Extract data from PayPal API

This tutorial explains how to extract data from any arbitrary data source with API and perform any ETL/ELT before loading into your data warehouse, i.e. BigQuery, Redshift or Snowflake.

Github repository with code

Image by author @mshakhomirov

Project outline

You will find how to:

Create a PayPal account with developer access and a Sandbox and mock test transactions

Create a sandbox account to integrate and test code in PayPal’s testing environment.

Populate your testing environment with some transaction data

Try to CURL this data from PayPal reporting API.

Create a PayPal data connector with AWS Lambda

Connect to your PayPal account and use PayPal API Authorization token to extract transaction data, for example, for yesterday.

Use a configuration file called ./config.json which will have all setting how to access and save your data. For example, you might want to name of your file with the same prefix as the name of the table in your data warehouse.

Send http request to PayPal API to get transaction data and save the result as JSON file into your AWS S3 bucket.

Run locally with npm run test command

Image by author @mshakhomirov

About the idea

BUILDING A DATA WAREHOUSE:

Usually you would want to put your data warehouse solution (BigQuery, Snowflake or any other) in the center of the diagramme.

Connect any external data source with ease, i.e. setup a pipe to get data from PayPal API and save to the Cloud.
Load data into BigQuery (AWS S3 to BigQuery ingest manager with Node.JS)
Create documented data transformation pipelines with Git, CI/CD. For example, with Dataform
Simplify and automate deployment with Cloudformation or Terraform (Infrastructure as a code)
Create BI reports with Google Data Studio (for example, revenue reconciliation, etc.)

Image by author @mshakhomirov

MODERN DATA STACK TOOLS (not a complete list of course):

Ingestion: Fivetran, Stitch

Warehousing: Snowflake, Bigquery, Redshift

Transformation: dbt, Dataform, APIs.

BI: Looker, Mode, Periscope, Chartio, Metabase, Redash

Image by author @mshakhomirov

Talking about data extraction and ingestion you would want to use paid and managed tools like Fivetran or Stitch to extract data from any arbitrary data sources (i.e. Payment merchant providers, Exchange rates, Geocoding databases, etc.) but if you follow this tutorial you will become totally capable of doing it yourself.

Real life scenario for Data Engineers

Imagine you are a Data Engineer

You are working on a project connecting various data sources into your data warehouse in BigQuery. Your company is a mobile game development studio and have various products being sold on both platforms, IOS and ANDROID.

YOUR STACK

Your dev stack is hybrid and includes AWS and GCP. Your team use Node.js a lot.

Data science team use Python but server and client data pipelines are being created using Node.

Your data stack is modern, event-driven and data intensive.

Data warehouse solution must be cost-effective and flexible enough so you could add any data source you need.

It must be able to scale easily to meet growing data you have.

THE TASK

All data comes from files from varioius data surces, i.e. databases, kinesis firehose streams and various notification services. It is being stored to your Cloud Datalake in different formats (CSV, JSON, PARQUET, etc.).

As a data engineer you were tasked to create a new data pipeline to feed financial data from PayPal into the data warehouse. Finance team will use it to analyse revenue reconciliation reports in Google Data Studio daily.

You decided to use AWS Lambda functions and Node.js to extract data daily from PayPal transaction API and save to AWS S3 first.

Image by author @mshakhomirov

This simple pipeline will ensure data is saved and prepared for loading into BigQuery later where it will be organised into tables.

Prerequisites, Libraries and setup

TOOLS

Node.js and Node package manager installed

Basic understanding of cloud computing (Amazon Web Services account), AWS CLI and AWS SDK

PayPal Developer account

Shell (Command line interface) commands and scripting (Advanced).

TECHNIQUES

Understanding of REST APIs.

Good knowledge of Node.JS (intermediate). You will create a Lambda Function.

You must understand Node.JS basic concepts, i.e. async funcitons, Node packages and how the code works.
basic debugging (consoles, print statements)
loops: i.e. for
branches: if, if/else, switches

Shell commands and scripting as you would want to deploy your Lambda using AWS CLI from command line and be able to test it locally.

Let's begin

Step 1. Create a PayPal account with developer access and a Sandbox

Go to developer.paypal.com and create an Application. This would be an integration for your Node.js app.

Create a sandbox account to integrate and test code in PayPal’s testing environment.

Populate your testing environment with some transaction data

Try to CURL this data from PayPal reporting API.

If any of these steps cause difficulties please refer to Hints section below.

Step 2. Create a local Node.js app called bq-paypal-revenue on your machine. This app will do the following:

Connect to your PayPal account and use PayPal API Authorization token to extract transaction data, for example, for yesterday.

Use a configuration file called ./config.json which will have all setting how to access and save your data. For example, you might want to name of your file with the same prefix as the name of the table in your data warehouse.

Send http request to PayPal API to get transaction data and save the result as JSON file into your AWS S3 bucket.

Run locally with npm run test command

Deliverable

The deliverable for this project is a working Node.js App which would be able to run locally and deployed in AWS account as a lambda function.

Feeling stuck? By author a coffee and ask a question.

Help

STEP 1: PayPal Sandbox

Use an access token to query PayPal API. More info how to get an access token can be found here
PayPal API basics explains how to create live and sandbox PayPal accounts. Try to create a sandbox. Another usful link would be PayPal Sandbox testing guide which explains the testing process and provides some tips on creating your PayPal developer account.
Use your Sandbox client_id:secret and add it to a CURL request:
as a result you will see something like this:
Image by author @mshakhomirov
Now you can use this access token to create another request to pull the transaction or any other data you need, for example Invoices, etc. Re-use the access token until it expires. Then, get a new token. The idea is to automate the process so in Step 2 we will create a microservice to get data from PayPal programmatically. Replace '<\Access-Token>' with a token from the previous step. You will see something like this:
Image by author @mshakhomirov
You can see that there are no transactions yet.

STEP1: How to mock test PayPal transactions

The easiest way to create sample transactions in your Sandbox is to use PayPal Product API Executor
Image by author @mshakhomirov
You would want to use PayPal Checkout Standard and to create a completed transaction the flow would be:
- Get an access token
- Create Order (Authorise and Capture). The default intent for Express Checkout is Capture and captures the funds straight after transaction was approved by buyer. Read more about Authorisation here.
- Approve Order (Buyer approves the order and is happy to pay)
- Update Order
- Capture Payment for Order
- Show Order details
Create an Order (Replace the phrase after 'Bearer with your token'): As a result you will see that some sample transaction was created:
Image by author @mshakhomirov
Customer now will accept and approve it:
Image by author @mshakhomirov
IMPORTANT: To emulate Sandox customer you would want to create a test Personal account in your paypal.developer.portal. That would be different from your Test Business account. By doing so you will be able to virtually approve your orders. Go to https://developer.paypal.com/developer/accounts/ and create a Test Personal account. Then login and use it to approve the order.
Image by author @mshakhomirov

Image by author @mshakhomirov
Once the Order approved you might want to UPDATE it. The CURL below Updates an order with the CREATED or APPROVED status. You cannot update an order with the COMPLETED status.
Final step would be to capture the payment. Go to APEX or click capture or use CURL with previously approved order ID: IMPORTANT: Order MUST be approved by a customer in order to capture the payment and this is the most tricky part which normally requires some server development but you could simply use APEX to create a virtual buyer (using Personal Sandbox account) just to test it.
Now you can check order details:

STEP1: Get historical transactions from PayPal using reporting API

When finally you have some completed transactions in your Sandbox you would want to use a Reporting API:

Output would be something like this:

Image by author @mshakhomirov

This JSON response has all transactions within the selected date range you need. Next you would want to run this command programmatically ans save data to Cloud storage, for example, AWS S3. This is a "data lake" step which aims to store your data safely so you could examine or change it before loading into data warehouse. This of course would create a duplicate of your data in both Cloud Storage and data warehouse but I would recommend to do it this way so you could easily examine any data loading errors. Setting an expiration policy on a bucket would clean the Cloud storage after certain amount of days.

Step2: Create a serverless Node.js application with AWS Lambda to extract data from PayPal API

You would want to use the following Node.js modules:

axios to make HTTP requests to PayPal API
moment to handle datetime data and parameters
aws-sdk to save data to S3
run-local-lambda to test your lambda locally

Initialise a new Node.js app so you have a ./package.json like this:
Your app directory would look like this:
Image by author @mshakhomirov
use ./config.json to separate your live and staging environments. For example:
create a file to configure your PayPal access token credentials and replace "Basic *" with a Base64 encoded combination of client_id and secret, i.e.: You can also copy and paste this Base64 encoded string from one of the previous CURL requests in the beginning of this tutorial where you got access_token. You would want to pass your Base64 encoded string clien_id:client_secret to Authorization header. If you use Postman, for example, you will find your client_id and client_secret values, appended to the text "Basic " in Headers as follows: "Authorization": "Basic QWNxd2xIYT..... Use it to get access token with OAuth2.0 and then use in your PayPal API calls. Read more about it here .
Now create a file called app.js:
Now you you would want to create a function called processEvent to process an event that will trigger the function to extract transaction data from PayPal.
FINAL SOLUTION:
- PayPal will paginate the output if transaction request size is too big. So processEvent will valuate size first and then extract data page by page from PayPal API.
- To test and run locally on your machine use: npm i then Use command npm run test.
- Use a ./deploy.sh to deploy your Lambda in AWS.
- You can invoke lambda manually when it is deployed by running this script (AWS CLI required):
app.js

Project conclusions

The traditional ETL, which stands for Extraction, Transformation, and Loading, has now evolved onto ELT. Data is Extracted from source systems, Loaded into the data warehouse and then Transformed within the data warehouse. This live project helps you manage that "E" part, the extraction from external data sources.

Image by author @mshakhomirov

Whether you’re a startup or a global enterprise, you've learned how to use serverless to extract data from arbitrary party data sources.
You've learned the concepts of datalake, datawarehouse and decoupled data storage.
You've learned how to use AWS Lambda functions to create a simple and reliable data extraction pipeline.
All these above provide an idea or a template which could be easily extended and reused for any other arbitrary data source you might require in the future.

Mike

Mike is a Machine Learning Engineer / Python / Java Script Full stack dev / Middle shelf tequila connoisseur and accomplished napper. Stay tuned, read me on Medium https://medium.com/@mshakhomirov/membership and receive my unique content.

Comments

Jen Lopez

November 13, 2021

Very useful.

Extract data from PayPal API

and prepare for loading into your data warehouse with Serverless, APIs and Node.js

This tutorial explains how to extract data from any arbitrary data source with API and perform any ETL/ELT before loading into your data warehouse, i.e. BigQuery, Redshift or Snowflake.

Github repository with code

Project outline

Create a PayPal account with developer access and a Sandbox and mock test transactions

Create a PayPal data connector with AWS Lambda

About the idea

BUILDING A DATA WAREHOUSE:

MODERN DATA STACK TOOLS (not a complete list of course):

Real life scenario for Data Engineers

YOUR STACK

THE TASK

Prerequisites, Libraries and setup

TOOLS

TECHNIQUES

Let's begin

Step 1. Create a PayPal account with developer access and a Sandbox

Step 2. Create a local Node.js app called bq-paypal-revenue on your machine. This app will do the following:

Deliverable

Help

STEP 1: PayPal Sandbox

STEP1: How to mock test PayPal transactions

STEP1: Get historical transactions from PayPal using reporting API

Step2: Create a serverless Node.js application with AWS Lambda to extract data from PayPal API

Project conclusions

Mike

Comments

Jen Lopez

Leave a Comment