Extract data from PayPal API

and prepare for loading into your data warehouse with Serverless, APIs and Node.js

This tutorial explains how to extract data from any arbitrary data source with API and perform any ETL/ELT before loading into your data warehouse, i.e. BigQuery, Redshift or Snowflake.

Github repository with code

© Mike Shakhomirov November 12, 2021 124 Comments
Image by author @mshakhomirov

Project outline

You will find how to:

Create a PayPal account with developer access and a Sandbox and mock test transactions

  • Create a sandbox account to integrate and test code in PayPal’s testing environment.
  • Populate your testing environment with some transaction data
  • Try to CURL this data from PayPal reporting API.
  • Create a PayPal data connector with AWS Lambda

  • Connect to your PayPal account and use PayPal API Authorization token to extract transaction data, for example, for yesterday.
  • Use a configuration file called ./config.json which will have all setting how to access and save your data. For example, you might want to name of your file with the same prefix as the name of the table in your data warehouse.
  • Send http request to PayPal API to get transaction data and save the result as JSON file into your AWS S3 bucket.
  • Run locally with npm run test command
  • Image by author @mshakhomirov

    About the idea

    BUILDING A DATA WAREHOUSE:

    Usually you would want to put your data warehouse solution (BigQuery, Snowflake or any other) in the center of the diagramme.

    1. Connect any external data source with ease, i.e. setup a pipe to get data from PayPal API and save to the Cloud.
    2. Load data into BigQuery (AWS S3 to BigQuery ingest manager with Node.JS)
    3. Create documented data transformation pipelines with Git, CI/CD. For example, with Dataform
    4. Simplify and automate deployment with Cloudformation or Terraform (Infrastructure as a code)
    5. Create BI reports with Google Data Studio (for example, revenue reconciliation, etc.)
    Image by author @mshakhomirov

    MODERN DATA STACK TOOLS (not a complete list of course):

  • Ingestion: Fivetran, Stitch
  • Warehousing: Snowflake, Bigquery, Redshift
  • Transformation: dbt, Dataform, APIs.
  • BI: Looker, Mode, Periscope, Chartio, Metabase, Redash
  • Image by author @mshakhomirov

    Talking about data extraction and ingestion you would want to use paid and managed tools like Fivetran or Stitch to extract data from any arbitrary data sources (i.e. Payment merchant providers, Exchange rates, Geocoding databases, etc.) but if you follow this tutorial you will become totally capable of doing it yourself.

    Real life scenario for Data Engineers

    Imagine you are a Data Engineer

    You are working on a project connecting various data sources into your data warehouse in BigQuery. Your company is a mobile game development studio and have various products being sold on both platforms, IOS and ANDROID.

    YOUR STACK

    Your dev stack is hybrid and includes AWS and GCP. Your team use Node.js a lot.
    Data science team use Python but server and client data pipelines are being created using Node.
    Your data stack is modern, event-driven and data intensive.
    Data warehouse solution must be cost-effective and flexible enough so you could add any data source you need.
    It must be able to scale easily to meet growing data you have.

    THE TASK

    All data comes from files from varioius data surces, i.e. databases, kinesis firehose streams and various notification services. It is being stored to your Cloud Datalake in different formats (CSV, JSON, PARQUET, etc.).

    As a data engineer you were tasked to create a new data pipeline to feed financial data from PayPal into the data warehouse. Finance team will use it to analyse revenue reconciliation reports in Google Data Studio daily.

    You decided to use AWS Lambda functions and Node.js to extract data daily from PayPal transaction API and save to AWS S3 first.

    Image by author @mshakhomirov
    This simple pipeline will ensure data is saved and prepared for loading into BigQuery later where it will be organised into tables.

    Prerequisites, Libraries and setup

    TOOLS

  • Node.js and Node package manager installed
  • Basic understanding of cloud computing (Amazon Web Services account), AWS CLI and AWS SDK
  • PayPal Developer account
  • Shell (Command line interface) commands and scripting (Advanced).
  • TECHNIQUES

  • Understanding of REST APIs.
  • Good knowledge of Node.JS (intermediate). You will create a Lambda Function.
    • You must understand Node.JS basic concepts, i.e. async funcitons, Node packages and how the code works.
    • basic debugging (consoles, print statements)
    • loops: i.e. for
    • branches: if, if/else, switches
  • Shell commands and scripting as you would want to deploy your Lambda using AWS CLI from command line and be able to test it locally.
  • Let's begin

    Step 1. Create a PayPal account with developer access and a Sandbox

    Go to developer.paypal.com and create an Application. This would be an integration for your Node.js app.
    Create a sandbox account to integrate and test code in PayPal’s testing environment.
    Populate your testing environment with some transaction data
    Try to CURL this data from PayPal reporting API.

    If any of these steps cause difficulties please refer to Hints section below.

    Step 2. Create a local Node.js app called bq-paypal-revenue on your machine. This app will do the following:

    Connect to your PayPal account and use PayPal API Authorization token to extract transaction data, for example, for yesterday.
    Use a configuration file called ./config.json which will have all setting how to access and save your data. For example, you might want to name of your file with the same prefix as the name of the table in your data warehouse.
    Send http request to PayPal API to get transaction data and save the result as JSON file into your AWS S3 bucket.
    Run locally with npm run test command

    Deliverable

    The deliverable for this project is a working Node.js App which would be able to run locally and deployed in AWS account as a lambda function.

    Feeling stuck? By author a coffee and ask a question.

    Help

    STEP 1: PayPal Sandbox

    • Use an access token to query PayPal API. More info how to get an access token can be found here
    • PayPal API basics explains how to create live and sandbox PayPal accounts. Try to create a sandbox. Another usful link would be PayPal Sandbox testing guide which explains the testing process and provides some tips on creating your PayPal developer account.
    • Use your Sandbox client_id:secret and add it to a CURL request:
    • as a result you will see something like this:
    • Image by author @mshakhomirov
    • Now you can use this access token to create another request to pull the transaction or any other data you need, for example Invoices, etc. Re-use the access token until it expires. Then, get a new token. The idea is to automate the process so in Step 2 we will create a microservice to get data from PayPal programmatically. Replace '<\Access-Token>' with a token from the previous step. You will see something like this:
      Image by author @mshakhomirov
      You can see that there are no transactions yet.

    STEP1: How to mock test PayPal transactions

    • The easiest way to create sample transactions in your Sandbox is to use PayPal Product API Executor
      Image by author @mshakhomirov
    • You would want to use PayPal Checkout Standard and to create a completed transaction the flow would be:
      • Get an access token
      • Create Order (Authorise and Capture). The default intent for Express Checkout is Capture and captures the funds straight after transaction was approved by buyer. Read more about Authorisation here.
      • Approve Order (Buyer approves the order and is happy to pay)
      • Update Order
      • Capture Payment for Order
      • Show Order details
    • Create an Order (Replace the phrase after 'Bearer with your token'): As a result you will see that some sample transaction was created:
      Image by author @mshakhomirov
    • Customer now will accept and approve it:
      Image by author @mshakhomirov
      IMPORTANT: To emulate Sandox customer you would want to create a test Personal account in your paypal.developer.portal. That would be different from your Test Business account. By doing so you will be able to virtually approve your orders. Go to https://developer.paypal.com/developer/accounts/ and create a Test Personal account. Then login and use it to approve the order.
      Image by author @mshakhomirov
      Image by author @mshakhomirov
    • Once the Order approved you might want to UPDATE it. The CURL below Updates an order with the CREATED or APPROVED status. You cannot update an order with the COMPLETED status.
    • Final step would be to capture the payment. Go to APEX or click capture or use CURL with previously approved order ID: IMPORTANT: Order MUST be approved by a customer in order to capture the payment and this is the most tricky part which normally requires some server development but you could simply use APEX to create a virtual buyer (using Personal Sandbox account) just to test it.
    • Now you can check order details:

    STEP1: Get historical transactions from PayPal using reporting API

    When finally you have some completed transactions in your Sandbox you would want to use a Reporting API:

    Output would be something like this:

    Image by author @mshakhomirov

    This JSON response has all transactions within the selected date range you need. Next you would want to run this command programmatically ans save data to Cloud storage, for example, AWS S3. This is a "data lake" step which aims to store your data safely so you could examine or change it before loading into data warehouse. This of course would create a duplicate of your data in both Cloud Storage and data warehouse but I would recommend to do it this way so you could easily examine any data loading errors. Setting an expiration policy on a bucket would clean the Cloud storage after certain amount of days.

    Step2: Create a serverless Node.js application with AWS Lambda to extract data from PayPal API

    • You would want to use the following Node.js modules:
      • axios to make HTTP requests to PayPal API
      • moment to handle datetime data and parameters
      • aws-sdk to save data to S3
      • run-local-lambda to test your lambda locally
    • Initialise a new Node.js app so you have a ./package.json like this:
    • Your app directory would look like this:
      Image by author @mshakhomirov
    • use ./config.json to separate your live and staging environments. For example:
    • create a file to configure your PayPal access token credentials and replace "Basic *" with a Base64 encoded combination of client_id and secret, i.e.: You can also copy and paste this Base64 encoded string from one of the previous CURL requests in the beginning of this tutorial where you got access_token. You would want to pass your Base64 encoded string clien_id:client_secret to Authorization header. If you use Postman, for example, you will find your client_id and client_secret values, appended to the text "Basic " in Headers as follows: "Authorization": "Basic QWNxd2xIYT..... Use it to get access token with OAuth2.0 and then use in your PayPal API calls. Read more about it here .
    • Now create a file called app.js:
    • Now you you would want to create a function called processEvent to process an event that will trigger the function to extract transaction data from PayPal.
    • FINAL SOLUTION:
      • PayPal will paginate the output if transaction request size is too big. So processEvent will valuate size first and then extract data page by page from PayPal API.
      • To test and run locally on your machine use: npm i then Use command npm run test.
      • Use a ./deploy.sh to deploy your Lambda in AWS.
        • When Lambda deployed to AWS and is being executed, for example, by Cloudwatch event trigger. AWS Cloudwatch event will trigger Lambda daily using a schedule.
        • Lambda will get PayPal credentials from ./token_config.json and authenticate with PayPal
        • Looping through each PayPal report needed (tables in ./config.json) Lambda will save each batch of transactions into AWS S3 in JSON format.
        • This will source data for a table in your data warehouse, i.e. if you call the file the same name as your data warehouse table it could be programmatically loaded into relevant table. So another microservice could pick it up from there (not covered in this project)
      • You can invoke lambda manually when it is deployed by running this script (AWS CLI required):
      app.js

    Project conclusions

    The traditional ETL, which stands for Extraction, Transformation, and Loading, has now evolved onto ELT. Data is Extracted from source systems, Loaded into the data warehouse and then Transformed within the data warehouse. This live project helps you manage that "E" part, the extraction from external data sources.

    Image by author @mshakhomirov
    • Whether you’re a startup or a global enterprise, you've learned how to use serverless to extract data from arbitrary party data sources.
    • You've learned the concepts of datalake, datawarehouse and decoupled data storage.
    • You've learned how to use AWS Lambda functions to create a simple and reliable data extraction pipeline.
    • All these above provide an idea or a template which could be easily extended and reused for any other arbitrary data source you might require in the future.
    blog

    Mike

    Mike is a Machine Learning Engineer / Python / Java Script Full stack dev / Middle shelf tequila connoisseur and accomplished napper. Stay tuned, read me on Medium https://medium.com/@mshakhomirov/membership and receive my unique content.

    Comments

    Blog Image 44

    Jen Lopez

    November 13, 2021

    Very useful.

    Leave a Comment