Imagine you are a Data engineer and you were tasked to sync data from one of your MySQL database instances and your data warehouse.
This is a very common scenario in which you need to connect of of the most popular data sources to your data warehouse solution, i.e. Snowflake, Redshift or BigQuery. I previously wrote about modern data stack and various ways of doing it here.
This article is a detailed summary on how to extract and save data from relational database (MySQL) without using 3rd paty apps.
I will give you with a detailed explanation of Node.js serverless application built with AWS Lambda function. This microservice will extract data from MySQL database and can be ran locally on your machine or/and from AWS account on a schedule.
There is also a more advanced example of the same application which creates Node.JS streams
to extract and save data from MySQL to your AWS S3 datalake.
You will learn how to:
yaml
config for your queries.
$ npm run test
command would export a hundred million rows in chunks to Cloud Storage.
This tutorial might be useful for data engineers and everyone who works with MySQL databases or wants to learn how to connect various arbitrary data sources to data warehouse solutions.
TOOLS
TECHNIQUES
You would want to use SELECT * INTO 'file.csv'
to achieve this:
https://gist.github.com/mshakhomirov/f54c634d9923c4fa0a36fca655277f9a
With a bit of tweaking and changing the outfile
you'll achieve the desired outcome.
However, this is a manual operation and keep in mind that you would want to limit your range in where
clause using indexed column. Otherwise your database will be having hard times.
Skip this part if you are after more advanced examples with node.js streaming
, data transformation and S3 upload
.
$ mkdir mysql-connector
. $ cd mysql-connector
. $ npm init
. npm
to install mysql package for your lambda: $npm i mysql2@2.0.0
$ npm i run-local-lambda@1.1.1
config
file to define the pipeline configuration and tables to use:https://gist.github.com/mshakhomirov/7c2065799179bb1f6f58549106188f87
async
function in your main application file ./app.js
to run SQL query from ./config.json
file.Your application folder now should look like this: https://gist.github.com/mshakhomirov/1aca62a66767b01679c0bef2329b5d1c
Your main application file ./app.js
would look like this:
https://gist.github.com/mshakhomirov/31dba2bf1b260cf13e8d22afb21d21e9
This is the gist of how to export data from MySQL programmatically. Very simple.
Run $ npm run test
in your command line and that would export data from MySQL database. Make sure your credentials and database host address are correct.
Now you can change query in your ./config.json
file and run them programmatically.
stream
and export data from MySQLFirst of all, why use Node.js stream?
Well if your dataset is more than your memory then you would want to extract data in chunks like we did in step 1. that what stream
is for.
It can help you to optimise your app's memory and export data without loading all of the rows into memory.
If you use AWS Lambda or GCP Cloud functions it helps to save money and not to overprovision the memory resource.
Very often 128 Mb of allocated memory is enough to export a few millions rows.
So with Node.js
stream
you can connect MySQL to your data warehouse more efficiently.
The gist of it
Let's say you want to extract data from MySQL row by row and save it locally as CSV. The example below would work with npm packages mysql2
and csv-stringify
.
https://gist.github.com/mshakhomirov/d300ef636910235669150a10deefb6bf
stream
and save locally as CSVAll you need is a queryDbAndSave()
function. Try to add this async
example below into your processEvent()
function.
https://gist.github.com/mshakhomirov/491c4eb380e26496103d8bbb236e208c
Add this to your processEvent()
function like so:
https://gist.github.com/mshakhomirov/b689f77578ed527b4dd0dfca6008fc36
Next in your command line run: $ npm run test
.
Got the idea? Cool. Let's continue to more examples.
stream
to GCP's Cloud Storage or AWS S3This example will save your SQL query results to AWS S3 or GCP Cloud Storage and no need to over provision memory resources.
In my scenario I would want to stream
query results to my S3 datalake bucket and transform JSON into ndJSON. In this way I could easily trigger further data ingestion with soome other service when file has been created in Cloud Storage.
If you want to learn how to do it check my tutroial on how to handle data loading into your data warehouse.
https://towardsdatascience.com/how-to-handle-data-loading-in-bigquery-with-serverless-ingest-manager-and-node-js-4f99fba92436
To stream-save
query results you would want to add a new branch to your queryDbAndSave()
function:
https://gist.github.com/mshakhomirov/5b935158985446b9b9bbd56edeff29eb
This example will work for the same modules but you would want to include these ones too:
npm i aws-sdk@2.1001.0
the that's used in AWS Lambda environments at the moment of writing this article. Theck current evironments here so you could then simple exclude it from deployment artifacts.through2@4.0.2"
a wrapper library making easier to construct streams
.So now if add another MySQL pipe to your ./config.json
:
https://gist.github.com/mshakhomirov/c537a5c86dba71cd4d89033fd1d62506
Run
$ npm run test
in your command line and it will export the data from MySQL and save it as one file instream
mode to your Cloud Storage.
stream
and save it in chunks to Cloud StorageYou also might want to save data locally or to AWS S3 in chunks, i.e. in batch mode. The snippet below explains how to do it.
You would want to declare an output file batch size in rows at the top of your ./app.js
: const BATCH_SIZE = process.env.BATCH_SIZE || 3000;
Then you would want to evaluate SQL query size. You could use another async function for that:
https://gist.github.com/mshakhomirov/ece0d6afc0d805f01bafc7335a5e1abc
Add another branch to your queryDbAndSave()
function and export data in chunks each time checking if it's a time to finish:
https://gist.github.com/mshakhomirov/18eef36fd2c5d0e1ffacab28824877c3
Final solution for processEvent()
function in ./app.js
:
Don't forget to run $npm i moment
. ./app.js
will use to construct file keys to save objects.
https://gist.github.com/mshakhomirov/fc2ad8b2b0904e942cc8c745f131485b
The final solution which can be found in this repository use npm config
and yaml
definitions for your MySQL pipes.
I prefer using
yaml
simply because it is easier to read when you add those long SQL queries.
Sample ./config/staging.yaml
would usually look:
https://gist.github.com/mshakhomirov/5866cfe527113a14982406f583e4308d
It is also more intuitive in my opinion when you need to separate live
and staging
environments. In your command line run $ npm i config
. So your final application folder with ./config/
instead ./config.json
would look like:
https://gist.github.com/mshakhomirov/ecfa4f896ede8fbc52b4c25034e668b8
There are three ways to do it.
./deploy.sh
like this one below. Run $ ./deploy.sh
in your command line and it will deploy the Lambda:
https://gist.github.com/mshakhomirov/43df39dbdfda83dfe2c47eb5a8369cea During the deployment make sure you configured all access roles and Security groups properly. For example, when your MySQL database is in AWS then your Lambda function must be deployed in the same VPC to be able to access it. Once you enable VPC support in Lambda your function no longer has access to anything outside your VPC, which includes S3. With S3 specifically you can use [VPC Endpoints][15] to resolve this.
This is a simple and reliable data export solution which allows you to extract data from MySQL database programmatically with low memory usage. This way you can create a simple and reliable MySQL data connector with some awesome features:
$npm run test
from your command line.yaml
definitions to describe your MySQL pipes.dryRun
to evaluate SQL size.Node.js streams
to avoid memory over provisioning.Some real-time integrations might be expensive and often it depnds on the size of the data you have. Just imagine each row inserted into MySQL would trigger the Lambda to export and insert it into your data warehouse. There is a better way to monitor and control data ingestion like that.
Just a few days ago I used it to export 56 million of rows from MySQL. Then I saved them in chunks to my data lake in AWS S3. For example, I have data loading manager and every time file lands in data lake it sends data to my BigQuery data warehouse. I wrote about it in this post before: https://towardsdatascience.com/how-to-handle-data-loading-in-bigquery-with-serverless-ingest-manager-and-node-js-4f99fba92436
So exporting my MySQL data and loading it into my data warehouse was quick and easy.
Comments
Jen Lopez November 13, 2021
Very useful.