Imagine you are a Data engineer and you were tasked to sync data from one of your MySQL database instances and your data warehouse.
This is a very common scenario in which you need to connect of of the most popular data sources to your data warehouse solution, i.e. Snowflake, Redshift or BigQuery. I previously wrote about modern data stack and various ways of doing it here.
This article is a detailed summary on how to extract and save data from relational database (MySQL) without using 3rd paty apps.
I will give you with a detailed explanation of Node.js serverless application built with AWS Lambda function. This microservice will extract data from MySQL database and can be ran locally on your machine or/and from AWS account on a schedule.
There is also a more advanced example of the same application which creates Node.JS
streams to extract and save data from MySQL to your AWS S3 datalake.
You will learn how to:
yamlconfig for your queries.
$ npm run testcommand would export a hundred million rows in chunks to Cloud Storage.
This tutorial might be useful for data engineers and everyone who works with MySQL databases or wants to learn how to connect various arbitrary data sources to data warehouse solutions.
You would want to use
SELECT * INTO 'file.csv' to achieve this:
With a bit of tweaking and changing the
outfile you'll achieve the desired outcome.
However, this is a manual operation and keep in mind that you would want to limit your range in
where clause using indexed column. Otherwise your database will be having hard times.
Skip this part if you are after more advanced examples with node.js
streaming, data transformation and
$ mkdir mysql-connector.
$ cd mysql-connector.
$ npm init.
npmto install mysql package for your lambda:
$npm i email@example.com
$ npm i firstname.lastname@example.org
configfile to define the pipeline configuration and tables to use:
asyncfunction in your main application file
./app.jsto run SQL query from
Your application folder now should look like this: https://gist.github.com/mshakhomirov/1aca62a66767b01679c0bef2329b5d1c
Your main application file
./app.js would look like this:
This is the gist of how to export data from MySQL programmatically. Very simple.
$ npm run test in your command line and that would export data from MySQL database. Make sure your credentials and database host address are correct.
Now you can change query in your
./config.json file and run them programmatically.
streamand export data from MySQL
First of all, why use Node.js stream?
Well if your dataset is more than your memory then you would want to extract data in chunks like we did in step 1. that what
stream is for.
It can help you to optimise your app's memory and export data without loading all of the rows into memory.
If you use AWS Lambda or GCP Cloud functions it helps to save money and not to overprovision the memory resource.
Very often 128 Mb of allocated memory is enough to export a few millions rows.
So with Node.js
streamyou can connect MySQL to your data warehouse more efficiently.
The gist of it
Let's say you want to extract data from MySQL row by row and save it locally as CSV. The example below would work with npm packages
streamand save locally as CSV
All you need is a
queryDbAndSave() function. Try to add this
async example below into your
Add this to your
processEvent() function like so:
Next in your command line run:
$ npm run test.
Got the idea? Cool. Let's continue to more examples.
streamto GCP's Cloud Storage or AWS S3
This example will save your SQL query results to AWS S3 or GCP Cloud Storage and no need to over provision memory resources.
In my scenario I would want to
stream query results to my S3 datalake bucket and transform JSON into ndJSON. In this way I could easily trigger further data ingestion with soome other service when file has been created in Cloud Storage.
If you want to learn how to do it check my tutroial on how to handle data loading into your data warehouse.
stream-save query results you would want to add a new branch to your
This example will work for the same modules but you would want to include these ones too:
npm i email@example.com that's used in AWS Lambda environments at the moment of writing this article. Theck current evironments here so you could then simple exclude it from deployment artifacts.
firstname.lastname@example.org"a wrapper library making easier to construct
So now if add another MySQL pipe to your
$ npm run testin your command line and it will export the data from MySQL and save it as one file in
streammode to your Cloud Storage.
streamand save it in chunks to Cloud Storage
You also might want to save data locally or to AWS S3 in chunks, i.e. in batch mode. The snippet below explains how to do it.
You would want to declare an output file batch size in rows at the top of your
const BATCH_SIZE = process.env.BATCH_SIZE || 3000;
Then you would want to evaluate SQL query size. You could use another async function for that:
Add another branch to your
queryDbAndSave() function and export data in chunks each time checking if it's a time to finish:
Final solution for
processEvent() function in
Don't forget to run
$npm i moment.
./app.js will use to construct file keys to save objects.
The final solution which can be found in this repository use
npm config and
yaml definitions for your MySQL pipes.
I prefer using
yamlsimply because it is easier to read when you add those long SQL queries.
./config/staging.yaml would usually look:
It is also more intuitive in my opinion when you need to separate
staging environments. In your command line run
$ npm i config. So your final application folder with
./config.json would look like:
There are three ways to do it.
./deploy.shlike this one below. Run
$ ./deploy.shin your command line and it will deploy the Lambda: https://gist.github.com/mshakhomirov/43df39dbdfda83dfe2c47eb5a8369cea
During the deployment make sure you configured all access roles and Security groups properly. For example, when your MySQL database is in AWS then your Lambda function must be deployed in the same VPC to be able to access it. Once you enable VPC support in Lambda your function no longer has access to anything outside your VPC, which includes S3. With S3 specifically you can use [VPC Endpoints] to resolve this.
This is a simple and reliable data export solution which allows you to extract data from MySQL database programmatically with low memory usage. This way you can create a simple and reliable MySQL data connector with some awesome features:
$npm run testfrom your command line.
yamldefinitions to describe your MySQL pipes.
dryRunto evaluate SQL size.
Node.js streamsto avoid memory over provisioning.
Some real-time integrations might be expensive and often it depnds on the size of the data you have. Just imagine each row inserted into MySQL would trigger the Lambda to export and insert it into your data warehouse. There is a better way to monitor and control data ingestion like that.
Just a few days ago I used it to export 56 million of rows from MySQL. Then I saved them in chunks to my data lake in AWS S3. For example, I have data loading manager and every time file lands in data lake it sends data to my BigQuery data warehouse. I wrote about it in this post before: https://towardsdatascience.com/how-to-handle-data-loading-in-bigquery-with-serverless-ingest-manager-and-node-js-4f99fba92436
So exporting my MySQL data and loading it into my data warehouse was quick and easy.