Dataform is a great free tool for SQL data transformation. It helps to keep your data warehouse clean and well-organized. It has nice dependency graphs explaining data lineage and it serves as a single source of truth for everything that happens there.
I am going to talk about unit tests though.
Dataform supports SQL unit tests for views and you can read about it in Dataform docs [1]. It is indeed simple.
In Dataform unit testing the view is simple:
However, often it is not enough.
You might want to run and test custom SQL operation / a script having multiple statements and more than one output.
Indeed, it becomes tricky when you want to unit test a SQL script which depends on some other actions, i.e. views, scripts, tables, etc. In this case you would want to keep it atomic and run one test for all of them each time you run it again.
Let's imagine we perform incremental updates on our table every day/hour and you would want to unit test that:
*_tests
schemas of each dependency. Depending on each table's schema it will create an input, i.e. production_tests.reputation_data
for production.reputation_data
, etc.user_reputation
and save the actual output in *_tests
schema, i.e. in production_tests.user_reputation
.Let's write the unit test for user_reputation
script:
A unit test fails if the actual output from the dataset is not equal to the expected output.
It means that:
Let's run it:
unit_tests
. Make sure your scripts have it.*_tests.any_table
)....
As a result we will see a Pass for our unit test:
So what has just happened?
Dataform ran all the dependencies but created "fake" outputs in my new production_tests.*
schema so the final script (a unit test script) could use them as inputs.
Let's see what happens if someone decides to change the logic of any of these dependencies in that pipeline. I will change the reputation_data_v
slightly:
I will run the unit test (test_user_reputation.sqlx
) again:
So this is how we run unit tests for SQL scripts in Dataform:
unit_tests
. Make sure your scripts has it.*_tests.any_table
).Dataform has a command line interface so you can run your project's actions from the command line:
Read more about it here: 2
Dataform also has a docker image so you might want to setup a Gitflow pipeline to run unit_tests
actions each time you create a Pull Request:
So when you push changes to your repository it will run the checks for your pipelines to see if any logic has been affected:
Dataform is a great tool for data modeling. It's like DBT but written in JavaScript. It has Git and CI/CD features which makes it extremely powerful. It documents SQL script automatically and creates nice dependency graphs which makes it useful as a single source of truth for everyone who is going to use it.
At the moment new Dataform signups are closed as the company was quietly acquired by Google. Dataform is now available in Preview
mode in Google Cloud Platform which means that it will become available with a full list of features in a couple of months. I've already tried a couple of things in Preview
but still prefer to use the legacy Web UI. If I want to run it as a microservice Dataform has a Docker
image which allows doing that.
I couldn't find Dependency graphs and some other important features in Preview
version in GCP.
Comments
Jen Lopez November 27, 2022
Nice.