Dataform is a great free tool for SQL data transformation. It helps to keep your data warehouse clean and well-organized. It has nice dependency graphs explaining data lineage and it serves as a single source of truth for everything that happens there.
I am going to talk about unit tests though.
Dataform supports SQL unit tests for views and you can read about it in Dataform docs . It is indeed simple.
In Dataform unit testing the view is simple:
However, often it is not enough.
You might want to run and test custom SQL operation / a script having multiple statements and more than one output.
Indeed, it becomes tricky when you want to unit test a SQL script which depends on some other actions, i.e. views, scripts, tables, etc. In this case you would want to keep it atomic and run one test for all of them each time you run it again.
Let's imagine we perform incremental updates on our table every day/hour and you would want to unit test that:
*_testsschemas of each dependency. Depending on each table's schema it will create an input, i.e.
user_reputationand save the actual output in
*_testsschema, i.e. in
Let's write the unit test for
A unit test fails if the actual output from the dataset is not equal to the expected output.
It means that:
Let's run it:
unit_tests. Make sure your scripts have it.
As a result we will see a Pass for our unit test:
So what has just happened?
Dataform ran all the dependencies but created "fake" outputs in my new
production_tests.* schema so the final script (a unit test script) could use them as inputs.
Let's see what happens if someone decides to change the logic of any of these dependencies in that pipeline. I will change the
I will run the unit test (
So this is how we run unit tests for SQL scripts in Dataform:
unit_tests. Make sure your scripts has it.
Dataform has a command line interface so you can run your project's actions from the command line:
Read more about it here: 2
Dataform also has a docker image so you might want to setup a Gitflow pipeline to run
unit_tests actions each time you create a Pull Request:
So when you push changes to your repository it will run the checks for your pipelines to see if any logic has been affected:
At the moment new Dataform signups are closed as the company was quietly acquired by Google. Dataform is now available in
Preview mode in Google Cloud Platform which means that it will become available with a full list of features in a couple of months. I've already tried a couple of things in
Preview but still prefer to use the legacy Web UI. If I want to run it as a microservice Dataform has a
Docker image which allows doing that.
I couldn't find Dependency graphs and some other important features in
Preview version in GCP.
Jen Lopez November 27, 2022