Making Data Engineering and Machine Learning available to everyone

Community platform where everyone is welcome to share their ideas about Data engineering, Machine learning, Data platform Architecture and Pipeline Design. We believe that authors shoudn't be told how to write. Beginner? Great! Share something you learned.

More

Advanced SQL techniques for beginners.

Mike November 20, 2022 11 Comments
advanced sql techniques

On a scale from 1 to 10 how good are your data warehousing skills?

Want to go above 9/10? This article is for you.

In this blog post you'll find the most intricate data warehouse SQL techniques exlplained. I will use BigQuery standard SQL dialect to scribble down a few thoughts.

  • Incremental tables and MERGE updates.
  • Date arrays and numbering functions
  • Convert table into Array of structs and pass them to UDF
  • Creating marketing funnels
  • Regexp

SQL is a powerful tool that helps to manipulate data. Hopefuly these SQL use cases from digital marketing will be useful for you. It's a handy skill indeed and can help you with many projects. These SQL snippets made my life a lot easier and I use at work almost every day.

Continue Reading on Medium

Data Lakehouse, Big Data Export and External Tables

Mike January 25, 2023 11 Comments
data pipeline design

When to use, file formats, external tables, storage costs and performance.

Yet another way to improve our data solution.

If your data platform architecture requires a data lake then this article is for you.

Continue Reading on Medium

Infrastructure as Code for your Data Pipelines.

Mike January 15, 2023 11 Comments
data pipeline design

Infrastructure as code is a fantastic strategy to create and manage new cloud resources, i.e. cloud storage buckets or event streams, as well as for streamlining data engineering processes, including CI/CD pipelines.

Advanced cheatsheet for beginners

Consider this article as user friendly introduction to Infrastructure as Code with a collection of stack file samples to deploy resources that your data platform might need.

Continue Reading on Medium

Data pipeline design patterns.

Mike December 27, 2022 110 Comments
data pipeline design

Typically data is processed, extracted, and transformed in steps. Therefore, a sequence of data processing stages can be referred to as a data pipeline.

What is a data pipeline?

Which design pattern to choose?

There are lots of things to consider, i.e. Which data stack to use? What tools to consider? How to design a data pipeline conceptually? ETL or ELT? Maybe ETLT? What is Change Data Capture?

I will try to cover these questions here.

Continue Reading on Medium

Unit tests for sql scripts with dependencies in Dataform.

Mike November 27, 2022 11 Comments
advanced sql techniques

Do you unit test your data warehouse scripts?

Dataform is a great free tool for SQL data transformation. It helps to keep your data warehouse clean and well organized. It has nice dependecy graphs explaining data lineage and it serves as a single source of truth for everything that heppens there.

I am going to talk about unit tests though.

What makes a good unit test?

  • It should test expected vs actual reaults
  • Should describe the script's logic corresponding to use cases.
  • It should be automated.
  • Be Independent (tests should not do setup or teardown for one another)
  • It should be easy to implement.
  • Be Repeatable: Anyone should be able to run it in any environment.
  • Once it's written, it should remain for future use.

Dataform supports SQL unit tests for views and you can read about it in Dataform docs. It is indeed simple.

Continue Reading on Medium

Deploy Machine learning models with Node.js Swagger, BigQuery and AWS Cloudformation.

Mike January 29, 2022 101 Comments
Cloudformation ML

Learn how to Deploy Machine learning models with Node.js Swagger, BigQuery and AWS Cloudformation.

In this post I will create a simple API and deploy API service with AWS Cloudformation. I want to achieve the following:

  • create a Node.JS API to serve my machine learning models.
  • connect API service to a data warehouse solution (in my case it would be BigQuery)
  • deploy my service using Docker and AWS Cloudformation

You will learn how to deploy your models with Docker with just one command.

Continue Reading on Medium

Automated data quality checks for your data warehouse.

Mike January 29, 2022 101 Comments
Auto email notifications

Building a data warehouse using clean data.

This is a simple and reliable data quality framework which most of the modern data warehouses support. Ultimately it allows to check your data with views and detect potential data quality issues with ease. It is not only missing data and NULL values. In dataset conditions almost everything can be used, i.e. use regex function to check data meet particular pattern or any other combined conditions for rows where multiple columns are being used. Need anomaly detection? It's simple. Just add 30 day moving average and a treshold into your dataset conditions and get those email notification when treshold breached.

You will learn how to:

  • Check data quality using SQL
  • Send email notifications
  • Create test tables and mock data with SQL
  • Detect data anomalies
Continue Reading on Medium

How to create a simple recommendation engine and train your model on GCP

Mike December 14, 2021 101 Comments
WALS model

Complete guide using Tensorflow, Airflow scheduler and Docker

This tutorial will explain how to train a user-items-ratings recommendation model using WALS algorithm.

You will learn how to:

  • How to run model trainer locally
  • Write a Dockerfile and Create a custom environment container
  • Push it to Google Cloud Platform and create a custom container Google AItraining job
  • Schedule model training with AirFlow
Continue Reading on Medium

How to Export millions rows from MySQL

Mike Novemeber 28, 2021 10 Comments
load data from MySQL

Data warehouse guide and MySQL data connector how-to with Serverless, APIs and Node.js.

You will learn how to:

  • create a simple Node.js app with AWS Lambda.
  • use Node.js streams to optimise memory consumption.
  • extract data and save it locally in CSV and JSON formats.
  • export it into the Cloud Storage.
  • use yaml config for your queries.
  • deploy and schedule it.
Continue Reading on Medium

How to load data into your BigQuery data warehouse with Serverless and Node.JS.

Mike Novemeber 21, 2021 10 Comments
load data

This project is about **data engineering**, modern data stacks, thinking outside the box, self learning, customisation, being language agnostic and being able to achieve the desired outcome with unconventional methods.

Continue Reading on Medium

How to extract real-time intraday data from Google Analytics 4 and Firebase in BigQuery.

Mike Novemeber 19, 2021 140 Comments
Firebase

If you are a Firebase or Google Analytics 4 user and you have setup data imports into your BigQuery data warehouse then you might want to create real-time custom reports with your data in intraday schema. The problem is that this integrated dataset is being deleted automatically by Google every day. So if you choose to connect it as a datasource to your report in Google Data Studio you won't find it the day after.

Here is the solution to improve data availability with *Firebase or Google Analytics 4 integrations.

Continue Reading on Medium

How to extract data from PayPal API and prepare for loading into your data warehouse.

Mike Novemeber 12, 2021 140 Comments
DAU

This tutorial explains how to extract data from any arbitrary data source with API and perform any ETL/ELT before loading into your data warehouse, i.e. BigQuery, Redshift or Snowflake.

Continue Reading on Medium

How to calculate Real Active Users. What are the numbers?.

Mike January 06, 2021 140 Comments
DAU

Have you ever wondered what are your real DAU numbers? How to identify linked user accounts and users cross using each other devices? Then this article is for you. MAU, DAU, Linked users and Retention: Google Data Studio template and SQL guide for datasets. Feedback is much appreciated. Thanks!

Continue Reading

Datastudioguides.com. How to guides and Data Studio tutorials.

Mike January 31, 2021 14 Comments
Visit datastudioguides.com

Tons of great content on Google Data Studio at Datastudioguides.com . Here, you’ll find what you need to get started. Google Data Studio is a fantastic tool for creating beautiful dashboards and reports. That’s it. It’s easy to use and, perhaps most importantly, completely free. Check their "How to guides".

Continue Reading

Building a BigQuery monitoring dashboard.

Mike January 06, 2021 140 Comments
Big Query monitoring dashboard

I've created a new Data Studio template for BigQuery usage monitoring with report labeling system. Feedback is much appreciated. Thanks!

Continue Reading

Retention and Daily Active Users Explained.

Mike January 06, 2021 140 Comments
Blog Image

I've created a new Data Studio template for User Retention with calendar charts. This is my new article "Retention and Daily Active Users Explained" which explains how to use it. It's a Data Studio guide and BigQuery tutorial for Firebase users, Machine Learning enthusiasts and Marketers. All you wanted to know. Feedback is much appreciated. Thanks!

Continue Reading
Blog Image

Building a static website with forms and blog. Part 3. Lambda commenting system.

Mike August 26, 2019 63 Comments

This is the first part of our tutorial on building web applications. We'll start with HTML5 and CSS basics aiming to build a blogging platfrom. I'll explain how to buy your domain name and put your website live. In the next parts you will learn how to add back end fetures, e.g. forms, blog comments and user authentication.

Continue Reading
Blog Image

Building a static website with forms and blog. Part 2. Lambda contact form.

Mike August 23, 2019 63 Comments

This is the first part of our tutorial on building web applications. We'll start with HTML5 and CSS basics aiming to build a blogging platfrom. I'll explain how to buy your domain name and put your website live. In the next parts you will learn how to add back end fetures, e.g. forms, blog comments and user authentication..

Continue Reading
Blog Image

Building a static blogger website. Part 1: Domain name. S3. SSL, HTTPS and Cloudfront.

Mike August 19, 2019 63 Comments

This is the first part of our tutorial on building web applications. We'll start with HTML5 and CSS basics aiming to build a blogging platfrom. I'll explain how to buy your domain name and put your website live. In the next parts you will learn how to add back end fetures, e.g. forms, blog comments and user authentication.

Continue Reading