DBT

Saurav Mitra

Updated on Jun 14, 2024

dbt, data build tool helps to transform data. dbt performs the T in ELT process

What is dbt

dbt performs the T in ELT (Extract, Load, Transform) process

dbt is extremely good at transforming data that’s already loaded into your warehouse

dbt enables analytics engineers to transform data in their warehouses by simply writing select statements

What it is not

dbt does not extract or load data from other systems or databases

Why Data Transformation

Data from source systems needs to be transformed for the purpose of:

Data Validation & Cleansing
Data Standardization & Enrichment
Data Integration between systems
Apply Business Logic & Rules
Derive Business KPI
Modeling suited for Analytics

dbt in Data Platform Landscape

Features of dbt

Software Engineering best practices can also be applied to Data Engineering Pipelines-

Version Control - Analytic code should be version controlled. Analysis changes as data and businesses evolve, and it’s important to know who changed what, when

Quality Assurance - Bad data can lead to bad analyses, and bad analyses can lead to bad decisions. Any code that generates data or analysis should be reviewed and tested

Documentation - Data analysis should come with a basic description of how it should be interpreted, and the team should be able to add to that documentation as additional questions arise

Modularity - Data analysis code should also follow the principles of Reusability, Manageability & DRY

How dbt works

Main focus will be on writing models (i.e. SELECT only queries) that reflect core business logic

There’s no need to write boilerplate code to create tables and views, or to define the order of execution of your models

Instead, dbt handles turning these models into objects in the warehouse

Generates the DDL required to build the model, as per the model's materialization

Executes the compiled queries against your data warehouse

Maintain model dependencies

Data Platform without DBT

DBT Based Flow

DBT Overview

dbt Projects

A dbt project is a directory of .sql and .yml files

The directory must contain at a minimum:

Project file: a dbt_project.yml file which configures and defines your dbt project

Models: A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics, or, more often, is an intermediate step in such a transformation
Projects typically contain a few other resources as well, including snapshots, tests, and seed files

dbt Adapters

dbt connects to and runs SQL against your database, warehouse, platform, or query engine. It works by using a dedicated adapter for each technology

You will need a data warehouse with source data already loaded in it to use dbt

dbt natively supports connections to Snowflake, BigQuery, Redshift and Postgres data warehouses
When you define your connection, you’ll also be able to specify the target schema where dbt should create your models as tables and views

dbt Sources

Sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools

Select from source tables in your models using the source() function. It helps to define the lineage of your data

Test your assumptions about your source data

Calculate the freshness of your source data

dbt Models

A model is a select statement.

Models are defined in .sql files (typically in your models directory)

Each .sql file must contain only one model

The name of the file is used as the model name

By default dbt will create the models as views

Build models in a target schema you define
A model can be referenced in downstream models using the ref() function. It helps to define the lineage of your data

dbt Materializations

dbt handles boilerplate code to materialize queries as relations. For each model you create, you can easily configure a materialization

A materialization represents a build strategy for your select query – the code behind a materialization is robust, boilerplate SQL that wraps your select query in a statement to create a new, or update an existing, relation

dbt ships with the following built-in materializations:

view (default): The model is built as a view in the database

table: The model is built as a table in the database
ephemeral: The model is not directly built in the database but is instead pulled into dependent models as common table expressions (CTEs)
incremental: The model is initially built as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table

Advantages of dbt

A well known and widely used Technology stack in modern Data Engineering space

Quicker development, testing & documentation + less overhead on infrastructure management

Less time spent troubleshooting data pipelines

Compliments & co-exists with Airflow & Snowflake

Ease to use with great documentation leads to low learning curve

Data Transformation, Data Lineage & Data Documentation, Data Searching are the strong points which also covers the Data Governance aspects

Many engineers are familiar with SQL & dbt hence onboarding new team members to data platform is fast

Migration of Target Data warehouse will have minimal effect on existing data pipelines

dbt Cloud Features

A web-based IDE to develop dbt projects

A purpose-built Job scheduler

Share dbt documentation with your team

VCS Integration

Logging & Alerting

Single Sign On (SSO)

Role-based ACLs
Audit logging

dbt Cloud vs Open Source

Provide a place for data analysts to complete a request in a few hours

Without getting caught up in the intricacies of infrastructure (CLI, k8s cluster access)

CI is self-serve for less technical users. All in git / hosted over internet so can easily access

Notifications - hosted logs in runs accessible to users. Not locked in k8s cluster

Access for everyone - giving access to business users to reduce bottlenecks → increase change velocity

without the overhead of manual coordination and infrastructure management

In-House Tooling vs. dbt Cloud

Reference:

Business Case Guide & dbt Cloud TCO Calculator:
https://www.getdbt.com/resources/making-the-case-for-dbt-cloud

DBT