Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
DBT

DBT

Updated on Jun 14, 2024

dbt, data build tool helps to transform data. dbt performs the T in ELT process

What is dbt

  • dbt performs the T in ELT (Extract, Load, Transform) process
  • dbt is extremely good at transforming data that’s already loaded into your warehouse
  • dbt enables analytics engineers to transform data in their warehouses by simply writing select statements

What it is not

  • dbt does not extract or load data from other systems or databases

Why Data Transformation

Data from source systems needs to be transformed for the purpose of:

  • Data Validation & Cleansing
  • Data Standardization & Enrichment
  • Data Integration between systems
  • Apply Business Logic & Rules
  • Derive Business KPI
  • Modeling suited for Analytics

dbt in Data Platform Landscape

Features of dbt

Software Engineering best practices can also be applied to Data Engineering Pipelines-

  • Version Control​ - Analytic code  should be version controlled. Analysis changes as data and businesses evolve, and it’s important to know who changed what, when
  • Quality Assurance ​- Bad data can lead to bad analyses, and bad analyses can lead to bad decisions. Any code that generates data or analysis should be reviewed and tested
  • Documentation​ - Data analysis should come with a basic description of how it should be interpreted, and the team should be able to add to that documentation as additional questions arise
  • Modularity​ - Data analysis code should also follow the principles of Reusability, Manageability & DRY

How dbt works

  • Main focus will be on writing models (i.e. SELECT only queries) that reflect core business logic
  • There’s no need to write boilerplate code to create tables and views, or to define the order of execution of your models
  • Instead, dbt handles turning these models into objects in the warehouse
  • Generates the DDL required to build the model, as per the model's materialization
  • Executes the compiled queries against your data warehouse
  • Maintain model dependencies

Data Platform without DBT

DBT Based Flow

DBT Overview

dbt Projects

  • A dbt project is a directory of .sql and .yml files
  • The directory must contain at a minimum:
  • Project file: a dbt_project.yml file which configures and defines your dbt project
  • Models: A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics, or, more often, is an intermediate step in such a transformation
  • Projects typically contain a few other resources as well, including snapshots, tests, and seed files

dbt Adapters

  • dbt connects to and runs SQL against your database, warehouse, platform, or query engine. It works by using a dedicated adapter for each technology
  • You will need a data warehouse with source data already loaded in it to use dbt
  • dbt natively supports connections to Snowflake, BigQuery, Redshift and Postgres data warehouses
  • When you define your connection, you’ll also be able to specify the target schema where dbt should create your models as tables and views

dbt Sources

  • Sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools
  • Select from source tables in your models using the source() function. It helps to define the lineage of your data
  • Test your assumptions about your source data
  • Calculate the freshness of your source data

dbt Models

  • A model is a select statement.
  • Models are defined in .sql files (typically in your models directory)
  • Each .sql file must contain only one model
  • The name of the file is used as the model name
  • By default dbt will create the models as views
  • Build models in a target schema you define
  • A model can be referenced in downstream models using the ref() function. It helps to define the lineage of your data

dbt Materializations

  • dbt handles boilerplate code to materialize queries as relations. For each model you create, you can easily configure a materialization
  • A materialization represents a build strategy for your select query – the code behind a materialization is robust, boilerplate SQL that wraps your select query in a statement to create a new, or update an existing, relation
  • dbt ships with the following built-in materializations:
  • view (default): The model is built as a view in the database
  • table: The model is built as a table in the database
  • ephemeral: The model is not directly built in the database but is instead pulled into dependent models as common table expressions (CTEs)
  • incremental: The model is initially built as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table

Advantages of dbt

  • A well known and widely used Technology stack in modern Data Engineering space
  • Quicker development, testing & documentation + less overhead on infrastructure management
  • Less time spent troubleshooting data pipelines
  • Compliments & co-exists with Airflow & Snowflake
  • Ease to use with great documentation leads to low learning curve
  • Data Transformation, Data Lineage & Data Documentation, Data Searching are the strong points which also covers the Data Governance aspects
  • Many engineers are familiar with SQL & dbt hence onboarding new team members to data platform is fast 
  • Migration of Target Data warehouse will have minimal effect on existing data pipelines

dbt Cloud Features

  • A web-based IDE to develop dbt projects
  • A purpose-built Job scheduler
  • Share dbt documentation with your team
  • VCS Integration
  • Logging & Alerting
  • Single Sign On (SSO)
  • Role-based ACLs
  • Audit logging

dbt Cloud vs Open Source

  • Provide a place for data analysts to complete a request in a few hours
  • Without getting caught up in the intricacies of infrastructure (CLI, k8s cluster access)
  • CI is self-serve for less technical users. All in git / hosted over internet so can easily access
  • Notifications - hosted logs in runs accessible to users. Not locked in k8s cluster 
  • Access for everyone - giving access to business users to reduce bottlenecks → increase change velocity
  • without the overhead of manual coordination and infrastructure management

In-House Tooling vs. dbt Cloud

Reference:



Top 10 Articles