DBT
dbt, data build tool helps to transform data. dbt performs the T in ELT process
What is dbt
- dbt performs the T in ELT (Extract, Load, Transform) process
- dbt is extremely good at transforming data that’s already loaded into your warehouse
- dbt enables analytics engineers to transform data in their warehouses by simply writing select statements
What it is not
- dbt does not extract or load data from other systems or databases
Why Data Transformation
Data from source systems needs to be transformed for the purpose of:
- Data Validation & Cleansing
- Data Standardization & Enrichment
- Data Integration between systems
- Apply Business Logic & Rules
- Derive Business KPI
- Modeling suited for Analytics
dbt in Data Platform Landscape
Features of dbt
Software Engineering best practices can also be applied to Data Engineering Pipelines-
- Version Control - Analytic code should be version controlled. Analysis changes as data and businesses evolve, and it’s important to know who changed what, when
- Quality Assurance - Bad data can lead to bad analyses, and bad analyses can lead to bad decisions. Any code that generates data or analysis should be reviewed and tested
- Documentation - Data analysis should come with a basic description of how it should be interpreted, and the team should be able to add to that documentation as additional questions arise
- Modularity - Data analysis code should also follow the principles of Reusability, Manageability & DRY
How dbt works
- Main focus will be on writing models (i.e. SELECT only queries) that reflect core business logic
- There’s no need to write boilerplate code to create tables and views, or to define the order of execution of your models
- Instead, dbt handles turning these models into objects in the warehouse
- Generates the DDL required to build the model, as per the model's materialization
- Executes the compiled queries against your data warehouse
- Maintain model dependencies
Data Platform without DBT
DBT Based Flow
DBT Overview
dbt Projects
- A dbt project is a directory of .sql and .yml files
- The directory must contain at a minimum:
- Project file: a dbt_project.yml file which configures and defines your dbt project
- Models: A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics, or, more often, is an intermediate step in such a transformation
- Projects typically contain a few other resources as well, including snapshots, tests, and seed files
dbt Adapters
- dbt connects to and runs SQL against your database, warehouse, platform, or query engine. It works by using a dedicated adapter for each technology
- You will need a data warehouse with source data already loaded in it to use dbt
- dbt natively supports connections to Snowflake, BigQuery, Redshift and Postgres data warehouses
- When you define your connection, you’ll also be able to specify the target schema where dbt should create your models as tables and views
dbt Sources
- Sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools
- Select from source tables in your models using the source() function. It helps to define the lineage of your data
- Test your assumptions about your source data
- Calculate the freshness of your source data
dbt Models
- A model is a select statement.
- Models are defined in .sql files (typically in your models directory)
- Each .sql file must contain only one model
- The name of the file is used as the model name
- By default dbt will create the models as views
- Build models in a target schema you define
- A model can be referenced in downstream models using the ref() function. It helps to define the lineage of your data
dbt Materializations
- dbt handles boilerplate code to materialize queries as relations. For each model you create, you can easily configure a materialization
- A materialization represents a build strategy for your select query – the code behind a materialization is robust, boilerplate SQL that wraps your select query in a statement to create a new, or update an existing, relation
- dbt ships with the following built-in materializations:
- view (default): The model is built as a view in the database
- table: The model is built as a table in the database
- ephemeral: The model is not directly built in the database but is instead pulled into dependent models as common table expressions (CTEs)
- incremental: The model is initially built as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table
Advantages of dbt
- A well known and widely used Technology stack in modern Data Engineering space
- Quicker development, testing & documentation + less overhead on infrastructure management
- Less time spent troubleshooting data pipelines
- Compliments & co-exists with Airflow & Snowflake
- Ease to use with great documentation leads to low learning curve
- Data Transformation, Data Lineage & Data Documentation, Data Searching are the strong points which also covers the Data Governance aspects
- Many engineers are familiar with SQL & dbt hence onboarding new team members to data platform is fast
- Migration of Target Data warehouse will have minimal effect on existing data pipelines
dbt Cloud Features
- A web-based IDE to develop dbt projects
- A purpose-built Job scheduler
- Share dbt documentation with your team
- VCS Integration
- Logging & Alerting
- Single Sign On (SSO)
- Role-based ACLs
- Audit logging
dbt Cloud vs Open Source
- Provide a place for data analysts to complete a request in a few hours
- Without getting caught up in the intricacies of infrastructure (CLI, k8s cluster access)
- CI is self-serve for less technical users. All in git / hosted over internet so can easily access
- Notifications - hosted logs in runs accessible to users. Not locked in k8s cluster
- Access for everyone - giving access to business users to reduce bottlenecks → increase change velocity
- without the overhead of manual coordination and infrastructure management
In-House Tooling vs. dbt Cloud
Reference:
- Business Case Guide & dbt Cloud TCO Calculator:
https://www.getdbt.com/resources/making-the-case-for-dbt-cloud