DBT (Data Build Tool) is a command line data transformation tool, which enables data analysts and engineers to transform data in their warehouses more effectively. It allows you to test business logic, ensuring data quality, and fix issues proactively before they impact your business.
Here I’m using the Databricks service from Azure, and the DBT commands are ran from local (can be part of Databricks Workflow or any other orchestrator)
Prerequisites
- Python (> 3.6)
- Databricks cloud
- dbt-databricks (pip library)
The versions I used for this POC are
- Python - 3.10.12
- Azure Databricks - 13.3 LTS
- dbt-databricks - 3.0.0
The current latest version of dbt-databricks (v3.0.0) needs min Python version as 3.8, as its dependant library databricks-sql-connector requires the mentioned version.
Installation and initial setup
Install “dbt-databricks” python library
pip install dbt-databricks
Setup DBT
Get in to the directory where you want to create new DBT project and run the following command.
Note: dbt init will create a new project dir with the name you mention in the interactive console.
dbt init
The above command will provide multiple options to choose and enter in interactive manner. Below picture depicts sample project setup.
The above setup does not have proper Databricks details entered, I just created for reference. Once you understand the DBT structure, you should be able to create the project manually without dbt init command.
After completing the above setup, you would see the dir struture as follows.
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│ └── example
│ ├── my_first_dbt_model.sql
│ ├── my_second_dbt_model.sql
│ └── schema.yml
├── seeds
├── snapshots
└── tests
The conection details are stored in profiles.yml file, which is by default created under user home directory ~/.dbt/.
DBT will search for profiles.yml in the following order.
- Under current project dir
- DBT_PROFILES_DIR (if DBT_PROFILES_DIR is set to maintain custom dir)
- User home dir (~/.dbt)
Note: No need of profiles.yml for dbt Cloud
Project structure and config files
databricks_poc
├── analyses
├── macros
├── models
├── seeds
├── snapshots
├── tests
├── README.md
├── dbt_project.yml
└── profiles.yml
profiles.yml
databricks_poc:
target: dev
outputs:
dev:
type: databricks
catalog: hive_metastore
schema: default
host:
http_path:
token:
dbt_project.yml
name: 'databricks_poc'
version: '1.0.0'
config-version: 2
profile: 'databricks_poc'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
clean-targets:
- "target"
- "dbt_packages"
How DBT seed works
Things that we should keep in mind while using seed
- dbt seed is recommended only for smaller dataset, mainly for mapping.
- Every time when we run seed command, it will recreate the table with new data.
- Make sure you version control the seed files with proper data.
- Because of above reasons, it is not recommended to use in production.
I have created the following schemas (databases) in Databricks upfront before triggering the DBT commands.
CREATE SCHEMA IF NOT EXISTS poc_bronze;
CREATE SCHEMA IF NOT EXISTS poc_silver;
CREATE SCHEMA IF NOT EXISTS poc_gold;
Changing the default Schema (database)
By default, DBT generates the target schema which is the combination of target_schema (default schema defined in the profile) and custom_schema. Based on the config we defined for seeds, our expectation would be to create and load the data under the defined schema (ie poc_bronze).
Note: instead of hive_metastore.poc_bronze, DBT will create new schema (database) like this => hive_metastore.defaut_poc_bronze
DBT gets the schema from built-in macro called generate_schema_name, which has following line to return schema
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
So, we need to update the following line to satisfy our requirement
{{ default_schema }}_{{ custom_schema_name | trim }}
TO
{{ custom_schema_name | trim }}
To do so, we need to add a new macro file called “get_custom_schema.sql” with the updated code as like follows.
Load data using dbt seed
Note: seed is mainly for mapping and other static dataset of smaller in size, not recommended for large volume of data. People generally avoid running seed in Production, so use it for testing and smaller mapping dataset.
We have the following CSV files with the sample data for product and product_category tables that are supposed to be created under poc_bronze database.
product.csv [ seeds / poc_bronze / product.csv ]
product_id, name, category_id, unit_price, created_at, updated_at
1, Apples, 1, 100, 2023-11-17 10:00:00, 2023-11-17 10:00:00
2, Oranges, 1, 80, 2023-11-17 10:00:00, 2023-11-17 10:00:00
3, Grapes, 1, 70, 2023-11-17 10:00:00, 2023-11-17 10:00:00
4, Onions, 2, 90, 2023-11-17 10:00:00, 2023-11-17 10:00:00
5, Tomatoes, 2, 60, 2023-11-17 10:00:00, 2023-11-17 10:00:00
6, Potatoes, 2, 50, 2023-11-17 10:00:00, 2023-11-17 10:00:00
7, Carrots, 2, 100, 2023-11-17 10:00:00, 2023-11-17 10:00:00
8, Milk, 3, 30, 2023-11-17 10:00:00, 2023-11-17 10:00:00
9, Butter, 3, 120, 2023-11-17 10:00:00, 2023-11-17 10:00:00
10, Cheese, 3, 150, 2023-11-17 10:00:00, 2023-11-17 10:00:00
product_category.csv [ seeds / poc_bronze / product_category.csv ]
category_id, name
1, Fruits
2, Vegetables
3, Dairy
This is how the files and the seeds configured for this prject.
Now run the following command to create respective tables and load data into them.
dbt seed
Note: No need to create tables and schema manually before seeding, DBT will create the tables with schema inferred from CSV files.
But, DBT has options to define our custom schema fully or partially for any columns.
Azure Databricks catalog
How DBT model works
Data modeling with dbt is a way of transforming data for business intelligence or downstream applications in a modular approach.
- We can write models using SQL and Python.
- A SQL model is nothing but a SELECT statement.
- SQL models are defined in .sql files under models dir
- dbt models are materialized as VIEWs by default
Note: Python support is added from dbt Core v1.3
In this project I will be referring SQL models only. Below picture shows how the models are defined and configured for this project.
- some *.sql files under models dir
- models related config in dbt_project.yml
- to configure different database and other properties
- models/sources.yml
- to refer existing tables
models config in dbt_project.yml
models:
databricks_poc:
silver:
+schema: 'poc_silver'
materialized: view
- + sign is like global config, which applies to the entire hiearchy from current level.
- +schema tells DBT to create the models under poc_silver database for all models under models/silver dir and subdirs
- schema without + sign means only for the mentioned dir (ie models/silver dir)
- We can also override the materialization type within model sql file by mentioning the type on top
-
{{ config(materialized = “table”) }}
-
- SQL Model supports multiple types of materialization https://docs.getdbt.com/docs/build/materializations
- View
- Table
- Incremental
- Ephemeral
- Materialized View
- Python model supports only two materializations for now
- table
- incremental
In this project, we will be covering SQL model VIEW and TABLE materializations. As per our config, if we run dbt run command we should see the expected view created under poc_silver database.
dbt run
Azure Databricks catalog
Creating a Delta table using model
Again run dbt run command to create the required delta table.
Source code for above POC is available in my repo https://github.com/pybala/dbt_databricks_poc.
I will be covering basic data quality testing and incremental models in the next set of articles.