Data modeling with dbt: Part I, organise data

dbt and data modeling are essential for data analysis. Data modeling transforms raw data into usable information for businesses. Data analysts can write data models using dbt, allowing modular and reusable code. These upcoming articles cover dbt data model layers, best practices, and proper testing.

Organise source data

Data models start with source data ingested into the data warehouse. dbt creates an extra layer between sources and models, so raw data remains untouched. Define a source correctly before using it in models.

Defining source data

In dbt, sources are defined in YAML files in a source directory. It's best to create a directory for each source in a ‘staging’ directory in the ‘models’ directory. This way, documentation is near the corresponding staging models. Specify the source's database and schema name in the YAML file, and assign it a downstream name. Here's an example of a source in a source.yml file:

To use a raw data source for a dbt model, reference it in a YAML file. Define it in the YAML file. Otherwise, dbt will need help understanding where to look when you reference it downstream.

Referencing a source

In dbt, staging models should always reference a previously defined source. Remember to only reference sources in staging models. To select from a source, use this syntax:

Two things you need for a source reference: a source name and a table name. In the YAML file, we defined ‘ios’ as the source name and ‘page_view’ as the table name.

Documenting a source

Document all sources used in your dbt models in YAML files, adding descriptions to source, table, and column names. Comprehensive documentation helps team members understand the code and contribute to the project while minimising questions and simplifying knowledge transfer. Here's an example:

When documenting data, include quirks in both table and column descriptions. Write documentation anticipating questions users may have.

Creating staging models

dbt uses staging models to standardise and protect raw data sources. These models are referred to in downstream data models. They help ensure that company standards are met by using functions like datatype conversion and column renaming.

Staging models keep data simple and close to the source without using aggregates or window functions.

A typical staging model looks like this:

Code reads only from defined source & uses basic SQL functions to maintain data integrity.