Data modeling with dbt: Part II, design best practices

12 Feb

After defining your sources and creating staging models, the next step is to extend your data models. To do this, you must focus on the core model, which produces datasets that data analysts can query directly. This is an essential step in building a dbt model.

These models include advanced SQL transformations such as joins, aggregations, and window functions. Unlike the staging layer, you can produce the desired results more flexibly. However, to leverage the full potential of dbt, it's essential to follow some best practices.

Ensure models read from the staging models

As mentioned earlier, ensure that all your models, except the staging models, read data from the staging models instead of referencing sources or using non-dbt syntax. This will help you to maintain the accuracy of your raw data and keep track of the relationships between different models in dbt.

If you are not using {{ source() }} or {{ ref() }} function, you are doing something wrong.

Write models in a modular way

dbt is an excellent tool because it allows you to write modular code. Instead of writing a model once and being done with it, you can reuse previously written code for other data models. Any code that is repeated across multiple data models should be its model. By doing this, the code can be run once and referenced in numerous locations, which saves computing resources in terms of both time and money.

Intermediate models contain repeatable code and are used to produce modular core data models. Analysts or business users do not directly query them.

Store models for data analysts in a core or mart directory

In dbt, you can organise models by source or business purpose. The models that are not staging models are called core or mart models. These models should be grouped together in a folder under the 'models' directory, and then further subcategorised by each business function. This organisation method will help you quickly identify and manage the models according to their business purpose.

Core models are the final models used directly by data analysts and business users. These models contain the datasets that are referenced within reports and dashboards. On the other hand, staging models and intermediate models are primarily used by data engineers.

Data modeling approaches

This section outlines various approaches for structuring a data project, including my personal approach to data modeling. There is no one-size-fits-all solution, as the structure will depend on the company's specific needs.

At the source level, you have raw data, atomic events, and data at various levels of tidiness, which can be difficult for humans to understand.
At the staging level, you have normalised analytics tables from disorganised data without duplication. You can implement the Kimball Star Schema at this level. Engineers can now keep things organised, sanitised, and named correctly and with column orders correctly.
At the mart level, you should create denormalised tables from the normalised tables to make them easier for humans to understand without using joins. Keep in mind that denormalisation can result in some data duplication. However, around ten denormalised tables can answer most people's questions with a high degree of accuracy.

James Lewis

Data modeling with dbt: Part II, design best practices

Ensure models read from the staging models

Write models in a modular way

Store models for data analysts in a core or mart directory

Data modeling approaches

Data modeling with dbt: Part III, SQL best practices

Data modeling with dbt: Part I, organise data