Data modeling with dbt: Part IV, testing

dbt allows you to test your sources and models within your code quickly. It features built-in generic tests, which check for null and unique values, and customisable tests. This ensures accurate and reliable data.

Source testing

When using dbt, always test sources first. This helps ensure data meets expectations before being used in downstream models, eliminating debugging steps. Testing only the models can make you assume that the model's logic is causing it to fail when the actual issue is with the input data. Debugging at the model level can be frustrating in such cases. When working with source data, use dbt's generic tests, including unique, not_null, accepted_values, and relationships.

I added initial generic tests to all source columns to detect empty columns and duplicates. We can add these tests to the YAML file previously mentioned; this process will ensure data accuracy:

It is important to ensure that you are running the correct queries on your data to ensure that your expectations about the data are accurate. Adding tests just for the sake of adding them is not recommended. Your tests need to be accurate to avoid receiving inaccurate alerts that can lead to alert fatigue.

Model testing

The next stage is to test your models. Adding these tests helps you catch failures in individual models due to incorrect logic or changes in upstream data models, avoiding the need to debug an entire DAG.

I use generic tests like 'accepted_values' and 'relationship' while adding 'not_null' and 'unique' tests to models. This helps to prevent unexpected issues caused by flawed logic.

The 'accepted_values' test in dbt lets you specify a column's expected values. The test fails if the column values are absent in the list. Like the following:

Any value other than checkout, signup, and settings in the screen column will cause the test to fail, and dbt will throw an error.

The relationships test aims to help you establish connections between various models or between a model and a source. This is useful to ensure your joins function does not adequately produce duplicates. The test verifies that all the values in the column you are testing are also present in the column of the model or source you are linking it to.

Under a column, you add this and specify the related model or source. See example below:

In this model, dbt will verify if all the values in the 'screen' column are also in the 'screen_name' column of the screen model. The test will fail if any value is found in this column that is not present in the orders table.

Previous
Previous

Logical fallacies: Ten common pitfalls

Next
Next

Data modeling with dbt: Part III, SQL best practices