dbt-profiler
implements dbt macros for profiling database relations and creating doc
blocks and table schemas (schema.yml
) containing said profiles. A calculated profile contains the following measures for each column in a relation:
column_name
: Name of the columndata_type
: Data type of the columnnot_null_proportion
^: Proportion of column values that are notNULL
(e.g.,0.62
means that 62% of the values are populated while 38% areNULL
)distinct_proportion
^: Proportion of unique column values (e.g.,1
means that 100% of the values are unique)distinct_count
^: Count of unique column valuesis_unique
^: True if all column values are uniquemin
*^: Minimum column valuemax
*^: Maximum column valueavg
**^: Average column valuemedian
**^: Median column valuestd_dev_population
**^: Population standard deviationstd_dev_sample
**^: Sample standard deviationprofiled_at
: Profile calculation date and time
*
numeric, date and time columns only
**
numeric columns only
^
can be excluded from the profile using exclude_measures
argument
dbt-profiler
aims to provide the following:
- get_profile macro for generating profiling SQL queries that can be used as dbt models or ad-hoc queries
- print_profile macro for ad-hoc model profiling to support data exploration
- Describe a mechanism to include model profiles in dbt docs
For the third point there are at least two options:
meta
properties, anddoc
blocks.
An example of the first is implemented in the print_profile_schema macro. The second can be achieved with the following pattern:
- Add a
"docs"
folder explicitly todbt_project.yml
viamodel-paths
model-paths: ["models", "docs"]
- Use print_profile_docs macro to generate the profile as a Markdown table wrapped in a Jinja
docs
macro - Copy the output to a
docs/dbt_profiler/<model>.md
file
# docs/dbt_profiler/customer.md
{% docs dbt_profiler__customer %}
| column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
| ----------------------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id | int64 | 1.00 | 1.00 | 100 | 1 | 1 | 100 | 50.5000000000000000 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
| first_order | date | 0.62 | 0.46 | 46 | 0 | 2018-01-01 | 2018-04-07 | | | | 2022-01-13 10:14:48.300040+00 |
| most_recent_order | date | 0.62 | 0.52 | 52 | 0 | 2018-01-09 | 2018-04-09 | | | | 2022-01-13 10:14:48.300040+00 |
| number_of_orders | int64 | 0.62 | 0.04 | 4 | 0 | 1 | 5 | 1.5967741935483863 | 0.7716692718648833 | 0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
| customer_lifetime_value | float64 | 0.62 | 0.35 | 35 | 0 | 1 | 99 | 26.9677419354838830 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
{% enddocs %}
- Include the profile in a model description using the
doc
macro
version: 2
models:
- name: customer
description: |
Represents a customer.
`dbt-profiler` results:
{{ doc("dbt_profiler__customer") }}
columns:
- name: customer_id
tests:
- not_null
- unique
One of the advantages of the doc
approach over the meta
approach is that it doesn't require changes to the schema.yml except for the doc
macro call. Once the macro call has been embedded in the schema the actual profiles can be maintained in a dedicated dbt_profiler/
directory as Markdown files. The profile files can then be automatically updated by a CI process that runs once a week or month as follows:
- List the models you want to profile (e.g., using
dbt list --output name -m ${node_selection}
) - For each model run
dbt run-operation print_profile_docs --args '{"relation_name": "'${relation_name}'", "schema": "'${schema}'"}'
and store the result indbt_profiler/${relation_name}.md
- Note that you need to store the
dbt run-operation print_profile_docs
output in e.g. a variable before piping it to the target file. Piping the output directly to a file (e.g.,dbt run-operation print_profile_docs > ${relation_name}.md
) will result in a situation where the target file is emptied beforedbt run-operation
compiles the dbt project which will throw an error if you're already referring to thedoc
block that the operation has not yet generated. See example update-relation-profile.sh script.
- Create a Pull Request for the updated profiles (e.g., using create-pull-request GitHub Action)
dbt-profiler
requires dbt version >=1.1.0
. Check dbt Hub for the latest installation instructions.
dbt-profiler
may work with unsupported adapters but they haven't been tested yet. If you've used dbt-profiler
with any of the unsupported adapters I'd love to hear your feedback (e.g., create an issue, PR or hit me with with a DM on dbt Slack) 😊
✅ AWS Athena
✅ BigQuery
✅ Databricks
✅ PostgreSQL
✅ Redshift
✅ Snowflake
✅ SQL Server
❌ Apache Spark
❌ Presto
get_profile (source)
This macro returns a relation profile as a SQL query that can be used in a dbt model. This is handy for previewing relation profiles in dbt Cloud.
relation
(required): Relation objectexclude_measures
(optional): List of measures to exclude from the profile (default:[]
)include_columns
(optional): List of columns to include in the profile (default:[]
i.e., all). Only one ofinclude_columns
andexclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default:[]
). Only one ofinclude_columns
andexclude_columns
can be specified at a time.where_clause
(optional): SQLWHERE
clause to allow exclustion of records from profiler.group_by
(optional): SQLgroup_by
to aggregate data from profiler (default:[]
)
Use this macro in a dbt model, using a ref():
{{ dbt_profiler.get_profile(relation=ref("customers"), where_clause="is_active = true") }}
Use this macro in a dbt model, using a source():
{{ dbt_profiler.get_profile(relation=source("jaffle_shop","customers"), exclude_measures=["std_dev_population", "std_dev_sample"]) }}
To configure the macro to be called only when dbt is in execute mode:
-- depends_on: {{ ref("customers") }}
{% if execute %}
{{ dbt_profiler.get_profile(relation=ref("customers")) }}
{% endif %}
get_profile_table (source)
This macro returns a relation profile as an agate.Table. The macro does not print anything to stdout
and therefore is not meant to be used as a standalone operation.
relation
(eitherrelation
orrelation_name
is required): Relation objectrelation_name
(eitherrelation
orrelation_name
is required): Relation nameschema
(optional): Schema whererelation_name
exists (default:none
i.e., target schema)database
(optional): Database whererelation_name
exists (default:none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default:[]
)include_columns
(optional): List of columns to include in the profile (default:[]
i.e., all). Only one ofinclude_columns
andexclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default:[]
). Only one ofinclude_columns
andexclude_columns
can be specified at a time.where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after theWHERE
keyword.
Call this macro from another macro or dbt model:
{% set table = dbt_profiler.get_profile_table(relation_name="customers") %}
print_profile (source)
❗ This macro does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.
This macro prints a relation profile as a Markdown table to stdout
.
relation
(eitherrelation
orrelation_name
is required): Relation objectrelation_name
(eitherrelation
orrelation_name
is required): Relation nameschema
(optional): Schema whererelation_name
exists (default:none
i.e., target schema)database
(optional): Database whererelation_name
exists (default:none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default:[]
)include_columns
(optional): List of columns to include in the profile (default:[]
i.e., all). Only one ofinclude_columns
andexclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default:[]
). Only one ofinclude_columns
andexclude_columns
can be specified at a time.max_rows
(optional): The maximum number of rows to display before truncating the data (default:none
i.e., not truncated)max_columns
(optional): The maximum number of columns to display before truncating the data (default:7
)max_column_width
(optional): Truncate all columns to at most this width (default:30
)max_precision
(optional): Puts a limit on the maximum precision displayed for number types (default:none
i.e., not limited)where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after theWHERE
keyword.
Call the macro as an operation:
dbt run-operation print_profile --args '{"relation_name": "customers"}'
An alternative for dbt Cloud that prints the profile in the console log but not in a Markdown format:
{% set profile = dbt_profiler.get_profile(relation=ref("customers")) %}
{% for row in profile.rows %}
{% do log(row.values(), info=True) %}
{% endfor %}
column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | int64 | 1.00 | 1.00 | 100 | 1 | 1 | 100 | 50.5000000000000000 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
first_order | date | 0.62 | 0.46 | 46 | 0 | 2018-01-01 | 2018-04-07 | 2022-01-13 10:14:48.300040+00 | |||
most_recent_order | date | 0.62 | 0.52 | 52 | 0 | 2018-01-09 | 2018-04-09 | 2022-01-13 10:14:48.300040+00 | |||
number_of_orders | int64 | 0.62 | 0.04 | 4 | 0 | 1 | 5 | 1.5967741935483863 | 0.7716692718648833 | 0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
customer_lifetime_value | float64 | 0.62 | 0.35 | 35 | 0 | 1 | 99 | 26.9677419354838830 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
print_profile_schema (source)
This macro prints a relation schema YAML to stdout
containing all columns and their profiles.
relation
(eitherrelation
orrelation_name
is required): Relation objectrelation_name
(eitherrelation
orrelation_name
is required): Relation nameschema
(optional): Schema whererelation_name
exists (default:none
i.e., target schema)database
(optional): Database whererelation_name
exists (default:none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default:[]
)include_columns
(optional): List of columns to include in the profile (default:[]
i.e., all). Only one ofinclude_columns
andexclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default:[]
). Only one ofinclude_columns
andexclude_columns
can be specified at a time.model_description
(optional): Model description included in the schema (default:""
)column_description
(optional): Column descriptions included in the schema (default:""
)where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after theWHERE
keyword.
Call the macro as an operation:
dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'
version: 2
models:
- name: customers
description: ''
columns:
- name: number_of_orders
description: ''
meta:
data_type: int64
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.04
distinct_count: 4.0
is_unique: 0.0
min: '1'
max: '5'
avg: 1.5967741935483863
std_dev_population: 0.7716692718648833
std_dev_sample: 0.7779687173818426
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: customer_lifetime_value
description: ''
meta:
data_type: float64
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.35
distinct_count: 35.0
is_unique: 0.0
min: '1'
max: '99'
avg: 26.967741935483883
std_dev_population: 18.659917143555873
std_dev_sample: 18.812245525263663
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: customer_id
description: ''
meta:
data_type: int64
row_count: 100.0
not_null_proportion: 1.0
distinct_proportion: 1.0
distinct_count: 100.0
is_unique: 1.0
min: '1'
max: '100'
avg: 50.5
std_dev_population: 28.86607004772212
std_dev_sample: 29.01149197588202
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: first_order
description: ''
meta:
data_type: date
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.46
distinct_count: 46.0
is_unique: 0.0
min: '2018-01-01'
max: '2018-04-07'
avg: null
std_dev_population: null
std_dev_sample: null
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: most_recent_order
description: ''
meta:
data_type: date
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.52
distinct_count: 52.0
is_unique: 0.0
min: '2018-01-09'
max: '2018-04-09'
avg: null
std_dev_population: null
std_dev_sample: null
profiled_at: '2022-01-13 10:08:18.446822+00'
This what the profile looks like on the dbt docs site:
print_profile_docs (source)
❗ This macro does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.
This macro prints a relation profile as a Markdown table wrapped in a Jinja docs
macro to stdout
.
relation
(eitherrelation
orrelation_name
is required): Relation objectrelation_name
(eitherrelation
orrelation_name
is required): Relation nameschema
(optional): Schema whererelation_name
exists (default:none
i.e., target schema)database
(optional): Database whererelation_name
exists (default:none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default:[]
)include_columns
(optional): List of columns to include in the profile (default:[]
i.e., all). Only one ofinclude_columns
andexclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default:[]
). Only one ofinclude_columns
andexclude_columns
can be specified at a time.docs_name
(optional):docs
macro name (default:dbt_profiler__{{ relation_name }}
)max_rows
(optional): The maximum number of rows to display before truncating the data (default:none
i.e., not truncated)max_columns
(optional): The maximum number of columns to display before truncating the data (default:7
)max_column_width
(optional): Truncate all columns to at most this width (default:30
)max_precision
(optional): Puts a limit on the maximum precision displayed for number types (default:none
i.e., not limited)where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after theWHERE
keyword.
Call the macro as an operation:
dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'
{% docs dbt_profiler__customers %}
| column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
| ----------------------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id | int64 | 1.00 | 1.00 | 100 | 1 | 1 | 100 | 50.5000000000000000 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
| first_order | date | 0.62 | 0.46 | 46 | 0 | 2018-01-01 | 2018-04-07 | | | | 2022-01-13 10:14:48.300040+00 |
| most_recent_order | date | 0.62 | 0.52 | 52 | 0 | 2018-01-09 | 2018-04-09 | | | | 2022-01-13 10:14:48.300040+00 |
| number_of_orders | int64 | 0.62 | 0.04 | 4 | 0 | 1 | 5 | 1.5967741935483863 | 0.7716692718648833 | 0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
| customer_lifetime_value | float64 | 0.62 | 0.35 | 35 | 0 | 1 | 99 | 26.9677419354838830 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
{% enddocs %}
mdutoo: Added date type to tests, fix #37 Error when profiling integer after date after string columns
Profiling a table whose column are integer, date, string in this order raises the following error : ERROR: UNION types text and numeric cannot be matched LINE 60: avg("int_after_date_after_string") as avg, Appropriately casting the null default value solves it.