Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring-external-materialization #332

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions dbt/include/duckdb/macros/materializations/external.sql
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,14 @@
{%- set read_location = adapter.external_read_location(location, rendered_options) -%}

-- set language - python or sql
-- i have to learn general about python models
{%- set language = model['language'] -%}

{%- set target_relation = this.incorporate(type='view') %}

-- Continue as normal materialization
-- Why do we have temp and intermediate relation?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to not destroy an existing relation that has the same name as this one (if it exists in the catalog) in case something fails during the construction of the relation defined by this model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand; we prepare everything and do all the transformation, and at the very end, we exchange the names

I understand how it fits into the context but am not sure what it means for me right now in the implementation, so make sure that i don't overlook that one in the implementation.

-- What does load_cached_relation do?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built-in dbt method; dbt will do the equivalent of caching the DB catalog metadata it loaded during the run so as to minimize the number of round-trips it does to the database (less relevant for DuckDB, more relevant for e.g. MotherDuck.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to take a look into MotherDuck here because I still don't understand how this fits here and what the places are where I have to be careful not to overlook the functionality. So please, if you see something that is against MotherDuck logic, make me aware of that.

{%- set existing_relation = load_cached_relation(this) -%}
{%- set temp_relation = make_intermediate_relation(this.incorporate(type='table'), suffix='__dbt_tmp') -%}
{%- set intermediate_relation = make_intermediate_relation(target_relation, suffix='__dbt_int') -%}
Expand All @@ -27,6 +30,8 @@
{%- set backup_relation = make_backup_relation(target_relation, backup_relation_type) -%}
-- as above, the backup_relation should not already exist
{%- set preexisting_backup_relation = load_cached_relation(backup_relation) -%}

--What is grants here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRANTS in the context of e.g. Snowflake or another cloud DWH. DuckDB doesn't have an equivalent concept yet, though it wouldn't surprise me to see it in the context of MotherDuck or Iceberg/Delta tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will skip it for now to simplify, but the access grants here are an exciting topic. We want to export from the current context/duckdb into external parquet, delta, iceberg, and register a view over this location so that the table can be referenced in the next steps.

-- grab current tables grants config for comparision later on
{% set grant_config = config.get('grants') %}

Expand Down Expand Up @@ -62,10 +67,11 @@
{{ adapter.rename_relation(intermediate_relation, target_relation) }}

{{ run_hooks(post_hooks, inside_transaction=True) }}

-- What is should_revoke?
milicevica23 marked this conversation as resolved.
Show resolved Hide resolved
{% set should_revoke = should_revoke(existing_relation, full_refresh_mode=True) %}
{% do apply_grants(target_relation, grant_config, should_revoke=should_revoke) %}

--What does that do?
milicevica23 marked this conversation as resolved.
Show resolved Hide resolved
{% do persist_docs(target_relation, model) %}

-- `COMMIT` happens here
Expand All @@ -76,9 +82,10 @@
{{ drop_relation_if_exists(temp_relation) }}

-- register table into glue
--I dont know glue so can you explain me a bit about that?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of history here-- the ability to register an external table with the AWS Glue catalog (and thus make it accessible to e.g. AWS Athena) actually predates the concept of dbt-duckdb plugins (in the BasePlugin sense.) The original idea of plugins for external relations was to generalize this capability to other cataloging systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I see; I will need you to support me from the requirements point of view because I am unsure if I can imagine all possible use cases implemented until now. I don't have so much experience in the AWS world so more azure, google cloud but happy to learn about glue and athena
I will skip this one till we don't agree on some structure of the code; then, we can speak about details and each use case.

{%- set plugin_name = config.get('plugin') -%}
{%- set glue_register = config.get('glue_register', default=false) -%}
{%- set partition_columns = config.get('partition_columns', []) -%}
{%- set partition_columns = config.get('partition_columns', []) -%} -- this one is never used?
milicevica23 marked this conversation as resolved.
Show resolved Hide resolved
{% if plugin_name is not none or glue_register is true %}
{% if glue_register %}
{# legacy hack to set the glue database name, deprecate this #}
Expand All @@ -89,6 +96,6 @@

{{ run_hooks(post_hooks, inside_transaction=False) }}

{{ return({'relations': [target_relation]}) }}
{{ return({'relations': [target_relation]}) }} -- to what i return?
milicevica23 marked this conversation as resolved.
Show resolved Hide resolved

{% endmaterialization %}
Loading