[Feature] The insert_by_period materialization should graduate to part of the main project #4174

joellabes · 2021-11-01T03:34:03Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

Since 0.10.1, dbt-utils has contained a Redshift-specific macro called insert_by_period to do incremental loads.

Recently, there's been some interest from the community in modernising this:

a PR to bring Snowflake compatibility
a SQLFluff user who is running into linting issues with the non-standard __PERIOD_FILTER__ block.

Although we could leave it in utils, I think it's time for it to make the jump to being a first-class materialization.

Describe alternatives you've considered

Leaving it in dbt-utils and making it compatible with the four adapters utils commits to supporting. Either way, other databases will need to implement their own shims if their implementation isn't compatible with the default and they want to offer it.
Writing up my hacky method from the EP days (use a standard incremental model, but process no more than a month at a time. Works using only inbuilt functionality, but takes multiple runs to complete the backfill 🤢 )

I'd be happiest if this could just be extra parameters on the existing incremental materialisation. IDK if that's actually possible; if not, then it would need to be the fifth inbuilt option and I'd probably swing closer to leaving it in utils because it's not as cut-and-dried and so adding the extra friction maybe isn't so bad.

Who will this benefit?

SQLFluff users
Snowflake users, probably BQ/PGSQL/etc too (I always assumed that it was a bandaid for the fact that Redshift would fall over at higher volumes, but obviously it also has value for more scalable warehouses)
Cautious developers: When I was at EP, I shied away from this materialisation because it didn't feel robust enough (when I looked at it it was already 6 minor versions "out of date", and it had a hacky feel to it)

Are you interested in contributing this feature?

Sure am, despite how scary materialization code is

Anything else?

No response

The text was updated successfully, but these errors were encountered:

balmasi · 2021-11-08T18:51:01Z

There's also an interesting insert_by_rank variation which I've seen in the dbtvault project which allows you to iterate over any arbitrary ranking function rather than only dates/times.

It's been pretty handy.

jtcohen6 · 2021-11-17T11:21:46Z

Thanks for opening this @joellabes, and getting the conversation started. All in all, I find myself agreeing with your assessment here:

I'd be happiest if this could just be extra parameters on the existing incremental materialisation. IDK if that's actually possible; if not, then it would need to be the fifth inbuilt option and I'd probably swing closer to leaving it in utils because it's not as cut-and-dried and so adding the extra friction maybe isn't so bad.

I'm wary of our ability to modify the existing incremental materialization, without adding a ton of new complexity. The ability to --full-refresh with one query is a pretty significant piece of the current implementation. We'd also need to introduce a new context variable for the period being inserted, or use a similar string-replacement hack (__PERIOD_FILTER__) to the one being used currently.

I sense more significant overlap with the proposal for sharded tables in #1637. When a table gets so big that it's impossible to create it with a single query (however long-running or expensive), perhaps the right answer is to make it multiple tables. Then, rather than inserting period chunks into one table, we create many tables, and a view to union them together.

Sharding tables is a hard problem—there's a strict 1:1 mapping between database objects and project resources today, a rule we'd need to break—but it feels like the right hard problem to tackle. Support for sharded tables will also, I believe, offer us a way toward native support for "lambda views". After all, the only difference is that the "latest" shard needs to be materialized as a view instead of a table.

BigQuery has long had a native approach to this, with some wonky/legacy syntax, as ingestion-time-partitioned tables. We've removed support for those from dbt-bigquery ahead of v1 (it was effectively deprecated code). Whether we re-add support for the native approach there—or opt for one in which dbt more actively coordinates between shards and the union view, as it will have to on other adapters—that should be an implementation detail to figure out when we have a chance to tackle this project in earnest.

So: Support for table sharding is on the list of big improvements we should consider making after v1. I think it may obviate the need to add insert_by_period into the global project. This will definitely require changes within dbt-core, but a lot of the most important logic will live in macros and materialization code. If anyone has some time on their hands, and wants to play around with the possibilities of many-tables, one-model before then — I'd love to see what you can come up with!

joellabes · 2021-11-18T09:01:50Z

OK! Until the glory days of sharded tables are upon us, I'll start thinking a bit harder about how a modernised version of the current insert_by_period materialization should behave in utils.

ilmari-aalto · 2022-02-02T17:04:57Z

Ooh, this is a very exciting initiative! Sharded tables sound like the way to go, even if it breaks the 1:1 mapping between a model and the corresponding physical relations.

I just wanted to add some comments since we've internally modified insert_by_period for our purposes. Sometimes there's a lot of source data, so the initial (backfill) run of a model is too heavy, and might not finish. incremental materialization would be good as long as you don't have to backfill, or fully reprocess after changes. We don't ever want to manually backfill against the database, in our opinion all logic should be encapsulated in the model, so all models should be populated from the command line only (that is, by running dbt build manually or on a schedule).

Models should run successfully (they don't hang) and within the execution window we have. On the other hand it's ok for us that the physical relation eventually catches up with the latest data, which sounds very similar to what you did in your previous company @joellabes.

Some of the changes we did to insert_by_period are:

We decided it's ok to process data in chunks of one day. This simplified the materialization logic because everything has to work for the date grain only, and backfilling is just a matter of iterating over the missing dates of data.
Instead of __PERIOD_FILTER__, we have __PROCESSING_DATE__ for the date to be processed. It can be referenced several times which allows for example processing date ranges:

between dateadd(day, -27, '__PROCESSING_DATE__') and '__PROCESSING_DATE__'

If I recall correctly, with insert_by_period your target model has to contain a timestamp column that has the same name as the source timestamp column, due to the atomic __PERIOD_FILTER__. Not the case when using a processing date placeholder. (You still need some date/timestamp column in the target of course, but naming is your choice and you can aggregate timestamp to date.)
Delete-insert can optionally be done against other column(s) than the date/timestamp column
You can optionally reprocess some days on each run
You can set maximum days to process in one run (to limit the total execution time)
Currently data is processed from older to newer. Usually recent data is more important than old data, so we'd like to change this and backfill starting from now and work our way back in time instead.

github-actions · 2022-10-17T02:15:25Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-10-29T02:14:36Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

graciegoheen · 2024-09-14T20:37:12Z

Closing this in favor of our epic to add new "microbatch" incremental strategy:

joellabes added enhancement New feature or request triage labels Nov 1, 2021

joellabes mentioned this issue Nov 9, 2021

Remove extra semicolon in insert_by_period materialization dbt-labs/dbt-utils#439

Merged

12 tasks

jtcohen6 added discussion and removed triage labels Nov 17, 2021

jtcohen6 added the Team: Execution label Nov 18, 2021

jtcohen6 removed the discussion label Apr 19, 2022

github-actions bot added the stale Issues that have gone stale label Oct 17, 2022

github-actions bot closed this as completed Oct 29, 2022

owenprough-sift mentioned this issue Dec 9, 2022

Partition Limitations dbt-labs/dbt-athena#87

Closed

dataders reopened this Aug 7, 2024

github-actions bot removed the stale Issues that have gone stale label Aug 8, 2024

graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

joellabes commented Nov 1, 2021

balmasi commented Nov 8, 2021

jtcohen6 commented Nov 17, 2021

joellabes commented Nov 18, 2021

ilmari-aalto commented Feb 2, 2022 •

edited

Loading

github-actions bot commented Oct 17, 2022

github-actions bot commented Oct 29, 2022

graciegoheen commented Sep 14, 2024

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

Comments

joellabes commented Nov 1, 2021

Is there an existing feature request for this?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

balmasi commented Nov 8, 2021

jtcohen6 commented Nov 17, 2021

joellabes commented Nov 18, 2021

ilmari-aalto commented Feb 2, 2022 • edited Loading

github-actions bot commented Oct 17, 2022

github-actions bot commented Oct 29, 2022

graciegoheen commented Sep 14, 2024

ilmari-aalto commented Feb 2, 2022 •

edited

Loading