Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

Closed
1 task done
joellabes opened this issue Nov 1, 2021 · 7 comments
Closed
1 task done
Labels
enhancement New feature or request

Comments

@joellabes
Copy link
Contributor

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Since 0.10.1, dbt-utils has contained a Redshift-specific macro called insert_by_period to do incremental loads.

Recently, there's been some interest from the community in modernising this:

Although we could leave it in utils, I think it's time for it to make the jump to being a first-class materialization.

Describe alternatives you've considered

  • Leaving it in dbt-utils and making it compatible with the four adapters utils commits to supporting. Either way, other databases will need to implement their own shims if their implementation isn't compatible with the default and they want to offer it.
  • Writing up my hacky method from the EP days (use a standard incremental model, but process no more than a month at a time. Works using only inbuilt functionality, but takes multiple runs to complete the backfill 🤢 )

I'd be happiest if this could just be extra parameters on the existing incremental materialisation. IDK if that's actually possible; if not, then it would need to be the fifth inbuilt option and I'd probably swing closer to leaving it in utils because it's not as cut-and-dried and so adding the extra friction maybe isn't so bad.

Who will this benefit?

  • SQLFluff users
  • Snowflake users, probably BQ/PGSQL/etc too (I always assumed that it was a bandaid for the fact that Redshift would fall over at higher volumes, but obviously it also has value for more scalable warehouses)
  • Cautious developers: When I was at EP, I shied away from this materialisation because it didn't feel robust enough (when I looked at it it was already 6 minor versions "out of date", and it had a hacky feel to it)

Are you interested in contributing this feature?

Sure am, despite how scary materialization code is

Anything else?

No response

@joellabes joellabes added enhancement New feature or request triage labels Nov 1, 2021
@balmasi
Copy link

balmasi commented Nov 8, 2021

There's also an interesting insert_by_rank variation which I've seen in the dbtvault project which allows you to iterate over any arbitrary ranking function rather than only dates/times.

It's been pretty handy.

@jtcohen6
Copy link
Contributor

Thanks for opening this @joellabes, and getting the conversation started. All in all, I find myself agreeing with your assessment here:

I'd be happiest if this could just be extra parameters on the existing incremental materialisation. IDK if that's actually possible; if not, then it would need to be the fifth inbuilt option and I'd probably swing closer to leaving it in utils because it's not as cut-and-dried and so adding the extra friction maybe isn't so bad.

I'm wary of our ability to modify the existing incremental materialization, without adding a ton of new complexity. The ability to --full-refresh with one query is a pretty significant piece of the current implementation. We'd also need to introduce a new context variable for the period being inserted, or use a similar string-replacement hack (__PERIOD_FILTER__) to the one being used currently.

I sense more significant overlap with the proposal for sharded tables in #1637. When a table gets so big that it's impossible to create it with a single query (however long-running or expensive), perhaps the right answer is to make it multiple tables. Then, rather than inserting period chunks into one table, we create many tables, and a view to union them together.

Sharding tables is a hard problem—there's a strict 1:1 mapping between database objects and project resources today, a rule we'd need to break—but it feels like the right hard problem to tackle. Support for sharded tables will also, I believe, offer us a way toward native support for "lambda views". After all, the only difference is that the "latest" shard needs to be materialized as a view instead of a table.

BigQuery has long had a native approach to this, with some wonky/legacy syntax, as ingestion-time-partitioned tables. We've removed support for those from dbt-bigquery ahead of v1 (it was effectively deprecated code). Whether we re-add support for the native approach there—or opt for one in which dbt more actively coordinates between shards and the union view, as it will have to on other adapters—that should be an implementation detail to figure out when we have a chance to tackle this project in earnest.

So: Support for table sharding is on the list of big improvements we should consider making after v1. I think it may obviate the need to add insert_by_period into the global project. This will definitely require changes within dbt-core, but a lot of the most important logic will live in macros and materialization code. If anyone has some time on their hands, and wants to play around with the possibilities of many-tables, one-model before then — I'd love to see what you can come up with!

@joellabes
Copy link
Contributor Author

OK! Until the glory days of sharded tables are upon us, I'll start thinking a bit harder about how a modernised version of the current insert_by_period materialization should behave in utils.

@ilmari-aalto
Copy link

ilmari-aalto commented Feb 2, 2022

Ooh, this is a very exciting initiative! Sharded tables sound like the way to go, even if it breaks the 1:1 mapping between a model and the corresponding physical relations.

I just wanted to add some comments since we've internally modified insert_by_period for our purposes. Sometimes there's a lot of source data, so the initial (backfill) run of a model is too heavy, and might not finish. incremental materialization would be good as long as you don't have to backfill, or fully reprocess after changes. We don't ever want to manually backfill against the database, in our opinion all logic should be encapsulated in the model, so all models should be populated from the command line only (that is, by running dbt build manually or on a schedule).

Models should run successfully (they don't hang) and within the execution window we have. On the other hand it's ok for us that the physical relation eventually catches up with the latest data, which sounds very similar to what you did in your previous company @joellabes.

Some of the changes we did to insert_by_period are:

  • We decided it's ok to process data in chunks of one day. This simplified the materialization logic because everything has to work for the date grain only, and backfilling is just a matter of iterating over the missing dates of data.
  • Instead of __PERIOD_FILTER__, we have __PROCESSING_DATE__ for the date to be processed. It can be referenced several times which allows for example processing date ranges:
between dateadd(day, -27, '__PROCESSING_DATE__') and '__PROCESSING_DATE__'
  • If I recall correctly, with insert_by_period your target model has to contain a timestamp column that has the same name as the source timestamp column, due to the atomic __PERIOD_FILTER__. Not the case when using a processing date placeholder. (You still need some date/timestamp column in the target of course, but naming is your choice and you can aggregate timestamp to date.)
  • Delete-insert can optionally be done against other column(s) than the date/timestamp column
  • You can optionally reprocess some days on each run
  • You can set maximum days to process in one run (to limit the total execution time)
  • Currently data is processed from older to newer. Usually recent data is more important than old data, so we'd like to change this and backfill starting from now and work our way back in time instead.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 17, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

@dataders dataders reopened this Aug 7, 2024
@github-actions github-actions bot removed the stale Issues that have gone stale label Aug 8, 2024
@graciegoheen
Copy link
Contributor

Closing this in favor of our epic to add new "microbatch" incremental strategy:

@graciegoheen graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants