Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum #254

Closed
2 tasks done
davide-vito opened this issue Dec 21, 2022 · 2 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@davide-vito
Copy link

davide-vito commented Dec 21, 2022

Is this a new bug in dbt-redshift?

  • I believe this is a new bug in dbt-redshift
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Hello, we are trying to use dbt cloud with Redshfit Spectrum.

When we first build our incremental models with dbt run everything works fine, however when we run it again without the --full-refresh flag, we get the error Predicates not yet supported in nested s3 scan. This same error happens if we run the compiled incremental query on redshift too, suggesting i's not supported by Redshift Spectrum (see Redshift docs - Nested-data-restrictions).

Is there another way to make this work?

Expected Behavior

We expected to update the table with the latest modifications by date as summarised in the dbt docs for incremental models

Steps To Reproduce

In dbt cloud, using as an example the following model:

{{ config(
        materialized='incremental'
         ) 
}}

with A as (
    SELECT
        u.entity.id as id,
        CAST(u.last_updated as timestamp) as last_updated,
        u.entity.colA,
        u.entity.colB,
        u.entity.colC,
    FROM {{ source('test', 'test_table') }} u
)

SELECT * FROM A

{% if is_incremental() %}

  where last_updated > (select max(last_updated) from {{ this }})
  
{% endif %}

Relevant log output

Predicates not yet supported in nested s3 scan.

Environment

dbt cloud

Additional Context

No response

@davide-vito davide-vito added bug Something isn't working triage labels Dec 21, 2022
@github-actions github-actions bot changed the title [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum [CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum Dec 21, 2022
@dbeatty10
Copy link
Contributor

Thanks for reaching out @davide-vito !

I agree with your assessment that this error message is specific to Redshift Spectrum and not something that dbt is doing.

Some ideas, some or all of which may be impractical in your specific use case:

  1. Collaborate with the producer of the object in S3 to add last_updated as a top-level attribute (rather than nested)
  2. Add a transformation step that writes the entire object to a regular table with unnested attributes first, then do a second step that performs the incremental logic

I haven't used Redshift Spectrum personally, so there might be some other approaches that I can't think of!

I'd suggest asking your question in dbt Discourse -- it's best place to get help & advice from the community of dbt practitioners:

If you go with idea 2), it sounds like you wouldn't be the first to resort to a similar strategy [1]:

Because of the query execution inefficiency for nested fields, ETL pipelines were set up at LinkedIn to copy the nested columns as a set of top-level columns corresponding to subfields

I'm going to close this because I don't think we can do anything about it within dbt-redshift. But if anyone discovers we can do something about it, we'll be glad to re-open it.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2022
@dbeatty10 dbeatty10 added wontfix This will not be worked on and removed triage labels Dec 22, 2022
@dossy
Copy link

dossy commented Aug 12, 2024

Sorry for necroposting, but since I had this exact same issue (incremental model on Redshift Spectrum), I wanted to share the work-around that I came up with, as I know I won't remember the next time I need it and hopefully will find my own comment here to remind me, and maybe it'll help others who wind up here as well, as it was the first result in Google for Predicates not yet supported in nested s3 scan.

My solution, tl;dr:

{% set last_timestamp_query %}

    select max(event_timestamp) from {{ this }}

{% endset %}

{% set last_timestamp = dbt_utils.get_single_value(last_timestamp_query, default=0) if is_incremental() else 0 %}

select *
from {{ source('some_schema', 'some_table') }}
where 1 = 1

{% if is_incremental() %}

and event_timestamp > {{ last_timestamp }}

{% endif %}

The implementation will need to be customized to your specific object names, but this is the "skeleton" of the approach that solves the problem for me.

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants