Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1340] [Feature] Support Python models (dbt-py) on Redshift/AWS #204

Closed
3 tasks done
ChenyuLInx opened this issue Oct 13, 2022 · 22 comments
Closed
3 tasks done

[CT-1340] [Feature] Support Python models (dbt-py) on Redshift/AWS #204

ChenyuLInx opened this issue Oct 13, 2022 · 22 comments
Labels
enhancement New feature or request help_wanted Extra attention is needed python_models Stale

Comments

@ChenyuLInx
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-redshift functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Background:

There's a Spark redshift connector. This would allow user to run python transformation code on EMR cluster that load data from Redshift, and write transformed data back to Redshift. The whole process is very similar to using Dataproc to run python models on GCP/BigQuery.

Items needed for implementation:

  • If there's additional profile information needed for EMR cluster, we can add it as optional attributes at Credentials(existing example for bigquery).
  • We need one macro to generate the final code to run on EMR cluster, Previous example for dbt-bigquery here
  • Now that we have the profile info and the macro to generate final code, we need submission classes to submit python code to the cluster. Existing submission code for dbt-bigquery, Function to define in impl.py(link1, link2). And how those classes are being used by dbt-core(This doesn't need to be changed.)

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

@ChenyuLInx ChenyuLInx added enhancement New feature or request triage labels Oct 13, 2022
@github-actions github-actions bot changed the title [Feature] Python model [CT-1340] [Feature] Python model Oct 13, 2022
@ChenyuLInx ChenyuLInx added help_wanted Extra attention is needed and removed triage labels Oct 13, 2022
@lostmygithubaccount lostmygithubaccount changed the title [CT-1340] [Feature] Python model [CT-1340] [Feature] Support Python models (dbt-py) on Redshift/AWS Nov 1, 2022
@lostmygithubaccount
Copy link

some relevant community discussion:

per @colin-rogers-dbt, it may be easier to run on Glue than EMR. I personally have no preference -- whatever is easier for users to setup and faster to run on

@colin-rogers-dbt
Copy link
Contributor

I wonder if the right thing here to do is just pick on to implement with a long term of supporting both, can we leverage the existing spark/glue adapters?

@lostmygithubaccount
Copy link

new redshift integration with Apache Spark announced: https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/

@saraleon1
Copy link

+1 dbt Cloud Enterprise Customer - This team is using Redshift and are really interested in leveraging python models in their dbt project. Their AWS contact passed this resource along as a possible way to run python + redshift.

@colin-rogers-dbt
Copy link
Contributor

@lostmygithubaccount thanks for sharing, we should definitely look into leverage this!

This is particularly interesting as it means we don't need to recreate the issues with the dbt-bigquery adapter where we shoehorned dataproc in. Going forward we should adhere to the rough principle that an adapter should only know how to leverage the capabilities of a single data transformation tool. If we want to support emr/glue we should focus on multi-adapter/project so folk can use the existing adapters for those tools.

@lostmygithubaccount
Copy link

@saraleon1 great to hear! on the Python connector there, we should use that in dbt-redshift in general for making the connection in the way we use the similar Snowflake connector. however, I don't think that'll get us Python models -- or at least not the ones we want. I see it can read into numpy or pandas locally, but we want to maintain the principle of executing Python code remotely (in the "warehouse")

@colin-rogers-dbt very much agreed! as fyi BigQuery was working on more native integration (forwarded you an email)

cc: @jtcohen6

@ryadav03
Copy link

Any ETA on this?

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@zdravis
Copy link

zdravis commented Aug 7, 2023

Hi, is this feature going to be implemented? @colin-rogers-dbt

@viniciusnunest
Copy link

+1 Great ideia !

@dnascimento
Copy link

:+1 dbt-fal supports this but not with incremetal

@joshua-pgatour
Copy link

Would love this feature please.

@ipcleary
Copy link

+1

2 similar comments
@cschouten
Copy link

+1

@rohaldb
Copy link

rohaldb commented Oct 28, 2023

+1

@rohaldb
Copy link

rohaldb commented Oct 28, 2023

I would really like to move my pipelines into DBT but don't want to lock myself into having to use SQL for transformations. So unfortuantely until this is done i'm going to have to stick to Glue :/

@erees-embarkvet
Copy link

+1

@qoqajr
Copy link

qoqajr commented Jan 11, 2024

  • 1

@SkinnyPigeon
Copy link

We would love this feature

@marzaccaro
Copy link

Is this going to be implemented?

@MICHAELFOLA
Copy link

This will be a great addition. I was looking to use this today but unfortunately I will have to use sql again.

@spagnoloe-amenitiz
Copy link

+1, Redshift is one of the main DWHs in the market and this feature is a great tool for Data Science/Analytics. It is a pity not to have it there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Extra attention is needed python_models Stale
Projects
None yet
Development

No branches or pull requests