Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test out early stop sort merge join to handle AS OF join? #360

Open
CTCC1 opened this issue Sep 5, 2023 · 2 comments
Open

Test out early stop sort merge join to handle AS OF join? #360

CTCC1 opened this issue Sep 5, 2023 · 2 comments
Labels
future This issue is in a backlog of ideas to possibly be done in the future performance Issues with the time to execute operations or other performance issues

Comments

@CTCC1
Copy link
Contributor

CTCC1 commented Sep 5, 2023

I ran into some online benchmarks about AS OF join where in certain cases, "early stop sort merge join" can outperform UNION based AS OF join.

https://www.hopsworks.ai/post/a-spark-join-operator-for-point-in-time-correct-joins (fwiw, it mentioned tempo as the inspiration for the UNION based AS OF join)

open sourced implementations
https://github.com/Ackuq/spark-pit/blob/main/scala/src/main/scala/execution/Patterns.scala

Would be interested to see what the community / maintainers think.

@tnixon
Copy link
Contributor

tnixon commented Sep 6, 2023

Thank you @CTCC1 for bringing this to our attention! We do plan to set up a more formal process for performance testing our functions and As-Of Joins are at the top of that list. We're also doing a big refactoring of the code that will make it easier to compare different implementations head-to-head, so this is very useful information, thanks!

@tnixon tnixon added performance Issues with the time to execute operations or other performance issues future This issue is in a backlog of ideas to possibly be done in the future labels Sep 6, 2023
@Tom-Newton
Copy link

I never tested this library but its implementation is quite similar to an in house implementation we previously used. Switching to spark PIT gave us a 20X speedup for the asof join stage on one of our workloads.

I will mention though that spark PIT is totally un-maintained and it does have quite a lot of bugs. I ended up creating a fork which fixed all the bugs I found https://github.com/Tom-Newton/spark-pit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future This issue is in a backlog of ideas to possibly be done in the future performance Issues with the time to execute operations or other performance issues
Projects
None yet
Development

No branches or pull requests

3 participants