-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order columns for Data Skipping #76
base: main
Are you sure you want to change the base?
Conversation
@robertkossendey just for my understanding, I always thought that potentially all cols are indexable it is just computationally more expensive on e.g. long strings or are these not indexable at all? I was just wondering as the list is called |
@Triamus you are totally right, the naming is a bit off 😅 Can you think of a more fitting name? |
@robertkossendey well naming is the hardest thing in CS... not perfect but what about |
@Triamus updated 👍 |
@robertkossendey - this looks like a good start, but it might be nice to have something that's a bit more generic.
The only way we can do that is by letting the user optionally pass in a function. It should be generic enough so the user can do any of these operations:
Any arbitrary logic with the logic you've coded as the default. Let me know if you think that's possible. |
@MrPowers could you describe the interface you envision for the function? Right now I don't think I can follow you 😅 |
@robertkossendey - here's a We should probably invoke That way the user could specify arbirarty reordering logic based on column names, column types, or even the nullable property. If the function is set to I am going to add the quinn function now. I will ping back here when that's built. Should be easier for you to see what I'm saying when the quinn function is built. |
@MrPowers ahh okay, it should default to the "intelligent" ordering if the function is |
I added a Going to try to add a more generic |
@MrPowers I've added the possibility to provide a sort UDF. Is that going in the direction of your idea? |
@robertkossendey why are we changing the |
I am not sure if this changes too much the scope of the function, but shouldn't this re-order be done in place? I mean the function should receive a Delta table and apply an ALTER TABLE ALTER COLUMN to that table with the order of the new columns. This is because saving the dataframe with the columns re-ordered I think will not affect the physical order of the columns. |
@brayanjuls - yea a metadata only operation would obviously be ideal, but is that possible for Delta 1.0? I think you'd need column mapping to be enabled to make this a metadata-only thing. I'm also not sure about if you made it an in-place operation if the column statistics would be re-generated for the new columns. I know very little about how this all works under the hood. |
@brayanjuls @MrPowers my assumption was that a user would use this function on their first write to a table. That's also why it is Delta Table agnostic, you pass a data frame and the function returns a data frame. @Triamus you are right, maybe we should make setting this property optional. It is also risky, since the user might reorder the columns in later steps of his pipeline. WDYT? |
Initial draft for #75 .
Feedback would be welcome! :)