Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order columns for Data Skipping #76

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

robertkossendey
Copy link
Collaborator

Initial draft for #75 .

Feedback would be welcome! :)

@Triamus
Copy link
Contributor

Triamus commented Jan 24, 2023

@robertkossendey just for my understanding, I always thought that potentially all cols are indexable it is just computationally more expensive on e.g. long strings or are these not indexable at all? I was just wondering as the list is called indexable_data_types. For example, if I have a short string, it may be a candidate for indexing or am I mistaken? Thanks.

@robertkossendey
Copy link
Collaborator Author

@Triamus you are totally right, the naming is a bit off 😅 Can you think of a more fitting name?

@Triamus
Copy link
Contributor

Triamus commented Jan 24, 2023

@robertkossendey well naming is the hardest thing in CS... not perfect but what about efficient_index_types?

@robertkossendey
Copy link
Collaborator Author

@Triamus updated 👍

@MrPowers
Copy link
Owner

@robertkossendey - this looks like a good start, but it might be nice to have something that's a bit more generic.

  • we need to let users specify the columns they want first
  • we want smart default behavior
  • we want to let the user customize the columns in any order they'd like

The only way we can do that is by letting the user optionally pass in a function. It should be generic enough so the user can do any of these operations:

  • Put IntegerType cols first, then StringType cols, then anything goes for the other cols
  • Put all the columns that start with "c" first, then anything goes

Any arbitrary logic with the logic you've coded as the default.

Let me know if you think that's possible.

@robertkossendey
Copy link
Collaborator Author

@MrPowers could you describe the interface you envision for the function? Right now I don't think I can follow you 😅

@MrPowers
Copy link
Owner

@robertkossendey - here's a renameColumns function that takes a function as an argument - this is just to give you an idea of a function that takes another function as an argument. The example is in Scala, so it's not the best. Looks like there is no quinn equivalent yet.

We should probably invoke reorder_columns with a function that takes a StructType as an argument and returns a StructType.

That way the user could specify arbirarty reordering logic based on column names, column types, or even the nullable property. If the function is set to None, then we'll automatically apply intelligent reordering logic.

I am going to add the quinn function now. I will ping back here when that's built. Should be easier for you to see what I'm saying when the quinn function is built.

@robertkossendey
Copy link
Collaborator Author

@MrPowers ahh okay, it should default to the "intelligent" ordering if the function is None that was the missing part for me! Will add this later :)

@MrPowers
Copy link
Owner

I added a reorder_columns_sorted function here that behaves like the Python sorted function: mrpowers-io/quinn#48

Going to try to add a more generic reorder_columns function that's instantiated with a function that takes a StructType argument and returns a StructType. The reorder_columns function will be harder to use, but more flexible.

@robertkossendey
Copy link
Collaborator Author

@MrPowers I've added the possibility to provide a sort UDF. Is that going in the direction of your idea?

@Triamus
Copy link
Contributor

Triamus commented Jan 25, 2023

@robertkossendey why are we changing the spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols in the function? Is the assumption that an unknowing user wants the default behavior to not only reorder the columns but also expand the statistics collection to a potentially large number of cols? And wouldn't that impact all following table ops even if the reorder_cols function is not invoked? Is that the intention or am I misunderstanding the code flow?

@brayanjuls
Copy link

I am not sure if this changes too much the scope of the function, but shouldn't this re-order be done in place?

I mean the function should receive a Delta table and apply an ALTER TABLE ALTER COLUMN to that table with the order of the new columns. This is because saving the dataframe with the columns re-ordered I think will not affect the physical order of the columns.

@MrPowers
Copy link
Owner

@brayanjuls - yea a metadata only operation would obviously be ideal, but is that possible for Delta 1.0? I think you'd need column mapping to be enabled to make this a metadata-only thing. I'm also not sure about if you made it an in-place operation if the column statistics would be re-generated for the new columns. I know very little about how this all works under the hood.

@robertkossendey
Copy link
Collaborator Author

robertkossendey commented Feb 20, 2023

@brayanjuls @MrPowers my assumption was that a user would use this function on their first write to a table. That's also why it is Delta Table agnostic, you pass a data frame and the function returns a data frame.

@Triamus you are right, maybe we should make setting this property optional. It is also risky, since the user might reorder the columns in later steps of his pipeline. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants