Order columns for Data Skipping #76

robertkossendey · 2023-01-24T19:52:55Z

Initial draft for #75 .

Feedback would be welcome! :)

Triamus · 2023-01-24T20:14:49Z

@robertkossendey just for my understanding, I always thought that potentially all cols are indexable it is just computationally more expensive on e.g. long strings or are these not indexable at all? I was just wondering as the list is called indexable_data_types. For example, if I have a short string, it may be a candidate for indexing or am I mistaken? Thanks.

robertkossendey · 2023-01-24T20:16:07Z

@Triamus you are totally right, the naming is a bit off 😅 Can you think of a more fitting name?

Triamus · 2023-01-24T20:19:49Z

@robertkossendey well naming is the hardest thing in CS... not perfect but what about efficient_index_types?

robertkossendey · 2023-01-24T20:36:30Z

@Triamus updated 👍

MrPowers · 2023-01-24T20:36:58Z

@robertkossendey - this looks like a good start, but it might be nice to have something that's a bit more generic.

we need to let users specify the columns they want first
we want smart default behavior
we want to let the user customize the columns in any order they'd like

The only way we can do that is by letting the user optionally pass in a function. It should be generic enough so the user can do any of these operations:

Put IntegerType cols first, then StringType cols, then anything goes for the other cols
Put all the columns that start with "c" first, then anything goes

Any arbitrary logic with the logic you've coded as the default.

Let me know if you think that's possible.

robertkossendey · 2023-01-25T10:21:53Z

@MrPowers could you describe the interface you envision for the function? Right now I don't think I can follow you 😅

MrPowers · 2023-01-25T12:59:12Z

@robertkossendey - here's a renameColumns function that takes a function as an argument - this is just to give you an idea of a function that takes another function as an argument. The example is in Scala, so it's not the best. Looks like there is no quinn equivalent yet.

We should probably invoke reorder_columns with a function that takes a StructType as an argument and returns a StructType.

That way the user could specify arbirarty reordering logic based on column names, column types, or even the nullable property. If the function is set to None, then we'll automatically apply intelligent reordering logic.

I am going to add the quinn function now. I will ping back here when that's built. Should be easier for you to see what I'm saying when the quinn function is built.

robertkossendey · 2023-01-25T13:01:13Z

@MrPowers ahh okay, it should default to the "intelligent" ordering if the function is None that was the missing part for me! Will add this later :)

MrPowers · 2023-01-25T17:17:16Z

I added a reorder_columns_sorted function here that behaves like the Python sorted function: mrpowers-io/quinn#48

Going to try to add a more generic reorder_columns function that's instantiated with a function that takes a StructType argument and returns a StructType. The reorder_columns function will be harder to use, but more flexible.

robertkossendey · 2023-01-25T20:18:12Z

@MrPowers I've added the possibility to provide a sort UDF. Is that going in the direction of your idea?

Triamus · 2023-01-25T21:42:16Z

@robertkossendey why are we changing the spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols in the function? Is the assumption that an unknowing user wants the default behavior to not only reorder the columns but also expand the statistics collection to a potentially large number of cols? And wouldn't that impact all following table ops even if the reorder_cols function is not invoked? Is that the intention or am I misunderstanding the code flow?

brayanjuls · 2023-02-15T15:02:08Z

I am not sure if this changes too much the scope of the function, but shouldn't this re-order be done in place?

I mean the function should receive a Delta table and apply an ALTER TABLE ALTER COLUMN to that table with the order of the new columns. This is because saving the dataframe with the columns re-ordered I think will not affect the physical order of the columns.

MrPowers · 2023-02-15T22:06:41Z

@brayanjuls - yea a metadata only operation would obviously be ideal, but is that possible for Delta 1.0? I think you'd need column mapping to be enabled to make this a metadata-only thing. I'm also not sure about if you made it an in-place operation if the column statistics would be re-generated for the new columns. I know very little about how this all works under the hood.

robertkossendey · 2023-02-20T08:30:17Z

@brayanjuls @MrPowers my assumption was that a user would use this function on their first write to a table. That's also why it is Delta Table agnostic, you pass a data frame and the function returns a data frame.

@Triamus you are right, maybe we should make setting this property optional. It is also risky, since the user might reorder the columns in later steps of his pipeline. WDYT?

feat: order columns

0f420b8

robertkossendey requested review from MrPowers and souvik-databricks January 24, 2023 19:52

chore: remove unnecessary show

e0d74ed

chore: renamings

9d1d655

feat: add user defined order func

b7104be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order columns for Data Skipping #76

Order columns for Data Skipping #76

robertkossendey commented Jan 24, 2023

Triamus commented Jan 24, 2023

robertkossendey commented Jan 24, 2023

Triamus commented Jan 24, 2023

robertkossendey commented Jan 24, 2023

MrPowers commented Jan 24, 2023

robertkossendey commented Jan 25, 2023

MrPowers commented Jan 25, 2023

robertkossendey commented Jan 25, 2023

MrPowers commented Jan 25, 2023

robertkossendey commented Jan 25, 2023

Triamus commented Jan 25, 2023

brayanjuls commented Feb 15, 2023

MrPowers commented Feb 15, 2023

robertkossendey commented Feb 20, 2023 •

edited

Loading

Order columns for Data Skipping #76

Are you sure you want to change the base?

Order columns for Data Skipping #76

Conversation

robertkossendey commented Jan 24, 2023

Triamus commented Jan 24, 2023

robertkossendey commented Jan 24, 2023

Triamus commented Jan 24, 2023

robertkossendey commented Jan 24, 2023

MrPowers commented Jan 24, 2023

robertkossendey commented Jan 25, 2023

MrPowers commented Jan 25, 2023

robertkossendey commented Jan 25, 2023

MrPowers commented Jan 25, 2023

robertkossendey commented Jan 25, 2023

Triamus commented Jan 25, 2023

brayanjuls commented Feb 15, 2023

MrPowers commented Feb 15, 2023

robertkossendey commented Feb 20, 2023 • edited Loading

robertkossendey commented Feb 20, 2023 •

edited

Loading