Ordering columns better for data skipping #75

MrPowers · 2023-01-24T11:18:35Z

Specifically this section:

Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the table property delta.dataSkippingNumIndexedCols. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the delta.dataSkippingNumIndexedCols property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the delta.dataSkippingNumIndexedCols property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the delta.dataSkippingNumIndexedCols property by using ALTER TABLE ALTER COLUMN.

We should provide a helper method that orders DataFrames with the "best" column types for data skipping first. We should let the user specify the columns they commonly filter on (put those first), then the integer columns, etc. Not sure how this would work with Z ORDER. Need to think about this one more, but seems like it's important.

robertkossendey · 2023-01-24T12:10:57Z

I would like to work on this :) I will propose a solution in form a WIP PR and we can discuss then, okay?

MrPowers · 2023-01-24T12:14:28Z

@robertkossendey - yep, sounds awesome, thanks!

robertkossendey self-assigned this Jan 24, 2023

robertkossendey mentioned this issue Jan 24, 2023

Order columns for Data Skipping #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordering columns better for data skipping #75

Ordering columns better for data skipping #75

MrPowers commented Jan 24, 2023

robertkossendey commented Jan 24, 2023

MrPowers commented Jan 24, 2023

Ordering columns better for data skipping #75

Ordering columns better for data skipping #75

Comments

MrPowers commented Jan 24, 2023

robertkossendey commented Jan 24, 2023

MrPowers commented Jan 24, 2023