Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordering columns better for data skipping #75

Open
MrPowers opened this issue Jan 24, 2023 · 2 comments
Open

Ordering columns better for data skipping #75

MrPowers opened this issue Jan 24, 2023 · 2 comments
Assignees

Comments

@MrPowers
Copy link
Owner

See the docs on data skipping.

Specifically this section:

Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the table property delta.dataSkippingNumIndexedCols. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the delta.dataSkippingNumIndexedCols property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the delta.dataSkippingNumIndexedCols property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the delta.dataSkippingNumIndexedCols property by using ALTER TABLE ALTER COLUMN.

We should provide a helper method that orders DataFrames with the "best" column types for data skipping first. We should let the user specify the columns they commonly filter on (put those first), then the integer columns, etc. Not sure how this would work with Z ORDER. Need to think about this one more, but seems like it's important.

@robertkossendey robertkossendey self-assigned this Jan 24, 2023
@robertkossendey
Copy link
Collaborator

I would like to work on this :) I will propose a solution in form a WIP PR and we can discuss then, okay?

@MrPowers
Copy link
Owner Author

@robertkossendey - yep, sounds awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants