Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report on 'margins'? #7

Open
amilbourne opened this issue Jun 18, 2020 · 2 comments
Open

Report on 'margins'? #7

amilbourne opened this issue Jun 18, 2020 · 2 comments

Comments

@amilbourne
Copy link

When I look at the report output, my first thought (particularly with numeric types) is, what if the data changes a bit?

It might be useful to give an idea of how much margin there would be for data change if the suggestion was followed? Perhaps a flag to add extra info to the output? For example:

wm_yr_wk (int64) currently taking 54,729,096 bytes, to save 41,046,726 bytes try wm_yr_wk.astype(int16)
int16 range: -32768 to +32767
data range: -1000 to + 29067

Obviously you could get that info yourself, but it might be nice to just be given it. You could give more info than just data range (percentiles or SDs) but this seems like an easy addition.

@ianozsvald
Copy link
Owner

Hey Antony, thanks for the interest. Agree that more info could be useful. I think also noting that some conversions might change the floating point results (but if under a threshold then maybe that's cool). Good food for thought :-)

@amilbourne
Copy link
Author

I deal a lot with data from Excel at the moment and a lot of it has noise in the 15th decimal place (or so). In this case a reduction of accuracy would be fine. Arguably a user could round the data before passing it to dtype_diet if they don't need the precision, but perhaps the library can help to find the optimum rounding level. It might be a nice feature anyway. Presumably you would need some input from the user on whether they are prepared to lose accuracy.

In fact I imagine you could plot a curve of rounding error vs storage size, although only a few points on the storage size axis would be valid data types. That is probably overkill - I'm just doing some blue skies thinking :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants