Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate null values for nullable fields #303

Merged

Conversation

nightscape
Copy link
Contributor

This PR changes the behavior of DataframeGenerator to generate null values if a field is nullable.
This solves #302.
There is one rather ugly spot in the code:

colValues.contains(null) ||
  // Unfortunately, dataframeGen.arbitrary sometimes generates DataFrames where all
  // rows have exactly identical values.
  // In that case, even generating many rows doesn't help to get some nulls...
  // To work around this we check if we generated at least some distinct values.
  colValues.distinct.size < 4 ||
  // This is needed for Array-valued fields where .distinct returns all values, even when
  // they're identical.
  colValues.size == colValues.distinct.size

If someone knows how to prevent the generator from generating absolutely identical rows this would simplify the code a lot.

@holdenk
Copy link
Owner

holdenk commented Sep 30, 2019

If you've got the time to look at the felling test let me know, otherwise I can take a look.

@nightscape nightscape force-pushed the generate_nulls_for_nullable_datatypes branch from 0fbc490 to 7abeb87 Compare September 30, 2019 07:15
@codecov-io
Copy link

codecov-io commented Sep 30, 2019

Codecov Report

Merging #303 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
+ Coverage   86.36%   86.39%   +0.02%     
==========================================
  Files          46       46              
  Lines        1005     1007       +2     
  Branches       86       87       +1     
==========================================
+ Hits          868      870       +2     
  Misses        116      116              
  Partials       21       21
Flag Coverage Δ
#python 85.87% <ø> (ø) ⬆️
#scala 71.01% <100%> (+0.07%) ⬆️
Impacted Files Coverage Δ
...holdenkarau/spark/testing/DataframeGenerator.scala 97.56% <100%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f130b0...7abeb87. Read the comment docs.

@nightscape
Copy link
Contributor Author

@holdenk fixed the failing test. I needed to explicitly set some fields to nullable = false in another test.

@nightscape
Copy link
Contributor Author

@holdenk ping 😉

@holdenk
Copy link
Owner

holdenk commented Oct 17, 2019

Sorry for the delay, and thanks for adding this!

@holdenk holdenk merged commit 79eef40 into holdenk:master Oct 17, 2019
@holdenk
Copy link
Owner

holdenk commented Oct 17, 2019

I'll try and do a release next week so folks can start using it :) Thank you @nightscape and thanks for working with my slow review cycle.

@nightscape nightscape deleted the generate_nulls_for_nullable_datatypes branch October 17, 2019 09:27
@nightscape
Copy link
Contributor Author

Thanks for merging! And no worries about review cycles, I'm no better 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants