Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omit parquet field for a schema with no nested fields #205

Merged
merged 2 commits into from
May 8, 2024

Conversation

istreeter
Copy link
Contributor

Snowplow users commonly use schemas with no nested fields. Previously, we were creating a string column and loading the string field {}. But there is no benefit to loading this redundant data.

By omitting a column for these schemas, it means we support schema evolution if the user ever adds a nested field to the empty schema.

For empty schemas with additionalProperties: true we retain the old behaviour of loading the original JSON as a string.

@coveralls
Copy link

coveralls commented Apr 19, 2024

Pull Request Test Coverage Report for Build 9003594613

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 47 of 50 (94.0%) changed or added relevant lines in 4 files are covered.
  • 7 unchanged lines in 7 files lost coverage.
  • Overall coverage increased (+0.1%) to 80.668%

Changes Missing Coverage Covered Lines Changed/Added Lines %
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Field.scala 32 33 96.97%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/FieldValue.scala 0 2 0.0%
Files with Coverage Reduction New Missed Lines %
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/NumberCodecs.scala 1 70.37%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Caster.scala 1 84.85%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/bigquery/Row.scala 1 71.79%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Migrations.scala 1 90.67%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/package.scala 1 61.76%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/StringCodecs.scala 1 64.29%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/mutate/Widened.scala 1 86.36%
Totals Coverage Status
Change from base Build 8434107272: 0.1%
Covered Lines: 1256
Relevant Lines: 1557

💛 - Coveralls

Snowplow users commonly use schemas with no nested fields. Previously,
we were creating a string column and loading the string field `{}`. But
there is no benefit to loading this redundant data.

By omitting a column for these schemas, it means we support schema
evolution if the user ever adds a nested field to the empty schema.

For empty schemas with `additionalProperties: true` we retain the old
behaviour of loading the original JSON as a string.
In modern snowplow loaders we have already switched to using Vector when
manipulating data structures. We use vector because we commonly need to
join together fields from different shcmeas.

But even the new loader code still has lots of `list.toVector`. If we
change to Vector in schema-ddl then we can eliminate a lot of those
unnecessary conversions.
@istreeter istreeter merged commit 4c0505d into develop May 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants