Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Load CDK: Root-Level Flattening & S3V2 Usage #48377

Merged

Conversation

johnny-schmidt
Copy link
Contributor

What

Adds support for root-level flattening to ObjectStorage + S3V2 usage (configurable for Json/CSV, required for Avro/Parquet).

Root level flattening means that the client fields are written to the top level of the resulting record instead of to an _airbyte_data subfield.

So given the client data {"id": 1, "name": "Bob"}

  • root-level-flattening {"_airbyte_raw_id": "<uuid>", "_airbyte_generation_id": 10, ..., "id": 1, "name": "Bob" }
  • no flattening: {"_airbyte_raw_id": "<uuid>", "_airbyte_generation_id": 10, ..., "_airbyte_data": { "id": 1, "name": "Bob" } }

Specifically

  • adds flattening option to both schema -> schema with meta and record -> record data with meta
  • unit tests (plus missing tests for "no flattening" conversions)
  • unflattening to the data dumper
  • some tweaks to data readers to make tests work better
  • slight bug fix where CSV was writing data in data order and not schema order
  • missing CSV conversion types
  • changes to expected spec to include the flattening option
  • integration test cases + secret configs for CSV & JSON with flattening enabled

@johnny-schmidt johnny-schmidt requested a review from a team as a code owner November 6, 2024 02:11
Copy link

vercel bot commented Nov 6, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Nov 7, 2024 7:30pm

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple small comments, otherwise lgtm

)
)
if (flatten) {
(schema as ObjectType).properties.forEach { (name, field) -> properties[name] = field }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should throw if a property clashes with an airbyte field? (honestly not sure, I've never figured out what the expected behavior is for people chaining source -> airbyte -> destination -> airbyte -> destination)

@@ -60,6 +61,19 @@ class AirbyteValueWithMetaToOutputRecord {
}
}

fun AirbyteValue.maybeUnflatten(wasFlattened: Boolean): ObjectValue {
this as ObjectValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this? (unless it's for the sake of doing a smart cast, in which case add a comment? or maybe define this as fun ObjectValue.maybeUnflatten)

is DateValue -> value.value
is IntValue -> value.value.toString()
is TimeValue -> value.value
is UnknownValue -> ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should preserve the original json for unknown value?

import io.airbyte.cdk.load.data.UnknownType
import io.airbyte.cdk.load.data.json.toAirbyteValue
import io.airbyte.cdk.load.util.deserializeToNode
import org.apache.commons.csv.CSVRecord

class CsvRowToAirbyteValue {
fun convert(row: CSVRecord, schema: AirbyteType): AirbyteValue {
print("converting row: $row")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("converting row: $row")

@octavia-squidington-iv octavia-squidington-iv requested a review from a team November 7, 2024 17:04
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/issue-10511/add-mappers-to-formatting-writers branch from 25e79bd to 99e4115 Compare November 7, 2024 17:29
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/issue-10580/root-level-flattening branch from e3fea36 to 45a6f89 Compare November 7, 2024 18:10
Base automatically changed from jschmidt/s3v2/issue-10511/add-mappers-to-formatting-writers to master November 7, 2024 19:29
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/s3v2/issue-10580/root-level-flattening branch from 45a6f89 to 1b09fbf Compare November 7, 2024 19:29
@johnny-schmidt johnny-schmidt enabled auto-merge (squash) November 7, 2024 19:30
@johnny-schmidt johnny-schmidt merged commit e165b01 into master Nov 7, 2024
36 checks passed
@johnny-schmidt johnny-schmidt deleted the jschmidt/s3v2/issue-10580/root-level-flattening branch November 7, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues CDK Connector Development Kit connectors/destination/s3-v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants