Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug: Failure when subsequent records have fundamentally incompatible schemas #89

Closed
aaronsteers opened this issue Mar 1, 2024 · 1 comment · Fixed by #67
Closed
Assignees

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 1, 2024

When subsequent records have incompatible schema, the pa.Table.from_pandas() call will fail:

Example:

----> 3 result = source.read(cache=cache)

9 frames
/usr/local/lib/python3.10/dist-packages/airbyte/source.py in read(self, cache, streams, write_strategy, force_full_refresh)
    592         )
    593         print(f"Started `{self.name}` read operation at {pendulum.now().format('HH:mm:ss')}...")
--> 594         cache.processor.process_airbyte_messages(
    595             self._tally_records(
    596                 self._read(

/usr/local/lib/python3.10/dist-packages/airbyte/_processors/base.py in process_airbyte_messages(self, messages, write_strategy, max_batch_size)
    207         for stream_name, stream_batch in stream_batches.items():
    208             batch_df = pd.DataFrame(stream_batch)
--> 209             record_batch = pa.Table.from_pandas(batch_df)
    210             self._process_batch(stream_name, record_batch)
    211             progress.log_batch_written(stream_name, len(stream_batch))

/usr/local/lib/python3.10/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

/usr/local/lib/python3.10/dist-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    611 
    612     if nthreads == 1:
--> 613         arrays = [convert_column(c, f)
    614                   for c, f in zip(columns_to_convert, convert_fields)]
    615     else:

/usr/local/lib/python3.10/dist-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    611 
    612     if nthreads == 1:
--> 613         arrays = [convert_column(c, f)
    614                   for c, f in zip(columns_to_convert, convert_fields)]
    615     else:

/usr/local/lib/python3.10/dist-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    598             e.args += ("Conversion failed for column {!s} with type {!s}"
    599                        .format(col.name, col.dtype),)
--> 600             raise e
    601         if not field_nullable and result.null_count > 0:
    602             raise ValueError("Field {} was non-nullable but pandas column "

/usr/local/lib/python3.10/dist-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    592 
    593         try:
--> 594             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    595         except (pa.ArrowInvalid,
    596                 pa.ArrowNotImplementedError,

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: ("Could not convert 'false' with type str: tried to convert to boolean", 'Conversion failed for column attributes with type object')

Reported in Slack:

https://airbytehq.slack.com/archives/C06FZ238P8W/p1709053334428409?thread_ts=1708526473.508759&cid=C06FZ238P8W

@aaronsteers
Copy link
Contributor Author

Related issue:

Hopefully would also be resolved by #67.

@aaronsteers aaronsteers self-assigned this Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant