Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BUG] Add retries to pyarrow write_dataset call (#2445)
Multipart uploads to Cloudflare R2 intermittently fail with an `InvalidPart` error ([more info about the error here](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html)). I have confirmed that this error does not occur when the write is retried, so this PR adds retry logic to `write_dataset` to fix this issue. Tested with this code: ```python import daft import pyarrow as pa from tqdm import trange def main(): daft.context.set_runner_ray() df = daft.from_pydict({"a": list(range(10_000_000 * 16))}) df = df.into_partitions(16) table = df.to_arrow() s3 = daft.io.S3Config( endpoint_url="...", key_id="...", access_key="...", region_name="auto" ) io_config = daft.io.IOConfig(s3=s3) for i in trange(1_000): path = f"s3://eventual-public-data/kevin-test/{i}.parquet" df.write_parquet(path, io_config=io_config) written_df = daft.read_parquet(path, io_config=io_config) written_df = written_df.sort("a") written_arrow = written_df.to_arrow() assert written_arrow.equals(table) if __name__ == "__main__": main() ```
- Loading branch information