Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Feat(source-s3): add file-transfer feature to s3 #48346

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

aldogonzalez8
Copy link
Contributor

@aldogonzalez8 aldogonzalez8 commented Nov 5, 2024

What

Implement file-transfer feature for source-s3

How

It updated the spec to show the option that leverages cdk logic to handle the new record type and add file_size() and get_file() to the stream reader.

Review guide

  1. airbyte-integrations/connectors/source-s3/source_s3/v4/stream_reader.py: file_size() and get_file() to the stream reader
  2. airbyte-integrations/connectors/source-s3/source_s3/v4/config.py: update spec.
  3. docs/integrations/sources/s3.md: updated doc.

Note: We need to merge first this other PR.

User Impact

If they just need to move files as they are, this will greatly improve users' experience as sync will be quite fast.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Nov 5, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 6, 2024 3:05pm

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Nov 5, 2024

/approve-regression-tests

Check job output.

✅ Approving regression tests

@aldogonzalez8 aldogonzalez8 requested a review from a team November 6, 2024 15:47
@aldogonzalez8 aldogonzalez8 changed the title Feat(file-transfer): add file-transfer feature to s3 ✨Feat(source-s3): add file-transfer feature to s3 Nov 6, 2024
@aaronsteers aaronsteers self-requested a review November 7, 2024 16:49
Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No surprises - all looks good! 👍

🚀

@@ -183,6 +187,40 @@ def open_file(self, file: RemoteFile, mode: FileReadMode, encoding: Optional[str
# we can simply return the result here as it is a context manager itself that will release all resources
return result

@override
def get_file(self, file: RemoteFile, local_directory: str, logger: logging.Logger):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice we're returning a dict here. Could we add a return type to the signature here, and a quick docstring explaining the expected behavior and return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, let me do that. I will add it to the CDK side as well.

self.s3_client.download_file(self.config.bucket, file.uri, local_file_path)
write_duration = time.time() - start_download_time
logger.info(f"Finished downloading the file {file.uri} and saved to {local_file_path} in {write_duration:,.2f} seconds.")
return {"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path}
Copy link
Collaborator

@aaronsteers aaronsteers Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my info, do we also share somewhere the original full URL?

Like s3://my_bucket/path/to/file?

Totally okay if not, because I know there's nothing using it yet, but want to make sure I understand the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for the record? We do it here in the CDK here.

Copy link
Collaborator

@aaronsteers aaronsteers Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldogonzalez8 - Thanks for pointing me to that. Sadly, last time I was working on this, I think what I found was that file.uri is a lie - and it is only the relative path part without bucket info or any "absolute" reference point. But if we fix that, the code on your side looks good.

No need to block on this; just wanted to make sure I had the right info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants