-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨Feat(source-s3): add file-transfer feature to s3 #48346
base: master
Are you sure you want to change the base?
✨Feat(source-s3): add file-transfer feature to s3 #48346
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
/approve-regression-tests
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No surprises - all looks good! 👍
🚀
@@ -183,6 +187,40 @@ def open_file(self, file: RemoteFile, mode: FileReadMode, encoding: Optional[str | |||
# we can simply return the result here as it is a context manager itself that will release all resources | |||
return result | |||
|
|||
@override | |||
def get_file(self, file: RemoteFile, local_directory: str, logger: logging.Logger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice we're returning a dict here. Could we add a return type to the signature here, and a quick docstring explaining the expected behavior and return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, let me do that. I will add it to the CDK side as well.
self.s3_client.download_file(self.config.bucket, file.uri, local_file_path) | ||
write_duration = time.time() - start_download_time | ||
logger.info(f"Finished downloading the file {file.uri} and saved to {local_file_path} in {write_duration:,.2f} seconds.") | ||
return {"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my info, do we also share somewhere the original full URL?
Like s3://my_bucket/path/to/file
?
Totally okay if not, because I know there's nothing using it yet, but want to make sure I understand the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for the record? We do it here in the CDK here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldogonzalez8 - Thanks for pointing me to that. Sadly, last time I was working on this, I think what I found was that file.uri
is a lie - and it is only the relative path part without bucket info or any "absolute" reference point. But if we fix that, the code on your side looks good.
No need to block on this; just wanted to make sure I had the right info.
What
Implement file-transfer feature for source-s3
How
It updated the spec to show the option that leverages cdk logic to handle the new record type and add
file_size()
andget_file()
to the stream reader.Review guide
airbyte-integrations/connectors/source-s3/source_s3/v4/stream_reader.py
: file_size() and get_file() to the stream readerairbyte-integrations/connectors/source-s3/source_s3/v4/config.py
: update spec.docs/integrations/sources/s3.md
: updated doc.Note: We need to merge first this other PR.
User Impact
If they just need to move files as they are, this will greatly improve users' experience as sync will be quite fast.
Can this PR be safely reverted and rolled back?