Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Handle boto3 clients more efficiently with lru_cache #361

Closed

Conversation

mattiamatrix
Copy link

@mattiamatrix mattiamatrix commented Dec 19, 2023

Closes #332

Example

Screenshots

Checklist

  • References any related issue by including "Closes #" or "Closes ".
    • If no issue exists and your change is not a small fix, please create an issue first.
  • Includes tests or only affects documentation.
  • Passes pre-commit checks.
    • Run pre-commit install && pre-commit run --all locally for formatting and linting.
  • Includes screenshots of documentation updates.
    • Run mkdocs serve view documentation locally.
  • Summarizes PR's changes in CHANGELOG.md

@mattiamatrix mattiamatrix marked this pull request as ready for review December 19, 2023 21:32
@mattiamatrix mattiamatrix requested a review from a team as a code owner December 19, 2023 21:32
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @mattiamatrix! This looks like a great optimization, but I'd like to suggest some tweaks:

  1. I think we should move the changes into the get_client method so all the client types can benefit from this optimization.
  2. We should maintain a cache of clients with their parameters as the cache key. In the current implementation, if any attributes are changed on the class, the client returned would not be updated to reflect those changes. By introducing a LRU cache, we can improve performance while maintaining configurability on the fly.

Let me know if you'd like me to go into more detail for either of those points!

prefect_aws/s3.py Outdated Show resolved Hide resolved
@mattiamatrix
Copy link
Author

Thanks for the contribution @mattiamatrix! This looks like a great optimization, but I'd like to suggest some tweaks:

  1. I think we should move the changes into the get_client method so all the client types can benefit from this optimization.
  2. We should maintain a cache of clients with their parameters as the cache key. In the current implementation, if any attributes are changed on the class, the client returned would not be updated to reflect those changes. By introducing a LRU cache, we can improve performance while maintaining configurability on the fly.

Let me know if you'd like me to go into more detail for either of those points!

Thank you for the review @desertaxle. I made the changes as you described. Let me know what you think now.
I had to add a little bit of code to follow pydantic/pydantic#3376.

@mattiamatrix mattiamatrix changed the title handle s3_client more efficiently Handle boto3 clients more efficiently with lru_cache Dec 22, 2023
@mattiamatrix
Copy link
Author

Hello @desertaxle @zzstoatzz, is the PR ok now?

@zzstoatzz
Copy link
Contributor

hi @mattiamatrix - Alex is currently out-of-office for a while, so I can take over reviewing this.

Let me take a look here / get caught up and do some QA and I will get back to you on this - appreciate the contribution! 🙂

@zzstoatzz
Copy link
Contributor

hey @mattiamatrix - this looks pretty good to me. Can you add a test for the new functionality, maybe something like this?

from prefect_aws.credentials import _get_client_cached, AwsCredentials, ClientType
from unittest.mock import patch

@patch('prefect_aws.credentials._get_client_cached')
def test_get_client_cached(mock_get_client_cached):
    """
    Test to ensure that _get_client_cached function returns the same instance
    for multiple calls with the same parameters and properly utilizes lru_cache.
    """

    # Create a mock AwsCredentials instance
    aws_credentials_block = AwsCredentials()

    # Call _get_client_cached multiple times with the same parameters
    _get_client_cached(aws_credentials_block, ClientType.S3)
    _get_client_cached(aws_credentials_block, ClientType.S3)

    # Verify that _get_client_cached is called only once due to caching
    mock_get_client_cached.assert_called_once_with(aws_credentials_block, ClientType.S3)

    # Optionally, you can test with different parameters to ensure they are cached separately
    _get_client_cached(aws_credentials_block, ClientType.ECS)
    assert mock_get_client_cached.call_count == 2, "Should be called twice with different parameters"

@mattiamatrix
Copy link
Author

Hi @zzstoatzz, thank you for the snippet. I added the code that you shared but the test is failing with error AssertionError: Expected '_get_client_cached' to be called once. Called 0 times. since call_count is always 0.

@zzstoatzz it would be very helpful if you could directly update the code and push the correct test. Thank you.

@zzstoatzz
Copy link
Contributor

sorry @mattiamatrix - i should have clarified that that was pseudocode based on my memory of the mocks / fixtures available

i do not have time to do this now but if you're unable then I can come back to this later

@mattiamatrix
Copy link
Author

sorry @mattiamatrix - i should have clarified that that was pseudocode based on my memory of the mocks / fixtures available

i do not have time to do this now but if you're unable then I can come back to this later

Ok, I pushed a version that is using cache_info() instead to get a similar result.

@zzstoatzz
Copy link
Contributor

hi @mattiamatrix - after discussing this internally, I think we should change how we're getting that __hash__

as is:

  • if you changed field values, we'd still hit the cache since we're just looking at the id
  • successively loaded instances of AwsCredentials would not hit the cache (share clients) because they would have different values for id

instead of just the id, the __hash__ should be derived from the hashes of all the values of all the fields on AwsCredentials (or alternatively, we could just use the underlying return value of get_params_override as the cache key instead of the AwsCredentials object)

@mattiamatrix
Copy link
Author

Close in favour of #369

@mattiamatrix mattiamatrix deleted the use-s3-client-more-efficiently branch January 19, 2024 18:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

put_directory creates a new s3_client for each file uploaded
3 participants