Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure modified retry for uploading partition yamls to GCS #1292

Open
karcot1 opened this issue Oct 8, 2024 · 1 comment
Open

Configure modified retry for uploading partition yamls to GCS #1292

karcot1 opened this issue Oct 8, 2024 · 1 comment
Assignees
Labels
priority: p1 High priority. Fix may be included in the next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@karcot1
Copy link

karcot1 commented Oct 8, 2024

When uploading a large number of yamls to GCS (10,000) we are encountering a 503 error. This indicates the connection to GCS was interrupted, and is commonly encountered when uploading a large number of files via python.

The exact error message:
'Request failed with status code', 503, 'Expected one of', <HTTPStatus.OK: 200>

The recommendation with 5xx errors is to retry, and we have some sample code as a reference: https://github.com/googleapis/python-storage/blob/main/samples/snippets/storage_configure_retries.py#L55

The request is to please modify _write_gcs_file in gcs_helper.py to use retries.

Suggestion (based on above sample code):

from google.cloud import storage
from google.cloud.storage.retry import DEFAULT_RETRY

def _write_gcs_file(file_path: str, data: str):
    gcs_bucket = get_gcs_bucket(file_path)
    blob = gcs_bucket.blob(_get_gcs_file_path(file_path))

    modified_retry = DEFAULT_RETRY.with_deadline(500.0)
    modified_retry = modified_retry.with_delay(initial=1.5, multiplier=1.2, maximum=45.0)

    blob.upload_from_string(data, **retry=modified_retry**)

Ideally, the values (deadline, initial, multiplier, maximum) can be parameterized, so that end users can modify the values to get optimal performance.

@helensilva14 helensilva14 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Oct 8, 2024
@helensilva14 helensilva14 self-assigned this Oct 8, 2024
@sundar-mudupalli-work
Copy link
Collaborator

sundar-mudupalli-work commented Oct 9, 2024

Hi,

Review of the customer usage showed that the customer had upto 8 separate generate-partitions executions running in parallel. Given how long it takes to write to GCS (about 4 hrs for 10,000 objects). Each DVT execution is single threaded, so unclear how it raised 503 errors. DVT creates a client for each file as outlined below - which could exacerbate the situation.

When writing a large number of GCS files for partitions, it can take about 1.25 seconds to write each file to GCS. This is awfully slow. The problem is in get_gcs_bucket function - which creates a http client and a connection to the bucket every time. So if you have 10,000 files, this can take very long. However if the return value from get_gcs_bucket is cached, then writes to GCS for small files improve dramatically (> 20x).

It may make sense to restructure the gcs_helper package as a class, perhaps rename it to filsys_helper - which allows DVT to interact with Cloud Storage or the file system as specified by the user. If the customer specifies a GCS destination (or source), we can save the storage_client.get_bucket('bucket_name') result in the class so that it can be reused. This will significantly improve performance.

Sundar Mudupalli

@helensilva14 helensilva14 added priority: p1 High priority. Fix may be included in the next release. and removed priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 High priority. Fix may be included in the next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
Status: Pending
3 participants