Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-683] Increase memory of raw lambda #140

Merged
merged 2 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/upload-and-deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,11 +132,13 @@ jobs:
- name: Install additional python dependency
run: |
pipenv install ecs_logging~=2.0
pipenv install pytest-datadir

- name: Test lambda scripts with pytest
run: |
pipenv run python -m pytest tests/test_s3_event_config_lambda.py -v
pipenv run python -m pytest tests/test_s3_to_glue_lambda.py -v
pipenv run python -m pytest -v tests/test_lambda_raw.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Depending on if this gets merged first or mine (may be a slight merge conflict), but i think you can link the tests like:

- name: Test scripts with pytest (lambda, etc.)
        run: |
          pipenv run python -m pytest \
            tests/test_s3_event_config_lambda.py \
            tests/test_s3_to_glue_lambda.py \
            tests/test_lambda_dispatch.py \
            tests/test_consume_logs.py \
            tests/test_lambda_raw.py -v

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, it's better for the commands to be discrete and explicit, imo. I'm not sure if the behavior is the same running the tests one at a time (separate pipenv run...) vs running all the tests at one time (a single pipenv run...). But if there are advantages (other than brevity) I'm open to running all the tests at one time.


- name: Test dev synapse folders for STS access with pytest
run: >
Expand Down
61 changes: 31 additions & 30 deletions src/lambda_function/raw/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,35 +211,37 @@ def yield_compressed_data(object_stream: io.BytesIO, path: str, part_threshold=N
part_threshold = 8 * 1024 * 1024
with zipfile.ZipFile(object_stream, "r") as zip_stream:
with zip_stream.open(path, "r") as json_stream:
compressed_data = io.BytesIO()
# analogous to the part number of a multipart upload
chunk_number = 1
with gzip.GzipFile(
filename=os.path.basename(path),
fileobj=compressed_data,
compresslevel=6,
mode="wb",
) as gzip_file:
# We can expect at least 10x compression, so reading/writing the
# JSON in 10*part_threshold chunks ensures we do not flush the
# gzip buffer too often, which can slow the write process significantly.
compression_factor = 10
for chunk in iter(
lambda: json_stream.read(compression_factor * part_threshold), b""
):
gzip_file.write(chunk)
# .flush() ensures that .tell() gives us an accurate byte count,
gzip_file.flush()
if compressed_data.tell() >= part_threshold:
yield compressed_data_wrapper(
compressed_data=compressed_data, chunk_number=chunk_number
)
compressed_data.seek(0)
compressed_data.truncate(0)
chunk_number = chunk_number + 1
yield compressed_data_wrapper(
compressed_data=compressed_data, chunk_number=chunk_number
)
with io.BytesIO() as compressed_data:
# analogous to the part number of a multipart upload
chunk_number = 1
with gzip.GzipFile(
filename=os.path.basename(path),
fileobj=compressed_data,
compresslevel=6,
mode="wb",
) as gzip_file:
# We can expect at least 10x compression, so reading/writing the
# JSON in 10*part_threshold chunks ensures we do not flush the
# gzip buffer too often, which can slow the write process significantly.
compression_factor = 10
for chunk in iter(
lambda: json_stream.read(compression_factor * part_threshold),
b"",
):
gzip_file.write(chunk)
# .flush() ensures that .tell() gives us an accurate byte count,
gzip_file.flush()
if compressed_data.tell() >= part_threshold:
yield compressed_data_wrapper(
compressed_data=compressed_data,
chunk_number=chunk_number,
)
compressed_data.seek(0)
compressed_data.truncate(0)
chunk_number = chunk_number + 1
yield compressed_data_wrapper(
compressed_data=compressed_data, chunk_number=chunk_number
)


def compressed_data_wrapper(compressed_data: io.BytesIO, chunk_number: int):
Expand Down Expand Up @@ -334,4 +336,3 @@ def main(event: dict, s3_client: boto3.client, raw_bucket: str, raw_key_prefix:
logger.info(
f"Complete multipart upload response: {completed_upload_response}"
)
return completed_upload_response
2 changes: 1 addition & 1 deletion src/lambda_function/raw/template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Resources:
Handler: app.lambda_handler
Runtime: !Sub "python${LambdaPythonVersion}"
Role: !Ref RoleArn
MemorySize: 1024
MemorySize: 1769
Copy link
Contributor

@rxu17 rxu17 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Something i was thinking of but do we want to make this a variable instead? So the develop version of the lambda can declare a smaller memory size than the prod version since it won't use as much? Or were you running into this for both?

Copy link
Contributor Author

@philerooski philerooski Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's much benefit. This is not an expensive function to run, so no harm in keeping the configurations the same.

EphemeralStorage:
Size: 2048
Timeout: 900
Expand Down
2 changes: 1 addition & 1 deletion tests/test_lambda_raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import zipfile

import boto3
import pytest
import pytest # requires pytest-datadir to be installed
from moto import mock_s3

import src.lambda_function.raw.app as app
Expand Down
Loading