Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Permissions error uploading prepared XML records to S3 #878

Closed
TylerHendrickson opened this issue Jul 31, 2024 · 2 comments
Closed
Assignees
Labels
bug Something isn't working Grant Finder

Comments

@TylerHendrickson
Copy link
Member

TylerHendrickson commented Jul 31, 2024

Why is this issue important?

We cannot save data from Grants.gov pertaining to new or modified grant opportunities. This issue currently only affects Staging.

Current State

As of the deployment of #848 to Staging, the SplitGrantsGovXMLDB Lambda function is unable to upload per-grant XML files to the S3 "prepared data" bucket. When the Lambda function attempts to do so, it receives an AccessDenied response from S3 and logs Error uploading prepared grant record to S3 (example logs, Datadog logs query service:grants-ingest "Error uploading prepared grant record to S3" functionname:grants_ingest-splitgrantsgovxmldb env:staging). This issue appears to be due to incorrect/missing permissions in the SplitGrantsGovXMLDB Lambda function execution role.

Background: The changes in #848 introduced a new filename (object key) naming scheme for the S3 "prepared data" bucket. Previously, object keys for Grants.gov XML records always followed a pattern of <id prefix>/<full id>/grants.gov/v2.xml. Now, instead of /v2.xml, object keys may have either of the following suffixes:

However, the permissions for SplitGrantsGovXMLDB were never updated to match the new object key convention, so the Lambda function is still only allowed to inspect and upload S3 objects with the /v2.xml suffix.

Note: This issue is also causing errors later in the PublishGrantEvents step, as the missing Grants.gov data is causing validation failures. Resolving the SplitGrantsGovXMLDB permissions issue should also resolve this downstream PublishGrantEvents validation issue.

Expected State

The SplitGrantsGovXMLDB Lambda function is permitted to upload per-grant XML records to the S3 "prepared data" bucket using the new object key suffixes.

Implementation Plan

Update permissions for Lambda execution role

In terraform/modules/SplitGrantsGovXMLDB/main.tf, update the Lambda function's AllowInspectS3PreparedData and AllowS3UploadPreparedData execution policy statements so that the resources condition in both statements' includes the following patterns:

  • ${data.aws_s3_bucket.prepared_data.arn}/*/*/grants.gov/v2.OpportunitySynopsisDetail_1_0.xml
  • ${data.aws_s3_bucket.prepared_data.arn}/*/*/grants.gov/v2.OpportunityForecastDetail_1_0.xml

Refactor Last-Modified Determination in SplitGrantsGovXMLDB handler

(See Option 3 in this comment)

We want to start using the "Grants Prepared Data" DynamoDB Table to determine whether an encountered XML record is new, modified, or unmodified. This replaces the existing behavior of checking S3 for the following conditions:

  1. Object does not exist: XML record is new and should be saved to S3
  2. Object exists and its modification timestamp is older than the XML record's LastUpdatedDate value: XML record is newer than the currently-persisted record and should replace the currently-persisted record in S3
  3. Object exists and its modification timestamp is the same as (or more recent than) the XML record's LastUpdatedDate value: XML record has been seen before and is unchanged since the last time it was encountered by SplitGrantsGovXMLDB

When using DynamoDB, the behavior is basically the same:

  1. DDB item does not exist: XML record is new (unrepresented in DynamoDB) and should be saved to S3
  2. DDB item exists and its LastUpdatedDate value is older than the XML record's LastUpdatedDate value: XML record is newer than the currently-persisted table item, so the XML record should replace the currently-persisted record in S3, which will eventually result in an update to the corresponding DynamoDB table item
  3. DDB item exists and its LastUpdatedDate value is the same as (or more recent than) the XML record's LastUpdatedDate value: XML record has been seen before and is unchanged since the last time it was persisted to DynamoDB.

Relevant Code Snippets

No response

@TylerHendrickson TylerHendrickson added the bug Something isn't working label Jul 31, 2024
@TylerHendrickson TylerHendrickson self-assigned this Jul 31, 2024
@TylerHendrickson
Copy link
Member Author

TylerHendrickson commented Aug 1, 2024

Adding some thoughts here related to the Additional considerations note in the issue description:

  • We need to ensure that this change is backwards-compatible with the old /v2.xml suffix. Otherwise, every Grants.gov XML record that currently exists in the prepared data bucket will be perceived as missing (and requiring an update) the next time SplitGrantsGovXMLDB runs. Details TBD

Options to consider:

  1. Do a one-time operation (in each deployment environment) to bulk-rename S3 object keys – every object with a key matching */*/grants.gov/v2.xml would be renamed to <equivalent>/<equivalent>/grants.gov/v2.OpportunitySynopsisDetail_1_0.xml.
    • Pros
      • Generally straightforward
    • Cons:
      • Requires a more hands-on/coordinated roll-out to Production
      • Would affect around 80k S3 objects. I'm not particularly worried about cost, but it's hard to say offhand how long this operation would take, and it's probably one that should be scripted out (possibly to the grants-ingest CLI). It could end up being relatively high-effort for something so limited-use.
      • Related to the previous point: this would result in a lot of unnecessary downstream processing unless we disable the S3 event triggers for the new object key convention, and then re-enable them (which adds more coordination overhead for deployment).
      • This is a solvable problem, but we need to be careful about preserving object metadata and tags.
      • Resets object expiration for all Grants.gov records in S3, since the vast majority of objects have been transitioned to Glacier at this point.
  2. Do nothing, and allow every Grants.gov record to be recreated.
    • Pros:
      • Low effort
    • Cons:
      • Nearly doubles the bucket's object count. However, I'm not so worried about cost here (and we could go in after the fact and mass-delete all objects with keys like */*/grants.gov/v2.xml), so this is a pretty minimal concern.
      • Also has the effect of resetting object expiration for all Grants.gov records.
      • Also results in a lot of downstream processing.
  3. Add a DynamoDB table to serve as an index for mapping Grants.gov records to their respective S3 keys
    • Pros:
      • Doesn't necessarily preclude either of the prior two options (particularly "Do nothing").
      • Using DDB to index S3 object metadata and/or S3 key lookup is a fairly well-established pattern.
      • The particular value of an S3 object key becomes more arbitrary, so we can more easily change naming conventions again in the future if we need to.
      • We would be able to store and compare to the Grants.gov record's actual LastUpdatedDate rather than relying on S3 file modification timestamps as we do today, which is admittedly a bit kludgy.
      • Might enable more optimized/batched lookups (currently we run HEAD on every Grant.gov record on a daily basis)
      • If we wanted (we might not) we could possibly make use of the existing DynamoDB table by adding a new item attribute, although it would potentially make lookups less optimized
    • Cons:
      • Level of effort

@TylerHendrickson
Copy link
Member Author

Completed with #892
Additional fixes #921 #922

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Grant Finder
Projects
Archived in project
Development

No branches or pull requests

2 participants