Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeadm: retry IMDS 404 errors #1970

Merged
merged 1 commit into from
Oct 8, 2024
Merged

nodeadm: retry IMDS 404 errors #1970

merged 1 commit into from
Oct 8, 2024

Conversation

ndbaker1
Copy link
Member

@ndbaker1 ndbaker1 commented Sep 17, 2024

Issue #, if available:

Description of changes:

nodeadm fails to make IMDS calls when the instance's credentials propagate slower, causing IMDS to return an error indicating no iam credentials were provided for an instance and a 404 is returned.

It now checks for this error message and counts it as retryable

Aug 28 19:12:30 localhost nodeadm[1491]: {"level":"info","ts":1724872350.0166261,"caller":"init/init.go:148","msg":"Fetching instance details.."}
Aug 28 19:12:30 localhost nodeadm[1491]: SDK 2024/08/28 19:12:30 DEBUG attempting waiter request, attempt count: 1
Aug 28 19:12:30 localhost nodeadm[1491]: SDK 2024/08/28 19:12:30 DEBUG request failed with unretryable error http response error StatusCode: 404, request to EC2 IMDS failed

cloud-init showing IMDS resolving ~2 second later

2024-08-28 19:12:32,448 - url_helper.py[DEBUG]: Read from http://169.254.169.254:80/2021-03-23/meta-data/instance-id (200, 19b) after 1 attempts

other notable changes:

  • removed all direct uses of the aws imds client besides in the internal helper implementation

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

we have a test case https://github.com/awslabs/amazon-eks-ami/blob/71aa1082300050221bb940abf14f0ba635fc59eb/nodeadm/test/e2e/cases/imds-timeouts/run.sh, meant to show that nodeadm will succeed if IMDS comes up in the middle of execution

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@ndbaker1
Copy link
Member Author

/ci

Copy link
Contributor

@ndbaker1 roger that! I've dispatched a workflow. 👍

Copy link
Contributor

@ndbaker1 the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.23 / al2023success ✅success ✅
1.24 / al2023success ✅success ✅
1.25 / al2023success ✅success ✅
1.26 / al2023success ✅success ✅
1.27 / al2023success ✅success ✅
1.28 / al2023success ✅success ✅
1.29 / al2023success ✅success ✅
1.30 / al2023success ✅success ✅

@mattcjo
Copy link
Contributor

mattcjo commented Sep 20, 2024

@nbaker1 Quick update on root cause. The node successfully authenticated with IMDS, but the resource it was trying to access was missing. Retry logic is still an appropriate solution here.

@ndbaker1 ndbaker1 marked this pull request as draft October 4, 2024 15:35
@ndbaker1 ndbaker1 marked this pull request as ready for review October 7, 2024 22:01
@ndbaker1
Copy link
Member Author

ndbaker1 commented Oct 7, 2024

rerunning CI as we reverted back to the original approach

/ci

Copy link
Contributor

github-actions bot commented Oct 7, 2024

@ndbaker1 roger that! I've dispatched a workflow. 👍

Copy link
Contributor

github-actions bot commented Oct 7, 2024

@ndbaker1 the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.23 / al2023success ✅success ✅
1.24 / al2023success ✅success ✅
1.25 / al2023success ✅success ✅
1.26 / al2023success ✅success ✅
1.27 / al2023success ✅success ✅
1.28 / al2023success ✅success ✅
1.29 / al2023success ✅success ✅
1.30 / al2023success ✅success ✅

@ndbaker1 ndbaker1 merged commit 093058d into awslabs:main Oct 8, 2024
10 checks passed
@ndbaker1 ndbaker1 deleted the imds branch October 8, 2024 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants