Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cloud detection #483

Merged
merged 1 commit into from
Dec 25, 2023

Conversation

syuu1228
Copy link
Contributor

Currently get_cloud_instance() is very slow on GCE / Azure, since it fails to detect AWS metadata server, and retrying multiple times in curl().

Actually we don't really need to do detection, we can specify target cloud name statically when we build machine-image.

To do so, adding target cloud name to machine_image_info.yaml while building machine-image, and reference it from scylla_cloud.py.

Note that we still need to fallback to metadata server based detection, since there are users who manually install scylla-machine-image package , not using our machine-image.

Fixes #481

@syuu1228
Copy link
Contributor Author

@mykaul
Copy link

mykaul commented Sep 24, 2023

nit: you should be able to reliably detect cloud based on DMI data - from https://coroot.com/blog/cloud-metadata :

AWS Xen instances: the uuid from /sys/hypervisor/uuid has the "ec2" or "EC2" prefix
AWS Nitro instances: the content of /sys/class/dmi/id/board_vendor is "Amazon EC2"
GCP: the content of /sys/class/dmi/id/board_vendor is "Google"
Azure: the content of /sys/class/dmi/id/board_vendor is "Microsoft Corporation"

@syuu1228
Copy link
Contributor Author

nit: you should be able to reliably detect cloud based on DMI data - from https://coroot.com/blog/cloud-metadata :

AWS Xen instances: the uuid from /sys/hypervisor/uuid has the "ec2" or "EC2" prefix
AWS Nitro instances: the content of /sys/class/dmi/id/board_vendor is "Amazon EC2"
GCP: the content of /sys/class/dmi/id/board_vendor is "Google"
Azure: the content of /sys/class/dmi/id/board_vendor is "Microsoft Corporation"

I also considered that, it would works well on AWS / GCP, but on Azure DMI conteints are exactly same as baremetal Hyper-V.
To cause correct error message "Unknown cloud provider! Only AWS/GCP/Azure supported" on baremetal Hyper-V, we still need to keep metadata server access on azure_instance.is_azure_instance(), so we cannot drop network access entirely from cloud detection.

see: https://stackoverflow.com/questions/11570965/how-to-detect-azure-amazon-vm

Maybe we can wait metadata server access just for Azure, since it will likely success because in that case we already know it's not AWS nor GCP.

Or, maybe we can ignore about non-cloud environment, since scylla-machine-image script will does not work on such environment at some point.
But it will not clealer message I mentioned above, it will cause Traceback at some point instead.

lib/scylla_cloud.py Outdated Show resolved Hide resolved
@mykaul
Copy link

mykaul commented Sep 26, 2023

nit: you should be able to reliably detect cloud based on DMI data - from https://coroot.com/blog/cloud-metadata :

AWS Xen instances: the uuid from /sys/hypervisor/uuid has the "ec2" or "EC2" prefix
AWS Nitro instances: the content of /sys/class/dmi/id/board_vendor is "Amazon EC2"
GCP: the content of /sys/class/dmi/id/board_vendor is "Google"
Azure: the content of /sys/class/dmi/id/board_vendor is "Microsoft Corporation"

I also considered that, it would works well on AWS / GCP, but on Azure DMI conteints are exactly same as baremetal Hyper-V. To cause correct error message "Unknown cloud provider! Only AWS/GCP/Azure supported" on baremetal Hyper-V, we still need to keep metadata server access on azure_instance.is_azure_instance(), so we cannot drop network access entirely from cloud detection.

see: https://stackoverflow.com/questions/11570965/how-to-detect-azure-amazon-vm

Maybe we can wait metadata server access just for Azure, since it will likely success because in that case we already know it's not AWS nor GCP.

Or, maybe we can ignore about non-cloud environment, since scylla-machine-image script will does not work on such environment at some point. But it will not clealer message I mentioned above, it will cause Traceback at some point instead.

We can ignore Hyper-V, or treat it just like Azure - either path is fine, IMHO.

@fruch
Copy link
Collaborator

fruch commented Sep 26, 2023

@mykaul
Copy link

mykaul commented Sep 26, 2023

We could do something similar to

https://github.com/dgzlopes/cloud-detect/blob/master/cloud_detect/providers/azure_provider.py

This needs the networking stack to be up (and IPv4 configured, what about IPv6 only?) - we'd like to avoid that.

@avikivity
Copy link
Member

There are also metal instances, which might be different.

@mykaul
Copy link

mykaul commented Sep 26, 2023

There are also metal instances, which might be different.

That could be the fallback if above detection fails.

@syuu1228
Copy link
Contributor Author

syuu1228 commented Oct 2, 2023

Pushed new implementation which uses DMI for the detection, and async/await for metadata server access (when DMI probe failed).
The implementation referenced https://github.com/dgzlopes/cloud-detect, borrowed some ideas:

  • which DMI parameter to detect the cloud
  • order of access DMI and metadata server (try DMI first, then metadata server)
  • use async/await for metadata server access

@syuu1228
Copy link
Contributor Author

syuu1228 commented Oct 2, 2023

Ah, it breaking the unittest, fixing...

@syuu1228
Copy link
Contributor Author

syuu1228 commented Oct 2, 2023

Testcase fixed.

fruch
fruch previously approved these changes Oct 2, 2023
Copy link
Collaborator

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly
and not copying it's implementation (even it's MIT license, and we could do it legally)

@yaronkaikov
Copy link
Collaborator

@syuu1228 Please rebase and re-test the verification job

@syuu1228
Copy link
Contributor Author

syuu1228 commented Oct 3, 2023

LGTM

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly and not copying it's implementation (even it's MIT license, and we cloud do it legally)

We actually can package cloud-detect package into scylla-python3 by updating install-dependencies.sh in scylla repo, but I did not prefer to update that just for scylla-machine-image so I haven't added any package for scylla-machine-image scripts.
That's the reason why I haven't used cloud-detect in this PR, and also other reason is we already have metadata based detection and DMI based one should be very small change.

But if we copy scylla-cqlsh implementation which adds zipped pip package into relocatable package, then we can use pip packages on scylla-machine-image too, that would be alternative way to implement this.

@fruch
Copy link
Collaborator

fruch commented Oct 3, 2023

LGTM

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly and not copying it's implementation (even it's MIT license, and we cloud do it legally)

We actually can package cloud-detect package into scylla-python3 by updating install-dependencies.sh in scylla repo, but I did not prefer to update that just for scylla-machine-image so I haven't added any package for scylla-machine-image scripts.
That's the reason why I haven't used cloud-detect in this PR, and also other reason is we already have metadata based detection and DMI based one should be very small change.

But if we copy scylla-cqlsh implementation which adds zipped pip package into relocatable package, then we can use pip packages on scylla-machine-image too, that would be alternative way to implement this.

A simple vendoring of the code as a submodule, might also work in this case.

@syuu1228
Copy link
Contributor Author

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly and not copying it's implementation (even it's MIT license, and we cloud do it legally)

We actually can package cloud-detect package into scylla-python3 by updating install-dependencies.sh in scylla repo, but I did not prefer to update that just for scylla-machine-image so I haven't added any package for scylla-machine-image scripts.
That's the reason why I haven't used cloud-detect in this PR, and also other reason is we already have metadata based detection and DMI based one should be very small change.
But if we copy scylla-cqlsh implementation which adds zipped pip package into relocatable package, then we can use pip packages on scylla-machine-image too, that would be alternative way to implement this.

A simple vendoring of the code as a submodule, might also work in this case.

Tried to add cloud-detect as submodule and load from scylla-machine-image, but it seems difficult because the module is hevily depend with aiohttp, witch contains C extension.
Probably we should just update scylla-python3 dependencies to add the module since it's simplest solution.

@avikivity
Copy link
Member

Is there no simple way to do this without adding complicated packages?

@syuu1228
Copy link
Contributor Author

Is there no simple way to do this without adding complicated packages?

I think current patch on this PR does this.
It referenced cloud-detect, borrowed some ideas, but not adding additional packages:
#483 (comment)

@benipeled
Copy link
Contributor

@syuu1228 if there is nothing else to add for this change - please verify it and make sure the slowness issue resolved so we can merge it

Currently get_cloud_instance() is very slow on GCE / Azure, since it fails to
detect AWS metadata server, and retrying multiple times in curl().

To speed up cloud detection, we will introduce following:
 - Use DMI parameters for the detection when it's available
 - Use async/await for metadata server access

Fixes scylladb#481
@syuu1228
Copy link
Contributor Author

@yaronkaikov yaronkaikov merged commit d97e311 into scylladb:next Dec 25, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long delay after SSH login
6 participants