-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Support for aarch64 (AWS graviton2) #78
Comments
I spent awhile reading through the code on these. My current bias is towards simplicity. I have to recognize I'm not finding a lot of time, and some of these have a lot of complexity. While the complexity is hidden by Terraform, we don't have a good terraform story (yet), and it's still complexity to manage/debug/fix. Given that, I am currently strongly biased towards the envoy style AWS ASG approach. It is, by far, the simplest approach here. Last night I ported the AMI generation from envoyproxy/ci-infra to making a github runner -- osquery/infrastructure#7 |
What if we use one of our existing available CI runners (Linux/x86), but cross-compile for ARM and then use cross-execution to run the osquery tests (using |
On slack a bit ago, Stefano said that was unacceptable slow. But maybe was compiling under qemu |
Ah, I didn't see that conversation but I think he told me today that the ARM-based AWS instance was unacceptably slow. Cross-compiling shouldn't be slow, and qemu overhead for cross-execution should be acceptable. |
I'm not sure how fast you're expecting, but building on a Graviton2 instance on AWS it's about 6m15s to build without tests 6m43s with them. |
That's plenty fast. He must've been talking about something else then. Regardless of speed, my suggestion was just about a possible way to build and test ARM without having to provision our own ARM-based CI runners on another cloud, until GitHub Actions gets an ARM CI runner. Since it seems like we don't have the time to learn Terraform/Ansible, set up another cloud account and maintain it and pay for it etc. |
https://osquery.slack.com/archives/C019GR05SAH/p1599466550051900 (Alessandro, not Stefano) Time and money are a bit funny. We do have an AWS presence, and I'm ignoring the terraform side and manually configuring. I'm currently testing CodeBuilder and slowly trying to get a native runner up. Of course, I haven't yet broached |
I tried spinning up AWS CodeBuild. (this is the AWS ci thing). I used an incredibly simple Build went smoothly. Took 1,123 seconds. (About 4 minutes in cmake and submodules, and 15min in build). While quite a bit more than the 7ish minutes cited earlier. The codebuild tooling is nice. Good display of things. But not as many platforms or options as GitHub. Still, if I can't get another strategy to work, we can probably figure out how to use this as a fallback |
I used a VM that had more than the 8 vcpus the CodeBuild VMs have, so that makes sense. |
osquery/osquery-toolchain#23 is the Dockerfile I'm using to build the builders |
Hey everyone, I'm founder of Cirrus CI. We are collaborating with AWS folks to bring free managed Graviton2 CI for OSS projects which we are about to announce. Would you like to try it out? It's as simple as configuring Cirrus CI Github App and adding the following # .cirrus.yml
task:
arm_conaitner:
image: ubuntu:latest
script: uname -a Cirrus CI will run such CI task on a EKS cluster of Graviton2 instances. You can containers of any size up to 8 CPUs and of 16 CPUs in total concurrently (for example, 8 concurrent tasks with 2CPUs). |
Hi @fkorotkov Coincidentally, I've been reading about Cirrus CI, and am overjoyed you found this. I'd love to chat! I'd love a cleaner solution for aarch64, and we're starting to think about apple's m1 as well. Does it make sense for us to find some time to chat, or should I just try this first? |
Will be happy to chat! You can email me at [email protected] and we'll figure something out. For future researchers, there is a problem with Apple M1 because non of the existing virtualization technologies don't support it yet and therefore it's impossible for CIs to provide ephemeral VMs. But if you have your own M1 hardware, Cirrus CI natively supports it via Persistent Workers. @directionless you probably read about them because of this comment actions/runner#805 (comment) |
Forgot to mention that if you are planning to experiment with Cirrus CI I highly recommend to check out Cirrus CLI which can run Cirrus tasks locally. It's a great way to iterate quickly over config. |
FYI |
Problem
osquery has had aarch64 support (osquery/osquery#6612) for a bit. Huge shoutouts to the contributors on that). The big sticking point in declaring it stable, is adding it to CI.
Our last CI was Azure Pipelines, our current CI is GitHub Actions. Unfortunately, neither of these host aarch64 runners. But, they both distribute runners for that platform so you can run your own... (GitHub actions is a fork of Azure Pipelines, so it's unsurprising they look similar)
Possible Solutions
A short link dump, and discussion, about possible solutions
Self Hosted Runner with an Auto Scaling Group
Envoy uses an AWS autoscaling group to manage workers. These workers have some tooling to run a single job, and then detach themselves. This feels very clean, in that it uses a simple AWS tool to handle availability.
References:
Self Hosted Runner in Kubernetes (EKS)
We could host runners as pods in a Kubernetes cluster. This is appealing in it's simplicity, at least once you accept kubernetes.
I think this has some potential drawbacks around security. I don't pods are as isolated as we might like them to be.
There's also a drawback in that we have to bring in kubernetes. I have some experience there (Kolide runs several clusters) but it would be new to the osquery project
References:
Self Hosted Runner with Lambda Scaling
Philips uses a pile of terraform to creates lambdas to manage spinning up and down spot instances as workers. This looks pretty well formed, and has some discussion of security. I think it trades the complexity of the Auto Scaling Group for a lambda function.
While I think this is a strong contender, I think it will be simpler for us to use auto scaling groups.
References:
Moving CI
There may be some CI vendors that have native support for aarch64. Amazon's various offerings, travis-ci.
However, moving CI has significant complexity cost to us. We are currently primarily invested in GitHub.
However, if Amazon CodeBuild works well enough, it might be okay to maintain both? Worth at least a little experimenting
The text was updated successfully, but these errors were encountered: