Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Support for aarch64 (AWS graviton2) #78

Open
directionless opened this issue Feb 24, 2021 · 15 comments
Open

CI Support for aarch64 (AWS graviton2) #78

directionless opened this issue Feb 24, 2021 · 15 comments
Labels
moving parts This involved infra, accounts, or services we need to manage

Comments

@directionless
Copy link
Member

directionless commented Feb 24, 2021

Problem

osquery has had aarch64 support (osquery/osquery#6612) for a bit. Huge shoutouts to the contributors on that). The big sticking point in declaring it stable, is adding it to CI.

Our last CI was Azure Pipelines, our current CI is GitHub Actions. Unfortunately, neither of these host aarch64 runners. But, they both distribute runners for that platform so you can run your own... (GitHub actions is a fork of Azure Pipelines, so it's unsurprising they look similar)

Possible Solutions

A short link dump, and discussion, about possible solutions

Self Hosted Runner with an Auto Scaling Group

Envoy uses an AWS autoscaling group to manage workers. These workers have some tooling to run a single job, and then detach themselves. This feels very clean, in that it uses a simple AWS tool to handle availability.

References:

Self Hosted Runner in Kubernetes (EKS)

We could host runners as pods in a Kubernetes cluster. This is appealing in it's simplicity, at least once you accept kubernetes.

I think this has some potential drawbacks around security. I don't pods are as isolated as we might like them to be.

There's also a drawback in that we have to bring in kubernetes. I have some experience there (Kolide runs several clusters) but it would be new to the osquery project

References:

Self Hosted Runner with Lambda Scaling

Philips uses a pile of terraform to creates lambdas to manage spinning up and down spot instances as workers. This looks pretty well formed, and has some discussion of security. I think it trades the complexity of the Auto Scaling Group for a lambda function.

While I think this is a strong contender, I think it will be simpler for us to use auto scaling groups.

References:

Moving CI

There may be some CI vendors that have native support for aarch64. Amazon's various offerings, travis-ci.

However, moving CI has significant complexity cost to us. We are currently primarily invested in GitHub.

However, if Amazon CodeBuild works well enough, it might be okay to maintain both? Worth at least a little experimenting

@directionless directionless added the moving parts This involved infra, accounts, or services we need to manage label Feb 24, 2021
@directionless
Copy link
Member Author

directionless commented Feb 24, 2021

I spent awhile reading through the code on these. My current bias is towards simplicity. I have to recognize I'm not finding a lot of time, and some of these have a lot of complexity. While the complexity is hidden by Terraform, we don't have a good terraform story (yet), and it's still complexity to manage/debug/fix.

Given that, I am currently strongly biased towards the envoy style AWS ASG approach. It is, by far, the simplest approach here.

Last night I ported the AMI generation from envoyproxy/ci-infra to making a github runner -- osquery/infrastructure#7

@mike-myers-tob
Copy link
Member

What if we use one of our existing available CI runners (Linux/x86), but cross-compile for ARM and then use cross-execution to run the osquery tests (using qemu-user and binfmt-misc so that any non-native binaries get executed as if they're native)? Because osquery is statically linked this might be more feasible than it sounds.

@directionless
Copy link
Member Author

What if we use one of our existing available CI runners (Linux/x86), but cross-compile for ARM and then use cross-execution to run the osquery tests (using qemu-user and binfmt-misc so that any non-native binaries get executed as if they're native)? Because osquery is statically linked this might be more feasible than it sounds.

On slack a bit ago, Stefano said that was unacceptable slow. But maybe was compiling under qemu

@mike-myers-tob
Copy link
Member

On slack a bit ago, Stefano said that was unacceptable slow. But maybe was compiling under qemu

Ah, I didn't see that conversation but I think he told me today that the ARM-based AWS instance was unacceptably slow. Cross-compiling shouldn't be slow, and qemu overhead for cross-execution should be acceptable.

@AGSaidi
Copy link

AGSaidi commented Feb 25, 2021

I'm not sure how fast you're expecting, but building on a Graviton2 instance on AWS it's about 6m15s to build without tests 6m43s with them.

@mike-myers-tob
Copy link
Member

I'm not sure how fast you're expecting, but building on a Graviton2 instance on AWS it's about 6m15s to build without tests 6m43s with them.

That's plenty fast. He must've been talking about something else then.

Regardless of speed, my suggestion was just about a possible way to build and test ARM without having to provision our own ARM-based CI runners on another cloud, until GitHub Actions gets an ARM CI runner. Since it seems like we don't have the time to learn Terraform/Ansible, set up another cloud account and maintain it and pay for it etc.

@directionless
Copy link
Member Author

Regardless of speed, my suggestion was just about a possible way to build and test ARM without having to provision our own ARM-based CI runners on another cloud, until GitHub Actions gets an ARM CI runner. Since it seems like we don't have the time to learn Terraform/Ansible, set up another cloud account and maintain it and pay for it etc.

https://osquery.slack.com/archives/C019GR05SAH/p1599466550051900 (Alessandro, not Stefano)

Time and money are a bit funny. We do have an AWS presence, and I'm ignoring the terraform side and manually configuring. I'm currently testing CodeBuilder and slowly trying to get a native runner up.

Of course, I haven't yet broached trailofbits/osquery:ubuntu-18.04-toolchain-v9

@directionless
Copy link
Member Author

I tried spinning up AWS CodeBuild. (this is the AWS ci thing). I used an incredibly simple buildspec.yml and having created a multiplatform trailofbits/osquery:ubuntu-18.04-toolchain-v9

Build went smoothly. Took 1,123 seconds. (About 4 minutes in cmake and submodules, and 15min in build). While quite a bit more than the 7ish minutes cited earlier.

The codebuild tooling is nice. Good display of things. But not as many platforms or options as GitHub. Still, if I can't get another strategy to work, we can probably figure out how to use this as a fallback

@AGSaidi
Copy link

AGSaidi commented Mar 8, 2021

I used a VM that had more than the 8 vcpus the CodeBuild VMs have, so that makes sense.

@directionless
Copy link
Member Author

osquery/osquery-toolchain#23 is the Dockerfile I'm using to build the builders

@fkorotkov
Copy link

Hey everyone,

I'm founder of Cirrus CI. We are collaborating with AWS folks to bring free managed Graviton2 CI for OSS projects which we are about to announce. Would you like to try it out? It's as simple as configuring Cirrus CI Github App and adding the following .cirrus.yml config. No need to manage your own infrastructure.

# .cirrus.yml
task:
  arm_conaitner:
    image: ubuntu:latest
  script: uname -a

Cirrus CI will run such CI task on a EKS cluster of Graviton2 instances. You can containers of any size up to 8 CPUs and of 16 CPUs in total concurrently (for example, 8 concurrent tasks with 2CPUs).

@directionless
Copy link
Member Author

Hi @fkorotkov Coincidentally, I've been reading about Cirrus CI, and am overjoyed you found this. I'd love to chat!

I'd love a cleaner solution for aarch64, and we're starting to think about apple's m1 as well. Does it make sense for us to find some time to chat, or should I just try this first?

@fkorotkov
Copy link

Will be happy to chat! You can email me at [email protected] and we'll figure something out.

For future researchers, there is a problem with Apple M1 because non of the existing virtualization technologies don't support it yet and therefore it's impossible for CIs to provide ephemeral VMs. But if you have your own M1 hardware, Cirrus CI natively supports it via Persistent Workers. @directionless you probably read about them because of this comment actions/runner#805 (comment)

@fkorotkov
Copy link

Forgot to mention that if you are planning to experiment with Cirrus CI I highly recommend to check out Cirrus CLI which can run Cirrus tasks locally. It's a great way to iterate quickly over config.

@fkorotkov
Copy link

FYI arm_containers are GA now and you can try them out. https://cirrus-ci.org/guide/linux/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
moving parts This involved infra, accounts, or services we need to manage
Projects
None yet
Development

No branches or pull requests

4 participants