Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The search for package hosting #80

Open
directionless opened this issue Sep 12, 2021 · 7 comments
Open

The search for package hosting #80

directionless opened this issue Sep 12, 2021 · 7 comments

Comments

@directionless
Copy link
Member

directionless commented Sep 12, 2021

Our package downloads are currently running around 70TB/month. And this is growing. This is fairly expensive to me, personally, so I've been slowly trying to find a better answer.

  • I looked into sponsorship with Fastly. That fell through.
  • Package hosting is against CloudFlare's terms of service. I reached out to them in 2021-09 to see if their OSS program would help. (See 2.8 Limitation on Serving Non-HTML Content)
  • keycdn looks cheapest so far, I emailed them on 2021-09-11 to ask about OSS support
  • We're currently (as of 2021-09-11) on AWS CloudFront, I pinged their sponsor people. But note that this will be a $7k bill or something
  • I keep wondering about using github release downloads instead of self hosting
@directionless
Copy link
Member Author

directionless commented Sep 14, 2021

I captured the cloudfront logs for most of a day, and did some analysis. there are a very small number of IP addresses that account for the vast majority of our served traffic. Most of these, are in AWS. And they look like they keep requesting the same deb files. I would bet these are VPC gateways to large sites with frequent or ephemeral node creation.

If we recognize that most of our traffic is within AWS, then we can plan hosting around that. Notably, transfer from S3 to an endpoint within the same region is free. This leads to a possible solution.

We create a bucket per region, replicate our package content, and then bounce users to these direct s3 URLs. There are, of course, implementation questions...

Possible Implementations:

What Pro Con
Lambda Lots of control; probably inexpensive Served from US; need to write it
Cloud Front Function Easily fits what we're doing today limited functionality
Lambda @ Edge Global need to write it; limited languages
S3 MRAP AWS implementation of what I'd write Unclear if custom URLs are supported; unclear price
Route53 Geo Load Balancing Simple Unclear if we can do the custom URLs for s3
Just use s3 from us-east-1 Simple Need to rename bucket; can't easily do other regions

I suspect that the best route for me, is to write the lambda. It gets us off cloudfront, go is supported, etc. But, that's going to take more more than a couple days. So as a stop gap, I've gone and implemented a CloudFront function to redirect users from our top 15 IP addresses to the bucket closest to them.

Getting to that, has a bunch of other pieces:

  1. There are now 3 additional buckets serving packages
  2. S3 is configured to replicate between them
  3. Cloudfront has a viewer request function to generate URLs for the top N users
  4. I enabled s3 storage metrics, they look pretty cheap

@directionless
Copy link
Member Author

Another approach here, is to use github. There is some prior art in making an apt repo in a github pages:

@mike-myers-tob
Copy link
Member

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats

@directionless
Copy link
Member Author

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.

GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats

I want this to be useful. But the supported repos are very lacking.

@mike-myers-tob
Copy link
Member

CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy

Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.

Yes, but the way I read it, those limits are the guaranteed for any open-source project, and if you accept a sponsorship agreement of some kind there would be other unstated limits (perhaps negotiable up to where we need them). Not sure if osquery is high-profile enough for them but we could ask.

We definitely want to get this hosting bill off your wallet, it's not sustainable and is kind of an existential risk to the project if you suddenly have to cut out.

I think we want to encourage these top downloaders to manage their own package cache. We could write a tutorial that explains how to create private mirrors of package repos so that they can point their ephemeral VMs to that instead of constantly re-downloading from our S3 bucket. Maybe we can even pitch it as a cost saver for them too, if it means less inbound network traffic cost to them. What if we could rate limit by IP address or IP range, whichever would be effective? Not for everyone, just for the repeat downloaders. Eventually they will notice that they should be caching.

@robbat2
Copy link

robbat2 commented Sep 29, 2021

Wondering if you have more stats:

  • size of the actual repo, in terms of distinct objects & byte count
  • From the Fastly engagement, showing how much origin traffic there was?

I'll send an email to @directionless making some introductions.

@directionless
Copy link
Member Author

I'm not great at updating tickets...

I thought a bit, and realized that something seemed very off. The number of requests we see (about 2.5 million/day) is very very high for osquery packaging. So I went digging into the actual users of the data. Is it actually credible that we see ~30 computers a second download osquery?

Anyhow, I discovered that there is a single consumer that is responsible for the vast bulk of the traffic. I don't know much about it, other than it's an VPC in AWS us-east-1, and they keep downloading the ubuntu x86 package.

Because AWS in-zone S3 data transfer is free, this leads to a simple solution. For the busy clients, we can redirect them directly to the bucket. I went and setup some redirect magic in cloud-front to bounce the top 10 AWS ips to direct bucket links. Our monthly bill is now much more manageable. Though probably still a bit higher than desired.

I think I can get it even lower, by moving my redirecting project into lambda, and completely moving away from cloud-front.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants