-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The search for package hosting #80
Comments
I captured the cloudfront logs for most of a day, and did some analysis. there are a very small number of IP addresses that account for the vast majority of our served traffic. Most of these, are in AWS. And they look like they keep requesting the same If we recognize that most of our traffic is within AWS, then we can plan hosting around that. Notably, transfer from S3 to an endpoint within the same region is free. This leads to a possible solution. We create a bucket per region, replicate our package content, and then bounce users to these direct s3 URLs. There are, of course, implementation questions... Possible Implementations:
I suspect that the best route for me, is to write the lambda. It gets us off cloudfront, go is supported, etc. But, that's going to take more more than a couple days. So as a stop gap, I've gone and implemented a CloudFront function to redirect users from our top 15 IP addresses to the bucket closest to them. Getting to that, has a bunch of other pieces:
|
Another approach here, is to use github. There is some prior art in making an apt repo in a github pages: |
CloudSmith offers OSS package hosting sponsorships, but we'd have to contact them: https://help.cloudsmith.io/docs/open-source-hosting-policy GitHub Packages is free for OSS. Can we use this? https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages ah....maybe only for Chocolatey / NuGet https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages#supported-clients-and-formats |
Always good to have more options! They're quoted bandwidth caps are orders of magnitude below our current bandwidth. (this is also true for packagecloud, and the various other places I've seen). This may be different if my S3 redirects pan out.
I want this to be useful. But the supported repos are very lacking. |
Yes, but the way I read it, those limits are the guaranteed for any open-source project, and if you accept a sponsorship agreement of some kind there would be other unstated limits (perhaps negotiable up to where we need them). Not sure if osquery is high-profile enough for them but we could ask. We definitely want to get this hosting bill off your wallet, it's not sustainable and is kind of an existential risk to the project if you suddenly have to cut out. I think we want to encourage these top downloaders to manage their own package cache. We could write a tutorial that explains how to create private mirrors of package repos so that they can point their ephemeral VMs to that instead of constantly re-downloading from our S3 bucket. Maybe we can even pitch it as a cost saver for them too, if it means less inbound network traffic cost to them. What if we could rate limit by IP address or IP range, whichever would be effective? Not for everyone, just for the repeat downloaders. Eventually they will notice that they should be caching. |
Wondering if you have more stats:
I'll send an email to @directionless making some introductions. |
I'm not great at updating tickets... I thought a bit, and realized that something seemed very off. The number of requests we see (about 2.5 million/day) is very very high for osquery packaging. So I went digging into the actual users of the data. Is it actually credible that we see ~30 computers a second download osquery? Anyhow, I discovered that there is a single consumer that is responsible for the vast bulk of the traffic. I don't know much about it, other than it's an VPC in AWS us-east-1, and they keep downloading the ubuntu x86 package. Because AWS in-zone S3 data transfer is free, this leads to a simple solution. For the busy clients, we can redirect them directly to the bucket. I went and setup some redirect magic in cloud-front to bounce the top 10 AWS ips to direct bucket links. Our monthly bill is now much more manageable. Though probably still a bit higher than desired. I think I can get it even lower, by moving my redirecting project into lambda, and completely moving away from cloud-front. |
Our package downloads are currently running around 70TB/month. And this is growing. This is fairly expensive to me, personally, so I've been slowly trying to find a better answer.
2.8 Limitation on Serving Non-HTML Content
)The text was updated successfully, but these errors were encountered: