Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Researching the queueing/ticketing system in the GCP context and in Cloud Run Services #1541

Open
Tracked by #1367
marcocastignoli opened this issue Aug 8, 2024 · 14 comments

Comments

@marcocastignoli
Copy link
Member

marcocastignoli commented Aug 8, 2024

Context

  • Sourcify server struggles when many requests are incoming, it doesn't have any scaling strategy.
  • Sourcify server currently have one pending HTTP request for each pending verification, /verify HTTP requests are closed only when verification is completed. With [Milestone] APIv2 #1367 we want to introduce e receipt property in /v2/verify response, this will allow to separate HTTP request from the verification status.

Solutions

We are exploring two solutions:

  • One that uses a queuing system between two new services: sourcify-http-server and sourcify-verification-service. sourcify-http-server will push pending contracts to the queue-service and sourcify-verification-service will read pending contracts from queue-service verifying them and marking them as completed. This solution involves setting up a queue service adding more complexity to our architecture, but we have granular control on what's in the queue, enabling us to potentially implement priority systems.
graph TD
    User -->|/verify| sourcify-http-server
    sourcify-http-server -->|Push Pending Contracts| queue-service
    queue-service -->|Read Pending Contracts| sourcify-verification-service
    sourcify-verification-service -->|Mark as Completed| sourcify-database
    sourcify-http-server -->|Read Status| sourcify-database
    sourcify-http-server -->|Read Status| queue-service
Loading
  • The second solution doesn't use a queue service, but it takes advantage of the scaling solutions offered by GCP. Two new services sourcify-http-server and sourcify-verification-service will be deployed as Google Cloud Run Services, sourcify-http-server will receive /verify request from the internet and call sourcify-verification-service directly without passing through a queue. The verification's status is going to be saved in sourcify-database
graph TD
    User -->|/public_api_verify| sourcify-http-server
    sourcify-http-server -->|/internal_api_verify| sourcify-verification-service
    sourcify-verification-service -->|Write Status| sourcify-database
    sourcify-http-server -->|Read Status| sourcify-database
Loading
@kuzdogan
Copy link
Member

  • Do we need "granular control on what's in the queue" or "priority systems"?
  • How do we send "tickets" in the second case?

@manuelwedler
Copy link
Collaborator

Could you explain what benefits the second proposal would bring compared to our current setup? Wouldn't just one server also scale well with GCP?

@marcocastignoli
Copy link
Member Author

marcocastignoli commented Aug 12, 2024

@manuelwedler

Could you explain what benefits the second proposal would bring compared to our current setup? Wouldn't just one server also scale well with GCP?

With our current setup we cannot unbind requests from verifications: a request is pending until the verification process is over. We need some way to separate verification from http requests if we want to support receipts in API v2

We could potentially separate them in the same "sourcify-server" process but then we are not taking advantage of GCP scaling by optimizing on the number of requests:

  • cloud run would scale based on the number of opened http requests, but with this new hypothetical setup they are not related with verifications. So it will not optimize by number of pending verifications, but only by number of http requests.
  • cloud run will also scale based on the amount of resources available, so it would actually scale, but not in an optimal way.

@marcocastignoli
Copy link
Member Author

@kuzdogan

Do we need "granular control on what's in the queue" or "priority systems"?

I cannot think of any real use case of this, other than prioritizing some chains

How do we send "tickets" in the second case?

In the diagram I wrote "Read Status" from sourcify-http-server to sourcify-database:

  • sourcify-http-server calls sourcify-verification-service triggering a new verification
  • sourcify-verification-service stores a new receipt in the database as "pending" and returns the receipt id to sourcify-http-server, then starts the verification process, then marks the receipt as completed once done
  • sourcify-http-server can always read the status reading directly from the database

@manuelwedler
Copy link
Collaborator

manuelwedler commented Oct 9, 2024

To be able to proceed here, some more feedback:

Second approach

If I get it right, GCP Cloud Run scales the number of instances based on pending http requests or when the CPU utilization gets above some percentage. So the sourcify-verification-service receives requests to verify, but these are closed after returning the receipt id and therefore it does not necessarily scale up while verifying. Is this how you meant it to be? This would mean the sourcify-verification-service would need to spin up some workers interally that handle the verification. This could be implemented in different ways:

  • Either you spin up a worker for every verification as it goes in, but this would mean CPU would go up and Cloud Run would scale this service up again. I think this would not be much different from our current setup.
  • Or you queue the verifications, so you only spin up a limited amount of workers (-> less CPU usage). But then you have an internal queue which means memory goes up. I think this would be totally fine, but also I am not sure if an external queue would then be better.

Maybe I am wrong here with my assumptions, so happy to hear your opinion on this.

First approach

We should look into what options we have for such a queue-service. There is for example the Google managed Cloud Taks. We should look into what ups and downs different approaches have. For example, I think such an external queue service can also provide some benefits in terms of logging and debugging purposes. Maybe it would be good to have a small list.

In general, I think we need to look a bit closer here in the two approaches and also define the internal structure of the components to decide which option is best.

@kuzdogan
Copy link
Member

kuzdogan commented Oct 15, 2024

Summarizing the call and the next steps:

We agreed on 3 viable options:

1. Queueing Service + Verification Service + HTTP Server

Similar to option 1 above, having a Queueing Service and a separate Verification Service. Keeping this short as we did not discuss the details.

I guess in this case the scaling will be handled by the Queueing Service itself?

Leaving it here to keep this option open.

2. Verification Service + HTTP Server

Similar to option 2 above, just having a separate Verification service:

In this case the rough flow is as following:

  • The user sends a verification req.
  • The Sourcify public HTTP server receives the request, processes and sends the contract for verification to the Verification Service.
  • The Verification Service writes a verificationJob to the DB and responds to the HTTP server the job ID
  • The HTTP server responds the user with the job ID
  • The user now starts polling for the job with the ID to the HTTP server
  • Meanwhile the Verification Service spins up a worker which compiles and processes the verification. It writes the result to the Database upon (successful or unsuccessful) compilation.
  • The user finally receives a response from the HTTP server that is isCompleted: true with the verification result

Scaling: In this case, the Verification Services will be scaled by their CPU usage. Once a certain use is hit (in GCP Cloud Run 60%) a new instance is spun up and new requests from the HTTP server will be routed to the new instance. This should also be compatible with other scalable deployments e.g. Kubernetes.

3. Only one HTTP Server

In the call, a third option has been proposed that requires no separate service (just an HTTP server) but outsources the async task to Workers.

In this case the rough flow is as following:

  • The user sends a verification req.
  • The Sourcify public HTTP server receives the request, creates a new verificationJob, and sends the job ID back to the user. Finally, it spins up a Worker with this job ID.
  • The user now starts polling for the job with the ID to the HTTP server
  • The Worker compiles and processes the verification. It writes the result to the Database upon (successful or unsuccessful) compilation.
  • The user finally receives a response from the HTTP server that is isCompleted: true with the verification result

Scaling: Here the server instances get scaled with the CPU use, similar to how it's done at the moment. Since the server instances are stateless, it is easily possible.

Next steps

We'd like to create simple sequence diagrams of the last 2 proposals to make them easily understandable. After that we'll contact Markus from the Devops team for his feedback.

@manuelwedler
Copy link
Collaborator

manuelwedler commented Oct 16, 2024

  • Meanwhile the Verification Service compiles and processes the verification. It writes the result to the Database upon (successful or unsuccessful) compilation.

This in the second option also implies a "worker". A worker is just a term for any background task that is being processed. We could also just call it background task or something similar, but I imagine it to be a class that gets instantiated with the request and handles the verification in the background then. This class could be called VerificationWorker for example. So the only actual difference about the second and third option is that we either split the server into two services or not.

I also updated your comment to make this clear.

@manuelwedler
Copy link
Collaborator

Here are the sequence diagrams for option number 2 and 3:

2. Verification Service + HTTP Server

queue_option2 drawio

3. Only one HTTP Server

queue_option3 drawio

@marcocastignoli
Copy link
Member Author

Recap of my conversation with Markus.

The monolithic option 3 is fine: the only downside is that we would scale parts of our service that doesn't need scaling:

  • if there is too much cpu used by verification then also the http part will scale even it doesn't need it
  • if there are too many requests to the http service then also the verification service will scale even if it doesn't need it

Option 2 is ideal because it separates http and verification scaling concerns but it comes with additional effort:

  • plain HTTP/JSON is not the best but it's better to use protobuf/gRPC.
  • upgrading the db schema would require updating both the HTTP and verification services, this makes the deployment process incredibly more complex. Instead of having the verification service handle database operations, designate the HTTP service as the sole component responsible for writing to the database. The verification service would then return the necessary verification information to the HTTP server, eliminating the need for it to directly interact with the database.

@kuzdogan
Copy link
Member

kuzdogan commented Oct 29, 2024

I'm not sure if I get the 2. point:

this makes the deployment process incredibly more complex.

I understand we'll have two components to build and deploy but is there something I'm missing that'll make it incredibly complex?

Instead of having the verification service handle database operations, designate the HTTP service as the sole component responsible for writing to the database. The verification service would then return the necessary verification information to the HTTP server, eliminating the need for it to directly interact with the database.

I don't get why it is favorable to have the HTTP server do the DB operations instead of the Verification Server.

Overall to me the downsides of number 3 are not a big concern, compared to the development effort that'll be needed. I think we can just increase the request count limit high enough to mostly scale for the CPU instead.

@marcocastignoli
Copy link
Member Author

In response to

I understand we'll have two components to build and deploy but is there something I'm missing that'll make it incredibly complex?

I'm citing Markus:

you'd need to keep both services aligned and need to deploy changes at the same time. so they get coupled and then it's basically a monolith as you can't develop both services independently. and you'd need to make sure only one is doing migrations or do migrations out of bounds aka manually so the two services won't fight each other wehn starting and applying migrations
also if you have to update "always" both at the same time, then you also can just have everything in one codebase.

I honestly also didn't fully get this point. It's not a huge deal to keep everything synchronized. Probably this becomes a problem when you have to keep different versions online or deploy with 0 downtime or you have more than 2 services.

@manuelwedler
Copy link
Collaborator

you'd need to keep both services aligned and need to deploy changes at the same time. so they get coupled and then it's basically a monolith as you can't develop both services independently. and you'd need to make sure only one is doing migrations or do migrations out of bounds aka manually so the two services won't fight each other wehn starting and applying migrations
also if you have to update "always" both at the same time, then you also can just have everything in one codebase.

I honestly also didn't fully get this point. It's not a huge deal to keep everything synchronized. Probably this becomes a problem when you have to keep different versions online or deploy with 0 downtime or you have more than 2 services.

I still think Markus makes some valid points here. For example, maintaining a database module for two services increases the maintenance burden.
I agree that deploying at the same time does not seem like a big issue for us at the moment, but for being future-proof and having a clean architecture, decoupling the services seems the better option to me. So if we go with option 2, I would also integrate Markus' proposals.

Overall, I think option 3 is very easy to implement for us and option 2 just means reduced costs compared to 3. As costs are not a priority at the moment, I would also go with 3 for now. It should also be possible to upgrade the architecture from 3 to 2 if we feel like there is the need later.

@kuzdogan
Copy link
Member

I'm also in favor of option 3 for its ease

@marcocastignoli
Copy link
Member Author

Then we all agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Sprint - Blocked
Development

No branches or pull requests

3 participants