Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - query utxos by address performance varies vastly #5810

Open
jonathangenlambda opened this issue Apr 27, 2024 · 24 comments
Open

[BUG] - query utxos by address performance varies vastly #5810

jonathangenlambda opened this issue Apr 27, 2024 · 24 comments
Labels
Stale type: bug Something is not working

Comments

@jonathangenlambda
Copy link

External

Area
Node querying

Summary
We observed decreasing performance of "query utxos by address" by increasing node versions. On 1.35.7 it was around 4 seconds, and with 8.7.3 & 8.9.1 we saw wild fluctuations in timings, with most queries taking around 7 seconds while regularly observed outliers range from 30, to 45 and even 90 (!) seconds. We observed this behaviour querying the same address, on the same cloud hardware and same application.

Expected behavior
Query timings should be within a predictable range - it is clear that some flucuations are perfectly normal and expected but outliers in timings of up to 12 times throw a wrench in every production environment.

@jonathangenlambda jonathangenlambda added the type: bug Something is not working label Apr 27, 2024
@jonathangenlambda
Copy link
Author

For comparison: we saw "query utxos by txins" with 90+ txins taking only around 100 milliseconds.
So regarding "query utxos by address" there is definitely something off there, and needs to be improved as this makes it prohibitvely expensive to use in production - and using Blockfrost is not always an alternative and comes with its own (monetary, and integration) costs.

@carbolymer
Copy link
Contributor

carbolymer commented Apr 29, 2024

There's a change of how UTXOs are stored in the works - UTXO-HD. The main change is that UTXOs won't be kept in the memory, so I believe the queries will be slower, but you should get less varying query times.

@jasagredo Correct me if I'm wrong here please. Do we have any performance expectations here?

@jonathangenlambda
Copy link
Author

@carbolymer so do I understand this correctly: the queries will be even slower but consistently slower?
This is unacceptable and makes it completely useless in any serious production setting.

@jasagredo
Copy link
Contributor

The QueryUtxoByAddress query has to traverse the whole UTXO set to find the UTXOs associated with the requested account.

May I be wrong, but that query is not expected to be performant, nor it is expected to be used in production. The node does not need a reverse index of UTXOs for operating in the network, thus it falls outside of the responsibilities of the node to mantain such an index, and doing so would impact other places, in particular memory usage.

I think the expected setup is for clients to track the UTXO set of their accounts, which (I think) cardano-wallet does. The alternative would be to use some external UTXO indexer like db-sync.

When UTXO-HD arrives, this query still will have to traverse the whole UTXO set but that set will be on the disk instead of in memory so it is expected some regression there.

Please correct me if I misinterpreted something above @disassembler

@jasagredo
Copy link
Contributor

In any case, leaving aside that this query might be slower than desired, I don't have an explanation for the fluctuations and I would not have expected that those are happening.

@jasagredo
Copy link
Contributor

If you want to investigate this, perhaps the first step would be to query on latest node versions a chain that is at the same (more or less) tip than 1.35.7 was at that time. My suspicion is the code that performs the query is not at fault as I don't think it has changed much, but rather (1) the data in the chain, which perhaps was much smaller back then, and (2) also perhaps some thunks are being forced by traversing the UTXO set.

If the cause is (1) there is not much to investigate here. If the cause is (2) then there is probably some profiling investigation that could be done to try to (even if the query is slow) smooth out the fluctuations

@jonathangenlambda
Copy link
Author

jonathangenlambda commented Apr 29, 2024

May I be wrong, but that query is not expected to be performant, nor it is expected to be used in production.

Then put some warnings/documentation into haddock to make it VERY clear.

The alternative would be to use some external UTXO indexer like db-sync.

Relying on db-sync is a big risk, requires a lot of resources, requires additional integration, and is often behind releases (for example is/was not compatible with 8.9.1 nodes despite official 8.9.1 release). But ok there are other options than db-sync - it seems that the only really feasible option is to rely on Blockfrost.

When UTXO-HD arrives, this query still will have to traverse the whole UTXO set but that set will be on the disk instead of in memory so it is expected some regression there.

Right, so this makes the query completely unusable. Please remove it from the api and/or mark it clearly as "dont use in production". Tbh I dont understand such a fundamental refactoring as a "UTXO-HD" when the consequence is an actual performance regression...

As I said: we saw those fluctuations on the very same address over a time window of a few days - and the utxo set of this address did definitely not change that much.

@jorisdral
Copy link
Contributor

Tbh I dont understand such a fundamental refactoring as a "UTXO-HD" when the consequence is an actual performance regression...

The UTXO set is currently stored in memory. Since it is an ever growing thing, it will at some point have to be moved to disk. Otherwise, only computers with a large memory budget will be able to run a node.

However, what wasn't mentioned yet is that the node will have two modes: one for storing the UTXO set on disk, one for storing it in memory. The consensus code will still be refactored to account for both modes, but the on-disk mode is not mandatory

@HeinrichApfelmus
Copy link

HeinrichApfelmus commented Apr 30, 2024

May I be wrong, but that query is not expected to be performant, nor it is expected to be used in production. The node does not need a reverse index of UTXOs for operating in the network, thus it falls outside of the responsibilities of the node to mantain such an index, and doing so would impact other places, in particular memory usage.

I think the expected setup is for clients to track the UTXO set of their accounts, which (I think) cardano-wallet does. The alternative would be to use some external UTXO indexer like db-sync.

Yes.

I can confirm that cardano-wallet keeps track of the UTxO set itself by parsing blocks fetched from the local node.

People have created a variety of external chain indexers, see Carp vs alternatives for details.

I do want to point out that developers who are new to the Cardano ecosystem often expect the node to perform some indexing out of the box, such as an Address → UTxO query. Different cryptocurrencies handle this issue differently — I found it instructive to look at the source code of the Kraken wallet: For Bitcoin, this wallet uses Electrum a semi-light wallet architecture; for Solana, the wallet talks directly to the nodes which index everything and present a HTTP interface.

I generally agree with the strict separation of concerns taken by the Cardano node, but it's also a fact that the Node-to-Client protocol exists and goes beyond the needs of block producers, so there is room to haggle about what precisely it should contain — perhaps in the same executable, perhaps in a separate executable. It's mostly a trade-off involving implementation complexity of clients as well as resource usage.

@gitmachtl
Copy link
Contributor

gitmachtl commented May 22, 2024

Please remove it from the api and/or mark it clearly as "dont use in production".

@jasagredo

No, don't remove it. There are a whole bunch of tools and scripts out there relying on the abilitiy of the cli to query utxos by an address. Even if its slow!

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Jun 22, 2024
@frog357
Copy link

frog357 commented Jul 4, 2024

I'm still using 1.35.7 for this reason - even today it still replies in less than 2 seconds where as all the newer versions are considerably slower. We all have GIGS of ram these days, it seems so petty to be fretting over an index of utxo taking up precious ram. The fact is - we want people to use this software - we should support all use cases and not tell people they are doing it wrong. If you provided a defective tool, don't complain when people are upset it doesn't work the same from day to day. I agree with the db-sync being behind the main version - ALWAYS. We want adoption - people using it - not "You are doing it wrong".

@github-actions github-actions bot removed the Stale label Jul 5, 2024
@gitmachtl
Copy link
Contributor

@frog357
I agree that the UTXO query by address should stay within cardano-cli. But i would take a look at ogmios in your case.

@jonathangenlambda
Copy link
Author

Ogmios is just a REST wrapper around the cardano-node, there is no reason why it should be faster against your own local node. Besides that, it requires additional integration effort.

@gitmachtl
Copy link
Contributor

Ogmios is just a REST wrapper around the cardano-node, there is no reason why it should be faster against your own local node. Besides that, it requires additional integration effort.

ah sorry.. i mean kupo not ogmios 😄

@jonathangenlambda
Copy link
Author

jonathangenlambda commented Jul 12, 2024

Ok we just had three extreme outlier cases where querying the address addr1z92l7rnra7sxjn5qv5fzc4fwsrrm29mgkleqj9a0y46j5lqy7t0mecfnwgpzh0uh4vcqmd8du5yspdtgf0lllh27c5zshyv38p took staggering 2 minutes and 35 seconds (!!).

According to our grafana dashboard, the node had a prolonged spike of CPU usaged (25%) at that time for a period of 30 minutes - it is completely unclear why, as we didnt produce this load.

For reference: when I query it on my local machine against a locally synced mainnet running on Node 9.0 it takes 3 seconds, returning 2 UTxOs.

Whatever you guys say, 2 minutes and 35 seconds is absolutely ridiculous.

@gitmachtl
Copy link
Contributor

gitmachtl commented Jul 12, 2024

hmm... did a query right now with node 9.0.0 and it took 4 secs. the node with the higher cpu load, was that also node 9.0.0? and it was not in the ledger replay phase after an upgrade? have you taken a look on the system (htop or so), about read or write activity during that period?

@jonathangenlambda
Copy link
Author

@gitmachtl yeah I know, locally it always works very fast, which I verified myself, also arriving at 3-4 seconds.
The higher CPU load was on a 8.9.1 node, and no, there was no ledger replay phase after an upgrade bcs we didnt upgrade.

@jonathangenlambda
Copy link
Author

@gitmachtl just checked the metrics:

  • receive bandwidth was around 10kbytes/sec
  • transmit bandwith was around 10kbytes/sec with a spike of 30kB/s towards the end
  • rate of received packet was was ~ 25p/s with a spike of 75 at the end
  • Rate of transmitted packets was ~ 25 p/s with a spike of 100-110p/s
  • Storage IO was 1 IOPS for writes most of the time, and peaked with 44 towards the last 5 minutes, but queries happened already a few minutes before.
  • ThroughPut was basically 0 until the spike at the end.

@jonathangenlambda
Copy link
Author

jonathangenlambda commented Jul 12, 2024

@gitmachtl and when looking at the node logs we see quite a few of these

�[34m[cardano-:cardano.node.ConnectionManager:Info:4198]�[0m [2024-07-11 22:41:42.29 UTC] TrConnectionHandler (ConnectionId {localAddress = 172.16.1.26:3001, remoteAddress = 34.89.206.68:1338}) (TrConnectionHandlerError OutboundError ExceededTimeLimit (ChainSync (Header (HardForkBlock (': * ByronBlock (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AllegraEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (MaryEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AlonzoEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (BabbageEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (ConwayEra StandardCrypto)) ('[] *)))))))))) (Tip HardForkBlock (': * ByronBlock (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AllegraEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (MaryEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AlonzoEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (BabbageEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (ConwayEra StandardCrypto)) ('[] *)))))))))) (ServerAgency TokNext TokCanAwait) ShutdownPeer)

Also there was a lot of activity with peer protocol around that time.

Also we saw a drop of connections as a result of these queries, dropping from 60 to 53, and recovering after.

@KtorZ
Copy link
Contributor

KtorZ commented Jul 15, 2024

Note that this very problem was the primary motivation for Kupo ( https://github.com/CardanoSolutions/kupo ). I was told more than 2 years ago that the queryUTxOByAddress was deprecated and about to be removed due to performance reasons. So an alternative had to be found.

I believe Kupo provides a great solution to that nowadays and has even evolved beyond that initial goal thanks to open source contributions. It's also sufficiently lightweight and self contained (single unit of deployment) that it doesn't add much constraints on an existing setup.

@jonathangenlambda
Copy link
Author

@KtorZ thanks for your input on this - unfortunately we never got told that queryUTxOByAddress was deprecated and about to be removed due to performance reasons, nor was/is this apparent anywhere in the Haddocks unless I am mistaken.

We might have a look into Kupo then.

@gitmachtl
Copy link
Contributor

I hope that it at least stays in there as a funciton, because 3rd party tools are relying on it. Even if it is slow.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale type: bug Something is not working
Projects
None yet
Development

No branches or pull requests

8 participants