Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle cached result column names and rows separately #6504

Merged
merged 2 commits into from
Aug 29, 2024

Conversation

morozov
Copy link
Member

@morozov morozov commented Aug 18, 2024

Q A
Type bug

Summary

Prior to #6428, there were two problems with caching results by design:

  1. If the result contains no rows, its cached version won't return the number of columns.
  2. If the result contains columns with the same name (e.g. SELECT a.id, b.id FROM a, b), they will clash even if fetched in the numeric mode.

See #6428 (comment) for reference.

The solution is:

  1. Handle column names and rows in the cache result separately. This way, the column names are available even if there is no data.
  2. Store rows as tuples (lists) instead of maps (associative arrays). This way, even if the result contains multiple columns with the same name, they can be correctly fetched with fetchNumeric().

Additional notes:

  1. The changes in TestUtil::generateResultSetQuery() were necessary for the integration testing of the changes in ArrayResult. However, later, I realized that the modified code is database-agnostic, so running it in integration with each database platform would effectively test the same code path and would be redundant. I replaced the integration tests with the unit ones but left the change in TestUtil as it makes the calling code cleaner.
  2. ArrayStatementTest was renamed to ArrayResultTest, which Git doesn't seem to recognize due to the amount of changes.

TODO:

  • Decide on the release version. This pull request changes the cache format, so the users will have to clear their cache during upgrade. I wouldn't consider it a BC break. We could also migrate the cache at runtime, but given that this cache can be cleared, I don't think it is worth the effort.
  • Add upgrade documentation.

@morozov morozov marked this pull request as draft August 18, 2024 00:52
@morozov morozov marked this pull request as ready for review August 18, 2024 07:11
@greg0ire
Copy link
Member

Hi! After reading this thread, I think the result cache is unlikely to get cleared between deploys. Because of that, there will be a performance hit. What else will there be? How will DBAL behave when attempting to unserialize a row cached by the previous version? I think an intermediate solution between doing nothing and migrating the cache at runtime could be to discard cache entries that look like they were generated by the previous version, also at runtime. Would that be significantly less effort?

@morozov
Copy link
Member Author

morozov commented Aug 18, 2024

Hi

👋

What else will there be?

I don't think there are any other implications. I can imagine that unless cache is implemented ideally (which is rarely the case), it needs to be cleared every now and then. A dependency update looks like an adequate reason to clear cache.

How will DBAL behave when attempting to unserialize a row cached by the previous version?

The current implementation doesn't support this scenario. Likely, there will be some notices like "Undefined index" and maybe more severe errors.

I think an intermediate solution between doing nothing and migrating the cache at runtime could be to discard cache entries that look like they were generated by the previous version, also at runtime. Would that be significantly less effort?

Yeah, this behavior looks reasonable in general but it will have the same performance implications as clearing cache by the user. The downside of it is if the code is capable of ignoring cache by design, it's prone to the issues where an error in the code or infrastructure, instead of manifesting in application-level errors will manifest in performance degradation. These situations are much more subtle and harder to diagnose.

As an operator of such an application, I would be comfortable with seeing the errors, reading the upgrade documentation, clearing the cache and getting the problem solved. I would be less calm if I knew that this application can ignore cache w/o reporting it in any way.

@greg0ire
Copy link
Member

I see what you mean. Here is another idea: what if, upon detecting an old version of the cache, we threw a dedicated exception instead of letting the users get to the errors about undefined index and so on? I'm assuming there is a reliable way to detect old versions of the cache, and that there aren't too many places in the code where that would need to be done.

In any case, I think this should ship in 4.2.0, because a good upgrade guide should make that OK.

@morozov
Copy link
Member Author

morozov commented Aug 18, 2024

I think we should reproduce the exact scenario of the cache format mismatch and act based on that.

Besides that, since we're discussing this, I though maybe instead of storing a tuple of column names and rows in the cache, we can split ArrayResult into the actual object to be cached (which will contain the column names and rows and the result/iterator which will reference the cached data and maintain the fetch state. This way, it would be much easier to understand if the cached data matches the expected format (instanceof) and report the error otherwise (could we trigger a warning in this case?)

@morozov
Copy link
Member Author

morozov commented Aug 18, 2024

Throwing an exception will unnecessarily complicate the upgrade. It will block the execution of the application and will prevent it from naturally populating cache with the data formatted the new way. If it was a warning, the operator could either see them and clear cache or even ignore them and get the cache naturally updated over time.

Caching is not a functional feature of an application, so its misbehavior shouldn't be silenced but shouldn't be blocking either.

@greg0ire
Copy link
Member

we can split ArrayResult into the actual object to be cached

Sorry, I don't understand what you mean by that, can you elaborate?

Regarding the warning, I think that would be a sensible solution indeed 👍

@morozov
Copy link
Member Author

morozov commented Aug 18, 2024

Sorry, I don't understand what you mean by that, can you elaborate?

In my current version of the feature, ArrayResult objects store the following properties:

  1. $columnNames and $rows which describe the cached data.
  2. $num which describes the state of fetching data from the result.

The column names and rows are stored in cache as a tuple:

Using a tuple looks clumsy because:

  1. It's just an array that doesn't enforce the number and the types of their elements.
  2. It's less memory-efficient compared to objects.

I propose that we break ArrayResult into two classes. Something like:

readonly class CachedData {
    public function __construct(private array $columnNames, private array $rows) {}
}

class CachedResult implements Result {
    private int $num = 0;

    public function __construct(private CachedData $data) {}
}

This way, CachedData will be a stateless data container that can be cached, and CachedResult will be a stateful iterator over this data.

Another idea (which doesn't invalidate but complements the above) is that instead of storing an associative array of results in the cache item (see below), we could store an object there.

dbal/src/Connection.php

Lines 808 to 811 in 7c9d983

$value = $item->get();
if (! is_array($value)) {
$value = [];
}

This way, instead of validating, discarding and updating elements of this array one by one, we could discard the whole value, if it's not an object of the expected class. Even though this is unlikely, if we need to change the cache format again, this approach will allow the introduction of format versioning and further automatic cache invalidation during upgrades.

@greg0ire
Copy link
Member

Thanks for the explanation, I think moving towards more object is a good idea. Using objects will result in a slightly bigger cache because of the way serialize() works, but I can't think of a realistic scenario where it would be an actual issue for anyone.

@derrabus
Copy link
Member

Thank you for working on this issue!

We must be able to handle an old cache gracefully. We can recommend clearing the cache, we can document it, and people will still deploy without clearing the cache. Not clearing the cache is fine in ~99% of the deployments and it's the favorable option if the alternative is starting with a cold cache after deployment.

We should either upgrade to discard old cache entries. If we don't want to deal with them at all, we could just change our naming strategy for cache keys:

Change the strategy for cache key generation

This way, we will never read an old entry and since we don't access them anymore, they should be purged from the cache eventually. However, this might mean that the cache might grow to twice its usual size for a short period of time. Also, the app will effectively have a cold cache after deployment.

Detect and discard old cache data when reading

Old cache data would be treated as if it were expired. This adds a little complexity to our code which we might want to remove in 5.0. On the other hand, the cache size should remain roughly the same. The detection logic however could have a performance impact that we should keep as low as possible.

Apart from that, the effect would be the same as in the previous option.

Detect and upgrade cache data when reading

Old cache data can easily be upgraded to the new format. This way, the app would keep its warm cache, but the data transformation will have a runtime impact. If that impact is lower than a full roundtrip to the database, it might be worth it.

We could even write back the upgraded data to the cache. However, since the old entries should expire eventually, that should not really be necessary.

@derrabus
Copy link
Member

We might have another inconsistency with regards to column names: How should a result behave if we call getColumnName() or columnCount() on a freed result? Some drivers are able to fulfill that request, others raise an exception. With your change, the ArrayResult switches from raising an exception to returning the correct column because the list of columns is not freed along with the rest.

Which behavior do we actually want?

@morozov
Copy link
Member Author

morozov commented Aug 20, 2024

Which behavior do we actually want?

I forgot to mention this but the current implementation is a deliberate choice (not necessarily correct).

/**
* Discards the non-fetched portion of the result, enabling the originating statement to be executed again.
*/
public function free(): void;

  1. I don't think that for a "physical" statement it's required to clear the column metadata in order to re-execute the statement. Furthermore, upon re-execution, the column metadata is guaranteed to be the same, so there's no point in discarding it.
  2. A result cannot initially exist with empty metadata, so it shouldn't be cleared upon freeing either.

So I believe we should retain the metadata upon clearing.

@morozov
Copy link
Member Author

morozov commented Aug 24, 2024

Personally, I'm not sold on the improvement of the upgrade experience at the cost of complicating the code at this point. It's not data, it's cache.

My major concern is not really about the code (which should be easy to write), it's primarily about the scenarios which we need to account for while designing the upgrade logic. E.g. how will a given solution work in a rolling upgrade scenario where multiple processes running different versions of the DBAL will read from and write to the cache? This eliminates the approach of upgrading cache in place because it will break the process(es) still running the old version.

The second concern is automated testing. The upgrade scenarios would have to be tested somehow, and we don't have the tooling for testing upgrades. It means that we'd have to build some one-off tooling for that.

With that said, I will totally understand if the proposed change in its current form is not acceptable for a minor release (maybe we can move it to the next major) or at all. But I'm not interested in building automation for a scenario that could be handled by clearing cache. Hopefully, it doesn't sound like I'm irritated or upset. I don't have any attachment to this pull request and am just stating the facts.

@derrabus
Copy link
Member

E.g. how will a given solution work in a rolling upgrade scenario where multiple processes running different versions of the DBAL will read from and write to the cache? This eliminates the approach of upgrading cache in place because it will break the process(es) still running the old version.

I'd be happy if the new version simply does not choke on the old cache format because I'm pretty sure that this is going to backfire at us. The code we would need for that should be less annoying to maintain than the tickets we will need to close.

The second concern is automated testing. The upgrade scenarios would have to be tested somehow, and we don't have the tooling for testing upgrades. It means that we'd have to build some one-off tooling for that.

I wouldn't overthink this. Just pre-populate an array cache with the old format and try reading a result from it.

But I'm not interested in building automation for a scenario that could be handled by clearing cache. Hopefully, it doesn't sound like I'm irritated or upset. I don't have any attachment to this pull request and am just stating the facts.

Shall I take over the part with the cache upgrading? Either way, we should document in our UPGRADE.md file that the result cache format will change and recommend to clear the caches when upgrading.

@morozov
Copy link
Member Author

morozov commented Aug 27, 2024

Shall I take over the part with the cache upgrading?

That would be great. Thank you!

@derrabus
Copy link
Member

All right. Let's ignore cache upgrading for now and I'll work on that in a follow-up. Can you please rebase your PR to resolve the conflicts and document the cache change in UPGRADE.md?

@morozov
Copy link
Member Author

morozov commented Aug 28, 2024

@derrabus, done.

@derrabus derrabus merged commit d47b6b1 into doctrine:4.2.x Aug 29, 2024
78 of 79 checks passed
xabbuh added a commit to symfony/symfony that referenced this pull request Aug 30, 2024
…buh)

This PR was merged into the 5.4 branch.

Discussion
----------

[Messenger] fix test to be compatible with DBAL 4.2

| Q             | A
| ------------- | ---
| Branch?       | 7.1
| Bug fix?      | no
| New feature?  | no
| Deprecations? | no
| Issues        |
| License       | MIT

see doctrine/dbal#6504

Commits
-------

447bf7e fix test to be compatible with DBAL 4.2
derrabus added a commit that referenced this pull request Aug 30, 2024
|      Q       |   A
|------------- | -----------
| Type         | improvement
| Fixed issues | Follows #6504

#### Summary

I'm changing the query cache format once again: Instead of storing a
tuple of arrays, I'm now storing the whole `ArrayResult` object. This
way, we can easily detect if we support the data returned from the cache
and we can handle future cache format changes in the `__unserialize()`
methods.

After #6504, we would run into errors if an app would attempt to us a
cache written with DBAL 4.1. With my change, the old cache is ignored
and the app would behave as if it had encountered a cache miss instead.

I've also added tests covering the serialization of `ArrayResult`
objects.
derrabus added a commit that referenced this pull request Aug 30, 2024
* 4.2.x:
  Invalidate old query cache format (#6510)
  Handle cached result column names and rows separately (#6504)
@morozov morozov deleted the array-result branch October 6, 2024 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants