Handle cached result column names and rows separately #6504

morozov · 2024-08-18T00:44:40Z

Q	A
Type	bug

Summary

Prior to #6428, there were two problems with caching results by design:

If the result contains no rows, its cached version won't return the number of columns.
If the result contains columns with the same name (e.g. SELECT a.id, b.id FROM a, b), they will clash even if fetched in the numeric mode.

See #6428 (comment) for reference.

The solution is:

Handle column names and rows in the cache result separately. This way, the column names are available even if there is no data.
Store rows as tuples (lists) instead of maps (associative arrays). This way, even if the result contains multiple columns with the same name, they can be correctly fetched with fetchNumeric().

Additional notes:

The changes in TestUtil::generateResultSetQuery() were necessary for the integration testing of the changes in ArrayResult. However, later, I realized that the modified code is database-agnostic, so running it in integration with each database platform would effectively test the same code path and would be redundant. I replaced the integration tests with the unit ones but left the change in TestUtil as it makes the calling code cleaner.
ArrayStatementTest was renamed to ArrayResultTest, which Git doesn't seem to recognize due to the amount of changes.

TODO:

Decide on the release version. This pull request changes the cache format, so the users will have to clear their cache during upgrade. I wouldn't consider it a BC break. We could also migrate the cache at runtime, but given that this cache can be cleared, I don't think it is worth the effort.
Add upgrade documentation.

greg0ire · 2024-08-18T16:44:54Z

Hi! After reading this thread, I think the result cache is unlikely to get cleared between deploys. Because of that, there will be a performance hit. What else will there be? How will DBAL behave when attempting to unserialize a row cached by the previous version? I think an intermediate solution between doing nothing and migrating the cache at runtime could be to discard cache entries that look like they were generated by the previous version, also at runtime. Would that be significantly less effort?

morozov · 2024-08-18T18:30:22Z

Hi

👋

What else will there be?

I don't think there are any other implications. I can imagine that unless cache is implemented ideally (which is rarely the case), it needs to be cleared every now and then. A dependency update looks like an adequate reason to clear cache.

How will DBAL behave when attempting to unserialize a row cached by the previous version?

The current implementation doesn't support this scenario. Likely, there will be some notices like "Undefined index" and maybe more severe errors.

I think an intermediate solution between doing nothing and migrating the cache at runtime could be to discard cache entries that look like they were generated by the previous version, also at runtime. Would that be significantly less effort?

Yeah, this behavior looks reasonable in general but it will have the same performance implications as clearing cache by the user. The downside of it is if the code is capable of ignoring cache by design, it's prone to the issues where an error in the code or infrastructure, instead of manifesting in application-level errors will manifest in performance degradation. These situations are much more subtle and harder to diagnose.

As an operator of such an application, I would be comfortable with seeing the errors, reading the upgrade documentation, clearing the cache and getting the problem solved. I would be less calm if I knew that this application can ignore cache w/o reporting it in any way.

greg0ire · 2024-08-18T18:39:31Z

I see what you mean. Here is another idea: what if, upon detecting an old version of the cache, we threw a dedicated exception instead of letting the users get to the errors about undefined index and so on? I'm assuming there is a reliable way to detect old versions of the cache, and that there aren't too many places in the code where that would need to be done.

In any case, I think this should ship in 4.2.0, because a good upgrade guide should make that OK.

morozov · 2024-08-18T18:43:21Z

I think we should reproduce the exact scenario of the cache format mismatch and act based on that.

Besides that, since we're discussing this, I though maybe instead of storing a tuple of column names and rows in the cache, we can split ArrayResult into the actual object to be cached (which will contain the column names and rows and the result/iterator which will reference the cached data and maintain the fetch state. This way, it would be much easier to understand if the cached data matches the expected format (instanceof) and report the error otherwise (could we trigger a warning in this case?)

morozov · 2024-08-18T18:50:18Z

Throwing an exception will unnecessarily complicate the upgrade. It will block the execution of the application and will prevent it from naturally populating cache with the data formatted the new way. If it was a warning, the operator could either see them and clear cache or even ignore them and get the cache naturally updated over time.

Caching is not a functional feature of an application, so its misbehavior shouldn't be silenced but shouldn't be blocking either.

greg0ire · 2024-08-18T19:00:40Z

we can split ArrayResult into the actual object to be cached

Sorry, I don't understand what you mean by that, can you elaborate?

Regarding the warning, I think that would be a sensible solution indeed 👍

morozov · 2024-08-18T19:51:07Z

Sorry, I don't understand what you mean by that, can you elaborate?

In my current version of the feature, ArrayResult objects store the following properties:

$columnNames and $rows which describe the cached data.
$num which describes the state of fetching data from the result.

The column names and rows are stored in cache as a tuple:

dbal/src/Connection.php

Line 831 in 7c9d983

$value[$realKey] = [$columnNames, $rows];
dbal/src/Connection.php

Line 814 in 7c9d983

[$columnNames, $rows] = $value[$realKey];

Using a tuple looks clumsy because:

It's just an array that doesn't enforce the number and the types of their elements.
It's less memory-efficient compared to objects.

I propose that we break ArrayResult into two classes. Something like:

readonly class CachedData {
    public function __construct(private array $columnNames, private array $rows) {}
}

class CachedResult implements Result {
    private int $num = 0;

    public function __construct(private CachedData $data) {}
}

This way, CachedData will be a stateless data container that can be cached, and CachedResult will be a stateful iterator over this data.

Another idea (which doesn't invalidate but complements the above) is that instead of storing an associative array of results in the cache item (see below), we could store an object there.

dbal/src/Connection.php

Lines 808 to 811 in 7c9d983

    
           $value = $item->get(); 
        
           if (! is_array($value)) { 
        
               $value = []; 
        
           }

This way, instead of validating, discarding and updating elements of this array one by one, we could discard the whole value, if it's not an object of the expected class. Even though this is unlikely, if we need to change the cache format again, this approach will allow the introduction of format versioning and further automatic cache invalidation during upgrades.

greg0ire · 2024-08-18T20:47:22Z

Thanks for the explanation, I think moving towards more object is a good idea. Using objects will result in a slightly bigger cache because of the way serialize() works, but I can't think of a realistic scenario where it would be an actual issue for anyone.

derrabus · 2024-08-19T13:09:08Z

Thank you for working on this issue!

We must be able to handle an old cache gracefully. We can recommend clearing the cache, we can document it, and people will still deploy without clearing the cache. Not clearing the cache is fine in ~99% of the deployments and it's the favorable option if the alternative is starting with a cold cache after deployment.

We should either upgrade to discard old cache entries. If we don't want to deal with them at all, we could just change our naming strategy for cache keys:

Change the strategy for cache key generation

This way, we will never read an old entry and since we don't access them anymore, they should be purged from the cache eventually. However, this might mean that the cache might grow to twice its usual size for a short period of time. Also, the app will effectively have a cold cache after deployment.

Detect and discard old cache data when reading

Old cache data would be treated as if it were expired. This adds a little complexity to our code which we might want to remove in 5.0. On the other hand, the cache size should remain roughly the same. The detection logic however could have a performance impact that we should keep as low as possible.

Apart from that, the effect would be the same as in the previous option.

Detect and upgrade cache data when reading

Old cache data can easily be upgraded to the new format. This way, the app would keep its warm cache, but the data transformation will have a runtime impact. If that impact is lower than a full roundtrip to the database, it might be worth it.

We could even write back the upgraded data to the cache. However, since the old entries should expire eventually, that should not really be necessary.

derrabus · 2024-08-19T13:32:04Z

We might have another inconsistency with regards to column names: How should a result behave if we call getColumnName() or columnCount() on a freed result? Some drivers are able to fulfill that request, others raise an exception. With your change, the ArrayResult switches from raising an exception to returning the correct column because the list of columns is not freed along with the rest.

Which behavior do we actually want?

morozov · 2024-08-20T17:22:02Z

Which behavior do we actually want?

I forgot to mention this but the current implementation is a deliberate choice (not necessarily correct).

dbal/src/Driver/Result.php

Lines 91 to 94 in a41d81d

    
               /** 
        
                * Discards the non-fetched portion of the result, enabling the originating statement to be executed again. 
        
                */ 
        
               public function free(): void;

I don't think that for a "physical" statement it's required to clear the column metadata in order to re-execute the statement. Furthermore, upon re-execution, the column metadata is guaranteed to be the same, so there's no point in discarding it.
A result cannot initially exist with empty metadata, so it shouldn't be cleared upon freeing either.

So I believe we should retain the metadata upon clearing.

morozov · 2024-08-24T17:30:36Z

Personally, I'm not sold on the improvement of the upgrade experience at the cost of complicating the code at this point. It's not data, it's cache.

My major concern is not really about the code (which should be easy to write), it's primarily about the scenarios which we need to account for while designing the upgrade logic. E.g. how will a given solution work in a rolling upgrade scenario where multiple processes running different versions of the DBAL will read from and write to the cache? This eliminates the approach of upgrading cache in place because it will break the process(es) still running the old version.

The second concern is automated testing. The upgrade scenarios would have to be tested somehow, and we don't have the tooling for testing upgrades. It means that we'd have to build some one-off tooling for that.

With that said, I will totally understand if the proposed change in its current form is not acceptable for a minor release (maybe we can move it to the next major) or at all. But I'm not interested in building automation for a scenario that could be handled by clearing cache. Hopefully, it doesn't sound like I'm irritated or upset. I don't have any attachment to this pull request and am just stating the facts.

derrabus · 2024-08-27T11:48:13Z

E.g. how will a given solution work in a rolling upgrade scenario where multiple processes running different versions of the DBAL will read from and write to the cache? This eliminates the approach of upgrading cache in place because it will break the process(es) still running the old version.

I'd be happy if the new version simply does not choke on the old cache format because I'm pretty sure that this is going to backfire at us. The code we would need for that should be less annoying to maintain than the tickets we will need to close.

The second concern is automated testing. The upgrade scenarios would have to be tested somehow, and we don't have the tooling for testing upgrades. It means that we'd have to build some one-off tooling for that.

I wouldn't overthink this. Just pre-populate an array cache with the old format and try reading a result from it.

But I'm not interested in building automation for a scenario that could be handled by clearing cache. Hopefully, it doesn't sound like I'm irritated or upset. I don't have any attachment to this pull request and am just stating the facts.

Shall I take over the part with the cache upgrading? Either way, we should document in our UPGRADE.md file that the result cache format will change and recommend to clear the caches when upgrading.

morozov · 2024-08-27T14:48:13Z

Shall I take over the part with the cache upgrading?

That would be great. Thank you!

derrabus · 2024-08-28T12:31:21Z

All right. Let's ignore cache upgrading for now and I'll work on that in a follow-up. Can you please rebase your PR to resolve the conflicts and document the cache change in UPGRADE.md?

morozov · 2024-08-28T14:35:16Z

@derrabus, done.

tests/Cache/ArrayResultTest.php

…buh) This PR was merged into the 5.4 branch. Discussion ---------- [Messenger] fix test to be compatible with DBAL 4.2 | Q | A | ------------- | --- | Branch? | 7.1 | Bug fix? | no | New feature? | no | Deprecations? | no | Issues | | License | MIT see doctrine/dbal#6504 Commits ------- 447bf7e fix test to be compatible with DBAL 4.2

| Q | A |------------- | ----------- | Type | improvement | Fixed issues | Follows #6504 #### Summary I'm changing the query cache format once again: Instead of storing a tuple of arrays, I'm now storing the whole `ArrayResult` object. This way, we can easily detect if we support the data returned from the cache and we can handle future cache format changes in the `__unserialize()` methods. After #6504, we would run into errors if an app would attempt to us a cache written with DBAL 4.1. With my change, the old cache is ignored and the app would behave as if it had encountered a cache miss instead. I've also added tests covering the serialization of `ArrayResult` objects.

* 4.2.x: Invalidate old query cache format (#6510) Handle cached result column names and rows separately (#6504)

morozov force-pushed the array-result branch from 0ae598d to 97d0ed5 Compare August 18, 2024 00:52

morozov marked this pull request as draft August 18, 2024 00:52

morozov force-pushed the array-result branch from c2f8f59 to 7c9d983 Compare August 18, 2024 07:00

morozov marked this pull request as ready for review August 18, 2024 07:11

morozov requested review from greg0ire and derrabus August 18, 2024 07:13

derrabus added this to the 4.2.0 milestone Aug 19, 2024

derrabus added New Feature Cache Results labels Aug 19, 2024

morozov added 2 commits August 28, 2024 07:00

Use separate parameters for test result column names and rows

b744e80

Handle cached result column names and rows separately

dd16441

morozov force-pushed the array-result branch from 7c9d983 to dd16441 Compare August 28, 2024 14:17

derrabus reviewed Aug 29, 2024

View reviewed changes

tests/Cache/ArrayResultTest.php Show resolved Hide resolved

derrabus approved these changes Aug 29, 2024

View reviewed changes

derrabus merged commit d47b6b1 into doctrine:4.2.x Aug 29, 2024
78 of 79 checks passed

xabbuh mentioned this pull request Aug 29, 2024

[Messenger] fix test to be compatible with DBAL 4.2 symfony/symfony#58118

Merged

derrabus mentioned this pull request Aug 29, 2024

Invalidate old query cache format #6510

Merged

derrabus added a commit that referenced this pull request Aug 30, 2024

Merge branch '4.2.x' into 5.0.x

d8ba313

* 4.2.x: Invalidate old query cache format (#6510) Handle cached result column names and rows separately (#6504)

morozov deleted the array-result branch October 6, 2024 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle cached result column names and rows separately #6504

Handle cached result column names and rows separately #6504

morozov commented Aug 18, 2024 •

edited

Loading

greg0ire commented Aug 18, 2024

morozov commented Aug 18, 2024 •

edited

Loading

greg0ire commented Aug 18, 2024

morozov commented Aug 18, 2024 •

edited

Loading

morozov commented Aug 18, 2024 •

edited

Loading

greg0ire commented Aug 18, 2024

morozov commented Aug 18, 2024

greg0ire commented Aug 18, 2024

derrabus commented Aug 19, 2024

derrabus commented Aug 19, 2024

morozov commented Aug 20, 2024 •

edited

Loading

morozov commented Aug 24, 2024 •

edited

Loading

derrabus commented Aug 27, 2024

morozov commented Aug 27, 2024

derrabus commented Aug 28, 2024

morozov commented Aug 28, 2024

Handle cached result column names and rows separately #6504

Handle cached result column names and rows separately #6504

Conversation

morozov commented Aug 18, 2024 • edited Loading

Summary

greg0ire commented Aug 18, 2024

morozov commented Aug 18, 2024 • edited Loading

greg0ire commented Aug 18, 2024

morozov commented Aug 18, 2024 • edited Loading

morozov commented Aug 18, 2024 • edited Loading

greg0ire commented Aug 18, 2024

morozov commented Aug 18, 2024

greg0ire commented Aug 18, 2024

derrabus commented Aug 19, 2024

Change the strategy for cache key generation

Detect and discard old cache data when reading

Detect and upgrade cache data when reading

derrabus commented Aug 19, 2024

morozov commented Aug 20, 2024 • edited Loading

morozov commented Aug 24, 2024 • edited Loading

derrabus commented Aug 27, 2024

morozov commented Aug 27, 2024

derrabus commented Aug 28, 2024

morozov commented Aug 28, 2024

morozov commented Aug 18, 2024 •

edited

Loading

morozov commented Aug 18, 2024 •

edited

Loading

morozov commented Aug 18, 2024 •

edited

Loading

morozov commented Aug 18, 2024 •

edited

Loading

morozov commented Aug 20, 2024 •

edited

Loading

morozov commented Aug 24, 2024 •

edited

Loading