Is there a reason not to reuse stats for multiple detection runs? #308

Jinsung-L · 2023-01-19T04:07:46Z

Jinsung-L
Jan 19, 2023

My thought was if I pass --stats CSV option with the same video multiple times, the detection process after the first run will be faster because it can reuse some previously calculated value like adaptive_ratio and content_val.

But it looks like caching and loading is not happening at all.

So I looked for some notes about this and found b89238c.
This commit deprecates load_from_csv functionality of StatsManager and the commit message says:

Loading stats from disk is rarely used, and complicates the implementation
of detection algorithms. Equivalent functionality with much better performance
could be obtained for most use cases by seeking instead.

Yet I'm not fully convinced why recalculating the stats on every run is better than caching previously calculated stats and reusing them for the next runs.

What am I missing here?

Answered by Breakthrough

Jan 26, 2023

This feature was removed for a variety of reasons. Regarding performance, the benefit isn't as great as it once was now that v0.6 does some things in parallel to use multiple cores. One could also just process the output of a statsfile to see which frames exceed a particular threshold similar to how the detection loop works.

Really though, using the statsfile as a cache was not well thought out. The CSV file format is inefficient for that purpose, and there's a significant number of factors that affect the stats calculations. This includes downscale factor, or even some detector parameters (e.g. for edge detection). Reusing a statsfile when these things change would be incorrect, and lead…

View full answer

Breakthrough · 2023-01-26T02:32:44Z

Breakthrough
Jan 26, 2023
Maintainer

This feature was removed for a variety of reasons. Regarding performance, the benefit isn't as great as it once was now that v0.6 does some things in parallel to use multiple cores. One could also just process the output of a statsfile to see which frames exceed a particular threshold similar to how the detection loop works.

Really though, using the statsfile as a cache was not well thought out. The CSV file format is inefficient for that purpose, and there's a significant number of factors that affect the stats calculations. This includes downscale factor, or even some detector parameters (e.g. for edge detection). Reusing a statsfile when these things change would be incorrect, and leads to a big rabbit hole of choices (e.g. do we make a new column for each changed dimension?)

Now things are much simpler for people actually consuming the statsfile for it's primary purpose - statistical analysis of the video itself. In the long term, I'm not opposed to having some kind of data cache for speeding up repeated calculations for reprocessing videos. However, I don't think the statsfile itself is good choice for that.

Sorry if this rationale wasn't made clear enough, but I'm happy to talk through specific points further if you wish. Thanks for the question.

3 replies

Breakthrough Jan 26, 2023
Maintainer

Could you also share a bit more about your use case, and why you would benefit from a cache? Any information you can provide about your workflow and expectations will help inform future direction on this matter. Thank you!

Jinsung-L Jan 28, 2023
Author

I personally find it irritating to statistically analyze the right adjustment for the threshold by opening up the CSV file and comparing each data points to the video timestamp.
So what I'm doing is just running the list-scenes or split-video command multiple times changing only the threshold.
In this case, since I haven't changed any other parameters but threshold I think there's no reason to recalculate the whole statistics over again.
I understand that there are some parameters that would affect to the frame score but I think most parameters used for detectors such as threshold or min-content-val don't affect to the frame score.

So that was my use case and thoughts on cashing the stats. The core idea is that the final score can be reused though multiple runs if the changed parameters have no effect to the final score. Won't this reduce the time it takes to calculate the frame scores to 0s?
After reading your answer, I see that stats file is not a good choice for caching. Python pickle can perhaps be a simple alternative then?

Maybe the process of calculating the stats of the video and actually detecting the scenes using that statistics should be separated in order to implement the caching feature. Because that way we can use the parameters of the video analyzer as a cache key, skip the calculation if there's previously calculated statistics with the same parameters and scene detectors can just use them with various thresholds.

Thank you for the answer! I love this project.

Breakthrough Jan 30, 2023
Maintainer

Ideally it would reduce the time to zero, but the way statsfiles worked was to query each detector on each frame of a video if processing was required. This is why it was never really faster once everything was optimized, as every frame would still be decoded.

To achieve a true speedup would require significant changes to the scanning logic. The statsfile cache was complex to maintain and led to some bugs, so I want to defer adding this feature without careful thought as to how it should be integrated properly.

I'm open to any suggestions or ideas for how this can be integrated into the Python API side of things, and any proposals are most welcome. A few ideas might be to separate the calculation of frame metrics from how scenes are actually detected. We also need a way to fingerprint a video and the parameters used for it, e.g. using some kind of hash of the first few frames and the path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a reason not to reuse stats for multiple detection runs? #308

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Is there a reason not to reuse stats for multiple detection runs? #308

Jinsung-L Jan 19, 2023

Replies: 1 comment · 3 replies

Breakthrough Jan 26, 2023 Maintainer

Breakthrough Jan 26, 2023 Maintainer

Jinsung-L Jan 28, 2023 Author

Breakthrough Jan 30, 2023 Maintainer

Jinsung-L
Jan 19, 2023

Replies: 1 comment 3 replies

Breakthrough
Jan 26, 2023
Maintainer

Breakthrough Jan 26, 2023
Maintainer

Jinsung-L Jan 28, 2023
Author

Breakthrough Jan 30, 2023
Maintainer