Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

Very simple fast test of hashing performance for moderate sized files #241

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

inorton
Copy link
Contributor

@inorton inorton commented Nov 2, 2016

I noticed the conversation about hashing in one of the other issues. Perhaps this would be useful?

It creates a directory tree of 1000 files each of 256K in size and then simply hashes each. As files are fresh each run the misleading figures due to warm/cold disc caches should be consistent. The whole test only takes a 1-2 sec for me.

@codecov-io
Copy link

codecov-io commented Nov 2, 2016

Current coverage is 88.86% (diff: 100%)

Merging #241 into master will decrease coverage by 0.55%

@@             master       #241   diff @@
==========================================
  Files             1          1          
  Lines          1040        997    -43   
  Methods           0          0          
  Messages          0          0          
  Branches        166        158     -8   
==========================================
- Hits            930        886    -44   
- Misses           82         83     +1   
  Partials         28         28          

Powered by Codecov. Last update 1e7b28a...7c8f2d1

@frerich
Copy link
Owner

frerich commented Nov 3, 2016

Thanks, I think it's a good idea to have some sort of standard benchmark. I suppose instead of creating a new file, this could be part of the existing performancetests.py test case?

Alas (?), the discussion in #239 suggests that the issue with slow hashing appears to be related to concurrent clcache instances, at least for @akleber 's setup -- so I'm not sure the test code as it is reproduces that issue

In any case, I very much agree that some sort of performance test for this functionality would be good -- it's not clear to me though which scenarios to benchmark.

@inorton
Copy link
Contributor Author

inorton commented Nov 5, 2016

I started this with a theory that we could compute hashes in parallel using concurrent.futures if we had multiple cpus (some doing IO some doing hashing). My tests here showed that it actually just made things worse by quite some considerable margin.

@frerich
Copy link
Owner

frerich commented Nov 5, 2016

Indeed, it matches @akleber 's observation that concurrent hashing of files is substantially slower than sequential hashing.

Maybe this is another argument in favor of some sort of server process which acts as the sole instance to sequentially hash (and potentially cache) hashes.

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

I think a performance test to check how fast cache hits and cache misses are (both concurrently as well as sequentially) would be a nice thing to have, but that should probably go into performancetests.py.

@frerich frerich added the test label Nov 14, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants