Lots of bloom filter false positives in labelOccurrence extraction step #14

dnmilne · 2013-12-18T07:17:40Z

We are using a bloom filter to decide which n-grams to count during the label occurrence extraction step. We seem to be getting a very large number of false positives (In the simple wikipedia, we get 2M misses, where we only expect a few thousand). This has a big effect on how much work the combiners and reducers have to do.

It also doesn't look good for my proposed stratgey for Issue #9

dnmilne pushed a commit that referenced this issue Dec 18, 2013

working on issue #14 (now down to as many hits as misses)

652c496

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of bloom filter false positives in labelOccurrence extraction step #14

Lots of bloom filter false positives in labelOccurrence extraction step #14

dnmilne commented Dec 18, 2013

Lots of bloom filter false positives in labelOccurrence extraction step #14

Lots of bloom filter false positives in labelOccurrence extraction step #14

Comments

dnmilne commented Dec 18, 2013