Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of bloom filter false positives in labelOccurrence extraction step #14

Open
dnmilne opened this issue Dec 18, 2013 · 0 comments
Open
Labels

Comments

@dnmilne
Copy link
Owner

dnmilne commented Dec 18, 2013

We are using a bloom filter to decide which n-grams to count during the label occurrence extraction step. We seem to be getting a very large number of false positives (In the simple wikipedia, we get 2M misses, where we only expect a few thousand). This has a big effect on how much work the combiners and reducers have to do.

It also doesn't look good for my proposed stratgey for Issue #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant