Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Accurate Count-Distinct aggregator (RoaringBitmap) #493

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

vidma
Copy link
Contributor

@vidma vidma commented Oct 4, 2015

Internally, this uses RoaringBitmap, a compressed alternative to BitSet (it's rather fast, and used commonly in projects as spark, druid etc).

This is very simple, but IMHO algebird is still missing this :)

I'll add tests if you'll be willing to merge it (our tests are currently Spark/DataFrame dependent)

@johnynek

@ianoc
Copy link
Collaborator

ianoc commented Oct 4, 2015

I think our main concern with this will be the addition of the dependency. Looks like its pretty pure of a dep, with all the packages it uses seem to be only used in tests. Small enough to be in core you think @johnynek or we should have it in an algebird-X ?

@johnynek
Copy link
Collaborator

johnynek commented Oct 4, 2015

Yes the dependency is the trick here. This would add a new dependency for everyone, even those that don't use this feature.

A second concern is that this would be the second bitset dependency (we took another one, which may have been a mistake):
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/BloomFilter.scala#L22

we definitely should not have two compressed bitset dependencies. They claim this performs better than Ewah. If that's true, we should consider just moving to this.

In the mean time, it might be better to make the aggregator without adding an additional dependency just by using the current dependency, and then we can evaluate if we should change to only use the RoaringBitmap implementation. That would be my opinon.

Also, note, there is a faster aggregation for RoaringBitmap according to the docs:
https://github.com/lemire/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/FastAggregation.java

which could be used as the implementation for sumOption on the semigroup, which should speed things up when this is used with spark or scalding.

One more comment: if we did merge this, the monoid should probably be in the mutable namespace since this datastructure is mutable (even though we are not mutating it here).

}
}

class RoaringBitampSemigroup extends Semigroup[RoaringBitmap] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitamp should be Bitmap (also appears elsewhere)

@avibryant
Copy link
Contributor

Agreed that we should pick a compressed bitset to use. It seems worth noting that the author of the EWAH implementation we use (Lemire) is also the author of the RoaringBitmap paper and implementation, which gives his claims about which one is faster a lot of weight.

@vidma
Copy link
Contributor Author

vidma commented Oct 5, 2015

good point on sumOption, finally Algebird's spark support might do some good for us (I hope).

Agreed that we should pick a compressed bitset to use.

Sounds good. then, I'll look into sumOption and add some tests.

@vidma
Copy link
Contributor Author

vidma commented Oct 5, 2015

btw, @johnynek , regarding sumOption ...

looking again at https://github.com/twitter/algebird/pull/397/files , I see your comment

I don't see a way to use sumOption in sumByKey or aggregateByKey with reimplementing or skipping map-side combining. I wish spark had something like scalding's sumByLocalKeys.

I wish too! any new ideas how to add this sumByLocalKeys to Spark?

@vidma vidma force-pushed the features/add-exact-count-distinct-monoid branch from 4527954 to f4e38ef Compare October 5, 2015 21:32
@vidma vidma force-pushed the features/add-exact-count-distinct-monoid branch from f4e38ef to 9e00b8b Compare October 5, 2015 21:33
@johnynek
Copy link
Collaborator

johnynek commented Oct 5, 2015

Guess I forgot about that. No new ideas on how to solve it.

On Monday, October 5, 2015, vidma [email protected] wrote:

btw, @johnynek https://github.com/johnynek , regarding sumOption ...

looking again at https://github.com/twitter/algebird/pull/397/files , I
see your comment

I don't see a way to use sumOption in sumByKey or aggregateByKey with
reimplementing or skipping map-side combining. I wish spark had something
like scalding's sumByLocalKeys.

I wish too! any new ideas how to add this sumByLocalKeys to Spark?


Reply to this email directly or view it on GitHub
#493 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

@vidma
Copy link
Contributor Author

vidma commented Oct 5, 2015

continuing sumByLocalKeys offtopic, I'd say the only useful place is a combiner after a Shuffle, if using a sort-based shuffle (locality), but not so simple i guess...

P.S. as I understand even Scalding calls sumOption only after shuffle [SummingCache uses .plus].

@johnynek
Copy link
Collaborator

johnynek commented Oct 5, 2015

yes, for now, we only use sumOption after shuffle. We could do this on the
map-side as well, but the problem there is always about how many items to
cache in memory to possibly improve summing. A jit-like approach would be
great (do some measurements on a few items, then use those to do the rest.
That or store history and check a history service to see how many items
should be kept for each key before calling sumOption).

On Mon, Oct 5, 2015 at 12:57 PM, vidma [email protected] wrote:

continuing sumByLocalKeys offtopic, I'd say the only useful place is a
combiner after a Shuffle, if using a sort-based shuffle (locality), but not
so simple i guess...

P.S. as I understand even Scalding calls sumOption only after shuffle
[SummingCache uses .plus].


Reply to this email directly or view it on GitHub
#493 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

@ianoc
Copy link
Collaborator

ianoc commented Oct 6, 2015

Scalding only uses it reduce side natively right now, though Summingbird actually will use sumOption mapside optionally.

I've used it map side with two general strategies before:

  1. If we can quickly compute the topN (some code used in SB is in algebird-util here) then we can do something with those
  2. If we know the overall space cardinality is small, or are happy to rely on lots of LRU evictions then putting lists in for the values works pretty well.

In scalding using sumByLocalKeys it looks something like: .map(specialType).sumByLocalKeys.map(_.present).group.forceToReducers.sum

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


vidmantas zemleris seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants