-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imagenet Pipeline #120
Imagenet Pipeline #120
Conversation
Also pass in numFeatures to block solver
Also add utility function to shuffle an Array
Also pass in appropriate values to sampler, solver to avoid multiple passes. Also caches the right set of things now
… into imagenet-sift-fv
// In place deterministic shuffle | ||
def shuffleArray[T](arr: Array[T]) = { | ||
// Shuffle each row in the same fashion | ||
val rnd = new java.util.Random(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably take seed as a parameter. also is Breeze's shuffle
not good enough for what you're trying to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added seed param. This is different from Breeze's shuffle in that I am trying to shuffle Array[DenseVector[Double]]
which is the output of calling collect
on RDD[DV[Double]]
I've addressed the comments and also added the SignedHellinger after SIFTs (before PCA). Note that I had to make a new BatchedHellingerMapper and that this uses |
@@ -159,8 +159,15 @@ class BlockLeastSquaresEstimator(blockSize: Int, numIter: Int, lambda: Double = | |||
override def fit( | |||
trainingFeatures: RDD[DenseVector[Double]], | |||
trainingLabels: RDD[DenseVector[Double]]): BlockLinearMapper = { | |||
val vectorSplitter = new VectorSplitter(blockSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a problem to have a single version of these with None
or does it break the Estimator
API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tried it and it breaks the api.
Awesome stuff @shivaram! I had a few minor things - one about refactoring to reuse some code in the SIFT/LCS pipeline and one about not reimplementing shuffle. If you want to merge this as-is and save for a future PR, I'm good with this! |
Alright merging this to hit milestone 0.1 ! |
Closes #11
This PR adds the SIFT + LCS + FV imagenet pipeline. This includes changes to a bunch of things that help us avoid doing multiple passes of SIFT over the data (e.g., VectorSplitter and the Sampling node)
The pipeline still looks a bit complex due to the sample re-use stuff across PCA, GMM -- Let me know if you can think of ways to make this better