Review use of FunctionNode #121

etrain · 2015-05-18T16:24:22Z

In several pipelines, we use FunctionNode to handle cases where, for example, an Estimator[A,B] doesn't return a Transformer[A,B], but instead returns a Transformer[C,D], or where there is no good meaning for a single-item transformation.

Currently, FunctionNode feels like a "catch-all" because the Transformer/Estimator APIs don't sufficiently cover some of the data transformation operations we need to support.

One example of this is NGramsCounts which takes a Seq[Seq[T]] and returns a model of type NGrams[T] => Int.

Other examples include Windower and Sampler which are used in the RandomPatchCifar pipeline. These nodes are different in that they do not operate on single items and are thus not transformers, but act as something like an Aggregator if we were going to draw a database analogy.

The text was updated successfully, but these errors were encountered:

tomerk · 2015-05-18T17:02:51Z

What percent of these are only being used in the "fitting" part of a pipeline and not the "prediction" part?

tomerk · 2015-05-18T17:03:18Z

And are these all RDD to RDD?

concretevitamin · 2015-05-18T17:40:47Z

Just randomly chiming in - the aggregation pattern is everywhere in every query processing engine (and you're totally right, it's also in decade-old databases!), so I guess there's a reason.

tomerk · 2015-05-18T18:28:31Z

So after taking a closer look, it seems to me like the cases we're using FunctionNode right now fall under either:

Some form of 'aggregation' (representable as any RDD transformation that isn't item to item) being done only at 'fit' time
Something related to zipping & block transformers & estimators which we still need to figure out how to do cleanly

Some questions I have about the Aggregators are:

Do we want to be able to chain these with transformers? (judging by how they're being used right now, it looks like there's at least some interest in it)
Where do we want to call these aggregators?
Only internally within Estimators?
Directly on the training data before we call estimator.fit(data)?
Somehow chain it within a pipeline but have it only apply in the 'fitting' stage?

etrain added the enhancement label May 18, 2015

tomerk mentioned this issue May 18, 2015

Imagenet Pipeline #120

Merged

tomerk mentioned this issue May 18, 2015

Figure out future DAG operators #125

Open

tomerk mentioned this issue Jun 23, 2015

Dag construction v2 #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review use of FunctionNode #121

Review use of FunctionNode #121

etrain commented May 18, 2015

tomerk commented May 18, 2015

tomerk commented May 18, 2015

concretevitamin commented May 18, 2015

tomerk commented May 18, 2015

Review use of FunctionNode #121

Review use of FunctionNode #121

Comments

etrain commented May 18, 2015

tomerk commented May 18, 2015

tomerk commented May 18, 2015

concretevitamin commented May 18, 2015

tomerk commented May 18, 2015