A meta-repository of big data tools
Here are the source code for the major pieces of a data science platform (hadoop, pig, wukong, storm, kafka, etc), and their essential plugins.
Clone it so you have all the source at hand -- to track development, to steal ideas from, or because you're getting on an airplane in ten minutes. The browse
directory links to the most-likely-to-be-interesting directory, so you don't spend time trying to figure out if it's src/main/java/org/with/lots/of/dirs
or java/src/main
or what.
- It's not a link to every tool in the space -- only repos we've found useful or promising.
- It doesn't build everything from scratch, or have a complete set of dependencies. (Pull request encouraged!)
- hadoop-common -- Hadoop (Core Framework)
- hadoop-mapreduce -- Hadoop (Distributed Computation)
- hadoop-hdfs -- Hadoop (Distributed File System)
- mahout -- machine learning on Hadoop
- hive -- High-level interface to hadoop
- crunch -- data science on Hadoop
- pig -- the tool itself
- piggybank -- the official contrib set of Pig UDFs
- piggychimp -- Pig UDFs from infochimps-labs
- sounder -- Pig UDFs from Jacob Perkins (@thedatachef)
- datafu -- Pig UDFs from linkedin
- elasticsearch -- full-text datastore of joy
- hbase -- store a billion of y'know whatever
- R -- statistics, tried and true, written by statisticians (unfortunately, written by statisticians)
- Julia -- statistics, exciting and new, written by programmers (unfortunately, exciting and new)
- kafka -- real-time data delivery
- storm -- real-time data analytics
- wukong-example-data -- useful tables and interesting datasets, from country codes to UFO sightings
Tools that are needed to make the other tools work
- addressable
- bundler
- guard, guard-rspec, guard-yard
- uuidtools
- htmlentities
- oj
(other dependencies: RedCloth forgery highline jeweler json kramdown multi_json perftools.rb pry rake rb-fsevent redcarpet rspec simplecov yard)