-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the idigbio-spark wiki!
This wiki is meant to share best practices for using large biodiversity data sets.
#When to use? When possible, try to use tools that you are comfortable with (e.g. R, python/pandas). Only when the datasets are no longer able to fit on a single machine, or when the calculations take too long, you might want to consider distributed processing frameworks/platforms like hadoop, apache spark. Don't use distributed computing unless you really, really need it.
#pick a distributed processing platform These pages focus on using Apache Spark, but many other distributed processing frameworks exist.
#prepare your data Once you realized that you need distributing processing for your analysis, the first step is to prepare your datasets. The data format/data store should be suitable for distributed processing.
Do's - use hdfs, parquet or other technologies specifically designed to do distributed computing. Using unprocessed files help distributed processing frameworks like Spark to distribute work by chunking files.
Don't - most large compressed files cannot be distributed because they are hard to chunk or split.
#setup your cluster Although it is getting easier to setup a compute cluster from scratch, it still requires quite some software/hardware skills to make this happen. Suggest to re-use an existing compute cluster when possible to avoid days/weeks of setup time and system administration. The idea of GUODA is to provide access to such a compute cluster.
#develop your processing jobs
#testing your processing jobs
#deploying your processing jobs