Skip to content
This repository has been archived by the owner on Nov 17, 2017. It is now read-only.

Docker and Data Visualization

Keiichiro Ono edited this page Mar 25, 2015 · 1 revision

Background

What is Docker?

For scientists and engineers, Docker is an alternative to virtual machines to make their data analysis and visualization workflows reproducible. In general, scientific applications are not user friendly; even installation is a complicated and time consuming process because of external dependencies and lack of documentation. Docker is a potential solution to simplify such application setup problems. With Docker, you can pack all of your data analysis/visualization applications in containers.

Please read Docker official documents if you are not familiar with its concepts.

Why Docker?

Engineers and (data) scientists should focus on data analysis and visualization, not software installation. If they can share common data analysis environments like the following, they do not have to spend their time to deal with common software installation pitfalls, such as finding correct combination of versions of popular libraries, setting proper options to start applications, etc.

  • Python + IPython Notebook + Pandas + NumPy/SciPy
  • R + igraph + Bioconductor
  • Ubuntu box with specific version of libc to compile package X

The power of Docker, or containers, is that users can share popular data analysis environments as a file, called Dockerfile. In addition, they can extend existing setups by inheriting images available from public (or private) repository (Docker Hub).

Docker for Network Analysis

Network data analysis / visualization is a non-trivial task. You need to use various kinds of tools to prepare, analyze, and visualize your data sets. This means setting up your own data analysis environment is another headache because you need to handle tool dependencies, versions, install scripts, etc. For example, our Python examples depend on some of the popular Python libraries including:

  • IPython Notebook
  • Pandas
  • NetworkX
  • NumPy
  • SciPy

All of these should be installed manually before executing our examples. This takes some time for first time users and it can be a barrier for busy scientists. Our goal is providing shortcut for them; instead of writing a long software installation guide, we will provide Docker Images for network analysis and visualization.

Reproducibility

In science, reproducibility of the experiments is the foundation of its openness. Unfortunately, even computational, or dry, experiments are hard to reproduce because of the problems we discussed above. The combination of Docker and IPython Notebook can be a solution for it.

Docker

Since Docker can include specific versions of software packages and exact same setting files you used for your research, you can use Docker as a portable environment to reproduce your entire data analysis pipeline.

IPython Notebook

Just like your paper lab notebooks, you can use IPython Notebook as the lab notebook for computational experiments. If you run IPython Notebook server on Docker container, you can re-run the entire data analysis pipeline on every machine running Docker.

Getting Started

There are tons of good (and free) documents for Docker. Before you use our sample images, you need to understand basic commands for Docker. Here is the link to useful Docker documentations:

Install Docker

Basic Docker Commands

Run IPython Notebook on Docker

When you run Python examples

  • Official Docker Images from Cytoscape Consortium

Beyond Basics

For real world problems, you need to use multiple tools to build workflow. This means you need to use multiple containers to build your pipeline. In this section, I will briefly introduce tools to use multiple containers.

Compose an Application with Multiple Containers

Orchestrating Container-Based Large System