-
Notifications
You must be signed in to change notification settings - Fork 14
Docker and Data Visualization
For scientists and engineers, Docker is an alternative to virtual machines to make their data analysis and visualization workflows reproducible. In general, scientific applications are not user friendly; even installation is a complicated and time consuming process because of external dependencies and lack of documentation. Docker is a potential solution to simplify such application setup problems. With Docker, you can pack all of your data analysis/visualization applications in containers.
Please read Docker official documents if you are not familiar with its concepts.
Engineers and (data) scientists should focus on data analysis and visualization, not software installation. If they can share common data analysis environments like the following, they do not have to spend their time to deal with common software installation pitfalls, such as finding correct combination of versions of popular libraries, setting proper options to start applications, etc.
- Python + IPython Notebook + Pandas + NumPy/SciPy
- R + igraph + Bioconductor
- Ubuntu box with specific version of libc to compile package X
The power of Docker, or containers, is that users can share popular data analysis environments as a file, called Dockerfile. In addition, they can extend existing setups by inheriting images available from public (or private) repository (Docker Hub).
Network data analysis / visualization is a non-trivial task. You need to use various kinds of tools to prepare, analyze, and visualize your data sets. This means setting up your own data analysis environment is another headache because you need to handle tool dependencies, versions, install scripts, etc. For example, our Python examples depend on some of the popular Python libraries including:
- IPython Notebook
- Pandas
- NetworkX
- NumPy
- SciPy
All of these should be installed manually before executing our examples. This takes some time for first time users and it can be a barrier for busy scientists. Our goal is providing shortcut for them; instead of writing a long software installation guide, we will provide Docker Images for network analysis and visualization.
In science, reproducibility of the experiments is the foundation of its openness. Unfortunately, even computational, or dry, experiments are hard to reproduce because of the problems we discussed above. The combination of Docker and IPython Notebook can be a solution for it.
Since Docker can include specific versions of software packages and exact same setting files you used for your research, you can use Docker as a portable environment to reproduce your entire data analysis pipeline.
Just like your paper lab notebooks, you can use IPython Notebook as the lab notebook for computational experiments. If you run IPython Notebook server on Docker container, you can re-run the entire data analysis pipeline on every machine running Docker.
There are tons of good (and free) documents for Docker. Before you use our sample images, you need to understand basic commands for Docker. Here is the link to useful Docker documentations:
When you run Python examples
- Official Docker Images from Cytoscape Consortium
For real world problems, you need to use multiple tools to build workflow. This means you need to use multiple containers to build your pipeline. In this section, I will briefly introduce tools to use multiple containers.
© 2014-2015 The Cytoscape Consortium
Developed and Maintained by Keiichiro Ono (UC, San Diego Trey Ideker Lab)