- Michael Mercier (Inria/Atos)
- Cristian Ruiz (Inria)
Grid5000, \structure{Kameleon}, Expo, … \bigskip\bigskip
Thanks for the feedback of:
- Pierre Neyron (CNRS)
- Arnaud Legrand (CNRS)
- Olivier Richard (UGA)
- Lucas Nussbaum (Loria)
Here is the pad for interactions:
Material (demo, slides) available on github
Reproducible research: What does it mean? Watch the first webinar if you need a reminder.\medskip
Reproducibility is a cornerstone of scientific method
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instruction which generated the figures. \flushright{– David Donoho, 1998}
- Making it available is a great first step
- Making sure others can rerun it is an other move
Experiment replication is not an easy task if you do not have it in mind from the beginning: \vspace{0.2cm}
The path from having a piece of software running on the programmer’s own machine to getting it running on someone else’s machine is fraught with potential pitfalls
\bigskip
In reproducible research, scientists should care about both the experiments and the analysis:
- All the artifacts (input/outputs files)
- The source code
- Documentation on how to compile, install and run
Still, several problems may prevent someone to rerun an experiment
- Unresolved dependencies
- \quad
- Compilation errors
- \quad
Less than 50% of experimental setups of papers submitted ACM conferences and journals could be built
- Portability issues
- E.g., BOINC had to rely on homogeneous redundancy to protect against numerical instabilities (OS, hardware, …).
- Imprecise documentation
- ”I have no clue about how to install it, configure it or run it!”
- Dependency Hell
- ”I can’t install this dependency package without breaking my entire system”
- Code rote
- ”This dependency package version is buggy! What was the version that was used to run the experiment in the first place?!?”
- Efforts are not rewarded by the current academic research and funding environment
- Software vendors tend to protect their markets through proprietary formats and interfaces
- Investigators naturally tend to want to own and control their research tools
- Even the most generalized software will not be able to meet the specific needs of every researcher in a field
- The need to derive and publish results as quickly as possible precludes the often slower standards-based development path
- Raw data and code: rely on the published paper for documentation
- Extensive documentation: it may still require certain skills
- Adopt a controlled environment: e.g., rely on a scientific workflow
- Use virtual machines to capture and publish code, data and experimental environment
For myself:
- Be able to reproduce my own experiment later
- Improve my productivity (when preparing articles, PhD, rebuttals, …)
- Be able to scale my experiment on other machines
- Facilitate experiment extensions and modifications
- Be a better scientist by doing better science
$\winkey$
For other people: my students, my colleagues, my peers, …
- Allow them to reproduce my experiment and corroborate (or not) my results
- Allow them to base their research on my research and extend
For everyone else:
- Improve knowledge sharing
- Increase collaboration possibilities
- Do better science!
One way to go is to take care of your experimental environment
There are mainly two approaches:
- Preserving the mess by capturing the already set up environment
- Encourage cleanliness with several options:
- Using a constrained environment
- Building your own environment
See Preserve the Mess or Encourage Cleanliness? (Thain et al., 2015)
Each of them have different levels of constraint and flexibility:
- The more constrained your environment is, the more simple it is
- Freedom comes with responsibility
Can be numerous or unique depending on the experiment workflow:
- Experiment environments
- local, on a testbed, on a dedicated server,…
- Analysis environments
- Usually a unique local environment
The whole environment contains both hardware and software information
Necessary when we carry out performance measures
Tools to capture hardware configuration:
dmidecode
- hwloc (
lstopo
) - ls* tools (lsblk, lshw, lspci, lsmod,…)
- proprietary tools (bios, nvidia,…)
- Testbeds hardware description API (Grid’5000, Chameleon)
As it is not shareable the hardware environment needs to be documented as exhaustively as possible.
Of course it depends on how the results of experiments are affected by the underlying hardware.
Different types of environment:
- minimal description in a mail
- README in a git repository
- small documentation
- bundle of the experiment tool and it dependencies
- linux container image
- Virtual machine
- A complete system image
The role of a virtual environment is to provide some isolation within the host\vspace{-1.5em}
- A virtual environment can only use a limited part of the resources:
- filesystem
- memory/cpu/disk/network
- Has his own software stack
$⇒$ clean dependencies
\medskip
By the way:
- What is a container?
- An isolated part of the system that shares the operating system kernel
- What is a virtual machine?
- A full system image that shares the system hardware with your guest OS though an hypervisor
Environment build, specialized, controlled, versioned by somebody else:
- SageMathCloud
- Use Jupyter
- Julia, Python, R, Haskell, Ruby…
- 40 languages (partly) supported
- Use Jupyter
Sharing is easy but you have to stick to what the environment provides
Constrained but extandable environment:
- Activepapers (Beta)
- Python or JVM based language
Start your experimental setup in a controlled environment from the beginning
- Clean install system in a virtual environment
- Software appliances market place (e.g., TURNKEYturnkey, Cloud Marketamazon)
- Default Testbed (Grid’5000, Cloudlab, Chameleon) environments
This encourage cleanliness:
Your environment is controlled (you start from a clean system)
turnkey http://www.turnkeylinux.org amazon http://www.thecloudmarket.com
Nothing is responsible for tracking the modifications applied in
this environment
You don’t know what is inside the box
Several approaches for capturing your environment:
- Export everything
- Kernel + Libraries + Application
- Heavy but safe
- Capture only what is needed to run on a similar system
- Libraries (only dependencies) + App
- Lightweight but can be partial
A simple capture of an environment is a complete copy of it.
It depends on what your environment is:
- On a classical local machine:
- Problem: A simple backup bundle is not easily usable by others
- Partial solution: Clone your hard drive to a VM (excluding personal data)
- On virtual environment use the instant snapshot capability
- Faster and simpler backup
- VM need to be used from the beginning (mentioned previously)
- On a testbed machine use the provided snapshot mechanism
In either case sharing is complicated
- Huge environment images of several Gigabytes are common
- Need a dedicated place to store them (a repository or some market place)
You still don’t know what is inside the box
Use a tracking tool to capture only what is necessary
- Instrumenting a run of your experiment to catch every used material
- Binaries/Scripts (experiment.py, Python 2.7)
- Configuration files (conf.yaml)
- Libraries (libc, numpy, matplotlib)
Then create a compressed bundle
- Rerun the experiment on another machine:
- Import the provided bundle
- Initialize the environment (depends on the tools…)
- Rerun the exact same experiment
Capture is not foolproof:
- Running with only one set of parameters is not enough
- More risk to miss something
$\frowny$
Less messy than virtual environment copy
Existing tools:
-
CDE (Guo et al., 2011)
- First to bring the idea
- Seems not maintained since 2013
-
ReproZip (Freire et al., 2013)
- One tool to trace and pack
- Several tools to unpack and run (install package, chroot, docker, vagrant)
- More during the demo
$☺$
-
CARE (Janin et al., 2014)
- Only for experts
- Seems unmaintained since 2014
- Parrot
- Limited to the Parrot filesystem…
- If you’re moving a computation to a new system, it should be simple and straightforward to set up the environment almost identical to that of the original machine
- A major challenge in reproducing computations is installing the
prerequisite software environment
$\frowny$ - Modern open computational science relies on complex software stacks
- So, it is necessary to know:
- How was it built?
- What does it contains?
- How can I modify it to extend the experiment?
$ tar -xzf pdt-3.19.tar.gz && cd pdtoolkit-3.19/
$./configure -prefix=/usr/local/pdt-install
$ make clean install
- Need to install all dependencies by hand
- Some skills are required
A PM is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs for a computer’s operating system in a consistent manner
- Examples in the Linux world: APT, yum, pacman, Nix …
- There also exists package managers for programming languages: Bundler, CPAN, CRAN, EasyInstall, Go Get, Maven, pip, RubyGems, …
The DevOps Approach credit
- Dev = Development, Ops= (System) operation
- You have a pile of crusty code that’s hard to install
- And documenting how to install it is almost as hard!
$☺$ - Why not develop scripts that reliably install your toolset?
- Because that sounds hard ?
$\winkey$ - But it’s more fun than writing documentation!
- Because that sounds hard ?
Use all the good things that software engineering has created along decades for ensuring isolation and reproducibility
credit Inspiration for this slide taken from “A Common Scientific Compute Environment for Research and Education” presented at SciPy 2014
- README
- Shell scripts
- Configuration management tools:
automate software configuration and installation
- Software stacks can be easily transportable
- Some CM tools: Puppet, Salt, Ansible
- A lot of work has to be done to write recipes
$\frowny$
Any application can be easily moved through different environments
- Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere
- Docker tries to achieve deterministic builds by isolating your service, building it from a snapshotted OS and running imperative steps on top of it
- Dependency hell: Docker works with images that consume minimal disk space, are versioned, archivable, and shareable (DockerHub)
- Dockerfiles: resolving imprecise documentation
- It automates the build of development environment using a base environment called box and a series of text-based instructions
- Researchers write text-based configuration files that provide instruction to build virtual machines
- Somehow solves way the problem of sharing a VM. Since these files are small, researchers can easily share them and track different versions via source-control repositories
- VMs are not seen as black boxes anymore
- Researchers can automate the process of building and configuring virtual machines
- It is possible to use different providers: EC2, Virtualbox, VMware, Docker, etc …
- Apply functional model to packaging
A package is the output of a function that is deterministic (it depends only on a function inputs, without any side effects)
- The principle: two independent runs of a given build process for a given set of inputs should return the same value
- Functional hash-based immutable package management
- Isolated build
- Deterministic
- No dependency hell
An experimental setup
- Experimenters have access to the original base experimental setup \(E\).
- Experimenters know exactly the sequence of actions \(\langle A_{1}, A_{2}, A_{3}, …, A_{n}\rangle \) that produced \(E’\).
- \bf Experimenters are able to change some action \(A_{i}\) and successfully re-construct an experimental setup \(E”\)
Additional problems:
- Accessing the same base setup \(E\)
- \bgroup\bf Software used is not available anymore\egroup
The Debian community is quite active on the reproducibility front.
- It’s an archive that allows to access old packages based on dates and version numbers
- It provides a valuable resource for tracking down when regressions were introduced, or for providing a specific environment that a particular application may require to run
- Only concerns software that is packaged
$\frowny$
- Easy to use
$\leadsto$ structured language based on few constructs and which relies on shell commands - Allows shareability thanks to the hierarchical structure of recipes and the extend mechanism
- Kameleon supports the build process by providing debugging mechanisms such as interactive shell sessions, break-points and checkpointing
- Allows the easy integration of providers using the same language for the recipes
- Persistent cache makes reconstructability a reality
It’s time for a Docker Demo (follow the links from https://github.com/alegrand/RR_webinars/)
Docker advantages for reproducible research:
- Integrating into local development environments
- Modular reuse
- Portable environments
- Public repositories for sharing
- Versioning
- Portable computation & sharing
$ docker export container-name > container.tar
$ docker push username/r-recommended
- Re-usable modules
$ docker run -d --name db training/postgres
$ docker run -d -P --link db:bd training/webapp \
python app.py
- Versioning
$ docker history r-base
$ docker tag d7e5801bb7ac ttimbers/mmp-dyf-skat:latest
Let’s demo a complete use case (follow the links from https://github.com/alegrand/RR_webinars/).\medskip
Use case: A not that simple simulation
- SimGrid (C library) + BatSim (C++) + OAR scheduler (Python) + A new scheduler (perl)
- Python script for the glue
Steps:
- Build an environment with Kameleon
- Capture an experiment with ReproZip
- Export the corresponding bundle
- Rerun the experiment on another machine (ReproZip + Docker)
- Compare the results (csv + python
$\leadsto$ graphics) still using the environment
Reproducibility is easier when you have it in mind from the beginning
- Choose your tools
- Reproducibility brings some complexity but more and more tools to manage this complexity for you
- Provide environments
- Whatever the environment quality you provide, it is better than no environment at all $\winkey$
- Better if you provide the recipe
- Providing experiment environment is good. Providing the recipe to build this environment is better!