Skip to content

Latest commit

 

History

History
804 lines (633 loc) · 26.4 KB

ctl_environment.org

File metadata and controls

804 lines (633 loc) · 26.4 KB

Controlling Your Environment

People involved in preparing this talk

  • Michael Mercier (Inria/Atos)
  • Cristian Ruiz (Inria)

Grid5000, \structure{Kameleon}, Expo, … \bigskip\bigskip

Thanks for the feedback of:

  • Pierre Neyron (CNRS)
  • Arnaud Legrand (CNRS)
  • Olivier Richard (UGA)
  • Lucas Nussbaum (Loria)

Here is the pad for interactions:

Material (demo, slides) available on github

  1. A Docker Demo
  2. A complete use case

Why should we care?

Motivations

Reproducible research: What does it mean? Watch the first webinar if you need a reminder.\medskip

Reproducibility is a cornerstone of scientific method

Is code sufficient?

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instruction which generated the figures. \flushright{– David Donoho, 1998}

  • Making it available is a great first step
  • Making sure others can rerun it is an other move

Problem statement

Experiment replication is not an easy task if you do not have it in mind from the beginning: \vspace{0.2cm}

The path from having a piece of software running on the programmer’s own machine to getting it running on someone else’s machine is fraught with potential pitfalls

\bigskip

In reproducible research, scientists should care about both the experiments and the analysis:

  • All the artifacts (input/outputs files)
  • The source code
  • Documentation on how to compile, install and run

Still, several problems may prevent someone to rerun an experiment

Dependencies and compilation problems

Unresolved dependencies
\quad
Compilation errors
\quad

Less than 50% of experimental setups of papers submitted ACM conferences and journals could be built

Other technical issues

Portability issues
E.g., BOINC had to rely on homogeneous redundancy to protect against numerical instabilities (OS, hardware, …).
Imprecise documentation
I have no clue about how to install it, configure it or run it!
Dependency Hell
I can’t install this dependency package without breaking my entire system
Code rote
This dependency package version is buggy! What was the version that was used to run the experiment in the first place?!?

Cultural challenges

  • Efforts are not rewarded by the current academic research and funding environment
  • Software vendors tend to protect their markets through proprietary formats and interfaces
  • Investigators naturally tend to want to own and control their research tools
  • Even the most generalized software will not be able to meet the specific needs of every researcher in a field
  • The need to derive and publish results as quickly as possible precludes the often slower standards-based development path

Disseminating science software

Disseminating science software

  • Raw data and code: rely on the published paper for documentation
  • Extensive documentation: it may still require certain skills
  • Adopt a controlled environment: e.g., rely on a scientific workflow
  • Use virtual machines to capture and publish code, data and experimental environment

Everywhere there is code, you need an environment

Why should I take care of my experiment environment?

For myself:

  • Be able to reproduce my own experiment later
  • Improve my productivity (when preparing articles, PhD, rebuttals, …)
  • Be able to scale my experiment on other machines
  • Facilitate experiment extensions and modifications
  • Be a better scientist by doing better science $\winkey$

For other people: my students, my colleagues, my peers, …

  • Allow them to reproduce my experiment and corroborate (or not) my results
  • Allow them to base their research on my research and extend

For everyone else:

  • Improve knowledge sharing
  • Increase collaboration possibilities
  • Do better science!

Controlling your environment

One way to go is to take care of your experimental environment

There are mainly two approaches:

  • Preserving the mess by capturing the already set up environment
  • Encourage cleanliness with several options:
    • Using a constrained environment
    • Building your own environment

    See Preserve the Mess or Encourage Cleanliness? (Thain et al., 2015)

Constraint for simplicity, complexity for freedom

Each of them have different levels of constraint and flexibility:

  • The more constrained your environment is, the more simple it is
  • Freedom comes with responsibility

What is an environment?

Environment definition

Can be numerous or unique depending on the experiment workflow:

  • Experiment environments
    • local, on a testbed, on a dedicated server,…
  • Analysis environments
    • Usually a unique local environment

The whole environment contains both hardware and software information

Hardware

Necessary when we carry out performance measures

Tools to capture hardware configuration:

  • dmidecode
  • hwloc (lstopo)
  • ls* tools (lsblk, lshw, lspci, lsmod,…)
  • proprietary tools (bios, nvidia,…)
  • Testbeds hardware description API (Grid’5000, Chameleon)

The hardware is not shareable

As it is not shareable the hardware environment needs to be documented as exhaustively as possible.

Of course it depends on how the results of experiments are affected by the underlying hardware.

Software

Different types of environment:

Very succinct (usually what is provided, if provided…)

  • minimal description in a mail
  • README in a git repository
  • small documentation

Partial

  • bundle of the experiment tool and it dependencies
  • linux container image

Full

A complete environment backup with the operating system included
  • Virtual machine
  • A complete system image

Virtual environments: important notions

The role of a virtual environment is to provide some isolation within the host\vspace{-1.5em}

  • A virtual environment can only use a limited part of the resources:
    • filesystem
    • memory/cpu/disk/network
  • Has his own software stack $⇒$ clean dependencies

\medskip

By the way:

What is a container?
An isolated part of the system that shares the operating system kernel
What is a virtual machine?
A full system image that shares the system hardware with your guest OS though an hypervisor

VM vs container

Types of environments

First approach: use a Constrained environment

Use of third party environments

Environment build, specialized, controlled, versioned by somebody else:

thrid party

  • SageMathCloud
    • Use Jupyter
      • Julia, Python, R, Haskell, Ruby…
      • 40 languages (partly) supported

Sharing is easy but you have to stick to what the environment provides

Constrained but extandable environment:

  • Activepapers (Beta)
    • Python or JVM based language

image

Use a controlled environment as a base

Start your experimental setup in a controlled environment from the beginning

  • Clean install system in a virtual environment
  • Software appliances market place (e.g., TURNKEYturnkey, Cloud Marketamazon)
  • Default Testbed (Grid’5000, Cloudlab, Chameleon) environments

This encourage cleanliness:
Your environment is controlled (you start from a clean system)

turnkey http://www.turnkeylinux.org amazon http://www.thecloudmarket.com

Drawbacks

Nothing is responsible for tracking the modifications applied in this environment
You don’t know what is inside the box $\frowny$

Second approach: Capturing an environment

Capturing an environment

Several approaches for capturing your environment:

  • Export everything
    • Kernel + Libraries + Application
    • Heavy but safe
  • Capture only what is needed to run on a similar system
    • Libraries (only dependencies) + App
    • Lightweight but can be partial

Copying your experiment environment

A simple capture of an environment is a complete copy of it.

It depends on what your environment is:

  1. On a classical local machine:
    • Problem: A simple backup bundle is not easily usable by others
    • Partial solution: Clone your hard drive to a VM (excluding personal data)
  2. On virtual environment use the instant snapshot capability
    • Faster and simpler backup
    • VM need to be used from the beginning (mentioned previously)
  3. On a testbed machine use the provided snapshot mechanism

In either case sharing is complicated

  • Huge environment images of several Gigabytes are common
  • Need a dedicated place to store them (a repository or some market place)

You still don’t know what is inside the box $\frowny$ \smallskip

Capture only what is needed

Use a tracking tool to capture only what is necessary

  • Instrumenting a run of your experiment to catch every used material
    • Binaries/Scripts (experiment.py, Python 2.7)
    • Configuration files (conf.yaml)
    • Libraries (libc, numpy, matplotlib)

    Then create a compressed bundle

  • Rerun the experiment on another machine:
    1. Import the provided bundle
    2. Initialize the environment (depends on the tools…)
    3. Rerun the exact same experiment

Capture is not foolproof:

  • Running with only one set of parameters is not enough
  • More risk to miss something $\frowny$

Less messy than virtual environment copy $☺$ but it is not easy to modify it to extend an experiment $\frowny$

Capture tools

Existing tools:

  • CDE (Guo et al., 2011)
    • First to bring the idea
    • Seems not maintained since 2013
  • ReproZip (Freire et al., 2013)
    • One tool to trace and pack
    • Several tools to unpack and run (install package, chroot, docker, vagrant)
    • More during the demo $☺$
  • CARE (Janin et al., 2014)
    • Only for experts
    • Seems unmaintained since 2014
  • Parrot
    • Limited to the Parrot filesystem…

Third approach: Building a complete environment

Environment generation (some facts)

  • If you’re moving a computation to a new system, it should be simple and straightforward to set up the environment almost identical to that of the original machine
  • A major challenge in reproducing computations is installing the prerequisite software environment $\frowny$
  • Modern open computational science relies on complex software stacks
  • So, it is necessary to know:
    • How was it built?
    • What does it contains?
    • How can I modify it to extend the experiment?

How is software installed and configured?

Source code compilation

$ tar -xzf pdt-3.19.tar.gz && cd pdtoolkit-3.19/
$./configure -prefix=/usr/local/pdt-install
$ make clean install
  • Need to install all dependencies by hand
  • Some skills are required

Package manager

A PM is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs for a computer’s operating system in a consistent manner

  • Examples in the Linux world: APT, yum, pacman, Nix …
  • There also exists package managers for programming languages: Bundler, CPAN, CRAN, EasyInstall, Go Get, Maven, pip, RubyGems, …

Devops: Docker and Vagrant

The DevOps Approach credit

  • Dev = Development, Ops= (System) operation
  • You have a pile of crusty code that’s hard to install
  • And documenting how to install it is almost as hard! $☺$
  • Why not develop scripts that reliably install your toolset?
    • Because that sounds hard ? $\winkey$
    • But it’s more fun than writing documentation!

Use all the good things that software engineering has created along decades for ensuring isolation and reproducibility

credit Inspiration for this slide taken from “A Common Scientific Compute Environment for Research and Education” presented at SciPy 2014

Creating recipes: text based description

  • README
  • Shell scripts
  • Configuration management tools: automate software configuration and installation
    • Software stacks can be easily transportable
    • Some CM tools: Puppet, Salt, Ansible
    • A lot of work has to be done to write recipes $\frowny$

DevOps response: Docker for deployment

Any application can be easily moved through different environments

DevOps response: Docker for deployment

  • Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere
  • Docker tries to achieve deterministic builds by isolating your service, building it from a snapshotted OS and running imperative steps on top of it
  • Dependency hell: Docker works with images that consume minimal disk space, are versioned, archivable, and shareable (DockerHub)
  • Dockerfiles: resolving imprecise documentation

DevOps response: Vagrant for building

  • It automates the build of development environment using a base environment called box and a series of text-based instructions

DevOps response: Vagrant for building

  • Researchers write text-based configuration files that provide instruction to build virtual machines
  • Somehow solves way the problem of sharing a VM. Since these files are small, researchers can easily share them and track different versions via source-control repositories
  • VMs are not seen as black boxes anymore
  • Researchers can automate the process of building and configuring virtual machines
  • It is possible to use different providers: EC2, Virtualbox, VMware, Docker, etc …

Reproducible builds

Reproducible builds: a functional package management\hspace{.3em}(Nix)\hspace{-5em}

  • Apply functional model to packaging

A package is the output of a function that is deterministic (it depends only on a function inputs, without any side effects)

  • The principle: two independent runs of a given build process for a given set of inputs should return the same value
  • Functional hash-based immutable package management
  • Isolated build
  • Deterministic
  • No dependency hell

Reproducible builds: Nix workflow

Environment generation

Reconstrucability

An experimental setup $E’$ is reconstructable if the following three facts hold:

  1. Experimenters have access to the original base experimental setup \(E\).
  2. Experimenters know exactly the sequence of actions \(\langle A_{1}, A_{2}, A_{3}, …, A_{n}\rangle \) that produced \(E’\).
  3. \bf Experimenters are able to change some action \(A_{i}\) and successfully re-construct an experimental setup \(E”\)

Reconstrucability

Additional problems:

  • Accessing the same base setup \(E\)
  • \bgroup\bf Software used is not available anymore\egroup

Dealing with software availability (Debian Snapshot)

The Debian community is quite active on the reproducibility front.

  • It’s an archive that allows to access old packages based on dates and version numbers
  • It provides a valuable resource for tracking down when regressions were introduced, or for providing a specific environment that a particular application may require to run
  • Only concerns software that is packaged $\frowny$

Kameleon: Reconstructable Appliance Generator

Kameleon Features

  • Easy to use $\leadsto$ structured language based on few constructs and which relies on shell commands
  • Allows shareability thanks to the hierarchical structure of recipes and the extend mechanism
  • Kameleon supports the build process by providing debugging mechanisms such as interactive shell sessions, break-points and checkpointing
  • Allows the easy integration of providers using the same language for the recipes
  • Persistent cache makes reconstructability a reality

Demo time

Docker

It’s time for a Docker Demo (follow the links from https://github.com/alegrand/RR_webinars/)

Docker advantages for reproducible research:

  • Integrating into local development environments
  • Modular reuse
  • Portable environments
  • Public repositories for sharing
  • Versioning

Docker advantages

  • Portable computation & sharing
$ docker export container-name > container.tar
$ docker push username/r-recommended
  • Re-usable modules
$ docker run -d --name db training/postgres
$ docker run -d -P --link db:bd training/webapp \
   python app.py
  • Versioning
$ docker history r-base
$ docker tag  d7e5801bb7ac ttimbers/mmp-dyf-skat:latest

A complete use case: Batsim

Let’s demo a complete use case (follow the links from https://github.com/alegrand/RR_webinars/).\medskip

Use case: A not that simple simulation

  • SimGrid (C library) + BatSim (C++) + OAR scheduler (Python) + A new scheduler (perl)
  • Python script for the glue

Steps:

  • Build an environment with Kameleon
  • Capture an experiment with ReproZip
  • Export the corresponding bundle
  • Rerun the experiment on another machine (ReproZip + Docker)
  • Compare the results (csv + python $\leadsto$ graphics) still using the environment

Conclusion

Conclusion

Reproducibility is easier when you have it in mind from the beginning

Choose your tools
Reproducibility brings some complexity but more and more tools to manage this complexity for you
Provide environments
Whatever the environment quality you provide, it is better than no environment at all $\winkey$
Better if you provide the recipe
Providing experiment environment is good. Providing the recipe to build this environment is better!

Conclusion