-
Notifications
You must be signed in to change notification settings - Fork 72
Workflow
The RASPA2 wrapper was created with workflow in mind. It was written in a way to play well with other programs, enabling its use in (or exclusion from) larger glue scripts. Similar interfaces have been built for other computational tools, and mostly every third-party program (ie. openbabel, mongoDB) has been written to work with this approach.
Talking about "workflow" and "interfaces" is really vague. To explain the meaning underlying these words, let's start with defining the key term here.
Simply put, a glue script is a short script that "glues" together multiple software components. There are many benefits to this approach, which are easier to show than to explain. Let's use the following example: we have a non-P1 cif file. We want to create a cif file that the charge algorithm EQeq accepts, run EQeq to get the charged cif format that RASPA accepts, and then use RASPA to get the helium void fraction.
Here's the "conventional" way of doing this:
- Open the cif file in a program with good cif i/o, such as Materials Studio.
- Save the cif, making sure it's outputted with P1 symmetry.
- Run EQeq from the command line (something like
./eqeq mof.cif
). - Grab the outputted RASPA cif file, deleting the other output. Move the cif to
$RASPA_DIR/share/raspa/structures/cif
. - Edit a RASPA input file with the name of the cif file.
- Run RASPA (something like
./simulate -i hevf.input
). - Open the file in the generated
Output/
folder, and search for "Average Widom Rosenbluth factor".
The details of this workflow may be different from person to person, and shell scripts have been written to simplify some aspects, but this is the general idea. Let's contrast this with how you would do this with glue script.
- Write out what you want to do in a Python script.
import pybel
import RASPA2
# Use pybel to parse, fill, and charge cif structure
mol = pybel.readfile("cif", "sample.cif").next()
mol.unitcell.FillUnitCell(mol.OBMol)
mol.calccharges("eqeq")
# Use RASPA to get the helium void fraction
print(RASPA2.get_helium_void_fraction(mol))
- Run the script.
Rather than performing multiple steps, you write out your logic and run in one step. This naturally results in the generation of a set of scripts for common tasks, which is great for consistent + reproducible workflows (Note: they can also be version controlled!).
If you're curious about why Python, here's an article by SciPy titled Using Python as Glue. The upshot is that Python is very easy to write and interfaces well with other scientific languages (e.g. C, FORTRAN).
In the above case, the automation only serves to simplify an existing workflow. However, the greatest power of glue scripts is not in saving time, but enabling workflows that were previously unimaginable. Let's go into a relevant example below.
One of the motivations of writing the python bindings was to assist in creating databases larger than the filesystem allows. As shown in the databases section of this wiki, we solve this by using mongoDB. This comes with a wide range of features, but it forces a learning curve and breaks existing bash scripts. As the existing scripts were not written as glue, they're difficult to change and don't adapt easily to the ever-changing needs of computational projects.
Using the glue script approach, we can easily work with a database. We can also easily work with cloud-based supercomputing (like Amazon ec2) or Northwestern's Quest cluster. Here's an example with a cloud supercomputer.
# Connect to database and get 3000 sample MOFs.
# Let's assume that these contain an id, a charged RASPA molfile, and a helium void fraction.
import pymongo
db = pymongo.MongoClient().sample
mofs = db.mof.find().limit(3000)
# On each core, we will get the v/v methane uptake at 65 bar.
def f(mof):
import RASPA
output = RASPA.run(mof["charged mol"], "CH4", temperature=298, pressure=65e5,
helium_void_fraction=mof["helium void fraction"], input_file_type="mol")
return output["Number of molecules"]["CH4"]["Average loading absolute [cm^3 (STP)/cm^3 framework]"][0]
# Distribute the jobs on the amazon cloud cores. Wait for the jobs to finish.
import cloud
jids = cloud.map(f, mofs)
uptakes = cloud.result(mofs)
# Let's print the results for postprocessing, and additionally save them back into the database.
# By saving into a central database, you can share your simulation results with everyone else. (Useful for data mining!)
for mof, uptake in zip(mofs, uptakes):
print mof["_id"], uptake
db.mof.update({"_id": mof["_id"]}, {"$set": {"CH4": uptake}})
The previous example stops the glue scripting at post-processing. This is fairly common in regular glue scripts, as no one wants their entire process to fail because the plotting part of their glue script broke. However, there is a solution. Scientists across the world have been working for years to create the ideal scientific glue-script workflow. Their project is called IPython, and its notebook feature is perfect for our computational needs. To set up IPython and get sample scripts, run the following:
conda install ipython
git clone https://github.com/numat/simulation-notebooks.git
cd simulation-notebooks
ipython notebook
If you don't have acccess to simulation-notebooks
, feel free to email [email protected] for samples.
This will install IPython, grab some sample notebooks made at NuMat, and run the program. The sample scripts include a tutorial, as well as thoroughly tested interactive scripts for common tasks - isotherms for many MOFs, adsorption as a function of temperature and pressure for a single MOF, high-throughput screening using mongo, BET surface area back-calcuation, and more. This is the current workflow at NuMat, and it has simplified common simulations to the point where experimental chemists can use it from their laptops. In the case of new simulations, the notebooks are even more powerful, allowing interactive testing and debugging.
By default, IPython notebook will only run on your own computer. In order to get the most out of the notebook, someone has to set one up on a computer with as many relevant libraries as possible (RASPA, EQeq, mofgen, pymongo, etc.). For those with the password, you can access one such setup at http://numat-tech.com/notebook. This setup gives you a simple web interface to one of NuMat's high-end Ubuntu desktops. This means that you can run simulations from any laptop without worrying about installing the libraries properly, or even knowing how to use linux. That is, once it's set up once by one person, it's set up for everyone forever.