Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OxCGRT data merge, npi model computation docker deployment #523

Merged
merged 63 commits into from
Sep 11, 2020

Conversation

JanataPavel
Copy link
Contributor

OxCGRT data merge

Added a luigi step that downloads data from the Oxford COVID Government Response Tracker (OxCGRT) and merges them into the countermeasures data.

The original countermeasures collected for the paper (till the end of May) are in the countermeasures_model_data.csv.

As described in the issue, we want to use new sources of the countermeasures data for the NPI model. So far only the data from OxCGRT are used (as described in the issue). At this moment we still don't have newer sources for the Mask wearing, Univerisites, and the Businesses countermeasures are derived from a related OxCGRT feature, but we might want to replace it in the future.

The data from the paper take priority over the merged OxCGRT data.

At this moment, the countries for which the NPI model is computed are the original 40 countries from the paper. In the near future, we will want to replace this by a list of about 80 countries

NPI model GCS Docker computation

A New GitHub workflow was created, which builds the docker image, uploads it to the container registry on google cloud and then it creates a VM instance on which it runs the newly created image. A script in the container downloads a CSV with R estimates, which we compute every day, and runs the NPI model. When the computation finishes it exports the results and uploads it to google's storage bucket from where it can be used by the FE. After that, the container runs a gcloud command to remove the instance on which it runs.

As the model now won't run before the WebExport task, the results of the NPI model are exported into a separate file.
Every day the upload-data workflow runs, which processes the John Hopkins data, computes the r_estimates, and exports it to the data-v4.json and uploads it to the main channel. The compute-npi-model workflow can be run independently of it, and can be triggered from the github web (only after the workflow file is in the master, but it can be run through the github api now). When running the workflow, the name of the channel to which the results will be uploaded can be specified. The docker container uploads two files to the bucket - latest_npi-results.json and a <date>_npi-results.json. This is so that we can show our past predictions in the future.

Notes on performance

During the development, I've encountered a few performance issues. The original idea was to take advantage of the parallelism of the pymc3 library, which turned up to be problematic. Although the pymc3 library enables running multiple jobs each using multiple threads, in reality, there are no performance gains from using more threads (probable issues with the implementation of pymc3 library itself). It turned out to work best when each chain only uses one thread. As we're using 2 chains a 2-core CPU is enough.

This project is based on poetry as dependency manager. However, while experimenting with the model, I found out, that it actually runs significantly better with Anaconda (do to some more direct interaction of the theano and mkl). Because of this, the docker container in which the model runs uses conda environment.

The conda docker image requires at least 15 GB of disc space on the VM instance. The model needs about 20 GB of memory when running on 40 countries with the current number of countermeasures and 90 extrapolation period. When more countries are added, the RAM will have to be increased

JanataPavel and others added 30 commits August 14, 2020 14:45
* added a dummy featrues for every intervention which is turned on when a intervention is turned of for the first time and turned of when the original intervention is put in place again.
* extended the data for another month by using th last know countermeasures in each region and data from johns hopkins
Prepared docker container which can be run on GCP compute instances from github pipelines

defined conda environment to accelerate the computation of the npi model
* don't run previous steps in workflows, instead download the latest r_estimates.csv inside docker. The other steps are quick
Created a script which deletes the instance when docker exits. This script is copied to the GCP console, but I included it in the repo for consistency
* OxCGRT added data for subregions (e.g. US states) which broke the pipeline. Fix is to filter it out, but we might use them in the future
* Triggering the npi-model computing workflow manually
* More tune interactions of the model (to hopefully shrink the confidence interval)
* Made sure that each NUTS sampling process created by pymc3 only uses one thread - The parallelization doesn't work and this greatly speeds up the computation
@lgtm-com
Copy link

lgtm-com bot commented Sep 1, 2020

This pull request introduces 3 alerts when merging 89c422f into 70bd566 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Unused local variable

@witzatom witzatom self-requested a review September 11, 2020 11:16
@witzatom witzatom merged commit e397f2e into master Sep 11, 2020
@witzatom witzatom deleted the 518-npi-model-data-merge branch September 11, 2020 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants