Merge pull request #72 from datasciencebr/cuducos-load-specific-files

Make commands load dataset from specific files, not according to dates from settings
okfn-brasil · Dec 19, 2016 · 802c580 · 802c580
2 parents f20f3e9 + c7de2f1
commit 802c580
Show file tree

Hide file tree

Showing 15 changed files with 239 additions and 213 deletions.
diff --git a/Makefile b/Makefile
@@ -8,6 +8,8 @@ collectstatic: run.jarbas
 
 seed: run.jarbas
 	docker-compose run --rm jarbas python manage.py loaddatasets
-	docker-compose run --rm jarbas python manage.py loadsuppliers
+	docker-compose run --rm jarbas python manage.py reimbursements /tmp/serenata-data/reimbursements.xz
+	docker-compose run --rm jarbas python manage.py companies /tmp/serenata-data/2016-09-03-companies.xz
+	docker-compose run --rm jarbas python manage.py irregularities /tmp/serenata-data/irregularities.xz
 
 run.devel: collectstatic
diff --git a/README.md b/README.md
@@ -132,42 +132,45 @@ If you have [Docker](https://docs.docker.com/engine/installation/) (with [Docker
 
 ```console
 $ docker-compose up -d --build
-$	docker-compose run --rm jarbas python manage.py migrate
-$	docker-compose run --rm jarbas python manage.py ceapdatasets
+$ docker-compose run --rm jarbas python manage.py migrate
+$ docker-compose run --rm jarbas python manage.py ceapdatasets
 ```
 
-
 You can access it at [`localhost:80`](http://localhost:80/). However your database starts empty and you still have to collect your static files:
 
 ```console
 $ docker-compose run --rm jarbas python manage.py collectstatic --no-input
 $ docker-compose run --rm jarbas python manage.py loaddatasets
-$ docker-compose run --rm jarbas python manage.py reimbursements
-$ docker-compose run --rm jarbas python manage.py loadsupliers
-$ python manage.py irregularities <path to irregularities.xz file>
+$ docker-compose run --rm jarbas python manage.py reimbursements <path to reimbursements.xz>
+$ docker-compose run --rm jarbas python manage.py irregularities <path to irregularities.xz file>
+$ docker-compose run --rm jarbas python manage.py companies <path to companies.xz>
 ```
 
-There are some cleaver shortcuts in the `Makefile` if you like it.
+You can get the datasets running [Rosie](https://github.com/datasciencebr/rosie) or directly with the [toolbox](https://github.com/datasciencebr/rosie).
+
+Also there are some cleaver shortcuts in the `Makefile` if you like it. 
 
 ### Local install
 
 #### Requirements
 
-The app is based in [Python 3.5](http://python.org) and [Node.js 6](http://nodejs.org). Once you have `pip` and `npm` available, install the dependencies:
+Jarbas requires [Python 3.5](http://python.org), [Node.js 6](http://nodejs.org). and [PostgreSQL 9.4+](https://www.postgresql.org).
+
+Once you have `pip` and `npm` available install the dependencies:
 
 ```console
-npm i
+npm install
 python -m pip install -r requirements.txt
 ```
 
-Minor details on requirements:
+##### Python's `lzma` module
+
+In some Linux distros `lzma` is not installed by default. You can check whether you have it or not with `$ python -m lzma`. In Debian based systems you can fix that with `$ apt-get install liblzma-dev` or in macOS with `$ brew install xz` — but you mihght have to re-compile your Python.
 
-* **`lzma`**: In some Linux distros `lzma` is not installed by default. You can check whether you have it or not with `$ python -m lzma`. In Debian based systems you can fix that with `$ apt-get install liblzma-dev` but you mihght have to re-compile your Python. Some macOS Users might have the same problem. To check if you have `lzma` you can use `$ python -m lmza`. To fix it you need to install `lzma` using `$ brew install xz` and after that you need to recompile Python, and an way to do it is through `$ brew upgrade --cleanup python`.
-* **`psycopg2`**: The `requirements.txt` file is prepared to use [PostgresSQL](https://www.postgresql.org) and `psycopg2` might fail if you don't have Postgres installed locally.
 
 #### Settings
 
-Copy `contrib/.env.sample` as `.env` in the project's root folder and adjust your settings. These are the main environment settings:
+Copy `contrib/.env.sample` as `.env` in the project's root folder and adjust your settings. These are the main variables:
 
 ##### Django settings
 
@@ -187,8 +190,6 @@ Copy `contrib/.env.sample` as `.env` in the project's root folder and adjust you
 * `AMAZON_S3_BUCKET` (_str_) Name of the Amazon S3 bucket to look for datasets (e.g. `serenata-de-amor-data`)
 * `AMAZON_S3_REGION` (_str_) Region of the Amazon S3 (e.g. `s3-sa-east-1`)
 * `AMAZON_S3_DATASET_DATE` (_str_) Datasets file name prefix of CEAP datasets from Serenata de Amor (e.g. `2016-08-08` for `2016-08-08-current-year.xz`)
-* `AMAZON_S3_REIMBURSEMENTS_DATE` (_str_) Reumbursements dataset file name date prefix (e.g. `2016-12-06` for `2016-12-06-reimbursements.xz`)
-* `AMAZON_S3_COMPANIES_DATE` (_str_) Suppliers (companies) datasets file name date prefix (e.g. `2016-08-08` for `2016-08-08-companies.xz`)
 * `AMAZON_S3_CEAPTRANSLATION_DATE` (_str_) File name prefix for dataset guide (e.g. `2016-08-08` for `2016-08-08-ceap-datasets.md`)
 
 ##### Google settings
@@ -210,19 +211,13 @@ Now you can load the data from our datasets and get some other data as static fi
 
 ```
 $ python manage.py loaddatasets
-$ python manage.py loadsuppliers
-$ python manage.py reimbursements
+$ python manage.py reimbursements <path to reimbursements.xz>
+$ python manage.py irregularities <path to irregularities.xz file>
+$ python manage.py companies <path to companies.xz>
 $ python manage.py ceapdatasets
 ```
 
-Use `python manage.py loaddatasets --help` and `python manage.py loadsuppliers --help` to check options on limiting the number of documents to be loaded from the datasets.
-
-If [Rosie](https://github.com/datasciencebr/rosie) was kind enough to give you
-a `irregularities.xz`, you can load it with:
-
-```
-$ python manage.py irregularities <path to irregularities.xz file>
-```
+You can get the datasets running [Rosie](https://github.com/datasciencebr/rosie) or directly with the [toolbox](https://github.com/datasciencebr/rosie).
 
 #### Generate static files
 

diff --git a/contrib/.env.sample b/contrib/.env.sample
@@ -9,7 +9,5 @@ AMAZON_S3_REGION=s3-sa-east-1
 
 AMAZON_S3_CEAPTRANSLATION_DATE=2016-08-08
 AMAZON_S3_DATASET_DATE=2016-08-08
-AMAZON_S3_REIMBURSEMENTS_DATE=2016-12-06
-AMAZON_S3_SUPPLIERS_DATE=2016-09-03
 
 GOOGLE_STREET_VIEW_API_KEY=my-google-places-api-key
diff --git a/jarbas/core/management/commands/__init__.py b/jarbas/core/management/commands/__init__.py
@@ -10,53 +10,13 @@
 
 class LoadCommand(BaseCommand):
 
-    def add_arguments(self, parser):
-        parser.add_argument(
-            '--source', '-s', dest='source', default=None,
-            help='Data directory of Serenata de Amor (dataset source)'
-        )
-        parser.add_argument(
-            '--drop-all', '-d', dest='drop', action='store_true',
-            help='Drop all existing records before loading the datasets'
-        )
-        parser.add_argument(
-            '--dataset-version', dest='dataset_version', default=None,
-            help='Dataset file version (usualy a YYYY-MM-DD date)'
-        )
-
-    def get_dataset(self, name):
-        if self.source:
-            return self.load_local(self.source, name)
-        return self.load_remote(name)
-
-    def load_remote(self, name):
-        """Load a document from Amazon S3"""
-        url = self.get_url(name)
-        print("Loading " + url)
-        with NamedTemporaryFile(delete=False) as tmp:
-            urlretrieve(url, filename=tmp.name)
-            return tmp.name
-
-    def load_local(self, source, name):
-        """Load documents from local source"""
-        path = self.get_path(source, name)
-
-        if not os.path.exists(path):
-            print(path + " not found")
-            return None
-
-        print("Loading " + path)
-        return path
-
-    def get_url(self, suffix):
-        return 'https://{region}.amazonaws.com/{bucket}/{file_name}'.format(
-            region=settings.AMAZON_S3_REGION,
-            bucket=settings.AMAZON_S3_BUCKET,
-            file_name=self.get_file_name(suffix)
-        )
-
-    def get_path(self, source, name):
-        return os.path.join(source, self.get_file_name(name))
+    def add_arguments(self, parser, add_drop_all=True):
+        parser.add_argument('dataset', help='Path to the .xz dataset')
+        if add_drop_all:
+            parser.add_argument(
+                '--drop-all', '-d', dest='drop', action='store_true',
+                help='Drop all existing records before loading the datasets'
+            )
 
     @staticmethod
     def to_number(value, cast=None):
@@ -91,12 +51,6 @@ def to_date(text):
         except ValueError:
             return None
 
-    def get_file_name(self, name):
-        if not self.date:
-            settings_name = 'AMAZON_S3_{}_DATE'.format(name.upper())
-            self.date = getattr(settings, settings_name)
-        return '{date}-{name}.xz'.format(date=self.date, name=name)
-
     def drop_all(self, model):
         if model.objects.count() != 0:
             msg = 'Deleting all existing records from {} model'
@@ -115,3 +69,61 @@ def print_count(self, model, **kwargs):
     @staticmethod
     def get_model_name(model):
         return model._meta.label.split('.')[-1]
+
+
+class OldLoadCommand(LoadCommand):
+
+    def add_arguments(self, parser):
+        parser.add_argument(
+            '--drop-all', '-d', dest='drop', action='store_true',
+            help='Drop all existing records before loading the datasets'
+        )
+        parser.add_argument(
+            '--source', '-s', dest='source', default=None,
+            help='Data directory of Serenata de Amor (dataset source)'
+        )
+        parser.add_argument(
+            '--dataset-version', dest='dataset_version', default=None,
+            help='Dataset file version (usualy a YYYY-MM-DD date)'
+        )
+
+    def get_dataset(self, name):
+        if self.source:
+            return self.load_local(self.source, name)
+        return self.load_remote(name)
+
+    def load_remote(self, name):
+        """Load a document from Amazon S3"""
+        url = self.get_url(name)
+        print("Loading " + url)
+        with NamedTemporaryFile(delete=False) as tmp:
+            urlretrieve(url, filename=tmp.name)
+            return tmp.name
+
+    def load_local(self, source, name):
+        """Load documents from local source"""
+        path = self.get_path(source, name)
+
+        if not os.path.exists(path):
+            print(path + " not found")
+            return None
+
+        print("Loading " + path)
+        return path
+
+    def get_url(self, suffix):
+        return 'https://{region}.amazonaws.com/{bucket}/{file_name}'.format(
+            region=settings.AMAZON_S3_REGION,
+            bucket=settings.AMAZON_S3_BUCKET,
+            file_name=self.get_file_name(suffix)
+        )
+
+    def get_path(self, source, name):
+        return os.path.join(source, self.get_file_name(name))
+
+    def get_file_name(self, name):
+        if not self.date:
+            settings_name = 'AMAZON_S3_{}_DATE'.format(name.upper())
+            self.date = getattr(settings, settings_name)
+        return '{date}-{name}.xz'.format(date=self.date, name=name)
+
diff --git a/...core/management/commands/loadsuppliers.py → jarbas/core/management/commands/companies.py b/...core/management/commands/loadsuppliers.py → jarbas/core/management/commands/companies.py
@@ -12,27 +12,25 @@ class Command(LoadCommand):
     help = 'Load Serenata de Amor supplier dataset into the database'
 
     def handle(self, *args, **options):
-        self.date = options.get('dataset_version')
-        self.source = options.get('source')
+        self.path = options['dataset']
         self.count = self.print_count(Supplier)
-        print('self.cont =', self.count)
         print('Starting with {:,} suppliers'.format(self.count))
 
         if options.get('drop', False):
             self.drop_all(Supplier)
             self.drop_all(Activity)
             self.count = 0
 
-        self.save_suppliers(self.get_dataset('companies'))
+        self.save_suppliers()
 
-    def save_suppliers(self, dataset):
+    def save_suppliers(self):
         """
         Receives path to the dataset file and create a Supplier object for
         each row of each file. It creates the related activity when needed.
         """
         skip = ('main_activity', 'secondary_activty')
         keys = list(f.name for f in Supplier._meta.fields if f not in skip)
-        with lzma.open(dataset, mode='rt') as file_handler:
+        with lzma.open(self.path, mode='rt') as file_handler:
             for row in csv.DictReader(file_handler):
                 main, secondary = self.save_activities(row)
 

diff --git a/jarbas/core/management/commands/irregularities.py b/jarbas/core/management/commands/irregularities.py
@@ -12,14 +12,10 @@ class Command(LoadCommand):
     filter_keys = ('applicant_id', 'document_id', 'year')
 
     def add_arguments(self, parser):
-        parser.add_argument(
-            '--irregularities', '-i', dest='irregularities_path',
-            default='irregularities.xz',
-            help='Path to the irregularities.xz dataset'
-        )
+        super().add_arguments(parser, add_drop_all=False)
 
     def handle(self, *args, **options):
-        self.path = options.get('irregularities_path', 'irregularities.xz')
+        self.path = options['dataset']
         if not os.path.exists(self.path):
             raise FileNotFoundError(os.path.abspath(self.path))
 

diff --git a/jarbas/core/management/commands/loaddatasets.py b/jarbas/core/management/commands/loaddatasets.py
@@ -5,11 +5,11 @@
 
 from django.conf import settings
 
-from jarbas.core.management.commands import LoadCommand
+from jarbas.core.management.commands import OldLoadCommand
 from jarbas.core.models import Document
 
 
-class Command(LoadCommand):
+class Command(OldLoadCommand):
     help = 'Load Serenata de Amor datasets into the database'
     suffixes = ('current-year', 'last-year', 'previous-years')
 

diff --git a/jarbas/core/management/commands/reimbursements.py b/jarbas/core/management/commands/reimbursements.py
@@ -16,8 +16,7 @@ def add_arguments(self, parser):
         )
 
     def handle(self, *args, **options):
-        self.date = options.get('dataset_version')
-        self.source = options.get('source')
+        self.path = options['dataset']
         self.count = Reimbursement.objects.count()
         print('Starting with {:,} reimbursements'.format(self.count))
 
@@ -31,8 +30,7 @@ def handle(self, *args, **options):
     @property
     def reimbursements(self):
         """Returns a Generator with a Reimbursement object for each row."""
-        dataset = self.get_dataset('reimbursements')
-        with lzma.open(dataset, mode='rt') as file_handler:
+        with lzma.open(self.path, mode='rt') as file_handler:
             for row in csv.DictReader(file_handler):
                 yield Reimbursement(**self.serialize(row))