Skip to content
This repository has been archived by the owner on Feb 28, 2018. It is now read-only.

Commit

Permalink
Merge pull request #72 from datasciencebr/cuducos-load-specific-files
Browse files Browse the repository at this point in the history
Make  commands load dataset from specific files, not according to dates from settings
  • Loading branch information
Irio authored Dec 19, 2016
2 parents f20f3e9 + c7de2f1 commit 802c580
Show file tree
Hide file tree
Showing 15 changed files with 239 additions and 213 deletions.
4 changes: 3 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ collectstatic: run.jarbas

seed: run.jarbas
docker-compose run --rm jarbas python manage.py loaddatasets
docker-compose run --rm jarbas python manage.py loadsuppliers
docker-compose run --rm jarbas python manage.py reimbursements /tmp/serenata-data/reimbursements.xz
docker-compose run --rm jarbas python manage.py companies /tmp/serenata-data/2016-09-03-companies.xz
docker-compose run --rm jarbas python manage.py irregularities /tmp/serenata-data/irregularities.xz

run.devel: collectstatic
45 changes: 20 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,42 +132,45 @@ If you have [Docker](https://docs.docker.com/engine/installation/) (with [Docker

```console
$ docker-compose up -d --build
$ docker-compose run --rm jarbas python manage.py migrate
$ docker-compose run --rm jarbas python manage.py ceapdatasets
$ docker-compose run --rm jarbas python manage.py migrate
$ docker-compose run --rm jarbas python manage.py ceapdatasets
```


You can access it at [`localhost:80`](http://localhost:80/). However your database starts empty and you still have to collect your static files:

```console
$ docker-compose run --rm jarbas python manage.py collectstatic --no-input
$ docker-compose run --rm jarbas python manage.py loaddatasets
$ docker-compose run --rm jarbas python manage.py reimbursements
$ docker-compose run --rm jarbas python manage.py loadsupliers
$ python manage.py irregularities <path to irregularities.xz file>
$ docker-compose run --rm jarbas python manage.py reimbursements <path to reimbursements.xz>
$ docker-compose run --rm jarbas python manage.py irregularities <path to irregularities.xz file>
$ docker-compose run --rm jarbas python manage.py companies <path to companies.xz>
```

There are some cleaver shortcuts in the `Makefile` if you like it.
You can get the datasets running [Rosie](https://github.com/datasciencebr/rosie) or directly with the [toolbox](https://github.com/datasciencebr/rosie).

Also there are some cleaver shortcuts in the `Makefile` if you like it.

### Local install

#### Requirements

The app is based in [Python 3.5](http://python.org) and [Node.js 6](http://nodejs.org). Once you have `pip` and `npm` available, install the dependencies:
Jarbas requires [Python 3.5](http://python.org), [Node.js 6](http://nodejs.org). and [PostgreSQL 9.4+](https://www.postgresql.org).

Once you have `pip` and `npm` available install the dependencies:

```console
npm i
npm install
python -m pip install -r requirements.txt
```

Minor details on requirements:
##### Python's `lzma` module

In some Linux distros `lzma` is not installed by default. You can check whether you have it or not with `$ python -m lzma`. In Debian based systems you can fix that with `$ apt-get install liblzma-dev` or in macOS with `$ brew install xz` — but you mihght have to re-compile your Python.

* **`lzma`**: In some Linux distros `lzma` is not installed by default. You can check whether you have it or not with `$ python -m lzma`. In Debian based systems you can fix that with `$ apt-get install liblzma-dev` but you mihght have to re-compile your Python. Some macOS Users might have the same problem. To check if you have `lzma` you can use `$ python -m lmza`. To fix it you need to install `lzma` using `$ brew install xz` and after that you need to recompile Python, and an way to do it is through `$ brew upgrade --cleanup python`.
* **`psycopg2`**: The `requirements.txt` file is prepared to use [PostgresSQL](https://www.postgresql.org) and `psycopg2` might fail if you don't have Postgres installed locally.

#### Settings

Copy `contrib/.env.sample` as `.env` in the project's root folder and adjust your settings. These are the main environment settings:
Copy `contrib/.env.sample` as `.env` in the project's root folder and adjust your settings. These are the main variables:

##### Django settings

Expand All @@ -187,8 +190,6 @@ Copy `contrib/.env.sample` as `.env` in the project's root folder and adjust you
* `AMAZON_S3_BUCKET` (_str_) Name of the Amazon S3 bucket to look for datasets (e.g. `serenata-de-amor-data`)
* `AMAZON_S3_REGION` (_str_) Region of the Amazon S3 (e.g. `s3-sa-east-1`)
* `AMAZON_S3_DATASET_DATE` (_str_) Datasets file name prefix of CEAP datasets from Serenata de Amor (e.g. `2016-08-08` for `2016-08-08-current-year.xz`)
* `AMAZON_S3_REIMBURSEMENTS_DATE` (_str_) Reumbursements dataset file name date prefix (e.g. `2016-12-06` for `2016-12-06-reimbursements.xz`)
* `AMAZON_S3_COMPANIES_DATE` (_str_) Suppliers (companies) datasets file name date prefix (e.g. `2016-08-08` for `2016-08-08-companies.xz`)
* `AMAZON_S3_CEAPTRANSLATION_DATE` (_str_) File name prefix for dataset guide (e.g. `2016-08-08` for `2016-08-08-ceap-datasets.md`)

##### Google settings
Expand All @@ -210,19 +211,13 @@ Now you can load the data from our datasets and get some other data as static fi

```
$ python manage.py loaddatasets
$ python manage.py loadsuppliers
$ python manage.py reimbursements
$ python manage.py reimbursements <path to reimbursements.xz>
$ python manage.py irregularities <path to irregularities.xz file>
$ python manage.py companies <path to companies.xz>
$ python manage.py ceapdatasets
```

Use `python manage.py loaddatasets --help` and `python manage.py loadsuppliers --help` to check options on limiting the number of documents to be loaded from the datasets.

If [Rosie](https://github.com/datasciencebr/rosie) was kind enough to give you
a `irregularities.xz`, you can load it with:

```
$ python manage.py irregularities <path to irregularities.xz file>
```
You can get the datasets running [Rosie](https://github.com/datasciencebr/rosie) or directly with the [toolbox](https://github.com/datasciencebr/rosie).

#### Generate static files

Expand Down
2 changes: 0 additions & 2 deletions contrib/.env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,5 @@ AMAZON_S3_REGION=s3-sa-east-1

AMAZON_S3_CEAPTRANSLATION_DATE=2016-08-08
AMAZON_S3_DATASET_DATE=2016-08-08
AMAZON_S3_REIMBURSEMENTS_DATE=2016-12-06
AMAZON_S3_SUPPLIERS_DATE=2016-09-03

GOOGLE_STREET_VIEW_API_KEY=my-google-places-api-key
118 changes: 65 additions & 53 deletions jarbas/core/management/commands/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,53 +10,13 @@

class LoadCommand(BaseCommand):

def add_arguments(self, parser):
parser.add_argument(
'--source', '-s', dest='source', default=None,
help='Data directory of Serenata de Amor (dataset source)'
)
parser.add_argument(
'--drop-all', '-d', dest='drop', action='store_true',
help='Drop all existing records before loading the datasets'
)
parser.add_argument(
'--dataset-version', dest='dataset_version', default=None,
help='Dataset file version (usualy a YYYY-MM-DD date)'
)

def get_dataset(self, name):
if self.source:
return self.load_local(self.source, name)
return self.load_remote(name)

def load_remote(self, name):
"""Load a document from Amazon S3"""
url = self.get_url(name)
print("Loading " + url)
with NamedTemporaryFile(delete=False) as tmp:
urlretrieve(url, filename=tmp.name)
return tmp.name

def load_local(self, source, name):
"""Load documents from local source"""
path = self.get_path(source, name)

if not os.path.exists(path):
print(path + " not found")
return None

print("Loading " + path)
return path

def get_url(self, suffix):
return 'https://{region}.amazonaws.com/{bucket}/{file_name}'.format(
region=settings.AMAZON_S3_REGION,
bucket=settings.AMAZON_S3_BUCKET,
file_name=self.get_file_name(suffix)
)

def get_path(self, source, name):
return os.path.join(source, self.get_file_name(name))
def add_arguments(self, parser, add_drop_all=True):
parser.add_argument('dataset', help='Path to the .xz dataset')
if add_drop_all:
parser.add_argument(
'--drop-all', '-d', dest='drop', action='store_true',
help='Drop all existing records before loading the datasets'
)

@staticmethod
def to_number(value, cast=None):
Expand Down Expand Up @@ -91,12 +51,6 @@ def to_date(text):
except ValueError:
return None

def get_file_name(self, name):
if not self.date:
settings_name = 'AMAZON_S3_{}_DATE'.format(name.upper())
self.date = getattr(settings, settings_name)
return '{date}-{name}.xz'.format(date=self.date, name=name)

def drop_all(self, model):
if model.objects.count() != 0:
msg = 'Deleting all existing records from {} model'
Expand All @@ -115,3 +69,61 @@ def print_count(self, model, **kwargs):
@staticmethod
def get_model_name(model):
return model._meta.label.split('.')[-1]


class OldLoadCommand(LoadCommand):

def add_arguments(self, parser):
parser.add_argument(
'--drop-all', '-d', dest='drop', action='store_true',
help='Drop all existing records before loading the datasets'
)
parser.add_argument(
'--source', '-s', dest='source', default=None,
help='Data directory of Serenata de Amor (dataset source)'
)
parser.add_argument(
'--dataset-version', dest='dataset_version', default=None,
help='Dataset file version (usualy a YYYY-MM-DD date)'
)

def get_dataset(self, name):
if self.source:
return self.load_local(self.source, name)
return self.load_remote(name)

def load_remote(self, name):
"""Load a document from Amazon S3"""
url = self.get_url(name)
print("Loading " + url)
with NamedTemporaryFile(delete=False) as tmp:
urlretrieve(url, filename=tmp.name)
return tmp.name

def load_local(self, source, name):
"""Load documents from local source"""
path = self.get_path(source, name)

if not os.path.exists(path):
print(path + " not found")
return None

print("Loading " + path)
return path

def get_url(self, suffix):
return 'https://{region}.amazonaws.com/{bucket}/{file_name}'.format(
region=settings.AMAZON_S3_REGION,
bucket=settings.AMAZON_S3_BUCKET,
file_name=self.get_file_name(suffix)
)

def get_path(self, source, name):
return os.path.join(source, self.get_file_name(name))

def get_file_name(self, name):
if not self.date:
settings_name = 'AMAZON_S3_{}_DATE'.format(name.upper())
self.date = getattr(settings, settings_name)
return '{date}-{name}.xz'.format(date=self.date, name=name)

Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,25 @@ class Command(LoadCommand):
help = 'Load Serenata de Amor supplier dataset into the database'

def handle(self, *args, **options):
self.date = options.get('dataset_version')
self.source = options.get('source')
self.path = options['dataset']
self.count = self.print_count(Supplier)
print('self.cont =', self.count)
print('Starting with {:,} suppliers'.format(self.count))

if options.get('drop', False):
self.drop_all(Supplier)
self.drop_all(Activity)
self.count = 0

self.save_suppliers(self.get_dataset('companies'))
self.save_suppliers()

def save_suppliers(self, dataset):
def save_suppliers(self):
"""
Receives path to the dataset file and create a Supplier object for
each row of each file. It creates the related activity when needed.
"""
skip = ('main_activity', 'secondary_activty')
keys = list(f.name for f in Supplier._meta.fields if f not in skip)
with lzma.open(dataset, mode='rt') as file_handler:
with lzma.open(self.path, mode='rt') as file_handler:
for row in csv.DictReader(file_handler):
main, secondary = self.save_activities(row)

Expand Down
8 changes: 2 additions & 6 deletions jarbas/core/management/commands/irregularities.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,10 @@ class Command(LoadCommand):
filter_keys = ('applicant_id', 'document_id', 'year')

def add_arguments(self, parser):
parser.add_argument(
'--irregularities', '-i', dest='irregularities_path',
default='irregularities.xz',
help='Path to the irregularities.xz dataset'
)
super().add_arguments(parser, add_drop_all=False)

def handle(self, *args, **options):
self.path = options.get('irregularities_path', 'irregularities.xz')
self.path = options['dataset']
if not os.path.exists(self.path):
raise FileNotFoundError(os.path.abspath(self.path))

Expand Down
4 changes: 2 additions & 2 deletions jarbas/core/management/commands/loaddatasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@

from django.conf import settings

from jarbas.core.management.commands import LoadCommand
from jarbas.core.management.commands import OldLoadCommand
from jarbas.core.models import Document


class Command(LoadCommand):
class Command(OldLoadCommand):
help = 'Load Serenata de Amor datasets into the database'
suffixes = ('current-year', 'last-year', 'previous-years')

Expand Down
6 changes: 2 additions & 4 deletions jarbas/core/management/commands/reimbursements.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ def add_arguments(self, parser):
)

def handle(self, *args, **options):
self.date = options.get('dataset_version')
self.source = options.get('source')
self.path = options['dataset']
self.count = Reimbursement.objects.count()
print('Starting with {:,} reimbursements'.format(self.count))

Expand All @@ -31,8 +30,7 @@ def handle(self, *args, **options):
@property
def reimbursements(self):
"""Returns a Generator with a Reimbursement object for each row."""
dataset = self.get_dataset('reimbursements')
with lzma.open(dataset, mode='rt') as file_handler:
with lzma.open(self.path, mode='rt') as file_handler:
for row in csv.DictReader(file_handler):
yield Reimbursement(**self.serialize(row))

Expand Down
Loading

0 comments on commit 802c580

Please sign in to comment.