Get a model build process going for multiple models in the database #12

lauralorenz · 2016-08-08T19:04:47Z

Right now we have a management command manage.py train that supports building a model against a corpus from disk and saving it to the database using the arbiter Estimator and Score models. With this issue, we want to be able to create models against documents in the database, and support the ability for multiple models to use the same documents and know which documents they used.

I think the intended implementation strategy for this is to

build a reader like/subclassed from/expanding TranscriptCorpusReader that can pull a corpus from the database via a queryset as opposed to from disk
provide a way for the model management command to ingest a queryset specification and utilize it when constructing the corpus
add necessary features to the build process and attributes to the Estimator model to be able to track an Estimators dependent documents
- may want to consider what happens if the documents change underneath the Estimator instance; do we version the documents or each Estimators input data? Do we care about reproducibility of each Estimator at this granularity?

This issue is closed when a model build process can be run against the stored documents in the database and track which documents were used for each model.

The text was updated successfully, but these errors were encountered:

bbengfort · 2016-08-08T20:51:00Z

My proposal is as follows:

Corpus Model to which estimators have an foreign key
ManyToMany relationship between Corpus and Document models
QueryCorpusReader (accepts either a Query or a Corpus object)
Ensure the Corpus Loader works with the QueryCorpusReader (generators being the main concern).

These three tasks seem to meet the specification of the requirement.

This will work for now, so long as the corpora are small; there is no memory issue for reads (it's streaming) but too many database queries can slow down performance.

bbengfort · 2016-08-08T20:52:17Z

@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...

lauralorenz · 2016-08-08T21:06:33Z

Nope it shouldn't block me I can use the disk read models. This one blocks #15 (which is more cosmetic anyways) and otherwise I think this milestone can be worked on without this.

Sent from my iPhone

On Aug 8, 2016, at 4:52 PM, Benjamin Bengfort [email protected] wrote:

@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

bbengfort · 2016-08-17T02:01:15Z

Figuring out labels according to votes shouldn't be tough ...

select d.id, d.title, l.name, count(l.id) from annotations a 
        join documents d on a.document_id = d.id
        join labels l on a.label_id = l.id
    group by d.title, l.name, d.id;

But how to do this in Django?

bbengfort · 2016-08-17T11:04:23Z

Ok, models can now be built as follows:

from corpus.models import *
from corpus.reader import *
from corpus.learn import *
from django.contrib.auth.models import User
from sklearn.linear_model import LogisticRegression as model

# Create a corpus for a specific user (database operation)
user = User.objects.get(username='bbengfort') 
corpus = Corpus.objects.create(user)

# Instantiate a query corpus reader from the corpus model object 
# As well as the loader that can use that reader 
reader = CorpusModelReader(corpus)
loader = CorpusLoader(reader, folds=2)

# Build the model 
(clf, scores), total_time = build_model(loader, model)

E.g. we're now building models from the documents that are in the database. Note that we still need something to save the estimator that is built into the database, and I haven't engaged this in a view because it takes minutes to do; but I think this issue is complete.

bbengfort · 2016-08-17T11:05:05Z

@lauralorenz discuss then move to done?

lauralorenz · 2016-08-17T16:08:12Z

@bbengfort Sure. Am I missing something or where the branch at

bbengfort · 2016-08-17T16:09:31Z

I merged my branch into develop

bbengfort · 2016-08-17T16:14:52Z

@lauralorenz put some inline comments into the commit.

bbengfort · 2016-08-17T18:23:41Z

At this point we need to:

allow creation of labeled and unlabeled corpora (excluding None labels for labeled)
expand the M2M relationship between corpus and documents to hardcode label
save the corpus with the estimator during model build
expand the management command to build a model for a user, for the debates, or for the entire corpus

In order to do a build in the view (e.g. the user clicks a button) we'd need Celery, and ideally have #13 in place so that we could specify progress (and have a place for that button).

So I'd suggest that after expanding the Django management command, we simply create a new issue for that and call this one good?

lauralorenz · 2016-08-17T18:44:43Z

Yeah I agree with all of that. Yes let's punt on the view/Celery version for model builds for now.

bbengfort · 2016-08-22T19:08:46Z

@lauralorenz -- ok just pushed the release with this. Things should be working but more testing is required. I'll move this to done for right now; let me know if you have any trouble with the CLI.

lauralorenz added type: feature priority: high ready labels Aug 8, 2016

lauralorenz added this to the PyData DC 2016 Release milestone Aug 8, 2016

lauralorenz mentioned this issue Aug 8, 2016

Show on document view that there are multiple models at work against the document #15

Open

lauralorenz assigned bbengfort Aug 8, 2016

bbengfort mentioned this issue Aug 8, 2016

Classifier App #7

Closed

bbengfort added in progress and removed ready labels Aug 17, 2016

bbengfort closed this as completed Aug 22, 2016

bbengfort removed the in progress label Aug 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get a model build process going for multiple models in the database #12

Get a model build process going for multiple models in the database #12

lauralorenz commented Aug 8, 2016 •

edited by bbengfort

Loading

bbengfort commented Aug 8, 2016 •

edited

Loading

bbengfort commented Aug 8, 2016

lauralorenz commented Aug 8, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016

lauralorenz commented Aug 17, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016 •

edited

Loading

lauralorenz commented Aug 17, 2016

bbengfort commented Aug 22, 2016

Get a model build process going for multiple models in the database #12

Get a model build process going for multiple models in the database #12

Comments

lauralorenz commented Aug 8, 2016 • edited by bbengfort Loading

bbengfort commented Aug 8, 2016 • edited Loading

bbengfort commented Aug 8, 2016

lauralorenz commented Aug 8, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016

lauralorenz commented Aug 17, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016

bbengfort commented Aug 17, 2016 • edited Loading

lauralorenz commented Aug 17, 2016

bbengfort commented Aug 22, 2016

lauralorenz commented Aug 8, 2016 •

edited by bbengfort

Loading

bbengfort commented Aug 8, 2016 •

edited

Loading

bbengfort commented Aug 17, 2016 •

edited

Loading