Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get a model build process going for multiple models in the database #12

Closed
lauralorenz opened this issue Aug 8, 2016 · 12 comments
Closed

Comments

@lauralorenz
Copy link

lauralorenz commented Aug 8, 2016

Right now we have a management command manage.py train that supports building a model against a corpus from disk and saving it to the database using the arbiter Estimator and Score models. With this issue, we want to be able to create models against documents in the database, and support the ability for multiple models to use the same documents and know which documents they used.

I think the intended implementation strategy for this is to

  • build a reader like/subclassed from/expanding TranscriptCorpusReader that can pull a corpus from the database via a queryset as opposed to from disk
  • provide a way for the model management command to ingest a queryset specification and utilize it when constructing the corpus
  • add necessary features to the build process and attributes to the Estimator model to be able to track an Estimators dependent documents
    • may want to consider what happens if the documents change underneath the Estimator instance; do we version the documents or each Estimators input data? Do we care about reproducibility of each Estimator at this granularity?

This issue is closed when a model build process can be run against the stored documents in the database and track which documents were used for each model.

@bbengfort
Copy link
Member

bbengfort commented Aug 8, 2016

My proposal is as follows:

  • Corpus Model to which estimators have an foreign key
  • ManyToMany relationship between Corpus and Document models
  • QueryCorpusReader (accepts either a Query or a Corpus object)
  • Ensure the Corpus Loader works with the QueryCorpusReader (generators being the main concern).

These three tasks seem to meet the specification of the requirement.

This will work for now, so long as the corpora are small; there is no memory issue for reads (it's streaming) but too many database queries can slow down performance.

@bbengfort
Copy link
Member

@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...

@lauralorenz
Copy link
Author

Nope it shouldn't block me I can use the disk read models. This one blocks #15 (which is more cosmetic anyways) and otherwise I think this milestone can be worked on without this.

Sent from my iPhone

On Aug 8, 2016, at 4:52 PM, Benjamin Bengfort [email protected] wrote:

@lauralorenz -- will this issue block you in any way? I probably won't be able to get to it until next week ...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@bbengfort
Copy link
Member

Figuring out labels according to votes shouldn't be tough ...

select d.id, d.title, l.name, count(l.id) from annotations a 
        join documents d on a.document_id = d.id
        join labels l on a.label_id = l.id
    group by d.title, l.name, d.id;

But how to do this in Django?

@bbengfort
Copy link
Member

Ok, models can now be built as follows:

from corpus.models import *
from corpus.reader import *
from corpus.learn import *
from django.contrib.auth.models import User
from sklearn.linear_model import LogisticRegression as model

# Create a corpus for a specific user (database operation)
user = User.objects.get(username='bbengfort') 
corpus = Corpus.objects.create(user)

# Instantiate a query corpus reader from the corpus model object 
# As well as the loader that can use that reader 
reader = CorpusModelReader(corpus)
loader = CorpusLoader(reader, folds=2)

# Build the model 
(clf, scores), total_time = build_model(loader, model) 

E.g. we're now building models from the documents that are in the database. Note that we still need something to save the estimator that is built into the database, and I haven't engaged this in a view because it takes minutes to do; but I think this issue is complete.

@bbengfort
Copy link
Member

@lauralorenz discuss then move to done?

@lauralorenz
Copy link
Author

@bbengfort Sure. Am I missing something or where the branch at

@bbengfort
Copy link
Member

I merged my branch into develop

@bbengfort
Copy link
Member

@lauralorenz put some inline comments into the commit.

@bbengfort
Copy link
Member

bbengfort commented Aug 17, 2016

At this point we need to:

  • allow creation of labeled and unlabeled corpora (excluding None labels for labeled)
  • expand the M2M relationship between corpus and documents to hardcode label
  • save the corpus with the estimator during model build
  • expand the management command to build a model for a user, for the debates, or for the entire corpus

In order to do a build in the view (e.g. the user clicks a button) we'd need Celery, and ideally have #13 in place so that we could specify progress (and have a place for that button).

So I'd suggest that after expanding the Django management command, we simply create a new issue for that and call this one good?

@lauralorenz
Copy link
Author

Yeah I agree with all of that. Yes let's punt on the view/Celery version for model builds for now.

@bbengfort
Copy link
Member

@lauralorenz -- ok just pushed the release with this. Things should be working but more testing is required. I'll move this to done for right now; let me know if you have any trouble with the CLI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants