Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2011 census microdata play #68

Open
wants to merge 41 commits into
base: develop
Choose a base branch
from
Open

Conversation

edwardchalstrey1
Copy link
Collaborator

As part of the Synthetic Data and Privacy Preservation - Turing/ONS partnership project 3, we're trying out the QUIPP pipeline on this dataset.

Note: may or may not need to ever merge this - just putting up so @ots22 can easily pull the branch

@ots22 I've attempted to modify the existing examples to run the different synth-method choices with stock parameters, only changing the parts referring to column names. Example 4, the SGF one, worked without any errors (I've set this one to enabled: true) - if you pull the branch and set enabled: false for any of the others you should hopefully get the errors I got for those.

On the SGF one, it seems to have generated a synthetic dataset! Only there are no values for the 2nd column (possible I wrongly chose categorical type for the column in the dataset json here, not sure)

Also, I created an issue #67 for the error I got on the CTGAN one - as I noticed the same error when I tried to run the existing CTGAN example from run-inputs

@ots22
Copy link
Member

ots22 commented Jul 15, 2021

From our discussion in-person just now:

  • we're planning to drop CTGAN for now
  • we fixed a few errors in the synthpop parameters, and now a 'bootstrap' synthesis works
  • the classifiers run for a long time (to investigate)

@gmingas
Copy link
Contributor

gmingas commented Jul 15, 2021

I think classifiers run for a long time when no specific classifier with specific hyperparamters is passed in the run-inputs file. In this case, a number of classifiers are tested with many combinations of hyperparameters each. I recommend using something like this to reduce time. It uses only logistic regression with defined params.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ots22 ots22 marked this pull request as ready for review September 2, 2021 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants