notes on stuff I should add #2

ianozsvald · 2015-10-19T13:56:22Z

if you lack constraints on datastores then duplicates will occur
how to create setup.py
hypothesis can fuzz mysql to make sure the data going in and back out is the same
assume during data ingestion that you'll have duplications/redundancy - how to spot and remove?
starting point for data ingestion - assume this is a sequence of processes that build on each other, not a single process with all the steps done at once. this way you can swap things in, test in isolation and scale to more machines
list some text similarity metrics fuzzywuzzy, levenshtein, note doing char or word based similarity or char n-gram similarity, maybe removing punctuation/case/unicode is useful?
pandas read_csv dayfirst=False (by default, consider different for euro poorly specified dates)
consider linking to http://datapatterns.org/pattern/

learning strategies

clustering for EDA

t-sne in sklearn, visualisations https://lvdmaaten.github.io/tsne/ to help understand what to expect (stuff close in n-dimensions should be close in 2d)

cleaning

glueviz should be noted for EDA (and qgrid?) and https://pypi.python.org/pypi/pivottablejs
if during cleaning you have to deal with internationalised code (e.g. Russian "Альфа-Банк") be aware that if you lack tests then a naive bit of processing (e.g. lowercasing and some cleaning rules in C#) might give you "?????-????", which you blindly store in database - this is a danger for mix-programming-language transformations (C#'s .net rules vs Python's rules) where they do different things
example of bad encoding twice " Électricité de France "
date parsing: http://blog.scrapinghub.com/2015/11/09/parse-natural-language-dates-with-dateparser/
https://github.com/aparrish/pycorpora/blob/master/README.rst lots of nice mappings using small JSON datasets

process

list project-types that might work and why, @springcoil talks on the requirement to invest in tooling to deliver working systems
r&d != engineering
how might r&d (e.g. 1 person) interface with an eng team?
which bits of an agile process seem to work well? do sprints work well (depends on the task-type)?
how 'owns' the data/process, can that cause problems?
does the lack of a shared language hinder things?
data scientists need clean data, the system will probably always have some dirty data, there is a need for a data-cleaning process (data eng team?) who try to improve the data quality to an agreed schema and who can export/transform the data so it can be used by the r&d team
building mini-monolithic-blocks is normal, remember to break them up into smaller services that can be tested else critical testing can easily be avoided (costing later development speed)
add logging early for anything production-like
luigi for task pipelines to avoid manual steps

getting hired:

list of tools I'd like to see

auto-possible-euro-datetime-checker (icy.py?) for pandas when reading ambiguous datetimes
string->unit converter (e.g. for relative times like "7 minutes" and weights and measures e.g. "23cm", "1inch", "1 in.", "2000m", "2kilometres", "1 pound", "23oz.", "0.25kg")
datetime parsing http://crsmithdev.com/arrow/ (stronger parser than labix dateutil I think), https://github.com/bear/parsedatetime/ (human friendly input?), https://dateparser.readthedocs.org/en/latest/ (relative dates as input)
anonymisation http://blog.applied.ai/approaches-to-data-anonymisation/
data generators eg https://github.com/jbrambleDC/simulacram?files=1

Provide feedback