Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notes on stuff I should add #2

Open
ianozsvald opened this issue Oct 19, 2015 · 2 comments
Open

notes on stuff I should add #2

ianozsvald opened this issue Oct 19, 2015 · 2 comments

Comments

@ianozsvald
Copy link
Owner

  • if you lack constraints on datastores then duplicates will occur
  • how to create setup.py
  • hypothesis can fuzz mysql to make sure the data going in and back out is the same
  • assume during data ingestion that you'll have duplications/redundancy - how to spot and remove?
  • starting point for data ingestion - assume this is a sequence of processes that build on each other, not a single process with all the steps done at once. this way you can swap things in, test in isolation and scale to more machines
  • list some text similarity metrics fuzzywuzzy, levenshtein, note doing char or word based similarity or char n-gram similarity, maybe removing punctuation/case/unicode is useful?
  • pandas read_csv dayfirst=False (by default, consider different for euro poorly specified dates)
  • consider linking to http://datapatterns.org/pattern/

learning strategies

  • more clean data (probably) beats smarter algorithms

clustering for EDA

cleaning

process

  • list project-types that might work and why, @springcoil talks on the requirement to invest in tooling to deliver working systems
  • r&d != engineering
  • how might r&d (e.g. 1 person) interface with an eng team?
  • which bits of an agile process seem to work well? do sprints work well (depends on the task-type)?
  • how 'owns' the data/process, can that cause problems?
  • does the lack of a shared language hinder things?
  • data scientists need clean data, the system will probably always have some dirty data, there is a need for a data-cleaning process (data eng team?) who try to improve the data quality to an agreed schema and who can export/transform the data so it can be used by the r&d team
  • building mini-monolithic-blocks is normal, remember to break them up into smaller services that can be tested else critical testing can easily be avoided (costing later development speed)
  • add logging early for anything production-like
  • luigi for task pipelines to avoid manual steps

getting hired:

  • what you need to show if you want to get hired (github, talks)
  • minimal stuff you should do to be more visible

list of tools I'd like to see

further reading

pipeline building

tools on my radar

review:

@springcoil
Copy link

I'm @springcoil on this

@ianozsvald
Copy link
Owner Author

Ooops, fixed, cheers :-) /cc @springcoil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants