Skip to content

webscraper/data cleaning/sql/data manipulation/visualization projects

Notifications You must be signed in to change notification settings

riversxiao/datamining

Repository files navigation

Data Mining Project


Talk is cheap, Show me the data

Synopsis


source(deepdotweb)

I'm currently working in an online payment company handling risky transactions. As a risk analyst,I find it import to know the story behind data to avoid myself being data blinded. But there are just too many stories to know in big companies like mine, so it's equally important to sharp my data analytic skills as well.

In order to sharp my data analytic skills, I'm creating this repo to record my learning endeavour.

I divided the learning paths into the following five categories

  1. webscaper for get data from various data source

  2. datacleaning for preprocessing the data

  3. datastoring for importing data excute sql and export data for further analysis

  4. datamanipulation for performing data analysis using tools like numpy pandas in python or dplyr in R

  5. datavisualization for presenting data to audience including ggplot2, matplotlib, D3, tableau

Environment requirements


  • Python 3.5 or higher

  • dataset used in the project are mentioned in the specific subfolders.

Project Details


  1. webscaper for get data from various data source
  • In this project I created several webscrapers with libraries like urllib, beautiful soup, selenium and scrapy

  • Detailed descriptions can be found in subfolder:webscraper

  1. datacleaning for preprocessing the data
  • There are lots of work to be done here, this is usually the most labor intensive work to be performed. In order to make this part of task easier,I wrote several data cleaning scripts.

  • Detailed description can be found in subfolder:datacleaning

  1. datastoring for importing data excute sql and export data for further analysis
  • sql or nonsql is import for storing data. For a lot of data analysis tasks, this part is often skiped. This folder is to create sql and tips to optimize its performance.

  • Detailed description can be found in subfolder:datastoring

  1. datamanipulation for performing data analysis using tools like numpy pandas in python or dplyr in R

TODO:

  1. datavisualization for presenting data to audience including ggplot2, matplotlib, D3, tableau
  • In this subfolder, I'm using libraries like matplotlib, ggplot2, seaborn and plotly with python, R ,D3 and tableau to create various data visualizations to list on the final report.

TODO:

Further Notice


This is a continuous project that need further refinement and maintaining, I'll keep updating the content

About

webscraper/data cleaning/sql/data manipulation/visualization projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published