Talk is cheap, Show me the data
source(deepdotweb)
I'm currently working in an online payment company handling risky transactions. As a risk analyst,I find it import to know the story behind data to avoid myself being data blinded. But there are just too many stories to know in big companies like mine, so it's equally important to sharp my data analytic skills as well.
In order to sharp my data analytic skills, I'm creating this repo to record my learning endeavour.
I divided the learning paths into the following five categories
-
webscaper for get data from various data source
-
datacleaning for preprocessing the data
-
datastoring for importing data excute sql and export data for further analysis
-
datamanipulation for performing data analysis using tools like numpy pandas in python or dplyr in R
-
datavisualization for presenting data to audience including ggplot2, matplotlib, D3, tableau
-
Python 3.5 or higher
-
dataset used in the project are mentioned in the specific subfolders.
- webscaper for get data from various data source
-
In this project I created several webscrapers with libraries like urllib, beautiful soup, selenium and scrapy
-
Detailed descriptions can be found in subfolder:webscraper
- datacleaning for preprocessing the data
-
There are lots of work to be done here, this is usually the most labor intensive work to be performed. In order to make this part of task easier,I wrote several data cleaning scripts.
-
Detailed description can be found in subfolder:datacleaning
- datastoring for importing data excute sql and export data for further analysis
-
sql or nonsql is import for storing data. For a lot of data analysis tasks, this part is often skiped. This folder is to create sql and tips to optimize its performance.
-
Detailed description can be found in subfolder:datastoring
- datamanipulation for performing data analysis using tools like numpy pandas in python or dplyr in R
TODO:
- datavisualization for presenting data to audience including ggplot2, matplotlib, D3, tableau
- In this subfolder, I'm using libraries like matplotlib, ggplot2, seaborn and plotly with python, R ,D3 and tableau to create various data visualizations to list on the final report.
TODO:
This is a continuous project that need further refinement and maintaining, I'll keep updating the content