MLinPython

This repository has Jupyter Notebooks showcasing code examples for the purpose of Data Analysis and Machine Learning in Python. The notebooks contain code blocks that can be used as reference for required tasks.

Conda environment and requirements.txt details are at the end of this readme.

Notebooks

Python and PySpark notebooks.

Python

Created and run on Python 3.8.

1_Data_operations.ipynb: Covers major data operations required before getting into any analysis or model building
2_Pandas_apply_optimization.ipynb: Shows the comparison between various ways of applying functions to a pandas df. Helps in optimizing pandas codes
3_Clustering_kmeans.ipynb: Showcases the flow of a clustering exercise using customer sales data

PySpark

These notebooks have been built using Spark 3.1.2 installed on Windows, unless specified.

pyspark/1_Clustering_kmeans.ipynb: Showcases the flow of a clustering exercise using customer sales data
pyspark/2_Spark_data_ops.ipynb: Covers major data operations in PySpark
pyspark/3_rolling_window_features.ipynb: Classification model using rolling window features
pyspark/4_xgboost.py: PySpark script showcasing how to use XGBoost with Spark 2.4.5 (run on AWS EMR)

For Spark 2.4.4 based notebooks, see ./pyspark/pyspark_2_4_4.

For Spark installation process, refer to this Medium article.

Pipeline

Planned for immediate future

Linear regression
Logistic regression

Planned for later (list is WIP as well; suggestions welcome)

Decision Trees and Random forests
More nbs for PySpark

I plan to keep updating existing notebooks as well along the way.

Requirements for Python

Create a Python 3.8 environment using the requirements.txt file.

Commands for conda

Create env

conda create -n mlInPython python=3.8

Switch to the env

conda activate mlInPython

Install dependencies

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
files		files
pyspark		pyspark
scripts		scripts
.gitignore		.gitignore
1_Data_operations.ipynb		1_Data_operations.ipynb
2_Pandas_apply_optimization.ipynb		2_Pandas_apply_optimization.ipynb
3_Clustering_kmeans.html		3_Clustering_kmeans.html
3_Clustering_kmeans.ipynb		3_Clustering_kmeans.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLinPython

Notebooks

Python

PySpark

Pipeline

Planned for immediate future

Planned for later (list is WIP as well; suggestions welcome)

Requirements for Python

Commands for conda

About

Releases

Packages

Contributors 2

Languages

License

patilvijay23/MLinPython

Folders and files

Latest commit

History

Repository files navigation

MLinPython

Notebooks

Python

PySpark

Pipeline

Planned for immediate future

Planned for later (list is WIP as well; suggestions welcome)

Requirements for Python

Commands for conda

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages