Twitter Account Classification Project

This project is part of the CSIT946 course, focusing on processing text data to classify Twitter accounts. The goal is to distinguish between human and non-human accounts using machine learning techniques. Note that the model is considered as a vanilla approach only, served as a baseline for future development of more complex models.

Project Overview

Data Preprocessing: Implemented stratified splitting to preserve the original distribution of Twitter account types.
Modeling: Used a bag of words model coupled with logistic regression to classify accounts.
Text Preprocessing: Developed a custom cleanTweet function to preprocess text data, ensuring optimal input quality for model training.
Analysis: Analyzed and presented key findings on distinctive keywords associated with human and non-human Twitter accounts.
Model Optimization: Fine-tuned the logistic regression model to enhance performance metrics.

Performance Summary

Class 0 (Human): Precision - 83%, Recall - 93%, F1-Score - 0.88
Class 1 (Non-Human): Precision - 80%, Recall - 59%, F1-Score - 0.68
Overall Accuracy: 83%

Limitations

Simple Feature Extraction: Uses a basic bag of words approach.
Imbalanced Recall: Lower recall for non-human classification.
Data Limitations: Limited ground truth for accurate classification.

Suggestions for Improvement

Advanced Feature Extraction: Implement TF-IDF or word embeddings.
Enhanced Model Techniques: Explore other machine learning models or deep learning approaches.
Increase Training Data: Collect more labeled data to improve the model's learning capability.

Installation:

Clone the repository and install the required libraries.

Built with:

Dependencies

This project requires the following Python libraries:

pandas
numpy
scikit-learn
nltk
matplotlib
wordcloud

Usage

Data Preprocessing: Ensure the dataset is clean and split into training and test sets.
Model Training: Train the logistic regression model using the preprocessed data.
Evaluation: Evaluate the model performance using accuracy, precision, recall, and F1-score metrics.
Analysis: Analyze the results and present key findings.

Restriction

This project is part of the CSIT946 subject at UoW. The code is provided for educational purposes and demonstration use only.

Disclaimer

This code is provided "as is" without warranty of any kind, and I, as the author, am not liable for any issues that arise from its use. While you're welcome to learn from it, please do not copy or distribute it for your own coursework or assignments without permission.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Model.ipynb		Model.ipynb
preprocessing.ipynb		preprocessing.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
test_data.csv		test_data.csv
train_data.csv		train_data.csv
twitter_user_data.csv		twitter_user_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Account Classification Project

Project Overview

Performance Summary

Limitations

Suggestions for Improvement

Installation:

Built with:

Dependencies

Usage

Restriction

Disclaimer

About

Releases

Packages

Languages

xuanhuyen3011/twitter_classification

Folders and files

Latest commit

History

Repository files navigation

Twitter Account Classification Project

Project Overview

Performance Summary

Limitations

Suggestions for Improvement

Installation:

Built with:

Dependencies

Usage

Restriction

Disclaimer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages