This project is part of the CSIT946 course, focusing on processing text data to classify Twitter accounts. The goal is to distinguish between human and non-human accounts using machine learning techniques. Note that the model is considered as a vanilla approach only, served as a baseline for future development of more complex models.
- Data Preprocessing: Implemented stratified splitting to preserve the original distribution of Twitter account types.
- Modeling: Used a bag of words model coupled with logistic regression to classify accounts.
- Text Preprocessing: Developed a custom
cleanTweet
function to preprocess text data, ensuring optimal input quality for model training. - Analysis: Analyzed and presented key findings on distinctive keywords associated with human and non-human Twitter accounts.
- Model Optimization: Fine-tuned the logistic regression model to enhance performance metrics.
- Class 0 (Human): Precision - 83%, Recall - 93%, F1-Score - 0.88
- Class 1 (Non-Human): Precision - 80%, Recall - 59%, F1-Score - 0.68
- Overall Accuracy: 83%
- Simple Feature Extraction: Uses a basic bag of words approach.
- Imbalanced Recall: Lower recall for non-human classification.
- Data Limitations: Limited ground truth for accurate classification.
- Advanced Feature Extraction: Implement TF-IDF or word embeddings.
- Enhanced Model Techniques: Explore other machine learning models or deep learning approaches.
- Increase Training Data: Collect more labeled data to improve the model's learning capability.
Clone the repository and install the required libraries.
This project requires the following Python libraries:
- pandas
- numpy
- scikit-learn
- nltk
- matplotlib
- wordcloud
- Data Preprocessing: Ensure the dataset is clean and split into training and test sets.
- Model Training: Train the logistic regression model using the preprocessed data.
- Evaluation: Evaluate the model performance using accuracy, precision, recall, and F1-score metrics.
- Analysis: Analyze the results and present key findings.
This project is part of the CSIT946 subject at UoW. The code is provided for educational purposes and demonstration use only.
This code is provided "as is" without warranty of any kind, and I, as the author, am not liable for any issues that arise from its use. While you're welcome to learn from it, please do not copy or distribute it for your own coursework or assignments without permission.