This project is a term project for CS435 - Big Data course. Our team of 4 students were instructed to formulate a question or problem, clearly define our goals that we will accomplish with our analytics, and how it can benefit certain parties. The objectives of the term project were to perform a large-scale data analytics using technologies typically used in modern data centers and interpret the results to extract insight from the data.
For our project, we decided to use the Amazon product dataset to formulate a rating scale to determine the whether the consumer’s written review properly matches their star rating using the Stanford Natural Language Processing library.
The project is divided into 3 main parts:
- Preprocessing the data.
- Giving rating to every individual review using 5 sentiment classes (1, 2, 3, 4, 5), from very negative (1) to very positive (5) by implementing 7 different algorithms in Apache Spark and comparing them.
- Calculating overall adjusted star rating for each product using Bayesian average.