Skip to content

A collection of Python scripts for N-gram analysis of software reviews. This repository tokenizes software review data, removes stopwords, lemmatizes, and generates N-grams using NLTK, pandas, and scikit-learn for an NSF REU AI research internship. It also processes review date data. Before running, check the script's directory for data files.

Notifications You must be signed in to change notification settings

kamron-h/REU_Capterra_N-gram_Analysis

Repository files navigation

N-gram Analysis of Software Reviews


This repository contains Python scripts for conducting N-gram analysis on software reviews.


Dependencies

The following Python libraries are required:

  • NLTK
  • pandas
  • scikit-learn
  • collections
  • re
  • openpyxl

If you're running the script for the first time, uncomment the following lines to download the necessary NLTK corpora:

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Usage

The script contains two main functions: ngram_analysis and process_date_data.


ngram_analysis The ngram_analysis function reads in an Excel file containing software review data, tokenizes the "All NCSS Capterra Cons" column, removes stopwords, performs lemmatization, generates N-grams, and calculates and prints the frequency distribution of these N-grams.

# Usage
file_path = 'Capterra_Cons_Excel.xlsx'
ngram_analysis(file_path, 3)  # Change 3 to whatever 'n' you want for the N-gram

process_date_data The process_date_data function reads in a CSV file, converts the dataframe into a single string, extracts all four-digit numbers (intended for years), and then uses the N-gram model to calculate and print the number of occurrences for each 4-digit year.

# Usage
file_name = 'Review_Dates.csv'
process_date_data(file_name)

Please ensure the data files are in the same directory as the script when running it.

About

A collection of Python scripts for N-gram analysis of software reviews. This repository tokenizes software review data, removes stopwords, lemmatizes, and generates N-grams using NLTK, pandas, and scikit-learn for an NSF REU AI research internship. It also processes review date data. Before running, check the script's directory for data files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages