This repository contains the develompent of classification models for recognizing emotions in speech based on prosody (intonation, rythm, stress, etc.). The data used is the emoUERJ open dataset. The Jupyter Notebooks detail several methods of feature extraction, tested on a Support Vector Machine classifier and a custom Deep Learning model built with Keras.
- Python 3.9
- Dataset emoUERJ
Repository installation
git clone https://github.com/gustavo-fardo/speech-emotion-ptbr
cd ./speech-emotion-ptbr
Create a python virtual environment (recommended):
sudo apt install python3.9
sudo apt install virtualenv
python3.9 -m virtualenv .venv --python=$(which python3.9)
OBS: every time you open a new terminal, activate the virtual environment with the command:
source .venv/bin/activate
Deactivate it with:
deactivate
To reproduce results, dowloading emoUERJ is needed, and then to put it inside /datasets folder
The data augmentation methods, using the Audiomentations library, were used to triple the size of the dataset, and are the following:
- Gaussian Noise with random amplification between 0.001 and 0.01
- Time Stretch between 0.8 and 1.24 times
- Pitch Shift with random semitone variation of -2 to 2
The data augmentation process is detailed in feat_extract.ipynb and feat_extract2.ipynb.
The feature extraction methods tested on emoUERJ are documented in the feat_extract.ipynb and feat_extract2.ipynb, with the following:
- TSFEL library (too slow, not tested with the models)
- Praat e Parselmouth, based on https://github.com/uzaymacar/simple-speech-features
- pyAudioAnalysis library
- Mel-Frequency Cepstrum Coefficients (MFCC) with Librosa library
- Mel Spectrogram with Librosa library
- SVM: a simple support vector machine with linear kernel and C=1.0
- Neural Network: a deep learning network built with Keras
The requirements for the implementations can be installed with:
pip install -r requirements.txt
- embedded_classifier.py: captures audio from a microphone and classifies it with a emotion in near real-time
- realtime_emotion_subtitle.py: given a audio in .wav, gives a proportion of each of the 4 emotions in near real-time