This repository contains tutorials, materials for testing purposes, and other documents relating to natural language processing and machine learning. It's first created by Titi KH while primarily working with Ian Scott (with a lot of support from Mike Thicke! And of course other super duper cool co-workers, Cassie Lem, Dimitrios Tzouris, Brian Boggan, and Bonnie Russell).
After gaining basic understanding of KC and other knowledge (e.g. docker environments, how to use API to access stats,etc.), I became more clear of my short-term and long-term goals:
- testing .py libraries for extracting text from different types of deposited files
- clean files and build (a) structured dataframe(s)
- perform topic modeling or other analysis on the data at hand
All related materials can be found in the subfolder "text4test". The script for text extraction comparision can be found under "stage1/tutorial1-textout.md".
- Machine Learning in Python (https://scikit-learn.org/stable/index.html), potetntially talk to Stephaine Vasko.
In this stage, I have been focusing on accessing and downloading files using api from Invenio, and then extract text data from all downloaded files. These steps can be found in the script "apiinvenio-9th.py" (in folder "stage2"). The next step is to clean all the extracted text, currently saved in csv, and preparing them for machine learning.