nlp-ml-repo

Introduction

This repository contains tutorials, materials for testing purposes, and other documents relating to natural language processing and machine learning. It's first created by Titi KH while primarily working with Ian Scott (with a lot of support from Mike Thicke! And of course other super duper cool co-workers, Cassie Lem, Dimitrios Tzouris, Brian Boggan, and Bonnie Russell).

Stage 1: 2023 Fall/Winter

After gaining basic understanding of KC and other knowledge (e.g. docker environments, how to use API to access stats,etc.), I became more clear of my short-term and long-term goals:

testing .py libraries for extracting text from different types of deposited files
clean files and build (a) structured dataframe(s)
perform topic modeling or other analysis on the data at hand

All related materials can be found in the subfolder "text4test". The script for text extraction comparision can be found under "stage1/tutorial1-textout.md".

Resources I still need to check out

Machine Learning in Python (https://scikit-learn.org/stable/index.html), potetntially talk to Stephaine Vasko.

Stage 2: 2024 Spring/Summer

In this stage, I have been focusing on accessing and downloading files using api from Invenio, and then extract text data from all downloaded files. These steps can be found in the script "apiinvenio-9th.py" (in folder "stage2"). The next step is to clean all the extracted text, currently saved in csv, and preparing them for machine learning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

nlp-ml-repo

Introduction

Stage 1: 2023 Fall/Winter

Resources I still need to check out

Stage 2: 2024 Spring/Summer

Files

README.md

Latest commit

History

README.md

File metadata and controls

nlp-ml-repo

Introduction

Stage 1: 2023 Fall/Winter

Resources I still need to check out

Stage 2: 2024 Spring/Summer