Skip to content

Latest commit

 

History

History
19 lines (15 loc) · 1.48 KB

README.md

File metadata and controls

19 lines (15 loc) · 1.48 KB

nlp-ml-repo

Introduction

This repository contains tutorials, materials for testing purposes, and other documents relating to natural language processing and machine learning. It's first created by Titi KH while primarily working with Ian Scott (with a lot of support from Mike Thicke! And of course other super duper cool co-workers, Cassie Lem, Dimitrios Tzouris, Brian Boggan, and Bonnie Russell).

Stage 1: 2023 Fall/Winter

After gaining basic understanding of KC and other knowledge (e.g. docker environments, how to use API to access stats,etc.), I became more clear of my short-term and long-term goals:

  • testing .py libraries for extracting text from different types of deposited files
  • clean files and build (a) structured dataframe(s)
  • perform topic modeling or other analysis on the data at hand

All related materials can be found in the subfolder "text4test". The script for text extraction comparision can be found under "stage1/tutorial1-textout.md".

Resources I still need to check out

Stage 2: 2024 Spring/Summer

In this stage, I have been focusing on accessing and downloading files using api from Invenio, and then extract text data from all downloaded files. These steps can be found in the script "apiinvenio-9th.py" (in folder "stage2"). The next step is to clean all the extracted text, currently saved in csv, and preparing them for machine learning.