In this project, I built and trained a model to recognize Named Entities from a sentence. This model should give a tag to each word from a sentence. A classical application for Natural Language Processing. The main file is here
for replicating results you will need to download PersianNER dataset from this section (link below in ArmanPersoNERCorpus and you need to change path in ipynb file) and you also need a fastext pretrained model. My fastest model is here. you can add it to your drive and use the correct path to address it. (it is recommended to use this ipynb file in Google Colab)
Named Entity Recognition is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, organizations, and...) that are mentioned in that string. Here is an example:
John went to New York to interview with Microsoft
B-PER O O B-LOC I-LOC O O O B-ORG
https://github.com/HaniehP/PersianNER
This dataset includes 250,015 tokens and 7,682 Persian sentences in total. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated by a new line. The NER tags are in IOB format.
The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (eg. named entity recognition)
An example with IOB format:
John B-PER
lives O
in O
New B-LOC
York I-LOC
. O
This O
is O
another O
sentence
In ArmanPersoNERCorpus, NEs are categorized into six classes:
- person
B-pers
,I-pers
- organization
B-org
,I-org
(such as banks, ministries, embassies, teams, nationalities, networks and publishers) - location
B-loc
,I-loc
(such as cities, villages, rivers, seas, gulfs, deserts and mountains) - facility
B-fac
,I-fac
(such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas) - product
B-pro
,I-pro
(such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religions) - event
B-event
,I-event
(such as wars, earthquakes, national holidays, festivals and conferences) - other
O
are the remaining tokens
This was the final project for FanAsa Academy's DeepNLP course that held in Summer of 1398(2019).
Instructors: Reza Vasefi - Fatemeh Mashhadi