-
Notifications
You must be signed in to change notification settings - Fork 0
aboSamoor/lydia
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Modify your bashrc # Add custom Python modules to the Python path. PYTHONPATH=$PYTHONPATH:/path/to/your/lydia/ export PYTHONPATH you can change ~/sandbox/ to wherever you are keeping the code To get a dump of arabic wikipedia http://download.wikimedia.org/arwiki/ the one that I downloaded |wget -c http://download.wikimedia.org/arwiki/20110121/arwiki-20110121-pages-articles.xml.bz2 To fragment the wikipedia dump to different xml files, each contain one article |$wikipedia/fragmenter.py wikiDump.xml outputDirectory The outputDirectory will contain all the files. The files are named after the ids of the wikipedia articles. In each xml file, the text is contained between the <text></text> tags. Moreover, the text available with the wiki markup language doStrip contains most of the formatting need to remove the wiki markup language. To extract the text and remove the markup language. $wikipedia/formatter.py inputFile This will generate an output file called inputFile.f $aner/NER.py folderName It will run over all the files that have .f extension, it will delete all the temp files except .bw.post.NER file which has five columns. The stats will will be stored in folderName/stats $aner/NER filename It will run on the file passed and make .bw.ner file, it will not delete any temp files.
About
Arabic NER pipeline for Lydia system
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published