Skip to content

mokha/corpona

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

corpona

corpona is a library for processing corpora formats (e.g. XML and JSON). The library is installable via PIP: pip install -U corpona.

Examples

Reading NewsML XML format

from corpona import XML
d = XML.parse_xml('2660341.xml')
print(f"Guid: {d.guid}") # access tag attributes as Python attributes
print(f"Language: {d.attributes['xml:lang']}") # in case of special characters, access them directly

contentMeta = d['contentMeta'][0]
print(f"Urgency: {contentMeta['urgency']}")
print(f"Headline: {contentMeta['headline']}")
print(f"Subject: {contentMeta['subject'][0]['name']}")
print("Genres: {}".format(", ".join(g['name'].text for g in contentMeta['genre'])))
print()
content_body = d['contentSet'][0]['inlineXML'][0]['html'][0]['body'][0]
print("Content: ")
for p in content_body['p']:
    print(p)

Getting a Summary of an XML/JSON

from corpona import XML
from corpona import summarize
from pprint import pprint

d = XML.parse_xml('data.xml', namespaces={'http://www.w3.org/XML/1998/namespace': 'xml', })
pprint(summarize(d), indent=4)

pprint(summarize([
    {'key1': 'hello1', 'key2': 1},
    {'key1': 'hello2', 'key2': 2},
    {'key1': 'hello3', 'key2': 3},
    {'key1': 'hello4', 'key2': 4},
]), indent=4)

Find children

from corpona import find_child

data = {"key":["list_item", {"key2":"oo"}, {"key2":"bbb"}]}
print(find_child(data, ["key", "key2"]))
print(find_child(data, ["key", "key3"], default_value="ok"))

>> ['oo', 'bbb']
>> ['ok']

Cite

If you use the library in an academic paper, please cite it:

Alnajjar, K. & Hämäläinen, M., (2021) Corpona – The Pythonic Way of Processing Corpora. In Hämäläinen, M., Partanen, N. & Alnajjar, K. (eds.) Multilingual Facilitation. University of Helsinki, p. 25−30

@inbook{3bd164164c8648b986cb14a4a8524423,
title = "Corpona – The Pythonic Way of Processing Corpora",
author = "Khalid Alnajjar and Mika H{\"a}m{\"a}l{\"a}inen",
year = "2021",
language = "English",
pages = "25−30",
editor = "Mika H{\"a}m{\"a}l{\"a}inen and Niko Partanen and Khalid Alnajjar",
booktitle = "Multilingual Facilitation",
publisher = "University of Helsinki",
address = "Finland",
}

Need for NLP solutions for your business?

Rootroo logo

Our company, Rootroo offers consulting related to multilingual NLP tasks. We have a strong academic background in the state-of-the-art AI solutions for every NLP need. Just contact us, we won't bite.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages