The content-related task of the text extraction-18 was to extract all speeches of the 1st - 18th legislative period,
the XML documents passed by the crawler and output them in JSON files.
files. The crawler achieved this by temporarily persisting the plenary minutes as
as XML files and passing them as parameters to the text extraction-18.
The basic idea is to examine the text block semantically and syntactically in order to find a search algorithm.
algorithm, which will find the text within the text block:
● Title of the speech ● Name of the speaker ● Affiliation ● Date of the speech ● Speech
to a person and extracts it.
Most of the problems arose from the assumption that there was consistency in the formatting of the documents.
Problem | Description |
---|---|
Document type | In the 1st - 14th electoral legislature period, a document type was given to the XML files, but from the from the 15th electoral legislature onwards, this was no longer the case. |
Formatting | In some documents, the considered commonalities of the table of contents were also not present, which makes a uniform programme for all electoral legislatures almost impossible. |
Naming of the affiliation | There were also many spelling differences in the naming of affiliations. The best example of this is the party "BUNDNIS 90/DIE GRÜNEN". For the different spellings of this party alone:
|
- IF there is a title
- the first name is also contained in the same entry and must be saved
after that, each entry is a name,
until the entry that contains a title.
Next loop pass from this title.
- the first name is also contained in the same entry and must be saved
- otherwise
- if a title is followed by a name
- then each subsequent entry is a name and must be saved,
until an entry contains a title.
next loop pass from this title.
- then each subsequent entry is a name and must be saved,
- if a title is followed by a name
If the speakers were found for each title, these entries are saved in a map.
mvn clean package
java -jar textextraction-18.jar <plenary minutes xml file>
creates a json file from the plenary minutes with all entered speeches