Skip to content

htw-projekt-p2p-volltextsuche/textextraction-18

Repository files navigation

Textextraction-18

The content-related task of the text extraction-18 was to extract all speeches of the 1st - 18th legislative period, the XML documents passed by the crawler and output them in JSON files. files. The crawler achieved this by temporarily persisting the plenary minutes as as XML files and passing them as parameters to the text extraction-18.

The basic idea is to examine the text block semantically and syntactically in order to find a search algorithm. algorithm, which will find the text within the text block:
● Title of the speech    ● Name of the speaker    ● Affiliation    ● Date of the speech    ● Speech
to a person and extracts it.


Problems

Most of the problems arose from the assumption that there was consistency in the formatting of the documents.
Problem Description
Document type In the 1st - 14th electoral legislature period, a document type was given to the XML files, but from the from the 15th electoral legislature onwards, this was no longer the case.
Formatting In some documents, the considered commonalities of the table of contents were also not present, which makes a uniform programme for all electoral legislatures almost impossible.
Naming of the affiliation There were also many spelling differences in the naming of affiliations. The best example of this is the party "BUNDNIS 90/DIE GRÜNEN". For the different spellings of this party alone:
  • BÜNDNIS 90/DIE GRÜNEN
  • BÜNDNIS 90/
    DIE GRÜNEN
  • BÜNDNIS 90

    /DIE GRÜNEN
  • BÜNDNIS 90/DIE GRÜ-
    NEN
three regular printouts had to be changed to make them work.
The decisive point in the extraction was the search in the table of contents for persons. Within the search itself, however, there were differences that had to be processed separately. The search was implemented with the method createMap() in the SpeechSearch class. The literal flow of the method is roughly as follows:
  • IF there is a title
    • the first name is also contained in the same entry and must be saved
      after that, each entry is a name,
      until the entry that contains a title.
      Next loop pass from this title.
  • otherwise
    • if a title is followed by a name
      • then each subsequent entry is a name and must be saved,
        until an entry contains a title.
        next loop pass from this title.

If the speakers were found for each title, these entries are saved in a map.


Build Jar File

mvn clean package


Execute

java -jar textextraction-18.jar <plenary minutes xml file>
creates a json file from the plenary minutes with all entered speeches

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages