Textextraction-18

The content-related task of the text extraction-18 was to extract all speeches of the 1st - 18th legislative period, the XML documents passed by the crawler and output them in JSON files. files. The crawler achieved this by temporarily persisting the plenary minutes as as XML files and passing them as parameters to the text extraction-18.

The basic idea is to examine the text block semantically and syntactically in order to find a search algorithm. algorithm, which will find the text within the text block:
● Title of the speech ● Name of the speaker ● Affiliation ● Date of the speech ● Speech
to a person and extracts it.

Problems

Most of the problems arose from the assumption that there was consistency in the formatting of the documents.

Problem	Description
Document type	In the 1st - 14th electoral legislature period, a document type was given to the XML files, but from the from the 15th electoral legislature onwards, this was no longer the case.
Formatting	In some documents, the considered commonalities of the table of contents were also not present, which makes a uniform programme for all electoral legislatures almost impossible.
Naming of the affiliation	There were also many spelling differences in the naming of affiliations. The best example of this is the party "BUNDNIS 90/DIE GRÜNEN". For the different spellings of this party alone: BÜNDNIS 90/DIE GRÜNEN BÜNDNIS 90/ DIE GRÜNEN BÜNDNIS 90 /DIE GRÜNEN BÜNDNIS 90/DIE GRÜ- NEN three regular printouts had to be changed to make them work.

The decisive point in the extraction was the search in the table of contents for persons. Within the search itself, however, there were differences that had to be processed separately. The search was implemented with the method createMap() in the SpeechSearch class. The literal flow of the method is roughly as follows:

IF there is a title
- the first name is also contained in the same entry and must be saved
  after that, each entry is a name,
  until the entry that contains a title.
  Next loop pass from this title.
otherwise
- if a title is followed by a name
  - then each subsequent entry is a name and must be saved,
    until an entry contains a title.
    next loop pass from this title.

If the speakers were found for each title, these entries are saved in a map.

Build Jar File

mvn clean package

Execute

java -jar textextraction-18.jar <plenary minutes xml file>
creates a json file from the plenary minutes with all entered speeches

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
.idea		.idea
.settings		.settings
src		src
target		target
.classpath		.classpath
.factorypath		.factorypath
.gitignore		.gitignore
.project		.project
README.md		README.md
pom.xml		pom.xml
textextraction-18.iml		textextraction-18.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Textextraction-18

Problems

Build Jar File

Execute

About

Releases

Packages

Languages

htw-projekt-p2p-volltextsuche/textextraction-18

Folders and files

Latest commit

History

Repository files navigation

Textextraction-18

Problems

Build Jar File

Execute

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages