GitHub - htw-projekt-p2p-volltextsuche/Crawler: Crawler software that downloads and processes speeches of the Bundestag through external text processors

Functionality

The crawler downloads speeches from the official Bundestag website in regular, customizable intervals.

The speeches are processed through seperate executables and the returned results inserted into the P2P network for indexing. Only the time periods >= 18 are supported. Time periods below 18 will be skipped.

How to run without Docker

Make sure you have .NET 5.0 SDK installed or higher.
Clone the repository
Build the application with dotnet publish ./Crawler -c Release
Switch to the build directory ./Crawler/bin/Release/net5.0/publish
Edit appsettings.json and add the required MongoDB connection string MongoConnectionString aswell as the endpoint to the indexing API IndexingApiEndpoint. Optionally adjust the indexing interval.
Run the application ./Crawler.exe

Running with Docker Compose

Persistent mounting points

The path /app/data/ (by default) needs to be persisted to the host for tracking which protocols have already been indexed to work.

Application settings

Application settings can be changed either by mounting a appsettings.json file into the container or by environment variables.

If mounting a appsettings.json file, mount it to /app/appsettings.json inside the container.
If configuring via environment variables, use the names of the option inside the appsettings.json as the environment variable name, e.g. IndexingApiEndpoint or Interval.

Default appsettings.json

{
  // Interval (CRON Expression) in which the Bundestag website will be crawled
  "Interval": "* * * * *",

  // One time delay in seconds before the Crawler will evaluate the interval CRON expression and run according to the given "Interval" schedule
  "InitialDelay": 0,

  // Delay in seconds in between HTTP calls to the indexing api to save a chunk of protocols
  "ChunkDelay": 0,

  // Maximum batch size in which documents will be POSTed to the indexing api endpoint
  "MaximumBatchSize": 5,

  // Database which will be used to save speeches
  "MongoConnectionString": "mongodb://0.0.0.0:8430",
  "MongoDatabase": "crawler",
  "MongoCollection": "protocols",

  // Database that is used to determine which documents of the Bundestag have already been indexed
  "LocalDbConnectionString": "Data Source=data/local.db",

  // Indexing api endpoint (without a trailing slash!)
  "IndexingApiEndpoint": "http://0.0.0.0:8421/api",

  // Timeout of the HTTP request to the api endpoints in seconds
  "IndexingApiTimeout": 300
}

External Specifications

MongoDB speech document

The crawler connects to a MongoDB database for easy access of extracted speeches for the frontend. The document representing a speech is defined as follows:

{
    // This ID is automatically generated by the crawler, as the official data source does not provide unique IDs for all speeches provided. This is a GUID as per GUID4 specification.
    "_id": "uuid",

    // Title of the speech.
    "title": "string",
    
    // The name of the speaker.
    "speaker": "string",

    // Affiliation of the speaker. This can be his/her role in the Bundestag or for example his party.
    "affiliation": "string",

    // Date of the speech in the format yyyy-mm-dd
    "date": "yyyy-mm-dd",

    // Entire processed content of the speech.
    "text": "string"
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Crawler		Crawler
local-testing		local-testing
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Crawler.sln		Crawler.sln
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Functionality

How to run without Docker

Running with Docker Compose

Persistent mounting points

Application settings

Default appsettings.json

External Specifications

MongoDB speech document

About

Releases

Packages

Contributors 4

Languages

htw-projekt-p2p-volltextsuche/Crawler

Folders and files

Latest commit

History

Repository files navigation

Functionality

How to run without Docker

Running with Docker Compose

Persistent mounting points

Application settings

Default appsettings.json

External Specifications

MongoDB speech document

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages