What is Apache Tika Server?

Apache Tika Server is a server edition of Apache Tika.

Apache Tika is a content detection and analysis framework. It allows users to easy text-extraction for thousand different file types (such as PPT, XLS, and PDF) in the single interface. Tika useful for search engine indexing, content analysis, translation, and much more.

Tika is a project of the Apache Software Foundation and was formely a subproject of Apache Lucene.

wikipedia.org/wiki/Apache_Tika

How to use this image

Run Tika Server

$ docker run -d -p 9998:9998 --name some-tika kujira/tika

...via `docker-compose`

Example docker-compose.yml for kujira/tika:

version: '3.1'

services:
    tika:
        image: kujira/tika
        restart: always
        ports:
          - 9998:9998

Where to Store Data

Tika does not use data store.

Port Mapping

Tika exposes 9998 port in the container. Just add -p 9998:9998 to the docker run arguments and then access either http://localhost:9998 or http://host-ip:9998 in a browser.

Environment Variables

When you start the kujira/tika image, you can adjust the configuration of the instance by passing one or more environment variables on the docker run command line.

`TIKA_TEXT_CSV_CONFIDENCE`

Set a fraction value to determine text file is csv or not in. Default is 0.5.

`TIKA_HTML_EXTRACT_SCRIPTS`

Set true to enable HTML script extraction. Default is false.

`TIKA_OFFICE_EXTRACT_MACROS`

Set true to enable VBA macro extraction. Default is false.

`TIKA_OFFICE_INCLUDE_DELETED_CONTENT`

Set true to enable deleted content extraction. Default is false.

`TIKA_OFFICE_INCLUDE_MOVED_CONTENT`

Set true to enable moved content extraction. Default is false.

`TIKA_OFFICE_INCLUDE_SHAPE_CONTENT`

Set true to enable moved content extraction. Default is false.

`TIKA_OFFICE_INCLUDE_HEADER_FOOTER`

Set false to disable header and footer extraction. Default is true.

`TIKA_OFFICE_INCLUDE_MISSING_ROWS`

Set true to enable missing rows extraction. Default is false.

`TIKA_OFFICE_INCLUDE_SLIDE_NOTES`

Set false to disable slide note extraction. Default is true.

`TIKA_OFFICE_INCLUDE_SLIDE_MASTER`

Set false to disable slide master extraction. Default is true.

`TIKA_OFFICE_INCLUDE_PHONETIC`

Set false to disable concatenate phonetic(aka. furigana) extraction. Default is true. When true, 山田太郎ヤマダタロウ will be extracted to 山田太郎ヤマダタロウ.

`TIKA_OFFICE_USE_SAX_EXTRACTOR`

Set true to enable SAX docx and pptx extraction. Default is false.

`TIKA_OFFICE_DATE_OVERRIDE_FORMAT`

Sets the format of date string. Default is yyyy-mm-dd. You can find custom date format here.

`TIKA_TESSERACT_OCR_LANGUAGE`

Set Tesseract OCR language model name. Default is eng. You can join several model names with + character, like this: eng+deu+fra

Supported model names

deu (German)
eng (English)
fra (French)
ita (Itarian)
jpn (Japanese)
jpn_vert (Japanese Vertical)
spa (Spanish)

`TIKA_TESSERACT_OCR_TIMEOUT`

Set the timeout of tesseract ocr execution. Default is 120.

`TIKA_TESSERACT_OCR_AUTO_ROTATE`

Set true to automatic rotate image if needs. Default is false.

`TIKA_PDF_INCLUDE_BOOKMARK`

Set true to enable bookmark extraction. Default is false.

`TIKA_PDF_INCLUDE_ANNOTATION`

Set true to enable annotation extraction. Default is false.

`TIKA_PDF_OCR_STRATEGY`

Set an OCR stragegy string. Default is no_ocr.

OCR strategy values

no_ocr (default)
ocr_only
ocr_and_text
auto

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is Apache Tika Server?

How to use this image

Run Tika Server

...via `docker-compose`

Where to Store Data

Port Mapping

Environment Variables

`TIKA_TEXT_CSV_CONFIDENCE`

`TIKA_HTML_EXTRACT_SCRIPTS`

`TIKA_OFFICE_EXTRACT_MACROS`

`TIKA_OFFICE_INCLUDE_DELETED_CONTENT`

`TIKA_OFFICE_INCLUDE_MOVED_CONTENT`

`TIKA_OFFICE_INCLUDE_SHAPE_CONTENT`

`TIKA_OFFICE_INCLUDE_HEADER_FOOTER`

`TIKA_OFFICE_INCLUDE_MISSING_ROWS`

`TIKA_OFFICE_INCLUDE_SLIDE_NOTES`

`TIKA_OFFICE_INCLUDE_SLIDE_MASTER`

`TIKA_OFFICE_INCLUDE_PHONETIC`

`TIKA_OFFICE_USE_SAX_EXTRACTOR`

`TIKA_OFFICE_DATE_OVERRIDE_FORMAT`

`TIKA_TESSERACT_OCR_LANGUAGE`

`TIKA_TESSERACT_OCR_TIMEOUT`

`TIKA_TESSERACT_OCR_AUTO_ROTATE`

`TIKA_PDF_INCLUDE_BOOKMARK`

`TIKA_PDF_INCLUDE_ANNOTATION`

`TIKA_PDF_OCR_STRATEGY`

About

Releases

Packages

Languages

greendesertsnow/docker-tikaserver

Folders and files

Latest commit

History

Repository files navigation

What is Apache Tika Server?

How to use this image

Run Tika Server

...via docker-compose

Where to Store Data

Port Mapping

Environment Variables

TIKA_TEXT_CSV_CONFIDENCE

TIKA_HTML_EXTRACT_SCRIPTS

TIKA_OFFICE_EXTRACT_MACROS

TIKA_OFFICE_INCLUDE_DELETED_CONTENT

TIKA_OFFICE_INCLUDE_MOVED_CONTENT

TIKA_OFFICE_INCLUDE_SHAPE_CONTENT

TIKA_OFFICE_INCLUDE_HEADER_FOOTER

TIKA_OFFICE_INCLUDE_MISSING_ROWS

TIKA_OFFICE_INCLUDE_SLIDE_NOTES

TIKA_OFFICE_INCLUDE_SLIDE_MASTER

TIKA_OFFICE_INCLUDE_PHONETIC

TIKA_OFFICE_USE_SAX_EXTRACTOR

TIKA_OFFICE_DATE_OVERRIDE_FORMAT

TIKA_TESSERACT_OCR_LANGUAGE

TIKA_TESSERACT_OCR_TIMEOUT

TIKA_TESSERACT_OCR_AUTO_ROTATE

TIKA_PDF_INCLUDE_BOOKMARK

TIKA_PDF_INCLUDE_ANNOTATION

TIKA_PDF_OCR_STRATEGY

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

...via `docker-compose`

`TIKA_TEXT_CSV_CONFIDENCE`

`TIKA_HTML_EXTRACT_SCRIPTS`

`TIKA_OFFICE_EXTRACT_MACROS`

`TIKA_OFFICE_INCLUDE_DELETED_CONTENT`

`TIKA_OFFICE_INCLUDE_MOVED_CONTENT`

`TIKA_OFFICE_INCLUDE_SHAPE_CONTENT`

`TIKA_OFFICE_INCLUDE_HEADER_FOOTER`

`TIKA_OFFICE_INCLUDE_MISSING_ROWS`

`TIKA_OFFICE_INCLUDE_SLIDE_NOTES`

`TIKA_OFFICE_INCLUDE_SLIDE_MASTER`

`TIKA_OFFICE_INCLUDE_PHONETIC`

`TIKA_OFFICE_USE_SAX_EXTRACTOR`

`TIKA_OFFICE_DATE_OVERRIDE_FORMAT`

`TIKA_TESSERACT_OCR_LANGUAGE`

`TIKA_TESSERACT_OCR_TIMEOUT`

`TIKA_TESSERACT_OCR_AUTO_ROTATE`

`TIKA_PDF_INCLUDE_BOOKMARK`

`TIKA_PDF_INCLUDE_ANNOTATION`

`TIKA_PDF_OCR_STRATEGY`

Packages