Tesseract release planning

Here we can plan the next releases of Tesseract.

Future releases

Here are some ideas for future Tesseract releases.

Modernize the code using C++11 (see discussions here and here).
Use llvm's tools: clang-format, clang-tidy, scan-build, sanitizers.
Replace more Tesseract data types by C++ standard types (GenericVector, ...), especially for the API.
Add json (or xml) output format. It will be used for full ocr and for psm 2 - layout info only.
Add option to use alternative binarization methods from leptonica.
Add an option to output separate files for multipage input (out1.hocr, out2.hocr ...).
Add multi-threading option to the command line (openmp will be disabled at runtime in this mode).
Explore the option to use Protocol Buffers or FlatBuffers for the traineddata.
Improve error handling and don't ignore return values from functions (see discussion).
Replace tprintf etc. by advanced logging API with log levels.

5.0.0

Advanced logging

Requirements (see also discussion):

Log levels:

trace
debug
info
warning
error
fatal

Related issues:

tesseract-ocr/tesseract#1338

Useful links:

List of Open Source C++ logging libraries

4.0.0

See the release notes.

See also the discussion for issue #1423.

Open issues which should be fixed

Issues with the "bug" label (see list here)
Noise characters recognized with bbox as the entire page #1192
Segmentation fault when using integer models for LSTM training #1573
Report a warning when the Tesseract initialisation code detects an unsupported locale setting. (See comment)
Insufficient error message when output file cannot be created Issue 1424
“no best words!!” on mixed language (fra+ara) items (see issue 235)
mgr_.Init(traineddata_path.c_str()):Error:Assert failed: #1075 (see issue 1075)

Features wanted for this release

Script for installing only selected languages from github (see issue)
https://github.com/zdenop/tessdata_downloader

To be discussed

Depending on available resources and opinions, these suggestions will either be added to the planning for the next or a future release or abandoned.

Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version

This will make the command slower, because each file must be opened and parsed. Add this as --list-langs-details or as --list-lang-details for one language file based on lang-code?
--list-langs should also display the directory it is using
Fix the autotools build so that the debug mode uses -O0 as intended
Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...)
Relative includes for traineddata

tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
More fixes for compiler warnings and issues reported by Coverity Scan
Add a simple bash script for building tesseract
New traineddata format

In addition to the current proprietary format Tesseract could also support ZIP archives (see discussion).

A possible implementation using libarchive is available, but needs more testing.
"Training light" - Learning by doing (see issue)
Modify text2image to use PrepareDistortedPix() #1052
Schedule date

Regression of features from 3.0x

Tesseract 4.0 should be a full replacement for Tesseract 3.05 and have the same features when used with the old OCR engine (--oem 0). The following regressions still need verification (are they really regressions, or are they just missing features for LSTM):

User Words (See comment)
User Patterns (See issue)

Fixed in 4.1.0

Features from 3.0x which are missing for LSTM

These features still work with the old OCR engine (--oem 0), but are missing and desired for LSTM.

~~#### Black list / White list (See issue). Here is a workaround.~~ Fixed in 4.1.0.
hOCR font info (See comment)

Future release

Here we collect important issues and features for the release(s) following 4.0.0.

New LSTM-based OSD detector (see comment).
Remove Legacy Tesseract Engine (see issue)
Better Multi-language implementation for training (See comment)
ARM SIMD support for dot product #519
Using OpenMP for dot product #983
Remove deprecated code

This does not include OpenCL or the old Tesseract engine.
Tesseract creates output for missing input (see issue 1023).

Mostly solved, but could be improved.
Issue 1353: Patch for /training/tessopt.cpp (see pull request 13)

It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).

Files

Planning.md

Latest commit

History

Planning.md

File metadata and controls

Tesseract release planning

Future releases

5.0.0

Advanced logging

4.0.0

Open issues which should be fixed

Issues with the "bug" label (see list here)

Noise characters recognized with bbox as the entire page #1192

Segmentation fault when using integer models for LSTM training #1573

Report a warning when the Tesseract initialisation code detects an unsupported locale setting. (See comment)

Insufficient error message when output file cannot be created Issue 1424

“no best words!!” on mixed language (fra+ara) items (see issue 235)

mgr_.Init(traineddata_path.c_str()):Error:Assert failed: #1075 (see issue 1075)

Features wanted for this release

Script for installing only selected languages from github (see issue)

To be discussed

Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version

--list-langs should also display the directory it is using

Fix the autotools build so that the debug mode uses -O0 as intended

Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...)

Relative includes for traineddata

More fixes for compiler warnings and issues reported by Coverity Scan

Add a simple bash script for building tesseract

New traineddata format

"Training light" - Learning by doing (see issue)

Modify text2image to use PrepareDistortedPix() #1052

Schedule date

Regression of features from 3.0x

User Words (See comment)

User Patterns (See issue)

Features from 3.0x which are missing for LSTM

hOCR font info (See comment)

Future release

New LSTM-based OSD detector (see comment).

Remove Legacy Tesseract Engine (see issue)

Better Multi-language implementation for training (See comment)

ARM SIMD support for dot product #519

Using OpenMP for dot product #983

Remove deprecated code

Tesseract creates output for missing input (see issue 1023).

Issue 1353: Patch for /training/tessopt.cpp (see pull request 13)