This module transforms ABBYY XML documents, generated by ABBYY FineReader 10, into primitively accessible ePub 3. The code is optimized for ABBYY XML documents created by the Internet Archive, though it may work for other ABBYY XML as well.
- Unicode-compliant
- Can handle left-to-right and right-to-left text.
- Attempts to recognize running headers, footers, and decimal or page numbers.
Level of confidence in fuzzy matching can be fine tuned in
config.ini
. Errs on the side of minimizing false positives. - Will use Kakadu image libraries if present, otherwise will fall back to Pillow.
- Accessibility is inherently limited by the input ABBYY FineReader documents. If they are marked up with headings and other semantic markup, that structure will be incorporated into the ePub.
- There is currently no functionality for image description.
- The module can also transform ABBYY XML documents generated by ABBYY FineReader 6. However, those documents are not marked up with headings, so there is no structural navigation for accessibility.
- Python 3
- If running epubcheck, a Java Runtime environment
- If running DAISY Ace, NodeJS version >= 6.4.0
- If using Kakadu, install the binaries and add the your PATH and LD_LIBRARY_PATH
From within a Python program:
from abbyy_to_epub3 import create_epub
book = create_epub.Ebook('docname') # See *Assumptions* below.
book.craft_epub()
From the shell:
abbyy2epub docname # See *Assumptions* below.
The available command line arguments are:
usage: abbyy2epub [-h] [-d] [--epubcheck level] [--ace level] docname
Process an ABBYY file into an EPUB
positional arguments:
item_dir The file path where this item's files are kept.
item_identifier The unique ID of this item.
item_bookpath The prefix to a specific book within an item.In a simple
book, usually the same as the item_identifier.
optional arguments:
-h, --help show this help message and exit
-d, --debug Show debugging information
--epubcheck Run EpubCheck on the newly created EPUB, given a severity level
--ace Run DAISY Ace on the newly created EPUB, given a severity level
Epubcheck: If you'd like to run epubcheck, there are certain system dependencies. Depending on running environment, these may need to be manually installed. On Ubuntu, I installed these with:
sudo apt-get install default-jre libpython3-dev
DAISY Ace: If you'd like to run Ace, there are certain system dependencies. Read the installation instructions, but in a nutshell:
- Install NodeJS. Important: You need at least version 6.4.0, which is newer than the version in the package manager for many distributions. (E.g. versions of Ubuntu before 17.10 Artful). If you have an older version on your system and you can't upgrade, consider running NodeJS in an isolated environment such as nodeenv.
- Install Ace:
npm install @daisy/ace -g
- Create a configuration file for the user account who'll be running the code, in ~/.config/DAISY Ace/. You can modify the configuration per the documentation <https://daisy.github.io/ace/docs/config/>_ but be sure to add this block:
{
"cli": {
"return-2-on-validation-error": true
}
}
This package can be installed on your local system. From the directory containing setup.py:
pip install -r requirements.txt
python setup.py develop
pip install .
You can rebuild the documentation, which is generated with Sphinx.
cd docs
make html
Before deploying, make sure you bump the version of the package in __init__.py. Then, run the upload.sh script in the root of the repository and enter the appropriate Internet Archive credentials when prompted.
You can test that the package has been installed correctly by going to https://devpi.archive.org or by running $ pip3 install --upgrade -i https://petaboxdevpi:{PASSWORD}@devpi.archive.org/books/formats abbyy_to_epub3.
Note that petaboxdevpi:{PASSWORD} is not needed inside IA network`
Run py.test
from the top-level app directory. Create new tests in the tests
subdirectory.
An item may contain 1 or more books. In order to accommodate this subtlety and delineate between books, an item_dir and item_identifier are not sufficient to isolate a specific book. To circumvent this limitation, we require another identifier called the item_bookpath which acts as a prefix to the files of a specific book. Given a datanode and an item_dir of an item, all the constituent files for a book can be constructed using item_identifier and item_bookpath in the following ways:
- The item_identifier (the unique ID of this item)
- The item_dir is the file path where this items files are kept
- The item_bookpath is name of the particular book file, often the same as item_identifier
The structure is assumed to be:
scandata.xml
describes the structure of the book (metadata, pages numbers)
docname_abbyy.gz
unzips todocname_abbyy
, an XML file generated by ABBYY.docname_jp2.zip
unzips to a directory calleddocname_jp2
, which includes a number of documents in the formatdocname_####.jp2
.- The scandata has hopefully marked up one leaf as 'Cover'. Failing that, we will use the first leaf marked 'Title', and failing that, the first leaf marked 'Normal'.
- There is a single global metadata manifest file for the entire
item named
{item_identifier}_meta.xml
. - All of the other book specific files follow the form
{item_bookpath}_{file}
. e.g.{item_bookpath}_abbyy.gz
Module documentation is available at Read The Docs.