diff --git a/README.md b/README.md
new file mode 100644
index 0000000..f36657f
--- /dev/null
+++ b/README.md
@@ -0,0 +1,23 @@
+hocr-spec
+=========
+
+The hOCR EOCR mbedded OCR Workflow and Output Format
+
+## About
+
+This repository contains a [Markdown version](./hocr-spec.md) of the
+[hOCR](https://en.wikipedia.org/wiki/HOCR) format specification edited by
+[Thomas Breuel](https://github.com/tmbdev), converted from the May 2010 edition
+hosted on [Google
+Docs](https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview).
+
+## Why
+
+The goal of this project is to make the hOCR specification more accessible and
+easier to maintain:
+
+* cross-reference other specs
+* harmonize style
+* track changes without the spam of a world-editable Google Doc
+* structured improvements with Github tools
+* add samples
diff --git a/hocr-spec.md b/hocr-spec.md
new file mode 100644
index 0000000..3a90d2a
--- /dev/null
+++ b/hocr-spec.md
@@ -0,0 +1,865 @@
+# The hOCR Embedded OCR Workflow and Output Format
+
+Thomas Breuel (editor)
+
+## Table of Contents
+* [Revision History](#revision-history)
+* [1 Rationale](#1-rationale)
+* [2 Getting Started](#2-getting-started)
+* [3 Terminology and Representation](#3-terminology-and-representation)
+* [4 Logical Structuring Elements](#4-logical-structuring-elements)
+* [5 Typesetting Related Elements](#5-typesetting-related-elements)
+* [6 Inline Representations](#6-inline-representations)
+* [7 Character Information](#7-character-information)
+* [8 OCR Engine-Specific Markup](#8-ocr-engine-specific-markup)
+* [9 Font, Text Color, Language, Direction](#9-font-text-color-language-direction)
+* [10 Alternative Segmentations / Readings](#10-alternative-segmentations--readings)
+* [11 Grouped Elements and Multiple Hierarchies](#11-grouped-elements-and-multiple-hierarchies)
+* [12 Capabilities](#12-capabilities)
+* [13 Profiles](#13-profiles)
+* [14 Required Meta Information](#14-required-meta-information)
+* [15 HTML Markup](#15-html-markup)
+ * [15.1 Restrictions on HTML Content](#151-restrictions-on-html-content)
+ * [15.2 Recommendations for Mappings](#152-recommendations-for-mappings)
+ * [15.2.1 html_none](#1521-html_none)
+ * [15.2.2 html_simple](#1522-html_simple)
+ * [15.2.3 html_ocr_](#1523-html_ocr_)
+ * [15.2.4 html_absolute_](#1524-html_absolute_)
+ * [15.2.5 html_xytable_absolute](#1525-html_xytable_absolute)
+ * [15.2.6 html_xytable_relative](#1526-html_xytable_relative)
+ * [15.2.7 html_](#1527-html_)
+* [16 Document Meta Information](#16-document-meta-information)
+* [17 Sample Usage](#17-sample-usage)
+
+## Revision History
+
+### 2016-03-02
+
+* Markdown version (@kba)
+
+### March 2010
+
+* bug fixes, clarifications (@tmbdev)
+
+### December 2007
+
+* initial release (@tmbdev)
+
+
+## 1 Rationale
+
+The purpose of this document is to define an open standard for representing OCR
+results. The goal is to reuse as much existing technology as possible, and to
+arrive at a representation that makes it easy to reuse OCR results.
+
+
+## 2 Getting Started
+
+This document describes many tags and a lot of information that can be output.
+However, getting started with hOCR is easy: you only need to output the tags
+and information you actually want to. For example, just outputting `ocr_line`
+tags with bounding boxes is already very useful for many applications. Just
+start simple and add more output information as the need arises.
+
+
+## 3 Terminology and Representation
+
+This document describes a representation of various aspects of OCR output in an
+XML-like format. That is, we define as set of tags containing text and other
+tags, together with attributes of those tags. However, since the content we are
+representing is formatted text,
+
+However, we are not actually using a new XML for the representation; instead
+embed the representation in XHTML (or HTML) because XHTML and XHTML processing
+already define many aspects of OCR output representation that would otherwise
+need additional, separate and ad-hoc definitions. These aspects include:
+
+* standard representations for common logical structuring elements, including
+ section headings, citations, tables, emphasis, line breaks, quotations,
+ citations, and preformatted text
+* standard representations for fonts, embedded images, embedded vector
+ graphics, tables, languages, writing direction, colors
+* standard representations for geometric layout and positioning
+* output files that are understood without any further modification by widely
+ used viewers (browsers), editors, conversion tools, and indexing tools
+* libraries for parsing and generating the content
+* support for document metadata
+
+We are embedding this information inside HTML by encoding it within valid tags
+and attributes inside HTML; We are going to use the terms "elements" and
+"properties" for referring to embedded markup.
+
+Elements are defined by the class= attribute on an arbitrary HTML tag. All
+elements in this format have a class name of the form `ocr…_…`.
+
+Properties are defined by putting information into the `title=` attribute of an
+HTML tag. Properties in title attributes are of the form “name values…”, and
+multiple properties are separated by semicolons.
+
+Here is an example:
+
+```html
+
+```
+
+The following properties can apply to most elements (where it makes sense):
+
+* `bbox x0 y0 x1 y1` – the bounding box of the element relative to the
+ binarized document image
+ * use `x_bboxes` below for character bounding boxes
+ * do not use `bbox` unless the bounding box of the layout component is, in
+ fact, rectangular
+ * some non-rectangular layout components may have rectangular bounding boxes
+ if the non-rectangularity is caused by floating elements around which text flows
+
+* `textangle alpha` - the angle in degrees by which textual content has been
+ rotate relative to the rest of the page (if not present, the angle is assumed
+ to be zero); rotations are counter-clockwise, so an angle of 90 degrees is
+ vertical text running from bottom to top in Latin script; note that this is
+ different from reading order, which should be indicated using standard HTML
+ properties
+
+The following properties can apply to most elements but should not be used
+unless there is no alternative:
+
+* `poly x0 y0 x1 y1 ...` - a closed polygon for elements with non-rectangular bounds
+ * this property must not be used unless there is no other way of
+ representing the layout of the page using rectangular bounding boxes,
+ since most tools will simply not have the capability of dealing with
+ non-rectangular layouts
+ * note that the natural and correct representation of many non-rectangular
+ layouts is in terms of rectangular content areas and rectangular floats
+ * documents using polygonal borders anywhere must indicate this in the
+ metadata
+ * documents should attempt to provide a reasonable bbox equivalent as well
+* `order n` – the reading order of the element (an integer)
+ * this property must not be used unless there is no other way of representing
+ the reading order of the page by element ordering within the page, since
+ many tools will not be able to deal with content that is not in reading order
+* `presence` presence must be declared in the document meta data
+
+The following property relates the flow between multiple `ocr_carea` elements,
+and between `ocr_carea` and `ocr_linear` elements.
+
+* `cflow s` – the content flow on the page that this element is a part of
+ * s must be a unique string for each content flow
+ * must be present on ocr_carea and ocrx_block tags when reading order is
+ attempted and multiple content flows are present
+ * presence must be declared in the document meta data
+
+This property applies primarily to textlines
+
+* `baseline pn pn-1 … p0` - a polynomial describing the baseline of a line of
+ text
+ * the polynomial is in the coordinate system of the line, with the bottom
+ left of the bounding box as the origin
+
+## 4 Logical Structuring Elements
+
+We recognize the following logical structuring elements:
+* `ocr_document`
+ * `ocr_linear`
+ * `ocr_title`
+ * `ocr_author`
+ * `ocr_abstract`
+ * `ocr_part [H1]`
+ * `ocr_chapter [H1]`
+ * `ocr_section [H2]`
+ * `ocr_sub*section [H3,H4]`
+ * `ocr_display`
+ * `ocr_blockquote [BLOCKQUOTE]`
+ * `ocr_par [P]`
+
+These logical tags have their standard meaning as used in the publishing
+industry and tools like LaTeX, MS Word, and others.
+
+The standard HTML tags given in brackets specify the preferred HTML tags to use
+with those logical structuring elements, but it may not be possible or
+desirable to actually chose those tags (e.g., when adding hOCR information to
+an existing HTML output routine).
+
+For all of these elements except `ocr_linear`, there exists a natural linear
+ordering defined by reading order (`ocr_linear` indicates that the elements
+contained in it have a linear ordering). At the level of `ocr_linear`, there
+may not be a single distinguished order. A common example of `ocr_linear` is a
+newspaper, in which a single newspaper may contain many linear, but there is no
+unique reading order for the different linear. OCR evaluation tools should
+therefore be sensitive to the order of all elements other than `ocr_linear`.
+
+Tags must be nested as indicated by nesting above, but not all tags within the
+hierarchy need to be present.
+
+Textual information like section numbers and bullets must be represented as
+text inside the containing element.
+
+Documents whose logical structure does not map naturally onto these logical
+structuring elemetns must not use them for other purpose.
+
+Image captions may be indicated using the `ocr_caption` element; such an
+element refers to the image(s) contained within the same float, or the
+immediately adjacent image if both the image and the `ocr_caption` element are
+in running text.
+
+
+## 5 Typesetting Related Elements
+
+The following typesetting related elements are based on a typesetting model as
+found in most typesetting systems, including
+[XSL:FO](https://www.w3.org/TR/xsl11/#fo-section),
+[(La)TeX](https://latex-project.org/guides/usrguide.pdf),
+[LibreOffice](https://wiki.documentfoundation.org/images/e/e6/WG42-WriterGuideLO.pdf),
+and Microsoft Word.
+
+In those systems, each page is divided into a number of areas. Each area can
+either be a part of the body text (or multiple body texts, in the case of
+newspaper layouts). The content of the areas derives from a linear stream of
+textual content, which flows into the areas, filling them linewise in their
+preferred directions.
+
+Overlayed onto the page is a set of floating elements; floating elements exist
+outside the normal reading order. Floating elements may be introduced by the
+textual content, or they may be related to the page itself (anchoring is a
+logical property). In typesetting systems, floating elements may be anchored to
+the page, to paragraphs, or to the content stream. Floating pelements can
+overlap content areas and render on top of or under content, or they can force
+content to flow around them. The default for floating elements in this spec is
+that their anchor is undefined (it is a logical property, not a typesetting
+property), and that text flows around them. Note that with rectangular content
+areas and rectangular floats, already a wide variety of non-rectangular text
+shapes can be realized.
+
+**Issue: there is currently no way of indicating anchoring or flow-around
+properties for floating elements; properties need to be defined for this.**
+
+The typesetting related elements therefore are:
+
+* `ocr_page`
+* `ocr_carea` ("ocr content area" or "body area"; used to be called ~~ocr_column~~)
+* `ocr_line` [SPAN]
+* (floats)
+* `ocr_separator` (any separator or similar element)
+* `ocr_noise` (any noise element that isn't part of typesetting)
+
+The `ocr_page` element must be present in all hOCR documents.
+
+The following properties SHOULD be present:
+
+* `bbox`
+ * the bounding box of the page; for pages, the top left corner must be at
+ `(0,0)`, so a typical page bounding box will look like `bbox 0 0 2300 3200`
+* `image imagefile`
+ * image file name used as input
+ * syntactically, must be a UNIX-like pathname or http URL (no Windows pathnames)
+ * may be relative
+ * cannot be resolved to the actual file in general (e.g., if the hOCR file
+ becomes separated from the image fiels)
+ * if the hOCR file is present in a directory hierarchy or file archive, should
+ resolve to the corresponding image file
+* `imagemd5 checksum`
+ * MD5 fingerprint of the image file that this page was derived from
+ * allows re-associating pages with source images
+* `ppageno n`
+ * the physical page number
+ * the front cover is page number 0
+ * should be unique
+ * must not be present unless the pages in the document have a physical ordering
+ * must not be present unless it is well defined and unique
+* `lpageno string`
+ * the logical page number expressed on the page
+ * may not be numerical (e.g., Roman numerals)
+ * usually is unique
+ * must not be present unless it has been recognized from the page and is unambiguous
+
+The following properties MAY be present:
+
+* `scan_res x_res y_res`
+ * scanning resolution in DPI
+* `x_scanner string`
+ * a representation of the scanner
+* `x_source string`
+ * an implementation-dependent representation of the document source
+ * could be a URL or a /gfs/ path
+ * offsets within a multipage format (e.g., TIFF) may be represented using
+ additional strings or using URL parameters or fragments
+ * examples
+ * `x_source /gfs/cc/clean/012345678911 17`
+ * `x_source http://pageserver/012345678911&page=17`
+
+The `ocr_carea` elements should appear reading order unless this is impossible
+because of some other structuring requirement If the document contains multiple
+`ocr_linear` streams, then each `ocr_carea` must indicate which stream it belongs
+to.
+
+In typesetting systems, content areas are filled with “blocks”, but most of
+those blocks are not recoverable or semantically meaningful. However, one type
+of block is visible and very important for OCR engines: the line. Lines are
+typesetting blocks that only contain glyphs (“inlines” in XSL terminology).
+
+They are represented by the `ocr_line` area. In addition to the standard
+properties, the `ocr_line` area supports the following additional properties:
+
+* `hardbreak n`
+ * a zero (default) indicates that the end of the line is not a hard
+ (explicit) line break, but a break due to text flow
+ * a one indicates that the line is a hard (explicit) line break
+
+Any special characters representing the desired end-of-line processing must be
+present inside the ocr_line element. Examples of such special characters are a
+soft hyphen (””), a hard line break (`
`), or whitespace (` `) for soft
+line breaks.
+
+**TODO: unicode point for soft hyphen**
+
+Note that for many documents, the actual ground truth careas are well-defined
+by the document style of the original document before printing and scanning.
+From a single page, the `careas` of the original document style cannot be
+recovered exactly. However, the partition of a document by `ocr_carea` for an
+individual page shall be considered correct relative to ground truth if (1) all
+the text contained in a ground truth carea is fully contained within a single
+ocr_carea, (2) no text outside a ground truth `carea` is contained within an
+ocr_carea, and (3) the `ocr_careas` appear in the same order as the text flow
+relationships between the ground truth careas.
+
+
+The following floats are defined:
+
+* `ocr_float`
+* `ocr_separator`
+* `ocr_textfloat`
+* `ocr_textimage`
+* `ocr_image`
+* `ocr_linedrawing` – something that could be represented well and naturally in
+ a vector graphics format like SVG (even if it is actually represented as PNG)
+* `ocr_photo` – something that requires JPEG or PNG to be represented well
+* `ocr_header`
+* `ocr_footer`
+* `ocr_pageno`
+* `ocr_table`
+
+Floats should not be nested.
+
+
+## 6 Inline Representations
+
+There is some content that should behave and flow like text
+
+* `ocr_glyph` – an individual glyph represented as an image (e.g., an unrecognized character)
+ * must contain a single IMG tag, or be present on one
+* `ocr_glyphs` – multiple glyphs represented as an image (e.g., an unrecognized word)
+ * must contain a single IMG tag, or be present on one
+* `ocr_dropcap` – an individual glyph representing a dropcap
+ * may contain text or an `` tag; the `ALT` of the image tag should
+ contain the corresponding text
+* `ocr_glyphs` – a collection of glyphs represented as an image
+ * must contain a single IMG tag, or be present on one
+* `ocr_chem` – a chemical formula
+ * must contain either a single IMG tag or ChemML markup, or be present on one
+* `ocr_math` – a mathematical formula
+ * must contain either a single IMG tag or MathML markup, or be present on one
+
+Mathematical and chemical formulas that float must be put into an `ocr_float`
+section.
+
+Mathematical and chemical formulas that are “display” mode should be put into
+an `ocr_display` section.
+
+Non-breaking spaces must be represented using the HTML ` ` entity.
+
+Soft hyphens must be represented using the HTML `` entity.
+
+Different space widths should be indicated using HTML and ` `, `&emsp`, ` `,
+``, ``.
+
+The HTML `` and `` entities (indicating writing direction) must not
+be used; all writing direction changes must be indicated with tags.
+
+Other superscripts and subscripts must be represented using the HTML `` and
+`` tags, even if special Unicode characters are available.
+
+Furigana and similar constructs must be represented using their correct Unicode
+encoding.
+
+
+## 7 Character Information
+
+Character-level information may be put on any element that contains only a
+single "line" of text; if no other layout element applies, the `ocr_cinfo`
+element may be used.
+
+
+* `cuts c1 c2 c3 …`
+ * character segmentation cuts (see below)
+ * there must be a bbox property relative to which the cuts can be interpreted
+* `nlp c1 c2 c3 …`
+ * estimate of the negative log probabilities of each character by the recognizer
+
+For left-to-write writing directions, cuts are sequences of deltas in the x and
+y direction; the first delta in each path is an offset in the x direction
+relative to the last x position of the previous path. The subsequent deltas
+alternate between up and right moves.
+
+Assume a bounding box of `(0,0,300,100)`; then
+
+````
+cuts("10 11 7 19") =
+ [ [(10,0),(10,100)], [(21,0),(21,100)], [(28,0),(28,100)], [(47,0),(47,100)] ]
+cuts("10,50,3 11,30,-3") =
+ [ [(10,0),(10,50),(13,50),(13,100)], [(21,0),(21,30),(18,30),(18,100)] ]
+```
+
+Here is an example:
+
+```html
+hello
+```
+
+
+Cuts are between all codepoints contained within the element, including any
+whitespace and control characters. Simply use a delta of 0 (zero) for
+invisible codepoints.
+
+Writing directions other than left-to-right specify cuts as if the bounding box
+for the element had been rotated by a multiple of 90 degrees such that the
+writing direction is left to right, then rotated back.
+
+It is undefined what happens when cut paths intersect, with the exception that
+a delta of 0 always corresponds to an invisible codepoint.
+
+
+## 8 OCR Engine-Specific Markup
+
+A few abstractions are used as intermediate abstractions in OCR engines,
+although they do not have a meaning that can be defined either in terms of
+typesetting or logical function. Representing them may be useful to represent
+existing OCR output, say for workflow abstractions.
+
+Common suggested engine-specific markup are:
+
+* `ocrx_block`
+ * any kind of "block" returned by an OCR system
+ * engine-specific because the definition of a "block" depends on the engine
+* `ocrx_line`
+ * any kind of "line" returned by an OCR system that differs from the standard ocr_line above
+ * might be some kind of "logical" line
+* `ocrx_word`
+ * any kind of "word" returned by an OCR system
+ * engine specific because the definition of a "word" depends on the engine
+
+The meaning of these tags is OCR engine specific. However, generators should
+attempt to ensure the following properties:
+
+* an `ocrx_block` should not contain content from multiple ocr_careas
+* the union of all `ocrx_blocks` should approximately cover all `ocr_careas`
+* an `ocrx_block` should contain either a float or body text, but not both
+* an `ocrx_block` should contain either an image or text, but not both
+* an `ocrx_line` should correspond as closely as possible to an `ocr_line`
+* `ocrx_cinfo` should nest inside `ocrx_line`
+* `ocrx_cinfo` should contain only `x_conf`, `x_bboxes`, and `cuts` attributes
+
+The following properties are defined:
+
+* `x_font s`
+ * OCR-engine specific font names
+* `x_fsize n`
+ * OCR-engine specific font size
+* `x_boxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 …`
+ * OCR-engine specific boxes associated with each codepoint contained in the
+ element
+ * note that the bbox property is a property for the bounding box of a layout
+ element, not of individual characters
+ * in particular, use ``, not
+ ``
+* `x_confs c1 c2 c3 …`
+ * OCR-engine specific character confidences
+ * `c1` etc. must be numbers
+ * higher values should express higher confidences
+ * if possible, convert character confidences to values between 0 and 100 and
+ have them approximate posterior probabilities (expressed in %)
+* `x_wconf n`
+ * OCR-engine specific confidence for the entire contained substring
+ * n must be a number
+ * higher values should express higher confidences
+ * if possible, convert word confidences to values between 0 and 100 and have
+ them approximate posterior probabilities (expressed in %)
+
+
+## 9 Font, Text Color, Language, Direction
+
+OCR-generated font and text color information is encoded using standard HTML
+and CSS attributes on elements with a class of `ocr_...` or `ocrx_...`.
+Language and writing direction should be indicated using the HTML standard
+attributes `lang=` and `dir=`, or alternatively can be indicated as properties on
+elements.
+
+OCR information and presentation information can be separated by putting the
+CSS info related to the CSS in an outer element with an `ocr_` or `ocrx_` class,
+and then overriding it for the presentation by nesting another `` with the
+actual presentation information inside that:
+
+```
+ ...
+```
+
+The CSS3 text layout attributes can be used when necessary. For example, CSS
+supports writing-mode, direction, glyph-orientation [ISO-15924-based
+script](http://www.unicode.org/iso15924/codelists.html), text-indent, etc.
+
+
+## 10 Alternative Segmentations / Readings
+
+Alternative segmentations and readings are indicated by a `` with
+`class="alternatives"`. It must contains `` and `` elements. The first
+contained element should be `` and represent the most probably interpretation,
+the subsequent ones ``. Each `` and `` element should have `class="alt"` and a
+property of either `nlp` or `x_cost`. These ``, ``, and `` tags can nest
+arbitrarily.
+
+Example:
+
+```html
+
+hello
+hallo
+
+```
+
+Whitespace within the `` but outside the contained ``/``
+elements is ignored and should be inserted to improve readability of the HTML
+when viewed in a browser.
+
+
+## 11 Grouped Elements and Multiple Hierarchies
+
+The different levels of layout information (logical, physical, engine-specific)
+each form hierarchies, but those hierarchies may not be mutually compatible;
+for example, a single ocr_page may contain information from multiple sections
+or chapters. To represent both hierarchies within a single document, elements
+may be grouped together. That is, two elements with the same class may be
+treated as one element by adding a "groupid identifier" property to them and
+using the same identifier.
+
+Grouped elements should be logically consistent with the markup they represent;
+for example, it is probably not sensible to use grouped elements to interleave
+parts of two different chapters. Therefore, grouped elements should usually be
+adjacent in the markup.
+
+Applications using hOCR may choose to manipulate grouped elements directly, but
+the simplest way of dealing with them is to transform a document with grouped
+elements into one without grouped elements prior to further processing by first
+removing tags that are not of interest for the subsequent processing step, and
+then collapsing grouped elements into single elements. For example, output
+that contains both logical and physical layout information, where the logical
+layout information uses grouped elements, can be transformed by removing all
+the physical layout information, and then collapsing all split ocr_chapter
+elements into single ocr_chapter elements based on the groupid. The result is
+a simple DOM tree. This transformation can be provided generically as a
+pre-processor or Javascript.
+
+The presence of grouped elements does not need to be indicated in the header;
+when it affects their operations, hOCR processors should check for the presence
+of grouped elements in the output and fail with an error message if they cannot
+correctly process the hOCR information.
+
+
+## 12 Capabilities
+
+Any program generating files in this output format must indicate in the
+document metadata what kind of markup it is capable of generating. This
+includes listing the exact set of markup sections that the system could have
+generated, even if it did not actually generate them for the particular
+document.
+
+The capability to generate specific properties is given by the prefix `ocrp_…`;
+the important properties are:
+
+* `ocrp_lang` – capable of generating lang= attributes
+* `ocrp_dir` – capable of generating dir= attributes
+* `ocrp_poly` – capable of generating polygonal bounds
+* `ocrp_font` – capable of generating font information (standard font information)
+* `ocrp_nlp` – capable of generating nlp confidences
+
+The capability to generate other specific embedded formats is given by the
+prefix `ocr_embeddedformat_`.
+
+If an OCR engine represents a particular tag but cannot determine reading order
+for that tag, it must must specify a capability of `ocr__unordered`.
+
+If a document lists a certain capabilities but no element or attribute is found
+that corresponds to that capability, users of the document may infer that the
+content is absent in the source document. If a capability is not listed, the
+corresponding element or attribute must not be present in the document.
+
+
+## 13 Profiles
+
+hOCR provides standard means of marking up information, but it does not mandate
+the presence or absence of particular kinds of information. For example, an
+hOCR file may contain only logical markup, only physical markup, or only
+engine-specific markup. As a result, merely knowing that OCR output is hOCR
+compliant doesn't tell us whether that file is actually useful for subsequent
+processing.
+
+OCR systems can use hOCR in various different ways internally, but we will
+eventually define some common profiles that mandate what kinds of information
+needs to be present in particular kinds of output.
+
+Of particular importance are:
+
+* physical layout profile: OCR output in XHTML format with a defined set of
+ common physical layout markup capabilities (page, carea, floats, line).
+ Logical layout may be present as well, but the document tree structure must
+ represent the physical layout structure, with logical layout elements split
+ and grouped as needed.
+
+* logical layout profile: OCR output in XHTML format with a defined set of
+ common logical layout markup capabilities (linear, chapter, section,
+ subsection). Physical layout may be present as well, but the document tree
+ structure must represent the logical layout structure, with logical layout
+ elements split and grouped as needed.
+
+Other possible profiles might be defined for specific engines or specific document classes:
+
+* common commercial OCR output (e.g., Abbyy)
+ * ocr_page
+ * ocrx_block, ocrx_line, ocrx_word
+ * ocrp_lang
+ * ocrp_font
+* book target
+ * all logical structuring elements (as applicable), except ocr_linear
+ * ocr_page
+* newspaper target
+ * all logical structuring elements (as applicable)
+ * articles map on ocr_linear
+ * ocr_page
+
+
+## 14 Required Meta Information
+
+The OCR system is required to indicate the following using meta tags in the header:
+
+* ``
+* ``
+ * see the capabilities defined above
+
+The OCR system should indicate the following information
+
+* name=ocr-number-of-pages content=number-of-pages
+* ``
+ * use ISO 639-1 codes
+ * value may be `unknown`
+* ``
+ * use ISO 15924 letter codes
+ * value may be `unknown`
+
+
+## 15 HTML Markup
+
+The HTML-based markup is orthogonal to the hOCR-based markup; that is, both can
+be chosen independent of one another. The only thing that needs to be
+consistent between the two markups is the text contained within the tags. hOCR
+and other embedded format tags can be put on HTML tags, or they can be put on
+their own ``/`
` tags.
+
+There are many different choices possible and reasonable for the HTML markup,
+depending on the use and further processing of the document. Each such choice
+must be indicated in the meta data for the document.
+
+Many mappings derived from existing tools are quite similar, and most follow
+the restrictions and recommendations below already without further
+modifications.
+
+Depending on the particular HTML markup used in the document, the document is
+suitable for different kinds of processing and use. The formats have the
+following intents:
+
+* `html_none`: straightforward equivalent of Goodoc or XDOC
+* `html_ocr`: straightforward recording of commercial OCR system output
+* `html_absolute`: target format for services like Google's View as HTML
+* `html_xytable`: target format for layout-preserving on-screen document viewing
+* `html_simpl`: target format for convenient on-line viewing and intermediate format for indexing
+
+As long as a format contains the hOCR information, it can be reprocessed by
+layout analysis software and converted into one of the other formats. In
+particular, we envision layout analysis tools for converting any hOCR document
+into `html_absolute`, `html_xytable`, and `html_simple`. Furthermore,
+internally, a layout analysis system might use `html_xytable` as an
+intermediate format for converting hOCR into `html_simple`.
+
+
+### 15.1 Restrictions on HTML Content
+
+To avoid problems, any use of HTML markup must follow the following rules:
+
+* HTML content must not use class names that conflict with any of those defined in this document (“ocr_*”)
+* HTML content must not use the title= attribute on any element with an ocr_* class for any purposes other than encoding OCR-related properties as described in this document
+
+
+### 15.2 Recommendations for Mappings
+
+When possible, any mapping of logical structure onto HTML should try to follow the following rules:
+
+* the mapping should be "natural" -- similar to what an author of the document
+ might have entered into a WYSIWYG content creation tool
+* text should be in reading order
+* all tags should be used for the intended purpose (and only for the intended
+ purpose) as defined in the HTML 4 spec
+* floats are contained in `