diff --git a/1.2/index.bs b/1.2/index.bs index 8e9d11f..9170920 100644 --- a/1.2/index.bs +++ b/1.2/index.bs @@ -10,7 +10,7 @@ Editor: Konstantin Baierer, UB Mannheim http://github.com/UB-Mannheim, konstanti Former Editor: Thomas Breuel, http://www.9x9.com/ Previous Version: https://github.com/kba/hocr-spec/blob/master/1.1/spec.md Abstract: A subset of HTML for marking up OCR results -Markup Shorthands: markdown on, biblio on +Markup Shorthands: markdown on, biblio on, markup on
 {
@@ -42,7 +42,7 @@ arrive at a representation that makes it easy to reuse OCR results.
 
 This document describes many tags and a lot of information that can be output.
 However, getting started with hOCR is easy: you only need to output the tags
-and information you actually want to.  For example, just outputting `ocr_line`
+and information you actually want to.  For example, just outputting <{ocr_line}>
 tags with bounding boxes is already very useful for many applications.  Just
 start simple and add more output information as the need arises.
 
@@ -97,7 +97,7 @@ multiple properties are separated by semicolons.
 
 The following properties can apply to most elements (where it makes sense):
 
-### `bbox`
+### bbox
 
 `bbox x0 y0 x1 y1`
 
@@ -108,8 +108,8 @@ the lower-right corner (x1, y1).
   * the values are with reference to the the top-left corner of the document image
     and measured in pixels
   * the order of the values are `x0 y0 x1 y1` = "left top right bottom"
-  * use `x_bboxes` below for character bounding boxes
-  * do not use `bbox` unless the bounding box of the layout component is, in
+  * use 'x_bboxes' below for character bounding boxes
+  * do not use 'bbox' unless the bounding box of the layout component is, in
     fact, rectangular
   * some non-rectangular layout components may have rectangular bounding boxes
     if the non-rectangularity is caused by floating elements around which text flows
@@ -135,7 +135,7 @@ the document image which border is drawn in black.
 
 
 
-### `textangle`
+### textangle
 
 `textangle alpha`
 
@@ -150,7 +150,7 @@ which should be indicated using standard HTML properties
 The following properties can apply to most elements but should not be used
 unless there is no alternative:
 
-### `poly`
+### poly
 
 `poly x0 y0 x1 y1 ...`
 
@@ -163,11 +163,11 @@ A closed polygon for elements with non-rectangular bounds
   * note that the natural and correct representation of many non-rectangular
     layouts is in terms of rectangular content areas and rectangular floats
   * documents using polygonal borders anywhere must indicate this by adding
-    [[#ocrp_poly]] to the list of `ocr-capabilities` in the
-    [[#required-meta-information]]
-  * documents should attempt to provide a reasonable bbox equivalent as well
+    ''ocr-capabilities/ocrp_poly'' to the list of 'ocr-capabilities' (see
+    [[#required-meta-information]])
+  * documents should attempt to provide a reasonable 'bbox' equivalent as well
 
-### `order`
+### order
 
 `order n`
 
@@ -177,27 +177,27 @@ The reading order of the element (an integer)
     the reading order of the page by element ordering within the page, since
     many tools will not be able to deal with content that is not in reading order
 
-### `presence`
+### presence
 
 Issue: [Use of property presence](https://github.com/kba/hocr-spec/issues/10)
 
-`presence` presence must be declared in the document meta data
+'presence' presence must be declared in the document meta data
 
-### `cflow`
+### cflow
 
 `cflow s`
 
-This property relates the flow between multiple [[#ocr_carea]] elements,
-and between [[#ocr_carea]] and [[#ocr_linear]] elements.
+This property relates the flow between multiple <{ocr_carea}> elements,
+and between <{ocr_carea}> and <{ocr_linear}> elements.
 
 The content flow on the page that this element is a part of
 
   * s must be a unique string for each content flow
-  * must be present on [[#ocr_carea]] and [[#ocrx_block]] tags when reading
+  * must be present on <{ocr_carea}> and <{ocrx_block}> tags when reading
     order is attempted and multiple content flows are present
   * presence must be declared in the document meta data
 
-### `baseline`
+### baseline
 
 `baseline pn pn-1 ... p0`
 
@@ -220,7 +220,7 @@ contains the following information:
     title="bbox 105 66 823 113; baseline 0.015 -18">...
 ```
 
-bbox is the bounding box of the line in image coordinates (blue). The two
+'bbox' is the bounding box of the line in image coordinates (blue). The two
 numbers for the baseline are the slope (1st number) and constant term (2nd
 number) of a linear equation describing the baseline relative to the bottom
 left corner of the bounding box (red). The baseline crosses the y-axis at `-18`
@@ -237,30 +237,30 @@ and its slope angle is `arctan(0.015) = 0.86°`.
 
 We recognize the following logical structuring elements:
 
-  * `ocr_document`
-    * `ocr_linear`
-      * `ocr_title`
-      * `ocr_author`
-      * `ocr_abstract`
-      * `ocr_part` [`

`] - * `ocr_chapter` [`

`] - * `ocr_section` [`

`] + * <{ocr_document}> + * <{ocr_linear}> + * <{ocr_title}> + * <{ocr_author}> + * <{ocr_abstract}> + * <{ocr_part}> [`

`] + * <{ocr_chapter}> [`

`] + * <{ocr_section}> [`

`] * `ocr_sub*section` [`

`,`

`] - * `ocr_display` - * `ocr_blockquote` [`
`] - * `ocr_par` [`

`] - -## `ocr_document` -## `ocr_title` -## `ocr_author` -## `ocr_abstract` -## `ocr_part` -## `ocr_chapter` -## `ocr_section` -## `ocr_subsubsection` -## `ocr_display` -## `ocr_blockquote` -## `ocr_par` + * <{ocr_display}> + * <{ocr_blockquote}> [`

`] + * <{ocr_par}> [`

`] + +## ocr_document +## ocr_title +## ocr_author +## ocr_abstract +## ocr_part +## ocr_chapter +## ocr_section +## ocr_subsubsection +## ocr_display +## ocr_blockquote +## ocr_par These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others. @@ -270,15 +270,15 @@ with those logical structuring elements, but it may not be possible or desirable to actually chose those tags (e.g., when adding hOCR information to an existing HTML output routine). -## `ocr_linear` +### ocr_linear -For all of these elements except `ocr_linear`, there exists a natural linear -ordering defined by reading order (`ocr_linear` indicates that the elements -contained in it have a linear ordering). At the level of `ocr_linear`, there -may not be a single distinguished order. A common example of `ocr_linear` is a +For all of these elements except <{ocr_linear}>, there exists a natural linear +ordering defined by reading order (<{ocr_linear}> indicates that the elements +contained in it have a linear ordering). At the level of <{ocr_linear}>, there +may not be a single distinguished order. A common example of <{ocr_linear}> is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should -therefore be sensitive to the order of all elements other than `ocr_linear`. +therefore be sensitive to the order of all elements other than <{ocr_linear}>. Tags must be nested as indicated by nesting above, but not all tags within the hierarchy need to be present. @@ -289,11 +289,11 @@ text inside the containing element. Documents whose logical structure does not map naturally onto these logical structuring elemetns must not use them for other purpose. -## `ocr_caption` +## ocr_caption -Image captions may be indicated using the `ocr_caption` element; such an +Image captions may be indicated using the <{ocr_caption}> element; such an element refers to the image(s) contained within the same float, or the -immediately adjacent image if both the image and the `ocr_caption` element are +immediately adjacent image if both the image and the <{ocr_caption}> element are in running text. @@ -332,57 +332,57 @@ properties for floating elements; properties need to be defined for this. The following classes, as well as [floats](#classes-for-floats) are used for type-setting elements. -### `ocr_page` +### ocr_page -The `ocr_page` element must be present in all hOCR documents. +The <{ocr_page}> element must be present in all hOCR documents. -### `ocr_column` +### ocr_column

**OBSOLETE** -Please use [[#ocr_carea]] instead +Please use <{ocr_carea}> instead
-### `ocr_carea` +### ocr_carea "ocr content area" or "body area" Used to be called ocr_column -The `ocr_carea` elements should appear in reading order unless this is impossible +The <{ocr_carea}> elements should appear in reading order unless this is impossible because of some other structuring requirement. If the document contains multiple -`ocr_linear` streams, then each `ocr_carea` must indicate which stream it belongs +<{ocr_linear}> streams, then each <{ocr_carea}> must indicate which stream it belongs to. Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the `careas` of the original document style cannot be -recovered exactly. However, the partition of a document by `ocr_carea` for an +recovered exactly. However, the partition of a document by <{ocr_carea}> for an individual page shall be considered correct relative to ground truth if 1. all the text contained in a ground truth carea is fully contained within a - single `ocr_carea`, + single <{ocr_carea}>, 2. no text outside a ground truth `carea` is contained within an - `ocr_carea`, and - 3. the `ocr_careas` appear in the same order as the text flow + <{ocr_carea}>, and + 3. the <{ocr_carea}> appear in the same order as the text flow relationships between the ground truth careas. -### `ocr_line` +### ocr_line In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). -They are represented by the `ocr_line` area. +They are represented by the <{ocr_line}> area. -`ocr_line` should be in a `` +<{ocr_line}> should be in a `` -### `ocr_separator` +### ocr_separator Any separator or similar element -### `ocr_noise` +### ocr_noise Any noise element that isn't part of typesetting @@ -395,7 +395,7 @@ The following properties should be present: The bounding box of the page; for pages, the top left corner must be at `(0,0)`, so a typical page bounding box will look like `bbox 0 0 2300 3200` -### `image` +### image `image imagefile` @@ -407,14 +407,14 @@ The bounding box of the page; for pages, the top left corner must be at * if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file -### `imagemd5` +### imagemd5 `imagemd5 checksum` * MD5 fingerprint of the image file that this page was derived from * allows re-associating pages with source images -### `ppageno` +### ppageno `ppageno n` @@ -424,7 +424,7 @@ The bounding box of the page; for pages, the top left corner must be at * must not be present unless the pages in the document have a physical ordering * must not be present unless it is well defined and unique -### `lpageno` +### lpageno `lpageno string` @@ -437,19 +437,19 @@ The bounding box of the page; for pages, the top left corner must be at The following properties MAY be present: -### `scan_res` +### scan_res `scan_res x_res y_res` * scanning resolution in DPI -### `x_scanner` +### x_scanner `x_scanner string` * a representation of the scanner -### `x_source` +### x_source `x_source string` @@ -462,9 +462,9 @@ The following properties MAY be present: * `x_source http://pageserver/012345678911&page=17` In addition to the standard -properties, the `ocr_line` area supports the following additional properties: +properties, the <{ocr_line}> area supports the following additional properties: -### `hardbreak` +### hardbreak `hardbreak n` @@ -473,7 +473,7 @@ properties, the `ocr_line` area supports the following additional properties: * a one indicates that the line is a hard (explicit) line break Any special characters representing the desired end-of-line processing must be -present inside the `ocr_line` element. Examples of such special characters are a +present inside the <{ocr_line}> element. Examples of such special characters are a soft hyphen ("­", `U+00AD`), a hard line break (`
`), or whitespace (` `) for soft line breaks. @@ -483,48 +483,48 @@ Floats should not be nested. The following floats are defined: -### `ocr_float` +### ocr_float `ocr_float` -### `ocr_separator` +### ocr_separator -`ocr_separator` +`ocr_separator` in the context of float classes. -### `ocr_textfloat` +### ocr_textfloat `ocr_textfloat` -### `ocr_textimage` +### ocr_textimage `ocr_textimage` -### `ocr_image` +### ocr_image `ocr_image` -### `ocr_linedrawing` +### ocr_linedrawing Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG) -### `ocr_photo` +### ocr_photo Something that requires JPEG or PNG to be represented well -### `ocr_header` +### ocr_header `ocr_header` -### `ocr_footer` +### ocr_footer `ocr_footer` -### `ocr_pageno` +### ocr_pageno `ocr_pageno` -### `ocr_table` +### ocr_table `ocr_table` @@ -534,44 +534,44 @@ There is some content that should behave and flow like text ## Classes for Inline Representation -### `ocr_glyph` +### ocr_glyph An individual glyph represented as an image (e.g., an unrecognized character) Must contain a single `` tag, or be present on one -### `ocr_glyphs` +### ocr_glyphs Multiple glyphs represented as an image (e.g., an unrecognized word) Must contain a single `` tag, or be present on one -### `ocr_dropcap` +### ocr_dropcap An individual glyph representing a dropcap May contain text or an `` tag; the `alt` of the image tag should contain the corresponding text -### `ocr_chem` +### ocr_chem A chemical formula Must contain either a single `` tag or [[CML]] markup, or be present on one -### `ocr_math` +### ocr_math A mathematical formula Must contain either a single `` tag or [[MathML]] markup, or be present on one -Mathematical and chemical formulas that float must be put into an `ocr_float` +Mathematical and chemical formulas that float must be put into an <{ocr_float}> section. Mathematical and chemical formulas that are “display” mode should be put into -an `ocr_display` section. +an <{ocr_display}> section. ### Non-breaking space @@ -586,8 +586,9 @@ Different space widths should be indicated using HTML and ` `, `&emsp`, Soft hyphens must be represented using the HTML `­` entity. -The HTML `‎` and `‏` entities (indicating writing direction) must not -be used; all writing direction changes must be indicated with tags. +The HTML `‎` and +`‏` entities (indicating writing direction) must not be used; all +writing direction changes must be indicated with tags. ### Superscript and Subscript @@ -606,20 +607,20 @@ must be represented using their correct Unicode encoding. Character-level information may be put on any element that contains only a single "line" of text. -### `ocr_cinfo` +### ocr_cinfo -If no other layout element applies, the `ocr_cinfo` element may be used. +If no other layout element applies, the <{ocr_cinfo}> element may be used. ## Properties for Character Information -### `cuts` +### cuts `cuts c1 c2 c3 ...` * character segmentation cuts (see below) - * there must be a bbox property relative to which the cuts can be interpreted + * there must be a 'bbox' property relative to which the 'cuts' can be interpreted -### `nlp` +### nlp `nlp c1 c2 c3 ...` @@ -670,21 +671,21 @@ Common suggested engine-specific markup are: ## Classes for engine specific markup -### `ocrx_block` +### ocrx_block Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) * any kind of "block" returned by an OCR system * engine-specific because the definition of a "block" depends on the engine -### `ocrx_line` +### ocrx_line Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) - * any kind of "line" returned by an OCR system that differs from the standard ocr_line above + * any kind of "line" returned by an OCR system that differs from the standard <{ocr_line}> above * might be some kind of "logical" line -### `ocrx_word` +### ocrx_word * any kind of "word" returned by an OCR system * engine specific because the definition of a "word" depends on the engine @@ -692,42 +693,44 @@ Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) The meaning of these tags is OCR engine specific. However, generators should attempt to ensure the following properties: -* an `ocrx_block` should not contain content from multiple ocr_careas -* the union of all `ocrx_blocks` should approximately cover all `ocr_careas` -* an `ocrx_block` should contain either a float or body text, but not both -* an `ocrx_block` should contain either an image or text, but not both -* an `ocrx_line` should correspond as closely as possible to an `ocr_line` -* `ocrx_cinfo` should nest inside `ocrx_line` -* `ocrx_cinfo` should contain only `x_conf`, `x_bboxes`, and `cuts` attributes +* An <{ocrx_block}> should not contain content from multiple <{ocr_carea}>. +* The union of all <{ocrx_block|ocrx_blocks}> should approximately cover all <{ocr_carea}>. +* an <{ocrx_block}> should contain either a float or body text, but not both +* an <{ocrx_block}> should contain either an image or text, but not both +* an <{ocrx_line}> should correspond as closely as possible to an <{ocr_line}> +* <{ocrx_cinfo}> should nest inside <{ocrx_line}> +* <{ocrx_cinfo}> should contain only 'x_confs', 'x_bboxes', and 'cuts' attributes + +Issue: ocrx_cinfo? ## Properties for engine-specific markup The following properties are defined: -### `x_font` +### x_font `x_font s` * OCR-engine specific font names -### `x_fsize` +### x_fsize `x_fsize n` * OCR-engine specific font size -### `x_bboxes` +### x_bboxes `x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...` * OCR-engine specific boxes associated with each codepoint contained in the element - * note that the bbox property is a property for the bounding box of a layout + * note that the 'bbox' property is a property for the bounding box of a layout element, not of individual characters * in particular, use ``, not `` -### `x_confs` +### x_confs `x_confs c1 c2 c3 ...` @@ -737,7 +740,7 @@ The following properties are defined: * if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %) -### `x_wconf` +### x_wconf `x_wconf n` @@ -777,7 +780,7 @@ Alternative segmentations and readings are indicated by a `` with `class="alternatives"`. It must contains `` and `` elements. The first contained element should be `` and represent the most probable interpretation, the subsequent ones ``. Each `` and `` element should have `class="alt"` and a -property of either `nlp` or `x_cost`. These ``, ``, and `` tags can nest +property of either 'nlp' or 'x_cost'. These ``, ``, and `` tags can nest arbitrarily.
@@ -798,7 +801,7 @@ when viewed in a browser. The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; -for example, a single `ocr_page` may contain information from multiple sections +for example, a single <{ocr_page}> may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and @@ -816,8 +819,8 @@ removing tags that are not of interest for the subsequent processing step, and then collapsing grouped elements into single elements. For example, output that contains both logical and physical layout information, where the logical layout information uses grouped elements, can be transformed by removing all -the physical layout information, and then collapsing all split `ocr_chapter` -elements into single `ocr_chapter` elements based on the groupid. The result is +the physical layout information, and then collapsing all split <{ocr_chapter}> +elements into single <{ocr_chapter}> elements based on the groupid. The result is a simple DOM tree. This transformation can be provided generically as a pre-processor or Javascript. @@ -838,23 +841,23 @@ document. The capability to generate specific properties is given by the prefix `ocrp_...`; the important properties are: -## `ocrp_lang` +## ocrp_lang Capable of generating `lang=` attributes -## `ocrp_dir` +## ocrp_dir Capable of generating `dir=` attributes -## `ocrp_poly` +## ocrp_poly Capable of generating [polygonal bounds](#poly) -## `ocrp_font` +## ocrp_font Capable of generating font information (standard font information) -## `ocrp_nlp` +## ocrp_nlp Capable of generating [nlp confidences](#nlp) @@ -880,16 +883,31 @@ corresponding element or attribute must not be present in the document. The OCR system is required to indicate the following using meta tags in the header: +### ocr-system + * `` + +### ocr-capabilities + * `` * see [[#capabilities]] +## Recommended Meta Information + The OCR system should indicate the following information +### ocr-number-of-pages + * `` + +### ocr-langs + * `` * use [ISO 639-1](https://www.loc.gov/standards/iso639-2/php/code_list.php) codes * value may be `unknown` + +### ocr-scripts + * `` * use [ISO 15924](http://www.unicode.org/iso15924/codelists.html) letter codes * value may be `unknown` @@ -930,17 +948,17 @@ Other possible profiles might be defined for specific engines or specific document classes: * common commercial OCR output (e.g., Abbyy) - * ocr_page - * ocrx_block, ocrx_line, ocrx_word - * ocrp_lang - * ocrp_font + * <{ocr_page}> + * <{ocrx_block}>, <{ocrx_line}>, <{ocrx_word}> + * ''ocr-capabilities/ocrp_lang'' + * ''ocr-capabilities/ocrp_font'' * book target - * all logical structuring elements (as applicable), except ocr_linear - * ocr_page + * all logical structuring elements (as applicable), except <{ocr_linear}> + * <{ocr_page}> * newspaper target * all logical structuring elements (as applicable) - * articles map on ocr_linear - * ocr_page + * articles map on <{ocr_linear}> + * <{ocr_page}> # HTML Markup @@ -1200,3 +1218,7 @@ Issue: [correct MIME type for hOCR?](https://github.com/kba/hocr-spec/issues/27) : Applications which use this media type: : File extension(s): :: `*.html`, `*.hocr` + + + + diff --git a/1.2/index.html b/1.2/index.html index a27e208..8bde9db 100644 --- a/1.2/index.html +++ b/1.2/index.html @@ -1185,93 +1185,6 @@ [data-md] > :last-child { margin-bottom: 0; } - - + + +

hOCR - OCR Workflow and Output embedded in HTML

-

Living Standard,

+

Living Standard,

This version: @@ -1403,7 +1440,7 @@

@@ -1425,35 +1462,38 @@

Table of Contents

  • 3.1 General Properties
      -
    1. 3.1.1 bbox -
    2. 3.1.2 textangle +
    3. 3.1.1 bbox +
    4. 3.1.2 textangle
  • 3.2 Non-recommended general properties
      -
    1. 3.2.1 poly -
    2. 3.2.2 order -
    3. 3.2.3 presence -
    4. 3.2.4 cflow -
    5. 3.2.5 baseline +
    6. 3.2.1 poly +
    7. 3.2.2 order +
    8. 3.2.3 presence +
    9. 3.2.4 cflow +
    10. 3.2.5 baseline
  • 4 Logical Structuring Elements
      -
    1. 4.1 ocr_document -
    2. 4.2 ocr_title -
    3. 4.3 ocr_author -
    4. 4.4 ocr_abstract -
    5. 4.5 ocr_part -
    6. 4.6 ocr_chapter -
    7. 4.7 ocr_section -
    8. 4.8 ocr_subsubsection -
    9. 4.9 ocr_display -
    10. 4.10 ocr_blockquote -
    11. 4.11 ocr_par -
    12. 4.12 ocr_linear -
    13. 4.13 ocr_caption +
    14. 4.1 ocr_document +
    15. 4.2 ocr_title +
    16. 4.3 ocr_author +
    17. 4.4 ocr_abstract +
    18. 4.5 ocr_part +
    19. 4.6 ocr_chapter +
    20. 4.7 ocr_section +
    21. 4.8 ocr_subsubsection +
    22. 4.9 ocr_display +
    23. 4.10 ocr_blockquote +
    24. + 4.11 ocr_par +
        +
      1. 4.11.1 ocr_linear +
      +
    25. 4.12 ocr_caption
  • 5 Typesetting Related Elements @@ -1461,44 +1501,44 @@

    Table of Contents

  • 5.1 Classes for typesetting elements
      -
    1. 5.1.1 ocr_page -
    2. 5.1.2 ocr_column -
    3. 5.1.3 ocr_carea -
    4. 5.1.4 ocr_line -
    5. 5.1.5 ocr_separator -
    6. 5.1.6 ocr_noise +
    7. 5.1.1 ocr_page +
    8. 5.1.2 ocr_column +
    9. 5.1.3 ocr_carea +
    10. 5.1.4 ocr_line +
    11. 5.1.5 ocr_separator +
    12. 5.1.6 ocr_noise
  • 5.2 Recommended Properties for typesetting elements
    1. 5.2.1 bbox (typesetting) -
    2. 5.2.2 image -
    3. 5.2.3 imagemd5 -
    4. 5.2.4 ppageno -
    5. 5.2.5 lpageno +
    6. 5.2.2 image +
    7. 5.2.3 imagemd5 +
    8. 5.2.4 ppageno +
    9. 5.2.5 lpageno
  • 5.3 Optional Properties for typesetting elements
      -
    1. 5.3.1 scan_res -
    2. 5.3.2 x_scanner -
    3. 5.3.3 x_source -
    4. 5.3.4 hardbreak +
    5. 5.3.1 scan_res +
    6. 5.3.2 x_scanner +
    7. 5.3.3 x_source +
    8. 5.3.4 hardbreak
  • 5.4 Classes for floats
      -
    1. 5.4.1 ocr_float -
    2. 5.4.2 ocr_separator -
    3. 5.4.3 ocr_textfloat -
    4. 5.4.4 ocr_textimage -
    5. 5.4.5 ocr_image -
    6. 5.4.6 ocr_linedrawing -
    7. 5.4.7 ocr_photo -
    8. 5.4.8 ocr_header -
    9. 5.4.9 ocr_footer -
    10. 5.4.10 ocr_pageno -
    11. 5.4.11 ocr_table +
    12. 5.4.1 ocr_float +
    13. 5.4.2 ocr_separator +
    14. 5.4.3 ocr_textfloat +
    15. 5.4.4 ocr_textimage +
    16. 5.4.5 ocr_image +
    17. 5.4.6 ocr_linedrawing +
    18. 5.4.7 ocr_photo +
    19. 5.4.8 ocr_header +
    20. 5.4.9 ocr_footer +
    21. 5.4.10 ocr_pageno +
    22. 5.4.11 ocr_table
  • @@ -1507,11 +1547,11 @@

    Table of Contents

  • 6.1 Classes for Inline Representation
      -
    1. 6.1.1 ocr_glyph -
    2. 6.1.2 ocr_glyphs -
    3. 6.1.3 ocr_dropcap -
    4. 6.1.4 ocr_chem -
    5. 6.1.5 ocr_math +
    6. 6.1.1 ocr_glyph +
    7. 6.1.2 ocr_glyphs +
    8. 6.1.3 ocr_dropcap +
    9. 6.1.4 ocr_chem +
    10. 6.1.5 ocr_math
    11. 6.1.6 Non-breaking space
    12. 6.1.7 Non-default spaces
    13. 6.1.8 Hyphenation @@ -1525,13 +1565,13 @@

      Table of Contents

    14. 7.1 Classes for Character Information
        -
      1. 7.1.1 ocr_cinfo +
      2. 7.1.1 ocr_cinfo
    15. 7.2 Properties for Character Information
        -
      1. 7.2.1 cuts -
      2. 7.2.2 nlp +
      3. 7.2.1 cuts +
      4. 7.2.2 nlp
  • @@ -1540,18 +1580,18 @@

    Table of Contents

  • 8.1 Classes for engine specific markup
      -
    1. 8.1.1 ocrx_block -
    2. 8.1.2 ocrx_line -
    3. 8.1.3 ocrx_word +
    4. 8.1.1 ocrx_block +
    5. 8.1.2 ocrx_line +
    6. 8.1.3 ocrx_word
  • 8.2 Properties for engine-specific markup
      -
    1. 8.2.1 x_font -
    2. 8.2.2 x_fsize -
    3. 8.2.3 x_bboxes -
    4. 8.2.4 x_confs -
    5. 8.2.5 x_wconf +
    6. 8.2.1 x_font +
    7. 8.2.2 x_fsize +
    8. 8.2.3 x_bboxes +
    9. 8.2.4 x_confs +
    10. 8.2.5 x_wconf
  • 9 Font, Text Color, Language, Direction @@ -1560,19 +1600,31 @@

    Table of Contents

  • 12 Capabilities
      -
    1. 12.1 ocrp_lang -
    2. 12.2 ocrp_dir -
    3. 12.3 ocrp_poly -
    4. 12.4 ocrp_font -
    5. 12.5 ocrp_nlp +
    6. 12.1 ocrp_lang +
    7. 12.2 ocrp_dir +
    8. 12.3 ocrp_poly +
    9. 12.4 ocrp_font +
    10. 12.5 ocrp_nlp
    11. 12.6 ocr_embeddedformat_<formatname>
    12. 12.7 ocr_<tag>_unordered
  • 13 Metadata
      -
    1. 13.1 Required Meta Information -
    2. 13.2 Document metadata +
    3. + 13.1 Required Meta Information +
        +
      1. 13.1.1 ocr-system +
      2. 13.1.2 ocr-capabilities +
      +
    4. + 13.2 Recommended Meta Information +
        +
      1. 13.2.1 ocr-number-of-pages +
      2. 13.2.2 ocr-langs +
      3. 13.2.3 ocr-scripts +
      +
    5. 13.3 Document metadata
  • 14 Profiles
  • @@ -1610,6 +1662,11 @@

    Table of Contents

  • 17.1 Media Type
  • Conformance +
  • + Index +
      +
    1. Terms defined by this specification +
  • References
      @@ -1630,7 +1687,7 @@

      This document describes many tags and a lot of information that can be output. However, getting started with hOCR is easy: you only need to output the tags -and information you actually want to. For example, just outputting ocr_line tags with bounding boxes is already very useful for many applications. Just +and information you actually want to. For example, just outputting ocr_line tags with bounding boxes is already very useful for many applications. Just start simple and add more output information as the need arises.

      3. Terminology and Representation

      This document describes a representation of various aspects of OCR output in an @@ -1670,17 +1727,17 @@

      < multiple properties are separated by semicolons.

      -
      <div class="ocr_page" id="page_1">
      -  <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922">
      -    <div class="ocr_par" id="par_7"> ... </div>
      -    <div class="ocr_par" id="par_19"> ... </div>
      -  </div>
      -</div>
      +
      <div class="ocr_page" id="page_1">
      +  <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922">
      +    <div class="ocr_par" id="par_7"> ... </div>
      +    <div class="ocr_par" id="par_19"> ... </div>
      +  </div>
      +</div>
       

      3.1. General Properties

      The following properties can apply to most elements (where it makes sense):

      -

      3.1.1. bbox

      +

      3.1.1. bbox

      bbox x0 y0 x1 y1

      The bbox - short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and @@ -1692,9 +1749,9 @@

      3.1
    1. the order of the values are x0 y0 x1 y1 = "left top right bottom"

    2. -

      use x_bboxes below for character bounding boxes

      +

      use x_bboxes below for character bounding boxes

    3. -

      do not use bbox unless the bounding box of the layout component is, in +

      do not use bbox unless the bounding box of the layout component is, in fact, rectangular

    4. some non-rectangular layout components may have rectangular bounding boxes @@ -1703,8 +1760,8 @@

      3.1

      See also the section §5.2.1 bbox (typesetting).

      -
      <span class='ocr_line' id='line_1'
      -    title="bbox 10 20 160 30">...</span>
      +
      <span class='ocr_line' id='line_1'
      +    title="bbox 10 20 160 30">...</span>
       

      The bounding box bbox of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). @@ -1712,7 +1769,7 @@

      3.1 the document image which border is drawn in black.

      bbox explained

      -

      3.1.2. textangle

      +

      3.1.2. textangle

      textangle alpha

      The angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations @@ -1722,7 +1779,7 @@

      3.2. Non-recommended general properties

      The following properties can apply to most elements but should not be used unless there is no alternative:

      -

      3.2.1. poly

      +

      3.2.1. poly

      poly x0 y0 x1 y1 ...

      A closed polygon for elements with non-rectangular bounds

        @@ -1735,11 +1792,11 @@

        3.2

        note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats

      • -

        documents using polygonal borders anywhere must indicate this by adding §12.3 ocrp_poly to the list of ocr-capabilities in the §13.1 Required Meta Information

        +

        documents using polygonal borders anywhere must indicate this by adding ocrp_poly to the list of ocr-capabilities (see §13.1 Required Meta Information)

      • -

        documents should attempt to provide a reasonable bbox equivalent as well

        +

        documents should attempt to provide a reasonable bbox equivalent as well

      -

      3.2.2. order

      +

      3.2.2. order

      order n

      The reading order of the element (an integer)

        @@ -1748,36 +1805,36 @@

        3. the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order

      -

      3.2.3. presence

      +

      3.2.3. presence

      Use of property presence

      -

      presence presence must be declared in the document meta data

      -

      3.2.4. cflow

      +

      presence presence must be declared in the document meta data

      +

      3.2.4. cflow

      cflow s

      -

      This property relates the flow between multiple §5.1.3 ocr_carea elements, -and between §5.1.3 ocr_carea and §4.12 ocr_linear elements.

      +

      This property relates the flow between multiple ocr_carea elements, +and between ocr_carea and ocr_linear elements.

      The content flow on the page that this element is a part of

      • s must be a unique string for each content flow

      • -

        must be present on §5.1.3 ocr_carea and §8.1.1 ocrx_block tags when reading +

        must be present on ocr_carea and ocrx_block tags when reading order is attempted and multiple content flows are present

      • presence must be declared in the document meta data

      -

      3.2.5. baseline

      +

      3.2.5. baseline

      baseline pn pn-1 ... p0

      This property applies primarily to textlines.

      The baseline is described by a polynomial of order n with the coefficients pn ... p0 with n = 1 for a linear (i.e. straight) line.

      The polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin.

      -
      - +
      +

      The hOCR output for the first line of eurotext.tif contains the following information:

      -
      <span class='ocr_line' id='line_1_1'
      -    title="bbox 105 66 823 113; baseline 0.015 -18">...</span>
      +
      <span class='ocr_line' id='line_1_1'
      +    title="bbox 105 66 823 113; baseline 0.015 -18">...</span>
       
      -

      bbox is the bounding box of the line in image coordinates (blue). The two +

      bbox is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at -18 and its slope angle is arctan(0.015) = 0.86°.

      @@ -1787,71 +1844,71 @@

      We recognize the following logical structuring elements:

      -

      4.1. ocr_document

      -

      4.2. ocr_title

      -

      4.3. ocr_author

      -

      4.4. ocr_abstract

      -

      4.5. ocr_part

      -

      4.6. ocr_chapter

      -

      4.7. ocr_section

      -

      4.8. ocr_subsubsection

      -

      4.9. ocr_display

      -

      4.10. ocr_blockquote

      -

      4.11. ocr_par

      +

      4.1. ocr_document

      +

      4.2. ocr_title

      +

      4.3. ocr_author

      +

      4.4. ocr_abstract

      +

      4.5. ocr_part

      +

      4.6. ocr_chapter

      +

      4.7. ocr_section

      +

      4.8. ocr_subsubsection

      +

      4.9. ocr_display

      +

      4.10. ocr_blockquote

      +

      4.11. ocr_par

      These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others.

      The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements, but it may not be possible or desirable to actually chose those tags (e.g., when adding hOCR information to an existing HTML output routine).

      -

      4.12. ocr_linear

      -

      For all of these elements except ocr_linear, there exists a natural linear -ordering defined by reading order (ocr_linear indicates that the elements -contained in it have a linear ordering). At the level of ocr_linear, there -may not be a single distinguished order. A common example of ocr_linear is a +

      4.11.1. ocr_linear

      +

      For all of these elements except ocr_linear, there exists a natural linear +ordering defined by reading order (ocr_linear indicates that the elements +contained in it have a linear ordering). At the level of ocr_linear, there +may not be a single distinguished order. A common example of ocr_linear is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should -therefore be sensitive to the order of all elements other than ocr_linear.

      +therefore be sensitive to the order of all elements other than ocr_linear.

      Tags must be nested as indicated by nesting above, but not all tags within the hierarchy need to be present.

      Textual information like section numbers and bullets must be represented as text inside the containing element.

      Documents whose logical structure does not map naturally onto these logical structuring elemetns must not use them for other purpose.

      -

      4.13. ocr_caption

      -

      Image captions may be indicated using the ocr_caption element; such an +

      4.12. ocr_caption

      +

      Image captions may be indicated using the ocr_caption element; such an element refers to the image(s) contained within the same float, or the -immediately adjacent image if both the image and the ocr_caption element are +immediately adjacent image if both the image and the ocr_caption element are in running text.

      The following typesetting related elements are based on a typesetting model as @@ -1878,53 +1935,53 @@

      The following classes, as well as floats are used for type-setting elements.

      -

      5.1.1. ocr_page

      -

      The ocr_page element must be present in all hOCR documents.

      -

      5.1.2. ocr_column

      +

      5.1.1. ocr_page

      +

      The ocr_page element must be present in all hOCR documents.

      +

      5.1.2. ocr_column

      OBSOLETE -

      Please use §5.1.3 ocr_carea instead

      +

      Please use ocr_carea instead

      -

      5.1.3. ocr_carea

      +

      5.1.3. ocr_carea

      "ocr content area" or "body area"

      Used to be called ocr_column

      -

      The ocr_carea elements should appear in reading order unless this is impossible -because of some other structuring requirement. If the document contains multiple ocr_linear streams, then each ocr_carea must indicate which stream it belongs +

      The ocr_carea elements should appear in reading order unless this is impossible +because of some other structuring requirement. If the document contains multiple ocr_linear streams, then each ocr_carea must indicate which stream it belongs to.

      Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the careas of the original document style cannot be -recovered exactly. However, the partition of a document by ocr_carea for an +recovered exactly. However, the partition of a document by ocr_carea for an individual page shall be considered correct relative to ground truth if

      1. all the text contained in a ground truth carea is fully contained within a -single ocr_carea,

        +single ocr_carea,

      2. -

        no text outside a ground truth carea is contained within an ocr_carea, and

        +

        no text outside a ground truth carea is contained within an ocr_carea, and

      3. -

        the ocr_careas appear in the same order as the text flow +

        the ocr_carea appear in the same order as the text flow relationships between the ground truth careas.

      -

      5.1.4. ocr_line

      +

      5.1.4. ocr_line

      In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). -They are represented by the ocr_line area.

      -

      ocr_line should be in a <span>

      -

      5.1.5. ocr_separator

      +They are represented by the ocr_line area.

      +

      ocr_line should be in a <span>

      +

      5.1.5. ocr_separator

      Any separator or similar element

      -

      5.1.6. ocr_noise

      +

      5.1.6. ocr_noise

      Any noise element that isn’t part of typesetting

      The following properties should be present:

      5.2.1. bbox (typesetting)

      The bounding box of the page; for pages, the top left corner must be at (0,0), so a typical page bounding box will look like bbox 0 0 2300 3200

      -

      5.2.2. image

      +

      5.2.2. image

      image imagefile

      • @@ -1940,7 +1997,7 @@

        5.

        if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file

      -

      5.2.3. imagemd5

      +

      5.2.3. imagemd5

      imagemd5 checksum

      • @@ -1948,7 +2005,7 @@

        allows re-associating pages with source images

      -

      5.2.4. ppageno

      +

      5.2.4. ppageno

      ppageno n

      • @@ -1962,7 +2019,7 @@

      • must not be present unless it is well defined and unique

      -

      5.2.5. lpageno

      +

      5.2.5. lpageno

      lpageno string

      • @@ -1976,19 +2033,19 @@

      5.3. Optional Properties for typesetting elements

      The following properties MAY be present:

      -

      5.3.1. scan_res

      +

      5.3.1. scan_res

      scan_res x_res y_res

      • scanning resolution in DPI

      -

      5.3.2. x_scanner

      +

      5.3.2. x_scanner

      x_scanner string

      • a representation of the scanner

      -

      5.3.3. x_source

      +

      5.3.3. x_source

      x_source string

      • @@ -2008,8 +2065,8 @@

      In addition to the standard -properties, the ocr_line area supports the following additional properties:

      -

      5.3.4. hardbreak

      +properties, the ocr_line area supports the following additional properties:

      +

      5.3.4. hardbreak

      hardbreak n

      • @@ -2019,67 +2076,67 @@

        ocr_line element. Examples of such special characters are a soft hyphen ("­", U+00AD), a hard line break (<br>), or whitespace () for soft line breaks.

        5.4. Classes for floats

        Floats should not be nested.

        The following floats are defined:

        -

        5.4.1. ocr_float

        +

        5.4.1. ocr_float

        ocr_float

        -

        5.4.2. ocr_separator

        -

        ocr_separator

        -

        5.4.3. ocr_textfloat

        +

        5.4.2. ocr_separator

        +

        ocr_separator in the context of float classes.

        +

        5.4.3. ocr_textfloat

        ocr_textfloat

        -

        5.4.4. ocr_textimage

        +

        5.4.4. ocr_textimage

        ocr_textimage

        -

        5.4.5. ocr_image

        +

        5.4.5. ocr_image

        ocr_image

        -

        5.4.6. ocr_linedrawing

        +

        5.4.6. ocr_linedrawing

        Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG)

        -

        5.4.7. ocr_photo

        +

        5.4.7. ocr_photo

        Something that requires JPEG or PNG to be represented well

        -

        5.4.8. ocr_header

        +

        5.4.8. ocr_header

        ocr_header

        - +

        ocr_footer

        -

        5.4.10. ocr_pageno

        +

        5.4.10. ocr_pageno

        ocr_pageno

        -

        5.4.11. ocr_table

        +

        5.4.11. ocr_table

        ocr_table

        6. Inline Representations

        There is some content that should behave and flow like text

        6.1. Classes for Inline Representation

        -

        6.1.1. ocr_glyph

        +

        6.1.1. ocr_glyph

        An individual glyph represented as an image (e.g., an unrecognized character)

        Must contain a single <img> tag, or be present on one

        -

        6.1.2. ocr_glyphs

        +

        6.1.2. ocr_glyphs

        Multiple glyphs represented as an image (e.g., an unrecognized word)

        Must contain a single <img> tag, or be present on one

        -

        6.1.3. ocr_dropcap

        +

        6.1.3. ocr_dropcap

        An individual glyph representing a dropcap

        May contain text or an <img> tag; the alt of the image tag should contain the corresponding text

        -

        6.1.4. ocr_chem

        +

        6.1.4. ocr_chem

        A chemical formula

        Must contain either a single <img> tag or [CML] markup, or be present on one

        -

        6.1.5. ocr_math

        +

        6.1.5. ocr_math

        A mathematical formula

        Must contain either a single <img> tag or [MathML] markup, or be present on one

        -

        Mathematical and chemical formulas that float must be put into an ocr_float section.

        +

        Mathematical and chemical formulas that float must be put into an ocr_float section.

        Mathematical and chemical formulas that are “display” mode should be put into -an ocr_display section.

        +an ocr_display section.

        6.1.6. Non-breaking space

        Non-breaking spaces must be represented using the HTML &nbsp; entity.

        6.1.7. Non-default spaces

        Different space widths should be indicated using HTML and &ensp;, &emsp, &thinsp;, &zwnj;, &zwj;.

        6.1.8. Hyphenation

        Soft hyphens must be represented using the HTML &shy; entity.

        -

        The HTML &lrm; and &rlm; entities (indicating writing direction) must not -be used; all writing direction changes must be indicated with tags.

        +

        The HTML &lrm; and &rlm; entities (indicating writing direction) must not be used; all +writing direction changes must be indicated with tags.

        6.1.9. Superscript and Subscript

        Other superscripts and subscripts must be represented using the HTML <sup> and <sub> tags, even if special Unicode characters are available.

        6.1.10. Ruby characters

        @@ -2088,18 +2145,18 @@

        7.1. Classes for Character Information

        Character-level information may be put on any element that contains only a single "line" of text.

        -

        7.1.1. ocr_cinfo

        -

        If no other layout element applies, the ocr_cinfo element may be used.

        +

        7.1.1. ocr_cinfo

        +

        If no other layout element applies, the ocr_cinfo element may be used.

        7.2. Properties for Character Information

        -

        7.2.1. cuts

        +

        7.2.1. cuts

        cuts c1 c2 c3 ...

        • character segmentation cuts (see below)

        • -

          there must be a bbox property relative to which the cuts can be interpreted

          +

          there must be a bbox property relative to which the cuts can be interpreted

        -

        7.2.2. nlp

        +

        7.2.2. nlp

        nlp c1 c2 c3 ...

        • @@ -2112,12 +2169,12 @@

          7.2.

          Assume a bounding box of (0,0,300,100); then

          -
          cuts("10 11 7 19") =
          +
          cuts("10 11 7 19") =
               [ [(10,0),(10,100)], [(21,0),(21,100)], [(28,0),(28,100)], [(47,0),(47,100)] ]
          -cuts("10,50,3 11,30,-3") =
          +cuts("10,50,3 11,30,-3") =
               [ [(10,0),(10,50),(13,50),(13,100)], [(21,0),(21,30),(18,30),(18,100)] ]
           
          -
          <span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span>
          +
          <span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span>
           

          Cuts are between all codepoints contained within the element, including any @@ -2135,7 +2192,7 @@

          Common suggested engine-specific markup are:

          8.1. Classes for engine specific markup

          -

          8.1.1. ocrx_block

          +

          8.1.1. ocrx_block

          ocr_carea vs ocrx_block

          • @@ -2143,15 +2200,15 @@

            engine-specific because the definition of a "block" depends on the engine

          -

          8.1.2. ocrx_line

          +

          8.1.2. ocrx_line

          ocr_line vs ocrx_line

          • -

            any kind of "line" returned by an OCR system that differs from the standard ocr_line above

            +

            any kind of "line" returned by an OCR system that differs from the standard ocr_line above

          • might be some kind of "logical" line

          -

          8.1.3. ocrx_word

          +

          8.1.3. ocrx_word

          • any kind of "word" returned by an OCR system

            @@ -2162,47 +2219,48 @@

            -

            an ocrx_block should not contain content from multiple ocr_careas

            +

            An ocrx_block should not contain content from multiple ocr_carea.

          • -

            the union of all ocrx_blocks should approximately cover all ocr_careas

            +

            The union of all ocrx_blocks should approximately cover all ocr_carea.

          • -

            an ocrx_block should contain either a float or body text, but not both

            +

            an ocrx_block should contain either a float or body text, but not both

          • -

            an ocrx_block should contain either an image or text, but not both

            +

            an ocrx_block should contain either an image or text, but not both

          • -

            an ocrx_line should correspond as closely as possible to an ocr_line

            +

            an ocrx_line should correspond as closely as possible to an ocr_line

          • -

            ocrx_cinfo should nest inside ocrx_line

            +

            ocrx_cinfo should nest inside ocrx_line

          • -

            ocrx_cinfo should contain only x_conf, x_bboxes, and cuts attributes

            +

            ocrx_cinfo should contain only x_confs, x_bboxes, and cuts attributes

          +

          ocrx_cinfo?

          8.2. Properties for engine-specific markup

          The following properties are defined:

          -

          8.2.1. x_font

          +

          8.2.1. x_font

          x_font s

          • OCR-engine specific font names

          -

          8.2.2. x_fsize

          +

          8.2.2. x_fsize

          x_fsize n

          • OCR-engine specific font size

          -

          8.2.3. x_bboxes

          +

          8.2.3. x_bboxes

          x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...

          • OCR-engine specific boxes associated with each codepoint contained in the element

          • -

            note that the bbox property is a property for the bounding box of a layout +

            note that the bbox property is a property for the bounding box of a layout element, not of individual characters

          • in particular, use <span class="ocr_cinfo" title="x_bboxes ....">, not <span class="ocr_cinfo" title="bbox ...">

          -

          8.2.4. x_confs

          +

          8.2.4. x_confs

          x_confs c1 c2 c3 ...

          • @@ -2215,7 +2273,7 @@

            if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %)

          -

          8.2.5. x_wconf

          +

          8.2.5. x_wconf

          x_wconf n

          • @@ -2248,14 +2306,14 @@

            . It must contains <ins> and <del> elements. The first contained element should be <ins> and represent the most probable interpretation, the subsequent ones <del>. Each <ins> and <del> element should have class="alt" and a -property of either nlp or x_cost. These <span>, <ins>, and <del> tags can nest +property of either nlp or x_cost. These <span>, <ins>, and <del> tags can nest arbitrarily.

            -
            <span class="alternatives">
            -<ins class="alt" title="nlp 0.3">hello</ins>
            -<del class="alt" title="nlp 1.1">hallo</del>
            -</span>
            +
            <span class="alternatives">
            +<ins class="alt" title="nlp 0.3">hello</ins>
            +<del class="alt" title="nlp 1.1">hallo</del>
            +</span>
             

            Whitespace within the <span> but outside the contained <ins>/<del> elements is ignored and should be inserted to improve readability of the HTML @@ -2263,7 +2321,7 @@

            11. Grouped Elements and Multiple Hierarchies

            The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; -for example, a single ocr_page may contain information from multiple sections +for example, a single ocr_page may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and @@ -2279,7 +2337,7 @@

            12.1. ocrp_lang

            +

            12.1. ocrp_lang

            Capable of generating lang= attributes

            -

            12.2. ocrp_dir

            +

            12.2. ocrp_dir

            Capable of generating dir= attributes

            -

            12.3. ocrp_poly

            +

            12.3. ocrp_poly

            Capable of generating polygonal bounds

            -

            12.4. ocrp_font

            +

            12.4. ocrp_font

            Capable of generating font information (standard font information)

            -

            12.5. ocrp_nlp

            +

            12.5. ocrp_nlp

            Capable of generating nlp confidences

            12.6. ocr_embeddedformat_<formatname>

            The capability to generate other specific embedded formats is given by the @@ -2317,9 +2375,13 @@

            13. Metadata

            13.1. Required Meta Information

            The OCR system is required to indicate the following using meta tags in the header:

            +

            13.1.1. ocr-system

            • <meta name="ocr-system" content="name version"/>

              +
            +

            13.1.2. ocr-capabilities

            +
            • <meta name="ocr-capabilities" content="capabilities"/>

            +

            The OCR system should indicate the following information

            +

            13.2.1. ocr-number-of-pages

            • <meta name="ocr-number-of-pages" content="number-of-pages"/>

              +
            +

            13.2.2. ocr-langs

            +
            • <meta name="ocr-langs" content="languages-considered-by-ocr"/>

                @@ -2339,6 +2406,9 @@

                value may be unknown

              +
            +

            13.2.3. ocr-scripts

            +
            • <meta name="ocr-scripts" content="scripts-considered-by-ocr"/>

                @@ -2348,7 +2418,7 @@

                value may be unknown

            -

            13.2. Document metadata

            +

            13.3. Document metadata

            For document meta information, use the Dublin Core Embedding into HTML. See also Citation Guidelines for Dublin Core.

            @@ -2384,21 +2454,21 @@

            14

            common commercial OCR output (e.g., Abbyy)

          • book target

            • -

              all logical structuring elements (as applicable), except ocr_linear

              +

              all logical structuring elements (as applicable), except ocr_linear

            • -

              ocr_page

              +

              ocr_page

          • newspaper target

            @@ -2406,9 +2476,9 @@

            14
          • all logical structuring elements (as applicable)

          • -

            articles map on ocr_linear

            +

            articles map on ocr_linear

          • -

            ocr_page

            +

            ocr_page

        15. HTML Markup

        @@ -2595,32 +2665,32 @@

        import libxml2,re,os,string -# convert the HTML to XHTML (if necessary) -os.system("tidy -q -asxhtml < page.html > page.xhtml 2> /dev/null") +# convert the HTML to XHTML (if necessary) +os.system("tidy -q -asxhtml < page.html > page.xhtml 2> /dev/null") -# parse the XML -doc = libxml2.parseFile('page.xhtml') +# parse the XML +doc = libxml2.parseFile('page.xhtml') -# search all nodes having a class of ocr_line -lines = doc.xpathEval("//*[@class='ocr_line']") +# search all nodes having a class of ocr_line +lines = doc.xpathEval("//*[@class='ocr_line']") -# a function for extracting the text from a node +# a function for extracting the text from a node def get_text(node): - textnodes = node.xpathEval(".//text()") + textnodes = node.xpathEval(".//text()") s = string.join([node.getContent() for node in textnodes]) - return re.sub(r'\s+',' ',s) + return re.sub(r'\s+',' ',s) -# a function for extracting the bbox property from a node -# note that the title= attribute on a node with an ocr_ class must -# conform with the OCR spec +# a function for extracting the bbox property from a node +# note that the title= attribute on a node with an ocr_ class must +# conform with the OCR spec def get_bbox(node): - data = node.prop('title') - bboxre = re.compile(r'\bbbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)') + data = node.prop('title') + bboxre = re.compile(r'\bbbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)') return [int(x) for x in bboxre.search(data).groups()] -# this extracts all the bounding boxes and the text they contain -# it doesn’t matter what other markup the line node may contain +# this extracts all the bounding boxes and the text they contain +# it doesn’t matter what other markup the line node may contain for line in lines: print get_bbox(line),get_text(line)

      @@ -2807,6 +2877,85 @@

      +

      Index

      +

      Terms defined by this specification

      +

      References

      Normative References

      @@ -2838,8 +2987,271 @@

      +
      ocrx_cinfo?
      -
      \ No newline at end of file +

  • + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/1.2/metadata b/1.2/metadata index 38cdc08..97c54c1 100644 --- a/1.2/metadata +++ b/1.2/metadata @@ -9,4 +9,4 @@ Editor: Konstantin Baierer, UB Mannheim http://github.com/UB-Mannheim, konstanti Former Editor: Thomas Breuel, http://www.9x9.com/ Previous Version: https://github.com/kba/hocr-spec/blob/master/1.1/spec.md Abstract: A subset of HTML for marking up OCR results -Markup Shorthands: markdown on, biblio on +Markup Shorthands: markdown on, biblio on, markup on diff --git a/1.2/spec.md b/1.2/spec.md index 78f1135..ca94594 100644 --- a/1.2/spec.md +++ b/1.2/spec.md @@ -13,7 +13,7 @@ arrive at a representation that makes it easy to reuse OCR results. This document describes many tags and a lot of information that can be output. However, getting started with hOCR is easy: you only need to output the tags -and information you actually want to. For example, just outputting `ocr_line` +and information you actually want to. For example, just outputting <{ocr_line}> tags with bounding boxes is already very useful for many applications. Just start simple and add more output information as the need arises. @@ -68,7 +68,7 @@ multiple properties are separated by semicolons. The following properties can apply to most elements (where it makes sense): -### `bbox` +### bbox `bbox x0 y0 x1 y1` @@ -79,8 +79,8 @@ the lower-right corner (x1, y1). * the values are with reference to the the top-left corner of the document image and measured in pixels * the order of the values are `x0 y0 x1 y1` = "left top right bottom" - * use `x_bboxes` below for character bounding boxes - * do not use `bbox` unless the bounding box of the layout component is, in + * use 'x_bboxes' below for character bounding boxes + * do not use 'bbox' unless the bounding box of the layout component is, in fact, rectangular * some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows @@ -106,7 +106,7 @@ the document image which border is drawn in black.
    -### `textangle` +### textangle `textangle alpha` @@ -121,7 +121,7 @@ which should be indicated using standard HTML properties The following properties can apply to most elements but should not be used unless there is no alternative: -### `poly` +### poly `poly x0 y0 x1 y1 ...` @@ -134,11 +134,11 @@ A closed polygon for elements with non-rectangular bounds * note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats * documents using polygonal borders anywhere must indicate this by adding - [[#ocrp_poly]] to the list of `ocr-capabilities` in the - [[#required-meta-information]] - * documents should attempt to provide a reasonable bbox equivalent as well + ''ocr-capabilities/ocrp_poly'' to the list of 'ocr-capabilities' (see + [[#required-meta-information]]) + * documents should attempt to provide a reasonable 'bbox' equivalent as well -### `order` +### order `order n` @@ -148,27 +148,27 @@ The reading order of the element (an integer) the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order -### `presence` +### presence Issue: [Use of property presence](https://github.com/kba/hocr-spec/issues/10) -`presence` presence must be declared in the document meta data +'presence' presence must be declared in the document meta data -### `cflow` +### cflow `cflow s` -This property relates the flow between multiple [[#ocr_carea]] elements, -and between [[#ocr_carea]] and [[#ocr_linear]] elements. +This property relates the flow between multiple <{ocr_carea}> elements, +and between <{ocr_carea}> and <{ocr_linear}> elements. The content flow on the page that this element is a part of * s must be a unique string for each content flow - * must be present on [[#ocr_carea]] and [[#ocrx_block]] tags when reading + * must be present on <{ocr_carea}> and <{ocrx_block}> tags when reading order is attempted and multiple content flows are present * presence must be declared in the document meta data -### `baseline` +### baseline `baseline pn pn-1 ... p0` @@ -191,7 +191,7 @@ contains the following information: title="bbox 105 66 823 113; baseline 0.015 -18">...
    ``` -bbox is the bounding box of the line in image coordinates (blue). The two +'bbox' is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at `-18` @@ -208,30 +208,30 @@ and its slope angle is `arctan(0.015) = 0.86°`. We recognize the following logical structuring elements: - * `ocr_document` - * `ocr_linear` - * `ocr_title` - * `ocr_author` - * `ocr_abstract` - * `ocr_part` [`

    `] - * `ocr_chapter` [`

    `] - * `ocr_section` [`

    `] + * <{ocr_document}> + * <{ocr_linear}> + * <{ocr_title}> + * <{ocr_author}> + * <{ocr_abstract}> + * <{ocr_part}> [`

    `] + * <{ocr_chapter}> [`

    `] + * <{ocr_section}> [`

    `] * `ocr_sub*section` [`

    `,`

    `] - * `ocr_display` - * `ocr_blockquote` [`
    `] - * `ocr_par` [`

    `] - -## `ocr_document` -## `ocr_title` -## `ocr_author` -## `ocr_abstract` -## `ocr_part` -## `ocr_chapter` -## `ocr_section` -## `ocr_subsubsection` -## `ocr_display` -## `ocr_blockquote` -## `ocr_par` + * <{ocr_display}> + * <{ocr_blockquote}> [`

    `] + * <{ocr_par}> [`

    `] + +## ocr_document +## ocr_title +## ocr_author +## ocr_abstract +## ocr_part +## ocr_chapter +## ocr_section +## ocr_subsubsection +## ocr_display +## ocr_blockquote +## ocr_par These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others. @@ -241,15 +241,15 @@ with those logical structuring elements, but it may not be possible or desirable to actually chose those tags (e.g., when adding hOCR information to an existing HTML output routine). -## `ocr_linear` +### ocr_linear -For all of these elements except `ocr_linear`, there exists a natural linear -ordering defined by reading order (`ocr_linear` indicates that the elements -contained in it have a linear ordering). At the level of `ocr_linear`, there -may not be a single distinguished order. A common example of `ocr_linear` is a +For all of these elements except <{ocr_linear}>, there exists a natural linear +ordering defined by reading order (<{ocr_linear}> indicates that the elements +contained in it have a linear ordering). At the level of <{ocr_linear}>, there +may not be a single distinguished order. A common example of <{ocr_linear}> is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should -therefore be sensitive to the order of all elements other than `ocr_linear`. +therefore be sensitive to the order of all elements other than <{ocr_linear}>. Tags must be nested as indicated by nesting above, but not all tags within the hierarchy need to be present. @@ -260,11 +260,11 @@ text inside the containing element. Documents whose logical structure does not map naturally onto these logical structuring elemetns must not use them for other purpose. -## `ocr_caption` +## ocr_caption -Image captions may be indicated using the `ocr_caption` element; such an +Image captions may be indicated using the <{ocr_caption}> element; such an element refers to the image(s) contained within the same float, or the -immediately adjacent image if both the image and the `ocr_caption` element are +immediately adjacent image if both the image and the <{ocr_caption}> element are in running text. @@ -303,57 +303,57 @@ properties for floating elements; properties need to be defined for this. The following classes, as well as [floats](#classes-for-floats) are used for type-setting elements. -### `ocr_page` +### ocr_page -The `ocr_page` element must be present in all hOCR documents. +The <{ocr_page}> element must be present in all hOCR documents. -### `ocr_column` +### ocr_column

    **OBSOLETE** -Please use [[#ocr_carea]] instead +Please use <{ocr_carea}> instead
    -### `ocr_carea` +### ocr_carea "ocr content area" or "body area" Used to be called ocr_column -The `ocr_carea` elements should appear in reading order unless this is impossible +The <{ocr_carea}> elements should appear in reading order unless this is impossible because of some other structuring requirement. If the document contains multiple -`ocr_linear` streams, then each `ocr_carea` must indicate which stream it belongs +<{ocr_linear}> streams, then each <{ocr_carea}> must indicate which stream it belongs to. Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the `careas` of the original document style cannot be -recovered exactly. However, the partition of a document by `ocr_carea` for an +recovered exactly. However, the partition of a document by <{ocr_carea}> for an individual page shall be considered correct relative to ground truth if 1. all the text contained in a ground truth carea is fully contained within a - single `ocr_carea`, + single <{ocr_carea}>, 2. no text outside a ground truth `carea` is contained within an - `ocr_carea`, and - 3. the `ocr_careas` appear in the same order as the text flow + <{ocr_carea}>, and + 3. the <{ocr_carea}> appear in the same order as the text flow relationships between the ground truth careas. -### `ocr_line` +### ocr_line In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). -They are represented by the `ocr_line` area. +They are represented by the <{ocr_line}> area. -`ocr_line` should be in a `` +<{ocr_line}> should be in a `` -### `ocr_separator` +### ocr_separator Any separator or similar element -### `ocr_noise` +### ocr_noise Any noise element that isn't part of typesetting @@ -366,7 +366,7 @@ The following properties should be present: The bounding box of the page; for pages, the top left corner must be at `(0,0)`, so a typical page bounding box will look like `bbox 0 0 2300 3200` -### `image` +### image `image imagefile` @@ -378,14 +378,14 @@ The bounding box of the page; for pages, the top left corner must be at * if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file -### `imagemd5` +### imagemd5 `imagemd5 checksum` * MD5 fingerprint of the image file that this page was derived from * allows re-associating pages with source images -### `ppageno` +### ppageno `ppageno n` @@ -395,7 +395,7 @@ The bounding box of the page; for pages, the top left corner must be at * must not be present unless the pages in the document have a physical ordering * must not be present unless it is well defined and unique -### `lpageno` +### lpageno `lpageno string` @@ -408,19 +408,19 @@ The bounding box of the page; for pages, the top left corner must be at The following properties MAY be present: -### `scan_res` +### scan_res `scan_res x_res y_res` * scanning resolution in DPI -### `x_scanner` +### x_scanner `x_scanner string` * a representation of the scanner -### `x_source` +### x_source `x_source string` @@ -433,9 +433,9 @@ The following properties MAY be present: * `x_source http://pageserver/012345678911&page=17` In addition to the standard -properties, the `ocr_line` area supports the following additional properties: +properties, the <{ocr_line}> area supports the following additional properties: -### `hardbreak` +### hardbreak `hardbreak n` @@ -444,7 +444,7 @@ properties, the `ocr_line` area supports the following additional properties: * a one indicates that the line is a hard (explicit) line break Any special characters representing the desired end-of-line processing must be -present inside the `ocr_line` element. Examples of such special characters are a +present inside the <{ocr_line}> element. Examples of such special characters are a soft hyphen ("­", `U+00AD`), a hard line break (`
    `), or whitespace (` `) for soft line breaks. @@ -454,48 +454,48 @@ Floats should not be nested. The following floats are defined: -### `ocr_float` +### ocr_float `ocr_float` -### `ocr_separator` +### ocr_separator -`ocr_separator` +`ocr_separator` in the context of float classes. -### `ocr_textfloat` +### ocr_textfloat `ocr_textfloat` -### `ocr_textimage` +### ocr_textimage `ocr_textimage` -### `ocr_image` +### ocr_image `ocr_image` -### `ocr_linedrawing` +### ocr_linedrawing Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG) -### `ocr_photo` +### ocr_photo Something that requires JPEG or PNG to be represented well -### `ocr_header` +### ocr_header `ocr_header` -### `ocr_footer` +### ocr_footer `ocr_footer` -### `ocr_pageno` +### ocr_pageno `ocr_pageno` -### `ocr_table` +### ocr_table `ocr_table` @@ -505,44 +505,44 @@ There is some content that should behave and flow like text ## Classes for Inline Representation -### `ocr_glyph` +### ocr_glyph An individual glyph represented as an image (e.g., an unrecognized character) Must contain a single `` tag, or be present on one -### `ocr_glyphs` +### ocr_glyphs Multiple glyphs represented as an image (e.g., an unrecognized word) Must contain a single `` tag, or be present on one -### `ocr_dropcap` +### ocr_dropcap An individual glyph representing a dropcap May contain text or an `` tag; the `alt` of the image tag should contain the corresponding text -### `ocr_chem` +### ocr_chem A chemical formula Must contain either a single `` tag or [[CML]] markup, or be present on one -### `ocr_math` +### ocr_math A mathematical formula Must contain either a single `` tag or [[MathML]] markup, or be present on one -Mathematical and chemical formulas that float must be put into an `ocr_float` +Mathematical and chemical formulas that float must be put into an <{ocr_float}> section. Mathematical and chemical formulas that are “display” mode should be put into -an `ocr_display` section. +an <{ocr_display}> section. ### Non-breaking space @@ -557,8 +557,9 @@ Different space widths should be indicated using HTML and ` `, `&emsp`, Soft hyphens must be represented using the HTML `­` entity. -The HTML `‎` and `‏` entities (indicating writing direction) must not -be used; all writing direction changes must be indicated with tags. +The HTML `‎` and +`‏` entities (indicating writing direction) must not be used; all +writing direction changes must be indicated with tags. ### Superscript and Subscript @@ -577,20 +578,20 @@ must be represented using their correct Unicode encoding. Character-level information may be put on any element that contains only a single "line" of text. -### `ocr_cinfo` +### ocr_cinfo -If no other layout element applies, the `ocr_cinfo` element may be used. +If no other layout element applies, the <{ocr_cinfo}> element may be used. ## Properties for Character Information -### `cuts` +### cuts `cuts c1 c2 c3 ...` * character segmentation cuts (see below) - * there must be a bbox property relative to which the cuts can be interpreted + * there must be a 'bbox' property relative to which the 'cuts' can be interpreted -### `nlp` +### nlp `nlp c1 c2 c3 ...` @@ -641,21 +642,21 @@ Common suggested engine-specific markup are: ## Classes for engine specific markup -### `ocrx_block` +### ocrx_block Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) * any kind of "block" returned by an OCR system * engine-specific because the definition of a "block" depends on the engine -### `ocrx_line` +### ocrx_line Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) - * any kind of "line" returned by an OCR system that differs from the standard ocr_line above + * any kind of "line" returned by an OCR system that differs from the standard <{ocr_line}> above * might be some kind of "logical" line -### `ocrx_word` +### ocrx_word * any kind of "word" returned by an OCR system * engine specific because the definition of a "word" depends on the engine @@ -663,42 +664,44 @@ Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) The meaning of these tags is OCR engine specific. However, generators should attempt to ensure the following properties: -* an `ocrx_block` should not contain content from multiple ocr_careas -* the union of all `ocrx_blocks` should approximately cover all `ocr_careas` -* an `ocrx_block` should contain either a float or body text, but not both -* an `ocrx_block` should contain either an image or text, but not both -* an `ocrx_line` should correspond as closely as possible to an `ocr_line` -* `ocrx_cinfo` should nest inside `ocrx_line` -* `ocrx_cinfo` should contain only `x_conf`, `x_bboxes`, and `cuts` attributes +* An <{ocrx_block}> should not contain content from multiple <{ocr_carea}>. +* The union of all <{ocrx_block|ocrx_blocks}> should approximately cover all <{ocr_carea}>. +* an <{ocrx_block}> should contain either a float or body text, but not both +* an <{ocrx_block}> should contain either an image or text, but not both +* an <{ocrx_line}> should correspond as closely as possible to an <{ocr_line}> +* <{ocrx_cinfo}> should nest inside <{ocrx_line}> +* <{ocrx_cinfo}> should contain only 'x_confs', 'x_bboxes', and 'cuts' attributes + +Issue: ocrx_cinfo? ## Properties for engine-specific markup The following properties are defined: -### `x_font` +### x_font `x_font s` * OCR-engine specific font names -### `x_fsize` +### x_fsize `x_fsize n` * OCR-engine specific font size -### `x_bboxes` +### x_bboxes `x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...` * OCR-engine specific boxes associated with each codepoint contained in the element - * note that the bbox property is a property for the bounding box of a layout + * note that the 'bbox' property is a property for the bounding box of a layout element, not of individual characters * in particular, use ``, not `` -### `x_confs` +### x_confs `x_confs c1 c2 c3 ...` @@ -708,7 +711,7 @@ The following properties are defined: * if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %) -### `x_wconf` +### x_wconf `x_wconf n` @@ -748,7 +751,7 @@ Alternative segmentations and readings are indicated by a `` with `class="alternatives"`. It must contains `` and `` elements. The first contained element should be `` and represent the most probable interpretation, the subsequent ones ``. Each `` and `` element should have `class="alt"` and a -property of either `nlp` or `x_cost`. These ``, ``, and `` tags can nest +property of either 'nlp' or 'x_cost'. These ``, ``, and `` tags can nest arbitrarily.
    @@ -769,7 +772,7 @@ when viewed in a browser. The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; -for example, a single `ocr_page` may contain information from multiple sections +for example, a single <{ocr_page}> may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and @@ -787,8 +790,8 @@ removing tags that are not of interest for the subsequent processing step, and then collapsing grouped elements into single elements. For example, output that contains both logical and physical layout information, where the logical layout information uses grouped elements, can be transformed by removing all -the physical layout information, and then collapsing all split `ocr_chapter` -elements into single `ocr_chapter` elements based on the groupid. The result is +the physical layout information, and then collapsing all split <{ocr_chapter}> +elements into single <{ocr_chapter}> elements based on the groupid. The result is a simple DOM tree. This transformation can be provided generically as a pre-processor or Javascript. @@ -809,23 +812,23 @@ document. The capability to generate specific properties is given by the prefix `ocrp_...`; the important properties are: -## `ocrp_lang` +## ocrp_lang Capable of generating `lang=` attributes -## `ocrp_dir` +## ocrp_dir Capable of generating `dir=` attributes -## `ocrp_poly` +## ocrp_poly Capable of generating [polygonal bounds](#poly) -## `ocrp_font` +## ocrp_font Capable of generating font information (standard font information) -## `ocrp_nlp` +## ocrp_nlp Capable of generating [nlp confidences](#nlp) @@ -851,16 +854,31 @@ corresponding element or attribute must not be present in the document. The OCR system is required to indicate the following using meta tags in the header: +### ocr-system + * `` + +### ocr-capabilities + * `` * see [[#capabilities]] +## Recommended Meta Information + The OCR system should indicate the following information +### ocr-number-of-pages + * `` + +### ocr-langs + * `` * use [ISO 639-1](https://www.loc.gov/standards/iso639-2/php/code_list.php) codes * value may be `unknown` + +### ocr-scripts + * `` * use [ISO 15924](http://www.unicode.org/iso15924/codelists.html) letter codes * value may be `unknown` @@ -901,17 +919,17 @@ Other possible profiles might be defined for specific engines or specific document classes: * common commercial OCR output (e.g., Abbyy) - * ocr_page - * ocrx_block, ocrx_line, ocrx_word - * ocrp_lang - * ocrp_font + * <{ocr_page}> + * <{ocrx_block}>, <{ocrx_line}>, <{ocrx_word}> + * ''ocr-capabilities/ocrp_lang'' + * ''ocr-capabilities/ocrp_font'' * book target - * all logical structuring elements (as applicable), except ocr_linear - * ocr_page + * all logical structuring elements (as applicable), except <{ocr_linear}> + * <{ocr_page}> * newspaper target * all logical structuring elements (as applicable) - * articles map on ocr_linear - * ocr_page + * articles map on <{ocr_linear}> + * <{ocr_page}> # HTML Markup @@ -1171,3 +1189,7 @@ Issue: [correct MIME type for hOCR?](https://github.com/kba/hocr-spec/issues/27) : Applications which use this media type: : File extension(s): :: `*.html`, `*.hocr` + + + +