diff --git a/1.2/index.bs b/1.2/index.bs index 8e9d11f..9170920 100644 --- a/1.2/index.bs +++ b/1.2/index.bs @@ -10,7 +10,7 @@ Editor: Konstantin Baierer, UB Mannheim http://github.com/UB-Mannheim, konstanti Former Editor: Thomas Breuel, http://www.9x9.com/ Previous Version: https://github.com/kba/hocr-spec/blob/master/1.1/spec.md Abstract: A subset of HTML for marking up OCR results -Markup Shorthands: markdown on, biblio on +Markup Shorthands: markdown on, biblio on, markup on
{ @@ -42,7 +42,7 @@ arrive at a representation that makes it easy to reuse OCR results. This document describes many tags and a lot of information that can be output. However, getting started with hOCR is easy: you only need to output the tags -and information you actually want to. For example, just outputting `ocr_line` +and information you actually want to. For example, just outputting <{ocr_line}> tags with bounding boxes is already very useful for many applications. Just start simple and add more output information as the need arises. @@ -97,7 +97,7 @@ multiple properties are separated by semicolons. The following properties can apply to most elements (where it makes sense): -### `bbox` +### bbox `bbox x0 y0 x1 y1` @@ -108,8 +108,8 @@ the lower-right corner (x1, y1). * the values are with reference to the the top-left corner of the document image and measured in pixels * the order of the values are `x0 y0 x1 y1` = "left top right bottom" - * use `x_bboxes` below for character bounding boxes - * do not use `bbox` unless the bounding box of the layout component is, in + * use 'x_bboxes' below for character bounding boxes + * do not use 'bbox' unless the bounding box of the layout component is, in fact, rectangular * some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows @@ -135,7 +135,7 @@ the document image which border is drawn in black. -### `textangle` +### textangle `textangle alpha` @@ -150,7 +150,7 @@ which should be indicated using standard HTML properties The following properties can apply to most elements but should not be used unless there is no alternative: -### `poly` +### poly `poly x0 y0 x1 y1 ...` @@ -163,11 +163,11 @@ A closed polygon for elements with non-rectangular bounds * note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats * documents using polygonal borders anywhere must indicate this by adding - [[#ocrp_poly]] to the list of `ocr-capabilities` in the - [[#required-meta-information]] - * documents should attempt to provide a reasonable bbox equivalent as well + ''ocr-capabilities/ocrp_poly'' to the list of 'ocr-capabilities' (see + [[#required-meta-information]]) + * documents should attempt to provide a reasonable 'bbox' equivalent as well -### `order` +### order `order n` @@ -177,27 +177,27 @@ The reading order of the element (an integer) the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order -### `presence` +### presence Issue: [Use of property presence](https://github.com/kba/hocr-spec/issues/10) -`presence` presence must be declared in the document meta data +'presence' presence must be declared in the document meta data -### `cflow` +### cflow `cflow s` -This property relates the flow between multiple [[#ocr_carea]] elements, -and between [[#ocr_carea]] and [[#ocr_linear]] elements. +This property relates the flow between multiple <{ocr_carea}> elements, +and between <{ocr_carea}> and <{ocr_linear}> elements. The content flow on the page that this element is a part of * s must be a unique string for each content flow - * must be present on [[#ocr_carea]] and [[#ocrx_block]] tags when reading + * must be present on <{ocr_carea}> and <{ocrx_block}> tags when reading order is attempted and multiple content flows are present * presence must be declared in the document meta data -### `baseline` +### baseline `baseline pn pn-1 ... p0` @@ -220,7 +220,7 @@ contains the following information: title="bbox 105 66 823 113; baseline 0.015 -18">... ``` -bbox is the bounding box of the line in image coordinates (blue). The two +'bbox' is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at `-18` @@ -237,30 +237,30 @@ and its slope angle is `arctan(0.015) = 0.86°`. We recognize the following logical structuring elements: - * `ocr_document` - * `ocr_linear` - * `ocr_title` - * `ocr_author` - * `ocr_abstract` - * `ocr_part` [``] - * `ocr_chapter` [`
`] - * `ocr_section` [`
`] + * <{ocr_document}> + * <{ocr_linear}> + * <{ocr_title}> + * <{ocr_author}> + * <{ocr_abstract}> + * <{ocr_part}> [`
`] + * <{ocr_chapter}> [`
`] + * <{ocr_section}> [`
`] * `ocr_sub*section` [`
`,`
`] - * `ocr_display` - * `ocr_blockquote` [`
`] - * `ocr_par` [``] - -## `ocr_document` -## `ocr_title` -## `ocr_author` -## `ocr_abstract` -## `ocr_part` -## `ocr_chapter` -## `ocr_section` -## `ocr_subsubsection` -## `ocr_display` -## `ocr_blockquote` -## `ocr_par` + * <{ocr_display}> + * <{ocr_blockquote}> [`
`] + * <{ocr_par}> [``] + +## ocr_document +## ocr_title +## ocr_author +## ocr_abstract +## ocr_part +## ocr_chapter +## ocr_section +## ocr_subsubsection +## ocr_display +## ocr_blockquote +## ocr_par These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others. @@ -270,15 +270,15 @@ with those logical structuring elements, but it may not be possible or desirable to actually chose those tags (e.g., when adding hOCR information to an existing HTML output routine). -## `ocr_linear` +### ocr_linear -For all of these elements except `ocr_linear`, there exists a natural linear -ordering defined by reading order (`ocr_linear` indicates that the elements -contained in it have a linear ordering). At the level of `ocr_linear`, there -may not be a single distinguished order. A common example of `ocr_linear` is a +For all of these elements except <{ocr_linear}>, there exists a natural linear +ordering defined by reading order (<{ocr_linear}> indicates that the elements +contained in it have a linear ordering). At the level of <{ocr_linear}>, there +may not be a single distinguished order. A common example of <{ocr_linear}> is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should -therefore be sensitive to the order of all elements other than `ocr_linear`. +therefore be sensitive to the order of all elements other than <{ocr_linear}>. Tags must be nested as indicated by nesting above, but not all tags within the hierarchy need to be present. @@ -289,11 +289,11 @@ text inside the containing element. Documents whose logical structure does not map naturally onto these logical structuring elemetns must not use them for other purpose. -## `ocr_caption` +## ocr_caption -Image captions may be indicated using the `ocr_caption` element; such an +Image captions may be indicated using the <{ocr_caption}> element; such an element refers to the image(s) contained within the same float, or the -immediately adjacent image if both the image and the `ocr_caption` element are +immediately adjacent image if both the image and the <{ocr_caption}> element are in running text. @@ -332,57 +332,57 @@ properties for floating elements; properties need to be defined for this. The following classes, as well as [floats](#classes-for-floats) are used for type-setting elements. -### `ocr_page` +### ocr_page -The `ocr_page` element must be present in all hOCR documents. +The <{ocr_page}> element must be present in all hOCR documents. -### `ocr_column` +### ocr_column
-### `ocr_carea` +### ocr_carea "ocr content area" or "body area" Used to be calledocr_column-The `ocr_carea` elements should appear in reading order unless this is impossible +The <{ocr_carea}> elements should appear in reading order unless this is impossible because of some other structuring requirement. If the document contains multiple -`ocr_linear` streams, then each `ocr_carea` must indicate which stream it belongs +<{ocr_linear}> streams, then each <{ocr_carea}> must indicate which stream it belongs to. Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the `careas` of the original document style cannot be -recovered exactly. However, the partition of a document by `ocr_carea` for an +recovered exactly. However, the partition of a document by <{ocr_carea}> for an individual page shall be considered correct relative to ground truth if 1. all the text contained in a ground truth carea is fully contained within a - single `ocr_carea`, + single <{ocr_carea}>, 2. no text outside a ground truth `carea` is contained within an - `ocr_carea`, and - 3. the `ocr_careas` appear in the same order as the text flow + <{ocr_carea}>, and + 3. the <{ocr_carea}> appear in the same order as the text flow relationships between the ground truth careas. -### `ocr_line` +### ocr_line In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). -They are represented by the `ocr_line` area. +They are represented by the <{ocr_line}> area. -`ocr_line` should be in a `` +<{ocr_line}> should be in a `` -### `ocr_separator` +### ocr_separator Any separator or similar element -### `ocr_noise` +### ocr_noise Any noise element that isn't part of typesetting @@ -395,7 +395,7 @@ The following properties should be present: The bounding box of the page; for pages, the top left corner must be at `(0,0)`, so a typical page bounding box will look like `bbox 0 0 2300 3200` -### `image` +### image `image imagefile` @@ -407,14 +407,14 @@ The bounding box of the page; for pages, the top left corner must be at * if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file -### `imagemd5` +### imagemd5 `imagemd5 checksum` * MD5 fingerprint of the image file that this page was derived from * allows re-associating pages with source images -### `ppageno` +### ppageno `ppageno n` @@ -424,7 +424,7 @@ The bounding box of the page; for pages, the top left corner must be at * must not be present unless the pages in the document have a physical ordering * must not be present unless it is well defined and unique -### `lpageno` +### lpageno `lpageno string` @@ -437,19 +437,19 @@ The bounding box of the page; for pages, the top left corner must be at The following properties MAY be present: -### `scan_res` +### scan_res `scan_res x_res y_res` * scanning resolution in DPI -### `x_scanner` +### x_scanner `x_scanner string` * a representation of the scanner -### `x_source` +### x_source `x_source string` @@ -462,9 +462,9 @@ The following properties MAY be present: * `x_source http://pageserver/012345678911&page=17` In addition to the standard -properties, the `ocr_line` area supports the following additional properties: +properties, the <{ocr_line}> area supports the following additional properties: -### `hardbreak` +### hardbreak `hardbreak n` @@ -473,7 +473,7 @@ properties, the `ocr_line` area supports the following additional properties: * a one indicates that the line is a hard (explicit) line break Any special characters representing the desired end-of-line processing must be -present inside the `ocr_line` element. Examples of such special characters are a +present inside the <{ocr_line}> element. Examples of such special characters are a soft hyphen ("", `U+00AD`), a hard line break (`
`), or whitespace (` `) for soft line breaks. @@ -483,48 +483,48 @@ Floats should not be nested. The following floats are defined: -### `ocr_float` +### ocr_float `ocr_float` -### `ocr_separator` +### ocr_separator -`ocr_separator` +`ocr_separator` in the context of float classes. -### `ocr_textfloat` +### ocr_textfloat `ocr_textfloat` -### `ocr_textimage` +### ocr_textimage `ocr_textimage` -### `ocr_image` +### ocr_image `ocr_image` -### `ocr_linedrawing` +### ocr_linedrawing Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG) -### `ocr_photo` +### ocr_photo Something that requires JPEG or PNG to be represented well -### `ocr_header` +### ocr_header `ocr_header` -### `ocr_footer` +### ocr_footer `ocr_footer` -### `ocr_pageno` +### ocr_pageno `ocr_pageno` -### `ocr_table` +### ocr_table `ocr_table` @@ -534,44 +534,44 @@ There is some content that should behave and flow like text ## Classes for Inline Representation -### `ocr_glyph` +### ocr_glyph An individual glyph represented as an image (e.g., an unrecognized character) Must contain a single `` tag, or be present on one -### `ocr_glyphs` +### ocr_glyphs Multiple glyphs represented as an image (e.g., an unrecognized word) Must contain a single `` tag, or be present on one -### `ocr_dropcap` +### ocr_dropcap An individual glyph representing a dropcap May contain text or an `` tag; the `alt` of the image tag should contain the corresponding text -### `ocr_chem` +### ocr_chem A chemical formula Must contain either a single `` tag or [[CML]] markup, or be present on one -### `ocr_math` +### ocr_math A mathematical formula Must contain either a single `` tag or [[MathML]] markup, or be present on one -Mathematical and chemical formulas that float must be put into an `ocr_float` +Mathematical and chemical formulas that float must be put into an <{ocr_float}> section. Mathematical and chemical formulas that are “display” mode should be put into -an `ocr_display` section. +an <{ocr_display}> section. ### Non-breaking space @@ -586,8 +586,9 @@ Different space widths should be indicated using HTML and ` `, `&emsp`, Soft hyphens must be represented using the HTML `` entity. -The HTML `` and `` entities (indicating writing direction) must not -be used; all writing direction changes must be indicated with tags. +The HTML `` and +`` entities (indicating writing direction) must not be used; all +writing direction changes must be indicated with tags. ### Superscript and Subscript @@ -606,20 +607,20 @@ must be represented using their correct Unicode encoding. Character-level information may be put on any element that contains only a single "line" of text. -### `ocr_cinfo` +### ocr_cinfo -If no other layout element applies, the `ocr_cinfo` element may be used. +If no other layout element applies, the <{ocr_cinfo}> element may be used. ## Properties for Character Information -### `cuts` +### cuts `cuts c1 c2 c3 ...` * character segmentation cuts (see below) - * there must be a bbox property relative to which the cuts can be interpreted + * there must be a 'bbox' property relative to which the 'cuts' can be interpreted -### `nlp` +### nlp `nlp c1 c2 c3 ...` @@ -670,21 +671,21 @@ Common suggested engine-specific markup are: ## Classes for engine specific markup -### `ocrx_block` +### ocrx_block Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) * any kind of "block" returned by an OCR system * engine-specific because the definition of a "block" depends on the engine -### `ocrx_line` +### ocrx_line Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) - * any kind of "line" returned by an OCR system that differs from the standard ocr_line above + * any kind of "line" returned by an OCR system that differs from the standard <{ocr_line}> above * might be some kind of "logical" line -### `ocrx_word` +### ocrx_word * any kind of "word" returned by an OCR system * engine specific because the definition of a "word" depends on the engine @@ -692,42 +693,44 @@ Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) The meaning of these tags is OCR engine specific. However, generators should attempt to ensure the following properties: -* an `ocrx_block` should not contain content from multiple ocr_careas -* the union of all `ocrx_blocks` should approximately cover all `ocr_careas` -* an `ocrx_block` should contain either a float or body text, but not both -* an `ocrx_block` should contain either an image or text, but not both -* an `ocrx_line` should correspond as closely as possible to an `ocr_line` -* `ocrx_cinfo` should nest inside `ocrx_line` -* `ocrx_cinfo` should contain only `x_conf`, `x_bboxes`, and `cuts` attributes +* An <{ocrx_block}> should not contain content from multiple <{ocr_carea}>. +* The union of all <{ocrx_block|ocrx_blocks}> should approximately cover all <{ocr_carea}>. +* an <{ocrx_block}> should contain either a float or body text, but not both +* an <{ocrx_block}> should contain either an image or text, but not both +* an <{ocrx_line}> should correspond as closely as possible to an <{ocr_line}> +* <{ocrx_cinfo}> should nest inside <{ocrx_line}> +* <{ocrx_cinfo}> should contain only 'x_confs', 'x_bboxes', and 'cuts' attributes + +Issue: ocrx_cinfo? ## Properties for engine-specific markup The following properties are defined: -### `x_font` +### x_font `x_font s` * OCR-engine specific font names -### `x_fsize` +### x_fsize `x_fsize n` * OCR-engine specific font size -### `x_bboxes` +### x_bboxes `x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...` * OCR-engine specific boxes associated with each codepoint contained in the element - * note that the bbox property is a property for the bounding box of a layout + * note that the 'bbox' property is a property for the bounding box of a layout element, not of individual characters * in particular, use ``, not `` -### `x_confs` +### x_confs `x_confs c1 c2 c3 ...` @@ -737,7 +740,7 @@ The following properties are defined: * if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %) -### `x_wconf` +### x_wconf `x_wconf n` @@ -777,7 +780,7 @@ Alternative segmentations and readings are indicated by a `` with `class="alternatives"`. It must contains `` and `` elements. The first contained element should be `` and represent the most probable interpretation, the subsequent ones ``. Each `` and `` element should have `class="alt"` and a -property of either `nlp` or `x_cost`. These ``, ``, and `` tags can nest +property of either 'nlp' or 'x_cost'. These ``, ``, and `` tags can nest arbitrarily.``` -bbox is the bounding box of the line in image coordinates (blue). The two +'bbox' is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at `-18` @@ -208,30 +208,30 @@ and its slope angle is `arctan(0.015) = 0.86°`. We recognize the following logical structuring elements: - * `ocr_document` - * `ocr_linear` - * `ocr_title` - * `ocr_author` - * `ocr_abstract` - * `ocr_part` [`@@ -798,7 +801,7 @@ when viewed in a browser. The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; -for example, a single `ocr_page` may contain information from multiple sections +for example, a single <{ocr_page}> may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and @@ -816,8 +819,8 @@ removing tags that are not of interest for the subsequent processing step, and then collapsing grouped elements into single elements. For example, output that contains both logical and physical layout information, where the logical layout information uses grouped elements, can be transformed by removing all -the physical layout information, and then collapsing all split `ocr_chapter` -elements into single `ocr_chapter` elements based on the groupid. The result is +the physical layout information, and then collapsing all split <{ocr_chapter}> +elements into single <{ocr_chapter}> elements based on the groupid. The result is a simple DOM tree. This transformation can be provided generically as a pre-processor or Javascript. @@ -838,23 +841,23 @@ document. The capability to generate specific properties is given by the prefix `ocrp_...`; the important properties are: -## `ocrp_lang` +## ocrp_lang Capable of generating `lang=` attributes -## `ocrp_dir` +## ocrp_dir Capable of generating `dir=` attributes -## `ocrp_poly` +## ocrp_poly Capable of generating [polygonal bounds](#poly) -## `ocrp_font` +## ocrp_font Capable of generating font information (standard font information) -## `ocrp_nlp` +## ocrp_nlp Capable of generating [nlp confidences](#nlp) @@ -880,16 +883,31 @@ corresponding element or attribute must not be present in the document. The OCR system is required to indicate the following using meta tags in the header: +### ocr-system + * `` + +### ocr-capabilities + * `` * see [[#capabilities]] +## Recommended Meta Information + The OCR system should indicate the following information +### ocr-number-of-pages + * `` + +### ocr-langs + * `` * use [ISO 639-1](https://www.loc.gov/standards/iso639-2/php/code_list.php) codes * value may be `unknown` + +### ocr-scripts + * `` * use [ISO 15924](http://www.unicode.org/iso15924/codelists.html) letter codes * value may be `unknown` @@ -930,17 +948,17 @@ Other possible profiles might be defined for specific engines or specific document classes: * common commercial OCR output (e.g., Abbyy) - * ocr_page - * ocrx_block, ocrx_line, ocrx_word - * ocrp_lang - * ocrp_font + * <{ocr_page}> + * <{ocrx_block}>, <{ocrx_line}>, <{ocrx_word}> + * ''ocr-capabilities/ocrp_lang'' + * ''ocr-capabilities/ocrp_font'' * book target - * all logical structuring elements (as applicable), except ocr_linear - * ocr_page + * all logical structuring elements (as applicable), except <{ocr_linear}> + * <{ocr_page}> * newspaper target * all logical structuring elements (as applicable) - * articles map on ocr_linear - * ocr_page + * articles map on <{ocr_linear}> + * <{ocr_page}> # HTML Markup @@ -1200,3 +1218,7 @@ Issue: [correct MIME type for hOCR?](https://github.com/kba/hocr-spec/issues/27) : Applications which use this media type: : File extension(s): :: `*.html`, `*.hocr` + + + + diff --git a/1.2/index.html b/1.2/index.html index a27e208..8bde9db 100644 --- a/1.2/index.html +++ b/1.2/index.html @@ -1185,93 +1185,6 @@ [data-md] > :last-child { margin-bottom: 0; } - - + + +-### `textangle` +### textangle `textangle alpha` @@ -121,7 +121,7 @@ which should be indicated using standard HTML properties The following properties can apply to most elements but should not be used unless there is no alternative: -### `poly` +### poly `poly x0 y0 x1 y1 ...` @@ -134,11 +134,11 @@ A closed polygon for elements with non-rectangular bounds * note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats * documents using polygonal borders anywhere must indicate this by adding - [[#ocrp_poly]] to the list of `ocr-capabilities` in the - [[#required-meta-information]] - * documents should attempt to provide a reasonable bbox equivalent as well + ''ocr-capabilities/ocrp_poly'' to the list of 'ocr-capabilities' (see + [[#required-meta-information]]) + * documents should attempt to provide a reasonable 'bbox' equivalent as well -### `order` +### order `order n` @@ -148,27 +148,27 @@ The reading order of the element (an integer) the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order -### `presence` +### presence Issue: [Use of property presence](https://github.com/kba/hocr-spec/issues/10) -`presence` presence must be declared in the document meta data +'presence' presence must be declared in the document meta data -### `cflow` +### cflow `cflow s` -This property relates the flow between multiple [[#ocr_carea]] elements, -and between [[#ocr_carea]] and [[#ocr_linear]] elements. +This property relates the flow between multiple <{ocr_carea}> elements, +and between <{ocr_carea}> and <{ocr_linear}> elements. The content flow on the page that this element is a part of * s must be a unique string for each content flow - * must be present on [[#ocr_carea]] and [[#ocrx_block]] tags when reading + * must be present on <{ocr_carea}> and <{ocrx_block}> tags when reading order is attempted and multiple content flows are present * presence must be declared in the document meta data -### `baseline` +### baseline `baseline pn pn-1 ... p0` @@ -191,7 +191,7 @@ contains the following information: title="bbox 105 66 823 113; baseline 0.015 -18">...+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/1.2/metadata b/1.2/metadata index 38cdc08..97c54c1 100644 --- a/1.2/metadata +++ b/1.2/metadata @@ -9,4 +9,4 @@ Editor: Konstantin Baierer, UB Mannheim http://github.com/UB-Mannheim, konstanti Former Editor: Thomas Breuel, http://www.9x9.com/ Previous Version: https://github.com/kba/hocr-spec/blob/master/1.1/spec.md Abstract: A subset of HTML for marking up OCR results -Markup Shorthands: markdown on, biblio on +Markup Shorthands: markdown on, biblio on, markup on diff --git a/1.2/spec.md b/1.2/spec.md index 78f1135..ca94594 100644 --- a/1.2/spec.md +++ b/1.2/spec.md @@ -13,7 +13,7 @@ arrive at a representation that makes it easy to reuse OCR results. This document describes many tags and a lot of information that can be output. However, getting started with hOCR is easy: you only need to output the tags -and information you actually want to. For example, just outputting `ocr_line` +and information you actually want to. For example, just outputting <{ocr_line}> tags with bounding boxes is already very useful for many applications. Just start simple and add more output information as the need arises. @@ -68,7 +68,7 @@ multiple properties are separated by semicolons. The following properties can apply to most elements (where it makes sense): -### `bbox` +### bbox `bbox x0 y0 x1 y1` @@ -79,8 +79,8 @@ the lower-right corner (x1, y1). * the values are with reference to the the top-left corner of the document image and measured in pixels * the order of the values are `x0 y0 x1 y1` = "left top right bottom" - * use `x_bboxes` below for character bounding boxes - * do not use `bbox` unless the bounding box of the layout component is, in + * use 'x_bboxes' below for character bounding boxes + * do not use 'bbox' unless the bounding box of the layout component is, in fact, rectangular * some non-rectangular layout components may have rectangular bounding boxes if the non-rectangularity is caused by floating elements around which text flows @@ -106,7 +106,7 @@ the document image which border is drawn in black.hOCR - OCR Workflow and Output embedded in HTML
-Living Standard,
+Living Standard,
- This version: @@ -1403,7 +1440,7 @@
To the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work. -In addition, as of 17 October 2016, +In addition, as of 18 October 2016, the editors have made this specification available under the Open Web Foundation Agreement Version 1.0, which is available at http://www.openwebfoundation.org/legal/the-owf-1-0-agreements/owfa-1-0. Parts of this work may be from another specification document. If so, those parts are instead covered by the license of that specification document.
@@ -1425,35 +1462,38 @@Table of Contents
3.1 General Properties 3.2 Non-recommended general properties 4 Logical Structuring Elements -
- 4.1
ocr_document
-- 4.2
ocr_title
-- 4.3
ocr_author
-- 4.4
ocr_abstract
-- 4.5
ocr_part
-- 4.6
ocr_chapter
-- 4.7
ocr_section
-- 4.8
ocr_subsubsection
-- 4.9
ocr_display
-- 4.10
ocr_blockquote
-- 4.11
ocr_par
-- 4.12
ocr_linear
-- 4.13
ocr_caption
+- 4.1 ocr_document +
- 4.2 ocr_title +
- 4.3 ocr_author +
- 4.4 ocr_abstract +
- 4.5 ocr_part +
- 4.6 ocr_chapter +
- 4.7 ocr_section +
- 4.8 ocr_subsubsection +
- 4.9 ocr_display +
- 4.10 ocr_blockquote +
- + 4.11 ocr_par + +
- 4.12 ocr_caption
5 Typesetting Related Elements @@ -1461,44 +1501,44 @@ Table of Contents
5.1 Classes for typesetting elements 5.2 Recommended Properties for typesetting elements 5.3 Optional Properties for typesetting elements 5.4 Classes for floats -
- 5.4.1
ocr_float
-- 5.4.2
ocr_separator
-- 5.4.3
ocr_textfloat
-- 5.4.4
ocr_textimage
-- 5.4.5
ocr_image
-- 5.4.6
ocr_linedrawing
-- 5.4.7
ocr_photo
-- 5.4.8
ocr_header
-- 5.4.9
ocr_footer
-- 5.4.10
ocr_pageno
-- 5.4.11
ocr_table
+- 5.4.1 ocr_float +
- 5.4.2 ocr_separator +
- 5.4.3 ocr_textfloat +
- 5.4.4 ocr_textimage +
- 5.4.5 ocr_image +
- 5.4.6 ocr_linedrawing +
- 5.4.7 ocr_photo +
- 5.4.8 ocr_header +
- 5.4.9 ocr_footer +
- 5.4.10 ocr_pageno +
- 5.4.11 ocr_table
@@ -1507,11 +1547,11 @@ Table of Contents
6.1 Classes for Inline Representation -
- 6.1.1
ocr_glyph
-- 6.1.2
ocr_glyphs
-- 6.1.3
ocr_dropcap
-- 6.1.4
ocr_chem
-- 6.1.5
ocr_math
+- 6.1.1 ocr_glyph +
- 6.1.2 ocr_glyphs +
- 6.1.3 ocr_dropcap +
- 6.1.4 ocr_chem +
- 6.1.5 ocr_math
- 6.1.6 Non-breaking space
- 6.1.7 Non-default spaces
- 6.1.8 Hyphenation @@ -1525,13 +1565,13 @@
Table of Contents
- 7.1 Classes for Character Information
- 7.2 Properties for Character Information
-
- 7.2.1
cuts
-- 7.2.2
nlp
+- 7.2.1 cuts +
- 7.2.2 nlp
@@ -1540,18 +1580,18 @@ Table of Contents
8.1 Classes for engine specific markup 8.2 Properties for engine-specific markup 9 Font, Text Color, Language, Direction @@ -1560,19 +1600,31 @@ Table of Contents
12 Capabilities 13 Metadata 14 Profiles @@ -1610,6 +1662,11 @@ Table of Contents
17.1 Media Type Conformance + + Index + References @@ -1630,7 +1687,7 @@
This document describes many tags and a lot of information that can be output. However, getting started with hOCR is easy: you only need to output the tags -and information you actually want to. For example, just outputting
ocr_line
tags with bounding boxes is already very useful for many applications. Just +and information you actually want to. For example, just outputtingocr_line
tags with bounding boxes is already very useful for many applications. Just start simple and add more output information as the need arises.3. Terminology and Representation
This document describes a representation of various aspects of OCR output in an @@ -1670,17 +1727,17 @@
< multiple properties are separated by semicolons.
-<div class="ocr_page" id="page_1"> - <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922"> - <div class="ocr_par" id="par_7"> ... </div> - <div class="ocr_par" id="par_19"> ... </div> - </div> -</div> +<div class="ocr_page" id="page_1"> + <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922"> + <div class="ocr_par" id="par_7"> ... </div> + <div class="ocr_par" id="par_19"> ... </div> + </div> +</div>3.1. General Properties
The following properties can apply to most elements (where it makes sense):
-3.1.1.
+bbox
3.1.1. bbox
bbox x0 y0 x1 y1
The
bbox
- short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and @@ -1692,9 +1749,9 @@3.1
the order of the values are
x0 y0 x1 y1
= "left top right bottom"- -
use
+x_bboxes
below for character bounding boxesuse x_bboxes below for character bounding boxes
- -
do not use
bbox
unless the bounding box of the layout component is, in +do not use bbox unless the bounding box of the layout component is, in fact, rectangular
some non-rectangular layout components may have rectangular bounding boxes @@ -1703,8 +1760,8 @@
3.1
See also the section §5.2.1 bbox (typesetting).
--<span class='ocr_line' id='line_1' - title="bbox 10 20 160 30">...</span> +<span class='ocr_line' id='line_1' + title="bbox 10 20 160 30">...</span>The bounding box
bbox
of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). @@ -1712,7 +1769,7 @@3.1 the document image which border is drawn in black.
3.1.2.
+textangle
3.1.2. textangle
textangle alpha
The angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations @@ -1722,7 +1779,7 @@
3.2. Non-recommended general properties
The following properties can apply to most elements but should not be used unless there is no alternative:
-3.2.1.
+poly
3.2.1. poly
poly x0 y0 x1 y1 ...
A closed polygon for elements with non-rectangular bounds
@@ -1735,11 +1792,11 @@
-3.2
note that the natural and correct representation of many non-rectangular layouts is in terms of rectangular content areas and rectangular floats
- -
documents using polygonal borders anywhere must indicate this by adding §12.3 ocrp_poly to the list of
+ocr-capabilities
in the §13.1 Required Meta Informationdocuments using polygonal borders anywhere must indicate this by adding ocrp_poly to the list of ocr-capabilities (see §13.1 Required Meta Information)
- -
documents should attempt to provide a reasonable bbox equivalent as well
+documents should attempt to provide a reasonable bbox equivalent as well
3.2.2.
+order
3.2.2. order
order n
The reading order of the element (an integer)
@@ -1748,36 +1805,36 @@
-3. the reading order of the page by element ordering within the page, since many tools will not be able to deal with content that is not in reading order
3.2.3.
+presence
3.2.3. presence
--
presence
presence must be declared in the document meta data3.2.4.
+cflow
presence presence must be declared in the document meta data
+3.2.4. cflow
-
cflow s
This property relates the flow between multiple §5.1.3 ocr_carea elements, -and between §5.1.3 ocr_carea and §4.12 ocr_linear elements.
+This property relates the flow between multiple
ocr_carea
elements, +and betweenocr_carea
andocr_linear
elements.The content flow on the page that this element is a part of
-
s must be a unique string for each content flow
- -
must be present on §5.1.3 ocr_carea and §8.1.1 ocrx_block tags when reading +
must be present on
ocr_carea
andocrx_block
tags when reading order is attempted and multiple content flows are presentpresence must be declared in the document meta data
3.2.5.
+baseline
3.2.5. baseline
baseline pn pn-1 ... p0
This property applies primarily to textlines.
The baseline is described by a polynomial of order
n
with the coefficientspn ... p0
withn = 1
for a linear (i.e. straight) line.The polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin.
-- +\ No newline at end of file +++The hOCR output for the first line of eurotext.tif contains the following information:
-<span class='ocr_line' id='line_1_1' - title="bbox 105 66 823 113; baseline 0.015 -18">...</span> +@@ -2807,6 +2877,85 @@<span class='ocr_line' id='line_1_1' + title="bbox 105 66 823 113; baseline 0.015 -18">...</span>-bbox is the bounding box of the line in image coordinates (blue). The two +
bbox is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at
@@ -1787,71 +1844,71 @@-18
and its slope angle isarctan(0.015) = 0.86°
.
We recognize the following logical structuring elements: -
- -
+
ocr_document
- -
+
ocr_linear
- -
+
ocr_title
- -
+
ocr_author
- -
+
ocr_abstract
- -
+
ocr_part
[<h1>
]
ocr_part
[<h1>
]
- -
+
ocr_chapter
[<h1>
]
ocr_chapter
[<h1>
]- -
+
ocr_section
[<h2>
]
ocr_section
[<h2>
]
ocr_sub*section
[<h3>
,<h4>
]- -
+
ocr_display
- -
+
ocr_blockquote
[<blockquote>
]
ocr_blockquote
[<blockquote>
]- -
+
ocr_par
[<p>
]
ocr_par
[<p>
]4.1.
-ocr_document
4.2.
-ocr_title
4.3.
-ocr_author
4.4.
-ocr_abstract
4.5.
-ocr_part
4.6.
-ocr_chapter
4.7.
-ocr_section
4.8.
-ocr_subsubsection
4.9.
-ocr_display
4.10.
-ocr_blockquote
4.11.
+ocr_par
4.1. ocr_document
+4.2. ocr_title
+4.3. ocr_author
+4.4. ocr_abstract
+4.5. ocr_part
+4.6. ocr_chapter
+4.7. ocr_section
+4.8. ocr_subsubsection
+4.9. ocr_display
+4.10. ocr_blockquote
+4.11. ocr_par
These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others.
The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements, but it may not be possible or desirable to actually chose those tags (e.g., when adding hOCR information to an existing HTML output routine).
-4.12.
-ocr_linear
For all of these elements except
ocr_linear
, there exists a natural linear -ordering defined by reading order (ocr_linear
indicates that the elements -contained in it have a linear ordering). At the level ofocr_linear
, there -may not be a single distinguished order. A common example ofocr_linear
is a +4.11.1. ocr_linear
+For all of these elements except
+therefore be sensitive to the order of all elements other thanocr_linear
, there exists a natural linear +ordering defined by reading order (ocr_linear
indicates that the elements +contained in it have a linear ordering). At the level ofocr_linear
, there +may not be a single distinguished order. A common example ofocr_linear
is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should -therefore be sensitive to the order of all elements other thanocr_linear
.ocr_linear
.Tags must be nested as indicated by nesting above, but not all tags within the hierarchy need to be present.
Textual information like section numbers and bullets must be represented as text inside the containing element.
Documents whose logical structure does not map naturally onto these logical structuring elemetns must not use them for other purpose.
-4.13.
-ocr_caption
Image captions may be indicated using the
ocr_caption
element; such an +4.12. ocr_caption
+Image captions may be indicated using the
ocr_caption
element; such an element refers to the image(s) contained within the same float, or the -immediately adjacent image if both the image and theocr_caption
element are +immediately adjacent image if both the image and theocr_caption
element are in running text.5. Typesetting Related Elements
The following typesetting related elements are based on a typesetting model as @@ -1878,53 +1935,53 @@
5.1. Classes for typesetting elements The following classes, as well as floats are used for type-setting elements.
-5.1.1.
-ocr_page
The
-ocr_page
element must be present in all hOCR documents.5.1.2.
+ocr_column
5.1.1. ocr_page
+The
+ocr_page
element must be present in all hOCR documents.5.1.2. ocr_column
-5.1.3.
+ocr_carea
5.1.3. ocr_carea
"ocr content area" or "body area"
Used to be called
-ocr_columnThe
ocr_carea
elements should appear in reading order unless this is impossible -because of some other structuring requirement. If the document contains multipleocr_linear
streams, then eachocr_carea
must indicate which stream it belongs +The
ocr_carea
elements should appear in reading order unless this is impossible +because of some other structuring requirement. If the document contains multipleocr_linear
streams, then eachocr_carea
must indicate which stream it belongs to.Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the
careas
of the original document style cannot be -recovered exactly. However, the partition of a document byocr_carea
for an +recovered exactly. However, the partition of a document byocr_carea
for an individual page shall be considered correct relative to ground truth if-
all the text contained in a ground truth carea is fully contained within a -single
+singleocr_carea
,ocr_carea
,- -
no text outside a ground truth
+carea
is contained within anocr_carea
, andno text outside a ground truth
carea
is contained within anocr_carea
, and- -
the
ocr_careas
appear in the same order as the text flow +the
ocr_carea
appear in the same order as the text flow relationships between the ground truth careas.5.1.4.
+ocr_line
5.1.4. ocr_line
In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). -They are represented by the
-ocr_line
area.-
ocr_line
should be in a<span>
5.1.5.
+They are represented by theocr_separator
ocr_line
area. ++
ocr_line
should be in a<span>
5.1.5. ocr_separator
Any separator or similar element
-5.1.6.
+ocr_noise
5.1.6. ocr_noise
Any noise element that isn’t part of typesetting
5.2. Recommended Properties for typesetting elements
The following properties should be present:
5.2.1.
bbox (typesetting)
The bounding box of the page; for pages, the top left corner must be at
-(0,0)
, so a typical page bounding box will look likebbox 0 0 2300 3200
5.2.2.
+image
5.2.2. image
image imagefile
-
- @@ -1940,7 +1997,7 @@
5.
if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file
5.2.3.
+imagemd5
5.2.3. imagemd5
imagemd5 checksum
-
- @@ -1948,7 +2005,7 @@
allows re-associating pages with source images
5.2.4.
+ppageno
5.2.4. ppageno
ppageno n
-
- @@ -1962,7 +2019,7 @@
must not be present unless it is well defined and unique
5.2.5.
+lpageno
5.2.5. lpageno
lpageno string
- @@ -1976,19 +2033,19 @@
5.3. Optional Properties for typesetting elements
The following properties MAY be present:
-5.3.1.
+scan_res
5.3.1. scan_res
scan_res x_res y_res
-
scanning resolution in DPI
5.3.2.
+x_scanner
5.3.2. x_scanner
x_scanner string
-
a representation of the scanner
5.3.3.
+x_source
5.3.3. x_source
x_source string
- @@ -2008,8 +2065,8 @@
In addition to the standard -properties, the
-ocr_line
area supports the following additional properties:5.3.4.
+properties, thehardbreak
ocr_line
area supports the following additional properties: +5.3.4. hardbreak
hardbreak n
- @@ -2019,67 +2076,67 @@
ocr_line element. Examples of such special characters are a soft hyphen ("",
U+00AD
), a hard line break (<br>
), or whitespace () for soft line breaks.
5.4. Classes for floats
Floats should not be nested.
The following floats are defined:
-5.4.1.
+ocr_float
5.4.1. ocr_float
-
ocr_float
5.4.2.
-ocr_separator
-
ocr_separator
5.4.3.
+ocr_textfloat
5.4.2. ocr_separator
++
ocr_separator
in the context of float classes.5.4.3. ocr_textfloat
-
ocr_textfloat
5.4.4.
+ocr_textimage
5.4.4. ocr_textimage
-
ocr_textimage
5.4.5.
+ocr_image
5.4.5. ocr_image
-
ocr_image
5.4.6.
+ocr_linedrawing
5.4.6. ocr_linedrawing
Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG)
-5.4.7.
+ocr_photo
5.4.7. ocr_photo
Something that requires JPEG or PNG to be represented well
-5.4.8.
+ocr_header
5.4.8. ocr_header
-
ocr_header
5.4.9.
+ocr_footer
5.4.9. ocr_footer
-
ocr_footer
5.4.10.
+ocr_pageno
5.4.10. ocr_pageno
-
ocr_pageno
5.4.11.
+ocr_table
5.4.11. ocr_table
ocr_table
6. Inline Representations
There is some content that should behave and flow like text
6.1. Classes for Inline Representation
-6.1.1.
+ocr_glyph
6.1.1. ocr_glyph
An individual glyph represented as an image (e.g., an unrecognized character)
Must contain a single
-<img>
tag, or be present on one6.1.2.
+ocr_glyphs
6.1.2. ocr_glyphs
Multiple glyphs represented as an image (e.g., an unrecognized word)
Must contain a single
-<img>
tag, or be present on one6.1.3.
+ocr_dropcap
6.1.3. ocr_dropcap
An individual glyph representing a dropcap
May contain text or an
-<img>
tag; thealt
of the image tag should contain the corresponding text6.1.4.
+ocr_chem
6.1.4. ocr_chem
A chemical formula
Must contain either a single
-<img>
tag or [CML] markup, or be present on one6.1.5.
+ocr_math
6.1.5. ocr_math
A mathematical formula
Must contain either a single
-<img>
tag or [MathML] markup, or be present on oneMathematical and chemical formulas that float must be put into an
+ocr_float
section.Mathematical and chemical formulas that float must be put into an
ocr_float
section.Mathematical and chemical formulas that are “display” mode should be put into -an
+anocr_display
section.ocr_display
section.6.1.6. Non-breaking space
Non-breaking spaces must be represented using the HTML
entity.6.1.7. Non-default spaces
Different space widths should be indicated using HTML and
 
,&emsp
, 
,‌
,‍
.6.1.8. Hyphenation
Soft hyphens must be represented using the HTML
-­
entity.The HTML
+‎
and‏
entities (indicating writing direction) must not -be used; all writing direction changes must be indicated with tags.The HTML
‎
and‏
entities (indicating writing direction) must not be used; all +writing direction changes must be indicated with tags.6.1.9. Superscript and Subscript
Other superscripts and subscripts must be represented using the HTML
<sup>
and<sub>
tags, even if special Unicode characters are available.6.1.10. Ruby characters
@@ -2088,18 +2145,18 @@7.1. Classes for Character Information
Character-level information may be put on any element that contains only a single "line" of text.
-7.1.1.
-ocr_cinfo
If no other layout element applies, the
+ocr_cinfo
element may be used.7.1.1. ocr_cinfo
+If no other layout element applies, the
ocr_cinfo
element may be used.7.2. Properties for Character Information
-7.2.1.
+cuts
7.2.1. cuts
cuts c1 c2 c3 ...
-
character segmentation cuts (see below)
- -
there must be a bbox property relative to which the cuts can be interpreted
+there must be a bbox property relative to which the cuts can be interpreted
7.2.2.
+nlp
7.2.2. nlp
nlp c1 c2 c3 ...
- @@ -2112,12 +2169,12 @@
7.2.
Assume a bounding box of
-(0,0,300,100)
; thencuts("10 11 7 19") = +cuts("10 11 7 19") = [ [(10,0),(10,100)], [(21,0),(21,100)], [(28,0),(28,100)], [(47,0),(47,100)] ] -cuts("10,50,3 11,30,-3") = +cuts("10,50,3 11,30,-3") = [ [(10,0),(10,50),(13,50),(13,100)], [(21,0),(21,30),(18,30),(18,100)] ]-<span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span> +<span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span>Cuts are between all codepoints contained within the element, including any @@ -2135,7 +2192,7 @@
Common suggested engine-specific markup are:
8.1. Classes for engine specific markup
-8.1.1.
+ocrx_block
8.1.1. ocrx_block
-
- @@ -2143,15 +2200,15 @@
engine-specific because the definition of a "block" depends on the engine
8.1.2.
+ocrx_line
8.1.2. ocrx_line
-
- -
any kind of "line" returned by an OCR system that differs from the standard ocr_line above
+any kind of "line" returned by an OCR system that differs from the standard
ocr_line
abovemight be some kind of "logical" line
8.1.3.
+ocrx_word
8.1.3. ocrx_word
+
any kind of "word" returned by an OCR system
@@ -2162,47 +2219,48 @@-
an
+ocrx_block
should not contain content from multiple ocr_careasAn
ocrx_block
should not contain content from multipleocr_carea
.- -
the union of all
+ocrx_blocks
should approximately cover allocr_careas
The union of all
ocrx_blocks
should approximately cover allocr_carea
.- -
an
+ocrx_block
should contain either a float or body text, but not bothan
ocrx_block
should contain either a float or body text, but not both- -
an
+ocrx_block
should contain either an image or text, but not bothan
ocrx_block
should contain either an image or text, but not both- -
an
+ocrx_line
should correspond as closely as possible to anocr_line
an
ocrx_line
should correspond as closely as possible to anocr_line
- -
+
ocrx_cinfo
should nest insideocrx_line
ocrx_cinfo
should nest insideocrx_line
- -
+
ocrx_cinfo
should contain onlyx_conf
,x_bboxes
, andcuts
attributes
ocrx_cinfo
should contain only x_confs, x_bboxes, and cuts attributes8.2. Properties for engine-specific markup
The following properties are defined:
-8.2.1.
+x_font
8.2.1. x_font
x_font s
-
OCR-engine specific font names
8.2.2.
+x_fsize
8.2.2. x_fsize
x_fsize n
-
OCR-engine specific font size
8.2.3.
+x_bboxes
8.2.3. x_bboxes
x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...
-
OCR-engine specific boxes associated with each codepoint contained in the element
- -
note that the bbox property is a property for the bounding box of a layout +
note that the bbox property is a property for the bounding box of a layout element, not of individual characters
in particular, use
<span class="ocr_cinfo" title="x_bboxes ....">
, not<span class="ocr_cinfo" title="bbox ...">
8.2.4.
+x_confs
8.2.4. x_confs
x_confs c1 c2 c3 ...
-
- @@ -2215,7 +2273,7 @@
if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %)
8.2.5.
+x_wconf
8.2.5. x_wconf
x_wconf n
- @@ -2248,14 +2306,14 @@
. It must contains
<ins>
and<del>
elements. The first contained element should be<ins>
and represent the most probable interpretation, the subsequent ones<del>
. Each<ins>
and<del>
element should haveclass="alt"
and a -property of eithernlp
orx_cost
. These<span>
,<ins>
, and<del>
tags can nest +property of either nlp or x_cost. These<span>
,<ins>
, and<del>
tags can nest arbitrarily.Whitespace within the
<span>
but outside the contained<ins>
/<del>
elements is ignored and should be inserted to improve readability of the HTML @@ -2263,7 +2321,7 @@11. Grouped Elements and Multiple Hierarchies
The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; -for example, a single
ocr_page
may contain information from multiple sections +for example, a singleocr_page
may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and @@ -2279,7 +2337,7 @@ocr_chapter elements into single
ocr_chapter
elements based on the groupid. The result is a simple DOM tree. This transformation can be provided generically as a pre-processor or Javascript.The presence of grouped elements does not need to be indicated in the header; @@ -2294,15 +2352,15 @@
12.1.
ocrp_lang
+12.1. ocrp_lang
Capable of generating
-lang=
attributes12.2.
+ocrp_dir
12.2. ocrp_dir
Capable of generating
-dir=
attributes12.3.
+ocrp_poly
12.3. ocrp_poly
Capable of generating polygonal bounds
-12.4.
+ocrp_font
12.4. ocrp_font
Capable of generating font information (standard font information)
-12.5.
+ocrp_nlp
12.5. ocrp_nlp
Capable of generating nlp confidences
12.6.
ocr_embeddedformat_<formatname>
The capability to generate other specific embedded formats is given by the @@ -2317,9 +2375,13 @@
13. Metadata
13.1. Required Meta Information
The OCR system is required to indicate the following using meta tags in the header:
+13.1.1. ocr-system
+
+
<meta name="ocr-system" content="name version"/>
13.1.2. ocr-capabilities
++
<meta name="ocr-capabilities" content="capabilities"/>
@@ -2327,10 +2389,15 @@
see §12 Capabilities 13.2. Recommended Meta Information
The OCR system should indicate the following information
+13.2.1. ocr-number-of-pages
+
+
<meta name="ocr-number-of-pages" content="number-of-pages"/>
13.2.2. ocr-langs
++
<meta name="ocr-langs" content="languages-considered-by-ocr"/>
@@ -2339,6 +2406,9 @@
+
value may be
unknown
13.2.3. ocr-scripts
+-
<meta name="ocr-scripts" content="scripts-considered-by-ocr"/>
@@ -2348,7 +2418,7 @@
value may be unknown
13.2. Document metadata
+13.3. Document metadata
For document meta information, use the Dublin Core Embedding into HTML. See also Citation Guidelines for Dublin Core.
@@ -2384,21 +2454,21 @@14
common commercial OCR output (e.g., Abbyy)
- -
ocr_page
+- -
ocrx_block, ocrx_line, ocrx_word
+- -
ocrp_lang
+- -
ocrp_font
+book target
- -
all logical structuring elements (as applicable), except ocr_linear
+all logical structuring elements (as applicable), except
ocr_linear
- -
ocr_page
+newspaper target
@@ -2406,9 +2476,9 @@14
all logical structuring elements (as applicable)
- -
articles map on ocr_linear
+articles map on
ocr_linear
- -
ocr_page
+15. HTML Markup
@@ -2595,32 +2665,32 @@import libxml2,re,os,string -# convert the HTML to XHTML (if necessary) -os.system("tidy -q -asxhtml < page.html > page.xhtml 2> /dev/null") +# convert the HTML to XHTML (if necessary) +os.system("tidy -q -asxhtml < page.html > page.xhtml 2> /dev/null") -# parse the XML -doc = libxml2.parseFile('page.xhtml') +# parse the XML +doc = libxml2.parseFile('page.xhtml') -# search all nodes having a class of ocr_line -lines = doc.xpathEval("//*[@class='ocr_line']") +# search all nodes having a class of ocr_line +lines = doc.xpathEval("//*[@class='ocr_line']") -# a function for extracting the text from a node +# a function for extracting the text from a node def get_text(node): - textnodes = node.xpathEval(".//text()") + textnodes = node.xpathEval(".//text()") s = string.join([node.getContent() for node in textnodes]) - return re.sub(r'\s+',' ',s) + return re.sub(r'\s+',' ',s) -# a function for extracting the bbox property from a node -# note that the title= attribute on a node with an ocr_ class must -# conform with the OCR spec +# a function for extracting the bbox property from a node +# note that the title= attribute on a node with an ocr_ class must +# conform with the OCR spec def get_bbox(node): - data = node.prop('title') - bboxre = re.compile(r'\bbbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)') + data = node.prop('title') + bboxre = re.compile(r'\bbbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)') return [int(x) for x in bboxre.search(data).groups()] -# this extracts all the bounding boxes and the text they contain -# it doesn’t matter what other markup the line node may contain +# this extracts all the bounding boxes and the text they contain +# it doesn’t matter what other markup the line node may contain for line in lines: print get_bbox(line),get_text(line)
+
Index
+Terms defined by this specification
++
- baseline, in §3.2.5 +
- bbox, in §3.1.1 +
- cflow, in §3.2.4 +
- cuts, in §7.2.1 +
- hardbreak, in §5.3.4 +
- image, in §5.2.2 +
- imagemd5, in §5.2.3 +
- lpageno, in §5.2.5 +
- nlp, in §7.2.2 +
- ocr_abstract, in §4.4 +
- ocr_author, in §4.3 +
- ocr_blockquote, in §4.10 +
- ocr-capabilities, in §13.1.2 +
- ocr_caption, in §4.12 +
- ocr_carea, in §5.1.3 +
- ocr_chapter, in §4.6 +
- ocr_chem, in §6.1.4 +
- ocr_cinfo, in §7.1.1 +
- ocr_column, in §5.1.2 +
- ocr_display, in §4.9 +
- ocr_document, in §4.1 +
- ocr_dropcap, in §6.1.3 +
- ocr_float, in §5.4.1 +
- ocr_footer, in §5.4.9 +
- ocr_glyph, in §6.1.1 +
- ocr_glyphs, in §6.1.2 +
- ocr_header, in §5.4.8 +
- ocr_image, in §5.4.5 +
- ocr-langs, in §13.2.2 +
- ocr_line, in §5.1.4 +
- ocr_linear, in §4.11.1 +
- ocr_linedrawing, in §5.4.6 +
- ocr_math, in §6.1.5 +
- ocr_noise, in §5.1.6 +
- ocr-number-of-pages, in §13.2.1 +
- ocr_page, in §5.1.1 +
- ocr_pageno, in §5.4.10 +
- ocr_par, in §4.11 +
- ocr_part, in §4.5 +
- ocrp_dir, in §12.2 +
- ocrp_font, in §12.4 +
- ocr_photo, in §5.4.7 +
- ocrp_lang, in §12.1 +
- ocrp_nlp, in §12.5 +
- ocrp_poly, in §12.3 +
- ocr-scripts, in §13.2.3 +
- ocr_section, in §4.7 +
- + ocr_separator + +
- ocr_subsubsection, in §4.8 +
- ocr-system, in §13.1.1 +
- ocr_table, in §5.4.11 +
- ocr_textfloat, in §5.4.3 +
- ocr_textimage, in §5.4.4 +
- ocr_title, in §4.2 +
- ocrx_block, in §8.1.1 +
- ocrx_line, in §8.1.2 +
- ocrx_word, in §8.1.3 +
- order, in §3.2.2 +
- poly, in §3.2.1 +
- ppageno, in §5.2.4 +
- presence, in §3.2.3 +
- scan_res, in §5.3.1 +
- textangle, in §3.1.2 +
- x_bboxes, in §8.2.3 +
- x_confs, in §8.2.4 +
- x_font, in §8.2.1 +
- x_fsize, in §8.2.2 +
- x_scanner, in §5.3.2 +
- x_source, in §5.3.3 +
- x_wconf, in §8.2.5 +
References
Normative References
@@ -2838,8 +2987,271 @@
↵
ocrx_cinfo? ↵-`] - * `ocr_chapter` [`
`] - * `ocr_section` [`
`] + * <{ocr_document}> + * <{ocr_linear}> + * <{ocr_title}> + * <{ocr_author}> + * <{ocr_abstract}> + * <{ocr_part}> [`
`] + * <{ocr_chapter}> [`
`] + * <{ocr_section}> [`
`] * `ocr_sub*section` [`
`,`
`] - * `ocr_display` - * `ocr_blockquote` [`
`] - * `ocr_par` [``] - -## `ocr_document` -## `ocr_title` -## `ocr_author` -## `ocr_abstract` -## `ocr_part` -## `ocr_chapter` -## `ocr_section` -## `ocr_subsubsection` -## `ocr_display` -## `ocr_blockquote` -## `ocr_par` + * <{ocr_display}> + * <{ocr_blockquote}> [`
`] + * <{ocr_par}> [``] + +## ocr_document +## ocr_title +## ocr_author +## ocr_abstract +## ocr_part +## ocr_chapter +## ocr_section +## ocr_subsubsection +## ocr_display +## ocr_blockquote +## ocr_par These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others. @@ -241,15 +241,15 @@ with those logical structuring elements, but it may not be possible or desirable to actually chose those tags (e.g., when adding hOCR information to an existing HTML output routine). -## `ocr_linear` +### ocr_linear -For all of these elements except `ocr_linear`, there exists a natural linear -ordering defined by reading order (`ocr_linear` indicates that the elements -contained in it have a linear ordering). At the level of `ocr_linear`, there -may not be a single distinguished order. A common example of `ocr_linear` is a +For all of these elements except <{ocr_linear}>, there exists a natural linear +ordering defined by reading order (<{ocr_linear}> indicates that the elements +contained in it have a linear ordering). At the level of <{ocr_linear}>, there +may not be a single distinguished order. A common example of <{ocr_linear}> is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should -therefore be sensitive to the order of all elements other than `ocr_linear`. +therefore be sensitive to the order of all elements other than <{ocr_linear}>. Tags must be nested as indicated by nesting above, but not all tags within the hierarchy need to be present. @@ -260,11 +260,11 @@ text inside the containing element. Documents whose logical structure does not map naturally onto these logical structuring elemetns must not use them for other purpose. -## `ocr_caption` +## ocr_caption -Image captions may be indicated using the `ocr_caption` element; such an +Image captions may be indicated using the <{ocr_caption}> element; such an element refers to the image(s) contained within the same float, or the -immediately adjacent image if both the image and the `ocr_caption` element are +immediately adjacent image if both the image and the <{ocr_caption}> element are in running text. @@ -303,57 +303,57 @@ properties for floating elements; properties need to be defined for this. The following classes, as well as [floats](#classes-for-floats) are used for type-setting elements. -### `ocr_page` +### ocr_page -The `ocr_page` element must be present in all hOCR documents. +The <{ocr_page}> element must be present in all hOCR documents. -### `ocr_column` +### ocr_column
-### `ocr_carea` +### ocr_carea "ocr content area" or "body area" Used to be calledocr_column-The `ocr_carea` elements should appear in reading order unless this is impossible +The <{ocr_carea}> elements should appear in reading order unless this is impossible because of some other structuring requirement. If the document contains multiple -`ocr_linear` streams, then each `ocr_carea` must indicate which stream it belongs +<{ocr_linear}> streams, then each <{ocr_carea}> must indicate which stream it belongs to. Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the `careas` of the original document style cannot be -recovered exactly. However, the partition of a document by `ocr_carea` for an +recovered exactly. However, the partition of a document by <{ocr_carea}> for an individual page shall be considered correct relative to ground truth if 1. all the text contained in a ground truth carea is fully contained within a - single `ocr_carea`, + single <{ocr_carea}>, 2. no text outside a ground truth `carea` is contained within an - `ocr_carea`, and - 3. the `ocr_careas` appear in the same order as the text flow + <{ocr_carea}>, and + 3. the <{ocr_carea}> appear in the same order as the text flow relationships between the ground truth careas. -### `ocr_line` +### ocr_line In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). -They are represented by the `ocr_line` area. +They are represented by the <{ocr_line}> area. -`ocr_line` should be in a `` +<{ocr_line}> should be in a `` -### `ocr_separator` +### ocr_separator Any separator or similar element -### `ocr_noise` +### ocr_noise Any noise element that isn't part of typesetting @@ -366,7 +366,7 @@ The following properties should be present: The bounding box of the page; for pages, the top left corner must be at `(0,0)`, so a typical page bounding box will look like `bbox 0 0 2300 3200` -### `image` +### image `image imagefile` @@ -378,14 +378,14 @@ The bounding box of the page; for pages, the top left corner must be at * if the hOCR file is present in a directory hierarchy or file archive, should resolve to the corresponding image file -### `imagemd5` +### imagemd5 `imagemd5 checksum` * MD5 fingerprint of the image file that this page was derived from * allows re-associating pages with source images -### `ppageno` +### ppageno `ppageno n` @@ -395,7 +395,7 @@ The bounding box of the page; for pages, the top left corner must be at * must not be present unless the pages in the document have a physical ordering * must not be present unless it is well defined and unique -### `lpageno` +### lpageno `lpageno string` @@ -408,19 +408,19 @@ The bounding box of the page; for pages, the top left corner must be at The following properties MAY be present: -### `scan_res` +### scan_res `scan_res x_res y_res` * scanning resolution in DPI -### `x_scanner` +### x_scanner `x_scanner string` * a representation of the scanner -### `x_source` +### x_source `x_source string` @@ -433,9 +433,9 @@ The following properties MAY be present: * `x_source http://pageserver/012345678911&page=17` In addition to the standard -properties, the `ocr_line` area supports the following additional properties: +properties, the <{ocr_line}> area supports the following additional properties: -### `hardbreak` +### hardbreak `hardbreak n` @@ -444,7 +444,7 @@ properties, the `ocr_line` area supports the following additional properties: * a one indicates that the line is a hard (explicit) line break Any special characters representing the desired end-of-line processing must be -present inside the `ocr_line` element. Examples of such special characters are a +present inside the <{ocr_line}> element. Examples of such special characters are a soft hyphen ("", `U+00AD`), a hard line break (`
`), or whitespace (` `) for soft line breaks. @@ -454,48 +454,48 @@ Floats should not be nested. The following floats are defined: -### `ocr_float` +### ocr_float `ocr_float` -### `ocr_separator` +### ocr_separator -`ocr_separator` +`ocr_separator` in the context of float classes. -### `ocr_textfloat` +### ocr_textfloat `ocr_textfloat` -### `ocr_textimage` +### ocr_textimage `ocr_textimage` -### `ocr_image` +### ocr_image `ocr_image` -### `ocr_linedrawing` +### ocr_linedrawing Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG) -### `ocr_photo` +### ocr_photo Something that requires JPEG or PNG to be represented well -### `ocr_header` +### ocr_header `ocr_header` -### `ocr_footer` +### ocr_footer `ocr_footer` -### `ocr_pageno` +### ocr_pageno `ocr_pageno` -### `ocr_table` +### ocr_table `ocr_table` @@ -505,44 +505,44 @@ There is some content that should behave and flow like text ## Classes for Inline Representation -### `ocr_glyph` +### ocr_glyph An individual glyph represented as an image (e.g., an unrecognized character) Must contain a single `` tag, or be present on one -### `ocr_glyphs` +### ocr_glyphs Multiple glyphs represented as an image (e.g., an unrecognized word) Must contain a single `` tag, or be present on one -### `ocr_dropcap` +### ocr_dropcap An individual glyph representing a dropcap May contain text or an `` tag; the `alt` of the image tag should contain the corresponding text -### `ocr_chem` +### ocr_chem A chemical formula Must contain either a single `` tag or [[CML]] markup, or be present on one -### `ocr_math` +### ocr_math A mathematical formula Must contain either a single `` tag or [[MathML]] markup, or be present on one -Mathematical and chemical formulas that float must be put into an `ocr_float` +Mathematical and chemical formulas that float must be put into an <{ocr_float}> section. Mathematical and chemical formulas that are “display” mode should be put into -an `ocr_display` section. +an <{ocr_display}> section. ### Non-breaking space @@ -557,8 +557,9 @@ Different space widths should be indicated using HTML and ` `, `&emsp`, Soft hyphens must be represented using the HTML `` entity. -The HTML `` and `` entities (indicating writing direction) must not -be used; all writing direction changes must be indicated with tags. +The HTML `` and +`` entities (indicating writing direction) must not be used; all +writing direction changes must be indicated with tags. ### Superscript and Subscript @@ -577,20 +578,20 @@ must be represented using their correct Unicode encoding. Character-level information may be put on any element that contains only a single "line" of text. -### `ocr_cinfo` +### ocr_cinfo -If no other layout element applies, the `ocr_cinfo` element may be used. +If no other layout element applies, the <{ocr_cinfo}> element may be used. ## Properties for Character Information -### `cuts` +### cuts `cuts c1 c2 c3 ...` * character segmentation cuts (see below) - * there must be a bbox property relative to which the cuts can be interpreted + * there must be a 'bbox' property relative to which the 'cuts' can be interpreted -### `nlp` +### nlp `nlp c1 c2 c3 ...` @@ -641,21 +642,21 @@ Common suggested engine-specific markup are: ## Classes for engine specific markup -### `ocrx_block` +### ocrx_block Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) * any kind of "block" returned by an OCR system * engine-specific because the definition of a "block" depends on the engine -### `ocrx_line` +### ocrx_line Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) - * any kind of "line" returned by an OCR system that differs from the standard ocr_line above + * any kind of "line" returned by an OCR system that differs from the standard <{ocr_line}> above * might be some kind of "logical" line -### `ocrx_word` +### ocrx_word * any kind of "word" returned by an OCR system * engine specific because the definition of a "word" depends on the engine @@ -663,42 +664,44 @@ Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) The meaning of these tags is OCR engine specific. However, generators should attempt to ensure the following properties: -* an `ocrx_block` should not contain content from multiple ocr_careas -* the union of all `ocrx_blocks` should approximately cover all `ocr_careas` -* an `ocrx_block` should contain either a float or body text, but not both -* an `ocrx_block` should contain either an image or text, but not both -* an `ocrx_line` should correspond as closely as possible to an `ocr_line` -* `ocrx_cinfo` should nest inside `ocrx_line` -* `ocrx_cinfo` should contain only `x_conf`, `x_bboxes`, and `cuts` attributes +* An <{ocrx_block}> should not contain content from multiple <{ocr_carea}>. +* The union of all <{ocrx_block|ocrx_blocks}> should approximately cover all <{ocr_carea}>. +* an <{ocrx_block}> should contain either a float or body text, but not both +* an <{ocrx_block}> should contain either an image or text, but not both +* an <{ocrx_line}> should correspond as closely as possible to an <{ocr_line}> +* <{ocrx_cinfo}> should nest inside <{ocrx_line}> +* <{ocrx_cinfo}> should contain only 'x_confs', 'x_bboxes', and 'cuts' attributes + +Issue: ocrx_cinfo? ## Properties for engine-specific markup The following properties are defined: -### `x_font` +### x_font `x_font s` * OCR-engine specific font names -### `x_fsize` +### x_fsize `x_fsize n` * OCR-engine specific font size -### `x_bboxes` +### x_bboxes `x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...` * OCR-engine specific boxes associated with each codepoint contained in the element - * note that the bbox property is a property for the bounding box of a layout + * note that the 'bbox' property is a property for the bounding box of a layout element, not of individual characters * in particular, use ``, not `` -### `x_confs` +### x_confs `x_confs c1 c2 c3 ...` @@ -708,7 +711,7 @@ The following properties are defined: * if possible, convert character confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %) -### `x_wconf` +### x_wconf `x_wconf n` @@ -748,7 +751,7 @@ Alternative segmentations and readings are indicated by a `` with `class="alternatives"`. It must contains `` and `` elements. The first contained element should be `` and represent the most probable interpretation, the subsequent ones ``. Each `` and `` element should have `class="alt"` and a -property of either `nlp` or `x_cost`. These ``, ``, and `` tags can nest +property of either 'nlp' or 'x_cost'. These ``, ``, and `` tags can nest arbitrarily.@@ -769,7 +772,7 @@ when viewed in a browser. The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; -for example, a single `ocr_page` may contain information from multiple sections +for example, a single <{ocr_page}> may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and @@ -787,8 +790,8 @@ removing tags that are not of interest for the subsequent processing step, and then collapsing grouped elements into single elements. For example, output that contains both logical and physical layout information, where the logical layout information uses grouped elements, can be transformed by removing all -the physical layout information, and then collapsing all split `ocr_chapter` -elements into single `ocr_chapter` elements based on the groupid. The result is +the physical layout information, and then collapsing all split <{ocr_chapter}> +elements into single <{ocr_chapter}> elements based on the groupid. The result is a simple DOM tree. This transformation can be provided generically as a pre-processor or Javascript. @@ -809,23 +812,23 @@ document. The capability to generate specific properties is given by the prefix `ocrp_...`; the important properties are: -## `ocrp_lang` +## ocrp_lang Capable of generating `lang=` attributes -## `ocrp_dir` +## ocrp_dir Capable of generating `dir=` attributes -## `ocrp_poly` +## ocrp_poly Capable of generating [polygonal bounds](#poly) -## `ocrp_font` +## ocrp_font Capable of generating font information (standard font information) -## `ocrp_nlp` +## ocrp_nlp Capable of generating [nlp confidences](#nlp) @@ -851,16 +854,31 @@ corresponding element or attribute must not be present in the document. The OCR system is required to indicate the following using meta tags in the header: +### ocr-system + * `` + +### ocr-capabilities + * `` * see [[#capabilities]] +## Recommended Meta Information + The OCR system should indicate the following information +### ocr-number-of-pages + * `` + +### ocr-langs + * `` * use [ISO 639-1](https://www.loc.gov/standards/iso639-2/php/code_list.php) codes * value may be `unknown` + +### ocr-scripts + * `` * use [ISO 15924](http://www.unicode.org/iso15924/codelists.html) letter codes * value may be `unknown` @@ -901,17 +919,17 @@ Other possible profiles might be defined for specific engines or specific document classes: * common commercial OCR output (e.g., Abbyy) - * ocr_page - * ocrx_block, ocrx_line, ocrx_word - * ocrp_lang - * ocrp_font + * <{ocr_page}> + * <{ocrx_block}>, <{ocrx_line}>, <{ocrx_word}> + * ''ocr-capabilities/ocrp_lang'' + * ''ocr-capabilities/ocrp_font'' * book target - * all logical structuring elements (as applicable), except ocr_linear - * ocr_page + * all logical structuring elements (as applicable), except <{ocr_linear}> + * <{ocr_page}> * newspaper target * all logical structuring elements (as applicable) - * articles map on ocr_linear - * ocr_page + * articles map on <{ocr_linear}> + * <{ocr_page}> # HTML Markup @@ -1171,3 +1189,7 @@ Issue: [correct MIME type for hOCR?](https://github.com/kba/hocr-spec/issues/27) : Applications which use this media type: : File extension(s): :: `*.html`, `*.hocr` + + + +