From 6fdbbbf28d85b0531d7bebbb59433d5fe49422f6 Mon Sep 17 00:00:00 2001 From: amitdo Date: Wed, 19 Oct 2016 14:43:17 +0300 Subject: [PATCH] Move paragraphs from the 'ocrx_word' section to the right sections --- 1.2/index.bs | 27 ++++++++++---------- 1.2/index.html | 68 ++++++++++++++++++++++++++------------------------ 1.2/spec.md | 27 ++++++++++---------- 3 files changed, 61 insertions(+), 61 deletions(-) diff --git a/1.2/index.bs b/1.2/index.bs index 9170920..321d586 100644 --- a/1.2/index.bs +++ b/1.2/index.bs @@ -609,7 +609,11 @@ single "line" of text. ### ocr_cinfo -If no other layout element applies, the <{ocr_cinfo}> element may be used. +Issue: ocrx_cinfo? + + * If no other layout element applies, the <{ocr_cinfo}> element may be used. + * <{ocrx_cinfo}> should nest inside <{ocrx_line}> + * <{ocrx_cinfo}> should contain only 'x_confs', 'x_bboxes', and 'cuts' attributes ## Properties for Character Information @@ -678,31 +682,26 @@ Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) * any kind of "block" returned by an OCR system * engine-specific because the definition of a "block" depends on the engine +Generators should attempt to ensure the following properties: + + * An <{ocrx_block}> should not contain content from multiple <{ocr_carea}>. + * The union of all <{ocrx_block|ocrx_blocks}> should approximately cover all <{ocr_carea}>. + * an <{ocrx_block}> should contain either a float or body text, but not both + * an <{ocrx_block}> should contain either an image or text, but not both + ### ocrx_line Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) * any kind of "line" returned by an OCR system that differs from the standard <{ocr_line}> above * might be some kind of "logical" line + * an <{ocrx_line}> should correspond as closely as possible to an <{ocr_line}> ### ocrx_word * any kind of "word" returned by an OCR system * engine specific because the definition of a "word" depends on the engine -The meaning of these tags is OCR engine specific. However, generators should -attempt to ensure the following properties: - -* An <{ocrx_block}> should not contain content from multiple <{ocr_carea}>. -* The union of all <{ocrx_block|ocrx_blocks}> should approximately cover all <{ocr_carea}>. -* an <{ocrx_block}> should contain either a float or body text, but not both -* an <{ocrx_block}> should contain either an image or text, but not both -* an <{ocrx_line}> should correspond as closely as possible to an <{ocr_line}> -* <{ocrx_cinfo}> should nest inside <{ocrx_line}> -* <{ocrx_cinfo}> should contain only 'x_confs', 'x_bboxes', and 'cuts' attributes - -Issue: ocrx_cinfo? - ## Properties for engine-specific markup The following properties are defined: diff --git a/1.2/index.html b/1.2/index.html index 8bde9db..2dbdd1a 100644 --- a/1.2/index.html +++ b/1.2/index.html @@ -1421,7 +1421,7 @@

hOCR - OCR Workflow and Output embedded in HTML

-

Living Standard,

+

Living Standard,

This version: @@ -1440,7 +1440,7 @@

@@ -2146,7 +2146,15 @@

7.1.1. ocr_cinfo

-

If no other layout element applies, the ocr_cinfo element may be used.

+

ocrx_cinfo?

+

7.2. Properties for Character Information

7.2.1. cuts

cuts c1 c2 c3 ...

@@ -2154,7 +2162,7 @@

7.2
  • character segmentation cuts (see below)

  • -

    there must be a bbox property relative to which the cuts can be interpreted

    +

    there must be a bbox property relative to which the cuts can be interpreted

    7.2.2. nlp

    nlp c1 c2 c3 ...

    @@ -2200,6 +2208,17 @@

    engine-specific because the definition of a "block" depends on the engine

    +

    Generators should attempt to ensure the following properties:

    +

    8.1.2. ocrx_line

    ocr_line vs ocrx_line

      @@ -2207,6 +2226,8 @@

      ocr_line above

    • might be some kind of "logical" line

      +
    • +

      an ocrx_line should correspond as closely as possible to an ocr_line

    8.1.3. ocrx_word

      @@ -2215,25 +2236,6 @@

      engine specific because the definition of a "word" depends on the engine

    -

    The meaning of these tags is OCR engine specific. However, generators should -attempt to ensure the following properties:

    - -

    ocrx_cinfo?

    8.2. Properties for engine-specific markup

    The following properties are defined:

    8.2.1. x_font

    @@ -2985,9 +2987,9 @@

    Use of property presence

  • There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this.
    +
    ocrx_cinfo?
    ocr_carea vs ocrx_block
    ocr_line vs ocrx_line
    -
    ocrx_cinfo?
    Delete x_cost
    XML namespace for hOCR HTML?
    What DOCTYPE for hOCR HTML?
    @@ -3101,7 +3103,7 @@

    3.2.4. cflow (2) (3)
  • 5.1.2. ocr_column
  • 5.1.3. ocr_carea (2) (3) (4) (5) (6) -
  • 8.1.3. ocrx_word (2) +
  • 8.1.1. ocrx_block (2) @@ -3165,13 +3167,13 @@

    #propdef-x_bboxesReferenced in: