2.0: Replace title= props with data-ocr-* attributes #77

kba · 2016-10-22T11:35:51Z

Reusing the title= attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.

In a (possible) next major revision of the standard, we could use data-ocr-* attributes for that purpose.

<span id="line1" class="ocr_line" title="bbox 0 0 100 100">...</span>

could be expressed as

<span id="line1" data-ocr-tag="line" data-ocr-bbox="[0,0,100,100]"> ... </span>

This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do:

var line = document.querySelector("#line1");
var bbox = JSON.parse(line.dataset.ocrBbox);
var width = ocrBbox[2] - ocrBbox[0];

The text was updated successfully, but these errors were encountered:

zuphilip · 2016-10-22T13:01:35Z

I think the data-ocr-* attributes would be a good way to continue. But is there any reason to change the class as well? This is standard HTML and has very good support like document.getElementsByClassName("ocr_line").

kba · 2016-10-22T13:30:47Z

It would make it easier to map between formats (ALTO) and serializations, if the OCR application profile of the HTML would be uniform, i.e. you wouldn't force a naming convention on class, id or title.

kba added this to the Version 2.0 milestone Oct 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.0: Replace title= props with data-ocr-* attributes #77

2.0: Replace title= props with data-ocr-* attributes #77

kba commented Oct 22, 2016

zuphilip commented Oct 22, 2016

kba commented Oct 22, 2016

2.0: Replace title= props with data-ocr-* attributes #77

2.0: Replace title= props with data-ocr-* attributes #77

Comments

kba commented Oct 22, 2016

zuphilip commented Oct 22, 2016

kba commented Oct 22, 2016