FS#60951 - [tesseract-data-jpn] Vertical text failing, add jpn_vert
Attached to Project:
Community Packages
Opened by Xum (Xum) - Sunday, 02 December 2018, 09:48 GMT
Last edited by Caleb Maclennan (alerque) - Tuesday, 01 February 2022, 09:06 GMT
Opened by Xum (Xum) - Sunday, 02 December 2018, 09:48 GMT
Last edited by Caleb Maclennan (alerque) - Tuesday, 01 February 2022, 09:06 GMT
|
Details
Description:
The OCR data for Japanese does not include the vertical version (jpn_vert). There is also no extra package for this. Most of Japanese literature is written vertically. This wasn't a big issue in the past, because the basic Tesseract engine can mostly handle vertical Japanese text without jpn_vert. But as of Tesseract 4, recognition using a neural net was added. This neural net is used by default and only outputs garbage when trying to recognize vertical Japanese text without jpn_vert. Couldn't the jpn_vert traindata be added to this package? The traindata is relatively small compared to the normal jpn traindata. Additional info: * package version(s): tesseract 4.0.0-1, tesseract-data-jpn 1:4.0.0-1 Steps to reproduce: Run tesseract on the test image (attached): tesseract testv.png stdout -l jpn One can switch to the legacy engine with "--oem 0", which outputs acceptable results. |
This task depends upon
Closed by Caleb Maclennan (alerque)
Tuesday, 01 February 2022, 09:06 GMT
Reason for closing: Fixed
Additional comments about closing: tesseract-data-jpn_vert-2:4.1.0-3-any.pk g.tar.zst
Tuesday, 01 February 2022, 09:06 GMT
Reason for closing: Fixed
Additional comments about closing: tesseract-data-jpn_vert-2:4.1.0-3-any.pk g.tar.zst
Comment by
Caleb Maclennan (alerque) -
Tuesday, 01 February 2022, 08:46 GMT
Comment by
Caleb Maclennan (alerque) -
Tuesday, 01 February 2022, 09:05 GMT
Should this be packaged as a separate tesseract-data-jpn_vert
package or added to the current one? If it can work independently
it should probably be separate, if they are only really useful in
conjunction with each other we should probably just stuff them in
the same package.
Since this has been open for a long time and I'm not sure the
original reporter will see this any time soon, I went ahead and
bundled this as a separate package. If somebody feels like that
was the wrong call please open a new bug report and I'll join
them.