Community Packages

Please read this before reporting a bug:

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!

FS#60951 - [tesseract-data-jpn] Vertical text failing, add jpn_vert

Attached to Project: Community Packages
Opened by Xum (Xum) - Sunday, 02 December 2018, 09:48 GMT
Last edited by Caleb Maclennan (alerque) - Tuesday, 01 February 2022, 09:06 GMT
Task Type Feature Request
Category Packages
Status Closed
Assigned To Jelle van der Waa (jelly)
Caleb Maclennan (alerque)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


The OCR data for Japanese does not include the vertical version (jpn_vert). There is also no extra package for this. Most of Japanese literature is written vertically. This wasn't a big issue in the past, because the basic Tesseract engine can mostly handle vertical Japanese text without jpn_vert. But as of Tesseract 4, recognition using a neural net was added. This neural net is used by default and only outputs garbage when trying to recognize vertical Japanese text without jpn_vert.
Couldn't the jpn_vert traindata be added to this package? The traindata is relatively small compared to the normal jpn traindata.

Additional info:
* package version(s): tesseract 4.0.0-1, tesseract-data-jpn 1:4.0.0-1

Steps to reproduce:
Run tesseract on the test image (attached): tesseract testv.png stdout -l jpn
One can switch to the legacy engine with "--oem 0", which outputs acceptable results.
This task depends upon

Closed by  Caleb Maclennan (alerque)
Tuesday, 01 February 2022, 09:06 GMT
Reason for closing:  Fixed
Additional comments about closing: g.tar.zst
Comment by Caleb Maclennan (alerque) - Tuesday, 01 February 2022, 08:46 GMT
Should this be packaged as a separate tesseract-data-jpn_vert package or added to the current one? If it can work independently it should probably be separate, if they are only really useful in conjunction with each other we should probably just stuff them in the same package.
Comment by Caleb Maclennan (alerque) - Tuesday, 01 February 2022, 09:05 GMT
Since this has been open for a long time and I'm not sure the original reporter will see this any time soon, I went ahead and bundled this as a separate package. If somebody feels like that was the wrong call please open a new bug report and I'll join them.