Arch Linux

Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines

Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.

REPEAT: Do NOT report bugs for outdated packages!
Tasklist

FS#64073 - [tesseract] Vertical data missing for Chinese, Japanese, and Korean

Attached to Project: Arch Linux
Opened by Sid S. (cnte) - Wednesday, 09 October 2019, 17:12 GMT
Last edited by freswa (frederik) - Friday, 21 February 2020, 22:03 GMT
Task Type Feature Request
Category Packages: Extra
Status Assigned
Assigned To Jelle van der Waa (jelly)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 1
Private No

Details

Description:
tesseract doesn't correctly recognize vertical Chinese, Japanese, and Korean even when the relevant packages are installed. Looking at the PKGBUILD for tesseract-data and comparing it against the upstream repository, the _langs variable needs to have entries for jpn_vert, chi_sim_vert, chi_tra_vert, and kor_vert.

After adding the required entries to the PKGBUILD (actually, to the tesseract-data-git package on AUR), I compiled the needed packages and got tesseract to recognize vertical text.

The tesseract-data package should be fixed and the three new packages for vertical CJK added.

Additional info:
package version: tesseract-data 1:4.0.0-1
upstream: https://github.com/tesseract-ocr/tessdata

Steps to reproduce:
Run the following command on the attached image.
tesseract vertical_japanese.png wrong_output -l jpn

Compile the tesseract-data-jpn_vert package and run the following command.
tesseract vertical_japanese.png correct_output -l jpn_vert

(Granted, it misreads the first character. But that's not relevant, and as you can see, it's a lot more accurate.)
This task depends upon

Loading...