FS#64073 - [tesseract] Vertical data missing for Chinese, Japanese, and Korean

Attached to Project: Arch Linux
Opened by Sid S. (cnte) - Wednesday, 09 October 2019, 17:12 GMT
Last edited by Caleb Maclennan (alerque) - Friday, 16 June 2023, 15:14 GMT
Task Type Feature Request
Category Packages: Extra
Status Closed
Assigned To Caleb Maclennan (alerque)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:
tesseract doesn't correctly recognize vertical Chinese, Japanese, and Korean even when the relevant packages are installed. Looking at the PKGBUILD for tesseract-data and comparing it against the upstream repository, the _langs variable needs to have entries for jpn_vert, chi_sim_vert, chi_tra_vert, and kor_vert.

After adding the required entries to the PKGBUILD (actually, to the tesseract-data-git package on AUR), I compiled the needed packages and got tesseract to recognize vertical text.

The tesseract-data package should be fixed and the three new packages for vertical CJK added.

Additional info:
package version: tesseract-data 1:4.0.0-1
upstream: https://github.com/tesseract-ocr/tessdata

Steps to reproduce:
Run the following command on the attached image.
tesseract vertical_japanese.png wrong_output -l jpn

Compile the tesseract-data-jpn_vert package and run the following command.
tesseract vertical_japanese.png correct_output -l jpn_vert

(Granted, it misreads the first character. But that's not relevant, and as you can see, it's a lot more accurate.)
This task depends upon

Closed by  Caleb Maclennan (alerque)
Friday, 16 June 2023, 15:14 GMT
Reason for closing:  Fixed
Comment by Caleb Maclennan (alerque) - Friday, 16 June 2023, 14:20 GMT
kor_vert and jpn_vert got added sometime in the past. I'll add the other two now.

Loading...