FS#64073 : [tesseract] Vertical data missing for Chinese, Japanese, and Korean

FS#64073 - [tesseract] Vertical data missing for Chinese, Japanese, and Korean

Attached to Project: Arch Linux
Opened by Sid S. (cnte) - Wednesday, 09 October 2019, 17:12 GMT
Last edited by Caleb Maclennan (alerque) - Friday, 16 June 2023, 15:14 GMT

Task Type	Feature Request
Category	Packages: Extra
Status	Closed
Assigned To	Caleb Maclennan (alerque)
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	1 Minmo (Minmo) (2020-12-18)
Private	No

Details

Description:
tesseract doesn't correctly recognize vertical Chinese, Japanese, and Korean even when the relevant packages are installed. Looking at the PKGBUILD for tesseract-data and comparing it against the upstream repository, the _langs variable needs to have entries for jpn_vert, chi_sim_vert, chi_tra_vert, and kor_vert.

After adding the required entries to the PKGBUILD (actually, to the tesseract-data-git package on AUR), I compiled the needed packages and got tesseract to recognize vertical text.

The tesseract-data package should be fixed and the three new packages for vertical CJK added.

Additional info:
package version: tesseract-data 1:4.0.0-1
upstream: https://github.com/tesseract-ocr/tessdata

Steps to reproduce:
Run the following command on the attached image.
tesseract vertical_japanese.png wrong_output -l jpn

Compile the tesseract-data-jpn_vert package and run the following command.
tesseract vertical_japanese.png correct_output -l jpn_vert

(Granted, it misreads the first character. But that's not relevant, and as you can see, it's a lot more accurate.)

vertical_japanese.png (4.4 KiB)

current_PKGBUILD (1.4 KiB)

corrected_PKGBUILD (1.4 KiB)

This task depends upon

Closed by Caleb Maclennan (alerque)
Friday, 16 June 2023, 15:14 GMT
Reason for closing: Fixed

Comment by Caleb Maclennan (alerque) - Friday, 16 June 2023, 14:20 GMT

kor_vert and jpn_vert got added sometime in the past. I'll add the other two now.

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#64073 - [tesseract] Vertical data missing for Chinese, Japanese, and Korean

Details

Loading...