FS#64073 - [tesseract] Vertical data missing for Chinese, Japanese, and Korean
Attached to Project:
Arch Linux
Opened by Sid S. (cnte) - Wednesday, 09 October 2019, 17:12 GMT
Last edited by Caleb Maclennan (alerque) - Friday, 16 June 2023, 15:14 GMT
Opened by Sid S. (cnte) - Wednesday, 09 October 2019, 17:12 GMT
Last edited by Caleb Maclennan (alerque) - Friday, 16 June 2023, 15:14 GMT
|
Details
Description:
tesseract doesn't correctly recognize vertical Chinese, Japanese, and Korean even when the relevant packages are installed. Looking at the PKGBUILD for tesseract-data and comparing it against the upstream repository, the _langs variable needs to have entries for jpn_vert, chi_sim_vert, chi_tra_vert, and kor_vert. After adding the required entries to the PKGBUILD (actually, to the tesseract-data-git package on AUR), I compiled the needed packages and got tesseract to recognize vertical text. The tesseract-data package should be fixed and the three new packages for vertical CJK added. Additional info: package version: tesseract-data 1:4.0.0-1 upstream: https://github.com/tesseract-ocr/tessdata Steps to reproduce: Run the following command on the attached image. tesseract vertical_japanese.png wrong_output -l jpn Compile the tesseract-data-jpn_vert package and run the following command. tesseract vertical_japanese.png correct_output -l jpn_vert (Granted, it misreads the first character. But that's not relevant, and as you can see, it's a lot more accurate.) |
This task depends upon
Comment by
Caleb Maclennan (alerque) - Friday,
16 June 2023, 14:20 GMT
kor_vert and jpn_vert got added sometime in the past. I'll add the
other two now.