FS#65676 - [tesseract] should depend on tesseract-data-eng

Attached to Project: Community Packages
Opened by carlenny (carlenny) - Monday, 02 March 2020, 02:16 GMT
Last edited by Jelle van der Waa (jelly) - Saturday, 06 June 2020, 18:06 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Jelle van der Waa (jelly)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

tesseract-data-eng should be a (non-optional) dependency of tesseract.

Steps to reproduce:

$ pacman -Q | grep tesseract
tesseract 4.1.1-1
tesseract-data-deu 1:4.0.0-1

$ ocrmypdf -l deu --output-type pdf --skip-text input.pdf output.pdf
ERROR - Tesseract failed to report available languages.
Output from Tesseract:
-----------
Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
List of available languages (2):
deu
osd

IMHO this is not an upstream bug because tesseracts [issue guidelines](https://github.com/tesseract-ocr/tesseract/blob/master/CONTRIBUTING.md) say:

> Each version of Tesseract has its own language data you need to obtain. You must obtain and install trained data for English (eng) and osd. Verify that Tesseract knows about these two files (and other trained data you installed) with this command: tesseract --list-langs.
This task depends upon

Closed by  Jelle van der Waa (jelly)
Saturday, 06 June 2020, 18:06 GMT
Reason for closing:  Fixed
Additional comments about closing:  Works as intended, check the optional dependencies
Comment by Jelle van der Waa (jelly) - Saturday, 06 June 2020, 18:06 GMT
That doesn't scale however, since someone might need a different language and the package has an optional dependency on it.

Loading...