FS#41746 - [tesseract] tesseract 3.03rc1-1 complains about missing osd.traineddata

Attached to Project: Community Packages
Opened by Matt (madalu) - Friday, 29 August 2014, 03:26 GMT
Last edited by Sergej Pupykin (sergej) - Tuesday, 03 March 2015, 19:06 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Sergej Pupykin (sergej)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description: default hocr configuration does not work.

When generating hocr output, tesseract fails and complains about a missing /usr/share/tessdata/osd.traineddata.

This is because /usr/share/tessdata/configs/hocr has an additional line as of 3.03rc1-1:

tessedit_pageseg_mode 1

However, this will not work on arch, as there is no osd data packaged for arch.

Here is the error message:

out.001 - Tesseract Open Source OCR Engine v3.03 with Leptonica
Error opening data file /usr/share/tessdata/osd.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'osd'
Tesseract couldn't load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load

Additional info:
* package version(s): 3.03rc1-1
- I also have tesseract-data-eng, tesseract-data-deu, and tesseract-data-france installed
* config and/or log files etc.: no special config

Steps to reproduce:

call tesseract on scanned pnm file,
- e.g., "tesseract out.001.pnm ocr hocr"

This command works with no problems with 3.02.02-4. However, with 3.03rc1-1, tesseract complains (see error message above).





This task depends upon

Closed by  Sergej Pupykin (sergej)
Tuesday, 03 March 2015, 19:06 GMT
Reason for closing:  Fixed
Additional comments about closing:  traning utils and osd.traineddata added
Comment by Matt (madalu) - Friday, 29 August 2014, 04:09 GMT
Please excuse the typo: The title of the bug should read "tesseract 3.03rc1-1."

If someone has the authority to change it, I would much appreciate it.

Comment by Frank Siegert (fsiegert) - Friday, 19 September 2014, 11:04 GMT
The necessary data files seem to reside in tesseract-ocr-3.01.osd.tar.gz which is available together will all the language files here: https://code.google.com/p/tesseract-ocr/downloads/list

As a workaround until the maintainer fixes this one can download/untar the file manually and copy its content to /usr/share/tessdata/
Comment by Aleksei (yupi) - Saturday, 01 November 2014, 04:29 GMT
This is probably related to the fact that the current package is also missing training tools.
From
https://code.google.com/p/tesseract-ocr/wiki/Compiling

-------------
If you want the training tools (3.03), you will also need to run the following commands:

make training
sudo make training-install
Build of training tools is not available if you do not have necessary dependencies (pay attention to messages from ./configure script).
--------------

Loading...