FS#33008 - [tesseract] Segfaults on any non-trivial input image

Attached to Project: Community Packages
Opened by Alain Kalker (ackalker) - Saturday, 08 December 2012, 02:13 GMT
Last edited by Sergej Pupykin (sergej) - Friday, 11 January 2013, 17:01 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Sergej Pupykin (sergej)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
Running Tesseract on most any non-trivial image (containing more than 2 short words) leads to a segmentation fault.

Additional info:
* package version(s)
$ tesseract -v
tesseract 3.02.02
leptonica-1.69
libgif 4.1.6 : libjpeg 8b : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
$ pacman -Qs tesseract-data
local/tesseract-data-eng 3.02.02-1 (tesseract-data)
Tesseract OCR data (eng)
local/tesseract-data-nld 3.02.02-1 (tesseract-data)
Tesseract OCR data (nld)

* config and/or log files etc.
Config unchanged from package.
Example message from dmesg:
[39855.247296] tesseract[18952]: segfault at 0 ip 00007fb2e77b7198 sp 00007fff5e289360 error 4 in libtesseract.so.3.0.2[7fb2e7573000+2bd000]


Steps to reproduce:
$ tesseract image.jpg outfile
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file ../ccutil/genericvector.h, line 512
Segmentation fault (core dumped)

I have reason to believe (see: http://code.google.com/p/tesseract-ocr/issues/detail?id=683 ) that one possible reason for this is that the language data sources defined in the PKGBUILD don't work with the current version of Tesseract, even though the download page at googlecode seems to indicate they do.

I would suggest using the tesseract-ocr-3.02.XXX.tar.gz files and not the XXX.traineddata.gz . (The .tar.gz files also contain the (newer) traineddata files.)
This task depends upon

Closed by  Sergej Pupykin (sergej)
Friday, 11 January 2013, 17:01 GMT
Reason for closing:  Fixed
Comment by Alain Kalker (ackalker) - Saturday, 08 December 2012, 05:37 GMT
I've created and test built packages from a new PKGBUILD, which solves the segmentation faults, and as an added bonus adds several more languages.

Some notes:
- Uses tesseract-ocr-3.02.tar.gz , not tesseract-ocr-3.02.tar.gz . As far as I've checked, the former includes files generated by a much older version of the Autotools, and is missing some other files. The 'newer' tarball has changes to at least one source file, and is the one actually recommended on the Downloads page.
- To conform more to upstream naming, and to make building easier, I would like to suggest renaming the package base from 'tesseract' to 'tesseract-ocr'. In my PKGBUILD I've already done this.
- Also, the language data packages don't need to be built for a specific architecture, as they are just data, so I'm setting all these packages to "arch=('any')".

Any suggestions are more than welcome! :-)
   PKGBUILD (4.8 KiB)
Comment by Alain Kalker (ackalker) - Saturday, 08 December 2012, 05:39 GMT
Oops, first note should start: "Uses tesseract-ocr-3.02.02-tar.gz, not tesseract-3.02.02.tar.gz".
Yes, it's getting late here :-)
Comment by Alain Kalker (ackalker) - Saturday, 08 December 2012, 18:54 GMT
Now that there are so many languages, perhaps it would be better to further split up the package into a single package 'tesseract-ocr' for the application, and a split package 'tesseract-ocr-data' for the language data (which isn't expected to change as often as Tesseract itself).
This would remove the need to download and repackage all the language data whenever Tesseract changes. It would also eliminate the private variable _langver .

Loading...