FS#33008 : [tesseract] Segfaults on any non-trivial input image

FS#33008 - [tesseract] Segfaults on any non-trivial input image

Attached to Project: Community Packages
Opened by Alain Kalker (ackalker) - Saturday, 08 December 2012, 02:13 GMT
Last edited by Sergej Pupykin (sergej) - Friday, 11 January 2013, 17:01 GMT

Task Type	Bug Report
Category	Packages
Status	Closed
Assigned To	Sergej Pupykin (sergej)
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	0
Private	No

Details

Description:
Running Tesseract on most any non-trivial image (containing more than 2 short words) leads to a segmentation fault.

Additional info:
* package version(s)
$ tesseract -v
tesseract 3.02.02
leptonica-1.69
libgif 4.1.6 : libjpeg 8b : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
$ pacman -Qs tesseract-data
local/tesseract-data-eng 3.02.02-1 (tesseract-data)
Tesseract OCR data (eng)
local/tesseract-data-nld 3.02.02-1 (tesseract-data)
Tesseract OCR data (nld)

* config and/or log files etc.
Config unchanged from package.
Example message from dmesg:
[39855.247296] tesseract[18952]: segfault at 0 ip 00007fb2e77b7198 sp 00007fff5e289360 error 4 in libtesseract.so.3.0.2[7fb2e7573000+2bd000]

Steps to reproduce:
$ tesseract image.jpg outfile
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file ../ccutil/genericvector.h, line 512
Segmentation fault (core dumped)

I have reason to believe (see: http://code.google.com/p/tesseract-ocr/issues/detail?id=683 ) that one possible reason for this is that the language data sources defined in the PKGBUILD don't work with the current version of Tesseract, even though the download page at googlecode seems to indicate they do.

I would suggest using the tesseract-ocr-3.02.XXX.tar.gz files and not the XXX.traineddata.gz . (The .tar.gz files also contain the (newer) traineddata files.)

This task depends upon

Closed by Sergej Pupykin (sergej)
Friday, 11 January 2013, 17:01 GMT
Reason for closing: Fixed

Comment by Alain Kalker (ackalker) - Saturday, 08 December 2012, 05:37 GMT

I've created and test built packages from a new PKGBUILD, which solves the segmentation faults, and as an added bonus adds several more languages.

Some notes:
- Uses tesseract-ocr-3.02.tar.gz , not tesseract-ocr-3.02.tar.gz . As far as I've checked, the former includes files generated by a much older version of the Autotools, and is missing some other files. The 'newer' tarball has changes to at least one source file, and is the one actually recommended on the Downloads page.
- To conform more to upstream naming, and to make building easier, I would like to suggest renaming the package base from 'tesseract' to 'tesseract-ocr'. In my PKGBUILD I've already done this.
- Also, the language data packages don't need to be built for a specific architecture, as they are just data, so I'm setting all these packages to "arch=('any')".

Any suggestions are more than welcome! :-)

PKGBUILD (4.8 KiB)

Comment by Alain Kalker (ackalker) - Saturday, 08 December 2012, 05:39 GMT

Oops, first note should start: "Uses tesseract-ocr-3.02.02-tar.gz, not tesseract-3.02.02.tar.gz".
Yes, it's getting late here :-)

Comment by Alain Kalker (ackalker) - Saturday, 08 December 2012, 18:54 GMT

Now that there are so many languages, perhaps it would be better to further split up the package into a single package 'tesseract-ocr' for the application, and a split package 'tesseract-ocr-data' for the language data (which isn't expected to change as often as Tesseract itself).
This would remove the need to download and repackage all the language data whenever Tesseract changes. It would also eliminate the private variable _langver .

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#33008 - [tesseract] Segfaults on any non-trivial input image

Details

Loading...