FS#33008 - [tesseract] Segfaults on any non-trivial input image
Attached to Project:
Community Packages
Opened by Alain Kalker (ackalker) - Saturday, 08 December 2012, 02:13 GMT
Last edited by Sergej Pupykin (sergej) - Friday, 11 January 2013, 17:01 GMT
Opened by Alain Kalker (ackalker) - Saturday, 08 December 2012, 02:13 GMT
Last edited by Sergej Pupykin (sergej) - Friday, 11 January 2013, 17:01 GMT
|
Details
Description:
Running Tesseract on most any non-trivial image (containing more than 2 short words) leads to a segmentation fault. Additional info: * package version(s) $ tesseract -v tesseract 3.02.02 leptonica-1.69 libgif 4.1.6 : libjpeg 8b : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 $ pacman -Qs tesseract-data local/tesseract-data-eng 3.02.02-1 (tesseract-data) Tesseract OCR data (eng) local/tesseract-data-nld 3.02.02-1 (tesseract-data) Tesseract OCR data (nld) * config and/or log files etc. Config unchanged from package. Example message from dmesg: [39855.247296] tesseract[18952]: segfault at 0 ip 00007fb2e77b7198 sp 00007fff5e289360 error 4 in libtesseract.so.3.0.2[7fb2e7573000+2bd000] Steps to reproduce: $ tesseract image.jpg outfile Tesseract Open Source OCR Engine v3.02.02 with Leptonica index >= 0 && index < size_used_:Error:Assert failed:in file ../ccutil/genericvector.h, line 512 Segmentation fault (core dumped) I have reason to believe (see: http://code.google.com/p/tesseract-ocr/issues/detail?id=683 ) that one possible reason for this is that the language data sources defined in the PKGBUILD don't work with the current version of Tesseract, even though the download page at googlecode seems to indicate they do. I would suggest using the tesseract-ocr-3.02.XXX.tar.gz files and not the XXX.traineddata.gz . (The .tar.gz files also contain the (newer) traineddata files.) |
This task depends upon
Some notes:
- Uses tesseract-ocr-3.02.tar.gz , not tesseract-ocr-3.02.tar.gz . As far as I've checked, the former includes files generated by a much older version of the Autotools, and is missing some other files. The 'newer' tarball has changes to at least one source file, and is the one actually recommended on the Downloads page.
- To conform more to upstream naming, and to make building easier, I would like to suggest renaming the package base from 'tesseract' to 'tesseract-ocr'. In my PKGBUILD I've already done this.
- Also, the language data packages don't need to be built for a specific architecture, as they are just data, so I'm setting all these packages to "arch=('any')".
Any suggestions are more than welcome! :-)
Yes, it's getting late here :-)
This would remove the need to download and repackage all the language data whenever Tesseract changes. It would also eliminate the private variable _langver .