FS#39245 - [glibc] regex infinite loop

Attached to Project: Arch Linux
Opened by test0 (test0) - Sunday, 09 March 2014, 00:46 GMT
Last edited by Allan McRae (Allan) - Saturday, 20 February 2016, 01:47 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Allan McRae (Allan)
Architecture i686
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

Description:
Running the file command on a specific file causes it to get stuck using 100% cpu (the original report for that is here http://bugs.gw.com/view.php?id=332)
Traced the problem to a call to regexec and created a short program to replicate the issue (attached).

Additional info:
* Using glibc 2.19-3

Steps to reproduce:
Compile and run the attached c file. It should return either "Match" or "No match", but instead gets stuck.
This task depends upon

Closed by  Allan McRae (Allan)
Saturday, 20 February 2016, 01:47 GMT
Reason for closing:  Fixed
Comment by Allan McRae (Allan) - Sunday, 09 March 2014, 03:23 GMT
The test case gives "No match" here, for both i686 and x86_64. How are you compiling it?
Comment by Allan McRae (Allan) - Sunday, 09 March 2014, 14:10 GMT
And I get results from file on your example. It just takes a long time.

$ time file TFAC00I0.RCI
TFAC00I0.RCI: Non-ISO extended-ASCII text, with very long lines, with NEL line terminators

real 0m53.735s
user 0m53.717s
sys 0m0.000s

Comment by Allan McRae (Allan) - Sunday, 09 March 2014, 14:10 GMT
Hrm... in my chroots it takes ~5 sections. Both i686 and x86_64.
Comment by Dave Reisner (falconindy) - Sunday, 09 March 2014, 15:02 GMT
Maybe a difference in locale? LANG=C in the chroot v. LANG=en_US.UTF-8 on the host?
Comment by Allan McRae (Allan) - Sunday, 09 March 2014, 15:12 GMT
Yes - it seems any UTF-8 locale is slow.

Running perf on the example file in the upstream bug report:

en_US.UTF-8:
Samples: 220K of event 'cycles', Event count (approx.): 183996098111
29.05% file libc-2.19.so [.] re_search_internal
24.70% file libc-2.19.so [.] re_acquire_state_context
13.97% file libc-2.19.so [.] merge_state_with_log
13.91% file libc-2.19.so [.] re_node_set_compare.part.2
6.99% file libc-2.19.so [.] re_string_context_at
5.04% file libc-2.19.so [.] check_node_accept_bytes.isra.26
3.98% file libc-2.19.so [.] clean_state_log_if_needed
1.84% file libc-2.19.so [.] memset

C:
Samples: 22K of event 'cycles', Event count (approx.): 18264791986
97.21% file libc-2.19.so [.] re_search_internal
1.93% file libc-2.19.so [.] check_halt_state_context.isra.20
0.56% file libc-2.19.so [.] re_string_context_at


BTW, the example attached here seems just to be a regex on a really long line and not related...
Comment by test0 (test0) - Sunday, 09 March 2014, 23:46 GMT
My default LANG is en_GB.UTF-8. With LANG=C `file TFAC00I0.RCI` completes in about 33 seconds.
The attached example completes in 33 seconds with my default LANG or LANG=C, so I guess I just didn't wait long enough when first testing. However, if I change the regex in the attachement (e.g. to ".*aaa$") it completes in around half of a second, so the slowness is not just due to a really long line. I compile it simply with `gcc regex_test.c -o regex_test`.
I left `file TFAC00I0.RCI` running again and it returned after 8 minutes ('though I swear I left it for much longer than that initially & it didn't return). Considering it's usually as good as instant it shouldn't be taking even 8 minutes, should it?
Comment by Doug Newgard (Scimmia) - Wednesday, 13 May 2015, 21:34 GMT
The upstream report is marked as "resolved", what is the status here?
Comment by Allan McRae (Allan) - Wednesday, 13 May 2015, 23:10 GMT
The glibc part has not been resolved.
Comment by Allan McRae (Allan) - Saturday, 20 February 2016, 01:47 GMT
Just tested this - now the same speed with or without UTF-8. Anything more should be reported upstream.

Loading...