FS#41530 : [coreutils] uniq fails on this test file

FS#41530 - [coreutils] uniq fails on this test file

Attached to Project: Arch Linux
Opened by Rasmus Steinke (rasi) - Monday, 11 August 2014, 23:54 GMT
Last edited by Dave Reisner (falconindy) - Tuesday, 12 August 2014, 17:38 GMT

Task Type	Bug Report
Category	Packages: Core
Status	Closed
Assigned To	Sébastien Luttringer (seblu)
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	0
Private	No

Details

Description: using this test file, uniq will fail to eliminate all duplicate lines.

Additional info:
coreutils 8.23-1
Arch Linux 64bit, testing enabled

Steps to reproduce:
Extract tar file.
Run "cat test3 | uniq"
interestingly it works, if you grep for one of the failed entries
E.g.: "grep MiMi test3 | uniq" will work

list.tar.gz (55.7 KiB)

This task depends upon

Closed by Dave Reisner (falconindy)
Tuesday, 12 August 2014, 17:38 GMT
Reason for closing: Works for me
Additional comments about closing: input to uniq must be "sorted"

Comment by Gerardo Exequiel Pozzi (djgera) - Tuesday, 12 August 2014, 00:58 GMT

Works as expected here.

test3.uniq.gz (38.7 KiB)

Comment by Allan McRae (Allan) - Tuesday, 12 August 2014, 03:03 GMT

What locale do you use?

Comment by Dave Reisner (falconindy) - Tuesday, 12 August 2014, 16:42 GMT

Are you expecting that unsorted input will be made unique? (hint: you shouldn't)

Comment by Rasmus Steinke (rasi) - Tuesday, 12 August 2014, 16:53 GMT

huh? it's not unsorted at all... all duplicates are right behind each other...

Comment by Rasmus Steinke (rasi) - Tuesday, 12 August 2014, 16:54 GMT

carnager@caprica ~ > locale -a
C
en_US.utf8
POSIX
carnager@caprica ~ > locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Comment by Rasmus Steinke (rasi) - Tuesday, 12 August 2014, 16:57 GMT

falconindy: but you have a point. running "cat test3 | uniq" will not only leave duplicates, but it will also screw up the order.

Comment by Rasmus Steinke (rasi) - Tuesday, 12 August 2014, 17:00 GMT

Geranto: your result is also messed up.
a) The order in your result has changed
b) there are duplicates

original file had this:

2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
2008 • Red Sky Coven • Volume 5
1999 • Red Sky Coven • Volume 3
1999 • Red Sky Coven • Volume 3
1999 • Red Sky Coven • Volume 3
1999 • Red Sky Coven • Volume 3
1999 • Red Sky Coven • Volume 3
1999 • Red Sky Coven • Volume 3
1999 • Red Sky Coven • Volume 3

and your result has this:

2008 • Red Sky Coven • Volume 5
1999 • Red Sky Coven • Volume 3
1995 • Red Sky Coven • Volume 2
1995 • Red Sky Coven • Volume 1
2008 • Red Sky Coven • Volume 5
1999 • Red Sky Coven • Volume 3
1995 • Red Sky Coven • Volume 2
1995 • Red Sky Coven • Volume 1

Comment by Dave Reisner (falconindy) - Tuesday, 12 August 2014, 17:01 GMT

> huh? it's not unsorted at all... all duplicates are right behind each other...
No, not really...

$ sed -n '/^2012 • Girls Aloud • Ten$/=' test3
97
98
101
102
106
110
111
...

Notice the gaps? 98 will be elided, 102 will be elided, 111 will be elided... you still have dupes in the list.

Comment by Rasmus Steinke (rasi) - Tuesday, 12 August 2014, 17:37 GMT

oh... damn it. didnt realise those...

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#41530 - [coreutils] uniq fails on this test file

Details

Loading...