FS#12107 - Arch news RSS feed is broken (duplicates last 10 entries)

Attached to Project: Arch Linux
Opened by shapeshifter (shapeshifter) - Wednesday, 12 November 2008, 19:05 GMT
Last edited by Dan McGee (toofishes) - Saturday, 17 January 2009, 15:57 GMT
Task Type Bug Report
Category Web Sites
Status Closed
Assigned To Dusty Phillips (Dusty)
Dan McGee (toofishes)
Architecture All
Severity High
Priority Normal
Reported Version None
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 5
Private No

Details

Description:

The "arch news" RSS feed is broken for many readers. Everytime you start the reader, it will receive the 10 last entries over and over again and again and again instead of only grabbing new entries.

The problem has been confirmed with Thunderbird, Akregator and Canto. Apparently, Firefox doesn't have a problem. Also, all the other feeds are fine, it's definetly a fault with the arch news feed. There's a screenshot attached as proof. :P
This task depends upon

This task blocks these from closing
 FS#12537 - 2 copies of mailing list messages delivered 
Closed by  Dan McGee (toofishes)
Saturday, 17 January 2009, 15:57 GMT
Reason for closing:  Fixed
Additional comments about closing:  Die bugs die
Comment by Allan McRae (Allan) - Thursday, 13 November 2008, 13:59 GMT
I can confirm this. It happened for me right after the site upgrade so I am assuming causation...
Comment by Dusty Phillips (Dusty) - Friday, 14 November 2008, 13:44 GMT
Can anyone specifically tell me s the packages feed is not broken?

Dusty
Comment by shapeshifter (shapeshifter) - Friday, 14 November 2008, 18:49 GMT
Tell you what?
In any case, it's still the same. I'm sorry I don't have any server side experience with RSS feeds. I can only say that until now I've received the last entires 18 times already making it 180 "new" entries from this feed in 3 days. Prior to opening this bug report I asked in #archlinux and many users confirmed this erratic behaviour of this feed.
Comment by Allan McRae (Allan) - Friday, 14 November 2008, 21:49 GMT
He means: does the same thing happen with the package update RSS feed? I am going to test that now.
Comment by Allan McRae (Allan) - Friday, 14 November 2008, 21:52 GMT
And my testing shows that package feed seems fine.
Comment by Alessandro Doro (adoroo) - Saturday, 15 November 2008, 14:58 GMT
I can confirm this in Opera.
I have duplicated feeds for "Recent News Updates" since 07/10/2008 (newsletter news).
I have apparently random duplicates for "Recent Packages Updates"; most recent duplicates are: sudo-ba, geeqie-ba, networkmanager-i686, gnome-network-manager-i686, libnetworkmanager-i686
(ba = both archs)
Comment by Dusty Phillips (Dusty) - Saturday, 15 November 2008, 18:55 GMT
This issue has me baffled; the feeds code is mostly built into django and I hardly changed it during the 1.0 port.

I have commented out some code that may be causing the problem. I'm unable to replicate it in canto or opera.

Can anyone confirm if the issue is still occurring?

Thanks,

Dusty
Comment by Alessandro Doro (adoroo) - Saturday, 15 November 2008, 19:19 GMT
I noticed that the duplicated entries are not really duplicated.
The domain in the item destination link is different: www.archlinux.org and archlinux.org.
Comment by Alessandro Doro (adoroo) - Saturday, 15 November 2008, 19:22 GMT
e.g. I see two feed entries in Opera for "zsnes 1.51-5 i686"; one points to http://archlinux.org/packages/extra/i686/zsnes/, the other points to http://www.archlinux.org/packages/extra/i686/zsnes/.
Comment by shapeshifter (shapeshifter) - Saturday, 15 November 2008, 19:34 GMT
Yeah, now that you mention it, I can see that as well. I have 26 duplicates of every entry now (in Thunderbird) and I see that every second entry has the www. prefix and every second does not. They're alternating...
Comment by Dusty Phillips (Dusty) - Saturday, 15 November 2008, 20:25 GMT
Is there a chance that the duplicate entries refer to entries that happened before the switch and then again after the switch?

I never use RSS and when I open the feed in canto, thunderbird, or opera, I don't get any duplicate entries. So my theory is that the duplicate entries refer to articles that were downloaded prior to the switch and then had to be downloaded again afterward... but going forward there won't be any more duplicates. Is this possible?

Dusty
Comment by shapeshifter (shapeshifter) - Saturday, 15 November 2008, 20:25 GMT
Yeah, now that you mention it, I can see that as well. I have 26 duplicates of every entry now (in Thunderbird) and I see that every second entry has the www. prefix and every second does not. They're alternating...
Comment by shapeshifter (shapeshifter) - Saturday, 15 November 2008, 20:28 GMT
sorry for the duplicate. I'm testing now if deleting all copies of the last 10 entries will solve the problem. Logically, they should come in one more time and that should be it. Following your theory, it would seem like the client always thinks that the server side entry is a different one. I don't understand though why it's changing back and forth all the time.
Comment by Allan McRae (Allan) - Saturday, 15 November 2008, 20:28 GMT
I removed the feed from Thunderbird and just added it again. Now there are no more duplicates for me.
Comment by shapeshifter (shapeshifter) - Saturday, 15 November 2008, 20:35 GMT
Removing and readding the feed doesn't work for me. I tried it when I first encountered the problem and I tried again now and it didn't work. I deleted all entries, removed the feed completely, closed TB, opened TB, added the Feed, got 10 new entries, closed TB, opened TB and got another 10 entries.

The first 10 entries that came in when I added the feed were all in the format
http://archlinux.org/news/412/
http://archlinux.org/news/413/
... and so on, while the second 10 entries that came in after restarting TB were in the format:
http://www.archlinux.org/news/412/
http://www.archlinux.org/news/413/
... and so on.

It's odd that you can't reprocude this. I'm using TB 2.0.0.17 from extra and I don't encounter this problem with any other feeds I have.
Comment by Alessandro Doro (adoroo) - Saturday, 15 November 2008, 20:42 GMT
The destination link is also used as item guid in the rss file, so showing the duplicates is the right thing for the reader.

Before 10/oct/2008 "Recent News Updates" entries linked to archlinux.org.

Recent entries (since monday 10, site upgrade date) from "Recent Packages Updates" link to www.archlinux.org pages with *some* duplicated entry that links to archlinux.org pages. Maybe sometimes a duplicate rss file is pushed/generated?

For the record I grab the feeds from the following urls:
http://www.archlinux.org/feeds/news/
http://www.archlinux.org/feeds/packages/
Comment by shapeshifter (shapeshifter) - Sunday, 16 November 2008, 01:28 GMT
Upon further trying, I notice that if I use this url as the feed...
http://www.archlinux.org/feeds/news/
...every second duplicate of an entry comes with www. in its link and every other second duplicate comes without it as described above. But if I use this feed...
http://archlinux.org/feeds/news/
...without the www. in front of it, all the duplicates have a www. in front of it. So
http://www.archlinux.org/feeds/news/ --> creates http://www.archlinux.org/news/420/ AND http://archlinux.org/news/420/ alternating, while
http://archlinux.org/feeds/news/ --> only creates http://www.archlinux.org/news/420/
I still get duplicates no matter what I try though... Maybe it'll just "level out" as soon as 10 "real" new entries have come out. Until then I'll have a couple of 1000 copies of the last 10 entries though ;)
Comment by Hervé (herve) - Friday, 21 November 2008, 10:47 GMT
Well, the first mistake is that archlinux.org must redirect to www.archlinux.org. There must be a single URL pointing to the same resource.
Comment by Dusty Phillips (Dusty) - Sunday, 30 November 2008, 01:57 GMT
Hey guys,

I've tried to hardcode the feed link, but I can't test if it failed completely, solves the problem, or does nothing... additional feedback requested.

Dusty
Comment by Alessandro Doro (adoroo) - Sunday, 30 November 2008, 14:38 GMT
Dusty,

I setup a cron job to download the feed file every 10 minutes.
Here is the download script:
#! /bin/sh
cd $HOME/rsstest/pkgs
wget http://archlinux.org/feeds/packages/
cd $HOME/rsstest/pkgs/www
wget http://www.archlinux.org/feeds/packages/

Then you could watch the files in $HOME/rsstest/pkgs to see what is happening.

To get a more readable form of the xml file:
for i in index.html*; do gawk '{printf $0}' $i | xmllint --format - > ${i/html/xml}; done

FYI; I have received a duplicate of the latest "news update" ("We're back") "http://archlinux.org/news/423/" (friday? I don't remember well); the first was sent on 20/11/2008 as "http://www.archlinux.org/news/423/".

I have duplicates from "Package Update" also. The last was friday. The reader downloads the feeds every three hours, but sometimes I refresh them manually. So I don't know if every feed is duplicated or only a few. Let's cron make its job.
Comment by Hervé (herve) - Sunday, 30 November 2008, 15:21 GMT
I'm not confident about your test case because fetching an RSS feed requires HTTP headers to download only what is new.

If you don't include a timestamp condition, you'll be served the same response everytime.
Comment by Alessandro Doro (adoroo) - Sunday, 30 November 2008, 16:25 GMT
You're right. The script is very basic; no header control. I'm manually discarding identical downloads.
Also it seems that the rss is generated every hour now (not so on nov 16th), so maybe 10 minutes is excessive.

But the problem/bug is another.
Two items are generated for every news and some (every?) package update with different GUID: http://www.archlinux.org/... and http://archlinux.org/....
The duplicate always resides in a different rss file.
This is a server, not reader, related problem.

Moreover I think that shapeshifter has a problem with the reader. Another story...
Comment by Alessandro Doro (adoroo) - Wednesday, 03 December 2008, 13:55 GMT
I'm pretty sure the current situation is this:
first www.archlinux.org announces a new package:
http://www.archlinux.org/packages/testing/x86_64/snort/, snort 2.8.2.1-8 x86_64
later archlinux.org announces a new package:
http://archlinux.org/packages/testing/x86_64/snort/, snort 2.8.2.1-8 x86_64

Two feed items are "correctly" generated.
We, humans, know that is the same package. We should instruct the machine.

Solution 1:
only www.archlinux.org (or archlinux.org) broadcasts the news;
be careful, this could affect the other parts of the website, I don't know.

MAYBE MAYBE MAYBE
In models.py:
change the get_absolute_url() method of the classes Package() and News()


Solution 2:
a filter in the feed generator discard items from archlinux.org (or www.archlinux.org)

In feeds.py:
class PackageFeed(Feed):
def items(self):
return Package.objects.order_by('-last_update')[:24]

class NewsFeed(Feed):
def items(self):
return News.objects.order_by('-postdate', '-id')[:10]

The filter should be in:
Package.objects.order_by('-last_update')[:24]
and
News.objects.order_by('-postdate', '-id')[:10]
if Package.id and News.id is the url narrow down the list to items whose id contains www.archlinux.org
Comment by Hervé (herve) - Wednesday, 03 December 2008, 14:15 GMT
As I said earlier, there must be a single URL pointing to the same web page. If "www.archlinux.org" is the canonical URL, the other one must redirect to it with a Redirect Permanent response.

This is the preliminary step before debugging any HTTP interaction.

So I vote as strong as possible for solution 1. Solution 2 is just a hack to compensate the duplicate in URLs.
Comment by Alessandro Doro (adoroo) - Wednesday, 03 December 2008, 15:17 GMT
I agree with you.
Solution 2 is *clearly* only a temporary hack.
Solution 1, in the way I exposed it, is not the real solution.

Your proposed solution implies opening a new "bug report" or "feature request".

But also note that archlinux.org and www.archlinux.org front pages don't show up duplicates in the "Recent Updates" and "Latest News". A 301 will hide a weakness in the feed generation code.

OT: the NewsFeed() method items() picks the 24 more recent packages. What if between two runs of the script more than 24 packages are updated?
Comment by Alessandro Doro (adoroo) - Wednesday, 03 December 2008, 15:51 GMT
I wrote:
> archlinux.org and www.archlinux.org front pages don't show up duplicates
> in the "Recent Updates" and "Latest News".

Sorry:
http://www.archlinux.org/packages/?sort=-last_update&limit=250
is a better test case.

If the code would be 100% correct:
http://www.archlinux.org/feeds/packages/ should pick up packages only from www.archlinux.org
http://archlinux.org/feeds/packages/ should pick up packages only from archlinux.org

The problem is not in "HTTP interaction" but "page generation", is not a web problem is a programming problem; and this should be solved whether or not archlinux.org redirects to www.archlinux.org.

Now I think only the web site developers can choose what steps to take.
Comment by Aaron Griffin (phrakture) - Monday, 12 January 2009, 21:48 GMT
Just pinging this one. See the related bug. This is still an issue, and shows in the arch-announce mailing list.

I *thought* I switched the arch-announce ML to use only the non-www version of the feed, but that didn't seem to fix it.
Comment by Dusty Phillips (Dusty) - Monday, 12 January 2009, 23:09 GMT
I know about both bugs and have spent several sessions trying to track it down. This is the most serious bug on my list, but I can't find a solution. I'd love to blame django, except nobody else has reported similar issues. ;-)

I can't imagine the issue is related to the www prefix. If somebody is subscribed to only one of the urls and is still getting duplicates, something else is going on. I'm not sure what correct behaviour would be if someone is subscribed to both urls, but that's kind of a silly thing to do, so I want to solve the single url issue first.

Dusty
Comment by Hervé (herve) - Tuesday, 13 January 2009, 09:11 GMT
Maybe the issue is in the Apache configuration (or whatever front-end), like a rewrite rule without the "L" flag.

Could we see it without affecting any level of security?
Comment by Dusty Phillips (Dusty) - Tuesday, 13 January 2009, 14:05 GMT
Tagging Dan in to address the apache question.
Comment by Aaron Griffin (phrakture) - Tuesday, 13 January 2009, 17:14 GMT
I don't think it's apache related. I think it's an issue with the model itself. I wonder if cactus would have any insight... /me goes and finds him
Comment by Aaron Griffin (phrakture) - Wednesday, 14 January 2009, 19:54 GMT
Are we changing the "pubDate" or whatever of the feed items? Feed-readers typically use that to detect if the item is newer or not. That is, if I have "Article A" from June 1st, the reader will ignore all other times it sees that item UNLESS the date changes. All of the sudden, "Article A" from July 7th appears, and *poof* my reader says it's new.

Can someone check to see the dates on their duplicate items?
Comment by Aaron Griffin (phrakture) - Wednesday, 14 January 2009, 20:07 GMT
Oh oh oh! Relevant:

http://projects.archlinux.org/?p=archweb_pub.git;a=commitdiff;h=37fc9586b1256aebda0098f209f0c1f51642717b

The packages feed doesn't have this change. Any reason for this?
Comment by Dusty Phillips (Dusty) - Wednesday, 14 January 2009, 20:15 GMT
My memory is fuzzy, but I think that django 1.0 was crashing if you supplied date objects instead of datetime objects. Either that, or it was just an unsuccessful attempt to fix this problem.
Comment by Alessandro Doro (adoroo) - Wednesday, 14 January 2009, 20:19 GMT
Same pubDate.
Feed-readers should also check the item GUID so:

<item>
<title>apache-ant 1.7.1-1 i686</title>
<link>http://www.archlinux.org/packages/extra/i686/apache-ant/</link>
<description>Ant is a java-based build tool.</description>
<pubDate>Mon, 12 Jan 2009 18:02:39 -0500</pubDate>
<guid>http://www.archlinux.org/packages/extra/i686/apache-ant/</guid>
<category>Extra</category>
<category>i686</category>
</item>

is different from:

<item>
<title>apache-ant 1.7.1-1 i686</title>
<link>http://archlinux.org/packages/extra/i686/apache-ant/</link>
<description>Ant is a java-based build tool.</description>
<pubDate>Mon, 12 Jan 2009 18:02:39 -0500</pubDate>
<guid>http://archlinux.org/packages/extra/i686/apache-ant/</guid>
<category>Extra</category>
<category>i686</category>
</item>

whatever the date is.
Comment by Aaron Griffin (phrakture) - Wednesday, 14 January 2009, 20:35 GMT
@Dusty: Eliott has been speculating that this is caching related, with different people hitting the feed page via archlinux.org and www.archlinux.org

Two things:
* Maybe try reverting this change, http://projects.archlinux.org/?p=archweb_pub.git;a=commitdiff;h=HEAD
* Maybe attempt to shut off caching on the feeds to see if this fixes the problem
Comment by Aaron Griffin (phrakture) - Wednesday, 14 January 2009, 20:37 GMT
Err, not reverting, but reverting the change from absolute to relative URLs - notice that the feeds still contain absolute URLs, so Django has to do some interpolation there.

EDIT: This also applies to the get_absolute_url functions in the models for News and Packages too
Comment by Dusty Phillips (Dusty) - Wednesday, 14 January 2009, 20:44 GMT
The memcache is (I believe) operated at the django level, www.archlinux.org and archlinux.org should not be caching different content. I just did a cursory (ie: 25 second) google search on django feed caching and couldn't find anything; the feeds are automatically generated by django so I don't think I have control over how they are cached. I'd like to think django does the right thing.

The commit you mentioned was actually reverting an early change that didn't seem to help. I'm not sure how to specify relative urls; would they be relative to the feed location? I think django would still have to interpolate it.

I finally found an article related to this issue, but it may not be relevant. I'll experiment with it and look for more info on Friday; I have a meeting in five minutes so I have to go now. ;-)

http://www.hoboes.com/Mimsy/?ART=513 <-- for my reference.
Comment by Aaron Griffin (phrakture) - Wednesday, 14 January 2009, 20:51 GMT
What I meant was: could we try switching all these things to have the full "http://www.archlinux.org/" in front of it - the commit I pointed out AND the get_absolute_path function from the models? That would solve our flip-flopping between www and no-www in these feeds.
Comment by eliott (cactus) - Thursday, 15 January 2009, 05:15 GMT
To reiterate Aaron's description, with an example:

Assume two users. UserA, UserB.
UserA has a feed reader pointed to http://www.archlinux.org/feeds/packages/
UserB has a feed reader pointed to http://archlinux.org/feeds/packages/
All caches are empty.

UserA requests his page. Gets it. The page gets cached. In the contents of the feed link=http://www.archlinux.org/feeds/packages/
UserB requests his page. Gets same page. In the contents of the feed link=http://www.archlinux.org/feeds/packages/

Time passes, and the cache's timeout.
UserB manages to slip in before UserA. Gets his page. The page gets cached. However, since his url is archlinux.org, django creates the link using the HOST header of the client request, tacking on the relative url partial. In the contents of the feed link=http://archlinux.org/feeds/packages/. That gets cached.
UserA now does his request. He gets the same page cached by UserB's request. In the contents of the feed link=http://archlinux.org/feeds/packages/

Result. Both users get duplicates in their rss feed reader.

Fixes:

1) In the apache vhost.
RewriteCond %{HTTP_HOST} ^archlinux\.org$
RewriteRule (.*) http://www.archlinux.org$1

2) In archweb_pub
A patch something like this.
http://cactuswax.net/p/eliott/misc/0001-pew-pew-pew.patch.txt
Comment by eliott (cactus) - Thursday, 15 January 2009, 05:17 GMT
note: the first will fix the issue, as well as the second. It wouldn't hurt to do them both.
*shrugs*
Comment by Hervé (herve) - Thursday, 15 January 2009, 10:43 GMT
Well I'm glad someone else is asking for fixing the Apache virtual hosts and set www.archlinux.org the canonical URL.

Out of any RSS issue, having two different URLs pointing to the same resources is bad for maintenance, caching and search engines.

I find the second proposition a bit violent.
Comment by Dusty Phillips (Dusty) - Thursday, 15 January 2009, 14:01 GMT
I'm reluctant to use a FQD in get_absolute_url because it makes the code less portable. People do run variations of this codebase on other domains, but more importantly, I have to run the site locally for testing.

However, in searching, I *think* I found a django snippet that allows me to set the guid explicitly instead of having django rewrite it, which should have the same effect. I'll try to test it on Friday. In the meantime or if it doesn't work, are there any side-effects to changing the apache configuration?
Comment by Dan McGee (toofishes) - Thursday, 15 January 2009, 14:44 GMT
There shouldn't be any real nasty side-effects from that change. We just need to pick one and go with it.
Comment by Alessandro Doro (adoroo) - Thursday, 15 January 2009, 14:45 GMT
In summary:
· duplicates in news and packages feeds
· duplicates in arch-announce (ml and web archive) ---  FS#12537 
· no duplicates in news and packages sections in home page
· no duplicates for news and packages pages

So "def item_guid(self, item):" for the PackageFeed and NewsFeed classes should partly solve the problem (guid could be, e.g., the link whitout the domain part).

And I think that the issue of "having two different URLs pointing to the same resource", where a resource is a package or news, worths a discussion in the arch-dev mailing list.
Comment by Hervé (herve) - Thursday, 15 January 2009, 15:49 GMT
> And I think that the issue of "having two different URLs pointing to the same resource", where a resource is a package or news, worths a discussion in the
> arch-dev mailing list.

No it's just Web good sense. It has nothing to do with what kind of resource (as defined by the HTTP standard) the Web server is publishing. It's an administrator's task, not a package developer's task.

It doesn't require any change to the Django code.
Comment by Aaron Griffin (phrakture) - Thursday, 15 January 2009, 17:13 GMT
Personally, I prefer archlinux.org and feel that www.archlinux.org should redirect there.
Comment by Dan McGee (toofishes) - Friday, 16 January 2009, 01:12 GMT
Yeah, I didn't want to get into a www policy debate...

I'm not going to really bring into this what I prefer (OK, www), but I'll approach it more from the "let's not change the norm" angle. Just about everyone uses www. on their primary site, and even in the Linux world where people are smart enough to realize this is arbitrary, debian.org, ubuntu.com and gentoo.org redirect to their www sites. Slackware does the same thing as we currently do (two distinct namespaces), and fedoraproject.org omits the www.

Are there any non-personal/non-political reasons for one or the other, or should Aaron and I just fight it out on IRC or something? :)
Comment by Dusty Phillips (Dusty) - Friday, 16 January 2009, 01:32 GMT
Actually, I think you should fight it out in person and take a video.
Comment by eliott (cactus) - Friday, 16 January 2009, 03:39 GMT
technical reason for using www:

A base domain cannot be a CNAME. This is an important distinction if you use cache services (http accelerators) like Akamai, Limelight, etc. They require CNAME aliasing for some of their accelerator/edge-cache services.

For archlinux.org though.. It probably doesn't matter.

(note: i prefer www myself. :D )
Comment by Dan McGee (toofishes) - Friday, 16 January 2009, 04:05 GMT
# direct 'bare' URLs to www
RewriteCond %{HTTP_HOST} ^archlinux\.org$
RewriteRule (.*) http://www.archlinux.org$1 [R=301]

Rewrite rule now in effect.
Comment by Hervé (herve) - Friday, 16 January 2009, 13:27 GMT
After my thunderbird was opened all this morning and I've just received an announce for "iproute2" which is indeed a new update, I think we can tell the bug is fixed!
Comment by Alessandro Doro (adoroo) - Friday, 16 January 2009, 15:51 GMT
Very well.
No duplicates here.
Comment by Dusty Phillips (Dusty) - Friday, 16 January 2009, 22:59 GMT
So are we happy now or should I still add the guid test to django?
Comment by Aaron Griffin (phrakture) - Friday, 16 January 2009, 23:09 GMT
Lets leave this open for a few more days to see if it pops back up.
Comment by Dan McGee (toofishes) - Saturday, 17 January 2009, 02:02 GMT
I think adding some guid magic to Django would still not be a bad idea, but its obviously not a huge priority. I'll leave it up to you, Dusty.
Comment by Dusty Phillips (Dusty) - Saturday, 17 January 2009, 13:31 GMT
Well the main reason not to do it if its not necessary is if some community site uses our codebase, it would be nice if their news items showed up in a different feed. On purpose this time. ;-)

Loading...