FS#12107 - Arch news RSS feed is broken (duplicates last 10 entries)
Attached to Project:
Arch Linux
Opened by shapeshifter (shapeshifter) - Wednesday, 12 November 2008, 19:05 GMT
Last edited by Dan McGee (toofishes) - Saturday, 17 January 2009, 15:57 GMT
Opened by shapeshifter (shapeshifter) - Wednesday, 12 November 2008, 19:05 GMT
Last edited by Dan McGee (toofishes) - Saturday, 17 January 2009, 15:57 GMT
|
Details
Description:
The "arch news" RSS feed is broken for many readers. Everytime you start the reader, it will receive the 10 last entries over and over again and again and again instead of only grabbing new entries. The problem has been confirmed with Thunderbird, Akregator and Canto. Apparently, Firefox doesn't have a problem. Also, all the other feeds are fine, it's definetly a fault with the arch news feed. There's a screenshot attached as proof. :P |
This task depends upon
Closed by Dan McGee (toofishes)
Saturday, 17 January 2009, 15:57 GMT
Reason for closing: Fixed
Additional comments about closing: Die bugs die
Saturday, 17 January 2009, 15:57 GMT
Reason for closing: Fixed
Additional comments about closing: Die bugs die
Dusty
In any case, it's still the same. I'm sorry I don't have any server side experience with RSS feeds. I can only say that until now I've received the last entires 18 times already making it 180 "new" entries from this feed in 3 days. Prior to opening this bug report I asked in #archlinux and many users confirmed this erratic behaviour of this feed.
I have duplicated feeds for "Recent News Updates" since 07/10/2008 (newsletter news).
I have apparently random duplicates for "Recent Packages Updates"; most recent duplicates are: sudo-ba, geeqie-ba, networkmanager-i686, gnome-network-manager-i686, libnetworkmanager-i686
(ba = both archs)
I have commented out some code that may be causing the problem. I'm unable to replicate it in canto or opera.
Can anyone confirm if the issue is still occurring?
Thanks,
Dusty
The domain in the item destination link is different: www.archlinux.org and archlinux.org.
I never use RSS and when I open the feed in canto, thunderbird, or opera, I don't get any duplicate entries. So my theory is that the duplicate entries refer to articles that were downloaded prior to the switch and then had to be downloaded again afterward... but going forward there won't be any more duplicates. Is this possible?
Dusty
The first 10 entries that came in when I added the feed were all in the format
http://archlinux.org/news/412/
http://archlinux.org/news/413/
... and so on, while the second 10 entries that came in after restarting TB were in the format:
http://www.archlinux.org/news/412/
http://www.archlinux.org/news/413/
... and so on.
It's odd that you can't reprocude this. I'm using TB 2.0.0.17 from extra and I don't encounter this problem with any other feeds I have.
Before 10/oct/2008 "Recent News Updates" entries linked to archlinux.org.
Recent entries (since monday 10, site upgrade date) from "Recent Packages Updates" link to www.archlinux.org pages with *some* duplicated entry that links to archlinux.org pages. Maybe sometimes a duplicate rss file is pushed/generated?
For the record I grab the feeds from the following urls:
http://www.archlinux.org/feeds/news/
http://www.archlinux.org/feeds/packages/
http://www.archlinux.org/feeds/news/
...every second duplicate of an entry comes with www. in its link and every other second duplicate comes without it as described above. But if I use this feed...
http://archlinux.org/feeds/news/
...without the www. in front of it, all the duplicates have a www. in front of it. So
http://www.archlinux.org/feeds/news/ --> creates http://www.archlinux.org/news/420/ AND http://archlinux.org/news/420/ alternating, while
http://archlinux.org/feeds/news/ --> only creates http://www.archlinux.org/news/420/
I still get duplicates no matter what I try though... Maybe it'll just "level out" as soon as 10 "real" new entries have come out. Until then I'll have a couple of 1000 copies of the last 10 entries though ;)
I've tried to hardcode the feed link, but I can't test if it failed completely, solves the problem, or does nothing... additional feedback requested.
Dusty
I setup a cron job to download the feed file every 10 minutes.
Here is the download script:
#! /bin/sh
cd $HOME/rsstest/pkgs
wget http://archlinux.org/feeds/packages/
cd $HOME/rsstest/pkgs/www
wget http://www.archlinux.org/feeds/packages/
Then you could watch the files in $HOME/rsstest/pkgs to see what is happening.
To get a more readable form of the xml file:
for i in index.html*; do gawk '{printf $0}' $i | xmllint --format - > ${i/html/xml}; done
FYI; I have received a duplicate of the latest "news update" ("We're back") "http://archlinux.org/news/423/" (friday? I don't remember well); the first was sent on 20/11/2008 as "http://www.archlinux.org/news/423/".
I have duplicates from "Package Update" also. The last was friday. The reader downloads the feeds every three hours, but sometimes I refresh them manually. So I don't know if every feed is duplicated or only a few. Let's cron make its job.
If you don't include a timestamp condition, you'll be served the same response everytime.
Also it seems that the rss is generated every hour now (not so on nov 16th), so maybe 10 minutes is excessive.
But the problem/bug is another.
Two items are generated for every news and some (every?) package update with different GUID: http://www.archlinux.org/... and http://archlinux.org/....
The duplicate always resides in a different rss file.
This is a server, not reader, related problem.
Moreover I think that shapeshifter has a problem with the reader. Another story...
first www.archlinux.org announces a new package:
http://www.archlinux.org/packages/testing/x86_64/snort/, snort 2.8.2.1-8 x86_64
later archlinux.org announces a new package:
http://archlinux.org/packages/testing/x86_64/snort/, snort 2.8.2.1-8 x86_64
Two feed items are "correctly" generated.
We, humans, know that is the same package. We should instruct the machine.
Solution 1:
only www.archlinux.org (or archlinux.org) broadcasts the news;
be careful, this could affect the other parts of the website, I don't know.
MAYBE MAYBE MAYBE
In models.py:
change the get_absolute_url() method of the classes Package() and News()
Solution 2:
a filter in the feed generator discard items from archlinux.org (or www.archlinux.org)
In feeds.py:
class PackageFeed(Feed):
def items(self):
return Package.objects.order_by('-last_update')[:24]
class NewsFeed(Feed):
def items(self):
return News.objects.order_by('-postdate', '-id')[:10]
The filter should be in:
Package.objects.order_by('-last_update')[:24]
and
News.objects.order_by('-postdate', '-id')[:10]
if Package.id and News.id is the url narrow down the list to items whose id contains www.archlinux.org
This is the preliminary step before debugging any HTTP interaction.
So I vote as strong as possible for solution 1. Solution 2 is just a hack to compensate the duplicate in URLs.
Solution 2 is *clearly* only a temporary hack.
Solution 1, in the way I exposed it, is not the real solution.
Your proposed solution implies opening a new "bug report" or "feature request".
But also note that archlinux.org and www.archlinux.org front pages don't show up duplicates in the "Recent Updates" and "Latest News". A 301 will hide a weakness in the feed generation code.
OT: the NewsFeed() method items() picks the 24 more recent packages. What if between two runs of the script more than 24 packages are updated?
> archlinux.org and www.archlinux.org front pages don't show up duplicates
> in the "Recent Updates" and "Latest News".
Sorry:
http://www.archlinux.org/packages/?sort=-last_update&limit=250
is a better test case.
If the code would be 100% correct:
http://www.archlinux.org/feeds/packages/ should pick up packages only from www.archlinux.org
http://archlinux.org/feeds/packages/ should pick up packages only from archlinux.org
The problem is not in "HTTP interaction" but "page generation", is not a web problem is a programming problem; and this should be solved whether or not archlinux.org redirects to www.archlinux.org.
Now I think only the web site developers can choose what steps to take.
I *thought* I switched the arch-announce ML to use only the non-www version of the feed, but that didn't seem to fix it.
I can't imagine the issue is related to the www prefix. If somebody is subscribed to only one of the urls and is still getting duplicates, something else is going on. I'm not sure what correct behaviour would be if someone is subscribed to both urls, but that's kind of a silly thing to do, so I want to solve the single url issue first.
Dusty
Could we see it without affecting any level of security?
Can someone check to see the dates on their duplicate items?
http://projects.archlinux.org/?p=archweb_pub.git;a=commitdiff;h=37fc9586b1256aebda0098f209f0c1f51642717b
The packages feed doesn't have this change. Any reason for this?
Feed-readers should also check the item GUID so:
<item>
<title>apache-ant 1.7.1-1 i686</title>
<link>http://www.archlinux.org/packages/extra/i686/apache-ant/</link>
<description>Ant is a java-based build tool.</description>
<pubDate>Mon, 12 Jan 2009 18:02:39 -0500</pubDate>
<guid>http://www.archlinux.org/packages/extra/i686/apache-ant/</guid>
<category>Extra</category>
<category>i686</category>
</item>
is different from:
<item>
<title>apache-ant 1.7.1-1 i686</title>
<link>http://archlinux.org/packages/extra/i686/apache-ant/</link>
<description>Ant is a java-based build tool.</description>
<pubDate>Mon, 12 Jan 2009 18:02:39 -0500</pubDate>
<guid>http://archlinux.org/packages/extra/i686/apache-ant/</guid>
<category>Extra</category>
<category>i686</category>
</item>
whatever the date is.
Two things:
* Maybe try reverting this change, http://projects.archlinux.org/?p=archweb_pub.git;a=commitdiff;h=HEAD
* Maybe attempt to shut off caching on the feeds to see if this fixes the problem
EDIT: This also applies to the get_absolute_url functions in the models for News and Packages too
The commit you mentioned was actually reverting an early change that didn't seem to help. I'm not sure how to specify relative urls; would they be relative to the feed location? I think django would still have to interpolate it.
I finally found an article related to this issue, but it may not be relevant. I'll experiment with it and look for more info on Friday; I have a meeting in five minutes so I have to go now. ;-)
http://www.hoboes.com/Mimsy/?ART=513 <-- for my reference.
Assume two users. UserA, UserB.
UserA has a feed reader pointed to http://www.archlinux.org/feeds/packages/
UserB has a feed reader pointed to http://archlinux.org/feeds/packages/
All caches are empty.
UserA requests his page. Gets it. The page gets cached. In the contents of the feed link=http://www.archlinux.org/feeds/packages/
UserB requests his page. Gets same page. In the contents of the feed link=http://www.archlinux.org/feeds/packages/
Time passes, and the cache's timeout.
UserB manages to slip in before UserA. Gets his page. The page gets cached. However, since his url is archlinux.org, django creates the link using the HOST header of the client request, tacking on the relative url partial. In the contents of the feed link=http://archlinux.org/feeds/packages/. That gets cached.
UserA now does his request. He gets the same page cached by UserB's request. In the contents of the feed link=http://archlinux.org/feeds/packages/
Result. Both users get duplicates in their rss feed reader.
Fixes:
1) In the apache vhost.
RewriteCond %{HTTP_HOST} ^archlinux\.org$
RewriteRule (.*) http://www.archlinux.org$1
2) In archweb_pub
A patch something like this.
http://cactuswax.net/p/eliott/misc/0001-pew-pew-pew.patch.txt
*shrugs*
Out of any RSS issue, having two different URLs pointing to the same resources is bad for maintenance, caching and search engines.
I find the second proposition a bit violent.
However, in searching, I *think* I found a django snippet that allows me to set the guid explicitly instead of having django rewrite it, which should have the same effect. I'll try to test it on Friday. In the meantime or if it doesn't work, are there any side-effects to changing the apache configuration?
· duplicates in news and packages feeds
· duplicates in arch-announce (ml and web archive) ---
FS#12537· no duplicates in news and packages sections in home page
· no duplicates for news and packages pages
So "def item_guid(self, item):" for the PackageFeed and NewsFeed classes should partly solve the problem (guid could be, e.g., the link whitout the domain part).
And I think that the issue of "having two different URLs pointing to the same resource", where a resource is a package or news, worths a discussion in the arch-dev mailing list.
> arch-dev mailing list.
No it's just Web good sense. It has nothing to do with what kind of resource (as defined by the HTTP standard) the Web server is publishing. It's an administrator's task, not a package developer's task.
It doesn't require any change to the Django code.
I'm not going to really bring into this what I prefer (OK, www), but I'll approach it more from the "let's not change the norm" angle. Just about everyone uses www. on their primary site, and even in the Linux world where people are smart enough to realize this is arbitrary, debian.org, ubuntu.com and gentoo.org redirect to their www sites. Slackware does the same thing as we currently do (two distinct namespaces), and fedoraproject.org omits the www.
Are there any non-personal/non-political reasons for one or the other, or should Aaron and I just fight it out on IRC or something? :)
A base domain cannot be a CNAME. This is an important distinction if you use cache services (http accelerators) like Akamai, Limelight, etc. They require CNAME aliasing for some of their accelerator/edge-cache services.
For archlinux.org though.. It probably doesn't matter.
(note: i prefer www myself. :D )
RewriteCond %{HTTP_HOST} ^archlinux\.org$
RewriteRule (.*) http://www.archlinux.org$1 [R=301]
Rewrite rule now in effect.
No duplicates here.