FS#47837 - {wiki} Countermeasures against spam wave needed

Attached to Project: Arch Linux
Opened by Jakub Klinkovský (lahwaacz) - Wednesday, 20 January 2016, 22:20 GMT
Last edited by Doug Newgard (Scimmia) - Sunday, 24 July 2016, 05:44 GMT
Task Type Feature Request
Category Web Sites
Status Closed
Assigned To Pierre Schmitz (Pierre)
Architecture All
Severity Critical
Priority Urgent
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 11
Private No


For several days, our wiki is under spam attack, which escalated today. See the recent changes [1], block log [2] and deletion log [3] for related activity.

The malicious accounts are created serially after few minutes, so the captcha is arguably broken. I can write a cleanup bot, but unless we get something like the AbuseFilter extension [4] installed on the wiki, this will soon look like clone wars.

Pierre should have access to the IP addresses used by the account, but given that the attack was not slowed down by the 24-hour IP-based autoblock, they most likely use distributed network of bots so IP blocking would affect many users.

[1] https://wiki.archlinux.org/index.php/Special:RecentChanges
[2] https://wiki.archlinux.org/index.php/Special:Log/block
[3] https://wiki.archlinux.org/index.php/Special:Log/delete
[4] https://www.mediawiki.org/wiki/Extension:AbuseFilter
This task depends upon

Closed by  Doug Newgard (Scimmia)
Sunday, 24 July 2016, 05:44 GMT
Reason for closing:  Implemented
Additional comments about closing:  The AbuseFilter extension had been installed as suggested and works as expected.
Comment by Dario Giovannetti (kynikos) - Thursday, 21 January 2016, 03:57 GMT
Besides installing the AbuseFilter extension, also enabling the Nuke extension [5], which is already installed [6], would help immensely against this kind of attacks.

[5] https://www.mediawiki.org/wiki/Extension:Nuke
[6] https://projects.archlinux.org/vhosts/wiki.archlinux.org.git/tree/extensions/Nuke

Then I think we either have to change FunnyQuestion's question and answer [7], which is now cracked, or more simply try to use one of the captchas from the already installed ConfirmEdit extension.[8][9]

[7] https://projects.archlinux.org/vhosts/wiki.archlinux.org.git/tree/extensions/FunnyQuestion/FunnyQuestion.body.php#n43
[8] https://www.mediawiki.org/wiki/Extension:ConfirmEdit
[9] https://projects.archlinux.org/vhosts/wiki.archlinux.org.git/tree/extensions/ConfirmEdit

For completeness' sake, I'm adding a link to MediaWiki's manual on combating spam.[10]

[10] https://www.mediawiki.org/wiki/Manual:Combating_spam
Comment by Shulhan (sulhan) - Thursday, 21 January 2016, 08:05 GMT
> The malicious accounts are created serially after few minutes, so the captcha is arguably broken.

So, the problem is not user creation process? AFAICR, wiki have an option for user to confirm user by email right?
Comment by Jakub Klinkovský (lahwaacz) - Thursday, 21 January 2016, 08:50 GMT
That could be done with $wgEmailConfirmToEdit [11], but can be cracked as well.

[11] https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:$wgEmailConfirmToEdit
Comment by Pierre Schmitz (Pierre) - Monday, 01 February 2016, 20:02 GMT
Due to ongoing spam that cannot defeated by even more Arch specific Captchas I had to disable account creation for now. They started to post large content which could lead to a denial of service attack. I'll see if I can come up with a more clever solution.
Comment by Shulhan (sulhan) - Friday, 05 February 2016, 05:52 GMT
I just remember something, did spams happened after mediawiki upgrade? If its so, I think one of the solution is either downgrade it to the last version or upgrade it to the latest stable.
Comment by brian downing (bsd) - Wednesday, 10 February 2016, 12:13 GMT
Since the Captcha is not rotating, its always the same. That's not much of a captcha.

Even if the community were to create a list of Arch captchas, it would be trivial for the vandal to code for the list of questions and valid responses.
Spammer -> vandal (not really spam, but vandalism)

The Arch specific Captcha is "cute" but if it is non effective, maybe its time to replace it with a more standard captcha.
Comment by Vladimir Panteleev (CyberShadow) - Thursday, 11 February 2016, 14:16 GMT

I made an anti-spam plugin for a wiki I administer, http://wiki.dlang.org/ . It's also a domain-specific question/answer CAPTCHA, but the questions are randomly generated. After some tweaking, we've had zero spam since then.

Here's my code:


(It had a much larger variation of questions before, but people complained that some may have been too hard and would scare away newbies.)

> The Arch specific Captcha is "cute" but if it is non effective, maybe its time to replace it with a more standard captcha.

Unfortunately, by my experience, you will have a much worse time with a standard CAPTCHA. I think spammers can buy 1000 reCAPTCHA solutions for $5 or so.

Let me know if I can help.
Comment by brian downing (bsd) - Thursday, 11 February 2016, 20:45 GMT
What about Google's reCAPTCHA? Is that compromised as well?
Comment by Jakub Klinkovský (lahwaacz) - Thursday, 11 February 2016, 20:56 GMT
In [8] in one of the comments above, reCAPTCHA is evaluated as having low effectiveness at stopping spam.

But the QuestyCaptcha module in the "official" ConfirmEdit extension might be a good alternative (it would probably solve the problem with rotation), though it is still in beta state. And of course we'd need to build the database of questions ourselves.
Comment by Vladimir Panteleev (CyberShadow) - Thursday, 11 February 2016, 21:04 GMT
Yes, that's the one I meant. But to clarify, it's not "compromised" in a technical or algorithmic sense. Spambot operators can buy in bulk CAPTCHA "solutions" - i.e. access to an API (which they can plug into their spam botnet) which connects CAPTCHA-defended websites with humans who can solve them cheaply. Given the cost of labor in countries such as China, this is very cost-effective and allows them a net profit. They will also do the same for static challenge/response questions, which is probably the reason for the recent spam wave discussed here.

The important part here is that, with few exclusions, spammers are not going to be bothered to customize their spamming software for individual websites - the cost/benefit is too high. Thus, the registration form must present a challenge that cannot be outsourced to somewhere else for low pay (e.g. reCAPTCHA), or present a static challenge which requires a one-time effort to defeat (such as the current challenge).
Comment by Vladimir Panteleev (CyberShadow) - Thursday, 11 February 2016, 21:08 GMT
We've used QuestyCaptcha with a set of questions on wiki.dlang.org before. It didn't work too well.

The issue is that there is almost zero penalty for failing a CAPTCHA challenge. Thus, if you have 100 questions and the spammer can solve (or outsource solving) one, they get a 1% success rate. Given how many requests a spam botnet can make to the wiki server, it will still be enough to flood the wiki with spam.
Comment by Vladimir Panteleev (CyberShadow) - Thursday, 11 February 2016, 21:48 GMT
An example of something that could work is to modify the current challenge (output of "pacman -V|base64|head -1") add a randomly-generated string, so it becomes "(echo hhE6qhrQQ8;pacman -V)|base64|head -1". Although simple, this requires writing custom (site-specific) code to defeat, which is unlikely to happen.
Comment by brian downing (bsd) - Friday, 12 February 2016, 10:09 GMT
If there is a dedicated individual, or group, that is determined to cause damage through the creation of accounts, then even a perfect CAPTCHA is only going to slow them down.
We should be looking at the attack vector(s) and thinking of ways to stop the attack and slow them down as well
Comment by Vladimir Panteleev (CyberShadow) - Friday, 12 February 2016, 11:06 GMT
Brian, spammers are not going to target the Arch Wiki specifically. Their goal is to target as many websites at once as possible with a minimum per-website effort. If you are dealing with a targeted attack (and this is not the case here), then that is a completely different situation requiring different counter-measures.

I don't see any fixable attack vectors here to speak of. IP blacklists / DNSBLs do not work because botnets grow faster than these lists can keep up. Heuristics involving JavaScript checks, UA sniffing etc. are all defeatable en-masse (unless Arch Wiki implements a custom registration form, but even that is defeatable by outsourcing registration to a human).

Additionally, I would not recommend exploring "slowing them down" as a pursuable direction. Even though the CAPTCHA is often solved by humans, the spamming is done by bots, and it will continue to eat into wiki editors' and administrators' time until solved.
Comment by brian downing (bsd) - Friday, 12 February 2016, 12:39 GMT
I was under the impression this was a targeted attack. How can you be certain that it is not?
Since there was an arch specific captcha they would have had to go to a certain amount of effort to answer the first time.
That doesn't sound like an easy drive-by, nor low hanging fruit.

Slowing them down does makes it more expensive in time and pennies, and if they're looking for easy drive-bys, then slowing would help to some degree. Looks like there is no **perfect** solution, but the collection of measures will be additive.

If they can afford to hire humans, then the CAPTCHA has little effect other than slowing them down and adding cost, however little it may be.
Comment by Vladimir Panteleev (CyberShadow) - Friday, 12 February 2016, 13:36 GMT
> I was under the impression this was a targeted attack. How can you be certain that it is not?

Well... probably better to ask, targeted by whom, and to what degree? As in, who had to spend time to target the Arch Wiki, and how much? It is entirely possible that the spambot operator did not have to perform any action to target the Arch Wiki specifically. I'm not certain, and what follows is mostly conjecture as far as it concerns this case, but this is what I've seen:

It is known that spambot operators use CAPTCHA-solving services (there is even a common API), so I think it's very likely that a similar service exists for Questy-like CAPTCHAs. For example, the forum spam software XRumer has a database of 170,000 questions and answers. Services such as Amazon's Mechanical Turk are often used for defeating such challenges.

Another trick that is used on some shady websites is that in place of using a CAPTCHA service directly, they connect to a service which sends you other websites' CAPTCHAs. Thus, the user on the shady website (A) is actually solving the CAPTCHA on a registration form of a different victim website (B). Apart from a delay, this is completely transparent, so in our case it could've appeared as, "In the context of <Arch Linux Wiki>, what is the answer to the question: What is the output of `pacman ...`?" Website A still knows whether the CAPTCHA answer is correct (because website B provides that), and as a result website A's operator wins a bit of money and website B gets spammed.

So, the person who solved the CAPTCHA (running Arch Linux, or finding the answer on Google, or asking an Arch Linux user) may not have even known what the answer would be used for.

> Slowing them down does makes it more expensive in time and pennies, and if they're looking for easy drive-bys, then slowing would help to some degree.

Indeed, but it's important to target the most costly areas. CPU time or bandwidth on a botnet you own is effectively free. The most expensive part is almost surely the human involvement. Any Arch Linux user can solve the CAPTCHA I proposed once, but to defeat my proposal one would need to write custom code.

Anyway, what would you propose? I could have a go at implementing the QuestyCaptcha plugin I proposed if you like.
Comment by David Runge (dvzrv) - Friday, 01 April 2016, 23:29 GMT
I've had similar issues recently and solved them by blocking the attackers outdated Chrome version (they were all the same across many different IPs) by using a simple http_user_agent rule with nginx.
I know this is probably the most brutal way of doing it, but the attackers used a version of Chrome from way way back in the day, so it was safe to assume that no real users would be harmed by this choice.
if ($http_user_agent ~ Chrome/<attackers chrome version> ) {
return 403;
Comment by Jakub Klinkovský (lahwaacz) - Sunday, 22 May 2016, 12:14 GMT
Update of the current state: the account creation has been suspended again due to another spam wave on May 19.

Since the wiki's CAPTCHA has zero effect even with very site-specific questions, we should take the approach of the Wikimedia Foundation and prevent the vandalism from ever happening, instead of assuming good faith of all created accounts. The AbuseFilter extension, suggested in the very first post, provides a ready-to-go solution, which is extensively tested on Wikipedia and her sister projects, where even anonymous editing is allowed. Besides blocking abusive changes to articles' content, it can also be used to throttle actions of new accounts (if the server uses memcached), which would give us chance to continually improve the filter without having to worry about extensive cleanup. I hope that this will help to restore the normal account registration once and for all.