Combating SpamBots

The war against spam is ever escalating. Two weeks ago I took my anti-spam tactics to the next level. I want people to be able to post comments to my website without registering. Anonymous comments (or rather unverified authors of comments) should be available if the webmaster sees fit. But I have found that in the past several months that comment spam was getting to be a real problem. I logged in one day and found that there were several hundred spam comments that had gone unnoticed for quite some time. At that time, I did not have any anti-spam measures. I looked around and added a CAPTCHA to the comment form. That stopped most of the spam, but the determined spammers were still getting through.

IP addresses in failed CAPTCHA log Number of failed CAPTCHA responses 514 250 160 158 138 111 78 78 73 72 69 60 60 54 54 52 50
2700+ other unique hosts <50 hits per host

In the past 2 months, I have logged more than 14,000 failed CAPTCHA attempts. Most the unique hosts have one or two failures, but more than 1,000 unique IP addresses have four or more failures. At some point you have to draw the line and I draw it at four. Or maybe three. One or two failures can easily be done even if a bona fide person is responding. But usually only spambots are dumb enough to get more than three failures.

I can characterize the failures and many of them seem to be of a certain forms: hit twice in rapid succession and then give up for a while. Two hits alone is not usually successful -- it usually guesses an empty string or 0 or 1. The problem is if you are using a math CAPTCHA, those can be the right answer. And obviously, if the spambot keeps at it two at a time, it will eventually guess correct and be able to post. I found that the spambot was able to crack several of the CAPTCHAs I offered: ReCAPTCHA, math, word list, word order, etc. Other than ReCAPTCHA, the other ones can be cracked by random entries. I am not sure how they managed to crack ReCAPTCHA. But it was starting to make me angry at all the spam. Finally, in addition to CAPTCHA I resorted to using comment moderation, requiring me to log in and manually approve all comments. I really don't like this because sometimes I forget. Then the comments get old and people think I don't care.

I did a little hunting around the Drupal front and found Mollom. This is a nice line of defense against spam. But I read elsewhere that in some cases it wasn't catching it all. Remember that spambots are in it for the speed and money, so their GET to POST times are very short. I whipped up a little module that checks that. All you super-human typists had better slow down when commenting on my forms. Then I took a page out of Ignacio Segura's book and added a honeypot to the comment form to my little module as well. Though you will not see it, (unless you are looking at the html source, reading with a non-CSS compliant browser like lynx, or are a spambot) it is meant to be left empty and will cause a form rejection if it has any text in it.

Then one step more. Because what is escalation if you are not really accelerating? I noticed that once spambots did get in that they usually were 'advertising' for companies of ill repute. Offering things like p1Lz and other items to EnH4Nc3 certain parts of one's body. But in order to get around blacklists for certain words, they intentionally misspell what they are advertising for and also have links to obscurely named domains (which are usually not words either.) I figured any rational thinking human being would spell at least 75% of their words correctly (and that includes things like spambot and acronyms and other non-English shortcuts). So my latest addition to the spam warfare is PHP's pspell library. So all you spammers out there had better spell it right.

SpamBot attacks
SpamBot attacks
Then as the final blow to spammer (and bad spellers everywhere) I added a "three strikes and you are out" gotcha where if you fail the previous tests more than a given number of times, you will get added to the blacklist. All entries in the blacklist are forbidden to access any part of the website. Permanently. And it seems to work. I have not seen any spam get past the filters in the last two weeks that this has been in effect. Let's hope this lasts.

I was curious about the actual counts of things, so I whipped up a few SQL queries that gave me the statistics that I wanted. I pushed it all into OOo and came up with this fine chart. There are a couple of things to note:

  • This is about a month of data.
  • The yellow line (number of daily comment spam posts) is on the scale to the right. The other two lines are on the scale to the left.
  • The first day I tried all this stuff out (29 Jul) I didn't actually have the blacklist implemented, which accounts for no HTTP/403 entries on that day
  • There has been zero comment spam since 29 Jul. It is not for a lack of trying.
  • The blue line shows the number of newly recognized SpamBot IP addresses.
  • The red-orange line shows the number of attempts from previously identified SpamBots that got rejected by the blacklist.
  • I find if quite funny that the HTTP/403 line looks like my server is flipping the bird at the SpamBots. That's what it is doing.... And no, I did not doctor the data.
  • I see that there seem to be trends or waves of spam. That is fascinating and frightening all at the same time.

Do you do anything to combat spam on your sites? Obviously comment moderation is the only truly perfect filter, but it requires so much work. Especially when I really don't get that many human comments per day, but loads of spam attempts.

Today ends with Vernon: 15, SpamBots: 0.