CAPTCHA: Is There A Better Way?
You may have never heard of a CAPTCHA, but if you spend any time on the internet, you've definitely seen one. CAPTCHAs most often take the form of distorted words that a visitor has to type in to complete an action, and are designed as a test to tell humans from programs.The current state-of-the-art of is reCAPTCHA, pictured below:

From a usability standpoint, CAPTCHA represents a hurdle for human visitors. While people who design CAPTCHAs are trying to solve a very real and difficult problem, the war against malicious programs and spammers has escalated to the point where it has a human toll, and we need to seek out alternatives.
The Captcha Arms Race
Unfortunately, the current incarnation of CAPTCHA is a losing proposition. Originally, it made sense to use reading to tell humans from programs; reading is easy for most people and difficult for computers, and words represent an almost unlimited task variety. Unfortunately, as computers get faster and programmers get more creative, creating a secure, word-based CAPTCHA means making the reading task increasingly more difficult, which adversely effects human users. Computers are getting better and better at reading all of the time, while our reading ability as adult humans stays roughly the same (and often gets worse as we age). Logically, it's only a matter of time before simple, word-based CAPTCHA is completely ineffective.The Sesame Street Solution
So, how do we up the difficulty level for computers without hurting people? For word-based CAPTCHA, we've really only followed one path: making the words more and more difficult to read. What if, instead of making the answer more difficult, we focused on the question?If you ever watched Sesame Street you probably remember the game "One of these things is not like the other." We humans are naturally good at detecting differences; it's an evolutionary necessity and built into many of our sensory systems. Consider the examples below:

In all of these, you can easily tell which of the 3 words is different. Now, consider asking a computer the question: "Which word is different?". Current technology could easily read the three words in every example above, but how does a machine parse the word "different"? Does it mean red, bold, italicized, green, underlined?
By making the question ambiguous, we've added a layer of difficulty for machines that's easily resolvable for humans. This "Difference CAPTCHA" could allow us to increase the level of security without increasing word distortion. Granted, it's not a perfect solution, and has many of the issues CAPTCHA currently has, but it taps a strength of human brains and at least buys us a bit more time in the arms race.
A Note About reCAPTCHA
I want to emphasize that this post is in no way an attack on people who design CAPTCHAs. The real enemies are the spammers and hackers who have made CAPTCHA a necessary evil. The folks at reCAPTCHA have done an admirable job of trying to deal with accessibility issues and are using CAPTCHA to accomplish a worthwhile task, helping to decipher digitized books.Spam Fighting Secrets Revealed
In many ways, this post has been over 2 years in the making. When I started blogging, I decided, as a long-time coder and control freak, to do the back-end programming myself. I came to pretty quickly regret that decision, as blog spammers flooded the site long before I had any legitimate readers. Gradually, though, I began to see the fight against spam as a personal challenge, and decided to develop my own code to even the odds.That code has developed quite a bit over the past 2 years, and I've decided to make it available to my readers. I realize that I'm probably taking a big risk by doing this, and may get hit with a lot of spam, but I've come to see spam as a serious threat to usability, and I think we have to be willing to talk openly about how to fight it.
The Code
I'll start out with the link to the anti-spam code (written in PHP), in case you want to follow along while reading the post. For the non-coders out there, don't worry; I'll be discussing all of the techniques in a general, cross-platform sense. A couple of notes to you coders: (1) the code assumes that the input variable "$comment" contains your text, (2) you'll have to adjust the thresholds for your own situation, and (3) if you want to see the details for any given bit of text, I've created an output variable called "$spam_report".Spam Filters
Below are each of the techniques and filters used in the anti-spam code. The system operates by assigning weights to each filter and then comparing those weights against two thresholds: low spam probability and high spam probability. Low-probability spam is marked for approval, whereas high-probability spam is automatically rejected. I've found that this 3-tiered system (accept, approve, reject) is more forgiving and allows for greater flexibility.Trigger Words
This is simply a list of suspicious words (Viagra, anyone?) that gets compared to the comment text and tallied up. Although this is probably the lowest-tech of the filters, it has two additional advantages: (1) you can use it to block obscene or harassing content, and (2) you can use it to quickly filter out persistent spammers while you look for a better solution.
Link Counting
Links are the lifeblood of blog spammers, and link counting is the single most powerful filter I've found. The system has two separate thresholds, one for counts of "http:" and one for the "url=" format that some blogs and forums used to use. These days, I've found "url=" to almost always be indicative of spam (unless your system still uses it, of course), and in my personal experience, any blog comment with more than one "http:" link is suspicious.
Text Density
This filter strips out HTML tags and compares the length of the remaining text to the overall length of the comment. The limit is the lowest percentage of text-to-content that you deem acceptable. I designed this filter specifically to handle spam comments that consist of only one link but have just a few words of text around it. Those comments have an unusually low text density.
Vowel Density
Have you ever typed nonsense text that looked something like "sdfgsdfgsfdg"? Spammers do it all the time, and it creates a very unusual vowel-to-consonant ratio. Vowel density is the ratio of vowels to total text (after HTML tags are stripped out). The density of a normal post can range quite a bit, so you'll have to play around with this one. The vowel density filter has the added advantage of flagging non-Roman characters, for those of you who get Chinese and Russian spam. For reference, the average vowel density on my blog is currently 23.4%. The threshold setting represents the lowest acceptable limit as a percentage.
Spam Countermeasures
In addition to the spam-filtering code, I have a couple of extra features in place on my blog to deter spam that you won't find in the PHP:Image-based Submit
If you get hit by a lot of automated spam, this can be shockingly effective. Most automated programs use a method of form posting that only works for standard submit buttons and not image-based (input type="image") buttons. Switch your comment submission button to an image, and you may see a dramatic drop in spam. Better yet, switch your button to an image that looks like a CSS-styled button. How? It's easy: just create the CSS button, take a screen capture of it, and save it as an image.
24-hour Nofollow
This is a bit of an experiment, but since nofollowed links are essentially worthless to spammers, I've adopted a 24-hour nofollow policy. The nofollow is automatically lifted by the code after 24 hours, giving me enough time to remove spam comments without harming my legitimate visitors.
Disclaimers & Licensing
Will these techniques make you immune to spam? Of course not. Any spammer who just wanted to prove a point could certainly get a nonsense comment posted on this blog. The goal is to put the low-hanging fruit out of the reach of opportunistic blog spammers, especially people who rely on automation or are using spam comments to generate cheap, easy links.A bit of legalese: You are free to use the PHP code in this post as you like on your own blog, either in full or in part. I only ask that you not commercialize or mass-produce the code without consulting with me first. Additionally, if you use this code on an active blog and make changes or improvements, I'd love to hear about them.
Anatomy of A Blog Spammer
You've probably noticed that I seem to be talking about spam quite a bit this month. Although it's partly because I've been dealing with a lot of spam lately and the issues are fresh in my mind, it's also because I think spam poses a serious threat to usability. In addition to the added time and trouble directly caused by spam, we also end up fighting that spam by forcing our legitimate visitors to jump through more and more hoops.One of my goals in developing the engine to power this blog has been to develop spam-fighting techniques that are as invisible as possible to end-users. I'm going to talk about those techniques in detail next week, but before you can fight spam effectively, you have to know your enemy. Here are a few things I've learned about blog spammers:
1. Spammers Are Lazy
Ok, maybe the people themselves aren't always lazy, but any individual piece of blog spam is such a low-value proposition that a spammer can't be bothered to spend much time on it. This leads to a number of common practices, including: (1) automation, (2) cutting and pasting, and (3) typing nonsense text (e.g. "asdasfasfasfasf").2. Spammers Are Human
I've been amazed to discover how much spam is actually being entered directly by people, but my point is much broader than that. Even if spam is coming from a program, that program first has to be written by a human being. If you want to stop blog spam, you have to understand the motivations and personality types of the people who create it. Which leads us to...3. Blog Spam Has A Purpose
Unlike computer viruses, which are often created for pure ego or as experiments gone wrong, blog spam has one key purpose: to generate links. The vast majority of blog spammers are looking for an easy way to get sites to link back to their own sites (or clients) and drive search engine traffic. If you can block those links, you'll render 90%+ of blog spam impotent.Building Blocks
It may not sound revolutionary, but all of the ways I've fought spam on this blog come back, one way or another, to these three points. Next week, I'll be building on these basic principles and revealing my anti-spam techniques and PHP code in detail.24-hour Nofollow: An Experiment
If you've commented on this blog recently, you may have noticed that your comment was nofollowed. Don't panic. In light of recent spam attacks, I'm running a proof-of-concept experiment and am automatically nofollowing new comments for 24 hours, after which they will be followed normally.I very much support link following, and will do my best to support my active visitors and commenters by allowing you to link to your own sites. As a usability specialist, I'm strongly opposed to punishing my users for the sins of a few opportunistic morons. My hope is that the 24-hour nofollow will give me enough time to screen and remove spam without hurting my legitimate visitors.
I have begun to see spam as a serious threat to usability, and my next three posts will focus on this topic. Among them, I'll be going out on a limb and releasing the techniques and source code I've used to fight spam on this blog. Over the next three Tuesdays, I hope you'll return for the following posts: "Anatomy of A Blog Spammer", "Spam Fighting Secrets Revealed", and "CAPTCHA: Is There A Better Way?".
How to Track Outbound Links
A key aspect of strategic website usability is being able to track goals and measure conversion. For an e-commerce site, tracking goals is usually pretty simple: you want visitors to make a purchase, fill out a contact form, etc., and that process generally results in a specific page that can be logged and measured.What if the goal you're tracking is a link to an outside site? You might, for example, want to track the effectiveness of advertising, or you might have a third-party handling some aspect of your shopping cart, subscription process, etc. As simple as outbound links are to create, tracking them is an entirely different matter. The core problem is that, even though an outbound link "lives" on your website, when a visitor clicks on that link, the request goes straight from their web browser to the server that hosts that other site. Your website never sees the request, it never gets logged, and you'll never see it in your analytics.
So, how do we go about tracking these off-site actions?
The Redirection Trap
A classic way to track outbound links is by trapping them using redirection. This usually requires writing some code, but it's pretty basic. Essentially, instead of linking directly to an outside site, you'll link to a page on your own site that requests that other site.Let's use this site as an example. I manage my subscribers through Feedburner, and it would be nice to be able to track how many people click on that link. Using PHP, one option would be for me to create a new page, let's call it "subscribe.php", with code that looks something like this:
header("Location: http://feeds.feedburner.com/usereffect");That new page simply redirects to the outside site, a process which is virtually invisible to the website visitor. The key is that this intermediary page (subscribe.php) is logged, allowing me to track it through my traffic logs and analytics.
Fun With Redirection
The simple method above can be used to power some much more interesting features as well. Let's say, for example, that you want to build your own banner ad tracking system, capturing all of your outbound ad clicks. By passing a parameter (let's call it $link) to your redirection page, that one page can serve as a portal to multiple outbound links. A simple PHP example (using some of my blogroll links) might look like this:if ($link == 1) {
header("Location: http://www.kaushik.net/avinash");
} elseif ($link == 2) {
header("Location: http://www.seomoz.org/blog");
} elseif ($link == 3) {
header("Location: http://www.thehotiron.com");
}Of course, this could be expanded to any number of links and could also include code before the redirection to, for example, store the ad click to a tracking database.
Google Analytics to The Rescue
If you don't like the do-it-yourself approach, Google Analytics includes a function you can use to track outbound links. This function essentially rewrites a link as the page you specify. Continuing my example above, if I wanted to track my Feedburner subscription link using Google Analytics, I would simply add the following JavaScript in my link (<a> tag):onClick="javascript: pageTracker._trackPageview('/subscribe.php');"Whenever someone clicked on that link, Google Analytics would log that click as a visit to the virtual page "subscribe.php". To use this feature, you do need the latest version of Google Analytics. Visit this GA help page for more details.
Dr. Peter J. Meyers (AKA "Dr. Pete") is the President of User Effect, a former start-up executive, cognitive psychologist, usability evangelist and lifelong programmer.

