27 May 2005

 

How Bayesian Spam Filtering Works

How Bayesian Spam Filtering Works
If a word, "Cartesian" for example, never appears in spam but often in your legitimate mail, the probability of "Cartesian" indicating spam is near zero. "Toner", on the other hand, appears exclusively, and often, in spam. "Toner" has a very high probability of being found in spam, not much below 1 (100%).
When a new message arrives, it is analyzed by the Bayesian spam filter, and the probability of the complete message being spam is calculated using the individual characteristics.
Let's say a message contains both "Cartesian" and "toner". From these words alone it's not yet clear whether we have spam or legit mail. But other characteristics will (most probably) indicate a probability that allows the filter to classify the message as either spam or good mail.
Bayesian Spam Filters Can Adapt Automatically
Now that we have a classification, the message can be used to train the filter further.
In this case, either the probability of "Cartesian" indicating good mail is lowered (if the message containing both "Cartesian" and "toner" is found to be spam), or the probability of "toner" indicating spam must be reconsidered.
Using this auto-adaptive technique, Bayesian filters can learn from both their own and the user's decisions (if she manually corrects a misjudgment by the filters). The adaptability of Bayesian filtering also makes sure they are most effective for the individual email user. While most people's spam may have similar characteristics, the legitimate mail is characteristically different for everybody.
How Can Spammers Get Past Bayesian Filters?
The characteristics of legitimate mail are just as important for the Bayesian spam filtering process as the spam is. If the filters are trained specifically for every user, spammers will have an even harder time working around everybody's (or even most people's) spam filters, and the filters can adapt to almost everything spammers try.
Spammers will only make it past well-trained Bayesian filters if they make their spam messages look perfectly like the ordinary email everybody may get. They could do that today, too.
Spammers do not usually send such ordinary emails, I presume, because they don't work. So chances are they won't be doing it when ordinary, boring emails are the only way to make it past the anti-spam filters.
If spammers do switch to mostly ordinary-looking emails, however, we will see a lot of spam in our Inboxes again, and email will may become as frustrating as it was in pre-Bayesian days (or even worse). It will also have ruined the market for most kinds of spam, though, and thus won't last for long.
One exception can be perceived for spammers to work their way through Bayesian filters even with their usual content. It's in the nature of Bayesian statistics that one word that very frequently appears in good mail can be so significant as to turn any message from looking like spam to being rated as ham by the filter.
If spammers find a way to determine your sure-fire good-mail words -- by using HTML return receipts to see which messages you opened, for example --, they can include one of them in a junk mail and reach you even through a well-trained Bayesian filter.

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?