14 June 2005

 

Go Bayesian to keep spam mails off

The term ‘Spam’, however, originated in 1937, when Jay Hormel of Austin, Minnesota, in the United States, sent nearly two million 50-lb crates of Hormel’s Spiced Ham to soldiers in WW II. Soldiers quickly developed a love-hate relationship with the pink product.
The issue of e-mail spam has only gotten bigger over the years, and the spam business is now valued at around $20 billion per year. Hardly a day passes without us finding mails peddling, among other things, Viagra, debt, porn sites, insurance rates and online degrees in our mailbox. Even as security and mail companies offer various solutions to combat this menace, spammers themselves are becoming smarter by the day, finding means to bypass existing anti-spam tools.
One of the most promising antidotes to spam is so-called Bayesian filtering, which calculates the probability that a given message is spam, based on analysis of messages previously identified as being spam or not being spam.
Most of the spam filters available in the market are keyword-based, that use word and phrase lists to trap spam. These only look for occurrences of the ‘banned’ words to determine if a given mail is spam or not.
Bayesian filters, on the other hand, see casual connections, and can evolve with spam. A Bayesian filter would learn from experience. So while they know that the phrase ‘insurance rates’ is spam, they can also learn, without human assistance, that ‘get the best rates of insurance’ is also spam.
The Bayes in Bayesian was an 18th-century British clergyman and amateur mathematician, Thomas Bayes, who suggested in a posthumously published paper that the probability of some event occurring in the future is related to the proportion of times that event occurred in the past under the same circumstances. Later, mathematicians refined Bayes' ideas and, in the 20th century, built a formal system of classification and decision-making and began applying it to many tasks in science and engineering.
In a sense, Bayesian filters are practically human. Think about how you detect spam. A quick glance is often enough. You know what spam looks like, and you know what good mail looks like. This knowledge of course, comes from our past experience with spam. The probability of spam looking like good mail is basically zero, once we know what to look for.
A Bayesian filter does something like this, it learns from past experience. While we learn to recognise spam from the subject and sender’s names, a Bayesian filter goes even further. It analyses the entire message, so has a lot more to draw on. Once they ‘learn’, they can actually get better than humans at detecting spam.
There are a number of software products available which have implemented some form of Bayesian technique or the other. SpamBayes effort has produced an Outlook add-in. Another free Outlook spam filter using a Bayesian technique is Spammunition (download now), currently in beta. Spam Bully provides a commercial solution. PopFile is an open-source spam filter.
Bayesian spam filters are here to stay. That they are efficient and the future of spam filtering is proved by the fact that the software monolith Microsoft set up a department to develop software that implements the Bayesian technique as far back as 1997.
So here’s wishing you luck with your mailbox, may it never have more spam than actual mail, as is now the case with most of us. With Bayesian filters, such a thing will hopefully just be a bad memory. Tell us what you'd like to read next in this feature.

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?