Share this Page URL

Chapter 12. Fighting Spam > Learning About the Adaptive Filter for Junk Mail Co... - Pg. 185

Fighting Spam 185 4. 5. This formula simply expands based on the number of words. Compute the combined probability that the 15 most interesting words might be found in any email (both spam and nonspam). We can call this probability Pa. Finally, the probability that a message is spam is computed using Ps, like so: Probability that an email is spam = Ps × Pe ÷ Pa Finally, we take this resulting probability and make a decision. Most systems flag an email as spam when the probability is 90% or greater. Empirical testing shows that few emails fall within the middle probabilities; most fall either at the higher end (and are spam) or at the lower end (and are not spam). Learning About the Adaptive Filter for Junk Mail Control Thunderbird's adaptive filter requires that it be trained. You must have both a collection (called a corpus) of good email words and one of bad email words: · If the good corpus is empty, all the messages are classified as spam. · If the bad corpus is empty, all the messages are classified as nonspam. These conditions make it clear that the user must train the spam filter before it can be effective. Part of the filter's configuration are four preferences. (See Chapter 13, "Customizing Thunderbird for Power Users," for information on setting and resetting preferences in Thunderbird.) Here are the four preferences: · mail.adaptivefilters.junk_threshold--This is the threshold percentage that if a mes- sage scores greater than this number, it is considered to be spam. The default is 90%, the same value suggested by Graham. · mail.toolbars.showbutton.junk--This preference determines whether the Junk/Not- junk button is displayed on the Thunderbird toolbar. This button lets you easily toggle a mes- sage's status from junk to not-junk, or vice versa. Tip If you are determined to not start from scratch, you can download a starting filter file. However, these files might not reflect spam as it is today. Training your filter is probably the best option because your nonspam emails can vary from other users' emails in their word contents. · mailnews.display.sanitizeJunkMail--This setting prevents Thunderbird from display- ing images or other content that might contain harmful code (such as viruses). · mailnews.ui.junk.firstuse--The first time the junk mail controls are used, this option displays an information box describing the Thunderbird junk mail (spam) features. The junk threshold, which is defaulted to 90%, and first use should probably not be modified unless you are sure of the consequences. Training the Adaptive Filter Thunderbird's adaptive filter requires training. When you first installed Thunderbird, there was no junk filter list. This list is always created, from scratch, based on your training. Thunderbird stores the spam filter word list in a file named training.dat. This is a binary file, with a file header and a series of variable-length records. Each record consists of a 4-byte hit count, a 4-byte integer specifying how long the token (a word or string) is, and then the token's contents. A token is usually a single word, although there are cases where it might be a compound word or other information.