Bayesian probability and networks
One of the most recent strategies to filter spam out is based on statistical text classification. Most of the filters document that they are on Bayesian networks.
Although I had some memories of Bayes' theorem, I had never heard of Bayesian networks. So I got a refresher on Probabilities and read through a Bayesian networks tutorial.
Some of the aspects in the papers I read really brought some new light on what I remembered from school. I'll detail some keys to understanding Bayes' theorem in a meaningful way and summarize what I learned on Bayesian networks, without going too much into the formal aspects.
Bayes theorem
A good introduction to the theorem is: Data Analysis: A Bayesian Tutorial, by Sivia, D. S. (1996), found via this Green Hat Journal's entry on Bayesian Learning.
Although this only contains the first couple chapters of Sivia's book, I really found it interesting, because it gives a meaningful approach to Bayes' theorem.
It also combines very well with this Tutorial on Learning With Bayesian Networks by David Heckerman, which also present Bayes' theorem before digging into Bayesian networks.
Disclaimer
I hope the following summary of these papers can help reading and understanding them. But in no case does it pretend to be as detailed or formally correct. Also, I skipped pretty much all the complexity problems that come up when you actually try to compute these probabilities.
Bayes probability
The first key that emerges from these papers is Bayes' definition of a probability: the probability of an event x is a person's degree of belief in that event. In particular, you can assign a probability before even doing the experiment: there is no need to toss a coin a thousand times to determine the probability of heads being the result of the next toss.
Variables or unknowns
In the example of a coin, where you don't know whether the coin is fair, you can represent the system as two variables: X, the state of the coin or result of a toss, and O, the fairness or balance of the coin. Specific outcomes or assignment of state for these variables are written using lower-case letters x and o.
X and O a correlated, because the knowledge of the coin's balance allows you to evaluate the probability of each outcomes for a toss.
The key here is to list the variables that define your system and analyze any causal relationship between them.
Bayes' theorem
What you want then to achieve is to figure out the probability for O knowing the result of X in previous experiments. Then you can evaluate the probability for the next outcome of X, because O is the hidden parameter that drive X's behavior.
Bayes' theorem is a general rule to correlate variables: for two variables A and B, P(A | B) = P(B | A) * P(A) / P(B).
This theorem can be interpreted usefully by using A an unknown variable and B the data you have collected so far in your experiments. This brings P(A) as the prior probability of A (before you did the experiments) and P(A | B) as the posterior probability of A (after you got the results B from the experiments).
The theorem becomes interesting when you know that A is the variable that drives the behavior of your system. In that case, you can express P(B | A), the probability of the results you got from the experiments, assuming you knew A.
So when you know the causal relationship between A (the driving variable or cause) and B (the driven variable or effect), you can revert that knowledge to infer A from B.
This means you can find the posterior probability for the hidden variable given its prior probability and some experimental results, if you know how the hidden variable influences the experiments.
Coin example
In the case of the coin, A is O (the coins fairness) and B is the data D from collecting outcomes of X. The theorem then writes P(o | D) = P(D | o) * P(o) / P(D).
Depending on your knowledge of the coin before you experiment with it (do you have any reason to believe the coin has certain properties) you can figure P(o).
P(D) is a normalizing constant that we can ignore.
When you assume a certain value for o, you can tell the probability of getting certain outcomes for one or more tosses. This allows you to compute P(D | o), the probability of the experimental data you collected, under the hypothesis of o.
Sivia presents the coin example with a very practical approach, by considering various prior assumptions for the coin's fairness and various toss results, and graphs the posterior probability distribution for the coin's fairness.
Probability for the next outcome
Now that you have evaluated the posterior probability for the hidden variable, you can find the probability of each outcome of the next experiment. In the case of the coin, the posterior knowledge of O, P(o | D), allows to express the probability of the next outcome x for the variable X, given the result of your experiments: P(X=x | D).
To do so, you need to consider every possible value o of the hidden variable O. You can write P(X=x | D) as the sum of P(X=x | o) * P(o | D) on all possible values of o.
Bayesian Networks
In the coin example, we considered two variables, with one being a hidden parameter driver the other variables behavior. Bayesian or belief networks are an extension of this causality or dependency relation to more than two variables.
Considering a set of variables, a Bayesian network consists of one node per variable. These nodes are connected by oriented arcs to represent the dependencies. Nodes that aren't connected are independent.
Each node is then described by a probability distribution that is conditioned on the state of the nodes that it depends on. In the example of the coin, the local probability distribution for X is P(x | o), which describes the effect of the parent (O) on a node (X).
How to use an existing network
Assuming you have the network, you can infer the probability of a variable given the other variables. The Heckerman paper describes this in a fraud detection example and shows how knowing which variables are dependent and which are independent helps simplify the formula.
How to improve the network for a fixed structure?
Without changing the network structure, we want to improve the local probability distributions, by learning from some experimental results.
The key here is to first parameterize the local probability distributions. These new parameters can be considered "hidden" variables, that determine the behavior of the variables in your system. So, for a given hidden variable state o, you can tell the probability the variable or node it conditions Xi, from the state of its parents (Pai).
Some assumptions need to be made to effectively parameterize the relationship between a variable and its parents. This helps you break down the hidden parameter into sub-parameters, each responsible for driving a node.
The fairness O of the coin was a simple example of hidden variable, that conditioned the behavior of the coin's state X.
Because we have introduced new variables and new dependencies, the network can be extended to include new nodes and arcs for these parameters. We can now use the previously described technique to figure out the probability of the hidden parameters given the data from the experiments.
How to decide between several network structures?
Improving the network without changing the network structure already required quite some assumptions because it relies on an effective parameterization. Improving the network structure also requires some simplifications. That's because the number of possible structure is more than exponential in the number of nodes. One way to handle this is to pre-select a small number of structures and find which one fits best.
For this, you need to evaluate the prior distribution P(Sh) for each structure hypothesis (Sh). Then using data collected you can find the posterior distribution P(Sh | D). This will bring out the structure that is most likely.
Notes
I haven't found a precise description of how spam filters use Bayesian networks. Paul Graham's plan for spam seems to describe a very simple network (a one level tree) with the root variable representing the likelihood of spam and all the leaf variables representing stemmed words. Update: I have found confirmation that this simple network is what is often used in spam filter. It assumes that word frequencies are independent from each other and is called "Naive Bayes".
One thing that might be interesting is to rate entries in RSS/blog feeds by applying a similar technique. The "indicator" would bring a likelihood that this post has a topic that interests you. Also, on really busy days, you could tune the filter to be more selective. I found one other reference to this idea on Jim O'Halloran's blog, but haven't found any implementation yet.
Other text classification techniques might also be useful to experiment with, like SVMs (Support Vector Machines) and neural networks.
Update: Addy is also thinking about a similar blog filtering idea.
Resources:
- Wikipedia's page on Bayes' theorem.
- Bayes' Essay towards solving a Problem in the Doctrine of Chances.
- Using Bayesian networks as classifiers.
- Wikipedia's page on the Monty Hall problem: a classic Bayes analysis game, that is counter-intuitive at first (people usually get it wrong). The picture provides a great aid to understanding the problem.
Update (2005/10/21): I was thinking of Bayes's theorem in the shower yesterday (for some reason) and had trouble remembering the exact form for the equation. I found a simple visualization for it, as a Venn or set diagram, that I hadn't thought of before.
Consider the set of all states, E, as a rectangular box. It then contains two sets, A and B, represented as circles that intersect. The probabilities P(A), P(B), P(A | B) and P(B | A) correspond to the probability of being inside a certain set under certain assumptions.
For example, P(A) is the probability of being in A given that you are in E. Similarly, P(B | A) is the probability of being in B (or actually inside the intersection of A and B) given that you are within A.