(Technical note: the following explanation is not quite correct, as it can lead to 0 probability for picking certain words, which isn’t right… the correct approach is to have a uniform prior (i.e. add 1 to everything) but this seemed to me to be a little easier to explain…)

Have you ever wondered why spam has all those hilarious nonsensical strings of words in it? They’re pretty odd constructions… not as random as if you picked words randomly out of a hat, almost grammatical much of the time, but still clearly gibberish.

I know this has been addressed in a bazillion places (see for example this), but most explanations say that these are randomly generated words designed to fool your spam filter, without explaining the interesting part – why they appear to have more structure than completely random choices. So, I thought I’d get into it in a bit more detail.

What are they? The answer is found in the concept of Markov chains.

You take a piece of “training” text. Make a list of all the words in it. For each word, make a list of all the other words that come after it, with the number of times each appears. So with the sentence: “the quick brown fox jumped over the lazy dog”, I’d end up with the list:

1. the -> (1, quick), (1, lazy) 2. quick -> (1, brown) 3. brown -> (1, fox) 4. fox -> (1, jumped) 5. jumped -> (1, over) 6. over -> (1, the) 7. lazy -> (1, dog) 8. dog ->

Turn that into a matrix, where the rows represent the “leading” words and the columns represent “following” words, and each number in the matrix says how many times the following word appeared after the leading word. So we’d get:

the quick brown fox jumped over lazy dog the 0 1 0 0 0 0 1 0 quick 0 0 1 0 0 0 0 0 brown 0 0 0 1 0 0 0 0 fox 0 0 0 0 1 0 0 0 jumped 0 0 0 0 0 1 0 0 over 0 0 0 0 0 0 1 0 lazy 0 0 0 0 0 0 0 1 dog 0 0 0 0 0 0 0 0

Now, divide every number in the matrix by the total of its row, and you’ll notice that each row becomes a sort of probability distribution.

the quick brown fox jumped over lazy dog the 0 0.5 0 0 0 0 0.5 0 quick 0 0 1 0 0 0 0 0 brown 0 0 0 1 0 0 0 0 fox 0 0 0 0 1 0 0 0 jumped 0 0 0 0 0 1 0 0 over 0 0 0 0 0 0 1 0 lazy 0 0 0 0 0 0 0 1 dog 0 0 0 0 0 0 0 0

You can interpret this as saying “if the first word is a ‘the’ there’s a 50% chance the next word is ‘quick’, and a 50% chance the next word is ‘lazy’. For all the other words, there is only one possible word following it.”

Now so far this hasn’t been too interesting. Almost every word has only one possible following word because the text is so short. But, if you train it with a larger text, and interpret the rows as a probability distribution, you can start to see for every word what sort of word tends to follow it. This gives a very interesting insight into the nature of written text.

If you then take that big “transition matrix” you’ve trained from a large text, you can use it to actually generate new text in the following way:

1. Pick a “seed” word from the text at random. For best results use one with many possible following words.

2. Find the row in the matrix corresponding to that word. Choose the next word at random, weighted according to the probabilities in the row. That is, if the the column corresponding to the word “blue” has the number .05 in it, you have a 5% chance of picking “blue” as the next word, and so on (when we divided each number by the total of its row we made sure that these probabilities would add up to 1).

3. Go back to step 2 using this second word as the new “seed” word. Continue this process to generate as long a string of words as you want. If you end up with a word for which no other words follow it (uncommon when you train on a large test, but possible – imagine if the last word of a novel was the only occurrence of the word “xylophone”, or whatever), just pick a random word.

You can see how strings of words generated with this method will follow the “trends” of the training data, meaning that if you were to generate a new transition matrix from the generated words it would, on average, look the same as the original transition matrix since you picked the words according to those weights. This completely mechanical process can generate data which looks, statistically, like meaningful English. Of course, it is not necessarily grammatical, and is certainly devoid of higher meaning since it was generated through this completely simplistic process.

Those “chains” of words constructed by the above process are an example of Markov chains. And they are (as far as I know) the answer to the question “where does spam come from?” posed above. Those uncannily-almost-grammatical ramblings below the “Viagra” ads, generated through the above process, are the spam-creators way of fooling your spam filter. They include these chains to give their advertisements statistical similarity to meaningful human correspondence. This works because the spam filters are (at least in part) using probabilistic models that depend on word-transitions and word frequencies to classify incoming email as spam. The spammers and the filter-writers are engaged in an eternal game of randomly-generated cat-and-mouse.

As with many algorithms in Bayesian statistics and machine learning, the fact that such a simple process can give birth to such seemingly, for lack of a better word, “intelligent” results, is mind-boggling.

[…] a followup to my last post on Markov chains, I thought I’d post some code I wrote to generate those sorts of word-chains given a training […]

[…] […]