A quick note on Markov chains, or “where does spam come from?”

February 26, 2012

(Technical note: the following explanation is not quite correct, as it can lead to 0 probability for picking certain words, which isn’t right… the correct approach is to have a uniform prior (i.e. add 1 to everything) but this seemed to me to be a little easier to explain…)

Have you ever wondered why spam has all those hilarious nonsensical strings of words in it? They’re pretty odd constructions… not as random as if you picked words randomly out of a hat, almost grammatical much of the time, but still clearly gibberish.

I know this has been addressed in a bazillion places (see for example this), but most explanations say that these are randomly generated words designed to fool your spam filter, without explaining the interesting part – why they appear to have more structure than completely random choices. So, I thought I’d get into it in a bit more detail.

What are they? The answer is found in the concept of Markov chains.

You take a piece of “training” text. Make a list of all the words in it. For each word, make a list of all the other words that come after it, with the number of times each appears. So with the sentence: “the quick brown fox jumped over the lazy dog”, I’d end up with the list:

1. the    -> (1, quick), (1, lazy) 
2. quick  -> (1, brown)
3. brown  -> (1, fox)
4. fox    -> (1, jumped)
5. jumped -> (1, over)
6. over   -> (1, the)
7. lazy   -> (1, dog)
8. dog    ->

Turn that into a matrix, where the rows represent the “leading” words and the columns represent “following” words, and each number in the matrix says how many times the following word appeared after the leading word. So we’d get:

       the quick brown fox jumped over lazy dog
the    0   1     0     0   0      0    1    0
quick  0   0     1     0   0      0    0    0
brown  0   0     0     1   0      0    0    0 
fox    0   0     0     0   1      0    0    0
jumped 0   0     0     0   0      1    0    0
over   0   0     0     0   0      0    1    0
lazy   0   0     0     0   0      0    0    1
dog    0   0     0     0   0      0    0    0

Now, divide every number in the matrix by the total of its row, and you’ll notice that each row becomes a sort of probability distribution.

       the quick brown fox jumped over lazy dog
the    0   0.5   0     0   0      0    0.5  0
quick  0   0     1     0   0      0    0    0
brown  0   0     0     1   0      0    0    0 
fox    0   0     0     0   1      0    0    0
jumped 0   0     0     0   0      1    0    0
over   0   0     0     0   0      0    1    0
lazy   0   0     0     0   0      0    0    1
dog    0   0     0     0   0      0    0    0

You can interpret this as saying “if the first word is a ‘the’ there’s a 50% chance the next word is ‘quick’, and a 50% chance the next word is ‘lazy’. For all the other words, there is only one possible word following it.”

Now so far this hasn’t been too interesting. Almost every word has only one possible following word because the text is so short. But, if you train it with a larger text, and interpret the rows as a probability distribution, you can start to see for every word what sort of word tends to follow it. This gives a very interesting insight into the nature of written text.

If you then take that big “transition matrix” you’ve trained from a large text, you can use it to actually generate new text in the following way:

1. Pick a “seed” word from the text at random. For best results use one with many possible following words.

2. Find the row in the matrix corresponding to that word. Choose the next word at random, weighted according to the probabilities in the row. That is, if the the column corresponding to the word “blue” has the number .05 in it, you have a 5% chance of picking “blue” as the next word, and so on (when we divided each number by the total of its row we made sure that these probabilities would add up to 1).

3. Go back to step 2 using this second word as the new “seed” word. Continue this process to generate as long a string of words as you want. If you end up with a word for which no other words follow it (uncommon when you train on a large test, but possible – imagine if the last word of a novel was the only occurrence of the word “xylophone”, or whatever), just pick a random word.

You can see how strings of words generated with this method will follow the “trends” of the training data, meaning that if you were to generate a new transition matrix from the generated words it would, on average, look the same as the original transition matrix since you picked the words according to those weights. This completely mechanical process can generate data which looks, statistically, like meaningful English. Of course, it is not necessarily grammatical, and is certainly devoid of higher meaning since it was generated through this completely simplistic process.

Those “chains” of words constructed by the above process are an example of Markov chains. And they are (as far as I know) the answer to the question “where does spam come from?” posed above. Those uncannily-almost-grammatical ramblings below the “Viagra” ads, generated through the above process, are the spam-creators way of fooling your spam filter. They include these chains to give their advertisements statistical similarity to meaningful human correspondence. This works because the spam filters are (at least in part) using probabilistic models that depend on word-transitions and word frequencies to classify incoming email as spam. The spammers and the filter-writers are engaged in an eternal game of randomly-generated cat-and-mouse.

As with many algorithms in Bayesian statistics and machine learning, the fact that such a simple process can give birth to such seemingly, for lack of a better word, “intelligent” results, is mind-boggling.

Advertisements

Getting started with Scala, Part I: Rambling Introduction!

February 25, 2012

Until recently, my Scala experimentation has mostly made use of only code that I personally had written, and avoided the complexities of IDEs/build scripts/DLL hell/etc. The PushGP implementation I wrote uses no third-party libraries, and is all in one big file with a single “main” method!

I use C# with .NET at work for the last 4 years or so, and I’ve mostly figured out how to code in that environment by now (read: my co-workers, especially Sean, have with long-suffering patience taught me how to do what I need to do and let me copy their keybindings.) Up until recently with Scala I’ve been trying to avoid learning how to really use the IDE and environment and concentrated on the neato things I could do with type systems and DSLs and whatnot.

Unfortunately, once you want to do anything besides write programming language implementations from scratch (thanks, built-in parser combinator library!), you’re going to need to reference other libraries. You’re going to need to learn how to build those libraries, which are frequently distributed as balls of source code. You’re going to need to learn how to make your IDE reference them from your project. And you’re going to need to learn wtf a JAR is. This, in my experience, is quite a pain.

So, I decided to put together a few bits of advice that might be useful to other people who are used to how rad C# + Studio + ReSharper is (or to a lesser extent F# – I’m looking at you, guy who was too lazy to let me color type names and variable names differently!), and are trying to get started with Scala. Read on through the next few posts and find out how!


Why traits and not just data?

February 7, 2012

I started learning Scala coming from the perspective of functional programming. I really want to “get” traits/mixins, but I’m having trouble seeing any reason to use inheritance/super heavily to achieve “stackability” when weighed against the benefit of a more data-driven approach that just uses functions as data.

In the chapter on traits in the excellent Artima book, Programming in Scala, their example to demonstrate the heavy use of traits is the idea of a modifiable queue, which performs certain stackable modifications on its inputs/outputs. The authors urge you to consider traits in any situation where you are doing a lot of “stackable modifications”.

I’m not sure why you wouldn’t just do this using lists of functions. Instead of relying on super and linearization to apply modifiers in the correct order, just treat the concept of a “modifier” as a piece of data and manipulate it as you would any other data. What’s nice about this is that these data-driven “mixins” can be added at runtime, removed, and re-ordered.

I’m not claiming traits are useless or anything – there are definitely benefits to first-class syntax, and I’m pretty sure I’m missing something more than that. But it seems that most of the time, if you want to organize code as a bunch of stackable modifications, function-valued data is a simpler and more flexible way to go than trying to ponder your way through a bunch of multiple inheritance.

Their example:

import collection.mutable.ArrayBuffer

object MainModule {

  abstract class IntQueue {
    def get(): Int
    def put(x: Int)
  }

  class BasicIntQueue extends IntQueue {
    private val buf = new ArrayBuffer[Int]
    def get() = buf.remove(0)
    def put(x: Int) { buf += x}
  }

  trait Doubling extends IntQueue {
    abstract override def put(x: Int) { super.put(2 * x)}
  }

  trait Incrementing extends IntQueue {
    abstract override def put(x: Int) { super.put(x + 1)}
  }

  def main(args: Array[String]) {

    val q1 = new BasicIntQueue with Doubling with Incrementing

    q1 put 10
    q1 put 20

    println(q1.get())
    println(q1.get())

    val q2 = new BasicIntQueue with Incrementing with Doubling

    q2 put 10
    q2 put 20

    println(q2.get())
    println(q2.get())

  }

}

My re-writing:

import collection.mutable.ArrayBuffer

object MainModule2 {

  abstract class Queue[T] {
    def put(item: T)
    def get(): T
  }

  trait BasicIntQueue extends Queue[Int] {
    private val buf = new ArrayBuffer[Int]
    def get() = buf.remove(0)
    def put(x: Int) { buf += x }
  }

  case class ModifiableQueue(
    getModifiers: List[Int=>Int],
    putModifiers: List[Int=>Int]) extends BasicIntQueue {

    private def compose[A](f: A => A, g: A => A): A => A =
      { x => f(g(x)) }

    override def get() = (getModifiers reduce compose[Int])(super.get())
    override def put(item: Int) {
      super.put((putModifiers reduce compose[Int])(item))
    }

    def modify(qm: QueueModifier) =
      ModifiableQueue(
        this.getModifiers ++ List(qm.getModifier),
        this.putModifiers ++ List(qm.putModifier))
  }

  case class QueueModifier(
    getModifier: Int => Int = { x => x },
    putModifier: Int => Int = { x => x })

  val doubling = QueueModifier(putModifier = { 2 * _ })

  val incrementing = QueueModifier(putModifier = { 1 + _ })

  def queue = new ModifiableQueue(Nil, Nil)

  def main(args: Array[String]) {

    // just like mixins:
    val q1 = queue modify doubling modify incrementing

    q1 put 10
    q1 put 20

    println("increment, then double:")

    println(q1.get())
    println(q1.get())

    val q2 = queue modify incrementing modify doubling

    q2 put 10
    q2 put 20

    println("double, then increment:")

    println(q2.get())
    println(q2.get())

    // but now "traits" are just first-class data...
    // way better than mixins because it can be done at runtime,
    // composably, higher-order, etc:

    val q3 = (List(incrementing, doubling) foldLeft queue) { _ modify _ }

    q3 put 10
    q3 put 20

    println("folded in modifiers:")

    println(q3.get())
    println(q3.get())

    // they can even be removed, stacked N times, etc:

    val q4 = q3.copy(putModifiers = q3.putModifiers.tail)

    q4 put 10
    q4 put 20

    println("removed last modifier:")

    println(q4.get())
    println(q4.get())

    def modifyN(q: ModifiableQueue, qm: QueueModifier, n: Int) =
      (1 to n map { _ => qm } foldLeft q) { _ modify _ }

    val q5 = modifyN(queue, doubling, 5)

    q5 put 10
    q5 put 20

    println("doubled 5 times:")

    println(q5.get())
    println(q5.get())

  }

}

This is a very simple example, and has the downside that a new “modifier” list must be added for each method we which to modify. I’m going to think this through some more and see if I can’t figure out a more generic approach that uses reflection, traits, dictionaries, or any of the above. One approach could be to define a generic trait that represents the “wrapping” of one method, and mix that in for each method you wish to stackably modify.


F#-style pipe operator in Scala

February 7, 2012

This is super trivial, but sometime’s it’s nice to use the F# style pipe operator to write computations that read in the order they execute (basically postfix function application). In idiomatic Scala, this is usually accomplished by method calls and implicits to add extension methods. This is aided by the fact that many common functions for which postfix notation is convenient, like map, are pre-defined as methods on collections. But, general piping can still be a tool for the toolbox:

object MainModule {

  case class Pipeable[A](a: A) {
    def |>[B](f: A => B) = f(a)
  }

  implicit def convert[A](s: A) = Pipeable(s)  

  def main(args: Array[String]) {

    def map2[A, B](f: A => B)(xs: List[A]) = xs map f

    val test = List(1, 2, 3, 4, 5) |>
      map2 { _ + 1 } |>
      map2 { _ * 2 }

    println(test)

  }
}