Non-empty lists

March 10, 2012

I recently watched a video about the library scalaz, which among other things discussed a “non-empty list” data structure provided by the library. I enjoyed very much when someone in the library asked “isn’t that just Cons?”… This is an awesome implication of the fact that in Scala, data constructors are actually full-blown types, something that functional programming guys – like myself – can easily forget.

Of course, scalaz’s NonEmptyList is a bit more than just Cons (or :: or whatever it’s called in Scala’s standard library), because it implements standard collection functions like map, flatmap, etc. such that they preserve the NonEmptyList result type.

Anyhow, this video, and my subsequent investigation of scalaz’s NonEmptyList, got me thinking about this data type. List is such a fundamental data structure – lists and folds on lists (remember, a data type is completely defined by its catamorphism) basically embody the concept of induction. So, it seemed like NonEmptyList should have some sort of similarly deep meaning, and a similarly pithy definition.

I was sort of disappointed to find that the scalaz definition of NonEmptyList actually appeals to the definition of List directly, looking something like this (scalaz doesn’t use a case class, but this gets the idea across):

case class NonEmptyList[+A](hd: A, tl: List[A])

I started reading a Haskell-cafe thread about NonEmptyLists, and saw Edward Kmett (who is one of my favorite writers on functional programming) use a NonEmptyList that was defined in a very similar way, once again appealing to the definition of List:

data NE a = NE a [a] deriving (Eq,Ord,Show,Read)

I felt kind of dissatisfied with these definitions. I find using the definition of List inside NonEmptyList to be sort of ugly – it seems to get in the way of understanding the essence of NonEmptyList, taken as its own concept. From a library implementer’s perspective, I understand the practicality – plenty of functions are already written to work with lists, and this is probably a fairly efficient encoding, performance-wise, as only the first cons cell needs to have any special information compared to a regular List.

With all those practical considerations aside, I thought this was a much slicker definition:

case class NonEmptyList[+A](hd: A, tl: Option[NonEmptyList[A]])

This seems to get a bit more at the fundamental nature of NonEmptyLists. Of course it is trivially isomorphic to the above definition, but I like it better since its a bit more self-contained – Option is simpler than List (something like forall X. 1 + X vs. forall X. mu. x. 1 + X * x).

As far as I understand it, this is the equivalent of calling someNonEmptyList.tail.toNel (toNel produces an Option[NonEmptyList[A]]) in the scalaz library, so they certainly use it this way. Still, I think its slicker to do it by default rather than relying on extension methods on List. Don’t get me wrong, I’m not so foolish as to think they hadn’t considered doing it in the way I propose – these guys know a hell a lot more about types and FP than I do (I’ve quoted their blogs a bunch, and I just bought a pre-release of their upcoming book).

As I continued to read the Haskell-cafe thread, I was excited to see that the next (and final) post was by Conor McBride, and showed that same slick formulation:

data NE x = x :> Maybe (NE x)

Conor McBride is a rock-star of type theory – the guy who showed the definition of “differentiation” (think calculus) can be extended to datatypes! What a boss.

It’s always a fun experience when you see people discussing a problem, you think “why don’t you solve it *this* way??”, and then someone whom you look up to offers that same idea. The occasional affirmation that you’re not totally out to lunch. Unfortunately for me, this only ever seems to happen while looking back at things – if I could just have decent ideas *before* other people have them, then I might be on to something.

All kidding aside, I feel like this phenomenon is common with many aspiring academics or budding programmers (at least among my friends and I): first you start reinventing wheels that were invented 50 years ago, then 20, then 5, then 1, then before you know it you’re actually doing some original research!

Anyhow, I just really liked the idea of NonEmptyList, and this particular slick little formulation, and wanted to share it with everybody. I know above I speculated as to whether NonEmptyList had some sort of similarly deep meaning as List, and I’ll let you know if I read of (or think of) any.

Java VisualVM is a sweet, free Scala profiler that comes with the JDK!

March 7, 2012

Scala is a very sweet language, but many of its abstractions come with hidden costs, so you might sometimes wind up writing some really slick code that is dog-slow, and be wondering where to start.

That’s where VisualVM comes in. You probs already have it installed on your computer. This was an especially pleasant surprise for me as primarily a .NET developer, I’m used to the decent profilers being expensive (my favorite is the one that comes with Visual Studio Ultimate, as I think sampling > instrumenting).

Just use Launchy to open up Search Everything, type in “visualvm”, and open the program. Then:

  1. Start running the code you want to profile inside IntelliJ.
  2. You’ll see a new java icon, representing the running process, appear under the “Local” section of the leftmost “Applications” pane of VisualVM – right-click the icon.
  3. From the menu that appears, click “Sample”.
  4. Now wait a while for it to collect enough data, stop running your program, and you should have a sweet list of the hotspots in your code.
  5. When you want to dig into a hotspot and figure out the stack trace leading up to it, select it and click the “Snapshot” button. Bam!
Of course, this isn’t perfect – in the ~10 minutes I’ve been playing with it, I still haven’t figured out how to get good line numbers. Perhaps running in debug mode will add better symbols for the tool to take advantage of. Even so, the traces are more than enough to point you in the right direction.

Thanks to and for pointing me to this stuff!


Scala code to generate Markov chains from a text file

March 6, 2012

(Technical note: the following code is not quite correct, as it can lead to 0 probability for picking certain words, which isn’t right… the correct approach is to have a uniform prior (i.e. add 1 to everything). I’ll update it soon. Learning is fun!)

As a followup to my last post on Markov chains, I thought I’d post some code I wrote to generate those sorts of word-chains given a training .txt file.

I will be putting this example along with some of my helper libraries and other introductory machine learning examples up on my GitHub in the near future. I’ve been self-studying machine learning for a while now, and writing little models and snippets of code here and there, but I thought it would be a good exercise to have a bit more discipline about it and try to make public a set of worked examples, factored into a mini-library of sorts.

I’ve been using Project Gutenberg as a nice free source of training text, and the code below is configured to use a local copy of War and Peace as the training data. Just set the path to point to wherever you download the text. If you can’t get this stuff to work keep on the lookout for some posts about setting up Scala + IntelliJ that I’ve been putting off writing…

Here’s some sample output:

the mob pointing to act and earned him princess mary listened to 
thedisordered battalion three and wait said something to be a small
officer said his coat over the village of countbezukhov and yellow
legs went onall the strange voice this a memorandum is so on their
mastersand he moved on the country seat of the dorogomilov gates of
oldenburgand there to commitmurder he had saved by the young
armenian family had taken from the meeting and the morning
alpatychdonned a quiet gentleirony because it offended by its all
that a third of moscow no farther end innothinghe was such


Anyhow, here’s the code (apologies for the funky formatting, I know I should just use Gists or figure out a blog plugin or something, but it’s haaaaard. Wah.)

object StandaloneMarkovChainGenerator {

  // replace this with a path containing your training text
  val localPath = "C:\\Users\\Luke\\Desktop\\Files For Machine Learning\\";

  def readLocalTextFile(path: String): String =
    io.Source.fromFile(localPath + path).mkString

  def getWordSequenceFromString(str: String): Seq[String] =
    str.toLowerCase filter { c => c.isLetter || c.isSpaceChar } split "\\s+"
  def getWordSequenceFromLocalFile(fileName: String): Seq[String] =

  def getConsecutivePairs[T](items: Seq[T]): Seq[(T, T)] =
    items zip (items drop 1)

  def getCountsDouble[T](items: Seq[T]): Map[T, Double] =
    items groupBy identity mapValues { _.length.asInstanceOf[Double] }

  type Dist[+T] = () => T

  def getWeightedCasesDistribution[T](
    weightedCases: Seq[(T, Double)]): Dist[T] = {

    val cases = weightedCases map { _._1 }
    val weights = weightedCases map { _._2 }
    val summedWeights = weights .scan(0d) { _ + _ } drop 1
    val sumOfWeights = summedWeights.last
    val probs = summedWeights map { _ / sumOfWeights }
    val casesAndProbs = cases zip probs

    { () =>
      val roll = scala.math.random
      casesAndProbs find {
        case (_, prob) => prob > roll
      } map { _._1 } getOrElse (sys.error("Impossible!"))

  def generateMarkovChainFromFile(
    fileName: String, startWord: String = "the"): Stream[String] = {

    val words = getWordSequenceFromLocalFile(fileName)
    val pairs = getConsecutivePairs(words)
    val pairsWithCounts = getCountsDouble(pairs)

    val wordsToFollowingWordsAndCounts =
        .map { case ((a, b), num) => (a, (b, num)) }
        .groupBy { case (a, (b, num)) => a }
        .mapValues { _ map { case (a, (b, num)) => (b, num) } }

    val sortedByFrequency =
      wordsToFollowingWordsAndCounts mapValues { _ sortBy { _._2 } }

    var wordsToNextWordDists = Map(): Map[String, Dist[String]]

    def pickNext(word: String): String = {
      val possibleNextWordsAndWeights = sortedByFrequency(word)
      if (possibleNextWordsAndWeights.isEmpty) { return startWord }
      wordsToNextWordDists = wordsToNextWordDists updated
        (word, getWeightedCasesDistribution(possibleNextWordsAndWeights))


  def main(args: Array[String]) {
    // you can replace this with a different book if you'd like
    val filePath = "MarkovChain\\WarAndPeace.txt"
    val chain = generateMarkovChainFromFile(filePath)

    println(chain take 100 reduce { _ + " " + _ })

Miles Sabin’s Shapeless is awesome!

March 6, 2012

So, I just found out about this cool Scala library, Shapeless.

It contains (among many things) what the author describes as “the mother of all HList implementations.” If you’ll remember, I did a post on using HLists to kinda-sorta simulate variable-arity polymorphism, as a followup to my very frist psot using Typed Racket’s variable-arity features to make a “fmap*” function that “lifts” functions of arbitrary arity into functions over functorized values – in that case, the functor was the type of probability distributions.

Shapeless has the ability to convert from functions of arbitrary arity to functions which accept a single HList parameter. Using that, he’s defined a variadic lifting function, lift0, which takes a function (A, B, C, ...) => D and turns it into (Option[A], Option[B], Option[C], ...) => Option[D]. This is exactly my holy-grail “fmap*” for the Option functor, and it’s done perfectly!

In Typed Racket I had to (mostly) disable type checking to use fmap* without requiring a frightening amount of type annotations, which made it more of a moral victory than something practical. In my Scala attempts at HLists, I acknowledged the connection, but certainly didn’t get anywhere close to an ability to convert between regular functions and functions over HLists, nor did I know it was possible.

Anyhow, I just think it’s very cool that somebody solved this puzzle 1) which I didn’t know how to solve, 2) for which I doubted a solution was possible, and 3) packaged it all up into a nicely reusable library that borders on the practical. This puts Miles Sabin up there with Oleg for smashing apart my preconceptions of what is and isn’t possible in the static type systems of actually usable languages.

I recommend you check out the library. It’s mostly above my head, which is always a good sign. Next step is to use recursive types to develop a similar facility for curried functions…