October 10, 2011

Scaling up

Photo by chaoticzee
A trend in modern programming languages is to simplify parallel computation so that the increasing number of cores in processors can be utilized. The days where the clock-speed increased according to Moore's law are over, so a single-threaded code does not benefit much from newer hardware anymore. This is one of the major reasons why functional languages are enjoying a renaissance, since functions with no mutable state can be parallelized much easier. In a functional language you declare what to do, but not how to do it.
Developing concurrent software with Java has always been hard to get right for the average programmer. With Java 7 the fork/join framework is introduced, but we'll have to wait for Java 8 to get lambda expressions (closures) in Java. The technical keynote at JavaOne 2011 nicely summarizes these features using a simple example of calculating the best graduate score among students. The functional implementation of the task is shown to be much simpler than using the fork/join framework. The nice thing is that it's implicitly relying on the collections framework to do the parallelization of the higher-order functions on the list.

One of the more popular languages that was built from ground up to be scalable is Scala. For a good introduction, watch the presentation given by the language creator himself: Martin Odersky at Scala days 2011. In the presentation he introduces the new parallel collections and shows how to easily utilize them in a small example.
Scala has been around for almost ten years and has recently gained a lot of interest. The Scala IDE is a nice integration with Eclipse supporting content assist, refactor and debug support, and since Scala runs on top of the JDK it can be nicely mixed with Java components.
This means you can start implementing the time critical computations in Scala now rather than to wait for Java 8.

Let's create an example similar to the one from the JavaOne keynote and see how it would look like in Scala:
package com.example
import scala.collection.parallel.immutable.ParSeq

object StudentExample {
  // The student class
  class Student(val gradYear: Int) {
    // The score is calculated randomly
    def score: Double = {
      var sum = 0.0d
      for (i <- 1 to 100)
        sum += Math.random / 100
      return sum
    }
  }
In trying to benefit anything from parallelization I had to do introduce some non-trivial calculation when getting the student score, like summing up 100 random scores.

This is how an imperative implementation of the maximum score would look like:
def maxScoreImperative(students: List[Student], gradYear: Int): Double = {
    var max = 0.0d
    for (student <- students)
      if (student.gradYear == gradYear)
        max = Math.max(max, student.score)
    return max
  }
This is easily understood by a Java programmer, but the functional implementation may be a bit harder to grasp depending on your background:
def maxScore(students: List[Student], gradYear: Int): Double =
    students.filter(_.gradYear == gradYear).map(_.score).fold(0.0d)(Math.max)
The important thing to note is that three higher-order functions filter, map and fold are applied on the student list to achieve the same result.
The recent ParSec class can be used instead of the List to get implicit parallelization of List functions:
def maxScorePar(students: ParSeq[Student], gradYear: Int): Double =
    students.filter(_.gradYear == gradYear).map(_.score).fold(0.0d)(Math.max)
Now on to my highly unscientific test of the three different implementations of the maxScore function:
def main(args: Array[String]) {
    // Create a million students, where every other graduates 2011
    var students = List[Student]()
    for (i <- 1 to 1000000) {
      val gradYear = 2010 + (i % 2)
      students = new Student(gradYear) :: students
    }
    // This is the parallel version of the list
    val studentsPar = students.par
    // Measure the time ten times
    var measuredTimes = List[Long]()
    for (i <- 1 to 10) {
      val tick = System.currentTimeMillis()
      val max = maxScoreImperative(students, 2011)
      val tock = System.currentTimeMillis()
      measuredTimes = (tock - tick) :: measuredTimes
    }
    // Skip first result due to JVM warmup
    measuredTimes = measuredTimes.tail;

    println("Average time: " + math.round(measuredTimes.sum.toDouble / measuredTimes.length) + "ms")
  }
}
Similarly, the maxScore and the maxScorePar implementations where measured.

Using Scala 2.9.1 the result on my dual-core MacBook Pro was:
  • Average time imperative: 2.4s
  • Average time functional: 2.8s
  • Average time functional parallel: 3.9s
Trying on a new quad-core iMac gave the following results:
  • Average time imperative: 1.2s
  • Average time functional: 1.5s
  • Average time functional parallel: 10.0s (!)
The imperative was a bit faster than the functional, which wasn't so surprising, but I was expecting much better results from the parallel list implementation.  The only thing that scaled up using this example was the execution time! The Scala parallel collections was introduced only a year ago in version 2.8 so I guess there is room for improvement.

Updated: For some comments and better results see the Scaling up (enhanced) post...

No comments:

Post a Comment