October 18, 2011

Scaling up (enhanced)

The results of trying to scale up in my previous post was so disappointing that I asked Martin Odersky, the creator of Scala, for some advice. He helped me get in touch with the author of Scala's parallel collections, Aleksandar Prokopec. Here is his feedback:
  1. Always use the scala.testing.Benchmark class to measure performance
  2. Always use the server VM. Benchmark results will in most cases be entirely different.
  3. There are no parallel lists - when calling .par on a list, it gets converted to the standard immutable parallel collection, which is ParVector. Transformer methods (such as filter and map) process items in parallel, but construct the final parallel vector sequentially - until vector concatenation is built into the standard library, the construction of the parallel vector will not be parallelized. We are looking into adding concats. In the meantime, I suggest using parallel arrays and comparing them against regular arrays in terms of performance.
  4. Every invocation of a parallel method requires some sync, so you might want to keep the number of invocations to a minimum. This is particularly the case with transformer methods which produce intermediate collections - thus possibly invoking gc cycles.
... and the modified version of the example:
object util {
  val studentCount = 1000000
}

case class Student(gradYear: Int, score: Double)

object StudentExample extends testing.Benchmark {
  val students = (for (i <- 1 to util.studentCount) 
    yield new Student(2010 + (i % 2), math.random)).toArray

  def run() {
    val gradYear = 2011
    students.filter(_.gradYear == gradYear).map(_.score).fold(0.0d)(math.max)
  }
}

object StudentExamplePar extends testing.Benchmark {
  val students = (for (i <- 1 to util.studentCount)
    yield new Student(2010 + (i % 2), math.random)).toArray.par

  def run() {
    val gradYear = 2011
    students.filter(_.gradYear == gradYear).map(_.score).fold(0.0d)(math.max)
  }
}

object StudentExampleImp extends testing.Benchmark {
  val students = (for (i <- 1 to util.studentCount)
    yield new Student(2010 + (i % 2), math.random)).toArray

  def run() {
    val gradYear = 2011
    var maxscore = 0.0d
    for (student <- students)
      if (student.gradYear == gradYear)
        maxscore = math.max(maxscore, student.score)
  }
}
... and the results on my dual-core MacBook Pro:
$ java -server -cp /scala-2.9.1.final/lib/scala-library.jar:. StudentExample 10
StudentExample$ 273 38 35 40 36 36 36 36 35 39

$ java -server -cp /scala-2.9.1.final/lib/scala-library.jar:. StudentExamplePar 10
StudentExamplePar$ 495 32 31 31 30 30 30 32 30 31
... and the results on my quad-core iMac:
$ java -server -cp /scala-2.9.1.final/lib/scala-library.jar:. StudentExample 10
StudentExample$ 154 18 17 17 17 17 17 17 17 17

$ java -server -cp /scala-2.9.1.final/lib/scala-library.jar:. StudentExamplePar 10
StudentExamplePar$ 222 10 10 10 9 10 10 10 10 10
A performance gain can now be seen, but both Aleksandar and I agreed that it wasn't the expected speedup. As a reference I also ran the imperative example, which still was by far the most efficient. The results on the dual-core was:
$ java -server -cp /scala-2.9.1.final/lib/scala-library.jar:. StudentExampleImp 10
StudentExampleImp$ 24 7 13 12 10 10 11 11 13 13

October 10, 2011

Scaling up

Photo by chaoticzee
A trend in modern programming languages is to simplify parallel computation so that the increasing number of cores in processors can be utilized. The days where the clock-speed increased according to Moore's law are over, so a single-threaded code does not benefit much from newer hardware anymore. This is one of the major reasons why functional languages are enjoying a renaissance, since functions with no mutable state can be parallelized much easier. In a functional language you declare what to do, but not how to do it.
Developing concurrent software with Java has always been hard to get right for the average programmer. With Java 7 the fork/join framework is introduced, but we'll have to wait for Java 8 to get lambda expressions (closures) in Java. The technical keynote at JavaOne 2011 nicely summarizes these features using a simple example of calculating the best graduate score among students. The functional implementation of the task is shown to be much simpler than using the fork/join framework. The nice thing is that it's implicitly relying on the collections framework to do the parallelization of the higher-order functions on the list.

One of the more popular languages that was built from ground up to be scalable is Scala. For a good introduction, watch the presentation given by the language creator himself: Martin Odersky at Scala days 2011. In the presentation he introduces the new parallel collections and shows how to easily utilize them in a small example.
Scala has been around for almost ten years and has recently gained a lot of interest. The Scala IDE is a nice integration with Eclipse supporting content assist, refactor and debug support, and since Scala runs on top of the JDK it can be nicely mixed with Java components.
This means you can start implementing the time critical computations in Scala now rather than to wait for Java 8.

Let's create an example similar to the one from the JavaOne keynote and see how it would look like in Scala:
package com.example
import scala.collection.parallel.immutable.ParSeq

object StudentExample {
  // The student class
  class Student(val gradYear: Int) {
    // The score is calculated randomly
    def score: Double = {
      var sum = 0.0d
      for (i <- 1 to 100)
        sum += Math.random / 100
      return sum
    }
  }
In trying to benefit anything from parallelization I had to do introduce some non-trivial calculation when getting the student score, like summing up 100 random scores.

This is how an imperative implementation of the maximum score would look like:
def maxScoreImperative(students: List[Student], gradYear: Int): Double = {
    var max = 0.0d
    for (student <- students)
      if (student.gradYear == gradYear)
        max = Math.max(max, student.score)
    return max
  }
This is easily understood by a Java programmer, but the functional implementation may be a bit harder to grasp depending on your background:
def maxScore(students: List[Student], gradYear: Int): Double =
    students.filter(_.gradYear == gradYear).map(_.score).fold(0.0d)(Math.max)
The important thing to note is that three higher-order functions filter, map and fold are applied on the student list to achieve the same result.
The recent ParSec class can be used instead of the List to get implicit parallelization of List functions:
def maxScorePar(students: ParSeq[Student], gradYear: Int): Double =
    students.filter(_.gradYear == gradYear).map(_.score).fold(0.0d)(Math.max)
Now on to my highly unscientific test of the three different implementations of the maxScore function:
def main(args: Array[String]) {
    // Create a million students, where every other graduates 2011
    var students = List[Student]()
    for (i <- 1 to 1000000) {
      val gradYear = 2010 + (i % 2)
      students = new Student(gradYear) :: students
    }
    // This is the parallel version of the list
    val studentsPar = students.par
    // Measure the time ten times
    var measuredTimes = List[Long]()
    for (i <- 1 to 10) {
      val tick = System.currentTimeMillis()
      val max = maxScoreImperative(students, 2011)
      val tock = System.currentTimeMillis()
      measuredTimes = (tock - tick) :: measuredTimes
    }
    // Skip first result due to JVM warmup
    measuredTimes = measuredTimes.tail;

    println("Average time: " + math.round(measuredTimes.sum.toDouble / measuredTimes.length) + "ms")
  }
}
Similarly, the maxScore and the maxScorePar implementations where measured.

Using Scala 2.9.1 the result on my dual-core MacBook Pro was:
  • Average time imperative: 2.4s
  • Average time functional: 2.8s
  • Average time functional parallel: 3.9s
Trying on a new quad-core iMac gave the following results:
  • Average time imperative: 1.2s
  • Average time functional: 1.5s
  • Average time functional parallel: 10.0s (!)
The imperative was a bit faster than the functional, which wasn't so surprising, but I was expecting much better results from the parallel list implementation.  The only thing that scaled up using this example was the execution time! The Scala parallel collections was introduced only a year ago in version 2.8 so I guess there is room for improvement.

Updated: For some comments and better results see the Scaling up (enhanced) post...