This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. The performance is memory bound. The DPH code is
dotp' ::...

This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. The performance is memory bound. The DPH code is

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' v w = D.sumP (zipWithP (*) v w)

Interesting is that despite the much lower single-thread performance, the T2 makes it all up in excellent multi-threading performance. Interesting to Haskell folks is that the corresponding multi-threaded C code using pthreads is much harder to write and barely any faster when using all parallelism (the Sun C-Compiler still manages to reduce the runtime by about 30% with 64 threads, but on the Xeons, there is no significant difference).

NB: This uses the latest development version of GHC and DPH.  Thanks to Ben Lippmeier for his nice work on the SPARC backend of GHC.