Java Extreme Performance: Part I – Baseline

June 24, 2010

I’m currently working on writing a new STM engine for the Multiverse project and this time the focus is going to be on scalability and performance. The goal is that for uncontended data and 1 ref per transaction, the update performance should be a least 10 million per second and readonly transactions should be 50/75 million per second (all per core and it should scale linearly). So I’m going to write some posts about what I discovered while improving performance and this is the first one.

The first big thing is that profilers become useless (even if you use stack sampling) because the call stack often changes so frequently, that no information can be derived from it (even if you use a sampling frequency of 1 ms which apparently is the lowest you can get).

What works best for me is to throw away everything you have and start from scratch with an (incomplete) baseline. This helps to create some kind of ‘wow.. I wish I could get this performance’ and by adding more logic, you can see how expensive something is. In a lot of cases you will be flabbergasted how expensive certain constructs are; e.g. a volatile read or a polymorphic method call, or that the JIT is not able to optimize something you would expect it to optimize.

This is a very time consuming process (especially since it also depends on the platform or the jdk you are using). But it will help to gain a deeper insight and help you to write better performing code.

Groovy and STM

May 18, 2010

Yesterday evening I was playing with some initial support for Groovy in Multiverse; a software transactional memory implementation for Java. And with the help of Alex Tkachman is was easy to get up some initial support.

If you want to have a transactional reference (similar to a ref in Clojure), you can just create a Ref (or in this case a LongRef):

class Account{
   final LongRef balance = new LongRef();

   void long getBalance(){balance.get()}

   void setBalance(long newBalance){
            throw new InsufficientFundsException();

And if you want to have an atomic block, you can execute the following:

    Account from = new Account(10)
    Account to = new Account(10)

    atomic(readonly:false, trackreads:true) {
      from.balance -= 5
      to.balance += 5

    println "from $from.balance"
    println "to $to.balance"

The parameters in the atomic block are not needed (Multiverse is able to infer some settings and in the future more inference will be added), but I wanted to see if it was possible. Having closure support in a language, increases language complexity, but imho makes a language a lot easier to use and less painful on the eyes.

There is more clutter that can be removed, but I really need to polish my Groovy skills. Groovy support is going to be added to Multiverse 0.6 (expected in 6/8 weeks). I haven’t checked in the new groovy support on the snapshot, so contact me if you want to play with it.

Multiverse: Timed blocking transactions

April 15, 2010

With traditional concurrency control you can do a timed wait on some resource (a lock for example) to come available. If the resources
doesn’t come available within a certain limit, the operation fails and this needs to be handled in the code. There are a few problems with
this approach:

  1. no convenient way to pass the total timeout to each blocking call
  2. you need to have several versions of blocking methods (with or without timeout, interruptible or not) and often there is no easy abstraction higher up. So you keep getting almost identical methods, that only differ in the type of blocking method they call
  3. no easy way to roll back changes. If you decide to abort an operation because of a timeout, it could be hard to restore the system in the original state (so no atomicity)

With software transactional memory these problems can be solved. I have just added support for timed blocking methods, so you can say something like:

class BankAccount{
    private int balance;

    public int balance(){
      return balance;

    public void setBalance(int balance){
       this.balance = balance;

  @TransactionalMethod(timeout = 1, timeoutTimeUnit = TimeUnit.SECONDS)
  public static void tryRemove(Account a, int amount){
      a.setBalance(a.getBalance()-amount);//is rolled back in case of a failure


When the tryRemove is called (and no transaction is running), and the balance is not sufficient, the call blocks until a timeout occurs or the balance is increased and the money withdrawn. And if you want the call to be interruptible, you just need to add the InterruptedException:

    @TransactionalMethod(timeout = 1)
    public void tryRemoveInterruptibly(int amount)throws InterruptedException{
       balance = balance - amount;//is rolled back in case of a failure


And the timeunit defaults to SECONDS, so no need to configure that.

In Multiverse you only need to specify multiple blocking versions of a method if that method is exposed to non transactional methods. Transactional methods calling transactional methods, will always lift on the transaction of the outermost transactional call (Multiverse provides a flattened transaction model for nested transactions for now). This also means that it is very easy to create a timed blocking method on datastructures you don’t own.

Multiverse STM: Looking for a Groovy Committer

April 10, 2010

One of the mission statements of Multiverse is to create a STM that can be used inside any JVM based language. We already have Scala integration available (although it needs some care) and since yesterday we also have someone responsible for integrating Multiverse with JRuby (welcome aboard Sai Venkat).

But it would be cool if we can add Groovy integration in the 0.5 release (in a week) so that it seamlessly integrates with Groovy. So we are looking for someone with Groovy experience, wants to experiment with STM technology and work on an Open Source project.

If you are interested, send a mail to alarmnummer at gmail dot com.

Multiverse and static instrumentation

April 9, 2010

Multiverse 0.4 relies on a javaagent if you want to seamlessly integrate it with the Java language. The problem is that javaagents suck in production environments, so that is why compiletime instrumentation is a logical next step. The biggest step was to redesign the instrumentation so that the same instrumentation can be used for the javaaagent and the static compiler. This was completed a few weeks ago. But the other step was to create a preinstrumented self contained multiverse jar using Maven so that the same jar can be used as agent, as compiler (jarjar rocks) and as library. After 1 day of struggling and some help of an ex colleague (thanks Silvester) I finally solved the last problems.

So for the 0.5 release of Multiverse, expected next week, it is possible to do static instrumentation (which can be integrated into maven) and to combine this with the Multiverse javaagent (used from your IDE). Already instrumented classes can be mixed with the javaagent and instrumented classes are protected against repeated instrumentation.

Another advantage is that projects that don’t want to rely on instrumentation, but take a more programmatic approach (Multiverse 0.5 will contain a new Api for that as well) still can use the preinstrumented transactional datastructures provided by Multiverse:

  • 0.4: TransactionalLinkedList (also implements BlockingQueue and BLockingDeque)
  • 0.4: TransactionalArrayList
  • 0.5: TransactionalReferenceArray
  • 0.5: TransactionalTreeMap (also implements the ConcurrentMap interface).
  • 0.5: TransactionalTreeSet
  • 0.4: TransactionalThreadPoolExecutor (also implements ExecutorService)

In the 0.6 release more transactional datastructures will be added, and more performance optimizations on the existing collections are to be expected as well.

Multiverse and constructor optimizations

April 2, 2010

One of other important optimizations I have just implemented and is going to be added to the Multiverse 0.5 release, is that transactions that only create/create+read objects are very cheap (reads already are very cheap) For a construction only one volatile read (for the version) and one volatile write (for the content to be stored). This means that creating transactional objects is going to be very fast and won’t cause scalability problems.

I hope I’ll be able to inline the complete transaction for the constructor when it is very basic (just like the getter/setter optimization).

Multiverse and getter/setter optimization

April 2, 2010

I’m currently working on removing unnecessary method calls (there is no method as fast as a method that is not called) and I have been able to boost performance on one of the transactional collection classes a few hundred percent and there still is a lot more to gain.

But there is more low hanging fruit waiting to be picked; if there is a getter or setter called on a transactional object, the transaction normally being used to execute these methods, can completely be optimized away be an extremely well performing solution. Getters will have the performance of one threadlocal read and one volatile read. The setters are a little bit more complicated because:

  1. they still need to acquire/release a lock (so at least one cas and one volatile write)
  2. it also needs one volatile write for storing the content
  3. and the biggest influence: increasing a shared clock: an AtomicLong.

I expect that writes will be in the 8+ Million transaction/second (125 ns/write) and around 80 million transactions/second (12.5 ns/read) on a single core. Another thing that also needs to be tested more extensively is how well this is going to scale because that shared counter is going to put a lot of pressure on cpu cache. The atomic writes can be made cheaper by relaxing the increment of the clock, if concurrent threads are increasing that clock; if one of them succeeds, the other is all right with that.

This functionality is also going to be integrated in the transactional reference support for the Akka project (for the programmatic support the threadlocal read for the transaction can even be bypassed in the get and set).

The relaxed clock optimization is explained in the TL2 paper of David Dice and was added to Multiverse quite some time ago (is configurable).