Java Extreme Performance: Part 2 – Object pooling

June 28, 2010

This is the second in the series about extreme Java performance; see Part 1 for more information.

In most cases creating objects is not a performance issue and object pooling today is considered an anti-pattern. But if you are creating millions of ‘difficult’ objects per second (so ones that leave a stackframe and potentially leave a thread) it could kill your performance and scalability.

In Multiverse I have taken extreme care to prevent any form of unwanted object creation. In the current version a lot of objects are reused, e.g. transactions. But that design design still depends on 1 object per update per transactional object (the object that contains the transaction local state of an object; I call this the tranlocal). So if you are doing 10 million update transactions on a transactional object, at least 10 million tranlocal objects are created.

In the newest design (will be part of Multiverse 0.7), I have overcome this limitation by pooling tranlocals when I know that no other transaction is using them anymore. The pool is stored in a threadlocal, so it doesn’t need to be threadsafe. To see what the effect of pooling is, see the diagrams underneath (image is clickable for a less sucky version):

without pooling
with pooling

As you can see, with pooling enabled it scales linearly for uncontended data (although there sometimes is a strange dip), but if pooling is disabled it doesn’t scale linearly.

Some information about the testing environment:
Dual cpu Xeon 5500 system (so 8 real cores) with 12 gigs of ram and I’m using Linux 64 bit/Sun JDK 1.6.0_19 64 bits and the following commandline settings:java -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+UnlockExperimentalVMOptions -XX:+UseCompressedOops -server -da

[edit]
The new diagram without smooth lines. Unfortunately I was not able to generate a histogram in a few minutes time (image is clickable).
combined


Java Extreme Performance: Part I – Baseline

June 24, 2010

I’m currently working on writing a new STM engine for the Multiverse project and this time the focus is going to be on scalability and performance. The goal is that for uncontended data and 1 ref per transaction, the update performance should be a least 10 million per second and readonly transactions should be 50/75 million per second (all per core and it should scale linearly). So I’m going to write some posts about what I discovered while improving performance and this is the first one.

The first big thing is that profilers become useless (even if you use stack sampling) because the call stack often changes so frequently, that no information can be derived from it (even if you use a sampling frequency of 1 ms which apparently is the lowest you can get).

What works best for me is to throw away everything you have and start from scratch with an (incomplete) baseline. This helps to create some kind of ‘wow.. I wish I could get this performance’ and by adding more logic, you can see how expensive something is. In a lot of cases you will be flabbergasted how expensive certain constructs are; e.g. a volatile read or a polymorphic method call, or that the JIT is not able to optimize something you would expect it to optimize.

This is a very time consuming process (especially since it also depends on the platform or the jdk you are using). But it will help to gain a deeper insight and help you to write better performing code.