Java Extreme Performance: Part 2 – Object pooling

June 28, 2010

This is the second in the series about extreme Java performance; see Part 1 for more information.

In most cases creating objects is not a performance issue and object pooling today is considered an anti-pattern. But if you are creating millions of ‘difficult’ objects per second (so ones that leave a stackframe and potentially leave a thread) it could kill your performance and scalability.

In Multiverse I have taken extreme care to prevent any form of unwanted object creation. In the current version a lot of objects are reused, e.g. transactions. But that design design still depends on 1 object per update per transactional object (the object that contains the transaction local state of an object; I call this the tranlocal). So if you are doing 10 million update transactions on a transactional object, at least 10 million tranlocal objects are created.

In the newest design (will be part of Multiverse 0.7), I have overcome this limitation by pooling tranlocals when I know that no other transaction is using them anymore. The pool is stored in a threadlocal, so it doesn’t need to be threadsafe. To see what the effect of pooling is, see the diagrams underneath (image is clickable for a less sucky version):

without pooling
with pooling

As you can see, with pooling enabled it scales linearly for uncontended data (although there sometimes is a strange dip), but if pooling is disabled it doesn’t scale linearly.

Some information about the testing environment:
Dual cpu Xeon 5500 system (so 8 real cores) with 12 gigs of ram and I’m using Linux 64 bit/Sun JDK 1.6.0_19 64 bits and the following commandline settings:java -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+UnlockExperimentalVMOptions -XX:+UseCompressedOops -server -da

[edit]
The new diagram without smooth lines. Unfortunately I was not able to generate a histogram in a few minutes time (image is clickable).
combined

Advertisements

STM: Encryption and Security

June 28, 2010

Software Transactional Memory makes certain problems easier to solve since there is a additional layer between the data itself and using the data. The standard problems STM’s solve is to coordinate concurrency, provide consistency and failure atomicity. But adding more ‘enterprisy’ logic isn’t that hard.

One of the things that would be quite easy to add is security; in some cases you want to control if someone is allowed to access a certain piece of information (either for reading or for writing). Most stm’s already provide readonly vs update transactions, and once executed in readonly mode a user is now allowed to modify information (although one should be careful that a readonly transaction is not ignored because there is en enclosing update transaction). This can easily be extended to provide some form of security where certain information can only be accessed if certain conditions are met.

Another related piece of functionality is adding encryption; in some cases you don’t want the data to be available unprotected in memory (someone could access it with a debugger for example) or placed on some durable storage or being distributed. Adding some form of encryption when committing and decryption when loading also would not be that hard to realize, since the transaction already exactly knows what data is used. Personally I don’t like to work with encryption since it very often is completely in your face and gets intertwined with the design, but using some declarative annotations, would remove a lot of pain.


STM: Object vs Field Granularity

June 28, 2010

Stm’s for object oriented languages like Java, c#, Scala etc can either be:

  1. field granular: meaning that each field of an object is managed individually
  2. object granular: meaning that all fields of an object are managed indivisible

Both approaches have some pro’s and con’s. The advantage of field granularity, is that independent fields of transactional object won’t cause conflicts. If you have a double linkedlist for example, you want the head and tail of the list to be accessed no matter what happens on the other side. This reduces the number of conflicts you get, so improves concurrency.

But having field granularity is not without its problems:

  1. more expensive transactions: because each field needs to be tracked individually
  2. more expensive loading; with a field granularity, each fields needs to be loaded individually.
  3. more expensive conflict detection; since each fields needs to be checked individually
  4. more expensive commit: each field needs to be locked/updated individually

This means that transactions require more cpu and memory. Another problem is that a transaction that uses field granularity takes longer and this means that it could suffer from more conflicts (causing reduced concurrency) and causing more conflicts (since they acquire locks longer). And last but not least, a field granular transaction puts a lot more pressure on your memory bus since more ‘synchronization’ actions are needed.

In Multiverse (when instrumentation is used), object granularity is the default. And to make field granularity possible, just place a @FieldGranular annotation on top of your field.


Java Extreme Performance: Part I – Baseline

June 24, 2010

I’m currently working on writing a new STM engine for the Multiverse project and this time the focus is going to be on scalability and performance. The goal is that for uncontended data and 1 ref per transaction, the update performance should be a least 10 million per second and readonly transactions should be 50/75 million per second (all per core and it should scale linearly). So I’m going to write some posts about what I discovered while improving performance and this is the first one.

The first big thing is that profilers become useless (even if you use stack sampling) because the call stack often changes so frequently, that no information can be derived from it (even if you use a sampling frequency of 1 ms which apparently is the lowest you can get).

What works best for me is to throw away everything you have and start from scratch with an (incomplete) baseline. This helps to create some kind of ‘wow.. I wish I could get this performance’ and by adding more logic, you can see how expensive something is. In a lot of cases you will be flabbergasted how expensive certain constructs are; e.g. a volatile read or a polymorphic method call, or that the JIT is not able to optimize something you would expect it to optimize.

This is a very time consuming process (especially since it also depends on the platform or the jdk you are using). But it will help to gain a deeper insight and help you to write better performing code.