A few weeks ago I received the last components of my dual Xeon 5500 configuration for Multiverse and I have been working on a framework for doing microbenchmarks. This framework has produced some nice diagrams that provide a lot of insight.
One of the things I wanted to see is how well the system scales with and without contention. The graph below shows the result of a benchmark for various levels of contention for updating a counter stored in the stm:
The slowest one is one with contended shared state: all threads share the same counter The blue one is a bit better because each thread gets his own counter. The green one is the same as the blue one but instead of using instrumentation, the classes are ‘instrumented’ manually. And the last graph where each counter is placed in his own stm.
The big questions are:
- why is the performance difference between manual and real instrumentation so big. It can be partially explained by not having to access expensive fields like volatile (for the global stm) and threadlocals (current transaction)
- why does the system scale so bad when transactions don’t share state, only the stm. The only point of contention is the shared logical clock (an AtomicLong). So I would expect a more linear growth.
So a lot of hard questions to answer.
The benchmark framework is going to be integrated with the Continuous integration server so we have a continuous feedback on performance. One of the features that needs to be added is trend analysis, but since the data already is there, this should not be a difficult requirement to implement.
It appears that statics were still activated on the non manual instrumented version. Statistics cause a lot of cas updates.
As you can see the difference between the manual and the automatic ones has decreased. So that partially solves the first question.