I’m currently working on writing a new STM engine for the Multiverse project and this time the focus is going to be on scalability and performance. The goal is that for uncontended data and 1 ref per transaction, the update performance should be a least 10 million per second and readonly transactions should be 50/75 million per second (all per core and it should scale linearly). So I’m going to write some posts about what I discovered while improving performance and this is the first one.
The first big thing is that profilers become useless (even if you use stack sampling) because the call stack often changes so frequently, that no information can be derived from it (even if you use a sampling frequency of 1 ms which apparently is the lowest you can get).
What works best for me is to throw away everything you have and start from scratch with an (incomplete) baseline. This helps to create some kind of ‘wow.. I wish I could get this performance’ and by adding more logic, you can see how expensive something is. In a lot of cases you will be flabbergasted how expensive certain constructs are; e.g. a volatile read or a polymorphic method call, or that the JIT is not able to optimize something you would expect it to optimize.
This is a very time consuming process (especially since it also depends on the platform or the jdk you are using). But it will help to gain a deeper insight and help you to write better performing code.