Multiverse: Concurrency, STM & MVCC

How does Multiversion Concurrency Control work in Multiverse. At the moment it works like this: each object that lives in STM-space, and is attached to a transaction (directly or indirectly) is dehydrated when the transaction commits and is dirty. All dehydrated/hydrated stm objects have a unique id (a long) and to make fast searching on this id possible by storing the dehydrated objects in a balanced binary tree. This immutable balanced tree is a stable view of reality: a snapshot. And can be shared between transactions since there is no mutable state. So readers don’t block writers and writers don’t block readers. If a transaction needs an instance of a previously committed object, a new hydrated object is formed based on the dehydrated object and registered at the transaction to make sure that multiple reads of the same id, give the same instance. When a transaction starts, it gets a reference to a snapshot and automatically gets transaction level read consistency (so it can’t observe changes made by other transactions, only his own).

Non destructive updates to the heap

But does Multiverse need to copy the entire tree when it does a commit? No, Multiverse uses the same technique as a pure functional programming language uses to create a tree: non destructive updates. Only the updated node in the tree, and the path to this, need to be replaced. So if you have a tree of 1000 items and you need to do an update, at most 10 nodes (2^10 makes 1000) need to be replaced.

Concurrent transactions

Transaction can be committed concurrently as long as there are no write/write conflicts. Read/write conflicts are not detected, so just like Oracle Multiverse is not completely serialised (for now).

Concurrent commits

And what about concurrency? Does Multiverse rely on locking to prevent concurrent commits? No: Multiverse almost is lock free (in the following weeks the last locks will be removed). When a transaction commits, a new snapshot is build based on the changes and the active snapshot, and actives the new snapshot as active snapshot. This is done within a CAS (Compare and Swap) loop, because another transaction maybe has finished committing (and replacing the active snapshot) earlier. The advantage of this approach is that all changes are isolated and atomic: they will appear at once, or not at all. A rollback is nothing more than not creating a new snapshot.

Livelocking

What about livelocking, does Multiverse suffer from livelocking? Yes, lock free algorithms could suffer from livelocking: if a lot of transaction commit at the same time, it could happen that a lot of them keep running the CAS loop and do a lot of unnecessary work. Especially longer transactions could starve. The holy grail of Multiverse will be creating a Wait-Free commit. Another level where livelocking can happen is when there is a writeconflict and a transaction needs to be retried again. By setting a maximum number of retries (in the future other policies are added) you can control the amount of livelocking

Deadlocking

At the moment Multiverse doesn’t support blocked locking and is (almost) lock free, so a deadlock can’t happen. When blocked locking is going to be added, deadlock detection will be added.

retry

What about notification? At the moment Multiverse supports the retry. The implementation is quite simple; when a transaction aborts with a retry, a RetryError is thrown. The TransactionTemplate (just like a Spring HibernateTemplate) catches this RetryError and retrieves all the id’s that have been read by the transaction. After this is done, a Latch is registered to all addresses that have been read, and waits till this latch is opened. Once the latch is opened, a new transaction begins. Some latch implementations also support timed waiting, this makes it possible to prevent waiting indefinitely.

orElse

What about the orElse? There orElse is a little bit more difficult. The TransactionTemplate could catch the RetryError in a try/catch and continue in the catch block. The problem is that changes that have been made in the try block are not rolled back (Multiverse can’t roll back changes inside objects). Although this would be a very good performing solution (the RetryError is reused because creating a stacktrace is very expensive and error handling is just jumping to an instruction), it introduces some strange semantics. So a different solution needs to be found here. For the nerds: in the past I have written some Prolog compilers based on the WAM, and this is just a choicepoint.

Optimistic locking

What about optimistic locking? Well, this implementation by nature is optimistic; write conflicts are only detected at the end and the CAS loop in essence is an optimistic lock.

Pessimistic locking

What about pessimistic locking? Just like MVCC databases, pessimistic locking is going to be added. And just like databases lock are automatically released when the transaction commits. And I’m thinking about a read/writelock (with a upgrade that detects if another transaction received the writelock this transaction was waiting for…. so something that works unlike the ReadWriteLock from java.util.concurrent). For example:

Transaction t = stm.startTransaction();

Person person = (Person)t.read(handle);
PessimisticLock personLock = t.getLock(person);
personLock.acquireExclusiveNoWait();
..do stuff on person.

Or the shorter approach

Transaction t = stm.startTransaction();
Person person = (Person)t.readAndLockNoWait(handle, LockMode.exclusive);
..do stuff on person.

And unlike oracle, where an update,insert,delete automatically leads to a pessimistic lock in the database, multiverse doesn’t automatically lock objects yet. Automatic locking requires listening to every write on a stm object field, and this reduces performance. But perhaps in the future it will be optional (instrumentation of pojo’s needs to change).

Advertisements

9 Responses to Multiverse: Concurrency, STM & MVCC

  1. Marcus says:

    Why RetryError instead of RetryException?

    “An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. The ThreadDeath error, though a “normal” condition, is also a subclass of Error because most applications should not try to catch it.”

  2. pveentjer says:

    Hi Marcus

    It is an error because a user should not try to catch it. It is a mechanism to do control flow. If it is an exception, an exception handler could catch it, even though it should only be caught by the transaction template. So that is why I decided to make it an error.

  3. nairalist says:

    Was this your former implementation before discovering TL2?

    In some ways it’s better than TL2: if you need to backup a consistent snapshot of a huge array of transactional refs that are constantly being updated, the above system will work but Tl2 will stall. In Tl2, readers don’t block writers, but writers can block readers by causing them to abort over and over again…

  4. Seun Osewa says:

    Was this your former implementation before discovering TL2? Tl2 doesn’t appear to provide a consistent snapshot; it only detects when snapshot consistency has been violated and aborts.

    • pveentjer says:

      He Seun,

      if you combine TL2 with some history (so tracking previous committed data), you can have a ‘permanent’ snapshot. This is the approach used in Clojure, but it consumes quite some memory.

      Oracle uses the Undo/Redo logs (I always forget which one) to reconstruct previous committed data. As long as the data is still in this log, a valid snapshot can be reconstructed.

      With my new design, providing a durable snapshot is going to be a lot harder since I don’t have a clock anymore to reconstruct a valid snapshot. The drop of the clock was needed to make the STM better scalable, but it also has some drawbacks. Another drawback of the new approach is that I can’t make use of cheap readonly transactions.. still thinking about that.

      • Seun Osewa says:

        TL2 read-only transactions are awesome. After some moderate optimization, my toy STM now does 10 million read-only transactions per second but only about 1 million read-write transactions per second. Losing that would be sad indeed.

        Keeping N items of history for each Ref sounds like it would work, but it also seems wasteful of RAM. If there was a way to make old versions of Refs eligible for garbage collection the moment they are no longer needed, that would have been awesome. But it won’t be TL2 anymore. It won’t have the same performance characteristics.

        Where snapshots are required, I guess persistent collections used with TL2 can do the job. Let’s say I have a list of constantly updated integers that need to be backed up to disk every minute. If each integer is stored in a TL2 ref it won’t work out, the backup transaction will never progress.

        But if we store the integers in an array of persistent immutable Vectors, then the backup process can just be a read-only transaction that reads the current version of the vector and writes it to disk at leisure.

        With this technique, even S2PL with encounter-order locking will work without any significant blocking if we commit immediately after reading the Vector so the read lock can be dropped before we start writing to the relatively slow disk.

        I guess there’s a strong case for providing snapshot functionality outside the STM Engine, so we only have to pay the price when we really need the feature.

  5. pveentjer says:

    Have you implemented the TL2 GV4 clock optimization? Where concurrent ticks on the clocks don’t need to lead to multiple increments?

    But since you are still in the 1M range, I guess you have still a lot of other clutter around. I use JProfiler often to see what is happening.

    • Seun Osewa says:

      Yes. The way it’s implemented is very similar to the Java snippet you showed me. The global counter is not a serious bottleneck yet at current speeds. As least, not on my humble dual core system. I’m impressed that you bought a server just to develop

      The update performance is currently dominated by the performance of Threadlocals and the upddate/iteration performance of the objects that track the readset and writeset (HashMap, TreeMap).

      I’d like to find an application that needs to do more than 1 million update transactions a second before making things more complex. A web cache certainly doesn’t need to handle more than a few thousand transactions a second.

  6. pveentjer says:

    You will certainly hit that wall eventually. I guess that on your system you can do 20 M update transactions/second on a dual core machine with the TL2 GV4 optimization. The problem is that it doesn’t matter how many cores you throw at it after that (performance is likely to degrade)

    And I have special transactions for different sizes of changes. The systems uses speculative configuration; it starts small and upgrades to more expensive ones if needed. Once learned, the next transactions for that ‘section’ will be bigger.

    And you are currently testing with a small number of refs, but if the numbers of refs increase, the performance goes down. That is one of the reasons I focus on removing overhead as much as possible, so that it performs for a longer period. I’m also working on dealing with longer transactions on a different level (e.g. cooperation.. but only some thoughts in the back of my mind… nothing concrete).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: