Lightweight Batch Processing I: Intro

If you are lucky, your application is a lot more complex than just the standard request/response webapplication. The complexity in these application can typically be found in the business domain or in the presentation logic. Batch processing systems process large volumes of data, and this is always something that makes me happy to be a software developer, because so much interesting stuff is going on; especially concurrency control and transaction management.

This is the first blog about lightweight batch processing and the goal is to share my knowledge, and hopefully gain new insights by your comments. There are batch frameworks (like the newest Spring module: Spring Batch) but frameworks often introduce a lot more functionality (and complexity) than required and they can’t always be used for a wide range of reasons (sometimes technical, sometimes political). This set of blogs is aimed at these scenario’s. The approach I use is to start from some basic example, to point out the problems that can occur (and the conditions), and eventually to refactor the example.

Lets get started: underneath you can see a standard approach to processing a batch of employees.

EmployeeDao employeeDao;

@Transactional
void processAll(){
    List batch = getBatch();
    for(Employee employee: batch)
        process(employee);
}

void process(Employee employee){
    ...logic
}

As you can see, the code is quite simple. There is no need to integrate the scheduling logic in the processing logic. It is much better to hook up a scheduler (like Quartz for example) from the outside (makes code much easier to test, to maintain and to extend). This example works fine for a small number of employees and if the processing of a single employee doesn’t take too much time. But when the number of employees increases, or the time to process a single item increases, this approach won’t scale well and could lead to all kinds of problems. One of the biggest problems (for now) is that the complete batch is executed under a single transaction. Although this transaction provides the ‘all or nothing’ (atomicity) functionality that normally is desired, the length of the transaction can lead to all kinds of problems:

  1. lock contention (and even lock escalation depending on the database) leading to decreased performance and eventually to a complete serialized access to the database. This can be problematic if the batch process is not the only user of the database.
  2. failing transactions caused by running out of undo space, or the database aborting the transaction because it runs too long.
  3. when the transaction fails, all the items need to be reprocessed, even the ones that didn’t gave a problem. If the batch takes a long time to run, this behavior could be highly undesirable.

In the following example the long running transaction has been replaced by multiple smaller transactions: 1 transaction to retrieve the batch and 1 transaction for each employee that needs to be processed:

EmployeeDao employeeDao;

void processAll(){
    List batch = getBatch();
    for(Employee employee: batch)
        process(employee);
}

@Transactional
List getBatch(){
    return employeeDao.findItemsToProcess();
}

@Transactional
void process(Employee employee){
    ...logic
}

As you maybe have noticed, this example is not without problems either. One of the biggest problems is that the complete list of employees needs to be retrieved first. If the number of employees is very large, or when a single employee consumes a lot of resources (memory for example) this can lead to all kinds of problems (apart from running another long running transaction!). One of the possible solutions is to retrieve only the id’s:

EmployeeDao employeeDao;

void processAll(){
    List batch = getBatch();
    for(Long id: batch)
        process(id);
}

@Transactional
List getBatch(){
    return employeeDao.findItemsToProcess();
}

@Transactional
void process(long id){
    Entity e = dao.load(id);
    ...actual processing
}

A big advantage of retrieving a list of id’s instead of a list of Employees, is that the transactional behavior is well defined. Detaching and reattaching objects to sessions introduces a lot more vagueness (especially if the or mapping tool doesn’t detach the objects entirely). There are different approaches possible: you can keep a cursor open and retrieve an employee only when it is needed, but the problem is that you still have a long running transaction. Another approach is that the only employees are retrieved that can be processed in single run, this has to be repeated until no items can be found.

In the next blogpost I’ll deal with multi-threading and locking.

Advertisements

4 Responses to Lightweight Batch Processing I: Intro

  1. F. Degenaar says:

    This batch operation will only be recoverable if the process method is idempotent. Otherwise certain employess will be processed again when the batch job starts again yielding wrong results.

    Just my 0.02 EUR
    Fokko

  2. pveentjer says:

    You are correct, but it also depends on the problem.

    Often you need to process objects that are in a certain state, if this is the case the operation doesn’t need to be idempotent because the item is only processed once (after is has been processed, the state changes and it won’t be picked up again).

    But an exception handling part is planned and I’ll certainly add your comment, thanks!

  3. Kurtdb says:

    Small question: why is your getBatch transactional? Are you doing some write-operations behind the scenes? A select-statement doesn’t strike me as being a transactional operation.

    K.

  4. pveentjer says:

    @Kurtdb
    A good question. Depending on the type of database, a transaction always is used. So when it is used, why not make it explicit. In Oracle a transaction is always used. This is required for the MVCC (Multi version concurrency control) system and this makes it possible to get statement level read consistency (or transaction level read consistency if the serialized isolation level is ued). Oracle would not work if it didn’t use a transaction internally! I guess the same goes for other MVCC databases like Postgresql, MySQL + InnoDb. And a transaction in oracle is not expensive as long if only selects are done.

    But you mention a good point: my example could be improved. I could make the getBatch readonly if only reads are done.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: