Spring Batch or How Not to Design an API

Let me start out by saying I’m a huge fan of the Spring framework. It revolutionized enterprise Java development, supplanted J2EE and is probably the single most important Java development in its turgid history.

One of the great things about Spring is that it is largely non-invasive and the documentation is extensive and, for the most part, excellent. The Spring reference manual is running at around 600 pages these days.

In fact about the only negative thing I can say about Spring is that if you get stuck there’s a good chance you’ll have a hard time finding an answer. So much of the Spring-related information is contained in mailing lists (my pet peeve) and forum posts (often unanswered questions), two of the mediums that in part led to the creation of StackOverflow.

Compare that to something as ubiquitous and venerable as Apache’s log4j, where your only real options (beyond the meagre introduction) are to read the source code or to buy some book. Poor or no documentation seems to be the hallmark of Apache projects to the point that my default position when evaluating an unfamiliar one is to be wary.

I’ve read about Spring Batch over the last year or two. Batch jobs tend to be one of those things that we as programmers hate doing, probably because they’re messy. There is a small kernel of technical solution surrounded by layers and layers of questions like:

  • How is the job started?
  • How do we monitor it?
  • How is it restarted manually?
  • How is it called?
  • When is it called?
  • How does it interact with other such jobs?
  • and so on.

But it’s a common problem so I’d approached Spring Batch with high hopes—especially given it's now up to version 2.0.1—that it might ease some of the pain. Unfortunately not but at least it’s a good example of how not to design an API.

The Big Picture

Spring Batch’s unit of work is called a Job. A job involves one or more Steps that consist of Chunks. A particular run of a Job is called a JobInstance and identified by the JobParameters passed to it. Each attempt at a JobInstance (successful or otherwise) is a JobExecution. The state of a particular execution is stored in a JobRepository. A particular Job with a given set of JobParameters is started using a JobLauncher.

A given chunk of work has an ItemReader for a source that can be anything eg a CSV file, a database query, data read from a TCP connection or whatever you like. Data is written out using an ItemWriter, which again can be anything, In between there is optionally an ItemProcessor that transforms items read to items to be written.

Sounds good? It did to me, particularly the simple abstraction of item readers and writers. But the honeymoon was quickly over.

Some background is required here. I needed to load some CSV files once per day into an in-memory cache (Oracle Coherence). That’s a straight load and overwrite existing entries. Neither the CSV files nor the cache are transactional (although Coherence i believe can support JTA transactions) and it’s eminently rerunnable: it’ll just overwrite the same data.

Why do I Need a Transaction Manager?

The basic job looks something like this:

<job id="loadData">
  <step id="loadDataStep">
    <tasklet>
      <chunk reader="reader" writer="writer"/>
    </tasklet>
  </step>
</job>

That’s assuming you use the batch schema by default. The names refer to other beans in your application context.

But if you try and load the above you’ll get exceptions thrown if you don’t have a Spring bean named “transactionManager” visible to the job. Why? I’m not doing transactions! This is enough of a problem that Spring bean has a ResourcelessTransactionManager to use in such situations.

Why do I Need a Job Repository?

Spring Batch is big on the concept that a given job only be run once (successfully). If it fails, the idea is generally to allow it to be restarted and continue. As such, the Job needs to maintain state about attempts to run, what succeeded and so on. That’s all well and good except that sometimes it’s just not appropriate.

Thing is, I don’t care about any of that. My job can run as many times as it pleases. Why am I being forced down this path?

Spring Batch comes with two implementations of the JobRepository interface: one saves it to a database. The other saves it to memory, being simply a collection of static Maps. I chose to use the map implementation because, like I said, I didn’t need the state anyway.

Why Is the CSV Parser So Strict?

CSV parsing is generally done using the bundled FlatFileItemReader. There is a lot of configuration that goes into instantiating one of these, a ridiculous amount in fact.

First problem: when I specify the column names I have to get the exact number of commas right or the parser bombs out (with an exception about the incorrect number of tokens). Can’t I just specify up to the fields I’m interested in rather than putting 30 commas at the end just for the sake of it?

Second problem: if you have different record types in your file, each must have the correct config to parse it whether you use it or not. My first file had a header and footer record . I need config to process these records (technically I can skip the first one with a skip parameter but that doesn’t work for the last).

Third problem: if you don’t care about certain fields and try to ignore them with config like “,,,price,,name” it actually reads them anyway and puts them in a result map with a key of the empty string.

Fourth problem: I did have a price field and it was getting overwritten by nonsense. It took awhile to find this one but basically the reader has a concept of distance. Distance is a static field (why??) so you can’t change it with Spring config and it defaults to 5. This means that it will ignore the last 5 characters of your result bean property names when trying to set a value (it’ll try an exact match first then ignore the last character and so on up to 5).

Can you spot the problem? The field “price” is 5 characters long. That means one of my empty key-value pairs with an empty string key matches the price property.What the…?!

Why Is the CSV Reader so Slow?

When loading 100,000 records this initially took 50 seconds. That’s ridiculously slow. Turns out there was a good reason for this.

The “normal” means of creating a value object for your reader is to use a prototype-scoped Spring bean and apply the Spring data binding with the name-value pairs retrieved from the parsing step. This is ridiculously slow. I replaced with with a simple method that was just instantiating a new object and manually setting its properties and the load time went to under 4 seconds.

That’s a ridiculous difference: an order of magnitude. You may lay the blame on reflection but not so. Ibatis has proven to me beyond a shadow of a doubt that flexible and robust reflection-based property setting can be done very quickly. Of course doing it manually is going to be faster but that much faster?

I don’t mind writing that code either except for this: it’s one less thing Spring Batch is doing foor me and another chunk of basically boilerplate code I’d rather not write but have to.

Why Do I Do About Parsing Errors?

The Job (or Step specifically) allows you to configure exceptions to ignore as well as how many exceptions you can ignore. These happen for a variety of reasons. In some cases, people editing the CSV file with Excel. Excel adds commas so each line has the same number of fields. A noble gesture but misguided. It breaks my CSV parsing (unfortunately).

The other error I had was commas inadequately escaped in the file. This is a genuine error. The problem is that each time it happens I get a giant stack trace in my log file. I don’t want that.

You can add a listener and listen for these errors. Thing is, that doesn’t stop the default behaviour to dump a giant stack. Annoying. Really annoying. This is basically an event system and any reasonable event system should have some means of halting further propagation of the event, kind of like e.preventDefault() in jQuery.

The Good

It’s worth having an intermission to mention some of the good things.

The CSV parser is (otherwise) good in that it supports quoting and escaping. Splitting a string on the comma is easy. Handling commas that are escaped or within quotes is not exactly hard but it’s tedious and who wants to write that code? I know I don’t.

The commit intervals are good. You can set a property of the interval of read items between each commit

The ItemProcessors are reasonably good too. You can basically use them as adapters between readers and writers such that it’s more possible to reuse the same reader and/or writer in different jobs, which is actually something I did.

The range of events you can listen to is also good, particularly how you can just implement say ItemStream or ExecutionStepListener on a custom ItemWriter and then get those events.

Why Does My Job Keep Failing?

This, to me, was the straw that broke the camel’s back. I didn’t necessarily mind having to put in a map-backed JobRepository just to satisfy the “purity” of the API… until it stopped working.

this only became apparent when I tried to run several jobs in parallel. Basically the map-backed JObRepository has huge concurrency problems despite the somewhat misleading usage of synchronized in some parts of the case base.

For example, in MapJobExecutionDao:

private static Map<Long, JobExecution> executionsById = TransactionAwareProxyFactory.createTransactionalMap();
private static long currentId = 0;

public void saveJobExecution(JobExecution jobExecution) {
  Assert.isTrue(jobExecution.getId() == null);
  Long newId = currentId++;
  jobExecution.setId(newId);
  jobExecution.incrementVersion();
  executionsById.put(newId, copy(jobExecution));
}

What’s wrong with this? Quite a few things actually. For a start:

  1. Updating a long is not an atomic operation. That’s why we have AtomicLong;
  2. The post-increment operator is not threadsafe; and
  3. Updating this particular Map in this way is not threadsafe.

Aside from that there’s also the annoying issue that the map is static. Why is this annoying? Because I had so many threading issues with this particular JobRepository implementation I gave up and used one for each Job.

Last Minute Complaints

One of the great things about Ibatis is that it supports grouping of result rows. A fixed commit interval is useful but what if you’re bundling rows to get them into groups? I did have to do this and it became painful. It resulted in a custom ItemReader that internally bundles rows. It works but I’d rather not have had to write it.

The other problem with commit intervals is they work off items read rather than items written. In some cases with a commit interval of 1000 I was doing commits of 8 records because the rest had been rejected for various reasons.

The config required is excessive. Whatever happened to convention over configuration? A good example of this is that you need to create separate line tokenizers and field set mappers (the first creates a property map from a line of CSV whereas the second binds that to an object). That’s a reasonable option but none of my scenarios had this kind of reuse. Couldn’t we simplify this somewhat?

We Have the Technology

Now with all these problems you may reasonably point out that perhaps I’m using the wrong tool or even that what I’m doing is wrong. I believe the above are quite legitimate complaints however.

You need look no further than Spring itself to see what Spring Batch is doing wrong. Spring MVC for example is a really lightweight Web framework in many ways. A request is mapped to a controller. That controller creates a model and passes it to a view. It’s straightforward and simple yet flexible and powerful. Controllers can vary from the very simple to the extremely complex with many implementations you can use out of the box.

That’s the right way to design an API or a library or framework: use as much or as little as you like. Don’t make me create a bunch of stuff I don’t need (and will in fact break what I want to do) for the sake of your API.

The very simplest Spring Batch job should be a simple read and write with no transactions and no repository. Restartability and a repository should be some kind of decorator/observer or just a more complex implementation. Do transactional management the way Spring does it (declaratively, programmatically or not at all, as you see fit) on both your jobs or steps and the repository.

While I can appreciate the flat file reader’s “white list” approach it’s simply too strict. CSV input by its nature tends to be hard to reliably define. It should be trivial to change this strict behaviour and to mask exceptions (or not as you see fit).

Things like property distance are just plain bizarre and need to go or at least be instance variables so config can easily override them.

Much like the groupBy in Ibatis, you should be able to set a commit interval breakpoint that uses one or more fields instead of or in addition to the commit interval to allow clean breaks in writing.

The commit interval should absolutely work off items written not items read.

There needs to be something in between hand-coding property setting and Spring’s full data binding that is reasonably powerful yet without the huge cost of Spring data binding prototype beans.

Lastly, the map-based repository needs some serious attention with respect to concurrency.

Conclusion

If you’ve gotten this far you may think I’m quite negative on the whole Spring Batch experience and you’d be right. Honestly I expect more out of something that wears the Spring label, has been out for a year or two and ostensibly at version 2.0.

10 comments:

Sabarish Sasidharan said...

Ouch, that was a bit too harsh. I actually like Spring batch for the abstractions and flexibility they have considered.

I am surprised they have that kind of code in the in memory job repository. And their passion for static variables is a bit worrying.

But I don't see separating the concerns of transaction management and job repository into first class abstractions a bad thing. I do feel that the defaults could have been a resourceless tx manager and a in memory job repository, or in fact job repository should be optional.

I just started using Spring Batch. I am yet to hit the potholes you hit, but thanks to you, i know the potholes now.

Michel Zanini said...

I agree with you about the job repository. I`ve been using Spring Batch since version M4. I have tried to open issues for some of the problems you mentioned, but the authors denied that. Look this issues in Jira: BATCH-778 and BATCH-1199. But, I do think we have to show this to them, so we can open issues and try to get the framework better, instead of just complain.

Dave Syer said...

As one of the authors of Spring Batch I also find this article a bit harsh, especially coming from someone who has has not been active on the forum and not raised any issues as far as I can tell in the Batch JIRA. Spring projects are community projects and we do care a lot about what people think. All of the issues above would benefit from discussion and clarification, but this is not the right forum, so I warmly invite interested parties to follow up on the Batch forum (http://forum.springsource.org?forumdisplay.php?f=41) or in JIRA (http://jira.springframework.org/browse/BATCH).

ejboy said...

Have a look at examples of Scriptella ETL. This tool was designed from the developer's point of view and its usage and integration was made as simple as possible:
- Load CSV data into a database(http://snippets.dzone.com/posts/show/3508)
- Copy table from one database to another(http://snippets.dzone.com/posts/show/3511)

Integration with Spring is easy - http://snippets.dzone.com/posts/show/4862

Snehal S. Antani said...

There are two more general complaints I've heard about Spring Batch from my customers.

First, unlike Spring Core, business logic is truly split across XML configuration files and java code; debugging and maintenance became a challenge for them, especially as the applications became more complex. With Spring Core, the XML files were used primarily to wire together beans and reference aspects; your business logic remained in Java.

Second, the checkpoint/restart mechanism within SB isn't truly transactional. The updates to the checkpoint/restart table are done in a separate transaction than business logic updates; therefore there is a window where data can get out of sync. I was told about this point from 2-3 different customers, but haven't checked it myself yet.

In the end, the programming style used to express your batch business logic is your own perrogative, we've seen customers use EJB3, EJB2, Compute Grid API's, BDS Framework + Spring, Spring Batch, etc; the more interesting challenge is providing an end-to-end batch processing platform. The better the contract between the batch application and its underlying platform (container, server, workload manager, etc), the more container-managed services can be provided.

We've been recommending that customers use the Batch Data Stream Framework (BDSFW) and their favorite dependency injection container (Spring or OSGI) to build applications. This provides a nice balance of platform integration and vendor neutrality, where applications are still product agnostic (as long as you use the adapter pattern) but still a tight enough integration with the WebSphere Compute Grid batch container to provide services like: job pacing/throttling for 24x7 batch + oltp; transactional checkpoint/restart; workload management integration; job log mgmt; etc.

To further address vendor neutrality, we're shipping a version of the Compute Grid batch container that can run in Tomcat, JBoss, WAS CE, Weblogic, WebSphere Smash, etc.

For more information generally, you can checkout the following:
1. SwissRe article on Compute Grid: http://www-01.ibm.com/software/tivoli/features/ccr2-2008-12/swissre-websphere-compute-grid-zos.html

2. Compute Grid technical introduction: http://snehalantani.googlepages.com/latestpresentationmaterial

Steve said...

Greetings. You're observations are not completely without merit. However I do find simply moaning on about something that is at the end of the day free, somewhat ungracious.

We've been using Spring Batch for just over a year at my current employer. We've build up a supplimental library over 100 classes with ~13,000 lines of code to fill in the graps and to improve of what we've experienced. The parsing is a good example, and we've build our own template classes that do a hell of a job (no pun intended) of making development easy.

We've build a Batch Console of our own and with some effort adapted the core Spring Batch code to provide callback messages over a message bus. This makes our Console a responsive illustration of the work in progress (thanks to IceFaces and it's reverse Ajax). With our additional library work we've standardised things like Log and Audit production. Not only are these details observable from the Console, they are used to drive support and billing tasks.

So yeah. Spring Batch has a few warts. Like most Open Source though you have to embrace what's there and fill in the gaps. We did, and our programming group now love writing Batch Jobs, and our support group no longer dislike fixing them! :-)

Ryan said...

Hi Steve -- any chance your company could graciously contribute those 13,000 lines back to the community? :)

Anonymous said...

This is where you see the mentality of Java contributors, instead of actually try to improve the product by reading and understanduing the issues raised here they just says "your opinion is harsh, you don't even contribute, you are not one of us, you are too dumb to use Java..." and so on...

Way to go guys...

-Everyone can design complex system, only simplicity requires intelligence...

Anonymous said...

Thank you for your article! It helped me a lot in understanding Spring Batch and those many errors I ran into...

Anonymous said...

Who would use spring batch, it is truly awful. When you compare with camel (and to a lesser extent Spring integration) the learning curve for the amount of functionality is rediculous. I can understand camel and get very complicated process flows running with ease. Even doing multi-threaded fork and join pipelines with complicated exception handling is a breeze. Getting the most simple spring batch pipeline working takes too long, and often working out how to do something slightly more complicated takes some lateral thinking. You get some flat file parsing and xml stuff out of the box but, big deal, this is easy to implement anyway.

Post a Comment