Saturday, June 06, 2009

Default Charset Contention

It's no secret that if you want your multi-threaded application to run (and scale) well, the less shared data and resources the better.

Well, I ran smack into a shared piece of data in the JDK libraries that I would never have guessed was there. It turns out there is a significant difference between the following method implementations:

String convert(byte[] data) {
    return new String(data);
}

and
String convert(byte[] data) {
    return new String(data, "UTF-8");
}


The difference is that the second is more concurrent than the first one.

The reason is that in order for java.lang.String to get the default character set, it needs a lock on the Charset.class object. Since there's only one of those per JVM, it means that no 2 threads can get the default character set concurrently, and therefore, the first implementation can be a bottleneck.

To demonstrate just how bad the bottleneck can be, I wrote a tiny test program. It creates 50 threads and each thread creates 100,000 strings, and then dies.

On my laptop, the results for this contrived scenario are dramatic.

Not specifying the character set, the program takes about 35 seconds to complete.
When the character set is specified, the program takes about 4 seconds to complete.

The moral of the story is in concurrent environments, you can't make any assumptions about what data is being shared (and therefore, contended for). Instead, you should rely on thread dumps, profilers, and source code analysis.