Wednesday, February 20, 2013

When should I use an int[] ?

Our team was participating in a code review (via ReviewBoard – pretty good tool) and some folks noticed that an int[] was being used to represent a version number (eg, 5.3.2 would be int[]{5, 3, 2}). It had been that way and this code was just continuing to use that representation. Comments were made that using an actual domain-specific class (e.g., VersionNumber) with constituent parts (getMajor(), getMinor(), etc.) would be better. Discussions erupted around changing it now, since it had been working just fine for years, but that’s another blog post.

Let me be blunt – I hated this use of int[]. Fundamentally, what is happening is that the author is writing in a high-level, object-oriented language and using the constructs provided by assembly language. This is not an uncommon problem – software engineers have often spent time in school learning all about how “it’s all just ones and zeroes” and that memory is “just one long array of 32 64-bit numbers”. And, yes, it is important to know that stuff. And, I’ve got nothing against assembly as a language and nothing but respect for the guys that write it (I’m not one of them – but if I was writing assembly, it’s quite possible I’d forego the array and represent a version number with just 16 bits, masking off 1 nibble for each component, and supporting 4 digit version numbers whose values can reach 15.15.15.15! Pack the bits!)

And yet…

It’s even more important to understand when to completely forget (or at least ignore) those underlying truths. When people don’t ignore or forget them, well, you see int[] crop up in places. You see bitmasks used to represent a set of Boolean options. You see an int used to represent the result of comparing two objects (Seriously, what’s up with that? Can I get an enum in 21st century Java?)

Oh, wait, did your buddy just say it’s more efficient to use the int[] over the object? I very much doubt that the bottleneck of your Java application’s performance is the creation time or memory footprint of simple objects like VersionNumber, but I stand ready to see the supporting data. Now, I’m not saying it never is… I haven’t seen your code. I’m just saying it isn’t about 99.99% of the time, so let’s not commit the cardinal sin of optimization – optimizing without evidence.

At the end of the day, object-oriented languages like Java really only have one basic design tool – classes. This is what allows for objects/instances, inheritance, polymorphism and encapsulation. Choosing to use an int[] is choosing to not use the one tool Java offers, and certainly calls into question why you’re writing in Java at all. And, have some empathy for the people that come after you, that maintain your code, or use your code in a new feature. Nobody wants to call a method that returns an int[] and recreate the logic that gives semantics to that nonsense.

But, hey, if you really want an int[], I won’t fight you. Just do me a favor, and make it a private member on the VersionNumber class. Use encapsulation to hide the truth and let me continue to think in my object-oriented language.

Thursday, February 25, 2010

Ginormous Methods (-XX:-DontWriteHugeMethods)

You know what I like to see when I open up a class? Not much. That is, I like my classes with Single Responsibility, and I like their methods short and to the point. If there are lots of different paths, maybe polymorphism should pay a visit.

But, I'm not pharisaic in this regard. I mean, if some of the methods ramble on a bit, well, that's just what happens sometimes. So, I'll cut some slack on rather long methods.

That's not what I'm talking about today. I'm not talking about methods that scroll off the screen. Those are big methods. I'm talking about methods that go on for so long, if you didn't have a nice IDE, you would have long forgotten what the name of the method is, or if you're still in it! I'm talking about 1000+ line long methods. Oh yeah.

So, besides the fact that maintainers of this code will likely want to throw themselves off of something tall, what is the problem? Let's back up a bit. First, methods of this length are a real problem. Specifically, when some poor soul goes to add some logic to this method, there are often so many variables on the stack, they don't even know if the data they need is available. They'll probably just go get it again. And, forget about testing this thing. Odds are the number of side-effects is about as long as the method, and so even if you could set up all the hundreds of fixtures, you'd need hundreds of asserts to insure the method did all of the things it does.

But, guess what... there's another problem, and it doesn't involve developers. It turns out that the JVM doesn't care for ginormous methods either. Basically, once your method exceeds 8kB of bytecode, the JVM sort of hangs it up, and says, forget this, I'm not going to JIT this method (OK, you can ask really nicely with -XX:-DontCompileHugeMethods, but really, that's not the path you want to take). Even worse, I've seen situations in deployed environments where the JVM will crash, because it's just such a tangly mess of oop maps and confusion.

So, if you ever find yourself in a situation where you're arguing over how bad a method needs to be refactored, and the opposition isn't responding to squishy things like maintainability, testability, developer morale, sanity, etc., then now you can pull out performance as another reason. You might just surprise them enough to get it changed.

Sunday, July 12, 2009

Charset, continued

I wrote about a concurrency issue in an earlier post, and while everything in that post is true, there's more to that story.

While this form

new String(byte[] data, String charsetName)

does perform better than the overload without the character set, the above invocation is not concurrent, in all circumstances. It turns out that the resolution of a character set name to a character set object, is also a synchronized call (JVM-wide lock). You can read all about it in Sun's humbly-named FastCharsetProvider class.

Basically, their implementation is fine, as long as a particular thread never deals with more than 2 character sets (see the ThreadLocal cache). But, it should be noted that you count character sets by their aliases, not unique character sets. For example, there are lots of ways to specify the US-ASCII character set, and if you use more than 2 of them, you defeat the cache. And, it's more than aliases that are a problem, lots of textual data (eg, HTML) use the distinct encodings of UTF-8, Cp1252 and ISO-8859-1 almost interchangeably.

In JDK 1.6, many of these charset methods now take a Charset object, instead of just the name, so that avoids this hurdle. But, some of us are still in JDK 1.5 land and may be for some time. In that case, there's a concurrent approach to dealing with multiple character sets.


Java NIO

Yes, in Matrix-like fashion, NIO saves the day. Now, the problem with Java NIO is that it's "Not I/O", meaning you're going to need to change your code to use these new APIs. And, even if you are able/willing to do that in your code, that says nothing about all of the 3rd party software your system likely uses.

But, let's assume we have control over our little universe... how is Java NIO more concurrent? Because it lets us do things using Charset objects (like JDK 1.6), without requiring us to use JDK 1.6.

For example, converting from a byte[] to a String using a character set in JDK 1.6 looks like this:

private static Charset UTF8 = Charset.forName("UTF-8");

public String treatAsUTF8(byte[] data) {
return new String(bytes, UTF8);
}

However, there is a way to get there with Java NIO alone:

public String treatAsUTF8(byte[] data ) {
return UTF8.decode(ByteBuffer.wrap(data)).toString();
}

What was that about? (and who let this guy chain that many invocations on one line?!)
This approach may look painful... creating a ByteBuffer out of the byte[], decoding it into a CharBuffer, and then toString()ing the CharBuffer to get our result. And, it is, in a lot of ways. But, the above invocation is fully concurrent! So, if you're dealing with multiple charsets and multiple threads, huge gains can be had.

But, the above code isn't very multi-charset friendly, so let's beef it up a bit:

private static final ConcurrentMap<String,Charset> charsetsByAlias = 
new ConcurrentHashMap<String,Charset>();

public static String decode(byte[] data, String charsetName) {
Charset cs = lookup(charsetName);
return cs.decode(ByteBuffer.wrap(data)).toString();
}

private static Charset lookup(String alias) {
Charset cs = charsetsByAlias.get(alias);
if (cs == null) {
cs = Charset.forName(alias);
charsetsByAlias.putIfAbsent(alias, cs);
}
return cs;
}

So, what does this decode() routine buy us over new String()? When I run tests with more than 2 character sets, and a dozen or more threads in a JDK 1.5 environment, the difference is about an order of magnitude. Yes, this isn't some 30% gain sort of thing, it's a total game-changer if your app is in this situation (say, a proxy server on the internet).

Parting comments

The above discussion and implementation is limited to the "new String" scenario, but that's just for brevity and illustration. The same problems can manifest when calling String.getBytes, or creating an InputStreamReader or OutputStreamWriter and many others, and the same approach works to resolve them.

If you're wondering why the cached implementation does not pre-loaded the map with all charsets and their aliases (see Charset.availableCharsets()), it's because in my testing, it just doesn't matter. That is, the gain is the subsequent requests, not the initial one. Furthermore, this lazy approach lets the cache be just for the aliases your app encounters, not the huge pile that is found in most JDKs.

Saturday, June 06, 2009

Default Charset Contention

It's no secret that if you want your multi-threaded application to run (and scale) well, the less shared data and resources the better.

Well, I ran smack into a shared piece of data in the JDK libraries that I would never have guessed was there. It turns out there is a significant difference between the following method implementations:

String convert(byte[] data) {
    return new String(data);
}

and
String convert(byte[] data) {
    return new String(data, "UTF-8");
}


The difference is that the second is more concurrent than the first one.

The reason is that in order for java.lang.String to get the default character set, it needs a lock on the Charset.class object. Since there's only one of those per JVM, it means that no 2 threads can get the default character set concurrently, and therefore, the first implementation can be a bottleneck.

To demonstrate just how bad the bottleneck can be, I wrote a tiny test program. It creates 50 threads and each thread creates 100,000 strings, and then dies.

On my laptop, the results for this contrived scenario are dramatic.

Not specifying the character set, the program takes about 35 seconds to complete.
When the character set is specified, the program takes about 4 seconds to complete.

The moral of the story is in concurrent environments, you can't make any assumptions about what data is being shared (and therefore, contended for). Instead, you should rely on thread dumps, profilers, and source code analysis.

Saturday, May 16, 2009

Overbuilding

I ran into a situation where I was creating a testing framework abstraction layer for my coworkers to use. The primary goal was to end up with something that was as easy to use as possible, especially given how complex the code to be tested was. To set the stage, jUnit was in place, but we needed mock objects to help with our large, legacy codebase, since it not only lacked unit tests, but also lacked classes with unit testing in mind (read: lots of coupling with no means for injection). An uphill road if ever there was one.

Mockito seemed like a good fit, since most of our classes were coupled to a wide number of concrete classes and there's very little in the way of interfaces anywhere. So, Mockito's ability to mock concrete classes (via CGLIB trickery) really saved my unit-test-writing-bacon. Requiring the team to introduce interfaces or injection everywhere would have made this effort DOA.

The trouble with mocking a complex class (many methods and many references) is that setting up all the stubs is not just tedious, it can be downright frightful. The astute reader has been chanting "time to refactor" several times by now, to which I would agree. But, refactoring means development time which means there are powers-that-be who must approve such an effort, and the battle to convince them of such benefits involve careful reasoning about long-term costs, technical debt, and things of this nature. So, let's invite Ward Cunningham to dinner and discuss ways to get that done some other time.

(Take note - I'm all about refactoring. So, if you're like me, you're already grabbing your monitor and screaming "refactor this crap!". But, this is a situation where refactoring is off the table, and if you've never encountered such a situation in commercial software development before, I have nothing but envy for you).

So, back to mocking a really complex class. The first/naive approach is to use the Mockito API to create our mock, and then use all of its slick JDK 1.5 methods to stub out values we're going to need. The problem is that we certainly don't want every unit test writer to have to do all this "setup code". Remember, we've got tons of methods. That means, there's probably redundancy (suppose there's one method returning a URL as a String, another as a URL, etc. -- not only does the writer have to mock both, he has to know and remember! to mock both). In short, if tests are painful to write, you can bet that tests won't get written (or worse, maintained).

The second approach is to use a Factory. This design pattern is probably in competition with Singleton for the most used. I say that, because sometimes this is OK, and sometimes this is just developers giving a knee-jerk response when the words "object creation" and "pattern" are used in the same sentence. The trouble with a Factory in this situation is that our unit test abstraction layer is meant to serve the needs of all unit test writers (present and future). With 50 methods and members on a class, there are an obscene number of object instance permutations to be handled. Factories aren't very good at obscene permutations. Factory won't cut down the unit test set up code very much.

This led me to my third approach -- the Builder pattern. Builders are a great fit here, since one of their strengths is providing the client code a great deal of control over object creation, without being involved in object construction. In other words, the ability to specify in great detail how an object gets built, while not seeing a single constructor, or any other part of that class' definition.

So, great, I'll have a Builder for every class that I need to mock. I'll let the unit test writer grab a builder, configure it, and then build their mock. Even better, they don't even need to know we're using Mockito. Shoot - they don't even know if we're using "real" objects, spies, mocks, dynamic proxies, subclasses, etc. All they know is that they asked for an instance of type X that has the specifications they provided. I'll be an even nicer guy and insure that the built objects have all kinds of very friendly default values, so most of the time, things "just work".

So, I was all happy and danced a jig. Clean, simple design - check. No coupling between unit tests and mocking library - check. Unit tests can get "set up" with the absolute minimum involvement of the test writer (only write as much as they deviate from the defaults). Time to write it up, and notify my coworkers. But wait, just to sort of "check myself" (I've got that tendency - it's some combination of personal paranoia and writing code for a while), I pull down a design pattern book off the shelf and look up the Builder pattern.

Oh, yuck.

What the heck is all this? Director, AbstractBuilder, ConcreteBuilder, the product, a Product interface, etc., etc. Part of me was saying to myself, "dude, you screwed this up", but the rest was saying, "what is all this stuff"? Where is Keep It Simple Stupid when you need him?

My takeaway (and the impetus of this post), is that sometimes even the seminal design patterns need a little trimming. Here's my justification with respect to the Builder pattern:

  1. Product. Obviously, need to build something. So, I keep this one.
  2. ConcreteBuilder. Similarly, need something to build it with. It stays.
  3. AbstractBuilder. Don't need it. I've got a ConcreteBuilder for every class I need, and those ConcreteBuilders know about their references, so the Products that get built are composites and life is good. Also, the codebase I'm dealing with has almost no inheritance, so the Builders map one-for-one to the types of Products I need. (Yes, I know, refactor!)
  4. Marker interface (for Product). Not appropriate. The users (unit test writers) do want to be coupled directly to the class being built. That's the point of the test.
  5. Director. In some sense, the unit test themselves are in this role. But, the point is there isn't a new class being introduced to support this.

The point is not that the Builder design pattern is all crazy-bloated, and the GoF are nutjobs. No, the point is that when building software, layers of abstraction and complexity should always be taken into account. While it's great to follow the Builder pattern to the letter (because you just say "Builder" and some other dev either knows what you mean, or goes and looks it up), it may be a fancier tool than you need. And, the worst thing you can do in software is exceed necessary complexity. I'll take maintaining software without a single pattern over software with excessive complexity any day.

I also don't truck with the notion of "if you don't have all those layers, then it won't be flexible enough down the road". My philosophy is I'd rather have just enough layers of abstraction to get through today. If I knew what tomorrow held, I'd own more stocks. Taken from a different angle, and borrowing someone else's phrasing (I couldn't pin down the original source):

There's no problem in Computer Science that can't be solved by adding another layer of abstraction to it. Except for the problem of too many layers of abstraction.

Wednesday, September 13, 2006

Welcome to HalfBottle!

I wanted a place where I could post essays of my thoughts on things, especially software design, and so this blog was born. As for the title, I wanted a colorful phrase that would simultaneously describe the content and mood.

The first phrase that came to mind was "He's got more ideas than a half bottle of liquor."