Friday, February 23, 2007

JDK Bugs: To fix or not to fix?

With Java 5.0 comes direct support for code points. What's the point you ask? Well, there just aren't enough Java characters to represent all the world's alphabets, so pairs of characters are used to specify what's called a code point. Fortunately the design is such that 1/2 of a code point can never be mistaken for a single "regular" character. The code point for a regular character is just the char value represented as an int, and additional code points are specified as a pair of "non-regular" characters x and y for which Character.isSurrogatePair(x, y) returns true. This design ensures that a loop like this will not match 1/2 of a surrogate pair accidentally:

for (int i = 0, length = string.length(); i < length; ++i)
{
  if (string.charAt(i) == '/')
  {
    // Do something useful.
  }
}

So can you ignore code points and if not, how do you use them properly? Here's how you would write a loop that processes each code point in a string as opposed to each character in a string.

for (int i = 0, length = s.length();
     i < length;
     i = s.offsetByCodePoints(i,1))   
{
  int codepoint = s.codePointAt(i);
  // Do something useful
}

There are new overloaded methods, such as Character.isJavaIdentifierStart(int), that work on code points. So while a high or low surrogate character is not a proper Java identifier start character, the code point they specify might well be. So correct behavior relies on processing code points properly. You can imagine that creating a substring should not split surrogate pairs in half, so there are plenty of new tricky issues to consider. Won't it be fun to revisit all that code you wrote years ago?!

My first attempt to use code points went well and I was quite pleased that a loop for using them was still quite clean. But then I noticed that the JUnits I wrote behaved differently on the build machine than on my own machine. In fact, only my machine seemed to be behaving correctly. This example demonstrates the problem by behaving differently on different JVMs:

public class CodePointTest
{
  public static void main(String[] args)
  {
    printCharacters("abcd");
    printCharacters("abcd".substring(1));
    printCharacters("abcd".substring(2));
    printCharacters("abcd".substring(3));
  }

  private static void printCharacters(String s) 
  {   
    System.out.println("String: " + s);
   

    for (int i = 0, length = s.length();
         i < length;
         i = s.offsetByCodePoints(i, 1))   
    {     
      System.out.println
        ("\ti:" + i + " - " + (char)s.codePointAt(i));   

    }     
  }
}

It was apparent that the returned code point offset was wrong by exactly the offset value displayed in the debug view of the String, so the fix would be to subtract that offset before returning. While the JVM I was using worked fine, everyone else had stumbled upon bug_id=6242664. But that problem report is two years old, so you really have to wonder how is it possible that this bug is still not fixed!

It would appear there are those who believe that we should just use the corresponding Character helper method as a workaround, but if you measure a loop like this:

int sum = 0;
long start = System.currentTimeMillis();
for (int count = 0; count < COUNT; ++count)
{
  for (int i = 0, length = s.length();
       i < length;
       i = Character.offsetByCodePoints(s, i, 1))
  {
    sum += s.codePointAt(i);
  }     
}
long end = System.currentTimeMillis();

It's 50% slower (on my working JVM) than it would be using the String offsetByCodePoints method directly. So not only is performance worse, but you can well imagine the burned cycles wasted across the world on this one. It is a method on String after all!

The lesson is that if you want write code point aware code that will work on all JVMs for any String then until the end of life for Java 5.0, you cannot use the String's offsetByCodePoints method. I find that incredible. Eclipse would never treat its clients with such apparent disregard. We'd have the problem fixed within a week...