Mutability, Arrays and the Cost of Temporary Objects in Java

In his 2001 must-read book, Effective Java, Joshua Bloch said in one item “Favor immutability”. Java theory and practice: To mutate or not to mutate? provides an excellent overview of what this means and why it matters. It states:

An immutable object is one whose externally visible state cannot change after it is instantiated.

A Brief Overview of Immutability

Let’s say you want to create a class to model arbitrary precision rational numbers (ie fractions). A mutable version might start out:

public class BigRational {
  private BigInteger numerator = BigInteger.ZERO;
  private BigInteger denominator = BigInteger.ONE;

  // constructors and so on

  public BigRational add(BigRational other) {
    if (numerator.signum() == 0) {
      numerator = other.numerator;
      denominator = other.denominator;
    } else if (other.numerator.signum() == 0) {
      // no action required
    } else if (denominator.equals(other.denominator)) {
      numerator = numerator.add(other.numerator);
    } else {
      // this could be optimized for greatest common divisor
      numerator = numerator.multiply(other.denominator).add(other.numerator.multiply(denominator));
      denominator = denominator.multiply(other.denominator);
    }
    return this;
  }

  // etc
}

This is how many classes naively start. The problem comes when this class is used as a data member or a parameter. Consider this class:

public class MyClass {
  private BigRational data = new BigRational();

  public BigRational getData() {
    return data;
  }
}

Doing this will modify the internal state of the class:

MyClass mc = new MyClass();
BigRational rational = mc.getData();
rational.multiply(rational);

Obviously this behaviour isn’t desirable, which leads to the practice of defensive copying, a practice familiar to any C programmer. Each time this getter is called a temporary copy is created so the internal state of the class isn’t violated.

One of the biggest early errors in Java’s design was that the Date class is mutable. This means that any API that uses dates either has to use a non-standard date class or defensively copy date instances. Oddly, Java got String, BigInteger and BigDecimal all right as they’re all immutable. Even stranger, the later Calendar class (introduced in JDK 1.1) was also made mutable.

An immutable version would look something like:

public class BigRational {
  private final BigInteger numerator;
  private final BigInteger denominator;

  public BigRational() {
    this(BigInteger.ZERO);
  }

  public BigRational(BigInteger integer) {
    this(integer, BigInteger.ONE);
  }

  public BigRational(BigInteger numerator, BigInteger denominator) {
    if (denominator.signum() == 0) {
      throw new IllegalArgumentException("denominator cannot be zero");
    }
    if (numerator.signum() == 0) {
      this.numerator = BigInteger.ZERO;
      this.denominator = BigInteger.ONE;
    } else {
      this.numerator = numerator;
      this.denominator = denominator;
    }
  }

  public BigRational multiply(BigRational other) {
    if (numerator.signum() == 0 || other.numerator.signum() == 0) {
      return new BigRational(BigInteger.ZERO);
    } else if (denominator.equals(other.denominator)) {
      return new BigRational(numerator.add(other.numerator), denominator);
    } else {
      return new BigRational(numerator.multiply(other.denominator).add(other.numerator.multiply(denominator)), enominator.multiply(other.denominator));
    }
  }

  // etc
}

Arrays are Mutable

The big problem with all this is that Java arrays are mutable. So for example:

public void doStuff(String args[]) {
  args[0] = "Hello world";
}

...

String arr[] = new String[] { "one", "two", "three" };
doStuff(arr);
System.out.println(arr[0]); // Hello world

This is one big reason why you should use Lists instead of arrays in almost all circumstances where you have a choice. Lists can be made immutable:

List<String> list = new ArrayList<String>();
list.add("one");
list.add("two");
list.add("three");
final List<String> immutableList = Collections.unmodifiableList(list);

On a side note, this rather verbose syntax gets a little easier in Java 7 with collection literals, for example:

List<String> list = ["one", "two", "three"];
final List<String> immutableList = Collections.unmodifiableList(list);

Enum Values

Java 5 introduced typesafe enums, largely based on Joshua Bloch’s proposal (that used classes). The unofficial versions had lots of potential issues (eg having to implement readResolve() to cater to serialization creating new instances). In my opinion, Java’s enums are one of (increasingly few) significantly better language constructs in Java compared to, say, C# as Java’s enums aren’t just thinly wrapped integers (as they are in C/C++/.Net) and can also have behaviour.

Java enums have a static method called values() which returns an array of all instances of that enum. After the lessons of the Date class, this particular decision was nothing short of shocking. A List would have been a far more sensible choice. Internally this means the array of instances must be defensively copied each time it is called forcing you to write code like this repeatedly:

public enum Season {
  SPRING, SUMMER, AUTUMN, WINTER;

  private static final List<Season> VALUES =
    Collections.unmodifiableList(
      new ArrayList<Season>(Arrays.asList(values())));

  public static List<Season> getValues() { return VALUES; }
}

Is This Really Necessary?

There is nothing inherently wrong with temporary objects. It’s a question of degree. So creating several hundred (or thousand) temporary objects isn’t any big deal. At some point there is such a thing as too many.

I will demonstrate two things here:

  1. The cost of temporary arrays; and
  2. The correct way to generate random enums.

This post was prompted by Java: Pick a random value from an enum? where the poster created a Random on every iteration. Some years ago this was bad because Randoms were seeded with the current time (in milliseconds or even seconds) so you wouldn’t get a particularly random distribution if called in a short space of time and it’s worth answering the question of fairness.

This enum will be used:

public enum Season {
  SPRING, SUMMER, AUTUMN, WINTER;

  private static final List<Season> VALUES1 =
      Collections.unmodifiableList(
          new ArrayList<Season>(Arrays.asList(values())));
  private static final Season[] VALUES2 = values();
  private static final int SIZE = VALUES2.length;
  
  private static final Random RANDOM = new Random();

  public static Season random1() {
    return values()[new Random().nextInt(SIZE)];
  }

  public static Season random2() {
    return values()[RANDOM.nextInt(SIZE)];
  }

  public static Season random3() {
    return VALUES1.get(new Random().nextInt(SIZE));
  }

  public static Season random4() {
    return VALUES1.get(RANDOM.nextInt(SIZE));
  }

  public static Season random5() {
    return VALUES2[new Random().nextInt(SIZE)];
  }

  public static Season random6() {
    return VALUES2[RANDOM.nextInt(SIZE)];
  }
}

with the following test harness:

public class Temporary {
  private static final int COUNT = 30000000;

  public static void main(String args[]) {
    ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
    MemoryMonitor monitor = new MemoryMonitor();
    executor.scheduleAtFixedRate(monitor, 0, 10, TimeUnit.MILLISECONDS);
    int[] tally = new int[4];
    long baseline = usedMemory();
    long start = System.nanoTime();
    for (int i=0; i<COUNT; i++) {
      tally[Season.random1().ordinal()]++;
    }
    long end = System.nanoTime();
    executor.shutdown();
    try {
      executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
    } catch (InterruptedException e) {
      throw new RuntimeException(e);
    }
    long memoryUsed = monitor.peak() - baseline;
    for (Season season : Season.values()) {
      System.out.printf("%s: %,d%n", season, tally[season.ordinal()]);
    }
    System.out.printf("%nCompleted %,d iterations in %,.3f seconds using %,d bytes%n",
        COUNT, ((end - start) / 1000000) / 1000.0d, memoryUsed
    );
  }

  private static long usedMemory() {
    Runtime runtime = Runtime.getRuntime();
    return runtime.totalMemory() - runtime.freeMemory();
  }

  private static void waitForEnter() {
    try {
      new BufferedReader(new InputStreamReader(System.in)).readLine();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

and

public class MemoryMonitor implements Runnable {
  private final Runtime runtime = Runtime.getRuntime();
  private final List<Long> usage = new ArrayList<Long>();

  @Override
  public void run() {
    usage.add(runtime.totalMemory() - runtime.freeMemory());
  }

  public List<Long> usage() {
    return usage;
  }

  public long peak() {
    return Collections.max(usage);
  }
}

with each method being run in turn.

The Results

Method Run time (seconds) Peak Memory Usage (bytes)
random1 9.746 681,288
random2 5.914 665,592
random3 5.123 669,408
random4 1.476 18,376
random5 4.593 661,368
random6 1.056 18,376

From this we can draw several conclusions:

  1. Creating a Random on every invocation is fair (in distribution terms) but has a high cost in temporary objects and CPU time (by a factor of 2-5);
  2. The garbage collector is working as both the arrays and the temporary Random objects contribute to the memory usage but Java is (partially) handling both being created. What this probably means is that both created is triggering a GC;
  3. Using a static copy of the enum values is 2-5x as fast;
  4. A static array copy is about 20-40% quicker than a static List copy; and
  5. The more optimized version uses 30x less memory and runs 10x quicker than the least optimized version.

Conclusion

For significant use of an enum’s values() method, it’s a no brainer: create and use a static copy instead. It’s faster and uses way less memory. On non-trivial applications it will also mean less memory fragmentation and less (possibly expensive) GCs, which is a significant issue with high-usage Web applications.

Hard Numbers on Stackoverflow Careers

This is a follow-up to Joel Inc., Stackoverflow Careers and Jumping Sharks, posted late last week.

Joel posted Stack Stats this week in which he demonstrates the correlation between Stackoverflow reputation and Careers take-up.

Now the exact meaning of this graph isn’t listed. Is that 30% of users with 50,000 reputation and above have submitted CVs? Let’s assume that it is.

Reputation Percentage Users in Range # of CVs
1,000 8% 3,257 261
2,000 10% 1,196 120
3,000 11% 640 70
4,000 12% 373 45
5,000 13% 244 32
6,000 13.5% 153 21
7,000 14% 112 16
8,000 15% 95 14
9,000 16% 48 8
10,000 17% 220 37
15,000 19.5% 79 15
20,000 22% 58 12
30,000 26% 28 7
50,000 30% 15 5
Total 10.2% 6,518 663

So we have 145 employers (as of 15 Dec 2009) and 663 job seekers of the 6,518 in the sample representing a percentage take-up of 10.2%.

I would guess that the vast majority of those would’ve paid $29 for 3 years so these ~700 uses account for $21,000 revenue over 3 years.

Is this kind of barrier—charging job seekers—really worth that kind of revenue stream?

Of course the hope is both that:

  1. The number of candidates will substantially grow; and
  2. Many (or most) of them will convert to paying $99/year in 3 years.

Three years from now I would consider it optimistic that the users matching the profile might number 40,000 instead of 10 to 25,000. That won’t accurately reflect the natural attrition rate either (there are some users who have already become basically inactive).

Considering that not all users will be looking for work at the same time (assuming Goldman Sachs’ next crackpot house of cards hasn’t come tumbling down yet), it’s hard to imagine the take-up rate being higher than 15-20% and that’s being optimistic.

So if everything goes well 10,000 people are paying $99/year. $1 million a year—basically money for nothing—is nothing to sneeze at. I can’t see it happening however.

Even if it does, it’s questionable whether this is the critical mass required to attract employers. I guess time will tell.

Joel Inc., Stackoverflow Careers and Jumping Sharks

Joel Spolsky is a legend in the programming world. His blog—Joel on Software—is the most popular and well-known programming blog. In mid-2008, Joel and Jeff Atwood—of Coding Horror fame—launched Stackoverflow, a free site for asking programming questions.

Stackoverflow is clearly a success but the sister sites haven’t fared nearly as well. Recently Jeff and Joel launched Stackoverflow Careers, a site for programmers to find jobs and employers to find programmers.

Stackoverflow Careers may just be a bridge too far.

Let’s Talk About… Joel

Joel on Software was the first blog I ever read. I read it before anyone really knew what a blog was. Controlling Your Environment Makes You Happy was one of those things I read that completely changed my perspective. How Microsoft Lost the API War I consider to be almost prophetic in its predictions regarding the then-Longhorn now-Vista boondoggle and desktop bloodletting by Web applications.

But something isn’t right in the Land of Joel.

In the late 90s during a brief flirtation with strenuous physical activity, I learnt to SCUBA dive. I went to one of these courses that was an evening of instruction of the evils of nitrogen, a weekend in the pool and then a weekend in the ocean. This was a PADI course and is very much the consumer-grade diving education and I state that as a simple observation not a judgement or accusation. At the other end of the spectrum is NAUI.

PADI is all about selling you stuff—gear, courses, whatever. A friend remarked to me that PADI stood for Put Another Dollar In.

NAUI on the other hand is much more highly regarded but less prolific. It is a not-for-profit organisation. Whereas some accuse PADI of dumbing down SCUBA training, nothing of the sort is levelled against NAUI. That same friend said NAUI stands for Not Another Untrained Idiot.

What does this have to do with Joel? Whereas Joel was once the NAUI-like font of wisdom, now it just seems like he’s trying to sell me stuff.

Jumping the Shark

Of course I’m not the first to articulate this.

In recent times Joel has taken quite a bashing, for example Joel Spolsky, Snake-Oil Salesman and Sten Anderson’s I Heart Joel on Software.

Sten’s comments are particularly interesting because what he says is true: all Joel’s endless talk about great programmers is thinly disguised disdain for the 99% of us that didn’t go to MIT, Stanford, UW, Yale, Harvard or UPenn.

Amusingly, Jeff Atwood posted several years ago Has Joel Spolsky Jumped the Shark? going so far as to say:

I reject this new, highly illogical Joel Spolsky. I demand the immediate return of the sage, sane, wise Joel Spolsky of years past. But maybe it's like wishing for a long-running television show to return to its previous glories.

I guess he got over it.

Side note: Jeff was responding to Language Wars (emphasis added by Jeff):

FogBugzis written in Wasabi, a very advanced, functional-programming dialect of Basic with closures and lambdas and Rails-like active records that can be compiled down to VBScript, JavaScript, PHP4 or PHP5. Wasabi is a private, in-house language written by one of our best developers that is optimized specifically for developing FogBugz; the Wasabi compiler itself is written in C#.

I admit it: I love a good rant. And not just ranting for ranting’s sake but a rant with a message, an essential kernel of truth, a pearl of wisdom. It’s hard to forget Zed Shaw’s now-infamous (albeit retracted) Rails is a Ghetto rant of nearly two years ago. Yesterday I read Giles Bowkett’s Blogs are Godless Communist Bullshit. It’s long but entertaining and absolutely worth reading.

But is all this criticism justified?

Firstly, some background.

IT Recruitment

In Europe and Australia programmers (and other IT professionals) are found in three ways:

  1. Direct recruitment by the employer. This usually means big employers who have dedicated HR departments to filter out CVs, book interviews and so on. Such candidates will most likely become salaried employees of the company;
  2. Word of mouth; and
  3. Through recruitment agencies.

In my experience recruitment agents are loathed by IT workers (eg Why is IT recruitment so bad?). Most of the time they’re utterly clueless (I have in all seriousness been asked “I see you have 7 years of Java experience but do you have any J2SE experience?”). Horror stories are legion. IT recruitment in London in particular is a soul-destroying experience.

Recruiters will fill positions on a permanent (salaried) or contract (paid by the hour, day, week or month) basis.

The recruiter will earn a fee that is typically around 10-15% of the candidate’s annual salary upon successfully filling the position. If the employee leaves in the probationary period (typically three months) some or all of that will be refunded.

With contractors the recruiter will typically earn a margin of 10-25% (or even higher) on top of the contractor’s rate either for a fixed term (eg it scales down after a year) or in perpetuity. Expat contractors typically have criminally high margins put on top of what they earn, at least initially.

So recruitment is expensive.

Compare that to placing ads on job boards will typically cost hundreds of dollars (eg jobs.joelonsoftware.com FAQ and Monster Job Posting) and last weeks. One ad can potentially fill multiple positions. Employers will typically keep CVs on file and getting contacted some time after applying is not uncommon. So ads can be effective although there can be a lot of chaff.

IT recruitment is broken so there’s definitely room for a solution.

Stackoverflow Careers

Careers is another site hoping to capitalize on the success of Stackoverflow. Programmers routinely demonstrate the ability to self-organize, which I think explains—at least in part—its success. Computer science is also a centuries-old. Yes I said “centuries old”. So before some reddit lurker points out computers were born in the mid-twentieth century, I suggest you consult the Timeline of computing 2400 BC–1949 and the work of Charles Babbage and others.

The latest money-making venture is Stackoverflow Careers, heavily cross-promoted by Jeff Atwood (Introducing Stack Overflow Careers and Stack Overflow Careers: Amplifying Your Awesome) and Joel (Upgrade your career and Programmer search engine) as well as echoes in the blogosphere.

Despite the success in terms of audience size (Joel in his Google Tech Talk claims a ~30% programmer share, which is huge if true), programmers are a hard bunch to monetize (see Our Amazon Advertising Experiment). Careers is the latest incarnation.

It’s free to have a public CV but having a private CV costs money (allegedly $99/year after 31st December but don’t be surprised if that changes). The private CV is searchable by employers and allows (as Jeff/Joel put it) “deep” integration with Stackoverflow.

The employers are paying too anywhere from $500 for a week to $5,000 for a year (see the FAQ).

Not cheap. So what are we getting for our money?

The Hollywood Analogy

Joel claims:

In Hollywood, studios who need talent browse through portfolios, find two or three possible candidates, and make them great offers. And then they all try to outdo each other providing plush work environments and great benefits.

Make no mistake: you’re being sold something here. The allure of stardom is deliberate bait. Giles succinctly sums this up:

This last part is laugh-out-loud funny. That's not how Hollywood works. I'm an actor, I've been studying acting for years, and I know award-winning actors who still have to go out on auditions like everybody else. You might wonder how a newbie like me, with nothing but Cop #3 in a student film to his credit, can claim to know award-winning, seasoned professionals. It's simple: because they have to go on auditions like everybody else.

I will take one issue with what Giles said:

Robert Downey Jr. had to fight like hell to get the lead role in Iron Man.

Yes, but there’s a reason for that. He had a serious drug problem and any studio is going to balk at betting a billion dollar franchise on a cokehead.

But I digress.

Here’s another difference: actors are basically the most flexible labour market in the world. They go where the work is. The film shoots for 40 weeks in Siberia? Fine, no problem. Actors go where the films are.

Programmers on the other hand are not nearly as flexible. Programmers are regular workers. We have families, friends, mortgages and so on. Sure we might move from St Louis to San Francisco for a job but we also might not. I think it’s safe to say that more often than not, we’re not looking to move across country. Hell, we’ll even turn down a job if it’s in the wrong part of the same city.

Imagine how far you’d get as an actor if you said “I’ve love to work on your TV show but the studio is in Burbank and commute from Radondo Beach is a bitch so i think I’ll pass.” (only knowing LA to change planes I make no apologies for any gross errors in LA geography I may have just made).

So instead of there being a handful of job markets for actors there are probably 100 or more for programmers.

So What’s In It For Me?

From the FAQ:

If you are seeking employment, we do require a modest annual payment to file your CV. Filing your CV makes it eligible to appear in searches by hiring managers via our private search interface. This fee allows us to ensure employers that everyone they find is actively looking for a job.

Isn’t the fact that I’ve filled out a CV and ticked a box that says I’m looking for work sufficient? Apparently not.

Consider Finding Great Developers:

The great software developers, indeed, the best people in every field, are quite simply never on the market.

So the target market seems to be those developers who think they’re great developers but actually aren’t. If they were they wouldn’t be looking. I get it: everyone is better than average.

Giles sums this up:

The number one rule of the con: you can't con an honest man … Try to get something for nothing, just because Joel Spolsky said you could? You're going to get burned.

I should point out that I signed up in the beta. I was under no illusions however (then again, who ever thinks they are?). The chances of an employer looking in my remote backwater are next to nil but I figured at $30, at worst I was out two lunches from Nando's.

And If I’m A Hiring Manager?

Approximately 6,500 Stackoverflow users have 1,000 reputation or more. This is an arbitrary number choice but the point is this: integration with Stackoverflow only adds value if you’ve contributed a sufficiently large number of answers to mine. Go up to 2,000 rep and you’re down to less than 3,200 users. And so on.

Let’s be optimistic and say the potential audience for whom Stackoverflow will add value to their CV is 10,000. A number of these can be eliminated as being students, retired, incapable of working (eg disability or serious prolonged injury) or simply not looking for work.

Joel claims:

But Stack Overflow Careers doesn’t have to be massive. It’s not for the 5.2 million people who visit Stack Overflow; it’s for the top 25,000 developers who participate actively.

Want to know what the 25,000th user looks like?

I mean no disrespect to these people but “participate actively”?

Take careful note of the language too: 25,000 from 5.2 million? Hell, you’re already the top half of one percent! You’re elite, positively l33t! Uh huh.

Crunching the Numbers

There are at least 100 distinct geographical job markets for an employer. If you’re lucky 10% of the pool is accessible to you either by being in the right place or willing to relocate.

Of those 10%, maybe 10% have the right skills. The importance of programming languages is definitely overstated by (typically clueless) HR departments and recruiters. It’s also true that good developers can program in anything (given sufficient time) but not all languages are interchangeable in all situations. I would consider a Java Web developer to be largely interchangeable with an ASP.NET C# Web developer (in that there is sufficient crossover to enable a sufficiently speedy transition) but I wouldn’t hire a Ruby programmer to do C programming for microcontrollers and embedded devices. The transition from unmanaged (eg C/C++) to managed (eg C#/.Net) code can be steep enough.

Of this reduced pool, how many have the right experience? The more experienced you get as a developer, generally the more important domain knowledge becomes. I wouldn’t hire a mobile telephony architect to design a system for market-making options on commodities futures because you’d spend 6 months explaining bid/ask, spreads, what a future is, what an option is, in-the-money, out-the-money, out-the-money, short, long, contango, volatility, Black-Scholes… the list goes on.

Of the remaining few who has the right amount of experience? You wouldn’t hire a fresh college grad to mentor junior developers.

Now you’ve got a short list (“short” being the operative word) consider how many are available?

And you haven’t even interviewed anybody yet!

So if you optimistically assume that 10,000 people sign up for Careers, chances are you’re down to less than five. Of those, how many are seriously looking? They’re paying by the year so why not have your CV out there just in case?

Don’t be fooled, paying to file your CV doesn’t ensure you’re seriously looking. The only thing it ensures is that you’re a revenue stream.

Critical Mass

Matching candidates to employers is low probability. The number who fit the profile is probably 1 in 1,000 or even less.

So of the 10 to 25 thousand relevant potential candidates, some percentage will actually be looking for work. Of that percentage, a smaller percentage will pay to be seen by employers, less than might otherwise be seen if the service was free (for job seekers). I expect that number to be 2,000 or less and that number is, in my opinion, inflated by the cheap beta registration.

So an employer is going to pay big bucks—much more than a typical job ad—to reach a much smaller target audience?

People will pay money if they are getting value for money. Paying $15,000 to a recruiter to find you a programmer is cheap because the recruiter is doing most of the legwork and assuming a large part of the risk (in that they don’t typically get paid if you don’t find someone you like). Job ads are cheap because they may reach tens or even hundreds of thousands of candidates.

It’s like Careers is charging as if it’s already a proven success.

Things like this work on the principle of critical mass. Take eBay. People buy on eBay because there are things to be bought. People sell on eBay because people will buy them. Without either group the site fails. A job board is no different. People go to them because they have jobs they want. Companies advertise on them because they reach the right audience.

So what job board—and let’s be honest; that’s what it is—is going to survive by restricting itself to 10 to 25 thousand candidates globally? Perhaps Jeff and Joel are thinking that it will be so successful that everyone else will just have to sign up anyway.

Good luck with that business strategy.

Is It Legal?

I have to wonder if anyone has bothered to ask this yet. Consider Job seekers are hit by illegal fees. Not just in the United Arab Emirates is it illegal to charge job seekers. Also, How job seekers can best use recruitment agencies (emphasis added):

Recruitment agencies make their money by charging employers a fee for a permanent hire or an hourly or daily margin on a temporary placement. It is illegal to charge job seekers a fee for finding them work.

That’s for Australia. The point here is that Jeff and Joel probably need to be very careful about how they define the Careers site if they don’t want to run afoul of laws set up to protect the unemployed from unscrupulous practices.

Smoke and Mirrors

From the FAQ:

If you are seeking employment, we do require a modest annual payment to file your CV. Filing your CV makes it eligible to appear in searches by hiring managers via our private search interface. This fee allows us to ensure employers that everyone they find is actively looking for a job.

We’re being sold something here.

Also consider Upgrade your career:

Employers can see how good you are at communicating, …

OK

… how well you explain things, …

OK

… how well you understand the tools that you’re using, …

Er… OK.

… and generally, if you’re a great developer or not.

Whoa. Sorry, but the fact that I know how that parsing HTML with regular expressions is retarded, I can explain how to add a jQuery click() handler and that not sanitizing user input to SQL statements is idiotic doesn’t make me a great developer. It means anything from I like teaching to I’m narcissistic enough to like hearing the sound of my own voice (virtually speaking), perhaps both.

And let’s not forget that all of this can be established by simply including a URL to your Stackoverflow profile on your CV.

Conclusion

The numbers just don’t add up on this one. My only question is how long it’ll be before that sinks in and the model changes. With so much free choice, its just not viable to charge job seekers while severely limiting the candidate pool for employers while charging them an arm and a leg for information they can get from a URL.

Google Wave Invites to Give Away

I've got a dozen or so of these I don't really need. Drop me a line and I'll send you one, first in first served until they run out.

Programming Puzzles, Chess Positions and Huffman Coding

This week Andrew Rollings asked the question ProgrammerPuzzle: Encoding a chess board state throughout a game on StackOverflow.

Now I’ll admit that I love this kind of question. I’m not really such a big fan of Code Golf as that’s an exercise in writing terse, unreadable code (although some of the solutions have been brilliant). But this Chess problem is the sort of thing that will allow a programmer to demonstrate his or her mental acuity and problem solving ability (or the lack thereof).

The Problem

What is the most space-efficient way you can think of to encode the state of a chess game (or subset thereof)? That is, given a chess board with the pieces arranged legally, encode both this initial state and all subsequent legal moves taken by the players in the game.

This image illustrates the starting Chess position. Chess occurs on an 8x8 board with each player starting with an identical set of 16 pieces consisting of 8 pawns, 2 rooks, 2 knights, 2 bishops, 1 queen and 1 king as illustrated here:

Positions are generally recorded as a letter for the column followed by the number for the row so White’s queen is at d1. Moves are most often stored in algebraic notation, which is unambiguous and generally only specifies the minimal information necessary. Consider this opening:

1. e4 e5
2. Nf3 Nc6
3. …

which translates to:

  1. White moves king’s pawn from e2 to e4 (it is the only piece that can get to e4 hence “e4”);
  2. Black moves the king’s pawn from e7 to e5;
  3. White moves the knight (N) to f3;
  4. Black moves the knight to c6.

The board looks like this:

An important ability for any programmer is to be able to correctly and unambiguously specify the problem.

So what’s missing or ambiguous? A lot as it turns out.

Board State vs Game State

The first thing you need to determine is whether you’re storing the state of a game or the position of pieces on the board. Encoding simply the positions of the pieces is one thing but the problem says “all subsequent legal moves”. The problem also says nothing about knowing the moves up to this point. That’s actually a problem as I’ll explain.

Castling

The game has proceeded as follows:

1. e4 e5
2. Nf3 Nc6
3. Bb5 a6
4. Ba4 Bc5

The board looks as follows:

White has the option of castling. Part of the requirements for this are that the king and the relevant rook can never have moved, so whether the king or either rook of each side has moved will need to be stored. Obviously if they aren’t on their starting positions, they have moved otherwise it needs to be specified.

There are several strategies that can be used for dealing with this problem.

Firstly, we could store an extra 6 bits of information (1 for each rook and knight position) to indicate whether that piece had moved. We could streamline this by only storing a bit for one of these six squares if the right piece happens to be in it. Alternatively we could treat each unmoved piece as another piece type so instead of 6 piece types on each side (pawn, rook, knight, bishop, queen and king) there are 8 (adding unmoved rook and unmoved king).

En Passant

Another peculiar and often-neglected rule in Chess is En Passant.

The game has progressed.

1. e4 e5
2. Nf3 Nc6
3. Bb5 a6
4. Ba4 Bc5
5. O-O b5
6. Bb3 b4
7. c4

Black’s pawn on b4 now has the option of moving his pawn on b4 to c3 taking the White pawn on c4. This only happens on the first opportunity meaning if Black passes on the option now he can’t take it next move. So we need to store this.

If we know the previous move we can definitely answer if En Passant is possible. Alternatively we can store whether each pawn on its 4th rank has just moved there with a double move forward. Or we can look at each possible En Passant position on the board and have a flag to indicate whether its possible or not.

Promotion

It is White’s move. If White moves his pawn on h7 to h8 it can be promoted to any other piece (but not the king). 99% of the time it is promoted to a Queen but sometimes it isn’t, typically because that may force a stalemate when otherwise you’d win. This is written as:

56. h8=Q

This is important in our problem because it means we can’t count on there being a fixed number of pieces on each side. It is entirely possible (but incredibly unlikely) for one side to end up with 9 queens, 10 rooks, 10 bishops or 10 knights if all 8 pawns get promoted.

Stalemate

When in a position from which you cannot win your best tactic is to try for a stalemate. The most likely variant is where you cannot make a legal move (usually because any move when put your king in check). In this case you can claim a draw. This one is easy to cater for.

The second variant is by threefold repetition. If the same board position occurs three times in a game (or will occur a third time on the next move), a draw can be claimed. The positions need not occur in any particular order (meaning it doesn’t have to the same sequence of moves repeated three times). This one greatly complicates the problem because you have to remember every previous board position. If this is a requirement of the problem the only possible solution to the problem is to store every previous move.

Lastly, there is the fifty move rule. A player can claim a draw if no pawn has moved and no piece has been taken in the previous fifty consecutive moves so we would need to store how many moves since a pawn was moved or a piece taken (the latest of the two. This requires 6 bits (0-63).

Whose Turn Is It?

Of course we also need to know whose turn it is and this is a single bit of information.

Two Problems

Because of the stalemate case, the only feasible or sensible way to store the game state is to store all the moves that led to this position. I’ll tackle that one problem. The board state problem will be simplified to this: store the current position of all pieces on the board ignoring castling, en passant, stalemate conditions and whose turn it is.

Piece layout can be broadly handled in one of two ways: by storing the contents of each square or by storing the position of each piece.

Simple Contents

There are six piece types (pawn, rook, knight, bishop, queen and king). Each piece can be White or Black so a square may contain one of 12 possible pieces or it may be empty so there are 13 possibilities. 13 can be stored in 4 bits (0-15) So the simplest solution is to store 4 bits for each square times 64 squares or 256 bits of information.

The advantage of this method is that manipulation is incredibly easy and fast. This could even be extended by adding 3 more possibilities without increasing the storage requirements: a pawn that has moved 2 spaces on the last turn, a king that hasn’t moved and a rook that hasn’t moved, which will cater for a lot of previously mentioned issues.

But we can do better.

Base 13 Encoding

It is often helpful to think of the board position as a very large number. This is often done in computer science. For example, the halting problem treats a computer program (rightly) as a large number.

The first solution treats the position as a 64 digit base 16 number but as demonstrated there is redundancy in this information (being the 3 unused possibilities per “digit”) so we can reduce the number space to 64 base 13 digits. Of course this can’t be done as efficiently as base 16 can but it will save on storage requirements (and minimizing storage space is our goal).

In base 10 the number 234 is equivalent to 2 x 102 + 3 x 101 + 4 x 100.

In base 16 the number 0xA50 is equivalent to 10 x 162 + 5 x 161 + 0 x 160 = 2640 (decimal).

So we can encode our position as p0 x 1363 + p1 x 1362 + ... + p63 x 130 where pi represents the contents of square i.

2256 equals approximately 1.16e77. 1364 equals approximately 1.96e71, which requires 237 bits of storage space. That saving of a mere 7.5% comes at a cost of significantly increased manipulation costs.

Variable Base Encoding

In legal boards certain pieces can’t appear in certain squares. For example, pawns cannot occur at in the first or eighth ranks, reducing the possibilities for those squares to 11. That reduces the possible boards to 1116 x 1348 = 1.35e70 (approximately), requiring 233 bits of storage space.

Actually encoding and decoding such values to and from decimal (or binary) is a little more convoluted but it can be done reliably and is left as an exercise to the reader.

Variable Width Alphabets

The previous two methods can both be described as fixed-width alphabetic encoding. Each of the 11, 13 or 16 members of the alphabet is substituted for another value. Each “character” is the same width but the efficiency can be improved when you consider that each character is not equally likely.

Consider Morse code (pictured left). Characters in a message are encoded as a sequence of dashes and dots. Those dashes and dots are transferred over radio (typically) with a pause between them to delimit them.

Notice how the letter E (the most common letter in English) is a single dot, the shortest possible sequence, whereas Z (the least frequent) is two dashes and two beeps.

Such a scheme can significantly reduce the size of an expected message but comes at the cost of increasing the size of a random character sequence.

It should be noted that Morse code has another inbuilt feature: dashes are as long as three dots so the above code is created with this in mind to minimize the use of dashes. Since 1s and 0s (our building blocks) don’t have this problem, it’s not a feature we need to replicate.

Lastly, there are two kinds of rests in Morse code. A short rest (the length of a dot) is used to distinguish between dots and dashes. A longer gap (the length of a dash) is used to delimit characters.

So how does this apply to our problem?

Huffman Coding

There is an algorithm for dealing with variable length codes called Huffman coding. Huffman coding creates a variable length code substitution, typically uses expected frequency of the symbols to assign shorter values to the more common symbols.

In the above tree, the letter E is encoded as 000 (or left-left-left) and S is 1011. It should be clear that this encoding scheme is unambiguous.

This is an important distinction from Morse code. Morse code has the character separator so it can do otherwise ambiguous substitution (eg 4 dots can be H or 2 Is) but we only have 1s and 0s so we choose an unambiguous substitution instead.

Below is a simple implementation:

private static class Node {
  private final Node left;
  private final Node right;
  private final String label;
  private final int weight;

  private Node(String label, int weight) {
    this.left = null;
    this.right = null;
    this.label = label;
    this.weight = weight;
  }

  public Node(Node left, Node right) {
    this.left = left;
    this.right = right;
    label = "";
    weight = left.weight + right.weight;
  }

  public boolean isLeaf() {
    return left == null && right == null;
  }

  public Node getLeft() {
    return left;
  }

  public Node getRight() {
    return right;
  }

  public String getLabel() {
    return label;
  }

  public int getWeight() {
    return weight;
  }
}

private static class WeightComparator implements Comparator<Node> {
  @Override
  public int compare(Node o1, Node o2) {
    if (o1.getWeight() == o2.getWeight()) {
      return 0;
    } else {
      return o1.getWeight() < o2.getWeight() ? -1 : 1;
    }
  }
}

private static class PathComparator implements Comparator<String> {
  @Override
  public int compare(String o1, String o2) {
    if (o1 == null) {
      return o2 == null ? 0 : -1;
    } else if (o2 == null) {
      return 1;
    } else {
      int length1 = o1.length();
      int length2 = o2.length();
      if (length1 == length2) {
        return o1.compareTo(o2);
      } else {
        return length1 < length2 ? -1 : 1;
      }
    }
  }
}

private final static List<String> COLOURS;
private final static Map<String, Integer> WEIGHTS;

static {
  List<String> list = new ArrayList<String>();
  list.add("White");
  list.add("Black");
  COLOURS = Collections.unmodifiableList(list);
  Map<String, Integer> map = new HashMap<String, Integer>();
  for (String colour : COLOURS) {
    map.put(colour + " " + "King", 1);
    map.put(colour + " " + "Queen", 1);
    map.put(colour + " " + "Rook", 2);
    map.put(colour + " " + "Knight", 2);
    map.put(colour + " " + "Bishop", 2);
    map.put(colour + " " + "Pawn", 8);
  }
  map.put("Empty", 32);
  WEIGHTS = Collections.unmodifiableMap(map);
}

public static void main(String args[]) {
  PriorityQueue<Node> queue = new PriorityQueue<Node>(WEIGHTS.size(), new WeightComparator());
  for (Map.Entry<String, Integer> entry : WEIGHTS.entrySet()) {
    queue.add(new Node(entry.getKey(), entry.getValue()));
  }
  while (queue.size() > 1) {
    Node first = queue.poll();
    Node second = queue.poll();
    queue.add(new Node(first, second));
  }
  Map<String, Node> nodes = new TreeMap<String, Node>(new PathComparator());
  addLeaves(nodes, queue.peek(), "");
  for (Map.Entry<String, Node> entry : nodes.entrySet()) {
    System.out.printf("%s %s%n", entry.getKey(), entry.getValue().getLabel());
  }
}

public static void addLeaves(Map<String, Node> nodes, Node node, String prefix) {
  if (node != null) {
    addLeaves(nodes, node.getLeft(), prefix + "0");
    addLeaves(nodes, node.getRight(), prefix + "1");
    if (node.isLeaf()) {
      nodes.put(prefix, node);
    }
  }
}

One possible output is:

  White Black
Empty 0
Pawn 110 100
Rook 11111 11110
Knight 10110 10101
Bishop 10100 11100
Queen 111010 111011
King 101110 101111

For a starting position this equates to 32 x 1 + 16 x 3 + 12 x 5 + 4 x 6 = 164 bits.

State Difference

Another possible approach is to combine the very first approach with Huffman coding. This is based on the assumption that most expected Chess boards (rather than randomly generated ones) are more likely than not to, at least in part, resemble a starting position.

So what you do is XOR the 256 bit current board position with a 256 bit starting position and then encode that (using Huffman coding or, say, some method of run length encoding). Obviously this will be very efficient to start with (64 0s probably corresponding to 64 bits) but increase in storage required as the game progresses.

Piece Position

As mentioned, another way of attacking this problem is to instead store the position of each piece a player has. This works particularly well with endgame positions where most squares will be empty (but in the Huffman coding approach empty squares only use 1 bit anyway).

Each side will have a king and 0-15 other pieces. Because of promotion the exact make up of those pieces can vary enough that you can’t assume the numbers based on the starting positions are maxima.

The logical way to divide this up is store a Position consisting of two Sides (White and Black). Each Side has:

  • A king: 6 bits for the location;
  • Has pawns: 1 (yes), 0 (no);
  • If yes, number of pawns: 3 bits (0-7+1 = 1-8);
  • If yes, the location of each pawn is encoded: 45 bits (see below);
  • Number of non-pawns: 4 bits (0-15);
  • For each piece: type (2 bits for queen, rook, knight, bishop) and location (6 bits)

As for the pawn location, the pawns can only be on 48 possible squares (not 64 like the others). As such, it is better not to waste the extra 16 values that using 6 bits per pawn would use. So if you have 8 pawns there are 488 possibilities, equalling 28,179,280,429,056. You need 45 bits to encode that many values.

That’s 105 bits per side or 210 bits total. The starting position is the worst case for this method however and it will get substantially better as you remove pieces.

It should be pointed out that there are less than 488 possibilities because the pawns can’t all be in the same square  The first has 48 possibilities, the second 47 and so on. 48 x 47 x … x 41 = 1.52e13 = 44 bits storage.

You can further improve this by eliminating the squares that are occupied by other pieces (including the other side) so you could first place the white non-pawns then the black non-pawns, then the white pawns and lastly the black pawns. On a starting position this reduces the storage requirements to 44 bits for White and 42 bits for Black.

Combined Approaches

Another possible optimization is that each of these approaches has its strength and weaknesses. You could, say, pick the best 4 and then encode a scheme selector in the first two bits and then the scheme-specific storage after that.

With the overhead that small, this will by far be the best approach.

Game State

I return to the problem of storing a game rather than a position. Because of the threefold repetition we have to store the list of moves that have occurred to this point.

Annotations

One thing you have to determine is are you simply storing a list of moves or are you annotating the game? Chess games are often annotated, for example:

17. Bb5!! Nc4?

White’s move is marked by two exclamation points as brilliant whereas Black’s is viewed as a mistake. See Chess punctuation.

Additionally you could also need to store free text as the moves are described.

I am assuming that the moves are sufficient so there will be no annotations.

Algebraic Notation

We could simply store the the text of the move here (“e4”, “Bxb5”, etc). Including a terminating byte you’re looking at about 6 bytes (48 bits) per move (worst case). That’s not particularly efficient.

The second thing to try is to store the starting location (6 bits) and end location (6 bits) so 12 bits per move. That is significantly better.

Alternatively we can determine all the legal moves from the current position in a predictable and deterministic way and state which we’ve chosen. This then goes back to the variable base encoding mentioned above. White and Black have 14 possible moves each on their first move, more on the second and so on.

Conclusion

There is no absolutely right answer to this question. There are many possible approaches of which the above are just a few.

What I like about this and similar problems is that it demands abilities important to any programmer like considering the usage pattern, accurately determining requirements and thinking about corner cases.

Chess positions taken as screenshots from Chess Position Trainer.