A syntax error in a comment, or: The case of the vanishing parenthesis (10 Jan 2006)

The other day, Visual C++ would not compile this code, reporting lots and lots of errors:

  if (!strncmp(text, "FOO", 3)) 
  {
    foobar(text); //üüü 
  }
  else 
  {
    gazonk();
  }

Now this was funny because that code had not changed in ages, and so far had compiled just fine. At first, I couldn't explain what was going on. Hmmmm... note the funny u-umlauts in the comment. Why would someone use a comment like that? Well, the above code was inherited from source code originally written on an HP-UX system. For long years, the default character encoding on HP-UX systems has been Roman8. In that encoding, the above comment looked like this:

    foobar(text); //■■■

(If your browser cannot interpret the Unicode codepoint U+25A0, it represents a filled box.)

So the original programmer used this special character for graphically highlighting the line. In Roman8, the filled box has a character code of 0xFC. On a Windows system in the US or Europe, which defaults to displaying characters according to ISO8859-1 (aka Latin1), 0xFC will be interpreted as the German u-umlaut ü.

So far, so good, but why the compilation errors?

On the affected system, I ran the code through the C preprocessor (cpp), and ended up with this preprocessed version:

  if (!strncmp(text, "FOO", 3)) 
  {
    foobar(text);
  else 
  {
    gazonk();
  }

Wow - the preprocessor threw away the comment, as expected, but also the closing parenthesis } on the next line! Hence, the parentheses in the code are now unbalanced, which the compiler complains bitterly about.

But why would the preprocessor misbehave so badly on this system? Shortly before, I had installed the Windows multi-language UI pack (MUI) to run tests in Japanese; because of that, the system defaulted to a Japanese locale. In the default Japanese locale, Windows assumes that all strings are encoding according to the Shift-JIS standard, which is a multi-byte character set (MBCS).

Shift-JIS tries to tackle the problem of representing the several thousands of Japanese characters. The code positions 0-127 are identical with US ASCII. In the range from 128-255, some byte values indicate "first byte of a two-byte sequence" - and 0xFC is indeed one of those indicator bytes.

So the preprocessor reads the line until it finds the // comment indicators. The preprocessor changes into "comment mode" and reads all characters until the end of the line, only to discard them. (The compiler doesn't care about the comments, so why bother it with them?)

Now the preprocessor finds the first 0xFC character, and - according to the active Japanese locale - assumes that it is the first byte of a two-byte character. Hence, it reads the next byte (also 0xFC, the second "box"), converts the sequence 0xFC 0xFC into a Japanese Kanji character, and throws that character away. Then the next byte is read, which again is 0xFC (the third "box" in the comment), and so the preprocessor will slurp another byte, interpreting it as the second byte of a two-byte character.

But the next byte in the file after the third "box" is a 0x0A, i.e. the line-feed character which indicates the end of the line. The preprocessor reads that byte, forms a two-byte character from it and its predecessor (0xFC), discards the character - and misses the end of the line.

The preprocessor doesn't have a choice now but to continue searching for the next LF, which it finds in the next line, but only after the closing parenthesis. Which is why that closing parenthesis never makes it to the compiler. Hocus, pocus, leavenotracus.

So special characters in comments are not a particularly brilliant idea; not just because they might be misinterpreted (in our case, displayed as ü instead of the originally intended box), but because they can actually cause the compiler to fail.

If you think this could only happen in a Roman8 context, consider this variation of the original code:

  if (!strncmp(text, "MENU", 4)) 
  {
    display_ui(text); //Menü
  } 
  else 
  {
    gazonk();
  }

Here, we're simply using the German translation for menu in the comment; we're not even trying to be "graphical" and draw boxes in our comments. But even this is enough to cause the same compilation issue as with my original example.

Now, in my particular case, the affected code isn't likely to be compiled in Japan or China anytime soon, except in that non-standard situation when I performed my experiments with the MUI pack and a Japanese UI. But what if your next open-source project attracts hundreds of volunteers around the world who want to refine the code, and some of those volunteers happen to be from Japan? If you're trying to be too clever (or too patriotic) in your comments, they might have to spend more time on finding out why the code won't compile than on adding new features to your code.


When asked for a TWiki account, use your own or the default TWikiGuest account.


Revision: r1.1 - 17 Feb 2006 - 21:01 - ClausBrod
Blog > BlogOnSoftware20060110
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback