The other day, Visual C++ would not compile this code, reporting lots and lots
of errors:
if (!strncmp(text, "FOO", 3))
{
foobar(text); //üüü
}
else
{
gazonk();
}
Now this was funny because that code had not changed in ages, and so far had compiled
just fine. At first, I couldn't explain what was going on. Hmmmm... note the
funny u-umlauts in the comment. Why would someone use a comment like that?
Well, the above code was inherited from source code originally written on an HP-UX
system. For long years, the default character encoding on HP-UX systems has been
Roman8. In that encoding, the
above comment looked like this:
foobar(text); //■■■
(If your browser cannot interpret the Unicode codepoint U+25A0, it represents a
filled box.)
So the original programmer used this special character for graphically highlighting the
line. In Roman8, the filled box has a character code of 0xFC. On a Windows system
in the US or Europe, which defaults to displaying characters according to
ISO8859-1 (aka Latin1), 0xFC will
be interpreted as the German u-umlaut
ü.
So far, so good, but why the compilation errors?
On the affected system, I ran the code through the C preprocessor (
cpp
), and ended
up with this preprocessed version:
if (!strncmp(text, "FOO", 3))
{
foobar(text);
else
{
gazonk();
}
Wow - the preprocessor threw away the comment, as expected, but also the closing
parenthesis
} on the next line! Hence, the parentheses in the code are now
unbalanced, which the compiler complains bitterly about.
But why would the preprocessor misbehave so badly on this system? Shortly before,
I had installed the
Windows multi-language UI pack
(MUI) to run tests in Japanese; because of that, the system defaulted to a Japanese
locale. In the default Japanese locale, Windows assumes that all strings are
encoding according to the
Shift-JIS
standard, which is a multi-byte character set (MBCS).
Shift-JIS tries to tackle the problem of representing the several thousands of
Japanese characters. The code positions 0-127 are identical with US ASCII.
In the range from 128-255, some byte values indicate "first byte of a
two-byte sequence" - and 0xFC is indeed one of those indicator bytes.
So the preprocessor reads the line until it finds the // comment indicators.
The preprocessor changes into "comment mode" and reads all characters until
the end of the line, only to discard them. (The compiler doesn't care about
the comments, so why bother it with them?)
Now the preprocessor finds the first 0xFC character, and - according to
the active Japanese locale - assumes that it is
the first byte of a two-byte character. Hence, it reads the next byte (also 0xFC,
the second "box"), converts the sequence 0xFC 0xFC into a
Japanese Kanji character, and throws that character away.
Then the next byte is read, which again is 0xFC
(the third "box" in the comment), and so the preprocessor will slurp
another byte, interpreting it as the second byte of a two-byte character.
But the next byte in the file after the third "box" is a 0x0A, i.e. the
line-feed character which indicates the end of the line. The preprocessor
reads that byte, forms a two-byte character from it and its predecessor (0xFC),
discards the character - and misses the end of the line.
The preprocessor doesn't have a choice now but to continue searching for the next LF,
which it finds in the next line, but only
after the closing parenthesis. Which is
why that closing parenthesis never makes it to the compiler. Hocus, pocus, leavenotracus.
So special characters in comments are not a particularly brilliant idea; not just because
they might be misinterpreted (in our case, displayed as
ü instead of
the originally intended box), but because they can actually cause the compiler to
fail.
If you think this could only happen in a Roman8 context, consider this variation
of the original code:
if (!strncmp(text, "MENU", 4))
{
display_ui(text); //Menü
}
else
{
gazonk();
}
Here, we're simply using the German translation for
menu in the comment;
we're not even trying to be "graphical" and draw boxes in our comments.
But even this is enough to cause the same compilation issue as with my original example.
Now, in my particular case, the affected code isn't likely to be compiled in Japan or China
anytime soon, except in that non-standard situation when I performed my experiments with
the MUI pack and a Japanese UI. But what if your next open-source
project attracts hundreds of volunteers around the world who want to refine the code, and
some of those volunteers happen to be from Japan? If you're trying to be too clever
(or too patriotic) in your comments, they might have to spend more time on finding
out why the code won't compile than on adding new features to your code.