Character Set Woes, or How C# and Java Differ When Writing to a File

by expack3

Recently, I just finished porting my Java JAR-based console app to C# .NET. It was a fun, challenging little venture – but there was one thing of note which makes me want to repeatedly slam my head into the nearest wall. That, my friends, is the differences in how C# and Java write an integer to a file.

 

Before I get into differences, here’s an example rundown in sudocode:

1. Assume the code will eventually be outputting to a file in the Latin-1 character set.

2. Assume the following variables exist:

  • The 16-bit integer k, used to keep track of the number of bits in a given byte.
  • The 16-bit integer j, used to keep track of what part of a given byte is being accessed.
  • The 16-bit integer i, used to keep track of which string array is being accessed.
  • The 16-bit integer s, used to store the final Latin-1 bytecode.
  • The list of the 2×1 string arrays vectorString, which stores the complete Huffman Compression encoding table; the first row contains a character, while the second row contains a string holding the bits representing the character in row 1.
  • The byte byter
  • The file writer writer; it is set to use Latin-1 encoding.

If k does not equal 8, do the following:

 If the character in element j of row 1 of the string array kept in element i of vectorString is not equal to the character 1, do the following:

Make byter equal to the result of the new byte created by applying a logical OR to byter using the result of shifting 1 by 7-k bits.

Increment k and j by 1.

Otherwise, do the following:

Make s equal to the result of applying a logical AND to byter using 0XFF

Write the Latin-1 character represented by s to a file.

Make byter equal to 0.

Make s equal to 0.

 

OK, now that we know what supposed to happen, here’s how to ensure s is correctly written to the file in each language and what that actually looks like:

Java

Write the character represented by s to the file based on the encoding used by the file writer.

bitWriter.write(s); (bitWriter is a properly-initialized FileOutputStream)

 

C# .NET

Write to a file the string created by getting the Latin-1-formatted bit string of a new 1×1 byte array containing the result of  converting s to a byte.

bitWriter.Write(Encoding.GetEncoding(“iso-8859-1”).GetString(new byte[] { Convert.ToByte(s) }));

 

WHAT THE HECK HAPPENED?!?!

 

As it would turn out, Java seems to default to Latin-1 for file IO and has a nifty file writer called a file output streamer which works exclusively with raw bytes; C#, however, defaults to Unicode-16 for file IO and only has a file writer called, creatively enough, a file writer, which works exclusively with characters. The result is in Java, I can simply throw bytecode at the streamer and it outputs properly; in C#, however, I have to convert s back into a byte, then take the result of that and convert it into a Latin-1 character. It’s so head-bangingly convoluted I still can’t get over it!

Advertisements