Wednesday, March 19, 2008

String to Bytes : Comparing Methods

Suppose you have a string and you want to convert it to a array of bytes, there are two ways of doing so. Using the Encoding Class directly or using the Encoding class with the stream writers.

Lets examine both ways. Here I have written two functions which would convert same string to bytes and print those bytes.

static void Main(string[] args)
{
string strTest = "this is a test";
Encoding encodingType = Encoding.ASCII;
Console.WriteLine("First Method");
FirstMethod(strTest, encodingType);
Console.WriteLine("Second Method");
SecondMethod(strTest, encodingType);
Console.ReadLine();

}

private static void FirstMethod(string strTest,Encoding encodingType)
{
foreach (byte b in encodingType.GetBytes(strTest))
{
Console.Write("{0},", b);
}
Console.WriteLine(Environment.NewLine);

}

private static void SecondMethod(string strTest,Encoding encodingType)
{
MemoryStream mo = new MemoryStream();
StreamWriter so = new StreamWriter(mo,encodingType);
so.Write(strTest);
so.Close();
foreach (byte b in mo.ToArray())
{
Console.Write("{0},", b);
}
mo.Close();
Console.WriteLine(Environment.NewLine);
}


Now the output would be same for both the function if you want them as Ascii characters. Check the output below.

First Method
116,104,105,115,32,105,115,32,97,32,116,101,115,116,

Second Method
116,104,105,115,32,105,115,32,97,32,116,101,115,116,


But what about the case when the Encoding is Unicode ?? Well , then there is a difference. The second method would prepended by two bytes 255 and 254. Check the output.

First Method
116,0,104,0,105,0,115,0,32,0,105,0,115,0,32,0,97,0,32,0,116,0,101,0,115,0,116,0,


Second Method
255,254,116,0,104,0,105,0,115,0,32,0,105,0,115,0,32,0,97,0,32,0,116,0,101,0,115,
0,116,0,

The two additional bytes are the preamble. the preamble is a set of bytes that usually denote the byte order for the decoder.
The Unicode byte order mark (BOM) is serialized as follows (in hexadecimal):


UTF-8: EF BB BF

UTF-16 big endian byte order: FE FF

UTF-16 little endian byte order: FF FE

UTF-32 big endian byte order: 00 00 FE FF

UTF-32 little endian byte order: FF FE 00 00