java XML parsing: line 1, character 38, unable to switch the encoding

While working on a Java web application I was trying to import an XML file into a Microsoft SQL Server database table with an XML type column. Everything seemed to be OK except I was running into the following error:

java XML parsing: line 1, character 38, unable to switch the encoding

After much googling the consensus was that this error is raised because SQL Server stores XML in the XML type column using UTF-16 character encoding while the file I was trying to insert was UTF-8 encoded. So, I had to convert my input data from UTF-8 to UTF-16. My original code was doing this:

String str = new String(data);

The String class defaults to UTF-16 unless specified otherwise. However, to be sure I changed the above to:

String str = new String(data, "UTF-16");

Nope, didn’t work… Well, the documentation for the String constructor isn’t the greatest so maybe the encoding parameter is just used as a hint at what the input data is supposed to be. The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. Not entirely clear… Just to be sure, I changed it to this:

String str = new String(data);
str = new String(str.getBytes("UTF-16"), "UTF-16");

Still no luck. Ok, how about using CharsetEncoder/Decoder

Charset utf8= Charset.forName("UTF-8");
Charset utf16= Charset.forName("UTF-16");
CharBuffer cb = utf8.decode(new ByteBuffer.wrap(data));
ByteBuffer outputBuffer = utf16.encode(cb);
String str = new String(outputBuffer.array(), "UTF-16");

Still no. Well, looking at the XML document itself I see the XML declaration is specifying the encoding attribute as encoding="UTF-8". I delete it and finally it works. So, even though I changed the input to the correct encoding, the fact that this attribute was specified in the XML document itself caused SQL Server to complain. Ok fine… However, I ended up removing all the encoding conversion code that I had been playing with and reverted back to the original code:

String str = new String(data);

Tried the import again but without the encoding attribute and it still works. SQL server will do the conversion for you if the encoding attribute isn’t specified. It won’t touch it however if it is. So, options now are, 1) either remove the encoding attribute altogether, or 2) make sure the file is UTF-16 encoded with the encoding attribute set accordingly.

Related Posts

Leave a Reply