Removing the Byte Order Mark inserted in source files by Visual Studio

So, I was reverse engineering some C# code that I had developed in Visual Studio to create some UML class diagrams. Running StarUML i ran into the following error:

Unrecoverable Parse Error 1Line 0Column

After poking around it seems StarUML does not like the UTF-8 Byte Order Mark (BOM) that Visual Studio adds to the beginning of the file by default. The Byte Order Mark is a unicode character placed at the beginning of a unicode text file to indicate the endianness of the multi-byte characters. For UTF-8 it is a three byte sequence 0xEF, 0xBB, 0xBF. The byte order mark for UTF-8 is not required since endianness is not an issue with the 8-bit encoding format. However, for some reason Microsoft thinks it’s a good idea to put it there anyway. This mark must be removed from the file before StarUML will be able to parse it. There are a few ways to remove this mark. Visual Studio will remove the BOM by going to Save As… and selecting “Save With Encoding…” and selecting “UTF-8 without signature”. Once it is saved without the BOM, Visual Studio will not add it again. Unfortunately, there is no way to make this default for all files in Visual Studio and it must be done manually each time a file is saved for the first time. If you have access to a Linux machine or Cygwin installed you can batch modify existing files with this little script:

find . -name "*.cpp" -exec vim -c "set nobomb" -c wq! {} ;

Or, if you don’t have Cygwin but you do have vim you can use this batch script.

for %%f in (*.cpp) do call vim -c "set nobomb" -c wq! %%f

Note, doing this in a batch script, it seems I need to hit [return] each time vim exits which isn’t the case with the Cygwin version.
This problem of course is not limited to StarUML as the byte order mark can cause problems in many other programs as well. It just so happened that I was using StarUML when I discovered this issue. The root cause is Microsoft adding this unnecessary mark. The fix however would be for applications, including StarUML, to handle the mark properly since the standard does say it “could” be there even though it is not necessary or even recommended.
Recommended reading:

]]>

Related Posts

Leave a Reply