Byte Order Mark: How to Properly Read a Text File

In 2019 I was trying to read a CSV (Comma Separated Values) file into Java. CSV file contained exported data from a DB2 table. My job was to create a new CSV file with some columns encrypted so that it could be loaded into a new Database with stricter security requirements. I inspected the data in notepad++ and created a java program with com.opencsv:opencsv dependency to read each row of the input CSV file into list of Java beans. Java beans were easier to handle for encryption and opencsv has support for reading and writing Java beans. Everything seemed to work, but when I inspected the output file, it was missing values for first column.

Issue

Reading Java beans with opencsv from input CSV file wasn’t working. For some reason, bean field for first column in the CSV was being populated with null instead of actual content.

No alt text provided for this image

Root Cause Analysis

First thing that I verified whether CsvBindingName annotation’s column value on this bean field had correct CSV column name. Here’s a sample bean class with these annotations.

No alt text provided for this image

Column name in the annotation was correct. Then I tested the code with a simple CSV file that I created myself, this of course worked fine as expected. Then I debugged the opencsv code to find why it was not able to match the first column name in CSV with the corresponding column in the class and found that column name read by opencsv for this field had some unknow characters prefixed to it. That’s why column name that I had provided in the class didn’t match. Taking example of above class, I am using NAME in the CsvBindingName annotation as column name but opencsv reads “NAME” from CSV file as column name.

That’s when I knew that the issue lied with the input file that I had been given. I inspected the CSV file with a Hex editor and found these special characters

No alt text provided for this image

Searching Google for this hexadecimal sequence 0xEFBBBF led me to Byte Order Mark page on Wikipedia https://en.wikipedia.org/wiki/Byte_order_mark

Byte Order Marker

Sometimes a text stream may contain special bytes in the beginning before actual data to tell the consumer of the stream a few things

  1. Endianness of the data. Endianness means whether the least significant bit is stored in the first or last bit of the byte. Not all CPU architectures store bits in the same order, that’s why it’s crucial to know the Endianness of data before consuming it. Endianness is categorized as Big-Endianness/BE and Little-Endianness/LE. For example, bit sequence 0001 will be interpreted as decimal value 8 in Little-Endian and 1 in Big-Endian system. Endianness can also decide whether first or last byte is read first in a muti-byte (word) data type. For example, a Java Integer consists of 4 bytes. Before we can convert it to decimal value, we’ll need to know if we are supposed to read the first or last byte first.
  2. Whether data is encoded using Unicode
  3. What Unicode encoding was used (UTF-8/16/32)

Here is the list of byte order marker bytes

  1. UTF-8 0xEFBBBF
  2. UTF-16 Big Endian 0xFEFF
  3. UTF-16 Little Endian 0xFFFE
  4. UTF-32 Big Endian 0x0000FEFF
  5. UTF-32 Little Endian 0xFFFE0000

These bytes are not part of data and should be excluded from the actual data. Problem with my code was that I wasn’t excluding these bytes.

Notepad on Windows 10 shows these BOM options when you try to save a text file

No alt text provided for this image

It also detects these bytes in the opened files

No alt text provided for this image

MS Excel shows two options to save as CSV, UTF-8 option adds UTF-8 BOM bytes.

Solution

I used BOMInputStream from Apache Commons to exclude BOM bytes from stream before handing it over to opencsv. Here’s what the code looks like, you can see working solution in this repository https://github.com/ConsciousObserver/ByteOrderMarkerTest

No alt text provided for this image

Here's what the output looks like once BomInputStream is used, name field has content now instead of being null.

No alt text provided for this image

Lesson

You will encounter some strange things that will challenge your preconceived ideas.

Others also viewed