Logging and monitoring for IoT devices are critical in larger deployments. Smaller edge devices can have unpredictable availability; they might go offline for long periods, change their physical location, or reduce their bandwidth to reduce power usage. Fog, the continuum between edge and cloud also has to be monitored, as nodes cannot uphold similar guarantees as cloud nodes could.

For reasons of availability (supporting a disconnected mode of operation) as well as resource efficiency (transmitting data in batches is more efficient than sending individual data items), DITAS nodes will need to buffer monitoring and logging data locally.For this reason, we analyzed and compared different options for local buffering of monitoring and logging data. Basically, there are two ways how data can be stored locally, either in a database system or directly on the file system. While database system are convenient to use and provide rich query capabilities, they also come with a higher computational overhead than using the file system directly. Furthermore, heterogeneity of devices and limited resources make it rather difficult to deploy databases on every possible device, whereas file storage should be possible almost everywhere. Finally, most programming environments already come with methods to interact with the local file system, and most operating systems already have built-in tools to process, transmit and collect files. Therefore, we will, in the following, focus on file-based methods. This leaves the question of choosing the best file format – for our purposes, “best” means lightweight, easy to transmit (files should be as small as possible, we also require the ability to split files into smaller chunks, in case we are not able to transmit all data at once), and easy to use for developers and applications.

Methodology

We developed an experimental framework in Java and Python to generate, write, read, and compress different file formats. Using Java and Python-based runtime environments seemed a good idea as these should be available on most *nix based platforms.

We first compiled a list of formats that might be a good fit for DITAS. For each format we considered multiple libraries (if available) to see if the concrete implementation would be the easiest to use. After selecting a library that was able to handle the a specific format, we determined for each format: the size of the resulting file for a representative sample of logging data, the time it took to create this file, and the impact of the libraries on CPU and memory. Other file properties were also considered: 1) Human readability of a file 2) difficulty to extend a data structure with additional fields and 3) the difficulty to split a file and still be able to read all part on their own, as we might not be able to send files completely.

In order to make our evaluation comparable, we selected a use case that allowed us to vary resolution and information repetition, but lets us create repeatable data sets. Our use case is roughly based on a IoT e-health scenario where patient data is monitored and, thus, produces a steady stream of data items. To test the different storage options, we simulated sensor data (list of values) that is gathered at a specific time, coming from an IPv6 address, for a specific purpose (ID). Our sample data, therefore, contains structured data, lists, numbers, text data and timestamps.

Formats

In total we looked into six file formats: CSV, XML, JSON, CLF/ELF, Protocol Buffers, MessagePack. We selected these formats as representatives for different approaches to data storage. In the next section, we r each format and the libraries we used.

  1. XML (eXtensible Markup Language) introduced in 1997 was designed to be human and machine readable. Data is organized in a tree structured where each subtree fully describes its schema and content. XML allows elements to have different schemas as long as the structure can be represented as a tree. For Java we looked into Jackson XML, DOM, SAX and Stax as potential libraries before settling on Jackson XML.
  2. CSV (Comma Separated Values) is an even older file format and has been used since 1972. CSV is typically used for table like data, where all data in a file has to follow the same schema. CSV only stores schema information at the top of the document, making it difficult to split a file without rewriting the header. This file format also has problems with representation of nested information which means data has to be denormalized before it can be stored in a CSV file. Adding or removing fields can also not be done without rewriting the file. We looked at open CSV and Jackson CSV for our implementation and again settled on Jackson CSV, as it had the best usability and featureset.
  3. JSON (JavaScript Object Notation) developed in 2000, stores data in key, value pairs where values can either be lists, other JSON-Objects, or primitive values. Each JSON object describes itself, which makes it trivial to have different schema in the same file and add new fields to it. We again looked at Jackson and org.json for our Java Implementation. Both support streams and can write data without additional boilerplate code.
  4. CLF/ELF (Common Log Format/ Extended Log Format) are text-based file formats. They offer no strict schema support. The format is used in many *nix applications, which is why there is a rich set of tools to analyse these types of files. Since these files do not have any schema enforcement, data presentation can change at any time, but with significant communication overhead between producers and consumers of these log files. Nevertheless, these formats are of interest, since any program logging with this format is easily embeddable in a *nix environment.
  5. Protocol Buffers, initially released by Google around 2001, is a binary format for sending data over the network. However, it can also be used to store strongly typed data. Protocol buffer files tend to be small in size, since as tight as possible. Data schemas are stored in external files that can be used to generate serializers and deserializers for a lot of different programming languages. It supports schema changes and handles versioning of these schemas. Google offers tooling and libraries for most common programming languages.
  6. MessagePack is a JSON-inspired binary data format started around 2013. It can be seen as a combination of JSON and Protocol Buffer. Each file still contains a schema description of each object but stored in a more efficient way. Libraries are also available for multiple programming languages.

Figure 1 [File Format Sizes] – Graphs showing the average sizes of each file format in bytes for the same content. Message Pack and Protocol Buffers produce the smallest files, with Message Pack having a slightly higher throughput. One interesting observation is the behavior of XML. Our XML library produces the larges files but also had the highest throughput, a clear indicator that writing parts of an XML file (whitespaces & tags) does not require a lot of time or resources.

Compression

Some of these file formats contain a high level of redundant information, which is the main reason for the different file sizes. One way to handle redundant information is to use compression, we therefore looked into common compression techniques and their impact on different devices that could be found in fog environments.

In total, we looked at four different algorithms: ZIP, BZIP2, GZip and Google Snappy. For our experiments we used implementation for Java as well as commonly available command line tools. ZIP, BZIP2, GZIP experiments were done using the Apache compression framework, for Snappy we used a library provided by Google.

Each of the different algorithms was then applied to each of our generated file formats. The results can be seen in the following graphs. Each algorithm was evaluated based on runtime, resulting file size as well as resource costs.

Figure 2 [Compression Runtime] – We also tested all experiments on multiple different devices (2016 i5-Mac, Intel Edison, Raspberry PI Zero W) to see how the different file formats and compression algorithms perform.

Overall we could observe that the lower powered devices needed significantly more time to run the experiments. It takes the Mac are about 35x less time on average to perform the same task as an Raspberry Pi . One thing we could observe is that not all compression algorithm behave the same on these devices. The zip compression performed overall better. We believe that is manly due to the write behavior of each algorithm, as writing data is the main bottleneck on the edison and raspberry pi.

Figure 3 [Detailed overview of all compression algorithms per file format] – In this Graph we can observe the performance of each compression algorithm for each file format.

The algorithms performed mostly similarly for all formats. However, we could observe that it compression files in the Protocol Buffers or Message Pack framework are not treated equally by all algorithms. Snappy for instance was less equipped to compress binary data,as can be seen in Figure 3.

Conclusion

Formats like Protocol Buffer and Message Pack are well suited for storing logging data, as the files produced are small even without compression. However, storing and transmitting binary data creates additional overhead for a human to read the data.
Some file formats also require sticked protocols between consumer and producer of the data. For instance, instead of agreeing that a log should contain a timestamp each partner has to agree on where the timestamp is stored.

Other formats like message pack, JSON, CSV and XML, on the other hand, can be written and read without such strict protocols as each file contains some meta information of the content. We also could observe the tradeoffs different compression algorithm make. These algorithms produce small files but either consume a lot of time (BZIP, GZIP) or resources (Snappy) to do so.

We asked a group of students to implement and evaluate the selected file formats and question them afterwards about their experiences. The following table shows the these results. We evaluated human readability of each file by given the students an example file and asked them about the content. For CLF, XML, JSON and CSV the students could use standard tools to simply open each file. Message Pack and Protocol Buffers could only be read after writing a program for each, where Message Pack could be read without knowing the structure the contained data.
We also asked the students to read and write each file form/to a datastream. At the end we also asked them how easy it was for them to implement each experiment, which we used to create a ease of programming score for each format.

CLF CSV JSON MSGPACK Protocol Buffers XML
Human readability of Files 5 4 5 2 1 5
Streamable Writing yes needs to store header state yes yes yes yes
Streamable Reading yes needs to store header state yes, by object yes partially needs to store state
Ease of programming 5 4 4 3 2 3

*scales from 5 (trivial) – 1 (difficult)

Depending on the use-case we recommend protocol buffer if these files only have to be read by applications, CLF/ELF if the data is only for logging without any structured content, JSON or CSV if the data is structured and needs to be human readable. If file size is more important than computation time, we furthermore recommend using zip on all but Protocol Buffer files as the compression is significant without consuming much resources; if overall time is important and resources are not an issue we recommend using snappy as a good inbetween solution. Keep in mind that compression techniques like gzip and bzip2 need significantly more cpu/memory to create the smaller files, something that can create problems on edge devices like a Raspberry PI (Figure 2).

See Source Code @ gitlab: https://gitlab.tubit.tu-berlin.de/ditas/datamonitor