With an increasing trend, data-intensive applications are becoming fundamental for the analysis of data gathered by the Internet of Things (IoT). In fact, data collected through tiny and affordable sensors, and transmitted with smart devices, are enabling the fourth industrial revolution supporting, for instance, predictive maintenance of machineries, real-time tracking of production lines, as well as efficient scheduling of tasks. At the same time, mobile phones and wearables are changing the habits of people, as the data collected by these devices can be exploited to optimize the daily activities to improve quality of life.

A very simple example of data-intensive applications is reported in the Figure 1. As usually occurs, computation is performed in different steps, each of them fed by different type of data. For instance, a first processing is required to align the data coming from the set of sensors deployed on a managed environment (i.e., Ambient Sensing Alignment). The result is a stream of data that are correct and consistent but, in some case, their amount is significant and need to be reduced without losing their quality (i.e., Ambient Sensing Integration). As the aim of the application, in this case, is to predict future trends (i.e., Data Enrichment and Prediction) additional data sources are requested to provide historical patterns. Finally, data are further processed (i.e., Visualization Preparation) to transform them in a form suitable to be visualized.

Data-intensive application and Fog Computing: the DITAS vision

According to the DITAS perspectiveIndeed, Cloud Computing is mainly related to the core of the network whereas Edge Computing is focused on providing to the owner of resources the local ’in-situ’ means for collecting and preprocessing data before sending it to cloud resources (for further utilization), thus addressing typical constraints of sensor-to-cloud scenarios like limited bandwidth and strict latency requirements. For this reason, also in the light of the definition proposed by the OpenFog Consortium [1],  (see Figure 2) DITAS considers Fog Computing as the sum of Cloud and Edge Computing.

A key-point to be addressed to achieve this goal is to model the dependency between the tasks and the data, where both of them can live at the Cloud or at the Edge and, at run-time, they could migrate from one environment to the other and, even in the same environment, from a resource to another. Furthermore, a task could also decide to change the data source used for the processing according to the specific location in which it operates. For instance, the prediction task which requires the data weather can decide, at run-time, which will be the actual data source, among the several ones available on the Web, to use.

Virtual Data Container

The usual way to define a data source, which includes metadata describing the content, does not provide enough support for the envisioned scenario. As we are dealing with a Fog environment, the location of the data sources, and the relevance are two additional issues to be considered. This requires the definition of an abstraction layer that hides the heterogeneity of the data content, i.e., the Virtual Data Container (see Figure 3). The concept of a Virtual Data Container (VDC) represents one of the key elements in the DITAS proposal. Generally speaking, a VDC embeds the logic to enable a proper information logistics depending on both the application to be executed and the available data sources. On the one hand, a VDC is linked to one or more tasks composing the data-intensive applications. Along with these links the developers specify the needs in terms of data, including both functional (i.e., the content) and non-functional aspects (i.e., data quality). On the other hand, a VDC is connected to a set of data sources offering a given information. This virtual layer hides from the developer the intricacies of the underlying complex infrastructure composed by smart devices, sensors, as well as traditional computing nodes located in the cloud.

Data utility

As both data sources and tasks could move among the available resources, the dependency between the task and a specific data source may be affected. For instance, if a task using a given Weather Data source is moved, it might happen that another Weather Data Source appears to be more useful to make the overall application more effective and efficient.

For this reason, Data Utility is introduced to model the relevance of a data source w.r.t. its usage, where the usage is defined by the task goal. More precisely, Data Utility extends the classical data quality for a data source, i.e., the fit for use for a data consumer [2] and measure the relevance of data for the usage context, where the context is defined in terms of the designer’s goals and system characteristics. The designer’s goals are captured by the definition of tasks which includes the input descriptions and the related requests in terms of both functional and non-functional requirements, while the system characteristics include the definition of the data sources.

To properly define the Data Utility, the Potential Data Utility needs to be firstly introduced. Indeed, the utility of a data source is composed of two main elements: the utility per se, which does not depend on the usage, the Potential Data Utility, and the Quality of Service which captures how the data source is seen by a task.

Potential Data Utility summarizes the capabilities of a data source and can be periodically evaluated independently of the context. The Potential Data Utility is calculated looking at the data and the characteristics of the data source. It is derived from a Data Quality and a Reputation assessment which contributes to understand the potential value of the data. For instance, data sources with errors, or out-to-date information are signals of poor data quality which can be assessed regardless of a specific usage. Finally, Quality of Service consists of the typical dimensions evaluating how data are provided in terms of response time, latency, and so on. Such an evaluation is required as moving data sources surely affect these dimensions and, in turn, the utility of the data for a task could change as well.

[1] OpenFog Consortium Architecture Working Group: OpenFog Architecture 
Overview (February 2016)



[2] Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data 
Consumers. J. Manage. Inf. Syst. 12(4), 5–33 (1996) 

Authors: Cinzia Cappiello, Barbara Pernici, Pierluigi Plebani, Monica Vitali (POLIMI)