The Internet has become such an integral part of our lives, that we often take for granted just how much it has transformed the way we communicate, share and consume data. The transformation is both seamless and relentless, so perhaps we can forgive the uninitiated for thinking pervasive terms such as Cloud, Fog, Edge, P2P and Jungle are nothing more than buzzwords. Nevertheless, it is these very paradigms (among others) and their underlying technologies that facilitate the transformation, leading to a seemingly infinite list of potential service delivery models. The equally pervasive “everything” as-a-Service model springs to mind, as it is now well and truly accepted into the mainstream lexicon.

The ubiquity of smart devices these days, along with the resulting increase in data volumes creates even more opportunity for innovation, particularly for those pioneering the Internet of Things (IoT). However, it also poses new challenges related to the delivery of cohesive services over heterogeneous computing systems, and more so than ever before. The ICT sector, as always, is rising to the challenge, attempting to unlock the potential of an ever increasing network of data creators and consumers. Cloud Computing is the incumbent paradigm, capable of meeting the requirements of the most demanding use-cases, but with the emergence of IoT there is a greater need for service owners to access and/or process time sensitive data at the edge of the network. Thus, a mixed cloud / edge is envisaged and forms the basis for the DITAS project.

The Cloud has revolutionized computing by enabling the processing and storage of large amounts of data remotely from anywhere, and at a relatively low cost. While this clearly presents a great deal of flexibility for developers of data-intensive applications, it also increases complexity and amplifies some factors that developers must consider early in the design phase, such as where to store data and in which format, what mechanisms must be implemented to secure data at rest and in transit, and much more. Complexity is increased even further when we consider the numerous configurations now possible in a heterogeneous system of IoT-ready devices.

So how do we manage the execution of such data-intensive applications in mixed cloud/fog environments?

The DITAS project intends to abstract this complexity away from developers (who may not necessarily have complete knowledge of the data sources) at design time with the introduction of the Virtual Data Container concept. A data administrator is assigned to take responsibility for making available one or more data sources to the application. The role of data administrator can be assumed by anyone with complete knowledge of the data sources and the required data processing.

The VDC essentially becomes an abstraction layer between the application and one or more data source, as shown on figure above. The VDC is made available to the Developer when the Data Administrator defines and publishes a VDC Blueprint to a repository (VDC Blueprint repository). The Developer therefore assumes the required data is retrieved from the VDC. They are not concerned about how and where the data are stored and can concentrate on defining more relevant requirements such as quality of data (e.g., accuracy, timeliness) and quality of service (e.g., transmission rate, encryption). The VDC offers the following capabilities:

  • Provides uniform access to data sources regardless of where they run, i.e., on the Edge or on the Cloud;
  • Embeds a set of data processing techniques able to transform data (e.g. encryption, compression);
  • Allows to compose these processing techniques in pipelines (inspired by the node-RED programming model1) and to execute the resulting application;
  • Based on the decision taken by the Virtual Data Manager (VDM) – which controls all the VDC instantiated from the same VDC blueprint – enacts the data and task movement strategies.

What is the Execution Environment in a nutshell?

The Execution Environment is designed to take decisions about data and computational movement. It consists of a data movement enactor that, based on the information collected by the monitoring system, is able to select the most suitable data movement techniques, a distributed monitoring and analytics system that is able to collect information about how the application behaves with respect to the data management, an execution engine able to support the execution and the adaptation – through computational movement – of data-intensive applications distributed among on-premises and on cloud resources, an Auditing and Compliance framework which will enforce data security and privacy policies across the DITAS architecture. The three main technical implementations in the project are as follows:

Software Development Kit (SDK) –  This provides extensions of popular tools such as Node-RED to define applications. The … of this tool is to allow developers to design applications by specifying Virtual Data Containers (VDCs) and constraints/preferences for Cloud & Edge resources to be exploited. Applications are then deployed satisfying all constraints based on developer’s instructions and the degree of freedom given by the VDCs.

Virtual Data Containers (VDCs) – these provide an abstraction layer for developers so they can focus only on data, what they want to use and why, forgetting about implementation details. With VDCs applications can easily access required data, in the desired format and with the proper quality level, rather than directly searching for and accessing them among various data infrastructure providers. At design-time, VDCs allow developers to simply define data requirements, quality and how important data is. At run-time, VDCs are responsible for providing the right data and satisfying requirements by hiding the complex underlying infrastructure composed of different platforms, storage systems, and network capabilities.

Execution Environment (EE) – is based on a powerful execution engine capable of managing a distributed architecture and taking care of data movement and computation, maintaining coordination with other resources involved in the same application. The EE also has a monitoring system capable of checking the status of the execution, track data movements, and collect all data necessary for understanding the behavior of the application.