The DITAS Cloud Platform allows developers to design data-intensive applications, deploy them on a mixed cloud/edge environment and execute the resulting distributed application in an optimal way by exploiting the data and computation movement strategies. In this post, we’ll focus on the data movement within the DITAS project as it poses several interesting challenges. Data has to be moved on demand between edge and cloud, partial and updated data also needs to be kept in sync and simultaneously the data has to go through different transformations depending on the user requirements. The use cases in the project utilize both relational databases and object storage.
The DITAS eco-system has several components responsible for performing the tasks related to data movement. The workflow begins with the Decision System for Movement (DS4M) which is actively checking all the gathered metrics and monitoring information, when detecting a violation that might cause increased latency accessing the data, it initiates a data movement.
The Data Movement Enactor (DME) runs within the DITAS management layer (VDM). It serves as the orchestrator and endpoint which the decision system queries. The DME is queried by the DS4M via a REST API and sends a JSON payload. It receives the initial movement request from the DS4M and compiles all the table or object data from the DS4M request into SQL queries, which are then sent to the Data Access Layer component (DAL). The DAL transforms and prepares the data, it receives the needed transformation from the DME (which in turns it gets with the initial request from the DS4M).
One of the challenges posed by moving partial database table data is that we need to keep in sync all the newly added and modified records. For this reason we use a third party application called SymmetricDS (https://www.symmetricds.org/). SymmetricDS is an open source database replication software and it provides a convenient way of keeping track of database modifications.
The DME keeps a cache of the table names that were moved, based on that and utilizing SymmetricDS, it monitors modifications on the needed tables. Additional SQL queries that describe these modifications are composed and sent to the DAL in a similar way to how the data movement is initiated, as described above. The DAL is then responsible for the transformation of the extracted data from the SQL query provided by the DME. The data is then exported in the Apache Parquet (https://en.wikipedia.org/wiki/Apache_Parquet) format and uploaded to a shared FTP server.
To facilitate the actual data movement, the DAL component is recreated in the target location. This is performed by the Deployment engine (DE) upon request by the DME. The moved DAL can then access the generated Parquet file already on the FTP server and recreate the data in the target cluster.
This wraps up a very brief and high level view of how data movement is performed “under the hood” of the DITAS cloud platform. This is still subject to some optimisation and may undergo changes in the final version of DITAS.