STAR’s Blockchain for Data Provenance and Traceability: Tackling the Challenges of Industrial Data Reliability
Understanding Industrial Data Reliability Challenges
Data reliability is a key prerequisite for the development of industrial applications that leverage Big Data, Machine Learning (ML) and Artificial Intelligence (AI) systems. It is also important for ensuring the trustworthiness of AI applications in industrial environments: Without reliable data, it is impossible to develop trusted AI systems. For instance, training ML algorithms with unreliable data will result to the well-known GIGO (Garbage In Garbage Out) effect. Moreover, the use of unreliable data is one of the main root causes of biased and unreliable AI/ML systems.
However, achieving data reliability in industrial environments is a very challenging task for the following reasons:
- Environmental influences (e.g., high or low temperatures, humidity, moisture, and air pressure).
- Background noise such as noise pollution or interference (e.g., alarms, extraneous speech) and electrical noise from devices like motors, cooling devices, air conditioning, and power supplies.
- Faulty or inaccurate sensors such as sensing systems with poor precision.
- Dying batteries that compromises a system’s ability to provide reliable measurements.
- Compromised or attacked devices that produce biased or fake data due to adversarial attacks (e.g., data poisoning, data modification, false information injection).
- Compromised analytics algorithms, such as algorithms under poisoning, evasion, and other types of adversarial attacks.
To alleviate data unreliability, industrial organizations need data infrastructures that cyber-resilient and cannot be tampered. In this direction, the use of distributed ledger technologies is suggested in several research works. Blockchain infrastructures facilitate decentralized data management, including decentralized data operations and transactions. The advantage of a blockchain infrastructure for reliable data operations include:
- Lack of single points of failure: Blockchain infrastructures operate in a highly distributed fashion and do not rely on a trusted third party for the validation of data transactions. This architectural property makes them much more difficult to be compromised, as they have no single point of failure.
- Tamper-resilience: Blockchains have anti-tampering properties. Data written in a distributed ledger requires a next-to-impossible investment in resources to be changed. This is a foundation for data reliability, as blockchain data cannot be changed by adversarial parties.
- Data transparency and auditability: Transactions that write/store (meta)data on a blockchain are transparent and accessible to all members (peers) of a blockchain network. Hence, they are auditable by other participants to the blockchain network.
- Security: Blockchain infrastructures offer integrity protection mechanisms, including data hashing and cryptographical linking among the various blocks. This boosts their tamper-proof nature and minimizes security risks. Also, it is not possible to hack a blockchain by attacking few of its nodes. Blockchains support consensus mechanisms, which require an absolute majority of nodes to agree on changes to the blockchain contents. Therefore, blockchains are resilient against cyber-attacks that could compromise one or more nodes.
Introducing the STAR Blockchain Infrastructure
Motivated by these properties benefits, STAR implements a blockchain infrastructure to boost the reliability of the data that are used by the project’s trusted AI systems. The figure below illustrates how the Blockchain Data Provenance and Traceability service interacts with other non-Blockchain modules of the STAR platform.
The STAR blockchain exhibits a rather complex architecture, the assemblage of which requires the use of several interconnected machines each hosting some of its components, thus formulating a private permissioned Blockchain network. Permissioned blockchain provide much better performance than the popular public blockchain networks (e.g., the Bitcoin network) since they need not employ computationally expensive Proof of Work (PoW) mechanisms. Rather they leverage Proof of Stake (PoS) mechanisms, which enables permissioned blockchain infrastructures to support thousands of transactions per second. STAR leverages the open source Hyperledger Fabric project from the Linux foundation for the implementation of its blockchain infrastructure.
As illustrated in the figure, an organization participating in the network in this context is a non-Blockchain module of the STAR architecture (e.g., a Cyber Physical Production System (CPPS)), that gains benefit from recording information on the Blockchain. Everything that interacts with the Blockchain network acquires their organizational identity from their digital certificate and their Membership Service Provider (MSP) definition. Communication of service owners with the Blockchain Network, takes place not directly, but via a multi-level Backend application that exposes several APIs to client applications.
A main value proposition of the STAR blockchain for data provenance when compared to similar blockchain-based systems like Provchain and ProductChain lies in its ability to record and protect not only raw data and source industrial data, but also AI/ML models and analytics results. This is the reason why, STAR’s blockchain for data provenance is suitable for trusted AI systems which is in-line with the overall aim of the STAR project.
STAR’s Blockchain as a Docker Swarm Overlay Network
STAR develops an industrial grade system leveraging a Docker Swarm Overlay Topology. To understand the use of a docker overlay network by STAR, the following three use case sample scenarios involving three organizations as illustrated in the figure can be considered:
- Organization "one" offers to the platform a data stream addressed to Organization "two". Metadata on the source of the data are stored on the Blockchain. Organization "two" that needs at a later point to verify the source of the data stream may refer to the Blockchain.
- Organization “one” offers to the platform a service leveraging an artificial intelligence algorithm (data processor). Metadata on the processor's configuration (state) right before its execution are stored on the Blockchain. Organization “two”, another stakeholder of the STAR platform, that needs at a later point to verify which processor and under which conditions performed the data processing may refer to the Blockchain.
- Organization “one” offers to the platform a service leveraging an artificial intelligence algorithm (data processor). Metadata on the results of the algorithm's calculations are stored on the Blockchain. Organization “two”, another stakeholder of the STAR platform, that needs at a later point to verify a result they might have come across via a different route may refer to the Blockchain to assert that it has not been tampered.
As evidenced in the figure, Organizations “one” and “two” are hosted by two distinct virtual machines carrying the role of Workers within the Docker Swarm network. For the transition to a production-level deployment, the administrator can replicate those Workers to match the number of stakeholders requiring services from Hyperledger Fabric. The Manager within the Docker Swarm network corresponds to a third virtual machine. There resides Organization “zero” and a set of Fabric Orderers (accompanied by its Certification Authority). This configuration is not mandatory for the future full-scale model. The latter could employ an odd number of Managers hosting only replicas of the Fabric Orderers and network monitoring/administration tools. In case of failure of the Leading Manager another could take the mantle of overseeing the network processes. For the MVP the Manager hosts also an Organization just to cut down on resource expenses.
The components that constitute the “Organization” have already been described: the actual Fabric Peer Node, a Certification Authority, a Command-Line Interface for administration tasks, a CouchDB instance to persist the global state and a Java Spring Boot Application exposing an API with the business logic to the outside world. All shall be deployed in Docker containers. Fabric Channels are, in essence, also materialized through Docker containers. Those can be hosted anywhere but let us assume that they will be hosted also within the Manager machine for clarity. Smart Contracts are also being deployed as Docker containers and associated strictly with Fabric Channels. All Peers and the set of Orderers are then attached to Channels for them to effectively share the global state of the distributed ledger.