By Qi Zhou (@qc_qizhou)

Introduction

A bedrock for future Web3 is decentralized storage (dStorage), where users can store a large amount of data in the network without worrying that the data is withheld or even discarded by a centralized organization. The essential part of dStorage is proof of storage – the network can prove that the data uploaded by the users are stored by data providers in the network. A couple of solutions such as FILECOIN/ARWEAVE are developed to solve the proof of storage problem and work quite well for static files. However, if the data from the users can be frequently modified or deleted, i.e., dynamic, we currently do not have an ideal solution.

In this article, we focus on proof of storage on large dynamic datasets, where

A dataset is a list of binary large objects (BLOB) that are uploaded and owned by multiple users.
Dynamic means that the users are able to perform create/read/update/delete (CRUD) operations on the dataset. Note that most of the existing dStorage solutions only support static files (or BLOBs) with limited operations such as create/read.
Large means that a dataset may be well hosted by a single data node (e.g., 4TB storage capacity), but the size of all datasets can be very large – ranging from 100+TB or even ~PB or more, which far exceeds a single node capacity.

To illustrate how to solve the proof of storage problem, let us first revisit some native solutions and some existing solutions, and then we will explore the way to achieve proof of storage on the large dynamic datasets in detail.

Naïve Solution to Proof of Storage on a Static Dataset

Consider a static dataset containing a list of BLOBs that are stored off-chain by a data provider, and the commitment of the dataset is available on the blockchain.

the commitment is a cryptographic digest of the dataset, and
a prover could create a succinct inclusion proof of any sub-data of the datasets, and
with the proof, the blockchain serves as a verifier that checks whether the sub-data is part of the dataset or not by running an on-chain verification function as verify(commitment, sub_data, proof) = {True, False}.

Note that the commitment and proof must be succinct, i.e., they are much smaller than the actual dataset so that the verification cost is cheap on-chain - we do not need to upload the whole dataset to the expensive blockchain.

There are several well-known commitment schemes satisfying our needs such as Merkle tree and Polynomial commitment. In the Merkle tree commitment scheme, the commitment is the Merkle root of the data, and the proof is the sibling's hashes of the nodes traversing from the leaf (data wants to prove) to the root. The good property of Merkle tree proof is that the size of the proof is O(log(n)).

An Illustration of the Merkle tree, where D1 (red) is the data to be verified, D0 and N1 (blue) are the proof, and Root is the commitment

With the above setup, we can now run the following naïve protocol to verify whether the data provider is actually storing the dataset:

The data provider will first submit a transaction to the protocol to confirm that it will host the dataset together with a security deposit.
For each specific period of time (e.g., every 1-hour using Chainlink Keepers), the protocol will generate a cryptographic random value that is unpredictable/unforgeable by any data providers. The random value can use RANDAO in ETH PoS or Verifiable Random Function (VRF) in Chainlink.