← Will Donnelly

It would be really neat if it were possible to have a fully decentralized "cloud" ecosystem in which providers of computing resources could be made fully fungible. Sadly we'll need some staggering leaps in homomorphic computing before that can be done for CPU time.

However, there is no reason in principle that bulk data storage can't be done in a decentralized fashion. Encrypting the data on the client (and adding a MAC) means that no third party can access or tamper with the actual data, and storing redundant copies mitigates the concern that any one storage provider might disappear.

Proof of Storage

Let's start with a much simpler problem: How can I verify that the storage provider is actually storing my entire 1TiB upload, with a minimum of actual data transfer?

This can be done using a straightforward challenge-response protocol. The data must be encrypted and MAC'd in smallish chunks, perhaps in the 4kiB to 64kiB range. This is common practice for large data streams already, so it shouldn't be a problem to require here. The client can then challenge the storage provider to return a randomly selected chunk. Thanks to the MAC the client can verify that the response payload is indeed the chunk they requested, while only storing a single encryption/signing key locally.

Assuming that the client picks chunks to request at random, the storage provider can pass a given challenge with probability equal to the fraction F of the original uploaded chunks they actually stored. Repeating the challenge-response operation makes things exponentially harder (F^N) for only a linear increase in data transfer. For instance if the storage provider only retains half of the uploaded chunks, it would take only 7 random requests to have a >99% chance of discovering their duplicity.

A Brief Sketch of Everything Else

The system only needs to interact with a "blockchain" insofar as there is payment being exchanged for storage services. The acts of discovering storage providers, uploading data to them, and retrieving data back should all take place "off the chain" anyway.

If clients pay any nonzero amount at upload time, there's a real risk that the market will be overrun by "storage providers" who never bother to store anything. It is possible that some sort of reputation mechanism could mitigate this, but of course then you have to worry about Sybil attacks on the reputation system itself.

If clients pay for retrieval things work out differently, however then the storage providers are being placed in the unenviable position of having to predict whether the client will actually ever retrieve their uploaded data in full. However there is some interesting potential in this vein for the system to serve dual purposes as a sort of "CDN" for publishing content as well.

But paying for storage upfront is the more typical model in real life, and there's probably good reason for that. Ideally most payments could be effectively a simple money transfer, the same as with any other cryptocurrency. Perhaps there are two transactions on the public chain rather than one -- the first placing the payment in escrow with certain conditions attached, and the second coming some time later and releasing the funds because the client isn't disputing anything. There must also be a timeout, with the rule that if the client doesn't either dispute or release funds by that time the payment is released to the provider.

Then we come to dispute resolution: what happens when the storage provider reneges on their end of the agreement? If there's no dispute process then we must rely solely on storage provider reputation, which is possible but hampered by the fact that nobody's ever designed a really solid P2P reputation system that would work well here. But it's an option. On the other end of the spectrum, if the client could always dispute transactions and get the payment back then we'd need a solid reputation system for worthwhile clients. Same issue.

So the final piece of the design is some process by which the rest of the network can verify objectively how a disputed agreement is resolved, either in favor of the client or in favor of the server. Assume that individual chunks of uploaded data are encrypted such that it's safe to publish them to the world as part of the dispute resolution process. Then the client can publish a dispute message saying "the server did not store chunk N of storage contract C". If the server does not contest this statement then the funds will eventually be released back to the client, but the server can contest the statement by publishing the complete contents of the encrypted block.

But this has some other issues related to the fact that the full encrypted block is still large by the standards of a blockchain transaction entry. First there's the raw size issue, and secondly there's the issue that people might try to get clever and store their data directly on the chain by abusing that mechanism.

I suspect these issues are resolvable, for instance there's a way one might use merkle trees to prove possession of arbitrarily small substrings of a large blob (because you can't generate an appropriate chain of hashes from a specific range of bytes up to the top-level blob hash unless you either store all the bytes or store so many lower-level hashes that you might as well have stored the underlying bytes). Likewise adding penalty fees to every step of the dispute process ought to be able to deter most forms of bad actor, since they'll lose money rather than simply not gaining money.

But then we come to the hardest question: is this whole system with all of its complexity and overheads actually better in some way than just...paying a traditional company for a certain amount of cloud data storage?