Scaling Ethereum: Data Bloat, Data Availability, and the Cloudless Solution

12 Jun 2024

Codex is a cloudless, trustless, p2p storage protocol seeking to offer strong data persistence and durability guarantees for the Ethereum ecosystem and beyond. Currently, EIP-4844 only offers a partial solution to the problem of data bloat. Fees remain high, and the ecosystem has few long-term data storage options.

Determining how to persist Ethereum’s excess data will allow it to scale indefinitely into the future, and Codex has arrived on the scene to help alleviate those concerns. Let’s explore the problem.

Have you ever swapped ETH for another token on Uniswap?

I connected via Metamask and tried to trade .001 Eth (roughly $35) for SNT. The gas fee costs as much as the transaction. That is too high of a fee to trade cryptocurrency. Most people do not want to pay this much.

Let’s get to the heart of why these transactions are so expensive.

3/20/2024

Web3 and decentralized finance have grown massively in recent years. Due to the rapid development and implementation of new protocols, the Ethereum blockchain chain has become bloated with data. The result? Prohibitively expensive gas fees and poor user experience. This data bloat can also be defined as “network congestion,” where transaction data clogs the network and undermines scalability.

This article examines why the blockchain has become bloated, why transaction throughput has suffered, and various approaches to solving the problem. I will specifically focus on data availability in the context of Ethereum and rollups. I will explore how Codex offers a solution to the DA problem, except with data persistence and durability guarantees that most other solutions lack.

Bear with me; I will use jargon and technical language, but I will do my best to explore this vital, underappreciated topic in clear language. More people in the ecosystem must begin grappling with how robust data availability sampling (DAS) is for scaling blockchains. Before continuing, the reader should have read about consensus mechanisms, proof of stake, and how the technology functions from a high level.

Let us start by unpacking the blockchain trilemma.

Problematic Trilemma

All decentralized technologies that want to grow suffer from similar constraints.

They want to scale to allow more and more users to adopt the tech — from thousands to millions of users. However, scaling different technologies comes with different engineering challenges.

In the case of Ethereum, the blocks on the chain contain transactional, state, and smart contract data. The more people use the network, the more data is added to each block. The problem is that when the blocks start filling up, a fee market emerges, where those who pay higher gas fees are more likely to get their transaction included in the next block.

A simple solution would be to expand the block size and allow more transactional data. However, there is a problem with this approach, which is part of the blockchain trilemma.

The trilemma states blockchains have three primary features they want to maintain and enhance: scalability, decentralization, and security. The trilemma suggests that trying to improve two reduces the other.

In the case of Ethereum, upgrading the block capacity also increases hardware requirements for running a fully validating node on the network. When the network raises hardware requirements in such a way, it becomes more difficult for ordinary people to run a full node — which negatively impacts the network by decreasing overall decentralization and censorship resistance.

On the surface, the problem seems insurmountable. Luckily, developers and engineers are rethinking how blockchains can scale. They are envisioning blockchains and their ecosystems as modular rather than monolithic.

Modular Versus Monolithic

It is vital to restate that running a full node on the network is imperative to its success. But what exactly is a “full node” or “fully validating node”?

A full node is a network participant that downloads all blockchain data and executes all transactions created on the network. Full nodes require more computing power and disc space because they download the complete transactional data set.

An article by Yuan Han Li titled “WTF is Data Availablity” explains:

“Since full nodes check every transaction to verify they follow the rules of the blockchain, blockchains cannot process more transactions per second without increasing the hardware requirements of running a full node (better hardware = more powerful full nodes = full nodes can check more transactions = bigger blocks containing more transactions are allowed).”

The problem with maintaining decentralization is that you want some network participants to run full nodes. However, these nodes require tremendous computing power that is too expensive for most users to purchase and maintain. And if that occurs, it dramatically limits the number of nodes on the network, harming overall decentralization.

The main problem is that the miners and validators could withhold data from the network, preventing others from accessing all the data. This is the crux of the problem in the context of “monolithic blockchains.”

Although this is a bit of an overused buzzword in the ecosystem, the idea of “monolithic” in blockchain means that the base layer — or the Ethereum blockchain — has to act as the settlement layer, the consensus layer, and the data availability layer, which bloats the system with data, slowing down transactional throughput and raising fees.

The solution to this problem of having a “monolithic” blockchain is to “modularize” its functionality and offload the data availability function to other network participants. In this scenario, the base layer of the blockchain would then just function as the settlement and consensus layer. All data availability requirements would be offloaded to other actors in the network.

Now that we understand the wisdom of modularisation, what exactly is data availability, and why is it crucial to the network?

The DA Problem and Rollups

Data availability is what a blockchain requires to function as an immutable arbiter of truth. Without the availability of transactional data, no one would know if the blockchain contains fraudulent or invalid transactions. In other words, no one could prove whether the validators and miners behaved maliciously or not. An __article __by Emmanuel Awosika described it:

“Data availability” is the guarantee that the data behind a newly proposed block—which is necessary to verify the block’s correctness—is available to other participants on the blockchain network.”

An important aside: note that there is a difference between “data availability” and “data storage.” Many people in the space confuse the two. Data availability asks if the data is available and anyone can access it, and data storage means holding data in a location over the long term. In this sense, data storage implies the idea of “data persistence.” Nick White, Celestia’s COO, provided a powerful analogy:

If you have canned food, it represents data storage. The food is in the can and stored for the long term, and can be accessed and taken out of storage at any time. In this sense, there is an element of “data persistence” with regard to “data storage.” Conversely, data availability is more like a buffet. The food is prepared and spread out on a buffet table. It is available for everyone to sample. Data availability is similar.

Data is made available to the network primarily so network participants can verify the data is accurate and does not contain malicious transactions.

This begs the question: what is the “data availability problem?”

The “data availability problem” is the central problem technologists are trying to solve to scale Ethereum. The problem is that when a full node broadcasts transactional data around the ecosystem, smaller nodes called “light nodes” do not typically have the hardware requirements to download and execute all of the transactions.

A ledger.com article explained how light nodes work:

“Light nodes do not download or validate transactions and only contain the block header. In other words, light nodes assume that transactions in a block are valid without the verification that full nodes provide, which makes light nodes less secure. This issue is referred to as the data availability problem.”

In this case, those nodes just need to know if the data is available and if it represents the current “state” of the blockchain. A “state” is simply all the blockchain data stored on the chain, address balances, and smart contract values. On the Ethereum blockchain, in its current form, light clients have to rely on so-called data availability committees (DACs) to provide on-chain attestations that the data is indeed available.

In the context of an Ethereum scaling solution, called a rollup, this data has to be made available so that network participants can determine if that data conforms to network rules. In other words, they need to ensure the data is accurate and that validators do not try to dupe the light clients.

Optimistic and ZK Rollups

To understand the DA problem further, it is crucial to comprehend rollups. Rollups are layer two blockchains that have nodes called sequencers; these sequencers assist in batching, compressing, and ordering transactions. Benjamin Simon described therelationship between rollups and Ethereum:

“A rollup is essentially a separate blockchain, but with a couple of modifications. Like Ethereum, a rollup protocol has a “virtual machine” that executes smart contract code. The rollup’s virtual machine operates independently from Ethereum’s own virtual machine (the “EVM”), but it is managed by an Ethereum smart contract. This connection allows rollups and Ethereum to communicate. A rollup executes transactions and processes data, and Ethereum receives and stores the results.”

Put simply, rollups are off-chain scaling solutions. However, rollups do not sacrifice security like many “off-chain” scaling solutions normally would. In the case of rollups, only data processing and computation occur off-chain (via sequencers). The transactions are ultimately stored on the layer 1 blockchain, preserving security. This on-chain data was previously called “calldata.”

In a way, rollups are the community’s way of “having their cake and eating it too”; they get to maintain network security while scaling the usability. It is an ingenious solution.

There are two popular types of rollups: Optimistic Rollups and ZK Rollups.

Optimistic rollups are the more widely discussed and deployed types of rollups. As their name suggests, “optimistic” rollups assume that there are at least 1 x n good actors in the ecosystem. What does that mean? Optimistic rollups assume all transactions posted to the network are valid. To compensate for this “optimism,” rollups provide a 7-day window for the network to submit a “fraudproof,” showing the transactions submitted by the rollup are invalid.

One key thing to know about optimistic rollups is that they are mostly EVM compatible, so developers can efficiently work with them. In this way, they can be seen as Ethereum's more popular scaling solution. Two examples of optimistic rollups are Optimism and Arbitrum.
ZK-rollups use zero-knowledge cryptography to prove that the transactions they compress and batch are correct and accurate. Instead of assuming that all the transactions are accurate (like optimistic rollups), ZK-rollups generate a “validity proof” to demonstrate the transactions are valid immediately, eliminating any waiting period.

However, it is known that ZK-rollups can be more difficult for developers to work with, as not all of them are EVM-compatible. ZK-rollups are also computationally intensive because generating the proofs consumes many resources. Nonetheless, more and more EVM-compatible rollups are starting to hit the market. The Scroll rollup EVM solution is just one example.

Solution: Data Availability Sampling and Codex

I mentioned earlier that rollups need somewhere to dump their data. Most rollups have been pushing data to the Ethereum main chain, as mentioned — which leads to the crux of the problem: data bloat. When bloat occurs, transactional throughput suffers, and fees for transactions and smart contract execution increase.

Recall that part of the solution is not to rely on fully validating nodes for network security. If we just rely on these nodes, most users would be unable to run full nodes due to prohibitively expensive hardware requirements. (Note that raising the block size is a potential solution, albeit dubious, as this path negatively impacts decentralization. Nonetheless, that particular argument has become invalid because rollups act as layer 2 scaling solutions that maintain the security of the main chain).

That said, what is the answer to not having everyone run full nodes?

The solution is to empower light nodes (as well as full nodes) to verify data without downloading and executing all transactions. This is the heart of the problem and where the magic of scaling the Ethereum network (among other blockchains) can be found.

Data Availability, Erasure Encoding, and Codex

The first step is to have a data availability layer with a robust network of light clients to determine if the data is available. But how can light clients, who typically only check header data and rely on full nodes for their information, ensure their data is valid and complete? The answer can be found within a mathematical trick called “data availability sampling (DAS).”

DAS is a method of sampling a bit of data from a chunk of data and using it to probabilistically determine the rest of the data that exists and reconstruct it. Many organizations (including the Celestia blockchain and DA layer) are leveraging DAS through erasure encoding and polynomial commitments. Reed Solomon Codes are the popular choice among many projects. These types of polynomials look like this:

Y = a[o] + a[1]x + a[2]x^2+...+a[k]x^k

These functions are used to determine missing data and fully restore it. This works by creating K of N data, where K is the original data, and N is the “parity data.” If some of the original data goes missing, the node’s machine leverages a mathematical function called Lagrange Interpolation torestore it. The mathematics involved are seemingly arcane to most people, but the idea is straightforward.

There are a few clear examples of erasure coding in action. The method has been used to back up scratched CDs. Erasure-encoding in CDs can reconstruct the missing bits of music due to surface damage. Satellites also leverage erasure codes if data goes missing in the vastness of space. The satellite or the CD can reconstruct missing data, adding redundant protection to both systems.

The specific scheme that Codex (as well as Celestia) uses is called the 2D Erasure coding scheme. It should be noted that 2D erasure coding, although popular in the crypto ecosystem, is not a new technology. However, how it is used to solve the DA problem is quite interesting. Dr. Bautista __explained __how the Codex team uses Erasure Coding:

“Similarly to Codex, erasure coding the original data into a more redundant and robust data structure is fundamental for the rest of the protocol to work, without it there is no magic. In Codex, this happens inside the Codex client of the node that wants to upload the data, while in Ethereum this happens inside the Ethereum validator of the consensus/beacon client of the node that is building/proposing the block.“

There is more to the story regarding the journey of the data in Codex, but it is beyond the scope of the article. Read Dr. Bautista’spiece to understand data dispersal, sampling, and the “lazy repair” mechanisms that Codex leverages.

Codex intends to have simultaneous data storage and retrieval functionality and data availability sampling through proof compression. This would allow for processing ephemeral data (or data that is not needed over the long term), and data persistence and durability guarantees that other projects may be missing.

Conclusion: Cracking the Problem

The debate on how to scale blockchains is ending. In the Bitcoin ecosystem, arguments have been raging on how to scale a blockchain, from increasing the block size limit to leveraging layer 2 solutions. The reality is that a mixture of the two is the most reasonable solution. For instance, Codex can act as the cloudless data availability layer for Ethereum (as well as for other blockchains), allowing the block size to grow because the network would contain many nodes to conduct DA checks on the network.

The good news is that this will increase the network's throughput while maintaining the security of the base layer. And what results from that? Yep, you got it: cheaper fees and faster transactions. As users of blockchains, that is really what we care most about.

One day, perhaps soon, I can do my token swap for pennies on the dollar instead of for $35 bucks.

by Sterlin Lujan