Managing state growth and corresponding issues

As the Arbitrum ecosystem grows, more and more teams are choosing to build on the Arbitrum tech stack. Arbitrum chains offer a feature-rich and scalable rollup stack that allows teams to focus on their ecosystem growth and build great products. As a result, many Arbitrum chains are seeing usage and throughput increase rapidly, often with sustained transaction load at the chain’s throughput limit for months on end.

This page aims to educate Arbitrum chain operators and owners on safely operating a high throughput chain. Amongst all the factors, the primary consideration is state growth rate and state size. We’ll discuss how increased state size affects the performance of different components in the Arbitrum chain (Orbit) stack and how certain metrics can be used to indicate a need to upgrade various infrastructure components.

Understanding state size and state growth rate

When we say ‘state size’, we mean the total amount of data recorded on the blockchain; state size is a critical metric for node performance as a larger state creates higher infrastructure requirements on nodes for storage and searching through existing states.

The state growth rate is simply the rate at which state size increases. A high state growth rate creates higher requirements for nodes to process state transitions and perform operations needed to keep up with the tip of the chain.

The critical Nitro stack parameter affecting state growth and state growth rate is the gas speed limit. Offchain Labs has set the gas speed limit to a default (7,000,000 gas/s), ensuring a safe, sustained operating limit for Arbitrum chains. You can read more about the nuances of the gas speed limit here.

The default gas speed limit is designed to ensure Arbitrum chains operate performantly and sustainably.

Behaviour at ultra-high throughput

At high state growth rates, especially in cases where a chain is pushing past prescribed limits, an Arbitrum chain may display certain behaviors that either indicate or result from the chain load being higher than its infrastructure can support. The following is a list of such behaviors.

1. Read Operation Bottleneck at High Disk Latency

As the number of read requests on a chain grows, the impact of disk latency on performance becomes more pronounced. The performance impact can be considered the total amount of read requests made as a multiple of the disk latency. High read request volumes may necessitate switching to using low-latency local NVMe drives.

2. Increased single-core CPU and RAM Utilization

Observing high utilization on single-core CPU and RAM indicates that you may require more performant hardware. As this trend continues, hardware investments become prohibitively expensive for the ecosystem or require increasingly custom solutions, which decreases accessibility for node runners.

3. Increased Total State Database Size

The accelerated state database growth rate, on the order of multiple terabytes of data per month, indicates that your chain may require increasing drive sizes. Played out over time, this may force node runners on the chain to adopt prohibitively expensive or hard-to-procure drives (e.g., those notes available on major cloud providers).

4. Increased Disk Write Operations per Second

The number of write operations per second directly correlates to state size growth. As the state growth rate increases, ecosystem nodes that aren’t properly resourced may fall out of sync with the chain.

5. Sync from Genesis Time

As state size increases, the time a new node needs to catch up to the chain also increases. A large state size and state growth rate can result in new nodes catching up to the chain in the worst case.

Summary of symptoms, mitigations, risks

The general trend with any issue in the table below is as follows:

The simple resolutions involve moving to more expensive infrastructure.
When simple resolutions are exhausted, infrastructure becomes both expensive and bespoke (options that available cloud providers do not support)
The long-term risk (and point of no return) is when infrastructure requirements are too expensive or too inaccessible for node runners.

Behaviour	Risk	Mitigations & Considerations
Performance degradation due to storage reads at high disk latency	An increase in read operations causes nodes to spend more accessing disk state. As the number of read operations increases, these delays can degrade chain performance.	Upgrade drives to local NVMe (PCIe Gen4/Gen5, not configured with RAID) with higher speeds. In the short term, NVMe usage will greatly increase the cost of node runners. In the extreme, you may run out of usable drive specifications on available infrastructure vendors.
Growing or constant sequencer backlog (using `arb_sequencer_backlog`) over a sustained period.	Seeing a growing or persistent backlog implies that nodes cannot keep up with the transaction load accepted by the sequencer.
Large state database size and high growth rate of the state database	A large state database size will require that nodes run more expensive disks. This reduces the economic feasibility for node runners. In extreme cases, the required disk size may be unsupported by accessible cloud service providers.	The primary resolution is to upgrade the disk size requirement for nodes on your chain.
High utilization of single-core CPU and RAM	As with the cases above, this symptom implies a need to upgrade hardware. The main risk is the economic feasibility and long-term accessibility of new hardware options.	The only resolution is to upgrade your node’s CPU and RAM.

Understanding state size and state growth rate​

Behaviour at ultra-high throughput​

1. Read Operation Bottleneck at High Disk Latency​

2. Increased single-core CPU and RAM Utilization​

3. Increased Total State Database Size​

4. Increased Disk Write Operations per Second​

5. Sync from Genesis Time​

Summary of symptoms, mitigations, risks​