Risk & Security Enhancement for App Chains: An In-depth Writeup of CWA-2023-004

In January 2024, CertiK research team, in collaboration with Confio's security contributors, identified and addressed a high-impact vulnerability affecting App Chains that allow permissionless uploads in the CosmWasm ecosystem. This vulnerability, designated as CWA-2023-004, enables a remote attacker to submit a malformed contract payload, causing a deterministic failure in every transaction processed by the WasmVM. This ultimately leads to a widespread outage across the validator network.

To mitigate the issue, the Confio team and CertiK worked together to develop and prepare a patch swiftly. We also proactively notified all app chains vulnerable to this issue, ensuring a smooth and prompt adoption of the security patch across the entire ecosystem. Thanks to the community’s joint effort, this bug was addressed without any user impact.

In this blog post, we delve into the technical details of CWA-2023-004.

Glimpse of the Impact

The vulnerability resides in versions of WasmVM prior to 1.5.1, 1.4.2, 1.3.1, and 1.2.5 and affects app chains that permit permissionless contract instantiation. A remote attacker can submit a relatively small, malformed WASM payload (approximately 600KB) to poison the targeted app chain. Once exploited, validator nodes continue to operate without crashing, giving the appearance of normal functionality. However, every subsequent transaction processed by the network will fail.

This "live but deterministic failure" behavior significantly disrupts the network, affecting regular transactions, including:

Users being unable to store new contracts.
Users being unable to execute existing contracts.

This unique form of network disruption is particularly insidious, as the system remains operational but is functionally paralyzed. Unlike typical outages where services are visibly down, this vulnerability creates a deceptive state where the network appears to be live, yet all critical operations fail. This "live but deterministically failing" behavior introduces a novel and challenging threat, as it can go unnoticed until significant damage is done, severely impacting the reliability and trustworthiness of the affected AppChain network.

Technical Details

Panic In The Module Serialization Routine

In Wasmd, the default size limitation imposed on wasm payloads, currently set to 800KB. This limitation is defined in the configuration file as follows:

// https://github.com/CosmWasm/wasmd/blob/main/x/wasm/types/validation.go#L22
// MaxWasmSize is the largest a compiled contract code can be when storing code on chain
MaxWasmSize = 800 * 1024 // extension point for chains to customize via the compile flag.

A user-submitted Wasm payload within the specified size limitation undergoes a series of checking routines before saving to disk. This includes thorough compatibility::check_wasm, followed by compilation into a runtime representation Module. Subsequently, the compiled module is serialized to the filesystem cache using wasmer::Module::serialize_to_file.

impl<A, S, Q> Cache<A, S, Q> {
    pub fn save_wasm_unchecked(&self, wasm: &[u8]) -> VmResult<Checksum> {
        …
        let module = compile(&compiling_engine, wasm)?;
        …
        cache.fs_cache.store(&checksum, &module)?;   // inner panic
    }
}

impl FileSystemCache {
    pub fn store(&mut self, checksum: &Checksum, module: &Module) -> VmResult<usize> { 
        …
        module
            .serialize_to_file(&path)
            .map_err(|e| VmError::cache_err(format!("Error writing module to disk: {e}")))?;
       …
    }
}

However, an attacker can carefully craft a malformed wasm payload that meets this size limitation, passes all validations, but compiles into an excessively large Module. This could lead to a runtime panic when attempting to save the compiled module on disk. The panic occurs when serializing a huge compiled module (over 2GB) using rkyv inside wasmer. For detailed technical information, please refer to the advisory page for reproducing the backtrace.

While application chains have the flexibility to overwrite the wasm size limitation configuration, our observations indicate that most app chains tend to loosen this restriction rather than strengthen it. Here are a few examples.

Neutron: 1.6 MB
Terra2: 800KB (No change)
Osmosis: 3MB

From Panic To Poisoned Lock

Within the cosmwasm-vm crate, the struct Cache uses Mutex to safeguard the CacheInner field, ensuring exclusive access to the cache. Inside the Cache::save_wasm_unchecked function, the lock is acquired before saving the Module to fs_cache.

pub struct Cache<A: BackendApi, S: Storage, Q: Querier> {
    …
    inner: Mutex<CacheInner>,
    …
}

impl<A, S, Q> Cache<A, S, Q> {
    pub fn save_wasm_unchecked(&self, wasm: &[u8]) -> VmResult<Checksum> {
        …
        let mut cache = self.inner.lock().unwrap();           // acquire lock
        let checksum = save_wasm_to_disk(&cache.wasm_path, wasm)?;
        cache.fs_cache.store(&checksum, &module)?;  // inner panic 
        Ok(checksum)                                                     // release lock
    }
}

As explained in the Rust documentation A Mutex will poison itself if one of its MutexGuards (the thing it returns when a lock is obtained) is dropped during a panic. Any future attempts to lock the Mutex will return an Err or panic.

In the event of panic inside cache.fs_cache.store, the lock on self.inner would be left in a poisoned state, introducing potential issues for subsequent attempts to access the Mutex.

Consistent Panic When Accessing The Poisoned Lock

The cosmwasm-vm::VM::Cache serves as a shared state per VM, facilitating the handling of various transactions within the WasmVM context. When the lock of the cache.inner becomes poisoned, it can result in multiple side effects, given that each access of the lock follows the pattern self.inner.lock().unwrap().

The semantic impact of the affected functions lies in their reliance on the inner.lock().unwrap() pattern. When the lock becomes poisoned due to a panic, subsequent calls to these functions consistently trigger a cascade of panics, disrupting the intended operations. As shown below, metrics retrieval, wasm saving, wasm removal, pinning operations, and module retrieval are all affected.

    pub fn metrics(&self) -> Metrics {
        let cache = self.inner.lock().unwrap();  // panic when mutex is poisoned
        ...
    }

    pub fn save_wasm_unchecked(&self, wasm: &[u8]) -> VmResult<Checksum> {
        ...
        let mut cache = self.inner.lock().unwrap();  // panic when mutex is poisoned
        ...
    }

    pub fn remove_wasm(&self, checksum: &Checksum) -> VmResult<()> {
        let mut cache = self.inner.lock().unwrap();  // panic when mutex is poisoned
        ...
    }
    pub fn pin(&self, checksum: &Checksum) -> VmResult<()> {
        let mut cache = self.inner.lock().unwrap();  // panic when mutex is poisoned
        ...
    }

    fn get_module(&self, checksum: &Checksum) -> VmResult<(CachedModule, Store)> {
        let mut cache = self.inner.lock().unwrap();  // panic when mutex is poisoned
        ...
    }

Deterministic Error Returned With `catch_unwind`

All cache operations affected by the previously described issues in Rust are exposed to Golang interfaces through C FFI bindings within libwasmvm.

At each Go-Rust FFI boundary, the Rust catch_unwind mechanism is employed to handle panics occurring within Rust code. This approach ensures that panics are caught, and returns Error to the Go side.

This behavior allows the node to maintain its operational state while concurrently providing deterministic error results when executing transactions. However, since the VM cache is shared across different execution contexts, all subsequent transaction executions will be affected.

In summary, an attacker can poison the VM cache inner lock by submitting a malformed wasm payload. Following the lock's poisoning, every subsequent transaction execution triggers cascading panics when accessing the VM cache. These panics are intercepted by the Go context, resulting in a deterministic transaction error returned to the client.

From the user's perspective, this translates to the blockchain stalling in processing any transaction, akin to a network outage.

The Security Patch

To address the issue, a security patch was released in f69ffc7f7a66015b7d31ffad1d5e08d6c692d44f. The core idea behind the patch is to reduce the risk of exploitation by limiting the complexity of the WASM payload before it is further processed by WasmVM. This approach mitigates the possibility of a malformed payload triggering the vulnerability.

Specifically, the following constraints were introduced:

const MAX_FUNCTIONS: usize = 20_000;
const MAX_FUNCTION_PARAMS: usize = 100;
const MAX_TOTAL_FUNCTION_PARAMS: usize = 10_000;

MAX_FUNCTIONS: Limits the total number of functions within a contract to 20,000, reducing the potential attack surface by capping the complexity of the contract's structure.
MAX_FUNCTION_PARAMS: Restricts each function to a maximum of 100 parameters, ensuring that individual functions cannot be overloaded with excessive inputs.
MAX_TOTAL_FUNCTION_PARAMS: Sets a cumulative limit of 10,000 parameters across all functions within a contract, preventing an excessive accumulation of parameters that could overwhelm the JIT generation.

These limitations ensure that any WASM payload submitted to the network is within a manageable scope, greatly reducing the likelihood of a successful attack. By preemptively restricting the complexity of contract payloads, the patch effectively neutralizes the specific exploit vector of CWA-2023-004, safeguarding the stability and reliability of the CosmWasm ecosystem.

More details regarding the entire timeline and actions can be found at the security advisory page.

Conclusion

In this blog post, we share the details of CWA-2023-004. Through the combined efforts of CertiK Skyfall and the Confio security contributors, the issue was effectively neutralized, ensuring the continued security and reliability of app chains. The implementation of targeted restrictions on WASM payload complexity has fortified the network against future threats, turning a potential attack vector into an opportunity for enhanced security. This proactive approach underscores the importance of vigilance and collaboration in maintaining a robust and trustworthy blockchain environment.