Back to all stories
Blogs
Tech & Dev
Why Memory-Safe Blockchain RPC Nodes are Not Panic-Free
7/10/2023

CertiK's Skyfall team recently uncovered several vulnerabilities in Rust-based RPC nodes in various blockchains including Aptos, StarCoin, and Sui. As RPC nodes are pivotal infrastructure components linking dApps with underlying Blockchains, their robustness is essential for seamless operations. The importance of stable RPC services is known to the blockchain designers, and thus they embrace memory-safe languages like Rust to circumvent common vulnerabilities that might disrupt RPC nodes.

The adoption of memory-safe language (such as Rust) helps RPC nodes to avoid many families of memory corruption based attacks. Through our recent auditing experiences, however, we found that even memory-safe Rust implementations, if not carefully designed, are susceptible to certain security threats that can disrupt the liveness of RPC services. In this blog post, we'll present our findings of a series of vulnerabilities with real-world examples, including bugs that are awarded with a cash bounty from blockchain owners - Sui.

Why Memory-Safe Blockchain RPC Nodes are Not Panic-Free

1. Background

1.1 The Role of Blockchain RPC Nodes

The Blockchain's Remote Procedure Call (RPC) service is a core infrastructure component in Layer 1 blockchains. It provides a critical API front end for users and serves as a gateway to the back-end blockchain network. However, the blockchain RPC service diverges from traditional ones as it facilitates user interaction without the need for authentication. The service's continuous availability, or "liveness", is of paramount importance, and any disruption in service can significantly impact the usability of the underlying blockchain.

1.2 Traditional vs. Blockchain RPC Server From The Auditing Perspective

Auditing a traditional RPC server focuses on aspects like input validation, authorization/authentication, cross-site request forgery/server-side request forgery (CSRF/SSRF), injection vulnerabilities (like SQL injection, command injection), and information leakages.

However, the scenario is different for blockchain RPC servers. There is no need to authenticate clients initiating the request at the RPC layer as long as the transaction is signed. As the frontend of the blockchain, one primary goal of the RPC service is to guarantee its liveness. If it fails, users cannot interact with the blockchain, thereby hindering their ability to query on-chain data, submit transactions, or publish contracts.

Hence, the most vulnerable aspect of the blockchain RPC server is its availability. If the server goes down, users lose the ability to interact with the blockchain. More seriously, some attacks can proliferate through the chain, affecting a substantial number of nodes, or even leading to a complete network breakdown.

1.3 Why New BlockChains Adopt Memory-Safe RPC Implementations

Several prominent Layer 1 blockchains, such as Aptos and Sui, implement their RPC services using the memory-safe programming language Rust. Thanks to its robust type safety and rigorous compile-time checks, Rust virtually immunizes the program against memory corruption vulnerabilities like stack-heap overflows, null pointer dereferences, and use-after-free vulnerabilities.

To further secure the codebase, developers strictly follow best practices, such as not introducing unsafe code. The use of #![forbid(unsafe_code)] in the source code ensures the prohibition of unsafe blocks.

unnamed - 2023-07-10T182214.757 Example of Rust Programming Practice Enforced By Blockchain Developer

For preventing integer overflow, developers often use functions like checked_add, checked_sub, saturating_add, saturating_sub, rather than simple addition and subtraction (+, -). Resource exhaustion is mitigated by setting appropriate timeouts, request size limits, and restrictions on the number of items requested.

2. Threats to Memory-Safe RPC Implementations in Layer 1 Blockchains

Despite being impervious to traditional memory unsafe vulnerabilities, the role of RPC nodes makes them exposed to inputs that are easily manipulated by attackers. A few conditions can lead to the denial of service in Memory-safe RPC implementations. Memory amplification, for instance, can exhaust the service's memory, while logic issues might introduce infinite loops. Additionally, race conditions could present a threat, where concurrent operations result in an unintended sequence of events that can leave the system in an undefined state. Furthermore, improperly managed dependencies and third-party libraries might bring unknown vulnerabilities into the system.

In this blog post, we aim to bring attention to the more direct ways that Rust runtime protections can be triggered, leading to self-aborted services.

2.1 Explicit Rust Panic: A Direct Path to Service Disruption

Developers can intentionally or inadvertently introduce code that explicitly panics. These are primarily meant to handle unexpected or abnormal situations. A few common instances include:

  • assert!(): This macro is used when a condition must be met. If the asserted condition fails, the program panics, indicating a serious error in the code.
  • panic!(): This function is called when the program encounters an unrecoverable error and cannot proceed further.
  • unreachable!(): This macro is used when a piece of code should never be executed. If this is ever reached, it signifies a serious logic error.
  • unimplemented!() and todo!(): These macros are placeholders for functionalities yet to be implemented. If reached, the program will panic.
  • unwrap(): This method is used on Option or Result types, causing the program to panic when it encounters an Err variant or None.

Vulnerability I: Triggering the assert! failure in Move Verifier

The Aptos blockchain employs the Move Bytecode Verifier to undertake a reference safety analysis through an abstract interpretation of bytecode. The execute() function, part of the TransferFunctions trait implementation, simulates the execution of a bytecode instruction within a basic block.

unnamed - 2023-07-10T182334.394

The function execute_inner() is tasked with interpreting the current bytecode instruction and updating the state accordingly. If we've reached the last instruction in the basic block, as indicated by index == last_index, the function invokes assert!(self.stack.is_empty()) to ensure the stack is empty. The intent behind this assertion is to guarantee that all operations are appropriately balanced, meaning that each push to the stack is matched by a corresponding pop.

In a normal execution flow, the stack is always balanced during the abstract interpretation process. This is assured by the Stack Balance Checker, which verifies the bytecode prior to interpretation. However, once we extend our perspective to encompass the scope of the Abstract Interpreter, we find that this stack balance assumption isn't always valid.

unnamed - 2023-07-10T182447.456 Patch of the vulnerabile analyze_function in AbstractInterpreter

At its core, the Abstract Interpreter simulates bytecode at the basic block level. In its original implementation, encountering an error during execute_block would prompt the analysis process to log the error and continue onto the next block in the control flow graph. This could create a scenario where an error in execute_block results in an unbalanced stack. If execution were to continue under these conditions, it could confront the assert! check with a non-empty stack, thereby triggering a panic.

This behavior can be exploited by an attacker. By crafting specific bytecodes that trigger an error within execute_block(), it's possible for execute() to reach the assert statement with a non-empty stack, thereby causing the assert check to fail. This would result in a panic and terminate the RPC service, hence affecting its liveness.

To prevent this, a fix has been implemented to ensure that the first occurrence of an error in the execute_block function halts the entire analysis process. This prevents the risk of subsequent crashes that could occur if the analysis continued after the stack was left unbalanced by an error. This modification helps improve the robustness and security of the Abstract Interpreter by eliminating a potential panic-inducing scenario.

Vulnerability II: Triggering the panic! failure in StarCoin

The Starcoin blockchain has its own fork of Move implementation. Inside this Move repo, there is a panic! inside the constructor of the Struct type. This panic! is triggered explicitly if the supplied StructDefinition possesses native field information.

unnamed - 2023-07-10T182528.902 Explicit panic! When Initialize a Struct inside the Normalization Routine

This potential risk lies within the process of republishing modules. If the module being published is already present in the datastore, a module normalization process is required for both the existing module and the attacker-controlled input module. During this process, the 'normalized::Module::new' function constructs module structs from the attacker-controlled input module, which can trigger the 'panic!'.

unnamed - 2023-07-10T182719.636 Preconditions of the Normalization Routine

This panic! can be triggered by submitting a specially crafted payload from the client-side, marking it as a potential security vulnerability. As a result, a malicious actor can disrupt the liveness of the RPC service, making it a critical area of focus for maintaining secure operations of the Starcoin blockchain.

unnamed - 2023-07-10T182845.366 Patch of the Struct Initialization Panic

Starcoin’s fix introduces a new behavior to handle the Native condition. Instead of panicking, it now returns an empty vector (vec![]), which signifies that there are no fields present for compatibility checking. This mitigates the possibility of a panic induced by user-submitted data.

2.2 Implicit Rust Panic: Overlooked Pathways to Service Disruption

While explicit panic points are readily identifiable in the source code, implicit panic represents hidden pitfalls that developers might overlook. Such implicit panics typically occur when using APIs provided by standard or third-party libraries. Developers need to thoroughly read and understand the API documentation, or their Rust program could be brought to an unexpected halt.

unnamed - 2023-07-10T183135.113 The Implicit panic Behavior inside BTreeMap

Let’s take a BTreeMap in Rust STD as an example. BTreeMap is a commonly used data structure that organizes key-value pairs in a sorted binary tree. BTreeMap offers two methods to retrieve values by their keys: get(&self, key: &Q) and index(&self, key: &Q).

The method get(&self, key: &Q) retrieves a value using its key and returns an Option. This can either be Some(&V), a reference to the value if the key exists, or None if the key is not found in the BTreeMap.

On the other hand, index(&self, key: &Q) directly returns a reference to the value corresponding to the key. However, it carries a significant risk: if the key does not exist in the BTreeMap, it triggers an implicit panic. This could unexpectedly crash the program if not handled correctly, making it a potential vulnerability.

In fact, the index(&self, key: &Q) method is the underlying implementation of the std::ops::Index trait. This trait provides convenient syntactic sugar for indexing operations (i.e., container[index]) in immutable contexts. Developers might directly use btree_map[key], invoking the index(&self, key: &Q) method under the hood. What they might overlook, however, is the fact that such usage can trigger a panic if the key is not found, thus creating a hidden threat to the stability of the program.

Vulnerability III: Triggering Implicit Panic in Sui RPC

The Sui module publish routine allows users to submit module payloads through RPC. Before forwarding the request to the backend validator network for Bytecode verification, the RPC handler uses the SuiCommand::Publish function to directly disassemble the received modules.

During this disassembly process, the code_unit section in the submitted module is utilized to construct a VMControlFlowGraph. This construction process involves the creation of basic blocks, which are stored in a BTreeMap called 'blocks'. The process involves creating and manipulating this map, and it's here where an implicit panic can be triggered under certain conditions.

Here's a look at a simplified snippet of the code:

unnamed - 2023-07-10T183225.965 Implicit panic when creating the VMControlFlowGraph

In this code, a new VMControlFlowGraph is created by iterating over the code and creating a new basic block for each code unit. The basic blocks are stored in a BTreeMap called blocks.

The blocks map is indexed using blocks[&block] within a loop that iterates over the stack, which has been initialized with an ENTRY_BLOCK_ID. The assumption here is that there will at least be an ENTRY_BLOCK_ID present in the blocks map.

However, this assumption may not always hold true. For instance, if the submitted code is empty, the blocks map will still be empty after the "create basic block" process. When the code later tries to iterate over the blocks map using for succ in &blocks[&block].successors, it could lead to an implicit panic if the key is not found. This is because the blocks[&block] expression is essentially an invocation of the index() method, which, as discussed earlier, causes a panic if the key is not present in the BTreeMap.

An attacker with remote access can exploit a vulnerability in this function by submitting a malformed module payload with an empty code_unit field. This simple RPC request can cause the entire JSON-RPC process to panic. If an attacker continually sends such malformed payloads with minimal cost, it can cause persistent disruptions in the service. In a blockchain network context, this means the network may become unable to confirm new transactions, leading to a denial-of-service (DoS) condition. The network's functionality and users' trust in the system would be significantly impacted.

unnamed - 2023-07-10T183259.066 Sui’s Fix to Remove The Disassemble Functionality From RPC Publish Routine

It’s worth noting that the CodeUnitVerifier in the Move Bytecode Verifier is indeed responsible for ensuring that the code_unit section is never empty. However, the sequence of operations exposes the RPC handler to potential vulnerabilities. This is due to the verification process being carried out at the Validator node, a stage subsequent to the RPC handling of the input module.

In response to this, Sui has addressed the vulnerability by removing the disassemble functionality in the module publish RPC routine. This is an effective way to prevent the RPC service from handling potentially dangerous, unverified bytecode.

Furthermore, it's important to note that other RPC methods related to Object Querying, which also incorporate the disassembling functionality, are not susceptible to such an attack using empty code-units. This is because they are always querying and disassembling an existing published module. A published module must have already undergone verification, thus, the assumption of non-empty code-units when building a VMControlFlowGraph always holds.

3. Recommendations for Developers

In light of understanding the threats that both explicit and implicit panics pose to the stability of RPC services in blockchain, it's important to arm developers with strategies to prevent or mitigate these risks. Therefore, we present the following recommendations. These strategies are aimed at reducing the likelihood of unexpected service disruptions, improving the resilience of the system, and promoting best practices in Rust programming.

  • Rust Panic Abstraction: Consider using Rust's catch_unwind function to catch panics and convert them into error messages. This prevents the entire program from crashing and allows developers to handle the error in a controlled manner.
  • Careful API Usage: Implicit panics typically occur due to misuse of APIs provided by standard or third-party libraries. Hence, it is crucial to understand the APIs fully and handle potential error situations adequately. Always assume that an API can fail and prepare for that situation.
  • Appropriate Error Handling: Use Result and Option types for error handling rather than resorting to panics. They provide a more controlled way of dealing with errors and exceptional situations.
  • Documentation and Comments: Ensure that your code is well-documented and comments are placed at critical sections, including where panics could occur. This will help other developers to understand the potential risks and handle them effectively.

4. Summary

In summary Rust-based RPC nodes play an important role in blockchain systems like Aptos, StarCoin, and Sui. Since they are used to connect DApps with the underlying Blockchains, their reliability is essential for the smooth operation of blockchain systems. Even though these systems use Rust - which isa memory-safe language - they are still at risk of not being designed correctly. CertiK’s research team explores these risks with examples from the real world, demonstrating the need for caution and meticulous design in memory-safe programming.