Pocket Network has always been designed to do one thing and do it well: provide a utilitarian economy that coordinates unstoppable Web3 access. In the context of the v1.0 Utility Module, this means we are laser-focused on optimizing the existing utility of the network, not expanding the scope of our utility.
This means optimizing how effectively we coordinate Web3 access, which can be broken down into two categories:
- Relay Quality: RPC node (Servicer) incentives should be as tightly-coupled as possible to relay quality, so that Pocket Network’s service not only matches centralized providers’ but provides a level of quality, unique to Pocket’s architecture, that can’t be matched
- Relay Scalability: our protocol should be as scalable as possible, to maximize the number of relays that the network can process and optimize the efficiency (and thus cost) of the service
v0 – Client-side + Layer-2 Enforcement
When we first designed Pocket Network, we assumed that service quality would be ensured through a Darwinian competitive dynamic. Indeed, there is a weak incentive to perform better on latency/volume to maximize relays processed, though this is insufficient to ensure quality on all metrics.
To address this, we included a simple client-side validation mechanism known as a challenge transaction, which enables Apps to protest the rewards that Servicers earned for low-quality work. However, 1.5 years into mainnet, challenge transactions have never been utilized, likely since the Portal does quality checks before it sends relays. We’ve learned that client-side validation is an insufficient model to rely upon because it 1) results in Apps experiencing quality failures before they are corrected, 2) forces Apps to choose between validating relays and maximizing relay speed.
The Portal is our gateway drug. It is a web application that stakes Pocket Apps on behalf of developers, and allows us to provide the simple RPC endpoints that apps have come to expect from centralized providers. Currently, the Portal administers the majority of the protocol’s App stakes to ensure service quality while we bootstrap new chains. In 2021 we developed various layer-2 cherry-picking methods through the Portal to ensure that Apps receive service from the highest-quality Servicers available in each Session. These methods have laid the groundwork for the on-chain quality enforcement that we will be introducing in v1.0.
v1.0 – On-chain Enforcement
The Portal is a service that builds on the Pocket Network, managing app stakes on behalf of Apps and providing quality of service checks. Maintaining this core service has shown us how critical quality assurance is to our core utility, so we decided to take these lessons and move them on-chain.
We have considered a world in which other people deploy their own Portals and compete to provide the best layer-2 quality enforcement. However, we realized that Portals would fall to the tragedy of the commons; they are simultaneously expensive to run and inherently altruistic. The only way this could be resolved off-chain is through economies of scale and extractive pricing. Sound familiar? If we went down this route, we would no longer be solving the incentive problem we saw in the full node ecosystem, we’d simply be pushing it further up the stack.
To solve this in a manner compatible with our trustless vision, the protocol needs to directly incentivize actors on-chain to enforce quality according to a standardized ruleset. Enter the Fishermen. These are a new set of actors who can disguise themselves as Apps and sample the quality provided by Servicers.
Fishermen measure the quality of relays across three key metrics according to a standardized sampling protocol:
- Availability: Since Fishermen, Apps, and Servicers are all time-synced according to the session protocol, time-based sampling can be used to assess the availability of the Servicer. If no signed response can be collected from a Servicer, a null sample is recorded for the sample time slot. The more null samples the worse the Servicer’s availability score.
- Latency: The time it takes for the Fisherman to receive a signed response from the Servicer (i.e. Round Trip Time) is another metric that is tracked. Due to normal variances in latency to be expected from the varying geographical distance between Apps and Servicers, these metrics are used to disincentivize high-average latency rather than explicitly rewarding the highest-performing Servicers.
- Consistency: In addition to time-based sampling, it is mandatory for Fishermen to sample all Servicers in a session at once. The Fishermen can then compare responses and measure data consistency without needing their own comparative Web3 datasource.
Fishermen compile these samples into test scores for each Servicer, which are averaged out across Fishermen over time, ultimately determining the proportion of block rewards that Servicers receive. Fishermen are not incentivized to influence test scores because they are paid based on the quantity and completeness of their reports, not the content of the metrics being reported.
v0 – Quantity-based
85% of v0 block rewards are distributed to Servicers proportional to the volume of relays they served. This means Pocket Network’s incentives currently optimize for the quantity of work done, not the quality of work done.
v1.0 – Quality-based
v1.0 prioritizes quality over quantity; block rewards are distributed to Servicers according to the aggregate test scores submitted by Fishermen.
The total salary for Servicers is still proportional to the volume of relays performed on aggregate. However, this is divided between Servicers in proportion to their test scores. Each Servicer above the MinimumReportCardThreshold is eligible for an equal salary from the total pool for their RelayChain or GeoZone but has their allocation burned by the difference between their ReportCard% and 100%.
These incentives enable Pocket Network to probabilistically guarantee a Service Level Agreement for applications.
v0 – Pessimistic Proofs of Relays
v0 is pessimistic about proving work done; guilty until proven innocent. To validate the volume of relays they completed, Servicers must generate plasma merkle trees, store relay evidence, and create merkle proofs, which will then be validated by Validators, as part of a multi-step Zero Knowledge Range Proof.
This works very well as secure cryptography but it scales linearly because plasma merkle trees have O(n log(n)) space complexity, where n is the number of relays, and the branch must be included in every proof transaction. In practice, this means proofs get more expensive to process in proportion to relays, which contributes to higher CPU loads and longer block processing times.
If Pocket Network is to grow to serve quadrillions of relays, the relay proof lifecycle must be fundamentally restructured.
v1.0 – Optimistic Proofs of Samples
Work payments in v1.0 are optimistic, more like a salary compared to v0’s unit-based payments.
The total salary pool is still proportional to the volume of relays performed on aggregate. To determine the size of this total available reward, Fishermen probabilistically estimate volume using probabilistic hash collisions rather than counting up (and proving) every relay.
The Fishermen samples themselves are also optimistic. Fishermen only need to submit test scores on a pseudorandom schedule and only need to verify a single pseudorandomly selected non-null sample. Since the Fishermen (and Servicers) can’t predict which test scores will be submitted or which samples will need to be verified, the monitoring system remains a secure deterrent of bad behavior while avoiding the cost that would come from validating every test score.
By switching to an optimistic model, we reduce both the frequency and size of proofs, which should allow relays to scale exponentially rather than linearly.
Creating & Validating Proofs
v0 – Unified Actors
In v0, until recently, Validators and Servicers were bundled together. This meant that the scalability of Servicers was bound by Validators, for two reasons:
- Validators are subject to the scalability of the BFT consensus algorithm, which is arguably the least scalable component of our protocol by design
- Servicers are burdened with performing expensive Validator tasks, even though these aren’t necessary to perform Servicer tasks
We have already begun moving away from this model with the separation of Validators and Servicers in v0.7. Limiting Validators to the top 1,000 by stake has allowed our Servicer count to grow to almost 25,000 without impacting the health of our blockchain. v1.0 will take this a step further.
v1.0 – Task-based Specialization of Actors
We have already agreed that Servicers shouldn’t need to validate proofs. Taking this a step further, why should they need to prove their own work? In v0, Servicers must constantly store claims and proofs if they want to get paid, which presents a computational burden that distracts them from optimizing on their most important tasks: relaying RPC requests.
In v1.0, we are more explicitly separating actors according to their tasks. We are introducing the Fisherman, whose sole responsibility is to prove the quality of work done. This narrows the scope of Servicer work to just performing relays, which should make it cheaper to perform relays and thus dramatically enhance the efficiency of the network’s core task.
v1.0 Utility Roadmap
We will begin live-testing the Fisherman sampling methods in v0, using the Portal as a low-risk supervised environment.
When v1.0 launches, we will begin with a single DAO-governed Fisherman. This will enable us to adopt all the benefits of v1.0, with the Fisherman actor being the only trusted actor in the network.
We will then transition to a multi-Fisherman model, wherein DAO governance can appoint and remove Fishermen, burning them to penalize bad behavior.
Finally, Fishermen will become truly trustless actors as the monitoring/enforcement of their behavior is moved entirely on-chain.
Read more details about the v1.0 Utility module spec.
The consensus module coordinates Validators to come to agreement that the transactions in a block are legitimate before the block is added to the blockchain. Currently, v0 inherits its consensus implementation from Tendermint. While Tendermint provides a great framework for building applications that can handle arbitrary computation, it is not optimized for Pocket Network’s singular focus of coordinating unstoppable Web3 access.
Some of the major changes to the consensus module for v1.0 include:
- Replacing Tendermint BFT with HotStuff BFT
- Migrating from a Round Robin leader selection process to a blind, pseudorandom leader selection process
- Allowing the Block Proposer to validate transactions against the mempool before including them in a block
For the network as a whole, these changes will enable more consistent block times. For those running nodes, these changes will allow for:
- Less bandwidth and compute resources spent communicating with other nodes during consensus
- Less storage usage, as invalid transactions will no longer be included in blocks
- Block creation rewards proportional to the amount staked.
A Primer on Consensus
The two most popular types of consensus algorithms are Nakamoto and Byzantine Fault Tolerance (BFT).
In Nakamoto consensus, the network chooses to follow the longest chain. This is the consensus mechanism that Bitcoin, Ethereum 1.0, and most Proof of Work chains currently use. As long as someone can submit a valid proof of work, they can add a block to the chain.
In case the block added was fraudulent, whether maliciously or accidentally, whoever submits the next block can choose to do so from whichever point they wish, creating a fork in the chain. If the network adopts this new chain enough that it becomes the longest, this results in what’s known as a chain reorganization. This is why applications that make transactions on these chains often wait for a number of “confirmations”, because the chain can potentially reorganize at any point. This process of waiting for transactions accomplishes something called probabilistic finality.
Probabilistic finality works well for coming to consensus on events that have already occurred, however, since Pocket Network uses the current state of the blockchain to determine which Servicers to match Apps with, Nakamoto consensus is incompatible with the Utility module.
In BFT consensus algorithms, Validators must come to agreement before adding any blocks to the chain. The name comes from the “Byzantine generals problem”, a thought experiment that dealt with how a group of military generals, who could not communicate directly with each other, could come to a decision about whether to attack or flee. This issue doesn’t just deal with the case of the general being a malicious actor, but also the case of the messenger being the one who changed the vote, or simply not being able to deliver the message. In a computer system, Byzantine faults are used to describe errors that can occur from an actor either being compromised or faulty. These types of consensus mechanisms, while useful for blockchain consensus, were originally developed to enable aircraft to rely upon their sensor data to make software decisions as long as a certain threshold of their sensors were always in agreement.
Where Nakamoto consensus can probabilistically resolve fraudulent blocks through its longest-chain selection process, BFT instead opts to require consensus on the next block before moving forward with the chain. Since Pocket Network’s Utility module requires a single state from which to maintain coordination between Servicers and Apps, BFT is the obvious choice.
The Consensus Algorithm
v0 – Exponential Communication Complexity
The Tendermint BFT consensus algorithm has served Pocket Network well but will hold us back because communication complexity scales exponentially in proportion to the number of nodes, for two reasons:
- Network-Wide Random Gossip: Every node in the network must pass on what every other node has told them, including Servicers. The more nodes there are, the more node-node messages must be sent and received, resulting in an exponential volume of messages. If Pocket Network is to continue scaling to hundreds of thousands of nodes, this gossip process must be fundamentally restructured, otherwise the communication bandwidth required will become insurmountable.
- Pessimistic Responsiveness: When voting on a block, the network must wait for all nodes to vote (or wait a minimum period of time) before moving on, even if enough votes have already been cast to reach consensus. This can make the network more vulnerable to chain halts if not enough nodes can keep up with the voting process.
v1.0 – Linear Communication Complexity
We will be solving these issues in v1.0 by switching to our own implementation of the HotStuff BFT consensus algorithm: HotPocket. The biggest advantage of this algorithm is that it uses validator-specific structured gossip and optimistic responsiveness to significantly reduce communication complexity.
- Validator-Specific Structured Gossip: HotPocket selects a small group of nodes to be responsible for listening for messages and then broadcasting gossip to everyone. This means instead of sharing messages with everyone, nodes only have to share with the selected listeners, which allows communication bandwidth to scale exponentially with the number of Validators.
- Optimistic Progress: HotPocket introduces a new phase in the consensus process. During this phase, Validators must acknowledge that they have seen the block proposal going up for a vote. When it comes to the voting phase, any Validators who did not acknowledge the proposal can be dismissed and the chain can move on with majority consensus as long as the present Validators approve it. Along with this, we have a custom pacemaker module that ensures consistent block times, which are critical for the time-synced mechanisms in the Utility module.
The Leader Selection Process
v0 – Round Robin
Tendermint uses Round Robin to determine the next leader who will propose the next block. This makes the blockchain vulnerable to DDoS attacks since we know exactly who the next leaders are going to be and when. We have already modified Tendermint’s Round Robin to use a pseudorandom selection algorithm where the hash of the previous block determines the next leader.
v1.0 – VRF + CDFs
In v1.0, we are leveraging Verifiable Random Functions (VRFs) for more secure pseudo-random leader selection that is unpredictable before the block production, yet deterministic and computationally cheap to verify afterwards. Whereas v0 uses the previous block hash to generate leaders, v1.0 uses the VRF secret keys of other Validators which are impossible for the proposer to know. VRFs are combined with Cumulative Distribution Functions (CDFs) to enable selection to be weighted based on the amount staked by the Validator. Because it is probabilistic, it is possible for the VRF to generate 0 proposers, in which case we fallback to the v0 Round Robin process.
Handling Invalid Transactions
v0 – Block Proposer Can’t Validate
Since Tendermint was designed to support arbitrary computation, there were risks for them allowing the block proposer to validate transactions against the mempool. Since the state of the mempool changes as transactions come in, there is currently no way for the block proposer to know if transactions are valid before proposing them in a block. For chained and dependent transactions that rely on ordering, this can cause problems of unexpected failed transactions. Since transactions aren’t validated until they’re included in the block, this leads to the chain being bloated with meaningless failed transactions, which makes indexing and processing transactions harder, ultimately wasting resources.
v1 – Block Proposer Can Validate
Since Pocket Network isn’t a blockchain built for arbitrary computation, and instead built for a specific purpose, we can make guarantees about what computations would occur if the block proposer tried to validate transactions against the mempool. This means that it is possible for the block proposer to filter invalid transactions before proposing a block. This will greatly help to reduce the storage requirements of the chain as well as making it easier to index and process transactions.
Read more details about the v1.0 Consensus module spec.
The Peer-to-Peer (P2P) module is responsible for handling how all the nodes in the network communicate with each other when new transactions need to be added to a block and when the new block gets added to the chain. Currently, v0 pairs peers together randomly. This random pairing was chosen to ensure redundancy across the network, however, as Pocket Network continues to grow, the redundancy and complexity of random pairing require more bandwidth as more nodes come online – with a good chunk of that bandwidth being used to retrieve copies of information the nodes have already seen.
In v1.0, the P2P module will transition from random communication to structured communication. For those running nodes, this transition means a significant reduction in bandwidth usage. For the network as a whole, this means more efficient and scalable communication, while maintaining the required redundancy. This will move the P2P system from having low visibility and a large amount of message duplication to having high visibility and less message duplication.
A Primer on P2P
There are three primary components of the P2P system that all contribute to the reliability and speed of the network. Those components include Event Dissemination which handles how nodes will be paired to communicate with each other, Membership and Churn which handles knowing when nodes either come online or go offline, and the Transport Layer which controls the mechanism used to actually send and receive messages between peers. Some of these problems are problems that can apply to networking across the board, while others are unique to P2P networks specifically. The requirements of building an effective P2P communication network include:
- Efficiency and Scalability – making sure that more peers can connect without a major impact to the network performance
- Fault and Partition Tolerance – despite any number of breakdowns that happen between nodes, the network can continue to operate
- Managing Membership and Churn – effectively communicating when a peer both joins and leaves the network.
While there are a large variety of existing P2P implementations out in the wild, each implementation has made design decisions based on specific needs that don’t translate to the needs of Pocket. In choosing the design for the v1.0 implementation, it was important to understand the fundamental needs of Pocket Network. There are three types of communications that are fundamental to how Pocket Network operates.
- Servicers need to be able to inform the network of transactions made
- Validators need to be able to communicate with each other during consensus and with the whole network when a new block is created
- Archival nodes need to be informed of new blocks, as well as handling and making requests for new blocks.
From these three main types of communication, we can specifically say that the P2P module of Pocket Networks needs to be able to:
- Communicate blocks and transactions across the network
- Restrict communications to only the groups who are needed for those communications.
v0 – Random Gossip
Currently, Pocket Network communicates information about blocks and transactions through random gossip. When a node learns of a new piece of information, they pass that information along randomly to other nodes in the network. This random process ensures that the data can make its way through the network, regardless of failures between certain pairs of nodes. This process, while ensuring redundancy, does so in a way that requires nodes to spend bandwidth and resources receiving the same message multiple times.
Building on this, because nodes randomly select other nodes to share this information with, nodes will share and receive information regardless of whether that is information they need. Gossip about the consensus process is only relevant to nodes who are acting as Validators. Gossiping with every Servicer and Archival node about the validation process not only takes away resources from those nodes but it slows down the gossip phase of the consensus process.
v1.0 – Structured Gossip
RainTree is a gossip protocol that leverages the efficiency that Binary Trees provide for lookups of sorted, randomly-distributed data. RainTree provides the mechanism to ensure delivery, offer redundancy against non-participation without needing retries or acknowledgements, and perform a cleanup to ensure 100% message propagation in all cases.
The process starts with a node that wishes to gossip a message to all other peers in the network. Once the node has a sorted list of all other peers in the network, it needs to:
- Build a binary tree based off the sorted list, with the left branch being the peer at 33%, and the right branch being the peer at 66%, with a third branch that points back to itself
- Determine its own and max depth of the tree
- Gossip messages to all depths of the tree.
The peers on the left and right branches then build their own subsections of the tree using the above, and the process continues. This third branch is there to add redundancy. When the node begins gossiping down the tree, it will, eventually, reach the point where it has to send itself the message. Once it has reached this point, the node is able to send it to not just the left and right as expected but also every other node in the originally sorted list.
Since networks are vulnerable to nodes dropping out at any point, if one of the branches does not acknowledge that it received the message, the node that tried to communicate with them will then readjust and send the message to the children that would have received the message from the unresponsive parent.
To be 100% certain that all functional nodes received the message, the Double-Daisy-Chain (DDC) cleanup layer provides a reliability mechanism that’s used in other gossip algorithms, including epidemic communications. Once the message propagation has reached the next to last layer of the tree, the nodes in the final layer are sent a special “I Got You Want?” (IGYM) message that expects either a Yes/No answer. If the answer is a Yes, the node is forwarded along the full message and they continue said process. If the answer is No, or there is no answer, that node is skipped and the next node in the chain is selected for the same process.
v0 – Everyone Communicates Everything
In the current system, all nodes communicate any information to all other nodes regardless of whether the information is relevant to that node. This means that Servicers must actively listen and share information about the consensus process, despite the fact that they are not participating. In networks that primarily consist of Validators, this doesn’t cause many issues, but since Pocket Network contains other specialized actors such as Servicers and Fishermen, scaling these actors should not impact the communication of processes irrelevant to their input. If we don’t scale actor communication independently, scaling Servicers and Fishermen will slow down consensus, resulting in longer block times.
v1.0 – Restricted to Relevant Parties
The gossip protocol in the Consensus module is a specialized version of RainTree designed for that specific case. As Pocket Network already keeps a list of the current Validators, it’s possible to leverage RainTree for this process. This allows the consensus process to take advantage of the optimizations and redundancy of RainTree. Even more importantly, since the tree will be constructed from only the list of Validators, this means that gossip related to consensus can occur only between those that are involved in the process. This means that the consensus process isn’t held up by trying to communicate with the entire network and that other actors are now unburdened from gossip that isn’t relevant to them.
Read more details about the v1.0 Peer To Peer module spec.
The Persistence module is responsible for ensuring that the data continues to persist over time, across deployments, and throughout software changes. Currently, V0 handles data persistence through Tendermint. Tendermint uses a similar mechanism to most other blockchains, storing the different state data that needs to persist in Merkle trees. The roots of these trees are included in the block, and then for each block, each of the trees is stored as a file.
This design simplifies the computational overhead a new client would need to get up to sync with the current block, however, it does so at the cost of storage. While this decision was made to ensure that the barrier to entry of acting as a validator of a network is low, it fails to consider how critical full nodes, nodes that maintain copies of the data to be accessed and queried by applications, are as infrastructure for production applications.
As the core mission of Pocket Network is to provide access to high quality, decentralized infrastructure, the role of the full node needs to be considered as a priority and not an afterthought. Some of the changes that are being made to the persistence layer in V1 include:
- Moving from a Key Value Filesystem DB to an SQL based Tamper Proof Hybrid Mutable DB
- Decoupling the persistence layer to allow for a Client Server Architecture between pocket-core and persistence.
For those interacting with the network, these changes will mean significantly faster access to any state data that would have relied on a query to access.
For those running nodes, these changes will not only make the current experience better, but it will also open the door to more control and options when it comes to deployment configurations. These changes will:
- Drastically reduce the storage needs (80%) for storing the blockchain data.
- Drastically reduce the amount of resources needed to query state data. (10+ seconds -> milliseconds)
- Enable better data portability, making it easier to quickly spawn up additional databases as needed
- Enable individual scalability, making it easier to kill off and replace a failed process.
- Allow for an additional layer of fault tolerance, since the choice for database engine can be replaced.
- Enable for multi-process concurrency, making it possible to horizontally scale pocket-core without worrying about the storage cost for doing so.
V0 – Key Value Filesystem
Currently, at each block, the data that represents the current state is stored as a tree in a file. While this provides some benefits, such as being able to simply move a directory to move the data, and easily being able to store arbitrary data, these benefits come with significant downsides. For one, regardless if anything has changed in the state between blocks, an entire copy of that tree is written again. There is no process for currently managing data deduplication.
The other issues come from using a filesystem as a production database. Out of the box, Linux cannot handle the amount of simultaneously open files needed to manage a system in this manner, hence needing to make sure the
ulimit is properly specified before running a node. Using the filesystem also means that data access is almost entirely reliant on the I/O speeds of the system. When you add in how querying through old state would mean opening multiple files, this means that queries dependent on I/O are the slowest part of the system.
V1 – Tamper Proof Hybrid Mutable DB
SQL provides a mature, optimized, battle-hardened solution to the data storage problem. One issue with SQL is it requires a well-defined data schema, and so it’s well suited for arbitrary data storage. Since Pocket Network does not handle arbitrary computation, and instead serves a defined application specific purpose, SQL provides the opportunity for significant storage and speed optimizations.
A Persistence Client Middleware will communicate between
pocket-core and a generic database engine, to define what needs to happen to define, persist, update, and query the datasets it stores. This mechanism will define the following to ensure consistency:
- Versioning of the state dataset
- Byte-perfect consistency of the data encoding
- Schema definition mechanism
- Deterministic write mechanism
- Idempotent Dataset Updates
To ensure that the data is tamper proof, the Patricia Merkle Tree can be stored in the state dataset for each block, and can be used to verify the validity of each block. As the operations on the data are ACID and idempotent, any change to the underlying data is detectable.
V0 – Data Duplication
The current system requires that each pocket-core process needs access to an independent copy of the data to run. If a node needs to increase resources, whether that be to handle more traffic, or more data access, the only option is to start another process that needs a full copy of the data to run.
V1 – Client-Server Architecture
Breaking the core and persistence layers into a client server architecture gives node runners the ability to scale and manage the respective processes more efficiently. While keeping the persistence layer on the same machine as the
pocket-core process would allow for the least latency, it also restricts the node runner from configuring their infrastructure in a way that best serves the needed demand.
Some of the examples that node runners now have access to include:
- Spawning up multiple
pocket-core processes on behalf of the same identity to scale to handle more demand.
- Connecting multiple
pocket-core processes that represent at least two nodes in close physical proximity to the same database.
- Connecting multiple
pocket-core process that represent multiple nodes to a database cluster to allow a collection of nodes to efficiently scale-up storage needs
Since nodes know the maximum amount of relays that the applications it’s serving can request, this gives node runners the ability to save costs by allocating resources only as needed.
Read more details about the v1.0 Persistence module spec.