Random Sampling of NEAR’s Validators

NEAR’s current approach is sufficient for the current number of shards and validators, but it can easily become inefficient if the number of validators grows. This blog post presents a new, simple and efficient way of selecting NEAR’s validators and analyzes its safety and liveness guarantees.

The NEAR blockchain system uses a mechanism called stateless validation to ensure that all transactions are executed correctly. When a NEAR node processes a sequence of transactions and computes a new blockchain state, it produces a state witness alongside the new state. The state witness is a piece of data containing all the information necessary for anyone to verify (“validate”) the transition from the old blockchain state to the new one. The NEAR node then distributes this state witness to a set of other nodes that have the role of validators. Validators verify the state witness and vote if the state transition represented by the witness should be accepted as valid. Their votes (a.k.a. endorsements) are required for the next NEAR block to be produced.

Image: Validation scheme

The validators are called stateless because they do not need any prior knowledge of the blockchain state to verify the correctness of the state transition - the state witness is all they need. We are not going to dive into the details of how state witnesses work - they can be found in the corresponding documentation. What we are interested in now is how the validators are selected.

To become a validator, a node needs to lock some NEAR tokens as stake. It is then allowed to vote on (i.e., endorse) NEAR’s state transitions and receives rewards for doing so. We make the common assumption that strictly less than one third of all the stake belongs to faulty (potentially malicious) validators. Building only on this assumption, we need to ensure the following with high probability despite potential malicious attacks.

Safety: The system does not accept an incorrect state transition.
Liveness: The system does accept correct state transitions.
Efficiency: The system does not spend too many resources on the validation process, even with many validators and many state transitions to be verified.
Fairness: The rewards of correct validators are proportional to their stakes.

NEAR is a sharded blockchain system that splits its state into multiple parts called shards. A separate state transition happens in every shard with each block. If a transaction touches the state in two (or more) shards, it is split into two (or more) separate (though related) state transitions. Making every validator verify every state transition is, unfortunately, not efficient. Therefore, only a limited subset of validators must be responsible for validating each state transition.

How NEAR currently samples validators

NEAR’s current implementation splits each validator’s stake into a variable number of parts and, at each block, assigns each such part to exactly one random shard. The owner of the stake part then becomes responsible for verifying and voting on that shard’s state transition. If the validator consistently performs this task, it receives a corresponding reward.

This method ensures that each of NEAR’s shards is assigned validators with approximately the same amount of stake, intuitively making each shard similarly secure (as it is “backed by” a similar amount of stake). Note that non-sharded blockchain systems do not need to deal with this concern, as they consist of only a single shard.

Our current approach is sufficient for the number of shards and validators NEAR has at the time of writing of this article, but it has several drawbacks. While it does ensure fairness, it can easily become inefficient if the number of validators grows. Since each validator is involved in the voting process at each block, the system might need to process unnecessarily many votes and transmit large amounts of state witness data. We will not discuss the details of the current validation mechanism here (the interested reader can read more in the NEAR Enhancement Proposal 509). It involves variably sized parts of validators’ stake, stake-weighted endorsements, and the safety and liveness guarantees are difficult to formally analyze. It suffices to say that the above considerations bring us to designing a new method of validator selection we are now going to present.

New way of sampling validators

The basic idea is that, instead of assigning each validator to a random shard, we take each shard and assign some randomly selected set of validators to it (while adjusting the selection probability according to the validators’ stake). As we will see later, this subtle detail simplifies the reasoning about the system and allows us to easily quantify the safety and liveness guarantees. In a nutshell, for each state transition, we randomly$^1$ select a small subset of validators that we call the validator sample. The probability of choosing a validator to be included in the sample is proportional to its stake and a validator can be chosen more than once. If enough validators in the sample endorse (vote for) the state transition, we consider the state transition validated.

We pick the validators for a sample with replacement. Validators that have been chosen multiple times have the corresponding number of votes. For instance, if a validator has been included in a sample 3 times, then its endorsement is also counted 3 times. Validators’ endorsements are not weighted by stake, as their stake is already taken into account in the selection probability.

To ensure fairness, we keep rewarding the validators proportionally to their stake (and thus also proportionally to the rate of their endorsements) if they consistently keep voting for correct state transitions.

Safety and liveness requirements

Intuitively, we say: “If a sufficiently large subset (that we call a quorum) of a small representative sample of validators endorse a state transition, then most likely all correct validators would have endorsed it.” The word “representative” is very important here. It means that the sample’s proportion of correct and faulty validators will be similar to that of the overall validator set. Now let us formalize this intuition.

Let us translate “sufficiently large subset of a small sample” to “ $q$ out of $s$ endorsements”. Furthermore, “most likely” shall be “with probability at least $(1-\beta)$”. In other words, we need to choose our validator sample of size $s$ such that the probability that $q$ out of those $s$ validators endorse an incorrect state transition is upper-bounded by a security parameter $β$. Note that this corresponds precisely to the safety requirement defined above.

Conversely, for liveness, we want to ensure that the sample contains enough validators voting for the state transition so it gets accepted. That is, for some small parameter $\gamma$, we need the probability that less than $q$ out of $s$ votes are cast for the correct state transition to be at most $\gamma$. We know that correct validators will always endorse a correct state transition and never an incorrect one. If $f’$ is the relative fraction of faulty validators in a sample (making $(1-f’)$ the fraction of correct ones), we can express the above as follows.

Image: Failure probabilities

We are free to choose the $\beta$ and $\gamma$ parameters according to our needs (within some constraints that we discuss later). For example, if we consider a system with 50 shards, a block rate of 3 blocks/second, a separate validator set for each state transition, and an expected time to a safety violation of 1 million years, we obtain $\beta$ = 1 / (1’000’000 years · 365 days · 24 hours · 3600 seconds · 3 blocks · 50 shards) = 1 / (4.73·1015 samples) = 2.11·10-16.

Calculating failure probabilities

Let us dive deeper into the representativeness of the sample. We are choosing (with replacement) $s$ validators for the sample, where a validator’s chance of being selected is proportional to its stake. This is equivalent to sampling from the stake itself and including the chosen stake’s owner in the sample. Our assumption is that at most $f = ⅓$ of all the stake may be owned by faulty validators, as depicted below (darts represent randomly selected parts of the stake)

Image: Stake sampling

Now it is easy to see that selecting a validator is a Bernoulli trial with the probability of the validator being faulty $f = ⅓$. That makes the number of times we select a faulty validator (denoted by $F$) in a sample of size $s$ a binomially distributed random variable: $F ~ Bin(s, f)$. To obtain the relative fraction of faulty validators $f’$, we divide by the sample size $s$ and obtain $f’ = F/s$, and thus $F = f’·s ~ Bin(s, f)$. We can perform an analogous calculation for the distribution of the number of correct validators with probability $(1-f) = ⅔$ of selecting a correct validator, leading to $(s - F) = (1-f’)·s ~ Bin(s, 1-f)$.

Let us focus on the number of faulty validators $f’·s ~ Bin(s, f)$. Since the expected value of a binomial distribution is $s·f$, the expected fraction of faulty validators in a sample will always be $s·f / s = f = ⅓$. What changes with sample size $s$ is only the variance (represented by the color gradient in the figure below). The bigger the sample size, the less likely $f’$ is to deviate much from $f$.

Image: Sample size

If we fix the acceptable safety violation probability $\beta$ and a sample size $s$, we can compute the minimal quorum size $q$ that satisfies our safety requirements by evaluating the binomial distribution$^2$. For example, in order to bound the safety violation probability by $\beta = 2.11·10^{-16}$ in a sample of $s = 50$ validators, we need to wait until $q = 45$ of them endorse the same state transition. If we had a larger sample of, say, 75 validators, we would require 60 endorsements to achieve the same safety violation probability. Note that, in a larger sample, we also need a larger absolute number, but a smaller relative number of endorsements (89% vs 79% in the above examples). The plot below shows possible combinations of sample sizes ($s$) and quorum sizes ($q$) for a tolerable safety violation probability $\beta = 2.11·10^{-16}$.

Image: Possible sample and quorum sizes

We see that if we use a small sample, we need all of its votes. As the sample grows, we can afford to “lose” some votes. Why does the plot start at a sample size of 33 validators? Because, if we only used 32 or fewer validators, we could not be sufficiently ($\beta = 2.11·10^{-16}$) sure that a state transition is correct even if all those validators endorsed it. There would still be a probability of $(⅓)^{32} ≈ 5.4·10^{-16} > \beta$ that all those 32 validators are faulty.

Safety vs. liveness trade-off

One might be tempted to say that the best choice is to stick with a small sample and a relatively large (but absolutely small) quorum. After all, a smaller validator sample means a smaller communication and computation overhead for the same safety guarantees. This has an important drawback, however: if we ever get even slightly unlucky with our sample, we can easily get stuck, since even a (absolutely) very small (but relatively large) number of faulty validators can block any state transition from being accepted. This is a symptom of a general trade-off: every choice we make regarding $\beta$, s, and q also determines the liveness violation probability $\gamma$. In a small sample, the quorum size required to achieve safety might be prohibitive for liveness. Increasing the sample size allows for a smaller relative quorum, decreasing the liveness violation probability $\gamma$ while maintaining safety.

Image: Small vs. large sample

We can also turn things around and start by fixing $\gamma$ to ensure the desired liveness. However, the resulting maximal allowed quorum size might prohibit us from guaranteeing safety with a sufficiently large $\beta$. We can thus say that the smaller the sample, the tougher the trade-off between safety and liveness.

Image: Beta vs. Gamma

In practice (i.e. in the NEAR system), safety and liveness requirements are not quite symmetric. A safety violation corresponds to accepting an invalid state transition, e.g., minting or transferring an arbitrary amount of tokens. This situation is very hard to meaningfully recover from and we need to make sure it simply does not occur. A liveness violation in the context of a single state transition, on the other hand, can be dealt with more easily. For example, if a state transition is not accepted by some time-out, we can try again, potentially even using a different, perhaps more conservative validation strategy. This manifests as a performance glitch which might be inconvenient, but, especially if rare, not critical. We can therefore say that for NEAR, we can tune the safety vs. liveness trade-off very much in favor of a low $\beta$ and a relatively larger $\gamma$.

Defending against an adaptive adversary

What was said above would suffice against a static adversary that has to choose the corrupted validators ahead of time. However, an adaptive adversary may wait until we select our validator sample and then corrupt a quorum of just that sample. If we rely on small validator samples, there is no defense against an adaptive adversary without making additional assumptions. Even corrupting the whole sample would fit in the adversary’s budget of ⅓ of the total corruptible stake. Therefore, to deter even an adaptive adversary, we make an additional assumption: a validator can only be dynamically corrupted if doing so doesn’t make it lose money. We then make it expensive to perform an adaptive attack through the following two measures:

A validator that endorses an incorrect state transition or consistently fails to endorse a correct one loses its stake.
We make sure that any quorum of validators in a sample always has some minimal cumulative amount of stake.

We call the cumulative amount of stake of a quorum in a sample the strength of that sample. There are many subsets of a sample that constitute a quorum with different amounts of stake and we always look at the one with the smallest stake. We assume that an adaptive adversary corrupting a quorum of validators in a sample would need to at least refund the stake to the corrupted validators to make them collude. The strength of a sample then corresponds to the price of an adaptive attack.

We introduce another security parameter $\alpha$ representing the required minimal strength of a sample. When picking a sample, we make sure that its strength is at least $\alpha$ – i.e., that the total stake of every quorum is at least $\alpha$. It is difficult to devise a probability distribution for the sample strength because it depends on how the stake is distributed among the validators (e.g. - “all validators have equal stake” or “most validators have small stake and few have a large one”, etc.). We do show, however, what the sample strength looks like for different quorum sizes in several simulated example scenarios in the figure below.

Image: Stake distribution in a sample

In the uniform stake distribution, all validators have the same stake. In the pseudo-exponential distribution, the first validator has 10% of the total stake, the second has 10% of the rest (i.e. 9% of the total stake), etc. The last validator, however, has all the remaining stake (that is where the “pseudo” comes from). The “actual” distribution corresponds to the assignment of stake to validators on the NEAR mainnet in September 2025. The standard deviation is not shown, as it is at least an order of magnitude smaller than the value itself.

Note that the uniform sample has less than ⅓ of the stake despite its size being more than ⅓ of the total number of validators. This is because it does not contain 111 distinct nodes - some validators have been picked multiple times, but their stake counts only once. The key takeaway from the above picture is that non-uniform stake distributions tend to lead to stronger samples. This is because very “rich” validators have a high probability of being chosen and drive the sample strength up.

We are slightly hand-wavy about the analysis of the sample strength since, at the end of the day, it is relatively easy to satisfy $\alpha$ whenever we are creating a validator sample in practice. We can simply choose the sample to satisfy $\beta$ and $\gamma$ and then, if necessary, keep increasing the sample size $s$ and quorum $q$ until $\alpha$ is satisfied too. While it might impact the performance, it will never compromise safety or liveness.

In a nutshell

We have seen how to randomly choose a sample of validators to vote on the correctness of NEAR state transitions. We also discussed how to quantify the safety and liveness guarantees we obtain based on the parameters of the sample:

$s$: sample size
$q$: quorum size required to accept a state transition

In essence, for a given sample size $s$, the quorum size $q$ is the knob we can turn to determine where on the safety-liveness trade-off we want to be. Increasing $q$ strengthens safety but weakens liveness and vice versa. The sample size $s$ determines how “tough” this trade-off is: if $s$ is too small, we only have bad options. The larger $s$, the more favorable options we have to choose from (using $q$). There is no free lunch, however, and we need to pay for these better options by an increased computation and communication overhead associated with larger samples.

To deal with an adaptive adversary that can corrupt validators on the fly, we use the concept of sample strength to quantify the cost of an adaptive corruption. We can then set acceptable limits on

$\alpha$: The minimal cost of an adaptive corruption of a validator sample
$\beta$: The safety violation probability
$\gamma$: Liveness violation probability

and choose a sample that respects all these limits. To achieve this, we can proceed as follows:

Pick acceptable security parameters $\alpha$, $\beta$, and $\gamma$.
Pick sample parameters s and q based on $\beta$ and $\gamma$.
Draw a sample of size s by choosing a random validator s times, with the probability of choosing a validator proportional to its stake.
While the sample strength is less than $\alpha$, keep adding validators.

The first two steps only need to be performed once and only the last two steps are executed for each sample. A practical alternative to point (4.) is to simply pick an all new sample of size s and hope it will be stronger. We might tend to do this if most of the samples we draw are strong enough and this happens rarely, since it technically shortens the expected time until a safety violation happens. What we buy for it is a more consistent performance where no sample exceeds the size $s$.

Lastly, it is important to note that we have always been considering the worst-case scenario, namely one third of the stake being actually controlled by faulty validators throughout the lifetime of the system. The computed failure probabilities are upper bounds that are tight only in the worst case. In practice, the actual amount of stake controlled by faulty validators stays well below the assumed limit ($f$) the vast majority of the time. Therefore, the real-world failure probabilities can also be expected to remain significantly below $\beta$ and $\gamma$.

$^1$We use the entropy provided by the Near protocol at each block as the source of randomness. Whenever we perform “random” sampling, we use pseudo-randomness derived from this entropy. (back to text)

$^2$For details on how we compute the exact values, have a look at the code. (back to text)