Redis Sentinel - Andy Dunstall

*(October 2023)* These are my notes on how [Redis Sentinel](https://redis.io/docs/management/sentinel/) works. This focuses on the design and implementation of Sentinel rather than configuration and usage. Sentinel monitors a configured set of Redis masters. It discovers the replicas attached to those masters, then if it detects a master is down, it will failover to one of its replicas and configure the remaining replicas to replicate the new master. Sentinel acts as the authority for who the correct Redis master is, then if an instance's configuration is different it will be reconfigured. Either if an instance reports itself as master but it should be a replica it will be demoted to a replica, or if a replica is replicating from the wrong master. Sentinel is designed as a distributed system where multiple Sentinel nodes coordinate to detect failures and failover the Redis masters being monitored. This means Sentinel is fault tolerant since it still functions correctly even if a minority of Sentinel nodes go down. It also reduces false positives when detecting Redis masters as down since multiple Sentinels must agree the master is down before triggering a failover. These notes describe how Sentinel handles failure detection and failover. To keep it simple I'll ignore many details of Sentinel such as user scripts, ACL, TILT mode and alerting. This will also assume Sentinel is only monitoring a single master, whereas it can actually monitor multiple Redis masters at once. # State Machine Sentinel is run as a Redis server cron job. It executes `sentinelTimer` at the configured rate (100ms by default). On each run Sentinel will check its current state and the state of the instances being monitored and take any required actions. Therefore Sentinel can be modelled as a state machine where each run evaluates the current state and moves to the next state if required. The below diagram provides an overview of the Sentinel states. As mentioned above this assumes only a single master is being monitored to keep it simple. ![[Sentinel State Machine.png]] Note in practice state transitions are more complex than that diagram shows, though it gives a useful overview. The states include: * **Healthy**: When Sentinel is monitoring the master with `PING` requests and getting valid responses so it doesn't need to do anything * **SDOWN**: If the Sentinel does not get a valid `PING` response from the master being monitored for more than the configured 'down-after-period', then the master is considered 'subjectively down' (`SDOWN`), meaning the current Sentinel node believes the the master is down but it has not been confirmed by a quorum of Sentinels. Once the Sentinel believes the master is down, it sends `SENTINEL-IS-MASTER-DOWN-BY-ADDR` requests to all known Sentinels asking if they agree that the master is down * **ODOWN**: If the Sentinel finds a quorum of Sentinels agree the master is down, it is then considered 'objectively down' (`ODOWN`). If there is not already a failover in progress by another Sentinel, the Sentinel will attempt to start a failover by triggering an election * **Election**: To failover the master, the Sentinel must be elected leader by a majority of Sentinels. If it is elected leader, it will proceed with the failover, otherwise it does nothing as another Sentinel is performing the failover instead * **Failover**: To trigger a failover, the Sentinel starts by selecting the best replica to promote and sends it a `replicaof no one` to promote it to be the master. It then waits for the instance to confirm it is now master, then proceeds to reconfigure the remaining replicas to replicate the new master. If the Sentinel cannot find a suitable replica or times out waiting for it to confirm it has been promoted then the failover is aborted If another Sentinel performs a failover it will notify the other Sentinels about the change of master and they will switch to monitoring the new master. # Configuration Each Sentinel is configured with a set of Redis masters to monitor. It will use the master to discover the masters replicas and other Sentinels monitoring the same master. As the Sentinels configuration changes, either due to discovering new Sentinel and replica instances, failovers occurring or manual reconfiguration, the Sentinel will flush its new configuration to disk. Therefore if it is restarted it can continue from there it left off. See `sentinelFlushConfig`. On startup each Sentinel generates a random run ID to identify itself. This is also persisted to disk so if a Sentinel restarts it will continue with the same run ID. ## Replica Discovery Sentinel periodically sends an `INFO` request to all Redis instances being monitored (both masters and replicas). It uses the Redis masters responses to discover which replicas are attached to the master. Once discovered, Sentinel begins monitoring the replicas as well. ## Gossip Sentinel uses a simple form of gossip to discover other Sentinels and notify Sentinels about a change of master. Each Sentinel subscribes to and periodically publishes 'hello' messages to a Redis pub/sub channel `__sentinel__:hello`. The messages are published on every instance being monitored, including masters, replicas and other Sentinels. This means as long as there is at least one communication path between two Sentinels, they can still exchange 'hello' messages. Sentinels use this mechanism to discover new Sentinels, since if two Sentinels are configured to monitor the same master, each will publish to and subscribe to the channel on that master. The 'hello' message includes the Sentinels IP, port and run ID to identify itself. It also includes the identity of the master which is used to notify the other Sentinels about a failover. See `sentinelSendHello` and `sentinelProcessHelloMessage`. ## Epochs Since 'hello' messages are used to send configuration updates containing the identity of the master, Sentinels need to ensure they only apply the latest configuration and discard stale updates. Therefore each update has a version number, called an epoch. Sentinels will track the epoch of their current configuration, then only accept configuration updates with a higher epoch. Since Sentinels periodically publish their configuration to all other Sentinels, updates should be propagated quickly. When a Sentinel triggers a failover, it will increment its current epoch. It uses this new epoch to hold an election and try to elect itself as leader (described below). If the Sentinel wins the election, it will trigger a failover and send the new master configuration to the other Sentinels via 'hello' messages, which will include the new epoch so all other Sentinels should accept the latest configuration. See `sentinelStartFailover`. ## Instance Reconfiguration As mentioned already, Sentinel acts as the authority for the Redis master configuration. If it finds a Redis instance with the wrong configuration based on its `INFO` response, it will be updated to the correct configuration. Such as if a replica believes it is the master, or a replica is replicating the wrong master. To avoid interfering with an ongoing failover, Sentinel waits for the failover timeout after it discovers the configuration inconsistency to give updated configuration time to propagate, otherwise it may end up reverting a failover triggered by another Sentinel. If a master is reporting itself as a replica, it will be considered down and a failover will occur. See `sentinelRefreshInstanceInfo`. # Failure Detection Each Sentinel sends `PING` messages to every known instance once a second. It tracks how long it has been since the last valid response from each instance. See `sentinelSendPeriodicCommands`, `sentinelSendPing` and `sentinelPingReplyCallback`. ## Subjectively Down (`SDOWN`) If a Sentinel does not receive a valid `PONG` response from an instance for more than the configured 'down-after-period', it will consider the instance as down. To avoid false positives, Sentinel instances must check that a quorum of Sentinels all agree the instance is down before a failover can occur. When the Sentinel detects an instance as down, but has not confirmed a quorum of nodes agree, the instance is considered 'subjectively down' (`SDOWN`). Any instance may be considered 'subjectively down', including replicas and other Sentinels. The monitored Redis master is also considered down if it is reporting itself as a replica. This is to trigger a failover to a new master. See `sentinelCheckSubjectivelyDown`.` ## Objectively Down (`ODOWN`) Once a Sentinel detects a Redis master as down, it will ask the other known Sentinels if they agree. If a quorum of Sentinels agree, then the master is considered 'objectively down' (`ODOWN`) so Sentinel will trigger a failover. Note the quorum is configured by the user so is not necessarily a majority. Only the Redis master can be considered objectively down. Replicas may be subjectively down, which is used when deciding which replica to promote in a failover, though Sentinel will never attempt to verify if other Sentinels agree the replica is down. Once a Sentinel considers the Redis master as subjectively down, it sends a `SENTINEL-IS-MASTER-DOWN` request to the other Sentinels. The request includes the IP and port of the master. `SENTINEL-IS-MASTER-DOWN` is also used for leader election as described below. When a Sentinel receives a `SENTINEL-IS-MASTER-DOWN-BY-ADDR` request, it responds with whether it agrees the master is down. See `sentinelCommand`. On receiving a response, Sentinel tracks which other Sentinels agree the master is down. A Sentinels response is only valid for 5 seconds so the Sentinel will keep sending requests to renew the response. This means if a Sentinel is unreachable it won't be included when deciding if a master is objectively down. Sentinels then check if the master is objectively down by counting how many Sentinels agree the master is down. See `sentinelCheckObjectivelyDown`. # Failover Once a Sentinel considers the Redis master as objectively down (`ODOWN`), if there is not already a failover in progress by the current Sentinel or another Sentinel, and the last failover attempt was over twice the failover timeout, then the Sentinel will attempt to trigger a failover. ## Leader Election To trigger a failover a Sentinel must win an election among the other Sentinels. This ensures a majority of Sentinels agree the failover may proceed and avoids multiple Sentinels starting a failover at the same time. Requiring a majority of Sentinels to agree the failover may continue avoids a split brain scenario where two partitions independently trigger a failover. Instead the minority partition won't be able to elect a leader so can't trigger a failover. Leader election in Sentinel is similar to leader election in [Raft](https://raft.github.io/). Raft has a concept of terms which are numbered with consecutive integers. Nodes learn about the latest term from the leader. To trigger an election, a node increments its term and requests the other nodes vote for it. Each node can only vote for one node per term, and cannot vote for a node with a term smaller than the term of the leader or the last node it voted for. In Sentinel epochs are the equivalent of terms. As described in 'Configuration' above, Sentinel uses epochs to version configuration updates which are propagated via the gossip mechanism. If a Sentinel sees configuration with a later epoch, it will apply the configuration update and update its own epoch to match. To start a failover, a Sentinel increments its current epoch and sends a request to all other Sentinels to vote for it. Sentinel reuses the `SENTINEL-IS-MASTER-DOWN-BY-ADDR` request, where if the senders run ID and failover epoch is included in the request, it is requesting a vote. See `sentinelAskMasterStateToOtherSentinels`. When a Sentinel receives a `SENTINEL-IS-MASTER-DOWN-BY-ADDR` request with a run ID and epoch, it will either vote for that Sentinel or respond with the Sentinel it already voted for in that epoch or in a larger epoch. Therefore, similar to Raft, each Sentinel tracks who they last voted for and can only vote for one Sentinel per epoch and cannot vote for a Sentinel with an old epoch. See `sentinelCommand` and `sentinelVoteLeader`. When the sender receives a `SENTINEL-IS-MASTER-DOWN-BY-ADDR` response, it tracks who the Sentinel voted for. See `sentinelReceiveIsMasterDownReply`. It then counts the votes of each Sentinel to work out which Sentinel is the winner. If the Sentinel hasn't yet voted it will either vote for the Sentinel with the highest vote, or vote for itself if there are no votes yet. If it votes for another Sentinel, it considers a failover in progress by that Sentinel so won't trigger a failover itself for at least twice the failover timeout. Its possible multiple Sentinels start an election at the same time which could lead to a split vote where no Sentinel wins the election. In which case they'll time out and abort their failover attempt so can retry. To reduce the chances of another split vote, Sentinel adds random jitter in its configured run interval so on the retry one Sentinel should run first and win. ## Replica Selection Once elected the Sentinel selects the best replica to failover to. It selects the replica according to the following criteria: 1. Ignores replicas that are considered down due to not responding to `PING`s from the Sentinel, or has not responded to `INFO` requests 2. Ignores replicas that have not been connected to the master for more than 10 times the 'down-after-period' plus the time we've considered the master down. This avoids selecting lagging replicas that considered the master down when Sentinel thought it was healthy 3. Ignores replicas with a replica priority of zero 4. Selects the replica with the lower replica priority 5. If priorities are the same it selects the replica with a higher replication offset 6. If the replica offsets are the same, selects the replica with the lexicographically smaller run ID See `sentinelSelectSlave` and `compareSlavesForPromotion`. ## Failover Once the Sentinel has selected a replica it will do the actual failover. To promote the replica to a master the Sentinel runs: ``` MULTI REPLICAOF NO ONE CONFIG REWRITE CLIENT KILL TYPE normal CLIENT KILL TYPE pubsub EXEC ``` * `REPLICAOF NO ONE` tells the replica to become a master * `CONFIG REWRITE` tells the node to rewrite its configuration file (if there is one) with the new configuration (where it is a master not a replica) * `CLIENT KILL` terminates normal and pub/sub clients The response is ignored since Sentinel polls `INFO` to check the nodes role has been updated instead of checking the reply. See `sentinelFailoverSendSlaveOfNoOne` and `sentinelSendSlaveOf`. Once `INFO` confirms the role update the failover is considered complete so can no longer be aborted. Finally the Sentinel reconfigured the other replicas using the above commands, except `REPLICAOF` points to the new master. Since replicas may block while replicating the master, `parallel-syncs` can be configured to limit how many replicas are reconfigured at a time. See `sentinelFailoverReconfNextSlave`. The other Sentinels will be notified about the failover and new master configuration using the 'hello' gossip mechanism described above. The Sentinel that triggered the failover should have the highest epoch as it won the election, so all other Sentinels will accept the new configuration.