By popular request, we’ll explore the topic of cluster state changes and quorum over the next couple of blog articles .
In computer science, Brewer’s CAP theorem states that it’s impossible for a distributed system to simultaneously guarantee consistency, availability, and partition tolerance. This means that, when faced with a network partition, one has to choose between consistency and availability.
OneFS does not compromise on consistency, so a mechanism is required to manage a cluster’s transient state and quorum. As such, the primary role of the OneFS Group Management Protocol (GMP) is to help create and maintain a group of synchronized nodes. Having a consistent view of the cluster state is critical, since initiators need to know which node and drives are available to write to, etc. A group is a given set of nodes which have synchronized state, and a cluster may form multiple groups as connection state changes. GMP distributes a variety of state information about nodes and drives, from identifiers to usage statistics. The most fundamental of these is the composition of the cluster, or ‘static aspect’ of the group, which is stored in the array.xml file. The array.xml file also includes info such as the ID, GUID, and whether the node is diskless or storage, plus attributes not considered part of the static aspect, such as internal IP addresses.
Similarly, the state of a node’s drives is stored in the drives.xml file, along with a flag indicating whether the drive is an SSD. Whereas GMP manages node states directly, drive states are actually managed by the ‘drv’ module, and broadcast via GMP. A significant difference between nodes and drives is that for nodes, the static aspect is distributed to every node in the array.xml file, whereas drive state is only stored locally on a node. The array.xml information is needed by every node in order to define the cluster and allow nodes to form connections. In contrast, drives.xml is only stored locally on a node. When a node goes down, other nodes have no method to obtain the drive configuration of that node. Drive information may be cached by the GMP, but it is not available if that cache is cleared.
Conversely, ‘dynamic aspect’ refers to the state of nodes and drives which may change. These states indicate the health of nodes and their drives to the various file system modules – plus whether or not components can be used for particular operations. For example, a soft-failed node or drive should not be used for new allocations. These components can be in one of seven states:
- UP The component is responding.
- DOWN The component is not responding.
- DEAD The component is not allowed to come back to the UP state and should be removed.
- STALLED A drive is responding slowly.
- GONE The component has been removed.
- Soft-failed The component is in the process of being removed.
- Read-only This state only applies to nodes.
Note: A node or drive may go from ‘down, soft-failed’ to ‘up, soft-failed’ and back. These flags are persistently stored in the array.xml file for nodes and the drives.xml file for drives.
Group and drive state information allows the various file system modules to make timely and accurate decisions about how they should utilize nodes and drives. For example, when reading a block, the selected mirror should be on a node and drive where a read can succeed (if possible). File system modules use the GMP to test for node and drive capabilities, which include:
- Readable Drives on this node may be read.
- Writable Drives on this node may be written to.
- Restripe From Move blocks away from the node.
Access levels help define ‘as a last resort’ with states for which access should be avoided unless necessary. The access levels, in order of increased access, are as follows:
- Normal The default access level.
- Read Stalled Allows reading from stalled drives.
- Modify Stalled Allows writing to stalled drives.
- Read Soft-fail Allows reading from soft-failed nodes and drives.
- Never Indicates a group state never supports the capability.
Drive state and node state capabilities are shown in the following tables. As shown, the only group states affected by increasing access levels are soft-failed and stalled.
Minimum Access Level for Capabilities Per Node State
Node States | Readable | Writeable | Restripe From |
UP | Normal | Normal | No |
UP, Smartfail | Soft-fail | Never | Yes |
UP, Read-only | Normal | Never | No |
UP, Smartfail, Read-only | Soft-fail | Never | Yes |
DOWN | Never | Never | No |
DOWN, Smartfail | Never | Never | Yes |
DOWN, Read-only | Never | Never | No |
DOWN, Smartfail, Read-only | Never | Never | Yes |
DEAD | Never | Never | Yes |
Minimum Access Level for Capabilities Per Drive State
Drive States | Minimum Access Level to Read | Minimum Access Level to Write | Restripe From |
UP | Normal | Normal | No |
UP, Smartfail | Soft-fail | Never | Yes |
DOWN | Never | Never | No |
DOWN, Smartfail | Never | Never | Yes |
DEAD | Never | Never | Yes |
STALLED | Read_Stalled | Modify_Stalled | No |
OneFS depends on a consistent view of a cluster’s group state. For example, some decisions, such as choosing lock coordinators, are made assuming all nodes have the same coherent notion of the cluster.
Group changes originate from multiple sources, depending on the particular state. Drive group changes are initiated by the drv module. Service group changes are initiated by processes opening and closing service devices. Each group change creates a new group ID, comprising a node ID and a group serial number. This group ID can be used to quickly determine whether a cluster’s group has changed, and is invaluable for troubleshooting cluster issues, by identifying the history of group changes across the nodes’ log files.
GMP provides coherent cluster state transitions using a process similar to two-phase commit, with the up and down states for nodes being directly managed by the GMP. RBM or Remote Block Manager code provides the communication channel that connect devices in the OneFS. When a node mounts /ifs it initializes the RBM in order to connect to the other nodes in the cluster, and uses it to exchange GMP Info, negotiate locks, and access data on the other nodes.
Before /ifs is mounted, a ‘cluster’ is just a list of MAC and IP addresses in array.xml, managed by ibootd when nodes join or leave the cluster. When mount_efs is called, it must first determine what it‘s contributing to the file system, based on the information in drives.xml. After a cluster (re)boot, the first node to mount /ifs is immediately placed into a group on its own, with all other nodes marked down. As the Remote Block Manager (RBM) forms connections, the GMP merges the connected nodes, enlarging the group until the full cluster is represented. Group transactions where nodes transition to UP are called a ‘merge’, whereas a node transitioning to down is called a split. Several file system modules must update internal state to accommodate splits and merges of nodes. Primarily, this is related to synchronizing memory state between nodes.
The soft-failed, read-only, and dead states are not directly managed by the GMP. These states are persistent and must be written to array.xml accordingly. Soft-failed state changes are often initiated from the user interface, for example via the ‘isi devices’ command.
A GMP group relies on cluster quorum to enforce consistency across node disconnects. By requiring ⌊N/2⌋+1 replicas to be available, this ensures that no updates are lost. Since nodes and drives in OneFS may be readable, but not writable, OneFS has two quorum properties:
- Read quorum
- Write quorum
Read quorum is governed by having [N/2] + 1 nodes readable, as indicated by sysctl efs.gmp.has_quorum. Similarly, write quorum requires at least [N/2] + 1 writeable nodes, as represented by the sysctl efs.gmp.has_super_block_quorum. A group of nodes with quorum is called the ‘majority’ side, whereas a group without quorum is a ‘minority’. By definition, there can only be one ‘majority’ group, but there may be multiple ‘minority’ groups. A group which has any components in any state other than up is referred to as degraded.
File system operations typically query a GMP group several times before completing. A group may change over the course of an operation, but the operation needs a consistent view. This is provided by the group info, which is the primary interface modules use to query group state. The current group info can be viewed via the sysctl efs.gmp.current_info command. It includes the GMP’s group state, but also information about services provided by nodes in the cluster. This allows nodes in the cluster to discover when services change state on other nodes and take the appropriate action when this happens. An example is SMB lock expiry, which uses GMP service information to clean up locks held by other nodes when the service owning the lock goes down.
Processes change the service state in GMP by opening and closing service devices. A particular service will transition from down to up in the GMP group when it opens the file descriptor for a device. Closing the service file descriptor will trigger a group change that reports the service as down. A process can explicitly close the file descriptor if it chooses, but most often the file descriptor will remain open for the duration of the process and closed automatically by the kernel when it terminates.
An understanding of OneFS groups and their related group change messages allows you to determine the current health of a cluster – as well as reconstruct the cluster’s history when troubleshooting issues that involve cluster stability, network health, and data integrity. We’ll explore the reading and interpretation of group change status data in the second part of this blog article series.