Cluster Composition and GMP – Part 3

In the third and final of these articles on OneFS groups, we’ll take a look at what and how we can learn about a cluster’s state and transitions. Simply put, ‘group state’ is a list of nodes, drives and protocols which are participating in a cluster at a particular point in time.

Under normal operating conditions, every node and its requisite disks are part of the current group, and the group status can be viewed from any node in the cluster using the ‘sysctl efs.gmp.group’ CLI command. If a greater level of detail is desired, the syscl efs.gmp.current_info command will report extensive current GMP information.

When a group change occurs, a cluster-wide process writes a message describing the new group membership to /var/log/messages on every node. Similarly, if a cluster ‘splits’, the newly-formed sub-clusters behave in the same way: each node records its group membership to /var/log/messages. When a cluster splits, it breaks into multiple clusters (multiple groups). This is rarely, if ever, a desirable event. A cluster is defined by its group members. Nodes or drives which lose sight of other group members no longer belong to the same group and therefor no longer belong to the same cluster.

The ‘grep’ CLI utility can be used to view group changes from one node’s perspective, by searching /var/log/messages for the expression ‘new group’. This will extract the group change statements from the logfile. The output from this command may be lengthy, so can be piped to the ‘tail’ command to limit it the desired number of lines.

Please note that, for the sake of clarity, the protocol information has been removed from the end of each group string in all the following examples. For example:

{ 1-3:0-11, smb: 1-3, nfs: 1-3, hdfs: 1-3, all_enabled_protocols: 1-3, isi_cbind_d: 1-3, lsass: 1-3, s3: 1-3 }

 

Will be represented as:

 

{ 1-3:0-11 }

 

In the following example, the ‘tail -10’ command limits the outputted list to the last ten group changes reported in the file:

tme-1# grep -i ‘new group’ /var/log/messages | tail –n 10

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-4, down: 1:5-11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-5, down: 1:6-11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-6, down: 1:7-11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-7, down: 1:8-11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-8, down: 1:9-11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-9, down: 1:10-11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-10, down: 1:11, 2-3 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 1:0-11, down: 2-3 }

2020-06-15-T08:07:51 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-merge”) new group: : { 1:0-11, 3:0-7,9-12, down: 2 }

2020-06-15-T08:07:52 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-merge”) new group: : { 1-2:0-11, 3:0-7,9-12 }

All the group changes in this set happen within two seconds of each other, so it’s worth looking earlier in the logs prior to the incident being investigated.

Here are some useful data points that can be gleaned from the example above:

  1. The last line shows that the cluster’s nodes are operational belong to the group. No nodes or drives report as down or split. (At some point in the past, drive ID 8 on node 3 was replaced, but a replacement disk was subsequently added successfully.)
  2. Node 1 rebooted. In the first eight lines, each group change is adding back a drive on node 1 into the group, and nodes two and three are inaccessible. This occurs on node reboot prior to any attempt to join an active group and is indicative of healthy behavior.
  3. Nodes 3 forms a group with node 1 before node 2 does. This could suggest that node 2 rebooted while node 3 remained up.

A review of group changes from the other nodes’ logs should be able to confirm this. In this case node 3’s logs show:

tme-1# grep -i ‘new group’ /var/log/messages | tail -10

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-4, down: 1-2, 3:5-7,9-12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”)  new group: : { 3:0-5, down: 1-2, 3:6-7,9-12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-6, down: 1-2, 3:7,9-12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-7, down: 1-2, 3:9-12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-7,9, down: 1-2, 3:10-12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-7,9-10, down: 1-2, 3:11-12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-7,9-11, down: 1-2, 3:12 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814=”kt: gmp-drive-updat”) new group: : { 3:0-7,9-12, down: 1-2 }

2020-06-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1828=”kt: gmp-merge”) new group: : { 1:0-11, 3:0-7,9-12, down: 2 }

2020-06-15-T08:07:52 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1828=”kt: gmp-merge”) new group: : { 1-2:0-11, 3:0-7,9-12 }

Since node 3 rebooted at the same time, it’s worth checking node 2’s logs to see if it also rebooted simultaneously. In this instance, the logfiles confirm this. Given that all three nodes rebooted at once, it’s highly likely that this was a cluster-wide event, rather than a single-node issue. OneFS ‘software watchdog’ timeouts (also known as softwatch or swatchdog), for example, cause cluster-wide reboots. However, these are typically staggered rather than simultaneous reboots. The Softwatch process monitors the kernel and dumps a stack trace and/or reboots the node when the node is not responding. This helps protects the cluster from the impact of heavy CPU starvation and aids the issue detection and resolution process.

If a cluster experiences multiple, staggered group changes, it can be extremely helpful to construct a timeline of the order and duration in which nodes are up or down. This info can then be cross-referenced with panic stack traces and other system logs to help diagnose the root cause of an event.

For example, in the following log excerpt, a node cluster experiences six different node reboots over a twenty-minute period. These are the group change messages from node 14, which that stayed up the whole duration:

tme-14# grep -i ‘new group’ /var/log/messages

2020-06-10-T14:54:00 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }

2020-06-15-T06:44:38 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060=”kt: gmp-split”) new group: : { 1-2:0-11, 6-8, 13-15:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 9}

2020-06-15-T06:44:58 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 13-14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 9, 15}

2020-06-15-T06:45:20 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 14:0-11, 16:0,2-12, 17-18:0- 11, 19-21, down: 2, 9, 13, 15}

2020-06-15-T06:47:09 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-merge”) new group: : { 1:0-11, 6-8, 9,14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 13, 15}

2020-06-15-T06:47:27 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-split”) new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17-18:0-11, 19-21, down: 1-2, 13, 15}

2020-06-15-T06:48:11 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 2102=”kt: gmp-split”) new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17:0-11, 19- 21, down: 1-2, 13, 15, 18}

2020-06-15-T06:50:55 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 2102=”kt: gmp-merge”) new group: : { 6-8, 9,13-14:0-11, 16:0,2-12, 17:0-11, 19- 21, down: 1-2, 15, 18}

2020-06-15-T06:51:26 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396=”kt: gmp-merge”) new group: : { 2:0-11, 6-8, 9,13-14:0-11, 16:0,2-12, 17:0-11, 19-21, down: 1, 15, 18}

2020-06-15-T06:51:53 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396=”kt: gmp-merge”) new group: : { 2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17:0-11, 19-21, down: 1, 18}

2020-06-15-T06:54:06 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17:0-11, 19-21, down: 18}

2020-06-15-T06:56:10 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 2102=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21}

2020-06-15-T06:59:54 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396=”kt: gmp-split”) new group: : { 1-2:0-11, 6-8, 9,13-15,17-18:0-11, 19-21, down: 16}

2020-06-15-T07:05:23 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21}

First, run the isi_nodes “%{name}: LNN %{lnn}, Array ID %{id}” to map the cluster’s node names to their respective Array IDs.

Before the cluster node outage event on June 15 there was a group change on June 10:

2020-06-10-T14:54:00 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }

After that, all nodes came back online and the cluster could be considered healthy. The cluster contains nine X210s with twelve drives apiece and six diskless nodes (accelerators). The Array IDs now extend to 21, and Array IDs 3 through 5 and 10 through 12 are missing. This confirms that six nodes were added to or removed from the cluster.

So, the first event occurs at 06:44:38 on 15 June:

2020-06-15-T06:44:38 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060=”kt: gmp-split”) new group: : { 1-2:0-11, 6-8, 13-15:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 9, diskless: 6-8, 19-21 }

Node 14 identifies Array ID 9 (LNN 6) as having left the group.

Next, twenty seconds later, two more nodes (2 & 15) are marked as offline:

2020-06-15-T06:44:58 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 13-14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 9, 15, diskless: 6-8, 19-21 }

Twenty-two seconds later, another node goes offline:

2020-06-15-T06:45:20 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-split”) new group: : { 1:0-11, 6-8, 14:0-11, 16:0,2-12, 17-18:0- 11, 19-21, down: 2, 9, 13, 15, diskless: 6-8, 19-21 }

At this point, four nodes (2,6,7, & 9) are marked as being offline:

Almost two minutes later, the previously down node (node 6) rejoins the group:

2020-06-15-T06:47:09 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-merge”) new group: : { 1:0-11, 6-8, 9,14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 13, 15, diskless: 6-8, 19-21 }

However, twenty-five seconds after node 6 comes back, node 1 leaves the group:

2020-06-15-T06:47:27 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-split”) new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17-18:0-11, 19-21, down: 1-2, 13, 15, diskless: 6-8, 19-21 }

Finally, the group returns to its original composition:

2020-06-15-T07:05:23 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066=”kt: gmp-merge”) new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }

As such, a timeline of this cluster event could read:

  1. June 15 06:44:38 6 down
  2. June 15 06:44:58 2, 9 down (6 still down)
  3. June 15 06:45:20 7 down (2, 6, 9 still down)
  4. June 15 06:47:09 6 up (2, 7, 9 still down)
  5. June 15 06:47:27 1 down (2, 7, 9 still down)
  6. June 15 06:48:11 12 down (1, 2, 7, 9 still down)
  7. June 15 06:50:55 7 up (1, 2, 9, 12 still down)
  8. June 15 06:51:26 2 up (1, 9, 12 still down)
  9. June 15 06:51:53 9 up (1, 12 still down)
  10. June 15 06:54:06 1 up (12 still down)
  11. June 15 06:56:10 12 up (none down)
  12. June 15 06:59:54 10 down
  13. June 15 07:05:23 10 up (none down)

The next step would be to review the logs from the other nodes in the cluster for this time period and construct similar timeline. Once done, these can be distilled into one comprehensive, cluster-wide timeline.

Note: Before triangulating log events across a cluster, it’s important to ensure that the constituent nodes’ clocks are all synchronized. To check this, run the isi_for_array –q date command on all nodes and confirm that they match. If not, apply the time offset for a particular node to the timestamps of its logfiles.

Here’s another example of how to interpret a series of group events in a cluster. Consider the following group info excerpt from the logs on node 1 of the cluster:

2020-06-15-T18:01:17 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 5681=”kt: gmp-config”) new group: <1,270>: { 1:0-11, down: 2, 6-11, diskless: 6-8 }

2020-06-15-T18:02:05 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 5681=”kt: gmp-config”) new group: <1,271>: { 1-2:0-11, 6-8, 9-11:0-11, soft_failed: 11, diskless: 6-8 }

2020-06-15-T18:08:56 -04:00 <0.4> tme–1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 10899=”kt: gmp-split”) new group: <1,272>: { 1-2:0-11, 6-8, 9-10:0-11, down: 11, soft_failed: 11, diskless: 6-8 }

2020-06-15-T18:08:56 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 10899=”kt: gmp-config”) new group: <1,273>: { 1-2:0-11, 6-8, 9-10:0-11, diskless: 6-8}

2020-06-15-T18:09:49 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 10998=”kt: gmp-config”) new group: <1,274>: { 1-2:0-11, 6-8, 9-10:0-11, soft_failed: 10, diskless: 6-8 }

2020-06-15-T18:15:34 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 12863=”kt: gmp-split”) new group: <1,275>: { 1-2:0-11, 6-8, 9:0-11, down: 10, soft_failed: 10, diskless: 6-8 }

2020-06-15-T18:15:34 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 12863=”kt: gmp-config”) new group: <1,276>: { 1-2:0-11, 6-8, 9:0-11, diskless: 6-8 }

The timeline of events here can be interpreted as such:

  1. In the first line, node 1 has just rebooted: node 1 is up, and all other nodes that are part of the cluster are down. (Nodes with Array IDs 3 through 5 were removed from the cluster prior to this sequence of events.)
  2. The second line indicates that all the nodes have returned to the group, except for Array ID 11, which has been smartfailed.
  3. In the third line, Array ID 11 is both smartfailed but also offline.
  4. Moments later in the fourth line, Array ID 11 has been removed from the cluster entirely.
  5. Less than a minute later, the node with array ID 10 is smartfailed, and the same sequence of events occur.
  6. After the smartfail finishes, the cluster group shows node 10 as down, then removed entirely.

Because group changes document the cluster’s actual configuration from OneFS’ perspective, they’re a vital tool in understanding which devices the cluster considers available, and which devices the cluster considers as having failed, at a point in time. This information, when combined with other data from cluster logs, can provide a succinct but detailed cluster history, simplifying both debugging and failure analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *