OneFS Cbind and DNS Caching

OneFS cbind is the distributed DNS cache daemon for a PowerScale cluster. As such, its primary role is to accelerate domain name lookups on the cluster, particularly for NFS workloads, which can frequently involve a large number of lookups requests, especially when using netgroups. Cbind itself is logically separated into two halves:

Component Description
Gateway cache The entries a node refreshes from the DNS server.
Local cache The entries a node refreshes from the Gateway node.

Cbind’s architecture helps to distribute the cache and associated DNS workload across all nodes in the cluster, and the daemon runs as a OneFS service under the purview of MCP and the /etc/mcp/sys/services/isi_cbind_d control script:

# isi services -a | grep i bind

   isi_cbind_d          Bind Cache Daemon                        Enabled

On startup the cbind daemon, isi_cbind_d, reads its configuration from the cbind_config.gc gconfig file. If needed, configuration changes can be made using the ‘isi network dnscache’ or ‘isi_cbind’ CLI tools.

The cbind daemon also supports multi-tenancy across the cluster, with each tenant’s groupnet being allocated its own completely independent DNS cache, with multiple client interfaces to separate DNS requests from different groupnets. Cbind uses the 127.42.x.x address range and can be accessed by client applications across the entire range. The lower 16 bytes of the address are set by the client to the groupnet ID for the query. For example, if the client is trying to query the DNS servers on groupnet with ID 5 it will send the DNS query to 127.42.0.5.

Under the hood, the cbind daemon comprises two DNS query/response containers, or ‘stallsets’:

Component Description
DNS stallset The DNS stallset is a collection of DNS stalls which encapsulate a single DNS server and a list of DNS queries which have been sent to the DNS servers and are waiting for a response.
Cluster stallset The cluster stallset is similar to the DNS stallset, except the cluster stalls encapsulate the connection to another node in the cluster, known as the gateway node. It also holds a list of DNS queries which have been forwarded to the gateway node and are waiting for a response.

Contained within a stallset are the stalls themselves, which store the actual DNS requests and responses. The DNS stallset provides a separate stall for each DNS server that cbind has been configured to use, and requests are handled via a round-robin algorithm. Similarly, for the cluster stallset, there is a stall for each node within the cluster. The index of the cluster stallset is the gateway node’s (devid – 1).

The cluster stallset entry for the node that is running the daemon is treated as a special case, known as ‘L1 mode’, because the gateway for these DNS requests is the node executing the code. Requests on the gateway stall also have an entry on the DNS stallset representing the request to the external DNS server. All other actively participating cluster stallset entries are referred to as ‘L2+L1’ mode. However, if a node cannot reach DNS, it is moved to L2 mode to prevent it from being used by the other nodes. An associated log entry is written to /var/log/isi_cbind_d.log, of the form:

isi_cbind_d[6204]: [0x800703800]bind: Error

sending query to dns:10.21.25.11: Host is down

In order to support large clusters, cbind uses a consistent hash to determine the gateway node to cache a request and the appropriate cluster stallset to use. This consistent hashing algorithm, which decides on which node to cache an entry, is designed to minimize the number of entry transfers as nodes are added/removed, while also reducing the number of threads and UDP ports used. To illustrate cbind’s consistent hashing, consider the following three node cluster:

In this scenario, when the cbind service on Node 3 becomes active, one third each of the gateway cache from node 1 and 2 respectively gets transferred to node 3. Similarly, if node 3’s cbind service goes down, its gateway cache is divided equally between nodes 1 and 2. For a DNS request on node 3, the node first checks its local cache. If the entry is not found, it will automatically query the gateway (for example, node 2). This means that even if node 3 cannot talk to the DNS server directly, it can still cache the entries from a different node.

So, upon startup, a node’s cbind process attempts to contact, or ‘ping’, the DNS servers. Once a reply is received, the cbind moves into an up state and notifies GMP that the isi_cbind_d service is running on this node. GMP, in turn, then informs the cbind processes across the rest of the cluster that the node is up and available.

Conversely, after several DNS requests to an external server fail for a given node or the isi_cbind_d process is terminated, then the GMP code is notified that the isi_cbind_d service is down for this node. GMP then notifies the cluster that the node is down. When a cbind process (on node Y) receives this notification, the consistent hash algorithm is updated to report that node X is down. The cluster stallset is not informed of this change. Instead the DNS requests that have changed gateways will eventually timeout and be deleted.

As such, the cbind request and response processes can be summarized as follows:

  1. A client on the node sends a DNS query on the additional loopback address 127.42.x.x which is received by cbind.
  2. The cbind daemon uses the consistent hash algorithm to calculate the gateway value of the DNS query and uses the gateway to index the cluster stallset.
  3. If there is a cache hit, a response is sent to the client and the transaction is complete.
  4. Otherwise, the DNS query is placed in the cluster stallset using the gateway as the index. If this is the gateway node then the request is sent to the external DNS server, otherwise the DNS request is forwarded to the gateway node.
  5. When the DNS server or gateway replies, another thread receives the DNS response and matches it to the query on the list. The response is forwarded to the client and the cluster stallset is updated.

Similarly, when a request is forwarded to the gateway node:

  1. The cbind daemon receives the request, calculates the gateway value of the DNS query using the consistent hash algorithm, and uses the gateway to index the cluster stallset.
  2. If there is a cache hit, a response is returned to the remote cbind process and the transaction is complete.
  3. Otherwise, the DNS query is placed in the cluster stallset using the gateway as the index and the request is sent to the external DNS server.
  4. When the DNS server or gateway returns, another thread receives the DNS response and matches it to the query on the list. The response is forwarded to the calling node and the cluster stallset is updated.

If necessary, cbind DNS caching can be enabled or disabled via the ‘isi network groupnets’ command set, allowing the cache to be managed per groupnet:

# isi network groupnets modify --id=<groupnet-name> --dns-cache-enabled=<true/false>

The global ‘isi network dnscache’ command set can be useful for inspecting the cache configuration and limits:

# isi network dnscache view

Cache Entry Limit: 65536

  Cluster Timeout: 5

      DNS Timeout: 5

    Eager Refresh: 0

   Testping Delta: 30

  TTL Max Noerror: 3600

  TTL Min Noerror: 30

 TTL Max Nxdomain: 3600

 TTL Min Nxdomain: 15

    TTL Max Other: 60

    TTL Min Other: 0

 TTL Max Servfail: 3600

 TTL Min Servfail: 300

 The following table describes these DNS cache parameters, which can be manually configured if desired.

Setting Description
TTL No Error Minimum Specifies the lower boundary on time-to-live for cache hits (default value=30s).
TTL No Error Maximum Specifies the upper boundary on time-to-live for cache hits (default value=3600s).
TTL Non-existent Domain Minimum Specifies the lower boundary on time-to-live for nxdomain (default value=15s).
TTL Non-existent Domain Maximum Specifies the upper boundary on time-to-live for nxdomain (default value=3600s).
TTL Other Failures Minimum Specifies the lower boundary on time-to-live for non-nxdomain failures (default value=0s).
TTL Other Failures Maximum Specifies the upper boundary on time-to-live for non-nxdomain failures (default value=60s).
TTL Lower Limit For Server Failures Specifies the lower boundary on time-to-live for DNS server failures(default value=300s).
TTL Upper Limit For Server Failures Specifies the upper boundary on time-to-live for DNS server failures (default value=3600s).
Eager Refresh Specifies the lead time to refresh cache entries that are nearing expiration (default value=0s).
Cache Entry Limit Specifies the maximum number of entries that the DNS cache can contain (default value=35536 entries).
Test Ping Delta Specifies the delta for checking the cbind cluster health (default value=30s).

 Also, if necessary, the cache can be globally flushed via the following CLI syntax:

# isi network dnscache flush -v

Flush complete.

OneFS also provides the ‘isi_cbind’ CLI utility, which can be used to communicate with the cbind daemon. This utility supports both regular CLI syntax, plus an interactive mode where commands are   prompted for. Interactive mode can be entered by invoking the utility without an argument, for example:

# isi_cbind

cbind:

cbind: quit

#

The following command options are available:

# isi_cbind help

        clear           - clear server statistics

        dump            - dump internal server state

        exit            - exit interactive mode

        flush           - flush cache

        quit            - exit interactive mode

        set             - change volatile settings

        show            - show server settings or statistics

        shutdown        - orderly server shutdown

An individual groupnet’s cache can be flushed as follows, in this case targeting the ‘client1’ groupnet:

# isi_cbind flush groupnet client1

Flush complete.

Note that all the cache settings are global and, as such, will affect all groupnet DNS caches.

The cache statistics are available via the following CLI syntax, for example:

# isi_cbind show cache

  Cache:

    entries:                 10         - entries installed in the cache

    max_entries:            338         - entries allocated, including for I/O and free list

    expired:                  0         - entries that reached TTL and were removed from the cache

    probes:                 508         - count of attempts to match an entry in the cache

    hits:                   498 (98%)   - count of times that a match was found

    updates:                  0         - entries in the cache replaced with a new reply

    response_time:     0.000005         - average turnaround time for cache hits

These cache stats can be cleared as follows:

# isi_cbind clear cache

Similarly, the DNS statistics can be viewed with the ‘show dns’ argument:

# isi_cbind show dns

  DNS server 1: (dns:10.21.25.10)

    queries:                862         - queries sent to this DNS server

    responses:              862 (100%)  - responses that matched a pending query

    spurious:             17315 (2008%) - responses that did not match a pending query

    dropped:              17315 (2008%) - responses not installed into the cache (error)

    timeouts:                 0 (  0%)  - times no response was received in time

    response_time:     0.001917         - average turnaround time from request to reply

  DNS server 2: (dns:10.21.25.11)

    queries:                861         - queries sent to this DNS server

    responses:              860 ( 99%)  - responses that matched a pending query

    spurious:             17314 (2010%) - responses that did not match a pending query

    dropped:              17314 (2010%) - responses not installed into the cache (error)

    timeouts:                 1 (  0%)  - times no response was received in time

    response_time:     0.001603         - average turnaround time from request to reply


When running isi_cbind_d, the following additional options are available, and can be invoked with the following CLI flags and syntax:

Option Flag Description
Debug -d Set debug flag(s) to log specific components.  The flags are comma separated list from the following components:

all     Log all components.

cache   Log information relating to cache data.

cluster  Log information relating to cluster data.

flow    Log information relating to flow data.

lock    Log information relating to lock data.

link    Log information relating to link data.

memory  Log information relating to memory data.

network  Log information relating to network data.

refcount  Log information relating to cache object refcount data.

timing  Log information relating to cache timing data.

external   Special debug option to provide off-node DNS service.

Output -f Isi_cbind will not detach from the controlling terminal and will print debugging messages to stderr.
Dump to -D Target file for isi_cbind dump output.
Port -p Uses specified port instead of default NS port of 53.

The isi_cbind_d process logs messages to syslog or to stderr, depending on the daemon’s mode. The log level can be changed by sending it a SIGUSR2 signal, which will toggle the debug flag to maximum or back to the original setting. For example:

# kill -USR2 `cat /var/run/isi_cbind_d.pid`

Also, when troubleshooting cbind, the following files can provide useful information:

File Description
/var/run/isi_cbind_d.pid the pid of the currently running process
/var/crash/isi_cbind_d.dump output file for internal state and statistics
/var/log/isi_cbind_d.log syslog output file
/etc/gconfig/cbind_config.gc configuration file
/etc/resolv.conf bind resolver configuration file

Additionally, the internal state data of isi_cbind_d can be dumped to a file specified with the -D option, described in the table above.

Astute observers will also notice the presence of an additional loopback address at 127.42.0.1:

0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128 zone 1
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 zone 1
        inet 127.0.0.1 netmask 0xff000000 zone 1
        inet 127.42.0.1 netmask 0xffff0000 zone 1
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
# grep 127 /etc/resolv.conf
nameserver 127.42.0.1
# sockstat | grep "127.42.0.1:53"

root     isi_cbind_ 4078  7  udp4   127.42.0.1:53         *:*

This entry is used to ensure that outbound DNS queries are intercepted by cbind, which then either utilizes its cache or reaches out to the DNS servers based on the groupnet configuration. The standard outbound uses the default groupnet, and Auth is forwarded to the appropriate groupnet DNS.

OneFS NANON

Another functionality enhancement that debuts in the OneFS 9.4 release is increased support for clusters with partial front-end connectivity. In OneFS parlance, these are known as NANON clusters, the acronym abbreviating ‘Not All Nodes On Network’. Today, every PowerScale node in the portfolio includes both front-end and back-end network interfaces. Both of a node’s redundant backend network ports, either Ethernet or Infiniband, must be active and connected to the supplied cluster switches at all times, since these form a distributed systems bus and handle all the intra-cluster communication. However, while the typical cluster topology has all nodes connected to all the frontend client network(s), this is not always possible or even desirable. In certain scenarios, there are distinct benefits to not connecting all the nodes to the front-end network.

But first, some background. Imagine an active archive workload, for example. The I/O and capacity requirements of the workload’s active component can be satisfied by an all-flash F600 pool. In contrast, the inactive archive data is housed on a pool of capacity-optimized A3000 nodes for archiving inactive data. In this case, not connecting the A3000 nodes to the front-end network saves on switch ports, isolates the archive data pool from client I/O, and simplifies the overall configuration, while also potentially increasing security.

Such NANON cluster configurations are increasing in popularity, as customers elect not to connect the archive nodes in larger clusters to save cost and complexity, reduce load on capacity optimized platforms, as well as creating physically secure and air-gapped solutions. The recent introduction of the PowerScale P100 and B100 accelerator nodes also increase a cluster’s front end connectivity flexibility.

The above NANON configuration is among the simplest of the partially connected cluster architectures. In this example, the deployment consists of five PowerScale nodes with only three of them connected to the network. The network is assumed to have full access to all necessary infrastructure services and client access.

More complex topologies can often include separating client and management networks, dedicated replication networks, multi-tenant and other separated front-end solutions, and often fall into the NANOAN, or Not All Nodes On All Networks, category. For example:

The management network can be assigned to Subnet0 on the cluster nodes, with a gateway priority of 10 (ie. default gateway), and the client network using Subnet1 with a gateway priority of 20. This would route all outbound traffic through the management network. Static Routes, or Source-Based Routing (SBR) can be configured to direct traffic to the appropriate gateway if issues arise with client traffic routing through the management network.

In this replication topology, nodes 1 through 3 on the source cluster are used for client connectivity, while nodes 4 and 5 on both the source and target clusters are dedicated for SyncIQ replication traffic.

Other more complex examples, such multi-tenant cluster topologies, can be deployed to support workloads requiring connectivity to multiple physical networks.

The above topology can be configured with a management Groupnet containing Subnet0, and additional Groupnets, each with a subnet, for the client networks. For example:

# isi network groupnets list

ID         DNS Cache Enabled  DNS Search      DNS Servers   Subnets

--------------------------------------------------------------------

Client1    1                  c1.isilon.com   10.231.253.14 subnet1

Client2    1                  c2.isilon.com   10.231.254.14 subnet2

Client3    1                  c3.isilon.com   10.231.255.14 subnet3

Management 1                  mgt.isilon.com  10.231.252.14 subnet0

--------------------------------------------------------------------

Total: 4

Or from the WebUI via Cluster management > Network configuration > External network

The connectivity details of a particular subnet and pool and be queried with the ‘isi network pools status <groupnet.subnet.pool>’ CLI command, and will provide details of node connectivity, as well as protocol health and general node state. For example, querying the management groupnet  (Management.Subnet0.Pool0) for the six node cluster above, we see that nodes 1-4 are externally connected, whereas nodes 5 and 6 are not:

# isi network pools status Management.subnet0.pool0

Pool ID: Management.subnet0.subnet0

SmartConnect DNS Overview:

       Resolvable: 4/6 nodes resolvable

Needing Attention: 2/6 nodes need attention

        SC Subnet: Management.subnet0


Nodes Needing Attention:

              LNN: 5

SC DNS Resolvable: False

       Node State: Up

        IP Status: Doesn't have any usable IPs

 Interface Status: 0/1 interfaces usable

Protocols Running: True

        Suspended: False

--------------------------------------------------------------------------------

              LNN: 6

SC DNS Resolvable: False

       Node State: Up

        IP Status: Doesn't have any usable IPs

 Interface Status: 0/1 interfaces usable

Protocols Running: True

        Suspended: False

There are two core OneFS components that have been enhanced in 9.4 in order to better support NANON configurations on a cluster. These are:

Name Component Description
Group Management GMP_SERCVICE_

EXT_CONNECTIVE

Allows GMP (Group Management Protocol) to report the cluster nodes’ external connectivity status.
MCP process isi_mcp Monitors for any GMP changes and, when detected, will try to start or stop the affected service(s) under its control.
SmartConnect isi_smartconnect_d Cluster’s network configuration and connection management service. If the SmartConnect daemon decides a node is NANON, OneFS will log the cluster’s status with GMP.

Here’s the basic architecture and inter-relation of the services.

The GMP external connectivity status is available via the ‘sysctl efs.gmp.group’ CLI command output.

For example, take a three node cluster with all nodes’ front-end interfaces connected:

GMP confirms that all three nodes are available, as indicated by the new ‘external_connectivity’ field:

# sysctl efs.gmp.group

efs.gmp.group: <79c9d1> (3) :{ 1-3:0-5, all_enabled_protocols: 1-3, isi_cbind_d: 1-3, lsass: 1-3, external_connectivity: 1-3 }

This new external connectivity status is also incorporated into a new ‘Ext’ column in the ‘isi status’ CLI command output, as indicated by a ‘C’ for connected or an ‘N’ for not connected. For example:

# isi status -q

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage

ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

  1|10.219.64.11   | OK  | C |25.9M| 2.1M|28.0M|(10.2T/23.2T(44%)|

  2|10.219.64.12   | OK  | C | 840K| 123M| 124M|(10.2T/23.2T(44%)|

  3|10.219.64.13   | OK  | C |    225M| 466M| 691M|(10.2T/23.2T(44%)|

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

Cluster Totals:              |  n/a|  n/a|  n/a|30.6T/69.6T( 37%)|

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only

           External Network Fields: C = Connected, N = Not Connected

Take the following three node NANON cluster:

GMP confirms that only nodes 1 and 3 are connected to the front-end network. Similarly, the absence of node 2 from the command output infers that this node has no external connectivity:

# sysctl efs.gmp.group

efs.gmp.group: <79c9d1> (3) :{ 1-3:0-5, all_enabled_protocols: 1,3, isi_cbind_d: 1,3, lsass: 1,3, external_connectivity: 1,3 }

Similarly, the ‘isi status’ CLI output reports that node 2 is not connected, denoted by an ‘N’, in the ‘Ext’ column:

# isi status -q

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage

ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

  1|10.219.64.11   | OK  | C | 9.9M| 12.1M|22.0M|(10.2T/23.2T(44%)|

  2|10.219.64.12   | OK  | N |    0|    0|    0|(10.2T/23.2T(44%)|

  3|10.219.64.13   | OK  | C | 440M| 221M| 661M|(10.2T/23.2T(44%)|

---+---------------+-----+---+-----+-----+-----+-----------------+-----------

Cluster Totals:              |  n/a|  n/a|  n/a|30.6T/69.6T( 37%)|

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only

           External Network Fields: C = Connected, N = Not Connected

Under the hood, OneFS 9.4 sees the addition of a new SmartConnect network module to evaluate and determine if the node has front-end network connectivity. This module leverages the GMP_SERVICE_EXT_CONNECTIVITY service and polls the nodes network settings every five minutes by default. SmartConnect’s evaluation and assessment criteria for network connectivity is as follows:

VLAN VLAN IP Interface Interface IP NIC Network
(any) (any) Up No Up No
(any) (any) Up Yes Up Yes
Enabled Yes (any) (any) Up Yes
(any) (any) (any) (any) Down No

OneFS 9.4 also adds an option to MCP, the master control process, which allows it to prevent certain services from being started if there is no external network. As such, the two services in 9.4 that now fall under MCP’s new NANON purview are:

Service Daemon Description
Audit isi_audit_cee Auditing of system configuration and protocol access events on the cluster.
SRS isi_esrs_d Allows remote cluster monitoring and support through Secure Remote Services (SRS).

There are two new MCP configuration tags, introduced to control services execution depending on external network connectivity:

Tag Description
require-ext-network Delay start of service if no external network connectivity.
stop-on-ext-network-loss Halt service if external network connectivity is lost.

These tags are used in the MCP service control scripts, located under /etc/mcp/sys/services. For example, in the SRS script:

# cat /etc/mcp/sys/services/isi_esrs_d

<?xml version="1.0"?>

<service name="isi_esrs_d" enable="0" display="1" ignore="0" options="require-quorum,stop-on-ext-network-loss">

      <isi-meta-tag id="isi_esrs_d">

            <mod-attribs>enable ignore display</mod-attribs>

      </isi-meta-tag>

      <description>ESRS Service Daemon</description>

      <process name="isi_esrs_d" pidfile="/var/run/isi_esrs_d.pid"

               startaction="start" stopaction="stop"

               depends="isi_tardis_d/isi_tardis_d"/>

      <actionlist name="start">

            <action>/usr/bin/isi_run -z 1 /usr/bin/isi_esrs_d</action>

      </actionlist>

      <actionlist name="stop">

            <action>/bin/pkill -F /var/run/isi_esrs_d.pid</action>

      </actionlist>

</service>

This MCP NANON control will be expanded to additional OneFS services over the course of subsequent releases.

When it comes to troubleshooting NANON configurations, the MCP, SmartConnect and general syslog log files can provide valuable connectivity troubleshooting messages and timestamps,. The pertinent logfiles are:

  • /var/log/messages
  • /var/log/isi_mcp
  • /var/log/isi_smartconnect

OneFS SmartConnect Diagnostics

SmartConnect, OneFS’ front-end load balancer and connection broker, is a bastion of the cluster that is intrinsically linked into many different areas of the product. Prior to OneFS 9.4, investigating the underlying cause of SmartConnect node resolution and/or connectivity issues typically required a variety of CLI tools and varied output, often making troubleshooting an unnecessarily cumbersome process.

With the addition of SmartConnect diagnostics, the initial troubleshooting process in OneFS 9.4 is now distilled down into a single ‘isi network pool status’ CLI command.

Issue OneFS 9.3 and Earlier Commands OneFS 9.4 Command
Node down isi stat  or  sysctl efs.gmp.group isi network pool status
Node smart-failing isi stat  or  sysctl efs.gmp.group isi network pool status
Node shutdown read-only isi stat  or  sysctl efs.gmp.group isi network pool status
Node rebooting sysctl efs.gmp.group isi network pool status
Node draining sysctl efs.gmp.group isi network pool status
Required protocols not running sysctl efs.gmp.group isi network pool status
Node suspended in network pool isi network pool view isi network pool status
Node interfaces down isi network interfaces list isi network pool status
Node missing IPs isi network interfaces list isi network pool status
Node IPs about to move No easy way to check isi network pool status

As the table above shows, this new diagnostic information source provides a single interface to view the primary causes affecting a node’s connectivity. For any given network pool, a summary of nodes and how many of them are resolvable is returned. Additionally, for each non-resolvable node, the detailed status is reported so the root cause can be pinpointed, providing enough context to narrow the scope of investigation to the correct component(s).

Specifically, the ‘isi network pool status’ CLI command in 9.4 now reports on the following attributes, making SmartConnect considerably easier to troubleshoot:

Attribute Possible Values
SC DNS Resolvable ·         True

·         False

Node State ·         Up

·         Draining

·         Smartfailing

·         Shutting Down

·         Down

IP Status ·         Has usable IPs

·         Does not have usable IPs

·         Does not have any configured IPs

Interface Status ·         x/y interfaces usable
Protocols Running ·         True

·         False

Suspended ·         True

·         False

Note that, while no additional configuration is needed, the ISI_PRIV_NETWORK privileges are required on a cluster account in order to run this CLI command.

In the following example, the CLI output clearly indicates that node 3 is down and requires attention. As such, it has no usable interfaces or IPs, no protocols running, and is not resolvable via SmartConnect DNS:

# isi network pool status subnet0.pool0 --show-all

Pool ID: groupnet0.subnet0.pool0


SmartConnect DNS Overview:

       Resolvable: 2/3 nodes resolvable

Needing Attention: 1/3 nodes need attention

        SC Subnet: groupnet0.subnet0


Nodes:

              LNN: 1

SC DNS Resolvable: True

       Node State: Up

        IP Status: Has usable IPs

 Interface Status: 1/1 interfaces usable

Protocols Running: True

        Suspended: False

-----------------------------------------------------------------------

              LNN: 2

SC DNS Resolvable: True

       Node State: Up

        IP Status: Has usable IPs

 Interface Status: 1/1 interfaces usable

Protocols Running: True

        Suspended: False

-----------------------------------------------------------------------

              LNN: 3

SC DNS Resolvable: False

       Node State: Down

        IP Status: Doesn't have any usable IPs

 Interface Status: 0/1 interfaces usable

Protocols Running: False

        Suspended: False

-----------------------------------------------------------------------

While currently, this new diagnostic information in OneFS 9.4 is only available via the CLI, it will be added to the WebUI in a future release..

OneFS Healthcheck Auto-updates

Prior to OneFS 9.4, Healthchecks were frequently regarded by storage administrators as yet another patch that need to be installed on a PowerScale cluster. As a result, their adoption was routinely postponed or ignored, potentially jeopardizing a cluster’s wellbeing. To address this, OneFS 9.4 introduces Healthcheck auto-updates, enabling new Healthchecks to be automatically downloaded and non-disruptively installed on a PowerScale cluster without any user intervention.

This new automated Healthcheck update framework helps accelerate the adoption of OneFS Healthchecks, by removing the need for manual checks, downloads and installation. In addition to reducing management overhead, the automated Healthchecks integrate with CloudIQ to update the cluster health score – further improving operational efficiency, while avoiding known issues affecting cluster availability.

Formerly known as Healthcheck patches, with OneFS 9.4 these are now renamed as Healthcheck definitions. The Healthcheck framework checks for updates to these definitions via Dell Secure Remote Services (SRS).

An auto-update configuration setting in the OneFS SRS framework controls whether the Healthcheck definitions are automatically downloaded and installed on a cluster. A OneFS platform API endpoint has been added to verify the Healthcheck version, and Healthchecks also optionally support OneFS compliance mode.

Healthcheck auto-update is enabled by default in OneFS 9.4, and is available for both existing and new clusters running 9.4, but can also be easily disabled from the CLI. If the auto-update is on and SRS is enabled, the healthcheck definition is downloaded to the desired staging location and then automatically and non-impactfully installed on the cluster. Any Healthcheck definitions that are automatically downloaded are obviously signed and verified before being applied, to ensure their security and integrity.

So the Healthcheck auto-update execution process itself is as follows:

1. Auto-update queries current Healthcheck version

2. Checks Healthcheck definition availability via SRS.

3. Version comparison.

4. Downloads new Healthcheck definition package to the cluster.

5. Package is unpacked and installed.

6. Telemetry data is sent, and Healthcheck framework updated with new version.

On the cluster, the Healthcheck auto-update utility, ‘isi_healthcheck_update’, monitors for new package once a night, by default. This python script checks the cluster’s current Healthcheck definition version and new updates availability via SRS. Next it performs version comparison of the install package, after which, the new definition is downloaded and installed. Telemetry data is sent and the /var/db/healthcheck_version.json file is created if it’s not already present. This json file is then updated with the new healthcheck version info.

In order to configure and use the Healthcheck auto-update functionality, the following prerequisite steps are required::

  1. Upgrade cluster to OneFS 9.4 and commit the upgrade.
  2. In order to use the isi_healthcheck script, OneFS needs to be licensed and connected to the ESRS gateway. OneFS 9.4 also introduces a new option for ESRS, ‘SRS Download Enabled’, which must be set to ‘Yes’ (the default value) to allow the ‘isi_healthcheck_update’ utility to run. This can be done with the following syntax, in this example using ‘lab-sea-esrs.onefs.com’ as the primary ESRS gateway:
# isi esrs modify --enabled=yes --primary-esrs-gateway=10.12.15.50 --srs-download-enabled=true

The ESRS configuration can be confirmed as follows:

# isi esrs view

                                    Enabled: Yes

                       Primary ESRS Gateway: 10.12.15.50

                     Secondary ESRS Gateway:

                        Alert on Disconnect: Yes

                       Gateway Access Pools: -

          Gateway Connectivity Check Period: 60

License Usage Intelligence Reporting Period: 86400

                           Download Enabled: No

                       SRS Download Enabled: Yes

          ESRS File Download Timeout Period: 50

           ESRS File Download Error Retries: 3

              ESRS File Download Chunk Size: 1000000

             ESRS Download Filesystem Limit: 80

        Offline Telemetry Collection Period: 7200

                Gateway Connectivity Status: Connected
  1. Next, the cluster is onboarded into CloudIQ via its web interface, which requires creating a site, and then from the ‘Add Product’ page configuring the serial number of each node in the cluster, along with the product type “ISILON_NODE”, site ID, and then selecting ‘Submit’.:

CloudIQ cluster onboarding typically takes a couple of hours and, when complete, the ‘Product Details’ page will show the ‘CloudIQ Status’, ‘ESRS Data’, and ‘CloudIQ Data’ fields as ‘Enabled’.

  1. Verify via cluster status that cluster is available and connected in CloudIQ

Once these pre-requisite steps are complete, auto-update can be enabled via the new ‘isi_healthcheck_update’ CLI command. For example, to enable:

# isi_healthcheck_update --enable

2022-05-02 22:21:27,310 - isi_healthcheck.auto_update - INFO - isi_healthcheck_update started

2022-05-02 22:21:27,513 - isi_healthcheck.auto_update - INFO - Enable autoupdate

Similarly, auto-update can also be easily disabled, either by:

# isi_healthcheck_update -s --enable

Or:

# isi esrs modify --srs-download-enabled=false

Auto-update also has the following gconfig global config options and default values:

# isi_gconfig -t healthcheck

Default values: healthcheck_autoupdate.enabled (bool) = true healthcheck_autoupdate.compliance_update (bool) = false healthcheck_autoupdate.alerts (bool) = false healthcheck_autoupdate.max_download_package_time (int) = 600 healthcheck_autoupdate.max_install_package_time (int) = 3600 healthcheck_autoupdate.number_of_failed_upgrades (int) = 0 healthcheck_autoupdate.last_failed_upgrade_package (char*) = healthcheck_autoupdate.download_directory (char*) = /ifs/data/auto_upgrade_healthcheck/downloads

The isi_healthcheck_update  utility is scheduled by cron and executed across all the nodes in the cluster, as follows:

# grep -i healthcheck /etc/crontab

# Nightly Healthcheck update

0       1       *       *       *       root    /usr/bin/isi_healthcheck_update -s

This default /etc/crontab entry executes auto-update once daily at 1am. However, this schedule can be adjusted to meet the needs of the local environment.

Auto-update checks for new package availability and downloads and performs a version comparison of the installed and the new package. The package is then installed, telemetry data sent, and the healthcheck_version.json file updated with new version.

After the Healthcheck update process has completed, the following CLI command can be used to view any automatically downloaded Healthcheck packages. For example:

# isi upgrade patches list

Patch Name               Description                                Status

-----------------------------------------------------------------------------

HealthCheck_9.4.0_32.0.3 [9.4.0 UHC 32.0.3] HealthCheck definition  Installed

-----------------------------------------------------------------------------

Total: 1

Additionally, viewing the json version file will also confirm this:

# cat /var/db/healthcheck_version.json

{“version”: “32.0.3”}

In the unlikely event that auto-updates runs into issues, the following troubleshoot steps can be of benefit:

  1. Confirm that Healthcheck auto-update is actually enabled:

Check the ESRS global config settings and verify they are set to ‘True’.

# isi_gconfig -t esrs esrs.enabled

esrs.enabled (bool) = true

# isi_gconfig -t esrs esrs.srs_download_enabled

esrs.srs_download_enabled (bool) = true

If not, run:

# isi_gconfig -t esrs esrs.enabled=true

# isi_gconfig -t esrs esrs.srs_download_enabled=true
  1. If an auto-update patch installation is not completed within 60 minutes, OneFS increments the unsuccessful installations counter for the current patch, and re-attempts installation the following day.
  2. If the unsuccessful installations counter exceeds 5 attempts, installation will be aborted. However, the following auto-update gconfig values can be reset as follows to re-enable installation:
# isi_gconfig -t healthcheck healthcheck_autoupdate.last_failed_upgrade_package = 0

# isi_gconfig -t healthcheck healthcheck_autoupdate.number_of_failed_upgrades = ""
  1. In the event that a patch installation status is reported as ‘failed’, as below, the recommendation is to contact Dell Support to diagnose and resolve the issue:
# isi upgrade patches list

Patch Name               Description                                Status

-----------------------------------------------------------------------------

HealthCheck_9.4.0_32.0.3 [9.4.0 UHC 32.0.3] HealthCheck definition  Failed

-----------------------------------------------------------------------------

Total: 1

However, the following CLI command can be carefully used to repair the patch system by attempting to abort the most recent failed action:

# isi upgrade patches abort

The ‘isi upgrade archive –clear’ command stops the current upgrade and prevents it from being resumed:

# isi upgrade archive --clear

Once the upgrade status is reported as ‘unknown’ run:

# isi upgrade patch uninstall
  1. The ‘/var/log/isi_healthcheck.log’ is also a great source for detailed auto-upgrade information.

OneFS Signed Upgrades

Introduced as part of the OneFS 9.4 security enhancements, signed upgrades help maintain system integrity by preventing a cluster from being compromised by the installation of maliciously modified upgrade packages. This is required by several industry security compliance mandates, such as the DoD Network Device Management Security Requirements Guide, which stipulates “The network device must prevent the installation of patches, service packs, or application components without verification the software component has been digitally signed using a certificate that is recognized and approved by the organization”.

With this new OneFS 9.4 signed upgrade functionality, all packages must be cryptographically signed before they can be installed. This applies to all upgrade types including core OneFS, patches, cluster firmware,  and drive firmware. The underlying components that comprise this feature include an updated .isi format for all package types plus a new OneFS Catalog to store the verified packages. In OneFS 9.4, the actual upgrades themselves are still performed via either the CLI or WebUI, and are very similar to previous versions.

Under the hood, the new signed upgrade process works as follows:

The primary change is that, in OneFS 9.4, everything goes through the catalog, which comprises four basic components. There’s a small SQLite database that tracks metadata, a library which has the basic logic for the catalog, the signature library based around OpenSSL which handles all of the verification, and a couple of directories to store the verified packages.

With signed upgrades, there’s a single file to download that contains the upgrade package, README text, and all signature data, and no file unpacking required.

The .isi file format is a follows:

A ‘readme’ text file can be incorporated directly in the second region of the package file, providing instructions, version compatibility requirements, etc.

The first region, which contains the main package data, is also compatible with previous OneFS versions that don’t support the .isi format. This allows a signed firmware of DSP package to be installed on OneFS 9.3 and earlier.

The new OneFS catalog provides a secure place to store verified .isi packages, and only the root account has direct access. The catalog itself is stored at /ifs/,ifsvar/catalog and all maintenance and interaction is via the ‘isi upgrade catalog’ CLI command set. The contents, or artifacts, of the catalog each have an ID which corresponds to the SHA256 hash of the file.

Any user account with ISI_PRIV_SYS_UPGRADE privilege can perform the following catalog-related actions, expressed as flags to the ‘isi upgrade catalog’ command:

Action Description
Clean List packages in the catalog
Export Save a catalog item to a user specified file location
Import Verify and add a new .isi package file into the catalog
List List packages in the catalog
Readme Display the README text from a catalog item or .isi package file
Remove Manually remove a package from the catalog
Repair Re-verify all catalog packages an rebuild the database
Verify Verify the signature of a catalog item or .isi package file

Package verification leverages the OneFS’ OpenSSL library, which enables a SHA256 hash of the manifest to be verified against the certificate. As part of this process, the chain-of-trust for the included certificate is compared with contents of the /etc/ssl/certs directory, and the distinguished name on the checked against /etc/upgrade/identities file. Finally, the SHA256 hash of the data regions is compared against values from manifest.

The signature can be checked using the ‘isi upgrade catalog verify’ command. For example:

# isi upgrade catalog verify --file /ifs/install.isi

Item             Verified

--------------------------

/ifs/install.isi True

--------------------------

Total: 1

Additional install image details are available via the ‘isi_packager view’ command

# isi_packager view --package /ifs/install.isi

== Region 1 ==

Type: OneFS Install Image

Name: OneFS_Install_0x90500B000000AC8_B_MAIN_2760(RELEASE)

Hash: ef7926cfe2255d7a620eb4557a17f7650314ce1788c623046929516d2d672304

Size: 397666098

== Footer Details ==

Format Version: 1

 Manifest Size: 296

Signature Size: 2838

Timestamp Size: 1495

 Manifest Hash: 066f5d6e6b12081d3643060f33d1a25fe3c13c1d13807f49f51475a9fc9fd191

Signature Hash: 5be88d23ac249e6a07c2c169219f4f663220d4985e58b16be793936053a563a3

Timestamp Hash: eca62a3c7c3f503ca38b5daf67d6be9d57c4fadbfd04dbc7c5d7f1ff80f9d948

== Signature Details ==

Fingerprint:     33fba394a5a0ebb11e8224a30627d3cd91985ccd

Issuer:          ISLN

Subject:         US / WA / Sea / Isln OneFS.

Organization:    Isln Powerscale OneFS

Expiration:      2022-09-07 22:00:22

Ext Key Usage:   codesigning

Packages in the catalog can be listed as follows:

# isi upgrade catalog list

ID    Type  Description                                               README

-----------------------------------------------------------------------------

cdb88 OneFS OneFS 9.4.0.0_build(2797)style(11) / B_MAIN_2797(RELEASE) -

3a145 DSP   Drive_Support_v1.39.1                                    Included

840b8 Patch HealthCheck_9.2.1_2021-09                                Included

aa19b Patch 9.3.0.2_GA-RUP_2021-12_PSP-1643                          Included

-----------------------------------------------------------------------------

Total: 4

Note that the package ID is comprised from first few characters of SHA256 hash

Packages are automatically imported when used, and verified upon import. Verification and import can also be performed manually, if desired:

# isi upgrade catalog verify --file Drive_Support_v1.39.1.isi

Item                                      Verified

------------------------------------------------- /ifs/packages/Drive_Support_v1.39.1.isi True

-------------------------------------------------

# isi upgrade catalog import Drive_Support_v1.39.1.isi

Packages can also be exported from the catalog and copy to another cluster, for example. Generally, exported packages can be re-imported, too.

# isi upgrade catalog list

ID    Type Description                                               README

----------------------------------------------------------------------------- 00b9c OneFS OneFS 9.4.0.0_build(2625)style(11) / B_MAIN_2625(RELEASE) –

3a145 DSP Drive_Support_v1.39.1 Included

----------------------------------------------------------------------------- Total: 5

# isi upgrade catalog export --id 3a145 --file /ifs/Drive_Support_v1.39.1.isi

However, auto-generated OneFS images cannot be reimported.

The README column of the ‘isi upgrade catalog list’ output indicates whether release notes are included for a .isi file or catalog item. If available, these can be viewed as follows:

# isi upgrade catalog readme --file HealthCheck_9.2.1_2021-09.isi | less Updated: September 02, 2021 *****************************************************************************

HealthCheck_9.2.1_2021-09: Patch for OneFS 9.2.1.x.

This patch contains the 2021-09 RUP for the Isilon HealthCheck System

***************************************************************************** This patch can be installed on clusters running the following OneFS version:

* 9.2.1.x

:

Within a readme file, details typically include a short description of the artefact, and also which minimum OneFS version the cluster is required to be running for installation.

Cleanup of patches and OneFS images is performed automatically upon commit, and any installed packages require the artefact to be present in the catalog for successful uninstall. Similarly, the committed OneFS image is required for both patch removal and cluster expansion via node addition.

Artifacts can be removed manually as follows:

# isi upgrade catalog remove --id 840b8

This will remove the specified artifact and all related metadata.

Are you sure? (yes/[no]): yes

However, always use caution if attempting to manually removing a package.

When it comes to catalog housekeeping, the ‘clean’ function will remove any catalog artifact files without database entries, although normally this happens automatically when an item is removed.

# isi upgrade catalog clean

This will remove any artifacts that do not have associated metadata in the database.

Are you sure? (yes/[no]): yes

Additionally, the catalog ‘repair’ function will rebuild the database and re-import all valid items, as well as re-verifying their signatures:

# isi upgrade catalog repair

This will attempt to repair the catalog directory. This will result in all stored artifacts being re-verified. Artifacts that fail to be verified will be deleted. Additionally, a new catalog directory will be initialized with the remaining artifacts.

Are you sure? (yes/[no]): yes

When installing a signed upgrade, patch, firmware or drive support package (DSP) on a cluster running OneFS 9.4, the command syntax used is fundamentally the same as in prior OneFS versions, with only the file extension itself having changed. The actual install file will have the ‘.isi’ extension, and the file containing the hash value for download verification will have a ‘.isi.sha256’ suffix. For example, take the OneFS 9.4 install files:

  • 4.0.0_Install.isi
  • 4.0.0_Install.isi.sha256

The following syntax can be used to initiate a parallel OneFS signed upgrade:

# isi upgrade start --install-image-path /ifs/install.isi -–parallel

Alternatively, if the desired upgrade image package is already in the catalog, it can be installed using the ‘—install-image-id’ flag instead:

# isi upgrade start --install-image-id 00b9c –parallel

Or to upgrade a cluster’s firmware:

# isi upgrade firmware start --fw-pkg /ifs/IsiFw_Package_v10.3.7.isi –-rolling

And upgrading a cluster’s firmware using the ID of a package that’s in the catalog:

# isi upgrade firmware start --fw-pkg-id cf01b -–rolling

To initiate a simultaneous upgrade of a patch:

# isi upgrade patches install --patch /ifs/patch.isi -–simultaneous

And finally, to initiate a simultaneous upgrade of a drive firmware package:

# isi_dsp_install Drive_Support_v1.39.1.isi

Note that patches and drive support firmware are not currently able to be installed by their package IDs.

The current version of node firmware that a cluster is running can be determined by viewing the contents of  the isi_hwmon file, located in a log set under each node’s root path.  Ie: <node-lnn>/isi_hwmon

--- FirmwareCheck Diagnostics ---

Packages:

IsiFw_Package_v11.5.1.tar

Drive_Support_v1.41.1.tgz

Similarly, there is a JSON file in a cluster’s log gather that holds all the firmware versions of the components on the system, in addition to the firmware package used to install the firmware. This file is located on each node under the path:

<node-lnn>/upgrade_local.tar/var/ifs/upgrade/firmware_status.json

A committed upgrade image from the previous OneFS upgrade is automatically saved in the catalog, and also created automatically when a new cluster is configured. This image is required for new node joins, as well as when uninstalling patches. However, it’s worth noting that auto-created images will not have a signature and, while they may be exported, they cannot be re-imported back into the catalog.

In the event that the committed upgrade image is missing, CELOG events will be generated and the ‘isi upgrade catalog repair’ command output will display an error. Additionally, when it comes to troubleshooting the signed upgrade process, it can pay to check both /var/log/messages and /var/log/isi_papi_d.log, as well as to the OneFS upgrade logs .

OneFS Data Reduction and Efficiency Reporting

Among the objectives of OneFS reduction and efficiency reporting is to provide ‘industry standard’ statistics, allowing allow easier comprehension of cluster efficiency. It’s an ongoing process, and prior to OneFS 9.2 there was limited tracking of certain filesystem statistics – particularly application physical and filesystem logical – which meant that data reduction and storage efficiency ratios had to be estimated. This is no longer the case, and OneFS 9.2 and later provides accurate data reduction and efficiency metrics at a per-file, quota, and cluster-wide granularity.

The following table provides descriptions for the various OneFS reporting metrics, while also attempting to rationalize their naming conventions with other general industry terminology:

OneFS Metric Also Known As Description
Protected logical Application logical Data size including sparse data, zero block eliminated data, and CloudPools data stubbed to a cloud tier.
Logical data Effective

Filesystem logical

Data size excluding protection overhead and spars data, and including data efficiency savings (compression and deduplication).
Zero-removal saved Capacity savings from zero removal.
Dedupe saved Capacity savings from deduplication.
Compression saved Capacity savings from in-line compression.
Preprotected physical Usable

Application physical

Data size excluding protection overhead and including storage efficiency savings.
Protection overhead Size of erasure coding used to protect data.
Protected physical Raw

Filesystem physical

Total footprint of data including protection overhead FEC erasure coding) and excluding data efficiency savings (compression and deduplication).
Dedupe ratio Deduplication ratio. Will be displayed as 1.0:1 if there are no deduplicated blocks on the cluster.
Compression ratio Usable reduction ratio from compression, calculated by dividing ‘logical data’ by ‘preprotected physical’ and expressed as x:1.
Inlined data ratio Efficiency ratio from storing small files’ data within their inodes, thereby not require any data or protection blocks for their storage.
Data reduction ratio Effective to Usable Usable efficiency ratio from compression and deduplication. Will display the same value as the compression ratio if there is no deduplication on the cluster.
Efficiency ratio Effective to Raw Overall raw efficiency ratio expressed as x:1

So let’s take these metrics and look at what they represent and how they’re calculated.

  • Application logical, or protected logical, is the application data that can be written to the cluster, irrespective of where it’s stored.

  • Removing the sparse data from application logical results in filesystem logical, also known simply as logical data or effective. This can be data that was always sparse, was zero block eliminated, or data that has been tiered off-cluster via CloudPools, etc.

Note that filesystem logical was not accurately tracked in releases prior to OneFS 9.2, so metrics prior to this were somewhat estimated.

  • Next, data reduction techniques such as compression and deduplication further reduce filesystem logical to application physical, or pre-protected physical. This is the physical size of the application data residing on the filesystem drives, and does not include metadata, protection overhead, or data moved to the cloud.

  • Filesystem physical is application physical with data protection overhead added – including inode, mirroring and FEC blocks, etc. Filesystem physical is also referred to as protected physical.

  • The data reduction ratio is the amount that’s been reduced from the filesystem logical down to the application physical.

  • Finally, the storage efficiency ratio is the filesystem logical divided by the filesystem physical.

With the enhanced data reduction reporting in OneFS 9.2 and later, the actual statistics themselves are largely the same, just calculated more accurately.

The storage efficiency data was available in releases prior to OneFS 9.2, albeit somewhat estimated, but the data reduction metrics were introduced with OneFS 9.2.

The following tools are available to query these reduction and efficiency metrics at the file, quota, and cluster-wide granularity:

Realm OneFS Command OneFS Platform API
File isi get -D
Quota isi quota list -v 12/quota/quotas
Cluster-wide isi statistics data-reduction 1/statistics/current?key=cluster.data.reduce.*
Detailed Cluster-wide isi_cstats 1/statistics/current?key=cluster.cstats.*

Note that the ‘isi_cstats’ CLI command provides some additional, behind-the-scenes details. The interface goes through platform API to fetch these stats.

The ‘isi statistics data-reduction’ CLI command is the most comprehensive of the data reduction reporting CLI utilities. For example:

# isi statistics data-reduction

                      Recent Writes Cluster Data Reduction

                           (5 mins)

--------------------- ------------- ----------------------

Logical data                  6.18M                  6.02T

Zero-removal saved                0                      -

Deduplication saved          56.00k                  3.65T

Compression saved             4.16M                  1.96G

Preprotected physical         1.96M                  2.37T

Protection overhead           5.86M                910.76G

Protected physical            7.82M                  3.40T

Zero removal ratio         1.00 : 1                      -

Deduplication ratio        1.01 : 1               2.54 : 1

Compression ratio          3.12 : 1               1.02 : 1

Data reduction ratio       3.15 : 1               2.54 : 1

Inlined data ratio         1.04 : 1               1.00 : 1

Efficiency ratio           0.79 : 1               1.77 : 1

--------------------- ------------- ----------------------


The ‘recent writes’ data to the left of the output provides precise statistics for the five-minute period prior to running the command. By contrast, the ‘cluster data reduction’ metrics on the right of the output are slightly less real-time but reflect the overall data and efficiencies across the cluster. Be aware that, in OneFS 9.1 and earlier, the right-hand column metrics are designated by the ‘Est’ prefix, denoting an estimated value. However, in OneFS 9.2 and later, the ‘logical data’ and ‘preprotected physical’ metrics are tracked and reported accurately, rather than estimated.

The ratio data in each column is calculated from the values above it. For instance, to calculate the data reduction ratio, the ‘logical data’ (effective) is divided by the ‘preprotected physical’ (usable) value. From the output above, this would be:

6.02 / 2.37 = 2.54             Or a Data Reduction ratio of 2.54:1

Similarly, the ‘efficiency ratio’ is calculated by dividing the ‘logical data’ (effective) by the ‘protected physical’ (raw) value. From the output above, this yields:

6.02 / 3.40= 1.77               Or an Efficiency ratio of 1.77:1

OneFS SmartQuotas reports the capacity saving from in-line data reduction as a storage efficiency ratio. SmartQuotas reports efficiency as a ratio across the desired data set as specified in the quota path field. The efficiency ratio is for the full quota directory and its contents, including any overhead, and reflects the net efficiency of compression and deduplication. On a cluster with licensed and configured SmartQuotas, this efficiency ratio can be easily viewed from the WebUI by navigating to ‘File System > SmartQuotas > Quotas and Usage’. In OneFS 9.2 and later, in addition to the storage efficiency ratio, the data reduction ratio is also displayed.

Similarly, the same data can be accessed from the OneFS command line via is ‘isi quota quotas list’ CLI command. For example:

# isi quota quotas list

Type      AppliesTo  Path  Snap  Hard  Soft  Adv  Used  Reduction  Efficiency

------------------------------------------------------------------------------

directory DEFAULT    /ifs  No    -     -     -    6.02T 2.54 : 1   1.77 : 1

------------------------------------------------------------------------------

Total: 1

More detail, including both the physical (raw) and logical (effective) data capacities, is also available via the ‘isi quota quotas view <path> <type>’ CLI command. For example:

# isi quota quotas view /ifs directory

                        Path: /ifs

                        Type: directory

                   Snapshots: No

                    Enforced: No

                   Container: No

                      Linked: No

                       Usage

                           Files: 5759676

         Physical(With Overhead): 6.93T

        FSPhysical(Deduplicated): 3.41T

         FSLogical(W/O Overhead): 6.02T

        AppLogical(ApparentSize): 6.01T

                   ShadowLogical: -

                    PhysicalData: 2.01T

                      Protection: 781.34G

     Reduction(Logical/Data): 2.54 : 1

Efficiency(Logical/Physical): 1.77 : 1

To configure SmartQuotas for in-line data efficiency reporting, create a directory quota at the top-level file system directory of interest, for example /ifs. Creating and configuring a directory quota is a simple procedure and can be performed from the WebUI by navigate to ‘File System > SmartQuotas > Quotas and Usage’ and selecting ‘Create a Quota’. In the create pane, field, set the Quota type to ‘Directory quota’, add the preferred top-level path to report on, select ‘application logical size’ for Quota Accounting, and set the Quota Limits to ‘Track storage without specifying a storage limit’. Finally, select the ‘Create Quota’ button to confirm the configuration and activate the new directory quota.

The efficiency ratio is a single, current-in time efficiency metric that is calculated per quota directory and includes the sum of in-line compression, zero block removal, in-line dedupe and SmartDedupe. This is in contrast to a history of stats over time, as reported in the ‘isi statistics data-reduction’ CLI command output, described above. As such, the efficiency ratio for the entire quota directory will reflect what is actually there.

OneFS Inline Dedupe

Among the features and functionality delivered in the new OneFS 9.4 release is the promotion of inline dedupe to enabled by default, further enhancing PowerScale’s dollar per TB economics, rack density, and value.

Part of the OneFS data reduction suite, inline dedupe initially debuted in OneFS 8.2.1. However, until now, it needed to be manually enabled, so often customers simply didn’t use it. However, with this enhancement, new clusters running OneFS 9.4 will now have inline dedupe on by default.

Cluster Configuration Inline Dedupe Inline Compression
New cluster running OneFS 9.4 Enabled Enabled
New cluster running OneFS 9.3 or earlier Disabled Enabled
Cluster with inline dedupe enabled that is upgraded to OneFS 9.4 Enabled Enabled
Cluster with inline dedupe disabled that is upgraded to OneFS 9.4 Disabled Enabled

That said, any clusters that upgrade to 9.4 will not see any change to their current inline dedupe config during upgrade. Additionally, there is also no change to the behavior for inline compression, which remains enabled by default in all OneFS versions from 8.1.3 onwards.

But before we examine the under the hood changes in OneFS 9.4, first, a quick dedupe refresher.

Currently OneFS inline data reduction, which encompasses compression, dedupe, and zero block removal, is supported on the F900, F600, F200 all-flash nodes, plus the F810, H5600, H700/7000, and A300/3000 Gen6.x chassis.

Within the OneFS data reduction pipeline, zero block removal is performed first, followed by dedupe, and then compression, and this order allows each phase to reduce the scope of work each subsequent phase.

Unlike SmartDedupe, which performs deduplication once data has been written to disk, or post-process, inline dedupe acts in real time, deduplicating data as is ingested into the cluster. Storage efficiency is achieved by scanning the data for identical blocks as it is received and then eliminating the duplicates.

When inline dedupe discovers a duplicate block, it moves a single copy of the block to a special set of files known as shadow stores. These are file system containers that allow data to be stored in a sharable manner. As such, files stored under OneFS can contain both physical data and pointers, or references, to shared blocks in shadow stores.

Shadow stores are similar to regular files but are hidden from the file system namespace, so cannot be accessed via a pathname. A shadow store typically grows to a maximum size of 2GB, which is around 256K blocks, with each block able to be referenced by 32,000 files. If the reference count limit is reached, a new block is allocated, which may or may not be in the same shadow store. Additionally, shadow stores do not reference other shadow stores. And snapshots of shadow stores are not permitted because the data contained in shadow stores cannot be overwritten.

When a client writes a file to a node pool configured for inline dedupe on a cluster, the write operation is divided up into whole 8KB blocks. Each of these blocks is then hashed and its cryptographic ‘fingerprint’ compared against an in-memory index for a match. At this point, one of the following will happen:

  1. If a match is discovered with an existing shadow store block, a byte-by-byte comparison is performed. If the comparison is successful, the data is removed from the current write operation and replaced with a shadow reference.
  2. When a match is found with another LIN, the data is written to a shadow store instead and replaced with a shadow reference. Next, a work request is generated and queued that includes the location for the new shadow store block, the matching LIN and block, and the data hash. A byte-by-byte data comparison is performed to verify the match and the request is then processed.
  3. If no match is found, the data is written to the file natively and the hash for the block is added to the in-memory index.

In order for inline dedupe to be performed on a write operation, the following conditions need to be true:

  • Inline dedupe must be globally enabled on the cluster.
  • The current operation is writing data (ie. not a truncate or write zero operation).
  • The ‘no_dedupe’ flag is not set on the file.
  • The file is not a special file type, such as an alternate data stream (ADS) or an EC (endurant cache) file.
  • Write data includes fully overwritten and aligned blocks.
  • The write is not part of a ‘rehydrate’ operation.
  • The file has not been packed (containerized) by SFSE (small file storage efficiency).

OneFS inline dedupe uses the 128-bit CityHash algorithm, which is both fast and cryptographically strong. This contrasts with OneFS’ post-process SmartDedupe, which uses SHA-1 hashing.

Each node in a cluster with inline dedupe enabled has its own in-memory hash index that it compares block ‘fingerprints’ against. The index lives in system RAM and is allocated using physically contiguous pages and accessed directly with physical addresses. This avoids the need to traverse virtual memory mappings and does not incur the cost of translation lookaside buffer (TLB) misses, minimizing dedupe performance impact.

The maximum size of the hash index is governed by a pair of sysctl settings, one of which caps the size at 16GB, and the other which limits the maximum size to 10% of total RAM.  The strictest of these two constraints applies.  While these settings are configurable, the recommended best practice is to use the default configuration. Any changes to these settings should only be performed under the supervision of Dell support.

Since inline dedupe and SmartDedupe use different hashing algorithms, the indexes for each are not shared directly. However, the work performed by each dedupe solution can be leveraged by each other.  For instance, if SmartDedupe writes data to a shadow store, when those blocks are read, the read hashing component of inline dedupe will see those blocks and index them.

When a match is found, inline dedupe performs a byte-by-byte comparison of each block to be shared to avoid the potential for a hash collision. Data is prefetched prior to the byte-by-byte check and then compared against the L1 cache buffer directly, avoiding unnecessary data copies and adding minimal overhead. Once the matching blocks have been compared and verified as identical, they are then shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.

Inline dedupe samples every whole block written and handles each block independently, so it can aggressively locate block duplicity.  If a contiguous run of matching blocks is detected, inline dedupe will merge the results into regions and process them efficiently.

Inline dedupe also detects dedupe opportunities from the read path, and blocks are hashed as they are read into L1 cache and inserted into the index. If an existing entry exists for that hash, inline dedupe knows there is a block sharing opportunity between the block it just read and the one previously indexed. It combines that information and queues a request to an asynchronous dedupe worker thread.  As such, it is possible to deduplicate a data set purely by reading it all. To help mitigate the performance impact, the hashing is performed out-of-band in the prefetch path, rather than in the latency-sensitive read path.

The original inline dedupe control path design had its limitations, since it did not provide a gconfig control settings for default disabled inline dedupe. The previous control path logic had no gconfig control settings for default disabled inline dedupe. But in OneFS 9.4, there are now two separate features that interact together to distinguishing between a new cluster or an upgrade to an existing cluster configuration: The first one is, upon upgrade to 9.4 on an existing cluster, if there is no inline dedupe config present, then explicitly set it to disabled in gconfig as part of the upgrade. This has no effect on an existing cluster since it’s already disabled. Similarly, if the upgrading cluster already has an existing inline dedupe setting in gconfig, then OneFS takes no action.

The other half of the functionality is that, when booting OneFS 9.4, a node looks in gconfig to see if there’s an inline dedupe setting. If no config is present, OneFS enables it by default. Therefore new OneFS 9.4 clusters automatically enable dedupe, and existing clusters retain their legacy setting upon upgrade.

Since inline dedupe’s configuration is binary, either on or off across a cluster, it can be easily manually controlled via the OneFS command line interface (CLI). As such, the ‘isi dedupe inline settings modify’ CLI command to either enable or disable dedupe at will – before, during, or after the upgrade, it doesn’t matter.

For example, inline dedupe can be globally disabled and verified via the following CLI command:

# isi dedupe inline settings viewMode: enabled# isi dedupe inline settings modify –-mode disabled

# isi dedupe inline settings view

Mode: disabled

Similarly, the following syntax will enable inline dedupe:

# isi dedupe inline settings view

Mode: disabled

# isi dedupe inline settings modify –-mode enabled

# isi dedupe inline settings view

Mode: enabled

While there are no visible userspace changes when files are deduplicated, if deduplication has occurred, both the ‘disk usage’ and the ‘physical blocks’ metric reported by the ‘isi get –DD’ CLI command will be reduced. Additionally, at the bottom of the command’s output, the logical block statistics will report the number of shadow blocks. For example:

Metatree logical blocks:   zero=260814 shadow=362 ditto=0 prealloc=0 block=2 compressed=0

Inline dedupe can also be paused from the CLI as follows:

# isi dedupe inline settings modify –-mode paused

# isi dedupe inline settings view

Mode: paused

However, it’s worth noting that this global setting states what you’d like to happen, after which each node attempts to enact the new configuration, but can’t guaranty the change, because not all node types support inline dedupe. For example, the following output is from a heterogenous cluster with an F200 three-node pool supporting inline dedupe, and an H400 four-node pool not supporting it.

Here, we can see that inline dedupe is globally enabled on the cluster:

# isi dedupe inline settings view

Mode: enabled

However, the ‘isi_for_array isi_inline_dedupe_status’ command can be used to display the actual setting and state of each node:

# isi dedupe inline settings view

Mode: enabled

# isi_for_array -s isi_inline_dedupe_status

1: OK Node setting enabled is correct

2: OK Node setting enabled is correct

3: OK Node setting enabled is correct

4: OK Node does not support inline dedupe and current is disabled

5: OK Node does not support inline dedupe and current is disabled

6: OK Node does not support inline dedupe and current is disabled

7: OK Node does not support inline dedupe and current is disabled

Additionally, any changes to the dedupe configuration are also logged to /var/log/messages, and can be found by grepping for ‘inline_dedupe’

So, in a nutshell: In-line compression has always been enabled by default since its introduction in OneFS 8.1.3. For new clusters running 9.4 and above, inline dedupe is on by default. For clusters running 9.3 and earlier, inline dedupe remains disabled by default. And existing clusters that upgrade to 9.4 will not see any change to their current inline dedupe config during upgrade.

And here’s the OneFS in-line data reduction platform support matrix for good measure:

PowerScale OneFS 9.4

Arriving in time for Dell Technologies World 2022, the new PowerScale OneFS 9.4 release shipped on Monday 4th April 2022.

OneFS 9.4 brings with it a wide array of new features and functionality, including:

Feature Description
SmartSync Data Mover Introduction of a new OneFS SmartSync data mover, allowing flexible data movement and copying, incremental resyncs, push and pull data transfer, and one-time file to object copy. Complimentary to SyncIQ, SmartSync provides an additional option for data transfer, including to object storage targets such as ECS, AWS and Azure.
IB to Ethernet Backend Migration Non-disruptive rolling Infiniband to Ethernet back-end network migration for legacy Gen6 clusters.
Secure Boot Secure boot support is extended to include the F900, F600, F200, H700/7000, and A700/7000 platforms.
Smarter SmartConnect Diagnostics Identifies non-resolvable nodes and provides their detailed status, allowing the root cause to be easily pinpointed.
In-line Dedupe In-line deduplication will be enabled by default on new OneFS 9.4 clusters. Clusters upgraded to OneFS 9.4 will maintain their current dedupe configuration.
Healthcheck Auto-updates Automatic monitoring, download, and installation of new healthcheck packages as they are released.
CloudIQ Protocol Statistics New protocol statistics ‘count’ keys are added, allowing performance to be measured over a specified time window and providing point-in-time protocol information.
SRS Alerts and CELOG Event Limiting Prevents CELOG from sending unnecessary event types to Dell SRS and restricts CELOG alerts from customer-created channels.
CloudPools Statistics Automated statistics gathering on CloudPools accounts and policies providing insights for planning and troubleshooting CloudPools-related activities.

We’ll be taking a deeper look at some of these new features in blog articles over the course of the next few weeks.

Meanwhile, the new OneFS 9.4 code is available for download on the Dell Online Support site, in both upgrade and reimage file formats.

Enjoy your OneFS 9.4 experience!

OneFS Metadata Overview

OneFS uses two principal data structures to enable information about each object, or metadata, within the file system to be searched, managed and stored efficiently and reliably. These structures are:

  • Inodes
  • B-trees

OneFS uses inodes to store file attributes and pointers to file data locations on disk, and each file, directory, link, etc, is represented by an inode.

Within OneFS, inodes come in two sizes – either 512B or 8KB. The size that OneFS uses is determined primarily by the physical and logical block formatting of the drives in a diskpool..

All OneFS inodes have both static and dynamic sections.  The static section space is limited and valuable since it can be accessed in a single I/O, and does not require a distributed lock to access. It holds fixed-width, commonly used attributes like POSIX mode bits, owner, and size.

In contrast, the dynamic portion of an inode allows new attributes to be added, if necessary, without requiring an inode format update. This can be done by simply adding a new type value with code to serialize and de-serialize it. Dynamic attributes are stored in the stream-style type-length-value (TLV) format, and include protection policies, OneFS ACLs, embedded b-tree roots, domain membership info, etc.

If necessary, OneFS can also use extension blocks, which are 8KB blocks, to store any attributes that cannot fully fit into the inode itself. Additionally, OneFS data services such as SnapshotIQ also commonly leverage inode extension blocks.

Inodes are dynamically created and stored in locations across all the cluster’s drives, and OneFS uses  b-trees (actually B+ trees) for their indexing and rapid retrieval. The general structure of a OneFS b-tree includes a top-level block, known as the ‘root’. B-tree blocks which reference other b-trees are referred to as ‘inner blocks’, and the last blocks at the end of the tree are called ‘leaf blocks’.

Only the leaf blocks actually contain metadata, whereas the root and inner blocks provide a balanced index of addresses allowing rapid identification of and access to the leaf blocks and their metadata.

A LIN, or logical inode, is accessed every time a file, directory, or b-tree is accessed.  The function of the LIN Tree is to store the mapping between a unique LIN number and it’s inode mirror addresses.

The LIN is represented as a 64-bit hexadecimal number.  Each file is assigned a single LIN and, since LINs are never reused, it is unique for the cluster’s lifespan.  For example, the file /ifs/data/test/file1 has the following LIN:

# isi get -D /ifs/data/test/f1 | grep LIN:

*  LIN:                1:2d29:4204

Similarly, its parent directory, /ifs/data/test, has:

# isi get -D /ifs/data/test | grep LIN:

*  LIN:                1:0353:bb59

*  LIN:                1:0009:0004

*  LIN:                1:2d29:4204

The file above’s LIN tree entry includes the mapping between the LIN and its three mirrored inode disk addresses.

# isi get -D /ifs/data/test/f1 | grep "inode"

* IFS inode: [ 92,14,524557565440:512, 93,19,399535074304:512, 95,19,610321964032:512 ]

Taking the first of these inode addresses, 92,14,524557565440:512, the following can be inferred, reading from left to right:

  • It’s on node 92.
  • Stored on drive lnum 14.
  • At block address 524557565440.
  • And is a 512byte inode.

The file’s parent LIN can also be easily determined:

# isi get -D /ifs/data/test/f1 | grep -i "Parent Lin"

*  Parent Lin          1:0353:bb59

In addition to the LIN tree, OneFS also uses b-trees to support file and directory access, plus the management of several other data services. That said, the three principal b-trees that OneFS employs are:

Category B+ Tree Name Description
Files Metatree or Inode Format Manager (IFM B-tree) •       This B-tree stores a mapping of Logical Block Number (LBN) to protection group

•       It is responsible to storing the physical location of file blocks on disk.

Directories Directory Format Manager (DFM B-tree) •       This B-tree stores directory entries (File names and directory/sub-directories)

•       It includes the full /ifs namespace  and everything under it.

System System B-tree (SBT) •       Standardized B+ Tree implementation to store records for OneFS internal use, typically related to a particular feature including:  Diskpool DB, IFS Domains, WORM, Idmap.  Quota (QDB) and Snapshot Tracking Files (STF) are actually separate/unique B+ Tree implementations.

OneFS also relies heavily on several other metadata structures too, including:

  • Shadow Store – Dedupe/clone metadata structures including SINs
  • QDB – Quota Database structures
  • System B+ Tree Files
  • STF – Snapshot Tracking Files
  • WORM
  • IFM Indirect
  • Idmap
  • System Directories
  • Delta Blocks
  • Logstore Files

Both inodes and b-tree blocks are mirrored on disk.  Mirror-based protection is used exclusively for all OneFS metadata because it is simple and lightweight, thereby avoiding the additional processing of erasure coding.  Since metadata typically only consumes around 2% of the overall cluster’s capacity, the mirroring overhead for metadata is minimal.

The number of inode mirrors (minimum 2x up to 8x) is determined by the nodepool’s achieved protection policy and the metadata type. Below is a mapping of the default number or mirrors for all metadata types.

Protection Level Metadata Type Number of Mirrors
+1n File inode 2 inodes per file
+2d:1n File inode 3 inodes per file
+2n File inode 3 inodes per file
+3d:1n File inode 4 inodes per file
+3d:1n1d File inode 4 inodes per file
+3n File inode 4 inodes per file
+4d:1n File inode 5 inodes per file
+4d:2n File inode 5 inodes per file
+4n File inode 5 inodes per file
2x->8x File inode Same as protection level. I.e. 2x == 2 inode mirrors
+1n Directory inode 3 inodes per file
+2d:1n Directory inode 4 inodes per file
+2n Directory inode 4 inodes per file
+3d:1n Directory inode 5 inodes per file
+3d:1n1d Directory inode 5 inodes per file
+3n Directory inode 5 inodes per file
+4d:1n Directory inode 6 inodes per file
+4d:2n Directory inode 6 inodes per file
+4n Directory inode 6 inodes per file
2x->8x Directory inode +1 protection level. I.e. 2x == 3 inode mirrors
LIN root/master 8x
LIN inner/leaf Variable – per-entry protection
IFM/DFM b-tree Variable – per-entry protection
Quota database b-tree (QDB) 8x
SBT System b-tree (SBT) Variable – per-entry protection
Snapshot tracking files (STF) 8x

Note that, by default, directory inodes are mirrored at one level higher than the achieved protection policy, since directories are more critical and make up the OneFS single namespace.  The root of the LIN Tree is the most critical metadata type and is always mirrored at 8x.

OneFS SSD strategy governs where and how much metadata is placed on SSD or HDD.  There are five SSD Strategies, and these can be configured via OneFS’ file pool policies:

SSD Strategy Description
L3 Cache All drives in a Node Pool are used as a read-only evection cache from L2 Cache.  Currently used data and metadata will fill the entire capacity of the SSD Drives in this mode.  Note:  L3 mode does not guarantee all metadata will be on SSD, so this may not be the most performant mode for metadata intensive workflows.
Metadata Read One metadata mirror is placed on SSD.  All other mirrors will be on HDD for hybrid and archive models.  This mode can boost read performance for metadata intensive workflows.
Metadata Write All metadata mirrors are placed on SSD. This mode can boost both read and write performance when there is significant demand on metadata IO.  Note:  It is important to understand the SSD capacity requirements needed to support Metadata strategies.  Therefore, we are developing the Metadata Reporting Script below which will assist in SSD metadata sizing activities.
Data Place data on SSD.  This is not a widely used strategy, as Hybrid and Archive nodes have limited SSD capacities, and metadata should take priority on SSD for best performance.
Avoid Avoid using SSD for a specific path.  This is not a widely used strategy but could be handy if you had archive workflows that did not require SSD and wanted to dedicate your SSD space for other more important paths/workflows.

Fundamentally, OneFS metadata placement is determined by the following attributes:

  • The model of the nodes in each node pool (F-series, H-series, A-series).
  • The current SSD Strategy on the node pool using configured using the default filepool policy and custom administrator-created filepool policies.
  • The cluster’s global storage pool settings.

The following CLI commands can be used to verify the current SSD strategy and metadata placement details on a cluster. For example, in order to check whether L3 Mode is enabled on a specific node pool:

# isi storagepool nodepool list

ID     Name                       Nodes  Node Type IDs  Protection Policy  Manual

----------------------------------------------------------------------------------

1      h500_30tb_3.2tb-ssd_128gb  1      1              +2d:1n             No

In the output above, there is a single H500 node pool reported with an ID of ‘1’. The details of this pool can be displayed as follows:

# isi storagepool nodepool view 1

                 ID: 1

               Name: h500_30tb_3.2tb-ssd_128gb

              Nodes: 1, 2, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40

      Node Type IDs: 1

  Protection Policy: +2d:1n

             Manual: No

         L3 Enabled: Yes

L3 Migration Status: l3

               Tier: -

              Usage

                Avail Bytes: 321.91T

            Avail SSD Bytes: 0.00

                   Balanced: No

                 Free Bytes: 329.77T

             Free SSD Bytes: 0.00

                Total Bytes: 643.13T

            Total SSD Bytes: 0.00

    Virtual Hot Spare Bytes: 7.86T

Note that if, as in this case, L3 is enabled on a node pool, any changes to this pool’s SSD Strategy configuration via file pool policies, etc, will not be honored. This will remain until L3 cache has been disabled and the SSDs reformatted for use as metadata mirrors.

The following CLI syntax can be used to check the cluster’s default file pool policy configuration:

# isi filepool default-policy view

          Set Requested Protection: default

               Data Access Pattern: concurrency

                  Enable Coalescer: Yes

                    Enable Packing: No

               Data Storage Target: anywhere

                 Data SSD Strategy: metadata

           Snapshot Storage Target: anywhere

             Snapshot SSD Strategy: metadata

                        Cloud Pool: -

         Cloud Compression Enabled: -

          Cloud Encryption Enabled: -

              Cloud Data Retention: -

Cloud Incremental Backup Retention: -

       Cloud Full Backup Retention: -

               Cloud Accessibility: -

                  Cloud Read Ahead: -

            Cloud Cache Expiration: -

         Cloud Writeback Frequency: -

      Cloud Archive Snapshot Files: -

                                ID: -

And to list all FilePool Policies configured on a cluster:

# isi filepool policies list

To view a specific FilePool Policy:

# isi filepool policies view <Policy Name>

OneFS also provides global storagepool configuration settings which control additional metadata placement. For example:

# isi storagepool settings view

     Automatically Manage Protection: files_at_default

Automatically Manage Io Optimization: files_at_default

Protect Directories One Level Higher: Yes

       Global Namespace Acceleration: disabled

       Virtual Hot Spare Deny Writes: Yes

        Virtual Hot Spare Hide Spare: Yes

      Virtual Hot Spare Limit Drives: 2

     Virtual Hot Spare Limit Percent: 0

             Global Spillover Target: anywhere

                   Spillover Enabled: Yes

        SSD L3 Cache Default Enabled: Yes

                     SSD Qab Mirrors: one

            SSD System Btree Mirrors: one

            SSD System Delta Mirrors: one

The CLI output below includes descriptions of the relevant metadata options available.

# isi storagepool settings modify -h | egrep -i options -A 30

Options:

    --automatically-manage-protection (all | files_at_default | none)

        Set whether SmartPools manages files' protection settings.

    --automatically-manage-io-optimization (all | files_at_default | none)

        Set whether SmartPools manages files' I/O optimization settings.

    --protect-directories-one-level-higher <boolean>

        Protect directories at one level higher.

    --global-namespace-acceleration-enabled <boolean>

        Global namespace acceleration enabled.

    --virtual-hot-spare-deny-writes <boolean>

        Virtual hot spare: deny new data writes.

    --virtual-hot-spare-hide-spare <boolean>

        Virtual hot spare: reduce amount of available space.

    --virtual-hot-spare-limit-drives <integer>

        Virtual hot spare: number of virtual drives.

    --virtual-hot-spare-limit-percent <integer>

        Virtual hot spare: percent of total storage.

    --spillover-target <str>

        Spillover target.

    --spillover-anywhere

        Set global spillover to anywhere.

    --spillover-enabled <boolean>

        Spill writes into pools within spillover_target as needed.

    --ssd-l3-cache-default-enabled <boolean>

        Default setting for enabling L3 on new Node Pools.

    --ssd-qab-mirrors (one | all)

        Controls number of mirrors of QAB blocks to place on SSDs.

    --ssd-system-btree-mirrors (one | all)

        Controls number of mirrors of system B-tree blocks to place on SSDs.

    --ssd-system-delta-mirrors (one | all)

        Controls number of mirrors of system delta blocks to place on SSDs.

OneFS defaults to protecting directories one level higher than the configured protection policy and retaining one mirror of system b-trees on SSD.  For optimal performance on hybrid platform nodes, the recommendation is to place all metadata mirrors on SSD, assuming the capacity is available.  Be aware, however, that the metadata SSD mirroring options only become active if L3 Mode is disabled.

Additionally, global namespace acceleration (GNA) is a legacy option that allows nodes without SSD to place their metadata on nodes with SSD.  All currently shipping PowerScale node models include at least one SSD drive.

 

OneFS Neighborhoods

Heterogeneous PowerScale clusters can be built with a wide variety of node styles and capacities, in order to meet the needs of a varied data set and wide spectrum of workloads. Isilon nodes are broken into several classes, or tiers, according to their functionality. These node styles encompass several hardware generations, and fall loosely into four main tiers:

OneFS neighborhoods add another level of resilience into the OneFS failure domain concept.

As we saw in the previous article, disk pools represent the smallest unit within the storage pools hierarchy. OneFS provisioning works on the premise of dividing similar nodes’ drives into sets, or disk pools, with each pool representing a separate failure domain. These are protected by default at +2d:1n (or the ability to withstand two disk or one entire node failure). In Gen6 chassis, disk pools are laid out across all five sleds in each nod.. For example, a node with three drives per sled will have the following disk pool configuration:

Node pools are groups of disk pools, spread across similar, or compatible, OneFS storage nodes. Multiple groups of different node types can work together in a single, heterogeneous cluster.

In OneFS, a failure domain is the portion of a dataset that can be negatively impacted by a specific component failure. A disk pool comprises a group of drives spread across multiple compatible nodes, and a node usually has drives in multiple disk pools which share the same node boundaries. Since each piece of data or metadata is fully contained within a single disk pool, OneFS considers the disk pool as its failure domain.

PowerScale chassis-based hybrid and archive nodes utilize sled protection, where each drive in a sled is automatically located in a different disk pool. This ensures that if a sled is removed, rather than a failure domain losing four drives, the affected failure domains each only lose one drive.

OneFS neighborhoods help organize and limit the width of a disk pool. Neighborhoods also contain all the disk pools within a certain node boundary, aligned with the disk pools’ node boundaries. As such, a node will often have drives in multiple disk pools, but a node will only be in a single neighborhood. Fundamentally, neighborhoods, node pools, and tiers are all layers on top of disk pools, and node pools and tiers are used for organizing neighborhoods and disk pools.

So the primary function of neighborhoods is to improve OneFS reliability in general, and guard against data unavailability. With the PowerScale all-flash F-series nodes, OneFS has an ideal size of 20 nodes per node pool, and a maximum size of 39 nodes. On the addition of the 40th node, the nodes automatically divide, or split, into two neighborhoods of twenty nodes.

Neighborhood F-series Nodes H-series and A-series Nodes
Smallest Size 3 4
Ideal Size 20 10
Maximum Size 39 19

In contrast, the Gen6 chassis based platforms, such as the PowerScale H-series and A-series, have an ideal neighborhood size of 10 nodes per node pool, and an automatic split occurs on the addition of the 20th node, or 5th chassis. This smaller neighborhood size helps the Gen6 hardware protect against simultaneous node-pair journal failures and full chassis failures. With the Gen6 platform and partner node protection, where possible, nodes will be placed in different neighborhoods – and hence different failure domains. Partner node protection is possible once the cluster reaches five full chassis (20 nodes) when, after the first neighborhood split, OneFS places partner nodes in different neighborhoods:

Partner node protection increases reliability because if both nodes go down, they are in different failure domains, so their failure domains only suffer the loss of a single node.

With chassis-level protection, when possible, each of the four nodes within a chassis will be placed in a separate neighborhood. Chassis protection becomes possible at 40 nodes, as the neighborhood split at 40 nodes enables every node in a chassis to be placed in a different neighborhood. As such, when a 38 node Gen6 cluster is expanded to 40 nodes, the two existing neighborhoods will be split into four 10-node neighborhoods:

Chassis-level protection ensures that if an entire chassis failed, each failure domain would only lose one node.

The distribution of nodes and drives in pools is governed by gconfig values, such as the ‘pool_ideal_size’ parameter which indicates the preferred number of nodes in a pool. For example:

# isi_gconfig smartpools | grep -i ideal

smartpools.diskpools.pool_ideal_size (int) = 20

The most common causes of a neighborhood split are:

  1. Nodes were added to the node pool and the neighborhood must be split to accommodate them, for example the nodepool went from 39 to 40 (20+20) or from 59 to 60 (20+20+20).
  2. Nodes were removed from a nodepool into a manual nodepool.
  3. Compatibility settings were changed, which made some existing nodes incompatible.

After a split, typically the Smartpools/SetProtectPlus and AutoBalance jobs run, restriping files so that the new disk pools are balanced.

The following CLI command can be used to identify the correlation between the cluster’s nodes and OneFS neighborhoods, or failure domains:

# sysctl efs.lin.lock.initiator.coordinator_weights

The command output reports the node composition of each neighborhood (failure_domain), as well as the active nodes (up_nodes) in each, and their relative weighting (weights).

With larger clusters, neighborhoods also help facilitate OneFS’ parallel cluster upgrade option. Parallel upgrade provides upgrade efficiency within node pools on larger clusters, allowing the simultaneous upgrading of a node per neighborhood until the pool is complete . By doing this, the upgrade duration is dramatically reduced, while ensuring that end-users still continue to have full access to their data.

During a parallel upgrade, the upgrade framework selects one node from each neighborhood, to run the upgrade job on simultaneously. So in this case, node 13 from neighborhood 1, node 2 from neighborhood 2, node 27 from neighborhood 3 and node 40 from neighborhood 4 will be upgraded at the same time. Considering they are all in different neighborhoods or failure domains, it will not impact the current running workload.  After the first pass completes, the upgrade framework will select another node from each neighborhood and upgrade them, and so on until the cluster is fully upgraded.

For example, consider a hundred node PowerScale H700 cluster. With an ideal layout, there would be 10 neighborhoods, each containing ten nodes. The equation for estimating upgrade a parallel completion time is as follows:

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝑝𝑒𝑟 𝑛𝑜𝑑𝑒 𝑢𝑝𝑔𝑟𝑎𝑑𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛) × (max 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑝𝑒𝑟 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑)

Assuming an upgrade time of 20 minutes per node, this would be:

20 × 10 = 200 𝑚𝑖𝑛𝑢𝑡𝑒𝑠

So the estimated duration of the hundred node parallel upgrade is 200 minutes, or just under 3 ½ hours. This is in contrast to a rolling upgrade, which would be an order of magnitude greater at 2000 minutes, or almost a day and a half.