Cluster – Page 19 – Unstructured Data Quick Tips

OneFS Protection Overhead

There have been a number of questions from the field recently around how to calculate the OneFS storage protection overhead for different cluster sizes and protection levels. But first, a quick overview of the fundamentals…

OneFS supports several protection schemes. These include the ubiquitous +2d:1n, which protects against two drive failures or one node failure. The best practice is to use the recommended protection level for a particular cluster configuration. This recommended level of protection is clearly marked as ‘suggested’ in the OneFS WebUI storage pools configuration pages and is typically configured by default. For all current Gen6 hardware configurations, the recommended starting protection level is “+2d:1n’.

The hybrid protection schemes are particularly useful for the chassis-based high-density node configurations, where the probability of multiple drives failing far surpasses that of an entire node failure. In the unlikely event that multiple devices have simultaneously failed, such that the file is “beyond its protection level”, OneFS will re-protect everything possible and report errors on the individual files affected to the cluster’s logs.

OneFS also provides a variety of mirroring options ranging from 2x to 8x, allowing from two to eight mirrors of the specified content. Metadata, for example, is mirrored at one level above FEC by default. For example, if a file is protected at +2n, its associated metadata object will be 3x mirrored.

The full range of OneFS protection levels are as follows:

Protection Level	Description
+1n	Tolerate failure of 1 drive OR 1 node
+2d:1n	Tolerate failure of 2 drives OR 1 node
+2n	Tolerate failure of 2 drives OR 2 nodes
+3d:1n	Tolerate failure of 3 drives OR 1 node
+3d:1n1d	Tolerate failure of 3 drives OR 1 node AND 1 drive
+3n	Tolerate failure of 3 drives or 3 nodes
+4d:1n	Tolerate failure of 4 drives or 1 node
+4d:2n	Tolerate failure of 4 drives or 2 nodes
+4n	Tolerate failure of 4 nodes
2x to 8x	Mirrored over 2 to 8 nodes, depending on configuration

The charts below show the ‘ideal’ protection overhead across the range of OneFS protection levels and node counts. For each field in this chart, the overhead percentage is calculated by dividing the sum of the two numbers by the number on the right.

x+y => y/(x+y)

So, for a five node cluster protected at +2d:1n, OneFS uses an 8+2 layout – hence an ‘ideal’ overhead of 20%.

8+2 => 2/(8+2) = 20%

Number of nodes	[+1n]	[+2d:1n]	[+2n]	[+3d:1n]	[+3d:1n1d]	[+3n]	[+4d:1n]	[+4d:2n]	[+4n]
3	2 +1 (33%)	4 + 2 (33%)	—	6 + 3 (33%)	3 + 3 (50%)	—	8 + 4 (33%)	—	—
4	3 +1 (25%)	6 + 2 (25%)	—	9 + 3 (25%)	5 + 3 (38%)	—	12 + 4 (25%)	4 + 4 (50%)	—
5	4 +1 (20%)	8+ 2 (20%)	3 + 2 (40%)	12 + 3 (20%)	7 + 3 (30%)	—	16 + 4 (20%)	6 + 4 (40%)	—
6	5 +1 (17%)	10 + 2 (17%)	4 + 2 (33%)	15 + 3 (17%)	9 + 3 (25%)	—	16 + 4 (20%)	8 + 4 (33%)	—

The ‘x+y’ numbers in each field in the table also represent how files are striped across a cluster for each node count and protection level.

Take for example, with +2n protection on a 6-node cluster, OneFS will write a stripe across all 6 nodes, and use two of the stripe units for parity/ECC and four for data.

In general, for FEC protected data the OneFS protection overhead will look something like below.

Note that the protection overhead % (in brackets) is a very rough guide and will vary across different datasets, depending on quantities of small files, etc.

Number of nodes	[+1n]	[+2d:1n]	[+2n]	[+3d:1n]	[+3d:1n1d]	[+3n]	[+4d:1n]	[+4d:2n]	[+4n]
3	2 +1 (33%)	4 + 2 (33%)	—	6 + 3 (33%)	3 + 3 (50%)	—	8 + 4 (33%)	—	—
4	3 +1 (25%)	6 + 2 (25%)	—	9 + 3 (25%)	5 + 3 (38%)	—	12 + 4 (25%)	4 + 4 (50%)	—
5	4 +1 (20%)	8 + 2 (20%)	3 + 2 (40%)	12 + 3 (20%)	7 + 3 (30%)	—	16 + 4 (20%)	6 + 4 (40%)	—
6	5 +1 (17%)	10 + 2 (17%)	4 + 2 (33%)	15 + 3 (17%)	9 + 3 (25%)	—	16 + 4 (20%)	8 + 4 (33%)	—
7	6 +1 (14%)	12 + 2 (14%)	5 + 2 (29%)	15 + 3 (17%)	11 + 3 (21%)	4 + 3 (43%)	16 + 4 (20%)	10 + 4 (29%)	—
8	7 +1 (13%)	14 + 2 (12.5%)	6 + 2 (25%)	15 + 3 (17%)	13 + 3 (19%)	5 + 3 (38%)	16 + 4 (20%)	12 + 4 (25%)	—
9	8 +1 (11%)	16 + 2 (11%)	7 + 2 (22%)	15 + 3 (17%)	15 + 3 (17%)	6 + 3 (33%)	16 + 4 (20%)	14 + 4 (22%)	5 + 4 (44%)
10	9 +1 (10%)	16 + 2 (11%)	8 + 2 (20%)	15 + 3 (17%)	15 + 3 (17%)	7 + 3 (30%)	16 + 4 (20%)	16 + 4 (20%)	6 + 4 (40%)
12	11 +1 (8%)	16 + 2 (11%)	10 + 2 (17%)	15 + 3 (17%)	15 + 3 (17%)	9 + 3 (25%)	16 + 4 (20%)	16 + 4 (20%)	6 + 4 (40%)
14	13 +1 (7%)	16 + 2 (11%)	12 + 2 (14%)	15 + 3 (17%)	15 + 3 (17%)	11 + 3 (21%)	16 + 4 (20%)	16 + 4 (20%)	10 + 4 (29%)
16	15 +1 (6%)	16 + 2 (11%)	14 + 2 (13%)	15 + 3 (17%)	15 + 3 (17%)	13 + 3 (19%)	16 + 4 (20%)	16 + 4 (20%)	12 + 4 (25%)
18	16 +1 (6%)	16 + 2 (11%)	16 + 2 (11%)	15 + 3 (17%)	15 + 3 (17%)	15 + 3 (17%)	16 + 4 (20%)	16 + 4 (20%)	14 + 4 (22%)
20	16 +1 (6%)	16 + 2 (11%)	16 + 2 (11%)	16 + 3 (16%)	16 + 3 (16%)	16 + 3 (16%)	16 + 4 (20%)	16 + 4 (20%)	14 + 4 (22%)
30	16 +1 (6%)	16 + 2 (11%)	16 + 2 (11%)	16 + 3 (16%)	16 + 3 (16%)	16 + 3 (16%)	16 + 4 (20%)	16 + 4 (20%)	14 + 4 (22%)

The protection level of the ﬁle is how the system decides to layout the ﬁle. A ﬁle may have multiple protection levels temporarily (because the ﬁle is being restriped) or permanently (because of a heterogeneous cluster). The protection level is speciﬁed as “n + m/b@r” in its full form. In the case where b, r, or both equal 1, it may be elided to get “n + m/b”, “n + m@r”, or “n + m”.

Layout Attribute	Description
N	Number of data drives in a stripe.
+m	Number of FEC drives in a stripe.
/b	Number of drives per stripe allowed on one node.
@r	Number of drives to include in the layout of a file.

The OneFS protection definition in terms of node and/or drive failures has the advantage of configuration simplicity. However, it does mask some of the subtlety of the interaction between stripe width and drive spread, as represented by the n+m/b notation displayed by the ‘isi get’ CLI command. For example:

# isi get README.txt

POLICY    LEVEL PERFORMANCE COAL  FILE

default   6+2/2 concurrency on    README.txt

In particular, both +3/3 and +3/2 allow for a single node failure or three drive failures and appear the same according to the web terminology. Despite this, they do in fact have diﬀerent characteristics. +3/2 allows for the failure of any one node in combination with the failure of a single drive on any other node, which +3/3 does not. +3/3, on the other hand, allows for potentially better space eﬃciency and performance because up to three drives per node can be used, rather than the 2 allowed under +3/2.

That said, the protection level does have a minor affect on write performance. The largest impact is from the first number after the + in the protection level. For example, the the ‘+2’ levels (including +2n or +2d:1n) may be around 3% faster than the ‘+3’ levels (including +3n, +3d:1n1d, or +3d:1n). There is also some variation on depending on the number of nodes in a cluster, but typically this is less significant.

Another factor to keep in mind is OneFS neighborhoods. A neighborhood is a fault domains within a node pool, and their purpose is to improve reliability in general – and guard against data unavailability from the accidental removal of Gen6 drive sleds. For self-contained nodes like the PowerScale F210, OneFS has an ideal size of 20 nodes per node pool, and a maximum size of 39 nodes. On the addition of the 40^th node, the nodes split into two neighborhoods of twenty nodes.

With the Gen6 platform, the ideal size of a neighborhood changes from 20 to 10 nodes. This 10-node ideal neighborhood size helps protect the Gen6 architecture against simultaneous node-pair journal failures and full chassis failures. Partner nodes are nodes whose journals are mirrored. Rather than each node storing its journal in NVRAM as in the PowerScale platforms, the Gen6 nodes’ journals are stored on SSDs – and every journal has a mirror copy on another node. The node that contains the mirrored journal is referred to as the partner node. There are several reliability benefits gained from the changes to the journal. For example, SSDs are more persistent and reliable than NVRAM, which requires a charged battery to retain state. Also, with the mirrored journal, both journal drives have to die before a journal is considered lost. As such, unless both of the mirrored journal drives fail, both of the partner nodes can function as normal.

With partner node protection, where possible, nodes will be placed in different neighborhoods – and hence different failure domains. Partner node protection is possible once the cluster reaches five full chassis (20 nodes) when, after the first neighborhood split, OneFS places partner nodes in different neighborhoods:

Partner node protection increases reliability because if both nodes go down, they are in different failure domains, so their failure domains only suffer the loss of a single node.

With chassis protection, when possible, each of the four nodes within a chassis will be placed in a separate neighborhood. Chassis protection becomes possible at 40 nodes, as the neighborhood split at 40 nodes enables every node in a chassis to be placed in a different neighborhood. As such, when a 38 node Gen6 cluster is expanded to 40 nodes, the two existing neighborhoods will be split into four 10-node neighborhoods:

Chassis protection ensures that if an entire chassis failed, each failure domain would only lose one node.

Better Protection with Dell EMC ECS Object Lock

Dell EMC ECS supported WORM (write-once-read-many) based retention from ECS 2.X. However, to gain more compatibility with more applications, ECS support the object lock feature from 3.6.2 version which is compatible with the capabilities of Amazon S3 object lock.

Dell EMC ECS object lock protects object versions from accidental or malicious deletion such as a ransomware attack. It does this by allowing object versions to enter a Write Once Read Many (WORM) state where access is restricted based on attributes set on the object version.

Object lock is designed to meet compliance requirements such as SEC 17a4(f), FINRA Rule 4511(c), and CFTC Rule 17.

Object lock overview

Object lock prevents object version deletion during a user-defined retention period. Immutable S3 objects are protected using object- or bucket-level configuration of WORM and retention attributes. The retention policy is defined using the S3 API or bucket-level defaults. Objects are locked for the duration of the retention period, and legal hold scenarios are also supported.

There are two lock types for object lock:

Retention period — Specifies a fixed period of time during which an object version remains locked. During this period, your object version is WORM-protected and can’t be overwritten or deleted.
Legal hold — Provides the same protection as a retention period, but it has no expiration date. Instead, a legal hold remains in place until you explicitly remove it. legal holds are independent from retention periods.

There are two mode for the retention period:

Governance mode — users can’t overwrite or delete an object version or alter its lock settings unless they have special permissions. With governance mode, you protect objects against being deleted by most users, but you can still grant some users permission to alter the retention settings or delete the object if necessary. You can also use governance mode to test retention-period settings before creating a compliance-mode retention period.
Compliance mode — a protected object version can’t be overwritten or deleted by any user, including the root user in your account. When an object is locked in compliance mode, its retention mode can’t be changed, and its retention period can’t be shortened. Compliance mode helps ensure that an object version can’t be overwritten or deleted for the duration of the retention period.

Object lock and lifecycle

Objects under lock are protected from lifecycle deletions.

Lifecycle logic is made difficult due to variety of behavior of different locks. From lifecycle point of view there are locks without a date, locks with date that can be extended, and locks with date that can be decreased.

For compliance mode, the retain until date can’t be decreased, but can be increased:
For governance mode, the lock date can increase, decrease, or get removed.
For legal hold, the lock is indefinite.

Some key points for the S3 object lock with ECS

Object lock requires FS (File System) disabled on bucket in ECS 3.6.2 version.
Object lock requires ADO (Access During Outage) disabled on bucket in ECS 3.6.2 version.
Object lock is only supported by S3 API, not UI workflows in ECS 3.6.2 version.
Object lock only works with IAM, not legacy accounts.
Object lock works only in versioned buckets.
Enabling locking on the bucket automatically makes it versioned.
Once bucket locking is enabled, it is not possible to disable object lock or suspend versioning for the bucket.
A bucket has default configuration include a retention mode (governance or compliance) and also a retention period (which is days or years).
Object locks apply to individual object versions only.
Different versions of a single object can have different retention modes and periods.
Lock prevents an object from being deleted or overwritten. Overwritten does not mean that new versions can’t be created (new version can be created with their own lock settings).
Object can still be deleted; it will create a delete marker and the version still exists and is locked.
Compliance mode is stricter, locks can’t be removed, decreased, or downgraded to governance mode.
Governance mode is less strict, it can be removed, bypassed, elevated to compliance mode.
Object can still be deleted, but the version still exists and is locked.
Updating an object version’s metadata, as occurs when you place or alter an object lock, doesn’t overwrite the object version or reset its Last-Modified timestamp.
Retention period can be placed on an object explicitly, or implicitly through a bucket default setting.
Placing a default retention setting on a bucket doesn’t place any retention settings on objects that already exist in the bucket.
Changing a bucket’s default retention period doesn’t change the existing retention period for any objects in that bucket.
object lock and traditional bucket/object ECS retention can co-exist.

ECS object lock condition keys

Access control using IAM policies is an important part of the object lock functionality. The s3:BypassGovernanceRetention permission is important since it is required to delete a WORM-protected object in Governance mode. IAM policy conditions have been defined below to allow you to limit what retention period and legal hold can be specified in objects.

Condition Key	Description
s3:object-lock-legal-hold	Enables enforcement of the specified object legal hold status
s3:object-lock-mode	Enables enforcement of the specified object retention mode
s3:object-lock-retain-until-date	Enables enforcement of a specific retain-until-date
s3:object-lock-remaining-retention-days	Enables enforcement of an object relative to the remaining retention days

ECS object lock API examples

This section lists s3curl examples of object Lock APIs. Put and Get object lock APIs can be used with and without versionId parameter. If no versionId parameter is used, then the action applies to the latest version.

Operation	API request examples
Create lock-enabled bucket	s3curl.pl –id=ecsflex –createBucket — http://${s3ip}/mybucket -H “x-amz-bucket-object-lock-enabled: true”
Enable object lock on existing bucket	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket?enable-objectlock -X PUT
Get bucket default lock configuration	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket?object-lock
Put bucket default lock configuration	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket?object-lock -X PUT \ -d “<ObjectLockConfiguration><ObjectLockEnabled>Enabled</ ObjectLockEnabled> <Rule><DefaultRetention><Mode>GOVERNANCE</Mode><Days>1</Days></ DefaultRetention></Rule></ObjectLockConfiguration>”
Get legal hold	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?legal-hold
Put legal hold on create	s3curl.pl –id=ecsflex –put=/root/100b.file — http://${s3ip}/ my-bucket/obj -H “x-amz-object-lock-legal-hold: ON”
Put legal hold on existing object	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?legalhold -X PUT -d “<LegalHold><Status>OFF</Status></LegalHold>”
Get retention	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?retention
Put retention on create	s3curl.pl –id=ecsflex –put=/root/100b.file — http://${s3ip}/ my-bucket/obj -H “x-amz-object-lock-mode: GOVERNANCE” -H “x-amz-object-lock-retain-until-date: 2030-01-01T00:00:00.000Z”
Put retention on existing object	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj? retention -X PUT -d “<Retention><Mode>GOVERNANCE</ Mode><RetainUntilDate>2030-01-01T00:00:00.000Z</ RetainUntilDate></Retention>”
Put retention on existing object (with bypass)	s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj? retention -X PUT -d “<Retention><Mode>GOVERNANCE</ Mode><RetainUntilDate>2030-01-01T00:00:00.000Z</ RetainUntilDate></Retention>” -H “x-amz-bypass-governance-retention: true”

OneFS and SMB Encryption

Received a couple of recent questions around SMB encryption, which is supported in addition to the other components of the SMB3 protocol dialect that OneFS supports, including multi-channel, continuous availability (CA), and witness.

OneFS allows encryption for SMB3 clients to be configured on a per share, zone, or cluster-wide basis. When configuring encryption at the cluster-wide level, OneFS provides the option to also allow unencrypted connections for older, non-SMB3 clients.

The following CLI command will indicate whether SMB3 encryption has already been configured globally on the cluster:

# isi smb settings global view | grep -i encryption

    Support Smb3 Encryption: No

The following table lists what behavior a variety of Microsoft Windows and Apple Mac OS versions will support with respect to SMB3 encryption:

Operating System	Description
Windows Vista/Server 2008	Can only access non-encrypted shares if cluster is configured to allow non-encrypted connections
Windows 7/Server 2008 R2	Can only access non-encrypted shares if cluster is configured to allow non-encrypted connections
Windows 8/Server 2012	Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)
Windows 8.1/Server 2012 R2	Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)
Windows 10/Server 2016	Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)
OSX10.12	Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)

Note that only operating systems which support SMB3 encryption can work with encrypted shares. These operating systems can also work with unencrypted shares, but only if the cluster is configured to allow non-encrypted connections. Other operating systems can access non-encrypted shares only if the cluster is configured to allow non-encrypted connections.

If encryption is enabled for an existing share or zone, and if the cluster is set to only allow encrypted connections, only Windows 8/Server 2012 and later and OSX 10.12 will be able to access that share or zone. Encryption cannot be turned on or off at the client level.

The following CLI procedures will configure SMB3 encryption on a specific share, rather than globally across the cluster:

As a prerequisite, ensure that the cluster and clients are bound and connected to the desired Active Directory domain (for example in this case, ad1.com).

To create a share with SMB3 encryption enabled from the CLI:

# mkdir -p /ifs/smb/data_encrypt
# chmod +a group "AD1\\Domain Users" allow generic_all /ifs/smb/data_encrypt
# isi smb shares create DataEncrypt /ifs/smb/data_encrypt --smb3-encryption-enabled true
# isi smb shares permission modify DataEncrypt --wellknown Everyone -d allow -p full

To verify that an SMB3 client session is actually being encrypted, launch a remote desktop protocol (RDP) session to the Windows client, log in as administrator, and perform the following:

Ensure a packet capture and analysis tool such as Wireshark is installed.
Start Wireshark capture using the capture filter “port 445”
Map the DataEncrypt share from the second node in the cluster
Create a file on the desktop on the client (eg. README-W10.txt).
Copy the README-W10.txt file from the Desktop on the client to the DataEncrypt shares using Windows explorer.exe
Stop the Wireshark capture
Set the Wireshark the display filter to “smb2 and ip.addr for node 1
1. Examine the SMB2_NEGOTIATE packet exchange to verify the capabilities, negotiated contexts and protocol dialect (3.1.1)
2. Examine the SMB2_TREE_CONNECT to verify the that encryption support has not been enabled for this share
3. Examine the SMB2_WRITE requests to ensure that the file contents are readable.
Set the Wireshark the display filter to “smb2 and ip.addr for node 2
1. Examine the SMB2_NEGOTIATE packet exchange to verify the capabilities, negotiated contexts and protocol dialect (3.1.1)
2. Examine the SMB2_TREE_CONNECT to verify the that encryption support has been enabled for this share
3. Examine the communication following the successful SMB2_TREE_CONNECT response that the packets are encrypted
: Save the Wireshark Capture to the DataEncrypt share using the name Win10-SMB3EncryptionDemo.pcap.

Additionally, the following CLI syntax can also be used to quickly verify if an SMB3 session is truly encrypted by querying the platform API (pAPI) endpoint:

# curl -vk -u root https://127.0.0.1:8080/platform/16/protocols/smb/sessions | egrep "computer|client_type|encryption"

Enter host password for user 'root':
…
"client_type" : "SMB 3.1.1",

"computer" : "192.168.1.7",

"encryption" : false,

#

SMB3 encryption can also be applied globally to a cluster. This will mean that all the SMB communication with the cluster will be encrypted, not just with individual shares. SMB clients that don’t support SMB3 encryption will only be able to connect to the cluster so long as it is configured to allow non-encrypted connections. The following table presents the available global SMB3 encryption config options:

Setting	Description
Disabled	Encryption for SMBv3 clients in not enabled on this cluster.
Enable SMB3 encryption	Permits encrypted SMBv3 client connections to Isilon clusters, but does not make encryption mandatory. Unencrypted SMBv3 clients can still connect to the cluster when this option is enabled. Note that this setting does not actively enable SMBv3 encryption: To encrypt SMBv3 client connections to the cluster, you must first select this option and then activate encryption on the client side. This setting applies to all shares in the cluster.
Reject unencrypted SMB3 client connections	Makes encryption mandatory for all SMBv3 client connections to the cluster. When this setting is active, only encrypted SMBv3 clients can connect to the cluster. SMBv3 clients that do not have encryption enabled are denied access. This setting applies to all shares in the cluster.

The following CLI syntax will configure global SMB3 encryption:

# isi smb settings global modify --support-smb3-encryption=yes

Verify the global encryption settings on a cluster by running:

# isi smb settings global view | grep -i encrypt

  Reject Unencrypted Access: Yes

    Support Smb3 Encryption: Yes

Global SMB3 encryption can also be enabled from the WebUI by browsing to Protocols > Windows Sharing (SMB) > SMB Server Settings:

OneFS Quota Accounting

Had a couple of recent enquiries from the field regarding SmartQuotas performance. So in this article we’ll explore one of the more obscure tuning parameters of OneFS SmartQuotas.

Under the hood, SmartQuotas quota data is maintained in Quota Accounting Blocks (QABs). Each QAB contains a large number of accounting records, which need to be updated whenever a particular user adds or removes data from the quota domain, the area of the filesystem on which quotas are enabled. If a large quantity of clients are simultaneously accessing the quota domain, these blocks can become highly contended and a potential bottleneck. Similarly, if a single client (or small number of clients) consistently makes a large number of small writes to files within a single quota, write performance could again be impacted

To address this, quota accounts have a mechanism to help avoid hot spots on those nodes which are storing QABs. This can be addressed using Quota Account Constituents, or QACs, which help parallelize the accounting. QACs can boost the performance of quota accounting by creating additional QAB mirrors, which are distributed across the cluster.

QAC configuration is via the sysctl ‘efs.quota.reorganize.qac_ratio’, which increases the number of accounting constituents, which are in turn spread across a much larger number of nodes and drives. This provides better scalability by increasing aggregate throughput and reduces latencies on heavy create/delete activities when quotas are configured.

Using this parameter, the internally calculated QAC count for each quota is multiplied by the specified value. If a workflow experiences write performance issues, and it has many writes to files or directories governed by a single quota, then increasing the QAC ratio may significantly improve write performance.

The qac_ratio can be reconfigured to from its default value of none up to the maximum value of 8 via the following CLI command:

# isi_sysctl_cluster efs.quota.reorganize.qac_ratio=8

To verify the persistent change, run:

# cat /etc/mcp/override/sysctl.conf | grep qac_ratio

Although increasing the QAC count via this sysctl can improve performance on write heavy quota domains, some amount of experimentation may be required until the ideal QAC ratio value is found.

Adjusting the ‘qac_ratio’ sysctl parameter can adversely affect write performance if you apply a value that is too high, or if you apply the parameter in an environment that does not have diminished write performance due to quota contention.

To help assess write performance while tuning the QAC ration, write latency (TimeAvg) for the NFSv3 protocol, for example, can continuously be monitored by running the following CLI command:

# isi statistics protocol --protocols nfs3 --classes write --output TimeAvg --format top

OneFS Hardware Fault Tolerance

There have been several inquiries recently around PowerScale clusters and hardware fault tolerance, above and beyond file level data protection via erasure coding. So it seemed like a useful topic for a blog article, and here are some of the techniques which OneFS employs to help protect data against the threat of hardware errors:

File system journal

Every PowerScale node is equipped with a battery backed NVRAM file system journal. Each journal is used by OneFS as stable storage, and guards write transactions against sudden power loss or other catastrophic events. The journal protects the consistency of the file system and the battery charge lasts up to three days. Since each member node of a cluster contains an NVRAM controller, the entire OneFS file system is therefore fully journaled.

Proactive device failure

OneFS will proactively remove, or SmartFail, any drive that reaches a particular threshold of detected Error Correction Code (ECC) errors, and automatically reconstruct the data from that drive and locate it elsewhere on the cluster. Both SmartFail and the subsequent repair process are fully automated and hence require no administrator intervention.

Data integrity

ISI Data Integrity (IDI) is the OneFS process that protects file system structures against corruption via 32-bit CRC checksums. All OneFS blocks, both for file and metadata, utilize checksum verification. Metadata checksums are housed in the metadata blocks themselves, whereas file data checksums are stored as metadata, thereby providing referential integrity. All checksums are recomputed by the initiator, the node servicing a particular read, on every request.

In the event that the recomputed checksum does not match the stored checksum, OneFS will generate a system alert, log the event, retrieve and return the corresponding error correcting code (ECC) block to the client and attempt to repair the suspect data block.

Protocol checksums

In addition to blocks and metadata, OneFS also provides checksum verification for Remote Block Management (RBM) protocol data. As mentioned above, the RBM is a unicast, RPC-based protocol used over the back-end cluster interconnect. Checksums on the RBM protocol are in addition to the InfiniBand hardware checksums provided at the network layer, and are used to detect and isolate machines with certain faulty hardware components and exhibiting other failure states.

Dynamic sector repair

OneFS includes a Dynamic Sector Repair (DSR) feature whereby bad disk sectors can be forced by the file system to be rewritten elsewhere. When OneFS fails to read a block during normal operation, DSR is invoked to reconstruct the missing data and write it to either a different location on the drive or to another drive on the node. This is done to ensure that subsequent reads of the block do not fail. DSR is fully automated and completely transparent to the end-user. Disk sector errors and Cyclic Redundancy Check (CRC) mismatches use almost the same mechanism as the drive rebuild process.

MediaScan

MediaScan’s role within OneFS is to check disk sectors and deploy the above DSR mechanism in order to force disk drives to fix any sector ECC errors they may encounter. Implemented as one of the phases of the OneFS job engine, MediaScan is run automatically based on a predefined schedule. Designed as a low-impact, background process, MediaScan is fully distributed and can thereby leverage the benefits of a cluster’s parallel architecture.

IntegrityScan

IntegrityScan, another component of the OneFS job engine, is responsible for examining the entire file system for inconsistencies. It does this by systematically reading every block and verifying its associated checksum. Unlike traditional ‘fsck’ style file system integrity checking tools, IntegrityScan is designed to run while the cluster is fully operational, thereby removing the need for any downtime. In the event that IntegrityScan detects a checksum mismatch, a system alert is generated and written to the syslog and OneFS automatically attempts to repair the suspect block.

The IntegrityScan phase is run manually if the integrity of the file system is ever in doubt. Although this process may take several days to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be prioritized, paused or stopped, depending on the impact to cluster operations and other jobs.

Fault isolation

Because OneFS protects its data at the file-level, any inconsistencies or data loss is isolated to the unavailable or failing device—the rest of the file system remains intact and available.

For example, a ten node, S210 cluster, protected at +2d:1n, sustains three simultaneous drive failures—one in each of three nodes. Even in this degraded state, I/O errors would only occur on the very small subset of data housed on all three of these drives. The remainder of the data striped across the other two hundred and thirty-seven drives would be totally unaffected. Contrast this behavior with a traditional RAID6 system, where losing more than two drives in a RAID-set will render it unusable and necessitate a full restore from backups.

Similarly, in the unlikely event that a portion of the file system does become corrupt (whether as a result of a software or firmware bug, etc) or a media error occurs where a section of the disk has failed, only the portion of the file system associated with this area on disk will be affected. All healthy areas will still be available and protected.

As mentioned above, referential checksums of both data and meta-data are used to catch silent data corruption (data corruption not associated with hardware failures).The checksums for file data blocks are stored as metadata, outside the actual blocks they reference, and thus provide referential integrity.

Accelerated drive rebuilds

The time that it takes a storage system to rebuild data from a failed disk drive is crucial to the data reliability of that system. With the advent of four terabyte drives, and the creation of increasingly larger single volumes and file systems, typical recovery times for multi-terabyte drive failures are becoming multiple days or even weeks. During this MTTDL period, storage systems are vulnerable to additional drive failures and the resulting data loss and downtime.

Since OneFS is built upon a highly distributed architecture, it’s able to leverage the CPU, memory and spindles from multiple nodes to reconstruct data from failed drives in a highly parallel and efficient manner. Because a PowerScale cluster is not bound by the speed of any particular drive, OneFS is able to recover from drive failures extremely quickly and this efficiency grows relative to cluster size. As such, a failed drive within a cluster will be rebuilt an order of magnitude faster than hardware RAID-based storage devices. Additionally, OneFS has no requirement for dedicated ‘hot-spare’ drives.

Automatic drive firmware updates

Clusters support automatic drive firmware updates for new and replacement drives, as part of the non-disruptive firmware update process. Firmware updates are delivered via drive support packages, which both simplify and streamline the management of existing and new drives across the cluster. This ensures that drive firmware is up to date and mitigates the likelihood of failures due to known drive issues. As such, automatic drive firmware updates are an important component of OneFS’ high availability and non-disruptive operations strategy.

OneFS Protocol Auditing

Auditing can detect potential sources of data loss, fraud, inappropriate entitlements, access attempts that should not occur, and a range of other anomalies that are indicators of risk. This can be especially useful when the audit associates data access with specific user identities.

In the interests of data security, OneFS provides ‘chain of custody’ auditing by logging specific activity on the cluster. This includes OneFS configuration changes plus NFS, SMB, S3, and HDFS client protocol activity, which are required for organizational IT security compliance, as mandated by regulatory bodies like HIPAA, SOX, FISMA, MPAA, etc.

OneFS auditing uses Dell EMC’s Common Event Enabler (CEE) to provide compatibility with external audit applications.

A cluster can write audit events across up to five CEE servers per node in a parallel, load-balanced configuration. This allows OneFS to deliver an end to end, enterprise grade audit solution which efficiently integrates with third party solutions like Varonis DatAdvantage.

OneFS auditing provides control over exactly what protocol activity is audited. For example:

Stops collection of unneeded audit events that 3^rd party applications do not register for
Reduces the number of audit events collected to only what is needed. Less unneeded events are stored on ifs and sent off cluster.

OneFS protocol auditing events are configurable at CEE granularity, with each OneFS event mapping directly to a CEE event. This allows customers to configure protocol auditing to collect only what their auditing application requests, reducing both the number of events discarded by CEE and stored on /ifs.

The ‘isi audit settings’ command syntax and corresponding platformAPI are used to specify the desired events for the audit filter to collect.

A ‘detail_type’ field within OneFS internal protocol audit events allows a direct mapping to CEE audit events. For example:

“protocol":"SMB2",

"zoneID":1,

"zoneName":"System",

"eventType":"rename",

"detailType":"rename-directory",

"isDirectory":true,

"clientIPAddr":"10.32.xxx.xxx",

"fileName":"\\ifs\\test\\New folder",

"newFileName":"\\ifs\\test\\ABC",

"userSID":"S-1-22-1-0",

"userID":0,

Old audit events are processed and mapped to the same CEE audit events as in previous releases. Backwards compatibility is maintained with previous audit events such that old versions ignore the new field. There are no changes to external audit events sent to CEE or syslog.

New default audit events when creating an access zone

Here are the protocol audit events:

New OneFS Audit Event	Pre-8.2 Audit Event
create_file	create
create_directory	create
open_file_write	create
open_file_read	create
open_file_noaccess	create
open_directory	create
close_file_unmodified	close
close_file_modified	close
close_directory	close
delete_file	delete
delete_directory	delete
rename_file	rename
rename_directory	rename
set_security_file	set_security
set_security_directory	set_security
get_security_file,	get_security
get_security_directory	get_security
write_file	write
read_file	read

Audit Event

logon

logoff

tree_connect

The ‘isi audit settings’ CLI command syntax is a follows:

Usage:

    isi audit <subcommand>

Subcommands:

    settings    Manage settings related to audit configuration.

    topics      Manage audit topics.

    logs        Delete out of date audit logs manually & monitor process.

    progress    Get the audit event time.

All options that take <events> use the protocol audit events:

# isi audit settings view –zone=<zone>

# isi audit settings modify --audit-success=<events> --zone=<zone>

# isi audit settings modify --audit-failure=<events> --zone=<zone>

# isi audit settings modify --syslog-audit-events=<events> --zone=<zone>

When it comes to troubleshooting audit on a cluster, the ‘isi_audit_viewer’ utility can be used to list protocol audit events collected.

# isi_audit_viewer -h

Usage: isi_audit_viewer [ -n <nodeid> | -t <topic> | -s <starttime>|

         -e <endtime> | -v ]

         -n <nodeid> : Specify node id to browse (default: local node)

         -t <topic>  : Choose topic to browse.

            Topics are "config" and "protocol" (default: "config")

         -s <start>  : Browse audit logs starting at <starttime>

         -e <end>    : Browse audit logs ending at <endtime>

         -v verbose  : Prints out start / end time range before printing

             records

The new audit event type is in the ‘detail_type’ field. Additionally, any errors that are encountered while processing audit events, and when delivering them to an external CEE server, are written to the log file ‘/var/log/isi_audit_cee.log’. Additionally, the protocol specific logs will contain any issues the audit filter has collecting while auditing events.

These protocol log files are:

Protocol	Log file
HDFS	/var/log/hdfs.log
NFS	/var/log/nfs.log
SMB	/var/log/lwiod.log
S3	/var/log/s3.log

Note that, on large clusters were there is heavy 100,000 of audit writes, when running the isi_audit_viewer utility across the cluster with ‘isi_for_array’, it can potentially lead to memory and other issues – especially if outputting to a directory under /ifs. As such, consider directing the output to an non-IFS location such as /var/temp. Also, the isi_audit_viewer ‘-s’ (start time) and ‘-e’ (end time) flags can be used to limit a search (ie. for 1-5 minutes), helping reduce the size of data.

OneFS NFS Netgroups

A OneFS network group, or netgroup, defines a network-wide group of hosts and users. As such, they can be used to restrict access to shared NFS filesystems, etc. Network groups are stored in a network information services, such as LDAP, NIS, or NIS+, rather than in a local file. Netgroups help to simplify the identification and management of people and machines for access control.

The isi_netgroup_d service provides netgroup lookups and caching for consumers of the ‘isi_nfs’ library. Only mountd and the ‘isi nfs’ command-line interface use this service. The isi_netgroup_d daemon maintains a fast, persistent cluster-coherent cache containing netgroups and netgroup members. isi_netgroup_d enforces netgroup TTLs and netgroup retries. A persistent cache database (SQLite) exists to store and recover cache data across reboots. Communication with isi_netgroup_d is via RPC and it will register its service and port with the local rpcbind.

Within OneFS, the netgroup cache possesses the following gconfig configuration parameters:

# isi_gconfig -t nfs-config | grep cache

shared_config.bypass_netgroup_cache_daemon (bool) = false

netcache_config.nc_ng_expiration (uint32) = 3600000

netcache_config.nc_ng_lifetime (uint32) = 604800

netcache_config.nc_ng_retry_wait (uint32) = 30000

netcache_config.ncdb_busy_timeout (uint32) = 900000

netcache_config.ncdb_write (uint32) = 43200000

netcache_config.nc_max_hosts (uint32) = 200000

Similarly, the following files are used by the isi_netgroup_d daemon:

File	Purpose
/var/run/isi_netgroup_d.pid	The pid of the currently running isi_netgroup_d
/ifs/.ifs/modules/nfs/nfs_config.gc	Server configuration file
/ifs/.ifs/modules/nfs/netcache.db	Persistent cache database
/var/log/isi_netgroup_d.log	Log output file

In general, using IP addresses works better than hostnames for netgroups. This is because hostnames require a DNS lookup and resolution from FQDN to IP address. Using IP addresses directly saves this overhead.

Resolving a large set of hosts in the allow/deny list is significantly faster when using netgroups. Entering a large host list in the NFS export means OneFS has to look up the hosts for each individual NFS export. In Netgroups, once looked up, it is cached by netgroups, so it doesn’t have to be looked up again if there are overlap between exports. It is also better to use an LDAP (or NIS) server when using Netgroups instead of the flat file. If you have a large list of hosts in the netgroups file, it can take a while to resolve as it is single threaded and sequential. LDAP/NIS provider based netgroups lookup is parallelized.

The OneFS netgroup cache has a default limit in gconfig of 200,000 host entries.

# isi_gconfig -t nfs-config | grep max

netcache_config.nc_max_hosts (uint32) = 200000

So what is the waiting period between when /etc/netgroup is updated to when the NFS export realizes the change? OneFS uses a netgroup cache and both its expiration and lifetime are both tunable. The netgroup expiration and lifetime can be configured with this following CLI command:

# isi nfs netgroup modify

--expiration or -e <duration>

Set the netgroup expiration time.

--lifetime or -l <duration>

Set the netgroup lifetime.

OneFS also provides the ‘isi nfs netgroups flush’ CLI command, which can be used to force a reload of the file.

# isi nfs netgroup flush

        [--host <string>]

        [{--verbose | -v}]

        [{--help | -h}]


Options:

    --host <string>

        IP address of the node to flush. Defaults is all nodes.


  Display Options:

    --verbose | -v

        Display more detailed information.

    --help | -h

        Display help for this command.

However, it is not recommended to flush the cache as a part of normal cluster operation. Refresh will walk the file and update the cache as needed.

Another area of caution is applying a netgroup with unresolved hostname(s). This will also slow down resolution of the hosts in the file when a refresh or startup of node happens. The best practice is to ensure that each host in the netgroups file is resolvable in DNS, or to just use IP addresses rather than names in the netgroup.

When it come to switching to a netgroup for clients already on an export, a netgroup can be added and clients removed in one step (#1 –add-client netgroup –remove-clients 1,2,3,etc). The cluster allows a mix of netgroup and host entries, so duplicates are tolerated. However, it’s worth noting that if there are unresolvable hosts in both areas, the startup resolution time will take that much longer.

OneFS & Files Per Directory

Had several recent enquiries from the field recently asking about the low impact methods to count the number of files in large directories containing hundreds of thousands to millions of files).

Unfortunately, there’s no ‘silver bullet’ command or data source available that will provide that count instantaneously: Something will have to perform a treewalk to gather these stats. That said, there are a couple of approaches to this, each with its pros and cons:

If the customer has a SmartQuotas license, they can configure an advisory directory quota on the directories they want to check. As mentioned, the first job run will require working the directory tree, but they can get fast, low impact reports moving forward.
Another approach is using traditional UNIX commands, either from the OneFS CLI or, less desirably, from a UNIX client. The two following commands will both take time to run: “

# ls -f /path/to/directory | wc –l

# find /path/to/directory -type f | wc -l

It’s worth noting that when counting files with ls, you’ll probably get faster results by omitting the ‘-l’ flag and using ‘-f’ flag instead. This is because ‘-l’ resolves UID & GIDs to display users/groups, which creates more work thereby slowing the listing. In contrast, ‘-f’ allows the ‘ls’ command to avoid sorting the output. This should be faster, and reduce memory consumption when listing extremely large numbers of files.

Ultimately, there really is no quick way to walk a file system and count the files – especially since both ls and find are single threaded commands. Running either of these in the background with output redirected to a file is probably the best approach.

Depending on your arguments for the ls or find command, you can gather a comprehensive set of context info and metadata on a single pass.

# find /path/to/scan -ls > output.file

It will take quite a while for the command to complete, but once you have the output stashed in a file you can pull all sorts of useful data from it.

Assuming a latency of 10ms per file it would take 33 minutes for 200,000 files. While this estimate may be conservative, there are typically multiple protocol ops that need to be done to each file, and they do add up. Plus, as mentioned before, ‘ls’ is a single threaded command.

If possible, ensure the directories of interest are stored on a file pool that has at least one of the metadata mirrors on SSD (metadata-read).
Windows Explorer can also enumerate the files in a directory tree surprisingly quickly. All you get is a file count, but it can work pretty well.
If the directory you wish to know the file count for just happens to be /ifs, you can run the LinCount job, which will tell you how many LINs there are in the file system.

Lincount (relatively) quickly scans the filesystem and returns the total count of LINs (logical inodes). The LIN count is essentially equivalent to the total file and directory count on a cluster. The job itself runs by default at the LOW priority, and is the fastest method of determining object count on OneFS, assuming no other job has run to completion.

The following syntax can be used to kick off the Lincount job from the OneFS CLI:

# isi job start lincount

The output from this will be along the lines of “Added job [52]”.

Note: The number in square brackets is the job ID.

To view results, run the following command from the CLI:

# isi job reports view [job ID]

For example:

# isi job reports view 52

LinCount[52] phase 1 (2021-07-06T09:33:33)

------------------------------------------

Elapsed time   1 seconds

Errors         0

Job mode       LinCount

LINs traversed 1722

SINs traversed 0

The "LINs traversed" metric indicates that 1722 files and directories were found.

Note: The Lincount job will also include snapshot revisions of LINs in its count.

Alternatively, if another treewalk job has run against the directory you wish to know the count for, you might be in luck.

At any rate, hundreds of thousands of files is a large number to store in one directory. To reduce the directory enumeration time, where possible divide the files up into multiple subdirectories.

When it comes to NFS, the behavior is going to partially depend on whether the client is doing READDIRPLUS operations vs READDIR. READDIRPLUS is useful if the client is going to need the metadata. However, ff all you’re trying to do is list the filenames, it actually makes that operation much slower.

If you only read the filenames in the directory, and you don’t attempt to stat any associated metadata, then this requires a relatively small amount of I/O to pull the names from the meta-tree, and should be fairly fast.

If this has already been done recently, some or all of the blocks are likely to already be in L2 cache. As such, a subsequent operation won’t need to read from hard disk and will be substantially faster.

NFS is more complicated regarding what it will and won’t cache on the client side, particularly with the attribute cache and the timeouts that are associated with it.

Here are some options from fastest to slowest:

If NFS is using READDIR, as opposed to READDIRPLUS, and the ‘ls’ command is invoked with the appropriate arguments to prevent it polling metadata or sorting the output, execution will be relatively swift.
If ‘ls’ polls the metadata (or if NFS uses READDIRPLUS) but doesn’t sort the results, output will be fairly immediately, but will take longer to complete overall.
If ‘ls’ sorts the output, nothing will be displayed until ls has read everything and sorted it, then you’ll get the output in a deluge at the end.

OneFS MCP

Affectionately named after TRON’s ‘Master Control Program’, MCP is OneFS’ main utility for distributed service control across a cluster. MCP is responsible for starting, monitoring, and restarting failed services on a cluster. It also monitors configuration files and acts upon configuration changes, propagating local file changes to the rest of the cluster. As such, it performs a similar function to the Windows ‘service control manager’ (SCM) or MacOS ‘launchd’.

MCP is actually comprised of three different processes, one for each of its modes:

Master
Failsafe
Forker

These can be seen when viewing the running processes on a healthy node:

# ps -auxw | grep -i mcp | grep -v grep

root    5400    0.4  0.0  60760  19928  -  Ss   11Jun21    170:08.18 isi_mcp: master (isi_mcp)

root    5179    0.0  0.0  32760  13632  -  Is   11Jun21      0:00.01 isi_mcp: failsafe (isi_mcp)

root    5181    0.0  0.0  31476  12572  -  Is   11Jun21      0:00.36 isi_mcp: forker (isi_mcp)

The ‘Master’ is the central MCP process and does the bulk of the work. It monitors files and services, including the failsafe process, and delegates actions to the forker process.

The role of the ‘Forker’ is to receive command-line actions from the master, execute them, and return the resulting exit codes. It receives actions from the master process over a UNIX domain socket. If the forker is inadvertently or intentionally killed, it’s automatically restarted by the master process. If necessary, MCP will continue trying to restart the forker at an increasing interval. If, after around ten minutes of unsuccessfully attempting to restart the forker, MCP will fire off a CELOG alert, and continue trying. A second alert would then be sent after thirty minutes.

The ‘Failsafe’ process is responsible for starting, monitoring, restarting, and stopping both the Master and Forker. It’s a single threaded process that, if killed, will shut down all three MCP services. If this occurs, the three services will stay down until they are restarted with the ‘isi_mcp’ CLI command. If the master fails and can’t be restarted, MCP will continue attempting to restart it and fire alerts in the same manner as described above for the forker service.

MCP monitors the following files:

File Type	Function
/etc/mcp/sys/files/*	Configuration files monitored by MCP.
/etc/mcp/sys/services/*	Services that MCP starts and monitors.
/etc/ifs/array.xml	Cluster configuration file.
/etc/mcp/override/*	All files in override directory propagated to all nodes and entered in global mlist.
/etc/mcp/mlist.xml	Local mlist (mlists are used to manage and track the above files)
/ifs/.ifsvar/etc/mcp/mlist.xml	Master mlist

The following command will list the open files that MCP is currently monitoring on a node:

# for i in `sysctl efs.bam.busy_vnodes | grep -i mcp | awk '{print $4}' | sed -E 's/)//'`; do isi get -L $i | awk '{print $8}'; done

MCP monitors the configuration files in /etc/mcp/sys/files. While monitoring the configuration files MCP does two things:

Performs the file change action
Propagates config file changes to other nodes

Consider the XML configuration file for the ndmpd service, for example:

# cat /etc/mcp/sys/services/ndmpd

<?xml version="1.0"?>

<service name="ndmpd" enable="0" display="1" options="require-quorum,kill-on-sigquorum,require-post-ifs">

      <isi-meta-tag id="ndmp_service">

        <mod-attribs>enable</mod-attribs>

      </isi-meta-tag>

      <description>Network Data Management Protocol Daemon</description>

      <process name="isi_ndmp_d" pidfile="/var/run/isi_ndmp_d.pid"

               startaction="start" stopaction="stop"/>

      <actionlist name="start">

        <action>/usr/bin/isi_ndmp_d</action>

      </actionlist>

      <actionlist name="stop">

        <action>/usr/bin/killall isi_ndmp_d</action>

      </actionlist>

</service>

Much of what MCP does in response to an event notification is defined by the ‘actionlist’ in a config file. This is simply a list of commands to be executed, with action lists for starting and stopping services, and also for specific configuration files changes (for example, importing a product license).

Many of the local configuration files need to be uniform across the cluster so, unless the ‘notify =0’ flag is set, the master process also copies changed files to /ifs for MCP on other nodes to use.

MCP starts and watches already running services in accordance with their service description files, stored under /etc/mcp/sys/services. These are XML files which describe how a service is to be started when enabled or stopped when disabled.

The XML file also lists under ‘options’ the conditions of the node and/or cluster that must be met before the service can be started (for example above, ‘require-quorum’ or ‘require-post-ifs’, etc).

When a service is monitored, MCP ensures the correct state of the service on a node. If a service is marked ‘enable’, MCP will run the start action until the PID confirms it as running. When a service is marked ‘disable’, MCP will kill the service via its PID. The full list of services and their current state can be viewed with the following CLI command:

# isi services -a

MCP monitors services by observing their PID files (under /var/run), plus the process table itself, to determine if a process is already running or not. It compares this state against the ‘enabled/disabled’ state for the service and determines whether any start or stop actions are required. Services may also be configured to terminate if the cluster loses quorum with the option ‘kill-on-sigquorum’ in their XML file.

Another type of configuration file that MCP monitors is also known as a service override file, which live under /etc/mcp/override. These override files are used to store any current settings for options which have been changed from the defaults. The override files are always shared across the cluster via MCP’s configuration propagation mechanism.

The Master MCP process creates merged lists, or mlists, that are used to track and coordinate the process of managing the XML config and service description files. There are two types of mlist: Local and Master. The master process will automatically create the local mlist at startup if it doesn’t already exist. However, the master mlist is created later since MCP starts and begins operations before /ifs is mounted.

Here’s the mlist entry for the cluster’s NTP service, for example:

    <file>

      <path>/etc/mcp/templates/ntp.conf</path>

      <md5>7772b5d50494c85043933321c21dbb8d</md5>

      <timestamp>1623466667</timestamp>

      <revision>1</revision>

      <array_id>1</array_id>

    </file>

The local mlist has an entry for every file identified in the MCP file configuration files (/etc/mcp/sys/files), an entry for every configuration file (/etc/mcp/sys/files & procs), an entry for an override file for each service (may or may not exist), an entry for /etc/ifs/array.xml. It also contains an entry for the master mlist (/ifs/.ifsvar/etc/mcp/mlist.xml).

# grep mlist.xml mlist.xml

      <path>/ifs/.ifsvar/etc/mcp/mlist.xml</path>

The mlist has an entry for every local file that’s shared across the cluster and the override service files. A coordinator lock file prevents different nodes from making changes to /ifs at the same time.

If MCP attempts to start a service and fails, as long as the service is enabled, it will wait for an interval before attempting to start the service again. This interval doubles in size each time, until it reaches 256 seconds then remains at this frequency.

OneFS SmartFail

OneFS protects data stored on failing nodes or drives in a cluster through a process called Smartfail. During the process, OneFS places a device into quarantine and, depending on the severity of the issue, the data on it into a read-only state. While a device is quarantined, OneFS reprotects the data on the device by distributing the data to other devices.

After all data eviction or reconstruction is complete, OneFS logically removes the device from the cluster and the node or drive can be physically replaced. OneFS only automatically Smartfails devices as a last resort. Nodes and/or drives can also be manually Smartfailed. However, it is strongly recommended to first consult Dell Technical Support.

Occasionally a device might fail before OneFS detects a problem. If a drive fails without being Smartfailed, OneFS automatically starts rebuilding the data to available free space on the cluster. However, because a node might recover from a transient issue, if a node fails, OneFS does not start rebuilding data unless it is logically removed from the cluster.

A node that is unavailable and reported by isi status as ‘D’, or down, can be Smartfailed. If the node is hard down, likely with a significant hardware issue, the Smartfail process will take longer because data has to be recalculated from the FEC protection parity blocks. That said, it’s well worth attempting to bring the node up if at all possible – especially if the cluster and/or node pools is at the default +2D:1N protection. The concern here is that, with a node down, there is a risk of data loss if a drive or other component goes bad during the Smartfail process.

If possible, and assuming the disk content is still intact, it can often be quicker to have the node hardware repaired. In this case, the entire node’s chassis (or compute module in the case of Gen 6 hardware) could be replaced and the old disks inserted with original content on them. This should only be performed at the recommendation and under the supervision of Dell Technical support. If the node is down as a result of a journal inconsistency, it will have to be Smartfailed out. In this case, Support should be engaged to determine an appropriate action plan.

The recommended procedure for Smartfailing a node is as follows. In this example, we’ll assume that node 4 is down:

From the CLI of any node except node 4, run the following command to Smartfail out the node:

# isi devices node smartfail --node-lnn 4

Verify that node is removed from the cluster.

# isi status –q

(An ‘—S-’ will appear in node 4’s ‘Health’ column to indicate it has been Smartfailed).

Monitor the successful completion of the job engine’s MultiScan, FlexProtect/FlexProtectLIN jobs:

# isi job status

Un-cable and remove the node from the rack for disposal

As mentioned above, there are two primary Job Engine jobs that run as a result of a Smartfail:

MultiScan
FlexProtect or FlexProtectLIN

MultiScan performs the work of the both AutoBalance and Collect jobs simultaneously, and it is triggered after every group change. The reason is that new file layouts and file deletions that happen during a disruption to the cluster may be imperfectly balanced or, in the case of deletions, simply lost.

The Collect job reclaims free space from previously unavailable nodes or drives. A mark and sweep garbage collector, it identifies everything valid on the filesystem in the first phase, then in the second phase scans the drives freeing anything that isn’t marked valid.

AutoBalance ensures that, when node and drive usage across the cluster are out of balance. This job scans through all the drives looking for files to re-layout to make use of the less filled devices.

The purpose of the FlexProtect job is to scan the file system after a device failure to ensure that all files remain protected. Incomplete protection levels are fixed, in addition to missing data or parity blocks caused by drive or node failures. This job is started automatically after Smartfailing a drive or node. If a Smartfailed device was the reason the job started, the device is marked gone (completely removed from the cluster) at the end of the job.

Although a new node can be added to a cluster at any time, it’s best to avoid major group changes during a Smartfail operation. This helps avoid any unnecessary interruptions of a critical job engine data reprotection job. However, since a node is down, there is a window of risk while the cluster is rebuilding the data from that cluster. Under pressing circumstances the Smartfail operation can be paused, the node added, and then Smartfail resumed once the new node has happily joined the cluster.

Be aware that, if the node you are adding is the same node that was Smartfailed, the cluster maintains a record of that node and may prevent the re-introduction of that node until the Smartfail is complete. To mitigate risk, Dell Technical Support should definitely be involved to ensure data integrity.

The time for a Smartfail to complete is hard to predict with any accuracy, and is dependent on:

Attribute	Description
OneFS release	Determines OneFS job engine version and how efficiently it operates.
System hardware	Drive types, CPU, RAM, etc.
File system	Quantity and type of data (ie. small vs large files), protection, tunables, etc.
Cluster load	Processor and CPU utilization, capacity utilization, etc.

Typical Smartfail runtimes range from minutes for fairly empty, idle nodes with SSD and SAS drives to days for nodes with large SATA drives and a high capacity utilization. The FlexProtect job already runs at the highest job engine priority (value=1) and medium impact by default. As such, there isn’t much that can be done to speed up this job, beyond reducing cluster load.

Smartfail is also a valuable tool for proactive cluster node replacement, for example during a hardware refresh. Provided cluster quorum is not broken, a Smartfail can be initiated on multiple nodes concurrently – but never more than n/2 – 1 nodes (rounded up)!

If replacing an entire node-pool as part of a tech refresh, a SmartPools filepool policy can be crafted to migrate the data to another nodepool across the back-end network. When complete, the nodes can then be Smartfailed out, which should progress swiftly since they are now empty.

If multiple nodes are Smartfailed simultaneously, at the final stage of the process the node remove is serialized with around 60 seconds pause between each. The Smartfail job places the selected nodes in read-only mode while it copies the protection stripes to the cluster’s free space. Using SmartPools to evacuate data from a node or set of nodes in preparation to remove them is generally a good idea, and is usually a relatively fast process.

SmartPools’ Virtual Hot Spare (VHS) functionality helps ensure that node pools maintain enough free space to successfully re-protect data in the event of a Smartfail. Though configured globally, VHS actually operates at the node pool level so that nodes with different size drives reserve the appropriate VHS space. This helps ensures that, while data may move from one disk pool to another during repair, it remains on the same class of storage. VHS reservations are cluster wide and configurable as either a percentage of total storage (0-20%) or as a number of virtual drives (1-4), with the default being 10%.

Note, a Smartfail is not guaranteed to remove all data on a node. Any pool in a cluster that’s flagged with the ‘System’ flag can store /ifs/.ifsvar data. A filepool policy to move the regular data won’t address this data. Also, since SmartPools ‘spillover’ may have occurred at some point, there are no guarantees that an ‘empty’ node is completely devoid of data. For this reason, OneFS still has to search the tree for files that may have blocks residing on the node.

Nodes can be easily Smartfailed via the OneFS WebUI by navigating to Cluster Management > Hardware Configuration and selecting ‘Actions > More > Smartfail Node’ for the desired node(s):

Similarly, the following CLI commands initiate and halt the node Smartfail process respectively. Firstly, the ‘isi devices node smartfail’ command kicks off the Smartfail process on a node and removes it from the cluster.

# isi devices node smartfail -h

Syntax

# isi devices node smartfail

[--node-lnn <integer>]

[--force | -f]

[--verbose | -v]

If necessary, the ‘isi devices node stopfail’ command can be used to discontinue the Smartfail process on a node.

# isi devices node stopfail -h

Syntax

isi devices node stopfail

[--node-lnn <integer>]

[--force | -f]

[--verbose | -v]

Similarly, individual drives within a node can be Smartfailed with the ‘isi devices drive smartfail’ CLI command.

# isi devices drive smartfail { <bay> | --lnum <integer> | --sled <string> }

        [--node-lnn <integer>]

        [{--force | -f}]

        [{--verbose | -v}]

        [{--help | -h}]

When it comes to Smartfailing PowerScale chassis-based nodes, there are a couple of other things to be aware of regarding the mirrored journal:

When you smartfail a node in a node pair, you do not have to smartfail its partner node.
A node will still run indefinitely with its partner missing. However, this significantly increases the window of risk since there’s no journal mirror to rely on (in addition to lack of redundant power supply, etc).
If you do smartfail a single node in a pair, the journal is still protected by the vault and powerfail memory persistence.