OneFS Quota Domains

In the previous article, we looked at the use of protection domains in OneFS, focusing on SyncIQ replication, SmartLock immutable archiving, and Snapshots and SnapRevert.

Under the hood, SmartQuotas is also based on the concept of domains – the linchpins of quota accounting. Since OneFS is a single file system, it relies on accounting domains for defining the scope of a quota in place of the typical volume boundaries found in most storage systems. As such, a domain defines which files belong to a quota, accounts for each resource type in that set and defines the top-level directory configuration point.

For SmartQuotas, the three main resource types are:

Resource Type Description
Directory A specific directory and all its subdirectories
User A specific user
Group All members of a specific group

A domain defined as “name@folder” would be the set of files under “folder”, owned by “name”, which could be either a user or a group. The files accounted include all files reachable from the given path, without traversing any soft links. The owner “name” can be ALL, and “/ifs”, the OneFS root directory is also an effective ALL for “folder”.

With SmartQuotas it’s easy to create traditional domain types quickly by using “ALL”. The following are examples of domain types:

  • All files belonging to user Jane: user:Jane@/ifs
  • All files under /ifs/home, belonging to any user: ALL@/ifs/home.
  • All files under /ifs/home that belong to user Jane: user:Jane@/ifs/home

Domains cannot be created on anything but directories. More specifically, domains are associated with the actual directories themselves, not directory paths. For example, if the domain is ALL@/ifs/home/data, but /ifs/home/data gets renamed to /ifs/home/files, the domain stays with the directory.

Domains can also be nested and may overlap. For example, say a hard quota is set on /ifs/data/marketing for 5TB. 1TB soft quotas are then placed on individual users in the marketing department. This ensures that the marketing directory as a whole never exceeds 5TB, while limiting the users in the marketing department to 1TB each.

A default quota domain is one that does not account for any specific set of files but instead specifies a policy for new domains that match a specific trigger. In other words, default domains are configuration templates for actual domains. SmartQuotas use the identity notation ‘default-user’, ‘default-group’, and ‘default directory’ to describe domains with default policies. For example, the domain default-user@/ifs/home becomes specific-user@/ifs/home for each specific-user that is not otherwise defined. All enforcements on default-user are copied to specific-user when specific-user allocates within the domain and the new inherited domain quota is termed as a Linked Quota. There may be overlapping defaults (i.e. default-user@/ifs and default-user@/ifs/home may both be defined).

Default quota domains help drastically simplify quota management for large environments by providing a mechanism to define top level template configurations from which many actual quotas are cloned, or linked. When a default quota domain is configured on a directory, any subdirectories created directly underneath this will automatically inherit the quota limits specified in the parent domain. This streamlines the provisioning and management quotas for large enterprise environments. Furthermore, default directory quotas can co-exist with user and/or group quotas and legacy default quotas.

Default directory quotas have been available since OneFS 8.2, in addition to the default user and group quotas available in earlier releases. For example:

  • Create default-directory quota
# isi quota create --path=/ifs/parent-dir --type=default-directory --hard-threshold=10M
  • Modify Default directory quota
# isi quota modify --path=/ifs/parent-dir --type=default-directory --advisory-threshold=6M --soft-threshold=7M --soft-grace=1D
  • List default-directory quota
# isi quota list                 

  Type              AppliesTo  Path            Snap  Hard   Soft  Adv  Used

  --------------------------------------------------------------------------

  default-directory DEFAULT    /ifs/parent-dir No    10.00M -    6.00M 0.00

  --------------------------------------------------------------------------

  Total: 1
  • Delete Default directory quota
# isi quota delete --path=/ifs/parent-dir --type=default-directory

If the enforcements on a default domain change, SmartQuotas will automatically propagate the changes to the Linked Quota domains. If a default quota domain is deleted, SmartQuotas will delete all children marked as inherited. An administrator may also choose to delete the default without deleting the children, but this will break inheritance on all inherited children.

For example, the creation & deletion of sub-directory under default directory folder causes inherited directory quota creation and removal:

A domain may be in one of three accounting states, as follows:

Quota Accounting States Description
Ready A domain in the ready state is fully accounted. SmartQuotas displays “ready” domains in all interfaces and all enforcements apply to such domains.
Accounting A domain is placed in the Accounting state when it’s waiting on accounting updates.
Deleting After a request to delete a domain, SmartQuotas will place the domain in the deleting state until tear-down is complete. Domain removal may be a lengthy process.

SmartQuotas displays accounting domains in all interfaces including usage data but indicate they are in the process of being “Accounted”. SmartQuotas applies all enforcements to accounting domains, even when it might reject an allocation that would have proceeded if it had completed the QuotaScan.

Domains in the deleting state are hidden from all interfaces and the top-level directory of a domain may be deleted while the domain is still in the deleting state (assuming there are no domains in “Ready” or “Accounting” state defined on the directory). No enforcements are applied for domains in “Deleting” state.

A quota scan is performed when the domain is in an Accounting State. This can occur during quota creation to account the new domain if a quota has been set for the domain and quota deletion to un-account the domain. A QuotaScan is required when creating a quota on a non-empty directory. If quotas are created up-front on an empty directory, no QuotaScan is necessary.

In addition, a QuotaScan job may be started from the WebUI or command line interface using the “isi job” command. Any path specified on the command line is treated as the root of a tree that should be processed. This is provided primarily as a means to re-scan a directory or maintenance reasons.

There are main three processes or daemons associated with SmartQuotas:

  • isi_quota_notify_d
  • isi_quota_sweeper_d
  • isi_quota_report_d

The job of the notification daemon, isi_quota_notify_d, is to listen for ‘limit exceeded’ and ‘link denied’ events and generate notifications for each. It also responds to configuration change events and instructs the QDB to generate ‘expired’ and ‘violated’ over-threshold notifications.

A quota sweeper daemon, isi_quota_sweeper_d, is responsible for a number of quota housekeeping tasks such as propagating default changes, domain and notification rule garbage collection and kicking off QuotaScan jobs when necessary.

Finally, the reporting daemon, isi_quota_report_d, is responsible for generating quota reports. Since the QDB only produces real-time resource usage, reports are necessary for providing point-in-time vies of a quota domain’s usage. These historical reports are useful for trend analysis of quota resource usage.

OneFS 8.2 and subsequent releases use the rpc.quotad service to facilitate client-side quota reporting on UNIX and Linux clients via native ‘quota’ tools. The service which runs on tcp/udp port 762 is enabled by default, and control is under NFS global settings.

Additionally, in OneFS 8.2 and later, users can now see their available user capacity set by soft and/or hard user and group quotas rather than the entire cluster capacity or parent directory-quotas. This avoids the ‘illusion’ of seeing available space that may not be associated with their quotas.

OneFS Protection Domains

In OneFS, a domain defines a set of behaviors for a collection of files under a specified directory tree. More specifically, a protection domain is a marker which prevents a configured subset of files and directories from being deleted or modified.

If a directory has a protection domain applied to it, that domain will also affect all of the files and subdirectories under that top-level directory. As we’ll see, in some instances, OneFS creates protection domains automatically, but they can also be configured manually.

With the recent introduction of domain-based snapshots, OneFS now supports four types of protection domain:

  • SnapRevert domains
  • SmartLock domains
  • SyncIQ domains
  • Snapshot domains

The process of restoring a snapshot in full to its top level directory can easily be accomplished by the SnapRevert job. This enables cluster administrators to quickly revert to a previous, known-good recovery point – for example in the event of a virus or malware outbreak, The SnapRevert job can be run from the job engine WebUI or CLI, and simply requires adding the desired snapshot ID.

SnapRevert domains are assigned to directories that are contained in snapshots to prevent files and directories from being modified while a snapshot is being reverted. OneFS does not automatically create SnapRevert domains. The SnapRevert domain is described as a ‘restricted writer’ domain, in OneFS jargon. Essentially, this is a piece of extra filesystem metadata and associated locking that prevents a domain’s files being written to while restoring a last known good snapshot.

Because the SnapRevert domain is essentially just a metadata attribute, or marker, placed onto a file or directory, a preferred practice is to create the domain before there is data. This avoids having to wait for DomainMark or DomainTag (the aptly named Job Engine jobs that mark a domain’s files) to walk the entire tree, setting that attribute on every file and directory within it.

There are two main components to SnapRevert:

  • The file system domain that the objects are put into.
  • The job that reverts everything back to what’s in a snapshot.

The SnapRevert job itself actually uses a local SyncIQ policy to copy data out of the snapshot, discarding any changes to the original directory. When the SnapRevert job completes, the original data is left in the directory tree. In other words, after the job completes, the file system (HEAD) is exactly as it was at the point in time that the snapshot was taken. The LINs for the files/directories don’t change, because what’s there is not a copy.

The SnapRevert job can either be scheduled or manually run from the OneFS WebUI by navigating to Cluster Management > Job Operations > Job Types > SnapRevert and clicking the ‘Start Job’ button.

A snapshot can’t be reverted until a SnapRevert domain has been created on its top level directory. If necessary, SnapRevert domains can also be nested. For example, domains could be successfully created on both /ifs/snap1 and /ifs/snap1/snap2. Also. A SnapRevert domain can easily be deleted if you no longer need to restore snapshots of that directory.

It’s worth noting that CloudPools also supports SnapRevert for SmartLink (stub) files. For example, if CloudPools archived “/ifs/cold_data”, the files in this directory would be replaced with stubs and the data moved off to the cloud provider of choice. If you then created a domain for the directory and ran the SnapRevert job, the original files would be restored to the directory, and CloudPools would remove any cloud data that was created as part of the original archive process.

SmartLock domains are assigned to WORM (write once, read many) immutable archive directories to prevent committed files from being modified or deleted. OneFS automatically sets up a SmartLock domain when a SmartLock directory is created. Note that a SmartLock domain cannot be manually deleted. However, if you remove a SmartLock directory, OneFS automatically deletes the associated SmartLock domain.

Once a file is SmartLocked (WORM committed) it cannot ever be modified or moved. It cannot be deleted until its ‘committed until’ or ‘expiry’ date has passed.  Even when the expiry date has passed (ie. the file is in an ‘expired’ state) it cannot be modified or moved.  All you can do with an expired file is either delete it or extend its ‘committed until’ date into the future.

SyncIQ domains can be assigned to both the source and target directories of replication policies. OneFS automatically creates a SyncIQ domain for the target directory of a replication policy the first time that the policy is run. OneFS also automatically creates a SyncIQ domain for the source directory of a replication policy during the failback process.

A SyncIQ domain can be manually created for a source directory before initiating the failback process, by configuring the policy for accelerated failback. However, a SyncIQ domain that marks the target directory of a replication policy cannot be deleted.

SnapshotIQ also uses a domain-based model for governance of scheduled snapshots in OneFS 8.2 and later releases. By utilizing the OneFS IFS domains infrastructure, recurring snapshot efficiency and performance is increased by limiting the scope of governance to a smaller, well defined domain boundary.

IFS Domains provide a Mark Job that proactively marks all the files in the domain. Creating a new snapshot on a fully marked domain will not cause further “painting” operations, thereby avoiding a significant portion of the resource overhead caused by taking a new snapshot.

Once a domain has been fully marked, subsequent snapshot creation operations will not cause any further painting. The new snapshot ID is simply added to the domain data section, so the creation of a new snapshot will not trigger a system-wide painting event anymore. Domains are re-used whenever possible.

Creating two domains of the same type on the same directory will cause the second domain to become an alias of the first domain. Aliases don’t require marking since they share the already existing marks. This benefits both snapshots and snapshot schedules taken on the same directory. For all these reasons, the number of I/O and locking operations needed to resolve snapshot governance is greatly reduced. Because the SnapIDs are stored in a single location (as opposed to being stored on individual inodes), this greatly simplifies Snapshot ID garbage collection whenever a Snapshot is deleted. By leveraging IFS Domains, creating a new snapshot on a domain that is fully marked will not cause further “painting” operations, so a significant portion of the performance impact caused by taking a new snapshot is avoided.

The illustration above shows an example of domain-based snapshots. In this case, a snapshot was taken on the ‘projects’ directory, and the on the directory named ‘video’. File v1.mp4 is tagged with the domain IDs, making it more efficient to determine snapshot governance.

A snapshot of file v1.mp4 creates a snap_ID in the domain’s SBT (system b-tree) providing a single place to store snapshot metadata. In previous OneFS versions, snapIDs were stored in the inode, which resulted in duplication of the snap_IDs and metadata usage.

Note that only snapshots taken after upgrade to OneFS 8.2 will use IFS domains backing. Any snapshots created prior to upgrade will not be converted and will remain in their original form.

Additionally, the new domain-based snapshot functionality in OneFS 8.2 brings other benefits including:

  • Improved management of SnapIDs
  • Reduced number of operations needed to resolve snapshot governance.
  • More efficient use of metadata
  • The automatic exclusion of the cluster’s /ifs/.ifsvar subtree from all root (/ifs) snapshots – although this behavior is configurable.
  • The write cache, or coalescer, is enhanced to better support parallel snapshot creates.
  • The snapshot create path is improved to reduce contention on the STF during copy-on-write.

Sync and snap domains can be easily created to enable snapshot revert and replication failover operations. However, SmartLock domains cannot be manually created, however, since OneFS automatically creates a domain upon creation of a SmartLock directory.

For example, the following CLI syntax will create a SnapRevert domain for /ifs/snap1:

# isi job jobs start domainmark --root /ifs/snap1 --dm-type SnapRevert

And from the WebUI:

You can delete a replication or snapshot revert domain if you want to move directories out of the domain. However, SmartLock domains cannot be manually removed, but will be automatically removed upon deletion of a SmartLock directory.

The following CLI command will delete a SnapRevert domain on /ifs/snap1:

# isi job jobs start domainmark --root /ifs/snap1 --dm-type SnapRevert –delete

Similarly, via the WebUI:

Protection domains can (and usually should) be manually created before they are required by OneFS to perform certain actions. However, manually creating protection domains can limit the ability to interact with the data marked by the domain.

OneFS 8.2 and later releases provide an ‘isi_pdm’ CLI utility for managing protection domains, with the following syntax:

#isi_pdm -h

usage: isi_pdm [-h] [-v]

               {base,domains,exclusions,operations,ifsvar-sysdom} ...




positional arguments:

  {base,domains,exclusions,operations,ifsvar-sysdom}

    base                Read base domains.

    domains             Read or manipulate domain instances.

    exclusions          Add or list domain exclusions.

    operations          Read pending pdm operations.

    ifsvar-sysdom       Manage .ifsvar system domain.




optional arguments:

  -h, --help            show this help message and exit

  -v, --verbose

For example:

# isi_pdm domains list /ifs/data All

[ 2.0100, 315.0100 ]

# isi_pdm exclusions list 2.0100

{

    DomID = 16.8100

    Owner LIN = 1:0000:0001

}

Domain membership can also be viewed via the ‘isi get’ command.

Here are some OneFS domain recommendations, constraints, and considerations:

  • Copying a large number of files into a protection domain can be a lengthy process, since each file must be marked individually as belonging to the protection domain.
  • The best practice is to create protection domains for directories while the directories are empty, and then add files to the directory.
  • Theisi sync policies create command contains an ‘—accelerated-failback true’ option, which automatically marks the domain. This can save considerable time during failback.
  • If you use SyncIQ to create a replication policy for a SmartLock compliance directory, the SyncIQ and SmartLock compliance domains must be configured at the same root directory level. A SmartLock compliance domain cannot be nested inside a SyncIQ domain.
  • If a domain is currently preventing the modification or deletion of a file, you cannot create a protection domain for a directory that contains that file. For example, if /ifs/data/smartlock/file.txt is set to a WORM state by a SmartLock domain, you cannot create a SnapRevert domain for /ifs/data/.
  • Directories cannot be moved in or out of protection domains. However, you can move a directory to another location within the same protection domain.

OneFS MultiScan, AutoBalance, & Collect

As we’ve seen throughout the recent file system maintenance job articles, OneFS utilizes file system scans to perform such tasks as detecting and repairing drive errors, reclaiming freed blocks, etc. Since these scans typically involve complex sequences of operations, they are implemented via syscalls and coordinated by the Job Engine. These jobs are generally intended to run as minimally disruptive background tasks in the cluster, using spare or reserved capacity.

FS Maintenance Job Description
AutoBalance Restores node and drive free space balance
Collect Reclaims leaked blocks
FlexProtect Replaces the traditional RAID rebuild process
MediaScan Scrub disks for media-level errors
MultiScan Run AutoBalance and Collect jobs concurrently

In this final article of the series, we’ll turn our attention to MultiScan. This job is a combination of both the of the AutoBalance job, which rebalances data across drives, and the Collect job, which recovers leaked blocks from the filesystem. In addition to reclaiming unused capacity as a result of drive replacements, snapshot and data deletes, etc, MultiScan also helps expose and remediate any filesystem inconsistencies.

The OneFS job engine defines two exclusion sets that govern which jobs can execute concurrently on a cluster. MultiScan straddles both of the job engine’s exclusion sets, with AutoBalance (and AutoBalanceLin) in the restripe set, and Collect in the mark set.

The restriping exclusion set is per-phase instead of per job, which helps to more efficiently parallelize restripe jobs when they don’t need to lock down resources. However, with the marking exclusion set, OneFS can only accommodate a single marking job at any point in time.

MultiScan is an unscheduled job that runs by default at ‘LOW’ impact and executes AutoBalance and Collect simultaneously. It is triggered by cluster group change events, which include node boot, shutdown, reboot, drive replacement, etc. While AutoBalance will execute each time the MultiScan job is triggered, Collect typically won’t be run more often that once every 2 weeks. AutoBalance and/or Collect are typically only run manually if MultiScan has been disabled.

When a new node or drive is added to the cluster, its blocks are almost entirely free, whereas the rest of the cluster is usually considerably more full, capacity-wise. AutoBalance restores the balance of free blocks in the cluster. As such, AutoBalance runs if a cluster’s nodes have a greater than 5% imbalance in capacity utilization. In addition, AutoBalance also fixes recovered writes that occurred due to transient unavailability and also addresses fragmentation.

If the cluster’s nodes contain SSDs, AutoBalanceLin (as opposed to the regular AutoBalance job) runs most efficiently by performing a LIN scan using a flash-backed metadata mirror. When a cluster is unbalanced, there is not an obvious subset of files to filter, since the files to be restriped are the ones which are not using the node or drive with less free space. In the case of an added node or drive, no files will be using it. As a result, almost any file scanned is enumerated for restripe.

As mentioned, the Collect job reclaims leaked blocks using a mark and sweep process. In traditional UNIX systems this function is typically performed by the ‘fsck’ utility. With OneFS, however, the other traditional functions of fsck are not required, since the transaction system keeps the file system consistent. Leaks only affect free space.

Collect’s ‘mark and sweep’ gets its name from the in-memory garbage collection algorithm. First, the in-use blocks and any new allocations are marked with the current generation in the Mark phase. When this is complete, the drives are swept of any blocks which don’t have the current generation in the Sweep phase.

In addition to automatic job execution following a group change event, Multiscan can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start multiscan

Started job [209]

# isi job list

ID   Type      State   Impact  Pri  Phase  Running Time

--------------------------------------------------------

209  MultiScan Running Low     4    1/4    1s

--------------------------------------------------------

Total: 1

The Multiscan job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 209

               ID: 209

             Type: MultiScan

            State: Running

           Impact: Low

           Policy: LOW

              Pri: 4

            Phase: 1/4

       Start Time: 2021-01-03T20:15:16

     Running Time: 34s

     Participants: 1, 2, 3

         Progress: Collect: 225 LINs, 0 errors

                   AutoBalance: 225 LINs, 0 errors

                   LIN Estimate based on LIN count of 2793 done on Jan 04 20:02:57 2021

                   LIN Based Estimate:  3m 2s Remaining (8% Complete)

                   Block Based Estimate: 5m 48s Remaining (4% Complete)

                   0 errors total

Waiting on job ID: -

      Description: Collect, AutoBalance

The LIN (logical inode) statistics above include both files and directories.

Be aware that the estimated LIN percentage can occasionally be misleading/anomalous. If concerned, verify that the stated total LIN count is roughly in line with the file count for the cluster’s dataset. Even if the LIN count is in doubt, the estimated block progress metric should always be accurate and meaningful.

If the job is in its early stages and no estimation can be given (yet), isi job will instead report its progress as “Started”. Note that all progress is reported per phase, with MultiScan phase 1 being the one where the lion’s share of the work is done. By comparison, phases 2-4 of the job are comparatively short.

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

     Job ID: 209

      Phase: 1

   CPU Avg.: 11.46%

Memory Avg.

        Virtual: 301.06M

       Physical: 28.71M

        I/O

            Ops: 3513425

          Bytes: 26.760G

Finally, upon completion, the Multiscan job report, detailing all four stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 209

MultiScan[209] phase 1 (2021-01-03T20:02:57)

--------------------------------------------

Elapsed time          307 seconds (5m7s)

Working time          307 seconds (5m7s)

Errors                0

Rebalance/LINs        2793

Rebalance/Files       2416

Rebalance/Directories 377

Rebalance/Errors      0

Rebalance/Bytes       372607773184 bytes (347.018G)

Collect/LINs          2788

Collect/Files         2411

Collect/Directories   377

Collect/Errors        0

Collect/Bytes         130187742208 bytes (121.247G)




MultiScan[209] phase 2 (2021-01-03T20:02:57)

--------------------------------------------

Elapsed time     0 seconds

Working time     0 seconds

Errors           0

LINs traversed   0

LINs processed   0

SINs traversed   0

SINs processed   0

Files seen       0

Directories seen 0

Total bytes      0 bytes




MultiScan[209] phase 3 (2021-01-03T20:02:58)

--------------------------------------------

Elapsed time          1 seconds

Working time          1 seconds

Errors                0

Rebalance/SINs        0

Rebalance/Files       0

Rebalance/Directories 0

Rebalance/Errors      0

Rebalance/Bytes       0 bytes

Collect/SINs          0

Collect/Files         0

Collect/Directories   0

Collect/Errors        0

Collect/Bytes         0 bytes

Unbalanced diskpools  Pool_Name = h600_18tb_3.2tb-ssd_256gb:2, free_blocks = 8693136159, total_blocks = 8715355092

Pool_Name = h600_18tb_3.2tb-ssd_256gb:3, free_blocks = 7259260440, total_blocks = 7262795910







MultiScan[209] phase 4 (2021-01-03T20:03:17)

--------------------------------------------

Elapsed time 19 seconds

Working time 19 seconds

Errors       0

Drives swept 33

LINs freed   0

Inodes freed 128359

Bytes freed  80022503424 bytes (74.527G)

Keys freed   0

Inodes lost  0

OneFS FlexProtect

The FlexProtect job is responsible for maintaining the appropriate protection level of data across the cluster.  For example, it ensures that a file which is configured to be protected at +2n, is actually protected at that level. Given this, FlexProtect is arguably the most critical of the OneFS maintenance jobs because it represents the Mean-Time-To-Repair (MTTR) of the cluster, which has an exponential impact on MTTDL. Any failures or delay has a direct impact on the reliability of the OneFS file system.

In addition to FlexProtect, there is also a FlexProtectLin job. FlexProtectLin is run by default when there is a copy of file system metadata available on solid state drive (SSD) storage. FlexProtectLin typically offers significant runtime improvements over its conventional disk based counterpart.

As such, the primary purpose of FlexProtect is to repair nodes and drives which need to be removed from the cluster. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. Any drives and/or nodes to be removed are marked with OneFS’ ‘restripe_from’ capability. The job engine coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

FlexProtect falls within the job engine’s restriping exclusion set and, similar to AutoBalance, comes in two flavors: FlexProtect and FlexProtectLin.

Run automatically after a drive or node removal or failure, FlexProtect locates any unprotected files on the cluster, and repairs them as rapidly as possible.  The FlexProtect job runs by default with an impact level of ‘medium’ and a priority level of ‘1’, and includes six distinct job phases:

The regular version of FlexProtect has the following phases:

Job Phase Description
Drive Scan Job engine scans the disks for inodes needing repair. If an inode needs repair, the job engine sets the LIN’s ‘needs repair’ flag for use in the next phase.

 

LIN Verify This phase scans the OneFS LIN tree to addresses the drive scan limitations.
LIN Re-verify The prior repair phases can miss protection group and metatree transfers. FlexProtect may have already repaired the destination of a transfer, but not the source. If a LIN is being restriped when a metatree transfer, it is added to a persistent queue, and this phase processes that queue.

 

Repair LINs with the ‘needs repair’ flag set are passed to the restriper for repair. This phase needs to progress quickly and the job engine workers perform parallel execution across the cluster.
Check This phase ensures that all LINs were repaired by the previous phases as expected.
Device Removal The successfully repaired nodes and drives that were marked ‘restripe from’ at the beginning of phase 1 are removed from the cluster in this phase. Any additional nodes and drives which were subsequently failed remain in the cluster, with the expectation that a new FlexProtect job will handle them shortly.

Be aware that prior to OneFS 8.2, FlexProtect is the only job allowed to run if a cluster is in degraded mode, such as when a drive has failed, for example. Other jobs will automatically be paused and will not resume until FlexProtect has completed and the cluster is healthy again. In OneFS 8.2 and later, FlexProtect does not pause when there is only one temporarily unavailable device in a disk pool, when a device is smartfailed, or for dead devices.

The FlexProtect job executes in userspace and generally repairs any components marked with the ‘restripe from’ bit as rapidly as possible. Within OneFS, a LIN Tree reference is placed inside the inode, a logical block. A B-Tree describes the mapping between a logical offset and the physical data blocks:

In order for FlexProtect to avoid the overhead of having to traverse the whole way from the LIN Tree reference -> LIN Tree -> B-Tree -> Logical Offset -> Data block, it leverages the OneFS construct known as the ‘Width Device List’ (WDL). The WDL enables FlexProtect to perform fast drive scanning of inodes because the inode contents are sufficient to determine need for restripe. The WDL keeps a list of the drives in use by a particular file, and are stored as an attribute within an inode and are thus protected by mirroring. There are two WDL attributes in OneFS, one for data and one for metadata. The WDL is primarily used by FlexProtect to determine whether an inode references a degraded node or drive. It New or replaced drives are automatically added to the WDL as part of new allocations.

As mentioned previously, the FlexProtect job has two distinct variants. In the FlexProtectLin version of the job the Disk Scan and LIN Verify phases are redundant and therefore removed, while keeping the other phases identical. FlexProtectLin is preferred when at least one metadata mirror is stored on SSD, providing substantial job performance benefits.

In addition to automatic job execution after a drive or node removal or failure, FlexProtect can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start flexprotect

Started job [274]
# isi job list

ID   Type        State   Impact  Pri  Phase  Running Time

----------------------------------------------------------

274  FlexProtect Running Medium  1    1/6    4s

----------------------------------------------------------

Total: 1


The FlexProtect job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 274

               ID: 274

             Type: FlexProtect

            State: Succeeded

           Impact: Medium

           Policy: MEDIUM

              Pri: 1

            Phase: 6/6

       Start Time: 2020-12-04T17:13:38

     Running Time: 17s

     Participants: 1, 2, 3

         Progress: No work needed

Waiting on job ID: -

      Description: {"nodes": "{}", "drives": "{}"}

Upon completion, the FlexProtect job report, detailing all six stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 274

FlexProtect[274] phase 1 (2020-12-04T17:13:44)

----------------------------------------------

Elapsed time 6 seconds

Working time 6 seconds

Errors       0

Drives       33

LINs         250

Size         363108486755 bytes (338.171G)

ECCs         0




FlexProtect[274] phase 2 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time 11 seconds

Working time 11 seconds

Errors       0

LINs         33

Zombies      0




FlexProtect[274] phase 3 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time 0 seconds

Working time 0 seconds

Errors       0

LINs         0

Zombies      0




FlexProtect[274] phase 4 (2020-12T17:13:55)

----------------------------------------------

Elapsed time 0 seconds

Working time 0 seconds

Errors       0

LINs         0

Zombies      0




FlexProtect[274] phase 5 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time 0 seconds

Working time 0 seconds

Errors       0

Drives       0

LINs         0

Size         0 bytes

ECCs         0




FlexProtect[274] phase 6 (2020-12-04T17:13:55)

----------------------------------------------

Elapsed time       0 seconds

Working time       0 seconds

Errors             0

Nodes marked gone  {}

Drives marked gone {}

While a FlexProtect job is running, the following command will detail which LINs the job engine workers are currently accessing:

# sysctl efs.bam.busy_vnodes | grep isi_job_d

vnode 0xfffff802938d18c0 (lin 0) is fd 11 of pid 2850: isi_job_d

vnode 0xfffff80294817460 (lin 1:0002:0008) is fd 12 of pid 2850: isi_job_d

vnode 0xfffff80294af3000 (lin 1:0002:001a) is fd 20 of pid 2850: isi_job_d

vnode 0xfffff8029c7c7af0 (lin 1:0002:001b) is fd 17 of pid 2850: isi_job_d

vnode 0xfffff802b280dd20 (lin 1:0002:000a) is fd 14 of pid 2850: isi_job_d

Using the ‘isi get -L’ command, a LIN address can be translated to show the actual file name and its path. For example:

# isi get -L 1:0002:0008

A valid path for LIN 0x100020008 is /ifs/.ifsvar/run/isi_job_d.lock

 

OneFS IntegrityScan

Under normal conditions, OneFS typically relies on checksums, identity fields, and magic numbers to verify file system health and correctness. Within OneFS, system and data integrity, can be subdivided into four distinct phases:

Here’s what each of these phases entails:

Phase Description
Detection The act of scanning the file system and detecting data block instances that are not what OneFS expects to see at that logical point. Internally, OneFS stores a checksum or IDI (Isi data integrity) for every allocated block under /ifs.
Enumeration Enumeration involves notifying the cluster administrator of any file system damage uncovered in the detection phase. For example, logging to the /var/log/idi.log file. System panics may also be noticed if the damage identified is not one that OneFS can reasonably recover from.
Isolation Isolation is the act of cauterizing the file system, ensuring that any damage identified during the detection phase does not spread beyond the file(s) that are already affected. This usually involves removing all references to the file(s) from the file system.
Repair Removing and repairing any damage discovered and removing the “corruption” from OneFS. Typically a DSR (Dynamic Sector recovery) is all that is required to rebuild a block that fails IDI.

 

Focused on the detection phase, the primary OneFS tool for uncovering system integrity issues is IntegrityScan. This job is run on across the cluster to discover instances of damaged files, and provide an estimate of the spread of the damage.

Unlike traditional ‘fsck’ style file system integrity checking tools (including OneFS’ isi_cpr utility), IntegrityScan is explicitly designed to run while the cluster is fully operational – thereby removing the need for any downtime. It does this by systematically reading every block and verifying its associated checksum. In the event that IntegrityScan detects a checksum mismatch, it generates an alert, logs the error to the IDI logs (/var/log/idi.log), and provides a full report upon job completion.

IntegrityScan is typically run manually if the integrity of the file system is ever in doubt. By default, the job runs at am impact level of ‘Medium’ and a priority of ‘1’ and accesses the file system via a LIN scan. Although IntegrityScan itself may take several hours or days to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be re-prioritized, paused or stopped, depending on its impact to cluster operations. Along with Collect and MultiScan, IntegrityScan is part of the job engine’s Marking exclusion set.

OneFS can only accommodate a single marking job at any point in time. However, since the file system is fully journalled, IntegrityScan is only needed in exceptional situations. There are two principle use cases for IntegrityScan:

  • Identifying and repairing corruption on a production cluster. Certain forms of corruption may be suggestive of a bug, in which case IntegrityScan can be used to determine the scope of the corruption and the likelihood of spreading. It can also fix some forms of corruption.
  • Repairing a file system after a lost journal. This use case is much like traditional fsck. This scenario should be treated with care as it is not guaranteed that IntegrityScan fixes everything. This is a use case that will require additional product changes to make feasible.

IntegrityScan can be initiated manually, on demand. The following CLI syntax will kick off a manual job run:

# isi job start integrityscan

Started job [283]
# isi job list

ID   Type          State   Impact  Pri  Phase  Running Time

------------------------------------------------------------

283  IntegrityScan Running Medium  1    1/2    1s

------------------------------------------------------------

Total: 1

With LIN scan jobs, even though the metadata is of variable size, the job engine can fairly accurately predict how much effort will be required to scan all LINs. The IntegrityScan job’s progress can be tracked via a CLI command, as follows:

# isi job jobs view 283

               ID: 283

             Type: IntegrityScan

            State: Running

           Impact: Medium

           Policy: MEDIUM

              Pri: 1

            Phase: 1/2

       Start Time: 2020-12-05T22:20:58

     Running Time: 31s

     Participants: 1, 2, 3

         Progress: Processed 947 LINs and approx. 7464 MB: 867 files, 80 directories; 0 errors

                   LIN & SIN Estimate based on LIN & SIN count of 3410 done on Dec 5 22:00:10 2020 (LIN) and Dec 5 22:00:10 2020 (SIN)

                   LIN & SIN Based Estimate:  1m 12s Remaining (27% Complete)

                   Block Based Estimate: 10m 47s Remaining (4% Complete)




Waiting on job ID: -

      Description:

The LIN (logical inode) statistics above include both files and directories.

Be aware that the estimated LIN percentage can occasionally be misleading/anomalous. If concerned, verify that the stated total LIN count is roughly in line with the file count for the cluster’s dataset. Even if the LIN count is in doubt. The estimated block progress metric should always be accurate and meaningful.

If the job is in its early stages and no estimation can be given (yet), isi job will instead report its progress as ‘Started’. Note that all progress is reported per phase.

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

     Job ID: 283

      Phase: 1

   CPU Avg.: 30.27%

Memory Avg.

        Virtual: 302.27M

       Physical: 24.04M

        I/O

            Ops: 2223069

          Bytes: 16.959G

Finally, upon completion, the IntegrityScan job report, detailing both job stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 283

IntegrityScan[283] phase 1 (2020-12-05T22:34:56)

------------------------------------------------

Elapsed time     838 seconds (13m58s)

Working time     838 seconds (13m58s)

Errors           0

LINs traversed   3417

LINs processed   3417

SINs traversed   0

SINs processed   0

Files seen       3000

Directories seen 415

Total bytes      178641757184 bytes (166.373G)




IntegrityScan[283] phase 2 (2020-12-05T22:34:56)

------------------------------------------------

Elapsed time     0 seconds

Working time     0 seconds

Errors           0

LINs traversed   0

LINs processed   0

SINs traversed   0

SINs processed   0

Files seen       0

Directories seen 0

Total bytes      0 bytes

In addition to the IntegrityScan job, OneFS also contains an ‘isi_iscan_report’ utility. This is a tool to collate the errors from IDI log files (/var/log/idi.log) generated on different nodes. It generates a report file which can be used as an input to ‘isi_iscan_query’ tool. Additionally, it reports the number of errors seen for each file containing IDI errors. At the end of the run, a report file can be found at /ifs/.ifsvar/idi/tjob.<pid>/log.repo.

The associated ‘isi_iscan_query’ utility can then be used to parse the log.repo report file and filter by node, time range, or block address (baddr). The syntax for the isi_iscan_query tool is:

/usr/sbin/isi_iscan_query filename [FILTER FIELD] [VALUE]

FILTER FIELD:

                node <logical node number> e.g. 1, 2, 3, ...

                timerange <start time> <end time> e.g. 2020-12-05T17:38:02Z 2020-12-06T17:38:56Z

                baddr <block address> e.g. 2,1,185114624:8192

OneFS MediaScan

As we’ve seen previously, OneFS utilizes file system scans to perform such tasks as detecting and repairing drive errors, reclaiming freed blocks, etc. These scans are typically complex sequences of operations, so they are implemented via syscalls and coordinated by the Job Engine. These jobs are generally intended to run as minimally disruptive background tasks in the cluster, using spare or reserved capacity.

The file system maintenance jobs which are critical to the function of OneFS are:

FS Maintenance Job Description
AutoBalance Restores node and drive free space balance
Collect Reclaims leaked blocks
FlexProtect Replaces the traditional RAID rebuild process
MediaScan Scrub disks for media-level errors
MultiScan Run AutoBalance and Collect jobs concurrently

MediaScan’s role within the file system protection framework is to periodically check for and resolve drive bit errors across the cluster. This proactive data integrity approach helps guard against a phenomenon known as ‘bit rot’, and the resulting specter of hardware induced silent data corruption.

The MediaScan job reads all of OneFS’ allocated blocks in order to trigger any latent drive sector errors in a process known as ‘disk scrubbing’. Drive sectors errors may occur due physical effects which, over time, could negatively affect the protection of the file system. Periodic disk scrubbing helps ensure that sector errors do not accumulate and lead to data integrity issues.

Sector errors are a relatively common drive fault. They are sometimes referred to as ‘ECCs’ since drives have internal error correcting codes associated with sectors. A failure of these codes to correct the contents of the sector generates an error on a read of the sector.

ECCs have a wide variety of causes. There may be a permanent problem such as physical damage to platter, or a more transient problem such as the head not being located properly when the sector was read. For transient problems, the drive has the ability to retry automatically. However, such retries can be time consuming and prevent further processing.

OneFS typically has the redundancy available to overwrite the bad sector with the proper contents. This is called Dynamic Sector Repair (DSR). It is preferable for the file system to perform DSR than to wait for the drive to retry and possibly disrupt other operations. When supported by the particular drive model, a retry time threshold is also set so that disruption is minimized and the file system can attempt to use its redundancy.

In addition, MediaScan maintains a list of sectors to avoid after an error has been detected. Sectors are added to the list upon the first error. Subsequent I/Os consult this list and, if a match is found, immediately return an error without actually sending the request to the drive, minimizing further issues.

If the file system can successfully write over a sector, it is removed from the list. The assumption is that the drive will reallocate the sector on write. If the file system can’t reconstruct the block, it may be necessary to retry the I/O since there is no other way to access the data. The kernel’s ECC list must be cleared. This is done at the end of the MediaScan job run, but occasionally must also be done manually to access a particular block.

The drive’s own error-correction mechanism can handle some bit rot. When it fails, the error is reported to the MediaScan job. In order for the file system to repair the sector, the owner must be located. The owning structure in the file system has the redundancy that can be used to write over the bad sector, for example an alternate mirror of a block.

Most of the logic in MediaScan handles searching for the owner of the bad sector; the process can be very different depending on the type of structure, but is usually quite expensive. As such, it is often referred to as the ‘haystack’ search, since nearly every inode may be inspected to find the owner. MediaScan works by directly accessing the underlying cylinder groups and disk blocks via a linear drive scan and has more job phases than most job engine jobs for two main reasons:

  • First, significant effort is made to avoid the expense of the haystack search.
  • Second, every effort is made to try all means possible before alerting the administrator.

Here are the eight phases of MediaScan:

Phase # Phase Name Description
1 Drive Scan Scans each drive using the ifs_find_ecc() system call, which issues I/O for all allocated blocks and inodes.
2 Random Drive Scan Find additional “marginal” ECCs that would not have been detected by the previous phase.
3 Inode Scan Inode ECCs can be located more quickly from the LIN tree, so this phase scans the LIN tree to determine the (LIN, snapshot ID) referencing any inode ECCs.
4 Inode Repair Repairs inode ECCs with known (LIN, snapshot ID) owners, plus any LIN tree block ECCs where the owner is the LIN tree itself.
5 Inode Verify Verifies that any ECCs not fixed in the previous phase still exist. First, it checks whether the block has been freed. Then it clears the ECC list and retries the I/O to verify that the sector is still failing.
6 Block Repair Drives are scanned and compared against the list of ECCs. When ECCs are found, the (LIN, snapshot ID) is returned and the restripe repairs ECCs in those files. This phase is often referred to as the “haystack search”.
7 Block Verify Once all file system repair attempts have completed, ECCs are again verified by clearing the ECC list and reissuing I/O.
8 Alert Any remaining ECCs after repair and verify represent a danger of data loss. This phase logs the errors at the syslog ERR level.

MediaScan falls within the job engine’s restriping exclusion set, and is run as a low-impact, low-priority background process. It is executed automatically by default at 12am on the first Saturday of each month, although this can be reconfigured if desired.

In addition to scheduled job execution, MediaScan can also be initiated on demand. The following CLI syntax will kick off a manual job run:

# isi job jobs start mediascan

Started job [251]

# isi job jobs list

ID   Type      State   Impact  Pri  Phase  Running Time

--------------------------------------------------------

251  MediaScan Running Low     8    1/8    1s

--------------------------------------------------------

Total: 1

The MediaScan job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 251

               ID: 251

             Type: MediaScan

            State: Running

           Impact: Low

           Policy: LOW

              Pri: 8

            Phase: 1/8

       Start Time: 2020-11-23T22:16:23

     Running Time: 1m 30s

     Participants: 1, 2, 3

         Progress: Found 0 ECCs on 2 drives; last completed: 2:0; 0 errors

Waiting on job ID: -

      Description:

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

     Job ID: 251

      Phase: 1

   CPU Avg.: 0.21%

Memory Avg.

        Virtual: 318.41M

       Physical: 28.92M

        I/O

            Ops: 391

          Bytes: 3.05M

Finally, upon completion, the MediaScan job report, detailing all eight stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 251

OneFS Job Engine – File System Protection & Management

In addition to the per-job impact controls described in the previous blog article, additional impact management is also provided by the notion of job exclusion sets. For multiple concurrent job execution, exclusion sets, or classes of similar jobs, determine which jobs can run simultaneously. A job is not required to be part of any exclusion set, and jobs may also belong to multiple exclusion sets. Currently, there are two exclusion sets that jobs can be part of:

  • Restriping
  • Marking.

The fundamental responsibility of the jobs within the Job Engine Restripe exclusion set is to ensure that the data on /ifs is:

  • Protected at the desired level.
  • Balanced across nodes.
  • Properly accounted for.

OneFS does this by running various file system maintenance jobs either manually, via a predefined schedule, or based on a cluster event, like a group change. These jobs include:

The FlexProtect job is responsible for maintaining the appropriate protection level of data across the cluster.  For example, it ensures that a file which is supposed to be protected at +2, is actually protected at that level.

Run automatically after a drive or node removal or failure, FlexProtect locates any unprotected files on the cluster, and repairs them as quickly as possible.  The FlexProtect job includes the following distinct phases:

  • Drive Scan. FlexProtect scans the cluster’s drives, looking for files and inodes in need of repair. When one is found, the job opens the LIN and repairs it and the corresponding data blocks using the restripe process.
  • LIN Verification. Once the drive scan is complete, the LIN verification phase scans the inode (LIN) tree and verifies, re-verifies and resolves any outstanding re-protection tasks.
  • Device Removal. In this final phase, FlexProtect removes the successfully repaired drives(s) or node(s) from the cluster.

In addition to FlexProtect, there is also a FlexProtectLin job. FlexProtectLin is run by default when there is a copy of file system metadata available on solid state drive (SSD) storage. FlexProtectLin typically offers significant runtime improvements over its conventional disk-based counterpart.

Unlike previous releases, in OneFS 8.2 and later FlexProtect does not pause when there is only one temporarily unavailable device in a disk pool, when a device is smartfailed, or for dead devices.

The MultiScan job, which combines the functionality of AutoBalance and Collect, is automatically run after a group change which adds a device to the cluster. AutoBalance(Lin) and/or Collect are only run manually if MultiScan has been disabled.

Scalability enhancements in OneFS 8.2 and later mean that fewer group change notifications are received. This results in MultiScan being triggered less frequently. To compensate for this, MultiScan is now started when:

  • Data is unbalanced within one or more disk pools, which triggers MultiScan to start the AutoBalance phase only.
  • When drives have been unavailable for long enough to warrant a Collect job, which triggers MultiScan to start both its AutoBalance and Collect phases.

The goal of the AutoBalance job is to ensure that each node has the same amount of data on it, in order to balance data evenly across the cluster. AutoBalance, along with the Collect job, is run after any cluster group change, unless there are any storage nodes in a “down” state.

Upon visiting each file, AutoBalance performs the following two operations:

  • File level rebalancing
  • Full array rebalancing

For file level rebalancing, AutoBalance evenly spreads data across the cluster’s nodes in order to achieve balance within a particular file. And with full array rebalancing, AutoBalance moves data between nodes to achieve an overall cluster balance within a 5% delta across nodes.

There is also an AutoBalanceLin job available, which is automatically run in place of AutoBalance when the cluster has a metadata copy available on SSD. AutoBalanceLin provides an expedited job runtime.

The Collect job is responsible for locating unused inodes and data blocks across the file system. Collect runs by default after a cluster group change, in conjunction with AutoBalance, as part of the MultiScan job.

In its first phase, Collect performs a marking job, scanning all the inodes (LINs) and identifying their associated blocks. Collect marks all the blocks which are currently allocated and in use, and any unmarked blocks are identified as candidates to be freed for reuse, so that the disk space they occupy can be reclaimed and re-allocated. All metadata must be read in this phase in order to mark every reference, and must be done completely, to avoid sweeping in-use blocks and introducing allocation corruption.

Collect’s second phase scans all the cluster’s drives and performs the freeing up, or sweeping, of any unmarked blocks so that they can be reused.

MediaScan’s role within the file system protection framework is to periodically check for and resolve drive bit errors across the cluster. This proactive data integrity approach helps guard against a phenomenon known as ‘bit rot’, and the resulting specter of hardware induced silent data corruption.

MediaScan is run as a low-impact, low-priority background process, based on a predefined schedule (monthly, by default).

First, MediaScan’s search and repair phase checks the disk sectors across all the drives in a cluster and, where necessary, utilizes OneFS’ dynamic sector repair (DSR) process to resolve any ECC sector errors that it encounters. For any ECC errors which can’t immediately be repaired, MediaScan will first try to read the disk sector again several times in the hopes that the issue is transient, and the drive can recover. Failing that, MediaScan will attempt to restripe files away from irreparable ECCs. Finally, the MediaScan summary phase generates a report of the ECC errors found and corrected.

The IntegrityScan job is responsible for examining the entire live file system for inconsistencies. It does this by systematically reading every block and verifying its associated checksum. Unlike traditional ‘fsck’ style file system integrity checking tools, IntegrityScan is designed to run while the cluster is fully operational, thereby removing the need for any downtime. In the event that IntegrityScan detects a checksum mismatch, it generates and alert, logs the error to the IDI logs and provides a full report upon job completion.

IntegrityScan is typically run manually if the integrity of the file system is ever in doubt. Although the job itself may take several days or more to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be prioritized, paused or stopped, depending on the impact to cluster operations.

The Job Engine includes a number of feature support jobs are related to and supporting of the operation of OneFS data and storage management modules, including SmartQuotas, SnapshotIQ, SmartPools, SmartDedupe, etc. Each of these modules requires a cluster-wide license to run. In the event that a feature has not been licensed, attempts to start the associated supporting job will fail with the following warning.

If the SmartPools data tiering product is unlicensed on a cluster, the SetProtectPlus job will run instead, to apply the default file policy. SetProtectPlus is then automatically disabled if SmartPools is activated on the cluster.

Another principle consumer of the SmartPools job and filepool policies is OneFS Small File Storage Efficiency (SFSE). This feature maximizes the space utilization of a cluster by decreasing the amount of physical storage required to house the small files that comprise a typical medical dataset. Efficiency is achieved by scanning the on-disk data for small files, which are protected by full copy mirrors, and packing them in shadow stores. These shadow stores are then parity protected, rather than mirrored, and typically provide storage efficiency of 80% or greater.

If both SmartPools and CloudPools are licensed and both have policies configured, the scheduled SmartPools job will also trigger a CloudPools job when it’s executed. Only the SmartPools job will be visible from the Job Engine WebUI, but the following command can be used to view and control the associated CloudPools jobs:

# isi cloud job <action> <subcommand>

In addition to the standard CloudPools archive, recall, and restore jobs, there are typically four CloudPools jobs involved with cache management and garbage collection, of which the first three are continuously running:

  • Cache-writeback
  • Cache-invalidation
  • Cloud-garbage-collection
  • Local-garbage-collection

Similarly, the SmartDedupe data efficiency product has two jobs associated with it. The first, DedupeAssessment, is an unlicensed job that can be run to determine the space savings available across a dataset. And secondly, the SmartDedupe job, which actually performs the data deduplication, and which requires a valid product license key in order to run.

OneFS Job Engine Exclusion Sets

In order for the OneFS Job Engine framework to support concurrent job execution, it provides the concept of exclusion sets. These are classes of similar jobs that determine which jobs can run simultaneously.

A job is not required to be part of any exclusion set, and jobs may also belong to multiple exclusion sets. As the graphic above shows, there are currently two exclusion sets that jobs can be part of:

  • Restriping
  • Marking.

Let’s first take a look at the restriping exclusion set.

OneFS protects data by writing file blocks across multiple drives on different nodes. This process is known as ‘restriping’ in the OneFS lexicon. The Job Engine defines a restripe exclusion set that contains these jobs that involve file system management, protection and on-disk layout.

The restripe exclusion set contains the following jobs:

Job Name Job Description Access Method
AutoBalance Balances free space in the cluster. Drive + LIN
AutoBalanceLin Balances free space in the cluster. LIN
FlexProtect Rebuilds and re-protects the file system to recover from a failure scenario. Drive + LIN
FlexProtectLin Re-protects the file system. LIN
MediaScan Scans drives for media-level errors. Drive + LIN
MultiScan Runs Collect and AutoBalance jobs concurrently. LIN
SetProtectPlus Applies the default file policy. This job is disabled if SmartPools is activated on the cluster. LIN
ShadowStoreProtect Protect shadow stores which are referenced by a LIN with higher requested protection. LIN
SmartPools Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured. LIN
Upgrade Manage OneFS upgrades. LIN

The restriping exclusion set is per-phase instead of per job. This helps to more efficiently parallelize restripe jobs when they don’t need to lock down resources.  Restriping jobs only block each other when the current phase may perform restriping. This is most evident with MultiScan, whose final phase only sweeps rather than restripes. Similarly, MediaScan, which rarely ever restripes, is usually able to run to completion more without contending with other restriping jobs.

For example, below the two restripe jobs, MediaScan and AutoBalanceLin, are both running their respective first job phases. ShadowStoreProtect, also a restriping job, is in a ‘waiting’ state, blocked by AutoBalanceLin.

Running and queued jobs:

ID    Type               State       Impact  Pri  Phase  Running Time

----------------------------------------------------------------------

26850 AutoBalanceLin     Running     Low     4    1/3    20d 18h 19m 

26910 ShadowStoreProtect Waiting     Low     6    1/1    -           

28133 MediaScan          Running     Low     8    1/8    1d 15h 37m  

----------------------------------------------------------------------

MediaScan restripes in phases 3 and 5 of the job, and only if there are disk errors (ECCs) which require data reprotection. If MediaScan reaches its third job phase with ECCs, it will pause until AutoBalanceLin is no longer running. If MediaScan’s priority were in the range 1-3, it would cause AutoBalanceLin to pause instead.

If two jobs happen to reach their restriping phases simultaneously and the jobs have different priorities, the higher priority job (ie. priority value closer to “1”) will continue to run, and the other will pause. If the two jobs have the same priority, the one already in its restriping phase will continue to run, and the one newly entering its restriping phase will pause.

Next, we’ll examine the marking exclusion set.

The OneFS file system actively marks the blocks which are actually in use by the file system. A good example of this class of job is IntegrityScan, which traverses the live file system, marking every block of every LIN in the cluster to proactively detect and resolve any issues with the structure of data in a cluster.

The jobs that comprise the marking exclusion set are:

Job Name Job Description Access Method
Collect Reclaims disk space that could not be freed due to a node or drive being unavailable while they suffer from various failure conditions. Drive + LIN
IntegrityScan Performs online verification and correction of any file system inconsistencies. LIN
MultiScan Runs Collect and AutoBalance jobs concurrently. LIN

Jobs may also belong to both exclusion sets. An example of this is MultiScan, since it includes both AutoBalance and Collect.

Multiple jobs from the same exclusion set will not run at the same time. For example, Collect and IntegrityScan cannot be executed simultaneously, as they are both members of the marking jobs exclusion set. Similarly, MediaScan and SetProtectPlus won’t run concurrently, as they are both part of the restripe exclusion set.

The majority of the jobs do not belong to an exclusion set. These are typically the data services feature support jobs, and they can coexist and contend with any of the other jobs.

Exclusion sets do not change the scope of the individual jobs themselves, so any runtime improvements via parallel job execution are the result of job management and impact control. The Job Engine monitors node CPU load and drive I/O activity per worker thread every twenty seconds to ensure that maintenance jobs do not cause cluster performance problems.

If a job affects overall system performance, Job Engine reduces the activity of maintenance jobs and yields resources to clients. Impact policies limit the system resources that a job can consume and when a job can run. You can associate jobs with impact policies, ensuring that certain vital jobs always have access to system resources.

It’s worth noting that the job engine exclusion sets are pre-defined, and cannot be modified or reconfigured.

OneFS Job Engine Monitoring & Reporting

The OneFS Job Engine provides detailed monitoring and statistics gathering, with insight into jobs and Job Engine. A variety of Job Engine specific metrics are available via the OneFS CLI, including per job disk usage, etc, with CLI command ‘isi job statistics list’.

For example:

# isi job statistics list

Job ID  Phase  CPU Avg.  Virt. Mem. Avg.  Phys. Mem. Avg.  I/O Ops  I/O Bytes

----------------------------------------------------------------------------

87      1      0.52%     77.44M           37.00M           1115337  8.51G

88      1      2.35%     75.84M           36.80M           16192    125.66M

89      1      0.48%     6.01M            2.67M            0        0.00

----------------------------------------------------------------------------

In verbose mode, worker statistics and job level resource usage can also be viewed:

# isi job statistics list -v | more

Job ID: 87

 Phase: 1
 Nodes
             Node: 1
              PID: 38164
              CPU: 0.00% (0.00% min, 4.30% max, 0.40% avg)

       Virtual Mem.: 76.92M (74.38M min, 76.92M max, 76.53M avg)
      Physical Mem.: 36.33M (35.93M min, 36.33M max, 36.13M avg)

           I/O Read: 16634 ops, 129.95M
          I/O Write: 38899 ops, 303.90M
          Workers: 1 (9.00 STW avg.)

             Node: 2
              PID: 66095
              CPU: 0.68% (0.34% min, 4.98% max, 1.22% avg)

       Virtual Mem.: 77.05M (74.38M min, 77.05M max, 76.92M avg)
      Physical Mem.: 42.07M (41.39M min, 42.07M max, 41.97M avg)

           I/O Read: 16602 ops, 129.70M
          I/O Write: 39173 ops, 306.04M
          Workers: 1 (0.00 STW avg.)
…

Additionally, the status of the Job Engine workers is available via the OneFS CLI using the ‘isi job statistics view’ command. For example:

# isi job statistics view --job-id 857

Job ID: 857

Phase: 1
Nodes
  Node: 1
     PID: 26224
     CPU: 7.96% (0.00% min, 28.96% max, 4.60% avg)

 Virtual: 187.23M (187.23M min, 187.23M max, 187.23M avg)
Physical: 19.01M (18.52M min, 19.33M max, 18.96M avg)

    Read: 931043 ops, 7.099G
   Write: 1610213 ops, 12.269G
 Workers: 1 (0.00 STW avg.)

Job events, including pause/resume, waiting, phase completion, job success, failure, etc, are reported under the ‘Job Events’ tab of the WebUI. Additional information for each event is available via the “View Details” button for the appropriate job events entry in the WebUI. These are accessed by navigating to Cluster Management > Job Operations > Job Events.

A comprehensive job report is also provided for each phase of a job. This report contains detailed information on runtime, CPU, drive and memory utilization, the number of data and metadata objects scanned, and other work details or errors specific to the job type.

After a job finishes, you can view a report about the job from the CLI. You need to specify the job ID to view the report for a completed job. The isi job reports list command displays a list of all recent jobs, including job IDs. Run the isi job reports view command with a specific job ID. The following command displays the report of a Collect job with an ID of 857:

# isi job reports view 857

Collect[857] phase 1 (2020-11-14T11:39:57)

------------------------------------------

LIN scan

Elapsed time:              6506 seconds

LINs traversed:            433423

Files seen:                396980

Directories seen:          36439

Errors:                    0

Total blocks:              27357443452 (13678721726 KB)

CPU usage:                 max 28% (dev 1), min 0% (dev 1), avg 4%

Virtual memory size:       max 193300K (dev 1), min 191728K (dev 1), avg 1925

Resident memory size:      max 21304K (dev 1), min 18884K (dev 2), avg 20294K

Read:                      11637860 ops, 95272875008 bytes (90859.3M)

Write:                     20717079 ops, 169663891968 bytes (161804.1M)

While a job is running, an Active Job Details report is also available. This provides contextual information, including elapsed time, current job phase, job progress status, etc.

For inode (LIN) based jobs, progress as an estimated percentage completion is also displayed, based on processed LIN counts.

Detailed and granular job performance information and statistics are now available in a job’s report. These new statistics include per job phase CPU and memory utilization (including minimum, minimum, and average), and total read and write IOPS and throughput.

OneFS performance resource management provides statistics for the resources used by jobs – both cluster-wide and per-node. This information is available via the isi statistics workload CLI command. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.

For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:

# isi statistics workload –-limit=5 -–format=top

last update:  2020-11-14T16:45:25 (s)ort: default

CPU   Reads Writes      L2    L3    Node  SystemName        JobType

1.4s  9.1k  0.0         3.5k  497.0 2     Job:  237         IntegrityScan[0]

1.2s  85.7  714.7       4.9k  0.0   1     Job:  238         Dedupe[0]

1.2s  9.5k  0.0         3.5k  48.5  1     Job:  237         IntegrityScan[0]

1.2s  7.4k  541.3       4.9k  0.0   3     Job:  238         Dedupe[0]

1.1s  7.9k  0.0         3.5k  41.6  2     Job:  237         IntegrityScan[0]

The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues, etc.

OneFS Job Engine Management

In this next installment of the Job Engine series, we take a look at the configuration, control and management of jobs.

Although OneFS runs several critical system maintenance jobs automatically when necessary, the majority of the Job Engine’s jobs have no default schedule and can be manually started by a cluster administrator. For example, the Collect job, which reclaims free space that previously could not be freed because the node or drive was unavailable. The following CLI command syntax can be used to run the Collect job with a ‘medium’ impact policy and a higher priority.

# isi job jobs start Collect --policy MEDIUM --priority 2

Started job [9]

When the job starts, a message such as appears. In the output here, [9] represents the job ID number, which can be used as the argument to run other commands on the job.

For example, the following CLI command will cancel this collect job with ID 9:

# isi job jobs cancel 9

Similarly, to start a job from the OneFS WebUI, navigate to Cluster Management > Job Operations > Job Type and click on the ‘start’ button for the desired job. For example, to manually run AutoBalance:

Other jobs such as ComplianceStoreDelete, FilePolicy, FSAnalyze, MediaScan, ShadowStoreDelete, SmartPools, and WormQueue are normally started via a schedule. The default job execution schedule is shown in the table below.

Job Name Default Job Schedule
AutoBalance Manual
AutoBalanceLIN Manual
AVScan Manual
ChangelistCreate Manual
CloudPoolsLin/Treewalk Manual
Collect Manual
ComplianceStoreDelete The 2nd Saturday of every month at 12am
Dedupe Manual
DedupeAssessment Manual
DomainMark/Tag Manual
FilePolicy Every day at 22:00
FlexProtect Manual
FlexProtectLIN Manual
FSAnalyze Every day at 22:00
IndexUpdate Manual
IntegrityScan Manual
LinCount Manual
MediaScan The 1st Saturday of every month at 12am
MultiScan Manual
PermissionRepair Manual
QuotaScan Manual
SetProtectPlus Manual
ShadowStoreDelete Every Sunday at 12:00am
SmartPools Every day at 22:00
SmartPoolsTree Manual
SnapRevert Manual
SnapshotDelete Manual
TreeDelete Manual
WormQueue Every day at 02:00

The full list of jobs and schedules can be viewed via the CLI using the following syntax:

# isi job types list --verbose

This information can also be displayed via the WebUI, by navigating to Cluster Management > Job Operations > Job Types.

System maintenance jobs can be customized for a particular environment and/or workflow by configuring a schedule and modifying the default priority level or impact level for a particular job type. For example, the following CLI syntax can be used to set a schedule for the MediaScan job to run every Saturday morning at 9 AM. Note that the –force option overrides the confirmation step.

# isi job types modify mediascan --schedule 'every Saturday at 09:00' --force

Similarly, the following command removes the schedule for the job:

# isi job types modify mediascan --clear-schedule --force

As such, all subsequent iterations of the MediaScan job type will run with the new settings. However, if a MediaScan job is in progress, it will continue to use the old settings.

Finally, the following syntax modifies the default priority level and impact level for the MediaScan job type:

# isi job types modify mediascan --priority 2 --policy medium

To create or edit a job’s schedule from the WebUI, click on the “View / Edit” button for the desired job, located in the “Actions” column of the “Job Types” WebUI tab above. From here, check the “Scheduled” radio button, and select between a Daily, Weekly, Monthly, or Yearly schedule, as appropriate. For each of these time period options, it’s possible to schedule the job to run either once or multiple times on each specified day.

The Job Engine schedule for certain feature supporting jobs can be configured directly from the feature’s WebUI area, as well as from the Job Engine WebUI management pages. An example of this is Antivirus and the AVScan job.

The OneFS Job Engine can also initiate certain jobs on its own. For example, if the SnapshotIQ process detects that a snapshot has been marked for deletion, it will automatically queue a SnapshotDelete job.

The Job Engine will also execute jobs in response to certain system event triggers. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. The coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

Job administration and execution can be controlled via the WebUI, the command line interface (CLI), or the OneFS RESTful platform API. For each of these control methods, additional administrative security can be configured using roles-based access control (RBAC). By restricting access via the ISI_PRIV_JOB_ENGINE privilege, it is possible to allow only a sub-set of cluster administrators to configure, manage and execute job engine functionality, as desirable for the security requirements of a particular environment.

When a job is started by any of the methods described above, in addition to starting and stopping, the job can also be paused.

Once paused, the job can also easily be resumed, and execution will continue from where the job left off when it became paused. This is managed by utilizing the Job Engines’ check-pointing system, described below.

Alternatively, this can also be performed from the CLI:

# isi job jobs pause 28

# isi job jobs resume 28

For example, the above syntax will pause and resume the job with ID 28.