OneFS Endurant Cache

The Endurant Cache, or EC, is OneFS’ caching mechanism for synchronous writes – or writes that require a stable write acknowledgement to be returned to an NFS client.

The EC operates in conjunction with the OneFS write cache, or coalescer, to ingest, protect and aggregate small, synchronous NFS writes. The incoming write blocks are staged to NVRAM, ensuring the integrity of the write, even during the unlikely event of a node’s power loss.  Furthermore, EC also creates multiple mirrored copies of the data, further guaranteeing protection from single node and, if desired, multiple node failures.

EC improves the latency associated with synchronous writes by reducing the time to acknowledgement back to the client. This process removes the Read-Modify-Write (R-M-W) operations from the acknowledgement latency path, while also leveraging the coalescer to optimize writes to disk. The EC is also tightly coupled with OneFS’ multi-threaded I/O (Multi-writer) process, to support concurrent writes from multiple client writer threads to the same file. And the design of EC ensures that the cached writes do not impact snapshot performance.

The Endurant Cache uses write logging to combine and protect small writes at random offsets into 8K linear writes. To achieve this, the writes go to a mirrored files, or Logstores. The response to a stable write request can be sent once the data is committed to the Logstore. Logstores can be written to by several threads from the same node, and are highly optimized to enable low-latency concurrent writes.

Note that if a write uses the EC, the coalescer must also be used. If the coalescer is disabled on a file, but the EC is enabled, the coalescer will still be active, with all data backed by the EC.

Imagine an NFS client that wishes to write a file to a PowerScale cluster over NFS with the O_SYNC flag set, requiring a confirmed or synchronous write acknowledgement. The following sequence of events will occur to facilitate a stable write.

  1. The client, connected to node 2, begins the write process sending protocol level blocks. 4K is the optimal block size for the Endurant Cache.

  1. The NFS client’s writes are temporarily stored in the write coalescer portion of node 2’s RAM. The Write Coalescer aggregates uncommitted blocks so that the OneFS can, ideally, write out full protection groups where possible, reducing latency over protocols that allow “unstable” writes. Writing to RAM has far less latency that writing directly to disk.
  2. Once in the write coalescer, the Endurant Cache log-writer process writes mirrored copies of the data blocks in parallel to the EC Log Files.

The protection level of the mirrored EC log files is the same as that of the data being written by the NFS client.

  1. Once the data copies are received into the EC Log Files, a stable write exists and a write acknowledgement (ACK) is returned to the NFS client confirming the stable write has occurred. The client assumes the write is completed and can close the write session.

  1. The Write Coalescer then processes the file just like a non-EC write at this point. The Write Coalescer fills and is routinely flushed as required as an asynchronous write via to the Block Allocation Manager (BAM) and the BAM Safe Write (BSW) path processes.
  2. The file is split into 128K Data Stripe Units (DSUs), parity protection (FEC) is calculated and FEC Stripe Units (FSUs) are created.

  1. The layout and write plan is then determined, and the stripe units are written to their corresponding nodes’ L2 Cache and NVRAM. The EC logfiles are cleared from NVRAM at this point. OneFS uses a Fast Invalid Path process to de-allocate the EC Log Files from NVRAM.

  1. Stripe Units are then flushed to physical disk.
  2. Once written to physical disk, the Data Stripe Unit (DSU) and FEC Stripe Unit (FSU) copies created during the write are cleared from NVRAM but remain in L2 cache until flushed to make room for more recently accessed data.

The number of logfile mirrors that are created by EC is always one more than the on-disk protection level of the file. For example:

File Protection Level # EC Mirrored Copies
+1n 2
2x 3
+2d:1n 3
+2n 3
+3d:1n1d 4
+3n 4
+4n 5

The EC mirrors are only used if the initiator node is lost. In the unlikely event that this occurs, the participant nodes replay their EC journals and complete the writes.

If the write is an EC candidate, the data remains in the coalescer, an EC write is constructed, and the appropriate coalescer region is marked as EC. The EC write is a write into a logstore (hidden mirrored file) and the data is placed into the journal.

Assuming the journal is sufficiently empty, the write is held there (cached) and only flushed to disk when the journal is full, thereby saving additional disk activity.

An optimal workload for EC involves small-block synchronous, sequential writes – something like an audit or redo log, for example. In that case, the coalescer will accumulate a full protection group’s worth of data and be able to perform an efficient FEC write.

The happy medium is a small-block sync (vmdk) type load where the I/O rate is low, and the client is latency-sensitive. In this case, the latency will be reduced and, if the I/O rate is low enough, it won’t create serious pressure.

The undesirable scenario is when the cluster is already spindle-bound and the workload is such that it generates a lot of journal pressure. In this case, EC is just going to aggravate things.

Although on by default, setting the boolean sysctl efs.bam.ec.mode value to ‘1’ will enable the Endurant Cache:

# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=1

EC can also be enabled & disabled per directory:

# isi set -c [on|off|endurant_all|coal_only] <directory_name>

To enable the coalescer but switch of EC, run:

# isi set -c coal_only

And to disable the Endurant Cache completely:

# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=0

A return value of zero on each node from the following command will verify that EC is disabled across the cluster:

# isi_for_array –s sysctl efs.bam.ec.stats.write_blocks efs.bam.ec.stats.write_blocks: 0

If the output to this command is incrementing, EC is delivering stable writes.

As mentioned previously, EC applies to stable writes, namely:

  • Writes with O_SYNC and/or O_DIRECT flags set
  • Files on synchronous NFS mounts

When it comes to analyzing any performance issues involving EC workloads, consider the following:

  • What changed with the workload?
  • If upgrading OneFS, did the prior version also have EC enable?
  • If the workload has moved to new cluster hardware:
  • Was there a large change in spindle or node count?
  • Has the OneFS protection level changed?
  • Is the SSD strategy the same?
  • Does the performance issue occur during periods of high CPU utilization?
  • Which part of the workload is creating a deluge of stable writes?

Disabling EC is typically done cluster-wide and this can adversely impact certain workflow elements. If the EC load is localized to a subset of the files being written, an alternative way to reduce the EC heat might be to disable the coalescer buffers for some particular target directories, which would be a more targeted adjustment. This can be configured via the isi set –c off command.

One of the more likely causes of performance degradation is from applications aggressively flushing over-writes and, as a result, generating a flurry of ‘commit’ operations. This can generate heavy read/modify/write (r-m-w) cycles, inflating the average disk queue depth, and resulting in significantly slower random reads. The isi statistics protocol CLI command output will indicate whether the ‘commit’ rate is high.

Bear in mind that synchronous writes do not require using the NFS ‘sync’ mount option! Any programmer who is concerned with write persistence can simply specify an O_FSYNC or O_DIRECT flag on the open() operation to force synchronous write semantics for that fie handle. With Linux, writes using O_DIRECT will be separately accounted-for in the Linux ‘mountstats’ output.

Note: Although almost exclusively associated with NFS, the EC code is actually protocol-agnostic. If writes are synchronous (write-through) and are either misaligned or smaller than 8k, they have the potential to trigger EC, regardless of the protocol.

The Endurant Cache can often provide a significant latency benefit for small (eg. 4K), random synchronous writes – albeit at a cost of some additional work for the system.

However, it’s worth bearing the following in mind:

  • EC is not intended for more general purpose I/O.
  • There is a finite amount of EC available. As load increases, EC can potentially ‘fall behind’ and end up being a bottleneck.
  • Endurant Cache does not improve read performance, since it’s strictly part of the write process.
  • EC will not increase performance of asynchronous writes – only synchronous writes.

OneFS TreeDelete

There have been several recent enquires about large scale file deletes, so a quick article on this topic seemed appropriate.

For example, imagine a workflow that involves creating and deleting thousands or millions of files of varying sizes each day. The serial deletion of these files at the NFS and SMB host level would be incredibly slow and inefficient. Fortunately, OneFS has a purpose-built tool for this: the TreeDelete job.

Within the OneFS Job Engine, TreeDelete is a single phase job which runs by default with ‘medium’ impact and a default priority value of 4.

The command line syntax to kick off an instance of this job is:

# isi job jobs start treedelete –-paths <path>

Or, alternatively, via the platform API:

# curl -k -u username:password -H 'Content-Type: application/json' --request POST --data '{"paths": ["<path>"], "type": "treedelete", "allow_dup": true}' 'https://<cluster_IP>:8080/platform/1/job/jobs'

Note that the ‘path’ argument for these commands must be within the cluster’s /ifs partition. TreeDelete will not work on any of the other OneFS filesystem partitions, such as /root, /var, etc, which will fail with “non-valid partition” error. Also, if attempting to delete /ifs/.ifsvar, the TreeDelete job will fail with “Invalid path specified”. Beyond these however, TreeDelete does not prompt to ensure that you’ve selected the desired directory path to remove, etc. So, to avoid any unpleasant surprises, check twice before running the TreeDelete job to ensure that you have configured the job correctly and specified the correct path(s):  There is no ‘undo’ button.

Multiple directory paths can be specified as part of this command. For example:

# isi job jobs start treedelete --paths /ifs/dir1  --paths /ifs/dir2 --paths /ifs/dir3 –paths /ifs/dir4

Deleting more than 60 paths in a single TreeDelete job command has been successful. And you’ll likely hit the command line max length well before finding a tree delete path limit. Additionally, you can always queue up to thirty TreeDelete jobs, if desired.

TreeDelete job progress is reported a percentage in “isi stat” output:

 TreeDelete (12)            MEDIUM     02/16 11:34  00:01:11    24%  /ifs/data/recycle

Upon completion, the job status will be reported as such:

Fri Mar 5 23:19:26 2021 Daemon[57225]: TreeDelete job deleted 131028 files/dirs, 322GB, with 0 errors

Plus, a full job report, containing deleted data counts and capacities, cluster resource utilization, and job engine stats, etc, is available as follows:

# isi job reports view 12 -v

TreeDelete[12] phase 1 (2021-03-05T23:19:26)

--------------------------------------------

Paths                                     [ "/ifs/data/trash" ]

Files                                     131028

Directories                               101

Apparent size                             267387005115

Physical size                             357773443072

JE/Coordinator/Merge microseconds         { sum = 55, mean = 11, stdev = 8.12404 }

JE/Error Count                            0

JE/Group at phase end                     [ "<1,8> :{ 1-4:0-14, smb: 1-4, nfs: 1-4, swift: 1-4, all_enabled_protocols: 1-4, isi_cbind_d: 1-4, lsass: 1-4, s3: 1-4 }" ]

JE/Manager/Merge microseconds             { sum = 150, mean = 5.17241, stdev = 2.76766 }

JE/Stats/CPU avg                          13.73%

JE/Stats/CPU max                          506.84%

JE/Stats/CPU max node                     1

JE/Stats/CPU min                          0.00%

JE/Stats/CPU min node                     2

JE/Stats/IO/Current job/Read bytes        0 bytes

JE/Stats/IO/Current job/Reads             0

JE/Stats/IO/Current job/Write bytes       741941248 bytes (707.57M)

JE/Stats/IO/Current job/Writes            90569

JE/Stats/IO/Non-JE/Read bytes             0 bytes

JE/Stats/IO/Non-JE/Reads                  0

JE/Stats/IO/Non-JE/Write bytes            26492928 bytes (25.27M)

JE/Stats/IO/Non-JE/Writes                 3234

JE/Stats/IO/Other jobs/Read bytes         0 bytes

JE/Stats/IO/Other jobs/Reads              0

JE/Stats/IO/Other jobs/Write bytes        36274176 bytes (34.59M)

JE/Stats/IO/Other jobs/Writes             4428

JE/Stats/Memory/RSS size avg              43867136 bytes (41.83M)

JE/Stats/Memory/RSS size max              46411776 bytes (44.26M)

JE/Stats/Memory/RSS size max node         4

JE/Stats/Memory/RSS size min              42795008 bytes (40.81M)

JE/Stats/Memory/RSS size min node         3

JE/Stats/Memory/VM size avg               91016338 bytes (86.80M)

JE/Stats/Memory/VM size max               93835264 bytes (89.49M)

JE/Stats/Memory/VM size max node          1

JE/Stats/Memory/VM size min               90251264 bytes (86.07M)

JE/Stats/Memory/VM size min node          3

JE/Time elapsed                           35 seconds

JE/Time working                           35 seconds

JE/Worker/Finalize item microseconds      { sum = 0, mean = 0, stdev = -- }

JE/Worker/Finalize task microseconds      { sum = 20, mean = 0.689655, stdev = 1.93165 }

JE/Worker/Next item microseconds          { sum = 103567, mean = 20.1846, stdev = 113.937 }

JE/Worker/Process item microseconds       { sum = 34027935, mean = 6644.78, stdev = 4183.41 }

JE/Worker/Process item total microseconds { sum = 34027935, mean = 6644.78, stdev = 4183.41 }




TreeDelete[12] Job Summary

--------------------------

Final Job State  Succeeded

Phase Executed

TreeDelete requires OneFS to perform a treewalk within the filesystem namespace as the first task, in order to determine how much work it will need to perform.  For instance, a TreeDelete job starting at /ifs/temp will traverse the directory hierarchy down to the lowest-level subdirectories.  For more complex TreeDelete configurations, the job will traverse all the configured policy paths, so it’s fair to say that TreeDelete can be a relatively metadata-heavy process.

As such, enabling metadata write acceleration (all metadata housed on SSDs) has the potential to speed up TreeDelete substantially. However, for optimal speed here, there are other considerations.

Layout can gain you a lot, provided you’re also smart about running multiple threads to do the deletions. If you have lots of smaller directories, delete performance is likely to be very good. If you have a few wide directories, it’s not likely to help much.

Be aware that, even though TreeDelete is multithreaded, if the deletion happens in a directory, it still requires an exclusive lock on the directory. This would slow the deletion down as the job’s worker thread will have to wait on getting a lock to do the deletion. So, if you have the available I/O, you can literally delete files stored in ten separate directory ten times as quickly deleting the same files from a single directory.

So ‘manually’ spreading the delete load has the potential to be faster. Also, deleting large files is quite expensive in terms of free-space management. For files larger than a tunable threshold in size, each node will, by default, spin up a background thread to delete the file.

A general rule is:

More directories = more TreeDelete parallelism = better performance.

Running the TreeDelete job at high impact, rather than the default medium, will increase the number of threads. However, this is only recommended for an idle cluster – not if there is other work happening.

Another option for data housekeeping on a cluster is using TreeDelete to maintain a common ‘recycle bin’. This can be done by create a directory like /ifs/recycle for users to dump their unwanted files in, via Windows file explorer, or the UNIX/linux ‘mv’ command, etc.  Then periodically manually run or set up a cron job for the treedelete job. For example:

# isi job jobs start treedelete --path /ifs/recycle --priority 10 --policy low

Note that the TreeDelete job will also delete the /ifs/recycle folder. If this is a problem, you can also:

  • Set an advisory quota on the parent directory, which will prevent it from being deleted. However, this requires SmartQuotas to be licensed on the cluster.
  • Add a cron job to recreate the recycle directory after the TreeDelete has completed.
  • Use a symlink for the current recycle directory. When you want to empty it, switch the symlink to a new empty directory, then start the job to delete the old directory.
  • Add subdirectories to the parent directory and just delete the ‘junk’ subdirectory. Ie. /ifs/recycle/junk1, /ifs/recycle/junk2, /ifs/recycle/junk3. This will potentially have the added benefit of more directory parallelism, and potentially better performance.

Be aware of the potential for file name collisions in the recycle bin. If two users both attempt to move files with the same name the recycle bin, the second one will require delete permissions to the first one.

In order to immediately reclaim the deleted file space from a TreeDelete job run, it may be necessary to remove all snapshot policies in that project path, delete those snapshots, then move the project into the recycle bin and let TreeDelete take over. Even the moving data to the recycle bin tree can force SnapshotIQ to preserve blocks if the snap existed before prior to the data being moved.  To delete a project entirely, you must remove all snaps associated with that tree to actually get all of your space back.  This can potentially include snapshots taken higher in the tree. Be aware that some other jobs, such as the FilesystemAnalyze, ChangelistCreate, etc, can keep a snapshot at the /ifs level sitting around to make incremental FSA jobs run faster.

As mentioned above, be aware that TreeDelete will not, by default, delete a directory that is the root directory of a quota. However, TreeDelete in OneFS 8.2 and later contains a flag to remove the top level quota, if one exists, so the final rmdir does not fail. For example:

# isi job jobs start treedelete –delete-quotas --path /ifs/recycle

It’s also fairly straightforward to write simple scripts to run TreeDelete. For example, the following shell command looks for subdirectories one level under the parent path, ‘/ifs/recycle’, and instructs TreeDelete to remove them:

# for i in `find /ifs/recycle -type d –depth 1`; do isi job jobs start treedelete --path $i; done

Another approach can be used in situations (ie. Windows environments) where the directory names can contain whitespace:

# find /ifs/recycle -type d –depth 1 -print0 | xargs -0 -I % isi job jobs start treedelete --path “%”

This command will also combine each line of the file into a single line and pass to the path argument.

Note that if you have more than 30 directories, the command will likely fail because the job engine  cannot queue more than thirty jobs. However, when emptying out recycle bin(s), if you create a time-stamped sub-directory and move everything for deletion into it, and then TreeDelete this directory, the 30-job limit is avoided.

This is a common challenge across a range of verticals, and particular in the EDA realm, where there are a couple of creative solutions. In addition to custom tools, solution also include discovering the files via find and moving them to a /ifs/data/recycle/date/batch-number/…100,000 files. Treedelete can then be run on a schedule, one at a time per batch number, with low impact and off hours so as not to impact key work flow times.

For example, a TREEDELETE_OFF_HOURS job impact policy can be created, which might include ‘SAT 00:00 to Sun 00:00 AND MON 00:00 to 06:00 LOW’. The default impact of the job could then be reconfigured from MEDIUM to TREEDLETE_OFF_HOURS. This would mean that any time that a TreeDelete job is run without specifying ‘-o medium’, it would automatically inherit and execute the TREEDELETE_OFF_HOURS schedule.

When moving directories to a recycle bin, beware of not crossing quota domains. The performance impact will be significant since the quota traverse will require a copy, delete, and then the subsequent TreeDelete.

So there you have it – a couple of examples in which the TreeDelete job can simplify and improve the wall clock time for data removal.

Snapdiff with Changelist

Snapdiff with changelist has been introduced in OneFS 8.2.2.0. It can return what regions of a file have been changed. However this functionality is hidden from WebUI and CLI and I see many customers are very interested in this new feature. This article will walk you through an example.

To create a snapdiff changelist, you have to add the option “–create-diffs” which is disabled by default.

# isi job start changelistcreate --newer-snapid=24 --older-snapid=22 --create-diffs

Note, you will not see this option in the help or description page of the CLI command.

After that you can either use isi_changelist_mod command to view the outcome. To include snapdiff deltas in the output, you have to add the option “–x”

# isi_changelist_mod -a 22_24 --x
st_ino=4306829722 st_mode=0100700 st_size=2837 st_atime=1614240223 st_mtime=1614240223 st_ctime=1614240223 st_flags=285212896 cl_flags=ENTRY_MODIFIED path=/ifs/test/test.txt
offset:0 size:2837 type: data

You can also leverage PAPI for the same purpose. The corresponding PAPI endpoint is:

platform/10/snapshot/changelists/<CHANGELIST>/diff-regions/<LIN>

Note, it’s only available starting from platform/10. To get a very detailed description of the PAPI you can use the following URL:

platform/10/snapshot/changelists/<CHANGELIST>/diff-regions/<LIN>/?describe
Resource URL: /platform/10/snapshot/changelists/<CHANGELIST>/diff-regions/<LIN>

    Overview: This resource represents the collection of snap diff regions.

     Methods: GET

********************************************************************************

Method GET: Get snap diff regions of a file.

URL: GET /platform/10/snapshot/changelists/<CHANGELIST>/diff-regions/<LIN>

Query arguments:
 resume=<string> Continue returning results from previous call using this token
                 (token should come from the previous call, resume cannot be
                 used with other options).
 limit=<integer> Return no more than this many results at once (see resume).
offset=<integer> 

GET response body schema:
{
  "type": [
    {
      "additionalProperties": false, 
      "type": "object", 
      "description": "A list of errors that may be returned.", 
      "properties": {
        "errors": {
          "minItems": 1, 
          "items": {
            "additionalProperties": false, 
            "type": "object", 
            "description": "An object describing a single error.", 
            "properties": {
              "field": {
                "minLength": 1, 
                "type": "string", 
                "description": "The field with the error if applicable.", 
                "maxLength": 8192
              }, 
              "message": {
                "minLength": 1, 
                "type": "string", 
                "description": "The error message.", 
                "maxLength": 8192
              }, 
              "code": {
                "minLength": 1, 
                "type": "string", 
                "description": "The error code.", 
                "maxLength": 8192
              }
            }
          }, 
          "type": "array", 
          "maxItems": 65535
        }
      }
    }, 
    {
      "additionalProperties": false, 
      "type": "object", 
      "properties": {
        "diff_regions": {
          "minItems": 0, 
          "items": {
            "type": "object", 
            "properties": {
              "byte_count": {
                "required": true, 
                "minimum": 0, 
                "type": "integer", 
                "description": "Byte count of change region.", 
                "maximum": 18446744073709551615
              }, 
              "region_type": {
                "required": true, 
                "description": "Type of change region.", 
                "minLength": 4, 
                "enum": [
                  "sparse", 
                  "data", 
                  "unchanged"
                ], 
                "maxLength": 9, 
                "type": "string"
              }, 
              "start_offset": {
                "required": true, 
                "minimum": 0, 
                "type": "integer", 
                "description": "Starting byte offset of change region.", 
                "maximum": 18446744073709551615
              }
            }
          }, 
          "type": "array", 
          "maxItems": 18446744073709551615
        }, 
        "resume": {
          "minLength": 0, 
          "type": [
            "string", 
            "null"
          ], 
          "description": "Provide this token as the 'resume' query argument to continue listing results.", 
          "maxLength": 8192
        }
      }
    }
  ]
}

Here is the output for the same example:

URL:

https://192.168.116.188:8080/platform/10/snapshot/changelists/22_24/diff-regions/4306829722

Outcome:

{
"diff_regions" : 
[

{
"byte_count" : 2837,
"region_type" : "data",
"start_offset" : 0
}
],
"resume" : null
}

 

Creating OneFS Changelists

In the previous article, we examined the context around OneFS changelists and the ChangelistCreate job. Next, let’s step through an example of creating and managing a simple changelist via the OneFS CLI.

Here’s a basic procedure:

  1. First, create a directory (ifs/data/test1) and add couple of files:
# mkdir -p -v /ifs/data/test1

# echo f1 > /ifs/data/test1/f1

# echo f2 > /ifs/data/test1/f2
  1. Next, take a snapshot of the directory:
# isi snap snaps create /ifs/data/test1
  1. Modify the data (edit f1, remove f2, add f3):
# echo f1 >> /ifs/data/test1/f1

# rm /ifs/data/test1/f2

# echo f3 > /ifs/data/test1/f3
  1. Take a second snapshot:
# isi snap snaps create /ifs/data/test1
  1. View the snapshot ID’s:
# isi snap snaps ls

ID   Name  Path        

------------------------

3    s3    /ifs/data/test1

5    s5    /ifs/data/test1

------------------------

Total: 2
  1. Start a ChangelistCreate job using the two snapshot IDs above, and allow it to run to completion. Once the job no longer appears in the ‘isi job jobs ls’ command output, you know it’s done:
# isi job jobs start ChangelistCreate --older-snapid 3 --newer-snapid 5

Started job [3]

# isi job jobs ls

ID   Type             State   Impact  Pri  Phase  Running Time

---------------------------------------------------------------

3    ChangelistCreate Running Low     5    2/4    9s          

---------------------------------------------------------------

Total: 1

Once the job no longer appears in the ‘isi job jobs ls’ command output, you know it’s done:

# isi job jobs ls

ID Type State Impact Pri Phase Running Time

-------------------------------------------

-------------------------------------------

Total: 0
  1. List all the cluster’s changelists:
# isi_changelist_mod -l

3_5
  1. Describe all the changelists:
# isi_changelist_mod -i –all 

name=3_5 num_entries=4 owner=3 path=/ifs/data/test1
  1. Display all the entries in the new changelist (in this case, 3_5):
# isi_changelist_mod -a 3_5

st_ino=4297195572 st_mode=040755 st_size=40 st_atime=1429572712 st_mtime=1429572712 st_ctime=1429572712 st_flags=224 cl_flags=00 path=/ifs/data/test1

st_ino=4297261063 st_mode=0100644 st_size=6 st_atime=1429572699 st_mtime=1429572699 st_ctime=1429572699 st_flags=224 cl_flags=00 path=/ifs/data/test1/f1

st_ino=4297261065 st_mode=0100644 st_size=3 st_atime=1429572712 st_mtime=1429572712 st_ctime=1429572712 st_flags=224 cl_flags=01 path=/ifs/data/test1/f3

st_ino=4297261064 st_mode=0100644 st_size=3 st_atime=1429572687 st_mtime=1429572687 st_ctime=1429572687 st_flags=224 cl_flags=02 path=/ifs/data/test1/f2


  1. Display all the entries with by path only & with added and removed prefixes (+/-):
# isi_changelist_mod -a 3_5 --p --v

/ifs/data/test1

/ifs/data/test1/f1

+/ifs/data/test1/f3

-/ifs/data/test1/f2
  1. Delete (or kill, -k) the new changelist:
# isi_changelist_mod -k 3_5

In addition to the CLI, the OneFS PlatformAPI (pAPI) can also be used to programmatically interact with changelists. Here are the associated API methods available.

pAPI Changelist Method Description
/snapshot/changelists List all changelists.
/snapshot/changelists/<CHANGELIST> Retrieve basic information on a changelist.
/snapshot/changelists/<CHANGELIST>/diff-regions/<LIN> Get snap diff regions of a file.
/snapshot/changelists/<CHANGELIST>/entries Get entries from a changelist.
/snapshot/changelists/<CHANGELIST>/entries/<ID> Get a single entry from the changelist by ID.
/snapshot/changelists/<CHANGELIST>/lins Get entries from a changelist.
/snapshot/changelists/<CHANGELIST>/lins/<LIN> Get a single entry from the changelist by LIN.

For example, to retrieve all changelists on the local node:

# curl -k -v -X GET --header "Content-Type:application/json" -u <username>:<password> https://localhost:8080/platform/1/snapshot/changelists

Additionally, the Job Engine API methods can be used to interface with the ChangelistCreate job.

pAPI Job Method Description
/job/events List job events.
/job/job-summary View job engine status.
/job/jobs List running and paused jobs.Queue a new instance of a job type.
/job/jobs/<JID> View a single job instance.Modify a running or paused job instance.
/job/policies List job impact policies.Create a new job impact policy.
/job/policies/<NAME> View a single job impact policy.Modify/delete a job impact policy.
/job/recent List recently completed jobs.
/job/reports List job reports.
/job/statistics View job engine statistics.
/job/types List job types.
/job/types/<NAME> Retrieve job type information.Modify the job type.

For example, to start a job:

# curl https://localhost:8080/platform/1/job/jobs -k -u <username>:<password> -v --data '{"type": "ChangelistCreate", "changelistcreate_params" : {"older_snapid" : 2, "newer_snapid" : 6}}'

Here’s an example of the PlatformAPI XML changelist output for an added file (file1).

{

            "atime": {

                "nsec": 0,

                "sec": 1612177389

            },

            "btime": {

                "nsec": 0,

                "sec": 1612177389

            },

            "change_types": [

                "ENTRY_ADDED"

            ],

            "ctime": {

                "nsec": 0,

                "sec": 1612177389

            },

            "data_pool": -3,

            "file_type": "regular",

            "gid": 0,

            "id": "68723787424",

            "lin": 4295236714,

            "metadata_pool": -3,

            "mtime": {

                "nsec": 0,

                "sec": 1612177389

            },

            "parent_lin": 4295163972,

            "path": "/ifs/data/test1/file1",

            "physical_size": 512,

            "size": 0,

            "uid": 0,

            "user_flags": [

                "uarch",

                "inherit",

                "writecache",

                "wcinherit",

                "shasntfsacl"

            ]

}

Changelist data is stored in an STF system B-Tree (SBT), where the key to the SBT is the ID field. The keys are stored in ascending order, and iteration of an SBT occurs in ascending key order. Changelist entries are returned in ascending order by ID in the different clients. Changelists are indexed in ascending order by ID field. The ID field is based partially on LIN and operation (plus some additional bits to allow for operations on the N links of a file).

The correlation between ID field and LIN in the XML output above is as follows:

Decimal                           Hexadecimal

"id": "68723787424",              100041C6A0

"lin": 4295236714,                100041C6A

In each instance, the difference between ID and LIN in hexadecimal notation is the addition of a trailing 0 to the 9-digit LIN value to make a 10-digit ID.

So in older pAPI output (ie. OneFS 8.x), where only the ID is displayed, the LIN can be calculated from the ID.

Unlink operations will appear at the end of a changelist based on sort order.

Be aware that the platformAPI call /platform/*/snapshot/changelists/<CHANGELIST>/lins has been deprecated and is replaced by /platform/*/snapshot/changelists/<CHANGELIST>/entries.

Logging for changelist creation (Job Engine) can be found in /var/log/isi_job_d.log.

OneFS ChangelistCreate Job

Received a couple of recent questions around the OneFS ChangelistCreate job, so thought it would make for a useful topic to explore over the course of the next couple of blog articles.

A changelist is essentially a catalog of the attributes and objects which changed between two checkpoints. For example, a list of the files, dirs, etc which were added, removed, modified, etc, typically between two file system snapshots. In OneFS, changelists are implemented as system B-trees (SBTs), with entries being indexed by ID or LIN, and containing information such as path, type and resulting size of the corresponding object. The SnapshotIQ framework is leveraged as the underlying mechanism for taking the required snapshots.

Historically, changelists were primarily utilized by SyncIQ as the foundation for differential replication. However, they are now used more widely, such as by FSAnalyze for InsightIQ, and IndexUpdate for the SmartPools FilePolicy job. Changelists are also of considerable interest to partners and vendors looking to integrate third party data protection and management software, solutions and tools with OneFS.

The OneFS Job Engine contains a class of jobs which utilize a ‘changelist’, rather than LIN-based scanning. The changelist approach analyzes two snapshots to find the LINs which changed, or delta,  between the snapshots, and then examines and catalogs the detailed changes.

The ChangelistCreate job supports the creation of a changelist for any two snapshots with a common root path. The job itself is started manually and runs by default with a LOW impact policy and a priority of 5.

# isi job types view ChangelistCreate

         ID: ChangelistCreate

Description: Create a list of changes between two snapshots with matching root paths.

    Enabled: Yes

     Policy: LOW

   Schedule: -

   Priority: 5

When a new ChangelistCreate job starts, it checks to see whether a finalized changelist with matching snapshot IDs already exists. If so, it completes quickly and successfully. Next, the job checks whether an in-progress changelist with matching snapshot IDs already exists. In this case, it retrieves and inspects the metadata entry (LIN 1) to find the ID of the job most recently operating on the changelist. If the ID references an active (e.g. paused) job, then the new job will exit with an EINPROGRESS error. Otherwise, the new job sets the metadata job ID value to its own job ID and proceeds with changelist creation.

ChangelistCreate scans all the snapshot tracking files (STFs) from the older snapshot (inclusive) to the newer snapshot (exclusive) in order to build a comprehensive list of applicable, changed LINs. When complete, it  compiles a changelist through a combination of tree walks, stats, and path lookups.

Under the hood, the ChangelistCreate job has four distinct phases:

Phase Description
1.    Summarize Performs basic validation and snapshot locking, determining any restart state, and summary STF creation.
2.    Examine Handles the stat, path lookup, scoped tree walks, and changelist entry creation activity.
3.    Merge Moves entries from a temporary ‘split-lin’ changelist to the primary changelist.
4.    Enumerate Calculates the final changelist entry count.

If a ChangelistCreate job fails or is cancelled before a changelist is finalized, the partial results can typically still be used by a subsequent job, with a couple of caveats. Specifically, stoppage at the summarize phase will result in the loss of all summary STF creation work, and stoppage at the ‘examine’ phase will result in any subsequent job(s) still iterating over all LINs in the summary STF but avoiding scoped tree walks, etc when a changelist entry already exists.

The ChangeListCreate ‘examine’ phase employs recursive logic, which is used to divide the task into work items to allow for distributed, interruptible processing.

For example, consider the processing of a LIN for a regular file, with no hardlinks or alternate data streams, which was changed at some point between the older and newer snapshots:

  1. The job engine calls a task’s ‘item_process’ routine, which, seeing that its work stack is empty, reads the next LIN from its allotted range in the summary STF, and then creates a work item for the LIN.
  2. This work item is pushed on to the work stack and  a handler invoked.
  3. The handler runs, sees that the LIN references a file, sets the work item’s type approiately, and returns.
  4. The ‘item_process’ routine addresses the newest work item on the stack.
  5. The  handler runs, a changelist entry is created for the LIN, and the handler pops the work item off the stack and returns.
  6. At this point the work stack is empty and the cycle repeats.

A depth-first approach ensures that a changelist entry is not created for a parent directory until entries have been created for all descendants covered by the recursive logic (the exception being LINs whose work items are split, which are written to a temporary “split-lin” changelist). As a result, if a prior job is interrupted, a subsequent job can easily identify and ignore branches that have already been fully processed.

From the administrator’s perspective, in addition to the regular job engine controls, OneFS also provides the ‘isi_changelist_mod’ CLI utility as the primary way to interact with changelists:

# isi_changelist_mod

Description:

    Manage snapshot changelists.

Usage:

    isi_changelist_mod -a cl_name ...            Display all entries.

    isi_changelist_mod -h                        Display help.

    isi_changelist_mod -i {cl_name | --all} ...  Describe changelist(s).

    isi_changelist_mod -k {cl_name | --all}      Kill changelist(s).

    isi_changelist_mod -l                        List changelists.


Options for entry display (-a) and changelist describe (-i):

    --B         Replace non-printable path characters with octal codes.

                See ascii(7).

    --b         As --B, but use C escape codes whenever possible,

                e.g. \t for TAB.

    --p         Only display path.

    --q         Replace non-printable path characters with '?'.

    --s         Append '/' to paths of directories.

    --t         Use shquote(3) on path strings to make them suitable for

                command-line arguments.

    --w         Display raw path strings. (Default.)


Additional options for entry display (-a):

    --d den     Fractional entry range denominator (2 <= den <= 1024).

    --n num     Fractional entry range numerator (1 <= num <= den).

    --v         Prepend '+' and '-' to paths of added and removed entries.

    For changelist entry st_* field descriptions, see stat(2).

Here’s an example command output entry for a particular changed file:

st_ino=4297261065 st_mode=0100644 st_size=3 st_atime=1429572712 st_mtime=1429572712 st_ctime=1429572712 st_flags=224 cl_flags=01 path=/ifs/data/test1/f3

The ‘st_*’ fields above are derived from ‘stat’ output, and correspond to the following:

ST Field Description
st_ino File’s inode number.
st_mode File type and mode .
st_size Size of the file (if it is a regular file or symbolic link).
st_atime Last access time of file data (epoch).
st_mtime Time of last modification of file data (epoch).
st_ctime File’s last status change timestamp (epoch time of last change to the inode)
st_flags User defined flags enabled for the file.

Similarly, the changelist entry status information field ‘cl_flags’ can have one of the following values:

CL Field Description
01 Added or moved to.
02 Removed of moved from.
04 Path changed – moved to/from.
10 Contains Alternate Data Stream(s)
20 Is an Alternate Data Stream
40 Hardlinks exist.

So, from the example above, we can see that the object has the following pertinent attributes:

  • Its inode number is 4297261065
  • It resides in the file system at /ifs/data/test1/f3.
  • The st_mode=0100644 shows it’s a file (value=1)
    • with user/group/all mode bits indicating user=read/write (6), group=read (4), everyone=read(4).
  • It’s newly created file, as indicated by the cl_flags=01 field’

If desired, OneFS can also be allow multiple ChangeListCreate jobs to run concurrently:

# isi_gconfig -t job-config jobs.types.changelistcreate.allow_multiple_instances=true

In the next article, we’ll walk through an example of creating, accessing and managing a changelist via both the CLI and platform API.

OneFS Caching – L3 Performance and Sizing

In the final article in this caching series we’ll take a look at some of the L3 cache’s performance benefits and attributes – plus how to size the cache and other considerations and good practices.

One of the goals of L3 is to deliver solid benefits right out of the box for a wide variety of workloads. However, L3 cache usually provides more benefit for random and aggregated workloads than for sequential and optimized workflows – typically delivering similar IOPS as SmartPools metadata-read strategy, for user data retrieval (reads).

Although the benefit of L3 caching is highly workflow dependent, the following general rules can be assumed:

  • During data prefetch operations, streaming requests are intentionally sent directly to the spinning disks (HDDs), while utilizing the L3 cache SSDs for random IO.
  • SmartPools metadata-write strategy may be the better choice for metadata write and/or overwrite heavy workloads, for example EDA and certain HPC workloads.
  • L3 cache can deliver considerable latency improvements for repeated random read workflows over both non-L3 nodepools and SmartPools metadata-read configured nodepools.
  • L3 can also provide improvements for parallel workflows, by reducing the impact to streaming throughput from random  reads (streaming meta-data).
  • The performance of OneFS job engine jobs can also be increased by L3 cache

L3 cache is enabled by default for Isilon A200, A200 and the older Gen5 NL and HD nodes that contain SSDs, and cannot be disabled. On these platforms, L3 cache runs in a metadata only mode. By storing just metadata blocks, L3 cache optimizes the performance of operations such as system protection and maintenance jobs, in addition to metadata intensive workloads.

Figuring out the size of the active data, or working set, for your environment is the first step in an L3 cache SSD sizing exercise.

L3 cache utilizes all available SSD space over time. As a rule, L3 cache benefits more with more available SSD space. However, sometimes losing spindle count hurts more than adding cache helps a workflow. If possible add a larger capacity SSD rather than multiple smaller SSDs.

L3 cache sizing involves calculating the correct amount of SSD space to fit the working data set. This can be done by using the isi_cache_stats command to periodically capture L2 cache statistics on an existing cluster.

Run the following commands based on the workload activity cycle, at job start and job end. Initially run isi_cache_stats –c in order to reset, or zero out, the counters. Then run isi_cache_stats –v at workload activity completion and save the output. This will help determine an accurate indication of the size of the working data set, by looking at the L2 cache miss rates for both data and metadata on a single node.

These cache miss counters are displayed as 8KB blocks. So an L2_data_read.miss value of 1024 blocks represents 8 MB of actual missed data.

The formula for calculating the working set size is:

(L2_data_read.miss + L2_meta_read.miss) = working_set size

Once the working set size has been calculated, a good rule of thumb is to size L3 SSD capacity per node according to the following formula:

L2 capacity + L3 capacity >= 150% of working set size.

There are diminishing returns for L3 cache after a certain point. With too high an SSD to working set size ratio, the cache hits decrease and fail to add greater benefit. Conversely, when compared to SmartPools SSD strategies, another benefit of using SSDs for L3 cache is that performance will degrade much more gracefully if metadata does happen to exceed the SSD capacity available.

Repeated random read workloads will typically benefit most from L3 cache via latency improvements. When sizing L3 SSD capacity, the recommendation is to use a small number (ideally no more than two) of large capacity SSDs rather than multiple small SSDs to achieve the appropriate capacity of SSD(s) that will fit your working data set.

When it comes to replacing failed L3 cache SSDs, the same procedure should be employed as for replacing other storage drives. However, L3 cache SSDs do not require FlexProtect or AutoBalance to run post replacement, so it’s typically a much faster process.

For a legacy node pool using a SmartPools metadata-write strategy, the conventional wisdom is to avoid converting it to L3 cache unless:

  1. The SSDs are seriously underutilized.
  2. The overall I/O mix has changed and represents a significant drop in metadata write percentage.
  3. The SSDs in the pool are oversubscribed and spilling over to hard disk.
  4. Your primary concern is SSD longevity.

OneFS Caching – The Key to L3

Unlike L1 and L2 cache, which are always present and operational in storage nodes, L3 cache is enabled per node pool via a simple on or off configuration setting. Other than this, there are no additional visible configuration settings to change. When enabled, L3 consumes all the SSD in node pool. Also, L3 cannot coexist with other SSD strategies, with the exception of Global Namespace Acceleration. However, since they’re exclusively reserved, L3 Cache node pool SSDs cannot participate in GNA.

Note that L3 cache is typically enabled by default on any new node pool containing SSDs.

Once the SSDs have been reformatted and are under the control of L3 cache, the WebUI removes them from usable storage:

There is also a global setting which governs whether to enable L3 cache by default for new node pools.

When converting the SSDs in a particular nodepool to use L3 cache rather than SmartPools, progress can be estimated by periodically tracking SSD space (used capacity) usage over the course of the conversion process. Additionally, the Job impact policy of the Flexprotect_Plus or SmartPools job responsible for the L3 conversion can be reprioritized to run faster or slower. This has the effect of conversely increasing or decreasing the impact of the conversion process on cluster resources.

OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2 and L3 cache are displayed for both data and metadata.

# isi_cache_stats

Totals            

l1_data: a 446G 100% r 579G  85% p 134G  89%, l1_encoded: a 0.0B   0% r 0.0B   0% p 0.0B   0%, l1_meta: r  82T 100% p 219M  92%,

l2_data: r 376G  78% p 331G  81%, l2_meta: r 604G  96% p 1.7G   4%,

l3_data: r   6G  19% p 0.0B   0%, l3_meta: r  24G  99% p 0.0B   0%

For more detailed and formatted output, a verbose option of the command is available using the ‘isi_cache_stats -v’ option:

# isi_cache_stats -v

------------------------- Totals -------------------------

l1_data:

        async read (8K blocks):
                aread.start:              58665103 / 100.0%
                aread.hit:                58433375 /  99.6%
                aread.miss:                 231378 /   0.4%
                aread.wait:                    350 /   0.0%


        read (8K blocks):
                read.start:               89234355 / 100.0%
                read.hit:                 58342417 /  65.4%
                read.miss:                13082048 /  14.7%
                read.wait:                  246797 /   0.3%
                prefetch.hit:             17563093 /  19.7%


        prefetch (8K blocks):
                prefetch.start:           19836713 / 100.0%
                prefetch.hit:             17563093 /  88.5%


l1_encoded:
        async read (8K blocks):
                aread.start:                     0 /   0.0%
                aread.hit:                       0 /   0.0%
                aread.miss:                      0 /   0.0%
                aread.wait:                      0 /   0.0%


        read (8K blocks):
                read.start:                      0 /   0.0%
                read.hit:                        0 /   0.0%
                read.miss:                       0 /   0.0%
                read.wait:                       0 /   0.0%
                prefetch.hit:                    0 /   0.0%


        prefetch (8K blocks):
                prefetch.start:                  0 /   0.0%
                prefetch.hit:                    0 /   0.0%


l1_meta:
        read (8K blocks):
                read.start:              11030213475 / 100.0%
                read.hit:                11019567231 /  99.9%
                read.miss:                 8070087 /   0.1%
                read.wait:                 2548102 /   0.0%
                prefetch.hit:                28055 /   0.0%


        prefetch (8K blocks):
                prefetch.start:              30483 / 100.0%
                prefetch.hit:                28055 /  92.0%


l2_data:
        read (8K blocks):
                read.start:               63393624 / 100.0%
                read.hit:                  5916114 /   9.3%
                read.miss:                 4289278 /   6.8%
                read.wait:                 9815412 /  15.5%
                prefetch.hit:             43372820 /  68.4%


        prefetch (8K blocks):
                prefetch.start:           53327065 / 100.0%
                prefetch.hit:             43372820 /  81.3%


l2_meta:
        read (8K blocks):
                read.start:               82823463 / 100.0%
                read.hit:                 78959108 /  95.3%
                read.miss:                 3643663 /   4.4%
                read.wait:                    1758 /   0.0%
                prefetch.hit:               218934 /   0.3%


        prefetch (8K blocks):
                prefetch.start:            5517237 / 100.0%
                prefetch.hit:               218934 /   4.0%


l3_data:
        read (8K blocks):
                read.start:                4418424 / 100.0%
                read.hit:                   817632 /  18.5%
                read.miss:                 3600792 /  81.5%
                read.wait:                       0 /   0.0%
                prefetch.hit:                    0 /   0.0%


        prefetch (8K blocks):
                prefetch.start:                  0 /   0.0%
                prefetch.hit:                    0 /   0.0%


l3_meta:
        read (8K blocks):
                read.start:                3104472 / 100.0%
                read.hit:                  3087217 /  99.4%
                read.miss:                   17255 /   0.6%
                read.wait:                       0 /   0.0%
                prefetch.hit:                    0 /   0.0%


        prefetch (8K blocks):
                prefetch.start:                  0 /   0.0%
                prefetch.hit:                    0 /   0.0%


l1_all:
        prefetch.start:                   19867196 / 100.0%
        prefetch.misses:                         0 /   0.0%

l2_all:
        prefetch.start:                   58844302 / 100.0%
        prefetch.misses:                     48537 /   0.1%

It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not utilize data or metadata prefetch.

Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.

OneFS Caching – Workings and Mechanics

In this article we’ll dig into the workings and mechanics of OneFS read caching a bit deeper…

L1 cache interacts with the L2 cache on any node it requires data from, and the L2 cache interacts with both the storage subsystem and L3 cache. L3 cache can be enabled or disabled at a nodepool level. L3 cached blocks are stored on one or more SSDs within the node and each node in the same node nodepool has to have L3 cache enabled

Here are the relative latency of OneFS Cache Hits and Misses:

Cache Hit Miss
L1 10us L2
L2 100us L3, (or Hard Disk)
L3 200us Hard Disk
Hard Disk 1-10ms x

Note: These latency numbers may vary in an active cluster.

L2 is typically more beneficial than L1 because a hit avoids a higher latency operation. An L1 cache hit avoids a back-end round-trip to fetch the data, whereas an L2 cache hit avoids a SATA disk seek in the worst case. This is a dramatic difference in both relative and absolute terms. For SATA drives, an L2 miss is two orders of magnitude above a hit compared to one for L1, and a single back-end round-trip is typically a small portion of a full front-end operation.

L2 is preferable because it is accessible to all nodes. Assuming a workflow with any overlap among nodes, it is preferable to have the cluster’s DRAM holding L2 data rather than L1. In L2, a given data block is only cached once and invalidated much less frequently. This is why storage nodes are configured with a drop-behind policy on file data. Nodes without disks will not drop behind since there is no L2 data to cache.

When a read request arrives from a client, OneFS determines whether the requested data is in local cache. Any data resident in local cache is read immediately. If data requested is not in local cache, it is read from disk. For data not on the local node, a request is made from the remote nodes on which it resides. On each of the other nodes, another cache lookup is performed. Any data in the cache is returned immediately, and any data not in the cache is retrieved from disk. When the data has been retrieved from local and remote cache (and possibly disk), it is returned back to the client.

Each level of OneFS’ cache hierarchy utilizes a different strategy for cache eviction, to meet the particular needs of that cache type. For L1 cache in storage nodes, cache aging is based on a drop-behind algorithm. L2 cache utilizes a Least Recently Used algorithm, or LRU, since it is relatively simple to implement, low-overhead, and performs well in general. By contrast, the L3 cache employs a first-in, first-out eviction policy (or FIFO) since it’s writing to what is effectively a specialized linear filesystem on SSD.

For OneFS, a drawback of LRU is that it is not scan resistant. For example, a OneFS Job Engine job or backup process that scans a large amount of data can cause the L2 cache to be flushed. This is mitigated to a large degree by the L3 cache. Other eviction policies have the ability to promote frequently accessed entries such that they are not evicted by scanning entries, which are accessed only once.

OneFS uses two primary sources of information for predicting a file’s access pattern and pre-populate the cache with data and metadata blocks before they’re requested:

  1. OneFS attributes that can be set on files and directories to provide hints to the filesystem.
  2. The actual read activity occurring on the file.

This technique is known as ‘prefetching’, whereby the latency of an operation is mitigated by predictively copying data into a cache before it has been requested. Data prefetching is employed frequently and is a significant benefactor of the OneFS flexible file allocation strategy.

Flexible allocation involves determining the best layout for a file based on several factors, including cluster size (number of nodes), file size, and protection level (e.g.+2 or +3). The performance effect of flexible allocation is to place a file on the largest number of drives possible, given the above constraints.

The most straightforward application of prefetch is file data, where linear access is common for unstructured data, such as media files. Reading and writing of such files generally starts at the beginning and continues unimpeded to the end of the file. After a few requests, it becomes highly likely that a file is being streamed to the end.

OneFS data prefetch strategies can be configured either from the command line or via SmartPools. File data prefetch behavior can be controlled down to a per-file granularity using the ‘isi set/get’ command’s access pattern setting. The available selectable file access patterns include concurrency (the default), streaming, and random.

# isi get tstfile1

POLICY    LEVEL PERFORMANCE COAL  FILE

default   6+2/2 streaming on    tstfile1

# isi set -l random tstfile1

# isi get tstfile1

POLICY    LEVEL PERFORMANCE COAL  FILE

default   6+2/2 random      on    tstfile1

Metadata prefetch occurs for the same reason as file data. Metadata scanning operations, such as finds and treewalks, can benefit. However, the use of metadata prefetch is less common because most accesses are random and unpredictable.

OneFS also provides a mechanism for prefetching files based on their nomenclature. In film and TV production, “streaming” often takes a different form as opposed to streaming an audio file. Each frame in a movie will often be contained in an individual file. As such, streaming reads a set of image files and prefetching across files is important. The files are often a subset of a directory, so directory entry prefetch does not apply. Ideally, this would be controlled by a client application, however in practice this rarely occurs.

To address this, OneFS has a file name prefetch facility. While file name prefetch is disabled by default, as with file data prefetch, it can be enabled with file access settings. When enabled, file name prefetch guesses the next sequence of files to be read by matching against several generic naming patterns.

Flexible file handle affinity (FHA) is a read-side algorithm designed to better utilize the internal threads used to read files. Using system configuration options and read access profiling, the number of operations per thread can be tuned to improve the efficiency of reads. FHA maps file handles to worker threads according to a combination of system settings, locality of the read requests (in terms of how close the requested addresses are), and the latency of the thread(s) serving requests to a particular client.

Note that prefetch does not apply to the L3 cache, since L3 is populated with ‘interesting’ L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm.

Blocks evicted from L2 are candidates for inclusion in L3, and a filter is employed to reduce the quantity and increase the value of incoming blocks. Because L3 is a first in, first out (FIFO) cache, filtering is performed ahead of time. By selecting blocks that are more likely to be read again, L3 can both limit SSD churn and enhance the quality of the L3 cache contents.

The L3 filter uses several heuristics to evaluate which candidate blocks will likely be most valuable and should go to L3 cache. In general, L3 prefers metadata/inode to data blocks. And the guiding principle for data blocks is that the per-block cost of re-reading a sequential cluster of blocks from disk is much lower than performing random reads from disk. For example, if a block is a “random” read (ie. there are no neighboring blocks on this disk in L2), then it is always included in L3. Conversely, If the block is part of a sequential cluster of 16 or more blocks (128KB), it is not evicted to L3. As such, the L3 cache can be most effective, per capacity, by addressing random reads.

The most frequently accessed data and metadata on a node should just remain in L2 cache and not get evicted to L3. For the next tier of cached data that’s accessed frequently enough to live in L3, but not frequently enough to always live in RAM, there’s a mechanism in place to keep these semi-frequently accessed blocks in L3.

To maintain this L3 cache persistence, when the kernel goes to read a metadata or data block, the following steps are performed:

1) First, L1 cache is checked. Then, if no hit, L2 cache is consulted.

2) If a hit is found in memory, it’s done.

3) If not in memory, L3 is then checked.

4) If there’s an L3 hit, and that item is near the end of the L3 FIFO (last 10%), a flag is set on the block which causes it to be evicted into L3 again when it is evicted out of L2.

Additionally, any un-cached job engine metadata requests will always come from disk and bypass L3 cache, so they do not displace user-cached blocks from L3 cache. As new versions are written, the journal notifies L3, which invalidates and removes the dirty block(s) from its cache.

OneFS Caching Architecture

There have been a number of recent enquiries from the field around how caching is performed in OneFS. So it seemed like an ideal time to review this topic over the next couple of articles.

Caching occurs at multiple different levels, and for a variety of types of data. In this article we’ll concentrate on the caching of file system structures in main memory and on SSD.

OneFS’ caching infrastructure design is predicated on aggregating each individual node’s cache into one cluster wide, globally accessible pool of memory. This is achieved by using an efficient messaging system that allows all the nodes’ memory caches to be available to each and every node in the cluster.

For remote memory access, OneFS utilizes the Sockets Direct Protocol (SDP) over an Ethernet or Infiniband backend interconnect on the cluster. SDP provides an efficient, socket-like interface between nodes which, by using a switched star topology, ensures that remote memory addresses are only ever one hop away. While not as fast as local memory, remote memory access is still very fast due to the low latency of the dedicated backend interconnect.

OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or write coalescer. The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and analogous to the cache used in CPUs. A third tier of read cache, called SmartFlash, or Level 3 cache (L3), is also configurable on nodes that contain solid state drives (SSDs). L3 cache is an eviction cache that is populated by L2 cache blocks as they are aged out from memory.

The OneFS caching subsystem is coherent across the cluster. This means that if the same content exists in the private caches of multiple nodes, this cached data is consistent across all instances. For example, consider the following scenario:

  1. Node 2 and Node 4 each have a copy of data located at an address in shared cache.
  2. Node 4, in response to a write request, invalidates node 2’s copy.
  3. Node 4 then updates the value.
  4. Node 2 must re-read the data from shared cache to get the updated value.

OneFS utilizes the MESI Protocol to maintain cache coherency, implementing an “invalidate-on-write” policy to ensure that all data is consistent across the entire shared cache. The various states that in-cache data can take are:

M – Modified: The data exists only in local cache, and has been changed from the value in shared cache. Modified data is referred to as ‘dirty’.

E – Exclusive: The data exists only in local cache, but matches what is in shared cache. This data referred to as ‘clean’.

S – Shared: The data in local cache may also be in other local caches in the cluster.

I – Invalid: A lock (exclusive or shared) has been lost on the data.

L1 cache, or front-end cache, is memory that is nearest to the protocol layers (e.g. NFS, SMB, etc) used by clients, or initiators, connected to that node. The main task of L1 is to prefetch data from remote nodes. Data is pre-fetched per file, and this is optimized in order to reduce the latency associated with the nodes’ IB back-end network. Since the IB interconnect latency is relatively small, the size of L1 cache, and the typical amount of data stored per request, is less than L2 cache.

L1 is also known as remote cache because it contains data retrieved from other nodes in the cluster. It is coherent across the cluster, but is used only by the node on which it resides, and is not accessible by other nodes. Data in L1 cache on storage nodes is aggressively discarded after it is used. L1 cache uses file-based addressing, in which data is accessed via an offset into a file object. The L1 cache refers to memory on the same node as the initiator. It is only accessible to the local node, and typically the cache is not the primary copy of the data. This is analogous to the L1 cache on a CPU core, which may be invalidated as other cores write to main memory. L1 cache coherency is managed via a MESI-like protocol using distributed locks, as described above.

It’s worth noting that L1 cache is utilized differently in accelerator nodes, which don’t contain any disk drives. Instead, the entire read cache is L1 cache, since all the data is fetched from other storage nodes. Also, cache aging is based on a least recently used (LRU) eviction policy, as opposed to the drop-behind algorithm typically used in a storage node’s L1 cache. Because an accelerator’s L1 cache is large, and the data in it is much more likely to be requested again, so data blocks are not immediately removed from cache upon use. However, metadata & update heavy workloads don’t benefit as much, and an accelerator’s cache is only beneficial to clients directly connected to the node.

L2, or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 reduces the latency of a read operation by not requiring a seek directly from the disk drives. As such, the amount of data prefetched into L2 cache for use by remote nodes is much greater than that in L1 cache.

L2 is also known as local cache because it contains data retrieved from disk drives located on that node and then made available for requests from remote nodes. Data in L2 cache is evicted according to a Least Recently Used (LRU) algorithm. Data in L2 cache is addressed by the local node using an offset into a disk drive which is local to that node. Since the node knows where the data requested by the remote nodes is located on disk, this is a very fast way of retrieving data destined for remote nodes. A remote node accesses L2 cache by doing a lookup of the block address for a particular file object. As described above, there is no MESI invalidation necessary here and the cache is updated automatically during writes and kept coherent by the transaction system and NVRAM.

L3 cache is a subsystem which caches evicted L2 blocks on a node. Unlike L1 and L2, not all nodes or clusters have an L3 cache, since it requires solid state drives (SSDs) to be present and exclusively reserved and configured for caching use. L3 serves as a large, cost-effective way of extending a node’s read cache from gigabytes to terabytes. This allows clients to retain a larger working set of data in cache, before being forced to retrieve data from higher latency spinning disk. The L3 cache is populated with “interesting” L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm.

Unlike RAM based caches, since L3 is based on persistent flash storage, once the cache is populated, or warmed, it’s highly durable and persists across node reboots, etc. L3 uses a custom log-based filesystem with an index of cached blocks. The SSDs provide very good random read access characteristics, such that a hit in L3 cache is not that much slower than a hit in L2.

To utilize multiple SSDs for cache effectively and automatically, L3 uses a consistent hashing approach to associate an L2 block address with one L3 SSD. In the event of an L3 drive failure, a portion of the cache will obviously disappear, but the remaining cache entries on other drives will still be valid. Before a new L3 drive may be added to the hash, some cache entries must be invalidated.

OneFS also uses a dedicated inode cache in which recently requested inodes are kept. The inode cache frequently has a large impact on performance, because clients often cache data, and many network I/O activities are primarily requests for file attributes and metadata, which can be quickly returned from the cached inode.

OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2 and L3 cache are displayed for both data and metadata.

# isi_cache_stats
Totals l1_data: a 409G 100% r 542G 84% p 134G 89%, l1_encoded: a 0.0B 0% r 0.0B 0% p% p 331G 81%, l2_meta: r 597G 96% p 1.7G 4%, l3_data: r 6G 18% p 0.0B 0%, l3_meta: r 22G 99

For more detailed and formatted output, a verbose option of the command is available using the following syntax:

# isi_cache_stats -v

It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not utilize data or metadata prefetch.

Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.

OneFS Quota Domains

In the previous article, we looked at the use of protection domains in OneFS, focusing on SyncIQ replication, SmartLock immutable archiving, and Snapshots and SnapRevert.

Under the hood, SmartQuotas is also based on the concept of domains – the linchpins of quota accounting. Since OneFS is a single file system, it relies on accounting domains for defining the scope of a quota in place of the typical volume boundaries found in most storage systems. As such, a domain defines which files belong to a quota, accounts for each resource type in that set and defines the top-level directory configuration point.

For SmartQuotas, the three main resource types are:

Resource Type Description
Directory A specific directory and all its subdirectories
User A specific user
Group All members of a specific group

A domain defined as “name@folder” would be the set of files under “folder”, owned by “name”, which could be either a user or a group. The files accounted include all files reachable from the given path, without traversing any soft links. The owner “name” can be ALL, and “/ifs”, the OneFS root directory is also an effective ALL for “folder”.

With SmartQuotas it’s easy to create traditional domain types quickly by using “ALL”. The following are examples of domain types:

  • All files belonging to user Jane: user:Jane@/ifs
  • All files under /ifs/home, belonging to any user: ALL@/ifs/home.
  • All files under /ifs/home that belong to user Jane: user:Jane@/ifs/home

Domains cannot be created on anything but directories. More specifically, domains are associated with the actual directories themselves, not directory paths. For example, if the domain is ALL@/ifs/home/data, but /ifs/home/data gets renamed to /ifs/home/files, the domain stays with the directory.

Domains can also be nested and may overlap. For example, say a hard quota is set on /ifs/data/marketing for 5TB. 1TB soft quotas are then placed on individual users in the marketing department. This ensures that the marketing directory as a whole never exceeds 5TB, while limiting the users in the marketing department to 1TB each.

A default quota domain is one that does not account for any specific set of files but instead specifies a policy for new domains that match a specific trigger. In other words, default domains are configuration templates for actual domains. SmartQuotas use the identity notation ‘default-user’, ‘default-group’, and ‘default directory’ to describe domains with default policies. For example, the domain default-user@/ifs/home becomes specific-user@/ifs/home for each specific-user that is not otherwise defined. All enforcements on default-user are copied to specific-user when specific-user allocates within the domain and the new inherited domain quota is termed as a Linked Quota. There may be overlapping defaults (i.e. default-user@/ifs and default-user@/ifs/home may both be defined).

Default quota domains help drastically simplify quota management for large environments by providing a mechanism to define top level template configurations from which many actual quotas are cloned, or linked. When a default quota domain is configured on a directory, any subdirectories created directly underneath this will automatically inherit the quota limits specified in the parent domain. This streamlines the provisioning and management quotas for large enterprise environments. Furthermore, default directory quotas can co-exist with user and/or group quotas and legacy default quotas.

Default directory quotas have been available since OneFS 8.2, in addition to the default user and group quotas available in earlier releases. For example:

  • Create default-directory quota
# isi quota create --path=/ifs/parent-dir --type=default-directory --hard-threshold=10M
  • Modify Default directory quota
# isi quota modify --path=/ifs/parent-dir --type=default-directory --advisory-threshold=6M --soft-threshold=7M --soft-grace=1D
  • List default-directory quota
# isi quota list                 

  Type              AppliesTo  Path            Snap  Hard   Soft  Adv  Used

  --------------------------------------------------------------------------

  default-directory DEFAULT    /ifs/parent-dir No    10.00M -    6.00M 0.00

  --------------------------------------------------------------------------

  Total: 1
  • Delete Default directory quota
# isi quota delete --path=/ifs/parent-dir --type=default-directory

If the enforcements on a default domain change, SmartQuotas will automatically propagate the changes to the Linked Quota domains. If a default quota domain is deleted, SmartQuotas will delete all children marked as inherited. An administrator may also choose to delete the default without deleting the children, but this will break inheritance on all inherited children.

For example, the creation & deletion of sub-directory under default directory folder causes inherited directory quota creation and removal:

A domain may be in one of three accounting states, as follows:

Quota Accounting States Description
Ready A domain in the ready state is fully accounted. SmartQuotas displays “ready” domains in all interfaces and all enforcements apply to such domains.
Accounting A domain is placed in the Accounting state when it’s waiting on accounting updates.
Deleting After a request to delete a domain, SmartQuotas will place the domain in the deleting state until tear-down is complete. Domain removal may be a lengthy process.

SmartQuotas displays accounting domains in all interfaces including usage data but indicate they are in the process of being “Accounted”. SmartQuotas applies all enforcements to accounting domains, even when it might reject an allocation that would have proceeded if it had completed the QuotaScan.

Domains in the deleting state are hidden from all interfaces and the top-level directory of a domain may be deleted while the domain is still in the deleting state (assuming there are no domains in “Ready” or “Accounting” state defined on the directory). No enforcements are applied for domains in “Deleting” state.

A quota scan is performed when the domain is in an Accounting State. This can occur during quota creation to account the new domain if a quota has been set for the domain and quota deletion to un-account the domain. A QuotaScan is required when creating a quota on a non-empty directory. If quotas are created up-front on an empty directory, no QuotaScan is necessary.

In addition, a QuotaScan job may be started from the WebUI or command line interface using the “isi job” command. Any path specified on the command line is treated as the root of a tree that should be processed. This is provided primarily as a means to re-scan a directory or maintenance reasons.

There are main three processes or daemons associated with SmartQuotas:

  • isi_quota_notify_d
  • isi_quota_sweeper_d
  • isi_quota_report_d

The job of the notification daemon, isi_quota_notify_d, is to listen for ‘limit exceeded’ and ‘link denied’ events and generate notifications for each. It also responds to configuration change events and instructs the QDB to generate ‘expired’ and ‘violated’ over-threshold notifications.

A quota sweeper daemon, isi_quota_sweeper_d, is responsible for a number of quota housekeeping tasks such as propagating default changes, domain and notification rule garbage collection and kicking off QuotaScan jobs when necessary.

Finally, the reporting daemon, isi_quota_report_d, is responsible for generating quota reports. Since the QDB only produces real-time resource usage, reports are necessary for providing point-in-time vies of a quota domain’s usage. These historical reports are useful for trend analysis of quota resource usage.

OneFS 8.2 and subsequent releases use the rpc.quotad service to facilitate client-side quota reporting on UNIX and Linux clients via native ‘quota’ tools. The service which runs on tcp/udp port 762 is enabled by default, and control is under NFS global settings.

Additionally, in OneFS 8.2 and later, users can now see their available user capacity set by soft and/or hard user and group quotas rather than the entire cluster capacity or parent directory-quotas. This avoids the ‘illusion’ of seeing available space that may not be associated with their quotas.