OneFS Customizable CELOG Alerts

Another feature enhancement that is introduced in the new OneFS 9.1 release is customizable CELOG event thresholds. This new functionality allows cluster administrators to customize the alerting thresholds for several filesystem capacity-based events. These new configurable events and their default threshold values include:

These event thresholds can be easily set from the OneFS WebUI, CLI, or platform API. For configuration via the WebUI, browse to Cluster Management > Events and Alerts > Thresholds, as follows:

The desired event can be configured from the OneFS WebUI by clicking on the associated ‘Edit Thresholds’ button. For example, to lower the thresholds for the FILESYS_FDUSAGE event critical threshold from 95 to 92%:

Note that none of an event’s thresholds can have an equal value to each other. Plus an informational must be lower than warning and critical must be higher than warning. For example:

Alternatively, event threshold configuration can also be performed via the OneFS CLI ‘isi event thresholds’ command set . For example:

The list of configurable CELOG events can be displayed with the following CLI command:

# isi event threshold list
ID ID Name
-------------------------------
100010001 SYS_DISK_VARFULL
100010002 SYS_DISK_VARCRASHFULL
100010003 SYS_DISK_ROOTFULL
100010015 SYS_DISK_POOLFULL
100010018 SYS_DISK_SSDFULL
600010005 SNAP_RESERVE_FULL
800010006 FILESYS_FDUSAGE
-------------------------------

Full details, including the thresholds, are shown with the addition of the ‘-v’ verbose flag:

# isi event threshold list -v
ID: 100010001
ID Name: SYS_DISK_VARFULL
Description: Percentage at which /var partition is near capacity
Defaults: info (75%), warn (85%), crit (90%)
Thresholds: info (75%), warn (85%), crit (90%)
--------------------------------------------------------------------------------
ID: 100010002
ID Name: SYS_DISK_VARCRASHFULL
Description: Percentage at which /var/crash partition is near capacity
Defaults: warn (90%)
Thresholds: warn (90%)
--------------------------------------------------------------------------------
ID: 100010003
ID Name: SYS_DISK_ROOTFULL
Description: Percentage at which /(root) partition is near capacity
Defaults: warn (90%), crit (95%)
Thresholds: warn (90%), crit (95%)
--------------------------------------------------------------------------------
ID: 100010015
ID Name: SYS_DISK_POOLFULL
Description: Percentage at which a nodepool is near capacity
Defaults: info (70%), warn (80%), crit (90%), emerg (97%)
Thresholds: info (70%), warn (80%), crit (90%), emerg (97%)
--------------------------------------------------------------------------------
ID: 100010018
ID Name: SYS_DISK_SSDFULL
Description: Percentage at which an SSD drive is near capacity
Defaults: info (75%), warn (85%), crit (90%)
Thresholds: info (75%), warn (85%), crit (90%)
--------------------------------------------------------------------------------
ID: 600010005
ID Name: SNAP_RESERVE_FULL
Description: Percentage at which snapshot reserve space is near capacity
Defaults: warn (90%), crit (99%)
Thresholds: warn (90%), crit (99%)
--------------------------------------------------------------------------------
ID: 800010006
ID Name: FILESYS_FDUSAGE
Description: Percentage at which the system is near capacity for open file descriptors
Defaults: info (85%), warn (90%), crit (95%)
Thresholds: info (85%), warn (90%), crit (95%)

Similarly, the following CLI syntax can be used to display the existing thresholds for a particular event – in this case the SYS_DISK_VARFULL /var partition full alert:

# isi event thresholds view 100010001

         ID: 100010001

    ID Name: SYS_DISK_VARFULL

Description: Percentage at which /var partition is near capacity

   Defaults: info (75%), warn (85%), crit (90%)

 Thresholds: info (75%), warn (85%), crit (90%)

The following command will reconfigure the threshold from the defaults of 75%|85%|95% to 70%|75%|85%:

# isi event thresholds modify 100010001 --info 70 --warn 75 --crit 85

# isi event thresholds view 100010001

         ID: 100010001

    ID Name: SYS_DISK_VARFULL

Description: Percentage at which /var partition is near capacity

   Defaults: info (75%), warn (85%), crit (90%)

 Thresholds: info (70%), warn (75%), crit (85%)

And finally, to reset the thresholds back to their default values:

#  isi event thresholds reset 100010001

Are you sure you want to reset info, warn, crit from event 100010001?? (yes/[no]): yes

# isi event thresholds view 100010001

         ID: 100010001

    ID Name: SYS_DISK_VARFULL

Description: Percentage at which /var partition is near capacity

   Defaults: info (75%), warn (85%), crit (90%)

 Thresholds: info (75%), warn (85%), crit (90%)

Configuring OneFS SyncIQ Encryption

Unlike previous OneFS versions, SyncIQ is disabled by default in OneFS 9.1 and later. Once SyncIQ has been enabled by the cluster admin, a global encryption flag is automatically set, requiring all SyncIQ policies to be encrypted. Similarly, when upgrading a PowerScale cluster to OneFS 9.1, the global encryption flag is also set. However, be aware that the global encryption flag is not enabled on clusters configured with any existing SyncIQ policies upon upgrade to OneFS 9.1 or later releases.

The following procedure can be used to configure SyncIQ encryption from the OneFS CLI:

  1. Ensure both source and target clusters are running OneFS 8.2 or later.
  2. Next, create an X.509 certificates, one for each of the source and target clusters, and signed by a certificate authority.
Certificate Type Abbreviation
Certificate Authority <ca_cert_id>
Source Cluster Certificate <src_cert_id>
Target Cluster Certificate <tgt_cert_id>

These can be generated using publicly available tools, such as OpenSSL: http://slproweb.com/products/Win32OpenSSL.html.

  1. Add the newly created certificates to the appropriate source cluster stores. Each cluster gets certificate authority, its own certificate, and its peer’s certificate:
# isi sync certificates server import <src_cert_id> <src_key>

# isi sync certificates peer import <tgt_cert_id>

# isi cert authority import <ca_cert_id>
  1. On the source cluster, set the SyncIQ cluster certificate:
# isi sync settings modify --cluster-certificate-id=<src_cert_id>
  1. Add the certificates to the appropriate target cluster stores:
# isi sync certificates server import <tgt_cert_id> <tgt_key>

# isi sync certificates peer import <src_cert_id>

# isi cert authority import <ca_cert_id>
  1. On the target cluster, set the SyncIQ cluster certificate:
# isi sync settings modify --cluster-certificate-id=<tgt_cert_id>
  1. A global option is available in OneFS 9.1, requiring that all incoming and outgoing SyncIQ policies are encrypted. Be aware that executing this command impacts any existing SyncIQ policies that may not have encryption enabled. Only execute this command once all existing policies have encryption enabled. Otherwise, existing policies that do not have encryption enabled will fail. To enable this, execute the following command:
# isi sync settings modify --encryption-required=True
  1. On the source cluster, create an encrypted SyncIQ policy:
# isi sync policies create <pol_name> sync <src_dir> <target_ip> <tgt_dir> --target-certificate-id=<tgt_cert_id>

Or modify an existing policy on the source cluster:

# isi sync policies modify <pol_name> --target-certificate-id=<tgt_cert_id>

OneFS 9.1 also facilitates SyncIQ encryption configuration via the OneFS WebUI, in addition to CLI. For the source, server certificates can be added and managed by navigating to Data Protection > SyncIQ > Settlings and clicking on the ‘add certificate’ button:

And certificates can be imported onto the target cluster by browsing to Data Protection > SyncIQ > Certificates and clicking on the ‘add certificate’ button. For example:

So that’s what’s required to get encryption configured across a pair of clusters. There are several addition optional encryption configuration parameters available. These include:

  • Updating the policy to use a specified SSL cipher suite:
# isi sync policies modify <pol_name> --encryption-cipher-list=<suite>
  • Configuring the target cluster to check the revocation status of incoming certificates:
# isi sync settings modify --ocsp-address=<address> --ocsp-issuer-certificate-id=<ca_cert_id>
  • Modifying how frequently encrypted connections are renegotiated on a cluster:
# isi sync settings modify --renegotiation-period=24H
  • Requiring that all incoming and outgoing SyncIQ policies are encrypted:
# isi sync settings modify --encryption-required=True

To troubleshoot SyncIQ encryption, first check the reports for the SyncIQ policy in question. The reason for the failure should be indicated in the report. If the issue was due to a TLS authentication failure, then the error message from the TLS library will also be provided in the report. Also, more detailed information can often be found in /var/log/messages on the source and target clusters, including:

  • ID of the certificate that caused the failure.
  • Subject name of the certificate that caused the failure.
  • Depth at which the failure occurred in the certificate chain.
  • Error code and reason for the failure.

Before enabling SyncIQ encryption, be aware of the potential performance implications. While encryption only adds minimal overhead to the transmission, it may still negatively impact a production workflow. Be sure to test encrypted replication in a lab environment that emulates the environment before deploying in production.

Note that both the source and target cluster must be upgraded and committed to OneFS 8.2 or later, prior to configuring SyncIQ encryption.

In the event that SyncIQ encryption needs to be disabled, be aware that this can only be performed via the CLI and not the WebUI:

# isi sync settings modify --encryption-required=false

If encryption is disabled under OneFS 9.1, the following warnings will be displayed on creating a SyncIQ policy.

From the WebUI:

And via the CLI:

# isi sync policies create pol2 sync /ifs/data 192.168.1.2 /ifs/data/pol1

********************************************

WARNING: Creating a policy without encryption is dangerous.

Are you sure you want to create a SyncIQ policy without setting encryption?

Your data could be vulnerable without encrypted protection.

Type ‘confirm create policy’ to proceed.  Press enter to cancel:

OneFS SyncIQ and Encrypted Replication

Introduced in OneFS 9.1, SyncIQ encryption is integral in protecting data in-flight during inter-cluster replication over the WAN. This helps prevent man-in-the-middle attacks,  mitigating remote replication security concerns and risks.

SyncIQ encryption helps to secure data transfer between OneFS clusters, benefiting customers who undergo regular security audits and/or government regulations.

  • SyncIQ policies support end-to-end encryption for cross-cluster communications.
  • Certificates are easy to manage with the SyncIQ certificate store.
  • Certificate revocation is supported through the use of an external OCSP responder.
  • Clusters now require that all incoming and outgoing SyncIQ policies be encrypted through a simple configuration change in the SyncIQ global settings.

SyncIQ encryption relies on cryptography, using a public and private key pair to encrypt and decrypt replication sessions. These keys are mathematically related: Data encrypted with one key is decrypted with other key, confirming the identity of each cluster. SyncIQ uses the common X.509 Public Key Infrastructure (PKI) standard which defines certificate requirements.

A Certificate Authority (CA) serves as a trusted 3rd party, which issues and revokes certificates. Each cluster’s certificate store has the CA, it’s certificate, and the peer’s certificate, establishing a trusted ‘passport’ mechanism.

A SyncIQ job can attempt either an encrypted or unencrypted handshake:

Under the hood, SyncIQ utilizes TLS protocol version 1.2 and OpenSSL version: 1.0.2o. Customers are responsible for creating their own X.509 certificates, and SyncIQ peers must store each other’s end entity certificates. A TLS authentication failure will cause the corresponding SyncIQ job to immediately fail, and a CELOG event notifies the user of a SyncIQ encryption failure.

One the source cluster, the SyncIQ job’s coordinator process passes the target cluster’s public cert to its primary worker (pworker) process. The target monitor and sworker threads receive a list of approved source cluster certs. The pworkers can then establish secure connections with their corresponding sworkers (secondary workers).

SyncIQ traffic encryption is enabled on a per-policy basis. The CLI includes the ‘isi certificates’ and ‘isi sync certificates’ commands for the configuration of TLS certificates:

# isi cert -h

Description:

    Configure cluster TLS certificates.

Required Privileges:

    ISI_PRIV_CERTIFICATE

Usage:

    isi certificate <subcommand>

        [--timeout <integer>]

        [{--help | -h}]

Subcommands:

  Certificate Management:

    authority    Configure cluster TLS certificate authorities.

    server       Configure cluster TLS server certificates.

    settings     Configure cluster TLS certificate settings.

The following policy configuration fields are included:

Config Field Detail
–target-certificate-id <string> The ID of the target cluster certificate being used for encryption.
–ocsp_issuer_certificate-id <string> The ID of the certificate authority that issued the certificate whose revocation status is being checked.
–ocsp-address <string> The address of the OCSP responder to which to connect.
–encryption-cipher-list <string> The cipher list being used with encryption. For SyncIQ targets, this list serves as a list of supported ciphers. For SyncIQ sources, the list of ciphers will be attempted to be used in order.

In order to configure a policy for encryption the ‘–target-certificate-id’ must be specified. The users will input the ID of the desired certificate as is defined in the certificate manager. If self-signed certificates are being utilized, then they will have been manually copied to their peer cluster’s certificate store.

For authentication, there is a strict comparison of the public certs to the expected values. If a cert chain (that has been signed by the CA) is selected to authenticate the connection, the chain of certificates will need to be added to the cluster’s certificate authority store. Both methods use the ‘SSL VERIFY FAIL IF NO PEER CERT’ option when establishing the SSL context. Note that once encryption is enabled (by setting the appropriate policy fields), modification of the certificate IDs is allowed. However, removal and reverting to unencrypted syncs will prompt for confirmation before proceeding.

We’ll take a look at the SyncIQ encryption configuration procedures and options in the second article of this series.

OneFS Fast Reboots

As part of engineering’s on-going PowerScale ‘always-on’ initiative, OneFS offers a fast reboot service, that focuses on decreasing the duration, and lessening the impact, of planned node reboots on clients. It does this by automatically reducing the size of the lock cache on all nodes before a group change event.

By shortening group change window times, this new faster reboot service will be extremely advantageous to cluster upgrades and planned shutdowns, by helping to alleviate the window of unavailability for clients connected to a rebooting node.

The fast reboot service is automatically enabled on installation or upgrade to OneFS 9.1, and it requires no further configuration. However, be aware that it will only begin to apply for upgrades, when moving from 9.1 to a future release.

Under the hood, this feature works by proactively de-staging all the lock management work, and removing it from the client latency path. This means that the time taken during group change activity – handling the locks, negotiating which coordinator has which lock, etc – is moved to an earlier window of time in the process. So, for example, for a planned cluster reboot or shutdown, instead of doing a lock dance during the group change window, the lazy lock queue is proactively drained for a period of up to 5 minutes, in order to move that activity to earlier in the process. This directly benefits OneFS upgrades, by shrinking the time for the actual group change. For a typical size cluster, this is reduced to approximately 1 second – down from around 17 seconds in prior releases. And engineering have been testing this feature with up to 5 million locks per domain.

There are several useful new and updated sysctls that indicate the status of the reboot service.

Firstly, efs.gmp.group has been enhanced to include both reboot and draining fields, that confirm which node(s) the reboot service is active on, and whether locks are being drained:

# sysctl efs.gmp.group efs.gmp.group: <35baa7> (3) :{ 1-3:0-5, nfs: 3, isi_cbind_d: 1-3, lsass: 1-3, drain: 1, reboot: 1 }

To complement this, the lki_draining sysctl confirms whether draining is still occurring:

# sysctl efs.lk.lki_draining

efs.lk.lki_draining: 1

OneFS has around 20 different lock domains, each with its own queue. These queues each contain lazy locks, which are locks that are not currently in use, but are just being held by the node in case it needs to use them again.

The stats from the various lock domain queues are aggregated, and displayed as a current total by the lazy_queue_size  sysctl:

# sysctl efs.lk.lazy_queue_size

efs.lk.lazy_queue_size: 460658

And finally, to indicates whether any of the lazy queues are above their reboot threshold:

# sysctl efs.lk.lazy_queue_above_reboot

efs.lk.lazy_queue_above_reboot: 0

In addition to the sysctls, and to aid with troubleshooting and debugging, the reboot service writes its status information about the locks being drained, etc, to /var/log/isi_shutdown.log.

As we can see in the first example, the node has activated the reboot service and is waiting for the lazy queues to be drained. And these messages are printed every 60 seconds until complete.

Once done, a log message is then written confirming that the lazy queues have been drained, and that the node is about to reboot or shutdown.

So there you have it – the new faster reboot service and low-impact group changes, completing the next milestone in the OneFS ‘always on’ journey.

Introducing OneFS 9.1

Dell PowerScale OneFS version 9.1 has been released and is now generally available for download and cluster installation and upgrade.

This new OneFS 9.1 release embraces the PowerScale tenants of simplified management, increased performance, and extended flexibility, and introduces the following new features:

  • CAVA-based anti-virus support
  • Granular configuration of node and cluster-level Event and alerting
  • Improved restart of backups for better RTO and RPO
  • Faster performance for access to CloudPools tiered files
  • Faster detection and resolution of node or resource unavailability
  • Flexible audit configuration for compliance and business needs
  • Encryption of replication traffic for increased security
  • Simplified in-product license activation for clusters connected via SRS

We’ll be looking more closely at this new OneFS 9.1 functionality in forthcoming blog articles.

OneFS SmartDedupe – Assessment & Estimation

To complement the actual SmartDedupe job, a dry-run Dedupe Assessment job is also provided to help estimate the amount of space savings that will be seen by running deduplication on a particular directory or set of directories. The dedupe assessment job reports a total potential space savings. The dedupe assessment does not differentiate the case of a fresh run from the case where a previous dedupe job has already done some sharing on the files in that directory. The assessment job does not provide the incremental differences between instances of this job. Isilon recommends that the user should run the assessment job once on a specific directory prior to starting an actual dedupe job on that directory.

The assessment job runs similarly to the actual dedupe job, but uses a separate configuration. It also does not require a product license and can be run prior to purchasing SmartDedupe in order to determine whether deduplication is appropriate for a particular data set or environment. This can be configured from the WebUI by browsing to File System > Deduplication > Settings and adding the desired directories(s) in the ‘Assess Deduplication’ section.


Alternatively, the following CLI syntax will achieve the same result:

# isi dedupe settings modify –add-assess-paths /ifs/data

Once the assessment paths are configured, the job can be run from either the CLI or WebUI. For example:

Or, from the CLI:

# isi job types list | grep –I assess

DedupeAssessment   Yes      LOW  

# isi job jobs start DedupeAssessment

Once the job is running, it’s progress and be viewed by first listing the job to determine it’s job ID.

# isi job jobs list

ID   Type             State   Impact  Pri  Phase  Running Time

---------------------------------------------------------------

919  DedupeAssessment Running Low     6    1/1    -

---------------------------------------------------------------

Total: 1

And then viewing the job ID as follows:

# isi job jobs view 919

               ID: 919

             Type: DedupeAssessment

            State: Running

           Impact: Low

           Policy: LOW

              Pri: 6

            Phase: 1/1

       Start Time: 2019-01-21T21:59:26

     Running Time: 35s

     Participants: 1, 2, 3

         Progress: Iteration 1, scanning files, scanned 61 files, 9 directories, 4343277 blocks, skipped 304 files, sampled 271976 blocks, deduped 0 blocks, with 0 errors and 0 unsuccessful dedupe attempts

Waiting on job ID: -

      Description: /ifs/data

The running job can also be controlled and monitored from the WebUI:

Under the hood, the dedupe assessment job uses a separate index table from the actual dedupe process. Plus, for the sake of efficiency, the assessment job also samples fewer candidate blocks than the main dedupe job, and obviously does not actually perform deduplication. This means that, often, the assessment will provide a slightly conservative estimate of the actually deduplication efficiency that’s likely to be achieved.

Using the sampling and consolidation statistics, the assessment job provides a report which estimates the total dedupe space savings in bytes. This can be viewed for the CLI using the following syntax:

# isi dedupe reports view 919

    Time: 2020-09-21T22:02:18

  Job ID: 919

Job Type: DedupeAssessment

 Reports

        Time: 2020-09-21T22:02:18

     Results:

Dedupe job report:{

    Start time = 2020-Sep-21:21:59:26

    End time = 2020-Sep-21:22:02:15

    Iteration count = 2

    Scanned blocks = 9567123

    Sampled blocks = 383998

    Deduped blocks = 2662717

    Dedupe percent = 27.832

    Created dedupe requests = 134004

    Successful dedupe requests = 134004

    Unsuccessful dedupe requests = 0

    Skipped files = 328

    Index entries = 249992

    Index lookup attempts = 249993

    Index lookup hits = 1

}

Elapsed time:                      169 seconds

Aborts:                              0

Errors:                              0

Scanned files:                      69

Directories:                        12

1 path:

/ifs/data

CPU usage:                         max 81% (dev 1), min 0% (dev 2), avg 17%

Virtual memory size:               max 341652K (dev 1), min 297968K (dev 2), avg 312344K

Resident memory size:              max 45552K (dev 1), min 21932K (dev 3), avg 27519K

Read:                              0 ops, 0 bytes (0.0M)

Write:                             4006510 ops, 32752225280 bytes (31235.0M)

Other jobs read:                   0 ops, 0 bytes (0.0M)

Other jobs write:                  41325 ops, 199626240 bytes (190.4M)

Non-JE read:                       1 ops, 8192 bytes (0.0M)

Non-JE write:                      22175 ops, 174069760 bytes (166.0M)

Or from the WebUI, by browsing to Cluster Management > Job Operations > Job Types:

As indicated, the assessment report for job # 919 in this case discovered the potential of 27.8% in data savings from deduplication.

Note that the SmartDedupe dry-run estimation job can be run without any licensing requirements, allowing an assessment of the potential space savings that a dataset might yield before making the decision to purchase the full product.

OneFS SmartDedupe – Performance Considerations

As with many things in life, deduplication is a compromise. In order to gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to find and execute the sharing of common data blocks.

Another important performance impact consideration with dedupe is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache.

To help reduce this risk, SmartDedupe will not share blocks across node pools or data tiers, and will not attempt to deduplicate files smaller than 32KB in size. On the other end of the spectrum, the largest contiguous region that will be matched is 4MB.

Because deduplication is a data efficiency product rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front, since, by design, multiple files will be sharing common data blocks, and also from the dedupe job execution perspective, as additional cluster resources are consumed to detect and share commonality.

The first deduplication job run will often take a substantial amount of time to run, since it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance will typically improve significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.

If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a data set with a high rate of change. If a deduplication job is paused or interrupted, it will automatically resume the scanning process from where it left off.

As mentioned previously, deduplication is a long running process that involves multiple job phases that are run iteratively. SmartDedupe typically processes around 1TB of data per day, per node.

Deduplication can significantly increase the storage efficiency of data. However, the actual space savings will vary depending on the specific attributes of the data itself. As mentioned above, the deduplication assessment job can be run to help predict the likely space savings that deduplication would provide on a given data set.

For example, virtual machines files often contain duplicate data, much of which is rarely modified. Deduplicating similar OS type virtual machine images (VMware VMDK files, etc, that have been block-aligned) can significantly decrease the amount of storage space consumed. However, the potential for performance degradation as a result of block sharing and fragmentation should be carefully considered first.

OneFS SmartDedupe does not deduplicate across files that have different protection settings. For example, if two files share blocks, but file1 is parity protected at +2:1, and file2 has its protection set at +3, SmartDedupe will not attempt to deduplicate them. This ensures that all files and their constituent blocks are protected as configured.  Additionally, SmartDedupe won’t deduplicate files that are stored on different node pools. For example, if file1 and file2 are stored on tier 1 and tier 2 respectively, and tier1 and tier2 are both protected at 2:1, OneFS won’t deduplicate them. This helps guard against performance asynchronicity, where some of a file’s blocks could live on a different tier, or class of storage, than others.

OneFS performance resource management provides statistics for the resources used by jobs – both cluster-wide and per-node. This information is provided via the ‘isi statistics workload’ CLI command. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.

For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:

# isi statistics workload --limit 5 –format=top

last update:  2020-09-23T16:45:25 (s)ort: default

CPU  Reads Writes    L2   L3   Node SystemName      JobType

1.4s 9.1k 0.0        3.5k 497.0 2    Job:  237       IntegrityScan[0]

1.2s 85.7 714.7      4.9k 0.0  1    Job:  238       Dedupe[0]

1.2s 9.5k 0.0        3.5k 48.5 1    Job:  237       IntegrityScan[0]

1.2s 7.4k 541.3      4.9k 0.0  3    Job: 238        Dedupe[0]

1.1s 7.9k 0.0        3.5k 41.6 2    Job:  237       IntegrityScan[0]

From the output, we can see that two job engine jobs are in progress: Dedupe (job ID 238), which runs at low impact and priority level 4 is contending with IntegrityScan (job ID 237), which runs by default at medium impact and priority level 1.

The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues, etc.

Below are some examples of typical space reclamation levels that have been achieved run SmartDedupe on various data types. Be aware though that these space savings values are provided solely as rough guidance. Since no two data sets are alike (unless they’re replicated), actual results can and will vary considerably from these examples.

Workflow / Data Type Typical Space Savings
Virtual Machine Data 35%
Home Directories / File Shares 25%
Email Archive 20%
Engineering Source Code 15%
Media Files 10%

SmartDedupe is included as a core component of OneFS but requires a valid product license key in order to activate. An unlicensed cluster will show a SmartDedupe warning until a valid product license has been applied to the cluster.

For optimal cluster performance, observing the following SmartDedupe best practices is recommended.

  • Deduplication is most effective when applied to data sets with a low rate of change – for example, archived data.
  • Enable SmartDedupe to run at subdirectory level(s) below /ifs.
  • Avoid adding more than ten subdirectory paths to the SmartDedupe configuration policy,
  • SmartDedupe is ideal for home directories, departmental file shares and warm and cold archive data sets.
  • Run SmartDedupe against a smaller sample data set first to evaluate performance impact versus space efficiency.
  • Schedule deduplication to run during the cluster’s low usage hours – i.e. overnight, weekends, etc.
  • After the initial dedupe job has completed, schedule incremental dedupe jobs to run every two weeks or so, depending on the size and rate of change of the dataset.
  • Always run SmartDedupe with the default ‘low’ impact Job Engine policy.
  • Run the dedupe assessment job on a single root directory at a time. If multiple directory paths are assessed in the same job, you will not be able to determine which directory should be deduplicated.
  • When replicating deduplicated data, to avoid running out of space on target, it is important to verify that the logical data size (i.e. the amount of storage space saved plus the actual storage space consumed) does not exceed the total available space on the target cluster.
  • Run a deduplication job on an appropriate data set prior to enabling a snapshots schedule.
  • Where possible, perform any snapshot restores (reverts) before running a deduplication job. And run a dedupe job directly after restoring a prior snapshot version.

With dedupe, there’s always trade-off between cluster resource consumption (CPU, memory, disk), the potential for data fragmentation and the benefit of increased space efficiency. Therefore, SmartDedupe is not ideally suited for high performance workloads.

  • Depending on an application’s I/O profile and the effect of deduplication on the data layout, read and write performance and overall space savings can vary considerably.
  • SmartDedupe will not permit block sharing across different hardware types or node pools to reduce the risk of performance asymmetry.
  • SmartDedupe will not share blocks across files with different protection policies applied.
  • OneFS metadata, including the deduplication index, is not deduplicated.
  • Deduplication is a long running process that involves multiple job phases that are run iteratively.
  • SmartDedupe will not attempt to deduplicate files smaller than 32KB in size.
  • Dedupe job performance will typically improve significantly on the second and subsequent job runs, once the initial index and the bulk of the shadow stores have already been created.
  • SmartDedupe will not deduplicate the data stored in a snapshot. However, snapshots can certainly be created of deduplicated data.
  • If deduplication is enabled on a cluster that already has a significant amount of data stored in snapshots, it will take time before the snapshot data is affected by deduplication. Newly created snapshots will contain deduplicated data, but older snapshots will not.
  • Any file on a cluster that is ‘un-deduped’ is automatically marked to ‘not re-dupe’. In order to reapply deduplicate to an un-deduped file, specific flags on the shadow store need to be cleared. For example:How to check the setting

    # isi get -D /ifs/data/test | grep -i dedupe

    *  Do not dedupe:      0

    Undedupe the file via isi_sstore :

    # isi_sstore undedupe /ifs/data/test

    Verify the setting:

    # isi get -D /ifs/data/test | grep -i dedupe

    *  Do not dedupe:      1

    ​​​​​​​If you want that file to participate in dedupe again then you need reset the “Do not dedupe” flag.

    How to reset the path.

    isi_sstore attr –no_dedupe=false <path>

SmartDedupe is one of several components of OneFS that enable OneFS to deliver a very high level of raw disk utilization. Another major storage efficiency attribute is the way that OneFS natively manages data protection in the file system. Unlike most file systems that rely on hardware RAID, OneFS protects data at the file level and, using software-based erasure coding, allows most customers to enjoy raw disk space utilization levels in the 80% range or higher. This is in contrast to the industry mean of around 50-60% raw disk capacity utilization. SmartDedupe serves to further extend this storage efficiency headroom, bringing an even more compelling and demonstrable TCO advantage to primary file based storage.

SmartDedupe post process dedupe is compatible with OneFS in-line data reduction (which we’ll cover in another blog post series) and vice versa. In-line compression is able to compress OneFS shadow stores. However, for SmartDedupe to process compressed data, the SmartDedupe job will have to decompress it first in order to perform deduplication, which is an addition resource overhead.

OneFS SmartDedupe – Monitoring & Management

As we saw in the previous article in this series, SmartDedupe operates at the directory level, targeting all files and directories underneath one or more root directories.

SmartDedupe not only deduplicates identical blocks in different files, it also matches and shares identical blocks within a single file. For two or more files to be deduplicated, the two following attributes must be the same:

  • Disk pool policy ID
  • Protection policy

If either of these attributes differs between two or more matching files, their common blocks will not be shared. SmartDedupe also does not deduplicate files that are less than 32 KB or smaller, because the resource consumption overhead outweighs the small storage efficiency benefit.

There are two principal elements to managing deduplication in OneFS. The first is the configuration of the SmartDedupe process itself. The second involves the scheduling and execution of the Dedupe job. These are both described below.

SmartDedupe works on data sets which are configured at the directory level, targeting all files and directories under each specified root directory. Multiple directory paths can be specified as part of the overall deduplication job configuration and scheduling.

Similarly, the dedupe directory paths can also be configured from the CLI via the isi dedupe settings modify command. For example, the following command targets /ifs/data and /ifs/home for deduplication:

# isi dedupe settings modify --paths /ifs/data, /ifs/home

Bear in mind that the permissions required to configure and modify deduplication settings are separate from those needed to run a deduplication job. For example, a user’s role must have job engine privileges to run a deduplication job. However, in order to configure and modify dedupe configuration settings, they must have the deduplication role privileges.

SmartDedupe can be run either on-demand (started manually) or via a predefined schedule. This is configured via the cluster management ‘Job Operations’ section of the WebUI.

The recommendation is to schedule and run deduplication during off-hours, when the rate of data change on the cluster is low. If clients are continually writing to files, the amount of space saved by deduplication will be minimal because the deduplicated blocks are constantly being removed from the shadow store.

To modify the parameters of the dedupe job itself, run the isi job types modify command. For example, the following command configures the deduplication job to be run every Saturday at 12:00 AM:

# isi job types modify Dedupe --schedule "Every Saturday at 12:00 AM"

For most clusters, after the initial deduplication job has completed, the recommendation is to run an incremental deduplication job once every two weeks.

The amount of disk space currently saved by SmartDedupe can be determined by viewing the cluster capacity usage chart and deduplication reports summary table in the WebUI. The cluster capacity chart and deduplication reports can be found by navigating to File System Management > Deduplication > Summary.

In addition to the bar chart and accompanying statistics (above), which graphically represents the data set and space efficiency in actual capacity terms, the dedupe job report overview field also displays the SmartDedupe savings as a percentage.

SmartDedupe space efficiency metrics are also provided via the ‘isi dedupe stats’ CLI command:

# isi dedupe stats

      Cluster Physical Size: 676.8841T

          Cluster Used Size: 236.3181T

  Logical Size Deduplicated: 29.2562T

             Logical Saving: 25.5125T

Estimated Size Deduplicated: 42.5774T

  Estimated Physical Saving: 37.1290T

In OneFS 8.2.1 and later, SmartQuotas has been enhanced to report the capacity saving from deduplication, and data reduction in general, as a storage efficiency ratio. SmartQuotas reports efficiency as a ratio across the desired data set as specified in the quota path field. The efficiency ratio is for the full quota directory and its contents, including any overhead, and reflects the net efficiency of compression and deduplication. On a cluster with licensed and configured SmartQuotas, this efficiency ratio can be easily viewed from the WebUI by navigating to ‘File System > SmartQuotas > Quotas and Usage’.

Similarly, the same data can be accessed from the OneFS command line via is ‘isi quota quotas list’ CLI command. For example:

# isi quota quotas list

Type      AppliesTo  Path           Snap  Hard  Soft  Adv  Used    Efficiency

-----------------------------------------------------------------------------

directory DEFAULT    /ifs           No    -     -     -    2.3247T 1.29 : 1

-----------------------------------------------------------------------------

Total: 1

More detail, including both the physical (raw) and logical (effective) data capacities, is also available via the ‘isi quota quotas view <path> <type>’ CLI command. For example:

# isi quota quotas view /ifs directory

                        Path: /ifs

                        Type: directory

                   Snapshots: No

 Thresholds Include Overhead: No

                       Usage

                           Files: 4245818

         Physical(With Overhead): 1.80T

           Logical(W/O Overhead): 2.33T

Efficiency(Logical/Physical): 1.29 : 1

…

To configure SmartQuotas for data efficiency reporting, create a directory quota at the top-level file system directory of interest, for example /ifs. Creating and configuring a directory quota is a simple procedure and can be performed from the WebUI, as follows:

Navigate to ‘File System > SmartQuotas > Quotas and Usage’ and select ‘Create a Quota’. In the create pane, field, set the Quota type to ‘Directory quota’, add the preferred top-level path to report on, select ‘File system logical size’ for Quota Accounting, and set the Quota Limits to ‘Track storage without specifying a storage limit’. Finally, select the ‘Create Quota’ button to confirm the configuration and activate the new directory quota.

The efficiency ratio is a single, current-in time efficiency metric that is calculated per quota directory and includes the sum of SmartDedupe plus in-line data reduction. This is in contrast to a history of stats over time, as reported in the ‘isi statistics data-reduction’ CLI command output, described above. As such, the efficiency ratio for the entire quota directory will reflect what is actually there. via the platform API as of OneFS 8.2.2.

The OneFS WebUI cluster dashboard also now displays a storage efficiency tile, which shows physical and logical space utilization histograms and reports the capacity saving from in-line data reduction as a storage efficiency ratio. This dashboard view is displayed by default when opening the OneFS WebUI in a browser and can be easily accessed by navigating to ‘File System > Dashboard > Cluster Overview’.

The Job Engine parallel execution framework provides comprehensive run time and completion reporting for the deduplication job.

Once the dedupe job has started working on a directory tree, the resulting space savings it achieves can be monitored in real time. While SmartDedupe is underway, job status is available at a glance via the progress column in the active jobs table. This information includes the number of files, directories and blocks that have been scanned, skipped and sampled, and any errors that may have been encountered.

Additional progress information is provided in an Active Job Details status update, which includes an estimated completion percentage based on the number of logical inodes (LINs) that have been counted and processed.

Once the SmartDedupe job has run to completion, or has been terminated, a full dedupe job report is available. This can be accessed from the WebUI by navigating to Cluster Management > Job Operations > Job Reports, and selecting ‘View Details’ action button on the desired Dedupe job line item.

The job report contains the following relevant dedupe metrics.

Report Field Description of Metric
Start time When the dedupe job started.
End time When the dedupe job finished.
Scanned blocks Total number of blocks scanned under configured path(s).
Sampled blocks Number of blocks that OneFS created index entries for.
Created dedupe requests Total number of dedupe requests created. A dedupe request gets created for each matching pair of data blocks. For example, three data blocks all match, two requests are created: One request to pair file1 and file2 together, the other request to pair file2 and file3 together.
Successful dedupe requests Number of dedupe requests that completed successfully.
Failed dedupe requests Number of dedupe requests that failed. If a dedupe request fails, it does not mean that the also job failed. A deduplication request can fail for any number of reasons. For example, the file might have been modified since it was sampled.

 

Skipped files Number of files that were not scanned by the deduplication job. The primary reason is that the file has already been scanned and hasn’t been modified since. Another reason for a file to be skipped is if it’s less than 32KB in size. Such files are considered too small and don’t provide enough space saving benefit to offset the fragmentation they will cause.
Index entries Number of entries that currently exist in the index.
Index lookup attempts Cumulative total number of lookups that have been done by prior and current deduplication jobs. A lookup is when the deduplication job attempts to match a block that has been indexed with a block that hasn’t been indexed.
Index lookup hits Total number of lookup hits that have been done by earlier deduplication jobs plus the number of lookup hits done by this deduplication job. A hit is a match of a sampled block with a block in index.

Dedupe job reports are also available from the CLI via the ‘ isi job reports view <job_id> ’ command.

From an execution and reporting stance, the Job Engine considers the ‘dedupe’ job to comprise of a single process or phase. The Job Engine events list will report that Dedupe Phase1 has ended and succeeded. This indicates that an entire SmartDedupe job, including all four internal dedupe phases (sampling, duplicate detection, block sharing, & index update), has successfully completed. For example:

# isi job events list --job-type dedupe

Time                Message

------------------------------------------------------

2020-09-01T13:39:32 Dedupe[1955] Running

2020-09-01T13:39:32 Dedupe[1955] Phase 1: begin dedupe

2020-09-01T14:20:32 Dedupe[1955] Phase 1: end dedupe

2020-09-01T14:20:32 Dedupe[1955] Phase 1: end dedupe

2020-09-01T14:20:32 Dedupe[1955] Succeeded

For deduplication reporting across multiple OneFS clusters, SmartConnect is also integrated with Isilon’s InsightIQ cluster reporting and analysis product. A report detailing the space savings delivered by deduplication is available via InsightIQ’s File Systems Analytics module.

OneFS SmartDedupe

Received several questions from the field recently around OneFS SmartDedupe, so this seemed like a useful topic to delve into. For the first article, we’ll dig into SmartDedupe’s underlying architecture.

In essence, SmartDedupe helps to maximize the storage efficiency of a cluster by decreasing the amount of physical storage required to house any given dataset. Efficiency is achieved by scanning the on-disk data for identical blocks and then eliminating the duplicates. This approach is commonly referred to as post-process, or asynchronous, deduplication. This is in contrast to the real time, in-line dedupe that’s performed on certain nodes as part of OneFS in-line data reduction. In-line DR will be explored in a future series of blog article. That said…

On discovering duplicate blocks, SmartDedupe moves a single copy of those blocks to a special set of files known as shadow stores. During this process, duplicate blocks are removed from the actual files and replaced with pointers to the shadow stores.

With post-process deduplication, new data is first stored on the storage device and then a subsequent process analyzes the data looking for commonality. This means that initial file write or modify performance is not impacted, since no additional computation is required in the write path.

Under the covers, SmartDedupe is comprised of five principle components:

  • Deduplication Control Path
  • Deduplication Job
  • Deduplication Engine
  • Shadow Store
  • Deduplication Infrastructure

The SmartDedupe job  is a highly distributed background process that orchestrates deduplication across all the nodes in the cluster. Job control encompasses file system scanning, detection and sharing of matching data blocks, in concert with the Deduplication Engine.

The SmartDedupe control path is the user interface portion, comprising the OneFS WebUI, command line interface and platform API, and is responsible for managing the configuration, scheduling and control of the deduplication job.

SmartDedupe works on data sets which are configured at the directory level, targeting all files and directories under each specified root directory. Multiple directory paths can be specified as part of the overall deduplication job configuration and scheduling. By design, the deduplication job will automatically ignore (not deduplicate) the reserved cluster configuration information located under the /ifs/.ifsvar/ directory, and also any file system snapshots.

It’s worth noting that the RBAC permissions required to configure and modify the deduplication settings are separate from those needed to actually run a deduplication job. For example, a user’s role must have job engine privileges to run a deduplication job. However, in order to configure and modify dedupe configuration settings, they must have the deduplication role privileges.

‘Fingerprinting’ is the part of the dedupe process where unique digital signatures, or fingerprints, are calculated using the SHA-1 hashing algorithm, one for each 8KB data block in the sampled set.

When SmartDedupe runs for the first time, it scans the data set and selectively samples blocks from it, creating the fingerprint index. This index contains a sorted list of the digital fingerprints, or hashes, and their associated blocks. After the index is created, the fingerprints are checked for duplicates. When a match is found, during the sharing phase, a byte-by-byte comparison of the blocks is performed to verify that they are absolutely identical and to ensure there are no hash collisions. Then, if they are determined to be identical, the block’s pointer is updated to the already existing data block and the new, duplicate data block is released.

Hash computation and comparison is only utilized during the sampling phase. For the actual block sharing phase, full data comparison is employed. SmartDedupe also operates on the premise of variable length deduplication, where the block matching window is increased to encompass larger runs of contiguous matching blocks.

As we saw in the previous  article, OneFS shadow stores are file system containers that allow data to be stored in a sharable manner. This allows files to contain both physical data and pointers, or references, to shared blocks in shadow stores.

For example, consider the shadow store information for a regular, undeduped file:

# isi get -DDD file.orig | grep –i shadow

*  Shadow refs:        0

         zero=36 shadow=0 ditto=0 prealloc=0 block=28

A second copy of this file is then created and then deduped:

# isi get -DDD file.* | grep -i shadow

*  Shadow refs:        28

         zero=36 shadow=28 ditto=0 prealloc=0 block=0

*  Shadow refs:        28

         zero=36 shadow=28 ditto=0 prealloc=0 block=0

As we can see, the block count of the original file has now become zero and the shadow block count for both the original file and it’s and copy has become ‘28′. Additionally, if another file copy is added and deduplicated, the same shadow store info and count is reported for all three files.

It’s worth noting that, even if duplicate file(s) are removed, the original file still retains the shadow store layout.

Dedupe is performed in parallel across the cluster by the OneFS Job Engine via a dedicated deduplication job, which distributes worker threads across all nodes. This distributed work allocation model allows SmartDedupe to scale linearly as an Isilon cluster grows and additional nodes are added.

The control, impact management, monitoring and reporting of the deduplication job is performed by the Job Engine in a similar manner to other storage management and maintenance jobs on the cluster.

While deduplication can run concurrently with other cluster jobs, only a single instance of the deduplication job, albeit with multiple workers, can run at any one time. Although the overall performance impact on a cluster is relatively small, the deduplication job does consume CPU and memory resources.

Architecturally, the duplication job, and supporting dedupe infrastructure, are made up of the following four phases:

Because the SmartDedupe job is typically long running, each of the phases are executed for a set time period, performing as much work as possible before yielding to the next phase. When all four phases have been run, the job returns to the first phase and continues from where it left off. Incremental dedupe job progress tracking is available via the OneFS Job Engine reporting infrastructure.

Phase 1 – Sampling

In the sampling phase, SmartDedupe performs a tree-walk of the configured data set in order to collect deduplication candidates for each file. The rational is that a large percentage of shared blocks can be detected with only a smaller sample of data blocks represented in the index table.

By default, the sampling phase selects one block from every sixteen blocks of a file as a deduplication candidate. For each candidate, a key/value pair consisting of the block’s fingerprint (SHA-1 hash) and file system location (logical inode number and byte offset) is inserted into the index. Once a file has been sampled, the file is flagged and won’t be re-scanned until it has been modified. This drastically improves the performance of subsequent deduplication jobs.

Phase 2 – Duplicate Detection

During the duplicate detection phase, the dedupe job scans the index table for fingerprints (or hashes) that match those of the candidate blocks.

If the index entries of two files match, a request entry is generated.  In order to improve deduplication efficiency, a request entry also contains pre and post limit information. This information contains the number of blocks in front of and behind the matching block which the block sharing phase should search for a larger matching data chunk, and typically aligns to a OneFS protection group’s boundaries.

Phase 3 – Block Sharing

For the block sharing phase the deduplication job calls into the shadow store library and dedupe infrastructure to perform the sharing of the blocks.

Multiple request entries are consolidated into a single sharing request, which is processed by the block sharing phase, and ultimately results in the deduplication of the common blocks. The file system searches for contiguous matching regions before and after the matching blocks in the sharing request; if any such regions are found, they will also be shared. Blocks are shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.

Phase 4 – Index Update

The index table is populated with the sampled and matching block information gathered during the previous three phases. After a file has been scanned by the dedupe job, OneFS may not find any matching blocks in other files on the cluster. Once a number of other files have been scanned, if a file continues to not share any blocks with other files on the cluster, OneFS will remove the index entries for that file. This helps prevent OneFS from wasting cluster resources searching for unlikely matches. SmartDedupe scans each file in the specified data set once, after which the file is marked, preventing subsequent dedupe jobs from rescanning the file until it has been modified.

OneFS Shadow Stores – Part 2

In the previous article, we looked at an overview of the shadow store and its three primary use cases within OneFS. Now, let’s look at shadow store mechanics, reporting, and job engine integration.

Under the hood, OneFS provides a SIN cache, which helps facilitate shadow store allocations. This provides a mechanism to create a shadow store on demand when required, and then cache that shadow store in memory on the local node so that it can be shared with subsequent allocators. The SIN cache separates stores by disk pool, protection policy and whether or not the store is a container.

When referencing data in a shadow store, blocks are identified with a SIN (shadow identification number) and LBN pair. A file with shadow store blocks will have protection group (PG) information that points to SINs. For example:

# isi get -DD /ifs/data/file.dup | head -100

POLICY  W  LEVEL PERFORMANCE COAL  ENCODING      FILE              IADDRS

default  4+2/2 concurrency on    UTF-8         file.dup     <1,6,35008000:512>, <2,3,236753920:512>, <3,5,302813184:512> 

...

PROTECTION GROUPS

       lbn 0: 4+2/2

               4000:0001:0067:0009@0#64

               0,0,0:8192#32

The ‘isi get’ CLI command will display information about a particular shadow store when using the –L flag and the SIN:

# isi get –DDL <SIN>
# isi get -DDL 4000:0001:003c:0005 | head -20

isi: Could not find a path to LIN:0x40000001003c0005/SNAP:18446744073709551615: Invalid argument

No valid path for LIN 0x40000001003c0005

POLICY  W  LEVEL PERFORMANCE COAL  ENCODING      FILE              IADDRS

+2:1  18   4+2/2 concurrency off   N/A           <unlinked>        <1,9,168098816:512>, <2,6,269270016:512>, <3,6,33850368:512> ct:  1337648672 rt: 0

*************************************************

* IFS inode: [ 1,9,168098816:512, 2,6,269270016:512, 3,6,33850368:512 ]

*************************************************

*

*  Inode Version:      6

*  Dir Version:        2

*  Inode Revision:     1

*  Inode Mirror Count: 3

*  Recovered Flag:     0

*  Recovered Groups:   0

*  Link Count:         2

*  Size:               133660672

*  Mode:               0100000

*  Flags:              0

*  Physical Blocks:    19251

*  LIN:                4000:0001:003c:0005

The protection group information for a SIN will also contain ‘reference count’ (refcount) information.

lbn 384: 4+2/2

               1,4,5054464:8192#16

               1,7,450527232:8192#16

               2,9,411435008:8192#16

               2,11,556056576:8192#16

               3,5,678928384:8192#16

               3,8,579436544:8192#16

               REF(    384): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    392): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    400): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    408): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    416): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    424): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    432): { 3, 3, 3, 3, 3, 3, 3, 3 }

               REF(    440): { 3, 3, 3, 3, 3, 3, 3, 3 }

The isi_sstore stats command can be used to display aggregate container statistics, alongside those of regular, or block, shadow stores. The output also includes storage efficiency stats. For example:

# isi_sstore stats

Block SIN stats:

33 GB user data takes 6 MB in shadow stores, using 11 MB physical space.

10792K physical average per shadow store.


5708.92 refs per block.

Reference efficiency 99.9825%.

Storage efficiency 57.0892%


Container SIN stats:

0 B user data takes 0 B in shadow stores, using 0 B physical space.


Raw counts={ type 0 num_ss=1 lsize=6209536 pblk=1349 refs=4328123 }{ type 1 num_ss=0 lsize=0 pblk=0 refs=0 }

Running the ‘isi_sstore list’ command in its verbose (-v) form also displays the type of SIN, the ‘fragmentation score’ (frag score) metric and whether a container is ‘underfull’, amongst other things:

# isi_sstore list –v | head –n 2

SIN                 lsize   psize    refs    filesize date         sin type underfull frag score

4000:0001:0002:0000 6209536 11003392 4328123 2121080K Jan 29 21:09 block    no 0.01

When it comes to the job engine, there are several jobs that interact with and cater to shadow stores – in addition to the dedupe job and SmartPools for small file packing. These include:

The Flexprotect job has two phases which are of particular relevance to shadow stores.

  1. The ‘LIN reverify’ phase: Metatree transfers are allowed, even if a file is under repair. Since metatree transfer goes in opposite direction from linscan, the LIN table needs to be re-verified to ensure a file is not missed during the first LIN verify. Note that both LIN verify and reverify will scan only the LIN potion of the LIN table.
  2. The ‘SIN verify’ phase: Once it’s determined that all the LINs are good, the SINs are inspected to ensure they are all correct. This is necessary since a cloning operation during Flexprotect, for example, might have moved an un-repaired block to a shadow store. This phase scans only the SIN portion of the table.

In general, the collect job isn’t required for (logical) blocks stored in shadow stores isn’t, since the freeing system is resilient to failure. The one exception is that references from files intentionally leaked by removing a LIN table entry to a file will not be freed, so collect will deal with these.

The ShadowStoreDelete job examines each shadow store for allocated blocks that have no external references (other than the shadow store’s reference) and frees the blocks. If all blocks in a shadow store have been freed then the shadow store is removed. A good practice is to run the ShadowStoreDelete job prior to running IntegrityScan on clusters with file clones and/or running SmartDedupe or small file storage efficiency jobs.

The ShadowStoreProtect job updates the protection level of shadow stores which are referenced by a LIN with a higher requested protection. Shadow stores that require a protection level change are added to a persistent queue (PQ) and consumed by this job.

There is also a SinReport job engine job which can be run to find LINs with SINs within the file system.

All the jobs which can change the protection contain an additional phase for SINs. For every LIN pointing to a particular SIN, if the LINs new protection policy is higher than that of the shadow store, it will update the SIN’s protection policy. In the SIN phase, the highest recorded policy will be used to protect the shadow store. In the case of disk pools, shadow stores may inherit the effective protection from the disk pool but not the disk pool itself.

As we have seen, to a large degree shadow stores store data like regular files. However, blocks from regular files are moved or copied to shadow stores and the original blocks in the source file are replaced with references to the blocks in the shadow store. If any of the logical blocks in the source file are written to, a copy on write (COW) event is triggered, which causes a local allocation of a block for the source file to replace the shadow reference. There may be multiple files with references to the same logical block in a shadow store. When all external references to a block in a shadow store have been released the block in the shadow store is now unused and will never be referenced again. The background garbage collection job, ShadowStoreDelete, periodically scans all the shadow stores and frees these unreferenced blocks. Once all the blocks in a shadow store are released, the shadow store itself can then be removed.

Be aware that files which reference shadow stores may also behave differently from regular files in that reading shadow-store references can be slower than reading data directly. Specifically, reading non-cached shadow-store references is slower than reading non-cached data. Reading cached shadow-store references takes no more time than reading cached data.

When files that reference shadow stores are replicated to another Isilon cluster or backed up via NDMP, the shadow stores are not transferred to the target Isilon cluster or backup device. The files are transferred as if they contained the data that they reference from shadow stores. On the target Isilon cluster or backup device, the files consume the same amount of space as if they had not referenced shadow stores.

When OneFS creates a shadow store, OneFS assigns the shadow store to a storage pool of a file that references the shadow store. If you delete the storage pool that a shadow store resides on, the shadow store is moved to a pool occupied by another file that references the shadow store.

OneFS does not delete a shadow-store block immediately after the last reference to the block is deleted. Instead, OneFS waits until the ShadowStoreDelete job is run to delete the unreferenced block. If a large number of unreferenced blocks exist on the cluster, OneFS might report a negative deduplication savings until the ShadowStoreDelete job is run.

Shadow stores are protected at least as much as the most protected file that references it. For example, if one file that references a shadow store resides in a storage pool with +2 protection and another file that references the shadow store resides in a storage pool with +3 protection, the shadow store is protected at +3.

Quotas account for files that reference shadow stores as if the files contained the data referenced from shadow stores; from the perspective of a quota, shadow-store references do not exist. However, if a quota includes data protection overhead, the quota does not account for the data protection overhead of shadow stores.