OneFS SmartPools Transfer Limits

The new OneFS 9.5 release introduces the first phase of engineering’s Smarter SmartPools initiative, and delivers a new feature called SmartPools transfer limits.

The goal of SmartPools transfer limits is to address spill over. Previously, when file pool policies were executed, OneFS had no guardrails to protect against overfilling the destination or target storage pool. So if a pool was overfilled, data would unexpectedly spill over into other storage pools.

The effects of an overflow would result in storagepool usage exceeding 100%, and the SmartPools job itself doing a considerable amount of unnecessary work, trying to send files to a given storagepool. But since the pool was full, it would then have to send those files off to another storage pool that was below capacity. This would result in data going where it wasn’t intended, and the potential for individual files to end up getting split between pools. Also, if the full pool was on the most performant storage in the cluster, all subsequent newly created data would now land on slower storage, affecting its throughput and latency. The recovery from a spillover can be fairly cumbersome since it’s tough for the cluster to regain balance, and urgent system administration may be required to free space on the affected tier.

In order to address this, SmartPools Transfer Limits allows a cluster admin to configure a storagepool capacity-usage threshold, expressed as a percentage, and beyond which file pool policies stop moving data to that particular storage pool.

These transfer limits only take effect when running jobs that apply filepool policies, such as SmartPools, SmartPoolsTree, and FilePolicy.

The main benefits of this feature are two-fold:

  • Safety, in that OneFS avoids undesirable actions, so the customer is prevented from getting into escalation situations, because SmartPools won’t overfill storage pools.
  • Performance, since transfer limits avoid unnecessary work, and allow the SmartPools job to finish sooner.

Under the hood, a cluster’s storagepool SSD and HDD usage is calculated using the same algorithm as reported by the ‘isi storagepools list’ CLI command. This means that a pool’s VHS (virtual hot spare) reserved capacity is respected by SmartPools transfer limits. When a SmartPools job is running, there is at least one worker on each node processing a single LIN at any given time. In order to calculate the current HDD and SSD usage per storagepool, the worker must read from the diskpool database. To circumvent this potential bottleneck, the filepool policy algorithm caches the diskpool database contents in memory for up to 10 seconds.

Transfer limits are stored in gconfig, and a separate entry is stored within the ‘smartpools.storagepools’ hierarchy for each explicitly defined transfer limit.

Note that in the SmartPools lexicon, ‘storage pool’ is a generic term denoting either a tier or nodepool. Additionally, SmartPools tiers comprise one or more constituent nodepools.

Each gconfig transfer limit entry stores a limit value and the diskpool database identifier of the storagepool that the transfer limit applies to. Additionally, a ‘transfer limit state’ field specifies which of three states the limit is in:

Limit State Description
Default Fallback to the default transfer limit.
Disabled Ignore transfer limit.
Enabled The corresponding transfer limit value is valid.

A SmartPools transfer limit does not affect the general ingress, restriping, or reprotection of files, regardless of how full the storage pool is where that file is located.  So if you’re creating or modifying a file on the cluster, it will be created there anyway. This will continue up until the pool reaches 100% capacity, at which point it will then spill over.

The default transfer limit is 90% of a pool’s capacity, and this applies to all storage pools where the cluster admin hasn’t explicitly set a threshold. Another thing to note is that the default limit doesn’t get set until a cluster upgrade to OneFS 9.5 has been committed. So if you’re running a SmartPools policy job during an upgrade, you’ll have the preexisting behavior, which is send the file to wherever the file pool policy instructs it to go. It’s also worth noting that, even though the default transfer limit is set on commit, if a job was running over that commit edge, you’d have to pause and resume it for the new limit behavior to take effect. This is because the new configuration is loaded lazily when the job workers are started up, so even though the configuration changes, a pause and resume is needed to pick up those changes.

SmartPools itself needs to be licensed on a cluster in order for transfer limits to work. And limits can be configured at the tier or nodepool level. But if you change the limit of a tier, it automatically applies to all its child nodepools, regardless of any prior child limit configurations. The transfer limit feature can also be disabled, which results in the same spillover behavior OneFS always displayed, and any configured limits will not be respected.

Note that a filepool policy’s transfer limits algorithm does not consider the size of the file when deciding whether to move it to the policy’s target storagepool, regardless of whether the file is empty, or a large file. Similarly, a target storagepool’s usage must exceed its transfer limit before the filepool policy will stop moving data to that target pool. The assumption here is that any storagepool usage overshoot is insignificant in scale compared to the capacity of a cluster’s storagepool.

A SmartPools file pool policy allow you to send snapshot or HEAD data blocks to different targets, if so desired.

Because the transfer limit applies to the storagepool itself, and not to the file pool policy, it’s important to note that, if you’ve got varying storagepool targets and one file pool policy, you may have a situation where the head data blocks do get moved. But if the snapshot is pointing at a storage pool that has exceeded its transfer limit, it’s blocks will not be moved.

File pool policies also allow you to specify how a mixed node’s SSDs are used: Either as L3 cache, or as an SSD strategy for head and snapshot blocks. If the SSDs in a node are configured for L3, they are not being used for storage, so any transfer limits are irrelevant to it. As an alternative to L3 cache, SmartPools offers three main categories of SSD strategy:  Avoid, which means send all blocks to HDD, Data, which means send everything to SSD, and then metadata read or read-write, which send varying numbers of metadata mirrors to SSD, and data blocks to hard disk.

To reflect this, SmartPools transfer limits are slightly nuanced when it comes to SSD strategies. That is, if the storagepool target contains both HDD and SSD, the usage capacity of both mediums needs to be below the transfer limit in order for the file to be moved to that target. For example, take two node pools, NP1 and NP2.

A file pool policy, Pol1, is configured, that matches all files under /ifs/dir1, with an SSD strategy of metadata write, and pool NP1 as the target for HEAD’s data blocks. For snapshots, the target is NP2, with an ‘avoid’ SSD strategy, so just writing to hard disk for both snapshot data and metadata.

When a SmartPools job runs and attempts to apply this file pool policy, it sees that SSD usage is above the 85% configured transfer limit for NP1. So, even though the hard disk capacity usage is below the limit, neither HEAD data nor metadata will be sent to NP1.

For the snapshot, the SSD usage is also above the NP2 pool’s transfer limit of 90%.

However, since the SSD strategy is ‘avoid’, and because the hard disk usage is below the limit, the snapshot’s data and metadata get successfully sent to the NP2 HDDs.

PowerScale OneFS 9.5

Dell PowerScale is already powering up the new year with the launch of the innovative OneFS 9.5 release, which shipped today (24th January 2023).

With data integrity and protection being top of mind in this era of unprecedented corporate cyber threats, OneFS 9.5 brings an array of new security features and functionality to keep your unstructured data and workloads more secure than ever, as well as delivering significant performance gains on the PowerScale nodes – such as up to 55% higher performance on all-flash F600 and F900 nodes as compared with the previous OneFS release.[3]

Table Description automatically generated

OneFS and hardware security features

New PowerScale OneFS 9.5 security enhancements include those that help address US Federal and DoD mandates, such as FIPS 140-2, Common Criteria, and DISA STIGs – in addition to general enterprise data security requirements. Multi-factor authentication (MFA), single sign-on (SSO) support, data encryption in-flight and at rest, TLS 1.2, USGv6R1 IPv6 support, SED Master Key rekey, plus a new host-based firewall are all part of OneFS 9.5.

15TB and 30TB self-encrypting (SED) SSDs now enable PowerScale platforms running OneFS 9.5 to scale up to 186 PB of encrypted raw capacity per cluster – all within a single volume and filesystem, and before any additional compression and deduplication benefit.

Delivering federal-grade security to protect data under a zero trust model 

Security-wise, the United States Government has stringent requirements for infrastructure providers such as Dell Technologies, requiring vendors to certify that products comply with requirements such as USGv6, STIGs, DoDIN APL, and so on. Activating the OneFS 9.5 cluster hardening option implements a default maximum security configuration with AES and SHA cryptography, which automatically renders a cluster FIPS 140-2 compliant.

OneFS 9.5 introduces SAML-based single sign-on (SSO) from both the command line and WebUI using a redesigned login screen. OneFS SSO is compatible with identity providers (IDPs) such as Active Directory Federation Services, and is also multi-tenant aware, allowing independent configuration for each of a cluster’s Access Zones.

Federal APL requirements mandate that a system must validate all certificates in a chain up to a trusted CA root certificate. To address this, OneFS 9.5 introduces a common Public Key Infrastructure (PKI) library to issue, maintain, and revoke public key certificates. These certificates provide digital signature and encryption capabilities, using public key cryptography to provide identification and authentication, data integrity, and confidentiality. This PKI library is used by all OneFS components that need PKI certificate verification support, such as SecureSMTP, ensuring that they all meet Federal PKI requirements.

This new OneFS 9.5 PKI and certificate authority infrastructure enables multi-factor authentication, allowing users to swipe a CAC or PIV smartcard containing their login credentials to gain access to a cluster, rather than manually entering username and password information. Additional account policy restrictions in OneFS 9.5 automatically disable inactive accounts, provide concurrent administrative session limits, and implement a delay after a failed login.

As part of FIPS 140-2 compliance, OneFS 9.5 introduces a new key manager, providing a secure central repository for secrets such as machine passwords, Kerberos keytabs, and other credentials, with the option of using MCF (modular crypt format) with SHA256 or SHA512 hash types. OneFS protocols and services may be configured to support FIPS 140-2 data-in-flight encryption compliance, while SED clusters and the new Master Key re-key capability provide FIPS 140-2 data-at-rest encryption. Plus, any unused or non-compliant services are easily disabled.

On the network side, the Federal APL has several IPv6 (USGv6) requirements that are focused on allowing granular control of individual components of a cluster’s IPv6 stack, such as duplicate address detection (DAD) and link local IP control. Satisfying both STIG and APL requirements, the new OneFS 9.5 front-end firewall allows security admins to restrict the management interface to specified subnet and implement port blocking and packet filtering rules from the cluster’s command line or WebUI, in accordance with federal or corporate security policy.

Improving performance for the most demanding workloads

OneFS 9.5 unlocks dramatic performance gains, particularly for the all-flash NVMe platforms, where the PowerScale F900 can now support line-rate streaming reads. SmartCache enhancements allow OneFS 9.5 to deliver streaming read performance gains of up to 55% on the F-series nodes, F600 and F9003, delivering benefit to media and entertainment workloads, plus AI, machine learning, deep learning, and more.

Enhancements to SmartPools in OneFS 9.5 introduce configurable transfer limits. These limits include maximum capacity thresholds, expressed as a percentage, above which SmartPools will not attempt to move files to a particular tier, boosting both reliability and tiering performance.

Granular cluster performance control is enabled with the debut of PowerScale SmartQoS, which allows admins to configure limits on the maximum number of protocol operations that NFS, S3, SMB, or mixed protocol workloads can consume.

Enhancing enterprise-grade supportability and serviceability

OneFS 9.5 enables SupportAssist, Dell’s next generation remote connectivity system for transmitting events, logs, and telemetry from a PowerScale cluster to Dell Support. SupportAssist provides a full replacement for ESRS, as well as enabling Dell Support to perform remote diagnosis and remediation of cluster issues.

Upgrading to OneFS 9.5 

The new OneFS 9.5 code is available on the Dell Technologies Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.

We’ll be taking a deeper look at the new  OneFS 9.5 features and functionality in additional blog articles over the course of the next few weeks.

[1] Based on Dell analysis, August 2021.

[2] Based on Dell analysis comparing cybersecurity software capabilities offered for Dell PowerScale vs. competitive products, September 2022.

[3] Based on Dell internal testing, January 2023. Actual results will vary.

OneFS SmartQuotas Efficiency Reporting

In this final article in the OneFS SmartQuotas series we focus on data reduction and storage efficiency reporting:

SmartQuotas reports both data reduction and efficiency as a ratio across the desired dataset as specified in the quota path field. These efficiency and data reduction ratios are for the full quota directory and its contents, including any overhead, and reflects the net efficiency of both compression and deduplication.

The ‘isi quota quotas view’ CLI command provides considerably more detailed storage capacity and efficiency metrics. These include the following:

Metric Description
AppLogical The application data that can be written to the cluster, irrespective of where it’s stored from.
FSLogical Removing sparse data (data that was always sparse, was zero block eliminated, or data that’s been moved to the cloud, etc) results in filesystem logical, which is the non-sparse data stored on the filesystem.
AppPhysical Data reduction techniques, such as compression and dedupe, reduce filesystem logical to application physical, or pre-protected physical. This is the physical size application data on the filesystem disks, and does not include metadata, protection overhead, or data moved to the cloud.
FSPhysical Application physical with data protection overhead added – including inode, mirroring and FEC blocks, etc. Filesystem physical is also referred to as protected physical.
Reduction The data reduction ratio is the amount that’s been reduced from the filesystem logical down to the application physical.
Efficiency Storage efficiency ratio is the filesystem logical divided by the filesystem physical.

With OneFS, the relationship between the capacity, data reduction and storage efficiency elements is as follows:

SmartQuotas reports the capacity saving from in-line data reduction as a storage efficiency ratio across the desired data set, or quota domain, as specified in the quota path field. The efficiency ratio is for the full quota directory and its contents, including any overhead, and reflects the net efficiency of compression and deduplication. On a cluster with licensed and configured SmartQuotas, this efficiency ratio can be easily viewed from the WebUI by navigating to ‘File System > SmartQuotas > Quotas and Usage’. In OneFS 9.2 and later, in addition to the storage efficiency ratio, the data reduction ratio is also displayed.

Similarly, the same data can be accessed from the OneFS command line via is ‘isi quota quotas list’ CLI command. For example:

# isi quota quotas list Type      AppliesTo  Path  Snap  Hard  Soft  Adv  Used  Reduction  Efficiency ------------------------------------------------------------------------------ directory DEFAULT    /ifs  No    -     -     -    6.02T 2.54 : 1   1.77 : 1 ------------------------------------------------------------------------------

Total: 1

More detail, including both the physical (raw) and logical (effective) data capacities, is also available via the ‘isi quota quotas view <path> <type>’ CLI command. For example:

# isi quota quotas view /ifs directory                         Path: /ifs                         Type: directory                    Snapshots: No                     Enforced: No                    Container: No                       Linked: No                        Usage                            Files: 5759676          Physical(With Overhead): 6.93T         FSPhysical(Deduplicated): 3.41T          FSLogical(W/O Overhead): 6.02T         AppLogical(ApparentSize): 6.01T                    ShadowLogical: -                     PhysicalData: 2.01T                       Protection: 781.34G      Reduction(Logical/Data): 2.54 : 1 Efficiency(Logical/Physical): 1.77 : 1

To configure SmartQuotas for in-line data efficiency reporting, create a directory quota at the top-level file system directory of interest, for example /ifs. Creating and configuring a directory quota is a simple procedure and can be performed from the WebUI by navigate to ‘File System > SmartQuotas > Quotas and Usage’ and selecting ‘Create a Quota’. In the create pane, field, set the Quota type to ‘Directory quota’, add the preferred top-level path to report on, select ‘application logical size’ for Quota Accounting, and set the Quota Limits to ‘Track storage without specifying a storage limit’. Finally, select the ‘Create Quota’ button to confirm the configuration and activate the new directory quota.

The efficiency ratio is a single, current-in time efficiency metric that is calculated per quota directory and includes the sum of in-line compression, zero block removal, in-line dedupe and SmartDedupe. This is in contrast to a history of stats over time, as reported in the ‘isi statistics data-reduction’ CLI command output, described above. As such, the efficiency ratio for the entire quota directory will reflect what is actually there.

When using SyncIQ replication on a cluster pair that are also running SmartQuotas, the quotas are matched one-to-one across the replication set. Multiple quotas are supported within a source directory or domain structure, and the target directory is included in a quota domain.

During replication SyncIQ ignores quota limits. However, if a quota is over limit, quotas still prevent users from adding additional data. SyncIQ will never automatically delete an existing target quota. Instead, a SyncIQ will fail, as opposed to deleting an existing quota. This may occur during an initial sync where the target directory has an existing quota under it, or if a source directory is deleted that has a quota on it on the target. The quota still remains and requires administrative removal if desired

Application logical quotas, available in OneFS 8.2 and later, provide a quota accounting metric, which accounts for, reports and enforces on the actual space consumed and available for storage, independent of whether files are on-premises or cloud-tiered.

In addition to data-protection overhead, the option is provided on whether to include snapshot data when calculating a quota’s usage limits.

SmartQuotas will only report on snapshots created after the quota domain was created. This is because determining quota governance (including QuotaScan job) for existing snapshots is a very time and resource consuming operation. However, as snapshots age out, SmartQuotas will gradually accrue accounting information for the entire set of relevant snapshots.

Compressed and deduplicated files appear no differently than regular files to standard quota policies. However, for deduplicated files, if the quota is configured to include data-protection overhead, the additional space used by the shadow store will not be accounted for by the quota.

OneFS and QLC SED Drives

A couple of days ago on 5th January, Dell announced support for quad-level cell (QLC) self-encrypting (SED) flash media for PowerScale. Specifically, the F900 and F600 all-flash NVMe platforms are now available with 15.4TB and 30.7TB QLC SED NVMe drives.

These new QLC SED drives offer a compelling blend of security, capacity, performance, reliability and affordability – and will be particularly beneficial for sensitive workloads and datasets requiring at-rest encryption.

The details of the new QLC SED drive options for the F600 and F900 platforms are as follows:

PowerScale Node Chassis specs

(per node)

Raw capacity

(per node)

Max Raw capacity
(252 node cluster)
F900 2U with 24 NVMe SSD drives 737.28TB with 30.72TB QLC

368.6TB with 15.36TB QLC

185.79PB with 30.72TB QLC

92.83PB with 15.36TB QLC

F600 1U with 8 NVMe SSD drives 245.76TB with 30.72TB QLC

122.88TB with 15.36TB QLC

61.93PB with 30.72TB QLC

30.96PB with 15.36TB QLC

This allows a PowerScale F900 cluster with the 30.7TB QLC SED drives to grow up to 185.79PB of raw encrypted data capacity in a single volume, coupled with predictable linear performance scaling!

The new QLC SED drives double the all-flash capacity footprint for encrypted data, as compared to previous generations – while delivering robust environmental efficiencies in consolidated rack space, power and cooling. What’s more, PowerScale F600 and F900 nodes containing QLC SED drives can deliver the same level of performance as TLC SED drives, thereby delivering vastly superior economics and value.

QLC-based F600 and F900 SED nodes can easily be rapidly and non-disruptively integrated into existing PowerScale clusters.

Before we get into the details, a quick terminology review:

Term Details
DARE Data-at-rest encryption
FIPS Federal Information Processing Standard 140 (currently at version 3: FIPS 140-3)
ISE Instant Secure Erase (Drives that support crypto erase but are not SEDs)
Non-FIPS SED drive that supports data-at-rest encryption, but has not yet been FIPS 140-3 certified.
QLC Quad-level cell, high capacity SSD (4 bits per cell).
SED Self-encrypting drive that supports data-at-rest encryption (includes both FIPS and non-FIPS drives).
SSD Solid State Drive, using flash memory for storage rather than spinning magnetic media.
TLC Tri-level cell SSD (3 bits per cell).

With the introduction of a new version of the FIPS 140 standard (FIPS 140-3), these new QLC SED drives fall under the ‘non-FIPS’ category above, and are currently intended for customers that need data-at-rest encryption but do not explicitly require US FIPS certification. That said, FIPS 140-3 certification of these QLC SED SSD drives is in porgess and will be completed later this year.

Under the hood, PowerScale support for these new drives requires the addition of a new ‘QLC SED-Non-FIPS’ OneFS drive category. Since the overall data-at-rest protection provided by a cluster is determined by the lowest protection offered by any component in the cluster, if a cluster contains any SED-Non-FIPS drives, it cannot claim to provide FIPS-certified protection. As such, actions that would reduce the protection level provided by a cluster are blocked.

OneFS 9.4.0.8 now recognizes the following drive types with their corresponding SED compliance level:

SED Drive Type Compliance Level
SED-NON-FIPS 0
SED-FIPS 1
SED-FIPS-140-2 2
SED-FIPS-140-3 3

For the curious, the compliance level can be queried via a SED node’s drives-psi.conf file. For example:

# cat /etc/psi.conf.d/drives-psi.conf | grep -i compliance

compliance_level = 0;

From the WebUI, the ‘drive details’ pop-up window in OneFS 9.4.0.8 is extended to display the drive’s compliance status via a new ‘SED Compliance Level’ field. This can be viewed by navigating to Hardware configuration > Drives and selecting ‘View details’ for the desired drive:

The ‘isi device drive view’ CLI command in OneFS 9.4.0.8 also reports the ‘SED Compliance Level’ field:

# isi device drive view 10

Lnn: 1

Location: Bay 10

Lnum: N/A

Device: /dev/nvd2

Baynum: 10

Handle: 364

Serial: PHAC2044006Y15PHGN

Model: Dell Ent NVMe SED P5316 RI 15.36TB

Tech: NVME

Media: SSD

Media Class: QLC

SED Compliance Level: SED-NON-FIPS 

Blocks: 30001856512

Logical Block Length: 512

Physical Block Length: 512 W

WN: 01000000010000005CD2E4B110325551

State: WRONG_TYPE

Purpose: UNKNOWN

Purpose Description: A drive whose purpose is unknown

Present: Yes

Percent Formatted: 0

Or from the ‘isi status –node’ CLI command, which is also enhanced to display a new node-level ‘SED Compliance Level’ attribute:

# isi status --node 1

Node LNN:               1

Node ID:                1

Node Name:              tme-1

Node IP Address:       10.9.24.76

Node Health:            -A—

Node Ext Conn:          C

Node SN:                8QMKR33

SED Compliance Level:   SED-NON-FIPS 

Member of Node Pools:   n/a

Member of Tiers:        n/a

Node Capacity:          19.0T

Available:              19.0T (> 99%)

Used:                   1.1G (< 1%)

Similarly, the node compliance level is reported in the OneFS 9.4.0.8 WebUI for each drive in Hardware Configuration->Nodes->Node Details. For example:

Additionally, PowerScale F600 and F900 nodes must be running OneFS 9.4.0.8 and DSP v1.43.2 or later in order to support QLC SED drives. In the event of a QLC SED drive failure, it must be replaced with another QLC  SED drive. More specifically:

Node Type Drive Type Drive Supported
ISE ISE Yes
ISE SED-Non-FIPS No
ISE SED-FIPS No
SED-Non-FIPS ISE No
SED-Non-FIPS SED-Non-FIPS Yes
SED-Non-FIPS SED-FIPS Yes
SED-FIPS ISE No
SED-FIPS SED-Non-FIPS No
SED-FIPS SED-FIPS Yes

If the wrong type of drive is inadvertently added to a node, the ‘SYS_DISK_WRONGTYPE’ CELOG event will provide a detailed description of why the drive is incorrect.

Also, per the OneFS compatibility rules, joins of SED-Non-FIPS nodes to SED-FIPS clusters are also blocked.

Minimum Node in Cluster Joining Node Type Join Supported
SED-Non-FIPS ISE No
SED-Non-FIPS SED-Non-FIPS Yes
SED-Non-FIPS SED-FIPS Yes
SED-FIPS ISE No
SED-FIPS SED-Non-FIPS No
SED-FIPS SED-FIPS Yes

Finally, any attempts to downgrade a QLC SED node to a version prior to OneFS 9.4.0.8 will be blocked.

OneFS SmartQuotas Accounting and Reporting

In this next article in the OneFS SmartQuotas series we turn our attention to quota accounting and reporting:

SmartQuotas has four main resources used in quota accounting:

Accounting Resource Description
Physical Size This includes all the on-disk storage associated with files and directories, with the exception of some metadata objects including the LIN tree, snapshot tracking files (STFs). For deduplicated data and file clones, each file’s 8 KB reference to a shadow store is included in the physical space calculation.
File system logical size File system logical size calculation approximates disk usage on ‘typical’ storage arrays by ignoring the erasure code, or FEC, protection overhead that OneFS employs. For regular files, the logical data space is the amount of storage required to house a particular file if it was 1x mirrored. Logical space also incorporates a file’s metadata resources.
Application Logical Size Reports total logical data store across different tiers, including CloudPools. This allows users to view quotas and free space as an application would view it, in terms of how much capacity is available to store logical data regardless of data reduction or tiering technology.
Inodes SmartQuotas counts the number of logical inodes, which allows accounting for files without any ambiguity from hard links or protection.

 When configuring a quota, these are accounting resource options are available as enforcement limits. For example, from the OneFS WebUI:

Application logical size quotas are available in OneFS 8.2 and later. Existing quotas can easily be configured to use application logical size upon upgrading from an earlier OneFS version. The benefits of application logical size quotas include:

  • Snapshots, protection overhead, deduplication, compression, and location of files all have no effect on quota consumption
  • Removes previous limitation where SmartQuotas only reported on-cluster storage, ignoring cloud consumption
  • Presents view that aligns with Windows storage accounting
  • Enables accounting and enforcing quota on actual file sizes
  • Precisely accounts for small files
  • Enables enforcing quotas on a path irrespective of the physical location of file.

The following table describes how SmartQuotas accounts for a 1KB file with the various datatypes:

Data Type Accounting
File: physical size Every non-sparse 8 KB disk block a file consumes including protection
File: file system logical size Every non-sparse 8 KB disk block a file consumes excluding protection
File: application logical size Actual size of file (rather than total of 8 KB disk blocks consumed)
CloudPools file: file system logical size Size of CloudPools SmartLink stub file (8 KB)
CloudPools file: application logical size Actual size of file on cloud storage (rather than local stub file)
Directories Sum of all directory entries
Symlinks Data size
ACL and similar Data size
Alternate data stream Each ADS is charged as a file and a container as a directory

The example below shows each method of accounting for a 1KB file.

Method Details
Logical size accounting Sum of physical sizes of all files/directories without overhead.
Physical size accounting Sum of physical sizes of all files/dirs with protection overhead.
Application Logical Accounting Sum of actual sizes of all files/directories.

So the logical size is reported as 8 KB, or one block, physical size reports 24 KB (file with 3x mirroring protection), and application logical shows its actual size of 1 KB.

Other resources encountered during quota accounting include:

Resource Description
Hard Link Each logical inode is accounted exactly once in every domain to which it belongs. If an inode is present in multiple domains, it is accounted in multiple domains. Alternatives such as shared accounting were considered. However, if inodes are not accounted once in every domain, it is possible for the deletion of a hard link in one domain to put another domain over quota.
Alternate Data Stream (ADS) A file with an alternate data stream or resource fork is accounted as the sum of the resource usage of the individual file, the usage for the container directory and the usage for each ADS. SmartQuotas handles the rename of a file with ADS synchronously, despite the fact that the ADS container is just a directory. SmartQuotas will store an accounting summary on the ADS container to handle renames.
Directory Rename A directory rename presents a unique challenge to a per-directory quota system. Renames of directories within a domain are trivial – if both the source and target directories have the same domain membership, no accounting changes. However, non-empty directories are not permitted to be moved when the SmartQuotas configuration is different on the source and the target parent directories. If a user trusts the client operating systems to copy files and preserve all the necessary attributes, then the user may set dir_rename_errno to EXDEV, which causes most UNIX and Windows clients to do a copy and delete of the directory tree to affect the move.
Snapshot Accounting If wanted, a quota domain can also include snapshot usage in its accounting. SmartQuotas will only support snapshots created after the quota domain was created. This is because determining quota governance (including QuotaScan job) for existing snapshots is a very time and resource consuming operation. As most administrators cycle their snapshots through timed expirations, SmartQuotas will eventually accrue enough accounting information to include the entire set of relevant snapshots on the system.

SmartQuotas supports flexible reporting options that enable administrators to more effectively manage cluster resources and analyze usage statistics. The goal of Quota Reporting is to provide a summarized view of the past or present state of the Quota Domains. There are three methods of data collection and reporting that are supported:

Reporting Method Detail
Scheduled Scheduled reports are generated and saved on a regular interval.
Ad-hoc Ad-hoc reports are generated and saved per request of the user.
Live Live reports are generated for immediate and temporary viewing

 A summary of general quota usage info can be viewed from the CLI via the ‘isi quota quotas list’ command syntax. Or from the WebUI, by navigating to File System > SmartQuotas > Quotas and Usage.

For each quota entry, additional information and context is available via the ‘isi quota quotas view <quota_name>’ CLI command, or by clicking on the WebUI ‘View / Edit’ button:

Client-side quota reporting includes support for rpc.quotad, which allows NFS clients to view quota consumption for both hard and soft quotas using the native Linux and UNIX ‘quota’ CLI utilities. There is also the ability to view available user capacity set by soft and/or hard user or group quotas, rather than the entire cluster capacity or parent directory-quotas.

The quota reports and summaries are typically stored in the /ifs/.isilon/smartquotas/reports directory, but this location is configurable. Each generated report includes the quota domain definition, state, usage, and global configuration settings. By default, ten reports and ten summaries are kept at a time, and older versions are purged. This can be configured from the WebUI, by navigating to File System > SmartQuotas > Settings:

On demand reports can also be created at any time to view the current state of the storage quotas system. These live reports can be saved manually.

Reports and summaries are prefixed by either ‘ad hoc’ or ‘scheduled’ to aid with identification.

The OneFS CLI export functionality makes use of the same data generation and storage format as quota reporting but should not require any extra requirements beyond the three types of reports. After the collection of the raw reporting data, data summaries can be produced given a set of filtering parameters and sorting type.

Reports can be viewed from historical sampled data or a live system. In either case, the reports are views of usage data at a given time. SmartQuotas does not provide reports on aggregated data over time (trending reports). However, the raw data can be used by a Quota Administrator to answer trending questions.

A quota report is a time-stamped XML file that starts off with global configuration settings and global notification rules:

# cat scheduled_quota_report_1465786800.xml

    <global-config>

        <quota-global-config>

            <reports>

                <schedule-pattern>1100000000|every sunday at 11pm</schedule-pattern>

                <schedule-dir>/ifs/.isilon/smartquotas/reports</schedule-dir>

                <schedule-copies>10</schedule-copies>

                <adhoc-dir>/ifs/.isilon/smartquotas/reports</adhoc-dir>

                <adhoc-copies>10</adhoc-copies>

            </reports>

        </quota-global-config>

    </global-config>

    <global-notify>

    </global-notify>

    <domains>

        <domain type="default-group" snaps="0" lin="0x0000000100020006">

            <path>/ifs/home</path>

            <inactive/>

            <enforcements default-resource="logical">

            </enforcements>

            <notifications use="global"/>

        </domain>

        <domain type="group" snaps="0" lin="0x0000000100020006" id="0">

            <inherited/>

            <id-name>wheel</id-name>

            <usage resource="physical">109568</usage>

            <usage resource="logical">32929</usage>

            <usage resource="inodes">6</usage>

            <path>/ifs/home</path>

            <inactive/>

            <enforcements default-resource="logical">

            </enforcements>

            <notifications use="default"/>

        </domain>

        <domain type="group" snaps="0" lin="0x0000000100020006" id="10">

            <inherited/>

            <id-name>admin</id-name>

            <usage resource="physical">28160</usage>

            <usage resource="logical">8208</usage>

            <usage resource="inodes">2</usage>

            <path>/ifs/home</path>

            <inactive/>

            <enforcements default-resource="logical">

            </enforcements>

            <notifications use="default"/>

        </domain>

        <domain type="group" snaps="0" lin="0x0000000100020006" id="1800">

            <inherited/>

            <id-name>Isilon Users</id-name>

            <usage resource="physical">1811456</usage>

            <usage resource="logical">705620</usage>

            <usage resource="inodes">42</usage>

            <path>/ifs/home</path>

            <inactive/>

            <enforcements default-resource="logical">

            </enforcements>

            <notifications use="default"/>

        </domain>

        <domain type="user" snaps="0" lin="0x0000000100020596" id="2002">

            <id-name>nick</id-name>

            <usage resource="physical">1001984</usage>

            <usage resource="logical">483743</usage>

            <usage resource="inodes">12</usage>

            <path>/ifs/home/nick</path>

            <enforcements default-resource="logical">

                <enforcement type="soft" resource="logical">

                    <limit>10485760</limit>

                    <grace>7776000</grace>

                </enforcement>

                <enforcement type="advisory" resource="logical">

                    <limit>5242880</limit>

                </enforcement>

            </enforcements>

            <notifications>

                <quota-notify-map tag="1"></quota-notify-map>

            </notifications>

        </domain>

    </domains>

</quota-report>

When listing domains, both inode and path, as well as name and ID, are stored with each domain. Quota Notification Rules are read and inserted into a domain entry only if the domain is not inherited to avoid any performance impact of reading the Quota Notification Rules with each domain.

SmartQuotas can be configured to produce scheduled reports to help monitor, track, and analyze storage use on a OneFS powered cluster.

Quota reports are managed by configuring settings that provide control over when reports are scheduled, how they are generated, where and how many are stored and how they are viewed. The maximum number of scheduled reports that are available for viewing in the web-administration interface can be configured for each report type. When the maximum number of reports is stored, the system automatically deletes the oldest reports to make space for new reports as they are generated.

SmartQuotas can be easily configured to generate quota report settings to generate the quota report on a specified schedule. These settings determine whether and when scheduled reports are generated, and where and how the reports are stored. Even if scheduled reports are disabled, you can still run unscheduled reports at any time.

The method to do this is:

  1. From the OneFS WebUI, go to File System Management > SmartQuotas > Settings.
  2. (Optional) On the Quota settings page, for Scheduled Reporting, click On. The Report Frequency option appears.
  3. Click Change schedule and select the report frequency that you want to set from the list.
  4. Select the reporting schedule options that you want.
  5. Click Save.

Reports are generated according to your criteria and can be viewed in the Generated Reports Archive.

In addition to scheduled quota reports, you can generate a report to capture usage statistics at a point in time. Before you can generate a quota report, quotas must exist and no QuotaScan jobs can be running.

The following procedure will achieve this:

  1. Click File System Management > SmartQuotas > Generated Reports Archive.
  2. In the Generated Quota Reports Archive area, click Generate a quota report.
  3. Click Generate Report.

The new report appears in the Quota Reports list.

You can locate quota reports, which are stored as XML files, and use your own tools and transforms to view them. This task can only be performed from the OneFS command-line interface.

A procedure for this is as follows:

  1. Open a secure shell (SSH) connection to any node in the cluster and log in.
  2. Go to the directory where quota reports are stored. The following path is the default quota report location:
/ifs/.isilon/smartquotas/reports

If quota reports are not in the default directory, you can run the isi quota settings command to find the directory where they are stored.

  1. At the command prompt, run the ls command.

To view a list of all quota reports in the directory, run the following command:

# ls -a *.xml

To view a specific quota report in the directory, run the following command:

# ls <filename>.xml

OneFS SmartQuotas Notifications

A crucial part of the OneFS SmartQuotas system is to provide user notifications regarding quota enforcement violations, both when a violation event occurs and while violation state persists on a scheduled basis.

An enforcement quota may have several notification rules associated with it. Each notification rule specifies a condition and an action to be performed when the condition is met. Notification rules are considered part of enforcements. Clearing an enforcement also clears any notification rules associated with it.

Enforcement quotas support the following notification settings:

Quota Notification Setting Description
Global default Uses the global default notification for the specified type of quota.
Custom – basic Enables creation of basic custom notifications that apply to a specific quota. Can be configured for any or all the threshold types (hard, soft, or advisory) for the specified quota.
Custom – advanced Enables creation of advanced, custom notifications that apply to a specific quota. Can be configured for any or all of the threshold types (hard, soft, or advisory) for the specified quota.
None Disables all notifications for the quota.

A quota notification condition is an event which may trigger an action defined by a notification rule. These notification rules may specify a schedule (for example, “every day at 5:00 AM”) for performing an action or immediate notification of a certain condition. Examples of notification conditions include:

  • Notify when a threshold is exceeded; at most, once every 5 minutes
  • Notify when allocation is denied; at most, once an hour
  • Notify while over threshold, daily at 2 AM
  • Notify while grace period expired weekly, on Sundays at 2 AM

Notifications are triggered for events grouped by the following two categories:

Type Description
Instant notification Includes the write-denied notification triggered when a hard threshold denies a write and the threshold-exceeded notification, triggered at the moment a hard, soft, or advisory threshold is exceeded. These are one-time notifications because they represent a discrete event in time.
Ongoing notification Generated on a scheduled basis to indicate a persisting condition, such as a hard, soft, or advisory threshold being over a limit or a soft threshold’s grace period being expired for a prolonged period.

Each notification rule can perform either one or none of the following notification actions.

Quota Notification Action Description
Alert Sends an alert for one of the quota actions, detailed below.
Email Manual Address Sends email to a specific address, or multiple addresses (OneFS 8.2 and later).
Email Owner Emails an owner mapping based on its identity source.

The email owner mapping is as follows:

Mapping Description
Active directory Lookup is performed against the domain controller (DC). If the user does not have an email setting, a configurable transformation from user name and DC fully qualified domain name is performed in order to generate an email address.
LDAP LDAP user email resolution is similar to AD users. In this case, only the email attribute looked up in the LDAP server is configurable by an administrator based on the LDAP schema for the user account information.
NIS Only the configured email transformation for the NIS fully qualified domain name is used.
Local users Only the configured email transformation is used.

The actual quota notification is handled by a daemon, isi_quota_notify_d, which performs the following functions:

  • Processes kernel notification events that get sent out. They are matched to notification rules to generate instant notifications (or other actions as specified in the notification rule)
  • Processes notification schedules – The daemon will check notification rules on a scheduled basis. These rules specify what violation condition should trigger a notification on a regular scheduled basis.
  • Performs notifications based on rule configuration to generate email messages or alert notifications.
  • Manages persistent notification states so that pending events are processed in the event of a restart.
  • Handles rescan requests when quotas are created or modified

SmartQuotas provides email templates for advisory, grace, and regular notification configuration, which can be found under /etc/ifs. The advisory limit email template (/etc/ifs/quota_email_advisory_template.txt) for example, displays:

Subject: Disk quota exceeded

The <ISI_QUOTA_DOMAIN_TYPE> quota on path <ISI_QUOTA_PATH> owned by <ISI_QUOTA_OWNER> has   exceeded the <ISI_QUOTA_TYPE> limit.

The quota limit is <ISI_QUOTA_THRESHOLD>, and <ISI_QUOTA_USAGE> is currently in use. <ISI_QUOTA_HARD_LIMIT> Contact your system administrator for details.

An email template contains text, and, optionally, variables that represent quota values. The following table lists the SmartQuotas variables that may be included in an email template.

Variable Description Example
ISI_QUOTA_DOMAIN_TYPE Quota type. Valid values are: directory, user, group, default-directory, default-user, default-group default-directory
ISI_QUOTA_EXPIRATION Expiration date of grace period Fri Jan 8 12:34:56 PST 2021
ISI_QUOTA_GRACE Grace period, in days 5 days
ISI_QUOTA_HARD_LIMIT Includes the hard limit information of the quota to make advisory/soft email notifications more informational. You have 30 MB left until you reach the hard quota limit of 50 MB.
ISI_QUOTA_NODE Hostname of the node on which the quota event occurred us-wa-1
ISI_QUOTA_OWNER Name of quota domain owner jsmith
ISI_QUOTA_PATH Path of quota domain /ifs/home/jsmith
ISI_QUOTA_THRESHOLD Threshold value 20 GB
ISI_QUOTA_TYPE Threshold type Advisory
ISI_QUOTA_USAGE Disk space in use 10.5 GB

Note that the default quota templates under /etc/ifs send are configured to send email notifications with a plain text MIME type. However, editing a template to start with an HTML tag (<html>) will allow an email client to interpret and display it as HTML content. For example:

<html><Body>

<h1>Quota Exceeded</h1><p></p>

<hr>

<p> The path <ISI_QUOTA_PATH> has exceeded the threshold <ISI_QUOTA_THRESHOLD> for this <ISI_QUOTA_TYPE> quota. </p>

</body></html>

Various system alerts are sent out to the standard cluster Alerting system when specific events occur. These include:

Alert Type Level Event Description
NotifyFailed Warning An attempt to process a notification rule failed externally, such as an undelivered email.
NotifyConfig Warning A notification rule failed due to a configuration issue, such as a non-existent user or missing email address.
NotifyExceed Warning A child quota’s advisory/soft/hard limit is greater than any of parent quota’s hard limit.
ThresholdViolation Info A quota threshold was exceeded. The conditions under which this alert is triggered are defined by notification rules.
DomainError Error An invariant was violated that resulted in a forced domain rescan.

 

Unveiling Lakehouse – Explaining Data Lakehouse as Cloud-native DWP Part2

In this article I focus on how the data lakehouse architecture compares with the classic data warehouse architecture. I imagine the data lakehouse architecture as an attempt to implement some of the core requirements of data warehouse architecture in a modern, cloud-native design. I will explore the advantages of cloud-native design, including the ability to dynamically provision resources in response to specific events, predetermined patterns, and other triggers. I also explore data lakehouse architecture as its own unique approach to addressing new or different types of practices, use cases, and consumers.

In an important sense, data lakehouse architecture is an effort to adapt the data warehouse and its architecture to the cloud, while also addressing a larger set of novel use cases, practices, and consumers. This claim is not as counterintuitive or daunting as it may seem. We can think of data warehouse architecture as a technical specification that enumerates and describes the set of requirements (features and capabilities) that the ideal data warehouse system must address, but does not specify how to design or implement the data warehouse. Designers are free to engineer their own novel implementations of the warehouse, such as what Joydeep Sen Sarma and Ashish Thusoo attempted with Apache Hive, a SQL interpreter for Hadoop, or what Google did with BigQuery, its NoSQL query-as-a-service offering.

The data lakehouse is a similar example. If a data lakehouse implementation addresses the set of requirements specified by data warehouse architecture, it can be considered a data warehouse.

In the What is Data Lakehouse? – Unstructured Data Quick Tips (unstructureddatatips.com), we saw that data lakehouse architecture differs from the monolithic design of classic data warehouse implementations and the more tightly coupled designs of big data-era platforms like Hadoop+Hive or PaaS warehouses like Snowflake.

So, how is data lakehouse architecture different and why?

Adapting Data Warehouse Architecture to Cloud

The classic implementation of data warehouse architecture is based on outdated expectations, especially regarding how the warehouse’s functions and resources are instantiated, connected, and accessed. For example, early implementers of data warehouse architecture expected the warehouse to be physically implemented as an RDBMS and for its components to connect to each other using a low-latency, high-throughput bus. They also expected SQL to be the only way to access and manipulate data in the warehouse.

Another expectation was that the data warehouse would be online and available all the time, and its functions would be tightly coupled to each other. This was a feature of its implementation in an RDBMS, but it made it impractical (and impossible) to scale the warehouse’s resources independently.

None of these expectations are true in the cloud. We are familiar with the cloud as a metaphor for virtualization, which is the use of software to abstract and define various virtual resources, and for the scale-up/scale-down elasticity that is a defining characteristic of the cloud.

However, we may not spend as much time thinking about the cloud as a metaphor for event-driven provisioning of virtualized hardware, and the ability to provision software in response to events.

This on-demand dimension is arguably the most important practical benefit of the cloud’s elasticity and a significant difference between the data lakehouse and the classic data warehouse.

The Data Lakehouse as Cloud-native Data Warehouse

Event-driven design on this scale requires a different set of hardware and software requirements, which cloud-native software engineering concepts, technologies, and methods address. Instead of monolithic applications that run on always-on, always-available, physically implemented hardware resources, cloud-native design allows developers to instantiate discrete software functions as loosely coupled services in response to specific events. These loosely coupled services correspond to the functions of an application, and applications are composed of these loosely coupled services, like the data lakehouse and its layered architecture.

What makes the data lakehouse cloud-native? It is cloud-native when it decomposes most, if not all, of the software functions implemented in data warehouse architecture. These functions include:

        • One or more functions that can store, retrieve, and modify data;
        • one or more functions that can perform various operations (such as joins) on data;
        • one or more functions that expose interfaces for users and jobs to store, retrieve, modify data and specify different types of operations to perform on data;
        • one or more functions that manage and enforce data access and integrity safeguards;
        • one or more functions that generate or manage technical and business metadata;
        • one or more functions that manage and enforce data consistency safeguards when two or more users/jobs try to modify the same data simultaneously or when a new user and job tries to update data currently being accessed by prior users/jobs.

Using this as a guideline, we can say that a “pure” or “ideal” implementation of data lakehouse architecture would include:

      • The lakehouse service itself, which in addition to SQL query provides metadata management, data federation, and data cataloging capabilities. It also serves as a semantic layer by creating, maintaining, and versioning modeling logic, such as denormalized views applied to data in the lake.
      • The data lake, which at minimum provides schema enforcement and the ability to store, retrieve, modify, and schedule operations on objects/blobs in object storage. It also usually provides data profiling and discovery, metadata management, data cataloging, data engineering, and optionally data federation capabilities. It enforces access and data integrity safeguards across its zones and ideally generates and manages technical metadata for the data in these zones.
      • An object storage service that provides a scalable, cost-effective storage substrate and handles the work of storing, retrieving, and modifying data stored in file objects.

There are different ways to implement the data lakehouse. One option is to combine all these functions into a single omnibus platform, a data lake with its own data lakehouse, like what Databricks, Dremio, and others have done with their data lakehouse implementations.

Why Does Cloud-native Design Matter?

This raises some obvious questions. Why do this? What are the advantages of a loosely coupled architecture compared to the tightly integrated architecture of the classic data warehouse? As mentioned, one benefit of loose coupling is the ability to scale resources independently of each other, such as allocating more compute without adding storage or network resources. It also eliminates some dependencies that can cause software to break, so a change in one service will not necessarily impact or break other services, and the failure of a service will not necessarily cause other services to fail or lose data. Cloud-native design also uses mechanisms like service orchestration to manage and address service failures.

Another benefit of loose coupling is the potential to eliminate dependencies from reliance on a specific vendor’s or provider’s software. If services communicate and exchange data with each other solely through publicly documented APIs, it should be possible to replace a service that provides a set of functions (like SQL query) with an equivalent service. This is the premise of pure or ideal data lakehouse architecture, where each component is effectively commoditized (with equivalent services available from major cloud infrastructure providers, third-party SaaS and/or PaaS providers, and as open-source offerings) and reduces the risk of provider-specific lock-in.

The Data Lakehouse as Event-driven Data Warehouse

Cloud-native software design also expects the provisioning and deprovisioning of the hardware and software resources for loosely coupled cloud-native services to happen automatically. In other words, provisioning a cloud-native service means provisioning its enabling resources, and terminating a cloud-native service means to deprovision these resources. In a way, cloud-native design wants to make hardware and to some extent software disappear as variables in deploying, managing, maintaining, and especially scaling business services.

From the perspective of consumers and expert users, there are only services – tools that do things.

For example, if an ML engineer designs a pipeline to extract and transform data from 100 GBs of log files, a cloud-native compute engine will dynamically provision compute instances to process the workload. Once the engineer’s workload finishes, the engine will automatically terminate these instances.

Ideally, neither the engineer nor the usual IT support people (DBAs, systems and network administrators, so forth) need to do anything to provision these compute instances or the software and hardware resources they depend on. Instead, this all happens automatically – for example, in response to an API call initiated by the engineer. The classic on-premises data warehouse was not designed with this kind of cloud-native, event-driven computing paradigm in mind.

The Data Lakehouse as Its Own Thing

The data lakehouse is supposed to be its own thing, providing the six functions listed above. However, it depends on other services – specifically, an object storage service and optionally a data lake service – to provide basic data storage and core data management functions. In addition, data lakehouse architecture implements novel software functions that have no obvious parallel in classic data warehouse architecture and are unique to the data lakehouse. These functions include:

      • One or more functions that can access, store, retrieve, modify, and perform operations (like joins) on data stored in object storage and/or third-party services. The lakehouse simplifies access to data in Amazon S3, AWS Lake Formation, Amazon Redshift, so forth
      • One or more functions that can discover, profile, catalog, and/or facilitate access to distributed data stored in object storage and/or third-party services. For example, a modeler creates denormalized views that combine data stored in the data lakehouse and in the staging zone of an AWS Lake Formation (a data lake), and designs advanced models incorporating data from an Amazon Redshift sales data mart.

However, in this respect, the lakehouse is not different from a PaaS data warehouse service, which we will explore in depth in future articles.

OneFS SmartQuotas Execution, Operation, and Governance

SmartQuotas employs the OneFS job engine to execute its work. Specifically, the QuotaScan job updates the accounting for quota domains created on an existing directory path. Although it is typically run without any intervention, the administrator has the option of manually control if necessary or desirable.

The OneFS job engine is based on a delegation hierarchy made up of coordinator, director, manager, and worker processes.

Once a SmartQuotas job is initially allocated, the job engine uses a shared work distribution model in order to execute the work, and each job is identified by a unique Job ID. When a job is launched, whether it’s scheduled, started manually, or responding to a cluster event, the Job Engine spawns a child process from the isi_job_d daemon running on each node. This job engine daemon is also known as the parent process.

The entire job engine’s orchestration is handled by the coordinator, which is a process that runs on one of the nodes in a cluster. While the actual work item allocation is managed by the individual nodes, the coordinator node takes control, divides up the job, and evenly distributes the resulting tasks across the nodes in the cluster. It is also responsible for starting and stopping jobs, and also for processing work results as they are returned during the execution of a job.

Each node in the cluster has a job engine director process, which runs continuously and independently in the background. The director process is responsible for monitoring, governing and overseeing all job engine activity on a particular node, constantly waiting for instruction from the coordinator to start a new job. The director process serves as a central point of contact for all the manager processes running on a node, and as a liaison with the coordinator process across nodes.

Manager processes are responsible for arranging the flow of tasks and task results throughout the duration of a job. Each manager controls and assigns work items to multiple worker threads working on items for the designated job. Under direction from the coordinator and director, a manager process maintains the appropriate number of active threads for a configured impact level, and for the node’s current activity level.

Each worker thread is given a task, if available, which it processes item-by-item until the task is complete or the manager un-assigns the task. Towards the end of a job phase, the number of active threads decreases as workers finish up their allotted work and become idle. Nodes which have completed their work items just remain idle, waiting for the last remaining node to finish its work allocation. When all tasks are done, the job phase is considered to be complete, and the worker threads are terminated.

By default, QuotaScan runs with a ‘low’ impact policy and a low-priority value of ‘6’.

If quotas are created on empty directories, governance will instantaneously propagate from parent to child incrementally. If the directory is not empty, the QuotaScan job is used to update the governance.

A domain created on a non-empty directory will not be marked as ready. This triggers a QuotaScan job to be started, and it performs a treewalk to traverse the directory tree under the domain root.

The QuotaScan job is the cluster maintenance process responsible for scanning the cluster to performing accounting activities to bring the determined governance to each inode. In essence, the job is a distributed tree walk that is performed based on the state of the domain.

Under the hood, SmartQuotas is based on the concept of domains – the linchpins of quota accounting. Since OneFS is a single file system, it relies on domains for defining the scope of a quota in place of the typical volume boundaries found in most storage systems. As such, a domain defines which files belong to a quota, accounts for each resource type in that set and defines the top-level directory configuration point.

For SmartQuotas, the three main resource types are:

Resource Type Description
Directory A specific directory and all its subdirectories
User A specific user
Group All members of a specific group

A domain defined as “name@folder” would be the set of files under “folder,” owned by “name,” which could be either a user or a group. The files accounted include all files reachable from the given path, without traversing any soft links. The owner “name” can be ALL, and “/ifs,” the OneFS root directory, is also an effective ALL for “folder.”

With SmartQuotas, it is easy to create traditional domain types quickly by using “ALL.” The following are examples of domain types:

  • All files belonging to user Jane: user:Jane@/ifs
  • All files under /ifs/home, belonging to any user: ALL@/ifs/home.
  • All files under /ifs/home that belong to user Jane: user:Jane@/ifs/home

Domains cannot be created on anything but directories. More specifically, domains are associated with the actual directories themselves, not directory paths. For example, if the domain is ALL@/ifs/home/data, but /ifs/home/data gets renamed to /ifs/home/files, the domain stays with the directory.

Domains can also be nested and may overlap. For example, say a hard quota is set on /ifs/data/marketing for 5 TB. 1 TB soft quotas are then placed on individual users in the marketing department. This ensures that the marketing directory as a whole never exceeds 5 TB, while limiting the users in the marketing department to 1 TB each.

A default quota domain is one that does not account for any specific set of files but instead specifies a policy for new domains that match a specific trigger. In other words, default domains are configuration templates for actual domains. SmartQuotas use the identity notation ‘default-user’, ‘default-group’, and ‘default directory’ to describe domains with default policies. For example, the domain default-user@/ifs/home becomes specific-user@/ifs/home for each specific-user that is not otherwise defined. All enforcements on default-user are copied to specific-user when specific-user allocates within the domain and the new inherited domain quota is termed as a Linked Quota. There may be overlapping defaults (default-user@/ifs and default-user@/ifs/home may both be defined).

Default quota domains help drastically simplify quota management for large environments by providing a mechanism to define top-level template configurations from which many actual quotas can be cloned, or linked. When a default quota domain is configured on a directory, any subdirectories created directly underneath this will automatically inherit the quota limits specified in the parent domain. This streamlines the provisioning and management quotas for large enterprise environments. Furthermore, default directory quotas can co-exist with user and/or group quotas and legacy default quotas.

Default directory quotas have been available since OneFS 8.2, in addition to the default user and group quotas available in earlier releases. For example:

  • Create default-directory quota
# isi quota create --path=/ifs/parent-dir --type=default-directory --hard-threshold=10M
  • Modify Default directory quota
# isi quota modify --path=/ifs/parent-dir --type=default-directory --advisory-threshold=6M --soft-threshold=7M --soft-grace=1D
  • List default-directory quota
# isi quota list                 




  Type              AppliesTo  Path            Snap  Hard   Soft  Adv  Used




  --------------------------------------------------------------------------




  default-directory DEFAULT    /ifs/parent-dir No    10.00M -    6.00M 0.00




  --------------------------------------------------------------------------




  Total: 1
  • Delete Default directory quota
# isi quota delete --path=/ifs/parent-dir --type=default-directory

If the enforcements on a default domain change, SmartQuotas will automatically propagate the changes to the Linked Quota domains. If a default quota domain is deleted, SmartQuotas will delete all children marked as inherited. An administrator may also choose to delete the default without deleting the children, but this will break inheritance on all inherited children.

For example, the creation & deletion of sub-directory under default directory folder causes inherited directory quota creation and removal:

A quota domain may be in one of three accounting states as described in the following table:

Quota Accounting States Description
Ready A domain in the ready state is fully accounted. SmartQuotas displays “ready” domains in all interfaces and all enforcements apply to such domains.
Accounting A domain is placed in the Accounting state when it is waiting on accounting updates.
Deleting After a request to delete a domain, SmartQuotas will place the domain in the deleting state until tear-down is complete. Domain removal may be a lengthy process.

SmartQuotas displays accounting domains in all interfaces including usage data but indicate they are in the process of being “Accounted.” SmartQuotas applies all enforcements to accounting domains, even when it might reject an allocation that would have proceeded if it had completed the QuotaScan.

Domains in the deleting state are hidden from all interfaces, and the top-level directory of a domain may be deleted while the domain is still in the deleting state (assuming there are no domains in “Ready” or “Accounting” state defined on the directory). No enforcements are applied for domains in “Deleting” state.

A quota scan is performed when the domain is in an Accounting State. This can occur during quota creation to account the new domain if a quota has been set for the domain and quota deletion to un-account the domain. A QuotaScan is required when creating a quota on a non-empty directory. If quotas are created up-front on an empty directory, no QuotaScan is necessary.

A QuotaScan job may be started either from the WebUI or CLi with the following syntax:.

# isi job jobs start quotascan

Any path specified on the command line is treated as the root of a tree that should be processed. This is provided primarily as a means to rescan a directory or maintenance reasons.

In addition to the core isi_smartquoatas service, there are three processes, or daemons, associated with SmartQuotas:

Daemon Details
isi_quota_notify_d Listens for ‘limit exceeded’ and ‘link denied’ events and generate notifications for each. Also responds to configuration change events and instructs the QDB to generate ‘expired’ and ‘violated’ over-threshold notifications.
isi_quota_report_d Generates quota reports. Since the QDB only produces real-time resource usage, reports are necessary for providing point-in-time vies of a quota domain’s usage. These historical reports are useful for trend analysis of quota resource usage.
isi_quota_sweeper_d Responsible for quota housekeeping tasks such as propagating default changes, domain and notification rule garbage collection, and kicking off QuotaScan jobs when necessary.

 

These can be viewed as follows:

# isi services -a | grep -i quota

   isi_smartquotas      SmartQuotas Service                      Enabled

# ps -auxw | grep -i quota

root    4852    0.0  0.0  26708   8404  -  Is   Sat20        0:00.00 /usr/sbin/isi_quota_report_d

root    4860    0.0  0.0  26812   8424  -  Is   Sat20        0:00.00 /usr/sbin/isi_quota_notify_d

root    4872    0.0  0.0  26836   8488  -  Is   Sat20        0:00.00 /usr/sbin/isi_quota_sweeper_d

OneFS 8.2 and later also include the rpc.quotad service to facilitate client-side quota reporting on UNIX and Linux clients using native ‘quota’ tools. The service which runs on tcp/udp port 762 is enabled by default, and control is under NFS global settings.

Also, users can view their available user capacity set by soft or hard user and group quotas rather than the entire cluster capacity or parent directory-quotas. This avoids the ‘illusion’ of seeing available space that may not be associated with their quotas.

SmartQuotas is included as a core component of OneFS but requires a valid product license key in order to activate. This license key can be purchased through your Dell EMC account team. An unlicensed cluster will show a SmartQuotas warning until a valid product license has been purchased and applied to the cluster.

License keys can be easily added through the ‘Activate License’ section of the OneFS WebUI, accessed by going to Cluster Management > Licensing.

OneFS SmartQuotas Architectural Fundamentals

As we saw in the previous article in this series, at a high level, there are three main elements to a OneFS quota:

Element Description
Domain Define which files and directories belong to a quota
Resource The quantity being limited
Enforcement Specify the limits and what actions are taken when those thresholds are exceeded

We’ll look at each of these elements over the course of this series of articles. But first, let’s delve into the architecture.

Under the hood, SmartQuotas hinges on the quota domain and quota database, and the general operational flow is as follows:

Each quota is governed by a OneFS domain, which defines the quota’s scope and includes a set of usage levels, limits, and configuration options. Most of this information is organized and managed by the file system and stored in the quota database (QDB). This database is represented in a B-tree structure, known as the quota tree, and allows both scalability and fast random access. Because of its importance, the quota database is protected at OneFS’ highest metadata level. The quota accounting blocks (QABs) within individual records are protected at the same level as the associated directory.

A quota domain is made up of the following principal parts:

Component Description
Quota domain key Where the unique identifier for the domain is stored.
Quota domain header (QDH) Contains various state and configuration information that affects the domain as a whole.
Quota domain enforcements Manages quota limits, including whether they have been reached or exceeded, notification information, and the quota grace period.
Quota domain account (QDA) Handles tracking of usage levels for the domain. The QDA tracks physical, logical, and file resource types for each domain.

The QDB is a data structure that stores quota domain record (QDR). Resource allocation and governance changes are recorded in the quota operation associated with a transaction, totaled and applied persistently to the QDRs.

Within QDB, a quota domain record stores all configuration and state associated with a domain. The record can be broken down into three components:

Component Description
Configuration Fields within quota config, such as whether the domain is a container. Despite the name, this includes some state fields like the Ready flag.
Enforcements A list of quota enforcements, which include the limit, grace period, and notification state. Although the structure is flexible, only three enforcements are allowed and only for a single resource.
Account The quota account for the domain.

The on-disk format of the QDR is as follows The structure is dynamic, based on the configured enforcements and state of the account, so the on-disk structures look different than the in-memory structures.

Quota domain locks synchronize access to quota domain records in the QDB.

The main challenge for quota domain locks is that the need to exclusively lock a quota domain is not known until the accounting is fully determined. In fact, it may not be until responses from transaction deltas are received before this is reported to the initiator. To address this, Quota Domain Locks use optimistic restarts.

Quota Account Blocks (QABs) enable high-performance accounting using transaction deltas. Since when the quota usage info if viewed it is stale anyway, locking is simplified by using an exclusive domain lock for coherent reads of usage.

Each QAB contains a large number of Quota Accounting records, which need to be updated whenever a particular user adds or removes data from an area of the file system on which quotas are enabled (quota domain). If a large quantity of clients are simultaneously accessing the quota domain, these blocks can become highly contended and a potential bottleneck. Similarly, if a single client (or small number of clients) consistently makes a large number of small writes to files within a single quota, write performance could again be impacted.

To address this, quota accounts have a mechanism to help avoid hot spots on the nodes storing QABs. Quota Account Constituents (QACs) help parallelize the quota accounting by including additional QAB mirrors distributed across other nodes in the cluster.

Configuration is managed through a sysctl, efs.quota.reorganize.qac_ratio , which increases the number of quota accounting constituents. This provides better scalability and reduces latencies on heavy create/delete activities when quotas are used.

Using this parameter, the internally calculated QAC count for each quota is multiplied by the specified value. If a workflow experiences write performance issues, and it has many writes to files or directories governed by a single quota, then increasing the QAC ratio may significantly improve write performance.

The sysctl efs.quota.reorganize.qac_ratio can be reconfigured to its maximum value of 8 from its default value of 1 using the following CLI command:

# isi_sysctl_cluster efs.quota.reorganize.qac_ratio=8

To verify the persistent change, run:

# cat /etc/mcp/override/sysctl.conf | grep qac_ratio

efs.quota.reorganize.qac_ratio=8 #added by script

Although increasing the QAC count through this sysctl can improve performance on write heavy quota domains, some amount of experimentation may be required until the ideal QAC ratio value is found. Adjusting the parameter can adversely affect write performance if you apply a value that is too high, or if you apply the parameter in an environment that does not have diminished write performance due to quota contention.

Additionally, OneFS provides a CLI command, which can restripe the QABs to improve their performance.

# isi_restripe_qabs retune

This utility can be run either on demand or periodically to randomly redistribute QABs for all existing quotas. It does this by ignoring the default ‘rebalance’ layout and running a ‘retune’ layout strategy instead, thereby alleviating the performance impact from an imbalanced QAB layout.

Unveiling Lakehouse – What is Data Lakehouse? Part1

What is Data Lakehouse?

This article on the data lakehouse will aim to introduce the data lakehouse and describe what is new and different about it.

The Data Lakehouse Explained

The term “lakehouse” is derived from the two foundational technologies the data lake and the data warehouse. Lakehouse is a concept or data paradigm that can be built using different set of technologies to fulfill the objectives.

At a high level, the data lakehouse consists of the following components:

      • Data lakehouse
      • Data lake
      • Object storage

The data lakehouse describes a data warehouse-like service that runs against a data lake, which sits on top of an object storage. These services are distributed in the sense that they are not consolidated into a single, monolithic application, as with a relational database. They are independent in the sense that they are loosely coupled or decoupled — that is, they expose well-documented interfaces that permit them to communicate and exchange data with one another. Loose coupling is a foundational concept in distributed software architecture and a defining characteristic of cloud services and cloud-native design.

How Does the Data Lakehouse Work? 

From the top to the bottom of the data lakehouse stack, each constituent service is more specialized than the service that sits “underneath” it.

      • Data lakehouse: The data lakehouse is a highly specialized abstraction layer or a semantic layer. That exposes data in the lake for operational reporting, ad hoc query, historical analysis, planning and forecasting, and other data warehousing workloads.
      • Data lake: The data lake is a less specialized abstraction layer. That schematizes and manages the objects contained in an underlying object storage service, and schedules operations to be performed on them. The data lake can efficiently ingest and store data of every type. Like structured relational data (which it persists in a columnar object format), semi structured (text, logs, documents), and un or multi structured (files of any type) data.
      • Object storage: As the foundation of the lakehouse stack, object storage consists of an even more basic abstraction layer: A performant and cost-effective means of provisioning and scaling storage, on-demand storage.

Again, for data lakehouse to work, the architecture must be loosely coupled. For example, several public cloud SQL query services, when combined with cloud data lake and object storage services, can be used to create the data lakehouse. This solution is the “ideal” data lakehouse in the sense that it is a rigorous implementation of a formal, loosely coupled architectural design. The SQL query service runs against the data lake service, which sits on top of an object storage service. Subscribers instantiate prebuilt queries, views, and data modeling logic in the SQL query service, which functions like a semantic layer. And this whole solution is the data lakehouse.

This implementation is distinct from the data lakehouse services that Databricks, Dremio, and others market. These implementations are coupled to a specific data lake implementation, with the result that deploying the lakehouse means, in effect, deploying each vendor’s data lake service, too.

The formal rigor of an ideal data lakehouse implementation has one obvious benefit: It is notionally easier to replace one type of service (for example, a SQL query) with an equivalent commercial or open-source service.

What Is New and Different About the Data Lakehouse?

It all starts with the data lake. Again, the data lakehouse is a higher-level abstraction superimposed over the data in the lake. The lake usually consists of several zones, the names, and purposes of which vary according to implementation. At a minimum, Lakehouse consist of the following:

      • one or more ingest or landing zones for data.
      • one or more staging zones, in which experts work with and engineer data; and
      • one or more “curated” zones, in which prepared and engineered data is made available for access.

Usually, the data lake is home to all an organization’s useful data. This data is already there. So, the data lakehouse begins with query against this data where it lives.

It is in the curated (GOLD) zone of the data lake that the data lakehouse itself lives. Although it is also able to access and query against data that is stored in the lake’s other zones. In this way the data lakehouse can support not only traditional data warehousing use cases, but also innovative use cases such as data science and machine learning and artificial intelligence engineering.

The following are the advantages of the data lakehouse.

  1. More agile and less fragile than the data warehouse

Querying against data in the lake eliminates the multistep process involved in moving the data, engineering it and moving it again before loading it into the warehouse. (In extract, load, transform [ELT], data is engineered in the warehouse itself. This removes a second data movement operation.) This process is closely associated with the use of extract, transform, load (ETL) software. With the data lakehouse, instead of modeling data twice — first, during the ETL phase, and, second, to design denormalized views for a semantic layer, or to instantiate data modeling and data engineering logic in code — experts need only perform this second modeling step.

The result is less complicated (and less costly) ETL, and a less fragile data lakehouse.

  1. Query against data in place in the data lake

Querying against the data lakehouse makes sense because all an organization’s business-critical data is already there — that is, in the data lake. Data gets stored into the lake from sensors and other sources, from workload, business apps and services, from online transaction processing systems, from subscription feeds, and so on.

The strong claim is that the extra ability to query against data in the whole of the lake — that is, its staging and non-curated zones — can accelerate data delivery for time-sensitive use cases. A related claim is that it is useful to query against data in the lakehouse, even if an organization already has a data warehouse, at least for some time-sensitive use cases or practices.

The weak claim is that the lakehouse is a suitable replacement for the data warehouse.

  1. Query against relational, semi-structured, and multi-structured data

The data lakehouse sits atop the data lake, which ingests, stores and manages data of every type. Moreover, the lake’s curated zone need not be restricted solely to relational data: Organizations can store and model time series, graph, document, and other types of data there. Even though this is possible with a data warehouse, it is not cost-effective.

  1. More rapidly provision data for time-sensitive use cases

Expert users — say, scientists working on a clinical trial — can access raw trial results in the data lake’s non-curated ingest zone, or in a special zone created for this purpose. This data is not provisioned for access by all users; only expert users who understand the clinical data are permitted to access and work with it. Again, this and similar scenarios are possible because the lake functions as a central hub for data collection, access, and governance. The necessary data is already there, in the data lake’s raw or staging zones, “outside” the data lakehouse’s strictly governed zone. The organization is just giving a certain class of privileged experts early access to it.

  1. Better support for DevOps and software engineering

Unlike the classic data warehouse, the lake and the lakehouse expose various access APIs, in addition to a SQL query interface.

For example, instead of relying on ODBC/JDBC interfaces and ORM techniques to acquire and transform data from the lakehouse — or using ETL software that mandates the use of its own tool-specific programming language and IDE design facility — a software engineer can use preferred dev tools and cloud services, so long as these are also supported by team’s DevOps toolchain. The data lake/lakehouse, with its diversity of data exchange methods, its abundance of co-local compute services, and, not least, the access it affords to raw data, is arguably a better “player” in the DevOps universe than is the data warehouse. In theory, it supports a larger variety of use cases, practices, and consumers — especially expert users.

True, most RDBMSs, especially cloud PaaS RDBMSs, now support access using RESTful APIs and language-specific SDKs. This does not change the fact that some experts, particularly software engineers, are not — at all — charmed of the RDBMS.

Another consideration is that the data warehouse, especially, is a strictly governed repository. The data lakehouse imposes its own governance strictures, but the lake’s other zones can be less strictly governed. This makes the combination of the data lake + data lakehouse suitable for practices and use cases that require time-sensitive, raw, lightly prepared, so on, data (such as ML engineering).

  1. Support more and different types of analytic practices.

For expert users, the data lakehouse simplifies the task of accessing and working with raw or semi-/multi-structured data.

Data scientists, ML, and AI engineers, and, not least, data engineers can put data into the lake, acquire data from it, and take advantage of its co-locality with an assortment of intra-cloud compute services to engineer data. Experts need not use SQL; rather, they can work with their preferred languages, libraries, services and tools (notebooks, editors, and favorite CLI shells). They can also use their preferred conceptual vocabularies. So, for example, experts can build and work with data pipelines, as distinct to designing ETL jobs. In place of an ETL tool, they can use a tool such as Apache Airflow to schedule, orchestrate, and monitor workflows.

Summary

It is impossible to untie the value and usefulness of the data lakehouse from that of the data lake. In theory, the combination of the two — that is, the data lakehouse layered atop the data lake — outperforms the usefulness, flexibility, and capabilities of the data warehouse. The discussion above sometimes refers separately to the data lake and to the data lakehouse. What is usually, however, is the co-locality of the data lakehouse with the data lake — the “data lake/house,” if you like.