December 2024 – Unstructured Data Quick Tips

OneFS MetadataIQ Cluster-side Configuration

In the prior article in this series, we took an in-depth look at MetadataIQ’s architecture and operation. Now, we turn our focus to its configuration and deployment.

MetadataIQ is introduced in OneFS 9.10 and requires a cluster to be running a fresh install or committed upgrade of 9.10 or later in order to run. The installation package for the OneFS 9.10 release is available at the Dell Support site.

Once the download has completed, the cluster can be upgraded and committed to OneFS 9.10 either from the CLI using the ‘isi upgrade cluster’ command, or via the WebUI by navigating to Cluster management > Upgrade.

Post upgrade to, or installation of, OneFS 9.10, the /ifs/.ifsvar/modules/metadataiq directory is created, which houses various MetadataIQ components and logs.

In addition to a PowerScale cluster running OneFS 9.10, MetadataIQ also requires that the following dependencies are met:

MetadataIQ requires the OneFS ISI_PRIV_SNAPHSHOT privilege in order to run.

Confirm that SnapshotIQ is licensed across the cluster and that the snapshot service is enabled.

# isi license view snapshotiq | grep Status

Status: Evaluation

# isi services -a | grep -i snapshot

isi_snapshot_d       Snapshot Daemon                          Enabled

Verify that the ElasticSearch packages are installed on the cluster by running the following CLI command:

# python -m pip list | grep -i elasticsearch

elasticsearch        6.3.1

elasticsearch-midx   8.14.0

MetadataIQ configuration and management in OneFS 9.10 is currently limited to the command line (CLI) or the platform API (pAPI) endpoints.

On the PowerScale cluster, the ‘isi metadataiq’ CLI command execution syntax is as follows:

Usage:

isi metadataiq {<action> | <subcommand>}

[--timeout <integer>]

[{--help | -h}]


Actions:

reset       Reset MetadataIQ to defaults.

resync      Rebuild metadata index from initial state.


Subcommands:

settings    Manage MetadataIQ settings.

As such, the following options are available for the ‘isi metadataiq’ CLI command:

OneFS MetadataIQ requires initialization and configuration before it can be successfully deployed. To this end, the ‘isi metadataiq settings modify’ CLI command can be used to edit the desired configuration fields, and the syntax is as follows:

Usage:

isi metadataiq settings modify

[--max-threads <integer>]

[--excluded-lnns <integer> | --clear-excluded-lnns | --add-excluded-lnns

<integer> | --remove-excluded-lnns <integer>]

[--nshards <integer>]

[--fetch-size <integer>]

[--work-queue-size <integer>]

[--verify-certificate <boolean>]

[--hostname <string>]

[--host-port <integer>]

[--path <string>]

[--schedule <string>]

[--changelist-job-retries <integer>]

[--changelist-job-tolerable-pause-hours <integer>]

[--changelist-job-tolerable-state-request-failures <integer>]

[--ca-certificate-path <string>]

[--api-key <string>]

[{--verbose | -v}]

[{--help | -h}]

When configuring MetadataIQ, the following client system credentials and parameters are required:

API ID
API Key
ElasticSearch database hostname
ElasticSearch port (typically 9200)
CA certificate path

For example:

# isi metadataiq settings modify --verify-certificate <boolean> --ca-certificate-path <string> --api-key <string> --hostname <string> --host-port <integer> --path <string> --schedule <string>

The ‘path’ parameter must be a path to a directory under the cluster’s /ifs filesystem. If left unspecified, the metadata path defaults to /ifs.

Note also that the ‘host-port’ value for the ElasticSearch database is typically TCP port 9200.

These configuration parameters are stored within gconfig on each node in the /etc/gconfig/metadataiq_config.gc file.

The OneFS platform API (pAPI) also offers an equivalent set of MetadataIQ configuration endpoints to the CLI, accessible under /platform/21/metadataiq/settings/.

For example:

# curl --insecure --basic --user <uname:passwd> https://<cluster_ip>:8080/platform/21/metadataiq/settings/
{
"settings" :
{
"consumer" :
{
"database_info" :
{
"api_key" : "A key is configured",
"certificate_path" :
 
"fa0e6e93f47c8d0074832c47bffd630ab1faad065f3d129b32e0aa7ae8de8595",
"database_type" : "ELK database",
"host_port" : 9200,
"hostname" : "https://10.224.101.214:9200",
"verify_certificate" : true
},
"excluded_lnns" : [],
"fetch_size" : 2048,

"max_threads" : 8,

"number_shards" : 8,

"work_queue_size" : 16

},

"producer" :

{

"changelist_job_retries" : 2,

"changelist_job_tolerable_pause_hours" : 24,

"changelist_job_tolerable_state_request_failures" : 720,

"path" : "/ifs/data",

"schedule" : "every day every 2 hours"

}

}

}

The following CLI commands can be used to control the operation of the MetadataIQ services:

Service	Daemon/Utility	Enable/disable/view Commands
Producer	isi_metadataiq_producer_d	isi services isi_metadataiq_producer_d enable isi services isi_metadataiq_producer_d disable isi services isi_metadataiq_producer_d
Consumer	isi_metadataiq_consumer_d	isi services isi_metadataiq_consumer_d enable isi services isi_metadataiq_consumer_d disable isi services isi_metadataiq_producer_d
Transfer	isi_metadataiq_transfer	# isi_metadataiq_transfer # isi_metadataiq_transfer –check # isi_metadataiq_transfer –consumer-checkpoint # isi_metadataiq_transfer –-map-version # isi_metadataiq_transfer –-show-mappings

For example, the two MetadataIQ services can be enabled from the CLI as follows:

# isi services -a isi_metadataiq_producer_d enable

The service 'isi_metadataiq_producer_d' has been enabled.

# isi services -a isi_metadataiq_consumer_d enable

The service 'isi_metadataiq_consumer_d' has been enabled.

In the next article in this series, we’ll examine the process involved in deploying and configuring an ElasticSearch instance and Kibana visualization portal.

OneFS MetadataIQ Architecture and Operation

In this second article in the series, we’ll take an in-depth look at MetadataIQ’s architecture and operation.

The OneFS MetadataIQ framework is based on the following core components:

A ‘MetadataIQ cycle’ describes the complete series of steps run by the MetadataIQ service daemons, which represent the full sequence, from determining the changes between two snapshots through updating the ElasticSearch database.

On the cluster side there are three core MetadataIQ components that are added in OneFS 9.10: The Producer Service, Consumer Service, and Transfer agent.

The producer service daemon, isi_metadataiq_producer_d, is responsible for running a metadata scan of the specified OneFS file system path, according to a configured schedule.

When first started, or in response to a configuration change, the producer daemon first loads its configuration, which instructs it on parameters such as the file system path to use, what schedule to run on, etc. Once the producer has found a valid schedule configuration string, it will start its first execution process, or ‘producer cycle’, performing the following actions:

A new snapshot of the configured file system path is taken.
Next, a ChangelistCreate job is started between the previous snapshot and the newly-taken snapshot checkpoints.
This ChangelistCreate job instance is monitored and, if necessary, restarted per a configurable number of retry attempts.
A consumer checkpoint file (CCP) is generated.
Finally, cleanup is performed and the old snapshot removed.

It’s worth noting that, in this initial version of MetadataIQ, only a single path may be configured.

Internally, the consumer checkpoint (CCP) is a JSON file containing a system b-tree created by the ChangelistCreate job, providing a delta between two input snapshots. These CCP files are created under the /ifs/.ifsvar/modules/metadataiq/cp/ directory with a ‘Checkpoint’ nomenclature followed by an incrementing ID. For example:

# ls /ifs/.ifsvar/modules/metadataiq/cp/

Checkpoint_0_2.json

To aid identification, the producer daemon creates its snapshots with a naming convention that includes both a ‘MetadataIQ’ prefix and creation timestamp naming convention for easy identification, plus an expiration value of one year. For example:

# isi snapshot snapshots list | grep -i metadata

3278 MetadataIQ_1730914225                        /ifs/data

During a producer cycle, once a CCP file is successfully generated, the old snapshot from the on-going cycle gets cleaned up. This snapshot deletion actually occurs in two phases: While the producer daemon initiates snapshot removal, the actual deletion is performed by the Job Engine’s SnapshotDelete job. As such, the contents of a ‘deleted’ snapshot may still exist until a SnapshotDelete job has run to completion and actually cleaned it up.

If a prior MetadataIQ execution cycle has not already completed, the old snapshot ID will automatically be set to HEAD (i.e.ID=0), and only the new snapshot will be used to report the current metadata state under the configured path.

Next, the MetadataIQ consumer service takes over the operations.

The consumer service relies on a set of database configuration parameters, including the CA certificate attributes, in order to securely connect to the remote ElasticSearch instance. These include:

Note that, in OneFS 9.10, MetadataIQ only supports the ElasticSearch database as its off-cluster metadata store.

Additionally, the consumer daemon, isi_metadataiq_consumer_d, also has a couple of configurable parameters to control and tune its behavior. There are:

The consumer daemon checks the queue for the arrival of a new checkpoint (CCP) file. When a CCP arrives, the daemon instructs the transfer agent to upload the metadata to the Elasticsearch database. The consumer daemon also continuously monitors the successful execution of the transfer agent, restarting it if needed.

If there happens to be more than one CCP in the queue, the consumer daemon will always select the file with the oldest timestamp.

The actual mechanics of uploading the consumer checkpoint (CCP) file to the remote database are handled by the transfer agent.

The transfer agent (isi_metadataiq_transfer), a python script, which is spawned on demand by the consumer daemon.

The transfer script is invoked with the path to a CCP file specifying the target changelist.
Next, the transfer script attempts to take an advisory lock on the CCP file to prevent more than one instance of the transfer script working on the same CCP file at a given time. This advisory lock is released whenever the transfer script completes or terminates.
After acquiring the advisory lock, the transfer script validates that both the CCP file and changelist exist, and that the ElasticSearch database connection and mapping are valid. It will configure the index mappings if the index does not already exist.
If everything is fine in the above step, the transfer script will start its ‘draining loop’, fetching and batching changelist entries and allocating them to a worker thread pool for data processing and transfer.
Once the changelist is fully transferred to the ElasticSearch database, the transfer script removes the CCP file and changelist.
Finally, the transfer script releases its advisory lock and exits normally.

In the event of a failure, the transfer agent is automatically restarted by the consumer daemon.

After the initial cluster setup and config, each additional MetadataIQ job run populates the remote ElasticSearch database with updated metadata from the configured dataset.

The ElasticSearch database and Kibana visualization portal reside on an off-cluster Linux host.

ElasticSearch typically uses TCP port 9200 by default, for communication and receiving metadata updates from the PowerScale cluster(s). Kibana typically runs by default on TCP port 5601.

Periodic synchronizations are needed to keep the database updated with new metadata changes, and the recommendation is to configure a dataset-appropriate schedule for the MetadataIQ job. And with the services enabled, the first MetadataIQ cycle begins as soon as a valid schedule has been configured.

In the next article in this series, we’ll examine the process involved in standing up and configuring a MetadataIQ environment.

OneFS MetadataIQ

Prominent amongst the payload of the new OneFS 9.10 release is MetadataIQ – PowerScale’s new global metadata namespace solution.

Incorporating the ElasticSearch database and Kibana visualization dashboard, MetadataIQ facilitates data indexing and querying across multiple geo-distributed clusters.

OneFS MetadataIQ is purpose-built to provide robust metadata capabilities, allowing customers to index and discover the data they need for their workflows during file creation and modification, negating the need for time-consuming treewalks . The metadata catalog may be used for queries, data visualization, and data lifecycle management. As customers add analytics workflows, the ability to simply and efficiently query data, wherever it may reside, is vital for the time-to-results they require.

The MetadataIQ framework is used to transfer file system metadata from a cluster to an external ElasticSearch database instance. Internally, MetadataIQ leverages the venerable OneFS Job Engine’s ChangeListCreate job, which tracks the delta, or changelist, between two snapshots. MetadataIQ parses entries in each changelist in batches, updating the metadata index residing off-cluster in an ElasticSearch database. This database can store the metadata from multiple PowerScale clusters, providing a global catalog of an organization’s unstructured data repositories.

The exported OneFS file system metadata contains the fields and attributes which are typically reported by the ubiquitous ‘stat’ CLI command, including:

In addition to these standard metadata attributes, MetadataIQ also includes a number of cluster-specific fields, including path, LINs and parent LINs, disk and nodepool membership, associated snapshots, etc. The full schema, including the metadata categories, fields, types, and descriptions, will be presented in a future blog article in this series.

Behind the scenes, OneFS MetadataIQ comprises the following principal components:

The high level architecture of the MetadataIQ framework is as follows:

A ‘MetadataIQ cycle’ is the complete series of steps run by the MetadataIQ daemons, which represent a full sequence of analyzing the changes between two snapshots and updating the ElasticSearch database.

On the cluster side there are three core MetadataIQ components that are added in OneFS 9.10: The Producer Service, which coordinates snapshot generation and changelist job execution, according to a specified schedule, generating a consumer checkpoint file. The Consumer Service, which detects new checkpoint files and manages and monitors the database connectivity. And the Transfer agent, which, under the purview of the Consumer, actually performs the metadata uploads to the remote ElasticSearch instance.

Off-cluster, a Linux server running the Docker container service hosts the ‘ELK’ stack. This includes the ElasticSearch database that houses the metadata index, paired with the Kibana dashboard, which provides the query and data visualization engine.

After the initial cluster setup and config, each additional MetadataIQ job run populates the remote ElasticSearch database with updated metadata from the configured dataset. Periodic synchronizations are needed to keep the database updated with new metadata changes, and the recommendation is to configure a dataset-appropriate schedule for the MetadataIQ job. And with the services enabled, the first MetadataIQ cycle will start as soon as a valid schedule has been configured.

From the Kibana UI, the ‘Discover’ page notifies of a new data source once MetadataIQ has been successfully configured and run on the cluster. After this, creating a data view is a simple as selecting the desired index and clicking the ‘Save data view to Kibana’ button. Once done, Kibana’s ‘Discover’ mode allows you to craft and run data queries by creating one or more filters to display a desired subset of metadata entries. Moving from ‘Discover’ mode to ‘Dashboard’ allows rich custom visualizations to be created. Kibana provides multiple data presentation options, such as a bar chart, here representing ‘file and physical size distributions’ data.

Or a pie chart, here displaying metadata from multiple clusters.

Up-leveled data can be collated and represented in the dashboard as interactive charts, and clicking or hovering over the desired region allows you to easily access the details.

So there you have it – the new OneFS MetadataIQ, providing smart, efficient, and scalable metadata querying and management across a federated PowerScale metadata index. Plus, in addition to streamlining data access and boosting operation efficiency, MetadataIQ can also facilitate AI workflows such as intelligent chunking and retrieval-augmented generation (RAG).

In the next article in this series, we’ll take an in-depth look at MetadataIQ’s architecture and operation.

PowerScale InsightIQ 5.2

It’s been a prolific week for PowerScale! Hot on the heels of the OneFS 9.10 launch comes the unveiling of the new InsightIQ 5.2 release. InsightIQ delivers powerful performance monitoring and reporting functionality, helping maximize PowerScale cluster performance. This includes advanced analytics to optimize applications, correlate cluster events, and accurately forecast future storage needs.

So what new goodness does the InsightIQ 5.2 release deliver? Added functionality includes expanded ecosystem support, enhanced reporting, and streamlined upgrade and migration.

The InsightIQ (IIQ) ecosystem is expanded in 5.2 to now include Red Hat Enterprise Linux (RHEL) versions 9.4 and 8.10. This allows customers who are running current RHEL code to use InsightIQ 5.x to monitor the latest OneFS versions. Additionally, InsightIQ Simple can now be installed on VMware Workstation 17, allowing IIQ 5.2 to be deployed on non-production lab environments for trial or demo purposes – without incurring a VMware charge.

On the reporting front, dashboard and report visibility has been enhanced to allow a greater number of clusters to be viewed via the dashboard’s performance overview screen. This enables users to easily compare a broad array of multi-cluster metrics on a single pane without the need for additional scrolling and navigation.

Additionally, IIQ 5.2 also expands the maximum and minimum range for a sample point across all performance reports. This allows cluster administrators to more easily identify a potential issue with the full fidelity of metrics displayed, whereas previously down sampling to an average value may have masked an anomaly.

Support and serviceability-wise, IIQ 5.2 brings additional upgrade and migration functionality. Specifically, cluster admins can perform simple, non-disruptive in-place upgrades from IIQ 5.1 to IIQ 5.2. Additionally, IIQ 4.4.1 instances can also now be directly migrated to the new IIQ 5.2 release without the need to export or import any data, or reconfiguring any settings.

Function	Attribute	Description
OS Support	Simple ecosystem support	InsightIQ Simple 5.2.0 can be deployed on the following platforms: · VMware virtual machine running ESXi version 7.0U3 or 8.0U3 · VMware Workstation 17 (free version) InsightIQ Simple 5.2.0 can monitor PowerScale clusters running OneFS versions 9.3 through 9.10, excluding 9.6.
	Scale ecosystem support	InsightIQ Scale 5.2.0 can be deployed on Red Hat Enterprise Linux versions 8.10 or 9.4 (English language versions). InsightIQ Scale 5.2.0 can monitor PowerScale clusters running OneFS versions 9.3 through 9.10, excluding 9.6.
Upgrade	In-place upgrade from InsightIQ 5.1.x to 5.2.0	The upgrade script supports in-place upgrades from InsightIQ 5.1.x.
	Direct database migration from InsightIQ 4.4.1 to InsightIQ 5.2.0	Direct data migration from an InsightIQ 4.4.1 database to InsightIQ 5.2.0 is supported.
Reporting	Maximum and minimum ranges on all reports	All live Performance Reports display a light blue zone that indicates the range of values for a metric within the sample length. The light blue zone is shown regardless of whether any filter is applied. With this enhancement, users can observe trends in values on filtered graphs.
	More graphs on a page	Reports are redesigned to maximize the number of graphs that can appear on each page. · Excess white space is eliminated. · The report parameters section collapses when the report is run. The user can expand it manually. · Graph heights are decreased when possible. · Page scrolling occurs while the collapsed parameters section remains fixed at the top.
User interface	What’s New dialog	All InsightIQ users can view a brief introduction to new functionality in the latest release of InsightIQ. Access the dialog from the banner area of the InsightIQ web application. Click About > What’s New.
	Compact cluster performance view on the Dashboard	The Dashboard is redesigned to improve usability. · Summary information for six clusters appears in the initial Dashboard view. A sectional scrollbar controls the view for additional clusters. · The capacity section has its own scrollbar. · The navigation side bar is collapsible into space-saving icons. Use the << icon at the bottom of the side bar to collapse it.

Meanwhile, the new InsightIQ 5.2 code is available on the Dell Support site, allowing both installation of and upgrade to this new release.

PowerScale OneFS 9.10

Dell PowerScale is already scaling up the holiday season with the launch of the innovative OneFS 9.10 release, which shipped today (10^th December 2024). This new 9.10 offering is an all-rounder, introducing PowerScale innovations in capacity, performance, security, serviceability, data management, and general ease of use.

OneFS 9.10 delivers the next version of PowerScale’s common software platform for both on-prem and cloud deployments. This can make it a solid fit for traditional file shares and home directories, vertical workloads like M&E, healthcare, life sciences, financial services, and next-gen AI, ML and analytics applications.

PowerScale’s clustered scale-out architecture can be deployed on-site, in co-lo facilities, or as customer managed Amazon AWS and Microsoft Azure deployments, providing core to edge to cloud flexibility, plus the scale and performance and needed to run a variety of unstructured workflows on-prem or in the public cloud.

With data security, detection, and monitoring being top of mind in this era of unprecedented cyber threats, OneFS 9.10 brings an array of new features and functionality to keep your unstructured data and workloads more available, manageable, and secure than ever.

Hardware Innovation

On the platform hardware front, OneFS 9.10 unlocks dramatic capacity and performance and enhancements – particularly for the all-flash F910 node, which sees the introduction of support for 61TB QLC SSDs, plus 200Gb Ethernet front and backend networking.

Additionally, the H and A-series chassis-based hybrid platforms also see a significant density and per-watt efficiency improvement with the introduction of 24TB HDDs. This includes both ICE and FIPS drives, accommodating both regular and SED clusters.

Networking and performance

For successful large-scale AI model customization and training and other HPC workloads, compute farms need data served to them quickly and efficiently. To achieve this, compute and storage must be sized and deployed accordingly to eliminate potential bottlenecks in the infrastructure.

To meet this demand, OneFS 9.10 introduces support for low latency front-end and back-end HDR Infiniband network connectivity on the F710 and F910 all-flash platform, providing up to 200Gb/s of bandwidth with sub-microsecond latency. This can directly benefit generative AI and machine learning environments, plus other workloads involving highly concurrent streaming reads and writes of different files from individual, high throughput capable Linux servers. In conjunction with the OneFS multipath driver, and GPUdirect support, the choice of either HDR Infiniband or 200GbE can satisfy the networking and data requirements of demanding technical workloads such as ADAS model training, seismic analysis, and complex transformer-based AI workloads, deep learning systems, and trillion-parameter generative AI models.

Metadata Indexing

Also debuting in OneFS 9.10 is MetadataIQ, a new global metadata namespace solution. Incorporating the ElasticSearch database and Kibana visualization dashboard, MetadataIQ facilitates data indexing and querying across multiple geo-distributed clusters.

MetadataIQ efficiently transfers file system metadata from a cluster to an external ELK instance, allowing customers to index and discover the data they need for their workflows and analytics needs. This metadata catalog may be used for queries, data visualization, and data lifecycle management. As workflows are added, MetadataIQ simply and efficiently queries data, wherever it may reside, delivering vital time-to-results.

Internally, MetadataIQ leverages the venerable OneFS ChangeListCreate job, which tracks the delta between two snapshots, batch processing and updating the off-cluster metadata index residing in an ElasticSearch database. This index can store metadata from multiple PowerScale clusters, providing a global catalog of an organization’s unstructured data repositories.

Security

In OneFS 9.10, OpenSSL is upgraded from version 1.0.2 to version 3.0.14. This makes use of the newly validated OpenSSL 3 FIPS module, which all of the OneFS daemons make use of. But probably the most significant feature in the OpenSSL 3 upgrade is the addition of library support for the TLS 1.3 ciphers, designed to meet stringent Federal requirements. OneFS 9.10 adds TLS 1.3 support for the WebUI and KMIP key management servers, and verifies that 1.3 is supported for LDAP, CELOG alerts, audit events, syslog forwarding, SSO, and SyncIQ.

Support and Monitoring

OneFS 9.10 also includes healthcheck enhancements to aid the customer in understanding cluster state and providing resolution guidance in case of failures. In particular, current healthcheck results are displayed in the WebUI landing page to indicate the real-time health of the system. Also included is detailed failure information, troubleshooting steps, and resolution guidance – including links to pertinent knowledge base articles. Healthchecks are also logically grouped based on category and frequency, and historical checks are also easily accessible.

Dell Technologies Connectivity Services also replaces the former SupportAssist in OneFS 9.10, with the associated updating of user-facing Web and command line interfaces. Intended for transmitting events, logs, and telemetry from PowerScale to Dell support, Dell Technologies Connectivity Services provides a full replacement for SupportAssist. With predictive issue detection and proactive remediation, Dell Technologies Connectivity Services helps rapidly identify, diagnose, and resolve cluster issues, improving productivity by replacing manual routines with automated support. Delivering a consistent remote support experience across the Dell storage portfolio, Dell Technologies Connectivity Services is intended for all sites that can send telemetry off-cluster to Dell over the internet, and is included with all support plans (features vary based on service level agreement).

In summary, OneFS 9.10 brings the following new features and functionality to the Dell PowerScale ecosystem:

OneFS 9.10 Feature	Description
Networking	· Front-end and back-end HDR Infiniband networking option for the F910 and F710 platforms.
Platform	· Support for F910 nodes with 61TB QLC SSD drives and a 200Gb/s back-end Ethernet network. · Support for 24TB HDDs on A-series and H-series nodes.
Metadata Indexing	· Introduction of MetadataIQ off-cluster metadata indexing and discovery solution.
Security	· OpenSSL 3.0 and TLS 1.3 transport layer security support.
Support and Monitoring	· Healthcheck WebUI enhancements · Dell Technologies Connectivity

We’ll be taking a deeper look at OneFS 9.10’s new features and functionality in future blog articles over the course of the next few weeks.

Meanwhile, the new OneFS 9.10 code is available on the Dell Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.

For existing clusters running a prior OneFS release, the recommendation is to open a Service Request with to schedule an upgrade. To provide a consistent and positive upgrade experience, Dell EMC is offering assisted upgrades to OneFS 9.10 at no cost to customers with a valid support contract. Please refer to Knowledge Base article KB544296 for additional information on how to initiate the upgrade process.

OneFS Automatic Maintenance Mode

Another piece of functionality that OneFS 9.9 brings to the table is automatic maintenance mode (AMM). AMM builds upon and extends the manual CELOG maintenance mode capability, which has been an integral part of OneFS since the 9.2 release.

Cluster maintenance operations such as upgrades, patch installation, rolling reboots, hardware replacement, etc, typically generate a significant increase in cluster events and alerts. This can be overwhelming for the cluster admin, who’s trying to focus on the maintenance task at hand and, as such, well aware of the issue. So, as the name suggests, the general notion of OneFS maintenance mode is to provide a method of temporarily suspending these cluster notifications.

So during a maintenance window with maintenance mode enabled, OneFS will continue to log events but not generate alerts for them. As such, all events that occurred during the maintenance window can then be reviewed upon manually disabling maintenance mode. Active event groups will automatically resume generating alerts when the scheduled maintenance period ends.

Until OneFS 9.9, activating maintenance mode has been a strictly manual process. For example, to enable CELOG maintenance mode from the OneFS WebUI select Cluster Management > Events and Alerts and click on the ‘Enable maintenance mode’ button:

Alas, as with most manually initiated and terminated processes, they’re only as reliable as the operator. As such, purely manual operation runs the risk of missed critical alerts if maintenance mode is not disabled after a maintenance window has concluded.

In contrast, the new OneFS 9.9 AMM functionality automatically places clusters or nodes in maintenance mode based on predefined triggers, such as the following:

AMM Trigger	Description	Action
Simultaneous upgrade	OneFS full cluster upgrade and all nodes simultaneous reboot.	Cluster enters maintenance mode at upgrade start and exits maintenance mode when the last node finishes upgrade.
Upgrade rollback	Reverting a OneFS upgrade to the previous version prior to upgrade commit.	Cluster enters maintenance mode at rollback start, and exits maintenance mode when the last node finishes its downgrade.
Node Reboot	Rebooting a PowerScale node.	Node is added to maintenance mode as reboot starts, and exits maintenance mode when reboot completes.
Node addition/removal	Joining or removing a node to/from a PowerScale cluster.	Node is added to maintenance mode as join/removal starts, and exits maintenance mode when join/removal is completed.

During maintenance mode, CELOG alerts are suppressed, ensuring that the cluster or node can undergo necessary updates or modifications without generating a flurry of notifications. This feature is particularly useful for organizations that need to perform regular maintenance tasks but want to minimize disruptions to their workflows (and keep their cluster admins sane).

When a maintenance window is triggered, such as for a rolling upgrade, the entire cluster enters maintenance mode at the start and exits when the last piece of the upgrade operation has completed. Similarly, when a node is rebooted, it is added to maintenance mode at the start of the reboot and removed when the rebooting finishes.

Automatic maintenance mode windows have a maximum time limit of 199 hours. This is in order to prevent an indefinite maintenance mode conditions and avoid the cluster being left in limbo, any associated issues. Plus, the cluster admin can easily manually override AMM and end the maintenance window at any time.

OneFS AMM offers a range of configuration options, including the ability to control automatic activation of maintenance mode, set manual maintenance mode durations, and specify start times. AMM also keeps a detailed history of all maintenance mode events, providing valuable insights for troubleshooting and system optimization.

Under the hood, there’s a new gconfig tree in OneFS 9.9 named ‘maintenance’, which holds the configuration for both automatica and manual maintenance mode:

Attribute	Description
active	Indicates if maintenance mode is active
auto_enable	Controls automatic activation of maintenance mode
manual_window_enabled	Indicates if a manual maintenance mode is active
manual_window_hours	The number of hours a manual maintenance window will be active
manual_window_start	The start time of the current manual maintenance window
maintenance_nodes	List of node LNNs in maintenance mode (0 indicates cluster wide)

For example:

# isi_gconfig -t maintenance

[root] {version:1}

maintenance.auto_enable (bool) = true

maintenance.active (bool) = false

maintenance.manual_window_enabled (bool) = false

maintenance.manual_window_hours (int) = 8

maintenance.manual_window_start (int) = 1732062847

maintenance.maintenance_nodes (char*) = []

These attributes are also reported by the ‘isi cluster maintenance status’ CLI command. For example:

# isi cluster maintenance status

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

There’s also a new OneFS Tardis configuration tree, also named ‘maintenance’, which includes both a list of the components supported by maintenance mode and their status, and a historical list of all the maintenance mode events and their timestamps on a cluster.

Branch	Attribute	Description
Components		List of components supported by maintenance mode.
	Active	Active – indicates if this component is currently in maintenance mode.
	Enabled	Enabled – indicates if this component can go into maintenance mode.
	Name	Name – is the name of the component this settings block controls.
History		List of all maintenance mode events on the cluster.
	Start	Start – timestamp for when this maintenance event started.
	End	Timestamp for when this maintenance event ended.
	Mode	Either ‘auto’ or ‘manual’, indicating how maintenance event was started.

These attributes and their values can be queried by the ‘isi cluster maintenance components view’ and isi cluster maintenance history view’ CLI commands respectively. For example:

# isi cluster maintenance components view

Name            Enabled   Active

---------------------------------

Event Alerting  Yes       Yes

Also:

# isi cluster maintenance history view

Mode    Start Time                End Time

----------------------------------------------------------

auto    Sun Nov  3 15:12:45 2024  Sun Nov  3 18:41:05 2024

manual  Tue Nov 19 19:43:14 2024  Tue Nov 19 19:43:43 2024

manual  Wed Nov 20 00:05:22 2024  Wed Nov 20 00:05:48 2024

Similarly, the maintenance mode’s prior 90 day history from the WebUI:

If the legacy ‘isi event maintenance’ CLI syntax is invoked, a gentle reminder to use the new ‘isi cluster maintenance’ command set is returned:

# isi event maintenance

'isi cluster maintenance' is now used to manage maintenance mode windows.

AMM can be easily enabled from the OneFS 9.9 CLI as follows:

# isi cluster maintenance settings modify --auto-enable true

# isi cluster maintenance settings view | grep -i auto

     Auto Maintenance Mode Enabled: Yes

Similarly from the WebUI, under Cluster management > Events and alerts > Maintenance mode:

When accessing a cluster with maintenance mode activated, the following warning will be broadcast:

Maintenance mode is active. Check 'isi cluster maintenance status' for more details.

Similarly from the WebUI:

So once AMM has been enabled on a cluster, how do things look when a triggering event occurs? For simultaneous upgrade, the following sequence occurs:

Initially, AMM is reported as being enabled, but maintenance mode is inactive:

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

Next, a simultaneous upgrade is initiated:

# isi upgrade cluster start --simultaneous /ifs/data/install.isi

You are about to start a simultaneous UPGRADE.  Are you sure?  {yes/[no]}:  yes

Verifying the specified package and parameters.

The upgrade has been successfully initiated.

‘isi upgrade view [--interactive | -i]’ or the web ui can be used to monitor the process.

A maintenance mode window is automatically started and reported as active:

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: Yes

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

During a simultaneous upgrade, the gconfig ‘maintenance_nodes’ parameter will report a cluster-wide event (value =0, which indicates all nodes):

# isi_gconfig -t maintenance | grep -i node

maintenance.maintenance_nodes (char*) = [0]

Once the upgrade has completed and is ready to commit, the maintenance mode window is no longer active:

# isi upgrade view

Upgrade Status:

Current Upgrade Activity: OneFS upgraded

   Cluster Upgrade State: Ready to commit

   Upgrade Process State: Running

      Upgrade Start Time: 2024-11-21T15:12:20.803000

      Current OS Version: 9.9.0.0_build(4)style(5)



# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

Finally, the maintenance history shows the duration of the AMM window:

# isi cluster maintenance history view

Mode    Start Time                End Time

----------------------------------------------------------

auto    Sun Nov  21 15:12:45 2024  Sun Nov  3 18:41:05 2024

Similarly, for a node reboot scenario. Once the reboot command has been run on a node, the cluster automatically activates a maintenance mode window:

# reboot

System going down IMMEDIATELY

…

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: Yes

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

A this point, the gconfig ‘maintenance_nodes’ parameter will report the logical node number (LNN) of the rebooting node:

# isi_gconfig -t maintenance | grep -i node

maintenance.maintenance_nodes (char*) = [3]

In this example, the rebooting node has both LNN 3 and node ID 3 (although matching LNN and IDs are not always the case):

# isi_nodes %{id} , %{lnn}~ | grep "3 ,"

3 , 3~

Finally, the maintenance window is reported inactive when the node is back up and running again:

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

When upgrading from OneFS versions prior to 9.9, if a maintenance mode window was manually enabled prior to the upgrade, it will continue to be active after the upgrade. The maintenance window is manual, and the maintenance window hours are set to 199 during the upgrade, then restored to the default of 8 on commit. Maintenance mode history is also migrated during the upgrade process. If necessary, an upgrade rollback restores ‘isi event maintenance’.

In the event of any issues during maintenance mode operations, error conditions and details are written to the /var/log/isi_maintenance_mode_d.log file. This file can be set to debug level logging for more verbose information. Additionally, the /var/log/isi_shutdown.log and isi_upgrade_logs files can often provide further insights and context.

Logfile	Description
/var/log/isi_maintenance_mode_d.log	• Will show errors that occur during maintenance mode operations • Can be set to debug level logging for more details
/var/log/isi_shutdown.log	Cluster and node shutdown log
isi_upgrade_logs	Cluster upgrade logs