PowerScale InsightIQ 5.2

It’s been a prolific week for PowerScale! Hot on the heels of the OneFS 9.10 launch comes the unveiling of the new InsightIQ 5.2 release. InsightIQ delivers powerful performance monitoring and reporting functionality, helping maximize PowerScale cluster performance. This includes advanced analytics to optimize applications, correlate cluster events, and accurately forecast future storage needs.

So what new goodness does the InsightIQ 5.2 release deliver? Added functionality includes expanded ecosystem support, enhanced reporting, and streamlined upgrade and migration.

The InsightIQ (IIQ) ecosystem is expanded in 5.2 to now include Red Hat Enterprise Linux (RHEL) versions 9.4 and 8.10. This allows customers who are running current RHEL code to use InsightIQ 5.x to monitor the latest OneFS versions. Additionally, InsightIQ Simple can now be installed on VMware Workstation 17, allowing IIQ 5.2 to be deployed on non-production lab environments for trial or demo purposes – without incurring a VMware charge.

On the reporting front, dashboard and report visibility has been enhanced to allow a greater number of clusters to be viewed via the dashboard’s performance overview screen. This enables users to easily compare a broad array of multi-cluster metrics on a single pane without the need for additional scrolling and navigation.

Additionally, IIQ 5.2 also expands the maximum and minimum range for a sample point across all performance reports. This allows cluster administrators to more easily identify a potential issue with the full fidelity of metrics displayed, whereas previously down sampling to an average value may have masked an anomaly.

Support and serviceability-wise, IIQ 5.2 brings additional upgrade and migration functionality. Specifically, cluster admins can perform simple, non-disruptive in-place upgrades from IIQ 5.1 to IIQ 5.2. Additionally, IIQ 4.4.1 instances can also now be directly migrated to the new IIQ 5.2 release without the need to export or import any data, or reconfiguring any settings.

Function Attribute Description
OS Support Simple ecosystem support InsightIQ Simple 5.2.0 can be deployed on the following platforms:

·         VMware virtual machine running ESXi version 7.0U3 or 8.0U3

·         VMware Workstation 17 (free version) InsightIQ Simple 5.2.0 can monitor PowerScale clusters running OneFS versions 9.3 through 9.10, excluding 9.6.

Scale ecosystem support InsightIQ Scale 5.2.0 can be deployed on Red Hat Enterprise Linux versions 8.10 or 9.4 (English language versions). InsightIQ Scale 5.2.0 can monitor PowerScale clusters running OneFS versions 9.3 through 9.10, excluding 9.6.
Upgrade In-place upgrade from InsightIQ 5.1.x to 5.2.0 The upgrade script supports in-place upgrades from InsightIQ 5.1.x.
Direct database migration from InsightIQ 4.4.1 to InsightIQ 5.2.0 Direct data migration from an InsightIQ 4.4.1 database to InsightIQ 5.2.0 is supported.
Reporting Maximum and minimum ranges on all reports All live Performance Reports display a light blue zone that indicates the range of values for a metric within the sample length. The light blue zone is shown regardless of whether any filter is applied. With this enhancement, users can observe trends in values on filtered graphs.
More graphs on a page Reports are redesigned to maximize the number of graphs that can appear on each page.

·         Excess white space is eliminated.

·         The report parameters section collapses when the report is run. The user can expand it manually.

·         Graph heights are decreased when possible.

·         Page scrolling occurs while the collapsed parameters section remains fixed at the top.

User interface What’s New dialog All InsightIQ users can view a brief introduction to new functionality in the latest release of InsightIQ. Access the dialog from the banner area of the InsightIQ web application. Click About > What’s New.
Compact cluster performance view on the Dashboard The Dashboard is redesigned to improve usability.

·         Summary information for six clusters appears in the initial Dashboard view. A sectional scrollbar controls the view for additional clusters.

·         The capacity section has its own scrollbar.

·         The navigation side bar is collapsible into space-saving icons. Use the << icon at the bottom of the side bar to collapse it.

Meanwhile, the new InsightIQ 5.2 code is available on the Dell Support site, allowing both installation of and upgrade to this new release.

PowerScale OneFS 9.10

Dell PowerScale is already scaling up the holiday season with the launch of the innovative OneFS 9.10 release, which shipped today (10th December 2024). This new 9.10 offering is an all-rounder, introducing PowerScale innovations in capacity, performance, security, serviceability, data management, and general ease of use.

OneFS 9.10 delivers the next version of PowerScale’s common software platform for both on-prem and cloud deployments. This can make it a solid fit for traditional file shares and home directories, vertical workloads like M&E, healthcare, life sciences, financial services, and next-gen AI, ML and analytics applications.

PowerScale’s clustered scale-out architecture can be deployed on-site, in co-lo facilities, or as customer managed Amazon AWS and Microsoft Azure deployments, providing core to edge to cloud flexibility, plus the scale and performance and needed to run a variety of unstructured workflows on-prem or in the public cloud.

With data security, detection, and monitoring being top of mind in this era of unprecedented cyber threats, OneFS 9.10 brings an array of new features and functionality to keep your unstructured data and workloads more available, manageable, and secure than ever.

Hardware Innovation

On the platform hardware front, OneFS 9.10 unlocks dramatic capacity and performance and enhancements – particularly for the all-flash F910 node, which sees the introduction of support for 61TB QLC SSDs, plus 200Gb Ethernet front and backend networking.

Additionally, the H and A-series chassis-based hybrid platforms also see a significant density and per-watt efficiency improvement with the introduction of 24TB HDDs. This includes both ICE and FIPS drives, accommodating both regular and SED clusters.

Networking and performance

For successful large-scale AI model customization and training and other HPC workloads, compute farms need data served to them quickly and efficiently. To achieve this, compute and storage must be sized and deployed accordingly to eliminate potential bottlenecks in the infrastructure.

To meet this demand, OneFS 9.10 introduces support for low latency front-end and back-end HDR Infiniband network connectivity on the F710 and F910 all-flash platform, providing up to 200Gb/s of bandwidth with sub-microsecond latency. This can directly benefit generative AI and machine learning environments, plus other workloads involving highly concurrent streaming reads and writes of different files from individual, high throughput capable Linux servers. In conjunction with the OneFS multipath driver, and GPUdirect support, the choice of either HDR Infiniband or 200GbE can satisfy the networking and data requirements of demanding technical workloads such as ADAS model training, seismic analysis, and complex transformer-based AI workloads, deep learning systems, and trillion-parameter generative AI models.

Metadata Indexing

Also debuting in OneFS 9.10 is MetadataIQ, a new global metadata namespace solution. Incorporating the ElasticSearch database and Kibana visualization dashboard, MetadataIQ facilitates data indexing and querying across multiple geo-distributed clusters.

MetadataIQ efficiently transfers file system metadata from a cluster to an external ELK instance, allowing customers to index and discover the data they need for their workflows and analytics needs. This metadata catalog may be used for queries, data visualization, and data lifecycle management. As workflows are added, MetadataIQ simply and efficiently queries data, wherever it may reside, delivering vital time-to-results.

Internally, MetadataIQ leverages the venerable OneFS ChangeListCreate job, which tracks the delta between two snapshots, batch processing and updating the off-cluster metadata index residing in an ElasticSearch database. This index can store metadata from multiple PowerScale clusters, providing a global catalog of an organization’s unstructured data repositories.

Security

In OneFS 9.10, OpenSSL is upgraded from version 1.0.2 to version 3.0.14. This makes use of the newly validated OpenSSL 3 FIPS module, which all of the OneFS daemons make use of. But probably the most significant feature in the OpenSSL 3 upgrade is the addition of library support for the TLS 1.3 ciphers, designed to meet stringent Federal requirements. OneFS 9.10 adds TLS 1.3 support for the WebUI and KMIP key management servers, and verifies that 1.3 is supported for LDAP, CELOG alerts, audit events, syslog forwarding, SSO, and SyncIQ.

Support and Monitoring

OneFS 9.10 also includes healthcheck enhancements to aid the customer in understanding cluster state and providing resolution guidance in case of failures. In particular, current healthcheck results are displayed in the WebUI landing page to indicate the real-time health of the system. Also included is detailed failure information, troubleshooting steps, and resolution guidance – including links to pertinent knowledge base articles. Healthchecks are also logically grouped based on category and frequency, and historical checks are also easily accessible.

Dell Technologies Connectivity Services also replaces the former SupportAssist in OneFS 9.10, with the associated updating of user-facing Web and command line interfaces. Intended for transmitting events, logs, and telemetry from PowerScale to Dell support, Dell Technologies Connectivity Services provides a full replacement for SupportAssist. With predictive issue detection and proactive remediation, Dell Technologies Connectivity Services helps rapidly identify, diagnose, and resolve cluster issues, improving productivity by replacing manual routines with automated support. Delivering a consistent remote support experience across the Dell storage portfolio, Dell Technologies Connectivity Services is intended for all sites that can send telemetry off-cluster to Dell over the internet, and is included with all support plans (features vary based on service level agreement).

In summary, OneFS 9.10 brings the following new features and functionality to the Dell PowerScale ecosystem:

OneFS 9.10 Feature Description
Networking ·         Front-end and back-end HDR Infiniband networking option for the F910 and F710 platforms.
Platform ·         Support for F910 nodes with 61TB QLC SSD drives and a 200Gb/s back-end Ethernet network.

·         Support for 24TB HDDs on A-series and H-series nodes.

Metadata Indexing ·         Introduction of MetadataIQ off-cluster metadata indexing and discovery solution.
Security ·         OpenSSL 3.0 and TLS 1.3 transport layer security support.
Support and Monitoring ·         Healthcheck WebUI enhancements

·         Dell Technologies Connectivity

We’ll be taking a deeper look at OneFS 9.10’s new features and functionality in future blog articles over the course of the next few weeks.

Meanwhile, the new OneFS 9.10 code is available on the Dell Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.

For existing clusters running a prior OneFS release, the recommendation is to open a Service Request with to schedule an upgrade. To provide a consistent and positive upgrade experience, Dell EMC is offering assisted upgrades to OneFS 9.10 at no cost to customers with a valid support contract. Please refer to Knowledge Base article KB544296 for additional information on how to initiate the upgrade process.

OneFS Automatic Maintenance Mode

Another piece of functionality that OneFS 9.9 brings to the table is automatic maintenance mode (AMM). AMM builds upon and extends the manual CELOG maintenance mode capability, which has been an integral part of OneFS since the 9.2 release.

Cluster maintenance operations such as upgrades, patch installation, rolling reboots, hardware replacement, etc, typically generate a significant increase in cluster events and alerts. This can be overwhelming for the cluster admin, who’s trying to focus on the maintenance task at hand and, as such, well aware of the issue. So, as the name suggests, the general notion of OneFS maintenance mode is to provide a method of temporarily suspending these cluster notifications.

So during a maintenance window with maintenance mode enabled, OneFS will continue to log events but not generate alerts for them. As such, all events that occurred during the maintenance window can then be reviewed upon manually disabling maintenance mode. Active event groups will automatically resume generating alerts when the scheduled maintenance period ends.

Until OneFS 9.9, activating maintenance mode has been a strictly manual process. For example, to enable CELOG maintenance mode from the OneFS WebUI select Cluster Management > Events and Alerts and click on the ‘Enable maintenance mode’ button:

Alas, as with most manually initiated and terminated processes, they’re only as reliable as the operator. As such, purely manual operation runs the risk of missed critical alerts if maintenance mode is not disabled after a maintenance window has concluded.

In contrast, the new OneFS 9.9 AMM functionality automatically places clusters or nodes in maintenance mode based on predefined triggers, such as the following:

AMM Trigger Description Action
Simultaneous upgrade OneFS full cluster upgrade and all nodes simultaneous reboot. Cluster enters maintenance mode at upgrade start and exits maintenance mode when the last node finishes upgrade.
Upgrade rollback Reverting a OneFS upgrade to the previous version prior to upgrade commit. Cluster enters maintenance mode at rollback start, and exits maintenance mode when the last node finishes its downgrade.
Node Reboot Rebooting a PowerScale node. Node is added to maintenance mode as reboot starts, and exits maintenance mode when reboot completes.
Node addition/removal Joining or removing a node to/from a PowerScale cluster. Node is added to maintenance mode as join/removal starts, and exits maintenance mode when join/removal is completed.

During maintenance mode, CELOG alerts are suppressed, ensuring that the cluster or node can undergo necessary updates or modifications without generating a flurry of notifications. This feature is particularly useful for organizations that need to perform regular maintenance tasks but want to minimize disruptions to their workflows (and keep their cluster admins sane).

When a maintenance window is triggered, such as for a rolling upgrade, the entire cluster enters maintenance mode at the start and exits when the last piece of the upgrade operation has completed. Similarly, when a node is rebooted, it is added to maintenance mode at the start of the reboot and removed when the rebooting finishes.

Automatic maintenance mode windows have a maximum time limit of 199 hours. This is in order to prevent an indefinite maintenance mode conditions and avoid the cluster being left in limbo, any associated issues. Plus, the cluster admin can easily manually override AMM and end the maintenance window at any time.

OneFS AMM offers a range of configuration options, including the ability to control automatic activation of maintenance mode, set manual maintenance mode durations, and specify start times. AMM also keeps a detailed history of all maintenance mode events, providing valuable insights for troubleshooting and system optimization.

Under the hood, there’s a new gconfig tree in OneFS 9.9 named ‘maintenance’, which holds the configuration for both automatica and manual maintenance mode:

Attribute Description
active Indicates if maintenance mode is active
auto_enable Controls automatic activation of maintenance mode
manual_window_enabled Indicates if a manual maintenance mode is active
manual_window_hours The number of hours a manual maintenance window will be active
manual_window_start The start time of the current manual maintenance window
maintenance_nodes List of node LNNs in maintenance mode (0 indicates cluster wide)

For example:

# isi_gconfig -t maintenance

[root] {version:1}

maintenance.auto_enable (bool) = true

maintenance.active (bool) = false

maintenance.manual_window_enabled (bool) = false

maintenance.manual_window_hours (int) = 8

maintenance.manual_window_start (int) = 1732062847

maintenance.maintenance_nodes (char*) = []

These attributes are also reported by the ‘isi cluster maintenance status’ CLI command. For example:

# isi cluster maintenance status

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

There’s also a new OneFS Tardis configuration tree, also named ‘maintenance’, which includes both a list of the components supported by maintenance mode and their status, and a historical list of all the maintenance mode events and their timestamps on a cluster.

Branch Attribute Description
Components List of components supported by maintenance mode.
Active Active – indicates if this component is currently in maintenance mode.
Enabled Enabled – indicates if this component can go into maintenance mode.
Name Name – is the name of the component this settings block controls.
History List of all maintenance mode events on the cluster.
Start Start – timestamp for when this maintenance event started.
End Timestamp for when this maintenance event ended.
Mode Either ‘auto’ or ‘manual’, indicating how maintenance event was started.

These attributes and their values can be queried by the ‘isi cluster maintenance components view’ and isi cluster maintenance history view’ CLI commands respectively. For example:

# isi cluster maintenance components view

Name            Enabled   Active

---------------------------------

Event Alerting  Yes       Yes

Also:

# isi cluster maintenance history view

Mode    Start Time                End Time

----------------------------------------------------------

auto    Sun Nov  3 15:12:45 2024  Sun Nov  3 18:41:05 2024

manual  Tue Nov 19 19:43:14 2024  Tue Nov 19 19:43:43 2024

manual  Wed Nov 20 00:05:22 2024  Wed Nov 20 00:05:48 2024

Similarly, the maintenance mode’s prior 90 day history from the WebUI:

If the legacy ‘isi event maintenance’ CLI syntax is invoked, a gentle reminder to use the new ‘isi cluster maintenance’ command set is returned:

# isi event maintenance

'isi cluster maintenance' is now used to manage maintenance mode windows.

AMM can be easily enabled from the OneFS 9.9 CLI as follows:

# isi cluster maintenance settings modify --auto-enable true

# isi cluster maintenance settings view | grep -i auto

     Auto Maintenance Mode Enabled: Yes

Similarly from the WebUI, under Cluster management > Events and alerts > Maintenance mode:

When accessing a cluster with maintenance mode activated, the following warning will be broadcast:

Maintenance mode is active. Check 'isi cluster maintenance status' for more details.

Similarly from the WebUI:

So once AMM has been enabled on a cluster, how do things look when a triggering event occurs? For simultaneous upgrade, the following sequence occurs:

Initially, AMM is reported as being enabled, but maintenance mode is inactive:

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

Next, a simultaneous upgrade is initiated:

# isi upgrade cluster start --simultaneous /ifs/data/install.isi

You are about to start a simultaneous UPGRADE.  Are you sure?  {yes/[no]}:  yes

Verifying the specified package and parameters.

The upgrade has been successfully initiated.

‘isi upgrade view [--interactive | -i]’ or the web ui can be used to monitor the process.

A maintenance mode window is automatically started and reported as active:

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: Yes

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

During a simultaneous upgrade, the gconfig ‘maintenance_nodes’ parameter will report a cluster-wide event (value =0, which indicates all nodes):

# isi_gconfig -t maintenance | grep -i node

maintenance.maintenance_nodes (char*) = [0]

Once the upgrade has completed and is ready to commit, the maintenance mode window is no longer active:

# isi upgrade view

Upgrade Status:

Current Upgrade Activity: OneFS upgraded

   Cluster Upgrade State: Ready to commit

   Upgrade Process State: Running

      Upgrade Start Time: 2024-11-21T15:12:20.803000

      Current OS Version: 9.9.0.0_build(4)style(5)



# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

Finally, the maintenance history shows the duration of the AMM window:

# isi cluster maintenance history view

Mode    Start Time                End Time

----------------------------------------------------------

auto    Sun Nov  21 15:12:45 2024  Sun Nov  3 18:41:05 2024

Similarly, for a node reboot scenario. Once the reboot command has been run on a node, the cluster automatically activates a maintenance mode window:

# reboot

System going down IMMEDIATELY

…

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: Yes

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

A this point, the gconfig ‘maintenance_nodes’ parameter will report the logical node number (LNN) of the rebooting node:

# isi_gconfig -t maintenance | grep -i node

maintenance.maintenance_nodes (char*) = [3]

In this example, the rebooting node has both LNN 3 and node ID 3 (although matching LNN and IDs are not always the case):

# isi_nodes %{id} , %{lnn}~ | grep "3 ,"

3 , 3~

Finally, the maintenance window is reported inactive when the node is back up and running again:

# isi cluster maintenance

       Auto Maintenance Mode Enabled: Yes

             Maintenance Mode Active: No

   Manual Maintenance Window Enabled: No

  Manual Maintenance Window Duration: 8 Hours

Manual Maintenance Window Start Time: -

  Manual Maintenance Window End Time: -

When upgrading from OneFS versions prior to 9.9, if a maintenance mode window was manually enabled prior to the upgrade, it will continue to be active after the upgrade. The maintenance window is manual, and the maintenance window hours are set to 199 during the upgrade, then restored to the default of 8 on commit. Maintenance mode history is also migrated during the upgrade process. If necessary, an upgrade rollback restores ‘isi event maintenance’.

In the event of any issues during maintenance mode operations, error conditions and details are written to the /var/log/isi_maintenance_mode_d.log file. This file can be set to debug level logging for more verbose information. Additionally, the /var/log/isi_shutdown.log and isi_upgrade_logs files can often provide further insights and context.

Logfile Description
/var/log/isi_maintenance_mode_d.log • Will show errors that occur during maintenance mode operations

• Can be set to debug level logging for more details

/var/log/isi_shutdown.log Cluster and node shutdown log
isi_upgrade_logs Cluster upgrade logs

 

OneFS CELOG Superfluous Alert Suppression

With recent enhancements to PowerScale healthchecks and CELOG events, it was discovered that an often more than desirable quantity of alerts were being sent to customers and Dell support. Many of these alerts were relatively benign, and sometimes consumed a non-trivial amount of system and human resources. As such, an overly chatty monitoring system can run the risk of noise fatigue, and the potential to miss a critical alert.

Included in the OneFS 9.9 payload is a new ‘superfluous alert suppression’ feature. The goal of this new functionality is to reduce the transmission of unnecessary alerts to both cluster administrators and Dell support. To this end, two new event categories are introduced in OneFS 9.9:

Category ID Description
DELL_SUPPORT 9900000000 Only events in DELL_SUPPORT will be sent to Dell support.
SYSTEM_ADMIN 9800000000 Events in SYSTEM_ADMIN will be sent to the cluster admin by default

With these new event categories and CELOG enhancements, any ‘informational’ (non-critical) events will no longer trigger an alert by default. As such, only warning, critical, and emergency events that include ‘DELL_SUPPORT’ will now be sent to Dell. Similarly, just warning, critical, and emergency events in ‘SYSTEM_ADMIN’ will be sent to the cluster admin by default.

Under the hood, OneFS superfluous alert suppression leverages the existing CELOG reporting mechanism to filter out alerts generation by providing stricter alerting rules.

Architecturally, CELOG captures events and stores them in its event database. From here, the reporting service parses the inbound events and fires alerts as needed to the cluster administrator and/or Dell support. CELOG uses Tardis for its configuration management, and changes are received from the user interface and forwarded to the Tardis configuration service and database by the platform API handlers. Additionally, CELOG uses a series of JSON config files to store its event, category, reporting, and alerting rules.

When a new event is captured, the reporting module matches the event with the reporting rules and sends out an alert if the condition is met. In OneFS 9.9, the CELOG workflow is not materially changed. Rather, filtering is applied by applying more stringent reporting rules, resulting in the transmission of fewer but more important alerts.

The newly introduced ‘Dell Support’ (ID: 99000000000) and ‘System Admin’ (ID: 9800000000) categories and their associated IDs are described in the ‘/etc/celog/categories.json’ file as follows:

"9800000000": {

        "id": "9800000000",

        "id_name": "SYSTEM_ADMIN",

        "name": "System Admin events"

    },
    "9900000000": {

        "id": "9900000000",

        "id_name": "DELL_SUPPORT",

        "name": "Dell Support events"

    },

Similarly, the event configurations in the /etc/celog/events.json config file now contain both ‘dell_support_category’ and ‘system_admin_category’ boolean parameters for each event type, which can be set to either ‘true’ or ‘false’:

"100010001": {

    "attachments": [

        "dfvar",

        "largevarfiles",

        "largelogs"

    ],

    "category": "100000000",

    "frequency": "10s",

    "id": "100010001",

    "id_name": "SYS_DISK_VARFULL",

    "name": "The /var partition is near capacity (>{val:.1f}% used)",

    "type": "node",

    "dell_support_category": true, 

    "system_admin_category": true 

},

The reporter file, /etc/celog/celog.reporter.json, also sees updated predefined alerting rules in OneFS 9.9. Specifically, the ‘categories’ field is no longer set to ‘all’. Instead, the category ID is specified. Also, a new ‘severities’ field now specifies the criticality level – ‘warning’, ‘critical’, and ‘emergency’. For example below, only events with ‘warning’ and above will be sent to the defined call-home channel, in this case ID 9000000000, indicating Dell support:

"SupportAssist New": {

            "condition": "NEW",

            "channel_ids": [3],

            "name": "SupportAssist New",

            "eventgroup_ids": [],

            "categories": ["9900000000"],

            "severities" : ["warning", "critical", "emergency"],

            "limit": 0,

            "interval": 0,

            "transient": 0

        },

When creating new custom alert rules in OneFS 9.9, the category and severity alerting fields will automatically default to ‘SYSTEM_ADMIN’ and ‘warning’, ‘critical’, and ‘emergency’. For example from the CLI:

# isi event alerts create <name> \

<condition> \

<channel> \

--category SYSTEM_ADMIN \ 

--severity=warning,critical,emergency

Similarly via the WebUI under Cluster management >Events and alerts > Alert management > Create alert rule:

By applying these changes to the configuration and alerting rules, and without modifying the CELOG infrastructure at all, this new functionality can significantly reduce the quantity, while increasing the quality and relevance, of OneFS alerts that both customers and support receive.

Alert and event definitions and rules can be viewed per category from the CLI as follows, which can be useful for investigative and troubleshooting purposes:

# isi event alerts view "SupportAssist New"

      Name: Dell Technologies connectivity services New

Eventgroup: -

  Category: 9900000000

       Sev: ECW

   Channel: Dell Technologies connectivity services

 Condition: NEW

Note that the ‘severity’ (Sev) field contains the value ‘ECW’, which translates to emergency, critical, and warning.

Also, the event types that are included in each category can be easily viewed from the CLI. For example, the event types associated with the SYSTEM_ADMIN category:

# isi event types list --category=9800000000

ID            Name                 Category     Description

100010001 SYS_DISK_VARFULL        100000000    The /var partition is near capacity (>{val:.1f}% used)

100010002 SYS_DISK_VARCRASHFULL 100000000    The /var/crash partition is near capacity ({val:.1f}% used)  

100010003 SYS_DISK_ROOTFULL      100000000    The /(root) partition is near capacity ({val:.1f}% used)

100010005 SYS_DISK_SASPHYTOPO    100000000    A SAS PHY topology problem or change was detected on {chas}, location {location}

100010006 SYS_DISK_SASPHYERRLOG 100000000    A drive's error log counter indicates there may be a problem on {chas}, location {location}

100010007 SYS_DISK_SASPHYBER     100000000    The SAS link connected to {chas} {exp} PHY {phy} has exceeded the maximum Bit Error Rate (BER)

And similarly for the DELL_SUPPORT category:

# isi event types list --category=9900000000

ID            Name                Category     Description

100010001 SYS_DISK_VARFULL        100000000    The /var partition is near capacity (>{val:.1f}% used)

100010002 SYS_DISK_VARCRASHFULL 100000000    The /var/crash partition is near capacity ({val:.1f}% used)

100010003 SYS_DISK_ROOTFULL      100000000    The /(root) partition is near capacity ({val:.1f}% used)

100010006 SYS_DISK_SASPHYERRLOG 100000000    A drive's error log counter indicates there may be a problem on {chas}, location {location}

100010007 SYS_DISK_SASPHYBER     100000000    The SAS link connected to {chas} {exp} PHY {phy} has exceeded the maximum Bit Error Rate

 (BER)

100010008 SYS_DISK_SASPHYDISABLED 100000000    The SAS link connected to {chas} {exp} PHY {phy} has been disabled for exceeding the maximum Bit Error Rate (BER)

While the automatic superfluous alert suppression functionality described above is new in OneFS 9.9, manual alert suppression has been available since OneFS 9.4:

Here, filtering logic in the CELOG framework allows individual event types to be easily be suppressed (and un-suppressed) as desired, albeit manually.

Additionally, OneFS also provides a ‘maintenance mode’ for temporary cluster-wide alert suppression during either a scheduled or ad-hoc maintenance window. For example:

When enabled, OneFS will continue to log events, but no alerts will be generated until the maintenance period either ends or is disabled. CELOG will automatically resume alert generation for active event groups as soon as the maintenance period concludes.

We’ll explore OneFS maintenance mode further in the next blog article.

OneFS Pre-upgrade Healthchecks – Management and Monitoring

In this second article in this series, we take a closer look at the management and monitoring of OneFS Pre-upgrade Healthchecks.

When it comes to running pre-upgrade checks, there are two execution paths: Either as the precursor to an actual upgrade, or as a stand-alone assessment. As such, the general workflow for the upgrade pre-checks in both assessment and NDU modes is as follows:

The ‘optional’ and ‘mandatory’ hooks of the Upgrade framework queue up a pre-check evaluation request to the HealthCheck framework. The results are then stored in an assessment database, which allows a comprehensive view of the pre-checks.

As of OneFS 9.9, the list of pre-upgrade checks include:

Checklist Item Description
battery_test_status Check nvram.xml and battery status to get the battery health result
check_frontpanel_firmware Checks if the front panel reports None after a node firmware package install.
check_m2_vault_card Checks for the presence of the M.2 vault card in Generation 6 nodes and confirms SMART status threshold has not been exceeded on that device
custom_cronjobs Warn the administrator if there are custom cron jobs defined on the cluster.
check_boot_order Checks BootOrder in bios_settings.ini on Generation 5 nodes to determine if at risk for https://www.dell.com/support/kbdoc/25523
check_drive_firmware Checks firmware version of drives for known issues
check_local_users Recommends backing up sam.db prior to an upgrade to 9.5 or higher where current version is less than 9.5
check_ndmp_upgrade_timeout Checks for LNN changes that have occurred since the isi_ndmp_d processes started which can cause issues during the HookDataMigrationUpgrade phase of an OneFS upgrade
check_node_upgrade_compatibility Checks node upgrade compatibility for OneFS upgrades by comparing it against known supported versions
check_node_firmware_oncluster Checks to verify if the cluster can run into issues due to firmware of certain devices.
check_security_hardening Check if the security hardening (FIPS and STIG mode) is applied on the cluster.
check_services_monitoring Checks that enabled services are being monitored.
check_upgrade_agent_port Checks the port used by the isi_upgrade_agent_d daemon to ensure it is not in use by other processes
check_upgrade_network_impact Checks for the risk of inaccessible network pools during a parallel upgrade
check_cfifo_thread_locking Checks if node may be impacted by DTA000221299, cluster deadlocking from Coalescer First In First Out (CFIFO) thread contention
ftp_root_permissions Checks if FTP is enabled and informs users about potential FTP login issues after upgrading.
flex_protect_fail Warns if the most recent FlexProtect or FlexProtectLin job failed.
files_open Checks for dangerous levels of open files on a node.
ifsvar_acl_perms Checks ACL permissions for ifsvar and ifsvar/patch directory
job_engine_enabled Service isi_job_d enabled
mediascan_enabled Determines if MediaScan is enabled.
mcp_running_status Status of MCP Process.
smartconnect_enabled Determines if SmartConnect enabled and running.
flexnet_running Determines if Flexnet is running.
opensm_masters Determines if backend fabric has proper number of opensm masters.
duplicate_gateway_priorities Checks for subnets with duplicate gateway priorities.
boot_drive_wear Boot drive wear level.
dimm_health_status Warns if there are correctable DIMM Errors on Gen-4 and Gen-6.
node_capacity Check the cluster and node pool capacity.
leak_freed_blocks Check if the sysctl ‘efs.lbm.leak_freed_blocks’ is set to 0 for all nodes.
reserve_blocks Check if the sysctl ‘efs.bam.layout.reserved_blocks’ is set to the default values of 32000 for all nodes.
root_partition_capacity Check root (/) partition capacity usage.
var_partition_capacity Check ‘/var’ partition capacity usage.
smb_v1_in_use Check to see if SMBv1 is enabled on the cluster. If it is enabled, provide an INFO level alert to the user. Also check if any current clients are usingSMBv1 if it is enabled and provide that as part of the alert.
synciq_daemon_status Check if all SyncIQ daemons are running.
synciq_job_failure Check if any latest SyncIQ job report shows failed and gather the failure infos.
synciq_job_stalling Checks if any running SyncIQ jobs are stalling.
synciq_job_throughput Check if any SyncIQ job is running with non-throughput.
synciq_pworker_crash Check if any pworker crash, related stack info, generates when the latest SyncIQ jobs failed with worker crash errors.
synciq_service_status Check if SyncIQ service isi_migrate is enabled.
synciq_target_connection Check SyncIQ policies for target connection problems.
system_time Check to warn if the system time is set to a time in the far future.
rpcbind_disabled Checks if rpcbind is disabled, which can potentially cause issues on startup
check_ndmp Checks for running NDMP sessions
check_flush Checks for running flush processes / active pre_flush screen sessions
battery_test_status Check nvram.xml and battery status to get the battery health result
checkKB516613 Checks if any node meets criteria for KB 000057267
check_flush Checks for running flush processes / active pre_flush screen sessions
upgrade_blocking_jobs Checks for running jobs that could impact an upgrade
patches_infra Warns if INFRA patch on the system is out of date
check_flush Checks for running flush processes / active pre_flush screen sessions
cloudpools_account_status Cloud Accounts showing unreachable when installing 9.5.0.4(PSP-3524) or 9.5.0.5 (PSP-3793) patch
nfs_verify_riptide_exports Verify the existence of nfs-exports-upgrade-complete file.
upgrade_version Pre-upgrade check to warn about lsass restart.

In OneFS 9.8 and earlier, the upgrade pre-check assessment CLI command set did not provide a method for querying the details.

To address this, OneFS 9.9 now includes the ‘isi upgrade assess view’ CLI syntax, which displays a detailed summary of the error status and resolution steps for any failed pre-checks. For example:

# isi upgrade assess view

PreCheck Summary:
Status: Completed with warnings
Percentage Complete: 100%
Started on: 2024-11-05T00:27:50.535Z
Check Name Type LNN(s) Message
----------------------------------------------------------------------------------------------------------------------------------------------------------------
custom_cronjobs Optional 1,3     Custom cron jobs are defined on the cluster. Automating
tasks on a PowerScale cluster is most safely done
with a client using the PowerScale OneFS API to
access the cluster. This is particularly true if you
are trying to do some type of monitoring task. To
learn more about the PowerScale OneFS API, see the
OneFS API Reference for your version of OneFS.
Locations of modifications found: /usr/local/etc/cron.d/
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Total: 1

In the example above, the assessment view of a failed optional precheck is flagged as a warning. Whereas a failed mandatory precheck is logged as an error and upgrade blocked with the following ‘not ready to upgrade’ status. For example:

# isi upgrade assess view

PreCheck Summary:
             Status: Completed with errors - not ready for upgrade
Percentage Complete: 100%
       Completed on: 2024-11-02T21:44:54.938Z

Check Name       Type      LNN(s)  Message
----------------------------------------------------------------------------------------------------------------------------------------------------------------
ifsvar_acl_perms Mandatory -     An underprivileged user (not in wheel group) has
access to the ifsvar directory. Run 'chmod -b 770
/ifs/.ifsvar' to reset the permissions back to
the default permissions to resolve the security risk.
Then, run 'chmod +a# 0 user ese allow traverse
                                 /ifs/.ifsvar' to add the system-level SupportAssist
User back to the /ifs/.ifsvar ACL.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Total: 1

Here, the pre-check summary both alerts to the presence of insecure ACLs on a critical OneFS directory, while also provides comprehensive remediation instructions. The upgrade could not proceed in this case due to a mandatory pre-check failure.

A OneFS upgrade can be initiated with the following CLI syntax:

# isi upgrade cluster start --parallel -f /ifs/install.isi

If a pre-check fails, the upgrade status can be checked with the ‘isi upgrade view’ CLI command. For example:

# isi upgrade view

Upgrade Status:

Current Upgrade Activity: OneFS upgrade
   Cluster Upgrade State: error
                           (see output of  isi upgrade nodes list)
   Upgrade Process State: Stopped
      Upgrade Start Time: 2024-11-03T15:12:20.803000
      Current OS Version: 9.9.0.0_build(1)style(11)
      Upgrade OS Version: 9.9.0.0_build(4299)style(11)
        Percent Complete: 0%

Nodes Progress:

     Total Cluster Nodes: 3
       Nodes On Older OS: 3
          Nodes Upgraded: 0
Nodes Transitioning/Down: 0

A Pre-upgrade check has failed please run “isi upgrade assess view” for results.
If you would like to retry a failed action on the required nodes, use the command
“isi upgrade cluster retry-last-action –-nodes”. If you would like to roll back
the upgrade, use the command “isi upgrade cluster rollback”.

LNN                                                        Version   Status
------------------------------------------------------------------------------
9.0.0  committed

Note that, in addition to retry and rollback options, the above output recommends running the ‘isi upgrade assess view’ CLI command to see the specific details of the failed pre-check(s). For example:

# isi upgrade assess view

PreCheck Summary:
Status: Warnings found during upgrade
Percentage Complete: 50%
Completed on: 2024-11-02T00:11:21.705Z
Check Name Type LNN(s) Message
----------------------------------------------------------------------------------------------------------------------------------------------------------------

custom_cronjobs Optional 1-3 Custom cron jobs are defined on the cluster. Automating
tasks on a PowerScale cluster is most safely done with a
client using the PowerScale OneFS API to access the
cluster. This is particularly true if you are trying to do
some type of monitoring task. To learn more about the
PowerScale OneFS API, see the OneFS API Reference for
your version of OneFS. Locations of modifications found:
/usr/local/etc/cron.d/
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Total: 1

In the above, the pre-check summary alerts of a failed optional check, due to the presence of custom (non default) crontab entries in the cron job’s schedule. In this case, the upgrade can still proceed, if desired.

While OneFS 9.8 and earlier releases do have the ability to skip the optional pre-upgrade checks, this can only be configured prior to the upgrade commencing:

# isi upgrade start –skip-optional ...

However, OneFS 9.9 provides a new ‘skip optional’ argument for the ‘isi upgrade retry-last-action’ command, allowing optional checks to also be avoided while an upgrade is already in process:

# isi upgrade retry-last-action –-skip-optional ...

The ‘isi healthcheck evaluation list’ CLI command can also be useful for reporting pre-upgrade checking completion status. For example:

# isi healthcheck evaluation list

ID State Failures Logs --------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------
pre_upgrade_optional20240508T1932 Completed - Fail WARNING: custom_cronjobs (1-4) /ifs/.ifsvar/modules/health check/results/evaluations/pre_upgrade_optional20240508T1932

pre_upgrade_mandatory20240508T1935 Completed - Pass - /ifs/.ifsvar/modules/healthcheck/results/evaluations/pre_upgrade_mandatory20240508T1935
--------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------
Total: 2

In the above example, the mandatory pre-upgrade checks all pass without issue. However, a warning is logged, alerting of an optional check failure due to the presence of custom (non default) crontab entries. More details and mitigation steps for this check failure can be obtained by running  ‘isi assess view’ CLI command. In this case, the upgrade can still proceed, if desired.

OneFS Pre-upgrade Healthchecks

Another piece of useful functionality that debuted in OneFS 9.9 is the enhanced integration of pre-upgrade healthchecks (PUHC) with the PowerScale non-disruptive upgrade (NDU) process.

Specifically, this feature complements the OneFS NDU framework by adding the ability to run pre-upgrade healthchecks as part of the NDU state machine, while providing a comprehensive view and control of the entire pre-check process. This means that OneFS 9.9 and later can now easily and efficiently include upgrade pre-checks by leveraging the existing healthcheck patch process.

These pre-upgrade healthchecks (PUHC) can either be run as an independent assessment (isi upgrade assess) or as an integral part of a OneFS upgrade. In both scenarios, the same pre-upgrade checks are run by the assessment and the actual upgrade process.

Prior to OneFS 9.9, there was no WebUI support for a pre-upgrade healthcheck assessment. This meant that an independent assessment had to be run from the CLI:

# isi upgrade assess

Additionally, there was no ‘view’ option for this ‘isi upgrade assess’ command. So after starting a pre-upgrade assessment, the only way to see which checks were failing was to parse the upgrade logs in order to figure out what was going on. For example, with the ‘isi_upgrade_logs’ CLI utility:

# isi_upgrade_logs -h

Usage: isi_upgrade_logs [-a|--assessment][--lnn][--process {process name}][--level {start level,end level][--time {start time,end time][--guid {guid} | --devid {devid}]

 + No parameter this utility will pull error logs for the current upgrade process

 + -a or --assessment - will interrogate the last upgrade assessment run and display the results

 Additional options that can be used in combination with 'isi_upgrade_logs' command:

  --guid     - dump the logs for the node with the supplied guid

  --devid    - dump the logs for the node/s with the supplied devid/s

  --lnn      - dump the logs for the node/s with the supplied lnn/s

  --process  - dump the logs for the node with the supplied process name

  --level    - dump the logs for the supplied level range

  --time     - dump the logs for the supplied time range

  --metadata - dump the logs matching the supplied regex

  --get-fw-report - get firmware report

                    =nfp-devices : Displays report of devices present in NFW package

                    =full        : Displays report of all devices on the node

                    Default value for No option provided is "nfp-devices".

When run with the ‘-a’ flag, ‘isi_upgrade_logs’ queries the archived logs from the latest assessment run:

# isi_upgrade_logs -a

Or by node ID or LNN:

# isi_upgrade_logs --lnn

# isi_upgrade_logs --devid

So, when running healthchecks as part of an upgrade in OneFS 9.8 or earlier, whenever any check failed, typically all that was reported was a generic check ‘hook fail’ alert. For example, a mandatory pre-check failure was reported as follows:

As can be seen, only general pre-upgrade insight was provided, without details such as which specific check(s) were failing.

Similarly from the upgrade logs:

Identifying in upgrade logs that PUHC hook scripts ran: 18 2024-11-05T02:19:21 /usr/sbin/isi_upgrade_agent_d Debug Queueing up hook script: /usr/share/upgrade/event-actions/pre-upgrade-mandatory/isi_puhc_mandatory 18 2024-11-05T02:12:21 /usr/sbin/isi_upgrade_agent_d Debug Queueing up hook script: /usr/share/upgrade/event-actions/pre-upgrade-optional/isi_puhc_optional

Additionally, when starting an upgrade in OneFS 9.8 or earlier, there was no opportunity to either skip any superfluous optional checks or quiesce any irrelevant or unrelated failing checks.

By way of contrast, OneFS 9.9 now includes the ability to run a pre-upgrade assessment (Precheck) directly from the WebUI via Cluster management > Upgrade > Overview > Start Precheck.

Similarly, a ‘view’ option is also added to the ‘isi upgrade assess’ CLI command syntax in OneFS 9.9. For example:

# isi upgrade assess view

PreCheck Summary:

             Status: Completed with errors - not ready for upgrade
Percentage Complete: 100%
       Completed on: 2024-11-4T21:44:54.938Z

Check Name       Type      LNN(s)  Message
----------------------------------------------------------------------------------------------------------------------------------------------------------------
ifsvar_acl_perms Mandatory -       An underprivileged user (not in wheel group) has access to the ifsvar directory. Run 'chmod -b 770 /ifs/.ifsvar' to reset the permissions back to the default permissions to resolve the security risk. Then, run 'chmod +a# 0 user ese allow traverse /ifs/.ifsvar' to add the system-level SupportAssist User back to the /ifs/.ifsvar ACL.
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Total: 1

Or from the WebUI:

This means that the cluster admin now gets a first-hand view of explicitly which check(s) are failing, plus their appropriate mitigation steps. As such, the time to resolution can often be drastically improved by avoiding the need to manually comb the log files in order to troubleshoot cluster pre-upgrade issues.

OneFS delineates between mandatory (blocking) and optional (non-blocking) pre-checks:

Evaluation Type Description
Mandatory PUHC These checks will block an upgrade on failure. As such, the option are to either fix the underlying issue causing the check to fail, or to roll-back the upgrade.
Optional PUHC These can be treated as a warning. On failure, either the underlying condition can be resolved, or skipped the check skipped, allowing the upgrade to continue.

Also provided is the ability to pick and choose which specific optional checks are run prior to an upgrade. This can also alleviate redundant effort and save considerable overhead.

Architecturally, pre-upgrade health checks operate as follows:

The ‘optional’ and ‘mandatory’ hooks of the Upgrade framework queue up a pre-check evaluation request to the HealthCheck framework. The results are then stored in an assessment database, which allows a comprehensive view of the pre-checks.

The array of upgrade pre-checks is pretty extensive and are tailored to a target OneFS version.

# isi healthcheck checklists list | grep -i pre_upgrade

pre_upgrade         Checklist to determine pre upgrade cluster health, 
many items in this list use the target_version parameter

A list of the individual checks can be viewed from the WebUI under Cluster management > Healthcheck > Healthchecks > pre_upgrade:

In the next article in this series, we’ll take a closer look at the management and monitoring of OneFS Pre-upgrade Healthchecks.