Resolving a Faulty Inter Site SAN Link
A significant SAN outage resulted in extended downtime for this Insurance industry sector company. CDS performed initial diagnostic services remotely in an attempt to determine the root cause of the issue.
During this time CDS identified several potential issues that may have contributed to sync failures between sites. Unable to accurately determine the root cause of the synchronization issues, CDS dispatched their experienced level 3 engineer onsite to work directly with the client to resolve their issue.
In an effort to reduce their risk, the client also engaged, the SAN’s manufacturer, who continued to monitor progress and ensure they were receiving accurate advice. These regular reviews not only confirmed that CDS were making the correct diagnoses, but gave the client the confidence they needed to implement suggestions made by the CDS team.
Over the course of two days, one of CDS’ highly experienced level 3 engineers worked onsite with the customer’s own team to locate the cause of the sync failure and implemented fixes which addressed a number of underlying issues. CDS was clearly able to identify a pattern of problems that had previously gone unnoticed, leading up to the complete failure of the system.
During the course of the investigation, CDS discovered and resolved issues including:
- Unbalanced workloads that were causing incoming I/O spikes, resulting in SRDF/A leg drops. CDS then worked with the client’s in house IT team to better load balance data requests throughout the day to reduce future resource spikes
- A link problem with a fabric layer switch port. CDS helped the client liaise with the manufacturer of the switch, to further analyse and resolve the issue
- Assisting with modification of the star options file to reduce and eliminate invalid track errors. The change was left in place until the error rate fell from 20 million invalids to zero
CDS also conducted periodic checks over the course of a weekend to ensure that systems were returning as normal and that the async error rate continued to reduce. After 17 hours all errors were resolved and full operational capacity had resumed.