Psychological Safety and Learning in the Face of IT Crises: Insights from the CrowdStrike Outage
Psychological safety is a crucial element for innovation and growth in the digital age. Can it persist in a global IT outage?
Welcome to the Data Score newsletter, composed by DataChorus LLC. The newsletter is your go-to source for insights into the world of data-driven decision-making. Whether you're an insight seeker, a unique data company, a software-as-a-service provider, or an investor, this newsletter is for you. I'm Jason DeRise, a seasoned expert in the field of data-driven insights. As one of the first 10 members of UBS Evidence Lab, I was at the forefront of pioneering new ways to generate actionable insights from alternative data. Before that, I successfully built a sell-side equity research franchise based on proprietary data and non-consensus insights. After moving on from UBS Evidence Lab, I’ve remained active in the intersection of data, technology, and financial insights. Through my extensive experience as a purchaser and creator of data, I have gained a unique perspective, which I am sharing through the newsletter.
On July 19th, a configuration update to the CrowdStrike Falcon Scanner triggered a global outage, leading to over $5 billion in insurance claims for Fortune 500 companies. https://www.reuters.com/technology/fortune-500-firms-see-54-bln-crowdstrike-losses-says-insurer-parametrix-2024-07-24/
Of the many industries affected by the Blue Screen of Death (BSOD, A stop error screen displayed on a Windows computer system after a fatal system error, also known as a system crash, which causes the system to fail), Delta Airlines was seemingly harder hit than other airlines and, of note, Southwest had less impact than others.
Delta’s CEO claimed $500 million in losses.
Bastian, speaking from Paris, where he traveled last week, told CNBC's "Squawk Box" on Wednesday that the carrier would seek damages from the disruptions, adding, "We have no choice." "If you're going to be having access, priority access to the Delta ecosystem in terms of technology, you've got to test the stuff. You can't come into a mission critical 24/7 operation and tell us we have a bug," Bastian said. https://www.cnbc.com/amp/2024/07/31/delta-ceo-crowdstrike-microsoft-outage-cost-the-airline-500-million.html
The initial response
Early responses to the disaster were important, as the absence of accurate statements leads to some filling in the gaps with whatever “truth” they can imagine. CrowdStrike's early admission of fault, along with Microsoft's acknowledgment, alleviated fears of a cyberattack.
Teams across the globe and across many industries worked to solve the specific applications that were down due to the outage, including getting individuals access to the systems restored. They did so even with some of their staff potentially affected by the BSOD, an influx of tickets raised and automated exception reports triggered.
Is there a blame game inside CrowdStrike?
Of course, some people also wanted to use the moment for comedy by pretending to take the blame. And, of course, people across the internet can’t tell satire from reality. Which, as an aside comment, makes me wonder how an AI model would be able to tell satire from reality as it consumes text across the internet via its training process.
This made me think about the real team and managers overseeing this situation, and I wondered if they would face blame and serious consequences. One could only imagine the sinking feeling in the team’s stomach as the reality of the deployment became apparent.
It’s a common acknowledgement across the tech industry that there’s “a rite of passage” by accidentally taking down the application supported by the deployment (similarly, there’s “a rite of passage” for a developer to accidentally run up a massive compute bill on the cloud due to bad code). In companies with a good culture where technologies feel psychologically safe, these are seen as learning opportunities. In the earliest days of Evidence Lab, someone accidentally deleted the database with all the data and analytics. It was a major setback for Evidence Lab, but the person responsible wasn’t let go. They continued their career with UBS and took on more responsibility along the way. While a major setback for Evidence Lab, it was insignificant in the broader context of UBS or the financial markets.
However, the CrowdStrike/Microsoft outage is exceptional. Typically, an application error doesn't result in billions of dollars in economic loss. Can this still be a learning and growth opportunity for CrowdStrike, or will there be blame resulting in terminations?
Using prediction market data on CEO tenure as a data point
A data point indicating a risk to psychological safety at CrowdStrike is the betting market's 32% probability of a change as of August 4th.
As a comparison to the 32% chance shown above, we need a comparison of other company CEOs to put it in context.
Source: https://manifold.markets/Mira_/which-ceos-will-leave-their-jobs
I guess the glass half full point of view is that it’s more likely that the CrowdStrike CEO will remain in his position. But its levels are elevated compared to other CEOs, which suggests there is expected to be accountability in terms of job loss due to the situation. It seems to be trending lower from the original coin flip probability implied by the market.
Can CrowdStrike turn this into a learning moment to sure up their processes and systems to allow for flexibility in adapting their software to new cyber threats without taking down applications dependent on their software?
Learning from errors
There are learnings within CrowdStrike which are being made public, but there are also learnings downstream from the application, as organizations affected by the outage also had to respond rapidly.
The article uses multiple real data points from the CrowdStrike outage to provide context and practical advice for creating a learning culture as part of the Data Playbook.