CrowdStrike explains how it BSOD'd the world

Today: CrowdStrike releases its root-cause analysis on last month's software update debacle, how a delay for Nvidia's next-generation chip will impact enterprise AI, and the latest funding rounds in enterprise tech.

CrowdStrike explains how it BSOD'd the world
LaGuardia Airport's baggage claim was a sad place to be on July 19th. (Credit: Wikimedia Commons user Smishra1/cc 4.0)

Welcome to Runtime! Today: CrowdStrike releases its root-cause analysis on last month's software update debacle, how a delay for Nvidia's next-generation chip will impact enterprise AI, and the latest funding rounds in enterprise tech.

(Was this email forwarded to you? Sign up here to get Runtime each week.)


Falcon jest

It has been nearly three weeks since CrowdStrike released a small software update that caused the largest IT outage in the history of the planet, which stranded travelers, hobbled banks, and elevated the concept of phased deployment to a much wider audience. All outages are teachable moments for enterprise tech, and this one is no exception.

CrowdStrike released a comprehensive root-cause analysis Tuesday outlining how an update to its Falcon security software bricked 8.5 million Windows PCs and servers, causing the most widespread "blue screen of death" event ever and requiring customers to manually reboot many of those machines. "While this scenario … is now incapable of recurring, it informs the process improvements and mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience," the company told customers and investors.

  • Like other security vendors, CrowdStrike's security software has access to the lowest levels of Microsoft Windows in order to quickly detect threats and issues with what it calls "sensors," in hopes of fixing them before they can cause bigger problems.
  • That software must be updated frequently as new threats emerge, and in February of this year CrowdStrike updated Falcon with "a new sensor capability" that was designed to gather data about "possible novel attack techniques that may abuse certain Windows mechanisms."
  • That sensor contained 20 fields for data input, but on July 19th, CrowdStrike released an update to that sensor that asked for data across 21 fields.
  • "The mismatch resulted in an out-of-bounds memory read, causing a system crash," CrowdStrike said, and the rest was history.

Given that software is a human endeavor (for now), all companies will make occasional mistakes as they write updates. However, most of those mistakes are caught either before deployment through an internal review process or after they are deployed to a small number of customers, which contains the damage if something is broken.

  • In its report, CrowdStrike identified six different opportunities where it could have detected the faulty updates before chaos ensued, and how it has already changed its software development practices in response.
  • Five of those mistakes involved incomplete or unimaginative testing procedures, and future updates to Falcon sensors will go through a much more rigorous process.
  • But the mistake that had many people across enterprise tech scratching their heads was the fact that CrowdStrike immediately deployed that update to a huge number of its customers. 
  • That will no longer be the case for Falcon updates: "Each Template Instance should be deployed in a staged rollout," CrowdStrike said, and customers will have more control over how and when they deploy sensor updates across their own infrastructure.

Crowdstrike also announced that it has retained "two independent third-party software security vendors" to go over Falcon with a fine-toothed comb in hopes of detecting any other looming problems that weren't exposed by the July update. But the most significant fallout from this incident could be changes to the way Microsoft works with third-party Windows security companies.

  • If vendors like CrowdStrike ran their software in "user space," rather than the Windows kernel, bad updates would crash the application software but not the entire machine.
  • Two weeks ago Microsoft's John Cable floated a trial balloon, suggesting that the company "must prioritize change and innovation in the area of end-to-end resilience" by looking at security design principles that "provide an isolated compute environment that does not require kernel mode drivers to be tamper resistant."
  • Tuesday, CrowdStrike agreed, with a caveat: "Significant work remains for the Windows ecosystem to support a robust security product that doesn’t rely on a kernel driver for at least some of its functionality," it wrote in its report.

This would be a big shift for Windows and the market for third-party security software.

  • It's now obvious how much damage can be done with kernel-level access, but if Microsoft locks down the Windows kernel, Windows users will be much more dependent on Microsoft for security.
  • And Microsoft's security track record over the last several years is … not great.

Package still out for delivery

The race to procure GPUs, especially Nvidia's top-of-the-line chips, has taken on comical proportions over the last two years. But nobody was laughing at the news that Nvidia's next-generation Blackwell chips, originally expected to arrive in the fourth quarter, could face significant delays.

The Information reported Friday evening that Nvidia customers have been told to expect Blackwell delays "up to three months or more due to design flaws," which could push their mainstream availability into 2025. According to SemiAnalysis, the problem arose as Nvidia and TSMC tried to make new chip packaging technology work with Blackwell, which "is the first high volume design to be packaged with TSMC’s CoWoS-L technology."

Financial analysts don't expect the delay to have a huge impact on Nvidia's financial situation, given that companies will still be lined up to buy those chips once they do arrive. But it could slow down AI developers that had anticipated putting that increased horsepower into training new models this year.


Enterprise funding

Groq raised $640 million in new funding, which values the AI chip startup at $2.8 billion.

Abnormal Security landed $250 million in Series D funding as the email and SaaS security company targets a wider group of customers.

Contextual AI raised $80 million in Series A funding for its tool designed to help companies implement the popular RAG (retrieval-augmented generation) technique for reducing LLM hallucinations.

Protect AI scored $60 million in Series B funding to help secure AI models and applications.


The Runtime roundup

Dell told sales employees that "we are getting leaner" as part of an AI-inspired business-chasing reorganization of those groups, according to Bloomberg.

Zoom is now competing directly with Microsoft and Google for workplace collaboration customers after making Zoom Docs generally available for customers Monday.

Problems with Microsoft Azure's content-delivery network caused a brief outage Monday, Microsoft's third non-CrowdStrike incident in as many weeks.

Atlassian acknowledged that it will likely take its large enterprise customers much longer to move to its cloud services than expected, if they make the leap at all.


Thanks for reading — see you Thursday!

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Runtime.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.