Security
A very basic error caused the CrowdStrike outage. Windows security may never be the same
A software update with one more variable than expected crashed 8.5 million Windows computers. Should Windows security vendors continue to have access to the kernel?
It has been nearly three weeks since CrowdStrike released a small software update that caused the largest IT outage in the history of the planet, which stranded travelers, hobbled banks, and elevated the concept of phased deployment to a much wider audience. All outages are teachable moments for enterprise tech, and this one is no exception.
CrowdStrike released a comprehensive root-cause analysis Tuesday outlining how an update to its Falcon security software bricked 8.5 million Windows PCs and servers, causing the most widespread "blue screen of death" event ever and requiring customers to manually reboot many of those machines. "While this scenario … is now incapable of recurring, it informs the process improvements and mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience," the company told customers and investors.
Like other security vendors, CrowdStrike's security software has access to the lowest levels of Microsoft Windows in order to quickly detect threats and issues with what it calls "sensors," in hopes of fixing them before they can cause bigger problems. That software must be updated frequently as new threats emerge, and in February of this year CrowdStrike updated Falcon with "a new sensor capability" that was designed to gather data about "possible novel attack techniques that may abuse certain Windows mechanisms."
That sensor contained 20 fields for data input, but on July 19th, CrowdStrike released an update to that sensor that asked for data across 21 fields. "The mismatch resulted in an out-of-bounds memory read, causing a system crash," CrowdStrike said, and the rest was history.
Given that software is a human endeavor (for now), all companies will make occasional mistakes as they write updates. However, most of those mistakes are caught either before deployment through an internal review process or after they are deployed to a small number of customers, which contains the damage if something is broken.
In its report, CrowdStrike identified six different opportunities where it could have detected the faulty updates before chaos ensued, and how it has already changed its software development practices in response. Five of those mistakes involved incomplete or unimaginative testing procedures, and future updates to Falcon sensors will go through a much more rigorous process.
But the mistake that had many people across enterprise tech scratching their heads was the fact that CrowdStrike immediately deployed that update to a huge number of its customers. That will no longer be the case for Falcon updates: "Each Template Instance should be deployed in a staged rollout," CrowdStrike said, and customers will have more control over how and when they deploy sensor updates across their own infrastructure.
Crowdstrike also announced that it has retained "two independent third-party software security vendors" to go over Falcon with a fine-toothed comb in hopes of detecting any other looming problems that weren't exposed by the July update.
However, the most significant fallout from this incident could result in changes to the way Microsoft works with third-party Windows security companies. If vendors like CrowdStrike ran their software in "user space," rather than the Windows kernel, bad updates would crash the application software but not the entire machine.
Two weeks ago Microsoft's John Cable floated a trial balloon, suggesting that the company "must prioritize change and innovation in the area of end-to-end resilience" by looking at security design principles that "provide an isolated compute environment that does not require kernel mode drivers to be tamper resistant."
On Tuesday, CrowdStrike agreed with a caveat: "Significant work remains for the Windows ecosystem to support a robust security product that doesn’t rely on a kernel driver for at least some of its functionality," it wrote in its report.
This would be a big shift for Windows and the market for third-party security software.
It's now obvious how much damage can be done with kernel-level access, but if Microsoft locks down the Windows kernel, Windows users will be much more dependent on Microsoft for security. And Microsoft's security track record over the last several years is … not great.
(This post originally appeared in the Runtime newsletter on Aug. 6th, sign up here to get more enterprise tech news three times a week.)