Blog
07.23.2024

Analysis: How CrowdStrike Happened and Why Knowing Your Software Matters

This blog details the catastrophic incident caused by a bug in CrowdStrike's ELAM driver, attributed to minimal testing. It illustrates the potential consequences of relying on insufficiently tested drivers and the necessity of robust testing, validation mechanisms, and application control to prevent similar incidents.

Why Knowing Your Software and Change Control – How to Prevent the Next Global IT Meltdown 

On July 19th, 2024, we woke up to the shocking news that airlines, hospitals, banks, and many other businesses were deeply impacted by a bug in CrowdStrike’s Falcon Sensor, which collects telemetry for its EDR product. 

CrowdStrike Bug: A Vulnerability or Malware?  

The CrowdStrike Falcon kernel driver took an unhandled exception that crashed the “System” process.  The crash was not triggered by any threat actor-provided code. Therefore, it can be safely concluded that the outage was not caused by malware, even though the effects may appear similar.

What are kernel drivers?

To collect its runtime cyber security telemetry, EDRs like CrowdStrike’s Falcon, make use of a class of drivers known as "ELAM Drivers." This class of drivers are kernel drivers, which means they are highly invasive. If an ELAM Driver were to crash, the computer would not even boot up, triggering an irrecoverable “Blue Screen of Death” or BSOD. ELAM drivers are designed to be very hard to unload because malware may attempt to remove the driver and render the EDR “blind.” This means a crash in the ELAM driver requires physically walking up to the impacted machine and removing the faulty ELAM driver. If the computer happens to be protected by disk encryption technology (such as Microsoft’s Bit Locker), then there is an added complication. If the BitLocker keys are not available, recovery may involve reformatting the machine and starting afresh. 

CrowdStrike’s “Content File” Triggered an ELAM BSOD

The BSOD associated with this CrowdStrike BSOD is shown below. The BSOD error code is “Critical Process Died.” 

CrowdStrike BSOD-1

The said “critical” process turned out to be the “System” process, which implements the kernel of Windows OS.

CrowdStrike Kernel 

It's important to note that the “System” process is the most critical Windows OS process because it is the progenitor of all runtime processes on the Windows Operating System. If this process dies, the computer will not boot into the OS, preventing any userland process from spawning, which means the computer is effectively a brick. 

According to an alleged Google Whistleblower’s post on X (formerly Twitter), the crash dump from his machine shows that the “System” process crashed due to a bug in the CrowdStrike ELAM driver csagent.sys. A relevant part of the Crash Dump is presented in the screenshot below.  

Crowdstrike Crash Dump 

The data in this screenshot can be interpreted as follows. A function at an offset of 0xe35a1 in the CrowdStrike ELAM Driver, csagent.sys, was invoked with two parameters. The second parameter was used as is to read memory at address 0x0000009c, which triggered an unhandled exception (code: 0xc000005 – read access violation) in the kernel or the “System” process  

Why Do EDRs Use ELAM Drivers?  

EDRs use ELAM Drivers to collect the detailed telemetry described below to decide if the computer is under attack or not. Ordinarily, collecting this much telemetry from userland code can have a perceptible impact on the computer's performance and latency. The EDR ELAM Driver, being a kernel driver, eases the performance and latency pain but at a cost—an unhandled exception in the Driver will trigger catastrophic consequences, as seen in the case of the CrowdStrike ELAM driver. 

As stated previously, EDRs apply very complex behavioral and heuristic rules of the telemetry collected by the ELAM Driver to discern if a cyber attack is in progress. Some of the telemetry the EDR collects is as follows: 

  1. System Activity Logs: Detailed logs of system activities, including file access, process creation, and network connections. 
  2. User Behavior: Information on user actions, such as login attempts, file modifications, and application usage. 
  3. Network Traffic: Data on inbound and outbound network traffic, including IP addresses, ports, and protocols used. 
  4. File Integrity: Changes to critical system files and configurations, helping to detect unauthorized modifications. 
  5. Memory Usage: Information on memory usage patterns and anomalies that might indicate malicious activity. 
  6. Endpoint Health: Metrics related to the health and performance of the endpoint, such as CPU usage, disk activity, and running processes. 

AI/ML or SOC Analysis to Evaluate Threats 

When it comes to detecting a cyber attack using such telemetry, all the above evidence can best be described as circumstantial in nature. This is because there is very little to distinguish between a legitimate app’s behavior when described in the above-mentioned terms and the behavior of malicious code.  

To avoid causing a user’s applications to block on the back of weak “circumstantial” evidence, EDRs require post-processing of the telemetry data by SOC analysts (an effort which is now being ceded to automated AI/ ML analysis, but that's another blog post!) to make the final determination of whether the behavior is legitimate or malicious. This determination can easily go south if the SOC Analyst or the AI/ML isn’t fully trained on the application’s legitimate behavior. The matter gets compounded by the fact that a user may have deployed any of millions of applications each of which can keep getting upgraded periodically. This is what leads to false positives, SOC analyst burnout, and AI/ ML making faulty decisions when faced with novel threats. 

The Perfect Storm Leading to The Outage 

There are many circumstances that had to align perfectly for this CrowdStrike ELAM bug to reach catastrophic proportions:

  • CrowdStrike Released Driver with Minimal Testing 

Standard SAST tools detect Null Pointer exceptions in code very precisely. For this reason,  this article from the New York Times and Matt Mitchell of Crypto Harlem observed that had CrowdStrike tested the Driver internally, they would have immediately realized that this faulty driver would inflict great harm on CrowdStrike’s users.  

  • Microsoft’s Role in the CrowdStrike Bug:  

Microsoft’s Windows OS does not permit drivers that both the security vendor and Microsoft do not sign. Like Virsec’s driver, the CrowdStrike driver was very likely signed by Microsoft. As you can see from the screenshot below, Microsoft performs eight operations, including scanning, validation, etc., on any driver submitted for signing. Therefore, Microsoft may have had the opportunity to find out if the CrowdStrike Driver was buggy. What happened to the worldwide community was likely to have occurred in Microsoft’s test labs also.  

Microsoft Partner Center

  •  CrowdStrike Pushed the Upgrade In Bulk 

Statistics show that ~40% of software patches are faulty and need a second patch to fix the first patch. A recent example is the Log4J patch, which required 4 patches to the vulnerable software before the underlying vulnerability got fully fixed. Had CrowdStrike built a defensive mechanism in their code that upgrades one computer to begin with and tests to see if the upgrade succeeded before rolling out the upgrade to 8.5 million computers, we may not be looking at the catastrophic failure caused by the CrowdStrike bug. 

  • End Users Bypassed Change Management Processes 

Normally, when a patch is applied, the Enterprise tests it before applying it in bulk. CrowdStrike’s end-users blindly trusted the patch that came from CrowdStrike and applied it in bulk.   

What Kind of Drivers Does Virsec Use to Collect Its Telemetry?  

In contrast, Virsec uses a class of drivers that Microsoft calls "Primitive Drivers" to collect its telemetry. This type of driver is an example of a userland driver that does not impact the kernel. Should the need arise to remove the Primitive Driver, it can be done using remote means and in bulk. Unlike what happened with the CrowdStrike or ELAM Driver, no truck roll is required! 

Virsec’s telemetry driver also leverages a watchdog driver to ensure that its telemetry driver cannot be removed.   

Architectural Advantage of Virsec’s Visibility and Cyber Protection:  

Unlike EDR telemetry, which is circumstantial at best (as described above), Virsec’s cyber attack detection algorithm leverages unambiguous or “direct” evidence that cannot be subverted by even the most sophisticated and is also ambiguity-free. 

This means that when Virsec’s telemetry alerts on a cyber attack, an adversarial cyber incident is definitely in progress. How can we be so sure? For a cyber attack to succeed, the threat actor must be able to launch malicious code of their choosing. This is because continuing to let pristine application code execute does not further the threat actor’s agenda.  

Therefore, Virsec’s telemetry focuses on determining whether the code that is about to be executed came from an authorized and trusted application or not. This very simple yet highly effective architectural advantage (application control) yields four (4) highly consequential benefits for Virsec: 

  1. Zero dwell time. Concluding that the code about to execute is alien code immediately prevents the threat actor from gaining more dwell time. In the case of sophisticated malware, even small amounts of dwell time can result in extensive damage, as we saw in the recent MGM attack. 

  2. No false positives. Since pristine, authorized application code occupies well-defined locations in memory, determining whether the code about to execute came from the application or from the threat actor is very simple. As a result, Virsec’s security controls do not deliver false positives and do not need post-processing assistance from SOC Analysts or AI/ML. 

  3. No catastrophic failures. Virsec’s telemetry is collected from userland processes rather than the kernel via the critical “system” process. As a result, there is no danger of catastrophic failures, as seen with the CrowdStrike Driver bug. 

  4. Zero impact to system performance. Virsec relies on very limited and lightweight telemetry to establish whether the code about to execute came from the App or a threat actor. This means that Virsec’s telemetry does not impact performance or latency.  

Key Takeaways for Enterprises 

  • Know Your Software: Use Change Management tools such as Virsec’s TrustSightTM to gain insight into the nature and extent of code changes included in the patch before it is applied. 
  •  Adhere to Change Management Best Practices: Do not roll out updates in bulk without ascertaining that the patch is not a defective patch. Test these patches before rolling them out in bulk.

Conclusion 

Virsec's next-gen autonomous application control permits only verified and trusted software to run, effectively blocking any unauthorized code. This proactive default-deny, allow-on-trust strategy is pivotal in maintaining seamless business operations by establishing and upholding stringent security and change management trust policies. By blocking the execution of unauthorized code, Virsec's approach safeguards server workloads from business disruption and exploitation by various threats such as vulnerabilities, malware, ransomware, zero-day exploits, and unidentified attacks.