The biggest technology story this past month was a detailed software problem that quickly became a huge problem across multiple industries. The CrowdStrike debacle that became a systemic airline industry failure can teach us to look closely at who your software vendors are and how to best protect your own business by safeguarding your systems from these kinds of outages and failures. Now that the main cause of the software issue at the center of it has been published, we’ll take a look closer at software updates and how they’re tested and distributed, as well as one way that having a managed services provider keeps you ahead of what could be catastrophic software failures.
“Small” Errors Can Become Widespread Problems
CrowdStrike’s Flacon is an EDR (endpoint detection and response) platform that protects endpoints-that is, workstations and servers alike. Their primary goal is to look at what’s going on inside of a computer or server and determine if their behavior is abnormal or suspicious.
CrowdStrike’s software runs on each endpoint in a client organization, making sure that the machine isn’t copying a lot of files, talking to unknown servers, or other suspicious patterns of behavior. If it finds that a workstation is behaving strangely, it is able to intervene in real time to stop what may be an attack-like ransomware or a malicious user deleting your business’ data. Crown Computers’ clients may recognize this protection as the same kind of protection offered by Sophos.
Because these applications are designed to algorithmically look for patterns of behavior, those patterns are what need updated to have the latest, most up-to-date information to protect your machines. This kind of update is what broke CrowdStrike’s EDR, sending airlines into a situation where many of its workstations and servers wouldn’t run at all. Because CrowdStrike runs alongside the operating system (the Windows kernel), an error in CrowdStrike means that Windows would have to stop running altogether. When it restarts, though, the OS has the same unavoidable error, creating a loop of blue screens, and unusable computers.
CrowdStrike published their root-cause analysis of the issue recently, acknowledging that the error was a programming error: the author of the update told the program to look for 21 of something where there was only 20 of that something to find. Since that’s “impossible,” Windows had to shut itself down to keep its memory secure… on every single computer that the update had been applied to.
Software Suppliers and Stability
We regularly write, on this blog, that it’s of the utmost importance to update software to secure your business’ network. When there are issues with a piece of software because it has a vulnerability in it, the software’s maker is (usually) responsible to create and distribute a fix. When they no longer support a piece of software, then they no longer fix these vulnerabilities. When that happens, any piece of hardware that relies on the software should be retired, and if a new version of the software is available, the device should be upgraded to a newer, supported version.
The CrowdStrike disaster wasn’t this kind of update, though; this was an update that was supposed to make all of the machines running the software more secure, which is what the product is for. This was an error not just of programming, but of testing and reliability practices that should have been followed before rolling out the update. Instead of just pushing the new definitions/algorithms to all of the machines running their software, there should have been some testing, or a limited release, to make sure that the software is stable and won’t break all of the machines running the software.
When there are problems this widespread and the companies involved are doing big business, it’s likely to cause the parties to seek renumeration in court. Delta estimated, for instance, that the software problem was the initial cause of a disruption to as many as 500,000 travelers (me included). With information systems at the center of all of their work, it can be really costly and time-consuming to go to each one of their workstations, servers, and kiosks to get them back up and running.
What Are Your Critical Systems and How Do You Protect Them from Bad Updates?
Part of the issue here is having an understanding that software vendors will get it wrong sometimes. It’s for this reason that IT teams have testing processes that validate, research, and test new updates from suppliers that might break something when they’re installed. We try to communicate to our clients that patches are necessary for stability and security, but it’s our testing and research process that validates that your systems will be stable with the newest updates. This is because updates are changes to already existing software-every once in a while, these changes will create ill effects on existing configurations and setups that rely on the patched software.
Just installing all of the updates as they’re made available isn’t enough for your critical systems. Having a plan for a backup system is an important way to make sure that business processes don’t stop when a software vendor releases an update with some ill effects on your current setup. This plan doesn’t have to be to go back to pen and paper, but could include having virtual systems on standby, hardware systems on standby, or alternative configurations and methods that can be used even though they are less automated and streamlined than your typical SOPs. These workarounds might even make it into your Disaster Recovery Plan; if you already have one, it could be the starting point for these plan-b procedures.
Do You Use Your Computer, or Your Software?
Most modern software companies push updates without saying that they are security updates or feature updates. The latter updates are the ones that enable new features, functions, and designs. Users often find feature updates to be inconsistent and confusing, or exciting because of greater functionality. Both kinds of updates, though, can lead to problems after being installed. Sometimes they’re caught in further testing of the updates before they’re rolled out to your organization, which we here at Crown Computers believe to be a great advantage to having a tech partner that you can trust.
But at the end of the day, the software is what its makers make it to be. If you rely on a single piece of software without any alternatives, that software will have to be made to work for your business to move forward. Quite a few people in the tech world believe that software isn’t getting better or more stable, but that industry practices are slipping overall and will likely continue along the current trend. Having a team of experts who can keep your business on its feet even when the software companies make a misstep is a great source of value for your organization.
-Written by Derek Jeppsen on Behalf of Sean Goss and Crown Computers Team