by Adi Gadwale, Chief Enterprise Architect at General Dynamics
The New York Times declares “all phones and computers at risk”! How? How can every modern processor going back to the early '90s have the same architectural choices leading to a flaw today? How can a vulnerability be so bad that it can’t be fixed until the next generation of processors?
The latest Spectre and Meltdown vulnerabilities are hard to explain and hard to understand. I’ve had technical and non-technical friends and family ask me and I’ve struggled to explain how these vulnerabilities exist, how much of a danger they pose, or why fixing them results in large performance penalties. Here is how I explain it to my mother, who is very smart, but not a "computer person"!
The explanation begins with something called “speculative execution”, a trick used to enhance performance in almost all modern chips. Speculative execution allows a program to begin using memory and executing instructions before completing the security check to see if the action should be allowed.
But what does this really mean?
Imagine we’re in a track race. We have a large number of participants, the race gun goes off and dozens of participants set off on a sprint.
Unfortunately some of them begin running before the starter's pistol goes off. We have two options. First, we could immediately call off the race. Send everybody back to the starting point. Get out the video cameras and the judges in the back room. Review everybody’s start, disqualify the ones that had a false start and start the race all over again.
This may be the right way to do things, but you can imagine this could be slow and tedious.
Our second option would be to let the race just continue to the finish line. While the participants are sprinting, our judges are reviewing all the starting tapes in in the back room. By the time the race is done, we know who needs to be disqualified. As soon as the race is done, we we move the winners to the podium and dismiss the false starters. A much faster and cleaner racing experience!
We let the race “speculatively” complete, discard the inappropriate operations before they damage the sporting world, pick a winner and quickly move to the next race.
Where are the vulnerabilities? They are subtle! If I have six contestants before me, psychologically, I might not run as fast, assuming I have little chance of winning, not knowing that all six of them had false starts! All participants might try to run just a little bit early - if they time it just right, they have a huge advantage and that might be worth the risk since they get to complete the race anyway.
Speculative execution on the processor works in a similar way. Rather than keep the race track or CPU idle, operations are completed while memory and security checks happen in parallel. If the security check fails, the operation is discarded and if the check passes, the overall system just operated a whole lot faster. Everybody always thought this was a great idea and it still is, but it turns out it has some subtle flaws which can be exploited especially when combined with one other subtle flaw call shared memory mapping.
Let's play Battleships. Each of us has a hidden board where we’ve laid out our battleships. We call out positions to each other for every turn. Every time I call out a position - “B8!” - you tell me if I’ve hit one of your ships or it is a miss.
Modern computers have conceptually similar boards in memory for each running application as well as for the operating system also called kernel space. They are not allowed to see each others boards or the kernel’s board to ensure security and privacy. This ensures that the solitaire game running on your PC does not read your email running in your browser and neither can peek into the kernel’s board and read your passwords. If they pass the right security checks, they may be allowed to share information. Your browser can access internet content by using the network connection managed in kernel space. Your document writer can save a document to disk using the drive access in kernel space. Such controlled interaction and access is essential for a functioning computer.
But wait - we could make things faster if we operated speculatively - act first and verify legitimacy in parallel to speed things up. If the security check fails, don’t complete the action. Simply discard the results of the action before they are completed. This works well in almost all cases. Except when there are subtle vulnerabilities.
Let’s get back to our game of Battleships. I yell “B8”, but you start moving your finger on your board even before I’m done - you move your finger as soon as I say “B” so we can play as fast as possible. Before I say “8”, you are already moving your finger along the “B” column ready to respond lightening fast if it is a hit or a miss. This works great, until I figure out that you have tells. If there are a lot of ships on the B row, I can see you arm move slower. I can’t see your board, but I can now start inferring indirect information from your reaction time. I see a micro-expression, a twitch and a shadow.
This is how modern day vulnerabilities work. Not because they immediately reveal everything but because they begin to leak small and subtle information that can be pieced together and exploited.
Every time the processor discards an inappropriate action, the timing and other indirect signals can be exploited to discover memory information that should have been inaccessible.
Meltdown exposes kernel data to regular user programs. Spectre allows programs to spy on other programs, the operating system and on shared programs from other customers running in a cloud environment.
Speculation is needed for high performance. Patching by simply restricting speculation results in significant drops in performance for many types of software. We will likely claw back to today’s levels of performance in the next one or two chip generations. In the interim, we can use more hardware to compensate for the performance drop in the short term.
Chip makers will be in a race to produce the next generation of chips with equivalent performance but without the vulnerabilities. Engineering a new fundamental solution introduces the risk of new vulnerabilities introduced in the next generation of architecture, especially with the rush to introduce something new to the market.
If you expect to do large refreshes of servers and desktops or major new investments that need to be viable for 3-5 years, delay or stagger investments by a quarter for new chip architecture road maps. Alternatively look for short term leases to minimize the risk of holding on to lower performing assets for many years.
Cloud providers will need to apply patches to address the extreme theoretical scenario that a rogue customer could steal data from other customers in the same infrastructure. In the short term, this will result in significant performance penalty that will have to be overcome with additional hardware investments. This may result in short term price increases or at the very least, a hiatus in the long term trend of constantly lowering cloud computing prices.
While security vendors all scramble to address these vulnerabilities, I have my eye on Virsec, whose focus on memory corruption and trusted execution led to a meeting with their co-founder a few months ago.
It's rare for such a widespread vulnerability, but it likely won't be the last time. The same good practices we always follow - endpoint protection, user education, only executing code from trusted sources, applying security updates - all still continue to be good hygiene factors and should continue while we wait for long term mitigation for these vulnerabilities.