We’re on our own. How can the power industry address this national security threat?

Since I started this blog in 2013, I have tried to be careful not to express political opinions; I have no intention of changing that policy now. However, a column by Jim Geraghty in the Washington Post this week made it clear that, when it comes to dealing with China’s escalating attacks – including the series of attacks labeled Salt Typhoon - US critical infrastructure (including the electric power industry) can’t expect a lot of help from the federal government.

Mr. Geraghty points out that “…on Dec. 3, the Financial Times reported that the Trump administration had ‘halted plans to impose sanctions on China’s Ministry of State Security over a massive cyber espionage campaign in order to avoid derailing the trade truce presidents Donald Trump and Xi Jinping struck in October.’”

I’m not going to second guess whether President Trump is justified in taking that action. However, these are the cards the power industry has been dealt. We need to do whatever we can to counter the relentless attacks that China is unleashing on us. It’s time to take a walk along the ramparts and identify the most likely entry points for those attacks.

Fortunately, since Chinese (and Russian, Iranian and North Korean) attacks on the North American power grid have been going on for years and have never (as far as I know) even come close to causing an outage, it’s safe to say that the power industry’s defenses, including the current NERC CIP standards and the upcoming CIP-015, are at least adequate for today’s threats.

But what about tomorrow's threats? One type of threat is inevitable: attacks on the grid that will arise once electric utilities start utilizing control systems that are implemented in the cloud. Why do I say these threats are inevitable, since today utilities are hardly using the cloud at all for control systems?

In my last post, I pointed to three reasons why it’s inevitable that electric utilities will increasingly want to utilize BES Cyber Systems (BCS) located in the cloud, once the current compliance risk for using the cloud has been eliminated (as described in the same post). These include BCS implemented on a cloud platform by the utility itself and software as a service (SaaS) that meets the definition of BES Cyber Asset (including sub-15-minute impact on the BES)[i].

The first reason is that providers of on-premises software and security services are increasingly moving to the cloud as their primary or even exclusive platform. Even though in some cases these providers continue to offer an on premises product, most future improvements in the software are only available in the cloud version.

This trend is already happening in a big way with security monitoring software (in fact a lot of this category of software has been cloud-based from the start), but it will inevitably happen with software used for grid operations as well. In five years, it’s likely that all electric utilities in the US will utilize at least a few control systems (or components of control systems) that are deployed in the cloud, either by the entity directly (e.g., BCS in the cloud) or as SaaS that has a sub-15 minute BES impact.

This is why the electric power industry needs to start considering the threats that will come with growing utilization of the cloud to house the systems that operate and monitor the Bulk Electric System (BES). The industry also needs to decide what can be done about those threats – other than accepting that electric utilities will never be able to utilize the cloud for their OT systems while maintaining the security of the BES.

The second reason is that NERC entities are realizing that software delivered as SaaS is inherently more secure than software installed on premises. This is because newly discovered vulnerabilities are usually fixed by the SaaS provider as soon as a patch is available; the end user doesn’t have to do anything to fix the vulnerability, including applying a patch. In fact, the end user may never even know that the vulnerability was present in the software[ii].

The third reason is that new cloud-based security monitoring services have appeared that can gather information on new threats in real time from sources all over the world. The service provider can use that information to protect their customers from new threats almost as soon as the threat is detected. Because of the EACMS problem, NERC entities with high or medium impact BES environments usually are unable to make use of these services.

Moreover, when an on-premises security monitoring service moves to the cloud (which is also happening much more frequently), it usually becomes unusable by NERC entities, due to the same problem. Thus, having to comply with the current NERC CIP requirements is now making it much harder to secure on premises systems than it would be otherwise.

Is the cloud safe?

IT professionals in other industries may find the above discussion to be strange, since they have been using the cloud for years, and – while they realize there are plenty of cyber threats in the cloud - they have come to realize that, for the most part, their data and systems are more secure in the cloud than they are on premises. This is because a Cloud Service Providers (CSP) has far more resources to throw at security than any electric utility has to secure their on premises systems. If the cloud is safe enough for most other industries, why isn’t it safe for the power industry?

The reason is simple: Many participants in the electric power industry, as well as many members of the public, don’t believe the cloud is safe. In my most recent post, I noted three main reasons why this is the case.

The first reason is that NERC entities with high and medium impact BES environments know they will be found in violation of between 50 and 120 NERC CIP Requirements and Requirement Parts if they locate or utilize BES systems (BCS, EACMS or PACS) in the cloud. This isn’t because any CIP requirement explicitly forbids use of the cloud, but because no CSP will ever agree to provide the huge volume of compliance evidence that the entity would need to prove their compliance during an audit.

The post I just cited describes seven simple changes to the wording of current CIP requirements and glossary definitions, which I believe will completely solve this problem. These changes could easily be drafted, approved and implemented within 2-3 years.

The second reason is that many electric utilities are worried that, even if they only deploy a fraction of their BES Cyber Systems (BCS) in the cloud, those systems could be the vector for someone in the Chinese People’s Liberation Army to reach out and compromise their entire on premises control network. These people will need to see a lot of their peer utilities starting to use the cloud for their OT networks before they do the same.

The third reason is the most important: Many NERC entities, as well as many members of the general public who follow these issues closely, believe that, when multiple utilities start deploying systems that monitor and/or control the BES in the cloud, this will open up the BES to threats that are unique to the cloud; I call these “cloud native” threats. These threats are not addressed by the current NERC CIP standards or by other regulations on critical infrastructure, since they don’t apply to on premises systems at all. What are cloud native threats, and why are they so different from the threats that are addressed by the current CIP standards (including CIP-015, which will become enforceable in 2028)?

The NERC CIP standards address threats to the smooth operation of systems owned and/or operated by individual NERC entities (these are primarily electric utilities and independent power producers, including renewable energy producers) that monitor and/or control the BES. In NERC CIP, requirements for remote access controls, patch management, malware detection and removal, vulnerability assessment, incident response, removable media controls, etc. help protect those systems from compromise due to various causes, including malicious foreign actors, software vulnerabilities, and…well, pure cluelessness.

If an on premises BES Cyber System is compromised, the damage will primarily be to the individual utility. Since on premises control systems in different utilities are not connected to each other, except for data exchange between Control Centers, compromise of a single system should not lead to a large problem on the BES.

On the other hand, because many utilities can use the same cloud infrastructure or SaaS product at the same time, cloud native threats can potentially affect all those utilities at the same time. This is why those threats can have a much greater impact on the BES than can threats to on-premises systems. In other words, since cloud native threats inherently apply to groups of NERC entities, they inherently apply to the BES itself.

An example of a cloud native threat

One of the most important cloud native threats is the threat of a widespread cloud outage. The Amazon Web Services outage (actually, outages) two months ago was a vivid example of the massive problems that can be caused by a widespread and prolonged cloud outage. In fact, it’s unlikely that many people reading this post were not affected by it. However, even though IT operations of some electric utilities were undoubtedly affected, it’s very unlikely that any OT operations were, since so few BES systems are deployed in the cloud today.

What would have been the BES impact if for example ten low impact Control Centers, all in one Interconnect, had been deployed on one of the major platforms when it went down? Or if three medium or high impact Control Centers, also in the same Interconnect, had been in the same situation? I agree with you…there would probably have been very little impact on the BES, even though the individual utilities or Independent Power Producers that operated those Control Centers would each have been affected, at least financially.

However, what might have happened if 100 low impact Control Centers or 50 medium or high impact Control Centers, also all in one Interconnect, had gone down together due to a CSP outage? It’s likely there would have been a significant BES impact, if not a true catastrophe. Where can we draw the line between insignificant and significant BES impact?

Before we can answer that question, we need to determine how “impact” will be measured. It might be measured by total load served by electric utility customers of the CSP in each Interconnect, with the line drawn higher based on the size of the Interconnect (that is, for the outage to significantly impact the BES, the CSP’s customers would need to serve a lot more load if they’re in WECC than if they’re in ERCOT, since WECC is much larger than ERCOT).

Once we determine how to measure the risk posed by the threat, we need to figure out how to mitigate that risk. In this case, the mitigation might be to limit use of a particular cloud (e.g., the AWS cloud or the Azure cloud) to say 60 low impact Control Centers in WECC but only 20 in ERCOT, and 20 medium or high impact Control Centers in WECC but only five in ERCOT.

Of course, there is no easy way to determine what is the maximum acceptable level of risk due to this threat (or any other cloud native threat to the BES). What’s certain is that the decision needs to be made by the consensus of a group of stakeholders, who stand to be negatively affected if the threat comes to pass. Perhaps the most important of those stakeholders are the cloud providers, since they will lose huge amounts of money and reputation if they experience a widespread outage.

I suggest there needs to be a working group that identifies cloud native threats to the BES, and determines mitigations for them. It will include four types of participants:

1. Power industry experts from both the cybersecurity and operations sides of the industry.

2. Representatives[iii] of cloud providers, including Platform CSPs, SaaS providers and Managed Security Service Providers (MSSPs). It is essential to include them, since providers understand how the cloud works better than anyone else.

3. Independent cloud security experts, who do not work for CSPs.

4. Consumers – residential, commercial and industrial. After all, they’re the ones whose livelihoods and lifestyles depend on having a reliable supply of electric power.

This group won’t just work on this one problem. The group’s charter will be to develop guidelines for cloud providers (and sometimes electric utilities) for mitigating all cloud native threats, not just the threat of widespread outages. In this recent post, in the section titled "Cloud-only risks”, I listed about 12 cloud native threats that were identified by the current drafting team when they developed their revised Standards Authorization Request (SAR) in 2024. I also pointed to four previous posts in which I identified five other cloud native threats (although I was using the term “cloud-only risks” at the time[iv]). There are probably many more cloud native threats that a cloud expert could easily identify.

For each of these threats, the group will first determine whether it’s a real threat – i.e., whether there is a reasonable likelihood that it will be realized. For every threat determined to be real, the group will first make sure they can properly state the threat, including any new terms that need to be defined. Next, the group will identify one or more mitigations for the threat. It will usually be the responsibility of cloud providers to mitigate a cloud native threat.

For some threats, the mitigation will also apply to cloud users. For example, in the above example, the cloud providers will need to place restrictions on the numbers and perhaps types of utilities and independent power producers that can utilize their services. However, the utilities and IPPs themselves will need to cooperate and not make a big fuss when they’re told they can no longer use the cloud provider they have chosen.

There’s another important point about this group: It will continue to meet for years, if not decades. This is because it’s inevitable that new cloud native threats will continue to be identified as long as the cloud is in existence. While many of these threats will turn out not to be real, they all need at least to be evaluated. I can see the group having monthly meetings to identify new threats and decide how they will deal with them. Since some threats may require a lot of deliberation (as in the example above), the whole group may not be able to consider each new threat; they might need to split into multiple subgroups, to keep handling the threats at least somewhat promptly.

How will the guidelines be “enforced”?

You may have noticed that I’m not even considering the idea that cloud native threats will be addressed in new NERC CIP requirements. For someone who hasn’t read my posts recently, that might be surprising. Even three months ago, I was expecting these threats to be the object of new CIP requirements, to be drafted by the current NERC “Cloud CIP” Standards Drafting Team.

However, in the last three months my thinking has changed. Even if the SDT wants to draft a new CIP requirement for each of the 16 or so cloud native threats that they and I have identified (which I’m not sure they do), I now realize doing that will take literally decades.

Here’s why I say that: In this post, immediately below the discussion of “Cloud-only risks” that I pointed to earlier, I discussed another example of a cloud native threat: SaaS multi-tenancy. After discussing that threat, I described 11 steps the SDT will need to take for each cloud-native threat, to have that requirement approved by NERC and FERC and become enforceable. I estimated that performing the 11 steps for the 16 threats will take 250 hours for each threat (very possibly an underestimate).

I then divided that by 150, which was my estimate of the hours the SDT would spend in meetings this year. That estimate turned out to be too high, but I’ll use it for now (it only reinforces my argument if the number is around 130, which I think is the case).

Dividing 250 by 150, I got 1.66. This is the number of years the team will need to spend on each cloud native threat. Multiplying that by the 16 threats (which is a big underestimate, since there are certainly far more than 16 cloud native threats today), I came up with about 27 years. That’s a very big number. Even if you think I’m off by a factor of two, it will be 13 ½ years. Clearly, it is fruitless to even try to address cloud native risks in the CIP standards.

But I’ve already said that, unless we address cyber threats that come with using BES systems in the cloud, nobody (including the general public and NERC entities) will be willing to permit wide use of the cloud by systems that control the power grid. What’s Plan B for addressing cloud native threats?

Plan B is to move cloud native threats completely out of the NERC standards development process, since that process is far too cumbersome. The first step will be to form a group like the one I described above. That group will determine which of those threats are real (i.e., they have a nonzero risk of occurring). For each real threat, the group will develop voluntary guidelines for the cloud providers to mitigate the threat. Since the cloud providers will be at the table (whether physical or virtual or both) during the whole process, they will presumably be strongly motivated to follow the guidelines that they played a big part in creating.

Obviously, there will be glitches in this process. However, I believe that within 2-3 years after this group starts working, version 1.0 of the guidelines will be finished, while work on version 2.0 will have already started. In other words, the national security threat caused by cloud native risks to the Bulk Electric System can be addressed in three years through the process I’m proposing, vs. 27 years through the NERC standards development process. Which is better?

If you would like to be part of the group planning this effort, please drop me an email.

If you would like to comment on what you have read here, I would love to hear from you. Please email me at [email protected] or comment on this blog’s Substack community chat.

Tom Alrich’s Blog, too is a reader-supported publication. You can view new posts for one month after they come out by becoming a free subscriber. You can also access my 1300 existing posts dating back to 2013, as well as support my work, by becoming a paid subscriber for $30 for one year.

[i] I didn’t refer to the definition of BES Cyber System, since that is defined as just a collection of BES Cyber Assets (which are devices). A BCA has a sub-15 minute impact on the BES, etc. etc. If there is to be a cloud version of BCS, it can’t refer to the BCA definition, since that applies to “devices” – which can’t be identified, when it comes to the cloud. I suggested in the same post that, if the BCS concept is ever extended to the cloud, “cloud BCS” could be based on the BCA definition. The only change would be that “device” is replaced by “system”; there would also need to be a NERC Glossary definition of System, since there isn’t one today.

[ii] This isn’t necessarily a good thing, since if one SaaS provider discovers a vulnerability in their product and patches it without reporting it in a new CVE record, other software providers (both SaaS and on premises software) will never learn about the vulnerability. Moreover, the scanner vendors will never include it in their scans. There is now a proposal before the CVE Program to allow a CVE Numbering Authority (which is very often the developer of the software in which the new vulnerability was found) to make clear in the record (in a machine-readable fashion) that the product listed is no longer affected by the new CVE, although other products may be affected. If you would like to read an introduction to the CVE program that explains what’s above, go here.

[iii] By “representatives”, I don’t mean there should be a formal system that decides which organizations can provide representatives and how many each can provide. It will be up to each organization to decide whether to participate at all and who will represent them, although there will also need to be a limit of say 2-3 people per organization.

[iv] I have very recently decided to distinguish threats from risks. A threat that has a realistic chance of being realized is a risk. For example, consider the threat of being killed by lightning. The risk is small but nonzero, since about 24,000 people worldwide are killed by lightning each year, out of an 8.2 billion population. However, the risk of being killed by a meteorite (a different threat) is essentially zero. Google AI says, “While there are many historical accounts and folklore about people being killed by meteorites, there are no definitively confirmed, scientifically verified cases of a person being killed by a direct meteorite impact.” Therefore, it would be a waste of time to try to mitigate the risk of being killed by a meteorite (if it’s even possible to mitigate the risk).