Preventing a $6 Billion Catastrophe: How to Assess Risks in the Power Grid with Digital Twin and Graph Analytics

Written by: Keigo Ito, Ph.D. Senior Data Scientist, Capgemini; Drew Swanson, Energy, Utility, Chemical Expert Sales Manager, Capgemini; Ranjeet Vaishnav, Energy Solutions Director, Capgemini; Drake Ryans, Strategist, Capgemini,

The Northeastern Blackout of 2003[1] that resulted in a power loss for 55 million people was a shocking reminder that, despite its remarkably stability, the US power grid system is still vulnerable to a severe blackout. Questions like “how did it happen?” or “how can we prevent it?” still lurk in the minds of those who were affected.

The months-long investigation[2] found that the severity of the blackout was partly due to a phenomenon called a cascading failure[3]. A cascading failure is a type of failure that happens in an interconnected system in which a failure of one part of the system results in a failure of other parts due to positive feedback. Cascading failures in a power grid are difficult to prevent because there is a multitude of pathways that power outages can take to propagate through a grid. Indeed, Wikipedia defines “Cascade effect” (i.e. a cascading failure) as “an inevitable and sometimes unforeseen chain of events”[4] in an interconnected system.

If a cascading failure is an intrinsic adverse effect of the power grid, is it possible to analyze the systemic risks of a cascading failure and use the insight to prevent it? To do this, we turn our attention to graph analytics.

Graph analytics is a subset of data analytics that specializes in analyzing graph data. In computer science, the term “graph data” refers to a data structure where objects and their relationships are represented by nodes and edges. Think of a mind map if you want to visualize what graph data looks like. In a mind map, ideas correspond to nodes, and the connections between the ideas are comparable to edges. Physical infrastructure can be represented as graph data, just like thoughts and ideas can be digitized into a mind map. Social network, supply chain, and of course, a power grid, can all be represented as graph data.

Graph analytics offers a method to programmatically travel through graph data, known as graph traversal. Graph traversal is the technology behind your GPS’ ability to guide you from your home (node A) to your office (node B). When used with a graph representation of a power grid, it allows us to perform a cascading failure simulation by tracking how an initial power outage traverses through the power grid. By performing many hypothetical cascading failure simulations and finding patterns in the simulation results, graph analytics enable us to identify vulnerabilities in the power gird.

In the following section, I describe how I constructed the digital copy (known as a digital twin) of the Texas power grid from synthetic data, how I used the graph traversal with the power gird digital twin to perform risk assessment during my Cortex research project. Cortex is a research project that aims at developing the state-of-the-art data analytics capabilities at Hybrid Intelligence, the world’s leading data science and AI consultancy team within Capgemini Engineering.

How to make a digital twin of the Texas power gird

We paired a graph database and a synthetic data set, designed to closely resemble the real power grid structure and power consumption to construct a digital twin of the Texas power grid. Using a graph database as our central tool enabled us to easily map the physical assets and their relationships to their digital counterparts. Power plants and substations were mapped onto the graph nodes, and transmission lines were mapped onto the edges. This representation made it possible to use a graph traversal algorithm to query connections between substations and therefore track how a disruption on one line might spread through the network.

We began by selecting an appropriate data set from which to build a power grid digital twin, namely the dataset “ACTIVSg2000”, which is available from the Zenodo.org data repository[5]. The data describes the properties of individual power plants, substations, buses, transmission lines, and the amount of power consumed at a given hour – all at just the right level of fidelity to build the power grid, digital twin. We then stored descriptive data about assets, like the highest amount of power consumed at each substation (max load), or the maximum amount of power each line can carry (line rate), as node and edge properties. Comparing a substation’s max load to the line rate of a transmission line that supplies it with electricity enables us to execute a load test to determine whether a line will stay operational or go offline for a given transmission load.

Since our digital twin records the interdependence of each asset, it is particularly suited for performing simulations of cascading failures because we can track how disruption of one line might spread through the network using a graph traversal algorithm.

Cascade analysis

We used the Breadth-first search algorithm to simulate how blackouts spread through the power grid. The Breadth-first search algorithm is suitable as this algorithm visits all nodes in a structural hierarchy before rising or descending to the next level of nodes.

By simulating an initial line failure and testing how the load on the failed line is compensated by neighboring lines, we can determine if the initial failure will cause secondary failures. Graph traversal with a Breath-first search allows us to simulate how the compensated load propagates and reaches upstream transmission lines. By combining load testing and a Breadth-first search, we can simulate how an initial transmission line disruption can cascade through the power grid network.

Single simulation results

One of the most illustrative studies we performed was of a cascading failure simulation under the condition that mimics the conditions of the 2003 Northeastern blackout. The disruption (indicated in red in Figure 1) occurs in small clusters scattered all over the power grid. The apparent irregular spreading of the disruptions happens because some parts of the power grid are designed with more resilience than other parts. When a high load propagates through a well-designed region of the grid, substations can distribute the load effectively and manage to handle the stress. Lines and substations fail due to overloading when the load reaches a poorly designed or less well-maintained area.

Figure 1. Animation of a cascading failure simulation using the power grid of Texas. Red = offline, amber = at risk, yellow = carrying compensation load.

Impact analysis of the single simulation

While the simulation in Figure 1 shows the sequence of spreading disruptions, it does not show how power outages impact domestic consumers. Since population density varies by region, translating the line outages to the affected population provides insight into the outage’s economic and social consequences.

The impact analysis of the above simulation (Figure 2) shows that the outage goes through three distinct phases. The impact in the initial phase is relatively limited because the cascade spreads within a sparsely populated area (Phase 1.) No outages appear to occur when the load is traversing the more resilient part of the grid (Phase 2.) However, once the load reaches less resilient parts of the grid near large cities, the impacted population grows rapidly (Phase 3.)

Figure 2. Impact analysis of a cascading failure simulation.

Systemic risk assessment

A single simulation like the one discussed above cannot provide an accurate risk assessment for the power grid, because the result is applicable only to a specific failure case. A more reliable risk analysis approach is to perform all simulations, encompassing different initial points of failure, and from the results, identify transmission lines that failed frequently or caused a significant outage downstream.

To quantify the systemic risks, we identified two key factors: (1) failure frequency, i.e., how often a power line fails across all simulations, and (2) affected population, i.e., the total number of consumers impacted by the initial failure. We used these factors to develop a risk metric by multiplying failure frequency by the affected population. Combining these factors into the metric is essential for understanding systemic risks. Inspecting one factor alone can be misleading. For example, some power lines may have a low failure frequency but are critical to a densely populated area. In this case, even a one-time failure (however slim the chances may be) can still affect a large population, posing a significant risk to the system. A high-risk score in the resulting metric indicates a power line with both high failure frequency and affected population.

A scatter plot of failure frequency vs. affected population enables us to quickly identify high-failure, high-impact lines (Figure 3-a). The ten lines with the highest risk score (Figure 3-b) are in the upper right corner (highlighted in red in Figure 3-a).

Figure 3. Visualization of risk scores. (a) A scatter plot of failure frequency vs. affected population.

(b) A bar graph showing risk scores of the ten riskiest lines.

Contingency plans

In our final investigation, we explored whether our simulation outputs could be used to create a contingency plan for minimizing cascading failures. We analyzed the sequence of line failures from the first simulation and identified three lines that, if disconnected manually, could stop the power outage spread. By applying this contingency plan, the power outage was successfully contained to a relatively small scale and evaded large metropolitan areas (see Figure 4-a and compare to the original simulation shown in Figure 1). The impact of the outage with intervention (see Figure 4b orange line) was reduced to 14% (Figure 4-b magenta line) of its original expected impact (Figure 4-b blue line, no intervention). In this example, we created a contingency plan using the result of one simulation, but contingency plans can be derived for all of the simulations.

Figure 4. Visualization of simulation with intervention. (a) The final frame of the simulation. (b) Impact analysis of the simulation with (magenta lines) and without (blue lines) intervention.

5 Things to consider when using Graph Analytics and Digital Twin to perform a Cascade Analysis

1. Add real-time data ingestion

The digital twin that we created is necessarily limited in scope and capability by research time and available data. One major limitation is the lack of real-time data.

Real-time data ingestion is a critical component. Adding real-time data ingestion would be possible if SCADA, IIOT sensors, and a reliable data streaming infrastructure were available to augment a static network model. If implemented, the digital twin can continuously monitor the current conditions of the power grid, detect disruptions, and suggest what actions to take to grid operators to minimize the impact from a disruption in real-time.

2. Integrate relevant data from multiple sources

We need to combine data from many different systems – SCADA, IIoT sensors, EAM, maintenance schedules, etc., for a digital twin to be able to represent the state of assets in real-time. Supporting infrastructure must also be prepared, including secure and reliable data streaming mechanisms and dedicated and specialized data storage. All of these can be costly to install on a greenfield project or retrofit into existing OT/IT estate, so it is crucial to understand the use case and return on investment anticipated from the digital twin.

3. Validate against historical data

Simulation results must be compared to actual historical data to have confidence in the digital twin. Historical data contain records of unusual, unexpected, or disruptive events, and including these elements will ensure that the digital twin can accurately detect and respond to such events. Feeding the digital twin only expected ‘steady state’ operational data during its development and commission is counterproductive as it will lead to erroneous output when rare events occur.

4. Incorporate external data

Another way to enrich the value of the digital twin is to incorporate external data. For example, having data on natural gas production and consumption can improve the predictive capability of a power grid digital twin because it provides information on the supply of electricity coming from gaspowered generators. But reliable real-time data may be unavailable or costly to purchase. Ultimately, as with hardware and software investment, the cost of augmenting data will be a question of value.

5. Identify acceptable data latency

It took only two hours for 55 million consumers to lose power during the 2003 Northeastern

Blackout. A digital twin must be able to propose a contingency plan quickly enough to be valuable. Timing, of course, will depend on the nature of the problem and the field of application. For example, the speed of a cascading failure in the power grid system is very different from that of a logistics network. Whether a digital twin can recommend a contingency plan within an acceptable timeframe is essential for its usefulness.

Closing Thoughts

As society gradually moves toward sustainable energy and demand for renewable (yet intermittent) power sources grows, it will become increasingly important to focus on grid reliability. Conversations about future-proofing the power grid often revolve around micro- and smart-grid technologies. However, innovative technics like the graph theory can provide unique insights into undetected risks within the power grid and prevent $6 billion-dollar catastrophes like the Northeastern Blackout of 2003. We hope that our efforts spark questions and proposals of how organizations might leverage Graph Analytics and Digital Twin technologies to improve reliability and resilience of infrastructure, in systems, and in business.

References

https://en.wikipedia.org/wiki/Northeast_blackout_of_2003

Final Report on the August 14, 2003 Blackout in the United States and Canada: Causes and Recommendations.

https://www.energy.gov/sites/prod/files/oeprod/DocumentsandMedia/BlackoutFinal-Web.pdf

https://en.wikipedia.org/wiki/Cascading_failure

https://en.wikipedia.org/wiki/Cascade_effect

A formatted (CSV files) dataset that was used in this project is available from

https://zenodo.org/record/3905429#.YX1vrmDMKUn. The original source of this data, which is in a PowerWorld format, and its details, are available at https://electricgrids.engr.tamu.edu/electricgrid-test-cases/activsg2000.