New AI-Driven Method Stabilizes Grid Voltage Amid Renewable Surge
As electric vehicles (EVs) and renewable energy sources like solar and wind become mainstream, power distribution networks face unprecedented challenges. Voltage instability, once a rare occurrence, is now a daily reality in many active distribution systems. The root of the problem lies in the unpredictable and often rapid power fluctuations generated by photovoltaic arrays and fast-charging EV stations. These fluctuations, sometimes exceeding 15% of rated capacity within a minute, can push node voltages beyond safe limits, threatening grid reliability and equipment lifespan.
Traditional voltage regulation methods, relying on devices such as voltage regulators (VRs), switchable capacitor reactors (SCRs), and energy storage systems (ESS), are struggling to keep pace. These devices operate on different time scales—VRs and SCRs are slow-acting, designed to minimize mechanical wear, while inverters from distributed generators (DGs) and static var compensators (SVCs) can respond in seconds. This mismatch in response times creates a complex coordination problem. When high-speed inverters and low-speed mechanical devices interact without a unified control strategy, it can lead to excessive switching, control saturation, and even voltage collapse.
Attempts to solve this using conventional optimization models have hit a wall. The problem is inherently non-convex and involves a mix of continuous and discrete decision variables—like the continuous charging power of a battery and the discrete tap position of a voltage regulator. Solving this as a large-scale mixed-integer nonlinear optimization problem is computationally prohibitive, especially for real-time applications. The complexity grows exponentially with network size, making it an NP-hard problem. This has spurred interest in data-driven approaches, particularly deep reinforcement learning (DRL), which can learn optimal control policies from experience without needing an explicit mathematical model of the entire grid.
However, existing DRL methods have their own limitations. Deep Q-Networks (DQN), for example, are effective for discrete actions but fail when continuous variables are involved. Conversely, algorithms like Deep Deterministic Policy Gradient (DDPG) excel with continuous actions but cannot handle discrete ones. In a network with multiple VRs and SCRs, each with numerous discrete tap positions, a DQN approach leads to a “curse of dimensionality,” where the action space becomes so vast that learning becomes inefficient and unstable. Multi-agent DRL solutions have been proposed to mitigate this, but they introduce new complexities in coordination and convergence.
A breakthrough solution has emerged from a collaborative research effort led by Jian Zhang from Hefei University of Technology, Mingjian Cui from Tianjin University, and Yigang He from Wuhan University. Their work, published in the prestigious Transactions of China Electrotechnical Society, presents a novel dual-timescale voltage coordination strategy that seamlessly blends the strengths of data-driven learning with the rigor of physical modeling. This hybrid approach, unlike purely model-free or purely model-based methods, is designed to overcome the fundamental limitations of both worlds.
The core innovation lies in its hierarchical, two-layer control architecture. The first layer operates on a slow timescale—typically hourly—and is responsible for setting the long-term operating points of slow-acting devices: the tap ratios of VRs, the switching states of SCRs, and the charging/discharging power of ESS. This is where the complexity of mixed discrete-continuous actions is most acute. To solve this, the researchers developed an adapted Deep Deterministic Policy Gradient (DDPG) algorithm. The key insight is a three-step process: relaxation, forecasting, and correction.
In the relaxation phase, the discrete tap positions of VRs and SCRs are treated as continuous variables, allowing the DDPG’s actor network to output a “prototype action” that includes both continuous (ESS power) and relaxed discrete (VR ratio, SCR state) components. In the forecasting phase, instead of forcing this prototype into a single discrete action, the algorithm searches the discrete action space for the K nearest neighbors to the relaxed values. This creates a small, manageable set of candidate actions. In the correction phase, each of these K candidates is paired with the continuous ESS power from the prototype and evaluated by the critic network to determine its expected long-term value. The action with the highest value is then selected for implementation.
This “relaxation-forecasting-correction” mechanism is a masterstroke. It allows the smooth, gradient-based learning of DDPG to guide the search in the discrete space, avoiding the combinatorial explosion of a pure DQN approach. By evaluating only K candidates (e.g., 20 or 40) per decision, the computational load remains low, while the policy can still converge to a near-optimal solution. The researchers demonstrated that this method achieves a training process that is far more stable and converges much faster than existing multi-agent DQN methods.
The second layer of the control system operates on a fast timescale—every 5 to 15 minutes—and handles the rapid voltage fluctuations caused by the intermittent nature of renewables and EV charging. At this layer, the slow-timescale decisions (VR taps, SCR states, ESS power) are treated as fixed parameters. Given this context, the system then calculates the optimal reactive power output for all DG inverters and SVCs. This is done not with another DRL agent, but by solving a physics-based quadratic programming (QP) model derived from the branch flow equations of the distribution network.
This is where the physical modeling shines. The QP model explicitly enforces the laws of physics—Kirchhoff’s laws and power flow equations—ensuring that the solution is physically feasible and respects all operational constraints, such as inverter capacity limits and voltage bounds. By minimizing the sum of squared voltage deviations across all nodes, the model produces a mathematically optimal reactive power dispatch for the current network state. Because the slow-timescale variables are fixed, this QP problem is convex and can be solved to global optimality in milliseconds using standard solvers like MOSEK.
The two layers are tightly coupled through the reward structure of the DRL agent. The cost for each hourly period in the MDP is defined as the sum of the optimal objective values from all the fast-timescale QP optimizations within that hour. In essence, the DRL agent learns to set the slow-timescale variables in a way that minimizes the total “effort” required by the fast-timescale controllers to maintain voltage stability. This creates a feedback loop where the long-term policy is directly informed by the short-term physical consequences of its actions.
The performance of this dual-timescale strategy was rigorously tested on two standard IEEE benchmark systems: the 33-node balanced network and the 123-node unbalanced network. In both cases, the results were compelling. When the slow-timescale devices were set to random or fixed values, the average hourly voltage deviation cost was catastrophically high, indicating severe and frequent voltage violations. A traditional, single-timescale mixed-integer QP model, which optimizes all devices simultaneously over a 24-hour horizon, was used as a gold standard for near-optimal performance. This model achieved very low costs but required over 78 seconds for the 33-node system and 189 seconds for the 123-node system—far too slow for real-time use.
The proposed method, after a training period of 600 days of simulated data, achieved performance remarkably close to this gold standard. For the 33-node system, the average cost converged to 0.0262 (pu), compared to the optimal 0.0207 (pu). For the 123-node system, it reached 0.0410 (pu) versus the optimal 0.0349 (pu). The true triumph, however, was in speed. The average time to compute a control action for a single day was just 1.7 to 2.1 seconds for the 33-node system and 8.9 to 10.5 seconds for the 123-node system. This represents a speedup of 36.7 times and 18.0 times, respectively, compared to the traditional optimization. More importantly, the per-slot computation time was a mere 7.4 milliseconds and 73 milliseconds, well within the requirements for real-time control.
The research also provided valuable insights into the practical tuning of the algorithm. The number of nearest neighbors, K, was found to be a critical hyperparameter. Setting K=1, which essentially forces the prototype action to the single closest discrete point, resulted in a highly unstable training process with large fluctuations in performance. In contrast, setting K=20 for the 33-node system and K=40 for the 123-node system led to a smooth, rapid convergence to a stable, high-performance policy. This demonstrates that a small degree of “exploration” in the discrete space is essential for robust learning.
Another key finding was the superior performance of this single-agent, adapted DDPG approach compared to multi-agent DQN methods reported in the literature. The training curves showed a much faster and smoother convergence, which the authors attribute to the vastly smaller effective search space and the inherent stability of the DDPG framework. This is a significant advantage, as unstable training can make an algorithm impractical for real-world deployment.
The implications of this work are far-reaching. It provides a practical, scalable solution for voltage control in the modern, active distribution grid. The method is inherently flexible and can be applied to both balanced and unbalanced three-phase networks, making it suitable for real-world urban and rural settings. The separation of timescales mirrors the natural dynamics of the grid, where slow-acting mechanical devices set the stage for fast-acting electronic devices to fine-tune the voltage profile.
For utilities and grid operators, this represents a powerful tool for integrating higher levels of renewable energy and EV charging without costly infrastructure upgrades. By proactively managing voltage with a coordinated, intelligent strategy, they can prevent overvoltage and undervoltage events, extend the life of equipment like voltage regulators, and maintain power quality for all customers. The use of DRL also means the system can adapt to changing conditions over time, learning from new patterns of load and generation.
The researchers acknowledge that their current work uses a fixed training dataset and does not fully test the algorithm’s generalization to unseen conditions. Future work will focus on online learning and testing with rolling validation sets to ensure the controller remains robust in a dynamic, evolving grid. Nevertheless, the foundation they have laid is solid. By combining the adaptability of artificial intelligence with the reliability of physical laws, they have created a control system that is not only faster and more effective but also more trustworthy. This hybrid approach may well become the blueprint for the next generation of intelligent grid management systems.
Jian Zhang, Mingjian Cui, Yigang He, Transactions of China Electrotechnical Society, DOI:10.19595/j.cnki.1000-6753.tces.222273