EV Aggregator Optimization Breakthrough with Hybrid Reinforcement Learning

EV Aggregator Optimization Breakthrough with Hybrid Reinforcement Learning

As the global transition to electric mobility accelerates, the role of electric vehicles (EVs) is evolving beyond mere transportation tools. They are increasingly being recognized as dynamic, bidirectional energy assets capable of supporting grid stability and participating in energy markets. However, the challenge of harnessing this potential at scale remains complex. A single EV offers minimal energy capacity, insufficient to meet the thresholds for direct market participation. This is where Electric Vehicle Aggregators (EVAs) come into play—entities that pool thousands of EVs into a unified, flexible resource capable of engaging in energy arbitrage and providing ancillary services such as frequency regulation and peak shaving.

Despite their promise, current EVA operational models face significant limitations. Traditional optimization methods, while mathematically rigorous, often rely on deterministic assumptions or require extensive modeling of uncertain variables like user behavior and market prices. These approaches can lead to overly conservative or suboptimal strategies when real-world conditions deviate from predictions. Meanwhile, rule-based power distribution methods within charging stations lack adaptability, failing to balance economic gains with battery health and user satisfaction.

A groundbreaking study led by Kong Yueping and Yang Shihai from the Marketing Service Center of State Grid Jiangsu Electric Power Co., Ltd. introduces a novel solution: a hybrid action reinforcement learning algorithm designed to optimize both market bidding and internal power allocation decisions simultaneously. Published in Computer Engineering with a DOI of 10.19678/j.issn.1000-3428.0068701, this research marks a significant leap forward in intelligent EVA management systems.

The core innovation lies in the integration of continuous and discrete decision-making within a single reinforcement learning framework. Unlike conventional models that treat bidding and power distribution as separate problems, this approach unifies them under a joint optimization process. The algorithm employs continuous actions to fine-tune the amount of energy and reserve capacity the aggregator bids into the day-ahead and balancing markets. Simultaneously, it uses discrete actions to dynamically switch between different power allocation strategies among the connected EVs—specifically, proportional distribution and priority-based distribution.

This dual-action mechanism enables the aggregator to respond more intelligently to fluctuating market signals and changing fleet conditions. For instance, during periods of low electricity prices, the system may choose to charge aggressively while using a proportional strategy that evenly distributes power across vehicles, preserving their individual flexibility for future use. Conversely, when discharging during high-price intervals, the model might switch to a priority strategy, selecting vehicles with higher state-of-charge and shorter remaining parking times to discharge first, thereby minimizing battery degradation and ensuring user needs are met.

Crucially, the algorithm is built upon a refined model of EV aggregation flexibility—one that incorporates the real-time value of flexibility as determined by wholesale market prices. Previous models often assessed flexibility in physical terms alone—how much power could be shifted when—but neglected the economic dimension. This new method evaluates flexibility not just by its magnitude, but by its revenue-generating potential at any given hour. By maximizing the total daily value of flexibility, the model ensures that operational decisions are aligned with financial performance.

To achieve this, the researchers developed a preprocessing step that calculates, for each EV, the upper and lower bounds of its feasible charging power over time. These bounds are derived from a constrained optimization problem that considers the vehicle’s arrival and departure times, initial and required final energy levels, battery capacity limits, and charging rate constraints. Importantly, the objective function weights each unit of flexibility by the corresponding hour’s market price, ensuring that flexibility is preserved where it is most valuable.

Once the aggregate flexibility envelope is established—representing the total upward and downward regulation capacity available to the EVA—the reinforcement learning agent takes over. The agent operates within a Markov Decision Process (MDP) framework, where each time step corresponds to a decision epoch, typically aligned with market bidding intervals. The state space includes both exogenous factors—such as current and forecasted energy and reserve prices—and endogenous variables like the total energy stored across all connected EVs, the current aggregate power demand, and the flexibility headroom remaining.

The reward function is carefully designed to reflect the true economic and operational goals of the EVA. It penalizes market procurement costs, battery degradation, unmet charging demands, and deviations from scheduled power delivery. Battery degradation is modeled using a polynomial cost function that accounts for nonlinear wear effects associated with high charging and discharging rates. This ensures that the agent does not exploit vehicles in a way that shortens their lifespan, which would be economically unsustainable in the long run.

The learning algorithm itself is an enhanced version of Proximal Policy Optimization (PPO), adapted to handle hybrid action spaces. Traditional PPO is well-suited for continuous control tasks but struggles with mixed discrete-continuous outputs. The team’s solution features a dual-branch actor network: one branch outputs parameters for a beta distribution governing the continuous bidding actions, while the other produces a Bernoulli probability for the discrete strategy selection. Both branches share a common feature extraction layer, allowing the model to learn a unified representation of the environment state before branching into specialized decision paths.

A shared critic network estimates the state value function, providing a consistent baseline for evaluating the quality of actions. This architecture avoids the complexity of estimating state-action values for hybrid spaces and promotes coordination between the two types of decisions. The training process uses clipped surrogate objectives to ensure stable policy updates, preventing large, destabilizing changes that could derail learning.

The results of the simulation study are compelling. Using realistic data including hourly wholesale electricity prices and synthetic EV charging profiles based on truncated normal and uniform distributions, the model was tested against several benchmark approaches. These included standard PPO algorithms paired with fixed allocation strategies—either proportional or priority-based—as well as Soft Actor-Critic (SAC) variants, another popular deep reinforcement learning method.

Across multiple performance metrics, the proposed Hybrid PPO (HPPO) algorithm outperformed all competitors. It achieved the highest cumulative reward, indicating superior long-term profitability. More importantly, it demonstrated a nuanced understanding of trade-offs. During nighttime charging windows, when electricity prices were lowest, the algorithm favored the proportional strategy, distributing charging power evenly to maintain the bidirectional flexibility of the entire fleet. This approach prevented individual batteries from reaching full charge too early, preserving their ability to respond to upward regulation signals later in the day.

In contrast, during afternoon and evening discharge periods, the HPPO agent consistently switched to the priority strategy. By preferentially discharging vehicles with ample charge and imminent departure, it minimized the number of deep discharge cycles on any single battery. This strategic switching led to significantly lower battery degradation costs compared to static strategy models. The data showed that while the priority-only approach reduced degradation, it sometimes led to premature saturation or depletion of certain vehicles, reducing overall system flexibility. The proportional-only approach, while maintaining flexibility, incurred higher degradation due to more frequent partial cycling across all vehicles.

The HPPO model struck an optimal balance, achieving both high revenue and low operational costs. On average, it reduced total daily operating costs by 1.9% compared to the priority-based PPO and by 3.2% compared to the proportional variant. Even more striking was its performance relative to SAC-based methods, which failed to converge within the training timeframe and exhibited higher variance in outcomes. This highlights a key advantage of on-policy algorithms like PPO in environments where data distribution shifts rapidly—a common scenario in dynamic energy markets.

From a computational standpoint, the HPPO model proved efficient. With an average per-step inference time of just over 5 milliseconds, it is well within the range required for real-time decision-making in fast-moving markets. The training convergence time was also reasonable, allowing the model to stabilize after approximately one million steps—a feasible target for offline training using historical data.

One of the most significant implications of this work is its potential to reshape how EVAs interact with both the grid and their customers. By maximizing the value of flexibility while respecting user constraints, the algorithm supports a win-win scenario: grid operators gain access to reliable, responsive resources, EV owners receive compensation for participation without compromising their mobility needs, and aggregators improve their bottom line.

Moreover, the model’s ability to dynamically adapt its internal power allocation strategy represents a shift from rigid, rule-based systems to intelligent, context-aware management. This adaptability is essential as EV fleets grow in size and diversity, encompassing everything from daily commuter cars to ride-sharing vehicles with unpredictable schedules.

The research also underscores the importance of moving beyond purely physical models of flexibility. By anchoring flexibility valuation in real market prices, the model ensures that decisions are economically rational rather than merely technically feasible. This market-aware perspective is critical as power systems transition toward greater decentralization and price-responsive demand.

Looking ahead, the authors suggest several avenues for extension. These include incorporating additional market types such as capacity markets or local flexibility markets, integrating vehicle-to-home (V2H) and vehicle-to-building (V2B) use cases, and enhancing the model to handle heterogeneous battery chemistries and aging effects more explicitly. There is also potential to integrate user preference modeling, allowing EV owners to specify their tolerance for battery wear or preferred charging timelines.

In practical terms, this algorithm could be deployed within existing EVA platforms, either as a cloud-based service or embedded in local control systems at large charging hubs. Its modular design allows for incremental integration, starting with market bidding optimization and gradually incorporating dynamic strategy switching as confidence in the model grows.

For utilities and grid operators, the widespread adoption of such intelligent aggregation systems could dramatically increase the availability of fast-responding, low-carbon flexibility. This would enhance grid resilience, facilitate higher penetration of renewable energy, and reduce reliance on fossil-fuel peaking plants.

For EV owners, the benefits are equally tangible. Participation in grid services could translate into lower charging costs, direct financial incentives, or even loyalty rewards. The assurance that their vehicle’s battery is being managed in a way that minimizes wear adds a layer of trust and transparency that is often missing in current V2G programs.

The success of this hybrid reinforcement learning approach also signals a broader trend in energy system optimization: the move from model-based to data-driven decision-making. As machine learning techniques mature and computational resources become more accessible, we are likely to see a proliferation of AI-powered energy management systems that learn from experience rather than relying solely on predefined rules.

This does not mean the end of traditional optimization methods. On the contrary, the two approaches can be complementary. Model-based methods provide valuable insights into system behavior and help generate training data, while data-driven models excel in handling uncertainty and adapting to changing conditions.

What makes this research particularly impactful is its grounding in real-world operational challenges. The authors, based at a major utility’s marketing service center, bring practical industry experience to the table. Their focus on economic viability, battery longevity, and user satisfaction reflects the priorities of actual market participants, not just academic abstractions.

As the number of EVs on the road continues to climb—projected to reach nearly 100 million in China by 2030 and over 140 million by 2040—the need for smart aggregation solutions will only intensify. Without intelligent coordination, uncontrolled charging could strain local grids and undermine the environmental benefits of electrification. With tools like the one developed by Kong, Yang, and their colleagues, however, EVs can become a stabilizing force in the energy transition.

In conclusion, this study presents a sophisticated yet practical framework for optimizing electric vehicle aggregator operations. By combining a market-aware flexibility model with a hybrid-action reinforcement learning algorithm, it achieves a level of coordination and economic efficiency previously unattainable. The ability to dynamically switch between power allocation strategies based on real-time conditions represents a major step forward in the quest to unlock the full potential of EVs as grid resources.

The implications extend far beyond a single algorithm. They point to a future where distributed energy resources are not just managed, but intelligently orchestrated—where software learns to balance technical constraints, economic signals, and human needs in real time. As the energy landscape evolves, such innovations will be essential to building a cleaner, more resilient, and more responsive power system.

Kong Yueping, Yang Shihai et al., Marketing Service Center of State Grid Jiangsu Electric Power Co., Ltd., Computer Engineering, DOI: 10.19678/j.issn.1000-3428.0068701