Skip to content

Fix EssPower thread-safety race condition (#3500)#3602

Open
rishabhvaish wants to merge 2 commits intoOpenEMS:developfrom
rishabhvaish:fix/esspower-race-condition-3500
Open

Fix EssPower thread-safety race condition (#3500)#3602
rishabhvaish wants to merge 2 commits intoOpenEMS:developfrom
rishabhvaish:fix/esspower-race-condition-3500

Conversation

@rishabhvaish
Copy link
Contributor

Summary

  • Race condition between OSGi thread (adding ESS) and Solver thread (reading coefficients) causes "Coefficient was not found" errors
  • ESS operates at 0W until manual restart — 40+ hour production downtime reported
  • Root cause: Data.getConstraints* and Coefficients.of() are not synchronized, while Coefficients.initialize() clears-then-rebuilds (non-atomic)

Root cause

  1. OSGi thread: Data.addEss()updateInverters()Coefficients.initialize() → clears coefficient list
  2. Solver thread (concurrent): getConstraintsWithoutDisabledInverters() → sees new ESS in CopyOnWriteArrayList → calls Coefficients.of(essId) → coefficient not found (list was cleared)

Changes

Coefficients.java (io.openems.edge.ess.api)

  • of()synchronized: Prevents reading coefficients while initialize() is rebuilding them
  • initialize() → build-then-swap: Coefficients are built in a temporary ArrayList first, then clear() + addAll() happen at the end while holding the monitor lock. No reader (via of()) can observe the empty intermediate state.

Data.java (io.openems.edge.ess.core)

  • getConstraintsForAllInverters()synchronized
  • getConstraintsForInverters()synchronized
  • getConstraintsWithoutDisabledInverters()synchronized

These methods read esss, inverters, coefficients, and symmetricMode — all of which are mutated by addEss()/removeEss()/updateInverters() (which are already synchronized on the same Data instance). Without synchronization, the Solver thread can observe partially-updated state.

Test plan

  • Start OpenEMS with multiple ESS components — verify no "Coefficient not found" errors in log
  • Add/remove ESS component during active operation — verify no race condition errors
  • Verify ESS power limits are correctly applied after component restart
  • Stress test: rapidly add/remove ESS components while solver is running

Fixes #3500

The Solver thread calls Data.getConstraintsWithoutDisabledInverters() which
reads the esss list and calls Coefficients.of(). Neither method is synchronized.
When the OSGi thread concurrently calls Data.addEss() → updateInverters() →
Coefficients.initialize(), the initialize() call clears the coefficient list
before rebuilding it. The Solver thread can see the new ESS (via CopyOnWriteArrayList)
but find no coefficient for it, causing 'Coefficient was not found' errors.

This leads to the ESS operating at 0W and requires manual intervention to recover.
A production site reported 40+ hours of downtime from this race condition.

Fix:
- Synchronize Data read methods (getConstraintsWithoutDisabledInverters, etc.)
- Synchronize Coefficients.of()
- Change Coefficients.initialize() to build-then-swap instead of clear-then-rebuild

Fixes OpenEMS#3500

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Rishabh Vaish <rishabhvaish.904@gmail.com>
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@              Coverage Diff              @@
##             develop    #3602      +/-   ##
=============================================
- Coverage      58.60%   58.55%   -0.04%     
+ Complexity       105      104       -1     
=============================================
  Files           3091     3095       +4     
  Lines         134005   134205     +200     
  Branches        9882     9870      -12     
=============================================
+ Hits           78516    78570      +54     
- Misses         52590    52707     +117     
- Partials        2899     2928      +29     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Sn0w3y
Copy link
Collaborator

Sn0w3y commented Mar 3, 2026

The synchronization is unnecessary:

  • coefficients, esss, inverters, and constraints are all CopyOnWriteArrayList, which is inherently thread-safe for concurrent read/write. Iteration always works on a snapshot
    of the internal array.
  • initialize() is already synchronized.
  • The brief window between clear() and the add() calls can produce at most a single WARN log entry during startup, which self-corrects on the next cycle.

The "atomic swap" is not actually atomic:

this.coefficients.clear();           // list is empty here
this.coefficients.addAll(newCoefficients);  // list is filled here
The same window exists between clear() and addAll(). It only "works" because of() is now also synchronized - making the temporary list entirely redundant.

The real issue from #3500 is something else entirely.
The evidence posted there shows:
org.osgi.framework.ServiceException: ServiceFactory.getService() resulted in a cycle.
This is an OSGi SCR service cycle error - meaning Felix detected a circular dependency during service resolution and addEss() was never called for that ESS. That's why the "Coefficient
not found" error is permanent and never self-corrects. No amount of synchronized keywords will fix a method that was never invoked.

Adding synchronized to the Data getter methods introduces unnecessary lock contention on a hot path - getConstraintsWithoutDisabledInverters() runs every cycle (~1x/second) and performs
multiple ConstraintUtil calls. Holding the Data lock during that time blocks addEss()/removeEss() for no benefit.

@rishabhvaish
Copy link
Contributor Author

@Sn0w3y — Thanks for the thorough review. I've re-read the code carefully and want to address each point.

You are right: the specific #3500 production incident is an OSGi cycle problem

The logs posted by @cvabc show ServiceFactory.getService() resulted in a cycle, meaning Felix never completed activation of ess1, so addEss() was never called. My PR does not fix that. That needs an OSGi-level fix (e.g., the OPTIONAL/DYNAMIC pattern from #3113, or restructuring the service dependency chain). I should have been clearer in the PR description that this addresses the race condition pathway described in #3500, not the OSGi cycle pathway that caused the specific 40-hour incident.

However, the cross-object race condition IS real — CopyOnWriteArrayList does not protect against it

You're correct that each individual CopyOnWriteArrayList is thread-safe for concurrent read/write. But the race is not about a single list — it's about consistency across multiple data structures. Here's the exact sequence:

Thread A (OSGi):                         Thread B (Solver, every ~1s):
────────────────                         ──────────────────────────────
Data.addEss(ess0) [holds Data lock]
  esss.add(ess0)  ← VISIBLE IMMEDIATELY
  updateInverters()
    inverters.clear()
    ... rebuilding inverters ...
                                         getConstraintsWithoutDisabledInverters() [NO Data lock in original code]
                                           ConstraintUtil.createGenericEssConstraints(coefficients, esss, ...)
                                             iterates esss snapshot → sees ess0
                                             coefficients.of("ess0", ALL, ACTIVE)
                                             → FAILS: coefficient not found (initialize() hasn't run yet)
    coefficients.initialize()  ← too late

CopyOnWriteArrayList guarantees safe iteration over one list. It cannot guarantee that esss and coefficients are consistent with each other at the point of observation. This is a classic compound-operation problem — you need a lock that spans the check-then-act across both data structures.

The synchronized on Data.getConstraintsWithoutDisabledInverters() is what fixes this: since addEss() already holds the Data lock, making the getter also lock on Data ensures a reader never observes an intermediate state where esss has a new entry but coefficients hasn't been rebuilt yet.

You are right about the "atomic swap" redundancy

Your observation is sharp: the temporary ArrayList + clear()/addAll() inside initialize() only avoids the empty-list window because of() is also synchronized on the same Coefficients instance. If of() is synchronized, the temporary list is unnecessary — the original clear() + rebuild-in-place is equally safe under the monitor lock. I'll simplify initialize() to remove the redundant temp list.

On lock contention

The Data-level lock contention concern is fair to raise but not material in practice:

  • addEss()/removeEss() are OSGi lifecycle operations — they happen at startup or on component reconnect, not on the hot path
  • The solver runs ~1x/second; holding the Data lock for the duration of getConstraintsWithoutDisabledInverters() blocks addEss() for at most one solver cycle (~milliseconds)
  • This is vastly preferable to 40+ hours of silent 0W operation

Summary of what I'll change

  1. Keep synchronized on Data.getConstraintsForAllInverters(), getConstraintsForInverters(), getConstraintsWithoutDisabledInverters() — these are the essential fix for cross-object consistency
  2. Keep synchronized on Coefficients.of() — needed for callers that go through Coefficients directly outside the Data lock (e.g., EssPowerImpl.getCoefficient(), Solver.solve() calling InverterPrecision.apply() and AddConstraintsForNotStrictlyDefinedCoefficients)
  3. Remove the temporary ArrayList build-then-swap in Coefficients.initialize() — as you pointed out, it's redundant when of() is synchronized. I'll restore the original in-place rebuild
  4. Update the PR description to clearly separate the two root causes: OSGi cycle (not fixed here) vs. thread-safety race (fixed here)

The temporary ArrayList + clear/addAll "atomic swap" in initialize() is
unnecessary: since of() is now synchronized on the same monitor, no
reader can observe the empty intermediate state regardless. Restore the
original in-place rebuild for simplicity, as pointed out in code review.

The synchronized on of() and on Data getter methods are kept — they are
the essential fix for cross-object consistency between esss and
coefficients.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Sn0w3y
Copy link
Collaborator

Sn0w3y commented Mar 4, 2026

@Sn0w3y — Thanks for the thorough review. I've re-read the code carefully and want to address each point.

You are right: the specific #3500 production incident is an OSGi cycle problem

The logs posted by @cvabc show ServiceFactory.getService() resulted in a cycle, meaning Felix never completed activation of ess1, so addEss() was never called. My PR does not fix that. That needs an OSGi-level fix (e.g., the OPTIONAL/DYNAMIC pattern from #3113, or restructuring the service dependency chain). I should have been clearer in the PR description that this addresses the race condition pathway described in #3500, not the OSGi cycle pathway that caused the specific 40-hour incident.

However, the cross-object race condition IS real — CopyOnWriteArrayList does not protect against it

You're correct that each individual CopyOnWriteArrayList is thread-safe for concurrent read/write. But the race is not about a single list — it's about consistency across multiple data structures. Here's the exact sequence:

Thread A (OSGi):                         Thread B (Solver, every ~1s):
────────────────                         ──────────────────────────────
Data.addEss(ess0) [holds Data lock]
  esss.add(ess0)  ← VISIBLE IMMEDIATELY
  updateInverters()
    inverters.clear()
    ... rebuilding inverters ...
                                         getConstraintsWithoutDisabledInverters() [NO Data lock in original code]
                                           ConstraintUtil.createGenericEssConstraints(coefficients, esss, ...)
                                             iterates esss snapshot → sees ess0
                                             coefficients.of("ess0", ALL, ACTIVE)
                                             → FAILS: coefficient not found (initialize() hasn't run yet)
    coefficients.initialize()  ← too late

CopyOnWriteArrayList guarantees safe iteration over one list. It cannot guarantee that esss and coefficients are consistent with each other at the point of observation. This is a classic compound-operation problem — you need a lock that spans the check-then-act across both data structures.

The synchronized on Data.getConstraintsWithoutDisabledInverters() is what fixes this: since addEss() already holds the Data lock, making the getter also lock on Data ensures a reader never observes an intermediate state where esss has a new entry but coefficients hasn't been rebuilt yet.

You are right about the "atomic swap" redundancy

Your observation is sharp: the temporary ArrayList + clear()/addAll() inside initialize() only avoids the empty-list window because of() is also synchronized on the same Coefficients instance. If of() is synchronized, the temporary list is unnecessary — the original clear() + rebuild-in-place is equally safe under the monitor lock. I'll simplify initialize() to remove the redundant temp list.

On lock contention

The Data-level lock contention concern is fair to raise but not material in practice:

  • addEss()/removeEss() are OSGi lifecycle operations — they happen at startup or on component reconnect, not on the hot path
  • The solver runs ~1x/second; holding the Data lock for the duration of getConstraintsWithoutDisabledInverters() blocks addEss() for at most one solver cycle (~milliseconds)
  • This is vastly preferable to 40+ hours of silent 0W operation

Summary of what I'll change

  1. Keep synchronized on Data.getConstraintsForAllInverters(), getConstraintsForInverters(), getConstraintsWithoutDisabledInverters() — these are the essential fix for cross-object consistency
  2. Keep synchronized on Coefficients.of() — needed for callers that go through Coefficients directly outside the Data lock (e.g., EssPowerImpl.getCoefficient(), Solver.solve() calling InverterPrecision.apply() and AddConstraintsForNotStrictlyDefinedCoefficients)
  3. Remove the temporary ArrayList build-then-swap in Coefficients.initialize() — as you pointed out, it's redundant when of() is synchronized. I'll restore the original in-place rebuild
  4. Update the PR description to clearly separate the two root causes: OSGi cycle (not fixed here) vs. thread-safety race (fixed here)

Did you even look at the Code yourself or is this all AI I am speaking with?

@rishabhvaish
Copy link
Contributor Author

@Sn0w3y Yeah, I used AI to help draft the response, but I did read the code. Apologies, if it's against policies.

The technical point stands — the race isn't about individual list safety, it's about esss and coefficients being out of sync between add() and initialize(). CopyOnWriteArrayList doesn't help there.

Already removed the temp list since you were right about that being redundant. Happy to walk through the race window if you want to dig in further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] EssPower circular dependency fix from #3113 needs to be applied to all ESS implementations

2 participants