Skip to content

Commit 4da0638

Browse files
committed
revert(Session): remove database_ref_ to fix Alpine/musl crashes
Revert Session class to the simple design from b5835fa (Dec 2025), removing the Napi::ObjectReference database_ref_ that was added in commit a151cb6 and subsequent "fixes" (fb283df through 611d330). Root cause: N-API reference manipulation during GC finalization corrupts V8's JIT page allocations on Alpine/musl. The database_ref_ field caused napi_delete_reference to be called from Session destructors during GC, which is documented as unsafe in nodejs/node-addon-api#660. The simple design works because: - Session uses raw DatabaseSync* pointer (no ObjectReference) - db.close() calls DeleteAllSessions() which cleans up SQLite sessions - Session operations check database_->IsOpen() before proceeding - This matches both Node.js upstream (uses weak references) and better-sqlite3 (uses raw pointers) Validated: 793 tests pass locally, 5 consecutive runs pass on Alpine.
1 parent 611d330 commit 4da0638

File tree

3 files changed

+127
-258
lines changed

3 files changed

+127
-258
lines changed

doc/todo/P02-investigate-flaky-native-crashes.md

Lines changed: 117 additions & 191 deletions
Original file line numberDiff line numberDiff line change
@@ -7,223 +7,160 @@
77
- **Key Constraints**: Must identify root cause (memory corruption, race condition, or CI environment issue)
88
- **Success Validation**: 10 consecutive CI runs without native crashes
99

10-
## Current Status: FINAL FIX IMPLEMENTED
10+
## Current Status: ROOT CAUSE IDENTIFIED - REVERT REQUIRED
1111

12-
**Five distinct root causes found and fixed.** Testing shows no crashes in local Alpine Docker tests.
12+
**The "fixes" since commit `a151cb6` have made things WORSE.** The root cause is adding `Napi::ObjectReference database_ref_` to Session class. This causes N-API reference manipulation during GC finalization, which corrupts V8 on Alpine/musl.
1313

14-
### Evidence of Original Flakiness
15-
16-
Same commit (`80fa40d`) showed different failures across runs:
17-
18-
| Run ID | Time | Failed Alpine Jobs |
19-
| ----------- | -------- | --------------------- |
20-
| 21346776312 | 4:56:24Z | **None - all passed** |
21-
| 21346793811 | 4:57:29Z | x64/23, arm64/24 |
22-
| 21347456610 | 5:36:26Z | x64/20, arm64/23 |
23-
24-
Crash signals: `SIGSEGV` (segfault), `SIGTRAP` (assertion), `SIGABRT` (abort)
25-
Affected tests: Various, shifting between runs due to Jest worker assignment
14+
**CI was stable at commit `50354e1` (Dec 17, 2025 through Jan 12, 2026).** The crashes started after Session "fix" commits began on Jan 26, 2026.
2615

2716
---
2817

29-
## Root Cause #1: BackupJob Use-After-Free (Fixed 2026-01-26)
30-
31-
**The bug**: `BackupJob` held a raw pointer to `DatabaseSync*` but nothing prevented the database from closing while backup ran on worker thread.
18+
## Critical Research Findings
3219

33-
**How it manifested**:
20+
### 1. Node.js Upstream Uses WEAK References for Session
3421

35-
1. User starts backup (`db.backup(...)`)
36-
2. `BackupJob::Execute()` runs on worker thread
37-
3. Database closes (test cleanup, GC, etc.)
38-
4. Worker thread accesses `source_->connection()`**SIGSEGV**
39-
40-
**Fix**: Capture connection pointer at construction, track backups in database, finalize before close.
41-
42-
---
22+
From `src/upstream/node_sqlite.h`:
23+
```cpp
24+
class Session : public BaseObject {
25+
Session(Environment* env,
26+
v8::Local<v8::Object> object,
27+
BaseObjectWeakPtr<DatabaseSync> database, // WEAK pointer!
28+
sqlite3_session* session);
29+
```
4330
44-
## Root Cause #2: Session Lifecycle Issues (Fixed 2026-01-27/28)
31+
Node.js uses `BaseObjectWeakPtr` (weak reference), NOT a strong reference. This means:
32+
- Session does NOT keep database alive
33+
- Session checks if database is still valid before operations
34+
- If database is GC'd, Session operations fail gracefully
4535
46-
**Five related bugs** in Session handling caused crashes specifically on Alpine/musl:
36+
### 2. better-sqlite3 Uses Raw Pointers
4737
48-
### Bug 2a: Session Database Use-After-Free
38+
From `/home/mrm/src/better-sqlite3/src/objects/statement.cpp`:
39+
```cpp
40+
Statement::Statement(Database* db, ...) : db(db) { ... } // Raw pointer
41+
Statement::~Statement() {
42+
if (alive) db->RemoveStatement(this); // Just uses raw pointer
43+
CloseHandles();
44+
}
45+
```
4946

50-
**The bug**: `Session` stored raw `DatabaseSync*` without preventing database from being garbage collected.
47+
better-sqlite3 relies on JavaScript to maintain references (Statement JS object has `.database` property). The C++ side uses raw pointers.
5148

52-
**Fix**: Added `Napi::ObjectReference database_ref_` to Session (matching StatementSync pattern).
49+
### 3. N-API Reference Manipulation During GC Causes Crashes
5350

54-
**Commit**: `a151cb6`
51+
From GitHub issues:
52+
- [nodejs/node-addon-api#660](https://github.com/nodejs/node-addon-api/issues/660): ObjectWrap destructor crashes due to double napi delete calls
53+
- [nodejs/node#37236](https://github.com/nodejs/node/issues/37236): Crash on node-api add-on finalization - double-free of RefBase
54+
- [nodejs/node#27085](https://github.com/nodejs/node/pull/27085): GC finalization stress issues
5555

56-
### Bug 2b: DeleteAllSessions Bypassed Reference Release
56+
**Key insight**: Calling `Napi::ObjectReference::Reset()` during GC finalization (destructor path) is NOT safe on Alpine/musl. Even the ObjectReference destructor calling `napi_delete_reference` can cause issues.
5757

58-
**The bug**: When `db.close()` calls `DeleteAllSessions()`, it directly sets `session->session_ = nullptr`, causing `Session::Delete()` to return early and never call `database_ref_.Reset()`.
58+
### 4. The Stable Version Had No `database_ref_`
5959

60-
**Fix**: Call `database_ref_.Reset()` in `DeleteAllSessions()` after cleaning up each session.
60+
At commit `50354e1` (stable), Session class was simple:
61+
```cpp
62+
class Session : public Napi::ObjectWrap<Session> {
63+
sqlite3_session *session_ = nullptr;
64+
DatabaseSync *database_ = nullptr; // Just a raw pointer, NO ObjectReference
65+
};
66+
```
6167
62-
**Commit**: `fb283df`
68+
---
6369
64-
### Bug 2c: Mutex Deadlock Causing SIGSEGV
70+
## Timeline of Changes
6571
66-
**The bug**: `DeleteAllSessions()` held `sessions_mutex_` while calling `database_ref_.Reset()`. Reset can trigger GC, which finalizes other Session objects, which call `Delete()``RemoveSession()` → tries to lock already-held mutex → **undefined behavior**.
72+
| Date | Commit | Description | CI Status |
73+
|------|--------|-------------|-----------|
74+
| Dec 17 | `50354e1` | Stable release 0.3.0 | ✅ Green |
75+
| Jan 12 | - | Last confirmed green CI | ✅ Green |
76+
| Jan 26 | `a151cb6` | Added `database_ref_` to Session | ❌ Crashes started |
77+
| Jan 26 | `fb283df` | Release database_ref_ in DeleteAllSessions | ❌ Still crashing |
78+
| Jan 27 | `dadbb86` | Release mutex before GC operations | ❌ Still crashing |
79+
| Jan 27 | `3a6aaff` | Prevent double-free in DeleteAllSessions | ❌ Still crashing |
80+
| Jan 27 | `e1a3fcb` | Prevent dangling pointers | ❌ Still crashing |
81+
| Jan 29 | `413c93f` | Hold refs during cleanup | ❌ Still crashing |
82+
| Jan 29 | `611d330` | Skip Reset() in destructor | ❌ Still crashing |
6783
68-
**Fix**: Release mutex before the cleanup loop.
84+
**Every "fix" commit has failed to resolve the issue because they all keep `database_ref_`.**
6985
70-
**Commit**: `dadbb86`
86+
---
7187
72-
### Bug 2d: Dangling Pointers During Iteration
88+
## Recommended Fix: Revert to Simple Design
7389
74-
**The bug**: Calling `database_ref_.Reset()` during iteration could trigger GC, which may finalize Session JS objects still in the iteration list, creating dangling pointers.
90+
### Option 1: Full Revert (Recommended)
7591
76-
**Initial fix attempt**: Removed `Reset()` calls entirely from `DeleteAllSessions()`.
92+
Revert Session-related changes back to `b5835fa` (before `a151cb6`):
7793
78-
**Commit**: `e1a3fcb`
94+
```bash
95+
git checkout b5835fa -- src/sqlite_impl.cpp src/sqlite_impl.h
96+
```
7997

80-
### Bug 2e: V8 JIT Corruption on Alpine During Jest Cleanup
98+
The simple design:
99+
- Session uses raw `DatabaseSync*` pointer (no ObjectReference)
100+
- `DeleteAllSessions()` just cleans up SQLite sessions
101+
- Session::Delete() removes from database's session list
102+
- No N-API reference manipulation in destructors
81103

82-
**The bug**: When Sessions are GC'd during Jest cleanup (process exit), their `Napi::ObjectReference` destructors call `Reset()`. On Alpine/musl, this corrupts V8's JIT page allocations.
104+
### Option 2: Match Node.js Upstream Pattern
83105

84-
**How it manifested**:
106+
If we want Session to survive database GC, implement weak reference pattern:
107+
1. Don't hold strong reference to database
108+
2. Check if database is still valid before operations
109+
3. Return error if database was closed/GC'd
85110

86-
1. Tests pass, Jest begins cleanup
87-
2. V8 runs GC, finalizing Session objects
88-
3. Session destructors trigger `Napi::ObjectReference` cleanup
89-
4. V8 JIT corruption: `Check failed: it != jit_page_->allocations_.end()`
90-
5. **SIGSEGV** or **SIGABRT**
111+
### Why Raw Pointers Are OK
91112

92-
**Partial Fix**: Hold strong references to Session JS objects during `DeleteAllSessions()` cleanup, preventing GC from destroying them while we iterate.
113+
The stable version used raw pointers and worked because:
114+
1. `db.close()` calls `DeleteAllSessions()` which cleans up SQLite sessions
115+
2. Session operations check `database_->IsOpen()` before proceeding
116+
3. If user GC's database without calling close() while holding sessions, that's undefined behavior (same as better-sqlite3)
93117

94-
**Commit**: `413c93f`
118+
---
95119

96-
### Bug 2f: Explicit Reset() in Destructor Path Corrupts V8 (FINAL FIX)
120+
## Files Changed by Session "Fixes"
97121

98-
**The bug**: Even with the `DeleteAllSessions()` fix, crashes still occurred when Sessions were GC'd without an explicit `db.close()` call. The root cause: `Session::Delete()` was calling `database_ref_.Reset()` during GC finalization, which corrupts V8's JIT on Alpine/musl.
122+
All these changes should be reverted:
99123

100-
**Key Insight**: The `Napi::ObjectReference` destructor is designed to be GC-safe. But **explicitly** calling `Reset()` during GC finalization is NOT safe on Alpine/musl.
124+
**src/sqlite_impl.h:**
125+
- Added `Napi::ObjectReference database_ref_;` to Session class
126+
- Added `bool in_destructor_ = false;` flag
101127

102-
**The Fix**: Add an `in_destructor_` flag to Session. When `Delete()` is called from the destructor, skip the explicit `Reset()` call and let the ObjectReference destructor handle cleanup naturally.
128+
**src/sqlite_impl.cpp:**
129+
- `Session::SetSession()`: Added `database_ref_ = Napi::Persistent(database->Value());`
130+
- `Session::Delete()`: Added `database_ref_.Reset()` calls with complex conditional logic
131+
- `Session::~Session()`: Added `in_destructor_` flag and `SuppressDestruct()` attempts
132+
- `DeleteAllSessions()`: Changed from simple loop to complex two-pass with session_refs vector
103133

104-
```cpp
105-
Session::~Session() {
106-
in_destructor_ = true; // Signal we're in destructor path
107-
Delete();
108-
}
134+
---
109135

110-
void Session::Delete() {
111-
// ... existing cleanup code ...
136+
## Current Working Directory State
112137

113-
// Only call Reset() if NOT in destructor path
114-
// On Alpine/musl, calling Reset() during GC corrupts V8's JIT
115-
if (!in_destructor_ && !database_ref_.IsEmpty()) {
116-
database_ref_.Reset();
117-
}
118-
// If in destructor, ObjectReference destructor handles cleanup (GC-safe)
119-
}
138+
The code has been reverted to `b5835fa` state:
139+
```bash
140+
git checkout b5835fa -- src/sqlite_impl.cpp src/sqlite_impl.h
120141
```
121142

122-
**Commit**: (pending)
143+
**Pending:**
144+
1. Rebuild: `npm run build:native:rebuild`
145+
2. Test locally: `npm test`
146+
3. Test in Alpine Docker
147+
4. Commit and push
123148

124149
---
125150

126-
## Reproduction Steps
127-
128-
The crash is **only reproducible in Alpine Docker containers running Jest**:
151+
## Reproduction Commands
129152

130153
```bash
131-
# This command reproduces the crash intermittently
154+
# Reproduces crash (with current broken code)
132155
docker run --rm -v "$(pwd)":/host:ro node:24-alpine sh -c '
133156
cp -r /host /work && cd /work &&
134157
apk add --no-cache build-base python3 py3-setuptools &&
135158
npm ci --ignore-scripts && npx node-gyp rebuild &&
136159
npm run build:dist &&
137-
# Run full suite multiple times - crashes during cleanup
138-
for i in $(seq 1 10); do
139-
node --expose-gc node_modules/jest/bin/jest.js --no-coverage
140-
done
141-
'
142-
143-
# This does NOT crash (skips cleanup):
144-
node --expose-gc node_modules/jest/bin/jest.js --no-coverage --forceExit
145-
146-
# This does NOT crash (not in Jest):
147-
node --expose-gc -e '
148-
const { DatabaseSync } = require("./dist/index.cjs");
149-
for (let i = 0; i < 1000; i++) {
150-
const db = new DatabaseSync(":memory:");
151-
db.createSession();
152-
db.close();
153-
}
154-
if (global.gc) global.gc();
155-
console.log("Success");
160+
node --expose-gc node_modules/jest/bin/jest.js --no-coverage
156161
'
157-
```
158-
159-
---
160162

161-
## Tribal Knowledge
162-
163-
### Pattern: Preventing GC of Parent Objects
164-
165-
When a child object (Session, Statement) holds a pointer to a parent (DatabaseSync), you **must** also hold a reference to prevent GC:
166-
167-
```cpp
168-
DatabaseSync *database_; // For fast access
169-
Napi::ObjectReference database_ref_; // Prevents GC
170-
```
171-
172-
### Pattern: Safe Bulk Cleanup with GC
173-
174-
When cleaning up multiple Napi objects that hold ObjectReferences:
175-
176-
```cpp
177-
// BAD: GC can destroy objects during iteration
178-
for (auto* obj : objects_copy) {
179-
obj->ref_.Reset(); // May trigger GC, destroying other objects in list
180-
}
181-
182-
// GOOD: Hold references to prevent GC during iteration
183-
std::vector<Napi::ObjectReference> refs;
184-
for (auto* obj : objects_copy) {
185-
refs.push_back(Napi::Persistent(obj->Value()));
186-
}
187-
// Now safe to clean up
188-
for (auto* obj : objects_copy) {
189-
obj->ref_.Reset();
190-
}
191-
// refs destructor releases, objects can be GC'd safely
192-
```
193-
194-
### Why Only Alpine/musl?
195-
196-
1. **Different GC timing**: musl's allocator triggers GC at different points
197-
2. **V8 JIT sensitivity**: musl's memory layout affects JIT page management
198-
3. **Process exit behavior**: Cleanup phase differs from glibc
199-
4. **No recursive mutex protection**: glibc may be more forgiving
200-
201-
### Files to Study
202-
203-
| File | What to Look For |
204-
| ------------------------------- | ----------------------------------- |
205-
| `src/sqlite_impl.cpp:1516-1560` | DeleteAllSessions final fix |
206-
| `src/sqlite_impl.cpp:2980-3020` | Session::Delete cleanup |
207-
| `src/sqlite_impl.cpp:1845-1853` | StatementSync reference pattern |
208-
209-
---
210-
211-
## Validation
212-
213-
- [x] Root causes identified and documented
214-
- [x] Session fixes implemented
215-
- [x] Local tests pass: `npm t` (793 tests)
216-
- [x] Alpine Docker tests pass without crashes (5+ runs)
217-
- [x] Linting passes: `npm run lint`
218-
- [ ] **REMAINING: CI validation on Alpine**
219-
220-
### Test Commands
221-
222-
```bash
223-
# Full test suite
224-
npm run build:native:rebuild && npm run build:dist && npm test
225-
226-
# Local Alpine test (the definitive test)
163+
# Should pass after revert
227164
docker run --rm -v "$(pwd)":/host:ro node:24-alpine sh -c '
228165
cp -r /host /work && cd /work &&
229166
apk add --no-cache build-base python3 py3-setuptools &&
@@ -238,35 +175,24 @@ docker run --rm -v "$(pwd)":/host:ro node:24-alpine sh -c '
238175

239176
---
240177

241-
## Commits Summary
178+
## References
242179

243-
| Commit | Description |
244-
| --------- | --------------------------------------------------------- |
245-
| `a151cb6` | Add `database_ref_` to Session |
246-
| `fb283df` | Release `database_ref_` in DeleteAllSessions |
247-
| `dadbb86` | Release mutex before GC-triggering operations |
248-
| `e1a3fcb` | Remove Reset() from DeleteAllSessions (incomplete fix) |
249-
| `413c93f` | Hold refs during cleanup to prevent V8 JIT corruption |
250-
| (pending) | Skip Reset() in destructor path - let ObjectRef handle it |
180+
- [nodejs/node-addon-api#660](https://github.com/nodejs/node-addon-api/issues/660) - ObjectWrap destructor crashes
181+
- [nodejs/node#37236](https://github.com/nodejs/node/issues/37236) - Crash on node-api add-on finalization
182+
- [nodejs/node#27085](https://github.com/nodejs/node/pull/27085) - GC finalization stress
183+
- [lovell/sharp#2570](https://github.com/lovell/sharp/issues/2570) - Segfault on Alpine/musl
184+
- Node.js upstream: `src/upstream/node_sqlite.h` - Uses `BaseObjectWeakPtr` for Session
185+
- better-sqlite3: Uses raw pointers, JavaScript manages lifetime
251186

252187
---
253188

254-
## Remaining Work
255-
256-
### Task: Push and Validate CI
257-
258-
**Success**: CI runs pass without SIGSEGV on Alpine
259-
260-
**Implementation**:
261-
262-
1. Commit the final fix (in_destructor_ flag approach)
263-
2. Push to trigger CI
264-
3. Monitor Alpine test jobs
265-
266-
**Completion checklist**:
189+
## Completion Checklist
267190

268-
- [ ] Commit pushed
269-
- [ ] test-alpine jobs pass for all Node versions (20, 22, 24, 25)
270-
- [ ] test-alpine jobs pass for both architectures (x64, arm64)
271-
- [ ] No SIGSEGV/SIGABRT crashes in 5+ consecutive runs
191+
- [x] Root cause identified: `database_ref_` ObjectReference added in `a151cb6`
192+
- [x] Research completed: Node.js upstream and better-sqlite3 patterns documented
193+
- [x] Code reverted to simple design (b5835fa state)
194+
- [x] Rebuild and test locally (793 tests passed)
195+
- [x] Test in Alpine Docker (5 consecutive runs - all passed)
196+
- [x] Commit revert with clear explanation (8441235)
197+
- [ ] Push and validate CI
272198
- [ ] Move TPP to `doc/done/`

0 commit comments

Comments
 (0)