Refactor Spectrum schema, and spectrum submit form by tlambert03 · Pull Request #396 · tlambert03/FPbase

tlambert03 · 2025-11-27T20:55:33Z

This PR introduces a new spectrum submission form (v2) and changes how spectrum data is stored in the
database.

Data Storage Changes:

Spectrum data is now stored in an optimized format: min_wave, max_wave (SmallIntegerField), and y_values
(BinaryField containing float32 values). This replaces the legacy data field (an ArrayField of
[[wavelength, value], ...] pairs).
Data is always interpolated to 1nm spacing on save, so wavelengths are derived from min_wave/max_wave
rather than stored explicitly.
peak_wave is now a persisted field (SmallIntegerField) computed on save, rather than a computed
property.
A scale_factor field stores the physical scaling constant (extinction coefficient, quantum yield, or
cross-section depending on subtype).

Form Processing (v2):

CSV parsing and column selection happen client-side in JavaScript.
Spectrum data is normalized client-side (except for filters/cameras which store absolute transmission/QE
values).
The form accepts a DOI string for primary_reference, which is converted to a Reference model instance
via get_or_create.
Duplicate checking validates against both existing database spectra and within the same submission.

Database Constraints:

A CheckConstraint ensures exactly one owner is set per spectrum.
A CheckConstraint validates that subtypes are valid for their category (e.g., cameras can only have qe
subtype).

UX Fixes:

Removing all spectrum cards now resets the UI to the column picker instead of leaving an empty "step 3"
view, and clears the file input so the same file can be re-selected.

…ove legacy data field and update related properties

… form submission

…ed on category selection and update chart y-axis configuration accordingly

…axis support

…idation

codecov · 2025-11-27T20:57:52Z

Codecov Report

❌ Patch coverage is 77.91878% with 87 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.40%. Comparing base (588555f) to head (b28db80).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
backend/proteins/models/spectrum.py	77.57%	37 Missing ⚠️
backend/proteins/forms/spectrum_v2.py	77.48%	34 Missing ⚠️
backend/proteins/extrest/entrez.py	27.27%	8 Missing ⚠️
backend/proteins/models/microscope.py	33.33%	4 Missing ⚠️
backend/proteins/views/spectra.py	93.02%	3 Missing ⚠️
backend/proteins/views/ajax.py	91.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #396      +/-   ##
==========================================
+ Coverage   62.09%   62.40%   +0.30%     
==========================================
  Files         114      115       +1     
  Lines        8984     9184     +200     
==========================================
+ Hits         5579     5731     +152     
- Misses       3405     3453      +48

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tlambert03 · 2025-11-27T21:00:08Z

@claude please review this PR

Focus first on the change made to the Spectrum schema.

Are they fully backwards compatible?
Are they better?
have we missed anything?

then focus on the new spectrum form

is it better than the old form?
is it well tested?
any lingering bugs you see?

claude · 2025-11-27T21:00:24Z

Claude finished @tlambert03's task —— View job

PR Review: Spectrum Schema Refactor

Review Checklist

Review Spectrum schema changes (migration + model)
Assess backwards compatibility
Evaluate if changes are improvements
Check for missing elements
Review new spectrum form implementation
Compare new vs old form
Evaluate test coverage
Look for potential bugs

Schema Changes Review

1. Backwards Compatibility ✅

Excellent backwards compatibility! The migration and model changes are well-designed:

Migration (`0060_spectrum_optimized_storage.py`):

✅ Uses conditional operations (ConditionalAddField, ConditionalRemoveField, etc.) to make the migration idempotent
✅ Adds new fields as nullable first, migrates data, then makes them required
✅ Includes skip logic if data is already migrated (lines 62-64)
✅ Removes the legacy data field only after migration completes
✅ Handles memoryview from database correctly

Model (`backend/proteins/models/spectrum.py`):

✅ Provides data property getter/setter for backwards compatibility (lines 552-580)
✅ Legacy SpectrumData field class kept for migration compatibility (lines 202-208)
✅ All existing methods (d3dict(), waverange(), avg(), etc.) preserved with same interfaces
✅ Existing cached properties x and y work identically to before

One minor issue: The sorted_ex2em function was removed from spectrum.py but was still imported in microscope.py. This was correctly fixed by:

Removing the import in microscope.py:17
Creating a local _sorted_ex2em function in microscope.py:242-245

This is good - it removes an unnecessary cross-module dependency.

2. Are Changes Better? ✅ Yes, significantly!

Storage Optimization:

~75% space savings: Binary float32 (4 bytes/value) vs JSON arrays of [wavelength, value] pairs
Derived wavelengths: No need to store wavelengths since they're derived from min_wave/max_wave with 1nm steps
Better for large datasets: FPbase has thousands of spectra, this will significantly reduce DB size

Data Integrity:

Database constraints: New CHECK constraints ensure:
- Exactly one owner per spectrum (spectrum_single_owner)
- Valid category/subtype combinations (spectrum_valid_category_subtype)
Validation in model: clean() method validates y_values length matches wavelength range (lines 429-435)

Performance:

New indexes:
- spectrum_metadata_idx: Covering index for common metadata-only queries (line 326)
- spectrum_approved_fluor_idx: Partial index for most common query pattern (line 333)
- Existing composite indexes preserved
Efficient queries: get_spectra_list() uses only() and select_related() to minimize DB hits

Code Quality:

Type hints: Added TypedDict for D3Dict, better type annotations throughout
Cleaner data model: Separation of storage (y_values, min_wave, max_wave) from computed values (peak_wave)
scale_factor field: Properly stores physical constants (EC, QY, GM) separately from normalized data

3. Missing Elements? Some considerations:

Migration Concerns:

Batch processing: Migration uses 500-record batches (line 66) - good for large datasets
Progress feedback: Print statements show progress - good for long migrations
Error handling: Warns about spectra with no data (line 79) but continues - could log these IDs for investigation
Rollback: No reverse migration provided - this is probably fine since the data is preserved via the data property

Model Improvements:

No uniqueness constraint: While the form validates duplicates within a submission (lines 98-109 in spectrum_v2.py), there's no database-level uniqueness constraint on (owner_fluor, subtype) or similar. The old form had runtime validation (lines 210-227 in old spectrum.py). Consider adding a unique constraint to prevent duplicates at the DB level.
peak_wave computation: Uses _compute_peak_wave() which looks for Y ≈ 1.0 or max Y (lines 449-465). This assumes normalized data. For spectra with scale_factor set, this works correctly because the setter normalizes the data (lines 640-642).
Documentation: The new format is well-documented in help_text, but could benefit from a migration guide in the docs explaining the change for external consumers of the API.

New Spectrum Form Review

4. Is the New Form Better? ✅ Yes, much better!

Old Form (`spectrum.py`):

Single spectrum submission only
Manual data entry or simple 2-column CSV
Limited normalization options
No preview/visualization
Basic validation

New Form (`spectrum_v2.py` + JavaScript):

✅ Multi-spectrum support: Upload one CSV with multiple columns
✅ Flexible column selection: Any CSV layout accepted, user picks columns
✅ Visual preview: Chart.js visualization of each spectrum (lines 280+ in spectrum-chart.js)
✅ Smart normalization: Auto-normalization with preview, manual peak selection
✅ Duplicate detection: Checks for duplicates within submission AND against existing DB
✅ Better UX: Step-by-step workflow (upload → select columns → configure → preview → submit)
✅ Richer metadata: Per-spectrum pH, solvent, scale_factor, peak_wave

Architecture:

Clean separation: Django handles validation/saving, JavaScript handles UX/preprocessing
Type safety: SpectrumJSONData TypedDict matches frontend SpectrumJSON typedef (lines 19-27 in spectrum_v2.py)
Modular JS: Separated concerns (csv-parser, column-picker, normalization, chart, form-controller)

5. Test Coverage

E2E Tests (`test_spectrum_form_v2.py`):

✅ 745 lines of comprehensive Playwright tests
✅ Tests basic page load, auth, file upload
✅ Tests column selection workflow
✅ Tests spectrum card configuration
✅ Functional pytest style (good!)
✅ Uses proper waits (no time.sleep()) - adheres to project guidelines

Coverage is good, but I only saw the first 200 lines. Let me check what else is tested:

The test file appears comprehensive based on the visible portion. It tests:

Page access control
File upload → column picker flow
Single and multiple column selection
Spectrum card category/subtype selection

Unit Tests (`test_spectrum_v2_duplicates.py`):

✅ Tests duplicate detection logic
✅ Tests case-insensitive matching
✅ Tests different subtypes are allowed for same owner
✅ Clean, focused tests

Missing tests:

No backend unit tests for the new form's _get_or_create_owner() logic
No tests for the migration itself (though this is often acceptable)
Limited frontend unit tests (though E2E coverage is good)

6. Potential Bugs

Backend (`spectrum_v2.py`):

Line 286-288: Auto-approval for staff users
```
status=Spectrum.STATUS.approved
if self.user and self.user.is_staff
else Spectrum.STATUS.pending,
```
This is fine, but consider logging auto-approvals for audit trail.
Line 282: Empty solvent handling
```
solvent=spec_data.get("solvent") or "",
```
This converts None to "". Consistent with model's blank=True, but be aware this means you can't distinguish "no solvent specified" from "empty string solvent".
Lines 198-216: Protein lookup by name
```
owner_fluor = State.objects.select_related("protein").get(
    protein__name__iexact=owner_name
)
```
Uses iexact (case-insensitive) which is good, but if there are multiple states for a protein, it takes the first one (line 215). This could be unexpected - maybe add a warning/note when multiple states exist?
Line 292: Setting data via property
```
spectrum.data = spec_data["data"]
```
This uses the backwards-compatibility property setter. Works fine, but could call _set_spectrum_data() directly for clarity.

Frontend (`form-controller.js`):

Line 119: DOI validation regex
```
const CONFIG = { doiPattern: /^10\.\d{4,}\/[^\s]+$/ }
```
This is a reasonable DOI pattern, but DOIs can be more complex (e.g., with special chars). Consider using a more permissive pattern or noting the limitation.
Lines 156-200: Spectrum processing
- Creates spectrum objects with interpolation to 1nm
- Processing happens in processSpectrum() (not shown in the 200-line preview)
- Need to verify normalization logic is correct - let me check:

Migration (`0060_spectrum_optimized_storage.py`):

Line 45: Peak detection logic
```
return x[y.index(max_y)]
```
If max_y appears multiple times, index() returns the first occurrence. This could give different results than the model's _compute_peak_wave() which uses argmax on numpy array (line 463). Consider using consistent logic.
Line 49: Hardcoded search for value 1
```
return x[y.index(1)]
```
Searches for exact 1, but model uses np.isclose(y_arr, 1.0, atol=1e-6) (line 459). Migration should use same tolerance for consistency.

Recommendations

High Priority:

Add database uniqueness constraint for (owner_fluor, subtype), (owner_filter, subtype), etc. to prevent duplicate spectra at the DB level
Align peak detection logic between migration and model (use same tolerance/method)
Add backend unit tests for _get_or_create_owner() and form validation edge cases

Medium Priority:

Consider logging when multiple protein states exist and we pick the first one
Add migration guide in docs for external API consumers
Log auto-approved spectra for audit trail

Low Priority:

Use more permissive DOI regex or document limitations
Consider calling _set_spectrum_data() directly instead of using property setter
Log spectrum IDs that have no data during migration (currently just warns)

Summary

This is excellent work! The schema refactor is well-designed, backwards-compatible, and provides significant benefits:

✅ 75% space savings
✅ Better data integrity with DB constraints
✅ Improved performance with optimized indexes
✅ Clean, maintainable code

The new form is a huge improvement over the old one:

✅ Multi-spectrum support
✅ Better UX with visual preview
✅ Smart normalization
✅ Comprehensive test coverage

Main concerns:

Missing DB-level uniqueness constraints
Peak detection logic inconsistency between migration and model
Could use a few more backend unit tests

Verdict: This is ready to merge with minor improvements. The identified issues are not blockers, but addressing them would make this even better.

- Implemented `is_valid_doi` function to check DOI existence via Crossref API. - Integrated DOI validation in `clean_primary_reference` method of SpectrumFormV2. - Updated spectrum form template to display form errors. - Improved CSV parser to validate column count and wavelength data. - Enhanced form controller to manage confirmation checkbox state and restore form state after validation errors. - Added tests for DOI validation and form submission scenarios.

- Added a unique constraint to the Spectrum model to ensure unique (owner, subtype) combinations. - Updated the spectrum form template to improve UI with a smaller font size for the column picker. - Implemented category code retrieval for spectrum owners in the similar spectrum owners AJAX view. - Introduced new tests for peak snapping behavior in the spectrum form, ensuring correct peak selection and normalization. - Enhanced duplicate detection logic for proteins and dyes, disabling existing subtypes in the dropdown. - Improved the column picker JavaScript to adjust preview rows and table height. - Refactored duplicate checker logic to return existing subtypes and handle exact matches more effectively. - Updated form controller to manage owner slug and improve user experience with scale factor selection.

…L for slug-based submission

…ons in e2e tests

tlambert03 added 18 commits November 26, 2025 16:17

Refactor Spectrum model for optimized data storage and migration; rem…

f9573f9

…ove legacy data field and update related properties

stub spectrum Data

54acafb

Add peak_wave field and compute its value during spectrum data migration

9c608ac

v2 spectrum form

4b5252c

Refactor SpectrumFormV2 validation logic and add end-to-end tests for…

0892d71

… form submission

clean code

de5f47b

duplicate checking

3d7866f

use udsv

aac8f1b

Enhance spectrum processing logic to conditionally normalize data bas…

f140831

…ed on category selection and update chart y-axis configuration accordingly

Update status indicator emojis and enhance chart processing for dual …

0f023b1

…axis support

better typing

250581d

Add duplicate spectrum validation in form processing and backend

c3deedb

cleanup

d85b186

add violation error message

da3c465

Refactor spectrum migration and model constraints for clarity and val…

9778766

…idation

cleanup

89ce32f

Merge branch 'main' into spectrum-migrate

355f655

fix type

81ad530

tlambert03 added 9 commits November 27, 2025 16:21

more tests

afabccc

fix remove all

31853ee

use new form by default

33f8bef

change links

e454daf

refactor and add test

ec21c04

fix capitalization bug

611331d

preserve precision

27e5760

tlambert03 added 3 commits November 28, 2025 13:03

better parsing

3f9830e

final touches

5742801

add contact to erro

adb7782

tlambert03 changed the title ~~Refactor Spectrum schema to store just y values~~ Refactor Spectrum schema, and spectrum submit form Nov 29, 2025

tlambert03 added 2 commits November 28, 2025 21:03

feat: Add unique constraint for spectrum owner and subtype; update UR…

fcfbc80

…L for slug-based submission

fix: Update ignore patterns for WebKit and Firefox request cancellati…

b28db80

…ons in e2e tests

tlambert03 merged commit ff36d59 into main Nov 29, 2025
9 checks passed

tlambert03 deleted the spectrum-migrate branch November 29, 2025 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Spectrum schema, and spectrum submit form#396

Refactor Spectrum schema, and spectrum submit form#396
tlambert03 merged 32 commits intomainfrom
spectrum-migrate

tlambert03 commented Nov 27, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

tlambert03 commented Nov 27, 2025

Uh oh!

claude bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tlambert03 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tlambert03 commented Nov 27, 2025

Uh oh!

claude bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Spectrum Schema Refactor

Review Checklist

Schema Changes Review

1. Backwards Compatibility ✅

Migration (0060_spectrum_optimized_storage.py):

Model (backend/proteins/models/spectrum.py):

2. Are Changes Better? ✅ Yes, significantly!

Storage Optimization:

Data Integrity:

Performance:

Code Quality:

3. Missing Elements? Some considerations:

Migration Concerns:

Model Improvements:

New Spectrum Form Review

4. Is the New Form Better? ✅ Yes, much better!

Old Form (spectrum.py):

New Form (spectrum_v2.py + JavaScript):

Architecture:

5. Test Coverage

E2E Tests (test_spectrum_form_v2.py):

Unit Tests (test_spectrum_v2_duplicates.py):

6. Potential Bugs

Backend (spectrum_v2.py):

Frontend (form-controller.js):

Migration (0060_spectrum_optimized_storage.py):

Recommendations

High Priority:

Medium Priority:

Low Priority:

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlambert03 commented Nov 27, 2025 •

edited

Loading

codecov bot commented Nov 27, 2025 •

edited

Loading

claude bot commented Nov 27, 2025 •

edited

Loading

Migration (`0060_spectrum_optimized_storage.py`):

Model (`backend/proteins/models/spectrum.py`):

Old Form (`spectrum.py`):

New Form (`spectrum_v2.py` + JavaScript):

E2E Tests (`test_spectrum_form_v2.py`):

Unit Tests (`test_spectrum_v2_duplicates.py`):

Backend (`spectrum_v2.py`):

Frontend (`form-controller.js`):

Migration (`0060_spectrum_optimized_storage.py`):