Skip to content

ValueError: could not convert integer scalar when running dyn.tl.dynamics on large harmonized dataset (~4M cells) #732

@superlsd

Description

@superlsd

Dear developers,

thank you for the amazing work you have put into Dynamo. I am encountering an issue when running dynamo on a large harmonized dataset (~4M cells) that was preprocessed and transformed externally across several datasets. Similar analyses with smaller datasets (up to ~100k cells) ran smoothly, but this larger dataset raises the following error at this step:

dyn.tl.dynamics(harmonized_adata,
assumption_mRNA='ss',
model='deterministic',
est_method='ols',
re_smooth=False,
del_2nd_moments=True,
cores=1)

where I obtain the following error:

    adata.layers["velocity_S"][cur_cells_ind, valid_ind_] = vel_S
    ret = csr_sample_offsets(M, N, self.indptr, self.indices, n_samples,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert integer scalar

After some debugging, I implemented a custom patch (safe_set_velocity) ensuring that all relevant layers are stored as sparse matrices and that assignments are shape-consistent:

import numpy as np
import scipy.sparse as sp

def safe_set_velocity(
    adata,
    vel_U,
    vel_S,
    vel_N,
    vel_T,
    vel_P,
    _group,
    cur_grp,
    cur_cells_bools,
    valid_bools_,
    ind_for_proteins,
):
    """
    Safe replacement for Dynamo's set_velocity with shape checks and auto-init.
    """
    print("✅ safe_set_velocity CALLED!")

    cells = np.where(cur_cells_bools)[0]
    genes = np.where(valid_bools_)[0]

    n_cells, n_genes = adata.n_obs, adata.n_vars

    # --- ensure the target layer exists ---
    if "velocity_S" not in adata.layers.keys():
        print(f"[safe_set_velocity] Creating adata.layers['velocity_S'] with shape ({n_cells}, {n_genes})")
        adata.layers["velocity_S"] = sp.csr_matrix((n_cells, n_genes), dtype=float)

    # --- ensure dense matrix for assignment ---
    try:
        dense = adata.layers["velocity_S"].toarray()
    except Exception as e:
        print(f"[safe_set_velocity] Failed to read velocity_S as array: {e}")
        dense = np.zeros((n_cells, n_genes), dtype=float)

    # --- prepare vel_S ---
    if sp.issparse(vel_S):
        vel_S = vel_S.toarray()
    vel_S = np.asarray(vel_S)
    vel_S = np.squeeze(vel_S)

    # --- fix shape if necessary ---
    if vel_S.shape != (len(cells), len(genes)):
        if vel_S.T.shape == (len(cells), len(genes)):
            print("[safe_set_velocity] Transposing vel_S to match target slice")
            vel_S = vel_S.T
        elif vel_S.ndim == 1 and vel_S.shape[0] == len(genes):
            vel_S = np.tile(vel_S, (len(cells), 1))
        elif vel_S.ndim == 1 and vel_S.shape[0] == len(cells):
            vel_S = np.tile(vel_S[:, None], (1, len(genes)))
        elif vel_S.size == 1:
            # Handle 1x1 sparse or array cases gracefully
            if sp.issparse(vel_S):
                scalar_value = vel_S.toarray().item()
            elif isinstance(vel_S, np.ndarray):
                scalar_value = vel_S.item()
            else:
                scalar_value = float(vel_S)
            vel_S = np.full((len(cells), len(genes)), scalar_value)
        else:
            raise ValueError(
                f"[safe_set_velocity] Cannot reshape vel_S {vel_S.shape} → ({len(cells)}, {len(genes)})"
            )

    # --- perform safe assignment ---
    try:
        dense[np.ix_(cells, genes)] = vel_S
        adata.layers["velocity_S"] = sp.csr_matrix(dense)
        print(f"[safe_set_velocity] ✅ Assigned block {vel_S.shape} to velocity_S[{len(cells)}x{len(genes)}]")
    except Exception as e:
        print(f"[safe_set_velocity] ❌ Assignment failed: {e}")
        raise

    return adata

This allowed me to bypass the original error, but at the next step (dyn.tl.cell_velocities), I encountered another issue, with the calculation of the PCA: ValueError: Input X contains NaN.

I verified that no NaN are present in adata.X, adata.layers["velocity"], or X_plus_V. This makes me think that my patch may not be the ideal solution.

💭 Questions
1. Do you have any recommendations for resolving the ValueError: could not convert integer scalar in large harmonized datasets?
2. Would it be acceptable to run dyn.tl.dynamics separately on individual subsets (e.g. per dataset before harmonization) and then merge the results, or would that bias the downstream velocity estimation?

📊 Additional Info
• adata.X: dense float32 array
• All layers (spliced, unspliced, etc.) are dense float32 arrays
• adata dimensions: 4,000,000 × 2,500
• No NaNs detected in any layer or .X

Thank you very much for your time and for maintaining such a powerful package.
Any guidance on how to proceed would be greatly appreciated!

Best,
Salvo D. Lombardo

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions