Neg partial log likelihood loss is 0 every time the batch size is 1 #35

mayurmallya · 2024-05-23T18:13:21Z

Hi there,

Thanks for sharing this wonderful library!

I was trying to run a survival analysis using the Cox proportional hazards model and due to the GPU constraints, I have to go with the batch size of 1. And every time I run the model, I observe that the loss value is always 0 when I'm using cox.neg_partial_log_likelihood.

I looked into the implementation of the _partial_likelihood_cox and it seems that the log_denominator gets the same value as the log_hz_sorted when the batch size is 1, resulting in the loss to be 0.

I was wondering if there is a workaround for this issue, please let me know. Also attaching the link to the corresponding code

torchsurv/src/torchsurv/loss/cox.py

Line 174 in 799eb30

log_denominator = torch.logcumsumexp(log_hz_sorted.flip(0), dim=0).flip(0)

Thank you in advance!

The text was updated successfully, but these errors were encountered:

melodiemonod · 2024-05-24T09:38:38Z

Dear @mayurmallya,

Thank you for your interest and for your question!

Summary
When the batch size is 1, the Cox partial log likelihood evaluates to 0. This behavior is due to the nature of the formula and the essence of the Cox partial log likelihood, not an issue in the code. To get meaningful results, you need to increase the sample size. The more participants you include, the more informative your partial log likelihood will be. However, there is a trade-off between efficiency and computational power.

More details
The Cox partial likelihood is constructed from the product of conditional probabilities, specifically the probability that that subject experience an event compared to all the other subjects who are still at risk of having an event. Unlike most likelihood functions, the likelihood depends on all observations in the set; you cannot calculate it separetly for subjects 1-5 and 6-10 and then multiply them together.

Including more subjects in your sample results in a more refined and accurate ranking, which improves the estimation of the log hazards. Conversely, with only one subject, the likelihood provides no information for parameter estimation (because the subject is compared to ... no one).

In maths
From the documentation, the Cox partial log likelihood is

$$ pll = \sum_{i: \delta_i = 1} \left(\log \theta_i - \log\left(\sum_{j \in R(\tau_i)} \theta_j \right) \right) $$

With only one subject $i$, if $\delta_i = 0$ then the sum is null, and if $\delta_i = 1$,

$$ pll = \left(\log \theta_1 - \log\left(\theta_1 \right) \right) = 0. $$

Side note
The partial log likelihood function can be viewed as a ranking measure. As a side note, Raykar et al. (2007) demonstrated that maximizing the Cox partial likelihood is approximately equivalent to maximizing the C-index. You can find more details in their paper here.

I hope this helps,

Melodie

mayurmallya · 2024-05-24T17:58:12Z

Thank you @melodiemonod for the detailed answer, much appreciated!

If I have 300 samples in the dataset, I believe the ideal case scenario would be a batch size of 300 (right?). But because of the computational constraints, the batch size would be lower, let's say 10. In that case the likelihood would be calculated for subjects 1-10 and 10-20 separately right?

I'm just trying to understand what you meant by-

Unlike most likelihood functions, the likelihood depends on all observations in the set; you cannot calculate it separetly for subjects 1-5 and 6-10 and then multiply them together.

Also, based on your experience, what batch size would you recommend? Or is it simply higher the better?

Thank you once again :)

melodiemonod · 2024-05-30T12:17:53Z

Hi @mayurmallya,

1/

I'm just trying to understand what you meant by-

Unlike most likelihood functions, the likelihood depends on all observations in the set; you cannot calculate it separetly for subjects 1-5 and 6-10 and then multiply them together.

You cannot decompose the log likelihood as follows when dealing the cox partial likelihood
$p(Y_1, Y_2) \neq p(Y_1) \times p(Y_2)$

2/

Also, based on your experience, what batch size would you recommend? Or is it simply higher the better?

It's a tradeoff between converging faster and computational power. The primary constraint is often the memory available on your GPU or TPU. On the other hand, larger batch sizes provide more accurate estimates of the gradient, potentially leading to more stable and faster convergence. Practical guidelines advise to start small: Begin with a small batch size (e.g., 32 or 64) to ensure your model trains correctly without memory issues. Monitor performance and track that the loss and accuracy on both the training and validation sets to see if increasing the batch size improves performance. If yes and if you have the memory capacity, gradually increase the batch size to see if it speeds up training without compromising the model’s ability to generalize.

Best regards

Melodie

mayurmallya · 2024-06-03T18:27:34Z

Thank you @ahmedhshahin and @melodiemonod
Much appreciated! :)

PierpaoloV · 2024-10-23T09:11:09Z

Hi!
Thank you for your nice library!!

I was wondering:

I am working with inputs that are tensor of dimension [x, 1024], with x bein the number of patches in a histology image.
The goal is to predict survival and I would like to use a CoxPH model with this function.
Because of the nature of the input I cannot initializate a dataloader with batch size >1, so what I am doing during the training step is actually to perform n forward passes (usually 8) and then stacking all the predictions, the times, the events and then performing a single loss calculation and backward.
I see that the model somehow trains as the C-index over the test set is >0.6 but I am wondering whether I could do that step differently?

melodiemonod self-assigned this May 24, 2024

melodiemonod added good first issue Good for newcomers question Further information is requested labels May 24, 2024

mayurmallya closed this as completed Jun 3, 2024

Novartis deleted a comment from ahmedhshahin Jun 26, 2024

tcoroller pinned this issue Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neg partial log likelihood loss is 0 every time the batch size is 1 #35

Neg partial log likelihood loss is 0 every time the batch size is 1 #35

mayurmallya commented May 23, 2024

melodiemonod commented May 24, 2024

mayurmallya commented May 24, 2024

melodiemonod commented May 30, 2024

mayurmallya commented Jun 3, 2024

PierpaoloV commented Oct 23, 2024

Neg partial log likelihood loss is 0 every time the batch size is 1 #35

Neg partial log likelihood loss is 0 every time the batch size is 1 #35

Comments

mayurmallya commented May 23, 2024

melodiemonod commented May 24, 2024

mayurmallya commented May 24, 2024

melodiemonod commented May 30, 2024

mayurmallya commented Jun 3, 2024

PierpaoloV commented Oct 23, 2024