Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Influence of the binning #10

Open
burunduk57 opened this issue Nov 29, 2023 · 3 comments
Open

Influence of the binning #10

burunduk57 opened this issue Nov 29, 2023 · 3 comments

Comments

@burunduk57
Copy link

DBauer-binbig
DBauer-binssmall
Grossfield
Dear Daniel,

Thank you for implementing the WHAM; I found it very convenient. However, I have an issue with the influence of binning on the shape of the PMF. I've attached the figures with PMFs obtained for the same dataset with the Grossfield and your WHAM programs with different numbers of bins. In the Grossfield implementation, with several bins bigger than 100, PMF becomes relatively smooth and remains the same with increased binning. The two other figures show PMFs with your WHAM (SDs are not shown; these are the free energy values themselves). Relatively smooth PMF is obtained with the smallest number of bins=30, while with an increase in the number of bins, the PMF becomes more and more wavy. What do you think could be done about this?

Thank you,
Sofya

@dnlbauer
Copy link
Owner

Hi Sofya,

It's been a while since I developed this library, but as far as I remember, most of the code is very similar to Grossfield's. There shouldn't be a significant difference in the results if used on the same dataset.

I wonder if the wavy lines might be a result from insufficient samples in the individual bins. If you increase the number of bins, the bin width gets smaller, and therefore less samples are in each bin. If the count per bin approaches 0, the algorithm becomes unstable. What leads me to this suspicion is that the global minimum (where we can expect a higher sample count) seems less wavy overall. Could you check the number of samples in the individual bins?

Also, from the plots its not entirely clear to me what I am looking at:

  • Am I correct that the "b100" stands for 100 bins, "b500" for 500 bins etc? Or are these different data sets and the different bin count is only between the plots? Can you please share more details on that and ideally also list all settings you used for WHAM?
  • Which of the three plots are from grossfield and which are made with this tool? I can spot wavy lines in all three of them for at least one "b" number. The bottom one seems to be the least wavy, except for b50 which again has ups and downs.
  • Any Idea why b60 in the middle-plot looks so different?

Cheers,
Daniel

@burunduk57
Copy link
Author

Hi, thank you for the answer.

  1. Yes, indeed, b in the legend is the bin size. The dataset is the same. I'm attaching the whole folder with the dataset and scripts.
    wham-test.tar.gz
    The total number of data points per window is 30'000, which should be sufficient for any binning.
  2. The first two plots are for PMFs obtained with your tool, and the third is with the Grossfield's. For Grossfield's tool, the typical behavior is that with small bin numbers, the PMFs are wavy, and with an increase in the number of bins, it becomes smoother without significant changes in the PMF shape with further increase. And the third plot illustrates this.
  3. I also wonder. It could be some mistake in running WHAM, but I double-checked, it is uniform with the others.

Best regards,
Sofya

@dnlbauer
Copy link
Owner

dnlbauer commented Dec 2, 2023

Hi,

I played around with your dataset and can reproduce what you observed. I get the same results for your settings.

I also looked a bit into the raw data and cant find anything that is wrong with it. The histogramms look well defined for me and there are enough datapoints with no obvious areas without proper overlap:

Individual histograms from all timeseries:
image

Combined histograms for bin counts 10, 30, 100, 1000:
image

Also, trying these things didnt solve the issue:

  • leave out some of the dataset
  • changing min/max lambda values
  • skipping the first 600 datapoints of each series
  • using an older version of WHAM

This really puzzles me, since I never experienced something like this with other data sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants