PDB loader throws errors - Note: fixed in release branch #3054

butlerpd · 2024-08-17T21:15:30Z

Describe the bug
Loading a PDB in the GSC throws a bunch of read errors into the log file. It seems like the program may be expecting to find atoms named C, N, O, which are in the last column (though not even available in all PDB files) but in some cases at least it seems to be reading the identity column which has things like CA, CB, CG, H1 etc. (alpha carbon etc), and there are many variations of these included in the PDB. But it does not seem to be consistent. All PDB files read in seem to have a slew of

17:57:06 - ERROR: string index out of range.

Others however also have a bunch of warning lines like:

17:57:06 - WARNING: Warning: set the sld of Cg to zero.

In those cases there seems to be at least one traceback error after the read (when it now auto calculates the Rg) though that may be a function of the fact that both files tested had the hydrogen atoms included and in some case two. The second error is what showed up when only one error was generated, while this is the order in the second case. Other types of files may cause different errors/permutations:

17:57:06 - ERROR: Traceback (most recent call last): File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 560, in change_data_type File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 912, in update_gui File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 923, in update_Rg File "sas\sascalc\calculator\geni.py", line 494, in radius_of_gyration File "sas\sascalc\calculator\geni.py", line 557, in center_of_mass File "periodictable\core.py", line 270, in symbol ValueError: unknown element H1
17:57:06 - ERROR: Traceback (most recent call last): File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 812, in complete_loading File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 912, in update_gui File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 923, in update_Rg File "sas\sascalc\calculator\geni.py", line 494, in radius_of_gyration File "sas\sascalc\calculator\geni.py", line 557, in center_of_mass File "periodictable\core.py", line 270, in symbol ValueError: unknown element H1

Given the many WARNINGSaboutindex out of range` or setting the SLD of the various elements to zero then running anyway is rather concerning. In particular if the calculation is ignoring atoms and/or setting them their sld to zero would cause incorrect answers.

To Reproduce
Steps to reproduce the behavior:

Go to the Generic Scattering Calculator in the tools menu
Load one of the attached PDB files (each produces a different variant as described above)
Check the log explorer
See above warnings and/or errors.

Expected behavior
The file should be read in correctly, OR throw an exception that warns the user that the PDB cannot be read (and do not load it)

SasView version (please complete the following information):

Version: 6.0.0b2

Operating system (please complete the following information):

OS: Windows 10

Additional context
Identified during NIST CNR summer school and thus being labelled 6.0.0. Whether the error existed in 5.0.6 has not yet been verified. However, if it is silently producing wrong answers I would argue fixing it now, if possible, would still be appropriate

FILES to REPRODUCE
NOTE: these should be renamed as *.pdb before using. Github does not allow that extension hence the change to .txt here

The text was updated successfully, but these errors were encountered:

wpotrzebowski · 2024-08-19T11:25:29Z

#2968 seems to be related

krzywon · 2024-08-19T15:46:57Z

The string index out of range errors are not new to v6.0.0, but the other errors are. If they are related, then we should fix them all for the release. Otherwise, the index error should wait for a point release.

lucas-wilkins · 2024-08-21T10:16:44Z

I fixed it before, just didn't get to the beta. @timsnow has confirmed.

butlerpd · 2024-08-30T21:19:07Z

Actually the fix only fixes the first error string index out of range and only for the first example file. However, the problem with the other files is that they do not conform (in a very subtle way) to the currently published PDB format v 3.3, which states that the element name is right justified. in the other two example files they are left justified.

Apparently, there is a lot of history of violent arguments about what the correct format should be (or what should be a valid PDB file). For our purposes, as long as we are "rolling our own" I suggest we stick to the recommended v3.3 standard. We should however do a lot better job of using the exception to not load the file and send a message to the log that the pdb does not match the v3.3 standard definition. This would remove the problem of later hanging when pressing compute while giving more useful (and less verbose) feedback to the user.

lucas-wilkins · 2024-09-05T15:31:31Z

@butlerpd Unfortunately, it's pretty complicated. I think whatever the solution, it's going to be hard to get it to work automatically, as the extent and manner of what is represented in a PDB can vary quite a lot. The most obvious thing being whether H is present or not. There are other things too though, like HOH entries, and the alpha, beta, gamma carbons etc.

butlerpd · 2024-09-19T14:57:44Z

I agree. see #3055 (comment) Your previous fix @lucas-wilkins does make the reader read anything that conform properly to the current published standard.

However, as mentioned in that issue also (#3055 (comment)) I think we can easily prevent the hanging when reading "malformed" files.

For a later discussion would be I think: Do we just say we will only deal with files that strictly adhere to the standard and provide all the atoms of interest or do we try to use a python reader (e.g. the Bio.PDB module of the biopython.py package - BSD-3 license) to handle the reading more elegantly? Personally I think documenting ourselves as "strict" would be the way to go until someone who feels it is important to their science to be more forgiving decides to get involved and contribute?

butlerpd added Critical High priority SasView 6.0.0 Required for 6.0.0 release labels Aug 17, 2024

butlerpd mentioned this issue Aug 17, 2024

Full Debye calculator hangs #3055

Closed

juliuskarliczek self-assigned this Aug 21, 2024

lucas-wilkins closed this as completed Aug 21, 2024

lucas-wilkins changed the title ~~PDB loader throws errors~~ PDB loader throws errors - Note: fixed in release branch Aug 21, 2024

juliuskarliczek removed their assignment Aug 21, 2024

lucas-wilkins reopened this Sep 5, 2024

krzywon added the Discuss At The Call Issues to be discussed at the fortnightly call label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDB loader throws errors - Note: fixed in release branch #3054

PDB loader throws errors - Note: fixed in release branch #3054

butlerpd commented Aug 17, 2024

wpotrzebowski commented Aug 19, 2024

krzywon commented Aug 19, 2024

lucas-wilkins commented Aug 21, 2024

butlerpd commented Aug 30, 2024

lucas-wilkins commented Sep 5, 2024

butlerpd commented Sep 19, 2024

PDB loader throws errors - Note: fixed in release branch #3054

PDB loader throws errors - Note: fixed in release branch #3054

Comments

butlerpd commented Aug 17, 2024

wpotrzebowski commented Aug 19, 2024

krzywon commented Aug 19, 2024

lucas-wilkins commented Aug 21, 2024

butlerpd commented Aug 30, 2024

lucas-wilkins commented Sep 5, 2024

butlerpd commented Sep 19, 2024