Skip to content

Error in make_dataset and question about filtering #6

Open
@octavian-ganea

Description

@octavian-ganea

Hi,

Thanks for these great resources. I have 2 questions:

  1. Can you please detail what exactly are the filtering criteria used in prune_pairs.py and if these were already applied to the 42,826 pairs listed in the paper ?
  2. I tried to run make_dataset on a subset of DIPS, but got this error. Can you please help ? Thanks.
$ python src/make_dataset.py ../raw/pdb/ ../interim
2021-09-06 13:35:29,892 INFO 10990: making final data set from interim data
2021-09-06 13:35:33,994 INFO 10990: 2566 requested keys, 0 produced keys, 2566 work keys
2021-09-06 13:35:34,058 INFO 10990: Processing 2566 inputs.
2021-09-06 13:35:34,058 INFO 10990: Sequential Mode.
2021-09-06 13:35:34,058 INFO 10990: Reading ../raw/pdb/17/317d.pdb1.gz
Traceback (most recent call last):
  File "src/make_dataset.py", line 45, in <module>
    main()
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1134, in __call__
    return self.main(*args, **kwargs)
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1059, in main
    rv = self.invoke(ctx)
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1401, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 767, in invoke
    return __callback(*args, **kwargs)
  File "src/make_dataset.py", line 30, in main
    pa.parse_all(input_dir, parsed_dir, num_cpus)
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/parse.py", line 57, in parse_all
    par.submit_jobs(parse, inputs, num_cpus)
  File "miniconda/miniconda3/lib/python3.8/site-packages/parallel.py", line 62, in submit_jobs
    out = [function(*args) for args in inputs]
  File "miniconda/miniconda3/lib/python3.8/site-packages/parallel.py", line 62, in <listcomp>
    out = [function(*args) for args in inputs]
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/parse.py", line 64, in parse
    df = struct.parse_structure(pdb_filename, one_model=False)
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/structure.py", line 61, in parse_structure
    biopy_structure = db.parse_biopython_structure(structure_filename)
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/database.py", line 59, in parse_biopython_structure
    biopy_structure = parser.get_structure('pdb', gzip.open(pdb_filename))
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 100, in get_structure
    self._parse(lines)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 121, in _parse
    self.header, coords_trailer = self._get_header(header_coords_trailer)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 139, in _get_header
    header_dict = _parse_pdb_header_list(header)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/parse_pdb_header.py", line 199, in _parse_pdb_header_list
    pdbh_dict["structure_reference"] = _get_references(header)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/parse_pdb_header.py", line 38, in _get_references
    if re.search(r"\AREMARK   1", l):
  File "miniconda/miniconda3/lib/python3.8/re.py", line 201, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions