项目作者: magnitov

项目描述 :
The code from the paper "An automated protocol for modelling peptide substrates to proteases"
高级语言: Python
项目地址: git://github.com/magnitov/protease_annotation_pipeline.git
创建时间: 2020-11-11T08:46:41Z
项目社区:https://github.com/magnitov/protease_annotation_pipeline

开源协议:MIT License

下载


Annotation of protease structures

Rodrigo Ochoa, Mikhail Magnitov, Roman A Laskowski, Pilar Cossio & Janet M Thornton. An automated protocol for modelling peptide substrates to proteases. BMC Bioinformatics (2020)

Purpose

The goal of these scripts is to provide an example workflow of the protease-peptide structure annotation prior to the modelling step described in https://github.com/rochoa85/Modelling-Protease-Substrates. The first script is focused on the annotation of the proteases with necessary information from publicly available resources. The second script allows the assignment of the binding pockets to each substrate residue by using the annotation of the reference substrate and protease.

Extra files required

To perform the annotation of the ligand and protein chains, two files from the PDBsum database are required:

  1. rasmol.dfn file that contains the information on ligand chain and residues. This file is accessible for each PDB entry via https://www.ebi.ac.uk/thornton-srv/databases/pdbsum/PDB_code/rasmol.dfn

  2. grow.out file that contains the information on protein chains and residues bound to ligand. This file is accessible for each PDB entry via https://www.ebi.ac.uk/thornton-srv/databases/pdbsum/PDB_code/grow.out

NOTE: Put these files into the annotation folder (see repository for an example).

To perform the annotation of the IDs and catalytic residues, two files from the PDBsum and M-CSA databases are required:

  1. Annotation of PDB structures with EC numbers and UniProt IDs. This file can be downloaded from: http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/data/pdb_chain_sp_ec (36.7 Mb)

  2. Information about the catalytic residues, including the residues found by homology. This file can be downloaded from: https://www.ebi.ac.uk/thornton-srv/m-csa/api/homologues_residues.json (295.4 Mb)

To perform the annotation of the pockets, one additional file from the PDB is required:

  1. Chemical Component Dictionary. The file can be downloaded from: ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif (317.7 Mb)

Implementation

Step 1: Annotating the dataset

First, proteases are annotated with ligand and protein PDB chains, EC number, UniProt and MEROPS IDs. The search for EC number and UniProt ID is performed using the obtained PDBsum database file, and the search for MEROPS ID is performed with UniProt REST API. Second, the protein chains bound to the ligand are annotated with a list of the catalytic residues. This step is performed using the obtained M-CSA database file.

NOTE: If any protein chains are not annotated with catalytic residues, one can perform a homology search using the protease sequence in order to infer them.

The basic command line to run the annotation script is:

  1. python 01_run_annotation.py --input INPUT [--output OUTPUT]
  2. arguments:
  3. --input INPUT Dataset with protease-ligand pairs
  4. --output OUTPUT Path to save the annotated dataset

Step 2: Identification of pockets

To identify the pocket residues we use PDB coordinates of the reference substrate residues. The reference substrate should be stretched across pockets S4-S4’, or, failing that, the longest available. Protease residues with at least one atom within a defined distance from the substrate are considered to form a binding pocket. We then project the annotation of pockets to other substrates within the protease family.

NOTE: This analysis requires two additional steps. The first is selecting a reference structure. The second is superimposing the proteases of interest with a selected reference structure (we suggest using PDBeFold) and obtaining new superimposed coordinates of the structures.

The basic command line to run the pockets identification script is:

  1. python 02_find_pockets.py --input INPUT --ref REF [--extra_ref EXTRA_REF]
  2. --radius RADIUS --cutoff CUTOFF [--output OUTPUT]
  3. arguments:
  4. --input INPUT Dataset with protease-ligand pairs to annotate
  5. --ref REF Reference protease-ligand structure
  6. --extra_ref EXTRA_REF Extra reference protease-ligand structure for S1
  7. pocket refinement
  8. --radius RADIUS Distance to search for pocket residues
  9. --cutoff CUTOFF Distance to project target reference residues on
  10. ligand residues
  11. --output OUTPUT Path to save the dataset with annotated pockets

Examples

Step 1: Annotating the dataset

For this step we provided a sample data of 7 proteases from major protease families as an example in sample_data_annotation.csv. It includes a list of PDB codes followed by ligand sequence:

  1. 3tjv,PRO-THR-SER-TYR-ALA-GLY-ASP-ASP-SER

The following command runs the annotation: python 01_run_annotation.py --input sample_data_annotation.csv

After running this command, the annotated dataset will be saved to annotated_dataset.csv. It will include the following fields:

  1. PDB: 3tjv
  2. Ligand: PRO-THR-SER-TYR-ALA-GLY-ASP-ASP-SER
  3. Ligand_size: 9
  4. Ligand_chain: B
  5. Ligand_residues_number: 1, 2, 3, 4, 5, 6, 7, 8, 9
  6. Protein_chains: A
  7. Protein_residues: A100PHE, A144TYR, A150LEU, A174ASN, A192GLY, A193PHE, A194LYS, A195GLY, A196ASP, A197SER, A212SER, A213TYR, A214GLY, A215ASN, A216LYS, A218GLY, A32PHE, A34GLN, A41ARG, A42LYS, A43ARG, A44CYS, A59HIS, A75LYS
  8. EC_number: 3.4.21.-
  9. UniProt_ID: P20718
  10. MEROPS_ID: S01.147
  11. Catalytic_residues: A103ASN, A191THR, A194LYS, A195GLY, A196ASP, A197SER, A198GLY, A59HIS, A99ASN

Step 2: Identification of pockets

For this step we provided a sample data of 10 annotated serine proteases as an example in sample_data_pockets.csv. It includes our reference structure, four additional structures to refine the pockets, and 5 example structures.

The following command runs the pockets search: python 02_find_pockets.py --input sample_data_pockets.csv --ref 3tjv --extra_ref 1iau,1ox1,1hne,1ab9 --radius 4.5 --cutoff 2

After running this command, the annotated pockets will be displayed in the shell:

  1. Pocket residues annotated:
  2. S4: ASN174, GLY214, PHE100, PHE171, TYR213
  3. S3: ASN215, GLY214, LYS194, LYS216, PHE100, TYR213
  4. S2: HIS59, LYS194, PHE100, SER212, TYR213
  5. S1: ARG43, ASN215, ASP196, CYS44, GLY192, GLY195, GLY214, GLY218, GLY222, HIS59, LEU211, LYS194, PHE193, SER197, SER212, THR191, THR219, TYR213
  6. S1': ARG43, CYS44, GLY195, HIS59, LYS194, PHE35, SER197
  7. S2': ARG41, ARG43, GLY195, LYS194, LYS42
  8. S3': ARG41, ARG43, LEU150, LYS42, TYR144
  9. S4': ARG41, ARG43, GLN34, LEU150, LYS42, LYS75, PHE32

The assigned pockets, as well as residues and their numbers for all truncated ligands from the dataset will be stored to annotated_pockets.csv. It will include the following new fields (for PDB 3tjv, which is a reference):

  1. Truncated_ligand: PRO-THR-SER-TYR-ALA-GLY-ASP-ASP
  2. Truncated_numbering: 1, 2, 3, 4, 5, 6, 7, 8
  3. Truncated_pocket_names: S4, S3, S2, S1, S1', S2', S3', S4'

Support

In case this workflow is useful for your research and requires any advice, please contact us via email: mikhail.magnitov@phystech.edu