UniProt genome annotation data in hg19/GRCh37 coordinates
This repository was last updated for March-2017 UniprotKB release. Now, I have moved on to using BigBed files from UniProt, instead of Bed files used here. This recent work is available at repository uniprot_genomic.
UniProt provides human genome annotation data enabling mapping of amino acid annotations directly to reference genome coordinates, but they are available only in hg38 coordinates. See this publication for more info:
This repository converts and makes this data available in hg19 coordinates.
Besides conversion to hg19 coordinated, few changes are made here to suit our purposes, which is to identify if query amino acids have any UniProt annotation. See ‘Processing pipeline’ section for details.
Restructured, hg19-converted Bed files. This is what you probably are interested in.
Two merged files each containing selective sequence annotations of interest, as listed below.
a. Merged file - Type 1 has following annotation types merged into a single file.
1 Active site
2 Binding site for any chemical group
3 Calcium binding region
4 Cross-link between proteins
5 Disulfide bond
6 Glycosylation-PTM
7 Interesting site
8 Lipidation-PTM
9 Metal binding site
10 Motif
11 Nucleotide binding region
12 Other PTM
13 Signal peptide
14 Transit peptide
15 Zinc finger region
b. Merged file - Type 0 has following annotation types merged into a single file.
1 Active peptide
2 Chain
3 Coiled coil
4 DNA binding domain
5 Domain
6 Intramembrane
7 Natural variant
8 Region of interest
9 Repeated motifs or domains
10 Topological domain
11 Transmembrane region
Reformat Bed files as follows:
a. Replace score column (5th column), which is zero by default in UniProt provided data, with corresponding sequence annotation type as shown below.
Original format by UniProt:
>chr1 7970956 7970959 Q99497 0 + 7970956 7970959 255,102,102 1 3 0 . Nucleophile. Pubmed:20304780, Pubmed:25416785
Format we used here:
>chr1 8031016 8031019 Q99497 Active site + 8031016 8031019 255,102,102 1 3 0 . Nucleophile. Pubmed:20304780, Pubmed:25416785
b. Restructure the rows in Bed files that have non-continuous amino acids as in example below.
```
Original format by UniProt (this line has coordinates for three, non-continuous amino acids):
>chr1 1633782 1633815 O75900 0 + 1633782 1633815 0,153,0 3 3,3,3 0,12,30 . Zinc; catalytic.
Format we used here (one amino acid per row, if non-continuous):
>chr1 1569161 1569164 O75900 Metal binding site + 1569161 1569164 0,153,0 1 3 0 . Zinc; catalytic.
>chr1 1569173 1569176 O75900 Metal binding site + 1569173 1569176 0,153,0 1 3 0 . Zinc; catalytic.
>chr1 1569191 1569194 O75900 Metal binding site + 1569191 1569194 0,153,0 1 3 0 . Zinc; catalytic.
```
Resulting Bed files are what you probably need if you are looking for replacement for UniProt provided hg38 genome coordinates in hg19 format.
We further merge sequence annotation types of our interest into two Bed files.
Download the resulting merged bed files:
UniProt’s license applies for the genome coordinates data available in this repository. Thanks to UniProt for permitting us to distribute this data in hg19 format. Data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Code in this repository is distributed under MIT license.