项目作者: davetang

项目描述 :
Bioinformatics analysis of SARS-CoV-2
高级语言: HTML
项目地址: git://github.com/davetang/sars_cov_2.git
创建时间: 2020-03-05T05:09:50Z
项目社区:https://github.com/davetang/sars_cov_2

开源协议:

下载


Table of Contents

Created by gh-md-toc

README

Call Omicron variants

The Severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2)

is an RNA virus currently causing the 2019–20 coronavirus pandemic. This
repository contains code and notes for the analysis of SARS-CoV-2.

I also wrote some blog posts related to the analyses in this repository.

  1. https://davetang.org/muse/2020/03/05/sequence-analysis-sars-cov-2/

  2. https://davetang.org/muse/2020/03/06/sequence-analysis-of-sars-cov-2-part-2/

  3. https://davetang.org/muse/2020/03/12/sequence-analysis-of-sars-cov-2-part-3/

  4. https://davetang.org/muse/2022/01/26/omicron-variants/ (this analysis
    workflow is also implemented as a GitHub Actions
    workflow
    )

Tools

Install Mambaforge and
then use mamba to create the environment that has all the necessary tools.

  1. mamba env create --file environment.yml
  2. conda activate sars_cov_2

The Conda package for snpEff is out of
date and does not download the reference sequences properly. We need to install
this manually.

  1. cd src
  2. wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
  3. unzip snpEff_latest_core.zip
  4. cd snpEff
  5. java -jar snpEff.jar download NC_045512.2

Sequences

Reference sequence

Reference sequence NC_045512.
Download GFF for NC_045512 from https://www.ncbi.nlm.nih.gov/sars-cov-2/.

  1. mkdir tmp && cd tmp
  2. datasets download genome accession GCF_009858895.2 --filename GCF_009858895.2.zip --include genome,gtf,gbff
  3. unzip GCF_009858895.2.zip
  4. gzip ncbi_dataset/data/GCF_009858895.2/GCF_009858895.2_ASM985889v3_genomic.fna
  5. mv ncbi_dataset/data/GCF_009858895.2/GCF_009858895.2_ASM985889v3_genomic.fna.gz ../raw
  6. cd .. && rm -rf tmp

Variants

See SARS-CoV-2 Variant Classifications and Definitions and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8342008/ for more information.

Download variants of concern using script/download_variants.sh. The script will keep trying to download a lineage until successful; this was implemented because for variants with a lot of sequences (Alpha and Delta), the download would disconnect quite often (despite having a very fast Internet connection).

  • Alpha (B.1.1.7)
  • Beta (B.1.351, B.1.351.2, B.1.351.3)
  • Delta (B.1.617.2, AY.1, AY.2, AY.3)
  • Gamma (P.1, P.1.1, P.1.2)
  • Omicron (B.1.1.529, BA.1, BA.2, BA.3)

Summarise using dataformat.

  1. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-P.1.1.20210819.zip --fields accession,virus-pangolin,release-date,isolate-lineage | head -5
  2. Accession Virus Pangolin Classification Release date Isolate Lineage
  3. MZ799138.1 P.1.1 2021-08-15 SARS-CoV-2/human/USA/2105200446/2021
  4. MZ788313.1 P.1.1 2021-08-13 SARS-CoV-2/human/USA/TX-DSHS-7363/2021
  5. MZ770359.1 P.1.1 2021-08-12 SARS-CoV-2/human/USA/UT-UPHL-210729378097/2021
  6. MZ746322.1 P.1.1 2021-08-10 SARS-CoV-2/human/USA/UT-UPHL-210729378097/2021
  7. # confirm that accession MZ157012 is a delta variant
  8. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-B.1.617.2.20210819.zip --fields accession,virus-pangolin,release-date,isolate-lineage | grep MZ157012
  9. MZ157012.1 B.1.617.2 2021-05-11 SARS-CoV-2/human/NPL/LMB11/2021

Count number of sequences per lineage.

  1. for file in $(ls raw/variants/*.zip); do
  2. echo ${file};
  3. bin/macos/dataformat tsv virus-genome --package ${file} | wc -l
  4. done
  5. raw/variants/SARS-CoV-2-AY.1.20210819.zip
  6. 254
  7. raw/variants/SARS-CoV-2-AY.2.20210819.zip
  8. 1036
  9. raw/variants/SARS-CoV-2-AY.3.20210819.zip
  10. 5678
  11. raw/variants/SARS-CoV-2-B.1.1.7.20210819.zip
  12. 467304
  13. raw/variants/SARS-CoV-2-B.1.351.2.20210819.zip
  14. 47
  15. raw/variants/SARS-CoV-2-B.1.351.20210819.zip
  16. 4311
  17. raw/variants/SARS-CoV-2-B.1.351.3.20210819.zip
  18. 340
  19. raw/variants/SARS-CoV-2-B.1.617.2.20210819.zip
  20. 152212
  21. raw/variants/SARS-CoV-2-P.1.1.20210819.zip
  22. 1004
  23. raw/variants/SARS-CoV-2-P.1.2.20210819.zip
  24. 746
  25. raw/variants/SARS-CoV-2-P.1.20210819.zip
  26. 16109

Omicron

BA.1.

  1. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.1.20220225.zip --fields accession,virus-pangolin,release-date,isolate-lineage,geo-location | head -5
  2. Accession Virus Pangolin Classification Release date Isolate Lineage Geographic location
  3. OL815078.1 BA.1 2021-12-11 SARS-CoV-2/human/USA/NY-CDC-IBX854867598434/2021 USA: New York
  4. OL808790.1 BA.1 2021-12-10 SARS-CoV-2/human/USA/MA-MGB-02636/2021 USA: Massachusetts
  5. OL802692.1 BA.1 2021-12-10 SARS-CoV-2/human/USA/NY-CDCBI-CRSP_GHHFUDBUXELXFY54/2021 USA: New York
  6. OL800703.1 BA.1 2021-12-10 SARS-CoV-2/human/DEU/FFM-ZAF0396/2021 Germany: Hesse
  7. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.1.20220225.zip | wc -l
  8. 8792232

BA.2.

  1. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip --fields accession,virus-pangolin,release-date,isolate-lineage,geo-location | head -5
  2. Accession Virus Pangolin Classification Release date Isolate Lineage Geographic location
  3. OM364005.1 BA.2 2022-01-24 SARS-CoV-2/human/USA/WA-S16397/2022 USA: Washington,King County
  4. OM362060.1 BA.2 2022-01-24 SARS-CoV-2/human/USA/NY-CDC-LC0466930/2022 USA: New York
  5. OM354922.1 BA.2 2022-01-24 SARS-CoV-2/human/USA/CA-CDC-STM-X4Q4J65PP/2022 USA: California
  6. OM349979.1 BA.2 2022-01-24 SARS-CoV-2/human/IND/MH-ICMR-NIV-INSACOG-GSEQ-7029/2021 India
  7. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip | wc -l
  8. 97503
  9. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip --fields geo-location | grep -v "^Geographic" | cut -f1 -d':' | sort | uniq -c | sort -k1rn
  10. 6896 United Kingdom
  11. 1623 USA
  12. 301 Switzerland
  13. 50 Germany
  14. 32 Bangladesh
  15. 30 Denmark
  16. 25 Slovakia
  17. 18 South Africa
  18. 16 Bahrain
  19. 5 China
  20. 4 Austria
  21. 3 Japan
  22. 1 India
  23. bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip --fields accession,virus-pangolin,release-date,isolate-lineage,geo-location | grep Japan
  24. BS002318.1 BA.2 2022-01-26 hCoV-19/Japan/SZ-NIG-4-C118/2022 Japan:Shizuoka, Hamamatsu
  25. BS002317.1 BA.2 2022-01-26 hCoV-19/Japan/SZ-NIG-4-C117/2022 Japan:Shizuoka, Hamamatsu
  26. BS002303.1 BA.2 2022-01-26 hCoV-19/Japan/SZ-NIG-4-C67/2022 Japan:Shizuoka, Hamamatsu

Genomes

Download Coronavirus genomes using datasets.

  1. mkdir -p raw/genome
  2. today=$(date +%Y%m%d)
  3. bin/datasets download virus genome tax-name sars2 --filename raw/genome/sars2.${today}.zip

973,966 genome sequences as of Thu Jul 22 17:37:22 JST 2021 up from 7,031 genome sequences from Mon Jul 20 14:32:46 JST 2020.

  1. cd raw/genome/
  2. unzip sars2.20210722.zip
  3. cat ncbi_dataset/data/genomic.fna | grep "^>" | wc -l
  4. 973966

Look for MN908947.

  1. cat ncbi_dataset/data/genomic.fna | grep MN908947
  2. >MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
  3. >MT576029.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/ESP/2019-nCoV-MN908947-cOVID-96_19/2020, complete genome

Look for MZ157012 (an isolate from lineage B.1.617.2 or also known as the Delta variant).

  1. cat ncbi_dataset/data/genomic.fna | grep MZ157012
  2. >MZ157012.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/NPL/LMB11/2021 ORF1ab polyprotein (ORF1ab) gene, partial cds; ORF1a polyprotein (ORF1ab) gene, complete cds; surface glycoprotein (S) and ORF3a protein (ORF3a) genes, partial cds; envelope protein (E), membrane glycoprotein (M), and ORF6 protein (ORF6) genes, complete cds; ORF7a protein (ORF7a) gene, partial cds; ORF8 gene, partial sequence; and nucleocapsid phosphoprotein (N) gene, partial cds

Extract MN908947.3 and MZ157012.1.

  1. echo -e "MN908947.3\nMZ157012.1" > raw/wanted.txt
  2. bin/seqtk subseq raw/genome/ncbi_dataset/data/genomic.fna raw/wanted.txt > raw/wanted.fa

Proteins

The download virus protein command downloads complete protein sequences (excluding partial sequences) and annotation data as a zip file in the BDBag (Big Data Bag) format.

  1. bin/datasets download virus protein S --filename raw/SARS2-spike.zip

The default protein dataset for a given protein includes the following for all complete SARS2 RefSeq and GenBank genomes:

  • amino acid sequences in FASTA (.faa) format
  • protein structures in PDB (.pdb) format
  • nucleotide coding (CDS) sequences in FASTA (.fna) format
  • data report containing taxonomy, isolate, host and other metadata (data_report.yaml)
  • annotation and amino acid sequences in the GenPept flat file format (protein.gpff)
  • a README containing details on sequence file data content and other general information (virus_dataset.md)
  • a list of files and file types (dataset_catalog.json)
  1. |-- ncbi_dataset
  2. | |-- bag-info.txt
  3. | |-- bagit.txt
  4. | |-- data
  5. | | |-- cds.fna
  6. | | |-- data_report.yaml
  7. | | |-- dataset_catalog.json
  8. | | |-- pdb
  9. | | | |-- 6VYB.pdb
  10. | | | |-- 6VYO.pdb
  11. | | | |-- 6W37.pdb
  12. | | | |-- 6W4H.pdb
  13. | | | |-- 6W9C.pdb
  14. | | | |-- 6W9Q.pdb
  15. | | | |-- 6WEY.pdb
  16. | | | |-- 6WJI.pdb
  17. | | | |-- 6WLC.pdb
  18. | | | |-- 7BQY.pdb
  19. | | | `-- 7BV2.pdb
  20. | | |-- protein.faa
  21. | | |-- protein.gpff
  22. | | `-- virus_dataset.md
  23. | |-- fetch.txt
  24. | |-- manifest-md5.txt
  25. | `-- tagmanifest-md5.txt
  26. `-- README.md

Extract only the amino acid FASTA sequences.

  1. for file in $(ls raw/protein/SARS2*.zip); do
  2. base=$(basename $file .zip);
  3. unzip -jp $file ncbi_dataset/data/protein.faa | gzip > raw/protein/${base}.faa.gz
  4. done
  5. zcat raw/protein/SARS2.20200720.S.faa.gz | grep "^>" | wc -l
  6. 7015

Number of unique sequences.

  1. for file in $(ls raw/protein/*.faa.gz); do
  2. echo $file;
  3. script/unique_seq.pl -f $file | wc -l;
  4. done
  5. raw/protein/SARS2.20200720.E.faa.gz
  6. 30
  7. raw/protein/SARS2.20200720.M.faa.gz
  8. 95
  9. raw/protein/SARS2.20200720.N.faa.gz
  10. 360
  11. raw/protein/SARS2.20200720.nsp10.faa.gz
  12. 36
  13. raw/protein/SARS2.20200720.nsp11.faa.gz
  14. 7
  15. raw/protein/SARS2.20200720.nsp13.faa.gz
  16. 254
  17. raw/protein/SARS2.20200720.nsp14.faa.gz
  18. 307
  19. raw/protein/SARS2.20200720.nsp15.faa.gz
  20. 166
  21. raw/protein/SARS2.20200720.nsp16.faa.gz
  22. 146
  23. raw/protein/SARS2.20200720.nsp1.faa.gz
  24. 134
  25. raw/protein/SARS2.20200720.nsp2.faa.gz
  26. 388
  27. raw/protein/SARS2.20200720.nsp3.faa.gz
  28. 862
  29. raw/protein/SARS2.20200720.nsp4.faa.gz
  30. 198
  31. raw/protein/SARS2.20200720.nsp5.faa.gz
  32. 118
  33. raw/protein/SARS2.20200720.nsp6.faa.gz
  34. 120
  35. raw/protein/SARS2.20200720.nsp7.faa.gz
  36. 31
  37. raw/protein/SARS2.20200720.nsp8.faa.gz
  38. 69
  39. raw/protein/SARS2.20200720.nsp9.faa.gz
  40. 48
  41. raw/protein/SARS2.20200720.ORF10.faa.gz
  42. 28
  43. raw/protein/SARS2.20200720.orf1ab.faa.gz
  44. 5773
  45. raw/protein/SARS2.20200720.orf1a.faa.gz
  46. 3756
  47. raw/protein/SARS2.20200720.ORF3a.faa.gz
  48. 290
  49. raw/protein/SARS2.20200720.ORF6.faa.gz
  50. 58
  51. raw/protein/SARS2.20200720.ORF7a.faa.gz
  52. 88
  53. raw/protein/SARS2.20200720.ORF7b.faa.gz
  54. 1
  55. raw/protein/SARS2.20200720.ORF8.faa.gz
  56. 99
  57. raw/protein/SARS2.20200720.RdRp.faa.gz
  58. 342
  59. raw/protein/SARS2.20200720.S.faa.gz
  60. 779

Nucleotide sequences

The page https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/ contains a list of SARS-CoV-2 sequences. Download latest list of SARS-CoV-2 nucleotide sequence IDs.

  1. outfile=acc.$(date +%F).txt.gz
  2. wget https://www.ncbi.nlm.nih.gov/sars-cov-2/download-nuccore-ids/ -O - | gzip > raw/$outfile
  3. zcat raw/$outfile | wc -l
  4. 10001

We can use efetch to download an assembled SARS-CoV-2 genome sequence in various formats.

  1. mkdir raw
  2. efetch -db sequences -format genbank -id MN908947 > raw/MN908947.genbank
  3. efetch -db sequences -format fasta -id MN908947 > raw/MN908947.fa
  4. efetch -db sequences -format fasta_cds_aa -id MN908947

We can use esearch to query the number of sequences associated with the term “coronavirus”. There were 41,874 results as of 2020/03/01.

  1. esearch -db nuccore -query coronavirus
  2. <ENTREZ_DIRECT>
  3. <Db>nuccore</Db>
  4. <WebEnv>NCID_1_121026101_130.14.22.33_9001_1583054896_904858778_0MetA0_S_MegaStore</WebEnv>
  5. <QueryKey>1</QueryKey>
  6. <Count>41874</Count>
  7. <Step>1</Step>
  8. </ENTREZ_DIRECT>

We can pipe the output from esearch to efetch to fetch all sequences.

  1. today=$(date "+%Y%m%d")
  2. time esearch -db nuccore -query coronavirus | efetch -db sequences -format fasta > raw/coronavirus_$today.fa
  3. real 10m39.511s
  4. user 0m53.739s
  5. sys 0m7.705s
  6. cat raw/coronavirus_$today.fa | grep "^>" | wc -l
  7. 43581
  8. cat raw/coronavirus_$today.fa | grep MN908947
  9. >MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

I wrote a simple Perl script to calculate the length of the FASTA sequences.

  1. today=$(date "+%Y%m%d")
  2. script/fasta_stats.pl -f raw/coronavirus_$today.fa | gzip > result/coronavirus_${today}_stat.txt.gz

Identifying variants in lineages

The SARS-CoV-2 Delta variant is a variant of lineage B.1.617 of SARS-CoV-2; it belongs to lineage B.1.617.2. It has mutations in the gene encoding the SARS-CoV-2 spike protein causing the substitutions T478K, P681R and L452R. The workflow below will download the FASTA sequence of one particular Delta variant (accession MZ157012), align it with the SARS-CoV-2 reference sequence, generate a VCF file based on the alignment, and finally annotate the differences to the reference sequence.

  1. id=MZ157012
  2. efetch -db sequences -format fasta -id ${id} | gzip > raw/${id}.fa.gz
  3. gunzip -c raw/GCF_009858895.2_ASM985889v3_genomic.fna.gz raw/MZ157012.fa.gz | gzip > raw/ref_vs_delta.fa.gz
  4. gunzip -c raw/ref_vs_delta.fa.gz | muscle -out raw/ref_vs_delta.aln.fa
  5. snp-sites -v -o raw/ref_vs_delta.vcf raw/ref_vs_delta.aln.fa
  6. cat raw/ref_vs_delta.vcf | sed 's/ID=1/ID=NC_045512.2/;s/^1/NC_045512.2/' > blah
  7. mv -f blah raw/ref_vs_delta.vcf
  8. java -Xmx8g -jar bin/snpEff/snpEff.jar NC_045512.2 raw/ref_vs_delta.vcf > raw/ref_vs_delta.ann.vcf

There are four mutations in the spike protein in MZ157012; two intersect with the reported substitutions in the Wikipedia article: L452R and T478K but P681R was not found.

  1. cat raw/ref_vs_delta.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]'
  2. p.Leu452Arg
  3. p.Thr478Lys
  4. p.Asp614Gly
  5. p.Asp950Asn

The above workflow can be run by using script/var_to_vcf.sh; make sure you activate the Conda environment before running the script. We will use the script to generate VCF files for BA.1 (OL815078) and BA.2 (OM364005) variants.

  1. script/var_to_vcf.sh -i OL815078
  2. cat raw/ref_vs_OL815078.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]'
  3. p.Ala67Val
  4. p.Thr95Ile
  5. p.Gly339Asp
  6. p.Ser371Pro
  7. p.Ser371Phe
  8. p.Ser373Pro
  9. p.Ser375Phe
  10. p.Lys417Asn
  11. p.Thr547Lys
  12. p.Asp614Gly
  13. p.His655Tyr
  14. p.Asn679Lys
  15. p.Pro681His
  16. p.Ala701Val
  17. p.Asn764Lys
  18. p.Asp796Tyr
  19. p.Asn856Lys
  20. p.Gln954His
  21. p.Asn969Lys
  22. p.Leu981Phe
  23. p.Asp1146Asp
  24. script/var_to_vcf.sh -i OM364005
  25. cat raw/ref_vs_OM364005.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]'
  26. p.Thr19Ile
  27. p.Gly142Asp
  28. p.Val213Gly
  29. p.Gly339Asp
  30. p.Ser371Phe
  31. p.Ser373Pro
  32. p.Ser375Phe
  33. p.Thr376Ala
  34. p.Asp405Asn
  35. p.Arg408Ser
  36. p.Lys417Asn
  37. p.Asn440Lys
  38. p.Ser477Asn
  39. p.Thr478Lys
  40. p.Glu484Ala
  41. p.Gln493Arg
  42. p.Gln498Arg
  43. p.Asn501Tyr
  44. p.Tyr505His
  45. p.Asp614Gly
  46. p.His655Tyr
  47. p.Asn679Lys
  48. p.Pro681His
  49. p.Asn764Lys
  50. p.Asp796Tyr
  51. p.Gln954His
  52. p.Asn969Lys
  53. p.Asn1023Ser
  54. p.Asp1146Asp

Spike protein variants specific to BA.1 (OL815078 <) and BA.2 (OM364005 >).

  1. diff <(cat raw/ref_vs_OL815078.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]') <(cat raw/ref_vs_OM364005.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]')
  2. 1,2c1,3
  3. < p.Ala67Val
  4. < p.Thr95Ile
  5. ---
  6. > p.Thr19Ile
  7. > p.Gly142Asp
  8. > p.Val213Gly
  9. 4d4
  10. < p.Ser371Pro
  11. 7a8,10
  12. > p.Thr376Ala
  13. > p.Asp405Asn
  14. > p.Arg408Ser
  15. 9c12,19
  16. < p.Thr547Lys
  17. ---
  18. > p.Asn440Lys
  19. > p.Ser477Asn
  20. > p.Thr478Lys
  21. > p.Glu484Ala
  22. > p.Gln493Arg
  23. > p.Gln498Arg
  24. > p.Asn501Tyr
  25. > p.Tyr505His
  26. 14d23
  27. < p.Ala701Val
  28. 17d25
  29. < p.Asn856Lys
  30. 20c28
  31. < p.Leu981Phe
  32. ---
  33. > p.Asn1023Ser

Gamma variant: This variant of SARS-CoV-2 has been named lineage P.1 and has 17 amino acid substitutions, ten of which are in its spike protein, including these three designated to be of particular concern: N501Y, E484K and K417T.

BLAST

Create BLAST database using the sequences associated with the term “coronavirus”.

  1. mkdir db
  2. makeblastdb -dbtype nucl \
  3. -in raw/coronavirus_20200426.fa \
  4. -input_type fasta \
  5. -title coronavirus_20200426 \
  6. -out db/coronavirus_20200426
  7. Building a new DB, current time: 04/26/2020 09:29:19
  8. New DB name: /Users/dtang/github/sars_cov_2/db/coronavirus_20200426
  9. New DB title: coronavirus_20200426
  10. Sequence type: Nucleotide
  11. Keep MBits: T
  12. Maximum file size: 1000000000B
  13. Adding sequences from FASTA; added 43581 sequences in 13.0157 seconds.

After creating the database we can blast the assembled genome to all the sequences we fetched to see if it matches other coronaviruses.

  1. # -evalue <Real> - Expectation value (E) threshold for saving hits
  2. # Default = `10'
  3. blastn -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200426 > result/MN908947_blast.txt
  4. blastn -evalue 1 -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200426 | wc -l
  5. 506
  6. # accessions with BLAST hits
  7. cat result/MN908947_blast.txt | grep -v "^#" | cut -f2 | sort -u | wc -l
  8. 500
  9. cat result/MN908947_blast.txt | grep -v "^#" | cut -f2 | sort -u > result/MN908947_matched.txt

The script extract_fasta.pl will extract specific FASTA entries. Below we fetch all the sequences that our query sequence matched (500 in total).

  1. script/extract_fasta.pl -i result/MN908947_matched.txt -f raw/coronavirus_20200426.fa > result/MN908947_matched.fa
  2. cat result/MN908947_matched.fa | grep "^>" | wc -l
  3. 500

We will calculate some simple FASTA stats on the matched sequences.

  1. script/fasta_stats.pl -f result/MN908947_matched.fa > result/MN908947_matched_stats.txt

Parse results

The script parse_outfmt7.pl simply parses the BLAST result and outputs the results in a more readable format.

  1. script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200426.fa | less

ClustalW

Create a multiple sequence alignment of MN908947 to other bat coronaviruses.

  1. script/extract_fasta.pl -i raw/wanted.txt -f raw/coronavirus_20200426.fa > raw/wanted.fa
  2. clustalw -infile=raw/wanted.fa
  3. script/extract_fasta.pl -i raw/MN908947_MN996532.txt -f raw/coronavirus_20200301.fa > raw/MN908947_MN996532.fa
  4. clustalw -infile=raw/MN908947_MN996532.fa

SRA

The page https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/ contains a YAML file with accession information on the SARS-CoV-2 sequences currently deposited on NCBI databases.

  1. wget -N https://www.ncbi.nlm.nih.gov/core/assets/genbank/files/ncov-sequences.yaml -O raw/ncov-sequences.yaml
  2. head raw/ncov-sequences.yaml
  3. updated: 'Thursay Apr 23 15:15 2020 EST'
  4. genbank-sequences: [
  5. {
  6. "accession": "NC_045512",
  7. "accession_list": "<a href=\"https://www.ncbi.nlm.nih.gov/nuccore/NC_045512\">NC_045512</a>",
  8. "collection_date": "2019-12",
  9. "country": "China"
  10. },
  11. {
  12. cat raw/ncov-sequences.yaml | grep "sra-run\"" | wc -l
  13. 288

Did we get all the GenBank sequences using esearch and efetch? We should directly use the accessions in the YAML file as 417 sequences are missing.

  1. cat raw/ncov-sequences.yaml | grep 'accession"' | perl -nle 's/.*"(\w+)",/$1/; print' > raw/genbank_list.txt
  2. cat raw/genbank_list.txt | wc -l
  3. 1622
  4. script/extract_fasta.pl -i raw/genbank_list.txt -f raw/coronavirus_20200426.fa > raw/genbank_list.fa 2> raw/genbank_list_missing.txt
  5. cat raw/genbank_list.fa | grep "^>" | wc -l
  6. 1205
  7. cat raw/genbank_list_missing.txt | wc -l
  8. 417

Download using efetch and check out the distribution of sequence lengths.

  1. my_accession=$(cat raw/genbank_list.txt | tr '\n' ',' | sed 's/,$//')
  2. efetch -db sequences -format fasta -id $my_accession > raw/genbank_efetch.fa
  3. cat raw/genbank_efetch.fa | grep "^>" | wc -l
  4. 1621
  5. script/fasta_stats.pl -f raw/genbank_efetch.fa | gzip > result/genbank_stat.txt.gz
  6. # get stats from https://github.com/arq5x/filo/
  7. gunzip -c result/genbank_stat.txt.gz | cut -f2 | grep -v "Length" | stats
  8. Total lines: 1621
  9. Sum of lines: 45891887
  10. Ari. Mean: 28310.8494756323
  11. Geo. Mean: 23869.0792681778
  12. Median: 29844
  13. Mode: 29882 (N=208)
  14. Anti-Mode: 64 (N=1)
  15. Minimum: 64
  16. Maximum: 29945
  17. Variance: 41665645.6466825
  18. StdDev: 6454.89315532663
  19. # some short sequences
  20. gunzip -c result/genbank_stat.txt.gz | sort -k2n | head -6
  21. Accession Length A C G T Unknown
  22. MT293547.1 64 13 16 12 23 0
  23. MT273658.1 84 20 15 22 27 0
  24. MT163712.1 87 20 17 23 27 0
  25. MN938387.1 107 38 14 22 33 0
  26. MN938388.1 107 38 14 22 33 0
  27. cat raw/genbank_efetch.fa | grep MT293547.1
  28. >MT293547.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/IRQ/KRD/2020 envelope protein (E) gene, partial cds

Use wget to obtain metadata of all SARS-CoV-2 raw sequences upload to the SRA. The YAML file no longer uses project accessions (SRP242226), so we will get information on each run instead (SRR10948550).

  1. mkdir sra
  2. # "sra-run": "SRR10948550"
  3. wget -O sra/SRR10948550_info.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRR10948550'

Checkout metadata using csvkit.

  1. csvcut -n sra/SRR10948550_info.csv
  2. 1: Run
  3. 2: ReleaseDate
  4. 3: LoadDate
  5. 4: spots
  6. 5: bases
  7. 6: spots_with_mates
  8. 7: avgLength
  9. 8: size_MB
  10. 9: AssemblyName
  11. 10: download_path
  12. 11: Experiment
  13. 12: LibraryName
  14. 13: LibraryStrategy
  15. 14: LibrarySelection
  16. 15: LibrarySource
  17. 16: LibraryLayout
  18. 17: InsertSize
  19. 18: InsertDev
  20. 19: Platform
  21. 20: Model
  22. 21: SRAStudy
  23. 22: BioProject
  24. 23: Study_Pubmed_id
  25. 24: ProjectID
  26. 25: Sample
  27. 26: BioSample
  28. 27: SampleType
  29. 28: TaxID
  30. 29: ScientificName
  31. 30: SampleName
  32. 31: g1k_pop_code
  33. 32: source
  34. 33: g1k_analysis_group
  35. 34: Subject_ID
  36. 35: Sex
  37. 36: Disease
  38. 37: Tumor
  39. 38: Affection_Status
  40. 39: Analyte_Type
  41. 40: Histological_Type
  42. 41: Body_Site
  43. 42: CenterName
  44. 43: Submission
  45. 44: dbgap_study_accession
  46. 45: Consent
  47. 46: RunHash
  48. 47: ReadHash
  49. csvcut -c Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName sra/SRR10948550_info.csv | csvlook
  50. | Run | Sample | BioSample | spots | LibraryStrategy | LibrarySource | LibraryLayout | Platform | Model | SampleType | CenterName |
  51. | ----------- | ---------- | ------------ | ------- | --------------- | ------------- | ------------- | --------------- | ------ | ---------- | --------------------- |
  52. | SRR10948550 | SRS6014638 | SAMN13871323 | 425,717 | RNA-Seq | GENOMIC | SINGLE | OXFORD_NANOPORE | MinION | simple | HKU-SHENZHEN HOSPITAL |
  53. | | | | | | | | | | | |

Download all metadata using SRR accessions from raw/ncov-sequences.yaml.

  1. my_accession=$(cat raw/ncov-sequences.yaml | grep 'sra-run"' | perl -nle 's/.*"(\w+)",/$1/; print')
  2. for acc in $my_accession; do
  3. wget -O sra/${acc}_info.csv "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=$acc"
  4. done
  5. # concenate into one metadata file
  6. rm -f sra/metadata.txt
  7. touch sra/metadata.txt
  8. for file in `ls sra/SRR*info.csv`; do
  9. echo $file;
  10. csvcut -c Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName $file >> sra/metadata.txt
  11. done
  12. head sra/metadata.txt
  13. Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
  14. SRR10902284,SRS6014638,SAMN13871323,261890,RNA-Seq,METAGENOMIC,SINGLE,OXFORD_NANOPORE,MinION,simple,UNIVERSITY OF HONG KONG
  15. ,,,,,,,,,,
  16. Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
  17. SRR10903401,SRS6007144,SAMN13872787,476632,RNA-Seq,METATRANSCRIPTOMIC,PAIRED,ILLUMINA,Illumina MiSeq,simple,WUHAN UNIVERSITY
  18. ,,,,,,,,,,
  19. Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
  20. SRR10903402,SRS6007143,SAMN13872786,676694,RNA-Seq,METATRANSCRIPTOMIC,PAIRED,ILLUMINA,Illumina MiSeq,simple,WUHAN UNIVERSITY
  21. ,,,,,,,,,,
  22. Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
  23. # technology
  24. cat sra/metadata.txt | grep -v "^," | grep -v "^Run" | cut -f9 -d','| sort | uniq -c
  25. 1 BGISEQ-500
  26. 248 GridION
  27. 7 Illumina HiSeq 3000
  28. 18 Illumina MiSeq
  29. 1 Illumina MiniSeq
  30. 2 Ion Torrent S5
  31. 3 MinION
  32. 8 NextSeq 500
  33. # center
  34. cat sra/metadata.txt | grep -v "^," | grep -v "^Run" | cut -f11 -d','| sort | uniq -c
  35. 1 "SHANGHAI PUBLIC HEALTH CLINICAL CENTER & SCHOOL OF PUBLIC HEALTH
  36. 11 "WUHAN INSTITUTE OF VIROLOGY
  37. 1 BGI
  38. 2 HKU-SHENZHEN HOSPITAL
  39. 1 THE SCRIPPS RESEARCH INSTITUTE
  40. 2 UFBA
  41. 1 UNIVERSIDAD TECNOLOGICA DE PEREIRA
  42. 1 UNIVERSITY OF HONG KONG
  43. 3 UNIVERSITY OF MELBOURNE
  44. 14 UNIVERSITY OF WASHINGTON
  45. 8 UNIVERSITY OF WISCONSIN - MADISON
  46. 243 WUHAN UNIVERSITY
  47. cat sra/metadata.txt | grep MELBOURNE
  48. SRR11267570,SRS6201528,SAMN14167851,779208,RNA-Seq,VIRAL RNA,SINGLE,OXFORD_NANOPORE,GridION,simple,UNIVERSITY OF MELBOURNE
  49. SRR11350376,SRS6201528,SAMN14167851,779208,RNA-Seq,VIRAL RNA,SINGLE,OXFORD_NANOPORE,GridION,simple,UNIVERSITY OF MELBOURNE
  50. SRR11300652,SRS6313628,SAMN14371025,430923,RNA-Seq,VIRAL RNA,SINGLE,OXFORD_NANOPORE,GridION,simple,UNIVERSITY OF MELBOURNE

Download all Illumina data for further analysis. (I have not analysed Oxford Nanopore before, so I won’t download these yet.)

  1. for acc in `cat sra/metadata.txt | grep ILLUMINA | cut -f1 -d','`; do
  2. echo $acc
  3. fasterq-dump -p --outdir raw/fastq $acc
  4. done

SRR10971381

We will analysis the data from SRR10971381 following the preprocessing steps outlined in https://github.com/galaxyproject/SARS-CoV-2/blob/master/1-PreProcessing/pp_wf.png.

Use SRA Toolkit to download FASTQ sequences from the SRA. First use prefetch, a command-line for downloading SRA, dbGaP, and ADSP data, to download sra files. See https://www.ncbi.nlm.nih.gov/books/NBK242621/ for more information.

Using prefetch kept resulting in timeout errors. Perhaps https://github.com/ncbi/sra-tools/wiki/06.-Connection-Timeouts will help?

  1. prefetch --output-directory raw SRR10971381
  2. # connection keeps dying
  3. wget https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRR/010714/SRR10971381
  4. md5sum SRR10971381 > SRR10971381.md5sum
  5. cat SRR10971381.md5sum
  6. 5496488662893a836e23541b84bfb7cd SRR10971381

Next use fastq-dump to convert SRA data into FASTQ.

  1. fastq-dump --split-files ./SRR10971381
  2. 2020-03-10T14:51:00 fastq-dump.2.10.3 err: name not found while resolving query within virtual file system module - failed to resolve accession './SRR10971381' - Cannot resolve accession ( 404 )
  3. Read 28282964 spots for ./SRR10971381
  4. Written 28282964 spots for ./SRR10971381

Preprocess using fastp.

  1. fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq
  2. Read1 before filtering:
  3. total reads: 28282964
  4. total bases: 4017125680
  5. Q20 bases: 1783384314(44.3945%)
  6. Q30 bases: 1735659038(43.2065%)
  7. Read2 before filtering:
  8. total reads: 28282964
  9. total bases: 4013917534
  10. Q20 bases: 1723725994(42.9437%)
  11. Q30 bases: 1652844944(41.1779%)
  12. Read1 after filtering:
  13. total reads: 13054241
  14. total bases: 1786633510
  15. Q20 bases: 1671652872(93.5644%)
  16. Q30 bases: 1634552420(91.4878%)
  17. Read2 aftering filtering:
  18. total reads: 13054241
  19. total bases: 1782180210
  20. Q20 bases: 1625652911(91.2171%)
  21. Q30 bases: 1568126467(87.9892%)
  22. Filtering result:
  23. reads passed filter: 26108482
  24. reads failed due to low quality: 30441256
  25. reads failed due to too many N: 16190
  26. reads failed due to too short: 0
  27. reads with adapter trimmed: 582728
  28. bases trimmed due to adapters: 30162896
  29. Duplication rate: 5.57505%
  30. Insert size peak (evaluated by paired-end reads): 150
  31. JSON report: fastp.json
  32. HTML report: fastp.html
  33. fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq --thread 8
  34. fastp v0.20.0, time used: 544 seconds

Quality control using FastQC.

  1. mkdir fastqc_out
  2. fastqc -o fastqc_out -f fastq SRR10971381_1_fastp.fastq SRR10971381_2_fastp.fastq

Use BWA MEM to map raw reads back to MN908947.

  1. mkdir bwa_index
  2. cp MN908947.fa bwa_index
  3. cd bwa_index
  4. bwa index MN908947.fa
  5. bwa mem -t 8 raw/bwa_index/MN908947.fa raw/SRR10971381/SRR10971381_1_fastp.fastq raw/SRR10971381/SRR10971381_2_fastp.fastq | samtools sort - -o result/SRR10971381_MN908947.bam

Stats on the BAM file.

  1. samtools flagstat -@8 SRR10971381_MN908947.bam
  2. 26137357 + 0 in total (QC-passed reads + QC-failed reads)
  3. 0 + 0 secondary
  4. 28875 + 0 supplementary
  5. 0 + 0 duplicates
  6. 174256 + 0 mapped (0.67% : N/A)
  7. 26108482 + 0 paired in sequencing
  8. 13054241 + 0 read1
  9. 13054241 + 0 read2
  10. 136034 + 0 properly paired (0.52% : N/A)
  11. 136414 + 0 with itself and mate mapped
  12. 8967 + 0 singletons (0.03% : N/A)
  13. 0 + 0 with mate mapped to a different chr
  14. 0 + 0 with mate mapped to a different chr (mapQ>=5)
  15. samtools view -F 0x804 -f 2 -b SRR10971381_MN908947.bam > SRR10971381_MN908947_mapped.bam

Variant calling.

  1. bcftools mpileup -f raw/MN908947.fa result/SRR10971381_MN908947_mapped.bam | bcftools call -mv -Ov -o result/SRR10971381_MN908947_mapped.vcf

Appendix

Entrez Direct Functions

See Entrez Direct (EDirect) for more information.

Navigation functions support exploration within the Entrez databases:

  • esearch performs a new Entrez search using terms in indexed fields.
  • elink looks up neighbors (within a database) or links (between databases).
  • efilter filters or restricts the results of a previous query.

Records can be retrieved in specified formats or as document summaries:

  • efetch downloads records or reports in a designated format.

Desired fields from XML results can be extracted without writing a program:

  • xtract converts EDirect XML output into a table of data values.

Several additional functions are also provided:

  • einfo obtains information on indexed fields in an Entrez database.
  • epost uploads unique identifiers (UIDs) or sequence accession numbers.
  • nquire sends a URL request to a web page or CGI service.