Supplementary scripts for Nelson et al. (2020) paper on ORF3d
Supplementary scripts for Nelson et al. (2020) paper on SARS-CoV-2 ORF3d:
Nelson CW, Ardern Z, Goldberg TL, Meng C, Kuo C-H, Ludwig C, Kolokotronis S-O, Wei X. 2020.
Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic. eLife 9: e59633. DOI: 10.7554/eLife.59633
fig1B.bash
ORF_length.R
aligned_fasta2haplotypes.pl
riboseq_sliding_window.R
riboseq.R
mass-spec_vs_riboseq.R
generate_random_protein.py
tally_epitope_coverage.py
epitope_MHCI.R
epitope_MHCII.R
hydrophobicity_profiles_ORF3a.R
SARS-CoV-2_locate_genes.pl
generate_seqs_from_VCF.py
three_levels_diversity.R
selection_vs_expression.R
extract_seqs_by_timepoint.py
extract_seqs_by_location.py
extract_positions_by_timepoint.py
temporal_pi.R
selection_sliding_windows.R
extract_variable_columns_MSA.py
extract_variable_columns_MSA_aa.py
Fig7.m
Fig7b.m
filter_vcf.py
summarize_intrahost_by_site.py
Fig8.m
Fig8_supp1.m
extract_fasta_by_sites.pl
extract_seq_subset.py
translate_nt_seq_file.pl
All supplementary are available at Zenodo under record ID 4052729.
For easy access, the four most important supplementary data files are available in the /data/
directory of this repository:
SARS-related-CoV_ALN.fasta
: whole-genome multiple sequence alignment of n=21 genomes of the species Severe acute respiratory syndrome-related coronavirus (between-taxa analysis). See manuscript for details. Note that the pangolin-CoV GD/1 sequence has been masked as N
, because GISAID permission is required for data access.
SARS-related-CoV_ALN.gtf
: Gene Transfer Format (GTF) file giving gene positions within SARS-related-CoV_ALN.fasta.
SARS-CoV-2.gtf
: Gene Transfer Format (GTF) file giving gene positions within the SARS-CoV-2 Wuhan-Hu-1 genome genome. Records are ordered by start site in the genome. Full gene regions are given, as well as just those segments of the gene overlapping (OL) or not overlapping (NOL) other genes. For example, the positions of ORF3d are 25524-25697; the positions of ORF3a codons overlapping ORF3d (ORF3a_OL_ORF3d) are 25522-25698; and the positions of ORF3a codons involved in the ORF3a/ORF3c/ORF3d triple overlap (ORF3a_OL_ORF3c_OL_ORF3d) are 25522-25584. ORF3b refers to the first short (23 codon) ORF in SARS-CoV-2 and is considered OL; the remainder of the region homologous to ORF3b in SARS-CoV is not considered OL.
Supplementary_Tables.xlsx
: Supplementary Tables referred to in the manuscript.
Scripts are arranged by Figure, and therefore by analysis. Although we are not able to provide all input data files because of the GISAID privacy agreement, users can apply for their own access. Every effort has been made to describe all steps and input, here or in each script’s comments. Where applicable, scripts are arranged in the order they should be executed. The scripts are of two types:
Command-line. These scripts are intended to be executed from the bash command line with the specified arguments.
Manual analysis. These R scripts document the bulk of our data analyses and visualizations. They are intended to be executed manually line-by-line in R/RStudio. The user should replace path names and arguments with the appropriate values for the user’s analysis and directories. Attention has been drawn to lines or variables that should be modified by the user with the flag: CHANGE THIS
.
fig1b_genome_annotation.bash
(command-line script)
SARS-related-CoV_ALN.fasta
: multiple alignment .fasta
file of n=21 genomes of the species Severe acute respiratory syndrome-related coronavirus. Note that the pangolin-CoV GD/1 sequence has been masked as N
, because GISAID permission is required for data access.SARS-related-CoV_ALN.gtf
: Gene Transfer Format (GTF) file giving gene positions within SARS-related-CoV_ALN.fasta
.sbc_rename2.nw
: Newick tree for sarbecovirus alignmentparameters_input.txt
parameters_input2.txt
Fig1b.png
Example:
fig1b_genome_annotation.bash
ORF_length.R
(manual analysis script)
frameshift_results.txt
, produced by frameshift_analysis.bash
.ORF_length_heatmap_data.tsv
, p-values for each internal overlapping gene within a previously annotated gene, redundatly annotated on site-by-site basis.aligned_fasta2haplotypes.pl
(command-line script)
.fasta
file containing a multiple sequence amino acid alignment of one protein product of interest.tsv
table: column 1 contain contains the number of occurrences (n) of the haplotype; column 2 contains the haplotype sequence itselfExample:
aligned_fasta2haplotypes.pl SARSCOV2_ORF3d_aa.fasta
riboseq_sliding_window.R
(command-line script)
Input. Unnamed arguments in the following order:
INFILE_FRAMES
: file with condition metadata for samples (here, riboseq_sample_metadata.tsv). Columns should be in the following order: sample, condition, time, and treatment, where condition is the combination of time and treatment (see script and file)INFILE_READS
: file with read data; columns described within script; produced by concatenating the output of ribosome profiling read mapping at the command line as follows:
grep -E "\s" SRR*.txt | gsed -E s/_all/\\t/ | gsed -E s/ntd-reads_framed_zerobased.txt:/\\t/ | gsed -E s/\\s+/\\t/ > mapped_reads_by_readlength.tsv
3. `WIN_SIZE`: size of the sliding window (integer)
4. `READ_LEN`: length of reads to consider (integer)
* **Output**.
1. Table in `.tsv` format giving the sum of reads in each frame for each condition and position (start of window).
* **Example**:
Rscript riboseq_sliding_window.R riboseq_sample_metadata.tsv mapped_reads_by_readlength.tsv 30 30
riboseq.R
(manual analysis script)
mapped_reads_by_readlength.tsv
, ribosome profiling read depth by sample, read length, position, and frame, after mapping of data from Finkel et al. (2020)riboseq_sample_metadata.tsv
, file with condition metadata for samples with columns in the following order: sample, condition, time, and treatment, where condition is the combination of time and treatment (see file)mapped_reads_by_readlength_sumSamples_win*read*.tsv
, results of the sliding window script riboseq_sliding_window.R for various window sizes (win*) and read length (read*) summarizing the proportion of reads mapping to each frame for each condition (treatment/time)mass-spec_vs_riboseq.R
(manual analysis script)
riboseq_upstream_peaks.tsv
, ribosome profile profiling read depth near gene start sites, shown in Figure 2A, derived from analysis of data from Finkel et al. (2020) and wrangled in the script riboseq.Rmapped_reads_by_readlength_wCDS.tsv
, ribosome profiling read depth by sample, read length, position, and frame. Source data underlying Figure 2C, derived from analysis of data from Finkel et al. (2020), produced in the script riboseq.R.expression_data_by_gene.tsv
, summary of expression estimates from both ribosome profiling and mass spectrometry for all gene detectable by mass spectrometrygenerate_random_protein.py
(command-line script)
.fasta
format..fasta
format.fasta
format containing the randomized protein sequence(s).Example:
generate_random_protein.py ORF3d_aa.fasta 57 1000
tally_epitope_coverage.py
(command-line script)
.tsv
format; NetMHCpan output file with 5 columns in this order: ID, NB (number MHC alleles bound based on NetMHCpan output), product, codon_start, codon_end .tsv
format giving the sum of bound epitopes overlapping each site.Example:
tally_epitope_coverage.py ORF3d_random.tsv 9
epitope_MHCI.R
(manual analysis script)
NetMHCpan_Wuhan-Hu-1.tsv
, NetMHCpan results for the basic protein set of the Wuhan-Hu-1 genome, plus ORF3dNetMHCpan_Wuhan-Hu-1_additional.tsv
, NetMHCpan results for ORF3c, ORF3d-2, and ORF3b (short)MHCI_epitope_summary_tally.tsv
, result of tally_epitope_coverage.py applied to the proteins encoded by Wuhan-Hu-1MHCI_short_unannot_ORFs.tsv
, result of tally_epitope_coverage.py applied to the short unannotated ORFs encoded by Wuhan-Hu-1NetMHCpan_*_random_tally.tsv
, result of tally_epitope_coverage.py applied to 1000 randomized peptides based on each of the proteins encoded by Wuhan-Hu-1MHCI_epitope_summary_test.tsv
, compiled results for MHC class I analysis, used for Figure 3Aepitope_MHCII.R
(manual analysis script)
NetMHCIIpan_Wuhan-Hu-1.tsv
, NetMHCIIpan results for the basic protein set of the Wuhan-Hu-1 genome, plus ORF3dNetMHCIIpan_Wuhan-Hu-1_additional.tsv
, NetMHCIIpan results for ORF3c, ORF3d-2, and ORF3b (short)MHCII_epitope_summary_tally.tsv
, result of tally_epitope_coverage.py applied to the proteins encoded by Wuhan-Hu-1MHCII_short_unannot_ORFs.tsv
, result of tally_epitope_coverage.py applied to the short unannotated ORFs encoded by Wuhan-Hu-1NetMHCIIpan_*_random_tally.tsv
, result of tally_epitope_coverage.py applied to 1000 randomized peptides based on each of the proteins encoded by Wuhan-Hu-1MHCII_epitope_summary_test.tsv
, compiled results for MHC class II analysis, used for Figure 3Ahydrophobicity_profiles_ORF3a.R
(manual analysis script)
hydrophobicity_profiles_ORF3a.csv
, output of the VOLPES server using the unitless hydrophobicity scale “FAC1” (Factor 1) with a sliding window of 25 amino acidshydrophobicity_profiles_ORF3a_corr.tsv
, correlations between hydrophobicity profiles in a sliding window of 25 residuesSARS-CoV-2_locate_genes.pl
(command-line script)
.fasta
file containing a multiple sequence alignment of SARS-CoV-2 whole-genome nucleotide sequences. Note that the script may fail for alignments with gaps or highly diverged from the Wuhan-Hu-1 genotype. .gtf
file containing gene coordinates. Note that some genes may be missed if the alignment contains sequences highly diverged from SARS-CoV-2.Example:
SARS-CoV-2_locate_genes.pl SARS-CoV-2_ALN.fasta
generate_seqs_from_VCF.py
(command-line script)
.fasta
with randomly interspersed variants (from VCF), for use with OLGenie (Nelson et al. 2020) or any software that requires a multiple sequence alignment input and does not depend upon linkage..fasta
file containing exactly one (1) reference sequence (e.g., SARS-CoV-2 Wuhan-Hu-1).vcf
file containing the variants of interest*.fasta
multiple sequence alignment file with variants randomy interpersed at the frequencies defined in the VCF file.Example:
generate_seqs_from_VCF.py reference.fasta variants.vcf 1000
three_levels_diversity.R
(manual analysis script)
between-taxa_codon_OL_category.tsv
, metadata file containing the OLG category for each codon in each product in the Severe acute respiratory syndrome-related coronavirus alignment, SARS-related-CoV_ALN.fastabetween-taxa_SNPGenie.tsv
, SNPGenie (non-OLG dN/dS, , Jukes-Cantor corrected) results for the alignment, SARS-related-CoV_ALN.fastabetween-taxa_SNPGenie_taxaRestricted.tsv
, SNPGenie results for our alignment, SARS-related-CoV_ALN.fasta, but limiting to taxa that have intact ORF3d (SARS-CoV-2 and pangolin-CoV GX/P5L) or lacking full-length ORF3b (9 sequences; see Figure 4)between-taxa_OLGenie.tsv
, OLGenie (OLG dN/dS, Jukes-Cantor corrected) results for most products and frames of annotated genes in SARS-related-CoV_ALN.fastabetween-taxa_OLGenie2.tsv
, OLGenie results for ORF3a in the ss13 frames to analyze ORF3c for SARS-related-CoV_ALN.fastabetween-taxa_OLGenie_addNN.tsv
, curated OLGenie results for SARS-related-CoV_ALN.fasta, incorporating the Wei-Zhang method for codon 71 of ORF3a (NN change in ORF3d)between-host_SNPGenie.tsv
, SNPGenie (non-OLG πN/πS) results for our SARS-CoV-2 (GISAID) alignmentbetween-host_OLGenie.tsv
, OLGenie (OLG πN/πS) results for our SARS-CoV-2 (GISAID) alignmentwithin-host_SNPGenie.tsv
, SNPGenie results for our deeply sequenced within-host samples, before taking means by codonwithin-host_OLGenie.tsv
, OLGenie results for our deeply sequenced within-host samples, based on pseudo-alignments produced with generate_seqs_from_VCF.py, before taking means by codonbetween-taxa_FINAL.tsv
, final, bootstrapped, OLG-aware Jukes-Cantor corrected dN/dS results for between-taxa (Severe acute respiratory syndrome-related coronavirus) databetween-host_FINAL.tsv
, final, bootstrapped, OLG-aware πN/πS results for between-host (GISAID) datawithin-host_FINAL.tsv
: final, bootstrapped, OLG-aware πN/πS results for within-host (SRR) dataselection_vs_expression.R
(manual analysis script)
expression_data_by_gene_frame.tsv
, summary of expression estimates from both ribosome profiling and mass spectrometry for all gene detectable by mass spectrometry, by frame (source data of Figure 2B)selection_three_levels.txt
, final πN/πS or dN/dS estimates underlying Figure 5 for each gene be evolutionary level (between-taxa, between-host, within-host) and region type (non-overlapping, overlapping).extract_seqs_by_timepoint.py
(command-line script)
.tsv
file with a one-row header; this contains the sampling dates of each sequence. The script expects the sequence ID in column 1 (index 0) and the date in column 4 (index 3).fasta
multiple sequence alignment file, where the headers are the GISAID IDs, i.e., they match the accession IDs in the GISAID table*.fasta
multiple sequence alignment file for each 14 day window, containing just those sequences sampled during that period. For example, the first file will have the name *_0to14.fasta
(window 1), the second file will have the name *_7to21.fasta
(window 2), and so on.Example:
extract_seqs_by_timepoint.py gisaid_cov2020_acknowledgement_table.tsv SARS-CoV-2_ALN.fasta
extract_seqs_by_location.py
(command-line script)
.tsv
file with a one-row header; this contains the sampling dates of each sequence. The script expects the sequence ID in column 1 (index 0), location in column 3 (index 2), and the date in column 4 (index 3).fasta
multiple sequence alignment file, where the headers are the GISAID IDs, i.e., they match the accession IDs in the GISAID tableLocation
columns*.fasta
multiple sequence alignment file with the name given by input argument 3 above, containing only those sequences matching the location given in argument 4 above.Example:
extract_seqs_by_location.py gisaid_cov2020_acknowledgement_table.tsv SARS-CoV-2_ALN.fasta SARS-CoV-2_ALN_Asia.fasta Asia
extract_positions_by_timepoint.py
(command-line script)
.tsv
file with a one-row header; this contains the sampling dates of each sequence. The script expects the sequence ID in column 1 (index 0), and the date in column 4 (index 3).fasta
multiple sequence alignment file, where the headers are the GISAID IDs, i.e., they match the accession IDs in the GISAID table.tsv
file with a single column: a list of sites (1-based genome positions) to track.tsv
).tsv
table with the name given by input argument 4 above, where each row is a unique combination of timepoint and genome position, and columns report the numbers and frequencies of the alleles for that timepoint and position.Example:
extract_positions_by_timepoint.py gisaid_cov2020_acknowledgement_table.tsv SARS-CoV-2_ALN_Asia.fasta sites_to_track.tsv SARS-CoV-2_ALN_Asia_tracked.fasta 14 1
temporal_pi.R
(manual analysis script)
snpgenie_within_group.pl
) to analyze the results of extract_seqs_by_timepoint.py
, separately for each window.within_group_codon_results.tsv
, combined across timepointstimepoint_seq_IDs.txt
, a table with sequence IDs (column ID
, i.e., EPI_) for each time window (column time_period
, e.g.*, “0to14”)gisaid_cov2020_acknowledgement_table.tsv
selection_sliding_windows.R
(manual analysis script) SARS-CoV-2-ref_ORF3a_ss12_windows_prepangolin.txt
, OLGenie (dNN/dNS) results in a sliding window (window size=50 codons, step size=1 codon) for SARS-CoV-2 against each other taxon (every pair) for ORF3a reading frame ss12 (i.e., the frame of ORF3d)SARS-CoV-2-ref_ORF3a_ss12_windows_pangolin.txt
, updated OLGenie results in a sliding window for SARS-CoV-2 against pangolin GX/P5L, i.e., the only other taxon in which ORF3d is intact, meant to replace the data in SARS-CoV-2-ref_ORF3a_ss12_windows.txtSARS-CoV-2-ref_N_ss13_windows.tsv
, OLGenie results in a sliding window for SARS-CoV-2 against each other taxon (every pair) for N reading frame ss13 (i.e., the frame of ORF9b and ORF9c)SARS-CoV-ref_ORF3a_ss13_windows.txt
, OLGenie results in a sliding window for SARS-CoV against each other taxon (every pair) for ORF3a reading frame ss13 (i.e., the frame of ORF3c and ORF3b)SARS-CoV-ref_N_ss13_windows.txt
, OLGenie results in a sliding window for SARS-CoV against each other taxon (every pair) for N reading frame ss13 (i.e., the frame of ORF9b and ORF9c)extract_variable_columns_MSA.py
(command-line script)
.fasta
file containing aligned nucleotide sequences *_variants_MAF[\d\.].fasta
, prints a .fasta
multiple sequence alignment of just the variable sites*_variants_MAF[\d\.].tsv
, prints a .tsv
table giving the nucleotide present in each sequence (row) at each variable site (column)Example:
extract_variable_columns_MSA.py SARS-CoV-2_ALN.fasta 0.02
extract_variable_columns_MSA_aa.py
(command-line script)
extract_variable_columns_MSA.py
, but for amino acid sequences, necessary to account for STOP codons (potential non-word characters).Example:
extract_variable_columns_MSA_aa.py SARS-CoV-2_ORF3d_aa_ALN.fasta 0.02
Fig7.m
SARSCOV2_MAFFT_processed_variants_MAF0.02_ManualCheck_Fig7.xlsx
, file from output, with manual correction of the entry with different ordering format of the collection dateFig7b.m
Fig7b.jpg
—> input in Fig7.mFig7a.jpg
(artwork, input in Fig7.m)Fig7.m
(figure collage): output Fig7.jpgfilter_VCF.py
(command-line script)
.vcf
(variant call format) files in the working directory. For use with SNPGenie (Nelson et al. 2015) or any software requiring FDR-filtered variants in .vcf
files..vcf
files in the working directory, and five unnamed arguments in the following order: *_filtered.vcf
) VCF file for each VCF file in the working directory, including only those variants that pass the FDR threshold.Example:
filter_VCF.py 1 0 29903 0.002 401
summarize_intrahost_by_site.py
(command-line script)
.vcf
(variant call format) files in the working directory..vcf
files in the working directory, and two unnamed arguments in the following order: .fasta
file containing exactly one (1) reference sequence (e.g., SARS-CoV-2 Wuhan-Hu-1).gtf
file containing containing genes to be annotated in the output table, with up to two genes overlapping each site (others will be ignored)*_site_database.tsv
. There are four rows for each position in the genome (defined by input 1), corresponding to each of the four possible nucleotide changes (including self-nucleotide). For example, a position with A in the reference (REF), there will four possible single nucleotide changes (ALT): A (self), C, G, and T. Each row is also labelled with up to two genes overlapping the site, and the codon, codon position, and amino acid encoded by each gene. Finally, each column following the metadata is a sample, giving the number of REF,ALT
reads in that same at that position, if its VCF file contains a record. Note that the beginning of the file is largely unpopulated, as the first rows correspond to the 5’-UTR region lacking genes and coverage.Example:
summarize_intrahost_by_site.py NC_045512.fasta NC_045512.gtf
Fig8.m
NC_045512_site_database_altA.tsv
NC_045512_site_database_altC.tsv
NC_045512_site_database_altG.tsv
NC_045512_site_database_altT.tsv
extract_fasta_by_sites.pl
(command-line script)
.fasta
file containing containing one or more aligned sequences .gtf
file containing CDS products to extract. They need not really be CDS, but should be labelled as such in the file. Only works for forward-strand (+) products..fasta
file for each CDS record in the GTF file. Just the DNA segment corresponding to the CDS coordinates of each record will be present in the resulting .fasta
files.Example:
extract_fasta_by_sites.pl Wuhan_Hu_1.fasta Wuhan_Hu_1.gtf
extract_seq_subset.py
(command-line script)
.fasta
based on header ID..fasta
headers) .fasta
format *Output. .fasta
format based on the file given by argument 2, but including only those sequences with headers provided in the file given by argument 1.Example:
extract_seq_subset.py seq_ID_list.txt SARS-COV-2_ALN.fasta
translate_nt_seq_file.pl
(command-line script)
.fasta
file containing the original (un-aligned) coding nucleotide sequences (complete codon sets, i.e., multiples of 3)Example:
translate_nt_seq_file.pl coding_nt_seqs.fasta
This work was supported by a Postdoctoral Research Fellowship from Academia Sinica (to C.W.N. under P.I. Wen-Hsiung Li); funding from the Bavarian State Government and National Philanthropic Trust (to Z.A. under P.I. Siegfried Scherer); NSF IOS grants #1755370 and #1758800 (to S.-O.K.); and the University of Wisconsin-Madison John D. MacArthur Professorship Chair (to T.L.G). Copyright-free images were obtained from Pixabay. The authors thank the GISAID platform and the originating and submitting laboratories who kindly uploaded SARS-CoV-2 sequences to the GISAID EpiCov™ Database for public access (Supplement). The authors thank Maciej F. Boni, Reed A. Cartwright, John Flynn, Kyle Friend, Dan Graur, Robert S. Harbert, Cheryl Hayashi, David G. Karlin, Niloufar Kavian, Kin-Hang (Raven) Kok, Wen-Hsiung Li, Meiyeh Lu, David A. Matthews, Lisa Mirabello, Apurva Narechania, Felix Li Jin, and attendees of the UC Berkeley popgen journal club for useful information and discussion; Andrew E. Firth, Alexander Gorbalenya, Irwin Jungreis, Manolis Kellis, Raven Kok, Angelo Pavesi, Kei Sato, Manuela Sironi, and Noam Stern-Ginossar for an invaluable discussion regarding standardizing nomenclature; Helen Piontkivska, Patricia Wittkopp, Antonis Rokas, and one anonymous reviewer for critical suggestions; Ming-Hsueh Lin for immense feedback on figures; and special thanks to Priya Moorjani, Jacob Tennessen, Montgomery Slatkin, Yun S. Song, Jianzhi George Zhang, Xueying Li, Hongxiang Zheng, Qinqin Yu, Meredith Yeager, and Michael Dean for commenting on earlier drafts of the manuscript.
When using this software, please refer to and cite:
Nelson CW, Ardern Z, Goldberg TL, Meng C, Kuo C-H, Ludwig C, Kolokotronis S-O, Wei X. 2020.
Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic. eLife 9: e59633. DOI: 10.7554/eLife.59633
and this page:
If you have questions about our scripts or study, please first thoroughly read the documentation and in-line comments relevant to the script of interest. If these do not answer your question, please click on the Issues tab at the top of this page and search to see if your question has already been answered; if not, begin a new issue, so that others might benefit from the discussion.
Other queries should be addressed to the corresponding authors: