项目作者: quanc1989

项目描述 :
Characterization of Structural Variation in Chinese samples
高级语言: OpenEdge ABL
项目地址: git://github.com/quanc1989/SV-ONT-Tibetan.git
创建时间: 2020-10-12T16:25:06Z
项目社区:https://github.com/quanc1989/SV-ONT-Tibetan

开源协议:MIT License

下载


SV-ONT-Tibetan

This repository includes data and scripts to analyze structural variations of 25 Chinese samples using nanopore sequencing.

Citing information

If you use the methods or pipeline in this repository, please cite our paper.
Also, you can get detailed information about these methods in the supplementary method of our paper.

Quan, C., Li, Y., Liu, X. et al. Characterization of structural variation in Tibetans reveals new evidence of high-altitude adaptation and introgression. Genome Biol 22, 159 (2021). https://doi.org/10.1186/s13059-021-02382-3

Summary of SV callsets

Description Format Location
SV callsets vcf.gz example/merge.ont.genotyped.SURVIVOR.sorted.reheader.corrected.INDELtoSYMBOL.sorted.local.addINFO.addFST.addCIPOS.addCIEND.vcf.gz
SV genotypes vcf.gz example/merge.genotypes.corrected.delCHR.svtk.vcf.gz
Sample list xlsx example/samples.xlsx
SV annotation tsv.zip example/merge.genotypes.corrected.delCHR.svtk.annotsv.public.tsv.zip
SV Distribution tsv example/merge.genotypes.corrected.delCHR.svtk.dist
SV hom&het tsv example/merge.genotypes.corrected.delCHR.svtk.gte1ngs.LDpruned.het
Fixation index tsv example/merge.paragraph.genotypes.TIBvsHAN.20k_5k.windowed.weir.fst
Gene annotations gtf.gz 0_raw_data/gencode.v32lift37.canonical_annotation.gtf.gz

Pipeline for Multi-sample SV-calling and annotation

Script: pipeline.sv-calling.sh

Requirements

Summary

In the bash file pipeline.sv-calling.sh, we use sample data to demonstrate the complete process of detecting and annotating structural variations based on nanopore sequencing technology.

  1. Firstly, long reads were mapped to GRCh37 human reference from NCBI without alternate sequences. Mapping was performed with NGMLR with ONT default parameters.

  2. Then SV calling was performed on each sample using Sniffles, NanoSV, and SVIM. These tools have been reported to be compatible with NGMLR and show better accuracy and sensitivity than others. Five minimum supporting reads with at least 50 bp length was required. The insertion sequence and read ID was required for each method, and the rest are all default parameters.

  3. SURVIVOR was used to merge the SVs per each sample, which are supported by at least two methods with a maximum allowed pairwise distance of 1,000 bp between breakpoints. Meanwhile, SVs obtained from different tools are not necessary to agree on the SV-type or the strand, so that we could capture as many potential breakpoints as possible. Finally, we merged the SVs obtained from all the samples as long as one sample supports it.

  4. At last, we need to get a fully genotyped multi-samples dataset. We re-ran Sniffles across all the samples with all these potential regions (—Ivcf) and finally combined SVs with SURVIVOR. This time, we asked SURVIVOR only to report SVs supported by at least one sample, and they have to agree on the SV-type. Furthermore, we used a hard threshold with five minimum supporting reads, and all non-missing genotypes less than this threshold were modified to reference.

  5. After SVs were discovered within a small population using long-read sequencing, we then genotyped these SVs with a relatively large amount of NGS data accumulated in previous studies. Paragraph, which is a new graph-based method, was used to genotype each NGS genome. We set the maximum allowed read count for SVs to 20 times the mean genome coverage for each dataset described above. We replaced all genotypes which failed to pass any filters by Paragraph with missing genotypes (./.).

  6. We annotated SVs for a range of potential effects on coding sequences using SVTK and AnnotSV.

  1. ![模型示意图](pipeline-sv-calling.png)

Pipeline for demographic inference and simulation

Script: pipeline.demographic_inference.sh

Other Requirements

Summary

In the bash file pipeline.demographic_inference.sh, we performed demographic inference by easySFS and make whole-genome simulation by msprime.

  1. Firstly, we used SNVs to estimate the demographic history of Tibetans and Hans. An unfolded joint site frequency spectrum (SFS) of non-genic autosomal bases was estimated using easySFS.

  2. To explore the alternative demographic models for the population model YRI-TIB-HAN, we used the diffusion approximation method of ∂a∂i to analyze the joint SFS.

  3. When the best-fit demographic model was recognized, we used msprime to perform whole-genome coalescent simulations. To approximately account for mutational heterogeneity across the genome, we applied a three-step framework described in a previous study (Hsieh PH, et al. 2019).


Pipeline for detection of introgression signals

Script: pipeline.detection_introgression.sh

Other Requirements

Summary

In the bash file pipeline.detection_introgression.sh, we applied the D-statistic and fd-statistic for each simualtion.


Visualization for chracteristics of SVs

Script: pipeline.plot.R

Examples:

  • SV distribution

    1. library(circlize)
    2. karyo_plot <- as.data.frame(data_sv_details_all_karyo)
    3. karyo_plot$SUPP <- as.integer(karyo_plot$SUPP)
    4. karyo_plot_BND <- karyo_plot[karyo_plot$SVTYPE=='TRA',]
    5. karyo_plot_BND$seqnames <- paste('chr',karyo_plot_BND$seqnames, sep = '')
    6. karyo_plot_BND_link <- karyo_plot_BND[,c('CHR2','END','END')]
    7. colnames(karyo_plot_BND_link) <- c('seqnames', 'start', 'end')
    8. karyo_plot_BND_link$seqnames <- paste('chr', as.character(karyo_plot_BND_link$seqnames), sep = "")
    9. karyo_plot_BND_link$start <- as.integer(karyo_plot_BND_link$start)
    10. karyo_plot_BND_link$end <- as.integer(karyo_plot_BND_link$end)
    11. array_seqnames <- paste('chr', karyo_plot$seqnames,sep = '')
    12. circos.initializeWithIdeogram(species = 'hg19')
    13. group_shared <- karyo_plot$SUPP==25
    14. group_major <- karyo_plot$SUPP%in%seq(13,24)
    15. group_polymorphic <- karyo_plot$SUPP%in%seq(2,12)
    16. group_singleton <- karyo_plot$SUPP==1
    17. circos.trackHist(factors=array_seqnames,
    18. track.height = 0.1,
    19. x=karyo_plot$start,col = "#999999", border = '#999999',
    20. bg.border = NA, bin.size = 500000)
    21. circos.trackHist(factors=array_seqnames[group_shared],
    22. track.height = 0.1,
    23. x=karyo_plot[group_shared,]$start,col = "red", border = 'red',
    24. bg.border = NA,bin.size = 500000)
    25. circos.trackHist(factors=array_seqnames[group_major], track.height = 0.1, x=karyo_plot[group_major,]$start,col = "purple", border = "purple",bg.border = NA,bin.size = 500000)
    26. circos.trackHist(factors=array_seqnames[group_polymorphic], track.height = 0.1, x=karyo_plot[group_polymorphic,]$start,col = "blue",border = "blue",bg.border = NA,bin.size = 500000)
    27. circos.trackHist(factors=array_seqnames[group_singleton], track.height = 0.1, x=karyo_plot[group_singleton,]$start,col = "light blue",border = "light blue",bg.border = NA,bin.size = 500000)

    SV distribution

  • Sample Distribution for NGS data

    1. library(maps)
    2. library(mapdata)
    3. library(maptools);
    4. china_map=readShapePoly('china-province-border-data/bou2_4p.shp');
    5. china_map@data$cName <- iconv(china_map@data$NAME, from = "GBK")
    6. x <- china_map@data
    7. xs <- data.frame(x,id=seq(0:924)-1)
    8. china_map1 <- fortify(china_map)
    9. china_map_data <- join(china_map1, xs, type = "full", )
    10. tmp <- summary(factor(data_samples_info[data_samples_info$Platform=='NGS',]$cLocation))
    11. NAME <- names(tmp)
    12. pop <- tmp
    13. pop <- data.frame(NAME, pop)
    14. colnames(pop) <- c('cName', 'pop')
    15. china_map_pop <- join(china_map_data, pop, type = "full")
    16. ggplot(china_map_pop, aes(x = long, y = lat, group = group, fill = pop)) +
    17. geom_polygon() +
    18. geom_path(color = "grey40") +
    19. coord_map() +
    20. xlab('') +
    21. ylab("") +
    22. theme(legend.title=element_blank(),
    23. legend.text = element_text(size = 8,face = "bold"),
    24. axis.text.x = element_blank(),
    25. axis.text.y = element_blank())

    SV distribution

  • Sample Distribution for NGS data

    1. df_sorted <- arrange(data_sample_details_ONT, id, Type)
    2. df_cumsum <- ddply(df_sorted, "id",
    3. transform,
    4. ypos=cumsum(Discovery) - 0.5*Discovery)
    5. ggplot(data=df_cumsum, aes(x=id, y=Discovery, fill = Type)) +
    6. geom_bar(stat="identity") +
    7. theme_minimal()+
    8. theme(legend.title=element_blank(),
    9. legend.direction = 'horizontal',
    10. legend.position=c(0.5,0.98),
    11. legend.spacing.x = unit(0.1, 'cm'),
    12. legend.key.size=unit(0.3, 'cm'),
    13. legend.text = element_text(size = 6,face = "bold"),
    14. axis.text.x = element_text(size = 6,face = "bold", angle = 60, vjust = 0.6),
    15. axis.text.y = element_text(size = 6,face = "bold"),
    16. axis.title.y = element_text(size = 6,face = "bold"),
    17. axis.title.x.bottom = element_text(margin = margin(-15,0,0,0))) +
    18. scale_fill_manual(values=c('light blue', 'blue', 'purple','dark red')) +
    19. xlab("") +
    20. ylab("Discovery")

    SV distribution

  • telomere enrichment

    1. ggplot(data = data_sv_dist[data_sv_dist$All!=0,],
    2. aes(x = Dist, y = All)) +
    3. geom_point(size = 0.05, color="blue") +
    4. annotate("rect", fill = "dark gray", alpha = 0.5,
    5. xmin = 0, xmax = 5,
    6. ymin = -Inf, ymax = Inf)+
    7. xlim(0, 150) +
    8. # ylim(0, 250) +
    9. ylab('SVs Per 500kbp') +
    10. xlab('Telomere Distance') +
    11. theme_minimal()+
    12. theme(
    13. axis.text.x = element_text(size = 6,face = "bold"),
    14. axis.text.y = element_text(size = 6,face = "bold"),
    15. axis.title.y = element_text(size = 6,face = "bold"),
    16. axis.title.x = element_text(size = 6,face = "bold"))

    SV distribution

  • Group Support

    1. data_sv_group <- plyr::count(data_plot,'GROUP_SUPP')
    2. data_sv_group$GROUP_SUPP <- factor(data_sv_group$GROUP_SUPP)
    3. ggplot(data_sv_group, aes(x="",y=freq,fill=GROUP_SUPP)) +
    4. geom_bar(width=1,stat="identity") + coord_polar("y",start=0) +
    5. geom_text(aes(y=freq/4+c(0,cumsum(freq)[-length(freq)])),
    6. label=percent(data_sv_group$freq/sum(data_sv_group$freq)),
    7. size=2,
    8. color='white',
    9. fontface="bold") +
    10. scale_fill_manual(values=config_color_group_supp) +
    11. theme_minimal() +
    12. theme(legend.position = c(0.5,0),
    13. legend.direction = 'horizontal',
    14. legend.spacing.x = unit(0.1, 'cm'),
    15. legend.key.size=unit(0.3, 'cm'),
    16. legend.title=element_blank(),
    17. legend.text = element_text(size = 6,face = "bold"),
    18. panel.border = element_blank(),
    19. panel.grid = element_blank(),
    20. axis.ticks = element_blank(),
    21. axis.text = element_blank(),
    22. axis.title.x = element_blank(),
    23. axis.title.y = element_blank())

    SV distribution

  • SV length

    1. ggplot(data=data_plot[data_plot$SVTYPE!='TRA',], aes(x=SVLEN, color = SVTYPE)) +
    2. geom_freqpoly(binwidth=1/5, size=0.8) +
    3. scale_x_continuous(trans = 'log10',
    4. breaks=c(100,1000,10000,100000,1000000,10000000),
    5. labels =c('100bp','1kb','10kb','100kb','1Mb', '10Mb'),
    6. limits = c(50, 10000000)
    7. ) +
    8. scale_y_continuous(trans = 'log10') +
    9. theme_minimal() +
    10. theme(legend.title=element_blank(),
    11. legend.position=c(0.9,0.9),
    12. legend.direction = 'vertical',
    13. legend.spacing.x = unit(0.1, 'cm'),
    14. legend.key.size=unit(0.3, 'cm'),
    15. legend.text = element_text(size = 6,face = "bold"),
    16. axis.text.x = element_text(size = 6,face = "bold"),
    17. axis.text.y = element_text(size = 6,face = "bold"),
    18. axis.title.y = element_text(size = 6,face = "bold"),
    19. axis.title.x = element_text(size = 6,face = "bold"),
    20. axis.title.x.bottom = element_text(margin = margin(-10,0,0,0))) +
    21. scale_fill_manual(values=config_color_svtype) +
    22. xlab('') +
    23. ylab("SV Count")

    SV distribution

  • SV length group

    1. data_sv_group <- plyr::count(data_plot,c('GROUP_LEN','GROUP_SUPP'))
    2. data_sv_group$label <- 0
    3. for (type_sv in levels(data_sv_group$GROUP_LEN)){
    4. data_sv_group[data_sv_group$GROUP_LEN==type_sv,]$label <- sum(data_sv_group[data_sv_group$GROUP_LEN==type_sv,]$freq)
    5. }
    6. ggplot(data_sv_group, aes(x = GROUP_LEN, y = freq, fill = GROUP_SUPP)) +
    7. geom_bar(position = "fill",stat = "identity") +
    8. scale_y_continuous(labels = scales::percent_format(), expand = expand_scale(mult = .1)) +
    9. coord_flip() +
    10. theme_minimal()+
    11. theme(legend.title=element_blank(),
    12. legend.position="bottom",
    13. legend.spacing.x = unit(0.1, 'cm'),
    14. legend.key.size=unit(0.3, 'cm'),
    15. legend.text = element_text(size = 6,face = "bold"),
    16. axis.text.x = element_text(size = 6,face = "bold"),
    17. axis.text.y = element_text(size = 6,face = "bold"),
    18. axis.title.y = element_text(size = 6,face = "bold"),
    19. axis.title.x.bottom = element_text(margin = margin(-10,0,0,0))) +
    20. scale_fill_manual(values=config_color_group_supp) +
    21. xlab("") +
    22. ylab("") +
    23. geom_text(aes(label = label, y= ..prop..), stat= "count", hjust = -0.1, size=1)

    SV distribution

  • SV type group

    1. data_sv_group <- plyr::count(data_plot,c('SVTYPE','GROUP_SUPP'))
    2. data_sv_group$SVTYPE <- factor(data_sv_group$SVTYPE)
    3. data_sv_group$label <- 0
    4. for (type_sv in levels(data_sv_group$SVTYPE)){
    5. data_sv_group[data_sv_group$SVTYPE==type_sv,]$label <- sum(data_sv_group[data_sv_group$SVTYPE==type_sv,]$freq)
    6. }
    7. data_sv_group$SVTYPE <- factor(data_sv_group$SVTYPE, levels = c( 'DEL', 'INS', 'DUP', 'INV', 'TRA'))
    8. ggplot(data_sv_group, aes(x = SVTYPE, y = freq, fill = GROUP_SUPP)) +
    9. geom_bar(position = "fill",stat = "identity") +
    10. scale_y_continuous(labels = scales::percent_format(), expand = expand_scale(mult = .1)) +
    11. coord_flip() +
    12. theme_minimal()+
    13. theme(legend.title=element_blank(),
    14. legend.position="bottom",
    15. legend.spacing.x = unit(0.1, 'cm'),
    16. legend.key.size=unit(0.3, 'cm'),
    17. legend.text = element_text(size = 6,face = "bold"),
    18. axis.text.x = element_text(size = 6,face = "bold"),
    19. axis.text.y = element_text(size = 6,face = "bold"),
    20. axis.title.y = element_text(size = 6,face = "bold"),
    21. axis.title.x.bottom = element_text(margin = margin(-15,0,0,0))) +
    22. scale_fill_manual(values=config_color_group_supp) +
    23. xlab("") +
    24. ylab("") +
    25. geom_text(aes(label = label, y= ..prop..), stat= "count", hjust = -0.1, size=2)

    SV distribution

  • SV breakpoints CI
    ```R

    1. data_tmp <- data_plot
    2. data_tmp$GROUP_CI <- NA
    3. data_tmp[data_tmp$MAXCI_POS<=1000&data_tmp$MAXCI_END<=1000,]$GROUP_CI <- '500bp-1kb'
    4. data_tmp[data_tmp$MAXCI_POS<=500&data_tmp$MAXCI_END<=500,]$GROUP_CI <- '250bp-500bp'
    5. data_tmp[data_tmp$MAXCI_POS<=250&data_tmp$MAXCI_END<=250,]$GROUP_CI <- '100bp-250bp'
    6. data_tmp[data_tmp$MAXCI_POS<=100&data_tmp$MAXCI_END<=100,]$GROUP_CI <- '0-100bp'
    7. data_tmp$GROUP_CI <- factor(data_tmp$GROUP_CI, levels = c('0-100bp', '100bp-250bp', '250bp-500bp', '500bp-1kb'))
  1. df.new<-ddply(data_tmp,.(GROUP_SUPP),plyr::summarise,
  2. prop=prop.table(table(GROUP_CI)),
  3. SUPP=names(table(GROUP_CI)))
  4. df.new$SUPP <- factor(df.new$SUPP, levels = c('0-100bp', '100bp-250bp', '250bp-500bp', '500bp-1kb'))
  5. ggplot(df.new, aes(SUPP, prop, fill=GROUP_SUPP)) +
  6. geom_bar(stat="identity",position = 'dodge') +
  7. theme_minimal()+
  8. theme(legend.title=element_blank(),
  9. legend.position=c(0.5,0.9),
  10. legend.direction = 'horizontal',
  11. legend.spacing.x = unit(0.1, 'cm'),
  12. legend.key.size=unit(0.3, 'cm'),
  13. legend.text = element_text(size = 6,face = "bold"),
  14. axis.text.x = element_text(size = 6,face = "bold"),
  15. axis.text.y = element_text(size = 6,face = "bold"),
  16. axis.title.y = element_text(size = 6,face = "bold"),
  17. axis.title.x = element_text(size = 6,face = "bold")
  18. ) +
  19. scale_fill_manual(values=config_color_group_supp) +
  20. xlab("Max Interval of Breakpoints") +
  21. ylab("Prop")
  22. ```
  23. ![SV distribution](plots/breakpoints.ci.group_supp.prop.png)
  • SV breakpoints repeat length

    1. ggplot(data=data_sv_repeat, aes(x=abs(SVLEN), fill = RepClass)) +
    2. geom_histogram() +
    3. scale_x_continuous(trans = 'log10',breaks = c(100,300,2000,6000,20000),
    4. labels = c('100bp','300bp','2kb','6kb','20kb'),limits = c(50,20000)) +
    5. theme_minimal()+
    6. theme(legend.title=element_blank(),
    7. legend.position=c(0.8,0.65),
    8. legend.spacing.x = unit(0.1, 'cm'),
    9. legend.key.size=unit(0.3, 'cm'),
    10. legend.text = element_text(size = 6,face = "bold"),
    11. axis.text.x = element_text(size = 6,face = "bold", angle = 45),
    12. axis.text.y = element_text(size = 6,face = "bold"),
    13. axis.title.y = element_text(size = 6,face = "bold"),
    14. axis.title.x = element_text(size = 6,face = "bold")) +
    15. scale_fill_manual(values=config_color_repeat) +
    16. xlab('') +
    17. ylab("SV Count")

    SV distribution

  • SV breakpoints gc repeat

    1. ggplot(data=data_sv_repeat, aes(x=GCcontent, fill = RepClass)) +
    2. geom_histogram() +
    3. theme_minimal()+
    4. theme(legend.title=element_blank(),
    5. legend.position='bottom',
    6. legend.spacing.x = unit(0.1, 'cm'),
    7. legend.key.size=unit(0.5, 'cm'),
    8. legend.text = element_text(size = 10,face = "bold"),
    9. axis.text.x = element_text(size = 10,face = "bold", angle = 45),
    10. axis.text.y = element_text(size = 10,face = "bold"),
    11. axis.title.y = element_text(size = 10,face = "bold"),
    12. axis.title.x = element_text(size = 10,face = "bold")) +
    13. scale_fill_manual(values=config_color_repeat) +
    14. xlab('GC Content') +
    15. ylab("SV Count")

    SV distribution

  • SV Database AF

    1. ggplot(data=data_plot, aes(x=AF_Database, fill = GROUP_SUPP)) +
    2. geom_histogram() +
    3. theme_minimal()+
    4. # scale_y_continuous(trans = 'log', breaks = c(1000,10000),labels = c('1000','10000')) +
    5. theme(legend.title=element_blank(),
    6. legend.position='bottom',
    7. legend.spacing.x = unit(0.1, 'cm'),
    8. legend.key.size=unit(0.5, 'cm'),
    9. legend.text = element_text(size = 10,face = "bold"),
    10. axis.text.x = element_text(size = 10,face = "bold", angle = 45),
    11. axis.text.y = element_text(size = 10,face = "bold"),
    12. axis.title.y = element_text(size = 10,face = "bold"),
    13. axis.title.x = element_text(size = 10,face = "bold")) +
    14. scale_fill_manual(values=config_color_group_supp) +
    15. xlab('AF') +
    16. ylab("Discovery")

    SV distribution

  • SV Genotyping
    ```R

    1. data_tmp <- data_plot
    2. data_tmp$GROUP_SUPP_NGS <- as.character(data_tmp$GROUP_SUPP_NGS)
    3. data_tmp <- data_tmp[!is.na(data_tmp$MR_NGS),]
    4. # data_tmp <- data_tmp[!is.na(data_tmp$MR_NGS)&data_tmp$MR_NGS<0.05,]
    5. data_tmp[is.na(data_tmp$GROUP_SUPP_NGS),]$GROUP_SUPP_NGS <- 'No Support'
    6. data_tmp$GROUP_SUPP_NGS <- factor(data_tmp$GROUP_SUPP_NGS, levels = c('No Support','Singleton', 'Polymorphic', 'Major', 'Shared'))
    7. data_tmp_1 <- plyr::count(data_tmp[data_tmp$MR_NGS<0.05,],'GROUP_SUPP_NGS')
    8. data_tmp_2 <- plyr::count(data_tmp[data_tmp$MR_NGS<0.05&data_tmp$GROUP_SUPP_NGS=='No Support',],
    9. 'GROUP_SUPP')
    10. data_tmp_3 <- plyr::count(data_tmp[data_tmp$MR_NGS<0.05&data_tmp$GROUP_SUPP_NGS!='No Support',],
    11. 'GROUP_SUPP')
    12. ids_pie = c('SV', 'MR_0', 'MR_1', 'No_Support', 'Support')
    13. labels_pie = c('SV genotyping', 'MR < 0.05', 'MR >= 0.05', 'genotyped AF = 0', 'genotyped AF > 0')
    14. parents_pie = c('', 'SV', 'SV', 'MR_0', 'MR_0')
    15. values_pie = c(nrow(data_tmp),
    16. sum(data_tmp$MR_NGS<0.05),
    17. sum(data_tmp$MR_NGS>=0.05),
    18. data_tmp_1[data_tmp_1$GROUP_SUPP_NGS=='No Support',]$freq,
    19. sum(data_tmp_1$freq)-data_tmp_1[data_tmp_1$GROUP_SUPP_NGS=='No Support',]$freq)
    20. for (label_condition in data_tmp_2$GROUP_SUPP) {
    21. ids_pie <- c(ids_pie, paste(label_condition, 'No_Support', sep = '_'))
    22. labels_pie <- c(labels_pie, label_condition)
    23. parents_pie <- c(parents_pie, 'No_Support')
    24. values_pie <- c(values_pie, data_tmp_2[data_tmp_2$GROUP_SUPP==label_condition,]$freq)
    25. }
    26. for (label_condition in data_tmp_3$GROUP_SUPP) {
    27. ids_pie <- c(ids_pie, paste(label_condition, 'Support', sep = '_'))
    28. labels_pie <- c(labels_pie, label_condition)
    29. parents_pie <- c(parents_pie, 'Support')
    30. values_pie <- c(values_pie, data_tmp_3[data_tmp_3$GROUP_SUPP==label_condition,]$freq)
    31. }
  1. data_sv_group <- subset(data_tmp, GROUP_SUPP_NGS=='No Support' & MR_NGS <= 0.05)
  2. data_sv_group$GROUP_REP <- 'None'
  3. data_sv_group[data_sv_group$RepClass!='None',]$GROUP_REP <- 'Repclass'
  4. data_sv_group[data_sv_group$RepClass=='None'&data_sv_group$SD!='',]$GROUP_REP <- 'SD'
  5. data_sv_group[data_sv_group$GROUP_REP=='None'&
  6. (data_sv_group$Repeats_type_left!='None'|data_sv_group$Repeats_type_right!='None'),]$GROUP_REP <- 'flanked'
  7. data_tmp_4 <- plyr::count(data_sv_group,c('GROUP_SUPP','GROUP_REP'))
  8. for (label_group in data_tmp_2$GROUP_SUPP){
  9. for (label_condition in unique(data_tmp_4$GROUP_REP)) {
  10. ids_pie <- c(ids_pie, paste(label_condition, label_group, 'No_Support', sep = '_'))
  11. labels_pie <- c(labels_pie, label_condition)
  12. parents_pie <- c(parents_pie, paste(label_group,'No_Support', sep = '_'))
  13. values_pie <- c(values_pie, data_tmp_4[data_tmp_4$GROUP_SUPP==label_group&
  14. data_tmp_4$GROUP_REP==label_condition,]$freq)
  15. }
  16. }
  17. if (!require("processx")) install.packages("processx")
  18. fig <- plot_ly(
  19. ids=ids_pie,
  20. labels=labels_pie,
  21. parents=parents_pie,
  22. values=values_pie,
  23. type='sunburst',
  24. branchvalues = 'total'
  25. )
  26. orca(fig, "../data/result_genotypted/plots/genotyping.pie.svg")
  27. ```
  28. ![SV distribution](plots/sv.genotyping.pie.svg)
  29. * SV Genotyping HardyWeinberg
  30. ```R
  31. library(HardyWeinberg)
  32. plot.HWE <- function(dat,pop=NULL,title=NULL,full.legend=F,lab.cex=1){
  33. require(HardyWeinberg,quietly=T)
  34. #Gather HW p-values & colors
  35. HWE.mat <- dat
  36. HW.p <- HWChisqStats(X=HWE.mat,x.linked=F,pvalues=T)
  37. HW.cols <- rep("#4DAC26",times=length(HW.p))
  38. HW.cols[which(HW.p<0.05)] <- "#81F850"
  39. HW.cols[which(HW.p<0.05/length(HW.p))] <- "#AC26A1"
  40. #Generate HW plot frame
  41. par(mar=c(1,1,1,1),bty="n")
  42. plot(x=1.15*c(-1/sqrt(3),1/sqrt(3)),y=c(-0.15,1.15),type="n",
  43. xaxt="n",yaxt="n",xlab="",ylab="",xaxs="i",yaxs="i")
  44. segments(x0=c(-1/sqrt(3),0,1/sqrt(3)),
  45. x1=c(0,1/sqrt(3),-1/sqrt(3)),
  46. y0=c(0,1,0),y1=c(1,0,0))
  47. HWTernaryPlot(X=HWE.mat,n=max(HWE.mat,na.rm=T),newframe=F,
  48. vbounds=F,mafbounds=F,
  49. region=1,vertexlab=NA,
  50. alpha=0.05,
  51. curvecols=c("#4DAC26","#81F850",NA,NA),pch=NA)
  52. #Add axes
  53. text(x=c(-1/sqrt(3),1/sqrt(3)),y=0,labels=c("0/0","1/1"),
  54. pos=1,cex=lab.cex,xpd=T,font=2)
  55. text(x=0,y=1,labels="0/1",pos=3,cex=lab.cex,xpd=T,font=2)
  56. #Finish HW plot
  57. HWTernaryPlot(X=HWE.mat,n=max(HWE.mat,na.rm=T),newframe=F,
  58. vbounds=F,mafbounds=F,
  59. region=1,vertexlab=NA,
  60. alpha=0.03/nrow(HWE.mat),
  61. curvecols=c("#4DAC26","#AC26A1",NA,NA),
  62. pch=21,cex=0.3,signifcolour=F,markercol=NA,
  63. markerbgcol=adjustcolor(HW.cols,alpha=0.25))
  64. segments(x0=c(-1/sqrt(3),0,1/sqrt(3)),
  65. x1=c(0,1/sqrt(3),-1/sqrt(3)),
  66. y0=c(0,1,0),y1=c(1,0,0))
  67. #Add legend
  68. n.pass <- length(which(HW.p>=0.05))
  69. print(paste("PASS: ",n.pass/length(HW.p),sep=""))
  70. n.nom <- length(which(HW.p<0.05 & HW.p>=0.05/nrow(HWE.mat)))
  71. print(paste("NOMINAL FAILS: ",n.nom/length(HW.p),sep=""))
  72. n.bonf <- length(which(HW.p<0.05/nrow(HWE.mat)))
  73. print(paste("BONFERRONI FAILS: ",n.bonf/length(HW.p),sep=""))
  74. legend("right",pch=19,col=c("#4DAC26","#81F850","#AC26A1"),pt.cex=1,
  75. legend=c(paste(round(100*(n.pass/nrow(HWE.mat)),0),"%",sep=""),
  76. paste(round(100*(n.nom/nrow(HWE.mat)),0),"%",sep=""),
  77. paste(round(100*(n.bonf/nrow(HWE.mat)),0),"%",sep="")),
  78. bty="n",bg=NA,cex=lab.cex)
  79. text(x=par("usr")[2],y=par("usr")[4]-(0.2*(par("usr")[4]-par("usr")[3])),pos=2,cex=lab.cex,
  80. labels=paste(title,"\n \n ",sep=""),font=2)
  81. text(x=par("usr")[2],y=par("usr")[4]-(0.2*(par("usr")[4]-par("usr")[3])),pos=2,cex=lab.cex,
  82. labels=paste(" \n",prettyNum(max(apply(HWE.mat,1,sum),na.rm=T),big.mark=",")," Samples\n ",sep=""))
  83. text(x=par("usr")[2],y=par("usr")[4]-(0.2*(par("usr")[4]-par("usr")[3])),pos=2,cex=lab.cex,
  84. labels=paste(" \n \n",prettyNum(nrow(HWE.mat),big.mark=",")," SV",sep=""))
  85. }
  86. data_genotypes_normal <- data_genotypes[
  87. (!is.na(data_sv_details_all$AF_ALL_NGS))&
  88. data_sv_details_all$MR_NGS<0.05&
  89. data_sv_details_all$AF_ALL_NGS>0.01&
  90. data_sv_details_all$AF_ALL_NGS<1,
  91. colnames(data_genotypes)%in%data_samples_info[data_samples_info$Platform=='NGS',]$SampleID]
  92. data_genotypes_normal[data_genotypes_normal==-1] <- 0
  93. data_genotypes_hw <- data_genotypes_normal
  94. data_genotypes_hw$AA <- rowSums(data_genotypes_normal==0)
  95. data_genotypes_hw$AB <- rowSums(data_genotypes_normal==1)
  96. data_genotypes_hw$BB <- rowSums(data_genotypes_normal==2)
  97. data_genotypes_hw <- data_genotypes_hw[,c('AA','AB','BB')]
  98. # png(filename = paste(prefix_filename,'genotyping.hardyweinberg.png',sep = '.'), width = 1000, height = 1000, res = val_res)
  99. plot.HWE(data_genotypes_hw, lab.cex = 1)
  100. ```
  101. ![SV distribution](plots/sv.genotyping.hardyweinberg.png)
  • SV Genotyping

    1. tmp <- subset(data_plot, SUPP_NGS>0)
    2. df.new<-ddply(tmp,.(SVTYPE),plyr::summarise,
    3. prop=prop.table(table(GROUP_SUPP_NGS)),
    4. SUPP=names(table(GROUP_SUPP_NGS)))
    5. df.new$SUPP <- factor(df.new$SUPP, levels = c('Singleton', 'Polymorphic', 'Major', 'Shared'))
    6. ggplot(df.new, aes(SUPP, prop, fill=SVTYPE)) +
    7. geom_bar(stat="identity",position = 'dodge') +
    8. theme_minimal()+
    9. theme(legend.title=element_blank(),
    10. legend.position=c(0.9,0.8),
    11. panel.border = element_blank(),
    12. panel.grid = element_blank(),
    13. panel.background = element_blank(),
    14. legend.spacing.x = unit(0.1, 'cm'),
    15. legend.key.size=unit(0.3, 'cm'),
    16. legend.text = element_text(size = 6,face = "bold"),
    17. axis.text.x = element_text(size = 6,face = "bold"),
    18. axis.text.y = element_text(size = 6,face = "bold"),
    19. axis.title.y = element_text(size = 6,face = "bold"),
    20. axis.title.x.bottom = element_text(margin = margin(-15,0,0,0))) +
    21. scale_fill_manual(values=config_color_svtype) +
    22. xlab("") +
    23. ylab("Prop")

    SV distribution

  • SV Genotyping SUPP

    1. ggplot(data=data_tmp, aes(x=SUPP, fill = GROUP_SUPP_NGS)) +
    2. geom_bar() +
    3. theme_minimal()+
    4. theme(legend.title=element_blank(),
    5. legend.position=c(0.5,0.9),
    6. legend.direction = 'horizontal',
    7. legend.spacing.x = unit(0.1, 'cm'),
    8. legend.key.size=unit(0.3, 'cm'),
    9. legend.text = element_text(size = 6,face = "bold"),
    10. axis.text.x = element_text(size = 6,face = "bold", angle = 45),
    11. axis.text.y = element_text(size = 6,face = "bold"),
    12. axis.title.y = element_text(size = 6,face = "bold"),
    13. axis.title.x = element_text(size = 6,face = "bold")) +
    14. scale_fill_manual(values=config_color_group_supp_han) +
    15. xlab('Samples Support') +
    16. ylab("Discovery")

    SV distribution

  • SV Genotyping LD

    1. ggplot(data_sv_group, aes(x=SVTYPE, y=LD,group=SVTYPE)) +
    2. geom_boxplot(aes(fill=SVTYPE),outlier.size = 0.1) +
    3. theme_minimal()+
    4. theme(legend.title=element_blank(),
    5. legend.position='bottom',
    6. # panel.border = element_blank(),
    7. # panel.grid = element_blank(),
    8. panel.background = element_blank(),
    9. legend.spacing.x = unit(0.1, 'cm'),
    10. legend.key.size=unit(0.5, 'cm'),
    11. legend.text = element_text(size = 10,face = "bold"),
    12. axis.text.x = element_text(size = 10,face = "bold"),
    13. axis.text.y = element_text(size = 10,face = "bold"),
    14. axis.title.y = element_text(size = 10,face = "bold"),
    15. axis.title.x = element_text(size = 10,face = "bold"),
    16. axis.title.x.bottom = element_text(margin = margin(-10,0,0,0))) +
    17. scale_fill_manual(values=config_color_svtype) +
    18. xlab('') +
    19. ylab("max LD")

    SV distribution

  • SV gene enrichment

    1. data.gene.background <- read.xlsx('example/enrich.xlsx')
    2. for (index_row in seq(nrow(data.gene.background))) {
    3. tmp <- fisher.test(matrix(c(data.gene.background[index_row,]$A1,
    4. data.gene.background[index_row,]$A0,
    5. data.gene.background[index_row,]$N1,
    6. data.gene.background[index_row,]$N0),nrow = 2))
    7. data.gene.background[index_row,]$Odds.Ratio <- tmp$estimate
    8. data.gene.background[index_row,]$p.value <- tmp$p.value
    9. }
    10. data.gene.background$Gene.Set <- factor(data.gene.background$Gene.Set, levels = rev(data.gene.background$Gene.Set[order(data.gene.background$Odds.Ratio,decreasing = TRUE)]))
    11. ggplot(data=data.gene.background[2:11,],aes(x=Gene.Set,
    12. y=A1,
    13. fill=p.value)) +
    14. geom_bar(stat="identity") + coord_flip() +
    15. theme(panel.background=element_rect(fill='transparent',colour = 'black'),
    16. legend.key.size=unit(0.2, 'cm'),
    17. legend.title=element_blank(),
    18. legend.text = element_text(size = 2.5,face = "bold"),
    19. axis.text.x = element_text(size = 2.5,face = "bold"),
    20. axis.text.y = element_text(color="black",size=2.5,face = "bold"),
    21. axis.title.y = element_text(size = 2.5,face = "bold")) +
    22. scale_fill_gradient(low="red",high="blue", breaks=c(0.05, 0.1), limits=c(0, 0.15)) +
    23. xlab("") +
    24. ylab("")

    SV distribution

  • SV Genotyping VAF

    1. data_tmp <- data_plot[!is.na(data_plot$AF_ALL_NGS)&data_plot$MR_NGS<0.05&data_plot$AF_ALL_NGS>0&data_plot$AF_ALL_NGS<0.1,]
    2. pdf(file = paste(prefix_filename,'genotyping.pop.vaf.group_pop_detail','pdf',sep = '.'), width = 3, height = 3)
    3. ggplot(data=data_tmp, aes(AF_ALL_NGS, stat(count), fill = GROUP_POP_DETAIL_NGS)) +
    4. geom_density(position = "fill",bw=0.005) +
    5. scale_fill_manual(values=config_color_group_pop)+
    6. theme_minimal()+
    7. theme(legend.title=element_blank(),
    8. legend.position='bottom',
    9. legend.spacing.x = unit(0.1, 'cm'),
    10. legend.key.size=unit(0.3, 'cm'),
    11. legend.text = element_text(size = 6,face = "bold"),
    12. axis.text.x = element_text(size = 6,face = "bold"),
    13. axis.text.y = element_text(size = 6,face = "bold"),
    14. axis.title.y = element_text(size = 6,face = "bold"),
    15. axis.title.x = element_text(size = 6,face = "bold"),
    16. axis.title.x.bottom = element_text(margin = margin(5,0,0,0))) +
    17. xlab('VAF') +
    18. ylab("Proportion")
    19. dev.off()

    SV distribution

  • SV Genotyping PCA

    1. group_sample <- as.data.frame(data_samples_info[!data_samples_info$Source%in%c('T15H10','HUM'),c('SampleID','Group','Group_Detail','Location')])
    2. group_sample <- group_sample[order(group_sample$SampleID),]
    3. rownames(group_sample) <- group_sample$SampleID
    4. data_genotypes_normal <- data_genotypes[!is.na(data_sv_details_all$AF_ALL_NGS)&data_sv_details_all$MR_NGS<0.05&data_sv_details_all$AF_ALL_NGS>0.01,colnames(data_genotypes)%in%group_sample$SampleID]
    5. data_genotypes_normal <- data_genotypes_normal[, order(colnames(data_genotypes_normal))]
    6. data_genotypes_normal[data_genotypes_normal==-1] <- 0
    7. # tmp <- (data_genotypes_normal- rowMeans(data_genotypes_normal))/(1+(2*sqrt(data_sv_details_all$AF_ALL_NGS*(1-data_sv_details_all$AF_ALL_NGS))))
    8. p <- (1 + rowSums(data_genotypes_normal))/(2+2*nrow(group_sample))
    9. tmp <- (data_genotypes_normal- rowMeans(data_genotypes_normal))/2*sqrt(p*(1-p))
    10. data_sv_details_all_matrix.pca <- prcomp(t(as.matrix(tmp)))
    11. PCi<-data.frame(data_sv_details_all_matrix.pca$x,Group_Detail=group_sample$Group_Detail)
    12. ggplot(PCi,aes(x=PC1,y=PC2,color=Group_Detail))+
    13. geom_point(size=0.8)+
    14. scale_color_manual(values = config_color_group_pop) +
    15. theme_minimal()+
    16. theme(legend.title=element_blank(),
    17. legend.position=c(0.1,0.9),
    18. legend.spacing.x = unit(0.1, 'cm'),
    19. legend.key.size=unit(0.3, 'cm'),
    20. # panel.background = element_blank(),
    21. # # panel.border = element_rect(size=0.6, colour = "black"),
    22. # axis.line = element_line(size=0.6, colour = "black"),
    23. # axis.line.x.top = element_line(size=0.6, colour = "black"),
    24. legend.text = element_text(size = 6,face = "bold"),
    25. axis.text.x = element_text(size = 6,face = "bold"),
    26. axis.text.y = element_text(size = 6,face = "bold"),
    27. axis.title.y = element_text(size = 6,face = "bold"),
    28. axis.title.x = element_text(size = 6,face = "bold"),
    29. axis.title.x.bottom = element_text(margin = margin(5,0,0,0))
    30. )

    SV distribution

  • SV Genotyping PHt
    ```R

    1. group_sample <- as.data.frame(data_samples_info[!data_samples_info$Source%in%c('T15H10'),
    2. c('SampleID','Group','Group_Detail','Location')])
    3. group_sample <- group_sample[order(group_sample$SampleID),]
    4. group_sample$PHt <- 0
    5. group_sample$PHot <- 0
    6. rownames(group_sample) <- group_sample$SampleID
    7. data_genotypes_normal <- data_genotypes[!is.na(data_sv_details_all$AF_ALL_NGS)&data_sv_details_all$MR_NGS<0.05&data_sv_details_all$AF_ALL_NGS>0.01,colnames(data_genotypes)%in%group_sample$SampleID]
    8. data_genotypes_normal <- data_genotypes_normal[, order(colnames(data_genotypes_normal))]
    9. data_genotypes_normal[data_genotypes_normal==-1] <- 0
    10. data_genotypes_normal_type <- data_sv_details_all[!is.na(data_sv_details_all$AF_ALL_NGS)&data_sv_details_all$MR_NGS<0.05&data_sv_details_all$AF_ALL_NGS>0.01,]$SVTYPE
  1. for (tmp_sample in group_sample$SampleID) {
  2. group_sample[group_sample$SampleID==tmp_sample,]$PHt <- sum(data_genotypes_normal[data_genotypes_normal_type%in%c('INS','DEL'),tmp_sample]==1)
  3. group_sample[group_sample$SampleID==tmp_sample,]$PHot <- sum(data_genotypes_normal[data_genotypes_normal_type%in%c('INS','DEL'),tmp_sample]==2)
  4. }
  5. group_sample_svtype <- rbind(group_sample,group_sample)
  6. group_sample_svtype$SVTYPE <- c(rep('DEL', nrow(group_sample)),
  7. rep('INS', nrow(group_sample)))
  8. for (tmp_svtype in unique(group_sample_svtype$SVTYPE)) {
  9. for (tmp_sample in group_sample$SampleID) {
  10. group_sample_svtype[group_sample_svtype$SampleID==tmp_sample&group_sample_svtype$SVTYPE==tmp_svtype,]$PHt <- sum(data_genotypes_normal[data_genotypes_normal_type==tmp_svtype,tmp_sample]==1)
  11. group_sample_svtype[group_sample_svtype$SampleID==tmp_sample&group_sample_svtype$SVTYPE==tmp_svtype,]$PHot <- sum(data_genotypes_normal[data_genotypes_normal_type==tmp_svtype,tmp_sample]==2)
  12. }
  13. }
  14. # group_sample <-group_sample[!group_sample$Group_Detail%in%c('HUM','AFR'),]
  15. group_sample <-group_sample[!group_sample$Group_Detail%in%c('HUM'),]
  16. group_sample_svtype <-group_sample_svtype[!group_sample_svtype$Group_Detail%in%c('HUM'),]
  17. # group_sample <- group_sample[!group_sample$SampleID%in%c('WGC025266D', 'WGC025273D'),]
  18. group_sample$Location <- factor(group_sample$Location, levels = unique(c(
  19. group_sample[group_sample$Group_Detail=='TIBG',]$Location,
  20. group_sample[group_sample$Group_Detail=='TIBL',]$Location,
  21. group_sample[group_sample$Group_Detail=='HANN',]$Location,
  22. group_sample[group_sample$Group_Detail=='HANS',]$Location,
  23. group_sample[group_sample$Group_Detail=='AFR',]$Location
  24. )))
  25. pairwise.t.test(group_sample$PHot, g = group_sample$Group_Detail, p.adjust.method = 'bonferroni')
  26. pairwise.t.test(group_sample$PHt, g = group_sample$Group_Detail, p.adjust.method = 'bonferroni')
  27. p1 <- ggplot(group_sample, aes(x=Group_Detail, y=PHot, fill=Group_Detail)) +
  28. geom_violin() +
  29. coord_flip() +
  30. # geom_signif(comparisons = list(c('TIBG','TIBL')), map_signif_level=TRUE)
  31. #geom_signif(comparisons = list(levels(factor(group_sample$Group_Detail))), map_signif_level=TRUE)
  32. scale_fill_manual(values=config_color_group_pop) +
  33. theme_minimal()+
  34. theme(legend.title=element_blank(),
  35. legend.position='none',
  36. legend.spacing.x = unit(0.1, 'cm'),
  37. legend.key.size=unit(0.5, 'cm'),
  38. legend.text = element_text(size = 6,face = "bold"),
  39. panel.border = element_blank(),
  40. panel.grid = element_blank(),
  41. panel.background = element_blank(),
  42. axis.line.x = element_line(size=0.6, colour = "black"),
  43. axis.line.y = element_line(size=0.6, colour = "black"),
  44. axis.text.x = element_text(size = 6,face = "bold",vjust = 0.6),
  45. axis.text.y = element_text(size = 6,face = "bold"),
  46. axis.title.y = element_text(size = 6,face = "bold"),
  47. axis.title.x = element_text(size = 6,face = "bold"),
  48. axis.title.x.bottom = element_text(margin = margin(5,0,0,0))) +
  49. xlab('') +
  50. ylab("#hom")
  51. p2 <- ggplot(group_sample, aes(x=Group_Detail, y=PHt, fill=Group_Detail)) +
  52. geom_violin() +
  53. coord_flip() +
  54. scale_fill_manual(values=config_color_group_pop) +
  55. theme_minimal()+
  56. theme(legend.title=element_blank(),
  57. legend.position='none',
  58. legend.spacing.x = unit(0.1, 'cm'),
  59. legend.key.size=unit(0.5, 'cm'),
  60. panel.border = element_blank(),
  61. panel.grid = element_blank(),
  62. panel.background = element_blank(),
  63. axis.line.x = element_line(size=0.6, colour = "black"),
  64. axis.text.y = element_blank(),
  65. axis.ticks.y = element_blank(),
  66. legend.text = element_text(size = 6,face = "bold"),
  67. axis.text.x = element_text(size = 6,face = "bold",vjust = 0.6),
  68. axis.title.x = element_text(size = 6,face = "bold"),
  69. axis.title.x.bottom = element_text(margin = margin(5,0,0,0))) +
  70. xlab('') +
  71. ylab("#het")
  72. # ggarrange(p1, p2,ncol=2,nrow = 1)
  73. multiplot(p1,p2,cols = 2)
  74. ```
  75. ![SV distribution](plots/sv.genotyping.pop.hetANDhom.png)
  • SV Genotyping FST

    1. library(qqman)
    2. library(Cairo)
    3. Fstfile <- read.table(paste('example/merge.paragraph.genotypes.TIBvsHAN.20k_5k.windowed.weir.fst', sep = ''), header = T, stringsAsFactors = F)
    4. SNP <- paste(Fstfile[,1], Fstfile[,2], sep = ':')
    5. Fstfile <- cbind(SNP, Fstfile)
    6. colnames(Fstfile) <- c('SNP', 'CHR', 'POS','END','Bins', 'Fst','mean')
    7. Fstfile[Fstfile$CHR == 'chrX',]$CHR <- 'chr23'
    8. Fstfile$CHR <- as.numeric(str_split_fixed(Fstfile$CHR,'chr',2)[,2])
    9. filePNG <-paste(prefix_filename,'genotyping.pop.fst.pdf',sep = '.')
    10. CairoPNG(file=filePNG, width = 1500, height = 500)
    11. CairoPDF(file = filePNG,
    12. width = 8, height = 4, onefile = TRUE, family = "Helvetica")
    13. colorset <- c('#FF0000', '#FFD700', '#2E8B57', '#7FFFAA', '#6495ED', '#0000FF', '#FF00FF')
    14. manhattan(Fstfile, chr='CHR', bp='POS', p='Fst', snp='SNP', col=colorset,
    15. logp=FALSE,
    16. suggestiveline=0.15,
    17. genomewideline=FALSE,
    18. ylab='Fst',
    19. ylim=c(0,0.6),
    20. font.lab=4,
    21. cex.lab=0.8,
    22. cex=0.1,
    23. chrlabs = c(1:22, "X"))
    24. dev.off()

    SV distribution

  • SV Genotyping admixture

    1. # Assign the first argument to prefix
    2. prefix=paste('example/admixture/merge.paragraph.genotypes.local.reformat.addPopInfo.addFST.LDpruned',sep = '')
    3. # Get individual names in the correct order
    4. labels <- data.frame('ind' <- colnames(data_sv_details_raw@gt)[27:217], 'pop' <- NA)
    5. names(labels)<-c("ind","pop")
    6. labels <- labels[labels$ind%in%data_samples_info$SampleID,]
    7. index_sample <- 1
    8. while (index_sample <= nrow(labels)) {
    9. labels[index_sample,]$pop <- as.character(data_samples_info[data_samples_info$SampleID==labels[index_sample,]$ind,]$Group_Detail)
    10. index_sample <- index_sample + 1
    11. }
    12. # Add a column with population indices to order the barplots
    13. # Use the order of populations provided as the fourth argument (list separated by commas)
    14. labels$n<-factor(labels$pop,levels=unlist(strsplit('AFR,HANS,HANN,TIBL,TIBG',",")))
    15. levels(labels$n)<-c(1:length(levels(labels$n)))
    16. labels$n<-as.integer(as.character(labels$n))
    17. # read in the different admixture output files
    18. maxK=7
    19. tbl<-lapply(2:maxK, function(x) read.table(paste0(prefix,".",x,".Q")))
    20. # Prepare spaces to separate the populations/species
    21. rep<-as.vector(table(labels$n))
    22. spaces<-0
    23. for(i in 1:length(rep)){spaces=c(spaces,rep(0,rep[i]-1),0.5)}
    24. spaces<-spaces[-length(spaces)]
    25. # Plot the cluster assignments as a single bar for each individual
    26. par(mfrow=c(maxK-1,1),mar=c(0,1,0,0),oma=c(2,1,9,1),
    27. mgp=c(0,0.2,0),xaxs="i",cex.lab=1.2,cex.axis=0.8)
    28. bp<-barplot(t(as.matrix(tbl[[2]][order(labels$n),])),
    29. col=rainbow(n=2),
    30. xaxt="n",
    31. border=NA,
    32. ylab="K=2",
    33. yaxt="n",
    34. space=spaces)
    35. axis(3,at=bp,labels=labels$ind[order(labels$n)],las=2,tick=F,cex=0.6)
    36. lapply(3:(maxK-1),
    37. function(x) barplot(t(as.matrix(tbl[[x]][order(labels$n),])),
    38. col=rainbow(n=x),xaxt="n",
    39. border=NA,
    40. ylab=paste0("K=",x),
    41. yaxt="n",
    42. space=spaces))
    43. axis(1,at=c(which(spaces==0.5),
    44. bp[length(bp)])-diff(c(1,which(spaces==0.5),bp[length(bp)]))/2,
    45. tick = F,
    46. labels=unlist(strsplit('AFR,HANS,HANN,TIBL,TIBG',",")))

    SV distribution