项目作者: felixfan

项目描述 :
用于HLA分析的Python:摘要,关联分析,接合性测试和交互测试
高级语言: Python
项目地址: git://github.com/felixfan/PyHLA.git
创建时间: 2015-04-28T02:46:06Z
项目社区:https://github.com/felixfan/PyHLA

开源协议:

下载


PyHLA: tests for association between HLA alleles and diseases

Dec 12, 2016

Table of Contents

1. Introduction

Python for HLA analysis: summary, association analysis, zygosity test and interaction test.

2. Installation

PyHLA uses Python 2 (Python 2.7 or higher) and the following Python modules:

The easiest way to install Python and the required packages: install FREE scientific python distributions such as Anaconda and Enthought Canopy which are already integrated the core scientific analytic and scientific Python packages such as SciPy, pandas, numpy, StatsModels and PyQt4.

In case you want to install all package by yourself, you can try the following steps.

2.1 Install Python

If you use Windows OS and you have not install Python 2 yet, you can download the install package from here, the latest version is 2.7.11 (22 April 2016). Download the installer for your machine and install it as any other software.

Linux and Mac OS come with Python 2.7 pre-installed. Open the terminal and type python --version to see the version of Python on your machine. In case Python is not installed on your machine, you can download the installer for Mac and just click it to install it. Users of Ubuntu Linux simply type (untested):

  1. sudo apt-get install build-essential python2.7

Users of RedHat or RedHat-derived distros (Fedora, CentOS) type (untested):

  1. sudo yum groupinstall "Development tools"
  2. sudo yum install python27

2.2 Install Python Modules

If you have Python 2 >=2.7.9, you will already have pip. Open the terminal (or Windows command prompt) and type the following commands to install Python modules.

  1. sudo pip install pandas
  2. sudo pip install numpy
  3. sudo pip install git+http://github.com/scipy/scipy/
  4. sudo pip install statsmodels

Install PyQt4 (optional, for GUI only).

  • Windows OS: Binary installers for Windows for PyQt4 is available here.

  • Mac OS (untested):

  1. brew install pyqt
  • Ubuntu Linux (untested):
  1. sudo apt-get install python-qt4
  • CentOS and RPM-based Linux (untested):
  1. sudo yum install PyQt4

If you failed to install PyQt4, please follow this guild to install it.

2.3 Getting Started

The latest PyHLA is available here.

or, you can clone this repository via the command

  1. git clone https://github.com/felixfan/PyHLA.git

Once you have downloaded PyHLA, typing

  1. $ python PyHLA.py -h

will print a list of all command-line options.

or, typing the following command to start the GUI.

  1. python gPyHLA.py

3. Tutorials

3.1 Input

3.1.1 HLA Types File (--input)

The input file is a white-space (space or tab) delimited file. The first two columns are mandatory: Individual ID and Phenotype. The Individual IDs are alphanumeric and should uniquely identify a person. The second column is phenotype which can be either a quantitative trait or an affection status. Affection status should be coded as 1 and 2 for unaffected and affected, respectively.

HLA types (column 3 onwards) should also be white-space delimited. Every gene must have two alleles specified. All alleles (see Nomenclature of HLA Alleles) do not need to have the same digits. However, if you want to test association at 4 digits, all alleles should have at least 4 digits resolution. Missing genotype is denoted as NA.

Header line is NOT needed. For example, here are two individuals typed for 6 genes (one row = one person):

  1. 0001 2 A*02:07:01 A*11:01:01 B*51:01:01 B*51:01:01 C*14:02:01 C*14:02:01 DQA1*01:04:01 DQA1*01:04:01 DQB1*03:03:02 DQB1*05:02:01 DRB1*07:01:01 DRB1*14:54:01
  2. 0002 1 A*24:02:01 A*33:03:01 B*15:25:01 B*58:01:01 C*03:02:02 C*04:03 NA NA DQB1*03:01:01 DQB1*03:01:01 DRB1*03:01:01 DRB1*12:02:01

There are one case and one control. The six genes are: HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1 and HLA-DRB1. Each gene has two columns. Individual 0002 does not have HLA types for HLA-DQA1 (two NA). All alleles have six digits resolution except that one allele of HLA-C of individual 0002 only has four digits resolution. It is fine if we only want to test association at two or four digits resolution.

Note: The allele names in the above example do not have the HLA prefix. Allele names have the HLA prefix can also be used as input. e.g. A*02:07:01 A*11:01:01 is the same as HLA-A*02:07:01 HLA-A*11:01:01. See the example file input0.txt and input1.txt for case-control trait and quantitative trait, respectively.

3.1.2 Exclude Alleles File (--exclude)

Alleles to be excluded from analysis. One allele per line.

  1. A*01:01:02
  2. C*01:03

3.1.3 Covariates file (--covar)

The covariates file is a white-space (space or tab) delimited file. The first row is header. Row 2 onwards contain the individual ID (IID) and measures of several traits. Each row for one individual. The first column is IID and column 2 onwards contain measures of several traits. Each column for one trait.

For example, here are two individuals with three traits:

  1. IID age sex bmi
  2. 0001 28 1 20.70
  3. 0002 23 0 16.29

Note: Name of trait should not include any white-space. The order of individuals in covariates file does not have to be the same as the genotype input file. The number of individuals in covariates file also does not have to be the same as the genotype input file. Only the common individuals of both files were included in the analysis. See covar.txt for an example.

3.2 Data Summary

Summary statistics for the data in three level: gene level, allele level, and population level.

  • Gene level summary: if the sample size is n and there is no missing data, each gene will appears 2n times.

  • Allele level summary: The number and frequency of each allele.

  • Population level summary: The number and frequency of individuals carry each allele.

3.2.1 Options

  1. --input input0.txt  [Mandatory]
  2. --summary [Mandatory]
  3. --digit 4 [Default]
  4. --out output.txt [Default]
  5. --print [Optimal]

3.2.1.1 HLA Types File (--input)

See section 3.1.1.

3.2.1.2 Data Summary (--summary)

This option tells PyHLA perform data summary analysis.

3.2.1.3 Digits resolution (--digit)

Summary based on two digits, four digits or six digits. When two was used, alleles such as A*02:01 and A*02:06 will be combined as A*02. Default value is 4.

3.2.1.4 Output file name (--out)

Default value is output.txt.

3.2.1.5 Print output to screen (--print)

Specify --print will print all results to screen (still write results to the output file).

3.2.2 Example

  1. python PyHLA.py --input example/input0.txt --summary --print

Output:

  1. Sample size: 2000
  2. Number of cases: 1158
  3. Number of controls: 842
  4. Gene level summary
  5. ------------------------------------------------------------------
  6. Gene CaseCount CtrlCount TotalCount
  7. A 2316 1684 4000
  8. B 2316 1684 4000
  9. C 2316 1684 4000
  10. DQA1 2316 1684 4000
  11. DQB1 2316 1684 4000
  12. DRB1 2316 1684 4000
  13. Allele level summary
  14. ------------------------------------------------------------------
  15. Allele CaseCount CtrlCount TotalCount CaseFreq CtrlFreq TotalFreq
  16. A*01:01 25 14 39 0.0108 0.0083 0.0097
  17. A*01:22N 5 4 9 0.0022 0.0024 0.0022
  18. A*01:81 37 22 59 0.0160 0.0131 0.0147
  19. A*02:01 158 98 256 0.0682 0.0582 0.0640
  20. A*02:03 109 85 194 0.0471 0.0505 0.0485
  21. ...(truncated)
  22. Population level summary
  23. ------------------------------------------------------------------
  24. Allele popCaseCount popCaseFreq popCtrlCount popCtrlFreq
  25. A*01:01 19 0.0164 10 0.0119
  26. A*01:22N 5 0.0043 4 0.0048
  27. A*01:81 37 0.0320 22 0.0261
  28. A*02:01 151 0.1304 96 0.1140
  29. A*02:03 108 0.0933 82 0.0974
  30. ...(truncated)

3.3 Allele Association Analysis

Methods for association analysis between HLA alleles and diseases.

3.3.1 Options

  1. --input input0.txt  [Mandatory]
  2. --assoc [Mandatory]
  3. --digit 4 [Default]
  4. --test fisher [Default]
  5. --model allelic [Default]
  6. --freq 0 [Default]
  7. --adjust FDR [Default]
  8. --out output.txt [Default]
  9. --print [Optimal]
  10. --perm N [Optimal]
  11. --seed S [Optimal]
  12. --exclude EXCLUDE.txt [Optimal]
  13. --covar COVAR.txt [Optimal, for logistic and linear regression only]
  14. --covar-name COVARNAME [Optimal, for logistic and linear regression only]

3.3.1.1 HLA Types File (--input)

See section 3.1.1.

3.3.1.2 Allele Association Analysis (--assoc)

This option tells PyHLA perform allele association analysis.

3.3.1.3 Digits resolution (--digit)

Test of association using two digits, four digits or six digits. When two was used, alleles such as A*02:01 and A*02:06 will be combined as A*02. Default value is 4.

3.3.1.4 Methods for association test (--test)

  1. chisq Pearson chi-squared test (For disease traits, 2 x 2 coningency table)
  2. fisher Fisher's exact test (For disease traits, 2 x 2 coningency table)
  3. logistic logistic regression (For disease traits)
  4. linear linear regression (For quantitative traits)

Default value is fisher.

3.3.1.5 Genetic model to test (--model)

When Pearson chi-squared test or Fisher’s exact test was used, three genetic models can be specified.

  1. allelic compares one allele against the others group together
  2. dom compares individuals carry one allele against individuals do not carry it
  3. rec compares individuals carry homozygous of one allele against other individuals

When linear or logistic regression is used, additive, dom and rec can be used. Assume A*01:01 is the test allele, then genotype will be coded as following:

  1. Genotype code (additive model) code (recessive model) code (dominant model)
  2. 'A*01:01 A*01:01' 2 1 1
  3. 'A*01:01 A*01:02' 1 0 1
  4. 'A*01:02 A*01:03' 0 0 0

Default value is allelic.

3.3.1.6 Minimal allele/allele group frequency (--freq)

A value between 0 and 1. Only alleles/allele groups have frequency higher than this threshold will be included in association analysis. Default value is 0.05. When --perm is specified, it is better to set a higher value than 0 to --freq to reduce permutation time.

3.3.1.7 Adjustment for multiple testing (--adjust)

  1. Bonferroni Bonferroni single-step adjusted p-values
  2. Holm Holm (1979) step-down adjusted p-values
  3. FDR Benjamini & Hochberg (1995) step-up FDR control
  4. FDR_BY Benjamini & Yekutieli (2001) step-up FDR control

3.3.1.8 Output file name (--out)

Default value is output.txt.

3.3.1.9 Print output to screen (--print)

Specify --print will print all results to screen (still write results to the output file).

3.3.1.10 Permutation (--perm)

Number of permutation will be performed.

For each permutation run, a simulated dataset is constructed from the original dataset by randomizing the assignment of phenotype status among individuals. The same individuals are used, maintaining the same LD structure and the original case/control ratio.

Only simulated dataset with the same common alleles between cases and controls as the original dataset will be used. So assign a greater than zero value to --freq can speed up the permutation.

3.1.1.11 Random seed (--seed)

Random seed for permutation. A number used to initialize the basic random number generator. By default, the current system time is used.

3.1.1.12 Exclude Alleles (--exclude)

Alleles to be excluded. One allele per line.

  1. A*01:01:02
  2. C*01:03

3.3.1.13 Covariates file (--covar)

One or more covariates can be included in linear and logistic regression.

The covariates file is a white-space (space or tab) delimited file. The first row is header. Row 2 onwards contain the individual ID (IID) and measures of several traits. Each row for one individual. The first column is IID and column 2 onwards contain measures of several traits. Each column for one trait.

For example, here are two individuals with three traits:

  1. IID age sex bmi
  2. 0001 28 1 20.70
  3. 0002 23 0 16.29

Note: Name of trait should not include any white-space.

Note: --covar only effect when --test linear or --test logistic is specified.

Note: The order of individuals in covariates file does not have to be the same as the genotype input file. The number of individuals in covariates file also does not have to be the same as the genotype input file. Only the common individuals of both files were included in the analysis.

3.3.1.14 Covariates name (--covar-name)

To select a particular subset of covariates, use --covar-name covarnames command.

covarnames is a string of trait names (in the header row of covariates file) concatenate with comma(,).

For example,

  1. --covar cov.txt # use all covariates in cov.txt
  2. --covar cov.txt --covar-name bmi # only use 'bmi'
  3. --covar cov.txt --covar-name age,bmi # use both 'age' and 'bmi'
  4. --covar cov.txt --covar-name age,sex,bmi # use all three covariates

Note: if --covar-name covarnames command is not specified, all covariates in cov.txt will be used.

3.3.2 Allele Association Analysis Examples

3.3.2.1 Output of Allele Association Analysis

Output contains several fields depend on which commands were used.

  1. Allele Allele name
  2. Gene Gene name
  3. A_case Count of this allele in cases
  4. B_case Count of other alleles in cases
  5. A_ctrl Count of this allele in controls
  6. B_ctrl Count of other allele in controls
  7. F_case Frequency of this allele in cases
  8. F_ctrl Frequency of this allele in controls
  9. Freq Frequency of this allele in cases and controls
  10. Chisq Chi-square
  11. DF Degree of freedom
  12. P_Chisq P-value for Pearson's chi-squared test
  13. P_FET P-value for Fisher's exact test
  14. P_Logit P-value for logistic regression
  15. P_Linear P-value for linear regression
  16. OR Odds ratio
  17. beta Regression coefficient
  18. L95 Lower bound of 95% confidence interval for odds ratio or regression coefficient
  19. U95 Upper bound of 95% confidence interval for odds ratio or regression coefficient
  20. P_adj Multiple testing adjusted p value
  21. P_perm P-value for permutation test
  22. PermN Number of permutation with statistic larger than the original data
  23. PermNA Number of permutation with NA statistic

3.3.2.2 Disease trait (Case/Control Study)

3.3.2.2.1 Fisher’s exact test and Pearson’s chi-squared test

Fisher’s exact test is the default option.

  1. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR
  2. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --perm 10000

Pearson’s chi-squared test

  1. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test chisq
  2. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test chisq --model dom
  3. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test chisq --model rec

For each allele, a 2 X 2 coningency table contains the count of this allele and the count of the other alleles in the same gene in cases and controls was created. The total number of test is the number of alleles have frequency in cases or controls higher the the threshold specified by option --freq.

The output includes: Allele, A_case, B_case, A_ctrl, B_ctrl, F_case, F_ctrl, Freq, OR, L95, U95, P_adj. The output of Pearson’s chi-squared test also includes: Chisq, DF, P_Chisq. The output of Fisher’s exact test also includes: P_FET. When --perm is used, P_perm, PermN and PermNA are added to the output.

3.3.2.2.2 Logistic Regression
  1. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test logistic --model additive
  2. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test logistic --model additive --perm 10000
  3. python PyHLA.py --input example/input0.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test logistic --model additive --covar example/covar.txt --covar-name age,bmi

The total number of test is the number of alleles have frequency in cases or controls higher the the threshold specified by option --freq.

The output includes: Allele, A_case, B_case, A_ctrl, B_ctrl, F_case, F_ctrl, Freq, L95, U95, P_adj, OR, and P_Logit. When --perm is used, P_perm, PermN and PermNA are added to the output.

3.3.2.3 Quantitative trait

3.3.2.3.1 Linear Regression
  1. python PyHLA.py --input example/input1.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test linear --model additive
  2. python PyHLA.py --input example/input1.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test linear --model additive --perm 10000
  3. python PyHLA.py --input example/input1.txt --assoc --digit 4 --freq 0.05 --adjust FDR --test linear --model additive --covar example/covar.txt --covar-name sex,age,bmi

The total number of test is the number of alleles have frequency higher the the threshold specified by option --freq.

The output includes: Allele, Freq, L95, U95, P_adj, beta, and P_Linear. When --perm is used, P_perm, PermN and PermNA are added to the output.

3.4 Amino Acid Alignment

For each gene, amino acid sequences for all alleles were aligned together. Protein sequence alignments were downloaded from IMGT/HLA, the current release Release 3.23.0, 2016-01-19 was used.

3.4.1 Options

  1. --input input0.txt  [Mandatory]
  2. --align [Mandatory]
  3. --out output.txt [Default]
  4. --print [Optimal]
  5. --consensus [Optimal]

3.4.1.1 HLA Types File (--input)

See section 3.1.1.

3.4.1.2 Amino Acid Alignment (--align)

This option tells PyHLA perform amino acid alignment.

3.4.1.3 Output file name (--out)

Default value is output.txt.

3.4.1.4 Print output to screen (--print)

Specify --print will print all results to screen (still write results to the output file).

3.4.1.5 Consensus Amino Acid Sequence --consensus

When low resolution HLA typing was used in the input file, the program takes the consensus string of all possible high-resolution HLA typings, marking polymorphic amino acid positions as unknown. For example, when C*06:53, which can not be found in the alignment file, was used as input, the consensus sequence of two (it is quite possible larger than two for other alleles) higher-resolution HLA typings C*06:53:01 and C*06:53:02 will be used. If --consensus was not specified, sequence of C*06:53:01 will be used as default.

3.5 Amino Acid Association

If there are more than one amino acid in a position, a test will be performed for each amino acid to test whether it is distributed differently between cases and controls.

3.5.1 Options

  1. --input input0.txt  [Mandatory]
  2. --assocAA [Mandatory]
  3. --test [Default]
  4. --out output.txt [Default]
  5. --print [Optimal]
  6. --consensus [Optimal]

3.5.1.1 HLA Types File (--input)

See section 3.1.1.

3.5.1.2 Amino Acid Association (--assoc-AA)

This option tells PyHLA perform amino acid association analysis.

3.5.1.3 Methods for association test (--test)

Currently, only --test fisher and --test chisq are available for amino acid association analysis. See section 3.3.1.4 for details about this two tests.

3.5.1.4 Output file name (--out)

Default value is output.txt.

3.5.1.5 Print output to screen (--print)

Specify --print will print all results to screen (still write results to the output file).

3.5.1.6 Consensus Amino Acid Sequence --consensus

See section 3.4.1.5.

3.5.2 Example of the Output

  1. python PyHLA.py --input example/input0.txt --assoc-AA --consensus

By default, Fisher’s exact test was used. Each ID contains three parts: gene, position and residue. A_case and B_case are the number of cases carry and do not carry the residue at this position, respectively. A_ctrl and B_ctrl are the number of controls carry and do not carry the residue at this position, respectively. P denotes the p value of the test. OR is the odds ratio calculated with Haldane’s correction of Woolf’s method. ACR lists the alleles where the residue is present.

  1. ID A_case B_case A_ctrl B_ctrl P OR ACR
  2. A_9_F 566 592 399 443 0.52589 1.06 A*01:01,A*01:22N,A*01:81,A*02:01,A*02:03,A*02:07,A*02:112,A*02:264,A*02:265,A*02:43N,A*03:01,A*32:01,A*34:08,A*36:01
  3. A_9_S 403 755 291 551 0.92425 1.01 A*23:01,A*24:02,A*24:03,A*24:07,A*24:20,A*24:59,A*30:01,A*30:04
  4. A_9_T 364 794 251 591 0.46171 1.08 A*29:01,A*31:01,A*33:03
  5. ...
  6. DRB1_13_C 19 1139 4 838 0.01809 3.19 DRB1*12:20
  7. DRB1_13_F 342 816 244 598 0.80365 1.03 DRB1*01:01,DRB1*01:02,DRB1*09:01,DRB1*09:05,DRB1*09:06,DRB1*09:09,DRB1*09:12,DRB1*09:15,DRB1*09:16,DRB1*10:01
  8. DRB1_13_G 438 720 327 515 0.67497 0.96 DRB1*08:02,DRB1*08:03,DRB1*08:09,DRB1*08:18,DRB1*12:01,DRB1*12:02,DRB1*12:15,DRB1*12:17,DRB1*12:18,DRB1*12:19,DRB1*12:21,DRB1*12:31N,DRB1*14:04
  9. DRB1_13_H 276 882 194 648 0.70858 1.04 DRB1*04:01,DRB1*04:03,DRB1*04:04,DRB1*04:05,DRB1*04:06,DRB1*04:08,DRB1*04:10,DRB1*04:23,DRB1*04:71
  10. DRB1_13_R 378 780 258 584 0.35570 1.10 DRB1*15:01,DRB1*15:02,DRB1*15:30,DRB1*15:58,DRB1*16:01,DRB1*16:02
  11. DRB1_13_S 510 648 388 454 0.38704 0.92 DRB1*03:01,DRB1*04:66,DRB1*11:01,DRB1*11:04,DRB1*11:06,DRB1*11:54,DRB1*13:01,DRB1*13:02,DRB1*13:12,DRB1*13:13,DRB1*13:19,DRB1*13:47,DRB1*14:03,DRB1*14:05,DRB1*14:54
  12. DRB1_13_Y 97 1061 75 767 0.68689 0.93 DRB1*07:01,DRB1*09:07
  13. ...

3.6 Zygosity Test

When an allele or residual was associated (p < 0.05) with the disease, three tests are performed here to identify whether a homozygote or heterozygote condition differentiates susceptibility to the disease.























Table 1. Homozygosity association
categoryclass1class2
casehomabsent
controlhomabsent






















Table 2. Heterozygosity association
categoryclass1class2
casehetabsent
controlhetabsent






















Table 3. Zygosity association
categoryclass1class2
casehomhet
controlhomhet

3.6.1 Options

  1. --input input0.txt  [Mandatory]
  2. --zygosity [Mandatory]
  3. --test [Default]
  4. --level [Default]
  5. --out output.txt [Default]
  6. --print [Optimal]
  7. --consensus [Optimal, for residual level only]
  8. --digit [Default, for allele level only]
  9. --freq [Default, for allele level only]

3.6.1.1 HLA Types File (--input)

See section 3.1.1.

3.6.1.2 Zygosity test (--zygosity)

This option tells PyHLA perform zygosity test.

3.6.1.3 Methods for zygosity test (--test)

Currently, only --test fisher and --test chisq are available for zygosity test. See section 3.3.1.4 for details about this two tests.

3.6.1.4 Level to test (--level)

Two levels --level residue and --level allele for amino acid and allele test, respectively. Default is --level residue.

3.6.1.5 Output file name (--out)

Default value is output.txt.

3.6.1.6 Print output to screen (--print)

Specify --print will print all results to screen (still write results to the output file).

3.6.1.7 Consensus sequence (--consensus)

For residual level only. When low resolution HLA typing was used in the input file, the program takes the consensus string of all possible high-resolution HLA typings, marking polymorphic amino acid positions as unknown. See section 3.4.1.5.

3.6.1.8 Digits resolution (--digit)

For allele level only. Test of association using two digits, four digits or six digits. When two was used, alleles such as A*02:01 and A*02:06 will be combined as A*02. Default value is 4.

3.6.1.9 Minimal allele/allele group frequency (--freq)

For allele level only. A value between 0 and 1. Only alleles/allele groups have frequency higher than this threshold will be included in association analysis. Default value is 0.05.

3.6.2 Examples

3.6.2.1 Residue level

  1. python PyHLA.py --input example/input0.txt --zygosity --consensus

By default, Fisher’s exact test was used. Each ID contains three parts: gene, position and residue. Hom_P, Het_P and Zyg_P is the p-value for testing homozygosity association, herterozygosity association and zygosity association, respectively. Hom_OR, Het_OR and Zyg_OR is odds ratio for testing homozygosity association, herterozygosity association and zygosity association, respectively. OR is the odds ratio calculated with Haldane’s correction of Woolf’s method.

  1. ID Hom_P Het_P Zyg_P Hom_OR Het_OR Zyg_OR
  2. A_57_P 1.0000 1.0000 0.0131 1.3833 0.0909 15.2161
  3. A_57_R 1.0000 1.0000 0.0131 0.0909 1.3833 0.0657
  4. B_45_T 0.0428 0.1309 0.0078 1.9192 0.8610 2.2291
  5. B_62_G 0.4933 0.0598 0.3149 1.8058 0.7937 2.2750
  6. B_65_R 0.4933 0.0598 0.3149 1.8058 0.7937 2.2750
  7. B_66_N 0.4933 0.0598 0.3149 1.8058 0.7937 2.2750
  8. B_67_M 0.4933 0.0598 0.3149 1.8058 0.7937 2.2750
  9. B_67_Y 0.2916 0.0191 0.9075 1.3092 1.2547 1.0434
  10. B_70_Q 0.2916 0.0191 0.9075 1.3092 1.2547 1.0434
  11. B_70_S 0.4933 0.0598 0.3149 1.8058 0.7937 2.2750
  12. B_74_D 0.3947 0.0229 0.8511 1.1900 1.2407 0.9591
  13. B_77_S 0.1520 0.1292 0.0149 0.8678 1.2379 0.7011
  14. B_80_I 0.0889 0.0302 0.0356 2.2191 0.7994 2.7760
  15. B_80_N 0.4483 0.0712 0.0240 0.9223 1.2692 0.7267
  16. B_82_R 0.4483 0.0712 0.0240 0.9223 1.2692 0.7267
  17. B_83_G 0.4483 0.0712 0.0240 0.9223 1.2692 0.7267
  18. B_152_V 0.0156 0.0037 0.3312 1.2808 1.4701 0.8712
  19. C_1_G 1.0000 0.0464 1.0000 4.3333 5.9962 0.7227
  20. C_165_E 1.0000 0.0464 1.0000 4.3333 5.9962 0.7227
  21. DQA1_25_Y 0.6999 0.0247 0.0360 0.9623 0.6726 1.4308
  22. DQB1_14_M 0.1584 0.0020 0.0096 0.8634 0.4410 1.9576
  23. DQB1_53_L 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  24. DQB1_84_Q 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  25. DQB1_85_L 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  26. DQB1_86_E 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  27. DQB1_87_L 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  28. DQB1_89_T 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  29. DQB1_90_T 0.4933 0.0354 0.1100 0.9339 0.7434 1.2563
  30. DQB1_116_I 1.0000 0.0117 1.0000 5.0000 6.9270 0.7218
  31. DQB1_125_S 1.0000 0.0117 1.0000 5.0000 6.9270 0.7218
  32. DQB1_126_H 1.0000 0.0117 1.0000 5.0000 6.9270 0.7218
  33. DQB1_133_Q 1.0000 0.0117 1.0000 5.0000 6.9270 0.7218
  34. DQB1_135_D 1.0000 0.0327 1.0000 2.0769 2.8857 0.7197
  35. DRB1_11_A 1.0000 0.0334 1.0000 0.3623 0.4906 0.7386
  36. DRB1_13_C 1.0000 0.0181 1.0000 0.2308 0.3136 0.7358
  37. DRB1_73_A 1.0000 0.0391 0.0421 0.9951 0.4925 2.0206

3.6.2.2 Allele level

  1. python PyHLA.py --input example/input0.txt --zygosity --level allele --freq 0.05

By default, Fisher’s exact test and 4 digit allele was used.

  1. ID Hom_P Het_P Zyg_P Hom_OR Het_OR Zyg_OR
  2. B*58:01 0.4925 0.0830 0.3151 1.8212 0.8034 2.2669
  3. DQA1*02:01 0.1676 0.3188 0.0602 0.5688 1.1873 0.4791

3.7 Interaction Test

When an allele or residual was associated (p < 0.05) with the disease, tests for independence, difference in association, combined action, interaction and linkage disequilibrium (LD) are used to determine the strongest association.

Table 1 Number of individuals with/without (+/-) factor A and/or factor B.

Factor A Factor B Number of Cases Number of Controls
+ + x1 y1
+ - x2 y2
- + x3 y3
- - x4 y4

Table 2 Summary of the ten tests (2x2 Tables)

Comparison a b c d Test [Number]
A vs. non-A x1+x2 x3+x4 y1+y2 y3+y4 [1] A associated?
B vs. non-B x1+x3 x2+x4 y1+y3 y2+y4 [2] B associated?
++ vs. -+ x1 x3 y1 y3 [3] A associated in B-positives?
+- vs. — x2 x4 y2 y4 [4] A associated in B-negatives?
++ vs. +- x1 x2 y1 y2 [5] B associated in A-positives?
-+ vs. — x3 x4 y3 y4 [6] B associated in A-negatives?
+- vs. -+ x2 x3 y2 y3 [7] Difference between A and B association?
++ vs. — x1 x4 y1 y4 [8] Combined A-B association?
Association A and B in Cases x1 x2 x3 x4 [9] Linkage disequilibrium in cases
Association A and B in Controls y1 y2 y3 y4 [10] Linkage disequilibrium in controls
  1. Both test 3 and test 4 are significant: A is associated with the disease independently of B.
  2. Both test 5 and test 6 are significant: B is associated with the disease independently of A.
  3. Both test 3 and test 5 are significant: A and B show interaction.
  4. Test 7 is significant: Difference between A and B is associated with the disease.
  5. Test 8 is significant: A and B have combined action.
  6. Test 9 is significant: A and B are in LD in cases.
  7. Test 10 is significant: A and B are in LD in controls.

3.7.1 Options

  1. --input input0.txt  [Mandatory]
  2. --interaction [Mandatory]
  3. --test [Default]
  4. --level [Default]
  5. --out output.txt [Default]
  6. --print [Optimal]
  7. --consensus [Optimal, for residual level only]
  8. --digit [Default, for allele level only]
  9. --freq [Default, for allele level only]

3.7.1.1 HLA Types File (--input)

See section 3.1.1.

3.7.1.2 Interaction test (--interaction)

This option tells PyHLA perform interaction test.

3.7.1.3 Test to be used (--test)

Only --test fisher and --test chisq can be used here. Default is --test fisher.

3.7.1.4 Level to test (--level)

Two levels --level residue and --level allele for amino acid and allele test, respectively. Default is --level residue.

3.7.1.5 Output file name (--out)

Default value is output.txt.

3.7.1.6 Print output to screen (--print)

Specify --print will print all results to screen (still write results to the output file).

3.7.1.7 Consensus sequence (--consensus)

For residual level only. When low resolution HLA typing was used in the input file, the program takes the consensus string of all possible high-resolution HLA typings, marking polymorphic amino acid positions as unknown. See section 3.4.1.5.

3.7.1.8 Digits resolution (--digit)

For allele level only. Test of association using two digits, four digits or six digits. When two was used, alleles such as A*02:01 and A*02:06 will be combined as A*02. Default value is 4.

3.7.1.9 Minimal allele/allele group frequency (--freq)

For allele level only. A value between 0 and 1. Only alleles/allele groups have frequency higher than this threshold will be included in association analysis. Default value is 0.05.

3.7.2 Examples

3.7.2.1 Residue level

  1. python PyHLA.py --input example/input0.txt --interaction --consensus

By default, Fisher’s exact test was used. Each ID contains three parts: gene, position and residue. OR is the odds ratio calculated with Haldane’s correction of Woolf’s method. P3-P10 and OR3-OR10 are the p-value and odds ratio for tests listed in table 2, respectively.

  1. ID1 ID2 P3 P4 P5 P6 P7 P8 P9 P10 OR3 OR4 OR5 OR6 OR7 OR8 OR9 OR10
  2. A_57_P B_45_T 0.3897 0.0365 0.0456 1.0000 0.4365 0.0234 1.0000 1.0000 4.71 11.64 1.21 3.00 3.88 14.14 0.58 1.44
  3. A_57_P B_62_G 1.0000 0.0148 0.0473 1.0000 1.0000 0.0076 1.0000 1.0000 1.69 14.61 1.27 11.00 1.33 18.59 0.23 1.97
  4. A_57_P B_65_R 1.0000 0.0148 0.0473 1.0000 1.0000 0.0076 1.0000 1.0000 1.69 14.61 1.27 11.00 1.33 18.59 0.23 1.97
  5. A_57_P B_66_N 1.0000 0.0148 0.0473 1.0000 1.0000 0.0076 1.0000 1.0000 1.69 14.61 1.27 11.00 1.33 18.59 0.23 1.97
  6. A_57_P B_67_M 1.0000 0.0148 0.0473 1.0000 1.0000 0.0076 1.0000 1.0000 1.69 14.61 1.27 11.00 1.33 18.59 0.23 1.97
  7. A_57_P B_67_Y 0.2025 0.0646 0.0330 1.0000 0.1607 0.0913 1.0000 1.0000 6.14 10.49 0.82 1.40 7.49 8.59 0.61 1.04
  8. A_57_P B_70_Q 0.2025 0.0646 0.0330 1.0000 0.1607 0.0913 1.0000 1.0000 6.14 10.49 0.82 1.40 7.49 8.59 0.61 1.04
  9. A_57_P B_70_S 1.0000 0.0148 0.0473 1.0000 1.0000 0.0076 1.0000 1.0000 1.69 14.61 1.27 11.00 1.33 18.59 0.23 1.97
  10. A_57_P B_74_D 0.1988 0.0636 0.0363 1.0000 0.1590 0.0888 1.0000 1.0000 6.24 10.59 0.82 1.40 7.56 8.73 0.78 1.33
  11. A_57_P B_77_S 0.0146 1.0000 0.0490 1.0000 0.0071 1.0000 1.0000 1.0000 14.67 1.74 0.77 0.09 19.13 1.33 5.57 0.66
  12. A_57_P B_80_I 1.0000 0.0162 0.0148 1.0000 1.0000 0.0080 1.0000 0.3334 1.65 14.16 1.28 11.00 1.29 18.12 0.46 3.98
  13. A_57_P B_80_N 0.0346 0.3667 0.0279 1.0000 0.0186 0.4307 1.0000 0.5405 11.91 5.22 0.76 0.33 15.65 3.97 4.53 1.98
  14. A_57_P B_82_R 0.0346 0.3667 0.0279 1.0000 0.0186 0.4307 1.0000 0.5405 11.91 5.22 0.76 0.33 15.65 3.97 4.53 1.98
  15. A_57_P B_83_G 0.0346 0.3667 0.0279 1.0000 0.0186 0.4307 1.0000 0.5405 11.91 5.22 0.76 0.33 15.65 3.97 4.53 1.98
  16. A_57_P B_152_V 0.0347 0.3631 0.0228 1.0000 0.0179 0.4312 1.0000 0.5309 11.89 5.30 0.75 0.33 15.89 3.96 4.59 2.04
  17. A_57_P C_1_G 1.0000 0.0129 0.0243 1.0000 1.0000 1.0000 1.0000 1.0000 0.23 15.31 0.17 11.00 1.39 2.54 0.00 0.09
  18. A_57_P C_165_E 1.0000 0.0129 0.0243 1.0000 1.0000 1.0000 1.0000 1.0000 0.23 15.31 0.17 11.00 1.39 2.54 0.00 0.09
  19. A_57_P DQA1_25_Y 0.0120 1.0000 0.0219 1.0000 0.0599 1.0000 1.0000 1.0000 15.72 0.98 1.46 0.09 10.74 1.43 12.88 0.80
  20. A_57_P DQB1_14_M 0.0123 1.0000 0.0046 1.0000 0.1487 1.0000 1.0000 1.0000 15.58 0.69 2.06 0.09 7.57 1.42 42.74 1.89
  21. A_57_P DQB1_53_L 0.0287 0.4765 0.0488 1.0000 0.0528 0.4110 1.0000 0.5746 12.91 3.32 1.30 0.33 9.96 4.30 6.97 1.79
  22. ...

3.7.2.2 Allele level

  1. python PyHLA.py --input example/input0.txt --interaction --level allele --freq 0.01

By default, Fisher’s exact test and 4 digit allele was used.

  1. ID1 ID2 P3 P4 P5 P6 P7 P8 P9 P10 OR3 OR4 OR5 OR6 OR7 OR8 OR9 OR10
  2. A*11:77 B*35:01 1.0000 0.0793 1.0000 0.0595 0.7396 0.4127 0.6210 1.0000 0.39 0.67 0.35 0.59 1.13 0.23 0.44 0.76
  3. A*11:77 B*58:01 0.8144 0.0493 0.3127 0.0879 0.0072 1.0000 0.1053 0.6684 0.87 0.61 1.79 1.25 0.49 1.09 1.83 1.28
  4. A*11:77 DQA1*02:01 0.5562 0.1022 0.7598 0.1128 0.6857 0.2500 0.4217 0.3392 0.66 0.68 0.75 0.78 0.87 0.51 1.49 1.54
  5. A*11:77 DRB1*04:66 0.2609 0.1014 1.0000 0.0209 0.0033 0.4209 0.6364 0.5004 0.11 0.69 0.35 2.18 0.32 0.24 0.35 2.20
  6. B*35:01 DQA1*02:01 1.0000 0.0359 1.0000 0.0851 0.3555 0.6945 0.5018 1.0000 0.92 0.56 1.24 0.76 0.74 0.70 1.52 0.92
  7. B*35:01 DRB1*04:66 0.2609 0.0813 1.0000 0.0209 0.0027 0.4214 1.0000 0.3831 0.11 0.62 0.39 2.18 0.28 0.24 0.58 3.20
  8. B*58:01 DQA1*02:01 0.1916 0.1199 1.0000 0.0622 0.0091 0.6062 0.2718 1.0000 1.69 1.23 0.99 0.72 1.70 1.22 1.37 0.99
  9. B*58:01 DRB1*04:66 0.4124 0.0413 1.0000 0.0192 0.1095 1.0000 0.8205 0.3967 0.51 1.29 0.94 2.39 0.54 1.21 0.86 2.19
  10. DQA1*02:01 DRB1*04:66 0.1732 0.1198 1.0000 0.0144 0.0045 0.6998 1.0000 0.1343 0.30 0.79 0.92 2.39 0.33 0.72 1.17 3.05

4. License

This project is licensed under GNU GPL v2.

5. Citation

Yanhui Fan, You-Qiang Song. (2016) PyHLA: tests for association between HLA alleles and diseases. BMC Bioinformatics. 2017. 18:90

6. References

  • Sham PC, Curtis D: Monte Carlo tests for associations between disease and alleles at highly polymorphic loci. Ann Hum Genet 1995, 59:97-105.

  • Lancaster AK, Single RM, Solberg OD, Nelson MP, Thomson G: PyPop update – a software pipeline for large-scale multilocus population genomics. Tissue Antigens 2007, 69:192-197.

  • Kanterakis S, Magira E, Rosenman KD, Rossman M, Talsania K, Monos DS: SKDM human leukocyte antigen (HLA) tool: A comprehensive HLA and disease associations analysis software. Human Immunology 2008, 69(8):522-525.

  • El Galta R, Hsu L, Houwing-Duistermaat JJ: Methods to test for association between a disease and a multi-allelic marker applied to a candidate region. BMC Genetics 2005, 6:S101-S101.

  • Svejgaard A, Ryder LP: HLA and disease associations: Detecting the strongest association. Tissue Antigens 1994, 43(1):18-27.