MSIsensor-RNA: Microsatellite instability detection using RNA sequencing data
MSIsensor-RNA is a member of MSIsensor family for microsatellite instability (MSI) detection using RNA expression data including Microarray, RNA-seq, and single RNA-seq (scRNA-seq). MSIsensor-RNA compute MSI by the expression of MSI associated genes. MSIsensor-RNA shows efficient performance in AUC, sensitivity, specificity and robustness.
Peng Jia, Xuanhao Yang, Xiaofei Yang, Tingjie Wang, Yu Xu, Kai Ye, MSIsensor-RNA: Microsatellite Instability Detection for Bulk and Single-cell Gene Expression Data, Genomics, Proteomics & Bioinformatics, 2024;, qzae004, https://doi.org/10.1093/gpbjnl/qzae004
MSIsensor-RNA is free for non-commercial use
by academic, government, and non-profit/not-for-profit institutions. A
commercial version of the software is available and licensed through
Xi’an Jiaotong University. For more information, please contact with
Peng Jia (pengjia@stu.xjtu.edu.cn) or Kai Ye (kaiye@xjtu.edu.cn).
Microsatellite Instability is an indispensable biomarker in cancer therapies and prognosis,
particularly in immunotherapy. Our previous work for MSI detection based on
next-generation-sequencing data, MSIsensor and
MSIsensor-pro, are widely used in clinical
research projects. In particular, MSIsensor is the chosen MSI scoring method in the first
FDA-approved pan cancer panel, MSK-IMPACT. However, most of those DNA-based methods,
including MSIsensor and MSIsensor-pro, quantify MSI evaluation of genome mutations as
consequence of MSI status rather than the direct cause of MSI, the deficiency of mismatch
repair (MMR) system. In addition, selection of detected microsatellite sites and thresholds
for different populations, sequencing panels and cancer types impedes the standardized
detection of MSI in clinical. To solve these problems, we launched a new member for
MSIsensor family, MSIsensor-RNA, a standalone software for MSI detection with MMR
associated genes from tumor RNA sequencing data. MSIsensor-RNA shows efficient
performance in AUC, sensitivity, specificity and robustness. MSIsensor-RNA also costs
less in aspect of sequencing and computation, and does not need selection of
microsatellite sites and threshold for different populations compared to the NGS-based
methods, including MSIsensor and MSIsensor-pro.
```shell script
conda create -n myenv python>=3.6
conda activate myenv
git clone https://github.com/xjtu-omics/msisensor-rna.git
pip3 install .
### Install with docker
```shell script
docker pull pengjia1110/msisensor-rna:latest
docker run -v /local/path:/docker/path pengjia1110/msisensor-rna:latest msisensor-rna
```shell script
msisensor-rna
### Key Commands:
#### **genes**
* **Function**. Select informative genes for microsatellite instability detection.
* **Parameters**
-h, --help show this help message and exit
-i INPUT, --input INPUT
The path of input file. e.g. xxx.csv [required]
-o OUTPUT, --output OUTPUT
The output file of gene information. e.g. xxx.csv [required]
-thresh_t THREADS, --threads THREADS
The threads used to run this program. [default=4]
-thresh_cov THRESH_COV, --thresh_cov THRESH_COV
Threshold for coefficient of variation of gene expression value of all samples (Mean/Std). [default=0.5]
-thresh_p THRESH_P_RANKSUM, --thresh_p_ranksum THRESH_P_RANKSUM
Threshold for Pvalue of rank sum test between MSI-H and MSS samples. [default=0.01]
-thresh_auc THRESH_AUCSCORE, --thresh_AUCscore THRESH_AUCSCORE
Threshold for AUC score: AUC score was calculating by the sklearn package. [default=0.65]
-p POSITIVE_NUM, --positive_num POSITIVE_NUM
The minimum positive sample of MSI for training. [default = 10]
```
Parameters
-h, --help show this help message and exit
-i INPUT, --input INPUT
The path of input file. [required]
-m MODEL, --model MODEL
The trained model of the input file. [required]
-t CANCER_TYPE, --cancer_type CANCER_TYPE
The cancer type for this training. e.g. CRC, STAD,
PanCancer etc.
-c {SVM,RandomForest,LogisticRegression,MLPClassifier,GaussianNB,AdaBoostClassifier}, --classifier {SVM,RandomForest,LogisticRegression,MLPClassifier,GaussianNB,AdaBoostClassifier}
The machine learning classifier for MSI detection.
[default = RandomForest]
-di INPUT_DESCRIPTION, --input_description INPUT_DESCRIPTION
The description of the input file. [default = None]
-dm MODEL_DESCRIPTION, --model_description MODEL_DESCRIPTION
Description for this trained model.
-p POSITIVE_NUM, --positive_num POSITIVE_NUM
The minimum positive sample of MSI for training.
[default = 10]
-a AUTHOR, --author AUTHOR
The author who trained the model. [default = None]
-e EMAIL, --email EMAIL
The email of the author. [default = None]
Function. Show the information of the model and add more details.
Parameters
-h, --help show this help message and exit
-m MODEL, --model MODEL
The trained model path. [required]
-t CANCER_TYPE, --cancer_type CANCER_TYPE
Rename the cancer type. e.g. CRC, STAD, PanCancer etc.
[default = None]
-di INPUT_DESCRIPTION, --input_description INPUT_DESCRIPTION
Add description for the input file. [default = None]
-dm MODEL_DESCRIPTION, --model_description MODEL_DESCRIPTION
Add description for this trained model. [default = None]
-g GENE_LIST, --gene_list GENE_LIST
The path for the genes must be included for this
model. [default = None]
Function. Microsatellite instability detection.
Parameters
-h, --help show this help message and exit
-i INPUT, --input INPUT
The path of input file. [required]
-o OUTPUT, --output OUTPUT
The path of output file prefix. [required]
-m MODEL, --model MODEL
The path of the microsatellite regions. [required]
-d RUN_DIRECTLY, --run_directly RUN_DIRECTLY
Run the program directly without any Confirm. [default = False]
The input file for informative genes selection and model training. (-i option in train command)
You need to prepare your training file with a comma separated format (csv).
The first columns should be sample id, the second columns should be msi status,
and the third and other columns should be gene expression values. We recommend
you provide a normalized expression values. (like z-score normalization with log2(FPKM+1) )
The following is an example:
| SampleID | msi | MLH1|LINC01006| …| NHLRC1|
| —— | —— | —— | ——| —— | ——|
| NA0001 | MSI-H | 0.209|1.209|…|0.393|
| CA0002 | MSS |5.690|0.620|…|4.902|
| … | … |…|…|…|…|
| CA10 0 | MSS |9.960|0.920|…|5.002|
The trained model (-m option in train, show and detection command)
The trained model is saved as pickle file. In train command, we recommend you add more
description by -di,-dm,-a,-e, so that others who used this model are able to get more information.
In show command, you can get the information of your model , and changed some descriptions by -di and -dm.
you can also use -g option to output the genes list this model needed to a file.
In detection command, you must check the model and input Yes or No to continue the predict step use -d True
to ignore this reminder.
The input file for the detection command (-i option in detection command)
You need to prepare your input file for MSI prediction with a comma separated format (csv).
The first columns should be sample id, the second and other columns should be gene expression values.
The genes name must contain the genes in the model (use -g option of show command to see the genes
list of the model).
The following is an example:
| SampleID | MLH1|LINC01006| …| NHLRC1|
| —— | —— | ——| —— | ——|
| NA0001| 0.209|1.209|…|0.393|
| CA0002 |5.690|0.620|…|4.902|
| … |…|…|…|…|
| CA100|9.960|0.920|…|5.002|
If you have any questions, please contact with Peng Jia (pengjia@stu.xjtu.edu.cn) or Kai Ye (kaiye@xjtu.edu.cn).