A Python-based Effective Feature Generation Tool from DNA, RNA, and Protein Sequences
We can directly download by clicking the link.
Note: The package will download in zip format (.zip)
named PyFeat-master.zip
.
or,
Cloning a repository syncs it to our local machine (Example for Linux-based OS). After clone, we can add and edit files and then push and pull updates.
user@machine:~$ git clone https://github.com/mrzResearchArena/PyFeat.git
user@machine:~$ git clone git@github.com:mrzResearchArena/PyFeat.git
Note #1: If the clone was successful, a new sub-directory appears on our local drive. This directory has the same name (PyFeat) as the GitHub
repository that we cloned.
Note #2: We can run any Linux-based command from any valid location or path, but by default, a command generally runs from /home/user/
.
Note #2.1: user
is the name of our computer but your computer name can be different (Example: /home/bioinformatics/
).
Major (Generate Features):
Minor (Performance Measures):
Using PIP:
pip install <package name>
user@machine:~$ pip install scikit-learn
or,
Using anaconda environment:
conda install <package name>
user@machine:~$ conda install scikit-learn
Run command on your console or terminal.
user@machine:~$ python main.py --sequenceType=DNA --fullDataset=1 --optimumDataset=1 --fasta=/home/user/PyFeat/Datasets/DNA/FASTA.txt --label=/home/user/PyFeat/Datasets/DNA/Labels.txt --kTuple=3 --kGap=5 --pseudoKNC=1 --zCurve=1 --gcContent=1 --cumulativeSkew=1 --atgcRatio=1 --monoMono=1 --monoDi=1 --monoTri=1 --diMono=1 --diDi=1 --diTri=1 --triMono=1 --triDi=1
or,
user@machine:~$ python main.py -seq=DNA -full=1 -optimum=1 -fa=/home/user/PyFeat/Datasets/DNA/FASTA.txt -la=/home/user/PyFeat/Datasets/DNA/Label.txt -ktuple=3 -kgap=5 -pseudo=1 -zcurve=1 -gc=1 -skew=1 -atgc=1 -f11=1 -f12=1 -f13=1 -f21=1 -f22=1 -f23=1 -f31=1 -f32=1
user@machine:~$ python main.py --sequenceType=Protein --testDataset=1 --fasta=/home/user/PyFeat/Datasets/Protein/independentFASTA.txt --label=/home/user/PyFeat/Datasets/Protein/independentLabel.txt --kTuple=3 --kGap=5 --pseudoKNC=1 --zCurve=1 --gcContent=1 --cumulativeSkew=1 --atgcRatio=1 --monoMono=1 --monoDi=1 --monoTri=1 --diMono=1 --diDi=1 --diTri=1 --triMono=1 --triDi=1
or,
user@machine:~$ python main.py -seq=Protein -test=1 -fa=/home/user/PyFeat/Datasets/Protein/independentFASTA.txt -la=/home/user/PyFeat/Datasets/Protein/independentLabel.txt -ktuple=3 -kgap=5 -pseudo=1 -zcurve=1 -gc=1 -skew=1 -atgc=1 -f11=1 -f12=1 -f13=1 -f21=1 -f22=1 -f23=1 -f31=1 -f32=1
[ Comment: The = sign
is optional. ]
Note #1: It will generate a full dataset named fullDataset.csv (if -full=1 or,
—fullDataset==1)
Note #2: It will generate a selected features dataset named optimumDataset.csv (if -optimum=1 or,
—optimumDataset==1), and It will also track the selected features index.
Note #3: It will generate a full dataset named testDataset.csv (if -test=1 or,
—testDataset==1) [ Especially for the independent (testing) dataset purpose. ] [ 3.1.2. ]
Note #4: The process will run smoothly for valid FASTA sequences and row-wise class label.
Argument | Corresponding Optional Argument | Type | Default | Help |
---|---|---|---|---|
—sequenceType | -seq | string | —sequenceType=DNA | We can use DNA, RNA, and protein or prot as option; Case is not sensitive. |
—fasta | -fa | string | Enter a UNIX-like path; Example: /home/user/FASTA.txt | |
—label | -la | string | Enter a UNIX-like path; Example: /home/user/Label.txt | |
—kGap | -kgap | integer | —kGap=5 | The number of gaps ranging from 1 to 5 inclusive; Example: -kGap=5 |
—kTuple | -ktuple | integer | —kTuple=3 | The number of nucleotides ranging from 1 to 3 inclusive; Example: -kTuple=3 |
—fullDataset | -full | integer | —fullDataset=0 | Set —fullDataset=1, if we don’t want to save full dataset. |
—testDataset | -test | integer | —testDataset=0 | Set —testDataset=1, if we don’t want to save test dataset. |
—optimumDataset | -optimum | integer | —optimumDataset=0 | Set —optimumDataset=1, if we don’t want to save optimum dataset. |
—pseudoKNC | -pseudo | integer | —pseudoKNC=0 | Set —pseudoKNC=1, if we want to generate features. |
—zCurve | -zcurve | integer | —zCurve=0 | Set —zCurve=1, if we want to generate features. |
—gcContent | -gc | integer | —gcContent=0 | Set —gcContent=1, if we want to generate features. |
—cumulativeSkew | -skew | integer | —cumulativeSkew=0 | Set —cumulativeSkew=1, if we want to generate features. |
—atgcRatio | -atgc | integer | —atgcRatio=0 | Set —atgcRatio=1, if we want to generate features. |
—monoMono | -f11 | integer | —monoMono=0 | Set —monoMono=1, if we want to generate features. |
—monoDi | -f12 | integer | —monoDi=0 | Set —monoDi=1, if we want to generate features. |
—monoTri | -f13 | integer | —monoTri=0 | Set —monoTri=1, if we want to generate features. |
—diMono | -f21 | integer | —diMono=0 | Set —diMono=1, if we want to generate features. |
—diDi | -f22 | integer | —diDi=0 | Set —diDi=1, if we want to generate features. |
—diTri | -f23 | integer | —diTri=0 | Set —diTri=1, if we want to generate features. |
—triMono | -f31 | integer | —triMono=0 | Set —triMono=1, if we want to generate features. |
—triDi | -f32 | integer | —triDi=0 | Set —triDi=1, if we want to generate features. |
Feature Name | Feature Structure / Formula | Number of Features | Applicable |
---|---|---|---|
zCurve | x_axis = (A+G)-(C+T); y_axis = (A+C)-(G+T); z_axis = (A+T)-(G+C) | 3 features for DNA/RNA | DNA, RNA |
gcContent | ( (G+C)/(A+C+G+T) ) x 100 % | 1 features for DNA/RNA | DNA, RNA |
atgcRatio | (A+T)/(G+C) | 1 features for DNA/RNA | DNA, RNA |
cumulativeSkew | gcSkew=(G-C)/(G+C); atSkew=(A-T)/(A+T) | 2 features for DNA/RNA | DNA, RNA |
pseudoKNC | X, XX, XXX | when —kTuple=3, 84 features for DNA/RNA and 8,420 features for protein | DNA, RNA, Protein |
monoMonoKGap | X_X | when —kGap=1, 16 features for DNA/RNA and 400 features for protein | DNA, RNA, Protein |
monoDiKGap | X_XX | when —kGap=1, 64 features for DNA/RNA and 8,000 features for protein | DNA, RNA, Protein |
monoTriKGap | X_XXX | when —kGap=1, 256 features for DNA/RNA and 160,000 features for protein | DNA, RNA, Protein |
diMonoKGap | XX_X | when —kGap=1, 64 features for DNA/RNA and 8,000 features for protein | DNA, RNA, Protein |
diDiKGap | XX_XX | when —kGap=1, 256 features for DNA/RNA and 160,000 features for protein | DNA, RNA, Protein |
diTriKGap | XX_XXX | when —kGap=1, 1024 features for DNA/RNA and 3,200,000 features for protein | DNA, RNA, Protein |
triMonoKGap | XXX_X | when —kGap=1, 256 features for DNA/RNA and 160,000 features for protein | DNA, RNA, Protein |
triDiKGap | XXX_XX | when —kGap=1, 1024 features for DNA/RNA and 3,200,000 features for protein | DNA, RNA, Protein |
Note: When sequence becomes DNA, RNA, and Protein then X = {A,C,G,T}, X = {A,C,G,U}, and
X = {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} respectively.
Arguments Details for the Features Generation
and Feature Description
are provided in Tables 1 and Table 2, respectively.
user@machine:~$ python runClassifiers.py --nFCV=10 --dataset=optimumDataset.csv --auROC=1 --boxPlot=1
Note #1: It will provide classification results (evaluationResults.txt) from the user provides binary class dataset (.csv format).
Note #2: Generate a ROC Curve (auROC.png).
Note #3: Generate an accuracy comparison via boxPlot (AccuracyBoxPlot.png).
Argument | Corresponding Optional Argument | Type | Default | Help |
---|---|---|---|---|
—nFCV | -cv | integer | —nFCV=10 | How many numbers of cross-validation? |
—dataset | -data | string | —dataset=optimumDataset.csv | Enter a UNIX-like path for a .csv file; Example: /home/User/dataset.csv |
—auROC | -roc | integer | —auROC=1 | Set —auROC=0, if we didn’t want to generate the ROC Curve. |
—boxPlot | -box | integer | —boxPlot=1 | Set —boxPlot=0, if we didn’t want to generate the accuracy box-plot. |
user@machine:~$ python trainModel.py --dataset=optimumDataset.csv --model=LR
Note #1: It will provide a dumpModel.pkl from the user provides binary class dataset (.csv format).
Argument | Corresponding Optional Argument | Type | Default | Help |
---|---|---|---|---|
—dataset | -data | string | —dataset=optimumDataset.csv | Enter a UNIX-like path for a .csv file; Example: /home/User/dataset.csv |
—model | -m | string | —model=LR | We can use LR, SVM, KNN, DT, SVM, NB, Bagging, RF, AB, GB, and LDA as an option; All options are case sensitive. |
—K | -k | integer | —K=5 | Only for the KNN classifier; Number of neighbor |
Note: LR, SVM, KNN, DT, NB, Bagging, RF, AB, GB, and LDA represents Logistics Regression, Support Vector Machine, k-Nearest Neighbor, Decision Tree, Naive Bayes, Bagging, Random Forest, AdaBoost, Gradient Boosting, Linear Discriminant Analysis classifier respectively.
user@machine:~$ python evaluateModel.py --optimumDatasetPath=optimumDataset.csv --testDatasetPath=testDataset.csv
Note #1: Here, optimumDataset.csv, and testDataset.csv using as a traing dataset and test dataset respectively.
Argument | Corresponding Optional Argument | Type | Default | Help |
---|---|---|---|---|
—optimumDatasetPath | -optimumPath | string | —optimumDatasetPath=optimumDataset.csv | Enter a UNIX-like path for a .csv file; Example: /home/User/dataset.csv |
—testDatasetPath | -testPath | string | —testDatasetPath=testDataset.csv | Enter a UNIX-like path for a .csv file; Example: /home/User/dataset.csv |
[1] Bin Liu, Fule Liu, Longyun Fang, Xiaolong Wang, and Kuo-Chen Chou. repdna: a
python package to generate various modes of feature vectors for dna sequences by in-
corporating user-defined physicochemical properties and sequence-order effects. Bioin-
formatics, 31(8):1307–1309, 2014.
[2] Dong-Sheng Cao, Qing-Song Xu, and Yi-Zeng Liang. propy: a tool to generate various
modes of chous pseaac. Bioinformatics, 29(7):960–962, 2013.
[3] Zhen Chen, Pei Zhao, Fuyi Li, André Leier, Tatiana T Marquez-Lago, Yanan Wang,
Geoffrey I Webb, A Ian Smith, Roger J Daly, Kuo-Chen Chou, et al. ifeature: a python
package and web server for features extraction and selection from protein and peptide
sequences. Bioinformatics, 1:4, 2018.
[4] Md Rafsan Jani, Md Toha Khan Mozlish, Sajid Ahmed, Dewan Md Farid, and Swakkhar
Shatabda. irecspot-ef: Effective sequence based features for recombination hotspot
prediction. Computers in biology and medicine, 2018.
[5] Bin Liu, Fule Liu, Longyun Fang, Xiaolong Wang, and Kuo-Chen Chou. reprna: a web
server for generating various feature vectors of rna sequences. Molecular Genetics and
Genomics, 291(1):473–481, 2016.
[6] Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. protr/protrweb: R
package and web server for generating various numerical representation schemes of pro-
tein sequences. Bioinformatics, 31(11):1857–1859, 2015.
[7] Bin Liu. Bioseq-analysis: a platform for dna, rna and protein sequence analysis based
on machine learning approaches. Briefings in bioinformatics, 2017.
[8] Bin Liu, Hao Wu, Deyuan Zhang, Xiaolong Wang, and Kuo-Chen Chou. Pse-analysis:
a python package for dna/rna and protein/peptide sequence analysis based on pseudo
components and kernel methods. Oncotarget, 8(8):13338, 2017.
[9] Bin Liu, Fule Liu, Xiaolong Wang, Junjie Chen, Longyun Fang, and Kuo-Chen Chou.
Pse-in-one: a web server for generating various modes of pseudo components of dna,
rna, and protein sequences. Nucleic acids research, 43(W1):W65–W71, 2015.
[10] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine
learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
[11] Eamonn Keogh and Abdullah Mueen. Curse of dimensionality. In Encyclopedia of
Machine Learning and Data Mining, pages 314–315. Springer, 2017.
[12] Ruihu Wang. Adaboost for feature selection, classification and its relation with svm, a
review. Physics Procedia, 25:800–807, 2012.
[13] Hao Lin, En-Ze Deng, Hui Ding, Wei Chen, and Kuo-Chen Chou. ipro54-pseknc: a
sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo
k-tuple nucleotide composition. Nucleic acids research, 42(21):12961–12972, 2014.
[14] Wei Chen, Pengmian Feng, Hui Ding, and Hao Lin. Pai: Predicting adenosine to inosine
editing sites by using pseudo nucleotide compositions. Scientific reports, 6:35123, 2016.
[15] Hao Lin, Zhi-Yong Liang, Hua Tang, and Wei Chen. Identifying sigma70 promoters
with novel pseudo nucleotide composition. IEEE/ACM transactions on computational
biology and bioinformatics, 2017.
[16] Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, and Michael A Beer.
Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computa-
tional biology, 10(7):e1003711, 2014.
[17] Shahana Yasmin Chowdhury, Swakkhar Shatabda, and Abdollah Dehzangi. Idnaprot-
es: Identification of dna-binding proteins using evolutionary and structural features.
Scientific Reports, 7(1):14938, 2017.