PIDGIN Version 2: Prediction IncluDinG INactivity: Version 2 (LEGACY VERSION)
Author : Lewis Mervin, lhm30@cam.ac.uk
Supervisor : Dr. A. Bender
Platt Scaled Random Forest Protein Target Prediction Tool trained on SARs from PubChem (Mined 21/06/16) and ChEMBL21
Molecular Descriptors : 2048bit Morgan Binary Fingerprints (Rdkit) - ECFP4
All rights reserved 2016
Follow these steps on Linux/OSX:
Download and install Anaconda2 for Python 2.7 from https://www.continuum.io/downloads
conda install -c https://conda.anaconda.org/rdkit rdkit
conda install scikit-learn=0.17
(PIDGINv2 uses Scikit-learn v17)git clone https://github.com/lhm30/PIDGINv2/
(recommended) or download/extract the zip from GitHub webpagehttps://tinyurl.com/y39kqc83
(leave .pkl.zip files compressed)conda install pydot graphviz
predict_raw.py filename.csv N_cores organism
This script outputs the Platt-scaled (sigmoid) probabilities of the Random Forest classifier for the compounds in a matrix.
If using “Organism” it must be as specified in the classes_in_model.txt and enclosed by quotes (“”).
Example of how to run the code:
python predict_raw.py input.csv 4
Output is a matrix of Platt-scaled probabilities for an input list of compounds calculated on a machine using 4 cores
predict_binary.py filename.csv N_cores tpr_threshold organism
This script generates binary predictions for the models after application of a user-specified predicted true-positive rate (TPR) threshold.
The choice of required TPR applies a given confidence when binarizing predictions (i.e. an acceptable True Positive (TP) rate).
If using “Organism” it must be as specified in the classes_in_model.txt and enclosed by quotes (“”).
Example of how to run the code:
python predict_binary.py input.csv 30 0.5
where 30 cores are used to produce predictions, and 0.5 would apply a 50% TPR confidence threshold
predict_enriched.py filename.csv N_cores tpr_threshold DisGeNET_threshold organism
This script enriched targets, NCBI Biosystems pathways and DisGeNET diseases for a library of compounds, when compared to a precomputed target predictions from a background set of 2,000,000 compounds from PubChem (bg_predictions.txt).
The protocol corrects for promiscuous models / biases in training data and to which targets are statistically associated with compounds in filename.csv.
Target predictions for filename.csv are compared against PubChem predictions using the Prediction Ratio (ref. http://tinyurl.com/predratio), Odd’s Ratio and Fishers Test p-values.
For tables with large numbers, the (inexact) chi-square test implemented in the function chi2 test should be used. Pathways and DisGeNET predictions are compared against PubChem predictions using the Prediction Ratio, Odd’s Ratio and Chi-square test of independence p-values.
bg_predictions.txt contains rows of target models with corresponding columns for the number of background compounds from PubChem at a given TPR threshold (to 2DP).
DisGeNET_diseases.txt contains disease data used to annotate target predictions. DisGeNET gene-disease score takes into account the number and type of sources (level of curation, organisms), and the number of publications supporting the association. The score ranges from 0 to 1 to give confidence for annotations. A DisGeNET_threshold can be supplied at runtime when annotating predictions with diseases (0.06 threshold applied by default, which includes associations from curated sources/animal models supporting them or reported in 20-200 papers). More info on the score here: http://disgenet.org/web/DisGeNET/menu/dbinfo#score
If using “Organism” it must be as specified in the classes_in_model.txt and enclosed by quotes (“”)
Example of how to run the code:
python predict_enriched.py input.csv 4 0.5 0.06 "Homo sapiens (Human)"
The output is a ranked list of targets that are more statistically associated with the input compounds. A low Prediction Ratio, Odd’s Ratio and p-value metric indicates a higher enrichment for a target/pathway/disease when compared to the background rate
predict_enriched_two_libraries.py input_active_library.csv input_inactive_library.csv tpr_threshold DisGeNET_threshold organism
This script calculates enriched targets, NCBI BioSystems pathways and DisGeNET for two compound libraries (e.g could be phenotypically active compounds and to phenotypically inactive compounds).
The protocol corrects for promiscuous models / biases in training data and to which targets are statistically associated with compounds in input_active_library.csv.
Target predictions for input_active_library.csv are compared against input_inactive_library.csv predictions using the Prediction Ratio (ref. http://tinyurl.com/predratio), Odd’s Ratio and Fishers Test p-values.
For tables with large numbers, the (inexact) chi-square test implemented in the function chi2 test should be used. Pathways and DisGeNET predictions are compared against PubChem predictions using the Prediction Ratio, Odd’s Ratio and Chi-square test of independence p-values.
Organism must be as specified in the classes_in_model.txt and enclosed by quotes (“”)
Example of how to run the code:
python predict_enriched_two_libraries.py filename_1.csv filename_2.csv 10 0.9 0.3 "Homo sapiens (Human)"
The output is a ranked list of targets that are more statistically associated with the input compounds. A low Prediction Ratio, Odd’s Ratio and p-value metric indicates a higher enrichment for a target/pathway/disease when compared to the inactive compound set.
predict_per_comp.py filename_1.csv N_cores tpr_threshold DisGeNET_threshold organism
This script calculates target, pathway and disease hits per compound and represents them in a matrix. The DisGeNET threshold and organism are optional. Organism must be as specified in the classes_in_model.txt and enclosed by quotes (“”)
Example of how to run the code:
python predict_per_comp.py input.csv 30 0.5 0.3 "Homo sapiens (Human)"
predict_target_fingerprints.py filename_1.csv N_cores tpr_threshold organism
This script calculates target probabilities per compound in a transposed (columns are targets), simplified a matrix. These can be used as a fingerprint/descriptor for biological space.
tp_threshold may be set to “None” to generate raw descriptors rather than fingerprints.
Organism filter is optional. If filtering predictions by organism, this must be as specified in the classes_in_model.txt and enclosed by quotes (“”)
Example of how to run the code:
python predict_target_fingerprints.py input.csv 30 0.5 "Homo sapiens (Human)"
predict_enriched_two_libraries_decision_tree.py filename_1.csv filename_2.csv N_cores tpr_threshold DisGeNET_threshold minimum_sample_split minimum_leaf_split max_depth organism
This script calculates target, pathway and disease hits enrichment and visualises the target predictions in a decision tree (jpg file). The DisGeNET threshold and organism are optional. As always, organism must be enclosed by quotes (“”)
Example of how to run the code:
python predict_enriched_two_libraries_decision_tree.py cytotox_library.csv nontoxic_background.csv 10 0.5 0.5 2 2 5 "Homo sapiens (Human)"
predict_enriched_decision_tree.py filename_1.csv N_cores tpr_threshold DisGeNET_threshold minimum_sample_split minimum_leaf_split max_depth N_kmeans_clusters organism
This script calculates target, pathway and disease hits enrichment and visualises the target predictions in a decision tree (jpg file). This code uses kmeans clustering to cluster predictions within the input dataset, as a method to split input data into hypothetical modes-of-action. The number of clusters is therefore subjective and unsupervised. The DisGeNET threshold and organism are optional. As always, organism must be enclosed by quotes (“”)
Example of how to run the code:
python predict_enriched_decision_tree.py cytotox_library.csv 10 0.5 0.5 2 2 5 5 "Homo sapiens (Human)"
sim_to_train.py filename.csv N_cores
This script conducts Tanimoto coefficient (Tc) similarity analysis for input compounds in filename.csv and the training data in PIDGIN. This can be used to support prediction interpretation to indicate which compounds are driving predictions. Two files are produced; The first is a matrix similar to the predict_raw script above, which has a similarity matrix of compounds vs. target instead of the raw predictions. The second is a detailed breakdown of the nearest neighbour compounds in the training set (i.e. their affinity, confidence and which organism this is extracted from - since ortholog bioactivity data is also used).
Example of how to run the code:
python sim_to_train.py cytotox_library.csv 10
==========================================================================================