lab

Drug Theoretics and Cheminformatics Laboratory

Cheminformatics Tools

QSAR Tools

Important Note: The same software tools are now also available from the official Website of Jadavpur University (Kolkata, INDIA).  Although you can freely access the software tools from any one of the sites, you are advised to cite the link http://teqip.jdvu.ac.in/QSAR_Tools/ 

Note: If Java is not installed in your computer. Please install Java (click here) before using following software tools. If you have any queries regarding software tools, please feel free to contact at kunalroy_in@yahoo.com. 

*List of research articles citing this website (i.e. dtclab.webs.com/software-tools) are listed at the bottom of this web page. 

**Please cite the reference article/s of the respective tools, along with the web site link

** The Java libraries utilized in developing these software tools are also mentioned (below 'List of research articles...').

*Program uploaded on 11 November 2014

A genetic algorithm (GA) is a search heuristic method that mimics the process of natural selection. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, crossover, mutation, and selection. Here, the Genetic Algorithm tool 1.0 performs the genetic algorithm for selection of significant variables (descriptors) during QSAR model development.


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set input file (.xlsx) before using the program (sample files provided in Data Folder). *Manual is provided in the program folder

*Program uploaded on 11 November 2014

Nano-Profiler (endpoint-dependent analogues identification software) is a tool to predict respective property/endpoint of nanoparticle’s using nano-QSAR models present in provided database (inside "Data" folder), which are already reported in the literature, and then to perform clustering to find analogues based on respective predicted endpoint. We have also included three more clustering methods i.e. k-Medoids algorithm (slow and exhaustive; searches best ‘k’ medoids), modified k-Medoid (fast; searches optimum ‘k’ medoids), Euclidean distance-based method, for analogues identification.


To Download and RunClick on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of  input file (.xlsx) before using the program (sample files provided in Data Folder). * Manual is provided in the program folder (pdf)

          *Updated program uploaded on 14 August 2014

Applicability domain- Model Disturbance Index (AD-MDI) program is a tool to define applicability domain (AD) of unknown samples based on a concept proposed by Yan et. al. (see reference below). For more information regarding defining AD and finding AD of query compounds please read the article (see reference below). This program is the updated version of the previous version (now removed). The only difference between two versions is that in updated version the AD of query compounds can be determined (optional), which was not included in the previous version.


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set input file (.xlsx) and query file (.xlsx)  before using the program (sample files provided in Data Folder).


***Please check the input file format: Sample files provided in the program folder inside folder named "Data".

Training and test set file (*.xlsx): (First column) Serial number column, Descriptors..columns, activity/property column (Last column).

Query file (*.xlsx): (First column) Serial number column, Descriptors..columns (Last column).

*Manual is provided in the program folder (pdf)

Reference article for AD-MDI

Yan, Jun, et al. "A Combinational Strategy of Model Disturbance and Outlier Comparison to Define Applicability Domain in Quantitative Structural Activity Relationship." Molecular Informatics (2014). (click here)

*Program uploaded on 1 August 2014

This tool perform stepwise MLR using two methods: 1) using alpha value, 2) using F value. User can also select data pre-treatment option to remove constant and inter-correlated descriptors prior to performing stepwise MLR. Three output files are generated 1) LogFile.txt : Consist of names of  descriptor (constant and/or intercorrelated) removed based on variance and correlation-coefficient cut-off; 2) SMLR.txt : Information regarding descriptor selected/removed along with validation parameters at each step, based on f-values (F-to-Enter,F-to-Remove) or alpha-value (alpha-to-Enter,alpha-to-Remove)  cut-offs; 3) xlsx file : consist of set of descriptors selected (along with activity/property column) after performing stepwise MLR.


To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of data file (.xlsx)  before using the program (sample file provided in Data Folder).


***Please check the input file format (*.xlsx): (First column) Descriptor/Properties columns, activity column(Last column).

*Manual is provided in the program folder (pdf).

*Program uploaded on 1 August 2014

*Minor change on 18 August 2014

Modified K-Medoid is a simple and fast algorithm for K-medoids clustering (see reference below). The above algorithm is a local heuristic that runs just like K-means clustering when updating the medoids.  This method tends to select k most middle objects as initial medoids. The algorithm involves calculation of the distance matrix once and uses it for finding new medoids at every iterative step.

To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of data file (.xlsx)  before using the program (sample file provided in Data Folder).

***Please check the input file format (*.xlsx): (First column) Serial number, Properties columns(Last column).

*Manual is provided in the program folder (pdf)

 Modified K-Medoid Reference article
Park, Hae-Sang, and Chi-Hyuck Jun. "A simple and fast algorithm for K-medoids clustering." Expert Systems with Applications 36.2 (2009): 3336-3341. (Click here)

*Program uploaded on 09 June 2014

* Minor update on 6 August 2014

To remove the constant and highly correlated descriptors based on user specified variance and correlation cut-off values using V-WSP algorithm (see reference below). 

To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of data file (.xlsx)  before using the program (sample file provided in data folder).

***Please check the input file format (*.xlsx): (First column) Properties columns only (Last column).

*Manual is provided in the program folder (pdf)

Reference article for Data Pre-treatment using V-WSP algorithm
Ballabio, Davide, et al. "A novel variable reduction method adapted from space-filling designs." Chemometrics and Intelligent Laboratory Systems (2014). (click here)

Dataset division GUI is a user friendly application tool,which includes three different methods i.e. Kennard-Stone based, Euclidean Distance based(or Diversity based) and Activity/Property based dataset division into training and test set.  


To Download: Click on download button(it will direct you to google drive) and then  press "Ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note:  User may keep dataset file (.csv) in “data” folder and also may select “output” folder for storing output files.

***Please check the input file format (*.csv): (First) Serial number column, Descriptors..columns, activity/property column (Last column).

Reference Article related to Dataset Division  
"Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?" published in  J. Chem. Inf. Model. (Click here).
Kennard, Ronald W., and Larry A. Stone. "Computer aided design of experiments." Technometrics 11.1 (1969): 137-148. (Click here)

The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoid shift algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as centers (medoids). In this case, cost is calculated using Manhattan distance. For detail information, see wikipedia: Click here.


To Download: Click on download button(it will direct you to google drive) and then  press "Ctrl + S (Windows) or cmd+S (Macs) " to save as zip file. Extract the .zip file and click on .jar file to run the program.

***Please check the input file format (*.csv): (First) Serial number column, Descriptors..columns, activity/property column (Last column).

This tool develops QSAR models using MLR and calculates internal, external and overall validation parameters of the developed models. It also checks Golbarikh and Tropsha model acceptibillity criteria, and compute Euclidean distance based   normalized mean distances to determine  applicability domain (AD). User may also use online tool to perform same operation plus Y randomization at  http://aptsoftware.co.in/DTCMLRWeb/index.jsp


***Please check the input file format (*.csv): (First) Descriptors..columns, activity/property column (Last column). *No serial number column

Reference Articles related to MLR plus Validation Program
1. "On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design" published in "Com. Chem. High. T Scr.journal. This article describes all the general validation paramaters very well. (Click here)
  
2. "Some case studies on application of "rm2" metrics for judging quality of quantitative structure-activity relationship predictions: Emphasis on scaling of response data." published in  J. Compu. Chem. journal. This article is about novel "rm^2" parameter developed in our lab. (Click here)

3. "Beware of q2!" published in J. Mol. Graph. Model journal and "Best practices for QSAR model development, validation, and exploitation" published in  Mol. Inf. well describes the Golbraikh Tropsha's criteria. (Click Here and Click Here)

4. "Quantitative structure-activity relationship prediction of blood-to-brain partitioning behavior using support vector machine" published in Eur. J. Pharm. Sci. journal. In this article, the Euclidean distance-based Applicability domain scatter plot is demonstrated. (Click here)
This tool calculates all external validation parameters  and also checks Golbraikh and Tropsha's acceptable model criteria.

   Or

To remove the constant and highly correlated descriptors based on user specified variance and correlation cut-off values. 

The Train-Test version helps while doing data pretreatment to training set. The set of descriptors removed in training set are also removed from corresponding test set.


To Download GUI version: Click on download button(it will direct you to google drive) and then  press "Ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.


To ensure that the compounds of the test set are representative of the  training set (i.e. whether the test set structures are within the applicability domain or not). It is based on distance scores calculated by the Euclidean distance norm.

To observe structural diversity in selected dataset, in terms of distance scores (calculated by the Euclidean distance norm).

Note: To observe diversity among compounds present in dataset; plot a ‘scatter plot’ of Normalized Mean Distance Vs Respective Activity/Property.

Reference Articles related to Diversity Validation

Recently , an interesting article is published regarding diversity assessment method in  J. Chem. Inf. Model. (Click here, 2013). According to this article, the descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) can be successfully used for diversity assessment of compounds present in dataset.


Also, User may use normalized mean distance to calculate "MODelability Index" (MODI), a quantitative means to quickly assess whether predictive QSAR model(s) can be obtained for a given chemical dataset or not, which is based on "activity cliffs" concept. The MODelability Index is recently published in  J. Chem. Inf. Model.(Click here, 2013)

This test is performed to check the robustness of the QSAR model by building several random models via shuffling the dependent variables, while  keeping the independent variables as it is. The resultant random models  are expected to have significantly low r^2 and q^2 values for several trials, to pass the test.  Another parameter, cRp^2 is also calculated, which should be more then 0.5 for passing this test.

To perform Leave-‘n’-Out cross validation (in true sense), that means this program performs MLR calculations leaving each possible combination of ‘n’ compounds, where n is the user defined value. Since Leave-n-out is computationally expensive, try to keep value of ‘n’ less (up to 4 for 100 compounds in training set). Check the output text file generated for validation parameters (Q^2 and SDEP).

Note: Number of combination (displayed when you start running program) and progress bar will assist you to decide value of ‘n’. Please check sample input file (*last column must be activity column and no compound no./serial no. column).

To select best descriptor combination out of set of descriptors based on internal and external validation parameters by performing MLR calculations. This is very helpful to select best descriptors (where number of descriptors to select is based on number of compounds in training set; user defined) out of set of contributing descriptors obtained after stepwise MLR or GFA. User can also define r^2 cut-off which is useful to reduce computational time.

Note: Please check sample input file. Format is same for both training and test set file. (*last column must be activity column and no compound no./serial no. column).

To normalize the data by scaling between  0 to 1.

Contact

All above programs have been developed in Java and are validated on known data sets.


For any query/suggestions contact :

Pravin Ambure (ambure.pharmait@gmail.com) 

Drug Theoretics and Cheminformatics Laboratory (DTC)

Department of Pharmaceutical Technology

Jadavpur University

Kolkata -700032


Acknowledgement

 The programmer is highly grateful to  Department of Biotechnology, Government of India for providing financial assistance.

*  The following software tools are developed during 6 months (March - August 2014) of participation in an International project "NanoBridges" at Gdansk University, Poland (http://nanobridges.eu/) that has received funding from the People Programme (Marie Curie Actions) of the European Union Seventh Framework Programme :

1. Stepwise MLR

2. Modified k-Medoid

3. vWSP

4. AD-MDI

5. Genetic Algorithm

6. Nano Profiler


Last Updated on 13/11/2014

C++ Programs

This program calculates qualitative validation parameters for Linear discriminant analysis and pharmacophore/toxicophore analysis such as sensitivity, specificity, accuracy, precision, F measure, Matthews correlation coefficient (MCC), Geometric means (Gmeans), Cohen's kappa, Guner Henry score and Recall for selected threshold based on the confusion matrix.

This C++ program has been validated on known data

For any query/suggestion contact :

Rahul Balasaheb Aher (rahulba26@gmail.com

Drug Theoretics and Cheminformatics Laboratory (DTC)

Department of Pharmaceutical Technology

Jadavpur University

Kolkata -700032

Please Share and Like this page, if you find these tools helpful.

This website has been cited by the following Research Articles (based on google scholar search) 

1. Ambure, Pravin, and Kunal Roy. "Exploring structural requirements of leads for improving activity and selectivity against CDK5/p25 in Alzheimer's disease: an in silico approach." RSC Advances 4.13 (2014): 6702-6709.


2. Toropov, Andrey A., and Alla P. Toropova. "Optimal descriptor as a translator of eclectic data into endpoint prediction: Mutagenicity of fullerene as a mathematical function of conditions." Chemosphere (2013).


3. Roy, Kunal, Rudra Narayan Das, and Paul LA Popelier. "Quantitative structure–activity relationship for toxicity of ionic liquids to Daphnia magna Aromaticity vs. lipophilicity." Chemosphere 112 (2014): 120-127.


4. Garg, Rajni, and Carr J. Smith. "Predicting the bioconcentration factor of highly hydrophobic organic chemicals." Food and Chemical Toxicology (2014).


5. Das, Rudra N., and Kunal Roy. "Predictive in silico modeling of ionic liquids towards inhibition of acetyl cholinesterase enzyme of Electrophorus electricus: A predictive toxicology approach." Industrial & Engineering Chemistry Research (2013).


6. Wang, Dan-Dan, et al. "QSAR studies for the acute toxicity of nitrobenzenes to the Tetrahymena pyriformis." Journal of the Serbian Chemical Society 00 (2014): 25-25.


7. Mridha, Priyanka, Pallabi Pal, and Kunal Roy. "Chemometric modelling of triphenylmethyl derivatives as potent anticancer agents." Molecular Simulation ahead-of-print (2014): 1-18.


8. Partha Pratim Roy, Sarbani Dey Ray, and Supratim Ray. "Combined experimental and in silico approaches for exploring antiperoxidative potential of structurally diverse classes of antioxidants on docetaxel-induced lipid peroxidation using 4-HNE as the model marker." Bioorganic Chemistry (2014).


9. Pramanik, Subrata, and Kunal Roy. "Exploring QSTR modeling and toxicophore mapping for identification of important molecular features contributing to the chemical toxicity in< i> Escherichia coli</i>." Toxicology in Vitro 28.2 (2014): 265-272.


10. Balupuri, A., et al. "Docking-based 3D-QSAR study of pyridyl aminothiazole derivatives as checkpoint kinase 1 inhibitors." SAR and QSAR in Environmental Research ahead-of-print (2014): 1-21.


11. Roy, Kunal, and Supratik Kar. "The< i> rm2</i> metrics and regression through origin approach: reliable and useful validation tools for predictive QSAR models (Commentary on ‘Is regression through origin useful in external validation of QSAR models?’)." European Journal of Pharmaceutical Sciences (2014),Volume 62, 1 October 2014, Pages 111–114.


12. de Campos, Luana Janaína, and Eduardo Borges de Melo. "MODELING STRUCTURE-ACTIVITY RELATIONSHIPS OF PRODIGININES WITH ANTIMALARIAL ACTIVITY USING GA/MLR AND OPS/PLS." Journal of Molecular Graphics and Modelling (2014). 


13.  Kunal Roy, Rudra Narayan Das, Paul LA Popelier. "Quantitative structure–activity relationship for toxicity of ionic liquids to Daphnia magna: Aromaticity vs. lipophilicity". Chemosphere 112 (2014): 120-127


14.  Vats, Chakshu, et al. "Computational design of novel flavonoid analogues as potential AChE inhibitors: analysis using group-based QSAR, molecular docking and molecular dynamics simulations." Structural Chemistry (2014): 1-10.


15. Roy K, Popelier PLA, Chemometric modeling of the chromatographic lipophilicity parameter logk0 of ionic liquid cations with ETA and QTMS descriptors. J Mol Liq, 2014, http://dx.doi.org/10.1016/j.molliq.2014.10.018.


16. Singh, Shalini. "Computational design and chemometric QSAR modeling of Plasmodium falciparum carbonic anhydrase inhibitors." Bioorganic & Medicinal Chemistry Letters (2014).  http://www.sciencedirect.com/science/article/pii/S0960894X14011603


17. Shamsara, Jamal, and Ahmad Shahir-Sadr. "A predictive HQSAR model for a series of tricycle core containing MMP-12 inhibitors with dibenzofuran ring." http://downloads.hindawi.com/journals/ijmc/aip/630807.pdf 


18. Aher, Rahul B., and Kunal Roy. "First Report on Two-Fold Classification of Plasmodium falciparum Carbonic Anhydrase Inhibitors Using QSAR Modeling Approaches." Combinatorial chemistry & high throughput screening 17.9 (2014): 745-755.

Java Libraries used in developing the above software tools are as follows :

1. Java Statistical Classes (JSC) : click here

2. Apache Commons Mathematics Library : click here

3. Apache POI - the Java API for Microsoft Documents : click here

4. XMLBeans : click here

5. JMathPlot - interactive 2D and 3D plots : click here