lab

Drug Theoretics and Cheminformatics Laboratory

Cheminformatics Tools

QSAR Tools

Important Note: The same software tools are now also available from the official Website of Jadavpur University (Kolkata, INDIA).  Although you can freely access the software tools from any one of the sites, you are advised to cite the link http://teqip.jdvu.ac.in/QSAR_Tools/ 

To be a registered user (free of charge) of this site for academic/commercial purpose, kindly download and sign a License Agreement Form and send to kunalroy_in@yahoo.com

**Researchers from various institutes have already signed the license agreement. The names of the representative institutes are as follows:
  • Department of Physics, SCSVMV University, Tamil Nadu, INDIA.
  • Institute of Pharmaceutical Sciences, Guru Ghasidas Vishwavidhyalaya, Bilaspur, INDIA
  • Department of Pharmaceutical Sciences, Assam Univeristy, Assam, INDIA
  • Division of Pharmaceutical Chemistry, GIPER, Uttarakhand, INDIA
  • Department of Chemical Technology, University of Calcultta, Kolkata, INDIA
  • Institute of Biochemistry and Biophysics, Tehran University, IRAN
  • Research Institute for Fundamental Sciences (RIFS), University of Tabriz, IRAN
  • Dr. H. S. Gour University, Madhya Pradesh, INDIA
  • Al ameen college of pharmacy, Bangalore, INDIA
  • Interdiscip. Center for Nano-Toxicity, Dept. of Chem. & BioChem. Jackson State University, USA
  • Smt. Kashibai Navale College of Pharmacy,STES, Pune, INDIA
  • Citech, Département de Biologie structurale et chimie, Pasteur Institute, Paris, FRANCE

Note: If Java is not installed in your computer. Please install Java (click here) before using following software tools. If you have any queries regarding software tools, please feel free to contact at kunalroy_in@yahoo.com. 

*List of research articles citing this website: Click here 

**Please cite the reference article/s of the respective tools, along with the web site link

** The Java libraries utilized in developing these software tools are also mentioned below.

** If your input file is very large or if there is a possibility of generation of large output file, then prefer .csv file type over .xlsx and .xls. Since there is a memory limit for xls/xlsx file type, which may cause incomplete execution of the program (program will stop and throw java heap size error). This can be avoided (up to a limit) by using CSV file.

*Program uploaded on 26 May 2016

*Updated on 27 May 2016

*Last Updated on 6 June 2016

The double cross-validation process comprises two nested cross-validation loops which are referred as internal and external cross-validation loops. In the outer (external) loop of double cross-validation, all data objects are divided into two subsets referred to as training and test sets. The training set is used in the inner (internal) loop of double cross-validation for model building and model selection, while the test set is exclusively used for model assessment. So in the internal loop, the training set is repeatedly split into calibration and validation data sets. The calibration objects are used to develop different models whereas the validation objects are used to estimate the models‟ error. Finally, the model with the lowest prediction errors (validation set) in the inner loop is selected. Then, the test objects in the outer loop are employed to assess the predictive performance of the selected model. This method of multiple splits of the training set into calibration and validation sets obviates the bias introduced in variable selection in case of usage of a single training set of fixed composition [see Reference article for further reading].

Double Cross-validation Tool (version 1.0) performs the double cross-validation process as mentioned above. Here, the user has to provide the training and test sets (descriptors and the response variable) information in the respective input files.


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set and test set input files (.xlsx/.xls/.csv) before using the program (sample files provided in Data Folder).  *Provisional Manual is provided in the program folder

File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference article for Double Cross-Validation
Baumann, D. and Baumann, K., 2014. Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation. J. Cheminformatics, 6(1), p.47. (Click Here)

*Program uploaded on 13 April 2016

To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set and test set input files (.xlsx/.xls/.csv) before using the program (sample files provided in Data Folder).  *Manual is provided in the program folder

File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

* Program uploaded on 2 March 2016

* A minor bug resolved!!! on 26 April 2016

*Last updated (Minor) on 4 May 2016

*Last updated on 3 June 2016

A genetic algorithm (GA) is a search heuristic method that mimics the process of natural selection. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, crossover, mutation, and selection. Here, the Genetic Algorithm tool 4.0 performs the genetic algorithm for selection of significant variables (descriptors) during QSAR model development using Fitness Function based on recently reported MAE-based criteria


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set and test set input files (.xlsx/.xls/.csv) before using the program (sample files provided in Data Folder). *Manual is provided in the program folder

File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference Articles for 'Fitness Function' employed in Genetic Algorithm Tools
[1] Ambure, Pravin, and Kunal Roy. "Understanding the structural requirements of cyclic sulfone hydroxyethylamines as hBACE1 inhibitors against Aβ plaques in Alzheimer's disease: a predictive QSAR approach." RSC Advances 6, 34 (2016): 28171-28186. DOI: 10.1039/C6RA04104C (Click Here)

[2] Roy, Kunal*, Rudra Narayan Das, Pravin Ambure, and Rahul B. Aher. "Be aware of error measures. Further studies on validation of predictive QSAR models." Chemometrics and Intelligent Laboratory Systems, Volume 152, 15 March 2016, Pages 18–33. doi:10.1016/j.chemolab.2016.01.008  (Click Here)

*Program updated on 2 March 2016

* Last updated on 3 June 2016

To select best descriptor combination out of set of descriptors by evaluating all possible combinations of descriptors in the input file. Along with the conventional parameters like R2, Q2, Q2f1, Q2F2; the prediction quality of training as well as test set is judged using recently reported MAE based criteria (For reference, see reference article for XternalValidationPlus Tool below). Further using the MAE-based metrics, QSAR score is computed that can be used to select the best QSAR models. User can define r^2 cut-off and inter-correlation cut-off values, which is useful to reduce the computational time and to remove models with inter-correlated descriptors, respectively.

To Download and Run Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.  

Note: Keep your input file in the same folder where you kept BestSubsetSelection.jar file. Sample input files are provided in the Program Folder

File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

*Program uploaded on 23 September 2015

**Program updated on 6 January 2016 : Minor Updates

***Programs updated on 13 April 2016: More on Systematic Error

Xternal Validation Plus is a tool which computes all the required external validation parameters, while further it also judges the performance of prediction quality of a QSAR model based on the MAE-based criteria.


To Download and RunClick on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of test set input file (.xlsx/.xls/.csv) before using the program (sample file provided in Data Folder). *Manual is provided in the program folder

File Format:  [First column] Observed Response Values (Test set only), [Second Column] Predicted Response Values (Test set only).

Reference Article for XternalValidationPlus Tool
Roy K, Das RN, Ambure P, Aher RB, Be aware of error measures. Further studies on validation of predictive QSAR models. Chemom Intell Lab Sys, 2016, 152, 18–33.  doi:10.1016/j.chemolab.2016.01.008 (Click Here)

*Program uploaded on 18 February 2015

*Manual Updated on 6 January 2016 : Minor Update

*Minor bug resolved!!! on 26 April 2016

*Last Updated (Minor) on 4 May 2016

The “AD using Standardization approach” is a tool to find out compounds (test set/query compounds) that are outside the applicability domain of the built QSAR model and it also detects outliers present in the training set compounds.


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set and test set input file (.xlsx/.xls/.csv) before using the program (sample files provided in Data Folder). *Manual is provided in the program folder

File Format:  Compound number (first column), Descriptors (Subsequent Columns).


Reference Article for AD using Standardization Approach
Roy, Kunal, Supratik Kar, and Pravin Ambure. "On a simple approach for determining applicability domain of QSAR models." Chemometrics and Intelligent Laboratory Systems 145 (2015): 22-29. doi:10.1016/j.chemolab.2015.04.013 (Click Here)
*Program uploaded on 11 November 2014
*Version 1.2 uploaded on 13 January 2015
*Version 1.3(Modified Fitness Function) uploaded on 16th March 2015
*Important updated version 1.4 on 27 May 2015 (A bug resolved!!)

A genetic algorithm (GA) is a search heuristic method that mimics the process of natural selection. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, crossover, mutation, and selection. Here, the Genetic Algorithm tool 1.2 performs the genetic algorithm for selection of significant variables (descriptors) during QSAR model development.


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training set input file (.xlsx/.xls/.csv) before using the program (sample files provided in Data Folder). *Manual is provided in the program folder

File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)



Reference Article
Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems,  Volume 147, 15 October 2015, Pages 1–13.  doi:10.1016/j.chemolab.2015.07.007 
*Program uploaded on 11 November 2014
*Version 1.2 uploaded on 13 January 2015
*Nano QSAR models database updated on 5 June 2015

Nano-Profiler (endpoint-dependent analogues identification software) is a tool to predict respective property/endpoint of nanoparticle’s using nano-QSAR models present in provided database (inside "Data" folder), which are already reported in the literature, and then to perform clustering to find analogues based on respective predicted endpoint. We have also included three more clustering methods i.e. k-Medoids algorithm (slow and exhaustive; searches best ‘k’ medoids), modified k-Medoid (fast; searches optimum ‘k’ medoids), Euclidean distance-based method, for analogues identification.


To Download and RunClick on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of  input file (.xlsx/.xls/.csv) before using the program (sample input files provided in Data Folder). * Manual is provided in the program folder (pdf)

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)


Reference Article
Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems,  Volume 147, 15 October 2015, Pages 1–13.  doi:10.1016/j.chemolab.2015.07.007
 *Updated program uploaded on 14 August 2014
 *Version 1.2 uploaded on 13 January 2015

Applicability domain- Model Disturbance Index (AD-MDI) program is a tool to define applicability domain (AD) of unknown samples based on a concept proposed by Yan et. al. (see reference below). For more information regarding defining AD and finding AD of query compounds please read the article (see reference below). This program is the updated version of the previous version (now removed). The only difference between two versions is that in updated version the AD of query compounds can be determined (optional), which was not included in the previous version.


To Download and Run: Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input files in "Data" folder and may save output file in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of training, test set input files (.xlsx/.xls/.csv) and query file (.xlsx/.xls/csv)  before using the program (sample input files provided in Data Folder).

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)


Reference article for AD-MDI

Yan, Jun, et al. "A Combinational Strategy of Model Disturbance and Outlier Comparison to Define Applicability Domain in Quantitative Structural Activity Relationship." Molecular Informatics (2014). (click here)


Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems,  Volume 147, 15 October 2015, Pages 1–13.  doi:10.1016/j.chemolab.2015.07.007

*Program uploaded on 1 August 2014 

*Version 1.2 uploaded on 13 January 2015

*Last updated on 3 June 2016

This tool perform stepwise MLR using two methods: 1) using alpha value, 2) using F value. User can also select data pre-treatment option to remove constant and inter-correlated descriptors prior to performing stepwise MLR. Three output files are generated 1) LogFile.txt : Consist of names of  descriptor (constant and/or intercorrelated) removed based on variance and correlation-coefficient cut-off; 2) SMLR.txt : Information regarding descriptor selected/removed along with validation parameters at each step, based on f-values (F-to-Enter,F-to-Remove) or alpha-value (alpha-to-Enter,alpha-to-Remove)  cut-offs; 3) xlsx/xls/csv file : consist of set of descriptors selected (along with activity/property column) after performing stepwise MLR.


To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of input file (.xlsx/xls/csv)  before using the program (sample input file provided in Data Folder).

*Manual is provided in the program folder (pdf).

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)


Reference Article
Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems,  Volume 147, 15 October 2015, Pages 1–13.  doi:10.1016/j.chemolab.2015.07.007

*Program uploaded on 1 August 2014

*Version 1.2 uploaded on 13 January 2015

Modified K-Medoid is a simple and fast algorithm for K-medoids clustering (see reference below). The above algorithm is a local heuristic that runs just like K-means clustering when updating the medoids.  This method tends to select k most middle objects as initial medoids. The algorithm involves calculation of the distance matrix once and uses it for finding new medoids at every iterative step.

To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of data file (.xlsx/xls/csv)  before using the program (sample input file provided in Data Folder).

*Manual is provided in the program folder (pdf)

 File Format:  Compound number (first column), Descriptors (Subsequent Columns)

 Modified K-Medoid Reference article
Park, Hae-Sang, and Chi-Hyuck Jun. "A simple and fast algorithm for K-medoids clustering." Expert Systems with Applications 36.2 (2009): 3336-3341. (Click here)
Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems,  Volume 147, 15 October 2015, Pages 1–13.  doi:10.1016/j.chemolab.2015.07.007

*Program uploaded on 09 June 2014

*Version 1.2 uploaded on 13 January 2015

To remove the constant and highly correlated descriptors based on user specified variance and correlation cut-off values using V-WSP algorithm (see reference below). 

To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of data file (.xlsx/xls/csv)  before using the program (sample input file provided in data folder).

*Manual is provided in the program folder (pdf)

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference article for Data Pre-treatment using V-WSP algorithm
Ballabio, Davide, et al. "A novel variable reduction method adapted from space-filling designs." Chemometrics and Intelligent Laboratory Systems (2014). (click here)
Ambure, P., Aher, R. B., Gajewicz, A., Puzyn, T., & Roy, K. (2015). “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling. Chemometrics and Intelligent Laboratory Systems,  Volume 147, 15 October 2015, Pages 1–13.  doi:10.1016/j.chemolab.2015.07.007
*Version 1.2 uploaded on 13 January 2015

Dataset division GUI is a user friendly application tool,which includes three different methods i.e. Kennard-Stone based, Euclidean Distance based(or Diversity based) and Activity/Property based dataset division into training and test set.  


To Download: Click on download button(it will direct you to google drive) and then  press "Ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program.

Note:  User may keep dataset file  in “data” folder and also may select “output” folder for storing output files. (sample input file provided in data folder).

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference Article related to Dataset Division  
"Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?" published in  J. Chem. Inf. Model. (Click here).
Kennard, Ronald W., and Larry A. Stone. "Computer aided design of experiments." Technometrics 11.1 (1969): 137-148. (Click here)

The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoid shift algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as centers (medoids). In this case, cost is calculated using Manhattan distance. For detail information, see wikipedia: Click here.


To Download: Click on download button(it will direct you to google drive) and then  press "Ctrl + S (Windows) or cmd+S (Macs) " to save as zip file. Extract the .zip file and click on .jar file to run the program.

***Please check the input file format (*.csv): (First) Serial number column, Descriptors..columns, activity/property column (Last column).

*Version 1.2 uploaded on 13 January 2015
*Version 1.3 uploaded on 6 January 2016: Major Update
* Last updated on 3 June 2016

This tool develops QSAR models by performing MLR and calculates internal, and external validation parameters [1,2] of the developed models. It further judges the quality of test set predictions based on the prediction errors (i.e. MAE-based) as GOOD, MODERATE and BAD. It also checks Golbarikh and Tropsha model acceptibillity criteria [3], and determine  applicability domain (AD) using two methods i.e. Standardization approach [5] and Euclidean-based method [4]. One can also perform Y-randomization test.


To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: The program folder will consist of three folders "Data", "Lib" and "Output". For user convenience, user may keep input file in "Data" folder and may save output files in "Output"  folder."Lib" folder consist of library files required for running the program. Check the format of data file (.xlsx/.xls/csv)  before using the program (sample input files are provided in Data Folder).

*Manual is provided in the program folder (pdf)

File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Reference Articles related to MLR plus Validation Program
1. "On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design" published in "Com. Chem. High. T Scr.journal. This article describes all the general validation paramaters very well. (Click here)
  
2. "Some case studies on application of "rm2" metrics for judging quality of quantitative structure-activity relationship predictions: Emphasis on scaling of response data." published in  J. Compu. Chem. journal. This article is about novel "rm^2" parameter developed in our lab. (Click here)

3. "Beware of q2!" published in J. Mol. Graph. Model journal and "Best practices for QSAR model development, validation, and exploitation" published in  Mol. Inf. well describes the Golbraikh Tropsha's criteria. (Click Here and Click Here)

4. "Quantitative structure-activity relationship prediction of blood-to-brain partitioning behavior using support vector machine" published in Eur. J. Pharm. Sci. journal. In this article, the Euclidean distance-based Applicability domain scatter plot is demonstrated. (Click here)

5.  Roy, Kunal, Supratik Kar, and Pravin Ambure. "On a simple approach for determining applicability domain of QSAR models." Chemometrics and Intelligent Laboratory Systems 145 (2015): 22-29. (Click here)

To remove the constant and highly correlated descriptors based on user specified variance and correlation cut-off values. 

The Train-Test version helps while doing data pretreatment to training set. The set of descriptors removed in training set are also removed from corresponding test set.


To Download GUI version: Click on download button(it will direct you to google drive) and then  press "Ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. Sample Input files are provided in the program folder.


To ensure that the compounds of the test set are representative of the  training set (i.e. whether the test set structures are within the applicability domain or not). It is based on distance scores calculated by the Euclidean distance norm.

To observe structural diversity in selected dataset, in terms of distance scores (calculated by the Euclidean distance norm).

Note: To observe diversity among compounds present in dataset; plot a ‘scatter plot’ of Normalized Mean Distance Vs Respective Activity/Property.

Reference Articles related to Diversity Validation

Recently , an interesting article is published regarding diversity assessment method in  J. Chem. Inf. Model. (Click here, 2013). According to this article, the descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) can be successfully used for diversity assessment of compounds present in dataset.


Also, User may use normalized mean distance to calculate "MODelability Index" (MODI), a quantitative means to quickly assess whether predictive QSAR model(s) can be obtained for a given chemical dataset or not, which is based on "activity cliffs" concept. The MODelability Index is recently published in  J. Chem. Inf. Model.(Click here, 2013)

*Version 1.2 uploaded on 13 January 2015

This test is performed to check the robustness of the QSAR model by building several random models via shuffling the dependent variables, while  keeping the independent variables as it is. The resultant random models  are expected to have significantly low r^2 and q^2 values for several trials, to pass the test.  Another parameter, cRp^2 is also calculated, which should be more then 0.5 for passing this test.

To Download and Run : Click on download button(it will direct you to google drive) and then  press "ctrl + S (Windows) or cmd+S (Macs)" to save as zip file. Extract the .zip file and click on .jar file to run the program. 

Note: Keep your input file in the same folder where you keep yRandomization.jar file. Sample input files are provided in the Program Folder

Manual is provided in the program folder (pdf)

File Format: Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

To perform Leave-‘n’-Out cross validation (in true sense), that means this program performs MLR calculations leaving each possible combination of ‘n’ compounds, where n is the user defined value. Since Leave-n-out is computationally expensive, try to keep value of ‘n’ less (up to 4 for 100 compounds in training set). Check the output text file generated for validation parameters (Q^2 and SDEP).

Note: Number of combination (displayed when you start running program) and progress bar will assist you to decide value of ‘n’. Please check sample input file (*last column must be activity column and no compound no./serial no. column).

Additional Tools: Simple but Useful Tools

To normalize the data by scaling between  0 to 1. 

This tool is helpful when you want to assign new names to the molecules present in the SDF. The first line of each molecule represent the molecule name (the molecule name can be missing) and also note that in SDF each molecule is separated by other molecule via ‘$$$$’ notation. You can observe this, if you open the SDF in any text editor like Notepad, Wordpad, Notepad++. Please see input file/s provided with the tool folder for further clarification.

This tool helps in checking the molecule 'name' (and not the file names) if correctly saved or not, while we draw structures in structural editors like ChemDraw. Note that the file name and the molecule name is different, and one can check the molecule name by opening the MOL file via any text editor like notepad, wordpad, notepad++. The first line usually consists of molecule name (the molecule name can be missing). For instance, while drawing more than 1 structure with same common scaffold, we save each file as ‘*.mol’ with some name, let say, “molecule1.mol’, and then we just change the required functional group keeping the common scaffold same and save the next molecule with different name like “molecule2.mol”. However, in such cases, many times the file name changes but the molecule names (inside mol file) remain same, which create problem while analyzing the molecules in a software which displays molecules names and not file names as molecule identifier. Thus using this tool you can identify those molecules for which the file name and molecule name is different, so that you can correct those by manually changing the molecule names in the MOL file using any text editor mentioned before. Please see input file/s provided with the tool folder for further clarification.

To assign file name as molecule name, the following tool can be used:

This tool helps in extracting any property value from the SDF (only if that property is provided in the sdf file). For example when you download a database of compounds with activity (or any property) value embedded in the sdf. Now, when the numbers of molecules in the SDF are higher, it will be cumbersome to extract activity value from the SDF file which corresponds to each molecule in the SDF for further analysis. Hence, this tool helps in extracting any such property along with the molecule name and save it in a text file (output file). Please see input file/s provided with the tool folder for further clarification.

This tool extracts only those molecules from the parent SDF file whose names (not file name) are mentioned in a text file (provided by the user) and save it as SDF. Please see input file/s provided with the tool folder for further clarification.

This simple tool provides the number of all possible combination using nCr (combination formula) methodology. This tool can be helpful while deciding ‘r’ in leave-r-out (cross validation method; available here). So here, the number of all possible combination computed using this tool (provided the values of ‘n’ and ‘r’, where, n= total number of compounds present in the training set and r= number of compounds removed during cross validation) corresponds to the number of models that will be eventually developed and evaluated. So one can adjust the value of ‘r’ with respect to the computational power available. Please see input file/s provided with the tool folder for further clarification.

This tool is especially helpful during QSAR model development while working on excel sheets. It can be used to extract only those compounds whose compound identifier (number and not alpha-numeric or alphabet) is provided in a text file and the extracted data is saved as a separate excel file. The compound identifier present in the text file (provided by the user) is matched with the numbers provided in the first column of the input parent file, and once matched the entire row is extracted and saved in the output file. Here, the first column (comprising compound identifier) must be present in the input file. Please see input file/s provided with the tool folder for further clarification. 

 File Format:  Compound number (first column), Descriptors or any property (Subsequent Columns).

This tool is especially helpful during QSAR model development while working on excel sheets. When you need to select few descriptors out of all the descriptors (in columns) present in the excel file (.xlsx., .xls, .csv). Such condition arises when you have selected, let say, 40-50 descriptors (out of 1000 or more) after initial GFA analysis or molecular spectrum analysis and now you want to extract only those selected descriptor from the original file comprising of all the descriptors. This tool helps to extract only required descriptors whose names are provided in a text file (provided by the user) and the extracted data is saved as a separate excel file. The descriptor names provided in the text file are matched with the header names (first row) in the input file, and when matched, the corresponding column is extracted. So keep in mind the names of the descriptor to be selected should be same in both text file  (provided by the user) and excel input file. As default, it will also extract first (serial number) and last column (activity column), even if the respective header name is not provided in the text file. So if your input file does not have any serial number (first column) and/or activity/property column (last column), then remove the respective columns from the generated output file. Please see input file/s provided with the tool folder for further clarification.

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

This tool randomly shuffles the entire descriptor matrix, while keeping the values of acitivity/property same. This is useful to prepare X-randomized model.Please see input file/s provided with the tool folder for further clarification.

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

This tool randomly shuffles the activity values, while keeping the descriptor matrix same. This is useful to prepare Y-randomized model. Please see input file/s provided with the tool folder for further clarification.

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Y-randomization method is one of the recommended validation methods while doing QSAR and one has to develop 10 or 50 or more Y-randomized models. This tool randomly shuffles the activity values, while keeping the descriptor matrix same and can generate more than 1 y-randomized models. So here user can mention number of Y-randomized models required to be generated. Although, user can find the MLR Y-randomization tool on this website, but this tool is useful when user may require Y-randomized models for some other analysis purpose. Please see input file/s provided with the tool folder for further clarification.

 File Format:  Compound number (first column), Descriptors (Subsequent Columns), Activity/Property (Last column)

Contact

All the above programs have been developed in Java and are validated on known data sets.


For any query/suggestions contact :

Pravin Ambure (ambure.pharmait@gmail.com) 

Drug Theoretics and Cheminformatics Laboratory (DTC)

Department of Pharmaceutical Technology

Jadavpur University

Kolkata -700032


Acknowledgement

 The programmer is highly grateful to  Department of Biotechnology, Government of India for providing financial assistance.

*  The following software tools are developed during 6 months (March - August 2014) of participation in an International project "NanoBridges" at Gdansk University, Poland (http://nanobridges.eu/) that has received funding from the People Programme (Marie Curie Actions) of the European Union Seventh Framework Programme :

1. Stepwise MLR

2. Modified k-Medoid

3. vWSP

4. AD-MDI

5. Genetic Algorithm

6. Nano Profiler


Last Updated on 26 May 2016

C++ Programs

This program calculates qualitative validation parameters for Linear discriminant analysis and pharmacophore/toxicophore analysis such as sensitivity, specificity, accuracy, precision, F measure, Matthews correlation coefficient (MCC), Geometric means (Gmeans), Cohen's kappa, Guner Henry score and Recall for selected threshold based on the confusion matrix.

Program for calculation of Periodic table-based nano-descriptors for metal, nonmetal and semimetals.  * Manual is provided in the program folder (pdf)
For rational division of dataset into training and test set based on the Mahalanobis distance.  * Manual is provided in the program folder (pdf)
For rational division of dataset into training and test set based on the Euclidean distance.  * Manual is provided in the program folder (pdf)
To check modelability of the dataset for nano-QSAR model development.  * Manual is provided in the program folder (pdf)
Standardization of data/descriptor matrix computed based on mean and standard deviation.  * Manual is provided in the program folder (pdf)

For rational division of dataset and for diversity validation.  * Manual is provided in the program folder (pdf)

For rational division of dataset and for diversity validation.  * Manual is provided in the program folder (pdf)

Contact

All the above C++ programs have been validated on known data

For any query/suggestion contact :

Rahul Balasaheb Aher (rahulba26@gmail.com

Drug Theoretics and Cheminformatics Laboratory (DTC)

Department of Pharmaceutical Technology

Jadavpur University

Kolkata -700032


Last Updated on 22/5/2015

Acknowledgement
*  The following C++ programs are developed during 7 months (May - December 2014) of participation in an International project "NanoBridges" at Gdansk University, Poland (http://nanobridges.eu/) that has received funding from the People Programme (Marie Curie Actions) of the European Union Seventh Framework Programme :

1. Elemental Descriptors Calculator
2. Dataset Modelability (Modelability Index)
3. Dataset division using Kennard-Stone (based on Mahalanobis distance)
4. Dataset division using Kennard-Stone (based on Euclidean distance)
5. Mahalanobis Distance
6. Euclidean distance
7. Standardize 

Please Share and Like this page, if you find these tools helpful.

Java Libraries used in developing the above software tools are as follows :

1. Java Statistical Classes (JSC) : click here

2. Apache Commons Mathematics Library : click here

3. Apache POI - the Java API for Microsoft Documents : click here

4. XMLBeans : click here

5. JMathPlot - interactive 2D and 3D plots : click here

Website designed by : Pravin Ambure (ambure.pharmait@gmail.com)