Genetic algorithms and self-organizing maps: a powerful combination for modeling complex QSAR and QSPR problems.
Modeling non-linear descriptor-target activity/property relationships with many dependent descriptors has been a long-standing challenge in the design of biologically active molecules. In an effort to address this problem, we couple the supervised self-organizing map with the genetic algorithm. Although self-organizing maps are non-linear and topology-preserving techniques that holdgreat potential for modeling and decoding relationships, the large number of descriptors in typical quantitative structure-activity relationship or quantitative structure-property relationship analysis may lead to spurious correlation(s) and/or difficulty in the interpretation of resulting models. To reduce the number of descriptors to a manageable size, we chose the genetic algorithm for descriptor selection because of its flexibility and efficiency in solving complex problems. Feasibility studies were conducted using six differentdatasets, of moderate-to-large size and moderate-to-great diversity; each with adifferent biological endpoint. Since favorable training set statistics do not necessarily indicate a highly predictive model, the quality of all models was confirmed by withholding a portion of each dataset for external validation. We also address the variability introduced onto modeling through dataset partitioning and through the stochastic nature of the combined genetic algorithmsupervised self-organizing map method using the z-score and other tests. Experiments show that the combined method provides comparable accuracy to the supervised self-organizing map alone, but using significantly fewer descriptors in the models generated. We observed consistently better results than partial least squares models. We conclude that the combination of genetic algorithms with the supervised self-organizing map shows great potential as a quantitative structure-activity/property relationship modeling tool.