Algorithmic Procedures for Compound Selection for the MLSMR Collection
The MLSMR collection of over 300K compounds was built-up primarily in three stages of 100K compounds each. Below you will find a broad description of the algorithmic procedures used in constructing the set. For more details, please contact Jamie Driscoll at NIMH.
Stage 1 (the 1st 100K set): Compounds in the MLSMR collection are generically grouped into one of the following five categories: (a) specialty sets (SS), comprising bioactive compounds such as known drugs and toxins, (b) non-commercial compounds, mainly from academic labs, (c) targeted libraries (TL), (d) natural products (NP), and (e) diversity compounds (DC), that is, a diverse set that complements the first three categories. Compounds fulfilling the criteria for more than one category are assigned to the highest qualifying category based on the ranking DC < TL < SS < NP
Standard vendor supplied purity of > 90%, availability (10 mg) and re-supply (20 mg) was required. For most categories (see below), the following additional criteria were added: calculated water solubility of > 20 ug/mL (using the Tetko AlogPS program), and suitability for screening as assessed by implementing excluded functionality filters using Daylight SMARTS.
It is important to note that the purity and identity of each compound is experimentally assessed by DPI (http://mlsmr.evotec.com/MLSMR_HomePage/quality.html) using LC-UV/ELSD/MS, before the compound is included as part of the MLSMR collection.
The target for the SS set was a highly selective collection of ~1500 proven bioactive compounds, hand-picked under the rubric of “compounds known to interact with biological systems in a functional manner”. This includes, for example, registered drugs and known toxins, but not necessarily screening hits or activity against individual targets from in-vitro assays (for example). Note that most NP compounds will fulfill the criteria for the SS set, so given the above assignment ranking rules, the NP set should be included within the complete SS set. These compounds were not explicitly filtered for diversity, Lipinski Rule of Five, calculated water solubility, or undesirable substructure content.
To be included in the NP set, an exact structural match must exist in either the Chapman and Hall Natural Products Database or the Wiley AntiBase, or a reference in the modern literature unambiguously citing a natural source or delineating its biosynthetic origin. The NP set contains only homogeneous, purified natural products with known structures. Primary extracts and broths are excluded, as are synthetic derivatives of otherwise bone fide natural products. These compounds were to satisfy the following criteria:(a) 10 mg availability, (b) MW < 2500, (c) purity of > 90%. The compounds were not filtered for diversity. The calculated water solubility and substructure content filters (mentioned above) were initially applied, but were later deemed unreasonably restrictive and were removed (see below), as were the additional physico-chemical targets of cLogP < 5, HBA <20 and HBD < 10.
The 15K TL set was roughly equally divided into protease, kinase, GPCR, ion channel, and nuclear receptor targets. The target class was assigned as specified by the chosen vendors without further analysis. Lipinski Rule of Five, calculated water solubility, and substructure content filters were applied to this set.
For the 80K DC set, Daylight fingerprints and clustering procedures were used to generate the final list of compounds to purchase. Topological fingerprints for the diversity analysis were calculated using the Daylight Similarity toolkit using a 4096 bit fingerprint (without folding) based on all pathways up to 14 bonds in length. The DC diversity selection was made independently from the other three (much smaller) categories; no attempt was made to adjust the selection to remove any putative bias to the overall diversity profile.
The diversity of each chosen vendor’s filtered set was assessed using the Willett average pairwise cosine similarities method: each set measured between 0.72 and 0.75. A set of diverse structures within each compound class was chosen according to the following algorithm:
- From the first vendor, select the structure closest to the cosine centroid (Willett method).
- Change to the next vendor.
- Calculate the Tanimoto similarity between each selected structure and each of the current vendor’s remaining structures.
- From among the current vendor’s remaining structures, choose the structure most dissimilar to the selected set (Willett MaxMin algorithm).
- Add the structure to the selected set.
- Return to step 2 (until an adequate number of diverse structures is selected).
Compounds were collected in microclusters containing up to five structures. The micro-clusters were designed to provide (1) incipient SAR around a screening hit, and (2) re-supply of a similar compound for depleted compounds that cannot not be replenished.
To meet the micro-cluster objectives, MLSMR acquired up to four nearest neighbors for each selected diverse structure, provided the Tanimoto similarity was at least 0.85 (to set a baseline similarity to the selected diverse structure) but not more than 0.99 (to ensure exclusion of duplicates).
Stage 2 (the 2nd 100K set): Similar procedures as for Stage 1 were mostly used in selecting compounds for inclusion in this round with the following exceptions:
- Excluded functionality filters were modified to remove the most restrictive of the “druggability” criteria to reflect the fact that MLI’s aims are to find chemical probes and not necessarily drugs.
- Two different diversity approaches were adopted in this round as described below.
- For the TL and DC collections, the physico-chemical property requirements from Stage 1 were relaxed. Four physico-chemical categories, (A), (B), (C), and (D) were devised with the requirement that roughly >=25%, >=50%, >=75%, 100% of the entire MLSMR collection would belong to each of the categories.
- (A): MW <= 300; ClogP <= 3; HBD <= 3; HBA <= 6; calculated solubility >= 40 ug/mL
- (B): MW <= 400; ClogP <= 4; HBD <= 4; HBA <= 8; calculated solubility >= 30 ug/mL
- (C): MW <= 500; ClogP <= 5; HBD <= 5; HBA <= 10; calculated solubility >= 20 ug/mL
- (D): MW <= 600; ClogP <= 6; HBD <= 6; HBA <= 12; calculated solubility >= 10 ug/mL
- In addition to the Tetko calculated solubility, the ACD Labs Solubility Batch software was also incorporated into the calculation of aqueous solubities.
- In addition to the Daylight topological fingerprints used in Stage 1, another diversity metric based on the MDL MACCS keys was introduced. These methods were used iteratively so that around half of the compounds selected came from each method.
Stage 3 (the 3rd 100K set): In this case all of the criteria for filters established in Stage 2 were retained with the following differences:
- The diversity implementation was altered.
- The DTP/NCI compound collection was included along with commercial vendor supplied compounds for this round of purchases.
One of the aims of building a specific set for screening apart from diversity is to ease the creation of potential SAR hypotheses that medicinal chemists can use for improving molecules obtained from the screening collection. Since the natural language of medicinal chemists is substructures, chemotypes, or scaffolds obtained from 2D representations of molecules a computational procedure that is faithful to this language was used for building out the MLSMR collection. Specifically an implementation of the Bemis-Murcko definition of scaffolds was used for all compound selection.
The set of known biologically active molecules made available by GVK Biosciences was used for defining this space. Note that the SS set, based on a more restrictive concept of “proven biologically active compounds”, is intended to be highly exclusive and should not be confused with the GVK or other exhaustive bioactive literature databases. The GVK collection is obtained from published articles and patents and were human curated. The detailed GVK target/disease annotations were grouped into generic bins (e.g., proteases, kinases, GPCRs, HIV, heart-disease etc.) to ensure some level of “biological diversity”. About 30K compounds purchased in this round were targeted using known bioactives.
Detailed Approach: Specifically, both the GVK and our vendor collection along with the existing MLSMR collection were converted into Bemis-Murcko scaffolds.
- “Known Biologicals”: Compounds in the vendor collection with scaffold level-similarity based on topological torsions of >0.96 to the GVK set were tagged as “known biologicals”.
- A machine learning based predictive model that separates the existing MLSMR collection from the vendor collection at the scaffold level was built. This extensively cross-validated model was used to rank-order the scaffolds in the vendor collection that are the most different from the existing Stage 2 collection. Topological torsions were used as descriptors.
- Finally compounds were selected from this rank-ordered list of scaffolds that contained at least 5-10 compounds. If a scaffold did not contain enough compounds, compounds from neighboring scaffolds (based on scaffold-based distance matrix using topological torsions) were selected for inclusion.