MLSMR Compounds
Algorithmic Procedures for Compound Selection for the MLSMR Collection
The MLSMR collection of over 300K compounds was built-up primarily in three stages of 100K compounds each. Below you will find a broad description of the algorithmic procedures used in constructing the set. For more details, please contact Jamie Driscoll at NIMH.
Stage 1 (the 1st 100K set): Compounds in the MLSMR collection are generically grouped into one of the following five categories: (a) specialty sets (SS), comprising bioactive compounds such as known drugs and toxins, (b) non-commercial compounds, mainly from academic labs, (c) targeted libraries (TL), (d) natural products (NP), and (e) diversity compounds (DC), that is, a diverse set that complements the first three categories. Compounds fulfilling the criteria for more than one category are assigned to the highest qualifying category based on the ranking DC < TL < SS < NP Standard vendor supplied purity of > 90%, availability (10 mg) and re-supply (20 mg) was required. For most categories (see below), the following additional criteria were added: calculated water solubility of > 20 ug/mL (using the Tetko AlogPS program), and suitability for screening as assessed by implementing excluded functionality filters using Daylight SMARTS. It is important to note that the purity and identity of each compound is experimentally assessed by DPI (http://mlsmr.glpg.com/MLSMR_HomePage/quality.html) using LC-UV/ELSD/MS, before the compound is included as part of the MLSMR collection. The target for the SS set was a highly selective collection of ~1500 proven bioactive compounds, hand-picked under the rubric of “compounds known to interact with biological systems in a functional manner”. This includes, for example, registered drugs and known toxins, but not necessarily screening hits or activity against individual targets from in-vitro assays (for example). Note that most NP compounds will fulfill the criteria for the SS set, so given the above assignment ranking rules, the NP set should be included within the complete SS set. These compounds were not explicitly filtered for diversity, Lipinski Rule of Five, calculated water solubility, or undesirable substructure content. To be included in the NP set, an exact structural match must exist in either the Chapman and Hall Natural Products Database or the Wiley AntiBase, or a reference in the modern literature unambiguously citing a natural source or delineating its biosynthetic origin. The NP set contains only homogeneous, purified natural products with known structures. Primary extracts and broths are excluded, as are synthetic derivatives of otherwise bone fide natural products. These compounds were to satisfy the following criteria:(a) 10 mg availability, (b) MW < 2500, (c) purity of > 90%. The compounds were not filtered for diversity. The calculated water solubility and substructure content filters (mentioned above) were initially applied, but were later deemed unreasonably restrictive and were removed (see below), as were the additional physico-chemical targets of cLogP < 5, HBA <20 and HBD < 10. The 15K TL set was roughly equally divided into protease, kinase, GPCR, ion channel, and nuclear receptor targets. The target class was assigned as specified by the chosen vendors without further analysis. Lipinski Rule of Five, calculated water solubility, and substructure content filters were applied to this set. For the 80K DC set, Daylight fingerprints and clustering procedures were used to generate the final list of compounds to purchase. Topological fingerprints for the diversity analysis were calculated using the Daylight Similarity toolkit using a 4096 bit fingerprint (without folding) based on all pathways up to 14 bonds in length. The DC diversity selection was made independently from the other three (much smaller) categories; no attempt was made to adjust the selection to remove any putative bias to the overall diversity profile. The diversity of each chosen vendor’s filtered set was assessed using the Willett average pairwise cosine similarities method: each set measured between 0.72 and 0.75. A set of diverse structures within each compound class was chosen according to the following algorithm: Compounds were collected in microclusters containing up to five structures. The micro-clusters were designed to provide (1) incipient SAR around a screening hit, and (2) re-supply of a similar compound for depleted compounds that cannot not be replenished. To meet the micro-cluster objectives, MLSMR acquired up to four nearest neighbors for each selected diverse structure, provided the Tanimoto similarity was at least 0.85 (to set a baseline similarity to the selected diverse structure) but not more than 0.99 (to ensure exclusion of duplicates). Stage 2 (the 2nd 100K set): Similar procedures as for Stage 1 were mostly used in selecting compounds for inclusion in this round with the following exceptions:
Stage 3 (the 3rd 100K set): In this case all of the criteria for filters established in Stage 2 were retained with the following differences:
One of the aims of building a specific set for screening apart from diversity is to ease the creation of potential SAR hypotheses that medicinal chemists can use for improving molecules obtained from the screening collection. Since the natural language of medicinal chemists is substructures, chemotypes, or scaffolds obtained from 2D representations of molecules a computational procedure that is faithful to this language was used for building out the MLSMR collection. Specifically an implementation of the Bemis-Murcko definition of scaffolds was used for all compound selection. The set of known biologically active molecules made available by GVK Biosciences was used for defining this space. Note that the SS set, based on a more restrictive concept of “proven biologically active compounds”, is intended to be highly exclusive and should not be confused with the GVK or other exhaustive bioactive literature databases. The GVK collection is obtained from published articles and patents and were human curated. The detailed GVK target/disease annotations were grouped into generic bins (e.g., proteases, kinases, GPCRs, HIV, heart-disease etc.) to ensure some level of “biological diversity”. About 30K compounds purchased in this round were targeted using known bioactives. Detailed Approach: Specifically, both the GVK and our vendor collection along with the existing MLSMR collection were converted into Bemis-Murcko scaffolds.



