Skip to main content
Chemistry LibreTexts

6.2: Similarity Coefficients

  • Page ID
    192635
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Many similarity metrics have been proposed and some commonly used metrics in cheminformatics are listed below, along with their mathematical definitions for binary features.

    common features.png

    Metric name Formula for
    binary variables
    Minimum Maximum
    Tanimoto (Jaccard) coefficient gif.latexS_ABampspacefracCAB-C 0 1

    Dice coefficient (Hodgkin index)

    gif.latexS_ABampspacefrac2CAB 0 1
    Cosine coefficient (Carbo index) gif.latexS_ABampspacefracCsqrtab 0 1
    Soergel distance gif.latexD_ABampspacefracab-2cab-c 0 1
    Euclidean distance gif.latexD_ABampspacesqrtab-2c 0 Nα
    Hamming (Manhattan or city-block) distance gif.latexD_ABampspaceab-2c 0 Nα

    α The length of molecular fingerprints.

    In the above table, the first three metrics (Tanimoto, Dice, and Cosine coefficients) are similarity metrics (SAB), which evaluates how similar two molecules are to each other. The other three (Soergel, Euclidean, and Hamming coefficients) are distance or dissimilarity metrics (DAB), which quantify how dissimilar the molecules are. These dissimilarity measures can be converted into similarity measures in a simple way. For example, for dissimilarity metrics whose possible values range from 0 to 1 (e.g., Soergel distance), the similarity score (SAB) between two molecules can be computed simply by subtracting the dissimilarity score from unity:

    gif.latexS_AB1-D_AB

    Note that the Soergel distance between two molecules is the complement of their Tanimoto coefficient (that is, the sum of the two metrics is 1), while they are developed independently of each other.

    If a distance metric has an upper-bound value greater than 1, (e.g., Euclidean or Hamming distance), the following equation [1] can be used to convert the dissimilarity score to the similarity score:

    gif.latexS_ABfrac11D_AB

    According to this equation, if two molecules are identical to each other, the distance (DAB) between them is zero, and their similarity score (SAB) becomes 1. On the other hand, as the DAB value increases (i.e., for dissimilar molecules), the SAB score approaches to 0.

    An important question about molecular similarity evaluation is “how similar is similar?”. To answer this question, it is necessary to have a similarity threshold that can be used to determine whether molecules are similar enough. In 1996, Patterson et al. [2] analyzed sets of active compounds selected from scientific articles and showed that a Tanimoto coefficient of 0.85 or greater reflected a high probability of two compounds having the same activity. Since then, this Tanimoto value of 0.85 has been used in many studies as a general threshold for molecular similarity evaluation. However, as demonstrated in several studies [3], different molecular fingerprints give different similarity score distributions. For example, the Tanimoto score of 0.85 computed from MACCS keys have a different probability of the two compounds sharing the same activity than the probability represented by the same Tanimoto value (0.85) computed from ECFPs. The programming assignments for this chapter will help understand the impact of different molecular fingerprints upon computed similarity coefficient values.


    6.2: Similarity Coefficients is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?