8.3: Python Assignment
- Page ID
- 195954
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Machine Learning Basics
lecture08_machine-learning.ipynb
- Download the ipynb file and run your Jupyter notebook.
- You can use the notebook you created in section 1.5 or the Jupyter hub at LibreText: https://jupyter.libretexts.org (see your instructor if you do not have access to the hub).
- This page is an html version of the above .ipynb file.
- If you have questions on this assignment you should use this web page and the hypothes.is annotation to post a question (or comment) to the 2019OLCCStu class group. Contact your instructor if you do not know how to access the 2019OLCCStu group within the hypothes.is system.
Required Modules
- pandas
- numpy
- time
- requests
- io
- rdkit
- sklearn
Objectives
- Build binary classification models that predict activity/inactivity of small molecules against human aromatase using supervised learning methods.
- Evaluate the performance of the developed models using performance measures.
Import bioactivity data from PubChem
In this notebook, we will develop a prediction model for small molecule's activity against human aromatase (https://pubchem.ncbi.nlm.nih.gov/protein/EAW77416), which is encoded by the CYP19A1 gene (https://pubchem.ncbi.nlm.nih.gov/gene/1588). The model will predict the activity of a molecule based on the structure of the molecule (represented with molecular fingerprints).
For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the "Download" button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below.
In [1]:
import pandas as pd import numpy as np url = 'https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=743139' df_raw = pd.read_csv(url)
In [2]:
df_raw.head(7)
Out[2]:
PUBCHEM_RESULT_TAG | PUBCHEM_SID | PUBCHEM_CID | PUBCHEM_ACTIVITY_OUTCOME | PUBCHEM_ACTIVITY_SCORE | PUBCHEM_ACTIVITY_URL | PUBCHEM_ASSAYDATA_COMMENT | Activity Summary | Antagonist Activity | Antagonist Potency (uM) | Antagonist Efficacy (%) | Viability Activity | Viability Potency (uM) | Viability Efficacy (%) | Sample Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RESULT_TYPE | NaN | NaN | NaN | NaN | NaN | NaN | STRING | STRING | FLOAT | FLOAT | STRING | FLOAT | FLOAT | STRING |
1 | RESULT_DESCR | NaN | NaN | NaN | NaN | NaN | NaN | Type of compound activity based on both the ar... | Type of compound activity in the aromatase ant... | The concentration of sample yielding half-maxi... | Percent inhibition of aromatase. | Type of compound activity in the cell viabilit... | The concentration of sample yielding half-maxi... | Percent inhibition of cell viability. | Where sample was obtained. |
2 | RESULT_UNIT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | MICROMOLAR | PERCENT | NaN | MICROMOLAR | PERCENT | NaN |
3 | 1 | 144203552.0 | 12850184.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
4 | 2 | 144203553.0 | 89753.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
5 | 3 | 144203554.0 | 9403.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
6 | 4 | 144203555.0 | 13218779.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
Note: Lines 0-2 provide the descriptions for each column (data type, descriptions, units, etc). These rows need be removed.
In [3]:
df_raw = df_raw[3:] df_raw.head(5)
Out[3]:
PUBCHEM_RESULT_TAG | PUBCHEM_SID | PUBCHEM_CID | PUBCHEM_ACTIVITY_OUTCOME | PUBCHEM_ACTIVITY_SCORE | PUBCHEM_ACTIVITY_URL | PUBCHEM_ASSAYDATA_COMMENT | Activity Summary | Antagonist Activity | Antagonist Potency (uM) | Antagonist Efficacy (%) | Viability Activity | Viability Potency (uM) | Viability Efficacy (%) | Sample Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | 144203552.0 | 12850184.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
4 | 2 | 144203553.0 | 89753.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
5 | 3 | 144203554.0 | 9403.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
6 | 4 | 144203555.0 | 13218779.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
7 | 5 | 144203556.0 | 142766.0 | Inconclusive | 25.0 | NaN | NaN | inconclusive antagonist (cytotoxic) | active antagonist | 15.5454 | -115.803 | active antagonist | 14.9601 | -76.8218 | NCI |
The column names in this data frame contain white spaces and special characters. For simplicity, let's rename the columns (no spaces or special characters except for the "_" character.)
In [4]:
df_raw.columns
Out[4]:
Index(['PUBCHEM_RESULT_TAG', 'PUBCHEM_SID', 'PUBCHEM_CID', 'PUBCHEM_ACTIVITY_OUTCOME', 'PUBCHEM_ACTIVITY_SCORE', 'PUBCHEM_ACTIVITY_URL', 'PUBCHEM_ASSAYDATA_COMMENT', 'Activity Summary', 'Antagonist Activity', 'Antagonist Potency (uM)', 'Antagonist Efficacy (%)', 'Viability Activity', 'Viability Potency (uM)', 'Viability Efficacy (%)', 'Sample Source'], dtype='object')
In [5]:
col_names_map = {'PUBCHEM_RESULT_TAG' : 'pc_result_tag', 'PUBCHEM_SID' : 'sid', 'PUBCHEM_CID' : 'cid', 'PUBCHEM_ACTIVITY_OUTCOME' : 'activity_outcome', 'PUBCHEM_ACTIVITY_SCORE' : 'activity_score', 'PUBCHEM_ACTIVITY_URL' : 'activity_url', 'PUBCHEM_ASSAYDATA_COMMENT' : 'assay_data_comment', 'Activity Summary' : 'activity_summary', 'Antagonist Activity' : 'antagonist_activity', 'Antagonist Potency (uM)' : 'antagonist_potency', 'Antagonist Efficacy (%)' : 'antagonist_efficacy', 'Viability Activity' : 'viability_activity', 'Viability Potency (uM)' : 'viability_potency', 'Viability Efficacy (%)' : 'viability_efficacy', 'Sample Source' : 'sample_source' }
In [6]:
df_raw = df_raw.rename(columns = col_names_map) df_raw.columns
Out[6]:
Index(['pc_result_tag', 'sid', 'cid', 'activity_outcome', 'activity_score', 'activity_url', 'assay_data_comment', 'activity_summary', 'antagonist_activity', 'antagonist_potency', 'antagonist_efficacy', 'viability_activity', 'viability_potency', 'viability_efficacy', 'sample_source'], dtype='object')
Check the number of compounds for each activity group
First, we need to understand what our data look like. Especially, we are interested in the activity class of the tested compounds because we are developing a model that classifies small molecules according to their activities against the target. This information is available in the "activity_outcome" and "activity_summary" columns.
In [7]:
df_raw.groupby(['activity_outcome']).count()
Out[7]:
pc_result_tag | sid | cid | activity_score | activity_url | assay_data_comment | activity_summary | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
activity_outcome | ||||||||||||||
Active | 379 | 379 | 378 | 379 | 0 | 0 | 379 | 379 | 378 | 379 | 379 | 115 | 359 | 379 |
Inactive | 7562 | 7562 | 7466 | 7562 | 0 | 0 | 7562 | 7562 | 0 | 7562 | 7562 | 324 | 7449 | 7562 |
Inconclusive | 2545 | 2545 | 2493 | 2545 | 0 | 0 | 2545 | 2545 | 2111 | 2136 | 2545 | 1206 | 2450 | 2545 |
Based on the data in the activity_outcome column, there are 379 actives, 7562 inactives, and 2545 inconclusives.
In [8]:
df_raw.groupby(['activity_outcome','activity_summary']).count()
Out[8]:
pc_result_tag | sid | cid | activity_score | activity_url | assay_data_comment | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
activity_outcome | activity_summary | |||||||||||||
Active | active antagonist | 379 | 379 | 378 | 379 | 0 | 0 | 379 | 378 | 379 | 379 | 115 | 359 | 379 |
Inactive | inactive | 7562 | 7562 | 7466 | 7562 | 0 | 0 | 7562 | 0 | 7562 | 7562 | 324 | 7449 | 7562 |
Inconclusive | active agonist | 612 | 612 | 571 | 612 | 0 | 0 | 612 | 612 | 612 | 612 | 60 | 590 | 612 |
inconclusive | 44 | 44 | 44 | 44 | 0 | 0 | 44 | 0 | 0 | 44 | 19 | 42 | 44 | |
inconclusive agonist | 414 | 414 | 409 | 414 | 0 | 0 | 414 | 212 | 223 | 414 | 12 | 397 | 414 | |
inconclusive agonist (cytotoxic) | 59 | 59 | 59 | 59 | 0 | 0 | 59 | 41 | 45 | 59 | 59 | 59 | 59 | |
inconclusive antagonist | 367 | 367 | 364 | 367 | 0 | 0 | 367 | 227 | 230 | 367 | 8 | 313 | 367 | |
inconclusive antagonist (cytotoxic) | 1049 | 1049 | 1046 | 1049 | 0 | 0 | 1049 | 1019 | 1026 | 1049 | 1048 | 1049 | 1049 |
Now, we can see that, in the activity_summary column, the inconclusive compounds are further classified into subclasses, which include:
- active agonist
- inconclusive
- inconclusive agonist
- inconclusive antagonist
- inconclusive agonist (cytotoxic)
- inconclusive antagonist (cytotoxic)
As implied in the title of this assay record (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139), this assay aims to identify aromatase inhibitors. Therefore, all active antagonists (in the activity summary column) were declared to be active compounds (in the activity outcome column).
On the other hand, the assay also identified 612 active agonists (in the activity summary column), and they are declared to be inconclusive (in the activity outcome column).
With that said, "inactive" compounds in this assay means those which are neither active agonists nor active antagonist.
It is important to understand that the criteria used for determining whether a compound is active or not in a given assay are selected by the data source who submitted that assay data to PubChem. For the purpose of this assignment (which aims to develop a binary classifier that tells if a molecule is active or inactive against the target), we should clarify what we mean by "active" and "inactive".
- active : any compounds that can change (either increase or decrease) the activity of the target. This is equivalent to either active antagonists or active agonists in the activity summary column.
- inactive : any compounds that do not change the activity of the target. This is equivalent to inactive compounds in the activity summary column.
Select active/inactive compounds for model building
Now we want to select only the active and inactive compounds from the data frame (that is, active agonists, active antagonists, and inactives based on the "activity summary" column).
In [9]:
df = df_raw[ (df_raw['activity_summary'] == 'active agonist' ) | (df_raw['activity_summary'] == 'active antagonist' ) | (df_raw['activity_summary'] == 'inactive' ) ] len(df)
Out[9]:
8553
In [10]:
print(len(df['sid'].unique())) print(len(df['cid'].unique()))
8553 6864
Note that the number of CIDs is not the same as the number of SIDs. There are two important potential reasons for this observation.
First, not all substances (SIDs) in PubChem have associated compounds (CIDs) because some substances failed during structure standardization. [Remember that, in PubChem, substances are depositor-provided structures and compounds are unique structures extracted from substances through structure standardization.] Because our model will use structural information of molecules to predict their bioactivity, we need to remove substances without associated CIDs (i.e., no standardized structures).
Second, some compounds are associated with more than one substances. In the context of this assay, it means that a compound may be tested multiple times in different samples (which are designated as different substances). It is not uncommon that different samples of the same chemical may result in conflicting activities (e.g., active agonist in one sample but inactive in another sample). In this practice, we remove such compounds with conflicting activities.
Drop substances without associated CIDs.
First, check if there are subtances without associated CIDs.
In [11]:
df.isna().sum()
Out[11]:
pc_result_tag 0 sid 0 cid 138 activity_outcome 0 activity_score 0 activity_url 8553 assay_data_comment 8553 activity_summary 0 antagonist_activity 0 antagonist_potency 7563 antagonist_efficacy 0 viability_activity 0 viability_potency 8054 viability_efficacy 155 sample_source 0 dtype: int64
There are 138 records whose "cid" column is NULL, and we want to remove those records.
In [12]:
df = df.dropna( subset=['cid'] ) len(df)
Out[12]:
8415
In [13]:
print(len(df['sid'].unique())) print(len(df['cid'].unique()))
8415 6863
In [14]:
df.isna().sum() # Check if the NULL values disappeared in the "cid" column
Out[14]:
pc_result_tag 0 sid 0 cid 0 activity_outcome 0 activity_score 0 activity_url 8415 assay_data_comment 8415 activity_summary 0 antagonist_activity 0 antagonist_potency 7467 antagonist_efficacy 0 viability_activity 0 viability_potency 7919 viability_efficacy 154 sample_source 0 dtype: int64
Remove CIDs with conflicting activities
Now identify compounds with conflicting activities and remove them.
In [15]:
cid_conflict = [] idx_conflict = [] for mycid in df['cid'].unique() : outcomes = df[ df.cid == mycid ].activity_summary.unique() if len(outcomes) > 1 : idx_tmp = df.index[ df.cid == mycid ].tolist() idx_conflict.extend(idx_tmp) cid_conflict.append(mycid) print("#", len(cid_conflict), "CIDs with conflicting activities [associated with", len(idx_conflict), "rows (SIDs).]")
# 65 CIDs with conflicting activities [associated with 146 rows (SIDs).]
In [16]:
df.loc[idx_conflict,:].head(10)
Out[16]:
pc_result_tag | sid | cid | activity_outcome | activity_score | activity_url | assay_data_comment | activity_summary | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 6 | 144203557.0 | 16043.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
5956 | 5954 | 144209507.0 | 16043.0 | Active | 43.0 | NaN | NaN | active antagonist | active antagonist | 54.4827 | -73.4024 | inconclusive antagonist | NaN | NaN | SigmaAldrich |
6850 | 6848 | 144210401.0 | 16043.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | SIGMA |
52 | 50 | 144203601.0 | 443939.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI |
6130 | 6128 | 144209681.0 | 443939.0 | Active | 61.0 | NaN | NaN | active antagonist | active antagonist | 1.65519 | -115.932 | active antagonist | 12.1763 | -120.598 | Toronto Research |
66 | 64 | 144203615.0 | 2170.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | BIOMOL |
9118 | 9116 | 144212669.0 | 2170.0 | Active | 50.0 | NaN | NaN | active antagonist | active antagonist | 16.5803 | -115.202 | inconclusive antagonist | 61.1306 | -80.7706 | SIGMA |
106 | 104 | 144203655.0 | 2554.0 | Inconclusive | 20.0 | NaN | NaN | active agonist | active agonist | 2.87255 | 73.7025 | inactive | NaN | 0 | SigmaAldrich |
5920 | 5918 | 144209471.0 | 2554.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | SIGMA |
6964 | 6962 | 144210515.0 | 2554.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | SIGMA |
In [17]:
df = df.drop(idx_conflict)
In [18]:
df.groupby('activity_summary').count()
Out[18]:
pc_result_tag | sid | cid | activity_outcome | activity_score | activity_url | assay_data_comment | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
activity_summary | ||||||||||||||
active agonist | 537 | 537 | 537 | 537 | 537 | 0 | 0 | 537 | 537 | 537 | 537 | 58 | 517 | 537 |
active antagonist | 343 | 343 | 343 | 343 | 343 | 0 | 0 | 343 | 342 | 343 | 343 | 108 | 326 | 343 |
inactive | 7389 | 7389 | 7389 | 7389 | 7389 | 0 | 0 | 7389 | 0 | 7389 | 7389 | 318 | 7278 | 7389 |
In [19]:
print(len(df['sid'].unique())) print(len(df['cid'].unique()))
8269 6798
Remove redundant data
The above code cells [in 3-(2)] do not remove compounds tested multiple times if the testing results are consistent [e.g., active agonist in all samples (substances)]. The rows corresponding to these compounds are redundant, so we want remove them except for only one row for each compound.
In [20]:
df = df.drop_duplicates(subset='cid') # remove duplicate rows except for the first occurring row. print(len(df['sid'].unique())) print(len(df['cid'].unique()))
6798 6798
Adding "numeric" activity classes
In general, the inputs and outputs to machine learning algorithms need to have numerical forms.
In this practice, the input (molecular structure) will be represented with binary fingerprints, which already have numerical forms (0 or 1). However, the output (activity) is currently in a string format (e.g., 'active agonist', 'active antagonist'). Therefore, we want to add an additional, 'activity' column, which contains numeric codes representing the active and inactive compounds:
- 1 for actives (either active agonists or active antagonists)
- 0 for inactives
Note that we are merging the two classes "active agonist" and "active antagonist", because we are going to build a binary classifer that distinguish actives from inactives.
In [21]:
df['activity'] = [ 0 if x == 'inactive' else 1 for x in df['activity_summary'] ]
Check if the new column 'activity' is added to (the end of) the data frame.
In [22]:
df.head(3)
Out[22]:
pc_result_tag | sid | cid | activity_outcome | activity_score | activity_url | assay_data_comment | activity_summary | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | 144203552.0 | 12850184.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI | 0 |
4 | 2 | 144203553.0 | 89753.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI | 0 |
5 | 3 | 144203554.0 | 9403.0 | Inactive | 0.0 | NaN | NaN | inactive | inactive | NaN | 0 | inactive | NaN | 0 | NCI | 0 |
Double-check the count of active/inactive compounds.
In [23]:
df.groupby('activity_summary').count()
Out[23]:
pc_result_tag | sid | cid | activity_outcome | activity_score | activity_url | assay_data_comment | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
activity_summary | |||||||||||||||
active agonist | 451 | 451 | 451 | 451 | 451 | 0 | 0 | 451 | 451 | 451 | 451 | 44 | 432 | 451 | 451 |
active antagonist | 291 | 291 | 291 | 291 | 291 | 0 | 0 | 291 | 290 | 291 | 291 | 88 | 275 | 291 | 291 |
inactive | 6056 | 6056 | 6056 | 6056 | 6056 | 0 | 0 | 6056 | 0 | 6056 | 6056 | 269 | 5970 | 6056 | 6056 |
In [24]:
df.groupby('activity').count()
Out[24]:
pc_result_tag | sid | cid | activity_outcome | activity_score | activity_url | assay_data_comment | activity_summary | antagonist_activity | antagonist_potency | antagonist_efficacy | viability_activity | viability_potency | viability_efficacy | sample_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
activity | |||||||||||||||
0 | 6056 | 6056 | 6056 | 6056 | 6056 | 0 | 0 | 6056 | 6056 | 0 | 6056 | 6056 | 269 | 5970 | 6056 |
1 | 742 | 742 | 742 | 742 | 742 | 0 | 0 | 742 | 742 | 741 | 742 | 742 | 132 | 707 | 742 |
Create a smaller data frame that only contains CIDs and activities.
Let's create a smaller data frame that only contains CIDs and activities. This data frame will be merged with a data frame containing molecular fingerprint information.
In [25]:
df_activity = df[['cid','activity']]
In [26]:
df_activity.head(5)
Out[26]:
cid | activity | |
---|---|---|
3 | 12850184.0 | 0 |
4 | 89753.0 | 0 |
5 | 9403.0 | 0 |
6 | 13218779.0 | 0 |
12 | 637566.0 | 0 |
Download structure information for each compound from PubChem
Now we want to get structure information of the compounds from PubChem (in isomeric SMILES).
In [27]:
cids = df.cid.astype(int).tolist()
In [28]:
chunk_size = 200 num_cids = len(cids) if num_cids % chunk_size == 0 : num_chunks = int( num_cids / chunk_size ) else : num_chunks = int( num_cids / chunk_size ) + 1 print("# CIDs = ", num_cids) print("# CID Chunks = ", num_chunks, "(chunked by ", chunk_size, ")")
# CIDs = 6798 # CID Chunks = 34 (chunked by 200 )
In [29]:
import time import requests from io import StringIO df_smiles = pd.DataFrame() list_dfs = [] # temporary list of data frames for i in range(0, num_chunks) : idx1 = chunk_size * i idx2 = chunk_size * (i + 1) cidstr = ",".join( str(x) for x in cids[idx1:idx2] ) url = ('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/' + cidstr + '/property/IsomericSMILES/TXT') res = requests.get(url) data = pd.read_csv( StringIO(res.text), header=None, names=['smiles'] ) list_dfs.append(data) time.sleep(0.2) if ( i % 5 == 0 ) : print("Processing Chunk ", i) # if ( i == 2 ) : break #- for debugging df_smiles = pd.concat(list_dfs,ignore_index=True) df_smiles[ 'cid' ] = cids df_smiles.head(5)
Processing Chunk 0 Processing Chunk 5 Processing Chunk 10 Processing Chunk 15 Processing Chunk 20 Processing Chunk 25 Processing Chunk 30
Out[29]:
smiles | cid | |
---|---|---|
0 | C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... | 12850184 |
1 | C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... | 89753 |
2 | C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... | 9403 |
3 | C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N | 13218779 |
4 | CC(=CCC/C(=C/CO)/C)C | 637566 |
In [30]:
len(df_smiles)
Out[30]:
6798
In [31]:
df_smiles = df_smiles[['cid','smiles']] df_smiles.head(5)
Out[31]:
cid | smiles | |
---|---|---|
0 | 12850184 | C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... |
1 | 89753 | C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... |
2 | 9403 | C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... |
3 | 13218779 | C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N |
4 | 637566 | CC(=CCC/C(=C/CO)/C)C |
Generate MACCS keys from SMILES.
In [32]:
from rdkit import Chem from rdkit.Chem import MACCSkeys
In [33]:
fps=dict() for idx, row in df_smiles.iterrows() : mol = Chem.MolFromSmiles(row.smiles) if mol == None : print("Can't generate MOL object:", "CID", row.cid, row.smiles) else: fps[row.cid] = [row.cid] + list(MACCSkeys.GenMACCSKeys(mol).ToBitString())
Can't generate MOL object: CID 28145 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F Can't generate MOL object: CID 28127 F[Si-2](F)(F)(F)(F)F.[Na+].[Na+]
In [34]:
# Generate column names fpbitnames = [] fpbitnames.append('cid') for i in range(0,167): # from MACCS000 to MACCS166 fpbitnames.append( "maccs" + str(i).zfill(3) ) df_fps = pd.DataFrame.from_dict(fps, orient='index', columns=fpbitnames)
In [35]:
df_fps.head(5)
Out[35]:
cid | maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | maccs007 | maccs008 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12850184 | 12850184 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
89753 | 89753 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
9403 | 9403 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
13218779 | 13218779 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
637566 | 637566 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 168 columns
Merge activity data and fingerprint information
In [36]:
df_activity.head(3)
Out[36]:
cid | activity | |
---|---|---|
3 | 12850184.0 | 0 |
4 | 89753.0 | 0 |
5 | 9403.0 | 0 |
In [37]:
df_fps.head(3)
Out[37]:
cid | maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | maccs007 | maccs008 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12850184 | 12850184 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
89753 | 89753 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
9403 | 9403 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
3 rows × 168 columns
In [38]:
df_data = df_activity.join(df_fps.set_index('cid'), on='cid')
In Section 5, there were two CIDs for which the MACCS keys could not be generated. They need to be removed from df_data.
In [39]:
df_data[df_data.isna().any(axis=1)]
Out[39]:
cid | activity | maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | maccs007 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2293 | 28145.0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9077 | 28127.0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 169 columns
In [40]:
df_data = df_data.dropna() len(df_data)
Out[40]:
6796
Save df_data in CSV for future use.
In [41]:
df_data.to_csv('df_data.csv')
Preparation for model building
Loading the data into X and y.
In [42]:
df_data.head(3)
Out[42]:
cid | activity | maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | maccs007 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 12850184.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
4 | 89753.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
5 | 9403.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
3 rows × 169 columns
In [43]:
X = df_data.iloc[:,2:] y = df_data['activity'].values
In [44]:
X.head(3)
Out[44]:
maccs000 | maccs001 | maccs002 | maccs003 | maccs004 | maccs005 | maccs006 | maccs007 | maccs008 | maccs009 | ... | maccs157 | maccs158 | maccs159 | maccs160 | maccs161 | maccs162 | maccs163 | maccs164 | maccs165 | maccs166 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
3 rows × 167 columns
In [45]:
print(len(y)) # Number of all compounds y.sum() # Number of actives
6796
Out[45]:
742
Remove zero-variance features
Some features in X are not helpful in distinguishing actives from inactives, because they are set ON for all compounds or OFF for all compounds. Such features need to be removed because they would consume more computational resources without improving the model.
In [46]:
from sklearn.feature_selection import VarianceThreshold
In [47]:
X.shape #- Before removal
Out[47]:
(6796, 167)
In [48]:
sel = VarianceThreshold() X=sel.fit_transform(X) X.shape #- After removal
Out[48]:
(6796, 163)
In this case, four features had zero variances. Note that one of them is the first bit (maccs000) of the MACCS keys, which is added as a "dummy" to name each of bits 1~166 as maccs001, maccs002, ... maccs166.
Train-Test-Split (a 9:1 ratio)
Now split the data set into a training set (90%) and test set (10%). The training set will be used to train the model. The developed model will be tested against the test set.
In [49]:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = \ train_test_split(X, y, shuffle=True, random_state=3100, stratify=y, test_size=0.1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) print(y_train.sum(), y_test.sum())
(6116, 163) (680, 163) (6116,) (680,) 668 74
Balance the training set through downsampling
Check the dimension of the training data set.
In [50]:
print(len(y_train)) print(len(X_train)) print(len(X_train[0]))
6116 6116 163
Check the number of actives and inactives compound.
In [51]:
print("# inactives : ", len(y_train) - y_train.sum()) print("# actives : ", y_train.sum())
# inactives : 5448 # actives : 668
The data set is highly imbalanced [the inactive to active ratio is 8.16 (=5448 / 668)]. To address this issue, let's downsample the majority class (inactive compounds) to balance the data set.
In [52]:
# Indicies of each class' observations idx_inactives = np.where( y_train == 0 )[0] idx_actives = np.where( y_train == 1 )[0] # Number of observations in each class num_inactives = len(idx_inactives) num_actives = len(idx_actives) # Randomly sample from inactives without replacement np.random.seed(0) idx_inactives_downsampled = np.random.choice(idx_inactives, size=num_actives, replace=False) # Join together downsampled inactives with actives X_train = np.vstack((X_train[idx_inactives_downsampled], X_train[idx_actives])) y_train = np.hstack((y_train[idx_inactives_downsampled], y_train[idx_actives]))
It is noteworthy that np.vstack is used for X_train and np.hstack is used for Y_train. The direction of stacking is different because X_train is a 2-D array and y_train is a 1-D array.
Confirm that the downsampled data set has the correct dimension and active/inactive counts.
In [53]:
print("# inactives : ", len(y_train) - y_train.sum()) print("# actives : ", y_train.sum())
# inactives : 668 # actives : 668
In [54]:
print(len(y_train)) print(len(X_train)) print(len(X_train[0]))
1336 1336 163
Build a model using the training set.
Now we are ready to build predictive models using machine learning algorithms available in the scikit-learn library (https://scikit-learn.org/). This notebook will use Naive Bayes and decisiont tree, because they are relatively fast and simple.
In [55]:
from sklearn.naive_bayes import BernoulliNB #-- Naive Bayes from sklearn.tree import DecisionTreeClassifier #-- Decision Tree
In [56]:
from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.metrics import roc_auc_score
Naive Bayes
In [57]:
clf = BernoulliNB() # set up the NB classification model
In [58]:
clf.fit( X_train ,y_train ) # Train the model by fitting it to the data.
Out[58]:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
In [59]:
y_true, y_pred = y_train, clf.predict( X_train ) # Apply the model to predict the training compound's activity.
In [60]:
CMat = confusion_matrix( y_true, y_pred ) #-- generate confusion matrix print(CMat) # [[TN, FP], # [FN, TP]]
[[462 206] [199 469]]
In [61]:
acc = accuracy_score( y_true, y_pred ) sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] ) # TP / (FN + TP) spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] ) # TN / (TN + FP ) bacc = (sens + spec) / 2 y_score = clf.predict_proba( X_train )[:, 1] auc = roc_auc_score( y_true, y_score )
In [62]:
print("#-- Accuracy = ", acc) print("#-- Balanced Accuracy = ", bacc) print("#-- Sensitivity = ", sens) print("#-- Specificity = ", spec) print("#-- AUC-ROC = ", auc)
#-- Accuracy = 0.6968562874251497 #-- Balanced Accuracy = 0.6968562874251496 #-- Sensitivity = 0.7020958083832335 #-- Specificity = 0.6916167664670658 #-- AUC-ROC = 0.7496985818781599
When applied to predict the activity of the training compounds, the NB classifier resulted in the accuracy of 0.70 and AUC-ROC of 0.75. However, the real performance of the model should be evaluated with the test set data, which are not used for model training.
In [63]:
y_true, y_pred = y_test, clf.predict(X_test) #-- Apply the model to predict the test set compounds' activity.
In [64]:
CMat = confusion_matrix( y_true, y_pred ) #-- generate confusion matrix print(CMat) # [[TN, FP], # [FN, TP]]
[[412 194] [ 28 46]]
In [65]:
acc = accuracy_score( y_true, y_pred ) sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] ) spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] ) bacc = (sens + spec) / 2 y_score = clf.predict_proba( X_test )[:, 1] auc = roc_auc_score( y_true, y_score ) print("#-- Accuracy = ", acc) print("#-- Balanced Accuracy = ", bacc) print("#-- Sensitivity = ", sens) print("#-- Specificity = ", spec) print("#-- AUC-ROC = ", auc)
#-- Accuracy = 0.6735294117647059 #-- Balanced Accuracy = 0.6507448042101507 #-- Sensitivity = 0.6216216216216216 #-- Specificity = 0.6798679867986799 #-- AUC-ROC = 0.724099099099099
For the test set, the accuracy is 0.67 and the AUC-ROC is 0.72. These values are somewhat smaller (by 0.03) than those for the training set. Also note that the accuracy is no longer the same as the balanced accruacy (which is the average of the sensitivity and specificity).
Some additional performance information may be obtained using classification_report().
In [66]:
print( classification_report(y_true, y_pred))
precision recall f1-score support 0 0.94 0.68 0.79 606 1 0.19 0.62 0.29 74 accuracy 0.67 680 macro avg 0.56 0.65 0.54 680 weighted avg 0.86 0.67 0.73 680
Decision Tree
In [67]:
clf = DecisionTreeClassifier( random_state=0 ) # set up the DT classification model
In [68]:
clf.fit( X_train ,y_train ) # Train the model by fitting it to the data (using the default values for all parameters)
Out[68]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=0, splitter='best')
In [69]:
y_true, y_pred = y_train, clf.predict( X_train ) # Apply the model to predict the training compound's activity.
In [70]:
CMat = confusion_matrix( y_true, y_pred ) #-- generate confusion matrix print(CMat) # [[TN, FP], # [FN, TP]]
[[663 5] [ 3 665]]
In [71]:
acc = accuracy_score( y_true, y_pred ) sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] ) # TP / (FN + TP) spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] ) # TN / (TN + FP ) bacc = (sens + spec) / 2 y_score = clf.predict_proba( X_train )[:, 1] auc = roc_auc_score( y_true, y_score )
In [72]:
print("#-- Accuracy = ", acc) print("#-- Balanced Accuracy = ", bacc) print("#-- Sensitivity = ", sens) print("#-- Specificity = ", spec) print("#-- AUC-ROC = ", auc)
#-- Accuracy = 0.9940119760479041 #-- Balanced Accuracy = 0.9940119760479043 #-- Sensitivity = 0.9955089820359282 #-- Specificity = 0.9925149700598802 #-- AUC-ROC = 0.9998890691670551
When applied to predict the activity of the training compounds, the DT classifier resulted in very high scores (>0.99) for all five performance measures considered here. However, it does not necessarily mean that the model will perform very well for the test set compounds. Let's apply the model to the test set.
In [73]:
y_true, y_pred = y_test, clf.predict(X_test) #-- Apply the model to predict the test set compounds' activity.
In [74]:
CMat = confusion_matrix( y_true, y_pred ) #-- generate confusion matrix print(CMat) # [[TN, FP], # [FN, TP]]
[[422 184] [ 32 42]]
In [75]:
acc = accuracy_score( y_true, y_pred ) sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] ) spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] ) bacc = (sens + spec) / 2 y_score = clf.predict_proba( X_test )[:, 1] auc = roc_auc_score( y_true, y_score ) print("#-- Accuracy = ", acc) print("#-- Balanced Accuracy = ", bacc) print("#-- Sensitivity = ", sens) print("#-- Specificity = ", spec) print("#-- AUC-ROC = ", auc)
#-- Accuracy = 0.6823529411764706 #-- Balanced Accuracy = 0.6319686022656319 #-- Sensitivity = 0.5675675675675675 #-- Specificity = 0.6963696369636964 #-- AUC-ROC = 0.6298724467041299
When the DT model was applied to the test set, all performance measures were much worse than those for the training set. This is a typical example of outfitting.
Model building through cross-validation
In the above section, the models were developed using the default values for many optional hyperparamters, which cannot be learned by the training algorithm. For example, when building a decision tree model, one should specify how the tree should be deep, how many compounds should be allowed in a single leaf, what is the minimum number of compounds in a single leaf, etc.
The cells below demonstrate how to perform hyperparameter optimization through 10-fold cross-validation. In this example, five values for each of three hyperparameters used in decision tree are considered (max_depth, min_samples_split, and min_samples_leaf), resulting in a total of 125 combination of the parameter values (=5 x 5 x 5). For each combination, 10 models are generated (through 10-fold cross validation) and the average performance will be tracked. The goal is to find the parameter value combination that results in the highest average performance score (e.g., 'roc_auc') from the 10-fold cross validation.
In [76]:
from sklearn.model_selection import GridSearchCV
In [77]:
scores = [ 'roc_auc', 'balanced_accuracy' ]
In [78]:
ncvs = 10 max_depth_range = np.linspace( 3, 7, num=5, dtype='int32' ) min_samples_split_range = np.linspace( 3, 7, num=5, dtype='int32' ) min_samples_leaf_range = np.linspace( 2, 6, num=5, dtype='int32' ) param_grid = dict( max_depth=max_depth_range, min_samples_split=min_samples_split_range, min_samples_leaf=min_samples_leaf_range ) clf = GridSearchCV( DecisionTreeClassifier( random_state=0 ), param_grid=param_grid, cv=ncvs, scoring=scores, refit='roc_auc', return_train_score = True, iid=False)
In [79]:
clf.fit( X_train, y_train ) print("Best parameter set", clf.best_params_)
Best parameter set {'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 3}
If necessary, it is possible to look into the performance data for each parameter value combination (stored in clf.cvresults), as shown in the following cell.
In [80]:
means_1a = clf.cv_results_['mean_train_roc_auc'] stds_1a = clf.cv_results_['std_train_roc_auc'] means_1b = clf.cv_results_['mean_test_roc_auc'] stds_1b = clf.cv_results_['std_test_roc_auc'] means_2a = clf.cv_results_['mean_train_balanced_accuracy'] stds_2a = clf.cv_results_['std_train_balanced_accuracy'] means_2b = clf.cv_results_['mean_test_balanced_accuracy'] stds_2b = clf.cv_results_['std_test_balanced_accuracy'] iterobjs = zip( means_1a, stds_1a, means_1b, stds_1b, means_2a, stds_2a, means_2b, stds_2b, clf.cv_results_['params'] ) for m1a, s1a, m1b, s1b, m2a, s2a, m2b, s2b, params in iterobjs : print( "Grid %r : %0.4f %0.04f %0.4f %0.04f %0.4f %0.04f %0.4f %0.04f" % ( params, m1a, s1a, m1b, s1b, m2a, s2a, m2b, s2b))
Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 3} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370 Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 4} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370 Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 5} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370 Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 6} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370 Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 7} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370 . . . Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 4} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413 Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 5} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413 Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 6} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413 Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 7} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413
Uncomment the following cell to look into additional performance data stored in cvresult.
In [81]:
#print(clf.cv_result_)
It is important to understand that each model built through 10-fold cross-validation during hyperparameter optimization uses only 90% of the compounds in the training set and the remaining 10% is used for testing that model. After all parameter value combinations are evaluated, the best parameter values are selected and used to rebuild a model from all compounds in the training set. GridSearchCV() takes care of this last step automatically. Therefore, there is no need to take an extra step to build a model using cls.fit() after hyperparameter optimization.
In [82]:
y_true, y_pred = y_train, clf.predict( X_train ) # Apply the model to predict the training compound's activity.
In [83]:
CMat = confusion_matrix( y_true, y_pred ) #-- generate confusion matrix print(CMat) # [[TN, FP], # [FN, TP]]
[[543 125] [179 489]]
In [84]:
acc = accuracy_score( y_true, y_pred ) sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] ) # TP / (FN + TP) spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] ) # TN / (TN + FP ) bacc = (sens + spec) / 2 y_score = clf.predict_proba( X_train )[:, 1] auc = roc_auc_score( y_true, y_score )
In [85]:
print("#-- Accuracy = ", acc) print("#-- Balanced Accuracy = ", bacc) print("#-- Sensitivity = ", sens) print("#-- Specificity = ", spec) print("#-- AUC-ROC = ", auc)
#-- Accuracy = 0.7724550898203593 #-- Balanced Accuracy = 0.7724550898203593 #-- Sensitivity = 0.7320359281437125 #-- Specificity = 0.812874251497006 #-- AUC-ROC = 0.8336452095808383
Compare these performance data with those from section 8-(2) (for the training set). When the default values were used, the DT model gave >0.99 for all performance measures, but the current models (developed using hyperparameter optimization) have much lower values, ranging from 0.73 to 0.83. Again, however, what really matters is the performance against the test set, which contains the data not used for model training.
In [86]:
y_true, y_pred = y_test, clf.predict(X_test) #-- Apply the model to predict the test set compounds' activity.
In [87]:
CMat = confusion_matrix( y_true, y_pred ) #-- generate confusion matrix print(CMat) # [[TN, FP], # [FN, TP]]
[[457 149] [ 26 48]]
In [88]:
acc = accuracy_score( y_true, y_pred ) sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] ) spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] ) bacc = (sens + spec) / 2 y_score = clf.predict_proba( X_test )[:, 1] auc = roc_auc_score( y_true, y_score ) print("#-- Accuracy = ", acc) print("#-- Balanced Accuracy = ", bacc) print("#-- Sensitivity = ", sens) print("#-- Specificity = ", spec) print("#-- AUC-ROC = ", auc)
#-- Accuracy = 0.7426470588235294 #-- Balanced Accuracy = 0.7013870305949514 #-- Sensitivity = 0.6486486486486487 #-- Specificity = 0.7541254125412541 #-- AUC-ROC = 0.7496209080367496
Now we can see that the model from hyperparameter optimization gives better performance data against the test set, compared to the model developed using the default parameter values. Importantly, the model from hyperparameter optimization shows smaller differences in performance measures between the training and test sets, indicatiing that the issue of outffiting has been alleviated substantially.
Exercises
In this assignment, we will build predictive models using the same aromatase data.
step 1 Show the following information to make sure that the activity data in the df_activity data frame is still available.
- The first five lines of df_activity
In [89]:
# Write your code in this cell.
- The counts of active/inactive compounds in df_activity
In [90]:
# Write your code in this cell.
Step 2 Show the following information to make sure the structure data is still available.
- The first five lines of df_smiles
In [91]:
# Write your code in this cell.
- the number of rows of df_smiles
In [92]:
# Write your code in this cell.
Step 3 Generate the (ECFP-equivalent) circular fingerprints from the SMILES strings.
- Use RDKit to generate 1024-bit-long circular fingerprints.
- Set the radius of the circular fingerprint to 2.
- Store the fingerprints in a data_frame called df_fps (along with the CIDs).
- Print the dimension of df_fps.
- Show the first five lines of df_fps.
In [93]:
# Write your code in this cell
Step 4 Merge the df_activity and df_fps data frames into a data frame called df_data
- Join the two data frames using the CID column as keys.
- Remove the rows that have any NULL values (i.e., compounds for which the fingerprints couldn't be generated).
- Print the dimension of df_data.
- Show the first five lines of df_data.
In [94]:
# Write your code in this cell.
Step 5 Prepare input and output data for model building
- Load the fingerprint data into 2-D array (X) and the activity data into 1-D array (y).
- Show the dimension of X and y.
In [95]:
# Write your code in this cell.
- Remove zero-variance features from X (if any).
In [96]:
# Write your code in this cell.
- Split the data set into training and test sets (90% vs 10%) (using random_state=3100).
- Print the dimension of X and y for the training and test sets.
In [97]:
# Write your code in this cell.
- Balance the training data set through downsampling.
- Show the number of inactive/active compounds in the downsampled training set.
In [98]:
# Write your code in this cell.
Step 6 Building a Random Forest model using the balanced training data set.
- First read the followng documents about random forest (https://scikit-learn.org/stable/modules/ensemble.html#forest and https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier).
- Use 10-fold cross validation to select the best value for the "n_estimators" parameter that maximizes the balanced accuracy. Test 40 values from 5 to 200 with an increment of 5 (e.g., 5, 10, 15, 20, ..., 190, 195, 200).
- For parameters 'max_depth', 'min_samples_leaf', and 'min_samples_split', use the best values found in Section 9.
- For other parameters, use the default values.
- For each parameter value, print the mean balanced accuracies (for both training and test from cross validation).
In [99]:
# Write your code in this cell.
Step 7 Apply the developed RF model to predict the activity of the training set compounds.
- Report the confusion matrix.
- Report the accuracy, balanced accurayc, sensitivity, specificity, and auc-roc.
In [100]:
# Write your code in this cell.
Step 8 Apply the developed RF model to predict the activity of the test set compounds.
- Report the accuracy, balanced accurayc, sensitivity, specificity, and auc-roc.
In [101]:
# Write your code in this cell.
Step 9 Read a recent paper published in Chem. Res. Toxicol. (https://doi.org/10.1021/acs.chemrestox.7b00037) and answer the following questions (in no more than five sentences for each question).
- What different approaches did the paper take to develop prediction models (compared to those used in this notebook)?
- How different are the models reported in the paper from those constructed in this paper (in terms of the performance measures)?
- What would you do to develop models with improved performance?
Write your answers in this cell.
In [ ]: