Processing math: 100%
Skip to main content
Library homepage
 

Text Color

Text Size

 

Margin Size

 

Font Type

Enable Dyslexic Font
Chemistry LibreTexts

8.3: Python Assignment

( \newcommand{\kernel}{\mathrm{null}\,}\)

Machine Learning Basics

Downloadable Files

lecture08_machine-learning.ipynb

  • Download the ipynb file and run your Jupyter notebook.
    • You can use the notebook you created in section 1.5 or the Jupyter hub at LibreText: https://jupyter.libretexts.org (see your instructor if you do not have access to the hub).
    • This page is an html version of the above .ipynb file.
      • If you have questions on this assignment you should use this web page and the hypothes.is annotation to post a question (or comment) to the 2019OLCCStu class group. Contact your instructor if you do not know how to access the 2019OLCCStu group within the hypothes.is system.

Required Modules

  • pandas
  • numpy
  • time
  • requests
  • io
  • rdkit
  • sklearn

Objectives

  • Build binary classification models that predict activity/inactivity of small molecules against human aromatase using supervised learning methods.
  • Evaluate the performance of the developed models using performance measures.

Import bioactivity data from PubChem

In this notebook, we will develop a prediction model for small molecule's activity against human aromatase (https://pubchem.ncbi.nlm.nih.gov/protein/EAW77416), which is encoded by the CYP19A1 gene (https://pubchem.ncbi.nlm.nih.gov/gene/1588). The model will predict the activity of a molecule based on the structure of the molecule (represented with molecular fingerprints).

For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the "Download" button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below.

In [1]:

import pandas as pd
import numpy as np

url = 'https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=743139'
df_raw = pd.read_csv(url)

In [2]:

df_raw.head(7)

Out[2]:

PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
0 RESULT_TYPE NaN NaN NaN NaN NaN NaN STRING STRING FLOAT FLOAT STRING FLOAT FLOAT STRING
1 RESULT_DESCR NaN NaN NaN NaN NaN NaN Type of compound activity based on both the ar... Type of compound activity in the aromatase ant... The concentration of sample yielding half-maxi... Percent inhibition of aromatase. Type of compound activity in the cell viabilit... The concentration of sample yielding half-maxi... Percent inhibition of cell viability. Where sample was obtained.
2 RESULT_UNIT NaN NaN NaN NaN NaN NaN NaN NaN MICROMOLAR PERCENT NaN MICROMOLAR PERCENT NaN
3 1 144203552.0 12850184.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
4 2 144203553.0 89753.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
5 3 144203554.0 9403.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
6 4 144203555.0 13218779.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI

Note: Lines 0-2 provide the descriptions for each column (data type, descriptions, units, etc). These rows need be removed.

In [3]:

df_raw = df_raw[3:]
df_raw.head(5)

Out[3]:

PUBCHEM_RESULT_TAG PUBCHEM_SID PUBCHEM_CID PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE PUBCHEM_ACTIVITY_URL PUBCHEM_ASSAYDATA_COMMENT Activity Summary Antagonist Activity Antagonist Potency (uM) Antagonist Efficacy (%) Viability Activity Viability Potency (uM) Viability Efficacy (%) Sample Source
3 1 144203552.0 12850184.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
4 2 144203553.0 89753.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
5 3 144203554.0 9403.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
6 4 144203555.0 13218779.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
7 5 144203556.0 142766.0 Inconclusive 25.0 NaN NaN inconclusive antagonist (cytotoxic) active antagonist 15.5454 -115.803 active antagonist 14.9601 -76.8218 NCI

The column names in this data frame contain white spaces and special characters. For simplicity, let's rename the columns (no spaces or special characters except for the "_" character.)

In [4]:

df_raw.columns

Out[4]:

Index(['PUBCHEM_RESULT_TAG', 'PUBCHEM_SID', 'PUBCHEM_CID',
       'PUBCHEM_ACTIVITY_OUTCOME', 'PUBCHEM_ACTIVITY_SCORE',
       'PUBCHEM_ACTIVITY_URL', 'PUBCHEM_ASSAYDATA_COMMENT', 'Activity Summary',
       'Antagonist Activity', 'Antagonist Potency (uM)',
       'Antagonist Efficacy (%)', 'Viability Activity',
       'Viability Potency (uM)', 'Viability Efficacy (%)', 'Sample Source'],
      dtype='object')

In [5]:

col_names_map = {'PUBCHEM_RESULT_TAG' : 'pc_result_tag', 
                 'PUBCHEM_SID' : 'sid', 
                 'PUBCHEM_CID' : 'cid',
                 'PUBCHEM_ACTIVITY_OUTCOME' : 'activity_outcome', 
                 'PUBCHEM_ACTIVITY_SCORE' : 'activity_score',
                 'PUBCHEM_ACTIVITY_URL' : 'activity_url', 
                 'PUBCHEM_ASSAYDATA_COMMENT' : 'assay_data_comment', 
                 'Activity Summary' : 'activity_summary',
                 'Antagonist Activity' : 'antagonist_activity', 
                 'Antagonist Potency (uM)' : 'antagonist_potency', 
                 'Antagonist Efficacy (%)' : 'antagonist_efficacy',
                 'Viability Activity' : 'viability_activity', 
                 'Viability Potency (uM)' : 'viability_potency',
                 'Viability Efficacy (%)' : 'viability_efficacy', 
                 'Sample Source' : 'sample_source' }

In [6]:

df_raw = df_raw.rename(columns = col_names_map)
df_raw.columns

Out[6]:

Index(['pc_result_tag', 'sid', 'cid', 'activity_outcome', 'activity_score',
       'activity_url', 'assay_data_comment', 'activity_summary',
       'antagonist_activity', 'antagonist_potency', 'antagonist_efficacy',
       'viability_activity', 'viability_potency', 'viability_efficacy',
       'sample_source'],
      dtype='object')

Check the number of compounds for each activity group

First, we need to understand what our data look like. Especially, we are interested in the activity class of the tested compounds because we are developing a model that classifies small molecules according to their activities against the target. This information is available in the "activity_outcome" and "activity_summary" columns.

In [7]:

df_raw.groupby(['activity_outcome']).count()

Out[7]:

pc_result_tag sid cid activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_outcome
Active 379 379 378 379 0 0 379 379 378 379 379 115 359 379
Inactive 7562 7562 7466 7562 0 0 7562 7562 0 7562 7562 324 7449 7562
Inconclusive 2545 2545 2493 2545 0 0 2545 2545 2111 2136 2545 1206 2450 2545

Based on the data in the activity_outcome column, there are 379 actives, 7562 inactives, and 2545 inconclusives.

In [8]:

df_raw.groupby(['activity_outcome','activity_summary']).count()

Out[8]:

pc_result_tag sid cid activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_outcome activity_summary
Active active antagonist 379 379 378 379 0 0 379 378 379 379 115 359 379
Inactive inactive 7562 7562 7466 7562 0 0 7562 0 7562 7562 324 7449 7562
Inconclusive active agonist 612 612 571 612 0 0 612 612 612 612 60 590 612
inconclusive 44 44 44 44 0 0 44 0 0 44 19 42 44
inconclusive agonist 414 414 409 414 0 0 414 212 223 414 12 397 414
inconclusive agonist (cytotoxic) 59 59 59 59 0 0 59 41 45 59 59 59 59
inconclusive antagonist 367 367 364 367 0 0 367 227 230 367 8 313 367
inconclusive antagonist (cytotoxic) 1049 1049 1046 1049 0 0 1049 1019 1026 1049 1048 1049 1049

Now, we can see that, in the activity_summary column, the inconclusive compounds are further classified into subclasses, which include:

  • active agonist
  • inconclusive
  • inconclusive agonist
  • inconclusive antagonist
  • inconclusive agonist (cytotoxic)
  • inconclusive antagonist (cytotoxic)

As implied in the title of this assay record (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139), this assay aims to identify aromatase inhibitors. Therefore, all active antagonists (in the activity summary column) were declared to be active compounds (in the activity outcome column).

On the other hand, the assay also identified 612 active agonists (in the activity summary column), and they are declared to be inconclusive (in the activity outcome column).

With that said, "inactive" compounds in this assay means those which are neither active agonists nor active antagonist.

It is important to understand that the criteria used for determining whether a compound is active or not in a given assay are selected by the data source who submitted that assay data to PubChem. For the purpose of this assignment (which aims to develop a binary classifier that tells if a molecule is active or inactive against the target), we should clarify what we mean by "active" and "inactive".

  • active : any compounds that can change (either increase or decrease) the activity of the target. This is equivalent to either active antagonists or active agonists in the activity summary column.
  • inactive : any compounds that do not change the activity of the target. This is equivalent to inactive compounds in the activity summary column.

Select active/inactive compounds for model building

Now we want to select only the active and inactive compounds from the data frame (that is, active agonists, active antagonists, and inactives based on the "activity summary" column).

In [9]:

df = df_raw[ (df_raw['activity_summary'] == 'active agonist' ) | 
             (df_raw['activity_summary'] == 'active antagonist' ) |
             (df_raw['activity_summary'] == 'inactive' ) ]

len(df)

Out[9]:

8553

In [10]:

print(len(df['sid'].unique()))
print(len(df['cid'].unique()))
8553
6864

Note that the number of CIDs is not the same as the number of SIDs. There are two important potential reasons for this observation.

First, not all substances (SIDs) in PubChem have associated compounds (CIDs) because some substances failed during structure standardization. [Remember that, in PubChem, substances are depositor-provided structures and compounds are unique structures extracted from substances through structure standardization.] Because our model will use structural information of molecules to predict their bioactivity, we need to remove substances without associated CIDs (i.e., no standardized structures).

Second, some compounds are associated with more than one substances. In the context of this assay, it means that a compound may be tested multiple times in different samples (which are designated as different substances). It is not uncommon that different samples of the same chemical may result in conflicting activities (e.g., active agonist in one sample but inactive in another sample). In this practice, we remove such compounds with conflicting activities.

Drop substances without associated CIDs.

First, check if there are subtances without associated CIDs.

In [11]:

df.isna().sum()

Out[11]:

pc_result_tag             0
sid                       0
cid                     138
activity_outcome          0
activity_score            0
activity_url           8553
assay_data_comment     8553
activity_summary          0
antagonist_activity       0
antagonist_potency     7563
antagonist_efficacy       0
viability_activity        0
viability_potency      8054
viability_efficacy      155
sample_source             0
dtype: int64

There are 138 records whose "cid" column is NULL, and we want to remove those records.

In [12]:

df = df.dropna( subset=['cid'] )
len(df)

Out[12]:

8415

In [13]:

print(len(df['sid'].unique()))
print(len(df['cid'].unique()))
8415
6863

In [14]:

df.isna().sum()   # Check if the NULL values disappeared in the "cid" column

Out[14]:

pc_result_tag             0
sid                       0
cid                       0
activity_outcome          0
activity_score            0
activity_url           8415
assay_data_comment     8415
activity_summary          0
antagonist_activity       0
antagonist_potency     7467
antagonist_efficacy       0
viability_activity        0
viability_potency      7919
viability_efficacy      154
sample_source             0
dtype: int64

Remove CIDs with conflicting activities

Now identify compounds with conflicting activities and remove them.

In [15]:

cid_conflict = []
idx_conflict = []

for mycid in df['cid'].unique() :
    
    outcomes = df[ df.cid == mycid ].activity_summary.unique()
    
    if len(outcomes) > 1 :
        
        idx_tmp = df.index[ df.cid == mycid ].tolist()
        idx_conflict.extend(idx_tmp)
        cid_conflict.append(mycid)

print("#", len(cid_conflict), "CIDs with conflicting activities [associated with", len(idx_conflict), "rows (SIDs).]")
# 65 CIDs with conflicting activities [associated with 146 rows (SIDs).]

In [16]:

df.loc[idx_conflict,:].head(10)

Out[16]:

pc_result_tag sid cid activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
8 6 144203557.0 16043.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
5956 5954 144209507.0 16043.0 Active 43.0 NaN NaN active antagonist active antagonist 54.4827 -73.4024 inconclusive antagonist NaN NaN SigmaAldrich
6850 6848 144210401.0 16043.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 SIGMA
52 50 144203601.0 443939.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI
6130 6128 144209681.0 443939.0 Active 61.0 NaN NaN active antagonist active antagonist 1.65519 -115.932 active antagonist 12.1763 -120.598 Toronto Research
66 64 144203615.0 2170.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 BIOMOL
9118 9116 144212669.0 2170.0 Active 50.0 NaN NaN active antagonist active antagonist 16.5803 -115.202 inconclusive antagonist 61.1306 -80.7706 SIGMA
106 104 144203655.0 2554.0 Inconclusive 20.0 NaN NaN active agonist active agonist 2.87255 73.7025 inactive NaN 0 SigmaAldrich
5920 5918 144209471.0 2554.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 SIGMA
6964 6962 144210515.0 2554.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 SIGMA

In [17]:

df = df.drop(idx_conflict)

In [18]:

df.groupby('activity_summary').count()

Out[18]:

pc_result_tag sid cid activity_outcome activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity_summary
active agonist 537 537 537 537 537 0 0 537 537 537 537 58 517 537
active antagonist 343 343 343 343 343 0 0 343 342 343 343 108 326 343
inactive 7389 7389 7389 7389 7389 0 0 7389 0 7389 7389 318 7278 7389

In [19]:

print(len(df['sid'].unique()))
print(len(df['cid'].unique()))
8269
6798

Remove redundant data

The above code cells [in 3-(2)] do not remove compounds tested multiple times if the testing results are consistent [e.g., active agonist in all samples (substances)]. The rows corresponding to these compounds are redundant, so we want remove them except for only one row for each compound.

In [20]:

df = df.drop_duplicates(subset='cid')  # remove duplicate rows except for the first occurring row.
print(len(df['sid'].unique()))
print(len(df['cid'].unique()))
6798
6798

Adding "numeric" activity classes

In general, the inputs and outputs to machine learning algorithms need to have numerical forms.
In this practice, the input (molecular structure) will be represented with binary fingerprints, which already have numerical forms (0 or 1). However, the output (activity) is currently in a string format (e.g., 'active agonist', 'active antagonist'). Therefore, we want to add an additional, 'activity' column, which contains numeric codes representing the active and inactive compounds:

  • 1 for actives (either active agonists or active antagonists)
  • 0 for inactives

Note that we are merging the two classes "active agonist" and "active antagonist", because we are going to build a binary classifer that distinguish actives from inactives.

In [21]:

df['activity'] = [ 0 if x == 'inactive' else 1 for x in df['activity_summary'] ]

Check if the new column 'activity' is added to (the end of) the data frame.

In [22]:

df.head(3)

Out[22]:

pc_result_tag sid cid activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source activity
3 1 144203552.0 12850184.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI 0
4 2 144203553.0 89753.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI 0
5 3 144203554.0 9403.0 Inactive 0.0 NaN NaN inactive inactive NaN 0 inactive NaN 0 NCI 0

Double-check the count of active/inactive compounds.

In [23]:

df.groupby('activity_summary').count()

Out[23]:

pc_result_tag sid cid activity_outcome activity_score activity_url assay_data_comment antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source activity
activity_summary
active agonist 451 451 451 451 451 0 0 451 451 451 451 44 432 451 451
active antagonist 291 291 291 291 291 0 0 291 290 291 291 88 275 291 291
inactive 6056 6056 6056 6056 6056 0 0 6056 0 6056 6056 269 5970 6056 6056

In [24]:

df.groupby('activity').count()

Out[24]:

pc_result_tag sid cid activity_outcome activity_score activity_url assay_data_comment activity_summary antagonist_activity antagonist_potency antagonist_efficacy viability_activity viability_potency viability_efficacy sample_source
activity
0 6056 6056 6056 6056 6056 0 0 6056 6056 0 6056 6056 269 5970 6056
1 742 742 742 742 742 0 0 742 742 741 742 742 132 707 742

Create a smaller data frame that only contains CIDs and activities.

Let's create a smaller data frame that only contains CIDs and activities. This data frame will be merged with a data frame containing molecular fingerprint information.

In [25]:

df_activity = df[['cid','activity']]

In [26]:

df_activity.head(5)

Out[26]:

cid activity
3 12850184.0 0
4 89753.0 0
5 9403.0 0
6 13218779.0 0
12 637566.0 0

Download structure information for each compound from PubChem

Now we want to get structure information of the compounds from PubChem (in isomeric SMILES).

In [27]:

cids = df.cid.astype(int).tolist()

In [28]:

chunk_size = 200
num_cids = len(cids)

if num_cids % chunk_size == 0 :
    num_chunks = int( num_cids / chunk_size )
else :
    num_chunks = int( num_cids / chunk_size ) + 1

print("# CIDs = ", num_cids)
print("# CID Chunks = ", num_chunks, "(chunked by ", chunk_size, ")")
# CIDs =  6798
# CID Chunks =  34 (chunked by  200 )

In [29]:

import time
import requests
from io import StringIO

df_smiles = pd.DataFrame()
list_dfs = []  # temporary list of data frames

for i in range(0, num_chunks) :
    
    idx1 = chunk_size * i
    idx2 = chunk_size * (i + 1)
    cidstr = ",".join( str(x) for x in cids[idx1:idx2] )

    url = ('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/' + cidstr + '/property/IsomericSMILES/TXT')
    res = requests.get(url)
    data = pd.read_csv( StringIO(res.text), header=None, names=['smiles'] )
    list_dfs.append(data)
    
    time.sleep(0.2)
    
    if ( i % 5 == 0 ) :
        print("Processing Chunk ", i)   

#    if ( i == 2 ) : break  #- for debugging

df_smiles = pd.concat(list_dfs,ignore_index=True)
df_smiles[ 'cid' ] = cids
df_smiles.head(5)
Processing Chunk  0
Processing Chunk  5
Processing Chunk  10
Processing Chunk  15
Processing Chunk  20
Processing Chunk  25
Processing Chunk  30

Out[29]:

smiles cid
0 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(... 12850184
1 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O... 89753
2 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC... 9403
3 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N 13218779
4 CC(=CCC/C(=C/CO)/C)C 637566

In [30]:

len(df_smiles)

Out[30]:

6798

In [31]:

df_smiles = df_smiles[['cid','smiles']]
df_smiles.head(5)

Out[31]:

cid smiles
0 12850184 C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(...
1 89753 C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O...
2 9403 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC...
3 13218779 C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N
4 637566 CC(=CCC/C(=C/CO)/C)C

Generate MACCS keys from SMILES.

In [32]:

from rdkit import Chem
from rdkit.Chem import MACCSkeys

In [33]:

fps=dict()

for idx, row in df_smiles.iterrows() :
    
    mol = Chem.MolFromSmiles(row.smiles)
    
    if mol == None :
        print("Can't generate MOL object:", "CID", row.cid, row.smiles)
    else:
        fps[row.cid] = [row.cid] + list(MACCSkeys.GenMACCSKeys(mol).ToBitString())
Can't generate MOL object: CID 28145 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F
Can't generate MOL object: CID 28127 F[Si-2](F)(F)(F)(F)F.[Na+].[Na+]

In [34]:

# Generate column names
fpbitnames = []

fpbitnames.append('cid')

for i in range(0,167):   # from MACCS000 to MACCS166
    fpbitnames.append( "maccs" + str(i).zfill(3) )

df_fps = pd.DataFrame.from_dict(fps, orient='index', columns=fpbitnames)

In [35]:

df_fps.head(5)

Out[35]:

cid maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 maccs008 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
12850184 12850184 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
89753 89753 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
9403 9403 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0
13218779 13218779 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 1 0 1 1 1 0
637566 637566 0 0 0 0 0 0 0 0 0 ... 1 0 0 1 0 0 0 1 0 0

5 rows × 168 columns

Merge activity data and fingerprint information

In [36]:

df_activity.head(3)

Out[36]:

cid activity
3 12850184.0 0
4 89753.0 0
5 9403.0 0

In [37]:

df_fps.head(3)

Out[37]:

cid maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 maccs008 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
12850184 12850184 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
89753 89753 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
9403 9403 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 168 columns

In [38]:

df_data = df_activity.join(df_fps.set_index('cid'), on='cid')

In Section 5, there were two CIDs for which the MACCS keys could not be generated. They need to be removed from df_data.

In [39]:

df_data[df_data.isna().any(axis=1)]

Out[39]:

cid activity maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
2293 28145.0 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9077 28127.0 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 169 columns

In [40]:

df_data = df_data.dropna()
len(df_data)

Out[40]:

6796

Save df_data in CSV for future use.

In [41]:

df_data.to_csv('df_data.csv')

Preparation for model building

Loading the data into X and y.

In [42]:

df_data.head(3)

Out[42]:

cid activity maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
3 12850184.0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
4 89753.0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
5 9403.0 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 169 columns

In [43]:

X = df_data.iloc[:,2:]
y = df_data['activity'].values

In [44]:

X.head(3)

Out[44]:

maccs000 maccs001 maccs002 maccs003 maccs004 maccs005 maccs006 maccs007 maccs008 maccs009 ... maccs157 maccs158 maccs159 maccs160 maccs161 maccs162 maccs163 maccs164 maccs165 maccs166
3 0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 1 0 1
5 0 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 1 1 0

3 rows × 167 columns

In [45]:

print(len(y))    # Number of all compounds
y.sum()          # Number of actives
6796

Out[45]:

742

Remove zero-variance features

Some features in X are not helpful in distinguishing actives from inactives, because they are set ON for all compounds or OFF for all compounds. Such features need to be removed because they would consume more computational resources without improving the model.

In [46]:

from sklearn.feature_selection import VarianceThreshold

In [47]:

X.shape  #- Before removal

Out[47]:

(6796, 167)

In [48]:

sel = VarianceThreshold()
X=sel.fit_transform(X)
X.shape  #- After removal

Out[48]:

(6796, 163)

In this case, four features had zero variances. Note that one of them is the first bit (maccs000) of the MACCS keys, which is added as a "dummy" to name each of bits 1~166 as maccs001, maccs002, ... maccs166.

Train-Test-Split (a 9:1 ratio)

Now split the data set into a training set (90%) and test set (10%). The training set will be used to train the model. The developed model will be tested against the test set.

In [49]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, shuffle=True, random_state=3100, stratify=y, test_size=0.1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
print(y_train.sum(), y_test.sum())
(6116, 163) (680, 163) (6116,) (680,)
668 74

Balance the training set through downsampling

Check the dimension of the training data set.

In [50]:

print(len(y_train))
print(len(X_train))
print(len(X_train[0]))
6116
6116
163

Check the number of actives and inactives compound.

In [51]:

print("# inactives : ", len(y_train) - y_train.sum())
print("# actives   : ", y_train.sum())
# inactives :  5448
# actives   :  668

The data set is highly imbalanced [the inactive to active ratio is 8.16 (=5448 / 668)]. To address this issue, let's downsample the majority class (inactive compounds) to balance the data set.

In [52]:

# Indicies of each class' observations
idx_inactives = np.where( y_train == 0 )[0]
idx_actives   = np.where( y_train == 1 )[0]

# Number of observations in each class
num_inactives = len(idx_inactives)
num_actives   = len(idx_actives)

# Randomly sample from inactives without replacement
np.random.seed(0)
idx_inactives_downsampled = np.random.choice(idx_inactives, size=num_actives, replace=False)

# Join together downsampled inactives with actives
X_train = np.vstack((X_train[idx_inactives_downsampled], X_train[idx_actives]))
y_train = np.hstack((y_train[idx_inactives_downsampled], y_train[idx_actives]))

It is noteworthy that np.vstack is used for X_train and np.hstack is used for Y_train. The direction of stacking is different because X_train is a 2-D array and y_train is a 1-D array.

Confirm that the downsampled data set has the correct dimension and active/inactive counts.

In [53]:

print("# inactives : ", len(y_train) - y_train.sum())
print("# actives   : ", y_train.sum())
# inactives :  668
# actives   :  668

In [54]:

print(len(y_train))
print(len(X_train))
print(len(X_train[0]))
1336
1336
163

Build a model using the training set.

Now we are ready to build predictive models using machine learning algorithms available in the scikit-learn library (https://scikit-learn.org/). This notebook will use Naive Bayes and decisiont tree, because they are relatively fast and simple.

In [55]:

from sklearn.naive_bayes import BernoulliNB        #-- Naive Bayes
from sklearn.tree import DecisionTreeClassifier    #-- Decision Tree

In [56]:

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

Naive Bayes

In [57]:

clf = BernoulliNB()            # set up the NB classification model

In [58]:

clf.fit( X_train ,y_train )    # Train the model by fitting it to the data.

Out[58]:

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [59]:

y_true, y_pred = y_train, clf.predict( X_train )    # Apply the model to predict the training compound's activity.

In [60]:

CMat = confusion_matrix( y_true, y_pred )    #-- generate confusion matrix
print(CMat)    # [[TN, FP], 
               #  [FN, TP]]
[[462 206]
 [199 469]]

In [61]:

acc  = accuracy_score( y_true, y_pred )

sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] )    # TP / (FN + TP)
spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] )    # TN / (TN + FP )
bacc = (sens + spec) / 2

y_score = clf.predict_proba( X_train )[:, 1]
auc = roc_auc_score( y_true, y_score )

In [62]:

print("#-- Accuracy          = ", acc)
print("#-- Balanced Accuracy = ", bacc)
print("#-- Sensitivity       = ", sens)
print("#-- Specificity       = ", spec)
print("#-- AUC-ROC           = ", auc)
#-- Accuracy          =  0.6968562874251497
#-- Balanced Accuracy =  0.6968562874251496
#-- Sensitivity       =  0.7020958083832335
#-- Specificity       =  0.6916167664670658
#-- AUC-ROC           =  0.7496985818781599

When applied to predict the activity of the training compounds, the NB classifier resulted in the accuracy of 0.70 and AUC-ROC of 0.75. However, the real performance of the model should be evaluated with the test set data, which are not used for model training.

In [63]:

y_true, y_pred = y_test, clf.predict(X_test)    #-- Apply the model to predict the test set compounds' activity.

In [64]:

CMat = confusion_matrix( y_true, y_pred )    #-- generate confusion matrix
print(CMat)    # [[TN, FP], 
               #  [FN, TP]]
[[412 194]
 [ 28  46]]

In [65]:

acc  = accuracy_score( y_true, y_pred )

sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] )
spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] )
bacc = (sens + spec) / 2

y_score = clf.predict_proba( X_test )[:, 1]
auc = roc_auc_score( y_true, y_score )

print("#-- Accuracy          = ", acc)
print("#-- Balanced Accuracy = ", bacc)
print("#-- Sensitivity       = ", sens)
print("#-- Specificity       = ", spec)
print("#-- AUC-ROC           = ", auc)
#-- Accuracy          =  0.6735294117647059
#-- Balanced Accuracy =  0.6507448042101507
#-- Sensitivity       =  0.6216216216216216
#-- Specificity       =  0.6798679867986799
#-- AUC-ROC           =  0.724099099099099

For the test set, the accuracy is 0.67 and the AUC-ROC is 0.72. These values are somewhat smaller (by 0.03) than those for the training set. Also note that the accuracy is no longer the same as the balanced accruacy (which is the average of the sensitivity and specificity).

Some additional performance information may be obtained using classification_report().

In [66]:

print( classification_report(y_true, y_pred))
              precision    recall  f1-score   support

           0       0.94      0.68      0.79       606
           1       0.19      0.62      0.29        74

    accuracy                           0.67       680
   macro avg       0.56      0.65      0.54       680
weighted avg       0.86      0.67      0.73       680

Decision Tree

In [67]:

clf = DecisionTreeClassifier( random_state=0 )    # set up the DT classification model

In [68]:

clf.fit( X_train ,y_train )    # Train the model by fitting it to the data (using the default values for all parameters)

Out[68]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')

In [69]:

y_true, y_pred = y_train, clf.predict( X_train )    # Apply the model to predict the training compound's activity.

In [70]:

CMat = confusion_matrix( y_true, y_pred )    #-- generate confusion matrix
print(CMat)    # [[TN, FP], 
               #  [FN, TP]]
[[663   5]
 [  3 665]]

In [71]:

acc  = accuracy_score( y_true, y_pred )

sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] )    # TP / (FN + TP)
spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] )    # TN / (TN + FP )
bacc = (sens + spec) / 2

y_score = clf.predict_proba( X_train )[:, 1]
auc = roc_auc_score( y_true, y_score )

In [72]:

print("#-- Accuracy          = ", acc)
print("#-- Balanced Accuracy = ", bacc)
print("#-- Sensitivity       = ", sens)
print("#-- Specificity       = ", spec)
print("#-- AUC-ROC           = ", auc)
#-- Accuracy          =  0.9940119760479041
#-- Balanced Accuracy =  0.9940119760479043
#-- Sensitivity       =  0.9955089820359282
#-- Specificity       =  0.9925149700598802
#-- AUC-ROC           =  0.9998890691670551

When applied to predict the activity of the training compounds, the DT classifier resulted in very high scores (>0.99) for all five performance measures considered here. However, it does not necessarily mean that the model will perform very well for the test set compounds. Let's apply the model to the test set.

In [73]:

y_true, y_pred = y_test, clf.predict(X_test)    #-- Apply the model to predict the test set compounds' activity.

In [74]:

CMat = confusion_matrix( y_true, y_pred )    #-- generate confusion matrix
print(CMat)    # [[TN, FP], 
               #  [FN, TP]]
[[422 184]
 [ 32  42]]

In [75]:

acc  = accuracy_score( y_true, y_pred )

sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] )
spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] )
bacc = (sens + spec) / 2

y_score = clf.predict_proba( X_test )[:, 1]
auc = roc_auc_score( y_true, y_score )

print("#-- Accuracy          = ", acc)
print("#-- Balanced Accuracy = ", bacc)
print("#-- Sensitivity       = ", sens)
print("#-- Specificity       = ", spec)
print("#-- AUC-ROC           = ", auc)
#-- Accuracy          =  0.6823529411764706
#-- Balanced Accuracy =  0.6319686022656319
#-- Sensitivity       =  0.5675675675675675
#-- Specificity       =  0.6963696369636964
#-- AUC-ROC           =  0.6298724467041299

When the DT model was applied to the test set, all performance measures were much worse than those for the training set. This is a typical example of outfitting.

Model building through cross-validation

In the above section, the models were developed using the default values for many optional hyperparamters, which cannot be learned by the training algorithm. For example, when building a decision tree model, one should specify how the tree should be deep, how many compounds should be allowed in a single leaf, what is the minimum number of compounds in a single leaf, etc.

The cells below demonstrate how to perform hyperparameter optimization through 10-fold cross-validation. In this example, five values for each of three hyperparameters used in decision tree are considered (max_depth, min_samples_split, and min_samples_leaf), resulting in a total of 125 combination of the parameter values (=5 x 5 x 5). For each combination, 10 models are generated (through 10-fold cross validation) and the average performance will be tracked. The goal is to find the parameter value combination that results in the highest average performance score (e.g., 'roc_auc') from the 10-fold cross validation.

In [76]:

from sklearn.model_selection import GridSearchCV

In [77]:

scores = [ 'roc_auc', 'balanced_accuracy' ]

In [78]:

ncvs = 10

max_depth_range         = np.linspace( 3, 7, num=5, dtype='int32' )
min_samples_split_range = np.linspace( 3, 7, num=5, dtype='int32' )
min_samples_leaf_range  = np.linspace( 2, 6, num=5, dtype='int32' )

param_grid = dict( max_depth=max_depth_range,
                   min_samples_split=min_samples_split_range,
                   min_samples_leaf=min_samples_leaf_range )

clf = GridSearchCV( DecisionTreeClassifier( random_state=0 ),
                    param_grid=param_grid, cv=ncvs, scoring=scores, refit='roc_auc',
                    return_train_score = True, iid=False)

In [79]:

clf.fit( X_train, y_train )
print("Best parameter set", clf.best_params_)
Best parameter set {'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 3}

If necessary, it is possible to look into the performance data for each parameter value combination (stored in clf.cvresults), as shown in the following cell.

In [80]:

means_1a = clf.cv_results_['mean_train_roc_auc']
stds_1a  = clf.cv_results_['std_train_roc_auc']

means_1b = clf.cv_results_['mean_test_roc_auc']
stds_1b  = clf.cv_results_['std_test_roc_auc']

means_2a = clf.cv_results_['mean_train_balanced_accuracy']
stds_2a  = clf.cv_results_['std_train_balanced_accuracy']

means_2b = clf.cv_results_['mean_test_balanced_accuracy']
stds_2b  = clf.cv_results_['std_test_balanced_accuracy']

iterobjs = zip( means_1a, stds_1a, means_1b, stds_1b,
                means_2a, stds_2a, means_2b, stds_2b, clf.cv_results_['params'] )

for m1a, s1a, m1b, s1b, m2a, s2a, m2b, s2b, params in iterobjs :

    print( "Grid %r : %0.4f %0.04f %0.4f %0.04f %0.4f %0.04f %0.4f %0.04f"
           % ( params, m1a, s1a, m1b, s1b, m2a, s2a, m2b, s2b))
Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 3} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370
Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 4} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370
Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 5} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370
Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 6} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370
Grid {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 7} : 0.7714 0.0042 0.7542 0.0420 0.7264 0.0052 0.7019 0.0370
.
.
.

Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 4} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413
Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 5} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413
Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 6} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413
Grid {'max_depth': 7, 'min_samples_leaf': 6, 'min_samples_split': 7} : 0.8873 0.0072 0.7375 0.0524 0.8068 0.0064 0.6855 0.0413

Uncomment the following cell to look into additional performance data stored in cvresult.

In [81]:

#print(clf.cv_result_)

It is important to understand that each model built through 10-fold cross-validation during hyperparameter optimization uses only 90% of the compounds in the training set and the remaining 10% is used for testing that model. After all parameter value combinations are evaluated, the best parameter values are selected and used to rebuild a model from all compounds in the training set. GridSearchCV() takes care of this last step automatically. Therefore, there is no need to take an extra step to build a model using cls.fit() after hyperparameter optimization.

In [82]:

y_true, y_pred = y_train, clf.predict( X_train )    # Apply the model to predict the training compound's activity.

In [83]:

CMat = confusion_matrix( y_true, y_pred )    #-- generate confusion matrix
print(CMat)    # [[TN, FP], 
               #  [FN, TP]]
[[543 125]
 [179 489]]

In [84]:

acc  = accuracy_score( y_true, y_pred )

sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] )    # TP / (FN + TP)
spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] )    # TN / (TN + FP )
bacc = (sens + spec) / 2

y_score = clf.predict_proba( X_train )[:, 1]
auc = roc_auc_score( y_true, y_score )

In [85]:

print("#-- Accuracy          = ", acc)
print("#-- Balanced Accuracy = ", bacc)
print("#-- Sensitivity       = ", sens)
print("#-- Specificity       = ", spec)
print("#-- AUC-ROC           = ", auc)
#-- Accuracy          =  0.7724550898203593
#-- Balanced Accuracy =  0.7724550898203593
#-- Sensitivity       =  0.7320359281437125
#-- Specificity       =  0.812874251497006
#-- AUC-ROC           =  0.8336452095808383

Compare these performance data with those from section 8-(2) (for the training set). When the default values were used, the DT model gave >0.99 for all performance measures, but the current models (developed using hyperparameter optimization) have much lower values, ranging from 0.73 to 0.83. Again, however, what really matters is the performance against the test set, which contains the data not used for model training.

In [86]:

y_true, y_pred = y_test, clf.predict(X_test)    #-- Apply the model to predict the test set compounds' activity.

In [87]:

CMat = confusion_matrix( y_true, y_pred )    #-- generate confusion matrix
print(CMat)    # [[TN, FP], 
               #  [FN, TP]]
[[457 149]
 [ 26  48]]

In [88]:

acc  = accuracy_score( y_true, y_pred )

sens = CMat[ 1 ][ 1 ] / ( CMat[ 1 ][ 0 ] + CMat[ 1 ][ 1 ] )
spec = CMat[ 0 ][ 0 ] / ( CMat[ 0 ][ 0 ] + CMat[ 0 ][ 1 ] )
bacc = (sens + spec) / 2

y_score = clf.predict_proba( X_test )[:, 1]
auc = roc_auc_score( y_true, y_score )

print("#-- Accuracy          = ", acc)
print("#-- Balanced Accuracy = ", bacc)
print("#-- Sensitivity       = ", sens)
print("#-- Specificity       = ", spec)
print("#-- AUC-ROC           = ", auc)
#-- Accuracy          =  0.7426470588235294
#-- Balanced Accuracy =  0.7013870305949514
#-- Sensitivity       =  0.6486486486486487
#-- Specificity       =  0.7541254125412541
#-- AUC-ROC           =  0.7496209080367496

Now we can see that the model from hyperparameter optimization gives better performance data against the test set, compared to the model developed using the default parameter values. Importantly, the model from hyperparameter optimization shows smaller differences in performance measures between the training and test sets, indicatiing that the issue of outffiting has been alleviated substantially.

Exercises

In this assignment, we will build predictive models using the same aromatase data.

step 1 Show the following information to make sure that the activity data in the df_activity data frame is still available.

  • The first five lines of df_activity

In [89]:

# Write your code in this cell.
  • The counts of active/inactive compounds in df_activity

In [90]:

# Write your code in this cell.

Step 2 Show the following information to make sure the structure data is still available.

  • The first five lines of df_smiles

In [91]:

# Write your code in this cell.
  • the number of rows of df_smiles

In [92]:

# Write your code in this cell.

Step 3 Generate the (ECFP-equivalent) circular fingerprints from the SMILES strings.

  • Use RDKit to generate 1024-bit-long circular fingerprints.
  • Set the radius of the circular fingerprint to 2.
  • Store the fingerprints in a data_frame called df_fps (along with the CIDs).
  • Print the dimension of df_fps.
  • Show the first five lines of df_fps.

In [93]:

# Write your code in this cell

Step 4 Merge the df_activity and df_fps data frames into a data frame called df_data

  • Join the two data frames using the CID column as keys.
  • Remove the rows that have any NULL values (i.e., compounds for which the fingerprints couldn't be generated).
  • Print the dimension of df_data.
  • Show the first five lines of df_data.

In [94]:

# Write your code in this cell.

Step 5 Prepare input and output data for model building

  • Load the fingerprint data into 2-D array (X) and the activity data into 1-D array (y).
  • Show the dimension of X and y.

In [95]:

# Write your code in this cell.
  • Remove zero-variance features from X (if any).

In [96]:

# Write your code in this cell.
  • Split the data set into training and test sets (90% vs 10%) (using random_state=3100).
  • Print the dimension of X and y for the training and test sets.

In [97]:

# Write your code in this cell.
  • Balance the training data set through downsampling.
  • Show the number of inactive/active compounds in the downsampled training set.

In [98]:

# Write your code in this cell.

Step 6 Building a Random Forest model using the balanced training data set.

In [99]:

# Write your code in this cell.

Step 7 Apply the developed RF model to predict the activity of the training set compounds.

  • Report the confusion matrix.
  • Report the accuracy, balanced accurayc, sensitivity, specificity, and auc-roc.

In [100]:

# Write your code in this cell.

Step 8 Apply the developed RF model to predict the activity of the test set compounds.

  • Report the accuracy, balanced accurayc, sensitivity, specificity, and auc-roc.

In [101]:

# Write your code in this cell.

Step 9 Read a recent paper published in Chem. Res. Toxicol. (https://doi.org/10.1021/acs.chemrestox.7b00037) and answer the following questions (in no more than five sentences for each question).

  • What different approaches did the paper take to develop prediction models (compared to those used in this notebook)?
  • How different are the models reported in the paper from those constructed in this paper (in terms of the performance measures)?
  • What would you do to develop models with improved performance?

Write your answers in this cell.

In [ ]:



8.3: Python Assignment is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

Support Center

How can we help?