5.5: Python Assignment
- Page ID
- 192002
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Quantitative Structure-Property Relationships
Excel_multiple_linear_regression_assignment.docx
- Download the ipynb file and run your Jupyter notebook.
- You can use the notebook you created in section 1.5 or the Jupyter hub at LibreText: https://jupyter.libretexts.org (see your instructor if you do not have access to the hub).
- This page is an html version of the above .ipynb file.
- If you have questions on this assignment you should use this web page and the hypothes.is annotation to post a question (or comment) to the 2019OLCCStu class group. Contact your instructor if you do not know how to access the 2019OLCCStu group within the hypothes.is system.
- You will need to have the BP.csv and 102BP.csv files in the same directory as the QSPR.ipynb Jupyter notebook file.
- Your instructor may have you use Excel and multiple linear regression assignment in addition to this notebook.
Required Modules
- RDKit
- mordred (in your conda environment type- conda install -c mordred-descriptor mordred )
- pandas
- statsmodels (in your conda environment type- conda install statsmodels )
Objectives
- Use RDKit to calculate molecular descriptors
- Use molecular descriptors to create a predictive model for boiling points of alkanes.
- Use statsmodels to visualize data
Quantitative Structure-Property Relationships
Quantitative Structure-Property Relationships (QSPR) and Quantitative Structure-Activity Relationships (QSAR) use statistical models to relate a set of predictor values to a response variable. Molecules are described using a set of descriptors, and then mathematical relationships can be developed to explain observed properties. In QSPR and QSAR, physico-chemical properties of theoretical descriptors of chemicals are used to predict either a physical property or a biological outcome.
Molecular Descriptors
A molecular descriptor is “final result of a logical and mathematical procedure, which transforms chemical information encoded within a symbolic repre-sentation of a molecule into a useful number or the result of some standardized experiment” (Todeschini, R.; Consonni, V. Molecular descriptors for chemoinformatics 2009 Wiley‑VCH, Weinheim). You are already familiar with descriptors such as molecular weight or number of heavy atoms and we have queried PubChem for data such as XLogP. We’ll examine just a few simple descriptors, but thousands have been developed for applications in QSPR.
Using rdkit and mordred to calculate descriptors
Clearly we have been using algorithms for calculating these indices. This is time consuming for an individual, but programs can be used to complete this much easier. We will use the rdkit and mordred python libraries to help us out.
In [1]:
from rdkit import Chem # imports the Chem module from rdkit from mordred import Calculator, descriptors # imports mordred descriptor library calc = Calculator(descriptors, ignore_3D=True) # sets up a function reading descriptors len(calc.descriptors) # tells us how many different types of descriptors are available in the library
Out[1]:
1613
Wiener Index
We already calculated the Wiener index for n-pentane and 2-methylpentane. Now let’s have mordred do it for us.
In [2]:
from mordred import WienerIndex pentane = Chem.MolFromSmiles('CCCCC') # Use rdkit to create a mol file from the smiles string for n-pentane methyl_pentane = Chem.MolFromSmiles('CCCC(C)C') # and for 2-methylpentane wiener_index = WienerIndex.WienerIndex() # create descriptor instance for Wiener index result1 = wiener_index(pentane) # calculate wiener index for n-pentane result2 = wiener_index(methyl_pentane) # and for 2-methylpentane print("The Wiener index for n-pentane is: ", result1) # display result print("The Wiener index for 2-methylpentane is: ", result2)
The Wiener index for n-pentane is: 20 The Wiener index for 2-methylpentane is: 32
Zagreb Indices
And we can do the same for the different Zagreb indices for n-pentane and 2-methylpentane.
In [3]:
from mordred import ZagrebIndex zagreb_index1 = ZagrebIndex.ZagrebIndex(version = 1) # create descriptor instance for Zagreb index 1 zagreb_index2 = ZagrebIndex.ZagrebIndex(version = 2) # create descriptor instance for Zagreb index 1 result_Z1 = zagreb_index1(pentane) # calculate Z1 descriptor value for n-pentane result_Z2 = zagreb_index2(pentane) # calculate Z2 descriptor value for n-pentane print("The Zagreb index 1 for n-pentane is:", result_Z1) print("The Zagreb index 2 for n-pentane is:", result_Z2) result_Z1 = zagreb_index1(methyl_pentane) # and for 2-methylpentane as well result_Z2 = zagreb_index2(methyl_pentane) print("The Zagreb index 1 for 2-methylpentane is:", result_Z1) print("The Zagreb index 2 for 2-methylpentane is:", result_Z2)
The Zagreb index 1 for n-pentane is: 14.0 The Zagreb index 2 for n-pentane is: 12.0 The Zagreb index 1 for 2-methylpentane is: 20.0 The Zagreb index 2 for 2-methylpentane is: 18.0
As you can see from the code above, each index will have different code that needs to be followed for programming. Each descriptor and the resulting code syntax can be found here http://mordred-descriptor.github.io/documentation/master/api/modules.html
Looping through a list of molecules
Now that we have an understanding on how rdkit and mordred work to get our descriptors, let’s simplify the code using a looping structure:
In [4]:
smiles = ["CCC", "CCCC", "CCCCC", "CCCC(C)C","CC(C)C(C)C"] #store smiles strings in a list for smile in smiles: mol = Chem.MolFromSmiles(smile) # convert smiles string to mol file result_Z1 = zagreb_index1(mol) # calculate Z1 descriptor value result_Z2 = zagreb_index2(mol) # calculate Z2 descriptor value print("The Zagreb index 1 for", smile, "is:", result_Z1) print("The Zagreb index 2 for", smile, "is:", result_Z2) print()
The Zagreb index 1 for CCC is: 6.0 The Zagreb index 2 for CCC is: 4.0 The Zagreb index 1 for CCCC is: 10.0 The Zagreb index 2 for CCCC is: 8.0 The Zagreb index 1 for CCCCC is: 14.0 The Zagreb index 2 for CCCCC is: 12.0 The Zagreb index 1 for CCCC(C)C is: 20.0 The Zagreb index 2 for CCCC(C)C is: 18.0 The Zagreb index 1 for CC(C)C(C)C is: 22.0 The Zagreb index 2 for CC(C)C(C)C is: 21.0
Using descriptors to predict molecular properties
For this exercise we will take a series of alkanes and create an equation that will allow us to predict boiling points. We will start with a 30 molecule alkane training set. We will obtain various descriptors and see how they can predict the physical property boiling point.
For this exercise we will be using the pandas (Python Data Analysis) library to help us read, write and manage data. We will also use matplotlib to generate graphs.
Boiling Point data
Let’s start by reading and graphing a set of boiling point data. First we read our csv file into a pandas “dataframe”. Notice that we can generate a nicely formatted table from our dataframe by just entering the name of the dataframe on the last line.
In [5]:
import pandas as pd # import the Python Data Analysis Library with the shortened name pd df = pd.read_csv("BP.csv") # read in the file into a pandas dataframe df # print the dataframe
Out[5]:
compound | name | BP_C | BP_K | SMILES | MW | |
---|---|---|---|---|---|---|
0 | 1 | Methane | -162.2 | 110.95 | C | 16.043 |
1 | 2 | Ethane | -88.6 | 184.55 | CC | 30.070 |
2 | 3 | propane | -42.2 | 230.95 | CCC | 44.100 |
3 | 4 | butane | -0.1 | 273.05 | CCCC | 58.120 |
4 | 5 | 2-methylpropane | -11.2 | 261.95 | CC(C)C | 58.120 |
5 | 6 | pentane | 36.1 | 309.25 | CCCCC | 72.150 |
6 | 7 | 2-methylbutane | 27.0 | 300.15 | CC(C)CC | 72.150 |
7 | 8 | 2,2-dimethylpropane | 9.5 | 282.65 | CC(C)(C)C | 72.150 |
8 | 9 | hexane | 68.8 | 341.95 | CCCCCC | 86.180 |
9 | 10 | 2-methylpentane | 60.9 | 334.05 | CC(C)CCC | 86.180 |
10 | 11 | 3-methylpentane | 63.3 | 336.45 | CC(CC)CC | 86.180 |
11 | 12 | 2,2-dimethylbutane | 49.8 | 322.95 | CC(C)(CC)C | 86.180 |
12 | 13 | 2,3-dimethylbutane | 58.1 | 331.25 | CC(C)C(C)C | 86.180 |
13 | 14 | heptane | 98.5 | 371.65 | CCCCCCC | 100.200 |
14 | 15 | 3-ethylpentane | 93.5 | 366.65 | C(C)C(CC)CC | 100.200 |
15 | 16 | 2,2-dimethylpentane | 79.2 | 352.35 | CC(C)(CCC)C | 100.200 |
16 | 17 | 2,3-dimethylpentane | 89.8 | 362.95 | CC(C)C(CC)C | 100.200 |
17 | 18 | 2,4-dimethylpentane | 80.6 | 353.75 | CC(C)CC(C)C | 100.200 |
18 | 19 | 2-methylhexane | 90.1 | 363.25 | CC(C)CCCC | 100.205 |
19 | 20 | 3-methylhexane | 91.8 | 364.95 | CC(CC)CCC | 100.200 |
20 | 21 | octane | 125.6 | 398.75 | CCCCCCCC | 114.230 |
21 | 22 | 3-methylheptane | 118.9 | 392.05 | CC(CC)CCCC | 114.232 |
22 | 23 | 2,2,3,3-tetramethylbutane | 106.5 | 379.65 | CC(C)(C(C)(C)C)C | 114.230 |
23 | 24 | 2,3,3-trimethylpentane | 114.7 | 387.85 | CC(C)C(CC)(C)C | 114.230 |
24 | 25 | 2,3,4-trimethylpentane | 113.7 | 386.85 | CC(C)C(C(C)C)C | 114.230 |
25 | 26 | 2,2,4-trimethylpentane | 99.3 | 372.45 | CC(C)(CC(C)C)C | 114.230 |
26 | 27 | nonane | 150.7 | 423.85 | CCCCCCCCC | 128.250 |
27 | 28 | 2-methyloctane | 143.0 | 416.15 | CC(C)CCCCCC | 128.259 |
28 | 29 | decane | 174.2 | 447.35 | CCCCCCCCCC | 142.280 |
29 | 30 | 2-methylnonane | 166.9 | 440.05 | CC(C)CCCCCCC | 142.280 |
Graphing the data
Now we can graph the data using matplotlib.
In [16]:
import matplotlib.pyplot as plt plt.scatter(df.MW, df.BP_K) # plot of boiling point (in K) vs molecular weight plt.xlabel('Molecular Weight') plt.ylabel('Boiling Point in Kelvin') plt.show()
Clearly from the data we can see that we have multiple molecules with the same molecular weight, but different boiling points. Molecular weight is therefore not the best predictor of boiling point. We can see if there are other descriptors that we can use such as Weiner or Zagreb. Let’s add various descriptors to the dataframe.
Adding descriptors to the dataset
We can now calculate the Wiener and Zagreb indices for each of our hydrocarbons and add them to the dataframe.
In [7]:
# create new lists to store results we calculate result_Wiener= [] result_Z1= [] result_Z2= [] for index, row in df.iterrows(): # iterate through each row of the CSV data SMILE = row['SMILES'] # get SMILES string from row mol = Chem.MolFromSmiles(SMILE) # convert smiles string to mol file result_Wiener.append(wiener_index(mol)) # calculate Wiener index descripter value result_Z1.append(zagreb_index1(mol)) # calculate zagreb (Z1) descriptor value result_Z2.append(zagreb_index2(mol)) # calculate zagreb (Z2) descriptor value df['Wiener'] = result_Wiener # add the results for WienerIndex to dataframe df['Z1'] = result_Z1 # add the results for Zagreb 1 to dataframe df['Z2'] = result_Z2 # add the results for Zagreb 2 to dataframe df # print the updated dataframe
Out[7]:
compound | name | BP_C | BP_K | SMILES | MW | Wiener | Z1 | Z2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Methane | -162.2 | 110.95 | C | 16.043 | 0 | 0.0 | 0.0 |
1 | 2 | Ethane | -88.6 | 184.55 | CC | 30.070 | 1 | 2.0 | 1.0 |
2 | 3 | propane | -42.2 | 230.95 | CCC | 44.100 | 4 | 6.0 | 4.0 |
3 | 4 | butane | -0.1 | 273.05 | CCCC | 58.120 | 10 | 10.0 | 8.0 |
4 | 5 | 2-methylpropane | -11.2 | 261.95 | CC(C)C | 58.120 | 9 | 12.0 | 9.0 |
5 | 6 | pentane | 36.1 | 309.25 | CCCCC | 72.150 | 20 | 14.0 | 12.0 |
6 | 7 | 2-methylbutane | 27.0 | 300.15 | CC(C)CC | 72.150 | 18 | 16.0 | 14.0 |
7 | 8 | 2,2-dimethylpropane | 9.5 | 282.65 | CC(C)(C)C | 72.150 | 16 | 20.0 | 16.0 |
8 | 9 | hexane | 68.8 | 341.95 | CCCCCC | 86.180 | 35 | 18.0 | 16.0 |
9 | 10 | 2-methylpentane | 60.9 | 334.05 | CC(C)CCC | 86.180 | 32 | 20.0 | 18.0 |
10 | 11 | 3-methylpentane | 63.3 | 336.45 | CC(CC)CC | 86.180 | 31 | 20.0 | 19.0 |
11 | 12 | 2,2-dimethylbutane | 49.8 | 322.95 | CC(C)(CC)C | 86.180 | 28 | 24.0 | 22.0 |
12 | 13 | 2,3-dimethylbutane | 58.1 | 331.25 | CC(C)C(C)C | 86.180 | 29 | 22.0 | 21.0 |
13 | 14 | heptane | 98.5 | 371.65 | CCCCCCC | 100.200 | 56 | 22.0 | 20.0 |
14 | 15 | 3-ethylpentane | 93.5 | 366.65 | C(C)C(CC)CC | 100.200 | 48 | 24.0 | 24.0 |
15 | 16 | 2,2-dimethylpentane | 79.2 | 352.35 | CC(C)(CCC)C | 100.200 | 46 | 28.0 | 26.0 |
16 | 17 | 2,3-dimethylpentane | 89.8 | 362.95 | CC(C)C(CC)C | 100.200 | 46 | 26.0 | 26.0 |
17 | 18 | 2,4-dimethylpentane | 80.6 | 353.75 | CC(C)CC(C)C | 100.200 | 48 | 26.0 | 24.0 |
18 | 19 | 2-methylhexane | 90.1 | 363.25 | CC(C)CCCC | 100.205 | 52 | 24.0 | 22.0 |
19 | 20 | 3-methylhexane | 91.8 | 364.95 | CC(CC)CCC | 100.200 | 50 | 24.0 | 23.0 |
20 | 21 | octane | 125.6 | 398.75 | CCCCCCCC | 114.230 | 84 | 26.0 | 24.0 |
21 | 22 | 3-methylheptane | 118.9 | 392.05 | CC(CC)CCCC | 114.232 | 76 | 28.0 | 27.0 |
22 | 23 | 2,2,3,3-tetramethylbutane | 106.5 | 379.65 | CC(C)(C(C)(C)C)C | 114.230 | 58 | 38.0 | 40.0 |
23 | 24 | 2,3,3-trimethylpentane | 114.7 | 387.85 | CC(C)C(CC)(C)C | 114.230 | 62 | 34.0 | 36.0 |
24 | 25 | 2,3,4-trimethylpentane | 113.7 | 386.85 | CC(C)C(C(C)C)C | 114.230 | 65 | 32.0 | 33.0 |
25 | 26 | 2,2,4-trimethylpentane | 99.3 | 372.45 | CC(C)(CC(C)C)C | 114.230 | 66 | 34.0 | 32.0 |
26 | 27 | nonane | 150.7 | 423.85 | CCCCCCCCC | 128.250 | 120 | 30.0 | 28.0 |
27 | 28 | 2-methyloctane | 143.0 | 416.15 | CC(C)CCCCCC | 128.259 | 114 | 32.0 | 30.0 |
28 | 29 | decane | 174.2 | 447.35 | CCCCCCCCCC | 142.280 | 165 | 34.0 | 32.0 |
29 | 30 | 2-methylnonane | 166.9 | 440.05 | CC(C)CCCCCCC | 142.280 | 158 | 36.0 | 34.0 |
Now we can see how each of these descriptors are related to the boiling points of their respective compounds.
In [8]:
plt.scatter(df.Wiener, df.BP_K) # plot of BP versus Wiener index plt.xlabel('Wiener Index') plt.ylabel('Boiling Point in Kelvin') plt.show()
In [9]:
plt.scatter(df.Z1, df.BP_K) # plot of BP versus Zagreb M1 plt.xlabel('Zagreb M1') plt.ylabel('Boiling Point in Kelvin') plt.show()
In [10]:
plt.scatter(df.Z2, df.BP_K) # plot of BP versus Zagreb M2 plt.xlabel('Zagreb M2') plt.ylabel('Boiling Point in Kelvin') plt.show()
Clearly molecular weight was somewhat predictive, but problematic. It looks like using the other indicators we have have some other ways to predict boiling point.
One option is write this data to a new CSV file and work in Microsoft Excel to perform a regression analysis. Exporting the data is straightforward and your instructor may provide instructions on how to analyze the data using Excel.
In [11]:
df.to_csv('bp_descriptor_data.csv', encoding='utf-8', index=False)
Mulitple regression analysis using statsmodels
The statsmodels package provides numerous tools for performaing statistical analysis using Python. In this case, we want to perform a multiple linear regression using all of our descriptors (molecular weight, Wiener index, Zagreb indices) to help predict our boiling point.
In [12]:
import statsmodels.api as sm # import the statsmodels library as sm X = df[["MW", "Wiener", "Z1", "Z2"]] # select our independent variables X = sm.add_constant(X) # add an intercept to our model y = df[["BP_K"]] # select BP as our dependent variable model = sm.OLS(y,X).fit() # set up our model predictions = model.predict(X) # make the predictions print(model.summary()) # print out statistical summary
C:\ProgramData\Miniconda3\envs\OLCC2019\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead. return ptp(axis=axis, out=out, **kwargs)
OLS Regression Results ============================================================================== Dep. Variable: BP_K R-squared: 0.994 Model: OLS Adj. R-squared: 0.994 Method: Least Squares F-statistic: 1124. Date: Wed, 16 Oct 2019 Prob (F-statistic): 8.16e-28 Time: 19:04:48 Log-Likelihood: -93.019 No. Observations: 30 AIC: 196.0 Df Residuals: 25 BIC: 203.0 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 55.5695 6.745 8.238 0.000 41.677 69.462 MW 4.4325 0.203 21.853 0.000 4.015 4.850 Wiener -0.6411 0.064 -9.960 0.000 -0.774 -0.509 Z1 -4.3920 1.238 -3.549 0.002 -6.941 -1.843 Z2 0.2982 0.943 0.316 0.754 -1.644 2.240 ============================================================================== Omnibus: 3.756 Durbin-Watson: 1.285 Prob(Omnibus): 0.153 Jarque-Bera (JB): 2.259 Skew: -0.583 Prob(JB): 0.323 Kurtosis: 3.671 Cond. No. 755. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Note above that we now have coeffiecients for an equation that can be used for prediction of boiling points for molecules not included in our dataset. The equation would be:
Predicted BP = 4.4325 * MW - 0.6411 * Weiner - 4.3920 * Z1 + 0.2982 * Z2 + 55.5695
We can use this equation to predict the boiling point of a new molecule. However, before we do, we need to explore the validity of the model.
Model summary and analysis using partial regression plots
A quick look at the results summary shows that the model has an excellent R-squared value. Upon more careful examination, you may notice that one of our descriptors has a very large P value. This would indicate that perhaps the Z2 descriptor is not working well in this case. We can generate a more graphical interpretation that will make this more obvious.
In [13]:
fig = plt.figure(figsize=(10,8)) fig = sm.graphics.plot_partregress_grid(model, fig=fig)
Part of the reason that Z2 may not be predictive in this model is that there is colinearity with the Z1 descriptor. Both descriptors have similar calculations (as outlined in the Libretexts page for this activity). Later on in this exercise we can explore dropping this descriptor.
How good is our model?
If we look at a plot of actual versus predicted boiling points
In [14]:
pred_bp = model.fittedvalues.copy() # use our model to create a set of predicted bp's fig, ax = plt.subplots(figsize=(8, 5)) lmod = sm.OLS(pred_bp, df.BP_K) # linear regression of observed vs predicted bp's res = lmod.fit() # run fitting plt.scatter(pred_bp, df.BP_K) # plot of of observed vs predicted bp's plt.ylabel('observed boiling point (K)') plt.xlabel('predicted boiling point (K)') plt.show() print(res.summary()) # print linear regression stats summary
OLS Regression Results ======================================================================================= Dep. Variable: y R-squared (uncentered): 1.000 Model: OLS Adj. R-squared (uncentered): 1.000 Method: Least Squares F-statistic: 1.213e+05 Date: Wed, 16 Oct 2019 Prob (F-statistic): 4.52e-54 Time: 19:05:22 Log-Likelihood: -93.015 No. Observations: 30 AIC: 188.0 Df Residuals: 29 BIC: 189.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ BP_K 0.9998 0.003 348.263 0.000 0.994 1.006 ============================================================================== Omnibus: 3.635 Durbin-Watson: 1.284 Prob(Omnibus): 0.162 Jarque-Bera (JB): 2.167 Skew: 0.574 Prob(JB): 0.338 Kurtosis: 3.643 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The model appears to have very good predictability (R-squared = 1.000) within the original 30 molecule data set. One way to test this model is to use a new molecule with its descriptors to see how well it is predicted. One molecule in the dataset is 2-methylheptane. It has the following data: MW = 114.232 Wiener Index = 79 Z1 = 28 Z2 = 26 Boiling Point = 390.6 K
Using the equation from above we can determine that the boiling point from the equation Predicted BP = 4.4325 MW - 0.6411 Weiner - 4.3920 Z1 + 0.2982 Z2 + 55.5695 is 396.0 K. The model gives a 1.4% error for prediction of the boiling point outide the training set.
We had mentioned earlier that Z2 may not be very predictive in this model. We can remove the variable an rerun the analysis to see if we can improve the predictability of the model.
In [15]:
import statsmodels.api as sm # import the statsmodels library as sm X = df[["MW", "Wiener", "Z1"]] # select our independent variables, this time without Z2 X = sm.add_constant(X) # add an intercept to our model y = df[["BP_K"]] # select BP as our dependent variable model = sm.OLS(y,X).fit() # set up our model predictions = model.predict(X) # make the predictions print(model.summary()) # print out statistical summary
OLS Regression Results ============================================================================== Dep. Variable: BP_K R-squared: 0.994 Model: OLS Adj. R-squared: 0.994 Method: Least Squares F-statistic: 1552. Date: Wed, 16 Oct 2019 Prob (F-statistic): 1.99e-29 Time: 19:05:30 Log-Likelihood: -93.078 No. Observations: 30 AIC: 194.2 Df Residuals: 26 BIC: 199.8 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 55.4979 6.624 8.378 0.000 41.882 69.113 MW 4.4114 0.188 23.433 0.000 4.024 4.798 Wiener -0.6397 0.063 -10.139 0.000 -0.769 -0.510 Z1 -4.0260 0.430 -9.364 0.000 -4.910 -3.142 ============================================================================== Omnibus: 2.917 Durbin-Watson: 1.326 Prob(Omnibus): 0.233 Jarque-Bera (JB): 1.624 Skew: -0.507 Prob(JB): 0.444 Kurtosis: 3.521 Cond. No. 741. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
IThe model appears to have very good predictability (R-squared = 0.994) within the original 30 molecule data set. Let's reexamine 2-methylheptane: MW = 114.232 Wiener Index = 79 Z1 = 28 Boiling Point = 390.6 K
Using the equation from above we can determine that the boiling point from the equation Predicted BP = 4.4114 MW - 0.6397 Weiner - 4.0260 * Z1 + 55.4979 is 396.2 K. The model also gives a 1.4% error for prediction of the boiling point outide the training set.
We can see here that Z2 doesn't really change the result of the calculation, and can probably be best left out to simplify the model.
Keep in mind that we have completed this analysis with only a training set of 30 molecules. If the training set had more molecules, you should be able to develop a better model.
Assignment
You originally ran this analysis on a 30 molecule data set (BP.CSV). You also have available to you a 102 molecule data set (102BP.CSV).
- Complete the above analysis using the expanded data set to determine if a better predictive model can be obtained with a larger training set. Note that 2-methylheptane is in this new dataset so you will need to choose a new test molecule.
When you have completed the analysis, you will create a new analysis:
- Choose four new topological and other calculated descriptors found in Mordred http://mordred-descriptor.github.io/documentation/master/api/modules.html
- Complete simple linear analysis for each of your new descriptors.
- Complete a multiple linear regression to create an equation that best represents the data boiling point data and your descriptors.
- Create a separate sheet that has your regression data.
- Make a plot of Actual vs Predicted BP for your regression.
- Choose a new molecule not in the dataset (not 2-methylheptane, be creative and use chemical intuition).
- Use your multiple linear equation to predict this molecule’s BP and look of the literature value.
- Write a short one-two page paper that includes:
- What your new chosen descriptors mean
- Which new chosen descriptors correlate
- What is the overall equation calculated
- How to choose the molecule to test
- How close this multiple linear regression predicts your boiling point of your molecule