Skip to main content
Chemistry LibreTexts

1.9: Python Assignment 1

  • Page ID
    144269
  • Getting Molecular Properties through PUG-REST

     

    Downloadable Files

    Lecture01_Basics.ipynb

    • Download and run the above file in your Jupyter notebook. 
    • This page is an html version of the above file. 
      • If you have questions on this assignment you should use this web page and the hypothes.is annotation to post a question (or comment) to the 2019OLCCStu class group.  If you are not on the discussion group you should contact your instructor for the link to join.

     

     

    Objectives

    • Learn the basic approach to getting data from PubChem through PUG-REST
    • Retrieve a single property of a single compound.
    • Retrieve a single property of multiple compounds
    • Retrieve multiple properties of multiple compounds.
    • Write a for loop to make the same kind of requests.
    • Process a large amount of data by splitting them into smaller chunks

     

    The Shortest Code to Get PubChem Data

     

    Let's suppose that we want to get the molecular formula of water from PubChem through PUG-REST. You can get this data from your web browsers (Chrome, Safari, Internet Explorer, etc) via the following URL:
    https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt
    Getting the same data using a computer program is not very difficult. This task can be with a three-line code.

    Line 1: First, the "requests" python library (https://3.python-requests.org/) is imported. The "requests" library contains a set of pre-written codes that allows you to access information on the web.

    In [1]:

    import requests
    

    Note: if you receive an error indicating that you do not have the requests library, you should go back to your anaconda prompt and type

    pip install requests

     

    Line 2: Get the desired information using the function get( ) in the requests library. The PUG-REST request URL (enclosed within a pair of quotes('') is provided within the parentheses. The result will be stored in a variable called res .

    In [2]:

    res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt')
    

    Line 3: The res variable (which means "result" or "response") contains not only the requested data but also some information about the request. To view the returned data, you need to get the data from res and print it out.

    In [3]:

    print(res.text)
    

     

    As another example, the following code retrieves the number of heavy (non-hydrogen) atoms of butadiene.

    In [4]:

    res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/butadiene/property/HeavyAtomCount/txt')
    print(res.text)
    

     

    Note that in this example, we did not import the  requests library because it has already been imported (in the very fist example for getting the molecular formula of water).

     

    LibreText Reading:

    Review Section 1.6.2 Rest Architecture before doing this assignment, and reference back to the Compound Properties Table as needed.

     

    Exercise 1a: Retrieve the molecular weight of ethanol in a "text" format. 

    In [7]:

    # Write your code in this cell:
    

    Exercise 1b: Retrieve the number of hydrogen-bond acceptors of aspirin in a "text" format.

    In [8]:

    # Write your code in this cell:
    

     

    Formulating PUG-REST request URLs using variables

     

    In the previous examples, the PUG-REST request URLs were directly provided to the requests.get( ), by explicitly typing the URL within the parentheses. However, it is also possible to provide the URL using a variable. The following example shows how to formulate the PUG-REST request URL using variables and pass it to requests.get( ).

    In [9]:

    pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
    pugin   = "compound/name/water"
    pugoper = "property/MolecularFormula"
    pugout  = "txt"
    
    url     = pugrest + '/' + pugin + '/' + pugoper + '/' + pugout
    print(url)
    

    https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt

     

    A PUG-REST request URL encodes three pieces of information (input, operation, output), preceded by the prologue commont to all requests. In the above code cell, these pieces of information are stored in four different variables (pugrest, pugin, pugoper, pugout) and combined into a new variable  url.

    One can also generate the same URL using the join( ) function, available for a string.

    In [10]:

    url = "/".join( [pugrest, pugin, pugoper, pugout] )
    print(url)
    
    https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt
    

    Here, the strings stored in the four variables are joined by the "/" character as a separator. Note that the four variables are enclosed within the square bracket ([]), meaning that a list containing them as elements is provided to join( ).

    Then, the url can be passed to requests.get( ).

    In [11]:

    res = requests.get(url)
    print(res.text)
    

     

    Warning:  Avoid using in or input as a variable name in python. In python, in is a reserved keyword and input is the name of a built-in function. In the example above, the variables are prefixed with "pug" to avoid this naming conflict.

     

    Making multiple requests using a for loop

     

    The approach in the previous section (that use variables to construct a request URL) looks very inconvenient, compared to the three-line code shown at the beginning, where the request URL is directly provided to requests.get( ). If you are making only one request, it would be simpler to provide the URL directly to requests.get( ), rather than assign the pieces to variables, constructing the URL from them, and passing it to the function.
    However, if you are making a large number of requests, it would be very time consuming to type the respective request URLs for all requests. In that case, you want to store common parts as variables and use them in a loop. For example, suppose that you want to retrieve the SMILES strings of 5 chemicals.

    In [12]:

    names = [ 'cytosine', 'benzene', 'motrin', 'aspirin', 'zolpidem' ]
    

    Now the chemical names are stored in a list called names . Using a for loop, you can loop over each chemical name, formulating the request URL and retrieving the desired data, as shown below.

    In [13]:

    pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
    pugoper = "property/CanonicalSMILES"
    pugout  = "txt"
    
    for myname in names:    # loop over each element in the "names" list
        
        pugin = "compound/name/" + myname
        
        url = "/".join( [pugrest, pugin, pugoper, pugout] )
        res = requests.get(url)
        print(myname, ":", res.text)
    

     

    Warning: When you make a lot of programmatic access requests using a loop, you should limit your request rate to or below five requests per second. Please read the following document to learn more about PubChem's usage policies:https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access$_RequestVolumeLimitations
    Violation of usage policies may result in the user being temporarily blocked from accessing PubChem (or NCBI) resources**

    In the for-loop example above, we have only five input chemical names to process, so it is not likely to violate the five-requests-per-second limit. However, if you have thousands of names to process, the above code will exceed the limit (considering that this kind of requests usually finish very quickly). Therefore, the request rate should be adjusted by using the sleep( )  function in the  time module. For simplicity, let's suppose that you have 12 chemical names to process (in reality, you could have much more to process).

    In [14]:

    names = [ 'water', 'benzene', 'methanol', 'ethene', 'ethanol', \
              'propene','1-propanol', '2-propanol', 'butadiene', '1-butanol', \
              '2-butanol', 'tert-butanol']
    

     

    LibreText Reading:

    In analyzing the code of the following example you should reference Code Example  9.1.1 of  Section 9.1.2 of Appendix 9.1

    In [15]:

    import time
    
    pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
    pugoper = "property/CanonicalSMILES"
    pugout  = "txt"
    
    for i in range(len(names)):    # loop over each index (position) in the "names" list
        
        pugin = "compound/name/" + names[i]    # names[i] = the ith element in the names list.
        
        url = "/".join( [pugrest, pugin, pugoper, pugout] )
        res = requests.get(url)
        print(names[i], ":", res.text)
    
        if ( i % 5 == 4 ) :  # the % is the modulo operator and returns the remainder of a calculation (if i = 4, 9, ...)
            time.sleep(1)
    

     

    There are three things noteworthy in the above example (compared to the previous examples with the five chemical name queries).

    • First, the for loop iterates from 0 to [ len(names) − 1], that is, [0, 1, 2, 3, ...,11].
    • The variable i is used (in names(i) to generate the input part ( pugin) of the PUG-REST request URL.
    • The variable i is used (in the if if sentence) to stop the program for one second for every five requests.

    It should be noted that the request volumn limit can be lowered through the dynamic traffic control at times of excessive load (https://pubchemdocs.ncbi.nlm.nih.gov/dynamic-request-throttling). Throttling information is provided in the HTTP header response, indicating the system-load state and the per-user limits. Based on this throttling information, the user should moderate the speed at which requests are sent to PubChem. We will cover this topic later in this course.

    Exercise 3a: Retrieve the XlogP values of linear alkanes with 1 ~ 12 carbons.

    • Use the chemical names as inputs
    • Use a for loop to retrieve the XlogP value for each alkane.
    • Use the sleep() function to stop the program for one second for every five requests.

    In [16]:

    # Write your code in this cell: (The solution code below will be removed later)
    

    Exercise 3b Retrieve the isomeric SMILES of the 20 common amino acids.

    • Use the chemical names as inputs. Because the 20 common amino acids in living organisms predominantly exist as one chrial form (the L-form), the names should be prefixed with "L-" (e.g., "L-alanine", rather than "alanine"), except for "glycine" (which does not have a chiral center).
    • Use a for loop to retrieve the isomeric SMILES for each alkane.
    • Use the sleep() function to stop the program for one second for every five requests.

    In [17]:

    # Write your code in this cell (The solution code below will be removed later)
    

     

    Getting multiple molecular properties

     

    All the examples we have seen in this notebook retrieved a single molecular property for a single compound (although we were able to get a desired property for a group of compounds using a for loop). However, it is possible to get multiple properties for multiple compounds with a single request.

    The following example retrieves the hydrogen-bond donor count, hydrogen-bond acceptor count, XLogP, TPSA for 5 compounds (represented by PubChem Compound IDs (CIDs) in a comma-separated values (CSV) format.

    In [18]:

    pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
    pugin   = "compound/cid/4485,4499,5026,5734,8082"
    pugoper = "property/HBondDonorCount,HBondDonorCount,XLogP,TPSA"
    pugout  = "csv"
    
    url = "/".join( [pugrest, pugin, pugoper, pugout] )   # Construct the URL
    print(url)
    print("-" * 30)   # Print "-" 30 times (to print a line for readability)
    
    res = requests.get(url)
    print(res.text)
    
    https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/4485,4499,5026,5734,8082/property/HBondDonorCount,HBondDonorCount,XLogP,TPSA/csv
    ------------------------------
    "CID","HBondDonorCount","HBondDonorCount","XLogP","TPSA"
    4485,1,1,2.200,110.0
    4499,1,1,3.300,110.0
    5026,1,1,4.300,123.0
    5734,1,1,0.2,94.6
    8082,1,1,0.800,12.0
    
    

    In [19]:

    res.text.rstrip()
    

    Out[19]:

    '"CID","HBondDonorCount","HBondDonorCount","XLogP","TPSA"\n4485,1,1,2.200,110.0\n4499,1,1,3.300,110.0\n5026,1,1,4.300,123.0\n5734,1,1,0.2,94.6\n8082,1,1,0.800,12.0'

    PubChem has a standard time limit of 30 seconds per request. When you try to retrieve too many properties for too many compounds with a single request, it can take longer than the 30-second limit and a time-out error will be returned. Therefore, you may need to split the compound list into smaller chunks and process one chunk at a time.

    In [20]:

    cids = [ 443422,  72301,   8082,    4485,    5353740, 5282230, 5282138, 1547484, 941361, 5734,  \
             5494,    5422,    5417,    5290,    5245,    5026,    4746,    4507,    4499,   4497,  \
             4494,    4474,    4418,    4386,    4009,    4008,    3949,    3926,    3878,   3784,  \
             3698,    3547,    3546,    3336,    3333,    3236,    3076,    2585,    2520,   2351,  \
             2312,    2162,    1236,    1234,    292331,  275182,  235244,  108144,  104972, 77157, \
             5942250, 5311217, 4564402, 4715169, 5311501]
    

    In [21]:

    chunk_size = 10
    
    if ( len(cids) % chunk_size == 0 ) : # check if total number of cids is divisible by 10 with no remainder
        num_chunks = len(cids) // chunk_size # sets number of chunks
    else : # if divide by 10 results in remainder
        num_chunks = len(cids) // chunk_size + 1 # add one more chunk
    
    print("# Number of CIDs:", len(cids) )
    print("# Number of chunks:", num_chunks )
    
    # Number of CIDs: 55
    # Number of chunks: 6
    

    In [22]:

    pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
    pugoper = "property/HBondDonorCount,HBondAcceptorCount,XLogP,TPSA"
    pugout  = "csv"
    
    csv = ""   #sets a variable called csv to save the comma separated output
    
    for i in range(num_chunks) : # sets number of requests to number of data chunks as determined above
        
        idx1 = chunk_size * i        # sets a variable for a moving window of cids to start in a data chunk
        idx2 = chunk_size * (i + 1)  # sets a variable for a moving window of cids to end ina data chunk
    
        pugin = "compound/cid/" + ",".join([ str(x) for x in cids[idx1:idx2] ]) # build pug input for chunks of data
        url = "/".join( [pugrest, pugin, pugoper, pugout] )   # Construct the URL
        
        res = requests.get(url)
    
        if ( i == 0 ) : # if this is the first request, store result in empty csv variable
            csv = res.text 
        else :          # if this is a subsequent request, add the request to the csv variable adding a new line between chunks
            csv = csv + "\n".join(res.text.split()[1:]) + "\n" 
        
        if (i % 5 == 4):  
            time.sleep(1)
    
    print(csv)
    
    "CID","HBondDonorCount","HBondAcceptorCount","XLogP","TPSA"
    443422,0,5,3.1,40.2
    72301,0,5,3.2,40.2
    8082,1,1,0.800,12.0
    4485,1,7,2.200,110.0
    5353740,2,5,3.5,76.0
    5282230,2,5,3.2,84.9
    5282138,1,8,4.400,120.0
    1547484,0,2,5.800,6.5
    941361,0,4,6.000,6.5
    5734,1,5,0.2,94.6
    5494,0,6,5.0,57.2
    5422,0,8,6.4,61.9
    5417,0,5,3.2,40.2
    5290,2,5,2.6,62.2
    5245,5,8,-3.1,148.0
    5026,1,8,4.300,123.0
    4746,1,1,6.8,12.0
    4507,1,7,2.900,110.0
    4499,1,7,3.300,110.0
    4497,1,8,3.100,120.0
    4494,1,8,2.900,134.0
    4474,1,8,3.800,114.0
    4418,1,5,4.100,45.2
    4386,2,3,4.400,49.3
    4009,2,5,3.5,76.0
    4008,1,9,5.600,117.0
    3949,0,7,4.9,34.2
    3926,1,5,6.0,35.6
    3878,2,5,1.4,90.7
    3784,1,8,4.300,104.0
    3698,2,3,-0.2,68.0
    3547,1,5,1.0,70.7
    3546,3,5,-0.5,132.0
    3336,1,1,5.5,12.0
    3333,1,5,3.900,64.6
    3236,0,2,3.8,20.3
    3076,0,6,3.1,84.4
    2585,3,5,4.200,75.7
    2520,0,6,3.800,64.0
    2351,0,3,5.3,15.7
    2312,0,2,4.6,12.5
    2162,2,7,3.000,99.9
    1236,1,8,6.800,114.0
    1234,0,7,3.800,73.2
    292331,2,3,3.900,49.3
    275182,1,8,6.1,72.9
    235244,1,8,6.7,72.9
    108144,2,5,3.9,117.0
    104972,1,6,3.300,72.7
    77157,1,4,3.2,49.8
    5942250,2,5,3.5,76.0
    5311217,1,7,4.500,90.9
    4564402,0,4,4.1,45.5
    4715169,2,3,-1.6,63.3
    5311501,0,4,4.4,43.7
    
    

    Exercise 4a: Below is the list of CIDs of known antiinflmatory agents (obtained from PubChem via the URL: https://www.ncbi.nlm.nih.gov/pccompound?LinkName=mesh_pccompound&from_uid=68000893). Download the following properties of those compounds in a comma-separated format: Heavy atom count, rotatable bond count, molecular weight, XLogP, hydrogen bond donor count, hydrogen bond acceptor count, TPSA, and isomeric SMILES.

    • Split the input CID list into small chunks (with a chunk size of 100 CIDs).
    • Process one chunk at a time using a for loop.
    • Do not forget to add sleep() to comply the usage policy.

    In [23]:

    cids = [ 471, 1981, 2005, 2097, 2151, 2198, 2206, 2214, 2244, 2307, 2308, 2313, 2355, 2396, 2449, 2462, 2466, 2581, 2662, 2794, 2863, 3000, 3003, 3033, 3056, 3059, 3111, 3177, 3194, 3230, 3242, 3282, 3308, 3332, 3335, 3342, 3360, 3371, 3379, 3382, 3384, 3394, 3495, 3553, 3612, 3672, 3715, 3716, 3718, 3778, 3824, 3825, 3826, 3935, 3946, 3965, 4009, 4037, 4038, 4044, 4075, 4159, 4237, 4386, 4409, 4413, 4487, 4488, 4495, 4534, 4553, 4614, 4641, 4671, 4692, 4781, 4888, 4895, 4921, 5059, 5090, 5147, 5161, 5208, 5228, 5339, 5352, 5359, 5362, 5468, 5469, 5475, 5480, 5509, 5733, 5743, 5744, 5745, 5753, 5754, 5755, 5834, 5865, 5875, 5876, 5877, 6094, 6213, 6215, 6247, 6436, 6741, 7090, 7497, 8522, 9053, 9231, 9642, 9782, 9878, 10114, 10154, 10170, 10185, 10206, 12555, 12938, 13802, 14982, 15209, 16490, 16533, 16623, 16639, 16752, 16923, 17198, 19161, 20469, 21102, 21700, 21800, 21826, 21975, 22419, 23205, 26098, 26248, 26318, 28718, 28871, 30869, 30870, 30951, 31307, 31378, 31508, 31635, 31799, 31800, 32153, 32327, 32798, 33958, 35375, 35455, 35935, 36833, 37425, 38081, 38503, 39212, 39941, 40000, 40632, 41643, 43261, 44219, 47462, 47795, 50294, 50295, 51717, 54445, 54585, 57782, 59757, 60164, 60490, 60542, 60712, 60726, 60864, 61486, 62074, 62924, 63006, 63019, 64704, 64738, 64746, 64747, 64927, 64945, 64971, 64982, 65394, 65464, 65655, 65679, 65762, 66249, 67417, 68700, 68704, 68706, 68731, 68749, 68819, 68865, 68869, 68917, 71246, 71354, 71364, 71386, 71398, 71414, 71415, 71771, 72158, 72300, 73400, 82153, 84003, 84429, 90763, 91626, 91670, 100472, 102011, 104762, 104943, 107641, 107738, 107793, 108068, 108130, 114753, 114840, 114917, 114999, 115239, 119032, 119286, 119365, 119607, 119828, 119871, 121928, 121957, 122139, 122179, 122182, 123619, 123673, 123723, 124978, 128191, 128229, 128571, 133021, 134896, 146364, 151075, 151166, 152165, 155354, 155761, 156391, 158103, 159557, 162666, 164676, 167928, 168928, 174093, 174277, 176155, 177976, 180604, 183088, 189821, 192156, 196122, 196840, 196841, 200674, 201587, 219121, 222786, 229860, 235244, 236702, 259846, 263373, 275182, 292331, 425990, 439503, 439533, 441335, 441336, 442534, 442993, 443943, 443949, 443967, 444036, 445154, 445858, 446925, 479503, 485711, 490428, 501254, 522325, 546807, 578771, 584547, 610479, 633091, 633097, 636374, 636398, 656604, 656656, 656852, 657238, 667550, 927704, 969510, 969516, 1548887, 1548910, 2737488, 3033890, 3033980, 3045402, 3051696, 3055172, 4129359, 4306515, 4483645, 5018304, 5185849, 5280802, 5280914, 5280915, 5281004, 5281071, 5281515, 5281522, 5281792, 5282183, 5282193, 5282230, 5282387, 5282402, 5282492, 5283542, 5283734, 5284538, 5284539, 5311051, 5311052, 5311066, 5311067, 5311093, 5311101, 5311108, 5311169, 5311180, 5318517, 5320420, 5322111, 5352624, 5353725, 5353726, 5353740, 5353864, 5354499, 5377381, 5420804, 5420805, 5458396, 5472495, 5481958, 5701991, 5702036, 5702148, 5702212, 5702252, 5702287, 5745214, 5942250, 6420050, 6429274, 6437368, 6437387, 6438873, 6447131, 6453785, 6473881, 6509979, 6708733, 6710677, 6714002, 6917783, 6917852, 6917894, 6918172, 6918173, 6918332, 6918445, 6918452, 6918612, 6925666, 7060958, 7251185, 9554199, 9798098, 9799453, 9841438, 9843941, 9846332, 9865808, 9868219, 9869053, 9871508, 9875547, 9883509, 9897518, 9897771, 9907157, 9913795, 9919776, 9926694, 9934547, 10363606, 10918539, 11158972, 11513733, 11561674, 11616712, 11870423, 11949636, 11954221, 11954316, 11954353, 11954369, 11957468, 11961431, 11972243, 11972532, 12300053, 12313906, 12313911, 12606303, 12634263, 12714644, 12874922, 13018150, 13020033, 13041095, 14010989, 14515707, 14798494, 15895902, 16051947, 16132369, 16213022, 16213698, 16218996, 16219353, 16220118, 16759566, 16760658, 17750985, 17753757, 18526330, 18632363, 18647121, 18943026, 20054915, 21120116, 21637635, 21637642, 21893738, 21893804, 21982135, 22141508, 22811280, 23509770, 23631982, 23653552, 23657872, 23663407, 23663409, 23663418, 23663959, 23663989, 23665411, 23665999, 23667642, 23669636, 23674183, 23674255, 23674745, 23675763, 23680530, 23681059, 23684814, 23688663, 23693301, 23694214, 23702389, 24181458, 24721429, 24761485, 24799587, 24847961, 24847981, 24867460, 24867465, 24867475, 24883465, 24916955, 25077872, 25113755, 25796773, 40469526, 44119558, 44202892, 44260118, 44266812, 44386560, 45006151, 45006158, 45039955, 45356876, 45356931, 45357558, 45357932, 45358013, 45358120, 45358130, 45358140, 45358148, 45358149, 45488525, 46174093, 46397498, 46780650, 46780910, 46783539, 46783786, 46783814, 46863906, 46878350, 46882877, 50989825, 51026956, 51340230, 51398089, 53384387, 53394893, 53486221, 53486290, 53486322, 54194814, 54605501, 54675840, 54676228, 54677470, 54677971, 54677972, 54677977, 54682045, 54684589, 54690031, 54697648, 54708862, 54714524, 56841932, 56842111, 56845155, 57347755, 57486087, 67668959, 67804972, 67986221, 70470286, 70678885, 71306882, 71587162, 72774967, 72941490, 72941625, 73758129, 73759663, 73759808, 74787565, 77906397, 78577433, 90488794, 91711382, 91826463, 91873711, 91881846, 92131836, 92462493, 102004404, 102601886, 117072385, 117072403, 117072410, 118701141, 118701402, 118984459, 122130078, 122130111, 122130185, 122130213, 122130768, 122173054, 122173183, 122361610, 123134657, 124081055, 124463365, 126968472, 126968501, 126968801, 126969212, 126969455, 129009998, 129010022, 129010033, 129010043, 129316829, 129317859, 129317898, 129628207, 129628892, 129670532, 129735029, 131632430, 131635023, 131676243, 131750284, 131954647, 131954667, 132399051, 132399058, 133112890, 133126366, 133126370, 133562807, 133659920, 133687604, 134129698, 134159361, 134460917, 134612785, 134687786, 134688123, 134688323, 134688977, 134689786, 134693106, 134693125, 134693234, 134694728, 134694860, 135413496, 135413505, 135414247, 135484078, 135515521, 135565709, 136040192, 137177332, 137699687, 137705034, 137705717, 137705725, 137705994, 137706376, 137706400, 137795135, 138059757, 138107776, 138113311, 138113507, 138113581, 138114182, 138114743]
    len(cids)
    

    Out[23]:

    708

    In [24]:

    # Write your code in this cell.