Skip to main content
Chemistry LibreTexts

Section 3D. MASCOT Database Search

  • Page ID
    79446
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    The peptide mass fingerprint search in the MASCOT database is used to identify a protein from mass spectrometry data. The following MALDI-TOF mass spectrum shows the masses detected after digesting a protein with trypsin. This protein is from the organism Escherichia coli (E. coli) and was cut from a 2D gel and analyzed by students in Analytical Chemistry at Indiana University. .coli (Escherichia coli)

    MALDI-TOF_mass_spectrum.png

    m/z

    S/N

    832.312 10.3
    842.53 23
    1045.541 3.7
    1179.58 2.3
    1210.545 5.2
    1247.598 5.7
    1283.741 74.7
    1307.679 2.1
    1401.67 3.1
    1403.711 2.3
    1473.801 43.2
    1521.775 37.2
    1537.764 6
    1618.801 7.3
    1626.907 10.4
    1648.88 2.6
    1811.976 53
    2141.078 8.5
    2391.068 2.3
    2753.496 12.3

    Some peaks in the mass spectrum do not come from peptides from the protein we are trying to identify. Peaks that should be eliminated before performing the Mascot search are:

    Trypsin autolysis peaks: Trypsin cleaves the protein of interest after lysine and arginine residues. However, trypsin will also cleave other trypsin molecules. The peaks due to trypsin autolysis are: 514.63, 842.5, 906.05, 1006.15, 1045.12, 1736.97, 1768.99, 2158.48, 2211.4, 2239.1

    Matrix Clusters: The MALDI matrix may also combine with Na+ and K+ during ionization. These peaks are 855.1, 861.1, 871.1, 877.1, 1060.1. If the scientist had good lab technique, these ions would be eliminated or greatly reduced in a sample preparation step.

    Reading Question

    1. Examine the m/z data in the previous table and figure. Which peaks should be removed before entering m/z data into the MASCOT database?

    Peptide Mass Mapping Search Parameters

    Before m/z information is entered into the MASCOT database, there are also a number of search parameters that must be set appropriately. The image shows a data entry page for a peptide mass fingerprint search. Let’s examine the meaning of each of the following search parameters; Database, Enzyme, Taxonomy, Fixed Modifications, Variable Modifications, Protein Mass, Peptide Tolerance, Mass Values, Monoisotopic or Average Mass, and Data Input.

    data_entry_page.png

    The following section on search parameters has been adapted from the Peptide Mass Fingerprint Tutorial in the MASCOT Database.

    Database and Taxonomy: The first choice you have to make is which database to search. Some databases contain sequences from a single organism. Others contain entries from multiple organisms, but usually include the taxonomy for each entry, so that entries for a specific organism can be selected during a search using a taxonomy filter.

    If your target organism is well characterized, such as human or mouse or yeast, Swiss-Prot is the recommended choice. The entries are all high quality and well annotated. Because Swiss-Prot is non-redundant, it is relatively small, which makes it easier to get a statistically significant match. If you know what is in the sample, you can restrict the search to an organism or family by means of the taxonomy filter, but remember that you can never rule out contaminants.

    If you are interested in a bacterium or a plant, you may find that it is poorly represented in Swiss-Prot, and it would be better to try one of the comprehensive protein databases, which aim to include all known protein sequences. The two best known are NCBInr and UniRef100. These are very large databases, and you will almost certainly want to select a limited taxonomy. But, never choose a narrow taxonomy without looking at the counts of entries and understanding the classification. In the current Swiss-Prot, for example, there are 26,139 entries for rodentia, of which all but 1,602 are for mouse and rat. So, even if your target organism is hamster, it isn’t a good idea to choose ‘other rodentia’. Better to search rodentia and hope to get a match to a homologous protein from mouse and rat.

    Enzyme: Choose the enzyme used to digest the protein. Trypsin is commonly used and will be the enzyme utilized in all the data in this module.

    Missed Cleavages: The number of missed cleavages refers to the completeness of the enzyme digest. Did the enzyme trypsin cleave after every lysine and arginine residue in the protein? Or were some cleavages missed? The number of allowed missed cleavages should be set empirically, by running a standard and/or trying different values to see which gives the best score.

    Modifications in database searching are handled in two ways.

    First, there are the fixed modifications. The most common example is the reduction and alkylation of cysteine. This reaction is performed to break disulfide bonds and prevent them from reforming. In the absence of disulfide bonds, the protein will be unfolded and the enzyme will be more effective in digesting the protein. Since all cysteines are modified, this is effectively just a change in the mass of cysteine. It carries no penalty in terms of search speed or specificity.

    The alkylation agent used is iodoacetamide (select modification carbamidomethyl). In proteins, the reduced thiol group in cysteine is alkylated with iodoacetamide in the reaction shown:

    iodoacetamide.png

    In contrast, most post-translational modifications do not apply to all instances of a residue. For example, phosphorylation might affect just one serine in a protein containing many serines and threonines. These variable or non-quantitative modifications are expensive in the sense that they increase the search space. This is because the software has to permute out all the possible arrangements of modified and unmodified residues that fit to the peptide molecular mass. As more and more modifications are considered, the number of combinations and permutations increases geometrically, and we get a so-called combinatorial explosion.

    One common variable modification is the oxidation of methionine shown:

    methionine_ oxidation.png

    Protein Mass: If the protein mass is known from its position in a 2D gel, this value can be entered. Usually, this adds little to the score, and the general advice is to leave this field blank.

    Peptide Tolerance: Making an estimate of the mass accuracy doesn’t have to be a guessing game. The Mascot Protein View report includes graphs of mass errors.

    One way to evaluate the mass accuracy of the mass spectrometer is to run a standard and look at the error graphs for the correct match. Another method of evaluating mass accuracy is to compare the experimental value of a trypsin autolysis peak with the theoretical value.

    In the data set provided, one trypsin autolysis peak had a measured mass of 1045.54 and the theoretical mass is 1045.12. The measurement indicates the mass spectrometer has mass error of approximately 0.42 Da.

    (Note: Da is the same as amu).

    Mass values: Most frequently MALDI produces the singly charged molecular ion (MH+). Your peak list will only contain Mr values (relative molecular mass) if the peak picking software has ‘de-charged’ the measured m/z values. Peak picking software may be programmed to do this because the data contained a mixture of charge states.

    Most modern instruments produce monoisotopic mass values. You will only have average masses if the entire isotope distribution has been centroided into a single peak, which usually implies very low resolution.

    The following MALDI-TOF mass spectrum of a protein digest zooms in on the mass region of different peptides near m/z 1500. The isotope distribution in a peptide with m/z 1515.7 is shown. The natural abundance of carbon-12 is 98.90% and carbon-13 is 1.10%. Therefore, peptides with a large number of carbon atoms will contain significant contributions to the M+1 peak and M+2 peak from carbon-13 atoms. The monoisotopic peptide contains all carbon-12 atoms. The M+1 peak has one carbon-13 atom and the M+2 peak has two carbon-13 atoms.

    MALDI-TOF_mass spectrum_protein_digest.png

    Data Input: The first requirement for a Peptide Mass Fingerprint (PMF) search is a peak list (a list of m/z values). Peak lists are text files and come in various different formats. You can also copy and paste a list of values into the query area of the search form, or even type them in. Each m/z value goes on a separate line. If you also have an intensity value for the peak, this follows the m/z value, separated by a space or a tab.

    Reading Questions

    1. You are analyzing a protein from E. coli.

    a. What is the advantage of setting the taxonomy to E. coli.?

    b. What is a disadvantage of setting the taxonomy to E. coli. instead of a more general class of bacteria to which E. coli. belongs (Proteobacteria).

    2. a. What is meant by the search parameter “missed cleavages?”

    b. How will one missed cleavage affect the number of peptides created after digestion with trypsin?

    3. A common fixed modification is carbamidomethyl. Why is a protein chemically modified in this way?

    4. Briefly describe one method for determining the peptide tolerance (or the mass accuracy) of the mass spectrometer.

    Performing a Peptide Mass Mapping Search: Now that you understand the various search parameters, you are now ready to perform a peptide mass fingerprint search in MASCOT.

    1. Go to www.matrixscience.com and choose “Mascot search database” “Peptide mass fingerprint”, and “Perform search”
    2. A good set of search parameters to start with are:
      Database: SwissProt
      Taxonomy: Escherichia Coli
      Enzyme: Trypsin
      Missed Cleavages: 1
      Fixed Modification: Carbamidomethyl
      Variable Modification: Oxidation of M
      Protein Mass: leave blank
      Peptide Tolerance: ±1 Da
      Mass Values: MH+
      Monoisotopic
      Report Top 5 Hits
    3. We will start with the MALDI-TOF data for the protein from E. Coli cut from a 2D gel. Copy and paste the m/z values in the table. Don’t forget to remove the trypsin autolysis peaks or matrix clusters from the data set.
      m/z
      832.312
      842.53
      1045.541
      1179.58
      1210.545
      1247.598
      1283.741
      1307.679
      1401.67
      1403.711
      1473.801
      1521.775
      1537.764
      1618.801
      1626.907
      1648.88
      1811.976
      2141.078
      2391.068
      2753.496
    4. Record the search results.

      What protein has the highest score?
      What is its protein score? What score is needed for significance?
      Click on the identity of the protein for information about sequence and which peptide masses were found experimentally.
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Instructor Note: Students can run the search on computers and then compare with the results shown here.

    Results Summary: The first results screen identifies the protein as cysteine synthase A with a protein score of 120. Scores outside the green region (>56) are significant. A score of 120 indicates that there is a high probability that the protein has been correctly identified.

    cysteine_synthase_A.png

    The optimum data set for a peptide mass fingerprint is, of course, all of the correct peptides and none of the wrong ones. By correct, we mean that the textbook enzyme cleavage rules were followed, and only specified modifications are present. Sadly, real life data are generally far from ideal, and it is almost unknown to get every single experimental mass value matching and 100% sequence coverage. However, it is not always recognized that having too many peptide mass values can create similar difficulties to having too few.

    Imagine a tryptic digest of a 20 kDa protein. We would expect something around 20 perfect cleavage peptides. If the digest was incomplete, or there was a non-quantitative modification, we might expect to double the number of peptides observed.

    If 100 peaks are taken from the mass spectrum of this digest and submitted to Mascot then either 60 to 80 peaks are noise or there are extensive non-quantitative modifications. Either possibility is bad news for search specificity. The peaks which cannot be matched correctly will still contribute to the population of random matches.

    The Mowse Scoring Algorithm is described in [Pappin, 1993]. (Reference available on MASCOT database)

    The first stage of a Mowse search is to compare the calculated peptide masses for each entry in the sequence database with the set of experimental data. Each calculated value which falls within a given mass tolerance of an experimental value counts as a match.

    Rather than just counting the number of matching peptides, Mowse uses empirically determined factors to assign a statistical weight to each individual peptide match.

    Probability Based Scoring

    Mascot incorporates a probability based implementation of the Mowse algorithm. The Mowse algorithm is an excellent starting point because it accurately models the behavior of a proteolytic enzyme. By casting the Mowse score into a probabilistic framework a simple rule can be used to judge whether a result is significant or not.

    Matches using mass values are always handled on a probabilistic basis. The total score is the absolute probability that the observed match is a random event. Reporting probabilities directly can be confusing because they encompass a very wide range of magnitudes, and also because a "high" score is a "low" probability. For this reason, we report scores as -10*LOG10(P), where P is the absolute probability. A probability of 10-20 thus becomes a score of 200.

    Significance Level

    Given an absolute probability that a match is random, and knowing the size of the sequence database being searched, it becomes possible to provide an objective measure of the significance of a result. A commonly accepted threshold is that an event is significant if it would be expected to occur at random with a frequency of less than 5%. This is the value which is reported on the master results page.

    The master results page for typical peptide mass fingerprint search reports that "Scores greater than 56 are significant (p<0.05).” The protein with the score of 120 is a nice result because the highest score is highly significant, leaving little room for doubt.

    After clicking on the protein with the top score, additional information from the search is displayed (as shown in the screen capture below). The molecular weight (34,525 Da) and pI value (5.83) are provided. If the protein was cut from a 2D gel, the position in the gel should correlate with the molecular weight of pI value of the protein identified. The protein sequence coverage was 49% and 13 of the 18 mass values that were searched matched the protein of interest.

    search_additional_info.png

    Discussion Questions

    1. The protein identified has a very high score; however, less than half of the sequence was matched.

    Why can a protein have a high score even with low sequence coverage?

    2. What are some experimental reasons for low sequence coverage? In other words, why are some peptides not found in the MALDI-TOF data?

    Peptide Mass Mapping for Protein Identification

    In this computer exercise, Mascot search parameters will be varied to explore their effect on protein score. The following MALDI-TOF data for an E. coli protein cut from a 2D gel will initially give a low protein score using default search parameters. The parameters will then be changed in a systematic way to see if a significant protein score can be achieved.

    Open the Excel File: Proteomics Data AST4

    The data is shown here as well.

    MALDI-TOF_data_E.coli.png

    m/z

    Intensity

    842.589

    1962

    864.585

    483

    874.496

    99

    886.559

    93

    936.473

    91

    976.436

    87

    993.491

    1285

    1015.477

    93

    1045.643

    571

    1067.664

    99

    1145.477

    125

    1155.598

    142

    1202.521

    520

    1254.658

    153

    1316.769

    246

    1333.777

    186

    1428.78

    1282

    1450.796

    69

    1525.82

    119

    1675.941

    862

    1753.983

    1171

    1804.047

    307

    1884.98

    508

    2013.05

    271

    2094.001

    147

    2508.378

    155

    2530.349

    141

    2662.519

    162

    2691.481

    333

    2807.501

    476

    3337.949

    307

    3794.978

    221

    3809.024

    338

    1. Use the following default search parameters to run a Peptide Mass Fingerprint search in Mascot search to identify the protein with the file name AST4.

      Database: SwissProt
      Taxonomy: All
      Enzyme: Trypsin
      Missed Cleavages: 1
      Fixed Modification: Carbamidomethyl
      Variable Modification: Oxidation of M
      Protein Mass: leave blank
      Peptide Tolerance: ±1 Da
      Mass Values: MH+
      Monoisotopic
      Report Top 5 Hits

      What protein has the highest score?
      What is its protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

    2. Use the same masses and initial search parameters and change the taxonomy to Metazoa (Animals) because the sample was from chicken.

      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?
      The score for significance changed between all taxonomies and Metazoa. What do you think is the reason for the change?

    3. Retain the taxonomy as Metazoa and vary the number missed cleavages.

      Zero missed cleavages:
      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Two missed cleavages:
      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Summarize how the number of missed cleavages parameter affects protein score. Can you explain why the score changes?

    4. Vary the mass tolerance.

      Repeat the search choosing 1 missed cleavage and a peptide tolerance of 0.5 Da.
      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Repeat the search choosing 1 missed cleavage and a peptide tolerance of 0.3 Da.
      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Repeat the search choosing 1 missed cleavage and a peptide tolerance of 0.1 Da.
      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Summarize how the mass tolerance affects score. Can you explain why the score changes?

    5. Set the number of missed cleavages to 1 and mass tolerance to 0.30 Da. Choose only carbamidomethyl (C) as a fixed modification and no variable modification.

      What protein has the highest score?
      Protein score? What score is needed for significance?
      How many mass values were searched?
      How many mass values were matched?
      What was the percent sequence coverage?

      Summarize how the selection of modifications affects score. Can you explain why the score changed?

    Homework (Peptide Mass Mapping)

    1. To turn in for each protein:
      1. Record the search parameters that were used to generate the highest probability score.
      2. What masses were used and which were discarded? Explain your reasoning.
      3. Record the protein identity, probability score, molecular weight, pI, number of mass values searched and matched, and percent sequence coverage.
      4. Interpret the results.

      ***Note: The reduction and alkylation procedure was not efficient in this set of data. It may help scores to not use choose carbamidomethyl as a fixed modification. (Do not select any fixed modifications if you cannot get a significant protein score.)

    2. Imagine that you have performed a 2D gel separation of proteins from healthy and cancerous cells and have identified a protein implicated in the cancerous state by cutting out the spot, digesting with trypsin, and performing peptide mass mapping. What else could you do experimentally to increase your level of certainty that the protein from the gel spot was identified correctly?
    3. The following data were obtained by analytical chemistry students at Indiana University. The students grew E. coli samples at two different temperatures (37°C and 46°C). The cells were lysed, proteins isolated, and 2D gel electrophoresis performed. Based upon differences in the 2D gel pattern between the high and low temperature E coli samples, one protein spot (from high temp. sample) was analyzed as it was suspected to be a heat shock protein.

      Change search parameters for the following data set to see if you can achieve a significant score for a heat shock (chaperone protein).

      High temp. E. coli gel
      Between MW band 5 (50 kDa)
      MW band 6 (75 kDa)

      Sample

      m/z

      Signal

       

      719.3664

      262.6387

       

      842.4825

      4195.775

       

      855.0103

      204.8235

       

      861.033

      273.0225

       

      864.4673

      410.5345

       

      877.0083

      428.2205

       

      892.9921

      220.0347

       

      1045.548

      444.506

       

      1567.886

      1874.742

       

      1694.763

      476.5625

       

      1707.792

      276.5812

       

      1845.939

      1238.335

       

      1939.982

      243.3293

       

      2402.261

      209.4211


    This page titled Section 3D. MASCOT Database Search is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Contributor.