Section 3D. MASCOT Database Search
- Page ID
- 79446
The peptide mass fingerprint search in the MASCOT database is used to identify a protein from mass spectrometry data. The following MALDI-TOF mass spectrum shows the masses detected after digesting a protein with trypsin. This protein is from the organism Escherichia coli (E. coli) and was cut from a 2D gel and analyzed by students in Analytical Chemistry at Indiana University. .coli (Escherichia coli)
m/z |
S/N |
---|---|
832.312 | 10.3 |
842.53 | 23 |
1045.541 | 3.7 |
1179.58 | 2.3 |
1210.545 | 5.2 |
1247.598 | 5.7 |
1283.741 | 74.7 |
1307.679 | 2.1 |
1401.67 | 3.1 |
1403.711 | 2.3 |
1473.801 | 43.2 |
1521.775 | 37.2 |
1537.764 | 6 |
1618.801 | 7.3 |
1626.907 | 10.4 |
1648.88 | 2.6 |
1811.976 | 53 |
2141.078 | 8.5 |
2391.068 | 2.3 |
2753.496 | 12.3 |
Some peaks in the mass spectrum do not come from peptides from the protein we are trying to identify. Peaks that should be eliminated before performing the Mascot search are:
Trypsin autolysis peaks: Trypsin cleaves the protein of interest after lysine and arginine residues. However, trypsin will also cleave other trypsin molecules. The peaks due to trypsin autolysis are: 514.63, 842.5, 906.05, 1006.15, 1045.12, 1736.97, 1768.99, 2158.48, 2211.4, 2239.1
Matrix Clusters: The MALDI matrix may also combine with Na+ and K+ during ionization. These peaks are 855.1, 861.1, 871.1, 877.1, 1060.1. If the scientist had good lab technique, these ions would be eliminated or greatly reduced in a sample preparation step.
1. Examine the m/z data in the previous table and figure. Which peaks should be removed before entering m/z data into the MASCOT database?
Peptide Mass Mapping Search Parameters
Before m/z information is entered into the MASCOT database, there are also a number of search parameters that must be set appropriately. The image shows a data entry page for a peptide mass fingerprint search. Let’s examine the meaning of each of the following search parameters; Database, Enzyme, Taxonomy, Fixed Modifications, Variable Modifications, Protein Mass, Peptide Tolerance, Mass Values, Monoisotopic or Average Mass, and Data Input.
The following section on search parameters has been adapted from the Peptide Mass Fingerprint Tutorial in the MASCOT Database.
Database and Taxonomy: The first choice you have to make is which database to search. Some databases contain sequences from a single organism. Others contain entries from multiple organisms, but usually include the taxonomy for each entry, so that entries for a specific organism can be selected during a search using a taxonomy filter.
If your target organism is well characterized, such as human or mouse or yeast, Swiss-Prot is the recommended choice. The entries are all high quality and well annotated. Because Swiss-Prot is non-redundant, it is relatively small, which makes it easier to get a statistically significant match. If you know what is in the sample, you can restrict the search to an organism or family by means of the taxonomy filter, but remember that you can never rule out contaminants.
If you are interested in a bacterium or a plant, you may find that it is poorly represented in Swiss-Prot, and it would be better to try one of the comprehensive protein databases, which aim to include all known protein sequences. The two best known are NCBInr and UniRef100. These are very large databases, and you will almost certainly want to select a limited taxonomy. But, never choose a narrow taxonomy without looking at the counts of entries and understanding the classification. In the current Swiss-Prot, for example, there are 26,139 entries for rodentia, of which all but 1,602 are for mouse and rat. So, even if your target organism is hamster, it isn’t a good idea to choose ‘other rodentia’. Better to search rodentia and hope to get a match to a homologous protein from mouse and rat.
Enzyme: Choose the enzyme used to digest the protein. Trypsin is commonly used and will be the enzyme utilized in all the data in this module.
Missed Cleavages: The number of missed cleavages refers to the completeness of the enzyme digest. Did the enzyme trypsin cleave after every lysine and arginine residue in the protein? Or were some cleavages missed? The number of allowed missed cleavages should be set empirically, by running a standard and/or trying different values to see which gives the best score.
Modifications in database searching are handled in two ways.
First, there are the fixed modifications. The most common example is the reduction and alkylation of cysteine. This reaction is performed to break disulfide bonds and prevent them from reforming. In the absence of disulfide bonds, the protein will be unfolded and the enzyme will be more effective in digesting the protein. Since all cysteines are modified, this is effectively just a change in the mass of cysteine. It carries no penalty in terms of search speed or specificity.
The alkylation agent used is iodoacetamide (select modification carbamidomethyl). In proteins, the reduced thiol group in cysteine is alkylated with iodoacetamide in the reaction shown:
In contrast, most post-translational modifications do not apply to all instances of a residue. For example, phosphorylation might affect just one serine in a protein containing many serines and threonines. These variable or non-quantitative modifications are expensive in the sense that they increase the search space. This is because the software has to permute out all the possible arrangements of modified and unmodified residues that fit to the peptide molecular mass. As more and more modifications are considered, the number of combinations and permutations increases geometrically, and we get a so-called combinatorial explosion.
One common variable modification is the oxidation of methionine shown:
Protein Mass: If the protein mass is known from its position in a 2D gel, this value can be entered. Usually, this adds little to the score, and the general advice is to leave this field blank.
Peptide Tolerance: Making an estimate of the mass accuracy doesn’t have to be a guessing game. The Mascot Protein View report includes graphs of mass errors.
One way to evaluate the mass accuracy of the mass spectrometer is to run a standard and look at the error graphs for the correct match. Another method of evaluating mass accuracy is to compare the experimental value of a trypsin autolysis peak with the theoretical value.
In the data set provided, one trypsin autolysis peak had a measured mass of 1045.54 and the theoretical mass is 1045.12. The measurement indicates the mass spectrometer has mass error of approximately 0.42 Da.
(Note: Da is the same as amu).
Mass values: Most frequently MALDI produces the singly charged molecular ion (MH+). Your peak list will only contain Mr values (relative molecular mass) if the peak picking software has ‘de-charged’ the measured m/z values. Peak picking software may be programmed to do this because the data contained a mixture of charge states.
Most modern instruments produce monoisotopic mass values. You will only have average masses if the entire isotope distribution has been centroided into a single peak, which usually implies very low resolution.
The following MALDI-TOF mass spectrum of a protein digest zooms in on the mass region of different peptides near m/z 1500. The isotope distribution in a peptide with m/z 1515.7 is shown. The natural abundance of carbon-12 is 98.90% and carbon-13 is 1.10%. Therefore, peptides with a large number of carbon atoms will contain significant contributions to the M+1 peak and M+2 peak from carbon-13 atoms. The monoisotopic peptide contains all carbon-12 atoms. The M+1 peak has one carbon-13 atom and the M+2 peak has two carbon-13 atoms.
Data Input: The first requirement for a Peptide Mass Fingerprint (PMF) search is a peak list (a list of m/z values). Peak lists are text files and come in various different formats. You can also copy and paste a list of values into the query area of the search form, or even type them in. Each m/z value goes on a separate line. If you also have an intensity value for the peak, this follows the m/z value, separated by a space or a tab.
1. You are analyzing a protein from E. coli.
a. What is the advantage of setting the taxonomy to E. coli.?
b. What is a disadvantage of setting the taxonomy to E. coli. instead of a more general class of bacteria to which E. coli. belongs (Proteobacteria).
2. a. What is meant by the search parameter “missed cleavages?”
b. How will one missed cleavage affect the number of peptides created after digestion with trypsin?
3. A common fixed modification is carbamidomethyl. Why is a protein chemically modified in this way?
4. Briefly describe one method for determining the peptide tolerance (or the mass accuracy) of the mass spectrometer.
Performing a Peptide Mass Mapping Search: Now that you understand the various search parameters, you are now ready to perform a peptide mass fingerprint search in MASCOT.
- Go to www.matrixscience.com and choose “Mascot search database” “Peptide mass fingerprint”, and “Perform search”
- A good set of search parameters to start with are:
Database: SwissProt
Taxonomy: Escherichia Coli
Enzyme: Trypsin
Missed Cleavages: 1
Fixed Modification: Carbamidomethyl
Variable Modification: Oxidation of M
Protein Mass: leave blank
Peptide Tolerance: ±1 Da
Mass Values: MH+
Monoisotopic
Report Top 5 Hits - We will start with the MALDI-TOF data for the protein from E. Coli cut from a 2D gel. Copy and paste the m/z values in the table. Don’t forget to remove the trypsin autolysis peaks or matrix clusters from the data set.
m/z 832.312 842.53 1045.541 1179.58 1210.545 1247.598 1283.741 1307.679 1401.67 1403.711 1473.801 1521.775 1537.764 1618.801 1626.907 1648.88 1811.976 2141.078 2391.068 2753.496 - Record the search results.
What protein has the highest score?
What is its protein score? What score is needed for significance?
Click on the identity of the protein for information about sequence and which peptide masses were found experimentally.
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Instructor Note: Students can run the search on computers and then compare with the results shown here.
Results Summary: The first results screen identifies the protein as cysteine synthase A with a protein score of 120. Scores outside the green region (>56) are significant. A score of 120 indicates that there is a high probability that the protein has been correctly identified.
The optimum data set for a peptide mass fingerprint is, of course, all of the correct peptides and none of the wrong ones. By correct, we mean that the textbook enzyme cleavage rules were followed, and only specified modifications are present. Sadly, real life data are generally far from ideal, and it is almost unknown to get every single experimental mass value matching and 100% sequence coverage. However, it is not always recognized that having too many peptide mass values can create similar difficulties to having too few.
Imagine a tryptic digest of a 20 kDa protein. We would expect something around 20 perfect cleavage peptides. If the digest was incomplete, or there was a non-quantitative modification, we might expect to double the number of peptides observed.
If 100 peaks are taken from the mass spectrum of this digest and submitted to Mascot then either 60 to 80 peaks are noise or there are extensive non-quantitative modifications. Either possibility is bad news for search specificity. The peaks which cannot be matched correctly will still contribute to the population of random matches.
The Mowse Scoring Algorithm is described in [Pappin, 1993]. (Reference available on MASCOT database)
The first stage of a Mowse search is to compare the calculated peptide masses for each entry in the sequence database with the set of experimental data. Each calculated value which falls within a given mass tolerance of an experimental value counts as a match.
Rather than just counting the number of matching peptides, Mowse uses empirically determined factors to assign a statistical weight to each individual peptide match.
Probability Based Scoring
Mascot incorporates a probability based implementation of the Mowse algorithm. The Mowse algorithm is an excellent starting point because it accurately models the behavior of a proteolytic enzyme. By casting the Mowse score into a probabilistic framework a simple rule can be used to judge whether a result is significant or not.
Matches using mass values are always handled on a probabilistic basis. The total score is the absolute probability that the observed match is a random event. Reporting probabilities directly can be confusing because they encompass a very wide range of magnitudes, and also because a "high" score is a "low" probability. For this reason, we report scores as -10*LOG10(P), where P is the absolute probability. A probability of 10-20 thus becomes a score of 200.
Significance Level
Given an absolute probability that a match is random, and knowing the size of the sequence database being searched, it becomes possible to provide an objective measure of the significance of a result. A commonly accepted threshold is that an event is significant if it would be expected to occur at random with a frequency of less than 5%. This is the value which is reported on the master results page.
The master results page for typical peptide mass fingerprint search reports that "Scores greater than 56 are significant (p<0.05).” The protein with the score of 120 is a nice result because the highest score is highly significant, leaving little room for doubt.
After clicking on the protein with the top score, additional information from the search is displayed (as shown in the screen capture below). The molecular weight (34,525 Da) and pI value (5.83) are provided. If the protein was cut from a 2D gel, the position in the gel should correlate with the molecular weight of pI value of the protein identified. The protein sequence coverage was 49% and 13 of the 18 mass values that were searched matched the protein of interest.
1. The protein identified has a very high score; however, less than half of the sequence was matched.
Why can a protein have a high score even with low sequence coverage?
2. What are some experimental reasons for low sequence coverage? In other words, why are some peptides not found in the MALDI-TOF data?
Peptide Mass Mapping for Protein Identification
In this computer exercise, Mascot search parameters will be varied to explore their effect on protein score. The following MALDI-TOF data for an E. coli protein cut from a 2D gel will initially give a low protein score using default search parameters. The parameters will then be changed in a systematic way to see if a significant protein score can be achieved.
Open the Excel File: Proteomics Data AST4
The data is shown here as well.
m/z |
Intensity |
---|---|
842.589 |
1962 |
864.585 |
483 |
874.496 |
99 |
886.559 |
93 |
936.473 |
91 |
976.436 |
87 |
993.491 |
1285 |
1015.477 |
93 |
1045.643 |
571 |
1067.664 |
99 |
1145.477 |
125 |
1155.598 |
142 |
1202.521 |
520 |
1254.658 |
153 |
1316.769 |
246 |
1333.777 |
186 |
1428.78 |
1282 |
1450.796 |
69 |
1525.82 |
119 |
1675.941 |
862 |
1753.983 |
1171 |
1804.047 |
307 |
1884.98 |
508 |
2013.05 |
271 |
2094.001 |
147 |
2508.378 |
155 |
2530.349 |
141 |
2662.519 |
162 |
2691.481 |
333 |
2807.501 |
476 |
3337.949 |
307 |
3794.978 |
221 |
3809.024 |
338 |
- Use the following default search parameters to run a Peptide Mass Fingerprint search in Mascot search to identify the protein with the file name AST4.
Database: SwissProt
Taxonomy: All
Enzyme: Trypsin
Missed Cleavages: 1
Fixed Modification: Carbamidomethyl
Variable Modification: Oxidation of M
Protein Mass: leave blank
Peptide Tolerance: ±1 Da
Mass Values: MH+
Monoisotopic
Report Top 5 HitsWhat protein has the highest score?
What is its protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage? - Use the same masses and initial search parameters and change the taxonomy to Metazoa (Animals) because the sample was from chicken.
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?
The score for significance changed between all taxonomies and Metazoa. What do you think is the reason for the change? - Retain the taxonomy as Metazoa and vary the number missed cleavages.
Zero missed cleavages:
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Two missed cleavages:
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Summarize how the number of missed cleavages parameter affects protein score. Can you explain why the score changes?
- Vary the mass tolerance.
Repeat the search choosing 1 missed cleavage and a peptide tolerance of 0.5 Da.
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Repeat the search choosing 1 missed cleavage and a peptide tolerance of 0.3 Da.
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Repeat the search choosing 1 missed cleavage and a peptide tolerance of 0.1 Da.
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Summarize how the mass tolerance affects score. Can you explain why the score changes?
- Set the number of missed cleavages to 1 and mass tolerance to 0.30 Da. Choose only carbamidomethyl (C) as a fixed modification and no variable modification.
What protein has the highest score?
Protein score? What score is needed for significance?
How many mass values were searched?
How many mass values were matched?
What was the percent sequence coverage?Summarize how the selection of modifications affects score. Can you explain why the score changed?
Homework (Peptide Mass Mapping)
- To turn in for each protein:
- Record the search parameters that were used to generate the highest probability score.
- What masses were used and which were discarded? Explain your reasoning.
- Record the protein identity, probability score, molecular weight, pI, number of mass values searched and matched, and percent sequence coverage.
- Interpret the results.
***Note: The reduction and alkylation procedure was not efficient in this set of data. It may help scores to not use choose carbamidomethyl as a fixed modification. (Do not select any fixed modifications if you cannot get a significant protein score.)
- Imagine that you have performed a 2D gel separation of proteins from healthy and cancerous cells and have identified a protein implicated in the cancerous state by cutting out the spot, digesting with trypsin, and performing peptide mass mapping. What else could you do experimentally to increase your level of certainty that the protein from the gel spot was identified correctly?
- The following data were obtained by analytical chemistry students at Indiana University. The students grew E. coli samples at two different temperatures (37°C and 46°C). The cells were lysed, proteins isolated, and 2D gel electrophoresis performed. Based upon differences in the 2D gel pattern between the high and low temperature E coli samples, one protein spot (from high temp. sample) was analyzed as it was suspected to be a heat shock protein.
Change search parameters for the following data set to see if you can achieve a significant score for a heat shock (chaperone protein).
High temp. E. coli gel
Between MW band 5 (50 kDa)
MW band 6 (75 kDa)Sample
m/z
Signal
719.3664
262.6387
842.4825
4195.775
855.0103
204.8235
861.033
273.0225
864.4673
410.5345
877.0083
428.2205
892.9921
220.0347
1045.548
444.506
1567.886
1874.742
1694.763
476.5625
1707.792
276.5812
1845.939
1238.335
1939.982
243.3293
2402.261
209.4211