Advanced protein methods & techniques in biochemistry 9788132346005, 8132346009, 1283507587, 9781283507585

This book explains about Advanced Protein Methods & Techniques in Biochemistry. Abstract: This book explains about A

182 98 2MB

English Pages 95 Year 2012

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Advanced protein methods & techniques in biochemistry
 9788132346005, 8132346009, 1283507587, 9781283507585

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

First Edition, 2012

ISBN 978-81-323-4600-5

© All rights reserved.

Published by: The English Press 4735/22 Prakashdeep Bldg, Ansari Road, Darya Ganj, Delhi - 110002 Email: [email protected] 

Table of Contents Introduction Chapter 1 - Immunostaining Chapter 2 - Immunoprecipitation Chapter 3 - Immunoelectrophoresis and Western Blot Chapter 4 - Enzyme Assay Chapter 5 - Protein Nuclear Magnetic Resonance Spectroscopy Chapter 6 - Protein Structure Prediction and Protein Sequencing Chapter 7 - Proteomics Chapter 8 - Structural Alignment Chapter 9 - Protein Biosynthesis and Peptide Mass Fingerprinting

Introduction

Protein methods are the techniques used to study proteins. There are genetic methods for studying proteins, methods for detecting proteins, methods for isolating and purifying proteins and other methods for characterizing the structure and function of proteins, often requiring that the protein first be purified.

Genetic methods • • • •

conceptual translation- many proteins are never directly sequenced, but their sequence of amino acids is known by "conceptual translation" of a known mRNA sequence. site-directed mutagenesis allows new variants of proteins to be produced and tested for how structural changes alter protein function. o insertion of protein tags such as the His-tag. evolutionary; analysis of sequence changes in different species using software such as BLAST. Proteins that are involved in human diseases can be identified by matching alleles to disease and other phenotypes using methods such as calculation of LOD scores.

Detecting proteins • • • • • • • •

microscopy, protein immunostaining Protein immunoprecipitation Immunoelectrophoresis Immunoblotting BCA Protein Assay Western blot Spectrophotometry Enzyme assay

Protein purification • • • •

Protein Isolation o chromatography methods Protein Extraction and Solubilization Protein Concentration Determination Methods, Bradford protein assay Concentrating Protein Solutions





Gel electrophoresis o Gel Electrophoresis Under denaturing conditions o Gel Electrophoresis Under non-denaturing conditions o 2D Gel Electrophoresis Electrofocusing

Protein structures • •

X-ray crystallography Protein NMR

Protein-DNA interactions • • • •

ChIP-on-chip Chip-Sequencing DamID Microscale Thermophoresis

Other methods • • • • • • • • • • • • •

Hydrogen-deuterium exchange Mass spectrometry Molecular dynamics Protein structure prediction Protein sequencing Protein structural alignment Protein ontology Protein synthesis Proteomics Peptide mass fingerprinting Ligand binding assay Eastern blotting metabolic labeling o heavy isotope labeling o radioactive isotope labeling

Chapter- 1

Immunostaining

Micrograph of an immunostained section of a brain tumour. GFAP immunostain. Immunostaining is a general term in biochemistry that applies to any use of an antibodybased method to detect a specific protein in a sample. The term immunostaining was originally used to refer to the immunohistochemical staining of tissue sections, as first described by Albert Coons in 1941. Now however, immunostaining encompasses a broad range of techniques used in histology, cell biology, and molecular biology that utilise antibody-based staining methods.

Immunostaining Techniques Immunohistochemistry

Immunohistochemistry labels individual proteins, such as TH (green) in the axons of sympathetic autonomic neurons. Immunohistochemistry or IHC refers to the process of detecting antigens (e.g., proteins) in cells of a tissue section by exploiting the principle of antibodies binding specifically to antigens in biological tissues. IHC takes its name from the roots "immuno," in reference to antibodies used in the procedure, and "histo," meaning tissue (compare to immunocytochemistry). Immunohistochemical staining is widely used in the diagnosis of abnormal cells such as those found in cancerous tumors. Specific molecular markers are characteristic of particular cellular events such as proliferation or cell death

(apoptosis). IHC is also widely used in basic research to understand the distribution and localization of biomarkers and differentially expressed proteins in different parts of a biological tissue. Visualising an antibody-antigen interaction can be accomplished in a number of ways. In the most common instance, an antibody is conjugated to an enzyme, such as peroxidase, that can catalyse a colour-producing reaction. Alternatively, the antibody can also be tagged to a fluorophore, such as fluorescein or rhodamine.

Sample preparation While using the right antibodies to target the correct antigens and amplify the signal is important for visualization, complete preparation of the sample is critical to maintain cell morphology, tissue architecture and the antigenicity of target epitopes. This requires proper tissue collection, fixation and sectioning. Depending on the purpose and the thickness of the experimental sample, either thin (about 4-40 μm) sections are sliced from the tissue of interest, or if the tissue is not very thick and is penetrable it is used whole. The slicing is usually accomplished through the use of a microtome, and slices are mounted on slides. "Free-floating IHC" uses slices that are not mounted, these slices are normally produced using a vibrating microtome. Because of the method of fixation and tissue preservation, the sample may require additional steps to make the epitopes available for antibody binding, including deparaffinization and antigen retrieval; these steps often makes the difference between staining and no staining. Additionally, depending on the tissue type and the method of antigen detection, endogenous biotin or enzymes may need to be blocked or quenched, respectively, prior to antibody staining. Unlike immunocytochemistry, the tissue does not need to be permeabilized because this has already been accomplished by the microtome blade during sample preparation. Detergents like Triton X-100 are generally used in immunohistochemistry to reduce surface tension, allowing less reagent to be used to achieve better and more even coverage of the sample. Although antibodies show preferential avidity for specific epitopes, they may partially or weakly bind to sites on nonspecific proteins (also called reactive sites) that are similar to the cognate binding sites on the target antigen. In the context of antibody-mediated antigen detection, nonspecific binding causes high background staining that can mask the detection of the target antigen. To reduce background staining in IHC, ICC and any other immunostaining application, the samples are incubated with a buffer that blocks the reactive sites to which the primary or secondary antibodies may otherwise bind. Common blocking buffers include normal serum, non-fat dry milk, BSA or gelatin, and commercial blocking buffers with proprietary formulations are available for greater efficiency.

Sample Labeling Antibody types The antibodies used for specific detection can be polyclonal or monoclonal. Polyclonal antibodies are made by injecting animals with peptide Ag and, after a secondary immune response is stimulated, isolating antibodies from whole serum. Thus, polyclonal

antibodies are a heterogeneous mix of antibodies that recognize several epitopes. Monoclonal antibodies show specificity for a single epitope are therefore considered more specific to the target antigen than polyclonal antibodies. For IHC detection strategies, antibodies are classified as primary or secondary reagents. Primary antibodies are raised against an antigen of interest and are typically unconjugated (unlabelled), while secondary antibodies are raised against immunoglobulins of the primary antibody species. The secondary antibody is usually conjugated to a linker molecule, such as biotin, that then recruits reporter molecules, or the secondary antibody is directly bound to the reporter molecule itself.

IHC reporters Reporter molecules vary based on the nature of the detection method, and the most popular methods of detection are with enzyme- and fluorophore-mediated chromogenic and fluorescentdetection, respectively. With chromogenic reporters, an enzyme label is reacted with a substrate to yield an intensely colored product that can be analyzed with an ordinary light microscope. While the list of enzyme substrates is extensive, Alkaline phosphatase (AP) and horseradish peroxidase (HRP) are the two enzymes used most extensively as labels for protein detection. An array of chromogenic, fluorogenic and chemiluminescent substrates is available for use with either enzyme, including DAB or BCIP/NBT, which produce a brown or purple staining, respectively, wherever the enzymes are bound. Reaction with DAB can be enhanced using nickel, producing a deep purple/black staining. Fluorescent reporters are small, organic molecules used for IHC detection and traditionally include FITC, TRITC and AMCA, while commercial derivatives, including the Alexa Fluors and Dylight Fluors, show similar enhanced performance but vary in price. For chromogenic and fluorescent detection methods, densitometric analysis of the signal can provide semi- and fully-quantitative data, respectively, to correlate the level of reporter signal to the level of protein expression or localization.

The direct method of immunohistochemical staining uses one labelled antibody, which binds directly to the antigen being stained for.

The indirect method of immunohistochemical staining uses one antibody against the antigen being probed for, and a second, labelled, antibody against the first.

Target antigen detection methods The direct method is a one-step staining method and involves a labeled antibody (e.g. FITC-conjugated antiserum) reacting directly with the antigen in tissue sections. While this technique utilizes only one antibody and therefore is simple and rapid, the sensitivity is lower due to little signal amplification, such as with indirect methods, and is less commonly used than indirect methods. The indirect method involves an unlabeled primary antibody (first layer) that binds to the target antigen antigen in the tissue and a labeled secondary antibody (second layer) that reacts with the primary antibody. As mentioned above, the secondary antibody must be raised against the IgG of the animal species in which the primary antibody has been raised. This method is more sensitive than direct detection strategies because of signal amplification due to the binding of several secondary antibodies to each primary antibody if the secondary antibody is conjugated to the fluorescent or enzyme reporter. Further amplification can be achieved if the secondary antibody is conjugated to several biotin molecules, which can recruit complexes of avidin-, streptavidin or NeutrAvidin proteinbound-enzyme. The difference between these three biotin-binding proteins is their individual binding affinity to endogenous tissue targets leading to nonspecific binding and high background; the ranking of these proteins based on their nonspecific binding affinities, from highest to lowest, is: 1) avidin, 2) streptavidin and 3) Neutravidin protein. The indirect method, aside from its greater sensitivity, also has the advantage that only a relatively small number of standard conjugated (labeled) secondary antibodies needs to be generated. For example, a labeled secondary antibody raised against rabbit IgG, which can be purchased "off the shelf," is useful with any primary antibody raised in rabbit. With the direct method, it would be necessary to label each primary antibody for every antigen of interest.

Counterstains After immunohistochemical staining of the target antigen, a second stain is often applied to provide contast that helps the primary stain stand out. Many of these stains show specificity for discrete cellular compartments or antigens, while others will stain the whole cell. Both chromogenic and fluorescent dyes are available for IHC to provide a vast array of reagents to fit every experimental design, and include: hematoxylin, Hoechst stain and DAPI are commonly used.

IHC Troubleshooting In immunohistochemical techniques, there are several steps prior to the final staining of the tissue antigen, and many potential problems affect the outcome of the procedure. The major problem areas in IHC staining include strong background staining, weak target antigen staining and autofluorescence. Endogenous biotin or reporter enzymes or primary/secondary antibody cross-reactivity are common causes of strong background staining, while weak staining may be caused by poor enzyme activity or primary antibody potency. Furthermore, autofluorescence may be due to the nature of the tissue or the fixation method. These aspects of IHC tissue prep and antibody staining must be systematically addressed to identify and overcome staining issues.

Diagnostic IHC markers

Immunohistochemical staining of normal kidney with CD10

IHC is an excellent detection technique and has the tremendous advantage of being able to show exactly where a given protein is located within the tissue examined. It is also an effective way to examine the tissues. This has made it a widely-used technique in the neurosciences, enabling researchers to examine protein expression within specific brain structures. Its major disadvantage is that, unlike immunoblotting techniques where staining is checked against a molecular weight ladder, it is impossible to show in IHC that the staining corresponds with the protein of interest. For this reason, primary antibodies must be well-validated in a Western Blot or similar procedure. The technique is even more widely used in diagnostic surgical pathology for typing tumors (e.g. immunostaining for e-cadherin to differentiate between DCIS (ductal carcinoma in situ: stains positive) and LCIS (lobular carcinoma in situ: does not stain positive)). • • • • • • • • • •

Carcinoembryonic antigen (CEA): used for identification of adenocarcinomas. Not specific for site. Cytokeratins: used for identification of carcinomas but may also be expressed in some sarcomas. CD15 and CD30: used for Hodgkin's disease Alpha fetoprotein: for yolk sac tumors and hepatocellular carcinoma CD117 (KIT): for gastrointestinal stromal tumors (GIST) CD10 (CALLA): for renal cell carcinoma and acute lymphoblastic leukemia Prostate specific antigen (PSA): for prostate cancer estrogens and progesterone staining for tumour identification Identification of B-cell lymphomas using CD20 Identification of T-cell lymphomas using CD3

Directing therapy A variety of molecular pathways are altered in cancer and some of the alterations can be targeted in cancer therapy. Immunohistochemistry can be used to assess which tumors are likely to respond to therapy, by detecting the presence or elevated levels of the molecular target.

Chemical inhibitors Tumor biology allows for a number of potential intracellular targets. Many tumors are hormone dependent. The presence of hormone receptors can be used to determine if a tumor is potentially responsive to antihormonal therapy. One of the first therapies was the antiestrogen, tamoxifen, used to treat breast cancer. Such hormone receptors can be detected by immunohistochemistry. Imatinib, an intracellualar tyrosine kinase inhibitor, was developed to treat chronic myelogenous leukemia, a disease characterized by the formation of a specific abnormal tyrosine kinase. Imitanib has proven effective in tumors, that express other tyrosine kinases, most notably KIT. Most gastrointestinal stromal tumors express KIT, which can be detected by immunohistochemistry.

Monoclonal antibodies Many proteins shown to be highly upregulated in pathological states by immunohistochemistry are potential targets for therapies utilising monoclonal antibodies. Monoclonal antibodies, due to their size, are utilized against cell surface targets. Among the overexpressed targets, the members of the epidermal growth factor receptor (EGFR) family, transmembrane proteins with an extracellular receptor domain regulating an intracellular tyrosine kinase, Of these, HER2/neu (also known as Erb-B2) was the first to be developed. The molecule is highly expressed in a variety of cancer cell types, most notably breast cancer. As such, antibodies against HER2/neu have been FDA approved for clinical treatment of cancer under the drug name Herceptin. There are commercially available immunohistochemical tests, Dako HercepTest and Ventana Pathway. Similarly, EGFR (HER-1) is overexpressed in a variety of cancers including head and neck and colon. Immunohistochemistry is used to determine patients who may benefit from therapeutic antibodies such as Erbitux (cetuximab). Commercial systems to detect EGFR by immunohistochemistry include the Dako pharmDx.

Flow cytometry A flow cytometer can be used for the direct analysis of cells expressing one or more specific proteins. Cells are immunostained in solution using methods similar to used for immunofluorescence, and then analysed by flow cytometry. Flow cytometry has several advantages over IHC including: the ability to define distinct cell populations are defined by their size and granularity; the capacity to gate out dead cells; improved sensitivity; and multi-colour analysis to measure several antigens simultaneously. However, flow cytometry can be less effective at detecting extremely rare cell populations, and there is a loss of architectural relationships in the absence of a tissue section. Flow cytometry also has a high capital cost associated with the purchase of a flow cytometer.

Western blotting Western blotting allows the detection of specific proteins from extracts made from cells or tissues, before or after any purification steps. Proteins are generally separated by size using gel electrophoresis before being transferred to a synthetic membrane via dry, semidry, or wet blotting methods. The membrane can then be probed using antibodies using methods similar to immunohistochemistry, but without a need for fixation. Detection is typically performed using peroxidase linked antibodies to catalyse a chemiluminescent reaction. Western blotting is a routine molecular biology method that can be used to semiquantitatively compare protein levels between extracts. The size separation prior to blotting allows the protein molecular weight to be gauged as compared with known molecular weight markers.

Enzyme-linked immunosorbent assay The enzyme-linked immunosorbent assay or ELISA is a diagnostic method for quantitatively or semi-quantitatively determining protein concentrations from blood plasma, serum or cell/tissue extracts in a multi-well plate format (usually 96-wells per plate). Broadly, proteins in solution are adsorbed to ELISA plates. Antibodies specific for the protein of interest are used to probe the plate. Background is minimised by optimising blocking and washing methods (as for IHC), and specificity is ensured via the presence of positive and negative controls. Detection methods are usually colorimetric or chemiluminescence based.

Immuno-electron microscopy Electron microscopy or EM can be used to study the detailed microarchitecture of tissues or cells. Immuno-EM allows the detection of specific proteins in ultrathin tissue sections. Antibodies labelled with heavy metal particles (e.g. gold) can be directly visualised using transmission electron microscopy. While powerful in detecting the sub-cellular localisation of a protein, immuno-EM can be technically challenging, expensive, and require rigorous optimisation of tissue fixation and processing methods.

Methodological overview In immunostaining methods, an antibody is used to detect a specific protein epitope. These antibodies can be monoclonal or polyclonal. Detection of this first or primary antibody can be accomplished in multiple ways. • • • •

The primary antibody can be directly labeled using an enzyme or fluorophore. The primary antibody can be labeled using a small molecule which interacts with a high affinity binding partner that can be linked to an enzyme or fluorophore. The biotin-strepavidin is one commonly used high affinity interaction. The primary antibody can be probed for using a broader species-specific secondary antibody that is labeled using an enzyme, or fluorophore. In the case of electron microscopy, antibodies are linked to a heavy metal atom.

As previously described, enzymes such as horseradish peroxidase or alkaline phosphatase are commonly used to catalyse reactions that give a coloured or chemiluminescent product. Fluorescent molecules can be visualised using fluorescence microscopy or confocal microscopy.

Applications of Immunostaining The applications of immunostaining are numerous, but are most typically used in clinical diagnostics and laboratory research. Clinically, IHC is used in histopathology for the diagnosis of specific types of cancers based on molecular markers.

In laboratory science, immunostaining can be used for a variety of applications based on investigating the presence or absence of a protein, its tissue distribution, its sub-cellular localisation, and or changes in protein expression or degradation.

Chapter- 2

Immunoprecipitation

Immunoprecipitation (IP) is the technique of precipitating a protein antigen out of solution using an antibody that specifically binds to that particular protein. This process can be used to isolate and concentrate a particular protein from a sample containing many thousands of different proteins. Immunoprecipitation requires that the antibody be coupled to a solid substrate at some point in the procedure.

Types of immunoprecipitation Individual protein Immunoprecipitation (IP) Involves using an antibody that is specific for a known protein to isolate that particular protein out of a solution containing many different proteins. These solutions will often be in the form of a crude lysate of a plant or animal tissue. Other sample types could be bodily fluids or other samples of biological origin.

Protein complex immunoprecipitation (Co-IP) Immunoprecipitation of intact protein complexes (ie: antigen along with any proteins or ligands that are bound to it) is known as co-immunoprecipitation (Co-IP). Co-IP works by selecting an antibody that targets a known protein that is believed to be a member of a larger complex of proteins. By targeting this known member with an antibody it may become possible to pull the entire protein complex out of solution and thereby identify unknown members of the complex. This works when the proteins involved in the complex bind to each other tightly, making it possible to pull multiple members of the complex out of solution by latching onto one member with an antibody. This concept of pulling protein complexes out of solution is sometimes referred to as a "pull-down". Co-IP is a powerful technique that is used regularly by molecular biologists to analyze protein-protein interactions. Identifying the members of protein complexes may require several rounds of precipitation with different antibodies for a number of reasons:







A particular antibody often selects for a subpopulation of its target protein that has the epitope exposed, thus failing to identify any proteins in complexes that hide the epitope. This can be seen in that it is rarely possible to precipitate even half of a given protein from a sample with a single antibody, even when a large excess of antibody is used. The first round of IP will often result in the identification of many new proteins that are putative members of the complex being studied. The researcher will then obtain antibodies that specifically target one of the newly identified proteins and repeat the entire immunoprecipitation experiment. This second round of precipitation may result in the recovery of additional new members of a complex that were not identified in the previous experiment. As successive rounds of targeting and immunoprecipitations take place, the number of identified proteins may continue to grow. The identified proteins may not ever exist in a single complex at a given time, but may instead represent a network of proteins interacting with one another at different times for different purposes. Repeating the experiment by targeting different members of the protein complex allows the researcher to double-check the result. Each round of pull-downs should result in the recovery of both the original known protein as well as other previously identified members of the complex (and even new additional members). By repeating the immunoprecipitation in this way, the researcher verifies that each identified member of the protein complex was a valid identification. If a particular protein can only be recovered by targeting one of the known members but not by targeting other of the known members then that protein's status as a member of the complex may be subject to question.

Chromatin immunoprecipitation (ChIP) Chromatin immunoprecipitation (ChIP) is a method used to determine the location of DNA binding sites on the genome for a particular protein of interest. This technique gives a picture of the protein-DNA interactions that occur inside the nucleus of living cells or tissues. The in vivo nature of this method is in contrast to other approaches traditionally employed to answer the same questions. The principle underpinning this assay is that DNA-binding proteins (including transcription factors and histones) in living cells can be cross-linked to the DNA that they are binding. By using an antibody that is specific to a putative DNA binding protein, one can immunoprecipitate the protein-DNA complex out of cellular lysates. The crosslinking is often accomplished by applying formaldehyde to the cells (or tissue), although it is sometimes advantageous to use a more defined and consistent crosslinker such as DTBP. Following crosslinking, the cells are lysed and the DNA is broken into pieces 0.2–1 kb in length by sonication. At this point the immunoprecipitation is performed resulting in the purification of protein-DNA complexes. The purified protein-DNA complexes are then heated to reverse the formaldehyde cross-linking of the protein and DNA complexes, allowing the DNA to be separated from the proteins. The identity and quantity of the

DNA fragments isolated can then be determined by PCR. The limitation of performing PCR on the isolated fragments is that one must have an idea which genomic region is being targeted in order to generate the correct PCR primers. This limitation is very easily circumvented simply by cloning the isolated genomic DNA into a plasmid vector and then using primers that are specific to the cloning region of that vector. Alternatively, when one wants to find where the protein binds on a genome-wide scale, a DNA microarray can be used (ChIP-on-chip or ChIP-chip) allowing for the characterization of the cistrome. As well, ChIP-Sequencing has recently emerged as a new technology that can localize protein binding sites in a high-throughput, cost-effective fashion.

RNA immunoprecipitation (RIP) Similar to chromatin immunoprecipitation (ChIP) outlined above, but rather than targeting DNA binding proteins as in ChIP, RNA immunoprecipitation targets RNA binding proteins. RIP is also an in vivo method in that live cells are exposed to formaldehyde in order to create cross-links between RNA and RNA-binding proteins. Cells are then lysed and the immunoprecipitation is performed with an antibody that targets the protein of interest. By isolating the protein, the RNA will also be isolated as it is cross-linked to the protein. The purified RNA-protein complexes can be separated by reversing the cross-link and the identity of the RNA can be determined by cDNA sequencing or RT-PCR.

Tagged proteins One of the major technical hurdles with immunoprecipitation is the great difficulty in generating an antibody that specifically targets a single known protein. To get around this obstacle, many groups will engineer tags onto either the C- or N- terminal end of the protein of interest. The advantage here is that the same tag can be used time and again on many different proteins and the researcher can use the same antibody each time. The advantages with using tagged proteins are so great that this technique has become commonplace for all types of immunoprecipitation including all of the types of IP detailed above. Examples of tags in use are the Green Fluorescent Protein (GFP) tag, Glutathione-S-transferase (GST) tag and the FLAG-tag tag. While the use of a tag to enable pull-downs is convenient, it raises some concerns regarding biological relevance because the tag itself may either obscure native interactions or introduce new and unnatural interactions.

Methods The two general methods for immunoprecipitation are the direct capture method and the indirect capture method.

Direct Antibodies that are specific for a particular protein (or group of proteins) are immobilized on a solid-phase substrate such as superparamagnetic microbeads or on microscopic

agarose (non-magnetic) beads. The beads with bound antibodies are then added to the protein mixture and the proteins that are targeted by the antibodies are captured onto the beads via the antibodies, in other words, they become immunoprecipitated.

Indirect Antibodies that are specific for a particular protein, or a group of proteins, are added directly to the mixture of protein. The antibodies have not been attached to a solid-phase support yet. The antibodies are free to float around the protein mixture and bind their targets. As time passes, the beads coated in protein A/G are added to the mixture of antibody and protein. At this point, the antibodies, which are now bound to their targets, will stick to the beads. From this point on, the direct and indirect protocols converge because the samples now have the same ingredients. Both methods gives the same end-result with the protein or protein complexes bound to the antibodies which themselves are immobilized onto the beads.

Selection An indirect approach is sometimes preferred when the concentration of the protein target is low or when the specific affinity of the antibody for the protein is weak. The indirect method is also used when the binding kinetics of the antibody to the protein is slow for a variety of reasons. In most situations, the direct method is the default, and the preferred, choice.

Technological advances Agarose Historically the solid-phase support for immunoprecipitation used by the majority of scientists has been highly-porous agarose beads (also known as agarose resins or slurries). The advantage of this technology is a very high potential binding capacity, as virtually the entire sponge-like structure of the agarose particle (50 to 150μm in size) is available for binding antibodies (which will in turn bind the target proteins) and the use of standard laboratory equipment for all aspects of the IP protocol without the need for any specialized equipment. The advantage of an extremely high binding capacity must be carefully balanced with the quantity of antibody that the researcher is prepared to use to coat the agarose beads. Because antibodies can be a cost-limiting factor, it is best to calculate backward from the amount of protein that needs to be captured (depending upon the analysis to be performed downstream), to the amount of antibody that is required to bind that quantity of protein (with a small excess added in to account for inefficiencies of the system), and back still further to the quantity of agarose that is needed to bind that particular quantity of antibody. In cases where antibody saturation is not required, this technology is unmatched in its ability to capture extremely large quantities of captured target proteins. The caveat here is that the "high capacity

advantage" can become a "high capacity disadvantage" that is manifested when the enormous binding capacity of the sepharose/agarose beads is not completely saturated with antibodies. It often happens that the amount of antibody available to the researcher for the their immunoprecipitation experiment is less than sufficient to saturate the agarose beads to be used in the immunoprecipitation. In these cases the researcher can end up with agarose particles that are only partially coated with antibodies, and the portion of the binding capacity of the agarose beads that is not coated with antibody is then free to bind anything that will stick, resulting in an elevated background signal due to non-specific binding of lysate components to the beads, which can make data interpretation difficult. While some may argue that for these reasons it is prudent to match the quantity of agarose (in terms of binding capacity) to the quantity of antibody that one wishes to be bound for the immunoprecipitation, a simple way to reduce the issue of non-specific binding to agarose beads and increase specificity is to preclear the lysate, which for any immunoprecipitation is highly recommended.

Preclearing Lysates are complex mixtures of proteins, lipids, carbohydrates and nucleic acids, and one must assume that some amount of non-specific binding to the IP antibody, Protein A/G or the beaded support will occur and negatively affect the detection of the immunoprecipitated target(s). In most cases, preclearing the lysate at the start of each immunoprecipitation experiment is a way to remove potentially reactive components from the cell lysate prior to the immunoprecipitation to prevent the non-specific binding of these components to the IP beads or antibody. The basic preclearing procedure is described below, wherein the lysate is incubated with beads alone, which are then removed and discarded prior to the immunoprecipitation. This approach, though, does not account for non-specific binding to the IP antibody, which can be considerable. Therefore, an alternative method of preclearing is to incubate the protein mixture with exactly the same components that will be used in the immunoprecipitation, except that a non-target, irrelevant antibody of the same antibody subclass as the IP antibody is used instead of the IP antibody itself. This approach attempts to use as close to the exact IP conditions and components as the actual immunoprecipitation to remove any non-specific cell constituent without capturing the target protein (unless, of course, the target protein non-specifically binds to some other IP component, which should be properly controlled for by analyzing the discarded beads used to preclear the lysate). The target protein can then be immunoprecipitated with the reduced risk of non-specific binding interfering with data interpretation.

Superparamagnetic beads While the vast majority of immunoprecipitations are performed with agarose beads, the use of superparamagnetic beads for immunoprecipitaion is a much newer approach that is only recently gaining in popularity as an alternative to agarose beads for IP applications. Unlike agarose, magnetic beads are solid and can be spherical, depending on the type of bead, and antibody binding is limited to the surface of each bead. While these beads do not have the advantage of a porous center to increase the binding capacity, magnetic

beads are signficantly smaller than agarose beads (1 to 4μm), and the greater number of magnetic beads per volume than agarose beads collectively gives magnetic beads an effective surface area-to-volume ratio for optimum antibody binding. Commercially available magnetic beads can be separated based by size uniformity into monodisperse and polydisperse beads. Monodisperse beads, also called microbeads, exhibit exact uniformity, and therefore all beads exhibit identical physical characteristics, including the binding capacity and the level of attraction to magnets. Polydisperse beads, while similar in size to monodisperse beads, show a wide range in size variability (1 to 4μm) that can influence their binding capacity and magnetic capture. Although both types of beads are commercially available for immunoprecipitation applications, the higher quality monodisperse superparamagnetic beads are more ideal for automatic protocols because of their consistent size, shape and performance. Monodisperse and polydisperse superparamagnetic beads are offered by many companies, including Invitrogen, Thermo Scientific, and Millipore.

Agarose vs. Magnetic Beads Proponents of magnetic beads claim that the beads exhibit a faster rate of protein binding over agarose beads for immunoprecipitation applications, although standard agarose bead-based immunoprecipitations have been performed in 1 hour. Claims have also been made that magnetic beads are better for immunoprecipitating extremely large protein complexes because of the complete lack of an upper size limit for such complexes, although there is no unbiased evidence stating this claim. The nature of magnetic bead technology does results in less sample handling due to the reduced physical stress on samples of magnetic separation versus repeated centrifugation when using agarose, which may contribute greatly to increasing the yield of labile (fragile) protein complexes. Additional factors, though, such as the binding capacity, cost of the reagent, the requirement of extra equipment and the capability to automate IP processes should be considered in the selection of an immunoprecipitation support.

Binding Capacity Proponents of both agarose and magnetic beads can argue whether the vast difference in the binding capacities of the two beads favors one particular type of bead. In a bead-tobead comparison, agarose beads have significantly greater surface area and therefore a greater binding capacity than magnetic beads due to the large bead size and sponge-like structure. But the variable pore size of the agarose causes a potential upper size limit that may affect the binding of extremely large proteins or protein complexes to internal binding sites, and therefore magnetic beads may be better suited for immunoprecipitating large proteins or protein complexes than agarose beads, although there is a lack of independent comparative evidence that proves either case. Some argue that the significantly greater binding capacity of agarose beads may be a disadvantage because of the larger capacity of non-specific binding. Others may argue for the use of magnetic beads because of the greater quantity of antibody required to saturate

the total binding capacity of agarose beads, which would obviously be an economical disadvantage of using agarose. While these arguments are correct outside the context of their practical use, these lines of reasoning ignore two key aspects of the principle of immunoprecipitation that demonstrates that the decision to use agarose or magnetic beads is not simply determined by binding capacity. First, non-specific binding is not limited to the antibody-binding sites on the immobilized support; any surface of the antibody or component of the immunoprecipitation reaction can bind to nonspecific lysate constituents, and therefore nonspecific binding will still occur even when completely saturated beads are used. This is why it is important to preclear the sample before the immunoprecipitation is performed. Second, the ability to capture the target protein is directly dependent upon the amount of immobilized antibody used, and therefore, in a side-by-side comparison of agarose and magnetic bead immunoprecipitation, the most protein that either support can capture is limited by the amount of antibody added. So the decision to saturate any type of support depends on the amount of protein required, as described above in the Agarose section of this page.

Cost The price of using either type of support is a key determining factor in using agarose or magnetic beads for immunoprecipitation applications. A typical first-glance calculation on the cost of magnetic beads compared to sepharose beads may make the sepharose beads appear less expensive. But magnetic beads may be competitively priced compared to agarose for analytical-scale immunoprecipitations depending on the IP method used and the volume of beads required per IP reaction. Using the traditional batch method of immunoprecipitation as listed below, where all components are added to a tube during the IP reaction, the physical handling characteristics of agarose beads necessitate a minimum quantity of beads for each IP experiment (typically in the range of 25 to 50μl beads per IP). This is because sepharose beads must be concentrated at the bottom of the tube by centrifugation and the supernatant removed after each incubation, wash, etc. This imposes absolute physical limitations on the process, as pellets of agarose beads less than 25 to 50μl are difficult if not impossible to visually identify at the bottom of the tube. With magnetic beads, there is no minimum quantity of beads required due to magnetic handling, and therefore, depending on the target antigen and IP antibody, it is possible to use considerably less magnetic beads. Conversely, spin columns may be employed instead of normal microfuge tubes to significantly reduce the amount of agarose beads required per reaction. Spin columns contain a filter that allows all IP components except the beads to flow through using a brief centrifugation and therefore provide a method to use significantly less agarose beads with minimal loss.

Equipment As mentioned above, only standard laboratory equipment is required for the use of agarose beads in immunoprecipitation applications, while high-power magnets are required for magnetic bead-based IP reactions. While the magnetic capture equipment may be cost-prohibitive, the rapid completion of immunoprecipitations using magnetic beads may be a financially beneficial approach when grants are due, because a 30 minute protocol with magnetic beads compared to overnight incubation at 4°C with agarose beads may result in more data generated in a shorter length of time.

Automation An added benefit of using magnetic beads is that automated immunoprecipitation devices are becoming more readily available. These devices not only reduce the amount of work and time to perform an IP, but they can also be used for high-throughput applications.

Summary While clear benefits of using magnetic beads include the increased reaction speed, more gentle sample handling and the potential for automation, the choice of using agarose or magnetic beads based on the binding capacity of the support medium and the cost of the product may depend on the protein of interest and the IP method used. As with all assays, empirical testing is required to determine which method is optimal for a given application.

Protocol Background Once the solid substrate bead technology has been chosen, antibodies are coupled to the beads and the antibody-coated-beads can be added to the heterogeneous protein sample (e.g. homogenized tissue). At this point, antibodies that are immobilized to the beads will bind to the proteins that they specifically recognize. Once this has occurred the immunoprecipitation portion of the protocol is actually complete, as the specific proteins of interest are bound to the antibodies that are themselves immobilized to the beads. Separation of the immunocomplexes from the lysate is an extremely important series of steps, because the protein(s) must remain bound to each other (in the case of co-IP) AND bound to the antibody during the wash steps to remove non-bound proteins and reduce background. When working with agarose beads, the beads must be pelleted out of the sample by briefly spinning in a centrifuge with forces between 600-3,000 x g (times the standard gravitational force). This step may be performed in a standard microcentrifuge tube, but for faster separation, greater consistency and higher recoveries, the process is often performed in small spin columns with a pore size that allows liquid, but not agarose beads, to pass through. After centrifugation, the agarose beads will form a very loose

fluffy pellet at the bottom of the tube. The supernatant containing contaminants can be carefully removed so as not to disturb the beads. The wash buffer can then be added to the beads and after mixing, the beads are again separated by centrifugation. With superparamagnetic beads, the sample is placed in a magnetic field so that the beads can collect on the side of the tube. This procedure is generally complete in approximately 30 seconds, and the remaining (unwanted) liquid is pipetted away. Washes are accomplished by resuspending the beads (off the magnet) with the washing solution and then concentrating the beads back on the tube wall (by placing the tube back on the magnet). The washing is generally repeated several times to ensure adequate removal of contaminants. If the superparamagnetic beads are homogeneous in size and the magnet has been designed properly, the beads will concentrate uniformly on the side of the tube and the washing solution can be easily and completely removed. After washing, the precipitated protein(s) are eluted and analyzed by gel electrophoresis, mass spectrometry, western blotting, or any number of other methods for identifying constituents in the complex. Protocol times for immunoprecipitation vary greatly due to a variety of factors, with protocol times increasing with the number of washes necessary or with the slower reaction kinetics of porous agarose beads.

Steps 1. Lyse cells and prepare sample for immunoprecipitation. 2. Pre-clear the sample by passing the sample over beads alone or bound to an irrelevant antibody to soak up any proteins that non-specifically bind to the IP components. 3. Incubate solution with antibody against the protein of interest. Antibody can be attached to solid support before this step (direct method) or after this step (indirect method). Continue the incubation to allow antibody-antigen complexes to form. 4. Precipitate the complex of interest, removing it from bulk solution. 5. Wash precipitated complex several times. Spin each time between washes when using agarose beads or place tube on magnet when using superparamagnetic beads and then remove the supernatant. After the final wash, remove as much supernatant as possible. 6. Elute proteins from the solid support using low-pH or SDS sample loading buffer. 7. Analyze complexes or antigens of interest. This can be done in a variety of ways: 1. SDS-PAGE (sodium dodecyl sulfate-polyacrylamide gel electrophoresis) followed by gel staining. 2. SDS-PAGE followed by: gel staining, cutting out individual stained protein bands, and sequencing the proteins in the bands by MALDI-Mass Spectrometry 3. Transfer and Western Blot using another antibody for proteins that were interacting with the antigen followed by chemiluminesent visualization.

Chapter- 3

Immunoelectrophoresis and Western Blot

Immunoelectrophoresis Immunoelectrophoresis is a general name for a number of biochemical methods for separation and characterization of proteins based on electrophoresis and reaction with antibodies. All variants of immunoelectrophoresis require immunoglobulins, also known as antibodies reacting with the proteins to be separated or characterized. The methods were developed and used extensively during the second half of the 20th century. In somewhat chronological order: Immunoelectrophoretic analysis (one-dimensional immunoelectrophoresis ad modum Grabar), crossed immunoelectrophoresis (twodimensional quantitative immunoelectrophoresis ad modum Clarke and Freeman or ad modum Laurell), rocket-immunoelectrophoresis (one-dimensional quantitative immunoelectrophoresis ad modum Laurell), fused rocket immunoelectrophoresis ad modum Svendsen and Harboe, affinity immunoelectrophoresis ad modum Bøg-Hansen. Agarose as 1 % gel slabs of about 1 mm thickness buffered at high pH (around 8.6) is traditionally preferred for the electrophoresis as well as the reaction with antibodies. The agarose was chosen as the gel matrix because it has large pores allowing free passage and separation of proteins, but provides an anchor for the immunoprecipitates of protein and specific antibodies. The high pH was chosen because antibodies are practically immobile at high pH. Immunoprecipitates may be seen in the wet agarose gel, but are stained with protein stains like Coomassie Brilliant Blue in the dried gel. In contrast to SDS-gel electrophoresis, the electrophoresis in agarose allows native conditions, preserving the native structure and activities of the proteins under investigation, therefore immunoelectrophoresis allows characterization of enzyme activities and ligand binding etc in addition to electrophoretic separation. The immunoelectrophoretic analysis ad modum Grabar is the classical method of immunoelectrophoresis. Proteins are separated by electrophoresis, then antibodies are applied in a trough next to the separated proteins and immunoprecipitates are formed after a period of diffusion of the separated proteins and antibodies against each other. The introduction of the immunoelectrophoretic analysis gave a great boost to protein chemistry, some of the very first results were the resolution of proteins in biological

fluids and biological extracts. Among the important observations made were the great number of different proteins in serum, the existence of several immunoglobulin classes and their electrophoretic heterogeneity. Crossed immunoelectrophoresis is also called two-dimensional quantitative immunoelectrophoresis ad modum Clarke and Freeman or ad modum Laurell. In this method the proteins are first separated during the first dimension electrophoresis, then instead of the diffusion towards the antibodies, the proteins are electrophoresed into an antibody-containing gel in the second dimension. Immunoprecipitation will take place during the second dimension electrophorsis and the immunoprecipitates have a characteristic bell-shape, each precipitate representing one antigen, the position of the precipitate being dependent on the amount of protein as well as the amount of specific antibody in the gel, so relative quantification can be performed. The sensitivity and resolving power of crossed immunoelectrophoresis is than that of the classical immunoelectrophoretic analysis and there are multiple variations of the technique useful for various purposes. Crossed immunoelectrophoresis has been used for studies of proteins in biological fluids, particularly human serum, and biological extracts. Rocket immunoelectrophoresis is one-dimensional quantitative immunoelectrophoresis. The methods has been used for quantitation of human serum proteins before automated methods became available. Fused rocket immunoelectrophoresis is a modification of one-dimensional quantitative immunoelectrophorsis used for detailed measurement of proteins in fractions from protein separation experiments. Affinity immunoelectrophoresis is based on changes in the electrophoretic pattern of proteins through biospecific interaction or complex formation with other macromolecules or ligands. Affinity immunoelectrophoresis has been used for estimation of binding constants, as for instance with lectins or for characterization of proteins with specific features like glycan content or ligand binding. Some variants of affinity immunoelectrophoresis are similar to affinity chromatography by use of immobilized ligands. The open structure of the immunoprecipitate in the agarose gel will allow additional binding of radioactively labeled antibodies to reveal specific proteins. This variation has been used for identification of allergens through reaction with IgE. Two factors determine that immunoelectrophoretic methods are not widely used. First they are rather work intensive and require some manual expertise. Second they require rather large amounts of polyclonal antibodies. Today gel electrophoresis followed by electroblotting is the preferred method for protein characterization because its ease of operation, its high sensitivity, and its low requirement for specific antibodies. In addition proteins are separated by gel electrophoresis on the basis of their apparent molecular weight, which is not accomplished by immunoelectrophoresis, but nevertheless

immunoelectrophoretic methods are still useful when non-reducing conditions are needed.

Western blot

Western blot analysis of proteins separated by SDS-PAGE The Western blot (alternatively, protein immunoblot) is an extremely useful analytical technique used to detect specific proteins in a given sample of tissue homogenate or extract. It uses gel electrophoresis to separate native or denatured proteins by the length of the polypeptide (denaturing conditions) or by the 3-D structure of the protein (native/ non-denaturing conditions). The proteins are then transferred to a membrane (typically nitrocellulose or PVDF), where they are probed (detected) using antibodies specific to the target protein.

There are now many reagent companies that specialize in providing antibodies (both monoclonal and polyclonal antibodies) against tens of thousands of different proteins. Commercial antibodies can be expensive, although the unbound antibody can be reused between experiments. This method is used in the fields of molecular biology, biochemistry, immunogenetics and other molecular biology disciplines. Other related techniques include using antibodies to detect proteins in tissues and cells by immunostaining and enzyme-linked immunosorbent assay (ELISA). The method originated from the laboratory of George Stark at Stanford. The name Western blot was given to the technique by W. Neal Burnette and is a play on the name Southern blot, a technique for DNA detection developed earlier by Edwin Southern. Detection of RNA is termed northern blotting and the detection of post-translational modification of protein is termed eastern blotting.

Steps in a Western blot Tissue preparation Samples may be taken from whole tissue or from cell culture. In most cases, solid tissues are first broken down mechanically using a blender (for larger sample volumes), using a homogenizer (smaller volumes), or by sonication. Cells may also be broken open by one of the above mechanical methods. However, it should be noted that bacteria, virus or environmental samples can be the source of protein and thus Western blotting is not restricted to cellular studies only. Assorted detergents, salts, and buffers may be employed to encourage lysis of cells and to solubilize proteins. Protease and phosphatase inhibitors are often added to prevent the digestion of the sample by its own enzymes. Tissue preparation is often done at cold temperatures to avoid protein denaturing and degradation. A combination of biochemical and mechanical techniques – including various types of filtration and centrifugation – can be used to separate different cell compartments and organelles.

Gel electrophoresis The proteins of the sample are separated using gel electrophoresis. Separation of proteins may be by isoelectric point (pI), molecular weight, electric charge, or a combination of these factors. The nature of the separation depends on the treatment of the sample and the nature of the gel. This is a very useful way to determine a protein. By far the most common type of gel electrophoresis employs polyacrylamide gels and buffers loaded with sodium dodecyl sulfate (SDS). SDS-PAGE (SDS polyacrylamide gel electrophoresis) maintains polypeptides in a denatured state once they have been treated with strong reducing agents to remove secondary and tertiary structure (e.g. disulfide bonds [S-S] to sulfhydryl groups [SH and SH]) and thus allows separation of proteins by their molecular weight. Sampled proteins become covered in the negatively charged SDS and move to the positively charged electrode through the acrylamide mesh of the gel. Smaller proteins migrate faster through this mesh and the proteins are thus separated according to size (usually measured in kilodaltons, kDa). The concentration of acrylamide determines the resolution of the gel - the greater the acrylamide concentration the better the resolution of lower molecular weight proteins. The lower the acrylamide concentration the better the resolution of higher molecular weight proteins. Proteins travel only in one dimension along the gel for most blots. Samples are loaded into wells in the gel. One lane is usually reserved for a marker or ladder, a commercially available mixture of proteins having defined molecular weights, typically stained so as to form visible, coloured bands. When voltage is applied along the gel, proteins migrate into it at different speeds. These different rates of advancement (different electrophoretic mobilities) separate into bands within each lane.

It is also possible to use a two-dimensional (2-D) gel which spreads the proteins from a single sample out in two dimensions. Proteins are separated according to isoelectric point (pH at which they have neutral net charge) in the first dimension, and according to their molecular weight in the second dimension.

Transfer In order to make the proteins accessible to antibody detection, they are moved from within the gel onto a membrane made of nitrocellulose or polyvinylidene difluoride (PVDF). The membrane is placed on top of the gel, and a stack of filter papers placed on top of that. The entire stack is placed in a buffer solution which moves up the paper by capillary action, bringing the proteins with it. Another method for transferring the proteins is called electroblotting and uses an electric current to pull proteins from the gel into the PVDF or nitrocellulose membrane. The protein move from within the gel onto the membrane while maintaining the organization they had within the gel. As a result of this "blotting" process, the proteins are exposed on a thin surface layer for detection (see below). Both varieties of membrane are chosen for their non-specific protein binding properties (i.e. binds all proteins equally well). Protein binding is based upon hydrophobic interactions, as well as charged interactions between the membrane and protein. Nitrocellulose membranes are cheaper than PVDF, but are far more fragile and do not stand up well to repeated probings.

The uniformity and overall effectiveness of transfer of protein from the gel to the membrane can be checked by staining the membrane with Coomassie Brilliant Blue or Ponceau S dyes. Ponceau S is the more common of the two, due to Ponceau S's higher sensitivity and its water solubility makes it easier to subsequently destain and probe the membrane as described below.

Blocking Since the membrane has been chosen for its ability to bind protein and as both antibodies and the target are proteins, steps must be taken to prevent interactions between the membrane and the antibody used for detection of the target protein. Blocking of nonspecific binding is achieved by placing the membrane in a dilute solution of protein typically 3-5% Bovine serum albumin (BSA) or non-fat dry milk (both are inexpensive) in Tris-Buffered Saline (TBS), with a minute percentage of detergent such as Tween 20 or Triton X-100. The protein in the dilute solution attaches to the membrane in all places where the target proteins have not attached. Thus, when the antibody is added, there is no room on the membrane for it to attach other than on the binding sites of the specific target

protein. This reduces "noise" in the final product of the Western blot, leading to clearer results, and eliminates false positives.

Detection During the detection process the membrane is "probed" for the protein of interest with a modified antibody which is linked to a reporter enzyme, which when exposed to an appropriate substrate drives a colourimetric reaction and produces a colour. For a variety of reasons, this traditionally takes place in a two-step process, although there are now one-step detection methods available for certain applications.

Two steps •

Primary antibody

Antibodies are generated when a host species or immune cell culture is exposed to the protein of interest (or a part thereof). Normally, this is part of the immune response, whereas here they are harvested and used as sensitive and specific detection tools that bind the protein directly. After blocking, a dilute solution of primary antibody (generally between 0.5 and 5 micrograms/mL) is incubated with the membrane under gentle agitation. Typically, the solution is composed of buffered saline solution with a small percentage of detergent, and sometimes with powdered milk or BSA. The antibody solution and the membrane can be sealed and incubated together for anywhere from 30 minutes to overnight. It can also be incubated at different temperatures, with warmer temperatures being associated with more binding, both specific (to the target protein, the "signal") and non-specific ("noise"). •

Secondary antibody

After rinsing the membrane to remove unbound primary antibody, the membrane is exposed to another antibody, directed at a species-specific portion of the primary antibody. Antibodies come from animal sources (or animal sourced hybridoma cultures); an anti-mouse secondary will bind to almost any mouse-sourced primary antibody, which allows some cost savings by allowing an entire lab to share a single source of massproduced antibody, and provides far more consistent results. This is known as a secondary antibody, and due to its targeting properties, tends to be referred to as "antimouse," "anti-goat," etc. The secondary antibody is usually linked to biotin or to a reporter enzyme such as alkaline phosphatase or horseradish peroxidase. This means that several secondary antibodies will bind to one primary antibody and enhance the signal. Most commonly, a horseradish peroxidase-linked secondary is used to cleave a chemiluminescent agent, and the reaction product produces luminescence in proportion to the amount of protein. A sensitive sheet of photographic film is placed against the membrane, and exposure to the light from the reaction creates an image of the antibodies bound to the blot. A cheaper but less sensitive approach utilizes a 4-chloronaphthol stain

with 1% hydrogen peroxide; reaction of peroxide radicals with 4-chloronaphthol produces a dark brown stain that can be photographed without using specialized photographic film.

As with the ELISPOT and ELISA procedures, the enzyme can be provided with a substrate molecule that will be converted by the enzyme to a colored reaction product that will be visible on the membrane (see the figure below with blue bands). Another method of secondary antibody detection utilizes a near-infrared (NIR) fluorophore-linked antibody. Light produced from the excitation of a fluorescent dye is static, making fluorescent detection a more precise and accurate measure of the difference in signal produced by labeled antibodies bound to proteins on a Western blot. Proteins can be accurately quantified because the signal generated by the different amounts of proteins on the membranes is measured in a static state, as compared to chemiluminescence, in which light is measured in a dynamic state. A third alternative is to use a radioactive label rather than an enzyme coupled to the secondary antibody, such as labeling an antibody-binding protein like Staphylococcus Protein A or Streptavidin with a radioactive isotope of iodine. Since other methods are safer, quicker, and cheaper, this method is now rarely used; however, an advantage of this approach is the sensitivity of auto-radiography based imaging, which enables highly accurate protein quantification when combined with optical software (e.g. Optiquant).

One step Historically, the probing process was performed in two steps because of the relative ease of producing primary and secondary antibodies in separate processes. This gives researchers and corporations huge advantages in terms of flexibility, and adds an amplification step to the detection process. Given the advent of high-throughput protein analysis and lower limits of detection, however, there has been interest in developing one-step probing systems that would allow the process to occur faster and with less consumables. This requires a probe antibody which both recognizes the protein of interest and contains a detectable label, probes which are often available for known protein tags. The primary probe is incubated with the membrane in a manner similar to that for the

primary antibody in a two-step process, and then is ready for direct detection after a series of wash steps.

Western blot using radioactive detection system

Analysis After the unbound probes are washed away, the Western blot is ready for detection of the probes that are labeled and bound to the protein of interest. In practical terms, not all Westerns reveal protein only at one band in a membrane. Size approximations are taken by comparing the stained bands to that of the marker or ladder loaded during electrophoresis. The process is repeated for a structural protein, such as actin or tubulin, that should not change between samples. The amount of target protein is indexed to the structural protein to control between groups. This practice ensures correction for the amount of total protein on the membrane in case of errors or incomplete transfers.

Colorimetric detection The colorimetric detection method depends on incubation of the Western blot with a substrate that reacts with the reporter enzyme (such as peroxidase) that is bound to the secondary antibody. This converts the soluble dye into an insoluble form of a different color that precipitates next to the enzyme and thereby stains the membrane. Development of the blot is then stopped by washing away the soluble dye. Protein levels are evaluated through densitometry (how intense the stain is) or spectrophotometry.

Chemiluminescent detection Chemiluminescent detection methods depend on incubation of the Western blot with a substrate that will luminesce when exposed to the reporter on the secondary antibody. The light is then detected by photographic film, and more recently by CCD cameras which capture a digital image of the Western blot. The image is analysed by densitometry, which evaluates the relative amount of protein staining and quantifies the results in terms of optical density. Newer software allows further data analysis such as molecular weight analysis if appropriate standards are used.

Radioactive detection Radioactive labels do not require enzyme substrates, but rather allow the placement of medical X-ray film directly against the Western blot which develops as it is exposed to the label and creates dark regions which correspond to the protein bands of interest. The importance of radioactive detections methods is declining, because it is very expensive, health and safety risks are high, and ECL (enhanced chemiluminescence) provides a useful alternative.

Fluorescent detection The fluorescently labeled probe is excited by light and the emission of the excitation is then detected by a photosensor such as CCD camera equipped with appropriate emission filters which captures a digital image of the Western blot and allows further data analysis such as molecular weight analysis and a quantitative Western blot analysis. Fluorescence is considered to be among the most sensitive detection methods for blotting analysis.

Secondary probing One major difference between nitrocellulose and PVDF membranes relates to the ability of each to support "stripping" antibodies off and reusing the membrane for subsequent antibody probes. While there are well-established protocols available for stripping nitrocellulose membranes, the sturdier PVDF allows for easier stripping, and for more reuse before background noise limits experiments. Another difference is that, unlike nitrocellulose, PVDF must be soaked in 95% ethanol, isopropanol or methanol before use. PVDF membranes also tend to be thicker and more resistant to damage during use.

2-D gel electrophoresis 2-dimensional SDS-PAGE uses the principles and techniques outlined above. 2-D SDSPAGE, as the name suggests, involves the migration of polypeptides in 2 dimensions. For example, in the first dimension polypeptides are separated according to isoelectric point, while in the second dimension polypeptides are separated according to their molecular weight. The isoelectric point of a given protein is determined by the relative number of positively (e.g. lysine and arginine) and negatively (e.g. glutamate and aspartate) charged

amino acids, with negatively charged amino acids contributing to a high isoelectric point and positively charged amino acids contributing to a low isoelectric point. Samples could also be separated first under nonreducing conditions using SDS-PAGE and under reducing conditions in the second dimension, which breaks apart disulfide bonds that hold subunits together. SDS-PAGE might also be coupled with urea-PAGE for a 2dimensional gel. In principle, this method allows for the separation of all cellular proteins on a single large gel. A major advantage of this method is that it often distinguishes between different isoforms of a particular protein - e.g. a protein that has been phosphorylated (by addition of a negatively charged group). Proteins that have been separated can be cut out of the gel and then analysed by mass spectrometry, which identifies the protein. Please refer to reference articles for examples of the application of 2-D SDS PAGE.

Medical diagnostic applications •

• • • •

The confirmatory HIV test employs a Western blot to detect anti-HIV antibody in a human serum sample. Proteins from known HIV-infected cells are separated and blotted on a membrane as above. Then, the serum to be tested is applied in the primary antibody incubation step; free antibody is washed away, and a secondary anti-human antibody linked to an enzyme signal is added. The stained bands then indicate the proteins to which the patient's serum contains antibody. A Western blot is also used as the definitive test for Bovine spongiform encephalopathy (BSE, commonly referred to as 'mad cow disease'). Some forms of Lyme disease testing employ Western blotting. Western blot can also be used as a confirmatory test for Hepatitis B infection. In veterinary medicine, Western blot is sometimes used to confirm FIV+ status in cats.

Chapter- 4

Enzyme Assay

Beckman DU640 UV/Vis spectrophotometer Enzyme assays are laboratory methods for measuring enzymatic activity. They are vital for the study of enzyme kinetics and enzyme inhibition.

Enzyme units Amounts of enzymes can either be expressed as molar amounts, as with any other chemical, or measured in terms of activity, in enzyme units.

Enzyme activity Enzyme activity = moles of substrate converted per unit time = rate × reaction volume. Enzyme activity is a measure of the quantity of active enzyme present and is thus dependent on conditions, which should be specified. The SI unit is the katal, 1 katal = 1

mol s-1, but this is an excessively large unit. A more practical and commonly-used value is 1 enzyme unit (U) = 1 μmol min-1. 1 U corresponds to 16.67 nanokatals.

Specific activity The specific activity of an enzyme is another common unit. This is the activity of an enzyme per milligram of total protein (expressed in μmol min-1mg-1). Specific activity gives a measurement of the purity of the enzyme. It is the amount of product formed by an enzyme in a given amount of time under given conditions per milligram of total protein. Specific activity is equal to the rate of reaction multiplied by the volume of reaction divided by the mass of total protein. The SI unit is katal kg-1, but a more practical unit is μmol mg-1 min-1. Specific activity is a measure of enzyme processivity, at a specific (usually saturating)substrate concentration, and is usually constant for a pure enzyme. For elimination of errors arising from differences in cultivation batches and/or misfolded enzyme etc. an active site titration needs to be done. This is a measure of the amount of active enzyme, calculated by e.g. titrating the amount of active sites present by employing an irreversible inhibitor. The specific activity should then be expressed as μmol min-1 mg-1 active enzyme.

Related terminology The rate of a reaction is the concentration of substrate disappearing (or product produced) per unit time (mol L − 1s − 1). The % purity is 100% × (specific activity of enzyme sample / specific activity of pure enzyme). The impure sample has lower specific activity because some of the mass is not actually enzyme. If the specific activity of 100% pure enzyme is known, then an impure sample will have a lower specific activity, allowing purity to be calculated.

Types of assay All enzyme assays measure either the consumption of substrate or production of product over time. A large number of different methods of measuring the concentrations of substrates and products exist and many enzymes can be assayed in several different ways. Biochemists usually study enzyme-catalysed reactions using four types of experiments: •

Initial rate experiments. When an enzyme is mixed with a large excess of the substrate, the enzyme-substrate intermediate builds up in a fast initial transient. Then the reaction achieves a steady-state kinetics in which enzyme substrate intermediates remains approximately constant over time and the reaction rate changes relatively slowly. Rates are measured for a short period after the attainment of the quasi-steady state, typically by monitoring the accumulation of product with time. Because the measurements are carried out for a very short period and because of the large excess of substrate, the approximation free substrate is approximately equal to the initial substrate can be made. The initial rate experiment is the simplest to perform and analyze, being relatively free from

complications such as back-reaction and enzyme degradation. It is therefore by far the most commonly used type of experiment in enzyme kinetics. •





Progress curve experiments. In these experiments, the kinetic parameters are determined from expressions for the species concentrations as a function of time. The concentration of the substrate or product is recorded in time after the initial fast transient and for a sufficiently long period to allow the reaction to approach equilibrium. We note in passing that, while they are less common now, progress curve experiments were widely used in the early period of enzyme kinetics. Transient kinetics experiments. In these experiments, reaction behaviour is tracked during the initial fast transient as the intermediate reaches the steady-state kinetics period. These experiments are more difficult to perform than either of the above two classes because they require rapid mixing and observation techniques. Relaxation experiments. In these experiments, an equilibrium mixture of enzyme, substrate and product is perturbed, for instance by a temperature, pressure or pH jump, and the return to equilibrium is monitored. The analysis of these experiments requires consideration of the fully reversible reaction. Moreover, relaxation experiments are relatively insensitive to mechanistic details and are thus not typically used for mechanism identification, although they can be under appropriate conditions.

Enzyme assays can be split into two groups according to their sampling method: continuous assays, where the assay gives a continuous reading of activity, and discontinuous assays, where samples are taken, the reaction stopped and then the concentration of substrates/products determined.

Temperature-controlled cuvette holder in a spectrophotometer

Continuous assays Continuous assays are most convenient, with one assay giving the rate of reaction with no further work necessary. There are many different types of continuous assays.

Spectrophotometric In spectrophotometric assays, you follow the course of the reaction by measuring a change in how much light the assay solution absorbs. If this light is in the visible region you can actually see a change in the color of the assay, these are called colorimetric

assays. The MTT assay, a redox assay using a tetrazolium dye as substrate is an example of a colorimetric assay. UV light is often used, since the common coenzymes NADH and NADPH absorb UV light in their reduced forms, but do not in their oxidized forms. An oxidoreductase using NADH as a substrate could therefore be assayed by following the decrease in UV absorbance at a wavelength of 340 nm as it consumes the coenzyme. Direct versus coupled assays

Coupled assay for hexokinase using glucose-6-phosphate dehydrogenase Even when the enzyme reaction does not result in a change in the absorbance of light, it can still be possible to use a spectrophotometric assay for the enzyme by using a coupled assay. Here, the product of one reaction is used as the substrate of another, easilydetectable reaction. For example, figure 1 shows the coupled assay for the enzyme hexokinase, which can be assayed by coupling its production of glucose-6-phosphate to NADPH production, using glucose-6-phosphate dehydrogenase.

Fluorometric Fluorescence is when a molecule emits light of one wavelength after absorbing light of a different wavelength. Fluorometric assays use a difference in the fluorescence of substrate from product to measure the enzyme reaction. These assays are in general much more sensitive than spectrophotometric assays, but can suffer from interference caused by impurities and the instability of many fluorescent compounds when exposed to light. An example of these assays is again the use of the nucleotide coenzymes NADH and NADPH. Here, the reduced forms are fluorescent and the oxidised forms non-fluorescent. Oxidation reactions can therefore be followed by a decrease in fluorescence and reduction reactions by an increase. Synthetic substrates that release a fluorescent dye in an enzyme-catalyzed reaction are also available, such as 4-methylumbelliferyl-β-Dgalactoside for assaying β-galactosidase.

Calorimetric

Chemiluminescence of Luminol Calorimetry is the measurement of the heat released or absorbed by chemical reactions. These assays are very general, since many reactions involve some change in heat and with use of a microcalorimeter, not much enzyme or substrate is required. These assays can be used to measure reactions that are impossible to assay in any other way.

Chemiluminescent Chemiluminescence is the emission of light by a chemical reaction. Some enzyme reactions produce light and this can be measured to detect product formation. These types of assay can be extremely sensitive, since the light produced can be captured by photographic film over days or weeks, but can be hard to quantify, because not all the light released by a reaction will be detected.

The detection of horseradish peroxidase by enzymatic chemiluminescence (ECL) is a common method of detecting antibodies in western blotting. Another example is the enzyme luciferase, this is found in fireflies and naturally produces light from its substrate luciferin.

Light Scattering Static light scattering measures the product of weight-averaged molar mass and concentration of macromolecules in solution. Given a fixed total concentration of one or more species over the measurement time, the scattering signal is a direct measure of the weight-averaged molar mass of the solution, which will vary as complexes form or dissociate. Hence the measurement quantifies the stoichiometry of the complexes as well as kinetics. Light scattering assays of protein kinetics is a very general technique that does not require an enzyme.

Microscale Thermophoresis Microscale Thermophoresis (MST) measures the size, charge and hydration entropy of molecules/substrates in real time. The thermophoretic movement of a fluorescently labeled substrate changes significantly as it it modified by an enzyme. This enzymatic activity can be measured with high time resolution in real time. The material consumption of the all optical MST method is very low, only 5 µl sample volume and 10nM enzyme concentration are needed to measure the enzymatic rate constants for activity and inhibition. MST allows to measure the modification of two different substrates at once (multiplexing) if both substrates are labeled with different fluorophores. Thus substrate competition experiments can be performed.

Discontinuous assays Discontinuous assays are when samples are taken from an enzyme reaction at intervals and the amount of product production or substrate consumption is measured in these samples.

Radiometric Radiometric assays measure the incorporation of radioactivity into substrates or its release from substrates. The radioactive isotopes most frequently used in these assays are 14 C, 32P, 35S and 125I. Since radioactive isotopes can allow the specific labelling of a single atom of a substrate, these assays are both extremely sensitive and specific. They are frequently used in biochemistry and are often the only way of measuring a specific reaction in crude extracts (the complex mixtures of enzymes produced when you lyse cells). Radioactivity is usually measured in these procedures using a scintillation counter.

Chromatographic Chromatographic assays measure product formation by separating the reaction mixture into its components by chromatography. This is usually done by high-performance liquid chromatography (HPLC), but can also use the simpler technique of thin layer chromatography. Although this approach can need a lot of material, its sensitivity can be increased by labelling the substrates/products with a radioactive or fluorescent tag. Assay sensitivity has also been increased by switching protocols to improved chromatographic instruments (e.g. ultra-high pressure liquid chromatography) that operate at pump pressure a few-fold higher than HPLC instruments.

Factors to control in assays •







Salt Concentration: Most enzymes cannot tolerate extremely high salt concentrations. The ions interfere with the weak ionic bonds of proteins. Typical enzymes are active in salt concentrations of 1-500 mM. As usual there are exceptions such as the halophilic (salt loving) algae and bacteria. Effects of Temperature: All enzymes work within a range of temperature specific to the organism. Increases in temperature generally lead to increases in reaction rates. There is a limit to the increase because higher temperatures lead to a sharp decrease in reaction rates. This is due to the denaturating (alteration) of protein structure resulting from the breakdown of the weak ionic and hydrogen bonding that stabilize the three dimensional structure of the enzyme active site. The "optimum" temperature for human enzymes is usually between 35 and 40 °C. The average temperature for humans is 37 °C. Human enzymes start to denature quickly at temperatures above 40 °C. Enzymes from thermophilic archaea found in the hot springs are stable up to 100 °C. However, the idea of an "optimum" rate of an enzyme reaction is misleading, as the rate observed at any temperature is the product of two rates, the reaction rate and the denaturation rate. If you were to use an assay measuring activity for one second, it would give high activity at high temperatures, however if you were to use an assay measuring product formation over an hour, it would give you low activity at these temperatures. Effects of pH: Most enzymes are sensitive to pH and have specific ranges of activity. All have an optimum pH. The pH can stop enzyme activity by denaturating (altering) the three dimensional shape of the enzyme by breaking ionic, and hydrogen bonds. Most enzymes function between a pH of 6 and 8; however pepsin in the stomach works best at a pH of 2 and trypsin at a pH of 8. Substrate Saturation: Increasing the substrate concentration increases the rate of reaction (enzyme activity). However, enzyme saturation limits reaction rates. An enzyme is saturated when the active sites of all the molecules are occupied most of the time. At the saturation point, the reaction will not speed up, no matter how much additional substrate is added. The graph of the reaction rate will plateau.



Level of crowding, large amounts of macromolecules in a solution will alter the rates and equilibrium constants of enzyme reactions, through an effect called macromolecular crowding.

Chapter- 5

Protein Nuclear Magnetic Resonance Spectroscopy

Pacific Northwest National Laboratory's high magnetic field (800 MHz) NMR spectrometer being loaded with a sample. Protein nuclear magnetic resonance spectroscopy (usually abbreviated protein NMR) is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins. The field was pioneered by Richard R. Ernst and Kurt Wüthrich, among others. Protein NMR techniques are continually being

used and improved in both academia and the biotech industry. Structure determination by NMR spectroscopy usually consists of several following phases, each using a separate set of highly specialized techniques. The sample is prepared, resonances are assigned, restraints are generated and a structure is calculated and validated.

Sample preparation

The NMR sample is prepared in a thin walled glass tube Protein nuclear magnetic resonance is performed on aqueous samples of highly purified protein. Usually the sample consist of between 300 and 600 microlitres with a protein concentration in the range 0.1 – 3 millimolar. The source of the protein can be either natural or produced in an expression system using recombinant DNA techniques through genetic engineering. Recombinantly expressed proteins are usually easier to produce in sufficient quantity, and makes isotopic labelling possible.

The purified protein is usually dissolved in a buffer solution and adjusted to the desired solvent conditions. The NMR sample is prepared in a thin walled glass tube.

Data collection Protein NMR utilizes multidimensional nuclear magnetic resonance experiments to obtain information about the protein. Ideally, each distinct nucleus in the molecule experiences a distinct chemical environment and thus has a distinct chemical shift by which it can be recognized. However, in large molecules such as proteins the number of resonances can typically be several thousand and a one-dimensional spectrum inevitably has incidental overlaps. Therefore multidimensional experiments are performed which correlate the frequencies of distinct nuclei. The additional dimensions decrease the chance of overlap and have a larger information content since they correlate signals from nuclei within a specific part of the molecule. Magnetization is transferred into the sample using pulses of electromagnetic (radiofrequency) energy and between nuclei using delays; the process is described with so-called pulse sequences. Pulse sequences allow the experimenter to investigate and select specific types of connections between nuclei. The array of nuclear magnetic resonance experiments used on proteins fall in two main categories — one where magnetization is transferred through the chemical bonds, and one where the transfer is through space, irrespective of the bonding structure. The first category is used to assign the different chemical shifts to a specific nucleus, and the second is primarily used to generate the distance restraints used in the structure calculation, and in the assignment with unlabelled protein. Depending on the concentration of the sample, on the magnetic field of the spectrometer, and on the type of experiment, a single multidimensional nuclear magnetic resonance experiment on a protein sample may take hours or even several days to obtain suitable signal-to-noise ratio through signal averaging, and to allow for sufficient evolution of magnetization transfer through the various dimensions of the experiment. Other things being equal, higher-dimensional experiments will take longer than lower-dimensional experiments. Typically the first experiment to be measured with an isotope-labelled protein is a 2D heteronuclear single quantum correlation (HSQC) spectrum where "heteronuclear" refers to nuclei other than 1H. In theory the heteronuclear single quantum correlation has one peak for each H bound to a heteronucleus. Thus in the 15N-HSQC one signal is expected for each amino acid residue with the exception of proline which has no amide-hydrogen due to the cyclic nature of its backbone. Tryptophan and certain other residues with Ncontaining sidechains also give rise to additional signals. The 15N-HSQC is often referred to as the fingerprint of a protein because each protein has a unique pattern of signal positions. Analysis of the 15N-HSQC allows researchers to evaluate whether the expected number of peaks is present and thus to identify possible problems due to multiple conformations or sample heterogeneity. The relatively quick heteronuclear single quantum correlation experiment helps determine the feasibility of doing subsequent longer, more expensive, and more elaborate experiments. It is not possible to assign peaks to specific atoms from the heteronuclear single quantum correlation alone.

Resonance assignment In order to analyze the nuclear magnetic resonance data, it is important to get a resonance assignment for the protein. That is to find out which chemical shift corresponds to which atom. This is typically achieved by sequential walking using information derived from several different types of NMR experiment. The exact procedure depends on whether the protein is isotopically labelled or not, since a lot of the assignment experiments depend on carbon-13 and nitrogen-15.

Comparison of a COSY and TOCSY 2D spectra for an amino acid like glutamate or methionine. The TOCSY shows off diagonal crosspeaks between all protons in the spectrum, but the COSY only has crosspeaks between neighbours.

Homonuclear nuclear magnetic resonance With unlabelled protein the usual procedure is to record a set of two dimensional homonuclear nuclear magnetic resonance experiments through correlation spectroscopy (COSY), of which several types include conventional correlation spectroscopy, total correlation spectroscopy (TOCSY) and nuclear Overhauser effect spectroscopy (NOESY). A two-dimensional nuclear magnetic resonance experiment produces a twodimensional spectrum. The units of both axes are chemical shifts. The COSY and TOCSY transfer magnetization through the chemical bonds between adjacent protons. The conventional correlation spectroscopy experiment is only able to transfer magnetization between protons on adjacent atoms, whereas in the total correlation spectroscopy experiment the protons are able to relay the magnetization, so it is transferred among all the protons that are connected by adjacent atoms. Thus in a conventional correlation spectroscopy, an alpha proton transfers magnetization to the beta protons, the beta protons transfers to the alpha and gamma protons, if any are present, then the gamma proton transfers to the beta and the delta protons, and the process continues. In total correlation spectroscopy, the alpha and all the other protons are able to transfer magnetization to the beta, gamma, delta, epsilon if they are connected by a continuous chain of protons. The continuous chain of protons are the sidechain of the individual amino acids. Thus these two experiments are used to build so called spin systems, that is build a list of resonances of the chemical shift of the peptide proton, the alpha protons and all the protons from each residue’s sidechain. Which chemical shifts corresponds to which nuclei in the spin system is determined by the conventional correlation spectroscopy connectivities and the fact that different types of protons have characteristic chemical shifts. To connect the different spinsystems in a sequential order, the nuclear Overhauser effect spectroscopy experiment has to be used. Because this experiment transfers magnetization through space, it will show crosspeaks for all protons that are close in space regardless of whether they are in the same spin system or not. The neighbouring residues are inherently close in space, so the assignments can be made by the peaks in the NOESY with other spin systems. One important problem using homonuclear nuclear magnetic resonance is overlap between peaks. This occurs when different protons have the same or very similar chemical shifts. This problem becomes greater as the protein becomes larger, so homonuclear nuclear magnetic resonance is usually restricted to small proteins or peptides.

Nitrogen-15 nuclear magnetic resonance

Schematic of an HNCA and HNCOCA for four sequential residues. The nitrogen-15 dimension is perpendicular to the screen. Each window is focused on the nitrogen chemical shift of that amino acid. The sequential assignment is made by matching the alpha carbon chemical shifts. In the HNCA each residue sees the alpha carbon of it self and the preceding residue. The HNCOCA only sees the alpha carbon of the preceding residue.

Carbon-13 and nitrogen-15 nuclear magnetic resonance When the protein is labelled with carbon-13 and nitrogen-15 it is possible to record an experiment that transfers magnetisation over the peptide bond, and thus connect different spin systems through bonds. This is usually done using some of the following

experiments, HNCO, HNCACO, HNCA, HNCOCA, HNCACB and CBCACONH. All six experiments consist of a HSQC plane expanded with a carbon dimension. In the HNCACO the spectrum contains peaks at the chemical shifts of the carbonyl carbons in the residue of the HSQC peak and the previous one in the sequence. The HNCO only contains the chemical shift from the previous residue, and it is thus possible to assign the carbonyl carbon shifts that corresponds to each HSQC peak and the one previous to that one. Sequential assignment can then be undertaken by matching the shifts of each spin system's own and previous carbons. The HNCA and HNCOCA works similarly, just with the alpha carbons rather than the carbonyls, and the HNCACB and the CBCACONH contains both the alpha carbon and the beta carbon. Usually several of these experiments are required to resolve overlap in the carbon dimension. This procedure is usually less ambiguous than the NOESY based method, since it is based on through bond transfer. In the NOESY-based methods additional peaks that are close in space but not belonging to the sequential residues will appear confusing the assignment process. When the sequential assignment has been made it is usually possible to assign the sidechains using HCCH-TOCSY, which is basically a TOCSY experiment resolved in an additional carbon dimension.

Restraint generation In order to make structure calculations a number of experimentially determined restraints have to be generated. These fall into different categories, the most widely used is distance restraints and angle restraints.

Distance restraints A crosspeak in a NOESY experiment signifies spatial proximity between the two nuclei in question. Thus each peak can be converted in to a maximum distance between the nuclei, usually between 1,8 and 6 angstroms. The intensity of a noesy peak is proportional to the distance to the minus 6th power, so the distance is determined according to intensity of the peak. The intensity-distance relationship is not exact, so usually a distance range is used. It is of great importance to assign the noesy peaks to the correct nuclei based on the chemical shifts. If this task is performed manually it is usually very labor intensive, since proteins usually have thousands of noesy peaks. Some computer programs such as UNIO, CYANA and ARIA/CNS perform this task automatically on manually pre-processed listings of peak positions and peak volumes, coupled to a structure calculation. Direct access to the raw NOESY data without the cumbersome need of iteratively refined peak lists is so far only granted by the ATNOS/CANDID approach implemented in the UNIO software package and thus indeed guarantees objective and efficient NOESY spectral analysis. To obtain as accurate assignments as possible it is a great advantage to have access to carbon-13 and nitrogen-15 noesy experiments, since they help to resolve overlap in the

proton dimension. This leads to faster and more reliable assignments, and in turn to better structures.

Angle restraints In addition to distance restraints, restraints on the torsion angles of the chemical bonds, typically the psi and phi angles can be generated. One approach is to use the Karplus equation, to generate angle restraints from coupling constants. Another approach uses the chemical shifts to generate angle restraints. Both methods use the fact that the geometry around the alpha carbon affects the coupling constants and chemical shifts, so given the coupling constants or the chemical shifts, a qualified guess can be made about the torsion angles.

Orientation restraints

The blue arrows represent the orientation of the N - H bond of selected peptide bonds. By determining the orientation of a sufficient amount of bonds relative to the external magnetic field, the structure of the protein can be determined. From PDB record 1KBH

The analyte molecules in a sample can be partially ordered with respect to the external magnetic field of the spectrometer by manipulating the sample conditions. Common techniques include addition of bacteriophages or bicelles to the sample, or preparation of the sample in a stretched polyacrylamide gel. This creates a local environment that favours certain orientations of nonspherical molecules. Normally in solution NMR the dipolar couplings between nuclei are averaged out because of the fast tumbling of the molecule. The slight overpopulation of one orientation means that a residual dipolar coupling remains to be observed. The dipolar coupling is commonly used in solid state NMR and provides information about the relative orientation of the bond vectors relative to a single global reference frame. Typically the orientation of the N-H vector is probed in a HSQC like experiment. Initially residual dipolar couplings were used for refinement of previously determined structures, but attempts at de novo structure determination have also been made.

Hydrogen-Deuterium exchange NMR spectroscopy is nuclei specific. Thus it can distinguish between hydrogen and deuterium. The amide protons in the protein exchange readily with the solvent, and if the solvent contains a different isotope, typically deuterium, the reaction can be monitored by NMR spectroscopy. How rapidly a given amide exchanges reflects its solvent accessibility. Thus amide exchange rates can give information on which parts of the protein are buried, hydrogen bonded etc. A common application is to compare the exchange of a free form versus a complex. The amides that become protected in the complex, are assumed to be in the interaction interface.

Structure calculation

Nuclear magnetic resonance structure determination generates an ensemble of structures. The structures will only converge if the data is sufficient to dictate a specific fold. In these structures, it is only the case for a part of the structure. From PDB 1SSU. The experimentially determined restraints can be used as input for the structure calculation process. Researchers, using computer programs such as CYANA (Software) or XPLOR-NIH, attempt to satisfy as many of the restraints as possible, in addition to general properties of proteins such as bond lengths and angles. The algorithms convert the restraints and the general protein properties into energy terms, and thus tries to minimize the energy. The process results in an ensemble of structures that, if the data were sufficient to dictate a certain fold, will converge.

Dynamics In addition to structures, nuclear magnetic resonance can yield information on the dynamics of various parts of the protein. This usually involves measuring relaxation times such as T1 and T2 to determine order parameters, correlation times, and chemical exchange rates. NMR relaxation is a consequence of local fluctuating magnetic fields within a molecule. Local fluctuating magnetic fields are generated by molecular motions. In this way measurements of relaxation times can provide information of motions within a molecule on the atomic level. In NMR studies of protein dynamics the nitrogen-15 isotope is the preferred nucleus to study because its relaxation times are relatively simple

to relate to molecular motions This however requires isotope labeling of the protein. The T1 and T2 relaxation times can be measured using various types of HSQC based experiments. The types of motions which can be detected are motions that occur on a time-scale ranging from about 10 picoseconds to about 10 nanoseconds. In addition slower motions, which take place on a time-scale ranging from about 10 microseconds to 100 milliseconds, can also be studied. However, since nitrogen atoms are mainly found in the backbone of a protein, the results mainly reflect the motions of the backbone, which is the most rigid part of a protein molecule. Thus, the results obtained from nitrogen-15 relaxation measurements may not be representative for the whole protein. Therefore techniques utilizing relaxation measurements of carbon-13 and deuterium have recently been developed, which enables systematic studies of motions of the amino acid side chains in proteins.

NMR spectroscopy on large proteins Traditionally nuclear magnetic resonance spectroscopy has been limited to relatively small proteins or protein domains. This is in part caused by problems resolving overlapping peaks in larger proteins, but this has been alleviated by the introduction of isotope labelling and multidimensional experiments. Another more serious problem is the fact that in large proteins the magnetization relaxes faster, which means there is less time to detect the signal. This in turn causes the peaks to become broader and weaker, and eventually disappear. Two techniques have been introduced to attenuate the relaxation: transverse relaxation optimized spectroscopy (TROSY) and deuteration of proteins. By using these techniques it has been possible to study proteins in complex with the 900 kDa chaperone GroES-GroEL.

Automation of the process Structure determination by NMR has traditionally been a time consuming process, requiring interactive analysis of the data by a highly trained scientist. There has been a considerable interest in automating the process to increase the throughput of structure determination and to make protein NMR accessible to non-experts. The two most time consuming processes involved are the sequence-specific resonance assignment (backbone and side-chain assignment) and the NOE assignment tasks. Several different computer programs have been published that target individual parts of the overall NMR structure determination process in an automated fashion. Most progress have been achieved for the task of automated NOE assignment. So far, only the FLYA and the UNIO approach were proposed to perform the entire protein NMR structure determination process in an automated manner without any human intervention. Efforts have also been made to standardize the structure calculation protocol to make it quicker and more amenable to automation.

Chapter- 6

Protein Structure Prediction and Protein Sequencing

Protein structure prediction Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry; it is highly important in medicine (for example, in drug design) and biotechnology (for example, in the design of novel enzymes). Every two years, the performance of current methods is assessed in the CASP experiment (Critical Assessment of Techniques for Protein Structure Prediction).

Secondary structure Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the local secondary structures of proteins and RNA sequences based only on knowledge of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm applied to the crystal structure of the protein; for nucleic acids, it may be determined from the hydrogen bonding pattern. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices and coiled coils in proteins, or canonical microRNA structures in RNA. The best modern methods of secondary structure prediction in proteins reach about 80% accuracy; this high accuracy allows the use of the predictions in fold recognition and ab initio protein structure prediction, classification of structural motifs, and refinement of sequence alignments. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.

Background Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, focused on identifying likely alpha helices and were based mainly on helix-coil transition models. Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60-65% accurate, and often underpredict beta sheets. The evolutionary conservation of secondary structures can be exploited by simultaneously assessing many homologous sequences in a multiple sequence alignment, by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modern machine learning methods such as neural nets and support vector machines, these methods can achieve up 80% overall accuracy in globular proteins. The theoretical upper limit of accuracy is around 90%, partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Limitations are also imposed by secondary structure prediction's inability to account for tertiary structure; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.

Chou-Fasman method The Chou-Fasman method was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughly 50-60% accurate in predicting secondary structures.

GOR method The GOR method, named for the three scientists who developed it - Garnier, Osguthorpe, and Robson - is an information theory-based method developed not long after ChouFasman. It uses a more powerful probabilistic techniques of Bayesian inference. The method is a specific optimized application of mathematics and algorithms developed in a series of papers by Robson and colleagues, eg. and). The GOR method is capable of continued extension by such principles, and has gone through several versions. The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the conditional probability of the amino acid assuming each structure given the contributions of its neighbors (it does not assume that the neighbors have that same structure). The approach is both more sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities are only strong for a

small number of amino acids such as proline and glycine. Weak contributions from each of many neighbors can add up to strong effect overall. The original GOR method was roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicted as loops or disorganized regions.. Later GOR methods considered also pairs of amino acids, significantly improving performance. The major difference from the following technique is perhaps that the weights in an implied network of contributing terms are assigned a priori, from statistical analysis of proteins of known structure, not by feedback to optimize agreement with a training set of such.

Machine learning Neural network methods use training sets of solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet. Support vector machines have proven particularly useful for predicting the locations of turns, which are difficult to identify with statistical methods. The requirement of relatively small training sets has also been cited as an advantage to avoid overfitting to existing structural data. Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as backbone dihedral angles in unassigned regions. Both SVMs and neural networks have been applied to this problem.

Other improvements It is reported that in addition to the protein sequence, secondary structure formation depends on other factors. For example, it is reported that secondary structure tendencies depend also on local environment, solvent accessibility of residues, protein structural class and even the organism from which the proteins are obtained. Based on such observations, some studies have shown that secondary structure prediction can be improved by addition of information about protein structural class, solvent accessibility and also contact number of residues. Sequence covariation methods rely on the existence of a data set composed of multiple homologous RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in evolution; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be NP-complete.

Tertiary structure The practical role of protein structure prediction is now more important than ever. Massive amounts of protein sequence data are produced by modern large-scale DNA sequencing efforts such as the Human Genome Project. Despite community-wide efforts in structural genomics, the output of experimentally determined protein structures— typically by time-consuming and relatively expensive X-ray crystallography or NMR spectroscopy—is lagging far behind the output of protein sequences. The protein structure prediction remains an extremely difficult and unresolved undertaking. The two main problems are calculation of protein free energy and finding the global minimum of this energy. A protein structure prediction method must explore the space of possible protein structures which is astronomically large. These problems can be partially bypassed in "comparative" or homology modeling and fold recognition methods, when the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. On the other hand, the de novo or ab initio protein structure prediction methods must explicitly resolve these problems.

Ab initio protein modelling Ab initio- or de novo- protein modelling methods seek to build three-dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many possible procedures that either attempt to mimic protein folding or apply some stochastic method to search possible solutions (i.e., global optimization of a suitable energy function). These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. To predict protein structure de novo for larger proteins will require better algorithms and larger computational resources like those afforded by either powerful supercomputers (such as Blue Gene or MDGRAPE-3) or distributed computing (such as Folding@home, the Human Proteome Folding Project and Rosetta@Home). Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ab initio structure prediction an active research field. As an intermediate step towards predicted protein structures, contact map predictions have been proposed.

Comparative protein modelling Comparative protein modelling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of tertiary structural motifs to which most proteins belong. It has been suggested that there are only around 2,000 distinct protein folds in nature, though there are many millions of different proteins. These methods may also be split into two groups:

Homology modeling is based on the reasonable assumption that two homologous proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment. Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences. Protein threading scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible threedimensional models. This type of method is also known as 3D-1D fold recognition due to its compatibility analysis between three-dimensional structures and linear protein sequences. This method has also given rise to methods performing an inverse folding search by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold.

Side chain geometry prediction Accurate packing of the amino acid side chains represents a separate problem. Methods that specifically address the problem of predicting side chain geometry include dead-end elimination and the self-consistent mean field methods. The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "rotamers" or a "conformational isomerism". The methods attempt to identify the set of rotamers that minimize the model's overall energy. These methods use rotamer libraries, the collections of rotamers (favorable multi-angle conformations) for each residue type in proteins. Rotamer libraries may contain information about the conformation, its frequency, and the variance about mean dihedral angles, which can be used in sampling. Rotamer libraries are derived from structural bioinformatics or other statistical analysis of side-chain conformations in known experimental structures of proteins, such as by clustering the observed conformations for tetrahedral carbons near the staggered (60°, 180°, -60°) values. Rotamer libraries can be backbone-independent, secondary-structure-dependent, or backbone-dependent. Backbone-independent rotamer libraries make no reference to backbone conformation, and are calculated from all available side chains of a certain type (for instance, the first example of a rotamer library, done by Ponder and Richards at Yale in 1987). Secondarystructure-dependent libraries present different dihedral angles and/or rotamer frequencies for α-helix, β-sheet, or coil secondary structures. Backbone-dependent rotamer libraries present conformations and/or frequencies dependent on the local backbone conformation as defined by the backbone dihedral angles φ and ψ, regardless of secondary structure.

The modern versions of these "libraries" as used in most software are presented as multidimensional distributions of probability or frequency, where the peaks correspond to the dihedral-angle conformations considered as individual rotamers in the lists. Some versions are especially sensitive to the prohibited regions in that conformational space and are used primarily for structure validation, while others emphasize relative frequencies in the favorable regions and are the form used primarily for structure prediction, such as the Dunbrack rotamer "libraries". The side chain packing methods are most useful for analyzing the protein's hydrophobic core, where side chains are more closely packed; they have more difficulty addressing the looser constraints and higher flexibility of surface residues, which often occupy multiple rotamer conformations rather than just one.

Prediction of structural classes Statistical methods have been developed for predicting structural classes of proteins based on their amino acid composition, pseudo amino acid composition and functional domain composition.

Quaternary structure In the case of complexes of two or more proteins, where the structures of the proteins are known or can be predicted with high accuracy, protein–protein docking methods can be used to predict the structure of the complex. Information of the effect of mutations at specific sites on the affinity of the complex helps to understand the complex structure and to guide docking methods.

Software MODELLER is a popular software tool for producing homology models using methodology derived from NMR spectroscopy data processing. SwissModel provides an automated web server for basic homology modeling. I-TASSER is the best server for protein structure prediction according to the 2006-2008 CASP experiments (CASP7 and CASP8). HHpred, bioinfo.pl, Robetta, and Phyre are widely used servers for protein structure prediction. HHsearch is a free software package for protein threading and remote homology detection. RAPTOR (software) is a protein threading software that is based on integer programming. The basic algorithm for threading is described in and is fairly straightforward to implement. Abalone is a Molecular Dynamics program for folding simulations with explicit or implicit water models.

TIP is a knowledgebase of STRUCTFAST models and precomputed similarity relationships between sequences, structures, and binding sites. Several distributed computing projects concerning protein structure prediction have also been implemented, such as the Folding@home, Rosetta@home, Human Proteome Folding Project, Predictor@home, and TANPAKU. The Foldit program seeks to investigate the pattern-recognition and puzzle-solving abilities inherent to the human mind in order to create more successful computer protein structure prediction software. Computational approaches provide a fast alternative route to antibody structure prediction. Recently developed antibody FV region high resolution structure prediction algorithms, like RosettaAntibody, have been shown to generate high resolution homology models which have been used for successful docking. Reviews of software for structure prediction can be found at. The progress and challenges in protein structure prediction has been reviewed in Zhang 2008.

Automatic structure prediction servers

A target structure (ribbons) and 354 template-based predictions superimposed (gray Calpha backbones); from CASP8 CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence, many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on the regular basis and it is not uncommon for the entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

Selection of target proteins In order to ensure that no predictor can have prior information about a protein's structure that would put him/her at an advantage, it is important that the experiment is conducted in a double-blind fashion: Neither predictors nor the organizers and assessors know the structures of the target proteins at the time when predictions are made. Targets for structure prediction are either structures soon-to-be solved by X-ray crystallography or NMR spectroscopy, or structures that have just been solved (mainly by one of the structural genomics centers) and are kept on hold by the Protein Data Bank. If the given sequence is found to be related by common descent to a protein sequence of known structure (called a template), comparative protein modeling may be used to predict the tertiary structure. Templates can be found using sequence alignment methods such as BLAST or FASTA or protein threading methods, which are better in finding distantly related templates. Otherwise, de novo protein structure prediction must be applied, which is much less reliable but can sometimes yield models with the correct fold. Truly new folds are becoming quite rare among the targets, making that category smaller than desirable.

Evaluation

Cumulative plot of α-carbon accuracy, of all predicted models for target T0398 in CASP8, with the two best models labeled The primary method of evaluation is a comparison of the predicted model α-carbon positions with those in the target structure. The comparison is shown visually by cumulative plots of distances between pairs of equivalents α-carbon in the alignment of the model and the structure, such as shown in the figure (a perfect model would stay at zero all the way across), and is assigned a numerical score GDT-TS (Global Distance Test - Total Score) describing percentage of well-modeled residues in the model with respect to the target. Free modeling (template-free, or de novo) is also evaluated visually by the assessors, since the numerical scores do not work as well for finding loose resemblances in the most difficult cases. High-accuracy template-based predictions were evaluated in CASP7 by whether they worked for molecular-replacement phasing of the target crystal structure with successes followed up later, and by full-model (not just αcarbon) model quality and full-model match to the target in CASP8. Evaluation of the results is carried out in the following prediction categories: • • •

tertiary structure prediction (all CASPs) secondary structure prediction (dropped after CASP5) prediction of structure complexes (CASP2 only; a separate experiment - CAPRI carries on this subject)

• • • • • • •

residue-residue contact prediction (starting CASP4) disordered regions prediction (starting CASP5) domain boundary prediction (CASP6-CASP8) function prediction (starting CASP6) model quality assessment (starting CASP7) model refinement (starting CASP7) high-accuracy template-based prediction (starting CASP7)

Tertiary structure prediction category was further subdivided into • • •

homology modeling fold recognition (also called protein threading; Note, this is incorrect as threading is a method) de novo structure prediction, now referred to as 'New Fold' as many methods apply evaluation, or scoring, functions that are biased by knowledge of native protein structures, such as an artificial neural network.

Starting with CASP7, categories have been redefined to reflect developments in methods. The 'Template based modeling' category includes all former comparative modeling, homologous fold based models and some analogous fold based models. The 'Template free modeling' category includes models of proteins with previously unseen folds and hard analogous fold based models. The CASP results are published in special supplement issues of the scientific journal Proteins, all of which are accessible through the CASP website. A lead article in each of these supplements describes specifics of the experiment while a closing article evaluates progress in the field.

Protein sequencing Protein sequencing is to determine the amino acid sequence of a protein, as well as which conformation the protein adopts and the extent to which it is complexed with any non-peptide molecules. Discovering the structures and functions of proteins in living organisms is an important tool for understanding cellular processes, and allows drugs that target specific metabolic pathways to be invented more easily. The two major direct methods of protein sequencing are mass spectrometry and the Edman degradation reaction. It is also possible to generate an amino acid sequence from the DNA or mRNA sequence encoding the protein, if this is known. However, there are a number of other reactions which can be used to gain more limited information about protein sequences and can be used as preliminaries to the aforementioned methods of sequencing or to overcome specific inadequacies within them.

Determining amino acid composition It is often desirable to know the unordered amino acid composition of a protein prior to attempting to find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of certain amino acids may also be used to choose which protease to use for digestion of the protein. A generalised method for doing this is as follows: 1. Hydrolyse a known quantity of protein into its constituent amino acids. 2. Separate the amino acids in some way.

Hydrolysis Hydrolysis is done by heating a sample of the protein in 6 Molar hydrochloric acid to 100-110 degrees Celsius for 24 hours or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However, these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan, glutamine and cystine) are degraded. To circumvent this problem, Biochemistry Online suggests heating separate samples for different times, analysing each resulting solution, and extrapolating back to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation - thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of amide hydrolysis.

Separation The amino acids can be separated by ion-exchange chromatography or hydrophobic interaction chromatography. An example of the former is given by the NTRC using sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through the column. Amino acids will be eluted when the pH reaches their respective isoelectric points. The latter technique may be employed through the use of reversed phase chromatography. Many commercially available C8 and C18 silica columns have demonstrated successful separation of amino acids in solution in less than 40 minutes through the use of an optimised elution gradient.

Quantitative analysis Once the amino acids have been separated, their respective quantities are determined by adding a reagent that will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be used for this - it gives a yellow colour when reacted with proline, and a vivid purple with other amino acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With very small quantities, down to 10 pmol, fluorescamine can be used as a marker: this forms a fluorescent derivative on reacting with an amino acid.

N-terminal amino acid analysis

Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid the ordering of individual peptide fragments' sequences into a whole chain, and because the first round of Edman degradation is often contaminated by impurities and therefore does not give an accurate determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis follows: 1. React the peptide with a reagent which will selectively label the terminal amino acid. 2. Hydrolyse the protein. 3. Determine the amino acid by chromatography and comparison with standards. There are many different reagents which can be used to label terminal amino acids. They all react with amine groups and will therefore also bind to amine groups in the side chains of amino acids such as lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman degradation, can also be used.

The same questions apply here as in the determination of amino acid composition, with the exception that no stain is needed, as the reagents produce coloured derivatives and only qualitative analysis is required, so the amino acid does not have to be eluted from the chromatography column, just compared with a standard. Another consideration to take into account is that, since any amine groups will have reacted with the labelling reagent, ion exchange chromatography cannot be used, and thin layer chromatography or high pressure liquid chromatography should be used instead.

C-terminal amino acid analysis The number of methods available for C-terminal amino acid analysis is much smaller than the number of available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a solution of the protein, take samples at regular intervals, and determine the terminal amino acid by analysing a plot of amino acid concentrations against time.

Edman degradation The Edman degradation is a very important reaction for protein sequencing, because it allows the ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A reaction scheme for sequencing a protein by the Edman degradation follows - some of the steps are elaborated on subsequently. 1. Break any disulfide bridges in the protein with an oxidising agent like performic acid or reducing agent like 2-mercaptoethanol. A protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming. 2. Separate and purify the individual chains of the protein complex, if there are more than one. 3. Determine the amino acid composition of each chain. 4. Determine the terminal amino acids of each chain. 5. Break each chain into fragments under 50 amino acids long. 6. Separate and purify the fragments. 7. Determine the sequence of each fragment. 8. Repeat with a different pattern of cleavage. 9. Construct the sequence of the overall protein. Digestion into peptide fragments Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the Edman degradation. Because of this, long protein chains need to be broken up into small fragments which can then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns, and the overlap between fragments can be used to construct an overall sequence.

The Edman degradation reaction The peptide to be sequenced is adsorbed onto a solid surface - one common substrate is glass fibre coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PTC), is added to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts with the amine group of the Nterminal amino acid. The terminal amino acid will have a hard time trying to find its won flower derivative can then be selectively detached by the addition of anhydrous acid. The derivative then isomerises to give a substituted phenylthiohydantoin which can be washed off and identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%, which allows about 50 amino acids to be reliably determined.

Limitations of the Edman degradation Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-terminal amino acid has been chemically modified or if it is concealed within the body of the protein. It also requires the use of either guesswork or a separate procedure to determine the positions of disulfide bridges.

Mass spectrometry The other major direct method by which the sequence of a protein can be determined is mass spectrometry. This method has been gaining popularity in recent years as new techniques and increasing computing power have facilitated it. Mass spectrometry can, in principle, sequence any size of protein, but the problem becomes computationally more difficult as the size increases. Peptides are also easier to prepare for mass spectrometry than whole proteins, because they are more soluble. One method of delivering the peptides to the spectrometer is electrospray ionization, for which John Bennett Fenn won the Nobel Prize in Chemistry in 2002. The protein is digested by an endoprotease, and the resulting solution is passed through a high pressure liquid chromatography column. At the end of this column, the solution is sprayed out of a narrow nozzle charged to a high positive potential into the mass spectrometer. The charge on the droplets causes them to fragment until only single ions remain. The peptides are then fragmented and the massto-charge ratios of the fragments measured. (It is possible to detect which peaks correspond to multiply charged fragments, because these will have auxiliary peaks corresponding to other isotopes - the distance between these other peaks is inversely proportional to the charge on the fragment). The mass spectrum is analysed by computer and often compared against a database of previously sequenced proteins in order to determine the sequences of the fragments. This process is then repeated with a different digestion enzyme, and the overlaps in the sequences are used to construct a sequence for the protein.

Predicting protein sequence from DNA/RNA sequences The amino acid sequence of a protein can also be determined indirectly from the mRNA or, in organisms that do not have introns (e.g. prokaryotes), the DNA that codes for the protein. If the sequence of the gene is already known, then this is all very easy. However, it is rare that the DNA sequence of a newly isolated protein will be known, and so if this method is to be used, it has to be found in some way. One way that this can be done is to sequence a short section, perhaps 15 amino acids long, of the protein by one of the above methods, and then use this sequence to generate a complementary marker for the protein's RNA. This can then be used to isolate the mRNA coding for the protein, which can then be replicated in a polymerase chain reaction to yield a significant amount of DNA, which can then be sequenced relatively easily. The amino acid sequence of the protein can then be deduced from this. However, it is necessary to take into account the possibility of amino acids being removed after the mRNA has been translated.

Chapter- 7

Proteomics

Robotic preparation of MALDI mass spectrometry samples on a sample carrier Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. The term "proteomics" was first coined in 1997 to make an analogy with genomics, the study of the genes. The word "proteome" is a blend of "protein" and "genome", and was coined by Marc Wilkins in 1994 while working on the concept as a PhD student. The proteome is the entire complement of proteins, including the modifications made to a particular set of proteins, produced by an

organism or system. This will vary with time and distinct requirements, or stresses, that a cell or organism undergoes.

Complexity of the problem After genomics, proteomics is considered the next step in the study of biological systems. It is much more complicated than genomics mostly because while an organism's genome is more or less constant, the proteome differs from cell to cell and from time to time. This is because distinct genes are expressed in distinct cell types. This means that even the basic set of proteins which are produced in a cell needs to be determined. In the past this was done by mRNA analysis, but this was found not to correlate with protein content. It is now known that mRNA is not always translated into protein, and the amount of protein produced for a given amount of mRNA depends on the gene it is transcribed from and on the current physiological state of the cell. Proteomics confirms the presence of the protein and provides a direct measure of the quantity present.

Post-translational modifications Not only does the translation from mRNA cause differences, many proteins are also subjected to a wide variety of chemical modifications after translation. Many of these post-translational modifications are critical to the protein's function.

Phosphorylation One such modification is phosphorylation, which happens to many enzymes and structural proteins in the process of cell signaling. The addition of a phosphate to particular amino acids—most commonly serine and threonine mediated by serine/threonine kinases, or more rarely tyrosine mediated by tyrosine kinases—causes a protein to become a target for binding or interacting with a distinct set of other proteins that recognize the phosphorylated domain. Because protein phosphorylation is one of the most-studied protein modifications many "proteomic" efforts are geared to determining the set of phosphorylated proteins in a particular cell or tissue-type under particular circumstances. This alerts the scientist to the signaling pathways that may be active in that instance.

Ubiquitination Ubiquitin is a small protein that can be affixed to certain protein substrates by enzymes called E3 ubiquitin ligases. Determining which proteins are poly-ubiquitinated can be helpful in understanding how protein pathways are regulated. This is therefore an additional legitimate "proteomic" study. Similarly, once it is determined what substrates are ubiquitinated by each ligase, determining the set of ligases expressed in a particular cell type will be helpful.

Additional modifications Listing all the protein modifications that might be studied in a "Proteomics" project would require a discussion of most of biochemistry; therefore, a short list will serve here to illustrate the complexity of the problem. In addition to phosphorylation and ubiquitination, proteins can be subjected to (among others) methylation, acetylation, glycosylation, oxidation and nitrosylation. Some proteins undergo ALL of these modifications, often in time-dependent combinations, aptly illustrating the potential complexity one has to deal with when studying protein structure and function.

Distinct proteins are made under distinct settings Even if one is studying a particular cell type, that cell may make different sets of proteins at different times, or under different conditions. Furthermore, as mentioned, any one protein can undergo a wide range of post-translational modifications. Therefore a "proteomics" study can become quite complex very quickly, even if the object of the study is very restricted. In more ambitious settings, such as when a biomarker for a tumor is sought - when the proteomics scientist is obliged to study sera samples from multiple cancer patients - the amount of complexity that must be dealt with is as great as in any modern biological project.

Limitations to genomic study Scientists are very interested in proteomics because it gives a much better understanding of an organism than genomics. First, the level of transcription of a gene gives only a rough estimate of its level of expression into a protein. An mRNA produced in abundance may be degraded rapidly or translated inefficiently, resulting in a small amount of protein. Second, as mentioned above many proteins experience post-translational modifications that profoundly affect their activities; for example some proteins are not active until they become phosphorylated. Methods such as phosphoproteomics and glycoproteomics are used to study post-translational modifications. Third, many transcripts give rise to more than one protein, through alternative splicing or alternative post-translational modifications. Fourth, many proteins form complexes with other proteins or RNA molecules, and only function in the presence of these other molecules. Finally, protein degradation rate plays an important role in protein content.

Methods of studying proteins Determining proteins which are post-translationally modified One way in which a particular protein can be studied is to develop an antibody which is specific to that modification. For example, there are antibodies which only recognize certain proteins when they are tyrosine-phosphorylated, known as phospho-specific antibodies; also, there are antibodies specific to other modifications. These can be used to determine the set of proteins that have undergone the modification of interest.

For sugar modifications, such as glycosylation of proteins, certain lectins have been discovered which bind sugars. These too can be used. A more common way to determine post-translational modification of interest is to subject a complex mixture of proteins to electrophoresis in "two-dimensions", which simply means that the proteins are electrophoresed first in one direction, and then in another... this allows small differences in a protein to be visualized by separating a modified protein from its unmodified form. This methodology is known as "two-dimensional gel electrophoresis". Recently, another approach has been developed called PROTOMAP which combines SDS-PAGE with shotgun proteomics to enable detection of changes in gel-migration such as those caused by proteolysis or post translational modification.

Determining the existence of proteins in complex mixtures Classically, antibodies to particular proteins or to their modified forms have been used in biochemistry and cell biology studies. These are among the most common tools used by practicing biologists today. For more quantitative determinations of protein amounts, techniques such as ELISAs can be used. For proteomic study, more recent techniques such as matrix-assisted laser desorption/ionization (MALDI) have been employed for rapid determination of proteins in particular mixtures and increasingly electrospray ionization (ESI).

Computational methods in studying protein biomarkers Computational predictive models have shown that extensive and diverse feto-maternal protein trafficking occurs during pregnancy and can be readily detected non-invasively in maternal whole blood. This computational approach circumvented a major limitation, the abundance of maternal proteins interfering with the detection of fetal proteins, to fetal proteomic analysis of maternal blood. Computational models can use fetal gene transcripts previously identified in maternal whole blood to create a comprehensive proteomic network of the term neonate. Such work shows that the fetal proteins detected in pregnant woman’s blood originate from a diverse group of tissues and organs from the developing fetus. The proteomic networks contain many biomarkers that are proxies for development and illustrate the potential clinical application of this technology as a way to monitor normal and abnormal fetal development. An information theoretic framework has also been introduced for biomarker discovery, integrating biofluid and tissue information. This new approach takes advantage of functional synergy between certain biofluids and tissues with the potential for clinically significant findings not possible if tissues and biofluids were considered individually. By conceptualizing tissue-biofluid as information channels, significant biofluid proxies can

be identified and then used for guided development of clinical diagnostics. Candidate biomarkers are then predicted based on information transfer criteria across the tissuebiofluid channels. Significant biofluid-tissue relationships can be used to prioritize clinical validation of biomarkers.

Establishing protein-protein interactions Most proteins function in collaboration with other proteins, and one goal of proteomics is to identify which proteins interact. This is especially useful in determining potential partners in cell signaling cascades. Several methods are available to probe protein-protein interactions. The traditional method is yeast two-hybrid analysis. New methods include protein microarrays, immunoaffinity chromatography followed by mass spectrometry, dual polarisation interferometry, Microscale Thermophoresis and experimental methods such as phage display and computational methods

Practical applications of proteomics One of the most promising developments to come from the study of human genes and proteins has been the identification of potential new drugs for the treatment of disease. This relies on genome and proteome information to identify proteins associated with a disease, which computer software can then use as targets for new drugs. For example, if a certain protein is implicated in a disease, its 3D structure provides the information to design drugs to interfere with the action of the protein. A molecule that fits the active site of an enzyme, but cannot be released by the enzyme, will inactivate the enzyme. This is the basis of new drug-discovery tools, which aim to find new drugs to inactivate proteins involved in disease. As genetic differences among individuals are found, researchers expect to use these techniques to develop personalized drugs that are more effective for the individual.

Biomarkers The FDA defines a biomarker as, “A characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention”. Understanding the proteome, the structure and function of each protein and the complexities of protein-protein interactions will be critical for developing the most effective diagnostic techniques and disease treatments in the future. An interesting use of proteomics is using specific protein biomarkers to diagnose disease. A number of techniques allow to test for proteins produced during a particular disease, which helps to diagnose the disease quickly. Techniques include western blot, immunohistochemical staining, enzyme linked immunosorbent assay (ELISA) or mass spectrometry.

Current research methodologies There are many approaches to attempting to characterize the human proteome, which is estimated to exceed 100,000 unique forms, 25,000 genes plus post-translational modifications. In addition, first promising attempts to decipher the proteom of animal tumors have recently been reported.

Chapter- 8

Structural Alignment

Structural alignment of thioredoxins from humans and the fly Drosophila melanogaster. The proteins are shown as ribbons, with the human protein in red, and the fly protein in yellow. Generated from PDB 3TRX and 1XWC. Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships

between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure. Structural alignments can compare two sequences or multiple sequences. Because these alignments rely on information about all the query sequences' three-dimensional conformations, the method can only be used on sequences where these structures are known. These are usually found by X-ray crystallography or NMR spectroscopy. It is possible to perform a structural alignment on structures produced by structure prediction methods. Indeed, evaluating such predictions often requires a structural alignment between the model and the true known structure to assess the model's quality. Structural alignments are especially useful in analyzing data from structural genomics and proteomics efforts, and they can be used as comparison points to evaluate alignments produced by purely sequence-based bioinformatics methods. The outputs of a structural alignment are a superposition of the atomic coordinate sets and a minimal root mean square deviation (RMSD) between the structures. The RMSD of two aligned structures indicates their divergence from one another. Structural alignment can be complicated by the existence of multiple protein domains within one or more of the input structures, because changes in relative orientation of the domains between two structures to be aligned can artificially inflate the RMSD.

Data produced by structural alignment The minimum information produced from a successful structural alignment is a set of superposed three-dimensional coordinates for each input structure. (Note that one input element may be fixed as a reference and therefore its superposed coordinates do not change.) The fitted structures can be used to calculate mutual RMSD values, as well as other more sophisticated measures of structural similarity such as the global distance test (GDT, the metric used in CASP). The structural alignment also implies a corresponding one-dimensional sequence alignment from which a sequence identity, or the percentage of residues that are identical between the input structures, can be calculated as a measure of how closely the two sequences are related.

Types of comparisons Because protein structures are composed of amino acids whose side chains are linked by a common protein backbone, a number of different possible subsets of the atoms that make up a protein macromolecule can be used in producing a structural alignment and calculating the corresponding RMSD values. When aligning structures with very different sequences, the side chain atoms generally are not taken into account because their identities differ between many aligned residues. For this reason it is common for structural alignment methods to use by default only the backbone atoms included in the

peptide bond. For simplicity and efficiency, often only the alpha carbon positions are considered, since the peptide bond has a minimally variant planar conformation. Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions, in which case the RMSD reflects not only the conformation of the protein backbone but also the rotameric states of the side chains. Other comparison criteria that reduce noise and bolster positive matches include secondary structure assignment, native contact maps or residue interaction patterns, measures of side chain packing, and measures of hydrogen bond retention.

Structural superposition The most basic possible comparison between protein structures makes no attempt to align the input structures and requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation. Structural superposition is commonly used to compare multiple conformations of the same protein (in which case no alignment is necessary, since the sequences are the same) and to evaluate the quality of alignments produced using only sequence information between two or more sequences whose structures are known. This method traditionally uses a simple least-squares fitting algorithm, in which the optimal rotations and translations are found by minimizing the sum of the squared distances among all structures in the superposition. More recently, maximum likelihood and Bayesian methods have greatly increased the accuracy of the estimated rotations, translations, and covariance matrices for the superposition. Algorithms based on multidimensional rotations and modified quaternions have been developed to identify topological relationships between protein structures without the need for a predetermined alignment. Such algorithms have successfully identified canonical folds such as the four-helix bundle. The SuperPose method is sufficiently extensible to correct for relative domain rotations and other structural pitfalls.

Algorithmic complexity Optimal solution the optimal "threading" of a protein sequence onto a known structure and the production of an optimal multiple sequence alignment have been shown to be NP-complete. However, this does not imply that the structural alignment problem is NP-complete. Strictly speaking, an optimal solution to the protein structure alignment problem is only known for certain protein structure similarity measures, such as the measures used in protein structure prediction experiments, GDT_TS and MaxSub. These measures can be rigorously optimized using an algorithm capable of maximizing the number of atoms in two proteins that can be superimposed under a predefined distance cutoff. Unfortunately, the algorithm for optimal solution is not practical, since its running time depends not only on the lengths but also on the intrinsic geometry of input proteins.

Approximate solution Approximate polynomial-time algorithms for structural alignment that produce a family of "optimal" solutions within an approximation parameter for a given scoring function have been developed. Although these algorithms theoretically classify the approximate protein structure alignment problem as "tractable", they are still computationally too expensive for large scale protein structure analysis. As a consequence, practical algorithms that converge to the global solutions of the alignment, given a scoring function, do not exist. Most algorithms are, therefore, heuristic, but algorithms that guarantee the convergence to at least local maximizers of the scoring functions, and are practical, have been developed.

Representation of structures Protein structures have to be represented in some coordinate-independent space to make them comparable. This is typically achieved by constructing a sequence-to-sequence matrix or series of matrices that encompass comparative metrics: rather than absolute distances relative to a fixed coordinate space. An intuitive representation is the distance matrix, which is a two-dimensional matrix containing all pairwise distances between some subset of the atoms in each structure (such as the alpha carbons). The matrix increases in dimensionality as the number of structures to be simultaneously aligned increases. Reducing the protein to a coarse metric such as secondary structure elements (SSEs) or structural fragments can also produce sensible alignments, despite the loss of information from discarding distances, as noise is also discarded. Choosing a representation to facilitate computation is critical to developing an efficient alignment mechanism.

Methods Structural alignment techniques have been used in comparing individual structures or sets of structures as well as in the production of "all-to-all" comparison databases that measure the divergence between every pair of structures present in the Protein Data Bank (PDB). Such databases are used to classify proteins by their fold.

SSM Secondaty Structure Matching (SSM), or PDBeFold at the Protein Data Bank in Europe uses graph matching followed by c-alpha alignment to compute alignments.

DALI

Illustration of the atom-to-atom vectors calculated in SSAP. From these vectors a series of vector differences, e.g., between (FA) in Protein 1 and (SI) in Protein 2 would be constructed. The two sequences are plotted on the two dimensions of a matrix to form a difference matrix between the two proteins. Dynamic programming is applied to all possible difference matrices to construct a series of optimal local alignment paths that are then summed to form the summary matrix, on which a second round of dynamic programming is performed. A common and popular structural alignment method is the DALI, or distance alignment matrix method, which breaks the input structures into hexapeptide fragments and calculates a distance matrix by evaluating the contact patterns between successive fragments. Secondary structure features that involve residues that are contiguous in sequence appear on the matrix's main diagonal; other diagonals in the matrix reflect spatial contacts between residues that are not near each other in the sequence. When these diagonals are parallel to the main diagonal, the features they represent are parallel; when they are perpendicular, their features are antiparallel. This representation is memoryintensive because the features in the square matrix are symmetrical (and thus redundant) about the main diagonal. When two proteins' distance matrices share the same or similar features in approximately the same positions, they can be said to have similar folds with similar-length loops

connecting their secondary structure elements. DALI's actual alignment process requires a similarity search after the two proteins' distance matrices are built; this is normally conducted via a series of overlapping submatrices of size 6x6. Submatrix matches are then reassembled into a final alignment via a standard score-maximization algorithm - the original version of DALI used a Monte Carlo simulation to maximize a structural similarity score that is a function of the distances between putative corresponding atoms. In particular, more distant atoms within corresponding features are exponentially downweighted to reduce the effects of noise introduced by loop mobility, helix torsions, and other minor structural variations. Because DALI relies on an all-to-all distance matrix, it can account for the possibility that structurally aligned features might appear in different orders within the two sequences being compared. The DALI method has also been used to construct a database known as FSSP (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins) in which all known protein structures are aligned with each other to determine their structural neighbors and fold classification. There is an searchable database based on DALI as well as a downloadable program and web search based on a standalone version known as DaliLite.

Combinatorial extension The combinatorial extension (CE) method is similar to DALI in that it too breaks each structure in the query set into a series of fragments that it then attempts to reassemble into a complete alignment. A series of pairwise combinations of fragments called aligned fragment pairs, or AFPs, are used to define a similarity matrix through which an optimal path is generated to identify the final alignment. Only AFPs that meet given criteria for local similarity are included in the matrix as a means of reducing the necessary search space and thereby increasing efficiency. A number of similarity metrics are possible; the original definition of the CE method included only structural superpositions and interresidue distances but has since been expanded to include local environmental properties such as secondary structure, solvent exposure, hydrogen-bonding patterns, and dihedral angles. An alignment path is calculated as the optimal path through the similarity matrix by linearly progressing through the sequences and extending the alignment with the next possible high-scoring AFP pair. The initial AFP pair that nucleates the alignment can occur at any point in the sequence matrix. Extensions then proceed with the next AFP that meets given distance criteria restricting the alignment to low gap sizes. The size of each AFP and the maximum gap size are required input parameters but are usually set to empirically determined values of 8 and 30 respectively. Like DALI and SSAP, CE has been used to construct an all-to-all fold classification database from the known protein structures in the PDB.

GANGSTA+ GANGSTA+ is a combinatorial algorithm for non-sequential structural alignment of proteins and similarity search in databases. It uses a combinatorial approach on the secondary structure level to evaluate similarities between two protein structures based on contact maps. Different SSE assignment modes can be used. The assignment of SSEs can be performed respecting the sequential order of the SSEs in the polypeptide chains of the considered protein pair (sequential alignment) or by ignoring this order (non-sequential alignment). Furthermore, SSE pairs can optionally be aligned in reverse orientation. The highest ranking SSE assignments are transferred to the residue level by a pointmatching approach. To obtain an initial common set of atomic coordinates for both proteins, pairwise attractive interactions of the C-alpha atom pairs are defined by inverse Lorentzians and energy minimized.

MAMMOTH MAtching Molecular Models Obtained from THeory. As its name suggests, MAMMOTH was originally developed for comparing models coming from structure prediction (THeory) since it is tolerant of large unalignable regions, but it has proven to work well with experimental models, especially when looking for remote homology. Benchmarks on targets of blind structure prediction (the CASP experiment) and automated GO annotation have shown it is tightly rank correlated with human curated annotation. A highly complete database of mammoth-based structure annotations for the predicted structures of unknown proteins covering 150 genomes facilitates genomic scale normalization. MAMMOTH-based structure alignment methods decompose the protein structure into short peptides (heptapeptides) which are compared with the heptapeptides of another protein. The similarity score between two heptapeptides is calculated using a unit-vector RMS (URMS) method. These scores are stored in a similarity matrix, and with a hybrid (local-global) dynamic programming the optimal residue alignment is calculated. Protein similarity scores calculated with MAMMOTH is derived from the likelihood of obtaining a given structural alignment by chance. MAMMOTH-mult is an extension of the MAMMOTH algorithm to be used to align related families of protein structures. This algorithm is very fast and produces consistent and high quality structural alignments. Multiple structural alignments calculated with MAMMOTH-mult produces structurally-implied sequence alignments that can be further used for multiple-template homology modeling, HMM-based protein structure prediction, and profile-type PSI-BLAST searches.

RAPIDO Rapid Alignment of Proteins In terms of DOmains. RAPIDO is a web server for the 3D alignment of crystal structures of different protein molecules, in the presence of conformational changes. Similar to what CE does as a first step, RAPIDO identifies

fragments that are structurally similar in the two proteins using an approach based on difference distance matrices. The Matching Fragment Pairs (MFPs) are then represented as nodes in a graph which are chained together to form an alignment by means of an algorithm for the identification of the longest path on a DAG (Directed Acyclic Graph). The final step of refinement is performed to improve the quality of the alignment. After aligning the two structures the server applies a genetic algorithm for the identification of conformationally invariant regions. These regions correspond to groups of atoms whose interatomic distances are constant (within a defined tolerance). In doing so RAPIDO takes into account the variation in the reliability of atomic coordinates by employing weighting-functions based on the refined B-values. The regions identified as conformationally invariant by RAPIDO represent reliable sets of atoms for the superposition of the two structures that can be used for a detailed analysis of changes in the conformation. In addition to the functionalities provided by existing tools, RAPIDO can identify structurally equivalent regions even when these consist of fragments that are distant in terms of sequence and separated by other movable domains.

SABERTOOTH SABERTOOTH uses structural profiles to perform structural alignments. The underlying structural profiles expresses the global connectivity of each residue. Despite the very condensed vectorial representation, the tool recognizes structural similarities with accuracy comparable to established alignment tools based on coordinates and performs comparably in quality. Furthermore, the algorithm has favourable scaling of computation time with chain length. Since the algorithm is independent of the details of the structural representation, the framework can be generalized to sequence-to-sequence and sequenceto-structure comparison within the same setup, and it is therefore more general than other tools.

SSAP The SSAP (Sequential Structure Alignment Program) method uses double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the beta carbons for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment. SSAP originally produced only pairwise alignments but has since been extended to multiple alignments as well. It has been applied in an all-to-all fashion to produce a hierarchical fold classification scheme known as CATH (Class, Architecture, Topology,

Homology), which has been used to construct the CATH Protein Structure Classification database.

TOPOFIT In the TOPOFIT method, of protein structures is analyzed using three-dimensional Delaunay triangulation patterns derived from backbone representation. It has been found that structurally related proteins have a common spatial invariant part, a set of tetrahedrons, mathematically described as a common spatial sub-graph volume of the three-dimensional contact graph derived from Delaunay tessellation (DT). Based on this property of protein structures we present a novel common volume superimposition (TOPOFIT) method to produce structural alignments of proteins. The superimposition of the DT patterns allows one to objectively identify a common number of equivalent residues in the structural alignment, in other words, TOPOFIT identifies a feature point on the RMSD/Ne curve, a topomax point, until which two structures correspond to each other including backbone and inter-residue contacts, while the growing number of mismatches between the DT patterns occurs at larger RMSD (Ne) after topomax point. The topomax point is present in all alignments from different protein structural classes; therefore, the TOPOFIT method identifies common, invariant structural parts between proteins. The TOPOFIT method adds new opportunities for the comparative analysis of protein structures and for more detailed studies on understanding the molecular principles of tertiary structure organization and functionality. It helps to detect conformational changes, topological differences in variable parts, which are particularly important for studies of variations in active/binding sites and protein classification.

Recent developments Improvements in structural alignment methods constitute an active area of research, and new or modified methods are often proposed that are claimed to offer advantages over the older and more widely distributed techniques. A recent example, TM-align, uses a novel method for weighting its distance matrix, to which standard dynamic programming is then applied. The weighting is proposed to accelerate the convergence of dynamic programming and correct for effects arising from alignment lengths. In a benchmarking study, TM-align has been reported to improve in both speed and accuracy over DALI and CE. However, as algorithmic improvements and computer performance have erased purely technical deficiencies in older approaches, it has become clear that there is no one universal criterion for the 'optimal' structural alignment. TM-align, for instance, is particularly robust in quantifying comparisons between sets of proteins with great disparities in sequence lengths, but it only indirectly captures hydrogen bonding or secondary structure order conservation which might be better metrics for alignment of evolutionarily related proteins. Thus recent developments have focused on optimizing particular attributes such as speed, quantification of scores, correlation to alternative gold standards, or tolerance of imperfection in structural data or ab initio structural models. An

alternative methodology that is gaining popularity is to use the consensus of various methods to ascertain proteins structural similarities. In choosing among modern algorithms, investigators should strongly consider the optimization for the purpose of intended application, as well as on the algorithm's penetration into the particular field so as to facilitate comparison to other authors' results. For example, MAMMOTH has been specialized for speed and correlation to human annotation, and is thus suited for large-scale structural genomics studies: MAMMOTH has been adopted by Rosetta@home and the World Community Grid's Yeast and Human Proteome Folding projects because it is designed for remote structural homology detection even with relatively inaccurate or incomplete predicted structure models. The venerable DALI is perhaps the most ubiquitous in the literature and due to its integration with other European Bioinformatics Institute web-based tools, the EBI DALI is easily approached by researchers interested in singleton applications. Historically, it was initially unclear if comparisons that preserved sequence order would be more sensitive than ones that simply compare architecture or contacts without regard to secondary segment ordering. Early versions of DALI allowed a choice of nonsequential alignment at a great cost in speed. Non-sequential methods lost favor as segment order preserving methods outperformed them in speed, quantification of highconfidence similarity scores, and amenability to adoption of rich scoring heuristics. However, in some applications, discovery of conserved but out-of-order structure motif recognition is vital and additionally some forms of experimental data collection, such as cryo-elecron microscopy, generally resolve regular secondary elements but not their connection order. This has renewed interest in Non-sequential approaches. Some examples are GANGSTA+ and TOPOFIT.

RNA structural alignment Structural alignment techniques have traditionally been applied exclusively to proteins, as the primary biological macromolecules that assume characteristic three-dimensional structures. However, large RNA molecules also form characteristic tertiary structures, which are mediated primarily by hydrogen bonds formed between base pairs as well as base stacking. Functionally similar noncoding RNA molecules can be especially difficult to extract from genomics data because structure is more strongly conserved than sequence in RNA as well as in proteins, and the more limited alphabet of RNA decreases the information content of any given nucleotide at any given position. A recent method for pairwise structural alignment of RNA sequences with low sequence identity has been published and implemented in the program FOLDALIGN. However, this method is not truly analogous to protein structural alignment techniques because it computationally predicts the structures of the RNA input sequences rather than requiring experimentally determined structures as input. Although computational prediction of the protein folding process has not been particularly successful to date, RNA structures without pseudoknots can often be sensibly predicted using free energy-based scoring methods that account for base pairing and stacking.

Software Choosing a software tool for structural alignment can be a challenge due to the large variety of available packages that differ significantly in methodology and reliability. A partial solution to this problem was presented in and made publicly accessible through the ProCKSI webserver. A more complete list of currently available and freely distributed structural alignment software can be found in structural alignment software. Properties of some structural alignment servers and software packages are summarized and tested with examples at Structural Alignment Tools in Proteopedia.Org.

Chapter- 9

Protein Biosynthesis and Peptide Mass Fingerprinting

Protein Biosynthesis

RNA is transcribed in the nucleus; once completely processed, it is transported to the cytoplasm and translated by the ribosome (not shown). Protein synthesis is the process in which cells build proteins. The term is sometimes used to refer only to protein translation but more often it refers to a multi-step process, beginning with amino acid synthesis and transcription of nuclear DNA into messenger RNA, which is then used as input to translation.

The cistron DNA is transcribed into a variety of RNA intermediates. The last version is used as a template in synthesis of a polypeptide chain. Proteins can often be synthesized directly from genes by translating mRNA. When a protein needs to be available on short notice or in large quantities, a protein precursor is produced. A proprotein is an inactive protein containing one or more inhibitory peptides that can be activated when the inhibitory sequence is removed by proteolysis during posttranslational modification. A preprotein is a form that contains a signal sequence (an N-terminal signal peptide) that specifies its insertion into or through membranes; i.e., targets them for secretion. The signal peptide is cleaved off in the endoplasmic reticulum. Preproproteins have both sequences (inhibitory and signal) still present. For synthesis of protein, a succession of tRNA molecules charged with appropriate amino acids have to be brought together with an mRNA molecule and matched up by basepairing through their anti-codons with each of its successive codons. The amino acids then have to be linked together to extend the growing protein chain, and the tRNAs, relieved of their burdens, have to be released. This whole complex of processes is carried out by a giant multimolecular machine, the ribosome, formed of two main chains of RNA, called ribosomal RNA (rRNA), and more than 50 different proteins. This molecular juggernaut latches onto the end of an mRNA molecule and then trundles along it, capturing loaded tRNA molecules and stitching together the amino acids they carry to form a new protein chain. Protein biosynthesis, although very similar, is different for prokaryotes and eukaryotes.

Amino acid synthesis Amino acids are the monomers that are polymerized to produce proteins. Amino acid synthesis is the set of biochemical processes (metabolic pathways) that build the amino acids from carbon sources like glucose. Many organisms have the ability to synthesize only a subset of the amino acids they need. Adult humans, for example, need to obtain 10 of the 20 amino acids from their food.

Transcription

Simple diagram of transcription elongation In transcription an mRNA chain is generated, with one strand of the DNA double helix in the genome as template. This strand is called the template strand. Transcription can be

divided into 3 stages: Initiation, Elongation and Termination, each regulated by a large number of proteins such as transcription factors and coactivators that ensure the correct gene is transcribed. The DNA strand is read in the 3' to 5' direction and the mRNA is transcribed in the 5' to 3' direction by the RNA polymerase. Transcription occurs in the cell nucleus, where the DNA is held. The DNA structure of the cell is made up of two helixes made up of sugar and phosphate held together by the bases. The sugar and the phosphate are joined together by covalent bond. The DNA is "unzipped" by the enzyme helicase, leaving the single nucleotide chain open to be copied. RNA polymerase reads the DNA strand from 3 prime (3') end to the 5 prime (5') end, while it synthesizes a single strand of messenger RNA in the 5' to 3' direction. The general RNA structure is very similar to the DNA structure, but in RNA the nucleotide uracil takes the place that thymine occupies in DNA. The single strand of mRNA leaves the nucleus through nuclear pores, and migrates into the cytoplasm. The first product of transcription differs in prokaryotic cells from that of eukaryotic cells, as in prokaryotic cells the product is mRNA, which needs no post-transcriptional modification, while in eukaryotic cells, the first product is called primary transcript, that needs post-transcriptional modification (capping with 7 methyl guanosine, tailing with a poly A tail) to give hnRNA (heterophil nuclear RNA). hnRNA then undergoes splicing of introns (non coding parts of the gene) via spliceosomes to produce the final mRNA.

Translation

Diagram showing the translation of mRNA and the synthesis of proteins by a ribosome

The synthesis of proteins is known as translation. Translation occurs in the cytoplasm, where the ribosomes are located. Ribosomes are made of a small and large subunit that surround the mRNA. In translation, messenger RNA (mRNA) is decoded to produce a specific polypeptide according to the rules specified by the trinucleotide genetic code. This uses an mRNA sequence as a template to guide the synthesis of a chain of amino acids that form a protein. Translation proceeds in four phases: activation, initiation, elongation, and termination (all describing the growth of the amino acid chain, or polypeptide that is the product of translation). In activation, the correct amino acid (AA) is joined to the correct transfer RNA (tRNA). While this is not technically a step in translation, it is required for translation to proceed. The AA is joined by its carboxyl group to the 3' OH of the tRNA by an ester bond. When the tRNA has an amino acid linked to it, it is termed "charged". Initiation involves the small subunit of the ribosome binding to 5' end of mRNA with the help of initiation factors (IF), other proteins that assist the process. Elongation occurs when the next aminoacyl-tRNA (charged tRNA) in line binds to the ribosome along with GTP and an elongation factor. Termination of the polypeptide happens when the A site of the ribosome faces a stop codon (UAA, UAG, or UGA). When this happens, no tRNA can recognize it, but releasing factor can recognize nonsense codons and causes the release of the polypeptide chain. The capacity of disabling or inhibiting translation in protein biosynthesis is used by antibiotics such as: anisomycin, cycloheximide, chloramphenicol, tetracycline, streptomycin, erythromycin, puromycin etc. Translation is the process of converting the mRNA codon sequences into an amino acid polypeptide chain. 1. Amino acid activation 2. Initiation - A ribosome attaches to the mRNA and starts to code at the FMet codon (usually AUG, sometimes GUG or UUG). 3. Elongation - tRNA brings the corresponding amino acid (which has an anticodon that identifies the amino acid as the corresponding molecule to a codon) to each codon as the ribosome moves down the mRNA strand. 4. Termination - Reading of the final mRNA codon (aka the STOP codon), which ends the synthesis of the peptide chain and releases it.

Events following protein translation The events following biosynthesis include post-translational modification and protein folding. During and after synthesis, polypeptide chains often fold to assume, so called, native secondary and tertiary structures. This is known as protein folding. Many proteins undergo post-translational modification. This may include the formation of disulfide bridges or attachment of any of a number of biochemical functional groups,

such as acetate, phosphate, various lipids and carbohydrates. Enzymes may also remove one or more amino acids from the leading (amino) end of the polypeptide chain, leaving a protein consisting of two polypeptide chains connected by disulfide bonds. In general, protein molecules are believed to be modified by small chemical groups, posttranslationally. Chemical modifications such as, phosphorylation of serine / threonine, acetylation or methylation of lysine, hydroxylation of proline / lysine, formylation of glycine, glycosylation of serine / threonine / asparagine, acylation of cysteine, myristoylation of glycine, biotinylation of lysine, ubiquitination, etc. on proteins is a very important issue in relation to properly understanding the biological functions of a given protein. These much-studied post-translational modifications have become wellestablished with the discovery of the respective enzymes (kinases for phosphorylation, acetylases for acetylation, methyl transferases for methylation, etc.), which carry on the chemical modifications on the specific amino acid residues. All of these modifications are still believed to have happened as post-translational events. There is no study yet on when actually one particular modification occurs on a given amino acid residue in a given protein. Does it happen when the protein is already formed, or when the amino acid chain is being synthesized, or before the translation of the primary chain has begun? Since these chemical modifications are related to the biological functions of a protein, it is easy to think that these chemical modifications have happened to the whole protein molecule, after the protein primary chain is fully synthesized; but, if that is the case, we have to consider the fact that the primary chains get folded instantly, (in a similar way as the newly synthesized DNA strands form helixes), to attain its compact-globular conformation ; As most of the primary chains are fairly long [a 5Kd protein may have 4045 amino acid residues in its primary chain], it is likely that the newly formed amino acid chain tries to remain intact by folding, thereby avoiding its breakdown via lots of proteases present within the cytoplasm. And, no capping event to protect the N-terminal end of the primary sequence (similar to 5' m-RNA capping to protect m-RNAs) is ever discovered for protein primary structure. So, by folding mechanism, the primary chain, perhaps, avoids the protease attacks. However, once it gets folded, it may be very difficult for the respective enzyme molecule to find out the particular aa residue from the complexity of that compactly folded conformation. In addition, it can be clearly imagined that this enzymatic modification/reaction on a given amino acid requires presence and association of the appropriate enzyme, necessary cofactors, etc. This association is much easier to occur when the amino acid residues in the primary structure are readily available for binding; in other words, it is much more difficult for the enzyme molecules to find and to bind to its substrate amino acid residue in a mature protein molecule after its threedimensional conformation been attained. So one can think that the modifications can happen while the primary chain of the protein is being synthesized during the translation process on the m-RNA strand; the amino acid residues on the primary chain can be modified instantly and enzymatically by kinases, acetylases, hydroxylases, methyltransferases, etc. to initiate proper folding for protection in order to avoid degradation by proteases, thereby gaining the globular form.

Also, the reader can imagine another scenario in which, while the free amino acid molecules are formed within a cell and become available, they [in that free state] may be modified enzymatically before taking the ride to the translational event; this means that, while the primary chain is being synthesized, the pre-modified amino acid molecules are ready to be engaged. This way, as the primary chain is getting synthesized, it does not have to be modified by the modifying enzymes anymore, and can fold itself instantaneously without concern of degradation. This also arises new thoughts that (i) phosphate, acetyl, methyl, biotinyl, acyl, etc. groups may have the ability to inhibit protease actions on the primary chains; (ii) most or all of the amino acid residues get modified by small chemical groups (so far, only some chemical groups are known). It is still not fully known exactly when and how the actual modification of a given amino acid residue occurs, at which stage of synthesis, within a protein molecule. Once the chemically-modified, protease-insensitive, intact protein molecule is generated, it must perform its biological function, which requires its being activated. The activation is probably done by a second set of enzymatic reactions when these chemical groups are removed from the aa residues (or added back). So, the second type of post-translational modifications are the opposite reactions of the above-described type, which are dephosphorylation by phosphatases or phosphoryl transferases, deacetylation by deacetylases or acetyltransferases, demethylation by demethylases or methyl-transferases, ubiquitination, SuMoylation, glycosylation, biotinylation, etc. These are taking place on the whole protein molecule toward generating its activated form to execute a particular function or toward its deactivation followed by its total degradation. As the chemical groups are removed (or added), it is quite clear that the proteins can go through different states of structural/conformational change. Thus, after translation, a protein can change its conformation dynamically while the chemical groups are removed or added enzymatically; these conformational changes in the protein structure help the protein to proceed through its lifecycle, until it is ubiquitinated for its total degradation.

Peptide mass fingerprinting Peptide mass fingerprinting (PMF) (also known as protein fingerprinting) is an analytical technique for protein identification that was developed in 1993 by several groups independently. In this method, the unknown protein of interest is first cleaved into smaller peptides, whose absolute masses can be accurately measured with a mass spectrometer such as MALDI-TOF or ESI-TOF. These masses are then compared to either a database containing known protein sequences or even the genome. This is achieved by using computer programs that translate the known genome of the organism into proteins, then theoretically cut the proteins into peptides, and calculate the absolute masses of the peptides from each protein. They then compare the masses of the peptides of the unknown protein to the theoretical peptide masses of each protein encoded in the genome. The results are statistically analyzed to find the best match.

The advantage of this method is that only the masses of the peptides have to be known and as such, de novo peptide sequencing is not necessary which can be time consuming. A disadvantage is that the protein sequence has to be present in the database of interest. Additionally most PMF algorithms assume that the peptides come from a single protein. The presence of a mixture can significantly complicate the analysis and potentially compromise the results. Typical for the PMF based protein identification is the requirement for an isolated protein. Mixtures exceeding a number of 2-3 proteins typically require the additional use of MS/MS based protein identification to achieve sufficient specificity of identification (6). Therefore, the typical PMF samples are isolated proteins from Two-dimensional gel electrophoresis (2D gels) or isolated SDS-PAGE bands. Additional analyses by MS/MS can either be direct, e.g., MALDI-TOF/TOF analysis or downstream nanoLC-ESI-MS/MS analysis of gel spot eluates.

Sample preparation

Protein samples can be derived from SDS-PAGE and are then subject to some chemical modifications. Disulfide bridges in proteins are reduced and cysteine amino acids are carboxymethylated chemically or acrylamidated during the gel electrophoresis.

Then the proteins are cut into several fragments using proteolytic enzymes such as trypsin, chymotrypsin or Glu-C. A typical sample:protease ratio is 50:1. The proteolysis is typically carried out overnight and the resulting peptides are extracted with acetonitrile and dried under vacuum. The peptides are then dissolved in a small amount of distilled water or further concentrated and purified using ZipTip Pipette tips and are ready for mass spectrometric analysis.

Mass spectrometric analysis The digested protein can be analyzed with different types of mass spectrometers such as ESI-TOF or MALDI-TOF. MALDI-TOF is often the preferred instrument because it allows a high sample throughput and several proteins can be analyzed in a single experiment - if complemented by MS/MS analysis. A small fraction of the peptide (usually 1 microliter or less) is pipetted onto a MALDI target and a chemical called a matrix is added to the peptide mix. The matrix molecules are required for the desorption of the peptide molecules. Matrix and peptide molecules co-crystallize on the MALDI target and are ready to be analyzed. The target is inserted into the vacuum chamber of the mass spectrometer and the analysis of peptide masses is initiated by a pulsed laser beam which transfers high amounts of energy into the matrix molecules. The energy transfer is sufficient to promote the transition of matrix molecules and peptides from the solid state into the gas state. Then the molecules become accelerated in the electric field of the mass spectrometer and fly towards an ion detector where their arrival is detected as an electric signal. Their mass is proportional to their time of flight (TOF) in the drift tube and can be calculated accordingly.

Computational analysis The mass spectrometrical analysis produces a list of molecular weights which is often called peak list. The peptide masses are now compared to huge databases such as Swissprot, Genbank which contain protein sequence information. Software programs cut all these proteins into peptides with the same enzyme used in the chemical cleavage (for example trypsin). The absolute mass of all these peptides is then theoretically calculated. A comparison is made between the peak list of measured peptide masses and all the masses from the calculated peptides. The results are statistically analyzed and possible matches are returned in a results table.