Intrinsically Disordered Proteins: Methods and Protocols (Methods in Molecular Biology, 2141) 1071605232, 9781071605233

The edition details methods to study intrinsically disordered proteins (IDPs) including recent topics such as extremely

118 77 30MB

English Pages 969 [934] Year 2020

Table of contents :
Preface
Acknowledgements
Contents
Contributors
Part I: Sequence Properties
Chapter 1: Disorder for Dummies: Functional Mutagenesis of Transient Helical Segments in Disordered Proteins
1 Introduction
2 Materials
3 Methods
3.1 Disorder Predictions and Sequence Alignments
3.2 Prediction of Transient Helicity
3.3 Measuring Transient Helicity with CD Spectroscopy
3.4 Measuring Transient Helicity with NMR
3.5 Mutant Design
4 Notes
References
Chapter 2: Computational Prediction of Intrinsic Disorder in Protein Sequences with the disCoP Meta-predictor
1 Introduction
2 Materials
3 Methods
3.1 Running disCoP
3.2 Results Generated by disCoP
4 Case Study
5 Notes
References
Chapter 3: Computational Prediction of Disordered Protein Motifs Using SLiMSuite
1 Introduction
2 Materials
2.1 SLiMSuite Installation
2.2 SLiMSuite Directory Structure
2.3 Python 2.x
2.4 SLiMSuite Servers
2.5 SLiMSuite Tools
2.5.1 SLiMProb: Prediction of Predefined SLiMs
2.5.2 SLiMFinder: De Novo SLiM Prediction
2.5.3 QSLiMFinder: Query-Focused De Novo SLiM Prediction
2.5.4 CompariMotif: Motif-Motif Comparisons
2.5.5 GABLAM: Global Alignment from BLAST Local Alignment Matrix
2.5.6 GOPHER: Generation of Orthologous Proteins from Homology-Based Estimation of Relationships
2.5.7 SLiMFarmer: SLiMSuite Job Farming Wrapper
2.5.8 SLiMParser: SLiMSuite REST Job Generation and Parsing Tool
2.6 SLiMSuite Dependencies
2.6.1 BLAST+ Homology Search
2.6.2 IUPred Disorder Predictor
2.6.3 ClustalW2 Multiple Sequence Alignment
2.6.4 ClustalOmega Multiple Sequence Alignment
2.6.5 R
2.7 Input Data for Motif Prediction
2.7.1 Data Types
Multi-protein Datasets
Single Proteins
Multiple Multi-protein Datasets
Proteomes
Motif Data
2.7.2 SLiMSuite FASTA Format
2.7.3 UniProt Text Files
2.7.4 Motif Formats
2.7.5 Example Data Table
3 Methods
3.1 Disorder Masking
3.1.1 Masking from Disorder Prediction Scores
3.1.2 Masking UniProt Features
3.1.3 Custom Masking of FASTA Files
3.2 Disordered Motif Prediction
3.2.1 Prediction of New Disordered Instances of Known SLiMs (SLiMProb)
3.2.2 Predicting Novel SLiMs De Novo in a Set of Proteins (SLiMFinder)
3.2.3 Query-Focused De Novo SLiM Prediction (QSLiMFinder)
3.2.4 Restricting De Novo SLiM Prediction to a Specific Protein Region (QSLiMFinder)
3.2.5 Identifying Known Motifs from De Novo Predictions (CompariMotif)
3.2.6 Comparing Different De Novo Prediction Runs (CompariMotif)
3.2.7 Searching De Novo Predictions Against Another Set of Proteins (SLiMProb)
4 Notes
References
Chapter 4: How to Annotate and Submit a Short Linear Motif to the Eukaryotic Linear Motif Resource
1 Introduction
2 Materials
2.1 Material Required to Submit a Motif to ELM
3 Methods
3.1 How to Annotate and Submit a Motif Instance
3.2 How to Annotate and Submit a Motif Class
3.3 How to Submit a Candidate Motif Class
3.4 How to Update a Motif Class
4 Notes
References
Chapter 5: Analyzing the Sequences of Intrinsically Disordered Regions with CIDER and localCIDER
1 Introduction
2 Materials
3 Methods
3.1 Using the CIDER Webserver
3.2 Using the localCIDER Software Package
3.3 Sequence Parameters Calculated by CIDER
3.3.1 Fraction of Charged Residues
3.3.2 Net Charge per Residue
3.3.3 Kappa (κ)
3.3.4 Fraction Hydropathy and Fraction Disorder-Promoting Residues
3.3.5 Diagram of States
3.3.6 The localCIDER Software Package
3.3.7 Example: A Case Study on p27100-198
3.3.8 Limitations of CIDER Sequence Analysis
3.3.9 Additional Approaches/Principles for the Analysis of Disordered Protein Sequences
4 Notes
References
Chapter 6: Exploring Protein Intrinsic Disorder with MobiDB
1 Introduction
2 Materials
3 Methods
3.1 Discovering Protein Intrinsic Disorder Using MobiDB 3.0
3.1.1 Curated Data
3.1.2 Derived Data
3.1.3 Predicted Data
3.1.4 Consensus Data
3.2 Applications
3.2.1 Working Case: Beta-Catenin
3.2.2 Disorder Flavors
3.2.3 MobiDB-Lite Could Help to Identify Order-Disorder Transitions
4 Notes
References
Part II: Evolution
Chapter 7: An Easy Protocol for Evolutionary Analysis of Intrinsically Disordered Proteins
1 Introduction
2 Materials
3 Methods
3.1 Finding Homologous Sequences
3.2 Using BLAST to Generate a Dataset
3.2.1 Isoform Selection
3.2.2 Renaming Headers
3.3 Making a High-Quality Multiple Sequence Alignment
3.3.1 Multiple Sequence Alignment Construction
3.3.2 Multiple Sequence Alignment Visualization
3.4 Building a Phylogenetic Tree
3.4.1 Phylogenetic Reconstruction
3.4.2 Tree Visualization
3.4.3 Tree Analysis
3.5 Predicting Disorder
3.5.1 Intrinsic Disorder Prediction
3.6 Evolutionary Dynamics of Intrinsic Disorder
3.7 Reflections and Future Directions
4 Notes
References
Part III: Production
Chapter 8: Expression and Purification of an Intrinsically Disordered Protein
1 Introduction
2 Materials
2.1 Bacterial Host and Plasmid Construct
2.2 Rich Bacterial Media
2.3 M9 Minimal Media
2.4 Purification Steps 1 and 2
2.5 High-Performance Liquid Chromatography (HPLC) Buffers
2.6 Ulp1 Purification
2.7 Tris-Tricine Acrylamide Gel
3 Methods
3.1 Protein Expression
3.2 Cell Lysis
3.3 Purification Step 1
3.4 Purification Step 2
3.5 Ulp1 Expression and Purification
3.6 Making a Tris-Tricine Polyacrylamide Gel
3.7 Tris-Tricine Polyacrylamide Gel Electrophoresis
4 Notes
References
Chapter 9: Production of Intrinsically Disordered Proteins for Biophysical Studies: Tips and Tricks
1 Introduction
1.1 Considerations to Keep in Mind Before Going to the Laboratory
1.1.1 Length of Primary Structure
1.1.2 Amino Acid Composition
1.1.3 Preventing Unwanted Degradation
2 Materials
2.1 Denaturing-Based Approaches
2.1.1 Heat Treatment
2.1.2 Isoelectric Precipitation
2.2 Chromatography-Based Approaches
2.2.1 Ion-Exchange Chromatography
2.2.2 Reversed-Phase Chromatography
3 Methods
3.1 Denaturing-Based Approaches
3.1.1 Purification by Heat Treatment
3.1.2 Isoelectric Precipitation
3.2 Chromatography-Based Approaches
3.2.1 Ion-Exchange Chromatography
3.2.2 Reversed-Phase Chromatography
4 Notes
References
Chapter 10: Recombinant Production of Monomeric Isotope-Enriched Aggregation-Prone Peptides: Polyglutamine Tracts and Beyond
1 Introduction
2 Materials
2.1 Transformation and Expression
2.2 Purification and Monomerization
2.3 NMR Sample Preparation
3 Methods
3.1 Transformation of E. coli Competent Cells
3.2 Expression of the His6-Sumo-Qn Fusion Construct in Isotope-Enriched Minimal Medium
3.3 Purification of the Isolated Qn Peptides
3.4 Purity Polishing and Monomerization of the Qn Peptides
3.5 Sample Preparation for NMR Measurements
4 Notes
References
Chapter 11: Cell-Free Protein Synthesis of Small Intrinsically Disordered Proteins for NMR Spectroscopy
1 Introduction
2 Materials
2.1 General Considerations and Necessary Material and Equipment
2.2 Extract Cultivation
2.3 Extract Preparation and Mg2+ Optimum Determination
2.4 CFPS Reagents
3 Methods
3.1 Extract Cultivation
3.2 Extract Preparation
3.3 Plasmid Preparation
3.4 CFPS Reaction Setup
3.5 Extract Activity Test and Mg2+ Optimization
3.6 Expression Assessment
4 Notes
References
Part IV: Dynamics, Ensembles, and Structures
Chapter 12: Structural Analyses of Intrinsically Disordered Proteins by Small-Angle X-Ray Scattering
1 Introduction
2 Materials
2.1 Batch Mode
2.2 SEC-SAXS Mode
3 Methods
3.1 SAXS Measurements in Batch Mode
3.2 Size-Exclusion Chromatography Coupled to SAXS (SEC-SAXS)
3.3 Primary Data Analysis
3.3.1 Checking for Concentration-Dependent Effects
3.3.2 Initial Estimates of Flexibility
3.3.3 Pair-Wise Distance Distribution Function
3.4 Ensemble Optimization Method for the Analysis of SAXS Profiles
3.4.1 Generation of an Ensemble of Conformations
3.4.2 Selection of a Sub-ensemble of Conformations that Describes the SAXS Data
3.4.3 Quantitative Estimation of Flexibility of the Ensemble
4 Notes
References
Chapter 13: Determining Rg of IDPs from SAXS Data
1 Introduction
2 Materials
3 Methods
3.1 General Steps to Determine Radius of Gyration
3.2 Determination of Rg of Histatin 5 by the Guinier Approach Using PRIMUS
3.3 Determination of the Pair Distance Distribution Function in PRIMUS
4 Notes
References
Chapter 14: Obtaining Hydrodynamic Radii of Intrinsically Disordered Protein Ensembles by Pulsed Field Gradient NMR Measuremen...
1 Introduction
2 Materials
2.1 Instruments
2.2 Materials and Solutions
2.3 Software
3 Methods
3.1 Gradient Calibration
3.2 Sample Preparation
3.2.1 Lyophilization
3.2.2 Sample Dilution
3.3 Diffusion Measurement and Hydrodynamic Radius Determination
3.3.1 Absolute Diffusion Constant
3.3.2 Internal Standard
4 Notes
References
Chapter 15: Quantitative Protein Disorder Assessment Using NMR Chemical Shifts
1 Introduction
2 Materials
2.1 Computation of RCCS Reference Values Using the POTENCI Web Application
2.2 Computation of RCCS Reference Values Using a Command-Line Python Script
2.3 Computation of CheZOD Z-Scores Using a Web Application
2.4 Computation of CheZOD Z-Scores from a Command-Line Interface Python Script
3 Methods
3.1 Computation of RCCS Reference Values Using a Web Application
3.2 Computation of RCCS Reference Values Using a Command-Line Python Script
3.3 Computation of CheZOD Z-Scores Using a Web Application
3.4 Computation of CheZOD Z-Scores from a Command-Line Interface Python Script
3.5 Interpretation of the CheZOD Z-Scores
4 Notes
References
Chapter 16: Determination of pKa Values in Intrinsically Disordered Proteins
1 Introduction
2 Materials
2.1 Media and Reagents
2.2 Buffers
2.3 Other Materials
3 Methods
3.1 Protein Expression in M9 Minimal Media
3.2 Cell Lysis and Protein Purification
3.3 NMR Spectroscopy
3.3.1 Sample Preparation
3.3.2 1H-15N Heteronuclear Single Quantum Correlation (HSQC) Spectra for Protein Backbone Assignment
3.3.3 Aromatic 1H-13C HSQC Spectra for Measuring his pKas
3.3.4 2D 13COi/1HNi+1 Spectra for Measuring Asp and Glu pKas
3.4 Data Processing, Analysis, and Model Fitting
3.4.1 Processing and Annotating NMR Spectra
3.4.2 Model Fitting and pKa Determination Using MATLAB
4 Notes
References
Chapter 17: Paris-DÉCOR: A Protocol for the Determination of Fast Protein Backbone Amide Hydrogen Exchange Rates
1 Introduction
2 Materials
3 Methods
3.1 Data Acquisition
3.2 Data Processing
3.3 Determination of Hydrogen Exchange Rates by Numerical Fitting
4 Notes
References
Part V: Ensembles by Computation
Chapter 18: Predicting Conformational Properties of Intrinsically Disordered Proteins from Sequence
1 Introduction
1.1 R1: FCR < 0.25 and |NCPR| < 0.25; Globules and Tadpoles
1.2 R2: 0.25 FCR 0.35 and |NCPR| 0.35; Range from Globules to Coils Depending on Context
1.3 R3: FCR > 0.35 and |NCPR| 0.35; Charge Patterning Dictates Compaction from Coil to Hairpin
1.4 R4: FCR > 0.35 and |NCPR| > 0.35; Coils and Semiflexible Rods
2 Materials
2.1 IDR Sequence(s)
2.2 Sequence Analysis Tools
3 Methods
3.1 Predicting Qualitative Conformational Properties from Global Amino Acid Composition
3.2 Classifying Experimentally Determined IDRs within the Diagram of States
3.3 Extracting Conformational Information from Charge Patterning
3.3.1 Calculating κ
3.3.2 Calculating SCD
3.3.3 Comparing κ and SCD
3.3.4 How Do κ and SCD Correlate with Conformational Properties?
3.3.5 Weaknesses of κ
3.3.6 Weaknesses of SCD
3.4 Extracting Conformational Information from Other Sequence Patterns
3.4.1 Patterning of Expansion Driving Residues
3.4.2 General Patterning
3.5 Applying Patterning Metrics to Understand the Effects of Phosphorylation on Conformation
3.6 Extracting Quantitative Conformational Information from Sequence
3.6.1 Extracting Rh from Sequence
3.6.2 Comparing Conformation to a Reference Coil: Chain Expansion Parameter
3.7 The Future of Predicting Conformation from Sequence
4 Notes
References
Chapter 19: Enhanced Molecular Dynamics Simulations of Intrinsically Disordered Proteins
1 Introduction
2 Materials
2.1 Software
3 Methods
3.1 Preparation of the Molecular System
3.2 Simulations
4 Notes
References
Chapter 20: Computational Protocol for Determining Conformational Ensembles of Intrinsically Disordered Proteins
1 Introduction
2 Materials
2.1 Force Field
2.2 Simulation Software
2.3 Experimental Data
3 Methods
3.1 Ensemble Reweighting
3.1.1 Generation of Initial Ensemble
3.1.2 Reweighting Scheme
3.1.3 Avoiding Overfitting
3.1.4 How to Actually Find the Optimal Weights?
3.1.5 Assessing the Final Weights
3.2 Validating and Using the Reweighted Ensemble
4 Notes
References
Chapter 21: Computing, Analyzing, and Comparing the Radius of Gyration and Hydrodynamic Radius in Conformational Ensembles of ...
1 Introduction
2 Materials
2.1 Experimental Data and Sequence of Sic1
2.2 Software
3 Methods
3.1 Generating Ensembles
3.2 Calculating Rg and Rh from Ensembles
3.3 A Bayesian/Maximum Entropy Approach
3.4 Calculating SAXS Data from Ensembles
3.5 Reweighting Sic1 Ensembles Against SAXS and NMR Diffusion Experiments
3.6 Summary
4 Notes
References
Part VI: Determinants of Interactions
Chapter 22: Binding Thermodynamics to Intrinsically Disordered Protein Domains
1 Introduction
1.1 Isothermal Titration Calorimetry
1.2 Thermodynamic Information Obtained from ITC at One Temperature
1.3 ITC Results for gp120
1.4 Determination of the Change in Heat Capacity Associated with Binding
1.5 Experimental Design
2 Materials
2.1 Reagents and Supplies
2.2 Isothermal Titration Calorimeter
2.3 Experimental Systems
3 Methods
3.1 Sample Preparation
3.2 Experimental Procedure
3.3 Analysis of the Data
3.4 Estimation of the Degree of Conformational Structuring
4 Notes
References
Chapter 23: Analysis of Multivalent IDP Interactions: Stoichiometry, Affinity, and Local Concentration Effect Measurements
1 Introduction
2 Materials
2.1 Buffers
2.2 Proteins
2.3 Equipment
2.4 Software
3 Methods
3.1 Protein Production
3.2 NMR Titration for Affinity Measurement
3.3 ITC Titration for Affinity measument
3.4 Determination of Stoichiometry
3.5 Estimation of Local Concentration Effect
3.6 Combining Results from the Different Analyses
4 Notes
References
Chapter 24: NMR Lineshape Analysis of Intrinsically Disordered Protein Interactions
1 Introduction
2 Materials
2.1 NMR Spectrometer
2.2 NMR Tubes
2.3 NMR Samples
2.4 Software
3 Data Acquisition
3.1 Sample Concentrations
3.2 Experimental Setup
3.3 Performing the NMR Titration
3.4 Data Analysis
3.4.1 Analysis of 1D 1H Spectra
3.4.2 Processing
3.4.3 Two-Dimensional Lineshape Analysis
4 Notes
References
Chapter 25: Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers
1 Introduction
2 Materials
2.1 DNA Constructs
2.2 Expression and Purification of Biosensor
2.3 Expression and Purification of Competitor Peptide
2.4 Measurement of Effective Concentration
3 Methods
3.1 Cloning of Linker Sequence into Reporter Plasmid
3.2 Expression and Purification of Fluorescent Biosensor
3.2.1 Biosensor Expression
3.2.2 Generation of Lysate
3.2.3 Biosensor Purification by Ni-Affinity Chromatography
3.2.4 Biosensor Purification by Strep-Tag Affinity Chromatography
3.3 Expression and Purification of MBD2 Peptide
3.4 Competition Titration
3.5 Data Analysis
4 Notes
References
Chapter 26: Determining the Protective Activity of IDPs Under Partial Dehydration and Freeze-Thaw Conditions
1 Introduction
2 Material
2.1 Stock Solutions
2.2 Treatment Buffers
2.3 Protein Solutions
2.4 Reaction Buffers
2.5 Equipment
3 Methods
3.1 Master Mix Preparation
3.2 Low Water Availability Assay: Freeze-Thaw Treatment
3.3 Low Water Availability Assay: Dehydration Treatment
3.4 Measurement of LDH or ADH Activity: Spectrophotometer Setting
3.5 Measurement of LDH or ADH Activity: Enzyme Activity Quantification
4 Notes
References
Chapter 27: Screening Intrinsically Disordered Regions for Short Linear Binding Motifs
1 Introduction
2 Materials
2.1 General Materials
2.2 Purification of du-ssDNA
2.3 PCR Amplification of Custom-Designed Oligonucleotide Pool and dsDNA Quantification
2.4 Synthesis of dsDNA Phagemid Library
2.5 Electroporation and Propagation of the ProP-PD Library
2.6 High-Throughput Expression and Purification
2.7 ProP-PD Selections and Pooled Phage ELISA
2.8 Preparing Samples for Next-Generation Sequencing (NGS)
3 Methods
3.1 Phage Library Construction
3.1.1 Purification of du-ssDNA
3.1.2 PCR Amplification of the Custom-Designed Oligonucleotide Pool and dsDNA Quantification
3.1.3 Synthesis of a dsDNA Phagemid Library
3.1.4 Electroporation and Propagation of Phage Library
3.2 High-Throughput Expression and Purification of Bait Proteins
3.2.1 Protein Expression
3.2.2 Protein Purification
3.3 HTP ProP-PD Selection and Pooled Phage ELISA
3.3.1 Phage Display Selection
3.3.2 Phage Pool ELISA (Day 5)
3.4 Preparing Samples for NGS
3.4.1 Barcoding and Amplification
3.4.2 Normalization, Pooling, and Cleaning of Amplified Products
Normalization of Samples
Pooling and Cleaning
3.5 NGS Data Handling
3.5.1 NGS Data Demultiplexing and Cleanup
3.5.2 Data Analysis and Storage
4 Notes
References
Part VII: Interactions on Surfaces
Chapter 28: Probing IDP Interactions with Membranes by Fluorescence Spectroscopy
1 Introduction
1.1 Tau: An Intrinsically Disordered Protein
1.2 Tau-Membrane Interactions
1.3 Application of Fluorescence Spectroscopy for Probing Tau-Membrane Interactions
2 Materials
2.1 Recombinant Protein Production and Purification
2.2 Small Unilamellar Vesicle (SUV) Preparation
2.3 Acrylodan Fluorescence
2.4 Tryptophan Fluorescence
3 Methods
3.1 Recombinant Protein Expression and Purification
3.1.1 1 L Growth of Unlabeled Tau Protein
3.1.2 Purification of Unlabeled Tau from 1 L
3.1.3 Site-Directed Mutagenesis
3.2 SUV Lipid Vesicle Preparation
3.3 Acrylodan Fluorescence
3.4 Tryptophan Fluorescence
3.5 Data Analysis
4 Notes
References
Chapter 29: Protocol for Investigating the Interactions Between Intrinsically Disordered Proteins and Membranes by Neutron Ref...
1 Introduction
2 Materials
3 Methods
3.1 NR Cell and Substrate Cleaning
3.2 POPC Vesicle Suspension
3.3 Formation of the Supported Lipid Bilayer
3.4 Interaction Between NHE1-LID and the POPC Lipid Bilayer
3.5 Data Analysis
4 Notes
References
Chapter 30: Interactions of IDPs with Membranes Using Dark-State Exchange NMR Spectroscopy
1 Introduction
1.1 Application of NMR for Probing IDP-Membrane Interactions
1.2 Membrane Interactions of Intrinsically Disordered Proteins
2 Materials
2.1 NMR
3 Methods
3.1 NMR
3.1.1 1H-15N HSQC Sample Preparation
3.1.2 NMR Spectrometer Setup
3.1.3 1H-15N HSQC Experimental Run and Setup
3.1.4 15N-T2 Relaxation Experimental Setup
3.1.5 15N-DEST Experimental Setup
3.2 Data Analysis
3.2.1 NMR Signal Processing
3.2.2 NMR Spectral Peak Picking
3.2.3 NMR Data Analysis and Interpretation: 1H-15N HSQC Intensity Ratios
3.2.4 NMR Data Analysis and Interpretation: 15N-T2 Relaxation
3.2.5 NMR Data Analysis and Interpretation: 15N-DEST
4 Notes
References
Part VIII: Binding Kinetics and Mechanisms
Chapter 31: Determination of Binding Kinetics of Intrinsically Disordered Proteins by Surface Plasmon Resonance
1 Introduction
2 Materials
3 Methods
3.1 Immobilization
3.2 Experimental Measurements
3.2.1 Single Cycle Protocol: Non-steady-State Kinetics
3.2.2 Single Cycle Protocol: Affinity at Steady-State Binding
3.3 Data Analysis
3.3.1 Data Analysis of Non-steady-State Kinetics (Figs. 1 and 2)
3.3.2 Data Analysis of Equilibrium Binding Constants at Steady-State Binding (Fig. 3)
4 Notes
References
Chapter 32: Measuring and Analyzing Binding Kinetics of Coupled Folding and Binding Reactions Under Pseudo-First-Order Conditi...
1 Introduction
2 Materials
2.1 Chemicals and Reactants
2.2 Instruments
2.3 Additional Equipment
3 Method
3.1 Equilibrium Measurements
3.2 Instrument Setup for Stopped-Flow Mixing Experiments
3.3 Sample Preparations for Stopped-Flow Mixing Experiments
3.4 Association Kinetics Under Pseudo-First-Order Conditions
3.5 Dissociation Kinetics by Displacement Under Pseudo-First-Order Conditions
4 Notes
References
Chapter 33: Understanding Binding-Induced Folding by Temperature Jump
1 Introduction
2 Materials
3 Methods
3.1 Protein Samples Preparation
3.2 Setup of the Instrument Optics
3.3 Setup of Discharge Unit
3.4 Temperature-Jump Experiment: Data Acquisition and Analysis
3.5 Analysis of kobs Dependence in Pseudo-First-Order Binding Experiments
4 Notes
References
Chapter 34: Determining Binding Kinetics of Intrinsically Disordered Proteins by NMR Spectroscopy
1 Introduction
1.1 Theoretical Descriptions of Chemical Exchange
1.2 Relaxation-Compensated CPMG Dispersion Experiments for Studying Chemical Exchange
1.3 The Carver and Richards Equation
1.4 GLOVE
1.5 Kinetic Models of Protein-Protein Interactions
2 Materials
2.1 Sample and Buffer
2.2 Spectrometer and Software
2.3 Pulse Sequence for the Relaxation-Compensated Constant-Time CPMG Experiment
3 Methods
3.1 Sample Preparation
3.2 Identification of the Appropriate Stoichiometric Range for the CPMG Titration Experiment
3.2.1 The HSQC Spectrum of the IDP Is Well-Resolved in Both the Free and Bound Form
3.2.2 The HSQC Spectrum of the IDP Is Well-Resolved in Both the Free and Bound Form
3.2.3 The HSQC Spectrum of the IDP Is Well-Resolved in the Free Form, but Severe Line-Broadening Is Observed in the Bound Form
3.3 CPMG Titration Data Acquisition
3.4 Dispersion Data Fitting Using the GLOVE Software
3.4.1 Initial Data Processing
3.4.2 Extract Peak Intensities Using pkfit or fudaFIT
3.4.3 Convert Peak Intensity to Using cpmg2glove
3.4.4 Fitting of Titration Points at Each Concentration Ratio
3.4.5 Fitting of a Titration Series to the Binding Model of Choice
4 Notes
References
Part IX: Higher Order-Phase Separation and Fibrillation
Chapter 35: Determination of Protein Phase Diagrams by Centrifugation
1 Introduction
2 Materials
2.1 Equipment
2.2 Stock Solutions
2.3 Working Solutions
3 Methods
3.1 Equipment Setup
3.2 Sample Preparation
3.3 Data Collection
3.3.1 Collection of Light Phase Concentration Data
3.3.2 Collection of Dense Phase Concentration Data
3.4 Data Processing
3.5 Example Calculations
3.5.1 Sample Preparation
First-Time Mapping of a Phase Diagram
Subsequent Data Collections
3.5.2 Calculating Concentration from A280 Readings
3.5.3 Calculating Error Bars
4 Notes
References
Chapter 36: In Vitro Transition Temperature Measurement of Phase-Separating Proteins by Microscopy
1 Introduction
2 Materials
2.1 Proteins and Buffers
2.2 Equipment
2.3 Software
3 Methods
3.1 Microscope and Thermal Stage Setup
3.2 Sample Preparation
3.3 Data Collection
3.4 Analysis
4 Notes
References
Chapter 37: Walking Along a Protein Phase Diagram to Determine Coexistence Points by Static Light Scattering
1 Introduction
2 Materials
2.1 Protein Sample Preparation
2.2 Static Light Scattering Measurements
2.3 Absorbance Measurements at 280 nm
3 Methods
3.1 Buffer Exchange for Protein That Is Under Denaturing Conditions
3.2 Preparation of the Light Scattering Instrument
3.3 Sample Preparation for Light Scattering
3.4 Light Scattering Measurements
3.5 Data Analysis
4 Notes
References
Chapter 38: Expression and Purification of Intrinsically Disordered Aβ Peptide and Setup of Reproducible Aggregation Kinetics ...
1 Introduction
2 Materials
2.1 Luria Broth (LB) Medium
2.2 LB/Agar Plates
2.3 Overnight Express Medium for Auto Induction
2.4 M9 Minimal Medium
2.5 Chromatography Resins, Columns, Chemicals, and Equipment
2.6 Buffers
3 Methods
3.1 Ca2+-Competent E. coli Cells
3.2 Transformation of the Expression Plasmid into E. coli
3.3 Expression in Overnight Express Medium for Auto Induction (See Note 20)
3.4 Expression of Aβ(M1-42) Peptide in M9 Minimal Medium (See Note 20)
3.5 Expression of Aβ Peptide in LB Medium (See Note 33)
3.6 Purification of Aβ(M1-40) or Aβ(M1-42) Without Urea (See Notes 34 and 35)
3.7 Purification of Aβ(M1-40) or Aβ(M1-42) with Urea
3.8 Modified Protocol for More Aggregation-Prone Aβ Variants (A2V, E22G, etc.)
3.9 Modified Protocol for Aβ Variants with a Different Net Charge (D7N, E22G, E22K, etc.)
3.10 Preparation of Samples for Kinetics Experiments
3.11 Kinetic Experiment by ThT Fluorescence
3.12 Kinetics Experiments with Pre-formed Seeds
3.13 Outsourced Synthesis and Cloning of Gene for Amyloid β Peptide Expression
4 Notes
References
Chapter 39: Measuring Interactions Between Tau and Aggregation Inducers with Single-Molecule Förster Resonance Energy Transfer
1 Introduction
2 Materials
2.1 Materials for Sample Preparation and Purification
2.1.1 Plasmids and Strains
2.1.2 Solutions, Media, and Buffers
2.2 SmFRET Instrument
3 Methods
3.1 PEG-PLL
3.2 smFRET Standards
3.3 Design of Tau Constructs for smFRET Measurements
3.3.1 Introduction of Cysteines for Labeling
3.3.2 Creation of Tau Isoforms and Fragments
3.4 Tau Expression and Purification
3.5 Labeling Tau for smFRET Measurements
3.6 Preparation of Coverslips
3.7 Checking Alignment by Measuring Diffusion Times of Fluorescent Standards
3.8 Calibrate Instrument with smFRET Standards
3.9 SmFRET Measurements of Tau
3.10 Analysis of smFRET Data
3.11 SmFRET Measurements with polyP or Other Aggregation Inducers
4 Notes
References
Part X: Modification and Targeting
Chapter 40: Detection of Multisite Phosphorylation of Intrinsically Disordered Proteins Using Phos-tag SDS-PAGE
1 Introduction
2 Materials
2.1 Mn2+-Phos-tagTM SDS-PAGE gels
2.2 Zn2+-Phos-tagTM SDS-PAGE Gels
2.3 Colloidal Coomassie Staining
2.4 In Vitro Kinase Assay and Autoradiography
3 Methods
3.1 In Vitro Kinase Assay
3.2 Preparing Phos-tagTM SDS-PAGE Gels
3.3 Electrophoresis
3.4 Post-electrophoresis Methods
3.4.1 Coomassie Staining
3.4.2 Western Blotting
3.4.3 Autoradiography
4 Notes
References
Chapter 41: Multiple Site-Specific Phosphorylation of IDPs Monitored by NMR
1 Introduction
2 Materials
2.1 Stock Solutions Preparation and Storage
2.2 Sample Preparation
2.3 NMR Spectra Acquisition and Analysis
3 Methods
3.1 Preparation of the IDP Stock
3.2 Optimization of Phosphorylation Conditions
3.3 Assignment of NMR Spectra in the Conditions Used for Phosphorylation Reactions
3.4 Phosphorylation Monitoring by NMR
3.4.1 Phosphorylation Monitoring Using Continuous NMR Readout
3.4.2 Phosphorylation Monitoring Using Quenched Reactions
3.5 Analysis of NMR Data
4 Notes
References
Chapter 42: Detection of Multisite Phosphorylation of Intrinsically Disordered Proteins Using Quantitative Mass-Spectrometry
1 Introduction
2 Materials
2.1 In Vitro Kinase Assay
2.2 SDS Polyacrylamide Gels
2.3 Coomassie Blue G-250 Staining
2.4 In-Gel Digestion
2.5 Sample Clean-Up on StageTips
2.6 LC-MS/MS Analysis
3 Methods
3.1 Kinase Assay
3.2 SDS Polyacrylamide Gel Electrophoresis
3.3 Coomassie Blue G-250 Staining
3.4 Autoradiography
3.5 In-Gel Digestion of Proteins
3.6 Sample Clean-Up on StageTips
3.7 LC-MS/MS Analyses
3.8 Mass Spectrometry Data Analysis
4 Notes
References
Chapter 43: Targeting an Intrinsically Disordered Protein by Covalent Modification
1 Introduction
2 Materials
2.1 Warheads
2.2 Covalent Modification of Calpastatin and Calpain Inhibition
3 Methods
3.1 Characterization of Warheads
3.2 Covalent Modification of Calpastatin
3.2.1 Modification Reaction
3.2.2 Following Modification by DTNB
3.2.3 Following Protein Modification by MS
Mass Determination of Proteins
Protein Sequence Analysis
3.2.4 Identification of the Modification by SDS-PAGE
3.3 CD Spectroscopy
3.4 Inhibition Assay
3.4.1 Determination of Michaelis-Menten Constant (Km)
3.4.2 Inhibition Assay: Determination of Inhibitory Constant (Ki)
3.4.3 Data Analysis
3.5 Conclusions and Recommendations
4 Notes
References
Part XI: In Cell and Interactomes
Chapter 44: Recording In-Cell NMR-Spectra in Living Mammalian Cells
1 Introduction
2 Materials
2.1 Solutions and Media
2.2 Instruments/Equipment
3 Methods
3.1 Expression and Purification of Labeled Proteins in Escherichia coli
3.2 Preparation of HEK-293 Cells for Electroporation
3.3 In-Cell NMR-Sample Preparation
3.4 NMR Measurements
4 Notes
References
Chapter 45: In-Cell NMR of Intrinsically Disordered Proteins in Mammalian Cells
1 Introduction
2 Materials
2.1 Expression and Purification of 15N-Labelled N-Terminally Acetylated αSyn in Bacteria
2.1.1 Equipment and Reagents for Transformation
2.1.2 Equipment and Reagents for Preculture
2.1.3 Equipment and Reagents for LB Culture
2.1.4 Equipment and Reagents for Minimal Media-Culture and Induction
2.1.5 Equipment and Reagents for Rapid Purification of αSyn
2.2 Cell Culture and Manipulation
2.2.1 Equipment and Reagents for Culturing HEK-293 Cells
2.2.2 Equipment and Reagents for αSyn Electroporation
2.2.3 Equipment and Reagents for Cell Harvesting and Collection into the NMR Tube
2.2.4 Equipment and Reagents for NMR
3 Methods
3.1 αSyn Expression and Purification
3.1.1 Bacterial Transformation
3.1.2 Preculture, 2x LB culture, and Expression in Minimal Media
3.1.3 Periplasmic Purification
3.1.4 Protein Clearance by Heat Shock
3.1.5 Protein Clearance by Precipitation with AS
3.2 αSyn Intracellular Delivery (See Note 12)
3.2.1 Cell Culture
3.2.2 αSyn Delivery by Electroporation
3.3 Cell Collection and NMR
3.3.1 Cell Collection
3.3.2 Setting Up the Spectrometer and NMR Measurements (See Note 23)
3.3.3 Spectrum Visualization Basic Data Analysis
3.3.4 Determination of Cell Leakage
4 Notes
References
46: Analyzing IDPs in Interactomes
1 Introduction
2 Materials
3 Methods
3.1 Retrieving Sequence Information from the UniProt Database
3.2 Generating PPIs for a Query Protein
3.2.1 Using UniProt to Retrieve a Set of PPIs for a Query Protein
3.2.2 Using BioGRID to Retrieve a PPI Network for a Query Protein
3.2.3 Using IntAct to Retrieve a PPI Network for a Query Protein
3.2.4 Using DIP to Retrieve a PPI Network for a Query Protein
3.2.5 Using MINT to Retrieve a PPI Network for a Query Protein
3.2.6 Using HPRD to Retrieve a PPI Network for a Query Protein
3.2.7 Using KEGG to Search for PPI Networks (or Pathways) Associated with a Query Protein
3.2.8 Using STRING to Search for PPI Network of a Query Protein
3.2.9 Using APID to Search for PPI Network of a Query Protein
3.2.10 Advantages and Limitations of the Different Network Tools
3.3 Generating PPIs for a Set of Query Proteins
3.3.1 Using STRING to Generate a PPI Network for a Set of Query Proteins
3.3.2 Using APID to Generate a PPI Network for a Set of Query Proteins
3.4 Analyzing Intrinsic Disorder Predisposition of a Query Protein (or a Set of Query Proteins) and Interactors from the Corre...
3.4.1 Looking at the Global Disorder Status of All the Members of a PPI Network Protein
Analysis of Protein Amino Acid Composition: Applying Compositional Profiler Tool to Obtain Fractional Amino Acid Composition o...
Using CH-Plot and CDF Binary Predictors and a Combined CH-CDF Approach
Using Per-residue Predictors
3.4.2 Detailed Characterization of the Peculiarities of Disorder Distribution and Disorder-Based Functionality in a Set of Sel...
3.4.3 Web-Based Means for the Visualization of Disorder Distribution in a Protein
3.4.4 Intrinsic Disorder-Based Functional Analyses
4 Notes
References
Correction to: Intrinsically Disordered Proteins
Index

Recommend Papers

TET Proteins and DNA Demethylation: Methods and Protocols (Methods in Molecular Biology, 2272) 107161293X, 9781071612934

This volume explores the latest methods used to study various aspects of TET proteins and their biology. Chapters in thi

103 99 Read more

Recombinant Proteins in Plants: Methods and Protocols (Methods in Molecular Biology, 2480) 9781071622407, 9781071622414, 1071622404

This volume provided methods and protocols on recombinant protein production in different plant systems, downstream proc

106 89 Read more

NLR Proteins: Methods and Protocols (Methods in Molecular Biology, 1417) 9781493935642, 149393564X

This volume provides a sound basis for the molecular investigation of NLR function in health and disease. Chapters focus

116 38 9MB Read more

Polycomb Group Proteins: Methods and Protocols (Methods in Molecular Biology, 1480) 1493963783, 9781493963782

This volume presents the most recent technologies used in the Polycomb Group Proteins (PcG) field. Chapters detail state

112 18 10MB Read more

Engineered Zinc Finger Proteins: Methods and Protocols (Methods in Molecular Biology, 649) 1607617528, 9781607617525

Among the many types of DNA binding domains, C2H2 zinc finger proteins (ZFPs) have proven to be the most malleable for c

110 30 8MB Read more

NLR Proteins: Methods and Protocols (Methods in Molecular Biology, 2696) [2nd ed. 2023] 107163349X, 9781071633496

This second edition provides a sound basis for the molecular investigation of NLR function in health and disease. Chapte

141 3 38MB Read more

High-Resolution Imaging of Cellular Proteins: Methods and Protocols (Methods in Molecular Biology, 1474) 1493963503, 9781493963508

This volume presents authoritative and cutting-edge methods and protocols focusing on three tool boxes covering the incr

121 77 15MB Read more

Ice Binding Proteins: Methods and Protocols (Methods in Molecular Biology, 2730) 1071635026, 9781071635025

This volume provides methods to study ice-binding proteins (IBPs), and applications involving these proteins. Chapters a

103 35 9MB Read more

NLR Proteins: Methods and Protocols (Methods in Molecular Biology, 2696) [2nd ed. 2023] 107163349X, 9781071633496

This second edition provides a sound basis for the molecular investigation of NLR function in health and disease. Chapte

132 66 12MB Read more

Heterologous Expression of Membrane Proteins: Methods and Protocols (Methods in Molecular Biology, 1432) 1493936352, 9781493936359

This detailed volume encompasses chapters from leading experts in the area of membrane proteins who describe step-by-ste

112 18 12MB Read more

Intrinsically Disordered Proteins: Methods and Protocols (Methods in Molecular Biology, 2141)
1071605232, 9781071605233

Author / Uploaded
Birthe B. Kragelund (editor)
Karen Skriver (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Methods in Molecular Biology 2141

Birthe B. Kragelund Karen Skriver Editors

Intrinsically Disordered Proteins Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Intrinsically Disordered Proteins Methods and Protocols

Edited by

Birthe B. Kragelund Structural Biology and NMR Laboratory, Department of Biology, University of Copenhagen, København N, Denmark

Karen Skriver Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark

Editors Birthe B. Kragelund Structural Biology and NMR Laboratory Department of Biology University of Copenhagen København N, Denmark

Karen Skriver Linderstrøm-Lang Centre for Protein Science Department of Biology University of Copenhagen Copenhagen, Denmark

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-0523-3 ISBN 978-1-0716-0524-0 (eBook) https://doi.org/10.1007/978-1-0716-0524-0 © Springer Science+Business Media, LLC, part of Springer Nature 2020, Corrected Publication 2021 Chapters 24, 40, and 42 are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapters. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface Intrinsically disordered proteins (IDPs) challenge current views on biomolecular communication because these proteins participate in functional biological complexes despite their lack of 3D structures, a scenario which until recently was unimaginable. An important key to unlocking the enigmatic nature of IDPs is the development of appropriate experimental methods and tools for their analysis, and derivation of mechanistic models. Due to their dynamic nature, IDPs expand the types of possible complexes, and they are highly challenging to investigate with traditional methods. However, with a remarkable rate, novel modes of operations are being uncovered, leading to new methods, tools, and strategies developed and modified specifically for studies of IDPs. Therefore, a Methods in Molecular Biology book on Intrinsically Disordered Proteins: Methods and Protocols is warranted. This edition follows, but is independent from, the first successful books on the topic in this series published almost 10 years ago. The edition covers methods to study IDPs in general and includes methods to study very recent topics such as extremely high-affinity disordered complexes, kinetics that evades established concepts, liquid–liquid phase separation, and novel disorderdriven allosteric mechanisms. Written in the highly successful Methods in Molecular Biology™ series format, the chapters include the kind of detailed description and implementation advice that is crucial for dissemination of good practice and for getting optimal and reliable results in the laboratory. Thorough and intuitive Intrinsically Disordered Proteins: Methods and Protocols helps scientists with different backgrounds to further their investigations of these fascinating and dynamic molecules. Copenhagen, Denmark København, Denmark

Karen Skriver Birthe B. Kragelund

v

Acknowledgements First and foremost, we owe a big thank you to all of the more than one hundred authors who, with impressive engagement, have contributed with a specific chapter to this special issue on Intrinsically Disordered Proteins (IDPs). We have been overwhelmed and impressed by the quality of the chapters, the many important details of hands-on tips and tricks, and the proudness of presenting a useful protocol. We have met a true excitement, which all together underscore the eagerness and readiness to share important methods and protocols characteristic of the IDP field. We owe a special thank you to those of you who stepped in at the last minute. Thank you all! Second, we are grateful to all our students and post docs who have taken the time and effort to go through the protocols with a user’s eye, and for your many suggestions to improve the usefulness of the protocols. Lasse, Pernille, Inna, Sarah, Edoardo, Katrine, Christian, Frederik, Maria—thank you! Finally, we acknowledge the support from John Walker and Patrick Marton from Springer for valuable feedback during the process and the support of the REPIN Centre funded by the Novo Nordisk Foundation, enabling us to allocate the needed time to get it all together. Karen and Birthe

vii

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PART I

v vii xiii

SEQUENCE PROPERTIES

1 Disorder for Dummies: Functional Mutagenesis of Transient Helical Segments in Disordered Proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Gary W. Daughdrill 2 Computational Prediction of Intrinsic Disorder in Protein Sequences with the disCoP Meta-predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Christopher J. Oldfield, Xiao Fan, Chen Wang, A. Keith Dunker, and Lukasz Kurgan 3 Computational Prediction of Disordered Protein Motifs Using SLiMSuite . . . . . 37 Richard J. Edwards, Kirsti Paulsen, Carla M. Aguilar Gomez, ˚ sa Pe´rez-Bercoff and A 4 How to Annotate and Submit a Short Linear Motif to the Eukaryotic Linear Motif Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 ˇ alysˇeva, Francesca Diella, Marc Gouw, Jesu´s Alvarado-Valverde, Jelena C Manjeet Kumar, Sushama Michael, Kim Van Roey, Holger Dinkel, and Toby J. Gibson 5 Analyzing the Sequences of Intrinsically Disordered Regions with CIDER and localCIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Garrett M. Ginell and Alex S. Holehouse 6 Exploring Protein Intrinsic Disorder with MobiDB . . . . . . . . . . . . . . . . . . . . . . . . . 127 Alexander Miguel Monzon, Andra´s Hatos, Marco Necci, Damiano Piovesan, and Silvio C. E. Tosatto

PART II

EVOLUTION

7 An Easy Protocol for Evolutionary Analysis of Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Janelle Nunez-Castilla and Jessica Siltberg-Liberles

PART III

PRODUCTION

8 Expression and Purification of an Intrinsically Disordered Protein. . . . . . . . . . . . . 181 Karamjeet K. Singh and Steffen P. Graether 9 Production of Intrinsically Disordered Proteins for Biophysical Studies: Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Christian Parsbæk Pedersen, Pernille Seiffert, Inna Brakti, and Katrine Bugge

ix

x

10

11

Contents

Recombinant Production of Monomeric Isotope-Enriched Aggregation-Prone Peptides: Polyglutamine Tracts and Beyond . . . . . . . . . . . . . . 211 Albert Escobedo, Giulio Chiesa, and Xavier Salvatella Cell-Free Protein Synthesis of Small Intrinsically Disordered Proteins for NMR Spectroscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Linne´a Isaksson and Anders Pedersen

PART IV 12

13 14

15 16 17

Structural Analyses of Intrinsically Disordered Proteins by Small-Angle X-Ray Scattering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Amin Sagar, Dmitri Svergun, and Pau Bernado Determining Rg of IDPs from SAXS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Ellen Rieloff and Marie Skepo¨ Obtaining Hydrodynamic Radii of Intrinsically Disordered Protein Ensembles by Pulsed Field Gradient NMR Measurements . . . . . . . . . . . . Sarah Leeb and Jens Danielsson Quantitative Protein Disorder Assessment Using NMR Chemical Shifts . . . . . . . Jakob T. Nielsen and Frans A. A. Mulder Determination of pKa Values in Intrinsically Disordered Proteins . . . . . . . . . . . . . Brandon Payliss and Anthony Mittermaier Paris-DE´COR: A Protocol for the Determination of Fast Protein Backbone Amide Hydrogen Exchange Rates . . . . . . . . . . . . . . . . . . . . . . . . Rupashree Dass and Frans A. A. Mulder

PART V 18

19

20

21

285 303 319

337

ENSEMBLES BY COMPUTATION

Predicting Conformational Properties of Intrinsically Disordered Proteins from Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kiersten M. Ruff Enhanced Molecular Dynamics Simulations of Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matteo Masetti, Mattia Bernetti, and Andrea Cavalli Computational Protocol for Determining Conformational Ensembles of Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert B. Best Computing, Analyzing, and Comparing the Radius of Gyration and Hydrodynamic Radius in Conformational Ensembles of Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mustapha Carab Ahmed, Ramon Crehuet, and Kresten Lindorff-Larsen

PART VI 22

DYNAMICS, ENSEMBLES, AND STRUCTURES

347

391

413

429

DETERMINANTS OF INTERACTIONS

Binding Thermodynamics to Intrinsically Disordered Protein Domains. . . . . . . . 449 Arne Scho¨n and Ernesto Freire

Contents

23

24 25

26

27

Analysis of Multivalent IDP Interactions: Stoichiometry, Affinity, and Local Concentration Effect Measurements . . . . . . . . . . . . . . . . . . . . . Samuel Sparks, Ryo Hayama, Michael P. Rout, and David Cowburn NMR Lineshape Analysis of Intrinsically Disordered Protein Interactions . . . . . . Christopher A. Waudby and John Christodoulou Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charlotte S. Sørensen and Magnus Kjaergaard Determining the Protective Activity of IDPs Under Partial Dehydration and Freeze-Thaw Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David F. Rendon-Luna, Paulette S. Romero-Pe´rez, Cesar L. Cuevas-Velazquez, Jose´ L. Reyes, and Alejandra A. Covarrubias Screening Intrinsically Disordered Regions for Short Linear Binding Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Ali, Leandro Simonetti, and Ylva Ivarsson

PART VII 28

29

30

32

33

34

463 477

505

519

529

INTERACTIONS ON SURFACES

Probing IDP Interactions with Membranes by Fluorescence Spectroscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Diana Acosta, Tapojyoti Das, and David Eliezer Protocol for Investigating the Interactions Between Intrinsically Disordered Proteins and Membranes by Neutron Reflectometry. . . . . . . . . . . . . . 569 Alessandra Luchini and Lise Arleth Interactions of IDPs with Membranes Using Dark-State Exchange NMR Spectroscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Tapojyoti Das, Diana Acosta, and David Eliezer

PART VIII 31

xi

BINDING KINETICS AND MECHANISMS

Determination of Binding Kinetics of Intrinsically Disordered Proteins by Surface Plasmon Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julie M. Leth and Michael Ploug Measuring and Analyzing Binding Kinetics of Coupled Folding and Binding Reactions Under Pseudo-First-Order Conditions . . . . . . . . . . . . . . . Kristine Steen Jensen Understanding Binding-Induced Folding by Temperature Jump. . . . . . . . . . . . . . Angelo Toto, Francesca Troilo, Francesca Malagrino`, and Stefano Gianni Determining Binding Kinetics of Intrinsically Disordered Proteins by NMR Spectroscopy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ke Yang, Munehito Arai, and Peter E. Wright

611

629 651

663

xii

Contents

PART IX 35 36

37

38

39

Determination of Protein Phase Diagrams by Centrifugation. . . . . . . . . . . . . . . . . Nicole M. Milkovic and Tanja Mittag In Vitro Transition Temperature Measurement of Phase-Separating Proteins by Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jack Holland, Michael D. Crabtree, and Timothy J. Nott Walking Along a Protein Phase Diagram to Determine Coexistence Points by Static Light Scattering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivan Peran, Erik W. Martin, and Tanja Mittag Expression and Purification of Intrinsically Disordered Aβ Peptide and Setup of Reproducible Aggregation Kinetics Experiment . . . . . . . . . . . . . . . . Sara Linse Measuring Interactions Between Tau and Aggregation Inducers with Single-Molecule Fo¨rster Resonance Energy Transfer . . . . . . . . . . . . . . . . . . . . Sanjula P. Wickramasinghe and Elizabeth Rhoades

PART X 40

41

42

43

HIGHER ORDER-PHASE SEPARATION AND FIBRILLATION

703

715

731

755

MODIFICATION AND TARGETING

Detection of Multisite Phosphorylation of Intrinsically Disordered Proteins Using Phos-tag SDS-PAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . ¨ rd and Mart Loog Mihkel O Multiple Site-Specific Phosphorylation of IDPs Monitored by NMR . . . . . . . . . . Manon Julien, Chafiaa Bouguechtouli, Ania Alik, Rania Ghouil, Sophie Zinn-Justin, and Franc¸ois-Xavier Theillet Detection of Multisite Phosphorylation of Intrinsically Disordered Proteins Using Quantitative Mass-Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ervin Valk, Artemi Maljavin, and Mart Loog Targeting an Intrinsically Disordered Protein by Covalent Modification . . . . . . . ´ bra´nyi-Balogh, La´szlo Petri, Attila Me´sza´ros, Hung Huy Nguyen, Pe´ter A ˝ , and Peter Tompa Kris Pauwels, Guy Vandenbussche, Gyo¨rgy Miklos Keseru

PART XI

685

779 793

819 835

IN CELL AND INTERACTOMES

44

Recording In-Cell NMR-Spectra in Living Mammalian Cells. . . . . . . . . . . . . . . . . 857 Irena Matecˇko-Burmann and Bjo¨rn M. Burmann 45 In-Cell NMR of Intrinsically Disordered Proteins in Mammalian Cells . . . . . . . . 873 Juan A. Gerez, Natalia C. Prymaczok, and Roland Riek 46 Analyzing IDPs in Interactomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895 Vladimir N. Uversky Correction to: Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C1 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

947

Contributors ´ BRA´NYI-BALOGH • Medicinal Chemistry Research Group, Research Centre for PE´TER A Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary DIANA ACOSTA • Department of Biochemistry, Weill Cornell Medical College of Cornell University, New York, NY, USA; Brain and Mind Research Institute, Weill Cornell Medical College of Cornell University, New York, NY, USA CARLA M. AGUILAR GOMEZ • School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia MUSTAPHA CARAB AHMED • Department of Biology, Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark MUHAMMAD ALI • Department of Chemistry, BMC, Uppsala, Sweden ANIA ALIK • Universite´ Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Gif-sur-Yvette, France JESU´S ALVARADO-VALVERDE • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany; Faculty of Biosciences, Collaboration for Joint PhD Degree Between EMBL and Heidelberg University, Heidelberg, Germany MUNEHITO ARAI • Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan LISE ARLETH • Structural Biophysics, X-ray and Neutron Science, Niels Bohr Institute, University of Copenhagen, Copenhagen, Denmark PAU BERNADO´ • Centre de Biochimie Structurale, INSERM, CNRS, Universite´ de Montpellier, Montpellier, France MATTIA BERNETTI • Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy ROBERT B. BEST • Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA CHAFIAA BOUGUECHTOULI • Universite´ Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Gif-sur-Yvette, France INNA BRAKTI • Department of Biology, Structural Biology and NMR Laboratory and the Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark KATRINE BUGGE • Department of Biology, Structural Biology and NMR Laboratory and the Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark BJO¨RN M. BURMANN • Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Go¨teborg, Sweden; Department of Chemistry and Molecular Biology, University of Gothenburg, Go¨teborg, Sweden ˇ ALYSˇEVA • Structural and Computational Biology Unit, European Molecular JELENA C Biology Laboratory, Heidelberg, Germany; Faculty of Biosciences, Collaboration for Joint PhD Degree Between EMBL and Heidelberg University, Heidelberg, Germany ANDREA CAVALLI • Department of Pharmacy and Biotechnology, Alma Mater Studiorum— ` di Bologna, Bologna, Italy; Computational and Chemical Biology, Istituto Universita Italiano di Tecnologia, Genoa, Italy

xiii

xiv

Contributors

GIULIO CHIESA • Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain; Joint BSC-IRB Research Programme in Computational Biology, Barcelona, Spain; Department of Biomedical Engineering and Biological Design Center, Boston University, Boston, MA, USA JOHN CHRISTODOULOU • Institute of Structural and Molecular Biology, UCL and Birkbeck College, London, UK ALEJANDRA A. COVARRUBIAS • Departamento de Biologı´a Molecular de Plantas, Instituto de Biotecnologı´a, Universidad Nacional Autonoma de Me´xico, Cuernavaca, Morelos, Mexico DAVID COWBURN • Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY, USA MICHAEL D. CRABTREE • Department of Biochemistry, University of Oxford, Oxford, UK RAMON CREHUET • Department of Biology, Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark; Institute for Advanced Chemistry of Catalonia (IQAC-CSIC), Barcelona, Spain CESAR L. CUEVAS-VELAZQUEZ • Departamento de Biologı´a Molecular de Plantas, Instituto de Biotecnologı´a, Universidad Nacional Autonoma de Me´xico, Cuernavaca, Morelos, Mexico; Departamento de Bioquı´mica, Facultad de Quı´mica, Universidad Nacional Autonoma de Me´xico, Mexico City, Me´xico JENS DANIELSSON • Department of Biochemistry and Biophysics, Arrhenius Laboratories of Natural Sciences, Stockholm University, Stockholm, Sweden TAPOJYOTI DAS • Department of Biochemistry, Weill Cornell Medical College of Cornell University, New York, NY, USA; Brain and Mind Research Institute, Weill Cornell Medical College of Cornell University, New York, NY, USA RUPASHREE DASS • Department of Chemistry, and Interdisciplinary Nanoscience Center (iNANO), Aarhus University, Aarhus, Denmark GARY W. DAUGHDRILL • Department of Cell Biology, Microbiology, and Molecular Biology, University of South Florida, Tampa, FL, USA FRANCESCA DIELLA • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany HOLGER DINKEL • Max Planck Institute for Biological Cybernetics, Tu¨bingen, Germany A. KEITH DUNKER • Department of Biochemistry and Molecular Biology, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, USA RICHARD J. EDWARDS • School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia DAVID ELIEZER • Department of Biochemistry, Weill Cornell Medical College of Cornell University, New York, NY, USA; Brain and Mind Research Institute, Weill Cornell Medical College of Cornell University, New York, NY, USA ALBERT ESCOBEDO • Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain; Joint BSC-IRB Research Programme in Computational Biology, Barcelona, Spain XIAO FAN • Department of Pediatrics, Columbia University, New York, NY, USA ERNESTO FREIRE • Department of Biology, The Johns Hopkins University, Baltimore, MD, USA; Department of Biology/Department of Biophysics and Biophysical Chemistry, Zanvyl Krieger School of Arts & Sciences, Johns Hopkins University, Baltimore, MD, USA JUAN A. GEREZ • Laboratory of Physical Chemistry, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland

Contributors

xv

RANIA GHOUIL • Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Universite´ Paris-Saclay, Gif-sur-Yvette, France STEFANO GIANNI • Dipartimento di Scienze Biochimiche “A. Rossi Fanelli”, Istituto PasteurFondazione Cenci Bolognetti and Istituto di Biologia e Patologia Molecolari del CNR, ` di Roma, Rome, Italy Sapienza Universita TOBY J. GIBSON • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany GARRETT M. GINELL • Graduate Program in Biochemistry, Biophysics, and Structural Biology, Division of Biological and Biomedical Sciences, Washington University in St. Louis, St. Louis, MO, USA; Department of Biomedical Engineering, Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA MARC GOUW • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany STEFFEN P. GRAETHER • Department of Molecular and Cellular Biology, University of Guelph, Guelph, ON, Canada ANDRA´S HATOS • Department of Biomedical Sciences, University of Padua, Padua, Italy RYO HAYAMA • Laboratory of Cellular and Structural Biology, The Rockefeller University, New York, NY, USA; Panasonic Co., Secaucus, NJ, USA ALEX S. HOLEHOUSE • Department of Biomedical Engineering, Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA; Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA JACK HOLLAND • Department of Biochemistry, University of Oxford, Oxford, UK LINNE´A ISAKSSON • Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden YLVA IVARSSON • Department of Chemistry, BMC, Uppsala, Sweden KRISTINE STEEN JENSEN • Department for Biophysical Chemistry, Center for Molecular Protein Science, LTH, Lund University, Lund, Sweden MANON JULIEN • Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Universite´ Paris-Saclay, Gif-sur-Yvette, France GYO¨RGY MIKLO´S KESERU˝ • Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary MAGNUS KJAERGAARD • Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark; The Danish Research Institute for Translational Neuroscience (DANDRITE), Aarhus, Denmark; Center for Proteins in Memory—PROMEMO, Danish National Research Foundation, Aarhus, Denmark MANJEET KUMAR • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany LUKASZ KURGAN • Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA SARAH LEEB • Department of Biochemistry and Biophysics, Arrhenius Laboratories of Natural Sciences, Stockholm University, Stockholm, Sweden JULIE M. LETH • Finsen Laboratory, Rigshospitalet, The Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen, Denmark KRESTEN LINDORFF-LARSEN • Department of Biology, Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark SARA LINSE • Biochemistry and Structural Biology, Lund University, Lund, Sweden

xvi

Contributors

MART LOOG • Institute of Technology, University of Tartu, Tartu, Estonia ALESSANDRA LUCHINI • Structural Biophysics, X-ray and Neutron Science, Niels Bohr Institute, University of Copenhagen, Copenhagen, Denmark FRANCESCA MALAGRINO` • Dipartimento di Scienze Biochimiche “A. Rossi Fanelli”, Istituto Pasteur-Fondazione Cenci Bolognetti and Istituto di Biologia e Patologia Molecolari del ` di Roma, Rome, Italy CNR, Sapienza Universita ARTEMI MALJAVIN • Institute of Technology, University of Tartu, Tartu, Estonia ERIK W. MARTIN • Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA MATTEO MASETTI • Department of Pharmacy and Biotechnology, Alma Mater Studiorum— ` di Bologna, Bologna, Italy Universita IRENA MATECˇKO-BURMANN • Department of Psychiatry and Neurochemistry, University of Gothenburg, Go¨teborg, Sweden; Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Go¨teborg, Sweden ATTILA ME´SZA´ROS • VIB Center for Structural Biology (CSB), Brussels, Belgium; Structural Biology Brussels (SBB), Vrije Universiteit Brussel (VUB), Brussels, Belgium SUSHAMA MICHAEL • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany NICOLE M. MILKOVIC • Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA TANJA MITTAG • Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA ANTHONY MITTERMAIER • Department of Chemistry, McGill University, Montreal, QC, Canada ALEXANDER MIGUEL MONZON • Department of Biomedical Sciences, University of Padua, Padua, Italy FRANS A. A. MULDER • Interdisciplinary Nanoscience Center (iNANO), Aarhus University, Aarhus C, Denmark; Department of Chemistry, Aarhus University, Aarhus C, Denmark MARCO NECCI • Department of Biomedical Sciences, University of Padua, Padua, Italy HUNG HUY NGUYEN • VIB Center for Structural Biology (CSB), Brussels, Belgium; Structural Biology Brussels (SBB), Vrije Universiteit Brussel (VUB), Brussels, Belgium JAKOB T. NIELSEN • Interdisciplinary Nanoscience Center (iNANO), Aarhus University, Aarhus C, Denmark; Department of Chemistry, Aarhus University, Aarhus C, Denmark TIMOTHY J. NOTT • Department of Biochemistry, University of Oxford, Oxford, UK JANELLE NUNEZ-CASTILLA • Department of Biological Sciences, Biomolecular Sciences Institute, Florida International University, Miami, FL, USA CHRISTOPHER J. OLDFIELD • Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA ¨ RD • Institute of Technology, University of Tartu, Tartu, Estonia MIHKEL O KIRSTI PAULSEN • School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia KRIS PAUWELS • VIB Center for Structural Biology (CSB), Brussels, Belgium; Structural Biology Brussels (SBB), Vrije Universiteit Brussel (VUB), Brussels, Belgium BRANDON PAYLISS • Department of Biochemistry, University of Toronto, Toronto, ON, Canada ANDERS PEDERSEN • Swedish NMR Centre, University of Gothenburg, Gothenburg, Sweden

Contributors

xvii

CHRISTIAN PARSBÆK PEDERSEN • Department of Biology, Structural Biology and NMR Laboratory and the Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark IVAN PERAN • Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA A˚SA PE´REZ-BERCOFF • School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia LA´SZLO´ PETRI • Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary DAMIANO PIOVESAN • Department of Biomedical Sciences, University of Padua, Padua, Italy MICHAEL PLOUG • Finsen Laboratory, Rigshospitalet, The Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen, Denmark NATALIA C. PRYMACZOK • Laboratory of Physical Chemistry, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland DAVID F. RENDO´N-LUNA • Departamento de Biologı´a Molecular de Plantas, Instituto de Biotecnologı´a, Universidad Nacional Autonoma de Me´xico, Cuernavaca, Morelos, Mexico JOSE´ L. REYES • Departamento de Biologı´a Molecular de Plantas, Instituto de Biotecnologı´a, Universidad Nacional Autonoma de Me´xico, Cuernavaca, Morelos, Mexico ELIZABETH RHOADES • Graduate Group in Biochemistry and Molecular Biophysics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Department of Chemistry, University of Pennsylvania, Philadelphia, PA, USA ROLAND RIEK • Laboratory of Physical Chemistry, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland ELLEN RIELOFF • Division of Theoretical Chemistry, Department of Chemistry, Lund University, Lund, Sweden PAULETTE S. ROMERO-PE´REZ • Departamento de Biologı´a Molecular de Plantas, Instituto de Biotecnologı´a, Universidad Nacional Autonoma de Me´xico, Cuernavaca, Morelos, Mexico MICHAEL P. ROUT • Laboratory of Cellular and Structural Biology, The Rockefeller University, New York, NY, USA KIERSTEN M. RUFF • Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, MO, USA AMIN SAGAR • Centre de Biochimie Structurale, INSERM, CNRS, Universite´ de Montpellier, Montpellier, France XAVIER SALVATELLA • Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Spain; Joint BSC-IRB Research Programme in Computational Biology, Barcelona, Spain; ICREA, Barcelona, Spain ARNE SCHO¨N • Department of Biology, The Johns Hopkins University, Baltimore, MD, USA PERNILLE SEIFFERT • Department of Biology, Structural Biology and NMR Laboratory and the Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen N, Denmark JESSICA SILTBERG-LIBERLES • Department of Biological Sciences, Biomolecular Sciences Institute, Florida International University, Miami, FL, USA LEANDRO SIMONETTI • Department of Chemistry, BMC, Uppsala, Sweden KARAMJEET K. SINGH • Department of Molecular and Cellular Biology, University of Guelph, Guelph, ON, Canada MARIE SKEPO¨ • Division of Theoretical Chemistry, Department of Chemistry, Lund University, Lund, Sweden

xviii

Contributors

CHARLOTTE S. SØRENSEN • Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark; The Danish Research Institute for Translational Neuroscience (DANDRITE), Aarhus, Denmark SAMUEL SPARKS • Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY, USA; Silicon Therapeutics, Boston, MA, USA DMITRI SVERGUN • European Molecular Biology Laboratory, Hamburg Unit, EMBL c/o DESY, Hamburg, Germany FRANC¸OIS-XAVIER THEILLET • Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Universite´ Paris-Saclay, Gif-sur-Yvette, France PETER TOMPA • VIB Center for Structural Biology (CSB), Brussels, Belgium; Structural Biology Brussels (SBB), Vrije Universiteit Brussel (VUB), Brussels, Belgium; Institute of Enzymology, Research Centre for Natural Sciences of the Hungarian Academy of Sciences, Budapest, Hungary SILVIO C. E. TOSATTO • Department of Biomedical Sciences, University of Padua, Padua, Italy; CNR Institute of Neuroscience, Padua, Italy ANGELO TOTO • Dipartimento di Scienze Biochimiche “A. Rossi Fanelli”, Istituto PasteurFondazione Cenci Bolognetti and Istituto di Biologia e Patologia Molecolari del CNR, ` di Roma, Rome, Italy Sapienza Universita FRANCESCA TROILO • Dipartimento di Scienze Biochimiche “A. Rossi Fanelli”, Istituto Pasteur-Fondazione Cenci Bolognetti and Istituto di Biologia e Patologia Molecolari del ` di Roma, Rome, Italy CNR, Sapienza Universita VLADIMIR N. UVERSKY • Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, USA; USF Health Byrd Alzheimer’s Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA; Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, Moscow Region, Russian Federation ERVIN VALK • Institute of Technology, University of Tartu, Tartu, Estonia GUY VANDENBUSSCHE • Laboratory for the Structure and Function of Biological Membranes, Centre for Structural Biology and Bioinformatics, Universite´ Libre de Bruxelles (ULB), Brussels, Belgium KIM VAN ROEY • Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany CHEN WANG • Department of Medicine, Columbia University, New York, NY, USA CHRISTOPHER A. WAUDBY • Institute of Structural and Molecular Biology, UCL and Birkbeck College, London, UK SANJULA P. WICKRAMASINGHE • Graduate Group in Biochemistry and Molecular Biophysics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA PETER E. WRIGHT • Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA KE YANG • Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA SOPHIE ZINN-JUSTIN • Universite´ Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Gif-sur-Yvette, France

Part I Sequence Properties

Chapter 1 Disorder for Dummies: Functional Mutagenesis of Transient Helical Segments in Disordered Proteins Gary W. Daughdrill Abstract Most cytosolic eukaryotic proteins contain a mixture of ordered and disordered regions. Disordered regions facilitate cell signaling by concentrating sites for posttranslational modifications and protein–protein interactions into arrays of short linear motifs that can be reorganized by RNA splicing. The evolution of disordered regions looks different from their ordered counterparts. In some cases, selection is focused on maintaining protein binding interfaces and PTM sites, but sequence heterogeneity is common. In other cases, simple properties like charge, length, or end-to-end distance are maintained. Many disordered protein binding sites contain some transient secondary structure that may resemble the structure of the bound state. α-Helical secondary structure is common and a wide range of fractional helicity is observed in different disordered regions. Here we provide a simple protocol to identify transient helical segments and design mutants that can change their structure and function. Key words Protein disorder, Transient helicity, α-Helix, Protein evolution, Protein–protein interactions, IDP, CD spectroscopy, NMR

1

Introduction By the end of the last century, it was well established that protein disorder was a common, functionally relevant feature of eukaryotic proteomes [1–10]. However, hypotheses to test relationships between structure and function were fragmented due to the limited number of examples where high-resolution data on the structure and dynamics of disordered regions was available. There was also firm resistance from a structural biology community reluctant to ignore 50 years of protein folding precedent. As more disordered regions were investigated, it became clear that tertiary structures were not present, but unstable secondary structures were common [9, 11–16]. In particular, transient α-helices are frequently observed in disordered regions that form protein binding sites [17–20]. In most cases, these transient α-helices become stabilized when bound to another protein [12, 16, 21–24]. This is a good

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_1, © Springer Science+Business Media, LLC, part of Springer Nature 2020

3

4

Gary W. Daughdrill

Table 1 Disordered proteins with transient helical structures Name

Organism

Helical a.a.

a.a. mutations

Δ helicitya

Functional effectsa

p53

Human

9

P!A

""

""

MdmX

Human

n.d.

W ! Q,S

n.d.

##

cMyb

Human

20

L,S,N ! P,A,G,V

" or ##

" or ##

Sgs1

Yeast

15

F!P

##

##

COR15A

Plant

>30

G!A

""

""

a

One arrow denotes less than a twofold change and two arrows denote at least a twofold change. Functional effects include protein binding, growth following hydroxyurea treatment, and liposome stabilization [27]

thing for molecular and structural biologist interested in understanding relationships between structure and function of their favorite disordered proteins because the physical properties of α-helices are well understood, and it is relatively straightforward to design mutants for an α-helix to interrogate structure and function [25, 26]. I will briefly describe a protocol my group has refined over the years to identify protein interaction sites in disordered regions, assess transient helical structure, and design mutants that can either disrupt or augment structure and function. I will discuss several examples my lab has worked on in the last decade as I describe the protocol (Table 1). In four of the examples (p53, cMyb, Sgs1, and COR15A), we measured residue-specific helicity using nuclear magnetic resonance (NMR) spectroscopy [27–30]. In the fifth example (MdmX), we are currently testing whether a disordered region that inhibits p53 binding has transient helical structure [31]. For p53, cMyb, Sgs1, and COR15A, we made mutations that either increase or decrease helicity and tested the effects on protein binding and used other functional assays. For MdmX, we mutated aromatic residues and tested the effects on autoinhibition of p53 binding. As with any study of protein structure and function, understanding evolution is an important starting point. Early studies of protein disorder suggested that their sequence evolution was less constrained than ordered proteins due to a lack of long-range interactions [13, 32, 33]. While this is the case for a number of disordered regions, including phosphorylation sites, amino acids that are buried upon binding to other proteins are often conserved [10, 17, 34, 35]. Figure 1 shows a summary of a study we performed to measure amino acid conservation in a database of disordered proteins [34]. The figure shows amino acid conservation for ordered and disordered proteins. The shaded regions show the range of conservation observed for buried and surface residues in ordered proteins,

Conservation for alignments with >85% identity

Mutagensis of IDPs

5

1 Buried residues in ordered proteins Surface residues in ordered proteins 0.9

Disordered residues

0.8 S I T N A M V D R H C E F Q K G L Y P W

Fig. 1 The probability that different amino acid types are conserved in ordered and disordered proteins based on gapless alignments of sequences with greater than 85% identity. The shaded regions show the range of conservation observed for buried and surface residues in ordered proteins, and the points show conservation for amino acids in disordered proteins

and the points show conservation for amino acids in disordered proteins. The data provides some important context for understanding how to identify disordered regions that form transient helical secondary structures that are stabilized by protein binding. Some of the most conserved residues in disordered proteins are tryptophan, tyrosine, and to a lesser extent phenylalanine. These amino acids are some of the most conserved but also some of the least frequent residues in disordered proteins [34]. This means they are functionally important, which is supported by the fact they are frequently buried at the interface in the bound state [36]. In contrast, residues like S, T, N, and D are less conserved and are usually found on the solvent-exposed side of amphipathic helices. This leads to an interesting contradiction; the interfacial residues will be conserved but the solvent-exposed residues may not be. This means that care must be taken when performing multiple protein sequence alignments because there may not be a high level of sequence conservation across the binding site even though the interfacial residues are conserved [37].

2

Materials The methods described in Subheadings 3.1 and 3.2 require that you know the sequence of the protein and closely related homologues. We will use the protein sequences to estimate disorder propensity using IUPred2A and to perform multiple protein sequence alignments using CLUSTAL W [38–41]. You also need a computer with access to the Internet. In Subheadings 3.3–3.5, I discuss using nuclear magnetic resonance (NMR) spectroscopy to make residue-specific measurements of transient secondary

6

Gary W. Daughdrill

structure and a variety of methods to test protein structure and function including site-directed mutagenesis, circular dichroism (CD) spectroscopy, isothermal titration calorimetry (ITC), quantitative real-time PCR, RNA sequencing, flow cytometry, and growth assays. I will not describe these methods in detail but will provide appropriate references. Whether or not you need to use any of these methods will depend on your system and the hypothesis you are trying to test.

3

Methods

3.1 Disorder Predictions and Sequence Alignments

A disorder prediction is usually a good place to start. It will delineate regions of order from regions of disorder and can provide valuable information about potential protein binding sites. Figure 2a shows a plot of disorder propensity for the human tumor suppressor, p53, and Fig. 2b shows a domain schematic. p53 contains a transactivation domain (TAD1/2), a proline-rich region (PRR), a DNA binding domain (DBD), a tetramerization domain (TET), and a regulatory domain (RD). The disorder prediction in Fig. 2a was made using IUPred2A, which is one of a number of programs that you can use for disorder predictions [40– 42]. To perform a prediction, go to the IUPred2A site at https:// iupred2a.elte.hu, enter the accession number or paste the sequence in the window, select from the available options (either long or short disorder), and submit the job. A new window will appear that contains a plot of disorder propensity as a function of sequence as well as a list of the disorder propensity values. This list can be copied and pasted into any spreadsheet program to generate a disorder propensity plot. In general, it is recommended that you use multiple disorder predictors, and there are servers available for this purpose [43], but for this demonstration IUPred2A is adequate. Figure 2a shows that about 50% of human p53 is predicted to be disordered. My lab and other groups have shown that the N-terminal regions from residues 1–93 are disordered [12, 44, 45]. Other groups have shown that in monomeric p53 both the tetramerization domain and the C-terminal regulatory domain are also disordered [46–48]. In addition to disorder predictions, you should perform a multiple protein sequence alignment to see if you can identify conserved amino acids that may form part of the binding interface. As mentioned in the introduction, this may be challenging if the region containing the protein binding site has undergone extensive sequence changes. In some case it may be necessary to narrow the alignment down to the sequence that corresponds to the dip in the disorder plot and restrict the alignment to closely related homologues because sequence conservation may be low at noninteracting sites, as shown for residues 21 and 24 in Fig. 2c. Figure 2c

Mutagensis of IDPs

Disorder Propensity

a.

7

1

0.5

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 Residue Number

b.

c.

Tad2 PRR

Tad1 1

40

61

93

Human Macaque Rabbit Guinea Pig Cow Dog Mouse

DBD/Core

RD

TD 312

360

393

PPLSQETFSDLWKLLPENNVL PPLSQETFSDLWKLLPENNVL PPLSQETFSDLWKLLPENNLL PPLSQETFSDLWKLLPENNVL PPLSQETFSDLWNLLPENNLL PPLSQETFSELWNLLPENNVL LPLSQETFSGLWKLLPPEDIL ******** **:*** :::*

Fig. 2 Disorder propensity plot identifies intermolecular protein binding site. (a) Disorder prediction of human p53. The line at 0.5 indicates the expected boundary between order and disorder. (b) Domain schematic of p53. TAD (transactivation domain), PRR (proline-rich region), DBD (DNA binding domain), TD (tetramerization domain), and RD (regulatory domain). (c) Multiple protein sequence alignment of human p53 residues 12–32. Asterisks indicate identical amino acids and colons indicate mostly identical amino acids. Disorder propensity values are from IUPred2A and CLUSTAL X was used for the sequence alignment

shows a multiple protein sequence alignment of the Mdm2 binding site for mammalian p53 homologues. Mdm2 is an E3 ubiquitin ligase that negatively regulates p53. The alignment is for human p53 residues 12–32. Notice in Fig. 2a that this region corresponds to a dip in the disorder plot, which is often (but not always) indicative of protein binding sites in disordered proteins. Figure 3 shows another example of using a disorder propensity plot to identify protein binding sites. Figure 3a shows a plot of disorder tendencies for MdmX. MdmX is a homologue of Mdm2 and a negative regulator of p53. It forms a heterodimer with Mdm2 to ensure that p53 levels are kept low in the absence of cellular stress [49, 50]. Figure 3b shows a domain schematic for Mdmx, which contain a p53 binding domain (p53bd), an autoinhibitory domain

Gary W. Daughdrill

a.

1 Disorder Propensity

8

0.5

0

b.

40

80

120

Linker

p53BD 100

c.

160

Human Macaque Rabbit Guinea Pig Cow Dog Mouse

200 240 280 320 Residue Number

WW 200

360

400

Zn 300

440

480

Ring 400

FEEWDVAGLPWWFLGNLRSNY FEEWDVAGLPWWFLGNLRSNY FEEWDVAGLPWWFLGNLRNNY FEEWDVAGLPWWFLGNLRSNC FEEWDVAGLPWWFLGNLRNNY FEEWDVAGLPWWFLGNLRSNY FEEWDVAGLPWWFLGNLRNNC ******************.*

Fig. 3 Disorder propensity plot identifies intramolecular protein binding site. (a) Disorder prediction of human MdmX. (b) Domain schematic of human MdmX p53BD (p53 binding domain), WW (motif containing two adjacent tryptophans), Zn (zinc finger domain), Ring (inactive E3 ring domain). (c) Multiple protein sequence alignment to human MdmX residues 190–210. Asterisks indicate identical amino acids and colons indicate mostly identical amino acids. Disorder propensity values are from IUPred2A and CLUSTAL X was used for the sequence alignment

that contains two tryptophans (WW), a zinc finger domain (Zn), and an inactive E3 ring domain (Ring). The p53bd is ordered and binds to the same disordered segment of p53 as Mdm2 [51]. This domain binds tightly to TAD1 of p53 but full length MdmX does not [31]. We helped solve this conundrum by showing that the WW domain loops around and binds to the p53bd and inhibits the binding of p53 TAD1. The disorder plot in Fig. 3a was the first clue we had this was happening. We noticed the dip in the disorder plot around residue 200 and identified conserved tryptophan residues at this site (Fig. 3c). Using short peptides, we showed the WW domain binds to p53bd and overlaps the TAD1 binding site [31]. To promote a more disordered state, the pair of tryptophans at residues 200 and 201 was changed to serine and glycine, respectively, decreased the binding to p53bd, and inhibited p53 transactivation in a reporter gene assay due to increased binding to TAD1

Mutagensis of IDPs

9

[31]. We were also able to show that intramolecular binding between p53bd and the WW domain reduced the binding of TAD1 to p53bd by 400-fold [52]. 3.2 Prediction of Transient Helicity

If the strategy outlined in Subheading 3.1 allows you to identify potential protein binding sites in your favorite disordered protein, there are several ways to determine whether this binding site is helical. Of course, our favorite approach is to measure residuespecific helicity using NMR spectroscopy. We will describe a couple of important examples where we do this in Subheading 3.4, but we understand that using this approach may not be practical for everyone. The simplest thing to do is try to predict the presence of helical segments based on the amino acid sequence, but care must be taken because transient helical segments in disordered proteins are difficult to identify using secondary structure predictors. We have had the most success using Agadir (an algorithm to predict the helical content of peptides) [25]. Agadir was developed using helix-coil transition theory and only considers short-range interactions. The performance of Agadir was evaluated using CD data from 1200 short peptides. It is well established that disordered proteins behave like short peptides, which may explain why Agadir is able to predict the low levels of transient helicity observed in disordered regions. If you think you have identified a protein binding site based on sequence but there is no predicted helix, it may form a beta or irregular structure in the bound state (see Note 1). We have used Agadir in concert with NMR spectroscopy and CD to identify transient helical segments and design mutants to augment or disrupt the helix [28, 30]. It is also possible to use Agadir independently without the need for NMR or CD [53]. In one case we used Agadir to help predict a transient helical segment in the long RecQ-like DNA helicase from yeast (Sgs1) that forms a binding site for topoisomerase III (TopIII) [28]. We confirmed the presence of this helical segment and precisely defined its boundaries using NMR spectroscopy, which was important for the functional analysis we performed (see Subheading 3.5). We also used Agadir in a study of the myeloblastosis oncoprotein, c-Myb. c-Myb contains a helical segment that is 20 amino acids long with an average helicity of 40%. We used CD to measure the helicity of a 53-amino acid fragment of wt c-Myb and 9 mutants [30]. Figure 4a shows the helicity estimates from Agadir plotted against the values we measured with CD. There is a good correlation between the predictions and measurements, but Agadir systematically underestimated percent helicity. Pappu and colleagues obtained a similar result with the PUMA protein [54].

10

Gary W. Daughdrill

% Helicity (Agadir)

a. 30 20

10 y = 0.7522x R² = 0.7701 0 5

10

15 20 % Helicity (CD)

25

30

% Helicity (NMR)

b. 30 20

10 y = 0.9877x R² = 0.9785 0 5

10

15 20 % Helicity (CD)

25

30

Fig. 4 Prediction and measurement of transient helicity for c-Myb and variants. (a) Correlation of % helicity between Agadir predictions and CD measurements for wt c-Myb and nine variants. (b) Correlation of %helicity between NMR and CD measurements for wt c-Myb and five variants. The red point corresponds to wt c-Myb 3.3 Measuring Transient Helicity with CD Spectroscopy

The simplest way to measure transient helicity is with CD and base your estimate of percent helicity on the molar ellipticity at 222 nm (see Note 2). It is relatively inexpensive, requires a moderate amount of protein, and does not require any labeling. There are two important considerations when using CD to measure transient helicity in disordered regions. CD does not provide residue-specific information, and the signal you get will depend on what fraction of the disordered region has transient helicity as well as the amount of helicity. For instance, it will be difficult to detect a 10-amino acid segment that is 50% helical in a disordered fragment that is 100 amino acids long. Since there has not been a systematic assessment of transient helicity in disordered proteins, it is difficult to know what percent helicity to expect in a disordered region, and like so many features of protein structure under selection, the percent helicity you observe will depend on the system and process. In a previous report, Han and colleagues collected transient helicity measurements from the literature for 25 disordered proteins and observed an average value less than 30% [55]. If most transient helical segments in disordered proteins are below this level, the effectiveness of CD to assess transient helicity will be limited.

Mutagensis of IDPs

11

However, if the transient helicity is high enough to measure by CD, as it was for c-Myb, you should observe an excellent correspondence with percent helicity values obtained from NMR spectroscopy (Fig. 4b). Whether or not the helix population for your system is high enough to measure by CD depends on a number of factors, including the quality of the sample and the CD spectra. Based on experience we usually do not use measurements of fractional helicity from CD spectra below 10%. In Fig. 4b one of the c-Myb mutants (L300P) falls into this category, and this point is not well correlated with the rest of the data. If you suspect a short helical segment and you are working with a longer polypeptide, you can always have a shorter peptide synthesized to see if you can detect a helical signal. 3.4 Measuring Transient Helicity with NMR

We recently published a book chapter that reviews how to use NMR chemical shifts to determine residue-specific secondary structure populations for disordered proteins [56]. See this reference if you are interested in the details of the procedure. Here I will summarize two examples where we have used this approach to identify transient helicity and then designed mutants to increase or decrease helicity and change function. When performing NMR spectroscopy on disordered proteins, you will have to deal with the resonance overlap problem. See Note 3 for one strategy to reduce resonance overlap based on minimal fragment design (see Note 3). One might ask whether precise levels of transient helicity are functionally relevant. We showed very clearly that the low percent helicity observed for the p53 TAD is necessary for controlling the binding affinity with Mdm2 because of the energetic penalty associated with coupled folding and binding [29]. However, increasing percent helicity for c-Myb had a small effect on binding to the KIX domain of the CREB binding protein (CBP) [7]. The p53 tumor suppressor was an early example of a eukaryotic protein with mixed regions of disorder and order [44, 45, 57– 59]. As discussed in Subheading 3.1, the disordered N-terminal TAD contains a transient helical segment important for binding the negative regulator, Mdm2. Mdm2 is an E3 ubiquitin ligase that targets p53 for proteasome degradation [60]. A short segment of the p53 N-terminus from F19-L25 forms a stable helix when bound to Mdm2. In the absence of Mdm2, this segment of p53 has the puniest helical character (Fig. 5a) that can only be reliably detected using NMR. Figure 5a shows a plot of secondary chemical shifts (SCS) of Cαs and helical frequency values obtained from the δ2D method [61]. The only input required for δ2D is a table containing the backbone and Cβ chemical shifts. The output includes the populations for helix, beta, PPII, and coil. Consecutive positive SCSs of Cαs are observed for helical segments, and negative values are observed for beta structures. Helical frequency can also be calculated directly from chemical shifts without using δ2D (see Note 4).

2

1

ΔδCα (ppm)

0.8 0.6

1

0.4 0.2 0

0

b.

2

0.6

1

0.4 0.2

0

0

c.

P P L S Q E T F S D L WK L L A E N

P P L S Q E T F S D L WK L L P E N -1

-1

d.

1 0.8

ΔδCα (ppm)

a.

Helical Frequency

Gary W. Daughdrill

Helical Frequency

12

p53TAD human mouse chicken fish

MEEPQSDPSVEP MEESQSDISLEL --MAEEMEPLLE --MADLAENVSL

P LSQETFSDLWKLL P LSQETFSGLWKLL P --TEVFMDLWSML P LSQESFEDLWKMN

P ENNVLSPLPSQAMDDLMLSPDDIE P PEDILPSP--HCMDDLLLPQ-DVE P ----YSMQQLPLPEDHSNWQELSP - --LNLVAVQPPETESWVGYDNFMM

Fig. 5 Transient helicity in p53 TAD. SCS of Cαs shifts and helical frequency values from δ2D for p53 residues 12–29. (a) wt. (b) P27A. (c) Structure of p53 TAD-Mdm2 complex (PDB ID 1YCR). A surface model of Mdm2 is shown in gray and p53 TAD is shown as a ribbon structure in red. (d) Protein sequence alignment of p53 TAD for vertebrate homologues. CLUSTAL X was used for the sequence alignment

We wanted to determine if the level of helicity observed for free p53 was important for binding to Mdm2 and expected there should be an energetic penalty associated with folding and binding. To make this region of p53 more helical, we mutated the conserved proline residues that flank the Mdm2 binding site. Figure 5b shows how transient helicity was increased by mutating a single proline at position 27 to alanine (P27A). The P27A variant of p53 bound more tightly to Mdm2. This resulted in more effective ubiquitylation and degradation of p53 and decreased the transcriptional activity of p53 in response to DNA damage [29]. In particular, decreased p21 expression resulted in reduced cell cycle arrest at the G1 checkpoint. Figure 5c shows the structure of p53 TAD bound to Mdm2, and Fig. 5d is a sequence alignment of p53 TAD highlighting the conserved prolines that flank the Mdm2 binding site. In another example, we investigated the relationship between helical stability and binding affinity for the disordered TAD from the myeloblastosis oncoprotein, c-Myb, and its ordered binding partner, KIX [30]. We designed a series of c-Myb mutants that increased or decreased helix stability without changing the binding interface with KIX. This included a complimentary series of alanine, glycine, proline, and valine mutants at three noninteracting sites. The glycine mutants were used as a reference state to correlate binding affinity and helix stability. We measured the helicity of wt c-Myb and the mutants using CD and NMR. Based on the residuespecific helicity measurements from NMR, we identified a contiguous helical segment in c-Myb that was divided into two conformationally distinct segments. The N-terminal segment, from K291 to L301, was more helical than expected for a disordered protein with

1 0.8

2

0.6 0.4

1

0.2 0

0

b.

3 1 0.8

2

0.6 0.4

1

0.2 0

0

Helical Frequency

3

ΔδCα (ppm)

ΔδCα (ppm)

a.

Helical Frequency

Mutagensis of IDPs

13

c.

DEDAEKEKR I KE LE L L LMS TENE LKGQQV LP TQ

DEDPEKEKR I KE LE L L LMS TENE LKGQQV LP TQ

1

-1

d. c-Myb human mouse chicken fish

V P Q P AAAAI V P Q P AAAAI V P Q P AAAAI F P Q H G T AAI

Q R HY N DED Q R HY N DED Q R HY N DED Q R HY S DED

P E K E KR I K E L E LLLM P E K E KR I K E L E LLLM P E K E KR I K E L E LLLM P E K E KR V K E I E MLLM

ST E N E L K G QQ VL P TQN H T C ST E N E L K G QQ AL P TQN H T C ST E N E L K G QQ AL P TQN H T A ST E N E L K G QQ AL P IS--MN

Fig. 6 Transient helicity in c-Myb TAD. SCS of Cαs and helical frequency values for c-Myb residues 286–318. (a) wt. (b) P289A. (c) Structure of the c-Myb TAD-KIX complex (PDB ID 1SB0). A surface model of KIX is shown in gray and c-Myb TAD is shown as a ribbon structure in red. (d) Protein sequence alignment of c-Myb TAD for vertebrate homologues. CLUSTAL X was used for the sequence alignment

an average helicity greater than 60% (Fig. 6a), and the C-terminal segment, from S304 to L315, had an average helicity less than 10%. Different effects on binding were observed when the helicity of these two segments was increased or decreased. Mutants in the N-terminal segment that increased helicity had no effect on the binding affinity to KIX, while helix destabilizing glycine and proline mutants reduced binding affinity by more than 1 kcal/mol. Figure 6b shows an example of how helicity was modestly increased when an N-terminal flanking proline was mutated to alanine. Mutants that either increased or decreased helical stability in the C-terminal segment had almost no effect on binding. Analysis of binding free energies and helical stability for several of the mutants suggested that multiple conformations are accessible in the bound state. Figure 6c shows the structure of c-Myb TAD bound to the Kix domain of the CREB binding protein, and Fig. 6d is a sequence alignment of c-Myb TAD highlighting the conserved prolines that flank the Kix binding site. Figures 5d and 6d show a curious feature we have identified for some disordered protein binding sites with transient helicity. The regions that become helical are flanked by conserved proline residues [62]. The fact that conserved prolines are frequently found flanking helical regions suggests they are functionally important. It is tempting to speculate that flanking prolines keep helicity low and control the energetic penalty for coupled folding and binding. This appears to be the case for p53 but not c-Myb or the mixed lineage leukemia (MLL) protein [63].

Gary W. Daughdrill

ΔδCα (PPM)

a.

1.5

0.15

1

0.1

0.5

0.05

0

Helical Frequency

14

0 E

D

K

D

F

V

F

Q

A

I

Q

K

H

I

A

-0.5 1.5

0.15

0.1

1

0.05

0.5

0

Helical Frequency

ΔδCα (PPM)

b.

0 E

D

K

D

F

V

P

Q

A

I

Q

K

H

I

A

-0.5

Fig. 7 Transient helicity in Sgs1 N-terminal domain. SCS of Cαs and helical frequency values for Sgs1 residues 24–38. (a) wt. (b) F30P 3.5

Mutant Design

In addition to the mutants described in Subheading 3.4, I will also describe a mutagenesis study of a transient helical segment in Sgs1, a RecQ-like helicase in yeast that is important for genome stability [64]. Sgs1 contains a 650-amino acid disordered region at the N-terminus [28]. We identified 3 segments in the first 120 amino acids of Sgs1 with transient helical secondary structure. The helical segments span residues 9–20, 24–35, and 88–97. In Subheading 3.4 we discussed removing prolines that flank transient helical segments as a strategy for augmenting function. For Sgs1 we interfered with function by inserting proline residues at different positions in the helical segments. Figure 7a shows the SCS of Cαs and the helical frequencies for Sgs1 residues 24–38, and Fig. 7b shows the reduction in transient helicity of this segment when F30 was changed to proline. Insertion of a single proline at this position was functionally similar to deleting the sgs1 gene in a hydroxyurea hypersensitivity assay (Table 2). We were able to confirm this phenotype was due to disrupting the interaction between Sgs1 and TopIII. A proline inserted in the middle of the helical segment for residues 88–97 had no effect on growth following hydroxyurea treatment. The transient helical segment from residues 24–35 was further probed with proline mutations (Table 2). Mutating a hydrophobic cluster in the middle of the helix that included V29,

Mutagensis of IDPs

15

Table 2 Summary of results from hydroxyurea (HU) hypersensitivity assays of Sgs1 Sgs1 variant

Relative growth following HU treatment

wt

+++++

sgs1Δ

++

L9P

++

W15P

++

K17P

+++++

T21P

+++++

D25P

+++

K26P

+++++

V29P

++

F30P

++

I33P

+++

Q34P

+++++

I37P

+++++

F30, and I33 had the largest functional effects. Residues on the periphery of the helix had weaker effects except for D25. We were also able to identify a weaker helical segment from residues 9–20 that included a rare tryptophan residue. Proline mutants in this segment also reduced TopIII binding and increased sensitivity to hydroxyurea treatment (Table 2).

4

Notes 1. Dealing with Non-helical Segments. If you already have evidence that a disordered region contains a protein binding site and there is a dip in the disorder plot corresponding to a cluster of hydrophobic residues but this region is not predicted to be helical, then your binding site may form a beta strand or extended structure in the bound state [10]. These regions are challenging to identify using NMR spectroscopy because the SCSs of disordered regions that form either beta strands or extended structures tend to go in the opposite direction of a helical segment. If there is evidence in the SCSs for an extended/beta strand region, you should consider the possibility of intra- or intermolecular beta-sheet formation. In our experience, most regions that are not helical will be dominated by coil-like structures with some small fraction of beta strand.

16

Gary W. Daughdrill

In this case it is difficult to identify the structural boundaries of a binding site. At this point a deletion study may be the most practical option. Since disordered regions rarely contain longrange intramolecular contacts, any functional affects you observe will not be related to destabilizing these interactions. However, the possibility of unstable intramolecular interactions that lead to even a low population of beta-strand formation cannot be ignored. 2. Calculating Helix Populations Using Molar Ellipticity at 222 nm. For the CD data presented, helix populations were calculated based on the molar ellipticity at 222 nm using the method developed by Munoz and Serrano [25, 65]. Here, the percent helix of a polypeptide is given by: % helix ¼

100 ð1 þ ððMRE MREhelix Þ=ðMRE MREcoil ÞÞÞ

where MRE (mean residue ellipticity) is equal to (100 θ222)/ l n c, l is the path length in cm, n is the number of amino acids in the polypeptide, and c is the concentration in M. The standard value for MREhelix is corrected for chain length and temperature so MREhelix ¼ 40,000(1 (1/n)) + (100 T), and the standard value for MREcoil is corrected for temperature so MREcoil ¼ 640 (45 T). For both corrections T is temperature in degrees Celsius. 3. Resonance Overlap in NMR Spectra. A common problem in performing NMR spectroscopy on disordered proteins is a lack of chemical shift dispersion. This leads to resonance overlap in the NMR spectra making assignments difficult. To reduce resonance overlap, design IDR fragments between 80 and 100 amino acids. A fragment of this length will allow you to incorporate functionally relevant motif(s) like protein-protein interaction sites and be large enough to detect using SDS-PAGE but small enough to reduce resonance overlap. 4. Calculating Helix Populations Directly from Chemical Shifts. The relationship between protein secondary structure and backbone chemical shifts was reported almost 30 years ago [66, 67]. Estimating helical populations from chemical shifts depends on a reliable database of residue-specific random coil chemical shifts, and there is still debate about whether any of the available random coil chemical shifts libraries capture the complete conformational diversity of disordered or denatured proteins. Residue-specific random coil chemical shifts have been determined using either model peptides in denaturing conditions or estimated using chemical shift databases [68– 74]. Even small deviations from random coil chemical shifts (RCCS) can be used to calculate helix populations. For instance, the average helical shift for Cα residues in ordered

Mutagensis of IDPs

17

proteins can be used to estimate helical populations using the following relationship: ∑i, j(ΔδCαi, j)/3.1), where ΔδCα is the α-carbon SCS for residues i through j. 3.1 ppm is the average α-carbon SCS for all residues types found in helices of ordered proteins [75]. Of course, not all residue types will have the same α-carbon SCS when they assume a helical structure. A more precise method relies on residue-specific α-carbon SCS values [76–78]. For this method of calculation, the formula would be ∑i(ΔδCai)/(Hδi RCδi)) where ΔδCα is the alpha carbon secondary chemical shift for residue i, Hδi is the helical chemical shift for the residue type i, and RCδi is the RCCS of residue type i.

Acknowledgments G.W.D. is supported by the National Institutes of Health (CA14124406 and GM115556). References 1. Romero P, Obradovic Z, Kissinger CR et al (1998) Thousands of proteins likely to have long disordered regions. Pac Symp Biocomput 3:437–448 2. Dunker AK, Obradovic Z, Romero P et al (2000) Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform 11:161–171 3. Dunker AK, Lawson JD, Brown CJ et al (2001) Intrinsically disordered protein. J Mol Graph Model 19(1):26–59 4. Kriwacki RW, Hengst L, Tennant L et al (1996) Structural studies of p21Waf1/Cip1/ Sdi1 in the free and Cdk2-bound state: conformational disorder mediates binding diversity. Proc Natl Acad Sci U S A 93 (21):11504–11509 5. Lefevre JF, Dayie KT, Peng JW et al (1996) Internal mobility in the partially folded DNA binding and dimerization domains of GAL4: NMR analysis of the N-H spectral density functions. Biochemistry 35(8):2674–2686 6. Daughdrill GW, Chadsey MS, Karlinsey JE et al (1997) The C-terminal half of the anti-sigma factor, FlgM, becomes structured when bound to its target, sigma 28. Nat Struct Biol 4 (4):285–291 7. Radhakrishnan I, Perez-Alvarado GC, Parker D et al (1997) Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell 91(6):741–752

8. Donne DG, Viles JH, Groth D et al (1997) Structure of the recombinant full-length hamster prion protein PrP(29-231): the N terminus is highly flexible. Proc Natl Acad Sci U S A 94(25):13452–13457 9. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293 (2):321–331 10. van der Lee R, Buljan M, Lang B et al (2014) Classification of intrinsically disordered regions and proteins. Chem Rev 114(13):6589–6631. https://doi.org/10.1021/cr400525m 11. Daughdrill GW, Pielak GJ, Uversky VN et al (2005) Natively disordered proteins. In: Buchner J, Kiefhaber T (eds) Protein folding handbook, vol 3. WILEY-VCH, Darmstadt, pp 275–357 12. Vise PD, Baral B, Latos AJ et al (2005) NMR chemical shift and relaxation measurements provide evidence for the coupled folding and binding of the p53 transactivation domain. Nucleic Acids Res 33(7):2061–2077 13. Daughdrill GW, Narayanaswami P, Gilmore SH et al (2007) Dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. J Mol Evol 65(3):277–288 14. Gely S, Lowry DF, Bernard C et al (2010) Solution structure of the C-terminal X domain of the measles virus phosphoprotein and interaction with the intrinsically disordered

18

Gary W. Daughdrill

C-terminal domain of the nucleoprotein. J Mol Recognit 23(5):435–447. https://doi.org/10. 1002/jmr.1010 15. Dyson HJ, Wright PE (2001) Nuclear magnetic resonance methods for elucidation of structure and dynamics in disordered states. Methods Enzymol 339:258–270 16. Dyson HJ, Wright PE (2002) Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol 12(1):54–60 17. Cheng Y, Oldfield CJ, Meng J et al (2007) Mining alpha-helix-forming molecular recognition features with cross species sequence alignments. Biochemistry 46 (47):13468–13477. https://doi.org/10. 1021/bi7012273 18. Mohan A, Oldfield CJ, Radivojac P et al (2006) Analysis of molecular recognition features (MoRFs). J Mol Biol 362(5):1043–1059. https://doi.org/10.1016/j.jmb.2006.07.087 19. Oldfield CJ, Cheng Y, Cortese MS et al (2005) Coupled folding and binding with alpha-helixforming molecular recognition elements. Biochemistry 44(37):12454–12470 20. Fuxreiter M, Simon I, Friedrich P et al (2004) Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J Mol Biol 338(5):1015–1026 21. Dyson HJ (2013) Coupled folding and binding. In: Roberts GCK (ed) Encyclopedia of biophysics. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 381–385. https://doi.org/10. 1007/978-3-642-16712-6_174 22. Gianni S, Dogan J, Jemth P (2016) Coupled binding and folding of intrinsically disordered proteins: what can we learn from kinetics? Curr Opin Struct Biol 36:18–24. https://doi.org/ 10.1016/j.sbi.2015.11.012 23. Spolar RS, Record MT Jr (1994) Coupling of local folding to site-specific binding of proteins to DNA. Science 263(5148):777–784 24. Sugase K, Dyson HJ, Wright PE (2007) Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature 447 (7147):1021–1025 25. Munoz V, Serrano L (1994) Elucidating the folding problem of helical peptides using empirical parameters. Nat Struct Biol 1 (6):399–409 26. Lacroix E, Viguera AR, Serrano L (1998) Elucidating the folding problem of alpha-helices: local motifs, long-range electrostatics, ionicstrength dependence and prediction of NMR parameters. J Mol Biol 284(1):173–191. https://doi.org/10.1006/jmbi.1998.2145 27. Sowemimo OT, Knox-Brown P, Borcherds W et al (2019) Conserved Glycines control

disorder and function in the cold-regulated protein, COR15A. Biomol Ther 9(3). https://doi.org/10.3390/biom9030084 28. Kennedy JA, Daughdrill GW, Schmidt KH (2013) A transient alpha-helical molecular recognition element in the disordered N-terminus of the Sgs1 helicase is critical for chromosome stability and binding of Top3/Rmi1. Nucleic Acids Res 41(22):10215–10227. https://doi. org/10.1093/nar/gkt817 29. Borcherds W, Theillet FX, Katzer A et al (2014) Disorder and residual helicity alter p53-Mdm2 binding affinity and signaling in cells. Nat Chem Biol 10(12):1000–1002. https://doi.org/10.1038/nchembio.1668 30. Poosapati A, Gregory E, Borcherds WM et al (2018) Uncoupling the folding and binding of an intrinsically disordered protein. J Mol Biol 430(16):2389–2402. https://doi.org/10. 1016/j.jmb.2018.05.045 31. Chen L, Borcherds W, Wu S et al (2015) Autoinhibition of MDMX by intramolecular p53 mimicry. Proc Natl Acad Sci U S A 112 (15):4624–4629. https://doi.org/10.1073/ pnas.1420833112 32. Brown CJ, Takayama S, Campen AM et al (2002) Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol 55(1):104–110 33. Radivojac P, Obradovic Z, Brown CJ et al (2002) Improving sequence alignments for intrinsically disordered proteins. In: Pac Symp Biocomput, pp 589–600 34. Brown CJ, Johnson AK, Daughdrill GW (2010) Comparing models of evolution for ordered and disordered proteins. Mol Biol Evol 27(3):609–621. https://doi.org/10. 1093/molbev/msp277 35. Ahrens JB, Rahaman J, Siltberg-Liberles J (2018) Large-scale analyses of site-specific evolutionary rates across eukaryote proteomes reveal confounding interactions between intrinsic disorder, secondary structure, and functional domains. Genes (Basel) 9(11): E553. https://doi.org/10.3390/ genes9110553 36. Gunasekaran K, Tsai CJ, Nussinov R (2004) Analysis of ordered and disordered protein complexes reveals structural features discriminating between stable and unstable monomers. J Mol Biol 341(5):1327–1341. https://doi. org/10.1016/j.jmb.2004.07.002 37. Borcherds W, Kashtanov S, Wu H et al (2013) Structural divergence is more extensive than sequence divergence for a family of intrinsically disordered proteins. Proteins 81

Mutagensis of IDPs (10):1686–1698. https://doi.org/10.1002/ prot.24303 38. Higgins DG, Thompson JD, Gibson TJ (1996) Using CLUSTAL for multiple sequence alignments. Methods Enzymol 266:383–402 39. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680. https://doi.org/10.1093/nar/22.22.4673 40. Dosztanyi Z, Csizmok V, Tompa P et al (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347(4):827–839 41. Meszaros B, Erdos G, Dosztanyi Z (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46 (W1):W329–W337. https://doi.org/10. 1093/nar/gky384 42. Deng X, Eickholt J, Cheng J (2012) A comprehensive overview of computational protein disorder prediction methods. Mol BioSyst 8 (1):114–121. https://doi.org/10.1039/ c1mb05207a 43. Lieutaud P, Ferron F, Uversky AV et al (2016) How disordered is my protein and what is its disorder for? A guide through the "dark side" of the protein universe. Intrinsically Disord Proteins 4(1):e1259708. https://doi.org/10. 1080/21690707.2016.1259708 44. Dawson R, Muller L, Dehner A et al (2003) The N-terminal domain of p53 is natively unfolded. J Mol Biol 332(5):1131–1141 45. Lee H, Mok KH, Muhandiram R et al (2000) Local structural elements in the mostly unstructured transcriptional activation domain of human p53. J Biol Chem 275 (38):29426–29432 46. Laptenko O, Tong DR, Manfredi J et al (2016) The tail that wags the dog: how the disordered C-terminal domain controls the transcriptional activities of the p53 tumor-suppressor protein. Trends Biochem Sci 41(12):1022–1034. https://doi.org/10.1016/j.tibs.2016.08.011 47. Ayed A, Mulder FA, Yi GS et al (2001) Latent and active p53 are identical in conformation. Nat Struct Biol 8(9):756–760. https://doi. org/10.1038/nsb0901-756 48. Weinberg RL, Freund SM, Veprintsev DB et al (2004) Regulation of DNA binding of p53 by its C-terminal domain. J Mol Biol 342 (3):801–811. https://doi.org/10.1016/j. jmb.2004.07.042

19

49. Finch RA, Donoviel DB, Potter D et al (2002) Mdmx is a negative regulator of p53 activity in vivo. Cancer Res 62(11):3221–3225 50. Migliorini D, Lazzerini Denchi E, Danovi D et al (2002) Mdm4 (Mdmx) regulates p53-induced growth arrest and neuronal cell death during early embryonic mouse development. Mol Cell Biol 22(15):5527–5538 51. Popowicz GM, Czarna A, Holak TA (2008) Structure of the human Mdmx protein bound to the p53 tumor suppressor transactivation domain. Cell Cycle 7(15):2441–2443. https://doi.org/10.4161/cc.6365 52. Borcherds W, Becker A, Chen L et al (2017) Optimal affinity enhancement by a conserved flexible linker controls p53 mimicry in MdmX. Biophys J 112(10):2038–2042. https://doi. org/10.1016/j.bpj.2017.04.017 53. Kennedy JA, Syed S, Schmidt KH (2015) Structural motifs critical for in vivo function and stability of the RecQ-mediated genome instability protein Rmi1. PLoS One 10(12): e0145466. https://doi.org/10.1371/journal. pone.0145466 54. Harmon TS, Crabtree MD, Shammas SL et al (2016) GADIS: algorithm for designing sequences to achieve target secondary structure profiles of intrinsically disordered proteins. Protein Eng Des Sel 29(9):339–346. https:// doi.org/10.1093/protein/gzw034 55. Lee SH, Kim DH, Lee SH et al (2012) Understanding pre-structured motifs (PreSMos) in intrinsically unfolded proteins. Curr Protein Pept Sci 13(1):34–54 56. Borcherds WM, Daughdrill GW (2018) Using NMR chemical shifts to determine residuespecific secondary structure populations for intrinsically disordered proteins. Methods Enzymol 611:101–136. https://doi.org/10. 1016/bs.mie.2018.09.011 57. Cho Y, Gorina S, Jeffrey PD et al (1994) Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. Science 265(5170):346–355 58. Clore GM, Omichinski JG, Sakaguchi K et al (1994) High-resolution structure of the oligomerization domain of p53 by multidimensional NMR. Science 265(5170):386–391 59. Lee W, Harvey TS, Yin Y et al (1994) Solution structure of the tetrameric minimum transforming domain of p53. Nat Struct Biol 1 (12):877–890 60. Fang S, Jensen JP, Ludwig RL et al (2000) Mdm2 is a RING finger-dependent ubiquitin protein ligase for itself and p53. J Biol Chem 275(12):8945–8951. https://doi.org/10. 1074/jbc.275.12.8945

20

Gary W. Daughdrill

61. Camilloni C, De Simone A, Vranken WF et al (2012) Determination of secondary structure populations in disordered states of proteins using nuclear magnetic resonance chemical shifts. Biochemistry 51(11):2224–2231. https://doi.org/10.1021/bi3001825 62. Lee C, Kalmar L, Xue B et al (2014) Contribution of proline to the pre-structuring tendency of transient helical secondary structure elements in intrinsically disordered proteins. Biochim Biophys Acta 1840(3):993–1003. https://doi.org/10.1016/j.bbagen.2013.10. 042 63. Crabtree MD, Borcherds W, Poosapati A et al (2017) Conserved helix-flanking Prolines modulate intrinsically disordered protein:target affinity by altering the lifetime of the bound complex. Biochemistry 56 (18):2379–2384. https://doi.org/10.1021/ acs.biochem.7b00179 64. Watt PM, Louis EJ, Borts RH et al (1995) Sgs1: a eukaryotic homolog of E. coli RecQ that interacts with topoisomerase II in vivo and is required for faithful chromosome segregation. Cell 81(2):253–260 65. Munoz V, Serrano L (1995) Elucidating the folding problem of helical peptides using empirical parameters. III. Temperature and pH dependence. J Mol Biol 245(3):297–308. https://doi.org/10.1006/jmbi.1994.0024 66. Wishart DS, Sykes BD, Richards FM (1991) Relationship between nuclear magnetic resonance chemical shift and protein secondary structure. J Mol Biol 222(2):311–333 67. Wishart DS, Sykes BD (1994) The 13C chemical-shift index: a simple method for the identification of protein secondary structure using 13C chemical-shift data. J Biomol NMR 4(2):171–180 68. Wishart DS, Bigam CG, Holm A et al (1995) H-1, C-13 and N-15 random coil Nmr chemical-shifts of the common amino-acids .1. Investigations of nearest-neighbor effects (Vol 5, Pg 67, 1995). J Biomol NMR 5 (3):332–332 69. Nielsen JT, Mulder FAA (2018) POTENCI: prediction of temperature, neighbor and pH-corrected chemical shifts for intrinsically disordered proteins. J Biomol NMR 70

(3):141–165. https://doi.org/10.1007/ s10858-018-0166-5 70. Tamiola K, Acar B, Mulder FA (2010) Sequence-specific random coil chemical shifts of intrinsically disordered proteins. J Am Chem Soc 132(51):18000–18003. https://doi.org/ 10.1021/ja105656t 71. Kjaergaard M, Poulsen FM (2011) Sequence correction of random coil chemical shifts: correlation between neighbor correction factors and changes in the Ramachandran distribution. J Biomol NMR 50(2):157–165. https://doi. org/10.1007/s10858-011-9508-2 72. Zhang HY, Neal S, Wishart DS (2003) RefDB: a database of uniformly referenced protein chemical shifts. J Biomol NMR 25 (3):173–195. https://doi.org/10.1023/ A:1022836027055 73. De Simone A, Cavalli A, Hsu STD et al (2009) Accurate random coil chemical shifts from an analysis of loop regions in native states of proteins. J Am Chem Soc 131(45):16332. https://doi.org/10.1021/ja904937a 74. Schwarzinger S, Kroon GJA, Foss TR et al (2000) Random coil chemical shifts in acidic 8 M urea: implementation of random coil shift data in NMRView. J Biomol NMR 18 (1):43–48. https://doi.org/10.1023/ A:1008386816521 75. Dyson HJ, Wright PE (2002) Insights into the structure and dynamics of unfolded proteins from nuclear magnetic resonance. In: Unfolded proteins, vol 62. Advances in Protein Chemistry. Academic Press Inc, San Diego, pp 311–340 76. Wishart DS (2011) Interpreting protein chemical shift data. Prog Nucl Magn Reson Spectrosc 58(1–2):62–87. https://doi.org/10. 1016/j.pnmrs.2010.07.004 77. Neal S, Nip AM, Zhang HY et al (2003) Rapid and accurate calculation of protein H-1, C-13 and N-15 chemical shifts. J Biomol NMR 26 (3):215–240. https://doi.org/10.1023/ A:1023812930288 78. Wishart DS, Nip AM (1998) Protein chemical shift analysis: a practical guide. Biochem Cell Biol 76(2–3):153–163. https://doi.org/10. 1139/bcb-76-2-3-153

Chapter 2 Computational Prediction of Intrinsic Disorder in Protein Sequences with the disCoP Meta-predictor Christopher J. Oldfield, Xiao Fan, Chen Wang, A. Keith Dunker, and Lukasz Kurgan Abstract Intrinsically disordered proteins are either entirely disordered or contain disordered regions in their native state. These proteins and regions function without the prerequisite of a stable structure and were found to be abundant across all kingdoms of life. Experimental annotation of disorder lags behind the rapidly growing number of sequenced proteins, motivating the development of computational methods that predict disorder in protein sequences. DisCoP is a user-friendly webserver that provides accurate sequence-based prediction of protein disorder. It relies on meta-architecture in which the outputs generated by multiple disorder predictors are combined together to improve predictive performance. The architecture of disCoP is presented, and its accuracy relative to several other disorder predictors is briefly discussed. We describe usage of the web interface and explain how to access and read results generated by this computational tool. We also provide an example of prediction results and interpretation. The disCoP’s webserver is publicly available at http://biomine.cs.vcu.edu/servers/disCoP/. Key words Intrinsically disordered proteins, IDP, Bioinformatics, Webserver, Meta-architecture

1

Introduction Intrinsically disordered proteins (IDPs) form broad structural ensembles and lack stable folded structure in isolation under physiological conditions [1–6]. These proteins have also been called partially folded, natively denatured, natively unfolded, natively disordered, intrinsically unstructured, intrinsically denatured, and intrinsically unfolded [1]. IDPs have one or more intrinsically disordered regions (IDRs) and in some cases they are fully disordered. Recent computational studies estimate that eukaryotic organisms have between 3% and 17% of fully disordered proteins and that between 30% and 50% of proteins in their proteomes have at least one long IDR (30 or more consecutive amino acid residues long) [7–13]. IDPs also occupy a large part of proteomes in

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_2, © Springer Science+Business Media, LLC, part of Springer Nature 2020

21

22

Christopher J. Oldfield et al.

bacteria, archaea, and viruses [7, 10–12, 14–20]. They are instrumental for numerous cellular functions including signaling [21– 24], regulation of transcription [25, 26], translation [27], chromatin condensing [28–31], and molecular interactions with proteins and nucleic acids [28, 32–39], to name just a few. Intrinsic disorder was shown to be enriched in alternatively spliced regions [40–43] and in posttranslational modification sites [43–45]. Moreover, IDPs are being explored as drug targets [46, 47], which is motivated by their association with a number of human diseases [48, 49]. Sequences of IDRs are substantially different from the sequences of structured regions and proteins. For example, IDRs are enriched in polar amino acids, depleted in large hydrophobic and aromatic amino acids, and have relatively low sequence complexity [50–53]. These differences underlie the development of accurate computational methods for the prediction of disorder in protein chains. Over 70 computational disorder predictors were developed over the last few decades [4, 54–65]. Many of the recently published methods rely on meta-architectures that combine outputs produced by several disorder predictors to (re)predict disorder. The meta-predictors include (in chronological order) VSL2 [66], metaPrDOS [67], PreDisorder [68], NN-CDF [69], MD [70], PONDR-FIT [71], MFDp [72], CSpritz [73], MetaDisorder [74], ESpritz [75], MFDp2 [76, 77], DisMeta [78], disCoP [79], DISOPRED3 [80], and MobiDB-lite [81]. This type of predictive architecture is motivated by studies that empirically demonstrate that outputs from the meta-predictors are more accurate when compared to the results produced by their input single predictors [79, 82]. However, the improved accuracy comes at a cost of a longer runtime and inconvenience. The long runtime stems from the fact that multiple disorder predictions have to be computed and combined together. The inconvenience is due to the fact that outputs of several disorder predictors must be collected by the user. The latter drawback is alleviated by some meta-predictors that incorporate computation of the input disorder predictors into their publicly available implementations. A recently published example of a convenient meta-predictor is disCoP (disorder Consensus-based Predictor) [79]. The disCoP method is available as a user-friendly webserver that automates the entire prediction process. Users only need to enter the sequence of their proteins and click the “Run” button to obtain disorder prediction. Moreover, benchmarking tests show that DisCoP provides accurate predictions, with area under the receiver operating characteristic (ROC) curve (AUC) ¼ 0.85 and Matthews correlation coefficient (MCC) ¼ 0.50. DisCoP was compared empirically to 20 other disorder predictors including several meta-predictors such as ESpritz (AUC ¼ 0.83 and MCC ¼ 0.48), CSpritz (AUC ¼ 0.83 and MCC ¼ 0.45), MD (AUC ¼ 0.82 and

Intrinsic Disorder from Sequence – disCoP

23

MCC ¼ 0.45), MFDp (AUC ¼ 0.82 and MCC ¼ 0.45), and PONDR-FIT (AUC ¼ 0.78 and MCC ¼ 0.41). These tests concluded that predictive performance of disCoP is statistically significantly better ( p-value < 0.01) [79]. To sum up, the two main advantages of disCoP are the availability of the convenient webserver and good predictive performance. This chapter describes the underlying meta-architecture of disCoP, explains its web interface, and provides detailed instructions on how to generate predictions with this computational tool. We also explain how to read and interpret the results generated by this meta-predictor using a case study that concerns prediction of intrinsic disorder for the chromatin accessibility complex 16kD protein.

2

Materials 1. Sequences of proteins to be predicted. The sequences must be formatted using the FASTA format (see Note 1). Up to five protein sequences can be submitted at one time as either a file upload or using a text entry field (see Note 2). 2. disCoP: The webserver that is freely available at http://bio mine.cs.vcu.edu/servers/disCoP/ is designed to be simple to use. All computations are performed on the server side, and thus the only requirements for submitting predictions are an Internet connection and a modern web browser (Firefox, Internet Explorer, or Chrome). The webserver visualizes the results directly in the web browser window and also delivers these results to the user-provided email address. The meta-architecture of the disCoP’s webserver is shown in Fig. 1. The input protein sequence goes through a three-stage process to generate putative IDRs. In stage 1, the sequence is processed by four disorder predictors: SPINE-D [83], DISOclust [84], DISOPRED2 [85], and MD [70]. This collection of four predictors was selected from among 20 disorder predictors using an empirical procedure that aims to maximize predictive performance [79]. Each of the four methods outputs numeric propensity for disorder and binary disorder annotations (disordered versus ordered) for each residue in the input protein chain. In stage 2, these predictions are processed to produce features that numerically quantify information which is relevant for the disorder prediction. The features are calculated using sliding windows that aggregate and summarize putative disorder information among neighboring (in the sequence) amino acid residues. This reduces the risk of making spurious predictions. The windows are represented by dashed boxes in Fig. 1. A balanced and complementary set of seven features is collected by considering both types of

24

Christopher J. Oldfield et al.

Fig. 1 Prediction process implemented in the disCoP predictor. The outputs of the four disorder predictors (SPINE-D, DISOclust, DISOPRED2, and MD) generated in stage 1 include the propensity scores and the corresponding putative IDRs, which are shown using the green horizontal bars. The dashed boxes with gray shading denote the sliding windows that are used to compute the seven features in stage 2. In stage 3, the binomial deviation regression model predicts the putative propensities for disorder from the seven features. The putative IDRs generated by disCoP are shown at the bottom of the figure, and they correspond to the residues for which the putative propensities for disorder 0.5. The example shows results produced for the chromatin accessibility complex 16kD protein (UniProt id: Q9V452)

outputs (propensities and binary) generated by each of the four disorder predictors. Stage 3 uses these features as input to a trained regression model to produce disCoP’s predictions in the form of numeric propensities for disorder. These propensities range between 0 and 1, where higher propensity scores are indicative of a higher likelihood of intrinsic disorder. The disCoP’s webserver further processes these propensities to generate binary predictions, which correspond to the putative IDRs. Residues with propensities >0.5 are predicted to be disordered, while the remaining residues are predicted as ordered/structured (see Note 3).

3

Methods Submission of predictions is made at the main disCoP’s webpage at http://biomine.cs.vcu.edu/servers/disCoP/. Notifications of completed predictions are given by email, and thus an email address is required for each submission. These notifications provide a link to prediction results, which can be viewed in a browser window and/or downloaded as a parsable text file. The predictions can be accessed at a later time and they are kept on the webserver for at least 3 months.

Intrinsic Disorder from Sequence – disCoP

3.1

Running disCoP

25

Three easy steps are required to submit sequences for prediction (Fig. 2, labels 1, 2, and 3): 1. Enter FASTA-formatted sequences (see Note 1) in one of two ways: (a) Upload a file of FASTA-formatted sequences. (b) Input the FASTA-formatted sequences into the white text entry field. This can be done using the copy and paste function. An example of properly formatted sequence can be obtained by clicking the “Example” button located below the text entry field. Clicking the “Reset sequence(s)” button clears both submission options. There are limits to both the number of sequences and maximum length of sequences that can be submitted for prediction (see Note 2).

Fig. 2 The disCoP prediction submission webpage. Orange/yellow circles indicate the three steps to submit sequences for predictions, discussed in the text

26

Christopher J. Oldfield et al.

2. Provide an email address (see Note 4). This email is only used to send notification of completed predictions. 3. Click “Run disCoP” to start the prediction. Clicking “Run disCoP” takes the user to a status page that reports on the current state of the submitted prediction. Submissions to several different bioinformatics webservers located at the http://biomine.cs.vcu.edu site (see Note 5) are entered into the same queue system (see Note 6). The status page reports the current position in the queue and shows when prediction for this submission begins. The runtime needed to complete prediction for an average length protein sequence (about 250 amino acids) is approximately 10 min. The prediction can take over 40 min when submitting five longer protein sequences. After the prediction is completed, the status page automatically redirects the user to the prediction results page. This also triggers an email with the location of the results page that is sent to the user-provided email address. There is no need to keep the status page open while predictions are running since the notification email is always sent when the prediction is finished. 3.2 Results Generated by disCoP

The results page can be reached by leaving the status page open for the duration of the prediction or by following the link provided in the email. The email (Fig. 3) provides a job identifier together with the location of the prediction results page and the text file with the results. Each submission is assigned a unique 14-digit-long identifier (Fig. 3, label 1) that is shown at the top of the notification email (see Note 7). In the case of issues with the completion and/or contents of the prediction, the identifier can be used to trace the

Fig. 3 The disCoP notification email. The email provides unique job identifier and links to the results indicated with orange/yellow circles, discussed in the text

Intrinsic Disorder from Sequence – disCoP

27

Fig. 4 The disCoP prediction results webpage. The red D and green n denote the putative disordered residues and putative non-disordered (structured) residues, respectively. The corresponding putative propensity scores are provided directly underneath. Orange/yellow circles indicate important features of this page, discussed in the text

corresponding submission. The email message includes a direct link to the webpage with the results of the prediction (Fig. 3, label 2) and to the text file with the results (Fig. 3, label 3). This results page (Fig. 4) includes a link to a text file results.csv with prediction results (Fig. 4, label 1) and a visualization of the predictions (Fig. 4, label 2) (see Note 8). The text file contains protein identifiers, sequences, binary predictions, and propensity scores for each of the submitted protein sequences. These data are in the comma-separable CSV format to ease parsing. An example of the CSV format results file is shown in Fig. 5. Each sequence is represented by three lines: 1. The protein name taken from the FASTA header provided by the user followed by the protein sequence. The individual amino acids are comma-separated to ease aligning them to the corresponding predictions listed in the two subsequent lines. 2. Binary predictions of disorder, where D denotes disordered residues and O denoted ordered residues. 3. Propensity for disorder, ranging between 1 for high propensity and 0 for low propensity. Residues with propensity >0.5 are annotated as D in the second line. The visualization (Fig. 4, label 2) shows the binary annotations of the putative IDRs (using red highlights) and putative ordered regions (green highlights) for each residue in the input protein chain. Each binary annotation is associated with the propensity score, which is provided directly underneath. The scores are in the range between 0 and 99, where residues with scores >50 are predicted in binary as disordered.

28

Christopher J. Oldfield et al.

Fig. 5 Example of the CSV format results file for the disCoP prediction. The example shows results produced for the chromatin accessibility complex 16kD protein (UniProt id: Q9V452)

4

Case Study The protein CHRAC16 is a component of the chromatin accessibility complex (CHRAC), formed by interaction of CHRAC16 and CHRAC14 with the ATP-utilizing chromatin assembly and remodeling factor (ACF) complex. The crystal structure of the CHRAC14/16 dimer has been determined [86], which revealed two disordered regions located at either terminus (Fig. 6, top panel). The N- and C-terminal IDRs play a role in the ACF binding and modulating DNA binding affinity, respectively [86]. Disorder predictions for CHRAC16 demonstrate the improvement of disCoP predictions relative to its component predictions from SPINE-D, DISOPRED, DISOclust, and MD. This is shown in Fig. 6 by comparing the amount of incorrect predictions (hatch portions of the score lines) between disCoP and the other four methods. For comparison, prediction scores from disCoP server’s CSV output file were plotted along with prediction scores of the component predictors (Fig. 6, middle and bottom panels). The

Intrinsic Disorder from Sequence – disCoP

29

Fig. 6 Known intrinsically disordered and structured regions in CHRAC16 compared to disorder predictions. (Top panel) Structurally characterized regions are shown: two intrinsically disordered regions (red) and one structured region (blue). (Middle and bottom panels) Intrinsic disorder prediction scores given by SPINE-D (cyan), DISOPRED (orange), DISOclust (green), MD (purple), and disCoP (pink, shown alone in the bottom panel) are shown, where values above 0.5 are predictions of disorder and below 0.5 are predictions of structure. Hatch portions of the score lines indicate incorrect predictions

four component predictors of disCoP generally perform well for Drosophila CHRAC16 (Fig. 6, middle panel); SPINE-D, DISOPRED, DISOclust, and MD predict 85%, 87%, 84%, and 69% of residues correctly, respectively. Both SPINE-D and DISOPRED predict disordered and ordered regions correctly, but predict the two disordered regions to be shorter than found experimentally. DISOclust and MD both predict too much disorder, with MD predicting much of the structured region to be disordered. In contrast, the disCoP prediction is highly accurate (Fig. 6, bottom panel), predicting 98% of residues correctly. Similar to SPINE-D and DISOPRED, disCoP slightly underpredicts disorder at the N-terminus and C-terminus, but only by two residues and one residue, respectively.

5

Notes 1. The FASTA format for the protein sequences is explained at https://en.wikipedia.org/wiki/FASTA_format. Briefly, each protein is represented by multiple lines where the first line

30

Christopher J. Oldfield et al.

begins with “>” followed by the name and description of the protein and the subsequent lines provide the sequence using the one-letter amino acid encoding and with usually no more than 80 characters per line. Example follows >Q9V452 MGEPRSQPPVERPPTAETFLPLSRVRTIMKSSMDTGLITNEVL FLMTKCTELFVRHLAGA AYTEEFGQRPGEALKYEHLSQVVNKNKNLEFLLQIVPQKIRVH QFQEMLRLNRSAGSDDD DDDDDDDDEEESESESESDE

The disCoP server will also accept the second line that gives the entire protein sequence, i.e., the user has the option of providing the sequence in one line or breaking it up into multiple lines. 2. Up to five FASTA-formatted sequences can be submitted at one time. Moreover, the programs used to implement the disCoP predictor limit the length of submitted protein sequences to the range between 26 residues and 1000 residues. These limits apply to both the text entry field and when uploading the file. Submissions exceeding either of these limits receive an error notification from the server (“You entered 10 proteins. Up to 5 proteins allowed!” or “Input sequence is 1024 amino acids long. The minimal allowed length is 26 amino acids and the maximal length is 1000. Please re-submit your sequence.”), and prediction is disallowed. Requests with more than five proteins have to be broken into multiple submissions each with five or fewer sequences (also, see Note 6). The users must combine the results from different submissions manually. 3. Analysis of the predictions generated by disCoP benefits from examining the propensity scores in addition to the binary predictions. High values of the propensity scores which are below the 0.5 threshold (and which consequently do not result in the binary prediction of IDRs) may suggest presence of disorder if combined with other data. Benchmarks show that the threshold ¼ 0.5 corresponds to the predictions with sensitivity of about 65% and low (15%) false-positive rate, resulting in a rather conservative set of disorder predictions. This means that residues that were not predicted as disordered based on the binary outputs and which have high propensity scores have elevated likelihood for disorder, but at higher levels of false positives. 4. Rather than requiring an active browser connection for the duration of the entire prediction, notifications of completed predictions are provided via the email address provided by the user.

Intrinsic Disorder from Sequence – disCoP

31

5. The http://biomine.cs.vcu.edu site includes several other predictors, such as (in alphabetical order) CONNECTOR [87], CRYSTALP2 [88], Cypred [89], DFLpred [90], DisCon [91], DisoRDPbind [92, 93], DMRpred [94], DRNApred [95], fDETECT [96, 97], fMoRFpred [98], funDNApred [99], hybridNAP [100], ILbind [101], MFDp [72], MFDp2 [76, 77], MoRFpred [102, 103], NsitePred [104], PPCpred [105], QUARTER [106, 107], RAPID [13], SSCon [108], and SLIDER [9]. 6. The http://biomine.cs.vcu.edu site utilizes the first-come-firstserve queue. However, the number of simultaneous submissions across all webservers (see Note 5) that are received from the same IP address is limited to three. Users who submit too frequently receive a message to resubmit after one of their pending submissions is completed. This limit aims to equalize access to this resource across users by not allowing any one user to submit an excessive number of jobs that would severely delay/block access for the other users. 7. Both links to the results are based on the unique job identifier and they are not posted online. This means that the other users of this webserver are unable to access the results, preserving privacy of the submission. 8. Users should save the email and the links to the results. They can be accessed only via the links that are provided in the notification email and on the results webpage.

Acknowledgments This research was supported in part by the Robert J. Mattauch Endowment funds and the National Science Foundation grant 1617369 to Lukasz Kurgan. References 1. Dunker AK, Babu MM, Barbar E et al (2013) What’s in a name? Why these proteins are intrinsically disordered. Intrinsically Disord Proteins 1:e24157 2. Dunker AK, Obradovic Z (2001) The protein trinity--linking function and disorder. Nat Biotechnol 19:805–806 3. Habchi J, Tompa P, Longhi S et al (2014) Introducing protein intrinsic disorder. Chem Rev 114:6561–6588 4. Lieutaud P, Ferron F, Uversky AV et al (2016) How disordered is my protein and what is its disorder for? A guide through the “dark side” of the protein universe. Intrinsically Disord Proteins 4:e1259708

5. Uversky VN, Gillespie JR, Fink AL (2000) Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41:415–427 6. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293:321–331 7. Dunker AK, Obradovic Z, Romero P et al (2000) Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform 11:161–171 8. Oates ME, Romero P, Ishida T et al (2013) D (2)P(2): database of disordered protein predictions. Nucleic Acids Res 41:D508–D516

32

Christopher J. Oldfield et al.

9. Peng Z, Mizianty MJ, Kurgan L (2014) Genome-scale prediction of proteins with long intrinsically disordered regions. Proteins 82:145–158 10. Peng Z, Yan J, Fan X et al (2015) Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life. Cell Mol Life Sci 72:137–151 11. Ward JJ, Sodhi JS, Mcguffin LJ et al (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645 12. Xue B, Dunker AK, Uversky VN (2012) Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life. J Biomol Struct Dyn 30:137–149 13. Yan J, Mizianty MJ, Filipow PL et al (2013) RAPID: fast and accurate sequence-based prediction of intrinsic disorder content on proteomic scale. Biochim Biophys Acta 1834:1671–1680 14. Charon J, Theil S, Nicaise V et al (2016) Protein intrinsic disorder within the Potyvirus genus: from proteome-wide analysis to functional annotation. Mol BioSyst 12:634–652 15. Fan X, Xue B, Dolan PT et al (2014) The intrinsic disorder status of the human hepatitis C virus proteome. Mol BioSyst 10:1345–1363 16. Meng F, Badierah RA, Almehdar HA et al (2015) Unstructural biology of the Dengue virus proteins. FEBS J 282:3368–3394 17. Wang C, Uversky VN, Kurgan L (2016) Disordered nucleiome: abundance of intrinsic disorder in the DNA- and RNA-binding proteins in 1121 species from eukaryota, bacteria and archaea. Proteomics 16:1486–1498 18. Xue B, Blocquel D, Habchi J et al (2014) Structural disorder in viral proteins. Chem Rev 114:6880–6911 19. Xue B, Mizianty MJ, Kurgan L et al (2012) Protein intrinsic disorder as a flexible armor and a weapon of HIV-1. Cell Mol Life Sci 69:1211–1259 20. Xue B, Williams RW, Oldfield CJ et al (2010) Archaic chaos: intrinsically disordered proteins in Archaea. BMC Syst Biol 4(Suppl 1):S1 21. Dunker AK, Cortese MS, Romero P et al (2005) Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS J 272:5129–5148 22. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6:197–208 23. Galea CA, Wang Y, Sivakolundu SG et al (2008) Regulation of cell division by intrinsically unstructured proteins: intrinsic

flexibility, modularity, and signaling conduits. Biochemistry 47:7598–7609 24. Uversky VN, Oldfield CJ, Dunker AK (2005) Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit 18:343–384 25. Fuxreiter M, Tompa P, Simon I et al (2008) Malleable machines take shape in eukaryotic transcriptional regulation. Nat Chem Biol 4:728–737 26. Liu J, Perumal NB, Oldfield CJ et al (2006) Intrinsic disorder in transcription factors. Biochemistry 45:6873–6888 27. Peng Z, Oldfield CJ, Xue B et al (2014) A creature with a hundred waggly tails: intrinsically disordered proteins in the ribosome. Cell Mol Life Sci 71:1477–1504 28. Dyson HJ (2012) Roles of intrinsic disorder in protein-nucleic acid interactions. Mol BioSyst 8:97–104 29. Meng F, Na I, Kurgan L et al (2016) Compartmentalization and functionality of nuclear disorder: intrinsic disorder and proteinprotein interactions in intra-nuclear compartments. Int J Mol Sci 17:E24 30. Peng Z, Mizianty MJ, Xue B et al (2012) More than just tails: intrinsic disorder in histone proteins. Mol BioSyst 8:1886–1901 31. Sandhu KS (2009) Intrinsic disorder explains diverse nuclear roles of chromatin remodeling proteins. J Mol Recognit 22:1–8 32. Chen JW, Romero P, Uversky VN et al (2006) Conservation of intrinsic disorder in protein domains and families: II. Functions of conserved disorder. J Proteome Res 5:888–898 33. Chowdhury S, Zhang J, Kurgan L (2018) In Silico prediction and validation of novel RNA binding proteins and residues in the human proteome. Proteomics 18:e1800064 34. Cumberworth A, Lamour G, Babu MM et al (2013) Promiscuity as a functional trait: intrinsically disordered regions as central players of interactomes. Biochem J 454:361–369 35. Fuxreiter M, Toth-Petroczy A, Kraut DA et al (2014) Disordered proteinaceous machines. Chem Rev 114:6806–6843 36. Haynes C, Oldfield CJ, Ji F et al (2006) Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol 2:890–901 37. Peng Z, Sakai Y, Kurgan L et al (2014) Intrinsic disorder in the BK channel and its interactome. PLoS One 9:e94331 38. Tompa P, Csermely P (2004) The role of structural disorder in the function of RNA and protein chaperones. FASEB J 18:1169–1175

Intrinsic Disorder from Sequence – disCoP 39. Wu Z, Hu G, Yang J et al (2015) In various protein complexes, disordered protomers have large per-residue surface areas and area of protein-, DNA- and RNA-binding interfaces. FEBS Lett 589:2561–2569 40. Buljan M, Chalancon G, Dunker AK et al (2013) Alternative splicing of intrinsically disordered regions and rewiring of protein interactions. Curr Opin Struct Biol 23:443–450 41. Korneta I, Bujnicki JM (2012) Intrinsic disorder in the human spliceosomal proteome. PLoS Comput Biol 8:e1002641 42. Romero PR, Zaidi S, Fang YY et al (2006) Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proc Natl Acad Sci U S A 103:8390–8395 43. Zhou JH, Zhao SW, Dunker AK (2018) Intrinsically disordered proteins link alternative splicing and post-translational modifications to complex cell signaling and regulation. J Mol Biol 430:2342–2359 44. Kurotani A, Tokmakov AA, Kuroda Y et al (2014) Correlations between predicted protein disorder and post-translational modifications in plants. Bioinformatics 30:1095–1103 45. Xie H, Vucetic S, Iakoucheva LM et al (2007) Functional anthology of intrinsic disorder. 3. Ligands, post-translational modifications, and diseases associated with intrinsically disordered proteins. J Proteome Res 6:1917–1932 46. Cheng Y, Legall T, Oldfield CJ et al (2006) Rational drug design via intrinsically disordered protein. Trends Biotechnol 24:435–442 47. Hu G, Wu Z, Wang K et al (2016) Untapped potential of disordered proteins in current druggable human proteome. Curr Drug Targets 17:1198–1205 48. Midic U, Oldfield CJ, Dunker AK et al (2009) Unfoldomics of human genetic diseases: illustrative examples of ordered and intrinsically disordered members of the human diseasome. Protein Pept Lett 16:1533–1547 49. Uversky VN, Oldfield CJ, Dunker AK (2008) Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys 37:215–246 50. Campen A, Williams RM, Brown CJ et al (2008) TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept Lett 15:956–963 51. Li X, Romero P, Rani M et al (1999) Predicting protein disorder for N-, C-, and internal regions. Genome Inform Ser Workshop Genome Inform 10:30–40 52. Romero P, Obradovic Z, Kissinger C et al (1997) Identifying disordered regions in

33

proteins from amino acid sequence. Int Conf Neural Netw 91:90–95 53. Romero P, Obradovic Z, Li X et al (2001) Sequence complexity of disordered protein. Proteins 42:38–48 54. Atkins J, Boateng S, Sorensen T et al (2015) Disorder prediction methods, their applicability to different protein targets and their usefulness for guiding experimental studies. Int J Mol Sci 16:19040 55. Deng X, Eickholt J, Cheng J (2012) A comprehensive overview of computational protein disorder prediction methods. Mol BioSyst 8:114–121 56. Doszta´nyi Z, Me´sza´ros B, Simon I (2010) Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins. Brief Bioinform 11:225–243 57. Doszta´nyi Z, Tompa P (2008) Prediction of protein disorder. In: Kobe B, Guss M, Huber T (eds) Structural proteomics. Humana Press, Totowa, New Jersey, pp 103–115 58. Ferron F, Longhi S, Canard B et al (2006) A practical overview of protein disorder prediction methods. Proteins 65:1–14 59. He B, Wang K, Liu Y et al (2009) Predicting intrinsic disorder in proteins: an overview. Cell Res 19:929–949 60. Li J, Feng Y, Wang X et al (2015) An overview of predictors for intrinsically disordered proteins over 2010–2014. Int J Mol Sci 16:23446 61. Meng F, Uversky V, Kurgan L (2017) Computational prediction of intrinsic disorder in proteins. Curr Protoc Protein Sci 88:2 16 11–12 16 14 62. Meng F, Uversky VN, Kurgan L (2017) Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions. Cell Mol Life Sci 74:3069–3090 63. Monastyrskyy B, Kryshtafovych A, Moult J et al (2014) Assessment of protein disorder region predictions in CASP10. Proteins 82 (Suppl 2):127–137 64. Necci M, Piovesan D, Dosztanyi Z et al (2017) A comprehensive assessment of long intrinsic protein disorder from the DisProt database. Bioinformatics 34(3):445–452 65. Pentony M, Ward J, Jones D (2010) Computational resources for the prediction and analysis of native disorder in proteins. In: Hubbard SJ, Jones AR (eds) Proteome bioinformatics. Humana Press, Totowa, New Jersey, pp 369–393 66. Peng K, Radivojac P, Vucetic S et al (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7:208

34

Christopher J. Oldfield et al.

67. Ishida T, Kinoshita K (2008) Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 24:1344–1348 68. Deng X, Eickholt J, Cheng J (2009) PreDisorder: ab initio sequence-based prediction of protein disordered regions. BMC Bioinformatics 10:436 69. Xue B, Oldfield CJ, Dunker AK et al (2009) CDF it all: consensus prediction of intrinsically disordered proteins based on various cumulative distribution functions. FEBS Lett 583:1469–1474 70. Schlessinger A, Punta M, Yachdav G et al (2009) Improved disorder prediction by combination of orthogonal approaches. PLoS One 4:e4433 71. Xue B, Dunbrack RL, Williams RW et al (2010) PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim Biophys Acta 1804:996–1010 72. Mizianty MJ, Stach W, Chen K et al (2010) Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 26: i489–i496 73. Walsh I, Martin AJM, Di Domenico T et al (2011) CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs. Nucleic Acids Res 39:W190–W196 74. Kozlowski LP, Bujnicki JM (2012) MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinformatics 13:1–11 75. Walsh I, Martin AJM, Di Domenico T et al (2012) ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 28:503–509 76. Mizianty MJ, Peng ZL, Kurgan L (2013) MFDp2: Accurate predictor of disorder in proteins by fusion of disorder probabilities, content and profiles. Intrinsically Disordered Proteins 1:e24428 77. Mizianty MJ, Uversky V, Kurgan L (2014) Prediction of intrinsic disorder in proteins using MFDp2. Methods Mol Biol 1137:147–162 78. Huang YJ, Acton TB, Montelione GT (2014) DisMeta: a meta server for construct design and optimization. Methods Mol Biol 1091:3–16 79. Fan X, Kurgan L (2014) Accurate prediction of disorder in protein chains with a comprehensive and empirically designed consensus. J Biomol Struct Dyn 32:448–464

80. Jones DT, Cozzetto D (2015) DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31:857–863 81. Necci M, Piovesan D, Dosztanyi Z et al (2017) MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics 33:1402–1404 82. Peng Z, Kurgan L (2012) On the complementarity of the consensus-based disorder prediction. Pac Symp Biocomput:176–187 83. Zhang T, Faraggi E, Xue B et al (2012) SPINE-D: accurate prediction of short and long disordered regions by a single neuralnetwork based method. J Biomol Struct Dyn 29:799–813 84. Mcguffin LJ (2008) Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics 24:1798–1804 85. Ward JJ, Mcguffin LJ, Bryson K et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138–2139 86. Hartlepp KF, Fernandez-Tornero C, Eberharter A et al (2005) The histone fold subunits of Drosophila CHRAC facilitate nucleosome sliding through dynamic DNA interactions. Mol Cell Biol 25:9886–9896 87. Wang C, Kurgan L (2018) Review and comparative assessment of similarity-based methods for prediction of drug-protein interactions in the druggable human proteome. Brief Bioinform 20(6):2066–2087 88. Kurgan L, Razib AA, Aghakhani S et al (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Struct Biol 9:50 89. Kedarisetti P, Mizianty MJ, Kaas Q et al (2014) Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim Biophys Acta 1844:181–190 90. Meng F, Kurgan L (2016) DFLpred: Highthroughput prediction of disordered flexible linker regions in protein sequences. Bioinformatics 32:i341–i350 91. Mizianty MJ, Zhang T, Xue B et al (2011) In-silico prediction of disorder content using hybrid sequence representation. BMC Bioinformatics 12:245 92. Peng Z, Kurgan L (2015) High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids Res 43:e121 93. Peng Z, Wang C, Uversky VN et al (2017) Prediction of disordered RNA, DNA, and

Intrinsic Disorder from Sequence – disCoP protein binding regions using DisoRDPbind. Methods Mol Biol 1484:187–203 94. Meng F, Kurgan L (2018) High-throughput prediction of disordered moonlighting regions in protein sequences. Proteins 86 (10):1097–1110 95. Yan J, Kurgan L (2017) DRNApred, fast sequence-based method that accurately predicts and discriminates DNAand RNA-binding residues. Nucleic Acids Res 45:e84 96. Meng F, Wang C, Kurgan L (2018) fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization. BMC Bioinformatics 18:580 97. Mizianty MJ, Fan X, Yan J et al (2014) Covering complete proteomes with X-ray structures: a current snapshot. Acta Crystallogr D Biol Crystallogr 70:2781–2793 98. Yan J, Dunker AK, Uversky VN et al (2016) Molecular recognition features (MoRFs) in three domains of life. Mol BioSyst 12:697–710 99. Amirkhani A, Kolahdoozi M, Wang C et al (2018) Prediction of DNA-binding residues in local segments of protein sequences with Fuzzy Cognitive Maps. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/ 10.1109/TCBB.2018.2890261 100. Zhang J, Ma Z, Kurgan L (2017) Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 20 (4):1250–1268 101. Hu G, Gao J, Wang K et al (2012) Finding protein targets for small biologically relevant

35

ligands across fold space using inverse ligand binding predictions. Structure 20:1815–1822 102. Disfani FM, Hsu WL, Mizianty MJ et al (2012) MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics 28:i75–i83 103. Oldfield CJ, Uversky VN, Kurgan L (2018) Predicting functions of disordered proteins with MoRFpred. Methods Mol Biol 1851:337–352 104. Chen K, Mizianty MJ, Kurgan L (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequencederived structural descriptors. Bioinformatics 28:331–341 105. Mizianty MJ, Kurgan L (2011) Sequencebased prediction of protein crystallization, purification and production propensity. Bioinformatics 27:i24–i33 106. Hu G, Wu Z, Oldfield C et al (2018) Quality assessment for the putative intrinsic disorder in proteins. Bioinformatics 35 (10):1692–1700 107. Wu Z, Hu G, Wang K et al (2017) Exploratory analysis of quality assessment of putative intrinsic disorder in proteins. In: 6th International Conference on Artificial Intelligence and Soft Computing. Zakopane, Poland, pp 722–732 108. Yan J, Marcus M, Kurgan L (2014) Comprehensively designed consensus of standalone secondary structure predictors improves Q3 by over 3%. J Biomol Struct Dyn 32:36–51

Chapter 3 Computational Prediction of Disordered Protein Motifs Using SLiMSuite Richard J. Edwards, Kirsti Paulsen, Carla M. Aguilar Gomez, and A˚sa Pe´rez-Bercoff Abstract Short linear motifs (SLiMs) are important mediators of interactions between intrinsically disordered regions of proteins and their interaction partners. Here, we detail instructions for the computational prediction of SLiMs in disordered protein regions, using the main tools of the SLiMSuite package: (1) SLiMProb identifies and calculates enrichment of predefined motifs in a set of proteins; (2) SLiMFinder predicts SLiMs de novo in a set of proteins, accounting for evolutionary relationships; (3) QSLiMFinder increases SLiMFinder sensitivity by focusing SLiM prediction on a specific query protein/region; (4) CompariMotif compares predicted SLiMs to known SLiMs or other SLiM predictions to identify common patterns. For each tool, command-line and online server examples are provided. Detailed notes provide additional advice on different applications of SLiMSuite, including batch running of multiple datasets and conservation masking using alignments of predicted orthologues. Key words Short linear motif, Minimotif, Protein–protein interactions, Motif prediction, Domainmotif interactions, SLiM

1

Introduction Protein sequences can be broadly divided into two major classes with respect to their structure: globular domains with relatively stable three-dimensional structures, and intrinsically disordered regions (IDRs) that are conformationally flexible and lack persistent structure [1, 2]. IDRs have been implicated in a wide range of fundamental biological processes, particularly in the context of cell signaling and other dynamic interactions [3, 4]. Whereas globular domains largely adhere to the “structure-function” paradigm, where the three-dimensional structure of the protein governs their

SLiMSuite is freely available under a GNU General Public License at https://github.com/slimsuite/SLiMSuite. SLiMSuite servers are available at http://www.slimsuite.unsw.edu.au/servers.php. Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_3, © Springer Science+Business Media, LLC, part of Springer Nature 2020

37

38

Richard J. Edwards et al.

activity and interactions, IDRs predominantly derive their function through conformationally flexible interactions with other proteins [5]. One important class of IDR-protein interactions is those mediated by short linear motifs (SLiMs), which are specific sequence patterns that bind interaction partners [5, 6]. SLiMs are thought to underpin millions of protein-protein interactions and post-translational modifications [7]. However, their small size (generally 2–15 amino acid residues long), sequence degeneracy (typically 2–5 specific residues), and often low binding affinity make SLiM identification difficult, both experimentally [8] and computationally [9, 10]. Here, we present detailed instructions for predicting intrinsically disordered protein motifs using the SLiMSuite package for SLiM discovery and analysis [10, 11]. Protocols cover the two main tasks in SLiM prediction: (1) identifying occurrences of known SLiMs and (2) de novo prediction of new SLiM patterns. For more detail on these tasks and an overview of other tools available, see [10, 12]. Identification of IDRs is clearly important for intrinsically disordered motif prediction, and we also present some methods for incorporating IDR predictions into SLiM discovery. For details of the disorder prediction tools used and more comprehensive coverage of other available methods, see [13].

2

Materials This protocol is entirely computational. In this section, we outline the main software tools and data formats used in the protocol. Commands for locally running programs are written for a UNIX or Mac terminal. They have not been tested on Windows machines. Precise details of the implementation may depend on the system you are using.

2.1 SLiMSuite Installation

SLiMSuite is available for download as a git repository (see Note 1) or stand-alone tarball from https://github.com/slimsuite/ SLiMSuite. For basic functionality, no other setup should be necessary beyond downloading and unzipping the package in the desired directory on your local computer. Visit the Releases page of the GitHub repository and click to download the relevant slimsuite.YYYY-MM-DD.tgz “Asset.” The YYYY-MM-DD part of the filename indicates the release date. It is recommended to use the latest release. This file can be unpacked using standard archiving tools. On a UNIX system: tar -xzf slimsuite.YYYY-MM-DD.tgz

SLiMSuite Motif Prediction

39

For the rest of this protocol, the main installation path will be replaced with $SLIMSUITE (see Note 2). 2.2 SLiMSuite Directory Structure

Once unzipped, the download will unpack a top level slimsuite/ directory with the following subdirectories: 1.

data/ contains example data for testing programs (see Subheading 2.7.5).

2.

docs/

3.

extras/ contains accessory programs that are not part of the main program suite.

4.

legacy/ contains superseded programs that are no longer supported.

5.

libraries/ contains all the python libraries used by the main tools (and extras), some of which have stand-alone functionality.

6.

contains documentation, including manuals.

settings/

contains INI files to set default options (see

Note 3). 7.

tools/contains the main program suite

(see Subheading 2.5).

It is recommended that analyses are performed outside these directories for ease of reinstallation (see Notes 4–6). Python 2.x

For basic functionality, SLiMSuite only requires Python 2.x installed on your system. If you do not have Python, you can download it for free from www.python.org at http://www. python.org/download/. The modules are written in Python 2 and most have been tested with version 2.7.x. Do not try running SLiMSuite with Python 3.

2.4 SLiMSuite Servers

SLiMSuite tools for disordered SLiM prediction are also available as online servers at http://www.slimsuite.unsw.edu.au/servers.php. To run SLiMSuite tools online, no installation is necessary. SLiMSuite servers have been tested in Chrome and Safari but should work in any browser. Example REST server URLs are given in the Methods (Subheading 3) (see Notes 7–14 for additional details using the servers).

2.5

The SLiMSuite tools used in this chapter are summarized in Table 1. A brief overview of each is given below. Additional documentation is available from the main server page (see Subheading 2.4 and Note 8). General reviews of computational SLiM discovery can be found in [10, 12].

2.3

SLiMSuite Tools

40

Richard J. Edwards et al.

Table 1 The main tools used in this chapter Tool

Subheadings Description

RESTa

CompariMotif 2.5.4

A unique motif-motif comparison tool for identifying similar SLiMs. Used for clustering results of predictions and identifying known motifs.

Y

GABLAM

2.5.5

BLAST-based protein similarity scoring and clustering. Used for N (Q)SLiMFinder and SLiMProb adjustments for evolutionary relationships.

GOPHER

2.5.6

Automated orthologue prediction and alignment algorithm. Used for (Q)SLiMFinder/SLiMProb conservation-based masking and SLiMPrints prediction.

Y

QSLiMFinder 2.5.3

Query-based variant of SLiMFinder with increased sensitivity and specificity, ideal for SLiM discovery from host-pathogen interactions or where at least one interaction is established experimentally.

Y

SLiMFarmer

2.5.7

SLiMSuite wrapper for multi-threaded batch processing.

N

SLiMFinder

2.5.2

The first de novo SLiM prediction based on a statistical model of Y overrepresented motifs in unrelated proteins. Repeatedly achieves the greatest specificity in benchmarking.

SLiMMaker

See Note 22

A simple tool for converting aligned peptides or SLiM occurrences into a regular expression motif.

Y

SLiMParser

2.5.8

SLiMSuite wrapper for generating and retrieving SLiMSuite REST server jobs.

Nb

SLiMProb

2.5.1

Unique tool providing biological context (disorder and conservation) for searches of predefined SLiMs along with under- and overrepresentation statistics, correcting for evolutionary relationships. Formerly called SLiMSearch 1.x but renamed to avoid confusion with SLiMSearch2.

Y

a

Tool can be run online at http://rest.slimsuite.unsw.edu.au Can be used to run tools online at http://rest.slimsuite.unsw.edu.au

b

2.5.1 SLiMProb: Prediction of Predefined SLiMs

SLiMProb (formerly known as SLiMSearch 1.0) [14] searches a set of protein sequences for occurrences of predefined SLiM sequence patterns. SLiM patterns may come from the literature, user definitions, or a SLiM database like ELM (see Subheading 2.7.4 and Note 14). SLiMProb calculates their probability of occurrence given the amino acid composition of the sequences. When the datasets are small enough, information about the statistical over- or underrepresentation can be returned, using a modification of the SLiMChance algorithm of SLiMFinder that corrects for the evolutionary relationships (see Note 20). Masking options also permit the incorporation of biological context, including protein disorder, into SLiM prediction.

SLiMSuite Motif Prediction

41

2.5.2 SLiMFinder: De Novo SLiM Prediction

SLiMFinder [15] is a de novo SLiM prediction program that searches for new SLiMs in a set of proteins. “Ambiguous” positions (where 2+ different amino acids are permitted) and variable length wildcards (any amino acid) can be incorporated into the search. SLiMFinder uses two algorithms. SLiMBuild identifies possible motif patterns in the input data, based on sequence masking (see Note 23) and evolutionary relationships between proteins (see Note 20). Once convergently evolved motifs with fixed positions are identified, these are combined to create motifs incorporating ambiguity (see Note 21). SLiMChance calculates the probability of motifs being enriched over expectation by chance. It corrects for the size, composition, and evolutionary relationships of the dataset (see Note 20). The significance value (Sig) estimates the likelihood of any motif being as overrepresented or more, based on enrichment probability. Sig should be interpreted as an empirical P value and acceptable values set according to the user’s desired balance between statistical power (low false-negative rate) and stringency (low false-positive rate).

2.5.3 QSLiMFinder: Query-Focused De Novo SLiM Prediction

QSLiMFinder [16] is a modification of the basic SLiMFinder tool to specifically look for SLiMs shared by a query sequence and one or more additional sequences. To do this, SLiMBuild first identifies all motifs that are present in the query sequence before removing it (and any related proteins) from the dataset. The rest of the search takes place using the remainder of the dataset but only using motifs found in the query. The final correction for multiple testing is made using a motif space defined by the original query sequence, rather than the full potential motif space used by the original SLiMFinder. This is offset against the reduction of support that results from removing the query sequence. Due to similarities in input and output between the two programs, SLiMFinder and QSLiMFinder will be collectively referred to as “(Q)SLiMFinder” in this chapter.

2.5.4 CompariMotif: Motif-Motif Comparisons

CompariMotif [17] takes two lists of SLiMs as regular expressions (Table 2) and compares them to each other, identifying which motifs have some degree of overlap and identifying the relationships between those motifs. Its primary use in de novo SLiM discovery is to compare predicted motifs with a list of previously published motifs (e.g., the eukaryotic linear motif (ELM) resource [18]) in order to identify candidates for known motifs (see Subheading 3.2.1). It can also be used to identify common motifs between different SLiM discovery runs.

42

Richard J. Edwards et al.

Table 2 SLiM regex elements Regex

SLiMSuite

Description

A

A

A single fixed amino acid, A using standard IUPAC letters.

[ILV]

[ILV]

Either I, L, or V. Can have any number of possible amino acids.

[^P] or [^DE]

[^P] or [^DE]

Exclude one or more amino acids.

.

X or .

Wildcard. Any amino acid.

.{n}

.{n} or X {n}

A repeat of n wildcard positions.

.{m,n}

.{m,n} or X {m,n}

A repeat of at least m and at most n wildcard positions (m can be zero).

^

^

N-terminus of protein.

$

$

C-terminus of protein.

(p1|p2)

(p1|p2)

Either regex pattern p1 or p2.

r{n}

r{n}

n repetitions of r, where r is one of the above regex elements.

r{m,n}

r{m,n}

At least m and up to n repetitions of r, where r is one of the above regex elements.

At least m of a stretch of n residues must match r, where r is one of the above regex elements (single amino acid, ambiguity or exclusion list).

Exactly m of a stretch of n residues must match r, and the rest must match b, where r and b are each one of the above regex elements.

(ABC)

A, B, and C in any order.

2.5.5 GABLAM: Global Alignment from BLAST Local Alignment Matrix

GABLAM [19] is used for establishing evolutionary relationships between the input proteins (see Note 20). GABLAM performs an all-by-all BLAST+ [20] blastp protein search and converts the local alignments into series of global (i.e., full-length) pairwise comparisons.

2.5.6 GOPHER: Generation of Orthologous Proteins from Homology-Based Estimation of Relationships

GOPHER [21] is used to generate alignments of predicted orthologues for conservation masking (see Note 24). Unlike most orthologue prediction tools, GOPHER is explicitly query-focused and does not enforce orthology relationships between the predicted orthologues themselves. Furthermore, GOPHER will only keep the most closely related predicted orthologue from each species, to avoid species with gene expansions from dominating the dataset. GOPHER uses estimated phylogenetic relationships based on GABLAM global similarities to differentiate paralogues and orthologues.

SLiMSuite Motif Prediction

43

2.5.7 SLiMFarmer: SLiMSuite Job Farming Wrapper

SLiMFarmer can be used to “farm” out multiple SLiMProb or (Q) SLiMFinder runs across multiple threads (see Note 17).

2.5.8 SLiMParser: SLiMSuite REST Job Generation and Parsing Tool

SLiMParser is provided in the SLiMSuite download to run and/or parse results from SLiMSuite REST calls (see Note 11).

2.6 SLiMSuite Dependencies

Very simple motif prediction does not require the use of any additional programs. However, the recommended protocols for disordered motif prediction will require the use of one or more thirdparty software tools. These will need to be installed on the system and SLiMSuite given the path to the file or bin directory (see below) needed to access the relevant programs via a simple command-line call (see Note 4).

2.6.1 BLAST+ Homology Search

BLAST+ [20] is used by SLiMSuite for multiple tasks, including establishing and correcting for evolutionary relationships (see Note 20) and generating sets of predicted orthologues (see Note 24). It is strongly recommended that BLAST+ is installed on your system if running SLiMSuite locally. BLAST+ is in fact a suite of programs, and the path containing these executables should be provided using blast+path¼PATH/, if the BLAST+ bin has not been added to the environment path (see Note 4), e.g.: blast+path=/usr/local/ncbi/blast/bin/

2.6.2 IUPred Disorder Predictor

IUPred2 [22] is the recommended disorder predictor to be used with SLiMSuite (see Subheading 3.1.1). Alternative methods for identifying disordered regions are also available (see Subheading 3.1 and Notes 25 and 26).

2.6.3 ClustalW2 Multiple Sequence Alignment

ClustalW2 [23] is used as the default phylogenetic inference and backup multiple sequence alignment tool. ClustalW2 is not generally used for motif discovery workflows.

2.6.4 ClustalOmega Multiple Sequence Alignment

ClustalOmega [24] is used as the default multiple sequence alignment tool. In motif discovery, it is primarily used in the context of making predicted orthologue alignments with GOPHER (see Note 24).

2.6.5 R

R [25] is used by some SLiMSuite tools for generating visualizations and/or markdown/HTML outputs. R is not currently required by any of the tools used in this protocol, but this requirement might change with future updates.

44

Richard J. Edwards et al.

2.7 Input Data for Motif Prediction

This section covers the core input data types (see Subheading 2.7.1) and formats (see Subheading 2.7.2) required for motif analysis with SLiMSuite.

2.7.1 Data Types

Standard input for SLiMSuite is a set of multiple protein sequences (see Subheading 2.7.2 for accepted formats) that share a common feature, generally a common interaction partner. (Q)SLiMFinder de novo SLiM discovery (see Subheading 3.2.2) requires multiple unrelated proteins for input as it explicitly invokes a model of independent (convergent) evolution to determine overrepresentation (see Note 20). The major benefit of SLiMSuite over other methods is its ability to correct for evolutionary relationships in the input data for these calculations [10, 15, 19, 26]. If too few unrelated proteins are found, (Q)SLiMFinder will exit. SLiMProb identification of known motifs (see Subheading 3.2.1) has no such requirement.

Multi-protein Datasets

Single Proteins

SLiMSuite is optimized for predicting SLiMs in multi-protein datasets. Nevertheless, SLiMProb motif predictions can be run on single proteins. (Q)SLiMFinder de novo SLiM discovery needs multiple proteins for input. For de novo disorder SLiM discovery in single proteins, please see SLiMPrints [27].

Multiple Multi-protein Datasets

SLiMSuite tools can be run in a batch mode where multiple datasets each consisting of multiple sequences are run one after the other (see Note 15 for details).

Proteomes

SLiMSuite corrections for evolutionary relationships tend to break down with very large datasets (e.g., several hundred proteins), especially with numerous multi-domain proteins. It is therefore recommended that very large datasets are run with efilter¼F (see Note 20). For whole proteome searches of known SLiMs, please see SLiMSearch [28].

Motif Data

SLiMProb prediction of known motifs requires a set of protein motifs defined as SLiMSuite regular expressions (see Subheading 2.7.4). These can be user-defined motifs, published SLiMs downloaded from the ELM database [18], or the output of de novo SLiM prediction from (Q)SLiMFinder.

2.7.2 SLiMSuite FASTA Format

One of the most common input and output formats for SLiMSuite is FASTA format, which is a very simple, human-readable sequence format. Despite the simplicity of FASTA, there are many subformat variants in which the sequence name is formatted with specific information. Many of these will work and be recognized by SLiMSuite programs, but it also has its own favored subformat, which is preferentially used for input/output.

SLiMSuite Motif Prediction

45

SLiMSuite FASTA format is: >Gene_SPCODE__AccNum [Description] SEQUENCE

where: l

Gene is not used for anything and is purely for easy visual identification.

l

SPCODE is the species code. Where possible, UniProt species mnemonics should be used, but any short code can be used as long as (a) it contains uppercase letter and numbers only (no symbols) and (b) it is consistently used within a species/ database (i.e., you can make it up as long as all sequences from the same species use the same code).

l

AccNum is the accession number, which is used as the unique sequence identifier.

l

Description is optional and can contain any other text, separated from the sequence name by whitespace. (NB. The square brackets above are not required; the notation is purely to indicate that this is an optional element.)

SEQUENCE can be on one or more lines and contain spaces. However, it is best to have a single SEQUENCE line with no whitespace (some programs may enforce this). UniProt FASTA downloads should be automatically recognized and converted where needed. SLiMSuite contains tools for reformatting a variety of input formats into SLiMSuite FASTA format, including UniProt, GenBank, and a variety of FASTA inputs.

2.7.3 UniProt Text Files

SLiMSuite is also able to recognize and parse UniProt text files (sometimes referred to as “DAT files”). Details of the UniProt plain text format can be found in the UniProt Knowledgebase user manual: https://web.expasy.org/docs/userman.html#entrystruc. If running on a computer with an Internet connection, SLiMSuite can also download UniProt entries directly from a list of UniProt accession numbers (see the UniProt help pages for more details: https://www.uniprot.org/help/accession_numbers). For most applications, SLiMSuite programs will extract only the species and sequence from UniProt entries. UniProt annotation can also be used for feature-based disorder masking (see Subheading 3.1.2).

2.7.4 Motif Formats

The term “motif” has a variety of meanings in different fields. Here, we will be using it as a synonym for “short linear motif” (SLiM). SLiMProb searches of known motifs (see Subheading 3.2.1) and

46

Richard J. Edwards et al.

CompariMotif exploration of predicted motifs (see Subheading 3.2.5) require the input of a set of SLiM motif patterns. In SLiMSuite, a motif consists of three basic elements: 1. Name. The motif “Name” is a single word (i.e., no whitespace) that provides a unique human-readable identifier for the motif. If not provided, the motif pattern will also be used for the name. 2. Pattern. SLiMSuite tools are based around regular expression pattern descriptors (see Table 2 and [10] for more details). 3. Description. The description is free text providing additional details. If not provided, the motif name will also be used for its description. For maximal information, SLiMSuite has its own motif format, known as “SLiMSuite” or “PRESTO” format, which has one line per motif in the form: Name Pattern # Description

Names and patterns cannot have whitespace. For example: CLV_C14_Caspase3-7 [DSTE][^P][^DEWHFYC]D[GSAN] # Caspase-3 and Caspase-7 cleavage site. [39 ELM instances]

Lines beginning with # are ignored. SLiMSuite will also recognize a plain list of regular expression patterns (Table 2) and ELM class definitions downloaded directly from ELM (see Note 14). Summary results tables from (Q)SLiMFinder can also be used as motif input (see Subheading 3.2.5). 2.7.5 Example Data Table

3

Example data for running SLiMSuite tools is provided in the SLiMSuite data/ directory and freely available to download from Open Science Foundation [29] (Table 3).

Methods Most of the protocols in this section are provided in two forms: (1) as a command-line program call and (2) as a call to the SLiMSuite REST server. The REST servers provide an easy entry point for the tools and will be suitable for general small-scale analyses. Some of these have simple web forms available for running the tool (see Subheading 2.4 and Note 10). Users interested in large-scale or bespoke analyses are encouraged to download and run SLiMSuite locally.

SLiMSuite Motif Prediction

47

Table 3 Example data used in SLiMSuite protocols SLiMSuitea

File

Type

Description

POL30.Scer.apid-hq. acc

Text

UniProt accession numbers for “high-quality” Y interactors with Saccharomyces cerevisiae protein POL30 (human PCNA orthologue) from the APID [34] interaction database.

POL30.Scer.apid-hq. dat

UniProt UniProt download of “high-quality” interactors with Y Saccharomyces cerevisiae protein POL30 (human PCNA orthologue) from the APID [34] interaction database.

POL30.Scer.apid-hq.fas FASTA

Protein sequence FASTA file for “high-quality” Y interactors with Saccharomyces cerevisiae protein POL30 (human PCNA orthologue) from the APID [34] interaction database.

uniprot. yeast.147537.201905-14.fas

FASTA

Yeast proteomes for GOPHER orthologue prediction. N A gzipped version of this file is available at OSF [29] along with details of its construction.

elm2019.motifs

Motif

ELM [18] motif classes (downloaded 2019-05-02) in Y SLiMSuite motif format.

LIG_PCNA_PIPBox_1. Motif motif

A single ELM motif for simplified analysis.

Y

a

Whether provided in the SLiMSuite data/directory. All data is available at OSF [29]

3.1 Disorder Masking

SLiMSuite is not exclusively designed for the prediction of motifs in intrinsically disordered regions; prediction tools can be run without any disorder masking. However, SLiMs are generally found in disordered regions [6], and incorporating a disorder-masking strategy has been shown to improve results [14–16, 30]. (Q)SLiMFinder and SLiMProb therefore share a number of core disordermasking strategies. The main purpose of disorder masking is to exclude regions of the input proteins that are not likely to contain SLiMs. This reduces the search space for SLiM discovery. This in turn has two benefits: (1) programs will run faster; (2) statistical power to detect overrepresentation will be increased. However, it should be noted that de novo SLiM prediction appears to be more sensitive to the loss of signal (i.e., accidentally masking out true SLiM occurrences) than the addition of noise (i.e., failing to mask out regions with SLiMs). It is therefore recommended that disorder masking is used conservatively. This section gives an overview of the main disorder-masking methods implemented for SLiMSuite. These fall broadly into masking from disorder predictions (see Subheading 3.1.1), masking UniProt features (see Subheading 3.1.2), and custom user masking of input FASTA files (Subheading 3.1.3). For additional masking options, please see Note 23.

48

Richard J. Edwards et al.

3.1.1 Masking from Disorder Prediction Scores

Disorder masking is activated using the dismask¼T command, with disorder¼X setting the prediction strategy used (see Note 3 for more information about setting command-line options). Please note that to perform disorder masking on the command line, you will need to have the appropriate software tools installed and/or disorder scores pre-computed (see Note 25). For details of disorder prediction tools, please refer to the appropriate literature (e.g., [13]). SLiMSuite currently has three disorder prediction methods implemented: 1.

disorder¼iupred2.

This will run the IUPred2A API [22] in either short (iumethod¼short or disorder¼iushort2, default), long (iumethod¼long or disorder¼iulong2), or redox (iumethod¼redox or disorder¼iuredox2) mode. Alternatively, iumethod¼anchor (or disorder¼anchor2) will run ANCHOR2 instead. IUPred2/ANCHOR2 returns a value for each residue, where a residue is determined to be disordered if >0.5. For SLiM discovery, a more relaxed threshold of 0.2 or 0.3 is normally used to limit false negatives. This is probably particularly important for de novo SLiM prediction, where loss of signal is problematic. This threshold can be modified using iucut¼X and is set to iucut¼0.2 by default. Any use of IUPred2 or ANCHOR2 for disorder masking should cite [22]. You must have a live web connection to use this method. IUPred2 retrieves results based on accession numbers rather than sequences and will therefore only work for UniProt proteins.

2.

disorder¼iupred. This will run a local installation IUPred [31] (see Subheading 2.6.2) in either long (iumethod¼long) or short (iumethod¼short, default) mode. This has to be installed locally (iupath¼PATH to set executable command) and is available on request from the IUPred website (see https://iupred2a.elte.hu/ for more details). Results are processed as for IUPred2.

3.

disorder¼anchor. ANCHOR [32] scores can be used in place of IUPred. This also has to be installed locally (anchor¼PATH to set executable command). It is available on request from the IUPred website (see above) and any use of results should cite the method. ANCHOR masking will also use the iucut¼X setting. disorder¼foldindex. This will run FoldIndex [33] directly from the website (https://fold.weizmann.ac.il/) and more simply returns a list of disordered regions without any stringency settings. You must have a live web connection to use this method. Alternatively, pre-computed disorder prediction scores from other methods can be provided (see Note 25).

SLiMSuite Motif Prediction

49

3.1.2 Masking UniProt Features

As an alternative (or in addition) to masking predicted disorder, SLiMSuite can mask input based on UniProt features. All features of a given type will be masked, so some editing of UniProt input files (see Subheading 2.7.3) might be required if only a subset are to be masked. UniProt feature masking is activated with the ftmask¼LIST command, where LIST specifies the feature types to mask. Setting ftmask¼T is equivalent to setting ftmask¼EM, DOMAIN,TRANSMEM (see the SLiMFinder manual for more details).

3.1.3 Custom Masking of FASTA Files

There are two main strategies for custom disorder masking of input by embedding masking information directly in the FASTA file without disrupting the evolutionary relationships identified. Regions to be masked can be added to the protein description in the FASTA file in the form #X-Y ¼ disordered region; &X-Y ¼ ordered region. This will be recognised by dismask¼T disorder¼parse. Mask out either lowercase (casemask¼Lower) or upper case (casemask¼Upper) sequences.

3.2 Disordered Motif Prediction

This section covers the basic tasks in disordered motif prediction, with associated Notes describing some more advanced variants. Additional options are covered in the manuals for the relevant tools, available through the SLiMSuite GitHub site. Users are also encouraged to contact the author if they have specific SLiM prediction tasks that are not covered by these protocols. Each tool generates a large number of output files. The sections below will focus on where to find the main results of interest. Please refer to the manuals for more details of available outputs (see Note 5 for renaming and controlling the output files). For simplicity, the protocols in this section will assume default disorder masking (disorder¼iushort2 iucut¼0.2) unless otherwise indicated (see Subheading 3.1).

3.2.1 Prediction of New Disordered Instances of Known SLiMs (SLiMProb)

SLiMProb (see Subheading 2.5.1) takes a set of protein motifs (see Subheading 2.7.4) and will search for occurrences of them in a set of proteins. SLiMProb calls take the general form: python

$SLIMSUITE/tools/slimprob.py

seqin=

motif= dismask=T

where is FASTA (see Subheading 2.7.2) or UniProt (see Subheading 2.7.3) format and is a motif file (see Subheading 2.7.4) or regular expression pattern. For example, to search a set of proteins with a single motif:

python $SLIMSUITE/tools/slimprob.py seqin=POL30.Scer. apid-hq.fas motif=LIG_PCNA_PIPBox_1.motif dismask=T

50

Richard J. Edwards et al.

or regular expression:

python $SLIMSUITE/tools/slimprob.py seqin=POL30.Scer. apid-hq.fas motif=Q..[IL]..FF dismask=T

or a set of motifs (see manual for input filtering): python $SLIMSUITE/tools/slimprob.py seqin=POL30.Scer. apid-hq.fas motif=elm2019.motifs dismask=T

The two main result files of interest produced are the summary results table (slimprob.csv) and the SLiM occurrence table (slimprob.occ.csv) (see Note 5). These are comma-separated files, which can be opened in other programs (e.g., MS Excel or R) for further analysis. The summary table contains one line per motif for each dataset (see Note 15) (Tables 4 and 5). This provides the total number of occurrences in the dataset, along with over- and underrepresentation statistics (see Note 20). The occurrence table contains one line per motif occurrence and includes the information on which proteins have which motif and where (Table 6). See Note 27 for additional occurrence statistics that can be calculated and see Note 28 for occurrence visualization. If you have an active Internet connection, SLiMProb can also be run on a set of UniProt accession numbers, either given as a comma-separated list:

Table 4 Common summary results fields for SLiMProb and (Q)SLiMFinder Field

Description

Dataset

Dataset name. Generally the input filename without its file extension. This will be the first part of any dataset-specific filenames in the output directory.

RunID

Run identifier set by runid¼X. The date and time is used if no ID is given. This allows the results of several runs to be compiled in a single results file and easily distinguished.

Masking

Summary of masking options for the run. See manual for details.

Build

SLiMBuild settings. See manual for details.

RunTime The time taken for the dataset to run (HH:MM:SS). SeqNum Number of sequences in dataset. UPNum

Number of unrelated protein clusters (UPC) in dataset.

AANum

Total number of unmasked AA in dataset.

SLiMSuite Motif Prediction

51

Table 5 Main SLiMProb motif summary results fields Field

Description

Motif

Name of the SLiM being searched.

Pattern

Pattern of SLiM being searched.

IC

Normalized information content (see manual).

N_Occ

Total number of occurrences across all sequences.

E_Occ

Expected number of total occurrences.

p_Occ

The probability of N_Occ+ observations given E_Occ.

pUnd_Occ

The probability of N_Occ or fewer observations given E_Occ.

N_Seq

Total number of sequences with 1+ occurrences.

E_Seq

Expected number of sequences with 1+ occurrences.

p_Seq

The probability of N_Seq+ observations given E_Seq.

pUnd_Seq

The probability of N_Seq or fewer observations given E_Seq.

N_UPC

Total number of unrelated sequences with 1+ occurrences.

E_UPC

Expected number of unrelated sequences with 1+ occurrences.

p_UPC

The probability of N_UPC+ observations given E_UPC.

pUnd_UPC

The probability of N_UPC or fewer observations given E_UPC.

python $SLIMSUITE/tools/slimprob.py uniprotid=P04051, P04819,P25336,P25847,P26793

motif=LIG_PCNA_PIPBox_1.motif

dismask=T

or in a file:

python $SLIMSUITE/tools/slimprob.py uniprotid=POL30.Scer. apid-hq.acc motif=LIG_PCNA_PIPBox_1.motif dismask=T

For large input datasets and where enrichment/depletion statistics are otherwise not required, users might want to switch off the evolutionary filter using efilter¼F (see Note 20). SLiMProb can also be run on a set of UniProt accession numbers through the online REST server. The REST server takes the same commands as the local Python code, with whitespace removed and each command prefixed with &:

52

Richard J. Edwards et al.

Table 6 Main occurrence table results fields for SLiMProb and (Q)SLiMFinder Field

Description

Dataset

See Table 4.

RunID

See Table 4.

Masking

See Table 4.

Motif

Motif name (SLiMProb only).

Rank

Motif rank ((Q)SLiMFinder only).

Sig

The corrected p-value of the motif ((Q)SLiMFinder only).

Seq

Sequence name.

AccNum

Sequence accession number (SLiMProb only).

Start_Pos

Start position of motif occurrence.

End_Pos

End position of motif occurrence.

Prot_Len

Total length of protein sequence.

Pattern

Pattern of returned/searched SLiM.

Match

Protein sequence region matching pattern.

Variant

Motif variant matching sequence.

Support

Total number of sequences containing motif.

MisMatch

Number of mismatches (SLiMProb only).

Desc

Protein description.

UPC

The UPC identifier number (SLiMProb only). a

Peptide sequence including flanking residues (see manual).

PepSeq

PepDesign

a

Possible synthesis issues with peptide sequence (see manual).

a

May not be output, depending on settings http://rest.slimsuite.unsw.edu.au/ slimprob&uniprotid¼P04051,P04819,P25336,P25847, P26793&motif¼LIG_PCNA_PIPBox_1&dismask¼T

When a job is first launched on the server, it will be allocated an 11-digit job identifier (“JobID”) in form: 19052800004. If the server is busy, this job will be placed in the queue. Otherwise, it will run straight away. Either way, it will launch a URL to retrieve the JobID: http://rest.slimsuite.unsw.edu.au/ retrieve&jobid¼19052800004&rest¼format

This URL can be saved and reloaded any time to check progress of the job. When the job is complete, a results page will load with each output in a separate tab. Outputs can be individually retrieved by changing the &rest¼X setting to match the title of the relevant tab. For example, to retrieve the SLiMProb occurrence table:

SLiMSuite Motif Prediction

53

http://rest.slimsuite.unsw.edu.au/ retrieve&jobid¼19052800004&rest¼occ

The outfmt tab will also list the available outputs. For additional details of using the REST servers and the data available, see Notes 7–14. 3.2.2 Predicting Novel SLiMs De Novo in a Set of Proteins (SLiMFinder)

SLiMFinder is designed to look for convergently evolved motifs that are shared between unrelated proteins. The run command is similar to SLiMProb, but no motif file is given:

python $SLIMSUITE/tools/slimfinder.py seqin= dismask=T

For example, to search for motifs in the example dataset of yeast proteins:

python $SLIMSUITE/tools/slimfinder.py seqin=POL30.Scer. apid-hq.fas dismask=T

or pulling down UniProt entries:

python $SLIMSUITE/tools/slimfinder.py uniprotid=POL30. Scer.apid-hq.acc dismask=T

Via the REST server (with a subset of the POL30.Scer. apid-hq.acc accession numbers): http://rest.slimsuite.unsw.edu.au/ slimfinder&uniprotid¼P25336,P12887,P38766,P04819, P47110,P26793&dismask¼T

As with SLiMProb, the two main result files of interest produced are the summary results table (slimfinder.csv) (Tables 4 and 7; see Note 15) and the SLiM occurrence table (slimfinder.occ.csv) (Table 6; see Note 27). The summary table is of most interest, as it contains the predicted motif patterns (if any) and their estimated significance (Table 7). One of the things that sets SLiMFinder apart from most other de novo protein motif discovery tools is the estimation of the probability of observing any motifs enriched in the data [15, 26]. SLiMChance has been shown to be slightly conservative,

54

Richard J. Edwards et al.

Table 7 (Q)SLiMFinder motif summary results fields Field

Description

MotNum Number of motifs meeting the output requirements. Rank

Rank of returned SLiM. If no motifs of any kind are returned, the dataset will have a one results line with Rank 0.

Pattern

Pattern of returned SLiM or “-,” no motifs returned; “,” dataset too big; “!,” dataset run failed.

IC

Normalized information content of motif (see manual).

Occ

Total number of occurrences across all sequences.

Support

Total number of sequences containing motif.

UP

Total number of unrelated proteins containing motif.

ExpUP

Expected number of unrelated proteins containing motif.

Prob

The uncorrected probability of the motif (the probability of k+ observations of a predefined motif).

Sig

The corrected p-value of the motif.

Cloud

Identifier of motif cloud to which the SLiM belongs (numbered starting at 1 for the most significant motif) (see Note 22).

CloudSeq Number of sequences covered by that motif cloud. CloudUP Number of unrelated protein clusters covered by motif cloud.

and so the default significance threshold is set to 0.1. Even so, the stringency of SLiMChance means that most searches will not return motifs unless there is a strong signal. The REST server is also set to return the top 100 patterns only. (It is rare for a dataset to return so many unless evolutionary filtering is switched off and/or the significance threshold relaxed.) These settings can be altered with probcut¼X and topranks¼X, respectively. It should be noted that the motif patterns returned are unlikely to precisely match the true SLiM. Significant motifs will be grouped into motif “clouds” of patterns that are physically overlapping on two or more non-wildcard positions in 2+ sequences, which indicates probable variants of a more complex SLiM. These are summarized in the ∗.cloud.txt output file (see Note 22). 3.2.3 Query-Focused De Novo SLiM Prediction (QSLiMFinder)

SLiMFinder treats all proteins equally and looks for all possible motif patterns, as constrained by the SLiMBuild settings (see Note 19). Sometimes, a specific protein is of particular interest. This may be because it has higher-quality data supporting a SLiMmediated interaction (such as a solved structure of a protein-protein interaction) or because SLiMs in that protein are of particular

SLiMSuite Motif Prediction

55

interest (such as a molecular mimicry in a viral protein). QSLiMFinder [16] is a modified version of SLiMFinder that builds the motif space on a specific query and then searches for (convergent) enrichment in the rest of the dataset. QSLiMFinder is run with the same commands as SLiMFinder with the addition of a query¼X command to identify the query. This can either be a sequence identifier (name or accession number; see Subheading 2.7.2) or a number specifying the position in the input file. The default setting is query¼1, which means that the first protein in the dataset will be used as the query in the absence of a specified query:

python $SLIMSUITE/tools/qslimfinder.py seqin= dismask=T

This is the same as:

python $SLIMSUITE/tools/qslimfinder.py seqin= dismask=T query=1

Generally, it is safer to explicitly specify the query by name, so that an error will be generated if the query is not found. For example, the POL30.Scer.apid-hq.fas file contains highquality interactors of yeast POL30 from the APID database [34]. The CDC9-POL30 interaction can be considered as a particularly high-quality binary interaction, as it is supported by PDB structure 2OD8 [35]. This information can be used to restrict the motif search space to be in CDC9 (P04819) using QSLiMFinder:

python $SLIMSUITE/tools/qslimfinder.py seqin=POL30.Scer. apid-hq.fas dismask=T query=P04819

Or via the REST server (with a subset of the proteins for clarity): http://rest.slimsuite.unsw.edu.au/ qslimfinder&uniprotid¼P25336,P12887,P38766, P04819,P47110,P26793&dismask¼T&query¼P04819

56

Richard J. Edwards et al.

3.2.4 Restricting De Novo SLiM Prediction to a Specific Protein Region (QSLiMFinder)

Chain B of the PDB structure 2OD8 in the example above is a 22 aa segment of P04819 corresponding to residues 32–53. This can be given to QSLiMFinder to further increase the sensitivity using the qregion¼X,Y parameter:

python $SLIMSUITE/tools/qslimfinder.py seqin=POL30.Scer. apid-hq.fas dismask=T query=P04819 qregion=32,53

Or via the REST server (with a subset of the proteins): http://rest.slimsuite.unsw.edu.au/ qslimfinder&uniprotid¼P25336,P12887,P38766,P04819, P47110,P26793&dismask¼T&query¼P04819&qregion¼32,53

From v2.3.0, QSLiMFinder uses position numbering starting from 1 (previous versions start at zero). It is possible to list several regions to include, separated with commas:

qregion=start1,end1,start2,end2,..,startN,endN.

For example, to keep the peptide region and the first ten residues of the protein: http://rest.slimsuite.unsw.edu.au/ qslimfinder&uniprotid¼P25336,P12887,P38766, P04819,P47110,P26793&dismask¼T&query¼ P04819&qregion¼1,10,32,53

To focus on the c-terminus, negative numbers can be given to count back from the end of the sequence. For example, to keep the last ten residues: http://rest.slimsuite.unsw.edu.au/ qslimfinder&uniprotid¼P25336,P12887,P38766, P04819,P47110,P26793&dismask¼T&query¼ P04819&qregion¼-10,-1

Note that all masking options are additive; any residues in a specified query region that are not predicted to be disordered will be masked out by dismask¼T. This can result in a situation where the entire query protein has been masked out and no motif prediction can take place (see Note 23).

SLiMSuite Motif Prediction 3.2.5 Identifying Known Motifs from De Novo Predictions (CompariMotif)

57

When you have a lot of motif predictions, it can be tiresome and error-prone to manually scan them for things that look familiar. CompariMotif [17] compares sets of motifs for similarity. One common task in SLiM prediction is to compare (Q)SLiMFinder results to known SLiMs, e.g., from the ELM database [18]. To compare SLiM prediction to known motifs: python $SLIMSUITE/tools/comparimotif_V3.py motifs=slimfinder.csv searchdb=elm2019.motifs

The main comparimotif.compare.tdt tab-delimited output will contain details of any motif matches. Predicted motifs will be named in the form Datset|RunID|Rank|Pattern. It is recommended to open this file in MS Excel and sort on Score to look at the highest scoring hits first. CompariMotif is designed to help users make informed manual decisions; there is currently no definitive set of CompariMotif rules to define a strong/weak match. Please see the CompariMotif paper [17] and manual for more details. SLiM predictions from the (Q)SLiMFinder servers can also be fed into the online CompariMotif server. This is achieved by giving the CompariMotif server the relevant REST output as input. The CompariMotif server is called with the same general command as the stand-alone program: http://rest.slimsuite.unsw.edu.au/ comparimotif&motifs¼X&searchdb¼X

Prediction results can be specified in the form jobid: For example, to compare SLiMFinder predictions from job 19052400010 to ELM:

NNNNNNNNNNN:main.

http://rest.slimsuite.unsw.edu.au/comparimotif&motifs¼jobid:19052400010:main&searchdb¼elm2019

3.2.6 Comparing Different De Novo Prediction Runs (CompariMotif)

CompariMotif can also be used to compare the results of different (Q)SLiMFinder analyses. This can be helpful for identifying shared motifs and common themes. To compare two (Q)SLiMFinder runs:

python $SLIMSUITE/tools/comparimotif_V3.py motifs=slimfinder.csv searchdb=qslimfinder.csv

For batch analysis of multiple datasets (see Note 15), it can be useful to compare the results file to itself. In this case, CompariMotif does not need a searchdb¼FILE setting:

58

Richard J. Edwards et al. python $SLIMSUITE/tools/comparimotif_V3.py motifs=slimfinder.csv

SLiM predictions from the (Q)SLiMFinder servers can also be compared with the online CompariMotif server as outlined above. For example, to compare SLiMFinder predictions from job 19052400010 to QSLiMFinder job 19052400025: http://rest.slimsuite.unsw.edu.au/comparimotif&motifs¼jobid:19052400010:main&searchdb¼jobid:19052400025:main

Or to compare them against themselves: http://rest.slimsuite.unsw.edu.au/comparimotif&motifs¼jobid:19052400010:main

3.2.7 Searching De Novo Predictions Against Another Set of Proteins (SLiMProb)

In the same way that (Q)SLiMFinder results can be used for CompariMotif searches, they can also be given to SLiMProb to search for the predicted SLiM patterns in a set of proteins:

python $SLIMSUITE/tools/slimprob.py Uniprotid=P52701, P13051,P18858,P38936,P26358 motifs=slimfinder.csv dismask=T

Or on the server: http://rest.slimsuite.unsw.edu.au/ slimprob&uniprotid¼P52701,P13051,P18858,P38936, P26358&motifs¼jobid:19052400010:main

4

Notes These notes have been organized into broad themes to help the user find the relevant advice. Running and Installing SLiMSuite 1. Downloading SLiMSuite as a git repository. SLiMSuite is available to download at https://github.com/slimsuite/ SLiMSuite. The most flexible way to download SLiMSuite is by cloning the git repository. This will enable the user to pull down updates without having to download and extract the entire repository. To clone SLiMSuite on the command line: git

clone

SLiMSuite.git

https://github.com/slimsuite/

SLiMSuite Motif Prediction

59

The repository can also be cloned using the GitHub Desktop app or another git GUI. Users unfamiliar with git and GitHub are encouraged to refer to the GitHub user guides (https://guides.github.com/). 2. Setting the SLiMSuite path. If you are running on a UNIX system (recommended) and want to use the example commands as given, you can create a $SLIMSUITE environment variable that points to your SLiMSuite download: SLIMSUITE=/share/apps/slimsuite

Alternatively, you might want to set up a symbolic link to the slimsuite directory in your run directory to simplify commands: ln -s $SLIMSUITE slimsuite python slimsuite/tools/slimsuite.py -help

If the second command works, you have set up your link OK. 3. SLiMSuite command-line parameters. Command-line options have two parts: the argument and the value. These can be fed to programs in one of two formats: argument=value -argument value

These two lines have equivalent functions. The two styles can be mixed within a program call, e.g.: python program.py arg1=val1 -arg2 val2

Options can also be supplied within a special ∗.ini text file. This is simply a text file containing a list of commands, which is parsed to the program using the ini¼FILE command: python program.py arg1=val1 ini=myini.ini -arg2 val2

Arguments are processed in the order they are received and supersede previous settings. In the above example, any arg1 setting myini.ini would supersede val1, whereas val2 would supersede any arg2 setting in myini.ini. More details can be found at the SLiMSuite blog (http://slimsuite. blogspot.com/).

60

Richard J. Edwards et al.

A list of command-line options can be accessed using the (or -h or -help or --help) command, e.g.:

help

python $SLIMSUITE/tools/slimfinder.py help

Or via the REST server without giving any option, e.g.: http://rest.slimsuite.unsw.edu.au/ slimfinder

4. Recommended procedures for running SLiMSuite programs. Running SLiMSuite can produce a lot of output files, especially if batch running on multiple input files. It is not uncommon to try running searches with different parameter settings or to make mistakes during early runs. It is therefore recommended to keep the SLiMSuite installation, input data, and run directories separate. In this way, runs can easily be archived or deleted without interfering with either the installation or the data, e.g.: python $SLIMSUITE/tools/slimfinder.py -seqin $SLIMSUITE/ data/POL30.Scer.apid-hq.fas -dismask T

Example commands elsewhere in this chapter have not followed this recommendation purely to keep the commands as short as possible. 5. Renaming and controlling output. By default, output will go into files named after the SLiMSuite tool. For example, SLiMProb will generate slimprob.log, slimprob.csv, and slimprob.occ.csv, with additional dataset files saved in SLiMProb/. These can be modified using log¼FILE, resfile¼FILE (setting the name of the main ∗.csv file, also sets root of all main outputs), and resdir¼PATH. The log and main outputs can be set together using basefile¼BASEFILE (see Note 16). By default, existing files will be appended rather than overwritten. This can be switched off with append¼F, in which case existing files will be backed up with a ∗.bak suffix, unless backups¼F. To overwrite the log file rather than appending to it, set newlog¼T (see Note 6). 6. Error handling, troubleshooting, and SLiMSuite log output. The ∗.log file records all the important information for a run, including the command-line parameters and progress during the run. It is therefore strongly recommended to keep log files with all run jobs and to set meaningful filenames with log ¼FILE or basefile¼FILE (see Note 5). Log files are tab-delimited files with three files:

SLiMSuite Motif Prediction

61

(a) Log code: #XXXXX. This enables certain types of log lines to be extracted easily. (b) Run time: HH:MM:SS. The time since the run started. (c) Information: ASCII text. The log information. If something goes wrong with a run, the first place to look is the log file. Errors themselves are recorded with the code #ERR. The source of the error may have occurred earlier in the run, so always check the log file for expected behavior. A run may also generate warnings (#WARN), which will highlight unexpected behavior or setting combinations that are not recommended. Using the SLiMSuite REST Server 7. Retrieving REST jobs from the server front page. All server jobs will be allocated an 11-digit “JobID” in form: 19052500004. Jobs can be retrieved at the main server page (http://www.slimsuite.unsw.edu.au/servers.php) in the “Retrieve REST Server job” by entering the allocated code in the “JobID:” box and clicking “Check/Retrieve.” This will launch the same URL as the job itself in a new browser tab. To simply check the status of a queued/running job, change the dropdown option from “Retrieve” to “Check.” This will then return the status as Queued, Running, or Finished. See Table 8 for JobIDs to retrieve example runs used in this chapter.

8. Running REST jobs from the server front page. At the bottom of the main server page (http://www.slimsuite.unsw. edu.au/servers.php) is a simple form to “Run REST Server Job” with three boxes: Prog, Options, and Password (optional) (see Note 9). To run jobs from this page, enter the relevant tool in the Prog box and the command-line options in the Options box, and then click “Submit to REST Server.” These can be given as either space-separated arg¼val commands or a string of &arg¼val commands matching the REST URL. For example, these will all generate the same run: (a)

uniprotid¼P04051,P04819,P25336,P25847, P26793 motif¼LIG_PCNA_PIPBox_1 dismask¼T

(b)

&uniprotid¼P04051,P04819,P25336,P25847, P26793&motif¼LIG_PCNA_PIPBox_1&dismask¼T

(c)

&uniprotid¼P04051,P04819,P25336,P25847, P26793 motif¼LIG_PCNA_PIPBox_1 &dismask¼T

If the Options box is left blank, clicking “Submit to REST Server” will open up the documentation for that tool. If the Prog box is also blank, the main REST server help will open.

slimprob&uniprotid¼P04051,P04819,P25336,P25847, P26793&motif¼LIG_PCNA_PIPBox_1&dismask¼T

slimfinder&uniprotid¼P25336,P12887,P38766,P04819,P47110, P26793&dismask¼T

qslimfinder&uniprotid¼P25336,P12887,P38766,P04819,P47110, P26793&dismask¼T&query¼P04819

qslimfinder&uniprotid¼P25336,P12887,P38766,P04819,P47110, P26793&dismask¼T&query¼P04819&qregion¼32,53

qslimfinder&uniprotid¼P25336,P12887,P38766,P04819,P47110, P26793&dismask¼T&query¼P04819&qregion¼1,10,32,53

qslimfinder&uniprotid¼P25336,P12887,P38766,P04819,P47110, P26793&dismask¼T&query¼P04819&qregion¼-10,-1

comparimotif&motifs¼jobid:19052400010:main&searchdb¼elm2019

comparimotif&motifs¼jobid:19052400010: main&searchdb¼jobid:19052400025:main

comparimotif&motifs¼jobid:19052400010:main

slimprob&uniprotid¼P52701,P13051,P18858,P38936, P26358&motifs¼jobid:19052400010:main

19052800004

19052800005

19052800006

19052800007

19052800008

19052800009

19052800010

19052800011

19052800012

19052800013

Retrieve using http://rest.slimsuite.unsw.edu.au/retrieve&jobid¼XXXXXXXXXXX (see Note 7)

a

Command

JobIDa

Table 8 Example server jobs

3.2.7

3.2.6

3.2.6

3.2.5

3.2.4

3.2.4

3.2.4

3.2.3

3.2.2

3.2.1

Subheadings

62 Richard J. Edwards et al.

SLiMSuite Motif Prediction

63

9. Data protection and passwords. Any data or jobs submitted to the server can be accessed by anyone with the JobID. It is possible to add limited protection by adding &password¼X to the REST options (or entering something in the Password box if running through the server front page (see Note 8)). This password will then need to be entered to retrieve any job information/results. Please note, however, that there is no data encryption and the server hosts will still be able to see all the details of the job run, including the password. It is strongly recommended that sensitive or protected data is run locally using the stand-alone version of SLiMSuite. 10. Running REST jobs using specific program servers. The original SLiMFinder [36] and SLiMSearch [14] servers have been retired and replacement servers are under development. Simple web forms for generating REST jobs are available at http://www.slimsuite.unsw.edu.au/servers.php. At time of publication, server forms were available for SLiMProb, SLiMFinder, and QSLiMFinder. These server pages provide the capability to run the REST servers on custom sequences in FASTA format (see Subheading 2.7.2) in addition to UniProt sequences. These can be pasted into a text box or uploaded as a file. Note that there is a size limit to input provided this way. Large jobs should be run locally. 11. Running REST server jobs with SLiMParser. The SLiMSuite download also contains a tool, SLiMParser, for running and/or parsing results from the REST server. The main SLiMParser input is given to the restin¼X command and can take one of three forms: (a) A full results file saved from the REST server for a job (&rest¼full). (b) An 11-digit JobID. (c) A full REST command URL. REST output for a job can then be downloaded by setting restout¼T. By default, output will be named after the JobID. Alternatively, restbase¼BASFILE can set the output file prefix. Optionally, restoutdir¼PATH will save files to a different location. For example: python

$SLIMSUITE/tools/slimparser.py

restin¼"http://rest.slimsuite.unsw.edu.au/ slimprob&uniprotid¼P04051,P04819,P25336,P2 5847,P26793&motif¼LIG_PCNA_PIPBox_

1&dis-

mask¼T" restout¼T restbase¼slimprob

12. Using SLiMParser for API REST calls. The SLiMSuite servers can be accessed programmatically. Please see the server API page (http://rest.slimsuite.unsw.edu.au/docs&

64

Richard J. Edwards et al.

page¼API) for details. SLiMParser can also be used to execute this function through command-line calls, using rest¼X and pureapi¼T to print the selected output to STDOUT in conjunction with silent¼T to avoid additional STDOUT output and log generation: (a) Submit the job and retrieve the JobID: python

$SLIMSUITE/tools/slimparser.py

restin¼"http://rest.slimsuite.unsw.edu.au/ slimprob&uniprotid¼P04051,P04819,P25336, P25847,P26793&motif¼LIG_PCNA_PIPBox_1&dis mask¼T" pureapi silent rest¼jobid

(b) To wait until the job is complete, check the job status and wait until “Finished” is returned (see Note 7): python $SLIMSUITE/tools/slimparser.py restin=19052500004 rest=check pureapi silent

(c) Once the run is complete, retrieve the relevant output (s): python $SLIMSUITE/tools/slimparser.py restin=19052500004 rest=main pureapi silent python $SLIMSUITE/tools/slimparser.py restin=19052500004 rest=occ pureapi silent

SLiMParser will execute step a as part of step c. A separate step b call is only required if the user wants to assess the job status and make different decisions accordingly. 13. Protein data available at the REST servers. The SLiMSuite servers have a number of protein datasets available for analysis, including human and yeast interactome data. Please see the REST server “Alias” page (http://rest.slimsuite.unsw.edu. au/alias) and SLiMSuite GitHub Wiki for more details. 14. Motif data available through the REST servers. The SLiMSuite servers have a number of protein motifs and motif datasets available for analysis, including ELM data [18]. Please see the REST server “Alias” page (http://rest.slimsuite.unsw. edu.au/alias) and SLiMSuite GitHub Wiki for more details. Batch Analysis: Running Multiple Datasets 15. Running SLiMSuite tools in batch mode. SLiMFinder is designed to run on multiple datasets with a low false-positive rate. To run SLiMProb or (Q)SLiMFinder on a set of protein

SLiMSuite Motif Prediction

65

files, use the batch¼FILELIST command in place of seqin¼FILE, e.g. to run on every ∗.dat and ∗.fas file in the current directory: python $SLIMSUITE/tools/slimfinder.py batch="∗.dat,∗.fas" dismask=T

(This is actually the default program behavior in the absence of any input command.) Note that wildcards are permitted. If wildcards are used, care must be taken that they are not expanded by the UNIX shell. FILELIST can also be a plain text file with one list element per line. When running in batch mode, all results will go into the same main results files. Each dataset will also have its own results files output into the resdir¼PATH path (see Note 5). It is therefore important that datasets with the same name in different directories are not passed to the program. More details on running in batch mode can be found in the SLiMFinder manual and GitHub Wiki. 16. Rerunning SLiMSuite with different settings. If SLiMProb or QSLiMFinder are rerun on the same datasets, they will try to reload intermediate files (including masked sequence data and SLiMBuild motifs) from the resdir¼PATH directory. This makes it very fast to rerun programs with different filtering criteria. If a batch run (see Note 15) was incomplete, setting pickup¼T will skip any datasets that has already run (as loaded from the main ∗.csv file and matched by runid¼X). To rerun data without reusing results, it is advisable to set basefile¼X, resdir¼PATH and runid¼X to something different from the original run. Alternatively, force¼T can be set to force the regeneration of all files with the same locations. SLiMProb and (Q)SLiMFinder will also look for existing intermediate files in the buildpath¼PATH directory, which has the same default as resdir¼PATH (e.g., SLiMProb/, SLiMFinder/, or QSLiMFinder/). To reuse existing intermediates but keep results files separate, point this to the resdir of the previous run. To avoid reusing these files, point this directory elsewhere or run with force¼T. More details can be found in the SLiMSuite documentation. 17. Farming jobs out to use multiple threads. To make use of multiple CPUs, batch running multiple datasets can be executed through SLiMFarmer rather than directly in SLiMProb or (Q)SLiMFarmer. Please see the SLiMSuite documentation and GitHub Wiki for details.

66

Richard J. Edwards et al.

18. Whole proteome preprocessing. Batch running multiple datasets with overlapping protein content can be made faster by preprocessing the proteome for disorder and/or conservation masking scores. Please see the SLiMSuite documentation and GitHub Wiki for details. Advanced Features and Protocols 19. Controlling motif search space with SLiMBuild. The SLiMChance statistics implemented in (Q)SLiMFinder make use of the finite motif space being searched due to the SLiMBuild settings. Searches can be made more flexible, but less sensitive and slower, or vice versa, by modifying these settings. The key SLiMBuild options are:

(a)

termini¼T/F. $)

Whether to add termini characters (^ & to search sequences [True],

(b)

minwild¼X.

(c)

maxwild¼X.

(d)

slimlen¼X.

Minimum number of consecutive wildcard positions to allow [0], Maximum number of consecutive wildcard positions to allow [2], Maximum length of SLiMs to return (no. non-wildcard positions) [5],

Default settings are termini¼T minwild¼0 maxwild¼2 slimlen¼5. This means that a motif can be up to 13 amino acids long (5 defined positions, plus 2 wildcards at each of 4 positions). Please see the SLiMFinder documentation for other SLiMBuild motif space options. In addition to the motif construction options, (Q)SLiMFinder also have a minimum number of occurrences that a pattern must be observed. This is to reduce memory usage (and computation) by prefiltering rarer patterns that are unlikely to be significantly enriched. (This filtering does not affect the SLiMChance correction for motif space.) By default, a pattern must occur in at least 5%, and at least 3, of the unrelated protein clusters, whichever is greater. This is controlled by minocc¼X and absmin¼X (see Note 21). 20. SLiMChance statistics and adjusting for evolutionary relationships. The “Unrelated Proteins” (UP) algorithm identifies relationships with BLAST+ that are later clustered into “Unrelated Protein Clusters” (UPC) where no protein has a BLAST-detectable relationship with another form a different UPC [19]. This is used to take out all the motifs that do not occur in enough unrelated proteins (see Note 19) and adjust the probability of the observed enrichment to take evolutionary relationships into account [15, 26]. Without this adjustment, results can have massive biases due to shared evolutionary history, which often swamp any signal of

SLiMSuite Motif Prediction

67

convergent SLiM evolution. For very large datasets, this can take a long time (due to the all-by-all BLAST+ search) and result in excess grouping of multi-domain proteins. Where the adjustment for evolutionary relationships is unwanted or problematic, it can be switched off with efilter¼F, with the concomitant risk of biased statistics. 21. Motif ambiguity. Ambiguity in (Q)SLiMFinder predictions is controlled in a similar fashion to TEIRESIAS [37], using an “equivalence list” (equiv¼LIST) of sets of amino acids that are free to replace each other at ambiguous positions. The default sets are AGS,ILMVF,FYW,FYH,KRH,DE,ST. (Q) SLiMFinder first finds fixed position patterns, with a relaxed minimum occurrence of 2 UPC (ambocc¼0.05 absminamb¼2), and then combines them using the equivalence groups. A position can only be in a single group: AGS and ST will never be combined into a [GST] position, for example. Wildcard spacers are similarly grouped into variable length wildcards (returned as “.{min,max}”). This can be switched off with wildvar¼F. For increased ambiguity (but slower runtimes and more false positives), both kinds of ambiguity can be combined with combamb¼T. Note: Due to the lack of independence between different variants of an ambiguous motif, the SLiMChance statistics do not currently correctly adjust for ambiguity. For this reason, especially with QSLiMFinder, there can be a tendency for ambiguous patterns to receive artificially inflated significance. It is recommended that high-throughput analyses are run with cloudfix¼T, which restricts output of ambiguous motifs to those that “cloud” with a fixed pattern (see Note 22). Ambiguity can be switched off completely with ambiguity¼F. 22. Motif clouding. Significant motifs are grouped into motif “clouds” of patterns that are physically overlapping on two or more non-wildcard positions in 2+ sequences. These are summarized in the ∗.cloud.txt output file. Motif clouds represent probable variants of a more complex SLiM. The ∗. cloud.txt file contains a consensus pattern for each cloud made by SLiMMaker [16] from all the occurrences of motifs in the cloud. Degrees of overlap between different clouds are also reported. 23. Additional masking options. Masking is important as it reduces the sequence search space, which increases statistical power and reduces the likelihood of returning irrelevant patterns. In addition to the disorder, UniProt feature and custom case masking (see Subheading 3.1), SLiMSuite has masking options for:

68

Richard J. Edwards et al.

(a) Low-complexity masking (compmask¼X,Y, mask where 5/8 residues are the same amino acid, by default). (b) N-terminal methionines (metmask¼T/F, on by default). (c) Position masking of specific amino acids in specific positions (posmask¼LIST, mask alanines at position 2 (2: A), by default). (d) Conservation-based masking based on multiple sequence alignments for input proteins (see Note 24 and SLiMFinder manual) (consmask¼T/F, off by default). (e) Motif masking (motifmask¼MOTIFS, off by default). Overzealous masking can reduce the signal if it incorrectly masks true positives, which can cause problems in small datasets or hunts for biased motif patterns. All masking can be switched off with masking¼F. 24. Generating orthologue alignments with GOPHER for conservation masking. The easiest way to generate alignments for conservation masking (see Note 23) is to use GOPHER (see Subheading 2.5.6), which can be switched on with usegopher¼T and setting a file of proteins as the source of possible orthologues with orthdb¼FILE. This file will need to be reformatted in SLiMSuite format (see Subheading 2.7.2). For vertebrates, the Quest for Orthologs proteomes are recommended [38]. This is the default for the consmask¼T option on the REST servers (&orthdb¼qfo_ref). The server also has a set of yeast proteomes for yeast analysis (&orthdb¼yeasts). Please contact the author to get additional proteome sets added. If running locally, it is also recommended to set gopherdir¼PATH so that alignments can be reused for different datasets. GOPHER can also be run stand-alone on the full proteome prior to SLiM prediction: python $SLIMSUITE/tools/gopher.py seqin=FILE orthdb=FILE gopherdir=PATH

See the GOPHER documentation for additional options. 25. Providing pre-computed disorder predictions. In addition to the implemented disorder prediction methods (see Subheading 3.1), SLiMSuite can use pre-computed prediction values from any tool. If iuscoredir¼PATH is set, then SLiMSuite will look for files named .. txt where is the accession number or md5 hash of the protein and matches the disorder¼ setting. To use the accession number (see Subheading 2.7.2), set md5acc¼F. The content of the file

SLiMSuite Motif Prediction

69

is the sequence name, followed by a simple whitespaceseparated list of disorder scores for each residue. Disorder will be predicted using the iucut¼X threshold, so the scoring system should range from low being order and high being disordered. 26. Region smoothing for disorder masking. Disorder predictions are generated at the per-residue level. These can be smoothed out to avoid extremely short regions of order or disorder using the minregion¼X setting. In our hands, minregion¼5 seems to work well. 27. Calculating additional statistics for predicted SLiM occurrences. In addition to the core motif occurrence information (Table 6), SLiMSuite can also calculate a number of other statistics, controlled by slimcalc¼LIST, where LIST contains one or more of: (a) Cons ¼ Evolutionary conservation. A wide range of conservation calculation options are available. See SLiMFinder manual for details. These require alignments (see Notes 23 and 24). (b)

SA ¼ Surface accessibility using a very crude estimate based on Janin and Wodak [39] over a 7 aa window. Each amino acid gets a SA value based on it and the three amino acids either side. These values are then averaged over the length of the SLiM.

(c)

Hyd

(d)

Fold

(e)

IUP

¼ Mean IUPred disorder.

(f)

Chg

¼ The absolute and net charge at pH 7.

(g)

¼ Hydropathy, calculated using the Eisenberg scale [40] over an 11 aa window, centered on each amino acid. The mean is then taken across the SLiM. ¼ Mean FoldIndex disorder.

Comp ¼ Simple complexity measure. The number of different amino acids observed across the length of the SLiM occurrence, divided by the maximum possible number, which is the length of the motif or 20, whichever is smaller. For example, a PxxPx[KR] motif occurrence with a sequence PASPPR would have a complexity of 4/6 ¼ 0.6667.

Calculations can be extended to flanking regions, using If winsize 0.15), extremely high (κ > 0.4) or low (κ < 0.05) κ values may be obtained which, due to the substantial charge imbalance, may lack any major significance. As a general rule of thumb, for polyampholytes with an FCR > 0.3, a κ above 0.25 is generally quite high, while a κ below 0.12 is quite low. κ provides a quantitative measure of the extent of charge mixing. As a result, it provides insight into both charge-mediated chain compaction (intramolecular contacts) and also regarding putative regions that may drive intermolecular interactions via charge

Sequence Analysis of IDRs

113

patches. With this in mind, κ is most informative for sequences with an FCR of above 0.25 and an NCPR between 0.1 and +0.1. This range encompasses ~30% of IDRs in the human proteome. In particular, special attention should be given to polyampholytic sequences with an FCR beyond 0.4, where charge patterning is expected to have a major impact on the conformational behavior of polyampholytes. An important question that is frequently raised for two sequences of similar composition is how different should two κ values be to say the patterning is substantially different. This is a challenging question, in no small part because other sequence features in addition to charge patterning can have a substantial impact on the biophysical behavior (and so in turn also function) of an IDR. For other systems, a difference in κ of around 0.05 has been sufficient in some—but not all—studies to engender a change in chain compaction [21, 25, 37, 43]. Whether or not this has a functional impact is an orthogonal question and one that can be tested (see Note 5). 3.3.4 Fraction Hydropathy and Fraction Disorder-Promoting Residues

The sequence-average hydropathy and the sequence-average fraction of disorder-promoting residues can be thought of as two complementary parameters that provide some insight into the likelihood that the sequence engages in hydrophobic-mediated interactions. The fraction of disorder-promoting residues comes from an analysis that identified sets of residues that are overrepresented in IDRs (so-called disorder-promoting residues) [44]. For the complete set of human IDRs, ~70% of IDRs contain a fraction of disorder-promoting residues of 0.80 (0.06) (Fig. 6). For IDRs that are depleted for these types of residues, one may expect local binding sites or folding-upon-binding phenomenon to occur due to the presence of “order-promoting” residues [45–47]. The hydropathy value calculated by CIDER uses the KyteDoolittle hydrophobicity scale, a widely used amino-acid-specific hydrophobicity scale [48]. By this analysis ~70% of IDRs from the human proteome have a sequence-average hydropathy score of 3.5 0.45 (Fig. 7) [48]. Protein–protein interactions are often mediated by hydrophobic residues, a feature that holds true for many binding motifs in IDRs [14, 49, 50]. With this in mind, enrichment of hydrophobic residues in an IDR may indicate an increased likelihood for binding, although this is a relatively crude metric for assessing the functional role of an IDR. As one might expect, linear hydropathy analysis can also aid in the identification of putative binding sites. However, we emphasize that CIDER and localCIDER are not well poised to robustly identify binding motifs, and many additional software tools have been developed for this explicit purpose [50–53].

114

Garrett M. Ginell and Alex S. Holehouse ~70% of human IDRs 2000

Number of sequences

1500

1000

500

0 0.5

0.6

0.7

0.8

0.9

1

Fraction disorder promoting

Fig. 6 The distribution of the fraction of disorder-promoting residues for the IDRs in the human proteome. The majority (~70%) of human IDRs fall within a window of 0.8 0.06 3.3.5 Diagram of States

The diagram of states was developed by Das and Pappu and helps provide a visual classification of the charge properties of disordered sequences [37]. The diagram places sequence in a two-dimensional space defined by the fraction of positively charged residues ( f+, xaxis) and the fraction of negatively charged residues ( f, y-axis) (Fig. 2). This space is divided into five possible regions. For sequences in R3, R4, and R5, the charge content and charge patterning is likely to have a substantial impact on the conformational behavior of the sequence, while for sequences in R1 and R2, other types of residues may be more influential (such as aromatic, hydrophobic, or proline residues). It was originally proposed that IDRs depleted in charged residues (those in R1) may be highly compact or even molten-globulelike, a sentiment echoed in the subsequent CIDER paper [37, 40]. While some sequences that lie in R1 may behave in this way, over the last few years, many results from multiple labs have demonstrated many examples of R1-sequences and are not substantially compact [27, 54–58]. This reflects, at least in part, the complex interplay between heteropolymeric sequences, distinct types of amino acids, and non-charge-mediated forms of favorable protein– solvent interactions. However, there are also examples of sequences that do fall within this region (or in R2) that are more compact [23, 38, 59–64], exemplifying the complexity of biological

Sequence Analysis of IDRs

115

~70% of human IDRs 2500

Number of sequences

2000

1500

1000

500

0 1

2

3

4

5

Hydropathy (Kyte-Doolittle)

Fig. 7 The distribution of the mean hydropathy as calculated using the Kyte-Doolittle hydrophobicity scale for the IDRs in the human proteome. The majority (~70%) of human IDRs fall within a window of 3.5 0.45

sequence-to-ensemble relationships. Importantly, the regional classification of an IDR on the diagram of states should not be taken as providing “proof” of any kind of biophysical behavior, but instead should motivate the types of sequence features that may be of interest for further examination. As a final but important point, the boundaries between the different regions on the diagram of states should not be taken as hard thresholds. The subdivision of the f+/f plane requires specific lines, but these lines are not meant to provide strict boundaries for which radically different behavior is expected immediately either side of a dividing line. 3.3.6 The localCIDER Software Package

We have focused here on analysis that can be performed using the CIDER webserver. However, the same analysis (and much more) can be performed on a proteome-wide scale using the localCIDER package. The CIDER webserver is deliberately capped at ten sequences per run—if a user has many more sequences than this, localCIDER may be a much more effective option. In addition to the analyses described above, localCIDER provides tools for a wide variety of additional analytical approaches including other types of patterning (e.g., Ω), the application of reduced alphabets, analysis of sequence complexity, and various other sequence features (see

116

Garrett M. Ginell and Alex S. Holehouse

Note 6) [27]. A complete summary of the localCIDER functionality is provided with the online localCIDER documentation (http:// pappulab.github.io/localCIDER/). 3.3.7 Example: A Case Study on p27100–198

In the interest of providing a pedagogical experience for the reader, we have chosen to present a case study, using the sequence analysis of the C-terminal IDR from p27 (CDKN1B) as our example. We chose this IDR for two reasons; it has been extensively characterized biophysically by Das et al. both computationally and experimentally, and as a stand-alone sequence, it provides tangible examples of several of the features discussed in this chapter [25]. A CIDER analysis of p27100–198 reveals an FCR of 0.28, which is on the higher side (Fig. 8). In contrast, the NCPR is exactly 0.0, demonstrating that the p27 C-terminal IDR is a perfectly symmetrical polyampholyte. Such a combination of near-zero NCPR and relatively high FCR suggests that charge patterning may play a role in determining the overall conformational behavior. Indeed, the κ value associated with the C-terminal IDR of p27 is 0.303, a relatively high value, suggesting that charge shuffle mutants might have a substantial impact on the conformational preferences of the p27-IDR. In keeping with this hypothesis, mutants were designed and tested computationally and experimentally, revealing a welldefined dependence on the global dimensions with respect to κ [25]. Interestingly, despite the fact that charge patterning provides rheostatic control over the global dimensions, the functional readout (phosphorylation efficiency) shows a more complex and non-monotonic relationship with charge patterning. Such a result highlights the inherent challenges in relating sequence to function in IDRs, in which despite uncovering a predictive biophysical model for conformational behavior, this description is insufficient to fully explain the biologically relevant functional readout. In the case of p27, an auxiliary interaction motif was then identified by considering the linear charge density, providing mechanistic insight into the observed functional behavior.

3.3.8 Limitations of CIDER Sequence Analysis

As with any tool or analysis protocol, there exist a number of important limitations associated with CIDER/localCIDER analysis. Recognizing those limitations is critical for interpreting the underlying analysis and for drawing general conclusions. An important limitation, discussed in Note 2, is the assumption of model compound pKas, an assumption which recent work has called into question [65]. Another general limitation is the assumption of additivity with respect to physicochemical properties. This assumption is not without reason—seminal work from Tanford, Record, Bolen, and others has robustly established the general validity of an additive model for protein-solvent interaction in heteropolymers [32, 66–

Sequence Analysis of IDRs

117

>p27100-198 KVPAQESQDVSGSRPAAPLIGAPANSEDTHLVDPKTDPSD SQTGLAEQCAGIRKRPATDDSSTQNKRANRTEENVSDGSP NAGSVEQTPKKPGLRRRQT FCR NCPR kappa hydropathy disorder promoting

: : : : :

0.283 0.000 0.303 3.277 0.828

Fig. 8 Sequence analysis of the C-terminal IDR from p27 (p27100–198). The main sequence parameters described in this chapter are presented

72]. Nevertheless, emergent chain-chain interactions in IDRs may influence local hydrophobicity, charge, and hydrogen-bonding potential in a nonadditive manner. With this in mind, the types of analysis provided by CIDER and localCIDER are inherently limited to a linear (one-dimensional) representation. We anticipate that new methods that move toward higher-order analysis will provide new types of complementary insight. Finally, an important albeit relatively specific limitation arises with respect to charge (κ), charge and proline (Ω), or arbitrary types of patterning computed through κ-like parameters [27, 37, 40, 63]. These parameters lose coherence and meaning as sequence compositions becomes highly asymmetric with respect to the residues found in one of the patterning groups (see Note 7). From a biological standpoint, this is of limited practical concern. In general, for a sequence in which a patterning parameter yields a value greater than 1 is also well beyond the limit in which that parameter would have any real meaning due to the sequence composition. However, in the context of sequence design, this can introduce sharp discontinuities and non-monotonic behavior in terms of patterning parameters with respect to composition. For a more extended discussion on this, we recommend the supplementary information associated with the original CIDER paper [40]. 3.3.9 Additional Approaches/Principles for the Analysis of Disordered Protein Sequences

The sequence analysis performed through CIDER and localCIDER provides a means to elucidate specific compositional attributes within an amino acid sequence. This analysis considers the standard 20-amino-acid alphabet but is inherently limited to consider features that can be uncovered using this “encoding” scheme. Furthermore, CIDER and localCIDER take a physics-based approach to sequence analysis, an approach distinct to many wellknown types of bioinformatics tools which are implicitly or explicitly knowledge-based. Physics-based and knowledge-based tools offer complementary approaches for understanding the sequence determinants of biophysical behavior and function in IDRs. With

118

Garrett M. Ginell and Alex S. Holehouse

this in mind, we will briefly outline additional knowledge-based approaches that provide information of direct relevance to the types of analysis that CIDER and localCIDER can offer. A crucial set of tools for physics-based approaches are those that allow the accurate and efficient prediction of IDRs from sequence data alone (i.e., disorder predictors). Disorder predictors are typically trained using experimental data to allow them to predict regions of local disorder. A large number of distinct predictors have been developed over the years, with metapredictors taking a central role for proteome-wide prediction more recently [73–78]. Unlike stand-alone disorder predictors, metapredictors compute disorder profiles using several different stand-alone predictors and provide a consensus analysis. While powerful, metapredictors often do not allow a user to supply their own arbitrary sequence as they typically pre-calculate consensus parameters on a whole-proteome scale. As well as providing information on predictions, the most recent incarnation of the metapredictor MobiDB (MobiDB 3.0) provides a wealth of additional information from a number of curated databases, offering users a wide range of useful information for naturally occurring sequences [78]. If de novo disorder predictions are needed, we have found that MobiDB-lite provides a convenient command-line tool that can be integrated into larger bioinformatics pipelines [79]. Similarly, IUPred remains a convenient, fast, and powerful stand-alone predictor [53, 77]. In addition to predictors, various databases exist that categorize experimentally verified IDRs, the most extensive of which is DisProt [80]. Posttranslational modifications (PTMs) provide a means to dynamically regulate the chemical identity of an amino acid sequence, substantially expanding the standard 20-residue alphabet. Given the emphasis placed in this chapter on the role of charged residues, the importance of PTMs should be clear. Phosphorylation can add negative charge, acetylation can remove positive charge, and methylation can modify the chemical nature of a charged sidechains [28, 81–83]. The integration of sequence analysis with extant PTM data represents an important aspect not explicitly covered by CIDER and localCIDER, although the development of bioinformatics pipelines that facilitate this type of analysis is straightforward. The interplay among different PTMs and their collective impact on function (and structure propensity) is also critical and represents an aspect of IDR regulation that likely plays an important role in many contexts [84]. Short linear motifs (SLiMs), also known as eukaryotic linear motifs (ELMs), have been the subject of extensive discussion and analysis, and their importance in dictating biological function is undeniable [50, 85, 86]. An open challenge in the field is the accurate de novo prediction of SLiMs, an issue well motivated by

Sequence Analysis of IDRs

119

the fact that two identical pseudo-SLiMs found in different proteins will not necessarily share equivalent binding properties, hinting at a broader context-based regulation [87, 88]. Various tools and databases for identifying SLiMs have been developed over the years, and these may offer complementary insight to CIDER-based analysis when available [51, 89]. Finally, new approaches to describe and classify IDRs are needed to develop our understanding of how specific types of disordered domains may mediate function. Approaches that combine physics-based and knowledge-based informatics offer an opportunity to overcome the limitations associated with each class of methods individually, although much work remains to be done here. An ultimate goal is to construct generalized discovery pipelines, in which sequence analysis reveals an initial hypothesis, this hypothesis can be tested through a combination of computational and in vitro experiments, and the functional relevance of the underlying behavior is finally tested in cells. While such a pipeline represents the ideal scenario, recent developments over the last five years have made substantial progress in allowing specific, information-rich mutations to be identified [14, 21, 23, 25, 29, 30, 90]. We anticipate that as new approaches for sequence analysis emerge while high-throughput experiments become more tractable, the rate at which novel discoveries in the context IDR-mediated function will further increase.

4

Notes 1. The CIDER webserver will remove nonstandard characters from the sequence (i.e., spaces, numbers) so sequences can be copied-and-pasted directly from UniProt. The CIDER webserver is not a disorder predictor, and the underlying assumption is that the sequences provided have already been predicted/ demonstrated to be disordered. The webserver allows for up to ten sequences to be analyzed simultaneously. 2. Among the most important limitations is the assumption of model compound pKa values in calculating charge-related information. In the context of IDRs and unfolded proteins, fluctuations in the local chemical environment can have a significant impact on the pKa values of charged residues [91– 93]. New methods that can dynamically estimate the pKa values of titratable groups for biophysical ensembles will play a key role in further developing tools for sequence analysis. A related limitation is that for charge-based analysis, CIDER assumes histidine to be a neutral residue. The model compound pKa of the histidine sidechain is 6.8, meaning that if we assume the

120

Garrett M. Ginell and Alex S. Holehouse

intracellular pH to be around 7.1, the majority of histidine should, in theory, be deprotonated and neutral. However, this both assumes a uniform pH and unmodified pKa values, both of which are unlikely to be universally valid assumptions. We also do not consider the N- and C-termini of polypeptides as contributing to the charge status of an IDR, in part because IDRs are often analyzed as locally excised regions taken from a larger protein, in which case the “termini” are an artificial modification formed due to excision. In addition, the termini are by definition not within the main chain, and as such their impact on the overall conformational behavior is less significant than if they were embedded far from the ends. 3. The default sliding window used for linear sequence analysis is five residues. This value was chosen as it is approximately the blob size for polypeptides [37]. However, within localCIDER, changing the window size is simple and can reveal sequence features that emerge over longer-length scales. For example, increasing the window size to 10 can uncover regions with a global net charge that would be “invisible” with a window size of 5. However, at least in our experience, at a window size beyond 20 and 30 residues, local sequence features are largely “smeared” out due to local averaging. 4. A detailed mathematical description of how to calculate κ is provided in both the original paper in which the parameter is defined and in the CIDER paper (notably see the supplementary information) [37, 40]. With this in mind, we will avoid repetition, but instead provide a more general and intuitive description of how κ is calculated. κ provides a measure of how similar the overall NCPR is compared to the local NCPR along the primary sequence. Specifically, in calculating κ the overall charge asymmetry (σ) for the entire sequence is calculated first, then the charge asymmetry for each five to six-residue blob (σ i) is calculated, and the average deviation between the overall charge asymmetry and the local charge asymmetry is determined (δ). This provides an un-normalized measure of the extent to which the local charge density deviates from the global charge density. To normalize, a new sequence with the same composition but the most well-mixed possible sequence patterning is generated, and the δ for this sequence is calculated (δmax). This value acts as a composition-specific normalization factor such that κ ¼ δ/δmax. 5. In the Notch transcriptional circuit, using sequence design to titrate κ for the Notch IDR from 0.15 to 0.74 (wild-type κ ¼ 0.31) yields close to the maximum possible change in the global chain dimensions as well as major shifts in transcriptional activation [21]. However, the transcriptional change does not perfectly correlate with global dimensions, implying an

Sequence Analysis of IDRs

121

interplay between several different factors. In work by Nott et al., charge patterning for a polyampholytic IDR was shown to influence liquid–liquid phase separation, a result that was further generalized in elegant work by Lin and Chan [94, 95]. 6. A reduced alphabet refers to taking the standard 20 amino acids and grouping certain residues together based on some selection criteria. How these groups are decided depends on the question of interest but typically will reflect some feature that unites different residue types. For example, a common type of reduced alphabet is to combine polar (Gln, Asn, Gly, Ther, Ser, His, Cys), hydrophobic (Ile, Leu, Val, Met, Ala, Tyr, Trp, Phe), charged (Asp, Glu, Lys, Arg), and then proline as its own group. Sequence complexity reflects the number of different amino acid types contained within a sequence (or local sequence region). The most common methods approach this question from an information-theory perspective, asking how many bits of information are needed to encode a sequence of characters (amino acids) [96]. 7. In principle κ and patterning parameters that are derivatives of κ should always exist in the limits of 0–1. However, for sequences with a composition that is highly depleted in one or more of the residue groups, a value greater than 1 can be obtained. For example, when calculated κ a sequence of 24 Glu, 24 Asp, and 2 Arg and 2 Lys (and no other residues) is highly depleted for positively charged residues and, depending on the patterning, could yield a κ value greater than 1. The main takeaway from such a result would be that (a) the sequences are highly segregated but (b) global charge patterning is likely not a key determinant of function or conformational behavior, although local charge density may be important and can be accessed via the linear NCRP and FCR analysis. References 1. Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev Biophys 36:307–340 2. Sadowski MI, Jones DT (2009) The sequence–structure relationship and protein function prediction. Curr Opin Struct Biol 19:357–362 3. Ku¨hlbrandt W (2004) Biology, structure and mechanism of P-type ATPases. Nat Rev Mol Cell Biol 5:282–295 4. Caruthers JM, McKay DB (2002) Helicase structure and mechanism. Curr Opin Struct Biol 12:123–133 5. Hollenstein K, Dawson RJP, Locher KP (2007) Structure and mechanism of ABC

transporter proteins. Curr Opin Struct Biol 17:412–418 6. Reetz MT, Carballeira JD (2007) Iterative saturation mutagenesis (ISM) for rapid directed evolution of functional enzymes. Nat Protoc 2:891–903 7. McLaughlin RN Jr, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R (2012) The spatial architecture of protein function and adaptation. Nature 491:138–142 8. Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, Fields S (2010) High-resolution mapping of protein sequence-function relationships. Nat Methods 7:741–746

122

Garrett M. Ginell and Alex S. Holehouse

9. Kitzman JO, Starita LM, Lo RS, Fields S, Shendure J (2015) Massively parallel single-aminoacid mutagenesis. Nat. Methods 12:203–206, 4 p following 206 10. Dorrity MW, Queitsch C, Fields S (2019) High-throughput identification of dominant negative polypeptides in yeast. Nat Methods 16:413–416 11. van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones DT, Kim PM, Kriwacki RW, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright PE, Babu MM (2014) Classification of intrinsically disordered regions and proteins. Chem Rev 114:6589–6631 12. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293:321–331 13. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6:197–208 14. Staller MV, Holehouse AS, Swain-Lenz D, Das RK, Pappu RV, Cohen BA (2018) A highthroughput mutational scan of an intrinsically disordered acidic transcriptional activation domain. Cell Syst 6:444–455.e6 15. Boothby TC, Tapia H, Brozena AH, Piszkiewicz S, Smith AE, Giovannini I, Rebecchi L, Pielak GJ, Koshland D, Goldstein B (2017) Tardigrades use intrinsically disordered proteins to survive desiccation. Mol Cell 65:975–984.e5 16. Wright PE, Dyson HJ (2014) Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16:18–29 17. Powers SK, Holehouse AS, Korasick DA, Schreiber KH, Clark NM, Jing H, Emenecker R, Han S, Tycksen E, Hwang I, Sozzani R, Jez JM, Pappu RV, Strader LC (2019) Nucleo-cytoplasmic Partitioning of ARF Proteins Controls Auxin Responses in Arabidopsis thaliana. Mol. Cell 76 (1):177–190. https://doi.org/10.1016/j. molcel.2019.06.044 18. Uversky VN, Oldfield CJ, Dunker AK (2008) Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys 37:215–246 19. Brown CJ, Johnson AK, Dunker AK, Daughdrill GW (2011) Evolution and disorder. Curr Opin Struct Biol 21:441–446 20. Buske PJ, Levin PA (2013) A flexible C-terminal linker is required for proper FtsZ assembly in vitro and cytokinetic ring formation in vivo. Mol Microbiol 89:249–263

21. Sherry KP, Das RK, Pappu RV, Barrick D (2017) Control of transcriptional activity by design of charge patterning in the intrinsically disordered RAM region of the notch receptor. Proc Natl Acad Sci U S A 114:E9243–E9252 22. Keul ND, Oruganty K, Schaper Bergman ET, Beattie NR, McDonald WE, Kadirvelraj R, Gross ML, Phillips RS, Harvey SC, Wood ZA (2018) The entropic force generated by intrinsically disordered segments tunes protein function. Nature 563:584–588 23. Riback JA, Katanski CD, Kear-Scott JL, Pilipenko EV, Rojek AE, Sosnick TR, Drummond DA (2017) Stress-triggered phase separation is an adaptive, evolutionarily tuned response. Cell 168:1028–1040.e19 24. Ota H, Fukuchi S (2017) Sequence conservation of protein binding segments in intrinsically disordered regions. Biochem Biophys Res Commun 494:602–607 25. Das RK, Huang Y, Phillips AH, Kriwacki RW, Pappu RV (2016) Cryptic sequence features within the disordered protein p27Kip1 regulate cell cycle signaling. Proc Natl Acad Sci U S A 113:5616–5621 26. Bouchard JJ, Otero JH, Scott DC, Szulc E, Martin EW, Sabri N, Granata D, Marzahn MR, Lindorff-Larsen K, Salvatella X, Schulman BA, Mittag T (2018) Cancer mutations of the tumor suppressor SPOP disrupt the formation of active, phase-separated compartments. Mol Cell 72:19–36.e8 27. Martin EW, Holehouse AS, Grace CR, Hughes A, Pappu RV, Mittag T (2016) Sequence determinants of the conformational properties of an intrinsically disordered protein prior to and upon multisite phosphorylation. J Am Chem Soc 138:15323–15335 28. Ortega E, Rengachari S, Ibrahim Z, Hoghoughi N, Gaucher J, Holehouse AS, Khochbin S, Panne D (2018) Transcription factor dimerization activates the p300 acetyltransferase. Nature 562:538–544 29. Wang J, Choi J-M, Holehouse AS, Lee HO, Zhang X, Jahnel M, Maharana S, Lemaitre R, Pozniakovsky A, Drechsel D, Poser I, Pappu RV, Alberti S, Hyman AA (2018) A molecular grammar governing the driving forces for phase separation of prion-like RNA binding proteins. Cell 174:688–699.e16 30. Das RK, Ruff KM, Pappu RV (2015) Relating sequence encoded information to form and function of intrinsically disordered proteins. Curr Opin Struct Biol 32:102–112 31. Mao AH, Lyle N, Pappu RV (2013) Describing sequence-ensemble relationships for

Sequence Analysis of IDRs intrinsically disordered proteins. Biochem J 449:307–318 32. Holehouse AS, Pappu RV (2018) Collapse transitions of proteins and the interplay among backbone, Sidechain, and solvent interactions. Annu Rev Biophys 47:19–39 33. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230 34. Mao AH, Crick SL, Vitalis A, Chicoine CL, Pappu RV (2010) Net charge per residue modulates conformational ensembles of intrinsically disordered proteins. Proc Natl Acad Sci U S A 107:8183–8188 35. Mu¨ller-Sp€ath S, Soranno A, Hirschfeld V, Hofmann H, Ru¨egger S, Reymond L, Nettels D, Schuler B (2010) Charge interactions can dominate the dimensions of intrinsically disordered proteins. Proc Natl Acad Sci U S A 107:14609–14614 36. Marsh JA, Forman-Kay JD (2010) Sequence determinants of compaction in intrinsically disordered proteins. Biophys J 98:2383–2390 37. Das RK, Pappu RV (2013) Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A 110:13392–13397 38. C.S. Sørensen, M. Kjaergaard, Effective concentrations enforced by intrinsically disordered linkers are governed by polymer physics, bioRxiv. (2019) 577536. https://doi.org/10. 1101/577536 39. Hofmann H, Soranno A, Borgia A, Gast K, Nettels D, Schuler B (2012) Polymer scaling laws of unfolded and intrinsically disordered proteins quantified with single-molecule spectroscopy. Proc Natl Acad Sci U S A 109:16155–16160 40. Holehouse AS, Das RK, Ahad JN, Richardson MOG, Pappu RV (2017) CIDER: resources to analyze sequence-ensemble relationships of intrinsically disordered proteins. Biophys J 112:16–21 41. Conda—Conda documentation, (n.d.). https://docs.conda.io/en/latest/. Accessed June 12, 2019 42. Borgia A, Borgia MB, Bugge K, Kissling VM, Heidarsson PO, Fernandes CB, Sottini A, Soranno A, Buholzer KJ, Nettels D, Kragelund BB, Best RB, Schuler B (2018) Extreme disorder in an ultrahigh-affinity protein complex. Nature 555:61

123

43. M.C. Cohan, A.E. Posey, S.J. Grigsby, A. Mittal, A.S. Holehouse, P.J. Buske, P.A. Levin, R.V. Pappu, Evolved sequence features within the intrinsically disordered tail influence FtsZ assembly and bacterial cell division, bioRxiv. (2018) 301622. https://doi. org/10.1101/301622 44. Campen A, Williams RM, Brown CJ, Meng J, Uversky VN, Dunker AK (2008) TOP-IDPscale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept Lett 15:956–963 45. Dyson HJ, Wright PE (2002) Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol 12:54–60 46. Sugase K, Dyson HJ, Wright PE (2007) Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature 447:1021–1025 47. Rogers JM, Oleinikovas V, Shammas SL, Wong CT, De Sancho D, Baker CM, Clarke J (2014) Interplay between partner and ligand facilitates the folding and binding of an intrinsically disordered protein. Proc Natl Acad Sci U S A 111:15420–15425 48. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132 49. Ravarani CNJ, Erkina TY, De Baets G, Dudman DC, Erkine AM, Madan Babu M (2018) High-throughput discovery of functional disordered regions: investigation of transactivation domains. Mol Syst Biol 14:e8190 50. Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, Altenberg B, Budd A, Diella F, Dinkel H, Gibson TJ (2012) Attributes of short linear motifs. Mol BioSyst 8:268–281 51. Gouw M, Michael S, Sa´mano-Sa´nchez H (2018) The eukaryotic linear motif resource–2018 update. Nucleic Acids 46(D1): D428–D434. https://academic.oup.com/ nar/article-abstract/46/D1/D428/4612965 52. Van Roey K, Uyar B, Weatheritt RJ, Dinkel H, Seiler M, Budd A, Gibson TJ, Davey NE (2014) Short linear motifs: ubiquitous and functionally diverse protein interaction modules directing cell regulation. Chem Rev 114:6733–6778 53. Me´sza´ros B, Erdos G, Doszta´nyi Z (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46: W329–W337 54. Fuertes G, Banterle N, Ruff KM, Chowdhury A, Mercadante D, Koehler C, Kachala M, Estrada Girona G, Milles S,

124

Garrett M. Ginell and Alex S. Holehouse

Mishra A, Onck PR, Gr€a ter F, EstebanMartı´n S, Pappu RV, Svergun DI, Lemke EA (2017) Decoupling of size and shape fluctuations in heteropolymeric sequences reconciles discrepancies in SAXS vs. FRET measurements. Proc Natl Acad Sci U S A 114:E6342–E6351 55. Mercadante D, Milles S, Fuertes G, Svergun DI, Lemke EA, Gr€a ter F (2015) Kirkwoodbuff approach rescues Overcollapse of a disordered protein in canonical protein force Fields. J Phys Chem B 119:7975–7984 56. Gibbs EB, Lu F, Portz B, Fisher MJ, Medellin BP, Laremore TN, Zhang YJ, Gilmour DS, Showalter SA (2017) Phosphorylation induces sequence-specific conformational switches in the RNA polymerase II C-terminal domain. Nat Commun 8:15233 57. Portz B, Lu F, Gibbs EB, Mayfield JE, Rachel Mehaffey M, Zhang YJ, Brodbelt JS, Showalter SA, Gilmour DS (2017) Structural heterogeneity in the intrinsically disordered RNA polymerase II C-terminal domain. Nat Commun 8:15231 58. Riback JA, Bowman MA, Zmyslowski AM, Knoverek CR, Jumper JM, Hinshaw JR, Kaye EB, Freed KF, Clark PL, Sosnick TR (2017) Innovative scattering analysis shows that hydrophobic disordered proteins are expanded in water. Science 358:238–241 59. Warner JB 4th, Ruff KM, Tan PS, Lemke EA, Pappu RV, Lashuel HA (2017) Monomeric Huntingtin exon 1 has similar overall structural features for wild-type and pathological Polyglutamine lengths. J Am Chem Soc 139:14456–14469 60. Mukhopadhyay S, Krishnan R, Lemke EA, Lindquist S, Deniz AA (2007) A natively unfolded yeast prion monomer adopts an ensemble of collapsed and rapidly fluctuating structures. Proc Natl Acad Sci U S A 104:2649–2654 61. Crick SL, Jayaraman M, Frieden C, Wetzel R, Pappu RV (2006) Fluorescence correlation spectroscopy shows that monomeric polyglutamine molecules form collapsed structures in aqueous solutions. Proc Natl Acad Sci U S A 103:16764–16769 62. Peran I, Holehouse AS, Carrico IS, Pappu RV, Bilsel O, Raleigh DP (2019) Unfolded states under folding conditions accommodate sequence-specific conformational preferences with random coil-like dimensions. Proc Natl Acad Sci U S A 116:12301–12310. 63. Martin EW, Holehouse AS, Peran I, Farag M, Incicco JJ, Bremer A, Grace CR, Soranno A, Pappu RV, Mittag T (2020) Valence and patterning of aromatic residues determine the

phase behavior of prion-like domains. Science 367:694–699 64. Longhi S, Receveur-Bre´chot V, Karlin D, Johansson K, Darbon H, Bhella D, Yeo R, Finet S, Canard B (2003) The C-terminal domain of the measles virus nucleoprotein is intrinsically disordered and folds upon binding to the C-terminal moiety of the phosphoprotein. J Biol Chem 278:18638–18648 65. Fossat MJ, Pappu RV (2019) Q-canonical Monte Carlo sampling for Modeling the linkage between charge regulation and conformational Equilibria of peptides. J Phys Chem B 123:6952–6967 66. Tanford C (1968) Protein denaturation. Adv Protein Chem 23:121–282 67. Tanford C (1964) Isothermal unfolding of globular proteins in aqueous urea solutions. J Am Chem Soc 86:2050–2059 68. Nozaki Y, Tanford C (1963) The solubility of amino acids and related compounds in aqueous urea solutions. J Biol Chem 238:4074–4081 69. Auton M, Holthauzen LMF, Bolen DW (2007) Anatomy of energetic changes accompanying urea-induced protein denaturation. Proc Natl Acad Sci U S A 104:15317–15322 70. Auton M, Bolen DW (2004) Additive transfer free energies of the peptide backbone unit that are independent of the model compound and the choice of concentration scale. Biochemistry 43:1329–1342 71. Felitsky DJ, Record MT Jr (2003) Thermal and urea-induced unfolding of the marginally stable lac repressor DNA-binding domain: a model system for analysis of solute effects on protein processes. Biochemistry 42:2202–2217 72. Cannon JG, Anderson CF, Record MT Jr (2007) Urea-amide preferential interactions in water: quantitative comparison of model compound data with biopolymer results using water accessible surface areas. J Phys Chem B 111:9675–9685 73. Oates ME, Romero P, Ishida T, Ghalwash M, Mizianty MJ, Xue B, Doszta´nyi S, Uversky VN, Obradovic Z, Kurgan L, Dunker AK, Gough J (2013) D2P2: database of disordered protein predictions. Nucleic Acids Res 41:D508–D516 74. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645 75. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138–2139

Sequence Analysis of IDRs 76. He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK (2009) Predicting intrinsic disorder in proteins: an overview. Cell Res 19:929–949 77. Doszta´nyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434 78. Piovesan D, Tabaro F, Paladin L, Necci M, Micetic I, Camilloni C, Davey N, Doszta´nyi Z, Me´sza´ros B, Monzon AM, Parisi G, Schad E, Sormanni P, Tompa P, Vendruscolo M, Vranken WF, Tosatto SCE (2018) MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic Acids Res 46:D471–D476 79. Necci M, Piovesan D, Doszta´nyi Z, Tosatto SCE (2017) MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics 33:1402–1404 80. Piovesan D, Tabaro F, Micˇetic´ I, Necci M, Quaglia F, Oldfield CJ, Aspromonte MC, Davey NE, Davidovic´ R, Doszta´nyi Z, Elofsson A, Gasparini A, Hatos A, Kajava AV, Kalmar L, Leonardi E, Lazar T, MacedoRibeiro S, Macossay-Castillo M, Meszaros A, Minervini G, Murvai N, Pujols J, Roche DB, Salladini E, Schad E, Schramm A, Szabo B, Tantos A, Tonello F, Tsirigos KD, Veljkovic´ N, Ventura S, Vranken W, Warholm P, Uversky VN, Dunker AK, Longhi S, Tompa P, Tosatto SCE (2017) DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res 45: D1123–D1124 81. Monahan Z, Ryan VH, Janke AM, Burke KA, Rhoads SN, Zerze GH, O’Meally R, Dignon GL, Conicella AE, Zheng W, Best RB, Cole RN, Mittal J, Shewmaker F, Fawzi NL (2017) Phosphorylation of the FUS low-complexity domain disrupts phase separation, aggregation, and toxicity. EMBO J 36:2951–2967 82. Qamar S, Wang G, Randle SJ, Ruggeri FS, Varela JA, Lin JQ, Phillips EC, Miyashita A, Williams D, Stro¨hl F, Meadows W, Ferry R, Dardov VJ, Tartaglia GG, Farrer LA, Kaminski Schierle GS, Kaminski CF, Holt CE, Fraser PE, Schmitt-Ulms G, Klenerman D, Knowles T, Vendruscolo M, St George-Hyslop P (2018) FUS phase separation is modulated by a molecular chaperone and methylation of arginine Cation-π interactions. Cell 173:720–734.e15 83. Hofweber M, Hutten S, Bourgeois B, Spreitzer E, Niedner-Boblenz A, Schifferer M, Ruepp M-D, Simons M, Niessing D, Madl T, Dormann D (2018) Phase separation of FUS is

125

suppressed by its nuclear import receptor and arginine methylation. Cell 173:706–719.e13 84. Woodsmith J, Kamburov A, Stelzl U (2013) Dual coordination of post translational modifications in human protein networks. PLoS Comput Biol 9:e1002933 85. Tompa P, Davey NE, Gibson TJ, Babu MM (2014) A million peptide motifs for the molecular biologist. Mol Cell 55:161–169 86. Davey NE, Morgan DO (2016) Building a regulatory network with short linear sequence motifs: lessons from the Degrons of the anaphase-promoting complex. Mol Cell 64:12–23 87. Davey NE, Seo M, Yadav VK, Jeon J, Nim S, Krystkowiak I, Blikstad C, Dong D, Markova N, Kim PM, Ivarsson Y (2017) Discovery of short linear motif-mediated interactions through phage display of intrinsically disordered regions of the human proteome. FEBS J 284:485–498 88. Richardson MOG, Holehouse AS, Langstein I, Korber P, Pappu RV (2018) Large-Scale Analysis of the Evolution of Functions Mediated by Intrinsically Disordered Regions. Biophys J 114:79a 89. Davey NE, Haslam NJ, Shields DC, Edwards RJ (2011) SLiMSearch 2.0: biological context for short linear motifs in proteins. Nucleic Acids Res 39:W56–W60 90. Zarin T, Tsai CN, Nguyen Ba AN, Moses AM (2017) Selection maintains signaling function of a highly diverged intrinsically disordered region. Proc Natl Acad Sci U S A 114: E1450–E1459 91. Franzmann TM, Jahnel M, Pozniakovsky A, Mahamid J, Holehouse AS, Nu¨ske E, Richter D, Baumeister W, Grill SW, Pappu RV, Hyman AA, Alberti S (2018) Phase separation of a yeast prion protein promotes cellular fitness. Science 359:eaao5654 92. Oliveberg M, Arcus VL, Fersht AR (1995) pKA values of carboxyl groups in the native and denatured states of Barnase: the pKA values of the denatured state are on average 0.4 units lower than those of model compounds. Biochemistry 34:9424–9433 93. Meng W, Raleigh DP (2011) Analysis of electrostatic interactions in the denatured state ensemble of the N-terminal domain of L9 under native conditions. Proteins 79:3500–3510 94. Lin Y-H, Chan HS (2017) Phase separation and single-chain compactness of charged disordered proteins are strongly correlated. Biophys J 112:2043–2046

126

Garrett M. Ginell and Alex S. Holehouse

95. Nott TJ, Petsalaki E, Farber P, Jervis D, Fussner E, Plochowietz A, Craggs TD, Bazett-Jones DP, Pawson T, Forman-Kay JD, Baldwin AJ (2015) Phase transition of a disordered nuage protein generates environmentally responsive membraneless organelles. Mol Cell 57:936–947

96. Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK (2001) Sequence complexity of disordered protein. Proteins 42:38–48

Chapter 6 Exploring Protein Intrinsic Disorder with MobiDB Alexander Miguel Monzon, Andra´s Hatos, Marco Necci, Damiano Piovesan, and Silvio C. E. Tosatto Abstract Nowadays, it is well established that many proteins or regions under physiological conditions lack a fixed three-dimensional structure and are intrinsically disordered. MobiDB is the main repository of protein disorder and mobility annotations, combining different data sources to provide an exhaustive overview of intrinsic disorder. MobiDB includes curated annotations from other databases, indirect disorder evidence from structural data, and disorder predictions from protein sequences. It provides an easy-to-use web server to visualize and explore disorder information. This chapter describes the data available in MobiDB, emphasizing how to use and access the intrinsic disorder data. MobiDB is available at URL http:// mobidb.bio.unipd.it. Key words Intrinsic disorder, Structural flexibility, MobiDB, Database, Disorder annotation, Disorder prediction, IDP, Intrinsically disordered proteins, IDR, Intrinsically disordered regions

1

Introduction The characterization of proteins and/or regions lacking a static tertiary structure under physiological conditions (intrinsically disordered proteins [IDPs] or intrinsically disordered regions [IDRs]) defies the classical structure-function paradigm. Nowadays, IDPs and IDRs are well characterized and proven to participate in many molecular and biological processes like DNA or RNA binding, transcription, translation, and cell-cycle regulation and often in pathologies associated with misfolding and aggregation [1, 2]. Moreover, 40% of the residues in higher eukaryotic proteomes are predicted to be part of IDPs/IDRs [3, 4]. IDRs are generally unable to form a stable hydrophobic core and a structured domain due to the lack of bulky hydrophobic amino acids [5, 6]. Therefore, IDRs tend to be surface exposed and to interact with other molecules [7].

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_6, © Springer Science+Business Media, LLC, part of Springer Nature 2020

127

128

Alexander Miguel Monzon et al.

During the last decades, the understanding of the functional role of IDPs/IDRs has advanced rapidly, with the discovery of new mechanisms [1]. In this sense, a key finding was the identification of intrinsically disordered domains (IDDs), short linear motifs (SLIMs), and preformed structural elements (PSE), which participate in many interactions and provide new functions [8–11]. Additionally, advances in the experimental detection of intrinsically disordered regions have had a big impact on structural biology [12, 13]. Missing electron densities from X-ray diffraction and cryo-electron microscopy (cryo-EM) experiments, solution techniques as NMR spectroscopy [14, 15], and single-molecule fluorescence [16] provide complementary information to help the study of IDPs/IDRs. MobiDB has been one of the major contributors providing disorder consensus predictions and annotations for all proteins available in UniProt, moving forward the field since its first release in 2012 [17]. MobiDB beneficiates from the DisProt database as the reference repository of manually curated IDRs [18]. DisProt represents the gold standard for the development of intrinsically disorder predictors, which are fundamental for MobiDB automatic annotation. Ever since other curated databases focused on disorder started to appear, such as Intrinsically Disordered proteins with Extensive Annotations and Literature (IDEAL) [19], Disordered Binding Site (DIBS) [10], Mutual Folding Induced by Binding (MFIB) [20], Eukaryotic Linear Motif (ELM) resource [9], and FuzDB [21], MobiDB has integrated their information. However, despite the large interest in the field and availability of dedicated resources, the relationship between (un)structure and function of IDRs is still poorly understood, with only a few widely studied exceptions like α-synuclein, tau, p53, and E1A proteins [7]. This chapter describes how to explore the world of intrinsically disordered proteins through the MobiDB database, with the aim to provide a practical guide for managing and processing the contained information, explaining the different levels of information confidence, and to promote best practices to use and interpret disorder predictions.

2

Materials The MobiDB database can be found at URL http://mobidb.bio. unipd.it. The home page (Fig. 1) shows general information about the integrated resources, database version, and citation. Additionally, the initial page has two text boxes to search proteins in the database by using the following criteria: (1) UniProt, UniParc, and UniRef accession codes (e.g., for “Microtubule-associated protein tau,” P10636, UPI0000EE80B7 and UniRef100_P10636, respectively), (2) reference genome accession (e.g., UP000005640 for

Exploring MobiDB

129

Fig. 1 The MobiDB home page has two search boxes to look for proteins in the database by different criteria. Additionally, this page shows the database version, news, and the external resources which have been integrated in the database

Homo sapiens), (3) NCBI taxon ID (e.g., 10,090 for Mus musculus), and (4) text search using UniProt-style queries (e.g., “human AND antigen” for all entries containing both terms). More information about how to search in the database can be found by pressing the question marks on the home page. When a search produces more than one protein as a result, the database shows a table listing all retrieved proteins, indicating by default the following columns: UniProt ID, organism, length, gene name, protein name, origin dataset, and MobiDB-lite disorder content. Moreover, some examples can be accessed by directly clicking on their names below the central text box at the home page.

3

Methods

3.1 Discovering Protein Intrinsic Disorder Using MobiDB 3.0

At the time of writing, MobiDB 3.0 [22] is the most up-to-date version of MobiDB. Since its creation in 2012 [17] and first major renewal in 2014 [23], MobiDB has become a central resource for large-scale protein intrinsic disorder and mobility annotations.

130

Alexander Miguel Monzon et al.

MobiDB comprises different types of disorder annotations and different quality levels of disorder evidence. The information is organized in three type of annotations: intrinsically disordered regions (IDRs), linear interacting peptides (LIPs), and dynamic structure/secondary structure populations. While it is at this point obvious what intrinsic disorder is, it may not be so straightforward to understand the concept of LIPs and secondary structure populations. LIPs are structural fragments that undergo folding upon binding by interacting with other molecules while preserving an elongated structure. Dynamic structure, instead, refers to the propensity of a residue to assume a specific secondary structure conformation, or the abundance of a specific conformation in a population of molecules in solution. The second definition is adopted when this information is derived from nuclear magnetic resonance (NMR) secondary chemical shifts. The three different types of annotations are further classified depending on the quality of the source of information they come from. High-quality data are imported from manually curated external databases. Intermediatequality annotations are derived from experimental data such as X-ray diffraction, NMR, and NMR chemical shifts. Disorder predictions are provided by different computational methods and represent the lowest quality and confidence, even though they have by far the largest coverage by including all UniProt proteins. In order to provide the most reliable disorder annotation for each protein, MobiDB combines all its data sources into a consensus annotation, prioritizing curated and indirect pieces of evidence over predictions. The web server provides a user-friendly interface which allows to easily explore disorder annotations for a given protein. A fully dynamic feature viewer manages the visualization of disorder regions and allows the generation of high-quality images for publication. When available, MobiDB annotation is also projected onto a protein structure and shown in a 3D viewer. Additionally, the web server offers programmatic access through RESTful services (see Note 1). 3.1.1 Curated Data

High-quality annotations in MobiDB come from external resources which provide manually curated evidence. Intrinsically disordered regions are extracted from three databases, DisProt [18], FuzDB [21], and UniProt [24]. In addition, Pfam [25] and Gene3D [26] annotations are integrated to show complementary information on folded (not disordered) domains. DisProt [18] is the largest database which covers manually curated ID annotations [27]. Records in DisProt are evidencecentric, and quality tags are added to an IDR when the annotation is ambiguous, for example, if an experiment was performed under extremely nonphysiological conditions (AMBEXP label) or an engineered sequence has been used (AMBSEQ label) (see Note 2)

Exploring MobiDB

131

[18]. MobiDB only contains DisProt annotations without ambiguity tags. In the latest version of MobiDB, DisProt disordered regions are propagated by homology, exploiting GeneTree alignments [28] with a similarity constraint of 80%. To ensure alignment significance, only regions with at least ten residues are propagated. This procedure increases the IDR annotations from 800 to about 10,000 proteins. A significant fraction of disordered regions fold upon binding. In MobiDB they are provided by the following databases: (1) IDEAL [19], where these regions are called ProS; (2) Database of Disordered Binding Sites (DIBS) [10]; (3) Mutual Folding Induced by Binding (MFIB) [20]; and (4) ELM [9]. DIBS contains disordered regions which fold upon binding with a globular partner, whereas MFIB contains only entries where both the interactors are disordered. IDEAL provides a collection of manually curated IDPs/IDRs, including functional annotations as protein binding regions and posttranslational modifications. ELM is the main resource focused on the detection and annotation of eukaryotic linear motifs. 3.1.2 Derived Data

Indirect experimental data sources provide annotations from X-ray experiments, NMR three-dimensional models, and NMR chemical shift data. The annotation from X-ray and NMR structures is derived by processing the entire Protein Data Bank (PDB) with the Mobi 2.0 software [29]. Chemical shift data, available from the Biological Magnetic Resonance Data Bank (BMRB) database, [30] are processed by an internal pipeline. Indirect disorder information is derived from PDB in three different ways by considering (1) missing, (2) high temperature, and (3) mobile residues. Missing residues are those for which the detected electron density is not sharp enough to model the corresponding structure. This happens when the residues are not aligned in the different unit cells of the crystal, i.e., are not fixed. In MobiDB, they are detected simply by comparing the sequence used in the experiment, i.e., PDB SEQRES, with the resolved residues, i.e., those with an ATOM field in the PDB file. High temperature residues are those with high normalized B-factors in X-ray and cryo-EM structures (see Note 3). Mobile residues are assigned from NMR structure by superimposing all models in the PDB file and calculating the distance between equivalent Cα atoms. Positions which change significantly across different models can be interpreted as mobile/disordered residues (see Note 4) [31]. LIPs are also derived from the PDB structures by measuring the local inter/intra-chain contact ratio in protein complexes. Since LIP regions are preserving a linear structure even when interacting with another protein, they are particularly prone to form only inter-chain contacts with the partner rather than intra-chain contacts. Contacts are obtained with the RING

132

Alexander Miguel Monzon et al.

[32] software, which generates the residue interaction network of each structure (see Note 5). NMR spectroscopy provides valuable information about protein dynamics in solution. In particular, it provides quantitative information on structural fluctuations and different conformational populations [33]. Chemical shifts are useful to quantify these fluctuations at wide timescale ranges, providing a statistical view of secondary structure elements in different molecular populations [33]. Chemical shift data from BMRB [30] provide the basis to calculate secondary structure populations. In MobiDB, secondary structure populations and backbone flexibility are calculated with the δ2D [34] and Random Coil Index (RCI) [35] software from two-dimensional spectra (see Note 6). In order to obtain accurate mappings, secondary structure populations are calculated only for those residues where chemical shifts are available for at least three different atom types [34]. MobiDB also informs about the experimental conditions like presence of binding partners, lipids, pH, etc., since they have a strong impact on the biological function of the protein [33]. All this information can be explored through the feature viewer (see later). When an entry has multiple tracks of chemical shift data associated, a consensus is provided on the top. Additionally, the feature viewer can be expanded, and, by putting the mouse over a chemical shift annotation, it is possible to see experimental conditions and the title of the corresponding BMRB entry in a tooltip. Lastly, MobiDB includes indirect information about the conformational diversity in globular regions extracted from CoDNaS database [36]. CoDNaS estimates the degree of conformational diversity of a particular protein by redundant structures of the same sequence (conformers) by a structural superposition. 3.1.3 Predicted Data

MobiDB provides disorder predictions for all UniProt entries using the following predictors: DisEMBL [37], ESpritz [38], GlobPlot [39], IUPred [40], Jronn [41], and VSL2b [42]. Each predictor provides a binary classification for each residue (ordered or disordered) which is shown in two different colors on the feature viewer. A consensus disorder prediction is shown at the top and is generated using all outputs of the abovementioned software (see next section for more information). Complementary to disorder predictions, MobiDB integrates other tools such as DynaMine [43], ANCHOR [44], and FeSS [45]. Since protein dynamics is closely related with protein disorder, DynaMine [43] is a fast and accurate predictor of backbone dynamics using only sequence information. DynaMine provides a score from zero to one for each amino acid, being one for complete order (stable/rigid conformation) and zero for highly dynamic (multiple conformations/disorder) (see Note 7). ANCHOR [44] is a method

Exploring MobiDB

133

for predicting disordered binding regions using only sequence information. In order to predict regions capable of undergoing a disorder-to-order transition upon binding, ANCHOR combines three key properties to perform predictions: (1) ensuring that a given residue is part of a long disordered region to filter out globular domains, (2) that a residue is not able to form enough favorable contacts with other residues sequentially close in the isolated state, and (3) the ability of a given residue to form sufficient favorable interactions with globular proteins upon binding. To predict these properties, this method is based on the pairwise energy estimation approach used in IUPred, a disorder predictor [40]. This method provides LIP annotations for all protein sequences in the database. Finally, FeSS is part of the FELLS method [45] which predicts secondary structure propensity in three states (helix, sheet, coil) based on the same single-sequence neural network architecture as ESpritz [38]. FeSS prediction can be interpreted in analogy to the secondary structure population obtained by δ2D from chemical shifts. In addition, two low-complexity predictors, SEG [46] and Pfilt [47], are used to highlight regions with poor complexity which are particularly abundant in disordered proteins. These methods analyze the sequence amino acid composition to find compositionally biased regions. 3.1.4 Consensus Data

One of the key and exclusive features of MobiDB is the consensus generated from different annotation sources. For disorder predictions, two consensus types are generated, MobiDB-lite and “simple.” The MobiDB-lite consensus [48] provides highly specific predictions (low false-positive rate) of long disordered regions (at least 20 residues long) (see Subheading 3.2.3). To define a residue as disordered, at least five out of eight predictors have to agree with the prediction. This consensus is shown on the feature viewer in two representations, a continuous line showing the fraction of methods voting disorder and a discrete view showing disordered regions, calculated based on a post-processing on the agreement among predictions. MobiDB-lite provides an additional characterization of the predicted regions related to their functional role. Different types of disordered regions are classified according to the fraction of charged residues and net charge [49]: weak polyampholytes (D_WC), negative polyelectrolytes (D_NPE), positive polyelectrolytes (D_PPE), and polyampholytes (D_PA). This information is shown when the mouse is placed over a region in the MobiDB-lite track. The “simple” consensus is not shown on the web interface but is available in the JSON file, which contains the entire information stored in the database for a given protein and is available for download. This type of consensus is less conservative

134

Alexander Miguel Monzon et al.

than MobiDB-lite and provides regions of different length without any restriction and post-processing. Disorder is assigned based on an agreement of at least one out of eight votes. For indirect annotation sources (PDB and BMRB), a consensus is additionally generated for each different aspect: missing, mobile, and high temperature residues and LIPs. The consensus combines information from different chains in three categories: structure, conflict and disorder for disorder-related annotations or LIPs, and conflict and non-LIP for LIPs-related annotations. To classify a residue as structured or disordered, 90% of the available evidence (e.g., missing residues regions in PDB chains) needs to agree. Between these two classifications, a residue is classified as in “conflict.” A comprehensive consensus of indirect annotations is shown on the feature viewer and is generated by merging the different consensuses of each source (missing residues, mobile and high temperature). In this case, when a single component disagrees, the final classification is conflict/ambiguous. Two different consensuses are also shown for the curated annotation, one for disorder and one for LIPs. The consensus in this case is generated by merging all regions without checking the agreement. However, if a Gene3D structured domain is available, overlapping residues appear in conflict. Finally, an overall consensus is generated considering curated and indirect sources and MobiDBlite prediction, prioritizing curated and indirect data in analogy to previous versions of MobiDB [23]. 3.2

Applications

3.2.1 Working Case: Beta-Catenin

Beta-catenin (or catenin beta-1) is a component of the cadherin protein complex in the canonical Wnt signaling pathway [50– 52]. This protein has a dual function acting in the regulation and coordination of gene transcription and cell–cell adhesion [53, 54]. The structured core of the protein is composed of several armadillo repeats which fold in a typically rigid domain [55]. The N- and C-terminal regions bordering the armadillo repeats seem to be disordered in solution, playing a central role in beta-catenin function [56]. In MobiDB, this protein can be found using the UniProt identifier P35222. The MobiDB home page has two search boxes (one in the middle and another on the top-right corner) where the user can write the UniProt, UniRef, UniParc, Proteome, or taxonomy ID to find a protein (Fig. 1). After writing the UniProt ID and pressing the search button, the protein entry page appears, showing general information about this protein (gene, name, organism, localization, taxonomy, among others) and an overview of the annotation in the feature viewer (Fig. 2). From the feature viewer in the entry page, two disordered regions at the C- and N-termini, covering about 30% of the residues, are visualized in the full consensus track. Looking at the curated data, these two regions are annotated by DisProt predicted (transfer by homology) and are almost perfectly separated by a folded domain

Exploring MobiDB

135

Fig. 2 The protein entry page contains all annotations for a given protein in MobiDB. On the top, it shows general information such as disorder content (%), protein name, localization, and length, among others. The star next to the UniProt ID indicates the degree of evidence that this protein has: (1) a fully colored star is manually curated evidence, (2) a half-colored star is indirect evidence, and (3) an uncolored star means only predictions. At the bottom, this page displays the feature viewer where all the annotations are shown. The feature viewer has five sections depending on which annotation the user wants to visualize. Additionally, the viewer allows zooming in and out on specific regions, to download a high-quality image, to see the protein sequence, and to open the 3D viewer (only in the indirect section). Red, gray, and blue bars indicate disordered, ambiguous, and ordered regions, respectively

annotated by Gene3D (Fig. 3). These disordered regions have been annotated by homology from the mouse beta-catenin (see Subheading 3.1.1). In addition to disorder, some LIPs, provided by ELM, DIBS, and IDEAL databases, are concentrated at the protein ends. For example, a conserved N-terminal SLIM is known from the literature to bind an E3 ubiquitin ligase when phosphorylated [57]. The ELM database [9] provides further details about the functional implications of the MOD_GSK3_1 (phosphorylation motif) and DEG_SCF_TRCP1_1 (degron motif) sites in betacatenin, and DIBS [10] identifies a disordered binding site. In the

136

Alexander Miguel Monzon et al.

Fig. 3 Curated annotations for catenin beta-1 as shown in the feature viewer in MobiDB. Disordered regions are colored in red, while ordered regions are in blue. Folded regions identified by Gene3D and Pfam domains are colored in green and light blue, respectively. LIP annotations are shown on the feature viewer in violet. Numbers on the x-axis are protein residue numbers

indirect annotation section (derived data), MobiDB shows information for LIPs and missing and high temperature residues. Missing residue data agrees with curated sources, as the N-terminus is completely disordered in one structure (PDB code 2Z6H, chain A). The C-terminus is completely removed when generating crystals for X-ray experiments; therefore it is not available in the SEQRES and not detected in the missing residues track. Additionally, based on the structural information, some LIPs are derived and in agreement with the SLIMs annotated in ELM. MobiDB-lite also predicts two IDRs at the C- and N-termini, meaning that the majority of other predictors also agree. In addition, ANCHOR [44] predicts binding regions around ELM and DIBS motifs, overlapping with IDRs identified by the majority of annotation sources. 3.2.2 Disorder Flavors

A series of very different phenomena have been defined under the same name of intrinsic disorder (ID). Various approaches have been adopted to classify different types of ID [1]. The simplest feature to consider is region length. Short and long IDRs are substantially different [58, 59] both in terms of function [60] and evolution [61]. Another feature is the sequence conservation [62] which has been used to distinguish conserved IDRs (constrained disorder) and non-conserved sequences which instead only preserve the disorder content (flexible disorder). Sequence composition is another way to classify disorder, as it clearly influences the physical behavior of the protein [63]. At least three different disorder flavors, which differ in amino acid composition, sequence locations, and biological function, were identified by an automatic sequence classification and disorder prediction [64]. A more recent classification

Exploring MobiDB

137

considers the fraction of charged residues (FCR) as a simple way to distinguish between extended, coil-like, and more compact, molten globule like IDRs [49]. Different predictors are biased to capture different ID flavors, and their limited agreement makes ID investigation a complex issue [65]. MobiDB-lite produces a consensus prediction from many different predictors and classifies predicted regions following an improved version of the classification proposed in [49]. In conclusion, MobiDB provides the most complete picture on different ID flavors covering the whole protein universe as defined by UniProt [66] and captures different aspects of disorder both at the structural and functional levels. 3.2.3 MobiDB-Lite Could Help to Identify Order– Disorder Transitions

MobiDB-lite predictions are conservative but help to detect disorder/order transitions. MobiDB-lite predictions also include different flavors of disorder regions (see Subheading 3.2.2). Due to limited experimental data, most of the MobiDB entries contain only predicted information. MobiDB-lite was designed to provide highly specific annotations, meaning that this algorithm is very conservative and, if a region is annotated as disordered, the user can be highly confident that this predicted region is disordered at physiological conditions when unbound [48]. One limitation of this method is that it cannot capture functional ID regions shorter than 20 residues and/or which undergo disorder-to-order transitions. To underline this behavior, we will focus on two real case examples. A short, folded region bordered by predicted disorder regions could indicate a potential disorder-to-order transition [67]. This behavior can be noticed in the prediction of 4E binding protein one (4EBP1, UniProt identifier Q13541). 4EBP1 is a repressor protein which controls eIF4E activity [68]. The short, folded region bordered by disordered regions is predicted as ordered by MobiDB-lite from residues 49 to 63. However, two predictors, ESpritz-DisProt and VSL2b, identify this region as disordered. ANCHOR [44] also predicts a disordered binding region from residues 38 to 66. From the predicted data, we can observe a potential LIP, which undergoes a disorder-to-order transition upon binding to eukaryotic translation initiation factor 4E (eIF4E). This observation has been proven by NMR studies, where 4EBP1 is completely disordered on its own [69], but undergoes disorder-to-order transition [70] during binding to and inhibition of eIF4E (Fig. 4). A similar phenomenon can be observed for p53 (UniProt identifier P04637) from residues 320 to 356, which constitutes the C-terminal domain consisting of a tetramerization domain. MobiDB-lite predicts it as structured, but this region is in DisProt annotated as disordered and can also undergo disorderto-order transition [71] during interaction.

138

Alexander Miguel Monzon et al.

Fig. 4 Disorder-to-order transition in 4EBP1. The MobiDB-lite disorder consensus prediction is shown at the top with the predictors used to produce the consensus. At the bottom it is possible to see a fragment of the 4EBP1 structure (PDB code: 1WKW) from residues 47 to 66 bound to eukaryotic translation initiation factor 4E. This region (colored in red) can undergo disorder to order when interacting with eIF4E to form a small h-helix. Numbers on the x-axis are residue numbers

4

Notes 1. Many bioinformatics resources have both a graphical user interface (GUI) and an application programming interface (API). MobiDB supports API accessibility via RESTful (REpresentational State Transfer) web services [22]. In this way, all MobiDB data are accessible through custom software clients via HTTP URLs. These services are useful for performing complex bioinformatics analysis or to import MobiDB data into third-party resources through standard HTTP requests.

Exploring MobiDB

139

For example, the DisProt database uses the MobiDB APIs to list which PDB structures are available for a given protein [18]. 2. It is hard to interpret unambiguously ID content from some IDP experiments. The annotation interface of DisProt allows the curators to report ambiguous information for each region evidence. The “AMBEXP” tag is used to report ambiguous experimental evidence, for example, when the experiment was performed under extreme conditions, low-resolution structures, and unclear information about disorder position inside the sequence. The “AMBSEQ” tag indicates when the experiment is carried out on non-native sequence, i.e., engineered or fragment, ambiguous literature evidence, among others. 3. Residues are flagged as high B-factor when they have B-factors above the theoretical B-factor (Wilson B) for the corresponding resolution increased by 25%, e.g., the Wilson B factor for a structure with a resolution between 1.25 and ˚ is 13.50 A˚2. 1.50 A 4. The formula SD ¼ 1/((1 + (d/d0)2) is used to compute and scale the distances between the same Cαs in all superimpositions, where d is the distance between two Cαs and d0 is a normalization factor [31]. Average and standard deviation of the scaled distances are computed. The program computes the φ and ψ angles standard deviation for all the amino acids in all the models. These are used to assign mobility to neighboring residues to those assigned as mobile according to SD average and its standard deviation. The parameter d0 is set to 4 A˚. SD, its standard deviation, and φ and ψ angles standard deviation thresholds are set to 0.85, 0.09, and 20 and 20 , respectively. These thresholds have been optimized to maximize the agreement with disorder definitions on 19 NMR structure targets evaluated in CASP8 [72]. 5. Residue interaction networks (RINs) are a way of representing protein structures where nodes are residues and arcs are the physicochemical interactions among them. The RING software [32] allows the identification of covalent and non-covalent bonds in protein structures, including hydrogen bonds, salt bridges, disulfide bonds, van der Waals and π-cation interactions, and π–π stacking. The software calculates intra- and interchain contacts, even considering ligand and solvent atoms. 6. The δ2D method is able to translate a set of chemical shifts into probabilities of occupation of secondary structure (SS) elements [34]. This approach is based on a generalization of another method which can define the random coil chemical shifts from the amino acid sequence of a protein [73]. The method uses four sequence-based SS predictors to obtain the

140

Alexander Miguel Monzon et al.

ideal chemical shifts for h-helix, ®-sheet, and polyproline II (PPII). These predictions are considered as the chemical shifts that would be measured for a given sequence in a state with a 100% population of a particular SS [34]. Then, δ2D combines these predictions and compares the results with the experimental chemical shifts to extract the most probable SS populations. The RCI method was developed by fitting backbone chemical shift curated data to an extensive set of protein MD simulations and is able to detect motions in the range of picoseconds to nanoseconds [35]. 7. DynaMine is based on a linear regression approach to predict protein flexibility directly from sequence information. DynaMine uses the chemical shifts of 2015 proteins to generate a curated dataset which contains per-residue information about fast protein backbone movements. This dataset is used to analyze statistically and quantitatively the backbone dynamics properties for each amino acid, allowing identification of those amino acids with a tendency to promote order-to-disorder transitions.

Acknowledgments Authors would like to thank Diana Battistella for helping us with manuscript proofreading. References 1. Van Der Lee R, Buljan M, Lang B et al (2014) Classification of intrinsically disordered regions and proteins. Chem Rev 114:6589–6631 2. Tompa P, Schad E, Tantos A et al (2015) Intrinsically disordered proteins: emerging interaction specialists. Curr Opin Struct Biol 35:49–59 3. Pancsa R, Tompa P (2012) Structural disorder in eukaryotes. PLoS One 7(4):e34687 4. Xue B, Dunker AK, Uversky VN (2012) Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life. J Biomol Struct Dyn 30:137–149 5. Uversky VN, Gillespie JR, Fink AL (2000) Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41:415–427 6. Romero P, Obradovic Z, Li X et al (2001) Sequence complexity of disordered protein. Proteins 42:38–48

7. Davey NE (2019) The functional importance of structure in unstructured protein regions. Curr Opin Struct Biol 56:155–163 8. Fuxreiter M, Simon I, Friedrich P et al (2004) Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J Mol Biol 338:1015–1026 9. Gouw M, Michael S, Sa´mano-Sa´nchez H et al (2018) The eukaryotic linear motif resource— 2018 update. Nucleic Acids Res 46(D1): D428–D434 10. Schad E, Ficho´ E, Pancsa R et al (2018) DIBS: a repository of disordered binding sites mediating interactions with ordered proteins. Bioinformatics 34:535–537 11. Van Roey K, Uyar B, Weatheritt RJ et al (2014) Short linear motifs: ubiquitous and functionally diverse protein interaction modules directing cell regulation. Chem Rev 114:6733–6778 12. Callaway E (2015) The revolution will not be crystallized: a new method sweeps through structural biology. Nature 525:172–174

Exploring MobiDB 13. Cheng Y (2015) Single-particle Cryo-EM at crystallographic resolution. Cell 161:450–457 14. Felli IC, Pierattelli R (2012) Recent progress in NMR spectroscopy: toward the study of intrinsically disordered proteins of increasing size and complexity. IUBMB Life 64:473–481 15. Theillet F-X, Binolfi A, Bekei B et al (2016) Structural disorder of monomeric α-synuclein persists in mammalian cells. Nature 530:45–50 16. Schuler B, Soranno A, Hofmann H et al (2016) Single-molecule FRET spectroscopy and the polymer physics of unfolded and intrinsically disordered proteins. Annu Rev Biophys 45:207–231 17. Di Domenico T, Walsh I, Martin AJM et al (2012) MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics 28:2080–2081 18. Piovesan D, Tabaro F, Micˇetic´ I et al (2017) DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res 45: D1123–D1124 19. Fukuchi S, Amemiya T, Sakamoto S et al (2014) IDEAL in 2014 illustrates interaction networks composed of intrinsically disordered proteins and their binding partners. Nucleic Acids Res 42:D320–D325 20. Ficho´ E, Reme´nyi I, Simon I et al (2017) MFIB: a repository of protein complexes with mutual folding induced by binding. Bioinformatics 33:3682–3684 21. Miskei M, Antal C, Fuxreiter M (2017) FuzDB: database of fuzzy complexes, a tool to develop stochastic structure-function relationships for protein complexes and higherorder assemblies. Nucleic Acids Res 45: D228–D235 22. Piovesan D, Tabaro F, Paladin L et al (2018) MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic Acids Res 46: D471–D476 23. Potenza E, Di Domenico T, Walsh I et al (2015) MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins. Nucleic Acids Res 43:D315–D320 24. UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515 25. El-Gebali S, Mistry J, Bateman A et al (2019) The Pfam protein families database in 2019. Nucleic Acids Res 47:D427–D432 26. Lewis TE, Sillitoe I, Dawson N et al (2018) Gene3D: extensive prediction of globular domains in proteins. Nucleic Acids Res 46: D1282

141

27. Necci M, Piovesan D, Tosatto SCE (2018) Where differences resemble: sequence-feature analysis in curated databases of intrinsically disordered proteins. Database. 2018;2018: bay127. https://doi.org/10.1093/database/ bay127 28. Vilella AJ, Severin J, Ureta-Vidal A et al (2009) EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 19:327–335 29. Piovesan D, Tosatto SCE (2018) Mobi 2.0: an improved method to define intrinsic disorder, mobility and linear binding regions in protein structures. Bioinformatics 34:122–123 30. Ulrich EL, Akutsu H, Doreleijers JF et al (2008) BioMagResBank. Nucleic Acids Res 36:D402–D408 31. Martin AJM, Walsh I, Tosatto SCE (2010) MOBI: a web server to define and visualize structural mobility in NMR protein ensembles. Bioinformatics 26:2916–2917 32. Piovesan D, Minervini G, Tosatto SCE (2016) The RING 2.0 web server for high quality residue interaction networks. Nucleic Acids Res 44:W367–W374 33. Sormanni P, Piovesan D, Heller GT et al (2017) Simultaneous quantification of protein order and disorder. Nat Chem Biol 13:339–342 34. Camilloni C, De Simone A, Vranken WF et al (2012) Determination of secondary structure populations in disordered states of proteins using nuclear magnetic resonance chemical shifts. Biochemistry 51:2224–2231 35. Berjanskii MV, Wishart DS (2005) A simple method to predict protein flexibility using secondary chemical shifts. J Am Chem Soc 127:14970–14971 36. Monzon AM, Rohr CO, Fornasari MS et al (2016) CoDNaS 2.0: a comprehensive database of protein conformational diversity in the native state. Database 2016:baw038 37. Linding R, Jensen LJ, Diella F et al (2003) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459 38. Walsh I, Martin AJM, Di Domenico T et al (2012) ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 28:503–509 39. Linding R, Russell RB, Neduva V et al (2003) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708 40. Doszta´nyi Z, Csizmok V, Tompa P et al (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based

142

Alexander Miguel Monzon et al.

on estimated energy content. Bioinformatics 21:3433–3434 41. Yang ZR, Thomson R, McNeil P et al (2005) RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21:3369–3376 42. Peng K, Radivojac P, Vucetic S et al (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7:208 43. Cilia E, Pancsa R, Tompa P et al (2013) From protein sequence to dynamics and disorder with DynaMine. Nat Commun 4:2741 44. Me´sza´ros B, Simon I, Doszta´nyi Z (2009) Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5: e1000376 45. Piovesan D, Walsh I, Minervini G et al (2017) FELLS: fast estimator of latent local structure. Bioinformatics 33:1889–1891 46. Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285 47. Jones DT, Swindells MB (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27:161–164 48. Necci M, Piovesan D, Doszta´nyi Z et al (2017) MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics 33:1402–1404 49. Das RK, Pappu RV (2013) Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A 110:13392–13397 50. Peifer M, Rauskolb C, Williams M et al (1991) The segment polarity gene armadillo interacts with the wingless signaling pathway in both embryonic and adult pattern formation. Development 111:1029–1043 51. Noordermeer J, Klingensmith J, Perrimon N et al (1994) Dishevelled and armadillo act in the wingless signalling pathway in drosophila. Nature 367:80–83 52. Peifer M, Berg S, Reynolds AB (1994) A repeating amino acid motif shared by proteins with diverse cellular roles. Cell 76:789–791 53. Kraus C, Liehr T, Hu¨lsken J et al (1994) Localization of the human beta-catenin gene (CTNNB1) to 3p21: a region implicated in tumor development. Genomics 23:272–274 54. MacDonald BT, Tamai K, He X (2009) Wnt/ beta-catenin signaling: components, mechanisms, and diseases. Dev Cell 17:9–26

55. Huber AH, Nelson WJ, Weis WI (1997) Three-dimensional structure of the armadillo repeat region of beta-catenin. Cell 90:871–882 56. Xing Y, Takemaru K-I, Liu J et al (2008) Crystal structure of a full-length beta-catenin. Structure 16:478–487 57. Wu G, Xu G, Schulman BA et al (2003) Structure of a beta-TrCP1-Skp1-beta-catenin complex: destruction motif binding and lysine specificity of the SCF(beta-TrCP1) ubiquitin ligase. Mol Cell 11:1445–1456 58. Radivojac P, Obradovic Z, Smith DK et al (2004) Protein flexibility and intrinsic disorder. Protein Sci 13:71–80 59. Schlessinger A, Schaefer C, Vicedo E et al (2011) Protein disorder--a breakthrough invention of evolution? Curr Opin Struct Biol 21:412–418 60. Ward JJ, Sodhi JS, McGuffin LJ et al (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645 61. Brown CJ, Takayama S, Campen AM et al (2002) Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol 55:104–110 62. Bellay J, Han S, Michaut M et al (2011) Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biol 12:R14 63. Uversky VN (2002) What does it mean to be natively unfolded? Eur J Biochem 269:2–12 64. Vucetic S, Brown CJ, Dunker AK et al (2003) Flavors of protein disorder. Proteins 52:573–584 65. Walsh I, Giollo M, Di Domenico T et al (2015) Comprehensive large-scale assessment of intrinsic protein disorder. Bioinformatics 31:201–208 66. UniProt Consortium (2014) Activities at the universal protein resource (UniProt). Nucleic Acids Res 42:D191–D198 67. Uversky VN, Dunker AK (2010) Understanding protein non-folding. Biochim Biophys Acta 1804:1231–1264 68. Yanagiya A, Suyama E, Adachi H et al (2012) Translational homeostasis via the mRNA cap-binding protein, eIF4E. Mol Cell 46:847–858 69. Fletcher CM, Wagner G (1998) The interaction of eIF4E with 4E-BP1 is an induced fit to a completely disordered protein. Protein Sci 7:1639–1642 70. Mader S, Lee H, Pause A et al (1995) The translation initiation factor eIF-4E binds to a common motif shared by the translation factor

Exploring MobiDB eIF-4 gamma and the translational repressors 4E-binding proteins. Mol Cell Biol 15:4990–4997 71. Kannan S, Lane DP, Verma CS (2016) Long range recognition and selection in IDPs: the interactions of the C-terminus of p53. Sci Rep 6:23750

143

72. Noivirt-Brik O, Prilusky J, Sussman JL (2009) Assessment of disorder predictions in CASP8. Proteins 77(Suppl 9):210–216 73. De Simone A, Cavalli A, S-TD H et al (2009) Accurate random coil chemical shifts from an analysis of loop regions in native states of proteins. J Am Chem Soc 131:16332–16333

Part II Evolution

Chapter 7 An Easy Protocol for Evolutionary Analysis of Intrinsically Disordered Proteins Janelle Nunez-Castilla and Jessica Siltberg-Liberles Abstract We present an easy protocol for evolutionary analysis of proteins, with an emphasis on studying the evolutionary dynamics of disordered regions. Using the p53 protein family as an example, we provide a guide for finding homologous sequences in a database and refining a dataset before constructing the evolutionary context by building a phylogenetic tree. We show how a multiple sequence alignment and phylogeny for a protein family can be further partitioned into smaller datasets in order to investigate the changes in disorder content across the phylogeny. Based on the evolutionary context, we also investigate site-specific conservation of disorder. Last, we address how to evaluate the evolutionary dynamics of disorder-to-order transitions. Key words Protein family, IUPred, Conservation, Protein, IDP, IDR, Phylogenetic tree, Evolution

1

Introduction Protein families consist of homologous protein sequences that have evolved from a common ancestor. These homologous sequences are related by speciation (orthologs) or by gene duplication (paralogs). Sequences evolve, primarily, by amino acid substitutions, but also through insertions and deletions (indels). Typically, orthologs tend to be more similar to each other in sequence substitution patterns than they are to paralogs, such that orthologs form monophyletic clades [1] (Fig. 1a). The primary reason for the increased rate of sequence divergence for paralogs is relaxed selective pressure caused by the immediate functional redundancy that follows a gene duplication event. The most common scenario after a gene duplication event is for one copy to become a pseudogene. Retention of multiple paralogs is often due to divergence in function or

Electronic supplementary material: The online version of this chapter (https://doi.org/10.1007/978-1-07160524-0_7) contains supplementary material, which is available to authorized users. Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_7, © Springer Science+Business Media, LLC, part of Springer Nature 2020

147

148

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

Fig. 1 Examples of evolutionary contexts. (a) The first evolutionary context is illustrated by a fictional phylogenetic tree. The phylogeny has the protein name and the species it is from at the tips. All sequences called protein_A are found under node A , and this means that this is a monophyletic clade. We call this Clade A (blue). Similarly, all sequences called protein_B are found under node B. We call this Clade B (red). The proteins in Clade A are orthologs to each other. Orthologs are related by speciation (nodes marked S) and often have a highly similar function. The proteins in Clade A are paralogs to the proteins in Clade B. Paralogs are related by gene duplication (nodes marked GD) and tend to diverge to some extent in function or dosage. Through functional divergence, the sequences in Clade A and the sequences in Clade B often display cladespecific patterns of amino acid substitutions that can be used to form distinct clades for the paralogous sequences as seen in this fictional example. To analyze this phylogeny, we may, e.g., compare Clade A to Clade B. This tree is rooted by outgroup based on the single copy in fruit fly. If the phylogeny lacks a clear outgroup, the tree can be rooted at midpoint. The second evolutionary context (b) illustrates the concept of a multiple sequence alignment site. When making a multiple sequence alignment, we aim to put the corresponding residue from different sequences into the same site, indicating that these are related by ancestry. Thus, alignment sites are homologous sites. Sites that have experienced indel events are shown as gaps in the alignment because a homologous residue cannot be identified. In the example, multiple sequence alignment site 9 has a conserved G that corresponds to residue position 7 in sequence 2; residue position 8 in sequences 1, 3, and 5; and residue position 9 in sequence 4 in the unaligned sequences. Multiple sequence alignment site 10 has a mixture of L and I that corresponds to residue position 8 in sequence 2; residue position 9 in sequences 1, 3, and 5; and residue position 10 in sequence 4 in the unaligned sequences. A site in a multiple sequence alignment is often referred to as an alignment site, a homologous site, or, simply, a site

Protein Family Analysis of Intrinsic Disorder

149

differential expression levels to maintain dosage [1]. When paralogs diverge in function, the corresponding homologous sites in two different paralogs often experience different selective pressures (Fig. 1b). Different selective pressures derive from different sitespecific contributions to function for the two paralogs. Consequently, not all sites in a protein family accumulate amino acid substitutions at the same rate, and while there are many reasons for this phenomena [2], for the purpose of this protocol, we will focus on intrinsic disorder. Intrinsically disordered proteins (IDPs) have been found to tolerate high sequence substitution rates [3] but often maintain function as long as the disordered property is conserved [4]. However, disordered sites that also have secondary structure have been found to be less prone to accumulate amino acid substitutions [5, 6]. These sites are likely of critical functional importance and may be able to form transient or interactiondependent secondary structure. Further, indels of one or more amino acids tend to occur in disordered regions [7]. Altogether, this implies that some disordered proteins may be evolving very rapidly, and consequently, finding homologous sequences and aligning them accurately becomes a challenge. If there are too few homologous sequences and if the alignment is of low quality, the evolutionary analysis can be misleading. This easy protocol is not recommended for such cases or for highly repetitive low-complexity sequences. The main purpose of this protocol is to study the majority of proteins that contain a mixture of ordered and disordered regions. To provide an evolutionary context, homologous proteins (related by ancestry) will be studied as a protein family that contains a phylogenetic tree that describes how the sequences are related based on the amino acids at different sites in their multiple sequence alignment (MSA) and a statistical model of evolution [8]. Based on these protein families, the evolutionary analysis can focus on comparing different parts of the phylogeny or MSA and on the evolutionary conservation per alignment site (Fig. 1b). This protocol is written with the experimental molecular biologist in mind and for a small-scale study of one protein family at a time although it can easily be scaled up to a large-scale study.

2

Materials To execute the protocol, different webservers (Table 1) and applications (Table 2) are needed. All are freely available resources and should work with standard desktops and laptops. In many cases, these resources offer a Help tab, a question mark, a tutorial, or Q&A that are helpful for novice users.

150

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

Table 1 Webservers Webservers

Basic description

NCBI (Blastp) https://blast.ncbi.nlm.nih.gov

Finds similar sequences in a large database [46]

NCBI common tree (https://www.ncbi.nlm.nih.gov/Taxonomy/ CommonTree/wwwcmt.cgi)

Shows how different species are related

MAFFT https://mafft.cbrc.jp/alignment/server/

Multiple sequence alignment (MSA) algorithm [19]

PhyML http://www.atgc-montpellier.fr/phyml/

Phylogenetic tree reconstruction [26]

ETE toolkit tree viewer http://etetoolkit.org/treeview/

Online tree viewer that displays the phylogeny next to the MSA [12]

Pfam https://pfam.xfam.org/

Domain database [45]

IUPred 2A https://iupred2a.elte.hu/

Predictor of intrinsic disorder propensity [33]

Table 2 Applications Applications

Basic description

Jalview http://www.jalview.org/

Multiple sequence alignment visualization and analysis [23]

SeaView http://doua.prabi.fr/software/seaview

Multiple sequence alignment visualization and analysis [47]

FigTree http://tree.bio.ed.ac.uk/software/ figtree/

Tree visualization, including rooting

Anaconda2/3 with Spyder and python 2/3 Python/R data science platform https://www.anaconda.com/ Python/excel

3

Statistical analysis

Methods This protocol covers how to construct a protein family and how to analyze intrinsic disorder based on a phylogenetic context (Fig. 2).

3.1 Finding Homologous Sequences

One critical aspect of building a protein family is its sequence composition. In order to compare sequences in a protein family, they must be homologous. When setting out to build a protein

Protein Family Analysis of Intrinsic Disorder

151

Fig. 2 A schematic of the main steps outlined in the protocol

family for evolutionary analysis, it is important to first use a set of different search strategies against different databases to get an idea of the origin of the protein family and, based on the objective of the study, which taxa to include in the study (see Note 1). For a given

152

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

protein sequence (the query), start by determining which taxa or domains of life or phyla contain homologs of the protein. For this purpose, we recommend Basic Local Alignment Search Tool (BLAST) at the National Center for Biotechnology Information (NCBI) where you can choose between different databases such as the nr database, the refseq_protein database, and the landmark database (see Note 2). We will start our exploration of which taxa contain homologs of the query using the landmark and refseq_protein databases. In our example, we will use p53 from human (NCBI accession code: NP_000537.3): 1. Perform a BLAST search using the Blastp algorithm on the NCBI server (https://blast.ncbi.nlm.nih.gov) using NP_000537.3 as query. Specify the Model organism (landmark) database. In May 2019, the search returns 75 hits. The taxonomy report indicates that most hits have been found in the three vertebrates, but there are three hits in one of the invertebrates, and there is one hit in one bacteria (Fig. 3a). Further, if we look at the first three hits in the main report, it shows three hits from Homo sapiens (Fig. 3b). The first hit is our query, and it has 100% query coverage and 100% sequence identity. The second hit has a query coverage of 90% and 100% sequence identity. The third hit has a query coverage of 86% and 97.65% sequence identity. These are all good hits in the sense that query coverage and sequence identity are high and the E-value (the chance of observing this hit at random in the given database) is low (see Note 3). If we look at the bottom three, we note that the query coverage and sequence identity are lower and the E-values are higher for all three hits. Depending on the purpose of the study, we have to decide what minimum query coverage, minimum sequence identity, and maximum E-value are acceptable to include the BLAST hits as homologs in our dataset. It is typically better to be more inclusive for starters, and in this case, we will include the third hit from the bottom, but not the two thereafter due to their high E-values. The second to the last hit is for the sole bacteria in the original hit list; excluding this hit means that we did not find a homolog in bacteria after all. The very last hit is from Danio rerio, one of the vertebrates, and even if it is excluded, there are 13 other sequences from this species left (Fig. 3a). The default setting for BLAST is to show the first 100 hits and to list these by E-value. It is therefore beneficial that our BLAST search led us to exclude the two last hits, because that means that all hits in this database were found. Had all our hits had a query coverage of 100%, a sequence identity of 90%, and an E-value of 0.0, we would need to change the default setting for how many hits are returned and rerun the BLAST search. We would have encountered such a scenario if we had chosen the refseq_protein database (see Note 4).

Protein Family Analysis of Intrinsic Disorder

153

Fig. 3 Initial BLAST results. (a) The Blastp search with the query (NP_000537.3) against the Model organism database results in 75 hits. Most sequences are from vertebrates, but three sequences are from Drosophila melanogaster and one is from Streptococcus. (b) The top hits show that the search identified the query as the best hit. Importantly, the bottom three hits (c) show that the default setting of 100 sequences is enough to identify all sequences of interest because the two last hits have very poor E-values. The second to last sequence is from Streptococcus. It has a very short query coverage of 9% and will be excluded. The last sequence has higher query coverage but lower sequence identity to the query. Although these numbers are rather similar to the third to last hit, the last hit is much longer than the query, and it will be excluded

2. Our BLAST search against the Model organism database resulted in 38 hits in Homo sapiens, 19 hits in Mus musculus, 13 hits in Danio rerio, and 3 hits in Drosophila melanogaster. The multiple hits in each species indicate that there may be alternative splice forms (aka isoforms) or multiple homologous genes in each species (paralogs), or perhaps a combination of isoforms and paralogs (see Note 5). If we select all sequences, except the last two that we above decided to exclude, and click on GenPept, our 73 hits are shown in the main Protein database in NCBI (Fig. 4a). If we change the database shown to Gene,

Fig. 4 Selecting the longest isoform. (a) GenPept shows all 73 hits. Under Find related data, selecting Gene (b) shows the 10 genes these proteins are from. By selecting one gene at a time (c) and returning to the refseq_protein database, the protein isoforms from the same genes can be identified (d) so that the longest isoform can be chosen

Protein Family Analysis of Intrinsic Disorder

155

we can see that the 73 proteins come from 10 genes, 3 genes per vertebrate and 1 gene from Drosophila melanogaster (Fig. 4b). This confirms that a combination of isoforms and paralogs explains why there were so many hits. When isoforms have been identified, caution must be taken to only include one isoform per gene in order to not bias the phylogeny (Fig. 4c and d). However, determining which isoform to include is not always trivial. Some studies choose to include the longest isoform, and others rely on other metrics. We will return to this topic below. 3. Run another Blastp search with the same query against the refseq_protein database. Given the size of this database, it is necessary to increase the number of hits from 100 (under Algorithm parameters at the bottom of the page, change the Max Target Sequences to the desired number). Increasing it to 5000 will display all >2600 hits for our starting query in this database. Most hits are from vertebrates, but some seem to be from invertebrates. Rerunning the same Blastp search while excluding vertebrates yields about 400 sequences. Excluding metazoan yields less than 10 hits, and most hits are in singlecell eukaryotes. These hits have rather low query coverage, but several have the same Pfam domains as p53, and they have been documented as p53 homologs before (see Note 6) [9–11]. 3.2 Using BLAST to Generate a Dataset

After our initial exploration of the taxonomic distribution of p53, we selected 23 species against which to perform our BLAST search. Of our selected species, 21 were metazoans and 2 were choanoflagellates. The metazoans can be further broken down into 13 chordates (11 vertebrates), 1 placazoan, 1 poriferan, 1 cnidarian, 1 echinoderm, and 4 arthropods. We used p53 from Homo sapiens (NP_000537.3) as our query and chose to BLAST against the refseq_protein database. To specify our 23 organisms, we added in each taxa individually under the Organism option for the Search Set and took note of the actual taxid number for each taxa. Lastly, we increased the max target sequences to 250.

3.2.1 Isoform Selection

Before we can begin selecting which isoforms we want to include in our dataset, we need to check if there are any sequences that we want to exclude right away based on query coverage, sequence identity, and E-value. For our BLAST results, the highest E-value was 2e-07, which is still relatively low. Therefore, no sequences were excluded based on E-value. While some hits had lower query coverage (21–35%), these were not immediately excluded as we wanted our initial dataset to be as inclusive as possible and the Evalues were still acceptable (see Note 3). Similarly, we did not exclude sequences on the basis of low sequence identity as that would have removed many of the invertebrate sequences.

156

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

We will begin to form our dataset first by selecting all hits and then removing (by deselecting) the isoforms that we are no longer interested in, using predefined criteria to guide isoform selection (see Note 7). To select the best isoform per gene, it is first important to determine how many genes are present for a given organism in our results. To do this, we select All sequences and view the GenPept page, as described above. If any sequences were excluded due to poor query coverage, sequence identity, or E-value, we would not select All but instead be sure to only select the sequences that are still potential candidates for our dataset. Once at the GenPept page, we can find the related Gene database information for our hits, where we can see the number of genes that correspond to each species (Fig. 4). Once we know how many and which genes are present within a given species, we can go back to the GenPept page in order to more easily see the length of the isoforms so that the longest can be selected as the representative for each gene. The other isoforms in the results can then be deselected. Additionally, we caution against only relying on the sequence names and not taking into account gene ID information (see Note 8). As you construct your dataset, we recommend systematically going through one species at a time as you decide which sequences to include and exclude from your dataset. It is important to exercise care and patience during this process to minimize human error and avoid the accidental inclusion of the incorrect or undesired isoform. Once we only have the sequences of interest selected, we can download the sequences in FASTA format by clicking Download and selecting FASTA (complete sequence) (see Note 9). 3.2.2 Renaming Headers

Now that we have generated our initial dataset, we recommend renaming the sequence headers in the FASTA file before moving on to any downstream analyses. The new headers should be relatively short while remaining informative and devoid of any other special characters. We recommend maintaining the unique sequence accession number and adding a unique species abbreviation in the shortened headers. Using unique species abbreviations in the headers will simplify the downstream tree analysis. Based on the taxids for the taxa used during the Blastp search, a species tree can be built using either NCBI’s Common Tree (https://www.ncbi.nlm.nih. gov/Taxonomy/CommonTree/wwwcmt.cgi) or the NCBI taxonomy module (from ETE [12, 13]) for making a species tree based on NCBI taxids. The NCBI taxonomy module provides increased resolution of species relationships and was used here (see supplementary material for the script commontree.py). We then manually added our species abbreviations used in our headers after the taxonomic name for each species (Fig. 5).

Protein Family Analysis of Intrinsic Disorder

157

Fig. 5 Taxonomic selection. Species tree of the species included in the study. The species abbreviation used in tree figures is shown within parentheses. The placement of taxonomic groups used in the study (vertebrates, metazoan, and choanoflagellates) is emphasized. Throughout this protocol, we refer to non-vertebrates as all species not included in the vertebrate set and invertebrates as all species not included in the vertebrate and choanoflagellate sets. The species tree was generated based on the organisms’ taxids from NCBI with Python using the “NCBI taxonomy module” in ETE [12, 13]. See supplementary material for the script commontree.py 3.3 Making a High-Quality Multiple Sequence Alignment

When homologous sequences have been identified, these sequences must be aligned. Proteins evolve by amino acid substitutions and by insertions and deletions (indels). The purpose of creating a multiple sequence alignment is to place homologous residues from the different sequences in the same site and to add gaps as needed to account for indels as accurately as possible, although gapped regions are difficult to align [14]. If the sequence composition is changed in the dataset either by adding or removing a sequence or by, e.g., truncating a sequence, the multiple sequence alignment must be redone. The amino acid identities at homologous sites across the multiple sequence alignment can be used to infer a protein family phylogeny that displays how the sequences in the

158

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

dataset are related to each other. There are many algorithms for making multiple sequence alignments, e.g., Muscle [15, 16], PRANK [17], T-Coffee [18], and MAFFT [19, 20] (see Note 10). 3.3.1 Multiple Sequence Alignment Construction

We will build a multiple sequence alignment using MAFFT from the MAFFT webserver for our initial dataset with the renamed headers: 1. Upload the FASTA file with the initial dataset to the MAFFT webserver https://mafft.cbrc.jp/alignment/server/. 2. Alignment strategy can be automatic (default), where the program chooses the best alignment strategy depending on the data. You can also select the alignment strategy yourself. Here, we recommend L-INS-i as it has been shown to be more accurate than other methods [21, 22]. 3. Submit job. 4. After the program has finished, click View to see the alignment. You have the option to launch an MSAviewer in your current window or a new window. 5. To save a FASTA file with your alignment, click Export and select Export alignment (FASTA).

3.3.2 Multiple Sequence Alignment Visualization

1. Open the alignment viewer Jalview [23]. Jalview can be downloaded from here http://www.jalview.org/Download. 2. Establish min–max lengths. Oftentimes datasets need to be refined to remove sequences that may affect downstream analyses. These sequences are easier to identify in the context of the multiple sequence alignment. Thus, it is recommended to establish min–max sequence length limits to remove sequences that negatively affect the quality of the dataset. Depending on the biological question you are trying to answer, you may also choose to remove sequences that have very large indel regions (whether large insertions or deletions). Typically, lots of gaps in the alignment reduces its quality, which can in turn reduce the quality of the phylogeny that is reconstructed based on that alignment. From viewing the alignment of the initial dataset, it was observed that some sequences are quite a bit shorter than other sequences. To reduce the number of gaps in the alignment, four sequences that were shorter than 196 residues (half the length of human p53) were removed (see Note 11). 3. Remove sequences that are too long or too short according to your set criteria. To remove a “bad” sequence in Jalview, select the sequence you want to remove, right click, and hover over Selection and Edit, and then click Cut. In all cases when sequences are

Protein Family Analysis of Intrinsic Disorder

159

removed from the initial dataset, the methodology must include specific details for how sequence removal was determined so that the work can be reproduced (if you intend to publish it). 4. Save the edited FASTA file. We want to export the remaining sequences for a refined dataset. First, click Tools, select Preferences, click Output, and deselect all checked boxes under File Output. The default setting in Jalview adds a start-end range to the end of the header when exporting a file by outputting to a textbox. Changing this setting will allow us to maintain our headers without any additional information. Then, we want to remove all of the gaps by selecting Edit and clicking Remove All Gaps. To save a FASTA file of your cleaned dataset, select File (next to Edit), hover over Output to Textbox, choose FASTA format, and save the file. We can now realign our sequences and iterate the procedure as needed. 5. Realign the edited FASTA file by repeating steps 1–4 of this section. The resulting alignment will from here on be referred to as dataset 1: MSA_ALL_PROTEIN. 6. Reinspect the new alignment. In MSA_ALL_PROTEIN, large indels were still present, and therefore, additional partitions of the dataset may be of interest. In this case, two additional approaches were taken: (a) We identified the Pfam domains for the human p53 sequence that was our query for the BLAST search that generated the initial dataset (see Note 6). A second dataset was constructed based on the region that corresponded to the p53 DNA-binding domain based on Pfam domain prediction [24, 25] of the query sequence. This dataset was based on the p53 DNA-binding domain boundaries (as predicted by Pfam) 10 residues. This dataset can be created from the first multiple sequence alignment by selecting the entire column for the site 10 residues past the most C-terminal residue in the p53 DNA-binding domain for the query sequence in Jalview and executing the command remove right. Thereafter, the column for the site 10 residues before the most N-terminal residue in the domain was selected, and the command remove left was executed (see Note 12). The resulting shorter sequences were extracted from Jalview, and a new MAFFT alignment was constructed as described above. The resulting alignment will from here on be referred to as dataset 2: MSA_ALL_DOMAIN. (b) We excluded all non-vertebrate sequences from MSA_ALL_PROTEIN. This third dataset was constructed by extracting only the vertebrate sequences

160

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

Fig. 6 Sequence datasets. Multiple sequence alignment of the full-length sequences showing amino acid residues (colored blocks) and gaps (gray lines) generated by TreeView from the ETE Toolkit (http://etetoolkit. org/treeview/). This alignment will be referred to as MSA_ALL_PROTEIN. The red dashed lines correspond to the p53 DNA-binding domain in all sequences. The black dashed lines correspond to the vertebrate sequences

from MSA_ALL_PROTEIN in Jalview and building a new MAFFT alignment as described above. The resulting alignment will from here on be referred to as dataset 3: MSA_VERTEBRATE_PROTEIN. For an overview of the three dataset partitions, see Fig. 6. 3.4 Building a Phylogenetic Tree

To perform an evolutionary analysis, we need to reconstruct the phylogenetic tree for the sequences in the dataset. The phylogenetic tree for a protein family aims to reflect its evolutionary history, although it may not always exactly follow the expected species tree, especially if the sequences have low or excessive divergence. However, the phylogeny often tends to at least group orthologous sequences (sequences related by speciation) in the same clade and paralogous sequences (sequences related by gene duplication) in different clades. The current state of the art for phylogenetic reconstruction includes maximum likelihood and Bayesian approaches [8]. As probabilistic model-based methods, these approaches need an appropriate model of evolution that describes how amino acids are substituted in order to infer phylogenies with statistical certainty. The model of evolution is determined through model testing, a necessary step before building a phylogenetic tree. PhyML is a phylogenetic reconstruction software that is an example of a maximum likelihood approach [26]. It is quick and can be run

Protein Family Analysis of Intrinsic Disorder

161

using the command line or on a webserver (http://www.atgcmontpellier.fr/phyml/).The webserver performs automatic model selection [27]. MrBayes is an example of a Bayesian approach [28]. It typically takes longer than PhyML and needs to be run on the command line. MrBayes can use a mixed-model approach instead of model testing. 3.4.1 Phylogenetic Reconstruction

1. For this easy protocol, we will use PhyML as an example. PhyML requires a multiple sequence alignment file in phylip format (see Note 13). The standard phylip format poses strict restrictions on sequence headers, limiting their length to only 10 characters. Often, this can create difficulties for researchers as creating informative and unique headers with a 10-character limit can be challenging. A relaxed phylip format allows preservation of the original header name by relaxing the 10-character limit. To this end, we will use SeaView (http://doua.prabi.fr/ software/seaview) [22] to convert our multiple sequence alignments to phylip format while maintaining the desired headers. Seaview has a default limit of 30 characters for headers. If this is still too restrictive for your data, you can increase this limit. To do so, click Props and select Customize. Where phylip names width is indicated, write in your desired header length and then hit Apply. You can then save your multiple sequence alignment in the relaxed phylip format by selecting File, clicking Save as, and choosing ∗.phy as the file extension. 2. To build the phylogeny, upload the phylip file to the PhyML webserver. Our sequence data is based on amino acids, so the data type needs to be changed to reflect this. However, everything else will be left as default settings. Although we will not modify anything else as we want PhyML to select the model for us, it is good to note that we can set our own model parameters using the PhyML webserver. It should be noted that the PhyML default for evaluating branch supports across the phylogeny is a fast likelihood-based method (SH-like support) [26]. Although we use the default SH-like support for all trees in this protocol, we recommend changing the setting from SH-like support to performing 100 bootstraps or more when the datasets have been completely finalized unless the datasets are very large. 3. When the reconstruction is completed, the PhyML results are emailed in a conveniently packed zip file. The zip file contains information about the best model of evolution from the model test and statistics about the tree building session. Importantly, it contains the tree file in Newick tree format.

162

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

3.4.2 Tree Visualization

To open the Newick tree file and to view the tree, we will use FigTree (http://tree.bio.ed.ac.uk/software/figtree/). As the tree is opened, FigTree provides the option to rename the values associated with the nodes or branches. Here, these values reflect the SH-like branch supports calculated by PhyML [26]. The sequence names in the phylogenetic tree are called leaves or tips: 1. Root the tree—either by outgroup or by midpoint rooting: (a) To root by outgroup, there must be an outgroup sequence or an outgroup clade included in the dataset. For instance, if the tree contains orthologs from mammals and chicken, the sequence from chicken can be used as an outgroup. To root by outgroup in FigTree, select the branch leading to the leaf for chicken by clicking on the leaf name, and then click reroot. To use an entire outgroup clade to root the tree, select the branch leading to that clade followed by reroot. (b) If there is no clear outgroup, you can root your tree by midpoint. When you root by midpoint, the root is placed in the middle of the longest path between the most distant leaves in your tree. To root by midpoint in FigTree, go to Tree and Midpoint Root. 2. Arrange in decreasing node order. In FigTree, go to Tree and Decreasing Node Order. 3. Show branch support either by number or symbol under Node label or Node shape, respectively. If shape is used, remember to add a Legend that explains what the colors and/or size of the shape mean. 4. Adjust font size of all text to make it easily readable. 5. Decorate the tree for easy analysis: color clades and leaves as desired. In our example, we rooted the tree for MSA_ALL_PROTEIN by its choanoflagellate outgroup and arranged it in decreasing node order. SH-like supports are displayed as node shapes. The tree built on MSA_ALL_DOMAIN was rooted by midpoint because the sequences from choanoflagellates did not form a monophyletic group. Similarly, the tree built on MSA_VERTEBRATE_PROTEIN that does not have an outgroup sequence was rooted by midpoint root. Choanoflagellate sequence names were colored green, non-vertebrates were colored blue, and vertebrate names were colored gray. The branches for three vertebrate clades representing p53, p63, and p73 that formed after gene duplications were colored pink, red, and blue, respectively, while the non-vertebrate branches were colored black.

Protein Family Analysis of Intrinsic Disorder

163

Fig. 7 Phylogenetic tree of the full-length sequences. Choanoflagellate, invertebrate, and vertebrate sequence names at the tips are shown in green, blue, and gray, respectively. The vertebrate clade has three separate subclades: p53 (pink), p73 (blue), and p63 (red). The filled circle at each node represents SH-like support colored according to legend (blue is highly supported; red is not supported). The tree was rooted with the monophyletic choanoflagellate clade (that includes all choanoflagellate sequences) as outgroup, shown in decreasing node order and visualized with FigTree (http://tree.bio.ed.ac.uk/software/figtree/) 3.4.3 Tree Analysis

The tree for MSA_ALL_PROTEIN (Fig. 7) is in rather good agreement with the species tree (Fig. 5). The largest discrepancy is seen for the cnidarian anemone Nematostella vectensis (NEMVE) that has three homologs in this phylogeny. These three copies are likely the result of gene duplication. One of these paralogs is close

164

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

to where it should be based on the species tree; the other two are in a clade with the sea urchin, Strongylocentrotus purpuratus (STRPU), and lancelet, Branchiostoma floridae (BRAFL), forming the closest sister clade to vertebrates. To elucidate the evolutionary history of these sequences within invertebrates, additional sequences would need to be added. The vertebrate sequences form a monophyletic clade. The p73 clade is in agreement with the species tree. The p53 clade is missing sequences from opossum, Monodelphis domestica (MONDO), and platypus, Ornithorhynchus anatinus (ORNAN), that are not included in the dataset, and the sequence from anole lizard, Anolis carolinensis (ANOCA), forms the outgroup of the clade. It should also be noted that the longer branches in the p53 clade mean that these sequences are evolving more rapidly than the sequences in the p73 and p63 clades. The p63 clade has poor resolution due to its sequences being extremely conserved. From the SH-like branch support values, we note that most of the early branches are well supported, while some of the more recent branches are not. Overall, we judge that these trees are of acceptable quality for this study, but that may not always be the case (see Note 14). Sometimes, the initial tree analysis will point us toward problems in the dataset suggesting that the dataset can be refined so that the multiple sequence alignment is improved and, ultimately, a better tree can be built. This can be time-consuming but can be limited if the dataset is carefully selected from the beginning of the study. 3.5 Predicting Disorder

To perform an evolutionary analysis of intrinsic disorder, we need to predict intrinsic disorder propensity for all sequences in our three datasets (see Note 15). There are many different predictors available that are typically based on either a scoring function or a machine learning approach that utilizes an evolutionary sequence profile to predict intrinsic disorder based on sequence data alone (for review, see [29]). In this protocol, we will perform an evolutionary analysis of intrinsic disorder, and it is therefore preferable to choose a predictor that does not assume that intrinsic disorder is evolutionarily conserved. Thus, we chose IUPred, a predictor that infers intrinsic disorder based on a scoring function and not evolutionary profiles. For each residue in a protein sequence, the scoring function for IUPred aims to estimate a pseudo-energy based on the interaction potential of one residue with its neighboring residues in the linear sequence context. It is assumed that residues in globular, foldable protein regions are able to form contacts and that residues that do not form such contacts have a propensity toward intrinsic disorder [30, 31]. IUPred-long (the default) is a widely used predictor with an accuracy in the range of 62–85% [32]. The IUPred webserver has recently been updated to IUPred2A with only minor bug fixes to the original IUPred predictor [33], so we assume that IUPred2A has similar accuracy.

Protein Family Analysis of Intrinsic Disorder 3.5.1 Intrinsic Disorder Prediction

165

The IUPred2A webserver (https://iupred2a.elte.hu/) allows us to submit an input file with the unaligned sequences in our datasets. We simply uploaded the input file, provided an email address, kept the default settings, and submitted the job. It is important that the sequences in the input file are unaligned. If gaps are included, IUPred will attempt to generate a prediction for the gap characters, which is inaccurate. For our example, we removed all gaps from the three datasets used to build trees and ran these through IUPred2A. For the second dataset that has only the part of the sequences that correspond to the p53 DNA-binding domain, it should be noted that since IUPred uses a windowing approach, performing the predictions on partial sequences may slightly alter the results. An alternative way is to run the prediction on the entire sequence and to only include the disorder propensities for the residues that are part of the domain when calculating the percentage of disorder. In this easy protocol, for dataset 3, disorder is predicted only on the part of the sequence that corresponds to the p53 DNA-binding domain. For each residue in the protein, IUPred returns a value between 0 and 1 that reflects the disorder propensity for that residue. IUPred does not automatically calculate the percentage of disorder for a given sequence. To determine the percentage of disorder in our sequences, we need to decide which cutoff to use. IUPred was developed to have a cutoff ¼ 0.5, where residues with disorder propensities of 0.5 or greater were classified as unstructured or intrinsically disordered and residues with disorder propensities below 0.5 were classified as ordered. It has been shown that the accuracy of IUPred may be improved if the cutoff ¼ 0.4 [34]. To parse the output files from IUPred2A, we used a custom Python script, iupred_parser.py (see supplementary material for this chapter). This script maps the predictions from IUPred2A onto the multiple sequence alignment to generate two matrices: one of continuous values (0–1) and one of binary values (0 ¼ order, 1 ¼ disorder). The binary values are determined by a user-provided cutoff value. The binary matrix facilitates the easy analysis of disorder conservation at a site in, e.g., Excel. The script also generates two FASTA files based on the binary predictions mapped onto the multiple sequence alignment: one with values 0 and 1 and another with values K (order) and E (disorder). The latter allows for visualization of the binary disorder predictions using TreeView from the ETE Toolkit (http://etetoolkit.org/treeview/; Fig. 8). Finally, the script provides the percent disorder of each sequence based on the user-provided cutoff value. We ran the iupred_parser.py script using a cutoff of 0.4 on the IUPred2A output files for all three datasets.

166

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

Fig. 8 Evolution of disorder. Multiple sequence alignment of the p53 DNA-binding domain showing amino acid residues colored based on their IUPred prediction: ordered residues in blue and disordered residues in red and gaps as gray lines. The phylogeny is rooted by midpoint because the four sequences from the two choanoflagellates (MONBR and SALRO) do not form a monophyletic group and cannot be used as an outgroup. The figure was based on the FASTA file (.ek.fa) from the iupred_parser.py script and the phylogenetic tree that was built from the corresponding multiple sequence alignment allowing visualization of the IUPred predictions on the alignment in an evolutionary context using TreeView from the ETE Toolkit (http://etetoolkit.org/treeview/)

Protein Family Analysis of Intrinsic Disorder

3.6 Evolutionary Dynamics of Intrinsic Disorder

167

Based on the phylogenies and IUPred2A predictions, different hypotheses can be tested. In our example, we first tested if there was more disorder in vertebrates than in non-vertebrates for the entire p53 protein and for only the p53 DNA-binding domain, and we also compared the amount of disorder in the different vertebrate clades. To this end, we used Excel on the output file from iupred_parser.py to divide the percentages of disorder into the partitions of our choice, vertebrates vs non-vertebrates (datasets 1 and 2) and the p53, p63, and p73 clades (dataset 3). Next, we used Python to visualize the data for the different partitions as boxplots (Fig. 9) and Mann-Whitney U tests to test for significance (Table 3) (see Note 16). It must be noted that by simply comparing the distribution of percent disorder for different partitions, the phylogenetic structure within the partitions is ignored. The comparisons of disorder percentage per partition show that there is significantly more disorder in vertebrate p53 proteins than in non-vertebrates, and the same holds for the p53 DNA-binding domain only. On the protein level, for the three paralogous vertebrate clades, only the comparison between p63 and p73 is significant, but on the p53 DNA-binding domain level, all comparisons are significant.

Fig. 9 Disorder comparisons. Boxplots showing percent disorder per full-length protein (a) or per p53 DNA binding domain (b) for the non-vertebrate sequences (non-v) and the vertebrate sequences (v). For the vertebrate sequences, percent disorder for each paralogous clade (p53, p63, and p73) are shown per protein (a) and per domain (b). The median percent disorder per dataset is shown in yellow. This figure was generated with Matplotlib in Python

168

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

Table 3 Testing for significance between different partitions U statistic

p-value

Significanta?

Protein

146

1.08 103

Yes

p53 domain

109

7.50 105

Yes

47

3.22 101

No

Mann-Whitney U Non-vertebrates vs. vertebrates

Vertebrates per clade p53 vs. p63 protein

2

p53 vs. p73 protein

28

5.53 10

p63 vs. p73 protein

23

4.45 103

Yes

11

1.18 103

Yes

No

Vertebrates per clade p53 vs. p63 domain p53 vs. p73 domain

3

p63 vs. p73 domain

23

2.37 10

4

4.28 103

Yes Yes

Compared to a simplified Bonferroni multiple hypothesis testing correction for α ¼ 0.05

a

Next, we used Excel and the matrix output file (from the iupred_parser.py script) with the disorder predictions as binary states to calculate the % disorder per alignment site across the p53 DNA-binding domain since this part of the alignment has fewer gaps. Disorder percentage was calculated for each alignment site: Number of residues predicted to be disordered Disorderð%Þ ¼ Total number of residues þ Gaps 100 Disorder percentage can also be calculated ignoring gaps. In that case, otherwise fully gapped sites with one insertion that is disordered would appear to be 100% conserved in disorder, but this is misleading. Alternatively, sites with gaps can be excluded from the analysis. The advantage of excluding gapped sites is that these tend to be less well aligned [14]. If removing gapped sites, it is necessary to not remove these regions until after the predictions have been completed since the IUPred predictions depend on the sequence context. As we here chose to focus on dataset 2, we divided the percentages for the sequences in this dataset into a non-vertebrate partition and into the three vertebrate partitions based on the three paralogous clades p53, p63, and p73, as shown in Fig. 6. For each partition, we calculated the site-specific conservation of intrinsic disorder as shown above. Intrinsic disorder appears almost completely conserved at most sites for the p63 and p73 clades but is incrementally less conserved for the p53 clade and for the non-vertebrates (Fig. 10). However, all four partitions have

Protein Family Analysis of Intrinsic Disorder

169

Fig. 10 Conservation of disorder. Percent disorder per site in the p53 DNA-binding domain for (a) non-vertebrates, (b) p53, (c) p63, and (d) p73. This figure was generated in Excel

conserved disorder around alignment site 301 suggesting that this disordered region is of critical importance to the function of the p53 DNA-binding domain. 3.7 Reflections and Future Directions

The protocol presented aims to get the reader ready to start performing their own evolutionary analysis of intrinsic disorder across protein families. This easy protocol emphasizes the necessity of carefully selecting the sequences to be included in the analysis. It is often beneficial to partition the data in different ways as this allows for different questions to be asked and answered. Here, we relied on the information in the multiple sequence alignment and its corresponding phylogeny to partition the data. Partitions can also be based on other factors, but if so, it is recommended to also

170

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

consider the actual phylogenetic context. This is supposed to be an easy protocol, and we have tried to make it such. It should be noted that the analysis above is just the tip of the iceberg of what can be done. To ignore the complete phylogenetic structure is a simplification and not formally correct. Instead, phylogenetic independent contrast can be used to study how traits like disorder percentage evolve over a phylogeny [35]. We and others have started to investigate the site-specific transition rate between disorder and order on evolutionary time scales [9, 36–39], and similar analyses can be performed with phylogenies and output files from the iupred_parser.py script generated here. It is known that disordered regions can become structured in response to a change in their environment, such as a pH decrease, phosphorylation, or binding to another biomolecule [40], and non-conserved disorder between homologous sequences will likely affect how they respond to such changes by modifying the disorder/order transition in real time [41]. Therefore, detecting non-conserved disorder can help in detecting functional divergence [41]. Work that addresses these questions in silico followed by experimental verification will provide valuable insights to our understanding of how intrinsic disorder in proteins evolves.

4

Notes 1. When building a protein family for an unfamiliar protein, our preference is to gather information about the taxonomic distribution of the protein using NCBI’s BLAST servers and databases. Depending on the question that we are trying to explore, we may then choose to limit the taxa that we are working with to only a certain taxonomic grouping (e.g., vertebrates only), or we may choose to select a few representative species that encompass the full taxonomic distribution in which our protein was found, as in the example we discuss. 2. The nr database is the largest database, and although nr stands for nonredundant, there is a lot of redundancy on the actual amino acid sequence level in this database. The refseq_protein database contains better annotated sequences for entire proteomes, and the protein sequences are linked to their corresponding nucleotide sequences. In March 2019, refseq_protein had proteomes from 88,816 species. The landmark database contains the proteomes from 27 model organisms in the refseq_protein database. The model organisms consist of 13 bacterial species, 2 archaeal species, and 12 eukaryotes. The eukaryotes can be further broken down to two plant species, two yeast species, one amoebozoan, one euglenozoan, one alveolate, and five metazoan (two invertebrates and three

Protein Family Analysis of Intrinsic Disorder

171

vertebrates). Other protein databases include the Protein Data Bank (PDB) [42] that contains the protein sequences from the structures in the PDB and Swiss-Prot/Uniprot [43] that contains manually annotated proteins. These two protein databases often contain valuable functional information but do not attempt to contain entire proteomes. 3. A good BLAST hit would have high query coverage, high sequence identity, and low E-value. However, what to consider good depends on the objective of your study. E-values depend on additional factors such as the length of your query and the size of the database. As sequence databases continue to grow, additional BLAST hits will result. For additional information on how to use BLAST properly, please see the Help tab on the NCBI BLAST server. 4. To decide on which strategy to use for a particular protein, you need to explore the distribution of its sequence space by using different BLAST databases or the literature. Some proteins are ubiquitous to the tree of life, while others are lineage specific. Some proteins have many paralogs while others have none. Some proteins are very long, while others are short. Some proteins are highly conserved, while others are rapidly evolving. Depending on your objective, you need to decide which database and which species to include in your study. There are additional settings that can be modified in BLAST, and the specific information for each is available by a click on its question mark. 5. Regardless of the database used (Model organism or refseq_protein), isoforms will need to be dealt with. A more experienced bioinformatician, once they have chosen their representative species, may elect to create a custom local BLAST database using the canonical reference proteomes from Uniprot [44]. The canonical reference proteomes contain the canonical isoform for each protein and solve the problem of which isoform to include. Further, it allows one to compose their own “Model organism database” based on their needs. For the purposes of this protocol, however, we will be using NCBI’s BLAST webservers and databases and will discuss how we can handle the issue of isoforms when constructing our dataset. 6. Pfam can assist you in determining which regions of your protein are important for specific functions, such as DNA binding, oligomerization, or catalytic function. Based on Hidden Markov Models (HMMs), it attempts to predict where in your sequence certain regions correspond to a specific protein family (sequence-based evolutionary units) or protein domain (globular structural units). Pfam families and domains

172

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

make up 97.2% of this database; for additional classifications, see [45]. Pfam protein families do not necessarily cover the full length of the sequence, and these are slightly different than the protein families covered in this protocol where all homologous proteins are considered to be a protein family. The presence of the same Pfam family or domain in two highly divergent protein sequences indicates homology (albeit remote homology). Further, combining the functional annotation from Pfam with our protein families broadens the scope of our study. We can study the evolution not only of the full-length proteins but also of specific Pfam domains (or Pfam family regions). 7. When selecting which isoform to include in the dataset, we generally chose the longest protein coded by a particular gene. On occasion, you may encounter two isoforms that are the same length. In that event, we recommend making a selection based on which isoform has the lower E-value. If the E-values are the same, then we recommend basing the decision on which isoform has the higher query coverage and/or percent sequence identity. However, there may be moments when exceptions to these guidelines may be made. In the case of p53 for Equus caballus, the two longest isoforms are the same length, XP_023573921.1 and NP_001189334.1. Both had an E-value of 0.0 and both had 100% query coverage. They differed in percent sequence identity to the query, with the former hit having 82.74% sequence identity and the latter having 82.49% identity. Based on the proposed guidelines for selecting which isoform to include in the dataset, we would have selected the hit with the higher percent sequence identity. We chose to select the experimentally verified isoform (NP_001189334.1) over the predicted isoform (XP_023573921.1). Another exception arose when selecting the representative p73 isoform for Mus musculus. Here, the longest isoform was a predicted protein (XP_006538783.2), while an experimentally verified protein (NP_035772.3) was the second longest isoform. Both hits had identical query coverage and percent sequence identity and nearly identical E-values (5e-84 and 3e-84, respectively). As with the p53 from Equus caballus, we chose to include the experimentally verified isoform as opposed to the predicted isoform. While this manual curation of the dataset is beneficial for individual small-scale studies, it is not feasible for large-scale studies. In all cases, the actual sequence accession numbers for the final sequence sets must be recorded. 8. For instance, Anolis carolinensis has four genes belonging to the p53 protein family due to a gene duplication in the gene that codes for p63. If we were selecting the sequences for our dataset based only on their annotations and the assumption

Protein Family Analysis of Intrinsic Disorder

173

that each species has one copy of each paralog, we would erroneously exclude one of the p63 proteins from our dataset. 9. FASTA format is a text-based way to represent sequence data using single-letter codes for amino acids or nucleotides. A “>” symbol distinguishes the sequence identifier (can also be referred to as a sequence header) from the other lines that contain actual sequence information. FASTA files may contain aligned or unaligned sequences. Ex. >SEQUENCE_1 MADTTA-AGLIFYKL >SEQUENCE_2 MADTT--AGILFYKL >SEQUENCE_3 MAETTA-AGIIFY-L >SEQUENCE_4 MAESTAAAGLLFY-L >SEQUENCE_5 MAESTA-AGLIFY-L

10. There are numerous algorithms for making multiple sequence alignments. Many of these algorithms are found on the tool page for multiple sequence alignment from the European Bioinformatics Institute (EBI, https://www.ebi.ac.uk/Tools/ msa/). These are also implemented in the alignment viewer Jalview (http://www.jalview.org/) and can be found under Web Service, followed by Alignment. 11. When refining the dataset using min–max sequence length cutoffs, the goal is to remove sequences that are clear outliers in the dataset based on their length. Sequences with significant length variations introduce more gaps to the alignment which can affect the quality of the phylogeny that is built based on that alignment. There is no perfect formula for deciding what the min–max values should be; it should be decided based on the goal of the study. For instance, if the goal is to have a dataset composed of sequences with similar domain composition that can be easily aligned, the sequences should be kept more similar in length. When in doubt, we recommend initially not removing any sequences and using the original multiple sequence alignment to build a phylogeny. The multiple sequence alignment and phylogeny can then be inspected in tandem to better inform whether some sequences are out of place (compared to the species tree, for example) and ought to be removed. Refining the dataset is an iterative process. The researcher can start refining with more generous (i.e., more inclusive) cutoffs, realigning the remaining sequences after removing outlier sequences, building the phylogeny, and

174

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

again determining if more sequences need removal in order to ensure best quality. 12. The Pfam domain boundaries for the p53 DNA-binding domain of the human p53 protein were used in order to create the partitioned dataset based containing only the p53 DNA-binding domain. While this domain across sequences is expected to be about the same length, we chose to include 10 residues beyond the human domain boundaries in order to be more inclusive in the event of minor deviation in domain length across sequences. An alternative approach that requires a bit more effort would be to cut out the portion of each sequence that corresponds to the predicted domain boundaries for that sequence based on Pfam predictions. However, for simplicity or larger datasets, the approach used here is adequate. 13. Phylip format is a text-based file that stores a multiple sequence alignment (it is not used for unaligned sequences) for proteins or nucleic acids using single-letter codes. At the top of a phylip file, there are two numbers separated by one or more spaces: the first indicates how many sequences are in the alignment, and the second indicates the length of the alignment. The headers and alignment follow the first line. On the left of the file, the sequence headers can be found. In a strict phylip file, which tends to be the standard, the header cannot exceed 10 characters. A relaxed phylip file does not impose length restrictions on the header. Following the header and on the same line, the sequence begins, split into groups of 10 characters separated by a space. Because the entire sequence cannot fit on one line, all sequences in the alignment continue in the same order on a separate block as necessary to write out the full sequence. Ex. 5 15 SEQUENCE_1 MADTTA-AGL IFYKL SEQUENCE_2 MADTT--AGI LFYKL SEQUENCE_3 MAETTA-AGI IFY-L SEQUENCE_4 MAESTAAAGL LFY-L SEQUENCE_5 MAESTA-AGL IFY-L

14. If we wanted to sort out more of the evolutionary events in invertebrates, we could add a few more species, make a new alignment, and rebuild the tree. Or, we could try to find the missing sequences from opossum, Monodelphis domestica (MONDO), and platypus, Ornithorhynchus anatinus (ORNAN), in a different database, add these to the dataset, realign, and rebuild the tree.

Protein Family Analysis of Intrinsic Disorder

175

15. If experimental data is available on which residues in all sequences in our datasets are intrinsically disordered and which are ordered, that can be used instead, but that is rarely the case. 16. To perform tests of significance, we used Python 2 with Spyder in Anaconda2, but we recommend the reader to use their preferred statistics package such as R or SPSS. To compare the amount of disorder for two datasets (data1 and data2) using Spyder, see supplementary material. References 1. Gabaldo´n T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet 14:360–366. https://doi. org/10.1038/nrg3456 2. Echave J, Spielman SJ, Wilke CO (2016) Causes of evolutionary rate variation among protein sites. Nat Rev Genet 17:109–121. https://doi.org/10.1038/nrg.2015.18 3. Brown CJ, Takayama S, Campen AM et al (2002) Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol 55:104–110 4. van der Lee R, Buljan M, Lang B et al (2014) Classification of intrinsically disordered regions and proteins. Chem Rev 114:6589–6631. https://doi.org/10.1021/cr400525m 5. Ahrens J, Rahaman J, Siltberg-Liberles J (2018) Large-scale analyses of site-specific evolutionary rates across eukaryote proteomes reveal confounding interactions between intrinsic disorder, secondary structure, and functional domains. Genes (Basel) 9:553. https://doi.org/10.3390/genes9110553 6. Ahrens J, Dos Santos HG, Siltberg-Liberles J (2016) The nuanced interplay of intrinsic disorder and other structural properties driving protein evolution. Mol Biol Evol 33:2248–2256. https://doi.org/10.1093/ molbev/msw092 7. Light S, Sagit R, Sachenkova O et al (2013) Protein expansion is primarily due to indels in intrinsically disordered regions. Mol Biol Evol 30:2645–2653. https://doi.org/10.1093/ molbev/mst157 8. Anisimova M, Liberles DA, Philippe H et al (2013) State-of the art methodologies dictate new standards for phylogenetic analysis. BMC Evol Biol 13:161. https://doi.org/10.1186/ 1471-2148-13-161 9. Dos Santos HG, Nunez-Castilla J, SiltbergLiberles J (2016) Functional diversification after gene duplication: Paralog specific regions

of structural disorder and phosphorylation in p53, p63, and p73. PLoS One 11:e0151961. https://doi.org/10.1371/journal.pone. 0151961 10. Richter DJ, King N (2013) The genomic and cellular foundations of animal origins. Annu Rev Genet 47:509–537. https://doi.org/10. 1146/annurev-genet-111212-133456 11. Suga H, Chen Z, de Mendoza A et al (2013) The Capsaspora genome reveals a complex unicellular prehistory of animals. Nat Commun 4:2325. https://doi.org/10.1038/ ncomms3325 12. Huerta-Cepas J, Serra F, Bork P (2016) ETE 3: reconstruction, analysis, and visualization of Phylogenomic data. Mol Biol Evol 33:1635–1638. https://doi.org/10.1093/ molbev/msw046 13. Huerta-Cepas J, Dopazo J, Gabaldo´n T et al (2010) ETE: a python environment for tree exploration. BMC Bioinformatics 11:24. https://doi.org/10.1186/1471-2105-11-24 14. Golubchik T, Wise MJ, Easteal S, Jermiin LS (2007) Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol Biol Evol 24:2433–2442. https://doi.org/ 10.1093/molbev/msm176 15. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. https://doi.org/10.1093/ nar/gkh340 16. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113. https://doi.org/10.1186/14712105-5-113 17. Lo¨ytynoja A (2014) Phylogeny-aware alignment with PRANK. Methods Mol Biol 1079:155–170 18. Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate

176

Janelle Nunez-Castilla and Jessica Siltberg-Liberles

multiple sequence alignment. J Mol Biol 302:205–217. https://doi.org/10.1006/ jmbi.2000.4042 19. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066. https://doi.org/10.1093/nar/gkf436 20. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 9:286–298. https://doi.org/10.1093/bib/bbn013 21. Thompson JD, Linard B, Lecompte O, Poch O (2011) A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 6:e18093. https://doi.org/10.1371/journal. pone.0018093 22. Long H, Li M, Fu H (2016) Determination of optimal parameters of MAFFT program based on BAliBASE3.0 database. Springerplus 5:736. https://doi.org/10.1186/S40064-0162526-5 23. Waterhouse AM, Procter JB, Martin DMA et al (2009) Jalview version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189–1191. https://doi.org/ 10.1093/bioinformatics/btp033 24. Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230. https://doi.org/ 10.1093/nar/gkt1223 25. Finn RD, Coggill P, Eberhardt RY et al (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44: D279–D285. https://doi.org/10.1093/nar/ gkv1344 26. Guindon S, Dufayard J-F, Lefort V et al (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59:307–321. https://doi.org/10.1093/sys bio/syq010 27. Lefort V, Longueville J-E, Gascuel O (2017) SMS: smart model selection in PhyML. Mol Biol Evol 34:2422–2424. https://doi.org/ 10.1093/molbev/msx149 28. Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574. https://doi.org/10.1093/bioinformatics/ btg180 29. Meng F, Uversky VN, Kurgan L (2017) Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions.

Cell Mol Life Sci 74:3069–3090. https://doi. org/10.1007/s00018-017-2555-4 30. Doszta´nyi Z, Csizmok V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839. https:// doi.org/10.1016/j.jmb.2005.01.071 31. Doszta´nyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434. https://doi.org/10. 1093/bioinformatics/bti541 32. Di Domenico T, Walsh I, Tosatto SCE (2013) Analysis and consensus of currently available intrinsic protein disorder annotation sources in the MobiDB database. BMC Bioinformatics 14(Suppl 7):S3. https://doi.org/10.1186/ 1471-2105-14-S7-S3 ˝s G, Doszta´nyi Z (2018) 33. Me´sza´ros B, Erdo IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46: W329–W337. https://doi.org/10.1093/ nar/gky384 34. Fuxreiter M, Tompa P, Simon I (2007) Local structural disorder imparts plasticity on linear motifs. Bioinformatics 23:950–956. https://doi.org/10.1093/bioinformatics/ btm035 35. Felsenstein J (1985) Phylogenies and the comparative method. Am Nat 125(1), 1–15. http://www.jstor.org/stable/2461605 36. Dos Santos HG, Siltberg-Liberles J (2016) Paralog-specific patterns of structural disorder and phosphorylation in the vertebrate SH3–SH2–tyrosine kinase protein family. Genome Biol Evol 8:2806–2825. https://doi. org/10.1093/gbe/evw194 37. Ortiz JF, MacDonald ML, Masterson P et al (2013) Rapid evolutionary dynamics of structural disorder as a potential driving force for biological divergence in flaviviruses. Genome Biol Evol 5:504–513. https://doi.org/10. 1093/gbe/evt026 38. Fahmi M, Ito M (2019) Evolutionary approach of intrinsically disordered CIP/KIP proteins. Sci Rep 9:1575. https://doi.org/10.1038/ s41598-018-37917-5 39. Rahaman J, Siltberg-Liberles J (2016) Avoiding regions symptomatic of conformational and functional flexibility to identify antiviral targets in current and future coronaviruses.

Protein Family Analysis of Intrinsic Disorder Genome Biol Evol 8(11):3471–3484. https:// doi.org/10.1093/gbe/evw246 40. Smock RG, Gierasch LM (2009) Sending signals dynamically. Science 324:198–203. https://doi.org/10.1126/science.1169377 41. Ahrens JB, Nunez-Castilla J, Siltberg-Liberles J (2017) Evolution of intrinsic disorder in eukaryotic proteins. Cell Mol Life Sci 74:3163–3174. https://doi.org/10.1007/ s00018-017-2559-0 42. Rose PW, Prlic´ A, Bi C et al (2015) The RCSB protein data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res 43:D345–D356. https:// doi.org/10.1093/nar/gku1214 43. UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515. https://doi.org/ 10.1093/nar/gky1049

177

44. The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212. https://doi.org/10.1093/ nar/gku989 45. El-Gebali S, Mistry J, Bateman A et al (2019) The Pfam protein families database in 2019. Nucleic Acids Res 47:D427–D432. https:// doi.org/10.1093/nar/gky995 46. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 245:403–410. https://doi.org/10.1016/S00 22-2836(05)80360-2 47. Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: a multiplatform graphical user Interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27:221–224. https://doi.org/10.1093/molbev/msp259

Part III Production

Chapter 8 Expression and Purification of an Intrinsically Disordered Protein Karamjeet K. Singh and Steffen P. Graether Abstract Intrinsically disordered proteins (IDPs) describe a group of proteins that do not have a regular tertiary structure and typically have very little ordered secondary structure. Despite not following the biochemical dogma of “structure determines function” and “function determines structure,” IDPs have been identified as having numerous biological functions. We describe here the steps to express and purify the intrinsically disordered stress response protein, Late embryogenesis abundant protein 3-2 from Arabidopsis thaliana (AtLEA 3-2), with 15N and 13C isotopes in E. coli, although the protocol can be adapted for any IDP with or without isotopic labeling. The atlea 3-2 gene has been cloned into the pET-SUMO vector that in addition to the SUMO portion encodes an N-terminal hexahistidine sequence (His-tag). This vector allows for the SUMO-AtLEA 3-2 fusion protein to be purified using Ni-affinity chromatography and, through the use of ubiquitin-like-specific protease 1 (Ulp1, a SUMO protease), results in an AtLEA 3-2 with a native N-terminus. We also describe the expression and purification of Ulp1 itself. Key words IDP, Intrinsically disordered proteins, Isotopic labeling, LEA proteins, Minimal media, NMR, Purification, Recombinant expression, SUMO protease, SUMO tag, Ulp1

1

Introduction Intrinsically disordered proteins (IDPs) are a relatively recently identified phenomenon, broadly described as a group of proteins that contain very little defined 3D structure yet still have important biological functions [1, 2]. Over 20 years of research have shown that IDPs have many roles, including transcription [3], cell cycle control [4], and protection from stress damage [5, 6]. It is estimated that ~20% of proteins encoded by eukaryotic genomes are disordered [7], yet many IDPs have not been studied in detail [8], making them attractive targets for characterization. With this interest also comes a need to be able to produce relatively large quantities of protein (milligram scale) for structural studies. While eukaryotic expression systems have been used to produce recombinant proteins [9, 10], the prokaryotic E. coli system is by far the

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_8, © Springer Science+Business Media, LLC, part of Springer Nature 2020

181

182

Karamjeet K. Singh and Steffen P. Graether

most popular, due to its relatively low cost and ease of use [11]. Though many aspects of IDP purification will be similar and even identical to ordered proteins, this chapter notes some of the key differences. We describe here a protocol for the production of an isotopically labeled, fully disordered protein for detailed NMR experiments using an E. coli recombinant expression system [12]. The protocol uses a plant stress protein that is studied in our laboratory as an example, the late embryogenesis abundant 3-2 protein from Arabidopsis thaliana (AtLEA 3-2, AT3G53770.1). The protein is fused with a His-tagged SUMO domain that is used as part of the purification process and results in a mature AtLEA 3-2 with a completely native N-terminal sequence after cleavage using the protease ubiquitin-like-specific protease 1 (Ulp1) [13]. Ulp1 recognizes the structure of the SUMO domain rather than just a cleavage site, allowing for precise removal of this domain, leaving no residues from a recognition sequence that may interfere with the residual structure of an IDP or its function. Because of their inability to crystallize, X-ray crystallography is not typically used to characterize the structure of highly disordered proteins. Therefore, characterization of IDPs, such as AtLEA 3-2, is performed using various biophysical techniques (Fouriertransform infrared spectroscopy, fluorescence resonance energy transfer, and circular dichroism [14]), but residue-specific information without mutation or modification can only be obtained by nuclear magnetic resonance (NMR) spectroscopy, and in cell, NMR promises characterization of IDPs in a near-native environment [15]. Another benefit of NMR is that it is not an “all or nothing” technique, with intermediate NMR experiments still providing relevant structural information (e.g., amounts of residual structure, levels of disorder, and identifying residues involved in ligand binding). The protocol presented here pertains to the production of AtLEA 3-2 in M9 minimal media with 13C-glucose and 15NH4Cl isotopes for detailed analysis of dynamics and residual structure by NMR but can be easily adapted to other IDPs and/or for growth in rich, unlabeled media for other biophysical experiments.

2

Materials For buffers and solutions, use water with a resistivity of at least 18.2 MΩ cm, whereas for larger-volume bacterial media preparations, distilled water can be used. For chemicals, use reagent grade or higher. Prepare and store all solutions and buffers at 4 C unless stated otherwise. Bacterial media should be sterilized the day before use. Sterilization of rich media can be performed using an autoclave; sterilization of solutions and minimal media can be

Producing Recombinant IDPs

183

performed using a sterile vacuum filter unit for volumes 100 mL and a sterile syringe filter for volumes 99%) of SUMO-AtLEA 3-2, add Ulp1 (a SUMO protease) in an enzyme/protein (w/w) ratio of 1:500. Perform the digestion reaction at 30 C for 1 h before moving to 4 C overnight. 3.4 Purification Step 2

1. Repeat steps 1–4 in Subheading 3.3 to separate cleaved AtLEA 3-2 protein from the His-tagged SUMO domain, uncleaved SUMO-AtLEA 3-2 protein, and any E. coli proteins that bound during the first purification. The cleaved AtLEA 3-2 will be in the column flowthrough and wash fractions. 2. Collect the flowthrough, and wash as 1 mL fractions and the elution samples into a 50 mL conical tube using a fraction collector. 3. Assess which fractions contain AtLEA 3-2 by using a Tristricine PAGE (see Note 9). Pool the fractions. 4. To desalt and remove small contaminants that may not be visible by PAGE, perform reversed-phase (RP)-HPLC. 5. Add TFA to the sample (0.651 μL/mL sample). Prepare an analytical-scale C4 or C18 RP-HPLC column as directed by the manufacturer. 6. Set the flowrate to 1 mL/min, and wash the RP-HPLC column with two CV of HPLC Buffer B, followed by two CV of HPLC Buffer A for baseline equilibration. 7. Inject the sample, and elute the protein using a buffer gradient of 0–100% HPLC Buffer B over 1 h. Monitor A280 to detect elution of the protein (see Note 8). Collect 1 mL fractions in glass tubes. 8. Determine which fractions contain the AtLEA 3-2 protein by running samples using Tris-tricine PAGE. Pool fractions in a 50 mL conical tube, and store the protein at 80 C.

Producing Recombinant IDPs

189

9. To remove the TFA and acetonitrile and prepare the protein for long-term storage, lyophilization can be used. If lyophilization is damaging to the protein, then alternative approaches are needed to remove these compounds [17]. 3.5 Ulp1 Expression and Purification

1. It is assumed that the Ulp1 plasmid (Ulp1 gene, coding residues 403–621 in the pET 28b vector) [13] has been transformed into competent E. coli BL21(DE3) cells. 2. Grow the transformed bacteria overnight in a small volume (5 mL) of LB media with 50 μg/mL kanamycin (5 μL of stock) at 37 C with shaking at 250 rpm. 3. Transfer 2 mL each of the overnight growth into two 500 mL of LB media in a baffled flask with 50 μg/mL kanamycin (500 μL of stock). Grow the cells at 37 C with shaking at 250 rpm. 4. Grow the large-volume media until an OD600 of 1 is reached (approximately 3–4 h) before transferring the media to a shaker-incubator set to 30 C with sample shaking at 250 rpm. 5. After 1 h at 30 C, induce expression by adding IPTG to a final concentration of 0.4 mM (500 μL of 0.4 M IPTG per flask). Let the cells incubate for 3 h. 6. Pellet the cells by centrifugation at 6000 g for 30 min. 7. Resuspend the cells in 15 mL of Ulp1 Cell Resuspension buffer. After resuspension, add more Ulp1 Cell Resuspension buffer to attain a final total volume of 25 mL. Freeze at 20 C for overnight storage. 8. Thaw cells at room temperature with occasional hand agitation. Add 2.4 mL 4 M NaCl, 54.8 μL Triton X-100, 4.2 mg DTT, and 18.6 mg imidazole per 25 mL resuspended pellet to give final concentrations of 350 mM NaCl, 1 mM DTT, 0.2% (w/v) Triton X-100, and 10 mM imidazole. 9. Sonicate the cellular suspension on ice for a total of 10 min with sonication on for 10 s and off for 10 s, on power setting 5, with a probe tip sonicator. 10. Remove cellular debris, unlysed cells, and insoluble proteins by centrifugation at 32,000 g for 30 min at 4 C. Filter supernatant through a 0.2 μm Whatman™ GD/X syringe filter. 11. Prepare a 5 mL HisTrap column or equivalent by washing the column with five CV Ulp1 Binding Buffer. Load the cellular supernatant onto the column. 12. Wash the column with five CV of Ulp1 Binding Buffer or until the absorbance at 280 nm (A280) returns to baseline.

190

Karamjeet K. Singh and Steffen P. Graether

13. Elute the Ulp1 protease with a linear gradient of 10 CV between Ulp1 Binding Buffer and Ulp1 Elution Buffer at a flowrate of 2.5 mL/min over 20 min. Collect 500 μL fractions using a fraction collector. 14. Examine fractions that contain the protein by running Tristricine PAGE. Pool fractions containing the purified Ulp1, and dialyze at 4 C against three changes of 2 L of Ulp1 Dialysis Buffer. 15. Determine the protein concentration of Ulp1 by measuring the A280 and using a mass extinction coefficient of 1.17 g L 1 cm 1. Aliquot the protease as 250 μL samples. Flash freeze using liquid N2, and store at 80 C until use. 3.6 Making a Tris-Tricine Polyacrylamide Gel

1. Prepare the separating gel by adding 3.89 mL water, 5 mL Tricine Gel Buffer, 4.5 mL 40% acrylamide (19:1 acrylamide/ bis) (see Note 12), and 2.0 g glycerol in a 50 mL Erlenmeyer flask. Add 50 μL APS and 15 μL TEMED, and cast gel within a gel cassette (10 cm 10 cm 2 mm). Allow space for the stacking layer, and gently overlay the unpolymerized gel solution with water-saturated isobutanol. 2. Once the separating layer has polymerized, pour off the layer of isobutanol and rinse the top of the gel with water. Remove water using a paper towel touched to the edge of the separating gel. 3. Prepare the stacking layer by adding 3.89 mL water, 1.5 mL Tricine Gel Buffer, and 0.81 mL acrylamide in a 50 mL beaker. Add 50 μL APS and 25 μL TEMED. Add the solution to the top of the separating gel. Gently insert a 10-well comb immediately, taking care not to introduce air bubbles. 4. If gels are not to be used that day, wrap in water-soaked paper towels, and store at 4 C for up to 5 days.

3.7 Tris-Tricine Polyacrylamide Gel Electrophoresis

1. Take a 10 μL aliquot of the sample and add 10 μL of 2 TrisLoading Buffer. 2. Heat the samples at 90 C for 10 min. 3. Set up the electrophoresis chamber by locking in the gels according to the manufacturer’s instructions. Add Cathode Buffer to the upper buffer chamber (negative electrode) and Anode Buffer to the lower buffer chamber (positive electrode). 4. Load 20 μL of each sample in the wells of the gel. 5. Run the gel at 75 V for 15 min to allow the sample to enter the separating gel, then 150 V for 1 h. In cases where a higher resolution of the gel bands is required, lower the second voltage to 120 V.

Producing Recombinant IDPs

191

Fig. 1 Recombinant expression, purification, and cleavage of AtLEA 3-2. The Tris-tricine gel shows the results of the major steps of the purification process. Ladder, broad-range molecular weight marker; Preinduction, whole bacterial cell lysate before adding IPTG; Post-induction, whole bacterial cell lysate 16 h after adding IPTG; Load, the sample loaded onto the first Ni-affinity purification; Predigestion, pooled Ni-affinity fractions before Ulp1 digestion; Post-digestion, after Ulp1 digestion and the sample loaded onto the second Ni-affinity purification; FPLC#2 Fractions, pooled Ni-affinity fractions after Ulp1 digestion; HPLC Fractions, pooled fractions after RP-HPLC purification. Arrows and labels on the right indicate where the various proteins migrate

6. Following electrophoresis, pry the gel plates open with a plastic spatula. The gel will remain on one of the glass plates. Rinse the gel gently with water, and transfer to a clean plastic container. 7. Add Stain Solution until the gel is submerged. Let the gel sit in stain for 30 min with gentle shaking. 8. Remove Stain Solution and add Destain Solution until the gel is submerged. Let the gel sit in destain for 1–2 h with gentle shaking. 9. Remove Destain Solution and store the gel in water. Take a photograph of the gel for analysis (see Note 13); see Fig. 1.

192

4

Karamjeet K. Singh and Steffen P. Graether

Notes 1. For cloned genes containing rare codons, E. coli strains such as Rosetta (DE3) may be a more appropriate choice. In some cases, the expressed IDPs are insoluble and may end up in inclusion bodies, and other vectors with different tags may be needed [12]. 2. Baffled flasks ensure that bacteria are sufficiently aerated. An alternative to purchasing baffled flasks is to use Erlenmeyer flasks and have them modified by a glass-blowing shop. 3. The amount of glucose for optimal protein expression should be first determined using unlabeled glucose. In cases where only 15N labeling is required, the amount of glucose added to the media can be doubled to 4 g glucose per liter M9. 4. To avoid potential degradation from heating and sample loss during the autoclave process, it is advised to sterilize glucose using a syringe filter. 5. Several of the growth and expression parameters (induction cell density, induction temperature, induction time, IPTG concentration, glucose concentration) may need to be optimized for other IDP expression systems. For test cultures to optimize expression, 50 mL of media in 250 mL shaker flasks are recommended, since in our experience 5 mL of media in test tubes does not faithfully reproduce the aeration conditions and bulk volume in a shaker flask. 6. Most IDPs will have optimal expression levels 3–4 h after induction. 7. If the protein is completely disordered or the tag does not encode a structured domain, an alternative approach is to boil cells for 20 min to lyse them and cause E. coli host proteins to aggregate [18, 19]. 8. IDPs not containing tryptophan or tyrosine residues will not absorb appreciably at 280 nm. Alternatively, absorbance at 214 nm can be monitored to detect the peptide bond, but nonprotein organic molecules will also contribute to the absorbance. 9. Tris-tricine gels are better for separating proteins RgRC suggests that the IDP is more extended than a random coil. It can also be useful to derive Flory’s exponent from the slope of the normalized Kratky plot at high q range using a molecular form-factor-based method [46]. This service can be accessed online at http://sosnick. uchicago.edu/SAXSonIDPs (Fig. 4c). 3.3.3 Pair-Wise Distance Distribution Function

The pair-wise distance distribution function, P(r), is the histogram of the distances between all the possible pairs of electrons within a molecule. The P(r) is generally obtained with an indirect Fourier transform of the intensity profiles using the assumptions that P(r) is zero at r ¼ 0 and r ¼ Dmax, where Dmax is the maximum linear dimension of the particle [47]. The P(r) can provide insight into the shape of the protein and being a real-space representation, a more intuitive view of the data. The P(r) is generally expected to land smoothly in a concave fashion on the X-axis (Fig. 5). It should be noted here that it might be difficult to choose a single correct Dmax. This is particularly true for IDPs, and an initial indication might be taken from the Rg value derived from a Guinier analysis (see Note 13). In PrimusQT, the P(r) distribution can be obtained by using Distance Distribution button in the Analysis tab. The lower and upper limits of the range can then be adjusted to exclude the points showing an upward curve in the low-q region and highly noisy part in the high-q region.

258

Amin Sagar et al.

Fig. 5 Distance distribution analysis. (a) The distance distribution analysis for Unique domain with the appropriate value of Dmax. The panels (b and c) show the scenarios where the Dmax has been under- and overestimated, respectively 3.4 Ensemble Optimization Method for the Analysis of SAXS Profiles

IDPs or proteins hosting large intrinsically disordered regions (IDRs) adopt an astronomical number of conformations in solution. In the timescales used for SAXS data acquisition, the timeaveraged SAXS intensity is equivalent to an ensemble-averaged intensity. Therefore, flexible systems can be described in a more suitable manner if they are considered as ensembles of conformations. The following section describes the analysis of SAXS data obtained on highly flexible proteins using ensemble optimization method (EOM) [48, 49], which is a part of ATSAS suite of programs. Similar analyses can be done with other programs, e.g., minimal ensemble search (MES) [50], basis-set supported SAXS (BSS-SAXS) [20], or ensemble refinement of SAXS (EROS) [21]. However, EOM is a more widely used program, and that is why we describe it in more detail below. The EOM approach consists of three steps: 1. Generation of a large ensemble of conformers representing the conformational space sampled by the protein in solution (see Notes 14 and 15). 2. Selection of a sub-ensemble of conformers that collectively describes the SAXS data. 3. Description of the structural properties of the selected ensemble including size and flexibility. In EOM, the program RanCh (Random Chain) handles the first step, i.e., generation of an ensemble, while the program GAJOE (Genetic Algorithm Judging Optimization of Ensembles) handles the latter two steps. These two programs can be run automatically by using the program EOM or can be called upon individually. The following section describes the usage of the program EOM.

SAXS Analysis of IDPs 3.4.1 Generation of an Ensemble of Conformations

259

1. The program can be started by issuing the command eom from a terminal. 2. The interactive prompt allows defining the core symmetry, overall symmetry, and the percentage of symmetric structures (see Note 16). 3. The ensembles can be generated with three levels of compactness called compact chain, native, and random coil, in decreasing order of compactness. Random coil uses a Cα dihedral angle distribution consistent with chemically denaturated proteins, while native uses a Cα dihedral angle distribution consistent with disordered proteins. On average, random coil models will be more extended than those defined as native-like. The compact option not only uses a Cα dihedral angle distribution consistent with disordered proteins but also forces the generated structures to be more compact. 4. In case of IDPs with no globular domains, the program can be executed by providing the sequence information in the form of a one-letter code text file and a three-column SAXS intensity profile having momentum transfer (q), scattering intensity (I (q)), and the associated experimental error or standard deviation (σ(q)). Multiple SAXS curves (e.g., from deletion mutants) can also be given to the program to be simultaneously fitted by the algorithm. 5. In case of proteins with globular domains with available structures/models and disordered regions, the number of globular domains must be specified, and the structure(s) of the globular domains can be supplied in the form of pdb files. In this case, the ensemble would consist of the supplied structures of domains in various orientations joined by the generated linkers/unstructured regions. If these globular domains are expected to obey a symmetry, the latter can be specified in the core symmetry. In most cases, the overall symmetry is absent as the full-length disordered portions are not expected to be symmetric. 6. All or some of the supplied structures can be fixed at their original positions, i.e., the positions in the input pdb files, to avoid generation of biologically nonsensical models. 7. If the distances between certain residues are known, they can be fed into the program in the form of a contact file to constrain these distances in the generated structures. An online tool can be used to help in writing the contact conditions file (https:// www.embl-hamburg.de/biosaxs/atsas-online/contacts.php). 8. A large number of models should be generated in order to adequately sample the different sizes and shapes adopted by the biomolecule in solution to be then subjected to selection by the

260

Amin Sagar et al.

genetic algorithm in the next step. Ten thousand models are generally sufficient for most systems (see Note 15). 9. After execution, RanCh produces the following output files: (a) junX.eom: This file contains the theoretical intensity profiles of all the generated models. X stands for the prefix of the generated pdb files that can be specified during the execution of the program. (b) RanchX.log: This log file contains all the parameters used for running RanCh. (c) Size_listX.txt: This file contains the values of Rg and Dmax for all the conformations generated by RanCh. 3.4.2 Selection of a Sub-ensemble of Conformations that Describes the SAXS Data

1. The generated ensemble is then fed to GAJOE to select a sub-ensemble that best describes the measured SAXS data. The program tries to find a set of N conformations whose average theoretical SAXS profile (Eq. (5)) I ðq Þ ¼

N 1 X I ðq Þ N n¼1 n

ð5Þ

fits the experimental SAXS profile with the lowest χ 2 value, where χ 2 is defined as (Eq. (6)): " #2 K 1 X μI s j I exp s j 2 χ ¼ ð6Þ K 1 σ sj j ¼1 The size of the sub-ensemble can either be set to a fixed number or be optimized during the minimization procedure. While minimizing the number of conformations to describe the data is a useful approach for proteins/complexes adopting a few well-defined conformations, this restriction may not be optimal for describing the behavior of IDPs, and setting the ensemble size to a fixed, relatively large number between 20 and 50 is recommended (see Note 17). 2. The output files generated after the execution of GAJOE contain the information about the parameters used for running the program and quantitative description of the structural properties of the ensemble. For each experimental curve, a numbered folder with the name curve_m is generated, where m is the number in the order in which the curves were supplied. A brief description of the output files is as follows: (a) GA00n/curve_m/logFile_00n_m.log: A log file containing the parameters used to run the genetic algorithm. (b) GA00n/curve_m/profiles_00n_m.fit: This file contains fit of the mth experimental SAXS profile and the theoretical SAXS profile of the selected sub-ensemble for the nth

SAXS Analysis of IDPs

261

GAJOE run in the current directory. All the following files follow the same nomenclature for n and m. (c) GA00n/curve_m/Rg_distr_00n_m.dat: The first four lines of this file contain the average and histogram radii of gyration of the ensemble and of the selected sub-ensemble. The subsequent table has a three-column format, where the first column has the values of Rg followed by their frequencies in the total ensemble and the selected sub-ensemble, respectively. (d) GA00n/curve_m/Dmax_distr_00n_m.dat: This file has the same format as the file mentioned as (c) but presents the values of Dmax instead of Rg. (e) GA00n/curve_m/CaCa_distr_00n_m.dat: This file has the same format as (c) but presents the average values of Cα-Cα distances instead of Rg. (f) GA00n/curve_m/Volume_distr_00n_m.dat: This file has the same format as (c) but presents the excluded volumes instead of Rg. (g) Statistics_flexibility.svg: This scalable vector graphics (svg) file has the values of Rflex, Rσ, and standard mathematical descriptors of distributions: standard deviation, average absolute deviation, kurtosis, skewness, and geometric average. Standard deviation is a measure of the variation of the distribution from the average. Average absolute deviation is the average of the absolute deviation from the average value. For normal distributions, it is expected to be ~0.8 times the standard deviation. Kurtosis and skewness measure the “peakedness” and asymmetry of the probability distribution, respectively. Geometric mean is a measure of the central tendency, defined as the nth root of the product of n numbers. The meaning of the metrics Rflex and Rsigma is described later in more detail. (h) GA00n/curve_m/pdbs: This folder contains the structures of the conformations comprising the best-fitting sub-ensemble in PDB format. 3. It is a good practice to perform several EOM runs with an increasing number of conformations in the sub-ensemble (Fig. 6). Rg and Dmax distributions derived from this analysis should stay similar, indicating that the information about the overall size and shape of the protein in solution is captured by the approach. Note that when using reduced number of conformations in the sub-ensemble, more spiky distributions may be obtained.

262

Amin Sagar et al.

Fig. 6 EOM results using different sub-ensemble size. The distribution of (a Rg and b) Dmax for EOM analysis of Unique domain of Src kinase with fixed number of conformations in the sub-ensemble ranging from 5 to 50. It can be seen that the distributions are spikier with a small number of conformation and become smoother with increasing the number of conformations. However, increasing the number of conformations in the ensemble does not give rise to any artifacts like artificial bimodality (or multimodality). The gray area represents the distributions of the pool 3.4.3 Quantitative Estimation of Flexibility of the Ensemble

1. The comparison of the Rg/Dmax/Volume distributions of the selected sub-ensembles with those of the initial pools provides important insights into the size and flexibility of the biomolecule (see Notes 18 and 19). The width of the distribution indicates the size variability of the selected structures and is therefore indicative of flexibility. In other words, molecules that are more flexible give rise to broader distributions. The position of the peak of the distribution conveys the size of the sub-ensemble relative to the pool of randomly generated structures. For example, a shift of the peak position toward smaller Rgs indicates that the molecule adopts more compact conformations than the randomly generated structures and vice-versa. The presence of multiple peaks indicates coexistence of multiple conformational/associational states (see the example of Ref. [51]). 2. EOM also outputs two metrics called Rflex and Rσ in order to quantitatively compare the distributions of the selected sub-ensemble to the total pool. Rflex is a measure of the information content of the distribution. Narrower distributions generally have more information than wider distributions. EOM expresses Rflex in the form of a percentage ranging from 0% (no flexibility, single rigid structure) to 100% (as much flexibility as in the total pool). The second metric, Rσ, illustrates the variance of the selected sub-ensemble with respect to the total pool. Rσ approaches 1.0 when the selected ensemble is almost as flexible as the total pool. EOM solutions are acceptable when Rflex of the selected sub-ensemble is smaller than that of the pool and Rσ < 1.0 indicating a more compact set of conformations than the pool or Rflex of the selected sub-ensemble is more than that of the pool and

SAXS Analysis of IDPs

263

Rσ > 1.0 indicating a highly flexible set of conformations. However, a combination of significantly smaller Rflex of the sub-ensemble compared to the pool and Rσ > 1 makes a warning sign and rather points to poor data quality.

4

Notes 1. The sample volume required for a single measurement in the older generation synchrotrons and in-house instruments is about 50 μl. In modern, high-brilliance synchrotrons, the sample requirement is reduced to ~20 μl, and it can be further reduced to sub-μl range with the introduction of microfluidic devices. 2. The buffer composition must exactly match the composition of the sample: even small differences in the chemical composition between the buffer and the sample may lead to severe artifacts after background subtraction. Our preferred option is to use the last dialysis buffer for the background measurements. Phosphate buffers should be avoided in favor of TRIS, MOPS, or MES, which scavenge the free radicals formed due to radiation. Addition of 3–5% glycerol and/or 1–2 mM dithiothreitol (DTT) is further recommended to limit the radiation damage. 3. Columns with smaller void volume can save beam time but can sometimes compromise on the resolution. The choice of the column should be based on the purpose of the experiment and the size of the protein/complex. SEC runs should be done prior to the allocated beam time to confirm that the column provides the desired resolution. 4. The concentration needs to be determined precisely to have an accurate estimate of the molecular weight. The Bradford assays are generally not accurate enough for this purpose, and spectrophotometric methods (absorbance at 280 nm) are more suitable. Refractrometers can be used for proteins lacking aromatic amino acids (a situation common for many IDPs) [52]. 5. High-brilliance synchrotrons enable fast measurements but can also induce radiation damage in the samples, which can be a major obstacle in collecting good-quality SAXS data. Radiation damage refers to the irreversible aggregation, unfolding, or fragmentation of the proteins induced by high energy and flux of the X-ray beam striking the sample. The process is initiated by the photolysis of water leading to generation of free hydroxyl and hydroperoxyl radicals which quickly react with the backbone and side chains of proteins [53–55]. This problem is partially addressed at the synchrotrons by collecting

264

Amin Sagar et al.

data under continuous flow. Suitable solvent additives can further reduce the radiation damage. A detailed study of the use of additives for reducing radiation damage in SAXS experiments can be found in Ref. [55]. If, for some reason, such additives cannot be used, increasing the flow rate might be helpful. 6. It is crucial to collect SAXS data at different protein concentrations to ascertain the absence of any concentration-dependent effects. Typically, 3–5 concentrations are studied. If the protein is prone to aggregation, it is strongly advised to bring the sample to the synchrotron at low concentration and concentrate on site. After concentrating, the sample should be centrifuged, concentration remeasured, and SAXS data collected as soon as possible to minimize the presence of aggregates. 7. The sample to be studied with SAXS must be highly pure and monodisperse. The contaminants of higher molecular weight are especially undesirable because the scattering signal is proportional to the square of molecular weight. Therefore, even small quantities of high-molecular-weight contaminants can dominate the scattering. Purity can be assessed by SDS-PAGE, and contamination by nucleic acids can usually be detected by checking the ratio of absorbance at 280 and 260 nm. However, SDS-PAGE is not suitable to identify the presence of aggregates and other techniques; e.g., native PAGE, dynamic light scattering, and analytical ultracentrifugation must be used. SEC can be employed to both check for and remove the aggregated protein. A non-aggregated sample collected from the SEC should be rerun on the column after concentration (if required), to check if it stays in the non-aggregated form. In case of proteins, where aggregation is detected after concentrating the non-aggregated fraction, the use of SEC-SAXS mode is highly recommended. 8. It is extremely important to have an accurate scattering profile from the buffer for subtraction. The over- or underestimation of the contribution from the buffer can lead to artifacts in the analyses. The best practice is to dialyze the concentrated protein against the appropriate buffer and use the dialysate as blank for the buffer subtraction. Generally, the scattering data on the buffer is collected before and after each sample. 9. The buffer composition should be chosen to yield close-toideal solutions minimizing the interactions between the protein molecules, unless otherwise is desired. The concentration of salts should be kept to the minimum required for the solubility of the protein. The presence of salts/additives in the buffer decreases the contrast between the protein and the buffer and can negatively impact the data quality. Typically, NaCl

SAXS Analysis of IDPs

265

concentrations more than 0.5 M and glycerol more than 10% should be avoided. Crowding agents like polyethylene glycol (PEG) can be used if the purpose is to study the effect of molecular crowding which can be of relevance to systems with IDPs. SAXS studies in the presence of PEG have been conducted to understand the effect of crowding on the conformation of RNA helicase elF4A [56], RNA (a bacterial group I ribozyme) [57], and complex of DNA polymerase (gp5) with its processivity factor (trx) and helicase-primase (gp4D) complex along with DNA [58]. 10. The pairwise CorMap can be used to assess the similarity of the frames selected as buffer and sample [41]. For a set of similar frames, the CorMap will show a random pattern without any specific features. However, if the frames are different, the pairwise CorMap would show patches with positive and negative correlations. Pairwise CorMap analysis can be performed by opening the frames selected in CHROMIXS in Primus (file/ open selected frames in Primus) and using the Data Comparison option in Primus (Processing/Data Comparison). The p-value can then be used to determine if the frames are significantly different. 11. The partial specific volume of the protein, ν, can be calculated experimentally by densitometry or estimated on the basis of the sequence using the program SEDNTERP (http://bitcwiki.sr. unh.edu/index.php/Main_Page). The contrast can be calculated using the relation Δρ ¼ (ρprot ρsolvν)ro, where ρprot is the number of electrons per mass of dry protein and ρsolv is the number of electrons per volume of the solvent. 12. The Guinier approximation is considered to be valid for a momentum transfer range of q < 1.3 Rg for well-folded proteins. This range can lead to smaller Rgs for IDPs, and a smaller range of s < 1.1 Rg has been suggested [59]. A fourth-order correction to the Guinier approximation has also been suggested to have better estimates of Rg for IDPs and increase the range to ~q < 1.5 Rg [60]. 13. A relatively small population of highly extended conformations may lead to an underestimation of Dmax for IDPs as derived from the primary analysis of the SAXS data [42, 61]. 14. All methods ensuring exhaustive sampling of the conformational space can be used to generate the initial pool of conformations, e.g., long timescale or enhanced sampling molecular dynamics (MD) simulations, kinematics-based methods (RRT programs from IMP (http://salilab.org/imp/ download.html)), coil database building strategies [62–64], dihedral angle sampling (SASSIE [65]), and Monte Carlo methods. Molecular dynamics (MD) simulations are often

266

Amin Sagar et al.

considered a less suitable way to generate ensembles due to limited exploration of the conformational space and inaccuracy of the force fields. However, recent improvements in the hardware and force fields make MD-based strategies also appropriate for the generation of IDP ensembles [66]. 15. In our experience, 10,000 models generated by RanCh suffice to have an adequate sampling of the overall descriptors of the molecule like Rg and Dmax in almost all situations. This is adequate to describe the SAXS data, which is inherently low resolution. Certainly, the sampling is not sufficient at an atomic level as IDPs can adopt an astronomically large number of conformations that cannot be exhaustively sampled by any of the methods. 16. The first prompt while running EOM asks the user for the core symmetry. The supported symmetries are p1 to p19 (from no symmetry to 19-fold), p22, p32, p42. . ., p122, and p222. The second prompt asks for the overall symmetry of the particle. The acceptable answers are S (for generating only symmetric structures), A (for generating structures which have the specified core symmetric but no symmetry for the rest of the particle), and M (for a mixture of symmetric and asymmetric particles). If [M]ix is selected for overall symmetry, then the next prompt asks for the percentage of symmetric structures in the pool, where entering a value of 50, for example, will lead of a pool with 50% each of symmetric and asymmetric structures. 17. While running GAJOE, we fix the number of conformations in the sub-ensemble by setting both maximum and minimum number of conformations per ensemble to the same fixed value (e.g., 20) and not allowing curve repetition. 18. It might be noted that the repeated cycles of EOM can converge to different individual conformations, which is expected given the low resolution of SAXS data. However, the distribution of the overall shape and size parameters (Rg and Dmax) of the sub-ensembles selected by EOM should be similar. 19. The results of EOM should be interpreted with care. The selected conformations should not be treated as the actual conformations adopted by an IDP in the solution. Instead, the distributions of the low-resolution parameters of the selected ensemble such as Rg and Dmax should be considered as the structural description of the conformations adopted by the protein in solution.

SAXS Analysis of IDPs

267

Acknowledgments This work was supported by the Labex EpiGenMed, an “Investissements d’Avenir” program (ANR-10-LABX-12-01). The CBS is a member of France-BioImaging (FBI) and the French Infrastructure for Integrated Structural Biology (FRISBI), two national infrastructures supported by the French National Research Agency (ANR-10-INBS-04-01 and ANR-10-INBS-05, respectively). D.S. acknowledges support by iNEXT, grant number 653706, funded by the Horizon 2020 program of the European Union. References 1. Feigin LA, Svergun DI (1987) Structure analysis by small-angle X-ray and Neutron scattering. Springer Science & Business Media, Berlin 2. Svergun DI, Koch MHJ (2002) Advances in structure analysis using small-angle scattering in solution. Curr Opin Struct Biol 12:654–660 3. Putnam CD, Hammel M, Hura GL, Tainer JA (2007) X-ray solution scattering (SAXS) combined with crystallography and computation: defining accurate macromolecular structures, conformations and assemblies in solution. Q Rev Biophys 40:191–285 4. Mertens HDT, Svergun DI (2010) Structural characterization of proteins and complexes using small-angle X-ray solution scattering. J Struct Biol 172:128–141 5. Jacques DA, Trewhella J (2010) Small-angle scattering for structural biology - expanding the frontier while avoiding the pitfalls. Protein Sci 19:642–657 6. Rambo RP, Tainer JA (2013) Super-resolution in solution X-ray scattering and its applications to structural systems biology. Annu Rev Biophys 42:415–441 7. Tuukkanen AT, Spilotros A, Svergun DI (2017) Progress in small-angle scattering from biological solutions at high-brilliance synchrotrons. IUCrJ 4:518–528 8. Bernado´ P, Shimizu N, Zaccai G et al (2018) Solution scattering approaches to dynamical ordering in biomolecular systems. Biochim Biophys Acta Gen Subj 1862:253–274 9. Svergun DI (1999) Restoring low resolution structure of biological macromolecules from solution scattering using simulated annealing. Biophys J 76:2879–2886 10. Franke D, Svergun DI (2009) DAMMIF, a program for rapid ab-initio shape determination in small-angle scattering. J Appl Crystallogr 42:342–346

11. Evrard G, Mareuil F, Bontems F et al (2011) DADIMODO: a program for refining the structure of multidomain proteins and complexes against small-angle scattering data and NMR-derived restraints. J Appl Crystallogr 44:1264–1271 12. Petoukhov MV, Svergun DI (2005) Global rigid body modeling of macromolecular complexes against small-angle scattering data. Biophys J 89:1237–1250 13. Tuukkanen AT, Svergun DI (2014) Weak protein-ligand interactions studied by smallangle X-ray scattering. FEBS J 281:1974–1987 14. Herranz-Trillo F, Groenning M, van Maarschalkerweerd A et al (2017) Structural analysis of multi-component amyloid systems by Chemometric SAXS data decomposition. Structure 25:5–15 15. Doniach S (2001) Changes in biomolecular conformation seen by small angle X-ray scattering. Chem Rev 101:1763–1778 16. Bernado´ P, Blackledge M (2010) Proteins in dynamic equilibrium. Nature 468:1046–1048 17. Mylonas E, Hascher A, Bernado´ P et al (2008) Domain conformation of tau protein studied by solution small-angle X-ray scattering. Biochemistry 47:10345–10353 18. Garcia-Pino A, Balasubramanian S, Wyns L et al (2010) Allostery and intrinsic disorder mediate transcription regulation by conditional Cooperativity. Cell 142:101–111 19. Ribeiro EDA, Pinotsis N, Ghisleni A et al (2014) The structure and regulation of human muscle α-Actinin. Cell 159:1447–1460 20. Yang S, Blachowicz L, Makowski L, Roux B (2010) Multidomain assembled states of Hck tyrosine kinase in solution. Proc Natl Acad Sci U S A 107:15757–15762 21. Ro´ycki B, Kim YC, Hummer G (2011) SAXS ensemble refinement of ESCRT-III CHMP3

268

Amin Sagar et al.

conformational transitions. Structure 19:109–116 22. Bernado´ P, Svergun DI (2012) Analysis of intrinsically disordered proteins by small-angle X-ray scattering. Methods Mol Biol 896:107–122 23. Receveur-Brechot V, Durand D (2012) How random are intrinsically disordered proteins? A small angle scattering perspective. Curr Protein Pept Sci 13:55–75 24. Kikhney AG, Svergun DI (2015) A practical guide to small angle X-ray scattering (SAXS) of flexible and intrinsically disordered proteins. FEBS Lett 589:2570–2577 25. Kachala M, Valentini E, Svergun DI (2015) Application of SAXS for the structural characterization of IDPs. Adv Exp Med Biol 870:261–289 26. Cordeiro TN, Herranz-Trillo F, Urbanek A et al (2017) Small-angle scattering studies of intrinsically disordered proteins and their complexes. Curr Opin Struct Biol 42:15–23 27. Franke D, Petoukhov MV, Konarev PV et al (2017) ATSAS 2.8: a comprehensive data analysis suite for small-angle scattering from macromolecular solutions. J Appl Crystallogr 50:1212–1225 28. Hopkins JB, Gillilan RE, Skou S (2017) BioXTAS RAW: improvements to a free open-source program for small-angle X-ray scattering data reduction and analysis. J Appl Crystallogr 50:1545–1553 29. Brookes E, Pe´rez J, Cardinali B et al (2013) Fibrinogen species as resolved by HPLC-SAXS data processing within the UltraScan solution Modeler (US-SOMO) enhanced SAS module. J Appl Crystallogr 46:1823–1833 30. Brookes E, Vachette P, Rocco M, Pe´rez J (2016) US-SOMO HPLC-SAXS module: dealing with capillary fouling and extraction of pure component patterns from poorly resolved SEC-SAXS data. J Appl Crystallogr 49:1827–1841 31. Jeffries CM, Graewert MA, Blanchet CE et al (2016) Preparing monodisperse macromolecular samples for successful biological small-angle X-ray and neutron-scattering experiments. Nat Protoc 11:2122–2153 32. Mathew E, Mirza A, Menhart N (2004) Liquid-chromatography-coupled SAXS for accurate sizing of aggregating proteins. J Synchrotron Radiat 11:314–318 33. Pe´rez J, Nishino Y (2012) Advances in X-ray scattering: from solution SAXS to achievements with coherent beams. Curr Opin Struct Biol 22:670–678

34. Graewert MA, Franke D, Jeffries CM et al (2015) Automated pipeline for purification, biophysical and X-ray analysis of biomacromolecular solutions. Sci Rep 5:10734 35. Vestergaard B (2016) Analysis of biostructural changes, dynamics, and interactions - smallangle X-ray scattering to the rescue. Arch Biochem Biophys 602:69–79 36. Round AR, Franke D, Moritz S et al (2008) Automated sample-changing robot for solution scattering experiments at the EMBL Hamburg SAXS station X33. J Appl Crystallogr 41:913–917 37. Hura GL, Menon AL, Hammel M et al (2009) Robust, high-throughput solution structural analyses by small angle X-ray scattering (SAXS). Nat Methods 6:606–612 38. Shkumatov AV, Chinnathambi S, Mandelkow E, Svergun DI (2011) Structural memory of natively unfolded tau protein detected by small-angle X-ray scattering. Proteins 79:2122–2131 39. Bucciarelli S, Midtgaard SR, Pedersen MN et al (2018) Size-exclusion chromatography smallangle X-ray scattering of water soluble proteins on a laboratory instrument. J Appl Crystallogr 51:1623–1632 40. Panjkovich A, Svergun DI (2018) CHROMIXS: automatic and interactive analysis of chromatography-coupled small-angle X-ray scattering data. Bioinformatics 34:1944–1946 41. Franke D, Jeffries CM, Svergun DI (2015) Correlation map, a goodness-of-fit test for one-dimensional X-ray scattering spectra. Nat Methods 12:419–422 42. Bernado´ P (2010) Effect of interdomain dynamics on the structure determination of modular proteins by small-angle scattering. Eur Biophys J 39:769–780 43. Le Guillou JC, Zinn-Justin J (1977) Critical exponents for the n-vector model in three dimensions from field theory. Phys Rev Lett 39:95–98 44. Kohn JE, Millett IS, Jacob J et al (2004) Random-coil behavior and the dimensions of chemically unfolded proteins. Proc Natl Acad Sci U S A 101:12491–12496 45. Bernado´ P, Blackledge M (2009) A selfconsistent description of the conformational behavior of chemically denatured proteins from NMR and small angle scattering. Biophys J 97:2839–2845 46. Riback JA, Bowman MA, Zmyslowski AM et al (2017) Innovative scattering analysis shows that hydrophobic disordered proteins are expanded in water. Science 358:238–241

SAXS Analysis of IDPs 47. Svergun DI (1992) Determination of the regularization parameter in indirect-transform methods using perceptual criteria. J Appl Crystallogr 25:495–503 48. Bernado´ P, Mylonas E, Petoukhov MV et al (2007) Structural characterization of flexible proteins using small-angle X-ray scattering. J Am Chem Soc 129:5656–5664 49. Tria G, Mertens HDT, Kachala M, Svergun DI (2015) Advanced ensemble modelling of flexible macromolecules using X-ray solution scattering. IUCrJ 2:207–217 50. Pelikan M, Hura GL, Hammel M (2009) Structure and flexibility within proteins as identified through small angle X-ray scattering. Gen Physiol Biophys 28:174–189 ˜ o´n I 51. Lira-Navarrete E, de las Rivas M, Compan et al (2015) Dynamic interplay between catalytic and lectin domains of GalNAc-transferases modulates protein O-glycosylation. Nat Commun 6:6937 52. Calc¸ada EO, Korsak M, Kozyreva T (2015) Recombinant intrinsically disordered proteins for NMR: tips and tricks. Adv Exp Med Biol 870:187–213 53. Garrison WM (1987) Reaction mechanisms in the radiolysis of peptides, polypeptides, and proteins. Chem Rev 87:381–398 54. Kuwamoto S, Akiyama S, Fujisawa T (2004) Radiation damage to a protein solution, detected by synchrotron X-ray small-angle scattering: dose-related considerations and suppression by cryoprotectants. J Synchrotron Radiat 11:462–468 55. Jeffries CM, Graewert MA, Svergun DI, Blanchet CE (2015) Limiting radiation damage for high-brilliance biological solution scattering: practical experience at the EMBL P12 beamline PETRAIII. J Synchrotron Radiat 22:273–279 56. Akabayov SR, Akabayov B, Richardson CC, Wagner G (2013) Molecular crowding enhanced ATPase activity of the RNA helicase eIF4A correlates with compaction of its quaternary structure and association with eIF4G. J Am Chem Soc 135:10040–10047 57. Kilburn D, Roh JH, Guo L et al (2010) Molecular crowding stabilizes folded RNA structure

269

by the excluded volume effect. J Am Chem Soc 132:8690–8696 58. Akabayov B, Akabayov SR, Lee S-J et al (2013) Impact of macromolecular crowding on DNA replication. Nat Commun 4:1615 59. Borgia A, Zheng W, Buholzer K et al (2016) Consistent view of polypeptide chain expansion in chemical denaturants from multiple experimental methods. J Am Chem Soc 138:11714–11726 60. Zheng W, Best RB (2018) An extended Guinier analysis for intrinsically disordered proteins. J Mol Biol 430:2540–2553 61. Heller WT (2005) Influence of multiple well defined conformations on small-angle scattering of proteins in solution. Acta Crystallogr D Biol Crystallogr 61:33–44 62. Bernado´ P, Blanchard L, Timmins P et al (2005) A structural model for unfolded proteins from residual dipolar couplings and smallangle x-ray scattering. Proc Natl Acad Sci U S A 102:17002–17007 63. Ozenne V, Bauer F, Salmon L et al (2012) Flexible-meccano: a tool for the generation of explicit ensemble descriptions of intrinsically disordered proteins and their associated experimental observables. Bioinformatics 28:1463–1470 ˜ a A, Sibille N, Delaforge E et al (2019) 64. Estan Realistic ensemble models of intrinsically disordered proteins using a structure-encoding coil database. Structure 27:381–391.e2 65. Curtis JE, Raghunandan S, Nanda H, Krueger S (2012) SASSIE: a program to study intrinsically disordered biological molecules and macromolecular ensembles using experimental scattering restraints. Comput Phys Commun 183:382–389 66. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for both folded and disordered protein states. Proc Natl Acad Sci U S A 115:E4758–E4766 67. Arbesu´ M, Maffei M, Cordeiro TN et al (2017) The unique domain forms a fuzzy Intramolecular complex in Src family kinases. Structure 25:630–640.e4

Chapter 13 Determining Rg of IDPs from SAXS Data Ellen Rieloff and Marie Skepo¨ Abstract There is a great interest within the research community to understand the structure–function relationship for intrinsically disordered proteins (IDPs); however, the heterogeneous distribution of conformations that IDPs can adopt limits the applicability of conventional structural biology methods. Here, scattering techniques, such as small-angle X-ray scattering, can contribute. In this chapter, we will describe how to make a model-free determination of the radius of gyration by using two different approaches, the Guinier analysis and the pair distance distribution function. The ATSAS package (Franke et al., J Appl Crystallogr 50:1212–1225, 2017) has been used for the evaluation, and throughout the chapter, different examples will be given to illustrate the discussed phenomena, as well as the pros and cons of using the different approaches. Key words Radius of gyration, Flexible proteins, Intrinsically disordered proteins, Scattering, Guinier, Pair distance distribution function, PRIMUS, GNOM, ATSAS

1

Introduction In contrast to well-folded proteins, intrinsically disordered proteins (IDPs) sample a broad ensemble of rapidly interconverting conformations. This intrinsic flexibility can make it complicated to study IDPs and intrinsically disordered regions using traditional biophysical techniques. In recent years, small-angle X-ray scattering (SAXS) has emerged among the methods that are useful for characterizing IDPs, since it gives information about the interatomic distances within a biomolecule. The radius of gyration (Rg) is the root-mean-square distance between the different atoms of the IDP and its center of mass and gives information about the average degree of expansion (see Eq. (1), as well as the schematic illustration in Fig. 1). !1=2 n 1X 2 ðri rcm Þ ð1Þ Rg ¼ n i¼1

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_13, © Springer Science+Business Media, LLC, part of Springer Nature 2020

271

272

Ellen Rieloff and Marie Skepo¨

Fig. 1 Schematic illustration of the radius of gyration, which is the root-meansquare of the distance of the different atoms from the center of mass. The black line illustrates one conformation of the IDP in a typical ensemble. The gray dot indicates the center of mass (cm)

In Eq. (1), n is equal to the number of atoms, whereas ri and rcm correspond to the coordinates of the atoms and the center of mass of the molecule, respectively. When determined in SAXS analysis, Rg corresponds to the root-mean-square of the distance of all electrons from the center of mass of the protein, when the background electrons, i.e., the electrons in the buffer, have been subtracted. Rg is a useful quantity to monitor when studying how the system is affected by varying the solvent conditions, as well as the sequences of IDPs [1–4]. If the sample of interest is polydisperse, then a more in-depth understanding can be obtained by monitoring Rg while using SAXS in combination with high-performance liquid chromatography [5]. In this chapter, we assume that the reader has already collected SAXS data and completed the initial processing of the scattering intensity, I(q). The magnitude of the scattering vector, q, is defined as q ¼ 4πsinθ/λ, from the scattering angle, 2θ, and the wavelength, λ. There are many different methods available for determining Rg from SAXS. Here, two different approaches will be described: (1) the Guinier approximation [6] and (2) the pair distance distribution function, P(r). The former states that for very small scattering angles, the scattering intensity depends only on two parameters, I(0) and Rg: q 2 R2g , ð2Þ I ðqÞ ¼ I ð0Þexp 3 where I(0) is the forward scattering (the scattering intensity extrapolated to q ¼ 0). In practice, this means that the scattering

Determination of Rg

273

intensity plotted as lnI(q) versus q2 should be a linear function for a particle of any shape, where the slope holds information about Rg. The validity of the Guinier method can be characterized by the product qmaxRg, where qmax is the largest value of q used in the linear regression. As a rule of thumb, for folded proteins, it is known that the approximation is valid up to qmaxRg 1.3 [7, 8], whereas when applied to IDPs, the corresponding number is qmaxRg 1.1 [9], or even as small as qmaxRg < 0.8 [10]. Attractive interactions between the IDPs, and nonspecific aggregates, result in an overestimation of Rg, whereas repulsive interactions between the proteins (see Note 1) give rise to the opposite. The second approach uses the pair distance distribution function, which in real space is a histogram of the distances, r, between all possible pairs of electrons within the protein. The scattering pattern, I(q), is a Fourier transform of its P(r) function, and vice versa: Z 1 2 q I ðq Þ sin ðqr Þ r2 P ðr Þ ¼ 2 dq: ð3Þ qr 2π 0 By definition, P(r) equals zero at P(0), is nonnegative, terminates smoothly at the maximum dimension Dmax, and is zero for r > Dmax. The P(r) function can be assessed from the experimental scattering data using indirect Fourier transformation [11, 12], and Rg can be determined from the P(r) using the following equation: R Dmax 2 r P ðr Þdr 2 Rg ¼ 0R Dmax : ð4Þ 2 0 P ðr Þdr The P(r) distribution often gives a more reliable estimate over the Guinier approximation. This is more pronounced for IDPs [13], since the full scattering pattern is taken into consideration. A caveat is that the more extended the protein is, the greater is the requirement for experimental data to be collected at smaller angles, i.e., closer to the primary beam. The two approaches discussed above will be described in detail in Subheading 3 (step-by-step procedure), before showing examples of how they can be applied in the ATSAS package [14]. For a comment on the notations and units used in this chapter and in the ATSAS package, see Note 2.

2

Materials The SAXS data used for the examples in this chapter have all been collected at BM29 beamline at the European Synchrotron Radiation Facility (ESRF), in Grenoble, France. SAXS data for the disordered small protein, Histatin 5, has been applied in the examples except for Figs. 3 and 7, where data from measurements of the

274

Ellen Rieloff and Marie Skepo¨

disordered protein, Statherin, has been used. More information about the experiments can be found here [15–17]. If the reader is interested in using the data for educational purposes, please contact the authors. The ATSAS package [14] has been used for the evaluation of the data. Installation packages for Windows, Linux, and Mac are available at www.embl-hamburg.de/biosaxs/download. html. For academic users, the download is free of charge after registration.

3

Methods The Guinier analysis allows model-free determination of Rg from scattering data in the limit of very small scattering angles. The natural logarithm of Eq. (2): 2 Rg q , ð5Þ ln I ðq Þ ¼ ln I ð0Þ 3 shows the linear relation between lnI(q) and q2, where, when lnI(q) is plotted against q2, lnI(0) corresponds to the intercept and R2g =3 to the slope, as illustrated in Fig. 2. Since the maximum usable q-value can vary between different IDPs, it is of importance to try different values (see Note 3) and gain experience. The approximation is only valid if the region is linear and nonlinearity in the lowest q-region signalizes interparticle interactions. Hence, a Guinier plot serves as an important check of protein–protein interaction. An upturn at low q indicates interparticle attraction, i.e., presence of aggregates, while a downturn indicates interparticle repulsion, as illustrated in Fig. 3. The general steps of performing

Fig. 2 A schematic Guinier plot, showing the intercept and the slope of the fitted straight line, from which the forward scattering and the radius of gyration can be obtained

Determination of Rg

275

Fig. 3 Guinier plot indicative of (a) aggregation and (b) repulsion

Guinier analysis are summarized in Subheading 3.1. How to perform the analysis in PRIMUS is exemplified on Histatin 5 in Subheading 3.2. 3.1 General Steps to Determine Radius of Gyration

1. Plot lnI(q) versus q2 over a range up to qmax that appears linear. 2. Make a linear regression to find the slope, k, of the line. pﬃﬃﬃﬃﬃﬃﬃﬃﬃ 3. Calculate Rg from the slope according to Rg ¼ 3k. 4. See if the Rg fulfills qmaxRg < 1.1; otherwise, repeat from step 1 with a smaller qmax.

3.2 Determination of Rg of Histatin 5 by the Guinier Approach Using PRIMUS

1. Open the scattering data using PRIMUS. Switch to the “Analysis” tab and press “Radius of Gyration.” This launches a new window, showing the data plotted as ln(I) versus q2 in the low-q regime, as well as the corresponding residuals. Additional information shown is I(0) and Rg with error estimates (see Note 4), the qRg limits, and the data range given in data point indices. 2. Automatically, the program runs AUTORG, a routine for Rg determination; hence, the values stated correspond to this automatic procedure. A description of the automatic procedure is found in the ATSAS manual. The result of AUTORG for Histatin 5 is given in the left-hand panel of Fig. 4. 3. For a manual adjustment of Rg, the q-range can be changed to find the best linear fit. If there is curvature in the residuals, i.e., nonlinear behavior, the range should be reduced, from both directions if necessary. Start by adjusting the left limit to exclude any largely deviating points. Continue with reducing the right limit until qRg < 1.1 or more if the residuals are still showing nonlinear behavior. The goal is for the residuals to be evenly distributed around zero. The result of manually fitting the data of Histatin 5 is shown in the right-hand panel of Fig. 4.

276

Ellen Rieloff and Marie Skepo¨

Fig. 4 Upper row: Guinier fit of Histatin 5 for two different qmaxRg. Lower row: Residuals of the fitted line for the corresponding fits

In contrast to the Guinier approximation, which only uses the low-q region, determination of Rg from P(r) relies on more or less the entire collected SAXS spectrum. The pair distance distribution function is the real space transformation of the intensity curve and can be obtained by the indirect Fourier transform method [11, 12] that is implemented, for example, in GNOM in the ATSAS package and in the Web application BayesApp (http://www.bayesapp.org). Other software also includes methods for obtaining the P(r), for ˚ tter (available example, SasView (http://www.sasview.org) and ScA at www.bioisis.net). From P(r) and the maximum dimension of the protein, Dmax, Rg can be calculated according to Eq. (4) and is usually given as an output from the software used. In Subheading 3.3, the steps in determining Rg by the pair distance distribution function are shown for Histatin 5 in PRIMUS. Other programs require different procedures (see Note 5). 3.3 Determination of the Pair Distance Distribution Function in PRIMUS

1. Open the scattering data in PRIMUS. Switch to the “Analysis” tab and press “Distance Distribution.” This opens a new window with two panels, initially showing the result from AUTOGNOM, the automatic calculation of P(r). The right panel shows the distance distribution, while the left panel shows the

Determination of Rg

277

back-transformed intensity from the distance distribution together with the experimental scattering data. Each adjustment in the graphical interface will call the program GNOM, and the result will be presented. For IDPs, manual modification is normally necessary since the automatic process often gives a too abrupt decrease to zero at large r, as exemplified in Fig. 5a. The goal is to find a P(r) that smoothly increases from r ¼ 0 and terminates smoothly at the maximum dimension. A typical P(r) of an IDP has an extended tail due to its ability to form a variety of conformations, where the tail corresponds to an almost fully stretched out protein. This is in contrary to globular proteins, which show a rather symmetric bell-shaped curve due to its less flexible and more compact conformation. 2. If the scattering in the high-q region is overly noisy, that part can be excluded by reducing the range. 3. Unclick “rmax ¼ 0,” which will release the restrain that P(r) should be zero at the maximum pair distance, Dmax. This gives the P(r) shown in Fig. 5b, illustrating the need of a larger Dmax. 4. Increase Dmax until the P(r) smoothly approaches zero at larger r; see Fig. 5c. If that does not occur, the scattering data might be affected by interparticle interference (see Note 6) or suffer from poor buffer matching (see Note 7). Click “rmax ¼ 0” again to ensure complete decrease to zero. 5. To check the quality of the solution, try to increase Dmax even further. Ideally, I(0) and Rg should be stable, and the P(r) should be zero beyond Dmax found in step 4. In addition, the back-transformed intensity should fit the experimental scattering curve. 6. Repeat the process with a different range on the input data. If the result is approximately the same, this indicates a good solution and quality of your data set. It is good practice to compare the results from the Guinier analysis and the P(r), but without letting one approach influence the fitting of the other, and thereby check for self-consistency of the scattering data. Ideally, the results from the two methods should agree. The limitations of the methods are further described (see Note 8).

4

Notes 1. High protein concentrations of highly charged IDPs will give rise to repulsion detectable in the SAXS data even at high ionic strength (150 mM). Therefore, to obtain reliable estimates of Rg, lower protein concentrations should be used, and the effect of protein concentration should be checked by performing measurements in a concentration series.

278

Ellen Rieloff and Marie Skepo¨

Fig. 5 Pair distance distribution function for Histatin 5 obtained from GNOM. (a) The result from AUTOGNOM, (b) release of the “rmax ¼ 0” constraint, and (c) the final result after increasing Dmax

2. Instead of q, the ATSAS package uses s as the notation for the magnitude of the scattering vector, although the definitions are identical. Usually the measured SAXS data is given with the unit nm1 for q; however, it can easily be converted to A˚1, as

Determination of Rg

279

Fig. 6 Radius of gyration of Histatin 5 versus range of Guinier fitting region

has been done with the data in this chapter. It is important to notice that the units of the output from PRIMUS depend on the units of the input; hence, if q is given in A˚1, the calculated Rg is given in A˚, whereas if q is given in nm1, the Rg is given in nm. The same applies to the unit of length in the pair distance distribution function. 3. The determined Rg is dependent on the q-range used, as illustrated for Histatin 5 in Fig. 6. The Guinier analysis has been shown to systematically underestimate Rg of IDPs for a fitting range limited by qmaxRg 1.3; hence, it is important to use a qmaxRg 1.1. The suitable range depends on both the protein sequence and the solvent conditions, and as a rule of thumb, the larger the deviations are from a sphere, the smaller fitting range can be used. Notice that for small peptides such as Histatin 5, more data points can be used than for a larger protein; hence, for longer IDPs, the uncertainty can be large when reducing the fitting range. If a larger qmax is required, there are alternative methods available for IDPs, such as the Debye function or the extended Guinier analysis. The Debye function [18]: I ðqÞ=I ð0Þ ¼

2 ½expððqRg Þ2 Þ 1 þ ðqRg Þ2 , 4 ðqRg Þ

ð6Þ

is an analytical expression for the form factor of a Gaussian chain. It has been shown to be applicable to disordered and denatured proteins in the qRg 3 range [19], at least if their behavior is close to that of a Gaussian chain. If the IDP in question has a larger deviation from Gaussian chain behavior, a restricted range of qRg < 1.4 can be more suitable [13]. The Gaussian chain behavior is characterized by a Flory exponent of 0.5, and recently, Zheng and Best showed that the

280

Ellen Rieloff and Marie Skepo¨

deviation from the Guinier approximation is related to the Flory exponent. They developed an extension of the Guinier approximation by including an extra higher-order term, dependent on the Flory exponent, υ [20]: I ðqÞ 1 ¼ q 2 R2g þ 0:0479ðν 0:212Þq 4 R4g : ln ð7Þ 3 I ð0Þ By introducing a dependence between Rg and ν based on a self-avoiding walk, there are only two free parameters in the fitting. The dependence is described by the following equation: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ γ ðγ þ 1Þ bN ν : ð8Þ Rg ¼ 2ðγ þ 2νÞðγ þ 2ν þ 1Þ The term γ 1.1615 [21], the prefactor b 0.55 nm for proteins [22], and N is the number of peptide bonds in the chain. This extended Guinier analysis was shown to be applicable up to qmaxRg 2 [20]. 4. The AUTORG process performs Guinier analysis on multiple intervals within the suitable Guinier range, before selecting the best fit. The presented error, however, takes into account both the error in the best fit and the error of other possible intervals and thereby accounts for systematic errors in Rg to a certain extent. When the Guinier fit is manually adjusted, the error is determined only based on the fit in the selected range and, as a consequence, becomes underestimated. For more information, please see the article by Svergun et al. [23]. ˚ tter to determine the P(r) distribu5. A guide on how to use ScA tion can be found at www.bioisis.net. The Web application BayesApp does not require any other input than the data file and has in the authors’ experience worked well for IDPs. 6. Interparticle interference in the P(r) can be visible in the large r-region, i.e., low-q region. In the case of repulsive interactions, P(r) decreases below zero as shown in Fig. 7a, which can lead to an underestimated Dmax if determined to where P(r) crosses r ¼ 0. In the case of aggregation, P(r) will instead have problems reaching r ¼ 0, which is shown as a long tail that never reaches zero or, alternatively, as a bump at the end instead of a smooth decline—see Fig. 7b. In both cases, the data is of insufficient quality and, hence, should not be used. 7. As stated, poor buffer subtraction can give a bump in the P(r), similarly to the presence of aggregates in the sample. Buffer mismatching in the low-q region also influences the Guinier analysis, and hence, for an accurate determination of Rg, it is important to ensure a good matching of the buffer. General advise on sample preparation for biological SAXS

Determination of Rg

281

Fig. 7 Pair distance distribution functions of problematic data sets obtained from GNOM, where (a) shows repulsion in the system visible by the decrease below zero and (b) displays aggregation in the system shown as a bumpy tail

measurements and how to check the buffer matching can be found in, for example, the following references [24–26]. 8. The Guinier analysis is performed only in the low-q regime, compared to the P(r) approach which uses more or less the full scattering curve. Thus, the Guinier analysis is more susceptible to experimental noise, which gives rise to larger uncertainties in the determined Rg. In addition (see also Note 3), the Guinier analysis is also known to underestimate the Rg if qmax is too large; hence, the P(r) approach can be considered as more reliable. However, the reproducibility of the Guinier analysis is usually better, since it is an easier method to use. Especially if extensive manual adjustment is made to P(r), the result can vary between users. Therefore, the authors recommend presenting the results from both methods, especially since agreement signalizes consistency of the data.

Acknowledgments We are grateful to Dr. Mark Tully at the European Synchrotron Radiation Facility (ESRF), Grenoble, France, and Dr. Samuel Lenton at the Division of Theoretical Chemistry, Lund University, Sweden, for valuable comments and proofreading of the text. References 1. Das RK, Pappu RV (2013) Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A 110:13392–13397. https://doi.org/10. 1073/pnas.1304749110

2. Hoffmann A, Kane A, Nettels D et al (2007) Mapping protein collapse with single-molecule fluorescence and kinetic synchrotron radiation circular dichroism spectroscopy. Proc Natl Acad Sci U S A 104:105–110. https://doi. org/10.1073/pnas.0604353104

282

Ellen Rieloff and Marie Skepo¨

3. Hofmann H, Soranno A, Borgia A et al (2012) Polymer scaling laws of unfolded and intrinsically disordered proteins quantified with single-molecule spectroscopy. Proc Natl Acad Sci U S A 109:16155–16160. https://doi. org/10.1073/pnas.1207719109 4. Mao AH, Crick SL, Vitalis A et al (2010) Net charge per residue modulates conformational ensembles of intrinsically disordered proteins. Proc Natl Acad Sci U S A 107:8183–8188. https://doi.org/10.1073/pnas.0911107107 5. David G, Perez J (2009) Combined sampler robot and high-performance liquid chromatography: a fully automated system for biological small-angle X-ray scattering experiments at the synchrotron SOLEIL SWING beamline. J Appl Crystallogr 42:892–900. https://doi.org/10.1107/ s0021889809029288 6. Guinier A (1939) La diffraction des rayons X aux tres petits angles; application a l’etude de phenomenes ultramicroscopiques. Ann Phys 12:161–237 7. Kikhney AG, Svergun DI (2015) A practical guide to small angle X-ray scattering (SAXS) of flexible and intrinsically disordered proteins. FEBS Lett 589:2570–2577. https://doi.org/ 10.1016/j.febslet.2015.08.027 8. Putnam CD, Hammel M, Hura GL et al (2007) X-ray solution scattering (SAXS) combined with crystallography and computation: defining accurate macromolecular structures, conformations and assemblies in solution. Q Rev Biophys 40:191–285. https://doi.org/ 10.1017/s0033583507004635 9. Borgia A, Zheng W, Buholzer K et al (2016) Consistent view of polypeptide chain expansion in chemical denaturants from multiple experimental methods. J Am Chem Soc 138:11714–11726. https://doi.org/10. 1021/jacs.6b05917 10. Receveur-Brechot V, Durand D (2012) How random are intrinsically disordered proteins? A small angle scattering perspective. Curr Protein Pept Sci 13:55–75. https://doi.org/10.2174/ 138920312799277901 11. Glatter O (1977) Data evaluation in smallangle scattering - calculation of radial electron-density distribution by means of indirect fourier transformation. Acta Phys Austriaca 47:83–102 12. Svergun DI (1992) Determination of the regularization parameter in indirect-transform methods using perceptual criteria. J Appl Crystallogr 25:495–503. https://doi.org/10. 1107/s0021889892001663

13. Perez J, Vachette P, Russo D et al (2001) Heatinduced unfolding of neocarzinostatin, a small all-beta protein investigated by small-angle X-ray scattering. J Mol Biol 308:721–743. https://doi.org/10.1006/jmbi.2001.4611 14. Franke D, Petoukhov MV, Konarev PV et al (2017) ATSAS 2.8: a comprehensive data analysis suite for small-angle scattering from macromolecular solutions. J Appl Crystallogr 50:1212–1225. https://doi.org/10.1107/ s1600576717007786 15. Cragnell C, Durand D, Cabane B et al (2016) Coarse-grained modeling of the intrinsically disordered protein Histatin 5 in solution: Monte Carlo simulations in combination with SAXS. Proteins 84:777–791. https://doi.org/ 10.1002/prot.25025 16. Jephthah S, Staby L, Kragelund BB et al (2019) Temperature dependence of IDPs in simulations: what are we missing? J Chem Theory Comput 15:2672–2683. https://doi.org/10. 1021/acs.jctc.8b01281 17. Rieloff E, Tully MD, Skepo¨ M (2019) Assessing the intricate balance of intermolecular interactions upon self-Association of Intrinsically Disordered Proteins. J Mol Biol 431:511–523. https://doi.org/10.1016/j. jmb.2018.11.027 18. Debye P (1946) Molecular-weight determination by light scattering. J Phys Colloid Chem 51:18–32 19. Calmettes P, Durand D, Desmadril M et al (1994) How random is a highly denatured protein. Biophys Chem 53:105–113. https:// doi.org/10.1016/0301-4622(94)00081-6 20. Zheng W, Best RB (2018) An extended Guinier analysis for intrinsically disordered proteins. J Mol Biol 430:2540–2553. https:// doi.org/10.1016/j.jmb.2018.03.007 21. Le Guillou JC, Zinn-Justin J (1977) Critical exponents for the n-vector model in three dimensions from field theory. Phys Rev Lett 39:95–98. https://doi.org/10.1103/Phy sRevLett.39.95 22. Zheng W, Zerze GH, Borgia A et al (2018) Inferring properties of disordered chains from FRET transfer efficiencies. J Chem Phys 148:123329. https://doi.org/10.1063/1. 5006954 23. Petoukhov MV, Konarev PV et al (2007) ATSAS 2.1 - towards automated and web-supported small-angle scattering data analysis. J Appl Crystallogr 40:S223–S228. https://doi.org/10.1107/ s0021889807002853 24. Graewert MA, Jeffries CM (2017) Sample and buffer preparation for SAXS. In: Chaudhuri B,

Determination of Rg Munoz IG, Qian S, Urban VS (eds) Biological small angle scattering: techniques, strategies and tips. Advances in experimental medicine and biology, vol 1009. Springer, Singapore, pp 11–30 25. Brennich M, Pernot P, Round A (2017) How to analyze and present SAS data for publication. In: Chaudhuri B, Munoz IG, Qian S, Urban VS (eds) Biological small angle

283

scattering: techniques, strategies and tips. Advances in experimental medicine and biology, vol 1009. Springer, Singapore, pp 47–64 26. Grishaev A (2012) Sample preparation, data collection, and preliminary data analysis in biomolecular solution X-ray scattering. Curr Protoc Protein Sci 70:17.14.11–17.14.18. https://doi.org/10.1002/0471140864. ps1714s70

Chapter 14 Obtaining Hydrodynamic Radii of Intrinsically Disordered Protein Ensembles by Pulsed Field Gradient NMR Measurements Sarah Leeb and Jens Danielsson Abstract In the disordered state, a protein exhibits a high degree of structural freedom, in both space and time. For an ensemble of disordered or unfolded proteins, this means that the ensemble comprises a high diversity of structures, ranging from compact collapsed states to fully extended polypeptide chains. In addition, each chain is highly dynamic and undergoes conformational changes and local dynamics on both fast and slow timescales. The size properties of disordered proteins are thus best described as ensemble averages. A straightforward measure of the size is the hydrodynamic radius, RH, of the ensemble. Since the disordered state is conformationally fluid, the observed RH does not refer to a particular shape or fold. Instead, it should be interpreted as a measure for the average compaction of the structural ensemble. In addition to characterizing the disordered ensemble itself, RH can be used to, with good precision, monitor changes in the ensemble size properties upon functional interactions of the disordered protein, e.g., dimerization, ligand binding, and folding pathways. Here, we present a step-by-step protocol for diffusion measurements using pulsed field gradient nuclear magnetic resonance (PFG NMR) spectroscopy. We describe how to calibrate the magnetic field gradient and offer different schemes for sample preparation. Finally, we describe how to obtain RH directly from the diffusion coefficient as well as from using an internal standard as a reference. Key words Diffusion coefficient, Gradient calibration, Unfolded ensemble, NMR, PFG NMR, Hydrodynamic radius, IDPs

1

Introduction The disordered protein ensemble—whether it is the unfolded state of an otherwise folded protein, intrinsically disordered proteins, or artificially denatured proteins—is a generically heterogeneous, highly dynamic entity. Typically, the physical properties of such an ensemble of structures have to be defined as ensemble averages or time averages, since snapshots of a single highly dynamic protein chain lack physical relevance (Fig. 1) [1–4]. The size of the disordered ensemble, i.e., how it extends in space, can be described

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_14, © Springer Science+Business Media, LLC, part of Springer Nature 2020

285

286

Sarah Leeb and Jens Danielsson

Fig. 1 The disordered ensemble. The size of a folded and an unfolded protein can be related to its hydrodynamic radius, RH. For the folded state in (a), the structural ensemble is very homogeneous and yields a narrow distribution of hydrodynmic radii, shown as the blue distribution in (c). The hydrodynamic size of the disordered state (b) is on the other hand much more heterogeneous and less intuitively defined. The orange circles in (c) correspond to the radius of a sphere that diffuses with the same speed as the unfolded poypeptide that it encloses. The heterogeneous ensemble of structures results in a wide size distribution for the disordered ensemble shown in (c). PFG NMR determination of the hydrodynamic radius from the diffusion coefficient does not give information on the width of the distribution but merely reports on the ensemble average, i.e., the peak of the distribution (indicated by black arrows)

using classical polymer statistical physics, where the individual amino acids are treated as mean field monomers, with limited specific properties. Using this formalism, the end-to-end distance and radii of gyration (Rg) for unfolded polypeptides under various conditions can be estimated [5, 6], and the size can be shown to obey simple power laws of the length of the protein. For example, Rg ¼ α0 M α1 , where M can be the molecular mass of the protein [7] or the number of residues [8, 9], α0 is a pre-factor depending mainly on the monomer size and persistence length [8], and, finally, α1 is the exponent reporting on the extendedness (or compaction) of the chain ensemble [5, 10, 8]. The exponent α1 is dependent on the interaction between the solvent and the monomers in the polypeptide and ranges from α1 0.3 for a compact folded state to α1 ¼ 0.59 for an extended state in a good solvent [3, 7–9]. For the unfolded or disordered state of proteins, α1 typically ranges from 0.4 to 0.59 depending on the sequence composition [11] and the solvent [3, 8]. The radius of gyration is linked to the hydrodynamic radius (RH) in a nontrivial way. For the folded, globular state, they scale linearly, RH ¼ κRg, where κ relates to the hydration layer [12]. In contrast, no simple link exists for the unfolded or disordered state, but empirical relationships between Rg and RH have been suggested [13, 14]. The hydrodynamic radius is, in turn, directly

Translational Diffusion of IDPs

287

linked to how the disordered ensemble diffuses in a solvent, and translational diffusion measurements have been widely applied when it comes to the biophysical characterization of proteins as it can be used to estimate the hydrodynamic (or Stokes) radius of, e.g., a disordered protein ensemble [15, 16]. Relating the hydrodynamic radius to the translational diffusion coefficient (Dt) is based on that a particle undergoing Brownian motion in a solvent will be exposed to opposing frictional forces ( f ) [17], depending on solvent microscopic viscosity (η) as well as the size and shape of the particle. In case of disordered proteins, the ensemble is a distribution of both sizes and shapes that will result in ensembleaveraged (or time-averaged) diffusion behavior (Fig. 1). Translational diffusion can be quantified by the mean square displacement per time unit, hz2i ¼ 6DtΔt, where the Stokes-Einstein equation relates Dt to RH [17]: Dt ¼

kB T k T ¼ B f 6πηRH

ð1Þ

where kB is the Boltzmann constant and T is the temperature. For the disordered ensemble, the bulk hydrodynamic radius is the ensemble average of the distribution of sizes and shapes and corresponds to the radius of a sphere that diffuses at the measured rate through a solvent of viscosity η. The shape dependence becomes particularly apparent, when comparing the hydrodynamic radii of a folded and unfolded version of the same protein. Despite having the exact same mass, their hydrodynamic radii will differ substantially due to the unfolded chain being much more extended (Fig. 1). Direct determination of translational diffusion is nowadays routinely performed using pulsed field gradient NMR (PFG NMR) spectroscopy, a method developed already in the mid-1960s by Stejskal and Tanner [18]. The method relies on a simple spin-echo but with a position-dependent phase labeling within the echo. In short, the method can be described as follows: After transferring the magnetization to the xy-plane, the spins of a molecule are given a position-dependent phase label by applying a magnetic field gradient (Fig. 2). In this way, the total magnetic field strength that individual spins are exposed to will vary depending on their position in the sample, and as a consequence, their coherence and the resulting magnetization are lost. For a gradient along the zaxis, the phase shift ϕ ¼ γgz, where γ is the gyromagnetic ratio, g is the gradient strength, and z is the spin position in z-direction in the detection volume. Immediately applying a gradient of reverse amplitude but same strength would then refocus the spin coherence with minimal signal loss (Fig. 2) and a total phase given by Δϕ ¼ γgz + γ(g)z ¼ 0. In diffusion measurements, however, a time delay is introduced between the two opposing gradient pulses,

288

Sarah Leeb and Jens Danielsson

Fig. 2 Determining the diffusion coefficient by NMR. Schematic overview of the PFG NMR experiment for determination of the diffusion coefficient, Dt. In the top panel, the simplified pulse sequence is shown, including the basic necessary elements. The black rectangles are hard pulses, with a narrow rectangle corresponding to a 90 and the wider rectangle to a 180 pulse, and the gray half oblates correspond to the gradient pulses with an arbitrary shape with length δ, separated by the diffusion delay Δ. In the scheme, all chemical shift defocusing has been omitted, and only phase shifts from the gradients are included. Any chemical shift defocusing will be refocused at the end of the echo where the FID is recorded. The second lane shows the effect on a non-diffusing spin, which in block a, directly after a 90 pulse is in the xy-plane, perpendicular to the static magnetic field. In block b, the gradient pulse gives the spin a position-dependent phase shift, ϕ. In block c, the 180 pulse has inverted the phase, which effectively alters the sign of the phase shift to ϕ. In block c, the spin experiences the exact same gradient pulse again, as the position in the zdirection is unchanged. As a result, the same phase label is given, and since –ϕ + ϕ ¼ 0, it is perfectly refocused. In the case of free diffusion during the delay Δ, shown in the third lane, the spins change place between blocks b and d resulting in a non-perfect refocusing. Assuming free diffusion, the ﬃaverage pﬃﬃﬃﬃﬃﬃﬃﬃﬃ change in position, Δz, is given by the diffusion coefficient and the diffusion delay, hΔz i ¼ 6D t Δ, which result in ϕ1 + ϕ2 6¼ 0 . The fourth lane shows the net magnetization from the diffusing spins in the detection volume. The loss of magnetization is related to the diffusion delay, gradient strength, and diffusion coefficient according to Eq. (2)

allowing the molecule (and therefore the spins) to move in space. When the refocusing gradient pulse is applied, the positions of the spins are changed due to Brownian motions, and the coherence is not fully refocused, which is manifested as a loss of signal (Fig. 2). The total phase shift is then Δϕ ¼ γgz1 + γ(g)z2 6¼ 0, if z1 6¼ z2. Hence, the more rapidly the signal decays as a function of delay

Translational Diffusion of IDPs

289

time, the faster the nuclear spins have been displaced in the sample. The loss of NMR signal (S) is related to Dt by the Stejskal-Tanner equation [18]: S ¼ S 0 eðγgδÞ

2 0

Δ Dt

ð2Þ

where γ is the gyromagnetic ratio, g the gradient strength (g ¼ δB/ δz), δ is the gradient pulse length, and Δ0 is the generalized diffusion delay, depending on the pulse sequence and gradient pulse shape [19]. From Eq. (2), one can see that alternatively to varying the diffusion delay between the two opposing gradient pulses, the strength or length of the gradient pulses can be incrementally increased while keeping the delay time constant. Since stronger gradients or longer gradient pulses enhance the sensitivity towards local displacement, the result will be the same: the faster a molecule diffuses, the more rapidly its signal decays. Using variable gradient strengths is often the preferred method, since it is free from complicating relaxation effects due to varying delay times [20, 21]. Nevertheless, if anomalous diffusion is studied, e.g., diffusion studies in complex environments, the diffusion delay should be used as the variable [22–24]. Determination of the translational diffusion coefficient of an unfolded protein ensemble can give information on the general dimensional features, such as the compaction or extendedness of the ensemble [7, 8, 10]. The scaling exponent can therefore be used to, e.g., categorize the ensemble in terms of its solvent interactions. However, more importantly, small, induced changes in the ensemble size can accurately be determined by PFG NMR, enabling more complex characterizations of the ensemble, where modulations of its size are involved. For example, the response on compaction due to alterations in sequence can be quantified with good precision, enabling the characterization of preferred intramolecular long-range contacts [10, 11]. The hydrodynamic radius is also a sensitive reporter on weak intermolecular interactions and can be used to, e.g., elucidate if a protein in solution is predominantly found in a monomeric or oligomeric state. By studying the concentration dependence of Dt, very weak self-interactions can be detected and quantified [25]. In fact, any process that involves changes in particle size can be monitored. For instance, titrating a ligand into the sample and tracking the change in protein or ligand diffusion coefficient as the bound species becomes increasingly populated allow the determination of the dissociation constant for the binding reaction [26]. This method allows for quantification also of very weak interactions, i.e., Kd in the mM range. In some cases, even small conformational transformations of a disordered ensemble upon changing the solvent conditions or adding a ligand can be observed, provided that this leads to a change in the

290

Sarah Leeb and Jens Danielsson

ensemble shape and/or size distribution and the ligand is of negligible size [27]. This protocol describes how to determine the hydrodynamic radius on an unfolded ensemble using PFG NMR translational diffusion measurements. We describe sample handling, gradient calibration, diffusion measurements, and simple first-order analysis.

2

Materials In general, there are no specific requirements for NMR PFG diffusion measurements, which differ from those required for a general NMR sample. In this section, we define the materials and solutions necessary for preparing a typical sample for high-precision measurements of translational diffusion of a disordered protein ensemble. The conditions can be changed and optimized in terms of solvent, pH, ionic strength (within the range tolerated by the hardware), buffer, and protein concentration.

2.1

Instruments

1. NMR spectrometer and probe equipped with a z-gradient (preferably > 50 G/cm). 2. Vacuum chamber (for lyophilization). 3. Thermometer that can be inserted into NMR spectrometer. (If not available, use deuterated methanol sample for temperature calibration.) 4. Tabletop centrifuge that can reach up to 17,000 g.

2.2 Materials and Solutions

All solutions should be prepared in deionized water (resistivity ρ 18.2 MΩ cm at 25 C) if not specified otherwise. 1. NMR tubes that are optimized for the field strength of the spectrometer that is to be used. In the following protocol, we will in general be referring to round-bottomed tubes of ; 5 mm that require a minimum sample volume of 500 μL (see Notes 1 and 2). 2. Eppendorf tubes, 1.5 mL. 3. Parafilm (for lyophilization). 4. Syringe needle, ; ~ 1 mm (for lyophilization). 5. D2O, filtered, 99.8% deuterated. 6. Buffer stock solutions: Depending on the pH of interest for the measurements, different buffering substances can be used. Typically used buffers in NMR samples are 2-(N-morpholino) ethanesulfonic acid (MES) and 4-(2-hydroxyethyl)piperazine1-ethanesulfonic acid (HEPES), but phosphate or acetate buffers are not uncommon either. Technically, NMR spectroscopy does not set any restrictions on the type of buffer that can be

Translational Diffusion of IDPs

291

used. Recommended stock concentrations are 200–500 mM (see Note 3). 7. Internal standard stock solutions: 10 mM α-cyclodextrin in D2O, 40 mM DSS in D2O or H2O, or 1,4-dioxane, anhydrous, 99.8%. Any of those could be used. 8. Sample for gradient calibration: 1 mM α-cyclodextrin in D2O (see Note 4). 9. Sample for temperature calibration: 500 μL of methanol-d4, 99.8% deuterated, sealed under atmospheric pressure (only if no thermometer is available). 10. Protein sample for diffusion measurement: The general requirements on protein NMR samples apply, i.e., relatively high protein concentration (~100 μM in case of cryoprobe and ~300 μM for a non-cryogenically cooled probe), low ionic strength ( 3, can be considered as statistical evidence for local order. A combined consideration of all available backbone chemical shifts provides a better evidence of local order (and disorder) [17, 20]. Therefore, we introduced the Chemical shift Z-score for assessing Order/Disorder (the CheZOD Z-score [21], also referred to herein as the Z-score) based on statistical analysis of wSCSs. First, the summed weighted chi-square deviation from RCCS is calculated from assigned chemical shifts for particular residue and its nearest neighbors: X X 2 χ 2 ði Þ ¼ min Δjn , 4 ð4Þ n j ¼i1, i, iþ1

where Δjn is the wSCS for residue j, and the number is truncated to four since a value around three is sufficient to be considered statistical evidence of structure. Secondly, χ 2 is transformed to an approximately normal distributed number, L, by sums of fractional powers of χ 2 [22]. 1 1 1 1 χ2 L ¼ ρ1=6 ρ3 þ ρ2 , ρ ¼ 2 3 N

ð5Þ

where N is the number of assigned chemical shifts for the amino acid triplet. Finally, the CheZOD Z-score (Z) is defined by converting L to a standard normal distributed number by correcting with the known mean and standard deviation for L [22]. Z ¼

L μL ðN Þ σ L ðN Þ

ð6Þ

CheZOD Z-scores were found to be consistent with other procedures for assessing disorder, such as missing density in X-ray crystallographic structures, structural variation in NMR-derived structure ensembles, and simulated Molecular Dynamics (MD) trajectories, but were proven to be more precise and accurate [23]. Z-scores were subsequently applied to set a new standard for benchmarking the performance of computational methods for predicting disorder. The CheZOD Z-score is a quantitative statistical descriptor for how much a set of experimentally determined chemical shifts deviate from RCCSs. By the nature of their definition (Eqs. 4–6), Z-scores follow a standard normal distribution with unit variance for completely disordered residues. Therefore, Z ¼ 3 can be regarded as a threshold above which chemical shifts identify the population of ordered conformations (e.g., helix formation). When applied to structured segments of proteins, Z-scores span a large range, with a maximum of Z ¼ 16.15. To investigate the magnitudes and sequence patterns of protein order and disorder, Z-scores were calculated for all the residues of 117 proteins that were allegedly disordered based on the title or keywords provided by the authors [21]. Through inspection of the Z-score profiles

306

Jakob T. Nielsen and Frans A. A. Mulder

along the protein sequences, a complex picture emerged, revealing a great diversity in disorder. On the one hand, proteins that are mostly disordered were frequently found to have a small number of structured regions comprised of varying numbers of residues and with diverse fractional populations of order. Conversely, proteins that are mostly structured often have flexible termini or loop regions with variable size and flexibility (see Fig. 2 in Ref. 21). This finding mirrors the broad repertoire of IDPs with levels of disorder coding for different biological functions, ranging from completely flexible through partially disordered with interaction hot spots to structured domains joined by flexible linkers [24]. The Z-scores for a large collection of proteins were observed to follow a clearly bimodal distribution (see Fig. 3 in Ref. 21), reflecting that residues typically classify either as mostly disordered or as mostly ordered. Residues corresponding to being semidisordered, roughly defined as falling in the region 3 < Z < 8, were found to be less common than disordered (Z < 3) or ordered residues (Z > 8) in the protein collection investigated.

2

Materials The following chapters describe the application of two software tools for quantitively assessing disorder and order in proteins from assigned chemical shifts. The first software to be discussed, POTENCI [11], predicts RCCS reference values from amino acid sequence. The other software, CheZOD, calculates CheZOD Zscores of disorder for proteins from assigned chemical shifts. POTENCI is implicitly included in the implementation of CheZOD. It is assumed that sequence-specific chemical shift assignments have already been obtained [25, 26] (see also Subheading 1). POTENCI and CheZOD are both available as Web server applications and as stand-alone command-line interface python scripts and, depending on which approach is chosen, there are different requirements. Following the online server application bypasses the need for installing python and further dependencies, which is practical if only a few protein sequences are to be evaluated. Alternatively, calculation via the command-line interface provides more freedom and is convenient for running multiple calculations. Python and some scientific python packages are required for running the applications locally (see below).

2.1 Computation of RCCS Reference Values Using the POTENCI Web Application

1. Find the POTENCI application at http://www.protein-nmr. org/. Currently, it is available at https://st-protein02.chem.au. dk/potenci/ (but exact server locations may change over time).

Quantitative Protein Disorder by NMR

307

2. The protein sequence is entered as one-letter amino acid code. Any nonstandard amino acids will be skipped, while sequence numbering is retained. A minimum length of 5 residues is required (see Note 1). 3. The sample conditions: pH, temperature, and ionic strength must be specified (see Subheading 3.1). 2.2 Computation of RCCS Reference Values Using a Command-Line Python Script

1. Download the latest version of the POTENCI application (hereunder assumed to be potenci1_3.py) from the GitHub page: https://github.com/protein-nmr/POTENCI. 2. Install python, for example, from command-line interface EDM: https://www.enthought.com/product/enthoughtdeployment-manager/ (see Note 2). 3. Install dependencies from a shell command-line: “edm” is alias on windows for C:\Users\username\AppData\Local\Programs \Enthought\edm\edm.bat. on linux for ./edm/bin/edm. and on Mac OS for /opt/edm/bin/edm: >>> edm install numpy >>> edm install scipy or: >>> pip install scipy numpy

4. The protein sequence and sample conditions must be specified as in Subheading 2.1, items 2 and 3. 2.3 Computation of CheZOD Z-Scores Using a Web Application

CheZOD converts the deviations of the observed chemical shifts from RCCSs to a measure of local protein disorder. Here, the RCCSs are computed by POTENCI. CheZOD can be run for both a local data set or for a published data set from Biological Magnetic Resonance Bank (BMRB) database using the identifier (BMRB ID). 1. Find the CheZOD Web server application at http://www.pro tein-nmr.org/. Currently, it is available at https://st-protein. chem.au.dk/chezod. 2. In case the target is a published BMRB entry, only the corresponding entry ID is required. 3. Alternatively, prepare a data file in NMR-STAR format (v2.1) with assigned chemical shifts, amino acid sequence, and experimental conditions (see Note 3). Published NMR-STAR files can also be downloaded by ftp (see Note 4).

2.4 Computation of CheZOD Z-Scores from a Command-Line Interface Python Script

The CheZOD python command-line script has the same type of requirements as the Web application. 1. Download the latest version of the CheZOD application from the GitHub page: https://github.com/protein-nmr/CheZOD.

308

Jakob T. Nielsen and Frans A. A. Mulder

2. Install python as in Subheading 2.2, item 2 and dependencies as in Subheading 2.2, item 3. Install further two dependencies if you would like plots of weighted secondary chemical shifts and Z-scores to be produced: >>> edm install pylab >>> edm install pyqt

3. Target published BMRB entry ID or NMR-STAR file (v2.1) with assigned chemical shifts, amino acid sequence, and experimental conditions (see Note 3) as described above (Subheading 2.3, items 2 and 3).

3

Methods This section provides instructions for running CheZOD and POTENCI programs on a server (Subheadings 3.1 and 3.3) or as command-line interface stand-alone python scripts (Subheadings 3.2 and 3.4). A description of how to interpret the Z-scores in terms of local disorder and order in proteins is provided (Subheading 3.5).

3.1 Computation of RCCS Reference Values Using a Web Application

RCCSs will be calculated using POTENCI [11] by server Web application. 1. Get to the POTENCI application (see Subheading 2.1, item 1). 2. Enter your protein sequence in the text box using one-letter amino acid code as described in Subheading 2.1, item 2 (see Note 1). 3. Specify sample pH in the corresponding input field (default is 7.0). pH can strongly affect RCCSs for the titratable amino acids (Asp, Glu, Cys, Tyr, His, Lys, and Arg) and their nearest neighbors, and POTENCI will adjust RCCSs accordingly (for the effects of pH on amino acid chemical shifts, the interested reader is referred to the work of Kjærgaard et al. [13] and Platzer et al. [14]). The exact implementation of pH correction is explained under point 5 below. 4. Specify sample temperature (in K; default value is 298 K). Temperature influences the RCCSs for all nuclei and 1HN chemical shifts in particular (see Note 5). For a systematic investigation of the effects on small peptides, see, for example, Kjærgaard et al. [13]. 5. Specify sample ionic strength (in Molar; default is 0.1). The ionic strength has an effect on the estimated pKa values of the side chains of the protein, which in turn affects the average protonation state of each side chain in the protein sequence.

Quantitative Protein Disorder by NMR

309

The latter is computed by the program pepKalc by Tamiola et al. [15]. Given the average protonation state together with the limiting RCCSs for the titratable amino acids at low pH (protonated) and high pH (deprotonated), POTENCI computes the most accurate RCCS values at any pH. 6. Click the “Submit” button. 7. Output RCCSs are returned to the user as a text file in ShiftY format (space separated, see http://shifty.wishartlab.com/ help.html) (see Notes 1 and 6). This output file contains the sequence entered by the user in the first line, followed by a table with residue number, amino acid type, and RCCSs for 15N, 13 0 13 C , Cα, 13Cβ, 1HN, 1Hα, and 1Hβ, respectively (see Note 7). In addition, the input parameters and the results of the computation are returned in a text box below the submission button, from where these can be copied. 3.2 Computation of RCCS Reference Values Using a Command-Line Python Script

1. Run the POTENCI application (potenci1_3.py below) with python on the command line: >>> python_exe potenci1_3.py seqstring pH temperature ionic_strength [pkacsvfile] > logfile

2. “python_exe” is alias for C:\Users\username\.edm\envs \edm\python.exe on Windows, .edm/envs/edm/bin/ python on linux, and /usr/bin/python on Mac OS. Python version must be python 2 (validated to work with python2.7). 3. The command-line argument, seqstring, specifies the amino acid sequence as in Subheading 3.1, step 2. The arguments, pH, temperature, and ionic_strength (see Subheading 3.1, steps 3–5) must be provided in that specific order, assuming the units as defined in Subheading 3.1 (see Note 8). 4. pKa values for amino acids with titratable side chains can be optionally provided in a csv-type file (don’t use square brackets around file name). As default, this is not required, and pKa values are computed using the pepKalc [15] approach, which is integrated in the POTENCI python code. Providing pKa values bypasses the pepKalc computation and speeds up computation considerably. 5. Output RCCSs are generated as described in Subheading 3.1, step 7. Note that pKa predictions (see Note 9) are stored locally as a text file, and this file is reloaded if the sequence, temperature, and ionic strength are the same as in a previous run. As pH-dependent protonation states are now precomputed, RCCSs can now be generated much faster for any pH value for the same protein sequence under otherwise identical conditions (see Note 10).

310

Jakob T. Nielsen and Frans A. A. Mulder

3.3 Computation of CheZOD Z-Scores Using a Web Application

CheZOD can be run for both a local data set or for a published data set from the BMRB database using the identifier (BMRB ID). CheZOD output is returned by email. See details below. 1. Go to the CheZOD Web server application address (Subheading 2.3, item 1). 2. Enter your email in the specified box. 3. You may run CheZOD for a published BMRB ID or provide a local NMR-STAR file (version 2.1). In the first case, choose as job ID a valid BMRB ID (this is the number of a BMRB entry, e.g., 16340 corresponds to the assigned chemical shifts of Calbindin D9k). The data will then be downloaded automatically from the BMRB ftp site for the specified entry (see Note 11). 4. Alternatively, when uploading an NMR-STAR file from your computer, choose any convenient identifier (such as the name of your protein) as job ID. 5. In the latter case, an NMR-STAR file (v2.1) with assigned chemical shifts and experimental conditions (see Note 3) must be uploaded (“Choose File” button) or pasted into the “NMRSTAR data” text window. Published NMR-STAR files can be downloaded by ftp as well (see Note 4). 6. Click the “Submit” button. 7. SCSs are returned by email to a text file with name “shiftjobid. txt” where jobid is your specified job ID. One single SCS, corresponding to a single atom, is specified on each line. The following columns are specified: residue number (see Note 12), residue type (one letter), atom type, assigned chemical shift, pentapeptide sequence used by POTENCI (Eq. 1) (see Note 13), and SCS. At the bottom of this file, chemical shift offset corrections derived by POTENCI are provided (see Note 14). 8. The output file, “zscoresjobid.txt,” is returned by email with CheZOD Z-scores [21]. The output has the following columns: residue name, residue number (see Note 12), and Z-score. 9. A plot (see Fig. 1 below as an example) containing two aligned panels with weighted SCSs (top panel) (see Note 15) and one with the Z-scores (bottom) is sent by email to the user in pdf format.

3.4 Computation of CheZOD Z-Scores from a Command-Line Interface Python Script

The CheZOD python command-line script uses the same type of input as the Web application. 1. Input parameters needed for POTENCI (within CheZOD) as well as assigned chemical shifts must be provided in an NMR-STAR 2.1 formatted file (see Note 3) via BMRB ID or

Quantitative Protein Disorder by NMR

311

a

b

Fig. 1 Plots of wSCS and CheZOD Z-score profiles produced by the CheZOD python application. Z-scores and wSCSs as a function of residue number for TDP-43 [28] showing (a) the RNA recognition motif 1 (RRM, bmrid 18765) and (b) the C-terminal domain (bmrid 26823). Top panels show weighted SCSs as blue, red, black, green, cyan, magenta, and yellow dots for 13C0 , 13Cα, 13Cβ, 1Hα, 1HN, 15N, and 1Hβ, respectively. Bottom panels provide CheZOD Z-score profiles. A reference line for Z ¼ 3 indicates the threshold for the emergence of order (black broken line), whereas Z ¼ 8 signifies the border between partial disorder and order (green broken line)

312

Jakob T. Nielsen and Frans A. A. Mulder

local file as described above (Subheading 2.3, item 3). In case a published BMRB entry is the target, the NMR-STAR file will be fetched automatically, which is accomplished through wget (Linux) or curl (Windows/Mac) remote access protocols. If you experience problems with this part of the code (not exhaustively tested on all possible platforms), we recommend downloading the file manually. 2. Run CheZOD with python 2 from command using the bmr bid (see Subheading 3.2, step 1): >>> python_exe chezod1_0.py bmrbid [-p] > logfile &

“-p” is default for producing plots, and “-n” implies skipping plotting 3. Output RCCSs and pKa values are generated by POTENCI within CheZOD with format and names as described above (Subheading 3.1, step 7 and Subheading 3.2, step 5). 4. Output text files are generated as described above (Subheading 3.3, steps 7 and 8) with SCSs (“shiftsbmrbid.txt”) and Z-scores (“zscoresbmrbid.txt”) (see Note 10). 5. A plot is produced as described above (Subheading 3.3, step 9) containing both SCSs and Z-scores as shown in Fig. 1. It is possible to interactively zoom, adjust range, and save the plot in different file formats using the plot window menu. 3.5 Interpretation of the CheZOD Z-Scores

The Z-scores computed in Subheadings 3.3 or 3.4 can be interpreted in terms of local disorder and order. Brief guidelines on how to interpret Z-scores and SCSs are provided below (see also Subheading 1). Examples of SCSs and CheZOD Z-scores profiles are provided in the literature [21] and in a commented example (Fig. 1). 1. Z-scores below three indicate complete disorder; variation observed is primarily a consequence of the finite accuracy of the POTENCI and pepKalc models, the number of assigned chemical shifts, and the precision in the measurement of the chemical shifts (see Note 16). 2. Observations of 3 < Z < 8 correspond to intermediate and pliant local order. This can, for example, be (1) formation of a fractional population of helical conformations apparent from positive SCSs for 13C0 and 13Cα and negative SCS for 1Hα, 13 Cβ, 15N, and, to a smaller extent, 1HN; (2) extended backbone conformation, for example, due to strong charge repulsion or β-strand formation (this is accompanied by opposite trends in SCSs compared to helix formation) [27]; and (3) very flexible loop or linker between two ordered segments with no systematic trend in SCSs.

Quantitative Protein Disorder by NMR

313

3. Z > 8 (up to a maximum value of 16.15) are observed for more ordered residues. Flexible loops with partly restricted motion have values at the lower end of this scale, whereas the higher end corresponds to completely ordered secondary structure elements as well as rigid turns or other noncanonical structure (see Note 17). α-Helices and β-sheets give rise to highly systematic trends in SCSs as described above. 4. A comparison of Z-scores and SCSs for a full-protein sequence gives a complete picture of local disorder and order. As an example, Fig. 1 shows CheZOD Z-score profiles for two domains of the protein TDP-43 (transactive response DNA binding protein 43 kDa) [28]. In panel (a), the RNA recognition motif 1 (RRM1, bmrbid 18765, residues 96-186 with N-terminal His-tag) [29] is shown, and in (b), the low complexity C-terminal domain (CTD), which contains the amyloidogenic core region (bmrbid 26823, residues 268–414) [29]. The CheZOD Z-score profile for the RRM1 domain (Fig. 1a) reveals that this domain is structured as seen by high Z-scores, which are mostly above Z ¼ 12 and a large variation in SCSs. Low Z-scores (Z < 3) at both ends of the sequence indicate disordered termini for this domain (the N-terminal has an added His-tag, for which chemical shifts are not available). Medium Z-scores (8 < Z < 10) detect a flexible loop centered at residue 55. The CheZOD Z-score profile for the CTD (Fig. 1b) shows that this domain is almost totally disordered, as judged by very low Z-scores (Z < 3). Notwithstanding this overall disordered nature, the middle segment, residues 55–70, shows variation in the SCSs and displays Z-scores in the intermediate range (3 < Z < 11), indicating partial ordering. Population of helical structure for this segment is evinced by the systematic patterns in the SCSs (as described in Subheading 3.5, item 2). A close inspection of the SCSs to the right of the abovementioned segment reveals similar systematic trends in the signs of the SCSs albeit with a much lower amplitude, suggesting a small propensity for the population (less than 10%) of helix-like structure also for residues 73–76.

4

Notes 1. Any nonstandard amino acids will be skipped, while sequence numbering is retained. A minimum length of five residues is required. RCCSs are not provided for the terminal amino acids due to lack of appropriate correction factors. Currently, phosphorylation, N-terminal acetylation, and other modifications are not handled, although these may become available in the future.

314

Jakob T. Nielsen and Frans A. A. Mulder

2. The python EDM package can be downloaded from https:// www.enthought.com. Enthought products are free of charge for academic users. For python support, see https://support. enthought.com/hc/en-us. It is also possible to use the GUI-type distribution, Canopy: https://www.enthought. com/product/canopy. 3. See format definition at http://www.bmrb.wisc.edu/dictio nary/htmldocs/nmr_star/dictionary_files/chemshift_J_full_ simple.txt. If the star file is created manually, the full syntax is not required but only the records: “_Entry_title,” “_Mol_residue_sequence,” “assigned_chemical_shifts,” and sample_conditions loop with pH, temperature, and ionic strength as needed for POTENCI (see example line below). If sample conditions are not provided, default values will be used (see Subheading 3.1, step 3–5). 1 2 ALA N N 123.67 0.1 1.

This line represent assigned chemical shift data for the columns corresponding to NMR-STAR labels: Atom_shift_assign_ID, Residue_seq_code, Residue_label, Atom_name, Atom_type, Chem_shift_value, Chem_shift_value_error, Chem_shift_ambiguity_code, respectively. 4. From the BMRB website or one of the mirror sites: http:// www.bmrb.wisc.edu, http://www.bmrb.wisc.edu/ftp/pub/ bmrb/entry_directories/ 5. Temperature coefficients (βi, Eq. 1) are up to 9.1 ppb/K for 1 HN, amounting to a correction of 2.99 for the wSCSs (see Eq. 3) at 278 K compared to the next-highest corrections of 0.94 and 0.60 for 13C0 and 15N, respectively, at 278 K. 6. Average of methylene protons are provided for Gly Hα2/Hα3 and Hβ2/Hβ3 side chain protons. 7. A value of 0.0 means “not applicable,” as for Gly 1Hβ and 13Cβ and for Pro 1HN. 8. An example command line is: >>> python2.7 potenci1_3.py MADCATMAN 7.0 293 0.2 > log.txt

9. Output pKa values is a csv-type file with columns: Site, pKa value, pKa shift, Hill coefficient, as specified by pepKalc [15]. 10. There is additional data in the logfile, which is undocumented and not essential for the use of POTENCI or CheZOD. It was mostly used for debugging and various stats. 11. CheZOD Z-scores are calculated based on assigned chemical shifts for the triplet (residues i 1, i, and i + 1). The Z-scores can be calculated for any number of assigned chemical shifts (Eqs. 4–6), and in particular also in cases with missing

Quantitative Protein Disorder by NMR

315

assignments for center residue i. In case there are no assigned chemical shifts for the full triplet, Z-scores cannot be calculated and will be skipped for these residues in the output file. 12. Residues are numbered using _Residue_seq_code field in bmrb file. Numbers in _Residue_author_seq_code are ignored. 13. One-letter codes, “n” and “c,” denote N and C-termini, respectively. 14. Sometimes, the chemical shift reference can be misset, which can have dramatic impact as systematic bias in wSCSs (e.g., for all 1H or 13C shifts) and thereby systematically higher Z-scores. To address this issue, offset corrections are derived automatically within POTENCI using a statistic informationtheoretic approach as described in detail in the POTENCI paper [21, 30, 31]. SCSs should have these values subtracted first if plotted manually from the shifts file. 15. wSCSs in the plot are offset-corrected (see Note 14) and weighted in order to scale with the natural variation among disordered residues by dividing unscaled SCS (Eq. 3) with 0.1846, 0.1982, 0.1544, 0.4722, 0.0608, 0.02621, and 0.02154 ppm for 13C0 , 13Cα, 13Cβ, 15N, 1HN, 1Hα, and 1Hβ, respectively [21]. The numbers were estimated using the standard deviation within the POTENCI training set. 16. The accuracy of POTENCI for the wSCSs is 1.0 by definition (Eqs. 3–6). Chemical shifts measured with lower accuracy can lead to larger scatter in wSCSs. In particular, if proton chemical shifts are derived from measurements in the indirect dimension with an insufficient number of increments, the measurement errors can be significantly larger than 1.0 for the corresponding wSCSs. 17. CheZOD Z-scores depend on the number of assigned chemical shifts for each tripeptide (maximum 21). Z-scores are stringently the statistical significance for being completely disordered. At an average weighted absolute SCS (awaSCS) around unity, the Z-score is close to 0 regardless of the number of assigned chemical shifts. Furthermore, the threshold, Z ¼ 3, is well defined even with the availability of fewer chemical shifts as well (e.g., with exclusively the amide 15N and 1HN). In contrast, at the highest end of the scale, Z-scores scale largely linearly with the number of chemical shifts. For example, the maximum value of Z ¼ 16.15 is obtained when all seven chemical shifts are available for all residues in the tripeptide and the awaSCS >4 (cutoff corresponding to four standard deviations from RCCS) for all atoms. The related value for the corresponding scenario with only three available assigned chemical shifts per residue is Z ¼ 10.63. This means that larger values of Z-scores must be interpreted with greater caution if

316

Jakob T. Nielsen and Frans A. A. Mulder

fewer assigned chemical shifts are available and segments of disorder must be judged relative to the maximum Z-scores of the profile.

Acknowledgments We are indebted to all researchers who have deposited NMR data in the BMRB and are grateful to the visionaries who have initiated this data collection initiative. Ian H. Gotliebsen is acknowledged for technical support. References 1. van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones DT, Kim PM, Kriwacki RW, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright PE, Babu MM (2014) Classification of intrinsically disordered regions and proteins. Chem Rev 114(13):6589–6631 2. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293 (2):321–331 3. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6(3):197–208 4. Romero P, Obradovic Z, Dunker AK (2004) Natively disordered proteins: functions and predictions. Appl Bioinforma 3(2-3):105–113 5. Uversky VN, Oldfield CJ, Dunker AK (2008) Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys 37:215–246 6. Midic U, Oldfield CJ, Dunker AK, Obradovic Z, Uversky VN (2009) Unfoldomics of human genetic diseases: illustrative examples of ordered and intrinsically disordered members of the human diseasome. Protein Pept Lett 16(12):1533–1547 7. Jensen MR, Ruigrok RW, Blackledge M (2013) Describing intrinsically disordered proteins at atomic resolution by NMR. Curr Opin Struct Biol 23(3):426–435 8. Kragelj J, Ozenne V, Blackledge M, Jensen MR (2013) Conformational propensities of intrinsically disordered proteins from NMR chemical shifts. ChemPhysChem 14(13):3034–3045 9. Felli IC, Pierattelli R (2014) Novel methods based on (13)C detection to study intrinsically disordered proteins. J Magn Reson 241:115–125

10. Konrat R (2014) NMR contributions to structural dynamics studies of intrinsically disordered proteins. J Magn Reson 241:74–85 11. Nielsen JT, Mulder FAA (2018) POTENCI: prediction of temperature, neighbor and pH-corrected chemical shifts for intrinsically disordered proteins. J Biomol NMR 70 (3):141–165 12. Wishart DS, Bigam CG, Holm A, Hodges RS, Sykes BD (1995) 1H, 13C and 15N random coil NMR chemical shifts of the common amino acids. I. Investigations of nearest-neighbor effects. J Biomol NMR 5(1):67–81 13. Kjaergaard M, Brander S, Poulsen FM (2011) Random coil chemical shift for intrinsically disordered proteins: effects of temperature and pH. J Biomol NMR 49(2):139–149 14. Platzer G, Okon M, McIntosh LP (2014) pH-dependent random coil (1)H, (13)C, and (15)N chemical shifts of the ionizable amino acids: a guide for protein pKa measurements. J Biomol NMR 60(2-3):109–129 15. Tamiola K, Scheek RM, van der Meulen P, Mulder FAA (2018) pepKalc: scalable and comprehensive calculation of electrostatic interactions in random coil polypeptides. Bioinformatics 34(12):2053–2060 16. Berjanskii MV, Wishart DS (2005) A simple method to predict protein flexibility using secondary chemical shifts. J Am Chem Soc 127 (43):14970–14971 17. Marsh JA, Singh VK, Jia Z, Forman-Kay JD (2006) Sensitivity of secondary structure propensities to sequence differences between alpha- and gamma-synuclein: implications for fibrillation. Protein Sci 15(12):2795–2804 18. Camilloni C, De Simone A, Vranken WF, Vendruscolo M (2012) Determination of secondary structure populations in disordered states of proteins using nuclear magnetic resonance

Quantitative Protein Disorder by NMR chemical shifts. Biochemistry 51 (11):2224–2231 19. Kjaergaard M, Poulsen FM (2012) Disordered proteins studied by chemical shifts. Prog Nucl Magn Reson Spectrosc 60:42–51 20. Tamiola K, Mulder FAA (2012) Using NMR chemical shifts to calculate the propensity for structural order and disorder in proteins. Biochem Soc Trans 40:1014–1020 21. Nielsen JT, Mulder FAA (2016) There is diversity in disorder—“in all chaos there is a cosmos, in all disorder a secret order”. Front Mol Biosci 3:4 22. Canal L (2005) A normal approximation for the chi-square distribution. Comput. Stat. Data Anal. 48(4):803–808 23. Nielsen JT, Mulder FAA (2019) Quality and bias of protein disorder predictors. Sci Rep 9 (1):5137 24. Uversky VN (2019) Intrinsically disordered proteins and their “mysterious” (meta)physics. Front Phys 7:10 25. Sattler M, Schleucher J, Griesinger C (1999) Heteronuclear multidimensional NMR experiments for the structure determination of proteins in solution. Prog Nucl Magn Reson Spectrosc 34:93–158

317

26. Brutscher B, Felli IC, Gil-Caballero S, Hosek T, Kummerle R, Piai A, Pierattelli R, Solyom Z (2015) NMR methods for the study of intrinsically disordered proteins structure, dynamics, and interactions: general overview and practical guidelines. Adv Exp Med Biol 870:49–122 27. Wang Y, Jardetzky O (2002) Investigation of the Neighboring residue effects on protein chemical shifts. J Am Chem Soc 124 (47):14075–14084 28. Chen-Plotkin AS, Lee VMY, Trojanowski JQ (2010) TAR DNA-binding protein 43 in neurodegenerative disease. Nat Rev Neurol 6:211 29. Chang CK, Chiang MH, Toh EK, Chang CF, Huang TH (2013) Molecular mechanism of oxidation-induced TDP-43 RRM1 aggregation and loss of function. FEBS Lett 587 (6):575–582 30. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Contr 19(6):716–723 31. Akaike H (1985) Prediction and entropy. In: Atkinson A, Fienberg S (eds) A Celebration of statistics. Springer, New York, pp 1–24. https://doi.org/10.1007/978-1-4613-85608_1

Chapter 16 Determination of pKa Values in Intrinsically Disordered Proteins Brandon Payliss and Anthony Mittermaier Abstract Electrostatic interactions in intrinsically disordered proteins (IDPs) and regions (IDRs) can strongly influence their conformational sampling. Side chain pKa values provide information on the electrostatic interaction energies of individual side chains and are required to accurately determine the molecular net charge and charge distribution. Nuclear magnetic resonance (NMR) spectroscopy is the premier method for measuring side chain pKa values as it can detect the ionization states of individual side chains in an IDP or IDR simultaneously. In this section, we outline the use of NMR spectroscopy to determine side chainspecific pKas for each of the nine aspartates, five glutamates, and one histidine contained in a highly acidic 35-residue intrinsically disordered peptide. Key words IDPs, Nuclear magnetic resonance spectroscopy, NMR, Biophysics, pKa, Titration

1

Introduction Intrinsically disordered proteins (IDPs) and regions (IDRs) lack persistent secondary and tertiary structure, making them inherently dynamic and flexible molecules that can populate diverse ensembles of conformations. There is considerable evidence that the conformational sampling preferences exhibited by IDP ensembles can impact their biological functions [1–3]. Thus, there is much interest in understanding how the amino acid sequence and posttranslational modifications of IDPs govern their conformational ensembles. One particularly important factor is electrostatic attraction and repulsion between charged amino acid side chains. The number of positively charged basic side chains and negatively charged acidic side chains and their distribution in the amino acid sequence have been linked to both the effective radius and the shape of IDP conformational ensembles [4, 5]. Intriguingly, some common posttranslational modifications, for example, phosphorylation and acetylation, alter the local (as well as overall) charge and

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_16, © Springer Science+Business Media, LLC, part of Springer Nature 2020

319

320

Brandon Payliss and Anthony Mittermaier

could thereby modulate long-range conformational sampling and biological function [6–9]. Thus, methods to evaluate electrostatic interaction energies on a per-residue basis are of great interest. A simple and powerful way to probe side chain electrostatic interactions is to measure residue-specific ionization constants. These are typically expressed in terms of pKa, pK a ¼ log 10 fK a g

ð1Þ

where Ka is the equilibrium constant for the dissociation of the acidic proton Ka ¼

½H3 Oþ ½A ½AH

ð2Þ

AH and A are the protonated and deprotonated forms of an acidic side chain (Asp, Glu, and the carboxyl terminus), and H3O+ is the hydronium ion. For basic residues (His, Arg, Lys, and the amino terminus), Ka is defined according to Ka ¼

½H3 Oþ ½B ½BHþ

ð3Þ

where BH+ and B are the protonated and deprotonated forms. Larger vs. smaller values of pKa correspond to tighter vs. weaker affinity of the side chain for the acidic protons. A simple way to conceptualize the meaning of a pKa value is to recall that it corresponds to the pH at which the side chain is 50% protonated, where pH ¼ log 10 f½H3 Oþ g

ð4Þ

pKa values are informative on side chain electrostatic interactions, since a favorable electrostatic attraction between a negatively charged acidic group and a positively charged basic group will tend to increase the tendency of the acidic group to deprotonate (decrease its pKa) and decrease the tendency of the basic group to deprotonate (increase its pKa). Conversely, unfavorable electrostatic repulsion between like charges will tend to decrease the tendency of acidic groups to deprotonate (increased pKa) and increase the tendency of basic groups to deprotonate (decreased pKa). It should be noted that in folded globular proteins, burial of an ionizable side chain in the protein interior can lead to the same pattern of up/downshifted pKas for acidic/basic residues [10, 11], although this is unlikely to be as prevalent for IDPs, which do not have closely packed hydrophobic cores. Furthermore, the magnitudes of the pKa shifts can be used to calculate quantitatively the overall electrostatic interaction energies [9]. pKa values have been measured for all standard amino acids in the context of short model peptides [12] to serve as reference values in these calculations.

pKa values in IDPs

321

The premier method for determining residue-specific pKa values in proteins is solution nuclear magnetic resonance (NMR) spectroscopy, as this technique allows one to visualize unique signals originating from each ionizable residue of a protein simultaneously. The frequencies (chemical shifts) of the nuclei close to the site of ionization are sensitive to the ionization state. In general, proton association and dissociation is fast on the NMR chemical shift timescale, meaning that peaks in an NMR spectrum gradually shift from locations corresponding to the protonated state to locations corresponding to the deprotonated state as the pH is raised. The fraction of ionization at a given pH is simply the fraction of the total shift in peak position exhibited in spectra collected at that pH. pKa measurements of Arg, Lys, and His can typically be performed with two-dimensional 1H/13C heteronuclear, singlequantum coherence (HSQC) spectra [12]. Directly attached proton/carbon pairs give rise to peaks which appear at the 1H chemical shift in the directly detected dimension and at the 13C chemical shift in the indirect dimension. Suitable 1H/13C spin pairs include 1 ε1 13 ε1 H / C for His, 1Hε/13Cε for Lys, and 1Hδ/13Cδ for Arg [13, 14]. For Asp and Glu residues, Tollinger et al. have developed a 2D NMR experiment in which peaks appear at the side chain carboxyl 13C chemical shift in the indirect dimension and at the chemical shift of the amide 1H of the following residue in the direct dimension [15]. A series of chemical shifts are measured as the pH is titrated across the ionization transition regions of the residues of interest and are then fitted to a modified Henderson-Hasselbalch equation (see Subheading 3.4.2, step 4) to yield pKa values on a per-residue basis. Note that this equation also yields a phenomenological Hill coefficient for each residue, which accounts for pH-dependent changes in the electrostatic environment of the residue throughout the titration [16]. In this chapter, we describe the bacterial expression and purification of an IDR studied in our lab, the 35-residue disordered C-terminal tail from yeast γ-tubulin (γCT). NMR protein studies typically require isotopic enrichment with 13C and/or 15N, and this is most easily accomplished by heterologous expression in E. coli grown in minimal media. The γCT is produced as a cleavable SUMO (Small Ubiquitin-like Modifier) fusion, which enhances protein solubility and yield during bacterial expression, and therefore represents a good example of producing IDPs for NMR analysis. This is followed by sections detailing the NMR spectroscopy protocol and titration data analysis. For the γCT, this approach yields the side chain pKas of nine Asp, five Glu, and one His residue. Despite the large proportion of negatively charged residues (40% Asp and Glu) and blocks of charged residues (e.g., D14, D15, E16 and E23, E24, D25), we were nevertheless able to unambiguously determine the pKa for each side chain. In principle, the

322

Brandon Payliss and Anthony Mittermaier

methodology outlined here could be readily applied to other IDPs of varying sequence composition and charge distribution in order to measure the side chain-specific pKas of Asp, Glu, and His residues.

2

Materials Prepare all solutions using ultrapure water (18 MΩ-cm, 25 C). Media and media components can be stored at room temperature and buffers stored at 4 C, unless otherwise stated. Where indicated, autoclave liquid media or reagents using a liquid cycle (121 C) with a minimum sterilization time of 25 min. Follow standard biosafety procedures in working with and disposing of waste contaminated with Escherichia coli. Refer to chemical Safety Data Sheets (SDS) and local regulations when handling and disposing of chemicals and waste materials. In this study, we designed a 6xHis-SUMO-γCT fusion protein (γCT sequence: AAEQD SYLDD VLVDD ENMVG ELEED LDADG DHKLV) by cloning the DNA encoding the γCT from Saccharomyces cerevisiae into a pET2S-T vector containing the ampR selection marker using standard molecular biology techniques.

2.1 Media and Reagents

1. Luria Broth (LB) Media: Add 12.5 g of LB into each of 8 2 L Erlenmeyer flasks. Add 0.5 L of ultrapure water to each flask. Cover flasks with aluminum foil and autoclave. Allow media to cool to room temperature before adding the required antibiotic(s). 2. 10 M9 Salts: Dissolve 67.8 g Na2HPO4, 30 g KH2PO4, and 5 g NaCl into 1 L of ultrapure water. Adjust pH to 7.4 and autoclave. 3. 1 M MgSO4 stock solution: Dissolve 12.04 g MgSO4 into 100 mL of ultrapure water and autoclave. 4. 0.1 M CaCl2 stock solution: Dissolve 1.11 g of CaCl2 into 100 mL of ultrapure water and autoclave. 5. 10 mg/mL thiamine: Dissolve 15 mg of thiamine into 1.5 mL of ultrapure water. Thiamine is light sensitive and should be stored in the dark. Filter sterilize using a 0.22 μm syringe filter. 6. 10 mg/mL D-biotin: Dissolve 15 mg into 1.5 mL of 3 M9 Salts. Filter sterilize using a 0.22 μm syringe filter. 7. 20% (w/v) U-13C D-glucose: Dissolve 6.0 g of U-13C6 D-glucose into 30.0 mL of ultrapure water. Filter sterilize using a 0.22 μm syringe filter. 8. 100 mM isopropyl β-D-1-thiogalactopyranoside (IPTG): Dissolve 0.47 g of IPTG into 20 mL of ultrapure water. Filter sterilize using a 0.22 μm syringe filter.

pKa values in IDPs

323

9. M9 Minimal Media: Dissolve 1.0 g of 15NH4Cl into 880 mL of ultrapure water. Add 100 mL of 10 M9 Salts. Adjust to pH 7.4 and autoclave. Once cooled to room temperature, aseptically transfer 15.0 mL of 20% U-13C D-glucose, followed by 1.0 mL of each of the following stock solutions: 10 mg/mL thiamine, 10 mg/mL D-biotin, 1 M MgSO4, and 0.1 M CaCl2. Add antibiotic(s) as required. 2.2

Buffers

1. Lysis Buffer: 50 mM sodium phosphate pH 7.5, 500 mM NaCl, 20 mM imidazole. Prepare 0.5 M NaH2PO4 and 0.5 M Na2HPO4 solutions. Add 2.25 mL of 0.5 M NaH2PO4 and 7.75 mL of 0.5 M Na2HPO4 per 100 mL of Lysis Buffer required. Dissolve 2.92 g NaCl and 0.136 g imidazole per 100 mL of Lysis Buffer prior to adjusting the pH to 7.5. 2. Ni-NTA Elution Buffer: 50 mM sodium phosphate pH 7.5, 500 mM NaCl, 300 mM imidazole. Add 20 mL of 0.5 M NaH2PO4, 2.92 g of NaCl, and 2.042 g imidazole per 100 mL of Ni-NTA Elution Buffer. Once dissolved, adjust the pH to 7.5 and store at 4 C. 3. Size Exclusion Chromatography (SEC) Buffer: 25 mM sodium phosphate pH 7.2, 10 mM acetate, 100 mM NaCl, 0.22 μm filtered and degassed. Add 22.5 mL of 0.5 M NaH2PO4 and 77.5 mL of 0.5 M Na2HPO4 per 1 L of SEC buffer. Dissolve 5.84 g of NaCl per 1 L of SEC buffer. Add 575 μL of glacial acetic acid, and adjust the pH to 7.2 before filtering through a 0.22 μm vacuum filter. Allow buffer to stand under vacuum for at least 10 min to degas. Store at 4 C. 4. NMR Buffer: 25 mM sodium phosphate pH 7.2, 10 mM acetate, 100 mM NaCl, 50 μM EDTA, 50 μM NaN3, 10% D2O. Prepare a 0.05 M stock solution of NaN3 by dissolving 0.163 g of NaN3 into 5 mL of ultrapure water. Prepare a 0.05 M stock solution of pH 8.0 ethylenediaminetetraacetic acid (EDTA) by adding 1.861 g of disodium EDTA dihydrate into 100 mL of ultrapure water. Add a stir bar and dissolve one to two pellets of NaOH. Adjust the pH to 8.0 and autoclave. To prepare the NMR buffer, use 100 mL of SEC buffer, and add 100 μL of 0.05 M EDTA pH 8.0 and 100 μL of 0.05 M NaN3. Store at 25 C.

2.3

Other Materials

1. IPTG (99%). D-biotin (99%). Thiamine hydrochloride (99%). Magnesium sulfate dihydrate (99% MgSO4). Calcium chloride dihydrate (99% CaCl2). Ampicillin (96%). Chloramphenicol (98%). Sodium phosphate dibasic (99% Na2HPO4) and monobasic (99% NaH2PO4). Potassium phosphate monobasic (99% KH2PO4). Sodium chloride (99% NaCl). Imidazole (99%). Disodium

324

Brandon Payliss and Anthony Mittermaier

ethylenediaminetetraacetic acid dihydrate (99% EDTA). Sodium azide (99% NaN3). Glacial acetic acid (100% CH3COOH). Deuterium oxide (99.9% D2O). 15N-labelled ammonium chloride (99% 15NH4Cl). Uniformly 13C-labelled D-glucose (99%, U-13C6 D-glucose). Ulp1 protease (SUMO protease 1). 2. HisTrap HP (1 mL). HiLoad™ 16/600 Superdex 75 pg. Small volume (3 mm) pH electrode. Erlenmeyer flasks (2 L). Coffee grinder. 3 kDa MWCO and 10 kDa MWCO centrifugal concentrators. 5 mm NMR tubes (length: 178 mm). Pasteur pipette for NMR tubes (length: 203 mm). Syringe filters (polyethersulfone (PES), 0.22 μm). Vacuum filters (500 mL capacity, PES, 0.22 μm). 3. Varian INOVA 500 MHz NMR spectrometer with a room temperature probe or an equivalent NMR spectrometer with a field 500 MHz.

3

Methods

3.1 Protein Expression in M9 Minimal Media

1. Select a single colony of E. coli BL21 (DE3) containing the modified pET2S-T vector from a fresh LB agar plate. Aseptically transfer to 50 mL of sterile LB contained in a 250 mL Erlenmeyer flask. In a shaker incubator, grow an overnight starter culture (225 rpm, 28 C, Peak Finding. 7. Add peaks corresponding to the Cε1 of His for each 1H-13C HSQC collected at different pHs (see Note 18). 8. Select all Cε1 peaks by clicking the mouse and dragging over the peaks. On the menu bar, select Peak > Selected Peaks. This table will show various information about the selected peaks. Right-click on the table, and export the “Spectrum” and “Position F2” columns into a tab-separated (.txt) file (see Note 19). Name the file aromatic_hsqc_peaks.txt. 9. A similar methodology can be used to track peak movement in the 1H-15N HSQC, followed by the 13COi/1HNi+1 spectra (see Note 20). Note that overlapping Asp or Glu peaks may resolve throughout the titration. For example, E23 and E16 from the γCT overlap completely until pH is reduced to 4.9 (Fig. 4). 3.4.2 Model Fitting and pKa Determination Using MATLAB

1. After all peaks have been assigned and peak lists generated, the data can be imported and analyzed in MATLAB [22]. Using the 1H-13C HSQC data as an example, a simple script for importing this data is as follows: His_shifts = importdata(’aromatic_hsqc_peaks.txt’); His = His_shifts.data; x = His(:,1); y = His(:,2);

330

Brandon Payliss and Anthony Mittermaier

Fig. 4 Tracking peak movement in the 13COi/1HNi+1 spectra with decreasing pH. Panel (a) Movement of the D10 resonance over a pH range of 7.2–2.3 in steps of 0.15–0.3 for a total of 13 data points. Panel (b) E23 and E16 resonances overlap at pH 5.9 but are resolved when the pH is reduced to 7.2, in which case we recommend including 10 mM borate in the NMR buffer. 16. A detailed description regarding experimental setup on a Varian spectrometer can be found at http://daffy.uah.edu/res ources/pKa_experiment.pdf, and the pulse sequence can be obtained from Dr. Lewis Kay’s research group at http:// pound.med.utoronto.ca. The pulse sequence is more formally known as hbgcbgccbgcaconnh_mt__v2_500.c [15]. For this experiment, use a center position of 179 ppm and a spectral width of 12 ppm for the indirect (13CO) dimension. Set the number of complex points to 512:H(F2)/128:C(F1). The acquisition time should be approximately 1.5 h per spectra. 17. Under our experimental conditions, the total change in ionic strength was 50 mM after all pH manipulations, which includes adjustment to pH 8.8 with 0.5 M NaOH, followed by reduction to pH 2.5 using 0.5 M HCl, where the initial and final ionic strengths are 135 mM and 185 mM, respectively. The total change in ionic strength can be calculated using the known concentrations and volumes added for HCl and NaOH. The pKa of an ionizable side chain is highly sensitive to its local environment, and the impact of ionic strength on site-specific pKas will ultimately depend on the IDP and experimental conditions. We also caution the reader that tuning the NMR probe may be difficult when the ionic strength is >200 mM in a 5 mm NMR tube. If pKa values at ionic strengths >200 mM are desired, use 3 mm NMR tubes. 18. On the menu bar, open Peak > Peak Finding. Select the Region Peak tab, and set reasonable boundaries for each dimension before selecting Find Peaks! Repeat this process for each spectrum by changing the selection from the Peak List drop-down menu. 19. Modify the .txt file such that the first column contains the pH and the second column contains the corresponding 13CO chemical shift data. 20. In order to assign Asp and Glu peaks from the 13COi/1HNi+1 experiment, it may be necessary to perform 3D experiments to assign 1H-15N HSQC resonances unambiguously; accurate assignment of 1H-15N HSQC peaks is necessary in order to track Asp or Glu peaks in the 13COi/1HNi+1 spectra.

pKa values in IDPs

335

21. This Henderson-Hasselbalch equation has been modified to contain the Hill coefficient n. δobs is the observed chemical shift, δoffset is the chemical shift of the nucleus in its protonated state, and Δδ is the change in the chemical shift over the course of the titration. In MATLAB, the custom equation may be entered as “a+(b.∗10.^(n∗(c-x)))./(1+10.^(n∗(cx))),” where a, b, n, and c variables are δoffset, Δδ, n, and pKa, respectively. 22. If model fitting is unsuccessful, inspect the scatter plot to provide an estimate of the pKa by determining the pH of the midpoint. Provide this value as an initial guess. 23. The prediction bounds of the model can be represented graphically by selecting Tools > Prediction Bounds > 95%. The smaller the distance between the prediction bounds and the model fit, the better the model fitting. Pay specific attention to the confidence bounds for each fitted parameter, especially for the pKa. Evaluate the distance between the upper (U) and lower (L) 95% confidence bounds; a reasonable fit would return a U–L < 0.2 for the pKa.

Acknowledgments The authors would like to thank Dr. Tara Sprules and the Quebec/ Eastern Canada High Field (QANUC) NMR Facility at McGill University (https://nmrlab.mcgill.ca/qanuc) in Montreal, Canada, for providing technical support and access to the Varian INOVA 500 MHz NMR spectrometer used in this work. References 1. Marsh JA, Forman-Kay JD (2010) Sequence determinants of compaction in intrinsically disordered proteins. Biophys J 98:2383–2390 2. Mao AH, Lyle N, Pappu RV (2013) Describing sequence-ensemble relationships for intrinsically disordered proteins. Biochem J 449:307–318 3. Wright PE, Dyson HJ (2015) Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16:18–29 4. Mao AH, Crick SL, Vitalis A et al (2010) Net charge per residue modulates conformational ensembles of intrinsically disordered proteins. Proc Natl Acad Sci U S A 107:8183–8188 5. Das RK, Pappu RV (2013) Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A 110:13392–13397

6. Bah A, Vernon RM, Siddiqui Z et al (2014) Folding of an intrinsically disordered protein by phosphorylation as a regulatory switch. Nature 519:106–109 7. El Turk F, De Genst E, Guilliams T et al (2018) Exploring the role of post-translational modifications in regulating α-synuclein interactions by studying the effects of phosphorylation on nanobody binding. Protein Sci 27:1262–1274 8. Harris J, Shadrina M, Oliver C et al (2018) Concerted millisecond timescale dynamics in the intrinsically disordered carboxyl terminus of γ-tubulin induced by mutation of a conserved tyrosine residue. Protein Sci 27:531–545 9. Payliss BJ, Vogel J, Mittermaier AK (2019) Side chain electrostatic interactions and pH-dependent expansion of the intrinsically

336

Brandon Payliss and Anthony Mittermaier

disordered, highly acidic carboxyl-terminus of γ-tubulin. Protein Sci 28:1095–1105 10. Mehler EL, Fuxreiter M, Simon I et al (2002) The role of hydrophobic microenvironments in modulating pKa shifts in proteins. Proteins Struct Funct Genet 48:283–292 11. Isom DG, Castaneda CA, Cannon BR et al (2011) Large shifts in pKa values of lysine residues buried inside a protein. Proc Natl Acad Sci U S A 108:5260–5265 12. Platzer G, Okon M, McIntosh LP (2014) pH-dependent random coil 1H, 13C, and 15N chemical shifts of the ionizable amino acids: a guide for protein pKa measurements. J Biomol NMR 60:109–129 13. Gao G, DeRose EF, Kirby TW et al (2006) NMR determination of lysine pKa values in the pol λ lyase domain: mechanistic implications. Biochemistry 45:1785–1794 14. Fitch CA, Platzer G, Okon M et al (2015) Arginine: its pKa value revisited. Protein Sci 24:752–761 15. Tollinger M, Forman-Kay JD, Kay LE (2002) Measurement of side-chain carboxyl pKa values of glutamate and aspartate residues in an unfolded protein by multinuclear NMR spectroscopy. J Am Chem Soc 124:5714–5717 16. Markley JL (1975) Observation of Histidine residues in proteins by means of nuclear magnetic resonance spectroscopy. Acc Chem Res 8:70–80 17. Babu M, Krogan N, Awrey D et al (2009) Systematic characterization of the protein interaction network and protein complexes in Saccharomyces cerevisiae using tandem affinity purification and mass spectrometry. Methods Mol Biol 548:187–207 18. Reynolds WF, Burns DC (2012) Getting the Most out of HSQC and HMBC spectra. In: Annual reports on NMR spectroscopy, vol 76. Academic Press, Cambridge, pp 1–21 19. Delaglio F, Grzesiek S, Vuister G et al (1995) NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J Biomol NMR 6:277–293 20. Vranken WF, Boucher W, Stevens TJ et al (2005) The CCPN data model for NMR spectroscopy: development of a software pipeline. Proteins Struct Funct Genet 59:687–696 21. Lee W, Tonelli M, Markley JL (2015) NMRFAM-SPARKY: enhanced software for

biomolecular NMR spectroscopy. Bioinformatics 31:1325–1327 22. MATLAB and Curve Fitting Toolbox Release 2018b, The MathWorks, Inc., Natick, MA, USA 23. Marley J, Lu M, Bracken C (2001) A method for efficient isotopic labeling of recombinant proteins. J Biomol NMR 20:71–75 24. Godin JP, Fay LB, Hopfgartner G (2007) Liquid chromatography combined with mass spectrometry for 13C isotopic analysis in life science research. Mass Spectrom Rev 26:751–774 25. Hoopes JT, Elberson MA, Preston RJ et al (2015) Protein Labeling in Escherichia coli with 2H, 13C, and 15N. In: Methods in enzymology, vol 565. Academic Press, Cambridge, pp 27–44 26. Neuhoff V, Arold N, Taube D et al (1988) Improved staining of proteins in polyacrylamide gels including isoelectric focusing gels with clear background at nanogram sensitivity using Coomassie brilliant blue G-250 and R-250. Electrophoresis 9:255–262 27. Grzesiek S, Bax A (1992) An efficient experiment for sequential backbone assignment of medium-sized isotopically enriched proteins. J Magn Reson 99:201–207 28. Grzesiekt S, Bax A (1992) Correlating backbone amide and side chain resonances in larger proteins by multiple relayed triple resonance NMR. J Am Chem Soc 114:6291–6293 29. Kay LE, Ikura M, Tschudin R et al (1990) Three-dimensional triple-resonance NMR spectroscopy of isotopically enriched proteins. J Magn Reson 89:496–514 30. Clubb RT, Thanabal V, Wagner G (1992) A constant-time three-dimensional tripleresonance pulse scheme to correlate intraresidue 1HN, 15N, and 13C0 chemical shifts in 15N-13C-labelled proteins. J Magn Reson 97:213–217 31. Rule GS, Hitchens TK (2006) Fundamentals of protein NMR spectroscopy. Springer, New York 32. Cavanagh J, Fairbrother WJ, Palmer IIIAG et al (2010) Protein NMR spectroscopy: principles and practice, 2nd edn. Academic Press, Cambridge

Chapter 17 Paris-DE´COR: A Protocol for the Determination of Fast Protein Backbone Amide Hydrogen Exchange Rates Rupashree Dass and Frans A. A. Mulder Abstract Determining hydrogen exchange kinetics in proteins can shed light on their structure and dynamics. Nuclear magnetic resonance (NMR) spectroscopy is an important analytical technique to determine exchange rates. In this chapter, we describe a new method (Paris-DE´COR) to determine fast protein amide backbone hydrogen exchange rates in the range 10 to 104 s1. Measuring fast exchange rates is particularly important for the study of intrinsically disordered proteins, where there is very little protection from exchange to the solvent by the formation of persistent structure. We provide a protocol to set up the experiment as well as MATLAB scripts for numerical simulation that is needed to determine the exchange rates. Key words NMR, Hydrogen exchange, Proteins, Spin-spin decorrelation, Intrinsically disordered proteins

1

Introduction Backbone amide hydrogen exchange rates of peptides and proteins can provide information about hydrogen bonding, secondary structure, conformational dynamics, and stability. The formation of structure reduces amide hydrogen exchange rates relative to those observed in their solvent-exposed reference state. The reduction relative to these intrinsic rates constitutes an important means to determine the relative stability to partial unfolding in folded proteins and emerges as a means to determine the extent of structure formation in intrinsically disordered proteins (IDPs). Nuclear magnetic resonance (NMR) spectroscopy is a widely used analytical tool to measure hydrogen exchange rates, and a number of methods have been developed to determine exchange kinetics that are applicable to different timescales. In this chapter, we give a step-by-step guide for determining exchange rates using the newly developed Paris-DE´COR [1] approach, which is suited

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_17, © Springer Science+Business Media, LLC, part of Springer Nature 2020

337

338

Rupashree Dass and Frans A. A. Mulder

for measuring very rapid hydrogen exchange in the approximate range 10–10,000 s1. The measurement of such very fast exchange rates has not been developed hitherto but has become of interest with the increased number of studies dealing with IDPs. ´ COR method is based on previous work by The Paris-DE Skrynnikov and Ernst [2] and Pelupessy and co-workers [3]. The first authors developed the conceptual framework for measuring the rate of hydrogen exchange via the spin-lattice relaxation rate of two spin order terms such as 2NzHz for an amide backbone pair. They rigorously showed that exchange events with solvent lead to a decorrelation of the involved spin operators (hence reference to DECOR), which then manifests itself as an additional exponential loss rate that equals the kinetic rate for exchange with solvent. Building on this approach, the Paris group of Pelupessy and co-workers [3] demonstrated that the exchange rate can be more conveniently measured though decorrelation of antiphase operators. In their ingenious approach, different admixtures of Ny and 2NxHz are produced by Carr-Purcell-Meiboom-Gill [4, 5] (CPMG) echo trains applied to the 15N channel. In contrast to Ny, the antiphase operator 2NxHz is subject to decorrelation by exchange events, such that exchange leads to pulse-spacing-dependent signal loss (detected in the experiment by reduced peak intensities). In addition, a reference experiment is performed in which high-power composite pulse 1H decoupling is applied, such that the generation of antiphase operators is strongly suppressed. In practice, a pulse sequence is chosen that produces an in-phase coherence Ny which is then subjected to a constant time period during which a variable number of 15N CPMG pulses are applied. Subsequently, the magnetization is transferred to a suitable non-labile nucleus for detection. Amide proton detection is not possible as this exchanges too fast to be observed. So far, we have successfully implemented the Paris-DE´COR approach in 2D H (CACO)N and 2D CON experiments. Peak intensities of the spectra obtained with different numbers of 15N CPMG pulses are then divided by the intensity measured in the reference experiment. In the analysis, exchange rates are obtained through a numerical fitting procedure. Briefly, expected peak intensities are computed by evolution of a subset of density operators during the constant time period. A best fit between experimental and computed intensity ratios at all CPMG spacings is then obtained through simplex minimization of the error target function. In this way, fast protein backbone amide exchange rates are determined for each residue separately, as the computation uses the RF field strengths and offsets, as explained in the literature [3].

Paris-DE´COR

2

339

Materials 1. A sample of uniformly [15N/13C]-labelled protein (see Note 1) in Tris buffer pH 9.0 (see Note 2), without added D2O (see Note 3). 2. For the combined purpose of chemical shift referencing, and to provide a lock signal, a capillary containing 50 mM DSS (4,4-dimethyl-4-silapentane-1-sulfonic acid) in D2O is used (see Note 3). 3. NMR instrument. High-field Bruker NMR spectrometer equipped for triple resonance experiments, operating at 500 MHz or higher. For experiments that use 1H detection, a standard probe head. For experiments that require 13C detection, a cryogenically cooled probe head, ideally of the type designed for 13C reception.

3 3.1

Methods Data Acquisition

1. The protocol explained below is applicable for Bruker spectrometers. To set up a general NMR experiment, one may refer to Bruker TopSpin manuals and additionally consult Chapter 4 of the book Structural Biology: Practical NMR Applications by Quincy Teng [6]. It is assumed that all steps are taken that are normally needed to record triple resonance experiments. 2. Pulse sequence and parameter sets are available from http:// www.protein-nmr.org. These are straightforward modifications of Bruker library experiments and are “prosol” compatible for automatic calculation of pulse widths (see Note 4). 3. Use the “getprosol” command to set the correct pulse lengths for all pulses on 1H, 13C, and 15N channels. 4. Set the parameter d25 to equal the length of the total CPMG period on the 15N channel. Typically, we employ 20 ms (see Note 5). 5. Evaluate relevant parameters (suggestions in square brackets) like d1 [1s], dummy scans [4], and number of scans [multiple of 8] (see Note 6). 6. Select appropriate carrier offsets, spectral widths, and number of data points for direct and indirect time domains (see Note 7). 7. Run experiments without 1H decoupling (spectra “A”) by ensuring that zgoptns is undefined (i.e., field left blank). The loop counter l4 defines the number of 180 pulses during the CPMG period and should be an even number to maximize the well-known self-compensatory effects of the CPMG sequence with respect to off-resonance artifacts. Acquire several 1D

340

Rupashree Dass and Frans A. A. Mulder

datasets (e.g., by running a single increment) for different values of l4. We recommend running at least four experiments (l4 ¼ 2, 4, 8, 16). Due to the finite 15N 180 pulse width, the maximum number of pulses will depend on power handling of the probe and the length of d25. A use of 16 is often adequate as only very little antiphase is being developed and so very little attenuation is obtained. 8. Acquire a reference spectrum (spectrum “B”) with zgoptns equal to “–DLABEL_REF” and the maximum value for l4 to be used. This will ensure 1H decoupling is applied during CPMG. Ensure that the 1H decoupling RF field is strong (>4 kHz; see Note 8). 9. Execute the series of 2D experiments (one reference and a series of experiments with different values of l4) each with otherwise identical settings. Having a large number of dummy scans (to the equivalent of 3 min) ensures that sample temperature is stable when data is being acquired. A full series of experiments is generally obtained within 24 h. 3.2

Data Processing

1. For processing of the acquired spectra, one may use the TopSpin processing software available on Bruker spectrometers or NMRPipe [7] available from https://www.ibbr.umd.edu/ nmrpipe/install.html. 2. Depending on the pulse sequence to be used, it is assumed that corresponding H(CACO)N or CON spectra have already been assigned (see Note 9). It is typically most convenient to identify all the peaks in the reference spectrum (with 1H decoupling) as this spectrum is expected to display maximal intensity. Assignments can then be transferred to the spectra obtained with different number of 180 pulses.

3.3 Determination of Hydrogen Exchange Rates by Numerical Fitting

Rate determination by the Paris-DE´COR method is implemented in the computational suite MATLAB (Fig. 1) [8]. Follow the following steps: 1. Download all the .m scripts from https://github.com/proteinnmr or through links provided at http://www.protein-nmr. org. It is important that all the .m files are located in the same folder on your computer. 2. Input parameters such as peak heights/intensities, noise levels, and experimental parameters such as length of the CPMG period, no. of CPMG 180 pulses, RF field strength of the hard 1H 90 pulse (expressed in Hz), and the magnitude of the1JNH coupling constant (93 Hz) should be entered in “input_script.m.” Follow the formatting definitions provided in the example file from the github distribution.

Paris-DE´COR

341

Fig. 1 Variation in intensity ratio A/B with the number of CPMG pulses as obtained experimentally (blue points) and by fitting the density operator (points connected by green line). Error bars produced by synthetic data generated by the Monte Carlo procedure are shown in red. Exchange rates in units s1 (in this example, for residues Phe4 and Gln24 of human α-synuclein) are indicated at the top of each panel

3. Start MATLAB. Locate and run “input_script.m.” The output is saved on the workspace as a matrix “k,” containing the exchange rate for each amide hydrogen (see Note 10). 4. The uncertainty in the determined rates is obtained by running “input_script.m” after removal of the out-commenting character % at the start of line 28 of the script. A Monte Carlo procedure is now invoked, which generates synthetic data based on the input uncertainties provided by the user (see Note 11). The total time this analysis takes depends on the number of iterations defined by the parameter “nmc.”

4

Notes 1. The quality of the data further depends on familiar parameters, such as the protein concentration and the intrinsic sensitivity of the pulse sequence employed. For IDPs, measured with a cryogenically cooled probe head, we typically use 200 μM for NMR experiments that utilize 1H observation and 1 mM for those with 13C observation. 0.01% NaN3 is included to preserve the protein sample for a longer time. The total volume of the sample is typically 600 μL, when using standard 5 mm NMR tubes. 2. The Paris-DE´COR approach is most sensitive to hydrogen exchange rates in the regime 100–1000 s1. Where possible,

342

Rupashree Dass and Frans A. A. Mulder

the pH of the sample can be adjusted to increase the accuracy of the method. For example, protein backbone amide rates at room temperature (298 K) will fall in the above-quoted range when the pH of the solution is around 9. 3. A 50 mM solution can be placed in a capillary of the type commonly employed to determine the melting temperature of solids. The capillary should be free of air bubbles and is then sealed over a flame. The capillary is subsequently put in the NMR tube containing the protein solution, which ensures it is concentrically located, such that shimming and referencing procedures are not adversely affected. Through this procedure, the protein solution itself does not contain D2O, which will otherwise complicate the analysis [3]. 4. “Prosol” compatibility ensures that the command “getprosol” can be used in order to compute any shaped and low-power pulses (e.g., for decoupling) once the high-power 90 pulses are available. 5. Longer values will cause stronger attenuation due to exchange but adversely affect signal-to-noise. 6. This choice will depend on the concentration of the protein and instrument/probe sensitivity and should be tested once for the reference experiment (which has the highest intrinsic sensitivity). 7. Typical carrier positions are in the middle of the respective 15N and 13C’ chemical shift ranges. As Pro residues are of no interest here, these are best placed around 117 ppm and 176 ppm, respectively. Aliasing may occur when spectral width of the indirect dimension is chosen narrower than all contributing resonances, in order to achieve high resolution, which is typically done. Note that in this scenario, Pro signals (which appear around 136 ppm) may be aliased in the 15N dimension and may overlap with peaks of interest. Should this occur, then the spectral width can be adjusted slightly until no further overlap is present. To identify all Pro signals, run the CON spectrum once with a 15N spectral width of 80 ppm. Side chain Arg 13Cζ/15Nε correlations can be identified in the same way (these occur at 168 ppm for 13Cζ and 86 ppm for 15Nε) and are also accessible by our approach. For the spectra shown in Fig. 2, acquired at 950 MHz, we used 1024 (total) points for 13C’ and 512 for 15N. The spectral widths were 7878 Hz (13C’) and 3561 Hz (15N). The carrier frequencies were 176.4 ppm (13C’) and 123.3 ppm (15N). 8. The B1 field strength expressed in Hz equals 1/4∗pw90, where pw90 is the pulse width of the decoupling field. 9. We typically use triple resonance experiments and the assignment software Sparky [9].

Paris-DE´COR

343

Fig. 2 Excerpt of 2D CON spectra for 1 mM human α-synuclein in H2O at 298 K and pH 9, recorded with the experiment for hydrogen exchange rate determination by Paris-DE´COR, using d25 ¼ 20 ms. (a) Reference spectrum. (b) Spectrum with two 180 CPMG pulses

10. In order to save time and produce valid output, the script only analyzes peaks for which the signal-to-noise ratio is larger than five. The output generated is a per-residue exchange rate, uncertainty interval for the exchange rate, and a plot showing the experimental A/B values and best-fit computed A/B values, as shown in Fig. 1. 11. Random noise is added to the intensities of each peak, which follows a normal distribution based on the experimental uncertainty. For every Monte Carlo iteration, the fitting procedure is done with this synthetic data. After many iterations, one has an increasingly good estimate of the uncertainty in the fitting procedure [10]. We typically use 200 iterations.

Acknowledgments This work was supported by a postdoctoral fellowship to R.D. from the VILLUM Foundation. References 1. Dass R, Corliano` E, Mulder FAA (2019) Measurement of very fast exchange rates of individual amide protons in proteins by NMR spectroscopy. ChemPhysChem 20 (2):231–235 2. Skrynnikov NR, Ernst RR (1999) Detection of intermolecular chemical exchange through

decorrelation of two-spin order. J Magn Reson 137:276–280 3. Kateb F, Pelupessy P, Bodenhausen G (2007) Measuring fast hydrogen exchange rates by NMR spectroscopy. J Magn Reson 184 (1):108–113

344

Rupashree Dass and Frans A. A. Mulder

4. Carr HY, Purcell EM (1954) Effects of diffusion on free precession in nuclear magnetic resonance experiments. Phys Rev 94 (3):630–638 5. Meiboom S, Gill D (1958) Modified spinEcho method for measuring nuclear relaxation times. Rev Sci Instrum 29(8):688–691 6. Teng Q (2013) Structural biology: practical NMR applications, 2nd edn. Springer, Dordrecht 7. Delaglio F, Grzesiek S, Vuister G, et al (1995) NMRPipe: a multidimensional spectral

processing system based on UNIX pipes. J Biomol NMR 6(3):277–293 8. The MathWorks, Inc. MATLAB and Statistics Toolbox Release 2012b. Natick, Massachusetts, United States 9. Goddard TD, Kneller DG SPARKY 3. University of California: San Francisco 10. Lerche I, Mudford BS (2005) How many Monte Carlo simulations does one need to do? Energy Explor Exploit 23(6):405–427

Part V Ensembles by Computation

Chapter 18 Predicting Conformational Properties of Intrinsically Disordered Proteins from Sequence Kiersten M. Ruff Abstract Intrinsically disordered proteins (IDPs) can adopt a range of conformations from globules to swollen coils. This large range of conformational preferences for different IDPs raises the question of how conformational preferences are encoded by sequence. Global compositional features of a sequence such as the fraction of charged residues and the net charge per residue engender certain conformational biases. However, more specific sequence features such as the patterning of oppositely charged residues, expansion driving residues, or residues that can undergo posttranslational modifications can also influence the conformational ensembles of an IDP. Here, we outline how to calculate important global compositional features and patterning metrics that can be used to classify IDPs into different conformational classes and predict relative changes in conformation for sequences with the same amino acid composition. Although increased effort has been devoted to determining conformational properties of IDPs in recent years, quantitative predictions of conformation directly from sequence remain difficult and often inaccurate. Thus, if quantitative predictions of conformational properties are desired, then sequence-specific simulations must be performed. Key words Conformation, Prediction, Amino acid composition, Sequence patterning, Compaction, IDPs

1

Introduction Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) have been implicated in a diverse set of cellular functions [1–5]. Here, we will use the term IDRs to account for both disordered regions within a larger protein or a protein that is completely disordered. IDRs are defined as sequences that fail to adopt a stable fold autonomously at physiological conditions. However, this does not imply all IDRs are nondescript random coils [6, 7]. Instead, as with the diverse set of functions they perform, IDRs also adopt a diverse set of conformational properties and biases as compared to folded and chemically denatured proteins (Fig. 1) [8–12]. The observed conformational diversity among IDRs raises two important questions: (1) How are conformational

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_18, © Springer Science+Business Media, LLC, part of Springer Nature 2020

347

348

Kiersten M. Ruff

100 Folded, Marsh et al. Chemically Denatured, Marsh et al. Region 1 IDRs, Table 2 Region 2 IDRs, Table 2 Region 3 IDRs, Table 2 Region 4 IDRs, Table 2 Fraction Proline ≥ 0.15 IDRs, Table 2

80

49 0.5

Rh

=

3N 2.3

Rh

60

40 Rh = 4.92N

0.285

20

0

0

100

200

500 400 300 Sequence Length, N

600

700

Fig. 1 Experimentally determined hydrodynamic radius (Rh) values plotted as a function of sequence length (N ) for IDRs listed in Table 2 (circles), folded proteins (yellow squares), and chemically denatured proteins (yellow diamonds) taken from the inventory collated by Marsh and Forman-Kay [12]. Rh is a measure of protein size. IDRs are colored by the region they fall into in the diagram of states: Region 1, blue; Region 2, lime green; Region 3, dark green; Region 4, red; fraction of proline residues greater than or equal to 15%, pink. The solid and dashed lines refer to the fits of Rh ¼ R0Nν from Marsh and Forman-Kay [12] for folded and chemically denatured proteins, respectively. Whereas folded and chemically denatured proteins follow a power law relationship closely, IDRs show a divergence from a simple relationship. For a given sequence length, IDRs span a whole range of Rh values and can be as compact as folded proteins, as well as more expanded than chemically denatured proteins

properties encoded in sequence? (2) Can we predict conformational properties directly from sequence without the aid of simulation or experiment? Two sequence features are particularly important to consider for determining conformational biases from sequence. The first is global amino acid composition. This feature refers to parameters that can be calculated without taking the order of the residues into account, for example, the fraction of positively charged residues ( f+), the fraction of negatively charged residues ( f), and the fraction of charged residues (FCR) within a sequence. The second feature is sequence patterning. In this case, the actual ordering of residues is important for encoding conformational biases. One example is the patterning of oppositely charged residues. In terms of global amino acid composition, IDRs are generally distinct from folded proteins. Specifically, IDRs are deficient in hydrophobic residues but are enriched in polar and charged

Predicting Conformational Properties from Sequence

349

residues [13]. However, IDRs can be further divided into compositional subclasses, and these subclasses have inherent conformational preferences [10, 14, 15]. The balance between chain–chain and chain–solvent interactions dictates an IDR’s conformational preferences [14, 16]. IDRs that prefer chain–chain interactions will adopt compact/globular conformations on average. In contrast, conformations will be expanded/coil-like on average for IDRs where chain–solvent interactions dominate. When chain–chain and chain–solvent interactions are perfectly counterbalanced, the conformational ensemble of an IDR is maximally heterogeneous and akin to a Flory Random Coil (FRC) [17, 18]. Synthesis of results from experiments and simulations for archetypal IDRs led to the creation of the diagram of states (Fig. 2) [10, 14, 15]. This synthesis classifies IDRs into four different compositional subclasses based on the fraction of positively charged residues ( f+) and the fraction of negatively charged residues ( f) within each sequence. Here, f+ and f can be combined to yield two main measures: (1) the fraction of charged residues, FCR ¼ f+ + f, and (2) the net charge per residue, NCPR ¼ f+ f. Assuming physiologically relevant conditions (pH ~ 7, temperature ~ 37 C, etc.), fixed charge state, low hydrophobicity, and low proline content ( 0.35 & |NCPR| ≤ 0.35; Charge patterning dictates compaction from coil to hairpin

160

0.8 0.7

R4

0.6

140

R4: FCR > 0.35 & |NCPR| > 0.35; Coils and semi-flexible rods

120

0.5 100 0.4

R3 80

0.3

R2

Sequence Length

Fraction of negatively charged residues (f-)

R1: FCR < 0.25 & |NCPR| < 0.25; Globules and tadpoles

60

0.2

0 0

40

R4

0.1

20

R1 0.2 0.4 0.6 0.8 Fraction of positively charged residues (f +)

1

Fig. 2 Diagram of states for classification of IDRs into conformational classes (R1–R4) based on f+ and f. All unique DisProt sequences of length greater than 15 are plotted on the diagram of states and colored by their sequence length (color bar)

that these sequences cannot be easily classified by conformation due to composition alone. Instead, additional sequence features, such as sequence patterning, will dictate the conformational preferences of these sequences. 1.3 R3: FCR > 0.35 and |NCPR| 0.35; Charge Patterning Dictates Compaction from Coil to Hairpin

Region 3 (R3) consists of IDRs that are strong polyampholytes. An example includes the IDR within the human PQBP-1 protein which is involved in splicing and transcriptional regulation [15, 24, 25]. Given the high fraction of charged residues in these sequences, the patterning of oppositely charged residues is important for the conformational preferences of these sequences. The details of charge patterning will be discussed later (Subheading 3.3). Generally, well-mixed sequences adopt more expanded conformations, whereas segregated sequences form hairpin-like stretches due to blocks of oppositely charged residues interacting with each other [15]. However, this expansion versus compaction is relative and dependent on composition.

Predicting Conformational Properties from Sequence

1.4 R4: FCR > 0.35 and |NCPR| > 0.35; Coils and Semiflexible Rods

2 2.1

351

Region 4 (R4) consists of IDRs that are strong polyelectrolytes— sequences that have a bias for one type of charge over another. Examples include protamines which are arginine-rich proteins that are important for DNA stabilization during spermatogenesis [10, 26] and a linker histone chaperone prothymosin-α [11, 27]. This combination of bias for one type of charge, as well as high FCR, implies these sequences often adopt expanded conformations as a result of repulsive interactions between like charges and the preference for charged residues to be solvated [28]. Thus, as sequences in this region become stronger polyelectrolytes, chainsolvent interactions increase, and they transition from coils to semiflexible rods [10]. Below, we describe a protocol to extract global amino acid composition features, such as FCR and |NCPR|, and thus the region of the diagram of states, as well as how to qualitatively predict the conformational properties of an IDR(s) (Subheading 3.1). However, as described above, conformational preferences of IDRs that fall in R2 and R3 cannot be determined by amino acid composition alone. Instead, sequence patterning will mediate conformational preferences. However, current sequence patterning metrics that have been utilized for IDRs can only make predictions on relative changes in conformation for sequences with the same amino acid composition. Therefore, there does not exist a step-bystep protocol for predicting conformational properties for an individual IDR that falls into R2 or R3 on the diagram of states. Given this, the remaining Methods sections of this chapter will not be written as protocols but instead highlight what conformational properties can and cannot be predicted given a certain set of sequence features, as well as how these sequence features are calculated. These subsections will be divided into the following topics: classifying experimentally determined IDRs within the diagram of states (Subheading 3.2), extracting conformational information from charge patterning (Subheading 3.3), extracting conformational information from other sequence patterns (Subheading 3.4), applying patterning metrics to understand the effects of phosphorylation on conformation (Subheading 3.5), extracting quantitative conformational information from sequence (Subheading 3.6), and the future of predicting conformation from sequence (Subheading 3.7).

Materials IDR Sequence(s)

The sequence analyses described in this chapter should only be performed on an IDR, and the remaining sections assume the reader has already identified the IDR(s) from the protein(s) of interest. An IDR can make up a portion of the protein or be the entire protein. An IDR(s) within a protein of interest can be

352

Kiersten M. Ruff

identified using experimental data, the protein disorder database DisProt (http://www.disprot.org/), or various disorder prediction algorithms. How to predict disorder from a protein of interest has been described in detail in a previous protocol [29], and the accuracy of different predictors has recently been assessed [30]. However, below is a non-comprehensive list of online tools that make predictive identifications of disordered regions or identify experimentally determined IDRs given a sequence or UniProt protein identifier [31]. 1. IUPred2A (https://iupred2a.elte.hu/plot) utilizes IUPred2 [32, 33] and ANCHOR2 [34, 35] to make predictive identifications of disordered regions and disordered binding regions, respectively, for one or a set of sequences. A stand-alone version of the program package can also be downloaded for highthroughput predictions. 2. PONDR (http://www.pondr.com/) makes predictive identifications of disordered regions for single sequences with the choice of one or multiple algorithms trained using different types of experimental data and disordered regions [36–43]. 3. MobiDB (http://mobidb.bio.unipd.it/) predicts disorder for any sequence with a UniProt identifier [44]. More specifically, MobiDB-lite combines eight different prediction algorithms to derive a consensus prediction for long IDRs. The program can also be downloaded for large-scale analyses. 4. Experimentally determined IDRs can be extracted from the DisProt database (http://www.disprot.org/) [45]. The Browse feature allows users to search for a protein of interest using the protein’s name, UniProt identifier, or sequence. Additionally, the entire set of IDR sequences can be extracted for analysis (Subheading 3.2). If the user plans on examining more than one IDR sequence, then the IDR sequences should be saved in FASTA format. For a single IDR sequence, FASTA or plain text format will generally work best. 2.2 Sequence Analysis Tools

CIDER (http://pappulab.wustl.edu/CIDER/analysis/) is an online tool that can be used to extract both global amino acid composition and sequence patterning features for up to ten IDR sequences (Subheading 3.1) [46]. For analysis of a large number of IDRs, a stand-alone version of the program package named localCIDER can be downloaded (http://pappulab.github.io/ localCIDER/).

Predicting Conformational Properties from Sequence

353

Get disordered sequence from either disorder predictor or experiment

Run sequence through CIDER / localCIDER to determine region in diagram of states

Region 1

Is the fraction of proline residues < 15%, length < 100, & FCR < 0.2 Yes Likely compact

Region 2

No

Region 3

Cannot predict conformation from composition alone; patterning will matter but still cannot predict whether an individual sequence is compact / expanded

Region 4

Likely expanded

Fig. 3 Flowchart for determining conformational preferences from an IDR’s global amino acid composition

3

Methods

3.1 Predicting Qualitative Conformational Properties from Global Amino Acid Composition

This subsection describes a protocol to predict qualitative conformational properties of an IDR(s) from sequence using only global amino acid composition (Fig. 3). Here, FCR and NCPR are calculated using the Web server computational tool named CIDER [46]. 1. Go to http://pappulab.wustl.edu/CIDER/analysis/. 2. Paste a single IDR sequence in either free form (white space and numbers are ignored) or FASTA format or up to ten IDR sequences in FASTA format in the input box. 3. Select Upload sequences(s). 4. This loads the output page which lists the sequence properties for each input sequence in a table including length, FCR, NCPR, Kyte-Doolittle hydropathy score [47], percent of disorder promoting residues, region the sequence falls in on the diagram of states, and a warning if the IDR is highly proline rich (see Note 1). A charge patterning parameter, κ, is also listed. The details of this parameter will be discussed below (Subheading 3.3). The output page also plots each sequence on

354

Kiersten M. Ruff

the diagram of states, and this figure can be saved as a PNG file by the user. 5. The diagram of states region can be used to predict qualitative conformational properties from global amino acid composition (Fig. 3). If the IDR resides in R1 and its fraction of proline residues is less than 15%, its length is less than 100 residues, and its FCR is less than 0.2, then the sequence is likely globally compact compared to an FRC. If the IDR resides in R1 and its fraction of proline residues is greater than or equal to 15% or its length is greater than or equal to 100 residues or its FCR is greater than or equal to 0.2, then a qualitative prediction cannot be made for this sequence. This is because high proline fraction and long sequence lengths can lead to more extended conformations than predicted from FCR and |NCPR| alone (see Notes 1 and 2). Additionally, the FCR boundary between R1 and R2 is ad hoc, and thus as sequences approach an FCR of 0.25, the accuracy of predictions is reduced. As explained in Subheading 1, if the IDR resides in R2 or R3, then its conformational properties cannot be predicted by global amino acid conformation alone. Instead, sequence patterning will dictate its conformational preferences. However, there are several problems with using sequence patterning to predict conformational properties. First, there are many types of sequence patterns that can be considered for IDRs (Subheadings 3.3 and 3.4), and the interplay between these patterns has not been well studied. Additionally, sequence patterning can only yield predictions on relative global compaction/expansion when comparing sequences with the same global amino acid composition (i.e., same FCR and |NCPR|). Thus, there is currently no way to predict global conformational properties compared to an FRC of a single IDR that resides in R2 or R3. However, Subheadings 3.3 and 3.4 will describe how to calculate patterning metrics for IDRs and what one can and cannot predict in terms of conformation from these features. If the IDR resides in R4, then the sequence is likely to be globally expanded as compared to an FRC. 6. If users wish to analyze the sequence properties of a large number of IDRs, then the software package localCIDER can be downloaded from http://pappulab.github.io/ localCIDER/ [46]. localCIDER is a Python package that can extract the same sequence features as CIDER. localCIDER takes in a sequence as a string or a file in which a single IDR sequence is listed in FASTA format. Thus, if multiple IDR sequences are to be examined, then each sequence will need its own FASTA file. Furthermore, localCIDER gives the user access to calculations of additional sequence features that are not part of the CIDER Web server, including more local

Predicting Conformational Properties from Sequence

355

sequence properties, as well as user-defined parameters such as patterning of a user-defined set of residues (Subheading 3.4.2). The set of functions of localCIDER are divided into six main categories: single-value sequence analysis functions (including FCR, NCPR), position-specific sequence analysis functions, phosphorylation functions, sequence permutation functions, plotting functions, and additional miscellaneous functions. More detailed descriptions of the features of localCIDER are beyond the scope of this chapter but can be found at http:// pappulab.github.io/localCIDER/. However, specific sequence features that can be extracted from localCIDER and provide conformational predictions have been discussed (FCR, NCPR) and will be further discussed below (κ, Ω). 7. After localCIDER is downloaded and installed, the following code can be utilized to run a localCIDER function (i.e., get_phase_plot_region()) using Python: import localcider from localcider.sequenceParameters import SequenceParameters # Load in sequence in FASTA format from the file named filename mySeq=SequenceParameters(sequenceFile=”filename”) # Get the region of the diagram of states in which the sequence resides mySeqRegion=mySeq.get_phase_plot_region()

3.2 Classifying Experimentally Determined IDRs within the Diagram of States

DisProt is a database of experimentally determined IDRs (http:// www.disprot.org/) [45]. Users who want to use DisProt for statistical analyses of IDRs must be aware of several caveats. First, for a single DisProt entry, the same IDR (identified with the term Disorder by the Structural state branch of the Disorder Ontology) will be listed multiple times if this IDR has been characterized using different experimental techniques (Ex: DP00005) or in different papers (Ex: DP00007). Second, within a DisProt entry, identified IDRs may be a subset of (Ex: DP00008) or overlap in linear sequence space with (Ex: DP00018) other IDRs listed. Third, some regions listed in the Structural state branch are actually ordered regions (Ex: DP00549). Thus, if the user downloads the DisProt data using either the FASTA or TSV format with the Name space Structural state and the Fields Regions, then the file will contain sequences of all the regions listed in the branch Structural state, which includes all of the caveats listed above. To circumvent these caveats, DisProt has added a consensus field which

356

Kiersten M. Ruff

merges subset and overlapping regions and classifies each combined region as a function region (F), order state (S), disorder state (D), transition (T), or transition with interaction (I). Thus, for statistical analyses of IDRs, users should extract the sequences that correspond to the consensus disorder state regions (IDRs) for each DisProt entry. If users want to analyze all consensus IDRs in DisProt, then the following steps can be taken: 1. Go to http://www.disprot.org/. 2. Select Download at the top of the page. 3. Download the database in JSON format. 4. Use Python to extract the sequences of all consensus IDRs from the downloaded JSON file as follows.

5. Optional: remove identical sequences that come from orthologs. As of March 19, 2019, DisProt contains 1236 unique IDRs. We restricted our analysis to IDRs of length longer than 15 amino acid residues given that short sequences are inappropriate to classify in the diagram of states. This is due to the fact that the conformations of short sequences are dominated by entropic effects and thus should all behave relatively similarly regardless of composition [48]. With this length filter, the database was reduced to

Predicting Conformational Properties from Sequence

357

879 sequences with 40% of sequences falling in R1, 35% of sequences in R2, 22% of sequences in R3, and 3% of sequences in R4 (Fig. 2). This classification implies that the conformations of at least 57% of known IDRs (regions R2 and R3) cannot be predicted by composition alone. This percent is likely even higher given that the R1 and R2 boundary is ad hoc and the R2 boundary may be more appropriately moved to a lower FCR value (FCR 0.2), especially as sequences become longer (see Note 2). Overall, for a large portion of IDRs, examining global compositional properties alone does not yield specific predictions about conformation, and thus other sequence features must be examined. Conformational preferences are also dictated by sequence patterning, particularly for IDRs that fall in R2 and R3 of the diagram of states. As sequences increase in FCR, the patterning of oppositely charged residues has an increased impact on conformational preferences. However, other features such as the patterning of expansion driving residues, aromatic residues, or posttranslational modifications may also control conformational preferences, particularly for sequences in R2 where the FCR is lower. In the next two sections, we outline the patterning metrics utilized for IDRs. 3.3 Extracting Conformational Information from Charge Patterning

Patterning of oppositely charged residues has thus far been the most well-studied patterning feature of IDRs. Charge patterning has been shown to modulate the conformations of sequences from R2 and R3 both computationally and experimentally [15, 22, 49– 52]. There are two main metrics that have been introduced to quantify charge patterning in an IDR sequence: (1) κ introduced by Das and Pappu and (2) sequence charge decoration (SCD) introduced by Sawle and Ghosh [15, 50].

3.3.1 Calculating κ

κ is defined as: ðP nw Þ

δseq ¼

ði¼1Þ

! 2

ðσ i σ Þ nw

; κ¼

δseq δmax

here, σ is the overall charge asymmetry given by: 2 fþf : σ¼ fþþf For each sliding window, nw, the per window charge asymmetry, σ i, is calculated. Then, δseq is the mean squared deviation of charge asymmetry per window compared to the overall/global charge asymmetry. To bound κ between zero and one, each sequence is normalized by the most charge segregated sequence that can be realized given the amino acid composition (δmax) (see Note 3). κ is calculated for two window lengths (five and six), and

358

Kiersten M. Ruff

then the final κ is the average of the two values. Thus, as κ increases from zero to one, oppositely charged residues become more segregated for a given composition. κ can be calculated for a single sequence or a set of sequences using the Web server CIDER (http://pappulab.wustl.edu/ CIDER/analysis/) or the get_kappa() function in localCIDER (http://pappulab.github.io/localCIDER/) [46]. 3.3.2 Calculating SCD

SCD is defined as: SCD ¼

N m1 1 X X q q ðm nÞ1=2 N m¼2 n¼1 m n

here, N is the number of residues, and qm and qn are the charges of residues m and n, respectively. SCD can be calculated using the get_SCD() function in localCIDER (http://pappulab.github.io/ localCIDER/) [46]. 3.3.3 Comparing κ and SCD

For a given amino acid composition, κ and SCD are generally highly correlated. Figure 4a shows the correlation between κ and SCD for three representative IDRs and their κ permutants that have been studied in the literature [15, 22, 49]. As κ goes from zero to one (i.e., oppositely charged residues become more demixed/ segregated), SCD becomes more negative. The R2 values for the individual IDR correlations are shown in Fig. 4b. If the R2 is calculated over all sequences shown in Fig. 4a, then the correlation between κ and SCD decreases (R2 ¼ 0.63). This is because it is inappropriate to compare κ across different amino acid compositions. By definition, κ is calculated by normalizing to the most segregated sequence with the given composition. This definition implies that a κ of 0.2 will not have the same meaning for two sequences if the FCR and |NCPR| of the two sequences are different.

3.3.4 How Do κ and SCD Correlate with Conformational Properties?

It has been shown both computationally and experimentally that, for a given amino acid composition, the demixing of oppositely charged residues leads to a conformational compaction as κ goes to one and SCD becomes more negative (Fig. 4c and d) [15, 22, 49– 52]. Simulations have shown that the compaction is due to a preference of hairpin-like architectures in which blocks of oppositely charged residues hinge the chain in order to interact [15]. Quantitatively, this compaction can be realized by quantifying the sequence length-normalized radius of gyration, Rg/√N, where N is the sequence length and Rg is given by: vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u n 2 u 1 X t ri rj Rg ¼ 2 2n ij

Predicting Conformational Properties from Sequence

a

b

0

(EK)25

p27

50

107

132

FCR

1.00

0.27

0.35

NCPR

0.00

0.01

-0.06

3

2

2

0.91

0.87

0.68

Length

SCD

−10

−20

−30

Region on Diagram of States

(EK)25 p27 NotchRAM 0

R2 0.5 κ

NotchRAM

1

c

d 4.5

4.5

4

4

3.5

3.5

Rg /√N

Rg /√N

359

3

3

2.5

2.5

2

2 0

0.5 κ

1

−30

−10

−20

0

SCD

Fig. 4 Evaluation of κ and SCD for three IDR sequences previously studied: synthetic (EK)25 [15], the C-terminal IDR of p27 [22], and the RAM region of the Notch receptor [49]. (a and b) Correlation between κ and SCD for each of the three sequences. (b) Sequence properties of the three sequences. (c and d) Relationship of Rg/√N from simulations with κ and SCD, respectively. The black dashed line corresponds to the Rg/√N associated with the reference FRC. This implies that points above this dashed line will be globally more expanded than an FRC

here, ri is the position vector of atom i and n is the number of atoms. By dividing by √N, sequences of different lengths can be directly compared. Figure 4 depicts how Rg/√N varies as a function of κ (Fig. 4c) and SCD (Fig. 4d) for simulations of three representative IDR amino acid compositions [15, 22, 49]. Each IDR generally shows a decrease in Rg/√N as κ increases or SCD decreases. However, for different amino acid compositions, the same κ or SCD value can correspond to vastly different degrees of compaction given by Rg/√N (Fig. 4c and d). This implies that both κ and SCD can yield information on the relative compactness for a given amino acid composition but cannot, alone, quantitatively predict the degree of compaction across different amino acid

360

Kiersten M. Ruff

compositions. In other words, two sequences with the same κ or SCD but different amino acid compositions cannot be assumed to have the same average conformational properties, such as Rg/√N. 3.3.5 Weaknesses of κ

As mentioned above, one weakness of κ is that it is inappropriate to compare κ values across different amino acid compositions. Additionally, because κ is calculated by using sliding windows, longrange effects are ignored. For example, Table 1 shows three sequences with κ ¼ 1, yet the spacing between the oppositely charged blocks varies. The distance between oppositely charged blocks is likely to affect the degree of compaction these sequences prefer. However, κ cannot distinguish between these three sequences since κ only considers the charge asymmetry within windows of five and six residues.

3.3.6 Weaknesses of SCD

Unlike κ, SCD explicitly accounts for long-range interactions. However, SCD has its own set of biases. Figure 5a shows each unique DisProt sequence of length greater than 15 on the diagram of states. Here, each sequence is colored by its SCD value. Sequences with |NCPR| ~ 0 generally show low SCD values (blue

Table 1 Three sequences with the same amino acid composition and κ ¼ 1

k

Sequence

AAAAAAEEEE EAAAAAAAAA AAAAAAAAAK KKKKAAAAAA AAAAAAAAEE EEEAAAAAAA AAAAAAAKKK KKAAAAAAAA AAAAAAAAAA EEEEEAAAAA AAAAAKKKKK AAAAAAAAAA

a

1

6

0.8

4

1.00 1.00 1.00

b

200 0.5 150

0.6 f-

2

0.4 0 0.2

|NCPR|

0.4 0.3

100

0.2 50

0.1 −2

0

0

0.2

0.4

0.6 f+

0.8

1

0

−2

0

2 SCD

4

6

Fig. 5 Evaluation of biases associated with SCD. (a) DisProt sequences of length greater than 15 residues plotted on the diagram of states and colored by their SCD (color bar). (b) SCD versus |NCPR| for each DisProt sequence of length greater than 15 and colored by its length (color bar)

Predicting Conformational Properties from Sequence

361

colors), whereas sequences with |NCPR| > 0.35 generally show large SCD values (reddish colors). Intermediate |NCPR| values show varied SCD values. Plotting SCD versus |NCPR| shows that the relationship between SCD and |NCPR| at these intermediate | NCPR| values is length dependent (Fig. 5b). Long sequences can have positive SCD values at |NCPR| < 0.1, whereas shorter sequences generally need |NCPR| > 0.15 to reach these same SCD values. The dependence of SCD on |NCPR| and length is expected given how SCD is defined (Subheading 3.3.2). However, recognizing the intrinsic biases of SCD is important if this metric is to be used to quantitatively predict conformational properties (Subheading 3.6.2). Additionally, given the dependency of SCD on both composition and length, there is not a one-to-one correlation between SCD and conformational properties. Thus, like κ, it appears to be inappropriate to compare SCD across different amino acid compositions. 3.4 Extracting Conformational Information from Other Sequence Patterns 3.4.1 Patterning of Expansion Driving Residues

Other sequence patterns besides the pattern of oppositely charged residues can also influence conformational preferences. Martin et al. observed that the unphosphorylated sequence of the Ash1 IDR (FCR ¼ 0.2, NCPR ¼ +0.18, fraction of proline residues ¼ 0.14) has a uniform distribution of proline and charged residues across the sequence [53]. This led them to assess how the mixing of proline and charged residues (Pro, Lys, Arg, Glu, Asp) versus all other residues correlated with the compaction of Ash1 based on scrambled sequences. They termed this mixing parameter Ω, where Ω is analogous to κ except for the patterning asymmetry parameter, σ, now defined as: 2 f þ==P f others σ¼ f þ==P þ f others here, f+//P is the fraction of charged and proline residues, and fothers is the fraction of all other amino acids (i.e., fothers ¼ 1 f+// P). Ω can be calculated using the get_Omega() function in localCIDER (http://pappulab.github.io/localCIDER/). The unphosphorylated Ash1 IDR sequence yields Ω ¼ 0.1. Martin et al. then shuffled the positions of the prolines and charged residues and found that Rg decreased as Ω increased from zero to one (Pearson’s correlation coefficient ¼ 0.81). These results suggest that the patterning of expansion driving residues (prolines and charged residues) can lead to sequence-encoded conformational preferences. Specifically, uniform mixing of expansion driving residues can lead to a preference for expanded conformations.

362

Kiersten M. Ruff

3.4.2 General Patterning

A general patterning parameter, ΩX, can also be deployed to quantify the distribution of a single residue or a set of residues, named set X. In this case, the patterning asymmetry parameter, σ, is now defined as: σ¼

ð f X f others Þ2 ð f X þ f others Þ

here, fX is the fraction of residues in set X, and fothers is the fraction of all other amino acids (i.e., fothers ¼ 1 fX). ΩX can be calculated using the get_kappa_X(grp1) function in localCIDER (http://pappulab.github.io/localCIDER/), where grp1 is the list of residues in set X. For example, if one wants to calculate the patterning of aromatic residues within a sequence, then grp1 ¼ [‘W’,‘Y’,‘F’]. However, given how δmax is calculated within localCIDER, it is only appropriate to calculate ΩX if the global fX is greater than 0.1 (see Note 4). 3.5 Applying Patterning Metrics to Understand the Effects of Phosphorylation on Conformation

Both Ω and SCD have been used as metrics to understand the effects of phosphorylation on conformational preferences. Martin et al. found that Ω maintained a similar value upon phosphorylation of the Ash1 IDR (unphosphorylated ¼ 0.10; phosphorylated ¼ 0.13) [53]. The low Ω values for both unphosphorylated and phosphorylated Ash1 IDR were used to help explain the invariance in global dimensions (Rg) measured for both sequences. Additionally, SCD has been used to identify phosphorylation sites that can induce maximal change in protein conformation [54]. The assumption is that a large change in SCD following minimal phosphorylation (introduction of phosphomimetic(s) in the sequence) indicates that those phosphosites are likely to lead to large changes in the conformational properties of the sequence. Thus, by comparing specific phosphosites, or sets of phosphosites, across a protein using SCD, predictions can be made as to which site(s) will have the largest effect on changing conformation. However, examination of a larger set of phosphorylated sequences will be needed to determine how general both these findings are.

3.6 Extracting Quantitative Conformational Information from Sequence

The above sequence features only yield qualitative conformational information for a given sequence. However, one may want to predict quantitative conformational properties from sequence without the aid of additional simulations or experiments. The following two subsections highlight efforts to quantitatively predict conformational properties of IDRs directly from sequence. Although some strides have been made to uncover sequence features that correlate with deviations from coil-like behavior, the results from these efforts show that a general framework to quantitatively predict IDR conformation from sequence is still lacking. This implies additional studies will be needed to uncover how conformational properties are encoded in sequence.

Predicting Conformational Properties from Sequence

363

3.6.1 Extracting Rh from Sequence

As can be observed from Fig. 1, a single unifying relationship between Rh and chain length alone is not observed for IDRs. Thus, Marsh et al. tried to determine what sequence features lead to deviations from a simple power law relationship for IDRs [12]. They found that the fraction of proline residues, the net charge of the sequence, and whether or not the sequence had a flanking polyhistidine tag had the greatest influence on the compaction of an IDR, given the features they tested. Given these results, the scaling relationship between Rh and chain length was adjusted to include terms that accounted for these sequence features. Although this method generally improves the prediction of Rh over using chain length alone for the IDRs examined, the mean percent error for the IDRs in Table 2 is still 15% (Fig. 6). This implies that this method does not account for the full set of sequence features needed to accurately describe the size of an IDR. Tomasso et al. used a similar strategy to Marsh et al. to predict Rh from sequence but instead adjusted the scaling relationship to include the average polyproline II (PPII) propensity of the sequence [55]. Figure 6 shows the measured versus predicted Rh values for the IDRs from Table 2 using this method. The Tomasso et al. method shows a similar mean percent error (18%) as compared to the Marsh et al. method. Additionally, Tomasso et al. found that the average PPII propensity alone could not recover the experimental Rh values for all 22 IDRs they studied. Importantly, it was observed that charge–charge interactions also influence the degree of compaction of an IDR.

3.6.2 Comparing Conformation to a Reference Coil: Chain Expansion Parameter

Whereas the previous efforts focused on global amino acid composition features, Ghosh and colleagues developed a theory to predict IDR compaction compared to an FRC as a function of sequence charge patterning, three-body repulsion, and excluded volume interactions [50, 54]. As a result, the chain expansion parameter, x, can be predicted for each sequence and is defined as: x¼

R2E R2E,FRC

here, R2E corresponds to the end-to-end distance of the given sequence, and R2E,FRC corresponds to the end-to-end distance of the reference FRC of the same length. Thus, x quantifies the degree of compaction/expansion of a chain compared to an FRC of the same length. Additionally, x can be determined from experimentally determined Rg values using the following approximation (assumming IDRs follow uniform FRC scaling which may not always be an appropriate assumption [6, 7]):

172

551

609

215

Nsp1n

Nup116m

Nup100n

Nup49

26.9

48.7

46.5

27.1

Length Rh

Name [68]

References

MFGLNKASSTPAGGLFGQASGASTGNANTGFSFGGTQTGQNTGPSTGGLFGAKPAG STGGLGASFGQQQQQSQTNAFGGSATTGGGLFGNKPNNTANTGGGLFGANSNSN SGSLFGSNNAQTSRGLFGNNNTNNINNSSSGMNNASAGLFGSKPAGGTSLFGN TSTSSAPAQNQGMFGAKPAGTSLFGNNAGNTTTGGGLFGSKPTGATSLFGSS

[68]

[68] FGNNRPMFGGSNLSFGSNTSSFGGQQSQQPNSLFGNSNNNNNSTSNNAQSGFGGF TSAAGSNSNSLFGNNNTQNNGAFGQSMGATQNSPFGSLNSSNASNGNTFGG SSSMGSFGGNTNNAFNNNSNSTNSPFGFNKPNTGGTLFGSQNNNSAGTSSLFGG QSTSTTGTFGNTGSSFGTGLNGNGSNIFGAGNNSQSNTTGSLFGNQQSSAFGTNN QQGSLFGQQSQNTNNAFGNQNQLGGSSFGSKPVGSGSLFGQSNNTLGNTTNN RNGLFGQMNSSNQGSSNSGLFGQNSMNSSTQGVFGQNNNQMQINGNNNN SLFGKANTFSNSASGGLFGQNNQQQGSGLFGQNSQTSGSSGLFGQNNQKQPNTF TQSNTGIGLFGQNNNQQQQSTGLFGAKPAGTTGSLFGGNSSTQPNSLFGTTNVP TSNTQSQQGNSLFGATKLTNMPFGGNPTANQSGSGNSLFGTKPASTTGSLFGNNTA STTVPSTNGLFGNNANNSTSTTNTGLFGAKPDSQSKPALGGGLFGNSNSNSSTIG QNKPVFGGTTQNTGLFGATGTNSSAVGSTGKLFGQNNNTLNVGTQNVPPVNN TTQNALLGTTAVPSLQQAPVTN

[68] RKFGTSQNGTGTTFNNPQGTTNTGFGIMGNNNSTTSATTGGLFGQKPATGMFGTG TGSGGGFGSGATNSTGLFGSSTNLSGNSAFGANKPATSGGLFGNTTNNPTNGTNN TGLFGQQNSNTNGGLFGQQQNSFGANNVSNGGAFGQVNRGAFPQQQTQQG SGGIFGQSNANANGGAFGQQQGTGALFGAKPASGGLFGQSAGSKAFGMNTNPTG TTGGLFGQTNQQQSGGGLFGQQQNSNAGGLFGQNNQSQNQSGLFGQQN SSNAFGQPQQQGGLFGSKPAGGLFGQQQGASTFASGNAQNNSIFGQNN QQQQSTGGLFGQQNNQSQSQPGGLFGQTNQNNNQPFGQNGLQQPQQNN SLFGAKPTGFGNTSLFSNSTTNQSNGISGNNLQQQSGGLFQNKQQPASGGLFGSKP SNTVGGGLFGNNQVANQNNPASTSGGLFGSKPATGSLFGGTNSTAPNASSGGIFG SNNASNTAATTNSTGLFGNKPVGAGASTSAGGLFGNNNNSSLNNSNGSTGLFGSNN TSQSTNAGGLFQNNTSTNTSGGGLFS

MNFNTPQQNKTPFSFGTANNNSNTTNQNSSTGAGAFGTGQSTFGFNNSAPNN TNNANSSITPAFGSNNTGNTAFGNSNPTSNVFGSNNSTTNTFGSNSAGTSLFGSSSA QQTKSNGTAGGNTFGSSSLFNNSTNSNTTKPAFGGLNFGGGNNTTPSSTGNAN TSNNLFGATA

Sequence

Table 2 List of IDRs with experimentally measured Rh values (in A˚)

364 Kiersten M. Ruff

212

255

242

279

441

151

Nup42

Nup57

Nup145N

Nup1c

Nup159

Nup60

31.3

55.4

32.4

28.2

31.9

28.4

[68]

[68]

[68]

GDKPPSSAFNFSFNTSRNVEPTENAYKSENAPSASSKEFNFTNLQAKPLVGKPKTEL TKGDSTPVQPDLSVTPQKSSSKGFVFNSVQKKSRSNLSQENDNEGKHISASIDNDF SEEKAEEFDFNVPVVSKQLGNGLVDENKVEAFKSLYTF

(continued)

[68]

[68] SGFTFLKTQPAAANSLQSQSSSTFGAPSFGSSAFKIDLPSVSSTSTGVASSEQDATDPA SAKPVFGKPAFGAIAKEPSTSEYAFGKPSFGAPSFGSGKSSVESPASGSAFGKPSFGTP SFGSGNSSVEPPASGSAFGKPSFGTPSFGSGNSSAEPPASGSAFGKPSFGTSAFGTASSNE TNSGSIFGKAAFGSSSFAPANNELFGSNFTISKPTVDSPKEVDSTSPFPSSGDQSEDESK SDVDSSSTPFGTKPNTSTKPKTNAFDFGSSSFGSGFSKALESVGSDTTFKFGTQASPF SSQLGNKSPFSSFTKDDTENGSLSKGSTSEINDDNEEHESNGPNVSGNDLTDSTVE QTSSTRLPETPSDEDGEVVEEEAQKSPIGKLTETIKKSANIDMAGLKNPVFGNH VKAKSESPFSAFATNITKPSSTTPAFSFGNSTMN

SAFSFGTANTNGTNASANSTSFSFNAPATGNGTTTTSNTSGTNIAGTFNVGKPDQSIA SGNTNGAGSAFGFSSSGTAATGAASNQSSFNFGNNGAGGLNPFTSA TSSTNANAGLFNKPPSTNAQNVNVPSAFNFTGNNSTPGGGSVFNMNGNTNAN TVFAGSNNQPHQSQTPSFNTNSSFTPSTVPNINFSGLNGGITNTATNALRP SDIFGANAASGSNSNVTNPSSIFGGAGGVPTTSFGQPQSAPNQMGMGTNNGM SMGGGVMANRKIARMRHSKR

MFNKSVNSGFTFGNQNTSTPTSTPAQPSSSLQFPQKSTGLFGNVNVNANTSTPSP SGGLFNANSNANSISQQPANNSLFGNKPAQPSGGLFGATNNTTSKSAGSLFGNNNA TANSTGSTGLFSGSNNIASSTQNGGLFGNSNNNNITSTTQNGGLFGKP TTTPAGAGGLFGNSSSTNSTTGLFGSNNTQSSTGIFGQKPGASTTGGLFGNNGASFP RSGETTGTMSTNPYGINISNVPMAVA

MFGFSGSNNGFGNKPAGSTGFSFGQNNNNTNTQPSASGFGFGGSQPNSGTA TTGGFGANQATNTFGSNQQSSTGGGLFGNKPALGSLGSSSTTASGTTATGTGLFG QQTAQPQQSTIGGGLFGNKPTTTTGGLFGNSAQNNSTTSGGLFGNKVGSTG SLMGGNSTQNTSNMNAGGLFGAKPQNTTATTGGLFGSKPQGSTTNGGLFGSG TQNNNTLGGGGLFGQSQQPQTNTAPGLGNTVSTQPSFAWSKPSTGS

[68] MSAFGNPFTSGAKPNLSNTSGINPFTNNAASTNNMGGSAFGRPSFGTANTMTGG TTTSAFGMPQFGTNTGNTGNTSISAFGNTSNAAKPSAFGAPAFGSSAPINVNPP STTSAFGAPSFGSTGFGAMAATSNPFGKSPGSMGSAFGQPAFGANKTAIPSSSVSNSNN SAFGAASNTPLTTTSPFGSLQQNASQNASSTSSAFGKPTFGAATN

Predicting Conformational Properties from Sequence 365

578

376

431

191

Nup1m

Nup2

Nsp1m

Nup145Ns

29.8

65.3

59.8

67.9

Length Rh

Name

Table 2 (continued)

DMPRSITSSLSDVNGKSDAEPKPIENRRTYSFSSSVSGNAPLPLASQSSLVSRLSTRLKA TQKSTSPNEIFSPSYSKPWLNGAGSAPLVDDFFSSKMTSLAPNENSIFPQNGFNFL SSQRADLTELRKLKIDSNRSAAKKLKLLSGTPAITKKHMQDEQDSSENEPIANAD SVTNIDRKENRDNNLDNTYL

NANKPAFSFGATTNDDKKTEPDKPAFSFNSSVGNKTDAQAPTTGFSFGSQLGGNK TVNEAAKPSLSFGSGSAGANPAGASQPEPTTNEPAKPALSFGTATSDNKTTNTTPSF SFGAKSDENKAGATSKPAFSFGAKPEEKKDDNSSKPAFSFGAKSNEDKQDGTAKPAF SFGAKPAEKNNNETSKPAFSFGAKSDEKKDGDASKPAFSFGAKPDENKASATSKPAF SFGAKPEEKKDDNSSKPAFSFGAKSNEDKQDGTAKPAFSFGAKPAEKNNNETSKPAF SFGAKSDEKKDGDASKPAFSFGAKSDEKKDSDSSKPAFSFGTKSNEKKDSGSSKPAF SFGAKPDEKKNDEVSKPAFSFGAKANEKKESDESKSAFSFGSKPTGKEEGDGAKAAI SFGAKPEEQKSSDTSKPAFTFGAQKDNEKKTEES

DSVFSFGPKKENRKKDESDSENDIEIKGPEFKFSGTVSSDVFKLNPSTDKNEKKTE TNAKPFSFSSATSTTEQTKSKNPLSLTEATKTNVDNNSKAEASFTFGTKHAAD SQNNKPSFVFGQAAAKPSLEKSSFTFGSTTIEKKNDENSTSNSKPEKSSDSNDSNPSF SFSIPSKNTPDASKPSFSFGVPNSSKNETSKPVFSFGAATPSAKEASQEDDNNNVEKP SSKPAFNLISNAGTEKEKESKKDSKPAFSFGISNGSESKDSDKPSLPSAVDGENDKKEA TKPAFSFGINTNTTKTADTKAPTFTFGSSALADNKEDVKKPFSFGTSQPNNTPSF SFGKTTANLPANSSTSPAPSIPSTGFKFSLPFEQKGS

DAIQKKDNKDKEGNAGGDQKTSENRNNIKSSISNGNLATGPNLTSEIEDLRADINSN RLSNPQKNLLLKGPASTVAKTAPIQESFVPNSERSGTPTLKKNIEPKKDKESIVLP TVGFDFIKDNETPSKKTSPKATSSAGAVFKSSVEMGKTDKSTKTAEAPTLSFNF SQKANKTKAVDNTVPSTTLFNFGGKSDTVTSASQPFKFGKTSEKSENHTESDAPPK STAPIFSFGKQEENGDEGDDENEPKRKRRLPVSEDTNTKPLFDFGKTGDQKE TKKGESEKDASGKPSFVFGASDKQAEGTPLFTFGKKADVTSNIDSSAQFTFGKAA TAKETHTKPSETPATIVKKPTFTFGQSTSENKISEGSAKPTFSFSKSEEERKSSPI SNEAAKPSFSFPGKPVDVQAPTDDKTLKPTFSFTEPAQKDSSVVSEPKKPSFTFASSK TSQPKPLFSFGKSDAAKEPPGSNTSFSFTKPPANETDKRPTPPSFTFGGSTTNN TTTTSTKPSFSFGAPESMKSTASTAAANTEKLSNGFSFTKFNHNKEKSNSPTSFFDGSA SSTPIPVLGKPTDATGNTTSK

Sequence

[68]

[68]

[68]

[68]

References

366 Kiersten M. Ruff

240

282

498

Nup62

Nup214

Nup98

111

Nup116

95

196

Nup116s

Nsp1

190

Nup100s

56

34

37

26.8

20.4

39.1

36.6

[68]

[68]

[68]

[69]

(continued)

[69] MFNKSFGTPFGGGTGGFGTTSTFGQNTGFGTTSGGAFGTSAFGSSNNTGGLFGN SQTKPGGLFGTSSFSQPATSTSTGFGFGTSTGTANTLFGTASTGTSLFSSQNNAFA QNKPTGFGNFGTSTSSGGLFGTTNTTSNPFGSTSGSLFGPSSFTAAPTGTTIKFNPPTG TDTMVKAGVSTNISTKHQCITAMKEYESKSLEELRLEDYQANRKGPQNQVGAG TTTGLFGSSPATSSATGLFSSSTTNSGFAYGQNKTAFGTSTTGFGTNPGGLFGQQN QQTTSLFSKPFGQATTTQNTGFSFGNTSTIGQPSTNTMGLFGVTQASQPGGLFGTA TNTSTGTAFGTGTGLFGQTNTGFGAVGSTLFGNNKLTTFGSSTTSAPSFG TTSGGLFGNKPTLTLGTNTNTSNFGFGTNTSGNSIFGSKPAPGTLGTGLGAGFG TALGAGQASLFGNNQPKIGGPLGTGAFGAPGFNTTTATLGFGAPQAPVALTDPNA SAAQQ

SPGFGQGGSVFGGTSAATTTAATSGFSFCQASGFGSSNTGSVFGQAASTGGIVFG QQSSSSSGSVFGSGNTGRGGGFFSGLGGKPSQDAANKNPFSSASGGFGSTATSN TSNLFGNSGAKTFGGFASSSFGEQKPTGTFSSGGGSVASQGFGFSSPNKTGGFGAAP VFGSPPTFGGSPGFGGVPAFGSAPAFTSPLGSTGGKVFGEGTAAASAGGFGFGSSSN TTSFGTLASQNAPTFGSLSQQTSGFGTQSSGFSGFGSGTGGFSFGSNNSSVQGFGG WRS

[69] MSGFNFGGTGAPTGGFTFGTAKTATTTPATGFSFSTSGTGGFNFGAPFQPATSTP STGLFSLATQTPATQTTGFTFGTATLASGGTGFSLGIGASKLNLSNTAATPAMANP SGFGLGSSNLTNAISSTVTSSQGTAPTGFVFGPSTTSVAPATTSGGFSFTGGSTAQP SGFNIGSAGNSAQPTAPATLPFTPATPAATTAGATQPAAPTPTATITSTGPSLFASIATAP TSSATTGLSLC

PAFSFGAKPDENKASATSKPAFSFGAKPEEKKDDNSSKPAFSFGAKSNEDKQDGTAKPAF [68] SFGAKPAEKNNNETSKPAFSFGAKSDEKKDGDASK

GALFGAKPASGGLFGQSAGSKAFGMNTNPTGTTGGLFGQTNQQQSGGGLFGQQQN SNAGGLFGQNNQSQNQSGLFGQQNSSNAFGQPQQQGGLFGSKPAGGLFG QQQGAST

QPSATKIKADERKKASLTNAYKMIPKTLFTAKLKTNNSVMDKAQIKVDPKLSI SIDKKNNQIAISNQQEENLDESILKASELLFNPDKRSFKNLINNRKMLIASEEKNNG SQNNDMNFKSKSEEQETILGKPKMDEKETANGGERMVLSSKNDGEDSATKHH SRNMDEENKENVADLQKQEYSEDDKKAVFADVAE

EQLFSKISIPNSITNPVKATTSKVNADMKRNSSLTSAYRLAPKPLFAPSSNGDAKFQK WGKTLERSDRGSSTSNSITDPESSYLNSNDLLFDPDRRYLKHLVIKNNKNLN VINHNDDEASKVKLVTFTTESASKDDQASSSIAASKLTEKAHSPQTDLKDDHDE STPDPQSKSPNGSTSIPMIENEKISS

Predicting Conformational Properties from Sequence 367

40

61

92

SBD

CTL9-I98A

720

Nup2p

Abeta(1–40)

602

Nup153

21.7

25.6

References

AAEELANAKKLKEQLEKLTVTIPAKAGEGGRLFGSITSKQAAESLQAQHGLKLDK RKIELADAIRALGYTNVPVKLHPEVTATLKVHVTEQK

GSMMSASSQSPNPNNPAEYCSTIPPLEYCSTIPPLQQAQASGALSSPPPTVMVPVG VLKHP

[12]∗

[12]∗

[12]∗

[70] MAKRVADAQIQRETYDSNESDDDVTPSTKVASSAVMNRRKIAMPKRRMAFKPFGSAK SDETKQASSFSFLNRADGTGEAQVDNSPTTESNSRLKALNLQFKAKVDDL VLGKPLADLRPLFTRYELYIKNILEAPVKSIENPTQTKGNDAKPAKVEDVQKSSD SSSEDEVKVEGPKFTIDAKPPISDSVFSFGPKKENRKKDESDSENDIEIKGPEFKFSG TVSSDVFKLNPSTDKNEKKTETNAKPFSFSSATSTTEQTKSKNPLSLTEATKTNVDNN SKAEASFTFGTKHAADSQNNKPSFVFGQAAAKPSLEKSSFTFGSTTIEKKNDENSTSN SKPEKSSDSNDSNPSFSFSIPSKNTPDASKPSFSFGVPNSSKNETSKPVFSFGAATP SAKEASQEDDNNNVEKPSSKPAFNLISNAGTEKEKESKKDSKPAFSFGISNGSESKD SDKPSLPSAVDGENDKKEATKPAFSFGINTNTTKTADTKAPTFTFGSSALADNKED VKKPFSFGTSQPNNTPSFSFGKTTANLPANSSTSPAPSIPSTGFKFSLPFEQKG SQTTTNDSKEESTTEATGNESQDATKVDATPEESKPINLQNGEEDEVALFSQKAKLM TFNAETKSYDSRGVGEMKLLKKKDDPSKVRLLCRSDGMGNVLLNATVVDSFK YEPLAPGNDNLIKAPTVAADGKLVTYIVKFKQKEEGRSFTKAIEDAKKEMK

[69] CESAKPGTKSGFKGFDTSSSSSNSAASSSFKFGVSSSSSGPSQTLTSTGNFKFGD QGGFKIGVSSDSGSINPMSEGFKFSKPIGDFKFGVSSESKPEEVKKDSKNDNFKFGL SSGLSNPVSLTPFQFGVSNLGQEEKKEELPKSSSAGFSFGTGVINSTPAPANTIVTSENK SSFNLGTIETKSASVAPFTCKTSEAKKEEMPATKGGFSFGNVEPASLPSASVFVLG RTEEKQQEPVTSTSLVFGKKADNEEPKCQPVFSFGNSEQTKDENSSKSTFSFSMTKP SEKESEQPAKATFAFGAQTSTTADQGAAKPVFSFLNNSSSSSSTPATSAGGGIFG SSTSSSNPPVATFVFGQSSNPVSSSAFGNTAESSTSQSLLFSQDSKLATTSSTGTAVTPF VFGPGASSNNTTTSGFGFGATTTSSSAGSSFVFGTGPSAPSASPAFGANQTPTFG QSQGASQPNPPGFGSISSSTALFPTGSQPAPPTFGTVSSSSQPPVFGQQPSQSAFGSG TTPNSSSAFQFGSSTTNFNFTNNSPSGVFTFGANSSTPAASAQPSGSGGFPFNQSPAAF TVGSNGKNVFSSSGTSFSGRKIKTAVRRRK

Sequence

14.36 DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVV

79

51

Length Rh

Name

Table 2 (continued)

368 Kiersten M. Ruff

112

140

189

198

234

237

TC1

Alpha-synuclein

CFTR R region

Tau K45

RYBP

3D7-6H MSP2

61

110

Prothymosin alpha

EHD-L16A

104

Sm1

20.1

34.3

39.5

45

32

28.2

26.5

33.7

23.4 [12]∗

[12]∗

[12]∗

[12]∗

TNDEKRPRTAFSSEQAARLKREFNENRYLTERRRQQLSSELGLNEAQIKIWFQNK RAKIKK

(continued)

[12]∗

MIKNESKYSNTFINNAYNMSIRRSMAESKPSTGAGGSAGGSAGGSAGGSAGGSAGGSAG [12]∗ SGDGNGADAEGSSSTPATTTTTKTTTTTTTTNDAEASTSTSSENPNHKNAE TNPKGKGEVQEPNQANKETQNNSNVQQDSQTKSNVPPTQDADTKSPTAQPE QAENSAPTAEQTESPELQSAPENKGTGQHGHMHGSRNNHPQNTSDSQKEC TDGNKENCGAATSLLNNSSNHHHHHH

HHHHHHMTMGDKKSPTRPKRQAKPAADEGFWDCSVCTFRNSAEAFKCSICDVRKG TSTRKPRINSQLVAQQVAQQYATPPPPKKEKKEKVEKQDKEKPEKDKEISPSVTKKN TNKKTKPKSDILKDPPSEANSIQSANATTKTSETNHTSRPRLKNVDRSTAQQLA VTVGNVTVIITDFKEKTRSSSTSSSTVTSSAGSEQQNQSSSGSESTDKG SSRSSTPKGDMSAVNDESF

MSSPGSPGTPGSRSRTPSLPTPPTREPKKVAVVRTPPKSPSSAKSRLQTAPVPMPDLKNVK [12]∗ SKIGSTENLKHQPGGGKVQIINKKLDLSNVQSKCGSKDNIKHVPGGGSVQIVYKP VDLSKVTSKCGSLGNIHHKPGGGQVEVKSEKLDFKDRVQSKIGSLDNITH VPGGGNKKIETHKLTFRENAKAKTDHGAEIVY

GAMESAERRNSILTETLHRFSLEGDAPVSWTETKKQSFKQTGEFGEKRKNSILNPINSI [12]∗ RKFSIVQKTPLQMNGIEEDSDEPLERRLSLVPDSEQGEAILPRISVISTGPTLQA RRRQSVLNLMTHSVNQGQNIHRKTTASTRKVSLAPQANLTELDIYSRRLSQETGLEI SEEINEEDLKECLFDDME

MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVA TVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAP QEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA

HHHHHHMKAKRSHQAIIMSTSLRVSPSIHGYHFDTASRKKAVGNIFENTDQESLERLF [12]∗ RNSGDKKAEERAKIIFAIDQDVEEKTRALMALKKRTKDKLFQFLKLRKYSIKVH

MSDAAVDTSSEITTKDLKEKKEVVEEAENGRDAPANGNANEENGEQEADNE VDEEEEEGGEEEEEEEEGDGEEEDGDEDEEAESATGKRAAEDDEDDDVDTKKQK TDEDD

MQNSQDYFYAQNRCQQQQAPSTLRTVTMAEFRRVPLPPMAEVPMLSTQNSMGSSA SASASSLEMWEKDLEERLNSIDHDMNNNKFGSGELKSMFNQGKVEEMDF

Predicting Conformational Properties from Sequence 369

115

126

136

140

168

Nup116

Cad136

NL3-cyt

Fos-AD

94

LJIDP1

ARS1

89

Vmw65

108

87

PDE-gamma

TyrRS(D1)

73

35

28.3

28.1

25.2

27.4

21

[12]∗

[12]∗

References

GSHMSVASLDLTGGLPEVATPESEEAFTLPLLNDPEPKPSVEPVKSISSMELK TEPFDDFLFPASSRPSGSETARSVPDMDLSGSFYAADWEPLHSGSLGMGPMA TELEPLCTPVVTCTPSCTAYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSP TLLAL

MGSSHHHHHHSSGLVPRGSHMAYRKDKRRQEPLRQPSP QRGAGAPELGAAPEEELAALQLGPTHHECEAGPPHDTLRLTALPDYTLTL RRSPDDIPLMTPNTITMIPNSLVGLQTLHPYNTFAAGFNSTGLPHSHSTTRV

RLEQYTSAVVGNKAAKPAKPAASDLPVPAEGVRNIKSMWEKGNVFSSPGGTGTPNKE TAGLKVGVSSRINEWLTKTPEGNKSPAPKPSDLRPGDVSGKRNLWEKQSVEKPAA SSSKVTATGKKSETNGLRQFEKEP

GSRRASVGSGALFGAKPASGGLFGQSAGSKAFGMNTNPTGTTGGLFGQTN QQQSGGGLFGQQQNSNAGGLFGQNNQSQNQSGLFGQQNSSNAFGQP QQQGGLFGSKPAGGLFGQQQGASTHHHHHH

MEEEKHHHHHLFHHKDKAEEGPVDYEKEIKHHKHLEQIGKLGTVAAGA YALHEKHEAKKDPEHAHKHKIEEEIAAAAAVGAGGFAFHEHHEKKDAKKEEKKKL RGDTTISSKLLF

MALFSGDIANLTAAEIEQGFKDVPSFVHEGGDVPLVELLVSAGISPSKRQAREDI QNGAIYVNGERLQDVGAILTAEHRLEGRFTVIRRGKKKYYLIRYALEHHHHHH

[12]∗

[12]∗

[12]∗

[12]∗

[12]∗

[12]∗

[12]∗

GSAGHTRRLSTAPPTDVSLGDELHLDGEDVAMAHADALDDFDLDMLGDGDSPGPGF [12]∗ TPHDSAPYGALDMADFEFEQMFTDALGIDEYGG

MNLEPPKAEIRSATRVMGGPVTPRKGPPKFKQRQTRQFKSKPPKKG VQGFGDDIPGMEGLGTDITVICPWEAFNHLELHELAQYGII

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWF TEDPGPDEAPRMPEAAPRV

Sequence

24.52 MARSFTNIKAISALVAEEFSNSLARRGYAATAQSAGRVGASMSGKMGSTKSGEEKAAA REKVSWVPDPVTGYYKPENIKEIDVAELRSAVLGKN

28

24.8

23.8

Length Rh

p53-TAD

Name

Table 2 (continued)

370 Kiersten M. Ruff

134

140

174

β-Synuclein

α-Synuclein

Stathmin

136

144

Caldesmon 636–771 fragment

pf1 gene 5 protein

75

127

γ-Synuclein

Heat stable protein kinase inhibitor

150

Fibronectin binding domain B

49

93

Wheat EM protein

Osteocalcin

34

Alpha-fetoprotein

29.5

28.1

22.3

18.4

33

32.3

32

30.4

30.7

28.2

15.5

[71]∗

MNMFATQGGVVELWVTKTDTYTSTKTGEIYASVQSIAPIPEGARGNAKGFEISEYNIEP TLLDAIVFEGQPVLCKFASVVRPTQDRFGRITNTQVLVDLLAVGGKPMAPTAQAPA RPQAQAQAPRPAQQPQGQDKQDKSPDAKA

RLEQYTSAVVGNKAAKPAKPAASDLPVPAEGVRNIKSMWEKGNVFSSPGGTGTPNKE TAGLKVGVSSRINEWLTKTPEGNKSPAPKPSDLRPGDVSGKRNLWEKQSVEKPAA SSSKVTATGKKSETNGLRQFEKEP

MTDVETTYADFIASGRTGRRNAIHDILVSSASGNSNELALKLAGLDINKTEGEEDA QRSSTEQSGEAQGEAAKSE

YLDSGLGAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV

MASSDIQVKELEKRASGQAFELILSPRSKESVPEFPLSPPKKKDLSLEEIQKKLEAAEE RRKSHEAEVLKQLAEKREHEKEVLQKAIEENNNFSKMAEEKLTHKMEANKENREA QMAAKLERLREKMYFWTHGPGAHPAQISAEQSCLHSVPALCPALGLQSALITWSDL SHHH

MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVA TVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAP QEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA

MDVFMKGLSMAKEGVVAAAEKTKQGVTEAAEKTKEGVLYVGSKTREGVVQGVA SVAEKTKEQASHLGGAVFSGAGNIAAATGLVKREEFPTDLKPEEVA QEAAEEPLIEPLMEPEGESYEDPPQEEYQEYEPEA

MDVFKKGFSIAKEGVVGAVEKTKQGVTEAAEKTKEGVMYVGAKTKEN VVQSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIAVTSGVVRKEDLRPSAP QQEGEASKEKEEVAEEAQSGGD

KKGKGKIARKKGKSKVSRKEPYIHSLKRDSANKSNFLQKNVILEEESLKTELLKEQSE TRKEKIQKQQDEYKGMTQGSLNSLSGESGELEEPIESNEIDLTIDSDLRPKSSL QGIAGSNSISYTDEIEEEDYDQYYLDEYDEEDEEEIRL

(continued)

[71]∗

[71]∗

[71]∗

[71]∗

[71]∗

[71]∗

[71]∗

[71]∗

[71]∗

MASGQQERSQLDRKAREGETVVPGGTGGKSLEAQENLAEGRSRGGQTRREQMGEEG [71]∗ YSQMGRKGGLSTNDESGGDRAAREGIDIDESKFKTKS

LMAITRKMAATAATCCQLSEDKLLACGEGAADII

Predicting Conformational Properties from Sequence 371

202

248

110

395

555

DARRP-32

Manganese-stabilizing protein

Calreticulin, human C fragment

Calsequestrin, rabbit

SdrD protein, B1–B5 fragment

54.7

45

46.2

32.7

34

Length Rh

Name

Table 2 (continued) References

VYKIGNYVWEDTNKNGVQELGEKGVGNVTVTVFDNNTNTKVGEAVTKEDG SYLIPNLPNGDYRVEFSNLPKGYEVTPSKQGNNEELDSNGLSSVITVNGKDNL SADLGIYKPKYNLGDYVWEDTNKNGIQDQDEKGISGVTVTLKDENGNVLK TVTTDADGKYKFTDLDNGNYKVEFTTPEGYTPTTVTSGSDIEKDSNGLTTTG VINGADNMTLDSGFYKTPKYNLGNYVWEDTNKDGKQDSTEKGISG VTVTLKNENGEVLQTTKTDKDGKYQFTGLENGTYKVEFETPSGYTPTQVGSG TDEGIDSNGTSTTGVIKDKDNDTIDSGFYKPTYNLGDYVWEDTNKNG VQDKDEKGISGVTVTLKDENDKVLKTVTTDENGKYQFTDLNNGTYKVEFETPSG YTPTSVTSGNDTEKDSNGLTTTGVIKDADNMTLDSGFYKTPKYSLGDYVWYD SNKDGKQDSTEKGIKDVKVTLLNEKGEVIGTTKTDENGKYCFDNLDSGKYK VIFEKPAGLTQTGTNTTEDDKDADGGEVDVTITDHDDFTLDNGYYEEET

MNAADRMGARVALLLLLVLGSPQSGVHGEEGLDFPEYDGVDRVINVNAKNYKN VFKKYEVLALLYHEPPEDDKASQRQFEMEELILELAAQVLEDKGVGFGLVD SEKDAAVAKKLGLTEEDSIYVFKEDEVIEYDGEFSADTLVEFLLDVLEDPVELIEGE RELQAFENIEDEIKLIGYFKNKDSEHYKAFKEAAEEFHPYIPFFATFDSKVAKKL TLKLNEIDFYEAFMEEPVTIPDKPNSEEEIVNFVEEHRRSTLRKLKPESMYE TWEDDMDGIHIVAFAEEADPDGYEFLEILKSVAQDNTDNPDLSIIWIDPDDFPLLVP YWEKTFDIDLSAPQIGVVNVTDADSVWMEMDDEEDLPSAEELEDWLEDVLEGEIN TEDDDDEDDDDDDDD

YDNFGVLGLDLWQVKSGTIFDNFLITNDEAYAEEFGNETWGVTKAAEKQMKDK QDEEQRLKEEEEDKKRKEEEEAEDKEDDEDKDEDEEDEEDKEEDEEEDVPG QAKDEL

EGGKRLTYDEIQSKTYLEVKGTGTANQCPTVEGGVDSFAFKPGKYTAKKFCLEPTKFA VKAEGISKNSGPDFQNTKLMTRLTYTLDEIEGPFEVSSDGTVKFEEKDGIDYAA VTVQLPGGERVPFLFTIKQLVASGKPESFSGDFLVPSYRGSSFLDPKGRGGSTGYDNA VALPAGGRGDEEELQKENNKNVASSKGTITLSVTSSKPETGEVIGVFQSLQPSD TDLGAKVPKDVKIEGVWYAQLEQQ

[71]∗

[71]∗

[71]∗

[71]∗

[71]∗ MDPKDRKKIQFSVPAPPSQLDPRQVEMIRRRRPTPAMLFRLSEHSSPEEEASPHQRA SGEGHHLKSKRSNPCAYTPPSLKAVQRIAESHLQSISNLGENQASEEEDELGELRELG YPREEEEEEEEEDEEEEEDSQAEVLKGSRGSAGQKTTYGQGLEGPWERPPPLDGP QRDGSSEDQVEDPALNEPGEEPQRPAHPEPGT

Sequence

372 Kiersten M. Ruff

93

93

93

97

202

97

260

p53(1–93) ALA

p53(1–93) PRO

Hdm2-ABD

HIF1-α-403

Mlph (147–240)

Mlph (147–403)

417

Calreticulin bovine

p53(1–93)

174

Topoisomerase I

49

28

44.3

25.7

27.4

30.4

32.4

44.2

58.5

[71]∗

[55]∗

[55]∗

[55]∗

[55]∗

[55]∗

(continued)

RLQGGGGSEPSLEEGNGDSEQTDEDGDLDTEARDQPLNSKKKKRLLSFRDVDFEED [55]∗ SDHLVQPCSQTLGLSSVPESAHSLQSLSGEPYSEDTTSLEPEGLEETGARALGCRPSPE VQPCSPLPSGEDAHAELDSPAASCKSAFGTTAMPGTDDVRGKHLPSQYLADVD TSDEDSIQGPRAASQHSKRRARTVPETQILELNKRMSAVEHLLVHLENTVLPPSA QEPTVETHPSADTEEETLRRRLEELTSNISGSSTSSE

RLQGGGGSEPSLEEGNGDSEQTDEDGDLDTEARDQPLNSKKKKRLLSFRDVDFEED SDHLVQPCSQTLGLSSVPESAHSLQSLSGEPYSEDTTSLEP

[55]∗ PAAGDTIISLDFGSNDTETDDQQLEEVPLYNDVMLPSPNEKLQNINLAMSPLPTAE TPKPLRSSADPALNQEVALKLEPNPESLELSFTMPQIQDQTPSPSDGSTRQSSPEPNSP SEYCFYVDSDMVNEFKLELVEKLFAEDTEAKNPFSTQDTDLDLEMLAPYIPMDDDF QLRSFDQLSPLESSSASPESASPQSTVTVFQ

ERSSSSESTGTPSNPDLDAGVSEHSGDWLDQDSVSDQFSVEFEVESLDSEDYSLSEEG QELSDEDDEVYQVTVYQAGESDTDSFEEDPEISLADYWK

MEEGQSDGSVEGGLSQETFSDLWKLLGENNVLSGLGSQAMDDLMLSGDDIEQWF TEDGGGDEAGRMGEAAGGVAGAGAAGTGAAGAGAGSWGL

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQGMDDLMLSPDDIEQWF TEDPGPDEGPRMPEGGPPVGPGPGGPTPGGPGPGPSWPL

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWF TEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPL

[71]∗ MLLPVPLLLGLLGLAAADPTVYFKEQFLDGDGWTERWIESKHKPDFGKFVLSSGKF YGDQEKDKGLQTSQDARFYALSARFEPFSNKGQTLVVQFTVKHEQNIDCGGG YVKLFPAGLDQTDMHGDSEYNIMFGPDICGPGTKKVHVIFNYKGKNVLINKDI RCKDDEFTHLYTLIVRPNNTYEVKIDNSQVESGSLEDDWDFLPPKKIKDPDAAKPED WDDRAKIDDPTDSKPEDWDKPEHIPDPDAKKPEDWDEEMDGEWEPPVIQNPE YKGEWKPRQIDNPEYKGIWIHPEIDNPEYSPDSNIYAYENFAVLGLDLWQVKSG TIFDNFLITNDEAYAEEFGNETWGVTKAAEKQMKDKQDEE QRLHEEEEEKKGKEEEEADKDDDEDKDEDEEDEDEKEEEEEEDAAAGQAKDEL

MSGDHLHNDSQIEADFRLNDSHKHKDKHKDREHRHKEHKKEKDREKSKHSN SEHKDSEKKHKEKEKTKHKDGSSEKHKDKHKDRDKEKRKEEKVRA SGDAKIKKEKENGFSSPPQIKDEPEDDGYFVPPKEDIKPLKRPRDEDDADYKPKKIK TEDTKKEKKRKLEEEEDGKLK

Predicting Conformational Properties from Sequence 373

73

206

146

170

202

132

132

132

SNAP25

ShB-C

HIF1-α-530

Securin

RAM PT1

RAM PT3

RAM PT4

[55]∗

[55]∗

[55]∗

References

27.52 MERKRRRQHGQLWFPEGFKVSEASKKKRRLFDMQDVVDRWQELEMDTL SENHAPDNASRQDWNRVEDLQLLTGLEPTGLDHQDKKDDLKFDAPGGAPKAE SAMAPTPPQGEVDADCMDVNVRGPDGFTPLLE

27.87 EERKRRRQHGQLWFPEGFKVSEASKKKRRWEDVKDATQVWDTKLGELK SHLGMMNNRLGDRRQDLPDPENDQADLSEAHQQTALDPAMLDPFDLKFEVGD SAMAPTPPQGEVDADCMDVNVRGPDGFTPLLE

[49]

[49]

[49]

MATLIYVDKENGEPGTRVVAKDGLKLGSGPSIKALDGRSQVSTPRFGKTFDAPPALPKA [55]∗ TRKALGTVNRATEKSVKTKGPLKQKQPSFSAKKMTEKTVKAKSSVPASDDA YPEIEKFFPFNPLDFESFDLPEEHQIAHLPLSGVPLMILDEERELEKLFQLGPPSP VKMPSPPWESNLLQSPSSILSTLDVELPPVCCDIDI

NEFKLELVEKLFAEDTEAKNPFSTQDTDLDLEMLAPYIPMDDDFQLRSFDQLSPLE [55]∗ SSSASPESASPQSTVTVFQQTQIQEPTANATTTTATTDELKTVTKDRMEDIKILIASPSP THIHKETTSATSSPYRDTQSRTASPNRAGKGVIEQTEKSHPRSPNVLSVALSQR

MTLGQHMKKSSLSESSSDMMDLDDGVESTPGLTETHPGRSAVAPFLGAQQQQQQP VASSLSMSIDKQLQHPLQQLTQTQLYQQQQQQQQQQQNGFKQQQQQTQQQL QQQQSHTINASAAAATSGSGSSGLTMRHNNALAVSIETDV

MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQGE QLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNL RHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG

VRTSACRSLFGPVDHEELSRELQARLAELNAEDQNRWDYDFQQDMPLRGPGRL QWTEVDSDSVPAFYRETVQV

Sequence

29.54 DDRKRRRQHGQLWFPEGFKVSEASKKKRREDLEKTVVQELTWPALLANKESQTE RNDLLLLGDFKDGEPNGMALDSMHVPAGPMFRDEQDARWDQHKDQDSAMAP TPPQGEVDADCMDVNVRGPDGFTPLLE

39.7

38.3

32.9

39.7

24

Length Rh

p57-ID

Name

Table 2 (continued)

374 Kiersten M. Ruff

132

132

132

132

RAM WT

RAM PT7∗

RAM PT8∗

RAM PT9

MARKRRRQHGQLWFPEGFKVSEASKKKRREPLGEDSVGLKPLKNASDGALMDDN QNEWGDEDLETKKFRFEEPVVLPDLDDQTDHRQWTQQHLDAADLRMSAMAP TPPQGEVDADCMDVNVRGPDGFTPLLE

22.12 RKRKRRRQHGQLWFPEGFKVSEASKKKRRAAQAQNEEHEDDLEQVAVNMGKFD VLDSLPDDLGLEDEETLDDDMPHQDAPLFGLDGLNWWRRQTPKMSKTSAMAP TPPQGEVDADCMDVNVRGPDGFTPLLE

24.51 MARKRRRQHGQLWFPEGFKVSEASKKKRRKPLGDDSVGLKPLDNASEGALMEDN QNEWGDDDLETEEFDFEDPVVLPRLRKQTKHRQWTQQHLDAADLDMSAMAP TPPQGEVDADCMDVNVRGPDGFTPLLE

26.51 MARKRRRQHGQLWFPEGFKVSEASKKKRRKPLGRKSVGLDPLENASDGALMEDN QNEWGEDDLDTEDFRFKKPVVLPDLEDQTEHDQWTQQHLDAARLDMSAMAP TPPQGEVDADCMDVNVRGPDGFTPLLE

26.6

27.91 MARKRRRQHGQLWFPEGFKVSEASKKKRRRPLGEDSVGLEPLDNASDGALMEEN QNDWGDDKLDTERFRFDDPVVLPDLDEQTDHKQWTQQHLKAAKLEMSAMAP TPPQGEVDADCMDVNVRGPDGFTPLLE

∗ denotes proteins were taken from a collated inventory from that reference

132

RAM PT5∗

[49]

[49]

[49]

[49]

[49]

Predicting Conformational Properties from Sequence 375

Kiersten M. Ruff 100

80

h

R Predicted

376

60

40

20 Marsh et al. Tomasso et al. 0 0

20

40

60

80

100

Rh Experiment

Fig. 6 Comparison of Rh values determined experimentally (Table 2) with predicted Rh values calculated using two different methods

x¼

R2E R2E,FRC

6R2g Nbl

here, N is the sequence length, b is the bond length, and l is the Kuhn length [7, 17, 18, 50]. The bond length, or average distance ˚ [11]. Fitting between two Cα atoms, is generally taken to be 3.8 A R2E,FRC from FRC simulations for the 409 disordered linkers from Harmon et al., we find l ¼ 10.3 A˚ [56]. With these parameters in hand, we can calculate x for known IDRs using the Rg values determined from small-angle X-ray scattering experiments (Table 3). Figure 7 compares these x values to the predicted x values using the theory of Ghosh and colleagues. Here, each marker corresponds to a single sequence, colored by the |NCPR| for that sequence. Figure 7 shows that |NCPR| is not generally predictive of whether a sequence will be collapsed or expanded experimentally, whereas the prediction of x by theory is highly biased by | NCPR|. Specifically, theory suggests that any sequence with a low | NCPR| will be predicted to be collapsed, any sequence with a high | NCPR| will be predicted to be expanded, and there exists some variability in conformation in the intermediate |NCPR| range. Given that |NCPR| is not highly correlated with conformational preferences determined experimentally, these results suggest that this theory is still missing important sequence features that dictate the conformational preferences of IDRs.

Predicting Conformational Properties from Sequence

377

Table 3 List of IDRs with Rg values measured from small-angle X-ray scattering experiments Name

Length Rg

Error

References

Nucleoporin Nup49 (N49)

38

15.9

1.3

[7]

Heh2 (NLS)

46

24

3

[7]

VSV protein Phosphoprotein P

67

24

1

[72]

LS

70

27.9

1

[77]

Spinophilin

78

16.39

Nup153_NUS

82

24.9

Sic1

92

32.1

[74] [75]

[73] 1.3

[7]

Chloroplastic Calvin cycle protein

100

23

Antitermination protein N (from lambda phage)

107

38

3.5

[76]

Nup153_NUL

112

30

3

[7]

DARPP-32 (aka protein phosphatase 1 regulatory subunit 1B)

118

28.28

[73]

II-1

141

41

[77]

FhuA

142

33.4

[78]

N98

153

28.6

Protein phosphatase inhibitor 2

156

34.6

Nsp1

178

41

3

[7]

IBB

99

32

2

[7]

Ash1

83

28.5

3.4

[53]

pAsh1 Phosphosites assumed

83

27.5

1.2

[53]

PIR domain (GRB14)

98

27

RpII215

75

28

0.7

[80]

RpII215 hyper pSer5

75

28.3

0.3

[80]

312

51.8

[81] [82]

RpII215

1.3

[7] [73]

[79]

ACTR

80

25

Msh6

304

56

2

[83]

AN16

176

50

2

[84]

HrpO

148

35

Gamma-syn

127

61

1

[86]

Beta-syn

134

49

1

[86]

Alpha-syn

140

41

1

[86]

[85]

(continued)

378

Kiersten M. Ruff

Table 3 (continued) Name

Length Rg

NTail

125

27.2

0.5

[87]

ERM

122

39.6

0.7

[88]

Neuroligin-3

118

33

3

[89]

elF4E binding protein (4E-BP)

117

48.6

0.2

[90]

Prothymosin alpha

111

37.8

0.9

[91]

Fez1

103

36

1

[92]

HIV-TAT

101

33

1.05

[93]

p531–91

91

28.7

0.3

[94]

Tau–ht40

441

65

3

[95]

Tau–K32

197

42

3

[95]

Tau–K16

175

39

3

[95]

Tau–K18

129

38

3

[95]

Tau–ht23

352

53

3

[95]

Tau–K27

166

37

2

[95]

Tau–K17

144

36

2

[95]

Tau–K19

98

35

1

[95]

Tau–K44

283

52

2

[95]

Tau–K10

167

40

1

[95]

Tau–K25

185

41

2

[95]

Tau–K23

258

49

2

[95]

Tau–K32 AT8 AT100

197

41

3

[95]

Tau–ht23 S214E

352

54

3

[95]

Tau–ht23 AT8 AT100

352

52

3

[95]

Tau–K18 P301L

129

35

2

[95]

Tau–K18 ΔK280

128

79

10

[95]

Tau–K18 ΔK280 I277P I308P

128

35

2

[95]

24

13.2

0.01

[96]

CortactinCR

324

46.7

Pertactin-NTD

334

51.3

0.1

[78]

Reduced_RnaseH

124

33.6

0.1

[78]

Nup153 fragment

79

24

5

[98]

Histatin

Error

References

[97]

(continued)

Predicting Conformational Properties from Sequence

379

Table 3 (continued) Name

Length Rg

LOX-PP

147

37

0.4

[99]

H1_CTD

98

25

0.2

[100]

Error

References

p27_WT (v31)

107

28.1

1.8

[22]

p27_v14

107

29.4

1.3

[22]

p27_v15

107

29.2

1

[22]

p27_v44

107

24.9

1.3

[22]

p27_v56

107

23.3

1

[22]

p27_v78

107

22.1

0.3

[22]

Ki-1/57

292

47

2

[101]

CTCF-R domain (WT)

185

32.5

1.8

[102]

CTCF-R domain phosphorylated–phosphosites assumed

185

29.2

0.4

[102]

hNHE1cdt

130

37.5

0

[103]

pMBP

171

54

0

[104]

HMPV

61

27.4

0.5

[105]

redAFP

81

22.2

0.1

[106]

CSD1 (with overhang)

150

35.4

0

[107]

PAGE4 WT

103

36.2

1.1

[108]

PAGE4 HIPK1 phosphorylated

103

34.7

1.2

[108]

PAGE4 CLK2 phosphorylated

103

49.8

1.9

[108]

ERalpha-NTD

187

31

0.2

[109]

PNT4 WT

107

28.4

0.2

[52]

PNT4 WT low kappa

107

32

0.7

[52]

PNT4 WT high kappa

107

24.6

0.1

[52]

NTail WT

127

30

0.1

[52]

NTail low kappa

127

31

0.6

[52]

NTail high kappa

127

26

0.1

[52]

BACH2

190

40.1

1.08

[110]

CaD136

136

40.8

0.8

[111]

Phosphoprotein

100

31

5

[112]

Sp1-QA

63

25.6

1.5

[113]

Sp1-QB

147

36.2

0.3

[113]

380

Kiersten M. Ruff 3 Incorrectly predicted to be expanded

Correctly predicted to be expanded

0.14

ln(Predicted x)

2

0.12 0.1

1 0.08 0

0.06 0.04

−1 Correctly predicted to be collapsed

−2 −2

−1

0.02

Incorrectly predicted to be collapsed

0

1

2

3

ln(Experimental x)

Fig. 7 Comparison of experimental and predicted x values for IDRs whose Rg values have been determined by small-angle X-ray scattering experiments (Table 3). Each IDR is colored by its |NCPR|. Squares represent sequences with a proline fraction greater than 15%. Black lines correspond to the cutoff between x values corresponding to more compact or expanded conformations compared to an FRC. The top right and bottom left quadrants correspond to IDRs that have been correctly predicted to be expanded and collapsed, respectively. In contrast, the top left and bottom right quadrants correspond to IDRs that have been incorrectly classified as expanded and collapsed, respectively 3.7 The Future of Predicting Conformation from Sequence

At this juncture, IDR conformations cannot be quantitatively predicted from sequence [12, 54, 55, 57]. Instead, simulations and/or experiments are necessary to extract conformational properties of a given IDR. Thus, the question becomes: How can we improve upon the current prediction methods to quantitatively predict IDR conformational properties? A one size fits all theory/ model is unlikely to capture the wide diversity of IDR sequences. As computers and algorithms get faster, one potential route is to simulate many sequences, on the order of 100s–1000s, in a particular composition/sequence region and use machine learning techniques to uncover the relationship between sequence features and conformation. These regions will likely have to evolve past the current diagram of states to include additional considerations of patterning, length, and proline content. Until such a method can be implemented, the best way to determine conformational properties of a given IDR without doing experiments is to run simulations. Over the recent years, atomistic simulation methods to study IDRs have improved in both efficiency and accuracy, making medium throughput studies of IDRs accessible [58–61]. Additionally, IDR-specific coarse-

Predicting Conformational Properties from Sequence

381

grained methods have been developed which further enhance the throughput of determining conformational properties for many IDRs [57, 62, 63]. However, one still needs to be aware of force field-specific biases, including over-compaction of conformational preferences observed for many explicit solvent force fields and the assumptions utilized by coarse-grained models when using simulations to predict conformation (see Note 5) [64]. A recent study performed coarse-grained simulations on IDRs that were previously studied using small-angle X-ray scattering [62]. It was observed that even when compositionally similar IDRs have comparable ensemble-averaged conformational properties, such as length-normalized Rg, the underlying ensembles can be drastically different. These results suggest that even if a method to quantitatively predict ensemble-averaged conformational properties, such as Rg, Rh, or x, from sequence is developed, important conformational information may still be missing. For instance, two sequences may have similar ensemble-averaged conformational properties but have different underlying conformational features that modulate the function of the sequence. These conformational features, such as conformational heterogeneity and sequencespecific contacts, will be more difficult to predict directly from sequence. Thus, simulations and/or experiments may still need to be conducted in order to connect conformational properties to function.

4

Notes 1. A high fraction of proline residues (≳15%) can promote more expanded conformations than may be expected by just considering FCR and |NCPR|. For one, proline is highly soluble and thus prefers to be solvated [28]. Additionally, proline can promote stiffness in conformations due to its own structural restrictions [65]. Thus, the boundaries of the diagrams of states may be different for IDRs with high-proline content. 2. Length can also influence conformational properties of IDRs. To show this, was calculated for the 409 IDR linkers simulated by Harmon et al. as a function of sequence length and diagram of states region [56]. Figure 8 shows that the average length-normalized Rg generally increases in each region as the sequence becomes longer. These results imply that longer sequences may tend to be more expanded even if they have the same compositional properties as shorter sequences. 3. δmax can only truly be identified by enumerating all sequence combinations; however, this is often infeasible. Thus, CIDER/ localCIDER use computational tricks to identify the δmax of an

Kiersten M. Ruff

R1

4

R2 R3 R4

3

382

2

1

0 0≤

s
30% residues) [2, 3]. These include the so-called intrinsically disordered regions (IDRs), which represent unstructured portions within well-folded proteins and between well-folded domains. Among transcription factors and signaling proteins, the percentage of long disordered regions increases up to 70%, underscoring the relevance of IDPs (or IDRs) in both physiological and pathological conditions [2, 3]. The intrinsic flexibility coupled with high conformational heterogeneity leads to multiple functional advantages for IDPs/IDRs that cannot be achieved by conventionally ordered

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_19, © Springer Science+Business Media, LLC, part of Springer Nature 2020

391

392

Matteo Masetti et al.

structures [4]. Indeed, while a dominant folded structure is necessary to accomplish catalysis, transport, and other functions strictly dependent on well-defined three-dimensional features [5, 6], intrinsic disorder can be exploited to bind specifically, but transiently, biomolecular counterparts [7]. Owing to their ability to easily interconvert among distinct conformations, IDPs also excel in the recognition of several structurally unrelated partners, often exploiting disorder-to-order transitions upon binding [7]. This adaptability is reflected by the fact that IDPs are often found at “hubs” (central nodes with many partner) of large protein–protein interaction networks, being involved in key processes of cell biology like signaling and regulation of transcription and translation [8]. In this context, IDPs complement the functional role of ordered proteins, and it has been suggested they provide an evolutionarily successful strategy to achieve complex functions while economizing on the size of the genome [3, 9]. Unfortunately, this complexity is mirrored by the seriousness of diseases associated with IDPs misregulation. Point mutations, exposure to toxins, and alterations in post-translational modification of these proteins can easily lead to misfolding and in turn to aggregation and/or fibril formation [10]. It is therefore not surprising that many complex pathologies like cancer and neurodegenerative diseases, just to name a few, are related to aberrant IDP functionality [2, 10]. From this standpoint, IDPs have recently emerged as potential drug targets, even though their intrinsic flexibility and the lack of well-defined pockets and/or protein surfaces pose serious challenges for conventional drug discovery endeavors [11, 12]. IDPs have long been neglected by the scientific community due to the bias toward folded structures affecting the available methods for structure determination. Indeed, well-established experimental techniques like X-ray crystallography are ill-suited to characterize the highly dynamic and heterogeneous ensemble of conformations covered by IDPs. Conversely, NMR spectroscopy, small-angle X-ray scattering (SAXS), and fluorescence Fo¨rster resonance energy transfer (FRET) are currently emerging as leading experimental techniques to tackle the problem of structural characterization of IDPs [3]. Describing the structure of proteins in terms of an ensemble is, however, not a trivial task, and all these methods rely on a pool of structures obtained by computational means [7, 13]. Two distinct approaches can be followed to this aim: selecting a limited number of structures from large conformational libraries and generating them through the physically motivated sampling provided by molecular dynamics (MD) simulations [7, 13]. Provided a reliable force field able to capture the relevant interactions at a molecular level, MD simulations are in principle able to generate the Boltzmann-weighted ensemble of structures by solving Newton’s equations of motion [14, 15]. Two main

Enhanced Sampling of IDPs

393

challenges exist for MD simulations in the field of IDPs. First and foremost, classical mechanics force fields have mostly been developed to treat structured proteins. When subsequently employed to model IDPs conformational ensembles, they revealed unable to stabilize more extended states, as typically observed in experiments on disordered proteins, while instead favoring the collapse into more compact structures [16–18]. In part, this is related to the limitation in computational power which has not allowed, until very recently, to observe folding and unfolding events through MD simulations and hence to spot inaccuracies of parameterization. This aspect will not be further covered in this chapter. Here, we only mention that the development of accurate force fields to treat IDPs is an active area of research, and several efforts have been undertaken to validate and correct them from known deficiencies [19–26]. The second issue arising when simulating IDPs is related to the abovementioned limitation of sampling. Indeed, huge computational power would be required to exhaustively explore all the relevant states visited by these proteins. A viable option is to employ so-called enhanced sampling methods. These are sampling schemes specifically designed to speed up the exploration of phase space and to statistically characterize main free energy minima and associated barriers in the regime of limited computational resources [27, 28]. Typically, two classes of enhanced sampling methods can be considered: those that rely on the notion of collective variables (CVs) and those that do not (often referred to as “tempering methods”) [27]. CVs represent (linear or nonlinear) combination of the atomic positions (q(x), where x is the system’s configuration) that can be used to describe the rare event one wishes to accelerate, for example, the sampling of the conformational space of an IDP, with the help of external biases. In particular, metadynamics is one such CV-based enhanced sampling methods where the exploration of the phase space is boosted by adding to the potential energy of the system (the force field) Gaussian-shaped potentials deposited at even time intervals in the CV space (VG(q(x),t)). An added value of metadynamics is that, when suitable convergence conditions are met, the history-dependent bias also provides an unbiased estimate of the underlying free energy surface [29, 30]. It is important to note that the choice of CVs is of critical importance for the efficiency and convergence not only of metadynamics but of all CV-based free energy calculation techniques. Several variants of metadynamics have been developed during the years. Here, we focus on the broadly employed well-tempered declination (WTMetaD), which often provides a better convergence behavior over the original method [30]. The theory of WTMetaD has been recently reviewed in excellent papers [31, 32], including an interesting application presented in a chapter of this book series [33], and will not be further covered here. Instead of relying on CVs,

394

Matteo Masetti et al.

tempering methods enhance the exploration of phase space taking advantage of more convenient statistical ensembles [27]. This is usually (but not necessarily) accomplished by running in parallel several replicas of the system at different temperatures (Parallel Tempering, PT, or Replica Exchange MD, REMD) [34] or using scaled potential energy functions (Hamiltonian REMD [35] and variants) [36, 37]. At even time intervals, exchanges between replicas are attempted according to a simple Metropolis Monte Carlo scheme: n o PT pði⇄j Þ ¼ min 1, eΔij with the acceptance probability: 1 1 PT V ðx i Þ V x j , Δij ¼ kB T i kB T j where V(x) is the potential energy associated with configurations xi and xj of the system at temperatures Ti and Tj, respectively. As a consequence, low-temperature replicas are enriched in configurations sampled at high temperature, ensuring an efficient crossing of energy barriers [27]. CV-based and tempering methods can be considered as conceptually distinct enhanced sampling methods that can be conveniently combined in order to take advantage of individual benefits and overcome limitations. For example, metadynamics can be very effective in reconstructing the free energy surface for rare events once proper CVs can be identified through chemical intuition or careful analysis of previous simulations. Unfortunately, this is not always possible, and in real case scenarios, choosing adequate CVs can be rather difficult. Indeed, during a typical setup procedure, the user first performs trial simulations using CVs that are expected to reasonably describe the relevant event under investigation. Subsequently, by monitoring the system behavior during such preliminary runs, the quality of the choice is assessed, as these initial CVs can result unsuitable or insufficient to guide the system toward a wide exploration of the configurational space. Thus, the analysis of such trial runs can suggest the use of more suitable CVs, if not even inspire the definition on new ones. When suboptimal CVs are employed, hysteresis effects and convergence problems in the reconstruction of the free energy are expected. Tempering methods do not suffer from the need to define CVs, even though boosting all degrees of freedoms at once (or a large part of them) can lead to less efficient exploration of the event one wishes to observe, and convergence of equilibrium populations can require very long simulation time. In the context of protein folding, metadynamics has long been successfully combined with PT/REMD (PTMetaD) [38], as the latter method provides an unspecific acceleration of all the orthogonal degrees of freedom not explicitly accounted for

Enhanced Sampling of IDPs

395

by the chosen CVs. As a consequence, the convergence on the reconstructed free energy is highly improved. The exchange probability now must take into account the bias potential released in each replica [38]: 1 1 V G,i ðq ðx i Þ, t Þ V G,i q x j , t þ kB T i kB T j V G,j q x j , t V G,j ðq ðx i Þ, t Þ :

ΔPTmetaD ¼ ΔPT ij ij þ

A caveat in this combined approach is related to the performance of PT/REMD. In fact, for the method to be efficient, neighboring replicas must be close enough in the temperature space to allow a high acceptance ratio in the Metropolis Monte Carlo moves. This requirement is met whenever a sufficient overlap between energy distributions is obtained. Unfortunately, potential energy fluctuations scale inversely with thepnumber of degrees of ﬃﬃﬃﬃﬃ freedom of the investigates system (ΔE / N ) [35]. That is, in order to cover a certain temperature range, a significant amount of replicas is typically required. This, in turn, translates to high computational costs to carry out efficient PT/REMD simulations, strongly limiting the use of this enhanced sampling technique to small systems. However, this practical issue can be attenuated by taking advance of the Well-Tempered Ensemble (WTE) [39]. Within the WTE formalism, the potential energy of the system is employed as a CV running a conventional WTMetaD simulation. It has been shown that this procedure has the effect of enhancing the fluctuations of the potential energy while preserving the average value of the unbiased ensemble. Moreover, when combined with a PT/REMD Scheme (PT-WTE), the larger energy fluctuations (to an extent determined by the applied bias factor) allow covering a given temperature range with a significantly smaller number of replicas [39]. This approach, complemented with WTMetaD, where the bias is also applied over conventional CVs, is usually referred to as Parallel Tempering Metadynamics in the WellTempered Ensemble, or PTMetaD-WTE [40]. In this case, the exchange probability of the underlying PT/REMD scheme is formally the same as for PTMetaD, with the only exception that the bias on the potential energy as a CV enters in the exchange probability too. In this chapter, following a recently published article [41], we will use PTMetaD-WTE to explore the relevant states of the wellknown IDP represented by the C-terminal domain, NTAIL (comprising amino acid residues 443–524) of the Sendai virus nucleoprotein [42]. Moreover, we will focus our attention to residues 472–497 since the central portion of the peptide (amino acid residues 476–492) has revealed a relatively high helical propensity, giving rise to differently folded states. In particular, H1 (residues 479–484), H2 (476–488), and H3 (478–492) are prevailing states

396

Matteo Masetti et al.

Fig. 1 Proposed conformational equilibrium in solution for the IDP NTAIL. The central region of the sequence, comprising residues 467–492, displays a high α-helical propensity, allowing this IDP to access differently folded states. Three arginines, through which NTAIL is able to make interactions with its biomolecular target, are highlighted

with an estimated helix population of 36%, 28%, and 11%, respectively, with the remaining 25% being unfolded (or random coil, RC; see Fig. 1) [43]. Notwithstanding the usefulness of PTMetaD-WTE in characterizing complex molecular systems, as witnessed by the rich literature exploiting this method [44–49], we stress that several other effective enhanced sampling approaches can be envisioned to study IDPs [50–52].

2 2.1

Materials Software

As already mentioned in the previous paragraph, several enhanced sampling methods can be exploited to explore the conformational ensemble of IDPs. For each technique, different setup choices can lead to different combinations of software and scripting tools. In particular, the protocol that we suggest can be reproduced with the programs reported below, but other valid choices can be made. Moreover, for each software, the specific version that we employed to carry out the calculations is reported. The reader must bear in mind that different program versions can lead to slightly different results and that the provided input files are not necessarily supported by newer software versions. We refer to the documentation of each program for installation instructions and general information. A schematic representation of the whole workflow followed to carry out the simulations is provided to the reader in Fig. 2.

Enhanced Sampling of IDPs

397

Fig. 2 Schematic workflow followed to perform the simulations described in this chapter l

Amber [53]. This is a suite of biomolecular simulation programs. It includes MD engines to carry out simulations with parallel CPU and GPU hardware, several useful auxiliary programs and scripting utilities to set up and analyze biomolecular systems, as well as the native Amber force field family. The Amber package can be obtained with a license fee of $500 for academic or nonprofit organizations, and it can be obtained from http:// ambermd.org/GetAmber.php (tested version: Amber14).

l

GROMACS [54]. This is a package for performing molecular dynamics simulations. It can be compiled to carry out calculations with parallel CPU hardware, and it also supports very efficient GPU-based acceleration. The package also includes several auxiliary programs to set up and analyze biomolecular systems and many of the most popular force fields for biomolecular simulations. It can be freely downloaded from http://www. gromacs.org/ (tested version: GROMACS 4.6.7. Note that, starting from version 5.0, all of the tools that came with the software have become modules of a single binary name “gmx.” Please refer to the official website for further details).

l

PLUMED [55]. This is an open-source plug-in for free energy calculations in molecular systems and works with some of the most popular MD engines, including GROMACS. The software can be freely downloaded from http://www.plumed.org/ (tested version: PLUMED 2.1. This version runs on CPU hardware).

l

VMD [56]. This is a popular molecular visualization program that can also be used to set up molecular systems and analyze

398

Matteo Masetti et al.

MD simulations. The software can be freely downloaded from https://www.ks.uiuc.edu/Research/vmd/ (tested version: VMD-1.9.2). l

3

Gnuplot [57]. This is a cross-platform command line plotting program. It can be freely downloaded from http://www. gnuplot.info/ (tested version: Gnuplot-4.6).

Methods

3.1 Preparation of the Molecular System

In the original article, the PTMetaD-WTE simulation was performed starting from a representative structure of the H1 state of the NTAIL peptide (see Fig. 1) [41]. Since it is not publicly available yet, we will start building the peptide from its primary structure. 1. The Amber suite of programs provides an excellent starting point for any biomolecular simulation. Here, we will exploit the functionalities of the tleap module of Amber to build the molecular system from scratch (see Note 1). The peptide, with the acetyl (ACE) and N-methylamide (NME) capping groups to ensure the neutralized N- and C-terminus, respectively, can be easily obtained as follows (see Note 2): (a)

foo ¼ sequence {ACE ASN ASP GLU . . . SER ALA THR NME}

(b)

writepdb foo pept.pdb

(c)

quit

The so obtained file pept.pdb will be the input for the subsequent setup of the molecular system actually employed for simulations (see Note 3). 2. Having obtained a starting structure for NTAIL, the following step is to generate the initial configuration of the whole molecular system including explicit water molecules and ions and the corresponding topology required by GROMACS. First of all, we need to choose an appropriate force field for our simulations. We choose the ff99SB∗ [58] modification of the Amber ff99SB [59] force field combined with the ff99SB-ILDN corrections [60]. This force field is not present in the GROMACS distribution and must be downloaded from the “User Contribution” section of the website (http://www.gromacs.org/ Downloads/User_contributions/Force_fields). From the very practical perspective, to allow the software reading such external force field, the newly downloaded directory containing all the parameter files should be placed either in the same directory where constructing the initial configuration of the system or, more generally and when possible due to potential restrictions in administration privileges, in the share/gromacs/ top path (i.e., where all the other native force fields that come

Enhanced Sampling of IDPs

399

with the software are located) inside the GROMACS installation directory. We can now build the topology for the peptide with the pdb2gmx tool of GROMACS: (a)

pdb2gmx -ignh -f pept.pdb -o ntail.gro -p ntail. top

Right after having interactively specified the force field, the water model to employ will also be asked for by pdb2gmx. For our system, as a reasonable compromise between accuracy of the simulations and computational cost, we chose the TIP3P water model, which was already employed in previous works in the context of IDPs producing satisfactory results (see Note 4). 3. The next step is to specify the shape and size of the simulation box containing the solvated peptide. We will use a cubic box with a side length determined by the diameter of the system plus a buffer of 18 A˚ in each direction (see Note 5). Then, we will solvate the empty volume of the simulation box with water molecules. To do so, we will use sequentially the editconf and genbox tools: (a)

editconf -f ntail.gro -o ntail-box.gro -bt cubic -d 1.8 --c

(b)

genbox -cp ntail-box.gro -cs spc216.gro -p ntail. top -o ntail-wat_only.gro

4. We then use the tool genion to add the counterions to the system (see Note 6). In particular, genion replaces water molecules with the type of ions specified by the user (see Note 7). It requires a run input file (ions.tpr file) obtained through the GROMACS preprocessor grompp (see the GROMACS manual for further information): (a)

grompp

-f

ions.mdp

-c

ntail-wat_only.gro

-p

ntail.top -o ions.tpr

(b)

genion -s ions.tpr -o ntail-wat_ions.gro -neutral -p ntail.top -pname NA -nname CL

The whole system can now be visualized with VMD (see Fig. 3). 3.2

Simulations

Unless otherwise stated, all the simulations are performed using the following general setup parameters: l

l

Constraints for bonds involving hydrogen atoms: LINCS algorithm [61] with default parameters. Cutoff for nonbonded interactions: 12 A˚.

l

Long-range electrostatics: Particle Mesh Ewald method [62], with grid 1.6 A˚ grid spacing and cubic interpolation.

l

Leap frog integrator [63], with a time-step of 2 fs.

400

Matteo Masetti et al.

Fig. 3 Starting from a pdb structure (folded region highlighted in blue, unfolded ones in white), the system is prepared for the MD simulations by specifying a simulation box of appropriate size (blue edges), filling it with solvent molecules, and adding a proper number of ions (yellow spheres) l

Thermostat: Velocity rescaling [64], with a coupling time of 0.1 ps.

l

Barostat: Parrinello-Rahman [65], with a coupling time of 2.0 ps. 1. In the following steps, we will report all the relevant details to equilibrate the replicas, to perform the preliminary PT-WTE simulation, and, finally, to carry out the actual production run of PTMetaD-WTE. Since there might be clashes between the atoms of the newly prepared systems, it is advisable to perform an energy minimization before starting the MD phases. A run consisting of 5000 steps of steepest descent is in general enough to remove all the possible bad contacts. 2. We now need to thermalize the system at different temperature values for the subsequent PT-WTE run. In particular, we will use eight replicas (replica index, 0–7) with temperatures T0 ¼ 298.00, T1 ¼ 310.80, T2 ¼ 324.15, T3 ¼ 338.07, T4 ¼ 352.60, T5 ¼ 367.73, T6 ¼ 383.53, and T7 ¼ 400.00 K (see Note 8). All replicas must be independently equilibrated at the target temperature (T0–7) in the canonical ensemble (also known as NVT ensemble, i.e., the statistical ensemble characterized by constant number of particles, N, constant volume, V, and constant temperature, T) (see Note 9).

Enhanced Sampling of IDPs

401

3. If we were to run a PT/REMD simulation with the eight replicas as prepared in the previous step, we would then obtain a very low acceptance ratio due to the poor overlap between energy distributions. Therefore, we will now enhance the energy fluctuations of each replica performing a WTMetaD simulation using the potential energy as a CV (WTE). Moreover, we combine WTE with the PT/REMD formalism to allow configurational exchanges (PT-WTE). To do so, we need to prepare, for each replica: (a) a GROMACS input file: prod_nvt_${i}.mdp (b) a PLUMED input file: plumed.dat.${i} (see Note 10) where the ${i} flag stands for the index of the ith replica (0, 1, . . ., 7). These files will be identical except for the nominal temperature specified. 4. Within the PLUMED input file, all the relevant instructions for the program to carry out the PT-WTE simulation must be defined. To this end, the METAD action tells PLUMED to activate a metadynamics simulation using “ene” as CV. Actually, “ene” is a label that we use to specify the ENERGY collective variable. We use a frequency for Gaussian deposition of 250 steps, a Gaussian height of 2.5 kJ/mol, and a width of 500 kJ/mol units (values specified for the PACE, HEIGHT, and SIGMA keywords, respectively). The Gaussian bias, for each replica, will be saved in a file named “HILLS_PTWTE.${i}.” In WTE, the energy fluctuations will be enhanced by a factor equal to the square root of the bias factor. Here, we set the BIASFACTOR to 50, which corresponds approximately to seven times larger fluctuations compared to those experienced in the canonical ensemble. The final PLUMED input file will look like: ene: ENERGY wte: METAD ARG=ene PACE=250 HEIGHT=2.5 SIGMA=500.0 FILE=HILLS_PTWTE BIASFACTOR=50.0 TEMP=${i} PRINT ARG=ene,wte.bias STRIDE=50 FILE=COLVAR ENDPLUMED

As it might be noticed, the METAD action is also associated with a label (“wte”), which is used as an argument (ARG keyword) by the PRINT directive. In particular, here we ask PLUMED to print in a file named “COLVAR.${i}” (FILE keyword) the values of the “ene” CV and the total bias experienced by the system (“wte.bias”) with a frequency of 50 MD steps. 5. The PT-WTE simulation can be finally started with the usual mdrun tool of GROMACS-4.6.7 like in conventional

402

Matteo Masetti et al.

PT/REMD runs. The only caveat, here, is to activate the PLUMED plug-in with the dedicated flag –plumed. The number of replicas and the frequency of exchange attempts (each 100 steps) are controlled by the GROMACS native flags –multi and –replex, respectively: mdrun –s ntail.tpr –plumed plumed.dat –multi 8 –replex 100

6. The enhanced fluctuations of the PT-WTE run can be monitored by taking advantage of the Gnuplot plotting tool, with the following command line: pl “COLVAR.${i}” u 1:2 w l

Repeating the same command for all the COLVAR files, one should get a plot similar to Fig. 4. The aim of this procedure is the achievement of an improved overlap between the potential energies of neighboring replicas, so that the exchanges can occur. As such, it is sufficient to extend the PT-WTE run until such condition is fulfilled. For instance, as shown in Fig. 4, for our system, a simulation time of 5 ns was adequate to observe a substantial increase in the overlap and thus record exchanges between the neighboring replicas. 7. The bias deposited in the previous steps will be instrumental to ensure a good acceptance ratio in the subsequent PTMetaDWTE. Similarly to step 4, we now need to prepare a new set of GROMACS and PLUMED input files for each replica differing

Fig. 4 Time series of the energy fluctuations at increasing (from purple to black) reference temperatures. Taking advantage of the WTE, the overlap between adjacent replicas becomes substantially improved by increasing the temperature

Enhanced Sampling of IDPs

403

only in the value of the nominal temperature. In this case, however, we need to instruct PLUMED to read the previously released bias (saved in the “HILLS_PTWTE.${i}”) as a constant bias. To this aim, we will use the RESTART keyword and a fake deposition frequency of 999,999,999 in the “wte”-labeled metadynamics. This will trick PLUMED to read the bias, but due to the unreasonably low frequency of deposition, it will actually not have the chance to modify it. Moreover, in the same input file, we also need to specify that we will perform WTMetaD (label: “metad”) as a function of the CV that we decide to explicitly bias (this second bias will be saved in a file called “HILLS_PTMetaDWTE.${i}”). 8. Since the folded states of NTAIL involve the formation of α-helix structures, we use the ALPHARMSD collective variable to describe the helicity content of the peptide [66]. As a second degree of freedom, we use the radius of gyration (GYRATION), which is a general purpose CV suited to describe the degree of compactness of a polypeptide sequence [67]. These CVs will be labeled as “a” and “g,” respectively, within the PLUMED input file, and to make them functional, we need to specify several other parameters controlled by specific keywords as reported in the Web manual. Here, we will only focus on the most important ones. Concerning the ALPHARMSD variable, we need to tell PLUMED which residues of the protein we wish to check for the helicity content. This feature is controlled by the RESIDUES keyword. In our case, the content of formed α-helix must be evaluated on the whole sequence, so we will use “all” as the value of the keyword, meaning that all residues will be considered for this action. This requires instructing PLUMED with the peptide topology, and this is in turn controlled by the MOLINFO action, which is fed with a reference PDB that we supply as an input (specified via the STRUCTURE keyword). Conversely, the atoms employed to evaluate the GYRATION CV are reported in a commaseparated list as an argument of the ATOMS keyword (in this case, the PDB index of the 26 Cα atoms of the peptide). Finally, the WHOLEMOLECULES action is mandatory if we want PLUMED to reconstruct the chains involved in the specific CV (with the keyword ENTITY0 ¼ 1–427) when using MD engines like GROMACS that break up the molecules when crossing the sides of the simulation box. 9. WTMetaD is then performed using a Gaussian deposition frequency of 500 steps, a Gaussian height of 0.4184 kJ/mol, a bias factor of 15 units, and using a Gaussian width of 0.15 and 0.025 nm in the space of “a” and “g,” respectively. The whole PTMetaD-WTE input file should look as follows:

404

Matteo Masetti et al. RESTART MOLINFO STRUCTURE=reference.pdb WHOLEMOLECULES STRIDE=1 ENTITY0=1-427 a: ALPHARMSD RESIDUES=all TYPE=DRMSD R_0=0.08 NN=8 MM=12 g: GYRATION TYPE = RADIUS ATOMS = 9,23,35,50,62,78,89,101, 120,135,159,183,202,212,229,253,272,282,297,321,345,362,377, 389,400,410 ene: ENERGY wte: METAD ARG=ene PACE=999999999 HEIGHT=2.5 SIGMA=500.0 FILE=HILLS_PTWTE BIASFACTOR=50.0 TEMP=298.0 metad: METAD ARG=a,g PACE=500 HEIGHT=0.4184 SIGMA=0.15,0.025 FILE=HILLS_PTMetaDWTE BIASFACTOR=15.0 TEMP=${i} PRINT ARG=a,g,ene,wte.bias,metad.bias STRIDE=50 FILE=COLVAR ENDPLUMED

10. We can finally launch the PTMetaD-WTE simulation with a command similar to the one used in step 7. The effectiveness of the bias provided by the previous WTE run can be assessed checking the diffusion of the replicas across the temperature ladder (see Fig. 5) and eventually computing the roundtrip time. The journey of each replica in the temperature space can be followed by employing the demux.pl script included in GROMACS. Specifically, by providing the MD simulation log file, produced by the GROMACS tool mdrun, with the command: demux.pl md0.log

Fig. 5 Diffusion of replica 0 in the temperature space. As the plot shows, the initial configuration, from which the simulation was started at the lowest temperature, roamed freely across the different temperatures considered. The first 50 ns of the PTMetaD-WTE simulation are displayed

Enhanced Sampling of IDPs

405

The replica_temp.xvg file is generated. In replica_temp. xvg, each replica is represented by a column, where the different indices, in ascending order starting from zero, are employed to designate the different temperatures visited (see Note 11). In principle, the simulation should be extended until a converged free energy surface as a function of the two CVs is obtained (the utility sum_hills coming with the PLUMED package can be used to this aim) (see Note 12). Based on our experience, the previous setup requires at least 300 ns (per replica) of sampling to explore the whole CV space. Then, the simulation can be further extended to reduce the statistical error depending on the availability of computational resources.

4

Notes 1. In the vast majority of the non-IDP applications, the system is generated starting from available PDB structures. This is mostly the case when the simulations are devoted to disentangling the behavior of complex globular proteins found in their specific native states. As such, reliable structures from which to initiate the simulations become a strong requirement. Indeed, if not already determined by experiments, reconstructing such complex structural organization from scratch would not be a trivial task. Conversely, when the initial conformation is not relevant, it is reasonable to build a linear polymer directly from the sequence of amino acids. This is for instance true when, over fractions of the simulation timescales, the solute nature would allow for substantial and frequent conformational rearrangements or the employed sampling technique would promote the exploration of farther regions of the conformational space. In such cases, a convenient solution to construct the linear peptide chain is the usage of the aforementioned commands within the tleap module of Amber. 2. The tleap module of Amber requires a force field library to be loaded. Since the force field that we will use for simulations will be chosen in the following steps exploiting GROMACS tools, the force field loaded by tleap is in this case entirely irrelevant. Thus, prior to specifying the desired amino acid sequence, tleap needs to be first instructed, for instance, with the following command: source leaprc.protein.ff14SB

3. tleap will build the peptide in a fully linear conformation. On the one hand, this is an optimal choice if one is interested in

406

Matteo Masetti et al.

benchmarking the enhanced sampling method. On the other hand, a 26-residue long sequence in a linear conformation will lead to huge simulation boxes when explicit solvent will be added (see also Note 5). Therefore, depending on the available computational resources, it might be convenient to manually bend the peptide operating on the phi/psi backbone angles. This can be easily done with the Molefacture module of VMD. Another possibility would be running a quick implicit solvent simulation with the Amber MD code to obtain a more compact conformation of the peptide (see, e.g., the tutorial: http:// ambermd.org/tutorials/basic/tutorial3/). 4. As we already mentioned in the Introduction, the force field used in this chapter is not specifically designed for IDPs. Notably, the inaccuracies of standard force fields in modeling disordered states are increasingly being recognized as related to poor representation of protein-water interactions. Thus, improvements in this respect have been recently pursued [20, 26, 68]. For instance, corrections on the water dispersion interactions [20] or a scaling of the Lennard-Jones well depth between the water oxygen and all the protein atoms [26] have been shown to produce results in better agreement with experimental observations. Therefore, depending on the purpose of the calculations, more reliable force fields can be employed. However, notice that this can easily translate into a substantial increase in the computational burden involved, as these choices would often correspond to employing more computationally demanding water models. 5. The shape and size of the system come from a careful compromise with the available computational resources. In general, the simulation box should be large enough to avoid the solute in the unit cell to interact with the solute images of the periodic array. In the case of folded structures, setting the box basing on the diameter of the protein (largest distance between solutes’ atoms) is a wise option. When using cubic boxes, this choice leads to a large number of solvent molecules at the edges of the box which are unnecessarily making the simulations more demanding. For this reason, other shapes of the box can be envisioned, like a triclinic, rhombic dodecahedron, or truncated octahedron system. When the purpose of the simulation is to study folding/unfolding events like in our case, one should also consider an extra buffer to allow the peptide to adopt sufficiently extended conformations. Starting from the H1 state of NTAIL, this condition was reasonably met with a ˚ from the solute and the side of the box (see buffer of 18 A Fig. 3).

Enhanced Sampling of IDPs

407

6. The force field parameters used to treat the counterions is a critical and often overlooked choice of simulations. We suggest using a recent reparameterization of the most common ions consistent with the Amber family of force fields provided by Joung and Cheatham [69]. These can be easily added to the ffnonbonded.itp file of the GROMACS port of the Amber force fields. 7. Apparently, there is no agreement in the literature regarding IDPs on the number of ions that should be added to the system. Here, we simply add the number of ions required to neutralize the net total charge, but concentrations of 100–150 mM for at least sodium and chloride should be considered when one is interested in better mimicking physiological conditions. More generally, when interested in comparing the outcomes from simulations with experimental data produced under particular ionic conditions, as similar as possible conditions should be reproduced in the system setup for the simulations. 8. When running PT/REMD simulations, one must choose the temperature range to be spanned, the number of replicas, and the nominal temperature value assigned to each replica. As a rule of thumb, the highest temperature must be high enough to efficiently cross the expected potential energy barriers encountered along the process one wishes to observe. In this specific case, we use PT/REMD to improve the efficiency of WTMetaD sampling; therefore, 400 K as the highest temperature should be enough. This assumption should be addressed in retrospect by checking the diffusion of the system in the configuration space in the tens of nanoseconds timescales, for example. In conventional PT/REMD, the spanned temperature range is also related to the number of replicas that can be run in parallel according to the available computational resources. As already mentioned in the Introduction, replicas must be close enough in the temperature space to allow efficient exchanges, and this requirement is met only if a good overlap in potential energy distributions is observed between adjacent replicas. Assuming a constant heat capacity for the system along the spanned temperature range, another useful rule of thumb to make optimal use of resources is to distribute replicas according to a geometric progression [70]: T i ¼ T min

T max T min

Ni1 1

9. The replicas can be equilibrated in parallel (i.e., independently of each other) or, more conveniently, in a subsequent manner. That is, the equilibrated system at the Ti temperature will serve

408

Matteo Masetti et al.

as starting configuration for both the ith replica of the following PT/REMD simulation and for the equilibration of the next replica in the temperature ladder (T(i + 1)). 10. All the PLUMED input files required to reproduce the results reported in this paper are available on PLUMED-NEST (www. plumed-nest.org) [71], the public repository of the PLUMED consortium, as plumID:19.020. 11. The script demux.pl that comes with the GROMACS software outputs two files: replica_temp.xvg and replica_index.xvg. The latter can be employed in conjunction with the trjcat tool of GROMACS to reconstruct the continuous trajectory of each replica, that is, the trajectory of each initial configuration in its journey across the temperature space. This is achieved via: trjcat -f traj_comp?.xtc -demux replica_index.xvg

In principle, if proper diffusion is achieved, each system should display the same behavior and thus equally explore the CV space. As a practical note in this respect, in those cases when the simulation is produced as a series of sequential runs, all the log files must be merged into a single md0.log file before feeding demux.pl with it. The same reasoning holds for the trajectories given to trjcat as an input along with replica_index. xvg (i.e., for each replica, all the trajectory fractions should be merged into a single one). 12. By enhancing the potential energy fluctuations, the WTE ensemble distorts the canonical distribution of states. Thus, if an accurate reconstruction of free energy is needed, this effect has to be taken into account during a reweighting procedure [72]. As such, the bias gathered from the two metadynamics, namely, the one employing the potential energy as a CV (PT-WTE step) and the one using the more typical definition of CVs while keeping static the bias deposited on the potential energy (PTMetaD-WTE step), needs to be first computed. While the metad.bias of the latter can be easily retrieved via the driver utility of PLUMED, this is not the case for the bias on the potential energy, as it implies computing such energy at each trajectory frame. In this case, the -rerun flag of the GROMACS mdrun tool must be used to read the simulation trajectory and obtain the required metad.bias. Thus, the weight of each trajectory frame becomes the sum of the two biases computed, and this can be used to build a histogram and in turn retrieve the free energy. Alternatively, the reweighting step can be avoided following the smart work-around proposed by Sutto and Gervasio [47]. Accordingly, a “white” replica at

Enhanced Sampling of IDPs

409

the temperature of interest is added to the PTMetaD-WTE simulation. Such a replica does not hold any constant energy bias, but it will nonetheless efficiently exchange with the neighboring replicas, thanks to their enhanced fluctuations provided by the WTE ensemble. Thus, the free energy along the chosen CVs can be directly reconstructed from the white replica without the need of any reweighting procedure.

Acknowledgments The authors wish to thank Fabio Pietrucci and Giovanni Bussi for their invaluable feedback and useful discussion. References 1. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293:321–331 2. Uversky VN (2011) Intrinsically disordered proteins from A to Z. Int J Biochem Cell Biol 43:1090–1103 3. Gibbs EB, Showalter SA (2015) Quantitative biophysical characterization of intrinsically disordered proteins. Biochemistry 54:1314–1326 4. Tompa P (2012) Intrinsically disordered proteins: a 10-year recap. Trends Biochem Sci 37:509–516 5. Uversky VN, Oldfield CJ, Midic U et al (2009) Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. BMC Genomics 10:S1–S7 6. Habchi J, Tompa P, Longhi S et al (2014) Introducing protein intrinsic disorder. Chem Rev 114:6561–6588 7. Varadi M, Vranken W, Guharoy M et al (2015) Computational approaches for inferring the functions of intrinsically disordered proteins. Front Mol Biosci 2:45 8. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6:197–208 9. Gunasekaran K, Tsai C-J, Kumar S et al (2003) Extended disordered proteins: targeting function with less scaffold. Trends Biochem Sci 28:81–85 10. Uversky VN, Oldfield CJ, Dunker AK (2008) Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys 37:215–246 11. Joshi P, Vendruscolo M (2015) Druggability of intrinsically disordered proteins BT. In: Felli IC, Pierattelli R (eds) Intrinsically disordered proteins studied by NMR spectroscopy.

Advances in experimental medicine and biology, vol 870. Springer, Cham, p 383 12. Recanatini M (2018) How dynamic docking simulations can help to tackle tough drug targets. Future Med Chem 10:2763–2765 13. Tompa P, Varadi M (2014) Predicting the predictive power of IDP ensembles. Structure 22:177–178 14. Masetti M, Rocchia W (2014) Molecular mechanics and dynamics: numerical tools to sample the configuration space. Front Biosci (Landmark Ed) 19:578–604 15. De Vivo M, Masetti M, Bottegoni G et al (2016) Role of molecular dynamics and related methods in drug discovery. J Med Chem 59:4035–4061 16. Nettels D, Mu¨ller-Sp€ath S, Ku¨ster F et al (2009) Single-molecule spectroscopy of the temperature-induced collapse of unfolded proteins. Proc Natl Acad Sci U S A 106:20740–20745 17. Merchant KA, Best RB, Louis JM et al (2007) Characterizing the unfolded states of proteins using single-molecule FRET spectroscopy and molecular simulations. Proc Natl Acad Sci U S A 104:1528–1533 18. Voelz VA, J€ager M, Yao S et al (2012) Slow unfolded-state structuring in acyl-CoA binding protein folding revealed by simulation and experiment. J Am Chem Soc 134:12565–12577 19. Best RB, Mittal J (2010) Protein simulations with an optimized water model: cooperative helix formation and temperature-induced unfolded state collapse. J Phys Chem B 114:14916–14923 20. Piana S, Donchev AG, Robustelli P et al (2015) Water dispersion interactions strongly influence simulated structural properties of

410

Matteo Masetti et al.

disordered protein states. J Phys Chem B 119:5113–5123 21. Ye W, Ji D, Wang W et al (2015) Test and evaluation of ff99IDPs force field for intrinsically disordered proteins. J Chem Inf Model 55:1021–1029 22. Palazzesi F, Prakash MK, Bonomi M et al (2015) Accuracy of current all-atom forcefields in modeling protein disordered states. J Chem Theory Comput 11:2–7 23. Huang J, MacKerell AD (2018) Force field development and simulations of intrinsically disordered proteins. Curr Opin Struct Biol 48:40–48 24. Liu H, Song D, Lu H et al (2018) Intrinsically disordered protein-specific force field CHARMM36IDPSFF. Chem Biol Drug Des 92:1722–1735 25. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for both folded and disordered protein states. Proc Natl Acad Sci U S A 115:4758–4766 26. Best RB, Zheng W, Mittal J (2014) Balanced protein–water interactions improve properties of disordered proteins and non-specific protein association. J Chem Theory Comput 10:5113–5124 27. Abrams C, Bussi G (2014) Enhanced sampling in molecular dynamics using Metadynamics, replica-exchange, and temperatureacceleration. Entropy 16:163 28. Camilloni C, Pietrucci F (2018) Advanced simulation techniques for the thermodynamic and kinetic characterization of biological systems. Adv Phys X 3:1477531 29. Laio A, Parrinello M (2002) Escaping freeenergy minima. Proc Natl Acad Sci U S A 99:12562–12566 30. Barducci A, Bussi G, Parrinello M (2008) Welltempered metadynamics: a smoothly converging and tunable free-energy method. Phys Rev Lett 100:020603 31. Barducci A, Bonomi M, Parrinello M (2011) Metadynamics. Wiley Interdiscip Rev Comput Mol Sci 1:826–843 32. Bussi G, Branduardi D (2015) Free-energy calculations with Metadynamics: theory and practice. In: Parrill AL, Lipkowitz KB (eds) Reviews in computational chemistry, vol 28. Springer, New York, p 1 33. Elvati P, Violi A (2012) Free energy calculation of Permeant–membrane interactions using molecular dynamics simulations. In: Reineke J (ed) Nanotoxicity: methods and protocols. Humana Press, Totowa, NJ, p 189

34. Sugita Y, Okamoto Y (1999) Replica-exchange molecular dynamics method for protein folding. Chem Phys Lett 314:141–151 35. Fukunishi H, Watanabe O, Takada S (2002) On the Hamiltonian replica exchange method for efficient sampling of biomolecular systems: application to protein structure prediction. J Chem Phys 116:9058–9067 36. Liu P, Kim B, Friesner RA et al (2005) Replica exchange with solute tempering: a method for sampling biological systems in explicit water. Proc Natl Acad Sci U S A 102:13749–13754 37. Bussi G (2014) Hamiltonian replica exchange in GROMACS: a flexible implementation. Mol Phys 112:379–384 38. Bussi G, Gervasio FL, Laio A et al (2006) Freeenergy landscape for β hairpin folding from combined parallel tempering and Metadynamics. J Am Chem Soc 128:13435–13441 39. Bonomi M, Parrinello M (2010) Enhanced sampling in the well-tempered ensemble. Phys Rev Lett 104:190601 40. Deighan M, Bonomi M, Pfaendtner J (2012) Efficient simulation of explicitly solvated proteins in the well-tempered ensemble. J Chem Theory Comput 8:2189–2192 41. Bernetti M, Masetti M, Pietrucci F et al (2017) Structural and kinetic characterization of the intrinsically disordered protein SeV NTAIL through enhanced sampling simulations. J Phys Chem B 121:9572–9582 42. Skiadopoulos MH, Surman SR, Riggs JM et al (2002) Sendai virus, a murine Parainfluenza virus type 1, replicates to a level similar to human PIV1 in the upper and lower respiratory tract of African green monkeys and chimpanzees. Virology 297:153–160 43. Jensen MR, Houben K, Lescop E et al (2008) Quantitative conformational analysis of partially folded proteins from residual dipolar couplings: application to the molecular recognition element of Sendai virus nucleoprotein. J Am Chem Soc 130:8055–8061 44. Barducci A, Bonomi M, Parrinello M (2010) Linking well-tempered Metadynamics simulations with experiments. Biophys J 98:44–46 45. Barducci A, Bonomi M, Prakash MK et al (2013) Free-energy landscape of protein oligomerization from atomistic simulations. Proc Natl Acad Sci U S A 110:4708–4713 46. Palazzesi F, Barducci A, Tollinger M et al (2013) The allosteric communication pathways in KIX domain of CBP. Proc Natl Acad Sci U S A 110:14237–14242 47. Sutto L, Gervasio FL (2013) Effects of oncogenic mutations on the conformational free-

Enhanced Sampling of IDPs energy landscape of EGFR kinase. Proc Natl Acad Sci U S A 110:10616–10621 48. Lovera S, Morando M, Pucheta-Martinez E et al (2015) Towards a molecular understanding of the link between Imatinib resistance and kinase conformational dynamics. PLoS Comput Biol 11:e1004578 49. Kuzmanic A, Sutto L, Saladino G et al (2017) Changes in the free-energy landscape of p38α MAP kinase through its canonical activation and binding events as studied by enhanced molecular dynamics simulations. Elife 6: e22175 50. Granata D, Baftizadeh F, Habchi J et al (2015) The inverted free energy landscape of an intrinsically disordered peptide by simulations and experiments. Sci Rep 5:15449 51. Rossetti G, Musiani F, Abad E et al (2016) Conformational ensemble of human α-synuclein physiological form predicted by molecular simulations. Phys Chem Chem Phys 18:5702–5706 52. Bellucci L, Bussi G, Di Felice R et al (2017) Fibrillation-prone conformations of the amyloid-β-42 peptide at the gold/water interface. Nanoscale 9:2279–2290 53. Salomon-Ferrer R, Go¨tz AW, Poole D et al (2013) Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J Chem Theory Comput 9:3878–3888 54. Abraham MJ, Murtola T, Schulz R et al (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1:19–25 55. Tribello GA, Bonomi M, Branduardi D et al (2014) PLUMED 2: new feathers for an old bird. Comput Phys Commun 185:604–613 56. Humphrey W, Dalke A, Schulten K (1996) VMD: Visual molecular dynamics. J Mol Graph 14:33–38 57. Racine J (2006) Gnuplot 4.0: a portable interactive plotting utility. J Appl Econ 21:133–141 58. Best RB, Hummer G (2009) Optimized molecular dynamics force fields applied to the helixcoil transition of polypeptides. J Phys Chem B 113:9004–9015 59. Hornak V, Abel R, Okur A et al (2006) Comparison of multiple Amber force fields and development of improved protein backbone parameters. Proteins 65:712–725

411

60. Lindorff-Larsen K, Piana S, Palmo K et al (2010) Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins 78:1950–1958 61. Hess B, Bekker H, Berendsen HJC et al (1998) LINCS: a linear constraint solver for molecular simulations. J Comput Chem 18:1463–1472 62. Darden T, York D, Pedersen L (1993) Particle mesh Ewald: an N·log(N) method for Ewald sums in large systems. J Chem Phys 98:10089–10092 63. Hockney RW, Goel SP, Eastwood JW (1974) Quiet high-resolution computer models of a plasma. J Comput Phys 14:148–158 64. Bussi G, Donadio D, Parrinello M (2007) Canonical sampling through velocity rescaling. J Chem Phys 126:14101 65. Parrinello M, Rahman A (1981) Polymorphic transitions in single crystals: a new molecular dynamics method. J Appl Phys 52:7182–7190 66. Pietrucci F, Laio A (2009) A collective variable for the efficient exploration of protein Betasheet structures: application to SH3 and GB1. J Chem Theory Comput 5:2197–2201 67. Vymeˇtal J, Vondra´sˇek J (2011) Gyration- and inertia-tensor-based collective coordinates for Metadynamics. Application on the conformational behavior of Polyalanine peptides and Trp-cage folding. J Phys Chem A 115:11455–11465 68. Huang J, Rauscher S, Nawrocki G et al (2016) CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods 14:71–73 69. Joung IS, Cheatham TE (2008) Determination of alkali and halide monovalent ion parameters for use in explicitly solvated biomolecular simulations. J Phys Chem B 112:9020–9041 70. Nadler W, Hansmann UHE (2008) Optimized explicit-solvent replica exchange molecular dynamics from scratch. J Phys Chem B 112:10386–10387 71. Bonomi M, Bussi G, Camilloni C et al (2019) Promoting transparency and reproducibility in enhanced molecular simulations. Nat Methods 16:670–673 72. Bonomi M, Barducci A, Parrinello M (2009) Reconstructing the equilibrium Boltzmann distribution from well-tempered metadynamics. J Comput Chem 30:1615–1621

Chapter 20 Computational Protocol for Determining Conformational Ensembles of Intrinsically Disordered Proteins Robert B. Best Abstract In modelling experimental measurements from intrinsically disordered proteins, it is essential to account for the very broad distribution of structures which they populate. A natural method for doing this is via computer simulations, particularly those that generate a reasonably accurate initial molecular ensemble. In this chapter, I present a reweighting approach that may be used to determine a conformational ensemble by combining experimental information with molecular simulations. The advantages and difficulties associated with this approach are briefly discussed. Key words Bayesian Ensemble refinement, Ensemble, Coarse-graining, Maximum entropy principle, IDP, Simulation

1

Introduction Determining the structures of intrinsically disordered proteins (IDPs) is important because it often contains clues to their function [1]. In contrast to the situation for folded macromolecules, the structures of IDPs cannot be described as small fluctuations about a single structure. We know this because in general it is not possible to fit measurements from IDPs to a single structure model, and in fact single molecule experiments such as single-molecule Fo¨rster resonance energy transfer (smFRET) demonstrate that the distributions of pair distances within IDPs are extremely broad, consistent with a disordered polymer chain [2, 3]. Therefore, it is essential to explicitly consider such diversity when constructing models to describe experimental data, which is usually ensemble averaged or averaged over some timescale. For example, one can determine an initial ensemble from a molecular dynamics (MD) simulation with a reasonable force field, which may be accurate but still does not reproduce experimental data within error. Experimental data may then be used to reweight the simulation ensemble

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_20, © Springer Science+Business Media, LLC, part of Springer Nature 2020

413

414

Robert B. Best

[4]. Confounding such a solution is that the number of relevant structural observables which can be measured is usually much smaller than in the case of folded proteins, while the conformational space populated is vastly larger. Thus, the structure determination problem is highly underdetermined, and this must be addressed in any such method. A second issue which is less problematic in conventional single structure determination is that of sampling—any simulation must be run long enough to generate a sufficiently representative ensemble of the disordered state. In this chapter, I describe one method for determining structural ensembles of intrinsically disordered proteins. In this approach, which is possibly the simplest to implement, a simulation of the unfolded protein is run and subsequently reweighted to match experimental data. Alternative methods, which are technically more demanding, are (1) using experimental data as a bias in running simulations, so as to ensure agreement with experiment [5], or (2) training a minimalist coarse-grained model to match experiment [6–8]. Using the biased simulations is advantageous when the structures sampled by the original simulation are too dissimilar from those populated in reality, such that it is not possible to do reweighting. However, this is usually due to deficiencies in the simulation force field, which can often be overcome by using a better force field. Biased runs also require the use of a large-enough ensemble (each member corresponding to a parallel process in the refinement) to capture the true diversity of the disordered state, which is obviously challenging for IDPs. Training a coarse-grained model is much more application specific and usually requires some specific physical insights into the system before starting. For these reasons, we have focused on the reweighting approach in this chapter, which is the most general and straightforwardly applicable method.

2 2.1

Materials Force Field

Molecular simulations of biomolecules are run with an energy function known as a force field, which is a simplified model that approximates the true potential energy surface. Any method of ensemble generation will benefit from an accurate force field. For example, in the case of reweighting, an accurate initial model will allow more of the initial ensemble to be used, resulting in a higherquality final ensemble. The subject of good force fields for IDPs has been reviewed several times recently, and the interested reader is referred to one of those articles for a more detailed discussion of this topic [9, 10]. For simulations using explicit solvent, some examples of reasonable force fields are Amber ff03ws [11], Amber 99sbws [11], Amber 99SB-disp [12], and CHARMM 36 m [13] (using the version with the scaled water model described in the

Ensemble Models for IDPs

415

supporting information of that work). However, running a simulation of an IDP with explicit water requires to use a very large simulation cell (see below). Therefore, an alternative may be to use a good implicit solvent model. Although the models may be somewhat less accurate, this can be corrected to some extent by the experimental input. The simulations are generally simpler to set up and run and can be run with a tiny fraction of the resources needed for explicit solvent. In this case, there are fewer candidate force fields which are suitable for IDPs, the best being the ABSINTH implicit solvent model [14] or alternatively EEF1 [15, 16]. If using other implicit solvent models such as most Generalized Born models [17], the simulations will have to be run at a very high temperature (e.g., 600–800 K) to avoid the over-collapsed structures typically favored by these models [16]. 2.2 Simulation Software

For a simple reweighting procedure, the choice of simulation code to run the initial simulations is very flexible, the only requirement being that it implements the force field model desired. The opensource simulation codes GROMACS [18] and openMM [19] both implement a variety of force fields, and the latter has a very efficient implantation on graphics processing units (GPUs). The ABSINTH force field mentioned above is only currently implemented in the CAMPARI Monte Carlo code [20]. For implementing experimental biases in simulations (if that method is used), the requirements are more demanding, as the code must implement ensemble-averaged restraints. One of the more actively developed codes for implementing such restraints is PLUMED [21], which can be used as a plug-in for several MD codes but has the most complete functionality in conjunction with GROMACS.

2.3 Experimental Data

A wide variety of experimental data are suitable for reweighting or biasing ensembles of IDPs, which have been recently reviewed [22]. The only requirement is that it should be possible to write the instantaneous experimental observable as a function of the protein coordinates. Examples of suitable data include scalar couplings, residual dipolar couplings (RDCs), paramagnetic relaxation enhancements (PREs) from NMR, FRET efficiencies from smFRET, and scattering functions from small-angle and/or wideangle X-ray scattering. Experimental observables which depend on the history in the simulation, such as NMR NOE buildup rates, photon correlation functions from smFRET, or intermediate scattering functions from neutron scattering or light scattering, would not be straightforward to include. This is because there is no practical way to reweight simulations to reproduce time-dependent data that is also correct.

416

3

Robert B. Best

Methods

3.1 Ensemble Reweighting 3.1.1 Generation of Initial Ensemble

1. Ensembles drawn from protein data bank (PDB) statistics. The simplest way to generate an initial ensemble of protein structures is to reconstruct coordinates by drawing from a statistical database of dihedral angles for backbone and side chains from the protein data bank (PDB) while at the same time respecting excluded volume (i.e., atoms are not allowed to overlap with each other). Usually, only loop or coil regions from the PDB would be used to construct the databases, as their conformational propensities should be most similar to those of IDPs. Such models are generally very cheap to run, and the best known is the flexible meccano algorithm [23]. This type of model can be a good initial guess if there is evidence already that the protein is largely unstructured and relatively expanded. If there is some compaction (e.g., suggested by SAXS) or evidence of local, cooperatively formed secondary structure such as helices or hairpins (e.g., from NMR chemical shifts or scalar couplings), then statistical ensembles may be less appropriate: it is relatively unlikely to obtain more compact structures or significant population of secondary structure from such algorithms. 2. Ensembles from MD/Monte Carlo simulations. An alternative to drawing from PDB statistics is to sample structures according to a force field energy. This of course requires running an initial molecular simulation of the intrinsically disordered protein. For this purpose, either MD or Monte Carlo (MC) sampling may be used, depending on what is more convenient in the simulation package being used. Although this seems fairly straightforward, a number of factors need to be considered to obtain the best results: (a) Choice of initial structure. Every simulation needs to be started from an initial set of coordinates. Unlike the case for folded proteins, there is not any initial choice with special significance. One method to generate initial structures is via excluded volume methods such as flexible meccano [23], referred to above. Alternatively, one can generate trial initial structures by running simulations starting from an extended state in vacuum at a high temperature (e.g., 500 K). The protein will of course quickly collapse, and so initial states of varying degrees of collapse can be chosen to run the simulations (care should be taken that no proline residues have isomerized in this initial high-temperature run; see Note 1). Good initial structures would be those with dimensions close to those expected for an excluded volume chain. For example, for a protein

Ensemble Models for IDPs

417

with N residues, an empirical relation gives the radius of gyration (Rg) as: Rg ¼ R0 N ν

ð1Þ

with R0 ¼ 0.1927 nm and ν ¼ 0.598 [24]. It is generally better to use structures which are too expanded than too collapsed, as the latter are more likely to be long-lived kinetic traps. In addition, it is always a good idea to run several simulations starting from somewhat different initial structures (e.g., slightly more expanded/collapsed). If the properties of the resulting ensembles (before reweighting) are very similar, this suggests that the simulation sampling may have been sufficient; if they are different, then it is certainly not sufficient! (b) Sampling method. Because of the lack of structure in most IDPs, it may suffice to run a very long, unbiased simulation (e.g., 1–10 ms if using MD), as there should not be long-lived kinetic traps. On the other hand, even the formation and breaking of local structure such as helices and hairpins can occur on a microsecond timescale, so it may be desirable to enhance sampling in some way in initial ensemble generation. One method may be to run replica exchange MD (or MC) [25]. The standard temperature replica exchange MD (T-REMD) [26], which accelerates the crossing of high-energy barriers, should work well for implicit solvent models. However, for the large system sizes needed to run IDPs with explicit solvent (at least IDPs longer than 30–40 residues), it will not scale well; in that case, replica exchange with solute tempering (REST or REST-2) [27], or replica exchange between different protein-water interaction strengths [28], may offer more scalable alternatives. The choice of optimal parameters for running such simulations is slightly outside the scope of this protocol but is already covered very well in the literature [29]. Replica exchange methods have an additional advantage in that they generate a number of different ensembles (e.g., at different temperatures for T-REMD). This is useful because one can then choose for reweighting the ensemble which is initially closest to experiment (i.e., that for which the computed observables are closes to the experimental values), thus minimizing the required perturbation to match experiment (see further discussion below). (For information on the typical ensemble sizes, see Note 2.) (c) “Convergence.” Having run the simulation with the chosen force field and sampling method, it is necessary to check whether the sampling is sufficient to generate an

418

Robert B. Best

initial ensemble which is representative of the equilibrium ensemble in that force field. The most common way of checking this is via the “convergence” of properties in two simulations initiated from different initial conditions toward common values [30–32]. Note that having such a converged ensemble isn’t strictly necessary if one is doing reweighting, since all that is required is a “reasonable” initial guess, and the experimental weighting can help to correct imperfections in both the force field and the sampling. However, obtaining converged ensemble properties is a helpful indicator because it generally suggests that the simulations are not stuck in a limited set of nearby local minima but are broadly exploring the allowed conformational space (see Note 3). 3.1.2 Reweighting Scheme

Reweighting an initial ensemble to match an experimental observable is a simply stated problem. For each structure i, saved from the simulation, one independently chooses a weight wi to minimize the difference between the actual experimental ensemble-averaged data points hakiexpt and the corresponding ensemble-averaged data computed from the reweighted simulation, hakisim (here, the kth observation ak could be a J-coupling, RDC, FRET efficiency, etc.). That is, one seeks to minimize χ 2 defined as 2 X ha k isim hak iexpt χ2 ¼ ð2Þ k σ 2k,sim þ σ 2k,expt here, the errors of the simulation and experiment, σ 2k,sim and σ 2k,expt, respectively, are used to determine the relative importance of each data point in the fit (see Note 4). The observables are computed from the simulation as a simple weighted average: X X w a ð x Þ= w ð3Þ ha k isim ¼ i i k i i i in which wi is the weight of frame i with coordinates xi and computed observable ak(xi), and the sum runs over the frames.

3.1.3 Avoiding Overfitting

An obvious question that arises from the above description of reweighting is how unique the set of chosen weights is. For example, if there are 10,000 structures in the initial ensemble and maybe 10–100 experimental data points, one can easily imagine that it is possible to find many sets of weights which could match experiment; that is, the problem is very underdetermined. Therefore, some additional assumptions are needed to prevent this degeneracy of solutions. One possible procedure one could imagine is to select a “minimal ensemble”; that is, choose only the smallest number of structures that are sufficient to fit the data, but no more. However, this does not really solve the overfitting problem, because it is quite

Ensemble Models for IDPs

419

possible (indeed likely) that there are many such minimal ensembles which fit the data but otherwise have quite different properties. The real problem is that the number of degrees of freedom in the true ensemble will always be vastly larger than the number of experimental data points that can be supplied, and so it is necessary to rely somewhat on the distribution of structures obtained in the initial ensemble. Most reweighting procedures that are now in use try to match the experimental data while minimally perturbing the original ensemble. An example of what is meant by this is shown in Fig. 1, in which two reweighting solutions are obtained to match an experimental observable. In the first (Fig 1a), essentially only structures x for which a(x) is very close to haiexpt are chosen; thus, the average haisim necessarily matches the experiment. In the second (Fig. 1b), a much broader distribution of structures is chosen which adheres more closely to the original simulation, but for which the average haiexpt nonetheless matches haisim. Intuitively, we know that haisim is likely to come from a broad distribution, and so extreme solutions such as that in Fig. 1a are unlikely. More likely is the more modest perturbation shown in Fig. 1b. This is often expressed in terms of the principle of minimum information or maximum entropy; that is, we want to maximize the entropy of the resulting distribution. A practical procedure for achieving this is to add an entropy “prior” based on the Shannon entropy of the weights, defined as: X S ðfw i gÞ ¼ w ln wi ð4Þ i i This entropy has the largest possible value when all the weights are equal; that is, the distribution is identical to the original unperturbed ensemble. Thus, maximizing the entropy of the weights keeps the reweighted ensemble close to the original. The entropy can be directly incorporated in the scoring function used for choosing the weights as [4]: G ðfwi gÞ ¼ χ 2 ðfwi gÞ θS ðfwi gÞ

ð5Þ

where the parameter θ controls the strength of the entropy prior. The choice of θ is critical: if it is too small, the prior will have no effect; the experimental data might be matched perfectly, but the ensemble distribution could deviate greatly from the original simulation. As θ is increased, the distribution should match the simulation increasingly well, but without necessarily making the agreement with experiment much worse (since it is still possible to fit the data in other ways). Finally, if θ is made too large, the similarity to the simulation will be too strong, and it will no longer be possible to fit the data. The ideal θ is the largest one where it is still possible to fit the experimental data (see Note 5 for choice of θ). Fig. 2 shows an illustration of this for a simple example of fitting some smFRET data obtained on the protein R17 [33].

420

Robert B. Best

a

Ppert(q) µsim

µexpt

P(q) Psim(q)

q b µexpt

P(q)

µsim

Ppert(q)

Psim(q)

q c

µsim

µexpt

Psim(q)

Ppert(q)

P(q)

q Fig. 1 Reweighting an ensemble to match a single observable, q. (a) Original distribution of q in simulation (blue curve, Psim(q)) with mean μsim to be reweighted to match experimental mean, μexpt (black dashed line). The simplest procedure for doing this is by selecting only those conformations which have q very close to the observable (magenta curve, Ppert(q)). A more likely (and minimally perturbing) reweighting is shown in (b), in which only the side peak in the original distribution at high q is eliminated. In (c), we illustrate the problem of trying to match experimental data which lies far from the probable region in the original distribution—as a result, only a few extreme outliers are chosen, which is clearly unlikely to be representative of the true experimental ensemble. This can be avoided by generating a higher-quality initial ensemble (e.g., with a better simulation force field)

Ensemble Models for IDPs 3.1.4 How to Actually Find the Optimal Weights?

421

So far, the reweighting process has been discussed in a fairly abstract manner on the assumption that we know how to search for the optimal weight set {wi}. One approach to finding them would be to use a gradient-based optimization, but the simplest method may be an MC search in weight space. Such a search also has the capability to escape local minima in parameter space, should they exist. Briefly, the algorithm is as follows: 1. Calculate the experimental observables for each frame in the simulation (i.e., ak(xi)). 2. Set the initial weights wi ¼ 1 for all weights. 3. Make a trial move by perturbing a weight for one frame. Many trial moves are possible, for example, picking at random one frame i and trying wi, trial ¼ wi exp [R], where R is a random number drawn from a uniform distribution in the interval [1,1]. 4. Accept or reject the move using the Metropolis MC acceptance criterion and the scoring function G({wi}). That is, if ΔG ¼ G w1 , . . . w i1 , w i,trial , w iþ1 , . . . w N G ðfw1 , . . . w i1 , w i , w iþ1 , . . . w N gÞ < 0, then the move is accepted and the wi is updated to wi, trial. If ΔG 0, then a random number R is drawn from a uniform distribution on the interval [0,1]. If R < exp [ΔG], then the move is accepted; otherwise, it is rejected and the current wi is retained. 5. After each move, go back to step 3, try another move, and continue until the specified number of moves is completed or until G({wi}) converges to a stable value. This MC weight optimization needs to be performed at a number of values of θ (see Note 5) in order to optimize both the choice of θ and the set of weights {wi}. In this way, it should be possible to find the least perturbing set of weights. An example of this is shown in Fig. 2.

3.1.5 Assessing the Final Weights

Finding a set of weights which fits the data is not sufficient, even if a maximum entropy prior is used to keep them as uniform as possible. For example, if only the very tail of the simulated ensemble is in any way compatible with experiment, the only solution which can fit the data will have non-negligible weights only for a very small number of structures. This could be the case, for example, if the protein was highly collapsed in the original simulation, but in fact the experimental data suggest it is quite expanded. Then the only way to match experiment would be for the weights to pick out the tiny minority of structures in the original ensemble which are more

Robert B. Best

a

b 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.6 0.4 0.2 0 2

c

4 6 [Urea] / M

8

10

d

-1 -0.8 -0.6 -0.4 -0.2 0 Entropy S 4.5

10

4

Rg / nm

R / nm

0.62 M 1.13 M 1.61 M 2.12 M 2.62 M 3.11 M 4.22 M 5.07 M 7.06 M 9.02 M

0.8

χ2

Transfer Efficiency

1

8

3.5 3

6 0

2

e

4 6 8 [Urea] / M

2.5

10

0

2

f

4 6 [Urea] / M

8

10

0.25

0.15

p(rg)

0.2

p(r)

422

0.1

0.15 0.1

0.05

0.05 5

10 15 20 25 r / nm

0

2

3

4 5 6 rg / nm

7

8

Fig. 2 Illustration of reweighting to match smFRET data. (a) smFRET transfer efficiencies for R17 for three different pairs of labelled residues (three different colors), as a function of urea concentration [33, 36]. Transfer efficiencies decrease as the urea concentration increases, suggesting an expansion of the unfolded state. (b) Different reweightings are performed in order to match experiment at each denaturant concentration (i.e., matching data for all FRET pairs simultaneously), with varying strength of the entropy prior. The ensemble with the maximum entropy before χ 2 diverges is chosen in each case (indicated by a red arrow in the case of the 7.06 M ensemble (cyan curve)). The mean (c) end-end distance R and (d) radius of gyration Rg indeed increase with denaturant concentration, as also reflected in their respective distributions in (e) and (f) (color code is black, 1.13 M urea; magenta, 5.07 M urea; cyan, 9.02 M urea). (Figure adapted from [36])

expanded (see Fig. 1c). Clearly, such an ensemble is unlikely to be representative of the true distribution sampled in experiment. There are a couple of possible checks for this outcome:

Ensemble Models for IDPs

423

1. Plotting distributions of properties of the reweighted ensemble, for example, distributions of pair distances, radius of gyration, Rg, etc. These distributions should be smooth and not have sharp spikes, which typically happens when too much weight is assigned to only a few structures. 2. Computing the size of the partition function for the weights, that is, X w = min ½fw i g Zw ¼ ð6Þ i i where min[{wi}] is the smallest weight. As is well known for partition functions with vanishing zero-point energy, Zw effectively counts the number of states which have significant weight in the ensemble. Ideally, Zw should be a significant fraction (at least 10–20%) of the total frames in the original simulation. 3.2 Validating and Using the Reweighted Ensemble

Once the optimal set of weights {wi} have been determined, these, in conjunction with the set of structures {xi}, constitute the reweighted ensemble and can be used for further calculations, such as computing other observables not used in the reweighting procedure: X X eðx i Þ= eisim ¼ wa w ð7Þ ha i i i i This allows us to predict the values of observables that have not yet been measured or, more commonly, to calculate the values of observables that have been kept aside and not used in fitting weights. While this practice may seem superficially similar to the use of Rfree in X-ray crystallography, the meaning is somewhat different in the context of IDP ensembles. Finding the structure which minimizes the deviation from the experimental diffraction intensities in X-ray crystallography is usually an overdetermined problem; that is, not all the data are needed, and therefore, it is expected that if a small fraction are left out from the fitting, they should be accurately predicted by the model; the use of Rfree is mainly to test whether adding additional degrees of freedom to the model (e.g., anisotropic B-factors) is really improving the model or instead leading to overfitting. In contrast, for IDP ensembles, the problem of determining an ensemble is essentially always going to be underdetermined by the experimental data, and we have already discussed how a maximum entropy prior can be used to solve the problem of overfitting. Therefore, testing the effect of additional data or different data on the resulting ensemble is not so much testing for overfitting, as much as whether the new experimental data contain additional information. If adding further data does not change the result, then the new data are redundant. Thus, in comparing smFRET and SAXS experiments, where the experimental data are expected to yield similar information on global dimensions of the protein, it was

424

Robert B. Best

found that generating ensembles using SAXS (from unlabeled protein) yielded similar results to FRET data from a protein labelled with a hydrophilic dye pair, confirming that the two experiments were yielding similar information as expected [33]. However, it must be noted that inconsistency between ensembles generated from different data sets does not necessarily mean that the data sets are themselves inconsistent. An obvious example would be using NMR J-couplings or SAXS scattering functions to reweight. Since the information content of these two experiments is respectively extremely local versus global information, the minimal bias to the simulation to match each experiment could well yield different ensembles. This does not mean that either ensemble is “wrong,” rather that they each contain different information, and using both data sets together would yield a still better ensemble.

4

Notes 1. To check for cis prolines in initial structures: The ω-dihedral angle between the proline and the preceding residue, that is, the dihedral Cα-C-N-Cα, needs to be calculated. If |ω| < 90∘, then the proline is cis; otherwise, it is trans (usually, it will be close to 0∘ if cis and close to 180∘ if trans). Usually, prolines will be predominantly trans in IDPs, although the exact population does depend on sequence context and posttranslational modifications [34]. There are a number of commonly available programs that can compute these torsion angles, including the MDAnalysis python package [35] (http://www. mdanalysis.org) or the programs dang or dangle from the Richardson group (http://kinemage.biochem.duke.edu). 2. Initial ensemble sizes (before reweighting): There is no hard and fast rule about how many structures are needed, but typically ensemble sizes starting with 104 structures may be adequate. Having a larger ensemble helps when a stronger reweighting is required to match experiment, as it allows a larger number of statistically independent coordinate sets to contribute to the average. 3. Properties to check for converged sampling: A good global property to check for IDPs is the radius of gyration Rg, which can be easily computed by most MD codes as well as the MDAnalysis tool mentioned in Note 1. In addition, it makes sense to focus on properties which are directly related to the specific experimental observables that will be used for reweighting, for example, pair distances in the case of FRET or NMR PREs or dihedral angles in the case of NMR scalar couplings. In each case, the aim is that the averages (and distributions) of such properties should be similar when they are

Ensemble Models for IDPs

425

computed from simulations starting from different initial conditions. Because such averages often take some time to reach their equilibrium value, it can be good to plot them as a function of time (possibly with a running average over a given time window to smooth fluctuations) in order to determine which portion of the simulation can be considered “equilibrated” and used for reweighting and to compute averages. 4. Error estimation from simulation and experiment: The errors σ 2k,sim and σ 2k,expt are required to compute the deviation from experiment as measured by χ 2. The experimental errors are meant to be standard errors (i.e., the expected standard deviation of the result if the same experiment were to be repeated many times). Ideally, this would be determined individually for each observable, but otherwise an estimate of the typical error could be used. For the simulations, there are two sources of error. The first is statistical error from the simulation owing to the fact that the averages are computed from a finite amount of sampling. The most common way of estimating this is via block averaging. To compute this, first determine which part of the simulation can be considered “equilibrated” (see Note 4). Divide this part of the simulation into a number N of equalsized blocks in time; usually, between 10 and 30 blocks is a good number. Compute the average observable ai for each block i as the weighted average of a over that block (using the current weights in the reweighting procedure). The statistical error for the observable a is then σ 2stat ¼ P estimate 2 P 2 =N . A second source of error for i a i =N i a i =N simulations is the uncertainty associated with computing the experimental observable from the simulation, σ 2pred , since the methods used to compute these observables are often empirical, approximate methods. Prime examples are the prediction of chemical shifts or scalar couplings. Generally, it is possible to obtain some estimate of the prediction error from the original publication (e.g., for shift predictor or Karplus equation). At the least, the RMS deviation between fit and data is usually reported, which is an underestimate of the true prediction error but a reasonable starting point. Since the statistical and prediction error are independent, they can be added to give the overall simulation error for each observable as σ 2sim ¼ σ 2stat þ σ 2pred . 5. How to choose the optimal θ in Eq. 5: There is no rule for a typical value of the parameter θ which is optimal to use in reweighting—it will depend on the system, the initial ensemble, and the experimental data used. The best way to choose it is to determine ensembles by reweighting with different values of θ. As seen in Fig. 2b, increasing θ (and hence the entropy of

426

Robert B. Best

the weights) initially results in little change of χ 2. Although all of these solutions would match experiment, most of them would be overfitting. At some point, however, χ 2 will increase sharply because θ is too large and the fit to the data is not good anymore. The optimal θ is the smallest one before this increase occurs, thus obtaining the maximum entropy solution while still fitting to the data. Note also that choosing the ensemble based on the value χ 2 itself is generally not as robust as the procedure described here. Although χ 2 should be approximately equal to the number of observables in an ideal world, in reality its value is affected by the accuracy of the error estimates, as well as correlations between the errors in different observables.

Acknowledgments R.B. is supported by the Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health. References 1. Das RK, Huang Y, Phillips AH et al (2016) Cryptic sequence features within the disordered protein p27Kip1 regulate cell cycle signaling. Proc Natl Acad Sci U S A 113:5616–5621 2. Gopich IV, Szabo A (2012) Theory of the energy transfer efficiency and fluorescence lifetime distribution in single-molecule FRET. Proc Natl Acad Sci U S A 109:7747–7752 3. Chung HS, Louis JM, Gopich IV (2016) Analysis of fluorescence lifetime and energy transfer efficiency in single-molecule photon trajectories of fast-folding proteins. J Phys Chem B 120:680–699 4. Hummer G, Ko¨finger J (2015) Bayesian ensemble refinement by replica simulations and reweighting. J Chem Phys 143:243150 5. Lindorff-Larsen K, Best RB, Depristo MA et al (2005) Simultaneous determination of protein structure and dynamics. Nature 433:128–132 6. Borgia A, Borgia MB, Bugge K et al (2018) Extreme disorder in an ultrahigh-affinity protein complex. Nature 555:61–66 7. Holmstrom ED, Holla A, Zheng W et al (2018) Accurate transfer efficiencies, distance distributions, and ensembles of unfolded and intrinsically disordered proteins from singlemolecule FRET. Methods Enzymol 611:297–325

8. Holmstrom ED, Liu Z, Nettels D et al (2019) Disordered RNA chaperones can enhance nucleic acid folding via local charge screening. Nat Commun 10:2453 9. Best RB (2017) Computational and theoretical advances in studies of intrinsically disordered proteins. Curr Opin Struct Biol 42:147–154 10. Huang J, MacKerell AD (2019) Force field development and simulations of intrinsically disordered proteins. Curr Opin Struct Biol 48:40–48 11. Best RB, Zheng W, Mittal J (2014) Balanced protein-water interactions improve properties of disordered proteins and non-specific protein association. J Chem Theory Comput 10:5113–5124 12. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for both folded and disordered protein states. Proc Natl Acad Sci U S A 115(21):E4758–E4766 13. Huang J, Rauscher S, Nawrocki G et al (2016) CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods 14:71–73 14. Vitalis A, Pappu RV (2008) ABSINTH: a new continuum solvation model for simulations of polypeptides in aqueous solutions. J Comput Chem 30:673–699

Ensemble Models for IDPs 15. Lazaridis T, Karplus M (1999) Effective energy function for proteins in solution. Proteins 35:133–152 16. Bottaro S, Lindorff-Larsen K, Best RB (2013) Variational optimization of an all-atom implicit solvent force field to match explicit solvent simulation data. J Chem Theory Comput 9:5641–5652 17. Onufriev AV, Case DA (2019) Generalized born implicit solvent models for biomolecules. Annu Rev Biophys 48:275–296 18. Hess B, Kutzner C, Van der Spoel D et al (2008) GROMACS4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. J Chem Theory Comput 4 (3):435–447 19. Eastman P, Swails J, Chodera JD et al (2017) OpenMM 7: rapid development of high performance algorithms for molecular dynamics. PLoS Comput Biol 13(7):e1005659. https:// doi.org/10.1371/journal.pcbi.1005659 20. Vitalis A, Pappu RV (2009) Methods for Monte Carlo simulations of biomacromolecules. Annu Rep Comput Chem 5:49–76 21. Tribello GA, Bonomi M, Branduardi D et al (2014) Plumed 2: new feathers for an old bird. Comput Phys Commun 185:604–613 22. Gibbs EB, Showalter SA (2015) Quantitative biophysical characterization of intrinsically disordered proteins. Biochemistry 54:1314–1326 23. Ozenne V, Bauer F, Salmon L et al (2012) Flexible-meccano: a tool for the generation of explicit ensemble descriptions of intrinsically disordered proteins and their associated experimental observables. Bioinformatics 28 (11):1463–1470 24. Kohn JE, Millett IS, Jacob J et al (2004) Random-coil behavior and the dimensions of chemically unfolded proteins. Proc Natl Acad Sci U S A 101(34):12491–12496 25. Frenkel D, Smit B (2001) Understanding molecular simulation: from algorithms to applications, 2nd edn. Academic Press, Cambridge, Massachusetts

427

26. Sugita Y, Okamoto Y (1999) Replica-exchange molecular dynamics methods for protein folding. Chem Phys Lett 314:141–151 27. Liu P, Kim B, Friesner RA et al (2005) Replica exchange with solute tempering: a method for sampling biological systems in explicit water. Proc Natl Acad Sci U S A 102:13749–13754 28. Bellaiche MMJ, Best RB (2018) Molecular determinants of Aβ42 adsorption to amyloid fibril surfaces. J Phys Chem Lett 9 (22):6437–6443 29. Nadler W, Hansmann UHE (2007) Dynamics and optimal number of replicas in parallel tempering simulations. Phys Rev E 76:065701 30. Best RB, Hummer G (2009) Optimized molecular dynamics force fields applied to the helix-coil transition of polypeptides. J Phys Chem B 113:9004–9015 31. Best RB, Mittal J (2010) Balance between α and β structures in ab initio protein folding. J Phys Chem B 114:8790–8798 32. Domanski J, Sansom MSP, Stansfeld P et al (2018) Balancing force field protein-lipid interactions to capture transmembrane helix-helix association. J Chem Theory Comput 14:1706–1715 33. Borgia A, Zheng W, Buholzer K et al (2016) Consistent view of polypeptide chain expansion in chemical denaturants from multiple experimental methods. J Am Chem Soc 138:11714–11726 34. Gibbs EB, Lu F, Portz B et al (2017) Phosphorylation induces sequence-specific conformational switches in the RNA polymerase II C-terminal domain. Nat Commun 8:15233 35. Michaud-Agrawal N, Denning EJ, Woolf TB et al (2011) MDAnalysis: a toolkit for the analysis of molecular dynamics simulations. J Comput Chem 32:2319–2327 36. Holmstrom ED, Holla A, Zheng W et al (2018) Accurate transfer efficiencies, distance distributions, and ensembles of unfolded and intrinsically disordered proteins from singlemolecule FRET. Methods Enzymol 611:287–325

Chapter 21 Computing, Analyzing, and Comparing the Radius of Gyration and Hydrodynamic Radius in Conformational Ensembles of Intrinsically Disordered Proteins Mustapha Carab Ahmed, Ramon Crehuet, and Kresten Lindorff-Larsen Abstract The level of compaction of an intrinsically disordered protein may affect both its physical and biological properties, and can be probed via different types of biophysical experiments. Small-angle X-ray scattering (SAXS) probe the radius of gyration (Rg) whereas pulsed-field-gradient nuclear magnetic resonance (NMR) diffusion, fluorescence correlation spectroscopy, and dynamic light scattering experiments can be used to determine the hydrodynamic radius (Rh). Here we show how to calculate Rg and Rh from a computationally generated conformational ensemble of an intrinsically disordered protein. We further describe how to use a Bayesian/Maximum Entropy procedure to integrate data from SAXS and NMR diffusion experiments, so as to derive conformational ensembles in agreement with those experiments. Key words Radius of gyration, Hydrodynamic radius, Conformational ensemble, Compaction, Intrinsically disordered protein

1

Introduction In contrast to natively folded proteins, intrinsically disordered proteins (IDPs) generally lack well-defined three-dimensional structures. Consequently, they explore a large number of distinct conformations, and their conformational properties are thus best described in statistical terms. One useful and informative way of representing this large conformational ensemble is through a distribution of the radius of gyration (Rg) of the IDP. The ensemble average hRgi gives a rough measure of how compact a protein is and may, for example, be compared to the values for other proteins of similar lengths. For a given configuration of a protein, the Rg may be calculated as the mass-weighted root mean distance to the center of mass:

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_21, © Springer Science+Business Media, LLC, part of Springer Nature 2020

429

430

Mustapha Carab Ahmed et al.

P Rg ¼

jjri jj iP

2

mi

i mi

12

ð1Þ

where mi is the mass of atom i and ri is the position of atom i with respect to the center of mass of the molecule. Experimentally, one may obtain an estimate of the ensembleaveraged value of the Rg of a protein by a Guinier analysis of smallangle X-ray scattering (SAXS) profiles [1] or using various extended models of the scattering data [2, 3]. For the sake of simplicity, we will loosely refer to the experimental value as Rg, omitting the bracket notation and only use brackets for explicitly averaging computed values. Here, we note also that Rg calculated using Eq. 1 is not directly comparable to that obtained from analyses of SAXS data due to contributions to the scattering data from the solvent layer around the disordered protein [4, 5]. Similarly, but via different physical principles, the hydrodynamic radius of a protein also reports on the overall expansion of a protein. The hydrodynamic radius (Rh), also called the Stokes radius, is defined as the radius of a theoretical hard sphere that would have the same translational diffusion coefficient as the considered particle. The translational diffusion coefficient (Dt) of a protein may in turn be determined, e.g., by pulsed-field gradient nuclear magnetic resonance (NMR) diffusion experiments, fluorescence correlation spectroscopy, and dynamic light scattering measurements, and is related to Rh through the Stokes–Einstein equation [6]: Dt ¼

kB T 6πηRh

ð2Þ

where kB is the Boltzmann constant, T is the temperature, and η is the viscosity of the solvent. Because both Rg and Rh probe the compaction of a disordered protein, and because they may contain complementary information about the distribution of states [7], there have been several studies on the relationship between the Rg and Rh for disordered proteins and polymers [7–10]. One such approach uses hydrodynamic modelling of protein conformations [11–13] to relate protein structure to Rh [7, 10]. In line with theoretical expectations, the authors found that the ratio Rg/Rh depends substantially on the compaction of the protein chain, so that compact states have ratios 0.8 and expanded conformations have ratios between 1.2 and 1.6. Because the relative level of compaction of the chain, when quantified by Rg, also depends on the chain length, the ratio Rg/Rh also depends on the number of residues of the protein (N). Recently, these two effects were combined into a single, physically motivated and

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

431

empirically parameterized equation that enables one to calculate Rh for a configuration of an IDP from its Rg [14]: α1 Rg α2 N 0:33 Rg N , Rg ¼ þ α3 ð3Þ Rh N 0:60 N 0:33 In addition to Rg and N (number of residues of the protein chain), the equation contains three parameters that were fitted to maximize agreement between the model and hydrodynamic calculations (α1 ¼ (0.216 0.001) Å, α2 ¼ (4.06 0.02) Å, and α3 ¼ (0.821 0.002) Å). As discussed further below, since conformational averaging acts on the diffusion properties, the ensembleaveraged value that should be compared to an experimentally measured Rh will not in general be the same as the linear average over the values of each conformation (hRhi). Also, note that the equation was parameterized using Rg values calculated from the Cα coordinates only. Values of Rg calculated in this way are generally very close to those calculated from all protein atoms, but this parameterization makes it possible to use the approach to calculate Rh also for coarse-grained Cα-only models. Here we provide a step-by-step protocol to calculate Rg—and subsequently Rh using Eq. 3—from a computationally generated conformational ensemble of an IDP. Together with calculations of SAXS data from simulations it is possible to compare the simulations to measurements of compaction. In cases where the computed and experimental quantities are not in perfect agreement, one may go one step further and refine the computational ensemble using the experimental data. We thus also demonstrate how to refine the ensembles by integrating experimental SAXS and Rh measurements, and thereby generate conformational ensembles that both take into account the physical principles encoded in the simulations as well as information from experiments. In addition to the motivation and description provided in this paper we also make available a Jupyter (Python) notebook with guided examples for performing analysis and generating many of the figures discussed here. We do, however, not provide instructions for how to generate conformational ensembles, and the reader is expected to have a basic understanding of the Python programming language to use the examples presented.

2

Materials

2.1 Experimental Data and Sequence of Sic1

1. We used the following sequence for the Sic1 protein: GSMTPSTPPR

SRGTRYLAQP

SGNTSSSALM

QGQKTPQKPS

QNLVPVTPST

TKSFKNAPLL

APPNSNMGMT

SPFNGLTSPQ

RSPFPKSSVK

RT.

432

Mustapha Carab Ahmed et al.

2. We obtained SAXS data for Sic1 [15] from the Protein Ensemble Database [16] entry PED9AAA (http://pedb.vib.be/acces sion.php?ID¼PED9AAA). 3. We used the previously measured [17] experimental value of Rh (21.5 1.1 Å). 2.2

Software

1. Flexible-Meccano [18] available from http://www.ibs.fr/ research/scientific-output/software/flexible-meccano/? lang¼en. 2. CAMPARI v3.0 [19] available from https://sourceforge.net/ projects/campari/. 3. PULCHRA v3.06 [20] available from http://www.pirx.com/ pulchra/index.shtml. 4. Pepsi-SAXS v1.4 [21] available from https://team.inria.fr/ nano-d/software/pepsi-saxs/. 5. BME [22] available from https://github.com/KULL-Centre/ BME. 6. MDTraj v1.9.3 [23] available from http://mdtraj.org/1.9.3/. 7. A Python Jupyter notebook (https://jupyter.org/) for performing the calculations and analyses described in this paper is available from https://github.com/KULL-Centre/ papers/edit/master/2019/IDP-methods-Ahmed-et-al/.

3

Methods

3.1 Generating Ensembles

We have chosen the 92 amino acid residues long protein Sic1 as an example for our calculations, as this protein has been studied extensively by both SAXS and various NMR methods [15, 17]. We used Campari [19] and Flexible-Meccano [18] to generate two conformational ensembles of Sic1 in its unphosphorylated state. In the ensemble we generated using Campari (Ensemble 1) we used Monte Carlo sampling with the ABSINTH v3.2 implicit solvent model [24] and a temperature of 298 K. The Sic1 protein was contained in a spherical simulation cell with a radius of 150 Å and an ion concentration of 140 mM, matching the experimental condition [15]. For the Flexible-Meccano ensemble (Ensemble 2) we generated conformations sampling random coil configurations as described [18]. As Flexible-Meccano only generates a model of the protein backbone, we used PULCHRA [20] with default settings to add side chains to these structures and generated Ensemble 2. These side chain coordinates are necessary when we calculate SAXS data from the conformational ensembles. In total we generated 32,000 structures for Ensemble 1 and 10,000 structures for Ensemble 2.

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

3.2 Calculating Rg and Rh from Ensembles

433

Many simulation and protein analysis software packages have the option of calculating the Rg of the protein. In this example we will use readily available and open source software. For calculating the Rg of the conformations we use MDTraj, a Python module for protein analysis [23]. Below we provide Python code demonstrating how to load the ensemble and calculate Rg for each structure, and then calculate Rh for each structure using Eq. 3. In the example we have collected all conformations of the ensemble in a trajectory file (here Ensemble1.trr). Depending on the file format of the trajectory file, one may also need a coordinate file (structure. pdb) or a topology file. Once these files are loaded, MDTraj is then used to calculate Rg for each structure in the ensemble, which in turn is converted into Rh using Eq. 3. Python Code for Calculating Rg and Rh:

434

Mustapha Carab Ahmed et al.

Once Rg and Rh have been calculated for each structure, these can be used to generate histograms of Rg (Fig. 1a) and Rh (Fig. 1b), and the average Rh and the average Rg can be calculated as for comparison to experimental values. Here we calculate the average Rh (see Note 1) and the average Rg (see Note 2) as: 1

hRh itrans ¼ ln ðh exp ðR1 h ÞiÞ 1=2

hRg itrans ¼ hR2g i

ð4Þ

ð5Þ

Here, we have introduced the notation hRhitrans and hRgitrans to represent averages via transformations that are aimed to represent the averaging that occurs in the experiments (see Notes 1 and 2 for further discussion). We also show the (linearly) calculated averages of Rh and Rg in the plots (Fig. 1) though as explained below, a better comparison to the experimental data requires calculations of SAXS intensities from the conformational ensemble (see Note 3). As described above the ratio Rg/Rh depends substantially on the compaction of the protein chain, so that compact states have ratios 0.8 and expanded conformations have ratios between 1.2 and 1.6 [14]. For a protein of 92 residues, the switch-over point where the Rg/Rh ¼ 1 lies at conformations with Rg 27 Å (see Note 4). Thus, conformations with Rg < 27 Å have Rh > Rg whereas conformations with Rg > 27 Å have Rh < Rg. In this way, the distribution of Rh is “pushed” towards the middle and has less density in the tails compared to the distribution of Rg (Fig. 1).

Fig. 1 Analyzing compaction in ensembles of Sic1. Probability distribution of (a) Rg and (b) Rh calculated from two ensembles that we generated of Sic1. Here, Ensemble 1 (red) was generated using Campari and Ensemble 2 (blue) was generated using Flexible-Meccano as described in the main text. (a) Solid vertical lines represent the ensemble average Rg (hRgitrans; see Note 2 for the definition) of Ensemble 1 (red) and Ensemble 2 (blue). (b) Solid vertical lines represent the ensemble average Rh (hRhitrans) calculated using Eq. 3 and as discussed in Note 1 from Ensemble 1 (red) and Ensemble 2(blue). The experimental values of Rg and Rh are shown in black. The error of the distribution and averages of Rg and Rh (shown as shaded areas) were estimated by block averaging using five blocks

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

435

The distributions of Rg (Fig. 1a) and Rh (Fig. 1b) from Ensemble 1 and Ensemble 2 and the resulting averages can also be compared to the experimental values from SAXS and NMR [15, 17]. These results reveal two different scenarios for the two ensembles. First, hRhitrans calculated from Ensemble 1 (Campari) is in good agreement with the experimentally determined value of Rh (Fig. 1b). At the same time, the calculated value of hRgitrans is substantially lower than the average Rg value estimated from SAXS experiments (Fig. 1a). Second, for Ensemble 2 (FlexibleMeccano) we observe the opposite scenario, where the calculated hRgitrans is close to the value estimated by SAXS (Fig. 1a), and the calculated hRhitrans is substantially greater than the experimental value (Fig. 1b). Disagreement between experiment and simulation is often indicative of problems with the molecular force fields or sampling [25], though differences may also arise from problems in, e.g., the model used to calculate experimental data from structural ensembles [5, 26]. While it is possible to improve molecular force fields directly against experimental data [27], we below describe how one can refine a specific ensemble against one or more sets of experimental measurements. 3.3 A Bayesian/ Maximum Entropy Approach

Above we have analyzed two ensembles and used Eq. 3 to estimate Rh which in turn could be averaged and compared to NMR diffusion experiments. We also calculated Rg from the protein coordinates, though as noted this value is not directly comparable to the experimental measurements due to solvation effects [5]. Nevertheless, the results suggested discrepancies between experiments and simulations. Although there have been continued improvements in methods and force fields for sampling the conformational landscape of IDPs, it is still not uncommon that simulations are not in perfect agreement with experiments. In such cases, it is possible to bias the simulation to construct an ensemble that is in better agreement than the unbiased ensemble [22, 28–32]. We here use such a method to construct two new ensembles by reweighting the Campari and Flexible-Meccano ensembles with the experimental data, thus obtaining ensembles that are in better overall agreement with the SAXS and NMR diffusion experiments. Specifically, we use experimental SAXS data [15] and NMR diffusion measurements of Rh [17], and use our recently described Bayesian/Maximum Entropy (BME) protocol to reweight the conformational ensembles [22]. We focus solely on the technical details of the approach rather than the biological relevance. Also, we exemplify using two experimental measures of compaction, but the approach is more generally applicable (see Note 5).

436

Mustapha Carab Ahmed et al.

Briefly described, BME is based on a combined Bayesian/Maximum Entropy framework and enables one to refine a simulation using multiple sources of (potentially noisy) data. The purpose of the reweighting is to derive a new set of weights for each configuration in a previously generated ensemble so that the reweighted ensemble satisfies two criteria: (1) it matches the experimental data better than the original ensemble and (2) it achieves this improved agreement by a minimal perturbation of the original ensemble. For additional details see Bottaro et al. [22] and references therein. In the current examples, both Ensemble 1 and 2 were generated as unbiased ensembles and so the initial weights of all structures are uniform (w 0j ¼ 1=n ), where n is the number of structures in the ensemble. The reweighting approach described above may in practice be achieved by updating the weights, wj, of each configuration in the input ensemble by minimizing a function (the negative log-likelihood) [22, 28]: Lðw1 . . . wn Þ ¼

1 2 χ ðw 1 . . . w n Þ θS rel ðw 1 . . . wn Þ: 2

ð6Þ

Here, the χ 2 quantifies the agreement between the experimental data and the corresponding values calculated from the reweighted ensemble. The second term contains the relative entropy, Srel, which measures the deviation between the original ensemble Pn wj and the reweighted ensemble S rel ¼ j w j log w0 . The j

temperature-like parameter θ tunes the balance between fitting the data accurately (low χ 2) and not deviating too much from the prior (low Srel). In practice, we determine this hyperparameter by evaluating the compromise between balancing the two terms in L [22, 28] (see Note 6). When more than one set of experimental data is included in BME, the deviations between calculated and experimental values are summed in a global χ 2 function which is the sum of a χ 2 function for each set of data. In practice, it turns out that a different—and in some cases more efficient—approach is to minimize L using the method of Lagrange multipliers, and this is the approach we take here [22, 28, 33] using the BME code, which is freely available at https://github. com/KULL-Centre/BME. 3.4 Calculating SAXS Data from Ensembles

The first step in the reweighting protocol is to collect the necessary data and structure it correctly for input in BME. We first calculate the SAXS intensity profiles by fitting to the experimental curve for each structure of the two ensembles using Pepsi-SAXS [21]. Pepsi-SAXS has free parameters for the solvation layer that are calculated for each fit. To decrease the risk of overfitting, we used

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

437

a two-step procedure. First, we fitted the parameters to each structure. Second, we calculated the averages of the resulting fitted values of the solvation parameters and re-ran Pepsi-SAXS with these parameters fixed to those averages. Alternative methods for calculating SAXS from conformational ensembles exist [4] and may also be used (see Notes 7 and 8). We then structure the input files as shown below for the SAXS data. The experimental SAXS input file is structured such that it contains the following three columns: the momentum transfer (q), intensity (I(q)), and the error (σ I(q)) (as shown below). Each of these three columns are m rows long, where m is the number of experimental data points. The input file for the calculated values contains n rows (number of structure in the ensemble), and m + 1 columns. The first column is for labeling the individual structure/ frame from the ensemble. Further details for how to structure the input files for other data can be found in the original description of BME and in the online examples [22]. File Format for Experimental SAXS Data:

File Format for Calculated SAXS Data:

Once these calculations have been done, we may load the data in Python and run BME.

438

Mustapha Carab Ahmed et al.

Python Code to Run BME:

3.5 Reweighting Sic1 Ensembles Against SAXS and NMR Diffusion Experiments

We used the methods described above to determine a reweighted ensemble of Sic1 that takes into account both the prior information encoded in the initial ensemble (from Campari or FlexibleMeccano) as well as the experimental measurements of compaction from NMR diffusion and SAXS. Before reweighting was applied, Ensemble 1 appears too compact when judged by agreement with the Rg-value extracted from the SAXS data, but is in good agreement with the NMR diffusion data (Fig. 1). In contrast, Ensemble 2 is in good agreement with the SAXS-derived Rg, but appears too expanded when compared to the NMR diffusion measurements (Fig. 1). The goal was therefore to examine whether one could construct an ensemble that provides a useful compromise between the two data sets. We note here that the NMR diffusion data were recorded at 278 K [17], whereas the SAXS data were obtained at room temperature [15], though we only expect a modest change in compaction in this temperature range [34, 35]. We note also that our goal is not to discuss in detail the conformational ensemble of Sic1, but rather to showcase how one may combine different measures of compaction. We reweighted the two ensembles against the NMR and SAXS data and compared to the unweighted ensembles (Fig. 2). The first step is to choose the temperature-like hyperparameter, θ, that sets the balance between fitting the data and not deviating too much from the input ensemble. The latter may be quantified by

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

439

Fig. 2 Constructing ensembles to improve agreement with experiments. We used BME reweighting with SAXS and Rh data for Ensemble 1 (a, c, e, g) and Ensemble 2 (b, d, f, h). We label the Rh data as I(Rh) as we here use intensitybased averaging of the measurements (see Note 1). (a, b) We plot Neff (the effective number of frames left after reweighting) vs. χ 2 when the scaling parameter θ is varied (top axis). The left axes show χ 2r ed for each of the two experiments, whereas the right axis shows the total χ 2 that is the sum of the two (non-reduced) χ 2 values (see Note 8). For further analyses we chose θ ¼ 100 (Ensemble 1) and θ ¼ 7 (Ensemble 2). (c–f) We show the distribution of Rg (c, d)

Mustapha Carab Ahmed et al.

calculating the fraction of the frames in the input ensemble, N eff ¼ exp ðS rel Þ , that effectively contributes to the calculated ensemble averages after reweighting. Thus, Neff ¼ 1 corresponds to the initial unweighted ensemble and a low value of Neff indicates that only a small fraction of the original ensemble has been selected to improve agreement with experiments. We scanned values of θ and calculated the agreement with both the SAXS and NMR diffusion data at each value of θ and for each of the two ensembles (Fig. 2a, b). Note that we here plot a reduced χ 2 (χ 2red ) for each of the two experiments individually but that the optimization acts to reduce the sum of the two non-reduced χ 2-values. Here we define χ 2red as χ 2red ¼ m 1 χ 2 , where m is the number of data points. Since there is 179 points in the SAXS measurements, this sum contains a large contribution from the SAXS data (see Note 8). In our analyses here, we chose θ ¼ 100 for Ensemble 1 and θ ¼ 7 for Ensemble 2, though in practical applications it would be advised to examine the results of other choices (see Note 6). The effect of reweighting can be seen both on the distribution of Rg (Fig. 2c, d) and Rh (Fig. 2e, f). The more compact Ensemble 1 is shifted to include more expanded structures, bringing hRgitrans substantially closer to the value estimated from the SAXS data, while only increasing the calculated Rh value 15% above the experimental value. Similarly, the more expanded Ensemble 2 is shifted to give greater weight to more compact configurations, bringing the calculated Rh closer to experiment while only shifting the hRgitrans down by 13%. While it is convenient to examine the distribution of Rg before and after reweighting, the actual reweighting is done against the SAXS data and not the estimated Rg. As explained above, the solvent layer around the protein also contributes to the SAXS measurements, and there may be 5–10% difference in the Rg calculated from the protein coordinates and the value estimated by SAXS [5]. We thus also show the agreement between the experimental and calculated SAXS curves (Fig. 2g, h). It is clear that the reweighted SAXS curves are substantially closer to the experimental data, though there still remains some discrepancy in the low-q range for Ensemble 1. ä

440

Fig. 2 (continued) and Rh (e, f) before (red) and after (blue) reweighting the two ensembles. Averages over these distributions (both before and after reweighting) are shown either as standard (linear) averages (dashed lines) or “transformed” averages (hRgitrans and hRhitrans as described in Notes 1 and 2). (g, h) We show the calculated SAXS intensity from the original ensemble and the refined ensembles and compared to the experimental data. In panels c–f the experimental data are shown in black lines and the errors are shown as shades. Errors in calculated values were estimated by block averaging using five blocks

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

3.6

4

Summary

441

We have shown here how it is possible to calculate Rh from a conformational ensemble using Eq. 3 and compare to experimental data obtained, e.g., from NMR diffusion measurements. Such measurements provide an alternative view of the compaction to that obtained, e.g., from SAXS experiments, and indeed it has previously been shown that simultaneous refinement against Rh and Rg can provide insight into the shape of the distribution of Rg [7]. We chose the protein Sic1 for exemplifying our analyses since the level of expansion has been measured for this protein using both SAXS and NMR diffusion measurements. Since the data were recorded at slightly different conditions and temperatures, we do not aim to make strong conclusions about the conformational ensemble of Sic1 and have used it here mostly to showcase the methods for analyses. We generated two ensembles and show that one is in relatively good agreement with the NMR diffusion data whereas the other is in better agreement with the SAXS data. At this moment the origins of these differences are unclear. Variation in experimental conditions such as temperature may affect both Rg and Rh [35]. Also, it is possible that our approach for calculating Rh is not always sufficiently accurate since it is inherently limited to the accuracy achievable by hydrodynamic modelling [14], and an important question for future research is whether we can provide better models to link conformation and calculated values of Rh. Finally, despite continued improvement in methods for calculating SAXS data from ensembles [4] there are still potential sources of error from, e.g., solvation effects [5]. Nevertheless, we note that by reweighting the ensembles against both sets of experiments it is possible to construct an ensemble that provides a reasonable balance between the two. As more proteins are studied by both NMR and SAXS it should be possible to test and improve our relationship between Rg and Rh, thus enabling further insight into the rules that govern compaction of IDPs.

Notes 1. When calculating averages over ensembles, in particular for broad ensembles such as for IDPs, it is important to take the correct form of averaging into account. The best way to calculate averages over experimental quantities will depend both on the type of experiment and often also, e.g., on the time scales for conformational averaging. Throughout this paper we make the assumption that averages can be calculated as timeindependent averages over the conformational ensemble. In the case of measurements of the hydrodynamic radius, Rh, we have explored two different types of averaging. In case the

442

Mustapha Carab Ahmed et al.

experiment measures the average diffusion coefficient then, according to Eq. 2, the average should be calculated as 1

hRh itrans ¼ hR1 h i . Here we have introduced the notation hRhitrans to represent that the averaging takes place on a transformed value (in this case proportional to R1 h ). When Rh is measured by pulsed-field gradient NMR diffusion measurements [36] the NMR signal intensity, I, is proportional to and it may therefore be more appropriate to use exp R1 h this function to perform the averaging over this functional form, as shown in Eq. 4. It is this intensity-based averaging that we use here, though in practice we have found it to give 1 essentially the same result as using hR1 h i . 2. Similar to the issue of averaging Rh discussed in Note 1 above, we use Eq. 5 when calculating averages over the radius of gyration. This kind of averaging mimics the averaging in the low-q regime of SAXS curves. Note, however, that hRgitrans calculated in this way should not directly be compared to experimental values of Rg since the latter includes solvation effects. While solvation also affects the hydrodynamic radius, effects of solvation are implicitly absorbed into the transformation in Eq. 3. 3. Notes 1 and 2 discuss the transformations that are relevant for comparing calculated and experimental quantities. We note, however, that during the reweighting protocol and when one makes quantitative comparisons between experiments and computation it is in general better to compare to the direct experimental quantities. In the case of SAXS experiments we thus judge agreement and perform reweighting against the experimentally measured intensities. In the case of the Rh measured for Sic1 by NMR diffusion experiments, we transform the experimental value of Rh (and its error), as well as the values calculated for each structure using the function I / exp ðR1 h Þ, as described in our associated Jupyter notebook. We note that in the future it might be more appropriate to perform such fitting directly against the measured intensities as a function of the gradient strength. 4. The level of compaction as quantified by the value of Rg at which Rg ¼ Rh (R0g ) can be estimated by rearranging Eq. 3 to 0:60 obtain: R0g ¼ α1 N 0:33 Þ þ ðα2 N 0:33 Þ. For 1 ð1 α3 ÞðN a protein with N ¼ 92 one obtains Rg0 ¼ 27Å. 5. We have here described approaches to refine ensembles against SAXS and NMR diffusion measurements. The BME method has also been used for IDPs with NMR chemical shifts [37], and may also readily be applied to SANS data, NOEs, scalar couplings, or other measurements that can be calculated as averages over configurational ensembles.

Computing, Analyzing, and Comparing the Radius of Gyration and. . .

443

6. Currently, the value of the hyperparameter θ (which sets the balance between information from the data and the force field) is set manually. In certain cases it may be possible to set it via a cross-validation approach [37] or it may be integrated out as a Bayesian “nuisance parameter” [28]. 7. We have here used Pepsi-SAXS to calculate X-ray scattering curves from a conformational ensemble due to its ease of use and the relatively high computational efficiency. The latter is particularly important for large conformational ensembles. We note, however, that several other methods exist and suggest users in particular to keep solvent effects in mind when calculating and interpreting SAXS data [4, 5]. In the Jupyter notebook available online we provide a script that performs a two-pass run of Pepsi-SAXS to find a reasonable value of solvent-related parameters in the calculations. 8. When plotting χ 2red in Fig. 2, we calculate it by normalizing χ 2 by the number of experimental data points: χ 2red ¼ m1 χ 2 . We note that this is an approximation because the number of degrees of freedom can be smaller because different parameters are fitted such as parameters involved in calculating the SAXS curves. Also, in the case of reweighting, the weights themselves may be considered as free parameters. Thus, we note that the reweighting does not involve this normalization, and that the χ 2red is only shown in Fig. 2 to give the reader an impression of the level of agreement. We also note that when fitting the Rh the resulting sum in χ 2 only contains a single term. Finally, we note that we here simply combine the χ 2 from the SAXS and NMR diffusion experiments by adding up the two individual χ 2 terms. In the current implementation, BME does not enable automatic balancing of independent experiments and instead sets this balance by the error estimates of the individual experiments. We note, however, that while the SAXS data for Sic1 contains 179 individual data points, the amount of information in SAXS experiments typically corresponds to a smaller number of parameters [38] and a more careful balance between the information in the SAXS and NMR diffusion experiments should take such effects into account [39].

Acknowledgements We thank Dr. Tanja Mittag for providing feedback on the manuscript, Dr. Andreas Haahr Larsen for general discussions about SAXS experiments and calculations, and Dr. Martin Blackledge for suggesting to use intensity-based averaging for Rh. The research described here was supported by a grant from the Lundbeck Foundation to the BRAINSTRUC structural biology initiative.

444

Mustapha Carab Ahmed et al.

References 1. Guinier A, Fournet G (1955) Small angle X-ray scattering. Wiley, New York 2. Zheng W, Best RB (2018) An extended Guinier analysis for intrinsically disordered proteins. J Mol Biol 430(16):2540–2553 3. Riback JA, Bowman MA, Zmyslowski AM, Knoverek CR, Jumper JM, Hinshaw JR, Kaye EB, Freed KF, Clark PL, Sosnick TR (2017) Innovative scattering analysis shows that hydrophobic disordered proteins are expanded in water. Science 358(6360):238–241 4. Hub JS (2018) Interpreting solution X-ray scattering data using molecular simulations. Curr Opin Struct Biol 49:18–26 5. Henriques J, Arleth L, Lindorff-Larsen K, Skepo¨ M (2018) On the calculation of SAXS profiles of folded and intrinsically disordered proteins from computer simulations. J Mol Biol 430(16):2521–2539 6. Edward JT (1970) Molecular volumes and the Stokes-Einstein equation. J Chem Educ 47 (4):261 7. Choy WY, Mulder FA, Crowhurst KA, Muhandiram D, Millett IS, Doniach S, Forman-Kay JD, Kay LE (2002) Distribution of molecular size within an unfolded state ensemble using small-angle X-ray scattering and pulse field gradient NMR techniques. J Mol Biol 316(1):101–112 8. Burchard W, Schmidt M, Stockmayer W (1980) Information on polydispersity and branching from combined quasi-elastic and integrated scattering. Macromolecules 13 (5):1265–1272 9. Oono Y, Kohmoto M (1983) Renormalization group theory of transport properties of polymer solutions. I. Dilute solutions. J Chem Phys 78(1):520–528 10. Lindorff-Larsen K, Kristjansdottir S, Teilum K, Fieber W, Dobson CM, Poulsen FM, Vendruscolo M (2004) Determination of an ensemble of structures representing the denatured state of the bovine acyl-coenzyme a binding protein. J Am Chem Soc 126(10):3291–3299 11. de la Torre JG, Huertas ML, Carrasco B (2000) Calculation of hydrodynamic properties of globular proteins from their atomiclevel structure. Biophys J 78(2):719–730 12. Ortega A, Amoro´s D, De La Torre JG (2011) Prediction of hydrodynamic and other solution properties of rigid proteins from atomic-and residue-level models. Biophys J 101 (4):892–898 13. Amoro´s D, Ortega A, Garcı´a de la Torre J (2013) Prediction of hydrodynamic and other

solution properties of partially disordered proteins with a simple, coarse-grained model. J Chem Theory Comput 9(3):1678–1685 14. Nygaard M, Kragelund BB, Papaleo E, Lindorff-Larsen K (2017) An efficient method for estimating the hydrodynamic radius of disordered protein conformations. Biophys J 113 (3):550–557 15. Mittag T, Marsh J, Grishaev A, Orlicky S, Lin H, Sicheri F, Tyers M, Forman-Kay JD (2010) Structure/function implications in a dynamic complex of the intrinsically disordered sic1 with the cdc4 subunit of an SCF ubiquitin ligase. Structure 18(4):494–506 16. Varadi M, Kosol S, Lebrun P, Valentini E, Blackledge M, Dunker AK, Felli IC, FormanKay JD, Kriwacki RW, Pierattelli R, et al (2013) pE-DB: a database of structural ensembles of intrinsically disordered and of unfolded proteins. Nucleic Acids Res 42(D1):D326–D335 17. Mittag T, Orlicky S, Choy WY, Tang X, Lin H, Sicheri F, Kay LE, Tyers M, Forman-Kay JD (2008) Dynamic equilibrium engagement of a polyvalent ligand with a single-site receptor. Proc Natl Acad Sci U S A 105 (46):17772–17777 18. Ozenne V, Bauer F, Salmon L, Huang Jr, Jensen MR, Segard S, Bernado´ P, Charavay C, Blackledge M (2012) Flexible-Meccano: a tool for the generation of explicit ensemble descriptions of intrinsically disordered proteins and their associated experimental observables. Bioinformatics 28(11):1463–1470 19. Vitalis A, Pappu RV (2009) Methods for Monte Carlo simulations of biomacromolecules. Annu Rep Comput Chem 5:49–76 20. Rotkiewicz P, Skolnick J (2008) Fast procedure for reconstruction of full-atom protein models from reduced representations. J Comput Chem 29(9):1460–1465 21. Grudinin S, Garkavenko M, Kazennov A (2017) Pepsi-SAXS: an adaptive method for rapid and accurate computation of small-angle X-ray scattering profiles. Acta Crystallogr D 73:449–464 22. Bottaro S, Bengtsen T, & Lindorff-Larsen K (2020) Integrating molecular simulation and experimental data: a Bayesian/maximum entropy reweighting approach. In: Ga´spa´ri Z (ed) Structural Bioinformatics (pp. 219–240). Humana, New York, NY. 23. McGibbon RT, Beauchamp KA, Harrigan MP, Klein C, Swails JM, Herna´ndez CX, Schwantes CR, Wang LP, Lane TJ, Pande VS (2015) Mdtraj: a modern open library for the analysis

Computing, Analyzing, and Comparing the Radius of Gyration and. . . of molecular dynamics trajectories. Biophys J 109(8):1528–1532 24. Vitalis A, Pappu RV (2009) Absinth: a new continuum solvation model for simulations of polypeptides in aqueous solutions. J Comput Chem 30(5):673–699 25. Bottaro S, Lindorff-Larsen K (2018) Biophysical experiments and biomolecular simulations: a perfect match? Science 361(6400):355–360 26. van Gunsteren WF, Daura X, Hansen N, Mark AE, Oostenbrink C, Riniker S, Smith LJ (2018) Validation of molecular simulation: an overview of issues. Angew Chem Int Ed 57 (4):884–902 27. Norgaard AB, Ferkinghoff-Borg J, LindorffLarsen K (2008) Experimental parameterization of an energy function for the simulation of unfolded proteins. Biophys J 94 (1):182–192 28. Hummer G, Ko¨finger J (2015) Bayesian ensemble refinement by replica simulations and reweighting. J Chem Phys 143 (24):243150 29. Boomsma W, Ferkinghoff-Borg J, LindorffLarsen K (2014) Combining experiments and simulations using the maximum entropy principle. PLoS Comput Biol 10(2):e1003406 30. Bonomi M, Heller GT, Camilloni C, Vendruscolo M (2017) Principles of protein structural ensemble determination. Curr Opin Struct Biol 42:106–116 31. Cesari A, Reißer S, Bussi G (2018) Using the maximum entropy principle to combine simulations and solution experiments. Computation 6(1):15 32. Hermann MR, Hub JS (2019) SAXSrestrained ensemble simulations of intrinsically

445

disordered proteins with commitment to the principle of maximum entropy. J Chem Theory Comput 15(9):5103–5115 33. Cesari A, Gil-Ley A, Bussi G (2016) Combining simulations and solution experiments as a paradigm for RNA force field refinement. J Chem Theory Comput 12(12):6192–6200 34. Kjaergaard M, Nørholm AB, HendusAltenburger R, Pedersen SF, Poulsen FM, Kragelund BB (2010) Temperature-dependent structural changes in intrinsically disordered proteins: formation of α-helices or loss of polyproline ii? Protein Sci 19(8):1555–1564 35. Jephthah S, Staby L, Kragelund BB, Skepo¨ M (2019) Temperature dependence of intrinsically disordered proteins in simulations: What are we missing? J Chem Theory Comput 15 (4):2672–2683 36. Wilkins DK, Grimshaw SB, Receveur V, Dobson CM, Jones JA, Smith LJ (1999) Hydrodynamic radii of native and denatured proteins measured by pulse field gradient NMR techniques. Biochemistry 38(50):16424–16431 37. Crehuet R, Buigues PJ, Salvatella X, & Lindorff-Larsen K (2019) Bayesian-MaximumEntropy reweighting of IDP ensembles based on NMR chemical shifts. Entropy 21:898 38. Vestergaard B, Hansen S (2006) Application of Bayesian analysis to indirect Fourier transformation in small-angle scattering. J Appl Crystallogr 39(6):797–804 39. Larsen AH, Arleth L, Hansen S (2018) Analysis of small-angle scattering data using model fitting and Bayesian regularization. J Appl Crystallogr 51(4):1151–1161

Part VI Determinants of Interactions

Chapter 22 Binding Thermodynamics to Intrinsically Disordered Protein Domains Arne Scho¨n and Ernesto Freire Abstract Many proteins are intrinsically disordered or contain one or more disordered domains. These domains can participate in binding interactions with other proteins or small ligands. Binding to intrinsically disordered protein domains requires the folding or structuring of those regions such that they can establish welldefined stoichiometric interactions. Since, in such a situation binding is coupled to folding, the energetics of those two events is reflected in the measured binding thermodynamics. In this protocol, we illustrate the thermodynamic differences between binding coupled to folding and binding independent of folding for the same protein. As an example, we use the HIV-1 envelope glycoprotein gp120 that contains structured as well as disordered domains. In the experiments presented, the binding of gp120 to molecules that bind to disordered regions and trigger structuring (CD4 or MAb 17b) and to molecules that bind to structured regions and do not induce conformational structuring (MAb b12) is discussed. Key words ITC, IDP, Calorimetry, Protein folding, gp120, HIV, Interactions, Folding upon binding

1

Introduction

1.1 Isothermal Titration Calorimetry

Isothermal titration calorimetry, ITC, is a technique in which the heat associated with binding is measured while one reactant contained in a syringe is stepwise added to the other in the calorimetric reaction cell. While several techniques are available for binding studies, only ITC has the advantage of providing the energetic contributions to the binding reaction. ITC provides directly the association constant and the enthalpy change associated with a binding reaction. The Gibbs energy of binding, ΔG, and the entropy change, ΔS, are calculated from the affinity and enthalpy change. While a single value of ΔG determines the binding affinity, different combinations of ΔH and ΔS values can define the same ΔG. Since ΔH and ΔS originate from different types of interactions, their knowledge provides critical information to understand the nature of a binding reaction, the redesigning of ligand molecules

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_22, © Springer Science+Business Media, LLC, part of Springer Nature 2020

449

450

Arne Scho¨n and Ernesto Freire

Fig. 1 The thermodynamic signature allows a rapid visual representation of the enthalpy (green) and entropy (red) contributions to the Gibbs energy (blue). In this figure, light colors have been used to illustrate the sign of the main contribution to the overall energetics. In this figure, the magnitudes are arbitrary and use only for illustration purposes

[1–5], as well as specific applications such as the discovery and optimization of drug candidates [6–14]. The enthalpy and entropy contributions to the Gibbs energy can be visualized in the thermodynamic signature [7, 15, 16] as shown in Fig. 1. Intrinsic ligand/macromolecule interactions as well as coupled reactions contribute to the thermodynamic signature. Among the most common coupled reactions is the presence of protonation or deprotonation associated with binding. The presence of protonation/deprotonation can be assessed by performing the binding reaction in buffers with different enthalpies of protonation [17–19]. Among the intrinsic ligand/macromolecule interactions, two major terms contribute to the enthalpy change, and also two major terms contribute to the entropy change. They are indicated in Fig. 1 using lighter colors. The formation of hydrogen bonds and van der Waals interactions upon binding contribute favorably to the binding enthalpy. Conversely, the burial of polar groups that do not participate in hydrogen bonding contributes unfavorably to the binding enthalpy due to the positive enthalpy of polar desolvation [20]. The desolvation of nonpolar and polar

ITC on Disordered Proteins

451

groups contributes favorably to the binding entropy [20]. On the other hand, structuring, including folding of disordered protein regions, contributes unfavorably to the binding entropy as it reduces the conformational degrees of freedom. The binding of ligands to intrinsically disordered proteins or to intrinsically disordered protein domains is coupled to the folding of those regions, and consequently, the binding thermodynamics reflects this phenomenon. Folding is coupled to a favorable enthalpic component due to the formation of hydrogen bonds and van der Waals interactions within the previously disordered domain and an unfavorable entropic component arising from the decrease in conformational degrees of freedom associated with the refolding or structuring process. As an example of ligand binding to a protein containing disordered domains, we selected the HIV-1 envelope glycoprotein gp120 for the experiments presented in this protocol. gp120 mediates the first step in HIV infection [21, 22]. Prior to binding, gp120 contains significant regions which are intrinsically disordered and become structured upon binding to the human CD4 receptor or to certain antibodies [23–25]. The conformational structuring coupled to binding is reflected in the binding thermodynamics measured by ITC. In the experiments presented here, the binding of gp120 to molecules that bind to disordered regions and trigger structuring (CD4 or MAb 17b) and to molecules that bind to structured regions and do not induce conformational structuring (MAb b12) is discussed. 1.2 Thermodynamic Information Obtained from ITC at One Temperature

The binding of a ligand, L, to a macromolecule, M, containing a single site (or multiple, identical independent sites) is represented by: M þ L⇌ML

ð1Þ

where Eq. 1 represents the binding per site. The affinity, Ka, for the binding reaction is determined by the Gibbs energy of binding, ΔG: Ka ¼

ΔG ½ML ¼ eRT ½M½L

ð2Þ

where R is the gas constant, 1.9872 cal/(K mol), and T is the absolute temperature in kelvin. ΔG is defined in terms of the enthalpy, ΔH, and the entropy, ΔS, changes: ΔG ¼ RTlnK a ¼ ΔH T ΔS

ð3Þ

A single ITC experiment (Fig. 2) performed at one temperature provides all the necessary information to define the thermodynamic signature: the Gibbs energy of binding, ΔG; the binding enthalpy, ΔH; and the entropy change, ΔS [2, 26]. Additionally, ITC also determines the stoichiometry of the reaction.

452

Arne Scho¨n and Ernesto Freire Time (min) 0

40

80

120

160

dQ/dt (µcal/s)

0.00

-0.05

-0.10

Q (kcal/mol of injectant)

-0.15 0 -2 -4 -6 -8 0

1

2

3

[L]/[M]

Fig. 2 Output from a typical ITC experiment where a ligand, L, is injected into the protein, M, in the cell. The upper panel shows the heat flow, dQ/dt, as a function of time. Integration of the area under the peaks gives the heat, Q, associated with each injection. The lower panel shows a plot of Q as a function of the molar ratio, [L]/[M], in the cell (filled circles) together with a nonlinear least squares fit of the data to a one-site binding model (solid line). Nonlinear least squares analysis of the data allows determination of ΔG, ΔH, ΔS, and the stoichiometry of binding 1.3 ITC Results for gp120

b12 is an antibody that binds to the CD4 binding region of gp120 and does not induce a conformational structuring [27]. The binding of b12 to gp120 is shown in the left panel of Fig. 3. The thermodynamic signatures for the respective binding reactions are shown below each ITC experiment. The binding is characterized by a ΔG of ¼ 10.4 kcal/mol and favorable binding enthalpy (ΔH ¼ 7.8 kcal/mol) and entropy (TΔS ¼ 2.6 kcal/mol). The magnitude of these values is characteristic of protein/protein interactions in which the binding energetics is primarily defined by the interactions occurring at the interacting surfaces, for example, barnase/barstar, ΔH ¼ 19.9 and TΔS ¼ 0.3 kcal/mol [28]; elastase/ovomucoid third domain, ΔH ¼ 0.6 and TΔS ¼ 13.9 kcal/mol [29]; nicotinic acetylcholine receptor/ conotoxin, ΔH 1.4 kcal/mol and TΔS ¼ 8.8 kcal/mol [30]; or ferredoxin:NADP reductase/ferredoxin, ΔH ¼ 0.3 kcal/mol and TΔS ¼ 9.0 kcal/mol [31]. Additional examples are summarized and tabulated in [32].

ITC on Disordered Proteins

453

Time (min)

Q (kcal/mol of injectant)

dQ/dt (µcal/s)

0

40

80

120

160

200 0

40

80

120

160

200 0

40

80

120

160

200

0.0 -0.1

b12

-0.2

sCD4

17b

-0.3 -0.4 0 -10 -20 -30 -40 0.0

30

kcal/mol

20 10

0.5

1.0

1.5

2.0

2.5

DG DH -TDS

0.0

30

0.5

1.0

1.5

Molar Ratio

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

30

20

20

10

10

0

0

0

-10

-10

-10

-20

-20

-20

-30

-30

-30

-40

-40

-40

Fig. 3 Calorimetric titrations of gp120 with MAb b12 (left panel), sCD4 (middle panel), and MAb 17b (right panel). The thermodynamic signatures for the respective binding reactions are shown below each ITC experiment. Binding of sCD4 and 17b to gp120 is associated with the unusually large favorable enthalpy and unfavorable entropy changes observed when binding is coupled to large conformational structuring. The thermodynamic signature is completely different for MAb b12 which binds to gp120 without inducing any conformational structuring. In the titration with sCD4, 2.0 μM gp120 in the calorimetric cell was titrated with 26.7 μM sCD4. In the experiments with the two antibodies, the cell was filled with gp120 at a concentration of 2.5 μM, and injections were made of either MAb 17b or b12 at 15.0 and 19.2 μM, respectively. Note that the concentrations of antigen binding sites are used for the antibodies b12 and 17b (twice the antibody concentration)

The situation is completely different when binding is coupled to folding. In this case, in addition to the resulting interface interactions, the energetics of structuring is also included in the measured thermodynamics. This is the case when the soluble form of the human cell surface receptor CD4 (sCD4) or the antibody 17b binds to gp120 [24, 25, 33]. The ITC experiments for sCD4 and MAb 17b binding to gp120 are shown in the center and right panels of Fig. 3, respectively. The thermodynamic signatures are shown below the ITC experiments. The binding thermodynamics of both sCD4 and 17b are characterized by similar ΔG values

454

Arne Scho¨n and Ernesto Freire

(11.7 and 11.5 kcal/mol) but very large favorable binding enthalpies (39.0 and 28.7 kcal/mol) and, most importantly, extremely large unfavorable entropy contributions (27.3 and 17.2 kcal/mol). The much larger favorable enthalpies arise from the formation of hydrogen bonds and van der Waals interaction upon refolding, while the extremely large unfavorable entropies indicate that the entropy loss associated with the reduced number of degrees of freedom predominates over any entropy gains due to desolvation (Fig. 1). It must be noted that folding of disordered domains is not only observed when two proteins bind. Small molecules are also able to trigger the structuring of disordered domains as also demonstrated for gp120 [11, 13, 14, 34]. 1.4 Determination of the Change in Heat Capacity Associated with Binding

ITC experiments performed at different temperatures allow the determination of the change in heat capacity, ΔCp, associated with binding. ΔCp is defined as the temperature derivative of the enthalpy at constant pressure according to Eq. 4: ∂ΔH ΔC p ¼ ð4Þ ∂T p and is simply obtained from a plot of the enthalpy change as a function of temperature. The change in heat capacity depends mainly on how much of the solvent-exposed surface area becomes buried upon binding [18, 35, 36]. Burial of hydrophobic residues with the concomitant release of hydration water is associated with a decrease in heat capacity, and the ΔCp associated with the binding event is consequently negative. Figure 4 shows plots of the enthalpy as a function of temperature for the binding of MAb b12, sCD4, and MAb17b to gp120. The resulting values are 0.4, 1.3, and 1.8 kcal/(K mol). As expected, the heat capacity changes for the systems coupled to structuring (CD4 and 17b) are about threeto fourfold larger as folding causes the burial of a significantly larger number of hydrophobic groups.

1.5 Experimental Design

In order to accurately determine not only the enthalpy change (and stoichiometry) but also the affinity of binding, it is necessary to choose the concentrations of the reactants correctly. The best guide is to estimate a range for the expected affinity and select a concentration of the protein in the cell, [M]T, such that the parameter c ¼ Ka [M]T ¼ [M]T/Kd is between 1 and 1000 depending on the actual binding affinity [5]. As a general rule, the concentration of the ligand [L]T in the syringe should typically be 10- to 20-fold higher than [M]T. If the concentration [M]T is lower than Kd, the c-value will be too low (c < 1), and the degree of binding will be small throughout the titration without reaching complete saturation. If the c-value is too high (c > 1000), all of the ligand is consumed in each injection until the protein in the cell is saturated

ITC on Disordered Proteins

455

0 CD4 17b b12

b12 -0.4 kcal/(K × mol)

ΔH (kcal/mol)

-10 -20 17b -1.3 kcal/(K × mol) -30 -40 sCD4 -1.8 kcal/(K × mol) -50 -60 24

26

28 30 32 34 Temperature (°C)

36

38

Fig. 4 Plots of the enthalpy change as a function of temperature for the gp120 binding of sCD4 and the monoclonal antibodies 17b and b12. The enthalpy values were determined in ITC experiments performed at 15, 25, and 37 C. Each slope represents the heat capacity change upon binding, ΔCp. ΔCp is negative in the three cases, which mainly reflects the release of ordered hydration water coupled to the association process. Binding of MAb b12 is associated with the smallest ΔCp because the dehydration is associated primarily with the binding interface. The conformational structuring associated with 17b and, especially, sCD4 binding involves a large number of residues that become buried away from the aqueous phase which is reflected in the much larger negative values for ΔCp

and the equilibrium is not reflected in any gradual change of the heats during the course of the experiment. This happens if the binding is very tight and the reactant concentrations are several orders of magnitude higher than the Kd for the reaction. The best ITC experiments are obtained for c-values of 10–100. See also refs. [3, 37] for a discussion and illustration of the effects of the c-value on the shape of the titration curves.

2

Materials

2.1 Reagents and Supplies

1. HIV-1 envelope glycoprotein gp120 of one of the more common HIV strains (e.g., YU2, HXBc2, JR-FL). 2. The soluble form of the human CD4 receptor, sCD4, either having two or four domains (D1D2 or D1D2D3D4). 3. Monoclonal antibody 17b (MAb 17b) (Strategic BioSolutions, Newark, DE, USA).

456

Arne Scho¨n and Ernesto Freire

4. Monoclonal antibody b12 (MAb b12) (Dr. J. Sodroski, DanaFarber Cancer Institute, Harvard Medical School, Boston, MA, USA). 5. Dialysis cassettes, 0.5–3 mL, with 30 kDa cutoff. 6. Centrifugal filters with 10 kDa cutoff for concentration of dilute protein solutions. 7. Test tubes of glass, 6 50 mm, for loading of the injector syringe. 8. Phosphate-buffered saline, PBS, pH 7.4: 137 mM NaCl, 2.7 mM KCl, 8.1 mM Na2HPO4, 1.8 mM KH2PO4. 2.2 Isothermal Titration Calorimeter

The titration calorimeter used in the experiments described below is a VP-ITC from MicroCal/Malvern Instruments (Northampton, MA, USA). A syringe also serving as a stirrer delivers the ligand to the cell with preset intervals. The VP-ITC has a cell for the sample and a matching reference cell both kept at constant temperature. A feedback power supplied to the sample cell keeps the temperature difference between the two cells close to zero. The feedback loop is active from the beginning of the experiment in response to a constant opposing reference power which slightly offsets the temperature of the reference cell. As a result, any heat released or absorbed during an injection is compensated by the power supplied to the sample cell. The feedback power is directly proportional to the thermal power, dQ/dt, and serves as the recorded output signal. A VP-ITC has a cell volume of ~1.4 mL, but instruments having smaller volumes down to 200 μL are available from the same or other manufacturers. It is important to point out that although the low-volume instruments currently on the market have higher sensitivity than the VP-ITC, the sensitivity per unit volume is lower. In order to achieve the same high-quality data using the low-volume instruments, it is therefore necessary to increase the concentrations of the reactants two- to threefold as mentioned in the notes to the following section.

2.3 Experimental Systems

The HIV-1 envelope glycoprotein, gp120, contains intrinsically disordered domains and was selected as the target protein in the ITC experiments. The soluble form of the human cell surface receptor, sCD4, was chosen as one of the ligands as it induces a large-scale conformational structuring in gp120 upon binding [24, 25]. The two monoclonal antibodies 17b and b12 bind to gp120 with and without inducing any conformational structuring, respectively [33], and were also selected for the experiments. 1. gp120 used for the studies presented here is of the HIV-1 subtype B isolate YU2. YU2 gp120 has 470 residues and is heavily glycosylated with a molecular weight of 100–120 kDa depending on the degree of glycosylation. The molecular

ITC on Disordered Proteins

457

weight in the absence of glycans is 53.4 kDa, which is the value used in the conversion of the concentration from gram to mole basis. YU2 gp120 was expressed in 293F cells and purified as previously described [38]. The purified gp120 was stored as 20–40 μM aliquots in PBS at 80 C. 2. sCD4 is the soluble form of the human CD4 receptor. The construct of sCD4 used here is the four-domain sCD4 (MW 45.2 kDa) and was purified as described elsewhere [39]. sCD4 was stored frozen as aliquots of 50–100 μM in PBS at 80 C. Identical results will be obtained with the two-domain construct of sCD4 (MW 20.3 kDa). 3. MAb 17b is a monoclonal antibody which binds to a site that overlaps the coreceptor binding site in gp120 [23]. The 17b antibody (MW 150 kDa) has two equivalent antigen binding sites for gp120. MAb 17b was obtained from Strategic BioSolutions (Newark, DE, USA). 4. MAb b12 (MW 150 kDa) is a monoclonal antibody with two equivalent sites which bind to an epitope in gp120 that overlap the binding site for CD4 [27]. MAb was kindly provided by J. Sodroski (Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA).

3

Methods

3.1 Sample Preparation

All reagents must be dialyzed against the exact same buffer, and the dialysate is preferably used for further dilutions of the reagents. The use of matching buffers will minimize the heat of dilution of the buffer components. 1. gp120: Dialyze 1 mL of gp120 at 5 μM or higher overnight using a dialysis cassette at 4 C against two buffer changes of 1 L PBS, pH 7.4, with stirring. Determine the concentration of gp120 after dialysis from the absorbance at 280 nm using an extinction coefficient of 79,000/M/cm. The concentration will be lower after dialysis but should not have dropped below 2 μM. Too diluted sample solution must be concentrated using a centrifugal filter. 2. sCD4: Dialyze 1 mL of sCD4 at a concentration of 40–50 μM against two buffer changes of 1 L PBS, pH 7.4, at 4 C with stirring. The concentration after the dialysis should not be lower than 30 μM. Too diluted protein solution must be concentrated using a centrifugal filter. At the end of the dialysis, dilute an aliquot of the sample fivefold and measure the concentration at 280 nm using an extinction coefficient of 59,845/M/cm. For the measurements, prepare sCD4, and dilute if necessary, at a concentration of 25–30 μM.

458

Arne Scho¨n and Ernesto Freire

3. MAbs 17b and b12: Dialyze 1 mL of each antibody at a concentration of 20–25 μM against two buffer changes of 1 L PBS, pH 7.4, at 4 C with stirring. At the end of the dialysis, dilute aliquots of the proteins fivefold and measure the concentration at 280 nm. The extinction coefficients for MAbs 17b and b12 are 144,830 (see Note 1) and 218,420/M/cm, respectively. The concentration after the dialysis should not be lower than 15 μM. Too diluted antibody solutions must be concentrated using a centrifugal filter. Note that the actual concentration of binding sites will be twice the antibody concentration. 3.2 Experimental Procedure

Results from ITC experiments are normally reported at 25 C, but as a complete thermodynamic characterization of the interactions requires the determination of the change in heat capacity of binding, measurements are also performed at other temperatures, here at 15 and 35 C. The following procedure applies directly to a VP-ITC (MicroCal/Malvern Instruments, Northampton, MA, USA) (see Note 2). 1. Make sure the calorimetric cell is clean. For cleaning, fill the cell with 0.5 M NaOH for 5–10 min followed by thorough rinsing with water. 2. Set the temperature of the instrument (15, 25, or 35 C) (see Note 3). 3. Prepare 2.2 mL of 2 μM gp120 by dilution of the dialyzed stock solution with identically prepared PBS. Measure the exact concentration from the absorbance reading of the undiluted solution at 280 nm in triplicate measurements. 4. Prepare 500 μL of sCD4 at 30 μM and 500 μL of either of the two antibodies at 15 μM (30 μM binding sites) in PBS. Prepare the ligand solutions in test tubes having adequate diameter to allow efficient mixing (e.g., 12 mm). Prior to filling the syringe, transfer the solutions to 6 50 mm test tubes. 5. Cover the solutions with perforated Para film, and place under vacuum using a water aspirator or vacuum pump for 15 min to prevent the formation of gas bubbles during the calorimetric experiment. For the experiment at 15 C, place the degassed gp120 solution on ice for a couple of minutes prior to filling the syringe. 6. Fill the reference cell of the calorimeter with water or buffer. Rinse the sample cell with buffer. Then, slowly inject gp120 solution into the sample cell without introducing air bubbles. Any air bubbles can easily be removed while filling the cell by forcefully injecting the last 100–200 μL. Remove any excess sample from the top of the cell. Save the remaining solution for concentration determination.

ITC on Disordered Proteins

459

7. Fill the 250 μL injection syringe with the ligand solution. The injector syringe should be filled without introducing air bubbles, although a minor gap just below the plunger can be difficult to avoid. Rinse the syringe tip with water, and remove any excess water with a paper tissue without touching the exit pore of the tip. Keep the remainder of the ligand solution for concentration determination (see Note 4). 8. Insert the injection syringe into the calorimetric cell and, if the instrument was equilibrated at a lower temperature, change the thermostat setting to the experimental temperature. 9. Using the software for the ITC, set the instrumental run parameters: the number of injections (28), temperature (15, 25, or 35 C), reference power (10 μcal/s), initial delay (60 s), stirring speed (350 rpm), file name (∗.itc), feedback mode (high), injection volume (2 μL for the first injection and 10 μL for the remaining 27 injections), spacing between injections (300 s), and filter (3 s) (see Notes 5–7). The concentrations of the solutions in the cell and syringe can be entered before or after the experiment (during the analysis). 10. Start the experiment by clicking Start in the ITC software, and allow the instrument to equilibrate while stirring. Initiate the injections from the software when a stable baseline has been established. 11. After the experiment has been completed, clean the syringe and calorimeter cell thoroughly. For cleaning of the syringe, thorough rinsing with water followed by drying using an aspirator is normally adequate. A quick rinse with ethanol, methanol, or acetone will speed up the time for drying. If more thorough cleaning is required, leave the syringe with 2% Contrad, Hellmanex, or similar laboratory detergent solution for 60 min prior to rinsing with large amounts of water. 3.3 Analysis of the Data

The upper panel of Fig. 2 shows the output signal, dQ/dt, as a function of time for a typical titration experiment. The heat associated with each injection corresponds to the area obtained by integration of each of the progressively decreasing peaks until saturation is achieved. The areas for the small peaks of similar size at the end of the experiment correspond to the heat associated with dilution and a small mechanical effect and need to be subtracted prior to the analysis of the data. The heat, Qi, corresponding to the area under each peak is given by Eq. 5: Q i ¼ V cell ΔH Δ½L b,i

ð5Þ

where Vcell is the volume of the calorimetric cell, and Δ[L]b,i is the increase in the concentration of bound ligand after the ith injection. For a single-site binding model, Eq. 5 can be written as:

460

Arne Scho¨n and Ernesto Freire

Q i ¼ V cell ΔH ½M

K a ½L i K a ½L i1 1 þ K a ½L i 1 þ K a ½L i1

ð6Þ

where [L]i and [L]i1 are the concentrations of free ligand upon injection i and i 1. Because only the total and not the free ligand concentrations are known, Eq. 6 must be rewritten in terms of total concentrations. For further details regarding the solution to this and more complex equations used in the analysis of the data, see, for example, refs. [1, 5]. The lower panel of Fig. 1 shows the integrated heat as a function of the molar concentration ratio together with the result from nonlinear regression of the data. The software provided by the manufacturer of the ITC instrument allows both integration of the areas and nonlinear regression of the data. 3.4 Estimation of the Degree of Conformational Structuring

Since in many situations, especially those involving monoclonal antibodies and disordered proteins, high-resolution crystallographic or NMR structures are not available, a structure-based analysis is not possible. This is especially true for the protein that contains disordered domains prior to binding. Nevertheless, it has been possible for us to establish empirical correlations between experimentally determined ΔCp values for the folding of globular proteins and the number of residues, Nres, that undergo folding [40]: ΔC p ¼ 0:014 N res

kcal K mol

ð7Þ

Application of Eq. 7 to the experimental data indicates that for 17b (ΔCp ¼ 1.3 kcal/(K mol)) and sCD4 (ΔCp ¼ 1.8 kcal/ (K mol)), the equivalent of 93–129 residues become structured. If we assume that the ΔCp for b12 is representative for the binding interface of a monoclonal antibody, then the ΔCp values expected for folding only for 17b and sCD4 would be 0.9 and 1.4 kcal/ (K mol), respectively, corresponding to 64 and 100 folding residues, respectively.

4

Notes 1. The extinction coefficient for MAb 17b of 144,830/M/cm was experimentally determined in our laboratory and differs significantly from the value calculated from the sequence (206,130/M/cm). 2. For the low-volume instrument ITC200 or similar, use two- or threefold higher concentrations of the reactants. 3. For an experimental temperature below 25 C, equilibration of the instrument to a temperature one degree below the experimental temperature is usually recommended as this prevents

ITC on Disordered Proteins

461

overheating while the sample cell is filled with the sample solution. 4. Automated filling of the injector syringe is an option for several newer ITC systems. 5. The stirring speed can be much higher or lower depending on the ITC system and is recommended by the manufacturer. We typically use 300–350 rpm. 6. High feedback gives the fastest response time and the largest signal-to-noise ratio and is consequently the preferred choice for most ITC experiments. Low feedback is chosen when the association process is so slow that fast feedback will increase the noise only. 7. The spacing between the injections must be sufficient to allow the signal to return to its baseline. References 1. Freire E, Mayorga OL, Straume M (1990) Isothermal titration calorimetry. Anal Chem 62 (18):950A–959A 2. Velazquez Campoy A, Freire E (2005) ITC in the post-genomic era...? Priceless. Biophys Chem 115(2–3):115–124 3. Velazquez-Campoy A, Leavitt SA, Freire E (2015) Characterization of protein-protein interactions by isothermal titration calorimetry. Methods Mol Biol 1278:183–204 4. Velazquez-Campoy A, Ohtaka H, Nezami A, Muzammil S, Freire E (2004) Isothermal titration calorimetry. Curr Protoc Cell Biol Chapter 17:Unit 17 18 5. Wiseman T, Williston S, Brandts JF, Lin LN (1989) Rapid measurement of binding constants and heats of binding using a new titration calorimeter. Anal Biochem 179 (1):131–137 6. Chaires JB (2008) Calorimetry and thermodynamics in drug design. Annu Rev Biophys 37:135–151 7. Freire E (2008) Do enthalpy and entropy distinguish first in class from best in class? Drug Discov Today 13(19–20):869–874 8. Freire E (2009) A thermodynamic approach to the affinity optimization of drug candidates. Chem Biol Drug Des 74(5):468–472 9. Ladbury JE, Klebe G, Freire E (2010) Adding calorimetric data to decision making in lead discovery: a hot tip. Nat Rev Drug Discov 9 (1):23–27 10. Lalonde JM, Le-Khac M, Jones DM, Courter JR, Park J, Schon A, Princiotto AM, Wu X, Mascola JR, Freire E, Sodroski J, Madani N,

Hendrickson WA, Smith AB 3rd (2013) Structure-based design and synthesis of an HIV-1 entry inhibitor exploiting X-ray and Thermodynamic characterization. ACS Med Chem Lett 4(3):338–343 11. Madani N, Schon A, Princiotto AM, Lalonde JM et al (2008) Small-molecule CD4 mimics interact with a highly conserved pocket on HIV-1 gp120. Structure 16(11):1689–1701 12. Pancera M, Lai YT, Bylund T et al (2017) Crystal structures of trimeric HIV envelope with entry inhibitors BMS-378806 and BMS-626529. Nat Chem Biol 13 (10):1115–1122 13. Schon A, Madani N, Klein JC et al (2006) Thermodynamics of binding of a lowmolecular-weight CD4 mimetic to HIV-1 gp120. Biochemistry 45(36):10973–10980 14. Schon A, Madani N, Smith AB et al (2011) Some binding-related drug properties are dependent on thermodynamic signature. Chem Biol Drug Des 77(3):161–165 15. Freire E (2004) Isothermal titration calorimetry: controlling binding forces in lead optimization. Drug Discov Today Technol 1 (3):295–299 16. Velazquez-Campoy A, Kiso Y, Freire E (2001) The binding energetics of first- and secondgeneration HIV-1 protease inhibitors: implications for drug design. Arch Biochem Biophys 390(2):169–175 17. Baker BM, Murphy KP (1996) Evaluation of linked protonation effects in protein binding reactions using isothermal titration calorimetry. Biophys J 71(4):2049–2055

462

Arne Scho¨n and Ernesto Freire

18. Gomez J, Freire E (1995) Thermodynamic mapping of the inhibitor site of the aspartic protease endothiapepsin. J Mol Biol 252 (3):337–350 19. Velazquez-Campoy A, Luque I, Todd MJ et al (2000) Thermodynamic dissection of the binding energetics of KNI-272, a potent HIV-1 protease inhibitor. Protein Sci 9 (9):1801–1809 20. Cabani S, Gianni P, Mollica V, Lepori L (1981) Group contributions to the thermodynamic properties of non-ionic organic solutes in dilute aqueous solution. J Solut Chem 10 (8):563–595 21. Dalgleish AG, Beverley PC, Clapham PR et al (1984) The CD4 (T4) antigen is an essential component of the receptor for the AIDS retrovirus. Nature 312(5996):763–767 22. Klatzmann D, Champagne E, Chamaret S et al (1984) T-lymphocyte T4 molecule behaves as the receptor for human retrovirus LAV. Nature 312(5996):767–768 23. Kwong PD, Wyatt R, Robinson J et al (1998) Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature 393 (6686):648–659 24. Leavitt SA, SchOn A, Klein JC et al (2004) Interactions of HIV-1 proteins gp120 and Nef with cellular partners define a novel allosteric paradigm. Curr Protein Pept Sci 5 (1):1–8 25. Myszka DG, Sweet RW, Hensley P et al (2000) Energetics of the HIV gp120-CD4 binding reaction. Proc Natl Acad Sci U S A 97 (16):9026–9031 26. Leavitt S, Freire E (2001) Direct measurement of protein binding energetics by isothermal titration calorimetry. Curr Opin Struct Biol 11(5):560–566 27. Pantophlet R, Ollmann Saphire E, Poignard P et al (2003) Fine mapping of the interaction of neutralizing and nonneutralizing monoclonal antibodies with the CD4 binding site of human immunodeficiency virus type 1 gp120. J Virol 77(1):642–658 28. Frisch C, Schreiber G, Johnson CM et al (1997) Thermodynamics of the interaction of barnase and barstar: changes in free energy versus changes in enthalpy on mutation11Edited by J. Karn. J Mol Biol 267(3):696–706 29. Baker BM, Murphy KP (1997) Dissecting the energetics of a protein-protein interaction: the

binding of ovomucoid third domain to elastase. J Mol Biol 268(2):557–569 30. Celie PHN, Kasheverov IE, Mordvintsev DY et al (2005) Crystal structure of nicotinic acetylcholine receptor homolog AChBP in complex with an α-conotoxin PnIA variant. Nat Struct Mol Biol 12(7):582–588 31. Jelesarov I, Bosshard HR (1994) Thermodynamics of ferredoxin binding to ferredoxin: NADP+ reductase and the role of water at the complex Interface. Biochemistry 33 (45):13321–13328 32. Stites WE (1997) Proteinprotein interactions: Interface structure, binding thermodynamics, and mutational analysis. Chem Rev 97 (5):1233–1250 33. Kwong PD, Doyle ML, Casper DJ et al (2002) HIV-1 evades antibody-mediated neutralization through conformational masking of receptor-binding sites. Nature 420 (6916):678–682 34. Kwon YD, LaLonde JM, Yang Y et al (2014) Crystal structures of HIV-1 gp120 envelope glycoprotein in complex with NBD analogues that target the CD4-binding site. PLoS One 9 (1):e85940–e85940 35. Gomez J, Hilser VJ, Xie D et al (1995) The heat capacity of proteins. Proteins 22 (4):404–412 36. Murphy KP, Bhakuni V, Xie D et al (1992) Molecular basis of co-operativity in protein folding. III. Structural identification of cooperative folding units and folding intermediates. J Mol Biol 227(1):293–306 37. Freire E, Kawasaki Y, Velazquez-Campoy A et al (2011) Characterisation of ligand binding by calorimetry. In: Biophysical approaches determining ligand binding to biomolecular targets: detection, measurement and modelling. The Royal Society of Chemistry, London, pp 275–299 38. Brower ET, Schon A, Freire E (2010) Naturally occurring variability in the envelope glycoprotein of HIV-1 and development of cell entry inhibitors. Biochemistry 49(11):2359–2367 39. Liu Y, Schon A, Freire E (2013) Optimization of CD4/gp120 inhibitors by thermodynamicguided alanine-scanning mutagenesis. Chem Biol Drug Des 81(1):72–78 40. Freire E (2001) TheThermodynamic linkage between protein structure, stability, and function. In: Murphy KP (ed) Protein structure, stability, and folding. Humana Press, Totowa, NJ, pp 37–68

Chapter 23 Analysis of Multivalent IDP Interactions: Stoichiometry, Affinity, and Local Concentration Effect Measurements Samuel Sparks, Ryo Hayama, Michael P. Rout, and David Cowburn Abstract Nuclear magnetic resonance (NMR) titration and isothermal titration calorimetry can be combined to provide an assessment of how multivalent intrinsically disordered protein (IDP) interactions can involve enthalpy–entropy balance. Here, we describe the underlying technical details and additional methods, such as dynamic light scattering analysis, needed to assess these reactions. We apply this to a central interaction involving the disordered regions of phe–gly nucleoporins (FG-Nups) that contain multiple phenylalanine–glycine repeats which are of particular interest, as their interactions with nuclear transport factors (NTRs) underlie the paradoxically rapid yet also highly selective transport of macromolecules mediated by the nuclear pore complex (NPC). These analyses revealed that a combination of low per-FG motif affinity and the enthalpy–entropy balance prevents high-avidity interaction between FG-Nups and NTRs while the large number of FG motifs promotes frequent FG–NTR contacts, resulting in enhanced selectivity. Key words NMR, IDPs, Isothermal titration calorimetry, Nucleoporins, Nuclear transport factors

1

Introduction Many cellular processes involve complex, multivalent interactions rather than apparently single “lock-and-key” recognition events [1–9]. It is widely recognized that multivalency (i.e., the presence of multiple interaction motifs on a ligand, possibly complemented by multiple interaction sites on a target) can lead to substantial increases in affinity via “avidity” in antibody recognition [10], signal transduction, and formation of biomolecular condensates [11–12]. Many such interactions form stable, long-lived complexes. In contrast, some intrinsically disordered systems possess multivalent interactions yet maintain rapid exchange [13]. At one extreme of the spectrum of fuzzy, dynamic interactions are those of the lining of the inner portion of the nuclear pore complex (NPC) where varieties of phenylalanine–glycine-rich nucleoporins (FG-Nups) are immobilized at one terminus to the NPC’s rigid

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_23, © Springer Science+Business Media, LLC, part of Springer Nature 2020

463

464

Samuel Sparks et al.

structure [14] and provide a barrier to cytoplasmic/nucleoplasmic exchange of materials other than that facilitated by specific nuclear transport receptors (NTRs). The FG-Nups have regular repeat sequence motifs [15] directly implying multivalent interactions, so an obvious question is how avidity is avoided in order to provide fast transit times in the few milliseconds for large cargoes and their NTRs [16]. A general answer involving the modest affinity between FG-Nups and NTRs balanced by the entropic cost of restriction on the intrinsic disorder of the chains has been suggested [17]. This involves a modest affinity between FG-Nups and NTRs, which is offset by the entropic cost of restriction on the intrinsic disorder of the chains, such that optimally the energies of binding and barrier are balanced and a macromolecule neither accumulates at nor is excluded from the NPC but passes rapidly and specifically. Confirmation requires assessment of the thermodynamic properties of FG-Nup/NTR complexes. In this chapter, we describe the technical details of how the balance of multivalent affinities of interactions from a repeated motif and the entropic disorder of connecting linkers can be done, following a shorter description of results, previously published [18]. To address this, several complex issues needed to be resolved. 1. The most direct assessment of thermodynamic properties of an interaction is the measurement of reaction enthalpy by isothermal titration calorimetry (ITC). However, care needs to be taken as the relatively weak (~millimolar) affinities of individual sites are recognized as being at the limit of ITC sensitivity [19], compromising the precision of measurement. Thus, studying low-affinity interactions requires substantial amounts of material (see Note 1). Other, less direct, methods may be more precise but lack accuracy (see Note 2). 2. The stoichiometry of the reactions of intrinsically disordered proteins (IDPs) with targets is potentially complicated by the large number of microstates associated with the intrinsic disorder of the linkers between the motifs (FGs for FG nucleoporins), and the possible range of timescales associated with interconversions among states, interacting and noninteracting with the NTR. Can a simple association/dissociation be ascribed to the interaction? Some independent validation of molecular stoichiometry is of significant benefit (see Note 3). 3. The effect of multivalency needs to be probed by varying the number and position of the interaction motifs with comparable controls. In the current protocol, we use FG-repeat proteins from yeast NSP1 based on similar sequence lengths and negative controls by substitution of the sequence “FSFG” with “SSSG” (Fig. 1). The direct effect of multiple motifs is then addressed by variation of the

ITC and NMR for IDP Ligand Interactions

SSSG6 FSFG1 FSFG2 FSFG3 FSFG4 FSFG5 FSFG6

465

PAFSFGAKPEEKKDDNSSKPASSSGAK FFSSSS FSFSSS FSSFSS FSSSSF FSSSSS

FSFG12 FSFG (13874.92 Da) MNETSKPAFSFGAKSDEKKDGDASKPAFSFGAKPDENKASATSKPAFSFGAKPEEKKDDNSS KPAFSFGAKSNEDKQDGTAKPAFSFGAKPAEKNNNETSKPAFSFGAKSDEKKDGDASKPALE HHHHHH

Fig. 1 Design of FSFG constructs with varying degrees of valency (left, upper) and with varying distance between two FSFG motifs (right, upper). Full sequence of FSFG6 construct (bottom)

number of FSFG motifs with standard adjacent linker separations (upper left, Fig. 1). The role of linker length is probed by increased separation of the active FSFG motifs (upper right, Fig. 1). These variants are then titrated with an NTR, the protein NFT2, which is the principal carrier for Ran GDP required for NPC transport function. ITC provides a direct measure of enthalpy (nΔH) related directly to the equilibrium constant. Using nuclear magnetic resonance (NMR) spectroscopy, we obtain complementary information with greater precision related to the equilibrium constant. This is complemented by investigation of the stoichiometry of the complexes by dynamic light scattering (DLS).

2

Materials All solutions are prepared with deionized water using Millipore Milli-Q typically with >17 MΩ.cm from the device. Reagents are stored at room temperature.

2.1

Buffers

1. Buffer A: 20 mM HEPES-KOH, pH 6.8, 150 mM KCl, 2 mM MgCl2.

2.2

Proteins

1. NMR samples: 7.5% D2O was added to NMR samples described in Methods in order to provide lock signal. There was no additional shift reference material needed because the co-dialysis of components prior to mixing provides a constant environment. As a check, in this system, a significant fraction of peaks is not perturbed in the titration.

466

2.3

Samuel Sparks et al.

Equipment

1. ITC: MicroCal Auto-iTC200. 2. NMR: All NMR experiments were conducted on Bruker spectrometers at 800 MHz. 3. DLS: DynaPro with plate reader.

2.4

3

Software

1. NMRPipe, CCPNMR, DYNAMICS.

GraphPad,

Origin,

DynaPro

Methods All procedures are carried out at 25 C when temperature is selectable, otherwise at ambient temperature.

3.1 Protein Production

1. The various sequences of Fig. 1 discussed in 1 were constructed as follows. FSFG6 (identified in Fig. 1) and SSSG6 [20] DNA constructs codon-optimized for bacterial expression were synthesized. Gene fragments from the synthesized plasmids were ligated into either pET21b or pET24a vectors. The FSFG12 plasmid was constructed by inserting a restrictiondigested FSFG6 plasmid fragment into a SpeI-digested FSFG6 plasmid. All the other FSFG variants were created by sitedirected mutagenesis from those two parent constructs for FSFG6 and SSSG6. For NTF2, the genomic sequence from S. cerevisiae was used. All the proteins in this study are tagged with hexa-histidine on the C-termini, and their primary sequences are listed in the supplemental material p.S-7 of [18]. Samples are prepared and purified by standard methods [21–22], including stable isotope labeling. For FG constructs, [U-15N] labeling was used. For NTF2, [U-2H,15N] was used in order to obtain sufficiently slow transverse relaxation for NMR studies. Proteins were dissolved, dialyzed, and concentrated in Buffer A to typical stock concentration ranges of 0.2–2 mM (ligand FG constructs) and 6 mM for NTF2 for ITC NMR and DLS. Prior to use, diluted or undiluted solutions were centrifuged (~ 20,000 g, 10 min) and ultra-filtered. The protein concentrations of the filtered samples are measured by bicinchoninic acid assay (BCA) [23] and diluted accordingly to the desired concentration with filtered and degassed dialysate.

3.2 NMR Titration for Affinity Measurement

1. NMR chemical shift data is analyzed using NMRPipe [24] and CCPNMR Analysis [25]. Titration experiments were performed in Buffer A using a fixed concentration of FG construct 15 N-labeled (range 20–120 μM) sample and by preparation of separate samples for each titration point, typically ten in total (see Note 4). These conditions do not uniformly result in saturation but are appropriate for this system (see Note 3). An example of the spectrum is shown in Fig. 2.

ITC and NMR for IDP Ligand Interactions

a

467

b 110.0

FSFG

0.40 115.0

N (ppm)

0.30

FSFG

0.20 0.10

FSFG

120.0

0.00

0

FSFG

500

1000

1500

2000

[NTF2] (µM)

125.0 0

740µM

8.6

8.4 1

8.2

8.0

H(ppm)

Fig. 2 NMR titrations. (a) HSQC spectrum of FSFG6 titrated with NTF2. (b) CSPs for residues with overlapping shifts associated with the six occurrences of FSFG are fitted to determine a global Kd and fitted curves corresponding to the derived value normalized by the expected maximum CSP for each position. In this case, Kd was determined to be 560 10 μM

2. Chemical shifts for both the FSFG residues and all the assigned NTF2 residues were extracted from each titration point, and the chemical shift perturbations (CSP) were calculated as CSP, qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Δδ ¼ ðδ15 N 0:11Þ2 þ ðδ1 H Þ2 , where Δδ15N and Δδ1H are 15N and 1H chemical shift changes with respect to the free state. The CSPs were treated as a function of titrant concentration and fit to a standard equation [26]. Δδ ¼ Δδmax ððP 0 þ X þ K D Þ

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðP 0 þ X þ K D Þ2 ð4 P 0 X ÞÞ=ð2P 0 Þ

where Δδmax is the change in chemical shift at saturation, P0 is the fixed protein concentration, and X is the titrant concentration. Global fitting was performed with the above equation to derive KDs. An example of the chemical shift changes of residues Fsfg, fSfg, fsFg, fsfG and their corresponding fits is shown in Fig. 2b. 3.3 ITC Titration for Affinity measument

1. For each ITC experiment, separate stock solutions of FSFGx (identified in Fig. 1 upper) and NTF2 under go two rounds of dialysis together against the same Buffer A and are then concentrated by centrifugal concentrators. Samples were filtered

468

Samuel Sparks et al.

a

b 0.5

−1.0

kcal mol−1 of injectant

0.0

−1.5

−0.5 −2.0 −1.0 −2.5 −1.5 −3.0

−2.0

−3.5

−2.5

−4.0

−3.0 0.0

0.5

1.0

1.5 2.0 2.5 Molar Ratio

3.0

3.5

0.0

0.5

1.0

1.5 2.0 2.5 Molar Ratio

3.0

3.5

Fig. 3 (a) NDH (normalized heat per injection) curves for each of the heat evolutions required for correct ITC measurement (black, lowest, FSFG6-NTF2; green, upper, buffer–NTF2; blue, center, FSFG6-buffer). Note (i) the discrepant first points at left out, arising from the initial step variation of the motor-driven syringe of the titrant which is left out during fitting, and (ii) the significant heat of dilution associated with the NTF2 dilution (green). (b) Reference-adjusted NDH curve for FSFG6 titrated with NTF2 (the green and blue NDH curves, left, were subtracted from the black resulting in the points and the fitted curve (red))

and the final concnetrations for titration redetermined by BCA [23]. 2. All the ITC experiments may be conducted on a MicroCal Auto-iTC200 (see Notes 1 and 5). FSFG constructs are placed in the cell, and NTF2 is titrated in from the syringe in all the experiments. Each experiment consists of three runs: FSFGx against NTF2, Buffer A against NTF2, and FSFGx against Buffer A (Fig. 3a). To obtain the normalized heat for FSFGxNTF2 interaction, data from “Buffer against NTF2” and “FSFGx against Buffer” runs are subtracted from the “FSFGx against NTF2” run (Fig. 3b). The titration protocol involves either 19 (19 2 μL), 16 (10 2 μL followed by 6 3 μL), or 15 (15 2.5 μL) titration points following a 0.4 μL first injection, which is removed from the analysis. The choice of injection schemes does not affect the gross result, though injections with larger volumes generally improve signal-tonoise ratio toward the end of the titration [19]. Typical cell concentration of FSFGx are in the range 0.3–2.11 mM, while the syringe titrant NTF2 is 3.8–6.2 mM. These values are

ITC and NMR for IDP Ligand Interactions

469

adjustable to obtain a favorable match of total heat released and of saturation. 3. The raw heat evolution data are integrated and analyzed by the ITC module within the Origin program, using the Wiseman isotherm, for the heat Q per moles of ligand added per step (Xt)

dQ d½X t

¼ nΔH o V 0

1 2

1X r r ﬃ þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 2

ð1þX r þr Þ4X r

where r ¼ Kd/

[M]t for the step concentration of titrand [M]t. 4. The n value is rounded to integer one (FSFG1-FSFG6) or two for (FSGS12) (see Subheading 3.3). For the binding reactions with very low affinity and low enthalpy (i.e., FSFG1–FSFG3, | nΔH| < 5 kcal/M), n and ΔH could not be determined independently by ITC, though their product (n∗ΔH) was observable as previously reported for other low-affinity systems [19, 27]. KDs could be reliably extracted from those reactions as evidenced by their agreement with NMR measurements, confirming the accuracy of the ITC experiment and its fitting procedure. However, since we cannot reliably determine n and ΔH independently for FSFG1-FSFG3, only KDs and n∗ΔHs (total enthalpy of the entire molecule) are reported for those constructs (Table S1 in ref. [18]), and n values are derived separately using DLS for others. Of course, in the case of more complex reactions than nL + R ¼ (nLR), there will be complexity of interpretation common to such issues [28]. 5. Enthalpy–entropy compensation curve are constructed by plotting TΔS against ΔH for constructs with increasing valency. 3.4 Determination of Stoichiometry

1. The formed complexes are characterized by dynamic light scattering (DLS). Experiments can ve run in a 384-well plate format with independent samples in triplicate. For each sample, ten acquisitions of 5 s are acquired on triplicate samples (see Note 6). 2. The protein complexes formed for ITC or NMR titrations are directly used. 3. The resulting intensity-weighted regularized autocorrelation data, a decay curve resulting from Brownian motion of the complexes, is averaged over the triplicate data set, and the average radius of hydration calculated (Fig. 4). 4. Apparent radii of hydration are interpreted as consistent with 1:1 complexes for all ~12 kDa FG mimics and 1:2 for FSFG12/ NTF2; that is, two NTF2 dimers interact with one FSFG12.

3.5 Estimation of Local Concentration Effect

1. The local concentration effect can be measured by observation of the change in Kd with variation of the linker length between motifs. In the FG-Nup application, this was done by changing the distance between two FSFG motifs in the natural sequence

470

Samuel Sparks et al.

Fig. 4 Determination of molecular stoichiometry by dynamic light scattering (DLS). (a) Intensity-weighted DLS of the free forms of FSFG6, FSFG12, and NTF2. (b) Intensity-weighted DLS comparing the free FSFG6 sample to those in the presence of increasing NTF2 concentrations. For the sample at 1:1 molar ratio, the intensity plot is derived from a mixture of free and 1:1 FSFG6:NTF2 molecular stoichiometry. No further increase in the peak position is observed at higher NTF2 concentrations. (c) Intensity-weighted DLS comparing the free FSFG12 sample to those in the presence of increasing NTF2 concentrations. The positions of the peaks in the presence of NTF2 are consistent with a shift from ~1:1 to ~ 2:1 NTF2:FSFG12 molecular stoichiometry at higher NTF2 concentrations. (d) Table reporting the radius of hydration, Rh, and the percentage by mass of the peak. In the case of FSFG12, two separate peaks could be resolved at 1:5 molar ratio with NTF2, where the smaller of the two likely represents the free protein component. Each measurement represents data averaged from independent samples in triplicate

with SSSG substitutions (Fig. 1, upper right). This results in additional separation equivalent to the insertion of 1, 2, 3, and 4 linked segments separating FSFG motifs. 2. To estimate and to illustrate the change in local motif concentration as a function of FSFG motif–motif distance, we calculate [FSFG]local for our designed FSFG2 constructs (Fig. 1, upper right) based on polymer theory. Here, we define the [FSFG]local as the concentration of FSFG motifs within a probing volume defined by the spacer length, L. We assumed that the spacer between the two FSFG motifs behaves as a random coil polymer consistent with experimental [21] and simulation [29] results. The calculated local concentration is based on a three-dimensional random flight model which is equivalent to an ideal chain undergoing random walks (rather than selfavoiding random walks) [30].

ITC and NMR for IDP Ligand Interactions

FSFGlocal ¼

471

1 ð3=2π Þ3=2 N A ð< r 2 >3=2 Þ

where NA is Avogadro’s number and the square distance (dm2) between the ends of the spacer given by a2n where a is ˚ and n the number of segments. the intersegment distance in A Using the distance distribution between adjacent FSFG motifs obtained from our previous simulation [29] as L ¼ 1/ 2 ˚ for adjacent FSFG repeats (n ¼ 15), we ¼ 33.0 A calculate a to be 8.5 A˚ (see Note 7). For each construct in the bivalent, differentially spaced, FSFG2 series, we calculate 1/2 by multiplying a as above and calculating the range of [FSFG]local concentrations as above. 3. The [FSFG]local calculated above is simply the concentration of FSFG motifs that are within the explorable space of the target protein. The local concentration effect influences the overall binding reaction by either promoting formation of divalently interacting species (two interaction patches on NTF2 bound by two FSFG motifs on a molecule of dual-site NTR) or inducing a rebinding effect [31]. Simulation of this process [29] suggests the latter option. 4. Based on the calculations for the variant with dual motifs above, we can calculate [FSFG]local for the FSFG3–FSFG12 variants following the model proposed by Gargano et al. [32]. For the case here, we are dealing with a linear polymer and increased local concentration predominantly from vicinal motifs. As intuitively expected, the local concentration effect then increases modestly as the number of motifs increases. Other cases may differ with changes in interaction energy, n and a. However, the plateauing observed as a function of motif number (Fig. 5) is contrary to a simple local concentration prediction suggesting that a more detailed theory including at least the additions of chain self-avoidance and of excluded volume effect [30] is needed to account quantitatively for experiment. 3.6 Combining Results from the Different Analyses

As shown in Fig. 5a, there is excellent agreement between dissociation constants measured by the two different methods, strongly supporting the hypothesis that the time-averaged microstates for this interaction, central to the nuclear pore complex’s function, are represented by simple chemical equilibria. The combination of methods has two significant features: (1) the agreement provides a clear underpinning for thermodynamic analysis from ITC to the NMR data that are generally more precise and (2) the ITC provides a direct analysis of enthalpy and entropy as in Fig. 5b not readily available by other methods.

472

Samuel Sparks et al.

a

b 4.0

NMR

10

Energy (kcal mol-1)

ITC

KD (mM)

15

3.0

2.0

DG DH -TDS

5 0 -5 -10

1.0

-15

0.0

0

1

2

3

4

5

6

# FSFG

12

FSFG 4

FSFG 5

FSFG 6

FSFG 12

Fig. 5 (a) KD values for each of the FSFG1-FSFG12 constructs (Fig. 1, left) by NMR and ITC. Standard errors of the curve fitting and standard errors of the mean are plotted for NMR and ITC, respectively. (b) Changes in Gibbs free energy (ΔG), enthalpy (ΔH ), and entropy (TΔS) for the interactions between FSFG4-FSFG12 constructs and NTF2 measured by ITC

4

Notes 1. ITC instruments may have cells of different sizes. Larger cells will provide larger heat changes for more precise measurements, but at the cost of significant amounts of material required. In any broad survey study, the number of constructs to test may be large: For example, ref. 18 involved 14 different FG constructs, with each titration requiring two controls as well as repetitions (n ¼ 3). Thus, the balance between production scale of materials and precision of measurement will require careful examination. The system used a MicroCal Auto-iTC200 with 200 μL of titrand in the cell. Note also that automated systems operating with sample trays will use significantly more material (~400 μL in our case) because of plumbing dead volumes. 2. When using ITC, the resulting heat is directly measuring the heat of reaction so that the thermodynamic quantity associated with the equilibrium is obtained without any other consideration. For other methods, there is typically an assumption that the measurable (e.g., chemical shift perturbation, or relaxation property for NMR, intensities of absorption or fluorescence for optical methods) is a direct measure of concentrations of reagents and products. This may be a limited approximation for several reasons. (1) The theoretical underpinning of the measurable/concentration dependence may be incorrect, for

ITC and NMR for IDP Ligand Interactions

473

example, misanalysis of relaxation properties by NMR [33]. (2) The equilibrium may contain microstates, and the measurable detects a subset, not reflective of the overall reaction. (3) The use of derivative tags (e.g., for fluorescence) may directly affect the equilibrium by providing additional interactions not present in the native case (e.g. [34]). 3. The conventional analysis of equilibria would rely on observation of saturation at the stoichiometric equivalence point. However, millimolar range affinities frequently preclude direct observation of saturation because of limitations of solubility. Contemporary statistical fitting methods [35] will provide reasonable estimates of stoichiometry (and its significance level) from curvature of titrations without observation of saturation. In the case of FG-Nup/NTRs interaction and similarly nonstandard cases, the possibility of separate microstates [36] leads to the need for an independent estimate of stoichiometry from DLS (Subheading 3.3). 4. In general, a significant quantity of surveying conditions may be needed to establish a range of conditions for observing NMR spectral changes during titration (see Note 2). Considerations will include sufficiently acidic pH such that solvent water exchange with amide 1H’s is minimal (typically, pH 6.8), and selection of a temperature to permit readily interpretable changes in chemical shifts, or relaxation properties [33], typically R2. 5. It is good practice to use the manufacturer’s calibration procedure before extensive work to ensure that the operator can obtain reproducible results. Of many issues, variations of room temperature, dust contamination of solutions, and gas bubbles in the titrant syringe are common. The titration protocol for these low c value systems is not as critical as those for high-affinity systems. In brief, the c value is the ratio of the titrand (receptor) concentration to the dissociation constant. Turnbull et al. [19] set out the conditions for using low values of c as “(1) a sufficient portion of the binding isotherm is used for analysis, (2) the binding stoichiometry is known, (3) the concentrations of both ligand and receptor” conditions met in the study here. 6. To minimize contaminants, samples for DLS should be centrifuged immediately before pipetting into the sample tray. It is recommended that the sample tray be centrifuged briefly prior to the run to remove interference from air bubbles. 7. We chose to use a constant value of a for simplicity, although the value for FSFG-(linker)-FSFG is obviously slightly shorter than the equivalent one-half of FSFG-(linker)-SSSG-(linker)FSFG.

474

Samuel Sparks et al.

References 1. Stevers LM, de Vink PJ, Ottmann C et al (2018) A thermodynamic model for multivalency in 14-3-3 protein-protein interactions. J Am Chem Soc 140(43):14498–14510 2. Harmon TS, Holehouse AS, Rosen MK et al (2017) Intrinsically disordered linkers determine the interplay between phase separation and gelation in multivalent proteins. elife 6. https://doi.org/10.7554/eLife.30294 3. Gueroussov S, Weatheritt RJ, O’Hanlon D et al (2017) Regulatory expansion in mammals of multivalent hnrnp assemblies that globally control alternative splicing. Cell 170 (2):324–339 e23 4. Vonnemann J, Liese S, Kuehne C et al (2015) Size dependence of steric shielding and multivalency effects for globular binding inhibitors. J Am Chem Soc 137(7):2572–2579 5. Dubacheva GV, Curk T, Auzely-Velty R et al (2015) Designing multivalent probes for tunable superselective targeting. Proc Natl Acad Sci U S A 112(18):5579–5584 6. Clark SA, Jespersen N, Woodward C et al (2015) Multivalent IDP assemblies: unique properties of LC8-associated, IDP duplex scaffolds. FEBS Lett 589(19 Pt A):2543–2551 7. Li P, Banjade S, Cheng HC et al (2012) Phase transitions in the assembly of multivalent signalling proteins. Nature 483(7389):336–340 8. Fasting C, Schalley CA, Weber M et al (2012) Multivalency as a chemical organization and action principle. Angew Chem Int Ed Engl 51 (42):10472–10498 9. Cloninger MJ, Bilgic¸er B, Li L et al (2012) Multivalency. In: Supramolecular chemistry. Wiley, Hoboken, NJ 10. Koenderman L (2019) Inside-out control of fc-receptors. Front Immunol 10:544 11. Banani SF, Lee HO, Hyman AA et al (2017) Biomolecular condensates: organizers of cellular biochemistry. Nat Rev Mol Cell Biol 18 (5):285–298 12. Cable J, Brangwynne C, Seydoux, G, et al (2019) Phase separation in biology and disease—a symposium report. Ann N Y Acad Sci 1452(1):3–11. PMC6751006. 13. Wu H, Fuxreiter M (2016) The structure and Dynamics of higher-order assemblies: amyloids, signalosomes, and granules. Cell 165 (5):1055–1066 14. Kim SJ, Fernandez-Martinez J, Nudelman I et al (2018) Integrative structure and functional anatomy of a nuclear pore complex. Nature 555(7697):475–482

15. Yamada J, Phillips JL, Patel S et al (2010) A bimodal distribution of two distinct categories of intrinsically disordered structures with separate functions in FG nucleoporins. Mol Cell Proteomics 9(10):2205–2224 16. Grunwald D, Singer RH (2010) In vivo imaging of labelled endogenous beta-actin mRNA during nucleocytoplasmic transport. Nature 467(7315):604–607 17. Rout MP, Aitchison JD, Magnasco MO et al (2003) Virtual gating and nuclear transport: the hole picture. Trends Cell Biol 13 (12):622–628 18. Hayama R, Sparks S, Hecht LM et al (2018) Thermodynamic characterization of the multivalent interactions underlying rapid and selective translocation through the nuclear pore complex. J Biol Chem 293(12):4555–4563 19. Turnbull WB, Daranas AH (2003) On the value of c: can low affinity systems be studied by isothermal titration calorimetry? J Am Chem Soc 125(48):14859–14866 20. Frey S, Richter RP, Gorlich D (2006) FG-rich repeats of nuclear pore proteins form a threedimensional meshwork with hydrogel-like properties. Science 314(5800):815–817 21. Hough LE, Dutta K, Sparks S et al (2015) The molecular mechanism of nuclear transport revealed by atomic-scale measurements. elife 4. https://doi.org/10.7554/eLife.10027 22. Shekhtman A, Ghose R, Goger M et al (2002) NMR structure determination and investigation using a reduced proton (REDPRO) labeling strategy for proteins. FEBS Lett 524 (1–3):177–182 23. Smith Pe, Krohn RI, Hermanson G et al (1985) Measurement of protein using bicinchoninic acid. Analytical biochemistry 150 (1):76–85 24. Delaglio F, Grzesiek S, Vuister GW et al (1995) NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J Biomol NMR 6(3):277–293 25. Vranken WF, Boucher W, Stevens TJ et al (2005) The CCPN data model for NMR spectroscopy: development of a software pipeline. Proteins 59(4):687–696 26. Williamson MP (2013) Using chemical shift perturbation to characterise ligand binding. Prog Nucl Magn Reson Spectrosc 73:1–16 27. Tellinghuisen J (2008) Isothermal titration calorimetry at very low c. Anal Biochem 373 (2):395–397 28. Weiss JN (1997) The Hill equation revisited: uses and misuses. FASEB J 11(11):835–841

ITC and NMR for IDP Ligand Interactions 29. Raveh B, Karp JM, Sparks S et al (2016) Slideand-exchange mechanism for rapid and selective transport through the nuclear pore complex. Proc Natl Acad Sci U S A 113(18): E2489–E2497 30. Krishnamurthy VM, Semetey V, Bracher PJ et al (2007) Dependence of effective molarity on linker length for an intramolecular proteinligand system. J Am Chem Soc 129 (5):1312–1320 31. Weber M, Bujotzek A, Haag R (2012) Quantifying the rebinding effect in multivalent chemical ligand-receptor systems. J Chem Phys 137 (5):054111 32. Gargano JM, Ngo T, Kim JY et al (2001) Multivalent inhibition of AB(5) toxins. J Am Chem Soc 123(51):12909–12910

475

33. Waudby CA, Ramos A, Cabrita LD et al (2016) Two-dimensional NMR Lineshape analysis. Sci Rep 6:24826 34. Feng H, Zhou BR, Bai Y (2018) Binding affinity and function of the extremely disordered protein complex containing human linker histone H1.0 and Its chaperone ProTalpha. Biochemistry 57(48):6645–6648 35. Motulsky HJ, Ransnas LA (1987) Fitting curves to data using nonlinear regression: a practical and nonmathematical review. FASEB J 1(5):365–374 36. Dill K, Bromberg S (2012) Molecular driving forces: statistical thermodynamics in biology, chemistry, physics, and nanoscience. Garland Sci

Chapter 24 NMR Lineshape Analysis of Intrinsically Disordered Protein Interactions Christopher A. Waudby and John Christodoulou Abstract Interactions of intrinsically disordered proteins are central to their cellular functions, and solution-state NMR spectroscopy provides a powerful tool for characterizing both structural and mechanistic aspects of such interactions. Here we focus on the analysis of IDP interactions using NMR titration measurements. Changes in resonance lineshapes in two-dimensional NMR spectra upon titration with a ligand contain rich information on structural changes in the protein and the thermodynamics and kinetics of the interaction, as well as on the microscopic association mechanism. Here we present protocols for the optimal design of titration experiments, data acquisition, and data analysis by two-dimensional lineshape fitting using the TITAN software package. Key words Nuclear magnetic resonance, Titrations, IDP, Binding, Kinetics

1

Introduction Protein interactions are central to their function, whether that be enzymatic, structural, or regulatory. Regulatory proteins or protein domains are particularly prevalent in higher-order organisms, and order-disorder transitions can play important roles in interactions of these proteins [1]. Indeed, intrinsically disordered proteins (IDPs) or intrinsically disordered regions (IDRs)—i.e., polypeptide sequences lacking any stable tertiary structure—have been labelled “interaction specialists” due to their ability to recognize many factors and integrate multiple signals with tuneable affinity and specificity [1, 2]. IDRs are abundant within signalling proteins and oncogenes [3], as well as within viral proteomes [4–6], while other IDPs and IDRs are implicated in a number of severe neurodegenerative disorders due to their propensity to misfold and aggregate because of their lack of stable structure [7, 8]. The dynamic nature of IDPs and IDRs poses a challenge to most structural biology techniques, but this is an area in which

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_24, © The Author(s) 2020

477

478

Christopher A. Waudby and John Christodoulou

solution-state NMR spectroscopy excels. From the identification of disordered regions and the characterization of their residual secondary and tertiary structure, both in vitro and within living cells [9–12], the study of posttranslational modifications [13, 14], the structural and dynamical characterization of interactions [6, 15– 19], to the analysis of phase separation and aggregation [20, 21], NMR spectroscopy has played a key role in the field. This chapter will focus on the application of NMR lineshape analysis, to characterize interactions of IDPs and IDRs with small molecules or other macromolecules. However, the protocol we provide is equally applicable to interactions of folded protein domains and indeed to interactions of folded domains with IDPs and IDRs. NMR spectra are sensitive to chemical exchange (i.e., dynamic molecular equilibria) across a wide range of timescales. Chemical exchange between two states can be characterized by the exchange rate, kex, which is the sum of the forward and backward reaction rates. For a simple two-state binding reaction in which a protein, P, is observed to form a complex, PL, in the presence of ligand, L: kon

P þ L Ð PL

ð1Þ

kex ¼ kon ½L þ koff

ð2Þ

koff

The exchange rate is: The dissociation constant Kd ¼ koff/kon, and the free ligand concentration ½L ¼12 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ 2 ½Ltot ½Ptot K d þ ½Ltot þ ½Ptot þ K d 4½Ptot ½Ltot . The appearance of an NMR resonance depends on the frequency difference between free and bound states, Δω, relative to the exchange rate (Fig. 1). If exchange is fast (kex Δω), then a single resonance will be observed with a population-weighted average chemical shift, while if exchange is slow (kex Δω), then two signals will be observed at the chemical shifts of the free and bound states, weighted according to the bound population. If the exchange rate is comparable to the frequency difference, a condition termed “intermediate exchange,” then extensive line broadening is observed. As different residues will have different chemical shift differences between their free and bound states, depending on changes in structure or chemical environment of the bound state, a range of fast/intermediate/slow exchange regimes are likely to be sampled across the polypeptide sequence. Note also that the free ligand concentration can only increase as ligand is added, and therefore according to Eq. 2, higher ligand concentrations will be associated with more rapid chemical exchange.

2D Lineshape Analysis of IDP Interactions

koff

PL

kex = kon[L] + koff ΔωN

15

P

b

ΔωH 1

H chemical shift / ppm

115

PL

P

P

PL

koff = 1 s–1

116

exchange rate

kon

PL

15

N chemical shift / ppm

P+L

N chemical shift / ppm

a

479

115

koff = 102 s–1

116 115

koff = 104 s–1

116 115

koff = 105 s–1

116 8 1

7.5

7

H chemical shift / ppm

8

7.5 1

7

H chemical shift / ppm

Fig. 1 (a) Definition of the exchange rate, kex, and frequency differences, ΔωH and ΔωN, for a protein-ligand interaction observed in a two-dimensional heteronuclear correlation experiment. (b) 1H, 15N-HSQC spectra and projected 1H 1D cross sections for a simulated protein-ligand interaction (700 MHz, 1 mM protein concentration, Kd 2 μM, ΔωH 4400 s1 [1 ppm], ΔωN 220 s1 [0.5 ppm]) illustrating lineshapes that may arise under various exchange regimes, as indicated. Contour levels are constant across all spectra. Adapted from [40]

The dependence of the observed NMR spectrum on the kinetics of the exchange process as well as the thermodynamics (the bound population) and structure (chemical shift differences and linewidths) is the essential reason why NMR spectroscopy provides a powerful tool to characterize molecular equilibria and binding reactions. The evolution of magnetization in an exchanging system may also be manipulated with rf fields and pulses, which has given rise to a range of methods such as exchange spectroscopy (EXSY) [22–24], CPMG or R1ρ relaxation dispersion [25–29], and chemical/dark-state exchange saturation transfer (CEST and DEST) [20, 30, 31], which may be used to characterize dynamic equilibria even within single samples [32]. However, in systems where titrations can be performed, i.e., where the concentration of one or more components can be altered to modulate the equilibrium, NMR titrations using standard 2D correlation experiments provide a powerful approach to the analysis of exchange that is often simpler and more intuitive to analyze than the methods described above. In the extreme fast exchange limit, which is typically associated with very weak interactions, NMR titration data can be analyzed in terms of chemical shift perturbations alone, which provide a straightforward report of the binding process [33, 34]. In general however, analyses based on a naı¨ve assumption of fast exchange risk the introduction of systematic errors [35]. A better alternative is the use of NMR lineshape analysis (also referred to as “dynamic NMR”) to fit observed spectra to numerical solutions of the Bloch-McConnell or Liouville-von Neumann equations that govern the evolution of magnetization vectors or density operators in the presence of chemical exchange [36, 37].

480

Christopher A. Waudby and John Christodoulou

NMR lineshape analysis was originally developed and applied to one-dimensional NMR spectra [37], but the analysis of proteins and other macromolecules, which contain on the order of 103 1H resonances, clearly requires multidimensional spectroscopy to resolve individual spin systems. Early approaches to lineshape analysis of two-dimensional spectra, which we term “pseudo-twodimensional lineshape analysis,” were based on the extraction or integration of cross sections through resonances or across rectangular regions, to which traditional one-dimensional lineshape analysis could be applied [38, 39]. However, not only is this approach limited by the need to avoid overlapping resonances—a particularly common problem in crowded spectra of IDPs—but additional uncertainty is introduced by the need to normalize the area under each cross peak, which may by highly broadened by chemical exchange. Lastly, because the pseudo-two-dimensional approach does not accurately consider the evolution of magnetization through the full pulse sequence, we have also shown that this approach can introduce systematic errors by neglecting differential relaxation in the slow-intermediate exchange regime [40] and phase distortions and cross peaks arising from chemical exchange during mixing and chemical shift evolution periods [41]. To resolve these problems, we recently developed a true two-dimensional lineshape analysis approach and introduced a software package, TITAN, to carry out the fitting procedure [40]. The method is based on the use of a “virtual spectrometer” to propagate an initial density operator through the entire experimental pulse sequence [42], to generate a two-dimensional interferogram which can be processed in an identical manner to the experimental data. This approach eliminates the systematic errors described above; obviates the requirement to normalize individual resonances or spectra, provided that sample concentrations and acquisition parameters are fully specified; and can readily fit overlapping resonances, which is of particular value for IDPs. The analysis has since been applied to a range of systems [43–46]. Here, we present a protocol for the acquisition and analysis of NMR titration data to characterize an IDP interaction with a ligand (which may be a small molecule or another macromolecule) using two-dimensional lineshape analysis. We begin with a discussion of the experimental setup and optimal titration protocols and then present a step-by-step description of data analysis using the TITAN software package. While we highlight some points of particular relevance to IDPs, we note that the protocol is also applicable to other macromolecular systems.

2D Lineshape Analysis of IDP Interactions

2

481

Materials

2.1 NMR Spectrometer

NMR titration experiments may be acquired at any magnetic field strength. However, due to the limited chemical shift dispersion associated with IDPs, it is recommended that NMR spectrometers operating with a 1H Larmor frequency of 800 MHz (18.8 T) or greater are used for titration measurements. NMR spectrometers equipped with cryogenic probes will also provide greater experimental sensitivity and allow titrations to be completed more rapidly. Automation facilities, if available, may be used to acquire a complete series of pre-prepared titration samples, albeit at the cost of additional protein and ligand.

2.2

NMR Tubes

In our experience, regular 5 mm tubes are most convenient for carrying out NMR titrations, requiring a sample volume of 550 μL. The most crucial requirement in performing a titration is that the sample volume is accurately controlled, and for this reason, we do not recommend the use of Shigemi tubes as inserting and removing the plunger results in uncontrollable sample loss. However, if material is limited, then 400 μL may be used in a 5 mm Shigemi tube without inserting the plunger. Alternatively, if a series of independent samples is being prepared (e.g., due to the inability to prepare a high-concentration stock of titrant), then the use of 3 mm tubes, which require a sample volume of 170 μL, may provide a useful reduction in the total amount of sample required (at the cost of increased acquisition times to compensate for the reduction in experimental sensitivity).

2.3

NMR Samples

1. The protein to be observed should be prepared with uniform 15 N labelling, according to standard expression protocols [47– 49]. Although resonance assignments are not strictly necessary for the analysis of titration data, they greatly assist the interpretation of the results, and therefore if assignments are not already available, it is recommended to prepare a second sample with 13C/15N labelling in order to carry out a backbone assignment according to standard methods [50, 51]. 2. Where possible, protein samples should be prepared in a buffer with a strong buffering capacity to avoid pH changes during addition of ligand. Be aware of the risk of protein degradation or aggregation, particularly for IDPs: if necessary, EDTA or protease inhibitors should be added, or the protein concentration reduced, in order to maintain sample integrity across the titration series. As only chemical shift changes are required for the analysis of titration data, rather than absolute values of chemical shifts, we recommend that the chemical shift reference DSS [52] is not included during titration experiments as in a number of cases it may interact with the protein being

482

Christopher A. Waudby and John Christodoulou

observed, potentially causing systematic errors in the results [53, 54]. Instead, referencing via the lock frequency has in our experience been found to provide acceptable stability. 3. A high-concentration ligand stock (ca. 5–20 the maximum final ligand concentration, as discussed below) should be prepared in an identical buffer to the protein to avoid changes to solution conditions (especially pH) occurring during the titration. Where the molecular weight of the ligand is sufficiently high, co-dialysis may be helpful, particularly if the IDP contains pH-sensitive histidine residues. Optionally, the ligand stock may be prepared in the presence of the 15N-labelled protein, to avoid dilution during the titration, but this is not essential as dilution can also be taken into account during the TITAN analysis. If, for solubility reasons, the ligand is dissolved in DMSO, d6-DMSO is preferred to avoid strong background signals from the solvent, and a control experiment should be performed to verify that DMSO, at the maximum concentration used during the titration, does not perturb the spectrum of the protein being observed. 2.4

Software

1. NMRPipe [55] is used for the processing and analysis of NMR data and is available from https://www.ibbr.umd.edu/ nmrpipe/. 2. TITAN (v. 1.6) software for two-dimensional lineshape analysis is available from https://www.nmr-titan.com [40]. TITAN requires OS X or Linux to run as a stand-alone application, or it can be run in a scripted manner within MATLAB (version R2016b is currently officially supported).

3

Data Acquisition

3.1 Sample Concentrations

1. Protein and ligand concentrations must be chosen with some care to ensure accurate results can be obtained from the titration measurement. The protein concentration should not be more than an order of magnitude higher than the Kd, and optimal results are obtained when the protein concentration is comparable to the Kd. The final ligand concentration should be sufficient to fully saturate the binding: the greater of at least three times the Kd or two equivalents relative to the protein concentration. If even an approximate value of the Kd is unknown (e.g., from measurements on a related or similar system), an initial titration may be required to obtain an estimate, to be followed up with a more detailed characterization using optimized sample concentrations. 2. Approximately 8–12 titration points should be acquired (see Note 1). The majority of ligand concentrations should be in

2D Lineshape Analysis of IDP Interactions

483

the range of 0–1.5 equivalents of protein, unless the protein concentration is much less than the Kd, in which case sufficiently high ligand concentrations should be used in order to saturate the binding (up to ca. 5 Kd). It is often helpful for ligand concentrations to be spaced nonlinearly, e.g., in quadratic increments: 0, 0.1, 0.25, 0.4, 0.6, 0.85, 1.15, 1.5, and 1.9 equivalents. 3. Protein and ligand stock concentrations should be determined as accurately as possible [56]. Centrifuge protein stocks at ca. 16,000 rpm (20,000 g) for 30 min to remove precipitate or aggregates prior to measurement of concentration. This is most readily done via UV absorbance measurements (although as IDPs often contain few aromatic residues, alternative methods may on occasion need to be employed [57, 58]). Ensure that the stock concentration or dilution thereof has an absorbance that can be measured with high accuracy, i.e., in the range 0.2–0.6 units. If the ligand stock concentration cannot also be measured spectrophotometrically, ensure that a sufficiently large mass of sample (at least 10 mg) is weighed to avoid errors from the limited precision of the analytical balance. Volumetric glassware or calibrated pipettes should be used for liquid handling; graduated cylinders and plasticware such as serological pipettes and falcon tubes do not provide sufficient accuracy. 4. Protein concentrations of at least 10 μM are required (on modern spectrometers equipped with cryogenic probes), although such low concentrations may necessitate acquisition times of several hours for each spectrum. Higher protein concentrations, around 50–100 μM, will generally allow sufficiently high signal-to-noise ratios to be obtained in less than an hour. 5. A ligand stock should be prepared with a concentration ca. 5–20 that of the final desired concentration, to minimize dilution of the sample (see Note 2). 3.2 Experimental Setup

1. The sample temperature has a strong effect on the quality of 2D NMR spectra and should be optimized before carrying out a titration. Due to the line broadening effect of amide hydrogen exchange with the solvent, which is particularly significant for disordered, solvent-exposed residues, the highest-quality 1 H,15N correlation spectra can generally be obtained for IDPs at lower temperatures, e.g., 278–283 K. This contrasts with typical folded proteins, for which higher temperatures result in more rapid rotational diffusion and hence sharper resonances (see Note 3). 2. Two-dimensional lineshape analysis proceeds by using a “virtual spectrometer” to calculate the evolution of magnetization

484

Christopher A. Waudby and John Christodoulou

through the same pulse sequence that was used to acquire the experimental data. It is therefore important to acquire data with a pulse sequence that is implemented within TITAN. At the time of writing, HSQC and HMQC experiments are fully implemented, but due to the more complex spin dynamics involved in magnetization transfer steps, TROSY, sensitivityenhanced HSQC, and in-phase HSQC experiments are not compatible and therefore are not recommended (see Note 4). Our preferred 1H,15N correlation experiments are the SOFAST-HMQC [59, 60] and the HSQC (specifically, the FHSQC variant [61]). The SOFAST-HMQC provides excellent experimental sensitivity with short acquisition times and good resolution in t1 as the amide selective 1H refocusing pulse also refocuses the 3JHNHA scalar coupling. However, transverse relaxation and chemical exchange broadening are generally more severe in the indirect dimension of HMQC experiments compared to the HSQC, and therefore in some cases, the HSQC may provide more useful or complementary information. Whichever experiment is selected, it is important that good solvent suppression and flat baselines are obtained, without the need to apply first-order phase correction in direct or indirect dimensions (with the exception of 90 /180 phase correction for an initial half-dwell delay). Remarkably, this is not always the case for standard library sequences. Bruker format pulse programs are therefore provided with the TITAN download (v. 1.6 onward) for SOFAST-HMQC and HSQC experiments that we have found provide suitable high-quality data. Note that a Reburp rather than an r-Snob refocusing pulse is recommended in the SOFAST-HMQC experiment [60]. Short relaxation delays should be avoided in SOFAST-HMQC experiments used for two-dimensional lineshape analysis (e.g., d1 must be longer than ca. 300 ms), to ensure that resonance intensities are not strongly weighted by longitudinal relaxation, which may vary between free and bound states. 3. Record a preliminary experiment with a wide sweep width, in order to optimize the sweep width and offset. Acquisition times in the direct and indirect dimensions are a balance between obtaining sufficient resolution and avoiding an excessive number of points that may result in slow lineshape fitting. As a rough guide, we suggest acquisition times of 100 ms in the direct dimension and 30–50 ms in the indirect dimension. Folded peaks may also be fitted in TITAN, and this may be used to reduce the sweep width and hence the number of points to be sampled.

2D Lineshape Analysis of IDP Interactions

3.3 Performing the NMR Titration

485

1. Record a 1D 1H spectrum of the unbound protein sample, as a reference, and an appropriate 2D experiment selected and set up as detailed above. 2. Remove the sample from the spectrometer, and add the required amount of ligand stock solution to the sample. Care should be taken while doing this so material is not lost due to adhesion to glass pipettes, Shigemi plungers, etc. Our recommended approach is that the sample should not be removed from the NMR tube, but instead the appropriate volume of ligand stock should be added to the top of the tube, which is recapped and closed with parafilm and then inverted several times to ensure complete mixing. A hand centrifuge can be used to return the sample to the bottom of the tube, and with careful handling, we find that bubbles can generally be avoided (which otherwise would greatly reduce the quality of the spectrum). Alternatively, the ligand may be delivered into the sample and mixed using a narrow metal spatula. 3. Reinsert the sample into the spectrometer, and after allowing time for the temperature to re-equilibrate, lock, shim, and recalibrate the 1H 90 pulse length (this is particularly important if the ligand is dissolved in DMSO rather than in the NMR buffer). 4. Record a new pair of 1D and 2D spectra, and repeat from step 2 until the titration is complete. Note that for analysis in TITAN, it is essential that acquisition parameters such as the spectrum widths or offsets are not changed between experiments. The number of scans and receiver gain may however be varied between points as required to obtain high-quality spectra. 5. Alternatively, if a series of samples have instead been prepared (e.g., because a high-concentration ligand stock could not be prepared), these may be run sequentially in automation mode. As a quality control measure, in this situation, we also recommend acquiring 1D 1H spectra for each sample.

3.4

Data Analysis

3.4.1 Analysis of 1D 1H Spectra

3.4.2 Processing

1D 1H spectra acquired across the titration series should be inspected for unexpected changes in protein intensity or linewidth, which may indicate aggregation or degradation. If a buffer was used containing non-exchangeable protons (e.g., Tris, HEPES, acetate), their chemical shifts may be used to verify that the sample pH remained constant throughout the titration. 2D spectra should be processed in NMRPipe, using linear prediction and exponential window functions. The strengths of the window functions should be chosen as a compromise between eliminating truncation artifacts (“sinc wiggles”) and optimizing

486

Christopher A. Waudby and John Christodoulou

sensitivity and minimizing overlap between adjacent signals. To accelerate analysis in TITAN, it is helpful to extract only the 1H chemical shift range containing signals to be analyzed, e.g., between 7 and 10 ppm (or less in the case of IDPs). No extraction should be applied in the indirect dimension. All spectra must be processed in an identical manner. To facilitate this, once fid.com and nmrproc.com processing scripts have been prepared for one spectrum, an automated script may be used to process the remaining spectra, adapted from the example below: Listing 1. #!/bin/csh # automated processing of 2D spectra in experiments 1 to 11 # processed spectra will be named spectrum-1.ft2, spectrum-2. ft2, ... foreach spec (1 2 3 4 5 6 7 8 9 10 11) # conversion from Bruker to nmrPipe format, adapted from fid. com: bruk2pipe -in ./$spec/ser \ -bad 0.0 -noaswap -AMX -decim 16 -dspfvs 12 -grpdly -1 \ -xN 2048 -yN 256 \ -xT 1024 -yT 128 \ -xMODE DQD -yMODE States-TPPI \ -xSW 9615.385 -ySW 1823.985 \ -xOBS 599.927 -yOBS 60.797 \ -xCAR 4.611 -yCAR 118.959 \ -xLAB HN -yLAB 15N \ -ndim 2 -aq2D States \ -out ./test.fid -verb -ov # 2D processing, adapted from nmrproc.com # with: solvent suppression via SOL filter # 4 Hz and 8 Hz exponential line broadening # extraction of 1H dimension from 7--10 ppm # linear baseline correction in 1H dimension # linear prediction in indirect dimension nmrPipe -in ./test.fid \ | nmrPipe -fn SOL \ | nmrPipe -fn EM -lb 4.0 -c 0.5 \ | nmrPipe -fn ZF -auto \ | nmrPipe -fn FT -auto \ | nmrPipe -fn PS -p0 152.00 -p1 0.00 -di -verb \ | nmrPipe -fn EXT -x1 7ppm -xn 10ppm -sw \ | nmrPipe -fn BASE -nw 10 -nl 0% 2% 98% 100% \ | nmrPipe -fn TP \ | nmrPipe -fn LP -fb \ | nmrPipe -fn EM -lb 8.0 -c 1.0 \

2D Lineshape Analysis of IDP Interactions

487

| nmrPipe -fn ZF -auto \ | nmrPipe -fn FT -auto \ | nmrPipe -fn PS -p0 -90.00 -p1 180.00 -di -verb \ -ov -out ./spectrum-$spec.ft2 rm ./test.fid end

3.4.3 Two-Dimensional Lineshape Analysis

1. Launch TITAN, either via the pre-compiled binary installation or within MATLAB using the command “TITAN” (having added the TITAN directory to the path as described in the documentation). The main interface is shown in Fig. 2 and provides a directed path through the analysis procedure. 2. “Select binding model. . .” will launch a dialog to specify the microscopic association mechanism to which experimental data will be fitted (Fig. 3). A number of binding models are available, describing a variety of situations. Some common models are summarized in Table 1. 3. “Set up titration points and select data. . .” will open a dialog to import experimental titration data (Fig. 4a) (see Note 5). For convenience, this data may be copied and pasted from an Excel spreadsheet, the format of which is shown in Fig. 4b. Depending on the binding model selected, protein and ligand concentrations must be specified for each titration point (corrected for sample dilution), as well as the number of scans (ns) and receiver gain (rg) used for each experiment. The experimental data should be in the form of NMRPipe format (.ft2) files generated by the processing steps above. Noise levels will be calculated automatically for each experiment, based on maximum likelihood estimation of a truncated Gaussian

TITAN

v1.5-6-ga686 Select binding model...

Load session...

Set up titration points and select data...

Save session...

Set up pulse program...

Help...

Set up/edit spins and select ROIs... Set up/edit model parameters... Fit!

View results

Run bootstrap error analysis...

View results

Fig. 2 Screenshot of the main TITAN interface. The workflow, indicated by arrows, is progressively enabled as the user proceeds through the analysis

488

Christopher A. Waudby and John Christodoulou Please select a binding model: No exchange Two state Two state (flexible stoichiometry) Dimerisation Induced fit Conformational selection 4-state binding (CS+IF) Three state (parallel) Three state (sequential) Two independent binding sites (4 state) Two binding sites (4 state) Chemical denaturation (two-state folding) Chemical denaturation (two-state folding, alternativv Chemical denaturation (three-state folding) Bidentate ligand Dimerisation after binding Dimerisation vs binding Ligand exchange (via ternary complex)

Cancel

OK

Fig. 3 Selection of a binding model. The two-state binding model selected here is suitable for simple proteinligand interactions, but a variety of additional models are available as shown here and described in Table 1

distribution, excluding intense regions associated with peaks. These can be manually overwritten if necessary. Note that accurate noise levels are critical for correct weighting of residuals across multiple spectra. 4. Select the pulse sequence used for data acquisition (“Set up pulse program. . .”). Several pulse sequences are available for analysis within TITAN, of which HSQC and HMQC are the most common experiments. It is important that the pulse sequence is correctly specified as this will strongly affect the results, as chemical exchange can have different effects on lineshapes in the indirect dimension of HSQC and HMQC experiments [40], as well as during coherence transfer periods [41]. Having selected an experiment type, further acquisition and processing parameters must be specified (Fig. 5). These parameters include the spectrometer frequency, frequency offsets, spectrum widths, the number of points in each dimension (after zero filling and linear prediction, if applied), and the exponential line broadening applied during processing. These parameters should all be correctly parsed from the input NMRPipe format data, except two that must be specified manually: the 1JIS heteronuclear scalar coupling between the spins being observed (ca. 92 Hz for amide spin systems) and the value of the 3JHNHA scalar coupling that is active during the final acquisition step. For fully protonated amide spin systems, the approximate value of 6.5 Hz has been sufficient for all cases the author has examined to date. However, if the protein being observed is perdeuterated, then the 3JHNHA coupling should be set to zero at this point (see Note 6).

2D Lineshape Analysis of IDP Interactions

489

Table 1 A summary of key binding models implemented within TITAN Model

Schematic

Fitting parameters Comments

No exchange

N/A

N/A

Suitable for fitting linewidths and peak positions in single spectra

Two state

P þ L Ð PL

Kd ¼ koff/kon koff

Default model for a simple binding reaction. “P” represents the observed protein and “L” the ligand

Kd ¼ koff/kon koff

n is a parameter that can be fitted to allow for uncertainty in the ligand stock concentration

Induced fit

Kd ¼ koff/kon koff, kopen, kclose

In the context of IDPs, this mechanism may also describe folding upon binding

Conformational selection

Kd ¼ koff/kon koff, kopen, kclose

In the context of IDPs, this mechanism may also describe binding upon folding

Four-state exchange

Kd,app koff,A, koff,B KAB ¼ kAB/kBA KAB0 ¼ kAB0 /kBA0 kex ¼ kAB + kBA kex0 ¼ kAB0 + kBA0

Binding via induced fit and conformational selection, for a protein with two states, “A” and “B”. Also applicable to coupled folding and binding. Ligand affinity is specified as Kd,app, the apparent dissociation constant for the equilibrium: (A + B) + L Ð (AL + BL)

Kd ¼ koff/kon koff

“M” represents the monomer and “D” the dimer. Dimers may be symmetric or asymmetric (see Note 10)

Kd,1 ¼ koff,1/kon,1 Kd,2 ¼ koff,2/kon,3 Kd,3 ¼ koff,3/kon,3 koff,1, koff,2, koff,3, koff,4

Two ligand binding sites, labelled “B1” and “B2”, with positive or negative cooperativity leading toward the doubly bound state “B12”. Closure of the thermodynamic cycle determines Kd,4 ¼ Kd,1Kd,3/ Kd,2

kon

koff

Two state (flexible stoichiometry)

Dimerization

kon

P þ nL Ð PL koff

kon

2M Ð D koff

Two binding sites

490

Christopher A. Waudby and John Christodoulou

a 41

0

16

512

/Users/chris/git/titan/examples/FBPNbox/test-1.ft2

1.5283e+04

40.67

8.13

16

181

/Users/chris/git/titan/examples/FBPNbox/test-2.ft2

5.7376e+03

40.34

16.14

16

256

/Users/chris/git/titan/examples/FBPNbox/test-3.ft2

7.6104e+03

40.02

24.01

16

256

/Users/chris/git/titan/examples/FBPNbox/test-4.ft2

7.6324e+03

39.7

31.76

16

256

/Users/chris/git/titan/examples/FBPNbox/test-5.ft2

7.6005e+03

39.08

46.89

16

256

/Users/chris/git/titan/examples/FBPNbox/test-6.ft2

7.5705e+03

38.48

61.56

16

256

/Users/chris/git/titan/examples/FBPNbox/test-7.ft2

7.6129e+03

37.89

75.79

16

362

/Users/chris/git/titan/examples/FBPNbox/test-8.ft2

1.0410e+04

36.51

109.53

16

256

/Users/chris/git/titan/examples/FBPNbox/test-9.ft2

7.5394e+03

35.22

140.89

16

256

/Users/chris/git/titan/examples/FBPNbox/test-10.ft2

7.5198e+03

32.8

200.06

16

256

/Users/chris/git/titan/examples/FBPNbox/test-11.ft2

7.4438e+03

Add row

Delete row

Copy to Excel

Paste from Excel

Preview spectra

Cancel

OK

b

Fig. 4 (a) Titration setup dialog. Protein and ligand concentrations (as required by the particular binding model selected) must be specified together with the acquisition parameters ns (number of scans) and rg (receiver gain). Spectrum noise levels are required to accurately weight residuals between spectra and are calculated automatically upon importing data. (b) For convenience, data for this dialog may be copied and pasted from a suitable Excel spreadsheet

Some experiments, such as the HSQC, will prompt for further details on delays within the particular pulse sequence employed, which are required to correctly propagate chemical exchange through gradient selection delays and zz filters. If the HSQC sequence provided is used, then the default parameters will be correct. 5. Open the spin system editor (“Set up/edit spins and select ROIs. . .”). This interface is used to select residues and regions of the experimental spectra that are used for analysis. Spin systems are a key concept within TITAN: each spin system represents a single residue and, for every state specified by the binding model (e.g., free and bound states), carries information on direct and indirect chemical shifts (dI, dS) and linewidths (R2I, R2S) (see Note 7). Optionally, spin systems can be labelled with an assignment; otherwise a default numbered label is created. For each spectrum, a spin system is associated

2D Lineshape Analysis of IDP Interactions

491

Direct dimension, base frequency (MHz): 599.927 Direct dimension, chemical shift limit 1 (ppm): 11.9987 Direct dimension, chemical shift limit 2 (ppm): 5.9962 Direct dimension, number of points: 768 Direct dimension, exponential window function (Hz): 4 Indirect dimension, base frequency (MHz): 60.797 Indirect dimension, transmitter offset (ppm): 118.959 Indirect dimension, sweep width (ppm): 30.0012 Indirect dimension, number of complex points: 256 Indirect dimension, number of complex points after zero filling: 512 Indirect dimension, exponential window function (Hz): 8 1J(IS) coupling / Hz : 92 3J(HNHA) coupling / Hz : 6.5 OK

Cancel

Fig. 5 Pulse sequence setup dialog. All fields should be parsed automatically from NMRPipe input files, except those highlighted yellow, for which appropriate values should be set by the user

with a region of interest (ROI). This is an area of the spectrum that will be simulated and used for fitting of the associated spin system. When the spin system editor is first opened, a new spin will be created automatically and the ROI editor launched (Fig. 6). The left-hand panel displays an overlay of all the experiments in the titration series, and any existing ROIs will also be marked on this panel. Use the toolbar to zoom in to the residue of interest, adjusting the contour levels as necessary, and then press “Select ROIs” to begin marking out ROIs. The first experiment will be displayed as a density plot in the righthand panel; the color scale may be adjusted using the up/down

492

Christopher A. Waudby and John Christodoulou

a 105

Assignment spin 1

110 Select ROIs 115

Edit peak positions

120

Using zoom and pan tools, navigate to a region of the spectrum on the left in which to define a region of interest

125

130

OK 11

10

9

8

7

6

b

Editing spectrum 1 of 11 Assignment

126.2

126.2

spin 1 126.4

126.4 Select ROIs

126.6

126.6 Edit peak positions

126.8 127

126.8 127 127.2

127.4

Select an ROI, spectrum 1 of 11. Use right-hand panel only!

127.6

Left-click to add

127.6

127.2

OK

127.8 8

7.9

7.8

127.4

127.8 8

7.7

7.9

7.8

7.7

c Assignment

125.5

D724 126

Re-select ROIs

126.5

Edit peak positions D724_P

127

D724_PL

Using zoom and pan tools, navigate to a region of the spectrumon the left in which to define a region of

127.5 128 128.5

OK 8.1

8

7.9

7.8

7.7

7.6

Fig. 6 Screenshots of the ROI editor illustrating the setup of a new spin system. (a) When first opened, the editor shows an overlay of all spectra in the left-hand panel which may be zoomed and panned using the controls in the toolbar (toolbar not shown). (b) ROIs are marked out as a series of points in the right-hand panel. (c) Final state of the editor after selecting initial estimates of the free and bound state peak positions

2D Lineshape Analysis of IDP Interactions

493

arrow keys. Left click in this panel to mark out a polygon defining the boundary of the ROI; a right click or the space bar will add a final point and close the boundary. This process is then repeated for the remaining spectra. Distinct ROIs can be defined for each spectrum or alternatively the previous ROI copied using the “c” key (see Note 8 for a discussion of the optimal shapes of ROIs). Once ROIs have been selected for all spectra, initial estimates of peak positions are selected for all states in the binding model (e.g., free and bound) by clicking at the appropriate position in the left hand panel. By default, linewidths of all states are set to 20 s1, which generally provides an acceptable starting point for fitting (see Note 7). Where resonances of multiple residues overlap, there is no restriction on ROIs also overlapping. However, these spin systems should then be associated together into a “spin group,” by assigning an arbitrary label (e.g., “group 1,” “group 2”) to the relevant spin systems within the upper panel of the spin system editor dialog (Fig. 7). 6. Having created a series of spin systems and associated ROIs, before closing the spin system editor, select the parameters that should be used for fitting using the lower half of the dialog (Fig. 7b). Two-dimensional lineshape analysis involves the fitting of many parameters, and so it is often helpful to carry out the fitting in two stages: firstly, to fit only the free-state chemical shifts and linewidths using the first (unbound) spectrum only and then to fix these chemical shifts and fit the remaining parameters, together with model parameters such as the dissociation constant Kd and the dissociation rate, koff, using the complete dataset. 7. Depending on the binding model selected above, a number of model parameters must be specified, which represent thermodynamic equilibrium and kinetic rate constants, such as Kd and koff values. These can be edited by selecting “Set up/edit model parameters. . . .” If initially only the first spectrum is being used for fitting unbound chemical shifts, as described above, then fitting of model parameters should be turned off in this dialog (Fig. 8). 8. After following the preparatory steps above, the “Fit!” command will now be enabled. Fitting will overwrite parameter values, and so it is recommended to save the session before proceeding. Upon proceeding, a dialog will prompt to select spectra to be used in the fitting process (e.g., the first spectrum only for the initial optimization of unbound peak positions). While the fitting process is running, a plot of the chi-square residuals is displayed to show the progress of the optimization algorithm, and on completion a list of the fitted parameters will

494

Christopher A. Waudby and John Christodoulou

a 114.5

Assignment: D223 F229_PL

115

Re-select ROIs 115.5 Edit peak positions D223_PL D223_P

116

Using zoom and pan tools, navigate to a region of the spectrumon the left in which to define a region of

116.5 117

F229_P

OK

117.5 8.5

b

8.4

8.3

8.2

Set up spins, select ROIs and assign spin groups:

L275

click to edit

S227

click to edit

F229

2

D223

2

click to edit click to edit

Set spinK236 system parameters, parameters to be fitted 3 and min/max limits:

click to edit Add spin

5

12 yes

8.3934

8.8885

5

12 no

8.4910

9.0813

8

140 yes

106.2456

114.5092

114

140 no

114

100 100

106.9667

114.2643

1000 yes

20

20

1

1000 no

20

20

1

1000 yes

20

20

1

1000 no

20

20

1

8

OK

Fig. 7 Setting up spin groups for the fitting of overlapping resonances. (a) Two residues, D223 and F229, have been defined with overlapping ROIs within the ROI editor. (b) The overlapping residues have been associated with a common “spin group” within the spin system editor (red circle). The spin group is an arbitrary text label. ROIs for residues within the same spin group will be merged, and (overlapping) resonances therein will be fitted simultaneously

be displayed in a new window. Parameter labels are of the form “ASSIGNMENT_QUANTITY_MODEL STATE.” Note that the reported error comes from the estimated covariance matrix, and the use of bootstrap resampling methods (below) is recommended for more robust estimates. 9. At this stage, the fitting process may be repeated as indicated in the workflow in the main window (Fig. 2), after adjusting the fixed and free parameters as required.

2D Lineshape Analysis of IDP Interactions

495

Please set model parameters:

10

0

10000

1000

0

10000

OK

Fig. 8 Screenshot of the model parameters editor. Initial values for parameters such as Kd and koff can be specified, together with the allowable parameter range (e.g., which may be constrained on the basis of prior knowledge). Fitting of particular parameters may be activated and deactivated using the checkboxes

10. Once fitting is complete, a variety of plots are available to assess the quality of the fit. Overlaid contour plots of observed and fitted spectra are a straightforward way to compare the goodness of fit (Fig. 9a), but deviations in peak intensities – which may be a signature of a more complex binding mechanism—are not always obvious in such plots. Therefore, it is also useful to examine interactive overlays of 3D waterfall plots, which may give better insights into signal-to-noise levels and whether intensities are being fitted accurately (Fig. 9b). All plots may be saved as publication quality vector graphics (eps format) using the toolbar. Lastly, fitted (simulated) spectra may be exported into NMRPipe format for visualization and analysis with other software packages such as CCPN Analysis or Sparky [62, 63]. 11. Finally, once the fitting has been completed, the option to run a bootstrap error analysis will be enabled. This will repeat the previous fitting step, with the same starting parameters as used previously, using a series of spectra generated through resampling of residuals from the best-fit spectrum [40]. The user is prompted for the number of bootstrap replicas to generate and fit; at least 50 are recommended to obtain reliable estimates of parameter uncertainties. Once complete, a summary report is generated containing the mean and standard error of the fitted parameters. The fit results from individual bootstrap replicas can also be tabulated, but this information is perhaps more usefully displayed in the form of the parameter covariance matrix (Fig. 10). Parameters of interest should be inspected for strong correlations that may point to weaknesses or hidden uncertainties in the fitting procedure. For example, for a spin system in fast exchange, it may be difficult to differentiate between a relatively high-affinity interaction with a small chemical shift difference between free and bound states and a lower-affinity

496

Christopher A. Waudby and John Christodoulou

a 115

115

115

115

116

116

116

116

117

117

117

117

8.4

8.4

8.3

8.4

8.3

8.3

115

115

115

115

116

116

116

116

117

117

117

117

8.4

8.4

8.3

8.3

115

115

115

116

116

116

117

117

117

8.4

b 0.8 0.6 0.4 0.2 0

1: P0=41, L0=0 F229 / D223

115 117

0.8 0.6 0.4 0.2 0

8.4

8.3

5: P0=40, L0=32 F229 / D223

115 117

8.4

8.3

9: P0=37, L0=110 F229 / D223

0.8 0.6 0.4 0.2 0

115 117

8.4

8.3

8.4

8.3

8.3

2: P0=41, L0=8.1 F229 / D223

0.8 0.6 0.4 0.2 0

115 117

0.8 0.6 0.4 0.2 0

8.4

8.3

6: P0=39, L0=47 F229 / D223

115 117

8.4

8.3

10: P0=35, L0=140 F229 / D223

0.8 0.6 0.4 0.2 0

115 117

8.4

8.3

0.8 0.6 0.4 0.2 0

8.4

8.3

8.4

8.3

3: P0=40, L0=16 F229 / D223

115 117

0.8 0.6 0.4 0.2 0

8.4

8.3

7: P0=38, L0=62 F229 / D223

115 117

8.4

8.3

0.8 0.6 0.4 0.2 0

8.4

8.3

8.4

8.3

4: P0=40, L0=24 F229 / D223

115 117

0.8 0.6 0.4 0.2 0

8.4

8.3

8: P0=38, L0=76 F229 / D223

115 117

8.4

8.3

11: P0=33, L0=200 F229 / D223

0.8 0.6 0.4 0.2 0

115 117

8.4

8.3

Fig. 9 Visualization of fitting results. (a) Overlaid contour plots of observed and fitted spectra (blue and red, respectively). (b) Three-dimensional views of observed and fitted spectra (gray and magenta, respectively)

2D Lineshape Analysis of IDP Interactions Correlation matrix of fitted parameters

497

1 0.8

20

0.6 0.4

Parameter ID

60

0.2 0

80

-0.2 100

Correlation coefficient

40

-0.4 120 -0.6 140

-0.8 -1 20

40

60

80 100 Parameter ID

120

140

Fig. 10 Density plot of the parameter covariance matrix derived from bootstrap error analysis. Parameter IDs are listed within the fitting output and can be explored interactively using the mouse cursor

interaction with a larger chemical shift difference, particularly if the available titration data did not reach saturation. In this situation, a strong correlation would therefore be expected between the fitted Kd and the fitted chemical shift of the bound state. In general, however, an analysis of multiple spin systems results in a greatly improved covariance structure. This points to the importance of globally fitting multiple spin systems exhibiting fast, intermediate, and slow chemical exchange or, where this is not possible, to ensuring that sufficient ligand is added to weak binding systems in fast chemical exchange so that they reach saturation and an accurate estimate of the bound state chemical shift can be determined. 12. The point at which a fit result is regarded as “acceptable” is ultimately subjective. A typical ROI contains ~500 points, resulting in ~105 observations fitted to ~102 parameters (given 20 ROIs fitted across 10 spectra), and given the complexity of this simulation and fitting process, it is difficult to devise a simple and robust measure of the goodness of fit. Instead, we recommend that the user consider the following criteria in reaching a decision: (a) Fitted spectra should accurately reproduce the observed experimental data. This should be assessed using both contour and 3D plots, as discussed above (Fig. 9).

498

Christopher A. Waudby and John Christodoulou

(b) Selected ROIs should cover the full range of chemical shift differences and chemical exchange regimes observed. (c) Sufficient ROIs should have been selected such that the inclusion of additional ROIs does not alter the fitting result (see Note 9). (d) Fitted parameters should have physically reasonable values and not reached their minimum or maximum limits. Where extreme values occur, this indicates that a parameter is not being effectively constrained by the experimental data, and its value should be interpreted with caution. (e) If a complex binding model is being fitted, the user should verify that a comparable quality of fit cannot be obtained using a simpler model (i.e., the principle of parsimony).

4

Notes 1. These guidelines are based on a simple 1:1 association reaction, P + L Ð PL. The identification and analysis of more complex binding mechanisms may benefit from a greater number of points or, for example, varying the protein concentration in addition to that of the ligand. 2. If a high-concentration ligand stock cannot be prepared (e.g., due to limited solubility or aggregation), then a lowerconcentration ligand stock may be prepared in the presence of the protein, to avoid sensitivity loss due to dilution. Alternatively, a series of individual titration samples should be prepared, either directly (using 3 mm NMR tubes for efficiency) or from serial dilutions of two samples prepared without ligand and with the maximum ligand concentration required. However, we note that in all these cases the total amount of 15N-labelled protein that is required is increased. 3. Experiments based on direct 13C detection, e.g., CON and CACO, have been developed that provide well-resolved resonances for disordered states under physiological conditions [64, 65], albeit with decreased sensitivity, and titrations have indeed been carried out using such methods [66]. However, these experiments are not currently implemented for two-dimensional lineshape analysis within TITAN and therefore will not be discussed further here. 4. To the extent that relaxation or chemical exchange during magnetization transfer steps can be neglected, other experiments such as transverse relaxation-optimized spectroscopy (TROSY) and sensitivity-enhanced HSQCs may still be analyzed as if acquired with a regular HSQC pulse sequence, but

2D Lineshape Analysis of IDP Interactions

499

the user should be aware of the additional assumptions involved in such an analysis. 5. For illustrative purposes, data shown in this article use the FBPNbox example provided in the TITAN download. 6. Perdeuteration of proteins (the substitution of all non-labile protons for deuterons, by expression in D2O and d7-glucose) is most commonly associated with high molecular weight and slowly tumbling systems, of approximately 30 kDa and above, and is usually applied in combination with TROSY experiments. IDPs and IDRs experience rapid rotational diffusion and therefore do not usually benefit from the application of TROSY experiments. Nevertheless, it has been reported that perdeuteration can significantly improve the quality of IDP spectra, by eliminating the 3JHNHA scalar coupling that otherwise increases 1H linewidths by 6–12 Hz [67]. 7. Linewidths in TITAN are denoted R2I and R2S, in the direct and indirect dimension, respectively. However, these values do not represent exact relaxation rates, but instead reflect approximate combinations of in-phase and anti-phase relaxation rates, as well as deviations from the constant 3JHNHA scalar coupling imposed on all residues (defined in the pulse program setup, Fig. 5). Therefore, in practical terms, it is not particularly helpful to set initial values for these parameters based on experimental relaxation measurements, and we recommend that caution should be exercised in any detailed interpretation of fitted values. 8. Our recommendations for the optimal shape and selection of ROIs are illustrated in Fig. 11. ROIs should extend approximately two to three linewidths from the center of resonances, in order that linewidths may be accurately determined (Fig. 11a). We also recommend that all ROIs for a given residue should cover the full region of the spectrum in which its resonances are observed (Fig. 11b), because the absence of resonances in such empty regions may in fact represent additional experimental restraints. However, overly large ROIs should be avoided (Fig. 11c), as these will only result in increased noise, slower fitting calculations, and potentially inadvertent overlaps with adjacent resonances. Conversely, tightly cropped ROIs (Fig. 11d) will reduce the accuracy of fitted linewidths and should be avoided. 9. As fitting within TITAN overwrites previous results and redefines the starting point for subsequent fits, there is a risk of becoming trapped in a local minimum when appending new spin systems and ROIs to an existing fit. We therefore recommend saving TITAN sessions before performing fits and

500

Christopher A. Waudby and John Christodoulou

a

ᅛ

124 124.5

b 124

ᅟ

124.5

125

125

125.5

125.5

126

126

126.5

126.5

127

127

127.5

127.5 8.3

8.2

8.1

8

7.9

c 123.5

7.8

ᅟ

124 124.5

8.2

8.1

8

7.9

d

7.8

ᅟ

124.5 125

125 125.5

125.5 126

126

126.5 126.5

127 127.5

127

128 8.3

8.2

8.1

8

7.9

7.8

8.2

8.1

8

7.9

Fig. 11 Optimal selection of ROIs. (a) Recommended setup: ROIs extend approximately two to three linewidths from the center of resonances and contain the entire region of the spectrum within which resonances are observed across the titration. Note that ROIs are identical for all spectra; hence only a single boundary can be observed. (b) Not recommended: individual ROIs do not encircle the entire region within which resonances are observed across the titration. (c) Not recommended: too large a selection, resulting in slow fitting, increased noise, and overlap with an adjacent residue. (d) Not recommended: too tight a selection, limiting accuracy when fitting linewidths

reloading these sessions before defining and fitting additional spin systems. 10. Dimers are represented within TITAN using two separate spin states, i.e., the equilibrium 2M Ð D is represented internally as 2M Ð D1 + D2. This allows for the possibility that the dimer might be asymmetric [68]. To perform calculations for a symmetric dimer, the chemical shifts and linewidths of the states D1 and D2 should be linked. This can be done by associating the parameters to a number of available shared variables using the lower panel of the spin system editor (Fig. 12).

2D Lineshape Analysis of IDP Interactions

501

Set up spins, select ROIs and assign spin groups:

spin 1

click to edit

Set spin system parameters, parameters to be fitted and min/max limits:

Add spin

5.9962

11.9987 yes

7.7590

5.9962

11.9987 shared 1

7.7409

5.9962

11.9987 shared 1

7.7409

104.0170

133.9596 yes

127.0292

104.0170

133.9596 shared 2

126.6463

104.0170

133.9596 shared 2

126.6463

1

1000 yes

20

1

1000 shared 3

20

1

1000 shared 3

20

1

1000 yes

20

1

1000 shared 4

20

1

1000 shared 4

20

OK

Fig. 12 Setting up shared parameters within the spin system editor. The example shown here is a symmetric dimer, defined by linking all properties of the asymmetric dimer states D1 and D2. For each of the D1 and D2 states, the direct and indirect chemical shifts, dI and dS, and the direct and indirect linewidths, R2I and R2S, are assigned to the global parameters “shared 1” to “shared 4” as indicated

Acknowledgments This work was supported by a Wellcome Trust Investigator Award (to J.C., 206409/Z/17/Z). References 1. Tompa P, Schad E, Tantos A et al (2015) Intrinsically disordered proteins: emerging interaction specialists. Curr Opin Struct Biol 35:49–59 2. Wright PE, Dyson HJ (2015) Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16:18–29 3. Uversky VN (2014) Wrecked regulation of intrinsically disordered proteins in diseases: pathogenicity of deregulated regulators. Front Mol Biosci 1:6 4. Xue B, Dunker AK, Uversky VN (2012) Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from

viruses and the three domains of life. J Biomol Struct Dyn 30:137–149 5. Xue B, Blocquel D, Habchi J et al (2014) Structural disorder in viral proteins. Chem Rev 114:6880–6911 6. Milles S, Jensen MR, Lazert C et al (2018) An ultraweak interaction in the intrinsically disordered replication machinery is essential for measles virus function. Sci Adv 4:eaat7778 7. Uversky VN (2015) Intrinsically disordered proteins and their (disordered) proteomes in neurodegenerative disorders. Front Aging Neurosci 7:18

502

Christopher A. Waudby and John Christodoulou

8. Chiti F, Dobson CM (2017) Protein misfolding, amyloid formation, and human disease: a summary of Progress over the last decade. Annu Rev Biochem 86:27–68 9. Dedmon MM, Lindorff-Larsen K, Christodoulou J et al (2005) Mapping long-range interactions in alpha-synuclein using spin-label NMR and ensemble molecular dynamics simulations. J Am Chem Soc 127:476–477 10. Camilloni C, De Simone A, Vranken WF et al (2012) Determination of secondary structure populations in disordered states of proteins using nuclear magnetic resonance chemical shifts. Biochemistry 51:2224–2231 11. Waudby CA, Camilloni C, Fitzpatrick AWP et al (2013) In-cell NMR characterization of the secondary structure populations of a disordered conformation of α-Synuclein within E. coli cells. PLoS One 8:e72286 12. Theillet F-X, Binolfi A, Bekei B et al (2016) Structural disorder of monomeric α-synuclein persists in mammalian cells. Nature 530:45 13. Mylona A, Theillet F-X, Foster C et al (2016) Opposing effects of Elk-1 multisite phosphorylation shape its response to ERK activation. Science 354:233–237 14. Bah A, Vernon RM, Siddiqui Z et al (2015) Folding of an intrinsically disordered protein by phosphorylation as a regulatory switch. Nature 519(7541):106–109 15. Milles S, Mercadante D, Aramburu IV et al (2015) Plasticity of an ultrafast interaction between nucleoporins and nuclear transport receptors. Cell 163:734–745 16. Saio T, Guan X, Rossi P et al (2014) Structural basis for protein antiaggregation activity of the trigger factor chaperone. Science 344:597 17. Huang C, Rossi P, Saio T et al (2016) Structural basis for the antifolding activity of a molecular chaperone. Nature 537:202–206 18. Libich DS, Fawzi NL, Ying J et al (2013) Probing the transient dark state of substrate binding to GroEL by relaxation-based solution NMR. Proc Natl Acad Sci U S A 110:11361 19. Borgia A, Borgia MB, Bugge K et al (2018) Extreme disorder in an ultrahigh-affinity protein complex. Nature 555:61–66 20. Fawzi NL, Ying J, Ghirlando R et al (2011) Atomic-resolution dynamics on the surface of amyloid-β protofibrils probed by solution NMR. Nature 480:268–272 21. Wang A, Conicella AE, Schmidt HB et al (2018) A single N-terminal phosphomimic disrupts TDP-43 polymerization, phase separation, and RNA splicing. EMBO J 37:e97452 22. Jeener J, Meier BH, Bachmann P et al (1979) Investigation of exchange processes by

two-dimensional NMR spectroscopy. J Chem Phys 71:4546–4553 23. Wagner G, Bodenhausen G, Mu¨ller N et al (1985) Exchange of two-spin order in nuclear magnetic resonance: separation of exchange and cross-relaxation processes. J Am Chem Soc 107:6440–6446 24. Farrow NA, Zhang O, Forman-Kay JD et al (1994) A heteronuclear correlation experiment for simultaneous determination of 15N longitudinal decay and chemical exchange rates of systems in slow equilibrium. J Biomol NMR 4:727–734 25. Loria JP, Rance M, Palmer AG (1999) A relaxation-compensated Carr-Purcell-Meiboom-Gill sequence for characterizing chemical exchange by NMR spectroscopy. J Am Chem Soc 121:2331–2332 26. Mulder FA, Mittermaier A, Hon B et al (2001) Studying excited states of proteins by NMR spectroscopy. Nat Struct Biol 8:932–935 27. Korzhnev DM, Orekhov VY, Kay LE (2005) Off-resonance R(1rho) NMR studies of exchange dynamics in proteins with low spinlock fields: an application to a Fyn SH3 domain. J Am Chem Soc 127:713–721 28. Massi F, Peng JW (2018) Characterizing protein dynamics with NMR R 1ρ relaxation experiments. Methods Mol Biol (Clifton, N. J.). 1688:205–221 29. Gopalan AB, Hansen DF, Vallurupalli P (2018) CPMG experiments for protein minor conformer structure determination. Methods Mol Biol (Clifton, NJ) 1688:223–242 30. Vallurupalli P, Bouvignies G, Kay LE (2012) Studying “invisible” excited protein states in slow exchange with a major state conformation. J Am Chem Soc 134:8148–8161 31. Fawzi NL, Ying J, Torchia DA et al (2012) Probing exchange kinetics and atomic resolution dynamics in high-molecular-weight complexes using dark-state exchange saturation transfer NMR spectroscopy. Nat Protoc 7:1523–1533 32. Kovermann M, Rogne P, Wolf-Watz M (2016) Protein dynamics and function from solution state NMR spectroscopy. Q Rev Biophys 49:11348 33. Arai M, Ferreon JC, Wright PE (2012) Quantitative analysis of multisite protein-ligand interactions by NMR: binding of intrinsically disordered p53 transactivation subdomains with the TAZ2 domain of CBP. J Am Chem Soc 134:3792–3803 34. Karki I, Christen MT, Spiriti J et al (2016) Entire-dataset analysis of NMR fast-exchange titration spectra: a Mg2+ titration analysis for

2D Lineshape Analysis of IDP Interactions HIV-1 ribonuclease H domain. J Phys Chem B 120:12420–12431 35. Williamson MP (2013) Using chemical shift perturbation to characterise ligand binding. Prog Nucl Magn Reson Spectrosc 73:1–16 36. McConnell HM (1958) Reaction rates by nuclear magnetic resonance. J Chem Phys 28:430–431 37. Binsch G (1969) Unified theory of exchange effects on nuclear magnetic resonance line shapes. J Am Chem Soc 91:1304–1309 38. Greenwood AI, Rogals MJ, De S et al (2011) Complete determination of the Pin1 catalytic domain thermodynamic cycle by NMR lineshape analysis. J Biomol NMR 51:21–34 39. Gu¨nther UL, Schaffhausen B (2002) NMRKIN: simulating line shapes from two-dimensional spectra of proteins upon ligand binding. J Biomol NMR 22:201–209 40. Waudby CA, Ramos A, Cabrita LD et al (2016) Two-dimensional NMR Lineshape analysis. Sci Rep 6:24826 41. Waudby CA, Frenkiel T, Christodoulou J (2019) Cross-peaks in simple 2D NMR experiments from chemical exchange of transverse magnetization. Angew Chem Int Ed Engl 58 (26):8784–8788 42. Helgstrand M, H€ard T, Allard P (2000) Simulations of NMR pulse sequences during equilibrium and non-equilibrium chemical exchange. J Biomol NMR 18:49–63 43. Renschler FA, Bruekner SR, Salomon PL et al (2018) Structural basis for the interaction between the cell polarity proteins Par3 and Par6. Sci Signal 11:eaam9899 44. McShan AC, Natarajan K, Kumirov VK et al (2018) Peptide exchange on MHC-I by TAPBPR is driven by a negative allostery release cycle. Nat Chem Biol 14:811–820 45. Corbeski I, Dolinar K, Wienk H et al (2018) DNA repair factor APLF acts as a H2A-H2B histone chaperone through binding its DNA interaction surface. Nucleic Acids Res 46:7138–7152 46. Acevedo LA, Kwon J, Nicholson LK (2019) Quantification of reaction cycle parameters for an essential molecular switch in an auxinresponsive transcription circuit in rice. Proc Natl Acad Sci U S A 116:2589–2594 47. Lian LY, Middleton DA (2001) Labelling approaches for protein structural studies by solution-state and solid-state NMR. Prog Nucl Magn Reson Spectrosc 39:171–190 48. Ohki SY, Kainosho M (2008) Stable isotope labeling methods for protein NMR spectroscopy. Prog Nucl Magn Reson Spectrosc 53:208–226

503

49. Azatian SB, Kaur N, Latham MP (2019) Increasing the buffering capacity of minimal media leads to higher protein yield. J Biomol NMR 73:11–17 50. Frueh DP (2014) Practical aspects of NMR signal assignment in larger and challenging proteins. Prog Nucl Magn Reson Spectrosc 78:47–75 51. Sattler M, Schleucher J (1999) Heteronuclear multidimensional NMR experiments for the structure determination of proteins in solution employing pulsed field gradients. Prog Nucl Magn Reson Spectrosc 34:93–158 52. Wishart DS, Bigam CG, Yao J et al (1995) 1H, 13C and 15N chemical shift referencing in biomolecular NMR. J Biomol NMR 6:135–140 53. Laurents DV, Gorman PM, Guo M et al (2005) Alzheimer’s Abeta40 studied by NMR at low pH reveals that sodium 4,4-dimethyl-4silapentane-1-sulfonate (DSS) binds and promotes beta-ball oligomerization. J Biol Chem 280:3675–3685 54. Morash B, Sarker M, Rainey JK (2018) Concentration-dependent changes to diffusion and chemical shift of internal standard molecules in aqueous and micellar solutions. J Biomol NMR 71:79–89 55. Delaglio F, Grzesiek S, Vuister GW et al (1995) NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J Biomol NMR 6:277–293 56. Boyce SE, Tellinghuisen J, BioRxiv JC, et al. Avoiding accuracy-limiting pitfalls in the study of protein-ligand interactions with isothermal titration calorimetry. biorxiv.org 57. Contreras-Martos S, Nguyen HH, Nguyen PN et al (2018) Quantification of intrinsically disordered proteins: a problem not fully appreciated. Front Mol Biosci 5:83 58. Anthis NJ, Clore GM (2013) Sequencespecific determination of protein and peptide concentrations by absorbance at 205 nm. Protein Sci 22:851–858 59. Schanda P, Brutscher B (2005) Very fast two-dimensional NMR spectroscopy for realtime investigation of dynamic events in proteins on the time scale of seconds. J Am Chem Soc 127:8014–8015 60. Schanda P, Kupce E, Brutscher B (2005) SOFAST-HMQC experiments for recording two-dimensional heteronuclear correlation spectra of proteins within a few seconds. J Biomol NMR 33:199–211 61. Mori S, Abeygunawardana C, Johnson MO et al (1995) Improved sensitivity of HSQC spectra of exchanging protons at short

504

Christopher A. Waudby and John Christodoulou

interscan delays using a new fast HSQC (FHSQC) detection scheme that avoids water saturation. J Magn Reson B 108:94–98 62. Vranken WF, Boucher W, Stevens TJ et al (2005) The CCPN data model for NMR spectroscopy: development of a software pipeline. Proteins 59:687–696 63. Lee W, Tonelli M, Markley JL (2015) NMRFAM-SPARKY: enhanced software for biomolecular NMR spectroscopy. Bioinformatics (Oxford, England) 31:1325–1327 64. Bermel W, Bertini I, Felli IC et al (2006) 13C-detected protonless NMR spectroscopy of proteins in solution. Prog Nucl Magn Reson Spectrosc 48:25–45 65. Bastidas M, Gibbs EB, Sahu D et al (2015) A primer for carbon-detected NMR applications

to intrinsically disordered proteins in solution. Concepts Magn Reson Part A 44:54–66 66. De Genst EJ, Guilliams T, Wellens J et al (2010) Structure and properties of a complex of α-synuclein and a single-domain camelid antibody. J Mol Biol 402:326–343 67. Maltsev AS, Ying J, Bax A (2012) Deuterium isotope shifts for backbone (1)H, (15)N and (13)C nuclei in intrinsically disordered protein α-synuclein. J Biomol NMR 54:181–191 68. Sekhar A, Bain AD, Rumfeldt JAO et al (2015) Evolution of magnetization due to asymmetric dimerization: theoretical considerations and application to aberrant oligomers formed by apoSOD1(2SH). In: Physical chemistry chemical physics : PCCP

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Chapter 25 Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers Charlotte S. Sørensen and Magnus Kjaergaard Abstract Intrinsically disordered linkers control avidity, auto-inhibition, catalysis, and liquid-liquid phase separation in multidomain proteins. Linkers enforce effective concentrations that directly affect the kinetics and equilibrium positions of intramolecular reactions. Mechanistic understanding of the role of linkers thus requires measurements of the effective concentrations in supramolecular complexes. Here, we describe an experimental protocol for measuring the effective concentrations enforced by a linker using a competition assay. The experiment uses a FRET biosensor that is titrated by a competitor peptide. The assay is designed for parallel analysis of several constructs in a fluorescent plate reader and has been used to study hundreds of synthetic disordered linkers. Key words Effective concentration, Intrinsically disordered protein (IDP), Linker, Fluorescent biosensor, Polymer physics

1

Introduction Intrinsically disordered linkers are common in multidomain proteins, where they provide adaptable connections between domains. By joining domains of different functions, linkers give multidomain proteins functional properties that are not merely the sum of their parts. Linkers allow biochemical reactions such as binding and catalysis to occur within a single physical connected unit, and they are thus mechanistically similar to intramolecular reactions. Crucially, such intra-complex reactions are not concentration dependent but instead depend on the connections between the interacting domains. The kinetics and equilibria of intramolecular reactions are governed by effective concentrations that are often orders of magnitude above the total concentration [1, 2]. Effective concentrations play an important role in, e.g., avidity of multivalent interactions [1, 3], enzyme-targeting [4], auto-inhibition [5], or liquid-liquid phase separation [6]. To understand the functions of

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_25, © Springer Science+Business Media, LLC, part of Springer Nature 2020

505

506

Charlotte S. Sørensen and Magnus Kjaergaard

linkers quantitatively, it is thus necessary to understand how the properties of the linker determine the effective concentration. Effective concentrations are easy to measure in principle but difficult in practice. Effective concentrations are therefore typically estimated based on theoretical descriptions of homopolymers [2, 7–9]. Intrinsically disordered linker are, however, not homopolymers, and their properties depend strongly on the specific sequence of the linker [10]. To understand how linkers fine-tune biochemical equilibria, there is thus no surrogate for experimental measurements of the effective concentration. Effective concentrations can be measured in competition experiments, where increasing concentrations of free ligand displace a tethered interaction [11]. When the free and tethered ligands are identical, the halfway point of the displacement occurs when the concentration of the free ligand corresponds to the effective concentration of the tethered ligand. However, displacement of an intramolecular interaction rarely gives rise to an observable signal, and full displacement occurs at concentration above the solubility of most proteins. Until recently, experimental determination of effective concentrations was thus limited to organic model systems, where fluorescent competing ligands were available [11]. Effective concentrations are often independent of what is being linked and is mainly a property of the linker [1, 12]. This suggests that the effective concentration enforced by a given linker can be measured in a convenient model system and then extrapolated back to its natural surroundings. We recently developed a biosensor to facilitate measurement of the effective concentration (Fig. 1a) [10]. The biosensor consists of two interaction domains that form antiparallel heterodimeric coiled coil (Fig. 1b) [13]. The two domains are connected by a linker that enforces an effective concentration, which is what will be measured. Furthermore, the biosensor contains two fluorescent protein domains that can undergo Fo¨rster resonance energy transfer (FRET) [14]. The FRET efficiency drops when the intramolecular interaction is outcompeted by a free peptide (Fig. 2a). The biosensor contains two purification tags, which allow for purification by standard methods regardless of the linker sequence. Compared to the wild-type complex, the interaction pair in the biosensor has a mutation that lowers the Kd by 30-fold (Fig. 2b). This allows the wild-type competitor to outcompete the interaction at 30-fold lower concentrations and thus allows measurement of effective concentrations in the millimolar range. The system is designed for parallel analysis of many linkers and enabled us to probe how the linker sequence properties determine its compaction and thus the ensuing effective concentrations (Fig. 2c, d). Direct measurement of effective concentrations can thus also be used to probe the relationship between sequence and structure in intrinsically disordered proteins [10].

Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers

507

Biosensor for determination of effective concentration: p66 α

MBD2 mClover3

(V227A)

6xHis

*

Unique RE sites:

NdeI

mRuby3

Linker

Strep-tag

AgeI NheI

KpnI

SpeI

c

b OR

MW (kDa)

N

F

C

116

p66 α

66.2

MBD2

pET15b backbone

XhoI

45 35 25 18.4 14.4

pR

N

Am

C

GS20

Proteolytic fragments

a

Fig. 1 Overview of the biosensor used to measure effective concentrations. (a) The biosensor gene is embedded in a standard E. coli expression vector and contains a FRET pair and two interacting domains (MBD2 and p66α) connected by a variable linker. The biosensor contains a mutant MBD2 domain (∗) with weaker binding to p66α to allow free wild-type competitor to compete more efficiently. For convenience the construct contains several unique restriction sites to allow future modification. (b) The interaction pair forms an antiparallel dimeric coiled coil, which is ideal as it places the N-termini of one domain in close proximity to the C-terminus of the other. (c) The protein can be purified using the two affinity tags. Internal cleavage in mRuby3 results in proteolytic fragments at lower molecular weight that do not disrupt the measurement. Figure adapted from [10]

2 2.1

Materials DNA Constructs

2.2 Expression and Purification of Biosensor

Plasmid DNA of biosensor (Addgene #132720). 1. Chemically competent E. coli BL21(DE3). 2. Water bath for heat shock, aerated plastic tubes (~13 mL) for starter cultures. 3. 500 mL baffled flasks for bacterial growth. 4. Shaking incubator. 5. Centrifuge for 50 mL tubes. 6. 1000 ampicillin stock: Prepare at 100 mg/mL in MQ water. 7. LB medium: Mix 10 g/L peptone, 5 g/L yeast extract, and 10 g/L NaCl and autoclave. The autoclaved medium can be prepared in advance. 8. LB agar plates with 100 μg/mL ampicillin.

508

Charlotte S. Sørensen and Magnus Kjaergaard

a

b FRET IA/(I A+I D )

0.26 WT MBD2 +

*

*

0.22 0.18 0.14

V227A MBD2

30 Peptide: WT V227A 10 -1 10 0 10 1 10 2 10 3 MBD2 concentration ( μM)

Apparent C

d eff

0.22 0.18 0.14

Linker length:

IA/(I A+I D )

0.26 20 30 40 60 120

10 -1 10 0 10 1 10 2 10 3 MBD2 concentration ( μM)

Ceff (μM)

c

10 3

10 2

Ceff = 330mM × N

-1.46

20 50 100 Linker length (aa)

Fig. 2 Measurement of effective concentrations. (a) The biosensor is constructed to allow binding between MBD2 and p66α, which brings the fluorescent proteins close together, and allow FRET. Free MBD2 peptide can displace the intramolecular ligand at sufficiently high peptide concentration resulting in decreased FRET. The fusion protein contains a mutation in the MBD2 domain that weakens the interaction with p66α and allows wild-type MBD2 peptide to compete more efficiently. (b) A correction factor corresponding to the difference in affinity due to the V227A mutation was determined to 30 by titration of the same construct with both WT and V227A peptides [10]. (c) Example of titration series for glycine-serine linkers of variable length. (d) For linker sequences that can be approximated by a homopolymer, the effective concentration can be described via a polymer-scaling law where the scaling coefficient reports on the compactness of the linker. Figure adapted from [10]

9. ZYM-5052 auto-induction medium [15]: The amount of medium depends on the number of constructs analyzed. Usually we prepare 1–2 L at a time and split into smaller bottles before autoclaving. Mix 10 g/L peptone, 5 g/L yeast extract, 25 mM Na2HPO4, 25 mM KH2PO4, 50 mM NH4Cl, 5 mM Na2SO4, 2 mM MgSO4, and trace metals from the stock described below. Adjust pH to 7.5 by addition of NaOH and autoclave. Immediately before use, add ampicillin to a final concentration of 100 μg/mL and carbon sources to a final concentration of 5 g/L glycerol, 0.5 g/L glucose, and 0.2 g/L lactose from a 20 stock. It is recommended that the sugars are not autoclaved, but sterile filtered. 10. 1000 trace metal solution: 50 mM FeCl3, 20 mM CaCl2, 10 mM MnCl2 and ZnSO4, and 2 mM each of CoCl2, CuCl2, NiCl2, NaMoO4, Na2SeO3, and H3BO4 dissolved in 60 mM HCl [15] (see Note 1).

Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers

509

11. B-PER bacterial protein extraction kit (see Note 2). Contents: B-PER bacterial protein extraction reagent. Lysozyme stock (50 mg/mL). DNase I (2500 U/mL). 12. Protease inhibitor stocks: Pepstatin (1 g/L in ethanol), chymostatin (1 g/L in dimethyl sulfoxide), and leupeptin (1 g/L in MQ water). Stored at 20 C. 13. Filter paper, e.g., Whatman 113 V, and matching plastic funnel. 14. Empty gravity flow column, 20 mL. 15. Ni-Sepharose 6 Fast Flow slurry. 16. Ni-binding buffer: 20 mM Na2HPO4/NaH2PO4 pH 7.4, 0.5 M NaCl, 20 mM imidazole. 17. Ni-elution buffer: 20 mM Na2HPO4/NaH2PO4 pH 7.4, 0.5 M NaCl, 0.5 M imidazole. 18. 20% ethanol (v/v). 19. Strep-Tactin XT Superflow columns, 1 mL. 20. Strep-Tactin binding buffer: 100 mM Tris pH 8, 150 mM NaCl 1 mM EDTA. 21. Strep-Tactin elution buffer: Dilute 10 BXT elution buffer (IBA Life Sciences) in 9 parts of MQ water. 22. 0.5 M NaOH. 23. TBS buffer: 20 mM Tris pH 7.4, 150 mM NaCl. 24. Dialysis tube with small diameter MWCO: 12–14 kDa. 25. Sealing clips for closing dialysis tubes. 26. Liquid nitrogen for flash freezing. 27. Drigalski spatula. 2.3 Expression and Purification of Competitor Peptide

1. Plasmid DNA of expression for competitor peptide (Addgene #132721). 2. Supplies for E. coli transformation and growth as described in Subheading 2.2, items 1–7 including equipment for largerscale culture 5 L baffled flask and a centrifuge for 1 L bottles. 3. Ni-binding buffer (see Subheading 2.2, item 16). 4. Stepwise Ni-elution buffers: 20 mM Na2HPO4/NaH2PO4 pH 7.4, 0.5 M NaCl, containing 40, 60, 80, 100, 200, and 500 mM imidazole (see Note 3). 5. Thermomixer for 15 mL centrifuge tubes or water bath. 6. Supplies for Ni-affinity chromatography described in Subheading 2.2, items 10–12. 7. TBS buffer (see Subheading 2.2, item 20). 8. Dialysis tube with a molecular weight cutoff of 3.5 kDa.

510

Charlotte S. Sørensen and Magnus Kjaergaard

9. Large (>5 L) container for dialysis. 10. Centrifugal concentrator: Centriprep Ultracel YM-3 (Millipore) 3 kDa NMWL Cat. #4302 (see Note 4). 2.4 Measurement of Effective Concentration

1. 96-well deep well plate. 2. 96-well plate. 3. Black 384-well plate. 4. Bovine serine albumin (BSA): Dissolve at 11 mg/mL in TBS. 5. Master mix for each biosensor analyzed: 0.55 μM biosensor, 5.5 mg/mL BSA in TBS. 6. Competitor peptide stock in TBS (~2.3 mM) from Subheading 3.3, step 14. 7. 8-channel pipette P20 (1–20 μL) (ideally two). 8. Fluorescence plate reader.

3

Methods

3.1 Cloning of Linker Sequence into Reporter Plasmid

3.2 Expression and Purification of Fluorescent Biosensor 3.2.1 Biosensor Expression

The DNA sequence corresponding to the linker that will be analyzed has to be cloned into reporter construct between the unique NheI and KpnI sites. We find that the constructs are most costefficiently produced by commercial DNA synthesis companies, and the linker can typically be delivered directly in the reporter vector. The vector can be made available directly at GenScript upon request. Alternatively, the constructs can be prepared by standard DNA ligation cloning techniques [16]. 1. Transform the biosensor plasmid into chemically competent bacteria: Add 1 μL DNA stock to 50 μL chemically competent E. coli BL21(DE3). Incubate on ice for 15 min. Heat shock for 45 s at 42 C using a water bath. Incubate the cells on ice for 5 min. Add 250 μL LB medium, and allow the cells to recover for 1 h at 37 C in a shaking incubator. Add 100 μL of the cell suspension to an LB ampicillin agar plate and spread with a sterile Drigalski spatula. Incubate the plates upside down over night at 37 C. 2. Starter culture: Pick a colony from each plate using an inoculation loop, and transfer it to 5 mL LB with ampicillin. The cells should be cultured in a disposable tube with a loose-fitting lid allowing oxygenation. Incubate for 6 h at 37 C in the shaking incubator at 120 rpm. 3. Add 50 mL ZYM-5052 with ampicillin to the baffled flask. Start the culture by addition of 1 mL starter culture. Incubate in cultures while shaking at 120 rpm in a shaking incubator at 37 C for 2–3 h before reducing the temperature to 18 C.

Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers

511

Typical shaking incubators allow up to 24 cultures to be expressed in parallel. Expression at higher temperatures can lead to insoluble protein. The cultures should be incubated until they have changed color to a bright orange, which usually occurs after 48–72 h (see Note 5). 4. Transfer the growth medium to a 50 mL centrifuge tube, and harvest by centrifugation (15 min, 6000 g). Discard the supernatant. Purification can be started immediately, or the bacterial pellets can be flash frozen in liquid nitrogen and stored at 20 C for later purification. 3.2.2 Generation of Lysate

1. We typically process eight constructs in parallel as this matches the capacity of our centrifuge. Add 90 μL of each of the lysozyme and DNase stocks to 45 mL B-PER bacterial protein extraction reagent. Add 45 μL of each of the stocks of pepstatin, chymostatin, and leupeptin (see Note 6). 2. Dissolve each bacterial pellet in 5 mL bacterial protein extraction reagent using a disposable pipette. Incubate for 15 min at room temperature. Use the remaining lysing solution to balance the tubes. 3. Centrifugate the lysate for 15 min at 14,000 g. Filter the supernatant through paper filter and funnel. At this point, the lysate can be flash frozen and stored at 20 C for later purification.

3.2.3 Biosensor Purification by Ni-Affinity Chromatography

1. Pierce a small hole in the cap of a 50 mL centrifuge tube, such that the hole snugly fits the tip of the gravity flow column. Insert column into the hole. This system allows adjustment of the flow rate via the lid of the centrifuge tube (see Note 7). 2. Add 2 mL Ni-Sepharose slurry for a final column volume of 1 mL. Wash the column with 7 mL MQ water and equilibrate using 7 mL of Ni-binding buffer. 3. Load the lysate onto the column. If the lysate has been frozen, it should be filtered again. Ensure a flow rate of about 1 mL/ min by adjusting the tightness of the centrifuge tube lid. The column should become bright red. 4. Wash the column with 7 mL Ni-binding buffer. If the column is still brightly colored, the flow-through and wash can be discarded. 5. Elute the fusion protein by addition of 5 mL Ni-elution buffer. Collect the eluate for further purification. 6. Clean the Ni-Sepharose column with 10 mL Ni-elution buffer or until all color has disappeared. Wash with 7 mL of MQ water and subsequently 10 mL of 20% ethanol. Store column at 4 C

512

Charlotte S. Sørensen and Magnus Kjaergaard

with 20% ethanol solution on top. The column can be reused but should be discarded when the flow rate starts to decrease. 3.2.4 Biosensor Purification by Strep-Tag Affinity Chromatography

1. Mount the Strep-Tactin-XT column in a 50 mL centrifuge column as described in Subheading 3.2.3, step 1. Equilibrate the column with 5 mL Strep-Tactin binding buffer. 2. Load half of the eluate from the Ni-affinity column onto the Strep-Tactin column. Wash with 5 mL Strep-Tactin binding buffer (see Note 8). 3. Elute proteins with 5 mL Strep-Tactin elution buffer. Collect only the most concentrated part of the eluate, which is usually between an elution volume of 0.5 and 2.5 mL. 4. Wash the Strep-Tactin column with 5 mL MQ water, followed by 5 mL of 0.5 M NaOH to wash out the bound biotin. Equilibrate with 10 mL Strep-Tactin binding buffer (or until the flow-through has pH 8,) and store at 4 C. 5. Analyze the purified protein by SDS-PAGE to ensure purity. The fusion protein preparation invariably contains proteolytic cleavage fragments at lower molecular weight due to proteolytic sensitivity of mRuby3 [10] (Fig. 1c). The fragments remain tightly bound and cannot be easily removed. The internal cleavages do not affect the subsequent assay. 6. Add the purified fusion protein to a dialysis tube with a cutoff of 12–14 kDa. Dialyze against 5 L TBS at 4 C. Close the tube with a sealing clip and a knot at either end. The sealing clips allow the different tubes to be labelled. After 4 h, move the dialysis tubes to a 5 L of fresh TBS buffer and dialyze overnight. 7. Measure protein concentration spectrophotometrically. For linkers without aromatics the ε(280 nm) ¼ 52,000 M1cm1. For linker with aromatic residues, the linkers’ contribution to ε can be estimated using ProtParam web server hosted at www. expasy.org [17]. We typically get a yield of around 2 mL of 20 μM protein. 8. Flash freeze the dialyzed eluate and store in freezer for analysis. Avoid extensive freeze-thaw cycles.

3.3 Expression and Purification of MBD2 Peptide

The titration experiments require high concentration of the MBD2 peptide. To get enough protein, we usually prepare 6 L of bacterial culture at the time. 1. Transform chemically competent E. coli BL21(DE3), and make 7 mL starter culture as described in Subheading 3.2.1, steps 1 and 2.

Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers

513

2. Prepare six 5 L baffled flasks with 1 L ZYM-5052 auto-induction medium in each. Add the starter culture to each, and incubate in a shaking incubator for 24 h at 37 C (see Note 9). 3. Harvest cells by centrifugation (6000 g, 15 min). Dispose of the supernatant. At this stage, the cells can be flash frozen and stored for later purification. 4. Dissolve cells in 25 mL Ni-washing buffer per liter of culture medium. Resuspend to avoid major cell clumps by shaking. Distribute the cell suspension to 15 mL centrifuge tubes. 5. Incubate the cell suspension for 25 min at 80 C using the thermoshaker, while shaking at 400 rpm, before rapidly cooling the cell suspension on ice. This will lyse the cells, precipitate most folded proteins including proteases [18], but leave the intrinsically disordered MBD2 peptide in solution. 6. Centrifuge cell suspension at 14,000 g for 15 min. Save supernatant and discard the pellet. 7. Prepare two gravity flow columns with 3 mL Ni-Sepharose, corresponding to 6 mL slurry. Wash with 15 mL MQ H2O, and equilibrate in 15 mL Ni-binding buffer (see Note 10). 8. Filter the lysate using the filter paper and funnel, and load lysate corresponding to 3 L of growth media per column. Wash the column with 15 mL 20 mM imidazole buffer. Control the flow rate as in Subheading 3.2.2, step 1. 9. Elute the peptide with 15 mL of Ni-elution buffer with first 40 mM imidazole and then 60, 80, 100, 200, and 500 mM, where the eluate is collected separately. Run samples on an SDS-PAGE, and pool fractions with a single protein lane with an apparent molecular weight of ~10 kDa. Usually the 40 mM elution will have impurities, whereas the 500 mM elution does not contain protein. 10. Clean the column with 15 mL of 500 mM imidazole. Wash with 15 mL of MQ water followed by 20 mL of 20% ethanol. Store the column at 4 C with 5 mL of 20% ethanol solution on top. 11. Load the pooled eluate into an appropriate length of dialysis tubing (MWCO ¼ 3.5 kDa). Dialyze for 4 h against 5 L of TBS buffer at 4 C. Transfer the dialysis tubing to 5 L of fresh TBS buffer and dialyze overnight. This is the last stage at which the MBD2 can be frozen although ideally freezing should be avoided. 12. Prepare centrifugal filter by filling it with 15 mL TBS and spinning it for 10 min at 3000 g at 4 C. Discard the TBS. 13. Add 15 mL of the dialysate to the centrifugal filter by spinning at 3000 g at 4 C. Remove the flow-through, and measure

514

Charlotte S. Sørensen and Magnus Kjaergaard

the A280 nm using a spectrophotometer. Top centrifugal filter up to 15 mL with dialysate. Repeat the procedure until all dialysate has been added. Concentrate the remaining MBD2 solution until an A280 nm of ~3.5 is achieved corresponding to a peptide concentration of 2350 μM (ε280nm ¼ 1490 M1cm1) (see Note 11). 14. Transfer the sample to a 1.5 mL centrifuge tube, and centrifuge at 14,000 g for 10 min to remove aggregates. Measure A280 nm five times, and use the average to calculate the protein concentration. Store protein on ice until use and do not freeze at this point. 3.4 Competition Titration

1. Prepare at least 200 μL master mix of each of the biosensor proteins to be measured with a final concentration of 0.55 μM biosensor and 5.5 mg/L BSA. Distribute equal portions of the master mix equal to 8 wells spanning a column in a 96-well plate to allow pipetting by a multipipette (Fig. 3a). 2. Estimate the volume of competitor peptide needed at each concentration: 3 repeats of 9 μL each result in 27 μL per biosensor variant. Add 10% extra to be safe: V ¼ 30 μL Nvariants. 3. Add 2 V of peptide stock solution to a corner well of the deep well plate. Add V μL TBS to all other well in two adjacent columns (wells 2–16 in Fig. 3b). Make a twofold dilution by transferring V μL from well 1 to well 2 following the pattern indicated in Fig. 3b. Mix by pipetting up and down five times. Continue by transferring from well 2 to 3 and so on until reaching well 16. Remove V μL from well 16 and discard to maintain identical volumes. 4. Use an 8-well pipette to transfer 2 μL biosensor master mix to each well in a column on the 384-well plate (Fig. 3a). Ensure that the droplet stays in the well by depositing them against the wall of the well. Use three identical columns of each variant to obtain triplicates. Continue for each biosensor variant and note the order of samples. 5. Transfer 9 μL from each well in the deep well plate using a multipipette. The two columns in the 96-well plate should be interleaved into a single column in the 384-well plate in two pipetting operations (Fig. 3b). Mix the sample by carefully pipetting up and down five times. 6. Equilibrate the fluorescence plate reader to 30 C (see Note 12). 7. Inspect the wells visually before fluorescence measurement. The samples should sit at the bottom of the well. Use an excitation wavelength of 500 nm with a bandwidth of 15 nm. For emission wavelengths, 535 and 600 nm are used for the

Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers

515

a 1. Prepare dilution series in deepwell plate

2. Prepare biosensor master-mix in 96-well plate 3. Mix in each well: 2μL biosensor 9μL competitor peptide

b Competitor peptide dilution: 1

3

5

7

2

4

6

8 1 0 12 1 4 1 6

well 384plate

9 11 13 15

4. Measure in plate reader

384-well plate: 1

2

3 4

5

6 7 8 9

...

Fig. 3 Flowchart of effective concentration titration experiment. (a) In parallel analysis of many biosensor variants, the samples are most easily handled using multiwell plates. For each biosensor variant, the master mix is distributed across a row in a 96-well plate for pipetting by an 8-well multipipette. For the titration series, a twofold dilution series is prepared of the competitor in the deep well plate in the pattern shown in (b). This allows the titration to be transferred to the 384-well plate in order using two interleaved pipetting steps per row

donor and acceptor, respectively, with a 25 nm bandwidth for both. 8. Record fluorescence emission for each well. We use 5 flashes of 1 s excitation per well. 9. Before finishing the experimental work, calculate the apparent FRET efficiency as described below. Identify large outliers, which are typically due to bubbles. Prepare a fresh sample of these and remeasure. 3.5

Data Analysis

The analysis capabilities and output of fluorescence instruments vary, so we will describe a generic approach for determining the effective concentration using generic software. 1. Export the fluorescence emission intensities for both donor and acceptor as comma-separated values. Import the data into a data-handling program of your choice such as MATLAB or Excel. Generate a column of x-values corresponding to the MBD2 concentration. 2. Calculate the apparent FRET efficiency for each measurement from Eq. 1: E¼

IA IA þ ID

ð1Þ

where ID and IA are the fluorescence emission from the donor and acceptor, respectively (see Note 13). 3. For each dataset, perform nonlinear fitting to Eq. 2:

516

Charlotte S. Sørensen and Magnus Kjaergaard

IA ¼ E 1 0:5E 2 ID þ IA qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 c MBD2 þ c e,app þ P c MBD2 þ c e,app þ P 4P c MBD2 ð2Þ where E1 corresponds to the apparent FRET value in the closed state, E2 is the change in FRET upon opening, P is the concentration of the biosensor, and ce,app is the apparent effective concentration. cMBD2 is the peptide concentration and is the independent variable. Suggested initial guesses for the fitting parameters: E1 ¼ 0.26, E2 ¼ 0.14, ce,app ¼ 10 μM, and P ¼ 0.1 μM (fixed). 4. Calculate the true effective concentration by correcting for the difference affinity between the mutant MBD2 in the biosensor and the wild-type competitor peptide (Eq. 3): c e ¼ 30 c e,app

4

ð3Þ

Notes 1. In many cases, trace metals can be left out as the bacteria will get the metals they need from the yeast extract. However, the metal content in such extracts can vary. 2. There are less expensive non-commercial alternatives; however most of these involve longer lysing times. 3. This product should ideally not be replaced. We have tried several centrifugal concentrators, and this was the only type capable of reaching sufficiently high concentrations. 4. The buffers can be prepared by making buffer stocks with 0 and 500 mM imidazole and mixing the intermediate concentrations. 5. Small volumes of bacterial culture are easier to handle in parallel, and we find that 50 mL culture produces ample protein for determination of effective concentrations. 6. The cells could also be lysed by physical methods such as a sonicator. However, the chemical lysis protocol allows processing of many constructs in parallel. 7. Similar or better purification could be achieved using FPLCbased protocols; however the gravity flow affinity purification allows for purification of many variants in parallel. 8. The Strep-Tactin column has less capacity than the Ni column, so there may still be some colored protein in the flow-through. If the column is not brightly colored after the wash, it has lost capacity and should be discarded.

Measuring Effective Concentrations Enforced by Intrinsically Disordered Linkers

517

9. To achieve the high concentration needed in the competitive titration, it is important to use at least 6 L of bacterial culture. 10. To achieve pure protein in a single chromatographic step, it is important to saturate the column entirely with MBD2. A larger bed volume will result in less pure protein. 11. Concentrated MBD2 may get a gel-like appearance when stored on ice. The protein can be dissolved by heating to 50 C and repeating Subheading 3.3, step 14. 12. We use 30 C because this is the lowest stable temperature in a plate reader without active cooling. For other plate readers, this may be different. 13. These values should not be interpreted in terms of distances as there is considerable direct excitation of the acceptor fluorophore during the donor excitation.

Acknowledgments This work was supported by grants to M.K. from the “Young Investigator Program” of the Villum Foundation, Independent Research Fund Denmark (FTP), and PROMEMO—Center for Proteins in Memory – a center of excellence funded by the Danish National Research Foundation (grant number DNRF133). References 1. Li M, Cao H, Lai L et al (2018) Disordered linkers in multidomain allosteric proteins: entropic effect to favor the open state or enhanced local concentration to favor the closed state?: disordered linkers in multidomain allosteric proteins. Protein Sci 27:1600–1610 2. Timpe LC, Peller L (1995) A random flight chain model for the tether of the shaker K+ channel inactivation domain. Biophys J 69:2415–2418 3. Sørensen CS, Jendroszek A, and Kjaergaard M (2019) Linker dependence of avidity in multivalent interactions between disordered proteins. J Mol Biol 431:4784–4795 4. Szabo BB, Tamas H, Eva S et al (2019) Intrinsically disordered linkers impart processivity on enzymes by spatial confinement of binding domains. Int J Mol Sci 20:2119 5. Hoshi T, Zagotta W, Aldrich R (1990) Biophysical and molecular mechanisms of shaker potassium channel inactivation. Science 250:533–538

6. Harmon TS, Holehouse AS, Rosen MK et al (2017) Intrinsically disordered linkers determine the interplay between phase separation and gelation in multivalent proteins. eLife 6: e30294 7. Diestler DJ, Knapp EW (2010) Statistical mechanics of the stability of multivalent ligandreceptor complexes. J Phys Chem C 114:5287–5304 8. Borcherds W, Becker A, Chen L et al (2017) Optimal affinity enhancement by a conserved flexible linker controls p53 mimicry in MdmX. Biophys J 112:2038–2042 9. Sherry KP, Johnson SE, Hatem CL et al (2015) Effects of linker length and transient secondary structure elements in the intrinsically disordered notch RAM region on notch signaling. J Mol Biol 427:3587–3597 10. Sørensen CS, Kjaergaard M (2019) Effective concentrations enforced by intrinsically disordered linkers are governed by polymer physics. Proc Natt Acad Sci 116:23124–23131 11. Krishnamurthy VM, Semetey V, Bracher PJ et al (2007) Dependence of effective molarity

518

Charlotte S. Sørensen and Magnus Kjaergaard

on linker length for an intramolecular proteinligand system. J Am Chem Soc 129:1312–1320 12. Gargano JM, Ngo T, Kim JY et al (2001) Multivalent inhibition of AB 5 toxins. J Am Chem Soc 123:12909–12910 13. Gnanapragasam MN, Scarsdale JN, Amaya ML et al (2011) p66α–MBD2 coiled-coil interaction and recruitment of Mi-2 are critical for globin gene silencing by the MBD2–NuRD complex. Proc Natl Acad Sci 108:7487–7492 14. Bajar BT, Wang ES, Lam AJ et al (2016) Improving brightness and photostability of green and red fluorescent proteins for live cell imaging and FRET reporting. Sci Rep 6:20889

15. Studier FW (2005) Protein production by auto-induction in high-density shaking cultures. Protein Expr Purif 41:207–234 16. Green MR, Sambrook J (2012) Molecular cloning: A Laboratory Manual, 4th edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY 17. Artimo P, Jonnalagedda M, Arnold K et al (2012) ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 40:W597–W603 18. Kalthoff C (2003) A novel strategy for the purification of recombinantly expressed unstructured protein domains. J Chromatogr B 786:247–254

Chapter 26 Determining the Protective Activity of IDPs Under Partial Dehydration and Freeze-Thaw Conditions David F. Rendo´n-Luna, Paulette S. Romero-Pe´rez, Cesar L. Cuevas-Velazquez, Jose´ L. Reyes, and Alejandra A. Covarrubias Abstract Unlike for structured proteins, the study of intrinsically disordered proteins (IDPs) requires selection of ad hoc assays and strategies to characterize their dynamic structure and function. Late embryogenesis abundant (LEA) proteins are important plant IDPs closely related to water-deficit stress response. Diverse hypothetical functions have been proposed for LEA proteins, such as membrane stabilizers during cold stress, oxidative regulators acting as ion metal binding molecules, and protein protectants during dehydration and cold/freezing conditions. Here we present two detailed protocols to characterize IDPs with potential protein/enzyme protection activity under partial dehydration and freeze-thaw treatments. Key words Intrinsically disordered proteins, Lactate dehydrogenase, LEA proteins, Alcohol dehydrogenase, Late embryogenesis abundant proteins, Desiccation, Dehydration

1

Introduction Data reported during the last decade have provided convincing evidence that there are proteins with high structural flexibility and instability, currently known as intrinsically disordered proteins (IDPs). This set of proteins are not only fully functional but also display more than one biological activity and have been identified in all the living organisms where they have been searched [1]. Structural disorder is also found in defined sections of ordered proteins, designated as intrinsically disordered regions (IDRs). One of the distinctive properties of IDPs and IDRs is their amino acid composition and distribution, with a high representation of charged and polar amino acid residues and low fraction of hydrophobic ones

David F. Rendo´n-Luna and Paulette S. Romero-Pe´rez contributed equally to this work. Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_26, © Springer Science+Business Media, LLC, part of Springer Nature 2020

519

520

David F. Rendo´n-Luna et al.

[2]. Numerous functions have been attributed to these proteins, many of them related to signaling processes, in some cases acting as hubs in protein interaction networks [3]. They may recognize multiple partners given their structural flexibility, allowing them to present different structures depending on posttranslational modifications [4] and/or on the characteristics of the cellular milieu [5]. IDP versatility extends even further because they have also been implicated in the liquid-liquid phase separation phenomenon and formation of proteinaceous membrane-less organelles [6]. IDPs involved in different processes have also been identified in plants [7]. Among the best characterized plant IDPs are those proteins highly accumulated during late embryogenesis, a stage where orthodox seeds undergo dehydration, a required condition to resume plant development upon germination. These proteins, termed late embryogenesis abundant (LEA) proteins, also accumulate in response to all those conditions where water availability is reduced in plant cells (dehydration, salinity, low temperatures) [8] and some of them in response to other adverse conditions such as oxidative stress [9] or high temperatures [10, 11]. LEA-like proteins have also been described in some bacteria genera [12] and in other eukaryotic organisms, interestingly, in those showing anhydrobiotic characteristics [13, 14]. Proteins with similar physicochemical properties to those showed by IDPs and LEA proteins, and whose accumulation also responds to water-deficit conditions but with no sequence similarity to LEA proteins, were also described in prokaryotic and eukaryotic organisms; these were designated as hydrophilins [15]. Even though the in vivo function of LEA proteins and other hydrophilins is still unclear, the establishment of partial dehydration and freeze-thaw in vitro assays has revealed a protective action of different LEA proteins, and some hydrophilins, on enzyme activities sensitive to these water-deficit conditions [16–21]. The in vitro assays to determine a protective activity of a protein under water-deficit conditions require the use of reporter proteins, whose activity/structure shows to be sensitive to the treatments imposed by the assays. The reporter proteins used hitherto in this kind of assays are lactate dehydrogenase (LDH), alcohol dehydrogenase (ADH) [22], malate dehydrogenase (MDH) [16], citrate synthase (CS) [23], rhodanese, fumarase [24], and catalase [25], with LDH being the most commonly used. Here, we present two detailed protocols to establish optimal experimental conditions to implement freeze-thaw and partial dehydration experiments using LDH and ADH as reporter proteins to determine protective activity of IDPs (LEA proteins and other hydrophilins). We are aware of other dehydration methods reported to evaluate the protective activity of hydrophilins

Determining the Protective Activity of IDPs

521

[23, 26]; however, in this work, we present those that we consider more effective. The methodology of these protocols is divided in three major procedures: (1) the master mix preparation, where important variables should be considered to correctly perform the experiments, (2) the low water availability treatment, and (3) determination of the activity of the reporter proteins.

2

Material Prepare all solutions with sterile distilled water and analytical grade reagents. Use gloves during all procedures. Treatment buffers can be prepared at room temperature and stored at 4 C. Protein solutions must be freshly prepared and kept on ice. The assay cannot be paused once started because the reporter protein might lose activity.

2.1

Stock Solutions

1. Tetrasodium pyrophosphate buffer—100 mM tetrasodium pyrophosphate, pH 8.8: Dissolve 4.46 g of sodium pyrophosphate tetrabasic decahydrate (Na4P2O7 · 10 H2O) in 80 mL of water. Adjust pH to 8.8 using 9.5% H3PO4. Make up to 100 mL with water. Autoclave. 2. 1 M Tris–HCl, pH 7.5: Dissolve 12.11 g of Tris (ultrapure grade) in 80 mL of water. Adjust pH to 7.5 with concentrated HCl. Make up to 100 mL with water. Autoclave. 3. 1 M KCl: Dissolve 7.455 g of KCl in 80 mL of water. Make up to 100 mL with water. Autoclave. 4. 0.2 M Na2HPO4: Dissolve 2.84 g of Na2HPO4 in 80 mL of water. Make up to 100 mL with water. Autoclave. 5. 0.2 M NaH2PO4: Dissolve 2.76 g of NaH2PO4 · H2O in 80 mL of water. Make up to 100 mL with water. Autoclave.

2.2 Treatment Buffers

1. LDH treatment buffer: 25 mM Tris–HCl, pH 7.5. Mix 2.5 mL of 1 M Tris–HCl, pH 7.5, with 97.5 mL of water. 2. ADH treatment buffer: Sodium phosphate buffer 10 mM, pH 7.5. Mix 42 mL of Na2HPO4 0.2 M and 8 mL of NaH2PO4 0.2 M. Make up to 100 mL with water.

2.3

Protein Solutions

1. LDH 100: 25 μM of L-lactate dehydrogenase (L-LDH) from rabbit muscle. Dilute the required volume of L-LDH in LDH treatment buffer. For the preparation of this solution, consider the molecular mass of the L-LDH monomer (36.4 kDa) (see Note 1). 2. ADH 100: 25 μM of alcohol dehydrogenase (ADH) from Saccharomyces cerevisiae. Dissolve the required mass of ADH in

522

David F. Rendo´n-Luna et al.

ADH treatment buffer. For the preparation of this solution, consider the molecular mass of the ADH monomer (35.25 kDa) (see Note 1). 3. Protectant 100: 25 μM IDP. Dissolve the corresponding IDPs, or control proteins in LDH or ADH treatment buffers according to the reporter protein used for the assay (see Note 1). 2.4

Reaction Buffers

1. LDH reaction buffer: 100 mM Tris–HCl, pH 7.5, 25 mM KCl, 148 μM NADH, 2.5 mM pyruvic acid. For 20 mL (required for five samples divided in six aliquots each), dissolve 3.5 mg of β-NADH in 30 μL of water and 8.8 mg of pyruvic acid in 20 μL of water. Then, mix 2 mL of 1 M Tris–HCl, pH 7.5, 0.5 mL of 1 M KCl, 20 μL of the β-NADH solution, and 10 μL of the pyruvic acid solution. Make up to 20 mL with water. 2. ADH reaction buffer: 10 mM tetrasodium pyrophosphate buffer, pH 8.8, 3.35% ethanol, 115 μM β-NAD. For 4.8 mL (required for one sample divided in six aliquots), dissolve 30.8 mg of β-NAD in 400 μL of water. Then, mix 480 μl of tetrasodium pyrophosphate buffer 100 mM pH 8.8, 168 μL of absolute ethanol, and 360 μL of the β-NAD solution in 3.8 mL of water.

2.5

Equipment

1. Thermoblock for microfuge tubes. 2. Vacuum concentrator for microfuge tubes. 3. UV spectrophotometer with time-based kinetic/rate determination capability. 4. UV quartz cuvette (1 mL). 5. Vortex.

3

Methods

3.1 Master Mix Preparation

1. Mix 7 μL of LDH 100 or ADH 100 solution with 7 μL of protectant 100 solution, where the working molar ratio is 1:1. For a different molar ratio, for example, 1:10, 70 μL of protectant 100 solution should be used. Fill up to 700 μL with LDH or ADH treatment buffer (see Notes 2 and 3). 2. Prepare one master mix for each variable to be tested, i.e., different protectants or molar ratios (see Notes 4 and 5).

3.2 Low Water Availability Assay: Freeze-Thaw Treatment

1. Vortex each master mix for 2–3 s. Aliquot 100 μL of each master mix in six microfuge tubes of 1.5 mL. Three of the six tubes will be subjected to the freeze-thaw treatment, while the rest will remain on ice during the experiment as control.

Determining the Protective Activity of IDPs

523

2. The freeze-thaw treatment consists in placing the tubes in liquid N2 for 1 min and then thawing them in a thermoblock at 25 C for 5 min (see Note 6). 3. Repeat step 2 as many times as required for the complete deactivation of the reporter proteins (see Note 7). 3.3 Low Water Availability Assay: Dehydration Treatment

1. Register the weight (WT) of each empty 1.5 mL microfuge tube needed for the experiment. 2. Vortex the prepared master mix for 2–3 s. Distribute the master mix in six weighed microfuge tubes, 75 μL per tube. 3. Register the weight of each of the six tubes containing master mix (WT + MM), and calculate the weight of the master mix solution added (WMM) as follows: w MM ¼ w TþMM w T 4. If the WMM is different to 75 μg, adjust the volume to 75 μL with master mix solution. 5. Register again the weight of each tube (WFT). 6. Three of the six tubes will be subjected to partial dehydration, while the rest will remain on ice during the experiment as control samples. 7. The partial dehydration treatment consists in subjecting the samples to progressive water loss using a vacuum concentrator at approximately 65 kPa (see Note 8). 8. Continue with the dehydration procedure (see Note 9), monitoring the weight of sample tubes (Wx) until the desired percentage of water loss (%WL) is reached (see Note 10): %w L ¼

ðw x 100Þ w FT

9. When the desired %WL is reached, calculate and add the volume of H2O needed to recover the initial weight of 75 μg. 10. Keep all tubes (including controls) on ice, and proceed to measurement of the enzymatic activity. 3.4 Measurement of LDH or ADH Activity: Spectrophotometer Setting

1. Adjust the UV spectrophotometer at 340 nM to detect NADH consumption (see Notes 11 and 12). For LDH, determine absorbance for 45 s, whereas for ADH extend the measurement up to 2 min. The NADH consumption time can be adjusted as it depends on the amount of enzyme used in each reaction.

524

David F. Rendo´n-Luna et al.

Fig. 1 Kinetics of consumption (a) and of production (b) of NADH, by LDH or ADH, respectively. The slope is obtained considering the linear range (red line) of the curve. s seconds

Fig. 2 A representative quantification of a LEA protein protective activity using LDH as reporter enzyme in a freeze-thaw assay. BSA is used as a protecting reference protein, while lysozyme is a reference of a non-protecting protein. The data was obtained using a molar ratio of 1:10 (LDH/LEA). Vertical lines in each box represent data dispersion, while horizontal lines correspond to the median. Statistical analysis was performed applying the Tukey test ( p < 0.05; n ¼ 3)

Determining the Protective Activity of IDPs

3.5 Measurement of LDH or ADH Activity: Enzyme Activity Quantification

525

This section describes the determination of LDH and ADH activities. Use the appropriate reaction buffer. 1. Use 600 μL of buffer without NADH in the quartz cuvette to blank the spectrophotometer. 2. Pour 600 μL of LDH (see Note 13) or ADH reaction buffer into the clean quartz cuvette, and place it in the spectrophotometer. 3. For LDH or ADH activity determinations, take 15 μL of one of the master mix aliquots. Then, mix quickly and thoroughly with the reaction buffer in the cuvette, and immediately begin the absorbance measurements (see Note 14). 4. Register the slope in the linear range of the curve (absorbance over time). For LDH the activity corresponds to a negative slope, because it represents the consumption of NADH, whereas ADH activity leads to NADH production resulting in a positive slope (Fig. 1a, b). 5. Repeat steps 3 and 4 with all samples (see Note 15). 6. The mean value for the slopes determined for all untreated samples will be considered as 100% enzyme activity. The mean values of the slopes for treated samples can then be compared to the 100% of enzyme activity (Fig. 2).

4

Notes 1. Before performing any assays, the integrity and purity of the IDPs of interest should be verified. Special care should be taken in the determination of protein concentration, particularly in the case of IDPs. Various observations indicate that the protein quantification assays commonly used (Bradford method or absorbance at 280 nm) may lead to serious errors in the calculation of IDP concentration with critical implications in data interpretation [27, 28]. In agreement with Contreras-Martos et al. [29], the ninhydrin method or the Qubit™ protein assay seems to be the most reliable quantification method for many IDPs. We strongly recommend referring to Contreras-Martos et al. [29] for details. 2. The magnitude of IDP protective activity depends on the reporter/protectant molar ratio. To obtain reliable data, the determination of the protection capacity of a protein should include at least five different molar ratios, between 1:1 and 1:20. Also, it should be considered that LDH and ADH activities start to decrease once they are diluted in the reaction buffer. To slow this drop of activity, all procedures must be carried out on ice, unless otherwise indicated. Reliable LDH

526

David F. Rendo´n-Luna et al.

activity is measured at 25 C; if necessary, place the reaction buffer in a water bath to maintain this temperature. 3. We estimated 0.25 μM as the optimal concentration of reporter protein. This final concentration of protein avoids protection due to molecular crowding and allows an optimal enzymatic activity determination. The amount of pyruvic acid and NADH for LDH reaction buffer, and of ethanol and NAD for ADH reaction buffer, used in these protocols, are optimal for the determinations of the corresponding enzyme activities. 4. Positive and negative controls should be used in these assays. Bovine serum albumin (BSA) is a good protein stabilizer; therefore, it can be used as a positive control of protection. Lysozyme and RNAse A are globular and small proteins (14.4 kDa and 13.7 kDa, respectively) that can be used as negative controls because they do not show a protective effect in these assays. 5. To facilitate sample handling, we recommend preparing up to four master mixes per experiment. 6. To avoid treatment variation between samples, freeze and thaw all the tubes at the same time. 7. We have found that seven cycles are enough for complete inactivation of LDH, while 12 cycles are required to inactivate ADH. It is also possible to determine reporter protein aggregation, which is detected even after two freeze-thaw cycles, measuring absorbance by light scattering at 320 or 340 nm. 8. Although, during the vacuum-dry treatment, most of the times the temperature in the concentrator chamber keeps quite stable at room temperature (25 C), to avoid a risky increase that could lead to heat denaturation of your protein samples, monitor the temperature by assembling a thermometer inside the chamber. If you notice a temperature rise, cool down the chamber by placing a sealed bag with iced water on the top. 9. Care should be taken when dehydration is close to 80% because at this point the water loss rate increases. Avoid complete dryness because the enzyme will be fully inactivated regardless of the presence of protectant protein. 10. The percentage of dehydration is scored as the weight loss due to water evaporation in each tube. 11. We follow the consumption of NADH due to the reduction of pyruvic acid to lactate catalyzed by LDH. LDH

Pyruvicacid þ NADH $ Lacticacid þ NADþ 12. We follow the production of NADH due to the oxidation of ethanol to acetaldehyde catalyzed by ADH.

Determining the Protective Activity of IDPs

527

ADH

Ethanol þ NADþ $ Acetaldehyde þ NADH 13. The amount of NADH used in the LDH reaction buffer is enough to have an initial A340 0.8 0.1. After blanking with buffer without NADH, read the absorbance of LDH reaction buffer alone. If the measurement of initial A340 is different from 0.8 0.1, prepare fresh LDH reaction buffer. 14. LDH consumes the NADH present in the LDH reaction buffer in approximately 1 min, so it is important to add the aliquot sample, mix, and read the absorbance as quickly as possible. 15. Even though samples remain on ice all the time, loss of activity is unavoidable. To compensate the error in measurements during the protocol, determine enzymatic activity of control samples without treatment at the beginning, during, and at the end of the procedure. References 1. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6(3):197–208. https://doi. org/10.1038/nrm1589 2. Uversky VN, Gillespie JR, Fink AL (2000) Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41(3):415–427. https://doi.org/10.1002/ 1097-0134(20001115)41:33.3.Co;2-Z 3. Dunker AK, Cortese MS, Romero P et al (2005) Flexible nets - the roles of intrinsic disorder in protein interaction networks. FEBS J 272(20):5129–5148. https://doi. org/10.1111/j.1742-4658.2005.04948.x 4. Darling AL, Uversky VN (2018) Intrinsic disorder and posttranslational modifications: the darker side of the biological dark matter. Front Genet 9. https://doi.org/10.3389/Fgene. 2018.00158 5. Fonin AV, Darling AL, Kuznetsova IM et al (2018) Intrinsically disordered proteins in crowded milieu: when chaos prevails within the cellular gumbo. Cell Mol Life Sci 75 (21):3907–3929. https://doi.org/10.1007/ s00018-018-2894-9 6. Alberti S (2017) Phase separation in biology. Curr Biol 27(20):R1097–R1102. https://doi. org/10.1016/j.cub.2017.08.069 7. Covarrubias AA, Cuevas-Velazquez CL, Romero-Perez PS et al (2017) Structural disorder in plant proteins: where plasticity meets sessility. Cell Mol Life Sci 74(17):3119–3147.

https://doi.org/10.1007/s00018-017-25572 8. Battaglia M, Olvera-Carrillo Y, Garciarrubio A et al (2008) The enigmatic LEA proteins and other hydrophilins. Plant Physiol 148(1):6–24. https://doi.org/10.1104/pp.108.120725 9. Liu Y, Wang L, Xing X et al (2013) ZmLEA3, a multifunctional group 3 LEA protein from maize (Zea mays L.), is involved in biotic and abiotic stresses. Plant Cell Physiol 54 (6):944–959. https://doi.org/10.1093/pcp/ pct047 10. Muvunyi BP, Yan Q, Wu F et al (2018) Mining late embryogenesis abundant (LEA) family genes in Cleistogenes songorica, a xerophyte perennial desert plant. Int J Mol Sci 19(11). https://doi.org/10.3390/Ijms19113430 11. Tang XL, Wang HY, Chu LY et al (2016) KvLEA, a new isolated late embryogenesis abundant protein gene from Kosteletzkya virginica responding to multiabiotic stresses. Biomed Res Int 2016:1. https://doi.org/10. 1155/2016/9823697 12. Campos F, Cuevas-Velazquez C, Fares MA et al (2013) Group 1 LEA proteins, an ancestral plant protein group, are also present in other eukaryotes, and in the archeae and bacteria domains. Mol Gen Genomics 288 (10):503–517. https://doi.org/10.1007/ s00438-013-0768-2 13. Boothby TC, Tapia H, Brozena AH et al (2017) Tardigrades use intrinsically disordered proteins to survive desiccation. Mol Cell 65

528

David F. Rendo´n-Luna et al.

(6):975. https://doi.org/10.1016/j.molcel. 2017.02.018 14. Kikawada T, Nakahara Y, Kanamori Y et al (2006) Dehydration-induced expression of LEA proteins in an anhydrobiotic chironomid. Biochem Biophys Res Commun 348 (1):56–61. https://doi.org/10.1016/j.bbrc. 2006.07.003 15. Garay-Arroyo A, Colmenero-Flores JM, Garciarrubio A et al (2000) Highly hydrophilic proteins in prokaryotes and eukaryotes are common during conditions of water deficit. J Biol Chem 275(8):5668–5674. https://doi. org/10.1074/jbc.275.8.5668 16. Reyes JL, Rodrigo MJ, Colmenero-Flores JM et al (2005) Hydrophilins from distant organisms can protect enzymatic activities from water limitation effects in vitro. Plant Cell Environ 28(6):709–718. https://doi.org/10. 1111/j.1365-3040.2005.01317.x 17. Reyes JL, Campos F, Wei H et al (2008) Functional dissection of hydrophilins during in vitro freeze protection. Plant Cell Environ 31 (12):1781–1790. https://doi.org/10.1111/j. 1365-3040.2008.01879.x 18. Cuevas-Velazquez CL, Saab-Rincon G, Reyes JL et al (2016) The unstructured N-terminal region of Arabidopsis group 4 late embryogenesis abundant (LEA) proteins is required for folding and for chaperone-like activity under water deficit. J Biol Chem 291 (20):10893–10903. https://doi.org/10. 1074/jbc.M116.720318 19. Kim SX, Camdere G, Hu XC et al (2018) Synergy between the small intrinsically disordered protein Hsp12 and trehalose sustain viability after severe desiccation. Elife 7. https:// doi.org/10.7554/eLife.383370.001 20. Dang NX, Hincha DK (2011) Identification of two hydrophilins that contribute to the desiccation and freezing tolerance of yeast (Saccharomyces cerevisiae) cells. Cryobiology 62 (3):188–193. https://doi.org/10.1016/j. cryobiol.2011.03.002 21. Lopez-Martinez G, Rodriguez-Porrata B, Margalef-Catala M et al (2012) The STF2p hydrophilin from Saccharomyces cerevisiae is

required for dehydration stress tolerance. PLoS One 7(3):e33324. https://doi.org/10. 1371/journal.pone.0033324 22. Lv AM, Su LT, Liu XC et al (2018) Characterization of Dehydrin protein, CdDHN4-L and CdDHN4-S, and their differential protective roles against abiotic stress in vitro. BMC Plant Biol 18:299. https://doi.org/10.1186/ S12870-018-1511-2 23. Goyal K, Walton LJ, Tunnacliffe A (2005) LEA proteins prevent protein aggregation due to water stress. Biochem J 388:151–157. https://doi.org/10.1042/Bj20041931 24. Grelet J, Benamar A, Teyssier E et al (2005) Identification in pea seed mitochondria of a late-embryogenesis abundant protein able to protect enzymes from drying. Plant Physiol 137(1):157–167. https://doi.org/10.1104/ pp.104.052480 25. Hara M, Terashima S, Kuboi T (2001) Characterization and cryoprotective activity of coldresponsive dehydrin from Citrus unshiu. J Plant Physiol 158(10):1333–1339. https:// doi.org/10.1078/0176-1617-00600 26. Furuki T, Shimizu T, Chakrabortee S et al (2012) Effects of group 3 LEA protein model peptides on desiccation-induced protein aggregation. Biochim Biophys Acta 1824 (7):891–897. https://doi.org/10.1016/j. bbapap.2012.04.013 27. Szollosi E, Hazy E, Szasz C et al (2007) Large systematic errors compromise quantitation of intrinsically unstructured proteins. Anal Biochem 360(2):321–323. https://doi.org/10. 1016/j.ab.2006.10.027 28. Weist S, Eravci M, Broedel O et al (2008) Results and reliability of protein quantification for two-dimensional gel electrophoresis strongly depend on the type of protein sample and the method employed. Proteomics 8 (16):3389–3396. https://doi.org/10.1002/ pmic.200800236 29. Contreras-Martos S, Nguyen HH, Nguyen PN et al (2018) Quantification of intrinsically disordered proteins: a problem not fully appreciated. Front Mol Biosci 5. https://doi.org/ 10.3389/Fmolb.2018.00083

Chapter 27 Screening Intrinsically Disordered Regions for Short Linear Binding Motifs Muhammad Ali, Leandro Simonetti, and Ylva Ivarsson Abstract The intrinsically disordered regions of the proteome are enriched in short linear motifs (SLiMs) that serve as binding sites for peptide binding proteins. These interactions are often of low-to-mid micromolar affinities and are challenging to screen for experimentally. However, a range of dedicated methods have been developed recently, which open for screening of SLiM-based interactions on large scale. A variant of phage display, termed proteomic peptide phage display (ProP-PD), has proven particularly useful for the purpose. Here, we describe a complete high-throughput ProP-PD protocol for screening intrinsically disordered regions for SLiMs. The protocol requires some basic bioinformatics skills for the design of the library and for data analysis but can be performed in a standard biochemistry lab. The protocol starts from the construction of a library, followed by the high-throughput expression and purification of bait proteins, the phage selection, and the analysis of the binding-enriched phage pools using next-generation sequencing. As the protocol generates rather large data sets, we also emphasize the importance of data management and storage. Key words Interactions, Phage display, Next-generation sequencing, High-throughput purification, SLiM, Data management, IDP

1

Introduction Short linear motifs (SLiMs) are compact binding interfaces that are crucial for cell function. SLiMs are typically found within stretches of 3–12 amino acids, of which 3–4 amino acids serve as the main determinants of binding, although flanking regions contribute to affinity and specificity [1]. They are commonly found in intrinsically disordered regions (IDRs), which represent more than 30% of the human proteome. SLiMs are bound by specialized domains, of more than 150 families [2]. Among the most abundant and most studied SLiM-binding domains in the human proteome are PDZ domains [3], SH3 domains [4], and WW domains [5]. SLiMs are also bound by enzymes such as kinases [6], phosphatases [7, 8], and E3 ligases [9]. SLiM-based interactions promote complex assembly

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_27, © Springer Science+Business Media, LLC, part of Springer Nature 2020

529

530

Muhammad Ali et al.

and determine cellular protein localization and modification state and protein stability. The ELM database, which is the most comprehensive, manually curated database of SLiM-based interactions, holds over 3000 instances of motif-based interactions [10]. However, the instances compiled in the ELM database represent only a fraction of the SLiMs expected to hide in the proteome [11]. An underlying reason for this is that SLiM-based interactions often are of low-to-mid micromolar affinities, which make them well suited to fulfill their cellular functions but inherently difficult to capture experimentally. Their limited footprints also make them difficult to predict accurately using bioinformatics. Various approaches such as peptide arrays and combinatorial peptide phage display have been used to characterize the SLiMbinding preferences of specific domains or domain families [12– 14]. One of the approaches that has emerged for charting SLiMbased interactions is proteomic peptide phage display [15– 17]. ProP-PD is a variant of phage display where peptides to be displayed are designed based on the IDRs of a proteome of interest. The method combines bioinformatics, custom oligonucleotide library synthesis, peptide phage display, and next-generation sequencing (NGS) (Fig. 1). In our proof-of-concept study, we generated two ProP-PD libraries, one that displays the C-terminal regions of the human proteome and one displaying C-terminal peptides of viral proteomes [17]. More recently we generated a “human disorderome” library, designed to display hexadecameric peptides that tile all the IDRs of the human proteome [16]. This library has, for example, been used to uncover the SLiM-based interactions of scaffold proteins (e.g., PDZ domains [17, 18]) and phosphatases (e.g., calcineurin [19] and PP2A [7])). Following a similar strategy, the binding specificity of the Cdc14 phosphatase was recently elucidated using a yeast “disorderome” phage library [20]. We have further shown that the approach can be tuned to uncover phospho-regulation of SLiM-based interactions [21]. An added strength of the method is that it is scalable. As for combinatorial peptide phage display [13, 22], selections can be performed in 96-well format, which paired with NGS opens for screening of SLiM-based interactions on large scale. ProP-PD selections provide data sets of binding peptides that can be used for generating consensus binding motifs. By matching the peptides to the library designs, ProP-PD further provides information on the identities of the host proteins of the peptides and suggests interactions of potential biological relevance. The approach is thus a valuable complement to other methods for interactome analysis. In this chapter, we describe a protocol for ProP-PD library construction, high-throughput (HTP) protein expression and purification, HTP phage display selections, and NGS analysis. As the experiment provides relatively large data sets, we also provide general guidelines for data analysis and storage.

Screening the Intrinsically DIsordered Regions

531

Fig. 1 Overview of the full protocol. The protocol starts with the construction of the phage library using a custom-designed oligonucleotide library and the expression and purification of bait proteins. We then describe how to use the material in high-throughput peptide phage display selection, how to evaluate the success of the selection through phage pool ELISA, and how to analyze the content of binding-enriched phage pools through NGS. We finally elaborate on how to manage and store the results

532

2

Muhammad Ali et al.

Materials

2.1 General Materials

1. LB agar plates: Add 10 g tryptone, 5 g yeast extract, 10 g NaCl, and 20 g agar into 1 L water. Autoclave. 2. M13KO7 helper phage. 3. Kanamycin stock (30 mg/mL): Dissolve 0.9 g of kanamycin in 30 mL of H2O. Filter sterilize, aliquote, and freeze. 4. Carbenicillin stock (100 mg/mL): Dissolve 1 g of carbenicillin in 10 mL of H2O. Filter sterilize, aliquote and freeze. 5. Chloramphenicol stock (34 mg/mL): Dissolve 1.02 g of chloramphenicol in 30 mL of 70% ethanol. Filter sterilize, aliquote, and freeze. 6. Tetracycline stock (10 mg/mL): Dissolve 100 mg of tetracycline in 10 mL of 70% ethanol. Filter sterilize, aliquote, and freeze. 7. 2YT media: Dissolve 16 g tryptone, 10 g yeast extract, and 5 g NaCl in 1 L H2O. Autoclave. 8. Molecular-grade agarose. 9. GelRed (Biotium). 10. PEG/NaCl: Dissolve 200 g PEG-8000 and 146.1 g NaCl in 1 L H2O. Filter sterilize. 11. PBS: 137 mM NaCl, 2.7 mM KCl, 10 mM Na2HPO4, 1.8 mM KH2PO4 (pH 7.4). Autoclave or filter sterilize. 12. PBST: 0.05% (v/v) Tween-20 in sterile PBS. 13. PT: 0.05% (v/v) Tween-20, 2% g BSA in 1 LPBS. Filter sterilize. 14. WFI (water for injection) quality, cell culture tested, sterile. 15. Ultrapure glycerol.

2.2 Purification of du-ssDNA

1. Phagemid encoding the P8 coat protein (e.g., PRSTOP4). 2. E. coli CJ236 ([F0 Tra + Pil + (CamR)] ung-1 relA1 dut-1 thi-1 spoT1 mcrA). 3. QIAprep Spin M13 kit. 4. Uridine: 0.025 μg/mL in H2O. Filter sterilize. 5. MLB buffer: 1 M sodium perchlorate in H2O, 30% (v/v) isopropanol. Filter sterilize.

Screening the Intrinsically DIsordered Regions

2.3 PCR Amplification of Custom-Designed Oligonucleotide Pool and dsDNA Quantification

533

1. Oligonucleotide pool from commercial provider. 2. Forward and reverse primers complementary to the constant regions of the oligonucleotide pool. 3. Phusion high-fidelity PCR master mix with HF buffer. 4. GeneRuler 50 bp DNA ladder. 5. QIAquick nucleotide removal kit. 6. Quant-iT™ PicoGreen™ dsDNA assay kit. 7. TE buffer: 10 mM Tris–HCl (pH 8.0), 0.1 mM EDTA.

2.4 Synthesis of dsDNA Phagemid Library

1. ATP: 10 mM in H2O. Filter sterilize, aliquote, and freeze. 2. DTT: 100 mM in H2O. Filter sterilize, aliquote, and freeze. 3. 10 TM buffer: 0.1 M MgCl2, 0.5 M Tris-Cl (pH 7.5). Filter sterilize. 4. T4 polynucleotide kinase. 5. dNTP mix: 10 mM (pH 7.5). 6. T4 DNA ligase. 7. T7 DNA polymerase. 8. FastDigest SmaI. 9. QIAquick gel extraction kit. 10. Exonuclease I.

2.5 Electroporation and Propagation of the ProP-PD Library

1. E. coli SS320 cells preinfected with M13KO7. 2. 0.1 cm gap electrode cuvettes. 3. Bio-Rad gene pulser electroporation device. 4. SOC medium: Dissolve 5 g yeast extract, 20 g tryptone, 5 g NaCl, and 2 g KCl in 900 mL water. Adjust pH to 7.0 with NaOH and then the volume to 975 mL. Autoclave. Add 5 mL of filter-sterilized 2 M MgCl2 and 20 mL of filter-sterilized 1 M glucose.

2.6 High-Throughput Expression and Purification

1. Genes of bait proteins cloned in pETM-33 or pETM-41 vectors. 2. E. coli BL21 (DE3). 3. MagicMedia™ E. coli expression medium (Invitrogen). 4. 96-well mini tube system 0.6 mL (blue box) (Axygen). 5. 96-deep-well plate (Axygen). 6. Resuspension buffer: 5 mM imidazole, 5% glycerol, EDTA-free complete protease inhibitor cocktail (Roche), in PBS. 7. Lysis buffer: 1% (v/v) Triton X-100, 10 mg/mL lysozyme, 1 μg/mL DNase I in resuspension buffer. 8. Washing buffer: 30 mM imidazole, 5% (v/v) glycerol in PBS.

534

Muhammad Ali et al.

9. Elution buffer: 250 mM imidazole, 5% (v/v) glycerol in PBS. 10. Ni Sepharose or Ni-NTA agarose. 11. Filter bottom microplates. 12. Corning 96-well, nonbinding surface, polystyrene plate. 13. Quick Start™ Bradford protein assay (Bio-Rad). 2.7 ProP-PD Selections and Pooled Phage ELISA

1. Nunc MaxiSorp™ flat bottom. 2. E. coli OmniMax. F0 {proAB lacIq lacZΔM15 Tn10(TetR) Δ(ccdAB)} mcrA Δ(mrr hsdRMS-mcrBC) Φ 80(lacZ)ΔM15 Δ(lacZYA-argF)U169 endA1 recA1 supE44 thi-1 gyrA96 relA1 tonA panD. 3. Blocking solution: 0.5% BSA in PBS. Filter sterilize. 4. IPTG (isopropyl β-d-1-thiogalactopyranoside): 1 M in H2O. Filter sterilize. 5. Anti-M13 coat HRP-conjugated antibody (Sino Biological). 6. TMB microwell peroxidase substrate. 7. 0.6 M H2SO4: Dilute 50 mL of 12 M H2SO4 in H2O to make up for 1 L.

2.8 Preparing Samples for Next-Generation Sequencing (NGS)

1. Illumina MiSeq service. 2. NGS-grade oligonucleotides. 3. Phusion high-fidelity PCR master mix with HF buffer. 4. Mag-Bind TotalPure NGS. 5. QIAquick PCR purification kit. 6. QIAquick gel extraction kit. 7. Quant-iT™ PicoGreen™ dsDNA assay kit.

3

Methods

3.1 Phage Library Construction

The first step in ProP-PD is to design an oligonucleotide library based on a proteome of interest and to obtain a custom-designed oligonucleotide library from a commercial provider (Fig. 1). The library design depends on the research question, which is beyond the scope of this protocol. We have, for example, designed a phage library that displays IDRs of the human proteome using hexadecameric peptides that overlap by seven amino acids [16] and a C-terminal (nonameric peptides) ProP-PD library that displays all the C-termini of the proteome that holds known or predicted phospho-sites [21]. For those interested, our published library designs can be obtained upon request. As part of the design process, constant flanking regions are added to all oligonucleotides in the library (Figs. 1 and 2). They are used for library construction

Screening the Intrinsically DIsordered Regions

535

Fig. 2 Generation of a ProP-PD library. A custom oligonucleotide library is designed based on the IDRs of a proteome of interest. Constant regions flanking the sequences of interest are introduced; these regions are used for cloning and as primer target for amplification. The design is synthesized by a company and later amplified and used to create a phagemid library, which is electroporated into E. coli (preinfected with M13K07 helper phage) that produces and amplifies the phage library

and later for PCR amplification for sequencing purposes. The obtained oligonucleotide library is genetically fused to the major coat protein P8 of the filamentous M13 phage using a modified version of Kunkel mutagenesis [23]. Following this protocol, a phagemid is transformed into E. coli CJ236 cells that lack functional dUTPase and uracil-N-glycosylase. The phagemid-containing bacteria are infected with M13K07 helper phage, which leads to the packing of single-stranded phagemid DNA (ssDNA), with uracil incorporated instead of thymine, into phage particles that are secreted into the medium. The dU-ssDNA is purified from the medium and used as template for synthesis of heteroduplex, double-stranded DNA (dsDNA) using the PCR-amplified and phosphorylated oligonucleotide library. The dsDNA library is then electroporated into E. coli SS320 cells preinfected with M13 K07. The bacteria are grown overnight, which leads to the production and amplification of the phage library that can then be harvested, as detailed below.

536

Muhammad Ali et al.

3.1.1 Purification of du-ssDNA

The following is a modified version of QIAprep Spin M13 kit protocol for purification of dU-ssDNA based on a previously described protocol [24]. The resulting ssDNA should be sufficient for the construction of a number of ProP-PD libraries of average size (10,000–1 million sequences). 1. Transform an appropriate phagemid vector into E. coli CJ236, and grow overnight on LB plate containing 100 μg/mL carbenicillin at 37 C. 2. The following day, inoculate a single colony into 1 mL 2YT media supplemented with M13K07 helper phage (1011 pfu/ mL), and grow for 2 h at 37 C and 200 rpm shaking. Use 100 μg/mL carbenicillin to maintain the phagemid and 25 μg/ mL chloramphenicol to maintain F0 episome (see Note 1). 3. Add 30 μg/mL kanamycin to select for bacteria infected with M13K07 helper phage, and allow the culture to grow for 6 h at 37 C with 200 rpm shaking. 4. Transfer the culture into 30 mL of 2xYT supplemented with 100 μg/mL carbenicillin, 30 μg/mL kanamycin, and 0.025 μg/mL uridine. Incubate at 37 C, shaking 200 rpm for 20 h. 5. Pellet the bacteria by centrifugation at 10,000 g and 4 C for 20 min. Transfer the phage-containing supernatant to a fresh 50 mL tube with conical bottom. 6. To precipitate the phages, add 7.5 mL of PEG/NaCl to the phage-containing supernatant, and incubate 5 min on ice. 7. Centrifuge for 20 min at 12,000 g to pellet the phage. 8. Remove the supernatant and spin for 2 min at 4000 g. Aspirate the residual supernatant. 9. Resuspend the phage pellet in 0.5 mL PBS. Transfer the solution to a fresh 1.5 mL tube. Clear the solution from insoluble debris by centrifugation at 15,000 g for 5 min. Transfer the supernatant to a clean tube. 10. To lyse the phages, add 700 μL MP (Qiagen), mix by pipetting, and incubate 2 min at RT. 11. Transfer the lysed phage solution to a QIAprep spin column in a fresh tube, and centrifuge at 6000 g for 30 s. The phage DNA will be bound to the column matrix. Discard the flowthrough. 12. Add 700 μL of MLB buffer to the column and spin it at 6000 g for 1 min. Discard flow-through. Repeat this step once more. 13. Finally, wash the DNA by adding 700 μL of PE buffer to column, and spin it at 6000 g for 1 min. Centrifuge again at 15,000 g for 1 min to get rid of residual buffer from the column.

Screening the Intrinsically DIsordered Regions

537

14. To elute the dU-ssDNA, place the clean column in a fresh 1.5 mL tube, add 100 μL of EB buffer from the QIAquick PCR purification kit to the column, incubate for 5 min at room temperature, and then centrifuge at 15,000 g for 2 min. 15. Confirm the quality of ssDNA by 1% (w/v) agarose gel electrophoresis using 1 TBE buffer and GelRed dye (see Note 2). 16. Quantify the DNA by absorbance at 260 nm using a NanoDrop. 3.1.2 PCR Amplification of the Custom-Designed Oligonucleotide Pool and dsDNA Quantification

A custom-designed oligonucleotide library encoding the desired peptide sequences can be obtained from a commercial provider (Fig. 2). The oligonucleotide pool is typically PCR amplified before being used for phage library construction as described below. 1. Prepare eight 50 μL PCR reactions each containing 0.5 μM of forward and reverse primers complementary to the constant regions of the ordered oligonucleotide library (Fig. 2), 80 ng of ssDNA template, and 25 μL of Phusion high-fidelity PCR master mix. 2. Use a thermocycler with this profile: denaturation at 98 C for 10 s, annealing at 56 C for 15 s and amplification at 72 C for 10 s. Repeat 18 times (see Note 3). 3. Verify the product by 2% agarose gel electrophoresis using a 50 bp ladder. 4. Pool four of the reactions and purify them using QIAquick nucleotide removal kit. Follow the manufacturer’s instructions but elute using 30 μL of elution buffer. This will remove excess salts and enzymes. 5. Quantify the amount of PCR product in duplicate using Quant-iT PicoGreen dsDNA assay kit following the manufacturer’s guidelines: (a) Dilute the PicoGreen dye 1:400 in TE buffer. (b) Prepare the lambda DNA standard by serial diluting it two times in a range of 100 ng/μL to 1.56 ng/μL. Use TE buffer for this purpose. (c) Mix 1 μL of DNA standard from each dilution with 25 μL of dye. Also mix 1 μL of 10 diluted PCR-amplified sample with 25 μL of dye. Include a no-DNA control. (d) Incubate for 5 min at room temperature, and read the samples fluorescence using a qPCR machine. Excitation, 480 nm; emission, 520 nm. (e) Generate a standard curve based on the readings of the lambda DNA, and calculate the concentration of the PCR product based on the linear fit of the standard.

538

Muhammad Ali et al.

3.1.3 Synthesis of a dsDNA Phagemid Library

After PCR amplification and quantification of the oligonucleotide library, it is phosphorylated, annealed to ssDNA, and converted into a ccc-dsDNA (covalently closed circular-dsDNA) library (Fig. 2). The remaining PCR primers from the amplification step are removed directly before phosphorylation by treating the sample with Exonuclease I. 1. Enzymatically remove the PCR primers from 0.6 μg PCR-amplified oligonucleotide library using 10 units of Exonuclease I at 37 C for 1 h followed by heat inactivation at 80 C for 15 min. Flash cool on ice and use directly for the next step. 2. To phosphorylate, add 2 μL of 10 mM ATP, 1 μL 100 mM DTT, and 2 μL of 10 TM buffer. Add 10 units of T4 polynucleotide kinase and H2O to a final volume of 20 μL. Incubate for 1 h at 37 C. 3. For annealing to the template ssDNA, add the phosphorylated library to 10 μg of dU-ssDNA and 25 μL of 10 TM buffer, and adjust the reaction volume to 250 μL using H2O. Incubate at 90 C for 3 min, 50 C for 3 min, and 20 C for 5 min. 4. Finally, add 10 μL of 10 mM ATP, 10 μL of 10 mM dNTP mix, 15 μL 100 mM DTT, 30 Weiss units of T4 DNA ligase, and 30 units of T7 DNA polymerase to the annealed mixture, and incubate at 20 C for maximum of 16 h. 5. Heat inactivate the enzymes at 70 C for 15 min, and digest the wild-type template with 10 units FastDigest SmaI for 1 h at 37 C (see Note 4). 6. To purify the product, add 1 mL of QG buffer, and load it to two QIA quick spin columns. Centrifuge at 16,000 g for 1 min and discard the flow-through. Add 750 μL of PE buffer to each column. Spin it at 13,000 rpm for 1 min followed by a brief spin to remove residual buffer. Elute the DNA using 35 μL of ultrapure H2O. 7. To analyze the conversion of ssDNA to ccc-dsDNA, compare the starting template to the product by analyzing them through gel electrophoresis (1% agarose gel, 45 min, 80 mA) stained with GelRed. Most, if not all, starting template should have been converted to product. The dsDNA library can be frozen or used directly for electroporation.

Screening the Intrinsically DIsordered Regions 3.1.4 Electroporation and Propagation of Phage Library

539

Once the ccc-dsDNA library is ready, it is electroporated into highly electro-competent E. coli SS320 cells preinfected with M13KO7 (Fig. 2), prepared as described elsewhere [24, 25]. The E. coli SS320 strain is used because of its very high transformation efficiency. 1. Thaw the supercompetent cells on ice. Mix them gently with the DNA, and transfer the mixture to an ice-cold electrode cuvette. 2. Electroporate using the following settings: 200 Ω, 1.8 kV, and 25 μF. 3. Recover the cells immediately using 37 C pre-warmed 1 mL SOC medium, and add them to 23 mL pre-warmed SOC medium. Wash again using 1 mL SOC medium to recover the rest. 4. Let the bacteria recover by incubating at 37 C for 30 min under 200 rpm shaking. 5. After exactly 30 min, titrate the number of electroporated bacteria by making serial dilutions of bacteria and spotting them on LB agar plates containing 100 μg/mL carbenicillin as selection marker. Incubate at 37 C overnight. 6. Transfer the incubated culture to 500 mL 2xYT media supplemented with 100 μg/mL carbenicillin and 30 μg/mL kanamycin. Incubate overnight at 37 C. 7. Count the number of bacterial colonies from the titration in step 5. Calculate how many electroporated bacteria in total (i.e., in the total of 25 mL of medium) that survived the growth in the presence of carbenicillin and thus were successfully transformed with the phagemid (see Note 5). The transformation efficiency provides an upper limit of the library diversity. For a ProP-PD library, we aim for 100–1000 times more transformants than the size of the library design to ensure a good coverage of the library design in the physically generated phage library. 8. Pellet the bacteria from step 6 by centrifugation at 8500 g and 4 C for 10 min. 9. Add the supernatant to a fresh tube containing 1/5 PEG/NaCl of the final volume. Incubate on ice for 5 min. 10. Precipitate phages by centrifugation at 16,000 g and 4 C for 20 min (see Note 6). 11. Dissolve the phage pellet in 20 mL of PBS containing 0.05% Tween-20 and 0.2% BSA. 12. Remove insoluble matter by centrifugation at 27,000 g for 20 min. Transfer to a fresh tube and repeat if the solution is still turbid.

540

Muhammad Ali et al.

13. Measure OD268 for the dissolved phages to estimate the phage concentration (see Note 7). 14. To estimate the phage concentration more accurately, serially dilute the phages, and add them to log-phase E. coli OmniMax cells. Incubate for 30 min at 37 C for infection, and plate them on LB agar plates containing 100 μg/mL carbenicillin as selection marker (see Note 5). 15. The phage library can be saved at 80 C for long-term storage after addition of 10% glycerol. The library can be used directly or reamplified before use as described elsewhere [26]. 3.2 High-Throughput Expression and Purification of Bait Proteins

The ProP-PD protocol described here is for high-throughput selections, which can be performed for hundreds of proteins in parallel (Fig. 3). Bait proteins can be purified following regular low-throughput protocols and then arrayed before the experiment or be expressed and purified following the 96-well format HTP protocol outlined here. About 100 μg of protein is needed for one full selection (without replicates), which can be purified from as little as 2–4 mL of bacterial cultures, although we recommend to perform replicate selections. Typically, we use His-MBP or His-GST tagged bait proteins, expressed from the pETM-33 or pETM-41 vectors as well as His-MBP and His-GST alone as controls. The HTP protein expression and purification protocol is a modified version of the protocol described by Huang and Sidhu [22].

3.2.1 Protein Expression

1. Transform the vectors (pETM-33 or pETM-41) encoding the bait proteins into E. coli BL21(DE3) cells, and plate them on LB agar plates containing 30 μg/mL kanamycin. Incubate at 37 C overnight. 2. The following day, refresh the growth of bacteria by inoculating them in 300 μL 2YT media 96-well in a mini tube system (or in a 96-deep-well plate) supplemented with 30 μg/mL kanamycin. Incubate at 37 C with 200 rpm shaking for 16 h. 3. Prepare two 96-deep-well plates by adding 1.2 mL autoinducing MagicMedia containing 30 μg/mL kanamycin. Inoculate each well with 10 μL of overnight culture (see Note 8). 4. Let the bacteria grow for 6 h at 37 C with 200 rpm shaking. Lower the temperature to 25 C and let the protein expression proceed for 20 h. 5. Pool the cultures of each protein from both plates together, and collect pellet of bacteria by centrifuging at 2000 g for 20 min. 6. At this stage, bacterial pellets can be stored at 80 C.

Screening the Intrinsically DIsordered Regions

541

Fig. 3 High-throughput purification of bait proteins. Constructs in a suitable vector are transformed into E. coli BL21(DE3). The transformed bacteria are arrayed in the plate. Proteins are expressed in 96-well format and purified using suitable resin in 96-well filter bottom plates and a vacuum manifold and eluted in a nonbinding plate 3.2.2 Protein Purification

1. Thaw the pellets on ice, and resuspend them by adding 250 μL of resuspension buffer to each well and gently pipetting up and down. It is recommended to save 5 μL of suspension in a 96-well PCR plate in 25 C to allow for confirmation of the identity of the expressed proteins through sequencing if needed (see Note 9). 2. To lyse cells, add 750 μL of lysis buffer into the suspension; mix and incubate it for 45 min on 4 C with gentle shaking. 3. To separate the debris, centrifuge plates at 2500 g for 30 min and 4 C. Save 10 μL of lysates to allow for analysis of expression levels. 4. Transfer approximately 850 μL of the lysates into 25 μm membrane-filter bottom plates containing 100 μL of resuspended Ni Sepharose. The lower end of the plate is sealed using three layers of parafilm, and the top is sealed using adhesive aluminum foil (see Note 10). 5. Let the proteins bind for 2 h at 4 C and with gentle shaking.

542

Muhammad Ali et al.

6. Fit the filter plate into a plate vacuum manifold, and remove lysate by applying vacuum for few seconds. 7. Wash the beads three times each by adding 1 mL of washing buffer and applying vacuum. 8. To elute the bound proteins, seal the bottom of the plate, add 200 μL elution buffer, mix and incubate for 5 min at RT. Elute proteins into nonbinding 96-well plate using the vacuum manifold. 9. Analyze the purity of the proteins using SDS-PAGE, and quantify the amount of each protein using Bradford assay. 10. Use the proteins directly as baits for phage selections (preferred), or add 10–15% glycerol, and store them at 80 C until further use. 3.3 HTP ProP-PD Selection and Pooled Phage ELISA

The ProP-PD protocol described here is based on the lowerthroughput protocol described in our original method paper [16] and a HTP phage display protocol described elsewhere [22]. We recommend to perform replicate selections where the bait proteins are arrayed according to different layouts. To avoid that phages are selected based on nonspecific binding to the plate or to the MBP/GST tags, phage pools are first incubated with immobilized negative controls (GST, MBP, or similar) before being transferred to the bait proteins. The final day of selection (day 4), we typically immobilize proteins for phage pool ELISA, which is then performed the following day. Phage pool ELISA is a way to evaluate the outcome of the selection (described in Subheading 3.3.2). Typically, the enrichment of binding phage saturates after 3–4 days of selection, which means that the full protocol can be performed in a week.

3.3.1 Phage Display Selection

Day 0: 1. Coat the plates with bait proteins and negative controls by adding 10–20 μg of proteins in 100 μL of PBS to a 96-well MaxiSorp plate. Cover the plates with adhesive plastic sealing, and incubate overnight at 4 C with shaking (see Note 11). 2. Inoculate a freshly streaked single colony of E. coli OmniMax from a LB agar plate (tetracycline) in 5 mL 2YT medium supplemented with 10 μg/mL tetracycline. Incubate overnight at 37 C with 200 rpm shaking. This bacterial culture is then used to start novel cultures throughout the selection days. Day 1: 3. Remove the protein solutions from the MaxiSorp plates. Block the wells by adding 200 μL of blocking solution to each well, and incubate for 1 h at 4 C under gentle agitation.

Screening the Intrinsically DIsordered Regions

543

4. Inoculate 15 μL of the overnight culture of E. coli OmniMax into three separate 50 mL tubes containing 15 mL 2YT supplemented with 30 μg/mL kanamycin, 100 μg/mL carbenicillin, or 10 μg/mL tetracycline. Incubate at 37 C with 200 rpm shaking (see Note 12). 5. To prepare the phage library for selection, dilute the appropriate volume of the phage library 20 times with PBS, and add PEG/NaCl to 1/5 of the final volume (see Note 13). Incubate on ice for 10 min, and precipitate the phages by centrifuging for 10 min at 14,000 g. Remove the supernatant and spin the tube again for 2 min at 2000 g. Aspirate the residual supernatant. Resuspend the phage pellet in appropriate volume of PT buffer. 6. After 1 h, remove the blocking solution from the preselection wells. 7. Add 100 μL phage library to each preselection well. Incubate for 1 h at 4 C while shaking. Nonspecific phages will bind to the controls, and they are thus removed at this step. 8. Remove the blocking solution from the bait protein-coated wells, and wash them four times with 200 μL PBST buffer. Transfer the phage library from the control wells to the bait wells. Allow the phages to bind for 2 h at 4 C while shaking (see Note 14). 9. After the incubation, remove unbound phages, and wash the wells four times with PBST buffer. 10. To elute the bound phages, add 100 μL of the log-phase E. coli OmniMax culture to each well. Cover the plate with gas permeable sealing, and incubate at 37 C for 30 min under agitation (see Note 15). 11. During this incubation, a titer can be made of the phage library used as in-phage by serially diluting the library in 2YT media and then adding 5 μL of it to 45 μL of log-phase E. coli OmniMax. Allow the phage to infect the bacteria for 30 min at 37 C and then spot 5 μL from each dilution onto LB agar plate supplemented with 100 μg/mL carbenicillin. Incubate plates overnight at 37 C, and then count colonies to calculate the “in-phage” number the following day (see Note 5). 12. After the 30 min elution, withdraw 5 μL of the bacterial solution for titer of the number of eluted phage (out-phage). Add 10 μL M13KO7 helper phages (1011 pfu/mL) to each well. Incubate the plates again for 45 min at 37 C. 13. During this incubation, execute the out-phage titration using 2YT, and plate them on LB plates supplemented with 100 μg/ mL carbenicillin to calculate the “out-phage” numbers the following day.

544

Muhammad Ali et al.

14. After 45 min of incubation, transfer the growing bacteria cultures into 1 mL 2YT media supplemented with 30 μg/mL kanamycin, 100 μg/mL carbenicillin, and 0.3 mM IPTG, and incubate at 37 C overnight with 200 rpm shaking. Day 2–4: Repeat the phage selection for 4 days. The protocol is same as for day 1 but using harvested out-phage from the previous day instead of the phage library. 15. To harvest the phages from the previous day of selection, pellet the bacteria by centrifugation at 2000 g for 25 min. Aspirate carefully 800 μL of phage supernatant from each well, and transfer it to a fresh 96-deep-well plate. Heat inactivate remaining bacteria at 65 C for 20 min. Let the plate cool down by placing it on ice for 5–10 min. Readjust the pH by addition of 88 μL 10 PBS. Use 100 μL of the out-phage as in-phage for the next day of selection. 16. Withdraw 100 μL of out-phage from each well and place into a 96-well PCR plate. Store in 20 C freezer. 3.3.2 Phage Pool ELISA (Day 5)

Phage pool ELISA is used to determine if phage selection against a particular protein has been successful, i.e., the ratio of bait protein ELISA signal to that of control has increased over the days of selection. We recommend to perform the ELISA directly after the finished phage display experiment and to start the experiment day 4 of the phage display experiment. 1. Immobilize 5–10 μg of bait and control proteins in 100 μL of PBS per well per selection day. Importantly, immobilize the negative control (GST/MBP) on the same plate as the bait protein to ensure that the signals are comparable. Incubate at 4 C with shaking for 16 h. 2. Remove the protein solution, and block wells by adding 200 μL blocking solution to each well. Incubate for 1 h at 4 C with shaking. 3. Remove the blocking solution, and wash each well four times using 200 μL PBST buffer. 4. After washing, add 100 μL amplified out-phage from each day of selection to the bait and control wells. Incubate for 2 h at 4 C while shaking. 5. Remove the phage solution, and wash unbound phage particles using five times 200 μL PBST buffer. 6. Add 100 μL anti-M13 HRP-conjugate antibody diluted 5000 times in blocking solution. Incubate for 30 min at RT.

Screening the Intrinsically DIsordered Regions

545

7. Wash away unbound antibody using four times 200 μL PBST. Wash one last time using 200 μL of PBS. Make sure to remove all the liquid by tapping the plate on a paper. 8. To develop, add 100 μL mixed TMB substrate to each well, and develop until blue color occurs (see Note 16). 9. Stop the reaction with 100 μL of 0.6 M H2SO4. This will turn the blue color into yellow. 10. Record absorbance of each well at 450 nm. Calculate the ratio between the A450 measured for the bait protein and that of the negative control. A ratio of more than 2 is indicative of a successful selection. 3.4 Preparing Samples for NGS

Phage pools from successful selections are used as templates for the generation of amplicons for analysis through NGS using Illumina MiSeq.

3.4.1 Barcoding and Amplification

The first step is to amplify the DNA region encoding the peptides displayed on the P8 protein. The NGS-grade primers used for the amplification should be designed to contain Illumina adapters and unique barcodes (short oligonucleotide sequences). We use a barcoding strategy previously outlined by McLaughlin and Sidhu [25], but NGS facilities can typically provide advices on how to design barcodes and adaptors. The steps recommended for the amplification and barcoding are outlined below. 1. Array sequencing primers in a 96-well PCR plate to match the number of phage pools to be sequenced from the highthroughput selection. Ensure that each pool will get a unique combination of barcodes so that they can be identified after sequencing. 2. Mix 0.5 μM of each primer, 2.5 μL phage pool, and H2O to a volume of 12.5 μL. Add 12.5 μL of Phusion HF master mix, and put the mixture directly into a preheated thermocycler (98 C). 3. Denature the phage particles at 98 C for 3 min. Then amplify the peptide-coding region for 20 cycle of (a) denaturation at 98 C for 10 s, (b) annealing at 68 C for 10 s, and (c) amplification at 72 C for 12 s. Add a final extension step at 72 C for 5 min. 4. Directly after finishing the program, remove the PCR plate, withdraw 1 μL of the reaction for analysis through agarose gel electrophoresis, and put the plate at 20 C. 5. Analyze the amplified products using 2% agarose gel electrophoresis to verify the correct size of the generated amplicons.

546

Muhammad Ali et al.

3.4.2 Normalization, Pooling, and Cleaning of Amplified Products

Normalization of Samples

After the peptide-coding regions of the binding enriched phage pools for each bait protein have been PCR amplified and barcoded, the resulting amplicons are pooled together into a library for sequencing. We pool in the order of 500 amplicons into 1 library for MiSeq NGS analysis. Since there can be biases during the amplification of the templates from different phage pools, it is recommended to normalize the amplicons before pooling. The protocol detailed below uses Mag-Bind magnetic beads for the normalization, but there are other products that would work equally well. 1. Draw 20 μL of magnetic bead slurry and 20 μL of PCR product from each well, and combine them into a 96-well PCR plate. Mix gently by pipetting 15–20 times. 2. Incubate at RT for 5 min. 3. Place the plate on magnetic stand, and wait until solution becomes clear of any beads. 4. Aspirate the supernatant, and wash the beads by adding 200 μL of 70% ethanol on the pellet, while the plate remains on magnetic stand. 5. Aspirate the ethanol and repeat two more times. 6. Remove the 96-well plate from the magnetic stand, and let air dry until the bead pellets crack (10–20 min). 7. Elute the bound DNA. Carefully add 12.5 μL of EB buffer QIAquick PCR purification kit to each magnetic bead pellet, and gently pipette up and down until the pellets have dissolved. Put the 96-well PCR plate on the magnetic stand.

Pooling and Cleaning

1. Pool 10 μL of eluate from each well of the 96-well plate into an 1.5 mL Eppendorf tube. Clear the solution of any beads left using the magnetic stand, and transfer the solution to a fresh 1.5 mL Eppendorf tube. 2. Use QIAquick PCR purification kit to concentrate 50% of the amplicon pool by following the manufacturer’s guidelines. Store the rest as a backup. Elute in 50 μL of EB buffer from the QIAquick PCR purification kit. 3. The PCR amplification of the peptide-coding region can result in some nonspecific products of different lengths. We therefore purify the desired products using agarose gel electrophoreses. For this purpose, electrophorese the pooled amplicons alongside a 50 bp ladder using a 2% agarose gel. Carefully excise the right-sized band. Use the QIAquick gel extraction kit protocol with the following modifications to purify the pooled amplicons from the gel: do not heat up the gel in QG buffer or

Screening the Intrinsically DIsordered Regions

547

vortex the sample as part of dissolving the gel slice as this can result in degradation of small DNA fragments. 4. Elute the product in 30 μL of TE buffer. 5. Quantify the dsDNA concentration using PicoGreen, as described in Subheading 3.1.2. 6. Provide the requested amount and documentation to the NGS facility. 3.5 NGS Data Handling

Depending on the length of the designed library oligonucleotides, sequencing can be done as single reads or read pairs. With our aforementioned design [16], a 1 150 bases read length is sufficient to account for the length of the library oligonucleotide plus the constant regions and barcodes. After NGS, sequences will be matched to the library design to identify the selected ligands. Given the complexity and amount of data generated by high-throughput ProP-PD, a data management plan is advised. Collecting all experimental information, consistent file naming, and anticipating a longterm storing solution for raw data files are key points in such a plan.

3.5.1 NGS Data Demultiplexing and Cleanup

The MiSeq NGS analysis returns 18–22 million reads corresponding to all the pooled (multiplexed) samples. Depending on the NGS service used, reads can be delivered in different file formats, either as a single file containing all the reads or as multiple already demultiplexed files. In our case, we request the information as a single .fastq.gz compressed file, which we then clean up and demultiplex into separated .fasta files, one for each of the 500–600 reactions, by using in-house developed python scripts (Fig. 4). Briefly, we discard reads with low average quality scores (20) and reads for which the barcodes cannot be unambiguously determined (allowing a maximum of 1 mismatch per barcode); this accounts for ~1% and ~30% and of the reads, respectively. After demultiplexing the reads, we trim the adapter and barcode sequences and count the number of times each oligonucleotide was found. Finally, oligonucleotides are translated to amino acid sequences, and peptides are matched to the library design. Peptides with unexpected lengths or sequences are discarded. In the end, ~50% of the initial NGS reads are preserved.

3.5.2 Data Analysis and Storage

The generated information for each bait protein is compiled in a table, including which peptides were identified, the counts of each peptide, and what protein they belong to. NGS results vary depending on the bait, the library, the selection day, etc. On average for one bait protein and one selection, approximately 500 different unique peptides are retrieved, with NGS counts from 1 up to 10,000. The low-count peptides are generally background noise, while the true binders are found among the enriched peptides (unique peptides with high counts, see Fig. 4). Successful selections

548

Muhammad Ali et al.

Fig. 4 (a) Overview of data analysis and processing. Illumina MiSeq analysis of binding-enriched phage pools returns about 18 million sequences. The summary table of barcodes and templates used, together with barcode information, is used to demultiplex the data. The information is translated into peptide sequences and matched against the library design, resulting in annotated peptides. Replicate selections are combined for analysis of the reproducibility of the selections. Throughout the process, care should be taken to archive the raw data for longer-term storage, while making sure that the processed data can be accessed easily for further analysis. (b) Next-generation sequencing (NGS) data processing. The raw sequencing data is provided by the NGS service as a .fastq.gz compressed file. Each sequence in this file format comprises four lines: a name (orange), the sequence (green, with the barcode and constant regions in purple), a line with extra information that starts with a “+” sign, and the per-base quality scores for the sequence (light blue). Raw data

Screening the Intrinsically DIsordered Regions

549

are identified by the presence of highly enriched binding peptides that are found in the replicate selections. Common features like the presence of motifs among the enriched peptides can then be identified using tools such as MEME [27]. When using relatively small libraries (e.g., 100,000 sequences), the reproducibility in terms of enriched sequences is expected to be high (see Note 17). The importance of data preservation, both for reanalysis and future sharing and publication, makes it important to develop a data management plan. This is especially true considering the amount of data generated by NGS. A data structure with unambiguous file naming should be established. We recommend to archive the raw NGS .fastq.gz files for long-term storage together with tables containing all experimental information (bait protein, phage library, phage ELISA results, and barcodes used for each selection), while the .fasta and other table files can be kept in a regular filesharing system for fast and easy accession (Fig. 4a).

4

Notes 1. Start with few colonies as some of them might not be able to grow. Carbenicillin is recommended as an alternative to ampicillin as it allows less satellite colonies to grow. 2. A single well-defined DNA band is usually visible with chances of higher faint bands representing possible secondary structures. 3. If the enzyme, like Phusion, has high processing and proofreading qualities, hot start will make sure that it does not degrade the single-stranded oligonucleotide library used as template. The numbers of cycles are kept low to avoid introducing PCR bias.

ä Fig. 4 (continued) is processed (sequences are demultiplexed, removing those with low average quality and trimming their barcodes and constant regions) and turned into .fasta files. Afterward, a table is built by translating the DNA sequences into amino acids and determining their peptide counts. Finally, each peptide is matched to the library, removing those that present mutations (like the peptide with 871 counts in the figure), and the different replica results are combined (Rep_1 to 3 in the figure) for comparison. This schematic illustrates one possible way of processing NGS data. (c) Hundreds of unique peptides are typically found for each bait protein and selection. Highly binding enriched peptides have higher NGS counts and are considered higher confidence hits. Most peptides with low counts represent background noise

550

Muhammad Ali et al.

4. Follow the guidelines of the commercial provider to adjust the amount and time used for the restriction. Typically, less than 20% wild-type template will be left after Kunkel reaction. The step can be omitted. 5. Make a cell titer, for example, by serially diluting 5 μL of cells into 45 μL of media for 12 dilutions, and then spot 5 μL out of each dilution on LB agar plate containing corresponding resistance marker. Grow overnight at 37 C. Next day, count colonies from a dilution where they are visibly well separated. Back calculate the total number of cells by multiplying colony count with dilution factor. 6. Respin the precipitate for 2 min to remove the residual supernatant completely, and pipette it out carefully. 7. Assume that OD268 ¼ 1 is equivalent to 5 1012 phage particles/mL. 8. For most proteins, 1 mL expression media is enough, but 3–4 mL media can be used if needed. Express them in multiple plates. 9. During HTP processing of proteins, samples can be mixed up at any stage of expression or purification. In case of doubt, the saved bacterial cultures can be used to confirm the identity of the expressed protein. Other potential quality measures could include MS analysis of the purified bait proteins. 10. Before adding lysates, check if filters are ok by applying vacuum. After adding lysates, seal the whole assembly together using rubber bands or similar. 11. If performing duplicate selections, it is recommended to change the way the proteins are arrayed. 12. The culture with kanamycin is used to check for helper phage preinfection, while the culture with carbenicillin is used for library phage control. There should be no growth in these two tubes. 13. Each bait protein should be exposed to 1000 times more phage particles than the size of library design. 14. Make sure the bacteria are growing and will reach the log phase when the 2 h incubation is over. If you find that the culture is overgrown, dilute it down to OD600 of 0.3, and allow it to regrow to an OD600 of 0.8. 15. Before eluting, streak some of the E. coli OmniMax culture onto LB agar plates supplemented with carbenicillin, kanamycin, and tetracycline to check for cross-contamination. 16. Shake the plate while developing. It can take from a few seconds up to 10 min. Do not overdevelop the color, as the

Screening the Intrinsically DIsordered Regions

551

dynamic range is limited and this will falsely decrease the ratio between the bait and the negative control. 17. If the reproducibility is low, it might indicate that selections have been cross-contaminated or that there has been a mix up of barcodes. It is then recommended to repeat the selection in duplicate.

Acknowledgments This work was supported by the Swedish Foundation for Strategic Research (grant number SB16-0039) and the Swedish Research Council (grant number 2016-04965). References 1. Davey NE, Van Roey K, Weatheritt RJ, Toedt G, Uyar B, Altenberg B, Budd A, Diella F, Dinkel H, Gibson TJ (2012) Attributes of short linear motifs. Mol BioSyst 8 (1):268–281. https://doi.org/10.1039/ c1mb05231d 2. Stein A, Mosca R, Aloy P (2011) Threedimensional modeling of protein interactions and complexes is going ’omics. Curr Opin Struct Biol 21(2):200–208. https://doi.org/ 10.1016/j.sbi.2011.01.005 3. Ivarsson Y (2012) Plasticity of PDZ domains in ligand recognition and signaling. FEBS Lett 586(17):2638–2647. https://doi.org/10. 1016/j.febslet.2012.04.015 4. Kaneko T, Li L, Li SS (2008) The SH3 domain – a family of versatile peptide- and protein-recognition module. Front Biosci 13:4938–4952. https://doi.org/10.2741/ 305 5. Ingham RJ, Colwill K, Howard C, Dettwiler S, Lim CS, Yu J, Hersi K, Raaijmakers J, Gish G, Mbamalu G, Taylor L, Yeung B, Vassilovski G, Amin M, Chen F, Matskova L, Winberg G, Ernberg I, Linding R, O’Donnell P, Starostine A, Keller W, Metalnikov P, Stark C, Pawson T (2005) WW domains provide a platform for the assembly of multiprotein networks. Mol Cell Biol 25(16):7092–7106. https://doi.org/10.1128/MCB.25.16.70927106.2005 6. Tanoue T, Adachi M, Moriguchi T, Nishida E (2000) A conserved docking motif in MAP kinases common to substrates, activators and regulators. Nat Cell Biol 2(2):110–116. https://doi.org/10.1038/35000065 7. Wu CG, Chen H, Guo F, Yadav VK, McIlwain SJ, Rowse M, Choudhary A, Lin Z, Li Y, Gu T,

Zheng A, Xu Q, Lee W, Resch E, Johnson B, Day J, Ge Y, Ong IM, Burkard ME, Ivarsson Y, Xing Y (2017) PP2A-B0 holoenzyme substrate recognition, regulation and role in cytokinesis. Cell Discov 3:17027. https://doi.org/10. 1038/celldisc.2017.27 8. Roy J, Cyert MS (2009) Cracking the phosphatase code: docking interactions determine substrate specificity. Sci Signal 2(100):re9. https://doi.org/10.1126/scisignal.2100re9 9. Zhang Q, Shi Q, Chen Y, Yue T, Li S, Wang B, Jiang J (2009) Multiple Ser/Thr-rich degrons mediate the degradation of ci/Gli by the Cul3HIB/SPOP E3 ubiquitin ligase. Proc Natl Acad Sci U S A 106(50):21191–21196. https://doi.org/10.1073/pnas.0912008106 10. Gouw M, Michael S, Samano-Sanchez H, Kumar M, Zeke A, Lang B, Bely B, Chemes LB, Davey NE, Deng Z, Diella F, Gurth CM, Huber AK, Kleinsorg S, Schlegel LS, Palopoli N, Roey KV, Altenberg B, Remenyi A, Dinkel H, Gibson TJ (2018) The eukaryotic linear motif resource - 2018 update. Nucleic Acids Res 46(D1):D428–D434. https://doi.org/10.1093/nar/gkx1077 11. Tompa P, Davey NE, Gibson TJ, Babu MM (2014) A million peptide motifs for the molecular biologist. Mol Cell 55(2):161–169. https://doi.org/10.1016/j.molcel.2014.05. 032 12. Stiffler MA, Chen JR, Grantcharova VP, Lei Y, Fuchs D, Allen JE, Zaslavskaia LA, MacBeath G (2007) PDZ domain binding selectivity is optimized across the mouse proteome. Science 317(5836):364–369. https://doi.org/10. 1126/science.1144592 13. Teyra J, Huang H, Jain S, Guan X, Dong A, Liu Y, Tempel W, Min J, Tong Y, Kim PM,

552

Muhammad Ali et al.

Bader GD, Sidhu SS (2017) Comprehensive analysis of the human SH3 domain family reveals a wide variety of non-canonical specificities. Structure 25(10):1598–1610. e1593. https://doi.org/10.1016/j.str.2017.07.017 14. Tonikian R, Zhang Y, Sazinsky SL, Currell B, Yeh JH, Reva B, Held HA, Appleton BA, Evangelista M, Wu Y, Xin X, Chan AC, Seshagiri S, Lasky LA, Sander C, Boone C, Bader GD, Sidhu SS (2008) A specificity map for the PDZ domain family. PLoS Biol 6(9): e239. https://doi.org/10.1371/journal.pbio. 0060239 15. Blikstad C, Ivarsson Y (2015) Highthroughput methods for identification of protein-protein interactions involving short linear motifs. Cell Commun Signal 13:38. https://doi.org/10.1186/s12964-015-01168 16. Davey NE, Seo MH, Yadav VK, Jeon J, Nim S, Krystkowiak I, Blikstad C, Dong D, Markova N, Kim PM, Ivarsson Y (2017) Discovery of short linear motif-mediated interactions through phage display of intrinsically disordered regions of the human proteome. FEBS J 284(3):485–498. https://doi.org/10. 1111/febs.13995 17. Ivarsson Y, Arnold R, McLaughlin M, Nim S, Joshi R, Ray D, Liu B, Teyra J, Pawson T, Moffat J, Li SS, Sidhu SS, Kim PM (2014) Large-scale interaction profiling of PDZ domains through proteomic peptide-phage display using human and viral phage peptidomes. Proc Natl Acad Sci U S A 111 (7):2542–2547. https://doi.org/10.1073/ pnas.1312296111 18. Garrido-Urbani S, Garg P, Ghossoub R, Arnold R, Lembo F, Sundell GN, Kim PM, Lopez M, Zimmermann P, Sidhu SS, Ivarsson Y (2016) Proteomic peptide phage display uncovers novel interactions of the PDZ1-2supramodule of syntenin. FEBS Lett 590 (1):3–12. https://doi.org/10.1002/18733468.12037

19. Wigington C et al (2019) Systematic discovery of short linear motifs decodes calcineurin phosphatase signaling. https://www.biorxiv.org/ content/101101/632547v1 20. Kataria M, Mouilleron S, Seo MH, CorbiVerge C, Kim PM, Uhlmann F (2018) A PxL motif promotes timely cell cycle substrate dephosphorylation by the Cdc14 phosphatase. Nat Struct Mol Biol 25(12):1093–1102. https://doi.org/10.1038/s41594-018-01523 21. Sundell GN, Arnold R, Ali M, Naksukpaiboon P, Orts J, Guntert P, Chi CN, Ivarsson Y (2018) Proteome-wide analysis of phospho-regulated PDZ domain interactions. Mol Syst Biol 14(8):e8129. https://doi.org/ 10.15252/msb.20178129 22. Huang H, Sidhu SS (2011) Studying binding specificities of peptide recognition modules by high-throughput phage display selections. Methods Mol Biol 781:87–97. https://doi. org/10.1007/978-1-61779-276-2_6 23. Kunkel TA (1985) Rapid and efficient sitespecific mutagenesis without phenotypic selection. Proc Natl Acad Sci U S A 82(2):488–492 24. Rajan S, Sidhu SS (2012) Simplified synthetic antibody libraries. Methods Enzymol 502:3–23. https://doi.org/10.1016/B9780-12-416039-2.00001-X 25. McLaughlin ME, Sidhu SS (2013) Engineering and analysis of peptide-recognition domain specificities by phage display and deep sequencing. Methods Enzymol 523:327–349. https:// doi.org/10.1016/B978-0-12-394292-0. 00015-1 26. Ernst A, Sazinsky SL, Hui S, Currell B, Dharsee M, Seshagiri S, Bader GD, Sidhu SS (2009) Rapid evolution of functional complexity in a domain family. Sci Signal 2(87):ra50. https://doi.org/10.1126/scisignal.2000416 27. Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME suite. Nucleic Acids Res 43(W1):W39–W49. https://doi.org/10. 1093/nar/gkv416

Part VII Interactions on Surfaces

Chapter 28 Probing IDP Interactions with Membranes by Fluorescence Spectroscopy Diana Acosta, Tapojyoti Das, and David Eliezer Abstract The microtubule-associated protein tau has been extensively studied as a culprit in Alzheimer’s disease and other neurodegenerative diseases known as tauopathies. Challenges in structurally defining tau protein emerge from its disordered nature, which makes it difficult to crystallize, and hinder efforts to interpret tau protein’s true function. The complexity of intrinsically disordered proteins (IDPs) necessitates a multifaceted approach to study their interactions including multiple spectroscopic methods that can report on local protein environment and structure at individual residue positions. We and others have shown that in addition to binding to microtubules, tau binds to lipid membranes. Tau-membrane interactions may be relevant both to normal tau function and to tau aggregation and pathology. Here we describe the use of fluorescence spectroscopy as a probe of protein-membrane interactions to determine whether there is an interaction, which residues participate, and the extent/nature of the interface between the protein and the membrane. We provide a protocol for how the membrane interactions of tau protein, as an example, can be probed by fluorescence spectroscopy, including details of how the samples should be prepared and guidelines on how to interpret the results. Key words Tau, Membrane interactions, Spectroscopy, Fluorescence, IDP, Membrane

1

Introduction

1.1 Tau: An Intrinsically Disordered Protein

Physiologically, tau function has been defined by its ability to bind, stabilize, and promote formation of microtubules [1–3]. Pathologically, tau is capable of forming fibrillar aggregates that are deposited within neurofibrillary tangles and are correlated with cognitive decline [2, 4]. Tau is composed of an amino-terminal domain (N-terminus), a microtubule-binding domain (MBD), and a carboxy-terminal domain (C-terminus). The microtubule-binding domain consists of four imperfect tandem repeats, 31–32 residue in length. Full-length human tau (Htau40) is the longest tau isoform, while smaller tau isoforms are derived by alternative splicing of two

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_28, © Springer Science+Business Media, LLC, part of Springer Nature 2020

555

556

Diana Acosta et al.

N-terminal inserts (generating 0–2 N tau) and of the microtubulebinding domain’s second repeat region (generating 3R or 4R tau), resulting in isoforms ranging in length from 352 to 441 amino acid residues. 1.2 Tau-Membrane Interactions

Evidence suggests that tau’s N-terminal domain associates with the plasma membrane, which is important for establishing cell polarity during neuronal development [5]. The tau MBD also binds to membranes directly [6–9], via formation of short amphipathic helices in tau’s repeat regions. These helices appear to promote the formation of oligomeric tau/lipid complexes [6, 10]. Thus, tau-membrane interactions can regulate tau structure, function, and toxicity. The MBD has been extensively studied due to its affinity for microtubules (MTs) but also because it contains two 6-residue segments, PHF6 (306–311) and PHF6∗ (275–280), which can nucleate tau aggregation [11–13]. Formation of helical structure within the MBD when bound to membranes [7] or heterodimeric tubulin [14] suggests that structural disorder-to-order transitions in specific tau regions may mediate tau physiological interactions. Here we describe the use of lipid vesicles in combination with fluorescence spectroscopy to study tau-membrane interactions.

1.3 Application of Fluorescence Spectroscopy for Probing Tau-Membrane Interactions

Fluorescence spectroscopy can be a relatively simple and efficient optical technique for probing membrane interactions of IDPs. The fluorescence of many fluorophores, such as acrylodan and tryptophan, is sensitive to their local environment and undergoes a blue shift in a hydrophobic setting such as the interior of a lipid bilayer. This property is useful to probe membrane interactions of individual residues of an IDP. Recently, fluorescence approaches have been used to probe the interactions of tau with tubulin [14], and we have used it as described below to probe tau-membrane interactions. Additional techniques such as nuclear magnetic resonance (NMR) and electron spin resonance (ESR) can also be used to study proteins bound to lipid vesicles. ESR requires introduction of spin labels containing unpaired electrons into protein samples, which is typically accomplished using cysteine chemistry. It can be used both to probe the environment of individual spin labels placed at different positions within a protein and for measurements of inter-residue distances for proteins containing two (or more) spin labels [15]. While we have used ESR to study tau-membrane interactions [7], the methods and protocols involved are beyond the scope of this current chapter. Solution-state NMR spectroscopy requires isotopically labeled protein to resolve structural features of protein in solution. NMR has been used to provide insight into the effects of membrane interactions on structural characteristics of tau such as secondary structure formation. We describe our protocols for the use of NMR to probe how disordered proteins bind to lipid membranes in a subsequent chapter.

Membrane Interactions of IDPs

2

557

Materials

2.1 Recombinant Protein Production and Purification

1. BL21(DE3) E. coli cells. 2. Plasmids containing cDNA coding for tau protein constructs (e.g., tau fragment TK16, which spans residues 244–368 of full-length tau and encompasses all four repeat regions) and an antibiotic resistance gene (e.g., ampicillin) and under control of an isopropyl-B-D-thiogalactopyranoside (IPTG)-inducible promoter. 3. Ampicillin (Amp) stock: 2.5 g in 50 mL double distilled (dd) H2O (50 mg/mL). 4. 10 cm petri dish. 5. LB-amp agar plates: Weigh 18.5 g of Miller’s LB agar and dissolve in 500 mL ddH2O, autoclaved. 500 μL Amp stock (50 mg/mL) is added to 500 mL LB agar after being autoclaved and cooled at room temperature for 20–30 min. 25 mL of LB-amp agar is added to 10 cm petri dishes and allowed to cool until agar has set. Plates are stored at 4 C up to 2 weeks (see Note 1). 6. 1 L LB media: 25 g Miller’s LB in 1 L ddH2O and autoclave. 7. M9 salts (5): Weigh 64 g Na2HPO4·7H2O, 15 g KH2PO4, 2.5 g NaCl, and 5 g NH4Cl, and dissolve in 1 L ddH2O. 8. M9 minimal media for 15N recombinant protein growth: Combine in the following order: 700 mL ddH2O water, 1 mL 1 M MgCl2, 100 μL 1 M CaCl2, 200 mL 5 M9 salts, 10 mL Basal Medium Eagle (BME) vitamins (100), 1 g NH4Cl (15N), 4 g unlabeled D-glucose (anhydrous); adjust the volume to 1 L. Add 1 mL Amp stock (50 mg/mL), and filter through a 0.2 μm bottle top filter. 9. 0.8 M IPTG: 2 g IPTG in 10 mL ddH2O, filtered through a 0.2 μm bottle top filter. 10. Lysis buffer: 3 mM urea,1 mM ethylenediaminetetraacetic acid (EDTA), 1 mM dithiothreitol (DTT), 10 mM Tris, and 1 mM phenylmethylsulfonyl fluoride (PMSF), pH 8.0. 11. Fast protein liquid chromatography (FPLC) dialysis buffer: 25 mM Tris, 20 mM NaCl, 1 mM EDTA, and 1 mM DTT, pH 8.0. 12. FPLC buffer A: 25 mM Tris-Cl, 20 mM NaCl, and 1 mM EDTA, pH 8.0, filtered through a 0.2 μm bottle top filter and degassed. 13. FPLC buffer B: 25 mM Tris-Cl, 1 M NaCl, and 1 mM EDTA, pH 8.0, filtered through a 0.2 μm bottle top filter and degassed.

558

Diana Acosta et al.

14. Reversed-phase high-performance liquid chromatography (HPLC) dialysis buffer: 5% acetic acid in 1 L ddH2O. 15. HPLC buffer A: 0.1% trifluoroacetic acid (TFA), degassed. 16. HPLC buffer B: 90% acetonitrile and 0.1% TFA, degassed. 17. 3.5 kDa molecular weight cutoff (MWCO) dialysis membrane. 18. Bacterial culture incubator with shaker. 19. UV-Vis spectrophotometer. 20. Centrifuge with rotor for 1 L tubes. 21. Centrifuge with rotor for 50 mL tubes. 22. Ultracentrifuge for 25 mL tubes. 23. Probe sonicator for cell lysis. 24. FPLC system: HiPrep FF 16/10 CM Sepharose™ column with a fully automated liquid chromatography system (see Note 2). 25. HPLC system: C4 column monitored by an inline absorbance detector (221 nm). 26. Vacuum concentrator (SpeedVac™). 27. Lyophilizer. 2.2 Small Unilamellar Vesicle (SUV) Preparation

1. 1-Palmitoyl-2-oleoyl-glycero-3-phosphocholine (POPC) purchased as a stock solution in chloroform. 2. 1-Palmitoyl-2-oleoyl-sn-glycero-3-phospho-L-serine (POPS) purchased as a stock solution in chloroform. 3. 16 100 mm borosilicate glass tube. 4. Dry nitrogen stream. 5. Vacuum concentrator (SpeedVac™). 6. Phosphate buffer (acrylodan fluorescence) or tryptophan buffer (tryptophan fluorescence) for resuspension of vesicles for desired assay (see below). 7. Bath sonicator. 8. Sorvall RC-M120 EX ultracentrifuge equipped with a S120AT2-0182 rotor (approx. 150,000 g).

2.3 Acrylodan Fluorescence

1. Protein samples for acrylodan fluorescence should include cysteines at positions of interest for probing interactions; cysteines can be introduced via site-directed mutagenesis. 2. Purification buffer: 25 mM Tris-Cl, 100 mM NaCl, 1 mM EDTA, 1 mM DTT, pH 7.4. 3. PD-10 columns. 4. Labeling buffer: 20 mM Tris-Cl, 50 mM NaCl, pH 7.4.

Membrane Interactions of IDPs

559

5. 50 mM acrylodan: 5 mg in 443.87 μL dimethyl sulfoxide (DMSO). 6. Phosphate buffer: 20 mM Na2HPO4, 20 mM KCl, 1 mM MgCl2, 0.5 mM EDTA, pH 7.2. 7. 96-well plate reader for fluorescence and accompanying software (see Note 3). 2.4 Tryptophan Fluorescence

1. Protein samples for tryptophan fluorescence should include tryptophan at positions of interest for probing interactions. Tryptophan can be introduced via site-directed mutagenesis. 2. Tryptophan buffer: 150 mM NaCl, 50 mM Tris-Cl, 5 mM EDTA, pH 8.0.

3

Methods

3.1 Recombinant Protein Expression and Purification 3.1.1 1 L Growth of Unlabeled Tau Protein

1. Combine 10 μL (OD600 ~ 0.4) of fresh BL21/DE3 cells with 1 μL (10–40 ng/μL) of designated plasmid (e.g., coding for TK16) for transformation. Mix well, and place cells on ice for 5 min. Heat shock cells at 42 C on a block heater for 45 s, and place on ice immediately for 2 min. Plate cells on LB-amp agar plates with Amp (50 mg/mL), and incubate at 37 C overnight. 2. The following day, pick 1 colony, and grow in 4 mL LB media with 4 μL of Amp stock for 3 h in a shaker set at 37 C (see Note 4). 3. Add 500 μL of Amp stock to 1 L LB media. Use the 4 mL culture to inoculate 100 mL of LB media with Amp, and incubate in a shaker at 37 C overnight. 4. Use the 100 mL overnight culture to inoculate the remaining 900 mL of LB media with Amp, and incubate in a shaker set at 37 C for 4 h. Monitor growth closely by measuring OD600 using a UV-Vis spectrophotometer every hour. When OD600 ¼ 0.6 (~1 h after inoculation), induce overexpression by adding 1 mL of 0.8 M IPTG to 1 L growth for final concentration of 0.8 mM IPTG, and continue to monitor. Once growth stagnates (~3 h after induction with IPTG) and OD600 is between 1.2 and 1.8, centrifuge the 1 L culture at 10,000 g for 15 min at 4 C (see Notes 5 and 6). Discard the supernatant, and collect the cell pellet, and store at 20 C until use for purification.

3.1.2 Purification of Unlabeled Tau from 1 L

1. Thaw bacterial pellet at room temperature and resuspend in 50 mL Lysis buffer. 2. Lyse the cells by sonicating for 12 min on ice (6 min 2, stirring in between) until the solution becomes less opaque.

560

Diana Acosta et al.

Table 1 HPLC program for tau protein purification Time (min)

Flow (mL/min)

HPLC buffer A (%)

HPLC buffer B (%)

0

4.45

80

20

5

4.45

58

42

37

4.45

50

50

39

4.45

0

100

41

4.45

80

20

51

4.45

0

100

53

4.45

80

20

Elution of the protein is dependent on an acetonitrile gradient, which is achieved by increasing concentrations of HPLC buffer B over time Collect fractions when A221 > 0.00 and verify protein by SDS-PAGE (15% gel)

3. Centrifuge the sonicated pellet for 1 h at 40,000 rpm in a Beckman ultracentrifuge using a Ti 50.2 rotor (approx. 145,000g). 4. Collect the supernatant, and dialyze once against 1 L FPLC dialysis buffer with a 3.5 kDa MWCO dialysis membrane; leave overnight. 5. The following day, purify the supernatant by cation-exchange chromatography and a NaCl gradient. Equilibrate a HiPrep FF 16/10 CM Sepharose™ column with two column volumes (CV) of FPLC buffer A, and then load the sample (see Note 7). 6. Elute sample fractions with a NaCl gradient created by a linear increase from 0 to 40% of FPLC buffer B over four CVs. Collect 2.5 mL fractions throughout elution. Identify protein-containing fractions (A280 > 0 mAU) that contain tau using SDS-PAGE (15% gel). Pool fractions containing tau protein for HPLC. Dialyze the pooled fraction once against 1 L HPLC dialysis buffer with a 3.5 kDa MWCO membrane; leave overnight at 4 C. 7. The following day, filter the protein sample through 0.22 μm membrane. 8. Equilibrate the C4 column with 90% HPLC buffer A and 10% HPLC buffer B for 15 min. Then load the sample onto the column by switching the flow input to the sample vial. Elute with an acetonitrile gradient with 1% TFA, as shown in Table 1. 9. Spin the eluted sample for 2 h in a vacuum concentrator to remove any acetonitrile from the HPLC buffers. 10. Dialyze the sample once against 4 L ddH2O with a 3.5 kDa MWCO membrane; leave overnight at 4 C.

Membrane Interactions of IDPs

561

11. The following day, flash freeze the protein sample in liquid nitrogen, and lyophilize for 72 h; afterward store at 20 C until needed. Confirm purity by 15% SDS-PAGE. 3.1.3 Site-Directed Mutagenesis

Site-specific mutations are required for acrylodan and tryptophan experiments and are introduced by a site-directed mutagenesis kit, as directed by the manufacturer. For acrylodan fluorescence, single cysteine mutations should be introduced to sites being probed for membrane interaction (see Note 8). For tryptophan fluorescence, site-directed mutagenesis will be needed to introduce tryptophan mutations at specific sites of tau. Full-length tau contains no native tryptophan mutations. After cysteine and tryptophan mutant constructs have been confirmed by sequencing, the resulting plasmid can be used for growth and purification of tau as described above.

3.2 SUV Lipid Vesicle Preparation

Lipid vesicles can be composed of varying lipid mixtures. For the experimental techniques described below, we have used POPC/ POPS SUV lipid vesicles at a 1:1 molar ratio. The user is free to use a lipid composition that best suits their experimental needs. Preparation of our SUV lipid vesicles is as follows: 1. From stock solutions in chloroform, POPC and POPS lipid volumes are combined in a 16 100 mm borosilicate glass tube based on a molar ratio of POPC/POPS ¼ 1:1 (see Note 9). For our fluorescence assays, typical volumes of 1–2 mL are made. 2. Dry the lipid mixture by flowing an inert gas over it (N2 or Ar) for 10–15 min. 3. Dry the lipid film further using a vacuum concentrator for 1–2 h. 4. Resuspend the dried lipid film in the designated buffer (acrylodan buffer or tryptophan buffer), and vortex until the lipid film has been completely resuspended. The lipid film is completely resuspended if there is no residue remaining on the glass surface and the solution has become a milky suspension. 5. Measure the pH of the suspension and adjust it to the experimental conditions if necessary. 6. Sonicate the lipid solution in a bath sonicator until clear (~10–30 min). It is advisable to put the tube at appropriate depth (top of solution should be below the sonicator water level) and at an appropriate position (at an antinode in the standing waves produced by the bath sonicator) (see Note 10). This sonication method typically yields SUVs that are 30–50 nm in diameter.

562

Diana Acosta et al.

7. Ultracentrifuge the clarified lipids at 60,000 rpm for 1 h. We use a Sorvall RC-M120 EX ultracentrifuge using a S120AT2–0182 rotor (approx. 150,000 g) at 4 C to remove any remaining multilamellar or larger vesicles. 8. Carefully remove and store the supernatant as the final lipid sample used for experiments, and discard the pellet. Lipid concentration can be verified afterward by phosphate assays [16], and lipid size distribution can be quantified by dynamic light scattering or directly visualized and measured by electron microscopy (see Note 11). 3.3 Acrylodan Fluorescence

For both acrylodan fluorescence and tryptophan fluorescence, the procedures below are being described for one tau mutant (one residue position). The procedures should be repeated for each tau mutant being probed if multiple residue-specific interactions are being studied across regions of tau. If set up correctly, a single 96-well plate can hold up to 24 mutants (including appropriate blanks), allowing 24 residues to be probed for membrane interactions at one time (Fig. 1). 1. Resuspend the protein in purification buffer at a concentration of 600 μM and a volume of 1.3 mL, and leave at room temperature for 15–30 min. 2. Filter through a 100 kDa centrifugal spin column at 4000 rpm on a benchtop centrifuge at 4 C for 15 min or until completely run through. 3. Equilibrate the PD-10 column with labeling buffer for 5 CVs. 4. Buffer exchange the protein into labeling buffer in the following way: add ~1.3 mL of sample to the PD-10 column in two parts of 650 μL. Add 600 μL of labeling buffer twice to the PD-10 column. Add 750 μL of labeling buffer twice, and collect protein from elution. Add 500 μL of labeling buffer six times. 5. Add 10 μL of 50 mM acrylodan eight times (total 80 μL) to the protein in labeling buffer gradually, while inverting the tube several times to mix well in between each addition, and cover sample in foil. 6. Place the sample in rotator for 4 h. 7. To remove residual acrylodan, spin down the sample for 5 min at 14,000 rpm (18,000 g). We use a Beckman Microfuge 18 centrifuge. 8. Equilibrate a new PD-10 column with phosphate buffer for 5 CVs. 9. Buffer exchange labeled protein into phosphate buffer in the following way: add 1.5 mL of sample to the PD-10 column in two parts of 750 μL. Add 1 mL of phosphate buffer. Add 750 μL of phosphate buffer twice, and collect acrylodan-labeled protein from elution. Add 500 μL of phosphate buffer six times.

Membrane Interactions of IDPs

1

2

3

563

Tau Labeled Mutant # 4 5 6 7 8 9 10 11 12

+ PCPS -PCPS Lipid Only Buffer Only + PCPS -PCPS Lipid Only Buffer Only

13 14 15 16 17 18 19 20 21 22 23 24

Fig. 1 Sample setup for acrylodan and tryptophan fluorescence. Plate reader setup can allow for up to 24 protein mutants to be scanned at one time for fluorescence experiments. +PCPS indicates protein samples with lipids (boundstate). -PCPS indicates protein samples without lipids (free-state). Lipid only indicates lipid samples at a concentration matched to the bound-state samples. Buffer only indicates only buffer is added to that row, as a blank for -PCPS samples

10. Make ~1:1500 (protein/lipid) molar ratio samples. For our samples we typically use 20 μM of protein with 30 mM SUV lipids. 11. Use phosphate buffer as a blank for free-state protein samples. By subtracting the blank spectrum from spectra of proteincontaining samples, any fluorescence or noise signal from the buffer in the free-state protein sample will be removed, leaving only the protein fluorescence signal. A lipid sample containing 10 mM liposome in buffer without protein should be used as a blank for bound-state protein samples to subtract any noise signal emerging from lipid or buffer, leaving only the protein fluorescence signal. 12. Use a plate reader with excitation set to 390 nm: cutoff, 420 nm, and emission, 400–600 nm. Scan bound-state and free-state protein as well as controls. 13. Export data to text file. 3.4 Tryptophan Fluorescence

Unlike acrylodan fluorescence, tryptophan fluorescence does not require extra labeling procedures. Instead, after site-directed mutagenesis has successfully introduced tryptophan at specific sites, tryptophan emission at 300–450 nm can be probed upon excitation at 295 nm. 1. Make ~1:1500 (protein/lipid) molar ratio samples. For our experiments we typically use 20 μM tau protein and 30 mM POPC/POPS SUV lipids.

564

Diana Acosta et al.

2. Use buffer as a blank for free-state protein samples. A lipid sample containing 30 mM POPC/POPS in buffer should be used as a blank for bound-state protein samples. 3. Use a plate reader with excitation set to 295 nm: cutoff, none, and emission, 300–500 nm. Scan bound-state and free-state protein samples. 4. Export data to text file. 3.5

Data Analysis

The following data processing and evaluation can be used for both acrylodan fluorescence and tryptophan fluorescence. 1. Data from the plate reader software can be imported into any data processing platform (e.g., Microsoft Excel). Fluorescence emission spectra from 300 to 450 nm (RFU vs. λ (nm)) should show a peak around 355 nm for Trp-labeled free-state samples and around 520 nm for acrylodan-labeled free-state samples. Samples in the presence of lipid vesicles may show a shift in the peak maximum toward lower wavelengths, reflecting insertion of the fluorophore into the membrane. Figure 2a, b shows representative data for two tau TK16 Trp mutants. 2. The difference in maximum emission wavelength (Δλmax) for free-state vs. bound-state is calculated. Δλmax results from a change in the environment of the fluorophore. Small values of Δλmax indicate small differences in fluorophore environment in the presence of lipids, suggesting no membrane interaction or insertion at this position. Larger values suggest membrane interaction or insertion (see Note 12). 3. If repeated for multiple successive residues, a plot of Δλmax vs. residue position can be constructed. Any periodicity observed in such a plot may reflect a periodicity in the environment of successive residues in the targeted protein segment, which may reflect, for example, the underlying periodicity of secondary structure elements. For example, a periodicity of 3.6 would be consistent with an α-helix secondary structure. A periodic function (Δλmax(n) ¼ a + b cos (2πn/N + c), where N is the periodicity, a is the vertical shift, b is the amplitude, and c is the horizontal shift of standard periodic functions) can be fit to the data in order to estimate the periodicity across a given region, as shown in Fig. 2c.

4

Notes 1. Old LB-amp plates (stored for longer than 2 weeks, sealed, at 4 C) will lose their ampicillin efficiency over time. Therefore, for plates being used for mutagenesis, it is best to use fresh plates to avoid satellites colonies.

Membrane Interactions of IDPs

Q307W +/ - PCPS (1:1)

350 400 λ (nm)

G307W

K317

Δλmax

450

500 550 λ (nm)

600

K317 + PCPS

324

323

322

321

320

319

318

317

314-324 +/ - PCPS (1:1) Periodicity = 3.6 R-square = 0.4969

316

30 25 20 15 10 5 0

450

10000 7500 5000 2500 0 400

G307W + PCPS

314

Δλmax

K317 +/ - PCPS

RFU

2000 1500 1000 500 0 300

c

b

Δλmax

315

RFU

a

565

Residue

Fig. 2 Tryptophan fluorescence of selected tau residues. (a) Calculating changes in max emission wavelength (Δλmax) for tau protein TK16 Q307W using tryptophan fluorescence. (b) Calculating Δλmax with acrylodan fluorescence for acrylodan-labeled TK16 K317C. (c) An example of fitting a periodic function (Δλmax(n) ¼ a + b cos (2πn/N + c), where N is the periodicity, a is the vertical shift, b is the amplitude, and c is the horizontal shift of standard periodic functions) to fluorescence data. The black curve represents the experimental data, while the red curve represents a curve fitting model with a periodicity of 3.6

2. Although we use an automated FPLC system, a manual gradient can be generated with a gradient former and used in line with a weak cation-exchange column. However, all elutions must be checked for protein with SDS-PAGE before pooling elution samples for dialysis and HPLC. 3. Plate settings on the plate reader software should be carefully matched to the plate the reader will be using. For our experiments we use standard 96-well black polystyrene plates with flat bottoms and lids (Nunc™ F96 MicroWell™). 4. After 3 h, 4 mL culture should no longer be a clear or transparent solution. A cloudy culture should be used to inoculate the 100 mL overnight culture. If the cultures are clear, leave the culture in the incubator for longer (up to 5 h). 5. If expression remains low (OD600 less than 1.0) for 1 L growth after 3 h (post-induction with IPTG), the growth should be discarded. As an alternative, before inducing the 1 L growth with IPTG, the incubator temperature should be set to a lower temperature (18 C or 20 C), and then the growth can be incubated in a shaker overnight.

566

Diana Acosta et al.

6. If OD600 is decreasing during monitoring of growths, the culture should be discarded and restarted with a new colony. 7. After washing the FPLC column with two CVs of FPLC buffer A, all detector signals should be plateaued and stable. If the detector signal features spikes or is drifting, buffer should be passed through the column. 8. It is important to note that there are two native cysteines in full-length tau (Htau40). Therefore, in addition to introducing cysteine mutations at sites of interest, the native cysteines C291 and C322 should be changed (typically to alanine) to avoid confounding results for residue-specific interactions. 9. The shape and size of tube used for sonicating lipid vesicles are important for optimizing sonication efficiency. For example, we specifically use 16 100 mm borosilicate glass tubes and have found that using tubes with smaller diameters can lead to longer sonication times and heterogeneous sonication. This is due to the fact that the sonication power distribution is nonuniform in a bath sonicator. So, for every sonicator used, one should determine the right size of tube for optimal sonication efficiency. 10. In our hands, lipid solutions become clear after sonication for 10–30 min. However, if lipids require longer sonication time, then lipids should be allowed to cool on ice before continuing sonication. Overheating of lipids should be avoided as this leads to degradation. 11. Phospholipids get oxidized over time. So, it is important to use fresh stocks of lipids to prepare vesicles and store the stock properly sealed at 20 C. Lipid stocks lasts up to 6 months stored in a glass container with a Teflon cap that is additionally sealed with parafilm and stored at 20 C. 12. It is important to keep in mind that introducing acrylodan or tryptophan at any position may influence the protein structure and interactions. Stabilization of transient secondary structures in IDPs may constrain a labeled site to behave in a similar way to the unmodified protein, but this is not guaranteed and must be considered. References 1. Mandelkow E-M, Mandelkow E (2012) Biochemistry and cell biology of tau protein in neurofibrillary degeneration. Cold Spring Harb Perspect Med 2:a006247 2. Wang Y, Mandelkow E (2016) Tau in physiology and pathology. Nat Rev Neurosci 17:22–35

3. Devred F, Barbier P, Douillard S et al (2004) Tau induces ring and microtubule formation from αβ-tubulin dimers under nonassembly conditions. Biochemistry 43:10520–10531 4. Fo¨rstl H, Kurz A (1999) Clinical features of Alzheimer’s disease. Eur Arch Psychiatry Clin Neurosci 249:288–290

Membrane Interactions of IDPs 5. Brandt R, Le´ger J, Lee G (1995) Interaction of tau with the neural plasma membrane mediated by tau’s amino-terminal projection domain. J Cell Biol 131:1327–1340 6. Shea TB (1997) Phospholipids alter tau conformation, phosphorylation, proteolysis, and association with microtubules: implication for tau function under normal and degenerative conditions. J Neurosci Res 50:114–122 7. Georgieva ER, Xiao S, Borbat PP et al (2014) Tau binds to lipid membrane surfaces via short amphipathic helices located in its microtubulebinding repeats. Biophys J 107:1441–1452 8. Barre´ P, Eliezer D (2006) Folding of the repeat domain of tau upon binding to lipid surfaces. J Mol Biol 362:312–326 9. Barre´ P, Eliezer D (2013) Structural transitions in tau k18 on micelle binding suggest a hierarchy in the efficacy of individual microtubulebinding repeats in filament nucleation. Protein Sci Publ Protein Soc 22:1037–1048 10. Ait-Bouziad N, Lv G, Mahul-Mellier A-L et al (2017) Discovery and characterization of stable and toxic tau/phospholipid oligomeric complexes. Nat Commun 8:1678

567

11. Ganguly P, Do TD, Larini L et al (2015) Tau assembly: the dominant role of PHF6 (VQIVYK) in microtubule binding region repeat R3. J Phys Chem B 119:4582–4593 12. Friedhoff P, von Bergen M, Mandelkow EM et al (2000) Structure of tau protein and assembly into paired helical filaments. Biochim Biophys Acta 1502:122–132 13. von Bergen M, Barghorn S, Li L et al (2001) Mutations of tau protein in Frontotemporal dementia promote aggregation of paired helical filaments by enhancing local β-structure. J Biol Chem 276:48165–48174 14. Li X-H, Culver JA, Rhoades E (2015) Tau binds to multiple tubulin dimers with helical structure. J Am Chem Soc 137:9218–9221 15. Eliezer D (2012) Distance information for disordered proteins from NMR and ESR measurements using paramagnetic spin labels. Methods Mol Biol (Clifton, NJ) 895:127–138 16. Snead D, Wragg RT, Dittman JS et al (2014) Membrane curvature sensing by the C-terminal domain of complexin. Nat Commun 5:4955

Chapter 29 Protocol for Investigating the Interactions Between Intrinsically Disordered Proteins and Membranes by Neutron Reflectometry Alessandra Luchini and Lise Arleth Abstract Several intrinsically disordered proteins (IDPs) exhibit high affinity for lipid membranes. Among the different biophysical methods to probe protein-lipid interaction, neutron reflectometry (NR) can provide direct and structural detailed information on the location of the IDP with respect to the membrane. Supported lipid bilayers are commonly used as cell membrane models in such experiments. NR measurements can be collected on the supported lipid bilayer before and after the interaction with the IDP to characterize whether the protein molecules are mainly located on the membrane surface (interaction with the lipid headgroups), are penetrating into the hydrophobic region of the membrane (interaction with the lipid acyl chains), or are not interacting at all with the membrane. The lipid composition of the supported lipid bilayer can easily be tuned; hence the NR experiments can be designed to investigate selective IDP-lipid interactions. This chapter will describe the fundamental steps for performing an NR experiment and the subsequent data analysis aimed at characterizing IDP-lipid bilayer interactions. The specific case of an intrinsically disordered region (IDR) from the membrane protein Na+/H+ exchanger isoform 1 (NHE1) will be used as an example, but the same protocol can be easily adapted to other IDPs. Key words Intrinsically Disordered Proteins, NHE1, Supported lipid bilayers, Neutron Reflectometry, Cell membranes, Membrane Interactions

1

Introduction Lipid membranes are relevant partners of interaction for intrinsically disordered proteins (IDPs) [1–3]. The biological impact of such interactions is still poorly understood and will be built on a detailed structural description of the location of the IDPs with respect to the membrane [4]. In particular, a desirable information is determining if the IDP molecules are interacting with the lipid headgroups or if they can penetrate into the hydrophobic region of the membrane and eventually cross it [5]. Besides locating the interacting protein molecules, characterizing the effect of the IDP

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_29, © Springer Science+Business Media, LLC, part of Springer Nature 2020

569

570

Alessandra Luchini and Lise Arleth

on the overall membrane structure is an additional relevant information for a full characterization of the system. Among the different biophysical methods available to study protein-lipid interactions, neutron reflection at solid/liquid interface offers several advantages [6]. Neutron reflectometry (NR) allows for structural investigation of thin films deposited on a solid substrate in an aqueous environment with a few A˚ resolutions. NR is a nondestructive technique, which can provide information on both the composition and the thickness of the deposited sample [7]. During an NR experiment aimed at investigating protein-lipid interactions, supported lipid bilayers are commonly used as models for cell membranes [8–10]. A supported lipid bilayer is composed of a planar lipid bilayer in the proximity of a solid substrate such as a silicon crystal. Supported lipid bilayers can be prepared with tunable lipid composition and have been widely exploited to investigate lipid interactions with proteins or peptides by means of different surface-sensitive techniques such as NR, quartz crystal microbalance with dissipation monitoring (QCM-D), and ellipsometry [11–14]. Because neutrons are sensitive to the isotopic composition of the sample [15], NR can provide a detailed description of both the hydrophilic (lipid headgroups) and hydrophobic (lipid acyl chains) regions of the lipid bilayer in terms of their thickness and solvation. Furthermore, as lipid and proteins can be distinguished during a neutron scattering experiment, this makes NR the ideal technique for probing the interaction between IDPs and supported lipid bilayers. NR can indeed provide the desired information on the protein location in relation to the bilayer and at the same time the overall membrane structure [5, 16]. Such information is contained in the scattering length density profile, ρ(z), which is the main output of the NR data analysis. The scattering length density quantifies the interaction between the neutrons and the different types of nuclei in the sample. In the case of NR, ρ(z) provides the distribution of the different molecules in the sample in the direction perpendicular to the substrate surface, z [6]. In this chapter, we describe how NR can be used to investigate the interaction between an IDP and a lipid bilayer composed by 1-palmitoyl-2-oleyl-glycero-phosphocholine (POPC). The chosen IDP is an intrinsically disordered region (IDR) that is part of the large intracellular domain of the Na+/H+ exchanger isoform 1 [17, 18]. NHE1 is an ubiquitous membrane protein responsible for the regulation of cellular pH and volume [19]. The IDR constitutes the lipid interaction domain (LID) of NHE1 and is named NHE1-LID in the following sections. The main goals of the NR experiment are to (1) reveal the presence of an interaction between the NHE1-LID and the POPC-supported lipid bilayer; (2) locate the NHE1-LID molecules with respect to the membrane, i.e., if they interact with the phospholipid headgroup or penetrate into

IDP-Membrane Interaction by Neutron Reflectometry

571

the hydrophobic region of the membrane; and (3) evaluate the impact of the NHE1-LID-lipid interaction on the overall membrane structure. This chapter will focus on the relevant practical aspects of an NR experiment aimed at the investigation of the interaction between an IDP, i.e., NHE1-LID, and a supported lipid bilayer. A detailed description of the NR technique and instrumentation can be found elsewhere [7, 15, 20–22]. Although the specific case of NHE1-LID will be used here for the method discussion, the same experimental protocol can be easily adapted to other IDPs and supported lipid bilayers with different lipid composition.

2

Materials 1. POPC was purchased and used without further purification. 2. NHE1-LID: Expressed and purified according to the protocol reported elsewhere [23], dissolved in ultrapure water (18 MΩ-cm) to a concentration of 2.5 μM. Dithiothreitol (DTT) was added to a concentration of 2.5 μM. 3. d-buffer: 20 mM HEPES, 150 mM NaCl buffer 2.5 μM DTT with pH ¼ 7.4 in 100% D2O (see Note 1). 4. smw-buffer: 20 mM HEPES, 150 mM NaCl buffer 2.5 μM DTT with pH ¼ 7.4 in 38% D2O and 62% H2O (v/v) (see Note 1). 5. h-buffer: 20 mM HEPES, 150 mM NaCl buffer 2.5 μM (DTT with pH ¼ 7.4 in 100% H2O (see Note 1). 6. Monocrystalline silicon substrate 8 5 1.5 cm. 7. Chloroform and ethanol were used for sample preparation and NR cell cleaning. 8. A diluted piranha solution was prepared by mixing ultrapure water sulfuric acid and hydrogen peroxide with 5:4:1 (v/v/v) H2O/H2SO4/H2O2. The piranha solution is a mixture of sulfuric acid and hydrogen peroxide, which can very efficiently remove organic molecules from the substrate surface. 9. 2% Hellmanex solution for NR cell cleaning. 10. HPLC pump for injection of buffer or solvents in the NR cell. 11. 2.5 mL plastic syringe for manual injection of the samples in the NR cell.

3

Methods Sample preparation was performed at room temperature, while the NR measurements were performed at 25 C. Injections of the solutions in the NR solid/liquid cell [24] were performed either

572

Alessandra Luchini and Lise Arleth

by manual syringe injection or by using an HPLC pump. In particular, samples were injected manually with a syringe, while the buffers or the ultrapure water used after sample deposition was injected using the HPLC pump with a flow rate of 2 mL/min. Since the injected sample volume (~2 mL) is considerably smaller than the injected buffer or ultrapure water volume (~20 mL), a manual syringe injection is recommended for the sample, while the HPLC pump can be used for the solvents. The experimental data showed here were collected on the SURF reflectometer at the ISIS Neutron and Muon Source in Didcot, UK. 3.1 NR Cell and Substrate Cleaning

1. Clean all the NR cell components that will be in contact with the sample, e.g., O-ring and the sample injection tubes, with 2% Hellmanex solution (see Note 2) followed by extensive rinsing with ultrapure water and ethanol (Fig. 1). 2. Prepare a diluted piranha solution for cleaning the silicon substrate (see Note 3). The piranha solution, even in its diluted form, is highly corrosive; hence take the relevant safety precautions during preparation and handling. Immerse the silicon substrate in the piranha solution at 80 C. After 15 min, remove the substrate and rinse its surface extensively with ultrapure water. Store the substrate under water to avoid contamination of the cleaned surface. 3. Assemble the NR solid/liquid cell as reported in Fig. 1b and fill the cell with D2O. Collect a first set of NR data for the substrate in contact with D2O, and subsequently inject 20 mL of H2O before collecting another dataset. The main purpose of the NR measurements of the substrate in contact with D2O and H2O is to characterize the silicon oxide layer, which is present on the surface of the substrate, and to assess the overall substrate quality (i.e., surface roughness). Silicon oxide spontaneously grows on the surface of silicon crystals exposed to air, and it is characterized by a sufficiently different scattering length density compared to pure silicon (i.e., ρ ¼ 3.46 ∙ 106 A˚2 and ρ ¼ 2.07 ∙ 106 A˚2, respectively). The difference in ρ between the two components gives rise to a reflectivity signal.

3.2 POPC Vesicle Suspension

The POPC-supported bilayer was prepared by vesicle fusion [25]. Hence, the first step to produce the POPC bilayer is preparing a vesicle suspension. Different protocols can be used for the preparation of a vesicle suspension [26]. Here, the one based on the redispersion and sonication of a POPC film was used. As described in Subheading 3.3, the vesicle size and lamellarity were controlled by sonication with a tip sonicator. This later favors the formation of small unilamellar vesicles (SUV).

IDP-Membrane Interaction by Neutron Reflectometry

573

Fig. 1 (a) Schematic illustration of the components of a solid/liquid NR cell produced using SolidWorks (https:// www.solidworks.com). (1) and (5) are aluminum plates with an internal serpentine for water circulation. These components are connected to a water bath for temperature regulation of the cell. (2) Is a plastic plate facing the substrate reflective surface. This component is connected to an HLPC pump (not represented in the figure) to inject the sample or solvent through the cell. (3) O-ring. (4) Silicon substrate. (2–4) are the cell components which will be in contact with the sample. (b) Illustration of the assembled NR cell

1. Weigh 2 mg of POPC powder in a glass vial, and dissolve the lipid with 0.25 mL of chloroform (see Note 4). 2. Dry the POPC chloroform suspension with a nitrogen flow and subsequently under vacuum, until a lipid film is produced on the bottom of the glass vial. If prepared in advance, the film can be stored at 80 C. 3. Resuspend the lipid film with 2 mL h-buffer. Use a bath sonicator for 30 min to ensure a homogeneous lipid suspension. The sample will contain POPC vesicles of variable sizes and lamellarity and will appear turbid. 3.3 Formation of the Supported Lipid Bilayer

1. Sonicate the POPC vesicle suspension with a tip sonicator right before injection into the NR cell (see Note 5). Typically, 5-min sonication is sufficient and the suspension will appear clearer. The sonication of the solution will produce SUVs with a larger tendency to adsorb and fuse on the hydrophilic substrate surface than multilamellar vesicles. 2. Inject 2 mL of the POPC suspension into the NR cell. The injection can be performed both manually with a syringe and (if available) through a syringe pump, which will enable a constant flow rate (typically 0.5–1 mL/min) during sample injection.

574

Alessandra Luchini and Lise Arleth

Fig. 2 (a) NR experimental data with the relative fitting curves for the POPC bilayer. Data corresponding to the three different buffer compositions are multiplied by a scaling factor (noted in the figure) to improve data visualization. (b) Scattering length density profile calculated from the optimization to the experimental data of the parameters in the theoretical model

3. Wait 20–30 min to let the POPC vesicles adsorb on the silicon substrate. Vesicle adsorption and rupture is a process that might require several tens of minutes [27]. It is important to incubate the substrate with the vesicles for a sufficiently long time (~30 min) to obtain a high surface coverage. Other surface-sensitive techniques such as QCM-D can be used to evaluate the incubation time [28]. 4. Inject ultrapure water into the cell to induce an osmotic shock to favor vesicle rupture and the formation of the supported lipid bilayer. Ten minutes of ultrapure water injection is typically sufficient. 5. Reintroduce the buffer in the cell before data collection. The strongest signal is produced when the lipid membrane is in contact with d-buffer [29] (see Note 6). Hence, inject 20 mL of d-buffer to guarantee an efficient isotope exchange. 6. Collect the NR data in at least three different contrasts: ˚ 2), smw-buffer d-buffer (ρ ¼ 6.35 ∙ 106 A 6 ˚ 2 ˚ 2) (ρ ¼ 2.07 ∙ 10 A ), and h-buffer (ρ ¼ 0.56 ∙ 106 A (see Note 7). Use the HLPC pump for buffer injection. Typically, 20 mL of buffer is sufficient to ensure a proper contrast exchange. The collected experimental NR data in the case of the POPC lipid bilayer are shown in Fig. 2a.

IDP-Membrane Interaction by Neutron Reflectometry

3.4 Interaction Between NHE1-LID and the POPC Lipid Bilayer

575

1. Replace the buffer with ultrapure water prior to the injection of the NHE1-LID solution. The NHE1-LID is particularly sensitive to the presence of salt in the solution; hence to avoid unwanted protein precipitation, ultrapure water is introduced in the NR cell with the supported lipid bilayer prior to the protein injection (see Note 8). 10-min injection of ultrapure water is suitable for this purpose. Use the HLPC pump for the water injection. Other protein buffers may apply to other protein systems. 2. Inject 2 mL of NHE1-LID solution. As for the POPC vesicle suspension, the NHE1-LID solution can either be manually injected with a syringe or a syringe pump can be used (0.5–1 mL/min flow rate). 3. Wait 30 min to let the NHE1-LID molecules interact with the lipid membrane (see Note 9). 4. Remove the protein excess by injecting 10 mL ultrapure water (see Note 10). Because of potential aggregation of the NHE1LID in the presence of salt, it is recommended to perform this operation with ultrapure water; this will help avoiding undesired protein aggregation and precipitation in the NR cell. As in step 1, other buffers may apply to other systems. 5. Collect the NR data in at least the three different buffers – d-buffer, smw-buffer, and h-buffer (see Note 6). Use the HLPC pump for the injection of 20 mL of buffer. The collected experimental data in the case of NHE1-LID are reported in Fig. 3a.

3.5

Data Analysis

NR data analysis involves the comparison of a theoretical model for the sample structure to the collected experimental data. The theoretical model contains a defined number of parameters, which are varied in order to maximize the agreement between the theoretical NR curve and the experimental data. An NR theoretical model is typically based on either the Parratt formalism or the optical matrix method, and they both require dividing the sample in layers with different composition [30]. A total of four parameters are assigned to each layer in the model, i.e., thickness (t), scattering length density (ρ), solvent volume fraction (ϕs), and surface roughness (r). Here, we will provide some of the fundamental steps for schematically representing the sample into a stack of layers in the specific case a supported lipid bilayer interacting with an IDP. This representation of the sample can be used to build a theoretical model for the NR data analysis using one of several software programs available for NR data analysis (www.reflectometry.net). From the optimization of the model parameters listed above to the experimental data, the scattering length density profile (ρ(z)) can be calculated (Figs. 2b and 3b). ρ(z) will contain the information on the average

576

Alessandra Luchini and Lise Arleth

Fig. 3 (a) NR experimental data with the relative fitting curves for the POPC bilayer after interaction of NHE1LID. Data corresponding to the different buffer compositions are multiplied by a scaling factor (noted in the figure) to improve data visualization. (b) Scattering length density profile calculated from the optimization to the experimental data of the parameters in the theoretical model

distribution of the molecules composing the sample in the direction normal to the substrate surface. All the available software programs for NR data analysis provide the calculation of ρ(z) from the fit of the designed theoretical model to the experimental data. 1. Start by analyzing the data collected on the bare silicon substrate. The silicon oxide layer present on the substrate surface is not expected to change upon injection of the vesicles and formation of the lipid bilayer. Keep the structural parameters for the silicon oxide layer fixed during the next steps of data analysis to reduce the number of free parameters in the model. In this case, the parameters for the silicon oxide layer were t ¼ (10 1) A˚; ρ ¼ (3.46 ∙ 106) A˚2; ϕs ¼ (0.20 0.05); r ¼ (3 1) A˚. 2. Use three different layers to describe the supported lipid bilayer: two layers for the inner and outer lipid headgroups and one intermediate layer for the lipid acyl chains. Include in the model also the silicon oxide layer, and fix its structural parameters to the ones obtained from 1. In this specific case since the bilayer is composed only by POPC molecules, it is possible to calculate the scattering length density for both the lipid headgroup and acyl chains from their chemical composition and the molecular volumes reported in the literature [31] ˚ 2, respectively). (ρ ¼ 1.86 ∙ 106 A˚2 and ρ ¼ 0.29 ∙ 106 A The structural parameters obtained for the characterized

IDP-Membrane Interaction by Neutron Reflectometry

577

Table 1 Structural parameters obtained from the analysis of the NR data collected for the POPC bilayer before and after interaction with NHE1-LID POPC bilayer POPC bilayer + NHE1 -LID Inner and outer headgroup layer t(A˚) ρ∙10

Inner and outer headgroup layer 71

6

˚ 2

(A

)

ϕs ˚) r(A

t(A˚)

81 6

˚ 2

1.8 0.2

ρ∙10

29 5

ϕs

45 7

31

˚) r(A

31

Acyl chains

(A

)

1.9 0.2

Inner acyl chains

t(A˚)

31 1

˚ 2) ρ∙106(A

˚ 2) 0.27 0.03 ρ∙106(A

0.27 0.02

ϕs

18 1

ϕs

23 2

31

˚) r(A

31

˚) r(A

t(A˚)

13 2

Outer acyl chains + NHE1LID t(A˚)

22 2

˚ 2) ρ∙106(A

0.07 0.01

ϕs

23 2

˚) r(A

31

NH1-LID t(A˚)

28 2 6

ρ∙10

˚ 2) (A

3.1 0.1 (d-buffer) 2.4 0.1 (smw-buffer) 1.9 0.1 (h-buffer)

ϕs

79 2

˚) r(A

41

POPC bilayer are reported in Table 1 and are in good agreement with previous results [32]. If the data analysis shows that the inner and outer headgroup layers converge to the same parameter values, constrain these two layers to have the same structural parameters during further refinement of the fitting curves to minimize the number of free parameters. The scattering length density profile calculated for the POPC lipid bilayer is reported in Fig. 2b.

578

Alessandra Luchini and Lise Arleth

Fig. 4 Schematic representation of the different hypothesis for the NHE-LID interaction with the POPC bilayer. (a) the protein molecules interact with the lipid headgroups and are hence mainly located on the membrane surface; (b) the protein molecules penetrate in the hydrophobic region of the bilayer but are still mainly located in the outer leaflet; (c) the protein molecules are able to cross the membrane and are hence found both in the hydrophobic region of the outer and inner leaflet. Dashed lines are used to mark the different interfaces which are used to build the NR theoretical model

3. With the structural information on the pure lipid bilayer obtained in step 2, proceed with the analysis of the data collected after the addition of the IDP, in this case NHE-LID. Compare the data collected with and without the protein. Visual inspection of the data trend can sometimes help designing a model for the data analysis. If the data trend for the curves collected with and without the protein is identical, no interaction occurred between the bilayer and the IDP. Build different theoretical models based on reasonable hypothesis for the IDP-bilayer interaction. The following different scenarios can be considered in the case of NHE1-LID: (1) the protein molecules interact with the lipid headgroups and are hence mainly located on the membrane surface; (2) the protein molecules penetrate in the hydrophobic region of the bilayer but are still mainly located in the outer leaflet; (3) the protein molecules are able to cross the membrane and are hence found both in the hydrophobic region of the outer and inner leaflet. Figure 4 shows a schematic representation of the above-described potential locations of and IDP interacting with a lipid membrane using the NHE1-LID as an example (see Note 11). Compare the different theoretical models to the experimental data, and identify the one better reproducing the data. In the case of the NHE1-LID, model b produced the best agreement with the experimental data. According to this hypothesis, the model for the NR data analysis should consider five different layers to describe the sample structure (i.e., layer 1, inner lipid

IDP-Membrane Interaction by Neutron Reflectometry

579

headgroups; layer 2, inner acyl chain region; layer3, outer acyl chain region+NHE1-LID; layer4, outer headgroup layer + NHE1-LID; layer5, NHE1-LID outside the bilayer) and the additional layer to describe the silicon oxide on the substrate surface (six layers in total). The protein volume fraction in layer 3 and 5 can be constrained to represent the same number of protein molecules, as in this specific case, if the protein molecules are hypothesized to be only partially inserted in the lipid bilayer (see Note 12). 4. Optimize the structural parameters of the best model for obtaining an accurate description of the sample structure (see Note 13). The optimized structural parameters obtained for the NHE1-LID interacting with the POPC lipid bilayer are reported in Table 1. In this case, we found that the NHE1-LID molecules partially penetrated the POPC bilayer without substantially affecting the overall membrane structure. The total membrane thickness and the structural parameters for the lipid headgroup layers remained almost unaffected if compared to the POPC bilayer before the protein injection. The scattering length profile calculated for the NHE1-LID interacting with the POPC-supported lipid bilayer is reported in Fig. 3b.

4

Notes 1. Neutrons interact differently with the two hydrogen isotopes, protium and deuterium [33]. As a consequence, D2O and H2O have very different scattering length densities (6.35 ∙ 106 A˚2 and 0.56 ∙ 106 A˚2, respectively). Measurements with the sample or the bare substrate in contact with D2O are characterized by a lower background (one of the main sources of background is the incoherent scattering form the protium nuclei) and exhibit a critical edge [15] at low q. Hence, it is desirable to start the experiment with the first measurement in D2O, as the signal from the sample will be stronger and it will be easier to evaluate whether it is correctly aligned in the instrument (sample misalignment might affect both the critical edge shape and position, and it might produce R < 1 before the critical edge). 2. Alternative detergent solutions can be used as well. 3. The main goal of the substrate cleaning step is to ensure that all the unwanted organic molecules are removed from the substrate surface and that the surface is indeed hydrophilic. This is a very important step, since having a hydrophilic surface is a fundamental requirement for the subsequent formation of the supported lipid bilayer. Milder approaches for cleaning the substrate surface involve sonication of the silicon crystal in

580

Alessandra Luchini and Lise Arleth

organic solvents, i.e., sequentially chloroform, acetone, and ethanol, followed by surface treatment with a UV ozone lamp or a plasma cleaner (20 min and 3 min, respectively). If the substrate is cleaned with the piranha solution well in advance of the beginning of the experiment (e.g., 1 or 2 days before), it is recommendable to use the UV ozone lamp or the plasma cleaner treatment before assembling the NR cell. This further treatment will restore the surface hydrophilicity that the substrate might have lost if not used right after the cleaning with the piranha solution. 4. Alternatively, methanol, ethanol, or isopropanol can be used as well. 5. POPC vesicles tend to rearrange over time from SUV to multilamellar vesicles. Hence, while the POPC suspension can be prepared a few days in advance of the NR experiment, it is recommendable to perform the tip sonication right before injecting the solution in the NR cell. This will favor the vesicle fusion process and allow for the formation of a supported lipid bilayer with a better surface coverage. 6. A quick visual inspection of the collected NR curves after the deposition of the lipids can provide qualitative information on the successful formation of the supported lipid bilayer on the substrate surface. In particular in the d-buffer contrast, the presence of lipids on the substrate surface should produce a higher reflectivity signal than the substrate alone. One of the main reasons for the failure of the supported lipid bilayer formation is that the substrate surface is not sufficiently hydrophilic. In that case, it is recommended to unmount the NR cell and start over from the substrate cleaning (Subheading 3.1). 7. NR data analysis involves the comparison of a theoretical model for the sample structure to the collected experimental data. The theoretical model contains a defined number of parameters, which are varied in order to maximize the agreement between the theoretical NR curve and the experimental data. In the case of a single NR curve, multiple combinations of parameter values can produce a reasonable fitting curve. Hence to make the fitting procedure converge to a unique set of parameter values, NR curves for the same sample, but in different contrast situations, i.e., in contact with buffers prepared with different D2O and H2O ratios, are collected and simultaneously analyzed. The different D2O and H2O content in the buffer will produce a different contrast (difference in the scattering length density of the sample compared to the buffer), and thus the NR curves will show a different data trend although they correspond to the same sample structure. Three different contrasts are typically sufficient to obtain a reliable data analysis [34].

IDP-Membrane Interaction by Neutron Reflectometry

581

8. The aggregation in the presence of salt is a peculiar characteristic of NHE1-LID, which required the protein to be injected in the cell containing the lipid bilayer as a solution in ultrapure water. Other IDPs can have a different behavior and be stable in buffer solutions. In these cases, there is no need of introducing ultrapure water before injecting the protein solution. 9. The interaction between IDPs and a lipid membrane can take a variable amount of time to occur depending on the affinity of the protein for the specific lipid species present in the membrane. Before performing the NR experiment, other lab surface-sensitive techniques, such as quartz crystal microbalance with dissipation monitoring [35] (QCM-D) or surface plasmon resonance (SPR) [36], can be used to define the optimal incubation time required for the IDP molecules to interact with the lipid membrane. This incubation time might be extremely variable from one IDP to another and according to the membrane lipid composition. 10. Because of the need of different contrasts for NR data analysis, the sample investigated by NR must be stable under buffer flushing through the cell. Indeed, the main assumption during the simultaneous analysis of NR curves in different contrast situations is that the structure of the sample remains unaffected by replacing one contrast with another. For this reason, the protein excess is removed from the cell, and only the IDP molecules that are stably adsorbed on the lipid membrane (stably adsorbed means that will not be removed during buffer flushing through the cell) will be detected. In practice, IDP molecules that are in solution or IDP molecules that are weakly adsorbed on the membrane surface (the abovenamed protein excess) will be removed during a first water or buffer injection step. After collecting a first NR curve for the IDP adsorbed on the membrane (preferably in D2O or d-buffer), it is recommendable to repeat the water/buffer injection and collect another NR curve in the same contrast as the first one to make sure that the sample is indeed not affected by further protein removal. Lab techniques such as QCM-D and SPR can again provide useful preliminary information on the sample stability under water/buffer flushing. 11. Figure 4 shows three extreme situations for the interaction between an IDP and a lipid membrane. Other intermediate scenarios in which the protein is only partially penetrating into the outer and inner leaflet could also be considered. However, these would involve building a more complex model containing a larger number of layers than the ones reported here and hence also a larger number of parameters to be optimized.

582

Alessandra Luchini and Lise Arleth

12. By considering that a single NHE1-LID molecule is partially inserted among the lipid acyl chain in the outer membrane leaflet and partially outside the membrane and exposed to the solvent, the scattering length density of the outer acyl chain layer (ρouter acyl chain) can be expressed according to Eq. 1: ρouter acyl chain ¼ ϕNHE1LID,3 ρNHE1LID þ 1 ϕNHE1LID,3 ρouter acyl chain

ð1Þ

where ϕNHE1 LID, 3 is the NHE1-LID volume fraction in the outer acyl chain layer (layer 3 in Fig. 4b). Calculation of the NHE1-LIDscatteringlengthdensity(ρd-buffer ¼3.15∙106 A˚2; ˚ 2; ρh-buffer ¼ 1.87 ∙ 106 A ˚ 2) ρsmw-buffer ¼ 2.41 ∙ 106 A shows that in particular in the h-buffer the protein has a scattering length density very close to the lipid headgroups (i.e., 1.86 ∙ 106 A˚2); hence a similar expression compared to the one reported in Eq. 1 cannot be used for the lipid headgroup layers, i.e., the neutrons are not capable of distinguishing the NHE1-LID molecules from the lipid headgroups. On the other hand, a very high contrast is present between the NHE1-LID and the lipid acyl chain or the NHE1-LID and buffer. Hence, information on the amount of the IDP interacting with the membrane can be obtained only by the scattering length density value associated to layers 3 and 5 of Fig. 4b. By implementing Eq. 2 in the model for the NR data analysis, the number of NHE1-LID molecules present outside the bilayer and inside the bilayer was constrained to be the same. ϕNHE1LID,3 ¼

V NHE1LID,3 t 3 ϕ V NHE1LID,5 t 5 NHE1LID,5

ð2Þ

In Eq. 2 VNHE1 LID, 3 and VNHE1 LID, 5 are, respectively, the protein volume in layer 3 and 5, and t3 and t5 are the thickness of layer 3 and 5, while ϕNHE1 LID, 5 is the NHE1LID volume fraction in the layer outside the lipid bilayer. VNHE1 LID, 3 and VNHE1 LID, 5 were calculated by consid˚ 3 as calculated from the ering the protein total volume (8169 A protein sequence) and by introducing in the model an additional parameter ( f ) quantifying the fraction of the protein molecular volume inside and outside the membrane. A first optimization of the model parameters to the experimental data showed that parameters for the two headgroup layers converged to similar values. For this reason, the theoretical curves reported in Fig. 3a were obtained by constraining these two layers to have the same structure. This gave a considerable reduction of the number of free parameters in the model. 13. In Fig. 3b the NHE1-LID interacting with the membrane is represented as a red area. This is because in principle the

IDP-Membrane Interaction by Neutron Reflectometry

583

interaction between an IDP and a lipid bilayer can affect the protein secondary structure and an IDP can potentially become structured upon interacting with the lipids. However, NR cannot directly detect the protein secondary structure. For this reason, the NHE1-LID structure was not represented in Fig. 3b.

Acknowledgments The work was supported by grants from Novo Nordisk Foundation Interdisciplinary Synergy program and the Lundbeck Foundation “BRAINSTRUC” project. The authors thank Birthe B. Kragelund and Stine F. Pedersen for sharing general advice on the sample and Emilie Skotte Pedersen for producing the NHE1-LID used to collect the reported NR data. The authors also thank the ISIS neutron source for the allocation of beamtime and Mario Campana for the support during the experiment. The authors thank Alessio Laloni for designing Fig. 1. References 1. Uversky VN (2019) Intrinsically disordered proteins and their “Mysterious” (meta)physics. Front Phys. https://doi.org/10.3389/fphy. 2019.00010 2. Theillet FX, Binolfi A, Frembgen-Kesner T et al (2014) Physicochemical properties of cells and their effects on intrinsically disordered proteins (IDPs). Chem Rev 114(13):6661–6714 3. Haxholm GW, Nikolajsen LF, Olsen JG et al (2015) Intrinsically disordered cytoplasmic domains of two cytokine receptors mediate conserved interactions with membranes. Biochem J 468(3):495–506. https://doi.org/10. 1042/BJ20141243 4. Sigalov AB (2010) Membrane binding of intrinsically disordered proteins: critical importance of an appropriate membrane model. Self Nonself 1(2):129–132. https://doi.org/10. 4161/self.1.2.11547 5. Whited AM, Johs A (2015) The interactions of peripheral membrane proteins with biological membranes. Chem Phys Lipids 192:51–59. https://doi.org/10.1016/j.chemphyslip. 2015.07.015 6. Fragneto-Cusani G (2001) Neutron reflectivity at the solid/liquid interface: examples of applications in biophysics. J Phys Condens Matter 13(21):4973–4989 7. Penfold J (2002) Neutron reflectivity and soft condensed matter. Curr Opin Colloid Interface

Sci 7(1):139–147. https://doi.org/10.1016/ S1359-0294(02)00015-8 8. Pomorski TG, Nylander T, Cardenas M (2014) Model cell membranes: discerning lipid and protein contributions in shaping the cell. Adv Colloid Interf Sci 205:207–220. https://doi. org/10.1016/j.cis.2013.10.028 9. Fragneto G (2012) Neutrons and model membranes. Eur Phys J Spec Top 213(1):327–342. https://doi.org/10.1140/epjst/e201201680-5 10. Katsaras J, Gutberlet T (2001) Lipid bilayers: structure and interactions, Biological physics series. Springer, New York 11. Losche M (2002) Surface-sensitive X-ray and neutron scattering characterization of planar lipid model membranes and lipid/peptide interactions. Curr Top Membr 52:117–161 12. Wacklin HP, Bremec BB, Moulin M et al (2016) Neutron reflection study of the interaction of the eukaryotic pore-forming actinoporin equinatoxin II with lipid membranes reveals intermediate states in pore formation. Biochim Biophys Acta 1858(4):640–652. https://doi.org/10.1016/j.bbamem.2015. 12.019 13. Vitiello G, Falanga A, Petruk AA et al (2015) Fusion of raft-like lipid bilayers operated by a membranotropic domain of the HSV-type I glycoprotein gH occurs through a cholesterol-dependent mechanism. Soft Matter

584

Alessandra Luchini and Lise Arleth

11(15):3003–3016. https://doi.org/10. 1039/c4sm02769h 14. Richter RP, Brisson AR (2005) Following the formation of supported lipid bilayers on mica: A study combining AFM, QCM-D, and ellipsometry. Biophys J 88(5):3422–3433 15. Daillant J, Gibaud A (2009) X-ray and neutron reflectivity: principles and applications. Springer, New York, NY 16. Hellstrand E, Grey M, Ainalem M-L et al (2013) Adsorption of α-synuclein to supported lipid bilayers: positioning and role of electrostatics. ACS Chem Neurosci 4 (10):1339–1351. https://doi.org/10.1021/ cn400066t 17. Norholm AB, Hendus-Altenburger R, Bjerre G et al (2011) The intracellular distal tail of the Na+/H+ exchanger NHE1 is intrinsically disordered: implications for NHE1 trafficking. Biochemistry 50(17):3469–3480. https://doi. org/10.1021/bi1019989 18. Hendus-Altenburger R, Kragelund BB, Pedersen SF (2014) Structural dynamics and regulation of the mammalian SLC9A family of Na (+)/H(+) exchangers. Curr Top Membr 73:69–148. https://doi.org/10.1016/B9780-12-800223-0.00002-5 19. Slepkov E, Fliegel L (2002) Structure and function of the NHE1 isoform of the Na+/H + exchanger. Biochem Cell Biol 80 (5):499–508 20. Clifton LANC, Lakey JH (2013) Examining protein–lipid complexes using neutron scattering, Methods in molecular biology (methods and protocols), vol 974. Springer, New York 21. Penfold J (1991) Instrumentation for neutron reflectivity. Phys B Condens Matter 173 (1):1–10. https://doi.org/10.1016/09214526(91)90028-D 22. Thomas RK, Penfold J (1996) Neutron and X-ray reflectometry of interfacial systems in colloid and polymer chemistry. Curr Opin Colloid Interface Sci 1(1):23–33. https://doi. org/10.1016/S1359-0294(96)80040-9 23. Ruth Hendus-Altenburger JV, Pedersen ES, Luchini A, Araya-Secchi R, Prestel A, Bendsoe AH, Wakabayashi S, Cardenas M, Cuesta EP, Arleth L, Pedersen SF, Kragelund BB. The lipid-binding domain of human Na+/H+ exchanger 1 forms a helical lipid-protein co-structure essential for activity: a new principle for membrane protein regulation via disordered domains 24. Rennie AR, Hellsing MS, Lindholm E et al (2015) Note: Sample cells to investigate solid/liquid interfaces with neutrons. Rev Sci

Instrum 86(1):016115. https://doi.org/10. 1063/1.4906518 25. Lind TK, Cardenas M, Wacklin HP (2014) Formation of supported lipid bilayers by vesicle fusion: effect of deposition temperature. Langmuir 30(25):7259–7263 26. Patil YP, Jadhav S (2014) Novel methods for liposome preparation. Chem Phys Lipids 177:8–18. https://doi.org/10.1016/j. chemphyslip.2013.10.011 27. Richter R, Mukhopadhyay A, Brisson A (2003) Pathways of lipid vesicle deposition on solid surfaces: a combined QCM-D and AFM study. Biophys J 85(5):3035–3047 28. Richter RP, Berat R, Brisson AR (2006) Formation of solid-supported lipid bilayers: an integrated view. Langmuir 22(8):3497–3505. https://doi.org/10.1021/la052687c 29. Wacklin HP (2010) Neutron reflection from supported lipid membranes. Curr Opin Colloid Interface Sci 15(6):445–454. https://doi. org/10.1016/j.cocis.2010.05.008 30. Hamley IW, Pedersen JS (1994) Analysis of neutron and X-ray reflectivity data. I. Theory. J Appl Crystallogr 27(1):29–35. https://doi. org/10.1107/S0021889893006260 31. Kucˇerka N, Tristram-Nagle S, Nagle JF (2006) Structure of Fully Hydrated Fluid Phase Lipid Bilayers with Monounsaturated Chains. J Membr Biol 208(3):193–202. https://doi. org/10.1007/s00232-005-7006-8 32. Luchini A, Nzulumike ANO, Lind TK et al (2019) Towards biomimics of cell membranes: Structural effect of phosphatidylinositol triphosphate (PIP3) on a lipid bilayer. Colloids Surf B Biointerfaces 173:202–209. https:// doi.org/10.1016/j.colsurfb.2018.09.031 33. Heinrich F (2016) Deuteration in biological neutron reflectometry. Methods Enzymol 566:211–230. https://doi.org/10.1016/bs. mie.2015.05.019 34. Krueger S (2001) Neutron reflection from interfaces with biological and biomimetic materials. Curr Opin Colloid Interface Sci 6 (2):111–117. https://doi.org/10.1016/ S1359-0294(01)00073-5 35. Cho NJ, Frank CW, Kasemo B et al (2010) Quartz crystal microbalance with dissipation monitoring of supported lipid bilayers on various substrates. Nat Protoc 5(6):1096–1106. https://doi.org/10.1038/nprot.2010.65 36. Besenicˇar M, Macˇek P, Lakey JH et al (2006) Surface plasmon resonance in protein–membrane interactions. Chem Phys Lipids 141 (1):169–178. https://doi.org/10.1016/j. chemphyslip.2006.02.010

Chapter 30 Interactions of IDPs with Membranes Using Dark-State Exchange NMR Spectroscopy Tapojyoti Das, Diana Acosta, and David Eliezer Abstract Membrane interactions of proteins play a role in essential cellular processes in both physiological and disease states. The structural flexibility of intrinsically disordered proteins (IDPs) allows for interactions with multiple partners, including membranes. However, determining conformational states of IDPs when interacting with membranes can be challenging. Here we describe the use of nuclear magnetic resonance (NMR), including dark-state exchange saturation transfer (DEST), to probe IDP-membrane interactions in order to determine whether there is an interaction, which residues participate, and the extent/nature of the interaction between the protein and the membrane. Using α-synuclein and tau as typical examples, we provide protocols for how the membrane interactions of IDPs can be probed, including details of how the samples should be prepared and guidelines on how to interpret the results. Key words IDP, Membrane interactions, NMR, Solution state, Dark-state exchange spectroscopy, α-synuclein, Tau

1

Introduction

1.1 Application of NMR for Probing IDP-Membrane Interactions

NMR spectroscopy has provided insights into the effects of membrane interactions on structural characteristics of IDPs such as secondary structure formation. Solution-state NMR is highly dependent on the tumbling rate of biomolecules. Slower tumbling caused by the presence of large structures like lipid small unilamellar vesicles (SUVs, 30–40 nm in diameter) can result in broad, undetectable NMR signals. For this reason, detergent micelles (4–5 nm in diameter) [1] are often used to analyze membrane-bound conformation of IDPs using solution-state NMR [2]. However, solution NMR can still be useful in studying the interactions of IDPs with large objects like lipid vesicles, because sites that are not associated with the vesicle surface remain highly flexible, and give rise to detectable NMR signals [3]. These signals can be used to determine the free fraction of protein at a residue-resolved level by

Birthe B. Kragelund and Karen Skriver (eds.), Intrinsically Disordered Proteins: Methods and Protocols, Methods in Molecular Biology, vol. 2141, https://doi.org/10.1007/978-1-0716-0524-0_30, © Springer Science+Business Media, LLC, part of Springer Nature 2020

585

586

Tapojyoti Das et al.

Fig. 1 Principle of dark-state excitation saturation transfer (DEST). (a) Dark-state excitation: Selective saturation of the vesicle-bound state of a protein can be achieved by a saturation pulse away from the resonance frequency. Saturation in NMR terminology means the net magnetization is zero in all three dimensions. Hence, a saturated nucleus does not produce any detectable NMR signal. The red and cyan lines represent the free-state and vesicle-bound-state peaks, with linewidths of 100 Hz and 5000 Hz, respectively. The pink-shaded region represents the saturation pulse bandwidth (400 Hz). The plot on the right is scaled up from the region indicated by a dashed cyan box in the left plot to show the relative area under the curves within the saturation bandwidth, which represents the saturation efficiency at that particular offset. (b) Saturation transfer by a slow exchange process: The exchange process is governed by an on-rate (kon) that causes an increase in the transverse relaxation rate in the vesicle-bound sample (ΔR2). The vesicle-bound state is saturated by the saturation pulse. When the saturated molecules come off the vesicles (saturation transfer) with an off-rate koff, they mix with the free-state molecules and cause a proportionate decrease in the NMR signal. By saturating the system at different offsets and using a saturation pulse of different bandwidths, a DEST profile is determined experimentally. To extract kinetic rate constants, appropriate kinetic model parameters are fit computationally so that the theoretical DEST profile matches the experimental DEST profile

IDP-Membrane by NMR

587

comparing signal intensities with those from a matched sample without vesicles [4]. In addition, while resonances of vesicle-bound residues are broadened beyond direct detection, the broad signal of the bound-state fraction can be selectively saturated using a narrowbandwidth saturation pulse far away from the free-state resonance frequency (Fig. 1). When this saturated bound state exchanges with the free state, the signal from the free state decreases consequently. This technique is called dark-state excitation saturation transfer (DEST) and, in combination with T2 relaxation experiments, is a powerful method to probe the kinetics of exchange processes occurring at timescales ranging from approximately 10 ms to 1 s [5]. In these types of spectra, it is possible to uncover which residues are interacting with the membrane, to determine the appropriate kinetic model for the system, and to extract the kinetic rates of such interactions. As an aside, we note that in contrast with NMR, electron spin resonance (ESR) can be used to directly observe proteins bound to either lipid vesicles or SDS micelles without concerns regarding signal loss. ESR has been used extensively for structural studies of membrane-bound IDPs such as α-synuclein [6–8]. ESR requires introduction of spin labels containing unpaired electrons into protein samples, which is typically accomplished using cysteine chemistry. It can be used both to probe the environment of individual spin labels placed at different positions within a protein and for measurements of inter-residue distances for proteins containing two (or more) spin labels [9]. While we have used ESR to study protein-membrane interactions [6, 7, 10], the methods and protocols involved in this complementary approach are beyond the scope of this current chapter. 1.2 Membrane Interactions of Intrinsically Disordered Proteins

Membrane interactions of IDPs contribute to physiological processes that maintain essential cellular functions. For example, axonal localization of the microtubule-associated protein tau is thought to be mediated by membrane binding and essential for neurite polarity [11, 12]. Tau-lipid interactions can perturb its association with microtubules [13] and alter tau’s conformation [10, 13] in a way that could contribute to tau’s pathological aggregation into paired helical filaments (PHFs) [2, 14, 15]. Additionally, α-synuclein has been implicated in modulation of membranerelated processes, including neurotransmitter release [16]. Interactions with membranes lead to conformational changes of

588

Tapojyoti Das et al.

α-synuclein [17], as observed with NMR [3], electron spin resonance [6, 7], and fluorescence spectroscopy [18]. Although we use tau and α-synuclein as examples for the following procedures, NMR techniques can be applied to other IDPs for the study of protein-membrane interactions [4].

2

Materials Here we specifically describe materials and methods related to NMR spectroscopy. For detailed descriptions on producing recombinant protein as well as on how to generate small unilamellar vesicles (SUVs), both of which are used here as an example; the user is referred to a previous protocol [19] and to Chapter 28 of this volume.

2.1

NMR

1. NMR buffer: 10 mM Na2HPO4, 100 mM NaCl, 10% v/v D2O, pH 6.8. 2. 100 kDa cutoff centrifugal spin column. 3. 5 mm glass NMR tubes. 4. High-field NMR spectrometer (600MHz and above). 5. Software: NMRPipe, NMRViewJ, CcpNmr Analysis, MATLAB or MATLAB Runtime (free), R language for statistical computing, and T2 and DEST processing and fitting scripts (see Note 1).

3 3.1

Methods NMR

3.1.1 1H-15N HSQC Sample Preparation

1. Gently resuspend the lyophilized protein in NMR buffer to prepare stock solution of 200 μM (see Note 2). 2. Filter the protein sample through a 100 kDa cutoff spin filter to remove any large aggregates. 3. For lipid vesicle binding studies, make a sample of 50–100 μM protein with an appropriate concentration of lipid small unilamellar vesicles (SUVs) (we generally use 1 mM, see Subheading 3.1.4 for a discussion of how to select the lipid concentration). It is desirable to keep the protein concentrations equal for different protein variants as the free- and boundstate populations will vary with total protein concentration. Since SUV properties from different batches can vary, it is also advisable to use the same batch of lipid vesicles for different conditions when possible. 4. Make a matched free-state (lipid-free) sample with the same concentration of protein by mixing an equal volume of protein stock solution with NMR buffer not containing SUVs.

IDP-Membrane by NMR

589

5. Measure the final pH of both samples. Adjust to match if needed with small volumes of HCl or NaOH. 6. Load samples into 5 mm NMR tubes with glass Pasteur pipettes. 3.1.2 NMR Spectrometer Setup

For comparison, all NMR spectra (control and experimental) should be collected on the same instrument and under identical conditions and parameters. For our experiments, we use a Bruker AVANCE 600 MHz spectrometer equipped with a cryogenic triple-resonance probe. The initial steps are the same for every NMR experiment and constitute setting the temperature, locking the magnetic field, tuning and matching, shimming the magnet, and determining 90 high-power pulse widths for every measured nucleus. In general, 1H pulse width is determined experimentally, while those of the 15N and 13C nuclei can be determined by looking up a calibration table that is dependent on the probehead and sample solvent (the so-called prosol table). When further accuracy is required, e.g., in 15N relaxation experiments, 90 pulse widths for 15 N and 13C can also be determined experimentally (see Note 3). The following are standard steps for setting up a spectrometer. For specific commands we refer the reader to standard NMR manuals. 1. Adjust the temperature by 5 C at a time to reach the target temperature before inserting the sample into the instrument (see Note 4). 2. Load your sample in the shuttle, and adjust the sample height using the depth gauge to an appropriate depth. 3. Adjust the lock field and phase to center the lock signal. Once centered, turn the lock on and check that the lock signal is stable. 4. Open a new dataset for your current sample. 5. Tune and match the 1H and 15N channels for the sample. Typically, commands that provide automatic adjustments are sufficient for proper tuning and matching; however manual adjustments are recommended for greater accuracy. 6. Shim the spectrometer coils to optimize a homogeneous signal throughout your sample. 7. Manually calibrate the 1H 90 hard pulse length (see Note 5). 8. (Optional) Calibrate 15N 90 hard pulse length (see Note 6). 9. Calibrate the shaped pulses for each experiment (see Note 7).

3.1.3 1H-15N HSQC Experimental Run and Setup

For our experiments, 1H-15N HSQC NMR spectra are collected at 10 C (α-synuclein) or 30 C (tau) with 512 complex points in the 1 H dimension, 128 complex points in the 15N dimension, a spectral

590

Tapojyoti Das et al.

width of 13 ppm in the 1H dimension and 26 ppm in the 15N dimension, and an offset of 4.7 ppm in the 1H dimension and 119 ppm in the 15N dimension. However, adjustments of spectral window, number of data points, and number of scans are often needed and will be described below. 1. Prepare the sample, insert it in the magnet, and set up spectrometer as described above. 2. Open your dataset and set the experiment. We use a modified version of an experiment available in Bruker TopSpin catalogue (hsqcetfpf3gpsi). For convenience, an experiment with parameters can be copied from an existing dataset to the working folder using the command “edc”. 3. Set up an appropriate spectral window to avoid folded or aliased resonances. Start by setting a little wider spectral width, collect a test spectrum, and narrow down optimally to avoid folded or aliased peaks (see Note 8). 4. Set up an appropriate number of data points to obtain the desired resolution. Typical total data points are 1024 or 2048 for the proton dimension and 160 to 320 for the nitrogen dimension. 5. Set up an appropriate number of scans to achieve an optimal signal to noise ratio (S/N). Typically, 4–32 scans are sufficient for our experiments (see Note 9). 6. Look up the prosol table (execute getprosol) to calculate the shaped pulses from the calibrated hard pulses (see Note 7). 7. Determine the optimal receiver gain (rg) by executing “rga.” Setting a lower receiver gain than the autocalibrated value is generally recommended to avoid receiver overload. A calibrated receiver gain value that is too low (1000), we generally set it to 512. 8. Check the estimated duration of the experiment by typing “expt.” 9. Once all parameters have been set, run the experiment by entering “zg.” 3.1.4 15N-T2 Relaxation Experimental Setup

For DEST and T2 relaxation experiments, a low concentration of SUV is needed so that around 5% of the sample is in the bound state. This concentration can be determined from vesicle titration experiments (see Note 10). We have achieved good results with 100 μM protein and 1 mM total lipids, but this depends on the affinity between the IDP and the membrane. To prevent lipid oxidation causing time-dependent changes in the spectra during acquisition, we have achieved good results by topping the sample

IDP-Membrane by NMR

591

with an inert gas such as nitrogen or argon before capping the NMR tube (see Notes 11–13). T2 relaxation measurements are typically performed for both the 5% bound-state and the matched free-state samples. For T2 relaxation experiments, also known as transverse or spin-spin relaxation experiments, we use an interleaved version of 15N Carr-Purcell-Meiboom-Gill (CPMG) experiment (hsqct2etf3gpsi3d, Bruker TopSpin). This is essentially a pseudo-3D experiment composed of a set of interleaved 2D 1H-15N experiments with a variable relaxation delay. This relaxation delay is changed by varying the number of CPMG units in the pulse program before acquisition. Of note, the 15N 90 pulse width has to be determined precisely for this experiment (see Note 6). Once the initial steps of setting up any NMR experiment have been completed, the general guidelines are as follows: 1. General parameters: We generally use 1024 total points (512 complex points) in the direct dimension and 160–320 total points (80–160 complex points) in the indirect dimension, which is enough to resolve most peaks for tau fragments and for α-synuclein. The number of scans is chosen so that the weakest peaks can still be distinguished at the maximum relaxation delay; 16–32 scans are ideal for our experiments. 2. Shaped pulse calibration: There is a proton 90 shaped pulse in the pulse program for water suppression, and the pulse power (attenuation in dB) is calibrated by calculating the area under the curve in the shaped pulse tool or simply by executing “getprosol” (see Note 7). 3. Relaxation delays: The relaxation delays are multiples of a block of 16,180 pulses that constitute a single CPMG unit (Fig. 2), the duration of which can be found in the “d31” parameter during the fitting. The multiplier array is read from a separate “vclist” file in the experiment folder. The number of elements in the “vclist” array and the maximum relaxation delay are chosen so that an exponential decay model can be fitted to the intensity series and can be calibrated from a test experiment that measures the decay of an 1D spectrum with increasing relaxation delay (see Note 14). One delay point is collected in triplicate to enable estimation of the experimental error. Moreover, the multiplier array for the relaxation delay is scrambled, and the delay between pulses (D1) is optimally increased to prevent sample temperature changes during the acquisition. A typical CPMG unit duration in the 700 MHz spectrometer is 16.32 ms, and the multiplier array is {0, 18, 12, 2, 3, 10, 6, 8, 1, 14, 6, 4, 6}, implying that the maximum relaxation duration is 16.32 ms 18 ¼ 293.76 ms, and the data point at 16.32 ms 6 ¼ 97.92 ms is collected in triplicate. Finally, it

592

Tapojyoti Das et al.

τ

2τ

2τ

2τ

2τ

2τ

2τ

2τ

τ

Fig. 2 T2 relaxation experiment. In a 15N-CPMG-T2 relaxation experiment, multiple CPMG units are used in tandem to model the exponential relaxation decay as indicated by the loss of signal with increasing delay. The narrow and wide bars represent 90 and 180 pulses, respectively. τ represents a unit delay time in between pulses. In a typical 15N-CPMG-T2 experiment, each CPMG block consists of 16 such 180 pulses spaced by a constant delay. The number of such blocks are varied using “vclist” array to achieve a series of delay times

is essential to set both the number of points in the third dimension and the parameter NBL (size of buffer that stores fids before writing to disk) to the number of relaxation delays used in the experiment. 3.1.5 15N-DEST Experimental Setup

15

N-DEST experiments are recorded for the 5% bound-state sample only. We use the pulse sequence developed by Fawzi et al. [20], which is essentially a series of interleaved 2D experiments using saturation at various offsets from the center of the 15N dimension. The pulse sequence for Bruker AVANCE spectrometers (Fig. 3) is freely available from the Marius Clore lab website (see Note 1). Some parameters in the DEST pulse sequence, particularly those related to the saturation pulse, are hard-coded in the pulse program, and a general familiarity with Bruker pulse programming language is recommended. Here we enumerate the general steps of setting up a 15N-DEST experiment. 1. General parameters: Accurate calibration of 15N 90 pulse length at high power is essential for 15N-DEST experiment. The number of points for each 2D plane and the number of scans are guided by similar principles as in 15N-T2 (see Subheading 3.1.4). The number of points in the second dimension is the product of that of an individual experiment and the number of 2D spectra needed (Fig. 3).

IDP-Membrane by NMR

593

Fig. 3 15N-DEST pulse sequence showing the nested loops in the pulse program (adapted with permission from Fawzi et al.) [20]. The innermost loop is the mixing time for saturation, when a continuous-wave saturation pulse is applied at the 15N channel, default 900 ms. The next loop is for different offsets and is set using the parameter l8. Then comes the loop for phase cycling. The outermost loop (l3) is the number of complex points in the indirect dimension for each 2D experiment. Since the whole experiment is an interleaved 2D experiment, the number of total points in the second dimension equals l8 2 l3. For every experiment, the power levels are calibrated for the following pulses: ϕ7 (shaped 180 pulse IBURP2), ϕ8 (shaped 90 SINC pulse), and CW-ϕ4 (continuous-wave saturation pulse on 15N dimension for desired bandwidth). The power levels for CW-ϕ4 for different offsets are hard-coded in the pulse program, while every other parameter can be entered in the acquisition parameter tab in Bruker TopSpin software

2. Saturation pulses: Selective saturation of 15N nuclei at a narrow bandwidth is achieved by a long and weak RF pulse. At least two different saturation bandwidths are recommended for accuracy of model fitting, as detailed later. For a 700 MHz spectrometer, typical bandwidths can be 400 and 175 Hz. The pulse power of the saturation pulse is calculated as follows: first, the time constant (τ) corresponding to a bandwidth (νbw) is calculated as τ ¼ 4ν1bw . Next, the pulse power difference in dB τ (ΔdB) is determined as ΔdB ¼ 20 log 10 τref , where τref refers 15 to the N reference 90 pulse width. Finally, the saturation pulse power in dB (dBsat) corresponding to each saturation bandwidth is calculated as dBsat ¼ dBref + ΔdB, where dBref refers to the absolute power level (in dB) of the reference 90 pulse. 3. Saturation offsets: The 15N saturation pulse is applied at various offsets from the 15N carrier frequency. Typically, we use 23 offsets at each power level with a range of 30 kHz on either side of the carrier frequency and with more points toward the center to capture the dip in signal as the saturation approaches the carrier frequency, plus one reference offset at zero power.

594

Tapojyoti Das et al.

The saturation power level-saturation offset tuples are hardcoded into the pulse program and edited for every experiment. 4. Shaped pulses: The pulse program contains two shaped pulses on the 1H channel, ϕ7 and ϕ8 (Fig. 3), the pulse powers of which need to be calculated by integrating the shaped pulse over the duration and comparing it to a reference as in the case of the saturation pulse. This can be done easily using the shaped pulse tool in Bruker TopSpin. 3.2

Data Analysis

3.2.1 NMR Signal Processing

NMR data is acquired in the time domain and must be Fourier transformed into the frequency domain. As an example, we will discuss using NMRPipe [21] for the conversion of raw NMR data using a Unix C shell. All of the following commands should be executed within the appropriate file/directory in which the raw NMR dataset resides. 1.

↲ This opens a graphical user interface (GUI) in which one should read parameters from the raw NMR Bruker dataset, adjust temperature and conversion type as needed, save the processing script (fid.com), and execute the script. Consequently, a .fid file will be created in the dataset folder, indicating that the raw NMR data is now in NMRPipe format.

2.

↲ This opens NMRDraw GUI in which one can visualize and transform the time-domain data into the frequency domain. Afterward, proper phasing is done for the frequency-domain peaks and is performed by turning phasing on and adjusting the zeroth order (p0) and the first order (p1) phasing values until all peaks are positive and “upright.”

bruker

nmrDraw

3. Macro edit (shortcut m) A basic 2D processing script with the adjusted p0 and p1 phasing parameters, determined as described above, should be executed. An example of a 2D processing script is included below with sections to be filled in by the user in bold. ———————————————————————————————————————————————————————————— #!/bin/csh # # Basic 2D Phase-Sensitive Processing: # Cosine-Bells are used in both dimensions. # Use of "ZF -auto" doubles size, then rounds to power of 2. # Use of "FT -auto" chooses correct Transform mode. # Imaginaries are deleted with "-di" in each dimension. # Phase corrections should be inserted by hand.

IDP-Membrane by NMR

595

nmrPipe -in test.fid \ | nmrPipe -fn SP -off 0.5 -end 1.00 -pow 1 -c 1.0 \ | nmrPipe -fn ZF -auto \ | nmrPipe -fn FT -auto \ | nmrPipe -fn PS -p0 0.00 -p1 0.00 -di -verb \ | nmrPipe -fn TP \ | nmrPipe -fn SP -off 0.5 -end 1.00 -pow 1 -c 0.5 \ | nmrPipe -fn ZF -auto \ | nmrPipe -fn FT -auto \ | nmrPipe -fn PS -p0 -90.00 -p1 0.00 -di -verb \ -ov -out test.ft2 ———————————————————————————————————————————————————————————— ∗∗ If desired, the line shown below can be added to the end of the processing script to create an .nv file for data analysis or visualization with NMRViewJ. ------------------------------------------------------------| pipe2xyz -nv -out nmrviewj_format_filename.nv -verb –ov

3.2.2 NMR Spectral Peak Picking

NMRViewJ is used to provide an example of how to pick peaks (resonances), assign peaks, and measure peak intensity. As an alternate, the user can use CcpNmr Analysis, Sparky, or any other NMR data visualization program, which have their own user instructions. 1. Peak resonances can be picked for assignment manually by setting the cursor to “PeakAdd” and manually clicking peaks, or automatically in the “PeakPick” tab within Attributes. With NMRViewJ, the picked peaks can be edited throughout the picking and assignment process. 2. Peak assignments should be previously available. If there is no previous assignment or peak list, triple-resonance experiments are typically required to provide the information required for obtaining resonance assignments. For our experiments, picking peaks or labeling residues were based on our published assignments of tau fragments or full-length human tau [2, 22, 23]. 3. Peak intensity can be quantified by NMRViewJ with “Get Intensities” within Peak > Integrate after peaks have been picked (properties like peak volume can be obtained in a similar way).

3.2.3 NMR Data Analysis and Interpretation: 1H-15N HSQC Intensity Ratios

1. Intensity ratios (I/I0) of every selected peak are calculated (Fig. 4). I0 represents the peak intensity of resonances when all of the protein is in a free state, and I represents the peak intensity of free-state resonances when the protein is in the presence of lipid vesicles (free protein fraction).

596 a

Tapojyoti Das et al. Resonance D Resonance A

b

Resonance B Resonance C

1H

l/l0

15N

1

0.5

A

B C Resonance

D

Fig. 4 (a) Depiction of resonances from HSQC NMR spectra, where green represents the free-state sample and red represents a bound-state sample. The overlay of both sample spectra shows differences in intensity for resonance A and D, but not for resonance B or C. The intensity of each resonance for free state (green) and bound state (red) can be extracted with an NMR data analysis software like NMRViewJ. These differences in intensity can then be plotted as intensity ratios as shown in (b), where ratios ~1.0 show little to no binding and smaller ratios (here shown as Format Converter > Export > NMRDraw). Save the peak list as “spec.Asstab.” 5. Next, peak clusters are defined on the CcpNmr Analysis exported file that need to be fitted together to a simulated lineshape. This is achieved by a custom script written in R (see Note 16). 6. The acquired spectrum is fitted to simulated lineshapes using the autofit.tcl routine in NMRPipe, and the intensities are extracted across the 2D datasets. ———————————————————————————————————————————————————————————— cat spec.Asstab | grep -v None > spec_noNone.Asstab autoFit.tcl -series -dX 0.0 -dY 0.0 -inTab spec_noNone.Asstab -specName ft/test%03d.ft2

7. A simulated spectrum (sim) and difference spectrum (dif) will be generated (Fig. 5b). Observe the difference spectrum carefully, and edit the cluster IDs iteratively as needed using NMRDraw peak editing tool (show peaks as CLUSTID, edit) to achieve a good fit between the simulated and the experimental spectra. 8. Upon peak fitting, a peak list file “axt.tab” is generated that contains the relative intensity values for each peak as compared to the first spectrum in the series, which is usually the reference spectrum. This file is cleaned with the following script: ———————————————————————————————————————————————————————————— cat axt.tab | grep -v None > nlin.filtered.tab

9. To extract relaxation rate constants, the relative intensities (I/ I0, I0 representing the corresponding peak intensities at zero relaxation delay) are fit to an exponential decay function, from which R2 values are extracted, along with error estimates at each residue. NMRPipe has a model fitting routine that can fit experimental data to an exponential model. modelExp.tcl nlin.filtered.tab nlin.spec.list 0.0

IDP-Membrane by NMR

599

The model fitting data are plotted with the following command: /usr/bin/gnuplot gnu/∗

The R2 values with the uncertainties for each residue are formatted and exported with the following routine: ./summary.tcl -in nlin.filtered.tab | ./formatR2out.pl | sort -n > R2.txt

10. Once the R2 values for free state and bound state are calculated with identical steps, finally, the difference in R2 values (ΔR2) between free state and bound state is calculated for each residue. ./calc_deltaR2.pl R2bound.txt R2free.txt > deltaR2.txt

3.2.5 NMR Data Analysis and Interpretation: 15 N-DEST

The interleaved 2D datasets are split into individual 2D datasets, each assigned to a specific 15N saturation pulse power and frequency. Similar to the relaxation analysis above, each experiment is processed identically with NMRPipe using a custom script (steps 1–8 identical as above Subheading 3.2.4), with the major exception that the residue-specific DEST profile (I/I0, I0 refers to the peak intensity at zero saturation power) is extracted as a text file for subsequent model fitting using a custom MATLAB script DESTfit, also available on the Marius Clore lab website (see Note 1) [5, 20]. We developed a custom routine to extract the DEST profile data from NMRDraw peak list file and format it properly for subsequent DESTfit routine (see Note 17). The DEST profile refers to the attenuation due to DEST as a function of saturation offset for each selected residue position. Subsequent model fitting by DESTfit program (written in MATLAB) involves choosing a kinetic model (Fig. 6) and observing the fit of the data to their theoretical estimates based on the solution of the McConnell equations (Fig. 7) [5, 24]. The DESTfit program takes in some per-residue parameters (R1, deltaR2, R2_app in bound-state sample, DEST profile) and some global parameters depending on the fitted model (kapp on , fraction in light state) and tries to optimize some local parameters (residue-specific dark-state R2) and some global parameters (koff), by fitting the theoretical ΔR2 and simulated DEST profile to the experimental data. For a detailed discussion on the kinetic model fitting and necessary scripts for analysis of DEST data, we refer readers to the published work from the developer of this technique [5, 20].

600

Tapojyoti Das et al.

Fig. 6 Kinetic models considered in the DESTfit analysis. A two-state model (a) involves a free state and a vesicle-bound state, whereas a three-state model (b) considers an additional tethered state where the protein is not completely bound to the vesicle. The corresponding kinetic rate constants are indicated in the diagram. For the purpose of fitting DEST data, the simplest two-state model is tried first followed by three-state and more complicated models

IDP-Membrane by NMR

601

Fig. 7 (a) DEST profile of selected individual residues of N-terminally acetylated α-synuclein in the lipid SUV-bound state (1 mM total lipids, DOPC:DOPE:DOPS ¼ 60:25:15) at two different saturation bandwidths (blue, 400 Hz; red, 175 Hz). Residues 4, 21, and 76 are in the lipid-binding domain of α-synuclein, whereas residue 113 is located in the C-terminal domain which does not bind to lipid membranes. Due to increased lipid-bound dark-state fraction of the N-terminal residues, they are progressively saturated by a narrowbandwidth pulse further away from the resonance frequency, resulting in progressive attenuation in the signal

602

4

Tapojyoti Das et al.

Notes 1. Note that processing software for NMR data is made free and available at: NMRPipe: https://www.ibbr.umd.edu/nmrpipe/install.html NMRViewJ: https://nmrfx.org/nmrfx/nmrviewj CcpNmr Analysis: https://www.ccpn.ac.uk/v2-software/ downloads/stable R language for statistical computing: https://www.r-project. org/ 15N-CPMG T2 and 15N DEST pulse sequences, processing scripts, and DESTfit are available from the Marius Clore lab website: http://spin.niddk.nih.gov/clore/Software/ software.html 2. Note that we find that the NMR buffer works well for both α-synuclein and tau, but other proteins may require optimization of the buffer for both protein solubility and NMR compatibility. 3. The approximate 15N 90 pulse width can be doubled in an 1D version of 1H-15N HSQC pulse program and calibrated to extinguish the amide signal, from which the true 15N 90 pulse width can be calculated precisely. 4. The temperature of the sample may differ from the temperature displayed on data acquisition software (e.g., TopSpin). To accurately determine the temperature of the sample, calibration curves are kept on file that are generated based on temperaturedependent chemical shifts of deuterated methanol. It is recommended to the user to set the experimental temperature based on such a calibration curve for the actual sample temperature, if one is available. 5. In order to calibrate 1H 90 hard pulse length, we use a proton 1D experiment. Initially we use a small duration pulse (typically 1 μs) to tip the magnetization a little. We set the receiver gain “rg” to 1 and run the experiment with a single scan (ns ¼ 1) and Fourier transform “ft” and adjust the phase of the resulting

ä Fig. 7 (continued) as one moves closer to the resonance frequency. Fit to kinetic models using the DESTfit program are also plotted (dashed line, two-state model; solid line, three-state model, cf. Fig. 6). (b) Difference in R2 values (ΔR2) between lipid SUV-bound and free-state samples of 15N-N-terminally acetylated α-synuclein (black circles connected by lines). Error bars represent mean 1 SD. The red crosses and green triangles represent calculated ΔR2 values based on DEST data fitting to two-state and three-state models, respectively

IDP-Membrane by NMR

603

spectrum “apk” to make it upright and symmetric. Then we set the pulse length to four times the expected 90 pulse length (to make a full 360 rotation of the magnetization vector, this can be estimated from similar samples or using the automated routine “pulsecal” on Bruker TopSpin), acquire the spectrum, Fourier transform and phase correct “fp,” and observe the result. Then we iteratively adjust the pulse length to extinguish the signal (equal amount of positive and negative values). This is the calibrated 360 pulse length, a quarter of which is the calibrated 90 pulse length for proton. 6. As a rough estimate, the prosol table can be read with just the 1 H-calibrated 90 hard pulse length to estimate the 15N 90 hard pulse length in any experiment. In order to manually calibrate the 15N hard pulse length (recommended for relaxation experiments), we set up an 1D version of 1H-15N HSQC and double the 15N hard pulse length and calibrate it to extinguish the signal. This value is calibrated 15N 180 hard pulse length, half of which is the calibrated 15N 90 hard pulse length. 7. For every experiment, to accurately calibrate the shaped pulses, we look up the “prosol” table using the following command: getprosol (optional) etc. e.g., getprosol 1H 9.25 -8.4 15N 25.5 -24.56

8. To set up an optimal spectral window in the indirect dimension, we generally start with a wider spectral width (e.g., 40 ppm) to encompass the entire spectrum. Then we collect a test spectrum with a few scans (e.g., ns ¼ 4 or 8) and then narrow down by setting the center position and spectral width optimally to avoid any folded or aliased peaks. 9. For each FID, a designated number of scans (repeats of experiments at each point) are averaged. It is important to note here that the S/N scales as a square root of the number of scans, i.e., for a twofold increase in signal to noise, four times the number of scans are required but linearly to the sample concentration. Therefore, acquisition time would increase dramatically for small improvements in resolution. Hence, to achieve a good spectrum within a reasonable experimental time, it is advisable to use a higher sample concentration as long as it does not aggregate or is deemed unsuitable for the experimental question.

604

Tapojyoti Das et al.

10. Vesicle titration experiments can help the user gauge the appropriate lipid/protein molar ratio to use for DEST/T2 relaxation samples. Briefly, a series of samples containing lipids and 15 N-labeled protein at molar ratios of 10:1, 50:1, 100:1, and 200:1 should be prepared. A brief 1H-15N HSQC NMR spectrum should be collected for each sample (e.g., 512 complex points in the 1H dimension, 128 complex points in the 15N dimension). Intensity ratios should be calculated for each sample, from which the sample with the majority of intensity ratios around 0.75–1.0 should be selected as the appropriate molar ratio to use for DEST/T2 relaxation experiments. 11. It is best to use freshly prepared vesicles, but vesicle preparations can be stored for several days up to a week if covered with an inert gas and sealed to prevent oxidation. 12. Phospholipids get oxidized over time. So, it is important to use fresh stocks of lipids to prepare vesicles and store the stock properly sealed at 20 C. 13. Lipid-containing NMR samples get oxidized as well. This oxidation can cause time-dependent changes in the NMR spectra. Typically, the changes are reflected as the oxidation gets transferred to methionines [25]. This can be partially prevented by flowing an inert gas (N2 or Ar) through the NMR tube before capping and sealing with paraffin film, allowing NMR experiments to run over longer duration. 14. To set up a test experiment to measure a 1D relaxation series, we set the number of points in the indirect dimension to one and acquire the spectrum with a limited number of scans (usually 8 to 16). The resultant spectrum can be processed with the command “ft2” to visualize the relaxation decay of the 1D spectrum. It has to be noted though that this relaxation series serves as a general estimate and may not fully capture the relaxation timescale of individual residues, which can vary significantly. The maximum relaxation delay in the experiment is limited by the probehead specifications, since the simultaneous decoupling pulse on 1H channel puts stress on the probehead and also tends to increase the temperature of the sample. So it is generally recommended to keep it below 250–300 ms. 15. We deviate from the suggested protocol by Fawzi et al., where NMRDraw was used for peak picking and assignment since using NMRDraw for peak picking and assignment is less intuitive and it is difficult to transfer peak assignments from an existing peak list. 16. Custom script in R for clustering of peaks: peakFile