publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
Rxiv
- Interpolation-Based Conditioning of Flow Matching Models for Bioisosteric Ligand DesignYael Ziv, Martin Buttenschoen, Lukas Schieblerger, and 2 more authorsbioRxiv, Rxiv
Fast, unconditional 3D generative models can now produce high-quality molecules, but adapting them for specific design tasks often requires costly retraining. To address this, we introduce two training-free, inference-time conditioning strategies, Interpolate–Integrate and Replacement Guidance, that provide control over E(3)-equivariant flow-matching models. Our methods generate bioisosteric 3D molecules by conditioning on seed ligands or fragment sets to preserve key determinants like shape and pharmacophore patterns, without requiring the original fragment atoms to be present. We demonstrate their effectiveness on three drug-relevant tasks: natural product ligand hopping, bioisosteric fragment merging, and pharmacophore merging.
- Syndirella: Synthesis-directed fragment elaboration enables extensive binding site exploration beyond catalog compoundsKate K. Fieseler, Max Winokan, Joseph A. Morrone, and 3 more authorsChemRxiv, Rxiv
Fragment screens provide an information-rich starting point for designing deriva- tive compounds that recapitulate key protein-ligand interactions in structure-based drug discovery. Maximizing the compound diversity and interaction sampling of the derivatives is critical to avoid premature convergence on unproductive chemical series, yet traditional catalog procurement typically limits campaigns to <100 compounds due to budget constraints. We developed Syndirella (Synthesis Directed Elaborations), which enables the sampling of 100s–1000s of compounds by generating congeneric series through digitized multi-step synthesis routes for reactant-based purchasing, in-house robotic synthesis, and direct-to-biology testing. Applied to three viral protein targets, Syndirella demonstrated comprehensive pharmacophore diversity, recapitulating known fragment interactions while identifying novel catalytic ones and accessing distinct chemical regions compared to traditional superstructure searches. Experimental validation on a calcium-binding protein fragment screen yielded 13 X-ray crystal struc- tures with binding kinetics from 252 synthesized compounds, achieving 84 percent cost reduction versus catalog purchasing. By shifting from product to reactant procurement, Syndirella enables order-of-magnitude increases in accessible compounds for structure-activity relationship exploration within typical budgets, fundamentally expanding the breadth-to-depth ratio of fragment-based series.
- Pox-AbDab: the Orthopoxvirus Antibody DatabaseHenriette L. Capel, Eric Ji Da Wang, Benjamin H. Williams, and 2 more authorsbioRxiv, Rxiv
In August 2024, the World Health Organisation declared the mpox orthopoxvirus to be a Public Health Emergency of International Concern for the second time in three years, emphasising the need for continued studies into its microbiology and potential therapeutic interventions. Here, we present the Orthopoxvirus Antibody Database (Pox-AbDab), a repository of data on antibodies known to bind or neutralise viruses from the same genus as mpox (https://opig.stats.ox.ac.uk/webapps/poxabdab). Beyond standardising and centralising the data, we highlight challenges in translating knowledge across orthopoxviruses, such as the absence of a function-based nomenclature for virion surface antigens. We also performed an exploratory analysis of the known orthopoxvirus-binding antibody landscape, highlighting their aggregate molecular properties, cross-binding/cross-neutralisation profiles, evidence for immunodominance or immune escape from their epitopes, and gaps in coverage to help orient future research.
- The Therapeutic Nanobody Profiler: characterising and predicting nanobody developability to improve therapeutic designGemma L. Gordon, Joao Gervasio, Colby Souders, and 1 more authorbioRxiv, Rxiv
Developability optimisation is an important step for successful biotherapeutic design. For monoclonal antibodies, developability is relatively well characterised. However, progress for novel biotherapeutics such as nanobodies is more limited. Differences in structural features between antibodies and nanobodies render current antibody computational methods unsuitable for direct application to nanobodies. Following the principles of the Therapeutic Antibody Profiler (TAP), we have built the Therapeutic Nanobody Profiler (TNP), an open-source computational tool for predicting nanobody developability. Tailored specifically for nanobodies, it accounts for their unique properties compared to conventional antibodies for more efficient development of this novel therapeutic format. We calibrate TNP metrics using the 36 currently available clinical-stage nanobody sequences. We also collected experimental developability data for 108 nanobodies and examine how these results are related to the TNP guidelines. TNP is available as a web application at opig.stats.ox.ac.uk/webapps/tnp.
- LICHEN: Light-chain Immunoglobulin sequence generation Conditioned on the Heavy chain and Experimental NeedsHenriette L. Capel, Isaac Ellmen, Chris J. Murray, and 9 more authorsbioRxiv, Rxiv
In developing therapeutic antibodies, the heavy chain is often prioritised due to its higher variability and its central role in antigen binding. An appropriate pairing of the light sequence is however important for antibody function. Here we present LICHEN, a heavy chain conditioned light sequence generation tool that enables collaborative light sequence design by leveraging computational capabilities alongside experimental expertise. LICHEN generates light sequences which are valid (antibodylike), diverse in sequence and structure, and conditioned on a specific heavy chain. LICHEN can also condition on germline and CDRs and automatically filter generated sequences for required properties. This allows LICHEN to be used across multiple antibody development use cases. We carry out experimental validation of the method conditioning only on the heavy sequence and on the heavy sequence and binding information. Our in vitro results show that sequences created by LICHEN have effective expression yields and can retain antigen-binding.
- AntiDIF: Accurate and Diverse Antibody Specific Inverse Folding with Discrete DiffusionNikhil Branson and Charlotte M. DeanebioRxiv, Rxiv
Inverse folding is an important step in current computational antibody design. Recently deep learning methods have made impressive progress in improving the sequence recovery of antibodies given their 3D backbone structure. However, inverse folding is often a one-to-many problem, i.e. there are multiple sequences that fold into the same structure. Previous methods have not taken into account the diversity between the predicted sequences for a given structure. Here we create AntiDIF an Antibody-specific discrete Diffusion model for Inverse Folding. Compared with stateof-the-art methods we show that AntiDIF improves diversity between predictions while keeping high sequence recovery rates. Furthermore, forward folding of the generated sequences shows good agreement with the target 3D structure.
- AF2χ: Predicting protein side-chain rotamer distributions with AlphaFold2Matteo Cagiada, F. Emil Thomasen, Sergey Ovchinnikov, and 2 more authorsbioRxiv, Rxiv
The flexibility of protein side chains is an essential contributor of conformational entropy and affects processes such as folding, stability and molecular interactions. Structure determination experiments and prediction tools such as AlphaFold generally fail to capture or represent the conformational heterogeneity of proteins in solution. Experiments can be used to study side-chain flexibility, but cannot be applied at scale, and most prediction methods focus on reconstructing the minimum free energy state rather than an ensemble representing side-chain configurations. Here, we use AlphaFold2 and its internal side-chain representations to develop AF2χ that predicts side-chain χ-angle distributions and generates structural ensembles. We extensively benchmark AF2χ predictions using experimental NMR 3J-couplings and s2 order parameters, as well as dihedral angle distributions derived from collections of experimental structures, demonstrating the accuracy of AF2χ in generating accurate side-chain ensembles. We also compare the accuracy of AF2χ with molecular dynamics simulations and recent machine learning models aimed to generate conformational ensembles and show that AF2χ provides state-of-the-art accuracy orders of magnitude faster than molecular simulations. With its speed and accuracy, AF2χ offers a strong complementary option to simulations and rotamer library approaches, making it particularly valuable for applications such as protein design, ligand docking and interpretation of biophysical experiments.
- Structure-Activity-Relationships can be directly extracted from high-throughput crystallographic evaluation of fragment elaborations in crude reaction mixturesHarold Grosjean, Kate Fieseler, Ruben Sanchez-Garcia, and 4 more authorsChemRxiv, Rxiv
Fragment hits serve as starting points for lead-like compound development. A common approach to advancing from fragments is to build structure-activity relationships (SARs) from close analogues. One strategy involves performing automated chemistry around fragment hits and evaluating resulting crude reaction mixtures (CRMs) of analogues in assays, bypassing costly purification. However, these purification-agnostic workflows are often perceived as noisy, and therefore typically involve additional hit resynthesis for confirmation, potentially discarding false negatives and reducing SAR dataset size. High-throughput (HT) X-ray crystallography has the potential to address these issues by unambiguously resolving hits directly from 100s–1000s of CRMs. However, no systematic analytics exist for extracting SAR models from HT crystallographic evaluation of CRMs. We demonstrate that crystallographic SAR (xSAR) can be extracted from CRMs evaluated via HT X-ray crystallography. We present here a simple rule-based ligand scoring scheme that identifies conserved chemical features linked to binding and non-binding observations in crystallography. Applied to a large-scale crystallographic dataset of 957 fragment elaborations in CRMs targeting PHIP(2), a therapeutically relevant bromodomain, our xSAR model demonstrated effectiveness in two proof-of-concept experiments. First, it recovered 26 missed binders in the initial dataset, doubling the hit rate and denoising the dataset. Second, it enabled a prospective virtual screen, also leveraging previously resolved cocrystal structures, that identified novel hits with informative chemistries, achieving up to a 10-fold binding affinity improvement over the repurified hit from the initial CRM evaluation. This work establishes a proof-of-concept that xSAR models can be directly extracted from large-scale crystallographic readouts of CRMs, offering a valuable methodology to build SAR models and accelerate design-make-test iterations without requiring CRM hit resynthesis and confirmation. This invites future work to utilise advanced analytics and modelling techniques to further strengthen purification-agnostic workflows.
- Assessing the Chemical Intelligence of Large Language ModelsNicholas T. Runcie, Charlotte M. Deane, and Fergus ImrieRxiv, Rxiv
Large Language Models are versatile, general-purpose tools with a wide range of applications. Recently, the advent of "reasoning models" has led to substantial improvements in their abilities in advanced problem-solving domains such as mathematics and software engineering. In this work, we assessed the ability of reasoning models to directly perform chemistry tasks, without any assistance from external tools. We created a novel benchmark, called ChemIQ, which consists of 796 questions assessing core concepts in organic chemistry, focused on molecular comprehension and chemical reasoning. Unlike previous benchmarks, which primarily use multiple choice formats, our approach requires models to construct short-answer responses, more closely reflecting real-world applications. The reasoning models, exemplified by OpenAI’s o3-mini, correctly answered 28%-59% of questions depending on the reasoning level used, with higher reasoning levels significantly increasing performance on all tasks. These models substantially outperformed the non-reasoning model, GPT-4o, which achieved only 7% accuracy. We found that Large Language Models can now convert SMILES strings to IUPAC names, a task earlier models were unable to perform. Additionally, we show that the latest reasoning models can elucidate structures from 1H and 13C NMR data, correctly generating SMILES strings for 74% of molecules containing up to 10 heavy atoms, and in one case solving a structure comprising 21 heavy atoms. For each task, we found evidence that the reasoning process mirrors that of a human chemist. Our results demonstrate that the latest reasoning models have the ability to perform advanced chemical reasoning.
- ANARCII: A Generalised Language Model for Antigen Receptor NumberingAlexander Greenshields-Watson, Parth Agarwal, Sarah A. Robinson, and 7 more authorsbioRxiv, Rxiv
Antigen receptor numbering allows the rapid delineation of the antigen-binding regions of antibody and T cell receptor (TCR) sequences, from sequence alone. It also allows the comparison of the vast diversity of antigen receptors in a consistent frame of reference. Numbering of antigen receptors is currently achieved by aligning sequences to a reference set. This approach may result in different numbering, depending on the reference set used or may fail to number query sequences derived from new species or rare sequence types. To address this problem, we have built a new numbering method (ANARCII) which requires no alignment step and is based on a Seq2Seq language model. Our results show that ANARCII can deal with the complexity that arises in experimentally collected sequencing data and generalise to sequences which are highly dissimilar to those in training. In test sets designed to contain challenging and ambiguous sequence patterns ANARCII numbering was identical to existing methods for over 99.99% of conserved residues and over 99.94% for complete CDR regions. The lightweight architecture allows numbering of over 90,000 sequences per minute on a single A100 GPU. Furthermore, the ANARCII package can be conditioned to fit rare sequence types and provide new training data for fine-tuning. We demonstrate that fine-tuned versions of ANARCII can correctly number other immunoglobulin domains such as TCRs and VNARs. Our model is freely available as a web tool (https://github.com/oxpig/ANARCII), as well as a package for high throughput numbering of next generation sequencing data (https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarcii/).
- Predicting the conformational flexibility of antibody and T-cell receptor CDRsFabian C. Spoendlin, Monica L. Fernández-Quintero, Sai S. R. Raghavan, and 8 more authorsbioRxiv, Rxiv
- Supervised Deep Learning for Efficient Cryo-EM Image Alignment in Drug Discovery with cryoPARESRuben Sanchez-Garcia, Alex Berndt, Amir Apelbaum, and 5 more authorsbioRxiv, Rxiv
- Expanding the Synthome: Generating and integrating novel reactions into retrosynthesis toolsMaranga Mokaya, Charlotte M. Deane, and Anthony R. BradleybioRxiv, Rxiv
- Expanding the scope of a catalogue search to bioisosteric fragment merges using a graph database approachStephanie Wills, Ruben Sanchez-Garcia, Stephen D. Roughley, and 4 more authorsbioRxiv, Rxiv
- Inferring residue level hydrogen deuterium exchange with ReXOliver M. Crook, Nathan Gittens, Chun-wa Chung, and 1 more authorbioRxiv, Rxiv
Hydrogen-Deuterium Exchange Mass-Spectrometry (HDX-MS) has emerged as a powerful technique to explore the conformational dynamics of proteins and protein complexes in solution. The bottom-up approach to MS uses peptides to represent an average of residues, leading to reduced resolution of deuterium exchange and complicates the interpretation of the data. Here, we introduce ReX, a method to infer residue-level uptake patterns leveraging the overlap in peptides, the temporal component of the data and the correlation along the sequence dimension. This approach infers statistical significance for individual residues by treating HDX-MS as a multiple change-point problem. By fitting our model in a Bayesian non-parametric framework, we perform parameter number inference, differential HDX confidence assessments, and uncertainty estimation for temporal kinetics. Benchmarking against existing methods using a three-way proteolytic digestion experiment shows our method’s superior performance at predicting unseen HDX data. Moreover, it aligns HDX-MS with the reporting standards of other structural methods by providing global and local resolution metrics. Using ReX, we analyze the differential flexibility of BRD4’s two Bromodomains in the presence of I-BET151 and quantify the conformational variations induced by a panel of seventeen small molecules on LXRalpha. Our analysis reveals distinct residue-level HDX signatures for ligands with varied functional outcomes, highlighting the potential of this characterisation to inform mode of action analysis.Competing Interest StatementChun-wa Chung and Nathan Gittens are employees of GSK
- Baselining the Buzz Trastuzumab-HER2 Affinity, and BeyondLewis Chinery, Alissa M. Hummer, Brij Bhushan Mehta, and 8 more authorsbioRxiv, Rxiv
There is currently considerable interest in the field of de novo antibody design, and deep learning techniques are now regularly applied to optimise antibody properties such as binding affinity. However, robust baselines within this field have not kept up with recent developments.In this study, we generate a dataset of over 524,000 Trastuzumab variants and use this to show that standard computational methods such as BLOSUM, AbLang, ESM, and Protein-MPNN can be used to design diverse antibody libraries from just a single starting sequence. These novel libraries are predicted to be enriched in binding variants and experimental validation of 700 of these designs is ongoing. We also demonstrate that, even with only a very small number of experimental data points, simple machine learning classifiers can be trained in seconds to accurately pre-screen future designs. This pre-screening maintains library diversity and saves experimental time and money.Competing Interest StatementV.G. declares advisory board positions in aiNET GmbH, Enpicom B.V, Absci, Omniscope, and Diagonal Therapeutics. V.G. is a consultant for Adaptive Biosystems, Specifica Inc, Roche/Genentech, immunai, LabGenius, and FairJourney Biologics. J.R.J. is employed by GlaxoSmithKline plc. The remaining authors declare no competing interests.
- Interpreting Graph Neural Networks with Myerson Values for Cheminformatics ApproachesSamuel Homberg, Menke Janosch, Garrett M. Morris, and 1 more authorChemRxiv, Rxiv
- Protein-Ligand Interaction Graphs: Learning from Ligand-Shaped 3D Interaction Graphs to Improve Binding Affinity PredictionMarc A Moesser, Dominik Klein, Fergus Boyles, and 3 more authorsbioRxiv, Rxiv
- SuCOS is Better than RMSD for Evaluating Fragment Elaboration and Docking PosesSusan Leung, Michael Bodkin, Frank Delft, and 2 more authorsChemRxiv, Rxiv
One of the fundamental assumptions of fragment-based drug discovery is that the fragment’s binding mode will be conserved upon elaboration into larger compounds. The most common way of quantifying binding mode similarity is Root Mean Square Deviation (RMSD), but Protein Ligand Interaction Fingerprint (PLIF) similarity and shape-based metrics are sometimes used. We introduce SuCOS, an open-source shape and chemical feature overlap metric. We explore the strengths and weaknesses of RMSD, PLIF similarity, and SuCOS on a dataset of X-ray crystal structures of paired elaborated larger and smaller molecules bound to the same protein. Our redocking and cross-docking studies show that SuCOS is superior to RMSD and PLIF similarity. When redocking, SuCOS produces fewer false positives and false negatives than RMSD and PLIF similarity; and in cross-docking, SuCOS is better at differentiating experimentally-observed binding modes of an elaborated molecule given the pose of its non-elaborated counterpart. Finally we show that SuCOS performs better than AutoDock Vina at differentiating actives from decoy ligands using the DUD-E dataset. SuCOS is available at https://github.com/susanhleung/SuCOS.
2025
- Orthogonal IMiD-Degron Pairs Induce Selective Protein Degradation in CellsPatrick J. Brennan, Rebecca E. Saunders, Mary Spanou, and 21 more authorsACS Chemical Biology, 2025
Immunomodulatory imide drugs (IMiDs), including thalidomide, lenalidomide, and pomalidomide, can be used to induce degradation of a protein of interest that is fused to a short degron motif, which often comprises a zinc finger (ZF). These IMiDs, however, also induce the degradation of endogenous ZF-containing neosubstrates, including IKZF1, IKZF3, and SALL4. To improve degradation selectivity, we took a bump-and-hole approach to design and screen bumped IMiD analogues against 8380 ZF mutants. This yielded a bumped IMiD analogue that induces efficient degradation of a mutant ZF degron, while not affecting other cellular proteins, including IKZF1, IKZF3, and SALL4. In proof-of-concept studies, this system was applied to induce degradation of the optimum degron fused to CDK9, HPRT1, NanoLuc, or TRIM28. We anticipate that this system will be a valuable addition to the current arsenal of degron systems for use in target validation.
- HeavyBuilder: Analysis of High-Throughput of Antibody Heavy Chain Repertoires in the Structural SpaceJoao D. Gervasio, Alexander Greenshields-Watson, Nele Quast, and 3 more authorsJournal of Molecular Biology, 2025
The vast majority of immunoglobulin sequence data publicly available is of antibody heavy chain-only, as it is easier and cheaper to sequence than the paired heavy and light chains. However, structural characterization and analysis of these sequences in scale has been limited, either by not enough resolution in the prediction, or high resource demand. Here, we introduce HeavyBuilder, a deep learning-based tool for rapid and accurate structure prediction of antibody heavy chains. Available as a web server (https://opig.stats.ox.ac.uk/webapps/HeavyBuilder), and python API (https://github.com/oxpig/HeavyBuilder2). Based on the ImmuneBuilder architecture, it predicts up to 1 million structures in 3.13 days using a single GPU, outperforming AlphaFold2 and IgFold in speed while maintaining comparable accuracy. We applied HeavyBuilder to over 11 million sequences from 73 immune repertoires enabling high-throughput structural analysis. Our study reveals widespread convergent structures, that is, structures from genetically distinct clones; and divergent clonotypes, similar sequences adopting multiple structures. Furthermore, we demonstrate that structure-based similarity search recovers more known antibodies than sequence-based methods. HeavyBuilder offers a scalable solution for structural interrogation of large-scale immune repertoires, opening new avenues for antibody discovery and immune repertoire profiling.
- Incorporating targeted protein structure in deep learning methods for molecule generation in computational drug designLucy Vost, Yael Ziv, and Charlotte M. DeaneChemical Science, 2025
- STCRpy: a software suite for T cell receptor structure parsing, interaction profiling and machine learning dataset preparationNele P. Quast, Charlotte M. Deane, and Matthew I. J. RaybouldBioinformatics, 2025
Summary: Computational methods to guide early-stage TCR drug discovery and TCR repertoire informatics currently under-utilise solved and predicted structure data. Here, we streamline use of these data through an open-source python package for high-throughput TCR structure handling and analysis (STCRpy), facilitating analyses such as TCR:peptide-MHC complex orientation calculation/scoring, root-mean-square-distance evaluation, interaction profiling, and machine learning dataset curation. Availability and implementation: Freely available as a Python package at https://github.com/npqst/STCRpy. Contact: deane@stats.ox.ac.uk, matthew.raybould@stats.ox.ac.uk
- MolSnapper: Conditioning Diffusion for Structure Based Drug DesignYael Ziv, Fergus Imrie, Brian Marsden, and 1 more authorJournal of Chemical Information and Modeling, 2025
Generative models have emerged as potentially powerful methods for molecular design, yet challenges persist in generating molecules that effectively bind to the intended target. The ability to control the design process and incorporate prior knowledge would be highly beneficial for better tailoring molecules to fit specific binding sites. In this paper, we introduce MolSnapper, a novel tool that is able to condition diffusion models for structure-based drug design by seamlessly integrating expert knowledge in the form of 3D pharmacophores. We demonstrate through comprehensive testing on both CrossDocked and Binding MOAD datasets, that our method generates molecules better tailored to fit a given binding site, achieving high structural and chemical similarity to the original molecules. It also, when compared to alternative methods, yields approximately twice as many valid molecules.Competing Interest StatementThe authors have declared no competing interest.
- Improving Structural Plausibility in 3D Molecule Generation via Property-Conditioned Training with Distorted MoleculesLucy Vost, Vijil Chenthamarakshan, Payel Das, and 1 more authorDigital Discovery, 2025
- Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented dataIsak Valsson, Matthew Warren, Charlotte Deane, and 3 more authorsCommunications Chemistry, 2025
- AntiFold: Improved antibody structure-based design using inverse foldingMagnus Haraldson Høie, Alissa M. Hummer, Tobias H. Olsen, and 3 more authorsBioinformatics Advances, 2025
- Robustly interrogating machine learning-based scoring functions: what are they learning?Guy Durant, Fergus Boyles, Kristian Birchall, and 2 more authorsBioinformatics, 2025
- Transformers trained on proteins can learn to attend to Euclidean distanceIsaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, and 1 more authorTransactions on Machine Learning Research, 2025
- Fragmenstein: predicting protein-ligand structures of compounds derived from known crystallographic fragment hits using a strict conserved-binding–based methodologyMatteo P. Ferla, Ruben Sanchez-Garcia, Rachael E. Skyner, and 5 more authorsJournal of Cheminformatics, 2025
- Challenges and compromises: Predicting unbound antibody structures with deep learningAlexander Greenshields-Watson, Odysseas Vavourakis, Fabian C. Spoendlin, and 2 more authorsCurrent Opinion in Structural Biology, 2025
- The molecular reach of antibodies determines their SARS-CoV-2 neutralisation potencyAnna Huhn, Daniel A. Nissley, Daniel B. Wilson, and 16 more authorsNature Communications, 2025
- The protein universe in 3DIsaac Ellmen, Matthew I. J. Raybould, and Charlotte M. DeaneNature Chemical Biology, 2025
- T-cell receptor structures and predictive models reveal comparable alpha and beta chain structural diversity despite differing genetic complexityNele P. Quast, Brennan Abanades, Bora Guloglu, and 4 more authorsCommunications Biology, 2025
- Investigating the Volume and Diversity of Data Needed for Generalizable Antibody-Antigen ∆∆G PredictionAlissa M. Hummer, Constantin Schneider, Lewis Chinery, and 1 more authorNature Computational Science, 2025
Antibody–antigen binding affinity lies at the heart of therapeutic antibody development: efficacy is guided by specific binding and control of affinity. Here we present Graphinity, an equivariant graph neural network architecture built directly from antibody–antigen structures that achieves test Pearson’s correlations of up to 0.87 on experimental change in binding affinity (ΔΔG) prediction. However, our model, like previous methods, appears to be overtraining on the few hundred experimental data points available and performance is not robust to train–test cut-offs. To investigate the amount and type of data required to generalizably predict ΔΔG, we built synthetic datasets of nearly 1 million FoldX-generated and >20,000 Rosetta Flex ddG-generated ΔΔG values. Our results indicate that there are currently insufficient experimental data to accurately and robustly predict ΔΔG, with orders of magnitude more likely needed. Dataset size is not the only consideration; diversity is also an important factor for model predictiveness. These findings provide a lower bound on data requirements to inform future method development and data collection efforts.
2024
- AIntibody: An experimentally-validated in silico antibody discovery design challengeM. Frank Erasmus, Laura Spector, Katheryn Perea-Schmittle, and 52 more authorsNature Biotechnology, 2024
- The future of machine learning for small-molecule drug discovery will be driven by dataGuy Durant, Fergus Boyles, Kristian Birchall, and 1 more authorNature Computational Science, 2024
- A comparative study of the developability of full-length antibodies, fragments, and bispecific formats reveals higher stability risks for engineered constructsItzel Condado-Morales, Fabian Dingfelder, Isabel Waibel, and 12 more authorsmAbs, 2024
- Assessing AF2’s ability to predict structural ensembles of proteinsJakob R. Riccabona, Fabian C. Spoendlin, Anna-Lena M. Fischer, and 11 more authorsStructure, 2024
- Humatch - fast, gene-specific joint humanisation of antibody heavy and light chainsLewis Chinery, Jeliazko R. Jeliazkov, and Charlotte M. DeanemAbs, 2024
Antibodies are a popular and powerful class of therapeutic due to their ability to exhibit high affinity and specificity to target proteins. However, the majority of antibody therapeutics are not genetically human, with initial therapeutic designs typically obtained from animal models. Humanisation of these precursors is essential to reduce immunogenic risks when administered to humans. Here, we present Humatch, a computational tool designed to offer experimental-like joint humanisation of heavy and light chains in seconds. Humatch consists of three lightweight Convolutional Neural Networks (CNNs) trained to identify human heavy V-genes, light V-genes, and well-paired antibody sequences with near-perfect accuracy. We show that these CNNs, alongside germline similarity, can be used for fast humanisation that aligns well with known experimental data. Throughout the humanisation process, a sequence is guided towards a specific target gene and away from others via multiclass CNN outputs and gene-specific germline data. This guidance ensures final humanised designs do not sit ‘between’ genes, a trait that is not naturally observed. Humatch’s optimisation towards specific genes and good VH/VL pairing increases the chances that final designs will be stable and express well and reduces the chances of immunogenic epitopes forming between the two chains. Humatch’s training data and source code are provided open-source.
- Quantifying conformational changes in the TCR:pMHC-I binding interfaceBenjamin McMaster, Christopher Thorpe, Jamie Rossjohn, and 2 more authorsFrontiers in Immunology, 2024
- p-IgGen: A Paired Antibody Generative Language ModelOliver M. Turnbull, Dino Oglic, Rebecca Croasdale-Wood, and 1 more authorBioinformatics, 2024
- Context-Guided Diffusion for Out-of-Distribution Molecular and Protein DesignLeo Klarner, Tim G.J. Rudner, Garrett M. Morris, and 2 more authorsProceedings of the 41st International Conference on Machine Learning (ICML 2024), 2024
- PLAbDab-nano: a database of camelid and shark nanobodies from patents and literatureGemma L. Gordon, Alexander Greenshields-Watson, Parth Agarwal, and 5 more authorsNucleic Acids Research, 2024
- Computational mining of B cell receptor repertoires reveals antigen-specific and convergent responses to Ebola vaccinationEve Richardson, Sagida Bibi, Florence McLean, and 13 more authorsFrontiers in Immunology, 2024
- The Observed T cell receptor Space database enables paired-chain repertoire mining, coherence analysis and language modellingMatthew I. J. Raybould, Alexander Greenshields-Watson, Parth Agarwal, and 5 more authorsCell Reports, 2024
- Prospects for the computational humanization of antibodies and nanobodiesGemma L. Gordon, Matthew I. J. Raybould, Ashley Wong, and 1 more authorFrontiers in Immunology, 2024
- Can AlphaFold’s breakthrough in protein structure help decode the fundamental principles of adaptive cellular immunity?Benjamin McMaster, Christopher Thorpe, Graham Ogg, and 2 more authorsNature Methods, 2024
- Prediction of polyspecificity from antibody sequence data by machine learningSzabolcs Éliás, Clemens Wrzodek, Charlotte M. Deane, and 3 more authorsFrontiers in Bioinformatics, 2024
- It is theoretically possible to avoid misfolding into non-covalent lasso entanglements using small molecule drugsYang Jiang, Charlotte M. Deane, Garrett M. Morris, and 1 more authorPLOS Computational Biology, Mar 2024
A novel class of protein misfolding characterized by either the formation of non-native noncovalent lasso entanglements in the misfolded structure or loss of native entanglements has been predicted to exist and found circumstantial support through biochemical assays and limited-proteolysis mass spectrometry data. Here, we examine whether it is possible to design small molecule compounds that can bind to specific folding intermediates and thereby avoid these misfolded states in computer simulations under idealized conditions (perfect drug-binding specificity, zero promiscuity, and a smooth energy landscape). Studying two proteins, type III chloramphenicol acetyltransferase (CAT-III) and D-alanyl-D-alanine ligase B (DDLB), that were previously suggested to form soluble misfolded states through a mechanism involving a failure-to-form of native entanglements, we explore two different drug design strategies using coarse-grained structure-based models. The first strategy, in which the native entanglement is stabilized by drug binding, failed to decrease misfolding because it formed an alternative entanglement at a nearby region. The second strategy, in which a small molecule was designed to bind to a non-native tertiary structure and thereby destabilize the native entanglement, succeeded in decreasing misfolding and increasing the native state population. This strategy worked because destabilizing the entanglement loop provided more time for the threading segment to position itself correctly to be wrapped by the loop to form the native entanglement. Further, we computationally identified several FDA-approved drugs with the potential to bind these intermediate states and rescue misfolding in these proteins. This study suggests it is possible for small molecule drugs to prevent protein misfolding of this type.
- Learnt representations of proteins can be used for accurate prediction of small molecule binding sites on experimentally determined and predicted protein structuresAnna Carbery, Martin Buttenschoen, Rachael Skyner, and 2 more authorsJournal of Cheminformatics, Mar 2024
- Addressing the antibody germline bias and its effect on language models for improved antibody designTobias H. Olsen, Iain H. Moal, and Charlotte M. DeaneBioinformatics, Mar 2024
The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive and time-consuming task, with the final antibody needing to not only have strong and specific binding, but also be minimally impacted by any developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody discovery and design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a small number of mutations away from the germline outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias towards germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline. In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimised for predicting non-germline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AbLang-2 is trained on both unpaired and paired data, and is freely available (https://github.com/oxpig/AbLang2.git).Competing Interest StatementAuthor IM is employed by GlaxoSmithKline plc. All authors declare no other competing interests.
- Ultrahigh frequencies of peripherally matured LGI1 & CASPR2-reactive B cells characterise encephalitis patient cerebrospinal fluidJakob Theorell, Ruby Harrison, Robyn Williams, and 16 more authorsProceedings of the National Academy of Sciences USA, Mar 2024
Intrathecal synthesis of central nervous system (CNS) reactive autoantibodies is observed across patients with autoimmune encephalitis, who show multiple residual neurobehavioural deficits and relapses despite immunotherapies. We leveraged two common forms of AE, mediated by leucine-rich glioma inactivated-1 (LGI1) and contactin-associated protein like 2 (CASPR2) antibodies, as human models to comprehensively reconstruct and profile cerebrospinal fluid (CSF) B cell receptor (BCR) characteristics. We hypothesised that the resultant observations would both inform the observed therapeutic gap and determine the contribution of intrathecal maturation to pathogenic B cell lineages. From the CSF of three patients, 381 cognate-paired IgG BCRs were isolated by cell sorting and scRNA-seq, and 166 expressed as monoclonal antibodies (mAbs). 62% of mAbs from singleton BCRs reacted with either LGI1 or CASPR2 and, strikingly, this rose to 100% of cells in clonal groups with >=4 members. These autoantigen-reactivities were more concentrated within antibody-secreting cells (ASCs) versus B cells (p<0.0001), and both these cell types were more differentiated than LGI1 and CASPR2-unreactive counterparts. Despite greater differentiation, autoantigen-reactive cells had acquired few mutations intrathecally and showed minimal variation in autoantigen affinities within clonal expansions. Also, limited CSF T cell receptor clonality was observed. In contrast, a comparison of germline-encoded BCRs versus the founder intrathecal clone, revealed marked gains in both affinity and mutational distances (p=0.004 and p<0.0001, respectively). Taken together, in patients with LGI1- and CASPR2-antibody encephalitis, our results identify CSF as a compartment with a remarkably high-frequency of clonally-expanded autoantigen-reactive ASCs whose BCR maturity appears dominantly acquired outside the CNS.
- Contextualising the developability risk of antibodies with lambda light chains using enhanced therapeutic antibody profilingMatthew I. J. Raybould, Oliver M. Turnbull, Annabel Suter, and 2 more authorsCommunications Biology, Mar 2024
Antibodies with lambda light chains (lambda-antibodies) are generally considered to be less developable than those with kappa light chains (kappa-antibodies). Though this hypothesis has not been formally established, it has led to substantial systematic biases in drug discovery pipelines and thus contributed to kappa dominance amongst clinical-stage therapeutics. However, the identification of increasing numbers of epitopes preferentially engaged by lambda-antibodies shows there is a functional cost to neglecting to consider them as potential lead candidates. Here, we update our Therapeutic Antibody Profiler (TAP) tool to use the latest data and machine learning-based structure prediction, and apply it to evaluate developability risk profiles for kappa-antibodies and lambda-antibodies based on their surface physicochemical properties. We find that while human lambda-antibodies on average have a higher risk of developability issues than kappa-antibodies, a sizeable proportion are assigned lower-risk profiles by TAP and should represent more tractable candidates for therapeutic development. Through a comparative analysis of the low- and high-risk populations, we highlight opportunities for strategic design that TAP suggests would enrich for more developable lambda-antibodies. Overall, we provide context to the differing developability of kappa- and lambda-antibodies, enabling a rational approach to incorporate more diversity into the initial pool of immunotherapeutic candidates.
- Codon language embeddings provide strong signals for protein engineeringCarlos Outeiral and Charlotte DeaneNature Machine Intelligence, Mar 2024
Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.
- Investigating the ability of deep learning-based structure prediction to extrapolate and/or enrich the set of antibody CDR canonical formsAlexander Greenshields-Watson, Brennan Abanades, and Charlotte M. DeaneFrontiers in Immunology, Mar 2024
- Designing stable humanized antibodiesAlissa M. Hummer and Charlotte M. DeaneNature Biochemical Engineering, Mar 2024
- CESPED: a new benchmark for supervised particle pose estimation in Cryo-EMRuben Sanchez-Garcia, Michael Saur, Javier Vargas, and 2 more authorsPhysical Review Research, Mar 2024
- Comprehensive Overview of Bottom-up Proteomics using Mass SpectrometryYuming Jiang, Devasahayam Arokia Balaya Rex, Dina Schuster, and 16 more authorsACS Measurement Science Au, Mar 2024
- The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structuresBrennan Abanades, Tobias H. Olsen, Matthew I. J. Raybould, and 5 more authorsNucleic Acids Research, Mar 2024
Antibodies are key proteins of the adaptive immune system, and there exists a large body of academic literature and patents dedicated to their study and concomitant conversion into therapeutics, diagnostics, or reagents. These documents often contain extensive functional characterisations of the sets of antibodies the describe. However, leveraging these heterogeneous reports, for example to offer insights into the properties of query antibodies of interest, is currently challenging as there is no central repository through which this wide corpus can be mined by sequence or structure. Here, we present PLAbDab (the Patent and Literature Antibody Database), a self-updating repository containing over 150,000 paired antibody sequences and 3D structural models, of which over 65,000 are unique. Each entry in the database also contains the title and authors of its literature source. Here we describe the methods used to extract, filter, pair, and model the antibodies in PLAbDab, and showcase how PLAbDab can be searched by sequence, structure, or keyword. PLAbDab uses include annotating query antibodies with potential antigen information from similar entries, analysing structural models of existing antibodies to identify modifications that could improve their properties, and compiling bespoke datasets of antibody sequences/structures known to bind to a specific antigen. PLAbDab is freely available via Github (https://github.com/oxpig/PLAbDab) and as a searchable webserver (https://opig.stats.ox.ac.uk/webapps/plabdab).
- Microfluidics-enabled fluorescence-activated cell sorting of single pathogen-specific antibody secreting cells for the rapid discovery of monoclonal antibodiesKatrin Fischer, Aleksei Lulla, Tsz So, and 13 more authorsNature Biotechnology, Mar 2024
2023
- Challenges in antibody structure predictionMonica L. Fernández-Quintero, Janik Kokot, Franz Waibl, and 4 more authorsmAbs, Mar 2023
- System-wide analysis of RNA and protein subcellular localization dynamicsEneko Villanueva, Tom Smith, Mariavittoria Pizzinga, and 9 more authorsNature Methods, Mar 2023
- PEP-Patch: Electrostatics in Protein–Protein Recognition, Specificity, and Antibody DevelopabilityValentin J. Hoerschinger, Franz Waibl, Nancy D. Pomarici, and 6 more authorsJournal of Chemical Information and Modelling, Mar 2023
- Open science discovery of potent noncovalent SARS-CoV-2 main protease inhibitorsMelissa L. Boby, Daren Fearson, Matteo Ferla, and 9 more authorsScience, Mar 2023
- A linear transportation L^p distance for pattern recognitionOliver M. Crook, Mihai Cucuringu, Tim Hurst, and 3 more authorsPattern Recognition, Mar 2023
- Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for bindingThomas E. Hadfield, Jack Scantlebury, and Charlotte M. DeaneJournal of Cheminformatics, Mar 2023
- Improved computational epitope profiling using structural models identifies a broader diversity of antibodies that bind to the same epitopeFabian C. Spoendlin, Brennan Abanades, Matthew I. J. Raybould, and 3 more authorsFrontiers in Molecular Biosciences, Mar 2023
- Discovery and pharmacophoric characterization of chemokine network inhibitors using phage-display, saturation mutagenesis and computational modellingSerena Vales, Jhanna Kryukova, Soumyanetra Chandra, and 11 more authorsNature Communications, Mar 2023
- PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequencesMartin Buttenschoen, Garrett M Morris, and Charlotte M DeaneChemical Science, Mar 2023
The last few years have seen the development of numerous deep learning-based protein–ligand docking methods. They offer huge promise in terms of speed and accuracy. However, despite claims of state-of-the-art performance in terms of crystallographic root-mean-square deviation (RMSD), upon closer inspection, it has become apparent that they often produce physically implausible molecular structures. It is therefore not sufficient to evaluate these methods solely by RMSD to a native binding mode. It is vital, particularly for deep learning-based methods, that they are also evaluated on steric and energetic criteria. We present PoseBusters, a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. The PoseBusters test suite validates chemical and geometric consistency of a ligand including its stereochemistry, and the physical plausibility of intra- and intermolecular measurements such as the planarity of aromatic rings, standard bond lengths, and protein–ligand clashes. Only methods that both pass these checks and predict native-like binding modes should be classed as having “state-of-the-art” performance. We use PoseBusters to compare five deep learning-based docking methods (DeepDock, DiffDock, EquiBind, TankBind, and Uni-Mol) and two well-established standard docking methods (AutoDock Vina and CCDC Gold) with and without an additional post-prediction energy minimisation step using a molecular mechanics force field. We show that both in terms of physical plausibility and the ability to generalise to examples that are distinct from the training data, no deep learning-based method yet outperforms classical docking tools. In addition, we find that molecular mechanics force fields contain docking-relevant physics missing from deep-learning methods. PoseBusters allows practitioners to assess docking and molecular generation methods and may inspire new inductive biases still required to improve deep learning-based methods, which will help drive the development of more accurate and more realistic predictions.
- KA-Search, a method for rapid and exhaustive sequence identity search of known antibodiesTobias Hegelund Olsen, Brennan Abanades, Iain H Moal, and 1 more authorScientific Reports, Mar 2023
Antibodies with similar amino acid sequences, especially across their complementarity-determining regions, often share properties. Finding that an antibody of interest has a similar sequence to naturally expressed antibodies in healthy or diseased repertoires is a powerful approach for the prediction of antibody properties, such as immunogenicity or antigen specificity. However, as the number of available antibody sequences is now in the billions and continuing to grow, repertoire mining for similar sequences has become increasingly computationally expensive. Existing approaches are limited by either being low-throughput, non-exhaustive, not antibody specific, or only searching against entire chain sequences. Therefore, there is a need for a specialized tool, optimized for a rapid and exhaustive search of any antibody region against all known antibodies, to better utilize the full breadth of available repertoire sequences. We introduce Known Antibody Search (KA-Search), a tool that allows for the rapid search of billions of antibody variable domains by amino acid sequence identity across either the variable domain, the complementarity-determining regions, or a user defined antibody region. We show KA-Search in operation on the $∼$2.4 billion antibody sequences available in the OAS database. KA-Search can be used to find the most similar sequences from OAS within 30 minutes and a representative subset of 10 million sequences in less than 9 seconds. We give examples of how KA-Search can be used to obtain new insights about an antibody of interest. KA-Search is freely available at https://github.com/oxpig/kasearch.
- Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over FunctionsLeo Klarner, Tim G. J. Rudner, Michael Reutlinger, and 4 more authorsProceedings of the 40th International Conference on Machine Learning, PMLR, Mar 2023
Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift—a setting that poses a challenge to standard deep learning methods. In this paper, we present Q-SAVI, a probabilistic model able to address these challenges by encoding explicit prior knowledge of the data-generating process into a prior distribution over functions, presenting researchers with a transparent and probabilistically principled way to encode data-driven modeling preferences. Building on a novel, gold-standard bioactivity dataset that facilitates a meaningful comparison of models in an extrapolative regime, we explore different approaches to induce data shift and construct a challenging evaluation setup. We then demonstrate that using Q-SAVI to integrate contextualized prior knowledge of drug-like chemical space into the modeling process affords substantial gains in predictive accuracy and calibration, outperforming a broad range of state-of-the-art self-supervised pre-training and domain adaptation techniques.
- A comparison of the binding sites of antibodies and single-domain antibodiesGemma L. Gordon, Henriette L. Capel, Bora Guloglu, and 3 more authorsFrontiers in Immunology, Mar 2023
Antibodies are the largest class of biotherapeutics. However, in recent years, single-domain antibodies have gained traction due to their smaller size and comparable binding affinity. Antibodies (Abs) and single-domain antibodies (sdAbs) differ in the structures of their binding sites: most significantly, single-domain antibodies lack a light chain and so have just three CDR loops. Given this inherent structural difference, it is important to understand whether Abs and sdAbs are distinguishable in how they engage a binding partner and thus, whether they are suited to different types of epitopes. In this study, we use non-redundant sequence and structural datasets to compare the paratopes, epitopes and antigen interactions of Abs and sdAbs. We demonstrate that even though sd-Abs have smaller paratopes, they target epitopes of equal size to those targeted by Abs. To achieve this, the paratopes of sdAbs contribute more interactions per residue than the paratopes of Abs. Additionally, we find that conserved framework residues are of increased importance in the paratopes of sd-Abs, suggesting that they include non-specific interactions to achieve comparable affinity. Further-more, the epitopes of sdAbs and Abs cannot be distinguished by their shape. For our datasets, sd-Abs do not target more concave epitopes than Abs: we posit that this may be explained by differences in the orientation and compaction of sdAb and Ab CDR-H3 loops. Overall, our results have important implications for the engineering and humanization of sdAbs, as well as the selection of the best modality for targeting a particular epitope.Competing Interest StatementThe authors have declared no competing interest.
- ImmuneBuilder: Deep-Learning models for predicting the structures of immune proteinsBrennan Abanades, Wing Ki Wong, Fergus Boyles, and 3 more authorsCommunications Biology, Mar 2023
- Fragment Merging Using a Graph Database Samples Different Catalogue Space than Similarity SearchStephanie Wills, Ruben Sanchez-Garcia, Tim Dudgeon, and 6 more authorsJournal of Chemical Information and Modeling, Mar 2023
- Specific attributes of the VL domain influence both the structure and structural variability of CDR-H3 through steric effectsBora Guloglu and Charlotte M DeaneFrontiers in Immunology, Mar 2023
Antibodies, through their ability to target virtually any epitope, play a key role in driving the adaptive immune response in jawed vertebrates. The binding domains of standard antibodies are their variable light (VL) and heavy (VH) domains, both of which present analogous complementarity-determining region (CDR) loops. It has long been known that the VH CDRs contribute more heavily to the antigen-binding surface (paratope), with the CDR-H3 loop providing a major modality for the generation of diverse paratopes. Here, we provide evidence for an additional role of the VL domain as a modulator of CDR-H3 structure, using a diverse set of antibody crystal structures and a large set of molecular dynamics simulations. We show that specific attributes of the VL domain such as CDR canonical forms and genes can influence the structural diversity of the CDR-H3 loop, and provide a physical model for how this effect occurs through inter-loop contacts and packing of CDRs against each other. Our study provides insights into the interdependent nature of CDR conformations, an understanding of which is important for the rational antibody design process.Competing Interest StatementThe authors have declared no competing interest.
- A Small Step Toward Generalizability: Training a Machine Learning Scoring Function for Structure-Based Virtual ScreeningJack Scantlebury, Lucy Vost, Anna Carbery, and 8 more authorsJournal of Chemical Information and Modeling, Mar 2023
- Exploring QSAR Models for Activity-Cliff PredictionMarkus Dablander, Thierry Hanser, Renaud Lambiotte, and 1 more authorJournal of Cheminformatics, Mar 2023
- Computationally profiling peptide:MHC recognition by T-cell receptors and T-cell receptor-mimetic antibodiesMatthew I. J. Raybould, Daniel A. Nissley, Sandeep Kumar, and 1 more authorFrontiers in Immunology, Mar 2023
- A functional Bayesian model for hydrogen-deuterium exchange mass-spectrometryOliver M. Crook, Chun-wa Chung, and Charlotte M. DeaneJournal of Proteome Research, Mar 2023
- Testing the Limits of SMILES-based De Novo Molecular Generation with Curriculum and Deep Reinforcement LearningMaranga Mokaya, Fergus Imrie, Willem P. Hoorn, and 3 more authorsNature Machine Intelligence, Mar 2023
- Characterisation of the immune repertoire of a humanised transgenic mouse through immunophenotyping and high-throughput sequencingEve Richardson, Spela Binter, Miha Kosmac, and 9 more authorseLife, Mar 2023
Immunoglobulin loci-transgenic animals are widely used in antibody discovery and increasingly in vaccine response modelling. In this study, we phenotypically characterised B-cell populations from the Intelliselect Transgenic mouse (Kymouse) demonstrating full B-cell development competence. Comparison of the naive B-cell receptor (BCR) repertoires of Kymice BCRs naive human and murine BCR repertoires revealed key differences in germline gene usage and junctional diversification. These differences result in Kymice having CDRH3 length and diversity intermediate between mice and humans. To compare the structural space explored by CDRH3s in each species repertoire, we used computational structure prediction to show that Kymouse naive BCR repertoires are more human-like than mouse-like in their predicted distribution of CDRH3 shape. Our combined sequence and structural analysis indicates that the naive Kymouse BCR repertoire is diverse with key similarities to human repertoires, while immunophenotyping confirms that selected naive B-cells are able to go through complete development.Competing Interest StatementSpela Binter, Paul Kellam and Simon Watson are employees of Kymab Ltd, a Sanofi company. Miha Kosmac and Jake Galson were employees of Kymab Ltd within the past 36 months. Eve Richardson is partially funded by Kymab Ltd.
2022
- Generating weighted and thresholded gene coexpression networks using signed distance correlationJavier Pardo-Diaz, Philip S. Poole, Mariano Beguerisse-Díaz, and 2 more authorsNetwork Science, Mar 2022
- Characterization of the SARS-CoV-2 ExoN (nsp14ExoN–nsp10) complex: implications for its role in viral genome stability and inhibitor identificationHannah T Baddock, Sanja Brolih, Yuliana Yosaatmadja, and 13 more authorsNucleic Acids Research, Jan 2022
- Structure of the malaria vaccine candidate Pfs48/45 and its recognition by transmission blocking antibodiesKuang-Ting Ko, Frank Lennartz, David Mekhaiel, and 8 more authorsNature Communications, Sep 2022
An effective malaria vaccine remains a global health priority and vaccine immunogens which prevent transmission of the parasite will have important roles in multi-component vaccines. One of the most promising candidates for inclusion in a transmission-blocking malaria vaccine is the gamete surface protein Pfs48/45, which is essential for development of the parasite in the mosquito midgut. Indeed, antibodies which bind Pfs48/45 can prevent transmission if ingested with the parasite as part of the mosquito bloodmeal. Here we present the structure of full-length Pfs48/45, showing its three domains to form a dynamic, planar, triangular arrangement. We reveal where transmission-blocking and non-blocking antibodies bind on Pfs48/45. Finally, we demonstrate that antibodies which bind across this molecule can be transmission-blocking. These studies will guide the development of future Pfs48/45-based vaccine immunogens.
- Paragraph - Antibody paratope prediction using Graph Neural Networks with minimal feature vectorsLewis Chinery, Newton Wahome, Iain Moal, and 1 more authorBioinformatics, Nov 2022
The development of new vaccines and antibody therapeutics typically takes several years and requires over \$1bn in investment. Accurate knowledge of the paratope (antibody binding site) can speed up and reduce the cost of this process by improving our understanding of antibody-antigen binding. We present Paragraph, a structure-based paratope prediction tool that outperforms current state-of-the-art tools using simpler feature vectors and no antigen information.Source code is freely available at www.github.com/oxpig/ParagraphSupplementary data are available at Bioinformatics online.
- Fragment Libraries Designed to Be Functionally Diverse Recover Protein Binding Information More Efficiently Than Standard Structurally Diverse LibrariesAnna Carbery, Rachael Skyner, Frank Delft, and 1 more authorJournal of Medicinal Chemistry, Nov 2022
- Understanding the genetics of viral drug resistance by integrating clinical data and mining of the scientific literatureAn Goto, Raul Rodriguez-Esteban, Sebastian H. Scharf, and 1 more authorScientific Reports, Nov 2022
Drug resistance caused by mutations is a public health threat for existing and emerging viral diseases. A wealth of evidence about these mutations and their clinically associated phenotypes is scattered across the literature, but a comprehensive perspective is usually lacking. This work aimed to produce a clinically relevant view for the case of Hepatitis B virus (HBV) mutations by combining a chronic HBV clinical study with a compendium of genetic mutations systematically gathered from the scientific literature. We enriched clinical mutation data by systematically mining 2,472,725 scientific articles from PubMed Central in order to gather information about the HBV mutational landscape. By performing this analysis, we were able to identify mutational hotspots for each HBV genotype (A-E) and gene (C, X, P, S), as well as the location of disulfide bonds associated with these mutations. Through a modelling study, we also identified a mutation position common in both the clinical data and the literature that is located at the binding pocket for a known anti-HBV drug, namely entecavir. The results of this novel approach show the potential of integrated analyses to assist in the development of new drugs for viral diseases that are more robust to resistance. Such analyses should be of particular interest due to the increasing importance of viral resistance in established and emerging viruses, such as for newly developed drugs against SARS-CoV-2.
- Peptide Centric Vβ Specific Germline Contacts Shape a Specialist T Cell ResponseYang Wang, Alexandra Tsitsiklis, Stephanie Devoe, and 11 more authorsFrontiers in Immunology, Nov 2022
Certain CD8 T cell responses are particularly effective at controlling infection, as exemplified by elite control of HIV in individuals harboring HLA-B57. To understand the structural features that contribute to CD8 T cell elite control, we focused on a strongly protective CD8 T cell response directed against a parasite-derived peptide (HF10) presented by an atypical MHC-I molecule, H-2Ld. This response exhibits a focused TCR repertoire dominated by Vβ2, and a representative TCR (TG6) in complex with Ld-HF10 reveals an unusual structure in which both MHC and TCR contribute extensively to peptide specificity, along with a parallel footprint of TCR on its pMHC ligand. The parallel footprint is a common feature of Vβ2-containing TCRs and correlates with an unusual Vα-Vβ interface, CDR loop conformations, and Vβ2-specific germline contacts with peptide. Vβ2 and Ld may represent “specialist” components for antigen recognition that allow for particularly strong and focused T cell responses.
- The Therapeutic Antibody Profiler for Computational Developability AssessmentMatthew I. J. Raybould and Charlotte M. DeaneMethods in Molecular Biology, Nov 2022
- Extracting Information from Gene Coexpression Networks of Rhizobium leguminosarumJavier Pardo-Diaz, Mariano Beguerisse-Diaz, Philip S. Poole, and 2 more authorsJournal of Computational Biology, Nov 2022
- Scoring Functions for Protein-Ligand Binding Affinity Prediction Using Structure-based Deep Learning: A ReviewRocco Meli, Garrett M. Morris, and Philip C. BigginFrontiers in Bioinformatics, Nov 2022
- Empirical Bayes functional models for hydrogen deuterium exchange mass spectrometryOliver M. Crook, Chun-wa Chung, and Charlotte M. DeaneCommunications Biology, Nov 2022
- Challenges and Opportunities for Bayesian Statistics in ProteomicsOliver M. Crook, Chun-wa Chung, and Charlotte M. DeaneJournal of Proteome Research, Nov 2022
- Incorporating Target-Specific Pharmacophoric Information into Deep Generative Models for Fragment ElaborationThomas E Hadfield, Fergus Imrie, Andy Merritt, and 2 more authorsJournal of Chemical Information and Modeling, Nov 2022
- Membranome 3.0: Database of single-pass membrane proteins with AlphaFold modelsAndrei L Lomize, Kevin A Schnitzer, Spencer C Todd, and 4 more authorsProtein Science, Nov 2022
- Advances in computational structure-based antibody designAlissa M. Hummer, Brennan Abanades, and Charlotte M. DeaneCurrent Opinion in Structural Biology, Nov 2022
- CoPriNet: Deep learning compound price prediction for use in de novo molecule generation and prioritizationRuben Sanchez-Garcia, Dávid Havasi, Gergely Takács, and 4 more authorsDigital Discovery, Nov 2022
- Current structure predictors are not learning the physics of protein foldingCarlos Outeiral, Daniel A Nissley, and Charlotte M DeaneBioinformatics, Nov 2022
- ABlooper: Fast accurate antibody CDR loop structure prediction with accuracy estimationBrennan Abanades, Guy Georges, Alexander Bujotzek, and 1 more authorBioinformatics, Nov 2022
- AI in 3D compound designThomas E. Hadfield and Charlotte M. DeaneCurrent Opinion in Structural Biology, Nov 2022
- AbLang: An antibody language model for completing antibody sequencesTobias Hegelund Olsen, Iain H Moal, and Charlotte M DeaneBioinformatics Advances, Nov 2022
Motivation: General protein language models have been shown to summarise the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, for example in the Observed Antibody Space (OAS) database. Results: Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b.Competing Interest StatementThe authors have declared no competing interest.
- Current advances in biopharmaceutical informatics: guidelines, impact and challenges in the computational developability assessment of antibody therapeuticsRahul Khetan, Robin Curtis, Charlotte M Deane, and 9 more authorsmAbs, Nov 2022
- SAbDab in the age of biotherapeutics: updates including SAbDab-nano, the nanobody structure trackerConstantin Schneider, Matthew I J Raybould, and Charlotte M DeaneNucleic Acids Research, Nov 2022
In 2013, we released the Structural Antibody Database (SAbDab), a publicly available repository of experimentally determined antibody structures. In the interim, the rapid increase in the number of antibody structure depositions to the Protein Data Bank, driven primarily by increased interest in antibodies as biotherapeutics, has led us to implement several improvements to the original database infrastructure. These include the development of SAbDab-nano, a sub-database that tracks nanobodies (heavy chain-only antibodies) which have seen a particular growth in attention from both the academic and pharmaceutical research communities over the past few years. Both SAbDab and SAbDab-nano are updated weekly, comprehensively annotated with the latest features described here, and are freely accessible at opig.stats.ox.ac.uk/webapps/newsabdab/.
2021
- A virtual drug-screening approach to conquer huge chemical librariesMaranga Mokaya and Charlotte M. DeaneNature, Nov 2021
- Current strategies for detecting functional convergence across B-cell receptor repertoiresMatthew I J Raybould, Anthony R Rees, and Charlotte M DeaneMAbs, Nov 2021
- Investigating the potential for a limited quantum speedup on protein lattice problemsCarlos Outeiral, Garrett M Morris, Jiye Shi, and 3 more authorsNew Journal of Physics, Nov 2021
Protein folding is a central challenge in computational biology, with important applications in molecular biology, drug discovery and catalyst design. As a hard combinatorial optimisation problem, it has been studied as a potential target problem for quantum annealing. Although several experimental implementations have been discussed in the literature, the computational scaling of these approaches has not been elucidated. In this article, we present a numerical study of quantum annealing applied to a large number of small peptide folding problems, aiming to infer useful insights for near-term applications. We present two conclusions: that even naïve quantum annealing, when applied to protein lattice folding, has the potential to outperform classical approaches, and that careful engineering of the Hamiltonians and schedules involved can deliver notable relative improvements for this problem. Overall, our results suggest that quantum algorithms may well offer improvements for problems in the protein folding and structure prediction realm.
- Deep generative design with 3D pharmacophoric constraintsFergus Imrie, Thomas E. Hadfield, Anthony R. Bradley, and 1 more authorChemical Science, Nov 2021
Generative models have increasingly been proposed as a solution to the molecular design problem. However, it has proved challenging to control the design process or incorporate prior knowledge, limiting their practical use in drug discovery. In particular, generative methods have made limited use of three-dimensional (3D) structural information even though this is critical to binding. This work describes a method to incorporate such information and demonstrates the benefit of doing so. We combine an existing graph-based deep generative model, DeLinker, with a convolutional neural network to utilise physically-meaningful 3D representations of molecules and target pharmacophores. We apply our model, DEVELOP, to both linker and R-group design, demonstrating its suitability for both hit-to-lead and lead optimisation. The 3D pharmacophoric information results in improved generation and allows greater control of the design process. In multiple large-scale evaluations, we show that including 3D pharmacophoric constraints results in substantial improvements in the quality of generated molecules. On a challenging test set derived from PDBbind, our model improves the proportion of generated molecules with high 3D similarity to the original molecule by over 300%. In addition, DEVELOP recovers 10 × more of the original molecules compared to the base-line DeLinker method. Our approach is general-purpose, readily modifiable to alternate 3D representations, and can be incorporated into other generative frameworks. Code is available at https://github.com/oxpig/DEVELOP.
- OAS: A diverse database of cleaned, annotated and translated unpaired and paired antibody sequencesTobias H Olsen, Fergus Boyles, and Charlotte M DeaneProtein Science, Nov 2021
The antibody repertoires of individuals and groups have been used to explore disease states, understand vaccine responses and drive therapeutic development. The arrival of B-cell receptor repertoire sequencing has enabled researchers to get a snapshot of these antibody repertoires and as more data is generated, increasingly in depth studies are possible. However, most publicly available data only exists as raw FASTQ files, making the data hard to access, process and compare. The Observed Antibody Space (OAS) database was created in 2018 to offer clean, annotated and translated repertoire data. In this paper we describe an update to OAS that has been driven by the increasing volume of data and the appearance of paired (VH/VL) sequence data. OAS is now accessible via a new web server, with standardised search parameters and a new sequence-based search option. The new database provides both nucleotides and amino acids for every sequence, with additional sequence annotations to make the data MiAIRR-compliant, and comments on potential problems with the sequence. OAS now contains 25 new studies, including SARS-CoV-2 data and paired sequencing data. The new database is accessible at http://opig.stats.ox.ac.uk/webapps/oas/ and all data is freely available for download.
- DLAB—Deep learning methods for structure-based virtual screening of antibodiesConstantin Schneider, Andrew Buchanan, Bruck Taddese, and 1 more authorBioinformatics, Nov 2021
Antibodies are one of the most important classes of pharmaceuticals, with over 80 approved molecules currently in use against a wide variety of diseases. The drug discovery process for antibody therapeutic candidates however is time- and cost-intensive and heavily reliant on in-vivo and in-vitro high throughput screens. Here, we introduce a framework for structure-based deep learning for antibodies (DLAB) which can virtually screen putative binding antibodies against antigen targets of interest. DLAB is built to be able to predict antibody-antigen binding for antigens with no known antibody binders.We demonstrate that DLAB can be used both to improve antibody-antigen docking and structure-based virtual screening of antibody drug candidates. DLAB enables improved pose ranking for antibody docking experiments as well as selection of antibody-antigen pairings for which accurate poses are generated and correctly ranked. We also show that DLAB can identify binding antibodies against specific antigens in a case study. Our results demonstrate the promise of deep learning methods for structure-based virtual screening of antibodies.The DLAB source code and pre-trained models are available at https://github.com/oxpig/dlab-public.Supplementary data are available at Bioinformatics online.
- Discovery of SARS-CoV-2 Mpro Peptide Inhibitors from Modelling Substrate and Ligand BindingH. T. Henry Chan, Marc Alexander Moesser, Rebecca K. Walters, and 25 more authorsChemical Science, Nov 2021
The main protease (Mpro) of SARS-CoV-2 is central to viral maturation and is a promising drug target, but little is known about structural aspects of how it binds to its 11 natural cleavage sites. We used biophysical and crystallographic data and an array of biomolecular simulation techniques, including automated docking, molecular dynamics (MD) and interactive MD in virtual reality, QM/MM, and linear-scaling DFT, to investigate the molecular features underlying recognition of the natural Mpro substrates. We extensively analysed the subsite interactions of modelled 11-residue cleavage site peptides, crystallographic ligands, and docked COVID Moonshot-designed covalent inhibitors. Our modelling studies reveal remarkable consistency in the hydrogen bonding patterns of the natural Mpro substrates, particularly on the N-terminal side of the scissile bond. They highlight the critical role of interactions beyond the immediate active site in recognition and catalysis, in particular plasticity at the S2 site. Building on our initial Mpro-substrate models, we used predictive saturation variation scanning (PreSaVS) to design peptides with improved affinity. Non-denaturing mass spectrometry and other biophysical analyses confirm these new and effective ‘peptibitors’ inhibit Mpro competitively. Our combined results provide new insights and highlight opportunities for the development of Mpro inhibitors as anti-COVID-19 drugs.
- Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained on Docked PosesFergus Boyles, Charlotte M. Deane, and Garrett M. MorrisJournal of Chemical Information and Modeling, Nov 2021
Machine learning scoring functions for protein–ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein–ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes. We explore how the use of docked rather than crystallographic poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. We also present a new, freely available validation set—the Updated DUD-E Diverse Subset—for binding affinity prediction using data from DUD-E and ChEMBL. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function sometimes generalizes poorly to a protein target not represented in the training set, demonstrating the need for improved scoring functions and additional validation benchmarks.
- Co-evolutionary distance predictions contain flexibility informationDominik Schwarz, Guy Georges, Sebastian Kelm, and 3 more authorsBioinformatics, Nov 2021
Co-evolution analysis can be used to accurately predict residue–residue contacts from multiple sequence alignments. The introduction of machine-learning techniques has enabled substantial improvements in precision and a shift from predicting binary contacts to predict distances between pairs of residues. These developments have significantly improved the accuracy of de novo prediction of static protein structures. With AlphaFold2 lifting the accuracy of some predicted protein models close to experimental levels, structure prediction research will move on to other challenges. One of those areas is the prediction of more than one conformation of a protein. Here, we examine the potential of residue–residue distance predictions to be informative of protein flexibility rather than simply static structure.We used DMPfold to predict distance distributions for every residue pair in a set of proteins that showed both rigid and flexible behaviour. Residue pairs that were in contact in at least one reference structure were classified as rigid, flexible or neither. The predicted distance distribution of each residue pair was analysed for local maxima of probability indicating the most likely distance or distances between a pair of residues. We found that rigid residue pairs tended to have only a single local maximum in their predicted distance distributions while flexible residue pairs more often had multiple local maxima. These results suggest that the shape of predicted distance distributions contains information on the rigidity or flexibility of a protein and its constituent residues.Supplementary data are available at Bioinformatics online.
- Learning protein-ligand binding affinity with atomic environment vectorsRocco Meli, Andrew Anighoro, Mike J. Bodkin, and 2 more authorsJournal of Cheminformatics, Nov 2021
Scoring functions for the prediction of protein-ligand binding affinity have seen renewed interest in recent years when novel machine learning and deep learning methods started to consistently outperform classical scoring functions. Here we explore the use of atomic environment vectors (AEVs) and feed-forward neural networks, the building blocks of several neural network potentials, for the prediction of protein-ligand binding affinity. The AEV-based scoring function, which we term AEScore, is shown to perform as well or better than other state-of-the-art scoring functions on binding affinity prediction, with an RMSE of 1.22 pK units and a Pearson’s correlation coefficient of 0.83 for the CASF-2016 benchmark. However, AEScore does not perform as well in docking and virtual screening tasks, for which it has not been explicitly trained. Therefore, we show that the model can be combined with the classical scoring function AutoDock Vina in the context of Δ-learning, where corrections to the AutoDock Vina scoring function are learned instead of the protein-ligand binding affinity itself. Combined with AutoDock Vina, Δ-AEScore has an RMSE of 1.32 pK units and a Pearson’s correlation coefficient of 0.80 on the CASF-2016 benchmark, while retaining the docking and screening power of the underlying classical scoring function.
- Different B cell subpopulations show distinct patterns in their IgH repertoire metricsMarie Ghraichy, Valentin Niederhäusern, Aleksandr Kovaltsuk, and 3 more authorseLife, Nov 2021
- Humanization of antibodies using a machine learning approach on large-scale repertoire dataClaire Marks, Alissa M. Hummer, Mark Chin, and 1 more authorBioinformatics, Nov 2021
Motivation. Monoclonal antibody therapeutics are often produced from non-human sources (typically murine), and can therefore generate immunogenic responses in humans. Humanization procedures aim to produce antibody therapeutics that do not elicit an immune response and are safe for human use, without impacting efficacy. Humanization is normally carried out in a largely trial-and-error experimental process. We have built machine learning classifiers that can discriminate between human and non-human antibody variable domain sequences using the large amount of repertoire data now available. Results. Our classifiers consistently outperform the current best-in-class model for distinguishing human from murine sequences, and our output scores exhibit a negative relationship with the experimental immunogenicity of existing antibody therapeutics. We used our classifiers to develop a novel, computational humanization tool, Hu-mAb, that suggests mutations to an input sequence to reduce its immunogenicity. For a set of therapeutic antibodies with known precursor sequences, the mutations suggested by Hu-mAb show significant overlap with those deduced experimentally. Hu-mAb is therefore an effective replacement for trial-and-error humanization experiments, producing similar results in a fraction of the time. Availability. Hu-mAb (humanness scoring and humanization) is freely available to use at opig.stats.ox.ac.uk/webapps/humab.
- Epitope profiling using computational structural modelling demonstrated on coronavirus-binding antibodiesSarah A. Robinson, Matthew I. J. Raybould, Constantin Schneider, and 3 more authorsPLoS Computational Biology, Nov 2021
Identifying the epitope of an antibody is a key step in understanding its function and its potential as a therapeutic. Sequence-based clonal clustering can identify antibodies with similar epitope complementarity, however, antibodies from markedly different lineages but with similar structures can engage the same epitope. We describe a novel computational method for epitope profiling based on structural modelling and clustering. Using the method, we demonstrate that sequence dissimilar but functionally similar antibodies can be found across the Coronavirus Antibody Database, with high accuracy (92 percent of antibodies in multiple-occupancy structural clusters bind to consistent domains). Our approach functionally links antibodies with distinct genetic lineages, species origins, and coronavirus specificities. This indicates greater convergence exists in the immune responses to coronaviruses than is suggested by sequence-based approaches. Our results show that applying structural analytics to large class-specific antibody databases will enable high confidence structure-function relationships to be drawn, yielding new opportunities to identify functional convergence hitherto missed by sequence-only analysis.
- Understanding Conformational Entropy in Small MoleculesLucian Chan, Garrett M. Morris, and Geoffrey R. HutchisonJournal of Chemical Theory and Computation, Nov 2021
The calculation of the entropy of flexible molecules can be challenging, since the number of possible conformers grows exponentially with molecule size and many low-energy conformers may be thermally accessible. Different methods have been proposed to approximate the contribution of conformational entropy to the molecular standard entropy, including performing thermochemistry calculations with all possible stable conformations, and developing empirical corrections from experimental data. We have performed conformer sampling on over 120,000 small molecules generating some 12 million conformers, to develop models to predict conformational entropy across a wide range of molecules. Using insight into the nature of conformational disorder, our cross-validated physically-motivated statistical model can outperform common machine learning and deep learning methods, with a mean absolute error ≈4.8 J/mol \textperiodcentered K, or under 0.4 kcal/mol at 300 K. Beyond predicting molecular entropies and free energies, the model implies a high degree of correlation between torsions in most molecules, often assumed to be independent. While individual dihedral rotations may have low energetic barriers, the shape and chemical functionality of most molecules necessarily correlate their torsional degrees of freedom, and hence restrict the number of low-energy conformations immensely. Our simple models capture these correlations, and advance our understanding of small molecule conformational entropy.
- The allosteric modulation of Complement C5 by knob domain peptidesAlex Macpherson, Maisem Laabei Laabei, Zainab Ahdash Ahdash, and 15 more authorseLife, Nov 2021
Bovines have evolved a subset of antibodies with ultra-long CDRH3 regions that harbour cysteine-rich knob domains. To produce high affinity peptides, we previously isolated autonomous 3-6 kDa knob domains from bovine antibodies. Here, we show that binding of four knob domain peptides elicits a range of effects on the clinically validated drug target complement C5. Allosteric mechanisms predominated, with one peptide selectively inhibiting C5 cleavage by the alternative pathway C5 convertase, revealing a targetable mechanistic difference between the classical and alternative pathway C5 convertases. Taking a hybrid biophysical approach, we present C5-knob domain co-crystal structures and, by solution methods, observed allosteric effects propagating >50 Å from the binding sites. This study expands the therapeutic scope of C5, presents new inhibitors and introduces knob domains as new, low molecular weight antibody fragments, with therapeutic potential.
- Generating Property-Matched Decoy Molecules Using Deep LearningFergus Imrie, Anthony R. Bradley, and M. DeaneBioinformatics, Nov 2021
An essential step in the development of virtual screening methods is the use of established sets of actives and decoys for benchmarking and training. However, the decoy molecules in commonly used sets are biased meaning that methods often exploit these biases to separate actives and decoys, rather than learning how to perform molecular recognition. This fundamental issue prevents generalisation and hinders virtual screening method development. We have developed a deep learning method (DeepCoy) that generates decoys to a user’s preferred specification in order to remove such biases or construct sets with a defined bias. We validated DeepCoy using two established benchmarks, DUD-E and DEKOIS 2.0. For all DUD-E targets and 80 of the 81 DEKOIS 2.0 targets, our generated decoy molecules more closely matched the active molecules’ physicochemical properties while introducing no discernible additional risk of false negatives. The DeepCoy decoys improved the Deviation from Optimal Embedding (DOE) score by an average of 81% and 66%, respectively, decreasing from 0.163 to 0.032 for DUD-E and from 0.109 to 0.038 for DEKOIS 2.0. Further, the generated decoys are harder to distinguish than the original decoy molecules via docking with Autodock Vina, with virtual screening performance falling from an AUC ROC of 0.71 to 0.63. The code is available at https://github.com/oxpig/DeepCoy. Generated molecules can be downloaded from http://opig.stats.ox.ac.uk/resources.
- Ribosome occupancy profiles are conserved between structurally and evolutionarily related yeast domainsDaniel A. Nissley, Anna Carbery, Mark Chonofsky, and 1 more authorBioinformatics, Nov 2021
Motivation. Protein synthesis is a non-equilibrium process, meaning that the speed of translation can influence the ability of proteins to fold and function. Assuming that structurally similar proteins fold by similar pathways, the profile of translation speed along an mRNA should be evolutionarily conserved between related proteins to direct correct folding and downstream function. The only evidence to date for such conservation of translation speed between homologous proteins has used codon rarity as a proxy for translation speed. There are, however, many other factors including mRNA structure and the chemistry of the amino acids in the A- and P-sites of the ribosome that influence the speed of amino acid addition. Results. Ribosome profiling experiments provide a signal directly proportional to the underlying translation times at the level of individual codons. We compared ribosome occupancy profiles (extracted from five different large-scale yeast ribosome profiling studies) between related protein domains to more directly test if their translation schedule was conserved. Our analysis reveals that the ribosome occupancy profiles of paralogous domains tend to be significantly more similar to one another than to profiles of non-paralogous domains. This trend does not depend on domain length, structural classes, amino acid composition, or sequence similarity. Our results indicate that entire ribosome occupancy profiles and not just rare codon locations are conserved between even distantly related domains in yeast, providing support for the hypothesis that translation schedule is conserved between structurally related domains to retain folding pathways and facilitate efficient folding. Availability. Python3 code is available on GitHub at https://github.com/DanNissley/Compare-ribosome-occupancy-profiles. Supplementary information. Supplementary data are available at Bioinformatics online.
- Understanding Ring Puckering in Small Molecules and Cyclic PeptidesLucian Chan, Geoffrey Hutchison, and Garrett M. MorrisJournal of Chemical Information and Modeling, Nov 2021
The geometry of a molecule plays a significant role in determining its physical and chemical properties. Despite its importance, there are relatively few studies on ring puckering and conformations, often focused on small cycloalkanes, five-and six-membered carbohydrate rings and specific macrocycle families. We lack a general understanding of the puckering preferences of medium-sized rings and macrocycles. To address this, we provide an extensive conformational analysis of a diverse set of rings. We used Cremer-Pople puckering coordinates to study the trends of the ring conformation across a set of 140,000 diverse small molecules, including small rings, macrocycles and cyclic peptides. By standardizing using key atoms, we show that the ring conformations can be classified into relatively few conformational clusters, based on their canonical forms. The number of such canonical clusters increases slowly with ring size. Ring puckering motions, especially pseudo-rotations, are generally restricted, and differ between clusters. More importantly, we propose models to map puckering preferences to torsion space, which allows us to understand the interrelated changes in torsion angles during pseudo-rotation and other puckering motions. Beyond ring puckers, our models also explain the change in substituent orientation upon puckering. In summary, this work provides an improved understanding of general ring puckering preferences, which will in turn accelerate the identification of low energy ring conformations for applications from polymeric materials to drug binding.
- Robust gene coexpression networks using signed distance correlationJavier Pardo-Diaz, Lyuba V. Bozhilova, Begeurisse-Diaz Mariano, and 3 more authorsBioinformatics, Nov 2021
Motivation. Even within well studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes/proteins, using a network of gene coexpression data that includes functional annotations. However, the lack of trustworthy functional annotations can impede the validation of such networks. Hence, there is a need for a principled method to construct gene coexpression networks that capture biological information and are structurally stable even in the absence of functional information. Results. We introduce the concept of signed distance correlation as a measure of dependency between two variables, and apply it to generate gene coexpression networks. Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation and mutual information. We propose a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information. We analyse data from three different organisms to illustrate how networks generated with our method are more stable and capture more biological information compared to networks obtained from Pearson correlation or mutual information. Supplementary information. Supplementary Information and code are available at Bioinformatics and https://github.com/javier-pardodiaz/sdcorGCN online.
- A computational method for immune repertoire mining that identifies novel binders from different clonotypes, demonstrated by identifying anti-Pertussis toxoid antibodiesEve Richardson, Jacob D. Galson, Paul Kellam, and 5 more authorsMAbs, Nov 2021
Due to their shared genetic history, antibodies from the same clonotype often bind to the same epitope. This knowledge is used in immune repertoire mining, where known binders are used to search bulk sequencing repertoires to identify new binders. However current computational methods cannot identify epitope convergence between antibodies from different clonotypes, limiting the sequence diversity of antigen-specific antibodies which can be identified. We describe how the antibody binding site, the paratope, can be used to cluster antibodies with common antigen reactivity from different clonotypes. Our method, paratyping, uses the predicted paratope to identify these novel cross clonotype matches. We experimentally validated our predictions on a Pertussis toxoid dataset. Our results show that even the simplest abstraction of the antibody binding site, using only the length of the loops involved and predicted binding residues, is sufficient to group antigen-specific antibodies and provide additional information to conventional clonotype analysis.
- CoV-AbDab: the Coronavirus Antibody DatabaseMatthew I. J. Raybould, Aleksandr Kovaltsuk, Claire Marks, and 1 more authorBioinformatics, Nov 2021
Motivation: The emergence of a novel strain of betacoronavirus, SARS-CoV-2, has led to a pandemic that has been associated with over 700,000 deaths as of 5th August 2020. Research is ongoing around the world to create vaccines and therapies to minimise rates of disease spread and mortality. Crucial to these efforts are molecular characterisations of neutralising antibodies to SARS-CoV-2. Such antibodies would be valuable for measuring vaccine efficacy, diagnosing exposure, and developing effective biotherapeutics. Here, we describe our new database, CoV-AbDab, which already contains data on over 1400 published/patented antibodies and nanobodies known to bind to at least one betacoronavirus. This database is the first consolidation of antibodies known to bind SARS-CoV-2 as well as other betacoronaviruses such as SARS-CoV-1 and MERS-CoV. It contains relevant metadata including evidence of cross-neutralisation, antibody/nanobody origin, full variable domain sequence (where available) and germline assignments, epitope region, links to relevant PDB entries, homology models, and source literature. Results: On 5th August 2020, CoV-AbDab referenced sequence information on 1402 anti-coronavirus antibodies and nanobodies, spanning 66 papers and 21 patents. Of these, 1131 bind to SARS-CoV-2. Availability: CoV-AbDab is free to access and download without registration at http://opig.stats.ox.ac.uk/webapps/coronavirus. Community submissions are encouraged.
- Hypergraphs for predicting essential genes using multiprotein complex dataFlorian Klimm, Charlotte M Deane, and Gesine ReinertJournal of Complex Networks, Nov 2021
Protein-protein interactions are crucial in many biological pathways and facilitate cellular function. Investigating these interactions as a graph of pairwise interactions can help to gain a systemic understanding of cellular processes. It is known, however, that proteins interact with each other not exclusively in pairs but also in polyadic interactions and they can form multiprotein complexes, which are stable interactions between multiple proteins. In this manuscript, we use hypergraphs to investigate multiprotein complex data. We investigate two random null models to test which hypergraph properties occur as a consequence of constraints, such as the size and the number of multiprotein complexes. We find that assortativity, the number of connected components, and clustering differ from the data to these null models. Our main finding is that projecting a hypergraph of polyadic interactions onto a graph of pairwise interactions leads to the identification of different proteins as hubs than the hypergraph. We find in our data set that the hypergraph degree is a more accurate predictor for gene-essentiality than the degree in the pairwise graph. We find that analysing a hypergraph as pairwise graph drastically changes the distribution of the local clustering coefficient. Furthermore, using a pairwise interaction representing multiprotein complex data may lead to a spurious hierarchical structure, which is not observed in the hypergraph. Hence, we illustrate that hypergraphs can be more suitable than pairwise graphs for the analysis of multiprotein complex data.
- Ab-Ligity: Identifying sequence-dissimilar antibodies that bind to the same epitopeWing Ki Wong, Sarah A Robinson, Alexander Bujotzek, and 6 more authorsMAbs, Nov 2021
Motivation: Solving the structure of an antibody-antigen complex gives atomic level information of the interactions between an antibody and its antigen, but such structures are expensive and hard to obtain. Alternative experimental sources include epitope mapping and binning experiments which can be used as a surrogate to identify key interacting residues. However, their resolution is usually not sufficient to identify if two antibodies have identical interactions. Computational approaches to this problem have so far been based on the premise that antibodies with similar sequences behave similarly. Such approaches will fail to identify sequence-distant antibodies that target the same epitope. Results: We present Ab-Ligity, a structure-based similarity measure tailored to antibody-antigen interfaces. Using predicted paratopes on model antibody structures, we assessed our ability to identify those antibodies that target highly similar epitopes. Most antibodies adopting similar binding modes can be identified from sequence similarity alone, using methods such as clonotyping. In the challenging subset where the antibody sequences differ significantly, Ab-Ligity is still able to predict antibodies that would bind to highly similar epitopes (area under the precision-recall curve of 0.86). We compared Ab-Ligity’s performance to an existing tool InterComp, and showed improved performance alongside a significant speed-up. These results suggest that Ab-Ligity will allow the identification of diverse (sequence-dissimilar) antibodies that bind to the same epitopes from large datasets such as immune repertoires.
- Public Baseline and Shared Response Structures Support the Theory of Antibody Repertoire Functional CommonalityMatthew I J Raybould, Claire Marks, Aleksandr Kovaltsuk, and 3 more authorsPLoS Computational Biology, Nov 2021
The naïve antibody/B-cell receptor (BCR) repertoires of different individuals ought to exhibit significant functional commonality, given that most pathogens trigger an effective antibody response to immunodominant epitopes. Sequence-based repertoire analysis has so far offered little evidence for this phenomenon. For example, a recent study estimated the number of shared (‘public’) antibody clonotypes in circulating baseline repertoires to be around 0.02% across ten unrelated individuals. However, to engage the same epitope, antibodies only require a similar binding site structure and the presence of key paratope interactions, which can occur even when their sequences are dissimilar. Here, we search for evidence of geometric similarity/convergence across human antibody repertoires. We first structurally profile naïve (‘baseline’) antibody diversity using snapshots from 41 unrelated individuals, predicting all modellable distinct structures within each repertoire. This analysis uncovers a high (much greater than random) degree of structural commonality. For instance, around 3% of distinct structures are common to the ten most diverse individual samples (‘Public Baseline’ structures). Our approach is the first computational method to find levels of BCR commonality commensurate with epitope immunodominance and could therefore be harnessed to find more genetically distant antibodies with same-epitope complementarity. We then apply the same structural profiling approach to repertoire snapshots from three individuals before and after flu vaccination, detecting a convergent structural drift indicative of recognising similar epitopes (‘Public Response’ structures). We show that Antibody Model Libraries derived from Public Baseline and Public Response structures represent a powerful geometric basis set of low-immunogenicity candidates exploitable for general or target-focused therapeutic antibody screening.
2020
- COGENT: evaluating the consistency of gene co-expression networksLyuba V. Bozhilova, Javier Pardo-Diaz, Gesine Reinert, and 1 more authorBioinformatics, Nov 2020
Gene co-expression networks can be constructed in multiple different ways, both in the use of different measures of co-expression, and in the thresholds applied to the calculated co-expression values, from any given dataset. It is often not clear which co-expression network construction method should be preferred. COGENT provides a set of tools designed to aid the choice of network construction method without the need for any external validation data. Availability and implementation. https://github.com/lbozhilova/COGENT
- Deep sequencing of B cell receptor repertoires from COVID-19 patients reveals strong convergent immune signaturesJacob D. Galson, Sebastian Schaetzle, Rachael J. M. Bashford-Rogers, and 18 more authorsFrontiers in Immunology, Nov 2020
Deep sequencing of B cell receptor (BCR) heavy chains from a cohort of 19 COVID-19 patients from the UK reveals a stereotypical naive immune response to SARS-CoV-2 which is consistent across patients and may be a positive indicator of disease outcome. Clonal expansion of the B cell memory response is also observed and may be the result of memory bystander effects. There was a strong convergent sequence signature across patients, and we identified 777 clonotypes convergent between at least four of the COVID-19 patients, but not present in healthy controls. A subset of the convergent clonotypes were homologous to known SARS and SARS-CoV-2 spike protein neutralising antibodies. Convergence was also demonstrated across wide geographies by comparison of data sets between patients from UK, USA and China, further validating the disease association and consistency of the stereotypical immune response even at the sequence level. These convergent clonotypes provide a resource to identify potential therapeutic and prophylactic antibodies and demonstrate the potential of BCR profiling as a tool to help understand and predict positive patient responses.
- The prospects of quantum computing in computational molecular biologyCarlos Outeiral, Martin Strahm, Jiye Shi, and 3 more authorsWIRES, Nov 2020
Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits across the entire field, from the ability to process vast amounts of information and run machine learning algorithms far more efficiently, to algorithms for quantum simulation that are poised to improve computational calculations in drug discovery, to quantum algorithms for optimization that may advance fields from protein structure prediction to network analysis. However, these exciting prospects are susceptible to “hype,” and it is also important to recognize the caveats and challenges in this new technology. Our aim is to introduce the promise and limitations of emerging quantum computing technologies in the areas of computational molecular biology and bioinformatics.
- How repertoire data is changing antibody scienceClaire Marks and Charlotte M. DeaneJournal of Biological Chemistry, Nov 2020
Antibodies are vital proteins of the immune system that recognize potentially harmful molecules and initiate their removal. Mammals can efficiently create vast numbers of antibodies with different sequences capable of binding to any antigen with high affinity and specificity. Since they can be developed to bind to many disease agents, antibodies can be used as therapeutics. After antigen exposure, antibodies specific to that antigen are enriched through clonal selection, expansion and somatic hypermutation. The antibodies present in an organism therefore report on its immune status, describe its innate ability to deal with harmful substances, and reveal how it has previously responded. Next-generation sequencing technologies are being increasingly used to query the antibody, or B cell receptor (BCR), sequence repertoire, and the amount of BCR data in public repositories is growing. The Observed Antibody Space database, for example, currently contains over a billion sequences from 68 different studies. Repertoires are available that represent both the naive state (i.e. antigen-inexperienced) and that after immunization. This wealth of data has created opportunities to learn more about our immune system. In this review, we discuss the many ways in which BCR repertoire data has been or could be exploited. We highlight its utility for providing insights into how the naive immune repertoire is generated and how it responds to antigens. We also consider how structural information can be used to enhance these data and may lead to more accurate depictions of the sequence space, and to applications in the discovery of new therapeutics.
- TCRBuilder: Multi-state T-cell receptor structure predictionWing Ki Wong, Claire Marks, Jinwoo Leem, and 3 more authorsBioinformatics, Nov 2020
MOTIVATION. T-cell receptors (TCRs) are immune proteins that primarily target peptide antigens presented by the major histocompatibility complex. They tend to have lower specificity and affinity than their antibody counterparts, and their binding sites have been shown to adopt multiple conformations, which is potentially an important factor for their polyspecificity. None of the current TCR modelling tools predict this variability which limits our ability to accurately predict TCR binding. RESULTS. We present TCRBuilder, a multi-state TCR structure prediction tool. Given a paired α βTCR sequence, TCRBuilder returns a model or an ensemble of models covering the potential conformations of the binding site. This enables the analysis of structurally-driven polyspecificity in TCRs, which is not possible with existing tools.
- Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalise To Unseen Target Classes, And Highlight Important Binding Interactions.Jack Scantlebury, Nathan Brown, Frank Von Delft, and 1 more authorJournal of Chemical Information Modeling, Nov 2020
Current deep learning methods for structure-based virtual screening take the structures of both the protein and the ligand as input but make little or no use of the protein structure when predicting ligand binding. Here we show how a relatively simple method of dataset augmentation forces such deep learning methods to take into account information from the protein. Models trained in this way are more generalisable (make better predictions on protein-ligand complexes from a different distribution to the training data). They also assign more meaningful importance to the protein and ligand atoms involved in binding. Overall, our results show that dataset augmentation can help deep learning based virtual screening to learn physical interactions rather than dataset biases.
- BOKEI: Bayesian optimization using knowledge of correlated torsions and expected improvement for conformer generationLucian Chan, Geoffrey R. Hutchison, and Garrett M. MorrisPhysical Chemistry Chemical Physics, Nov 2020
A key challenge in conformer sampling is finding low-energy conformations with a small number of energy evaluations. We recently demonstrated the Bayesian Optimization Algorithm (BOA) is an effective method for finding the lowest energy conformation of a small molecule. Our approach balances between exploitation and exploration, and is more efficient than exhaustive or random search methods. Here, we extend strategies used on proteins and oligopeptides (e.g. Ramachandran plots of secondary structure) and study correlated torsions in small molecules. We use bivariate von Mises distributions to capture correlations, and use them to constrain the search space. We validate the performance of our new method, Bayesian Optimization with Knowledge-based Expected Improvement (BOKEI), on a dataset consisting of 533 diverse small molecules, using (i) a force field (MMFF94); and (ii) a semi-empirical method (GFN2), as the objective function. We compare the search performance of BOKEI, BOA with Expected Improvement (BOA-EI), and a genetic algorithm (GA), using a fixed number of energy evaluations. In more than 60% of the cases examined, BOKEI finds lower energy conformations than global optimization with BOA-EI or GA. More importantly, we find correlated torsions in up to 15% of small molecules in larger data sets, up to 8 times more often than previously reported. The BOKEI patterns not only describe steric clashes, but also reflect favorable intramolecular interactions such as hydrogen bonds and π–π stacking. Increasing our understanding of the conformational preferences of molecules will help improve our ability to find low energy conformers efficiently, which will have impact in a wide range of computational modeling applications.
- Deep Generative Models for 3D Linker DesignFergus Imrie, Anthony R. Bradley, Mihaela Schaar, and 1 more authorJournal of Chemical Information Modeling, Nov 2020
Rational compound design remains a challenging problem for both computational methods and medicinal chemists. Computational generative methods have begun to show promising results for the design problem. However, they have not yet used the power of 3D structural information. We have developed a novel graph-based deep generative model that combines state-of-the-art machine learning techniques with structural knowledge. Our method (“DeLinker”) takes two fragments or partial structures and designs a molecule incorporating both. The generation process is protein context dependent, utilising the relative distance and orientation between the partial structures. This 3D information is vital to successful compound design, and we demonstrate its impact on the generation process and the limitations of omitting such information. In a large scale evaluation, DeLinker designed 60% more molecules with high 3D similarity to the original molecule than a database baseline. When considering the more relevant problem of longer linkers with at least five atoms, the outperformance increased to 200%. We demonstrate the effectiveness and applicability of this approach on a diverse range of design problems: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design. As far as we are aware, this is the first molecular generative model to incorporate 3D structural information directly in the design process. Code is available at https://github.com/oxpig/DeLinker.
- Structural Diversity of B-cell Receptor Repertoires along the B-cell Differentiation Axis in Humans and MiceAleksandr Kovaltsuk, Matthew I. J. Raybould, Wing Ki Wong, and 5 more authorsPLoS Computational Biology, Nov 2020
Most current analysis tools for antibody next-generation sequencing data work with primary sequence descriptors, leaving accompanying structural information unharnessed. We have used novel rapid methods to structurally characterize the paratopes of more than 180 million human and mouse B-cell receptor (BCR) repertoire sequences. These structurally annotated paratopes provide unprecedented insights into both the structural predetermination and dynamics of the adaptive immune response. We show that B-cell types can be distinguished based solely on these structural properties. Antigen-unexperienced BCR repertoires use the highest number and diversity of paratope structures and these patterns of naive repertoire paratope usage are highly conserved across subjects. In contrast, more differentiated B-cells are more personalized in terms of paratope structure usage. Our results establish the paratope structure differences in BCR repertoires and have applications for many fields including immunodiagnostics, phage display library generation, and humanness assessment of BCR repertoires from transgenic animals.
- Thera-SAbDab: the Therapeutic Structural Antibody DatabaseMatthew I. J. Raybould, Claire Marks, Alan P. Lewis, and 4 more authorsNucleic Acids Research, Nov 2020
The Therapeutic Structural Antibody Database (Thera-SabDab; http://opig.stats.ox.ac.uk/webapps/therasabdab) tracks all antibody- and nanobody-related therapeutics recognised by the World Health Organisation (WHO), and identifies any corresponding structures in the Structural Antibody Database (SAbDab) with near-exact or exact variable domain sequence matches. Thera-SAbDab is synchronised with SAbDab to up-date weekly, reflecting new Protein Data Bank entries and the availability of new sequence data published by the WHO. Each therapeutic summary page lists structural coverage (with links to the appropriate SAbDab entries), alignments showing where any near-matches deviate in sequence, and accompanying metadata, such as intended target and investigated conditions. Thera-SAbDab can be queried by therapeutic name, by a combination of metadata, or by variable domain sequence - returning all therapeutics that are within a specified sequence identity over a specified region of the query. The sequences of all therapeutics listed in Thera-SAbDab (461 unique molecules, as of 18th July 2019) are downloadable as a single file with accompanying metadata.
- Functional module detection through integration of single-cell RNA sequencing data with protein-protein interaction networks.Florian Klimm, Enrique M. Toledo, T. Monfeuga, and 3 more authorsBMC Bioinformatics, Nov 2020
Recent advances in single-cell RNA sequencing (scRNA-seq) have allowed researchers to explore transcriptional function at a cellular level. In this study, we present scPPIN, a method for integrating single-cell RNA sequencing data with protein-protein interaction networks to detect active modules in cells of different transcriptional states. We achieve this by clustering RNA-sequencing data, identifying differentially expressed genes, constructing node-weighted protein-protein interaction networks, and finding the maximum-weight connected subgraphs with an exact Steiner-tree approach. As a case study, we investigate RNA-sequencing data from human liver spheroids but the techniques described here are applicable to other organisms and tissues. scPPIN allows us to expand the output of differential expressed genes analysis with information from protein interactions. We find that different transcriptional states have different subnetworks of the PPIN significantly enriched which represent biological pathways. In these pathways, scPPIN also identifies proteins that are not differentially expressed but of crucial biological function (e.g., as receptors) and therefore reveals biology beyond a standard differentially expressed gene analysis.
- Maturation of Naïve and Antigen-experienced B-cell Receptor Repertoires with AgeMarie Ghraichy, Jacob D. Galson, Aleksandr Kovaltsuk, and 6 more authorsFrontiers in Immunology, Nov 2020
B cells play a central role in adaptive immune processes, mainly through the production of antibodies. The maturation of the B-cell system through continuous antigen exposure with age is poorly studied. We extensively investigated naïve and antigen-experienced B-cell receptor (BCR) repertoires in individuals aged 6 months to 50 years. Most dynamics were observed in the first 10 years of life characterized by an increase in frequencies of mutated transcripts through positive selection, increased usage of downstream constant region genes and a decrease in the frequency of transcripts with self-reactive properties. Structural analysis revealed that the frequency of antibodies different from germline in shape increased with age. Our results suggest large and broad changes of BCR repertoires through childhood and stress the importance of using well-selected, age-appropriate controls in BCR studies.
2019
- MHC binding affects the dynamics of different T-cell receptors in different waysBernhard Knapp, P. Anton Merwe, Omer Dushek, and 1 more authorPLoS Computational Biology, Nov 2019
T cells use their T-cell receptors (TCRs) to scan other cells for antigenic peptides presented by MHC molecules (pMHC). If a TCR encounters a pMHC, it can trigger a signalling pathway that could lead to the activation of the T cell and the initiation of an immune response. It is currently not clear how the binding of pMHC to the TCR initiates signalling within the T cell. One hypothesis is that conformational changes in the TCR lead to further downstream signalling. Here we investigate four different TCRs in their free state as well as in their pMHC bound state using large scale molecular simulations totalling 26 000 ns. We find that the dynamical features within TCRs differ significantly between unbound TCR and TCR/pMHC simulations. However, apart from expected results such as reduced solvent accessibility and flexibility of the interface residues, these features are not conserved among different TCR types. The presence of a pMHC alone is not sufficient to cause cross-TCR-conserved dynamical features within a TCR. Our results argue against models of TCR triggering involving conserved allosteric conformational changes.
- Comparative analysis of the CDR loops of antigen receptorsWing Ki Wong, Jinwoo Leem, and Charlotte M. DeaneFrontiers in Immunology, Nov 2019
The adaptive immune system uses two main types of antigen receptors: T-cell receptors (TCRs) and antibodies. While both proteins share a globally similar β-sandwich architecture, TCRs are specialised to recognise peptide antigens in the binding groove of the major histocompatibility complex, while antibodies can bind an almost infinite range of molecules. For both proteins, the main determinants of target recognition are the complementarity-determining region (CDR) loops. Five of the six CDRs adopt a limited number of backbone conformations, known as the ′canonical classes′ ; the remaining CDR (β3 in TCRs and H3 in antibodies) is more structurally diverse. In this paper, we first update the definition of canonical forms in TCRs, build an auto-updating sequence-based prediction tool (available at http://opig.stats.ox.ac.uk/resources) and demonstrate its application on large scale sequencing studies. Given the global similarity of TCRs and antibodies, we then examine the structural similarity of their CDRs. We find that TCR and antibody CDRs tend to have different length distributions, and where they have similar lengths, they mostly occupy distinct structural spaces. In the rare cases where we found structural similarity, the underlying sequence patterns for the TCR and antibody version are different. Finally, where multiple structures have been solved for the same CDR sequence, the structural variability in TCR loops is higher than that in antibodies, suggesting TCR CDRs are more flexible. These structural differences between TCR and antibody CDRs may be important to their different biological functions.
- The evolution of contact prediction: Evidence that contact selection in statistical contact prediction is changingMark Chonofsky, Saulo H. P. Oliveira, Konrad Krawczyk, and 1 more authorBioinformatics, Nov 2019
Over the last few years, the field of protein structure prediction has been transformed by increasingly-accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments. However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others. Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV, and DNCON2, as examples of direct coupling analysis, meta-prediction, and deep learning, respectively. To further investigate what sets these predicted contacts apart, we considered correctly-predicted contacts and compared their properties against the protein contacts that were not predicted. We found that predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts. These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from multiple sequence alignments. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology.
- RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct foldClaire E. West, Saulo H. P. Oliveira, and Charlotte M. DeanePLoS One, Nov 2019
While template-free protein structure prediction protocols now produce good quality models for many targets, modelling failure remains common. For these methods to be useful it is important that users can both choose the best model from the hundreds to thousands of models that are commonly generated for a target, and determine whether this model is likely to be correct. We have developed Random Forest Quality Assessment (RFQAmodel), which assesses whether models produced by a protein structure prediction pipeline have the correct fold. RFQAmodel uses a combination of existing quality assessment scores with two predicted contact map alignment scores. These alignment scores are able to identify correct models for targets that are not otherwise captured. Our classifier was trained on a large set of protein domains that are structurally diverse and evenly balanced in terms of protein features known to have an effect on modelling success, and then tested on a second set of 244 protein domains with a similar spread of properties. When models for each target in this second set were ranked according to the RFQAmodel score, the highest-ranking model had a high-confidence RFQAmodel score for 67 modelling targets, of which 52 had the correct fold. At the other end of the scale RFQAmodel correctly predicted that for 59 targets the highest-ranked model was incorrect. In comparisons to other methods we found that RFQAmodel is better able to identify correct models for targets where only a few of the models are correct. We found that RFQAmodel achieved a similar performance on the model sets for CASP12 and CASP13 free-modelling targets. Finally, by iteratively generating models and running RFQAmodel until a model is produced that is predicted to be correct with high confidence, we demonstrate how such a protocol can be used to focus computational efforts on difficult modelling targets.
- Measuring rank robustness in scored protein interaction networksLyuba V Bozhilova, Alan V Whitmore, Jonny Wray, and 2 more authorsBMC Bioinformatics, Nov 2019
Background: Protein interaction databases often provide confidence scores for each recorded interaction based on the available experimental evidence. Protein interaction networks (PINs) are then built by thresholding on these scores, so that only interactions of sufficiently high quality are included. These networks are used to identify biologically relevant motifs or nodes using metrics such as degree or betweenness centrality. This type of analysis can be sensitive to the choice of threshold. If a node metric is to be useful for extracting biological signal, it should induce similar node rankings across PINs obtained at different reasonable confidence score thresholds. Results: We propose three measures—rank continuity, identifiability, and instability—to evaluate how robust a node metric is to changes in the score threshold. We apply our measures to twenty-five metrics and identify four as the most robust: the number of edges in the step-1 ego network, as well as the leave-one-out differences in average redundancy, average number of edges in the step-1 ego network, and natural connectivity. Our measures show good agreement across PINs from different species and data sources. Analysis of synthetically generated scored networks shows that robustness results are context-specific, and depend both on network topology and on how scores are placed across network edges. Conclusion: Due to the uncertainty associated with protein interaction detection, and therefore network structure, for PIN analysis to be reproducible, it should yield similar results across different confidence score thresholds. We demonstrate that while certain node metrics are robust with respect to threshold choice, this is not always the case. Promisingly, our results suggest that there are some metrics that are robust across networks constructed from different databases, and different scoring procedures.
- Learning From The Ligand: Using Ligand-Based Features To Improve Binding Affinity PredictionFergus Boyles, Charlotte M. Deane, and Garrett M. MorrisBioinformatics, Nov 2019
Machine learning scoring functions for protein-ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein-ligand complex, with limited information about the chemical or topological properties of the ligand itself. We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.831, 0.785, and 0.821 on the PDBbind 2007, 2013, and 2016 core sets respectively, compared to 0.790, 0.737, and 0.797 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a Random Forest using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets.
- Modeling conformational flexibility of kinases in inactive statesDominik Schwarz, Benjamin Merget, Charlotte M. Deane, and 1 more authorProteins, Nov 2019
Kinase structures in the inactive “DFG‐out” state provide a wealth of druggable binding site variants. The conformational plasticity of this state can be mainly described by different conformations of binding site‐forming elements such as DFG motif, A‐loop, P‐loop, and αC‐helix. Compared to DFG‐in structures, DFG‐out structures are largely underrepresented in the Protein Data Bank (PDB). Thus, structure‐based drug design efforts for DFG‐out inhibitors may benefit from an efficient approach to generate an ensemble of DFG‐out structures. Accordingly, the presented modeling pipeline systematically generates homology models of kinases in several DFG‐out conformations based on a sophisticated creation of template structures that represent the major states of the flexible structural elements. Eighteen template classes were initially selected from all available kinase structures in the PDB and subsequently employed for modeling the entire kinome in different DFG‐out variants by fusing individual structural elements to multiple chimeric template structures. Molecular dynamics simulations revealed that conformational transitions between the different DFG‐out states generally do not occur within trajectories of a few hundred nanoseconds length. This underlines the benefits of the presented homology modeling pipeline to generate relevant conformations of “DFG‐out” kinase structures for subsequent in silico screening or binding site analysis studies.
- Looking for Therapeutic Antibodies in Next Generation Sequencing RepositoriesKonrad Krawczyk, Matthew I. J. Raybould, Aleksandr Kovaltsuk, and 1 more authorMAbs, Nov 2019
Recently it has become possible to query the great diversity of natural antibody repertoires using Next Generation Sequencing (NGS). These methods are capable of producing millions of sequences in a single experiment. Here we compare Clinical Stage Therapeutic antibodies to the 1b sequences from 60 independent sequencing studies in the Observed Antibody Space Database. Of the 242 post Phase I antibodies, we find 16 with sequence identity matches of 95% or better for both heavy and light chains. There are also 54 perfect matches to therapeutic CDR-H3 regions in the NGS outputs, suggesting a nontrivial amount of convergence between naturally observed sequences and those developed artificially. This has potential implications for both the discovery of antibody therapeutics and the legal protection of commercial antibodies.
- Antibody-antigen Complex Modelling in the Era of Immunoglobulin Repertoire SequencingMatthew I. J. Raybould, Wing Ki Wong, and Charlotte M. DeaneMolecular Systems Design & Engineering, Nov 2019
The natural immune repertoire can be a useful guide to antibody discovery against any given target. However, the large volume of immunoglobulin gene sequencing data necessitates the rational prioritisation of possible binders for experimental validation. Where other known binders exist, sequence similarity is used to infer binding, but this neglects alternative binding modes to the same epitope, and cannot identify antibodies that bind to different epitopes. In this review, we summarise the state-of-the-art of high-throughput antibody-antigen complex modelling. Given the millions of natural antibody sequences now available, this pipeline attempts to predict whether, and if so how, each antibody binds to a particular antigen’s surface. We cover the current paradigm (antibody and antigen structural modelling, followed by binding site prediction, followed by molecular docking), discussing how existing algorithms can deal with this magnitude of data by balancing accuracy with computational efficiency, and identifying areas where further developments are required to improve performance.
- HLA-DM stabilises the empty MHCII binding groove: A model using customised Natural Move Monte CarloSam Demharter, Bernhard Knapp, Charlotte M. Deane, and 1 more authorJournal of Chemical Information and Modeling, Nov 2019
MHC class II molecules bind peptides derived from extracellular proteins that have been ingested by antigen-presenting cells and display them to the immune system. Peptide loading occurs within the antigen-presenting cell and is facilitated by HLA-DM. HLA-DM stabilizes the open conformation of the MHCII binding groove when no peptide is bound. While a structure of the MHCII/HLA-DM complex exists, the mechanism of stabilization is still largely unknown. Here, we applied customized Natural Move Monte Carlo to investigate this interaction. We found a possible long-range mechanism that implicates the configuration of the membrane-proximal globular domains in stabilizing the open state of the empty MHCII binding groove.
- Ligity: A Non-Superpositional, Knowledge-Based Approach to Virtual ScreeningJean-Paul Ebejer, Paul W. Finn, Wing Ki Wong, and 2 more authorsJournal of Chemical Information and Modeling, Nov 2019
We present Ligity: a hybrid ligand-structure-based, non-superpositional method for virtual screening of large databases of small molecules. Ligity uses the relative spatial distribution of Pharmacophoric Interaction Points (PIPs) derived from the conformations of small molecules. These are compared with the PIPs derived from key interaction features found in protein-ligand complexes, and are used to prioritize likely binders. We investigated the effect of: generating PIPs using a single lowest energy conformer versus an ensemble of conformers for each screened ligand; different bin sizes for the distance between two features; the use of triangular sets of pharmacophoric features (3-PIPs) versus chiral tetrahedral sets (4-PIPs); data fusion for targets with multiple protein-ligand complex structures; and different similarity measures. Ligity was benchmarked using the Directory of Useful Decoys Enhanced (DUD-E). Optimal results were obtained using the tetrahedral PIPs derived from an ensemble of bound ligand conformers and a bin size of 1.5 Å, which are used as the default settings for Ligity. The high-throughput screening (HTS) mode of Ligity, using only the lowest-energy conformer of each ligand, was used for benchmarking against the whole of DUD-E, and a more resource-intensive, ‘information-rich’ mode of Ligity, using a conformational ensemble of each ligand, was used for a representative subset of ten targets. Against the full DUD-E database, mean area under the receiver operating characteristic curve (AUC) values ranged from 0.44 to 0.99, while for the representative subset, they ranged from 0.61 to 0.86. Data fusion further improved Ligity’s performance, with mean AUC values ranging from 0.64 to 0.95. Ligity is very efficient compared to a protein-ligand docking method such as AutoDock Vina: if the time taken for the pre-calculation of Ligity descriptors is included in the comparison then Ligity is about 20 times faster than docking. A direct comparison of the virtual screening steps shows Ligity to be over 5,000 times faster. Ligity ranks highly the lowest-energy conformers of DUD-E actives, in a statistically significant manner, behaviour that is not observed for DUD-E decoys. Thus, our results suggest that active compounds tend to bind in relatively low-energy conformations compared to decoys. This may be because actives — and thus their lowest-energy conformations — have been optimized for conformational complementarity with their cognate binding sites.
- Bayesian Optimization for Conformer GenerationLucian Chan, Geoffrey R. Hutchison, and Garrett M. MorrisJournal of Cheminformatics, Nov 2019
Generating low-energy molecular conformers is a key task for many areas of computational chemistry, molecular modeling and cheminformatics. Most current conformer generation methods primarily focus on generating geometrically diverse conformers rather than finding the most probable or energetically lowest minima. Here, we present a new stochastic search method using Bayesian Optimization Algorithm (BOA) for finding the lowest energy conformation of a given molecule. We compare BOA with uniform random search, and systematic search as implemented in Confab, to determine which method finds the lowest energy. Energetic difference, root-mean-square deviation (RMSD), and torsion fingerprint deviation (TFD) are used to quantify differences between the conformer search algorithms. In general, we find BOA requires far fewer evaluations than systematic or uniform random search to find low-energy minima. For molecules with four or more rotatable bonds, Confab typically evaluates 104 (median) conformers in its search, while BOA only requires 102 energy evaluations to find top candidates. Despite evaluating fewer conformers, for many molecules, BOA finds lower-energy conformations than an exhaustive systematic Confab search.
- Five Computational Developability Guidelines for Therapeutic Antibody ProfilingMatthew I. J. Raybould, Claire Marks, Konrad Krawczyk, and 6 more authorsProceedings of the National Academy of Sciences USA, Nov 2019
Therapeutic MAbs must not only bind to their target but must also be free from “developability issues” such as poor stability or high levels of aggregation. While small-molecule drug discovery benefits from Lipinski’s rule of five to guide the selection of molecules with appropriate biophysical properties, there is currently no in silico analog for antibody design. Here, we model the variable domain structures of a large set of post-phase-I clinical-stage antibody therapeutics (CSTs) and calculate in silico metrics to estimate their typical properties. In each case, we contextualize the CST distribution against a snapshot of the human antibody gene repertoire. We describe guideline values for five metrics thought to be implicated in poor developability: the total length of the complementarity-determining regions (CDRs), the extent and magnitude of surface hydrophobicity, positive charge and negative charge in the CDRs, and asymmetry in the net heavy- and light-chain surface charges. The guideline cutoffs for each property were derived from the values seen in CSTs, and a flagging system is proposed to identify nonconforming candidates. On two mAb drug discovery sets, we were able to selectively highlight sequences with developability issues. We make available the Therapeutic Antibody Profiler (TAP), a computational tool that builds downloadable homology models of variable domain sequences, tests them against our five developability guidelines, and reports potential sequence liabilities and canonical forms. TAP is freely available at opig.stats.ox.ac.uk/webapps/sabdab-sabpred/TAP.php.
- Increasing the accuracy of protein loop structure prediction with evolutionary constraintsClaire Marks and Charlotte M. DeaneBioinformatics, Nov 2019
Motivation. Accurate prediction of loop structures remains challenging. This is especially true for long loops where the large conformational space and limited coverage of experimentally-determined structures often leads to low accuracy. Co-evolutionary contact predictors, which provide information about the proximity of pairs of residues, have been used to improve whole-protein models generated through de novo techniques. Here we investigate whether these evolutionary constraints can enhance the prediction of long loop structures. Results. As a first stage, we assess the accuracy of predicted contacts that involve loop regions. We find that these are less accurate than contacts in general. We also observe that some incorrectly-predicted contacts can be identified as they are never satisfied in any of our generated loop conformations. We examined two different strategies for incorporating contacts, and on a test set of long loops (ten residues or more), both approaches improve the accuracy of prediction. For a set of 135 loops, contacts were predicted and hence our methods were applicable in 97 cases. Both strategies result in an increase in the proportion of near-native decoys in the ensemble, leading to more accurate predictions and in some cases improving the RMSD of the final model by more than 3 Å.
- Assessment of model fit via network comparison methods based on subgraph countsLuis Ospina-Forero, Charlotte M. Deane, and Gesine ReinertJournal of Complex Networks, Nov 2019
While the number of network comparison methods is increasing, benchmarking of these methods is still in its infancy. The lack of understanding of complex dependencies among network characteristics makes it difficult to fully understand the meaning of the different network comparison methodologies and the relations between them. In this article, we use a Monte Carlo framework as a way to address three general questions about the network comparison methods based on subgraph counts: (1) Can the methods differentiate between networks generated from different network generation mechanisms? (2) Are the number of nodes or average degree, confounding factors for the comparison of networks? (3) Do all methods reach the same conclusions? We further use the Monte Carlo framework to test the fit of ER, Chung-Lu and a duplication–divergence model to the protein–protein interaction (PPI) networks of Yeast, Fly, Worm, Human, Escherichia Coli, five herpes virus networks and five social networks. In contrast to previous claims in the literature, we show that the large PPI networks are not well modelled by the Chung-Lu model according to any of our tested methods. We find that network comparison statistics are not completely invariant to changes in the number of nodes and edges. Some methods focus on fine grain similarities, such as graphlet correlation distance, while other methods such as Netdis, can capture the similarities of networks despite them having different numbers of nodes and edges.
2018
- Filtering Next-Generation Sequencing of the Ig Gene Repertoire Data Using Antibody Structural InformationAleksandr Kovaltsuk, Konrad Krawczyk, Sebastian Kelm, and 2 more authorsJournal of Immunology, Nov 2018
Next-generation sequencing of the Ig gene repertoire (Ig-seq) produces large volumes of information at the nucleotide sequence level. Such data have improved our understanding of immune systems across numerous species and have already been successfully applied in vaccine development and drug discovery. However, the high-throughput nature of Ig-seq means that it is afflicted by high error rates. This has led to the development of error-correction approaches. Computational error-correction methods use sequence information alone, primarily designating sequences as likely to be correct if they are observed frequently. In this work, we describe an orthogonal method for filtering Ig-seq data, which considers the structural viability of each sequence. A typical natural Ab structure requires the presence of a disulfide bridge within each of its variable chains to maintain the fold. Our Ab Sequence Selector (ABOSS) uses the presence/absence of this bridge as a way of both identifying structurally viable sequences and estimating the sequencing error rate. On simulated Ig-seq datasets, ABOSS is able to identify more than 99% of structurally viable sequences. Applying our method to six independent Ig-seq datasets (one mouse and five human), we show that our error calculations are in line with previous experimental and computational error estimates. We also show how ABOSS is able to identify structurally impossible sequences missed by other error-correction methods.
- Avoiding false positive conclusions in molecular simulation: the importance of replicasBernhard Knapp, Luis Ospina, and Charlotte M. DeaneJournal of Chemical Theory and Computation, Nov 2018
Molecular simulations are a computational technique used to investigate the dynamics of proteins and other molecules. The solution landscape of these simulations is often rugged and minor differences in the initial seed, floating point precision, or underlying hardware can cause identical simulations (replicas) to take different paths in the landscape. In this study we investigated the magnitude of these effects based on 310 000 ns of simulation time. We performed 100 identically parameterised replicas of 3000 ns each for a small 10 amino acid system as well as 100 identically parameterised replicas of 100 ns each for an 827 residue T-cell receptor / MHC system. Comparing randomly chosen subgroups within these replica sets we estimate the reproducibility and reliability that can be achieved by a given number of replicas at a given simulation time. These results demonstrate that conclusions drawn from single simulations are often not reproducible and that conclusions drawn from multiple shorter replicas are more reliable than those from a single longer simulation. The actual number of replicas needed will always depend on the question asked and the level of reliability sought. Based on our data it appears a good rule of thumb is to perform a minimum of five to ten replicas.
- SCALOP: sequence-based antibody canonical loop structure annotationWing K. Wong, Guy Georges, Francesca Ros, and 5 more authorsBioinformatics, Nov 2018
- Protein Family-specific Models using Deep Neural Networks and Transfer Learning Improve Virtual Screening and Highlight the Need for More DataFergus Imrie, Anthony R. Bradley, Mihaela Schaar, and 1 more authorJournal of Chemical Information and Modeling, Nov 2018
- Observed Antibody Space: a resource for data mining next generation sequencing of antibody repertoiresAleksandr Kovaltsuk, Jinwoo Leem, Sebastian Kelm, and 3 more authorsJournal of Immunology, Nov 2018
Antibodies are immune system proteins that recognize noxious molecules for elimination. Their sequence diversity and binding versatility have made antibodies the primary class of biopharmaceuticals. Recently it has become possible to query their immense natural diversity using next-generation sequencing of immunoglobulin gene repertoires (Ig-seq). However, Ig-seq outputs are currently fragmented across repositories and tend to be presented as raw nucleotide reads, which means nontrivial effort is required to reuse the data for analysis. To address this issue, we have collected Ig-seq outputs from 53 studies, covering more than half a billion antibody sequences across diverse immune states, organisms and individuals. We have sorted, cleaned, annotated, translated and numbered these sequences and make the data available via our Observed Antibody Space (OAS) resource at antibodymap.org. The data within OAS will be regularly updated with newly released Ig-seq datasets. We believe OAS will facilitate data mining of immune repertoires for improved understanding of the immune system and development of better biotherapeutics.
- Structurally Mapping Antibody RepertoiresKonrad Krawczyk, Sebastian Kelm, Aleksandr Kovaltsuk, and 13 more authorsFrontiers in Immunology, Nov 2018
Every human possesses millions of distinct antibodies. It is now possible to analyze this diversity via Next Generation Sequencing of Immunoglobulins Genes (Ig-seq). This technique produces large volume sequence snapshots of B-cell receptors that are indicative of the antibody repertoire. In this paper we enrich these large scale sequence datasets with structural information. Enriching a sequence with its structural data allows better approximation of many vital features, such as its binding site and specificity. Here, we describe the Structural Annotation of Antibodies (SAAB) pipeline that maps the outputs of large Ig-seq experiments to known antibody structures. We demonstrate the viability of our protocol on five separate Ig-seq datasets covering ca. 35m unique amino acid sequences from ca. 600 individuals. Despite the great theoretical diversity of antibodies, we find that the majority of sequences coming from such studies can be reliably mapped to an existing structure.
- Combining co-evolution and secondary structure prediction to improve fragment library generationSaulo Henrique Pires Oliveira and Charlotte M. DeaneBioinformatics, Nov 2018
Motivation: Recent advances in co-evolution techniques have made possible the accurate prediction of protein structures in the absence of a template. Here,we provide a general approach that further utilizes co- evolution constraints to generate better fragment libraries for fragment-based protein structure prediction. Results:We have compared five different fragment library generation programmes on three different data sets encompassing over 400 unique protein folds.We showthat considering the secondary structure of the fragments when assembling these libraries provides a critical way to assess their usefulness to structure prediction. We then use co-evolution constraints to improve the fragment libraries by enriching them with fragments that satisfy constraints and discarding those that do not. These improved libraries have better precision and lead to consistently better modelling results. Availability: Data is available for download from: http://opig.stats.ox.ac.uk/resources. Flib-Coevo is available for download from: https://github.com/sauloho/Flib-Coevo Contact: saulo.deoliveira@dtc.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
- A statistical model for helices with applicationsKanti V. Mardia, Karthik Sriram, and Charlotte M. DeaneBiometrics, Nov 2018
Motivated by a cutting edge problem related to the shape of alpha-helices in proteins, we formulate a parametric statistical model, which incorporates the cylindrical nature of the helix. Our focus is to detect a “kink,” which is a drastic change in the axial direction of the helix. We propose a statistical model for the straight alpha-helix and derive the maximum likelihood estimation procedure. The cylinder is an accepted geometric model for alpha-helices, but our statistical formulation, for the first time, quantifies the uncertainty in atom positions around the cylinder. We propose a change point technique “Kink-Detector” to detect a kink location along the helix. Unlike classical change point problems, the change in direction of a helix depends on a simultaneous shift of multiple data points rather than a single data point, and is less straightforward. Our biological building block is crowdsourced data on straight and kinked helices; which has set a gold standard. We use this data to identify salient features to construct Kink-Detector, test its performance and gain some insights. We find the performance of Kink-Detector comparable to its computational competitor called “Kink-Finder.” We highlight that identification of kinks by visual assessment can have limitations and Kink-Detector may help in such cases. Further, an analysis of crowdsourced curved alpha-helices finds that Kink-Detector is also effective in detecting moderate changes in axial directions.
- Antibody side chain conformations are position-dependentJinwoo Leem, Guy Georges, Jiye Shi, and 1 more authorProteins: Structure, Function, and Bioinformatics, Nov 2018
Side chain prediction is an integral component of computational antibody design and structure prediction. Current antibody modelling tools use backbone–dependent rotamer libraries with conformations taken from general proteins. Here we present our antibody–specific rotamer library, where rotamers are binned according to their IMGT position, rather than their local backbone geometry. We find that for some amino acid types at certain positions, only a restricted number of side chain conformations are ever observed. Using this information, we are able to reduce the breadth of the rotamer sampling space. Based on our rotamer library, we built a side chain predictor, PEARS. On a blind test set of 95 antibody model structures, PEARS had the highest average ϰ1 and ϰ1 + 2 accuracy (78.7% and 64.8%) compared to three leading backbone–dependent side chain predictors. Our use of IMGT position, rather than backbone ϕ/ψ, meant that PEARS was more robust to errors in the backbone of the model structure. PEARS also achieved the lowest number of side chain–side chain clashes. PEARS is freely available as a web application at http://opig.stats.ox.ac.uk/webapps/pears. This article is protected by copyright. All rights reserved.
- pyHVis3D: Visualising Molecular Simulation deduced H-bond networks in 3D: Application to T-cell receptor interactionsBernhard Knapp, Marta Alcala, Hao Zhang, and 3 more authorsBioinformatics, Nov 2018
- Identifying networks with common organizational principlesAnatol E. Wegner, Luis Ospina-Forero, Robert E. Gaunt, and 2 more authorsJournal of Complex Networks, Nov 2018
Many complex systems can be represented as networks, and the problem of network comparison is becoming increasingly relevant. There are many techniques for network comparison, from simply comparing network summary statistics to sophisticated but computationally costly alignment-based approaches. Yet it remains challenging to accurately cluster networks that are of a different size and density, but hypothesized to be structurally similar. In this article, we address this problem by introducing a new network comparison methodology that is aimed at identifying common organizational principles in networks. The methodology is simple, intuitive and applicable in a wide variety of settings ranging from the functional classification of proteins to tracking the evolution of a world trade network.
- Predicting loop conformational ensemblesClaire Marks, Jiye Shi, and Charlotte M DeaneBioinformatics, Nov 2018
Motivation. Protein function is often facilitated by the existence of multiple stable conformations. Structure prediction algorithms need to be able to model these different conformations accurately and produce an ensemble of structures that represent a target’s conformational diversity rather than just a single state. Here, we investigate whether current loop prediction algorithms are capable of this. We use the algorithms to predict the structures of loops with multiple experimentally determined conformations, and the structures of loops with only one conformation, and assess their ability to generate and select decoys that are close to any, or all, of the observed structures. Results. We find that while loops with only one known conformation are predicted well, conformationally diverse loops are modelled poorly, and in most cases the predictions returned by the methods do not resemble any of the known conformers. Our results contradict the often-held assumption that multiple native conformations will be present in the decoy set, making the production of accurate conformational ensembles impossible, and hence indicating that current methodologies are not well suited to prediction of conformationally diverse, often functionally important protein regions. Supplementary data are available at Bioinformatics online.
- Sequential search leads to faster, more efficient fragment-based de novo protein structure predictionSaulo H P de Oliveira, Eleanor C Law, Jiye Shi, and 1 more authorBioinformatics, Nov 2018
Motivation: Most current de novo structure prediction methods randomly sample protein conformations and thus require large amounts of computational resource. Here, we consider a sequential sampling strategy, building on ideas from recent experimental work which shows that many proteins fold cotranslationally. Results: We have investigated whether a pseudo-greedy search approach, which begins sequentially from one of the termini, can improve the performance and accuracy of de novo protein structure prediction. We observed that our sequential approach converges when fewer than 20,000 decoys have been produced, fewer than commonly expected. Using our software, SAINT2, we also compared the run time and quality of models produced in a sequential fashion against a standard, non-sequential approach. Sequential prediction produces an individual decoy 1.5 to 2.5 times faster than non-sequential prediction. When considering the quality of the best model, sequential prediction led to a better model being produced for 31 out of 41 soluble protein validation cases and for 18 out of 24 transmembrane protein cases. Correct models (TM-Score \textgreater 0.5) were produced for 29 of these cases by the sequential mode and for only 22 by the non-sequential mode. Our comparison reveals that a sequential search strategy can be used to drastically reduce computational time of de novo protein structure prediction and improve accuracy. Availability: Data is available for download from:
- In silico structural modeling of multiple epigenetic marks on DNAKonrad Krawczyk, Samuel Demharter, Bernhard Knapp, and 2 more authorsBioinformatics, Nov 2018
2017
- How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody DataAleksandr Kovaltsuk, Konrad Krawczyk, Jacob D. Galson, and 3 more authorsFrontiers in Immunology, Nov 2017
Next-generation sequencing of immunoglobulin gene repertoires (Ig-seq) allows the investigation of large-scale antibody dynamics at a sequence level. However, structural information, a crucial descriptor of antibody binding capability is not collected in Ig-seq protocols. Developing systematic relationships between the antibody sequence information gathered from Ig-seq and low-throughput techniques such as X-ray crystallography could radically improve our understanding of antibodies. The mapping of Ig-seq datasets to known antibody structures can indicate structurally, and perhaps functionally, uncharted areas. Furthermore, contrasting naïve and antigenically challenged datasets using structural antibody descriptors should provide insights into antibody maturation. As the number of antibody structures steadily increases and more and more Ig-seq datasets become available, the opportunities that arise from combining the two types of information increase as well. Here we review how these data types enrich one another and show potential for advancing our knowledge of the immune system and improving antibody engineering.
- CommWalker: Correctly Evaluating Modules in Molecular Networks in Light of Annotation BiasM D Luecken, M J T Page, A J Crosby, and 3 more authorsBioinformatics, Nov 2017
- STCRDab: the structural T-cell receptor databaseJinwoo Leem, Saulo H.P. de Oliveira, Konrad Krawczyk, and 1 more authorNucleic Acids Research, Nov 2017
- A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron densityNicholas M. Pearce, Tobias Krojer, Anthony R. Bradley, and 8 more authorsNature Communications, Apr 2017
Building a ligand into a weak region of an electron density map of a protein is a subjective process. Here, the authors present a new method to obtain a clear electron density for a bound ligand based on multi-crystal experiments and 3D background correction.
- Association between a common immunoglobulin heavy chain allele and rheumatic heart disease risk in OceaniaTom Parks, Mariana M. Mirabel, Joseph Kado, and 15 more authorsNature Communications, Apr 2017
The indigenous populations of the South Pacific experience a high burden of rheumatic heart disease (RHD). Here we report a genome-wide association study (GWAS) of RHD susceptibility in 2,852 individuals recruited in eight Oceanian countries. Stratifying by ancestry, we analysed genotyped and imputed variants in Melanesians (607 cases and 1,229 controls) before follow-up of suggestive loci in three further ancestral groups: Polynesians, South Asians and Mixed or other populations (totalling 399 cases and 617 controls). We identify a novel susceptibility signal in the immunoglobulin heavy chain (IGH) locus centring on a haplotype of nonsynonymous variants in the IGHV4-61 gene segment corresponding to the IGHV4-61*02 allele. We show each copy of IGHV4-61*02 is associated with a 1.4-fold increase in the risk of RHD (odds ratio 1.43, 95% confidence intervals 1.27–1.61, P=4.1 × 10−9). These findings provide new insight into the role of germline variation in the IGH locus in disease susceptibility.
- Computational Tools for Aiding Rational Antibody DesignKonrad Krawczyk, James Dunbar, and Charlotte M. DeaneMethods in Molecular Biology, Apr 2017
Antibodies are a group of proteins responsible for mediating immune reactions in vertebrates. They are able to bind a variety of structural motifs on noxious molecules tagging them for elimination from the organism. As a result of their versatile binding properties, antibodies are currently one of the most important classes of biopharmaceuticals. In this chapter, we discuss how knowledge-based computational methods can aid experimentalists in the development of potent antibodies. When using common experimental methods for antibody development, we often know the sequence of an antibody that binds to our molecule, antigen, of interest. We may also have a structure or model of the antigen. In these cases, computational methods can help by both modeling the antibody and identifying the antibody–antigen contact residues. This information can then play a key role in the rational design of more potent antibodies.
- Comparing co-evolution methods and their application to template-free protein structure predictionSaulo Henrique Pires Oliveira, Jiye Shi, and Charlotte M. DeaneBioinformatics, Apr 2017
Motivation. Co-evolution methods have been used as contact predictors to identify pairs of residues that share spatial proximity. Such contact predictors have been compared in terms of the precision of their predictions, but there is no study that compares their usefulness to model generation. Results. We compared eight different co-evolution methods for a set of ∼3500 proteins and found that metaPSICOV stage 2 produces, on average, the most precise predictions. Precision of all the methods is dependent on SCOP class, with most methods predicting contacts in all α and membrane proteins poorly. The contact predictions were then used to assist in de novo model generation. We found that it was not the method with the highest average precision, but rather metaPSICOV stage 1 predictions that consistently led to the best models being produced. Our modelling results show a correlation between the proportion of predicted long range contacts that are satisfied on a model and its quality. We used this proportion to effectively classify models as correct/incorrect; discarding decoys classified as incorrect led to an enrichment in the proportion of good decoys in our final ensemble by a factor of seven. For 17 out of the 18 cases where correct answers were generated, the best models were not discarded by this approach. We were also able to identify eight cases where no correct decoy had been generated.
- The H3 loop of antibodies shows unique structural characteristicsCristian Regep, Guy Georges, Jiye Shi, and 2 more authorsProteins: Structure, Function, and Bioinformatics, Jul 2017
- Variable Regions of Antibodies and T-Cell Receptors May Not Be Sufficient in Molecular Simulations Investigating BindingBernhard Knapp, James Dunbar, Marta Alcala, and 1 more authorJournal of Chemical Theory and Computation, Jul 2017
Antibodies and T-cell receptors are important proteins of the immune system that share similar structures. Both contain variable and constant regions. Insight into the dynamics of their binding can be provided by computational simulations. For these simulations the constant regions are often removed to save runtime as binding occurs in the variable regions. Here we present the first study to investigate the effect of removing the constant regions from antibodies and T-cell receptors on such simulations. We performed simulations of an antibody/antigen and T-cell receptor/MHC system with and without constant regions using 10 replicas of 100 ns of each of the four setups. We found that simulations without constant regions show significantly different behavior compared to simulations with constant regions. If the constant regions are not included in the simulations alterations in the binding interface hydrogen bonds and even partial unbinding can occur. These results indicate that constant regions should be i...
- Partial-occupancy binders identified by the Pan-Dataset Density Analysis method offer new chemical opportunities and reveal cryptic binding sitesNicholas M Pearce, Anthony R Bradley, Tobias Krojer, and 3 more authorsStructural Dynamics, May 2017
Crystallographic fragment screening uses low molecular weight compounds to probe the protein surface and although individual protein-fragment interactions are high quality, fragments commonly bind at low occupancy, historically making identification difficult. However, our new Pan-Dataset Density Analysis method readily identifies binders missed by conventional analysis: for fragment screening data of lysine-specific demethylase 4D (KDM4D), the hit rate increased from 0.9% to 10.6%. Previously unidentified fragments reveal multiple binding sites and demonstrate: the versatility of crystallographic fragment screening; that surprisingly large conformational changes are possible in crystals; and that low crystallographic occupancy does not by itself reflect a protein-ligand complex’s significance.
- Co-evolution techniques are reshaping the way we do structural bioinformaticsSaulo Oliveira and Charlotte DeaneF1000Research, Jul 2017
- Sphinx: merging knowledge-based and ab initio approaches to improve protein loop predictionClaire Marks, Jaroslaw Nowak, Stefan Klostermann, and 5 more authorsBioinformatics, Jul 2017
Motivation. Loops are often vital for protein function, however, their irregular structures make them difficult to model accurately. Current loop modelling algorithms can mostly be divided into two categories: knowledge-based, where databases of fragments are searched to find suitable conformations and ab initio, where conformations are generated computationally. Existing knowledge-based methods only use fragments that are the same length as the target, even though loops of slightly different lengths may adopt similar conformations. Here, we present a novel method, Sphinx, which combines ab initio techniques with the potential extra structural information contained within loops of a different length to improve structure prediction. Results. We show that Sphinx is able to generate high-accuracy predictions and decoy sets enriched with near-native loop conformations, performing better than the ab initio algorithm on which it is based. In addition, it is able to provide predictions for every target, unlike some knowledge-based methods. Sphinx can be used successfully for the difficult problem of antibody H3 prediction, outperforming RosettaAntibody, one of the leading H3-specific ab initio methods, both in accuracy and speed. Sphinx is available at http://opig.stats.ox.ac.uk/webapps/sphinx. Supplementary data are available at Bioinformatics online.
- Antibody H3 Structure PredictionClaire Marks and Charlotte M DeaneComputional and Structural Biotechnology Journal, Jul 2017
Antibodies are proteins of the immune system that are able to bind to a huge variety of different substances, making them attractive candidates for therapeutic applications. Antibody structures have the potential to be useful during drug development, allowing the implementation of rational design procedures. The most challenging part of the antibody structure to experimentally determine or model is the H3 loop, which in addition is often the most important region in an antibody’s binding site. This review summarises the approaches used so far in the pursuit of accurate computational H3 structure prediction.
- Cross-linking mass spectrometry identifies new interfaces of Augmin required to localise the gamma-tubulin ring complex to the mitotic spindle.Jack W C Chen, Zhuo A Chen, Kacper B Rogala, and 4 more authorsBiology open, May 2017
The hetero-octameric protein complex, Augmin, recruits γ-Tubulin ring complex (γ-TuRC) to pre-existing microtubules (MTs) to generate branched MTs during mitosis, facilitating robust spindle assembly. However, despite a recent partial reconstitution of the human Augmin complex in vitro, the molecular basis of this recruitment remains unclear. Here, we used immuno-affinity purification of in vivo Augmin from Drosophila and cross-linking/mass spectrometry to identify distance restraints between residues within the eight Augmin subunits in the absence of any other structural information. The results allowed us to predict potential interfaces between Augmin and γ-TuRC. We tested these predictions biochemically and in the Drosophila embryo, demonstrating that specific regions of the Augmin subunits, Dgt3, Dgt5 and Dgt6 all directly bind the γ-TuRC protein, Dgp71WD, and are required for the accumulation of γ-TuRC, but not Augmin, to the mitotic spindle. This study therefore substantially increases our understanding of the molecular mechanisms underpinning MT-dependent MT nucleation.
- Developability of Biotherapeutics: Computational Approaches . Edited by Sandeep Kumar and Satish K. SinghCharlotte M Deane and Maximiliano VásquezMAbs, Jan 2017
- Ten simple rules for surviving an interdisciplinary PhDSamuel Demharter, Nicholas Pearce, Kylie Beattie, and 11 more authorsPLOS Computational Biology, May 2017
2016
- SAbPred: a structure-based antibody prediction server.James Dunbar, Konrad Krawczyk, Jinwoo Leem, and 7 more authorsNucleic Acids Research, Apr 2016
SAbPred is a server that makes predictions of the properties of antibodies focusing on their structures. Antibody informatics tools can help improve our understanding of immune responses to disease and aid in the design and engineering of therapeutic molecules. SAbPred is a single platform containing multiple applications which can: number and align sequences; automatically generate antibody variable fragment homology models; annotate such models with estimated accuracy alongside sequence and structural properties including potential developability issues; predict paratope residues; and predict epitope patches on protein antigens. The server is available at http://opig.stats.ox.ac.uk/webapps/sabpred.
- Modeling Functional Motions of Biological Systems by Customized Natural MovesSamuel Demharter, Bernhard Knapp, Charlotte M Deane, and 1 more authorBiophysical Journal, Apr 2016
Simulating the functional motions of biomolecular systems requires large computational resources. We introduce a computationally inexpensive protocol for the systematic testing of hypotheses regarding the dynamic behavior of proteins and nucleic acids. The protocol is based on natural move Monte Carlo, a highly efficient conformational sampling method with built-in customization capabilities that allows researchers to design and perform a large number of simulations to investigate functional motions in biological systems. We demonstrate the use of this protocol on both a protein and a DNA case study. Firstly, we investigate the plasticity of a class II major histocompatibility complex in the absence of a bound peptide. Secondly, we study the effects of the epigenetic mark 5-hydroxymethyl on cytosine on the structure of the Dickerson-Drew dodecamer. We show how our customized natural moves protocol can be used to investigate causal relationships of functional motions in biological systems.
- Progress and challenges in predicting protein interfacesR Esmaielbeiki, K Krawczyk, B Knapp, and 2 more authorsBriefings in Bioinformatics, Apr 2016
The majority of biological processes are mediated via protein-protein interactions. Determination of residues participating in such interactions improves our understanding of molecular mechanisms and facilitates the development of therapeutics. Experimental approaches to identifying interacting residues, such as mutagenesis, are costly and time-consuming and thus, computational methods for this purpose could streamline conventional pipelines. Here we review the field of computational protein interface prediction. We make a distinction between methods which address proteins in general and those targeted at antibodies, owing to the radically different binding mechanism of antibodies. We organize the multitude of currently available methods hierarchically based on required input and prediction principles to provide an overview of the field.
- Length-independent structural similarities enrich the antibody CDR canonical class model.Jaroslaw Nowak, Terry Baker, Guy Georges, and 5 more authorsMAbs, Jan 2016
Complementarity-determining regions (CDRs) are antibody loops that make up the antigen binding site. Here, we show that all CDR types have structurally similar loops of different lengths. Based on these findings, we created length-independent canonical classes for the non-H3 CDRs. Our length variable structural clusters show strong sequence patterns suggesting either that they evolved from the same original structure or result from some form of convergence. We find that our length-independent method not only clusters a larger number of CDRs, but also predicts canonical class from sequence better than the standard length-dependent approach. To demonstrate the usefulness of our findings, we predicted cluster membership of CDR-L3 sequences from 3 next-generation sequencing datasets of the antibody repertoire (over 1,000,000 sequences). Using the length-independent clusters, we can structurally classify an additional 135,000 sequences, which represents a ���20% improvement over the standard approach. This suggests that our length-independent canonical classes might be a highly prevalent feature of antibody space, and could substantially improve our ability to accurately predict the structure of novel CDRs identified by next-generation sequencing.
- Comparison of large networks with sub-sampling strategiesWaqar Ali, Anatol E Wegner, Robert E Gaunt, and 2 more authorsScientific Reports, Jul 2016
- ABodyBuilder: automated antibody structure prediction with data-driven accuracy estimationJinwoo Leem, James Dunbar, Guy Georges, and 2 more authorsMAbs, Jul 2016
AbstractComputational modelling of antibody structures plays a critical role in therapeutic antibody design. Several antibody modelling pipelines exist, but no freely available methods currently model nanobodies, provide estimates of expected model accuracy, or highlight potential issues with the antibody’s experimental development. Here, we describe our automated antibody modelling pipeline, ABodyBuilder, designed to overcome these issues. The algorithm itself follows the standard four steps of template selection, orientation prediction, complementarity-determining region (CDR) loop modelling, and side chain prediction. ABodyBuilder then annotates the ’confidence’ of the model as a probability that a component of the antibody (e.g., CDRL3 loop) will be modelled within a root-mean square deviation threshold. It also flags structural motifs on the model that are known to cause issues during in vitro development. ABodyBuilder was tested on four separate datasets, including the 11 antibodies from the Antibod...
- Tertiary Element Interaction in HIV-1 TARKonrad Krawczyk, Adelene Y L Sim, Bernhard Knapp, and 2 more authorsJ. Chem. Inf. Model., Sep 2016
HIV-1 replication requires binding to occur between Trans-activation Response Element (TAR) RNA and the TAT protein. This TAR-TAT binding depends on the conformation of TAR, and therapeutic development has attempted to exploit this dynamic behavior. Here we simulate TAR dynamics in the context of mutations inhibiting TAR binding. We find that two tertiary elements, the apical loop and the bulge, can interact directly, and this interaction may be linked to the affinity of TAR for TAT.
- Examining the Conservation of Kinks in Alpha HelicesEleanor C Law, Henry R Wilman, Sebastian Kelm, and 2 more authorsPLoS One, Jun 2016
- Exploring peptide/MHC detachment processes using Hierarchical Natural Move Monte CarloBernhard Knapp, Samuel Demharter, Charlotte M Deane, and 1 more authorBioinformatics, Jun 2016
MOTIVATION: The binding between a peptide and a major histocompatibility complex (MHC) is one of the most important processes for the induction of an adaptive immune response. Many algorithms have been developed to predict peptide/MHC (pMHC) binding. However, no approach has yet been able to give structural insight into how peptides detach from the MHC. RESULTS: In this study, we used a combination of coarse graining, hierarchical natural move Monte Carlo and stochastic conformational optimization to explore the detachment processes of 32 different peptides from HLA-A*02:01. We performed 100 independent repeats of each stochastic simulation and found that the presence of experimentally known anchor amino acids affects the detachment trajectories of our peptides. Comparison with experimental binding affinity data indicates the reliability of our approach (area under the receiver operating characteristic curve 0.85). We also compared to a 1000���ns molecular dynamics simulation of a non-binding peptide (AAAKTPVIV) and HLA-A*02:01. Even in this simulation, the longest published for pMHC, the peptide does not fully detach. Our approach is orders of magnitude faster and as such allows us to explore pMHC detachment processes in a way not possible with all-atom molecular dynamics simulations. AVAILABILITY AND IMPLEMENTATION: The source code is freely available for download at http://www.cs.ox.ac.uk/mosaics/. CONTACT: bernhard.knapp@stats.ox.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.
- ANARCI: Antigen receptor numbering and receptor classificationJames Dunbar and Charlotte M DeaneBioinformatics, Jun 2016
MOTIVATION: Antibody amino-acid sequences can be numbered to identify equivalent positions. Such annotations are valuable for antibody sequence comparison, protein structure modelling and engineering. Multiple different numbering schemes exist, they vary in the nomenclature they use to annotate residue positions, their definitions of position equivalence and their popularity within different scientific disciplines. However, currently no publicly available software exists that can apply all the most widely used schemes or for which an executable can be obtained under an open license. RESULTS: ANARCI is a tool to classify and number antibody and T-cell receptor amino-acid variable domain sequences. It can annotate sequences with the five most popular numbering schemes: Kabat, Chothia, Enhanced Chothia, IMGT and AHo. AVAILABILITY AND IMPLEMENTATION: ANARCI is available for download under GPLv3 license at opig.��stats.ox.ac.uk/webapps/anarci. A web-interface to the program is available at the same address. CONTACT: deane@stats.ox.ac.uk.
- The contribution of major histocompatibility complex contacts to the affinity and kinetics of T cell receptor bindingHao Zhang, Hong-Sheng Lim, Berhard Knapp, and 4 more authorsScientific Reports, Oct 2016
2015
- Rapid, Precise, and Reproducible Prediction of Peptide.MHC Binding Affinities from Molecular Dynamics That Correlate Well with ExperimentShunzhou Wan, Bernhard Knapp, David W. Wright, and 2 more authorsJournal of Chemical Theory and Computation, Oct 2015
The presentation of potentially pathogenic peptides by major histocompatibility complex (MHC) molecules is one of the most important processes in adaptive immune defense. Prediction of peptide–MHC (pMHC) binding affinities is therefore a principal objective of theoretical immunology. Machine learning techniques achieve good results if substantial experimental training data are available. Approaches based on structural information become necessary if sufficiently similar training data are unavailable for a specific MHC allele, although they have often been deemed to lack accuracy. In this study, we use a free energy method to rank the binding affinities of 12 diverse peptides bound by a class I MHC molecule HLA-A*02:01. The method is based on enhanced sampling of molecular dynamics calculations in combination with a continuum solvent approximation and includes estimates of the configurational entropy based on either a one or a three trajectory protocol. It produces precise and reproducible free energy estimates which correlate well with experimental measurements. If the results are combined with an amino acid hydrophobicity scale, then an extremely good ranking of peptide binding affinities emerges. Our approach is rapid, robust, and applicable to a wide range of ligand–receptor interactions without further adjustment.
- Building a Better Fragment Library for De Novo Protein Structure PredictionSaulo Henrique Pires Oliveira, Jiye Shi, and Charlotte M. DeanePLoS One, Oct 2015
Fragment-based approaches are the current standard for de novo protein structure prediction. These approaches rely on accurate and reliable fragment libraries to generate good structural models. In this work, we describe a novel method for structure fragment library generation and its application in fragment-based de novo protein structure prediction. The importance of correct testing procedures in assessing the quality of fragment libraries is demonstrated. In particular, the exclusion of homologs to the target from the libraries to correctly simulate a de novo protein structure prediction scenario, something which surprisingly is not always done. We demonstrate that fragments presenting different predominant predicted secondary structures should be treated differently during the fragment library generation step and that exhaustive and random search strategies should both be used. This information was used to develop a novel method, Flib. On a validation set of 41 structurally diverse proteins, Flib libraries presents both a higher precision and coverage than two of the state-of-the-art methods, NNMake and HHFrag. Flib also achieves better precision and coverage on the set of 275 protein domains used in the two previous experiments of the the Critical Assessment of Structure Prediction (CASP9 and CASP10). We compared Flib libraries against NNMake libraries in a structure prediction context. Of the 13 cases in which a correct answer was generated, Flib models were more accurate than NNMake models for 10. “Flib is available for download at: http://www.stats.ox.ac.uk/research/proteins/resources”.
- Structural Bridges through Fold Space.Hannah Edwards and Charlotte M DeanePLoS Computational Biology, Sep 2015
Several protein structure classification schemes exist that partition the protein universe into structural units called folds. Yet these schemes do not discuss how these units sit relative to each other in a global structure space. In this paper we construct networks that describe such global relationships between folds in the form of structural bridges. We generate these networks using four different structural alignment methods across multiple score thresholds. The networks constructed using the different methods remain a similar distance apart regardless of the probability threshold defining a structural bridge. This suggests that at least some structural bridges are method specific and that any attempt to build a picture of structural space should not be reliant on a single structural superposition method. Despite these differences all representations agree on an organisation of fold space into five principal community structures: all-α, all-βsandwiches, all-βbarrels, α/βand α+ β. We project estimated fold ages onto the networks and find that not only are the pairings of unconnected folds associated with higher age differences than bridged folds, but this difference increases with the number of networks displaying an edge. We also examine different centrality measures for folds within the networks and how these relate to fold age. While these measures interpret the central core of fold space in varied ways they all identify the disposition of ancestral folds to fall within this core and that of the more recently evolved structures to provide the peripheral landscape. These findings suggest that evolutionary information is encoded along these structural bridges. Finally, we identify four highly central pivotal folds representing dominant topological features which act as key attractors within our landscapes.
- Prediction of VH-VL domain orientation for antibody variable domain modeling.Alexander Bujotzek, James Dunbar, Florian Lipsmeier, and 4 more authorsProteins, Apr 2015
The antigen-binding site of antibodies forms at the interface of their two variable domains, VH and VL, making VH-VL domain orientation a factor that codetermines antibody specificity and affinity. Preserving VH-VL domain orientation in the process of antibody engineering is important in order to retain the original antibody properties, and predicting the correct VH-VL orientation has also been recognized as an important factor in antibody homology modeling. In this article, we present a fast sequence-based predictor that predicts VH-VL domain orientation with Q(2) values ranging from 0.54 to 0.73 on the evaluation set. We describe VH-VL orientation in terms of the six absolute ABangle parameters that have recently been proposed as a means to separate the different degrees of freedom of VH-VL domain orientation. In order to assess the impact of adjusting VH-VL orientation according to our predictions, we use the set of antibody structures of the recently published Antibody Modeling Assessment (AMA) II study. In comparison to the original AMAII homology models, we find an improvement in the accuracy of VH-VL orientation modeling, which also translates into an improvement in the average root-mean-square deviation with regard to the crystal structures.
- Current status and future challenges in T-cell receptor/peptide/MHC molecular dynamics simulationsBernhard Knapp, Samuel Demharter, Reyhaneh Esmaielbeiki, and 1 more authorBriefings in Bioinformatics, Apr 2015
The interaction between T-cell receptors (TCRs) and major histocompatibility complex (MHC)-bound epitopes is one of the most important processes in the adaptive human immune response. Several hypotheses on TCR triggering have been proposed. Many of them involve structural and dynamical adjustments in the TCR/peptide/MHC interface. Molecular Dynamics (MD) simulations are a computational technique that is used to investigate structural dynamics at atomic resolution. Such simulations are used to improve understanding of signalling on a structural level. Here we review how MD simulations of the TCR/peptide/MHC complex have given insight into immune system reactions not achievable with current experimental methods. Firstly, we summarize methods of TCR/peptide/MHC complex modelling and TCR/peptide/MHC MD trajectory analysis methods. Then we classify recently published simulations into categories and give an overview of approaches and results. We show that current studies do not come to the same conclusions about TCR/peptide/MHC interactions. This discrepancy might be caused by too small sample sizes or intrinsic differences between each interaction process. As computational power increases future studies will be able to and should have larger sample sizes, longer runtimes and additional parts of the immunological synapse included.
- Ten Simple Rules for a Successful Cross-Disciplinary CollaborationBernhard Knapp, Rémi Bardenet, Miguel O Bernabeu, and 17 more authorsPLoS Computational Biology, Apr 2015
- Misato Controls Mitotic Microtubule Generation by Stabilizing the Tubulin Chaperone Protein-1 Complex.Valeria Palumbo, Claudia Pellacani, Kate J Heesom, and 6 more authorsCurrent Biology, Apr 2015
Mitotic spindles are primarily composed of microtubules (MTs), generated by polymerization of α- and β-Tubulin hetero-dimers [1, 2]. Tubulins undergo a series of protein folding and post-translational modifications in order to fulfill their functions [3, 4]. Defects in Tubulin polymerization dramatically affect spindle formation and disrupt chromosome segregation. We recently described a role for the product of the conserved misato (mst) gene in regulating mitotic MT generation in flies [5], but the molecular function of Mst remains unknown. Here, we use affinity purification mass spectrometry (AP-MS) to identify interacting partners of Mst in the Drosophila embryo. We demonstrate that Mst associates stoichiometrically with the hetero-octameric Tubulin Chaperone Protein-1 (TCP-1) complex, with the hetero-hexameric Tubulin Prefoldin complex, and with proteins having conserved roles in generating MT-competent Tubulin. We show that RNAi-mediated in��vivo depletion of any TCP-1 subunit phenocopies the effects of mutations in mst or the Prefoldin-encoding gene merry-go-round (mgr), leading to monopolar and disorganized mitotic spindles containing few MTs. Crucially, we demonstrate that Mst, but not Mgr, is required for TCP-1 complex stability and that both the efficiency of Tubulin polymerization and Tubulin stability are drastically compromised in mst mutants. Moreover, our structural bioinformatic analyses indicate that Mst resembles the three-dimensional structure of Tubulin monomers and might therefore occupy the TCP-1 complex central cavity. Collectively, our results suggest that Mst acts as a co-factor of the TCP-1 complex, playing an essential role in the Tubulin-folding processes required for proper assembly of spindle MTs.
- WONKA: objective novel complex analysis for ensembles of protein-ligand structures.A R Bradley, I D Wall, F Delft, and 3 more authorsJournal of Computer-Aided Molecular Design, Apr 2015
WONKA is a tool for the systematic analysis of an ensemble of protein-ligand structures. It makes the identification of conserved and unusual features within such an ensemble straightforward. WONKA uses an intuitive workflow to process structural co-ordinates. Ligand and protein features are summarised and then presented within an interactive web application. WONKA’s power in consolidating and summarising large amounts of data is described through the analysis of three bromodomain datasets. Furthermore, and in contrast to many current methods, WONKA relates analysis to individual ligands, from which we find unusual and erroneous binding modes. Finally the use of WONKA as an annotation tool to share observations about structures is demonstrated. WONKA is freely available to download and install locally or can be used online at http://wonka.sgc.ox.ac.uk .
- The Caenorhabditis elegans protein SAS-5 forms large oligomeric assemblies critical for centriole formationKacper B Rogala, Nicola J Dynes, Georgios N Hatzopoulos, and 6 more authorseLife, May 2015
Centrioles are microtubule-based organelles crucial for cell division, sensing and motility. In C. elegans, the onset of centriole formation requires notably the proteins SAS-5 and SAS-6, which have functional homologs across eukaryotic evolution. Whereas the molecular architecture of SAS-6 and its role in initiating centriole formation are well understood, the mechanisms by which SAS-5 and its relatives function is unclear. Here, we combine biophysical and structural analysis to uncover the architecture of SAS-5 and examine its functional implications in vivo. Our work reveals that two distinct self-associating domains are necessary to form higher-order oligomers of SAS-5: a trimeric coiled coil and a novel globular dimeric Implico domain. Disruption of either domain leads to centriole duplication failure in worm embryos, indicating that large SAS-5 assemblies are necessary for function in vivo.
- Type II Inhibitors Targeting CDK2.Leila T Alexander, Henrik Möbitz, Peter Drueckes, and 6 more authorsACS Chemical Biology, Sep 2015
Kinases can switch between active and inactive conformations of the ATP/Mg(2+) binding motif DFG, which has been explored for the development of type I or type II inhibitors. However, factors modulating DFG conformations remain poorly understood. We chose CDK2 as a model system to study the DFG in-out transition on a target that was thought to have an inaccessible DFG-out conformation. We used site-directed mutagenesis of key residues identified in structural comparisons in conjunction with biochemical and biophysical characterization of the generated mutants. As a result, we identified key residues that facilitate the DFG-out movement, facilitating binding of type II inhibitors. However, surprisingly, we also found that wild type CDK2 is able to bind type II inhibitors. Using protein crystallography structural analysis of the CDK2 complex with an aminopyrimidine-phenyl urea inhibitor (K03861) revealed a canonical type II binding mode and the first available type II inhibitor CDK2 cocrystal structure. We found that the identified type II inhibitors compete with binding of activating cyclins. In addition, analysis of the binding kinetics of the identified inhibitors revealed slow off-rates. The study highlights the importance of residues that may be distant to the ATP binding pocket in modulating the energetics of the DFG-out transition and hence inhibitor binding. The presented data also provide the foundation for a new class of slow off-rate cyclin-competitive CDK2 inhibitors targeting the inactive DFG-out state of this important kinase target.
- T-cell Receptor Binding Affects the Dynamics of the Peptide/MHC-I Complex.Bernhard Knapp and Charlotte M DeaneJournal of Chemical Information and Modeling, Dec 2015
The recognition of peptide-MHC by T-cell receptors is one of the most important interactions in the adaptive immune system. A large number of computational studies have investigated the structural dynamics of this interaction. However, so far only limited attention has been paid to differences between the dynamics of peptide/MHC with the T-cell receptor bound and unbound. Here we present the first large scale Molecular Dynamics simulation study of this type investigating HLA-B*08:01 in complex with the Epstein-Barr virus peptide FLRGRAYGL and all possible single point mutations (n=172). All simulations were performed with and without the LC 13 T-cell receptor for a simulation time of 100 ns yielding 344 simulations and a total simulation time of 34 400 ns. Our study is two orders of magnitude larger than the average T-cell receptor / peptide / MHC Molecular Dynamics simulation study. This dataset provides reliable insights into alteration in peptide/MHC-I dynamics caused by T-cell receptor presence. We found that simulations in the presence of T-cell receptors have more H-bonds between peptide/MHC; altered flexibility patterns in the MHC helices and the peptide; a lower MHC groove width range; and altered solvent accessible surface areas. This indicates that without a T-cell receptor the MHC binding groove can open and close while T-cell receptor presence inhibits these breathing-like motions.
2014
- Examining Variable Domain Orientations in Antigen Receptors Gives Insight into TCR-Like Antibody DesignJames Dunbar, Bernhard Knapp, Angelika Fuchs, and 2 more authorsPLoS Computational Biology, Dec 2014
- Gro2mat: A package to efficiently read gromacs output in MATLABHung Dien, Charlotte M Deane, and Bernhard KnappJournal of Computational Chemistry, Dec 2014
- Alignment-free protein interaction network comparisonWaqar Ali, Tiago Rito, Gesine Reinert, and 2 more authorsBioinformatics, Dec 2014
- SAbDab: the structural antibody databaseJames Dunbar, Konrad Krawczyk, Jinwoo Leem, and 5 more authorsNucleic Acids Research, Dec 2014
Structural antibody database (SAbDab; http://opig.stats.ox.ac.uk/webapps/sabdab) is an online resource containing all the publicly available antibody structures annotated and presented in a consistent fashion. The data are annotated with several properties including experimental information, gene details, correct heavy and light chain pairings, antigen details and, where available, antibody���antigen binding affinity. The user can select structures, according to these attributes as well as structural properties such as complementarity determining region loop conformation and variable domain orientation. Individual structures, datasets and the complete database can be downloaded.
- Crowdsourcing Yields a New Standard for Kinks in Protein Helices.Henry R Wilman, Jean-Paul Ebejer, Jiye Shi, and 2 more authorsJournal of Chemical Information and Modeling, Dec 2014
Kinks are functionally important structural features found in the α-helices of proteins. Structurally, they are points at which a helix abruptly changes direction. Current kink definition and identification methods often disagree with one another. Here we describe a crowdsourcing approach to obtain a reliable gold standard set of kinks. Using an online interface, we collected more than 10 000 classifications of 300 helices into straight, curved, or kinked categories. We found that participants were better at discriminating between straight and not-straight helices than between kinked and curved helices. Surprisingly, more obvious kinks were not necessarily identified as more localized within the helix. We present a set of 252 helices where more than 50% of the participants agree on a classification. This set can be used as a reliable gold standard to develop, train, and compare computational methods. An interactive visualization of the results is available online at http://opig.stats.ox.ac.uk/webapps/ahah/php/experiment_results.php .
- Improving B-cell epitope prediction and its application to global antibody-antigen dockingKonrad Krawczyk, Xiaofeng Liu, Terry Baker, and 2 more authorsBioinformatics, Aug 2014
Motivation: Antibodies are currently the most important class of biopharmaceuticals. Development of such antibody-based drugs depends on costly and time-consuming screening campaigns. Computational techniques such as antibody-antigen docking hold the potential to facilitate the screening process by rapidly providing a list of initial poses that approximate the native complex. Results: We have developed a new method to identify the epitope region on the antigen, given the structures of the antibody and the antigen���EpiPred. The method combines conformational matching of the antibody-antigen structures and a specific antibody-antigen score. We have tested the method on both a large non-redundant set of antibody-antigen complexes and on homology models of the antibodies and/or the unbound antigen structure. On a non-redundant test set, our epitope prediction method achieves 44% recall at 14% precision against 23% recall at 14% precision for a background random distribution. We use our epitope predictions to rescore the global docking results of two rigid-body docking algorithms: ZDOCK and ClusPro. In both cases including our epitope, prediction increases the number of near-native poses found among the top decoys. Availability and implementation: Our software is available from http://www.stats.ox.ac.uk/research/proteins/resources. Contact: deane@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
- Helix kinks are equally prevalent in soluble and membrane proteins.Henry R Wilman, Jiye Shi, and Charlotte M DeaneProteins, Sep 2014
Helix kinks are a common feature of α-helical membrane proteins, but are thought to be rare in soluble proteins. In this study we find that kinks are a feature of long α-helices in both soluble and membrane proteins, rather than just transmembrane α-helices. The apparent rarity of kinks in soluble proteins is due to the relative infrequency of long helices (���20 residues) in these proteins. We compare length-matched sets of soluble and membrane helices, and find that the frequency of kinks, the role of Proline, the patterns of other amino acid around kinks (allowing for the expected differences in amino acid distributions between the two types of protein), and the effects of hydrogen bonds are the same for the two types of helices. In both types of protein, helices that contain Proline in the second and subsequent turns are very frequently kinked. However, there are a sizeable proportion of kinked helices that do not contain a Proline in either their sequence or sequence homolog. Moreover, we observe that in soluble proteins, kinked helices have a structural preference in that they typically point into the solvent. Proteins 2014; 82:1960-1970. \textcopyright 2014 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.
- Fragment-based modeling of membrane protein loops: successes, failures, and prospects for the futureSebastian Kelm, Anna Vangone, Yoonjoo Choi, and 3 more authorsProteins, Feb 2014
Membrane proteins (MPs) have become a major focus in structure prediction, due to their medical importance. There is, however, a lack of fast and reliable methods that specialize in the modeling of MP loops. Often methods designed for soluble proteins (SPs) are applied directly to MPs. In this article, we investigate the validity of such an approach in the realm of fragment-based methods. We also examined the differences in membrane and soluble protein loops that might affect accuracy. We test our ability to predict soluble and MP loops with the previously published method FREAD. We show that it is possible to predict accurately the structure of MP loops using a database of MP fragments (0.5-1 Å median root-mean-square deviation). The presence of homologous proteins in the database helps prediction accuracy. However, even when homologues are removed better results are still achieved using fragments of MPs (0.8-1.6 Å) rather than SPs (1-4 Å) to model MP loops. We find that many fragments of SPs have shapes similar to their MP counterparts but have very different sequences; however, they do not appear to differ in their substitution patterns. Our findings may allow further improvements to fragment-based loop modeling algorithms for MPs. The current version of our proof-of-concept loop modeling protocol produces high-accuracy loop models for MPs and is available as a web server at http://medeller.info/fread.
- Ten Simple Rules for Effective Computational ResearchJames M Osborne, Miguel O Bernabeu, Maria Bruna, and 17 more authorsPLoS Computational Biology, Mar 2014
- OOMMPPAA: A Tool To Aid Directed Synthesis by the Combined Analysis of Activity and Structural DataAnthony R Bradley, Ian D Wall, Darren V S Green, and 2 more authorsJournal of Chemical Information and Modeling, Mar 2014
- Large Scale Characterization of the LC13 TCR and HLA-B8 Structural Landscape in Reaction to 172 Altered Peptide Ligands: A Molecular Dynamics Simulation StudyBernhard Knapp, James Dunbar, and Charlotte M DeanePLoS Computational Biology, Mar 2014
2013
- Arginine methylation-dependent reader-writer interplay governs growth control by E2F-1Shunsheng Zheng, Jutta Moehlenbrink, Yi-Chien Lu, and 11 more authorsMolecular Cell, Mar 2013
The mechanisms that underlie and dictate the different biological outcomes of E2F-1 activity have yet to be elucidated. We describe the residue-specific methylation of E2F-1 by the asymmetric dimethylating protein arginine methyltransferase (PRMT) 1 and symmetric dimethylating PRMT5, and relate the marks to different functional consequences of E2F-1 activity. Methylation by PRMT1 hinders methylation by PRMT5, which augments E2F-1-dependent apoptosis, whereas PRMT5-dependent methylation favours proliferation by antagonising methylation by PRMT1. The ability of E2F-1 to prompt apoptosis in DNA damaged cells coincides with enhanced PRMT1 methylation. In contrast, cyclin A binding to E2F-1 impedes PRMT1 methylation and augments PRMT5 methylation, thus ensuring that E2F-1 is locked into its cell cycle progression mode. The Tudor domain protein p100-TSN reads the symmetric methylation mark, and binding of p100-TSN down-regulates E2F-1 apoptotic activity. Our results define an exquisite level of precision in the reader-writer interplay that governs the biological outcome of E2F-1 activity.
- Differential Geometric Analysis of Alterations in MH Alpha$-HelicesBirgit Hischenhuber, Hans Havlicek, Jelena Todoric, and 3 more authorsJournal of Computational Chemistry, Aug 2013
Antigen presenting cells present processed peptides via their major histocompatibility (MH) complex to the T cell receptors (TRs) of T cells. If a peptide is immunogenic, a signaling cascade can be triggered within the T cell. However, the binding of different peptides and/or different TRs to MH is also known to influence the spatial arrangement of the MH α-helices which could itself be an additional level of T cell regulation. In this study, we introduce a new methodology based on differential geometric parameters to describe MH deformations in a detailed and comparable way. For this purpose, we represent MH α-helices by curves. On the basis of these curves, we calculate in a first step the curvature and torsion to describe each α-helix independently. In a second step, we calculate the distribution parameter and the conical curvature of the ruled surface to describe the relative orientation of the two α-helices. On the basis of four different test sets, we show how these differential geometric parameters can be used to describe changes in the spatial arrangement of the MH α-helices for different biological challenges. In the first test set, we illustrate on the basis of all available crystal structures for (TR)/pMH complexes how the binding of TRs influences the MH helices. In the second test set, we show a cross evaluation of different MH alleles with the same peptide and the same MH allele with different peptides. In the third test set, we present the spatial effects of different TRs on the same peptide/MH complex. In the fourth test set, we illustrate how a severe conformational change in an α-helix can be described quantitatively. Taken together, we provide a novel structural methodology to numerically describe subtle and severe alterations in MH α-helices for a broad range of applications. \textcopyright 2013 Wiley Periodicals, Inc.
- MP-T: improving membrane protein alignment for structure predictionJ R Hill and C M DeaneBioinformatics, Jan 2013
Membrane proteins are clinically relevant, yet their crystal structures are rare. Models of membrane proteins are typically built from template structures with low sequence identity to the target sequence, using a sequence-structure alignment as a blueprint. This alignment is usually made with programs designed for use on soluble proteins. Biological membranes have layers of varying hydrophobicity, and membrane proteins have different amino-acid substitution preferences from their soluble counterparts. Here we include these factors into an alignment method to improve alignments and consequently improve membrane protein models.\{We developed Membrane Protein Threader (MP-T), a sequence-structure alignment tool for membrane proteins based on multiple sequence alignment. Alignment accuracy is tested against seven other alignment methods over 165 non-redundant alignments of membrane proteins. MP-T produces more accurate alignments than all other methods tested (��F(M) from +0.9 to +5.5%). Alignments generated by MP-T also lead to significantly better models than those of the best alternative alignment tool (one-fourth of models see an increase in GDT_TS of ����4%).\{All source code, alignments and models are available at http://www.stats.ox.ac.uk/proteins/resources
- ABangle: characterising the VH–VL orientation in antibodiesJames Dunbar, Angelika Fuchs, Jiye Shi, and 1 more authorProtein Design Selection & Engineering, Jan 2013
The binding site of an antibody is formed between the two variable domains, VH and VL, of its antigen binding fragment (Fab). Understanding how VH and VL orientate with respect to one another is important both for studying the mechanisms of antigen specificity and affinity and improving antibody modelling, docking and engineering. Different VH-VL orientations are commonly described using relative measures such as root-mean-square deviation. Recently, the orientation has also been characterised using the absolute measure of a VH-VL packing angle. However, a single angle cannot fully describe all modes of orientation. Here, we present a method which fully characterises VH-VL orientation in a consistent and absolute sense using five angles (HL, HC1, LC1, HC2 and LC2) and a distance (dc). Additionally, we provide a computational tool, ABangle, to allow the VH–VL orientation for any antibody to be automatically calculated and compared with all other known structures. We compare previous studies and show how the modes of orientation being identified relate to movements of different angles. Thus, we are able to explain why different studies identify different structural clusters and different residues as important. Given this result, we then identify those positions and their residue identities which influence each of the angular measures of orientation. Finally, by analysing VH–VL orientation in bound and unbound forms, we find that antibodies specific for protein antigens are significantly more flexible in their unbound form than antibodies specific for hapten antigens. ABangle is freely available at http://opig.stats.ox.ac.uk/webapps/abangle.
- Memoir: template-based structure prediction for membrane proteinsJean-Paul Ebejer, Jamie R Hill, Sebastian Kelm, and 2 more authorsNucleic Acids Research, Jan 2013
Membrane proteins are estimated to be the targets of 50% of drugs that are currently in development, yet we have few membrane protein crystal structures. As a result, for a membrane protein of interest, the much-needed structural information usually comes from a homology model. Current homology modelling software is optimized for globular proteins, and ignores the constraints that the membrane is known to place on protein structure. Our Memoir server produces homology models using alignment and coordinate generation software that has been designed specifically for transmembrane proteins. Memoir is easy to use, with the only inputs being a structural template and the sequence that is to be modelled. We provide a video tutorial and a guide to assessing model quality. Supporting data aid manual refinement of the models. These data include a set of alternative conformations for each modelled loop, and a multiple sequence alignment that incorporates the query and template. Memoir works with both α-helical and β-barrel types of membrane proteins and is freely available at http://opig.stats.ox.ac.uk/webapps/memoir.
- Exploring Fold Space Preferences of New-born and Ancient Protein SuperfamiliesHannah Edwards, Sanne Abeln, and Charlotte M DeanePLoS Computational Biology, Nov 2013
Author SummaryProteins are the molecular workers of the cell. They are formed from a string of amino acids which folds into an elaborate three-dimensional structure. While there is a relationship between a protein’s sequence and its structure this relationship is highly complex and not fully understood. Protein structures tend to evolve differently to their sequences. They are far more conserved so tend to change slower. The aim of this paper was to identify trends in the way that protein structures evolve, rather than adapting models of sequence evolution. To do this we have provided a database of ages for structural superfamilies. These ages are robust to drastic differences in the evolutionary assumptions underlying their estimation and can be used to study differences between populations of proteins. For example, we have compared newly evolved structures against those with a long evolutionary history and found that, overall, a shorter evolutionary history corresponds to a less elaborate structure. We have also demonstrated here how these ages can be used to compare particular structural motifs present in a large number of protein structures and have shown that the jelly roll motif is significantly younger than the greek key.
- Stochastic detection of Pim protein kinases reveals electrostatically enhanced association of a peptide substrateLeon Harrington, Stephen Cheley, Leila T Alexander, and 2 more authorsProceedings of the National Academy of Sciences USA, Nov 2013
In stochastic sensing, the association and dissociation of analyte molecules is observed as the modulation of an ionic current flowing through a single engineered protein pore, enabling the label-free determination of rate and equilibrium constants with respect to a specific binding site. We engineered sensors based on the staphylococcal α-hemolysin pore to allow the single-molecule detection and characterization of protein kinase���peptide interactions. We enhanced this approach by using site-specific proteolysis to generate pores bearing a single peptide sensor element attached by an N-terminal peptide bond to the trans mouth of the pore. Kinetics and affinities for the Pim protein kinases (Pim-1, Pim-2, and Pim-3) and cAMP-dependent protein kinase were measured and found to be independent of membrane potential and in good agreement with previously reported data. Kinase binding exhibited a distinct current noise behavior that forms a basis for analyte discrimination. Finally, we observed unusually high association rate constants for the interaction of Pim kinases with their consensus substrate Pimtide (���107 to 108 M���1���s���1), the result of electrostatic enhancement, and propose a cellular role for this phenomenon.
- Antibody i-Patch prediction of the antibody binding site improves rigid local antibody-antigen dockingKonrad Krawczyk, Terry Baker, Jiye Shi, and 1 more authorProtein Engineering Design and Selection, Jan 2013
Antibodies are a class of proteins indispensable for the vertebrate immune system. The general architecture of all antibodies is very similar, but they contain a hypervariable region which allows millions of antibody variants to exist, each of which can bind to different molecules. This binding malleability means that antibodies are an increasingly important category of biopharmaceuticals and biomarkers. We present Antibody i-Patch, a method that annotates the most likely antibody residues to be in contact with the antigen. We show that our predictions correlate with energetic importance and thus we argue that they may be useful in guiding mutations in the artificial affinity maturation process. Using our predictions as constraints for a rigid-body docking algorithm, we are able to obtain high-quality results in minutes. Our annotation method and re-scoring system for docking achieve their predictive power by using antibody-specific statistics. Antibody i-Patch is available from http://www.stats.ox.ac.uk/research/proteins/resources.
- Local Network Patterns in Protein-Protein InterfacesQiang Luo, Rebecca Hamer, Gesine Reinert, and 1 more authorPLoS One, Mar 2013
Protein-protein interfaces hold the key to understanding protein-protein interactions. In this paper we investigated local interaction network patterns beyond pair-wise contact sites by considering interfaces as contact networks among residues. A contact site was defined as any residue on the surface of one protein which was in contact with a residue on the surface of another protein. We labeled the sub-graphs of these contact networks by their amino acid types. The observed distributions of these labeled sub-graphs were compared with the corresponding background distributions and the results suggested that there were preferred chemical patterns of closely packed residues at the interface. These preferred patterns point to biological constraints on physical proximity between those residues on one protein which were involved in binding to residues which were close on the interacting partner. Interaction interfaces were far from random and contain information beyond pairs and triangles. To illustrate the possible application of the local network patterns observed, we introduced a signature method, called iScore, based on these local patterns to assess interface predictions. On our data sets iScore achieved 83.6% specificity with 82% sensitivity.
- Early Relaxation Dynamics in the LC 13 T Cell Receptor in Reaction to 172 Altered Peptide Ligands: A Molecular Dynamics Simulation StudyBernhard Knapp, Georg Dorffner, and Wolfgang SchreinerPLoS One, Jun 2013
The interaction between the T cell receptor and the major histocompatibility complex is one of the most important events in adaptive immunology. Although several different models for the activation process of the T cell via the T cell receptor have been proposed, it could not be shown that a structural mechanism, which discriminates between peptides of different immunogenicity levels, exists within the T cell receptor. In this study, we performed systematic molecular dynamics simulations of 172 closely related altered peptide ligands in the same T cell receptor/major histocompatibility complex system. Statistical evaluations yielded significant differences in the initial relaxation process between sets of peptides at four different immunogenicity levels.
- How long is a piece of loop?Yoonjoo Choi, Sumeet Agarwal, and Charlotte M DeanePeerJ, Feb 2013
Loops are irregular structures which connect two secondary structure elements in proteins. They often play important roles in function, including enzyme reactions and ligand binding. Despite their importance, their structure remains difficult to predict. Most protein loop structure prediction methods sample local loop segments and score them. In particular protein loop classifications and database search methods depend heavily on local properties of loops. Here we examine the distance between a loop’s end points (span). We find that the distribution of loop span appears to be independent of the number of residues in the loop, in other words the separation between the anchors of a loop does not increase with an increase in the number of loop residues. Loop span is also unaffected by the secondary structures at the end points, unless the two anchors are part of an anti-parallel beta sheet. As loop span appears to be independent of global properties of the protein we suggest that its distribution can be described by a random fluctuation model based on the Maxwell���Boltzmann distribution. It is believed that the primary difficulty in protein loop structure prediction comes from the number of residues in the loop. Following the idea that loop span is an independent local property, we investigate its effect on protein loop structure prediction and show how normalised span (loop stretch) is related to the structural complexity of loops. Highly contracted loops are more difficult to predict than stretched loops.
- The emerging role of cloud computing in molecular modellingJean-Paul Ebejer, Simone Fulle, Garrett M Morris, and 1 more authorJournal of Molecular Graphics Modeling, Jul 2013
There is a growing recognition of the importance of cloud computing for large-scale and data-intensive applications. The distinguishing features of cloud computing and their relationship to other distributed computing paradigms are described, as are the strengths and weaknesses of the approach. We review the use made to date of cloud computing for molecular modelling projects and the availability of front ends for molecular modelling applications. Although the use of cloud computing technologies for molecular modelling is still in its infancy, we demonstrate its potential by presenting several case studies. Rapid growth can be expected as more applications become available and costs continue to fall; cloud computing can make a major contribution not just in terms of the availability of on-demand computing power, but could also spur innovation in the development of novel approaches that utilize that capacity in more effective ways.
- Effect of Single Amino Acid Substitution Observed in Cancer on Pim-1 Kinase Thermodynamic Stability and StructureClorinda Lori, Antonella Lantella, Alessandra Pasquo, and 4 more authorsPLoS One, Jun 2013
Pim-1 kinase, a serine/threonine protein kinase encoded by the pim proto-oncogene, is involved in several signalling pathways such as the regulation of cell cycle progression and apoptosis. Many cancer types show high expression levels of Pim kinases and particularly Pim-1 has been linked to the initiation and progression of the malignant phenotype. In several cancer tissues somatic Pim-1 mutants have been identified. These natural variants are nonsynonymous single nucleotide polymorphisms, variations of a single nucleotide occurring in the coding region and leading to amino acid substitutions. In this study we investigated the effect of amino acid substitution on the structural stability and on the activity of Pim-1 kinase. We expressed and purified some of the mutants of Pim-1 kinase that are expressed in cancer tissues and reported in the single nucleotide polymorphisms database. The point mutations in the variants significantly affect the conformation of the native state of Pim-1. All the mutants, expressed as soluble recombinant proteins, show a decreased thermal and thermodynamic stability and a lower activation energy values for kinase activity. The decreased stability accompanied by an increased flexibility suggests that Pim-1 variants may be involved in a wider network of protein interactions. All mutants bound ATP and ATP mimetic inhibitors with comparable IC50 values suggesting that the studied Pim-1 kinase mutants can be efficiently targeted with inhibitors developed for the wild type protein.
- Protein Modelling and Structural PredictionSebastian. Kelm, Yoonjoo. Choi, and Charlotte M DeaneJun 2013
2012
- A maternally inherited autosomal point mutation in human phospholipase C zeta (PLCζ) leads to male infertility.Junaid Kashir, Michalis Konstantinidis, Celine Jones, and 10 more authorsHuman Reproduction, Jun 2012
BACKGROUND: Male factor and idiopathic infertility contribute significantly to global infertility, with abnormal testicular gene expression considered to be a major cause. Certain types of male infertility are caused by failure of the sperm to activate the oocyte, a process normally regulated by calcium oscillations, thought to be induced by a sperm-specific phospholipase C, PLCzeta (PLCζ). Previously, we identified a point mutation in an infertile male resulting in the substitution of histidine for proline at position 398 of the protein sequence (PLCζ(H398P)), leading to abnormal PLCζ function and infertility. METHODS AND RESULTS: Here, using a combination of direct-sequencing and mini-sequencing of the PLCζ gene from the patient and his family, we report the identification of a second PLCζ mutation in the same patient resulting in a histidine to leucine substitution at position 233 (PLCζ(H233L)), which is predicted to disrupt local protein interactions in a manner similar to PLCζ(H398P) and was shown to exhibit abnormal calcium oscillatory ability following predictive 3D modelling and cRNA injection in mouse oocytes respectively. We show that PLCζ(H233L) and PLCζ(H398P) exist on distinct parental chromosomes, the former inherited from the patient’s mother and the latter from his father. Neither mutation was detected utilizing custom-made single-nucleotide polymorphism assays in 100 fertile males and females, or 8 infertile males with characterized oocyte activation deficiency. CONCLUSIONS: Collectively, our findings provide further evidence regarding the importance of PLCζ at oocyte activation and forms of male infertility where this is deficient. Additionally, we show that the inheritance patterns underlying male infertility are more complex than previously thought and may involve maternal mechanisms.
- Producing High-Accuracy Lattice Models from Protein Atomic Coordinates Including Side ChainsMartin Mann, Rhodri Saunders, Cameron Smith, and 2 more authorsAdvances in Bioinformatics, Aug 2012
Lattice models are a common abstraction used in the study of protein structure, folding, and refinement. They are advantageous because the discretisation of space can make extensive protein evaluations computationally feasible. Various approaches to the protein chain lattice fitting problem have been suggested but only a single backbone-only tool is available currently. We introduce LatFit, a new tool to produce high-accuracy lattice protein models. It generates both backbone-only and backbone-side-chain models in any user defined lattice. LatFit implements a new distance RMSD-optimisation fitting procedure in addition to the known coordinate RMSD method. We tested LatFit's accuracy and speed using a large nonredundant set of high resolution proteins (SCOP database) on three commonly used lattices: 3D cubic, face-centred cubic, and knight's walk. Fitting speed compared favourably to other methods and both backbone-only and backbone-side-chain models show low deviation from the original data (~1.5 Å RMSD in the FCC lattice). To our knowledge this represents the first comprehensive study of lattice quality for on-lattice protein models including side chains while LatFit is the only available tool for such models.
- Plasmodium subtilisin-like protease 1 (SUB1): Insights into the active-site structure, specificity and function of a pan-malaria drug targetChrislaine Withers-Martinez, Catherine Suarez, Simone Fulle, and 8 more authorsInternational Journal of Parasitology, May 2012
Release of the malaria merozoite from its host erythrocyte (egress) and invasion of a fresh cell are crucial steps in the life cycle of the malaria pathogen. Subtilisin-like protease 1 (SUB1) is a parasite serine protease implicated in both processes. In the most dangerous human malarial species, Plasmodium falciparum, SUB1 has previously been shown to have several parasite-derived substrates, proteolytic cleavage of which is important both for egress and maturation of the merozoite surface to enable invasion. Here we have used molecular modelling, existing knowledge of SUB1 substrates, and recombinant expression and characterisation of additional Plasmodium SUB1 orthologues, to examine the active site architecture and substrate specificity of P. falciparum SUB1 and its orthologues from the two other major human malaria pathogens Plasmodium vivax and Plasmodium knowlesi, as well as from the rodent malaria species, Plasmodium berghei. Our results reveal a number of unusual features of the SUB1 substrate binding cleft, including a requirement to interact with both prime and non-prime side residues of the substrate recognition motif. Cleavage of conserved parasite substrates is mediated by SUB1 in all parasite species examined, and the importance of this is supported by evidence for species-specific co-evolution of protease and substrates. Two peptidyl alpha-ketoamides based on an authentic PfSUB1 substrate inhibit all SUB1 orthologues examined, with inhibitory potency enhanced by the presence of a carboxyl moiety designed to introduce prime side interactions with the protease. Our findings demonstrate that it should be possible to develop ’pan-reactive’ drug-like compounds that inhibit SUB1 in all three major human malaria pathogens, enabling production of broad-spectrum antimalarial drugs targeting SUB1.
- Freely Available Conformer Generation Methods: How Good Are They?Jean-Paul Ebejer, Garrett M Morris, and Charlotte M DeaneJournal of Chemical Information and Modeling, May 2012
- What Evidence Is There for the Homology of Protein-Protein Interactions?Anna C F Lewis, Nick S Jones, Mason A Porter, and 1 more authorPLoS Computational Biology, Sep 2012
Author SummaryIt is widely assumed that knowledge gained in one species can be transferred to another species, even among species that are widely separated on the tree of life. This transfer is often done at the level of proteins under the assumption that if two proteins have similar sequences, they will share similar properties. In this paper, we investigate the validity of this assumption for the case of protein-protein interactions. The transfer of protein interactions across species is a common procedure and it is known to have shortcomings but these are generally ascribed to the incompleteness of protein interaction data. We introduce a framework to take such incomplete information into account, and under its assumptions show that the procedure is unreliable when using sequence-similarity thresholds typically thought to allow the transfer of functional information. Our results imply that, unless using strict definitions of homology, interactions rewire at a rate too fast to allow reliable transfer across species. We urge caution in interpreting the results of such transfers.
- The importance of age and high degree, in protein-protein interaction networksTiago Rito, Charlotte M Deane, and Gesine ReinertJournal of Computational Biology, Jun 2012
Here we present an in-depth analysis of the protein age patterns found in the edge and triangle subgraphs of the yeast protein-protein interaction network (PIN). We assess their statistical significance both according to what would be expected by chance given the node frequencies found in the yeast PIN, and also, for the case of triangles, given the age frequencies observed in the currently available pairwise data. We find that pairwise interactions between Old proteins are over-represented even when controlling for high degree, and triangle interactions between Old proteins are over-represented even when controlling for pairwise interaction frequencies. There is evidence for negative selection of interactions between Middle-aged and Old proteins within triangles, despite pairwise Middle-Old interactions being common. Most triangles consist solely of vertices with high degree. Our findings point towards an architecture of the yeast PIN that is highly heterogeneous, having connected clumps which contain a large number of interacting Old proteins along with selective age-dependent interaction patterns. Supplementary Material is available online (www.liebertonline.com/cmb).
- Predicting Inter-Species Cross-Talk in Two-Component Signalling SystemsSonja Pawelczyk, Kathryn A Scott, Rebecca Hamer, and 3 more authorsPLoS One, May 2012
Phosphosignalling pathways are an attractive option for the synthetic biologist looking for a wide repertoire of modular components from which to build. We demonstrate that two-component systems can be used in synthetic biology. However, their potential is limited by the fact that host cells contain many of their own phosphosignalling pathways and these may interact with, and cross-talk to, the introduced synthetic components. In this paper we also demonstrate a simple bioinformatic tool that can help predict whether interspecies cross-talk between introduced and native two-component signalling pathways will occur and show both in vitro and in vivo that the predicted interactions do take place. The ability to predict potential cross-talk prior to designing and constructing novel pathways or choosing a host organism is essential for the promise that phosphosignalling components hold for synthetic biology to be realised.
2011
- Environment specific substitution tables improve membrane protein alignment.Jamie R Hill, Sebastian Kelm, Jiye Shi, and 1 more authorBioinformatics, Jul 2011
MOTIVATION: Membrane proteins are both abundant and important in cells, but the small number of solved structures restricts our understanding of them. Here we consider whether membrane proteins undergo different substitutions from their soluble counterparts and whether these can be used to improve membrane protein alignments, and therefore improve prediction of their structure. RESULTS: We construct substitution tables for different environments within membrane proteins. As data is scarce, we develop a general metric to assess the quality of these asymmetric tables. Membrane proteins show markedly different substitution preferences from soluble proteins. For example, substitution preferences in lipid tail-contacting parts of membrane proteins are found to be distinct from all environments in soluble proteins, including buried residues. A principal component analysis of the tables identifies the greatest variation in substitution preferences to be due to changes in hydrophobicity; the second largest variation relates to secondary structure. We demonstrate the use of our tables in pairwise sequence-to-structure alignments (also known as ’threading’) of membrane proteins using the FUGUE alignment program. On average, in the 10-25% sequence identity range, alignments are improved by 28 correctly aligned residues compared with alignments made using FUGUE’s default substitution tables. Our alignments also lead to improved structural models. AVAILABILITY: Substitution tables are available at: http://www.stats.ox.ac.uk/proteins/resources.
- Predicting antibody complementarity determining region structures without classificationYoonjoo Choi and Charlotte M DeaneMolecular Biosystems, Dec 2011
Antibodies are used extensively in medical and biological research. Their complementarity determining regions (CDRs) define the majority of their antigen binding functionality. CDR structures have been intensively studied and classified (canonical structures). Here we show that CDR structure prediction is no different from the standard loop structure prediction problem and predict them without classification. FREAD, a successful database loop prediction technique, is able to produce accurate predictions for all CDR loops (0.81, 0.42, 0.96, 0.98, 0.88 and 2.25 Å RMSD for CDR-L1 to CDR-H3). In order to overcome the relatively poor predictions of CDR-H3, we developed two variants of FREAD, one focused on sequence similarity (FREAD-S) and another which includes contact information (ConFREAD). Both of the methods improve accuracy for CDR-H3 to 1.34 Å and 1.23 Å respectively. The FREAD variants are also tested on homology models and compared to RosettaAntibody (CDR-H3 prediction on models: 1.98 and 2.62 Å for ConFREAD and RosettaAntibody respectively). CDRs are known to change their structural conformations upon binding the antigen. Traditional CDR classifications are based on sequence similarity and do not account for such environment changes. Using a set of antigen-free and antigen-bound structures, we compared our FREAD variants. ConFREAD which includes contact information successfully discriminates the bound and unbound CDR structures and achieves an accuracy of 1.35 Å for bound structures of CDR-H3.
- Signatures of co-translational foldingRhodri Saunders, Martin Mann, and Charlotte M DeaneBiotechnology Journal, Jun 2011
Global and co-translational protein folding may both occur in vivo, and understanding the relationship between these folding mechanisms is pivotal to our understanding of protein-structure formation. Within this study, over 1.5 million hydrophobic-polar sequences were classified based on their ability to attain a unique, but not necessarily minimal energy conformation through co-translational folding. The sequence and structure properties of the sets were then compared to elucidate signatures of co-translational folding. The strongest signature of co-translational folding is a reduced number of possible favorable contacts in the amino terminus. There is no evidence of fewer contacts, more local contacts, or less-compact structures. Co-translational folding produces a more compact amino- than carboxy-terminal region and an amino-terminal-biased set of core residues. In real proteins these signatures are also observed and found most strongly in proteins of the alpha/beta structural class of proteins (SCOP) where 71 % have an amino-terminal set of core residues. The prominence of co-translational features in experimentally determined protein structures suggests that the importance of co-translational folding is currently underestimated.
- Protein interaction networks and their statistical analysisWaqar Ali, Deane Charlotte M., and Gesine ReinertSep 2011
Systems Biology is now entering a mature phase in which the key issues are characterising uncertainty and stochastic effects in mathematical models of biological systems. The area is moving towards a full statistical analysis and probabilistic reasoning over the inferences that can be made from mathematical models. This handbook presents a comprehensive guide to the discipline for practitioners and educators, in providing a full and detailed treatment of these important and emerging subjects. Leading experts in systems biology and statistics have come together to provide insight in to the major ideas in the field, and in particular methods of specifying and fitting models, and estimating the unknown parameters. This book: Provides a comprehensive account of inference techniques in systems biology. Introduces classical and Bayesian statistical methods for complex systems. Explores networks and graphical modeling as well as a wide range of statistical models for dynamical systems. Discusses various applications for statistical systems biology, such as gene regulation and signal transduction. Features statistical data analysis on numerous technologies, including metabolic and transcriptomic technologies. Presents an in-depth presentation of reverse engineering approaches. Provides colour illustrations to explain key concepts. This handbook will be a key resource for researchers practising systems biology, and those requiring a comprehensive overview of this important field.
- The imprint of codons on protein structureCharlotte M Deane and Rhodri SaundersBiotechnology Journal, Jun 2011
2010
- How threshold behaviour affects the use of subgraphs for network comparisonTiago Rito, Zi Wang, Charlotte M Deane, and 1 more authorBioinformatics, Sep 2010
Motivation: A wealth of protein-protein interaction (PPI) data has recently become available. These data are organized as PPI networks and an efficient and biologically meaningful method to compare such PPI networks is needed. As a first step, we would like to compare observed networks to established network models, under the aspect of small subgraph counts, as these are conjectured to relate to functional modules in the PPI network. We employ the software tool GraphCrunch with the Graphlet Degree Distribution Agreement (GDDA) score to examine the use of such counts for network comparison., Results: Our results show that the GDDA score has a pronounced dependency on the number of edges and vertices of the networks being considered. This should be taken into account when testing the fit of models. We provide a method for assessing the statistical significance of the fit between random graph models and biological networks based on non-parametric tests. Using this method we examine the fit of Erdös���Rényi (ER), ER with fixed degree distribution and geometric (3D) models to PPI networks. Under these rigorous tests none of these models fit to the PPI networks. The GDDA score is not stable in the region of graph density relevant to current PPI networks. We hypothesize that this score instability is due to the networks under consideration having a graph density in the threshold region for the appearance of small subgraphs. This is true for both geometric (3D) and ER random graph models. Such threshold behaviour may be linked to the robustness and efficiency properties of the PPI networks., Contact: tiago@stats.ox.ac.uk, Supplementary information: Supplementary data are available at Bioinformatics online.
- Exploring the potential of template-based modellingBraddon K Lance, Charlotte M Deane, and Graham R WoodBioinformatics, Jan 2010
Motivation: Template-based modelling can approximate the unknown structure of a target protein using an homologous template structure. The core of the resulting prediction then comprises the structural regions conserved between template and target. Target prediction could be improved by rigidly repositioning such single template, structurally conserved fragment regions. The purpose of this article is to quantify the extent to which such improvements are possible and to relate this extent to properties of the target, the template and their alignment. Results: The improvement in accuracy achievable when rigid fragments from a single template are optimally positioned was calculated using structure pairs from the HOMSTRAD database, as well as CASP7 and CASP8 target/best template pairs. Over the union of the structurally conserved regions, improvements of 0.7 Å in root mean squared deviation (RMSD) and 6% in GDT_HA were commonly observed. A generalized linear model revealed that the extent to which a template can be improved can be predicted using four variables. Templates with the greatest scope for improvement tend to have relatively more fragments, shorter fragments, higher percentage of helical secondary structure and lower sequence identity. Optimal positioning of the template fragments offers the potential for improving loop modelling. These results demonstrate that substantial improvement could be made on many templates if the conserved fragments were to be optimally positioned. They also provide a basis for identifying templates for which modification of fragment positions may yield such improvements. Contact: braddon.lance@mq.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.
- Revisiting date and party hubs: novel approaches to role assignment in protein interaction networksSumeet Agarwal, Charlotte M Deane, Mason A Porter, and 1 more authorPLoS Computational Biology, Jun 2010
- MEDELLER: homology-based coordinate generation for membrane proteinsS Kelm, J Shi, and C M DeaneBioinformatics, Nov 2010
Membrane proteins (MPs) are important drug targets but knowledge of their exact structure is limited to relatively few examples. Existing homology-based structure prediction methods are designed for globular, water-soluble proteins. However, we are now beginning to have enough MP structures to justify the development of a homology-based approach specifically for them.\{We present a MP-specific homology-based coordinate generation method, MEDELLER, which is optimized to build highly reliable core models. The method outperforms the popular structure prediction programme Modeller on MPs. The comparison of the two methods was performed on 616 target-template pairs of MPs, which were classified into four test sets by their sequence identity. Across all targets, MEDELLER gave an average backbone root mean square deviation (RMSD) of 2.62 Åversus 3.16 Åfor Modeller. On our ’easy’ test set, MEDELLER achieves an average accuracy of 0.93 Åbackbone RMSD versus 1.56 Åfor Modeller.\{http: //medeller.info; Implemented in Python, Bash and Perl CGI for use on Linux systems; Supplementary data are available at http://www.stats.ox.ac.uk/proteins/resources.
- Directionality in protein fold prediction.Jonathan J Ellis, Fabien P E Huard, Charlotte M Deane, and 2 more authorsBMC Bioinformatics, Jan 2010
BACKGROUND: Ever since the ground-breaking work of Anfinsen et al. in which a denatured protein was found to refold to its native state, it has been frequently stated by the protein fold prediction community that all the information required for protein folding lies in the amino acid sequence. Recent in vitro experiments and in silico computational studies, however, have shown that cotranslation may affect the folding pathway of some proteins, especially those of ancient folds. In this paper aspects of cotranslational folding have been incorporated into a protein structure prediction algorithm by adapting the Rosetta program to fold proteins as the nascent chain elongates. This makes it possible to conduct a pairwise comparison of folding accuracy, by comparing folds created sequentially from each end of the protein. RESULTS: A single main result emerged: in 94% of proteins analyzed, following the sense of translation, from N-terminus to C-terminus, produced better predictions than following the reverse sense of translation, from the C-terminus to N-terminus. Two secondary results emerged. First, this superiority of N-terminus to C-terminus folding was more marked for proteins showing stronger evidence of cotranslation and second, an algorithm following the sense of translation produced predictions comparable to, and occasionally better than, Rosetta. CONCLUSIONS: There is a directionality effect in protein fold prediction. At present, prediction methods appear to be too noisy to take advantage of this effect; as techniques refine, it may be possible to draw benefit from a sequential approach to protein fold prediction.
- i-Patch: interprotein contact prediction using local network informationRebecca Hamer, Qiang Luo, Judith P Armitage, and 2 more authorsProteins, Oct 2010
Biological processes are commonly controlled by precise protein-protein interactions. These connections rely on specific amino acids at the binding interfaces. Here we predict the binding residues of such interprotein complexes. We have developed a suite of methods, i-Patch, which predict the interprotein contact sites by considering the two proteins as a network, with residues as nodes and contacts as edges. i-Patch starts with two proteins, A and B, which are assumed to interact, but for which the structure of the complex is not available. However, we assume that for each protein, we have a reference structure and a multiple sequence alignment of homologues. i-Patch then uses the propensities of patches of residues to interact, to predict interprotein contact sites. i-Patch outperforms several other tested algorithms for prediction of interprotein contact sites. It gives 59% precision with 20% recall on a blind test set of 31 protein pairs. Combining the i-Patch scores with an existing correlated mutation algorithm, McBASC, using a logistic model gave little improvement. Results from a case study, on bacterial chemotaxis protein complexes, demonstrate that our predictions can identify contact residues, as well as suggesting unknown interfaces in multiprotein complexes.
- Protein structure prediction begins well but ends badlyRhodri Saunders and Charlotte M DeaneProteins, Oct 2010
The accurate prediction of protein structure, both secondary and tertiary, is an ongoing problem. Over the years, many approaches have been implemented and assessed. Most prediction algorithms start with the entire amino acid sequence and treat all residues in an identical fashion independent of sequence position. Here, we analyze blind prediction data to investigate whether predictive capability varies along the chain. Free modeling results from recent critical assessment of techniques for protein structure prediction (CASP) experiments are evaluated; as is the most up-to-date data from EVA, a fully automated blind test of secondary structure prediction servers. The results demonstrate that structure prediction accuracy is dependent on sequence position. Both secondary structure and tertiary structure predictions are more accurate in regions near the amino(N)-terminus when compared with analogous regions near the carboxy(C)-terminus. Eight of 10 secondary structure prediction algorithms assessed by EVA perform significantly better in regions at the N-terminus. CASP data shows a similar bias, with N-terminal fragments being predicted more accurately than fragments from the C-terminus. Two analogous fragments are taken from each model, the N-terminal fragment begins at the start of the most N-terminal secondary structure element (SSE), whereas the C-terminal fragment finishes at the end of the most C-terminal SSE. Each fragment is locally superimposed onto its respective native fragment. The relative terminal prediction accuracy (RMSD) is calculated on an intramodel basis. At a fragment length of 20 residues, the N-terminal fragment is predicted with greater accuracy in 79% of cases.
- Deciphering chemotaxis pathways using cross species comparisonsRebecca Hamer, Pao-Yang Chen, Judith P Armitage, and 2 more authorsBMC Systems Biology, Jul 2010
BACKGROUND: Chemotaxis is the process by which motile bacteria sense their chemical environment and move towards more favourable conditions. Escherichia coli utilises a single sensory pathway, but little is known about signalling pathways in species with more complex systems. RESULTS: To investigate whether chemotaxis pathways in other bacteria follow the E. coli paradigm, we analysed 206 species encoding at least 1 homologue of each of the 5 core chemotaxis proteins (CheA, CheB, CheR, CheW and CheY). 61 species encode more than one of all of these 5 proteins, suggesting they have multiple chemotaxis pathways. Operon information is not available for most bacteria, so we developed a novel statistical approach to cluster che genes into putative operons. Using operon-based models, we reconstructed putative chemotaxis pathways for all 206 species. We show that cheA-cheW and cheR-cheB have strong preferences to occur in the same operon as two-gene blocks, which may reflect a functional requirement for co-transcription. However, other che genes, most notably cheY, are more dispersed on the genome. Comparison of our operons with shuffled equivalents demonstrates that specific patterns of genomic location may be a determining factor for the observed in vivo chemotaxis pathways. We then examined the chemotaxis pathways of Rhodobacter sphaeroides. Here, the PpfA protein is known to be critical for correct partitioning of proteins in the cytoplasmically-localised pathway. We found ppfA in che operons of many species, suggesting that partitioning of cytoplasmic Che protein clusters is common. We also examined the apparently non-typical chemotaxis components, CheA3, CheA4 and CheY6. We found that though variants of CheA proteins are rare, the CheY6 variant may be a common type of CheY, with a significantly disordered C-terminal region which may be functionally significant. CONCLUSIONS: We find that many bacterial species potentially have multiple chemotaxis pathways, with grouping of che genes into operons likely to be a major factor in keeping signalling pathways distinct. Gene order is highly conserved with cheA-cheW and cheR-cheB blocks, perhaps reflecting functional linkage. CheY behaves differently to other Che proteins, both in its genomic location and its putative protein interactions, which should be considered when modelling chemotaxis pathways.
- Evolutionary analysis reveals low coverage as the major challenge for protein interaction network alignmentWaqar Ali and Charlotte M DeaneMolecular Biosystems, Nov 2010
Local alignments of protein interaction networks have found little conservation among several species. While this could be a consequence of the incompleteness of interaction data-sets and presence of error, an intriguing prospect is that the process of network evolution is sufficient to erase any evidence of conservation. Here, we aim to test this hypothesis using models of network evolution and also investigate the role of error in the results of network alignment. We devised a distance metric based on summary statistics to assess the fit between experimental and simulated network alignments. Our results indicate that network evolution alone is unlikely to account for the poor quality alignments given by real data. Alignments of simulated networks undergoing evolution are considerably (4 to 5 times) larger than real alignments. We compare several error models in their ability to explain this discrepancy. Our estimates of false negative rates vary from 20 to 60% dependent on whether incomplete proteome sampling is taken into account or not. We also find that false positives appear to affect network alignments little compared to false negatives indicating that incompleteness, not spurious links, is the major challenge for interactome-level comparisons.
- FREAD revisited: Accurate loop structure prediction using a database search algorithmYoonjoo Choi and Charlotte M DeaneProteins, May 2010
Loops are the most variable regions of protein structure and are, in general, the least accurately predicted. Their prediction has been approached in two ways, ab initio and database search. In recent years, it has been thought that ab initio methods are more powerful. In light of the continued rapid expansion in the number of known protein structures, we have re-evaluated FREAD, a database search method and demonstrate that the power of database search methods may have been underestimated. We found that sequence similarity as quantified by environment specific substitution scores can be used to significantly improve prediction. In fact, FREAD performs appreciably better for an identifiable subset of loops (two thirds of shorter loops and half of the longer loops tested) than the ab initio methods of MODELLER, PLOP, and RAPPER. Within this subset, FREAD’s predictive ability is length independent, in general, producing results within 2A RMSD, compared to an average of over 10A for loop length 20 for any of the other tested methods. We also benchmarked the prediction protocols on a set of 212 loops from the model structures in CASP 7 and 8. An extended version of FREAD is able to make predictions for 127 of these, it gives the best prediction of the methods tested in 61 of these cases. In examining FREAD’s ability to predict in the model environment, we found that whole structure quality did not affect the quality of loop predictions.
- Synonymous codon usage influences the local protein structure observedRhodri Saunders and Charlotte M DeaneNucleic Acids Research, Jan 2010
Translation of mRNA into protein is a unidirectional information flow process. Analysing the input (mRNA) and output (protein) of translation, we find that local protein structure information is encoded in the mRNA nucleotide sequence. The Coding Sequence and Structure (CSandS) database developed in this work provides a detailed mapping between over 4000 solved protein structures and their mRNA. CSandS facilitates a comprehensive analysis of codon usage over many organisms. In assigning translation speed, we find that relative codon usage is less informative than tRNA concentration. For all speed measures, no evidence was found that domain boundaries are enriched with slow codons. In fact, genes seemingly avoid slow codons around structurally defined domain boundaries. Translation speed, however, does decrease at the transition into secondary structure. Codons are identified that have structural preferences significantly different from the amino acid they encode. However, each organism has its own set of ���significant codons’. Our results support the premise that codons encode more information than merely amino acids and give insight into the role of translation in protein folding.
- Predicting protein-protein interactions in the context of protein evolutionAnna C F Lewis, Ramazan Saeed, and Charlotte M DeaneMolecular Biosystems, Jan 2010
Here we review the methods for the prediction of protein interactions and the ideas in protein evolution that relate to them. The evolutionary assumptions implicit in many of the protein interaction prediction methods are elucidated. We draw attention to the caution needed in deploying certain evolutionary assumptions, in particular cross-organism transfer of interactions by sequence homology, and discuss the known issues in deriving interaction predictions from evidence of co-evolution. We also conject that there is evolutionary knowledge yet to be exploited in the prediction of interactions, in particular the heterogeneity of interactions, the increasing availability of interaction data from multiple species, and the models of protein interaction network growth.
2009
- Functionally guided alignment of protein interaction networks for module detectionWaqar Ali and Charlotte M DeaneBioinformatics, Jan 2009
Motivation: Functional module detection within protein interaction networks is a challenging problem due to the sparsity of data and presence of errors. Computational techniques for this task range from purely graph theoretical approaches involving single networks to alignment of multiple networks from several species. Current network alignment methods all rely on protein sequence similarity to map proteins across species. Results: Here we carry out network alignment using a protein functional similarity measure. We show that using functional similarity to map proteins across species improves network alignment in terms of functional coherence and overlap with experimentally verified protein complexes. Moreover, the results from functional similarity-based network alignment display little overlap (15%) with sequence similarity-based alignment. Our combined approach integrating sequence and function-based network alignment alongside graph clustering properties offers a 200% increase in coverage of experimental datasets and comparable accuracy to current network alignment methods. Availability: Program binaries and source code is freely available at http://www.stats.ox.ac.uk/research/bioinfo/resources Contact: ali@stats.ox.ac.uk Supplementary Information: Supplementary data are available at Bioinformatics online.
- iMembrane: homology-based membrane-insertion of proteins.Sebastian Kelm, Jiye Shi, and Charlotte M DeaneBioinformatics, Apr 2009
iMembrane is a homology-based method, which predicts a membrane protein’s position within a lipid bilayer. It projects the results of coarse-grained molecular dynamics simulations onto any membrane protein structure or sequence provided by the user. iMembrane is simple to use and is currently the only computational method allowing the rapid prediction of a membrane protein’s lipid bilayer insertion. Bilayer insertion data are essential in the accurate structural modelling of membrane proteins or the design of drugs that target them.
- Proteomic Analysis of Microtubule-associated Proteins during Macrophage ActivationPrerna C Patel, Katherine H Fisher, Eric C C Yang, and 2 more authorsMolecular and Cellular Proteomics, Nov 2009
Classical activation of macrophages induces a wide range of signaling and vesicle trafficking events to produce a more aggressive cellular phenotype. The microtubule (MT) cytoskeleton is crucial for the regulation of immune responses. In the current study, we used a large scale proteomics approach to analyze the change in protein composition of the MT-associated protein (MAP) network by macrophage stimulation with the inflammatory cytokine interferon-γand the endotoxin lipopolysaccharide. Overall the analysis identified 409 proteins that bound directly or indirectly to MTs. Of these, 52 were up-regulated 2-fold or greater and 42 were down-regulated 2-fold or greater after interferon-γ/lipopolysaccharide stimulation. Bioinformatics analysis based on publicly available binary protein interaction data produced a putative interaction network of MAPs in activated macrophages. We confirmed the up-regulation of several MAPs by immunoblotting and immunofluorescence analysis. More detailed analysis of one up-regulated protein revealed a role for HSP90βin stabilization of the MT cytoskeleton during macrophage activation.
- Reduced amounts and abnormal forms of phospholipase C zeta (PLCzeta) in spermatozoa from infertile menE Heytens, J Parrington, K Coward, and 15 more authorsHuman Reproduction, Oct 2009
BACKGROUND: In mammals, oocyte activation at fertilization is thought to be induced by the sperm-specific phospholipase C zeta (PLCzeta). However, it still remains to be conclusively shown that PLCzeta is the endogenous agent of oocyte activation. Some types of human infertility appear to be caused by failure of the sperm to activate and this may be due to specific defects in PLCzeta. METHODS AND RESULTS: Immunofluorescence studies showed PLCzeta to be localized in the equatorial region of sperm from fertile men, but sperm deficient in oocyte activation exhibited no specific signal in this same region. Immunoblot analysis revealed reduced amounts of PLCzeta in sperm from infertile men, and in some cases, the presence of an abnormally low molecular weight form of PLCzeta. In one non-globozoospermic case, DNA analysis identified a point mutation in the PLCzeta gene that leads to a significant amino acid change in the catalytic region of the protein. Structural modelling suggested that this defect may have important effects upon the structure and function of the PLCzeta protein. cRNA corresponding to mutant PLCzeta failed to induce calcium oscillations when microinjected into mouse oocytes. Injection of infertile human sperm into mouse oocytes failed to activate the oocyte or trigger calcium oscillations. Injection of such infertile sperm followed by two calcium pulses, induced by assisted oocyte activation, activated the oocytes without inducing the typical pattern of calcium oscillations. CONCLUSIONS: Our findings illustrate the importance of PLCzeta during fertilization and suggest that mutant forms of PLCzeta may underlie certain types of human male infertility.
2008
- Predicting and validating protein interactions using network structurePao-Yang Chen, Charlotte M Deane, and Gesine ReinertPLoS Computational Biology, Oct 2008
Protein interactions play a vital part in the function of a cell. As experimental techniques for detection and validation of protein interactions are time consuming, there is a need for computational methods for this task. Protein interactions appear to form a network with a relatively high degree of local clustering. In this paper we exploit this clustering by suggesting a score based on triplets of observed protein interactions. The score utilises both protein characteristics and network properties. Our score based on triplets is shown to complement existing techniques for predicting protein interactions, outperforming them on data sets which display a high degree of clustering. The predicted interactions score highly against test measures for accuracy. Compared to a similar score derived from pairwise interactions only, the triplet score displays higher sensitivity and specificity. By looking at specific examples, we show how an experimental set of interactions can be enriched and validated. As part of this work we also examine the effect of different prior databases upon the accuracy of prediction and find that the interactions from the same kingdom give better results than from across kingdoms, suggesting that there may be fundamental differences between the networks. These results all emphasize that network structure is important and helps in the accurate prediction of protein interactions. The protein interaction data set and the program used in our analysis, and a list of predictions and validations, are available at http://www.stats.ox.ac.uk/bioinfo/resources/PredictingInteractions.
- Classifying proteinlike sequences in arbitrary lattice protein models using LatPackMartin Mann, Daniel Maticzka, Rhodri Saunders, and 1 more authorHFSP Journal, Dec 2008
Knowledge of a protein’s three-dimensional native structure is vital in determining its chemical properties and functionality. However, experimental methods to determine structure are very costly and time-consuming. Computational approaches such as folding simulations and structure prediction algorithms are quicker and cheaper but lack consistent accuracy. This currently restricts extensive computational studies to abstract protein models. It is thus essential that simplifications induced by the models do not negate scientific value. Key to this is the use of thoroughly defined proteinlike sequences. In such cases abstract models can allow for the investigation of important biological questions. Here, we present a procedure to generate and classify proteinlike sequence data sets. Our LatPack tools and the approach in general are applicable to arbitrary lattice protein models. Identification is based on thermodynamic kinetic features and incorporates the sequential assembly of proteins by addressing cotranslational folding. We demonstrate the approach in the widely used unrestricted 3D-cubic HP-model. The resulting sequence set is the first large data set for this model exhibiting the proteinlike properties required. Our data tools are freely available and can be used to investigate protein-related problems.
- A Microtubule Interactome: Complexes with Roles in Cell Cycle and MitosisJulian R Hughes, Ana M Meireles, Katherine H Fisher, and 7 more authorsPLoS Biology, Apr 2008
The microtubule (MT) cytoskeleton is required for many aspects of cell function, including the transport of intracellular materials, the maintenance of cell polarity, and the regulation of mitosis. These functions are coordinated by MT-associated proteins (MAPs), which work in concert with each other, binding MTs and altering their properties. We have used a MT cosedimentation assay, combined with 1D and 2D PAGE and mass spectrometry, to identify over 250 MAPs from early Drosophila embryos. We have taken two complementary approaches to analyse the cellular function of novel MAPs isolated using this approach. First, we have carried out an RNA interference (RNAi) screen, identifying 21 previously uncharacterised genes involved in MT organisation. Second, we have undertaken a bioinformatics analysis based on binary protein interaction data to produce putative interaction networks of MAPs. By combining both approaches, we have identified and validated MAP complexes with potentially important roles in cell cycle regulation and mitosis. This study therefore demonstrates that biologically relevant data can be harvested using such a multidisciplinary approach, and identifies new MAPs, many of which appear to be important in cell division., The microtubule cytoskeleton is crucial for many aspects of cell function. A new multidisciplinary study identifies microtubule-associated proteins that are important in cell division.
- An assessment of the uses of homologous interactionsRamazan Saeed and Charlotte DeaneBioinformatics, Jan 2008
Motivation: Protein-protein interactions have proved to be a valuable starting point for understanding the inner workings of the cell. Computational methodologies have been built which both predict interactions and use interaction datasets in order to predict other protein features. Such methods require gold standard positive (GSP) and negative (GSN) interaction sets. Here we examine and demonstrate the usefulness of homologous interactions in predicting good quality positive and negative interaction datasets. Results: We generate GSP interaction sets as subsets from experimental data using only interaction and sequence information. We can therefore produce sets for several species (many of which at present have no identified GSPs). Comprehensive error rate testing demonstrates the power of the method. We also show how the use of our datasets significantly improves the predictive power of algorithms for interaction prediction and function prediction. Furthermore, we generate GSN interaction sets for yeast and examine the use of homology along with other protein properties such as localization, expression and function. Using a novel method to assess the accuracy of a negative interaction set, we find that the best single selector for negative interactions is a lack of co-function. However, an integrated method using all the characteristics shows significant improvement over any current method for identifying GSN interactions. The nature of homologous interactions is also examined and we demonstrate that interologs are found more commonly within species than across species. Conclusion: GSP sets built using our homologous verification method are demonstrably better than standard sets in terms of predictive ability. We can build such GSP sets for several species. When generating GSNs we show a combination of protein features and lack of homologous interactions gives the highest quality interaction sets. Availability: GSP and GSN datasets for all the studied species can be downloaded from http://www.stats.ox.ac.uk/deane/HPIV Contact: saeed@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
- Electrostatic and functional analysis of the seven-bladed WD beta-propellersNajl V Valeyev, A Kristina Downing, John Sondek, and 1 more authorEvolutionary Bioinformatics, Jan 2008
beta-propeller domains composed of WD repeats are highly ubiquitous and typically used as multi-site docking platforms to coordinate and integrate the activities of groups of proteins. Here, we have used extensive homology modelling of the WD40-repeat family of seven-bladed beta-propellers coupled with subsequent structural classification and clustering of these models to define subfamilies of beta-propellers with common structural, and probable, functional characteristics. We show that it is possible to assign seven-bladed WD beta-propeller proteins into functionally different groups based on the information gained from homology modelling. We examine general structural diversity within the WD40-repeat family of seven-bladed beta-propellers and demonstrate that seven-bladed beta-propellers composed of WD-repeats are structurally distinct from other seven-bladed beta-propellers. We further provide some insights into the multifunctional diversity of the seven-bladed WD beta-propeller surfaces. This report once again reinforces the importance of structural data and the usefulness of homology models in functional classification.
- The functional domain grouping of microtubule associated proteinsK H Fisher, C M Deane, and J G WakefieldCommunicative and Integrative Biology, Jan 2008
2007
- Cotranslational protein folding–fact or fiction?Charlotte M Deane, Mingqiang Dong, Fabien P E Huard, and 2 more authorsBioinformatics, Jul 2007
MOTIVATION: Experimentalists have amassed extensive evidence over the past four decades that proteins appear to fold during production by the ribosome. Protein structure prediction methods, however, do not incorporate this property of folding. A thorough study to find the fingerprint of such sequential folding is the first step towards using it in folding algorithms, so assisting structure prediction. RESULTS: We explore computationally the existence of evidence for cotranslational folding, based on large sets of experimentally determined structures in the PDB. Our perspective is that cotranslational folding is the norm, but that the effect is masked in most classes. We show that it is most evident in alpha/beta proteins, confirming recent findings. We also find mild evidence that older proteins may fold cotranslationally. A tool is provided for determining, within a protein, where cotranslation is most evident.
- Linking evolution of protein structures through fragmentsSanne Abeln and Charlotte DeaneBMC Systems Biology, Jul 2007
- A statistical approach using network structure in the prediction of protein characteristicsPao-Yang Chen, Charlotte M Deane, and Gesine ReinertBioinformatics, Jan 2007
Motivation: The Majority Vote approach has demonstrated that protein–protein interactions can be used to predict the structure or function of a protein. In this article we propose a novel method for the prediction of such protein characteristics based on frequencies of pairwise interactions. In addition, we study a second new approach using the pattern frequencies of triplets of proteins, thus for the first time taking network structure explicitly into account. Both these methods are extended to jointly consider multiple organisms and multiple characteristics. Results: Compared to the standard non-network-based method, namely the Majority Vote method, in large networks our predictions tend to be more accurate. For structure prediction, the Frequency-based method reaches up to 71% accuracy, and the Triplet-based method reaches up to 72% accuracy, whereas for function prediction, both the Triplet-based method and the Frequency-based method reach up to 90% accuracy. Function prediction on proteins without homologues showed slightly less but comparable accuracies. Including partially annotated proteins substantially increases the number of proteins for which our methods predict their characteristics with reasonable accuracy. We find that the enhanced Triplet-based method does not currently yield significantly better results than the enhanced Frequency-based method, suggesting that triplets of interactions do not contain substantially more information about protein characteristics than interaction pairs. Our methods offer two main improvements over current approaches. First, multiple protein characteristics are considered simultaneously, and second, data is integrated from multiple species. In addition, the Triplet-based method includes network structure more explicitly than the Majority Vote and the Frequency-based method. Availability: The program is available upon request. Contact: pchen@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
- Using Phylogeny to Improve Genome-Wide Distant Homology RecognitionSanne Abeln, Carlo Teubner, and Charlotte M DeanePLoS Computational Biology, Jan 2007
The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values 10���8 for SUPERFAMILY and 10���4 for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method., When predicting the structure for a protein sequence, a major step is achieved when an evolutionarily related protein with a known structure is identified. This process is called fold recognition, and has been a major force behind improvement in structure prediction. Moreover, fold-recognition techniques have become increasingly important in recent years because of the huge numbers of protein sequences with unknown structures available through sequencing projects on completed genomes. However, all fold-recognition methods tend to produce either a large number of false negatives (at high confidence scores) or a large number of false positives (at low confidence scores). Here we show that the reliability of a fold-recognition technique can be explored by analysing its predictions across a set of completed genomes. We have developed a method that can indicate false positives in these genome-wide assignment sets. The basic idea behind the method is that a fold assignment on a genome is less reliable if the prediction is not observed on evolutionary related genomes. The ability of the method to discriminate false positives is confirmed by independent tests. The method can be used on the output of any genome-wide fold assignment method.
2006
- Modelling sequential protein folding under kinetic controlF P E Huard, Charlotte M Deane, and Graham R WoodBioinformatics, Mar 2006
MOTIVATION: This study presents a novel investigation of the effect of kinetic control on cotranslational protein folding. We demonstrate the effect using simple HP lattice models and show that the cotranslational folding of proteins under kinetic control has a significant impact on the final conformation. Differences arise if nature is not capable of pushing a partially folded protein back over a large energy barrier. For this reason we argue that such constraints should be incorporated into structure prediction techniques. We introduce a finite surmountable energy barrier which allows partially formed chains to partly unfold, and permits us to enumerate exhaustively all energy pathways. RESULTS: We compare the ground states obtained sequentially with the global ground states of designing sequences (those with a unique global ground state). We find that the sequential ground states become less numerous and more compact as the surmountable energy barrier increases. We also introduce a probabilistic model to describe the distribution of final folds and allow partial settling to the Boltzmann distribution of states at each stage. As a result, conformations with the highest probability of final occurrence are not necessarily the ones of lowest energy. AVAILABILITY: Software available on request.
2005
- How old is your fold?Henry F Winstanley, Sanne Abeln, and Charlotte M DeaneBioinformatics, Jun 2005
MOTIVATION: At present there exists no age estimate for the different protein structures found in nature. It has become clear from occurrence studies that different folds arose at different points in evolutionary time. An estimation of the age of different folds would be a starting point for many investigations into protein structure evolution: how we arrived at the set of folds we see today. It would also be a powerful tool in protein structure classification allowing us to reassess the available hierarchical methods and perhaps suggest improvements. RESULTS: We have created the first relative age estimation technique for protein folds. Our method is based on constructing parsimonious scenarios, which can describe occurrence patterns in a phylogeny of species. The ages presented are shown to be robust to the different trees or data types used for their generation. They show correlations with other previously used protein age estimators, but appear to be far more discriminating than any previously suggested technique. The age estimates given are not absolutes but they already offer intriguing insights, like the very different age patterns of alpha/beta folds compared with small folds. The alpha/beta folds appear on average to be far older than their small fold counterparts. AVAILABILITY: Example trees and additional material are available at http://www.stats.ox.ac.uk/abeln/foldage SUPPLEMENTARY INFORMATION: http://www.stats.ox.ac.uk/abeln/foldage.
- Fold usage on genomes and protein fold evolutionSanne Abeln and Charlotte M DeaneProteins, Sep 2005
We review fold usage on completed genomes to explore protein structure evolution. The patterns of presence or absence of folds on genomes gives us insights into the relationships between folds, the age of different folds and how we have arrived at the set of folds we see today. We examine the relationships between different measures which describe protein fold usage, such as the number of copies of a fold per genome, the number of families per fold, and the number of genomes a fold occurs on. We obtained these measures of fold usage by searching for the structural domains on 157 completed genome sequences from all three kingdoms of life. In our comparisons of these measures we found that bacteria have relatively more distinct folds on their genomes than archaea. Eukaryotes were found to have many more copies of a fold on their genomes. If we separate out the different fold classes, the alpha/beta class has relatively fewer distinct folds on large genomes, more copies of a fold on bacteria and more folds occurring in all three kingdoms simultaneously. These results possibly indicate that most alpha/beta folds originated earlier than other folds. The expected power law distribution is observed for copies of a fold per genome and we found a similar distribution for the number of families per fold. However, a more complicated distribution appears for fold occurrence across genomes, which strongly depends on fold class and kingdom. We also show that there is not a clear relationship between the three measures of fold usage. A fold which occurs on many genomes does not necessarily have many copies on each genome. Similarly, folds with many copies do not necessarily have many families or vice versa.
2004
- Solution structure and dynamics of a prototypical chordin-like cysteine-rich repeat (von Willebrand Factor type C module) from collagen IIAJoanne M O’Leary, John M Hamilton, Charlotte M Deane, and 3 more authorsJournal of Biological Chemistry, Dec 2004
Chordin-like cysteine-rich (CR) repeats (also referred to as von Willebrand factor type C (VWC) modules) have been identified in approximately 200 extracellular matrix proteins. These repeats, named on the basis of amino acid conservation of 10 cysteine residues, have been shown to bind members of the transforming growth factor-beta (TGF-beta) superfamily and are proposed to regulate growth factor signaling. Here we describe the intramolecular disulfide bonding, solution structure, and dynamics of a prototypical chordin-like CR repeat from procollagen IIA (CR(ColIIA)), which has been previously shown to bind TGF-beta1 and bone morphogenetic protein-2. The CR(ColIIA) structure manifests a two sub-domain architecture tethered by a flexible linkage. Initial structures were calculated using RosettaNMR, a de novo prediction method, and final structure calculations were performed using CANDID within CYANA. The N-terminal region contains mainly beta-sheet and the C-terminal region is more irregular with the fold constrained by disulfide bonds. Mobility between the N- and C-terminal sub-domains on a fast timescale was confirmed using NMR relaxation measurements. We speculate that the mobility between the two sub-domains may decrease upon ligand binding. Structure and sequence comparisons have revealed an evolutionary relationship between the N-terminal sub-domain of the CR module and the fibronectin type 1 domain, suggesting that these domains share a common ancestry. Based on the previously reported mapping of fibronectin binding sites for vascular endothelial growth factor to regions containing fibronectin type 1 domains, we discuss the possibility that this structural homology might also have functional relevance.
2002
- Protein interactions: two methods for assessment of the reliability of high throughput observationsCharlotte M Deane, Łukasz Salwiński, Ioannis Xenarios, and 1 more authorMol. Cell. proteomics MCP, May 2002
High throughput methods for detecting protein interactions require assessment of their accuracy. We present two forms of computational assessment. The first method is the expression profile reliability (EPR) index. The EPR index estimates the biologically relevant fraction of protein interactions detected in a high throughput screen. It does so by comparing the RNA expression profiles for the proteins whose interactions are found in the screen with expression profiles for known interacting and non-interacting pairs of proteins. The second form of assessment is the paralogous verification method (PVM). This method judges an interaction likely if the putatively interacting pair has paralogs that also interact. In contrast to the EPR index, which evaluates datasets of interactions, PVM scores individual interactions. On a test set, PVM identifies correctly 40% of true interactions with a false positive rate of approximately 1%. EPR and PVM were applied to the Database of Interacting Proteins (DIP), a large and diverse collection of protein-protein interactions that contains over 8000 Saccharomyces cerevisiae pairwise protein interactions. Using these two methods, we estimate that approximately 50% of them are reliable, and with the aid of PVM we identify confidently 3003 of them. Web servers for both the PVM and EPR methods are available on the DIP website (dip.doe-mbi.ucla.edu/Services.cgi).
2001
- SCORE: predicting the core of protein modelsC M Deane, Q Kaas, and T L BlundellBioinformatics, Jun 2001
MOTIVATION: The prediction of the regions of homology models that can be ’restrained by’ or ’copied from’ the basis structures is a vital step in correct model generation, because these regions are the models most accurate part. However, there is no ideal method for the identification of their limits. In most algorithms their length depends on the number of family members and definitions of secondary structure. RESULTS: The algorithm SCORE steps away from the conventional definitions of the core to identify from large numbers of basis structures those regions that can be considered structurally related to a target sequence. The use of phi, psi constraints to accurately pinpoint the regions that are conserved across a family and environmentally constrained substitution tables to extend these regions allows SCORE to rapidly (generally in under 1 s, an order of magnitude faster than methods such as MODELLER) identify and build the core of homology models from the alignments of the target sequence to the basis structures. The SCORE algorithm was used to build 114 model cores. In only two cases was the core size less than 50% of the structure and all the cores built had an RMSD of 3.7 A or less to the target structure.
- Improved protein loop prediction from sequence aloneD F Burke and C M DeaneProtein Eng., Jul 2001
The SLoop database of supersecondary fragments, first described by Donate et al. (Protein Sci., 1996, 5, 2600-2616), contains protein loops, classified according to structural similarity. The database has recently been updated and currently contains over 10 000 loops up to 20 residues in length, which cluster into over 560 well populated classes. The database can be found at http://www-cryst.bioc.cam.ac.uk/sloop. In this paper, we identify conserved structural features such as main chain conformation and hydrogen bonding. Using the original approach of Rufino and co-workers (1997), the correct structural class is predicted with the highest SLoop score for 35% of loops. This rises to 65% by considering the three highest scoring class predictions and to 75% in the top five scoring class predictions. Inclusion of residues from the neighbouring secondary structures and use of substitution tables derived using a reduced definition of secondary structure increase these prediction accuracies to 58, 78 and 85%, respectively. This suggests that capping residues can stabilize the loop conformation as well as that of the secondary structure. Further increases are achieved if only well-populated classes are considered in the prediction. These results correspond to an average loop root mean square deviation of between 0.4 and 2.6 A for loops up to five residues in length.
- Sequence-structure homology recognition by iterative alignment refinement and comparative modelingM G Williams, H Shirai, J Shi, and 12 more authorsProteins, Jul 2001
Our approach to fold recognition for the fourth critical assessment of techniques for protein structure prediction (CASP4) experiment involved the use of the FUGUE sequence-structure homology recognition program (http://www-cryst.bioc.cam.ac.uk/fugue), followed by model building. We treat models as hypotheses and examine these to determine whether they explain the available data. Our method depends heavily on environment-specific substitution tables derived from our database of structural alignments of homologous proteins (HOMSTRAD, http://www-cryst.bioc.cam.ac.uk/homstrad/). FUGUE uses these tables to incorporate structural information into profiles created from HOMSTRAD alignments that are matched against a profile created for the target from multiple sequence alignment. In addition, environment-specific substitution tables are used throughout the modeling procedure and as part of the model evaluation. Annotation of sequence alignments with JOY, to reflect local structural features, proved valuable, both for modifying hypotheses, and for rejecting predictions when the expected pattern of conservation is not observed. Our stringency in rejecting incorrect predictions led us to submit a relatively small number of models, including only a low number of false positives, resulting in a high average score.
- CODA: a combined algorithm for predicting the structurally variable regions of protein modelsC M Deane and T L BlundellProtein Sci. A Publ. Protein Soc., Mar 2001
CODA, an algorithm for predicting the variable regions in proteins, combines FREAD a knowledge based approach, and PETRA, which constructs the region ab initio. FREAD selects from a database of protein structure fragments with environmentally constrained substitution tables and other rule-based filters. FREAD was parameterized and tested on over 3000 loops. The average root mean square deviation ranged from 0.78 A for three residue loops to 3.5 A for eight residue loops on a nonhomologous test set. CODA clusters the predictions from the two independent programs and makes a consensus prediction that must pass a set of rule-based filters. CODA was parameterized and tested on two unrelated separate sets of structures that were nonhomologous to one another and those found in the FREAD database. The average root mean square deviation in the test set ranged from 0.76 A for three residue loops to 3.09 A for eight residue loops. CODA shows a general improvement in loop prediction over PETRA and FREAD individually. The improvement is far more marked for lengths six and upward, probably as the predictive power of PETRA becomes more important. CODA was further tested on several model structures to determine its applicability to the modeling situation. A web server of CODA is available at http://www-cryst.bioc.cam.ac.uk/charlotte/Coda/search_coda.html.
- The Role and Predicted Propensity of Conserved Proline Residues in the 5-HT3 ReceptorCharlotte M Deane and Sarah C R LummisJ. Biol. Chem., Dec 2001
5-HT3 receptors possess a number of highly conserved proline residues. We changed each of these to alanine, expressed the mutants as homomeric 5-HT3Areceptors in HEK293 cells, and analyzed them with radioligand binding, electrophysiology, and immunocytochemistry. Mutation of Pro56, Pro104, Pro123, and Pro170 resulted in ablation of radioligand binding, whereas mutation of Pro257 and Pro301 did not. Only the latter were expressed at the plasma membrane but were non-functional. Thus the former, which are in the N-terminal domain, may be involved in forming correct receptor structure, while those in the transmembrane region (Pro257 and Pro301) are necessary for the function of the protein. To explore the conformational preference (propensity) of these residues we examined the proportion ofcis-prolines and the influence of adjacent residues in known protein structures. 4.7% of prolines in the protein data base were in the cis conformation, and the distribution of amino acids adjacent to cis-prolines was not randomly distributed. Comparison of the proportion of each amino acid residue adjacent to acis-proline revealed that aromatic and bend-facilitating residues were favored while those with β-branched chains were not. Thus five residues (Gly, Pro, Tyr, Trp, Phe) and three residues (Pro, Tyr, Phe) were found more frequently than expected before and aftercis-prolines respectively, whereas five residues (Val, Ile, Leu, Asp, Thr) and two residues (Asp, Glu) were found less frequently. Of the 20 proline residues in the 5-HT3A receptor subunit only Pro170 has adjacent residues that are favorable. Mutating these to non-favorable residues resulted in ablation of ligand binding, whereas replacement with alternative favorable residues did not. We therefore propose that Pro170, which is part of the characteristic cys-loop found in this family of proteins, may be in the cis conformation.
2000
- Browsing the SLoop database of structurally classified loops connecting elements of protein secondary structureD F Burke, C M Deane, and T L BlundellBioinformatics, Jun 2000
We describe a web server, which provides easy access to the SLoop database of loop conformations connecting elements of protein secondary structure. The loops are classified according to their length, the type of bounding secondary structures and the conformation of the mainchain. The current release of the database consists of over 8000 loops of up to 20 residues in length. A loop prediction method, which selects conformers on the basis of the sequence and the positions of the elements of secondary structure, is also implemented. These web pages are freely accessible over the internet at http://www-cryst.bioc.cam.ac.uk/ approximately sloop.
- A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteinsC M Deane and T L BlundellProteins, Jul 2000
We present a fast ab initio method for the prediction of local conformations in proteins. The program, PETRA, selects polypeptide fragments from a computer-generated database (APD) encoding all possible peptide fragments up to twelve amino acids long. Each fragment is defined by a representative set of eight straight phi/psi pairs, obtained iteratively from a trial set by calculating how fragments generated from them represent the protein databank (PDB). Ninety-six percent (96%) of length five fragments in crystal structures, with a resolution better than 1.5 A and less than 25% identity, have a conformer in the database with less than 1 A root-mean-square deviation (rmsd). In order to select segments from APD, PETRA uses a set of simple rule-based filters, thus reducing the number of potential conformations to a manageable total. This reduced set is scored and sorted using rmsd fit to the anchor regions and a knowledge-based energy function dependent on the sequence to be modelled. The best scoring fragments can then be optimized by minimization of contact potentials and rmsd fit to the core model. The quality of the prediction made by PETRA is evaluated by calculating both the differences in rmsd and backbone torsion angles between the final model and the native fragment. The average rmsd ranges from 1.4 A for three residue loops to 3.9 A for eight residue loops.
1999
- An iterative structure-assisted approach to sequence alignment and comparative modelingD F Burke, C M Deane, H A Nagarajaram, and 12 more authorsProteins, Jul 1999
Correct alignment of the sequence of a target protein with those of homologues of known three-dimensional structure is a key step in comparative modeling. Usually an iterative approach that takes account of the local and overall structural features is required. We describe such an approach that exploits databases of structural alignments of homologous proteins (HOMSTRAD, http:/(/)www-cryst.bioc.cam.ac.uk/ approximately homstrad) and protein superfamilies (CAMPASS, http:/(/)www-cryst.bioc.cam.ac.uk/ approximately campass), in which structure-based alignments are analyzed and formatted with the program JOY (http:/(/)www-cryst.bioc.cam.ac.uk/ approximately joy) to reveal conserved local structural features. The databases facilitate the recognition of a family or superfamily, they assist in the selection of useful parent structures, they are helpful in alignment of the target sequences with the parent set, and are useful for deriving relationships that can be used in validating models. In the iterative approach, a model is constructed on the basis of the proposed sequence alignment and this is then reexpressed in the JOY format and realigned with the parent set. This is repeated until the model and sequence alignment is optimized. We examine the case for comparison and use of multiple structures of family members, rather than a single parent structure. We use the targets attempted by our group in CASP3 to assess the value of such procedures.
- Carbonyl-carbonyl interactions stabilize the partially allowed Ramachandran conformations of asparagine and aspartic acidC M Deane, F H Allen, R Taylor, and 1 more authorProtein Eng., Dec 1999
Asparagine and aspartate are known to adopt conformations in the left-handed alpha-helical region and other partially allowed regions of the Ramachandran plot more readily than any other non-glycyl amino acids. The reason for this preference has not been established. An examination of the local environments of asparagine and aspartic acid in protein structures with a resolution better than 1.5 A revealed that their side-chain carbonyls are frequently within 4 A of their own backbone carbonyl or the backbone carbonyl of the previous residue. Calculations using protein structures with a resolution better than 1.8 A reveal that this close contact occurs in more than 80% of cases. This carbonyl-carbonyl interaction offers an energetic sabilization for the partially allowed conformations of asparagine and aspartic acid with respect to all other non-glycyl amino acids. The non-covalent attractive interactions between the dipoles of two carbonyls has recently been calculated to have an energy comparable to that of a hydrogen bond. The preponderance of asparagine in the left-handed alpha-helical region, and in general of aspartic acid and asparagine in the partially allowed regions of the Ramachandran plot, may be a consequence of this carbonyl-carbonyl stacking interaction.
1998
- HOMSTRAD: a database of protein structure alignments for homologous families.K Mizuguchi, C M Deane, T L Blundell, and 1 more authorProtein Sci., Nov 1998
We describe a database of protein structure alignments for homologous families. The database HOMSTRAD presently contains 130 protein families and 590 aligned structures, which have been selected on the basis of quality of the X-ray analysis and accuracy of the structure. For each family, the database provides a structure-based alignment derived using COMPARER and annotated with JOY in a special format that represents the local structural environment of each amino acid residue. HOMSTRAD also provides a set of superposed atomic coordinates obtained using MNYFIT, which can be viewed with a graphical user interface or used for comparative modeling studies. The database is freely available on the World Wide Web at: http://www-cryst.bioc.cam. ac.uk/-homstrad/, with search facilities and links to other databases.
- JOY: protein sequence-structure representation and analysis.K Mizuguchi, C M Deane, T L Blundell, and 2 more authorsBioinformatics, Jan 1998
MOTIVATION: JOY is a program to annotate protein sequence alignments with three-dimensional (3D) structural features. It was developed to display 3D structural information in a sequence alignment and to help understand the conservation of amino acids in their specific local environments. RESULTS: : The JOY representation now constitutes an essential part of the two databases of protein structure alignments: HOMSTRAD (http://www-cryst.bioc.cam.ac.uk/homstrad ) and CAMPASS (http://www-cryst.bioc.cam.ac. uk/campass). It has also been successfully used for identifying distant evolutionary relationships. AVAILABILITY: The program can be obtained via anonymous ftp from torsa.bioc.cam.ac.uk from the directory /pub/joy/. The address for the JOY server is http://www-cryst.bioc.cam.ac.uk/cgi-bin/joy.cgi. CONTACT: kenji@cryst.bioc.cam.ac.uk
- Protein three-dimensional structural databases: domains, structurally aligned homologues and superfamiliesR Sowdhamini, D F Burke, C Deane, and 7 more authorsActa Crystallogr. D. Biol. Crystallogr., Nov 1998
This paper reports the availability of a database of protein structural domains (DDBASE), an alignment database of homologous proteins (HOMSTRAD) and a database of structurally aligned superfamilies (CAMPASS) on the World Wide Web (WWW). DDBASE contains information on the organization of structural domains and their boundaries; it includes only one representative domain from each of the homologous families. This database has been derived by identifying the presence of structural domains in proteins on the basis of inter-secondary structural distances using the program DIAL [Sowdhamini & Blundell (1995), Protein Sci. 4, 506-520]. The alignment of proteins in superfamilies has been performed on the basis of the structural features and relationships of individual residues using the program COMPARER [Sali & Blundell (1990), J. Mol. Biol. 212, 403-428]. The alignment databases contain information on the conserved structural features in homologous proteins and those belonging to superfamilies. Available data include the sequence alignments in structure-annotated formats and the provision for viewing superposed structures of proteins using a graphical interface. Such information, which is freely accessible on the WWW, should be of value to crystallographers in the comparison of newly determined protein structures with previously identified protein domains or existing families.
1997
- A structural model of the human thrombopoietin receptor complexC M Deane, R T Kroemer, and W G RichardsJ. Mol. Graph. Model., Jun 1997
Thrombopoietin (TPO) is a glycoprotein hormone that regulates red blood cell production. Presented here is a modeling study of the extracellular region of the human thrombopoietin receptor complex, in particular the TPO-receptor interface. The models were developed from structural homology to other cytokines and their receptors. Experimental evidence suggests that the receptor is homodimeric and it was modeled accordingly. Key interactions are shown that correlate with previous cytokine receptor complexes, and the pattern of cysteine bonding (Cys7-Cys151 and Cys29-Cys85) agrees with that experimentally determined for thrombopoietin. These models pave the way for possible mutagenesis experimentation and the design of (ant)agonists.