Pattern Recognition in Bioinformatics: International Workshop, PRIB 2006, Hong Kong, China, August 20, 2006, Proceedings (Lecture Notes in Computer Science, 4146) 3540374469, 9783540374466

The field of bioinformatics has two main objectives: the creation and maintenance of biological databases, and the disco

143 42 3MB

English Pages 198 [197] Year 2006

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Frontmatter
Pattern Recognition in Bioinformatics: An Introduction
Part 1: Signal and Motif Detection; Gene Selection
Machine Learning Prediction of Amino Acid Patterns in Protein N-myristoylation
A Profile HMM for Recognition of Hormone Response Elements
Graphical Approach to Weak Motif Recognition in Noisy Data Sets
Comparative Gene Prediction Based on Gene Structure Conservation
Computational Identification of Short Initial Exons
Pareto-Gamma Statistic Reveals Global Rescaling in Transcriptomes of Low and High Aggressive Breast Cancer Phenotypes
Investigating the Class-Specific Relevance of Predictor Sets Obtained from DDP-Based Feature Selection Technique
A New Maximum-Relevance Criterion for Significant Gene Selection
Part 2: Models of DNA, RNA, and Protein Structures
Spectral Graph Partitioning Analysis of In Vitro Synthesized RNA Structural Folding
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines
Prediction of Protein Subcellular Localizations Using Moment Descriptors and Support Vector Machine
Using Permutation Patterns for Content-Based Phylogeny
Part 3: Biological Databases and Imaging
The Immune Epitope Database and Analysis Resource
Intelligent Extraction Versus Advanced Query: Recognize Transcription Factors from Databases
Incremental Maintenance of Biological Databases Using Association Rule Mining
Blind Separation of Multichannel Biomedical Image Patterns by Non-negative Least-Correlated Component Analysis
Image and Fractal Information Processing for Large-Scale Chemoinformatics, Genomics Analyses and Pattern Discovery
Hybridization of Independent Component Analysis, Rough Sets, and Multi-Objective Evolutionary Algorithms for Classificatory Decomposition of Cortical Evoked Potentials
Backmatter
Recommend Papers

Pattern Recognition in Bioinformatics: International Workshop, PRIB 2006, Hong Kong, China, August 20, 2006, Proceedings (Lecture Notes in Computer Science, 4146)
 3540374469, 9783540374466

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Bioinformatics

4146

Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong

Subseries of Lecture Notes in Computer Science

Jagath C. Rajapakse Limsoon Wong Raj Acharya (Eds.)

Pattern Recognition in Bioinformatics International Workshop, PRIB 2006 Hong Kong, China, August 20, 2006 Proceedings

13

Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editors Jagath C. Rajapakse Nanyang Technological University BioInformatics Research Centre, Singapore E-mail: [email protected] Limsoon Wong National University of Singapore School of Computing and Graduate School for Integrated Sciences and Engineering 3 Science Drive 2, 117543, Singapore E-mail: [email protected] Raj Acharya Penn. State University Computer Science and Engineering 220 Pond Lab., University Park, Pennsylvania 16802-6106, USA E-mail: [email protected]

Library of Congress Control Number: 2006930615

CR Subject Classification (1998): H.2.8, I.5, I.4, J.3, I.2, H.3, F.1-2 LNCS Sublibrary: SL 8 – Bioinformatics ISSN ISBN-10 ISBN-13

0302-9743 3-540-37446-9 Springer Berlin Heidelberg New York 978-3-540-37446-6 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2006 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11818564 06/3142 543210

Preface

The field of bioinformatics has two main objectives: the creation and maintenance of biological databases, and the discovery of knowledge from life sciences data in order to unravel the mysteries of biological function, leading to new drugs and therapies for human disease. Life sciences data come in the form of biological sequences, structures, pathways, or literature. One major aspect of discovering biological knowledge is to search, predict, or model specific patterns of a given dataset, which have some relevance to an important biological phenomenon or another dataset. To date, many pattern recognition algorithms have been applied or catered to address a wide range of bioinformatics problems. The 2006 Workshop of Bioinformatics in Pattern Recognition (PRIB 2006) marks the beginning of a series of workshops that is aimed at gathering researchers applying pattern recognition algorithms in an attempt to resolve problems in computational biology and bioinformatics. This volume presents the proceedings of Workshop PRIB 2006 held in Hong Kong, China, on August 20, 2006. It includes 19 technical contributions that were selected by the Program Committee from 43 submissions. We give a brief introduction to pattern recognition in bioinformatics in the first paper. The rest of the volume consists of three parts. Part 1: signal and motif detection, and gene selection. Part 2: models of DNA, RNA, and protein structures. Part 3: biological databases and imaging. Part 1 of the proceedings contains eight chapters that deal with detection of signals, motifs, and gene structure of genomic sequences and gene selection from microarray data. Ryo et al. suggest an approach to derive rules for alphabet indexing to predict the position of N-myristoylation signal by using decision trees. Stepanova, Lin, and Lin present an approach to recognize steroid hormone regulation elements within promoters of vertebrate genomes, based on a hidden Markov model (HMM). Ho and Rajapakse present a novel graphical approach for weak motif detection in noisy datasets. They examine the robustness of the approach on synthetic datasets and illustrate its applicability to find the motifs in eukaryotes. Hsieh et al. propose a program, GeneAlign, that predicts genes on one genome by incorporating annotated genes on another genome. This approach achieves higher accuracies of gene prediction by employing the conservation of gene structures and sequence homologies between protein coding regions of genomes. Logeswaran, Ambikairajah, and Epps propose a method for predicting short initial exons, based on the weight arrays and CpG islands. Chua, Ivshina, and Kuznetsov propose a mixture probability model for microarray signals. The noise term due to non-specific mRNA hybridization was modeled by a lognormal distribution; and the true signal was described by the generalized Paretogamma function. The model, applied to expression data of 251 human breast cancer tumors on the Affymetrix microarray platform, yields accurate fits for all tumor

VI

Preface

samples. Using the degree of differential prioritization between relevance and antiredundancy on microarray data, Ooi, Chetty, and Teng propose a feature selection technique for tumor classification. Kim and Gao propose an enhanced MaxRelevance criterion for gene selection, which combines the collective impact of the most expressive features in emerging patterns (EPs) and independent criteria such as t-test or symmetrical uncertainty. By capturing the joint effect of features with EPs algorithm, the method finds the most discriminative features in a broader scope. Part 2 of the proceedings focuses on the prediction of different models of DNA, RNA, and amino acids to predict protein secondary structure, protein subcellular localization, RNA structure, phylogeny, and nucleosome formation. Loong and Mishra investigate the topological properties of synthetic RNAs by applying a spectral graph partitioning technique. Their analysis shows that the majority of synthetic RNAs possess two to six vertices, in contrast to natural RNA structures that mostly have nine or ten vertices, and are less compact with the second eigenvalue below unity. Gassend et al. propose a biophysically-motivated energy model through the use of hidden Markov support vector machines (HM-SVMs) for protein secondary structure prediction from amino acid sequences. Shi et al. construct three types of moment descriptors to obtain sequence order information in a protein sequence to predict the subcellular localization of proteins, without needing the information of physicochemical properties of amino acids. Karim, Parida, and Lakhotia explore the use of permutation patterns from genome rearrangement data as a content similarity measure to infer phylogenies, in polynomial time. Part 3 of the proceedings deals with biological databases and images. Sette et al. announce the availability of the Immune Epitope Database and Analysis Resource (IEDB) to facilitate the exploration of immunity to infectious diseases, allergies, autoimmune diseases, and cancer. The utility of the IEDB was recently demonstrated through a comprehensive analysis of all current information regarding antibody and T cell epitopes derived from influenza A and determining possible cross-reactivity among H5N1 avian flu and human flu viruses. Zhang, Ng, and Bajic combine information of protein functional domains and gene ontology descriptions for highly accurate identification of transcription factor entries in Swiss-Prot and Entrez gene databases. Lam et al. propose a novel method to support automatic incremental updating of specialist biological databases by using association rule mining. Wang et al. report a blind source separation method, based on non-negative leastcorrelated component analysis (nLCA), for quantitative dissection of mixed yet correlated biomarker patterns in cellular images. Two approaches for handling largescale biological data were proposed by Havukkala et al. and illustrated in the contexts of molecular image processing for chemoinformatics and fractal visualization methods for genome analyses. Smolinski et al. investigate hybridization of the multiobjective evolutionary algorithms (MOEA) and rough sets (RS) for the classificatory decomposition of signals recorded from the surface of the cerebral cortex. By using independent component analysis (ICA) to initialize the MOEA, reconstruction errors are significantly improved.

Preface

VII

We would like to sincerely thank all authors who have spent time and effort to make important contributions to this book. Our gratitude also goes to the LNBI editors, Sorin Istrail, Pavel Pevzner, and Michael Waterman, for their most kind support and help in editing this book. Jagath C. Rajapakse Limsoon Wong Raj Acharya

Acknowledgement We would like to thank all individuals and institutions who contributed to the success of the workshop, especially the authors for submitting papers and the sponsors for generously providing financial support. We are very grateful to the IAPR Technical Committee (TC-20) on Pattern Recognition for BioInformatics for their invaluable guidance and advice. In addition, we would like to express our gratitude to all PRIB 2006 Program Committee members for their thoughtful and rigorous reviews of the submitted papers. We fully appreciate the Organizing Committee for their enormous and excellent work. We are also grateful to the ICPR 2006 General Chairs, Yuan Yan Tang, Patrick Wang, G. Lorette, and Daniel So Yeung, for their willingness to coordinate with PRIB 2006, and, especially to ICPR 2006 Workshop Chairs, James Kwok and Nanning Zheng, for their effort in the local arrangements. Many thanks go to PRIB 2006 secretary, Norhana Ahmad, for coordinating all the logistics of the workshop. Last but not least, we wish to convey our sincere thanks to Springer for providing excellent support in preparing this volume. Raj Acharya PRIB 2006 General Chair Jagath C. Rajapakse Limsoon Wong PRIB 2006 Program Co-chairs

Organization

IAPR Technical Committee on Pattern Recognition on Bioinformatics

Raj Acharya (Chair) Fransisco Azuaje Vladimir Brusic Phoebe Chen David Corne Elena Marchiori Mariofanna Milanova Gary B. Fogel Saman K. Halgamuge Visakan Kadirkamanathan Nik Kasabov Irwin King Alex V. Kochetov Graham Leedham Ajit Narayanan Nikhil R. Pal Marimuthu Palaniswami Jagath C. Rajapakse (Vicechair) Gwenn Volkert Roy E. Welsch Kay C. Wiese Limsoon Wong Jiahua (Jerry) Wu Yanqing Zhang Qiang Yang

Pennsylvania State Univ., USA Univ. of Ulster, UK Univ. of Queensland, Australia Deakin University, Australia Heriot-Watt Univ., UK Vrije Univ. of Amsterdam, The Netherlands Univ. of Arkansas at Little Rock, USA Natural Selection, Inc., USA Univ. of Melbourne, Australia Univ. of Sheffield, UK Auckland Univ. of Technology, New Zealand Chinese Univ. of Hong Kong, Hong Kong Russian Academy of Sciences, Russia Nanyang Tech. Univ., Singapore Univ. of Exeter, UK Indian Statistical Inst., India Univ. of Melbourne, Australia Nanyang Tech. Univ., Singapore Kent State Univ., USA Massachusetts Inst. of Technology, USA Simon Fraser Univ., Canada National Univ. of Singapore, Singapore Wellcome Trust Sanger Inst., UK Georgia State Univ., USA Hong Kong Univ. of Science and Technology, Hong Kong

PRIB 2006 Organization

General Chair Raj Acharya

Pennsylvania State Univ., USA

Program Co-chairs Jagath C. Rajapakse (Cochair) Limsoon Wong (Co-chair)

Nanyang Tech. Univ., Singapore National Univ. of Singapore, Singapore

Publicity Phoebe Chen Elena Marchiori Mariofanna Milanova

Deakin University, Australia Vrije Univ. of Amsterdam, The Netherlands Univ. of Arkansas at Little Rock, USA

Publication Loi Sy Ho

Nanyang Tech. Univ., Singapore

Local Arrangement Chair Irwin King

Chinese Univ. of Hong Kong, Hong Kong

Secretariat Norhana Binte Ahmad

Nanyang Tech. Univ., Singapore

System Administration Linda Ang Ah Giat

Nanyang Tech. Univ., Singapore

Program Committee Shandar Ahmad Tatsuya Akutsu Ron Appel Vladimir Brusic Madhu Chetty Francis Y.L. Chin Koon Kau Byron Choi Ching Ming Maxey Chung Carlos Cotta David Corne Alexandru Floares Gary B. Fogel Vivekanand Gopalkrishnan

Kyushu Inst. of Technology, Japan Kyoto Univ., Japan Swiss Inst. of Bioinformatics, Switzerland Univ. of Queensland, Australia Monash Univ., Australia Univ. of Hong Kong, Hong Kong Nanyang Tech. Univ., Singapore National Univ. of Singapore, Singapore Univ. of Malaga, Spain Heriot-Watt Univ., UK Inst. of Oncology, Romania Natural Selection, Inc., USA Nanyang Tech. Univ., Singapore

X

PRIB 2006 Organization

Saman K. Halgamuge Dongsoo Han Yulan He Hsuan-Cheng Huang Ming-Jing Hwang Visakan Kadirkamanathan Nik Kasabov Alex V. Kochetov Natalio Krasnogor Chee Keong Kwoh Tak-Wah Lam Jinyan Li Alan Wee-Chung Liew Feng Lin Gary F. Marcus Hiroshi Matsuno Satoru Miyano Jason H. Moore Kenta Nakai Ajit Narayanan Zoran Obradovic Marimuthu Palaniswami Laxmi Parida Mihail Popescu Predrag Radivojac Jem Rowland Alexander Schliep Bertil Schmidt Alessandro Sette Roberto Tagliaferri Gwenn Volkert Michael Wagner Haiying Wang Lusheng Wang Wei Wang Banzhaf Wolfgang Jiahua (Jerry) Wu Ying Xu Hong Yan Yanqing Zhang Jun Zhang

Univ. of Melbourne, Australia Information and Communications Univ., Korea Nanyang Tech. Univ., Singapore National Yang-Ming Univ., Taiwan Academia Sinica, Taiwan Univ. of Sheffield, UK Auckland Univ. of Technology, New Zealand Russian Academy of Sciences, Russia Univ. of Nottingham, UK Nanyang Tech. Univ., Singapore Univ. of Hong Kong, Hong Kong Inst. of Infocomm Research, Singapore Chinese Univ. of Hong Kong, Hong Kong Nanyang Tech. Univ., Singapore New York Univ., USA Yamaguchi Univ., Japan Univ. of Tokyo, Japan Dartmouth Medical School, USA Univ. of Tokyo, Japan Univ. of Exeter, UK Temple Univ., USA Univ. of Melbourne, Australia IBM T.J. Watson Research Center, USA Univ. of Missouri, USA Indiana Univ., USA Univ. of Wales Aberystwyth, UK Max Planck Inst. for Mol. Genetics, Germany Nanyang Tech. Univ., Singapore La Jolla Inst. for Allergy & Immunology, USA Universita di Salerno, Italy Kent State Univ., USA Cincinnati Children's Hospital Research Foundation, USA Univ. of Ulster at Jordanstown, N. Ireland City Univ. of Hong Kong, Hong Kong Fudan Univ., China Memorial Univ. of Newfoundland, Canada Wellcome Trust Sanger Inst., UK Univ. of Georgia, USA City Univ. of Hong Kong, Hong Kong Georgia State Univ., USA Nanyang Tech. Univ., Singapore

Table of Contents Pattern Recognition in Bioinformatics: An Introduction . . . . . . . . . . . . . . . . Jagath C. Rajapakse, Limsoon Wong, Raj Acharya

1

Part 1: Signal and Motif Detection; Gene Selection Machine Learning Prediction of Amino Acid Patterns in Protein N-myristoylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Okada, Manabu Sugii, Hiroshi Matsuno, Satoru Miyano

4

A Profile HMM for Recognition of Hormone Response Elements . . . . . . . . Maria Stepanova, Feng Lin, Valerie C.-L. Lin

15

Graphical Approach to Weak Motif Recognition in Noisy Data Sets . . . . . Loi Sy Ho, Jagath C. Rajapakse

23

Comparative Gene Prediction Based on Gene Structure Conservation . . . . Shu Ju Hsieh, Chun Yuan Lin, Ning Han Liu, Chuan Yi Tang

32

Computational Identification of Short Initial Exons. . . . . . . . . . . . . . . . . . . . Sayanthan Logeswaran, Eliathamby Ambikairajah, Julien Epps

42

Pareto-Gamma Statistic Reveals Global Rescaling in Transcriptomes of Low and High Aggressive Breast Cancer Phenotypes . . . . . . . . . . . . . . . . Alvin L.-S. Chua, Anna V. Ivshina, Vladimir A. Kuznetsov

49

Investigating the Class-Specific Relevance of Predictor Sets Obtained from DDP-Based Feature Selection Technique . . . . . . . . . . . . . . . . . . . . . . . . Chia Huey Ooi, Madhu Chetty, Shyh Wei Teng

60

A New Maximum-Relevance Criterion for Significant Gene Selection . . . . . Young Bun Kim, Jean Gao, Pawel Michalak

71

Part 2: Models of DNA, RNA, and Protein Structures Spectral Graph Partitioning Analysis of In Vitro Synthesized RNA Structural Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stanley Kwang Loong Ng, Santosh K. Mishra

81

XII

Table of Contents

Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Blaise Gassend, Charles W. O’Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas

93

Prediction of Protein Subcellular Localizations Using Moment Descriptors and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Jianyu Shi, Shaowu Zhang, Yan Liang, Quan Pan Using Permutation Patterns for Content-Based Phylogeny . . . . . . . . . . . . . . 115 Md Enamul Karim, Laxmi Parida, Arun Lakhotia

Part 3: Biological Databases and Imaging The Immune Epitope Database and Analysis Resource . . . . . . . . . . . . . . . . . 126 Alessandro Sette, Huynh Bui, John Sidney, Phi Bourne, Soren Buus, Ward Fleri, R. Kubo, O. Lund, D. Nemazee, J.V. Ponomarenko, M. Sathiamurthy, S. Stewart, S. Way, S.S. Wilson, B. Peters Intelligent Extraction Versus Advanced Query: Recognize Transcription Factors from Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Zhuo Zhang, Merlin Veronika, See-Kiong Ng, Vladimir B. Bajic Incremental Maintenance of Biological Databases Using Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Kai-Tak Lam, Judice L.Y. Koh, Bharadwaj Veeravalli, Vladimir Brusic Blind Separation of Multichannel Biomedical Image Patterns by Non-negative Least-Correlated Component Analysis . . . . . . . . . . . . . . . . . . . 151 Fa-Yu Wang, Yue Wang, Tsung-Han Chan, Chong-Yung Chi Image and Fractal Information Processing for Large-Scale Chemoinformatics, Genomics Analyses and Pattern Discovery . . . . . . . . . . 163 Ilkka Havukkala, Lubica Benuskova, Shaoning Pang, Vishal Jain, Rene Kroon, Nikola Kasabov Hybridization of Independent Component Analysis, Rough Sets, and Multi-Objective Evolutionary Algorithms for Classificatory Decomposition of Cortical Evoked Potentials . . . . . . . . . . . . . . . . . . . . . . . . . 174 Tomasz G. Smolinski, Grzegorz M. Boratyn, Mariofanna Milanova, Roger Buchanan, Astrid A. Prinz Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Pattern Recognition in Bioinformatics: An Introduction J.C. Rajapakse1,4,5, L. Wong2, and R. Acharya3 1

BioInformatics Research Center, Nanyang Technological University, Singapore 2 National University of Singapore, Singapore 3 Computer Science and Engineering, The Penn State University, USA 4 Singapore-MIT Alliance, N2 50 Nanyang Avenue, Singapore 5 Biological Engineering Division, Massachusetts Institute of Technology, USA [email protected]

The information stored in DNA, a chain of four nucleotides (A, T, G, and C), is first converted to mRNA through the process of transcription and then converted to the functional form of life, proteins, through the process of translation. Only about 5% of the genome contains useful patterns of nucleotides, or genes, that code for proteins. The initiation of translation or transcription process is determined by the presence of specific patterns of DNA or RNA, or motifs. Research on detecting specific patterns of DNA sequences such as genes, protein coding regions, promoters, etc., leads to uncover functional aspects of cells. Comparative genomics focus on comparisons across the genomes to find conserved patterns over the evolution, which possess some functional significance. Construction of evolutionary trees is useful to know how genome and proteome are evolved over all species by ways of a complete library of motifs and genes. A protein’s functionality or its interaction with another protein is mainly determined by its 3-D structure and the surface pattern. Prediction of protein’s 3-D structure from its 1-D amino-acid sequence remains an open problem in structural genomics; protein-protein interactions determine all essential functions in living cells. Computational modeling and visualization tools of 3-D structures of proteins help biologists to infer cellular activities. The challenge in functional genomics is to analyze gene expression data accumulated by microarray techniques to discover the clusters of co-regulated genes and thereby gene regulatory networks, leading to the understanding of regulatory mechanisms of genes and pathways. Molecular imaging provides techniques for in vivo sensing and imaging of molecular events, which measure biological processes in living organism at the molecular and cellular level. The techniques to fuse and integrate different kinds of information derived from different life science data are yet to be explored. The knowledge in databases of biomedicine and phenotypes, combined with genotypes, is increasingly unmanageable by traditional text-based methods. Advanced data mining techniques, where the use of ontologies for constructing precise descriptors of medical concepts and procedures, are required in the field of medical informatics. The increasing amount of biological literature are posing new challenges in the field of text mining which techniques could find pathways and interaction networks from pure mining of literature. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 1 – 3, 2006. © Springer-Verlag Berlin Heidelberg 2006

2

J.C. Rajapakse, L. Wong, and R. Acharya

Finding a particular structure of a sequence or surface pattern of a protein, that has a specific biological function or is involved in interactions with other molecule, is a fundamental question which could be addressed by pattern recognition algorithms. Further, pattern recognition has already shown promise in the following areas of bioinformatics: • • • • • • • • • • •

Computational genomics and comparative genomics Gene expression analysis and functional genomics Alignment of sequences: DNA, protein, structures, etc. Phylogenic analysis of species, sequences, structures Structural genomics and proteomics Functional and molecular imaging Data mining, data integration, and visualization Information fusion such as combining sequences, expressions, texts, etc. Pathway analysis, gene regulatory networks, etc. Disease modeling Medical informatics

Statistical, fuzzy, and neural network clustering techniques have been successfully applied to gene expression data analysis. Graph-based pattern recognition techniques have found applications in recognition of motifs, gene regulatory networks, and protein-protein interactions [1, 2, 3]. Support vector machines and information theory based approaches are increasingly used in feature selection or gene selection [4, 5]. Markov models and hidden Markov models are becoming popular in sequence alignments and gene or RNA structure finding [6, 7]. Statistical and neural network based predictors have found signals in genomic sequences and protein structures [2, 4, 8, 9]. As underpinnings of life sciences data are becoming clearer, pattern recognition algorithms would find more and more useful and relevant in solving computational biology and bioinformatics problems.

References [1] E Eskin, PA Pevzner (2002), "Finding composite regulatory patterns in DNA sequences", Bioinformatics, 18:S354-S363. [2] MN. Nguyen and JC. Rajapakse (2005), “Two-stage support vector regression approach for predicting accessible surface areas of amino acids,” PROTIENS: Structure, Function, and Bioinformatics, 63: 542-550. [3] Min Zou, Suzanne D. Conzen (2005), "A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data", Bioinformatics, 21:71-79. [4] Haifeng Li, Tao Jiang (2005), "A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs", Journal of Computational Biology, 12(6):702-718. [5] Guo-Liang Li, Tze-Yun Leong (2005), “Feature selection for the prediction of translation initiation sites”, Genomics Proteomics Bioinformatic, 3(2):73-83. [6] WH Majoros, M Pertea, SL Salzberg (2005), "Efficient implementation of a generalized pair hidden Markov model for comparative gene finding", Bioinformatics, 21(9):1782-1788.

Pattern Recognition in Bioinformatics: An Introduction

3

[7] Dustin E. Schones, Pavel Sumazin, Michael Q. Zhang (2005), "Similarity of position frequency matrices for transcription factor binding sites", Bioinformatics, 21:307-313. [8] Te-Ming Chen, Chung-Chin Lu, and Wen-Hsiung Li (2005), "Prediction of splice sites with dependency graphs and their expanded bayesian networks", Bioinformatics, 21: 471-482. [9] Gideon Dror, Rotem Sorek, Ron Shamir (2005), "Accurate identification of alternatively spliced exons using support vector machine", Bioinformatics, 21:897-901.

Machine Learning Prediction of Amino Acid Patterns in Protein N-myristoylation Ryo Okada1 , Manabu Sugii2 , Hiroshi Matsuno1 , and Satoru Miyano3 1

Graduate School of Science and Engineering Media and Information Technology Center, Yamaguchi University, Yamaguchi 753-8511, Japan Human Genome Center, University of Tokyo, Tokyo 108-8639, Japan [email protected], [email protected], [email protected], [email protected] 2

3

Abstract. Protein N-myristoylation is the lipid modification in which the 14-carbon saturated fatty acid binds covalently to N-terminal of virus-based and eukaryotic protein. In this study, we suggest an approach to predict the pattern of N-myristoylation signal using the machine learning system BONSAI. BONSAI finds rules in combination of an alphabet indexings and decision trees. Computational experiments with BONSAI classified amino acid residues depending on effect for N-myristoylation and found rules of the alphabet indexing. In addition, BONSAI suggested new requirements for the position of an amino acid in N-myristoylation signal.

1

Introduction

Protein N-myristoylation is the lipid modification, and many N-myristoylated proteins play key roles in regulating cellular structure and function such as the BID protein concerned with an apoptosis and the alpha subunit of the G-protein localized on the cell membrane. N-myristoylated proteins have a specific sequence at N-terminus called N-myristoylation signal sequence, and this sequence is probably composed of 6 to 9 amino acids (up to 17) [1]. In order to determine the amino-terminal sequence requirements for protein N-myristoylation, their sequences have been examined [2,3]. Most of methods used by researchers are those that predict patterns for N-myristoylation by biological experimentations based on their knowledge. However, information in the sequence is very rich, involving not only a simple rule but also many specific rules. Hence, computational techniques are essentially required to predict rules from huge amount of data involving the sequence prediction for Nmyristoylation. The machine learning system BONSAI is a system for knowledge acquisition from primary structural data [4]. BONSAI has discovered a rule which can classify amino acid sequences into transmembrane domains and other domains over 90% accuracy [4]. BONSAI finds the rules in the combination of alphabet indexing and decision tree from positive and negative examples of sequence. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 4–14, 2006. c Springer-Verlag Berlin Heidelberg 2006 

Machine Learning Prediction of Amino Acid Patterns

5

The alphabet indexing groups letters in positive and negative examples by mapping these letters to fewer numbers of letters. We have tried to predict the N-myristoylation signal sequence from amino acid sequences using BONSAI. Section 2 describes features of protein N-myristoylation with the emphasis on the sequence requirement. Section 3 gives a brief description about BONSAI used to find rules for N-myristoylation. In Section 4, our computational experiments using BONSAI to find rules in amino acid sequences for N-myristoylation are described. Suggested results from the computational experiments are presented in Section 5. This section includes two interesting rules in the requirements for N-myristoylation sequence, discussing about the validity of the suggested results and giving biological interpretations of them.

2

Protein N-myristoylation

Protein N-myristoylation is the lipid modification in which the 14-carbon saturated fatty acid binds covalently to N-terminus of virus-based and eukaryotic protein. About 0.5% of human proteins are estimated to be N-myristoylated [1]. Protein N-myristoylation is a cotranslational protein modific ation catalyzed by two enzymes, methionine aminopeptidase and N-myristoyltransferase (NMT). The estimated N-myristoylation protein has the sequence Met-Gly on its Nterminus at least. The initial Met is removed cotranslationally by the methionine

Fig. 1. Protein N-myristoylation

6

R. Okada et al. Table 1. Example of myristoylated sequence Protein Amino Acid Sequence GAG SIVM1 MGARNSVLSGKKADE KCRF STRPU MGCAASSQQTTATGG Q26368 MGCNTSQELKTKDGA GBAZ HUMAN MGCRQSSEEKEAARR COA2 POVM3 MGAALTILVDLIEGL RASH RRASV MGQSLTTPLSLTLDH

Three Letter Code

Gly Ala Ser Cys Thr Pro Val Asp Asn Leu Ile Gln Glu His Met Phe Lys Tyr Trp Arg

Single Letter Code

G

A

S

C

T

P

V

D

N

L

I

Q

E

H

M

F

K

Y

W

R

Fig. 2. Correspondence between amino acids in one letter and three letters

aminopeptidase, and then the myristic acid is linked to the next Gly via an amide bond by NMT. NMT catalyzes the transfer of myristic acid from myristoyl-CoA to the N-terminus Gly residue of the substrate protein (Fig. 1). Most of myristoylated proteins have a physiological activity such as cell signaling protein, expressing specific functions through binding organelle membrane. It is known that membrane binding reaction mediated by myristoylation is controlled variedly, and play a crucial role in functional regulation mechanisms of proteins in cell signaling pathway and process of virus growth [5,6]. For example, HIV-1 Gag protein transfer to the plasma membrane by using N-myristoyl group, and is involved in the formation of virus particle and emission. Additionally, it is known that the apoptosis-inducing factor Bid is digested by protease, and the new N-terminus of digested peptide is also myristoylated [7]. N-myristoylated proteins have a specific sequence at the N-terminus called a N-myristoylation signal sequence. This sequence is probably composed typically of 6 to 9, but can be as many as 17 amino acids [1]. The effect of the amino acid sequence on N-myristoylation depends on the distance from N-terminus; with the increase of the distance, this effect is decrease. Table 1 shows examples of N-terminus sequence of myristoylated protein. Amino acids are usually written in one letter or three letters. Fig. 2 shows the correspondence of them. Researchers in biology have revealed that the combination of amino acid residues at positions 3 and 6 constitute a major determinant for the susceptibility to protein N-myristoylation. As shown in Fig. 3, when Ser is located at position 6, 11 amino acid residues (Gly, Ala, Ser, Cys, Thr, Val, Asn, Leu, Ile, Gln, His) are permitted locating at position 3 to direct efficient protein N-myristoylation [2,3]. Most of these 11 amino acids have a rule that the radius of gyration of residue is smaller than 1.80˚ A. Actually other amino acids that have radius of gyration is larger than 1.80˚ A, being not allowed at position 3. In addition to the restriction by the radius of gyration of the amino acid residues, it has been also revealed that the presence of negatively charged residues (Asp and Glu) and Pro residue at this position completely inhibited the N-myristoylation reaction.

Machine Learning Prediction of Amino Acid Patterns

7

Fig. 3. Protein N-myristoylation rule

On the other hand, when Ala is located at position 6, 5 kinds of amino acid residues are permitted locating at position 3 for N-myristoylation. When Thr or Phe is located at position 6, only 2 or 3 kinds of amino acid residues are permitted locating at position 3 for N-myristoylation. In addition to the amino acids at position 6, there is a case that some amino acid residues at position 7 affects amino acid requirement at position 3 for N-myristoylation. For example, although location of Ser at position 6 does not basically allow Lys to locate at position 3, location of Lys at position 7 makes a changes to the requirement for amino acid residue at position 3; Lys can be located at position 3 [2].

3

Machine Learning System BONSAI

BONSAI is a machine learning system for knowledge acquisition from positive and negative examples of strings (Fig. 4) [4]. A hypothesis generated by the system is given as a pair of a classification of symbols called an alphabet indexing

Fig. 4. Behavior of BONSAI

8

R. Okada et al.

Fig. 5. Indexing

and a decision tree that classifies given examples to either positives or negatives (Fig. 5). An alphabet indexing (indexing for short) is a transformation of symbol to reduce the size of the alphabet for positive and negative examples, without missing important information in original data. In the case of amino acid residues, the alphabet indexing can be regarded as a classification of 20 kinds of amino acid residues to a few categories. Indexing contributes not only to speed up computations in finding rules but also to simplify expression patterns assigned at nodes of decision trees. It has been reported that BONSAI has discovered knowledge which can classify amino acid sequences of transmembrane domains and randomly chosen amino acid sequence with over 90% accuracy [4]. In the experiment, this system has found an indexing that is nearly the same as the hydropathy index of Kyte and Doolittle [8], without any knowledge on the hydropathy index.

4

Discovery of Amino Acid Patterns with Locations

We have used the following two sets of sequences as the positive and negative examples for BONSAI. positive examples 78 sequences identified as sequences of N-myristoylation by the biological experiments [1] and sequences verified as N-myristoylation sequences presented in [6], and negative examples sequences randomly selected from all amino acid sequences among human proteins in the NCBI database [11]. This random selection of amino acid sequences for negative examples is assured by the fact that only 0.5% of all human proteins are estimated to satisfy the requirements for N-myristoylation [1]. Computational experiments with BONSAI have been performed with varying the length of an amino acid sequence and the number of indexing in order to identify the proper values of them. It seems that the result is not affected by the 0.5% non-negative example in the negative examples. Because BONSAI can

Machine Learning Prediction of Amino Acid Patterns

9

Fig. 6. Pattern search by original BONSAI

find the pattern which classifies whole given examples into either positives or negatives correctly best, even if examples contain a few exceptions. The symbol M (Met) at the N-terminus was removed from any of sequences since all the sequences of positive and negative examples have the symbol M at the N-terminus. We modified the program of BONSAI so that BONSAI find patterns of nodes at a decision tree whose lengths are equal to the lengths of amino acid sequences inputted. Although original BONSAI finds a decision tree with indexing which can decide whether specific patterns exist in given sequences or not, it does not provide any information to identify the locations of these specific patterns. Hence, as shown in Fig. 6, the original BONSAI works well in finding transmembrane domains of amino acid sequences [4], but it can not be used to find patterns with these locations in given sequences such as patterns for N-myristoylation. For example, even if the original BONSAI would find a rule for the existence of successive amino acid residues Met and Gly which locate at the first and second position of the N-myristoylation sequence, respectively, we could not know these locations of these two amino acids by the original BONSAI. Hence, with the modified BONSAI, we have employed the following strategy to find patterns for the N-myristoylation classification with amino acid locations. 1. Fix the length of sequences given to BONSAI. 2. Produce decision trees; pattern length at any node of the tree is the same as that fixed by the above procedure. We modified the program of BONSAI for this purpose. By this strategy, we can find rules that classify sequence patterns for Nmyristoylation with all the positions of amino acids in the patterns. Fig. 7 shows a case when the length of sequences for BONSAI is fixed to 6 and the lengths of patterns from BONSAI are restricted to the same number 6.

10

R. Okada et al.

Fig. 7. Pattern search by modified BONSAI

5

Obtained Two Rules for Amino Acid Patterns in N-Myristoylation

BONSAI has presented two rules in the form of decision tree with indexing as shown in Fig. 8 and 10. Although one rule is a known fact confirmed by the biological experiment [2], the other rule suggests new amino acid sequence patterns for N-myristoylation. 5.1

Rule 1: Identification of Amino Acid Residue at Position 3 (Existing Rule)

Confirmed sequences of N-myristoylation whose N-myristoylation was experimentally verified in the recent report [1] and sequences presented in the literature [6] have been provided to BONSAI as positive examples. As negative examples, Pattern of N-myristoylation Decision Tree Yes Positive

㧜㧜㧜㧜㧜㧝㧝㧝 No

㧜㧜㧜㧜㧜㧝㧝㧜

Yes

࡮ ࡮࡮

No

Positive

No

㧜㧝㧜㧜㧜㧜㧜㧝 Yes Positive

No Negative

position 㧞㧟㧠㧡㧢㧣㧤㧥 㧟㧠㧡㧢㧣㧤㧥10 㧜㧜㧜㧜㧜㧝㧝㧝 㧜㧜㧜㧜㧜㧝㧝㧜 㧜㧝㧜㧜㧜㧝㧝㧝 㧜㧝㧜㧜㧝㧜㧜㧜 㧝㧝㧜㧜㧝㧜㧜㧜 㧜㧜㧝㧜㧝㧜㧝㧝 㧜㧜㧝㧜㧝㧜㧜㧜 㧝㧜㧜㧜㧜㧝㧜㧜 㧜㧜㧝㧜㧝㧜㧝㧝 㧝㧜㧜㧜㧝㧜㧝㧝 㧜㧝㧝㧝㧜㧝㧜㧝 㧜㧝㧝㧜㧝㧜㧜㧜 㧜㧜㧝㧜㧜㧜㧜㧜 㧜㧜㧜㧜㧝㧜㧜㧜 㧜㧜㧝㧜㧝㧝㧝㧝

position 㧞㧟㧠㧡㧢㧣㧤㧥 㧟㧠㧡㧢㧣㧤㧥10 㧜㧝㧜㧜㧝㧝㧝㧝 㧝㧝㧝㧜㧝㧝㧜㧜 㧝㧝㧜㧜㧝㧜㧝㧝 㧜㧜㧜㧜㧜㧜㧝㧝 㧜㧝㧝㧜㧝㧝㧜㧝 㧝㧝㧝㧜㧝㧝㧜㧝 㧜㧝㧝㧜㧜㧝㧝㧜 㧜㧝㧝㧜㧜㧝㧝㧝 㧜㧜㧜㧜㧝㧝㧜㧜 㧜㧜㧜㧜㧝㧝㧝㧜 㧜㧜㧝㧝㧜㧝㧜㧝 㧝㧝㧜㧜㧝㧝㧝㧜 㧜㧝㧝㧜㧝㧜㧜㧝 㧜㧝㧜㧜㧜㧜㧜㧝

Indexing



㧳㧭㧿㧯㨀㧼㨂㧰㧺㧸㧵㧽㧱㧴㧹㧲㧷㨅㨃㧾 㧜㧜㧜㧜㧜㧝㧝㧝㧜㧜㧜㧝㧝㧜㧝㧜㧝㧝㧝㧝

Fig. 8. Decision tree and indexing at Result1

Machine Learning Prediction of Amino Acid Patterns 㧳㧭㧿㧯㨀㧼㨂㧰㧺㧸㧵㧽㧱㧴㧹㧲㧷㨅㨃㧾 㧜㧜㧜㧜㧜㧝㧝㧝㧜㧜㧜㧝㧝㧜㧝㧜㧝㧝㧝㧝

Amino Acid Indexing Amino acid which has been identified as N-myristoylation signal in position 3

㧹 㧳

11

㧿

٨٨٨٨٨‫ޓ‬٨‫ޓ‬٨٨٨٨‫ޓ‬٨ Amino-Acid Sequence

Fig. 9. Indexing of Rule1

we used 800 human protein sequences that have been randomly selected from NCBI database [11]. This number of 800 negative examples was determined under the consideration of the tradeoff between the preciseness of produced rules from BONSAI and the processing time of BONSAI; much examples produce more precise rules, while the processing is required more. The first symbol M was removed from sequences of both of the positive and negative examples, namely all sequences had the length 9. Fig. 8 shows a rule produced by BONSAI. The decision tree of the rule has a simple structure as shown in the figure, in which binary patterns (b-patterns for short) of the length 8 such as 00000111 is assigned to each node. These bpatterns were obtained by replacing amino acid residue symbols with each of the symbol 0 or 1 according to the indexing table in the figure. All of such 29 b-patterns are listed in the table in Fig. 8. In the table, of 29 b-patterns of Fig. 8, we can find characteristics across two positions of them; 23 b-patterns have 0 at position 3 (79%) and 27 b-patterns have 0 at position 6 (93%). By noting that most of positive examples inputted to BONSAI has Ser at position 3 and the result of indexing that assigned the symbol 0 to Ser, we can see the reason that 93% of b-patterns at the position 6 were occupied by the symbol 0. Fig. 9 summarizes a relationship between the amino acid pattern dependency at the position 3 on Ser at position 6 and the result of indexing from BONSAI. Eleven amino acid residues, which are biologically determined to be located at position 3 under the existence of Ser at the position 6 [2], are marked with black circles in the figure. By comparing the black circles pattern and the result of indexing, we can see that, out of these 11 amino acid residues, 9 amino acid residues (except Val and Gln) have been classified to the symbol 0. This means that BONSAI have worked well in finding requirements for N-myristoylation in given amino acid sequences. Fig. 8 shows also a relationship between positions 3 and 7; if the symbol at position 3 is ‘1’, the symbol at position 7 is ‘1’. This will reflect the fact that Lys can locate at position 3 under the existence of Lys at position 7, but otherwise Lys can not [2]. 5.2

Rule 2: New Rules of Amino Acid Requirements Predicted by BONSAI

Confirmed 78 sequences of N-myristoylation have been provided to BONSAI as positive examples. As negative examples, we used 100 sequences randomly selected from NCBI database in order to avoid taking a long processing time

12

R. Okada et al. Pattern of N-myristoylation Indexing 㧳㧭㧿㧯㨀㧼㨂㧰㧺㧸㧵㧽㧱㧴㧹㧲㧷㨅㨃㧾 㧝㧝㧝㧝㧝㧜㧝㧝㧝㧝㧝㧝㧝㧝㧝㧜㧝㧝㧜㧝

+

position 㧞㧟㧠㧡㧢㧣㧤㧥 㧞㧟㧠㧡㧢㧣㧤㧥10 㧝㧝㧝㧝㧝㧝㧝㧝㧜̖ 㧝㧝㧝㧝㧝㧝㧝㧝㧝̖ 㧝㧝㧝㧝㧝㧝㧝㧜㧝̖ 㧝㧝㧝㧝㧝㧝㧜㧝㧝̖ 㧝㧝㧝㧜㧝㧝㧝㧝㧝̖

Fig. 10. Binary pattern of nodes in decision tree and indexing of Rule2

Fig. 11. Biological Interpretation of Rule2

by BONSAI. We extracted sequences of the length 20 from these positive and negative examples with removing the first symbol M from them. With the sequences of the length 19 for these positive and negative examples, BONSAI suggested the rule as shown in Fig. 10. The decision tree is not described in the figure since it has the same structure as the one in Fig. 8. In addition, according to the biological observation that amino acid sequences up to 10 will affect N-myristoylation, only the parts from positions 2 to 10 of bpatterns are presented in the table. We extracted the following rule from the result of BONSAI. – if a protein is N-myristoylated then the sequence of the protein satisfies the following condition; • only one of three amino acid residues Pro, Phe, and Try is allowed to appear at one of four positions 5, 8, 9, and 10 in the sequence, or • none of these three residues appears at any position from 2 to 10 in the sequence. By taking the contraposition of the above rule, we can get the following (Fig. 11); (Proposition from BONSAI) – if the sequence of a protein satisfies the following condition; • one of these three residues appears at any of positions 2, 3, 4, 6, and 7 in the sequence, or • the sequence has more than one residue of Pro, Phe, and Try at any position of 5, 8, 9, and 10 then the protein is not N-myristoylated.

Machine Learning Prediction of Amino Acid Patterns

13

In the following, we will consider biological meaning of (Proposition from BONSAI). First, there has been no biological examination of amino acid requirement for positions 8, 9, and 10, and it has been biologically confirmed that amino acid residue at position 5 does not affect N-myristoylation [9,10]. However, the first part of “if the sequence of a protein has more than one amino acid residue of Pro, Phe, and Try at any positions of 5, 8, 9, 10, then the protein is not Nmyristoylated” in the (Proposition from BONSAI) suggests the possibility that a protein which has more than one Pro at positions 5, 8, 9, and 10 will not be N-myristoylated. That is, Pro at position 5 of a protein may affect Nmyristoylation of the protein, which has not been stated in any literature. Second, the part of “if the sequence of a protein has Pro, Phe, and Try at any of positions 2, 3, 4, 6, and 7, then the protein is not N-myristoylated” involves the biologically confirmed fact that Pro is not allowed to locate at positions 2, 3, 6, and 7 [9,10]. For position 4, furthermore, (Proposition from BONSAI) suggests the new possibility that Pro, Phe, and Try can be located at position 4, while it has been considered that any of these amino acid residues can not be located at position 4.

6

Conclusion

With the increase of sequences such as amino acid sequences and base sequences produced from biological experiments, computational techniques for pattern identifications in these sequences will become more important. Using a machine learning system BONSAI, this paper examined the requirement of amino acid patterns for protein N-myristoylation. Suggested amino acid positions for Nmyristoylation include not only the known positions but also positions which have not been biologically confirmed. We will proceed to the next stage to verify the new suggestion with the help of researchers in biology. Acknowledgments. The authors thank to Prof. Toshihiko Utsumi at Yamaguchi University for insightful comments on this study. The work was partially supported by Grand-in-Aid for Scientific Research on Priority Areas “Systems Genomics” from the Ministry of Education, Culture, Sports, Science, and Technology, Japan.

References 1. Maurer-Stroh, S., Eisenhaber, B., Eisenhaber, F.: N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxon-specific differences. J. Mol. Biol. 317 (2002) 523–540 2. Utsumi, T., Nakano, K., Funakoshi, T., Kayano, Y., Nakao, S., Sakurai, N., Iwata, H., Ishisaka, R.: Vertical-scanning mutagenesis of amino acid in a model N-myristoylation motif reveals the major amino-terminal sequence requirements for protein N-myristoylation. Eur. J. Mol. Biochem. 271 (2004) 863–874

14

R. Okada et al.

3. Utsumi, T., Sato, M., Nakano, K., Takemura, D., Iwata, H., Ishisaka, R.: Amino Acid Residue Penultimate to Amino-terminal Gly Residue Strongly Affects Two Cotranslational Protein Modifications, N-Myristoylation and N-Acetylation. J. Biol. Chem. 276 (2001) 10505–10513 4. Shimozono, S., Shinohara, A., Shinohara, T., Miyano, S., Kuhara, S., Arikawa, S.: Knowledge Acquisition from Amino Acid Sequences by Machine Learning System BONSAI. Trans. Inform. Process. Soc. Japan 35 (1994) 2009–2018 5. Farazi, T.A., Waksman, G., Gordon, J.I.: The biology and enzymology of protein N-myristoylation. J. Biol. Chem. 276 (2001) 39501–39504 6. Resh, M.D.: Fatty acylation of proteins: new insights into membrane targeting of myristoylated and palmitoylated proteins. Biochim. Biophys. Acta 1451 (1999) 1–16 7. Zha, J., Weiler, S., Oh, K.J., Wei, M.C., Korsmeyer, S.J.: Posttranslational Nmyristoylation of BID as a molecular switch for targeting mitochondria and apoptosis. Science 290 (2000) 1761–1765 8. Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157 (1982) 105–132 9. Towler, D.A., Adams, S.P., Eubanks, S.R., Towery, D.S., Jackson-Machelski, E., Glaser, L., Gordon, J.I.: Purification and characterization of yeast myristoyl CoA:protein N-myristoyltransferase. Proc. Natl. Acad. Sci. USA 84 (1987) 2708– 2712 10. Rocque, W.J., McWherter, C.A., Wood, D.C., Gordon, J.I.: A comparative analysis of the kinetic mechanism and peptide substrate specificity of human and Saccharomyces cerevisiae myristoyl-CoA:protein N-myristoyltransferase. J. Biol. Chem. 268 (1993) 9964–9971 11. NCBI: ftp://ftp.ncbi.nih.gov/

A Profile HMM for Recognition of Hormone Response Elements Maria Stepanova1 , Feng Lin2 , and Valerie C.-L. Lin3 1 2

Bioinformatics Research Centre, Nanyang Technological University, 50 Nanyang Drive, Singapore 637553 School of Computer Engineering, Nanyang Technological University, Block N4, Nanyang Avenue, Singapore 639798 3 School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551 {mari0004, asflin, cllin}@ntu.edu.sg

Abstract. Steroid hormones are necessary for most vital functions of vertebrate organisms, and act within cells via interaction with their receptor molecules. Steroid hormone receptors are transcription factors. Identification of Hormone response elements (HREs) on DNA is essential for understanding the mechanism of gene regulation by steroid hormones. In this work we present a systematic approach for recognition of steroid HREs within promoters of vertebrate genomes, based on extensive experimental dataset and specifically reconstructed Profile Hidden Markov Model of putative HREs. The model can be trained for further prediction of HREs in promoters of hormone responsive genes, and therefore, investigation of direct targets for androgen, progesterone and glucocorticoid hormones. Additional documentation and supplementary data, as well as the web-based program developed for steroid HRE prediction are available at http://birc.ntu.edu.sg/∼ pmaria.

1

Introduction

A large number of ontogenetic and physiological processes within different organisms - from fungi to human - are regulated by a small group of steroid hormones. It can be hardly to over-evaluate the significance of steroid hormones for the life cycle during the whole period of development of an individual. Steroid hormones play a central role in the regulation of all aspects of female reproductive activity leading to the establishment and maintenance of pregnancy [1]. Also steroid hormones are essential for male fertility [2], some of them have been implicated in the cardiovascular [3], immune [4], and central nervous systems [5], as well as in bone function [6]. Steroid hormone family includes estrogen, progesterone, androgens, glucocorticoids, and mineralocorticoids, which are synthesized of cholesterol and secreted by endocrine cells [7]. The steroid hormone receptors (HRs) are intracellular transcription factors that exist in inactive apoprotein forms either in the cytoplasm or nucleus [8]. Connection of a hormone results in allosteric change of conformation of the receptor (this process is known as ”activation of a receptor”) that J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 15–22, 2006. c Springer-Verlag Berlin Heidelberg 2006 

16

M. Stepanova, F. Lin, and V.C.-L. Lin

raises affinity of the receptor to DNA; it allows a receptor to bind to specific parts (hormone response elements, or HREs) of DNA molecule inside a nucleus and to adjust transcription of cis-linked genes. In addition to regulating transcription, steroid hormones occasionally regulate gene expression by affecting mRNA stability and translational efficiency [7]. Consensus steroid Hormone Response Elements contains symmetric imperfect repeats; namely, direct repeats, palindromic, and inverted palindromic repeats, of hexameric half-site sequence 5’-AGAACA-3’. These half-sites are usually divided by 3bp-long spacer [9] (except for Estrogen Response Element (ERE) which has some other distinctive features and is not included in this work [10]). In natural promoters, HREs display a great diversity in nucleotide sequence, some of which may contribute to a degree of receptor specificity, whereas other nucleotide substitutions may be incidental. Mutational analysis allows estimating relative significance of every position within the response element. It is worth mentioning works by Dahlman-Wright et al. [9], Barbulescu et al. [11], Truss et al. [12] and a review by Evans [13], where specific structure of HREs in described in a series of experiments. Activated HRs are usually considered as classic vertebrate transcription factors, and classic method of transcription factor binding sites (TFBSs) can be used for prediction of steroid HREs too. A review of possible approaches for the task of recognition of binding sites in general has recently been published by Wasserman and Sandelin [14]. Unfortunately, these methods are of very low specificity due to great diversity of TFBS. A possible way to improve the accuracy of prediction is to take into account the specific structure of a particular TFBS, and reconstruction of the model with consideration of its specific features. Specific HRE-like patterns have lately become an object of interest of several research groups: works by Favorov et al. [15], Sandelin and Wasserman [16], Bono [17] mainly focus on specific HRElike structures, and the work by Bajic et al. [10] describes a method and a tool for the steroid hormone estrogen. However, the performance of the proposed NHRE works is limited due to insufficient training sets, as well as the high level of false positives inherent for single nucleotide position frequency-based models. In this work we present a systematic approach for recognition of HRE within promoters of vertebrate genomes, based on extensive experimental data collected from literature and a classic method commonly used for profile modeling - Profile Hidden Markov Model [18]. The model can be used for prediction of HREs for further investigation of androgen, progesterone and glucocorticoid primary target genes.

2 2.1

Methods Data Collection

Seven hundred of experimentally verified binding sites for Androgen, Glucocorticoid and Progesterone nuclear receptors were collected from the biomedical

A Profile HMM for Recognition of Hormone Response Elements

17

literature. For a binding site to be accepted into the collection a convincing experimental evidence was required - at least validated for binding in vitro, and demonstrated to mediate a response through plasmid transfection assays. Further requirement was a positive identification of the interacting steroid hormone receptor and an experimentally based identification of the binding site positions. A binding site was not included into the collection if correspondent literature source contained ambiguous or insufficient information. In particular, if experimental data showed only location of protected region, but the position of binding site was predicted by sequence analysis on basis of comparison with known ARE/PRE/GRE consensus; or if binding site was predicted by only transfection assay (or other indirect method), without showing immediate receptor-DNA interaction. To avoid over-fitting of the model we included a particular HRE into the database only once even if a particular binding site was mentioned twice or more as verified by different experimental methods, and correspondent primer had been retrieved from one source. Reported bound sequence was included with three flanking nucleotides in both directions. Positions of two half-sites of the response element were recorded if this information was given; if not - the internal structure of the response element was determined based on pairwise alignment of the sequence with known consensus binding site. All retrieved binding sites were joined into Tiger HRE database. Every entry of this database is characterized by i. response element nucleotide sequence (if known, positions of two half-sites to which a receptor bind as a dimer were indicated); ii. a steroid hormone for which receptor binding was detected (if the same binding site was reported to bind to two or three steroid hormone receptors in the same literature source, it corresponded to several entries); iii. corresponding hormone-regulated gene (if existed and mentioned); iv. species from whose genomic DNA (used in the experiment) with the response element was retrieved; v. relative position from transcription start site (if this response element was retrieved from promoter or enhancer region or first exon of any hormone-regulated gene); vi. experimental method of binding detection; vii. reference. After implementation of proposed algorithm for HRE recognition, each entry from the database was supplemented with corresponding probability value for each HRE sequence. Final version of database was implemented as a table within MySQL database system. 2.2

Hidden Markov Model Algorithm for HRE Recognition

The proposed profile HMM is depicted in Fig.1. It represents per se a composition of 5 independent HMMs for each constituent part of the HRE pattern - two flanking regions, two half-sites for dimer binding, and a spacer separating them. Each of these constituent domains is expected to have its own properties (i.e. internal transition probabilities), so has to be examined and trained separately in; transition probabilities between two consecutive ones also must be evaluated.

18

M. Stepanova, F. Lin, and V.C.-L. Lin 4 transitions from 9 4x4 possible transitions 4x4 possible transitions A

C

1

B

2

3

G

T

A

C

6

9

A

C

4

7

G

T

8

3bp-long left 6bp-long left (low flanking region conservative) half-site

11

G

A

14

G

T 12

4 transitions from 11 3bp-long spacer

A 20

13

C

5 transitions inside 2 transitions

2 transitions

0

5

10

6bp-long right (highly conserved) half-site

15

C

17

24

18

26

A 27

C G

25

29

T 19 NO 100% conservative GpT dinucleotide

T

\T

30

2 transitions inside 34

31

C

28

G

\G

21

T 16

22 23

T

G

A

35

A

C

G

T

E

32

38

G 33

T

37

36

3bp-long right flanking region

Fig. 1. Hidden Markov Model for HRE recognition

As the right half-site is found to demonstrate conservation close to a rate of 100%, a more specific topology of state transitions is defined. And also, as a dinucleotide GpT was shown to be a characteristic feature of almost all functional HREs (as shown in all works mention in the Introduction section), it is made a necessary component of an input sequence by the profile HMM. In this way, if a path leads to state 19, the model emits ”NO” and the probability of the sequence is set to 0. However, there are some differences in lengths of training sequences (not all of them are denoted with flanking regions in corresponding literature). Hence, normalization procedure for probability value is used - logarithm of probability is divided by sequence length. Also prior distribution is used from position frequency matrices. Alignments of experimentally verified HREs from Tiger HRE DB were used for the Maximum Likelihood (ML) estimation of transition probabilities with the profile HMM. Probability value received with use of this method is further denoted as HMMS (Hidden Markov Model Similarity) and calculated as a product of transition probabilities come across when aligning the sequence and the reconstructed HMM. Received values for parameters of the HMM are given in the Supplemental Info section. Then, moving a 21bp-long window down the given sequence (being scanned for HREs), recognition procedure is performed for longer sub-parts of DNA. 2.3

Accuracy Estimation

For assessment of accuracy of our predictions by profile HMM, we used crossvalidation approach for sensitivity assessment, that is, 70% of collected dataset used for training vs. 30% for testing; and we generated 10 random ’DNA’ sequences, each being 50Mbp long, with all ’nucleotides’ equally frequent and all positions independent, for estimation of occurrence of signals (random estimation, or re-value) using prediction level on a random basis. 2.4

Web-Based Tool for HMM Prediction of HREs

The publicly available version of the program allows users to input the sequence in FASTA, GenBank, EMBL, IG, GCG or plain format by either pasting it into an input box or by reading it from a text file. Also user can select accuracy level with use of provided table of sensitivity and specificity correspondence. Allowable length of submitted sequence is up to 5kb, and of course it should not be

A Profile HMM for Recognition of Hormone Response Elements

19

shorter than pattern length of 21 bp = 3 + 6 + 3 + 6 + 3 bp (two-half-sites separated by a 3bp spacer in consensus, together with two f3bp-long lanking regions). The resulting output will include: relative position of found match within submitted sequence; direct/complimentary DNA strand (if option of inclusion of complimentary strand is selected before the search started); actual nucleotide composition of found HRE; novel/known HRE (known means presented in the training set); HHM-based probability. For further investigation, the user can submit the sequence to other web-base tools for recognition (reviewed above) to estimate presence of other binding sites in the surrounding area and predict functionality of a potential regulatory complex. For the user to perform analysis of the promoter region of the gene of interest, it is necessary to extract promoter region from any public database of promoters (for example, BEARR [19]) and submit a sequence to the form.

3 3.1

Results Database of Hormone Response Elements

The benchmark dataset for training and testing of the model was collected from 174 different biomedical literature sources (in the final version to date of paper submission, it is 712 hormone response elements included into the database). Such a collection has no analogs in the current public and commercial databases of TFBS profiles considering hormone response elements. While a few of the regulatory elements are derived from genes in insects and birds, most of the sites are mammalian - with 89% of all sites from human or rodent genes. It is also worth mentioning that most collections do not filter out confirmed binding sites from recognized ones: when a DNA region was found to exercise promoter activity, regions similar to HRE consensus are sought in the long promoter sequence by computational methods. Our aim was to collect sites with binding affinity, whatever their structure is, so in the current dataset only experimentally confirmed binding sites were included into the collection 3.2

Accuracy of Prediction

The Hidden Markov Model provides a versatile method for sequence transition pattern recognition. A specifically designed HMM with its states, emission letters and transition probabilities can best characterize the transition patterns in the nucleotide sequence of interest. We designed and implemented a profile HMM, taking into account specific structure of HRE sequences being recognized. In the current work HMM approach allowed to achieve 88% of sensitivity with re-value of 1:1217bp (threshold of normalized probability 0.33) and a level of prediction 1:6.4kb with 63% of true positives (threshold 0.36). Its sensitivity and re-values were evaluated as described in previous sections. Considering the trade-off between sensitivity versus specificity, we selected threshold of 0.343 with sensitivity of 79% and specificity of 1 prediction per 3.9kb for future analysis of hormone responsive genes.

20

M. Stepanova, F. Lin, and V.C.-L. Lin

In the web-based version of the model, the accuracy level is a user-defined parameter. If in the query sequence, HRE patterns are not reported by the system, the user may increase the sensitivity (by decreasing the threshold) and repeat the analysis. Conversely, the user can reduce the sensitivity level if the detected ERE patterns seem to be false positive predictions. Reduction in sensitivity should decrease the number of potential false positives. 3.3

Analysis of Steroid Hormone Primary Target Genes

In this study, we estimated our model using the reported progesterone responsive genes [20]. Although a particular gene might be hormone-regulated by any of indirect pathways, primary target genes are supposed to contain HRE in their regulatory area. For a list of 380 human PR-regulated genes we selected their promoters (areas [-3000; +500] relative to annotated transcription start sites) from NCBI Genbank database (build 35.1), and scanned them using the strategy described above with optimal values of thresholds for recognition. A set of all human genes was used as a potential control of ’noise’ level. The average number of the found PREs in promoter area for 380 PR-responsive genes from the list is 1.06 while for total set of human genes this value is 0.62 HREs per promoter. Another negative control is through implementation of the ERE recognition within promoters of PR-responsive genes, because progesterone primary target genes are considered not to be estrogen-regulated. We used database of EREs [10] for exactly the same PWM training and testing procedure and selected thresholds for recognition to keep the same sensitivity value as for PRE prediction - 79%. The average number of EREs is 0.66 per promoter of PR-responsive genes. The highest frequencies of PREs were found in promoter areas of human CMAH gene (encoding for cytidine monophosphate-N-acetylneuraminic acid hydroxylase) and for AOX1 (aldehyde oxidase 1) - 6 and 5 per promoter region respectively. Also there were 7 genes with 4 predicted PREs (1.8% of total 380), 34 - with 3 found matches (8.9% of 380), 62 with 2 (16.3%) and 118 with only one promoterlocated PRE being predicted (31.1% of total 380 reported PR-responsive human genes). The highest probability of being steroid hormone primary target gene was found for human MMP1 gene encoding for matrix metalloproteinase 1 (interstitial collagenase). Its promoter contains three predicted HREs and two of them are adjacent (which have been previously reported to have very high chance to be functional [21]). Steroid hormone progesterone was previously reported to reduce level of human MMP1 gene expression significantly [22]. The second significant PR-responsive gene NGRF was also reported to be progesterone-regulated [23]. 3.4

Proposal for Modeling of Secondary Response

It is well-known that transcription regulatory mechanisms, being rather complicated themselves, when considered from secondary response point of view, become even more intricate. However, with more experimental information

A Profile HMM for Recognition of Hormone Response Elements

21

becoming available, it is very suggestive to look further and investigate induced effects of the first level of regulation. In the current list of PR-regulated genes there are at least 8 genes whose product proteins are involved in transcriptional regulation. Among them there is one gene FOSL1 which has been proved to be a primary target. However, even this information can provide important information. The Fos gene family consists of 4 members: FOS, FOSB, FOSL1, and FOSL2. These genes encode leucine zipper proteins that can dimerize with proteins of the JUN family, thereby forming the transcription factor complex AP-1. As such, the FOS proteins have been implicated as regulators of cell proliferation, differentiation, and transformation (i.e. the processes in which progesterone regulation is extremely important) information. For example, IL-8 gene is also known to be progesterone regulated. However, FOS transcription factor has been recently reported to be involved in regulation of IL-8 gene [24]. Therefore, it is at least reasonable to look at the putative pathway of regulation: progesterone → human FOSL1 gene → Fos transcription factor → regulation of IL-8. For conclusion, we present a novel program for identification of a class of steroid hormone response elements (HREs) in genomic DNA, including HREs for androgen, glucocorticoid and progesterone. The detection algorithm uses Profile Hidden Markov Model representation of the sequence of interest, and takes into account its specific structure. After series of independent tests on several large datasets, we estimated appropriate combination of sensitivity and specificity as 79% and specificity of 1 prediction per 3.9kb. Users can further investigate selected regions around the identified HRE patterns for transcription factor binding sites based on publicly available TFBS databases, estimate promoter sequences to be hormonally-regulated, and therefore, predict steroid hormone primary target genes.

References 1. Conneely OM (2001) Perspective: Female Steroid Hormone Action. Endocrinology. 142(6):2194-2199 2. Eddy EM, Washburn TF, Bunch DO, Goulding EH, Gladen BC, Lubahn DB, and Korach KS (1996) Targeted disruption of the estrogen receptor gene in male mice causes alteration of spermatogenesis and infertility. Endocrinology. 137(11):47964805 3. Pelzer T, Shamim A, Wolfges S, Schumann M, and Neyses L (1997) Modulation of cardiac hypertrophy by estrogens. Adv Exp Med Biol. 432:83-89 4. Cutolo M, Sulli A, Capellino S, Villaggio B, Montagna P, Seriolo B, and Straub RH (2004) Sex hormones influence on the immune system: basic and clinical aspects in autoimmunity. Lupus. 13(9):635-638 5. Maggi A, Ciana P, Belcredito S, and Vegeto E (2004) Estrogens in the nervous system: mechanisms and nonreproductive functions. Annu Rev Physiol. 66:291-313 6. Kearns AE and Khosla S (2004) Potential anabolic effects of androgens on bone. Mayo Clin Proc. 79(4S):14-18 7. Tsai MJ and O’Malley BW (1994) Molecular mechanisms of action of steroid/thyroid receptor superfamily members. Annu Rev Biochem. 63:451-486

22

M. Stepanova, F. Lin, and V.C.-L. Lin

8. Alberts,B., Bray,D., Lewis,J., Raff,M., Roberts,K. and Watson,J. (1994) Intercellular signalling. Molecular Biology of the Cell. Garland Publishing, New York 9. Dahlman-Wright K, Siltala-Roos H, Carlstedt-Duke J, and Gustafsson JA (1990) Protein-protein interactions facilitate DNA binding by the glucocorticoid receptor DNA-binding domain. J Biol Chem. 265(23):14030-14035 10. Bajic VB, Tan SL, Chong A, Tang S, Strom A, Gustafsson JA, Lin CY, and Liu ET (2003) Dragon ERE Finder version 2: A tool for accurate detection and analysis of estrogen response elements in vertebrate genomes. Nucleic Acids Res. 31(13):36053607 11. Barbulescu K, Geserick C, Schuttke I, Schleuning WD, and Haendler B (2001) New androgen response elements in the murine pem promoter mediate selective transactivation. Mol Endocrinol. 15(10):1803-1816 12. Truss M, Chalepakis G, and Beato M (1990) Contacts between steroid hormone receptors and thymines in DNA: an interference method. Proc Natl Acad Sci USA. 87(18):7180-7184 13. Evans RM (1988) The steroid and thyroid hormone receptor superfamily. Science. 240(4854):889-895 14. Wasserman WW and Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 5(4):276-287 15. Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, and Makeev VJ (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics. 21(10):2240-2245 16. Sandelin A and Wasserman WW (2005) Prediction of nuclear hormone receptor response elements. Mol Endocrinol. 19(3):595-606 17. Bono HU (2005) SayaMatcher: Genome scale organization and systematic analysis of nuclear receptor response elements. Gene. 364:74-78 18. Eddy SR (1998) Profile hidden Markov models. Bioinformatics. 14(9):755-763 19. Vega VB, Bangarusamy DK, Miller LD, Liu ET, and Lin CY (2004) BEARR: Batch Extraction and Analysis of cis-Regulatory Regions. Nucleic Acids Res. 32(Web Server Issue):257-260 20. Leo JC, Wang SM, Guo CH, Aw SE, Zhao Y, Li JM, Hui KM, and Lin VC (2005) Gene regulation profile reveals consistent anticancer properties of progesterone in hormone-independent breast cancer cells transfected with progesterone receptor. Int J Cancer. 117(4):561-568 21. Tsai SY, Tsai MJ, and O’Malley BW (1989) Cooperative binding of steroid hormone receptors contributes to transcriptional synergism at target enhancer elements. Cell. 57(3):443-448 22. Lapp CA, Lohse JE, Lewis JB, Dickinson DP, Billman M, Hanes PJ, and Lapp DF (2003) The effects of progesterone on matrix metalloproteinases in cultured human gingival fibroblasts. J Periodontol. 74(3):277-288 23. Bjorling DE, Beckman M, Clayton MK, and Wang ZY (2002) Modulation of nerve growth factor in peripheral organs by estrogen and progesterone. Neuroscience. 110(1):155-167 24. Hoffmann E, Thiefes A, Buhrow D, Dittrich-Breiholz O, Schneider H, Resch K, and Kracht M (2005) MEK1-dependent delayed expression of Fos-related antigen-1 counteracts c-Fos and p65 NF-kappaB-mediated interleukin-8 transcription in response to cytokines or growth factors. J Biol Chem. 280(10):9706-9718

Graphical Approach to Weak Motif Recognition in Noisy Data Sets Loi Sy Ho1 and Jagath C. Rajapakse1,2,3 1

3

BioInformatics Research Center, School of Computer Engineering Nanyang Technological University, Singapore 639798 {slho, asjagath}@ntu.edu.sg 2 Biological Engineering Division Massachusetts Institute of Technology, Cambridge, MA 02139, USA Singapore-MIT Alliance, N2-B2C-15, 50 Nanyang Avenue, Singapore 639798

Abstract. Accurate recognition of motifs in biological sequences has become a central problem in computational biology. Though previous approaches have shown reasonable performances in detecting motifs having clear consensus, they are inapplicable to the recognition of weak motifs in noisy datasets, where only a fraction of the sequences may contain motif instances. This paper presents a graphical approach to deal with the real biological sequences, which are noisy in nature, and find potential weak motifs in the higher eukaryotic datasets. We examine our approach on synthetic datasets embedded with the degenerate motifs and show that it outperforms the earlier techniques. Moreover, the present approach is able to find the wet-lab proven motifs and other unreported significant consensus in real biological datasets.

1

Introduction

The central dogma of molecular biology is that DNA produces RNA, which in turn produces protein. For the regulation of transcription, a set of proteins called transcription factors (TFs) bind to short subsequences in the promoter region and activate transcription machinery. Such subsequences are called transcription factor binding sites (TFBSs) that, since a TF can bind to several sites in the promoter regions of different genes, should have common patterns or motifs. A motif is defined as a representation of a set of subsequences, which are prevalent in a class of biological sequences and share a similar composition of symbols. For instance, the TATA box is a motif at the site of transcription initiation. Motifs such as Shine-Dalgarno sequences (also called Ribosome Binding Sites (RBSs)) are involved in the translational initiation and preserve in most promoter regions of prokaryotic genes. Identification of motifs in DNA sequences provides important clues for the understanding of the proteins, DNA-protein interactions and the gene regulatory networks. Since not much knowledge is known about most TFs and the variability of their binding sites, the wet-lab experiments to locate related motifs in DNA sequences, such as DNAseI Footprinting Assay and Methylation Interference Assay [10], are both cumbersome and time consuming. Therefore, computational J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 23–31, 2006. c Springer-Verlag Berlin Heidelberg 2006 

24

L.S. Ho and J.C. Rajapakse

techniques and algorithms, providing efficient and low cost solutions, have been rapidly developed for motif recognition. Based on different assumptions used by these techniques and algorithms they are classified into either probabilistic or deterministic. Probabilistic approaches use a weight matrix to represent a motif and maximize the information content of the alignment of motif instances [1,2,6,11,13]. On the other hand, deterministic approaches exhaustively enumerate or search for motif consensus sequences [4,5,14,17]. Each approach has its own strength and weakness, depending on the task at hand, while a specific type of motif recognition approaches may be more useful than others [7,8,18]. It is observed that, for some TFs, the number of sequences that contain TFBSs with very similar pattern are insufficient to successfully find the motif using existing approaches [3]. Some motif consensus may exactly be present in datasets while others may exist with a small or significant number of de-generations. In practice, the noises are inevitable in datasets due to experimental errors, the failure to retrieve a suitable length of the regions containing the motifs, etc. The problem of weak motif recognition (WMR), that discovers a motif having a significant number of degenerations randomly distributed over its relatively short length, has recently been addressed. The graphical approaches, such as WINNOWER [14], cWINNOWER [12], and MITRA [4] convert the subsequences in the dataset into vertices and use the edges to indicate their relationships among possible instances; the random projection methods, such as PROJECTION [2], Multiprofiler [9], and Planted Motif Search [16], attempt to reduce the sample space by decreasing the motif length or the effective degenerate positions; the other approaches, such as SampleBranching [15] and SP-STAR [14] optimize a target function such as the pair-wise scoring function. Despite such various attempts, it has been hard to develop an efficient algorithm to deal with the WMR problem. The difficulty is mainly due to two reasons: (1) the large pairwise distance between motif instances of two sequences evades their detection and an instance could be more similar to a random subsequence than to another motif instance, and (2) the time complexity of the detection increases and the accuracy decreases when corrupted sequences that do not contain any motif instance are present in the dataset. Therefore, the previous WMR approaches are quite time consuming and vulnerable to noises. Earlier in [19], Yang and Rajapakse proposed an graphical algorithm (hereinafter known as GWM) with superior running time and performance that can find weak motifs in the datasets where each sequence contains at least one motif instance. However, the robust motif finding algorithm with capabilities of tolerating to a certain amount of noise in datasets is of practical significance. In this paper, we propose a GWM2 approach that extends the previous algorithm to find weak motifs in noisy datasets containing corrupted sequences. Our algorithm shows better robustness to noises and more accuracy than the earlier methods. Moreover, GWM2 is able to find the wet-lab proven motifs and other unreported significant consensus on the real biological datasets. Although the illustration of our method, in this paper, is limited to only DNA sequences, the method is generalizable to other biological sequences such as protein sequences.

Graphical Approach to Weak Motif Recognition in Noisy Data Sets

2

25

Method

Suppose that we are interested in finding motifs in m DNA sequences given by the set D = {xi : i = 1, 2, ..., m} where the i th sequence xi = (xi1 , xi2 , . . . , xini ) has length ni . The elements of the sequences, xij ∈ Ω for every sequence xi and its j th element are drawn from the alphabet of nucleotides Ω = {A, T, G, C}. We use Ψ to represent the consensus of the motif that is derived from the alignment of a set of motif instances. Suppose that K is the number of sequences that contain motif instances. If K = m, the dataset is called an exact dataset, otherwise (K < m) a noisy dataset. Here, we present an approach to the latter case where each sequence xi contains either one or zero motif instance. Let the motif be denoted as a pair (l, d) where l is the length of the motif and d is the maximum degenerate positions allowed to differ a motif instance from the consensus. We look for instances, ψk , k = 1, ..., K that satisfy dis(Ψ, ψk ) ≤ d where dis(·, ·) is a distance measure, say the Hamming distance, between the two subsequences. d can be set to large value, but no more than a threshold d , beyond which random motifs could be found in the same dataset. The d is restricted by the inequality [2]: 4l (1 − (1 − p)n−l+1 )m < 1

(1)

where the left hand side gives the expected number of random (l, d ) motif ocd l 3 i 1 l−i is the probability for two currences, n = maxm i=1 ni , and p = i=0 (i )( 4 ) ( 4 ) random subsequences having length l to be differed within d positions. In graphical representation of the dataset, each subsequence is represented at a vertex [14]. Let vertex vij represent the subsequence of length l starting at position j of the ith sequence, say si,j = (xij , xij+1 ..., xij+l−1 ). Therefore, K motif instances in the dataset are assigned to certain vertices and are determined from a total of m i=1 (ni − l + 1) number of vertices. For a given (l, d ) motif Ψ in the dataset, any two instances of Ψ differ at most 2d positions. If the graph is constructed so that any two vertices vij and vpq , for 1 ≤ i = p ≤ m, 1 ≤ j ≤ ni , and 1 ≤ q ≤ np , are linked if dis(si,j , sp,q ) ≤ 2d, the motif instances represented by vertices in the graph are connected to each other and form a clique of size K. Then, the motif recognition problem is equivalent to finding K-cliques in a given graph. Though clique finding in graphs is known as NP-complete problem, in the present context its complexity is significantly lower because of a small ratio of the numbers of edges to the number of vertices of graphs for datasets of nucleotide or amino acid sequences [8]. Our algorithm consists of three steps: graph construction, clique finding, and rescanning. 2.1

Graph Construction

Let a selected sequence x r , for r = 1, . . . m − K + 1, be referred as reference sequence and suppose that the potential motif instance in the reference sequence is represented by the vertex vrρ where ρ indicates its starting position. As we are looking for l-length motifs, for each position ρ = 1, ..., nr − l + 1 in the reference sequence, we build a graph Gρ = (Vρ , Eρ ) as follows:

26

L.S. Ho and J.C. Rajapakse

1. Set Vρ = {ρ} and Eρ = φ. by vertex vij where 2. For i = r + 1, . . . , m, find subsequence si,j represented  j = 1, 2, . . . ni − l + 1, and if dis(sr,ρ , si,j ) ≤ 2d: Vρ = Vρ vij . 3. For  two different vertices vij and vpq ∈ Vρ , if dis(si,j , sp,q ) ≤ 2d: Eρ = Eρ evij ,vpq . As sequence xi is assumed to contain at most one motif instance, no edge evij ,vij , where j  = 1, 2, . . . ni − l + 1, is added to Eρ . 4. For each vij ∈ Vρ , define a triangle neighbor set Tij , which consists of elements p, r + 1 ≤ p ≤ m, satisfying vpq ∈ Vρ and evij ,vpq ∈ Eρ with at least an index q: 1 ≤ q ≤ np . Remove vertex vij from Vρ and its corresponding edges from Eρ if |Tij | < K − 2. This triangle criteria is what Pevzner and Sze called the k = 2 case [14]. After constructing the graph Gρ , if vrρ represents a real motif instance in the reference sequence x r , the motif instances in other sequences should then be represented by the vertices in the same graph Gρ . As such, the tenet of our approach is to convert the given dataset into a set of graphs Gρ where ρ = 1, ..., nr − l + 1, and look for cliques of size K such that each of the vertices in these cliques represents an actual motif instance. 2.2

Clique Finding

If the potential motif instance is represented by the vertex vrρ , the motif instances will be represented by a clique of K vertices in the graph Gρ . In what follows, we present an iterative approach to search for K-cliques in the graph Gρ . 1. We define the set Ck (i, j), corresponding to vij ∈ Vρ , indicate a set of all possible k-cliques containing k vertices starting from the vertex vrρ to vertex vij . Set C1 (r, ρ) = {vrρ }. 2. The iterative computation for Ck (i, j) is then: (a) Set Ck (i, j) = φ. (b) For each vpq ∈ Vρ , where r ≤ p < i and evij ,vpq ∈ Eρ , do For each k-1  -clique c ∈ Ck−1 (p, q) do If {cli vij } is a valid then k -clique  Ck (i, j) = Ck (i, j) {c vij } End If Repeat Repeat 3. By increasing k from 2 to K, if a clique of size K exists in the graph Gρ , there must exist a non-empty set Ck (i, j) for a vertex vij ∈ Vρ that contains vertices forming a K-clique. 2.3

Rescanning

After obtaining the cliques of size K, motif consensus Ψ could be formed by alignment of the instances corresponding to the vertices of each clique. As the lengths of sequences in the dataset become longer, spurious cliques could appear.

Graphical Approach to Weak Motif Recognition in Noisy Data Sets

27

Therefore, an extra step is necessary to rescan the dataset with the motif consensus derived from the earlier steps and save those instances ψi satisfying the inequality dis(Ψ, ψi ) ≤ d. This guarantees that all the possible motif instances are found exactly in each sequence, including the spurious instances that are preserved as good as the real instances. 2.4

Algorithmic Complexity

For exact datasets where K = m, the motif recognition problem efficiently is 2d solved by Yang and Rajapakse [19] in O(nmA2 ), where A = n i=0 (li )(3/4)i (1/4)l−i is the random number instances of a motif (l, d ) existing in a sequence with length n. The present approach GWM2 is a direct extension of our previous algorithm GWM for noise datasets, where K ≤ m, hence requiring on the order of 2 (K m )nkA computations. If in the graph Gρ most vertices are spurious or unrelated and have been included in the Ck (i, j) repeatedly, it could cost memory and time for maintaining such sets of cliques. However, as indicated in [14], when the size of cliques becomes larger, less spurious vertices are included; most Ck (i, j) become mostly empty as k increases to K. Therefore, as will be shown in the next section, the running time of our approach in most cases in the experiments is reasonably small.

3

Experiments and Results

This section presents our experiments to evaluate the GWM2 approach on synthetic datasets and real biological datasets for TFBSs recognition, and compare its performance with the earlier methods. In case of real biological datasets, which are extracted from both prokaryotic and eukaryotic organisms, some sequences are exact while the others are noisy. 3.1

Synthetic Data

The techniques of motif recognition in our experiments were evaluated based on two standard performance measures defined as follows: performance coefficient ˆ ˆ where ψ is the set of the known motif instances (PC), P C = |ψ ∩ ψ|/|ψ ∪ ψ|, ˆ and ψ is the set of motif instances predicted [14], and success rate (SR) [15] is the ratio of the number of successes to the total number of trials. Because we use the consensus presentation for the motifs found, SR is used for evaluation of our algorithm. Exact Data. The exact datasets are those used in [14]: there are 10 datasets, each of which consists a total of 20 DNA sequences of length 600 bp and generated with identical and independent nucleotides distributions. The results of the former approaches were referenced in [15,16]. Table 1 shows the performance measure and running time. It can be seen that the probabilistic approaches might perform faster than the GWM2 approach, but they could not guarantee to find precisely the embedded motifs. Compared with

28

L.S. Ho and J.C. Rajapakse

Table 1. Comparison of the performance and running time by different approaches on the datasets used in the Challenge Problem [14] for finding (l = 15, d = 4) motifs Algorithm SR GibbsDNA ProfileBranching CONSENSUS MEME MITRA 100% PROJECTION 100% MULTIPROFILER 99.7% PMS 100% PatternBranching 99.7% GWM [19] 100% GWM2 100%

PC 0.32 0.57 0.20 0.14 -

Running time 40 s 80 s 40 s 5s 5m 2m 1m 217 s 3s 21 s 64 s

Fig. 1. Results of GWM2 on noisy datasets. Each dataset has 20 sequences that contain (15, 4) motif instances and m corrupted sequences without containing any motif instances.

GWM [19], while both archive 100% success rate, GWM2 has a slower running time. Since GWM2 was designed to address the motif recognition problem in noisy datasets which contain corrupted sequences, it has to handle more complex characteristics of the given problem, and was not optimized to recognize the motifs in exact datasets in the fastest possible way. However, if we allowed only one motif to be recognized in the dataset, the running time of GWM2 decreased to an average of 26 seconds at SR = 100%. All performances and the running times reported were averaged over the datasets. Noisy Data. To show the tolerance to noise, we further evaluate the GWM2 approach on the noisy datasets by artificially introducing noisy sequences to the dataset. The noisy datasets were generated that consist of m = 20 sequences having motif instances and m corrupted sequences. The sequences were chosen from the previous exact datasets and mixed randomly.

Graphical Approach to Weak Motif Recognition in Noisy Data Sets

29

In accordance with [18], in this experiment we restricted to find the best motif per each run. Running times for GWM2 were averaged over five random datasets. As seen from figure 1, while our approach still archived 100% success rate, its running times were strongly effected by the number of the corrupted sequences in the dataset. This is because the probability of the motif could reach a threshold that causes many pairwise similarities to occur by chance [2,8]. It may further require a preprocessing step that handles the variability of the data to filter corrupted sequences. Fortunately, our approach is considered sufficiently fast for common applications. 3.2

Real Biological Data

We tested our approach on the following biological datasets: DHFR, preproinsulin, and c-fos, which consist of upstream regions of eukaryotic genes [9]. These biological datasets were also analyzed in [2,9,15]. For all experiments, we set l = 20 and d = 4. The number of the sequences assumed to contain the number of motif instances that was initially set to the number of the sequences in the dataset (K = m), then was decreased until the motifs were found or K < m/2. Once a motif was found in the dataset, it was likely that if the location of the motif was shifted to left or right several positions, other preserved motifs might also be found. Hence, for multiple shifted versions of the motif, only one with the lowest total distance score was selected. Table 2 lists the motifs that match the referenced known motifs with underlined letters corresponding to the matching areas. As seen, GWM2 successfully recognized the reference motifs. Moreover, in many circumstances (results not shown), even the motifs found by GWM2 do not accord with the motifs identified by wet-labs, they actually match to those reported in [4]. It indicates that our approach is able to find the potentially significant motifs. Table 2. Performance of GWM2 on eukaryotic promoter sequences, using parameters l = 20 and d = 4. The motifs that match the motifs found by wet-lab experiments [2,9] are listed with underlined letters indicating the matching areas. Dataset (seqs/bases) preproinsulin (4/7689)

K

Best motifs by GWM2

4 GCAGACCCAGCACCAGGGAA GAAATTGCAGCCTCAGCCCC AGGCCCTAATGGGCCAGGCG DHFR (4/800) 3 TGCAATTTCGCGCCAAACTT c-fos (6/4710) 5 CCATATTAGGACATCTGCGT

4

Experimentally defined motifs AGACCCAGCA CCTCAGCCCC CCCTAATGGGCCA ATTTCnnGCCAAACT CCATATTAGGACATCTG

Discussion

As more high throughput sequence techniques are being available, recognizing meaningful but weak signals or sites in biological sequences becomes more pressing. However, solving the problem of WMR usually involves with two difficulties: (1) the large pairwise distance between the motif instances cause false pairwise

30

L.S. Ho and J.C. Rajapakse

distances likely to occur at random elsewhere in the dataset that possibly obscures the true motifs, and (2) the increased running time with the increase of the motif length and the noises (the presence of corrupted sequences in the dataset). Therefore, despite various attempts, the existing computational techniques are far from achieving satisfactory results [18,7]. This paper has proposed a graphical approach named as GWM2 to recognize weak motifs in datasets that bear noise. Through experiments, our approach GWM2 has tolerated well to noises, where a fraction of the sequences may not contain any motif instances, while the running time is comparable if not faster than the former methods. GWM2 has been applied with real biological datasets that share the common TFBSs and showed good performance. Moreoever, as three steps in the present method were designed independently of a sequence alphabet, GWM2 is generalizable to other biological sequences such as protein sequences. One limitation of our approach may be how to determine the motif length l and the degenerate positions d. Fortunately, in most cases of real biological dataset, prior information about the potential motif length is usually provided. Therefore, we could fix the motif length beforehand while varying the value of d. Even if no prior information is available, the motif could be recognized by a trial and error approach with a range of different values of l. Our approach could be further adapted to find (l, d ) motifs with large l and d values. Recently proposed techniques [2,16], that find long motifs with acceptable performance, try to find motifs (l , d ) with l  l and d  d (d  l ) by using probabilistic sampling techniques. In effect, they change the longer motifs recognition to the shorter ones, then recover the original motifs. However, we believe that a better way to improve the present approach for recognizing weak motifs in the large datasets is to reveal the potential motif by using only a small number of sequences and subsequently validate these motifs with the remaining sequences. For instance, instead of having to find K-cliques, where K is large, we can find kcliques with k  K and recover the potential motifs. Each potential motif will be evaluated against the dataset and if in the dataset we find no less than K number of subsequences having Hamming distance within d different positions from this potential motif, then it is recognized as a valid motif. We plan to further explore this possibility.

References 1. Bailey T. and C. Elkan, ”Fitting a mixture model by expectation maximization to discover motifs in biopolymers”, 2nd ISMB, 1994, 33-54. 2. Buhler J. and M. Tompa, ”Finding motifs using random projections”, J Comput Biol, 2002, 9(2), 225-242. 3. Chin F., H. Leung, S.Yiu, T. Lam, R. Rosenfeld, W. Tsang, D. Smith and Y. Jiang, ”Finding Motifs for Insufficient Number of Sequences with Strong Binding to Transcription Factor”, RECOMB 2004, San Diego, USA, 125-132. 4. Eskin E. and P. Pevzner, ”Finding composite regulatory patterns in DNA sequences”, Bioinformatics, 2002, 18 Suppl 1, S354-S363.

Graphical Approach to Weak Motif Recognition in Noisy Data Sets

31

5. Helden J., B. Andre, and J. Collado-Vides, ”Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies”, J Mol Biol., 1998. 6. Hertz G. and G. Stormo G., ”Identifying DNA and protein patterns with statistically significant alignments of multiple sequences”, Bioinformatics, 1999, 15(7-8), 563-77. 7. Hu J., B. Li, and D. Kihara, ”Limitations and Potentials of Current Motif Discovery Algorithms”, Nucleic Acids Res., 2005, 33(15), 48994913. 8. Jensen K., M. Styczynski, I. Rigoutsos, and G. Stephanopoulos, ”A generic motif discovery algorithm for sequential data”, Bioinformatics, 2005, in press. 9. Keich U. and P.A. Pevzner, ”Finding motifs in the twilight zone”, Bioinformatics, 2002, 18(10), 1374-81. 10. Latchman S., Eukaryotic Transcription Factors, Academic Press, 2003. 11. Lawrence C., S. Altschul, M. Boguski, J. Liu, A. Neuwland, and J. Wootton, ”Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment”, Science, 1993, 262, 208-214. 12. Liang S., M. Samanta and B. A. Biegel, ”cWINNOWER Algorithm for Finding Fuzzy DNA Motifs”, Journal of Bioinformatics and Computational Biology, 2004, 2(1), 47-60. 13. Liu S., A. Neuwald, and C. Lawrence, ”Bayesian models for multiple local sequence alignment and Gibbs sampling strategies”, J. Amer. Statist. Assoc., 1995, 90, 11571170. 14. Pevzner P. and S. Sze., ”Combinatorial approaches to finding subtle signals in DNA sequences”, Intelligent Systems for Molecular Biology, 2000, 269-278. 15. Price A., S. Ramabhadran S., and P. Pevzner, ”Finding subtle motifs by branching from sample strings”, Bioinformatics, 2003, 19 Suppl 2, II149-II155. 16. Rajasekaran S., S. Balla, and C. Huang, ”Exact Algorithm for Planted Motif Challenge Problems”, 3rd Asia-Pacific Bioinformatics Conference, 2003, 249-259. 17. Sinha S. and M. Tompa, ”A statistical method for finding transcription factor binding sites”, Proc Int Conf Intell Syst Mol Biol, 2000, 8, 344-54. 18. Tompa M., N. Li, T. Bailey , G. Church , B. De Moor, E. Eskin, A. Favorov, M. Frith, Y. Fu, W. Kent, V. Makeev, A. Mironov, W. Noble, G. Pavesi, G. Pesole, M. Regnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu, ”Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites”, Nature Biotechnology, 2005, 23(1), 137 - 144. 19. Yang X. and J. Rajapakse, ”Graphical approach to weak motif recognition”, Genome Informatics, 2004, 15(2), 52-62.

Comparative Gene Prediction Based on Gene Structure Conservation Shu Ju Hsieh1, Chun Yuan Lin2, Ning Han Liu1, and Chuan Yi Tang1 1

Department of Computer Science Institute of Molecular and Cellular Biology, National Tsing-Hua University Hsinchu, Taiwan, ROC [email protected], [email protected], [email protected], [email protected] 2

Abstract. Identifying protein coding genes is one of most important task in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in newly sequenced genomes by comparing with genes annotated on phylogenetically close organisms. Here, we propose a program, GeneAlign, which predicts the genes on one sequence by measuring the similarity between the predicted sequence and related genes annotated on another genome. The program applies CORAL, a heuristic linear time alignment tool, to determine whether the regions flanked by candidate signals are similar with the annotated exons or not. The approach, which employs the conservation of gene structures and sequence homologies between protein coding regions, increases the prediction accuracy. GeneAlign was tested on Projector data set of 449 human-mouse homologous sequence pairs. At the gene level, the sensitivity and specificity of GeneAlign are 80%, and larger than 96% at the exon level.

1 Introduction Accurate prediction of gene structures, the exact exon-intron boundaries, is an important task in genomic sequence analysis, while it remains far from fully analyzed [9]. Numerous computational gene prediction programs have aided the identification of protein coding genes; however, no programs are accurate enough to predict all the protein coding genes perfectly. The best accuracy is achieved with spliced alignment of full-length cDNAs or comprehensive expressed sequences tags (ESTs) [8]. Sim4 [14], Spidey [28], BLAT [18], and GMAP [29] belong to this class. Nevertheless, to generate complete and accurate predictions of all genes is still an ongoing challenge because of the numerous genes lacking for the full-length cDNA. Single-genome predictors which predict gene structures by using one genomic sequence, e.g., GENSCAN [10], have been successfully used at prediction of newly sequenced genomes. With more and more organisms being sequenced, the comparative approaches provide more accuracy than the single-genome predictors. In addition to comparative analysis between genomes (e.g., ROSETTA [4], Pro-Gen [24], DOUBLESCAN [21], TWINSCAN [19], SGP2 [25], SLAM [1] and EXONALIGN [16], evidences from related organisms, such as cDNAs and ESTs of related J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 32 – 41, 2006. © Springer-Verlag Berlin Heidelberg 2006

Comparative Gene Prediction Based on Gene Structure Conservation

33

organisms (e.g., GeneSeqer [8]), known proteins of related organisms (e.g., GeneWise [6], PROCRUSTES [15]) and known annotations of related organisms (e.g., Projector [20]), have been employed in the comparative approaches. Recently, several programs, Combiner [2], ExonHunter [7], and JIGSAW [3], devote to integrate multiple sources of information (e.g., multiple genomic sequences, cDNAs/ESTs and protein databases of related organisms, and the output of various gene predictors) to further increase accuracy for gene prediction. Currently, the gene structures for complete genome sequences are generated by incorporating multiple computational approaches depending on the evidence available. The Ensemble gene prediction pipeline uses two streams of evidence, the direct placement of cDNAs and ESTs on the genome of the same organism and a related gene in another organism which is used as a template for the homologous gene [13]. Although cDNA and EST collections are far from comprehensive for most organisms, the abundance of valuable data provided by more than 1700 complete and ongoing genome projects [5] (Genomes Online Database http://www.genomesonline.org, January 2006) could help to locate the exon-intron boundaries for organisms which full-length cDNA sequences have not been generated. Moreover, the previous studies indicate that the known gene annotations coming from homologous genes are more powerful in aiding gene prediction than the evidence of homologous protein sequences [20]. In this study, we present a gene prediction tool - GeneAlign. The same as Projector, GeneAlign employs gene annotations of one organism to predict the homologous genes of another relative organism. GeneAlign integrates signal detectors with CORAL [16] to efficiently align annotated exons with predicted sequences. CORAL, a heuristic alignment program, aligns coding regions between two phylogenetically close organisms in linear time. The approach could identify the distinctive features, the high degree of conservation between protein coding sequences and gene structure conservation between phylogenetically close organisms. GeneAlign assumes the conservation of the exon-intron structures, but it can also align exons which differ by events of exon-splitting. GeneAlign can help gene structure prediction by a fairly diverged annotated genome that still shares a common gene structure. Here, we show that GeneAlign performs well in identifying coding exons; specifically, the rates of missing exons and wrong exons are both low.

2 Methods GeneAlign accepts two nucleotide sequences of homologous genes and known gene annotation of one of these two sequences as inputs and predicts the exon positions in another sequence according to the known gene annotation. The major components of GeneAlign for annotation-genome mapping and alignment include: (1) signal filtrations, (2) applying CORAL to measure sequence similarity following candidate signals for generating approximate gene structures.

34

S.J. Hsieh et al.

2.1 Signal Filtrations To model the conserved gene structures of homologous genes, GeneAlign measures similarities between annotated exons of one sequence and downstream/upstream to the potential splice acceptors/donors of another sequence. For the predicted sequence, GeneAlign firstly obtains a set of candidate signals, TISs(translation initiation sites), splice acceptors/donors, according to signal scores calculated by the signal prediction tool NetStart [26] and DGsplicer [12] respectively. The NetStart, the most popular and accessible program for TISs prediction [23], produces neural network predictions of translation start in nucleotide sequences. The DGsplicer employs a dependency graph model to score potential splice signals. The NetStart and DGsplicer could efficiently filter out many false TISs and splice signals but failed to remove false signals resulting from highly degenerate and unspecific nature. Integrating CORAL [16] could help to measure the similarity between annotated exons and potential regions marked by candidate TISs and splice signals. 2.2 COding Region ALignment – CORAL CORAL is developed on the basis of the conservation of coding regions. Since most of coding regions among organisms are conserved at the amino acid level, suggesting that the hamming distance of two segments with an optimal alignment is low. Applying the idea of a random model, the codon mutations are supposed to occur randomly within a sequence. A probabilistic filtration method is built to efficiently find ill-positioned pairs, a less than optimal alignment which is supposed to result from a shifting mutation and could be solved by inserting a gap of a length of a multiple of three. A local optimal solution is used to obtain a significant alignment when an ill-positioned pair is detected and to determine the possible position and length for the inserted gap. Besides, considering that the nucleotide sequences of the translated regions are well conserved in the first and second positions of a codon and maybe less conserved in the third nucleotide of a codon, we utilized three nucleotides spread out in the pattern XXO (where the X indicated “absolute matching” and the O meant “don’t care”) to serve as the basis of alignment. CORAL employs probabilistic analysis and local optimal solution to efficiently align sequences by sliding windows and, thus, obtains near optimal alignment in linear time. The detail for the concept of CORAL can refer to Hsieh et al. [16]. Additionally, CORAL is implemented another version to directly compare with amino acid instead of codon. An amino acid identity score is calculated by translating the codons according to the genetic code and comparing corresponding amino acids in the two compared regions. 2.3 Gene Structure Alignments – GeneAlign After signal filtrations by NetStart and DGsplicer, the predicted sequences and annotated exons are aligned from 5’ to 3’. GeneAlign is designed for detecting multi-exons genes. The coding exons are divided into three categories according to

Comparative Gene Prediction Based on Gene Structure Conservation

35

their location in the coding region, initial exon (ATG-GT, first coding exon of a gene), internal exon (AG-GT), and terminal exon (AG-stop codon, last coding exon of a gene). Splice sites are the most powerful signals for gene prediction, accurate modeling splice sites could improve gene prediction accuracy [9]. Thus, the alignments are processed from the splice acceptors, aligning the first annotated internal exons with regions following the candidate splice acceptors by CORAL. CORAL will stop aligning when the alignment score significantly drops. If the alignment score and aligned sequence length are greater than threshold, the aligned subsequence is predicted as a candidate exon. In general, the threshold is set at alignment score ≥ 50% and exon length ≥ 30 bp, which is determined empirically. Candidate splice acceptors and the following annotated exons are examined subsequently to search for meaningful alignments. For each aligned segment, the downstream boundary is delimited by an admissible candidate splice donor. A series of aligned segments is ended at the annotated terminal exon and delimited by a stop codon, e.g. TAG, TGA and TAA. The aforementioned process is repeated from 3’ to 5’, from the last internal exons aligning with the regions following the candidate splice donors, and ended with initial exon. TISs are selected according to the scores evaluated by NetStart. This procedure retrieves possible missing exons resulted from underestimation of splice acceptors by DGsplicer, a single intron insertion/deletion to one of the exon pair, and frameshift at the 5’ end of exon pairs. Any annotated exon could not be mapped to the predicted sequence, the alignment score of threshold will be set lower, e.g., 30%, and the corresponding region is searched again. 2.4 Performance Evaluation The standard performance measures on prediction accuracy defined by Burset and Guigó [11] are applied to compare the accuracy of gene prediction. The measures of sensitivity (Sn) and specificity (Sp) are respectively Sn=TP/(TP+FN) and Sp=TP/(TP+FP) where TP (true positives) is the number of correctly predicted genes, FN (false negatives) is the number of true genes missed in the prediction, FP (false positives) is the number of pseudo genes wrongly predicted, and TN (true negative) is the number of correctly predicted pseudo genes. At the exon level, the TP, FP, FN and, TN are the same as the definitions except that exons are compared. An exon is assumed to be correctly predicted only when both its boundaries are correct. ME (missing exons) is the proportion of annotated exons not overlapped by any predicted exon, whereas WE (wrong exons) is the proportion of predicted exons not overlapped by any annotated exons.

3 Results 3.1 Data Sets GeneAlign applies CORAL based on codon identity to efficiently find the partner exons to those of related known genes. The other version, GeneAlign*, which applies

36

S.J. Hsieh et al.

CORAL based on amino acid identity, is in comparison with GeneAlign. To optimize the parameters, GeneAlign was trained by the IMOG data set [25]. The IMOG data set contains 15 homologous gene pairs. The testing set is Projector data set which collects 491 homologous gene pairs. As we aim to test the capability of the splice alignment, intronless genes were discarded. The average number of exons per gene in the test set of the remaining 449 homologous gene pairs is 9.3 exons. 45% of these gene pairs (204 out of 449) have the identical number of coding exons and the identical coding sequence length. 50% of these gene pairs (224 out of 449) have identical exons number but differ in coding sequence length. 5% of these gene pairs (21 out of 449) have different number of exons. 3.2 Performance The performance of GeneAlign on accuracy of gene prediction was compared on that of Projector [20] and GeneWise [6]. Projector predicts gene structures by using the annotated genes on a related organism, which is the same with GeneAlign. GeneWise predicts gene structures by using the known proteins of a related organism. The set of genes predicted by Projector and GeneWise were retrieved from Projector web sever (http://www.sanger.ac.uk/Software/analysis/projector). We measure the performance in terms of sensitivity and specificity not only at exon level but also at gene level. The results are summarized in Table 1. These results show that the predictions obtained by GeneAlign are accurate on both gene level and exon level. GeneAlign also predicts better when evaluated by ME and WE. Besides, GeneAlign* has the lower ratios of ME and WE than those of GeneAlign. In order to study the effects of sequence similarity on the performance of prediction accuracy, 449 homologous pairs were stratified into five classes with amino acid identities between two encoded proteins ranging from 0) is replaced by a non-strict inequality (≥ 1), and (ii) slack variables ξi are introduced to allow a best-fit solution in the event of unsatisfiable constraints. The objective function minimizes the length of the weight vector (to normalize the constraints across various dimensions of w) and the size of the slack variables. The constant parameter C indicates how much a solution is penalized for violating a constraint. In practice, SVMs solve the dual of the this minimization problem. We can therefore use SVMs to determine our function Gw ; however, this only solves half of our problem. Given a candidate Gw we must then determine if equation (3) has been violated and add more constraints to it if necessary. To accomplish this task, we build off of work done by Tsochantaridis et al. [17] which tightly couples this constraint verification problem with the SVM w minimization problem. First a loss function Δ(yi , y) is defined that weighs the goodness of the strucˆ i . Smaller values of Δ(yi , y) indicate that structures yi and y are more tures y similar; see Section 3.1 for examples. Adding this to the SVM constraints in equation (10b) gives ∀i, ∀y ∈ Si : ξi ≥ Δ(yi , y) − w, ΔΨi (y).

(11)

Using this we can decide when to add constraints to our reduced problem and which constraints to add. Since at every iteration of the algorithm we determine some w for the current Si , we can then find the value ξˆi assigned to variable ξi as a result of the optimization. ξˆi corresponds to the “worst” prediction by w across the structures y ∈ Si : ξˆi = max(0, max Δ(yi , y) − w, ΔΨi (y)). y∈Si

(12)

This resulting ξˆi , which was determined using Si , can be compared to a similar  ˆ ξi that is obtained by instead maximizing over Y \ {yi } in equation (12). This will tell us how much the constraints we are ignoring from Y \ {yi } will change the solution. The constraint that is most likely to change the solution is that which would have caused the greatest change to the slack variables. Therefore we would add the constraint to Si that corresponds to ˆ  = argmax Δ(yi , y) − w, ΔΨi (y). y

(13)

y∈Y\{yi }

ˆ  could Tsochantaridis et al. [17] show that by only adding constraints when y ˆ change ξi by more than ε, one can attain a provable termination condition for the problem. The summary of this overall process appears in Algorithm 1.

Predicting Secondary Structure of All-Helical Proteins

99

Algorithm 1. Algorithm for iterative constraint based optimization 1 2 3 4 5 6 7 8 9 10 11 12 13

Input: (x1 , y1 ), . . . , (xn , yn ), C, ε Si ← ∅ for all 1 ≤ i ≤ n w ← any arbitrary value repeat ( for i = 1, . . . , n do ( Set up the cost function: H(y) = Δ(yi , y) − w, ΔΨi (y) ˆ = argmaxy∈Y\{yi } H(y) Compute y Compute ξˆi = max{0, maxy∈Si H(y)} if H(ˆ y) > ξˆi + ε then ( Si ← si ∪ {ˆ y} w ← optimize over S = ∪i Si ))) until no Si changes during iteration

2.5

Defining the Set of Valid Structures

One final issue remains to be solved to complete our algorithm. We need to specify what Y and Ψ (x, y) are, and how to optimize G(x, y) over Y. In general, Y can be exponentially large with respect to the sequence length, making brute-force optimization impractical. Our general approach is to structure Y and Ψ (x, y) in a way that allows optimization of G(x, y) through dynamic programming. Most secondary-structure prediction tools use local features to predict which regions of a protein will be helical [14]. Individual residues can have propensities for being in a helix, they can act as helix nucleation sites, or they can interact with other nearby residues. This type of information can be well captured by Hidden Markov Models (HMMs). Equivalently, we choose to capture them using Finite State Machines (FSMs). The only difference between the FSMs we use and a non-stationary HMM is that the HMM deals with probabilities, which are multiplicative, while our FSMs deal with pseudo-energies, which are additive. To a logarithm, they are the same. We define Y to be the language that is recognized by some FSM. Thus a structure y ∈ Y will be a string over the input alphabet of the FSM. For example, that alphabet could be {h, c}, where h indicates that the residue at that position in the string is in a helix, and c indicates that it is in a coil region. A string y is read by an FSM one character at a time, inducing a specific set of transitions between internal states. Note that the FSMs we are considering do not need to be deterministic. However, they do need to satisfy the property that, for a given input string, there is at most one set of transitions leading from the initial state to a final state. We denote this sequence of transitions by σ(y) and note that σ(y) need not be defined for all y. To define Ψ (x, y), we create a helper function ψ(x, t, i) which assigns a vector of feature values whenever transition t is taken at position i in the sequence x. For example, if a transition is taken to start a helix at position i, then ψ(x, t, i) might return features indicating that residues at position i − 3 to i + 3 are associated with an N-terminal helix cap. See Section 3.1 for our particular choice of ψ.

100

B. Gassend et al.

The overall feature vector is the sum of these features across all positions in the  sequence: Ψ (x, y) = i ψ(x, σ(y)i , i). The total cost G(x, y) follows the form of equation (7). We also specify an infinite cost for structures that are the wrong length or are rejected by the FSM:  +∞ if |x| = |y| or σ(y) is undefined G(x, y) = (14) w, Ψ (x, y) otherwise This cost is easy to optimize over Y by using the Viterbi algorithm. This algorithm proceeds in |x| rounds. In round i, the best path of length s starting from an initial state is calculated for each FSM state. These paths are computed by extending the best paths from the previous round by one transition, and picking the best resulting path for each state. The algorithmic complexity is O(|FSM| · |x|), where |FSM| is the number of states and transitions in the FSM.

3

Results

We now present results from our implementation of our algorithm. It was written in Objective Caml, and uses SVMstruct /SVMlight [7] by Thorsten Joachims. 3.1

Finite State Machine Definition

In our experiments, we have used an extremely simple finite state machine that is presented in Figure 2. Each state corresponds to being in a helix or coil region, and indicates how far into the region we are. States H4 and C3 correspond to helices and coils more than 4 and 3 residues long, respectively. Short coils are permitted, but helices shorter than 4 residues are not allowed, as even 310 helices need at least 4 residues to complete one turn and form the first hydrogen bond. Table 1 lists the basic features that were used in our experiments. These features can also be considered to be the parameters of our system, as our learning algorithm assigns an appropriate weight to each one. Our choice of features is motivated by observations that amino acids have varying propensities for appearing within an alpha helix as well as for appearing at the ends of a helix, an area termed the helix cap [2]. We introduce a single feature per residue to account for helix propensity, for a total of 20 parameters. For helix capping, we use a separate feature for each residue that appears at a given offset (−3 to +3) from a given end of the helix (N-terminal or C-terminal). This accounts for 20 ∗ 7 ∗ 2 = 280 parameters. Finally, we also introduce a feature for very short (2-residue) and short (3-residue) coils. Thus, there are a total of 302 parameters. Table 2 illustrates how features are associated with the transitions of the FSM. This table corresponds to the ψ function described in Section 2.5; given an FSM transition and a position in the input sequence, it outputs a set of representative features. Most of this mapping is straightforward. In the case of helix caps (labels #1 and #2), features are emitted across a 7-residue window that is centered at position n − 1 (the previously processed residue). None of the features we have used involve more than one residue in the sequence. We have experimented with more complicated cost functions that model

Predicting Secondary Structure of All-Helical Proteins H1

H, #2

C, #0

H, #3

H3

H, #3

H4

H, #4 H, #3

C3

H2

101

C, #0

C, #1

H, #5 C2

C, #0

H, #3

C1

Fig. 2. The finite state machine we used. Double circles represent accept states. The arrow leading into state C3 indicates that it is an initial state. Each transition is labeled with the type of structure it corresponds to: helix (H) or coil (C), and a label (#i) indicating which features correspond to this transition in Table 2. Table 1. Summary of basic features that are considered. Each of these features corresponds to a parameter that is learned by our algorithm. Name Number of features A 1 B 1 HR 20 i CR 140 i NR 140 Total 302

Description Penalty for very short coil Penalty for short coil Energy of residue R in a helix Energy of residue R at position i relative to C-cap Energy of residue R at position i relative to N-cap

Table 2. Sets of features that are emitted by transitions in the FSM. Ri denotes the residue at position i in the protein, and n is the current position of the FSM. Label #0 #1 #2 #3 #4 #5

Features 0 +3 i−1 i=−3 CRn+i−1 +3 i−1 HRn + i=−3 NR n+i−1 H Rn H Rn + A H Rn + B

Description Coil defined as zero-energy End of helix processing (C-cap) Start of helix processing (N-cap) Normal helix residue Helix after very short coil Helix after short coil

pairwise interactions between nearby residues in a helix, namely between n and n+ 3 or n and n+ 4. So far we have not managed to improve our prediction accuracy using these interactions, possibly because each pairwise interaction adds 400 features to the cost function, leaving much room for over-learning. Indeed, with the expanded cost functions we observed improved predictions on the training proteins, but decreased performance on the test proteins. We have also experimented with various loss functions Δ (see Section 2.4). We have tried a 0-1 loss function (0 unless both structures are identical), hamming distance (number of incorrectly predicted residues), and a modified hamming distance (residues are given more weight when they are farther from the helixcoil transitions). Each one gives results slightly better than the previous one.

102

3.2

B. Gassend et al.

Results

We have been working with a set of 300 non-homologous all-alpha proteins taken from EVA’s largest sequence-unique subset [6] of the PDB [4] at the end of July 2005. The sequences and structures have been extracted from PDB data processed by DSSP [9]. Only alpha helices have been considered (H residues in DSSP files); everything else has been lumped as coil regions. In our experiments, we split our 300 proteins into two 150 protein subsets. The first set is used to train the cost function; the second set is used to evaluate the cost function once it has been learned. Since the results vary a bit depending on how the proteins are split in two sets, we train the cost function on 20 random partitions into training and test sets, and report the average performance. Table 3 presents our results using both the Qα and SOVα metrics. The Qα metric is simply the number of incorrectly predicted residues divided by sequence length. SOVα is a more elaborate metric that has been designed to ignore small errors in helix-coil transition position, but heavily penalize more fundamental errors such as gaps appearing in a helix [19]. Table 3. Results of our predictor across 20 configurations of training/test set Description

SOVα (%) SOVα (%) Qα (%) Qα (%) Training (train) (test) (train) (test) time (s) Best run for SOVα 76.4 75.1 79.6 78.6 123 Average of 20 runs 75.1 73.4 79.1 77.6 162 Standard deviation of 20 runs 1.0 1.4 0.6 0.9 30

The weights obtained for the features in Table 1 are available in our technical report [11] (although their sign is reversed relative to this paper). Initial examination has shown some correlation with propensities found in the literature [2]. Our experiments utilized a slack variable weighting factor C = 0.08 in equation (10a). The algorithm termination criterion was for ε = 0.1. Both of these parameters have a large impact on prediction accuracy and training time. Our choice of these values was driven by informal experiments in which we tried to maximize the test accuracy while maintaining a practical training time.

4

Related Work

Tsochantaridis et al. apply an integrated HMM and SVM framework for secondary structure prediction [16]. The technique may be similar to ours, as we are reusing their SVM code; unfortunately, there are few details published. Though state-of-the-art neural network predictors such as PSIPred [8] currently out-perform our method by about 5%, they incorporate multiple sequence alignments and are often impervious to analysis and understanding. For example, the PHD predictor contains more than 10,000 parameters [15], and SSPro contains between 1,400 and 2,900 parameters [3]. A notable exception is the

Predicting Secondary Structure of All-Helical Proteins

103

network of Riis and Krogh [13], which is structured by hand to reduce the parameter count to as low as 311 (prediction accuracy is reported at Q3 = 71.3%). In comparison, our technique uses 302 parameters and offers Q3 = 77.6%. Also, we do not incorporate alignment information, which is often responsible for 5-7% improvement in accuracy [13,15]. Please see our technical report for a complete discussion of related work [11].

5

Conclusion

In this paper, we present a method to predict alpha helices in all-alpha proteins. The HMM is trained using a Support Vector Machine method which iteratively picks a cost function based on a set of constraints, and uses the predictions resulting from this cost function to generate new constraints for the next iteration. On average, our method is able to predict all-alpha helices with an accuracy of 73.4% (SOVα ) or 77.6% (Qα ). Unfortunately, these results are difficult to compare with existing prediction methods which usually do predictions on both alpha helices and beta strands. Rost and Sanders caution that restricting the test set to all-alpha proteins can result in up to a 3% gain in accuracy [15]. In addition, recent techniques such as PSIPred [8] consider 310 helices (the DSSP state ‘G’) to be part of a helix rather than loop, and report gains of about 2% in overall Q3 if helices are restricted to 4-helices (as in most HMM techniques, including ours). The real power of the machine learning method we use is its applicability beyond HMM models. Instead of describing a protein structure as a sequence of HMM states, we could equally describe it as a parse tree of a context-free grammar or multi-tape grammar. With these enriched descriptions, we should be able to include in the cost function interactions between adjacent strands of a beta sheet. This should allow us to incorporate beta sheet prediction into our algorithm. Unlike most secondary structure methods, we would then be able to predict not only which residues participate in a beta sheet, but also which residues are forming hydrogen bonds between adjacent sheets.

Acknowledgements We thank Chris Batten, Edward Suh and Rodric Rabbah for their early contributions to this work, and the anonymous reviewers for their helpful comments. W.T. also thanks Saman Amarasinghe for supporting his part in this research.

References 1. Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In ICML, 2003. 2. R. Aurora and G. Rose. Helix capping. Protein Science, 7, 1998. 3. P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15, 1999. 4. H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The protein data bank. Nucleic Acids Research, 28, 2000.

104

B. Gassend et al.

5. C. Bystroff, V. Thorsson, and D. Baker. HMMSTR: a Hidden Markov Model for Local Sequence-Structure Correlations in Proteins. J. of Mol. Bio., 301, 2000. 6. EVA Largest sequence of unique subset of PDB. http://salilab.org/∼ eva/res/ weeks.html#unique. 7. T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods – Support Vector Learning, pages 169–185. MIT Press, 1998. 8. D. T. Jones. Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices. Journal of Molecular Biology, 292:195–202, 1999. 9. W. Kabsch and C. Sander. Dictionary of protein secondary structure. Biopolymers, 22, 1983. 10. V. Eyrich et al. EVA: Continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17(12):1242–1243, 2001. 11. B. Gassend et al. Secondary Structure Prediction of All-Helical Proteins Using Hidden Markov Support Vector Machines. Technical Report MIT-CSAIL-TR-2005060, MIT, December 2005. http://hdl.handle.net/1721.1/30571. 12. M. N. Nguyen and J. C. Rajapakse. Prediction of protein secondary structure using bayesian method and support vector machines. In ICONIP, 2002. 13. S. Riis and A. Krogh. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. Journal of Computational Biology, 3:163–183, 1996. 14. B. Rost. Review: Protein Secondary Structure Prediction Continues to Rise. Journal of Structural Biology, 134(2):204–218, 2001. 15. B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232:584–599, 1993. 16. I. Tsochantaridis, Y. Altun, and T. Hoffman. A crossover between SVMs and HMMs for protein structure prediction. In NIPS Workshop on Machine Learning Techniques for Bioinformatics, 2002. 17. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In ICML, 2004. 18. K. Won, T. Hamelryck, A. Pr¨ ugel-Bennett, and A. Krogh. Evolving Hidden Markov Models for Protein Secondary Structure Prediction. In Proceedings of IEEE Congress on Evolutionary Computation, pages 33–40, 2005. ˇ 19. A. Zemla, Ceslovas Venclovas, K. Fidelis, and B. Rost. A Modified Definition of Sov, a Segment-Based Measure for Protein Secondary Structure Prediction Assessment. Proteins, 34(2):220–223, 1999.

Prediction of Protein Subcellular Localizations Using Moment Descriptors and Support Vector Machine Jianyu Shi, Shaowu Zhang, Yan Liang, and Quan Pan College of Automation, Northwestern Polytechnical University, Xi'an,710072, China [email protected], {Zhangsw, Liangyan, Quanpan}@nwpu.edu.cn

Abstract. As more and more genomes have been discovered in recent years, it is an urgent need to develop a reliable method to predict protein subcellular localization for further function exploration. However many well-known prediction methods based on amino acid composition, have no ability to utilize the information of sequence-order. Here we propose a novel method, named moment descriptor (MD), which can obtain sequence order information in protein sequence without the need of the information of physicochemical properties of amino acids. The presented method first constructs three types of moment descriptors, and then applies multi-class SVM to the Chou’s dataset. Through resubstitution, jackknife and independent tests, it is shown that the MD is better than other methods based on various types of extensions of amino acid compositions. Moreover, three multi-class SVMs show similar performance except for the training time.

1 Introduction One of the big challenges in biological field is about structure and function classification and further characterization of protein sequences, as more and more genomes and protein sequences are exploited. It is widely accepted that the subcellular localization of proteins plays a crucial role in predicting protein functions[1]. Hence a large number of computation methods have been developed over the last few years. However most of them are based on amino acid composition. Originally, Nakashima and Nishikawa[2] indicated that intracellular and extracellular proteins are significantly different in amino acid composition (AAC). The subsequent studies showed that AAC is closely related to protein subcellular localizations. However, the sequence-order information is ignored in AAC. Hence two sequences, different in function and localization but similar in AAC, may be predicted as the same localization. To utilize the sequence-order information, some novel feature extraction methods have been proposed and may be divided into in the following two categories. The first category focuses on combining AAC with physicochemical properties of amino acids. Feng and Zhang [3,4] considered hydrophobic information and Zp parameters respectively. Chou firstly presented an effective method, named Pseudo J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 105 – 114, 2006. © Springer-Verlag Berlin Heidelberg 2006

106

J. Shi et al.

Amino Acid Composition, to predict protein subcellular localization[5]. Then Zhou, Cai and Chou further developed this method[5,7,8,9]. Pan et al also presented a stochastic signal processing approach[10] to predict protein subcellular location based on pseudo amino acid composition. The other makes the direct extension of AAC. Bhasin and Raghava developed a web server ESLpred for subcellular localization of eukaryotic proteins using dipeptide composition[11] and PSI-BLAST. Park and Kanehisa applied compositions of amino acids and amino acid pairs to predict 12-class protein subcellular localizations [12]. Cui et al proposed two-segment amino acid composition and developed a tool, named Esub8, to predict protein subcellular localizations in eukaryotic organisms[13]. This paper proposes a novel feature extraction method, named moment descriptor (MD), which takes into account sequence-order information in protein sequence without incorporating physicochemical properties of amino acids. Then MD and multi-class SVMs are used to predict subcellular localizations of proteins.

2 Method 2.1 Feature Extraction Without loss of generality, we assume that there are N protein sequences in the dataset, let Lk be the length of the k th sequence pk , and α i be the i th element of 20 natural amino acids represented by English letters A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y respectively. Amino Acid Composition. According to amino acid composition, the protein sequence pk can be characterized as a 20-D feature vector: k º AACk = ª¬ c1k , , cik , , c20 ¼ , k = 1, , N

(1)

where cik = ni Lk is the normalized occurrence frequency of amino acid α i and ni is the count of α i appearing in sequence pk . However, it is not sufficient to characterize a specific protein sequence only based on AACk because the position of α i in protein sequence is not considered. Suppose that we have two protein sequences denoted as p1 and p2 with the lengths of 10 and 20, respectively. Amino acid α i is occupied at position 2 and 3 in p1 , and at position 1, 6, 8 and 15 in p2 . In such case, the information of sequence-order may be needed because c1i equals ci2 exactly. Moment Descriptor. Considering the sequence order, we propose a new feature extraction method, called moment descriptor (MD). Firstly, instead of using above direct definition, we calculate cik by introducing

position indicator xik, j as follows:

Prediction of Protein Subcellular Localizations Using MD and SVM

1 Lk

cik =

107

Lk

¦ xik, j

(2)

j =1

­1 if α i is present at position j in pk xik, j = ® ¯0 if α i is NOT present at position j in pk

(3)

Obviously, AACk in formulation (1) is the sampled statistical mean (raw moment) of position indicator. Hence, we choose formulation (2) as the first MD of protein sequence. Secondly, considering the position of amino acid α i in sequence pk , we define a new feature for amino acid α i mik =

1 Lk

Lk

¦ xik, j j

(4)

j =1

Thus, sequence pk can be characterized as a 20-D feature vector k º AAM k = ª¬ m1k , , mik , , m20 ¼ , k = 1, , N

(5)

where mik represents mean of position of α i . AAM k represents the sampled statistical mean of position of amino acids (AAM) in sequence pk . We choose it as the second MD. Here AAM1 is not equal to AAM 2 in general. However, it is not sufficient just based on AAM to characterize a protein sequence. For example, there may exist two protein sequences p3 and p4 with the same length of 10. Amino acid α i is occupied at position 8 and 10 in p3 , and at position 3, 6 and 9 in p4 . In such case, mi3 equals mi4 exactly although the positions of amino acid α i in both sequences are different. It may be still not sufficient to characterize a protein sequence based on both AAC and AAM. Suppose there are two protein sequences p5 and p6 with the same length of 10. Amino acid α i is occupied at position 4 and 6 in p5 , and at position 3 and 7 in p6 . It is unfortunate that ci5 equals ci6 , and mi5 equals mi6 . Hence it is needed to extract further features from protein sequence. Thirdly, the sampled variance vik of position of amino acid α i in sequence pk is considered: vik =

1 Lk

Lk

¦ ( xik, j j − mik )

2

(6)

j =1

Then, we can obtain a 20-D feature vector k º AAVk = ª¬v1k , , vik , , v20 ¼ , k = 1, , N

(7)

108

J. Shi et al.

where vik represents the second-order central moment of position of amino acid α i in sequence pk . AAVk represents the sampled statistical variance of position of amino acids (AAV) in sequence. We choose AAV as the third MD of protein sequence. Eventually, we construct a combined 60-D feature vector for sequence pk by combining above three moment descriptors

X k = [ AACk , AAM k , AAVk ] , k = 1, , N T

(8)

2.2 Multi-class SVM

Several classification algorithms have already been applied to protein subcellular localization, such as least Mahalanobis distance [14], neural network[15], covariant discriminant algorithm [16], Markov chain [17], fuzzy k-NN [18] and support vector machine [9,12,19,20]. Support vector machine (SVM) [21] has been proved to be a fruitful learning machine, especially for classification. Since it was originally designed for binary classification, it is not a straightforward issue to extend binary SVM to multi-class problem. Constructing Ω-class SVMs ( Ω 2 ) is an on-going research issue [22]. Basically, there are two kinds of approaches for multi-class SVM. One directly processes all data in one optimization formulation [23]. The other decomposes multiclass into a series of binary SVMs, including “One-Versus-Rest” (OVR) [21], “OneVersus-One” (OVO) [24], and DAGSVM [25]. Although there are also several sophisticated approaches for multi-class SVM, extensive experiments have shown that OVR, OVO and DAGSVM are practical [26,27]. OVR is probably the earliest approach for multi-class SVM. For Ω-class problem, it constructs Ω binary SVMs. The ith SVM is trained with all the positive samples from the ith class and all negative samples from the other classes. Given a testing sample to classify, all Ω SVMs are evaluated, and the testing sample is labeled the class with the largest value of the decision functions. For a Ω-class problem, OVO constructs Ω (Ω − 1) 2 binary SVMs. During the

evaluation, each of the Ω (Ω − 1) 2 SVMs casts one vote for its most favored class, and finally the class with the most votes wins [24]. Compared with OVO, DAGSVM has the same training process but the different evaluation. During the evaluation, DAGSVM uses a directed acyclic graph (DAG) [25] architecture to make a decision. The idea of DAG is easily implemented. Let Τ = 1, 2, , Ω be a list of class labels. When a testing sample is given, DAG first evaluates this sample with the binary SVM, which corresponds to the first and the last elements in list T. If the classifier prefers one of the two classes, then the other one will be eliminated from the list. After each testing, a class label will be excluded. As a result, through Ω − 1 binary SVM evaluations, the last label remaining in the list will be the answer. Here, SVM software we used is LIBSVM[26] which can be freely downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/ for academic research. We can implement above three methods just through modifying LIBSVM.

Prediction of Protein Subcellular Localizations Using MD and SVM

109

2.3 Test and Assessment

As mentioned in most papers, the prediction quality is often assessed by the resubstitution, jackknife, and independent dataset tests [5], respectively. Resubstitution test is used to evaluate the self-consistency of prediction system. During the process of resubstitution test, the subcellular location of each protein in a dataset is predicted by the parameters derived from the same training dataset. Jackknife test is always regarded as the most objective and effective one. During the process of jackknife test, each protein in training dataset is singled out in turn as a testing sample, and the remaining proteins are used as training samples to evaluate the testing sample’s class. The quality of independent test indicates the ability of generalization of predictive system in practical application. During the process of independent test, proteins in training dataset are used as training samples and proteins in the independent testing dataset are used as testing samples. To assess the quality of three tests, the total prediction accuracy and prediction accuracy of each location can be respectively defined as [18,19,20]:

Total accuracy = accuracy (ω ) =

1 Ω ¦ p(ω ) N ω =1 p (ω ) obs (ω )

(9)

(10)

where N is the total number of sequences, Ω is the class number, obs(ω) is the number of sequences observed in location ω and p(ω) is the number of correctly predicted sequences in location ω.

3 Experiments and Discussion Here we train the dataset only with the RBF kernel in all the following experiments. In addition, in order to avoid the domination of the features in greater numeric ranges over those in smaller numeric ranges and numerical difficulties during the calculation, we scale all training data to be in [0,1] and adjust all testing data with the same transformation accordingly. 3.1 Dataset

The training dataset and independent dataset [5] are used to validate the current method. The training dataset consists of 2191 protein sequences, of which 145 are chloroplast, 571 cytoplasm, 34 cytoskeleton, 49 endoplasmic reticulum, 224 extracellular, 25 Golgi apparatus, 37 lysosome, 84 mitochondrial, 272 nuclear, 27 peroxisomal, 699 plasma membrane and 24 vacuoles. The independent dataset consists of 2494 protein sequences, of which 112 are chloroplast proteins, 761 cytoplasm, 19 cytoskeleton, 106 endoplasmic reticulum, 95 extracellular, 4 Golgi apparatus, 31 lysosome, 163 mitochondria, 418 nucleus proteins, 23 peroxisomal, and 762 plasma membrane.

110

J. Shi et al.

3.2 Results of Prediction

In this section, the approach for multi-class SVM is OVO, which is implemented directly by LIBSVM without any changes. Firstly, in order to show the improvement of position of amino acid is statistically significant, we measure the performance of AAC, AAM and MD in 10-fold cross validation and present mean and standard deviation of prediction accuracies in table 1. Table 1. Mean and standard deviation of prediction accuracies (%) obtained with OVO in 10fold cross validation Method AAC AAM MD

Mean 79.78 80.14 83.47

Standard deviation 3.341 2.235 2.698

As shown in Table 1, AAM and MD do improve the classification results with more mean and less standard deviation of prediction accuracy. Maybe, positional bias of amino acid contains more classification information than its compositional bias in protein subcellular localizations. The further exploration is still open. Then, we apply MD to Chou's dataset and list the prediction accuracies of subcellular localization in Table 2. As shown in Table 2, the total accuracy in resubstitution, jackknife and independent tests, reaches 99.2%, 79.9% and 85.8% respectively. It seems that classifier suffers from overfitting since the accuracies of resubstitution test are much higher than those on the independent dataset. However, by varying the SVM trade-off parameter from 2-3 to 210, we find that the classifier has no overfitting as a matter of fact in our experiments. Table 2. Prediction accuracies(%) obtained with OVO in resubstitution, jackknife and independent tests, respectively Location Chloroplast Cytoplasm Cytoskeleton Endoplasmic reticulum Extracellular Golgi apparatus Lysosome Mitochondrial Nuclear Peroxisomal Plasma membrane Vacuoles Total accuracy

Resubstitution 98.6 99.7 100.0 98.0 98.2 100.0 100.0 96.4 99.3 96.3 99.7 100.0 99.2

Jackknife 75.9 87.9 44.1 40.8 67.0 32.0 62.2 36.9 82.0 25.9 93.7 29.2 79.9

Independent 84.8 88.8 94.7 84.9 80.0 50.0 96.8 22.1 86.4 65.2 97.4 — 85.8

Prediction of Protein Subcellular Localizations Using MD and SVM

111

It is also worthy to note that several small groups including cytoskeleton, endoplasmic reticulum, Golgi apparatus, peroxisomal and vacuolar which have 34, 49, 25, 27 and 24 training samples, obtain poor prediction accuracies 44.1%, 40.8%, 32.0%, 25.9% and 29.2% in jackknife test, respectively. Better jackknife prediction may be achieved by increasing the amount of the training samples from updated databases. Moreover, mitochondrial gets poor predictions 36.9% and 22.1% in both jackknife and independent tests even the amount of its training samples is up to 84. Better prediction may be obtained by subdividing mitochondrial into inner membrane, outer membrane and matrix proteins. 3.3 Comparison of Feature Extraction

Here, in order to show the efficiency of MD, we compare it with other methods which make the direct extensions of AAC and extract feature merely from sequence without incorporating physicochemical properties. These methods include the traditional amino acid composition (AAC)[2], amino acid pair/dipeptide composition(AAP)[11,12], and two-segment amino acid composition(2SAAC)[13]. We apply above methods respectively to the same Chou's dataset and then compare MD with them. The approach for multi-class SVM is also OVO. The comparison results are presented in Table 3. Table 3. Total accuracies (%) obtained with other methods using OVO in resubstitution, jackknife and independent tests, respectively Method AAC AAP 2SAAC MD

Dim 20 400 40 60

Resubstitution 92.6 98.7 92.4 99.2

Jackknife 77.2 77.8 79.6 79.9

Independent 81.7 81.6 83.8 85.8

Compared with AAC, AAP and 2SAAC, MD can obtain about 6.6%, 0.5% and 6.8% total accuracy improvements in resubstitution test, about 2.9%, 2.3% and 0.5% total accuracy improvements in jackknife test, and about 3.9%, 4.0% and 1.8% total accuracy improvements in independent test, respectively. These results show that MD is effective and helpful for prediction of protein subcellular localization because it can extract more sequence-order information. In the future, the further improvement will be achieved by incorporating physicochemical properties of amino acids. 3.4 Comparison of Multi-class SVMs

In order to make the comparison of three multi-class SVMs mentioned in section 2.2, we also train DAGSVM and OVR based on LIBSVM with some modification of its source codes and present the results in Table 4.

112

J. Shi et al.

Table 4. Total accuracies (%) obtained with DAGSVM, OVR, and OVO in resubstitution, jackknife and independent tests, respectively Multi-Class SVM MD(DAG) MD(OVR) MD(OVO)

Resubstitution 99.2 99.2 99.2

Jackknife 80.1 79.8 79.9

Independent 85.6 85.4 85.8

We find that OVO, OVR and DAG have very similar classification accuracy and that the difference is mainly focused on the number of support vectors, the training time and the testing time. To validate further these differences, we have run training, resubstitution and independent tests for 10 times, and list the number of support vectors (SV), the maximum (Max) and the minimum (Min) time of them in Table 5, respectively. Table 5. The number of support vectors and the consumed time (second) of DAGSVM, OVR and OVO for training, resubstitution, and independent tests, respectively Method MD(DAG) MD(OVR) MD(OVO)

Training SV Max 1603 2.766 1686 6.812 1603 2.765

Min 2.765 6.578 2.657

Resubstitution Max Min 2.000 2.000 2.328 2.312 2.110 2.094

Independent Max Min 2.219 2.203 2.547 2.422 2.438 2.312

Each binary SVM of OVR is optimized on all the N training samples although it only requires Ω binary SVMs. OVO or DAG has Ω (Ω − 1) 2 binary SVMs to train, however, the total training time of OVO or DAG is still less because individual binary SVM is trained just on the samples from only two classes. We find that OVR has heavy training computational burden with almost 2.5 times of training time of OVO or DAG in our experiments. Because the testing time is still dominated by the kernel evaluations, we find that the testing time is almost proportional to the number of support vectors. In addition, we also can see that DAG is really a little faster than OVO on the testing time and needs extra data structure to index the binary SVMs so that it occupies a little bit larger memory than OVO. As described above, except for the training time, other performance of DAG, OVO and OVR are very similar. Hence, we suggest that DAGSVM and OVO may be more suitable in practical use.

4 Conclusion In this paper, we have developed a novel feature extraction method, called moment descriptor which extract feature merely from sequence without incorporating physicochemical properties, and have applied multi-class SVMs to protein subcellular localization for Chou’s protein dataset.

Prediction of Protein Subcellular Localizations Using MD and SVM

113

Compared with other methods based on various types of extensions of amino acid compositions, moment descriptor is shown more effectively in representing the protein sequence-order information. Moreover, except for the training time, three types of multi-class SVMs show similar performance. The results show that moment descriptor may be an effective method of feature extraction for protein localization prediction. Acknowledgments. The authors would like to thank Prof. Kuo-Chen Chou (Gordon Life Science Institute, San Diego, CA 92130, USA) for providing the database. This paper was supported, in part, by National Natural Science Foundation of China (No.60372085) and Technological Innovation Foundation of Northwestern Polytechnical University (No. KC02).

References 1. Feng, Z.P.: An Overview on Predicting Subcellular Location of a Protein. In Silico Biol. 2(2002), 0027 2. Nakashima, H., Nishikawa, K.: Discrimination of Intracellular and Extracellular Proteins Using Amino Acid Composition and Residue-Pair Frequencies. J. Mol. Biol. 238(1994), 54–61 3. Feng, Z.P., Zhang, C.T.: Prediction of the Subcellular Localization of Prokaryotic Proteins Based on the Hydrophobicity Index of Amino Acids, Int. J. Biol. Macromol 28(2001), 255–261 4. Feng, Z.P., Zhang, C.T.: A Graphic Representation of Protein Sequence and Predicting the Subcellular Localizations of Prokaryotic Proteins, Int. J. Biochem. Cell Biol. 34(2002), 298–307 5. Chou, K.C.: Prediction of Protein Cellular Attributes Using Pseudo – Amino – Acid – Composition, Proteins 43(2001), 246–255 6. Zhou, G.P., Doctor K.: Subcellular Location Prediction of Apoptosis Proteins, Proteins 50(2003), 44–48 7. Cai, Y.D. and Chou, K.C.: Nearest Neighbour Algorithm for Predicting Protein Subcellular by Combining Functional Domain Composition and Pseudo Amino Acid Composition, Biochem. Biophys. Res. Commun., 305(2003), 407–411 8. Chou, K.C., Cai, Y.D.: A New Hybrid Approach to Predict Subcellular Localization of Proteins by Incorporating Gene Ontology, Biochem. Biophys. Res. Commun. 311(2003), 743–747 9. Chou, K.C., Cai, Y.D.: Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location, J. Biol. Chem. 277(2002), 45765–45769 10. Pan, Y.X., Zhang, Z.Z., Guo, Z.M., Feng, G.Y., Huang, Z.D., He, L.: Application of Pseudo Amino Acid Composition for Predicting Protein Subcellular Location: Stochastic Signal Processing Approach, J. Protein Chem. 22(2003), 395–402 11. Bhasin, M., Raghava, G.P.S.: ESLpred: SVM-Based Method for Subcellular Localization of Eukaryotic Proteins Using Dipeptide Composition and PSI-BLAST, Nucleic Acids Res. 32(2004), W414–W419 12. Park, K.J., Kanehisa, M.: Prediction of Protein Subcellular Locations by Support Vector Machines Using Compositions of Amino Acids and Amino Acid Pairs, Bioinformatics 19 (2003), 1656–1663

114

J. Shi et al.

13. Cui, Q., Jiang, T., Liu, B., Ma, S.: Esub8: A Novel Tool to Predict Protein Subcellular Localizations in Eukaryotic Organisms, BMC Bioinformatics 5(2004), 66–72 14. Chou, K.C.: A Novel Approach to Predicting Protein Structural Classes in a (20-1)-D Amino Acid Composition Space, Proteins 21(1995), 319–344 15. Reinhardt, A., Hubbard, T.: Using Neural Networks for Prediction of the Subcellular Localization of Proteins, Nucleic Acids Res. 26(1998), 2230–2236 16. Chou, K.C., Elrod, D.: Protein Subcellular Localization Prediction, Protein Eng. 12(1999), 107–118 17. Yuan, Z.: Prediction of protein subcellular localizations using Markov chain models, FEBS Lett. 451(1999), 23–26 18. Huang, Y., Li, Y.D.: Prediction of protein subcellular locations using fuzzy k-NN method, Bioinformatics 20(2001), 21–28 19. Hua, S.J., Sun, Z.R.: Support Vector Machine Approach for Protein Subcellular Localization Prediction, Bioinformatics 17(2001), 721–728 20. Zhang, S.W., Pan, Q., Zhang, H.C., Shao, Z.C., Shi, J.Y.: Prediction Protein Homooligomer Types by Pesudo Amino Acid Composition: Approached with an Improved Feature Extraction and Naive Bayes Feature Fusion, Amino Acid(2006), in press 21. Vapnik, V.: Statistical Learning Theory. Wiley, New York(1998) 22. Bredensteiner, E., Bennet, K.: Multicategory Classification by Support Vector Machines. Comput. Optim. Appl. 12(1999), 53–79 23. Crammer, K., Singer, Y.: On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines, J. Mach. Learn. Res. 2(2001), 265–292 24. Kreßel, U.: Pairwise Classification and Support Vector Machines, In Schölkopf,B., Burges,C.J., Smola,A.J.(eds): Advances in Kernel Methods: Support Vector Learnings, Cambridge, MA, MIT Press ,(1999) 255–268 25. Platt, J., Cristianini, N., Shawe-Taylor, J.: Large Margin DAGs for Multiclass Classification, In Solla,S.A., Leen,T.K., Muller,K.-R.(eds): Advances in Neural Information Processing Systems, Vol. 12. Press (2000)547–553 26. Hsu, C., Lin, C.J.: A Comparison of Methods for Multi-Class Support Vector Machines, IEEE. T. Neural Networks 13(2002), 415–425 27. Rifin, R. and Klautau, A.: In defense of one-vs-all classification, J. Mach. Learn. Res. 5(2004), 101–141

Using Permutation Patterns for Content-Based Phylogeny Md Enamul Karim1 , Laxmi Parida2, and Arun Lakhotia1 1

Center for Advanced Computer Studies University of Louisiana at Lafayette, USA {mek, arun}@cacs.louisiana.edu 2 Computational Biology Center IBM T J Watson Research Center Yorktown Heights, USA [email protected] Abstract. When the same set of genes appear in different orders on the chromosomes, they form a permutation pattern. Permutation patterns have been used to identify potential haplogroups in mammalian data [8]. They also have been successfully used to detect phylogenetic relationships between computer viruses [9]. In this paper we explore the use of these patterns as a content similarity measure and use this in inferring phylogenies from genome rearrangement data in polynomial time. The method uses a function of the cardinality of the set of common maximal permutation patterns as a proxy for evolutionary “proximity” between genomes. We introduce Pi-logen, a phylogeny tool based on this method. We summarize results of feasibility study for this scheme on synthetic data by (1) content verification and (2) ancestor prediction. We also successfully infer phylogenies on series of synthetic data and on chloroplast gene order of Campanulaceae data.

1

Introduction

Genome rearrangements may occur due to events such as inversions, transpositions, fusions, fissions, insertions or deletions. A major challenge in building a phylogeny from genome rearrangement data is estimating the common ancestor, either by reversing the effect of evolutionary events or by some other means. Early approaches used breakpoint distance [2], [10],[11] to estimate the effect of evolution. Breakpoints are the adjacent genes present in one genome, but not in the other and breakpoint distance is the total number of such breakpoints. Consider two genomes each with five genes as shown below. The two breakpoints are shown by the arrows and the breakpoint distance between G1 and G2 is two. G1 = g1 g2 g3 g4 g5 G2 = g1  g3 g2  g4 g5 Thus breakpoints in the genome indicate the operations transposition and inversion. However, one or zero (absent) breakpoint may correspond to multiple such operations. Moreover, computing breakpoint phylogeny is an NP-hard problem [12]. Also, it is unclear how to suitably adapt it for multiple genomes. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 115–125, 2006. c Springer-Verlag Berlin Heidelberg 2006 

116

M.E. Karim, L. Parida, and A. Lakhotia

Yet another scheme is the use of reversal distance between two genomes as an estimate of the evolutionary distance. This has been extensively studied in literature [1], [14], [15], [16] and the use of signed genes actually renders the problem polynomial time solvable for a pair of genomes [13]. A reversal in a signed permutation is an operation that takes an interval in a permutation, reverses the order of the numbers, and changes their signs. In the following example, G3 is transformed into G3 by one reversal of the boxed segment as shown. G3 = g5 g1 g3 g2 −g9 g7 −g4 g6 g8 G3 = g5 g1 −g7 g9 −g2 −g3 −g4 g6 g8 The reversal distance between two genomes is the minimum number of reversals required to get from one genome to the other. One reversal can eliminate maximum two breakpoints. Though reversal on signed permutations requires polynomial time for computation, its generalized version is still an NP hard problem [4]. Reversal distance, like the breakpoint, can underestimate the actual number of steps that occurred biologically and prefers all the genomes under study to have same set of genes. Various other hybrid and heuristic based approaches also have been studied in literature and the reader is directed to [5] for an excellent summary. In this paper, we propose a content-similarity based measure to handle gene order data based on permutation patterns [6]. The content similarity is based on the nature and location of the permutation patterns that co-occur in the genomes. Through simulations, we observe that ancestral information is substantially preserved through the common permutations of various lengths. Based on this observation we develop a similarity matrix and use this matrix to estimate ancestor information and build the phylogeny. Pi-logen is an implementation of this scheme: it can be used for multi-chromosomal genomes and has a polynomial run time. The tool is available at www.cacs.louisiana.edu/∼mek8058/Pi-logen.

2

Permutation Patterns

Consider genomes G1 and G2 defined as gene orders: G1 =g11 g1 g2 g3 g4 g5 g6 g7 g8 g12 , G2 =g2 g4 g1 g3 g5 g9 g7 g6 g10 . Is there anything common between the two genomes? The following clusters or groups of genes appear in the two genomes: p1 = {g1 , g2 , g3 , g4 }, p2 = {g6 , g7 }. p1 and p2 are called permutation patterns. Let G1 and G2 , defined on Σ, be of length n each. The number of common permutation patterns in G1 and G2 can be no more than O(n2 ), since each pattern can start at location i and end at location (j > i), 1 ≤ i < j ≤ n. In the following example, we show that O(n2 ) can actually be obtained. Consider G1 and G2 of size 4n each as shown. G1 = g1a g1b g1c g1d g2a g2b g2c g2d . . . gna gnb gnc gnd G2 = g1c g1a g1d g1b g2c g2a g2d g2b . . . gnc gna gnd gnb These two sequences have O(n2 ) common permutation patterns given by pij = 1 ≤ i < j ≤ n. The following lemma is easy to see.

∪jk=i {gka , gkb , gkc , gkd },

Using Permutation Patterns for Content-Based Phylogeny

117

Lemma 1. Given m sequences of length n each, the number of permutation patterns that appear in at least K(≥ 2) sequences is ≤ mn2 . Given K, 2 ≤ K ≤ m and a collection of m sequences (genomes) Gi , 1 ≤ i ≤ m, let P be the collection of all permutation patterns that appear at least K times. An m-dimensional array Fp corresponding to a permutation pattern p ∈ P as follows, can be viewed as Gi ’s feature vector, where f (p) is some appropriate function of p:  f (p) if p occurs in Gi Fp [i] = 0 otherwise 2.1

Dimension Reduction Via Maximality

Choosing the right length of a permutation pattern to extract content information is tricky. One option is to use all possible lengths, however, this gives a O(n2 )dimension feature space (n is the length of the genome). We tackle this problem using maximal permutation patterns which reduces the dimension to O(n). Maximal permutation patterns cover all the permutation patterns of different granularity. We recall the definition of a maximal permutation pattern [6]. Definition 1. (maximal) Let P be the collection of all permutation patterns on a given data. Let p1 , p2 ∈ P be such that each occurrence of p2 in the data is covered by an occurrence of p1 and each occurrence of p1 covers an occurrence of p2 , then p1 is not maximal with respect to p2 . A pattern p ∈ P is maximal if there exists no q ∈ P such that p is not maximal with respect to q. Consider the following example. G1 = g1 g2 g3 g9 g0 g4 g5 g6 , G2 = g3 g0 g2 g9 g4 g5 g7 g8 and G3 = g4 g9 g2 g0 g3 g8 g7 g5 . Let P  be the collection of all maximal permutation patterns, then P  = {p1 , p2 , p3 }, where p1 = {g0 , g2 , g3 , g4 , g5 , g7 , g8 , g9 } that occurs in G2 and G3 ; p2 = {g0 , g2 , g3 , g4 , g5 , g9 } that occurs in G1 and G2 ; p3 = {g0 , g2 , g3 , g4 , g9 } that occurs in G1 , G2 and G3 . Other permutation patterns are not in P  because they are not maximal w.r.t either of p1 , p2 , p3 . For example, {g4 , g5 } is a permutation pattern appearing in G1 and G2 , however, both of its occurrences are covered by two occurrences of p2 making is non maximal. For the above example, by the definition, Fp1 = [0, 1, 1], Fp2 = [1, 1, 0], Fp3 = [1, 1, 1]. Now, the following lemma is straightforward to verify. Lemma 2. (p is not maximal w.r.t. q) ⇒ (Fp = Fq ). The converse of the lemma is not true, since there may be distinct permutation patterns that occur in the same input sequences. Each maximal permutation pattern may have two kinds of components: (1) sequence-preserving and (2) true permutations. For example, in p2 , {g0 , g2 , g3 , g9 } is a true permutation and {g4 , g5 } is a sequence preserving component. In fact a

118

M.E. Karim, L. Parida, and A. Lakhotia

maximal permutation pattern has a clean hierarchical structure that is explored in [8]. Lets define cnt1 (p) to be the ratio of sequence-preserving components and cnt2 (p) to be the ratio of true permutation components in p. 2.2

Similarity Measure

We use the common permutations as an estimate of the similarity between genomes. Let P  be the set of maximal permutation patterns and let Pi ⊆ P  be the collection that occurs in genome i. We define similarity between two genomes i and j, S(i, j) as, 

S(i, j) =

Pi

(1 − α) cnt1 (p) + α cnt2 (p)

(1)

Pj

∩ OR p ∈ Pi ∪ Pj )

(p ∈

where 0 ≤ α ≤ 1 is a fixed constant (weighting factor) to take care of the effect of the internal structure of p, if there exists any. This similarity measure rewards for the presence in both genomes i and j or absence in both i and j, and penalizes for being present in one and absent in the other.

3

Method

In this section we describe the method implemented in Pi-logen: it uses an agglomerative hierarchical clustering method [7]. Given the m genomes, in this scheme every genome is initially considered a cluster. Then the two genomes with the highest similarity are combined into a cluster. This iterative procedure continues until a stopping criterion is fulfilled (a single cluster, say). Computing pairwise similarity. The pairwise similarity measure for genomes Gi , 1 ≤ i ≤ m , for a given quorum K (≤ m ) and a weighting factor 0 ≤ α ≤ 1 is computed in four steps as follows: (Step 1) Compute P  , the collection of all the maximal permutation patterns that occur in at least K genomes. Let |P  | be denoted by n. For each pi ∈ P  , also compute cnt1 (pi ) and cnt2 (pi ) (see Section 2.1). (Step 2) Create m feature vectors of n dimension each as the (m × n) feature matrix F :  1 pj occurs in Gi F [i, j] = 0 otherwise This matrix is required for easy updates during the clustering. (Step 3) Build a temporary (m × m ) × n matrix T :  1 if F [i, k] = F [j, k] T [i, j, k] = 0 otherwise This matrix is not explicitly built but is given here for ease of exposition. Note that in the next step, this matrix can be temporarily built as and when required.

Using Permutation Patterns for Content-Based Phylogeny

119

(Step 4) Build an m × m similarity matrix S as follows: n  T [i, j, k] ((1 − α) cnt1 (pk ) + α cnt2 (pk )) S[i, j] = k=1

This completes the computation of the similarity matrix S. Hierarchical clustering. The iterative process is applied as follows: m ← m Compute (m × m ) similarity matrix S Repeat (a) Let S[p, q] have the largest value in S, link p, q (b) Replace row q by (q ∨ p) in F and recompute row and column q in S accordingly (c) Remove row and column p from F and S (d) m ← (m − 1) Until (m = 1) The hierarchy that is constructed by this process corresponds to the inferred phylogeny tree. In actual implementation we use either the upper or lower triangular of S because S is symmetric and just ignore row and column p instead of actually removing them. Time complexity. Assume that all m genomes are of length N each. Step 1 takes O(N 2 m log G log n) where G is the number of distinct genes in the data [6], [8]. Step (2) takes O(mn) time and Step (4) takes O(m2 n) time. The algorithm is iterated O(m) times, each iteration taking O(mn) time. Thus the algorithm takes O(m2 n + N 2 m log G log n)) time.

4

Feasibility Experiments

The key contributions of our approach is the ancestor content prediction (step (b) in repeat loop) based on maximal permutation patterns and the similarity measure (step 3 and 4). The effectiveness of this approach to discover a good phylogeny depends on the feasibility of these two methods. Hence, we setup two experiments using synthetic data to verify their feasibility empirically. Synthetic data, in all the experiments, is produced through simulation of evolution. For the first experiment we carry out a content verification test. For the second one we check how well the similar species are grouped together under each internal nodes (ancestors). We call this measure ”ancestor prediction”. The second experiment also involves estimating a good weighting factor, α. (1) Content verification: The experiment involves taking a genome of length n and applying d “evolution” edit operations on it to obtain a set of m evolved

120

M.E. Karim, L. Parida, and A. Lakhotia

#maximal permutation patterns

recovered

actual

25 n=100 20

15 n=30

10

5

0 0

0.2

0.4 0.6 0.8 #(reversal+transposition)/marker

1

Fig. 1. Content verification: Number of maximal permutation patterns recovered is plotted against the actual number of permutations in the data. See text for details.

genomes. These operations are reversals and transpositions. The nature and location of the operation is picked at random. Then we look for the presence of P  , the union of permutation patterns in the m evolved genomes, in the ancestor genome. The ancestral genome is the identity permutation 0 1 2 3 . . . (n-2) (n-1). We describe the results for n = 30, 100 and m = 3. Three genomes G1 , G2 , G3 are obtained by d edit operations (i.e., reversals and transpositions) each on the ancestral genome. Let change ratio rd be defined as rd = md/n. The simulations are repeated a number of times to obtain an average trend and the result is shown in Figure 1. It plots the number of recovered maximal permutation patterns and the actual number of them in the ancestor sequence against rd . Because we use UNION operation to compute ancestral content from descendants, it is likely that we may overestimate ancestral content. We use an evaluation function ωn that calculates the amount of overestimation in the ancestral content prediction where the ancestral genome is of length n. The experiment is performed for l distinct change ratios. If for a specific change ratio, the predicted number of permutations is on and an of them are actually present, then −1  an − on ωn = l i=1 an l

In our experiments, we obtained ω30 = 0.083 and ω100 = 0.052. In other words, on an average, there is an overestimation of only 8% on genomes of length 30 and an overestimation of only 5% on genomes of length 100. (2) Ancestor prediction: We created a set of m genomes with n genes each. These m genomes correspond to the m leaves of a reference phylogeny tree Tr . In this tree, each descendant is obtained by at most d edit operations (inversion or transposition). The length of a segment affected by these edit operations is randomly chosen between 1 and 10.

Using Permutation Patterns for Content-Based Phylogeny

121

Our suite of experiments uses m = 16 and n = 120. For each d we produce five such data sets and run the experiments over a series of α. For comparison purposes, every time we generated 16 species, we maintained a fixed reference tree, Tr = (Tr1 , Tr2 ) where Tr1 = (((0, 1)(2, 3)), ((4, 5), (6, 7))) Tr2 = (((8, 9), (10, 11)), ((12, 13), (14, 15))) Note that this reference tree is a complete binary tree that has fifteen internal nodes, including the root node. In each experiment, the topology of the reference tree is the same, but the edit (or evolution) operations on the branches of the tree as well as the genomes corresponding to the leaves are different. Measuring matches of trees. We used the following measure to match different trees that have the same set of leaf node labels. Let an internal node(ancestor) i i and let D(TX ) denote the set of leaves numbered i of tree TX be denoted as TX reachable from this node. For example, in the reference tree, D(Tr1 ) = {0, 1, 2, 3, 4, 5, 6, 7}. If two sets D1 and D2 are equal then we have an ancestor match, formally  1 if (D1 = D2 ) δ(D1 , D2 ) = 0 otherwise We use a simple measure M atch(TI , Tr ) to compare the inferred tree TI with the reference tree Tr whose values range from values 0 to (m − 1) with (m − 1) denoting a perfect match and 0 denoting a complete mismatch. Formally, 



i=1

j=1

(m−1) (m−1)

M atch(TI , Tr ) =

δ(D(TIj ), D(Tri ))

Estimating α, the weighting factor: A control parameter in this scheme is the weighting factor α, in the similarity measure, S(i, j), of genomes i and j as shown in Equation (1). We carry out a series of experiments and the results are summarized in Table (1). We obtain the best values of match for (1 − α) = 0.5 and 0.6. We Table 1. M atch(TI , Tr ), for d edit operations and weighting factor α

d 1 2 3 4 5 6 7 8 Average

0 .0 7.3 9.2 8.4 10.1 9.7 12.6 13.2 13.1 10.45

0.1 8.9 9.3 8.5 12.2 10.6 12.8 12.9 13.3 11.06

0.2 9.1 9.7 9.5 12.9 11.5 12.8 12.9 13.4 11.48

0.3 10.8 10.7 10.3 13.5 12.9 13.3 13.7 14.2 12.43

0.4 12 11.7 11.5 13.8 13 13.8 13.8 14.2 12.98

(1 − α) 0.5 0.6 0.7 12.7 12.1 10.9 12.2 12.4 11.5 11.2 11.5 10.2 14.2 14.2 14.1 13.7 13.5 13.4 13.9 14.2 13.7 14.3 14.1 13 14.3 14.3 13.4 13.31 13.29 12.53

0.8 10.8 11.3 9.7 13.7 12.9 13.7 13.1 13.1 12.19

0.9 10.9 10.9 9.4 13.4 12.2 13.1 12.9 14.1 12.11

1.0 Average 10.8 10.57 11.1 10.91 9.4 9.96 12.7 13.16 11.3 12.25 13 13.35 12.7 13.33 14 13.76 11.88 12.16

122

M.E. Karim, L. Parida, and A. Lakhotia Table 2. M atch(TI , Tr ) for 10 trees with d = 4, (1 − α) = 0.5 14 13 15 14 14 13 12 13 15 13 Average 13.6

verify this with further simulation experiments, whose results are summarized in Table (2). Effect of d on reconstruction: Further, we observed that the accuracy of the tree reconstruction using the measure M atch(Tr , TI ), usually improves with increase in the number of edit operations d during each “evolution” process. The results are shown in Table (1), which is not a surprising observation and is in fact reassuring about our proposed scheme. The conclusion of the exercise performed in this section is that it is worthwhile to explore the reconstruction of an underlying phylogeny tree using the set of maximal permutation patterns.

5

Experimental Results

Here we discuss our results of using Pi-logen on synthetic data and then on chloroplast DNA (cpDNA) of the Campanulaceae family. 5.1

Synthetic Data

We now describe our simulation experiments for inferring phylogeny trees. We fixed the topology of the reference tree to Tr of the last section. We generated 100 cases: in each we generated 16 genomes (corresponding to the leaves of Tr ) by using randomly chosen values of the number of edit operations d = 1, 2, 3, . . . , 10, for each evolution step. Given this set of 16 genomes, we then inferred the underlying phylogeny tree with Pi-logen using the estimated value of α = 0.4 and 0.5 from the previous section and K = 2 . Figure 2 shows three of the trees inferred by the algorithm for d = 3. In one of them (leftmost) the reference tree is predicted exactly. 6 7

12 13

10 11

4 5

15

8 9

2 3

8 9

0 1

10 11

12 13

2 3

14 15

0 1

2 3

8 9

6 7

4 5

10 11

4 5

7

14

14 15 12 0 1 13

6

Fig. 2. Three trees inferred by Pi-logen for d = 3

Using Permutation Patterns for Content-Based Phylogeny

123

Actual number of ancestors Correctly predicted Average of correctly predicted

17

n=120

16

Number of ancestors

15 14 13 12 11 10 0

1

2 3 4 5 6 7 8 Maximum number of evolutionary operations at each level

9

Fig. 3. Number of ancestors correctly predicted plotted against the number of edit operations

Figure 3 shows M atch(TI , Tr ) for the inferred trees for different amount of evolutionary changes d. The average M atch(TI , Tr ) value for this setup was found to be 13.85. Recall that for this setup the best value of M atch(., .) is 15 (and the worst is 0). The average number of maximal permutation patterns for this setup was 471.4 and the average tree computation required 16.13 seconds on a 2.3 GHz pentium 4 processor. 5.2

Campanulaceae Data

We next use our algorithm on the cpDNA for Campanulaceae data set that has been also used by [3], [5]. This data set has about 105 genes in 13 extant species. We found 167 maximal permutation patterns and it took approximate 11 seconds to generate the phylogeny tree.

(a)

Tra

Cam

Sym

Ade

Cam

Tra

Ade

Sym

Wah

Wah

Mer

Mer

Asy

Leg

Leg

Asy

Tri

Tri

Cod

Cod

Cya

Cya

Pla

Pla

Tob

Tob

(b)

(c)

Fig. 4. The phylogeny tree inferred using (a) maximal permutation pattern (b) reversal and (c) breakpoint based methods on the cpDNA of Campanulaceae data set

124

M.E. Karim, L. Parida, and A. Lakhotia

Figure 4 shows three inferred trees: (a) using maximal permutation patterns (using Pi-logen), (b) using reversal based algorithm [3] and (c) using breakpoint based algorithm [5] . The sub-tree (((T ra, Sym), (Cam, Ade)), (W ah, M ar)) in (a) is identical to the one in (c). The sub-tree (((Cod, Cya), P la), T ob) in (a) is identical to the one in (b). The sub-tree ((Leg, T ri), Asy) in (a) is different from the one in (b) and (c). The aligned genomic sequences are shown below: Leg : 76-56 s1 90-84 s2 91-96 5-8 55-53 T ri : 76-56 s1 89-84 s2 90-96 X 55-53 Asy : 76-57 s1 89-84 s2 90-96 X X

The numbers refer to the gene encodings and s1 and s2 correspond to common segments (of genes) in the three. One can see that from this alignment, the correct choice of a subtree on these three genomes is not apparent.

6

Conclusion

We present Pi-logen a content similarity based method for inferring phylogeny in genome arrangement data. This similarity is based on a well studied regularity measure, a permutation pattern, that co-occurs in multiple genomes. We summarize our results of an extensive feasibility study of using this scheme by content verification and ancestor prediction. We also successfully test the scheme on synthetic and cpDNA of Campanulaceae data.

References 1. Kececioglu, J. and Sankoff, D.(1994) Efficient bounds for oriented chromosome inversion distance. 5th Annual Symposium on Combinatorial Pattern Matching CPM, 807 307-325 2. Blanchette, M., Bourque, G. and Sankoff, D. (1997) Breakpoint phylogenies. In Genome Informatics Workshop (GIW 1997), (eds. S. Miyano and T. Takagi), pp. 25-34. University Academy Press, Tokyo. 3. Bourque, G. and Pevzner, P. A.(2002) Genome-Scale Evolution: Reconstructing Gene Orders in the Ancestral Species. Genome Research, 12(1): 26-36, Cold Spring Harbor Laboratory Press 4. Caprara, A. (1999) Formulations and complexity of multiple sorting by reversals. Proceedings of the Third Annual International Conference on Computational Molecular Biology RECOMB, (eds. S. Istrail et al.), pp. 8493. ACM Press, Lyon, France. 5. Cosner, M. E. et. al. (2000) An Empirical Comparison of Phylogenetic Methods on Chloroplast Gene Order Data in Campanulaceae. Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment, and the Evolution of Gene Families, D. Sankoff and J. Nadeau, eds., 99-121, Kluwer Academic Publishers 6. Eres R., Landau, G. M. and Parida, L. (2003) Combinatorial approach to automatic discovery of cluster patterns. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary, WABI, pp. 139 - 150, Springer-Verlag 7. Kauffman, L. and Rousseeuw, P. (1990) Finding Groups in Data: An Introduction to Cluster Analysis

Using Permutation Patterns for Content-Based Phylogeny

125

8. Landau, G. M., Parida, L. and Weimann, O. (2005) Using PQ Trees for Comparative Genomics. Combinatorial Pattern Matching, CPM, Jeju Island, South Korea, CPM, pp. 128 - 143, Springer-Verlag 9. Karim, M. E., Walenstein, A., Lakhotia, A. and Parida, L. (2005) Malware phylogeny generation using permutations of code. European Journal of Computer Virology 1 1-11 10. Nadeau, J. and Taylor, B. (1984) Lengths of chromosomal segments conserved since divergence of man and mouse. Proc. Natl.Acad. Sci. PNAS 81 814-818 11. Watterson, G., Ewens, W., Hall, T. and Morgan, A. (1982) The chromosome inversion problem. J. Theor. Biol. 99 1-7 12. Peer, I. and Shamir, R. (1998) The median problems for breakpoints are NPcomplete, Electronic Colloquium on Computational Complexity Technical Report 98-071, http://www.eccc.uni-trier.de/eccc. 13. Hannenhalli, S. and Pevzner, P. (1995) Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). Proc. of the 27th Annual Symposium on Theory of Computing STOC, 178189 14. Sankoff, D. (1992) Edit distance for genome comparison based on non-local operations. Proc of the 3rd Annual Symposium on Combinatorial Pattern Matching CPM, 121-135 15. Berman, P. and Hannenhalli, S.(1996) Fast sorting by reversal. 7th Annual Symposium on Combinatorial Pattern Matching CPM, 1075 168-185 16. Kaplan, H., Shamir, R. and Tarjan,R. (1997) Faster and simpler algorithm for sorting signed per- mutations by reversals. Proc of the 8th Annual ACM-SIAM Symposium onDiscrete Algorithms SODA, 344-351

The Immune Epitope Database and Analysis Resource Sette A1, Bui HH1, Sidney J1, Bourne P2, Buus S3, Fleri W1, Kubo R1,4, Lund O5, Nemazee D6, Ponomarenko JV2, Sathiamurthy M1, Stewart S1, Way S1, Wilson SS1, and Peters B1 1

La Jolla Institute of Allergy and Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA 2 San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA 3 University of Copenhagen, Copenhagen, DK-2200, Denmark 4 Gemini Science, 9420 Athena Circle, La Jolla, CA 92037, USA 5 Center for Biological Sequence Analysis, BioCentrum-DTU, Building 208, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark 6 The Scripps Research Institute, 10555 North Torrey Pines Road, La Jolla, CA 92037, USA 7 Science Applications International Corporation, San Diego, CA 92121

Abstract. Epitopes are defined as the molecular structures interacting with specific receptors of the immune system such as antibodies, MHC, and T cell receptor molecules. The Immune Epitope Database and Analysis Resource (IEDB, http://www.immuneepitope.org) is a database specifically devoted to immune epitope data. The database is populated with intrinsic and contextdependent epitope data curated from the scientific literature by immunologists, biochemists, and microbiologists. An analysis resource is linked to the database which hosts various bioinformatics tools to analyze epitope data as well as to predict de novo epitopes. The availability of the IEDB will facilitate the exploration of immunity to infectious diseases, allergies, autoimmune diseases, and cancer. The utility of the IEDB was recently demonstrated through a comprehensive analysis of all current information regarding antibody and T cell epitopes derived from influenza A and determining possible cross-reactivity among H5N1 avian flu and human flu viruses.

1 Introduction Epitopes are defined as the molecular structures interacting with specific receptors of the immune system such as antibodies, MHC, and T cell receptor molecules. Knowledge of the epitopes involved in the immune response is critical to detect, monitor, and design therapies to fight infectious diseases as well as allergies, autoimmunity and cancer. A vast amount of epitope-related information is available, ranging from epitope binding affinities for their receptors, to cellular and humoral responses, to data analyzing correlates of protection or immune pathology. We have developed a central resource that captures this information, allowing users to connect realms of knowledge currently separated and difficult to access. This new initiative, "The Immune Epitope Database and Analysis Resource", became available to the public in a beta version on 15 February 2006 (http://www.immuneepitope.org) [1, 2]. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 126 – 132, 2006. © Springer-Verlag Berlin Heidelberg 2006

The Immune Epitope Database and Analysis Resource

127

The priorities for inclusion in the database are epitopes from category A-C pathogens such as influenza, SARS and poxviruses and emerging/re-emerging infectious diseases. However, we anticipate that all immune epitope data will eventually be curated. B and T cell epitopes recognized in humans, non-human primates and laboratory animals are all considered within the scope of the project. Accordingly, we estimate that about 100,000 different literature records to be relevant. In addition, we expect to host a large volume of direct data submissions from various NIH-sponsored research contracts. Curation and query strategies have been developed to enable effective handling of this large amount of highly context dependent information. An analysis resource is linked to the database that hosts various bioinformatic tools to analyze epitope data (including, for example, population coverage and sequence conservation), as well as tools to predict epitope cellular processing, binding to MHC, and recognition by T cell receptors and antibody molecules. In this context we observed that a large number of predictive tools related to epitopes exist in the literature, and new ones are continuously being developed. Evaluating the performance of existing and newly developed tools will be an on-going effort for the IEDB team.

2 Database Structure and Design The IEDB has been developed as a web-accessible database using an industry standard software design. The IEDB application is a Model View Controller (MVC) (http://java.sun.com/blueprints/guidelines/designing_enterprise_applications_2e/webtier/web-tier5.html) style Enterprise Java (J2EE) application with a relational database management system (Oracle 10g) data repository. The system architecture is divided into two websites, three application tiers, and two physical servers (See Figure 1). The application was constructed using existing Java frameworks and commercial products to create the infrastructure allowing the development team to concentrate on the novel functionality that was required. The enterprise architecture is flexible, extensible, scalable, and proven. Each epitope is associated with intrinsic and extrinsic features. Intrinsic features are those associated with its sequence and structure, while extrinsic features are context-dependent attributes dependent upon the specific experimental or natural environment context. Contextual information includes the species, health, and genetic makeup of the host, the immunization route and dose, and the presence of adjuvants. In order to describe an immune response associated with a specific epitope, both the intrinsic and extrinsic (context-dependent) features need to be taken into account. This immunological perspective has been a guiding principle in organizing the data in the IEDB [1-3]. The hierarchal nature of the data has been captured in an epitope-centric, ontologylike data structure [4], developed in a top-down process where each class in a domain and its properties were defined before building the hierarchy. The primary classes of the IEDB consist of Reference, Epitope Structure, Epitope Source, MHC Binding, MHC Ligand Elution, T Cell Response, and B Cell Response (See Figure 2). The Epitope class encapsulates all the individual concepts identified. In turn, the

128

A. Sette et al.

Fig. 1. Logical to Physical Tier Map

Fig. 2. Detailed classification of Epitope class showing its properties

The Immune Epitope Database and Analysis Resource

129

individual concepts are related to other classes. The primary relationships have a subclass relationship or use a property (denoted in the figure by the arcs labeled “Has a”) that has a value restriction.

3 Populating the Database To identify and extract relevant data from the scientific literature in an efficient and accurate manner, novel formalized curation strategies were developed, enabling the processing of a large volume of context-dependent data. This process is multi-step, involving an automated PubMed query, a manual abstract scan to select potentially relevant references, followed by methodical analysis of the selected references, and finally the manual curation of papers deemed relevant to the scope of the database by a team of dedicated curators with expertise in the areas of biochemistry, microbiology and immunology. Once the manual curation of a reference is complete, the curation is reviewed by an independent group of immunologists and structural biologists, thus, integrating experts and data curators to optimize quality, consistency and uniformity. To facilitate accurate translation of the information contained in the literature into the structured format of the database, we developed a Curation Manual and Data Ontology. These documents are designed to provide a consistent set of rules, definitions, and guidelines regarding the strategies and procedures for capturing, annotating and introducing data from the literature into the IEDB. Additionally, feedback from external experts in the fields of immunology and infectious diseases has been sought on an ongoing basis in order to improve both the database structure and curation practices. In this way, complex experimental data are captured in a consistent and accurate manner. Management of the curation of a large number of references by a team of curators and reviewers required the development of a formal tracking system. All transactions and comments pertaining to each reference are tracked to provide details of the progress of each curated paper from selection of the manuscript to final incorporation of the data into the IEDB. As of May 2006, over 1900 references have been manually curated.

4 Analysis Resource An analysis resource is linked to the database allowing users to analyze epitope data as well as to predict de novo epitopes. For example, one tool predicts the population coverage for a user-prescribed set of T-cell epitopes [5], i.e. the fraction of an ethnic population likely to respond to one or more epitopes in the set. This is done by relating the known MHC restrictions of the epitopes to frequencies of the MHC alleles in different populations, and calculating the total population covered assuming linkage equilibrium between MHC loci. An illustration of the utility of this tool is that it allows a user to detect if a set of epitopes, which may be intended for use as a diagnostic tool or vaccine, has ethnically skewed or balanced population coverage. Another tool calculates the protein sequence conservation of epitopes. For a given starting sequence and threshold of sequence identity, the tool calculates the fraction of proteins containing the epitope. Focusing on epitopes that are conserved at a high

130

A. Sette et al.

level of sequence identity is specifically important for RNA viruses, which show a large degree of sequence variability between different isolates or strains. To predict the presence of antibody epitopes in protein sequences, a number of previously existing amino acid scale-based tools have been implemented from the literature. Although these particular tools have been shown to underperform [6], they represent the current state-of-the-art in antibody epitope prediction and do provide a benchmark for future tool development. Our intent is to devote significant efforts towards the development of improved tools for the prediction of antibody epitopes. The most extensively tested predictions at present are those describing peptide binding to MHC class I molecules. The ability of a peptide to bind an MHC molecule is a necessary requirement for it to be recognized by T-cells. As MHC molecules are very specific in their binding preference, these predictions provide a powerful means to scan entire pathogens for T-cell epitope candidates. Three separate prediction methods were implemented, two of them based on scoring matrices (ARB [7] and SMM [8]) and one based on an artificial neural network [9, 10]. The three methods were compared using five-fold cross validation on a large dataset comprising nearly 50,000 data points, in which the neural network based predictions outperformed the other two [11]. The complete benchmark dataset used in this evaluation is available at http://mhcbindingpredictions.immuneepitope.org/, and we encourage developers to use these data in training and testing their own tools. In addition to the tools described above, tools for predicting MHC class II epitopes [7], proteasomal cleavage [12] and TAP transport [13] have also been implemented and will be evaluated in a similar manner to the MHC class I binding predictions. Also, for epitopes with a 3D structure available in the PDB, an epitope viewer has been developed displaying the epitope structure and its immune receptor interactions.

5 Curating and Analyzing Influenza a Epitope Data As pointed out in a recent Nature editorial, the fight against flu is undermined by “the lack of an accessible store of information” [14]. Besides outbreaks and sequencing data, information is also lacking regarding influenza epitopes. This knowledge is crucial to predict potential cross-reactive immunity and coverage of new strains by vaccines and diagnostic candidates. To demonstrate the features of IEDB and in response to the global spread of highly virulent H5N1 influenza viruses, we have performed an analysis of influenza A epitope information to: 1) compile all current information regarding antibody and T cell epitopes derived from influenza A and 2) determine possible cross-reactivity among H5N1 avian flu and human flu virus. To compile all information available in the literature relating to influenza epitopes, we inspected over 2000 references, and more than 400 were added to the IEDB after detailed curation. An assessment of these curated records revealed that approximately 600 different epitopes, derived from 58 strains, recognized in 8 different hosts and derived from all flu proteins have been identified and reported in the literature, including several conserved epitopes and a small number of protective ones. The latter are of particular interest as they may confer cross-reactive protection against influenza strains of the avian H5N1 subtype. Significantly, however, this analysis made apparent the fact that: 1) few protective antibody and T cell epitopes are reported in the literature; 2) there is a paucity of antibody epitopes in comparison to T

The Immune Epitope Database and Analysis Resource

131

cell epitopes; 3) the number of animal hosts from which the epitopes were defined is limited; 4) the number of epitopes reported for avian influenza strains/subtypes is limited, 5) the number of epitopes reported from proteins other than hemagglutinin (HA) and nucleoprotein (NP) is limited. In summary, this analysis provides a unique resource to evaluate existing data and to aid efforts in guarding against seasonal and pandemic flu outbreaks.

6 Conclusion The IEDB is an initiative focused on creating large volumes of complex contextdependent immunological data paired with relevant analytical tools. The project should facilitate basic research, as well as the development of new vaccines and diagnostics. The experience gained in the process of developing and operating the IEDB will be of value in the development and integration of other biological databases capturing clinical, immunological, genomic and cellular biology knowledge. Acknowledgments. This work was supported by the National Institutes of Health Contract HHSN26620040006C.

References 1. Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O et al. The design and implementation of the immune epitope database and analysis resource. Immunogenetics 2005. 2. Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O et al. The immune epitope database and analysis resource: from vision to blueprint. PLoS Biol 2005, 3(3):e91. 3. Sette A, Fleri W, Peters B, Sathiamurthy M, Bui HH, Wilson S: A roadmap for the immunomics of category A-C pathogens. Immunity 2005, 22(2):155-161. 4. Sathiamurthy M, Peters B, Bui HH, Sidney J, Mokili J, Wilson SS, Fleri W, McGuinness DL, Bourne PE, Sette A. An ontology for immune epitopes: application to the design of a broad scope database of immune reactivities. Immunome Res. 2005 Sep 20;1(1):2. 5. Bui HH, Sidney J, Dinh K, Southwood S, Newman MJ, et al. (2006) Predicting population coverage of T-cell epitope-based diagnostics and vaccines. BMC Bioinformatics 7: 153. 6. Blythe MJ, Flower DR (2005) Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci 14: 246-248. 7. Bui HH, Sidney J, Peters B, Sathiamurthy M, Sinichi A, et al. (2005) Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications. Immunogenetics 57: 304-314. 8. Peters B, Sette A (2005) Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC Bioinformatics 6: 132. 9. Buus S, Lauemoller SL, Worning P, Kesmir C, Frimurer T, et al. (2003) Sensitive quantitative predictions of peptide-MHC binding by a 'Query by Committee' artificial neural network approach. Tissue Antigens 62: 378-384.

132

A. Sette et al.

10. Nielsen M, Lundegaard C, Worning P, Lauemoller SL, Lamberth K, et al. (2003) Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci 12: 1007-1017. 11. Peters B, Bui HH, Frankild S, Nielsen M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, Wilson S, Sidney J, Lund O, Buus S and Sette A. (2006) A Community Resource Benchmarking Predictions of Peptide Binding to MHC-I Molecules. PLoS Comp Bio, in press. 12. Tenzer S, Peters B, Bulik S, Schoor O, Lemmel C, et al. (2005) Modeling the MHC class I pathway by combining predictions of proteasomal cleavage,TAP transport and MHC class I binding. Cell Mol Life Sci 62: 1025-1037. 13. Peters B, Bulik S, Tampe R, Van Endert PM, Holzhutter HG (2003) Identifying MHC class I epitopes by predicting the TAP transport efficiency of epitope precursors. J Immunol 171: 1741-1749. 14. Dreams of flu data. Nature 440, 255-6 (2006)

Intelligent Extraction Versus Advanced Query: Recognize Transcription Factors from Databases Zhuo Zhang1 , Merlin Veronika1 , See-Kiong Ng1 , and Vladimir B Bajic2 1

Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 South African National Bioinformatics Institute, Bellville 7535, South Africa {zzhang, skng}@i2r.a-star.edu.sg, [email protected]

2

Abstract. Many entries in major biological databases have incomplete functional annotation and thus, frequently, it is difficult to identify entries for a specific functional category. We combined information of protein functional domains and gene ontology descriptions for highly accurate identification of transcription factor (TF) entries in Swiss-Prot and Entrez Gene databases. Our method utilizes support vector machines and it efficiently separates TF entries from non-TF entries. The 10-fold cross validation of predictions produced on average a positive predictive value of 97.5% and sensitivity of 93.4%. Using this method we have scanned the whole Swiss-Prot and Entrez Gene databases and extracted 13826 unique TF entries. Based on a separate manual test of 500 randomly chosen extracted TF entries, we found that the non-TF (erroneous) entries were present in 2% of the cases.

1

Introduction

Recent years’ advance in genome research has yielded thousand hundreds protein and gene sequences accumulated in genomic databases such as Swiss-Prot [1] and Entrez Gene [2], which, being annotated carefully, provide valuable knowledgebase for further research. Meanwhile, effort has been put in protein and gene classification. Researchers attempted to categorize proteins/genes into functional or structural associated groups. To name a few, Pfam [3] intends to cluster proteins into various families based on functional domains. The Gene Ontology [4] project provides three set of controlled vocabulary to describe gene or gene product in any organism, includes three ontologies, namely, molecular function, biological process and cellular component. The classification mechanisms provide convenient ways for biologists looking for particular groups of proteins or genes, e.g., proteins contain Homeo-Box domain (PF00046), or genes expressed in nucleus (GO:0005634). However, when search criteria is less distinct, the query may not render satisfactory results. For example, when looking for ”transcription factor”, the search engine of the query database usually applies a pattern matching to text fields – as a result, real transcription factors without explicit notes will be overlooked. Moreover, when a none-TF entry was annotated such as ”regulated by transcription factor”, it will be extracted by the query. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 133–139, 2006. c Springer-Verlag Berlin Heidelberg 2006 

134

Z. Zhang et al.

In this study, we use Transcription Factor as an example, to illustrate our intelligent extraction method to find a particular group of proteins / genes from public databases based on prior knowledge and available annotation, as versus to common query method. 1.1

Background

Transcription factors (TFs) form a key regulatory family of proteins that control transcriptional activation of genes. Knowledge of TFs and their activities relative to different genes and gene products is necessary for deciphering transcriptional gene regulatory networks. The definition of TF may vary in different forms, in this study we adopted the broader definition which allows the TFs to be a/ a protein that regulates transcription (after nuclear translocation) by sequence-specific interaction with DNA or b/ by stoichimetric interaction with a protein that can be assembled into a sequence-specific DNA-protein complex It is a challenge to identify which entries belongs to TF even in curated databases. Under Gene Ontology categories, TF genes are categorized into various classes, e.g., Transcription Factor Activity (GO:0003700), Regulation of transcription, DNA-dependent (GO:0006355), regulation of transcription(GO:0045449) and many more. A quick inspection shows that transcription factor genes scatter across tens of GO term in Molecular Function category alone, the relationship between these terms, in terms of ontology tree, may be parent-child, siblings, or even unrelated. In the context of protein family classified by Pfam [3] schema, transcription factors present in hundreds of families, such as Homeobox (PF00046), zinc finger (PF00096) etc. Furthermore, the annotation from databases lack of a standard way to label transcription factors. Researchers have attempted to identify putative TFs based on their Pfam domain information. For example, a method to predict a group of putative TFs purely based on whether they contain DNA-binding domains was adopted by Zupicich et al. [5] . In a recent work of Stegmaier et al. [6] , the group developed a library of specific hidden Markov models to represent TF DNA-binding domains based on the annotated TRANSFAC [7] entries. In this study, we proposed a intelligent extraction method to identify a large portion of transcription factors based on annotation data, incorporating gene ontology terms, Pfam domains as well as keywords. The method made use of current knowledge on protein classification to build a classifier recognizing transcription factors from genomic databases.

2 2.1

Method Dataset Preparation

The study of model genomes such as Saccharomyces cerevisiae provides genomewide identification of protein functional categories, which can serve as a blueprint

Intelligent Extraction Versus Advanced Query

135

Table 1. Datasets used to build TF classifier Datasets for SVMs training Human Protein Reference Database MIPS Yeast Genome Database TRANSFAC Total

TF set None-TF set 1614 7144 180 764 3489 0 5283 7908

for predicting protein families in other genomes. In this study, we used protein classification from yeast [9], human [8] and TRANSFAC [7] to compose our training data. Protein families such as GTP-binding proteins, molecular chaperones, protein kinases etc. were used to compose the non-TF dataset. Detailed classification of proteins from yeast and human were listed in the supplementary document. (http://research.i2r.a-star.edu.sg/svm tf) Table 1 shows the datasets we compiled to train the classifier. Each TF is assigned to the positive class and each non-TF entry to the negative class. For each entry, a list of GO terms, Pfam domains as well as keywords were extracted from database and used to form the feature vectors for that entry. GO terms were extracted from GOA database [10] and Pfam domains were acquired by combining cross references from Swiss-Prot, Entrez Gene to Pfam database. Keywords were taken from Swiss-Prot annotation. 2.2

Build TF Classifier Using SVMs

We used the SVMlight software [11] to implement our method. SVMlight is an implementation of Vapnik’s Support Vector Machine [12] for the problems of pattern recognition. Our method forms the feature space by three groups of features. Fdomain – Pfam domain features, provide knowledge about the functional unit (conserved motif) of the protein. FGO – Gene ontology features, represent the ontology assignment of the encoded gene; and Fkeyword – keywords features offered current knowledge about the gene. The value of the features was defined as follow:  Ndomain (1) Fdomain = 0  Dmax (2) FGO = 0  1 (3) Fdomain = 0 For Pfam domain features, we defined the occurrence of the domain (how many times the domain motif presents in the sequence) as the feature value. For GO terms, we looked at the depth of the GO node, which is, how far the GO node descent away from the ontology root, as the depth implies the specificity of the ontology annotation. For GO term with multiple parents, various paths may have different value of depth, we choose the greatest one.

136

Z. Zhang et al.

We used radial-basis function kernel for SVM and inductive training. We performed a 10-fold cross-validation experiment which obtained average sensitivity of 93.4% and average positive predictive value of 97.5% in recognition of TFs. Then we retrained our SVM with the whole set.

3

Result

The trained SVM TF classifier was applied to Swiss-Prot and Entrez Gene entries to extract TFs. The results are summarized in Table 2. The overlap of entries between TFs extracted from Swiss-Prot and from Entrez Gene is 2097, meaning that we have identified in total 13826 unique TF entries from these two databases. 3.1

Compare Our Method with Database Queries

To evaluate our method, we compare our extraction result with comprehensive queries. To do this, we made several queries to Swiss-Prot and Entrez Gene to identify groups of TF. Table 2 illustrates various queries made in this study. (Quries made on date 20/12/2005) Table 2. Summary of various queries and extraction results query term Sws text transcription Factor (full text search) Sws GO GO:0003700 Sws trans crosslink to TRANSFAC merge Sws text, Sws GO Sws all and Sws trans EG text transcription factor (full text search) EG GO GO:0003700 Entries recognized by out method Sws svm Swiss-prot TFs predicted by our SVMs EG svm Entrez TFs predicted by out SVMs All svm combined total unique TF entries

result error rate 6448 21.2% 824 2522 7344

-

14093

28.6%

6895 result error rate 10975 3.2% 4948 2.2% 13826 -

We calculate the overlaps from each set. Figure 1 illustrates the overlaps for TFs predicted from Swiss-Prot. From the graph we can see that our extraction method can pick largest amount of TFs. 3.2

Accuracy Evaluation

We manually inspected how accurate were the queries as well as our methods. To perform the evaluation, we randomly picked 500 entries from each group for manual check. This manual checking was done by a biologist, based on the

Intelligent Extraction Versus Advanced Query

137

Fig. 1. Overlapped, missed and additional predicted entries based on different queries and our methods applied in Swiss-Prot

description of the entry available, as well as PubMed literature. The error rate was calculated as: E=

Nnone T F Nnone T F + NT F

(4)

Table 2 lists the error rate for different queries and for our method. We can see that the error rate of text search is quite high. The reason is that search engine simply does pattern matching to detect existence of ”transcription factor” without semantic interpretation. Out method achieved good result and the prediction is highly reliable. The visualization representation about the number of TFs identified via various methods can be found in the supplementary file (as above).

4

Discussion

In this study we addressed the question on how to identify protein-specific information from genomic databases with incomplete functional annotation. We have done this in a context of recognition of TF entries in Swiss-Prot and Entrez Gene. In the process of annotation of proteins by GO categories through GOA project, protein domain information is utilized through InterProScan [13] engine in the annotation process. However, while that system collects domain information from various databases that provide them, it makes no sophisticated assessment of whether the listed domains should or should not confer the implied protein functionality. The annotation uses only information about the presence of particular domains in the protein. Moreover, GO classification may classify TF-producing genes into different categories, making it impossible to use any

138

Z. Zhang et al.

particular GO term or a combination of GO terms to identify all TFs. For example, only around 50% of TF proteins in TRANSFAC were categorized into ”transcription factor activity”. Also, one should be aware that the functions assigned to the proteins or genes are those that are most well known at the moment of annotation. Thus, although Swiss-Prot, Entrez Gene and GO are manually curated, this does not imply that every aspect of protein and gene functionality is captured in the entry information. For example, Q01525, a protein identified as TF by TRANSFAC, was annotated as protein domain specific binding (GO:0019904) from which one can not directly infer that it is a TF. Also, from Table 2 and Figure 1 we can observe that specific queries relative to GO terms are missing many existing entries that are known TFs. On the other hand, description of protein functionality through GO categories, although not necessarily explicitly suggest TF activity for the protein, may describe aspects that are related to TF activity. For example, genes categorized by DNA binding (GO:0003677) have 35% co-annotation in regulation of transcription, DNA-dependent (GO: 0006355). Also, since TFs are a group of proteins that contains different combination of domains and perform various functions (of which many characterize protein as TFs and its activity as TF activity), simply by constructing a query (even if it’s a well-constructed one) to search Swiss-Prot and Entrez Gene one will not be able to get a satisfactory TF lists. For all these reasons, we have considered a combination of GO category descriptions associated with an entry in Swiss-Prot or Entrez Gene, and Pfam domains, as a valuable basis that can reveal the essential knowledge on proteins/gene product activity. Since in our method the positive predictive value is greater than 97%, we can expect approximately one wrongly identified non-TF entry among 40 entries identified as TF by our system. However, one should be careful with such generalizations and note that this is an optimistic one since not all non-TF families have been used in the training of our system. On the other hand, the sensitivity obtained is rather high, 93.4%. However, again, one should be careful in interpreting this score, since it is given for the entries that had either GO category ascribed, or protein domains, or both. In general, the expected (absolute) sensitivity should be lower. In spite of these considerations, we did show that our method allows for efficient accurate extraction of TF entries from the two considered public resources. With a total of over 13826 unique entries extracted, we were able to extract considerably more entries than contained in TRANSFAC Professional v.8.4 that contains 5919 TF entries. Although it is not possible to directly compare the number of entries in TRANSFAC (since they were manually curated) and our extracted entries, we observe that our method potentially allows extraction of very high quality (putative) TF entries, with 2% error rate. Also, one should note that the predictions made in this study are biased as they relate toward eukaryotic species, since the training data were gathered from eukaryotic or-

Intelligent Extraction Versus Advanced Query

139

ganisms. However, the same method should work for prokaryotic transcription factors, if the data is available, although the features should be regenerated and the system should be retrained. Finally, we can apply our method also to the predicted genes and proteins, as long as they contain the relevant TF domains. This may help in provisional association of TF function in some cases.

References 1. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M., Martin,M.J., Natale,D.A., O’Donovan,C., Redaschi,N., Yeh,L.S. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33 (2005) D154-159. 2. Maglott,D., Ostell,J., Pruitt,K.D., Tatusova,T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 33 (2005) D54-8. 3. Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L. et al. The Pfam protein families database. Nucleic Acids Res. (2002) 30 276-280. 4. Harris,M.A., Clark,J., et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. (2004) 32 D258-61. 5. Zupicich,J., Brenner,S.E., Skarnes,W.C. Computational prediction of membranetethered transcription factors. Genome Biol. (2001) 2:0050. 6. Stegmaier,P., Kel,A.E., Wingender,E. Systematic DNA-Binding Domain Classification of Transcription Factors. Genome Inform Ser Workshop (2004) 15(2):276-86. 7. Matys,V., Wingender,E., et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. (2003) 31 374-8. 8. Peri,S., Navarro.J.D., Pandey,A. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. (2003) 13 2363-71. 9. Mewes,H.W., Amid,C., Arnold,R., Frishman,D., Gldener,U., Mannhaupt,G., Mnsterktter,M., Pagel,P., Strack,N., Stmpflen,V., Warfsmann,J. and Ruepp,A. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. (2004) 32 D41-4. 10. Camon, E., Magrane, M., Barrell, D., Lee V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. (2004) 32 D262-D266. 11. Scholkopf,B., Burges,C., Smola,A. Advances in Kernel Methods - Support Vector Learning, MIT-Press. (1990) 12. Vapnik,V.N. The Nature of Statistical Learning Theory. Springer. (1995) 13. Zdobnov E.M. and Apweiler R. InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics, (2001) 17 847-8.

Incremental Maintenance of Biological Databases Using Association Rule Mining a

Kai-Tak Lam1, , Judice L.Y. Koh2,3,b, Bharadwaj Veeravalli1, and Vladimir Brusic4 1

Department of Electrical & Computer Engineering, National University of Singapore, 4 Engineering Drive 3, Singapore 117576 2 Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 3 School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 119260 4 Australian Centre for Plant Functional Genomics, School of Land and Food Sciences, and the Institute for Molecular Bioscience, University of Queensland, Brisbane QLD 4072, Australia [email protected], [email protected]

Abstract. Biological research frequently requires specialist databases to support in-depth analysis about specific subjects. With the rapid growth of biological sequences in public domain data sources, it is difficult to keep these databases current with the sources. Simple queries formulated to retrieve relevant sequences typically return a large number of false matches and thus demanding manual filtration. In this paper, we propose a novel methodology that can support automatic incremental updating of specialist databases. Complex queries for incremental updating of relevant sequences are learned using Association Rule Mining (ARM), resulting in a significant reduction in false positive matches. This is the first time ARM is used in formulating descriptive queries for the purpose of incremental maintenance of specialised biological databases. We have implemented and tested our methodology on two real-world databases. Our experiments conclusively show that the methodology guarantees an F-score of up to 80% in detecting new sequences for these two databases.

1 Introduction In-depth analysis about a specific subject in molecular biology, specifically those associated with the structural and functional properties of a particular group of sequences typically requires access to an extensive knowledge base which may take the form of a specialist database. By integrating subject specific molecular information from public data sources such as GenBank and Swiss-Prot with data analysis tools, a specialist database facilitates the extraction of new knowledge of the topic under study for its users. Some examples of specialist databases include svNTX – a database of functionally classified snake neurotoxins [1], APD – an antimicrobial peptides database with their functional classification [2], Aminoacyl-tRNA synthetases database (AARS) – a database of AARS enzymes that carry out specific esterification of tRNAs [3], svPLA2 – a database of snake PLA2 venoms [4], and a food allergen sequence database for assessing potential allergenicity in transgenic food [5]. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 140 – 150, 2006. © Springer-Verlag Berlin Heidelberg 2006

Incremental Maintenance of Biological Databases

141

With biotechnological advancement and high-throughput sequencing, new sequences are rapidly accumulating in the public data sources. On the other hand, specialist databases created by the researchers are easily out-dated given the exponential growth of new data in the data sources. Frequent updating using simple queries (keywords and sequence searches) of the specialist databases are handicapped by the high number of chance matches and the need to filter them manually. In this paper, we apply text and data mining techniques together with motif identification techniques to formulate complex queries that in turn, can be used to search the public data sources for the purpose of updating these specialist databases. The use of complex queries which are “machine-learned” from a given specialist database, as opposed to user-defined simple queries, reduces the number of chance matches in the database updating process. With accurate retrieval of new records of high relevance, the method is a crucial step towards enabling automatic incremental updating of any specialist database. Particularly in a biological data warehousing system such as BioWare which comprises of a number of specialist databases organised around different topics [6], a general method for incremental updating of any specialist database reaps great benefits to its users. We will first present a brief review of Association Rule Mining (ARM) which is predominantly used in the mining of frequent patterns in a database. In section 2, we present our proposed automated queries formulation method. In section 3, we evaluate the performance and present the results for two specialised biological databases. And we explore other applications of this method in the concluding section 4. 1.1 Association Rule Mining Since its introduction in 1993, ARM [7] has been widely used for market basket analysis in the finance sector, and in other applications such as the identification of gene expression in bioinformatics domain [8]. In ARM, the support of each itemset - percentage of occurrences in the database is computed. Itemsets with supports higher than the user-specified minimum support (minsup) are identified as frequent itemsets. Many association rule mining algorithms are variants of Apriori [9], which employs a bottom-up, breadth-first search enumerating every frequent itemset. In this paper, we are interested in identifying a special type of frequent itemsets, known as the maximal frequent itemsets (MFS) which contains the maximum number of items that achieve the minsup. This group of frequent itemsets has the property that any addition of item into the MFS will cause the support of the set to fall below the minsup.

2 Proposed Methodology A specialist sequence database can be characterized by the textual features as well as sequence features of the sequence records it contains. Textual features refer to the key terms present in the textual attributes of the database records. For example, the term “phospholipase A2” occurs in a significant number of records in svPLA2, the

142

K.-T. Lam et al.

5

5

Update

Update

Database 1 Entities identification

2

Motifs identification

Association rule mining to obtain MFS according to minsup Complex textual query

4

Complex sequence query

3

GenPept

Swiss-Prot

4

Fig. 1. Flow chart of the proposed methodology. The number in circles denotes the order of the steps required for the formulation of complex queries and how these queries are used in updating the specialist database.

snake PLA2 venom database. And sequence features, more commonly referred to as motifs, are sequences that characterised certain biochemical functions. One example of sequence features is the so-called zinc finger motif, CXX(XX) CXXXXXXXXXXXXHXXXH, which is found in widely varying families of DNAbinding proteins. In this paper, we explore the extraction of both textual and sequence features from a given database to formulate complex queries which in turn, are used for incremental retrieval of new sequences of high relevance to the database. Figure 1 shows the primary steps in our proposed methodology. First, textual (entities) and sequence (motifs) features, and their supports in the sequence records, are extracted from the input database or dataset through the identification engines. Each feature corresponds to an item in the frequent itemsets mining process, and each record is a transaction. Using the Apriori program [10] and given a user-defined minsup, the MFS of the database are generated. Each MFS is formulated, using the Boolean operators, into a complex query which is utilised to search for the new sequences in the data sources. The method is evaluated based on the relevance of the new sequences retrieved. 2.1 Entities Identification For entities identification, a biological named-entities recogniser (BNER) [11] is used to identify biologically significant words and phrases, along with the complementary elimination of stop-words. Selected textual fields of the database, including reference title, species, and taxonomy, are parsed by the entities identification engine.

Incremental Maintenance of Biological Databases

143

We conduct a comparative study of two publicly available BNER programs PowerBioNE (PBNE) published by Zhou et. al., [12], and ABNER developed by Settles [13]. The efficacy of the BNER programs depends on its Corpus. PBNE uses the GENIA Corpus V3.0 [14] which contains 2000 MEDLINE abstracts of 360K words. ABNER, on the other hand, uses the NLPBA corpus [15], a variant of the GENIA corpus, and the BioCreative corpus [16]. In our experiment with the two specialist databases, ABNER performed consistently better than PBNE (results not shown, but is available on request). The entities extracted are sorted and arranged according to their occurrences in the databases. 2.2 Motifs Identification For motifs identification, we utilize the MEME system for the detection of motifs [17]. MEME is an unsupervised learning algorithm based on a combination of Hidden Markov models, expectation maximization (EM), an EM-based heuristic for choosing the starting point for EM, a maximum likelihood ratio-based heuristic for determining the best number of model free parameters, multistart for searching over possible motif widths, and greedy search for finding multiple motifs. Since motifs are highly specific biological patterns, their identification, and the generation of queries from these motifs, are typically unique for different databases or datasets. 2.3 Query Formulation Maximal frequent feature sets in the database are mined through the use of Apriori with the parameters: minimum number of items per set is 2, maximum number of items per set is 25. The features within each MFS are combined using “AND” and different MFS are consolidated using “OR”. Figure 2 shows an example of the complex query extracted from a snake venom database. phospholipase AND chordata AND colubroidea AND craniata AND euteleostomi AND lepidosauria AND metazoa AND scleroglossa AND serpentes AND squamata AND vertebrate (96.2%) "phospholipase a" AND phospholipase AND vertebrata AND squamata AND serpentes AND scleroglossa AND metazoa AND lepidosauria AND euteleostomi AND craniata AND colubroidea AND chordate (96%) CNPKLDTYSYSCxNG AND RPWWSYADYGCYCGWGGSGTPVDALDRCCFVHDCC YGKAEK (86%) RFVCECDRAAAICFADNPYTYN AND RPWWSYADYGCYCGWGGSGTPVDALDRCCFVHDCC YGKAEK (85.3%)

Fig. 2. Example combinatory query formulated from svPLA2 database

144

K.-T. Lam et al.

2.4 Updating of Database As shown in Figure 1, the incremental maintenance of a specialist database is ideally an iterative process of query formulation, searching of new sequences in the data sources, and updating them to the database. Users have the option of using queries from entities identification, motif identification, or a combination of both. In our experiment, we concluded that any of these three approaches reduces the false positive (irrelevant) records retrieved and thus the number of records that need to be filtered manually. One of the unique strengths of this methodology lies in the combined use of motifs and textual entities in characterizing a sequence database or dataset in a simple and efficient manner. 2.5 Performance Metrics The performance of our proposed methodology is quantified and measured by the Precision, Recall and, the F-score metrics:

Precision =

Recall =

F-Score =

TP TP + FP TP

TP + FN

(1)

.

(2)

.

2×Precision×Recall Precision+Recall

.

(3)

where, TP,FP and FN are true positives, false positives and false negatives respectively. In our experiments, the TPs of a formulated query refer to retrieved records from the data sources using the query which are also found in the original database. FNs are database records not retrieved. And FPs are non-database records retrieved using the query. Precision measures the fraction of records retrieved by the complex queries which are relevant to the input database. Recall measures the fraction of relevant records which are retrieved. F-score combines them into a single value for purpose of comparison.

3 Performance Evaluation and Discussions The databases used for performance evaluation are Snake Venom PLA2 (svPLA2) [4] and Food Allergen [5]. The svPLA2 database contains 289 functionally annotated, non-redundant svPLA2 toxins used for the studying of the pharmacological effect of these toxins and for supporting detailed structure-function analysis. Sequences in the svPLA2 database were retrieved from GenPept and Swiss-Prot using the simple query “serpentes AND phospholipase OR pla2”, followed by manual filtering to remove chance matches and fragmented sequences.

Incremental Maintenance of Biological Databases

145

The Food Allergen database contains 633 unique protein sequences that are used in the analysis of allergenicity in transgenic food. This database is used in the monitoring of the possible allergic reactions towards genetically modified food. In our experiments, the original entries in the specialist database are used as the set of TP. The retrieved records are first filtered by the date at which the database was most recently updated. After which, these filtered records are compared with the entries in the specialist database. The matched records are the TPs while the unmatched records are the FPs. Those entries in the database that do not have a match are the FNs. 3.1 Snake Venom PLA2 Database

We compare the performance of the complex queries to the simply query used by the biologists when constructing the database. Generally, a higher F-score indicates that the query is more efficient in identifying sequences in the data sources relevant to snake PLA2 venoms. In addition, we are interested to find out if a combinatory approach of using both textual and motif features result in a more accurate query, compared to using only textual or sequence-based queries. Complex Textual Query. A total of 1268 key terms are identified using ABNER program. As the optimal minsup is unknown, the precision and recall are computed at varying minsup values. As shown in Figure 3a, complex textual query at the minsup of 96% has an F-score of 74%, a great improvement over the original query of 50%. From Table 1, it can be seen that the number of records retrieved using the complex textual query is about 13% less than that of the original simple query. This shows that it is more efficient to update this database by applying the methodology proposed than using the original simple query. Complex Sequence Query. A total of 50 motifs are identified using MEME program. MFS according to different minsup are generated and submitted to BLAST for the retrieval of records from the protein database in NCBI and the retrieval results are listed in Table 1. The relationship between F-score and minsup for these retrievals is plotted in Figure 3b. The F-score for these retrieval are lower than that of the original query, with the highest being 39%. However, as the snake venom PLA2 was not initially retrieved using motif queries, comparison made with the original query may not be fair. Also, the efficacy of the BLAST search is dependent on the choice other parameters, such as the substitution matrices and the gap cost. As the lengths of the motifs identified are mostly less than 35, we use PAM30 as the substitution matrix and select a gap cost of 10 so as discourage the introduction of gaps within the motifs. The selection of other matrices and gap cost combinations may result in further optimisation of the F-score. For this, further investigation may be required. Combinatory Query. An investigation is made on the effect of combining both entities and motifs queries on the retrieval result. The queries that give the best Fscore are chosen and are submitted to NCBI to retrieve the relevant records. The results are as listed in Table 1. An F-score of 85% is achieved using a combination of entity and motif queries, giving a 35% improvement over the original query. This indicates that the method gives a much better result on retrieval using both textual or sequence queries, as opposed to using either only.

146

K.-T. Lam et al.

Table 1. Results of varying minsup using different queries for the svPLA2 database. The simple query is the same query that is used during the creation of the database [4]. The bolded values denote the best F-Score achieved in the respective queries. These are then used to form the combinatory query.

F-Score VS minsup

F-Score (%)

80 60 40 20 0 90

92

94 96 minsup (%)

A B NER Query

98

100

Original Query

(a) F-Score VS minsup

F-Score (%)

50 40 30 20 10 0 45

55

65 75 minsup (%)

85

M o tif Query

(b) Fig. 3a and 3b. F-score of textual and sequence queries at varying minsup for svPLA2

3.2 Food Allergen Database Complex Textual Query. A similar experiment is carried out on the Food Allergen database. A total of 734 key terms are identified using ABNER program. The results

Incremental Maintenance of Biological Databases

147

are listed in Table 2 and the corresponding Figure 4 shows the trend of F-score versus the varying minsup. From Table 2, we can see that in general, the recall of the retrieval is above 50%, with the exception when minsup is 12%. This shows that more that half of the positive records are retrieved using this textual-based queries. However, we are only able to achieve a maximum F-score of 4.5% when the minsup is 7%. The low F-score is mainly due to the diversity of textual information in the database. As there are no textual entities that occur in abundance (more than 20%) in the database, a relatively small minsup has to be used. This query using entities alone has retrieved a large number of records, thereby contributing to a higher number of chance retrieval and hence a low F-score. This experiment on the Food Allergen database exhibits the shortcoming of queries based on entities alone. Some specialist database, even though it is restricted to a certain domain, may contain very general textual information. If only textual information is used, one may be faced with a very large amount of chance matches, which amounts to 18K of records in the Food Allergen database that need to be reviewed manually. This may be improved with using queries based on motifs alone, as show in our investigation in the next section. Complex Sequence Query. Although using query based on motifs alone we are able to obtain an F-score of 41%, the recall of the retrieval suffers. Less then half of the positive records are retrieved. However, using motifs alone, we are able to minimise the number of chance matches, i.e. decreasing the FP in our retrieval. Since there are shortcoming for both types of queries based on entities and motifs alone, we will further investigate if a combination of both would yield a better result in the next section. Combinatory Query. Textual query at minsup of 7% and sequence query at minsup of 7% are used together for combinatory retrieval. An F-score of 80% is achieved and both the precision and recall achieve a score of more than 50%. This demonstrates that the combination of both entity-based and motif-based queries gives a must better results than using either one alone. Table 2. Results of varying minsup using different queries for the Food Allergen database. A comparison with simple query is not carried out as the simple query used for the creation of the original database is not availible. The bolded values denote the best F-Score achieved in the respective queries. These are then used to form the combinatory query.

148

K.-T. Lam et al.

F-Score VS minsup

F-Score (%)

5 4 3 2 1 0 5

6

7 8 minsup (%)

9

10

A BNER Query

(a) F-Score VS minsup

F-Score (%)

50 40 30 20 10 0 0

5

minsup (%)

10

15

M o tif Query

(b) Fig. 4a and 4b. F-score of textual and sequence queries at varying minsup for the Food Allergen database

4 Conclusions The task of database maintenance is a time-consuming process due to the everincrease size of public data sources and the large number of chance matches using simply query method. In this paper we have proposed a methodology with can be used to formulate complex queries from a database based on the textual and sequence features that characterised the database. We have shown that these queries are able to reduce the number of FP records from the retrieval and hence reduce the amount of time and effort required to manually filter the retrieval results during database maintenance. This is the first time ARM is used in formulating complex queries for the purpose of incremental maintenance of specialised biological databases. Tested on two realworld databases, our methodology shows that an F-score of up 80% is achieved. At this current stage, many of the parameters such as minsup and the minimum and maximum number of items per MFS are determined empirically. Further work can be

Incremental Maintenance of Biological Databases

149

done on finding these values by machine learning techniques. For example, the optimal minsup can be found automatically by using internal cross validation. This would ease the burden from the user to determine a few arbitrary minsup and observe from the result which one is closer to the optimal value. This methodology can be integrated in a biological data warehousing system such as BioWare for the incremental maintenance of the existing databases. Furthermore, information retrieval of PubMed records based on a sample of PubMed articles can be carried out more efficiently using the entity query formulation portion of our methodology. This is useful for initial research work where information retrieval from the public domain is essential.

References 1. Siew, J.P., Khan, A.M., Tan, P.T., Koh, J.L. , Seah, S.H., Koo, C.Y., Chai, S.C., Armugam, A., Brusic, V., Jeyaseelan, K., “Systematic analysis of snake neurotoxins functional classification using a data warehousing approach”, Bioinformatics, 20(18), 2004, pp. 3466-3480. 2. Wang, Z. and Wang, G., “APD: the Antimicrobial Peptide Database”, Nucleic Acids. Res. 32, 2004, pp. 590-592. 3. Szymanski, M. and Barciszewski, J., “Aminoacyl-tRNA synthetases database Y2K”, Nucleic Acids Res. 28, 2000, pp. 326–328. 4. Tan, P.T.J., Khan, A.M. and Brusic, V., “Bioinformatics for venom and toxin sciences”, Brief Bioinform. 1, 2003, pp. 53-62. 5. Gendel, S.M., “Sequence Databases for Assessing the Potential Allergenicity of Proteins Used in Transgenic Foods”, Advances in Food and Nutrition Research, v42 1998, pp. 6392. 6. Koh, J.L.Y., Krishnan, S.P.T, Seah, S.H., Tan, P.T.J., Khan, A.M., Lee, M.L., Brusic, V., “BioWare: A framework for bioinformatics data retrieval, annotation and publishing”, SIGIR’04 workshop on Search and Discovery in Bioinformatics, July 29, 2004, Sheffield, UK 7. Agrawal, R., Imielinski, T., Swami, A., “Mining association rules between sets of items in large databases”, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Washington, D.C., United States, 1993, pp. 207-216. 8. Creighton, C. and Hanash, S., “Mining gene expression databases for association rules”, Bioinformatics, 19(1), 2003, pp. 79-86. 9. Agrawal, R., Srikant, R., “Fast algorithms for mining association rules”, The International Conference on Very Large Databases, 1994, pp. 487-499. th 10. Borgelt, C., Kruse, R., “Induction of Association Rules: Apriori Implementation”, 15 Conference on Computational Statistics, Physica Verlag, Heidelberg, Germany, 2002. 11. Ananiadou, S., Friedman, C., Tsujii, J., “Introduction: named entity recognition in biomedicine”, Journal of Biomedical Informatics, 37(2004), pp. 393-395. 12. Zhou, G.D., Zhang, J., Su, J., Shen, D., Tan, C.L., “Recognizing Names in Biomedical Texts: a Machine Learning Approach”, Bioinformatics, v20(7) 2004, pp. 1178-1190. 13. Settles, B., “ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text”, Bioinformatics, v21(14) 2005, pp. 3191-3192. 14. Ohta, T., Tateisi, Y., Kim, J., Mima, H., Tsujii, J., “The GENIA corpus: an annotated research abstract corpus in molecular biology domain”, Proceedings of Human Language Technology (HLT’ 2002), San Diego, pp. 489-493.

150

K.-T. Lam et al.

15. Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N., “Introduction to the bio-entity recognition task at JNLPBA”, Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, 2004, pp. 70-75. 16. Yeh, A., Hirschman, L., Morgan, A., Colosimo, M., “BioCreAtIve Task 1A: gene mention finding evaluation”, BMC Bioinformatics, 6(Suppl 1):S2, 2005. 17. Bailey, T.L., Elkan, C., “The Value of Prior Knowledge in Discovering Motifs with MEME”, ISMB, v3 1995, pp. 21-29.

Blind Separation of Multichannel Biomedical Image Patterns by Non-negative Least-Correlated Component Analysis Fa-Yu Wang1,2 , Yue Wang2 , Tsung-Han Chan1 , and Chong-Yung Chi1 2

1 National Tsing Hua University, Hsinchu, Taiwan 30013 ROC Virginia Polytechnic Institute and State University, Arlington, VA 22203 USA [email protected], [email protected], [email protected], [email protected] http://www.ee.nthu.edu.tw/cychi

Abstract. Cellular and molecular imaging promises powerful tools for the visualization and elucidation of important disease-causing biological processes. Recent research aims to simultaneously assess the spatialspectral/temporal distributions of multiple biomarkers, where the signals often represent a composite of more than one distinct source independent of spatial resolution. We report here a blind source separation method for quantitative dissection of mixed yet correlated biomarker patterns. The computational solution is based on a latent variable model, whose parameters are estimated using the non-negative least-correlated component analysis (nLCA) proposed in this paper. We demonstrate the efficacy of the nLCA with real bio-imaging data. With accurate and robust performance, it has powerful features which are of considerable widespread applicability.

1

Introduction

Multichannel biomedical imaging promises simultaneous imaging of multiple biomarkers, where the pixel values often represent a composite of multiple sources independent of spatial resolution. For example, in vivo multispectral imaging exploits emissions from multiple fluorescent probes, aiming at discriminating often overlapped spatial-spectral distributions and reducing background autofluorescence [1, 2]. Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) utilizes various molecular weight contrast agents to investigate tumor microvascular status and then obtain the information about the therapeutic effect under anti-angiogenic drugs. However, due to the heterogeneous nature of tumor microvessels associated with different perfusion rate, DCE-MRI measured signals are the mixture of the permeability images corresponding to fast perfusion and slow perfusion. Further examples include dynamic positron emission tomography, and dynamic optical molecular imaging [1]. The major efforts for computational separation of composite biomarker distributions are: supervised spectrum unmixing [2], a priori weighted subtraction [3], parametric compartment modeling and independent component analysis (ICA) [4, 5]. The major limitations associated with the existing methods are J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 151–162, 2006. c Springer-Verlag Berlin Heidelberg 2006 

152

F.-Y. Wang et al.

s2

y1 ≈ s1

x1

s1

A

x2

W

y2 ≈ s2

Fig. 1. Block diagram of 2 × 2 mixing and demixing systems

the inability of acquiring in vivo spectra of the probes (e.g., individual physiological conditions and microenvironment, pH, temperature, oxygen, blood flow, etc.) [2] and the unrealistic assumptions about the characteristics of the unknown sources and mixing processes (e.g., source independence, model identifiability, etc.) [5, 6]. Our goal and effort therefore, is to develop a novel blind source separation (BSS) method that is able to separate correlated or dependent sources under non-negativity constraints [7]. This new BSS method is called the non-negative least-correlated component analysis (nLCA) whose principle and applications will be reported in detail here. In the next section, we present the nLCA, including model assumptions, theory and computational methods for blind separation of non-negative sources with a given set of observations of a non-negative mixing system. In Section 3, we demonstrate the efficacy of the nLCA by two experiments (human face images and DCE-MRI analysis) and its performance superior to some existing algorithms followed by a discussion of future research.

2

Non-negative Least-Correlated Component Analysis

As shown in Fig. 1, consider a 2 × 2 non-negative mixing system with the input signal vector s[n] = (s1 [n], s2 [n])T (e.g., images of two different types of cells) and the output vector x[n] = (x1 [n], x2 [n])T = As[n]

(1)

where the superscript ‘T ’ denotes the transpose of a matrix or vector and A = {aij }2×2 is the unknown non-negative mixing matrix. The blind source separation problem is to find a demixing matrix W from the given measurements x[n], n = 1, 2, ..., L, such that y[n] = (y1 [n], y2 [n])T = Wx[n] = WAs[n] ≈ Ps[n]

(2)

Blind Separation of Multichannel Biomedical Image Patterns

153

i.e., WA  P (permutation matrix). Alternatively, let xi = (xi [1], xi [2], ..., xi [L])T , i = 1, 2,

(ith observation)

(3)

si = (si [1], si [2], ..., si [L]) , i = 1, 2,

(ith unknown source)

(4)

y i = (yi [1], yi [2], ..., yi [L]) , i = 1, 2.

(ith extracted source)

(5)

T

T

Then the observations x1 and x2 , and the extracted sources y 1 and y 2 can be expressed as   xT1 sT1 = A , (6) xT2 sT2   y T1 xT1 =W , (7) y T2 xT2 respectively. For ease of later use, the correlation coefficient and the angle between s1 and s2 are defined as ρ(s1 , s2 ) =

sT1 s2 , s1  · s2 

θ(s1 , s2 ) = cos−1 (ρ(s1 , s2 )),

(8) (9)

respectively, where si  (= (sTi si )1/2 ) is the norm of si . Next, let us present the assumptions and the associated theory and methods of the nLCA, respectively. 2.1

Model Assumptions

Let us make some assumptions about the sources s1 and s2 , and the mixing matrix A as follows: (A1) s1  0 and s2  0 (i.e., s1 [n] ≥ 0, s2 [n] ≥ 0 for all n), and the two distinct sources s1 and s2 are linearly independent (i.e., s2 = αs1 where α = 0). (A2) A  0 (i.e., all the entries of A are non-negative). (A3) A is full rank (i.e., nonsingular). (A4) A · 1 = 1, where 1 = (1, 1)T (i.e., the sum of the entries of each row of A is equal to unity). Assumptions (A1) and (A2) hold valid in biomedical imaging applications [6] where all the sources and all the entries of the mixing matrix are non-negative, s2 /L = 0 where and meanwhile s1 and s2 are allowed to be correlated (i.e.,  sT1   L  si = (si [1] − μi , si [2] − μi , ..., si [L] − μi )T in which μi = n=1 si [n]/L). Assumptions (A1) and (A3) imply that the two observations x1 and x2 are linearly independent vectors (i.e., x2 = αx1 where α = 0). Note that the assumption that sources are mutually statistically independent is a fundamental assumption s2 /L = 0 (uncorrelated). made by most ICA algorithms that requires  sT1 

154

2.2

F.-Y. Wang et al.

Theory and Methods

The proposed nLCA is supported by the following theorem. Theorem 1 (Correlation Increase Theorem). Under Assumptions (A1) and (A2), ρ(x1 , x2 ) ≥ ρ(s1 , s2 ) (or θ(x1 , x2 ) ≤ θ(s1 , s2 )) as shown in Fig. 2.

s2

x2

R

θ ( x1 , x 2 )

x1

θ ( s1 , s 2 )

o

s1

Fig. 2. Observation vectors x1 and x2 which are inside the shaded region R formed by source vectors s1 and s2

The proof of Theorem 1 is given in Section 4.1. Theorem 1 implies that linear non-negative mixing of non-negative sources leads to increase of the correlation coefficient. Based on Theorem 1, a straightforward method, referred as Method 1, is to design the demixing matrix W by reducing the correlation coefficient ρ(y 1 , y 2 ) (i.e., maximizing the angle θ(y 1 , y2 )) of the extracted sources, subject to the two constraints y[n]  0 (i.e., y[n] ≈ Ps[n]) and W · 1 = 1 (due to W(A · 1) = P · 1 = 1). Method 1 (A direct method) The demixing matrix W is obtained as W = arg min ρ(y 1 , y 2 ) W

subject to y[n]  0, ∀n (i.e., y 1  0 and y 2  0) and W · 1 = 1. Finding the optimum demixing matrix W is apparently a nonlinear and nonconvex optimization problem. Fortunately, a closed-form solution for W can be shown to be ⎤ ⎡ − tan φ(x[k1 ])

1−tan φ(x[k ]) W = ⎣ − tan φ(x[k 1]) 

1 1−tan φ(x[k1 ])

1 1−tan φ(x[k2 ]) 1−tan φ(x[k2 ]) 2



(10)

where tan φ(x[k1 ]) = x2 [k1 ]/x1 [k1 ] = max{x2 [n]/x1 [n]},

(11)

tan φ(x[k2 ]) = x2 [k2 ]/x1 [k2 ] = min{x2 [n]/x1 [n]}.

(12)

n

n

Blind Separation of Multichannel Biomedical Image Patterns

155

The proof of (10) is given in Section 4.2. Next let us present an indirect method. The second method is to estimate the mixing matrix A from the given observations x1 and x2 , based on the following theorem. Theorem 2. Suppose that Assumptions (A1) and (A2) hold true, and that there exist s[l1 ] = (s1 [l1 ] = 0, 0)T and s[l2 ] = (0, s2 [l2 ] = 0)T for some l1 and l2 . Let φ(x[k1 ]) = max{φ(x[l1 ] = As[l1 ]), φ(x[l2 ] = As[l2 ])}, φ(x[k2 ]) = min{φ(x[l1 ]), φ(x[l2 ])}. Then 0 ≤ φ(x[k2 ]) ≤ φ(x[n]) ≤ φ(x[k1 ]) ≤ π/2, ∀n.

s[l2 ]

x[l2 ] (x[l1 ])

R1

R2

x[l1 ] (x[l2 ])

s[l1 ] Scatter plot before mixing

Scatter plot after mixing

Fig. 3. Scatter plot coverage (R1 ) of two sources and that (R2 ) of the two associated observations

The proof of Theorem 2 is given in Section 4.3. By Theorem 2, if s[l1 ] = (s1 [l1 ] = 0, 0)T and s[l2 ] = (0, s2 [l2 ] = 0)T for some l1 and l2 are two points each on one edge of the scatter plot of s1 and s2 , a two-dimensional plot of s[n], n = 1, ..., L, (i.e., the shaded region R1 in Fig. 3). Then the associated x[l1 ] and x[l2 ] will also be on each of the two edges of the scatter plot of x1 and x2 , (i.e., the shaded region R2 ⊆ R1 in Fig. 3), respectively. In view of this observation, the unknown mixing matrix A can be easily solved from tan φ(x[l1 ]) = x2 [l1 ]/x1 [l1 ] = a21 /a11 , tan φ(x[l2 ]) = x2 [l2 ]/x1 [l2 ] = a22 /a12 ,

(13) (14)

A · 1 = 1,

(15)

and the solutions for a11 and a21 are given by 1 − tan φ(x[l2 ]) , tan φ(x[l1 ]) − tan φ(x[l2 ]) (1 − tan φ(x[l2 ])) tan φ(x[l1 ]) , = tan φ(x[l1 ]) − tan φ(x[l2 ])

a11 =

(16)

a21

(17)

which together with (15) lead to the solution for A. The above procedure for estimating A, referred as to Method 2, is summarized as follows:

156

F.-Y. Wang et al.

Method 2 (An indirect method) Find tan φ(x[k1 ]) and tan φ(x[k2 ]) using (11) and (12), and set l1 = k1 and l2 = k2 . Obtain a11 and a21 using (16) and (17), respectively, and then obtain a12 = 1 − a11 and a22 = 1 − a21 . Finally, obtain W = A−1 . Let us conclude this section with the following two remarks. Remark 1. The condition that s[l1 ] = (s1 [l1 ] = 0, 0)T and s[l2 ] = (0, s2 [l2 ] = 0)T for some l1 and l2 exist as stated in Theorem 2 guarantees that the estimate A obtained by Method 2 is existent and unique up to a column permutation of A. Under the same condition, one can easily prove that W A = P (see (10)), implying the existence and uniqueness of the estimate A for Method 1 up to a column permutation of A. However, this condition may not be perfectly satisfied but approximately satisfied in practical applications, i.e., s[l1 ] = (s1 [l1 ] = 0, s2 [l1 ]  0)T and s[l2 ] = (s1 [l2 ]  0, s2 [l2 ] = 0)T for some l1 and l2 . For example, non-overlapping region (for which s[n] = (s1 [n] = 0, s2 [n]  0)T and s[n] = (s1 [n]  0, s2 [n] = 0)T ) in the spatial distribution of a fast perfusion and a slow perfusion source images on brain MRI [8] is usually higher than 95%. The estimated sources y 1 and y 2 turn out to be approximations of the original sources s1 and s2 . Remark 2. The proposed nLCA is never limited by Assumption (A4). As A·1 = 1, the mixing model given by (1) can be converted into the following model:  s[n] [n] = D1 x[n] = (D1 AD2 )D−1 x (18) 2 s[n] = A     where D1 = diag{1/ n x1 [n], 1/ n x2 [n]} and D2 = diag{ n s1 [n], n s2 [n]}  = D1 AD2 for which Assumptions (A2), (A3) and (2 × 2 diagonal matrices), A  (A4) (A · 1 = 1) are satisfied, and the sources are s[n] = D−1 2 s[n] (instead of s[n]) for which Assumption (A1) is also satisfied.

3

Experiments and Discussion

So far, we have described the theory behind nLCA, and have presented two nLCA methods to separate composite biomarker distributions. We shall now illustrate the efficacy of the proposed nLCA using mixtures of real multichannel images. The first experiment reports the effectiveness of the nLCA for mixtures of two correlated human face images taken from the benchmarks in [11]. For the second experiment, the proposed nLCA is applied to DCE-MRI of breast cancer, where the two source images correspond to the permeability distributions of a fast perfusion image and a slow perfusion image in the region of interest [5]. In each of the two experiments, 50 randomly independent mixtures were generated and then processed using Method 2, and three existing algorithms, FastICA [9], non-negative ICA (nICA) [6], and non-negative matrix factorization (NMF) [10] for performance comparison. The average of the error index E1 [8] over the 50 independent runs was calculated as the performance index, where

Blind Separation of Multichannel Biomedical Image Patterns

E1 =

2  i=1

⎡⎛ ⎣⎝

2  j=1





2  | pij | ⎠ − 1⎦ + maxk {| pik |} j=1



2  i=1

| pij | maxk {| pkj |}



157

−1

(19)

 = WA. Note that the value of E1 is where pij denotes the (i, j)-element of P  closer to a permutation matrix. smaller for P The averaged E1 associated with the proposed nLCA, and FastICA, nICA, NMF for the human face experiment are displayed in Table 1. One can see from this table that the proposed nLCA performs best, the nICA second, the FastICA third, and the NMF fourth. In order to further illustrate the performance insights of the four non-negative source separation algorithms under test. A typical set of results of the human face experiment is displayed in Fig. 4, including two original source images s1 , s2 and the associated scatter plot, observations x1 , x2 and the associated scatter plot, the extracted source images y 1 , y 2 and the associated scatter plots obtained by the four algorithms. Some observations from Fig. 4 are as follows. The scatter plots shown in Figs. 4(a) and 4(b) are similar to those shown in Fig. 3 and thus consistent with Theorem 2. The scatter plot associated with the proposed nLCA shown in Fig. 4(c) resembles that shown in Fig. 4(a) much better than the other scatter plots. As previously mentioned, DCE-MRI provides temporal mixtures of heterogeneous permeability distributions corresponding to slow and fast perfusion rates. The second experiment is to separate the two perfusion distributions from their mixtures. The averaged E1 associated with the proposed nLCA, and FastICA, nICA, NMF are also displayed in Table 1. Moreover, a typical set of results of the DCE-MRI experiment corresponding to those shown in Fig. 4 is displayed in Fig. 5. Again, the same conclusions obtained from the human face experiment apply to the DCE-MRI experiment, i.e., the proposed nLCA outperforms the other three algorithms. These experimental results demonstrate the efficacy of the proposed nLCA. Next let us discuss why the proposed nLCA performs better than FastICA, nICA, NMF. FastICA and nICA are statistical BSS algorithms under the assumption of non-Gaussian independent sources for the former, and the assumption of non-negative uncorrelated and well-grounded sources (i.e., probability Pr {si [n] < δ} > 0 for any δ > 0) for the latter. However, the sources in the above experiments are correlated as in many other biomedical imaging appliTable 1. The performance (averaged E1) of the proposed nLCA and FastICA, nICA and NMF for the human face and DCE-MRI experiments Method Averaged E1

Human face experiment DCE-MRI experiment

nLCA FastICA

nICA NMF

0.082

0.453

0.369

0.640

0

0.162

0.114

0.432

158

F.-Y. Wang et al.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Human face images (top and middle rows) and associated scatter plots (bottom row) for (a) the sources, (b) the observations and the extracted sources obtained by (c) nLCA, (d) FastICA, (e) nICA, and (f) NMF

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. DCE-MRI images (top and middle rows) and associated scatter plots (bottom row) for (a) the sources (permeability) corresponding to the slow (top plot) and fast (middle plot) perfusion, (b) the observations and the extracted sources obtained by (c) nLCA, (d) FastICA, (e) nICA, and (f) NMF

cations, implying that the source independence assumption made by FastICA and nICA is not satisified. On the other hand, both the proposed nLCA and NMF are algebraic approaches basically under the same realistic assumptions (i.e., Assumptions (A1), (A2) and (A3)). However, the proposed nLCA has op-

Blind Separation of Multichannel Biomedical Image Patterns

159

timum and closed-form solutions (either for the demixing matrix W or for the mixing matrix A) for Methods 1 and 2, but NMF is an iterative algorithm which may provide a local optimum solution. Therefore, the proposed nLCA is computationally efficient and outperforms the other three algorithms in the above experiments. We believe that the proposed nLCA is a promising method for blind separation of multiple biomarker patterns. We would expect it to be an effective image formation tool applicable to many other multichannel biomedical imaging modalities [1]. Extension of the proposed nLCA to the case of more than two sources and its performance in the presence of measurement noise are currently under investigation. Acknowledgments. This work was supported partly by the National Science Council (R.O.C.) under Grant NSC 94-2213-E-007-026 and partly by the National Institutes of Health under Grant EB000830.

References 1. H. R. Herschman, “Molecular imaging: looking at problems, seeing solutions,” Science, vol. 302, pp. 605-608, 2003. 2. M. Zhao, M. Yang, X.-M. Li, P. Jiang, E. Baranov, S. Li, M. Xu, S. Penman, and R. M. Hoffman, “Tumor-targeting bacterial therapy with amino acid auxotrophs of GFP-expressing salmonella typhimurium,” Proc. Natl. Acad. Sci., vol. 102, pp. 755-760, 2005. 3. S. G. Armato, “Enhanced visualization and quantification of lung cancers and other diseases of the chest,” Experimental Lung Res., vol. 30, pp. 72-77, 2004. 4. Y. Wang, J. Zhang, K. Huang, J. Khan, and Z. Szabo, “Independent component imaging of disease signatures,” Proc. IEEE Intl. Symp. Biomed. Imaging, Washington DC, July 7-10, 2002, pp. 457-460. 5. Y. Wang, R. Srikanchana, P. Choyke, J. Xuan, and Z. Szabo, “Computed simultaneous imaging of multiple functional biomarkers,” Proc. IEEE Intl. Symp. Biomed. Imaging, Arlington, VA, April 15-18, 2004, pp. 225-228. 6. E. Oja and M. Plumbley, “Blind separation of positive sources by globally convergent gradient search,” Neural Computation, vol. 16, pp. 1811-1825, 2004. 7. Y. Wang, J. Xuan, R. Srikanchana, and P. L. Choyke, “Modeling and reconstruction of mixed functional and molecular patterns,” Intl. J. Biomed. Imaging, 2005 in press. 8. Y. Zhou, S. C. Huang, T. Cloughesy, C. K. Hoh, K. Black, and M. E. Phelps, “A modeling-based factor extraction method for determining spatial heterogeneity of Ga-68 EDTA kinetics in brain tumors,” IEEE Trans. Nucl. Sci., vol. 44, no. 6, pp. 2522-2527, Dec. 1997. 9. A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York: John Wiley, 2001. 10. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, pp. 788-791, Oct. 1999. 11. Andrzej Cichocki and Shun-ichi Amari, Adaptive Blind Signal and Image Processing. John Wiley and Sons, Inc., 2002.

160

4

F.-Y. Wang et al.

Appendix

4.1

Proof of Theorem 1

Let V be a 2-dimensional vector space spanned by the linearly independent vectors s1 and s2 for which 0 ≤ ρ(s1 , s2 ) < 1. Let u1 = s1 /s1 , and u2 (which can be obtained via Gram-Schmidt orthogonalization) be a set of orthonormal basis vectors of V. Then any vector in V can be represented in terms of u1 and u2 , as shown in Fig. 6.

s2 x2 u2

θ ( s2 )

o

x1

θ ( x2 )

θ ( x1 )

s1

u1

Fig. 6. Source vectors and observation vectors after non-negative mixing

Let θ(v) denote the angle between u1 and v ∈ V. Then x1 = a11 s1 + a12 s2 = a11 s1 u1 + a12 s2 [cos(θ(s2 ))u1 + sin(θ(s2 ))u2 ] = [a11 s1  + a12 s2  cos(θ(s2 ))]u1 + [a12 s2  sin(θ(s2 ))]u2 which implies that 0 ≤ tan θ(x1 ) =

sin(θ(s2 ))

a11 s1  a12 s2 

+ cos(θ(s2 ))

≤ tan θ(s2 ),

i.e., 0 ≤ θ(x1 ) ≤ θ(s2 ). Similarly, one can prove 0 ≤ θ(x2 ) ≤ θ(s2 ). Therefore, |θ(x2 ) − θ(x1 )| ≤ θ(s2 ) and ρ(x1 , x2 ) = cos(θ(x2 ) − θ(x1 )) ≥ ρ(s1 , s2 ) = cos(θ(s2 )). 4.2

Proof of (10)

Consider a 2-dimensional plane of (ω1 , ω2 ). The constraint of ω1 +ω2 = 1 includes the following two cases: Case 1: w11 + w12 = 1 for (ω1 = w11 , ω2 = w12 ). Case 2: w21 + w22 = 1 for (ω1 = w21 , ω2 = w22 ). As shown in Fig. 7, all the points on the line segment AB satisfy ω1 + ω2 = 1 and ω1 x1 [n] + ω2 x2 [n] ≥ 0, where the coordinates of the points A and B are given by

Blind Separation of Multichannel Biomedical Image Patterns

161

ω2 B (0,1)

ω1 x1[k1 ] + ω2 x2 [k1 ] = 0 ω1

(1,0)

ο

A

ω1 x1[k2 ] + ω2 x2 [k2 ] = 0

ω1 + ω2 = 1

Fig. 7. Feasible region of ω1 and ω2 (same for both Cases 1 and 2) satisfying ω1 +ω2 = 1 and ω1 x1 [n] + ω2 x2 [n] ≥ 0 for all n



 x1 [k1 ] −x2 [k1 ] , , x1 [k1 ] − x2 [k1 ] x1 [k1 ] − x2 [k1 ]   −x2 [k2 ] x1 [k2 ] , ωB = . x1 [k2 ] − x2 [k2 ] x1 [k2 ] − x2 [k2 ]

ωA =

Note that ω1 x1 [n] + ω2 x2 [n] = y1 [n] for Case 1 and ω1 x1 [n] + ω2 x2 [n] = y2 [n] for Case 2. Consider the vector space V as presented in the proof of Theorem 1 in Section 4.1. Both of x1 and x2 must be on the line passing s1 and s2 due to the constraints a11 + a12 = 1 and a21 + a22 = 1, respectively. Moreover, both of y 1 and y 2 must be on the line passing x1 and x2 due to the constraints w11 + w12 = 1 and w21 + w22 = 1, respectively. Decreasing ρ(y 1 , y 2 ) is equivalent to increasing the angle between the vectors y 1 and y 2 as shown in Fig. 8, i.e., both w11 and w22 must be positive and meanwhile w12 and w21 must be negative. Minimum ρ(y 1 , y 2 ) corresponds to the values of wij with maximum |wij |. In other words, the optimum solution for (w11 , w12 ) corresponds to either the point A or the point B in Fig. 7, so does the optimum solution for (w21 , w22 ). Therefore, the T  which can be easily optimum demixing matrix is given by W = ωTA , ω TB proven to be the one given by (10). 4.3

Proof of Theorem 2

Because of φ(s[l1 ]) = 0, φ(s[l2 ]) = π/2, one can easily see tan φ(x[n]) =

a21 s1 [n] + a22 s2 [n] a21 + a22 tan φ(s[n]) x2 [n] = = . x1 [n] a11 s1 [n] + a12 s2 [n] a11 + a12 tan φ(s[n])

(20)

162

F.-Y. Wang et al.

s2

w22 x2

y2

w21 x1 w12 x2

x2

w11 x1 x1 y1

ο

s1

Fig. 8. Vector diagram of source signals (si ), observations (xi ) and extracted signals (y i )

By (20), (13) and (14), one can easily obtain tan φ(x[n]) − tan φ(x[l1 ]) =

det(A) tan φ(s[n]) , a11 + a12 tan φ(s[n])

(21)

tan φ(x[n]) − tan φ(x[l2 ]) =

− det(A)/a12 , a11 + a12 tan φ(s[n])

(22)

where det(A) = a11 a22 − a21 a12 . One can easily infer, from (21) and (22), that φ(x[l1 ]) ≤ φ(x[n]) ≤ φ(x[l2 ]) if det(A) ≥ 0, and φ(x[l2 ]) ≤ φ(x[n]) ≤ φ(x[l1 ]) if det(A) ≤ 0, implying φ(x[k2 ]) ≤ φ(x[n]) ≤ φ(x[k1 ]), ∀n. Thus, we have completed the proof.

Image and Fractal Information Processing for Large-Scale Chemoinformatics, Genomics Analyses and Pattern Discovery Ilkka Havukkala, Lubica Benuskova, Shaoning Pang, Vishal Jain, Rene Kroon, and Nikola Kasabov Knowledge Engineering and Discovery Research Institute, Auckland University of Technology Auckland, New Zealand [email protected] www.kedri.info

Abstract. Two promising approaches for handling large-scale biodata are presented and illustrated in several new contexts: molecular structure bitmap image processing for chemoinformatics, and fractal visualization methods for genome analyses. It is suggested that two-dimensional structure databases of bioactive molecules (e.g. proteins, drugs, folded RNAs), transformed to bitmap image databases, can be analysed by a variety of image processing methods, with an example of human microRNA folded 2D structures processed by Gabor filter. Another compact and efficient visualization method is comparison of huge amounts of genomic and proteomic data through fractal representation, with an example of analyzing oligomer frequencies in a bacterial phytoplasma genome. Bitmap visualization of bioinformatics data seems promising for complex parallel pattern discovery and large-scale genome comparisons, as powerful modern image processing methods can be applied to the 2D images.

1

Introduction

Massive amounts of information keep accumulating into many complex chemical structure databases, including protein and RNA structures, drug molecules, drugligand databases, and so on. Surprisingly, there is no commonly accepted standard for recording and managing chemical structure data, e.g. drug molecules, suitable for automated data mining [1]. Also genomic data are accumulating at increasing speed, with almost 2,000 microbial and eukaryotic genomes listed in the Genomes OnLine Database (GOLD), either completed or being sequenced [2]. Increasing interest is now being focused on characterizing various genomes, especially for their repetitive DNA and repeated DNA motifs, especially in the non-coding regions, important for chromatin condensation and gene regulation [3]. We present and illustrate two promising approaches to handle large-scale chemoinformatics and genomics data, based on visualization as bitmaps and applicable to standardized pattern analysis and knowledge discovery. J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 163–173, 2006. c Springer-Verlag Berlin Heidelberg 2006 

164

2 2.1

I. Havukkala et al.

Protein, RNA and Other Chemoinformatics Databases Current Analysis Methods

There are currently some 35,000 databased protein structures (X-ray and NMR) in the Protein Data Bank PDB [4], and many more structures have been estimated by computational comparison of amino acid sequences to secondary and tertiary structures, either by ab initio folding programs or supervised methods involving sequence threading to a known protein structure. A large number of web servers are available on the internet to compare protein structures with each other, see e.g. [5]. The underlying structural alignment algorithms are crucial for drug design, e.g. ligand to protein binding simulation. However, these algorithms currently cannot handle simultaneous comparison and classification of large numbers of structures, except by brute force, using very large distributed computing infrastructures, like FightAIDS@Home on the World Community Grid, which performs AutoDock analysis of drug and HIV virus target matching on thousands of PCs around the world [6]. However, currently there is no efficient solution for matching, clustering and classifying large numbers of molecular structures efficiently. Amino acid sequence similarity has been used as a proxy to compare similar protein structures, but a minimum of 30% sequence identity and a known structure is needed for modelling protein structures. For accurate drug design, up to 60% sequence identity is needed to ensure proper ligand binding models. Also, in this respect the current set of protein structures do not yet cover sufficiently the natural protein structure space [7]. In addition, protein structure is known to be clearly more conserved than sequence similarity. Similarly to proteins, the folding of the RNA molecules is also known to be often more conserved than their sequence, and most recent estimates suggest that the number of non-coding genes with stable 2D RNA structures of transcripts is in the thousands [8], and may match the total number of protein coding genes in eukaryotic genomes. There is thus a need for new efficient methods for comparing and clustering of large numbers of macromolecule structures that could avoid the use of complicated and detailed data structures pertaining to the 3D atomic coordinates of proteins, RNAs, and organic molecules. Such an alternative approach advocated in this paper is to generate 2D projections of molecular structures, transform the data into bitmap images and then analyze the bitmap images using a variety of advanced methods developed in the artificial intelligence community for face recognition, fingerprint classification and so on. An example of using this approach for RNA structures is described below. 2.2

Bitmap Image Processing Approach to Clustering and Classification of Folded RNAs

RNA molecules commonly self-assemble, resulting in more or less stable specific conformations in which nucleotide pairs A–U and C–G are formed for a reduced

Image and Fractal Information Processing

165

free energy level. The conformations are characteristic of the different RNAs, e.g. eukaryotic ribosomal RNAs, microbial riboswitches, human microRNAs and so on. With the latest algorithms, secondary 2D structures can be computed quite fast and reliably from RNA sequence [9] Normally only the most stable structure with the lowest thermodynamic energy (ΔG) is considered, but there can also be several other more or less likely conformations, collectively known as the Boltzmann ensemble, which can nowadays also be computed with reasonable accuracy [10]. Ideally, these alternative conformations should be taken into account in comparative analysis of different RNAs. Consensus structure comparisons for a set of RNA sequences have been previously made in three basic ways: 1) multiple alignment of sequences, followed by structure folding of the consensus, 2) Sankoff method of simultaneously aligning sequences and folds and 3), folding sequences to structures, followed by structural alignment, as reviewed in [11]. The first method may not cluster together all related sequences, as RNA structure is more conserved than its sequence. With the Sankoff method it is not easy to cluster large numbers of sequences/structures and the method is also computationally very demanding for large-scale use. The third method is a novel field, and demands a very good method to align structures to start with. Several approaches have been introduced, including RNA as topological graphs or trees. Representative algorithms in this field are RNAFORESTER and MARNA, reviewed in [9] and TREEMINER [12]. Their performance in analysing and clustering very large RNA sets is not yet known. A new generic approach proposed by us [13] for large-scale analysis of RNA structures consists of first computing the 2D structures for the set of RNA sequences, followed by transformation of the structures into bitmap images and analysis of the image set with a suitable image processing algorithm (Fig. 1). 2.3

Example of Human MicroRNAs Analysed by Gabor Filter Method

MicroRNAs are short, 80-150 basepairs long RNAs that do not code for protein, but fold into hairpin structures and exert their effect on gene regulation by binding to matching sequences of messenger RNAs of protein-coding genes, reviewed in [14]. They are now known from plants, mammals and many lower eukaryotes as well. In a first case study of the general bitmap image analysis approach [13], the set of 222 known human microRNAs was folded by RNAFold algorithm of the Vienna package [15] and transformed into bitmap images, which were then used to extract classificatory information using Gabor filter method. Gabor filter produces rotation-invariant features, which are used to calculate measures of similarity to compare images. Greyscale bitmaps of 512x512 pixels were used, with low-resolution spatial frequencies and four angular directions. Fig. 2 (top middle and right) shows two examples of Gabor filter transformed bitmap images of folded RNA (top left) at low angular resolution. From the transformed images, feature vectors were obtained, and Manhattan distances between vectors of all pairs of microRNAs calculated. The heat map of all ver-

166

I. Havukkala et al.

RNA candidate sequence set

Fold sequences

Transform to bitmap images

Cluster by image similarity

Conserved structures Fig. 1. Bitmap image analysis approach to RNA structure classification

sus all comparisons of the 222 microRNAs (Fig. 2, bottom) shows clearly the diagonal of similar items (the microRNAs were ordered by known microRNA families) or structural motifs together. In the heatmap colour scaling blue pixels show the most similar microRNA pairs, and red pixels the least similar ones. In addition, many other putative similarities between microRNAs that do not share sequence similarity are also indicated for a large number of other microRNA pairs. For more details, see [13]. These additional similarities are worth exploring further, because they may correlate with specific structures in the folded RNAs. Thus the bitmap image similarity could help in sequence pattern discovery by providing additional information for clustering RNAs with weakly similar sequences. 2.4

Further Improvement of the Approach

For improving the bitmap utilization method, other ways of visualizing the 2D structure could be used, e.g. by using different colours or shapes for different bases or basepairs. Subsequently, various other image feature extraction methods could be used to derive informative colour/shape/contour/curvature data for clustering and classification of the microRNA structure images. The approach is a general one, applicable to all kinds of macromolecules for which an informative 2D structure representation is easily computed. This method could reveal relevant features not previously considered by chemists or biologists, or it could be used as a prefiltering step in very large databases of molecular structures. Then the challenge is to develop the image clustering methods to handle large num-

Image and Fractal Information Processing

167

Fig. 2. Gabor filter analysis of microRNA structures. Top: left, a sample folded RNA structure; middle, Gabor filtered image at θ = 0 rotation angle; right, image at θ = π/2. Bottom: Heatmap matrix of Gabor filter feature vector Manhattan distance similarities of 222 human microRNAs. x- and y-axes: microRNA identification number, heatmap colourscale: Blue (dark): most similar, Red (light): least similar.

bers of bitmap images efficiently. Automation of the procedure involves suitable cutoffs for similarity measures for desired statistically significant clustering of the similar structures.

3 3.1

Genomics Databases Current Analysis Methods

Similarly to the expansion of chemoinformatics related databases, genomic and proteomic data is stretching bioinformaticians to develop efficient large-scale methods for pattern identification, knowledge discovery and easily accessible and queriable databasing. Multiple alignments of many genomes (utilizing BLAST or other fast string comparisons) are already used for interspecies comparisons [16],[17], but more compact data summarization methods are needed. Analyzing whole genomes to quickly reveal their salient features and to extract new knowledge is an essential goal for biological sciences. We advocate the solution of compressing information about oligomer frequencies in long sequences into

168

I. Havukkala et al.

small, coloured fractal representations in 2D or 3D space. This can achieve compression of genome data by a million times or more. 3.2

Fractal Representation Approach for DNA Sequences

Fractals in the form of iterated function system (IFS) and Chaos Game Representation have been used to visualize short DNA [18] or protein [19] sequences of genes, even complete genomes [20],[21], and in principle any symbolic sequences [22]. The iterated function system transforms DNA sequences to unique points in 2-dimensional space. The principle here is to map all oligomers of fixed size of N bases contained in the genome to a 2D space with 2Nx2N elements. An important characteristic of the representation space is that there are socalled attractor points in the space, e.g. in the corners, representing subsequences AAAA, CCCC, and so on. Similar oligomers are situated spatially close to each other in the representation space. (Fig. 3 illustrates the IFS principle. Equation (1) shows the four transformations in the rectangular coordinate space in successive basepairs of the DNA, with x and y axes ranging from 0 to 1. ωT (x, y) = (0.5x + 0.5, 0.5y) ωA (x, y) = (0.5x, 0.5y + 0.5)

(1)

ωG (x, y) = (0.5x, 0.5y) ωC (x, y) = (0.5x + 0.5, 0.5y + 0.5) Every transformation contracts coordinates to its quarter of a unit square. A limit set of points emerging from an infinite application of the IFS is called the IFS attractor. End positions of all the oligomers are marked on the grid, and their frequency in each cell counted, and the frequencies displayed by greyscale or colour scale. We show an example with a microbial phytoplasma genome. 3.3

Example of Phytoplasma Genome Octamers Visualized in Fractal Space

Phytoplasmas are wall-less prokaryotic microbes and obligate parasites of plants, with genome sizes below one million basepairs. They belong to Mollicutes, known to have AT-rich genomes. The Aster Yellow Witches’ Broom genome has recently been sequenced, and is used here as an example of a new unexplored genome [23]. All octamer oligonucleotides of the whole genome (ca. 700 kilobases) were plotted in fractal space of 256x256 = 65,536 pixels, and their frequencies are shown as a colour heatmap (Fig. 4). As expected, the AT-diagonals have high frequencies of octamers, and the abundance of A-rich and T-rich sequences at opposite corners is immediately evident. This was verified by using RepeatScout algorithm [24] to calculate the most abundant non-overlapping octamers (including tandem ones) in the genome. Fig. 5 illustrates that the most abundant octamers indeed are AT-rich. What is not easy to find out from these octamer frequency listing is that there are two approximately equally abundant types of these oligomers, A-rich and T-rich, as shown by the red-orange clusters in the corners A and T, respectively.

Image and Fractal Information Processing

169

Fig. 3. Principle of mapping N-mers for a fractal space. Here all tetramer polynucleotides are mapped to unique positions in a 16 x 16 coordinate grid. Three end positions for three tetramers are shown.

The basic difference of the fractal method to counting and comparison of frequencies of tandem and interspersed repeats is also that overlapping oligomers are enumerated exhaustively. This is important in terms of RNAi and transcription factor regulating mechanisms of gene expression and chromatin remodelling., which rely on the presence of suitable binding site oligomers in any relevant genome location. Another finding easily seen in the fractal representation is the cluster in the middle of bottom border between G and T corners, which suggests an abundance of GT-rich octamers. Such repetitive motifs might have a special function in the phytoplasma for its host relationship. Indeed, it has been suggested that repetitive DNA is important in prokaryotes for genome plasticity, especially in hostparasite interactions [25]. For example, in Neisseria bacterium octamer repeats are specifically enriched, suggesting a special mechanism for their generation and instability [26]. Short direct tandem repeats (microsatellites) seem to be rare in the closely related onion yellows phytoplasma genomes, based on searching the Microorganisms Tandem Repeats Database [27],[28]. Thus the common GT-rich octamers mentioned above are most likely interspersed multicopy sequences of unknown function. In summary, the fractal histogram plot seems very useful to show simultaneously over-represented and under-represented oligomers that may be under

170

I. Havukkala et al.

A

C

G

T

Fig. 4. Aster Yellows Witches’ Broom phytoplasma genome octamers visualized in a 256 x 256 (28 x 28) grid as a heatmap, red colour means higher frequency. The abundance of A and T rich octamers is obvious on the red diagonal and in the top left and bottom right corners. An arrow points to a cluster of GT-repeats in the middle of bottom border between G and T corners.

special evolutionary selection pressures. A specific feature of the fractal representation is that the oligomers cluster based on their similarity starting from the beginning of the sequence, so that sequences with the same beginning but different suffixes are near each other. 3.4

Further Improvement of the Approach

For a more detailed analysis of any genome, one would draw fractal histograms with different oligomer lengths to identify specific repeated interspersed motifs in the genome. Overlaying/substracting from a plot of similar length random sequence with same ratios of A/T/C/G could show statistically significant differences according to a specific cutoff. Successive sections of the genome could be analyzed separately, so that one could find out repeat-rich regions, coding and non-coding regions and so on in the genome. Keeping track of the oligomer coordinates as well would enable one to map specific oligomer groups to specific locations in the genome. Such a tool could thus be a very versatile method of visual exploration and comparison of genomes. Similarly, comparing two or more genomes by overlaying could be easily accomplished, to pinpoint the relevant changes in abundant or under-represented oligomers in the genome. This would be effective for immediate and informative genome scale visual comparisons.

Image and Fractal Information Processing

171

Number of occurrences

2000

1500

1000

500

AA TAAAA TTAAAAAT A A AA AA AA A AAAAA AA A AA AAATA AAAAATAA TTAATAAA A AATAA AA A AAATA AA ATTAAAAA AAAAAAAA A AATTA AA A AAAAA AA ATAAAATT AATAATTA A CAAAA AA T AAAAA AT A A AA AT AA AAAATAAT AAATTATT AAAATAAA AAAAATAA AAATAAAC AAATTATT AAAAAATT C AAGAA AA A ATAAA AA AATTTGAA AAAAGTTA A AAAAA AA A TACAA GA AAATAAAA TAAGAAAA AAAAAAAA A A AA AA TA A ATAAC AG TTTTAAAA ATTTAAAA AATTTAAA AA AA AA A CA

0

Aster Yellows genome octamer

Fig. 5. The most abundant Aster Yellows Witches’ Broom phytoplasma genome octamers obtained by RepeatScout algorithm. The abundance of A and T rich oligomers is clear.

The oligomer lengths could be variable, depending on the scale of interest, up to oligomer size 20 or so, which would map all unique single-copy sequences in a separate grid cell. The fractal spaces of the different length oligomers could be viewed successively as a moving colour video track for quick visualization of the relevant features, with several genomes shown side by side in synchrony. When analysing longer sequences where a small trivial difference may appear in the beginning of the string, leading to a quite different location in the fractal space. This could be mitigated by mapping strings in the reverse direction also. To achieve a sequence similarity based clustering like in BLAST, one would need a different ordering of the similar strings in the fractal space. A specific application for short oligomer based microarray technology is visualization of the set of oligos (the ”oligome”) on an array, and comparison between arrays and the target transcriptomes/genomes for completeness of coverage of possible hybridization sites. Further extension to larger alphabets to encompass also complete proteomes, rather than short single protein sequences is also an interesting possibility. Finally, automation of the method could be accomplished by image processing of the overlayed/subtracted images to highlight/extract the oligomer clusters of interest in the fractal space, down to the specific most common oligomers differing in frequency between the genomes.

4

Discussion

We have presented two promising visualization and classification methods, both based on transforming a bioinformatics problem to the image analysis domain, to deal with large sets of molecular structures and oligomer motifs in large genomes and proteomes. An example on transforming folded RNA molecules to 2D struc-

172

I. Havukkala et al.

ture bitmaps was given, but the approach applies to several domains, including complex organic molecule databases and even protein secondary structure diagrams. For fractal coding of genome oligomer distribution, an example of phytoplasma genome showed that specific types of repeats can be visualized effectively. Various extensions of the fractal method seem worth pursuing for novel types of DNA sequence pattern clustering and classification. Finally, moving the bioinformatics domain symbolic data into bitmap representation domain makes it possible to use the wide variety of bitmap image analysis methods developed in other fields outside biology. This interdisciplinary approach should be both interesting and fruitful for informative visualization, data mining and knowledge discovery in bioinformatics and chemoinformatics datasets. Acknowledgments. Supported by the Knowledge Engineering and Discovery Research Institute, Auckland University of Technology and the FRST NERF Fund (AUTX0201), New Zealand.

References 1. Banville, D.L.: Mining the chemical structural information from the drug literature. Drug Discovery Today 11(1/2) (2006) 35–42 2. http://www.genomesonline.org/ 3. Vinogradov, A.E.: Noncoding DNA, isochores and gene expression: nucleosome formation potential. Nucleic Acids Res. 33(2) (2005) 559–63 4. http://www.rcsb.org/pdb/holdings.do 5. Vlahovicek, K. et al.: CX, DPX and PRIDE: WWW servers for the analysis and comparison of protein 3D structures. Nucleic Acids Res. 1(33) (2005) W252–254 6. http://fightaidsathome.scripps.edu/index.html 7. Vitkup, D. et al.: Completeness in structural genomics. Nature Struct. Biol. 8 (2001) 559–566 8. Washietl, S. et al.: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nature Biotechnol. 23(11) (2005) 1383–1390 9. Washietl, S., Hofacker, I.L., Stadler, P.F.P., Tino, P.: Fast and reliable prediction of noncoding RNAs. PNAS USA 102(7) (2005) 2454–2459 10. Ding, Y., Lawrence, C.E.: A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 31 (2003) 7280–7301 11. Gardner, P.P., Giegerich, R.: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics 5(140) (2004) 1–18. 12. Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Trans. Knowl. Data Eng. 17(8) (2005) 1021–1035 13. Havukkala, I., Pang, S.N., Jain,V., Kasabov, N.: Classifying microRNAs by Gabor filter features from 2D structure bitmap images on a case study of human microRNAs. J. Comput. Theor. Nanosci. 2(4) (2005) 506–513 14. Mattick, J.S., Makunin,I.V.: Small regulatory RNAs in mammals. Hum. Mol. Genet. 14(1) (2005) R121–132 15. Hofacker, I.: Vienna RNA secondary structure server. Nucleic Acids Res. 31 (2003) 3429–3431

Image and Fractal Information Processing

173

16. Frazer,K.A. et al.: VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32(2004) W273-279 17. Brudno, M. et al.: Automated whole-genome multiple alignment of rat, mouse, and human. Genome Res. 14(4) (2004) 685–692 18. Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8) (1990) 2163–2170 19. Fiser, A., Tusn??ady, G.E., Simon, I.: Chaos game representation of protein structures. J. Mol. Graphics 12 (1994) 302–304 20. Hao, B., Lee, H., Zhang, S.: Fractals related to long DNA sequences and complete genomes Chaos, Solitons and Fractals 11(6) (2000) 825–836 21. Almeida, J.S. et al.: Analysis of genomic sequences by Chaos Game Representation Bioinformatics 17(5) (2001) 429–37 22. Tino, P.: Spatial representation of symbolic sequences through iterative function systems. IEEE Trans. Syst. Man Cybernet. 29 (1999) 386–393 23. Bai, X. et al.: Living with genome instability: the adaptation of phytoplasmas to diverse environments of their insect and plant hosts. J. Bacteriol. 188 (1999) 3682–3696 24. Price, A.L., Jones, N.C., Pevzner, P.A.: De novo identification of repeat families in large genomes. Bioinformatics. , 21(Suppl. 1) (2005) i351–i358 25. Aras, R.A. et al.: Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proc. Natl. Acad. Sci. USA 100(23) (1999) 13579–135784 26. Saunders, N.J. et al.: Repeat-associated phase variable genes in the complete genome sequence of Neisseria meningitidis strain MC58. Molecular Microbiology, 37(1) (2000) 207–215 27. Denoeud, F., Vergnaud, G.: Identification of polymorphic tandem repeats by direct comparison of genome sequence from different bacterial strains: a web-based resource. BMC Bioinformatics. 5 (2004) 4 28. http://minisatellites.u-psud.fr/

Hybridization of Independent Component Analysis, Rough Sets, and Multi-Objective Evolutionary Algorithms for Classificatory Decomposition of Cortical Evoked Potentials Tomasz G. Smolinski1 , Grzegorz M. Boratyn2, Mariofanna Milanova3 , Roger Buchanan4 , and Astrid A. Prinz1 1

3

Department of Biology, Emory University, Atlanta, GA 30322 {tomasz.smolinski, astrid.prinz}@emory.edu 2 Kidney Disease Program, University of Louisville, Louisville, KY 40292 [email protected] Department of Computer Science, University of Arkansas, Little Rock, AR 72204 [email protected] 4 Department of Biology, Arkansas State University, Jonesboro, AR 72467 [email protected]

Abstract. This article presents a continuation of our research aiming at improving the effectiveness of signal decomposition algorithms by providing them with “classification-awareness.” We investigate hybridization of multi-objective evolutionary algorithms (MOEA) and rough sets (RS) to perform the task of decomposition in the light of the underlying classification problem itself. In this part of the study, we also investigate the idea of utilizing the Independent Component Analysis (ICA) to initialize the population in the MOEA.

1

Introduction

The signals recorded from the surface of the cerebral cortex are composites of the electrical activity of a large number of individual cells and often constitute a mixture of a group of signals produced by many different sources (e.g., specific neuronal structures). In order to separate those superimposed signal patterns and analyze them independently, we propose to utilize an experimental technique based on measuring neural activity in a controlled setting (normal) as well as under exposure to some external stimulus (nicotine, in this case) [1]. Application of stimuli that affect the observed signals often has an effect only on a subset of the sources. The information about which sources are affected by the stimuli can provide interesting insight into the problem of neural activity analysis, but cannot be measured directly. Based on the assumption that each of the sources produces a signal that is statistically independent on the others, the observed signals can be decomposed into constituents that model the sources. Those modeled sources are referred to as basis functions. Each of the observed signals is a linear combination of the basis functions. Due to the fact that some sources can have stronger influence in some locations than others, J.C. Rajapakse, L. Wong, and R. Acharya (Eds.): PRIB 2006, LNBI 4146, pp. 174–183, 2006. c Springer-Verlag Berlin Heidelberg 2006 

Hybridization of Independent Component Analysis

175

each source can be present in each observed signal with a different magnitude. The source magnitudes are modeled as coefficients in the aforementioned linear combination. The change in the coefficients, as a result of applied stimuli, corresponds to the change in the contribution of a source in generation of a given signal. Independent Component Analysis (ICA) can be useful in this kind of analysis, as it allows for determination of an impact of the external stimuli on some specific neuronal structures, supposedly represented by the discovered components. The link between the stimulus and a given source can be verified by designing a classifier that is able to “predict” under which condition a given signal was registered, solely based on the discovered independent components. However, statistical criteria used in ICA, often turn out to be insufficient to build an accurate coefficients-based classifier. Classificatory decomposition is a general term that describes our research study that attempts to improve the effectiveness of signal decomposition techniques by providing them with “classification-awareness.” The description of previous stages of the study and some examples of applications can be found in [2,3,4]. Currently, we are investigating a hybridization of multi-objective evolutionary algorithms (MOEA) and rough sets (RS) to perform decomposition in the light of the classification problem itself. The idea is to look for basis functions whose coefficients allow for an accurate classification while preserving the reconstruction. In this article, we propose a simple extension of the well-known multi-objective evolutionary algorithm VEGA, which we call end-VEGA (elitist-non-dominatedVEGA). The extension supplies the algorithm with the considerations related to elitism and non-dominance, lack of which is known to be VEGA’s main drawback. We also investigate the idea of utilizing the ICA to initialize the population in the MOEA. The details of the modifications as well as a short theoretical background are given below.

2 2.1

Theoretical Background Independent Component Analysis

Independent Component Analysis (ICA) is a signal processing technique originally developed to deal with the cocktail-party problem [5]. ICA is perhaps the most widely used method in Blind Source Separation (BSS) in various implementations and practical applications [6]. The basic idea in ICA is to represent a set of random variables using basis functions (or sources), which are as much statistically independent as possible. The Central Limit Theorem states that the distribution of a sum of independent random variables, under certain conditions, tends toward a Gaussian distribution. Thus a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original random variables. Therefore, the key concept in ICA is based on maximization of non-Gaussianity of the sources. There are various quantitative measures of nonGaussianity, one of the most popular one among which is kurtosis (i.e., the fourth-

176

T.G. Smolinski et al.

order cumulant). One of the most popular ICA algorithms based on finding the local maximum of the absolute value of kurtosis is FastICA [7]. 2.2

Multi-Objective Evolutionary Algorithms

Many decision making or design problems involve optimization of multiple, rather than single, objectives simultaneously. In the case of a single objective, the goal is to obtain the best global minimum or maximum (depending on the nature of the given optimization problem), while with multi-objective optimization, there usually does not exist a single solution that is optimal with respect to all objectives. Therefore, the goal of multi-objective optimization is to find a set of solutions such that no other solution in the search space is superior to them when all objectives are considered. This set is known as Pareto-optimal or non-dominated set [8]. Since evolutionary algorithms (EA) work with a population of individuals, a number of Pareto-optimal solutions can be found in a single run. Therefore, an application of EAs to multi-objective optimization seems natural. The first practical MOEA implementation was the Vector Evaluated Genetic Algorithm (VEGA) proposed in [9]. Although it opened a new avenue in multi-objective optimization research, the algorithm seemed to have some serious limitations, at least partially due to the lack of considerations of dominance and elitism [8]. To deal with the first of the above considerations, a non-dominated sorting procedure was suggested in [10] and various implementations based on that idea of rewarding non-dominated solutions followed [11]. Elitism, in other words the notion that “elite” individuals cannot be expelled from the active gene-pool by worse individuals, has recently been indicated to be a very important factor in MOEAs that can significantly improve their performance [12]. Both these aspects, while preserving the simplicity of implementation of the original VEGA, were taken into consideration in the design of the end-VEGA algorithm proposed here. 2.3

Rough Sets

The theory of rough sets (RS) deals with the classificatory analysis of data tables [13]. The main idea behind it is the so-called indiscernibility relation that describes objects indistinguishable from one another. The indiscernibility relation induces a split of the universe, by dividing it into disjoint equivalence classes, denoted as [x]B (for some object x described by a set of attributes B). These classes can be used to build new partitions of the universe. Partitions that are most often of interest are those that contain objects that belong to the same decision class. It may happen, however, that a concept cannot be defined in a crisp manner. The main goal of rough set analysis is to synthesize approximations of concepts from acquired data. The concepts are represented by lower and upper approximations. Although it may be impossible to precisely define some concept X, we can approximate it using the information contained in B by constructing the B-lower and B-upper of X, denoted by BX  approximations ! ! and BX respectively, where BX = x [x]B ⊆ X and BX = x [x]B ∩ X = ∅ . Only the

Hybridization of Independent Component Analysis

177

objects in BX can be with certainty classified as members of X, based on the knowledge in B. A rough set can be characterized numerically by the so-called quality of classification: cardBX ∪ B¬X , (1) γB (X) = cardU where BX is the lower approximation of X, B¬X is the lower approximation of the set of objects that do not belong to X, and U is the set of all objects. Another very important aspect of rough analysis is data reduction by means of keeping only those attributes that preserve the indiscernibility relation and, consequently, the set approximation. The rejected attributes are redundant since their removal cannot worsen the classification. There are usually several such subsets of attributes and those that are minimal are called reducts. Finding a global minimal reduct (i.e., reduct with a minimal cardinality among all reducts) is an NP-hard problem. However, there are many heuristics (including utilization of genetic algorithms [14]) designed to deal with this problem.

3

ICA, RS, and MOEA-Based Classificatory Decomposition

The main concept of classificatory decomposition was motivated by the hybridization of EAs with sparse coding with overcomplete bases (SCOB) introduced in [15]. Using this approach, the basis functions as well as the coefficients are being evolved by optimization of a fitness function that minimizes the reconstruction error and at the same time maximizes the sparseness of the basis function coding. This methodology produces a set of basis functions and a set of sparse (i.e., “as few as possible”) coefficients. This may significantly reduce dimensionality of a given problem but, as any other traditional decomposition technique, does not assure the classificatory usefulness of the resultant model. In the approach proposed here, the sparseness term is replaced by a rough sets-derived data reduction-driven classification accuracy measure. This should assure that the result will be both “valid” (i.e., via the reconstruction constraint) and useful for the classification task. Furthermore, since the classification-related constituent also searches for a reduct, the classification is done with as few as possible basis functions. Finally, the single-objective EA utilized in the aforementioned technique is replaced by a multi-objective approach, in which the EA deals with the reconstruction error and classification accuracy, both at the same time [4]. Since the approach proposed here is based upon finding a solution satisfying two potentially conflicting goals (i.e., component-based reconstruction accuracy vs. classification accuracy), an application of MOEAs seems natural. In the experiments described here, we investigate a simple extension of VEGA, which supplies it with elitism and non-dominance, lack of which is known to be its main drawback. We call this extended algorithm end-VEGA (elitist-non-dominatedVEGA).

178

3.1

T.G. Smolinski et al.

End-VEGA

The main idea in VEGA is to randomly divide the population, in each generation, into equal subpopulations. Each subpopulation is assigned fitness based on a different objective function. Then, the crossover between the subpopulations is performed as with traditional EAs, with an introduction of random mutations. As indicated earlier, VEGA has several quite significant limitations related to the lack of dominance and elitism. To address the former, we propose a simple approach based on multiplying the fitness of a given individual by the number of solutions that this individual is dominated by (plus 1 to ensure that the fitness function of a non-dominated solution is not multiplied by 0). Since the fitness function is being minimized in this project, the dominated solutions will be adequately penalized. To deal with the latter, we utilize the idea of an external sequential archive [12] to keep track of the best-so-far (i.e., nondominated) solutions and to make sure that their genetic material is in the active gene-pool. 3.2

Chromosome Coding

Each chromosome is a complete solution to a given classificatory decomposition task and provides a description of both the set of basis functions and the coefficients for all the signals in the training data set. For example, for N signals with n samples each, and the task of finding M basis functions, the chromosome will be coded in the following way:

Fig. 1. Chromosome coding

Each of the M basis functions has the length of the original input signal (i.e., n), and there are N vectors of coefficients (i.e., each vector corresponds to one signal in the training set) of dimensionality equal to the number of basis functions (i.e., each coefficient corresponds to one basis function). 3.3

ICA-Based Population Initialization

The idea behind using ICA to initialize the starting population in end-VEGA is very simple: rather than beginning the search at a random location in the search space, the chromosomes are supplied with a starting point that is already quite satisfactory in one of the objectives (i.e., reconstruction). Since based on previous results described in [4], the reconstruction accuracy objective seems to be much more difficult to optimize (especially for a small number of highdimensional signals), this approach is plausible. Depending on the parameters of both ICA and end-VEGA, various variants of the initialization can be pursued.

Hybridization of Independent Component Analysis

179

If the number of independent components (IC) is higher than the number of the maximum allowed classificatory components (CC), then the subset of the ICs used to initialize the population is chosen randomly. In the opposite case, the “missing” initial CCs can assume random values, just as in the case without ICA. 3.4

Fitness Evaluation

Reconstruction error. The problem of minimization of the reconstruction error is intuitively simple. Once a particular distance measure has been decided upon, virtually any optimization algorithm can be used to minimize the distance between the original signal and the reconstructed one. The measure employed in this project is the well known 2-norm, referred to in signal processing as the signal energy-based measure [16]. In order to deal with raw signals which can be large (thus causing the energy-based distance measure to be large as well), a simple normalization of the energy-based measure by the energy of the original signal is proposed [3]: n (xt − (Ma)t )2 , (2) DN ORM = t=1n 2 t=1 (xt ) where x represents the original signal, M is the matrix of basis functions, a is a set of coefficients, and t = 1..n where n is the number of samples in the signal. Subsequently, the reconstruction error fitness function fREC for a representative p takes the following form: N fREC (p) =

i=1

i DN ORM , N

(3)

i th signal and N is where DN ORM is the normalized reconstruction error for the i the total number of the input signals.

Classification accuracy and reduction in the number of coefficients and basis functions. The problem of maximizing the classificatory competence of the decomposition scheme, and at the same time reducing the number of computed basis functions, can be dealt with by the application of rough sets. In this project, the rough sets-based quality of classification, as introduced in (1), is used for the purpose of estimating the classificatory aptitude. The quality of classification is estimated directly on the candidate reduct, which can be computed by any of the existing algorithms/heuristics (in this project, algorithms from the Rough Set Library were utilized [17]). Note that the main objective that deals with the classificatory capability of decomposition can actually be considered a bi-objective optimization problem itself. On one hand, we are looking for the best possible classification accuracy, but on the other, we want to use as few basis functions as possible. However, based on previous applications of EAs in the search for reducts, as described in [14], we decided to deal with it by minimizing a single-objective fitness function that

180

T.G. Smolinski et al.

is simply a summation of the classification error and the relative length of the reduct, as shown in (4). # L(R) " , fCLASS (p) = 1 − γR + M

(4)

where p is a given representative (i.e., chromosome), L(R) is the length of the potential reduct R (i.e., the number of attributes used in the representative), normalized by the total number of conditional attributes M, and γR is the quality of classification coefficient for the reduct R. An interesting question here is what to do with the coefficients (and the corresponding basis functions) that are not a part of the reduct. Since we are looking for the best possible classification accuracy, while using as few basis functions as possible, some mechanism capable of emphasizing the “important” coefficients/basis functions would be advisable. A solution to this problem is possible due to the application of the “hard” fitness computation idea, which allows the fitness function itself to introduce changes directly to the genetic material of the evaluated chromosome [3]. In this paper we propose to utilize a coefficients/basis functions annihilation approach, which simply zeroes-out the “not important” genetic material. The idea here is that if we remove the basis functions that are not vital in the classification process, the EA will improve the remaining basis functions in order to compensate for an increase in the reconstruction error.

4

Experimental Data

The dataset used in this study was derived from neurophysiological experiments performed at Arkansas State University [1]. In the experiments, recordings in the form of evoked potentials (EP) of a duration of 1 second triggered by an auditory stimulus were collected from the cerebral cortex of two rats. One of the animals had been exposed to the cigarette smoke in utero (i.e., mother of the animal was exposed to cigarette smoke during pregnancy), while the other had not. The research problem here is to investigate how treatments (like nicotine) could alter responses to discrete stimuli. 10 signals were registered for the unexposed animal and 9 for the exposed one. The EPs were sampled at the rate of 7 kHz. The original signals for the unexposed and exposed rats are shown in Fig. 2.

5

Analysis

In the first step of the analysis described in this paper, the FastICA algorithm [18] was utilized to compute the ICs to be used in the initial population in the MOEA. The algorithm yielded 19 ICs along with the corresponding coefficients. As typical with ICA, the reconstruction was nearly perfect, but the coefficients were hardly useful for differentiation between the two EP classes (unexposed vs. exposed).

Hybridization of Independent Component Analysis

181

Fig. 2. Input EPs for the unexposed (a) and exposed (b) animal

In order to investigate the feasibility of the proposed approach, a number of MOEAs was launched simultaneously. The number of maximum possible generations was set to 200 (there was no significant improvement of convergence observed with a larger number of generations) while the size of the population was set to 30, 50, and 100. Mutation probability was initialized with a small random value and was being adapted along the evolution process (i.e., increased by a fixed step in each generation if no progress in the fitness functions was observed and reset when there was an improvement). Crossover probability was randomly determined in each generation (between 0% and 100%). Single-point crossover was utilized. Several variants of the ICA used to initialize the population in end-VEGA were tried. Both initialization of the full as well as a part of the population were simulated. In the first case, the changes in the basis functions can only be introduced by mutation, while in the second, some randomness is present from the beginning. The maximum allowable number of basis functions was set to 5, 10, or 19. In the first two cases, a random subset of 5 or 10 ICs (out of all 19) was chosen for each chromosome, and in the second, a permutation of all 19 ICs was used. In most cases, the classification accuracy was reasonably high (over 80%) and the problems appeared to be mostly related to 2 unexposed EPs being classified as exposed. The determined number of the basis functions required to preserve that precision (driven by the search for a reduct) oscillated around 4, 7, and 12, for the maximum allowable number of 5, 10, and 19 of the basis functions respectively. The average reconstruction error was significantly improved compared to the previous study [4], especially in the case of the full set of the ICs being used to initialize the MOEA (note that this set was however reduced to about 12 ICs, thus indicating the ICs important for classification and at the same time “improving” them to account for the increase in the reconstruction error caused by removing the other 7 components). The modifications in end-VEGA, although

182

T.G. Smolinski et al.

Fig. 3. Selected averaged components for the unexposed (a) and exposed (b) animal

improved the reconstruction slightly and sped up the overall convergence of the algorithm, worked much better in tandem with ICA. An exemplary set of components, averaged for the unexposed and exposed animal separately, for arbitrarily selected 5 basis functions, is shown in Fig. 3. The figure represents an average contribution of the basis functions in generation of the EPs in the unexposed and the exposed animal respectively. Even a quick visual analysis of the figure reveals significant differences in how the sources are represented in the unexposed and the exposed rat. The dissimilarities can be simply expressed by amplitude variations (M3, M5, M9), or can be as major as the sign reversal (M2, M10). Further analysis of such phenomena can provide interesting insight into the mechanisms behind the influence of nicotine on the cortical neural activity.

6

Conclusions

This article presented a general framework for the methodology of classificatory decomposition of signals based on hybridization of independent component analysis, multi-objective evolutionary algorithms, and rough sets. The preliminary results described here are very promising and further investigation of other MOEAs and/or RS-based classification accuracy measures should be pursued. The incorporation of ICA-derived basis functions and coefficients as the starting point in the MOEA significantly improved the reconstruction error and more closely related the concept of classificatory decomposition to the traditional signal decomposition techniques. Acknowledgments. RSL - The Rough Set Library [17] and the open-source FastICA MATLAB package [18] were used in parts of this project. Research partially sponsored by Burroughs-Wellcome Fund CASI Award, National Institutes of Health (NIH) NCRR grant P20 RR-16460, and grant no. 301-435-0888 from the National Center for Research Resources (NCRR), a component of the NIH.

Hybridization of Independent Component Analysis

183

References 1. Mamiya, N., Buchanan, R., Wallace, T., Skinner, D., Garcia, E.: Nicotine suppresses the P13 auditory evoked potential by actingon the pedunculopontine nucleusin the rat. Exp Brain Res 164 (2005) 109–119 2. Smolinski, T.G., Boratyn, G.M., Milanova, M., Zurada, J.M., Wrobel, A.: Evolutionary algorithms and rough sets-based hybrid approach to classificatory decomposition of cortical evoked potentials. Lecture Notes in Artificial Intelligence 2475 (2002) 621–628 3. Smolinski, T.G.: Classificatory Decomposition for Time Series Classification and Clustering. PhD thesis, Univ. of Louisville, Louisville (2004) 4. Smolinski, T.G., Milanova, M., Boratyn, G.M., Buchanan, R., Prinz, A.: Multiobjective evolutionary algorithms and rough sets for decomposition and analysis of cortical evoked potentials. In: Proc. IEEE International Conference on Granular Computing, Atlanta, GA (2006) 635–638 5. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind signal separation. In Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., eds.: Advances in Neural Information Processing Systems. Volume 8. The MIT Press (1996) 757–763 6. Hyvarinen, A., Oja, E.: Independent component analysis: Algorithms and applications. Neural Networks 13 (2000) 411–430 7. Hyvarinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. on Neural Networks 10 (1999) 626–634 8. Deb, K.: Multi-objective optimization using evolutionary algorithms. Wiley (2001) 9. Schaffer, J.D.: Some Experiments in machine learning using vector evaluated genetic algorithms. PhD thesis, Vanderbilt University (1984) 10. Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley (1989) 11. Srinivas, N., Deb, K.: Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 2 (1994) 221–248 12. Laumanns, M., Zitzler, E., Thiele, L.: A unified model for multi-objective evolutionary algorithms with elitism. In: Proceedings of the 2000 Congress on Evolutionary Computation CEC00, IEEE Press (2000) 46–53 13. Pawlak, Z.: Rough sets - Theoretical aspects of reasoning about data. Kluwer (1991) 14. Wr´ oblewski, J.: Finding minimal reducts using genetic algorithms. In: Proc. 2nd Annual Joint Conference on Information Sciences, Wrightsville Beach, NC (1995) 186–189 15. Milanova, M., Smolinski, T.G., Boratyn, G.M., Zurada, J.M., Wrobel, A.: Correlation kernel analysis and evolutionary algorithm-based modeling of the sensory activity within the rat’s barrel cortex. Lecture Notes in Computer Science 2388 (2002) 198–212 16. Kreyszig, E.: Introductory functional analysis with applications. Wiley, New York, NY (1978) 17. Gawry´s, M., Sienkiewicz, J.: RSL - The Rough Set Library version 2.0. Technical report, Warsaw University of Technology, Poland (1994) 18. Hurri, J.: The FastICA package for MATLAB. (http://www.cis.hut.fi/projects/ica/fastica/)

Author Index

Acharya, Raj 1 Ambikairajah, Eliathamby Bajic, Vladimir B. 133 Benuskova, Lubica 163 Boratyn, Grzegorz M. 174 Bourne, Phi 126 Brusic, Vladimir 140 Buchanan, Roger 174 Bui, Huynh 126 Buus, Soren 126 Chan, Tsung-Han 151 Chetty, Madhu 60 Chi, Chong-Yung 151 Chua, Alvin L.-S. 49 Devadas, Srinivas Epps, Julien Fleri, Ward

93

42

Liang, Yan 105 Lin, Chun Yuan 32 Lin, Feng 15 Lin, Valerie C.-L. 15 Liu, Ning Han 32 Logeswaran, Sayanthan Lund, O. 126

42

Matsuno, Hiroshi 4 Michalak, Pawel 71 Milanova, Mariofanna 174 Mishra, Santosh K. 81 Miyano, Satoru 4 Nemazee, D. 126 Ng, See-Kiong 133 Ng, Stanley Kwang Loong O’Donnell, Charles W. Okada, Ryo 4 Ooi, Chia Huey 60

126

Gao, Jean 71 Gassend, Blaise

42

93

Havukkala, Ilkka 163 Ho, Loi Sy 23 Hsieh, Shu Ju 32

Pan, Quan 105 Pang, Shaoning 163 Parida, Laxmi 115 Peters, B. 126 Ponomarenko, J.V. 126 Prinz, Astrid A. 174

Ivshina, Anna V.

Rajapakse, Jagath C.

Jain, Vishal

93

49

163

81

1, 23

Karim, Md Enamul 115 Kasabov, Nikola 163 Kim, Young Bun 71 Koh, Judice L.Y. 140 Kroon, Rene 163 Kubo, R. 126 Kuznetsov, Vladimir A. 49

Sathiamurthy, M. 126 Sette, Alessandro 126 Sidney, John 126 Shi, Jianyu 105 Smolinski, Tomasz G. 174 Stepanova, Maria 15 Stewart, S. 126 Sugii, Manabu 4

Lakhotia, Arun 115 Lam, Kai-Tak 140 Lee, Andrew 93

Tang, Chuan Yi 32 Teng, Shyh Wei 60 Thies, William 93

186

Author Index

van Dijk, Marten 93 Veeravalli, Bharadwaj 140 Veronika, Merlin 133

Way, S. 126 Wilson, S.S. 126 Wong, Limsoon 1

Wang, Fa-Yu 151 Wang, Yue 151

Zhang, Shaowu 105 Zhang, Zhuo 133