263 14 12MB
English Pages 293 [294] Year 2023
Signals and Communication Technology
Changho Suh
Communication Principles for Data Science
Signals and Communication Technology Series Editors Emre Celebi, Department of Computer Science, University of Central Arkansas, Conway, AR, USA Jingdong Chen, Northwestern Polytechnical University, Xi’an, China E. S. Gopi, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India Amy Neustein, Linguistic Technology Systems, Fort Lee, NJ, USA H. Vincent Poor, Department of Electrical Engineering, Princeton University, Princeton, NJ, USA Antonio Liotta, University of Bolzano, Bolzano, Italy Mario Di Mauro, University of Salerno, Salerno, Italy
This series is devoted to fundamentals and applications of modern methods of signal processing and cutting-edge communication technologies. The main topics are information and signal theory, acoustical signal processing, image processing and multimedia systems, mobile and wireless communications, and computer and communication networks. Volumes in the series address researchers in academia and industrial R&D departments. The series is application-oriented. The level of presentation of each individual volume, however, depends on the subject and can range from practical to scientific. Indexing: All books in “Signals and Communication Technology” are indexed by Scopus and zbMATH For general information about this book series, comments or suggestions, please contact Mary James at [email protected] or Ramesh Nath Premnath at [email protected].
Changho Suh
Communication Principles for Data Science
Changho Suh KAIST Daejeon, Korea (Republic of)
ISSN 1860-4862 ISSN 1860-4870 (electronic) Signals and Communication Technology ISBN 978-981-19-8007-7 ISBN 978-981-19-8008-4 (eBook) https://doi.org/10.1007/978-981-19-8008-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
To my wife, Yuni Kim
Preface
Features of the book This book was written in response to the increased research activities in data science and the recent recognition of the importance of communication principles in the field. The book has three key features. The first feature is a focus on digital communication principles tailored for modern applications in data science. This includes a range of domains such as social networks, biological networks, machine learning, and deep learning. Over the past century, the field of communication has undergone significant development, with a widen array of systems from telegraph, telephone, television, 5G cellular, Internet of Thing, all the way up to metaverse. There has been a proliferation of efforts to establish fundamental principles, concepts, and theories that guide optimal system design with low complexities, such as information and coding theory and detection and estimation theory. Communication has also contributed to numerous disciplines, including probability, random processes, signals & systems, as well as linear algebra. While numerous books on communication have been published over the past decades, this book focuses on communication principles specifically in the field of data science. The second key feature of this book is its lecture-style organization. Unlike many communication books that present mathematical concepts and theories in a dictionary-style format, where topics are simply enumerated and organized sequentially, this book aims to provide a coherent storyline that motivates readers who are interested in data science and its connections to other disciplines. To achieve this goal, each section of the book is designed to serve as a note for an 80-minute lecture, with an emphasis on establishing an intimate connection between coherent themes and concepts that span across sections. In order to facilitate smooth transitions between sections, the book features two special paragraphs per section: a “recap” paragraph that summarizes what has been covered so far, thereby motivating the contents of the current section, and a “look ahead” paragraph that introduces upcoming contents by connecting them with past materials. The final feature of this book includes various programming exercises utilizing two programming languages: (i) Python; and (ii) TensorFlow. While C++ and MATLAB are widely used in the communication field, Python is a primary software tool in the data science domain. Since this book covers various data science applications, Python vii
viii
Preface
is used as the primary programming language. To implement machine learning and deep learning algorithms, TensorFlow is used, which is one of the most prevalent deep learning frameworks. TensorFlow offers numerous powerful built-in functions, which simplify many critical procedures in deep learning. One of the primary advantages of TensorFlow is its full integration with Keras, a high-level library focused on enabling rapid user experimentation. Keras enables users to transform their ideas into implementations with minimal steps. Structure of the book This book comprises course materials that were originally developed for two courses taught at KAIST: (i) EE321 Communication Engineering (taught in Spring 2013, 2014, 2015, and 2020); and (ii) EE326 Introduction to Information Theory and Coding (taught in Spring 2016 and 2017). The book is divided into three parts, each of which is further divided into several sections. Each section covers material that was presented in a single 80-minute lecture. Every three sections are followed by a problem set, which served as homework in the original courses. Below is a summary of the contents of each part. I.
Communication over the AWGN channel (9 sections and 3 problem sets): A brief history of communication; a block diagram of a digital communication system; Additive Gaussian Noise (AGN) channel and Python exercise; the optimal Maximum A Posteriori probability (MAP) detection, the Maximum Likelihood (ML) detection, the Nearest Neighbor (NN) detection; analysis of the error probability performance and Python verification via Monte Carlo simulation; sequential coding; repetition coding; channel capacity and phase transition. II. Communication over ISI channels (9 sections and 3 problem sets): Signal conversion from discrete to continuous time; bandlimited channels; Nyquist criterion; Linear Time Invariant (LTI) systems; Inter-Symbol Interference (ISI); maximum likelihood sequence detection; the Viterbi algorithm and Python implementation; Orthogonal Frequency Division Multiplexing (OFDM) and Python implementation. III. Data science applications (12 sections and 4 problem sets): Community detection as a communication problem, and phase transition; the optimal ML detection; the spectral algorithm, the power method and Python implementation; Haplotype phasing as a communication problem; a two-step algorithm with local refinement via ML estimation, and Python implementation; speech recognition as a communication problem; the Markov process and graphical model; the Viterbi algorithm for speech recognition; machine learning and connection with communication; logistic regression and its optimality; deep learning and TensorFlow implementation. Regarding the application of machine learning, a couple of sections in this book are based on the author’s previous publication (Suh, 2022), but have been modified to suit the theme of this book, which is the role of communication principles. Two appendices are included at the end of the book to provide brief tutorials on the programming languages used (Python and TensorFlow). These tutorials have also been adapted from (Suh, 2022), but with necessary adjustments based on the specific topics
Preface
ix
covered in this book. Additionally, a list of references is provided that are relevant to the contents discussed in the book, but no detailed explanations are given as the aim is not to cover the vast amount of research literature available. How to use this book The aim of this book is to serve as a textbook for undergraduate junior and senior-level courses. It is recommended that readers have a solid understanding of probability and random processes, signals & systems, as well as a basic knowledge of Python. To assist students and other readers, we have included some guidelines. 1. Read one section per day and two sections per week: As each section is intended for a single lecture, and most courses have two lectures per week, this approach is recommended. 2. Cover all the contents in Parts I and II: Two of the most important principles in communication are: Maximum A Posteriori probability (MAP) detection and Maximum Likelihood (ML) detection. If you are already familiar with these principles, you can skim through Part I and move on to Part II. However, if you are not familiar with them, we recommend that you read all the contents in Parts I and II in a sequential manner. The sections are arranged in a motivating storyline, and relevant exercise problems are included at appropriate intervals. We believe that this approach will help you maintain your motivation, interest, and understanding of the material. 3. Explore Part III in part depending on your interest: In Part III, you can choose to read applications in sections. However, we have structured the content assuming that each section is read in sequence. Part III includes a significant aspect of Python and TensorFlow implementation. With the help of the main body and skeleton codes provided in the problem sets and appendices, you can successfully implement all the algorithms covered. 4. Solve four to five basic problems from each problem set: There are over 90 problems in total, consisting of more than 210 subproblems. These problems help to elaborate on the concepts discussed in the main text. They vary in difficulty, ranging from basic probability and signals & systems, relatively simple derivations of results in the main text, in-depth exploration of non-trivial concepts not fully explained in the main text, to programming implementation through Python or TensorFlow. All of the problems are closely connected to the established storyline. It is crucial to work on at least some of the problems to grasp the materials. In the course offerings at KAIST, we were able to cover most of the topics in Parts I and II, but only a few applications in Part III. Depending on students’ backgrounds, interests, and time availability, there are various ways to structure a course using this book. Some examples include: 1. A semester-long course (24–26 lectures): Cover all sections in Parts I and II, as well as two to three applications in Part III, such as community detection and speech recognition, or speech recognition and machine learning.
x
Preface
2. A quarter-long course (18–20 lectures): Cover almost all the materials in Parts I and II, except for some topics like Python verification of the central limit theorem, Monte Carlo simulation, OFDM, and its Python implementation. Investigate two applications selected from Part III. 3. A senior-level course for students with the basics of MAP/ML: Spend about four to six lectures briefly reviewing the contents of Part I. Cover all the materials in Parts II and III. To save time, programming exercises could be assigned as homework. Daejeon, Korea (Republic of) August 2022
Reference Suh, C. 2022. Convex optimization for machine learning. Now Publishers.
Changho Suh
Acknowledgement
We would like to extend our heartfelt thanks to Hyun Seung Suh for generously providing us with a plethora of constructive feedback and comments on the book’s structure, writing style, and overall readability, particularly from the perspective of a beginner.
xi
Contents
1 Communication over the AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 A Statistical Model for Additive Noise Channels . . . . . . . . . . . . . . . 1.3 Additive Gaussian Noise Channel and Python Exercise . . . . . . . . . . Problem Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Optimal Receiver Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Error Probability Analysis and Python Simulation . . . . . . . . . . . . . . 1.6 Multiple Bits Transmission using PAM . . . . . . . . . . . . . . . . . . . . . . . Problem Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Multi-shot Communication: Sequential Coding . . . . . . . . . . . . . . . . 1.8 Multi-shot Communication: Repetition Coding . . . . . . . . . . . . . . . . 1.9 Capacity of the AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 8 12 22 25 30 39 45 49 53 60 68 74
2 Communication over ISI Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Waveform Shaping (1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Waveform Shaping (2/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Optimal Receiver Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Set 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Optimal Receiver in ISI Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 The Viterbi Algorithm: Python Implementation . . . . . . . . . . . . . . . . . Problem Set 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 OFDM: Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 OFDM: Extension to General L-tap ISI Channels . . . . . . . . . . . . . . 2.9 OFDM: Transmission and Python Implementation . . . . . . . . . . . . . . Problem Set 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77 77 83 90 97 99 106 112 121 126 132 138 147 150
xiii
xiv
Contents
3 Data Science Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Community Detection as a Communication Problem . . . . . . . . . . . 3.2 Community Detection: The ML Principle . . . . . . . . . . . . . . . . . . . . . 3.3 An Efficient Algorithm and Python Implementation . . . . . . . . . . . . . Problem Set 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Haplotype Phasing as a Communication Problem . . . . . . . . . . . . . . 3.5 Haplotype Phasing: The ML Principle . . . . . . . . . . . . . . . . . . . . . . . . 3.6 An Efficient Algorithm and Python Implementation . . . . . . . . . . . . . Problem Set 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Speech Recognition as a Communication Problem . . . . . . . . . . . . . 3.8 Speech Recognition: Statistical Modeling . . . . . . . . . . . . . . . . . . . . . 3.9 Speech Recognition: The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . Problem Set 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Machine Learning: Connection with Communication . . . . . . . . . . . 3.11 Logistic Regression and the ML Principle . . . . . . . . . . . . . . . . . . . . . 3.12 Machine Learning: TensorFlow Implementation . . . . . . . . . . . . . . . . . Problem Set 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
151 151 159 168 177 181 188 194 202 206 210 215 221 223 231 239 248 254
Appendix A: Python Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Appendix B: TensorFlow and Keras Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Chapter 1
Communication over the AWGN Channel
Communication is the transfer of information from one end to another, and there exists a fundamental limit on the amount of information that can be transmitted reliably.
1.1 Overview of the Book Outline In this section, we will cover two main topics. Firstly, we will discuss the logistics of the book, including its organization and structure. Secondly, we will provide a brief overview of the book, including the story of how communication theory was developed and what topics will be covered in the book.
Prerequisite To fully benefit from this book, it is important to have a good understanding of two key topics: (i) signals & systems, and (ii) probability and random processes. This knowledge is typically obtained through introductory-level courses offered in the Department of Electrical Engineering, although equivalent courses from other departments are also acceptable. While taking an advanced graduate-level course on random processes (such as EE528 at KAIST or EE226 at UC Berkeley) is optional, it can be beneficial. If you feel uncomfortable with these topics even after taking the relevant course(s), we have included exercise problems throughout the book to help you practice and solidify your understanding.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 C. Suh, Communication Principles for Data Science, Signals and Communication Technology, https://doi.org/10.1007/978-981-19-8008-4_1
1
2
1 Communication over the AWGN Channel
The reason why familiarity with signals and systems is necessary for this book is that communication involves the transfer of information through signals, which are transmitted and received through physical media. Also, there is a system in place that acts as a black box between the transmitter and receiver. As such, this book deals with both signals and systems. We assume that many readers have taken a course on signals and systems. If not, you are strongly recommended to take a relevant course. It’s okay to take the course while reading this book. The topics concerning signals and systems will not be covered at the beginning of this book. The concept of probability is also essential to this book because it is related to the channel, which is the physical medium between the transmitter and receiver. The channel is a system that takes a transmitted signal and yields a received signal. However, the channel is not deterministic because there is a random entity (noise) added to it. In communication systems, the noise is typically additive. As a result, a deep understanding of probability is required for this book. While some readers may not be familiar with random processes, which are sequences of random variables, detailed explanations and exercise problems will be provided to aid in understanding. Problem sets There will be a total of 11 problem sets in this book, with one offered after every three sections. We highly encourage collaboration with your peers to solve these problem sets, as they are an effective way to enhance your learning. You can discuss, teach, and learn from one another to optimize your learning experience. Please note that solutions will only be accessible to instructors upon request. Some problems may require programming tools such as Python and TensorFlow, and we will be using Jupyter notebook. You can refer to the installation guide in Appendix A.1 or seek assistance from: https
: //jupyter.readthedocs.io/en/latest/install.html
We also provide tutorials for the programming tools in appendices: (i) Appendix A for Python; and (ii) Appendix B for TensorFlow. History of communication We will explore the development of communication theory and provide an overview of the topics covered in this book. The history of communication can be traced back to dialogues between people, but the invention of telegraph marked a significant breakthrough in the field of electronic communication. Morse code1 , a simple transmission scheme, was the first example of an electronic communication system based on the observation that electrical signals can be transmitted over wires. This led to the development of telephone by Alexander Graham Bell in the 1870s (Coe, 1995). Further advancements were made by utilizing electromagnetic waves, which led to the development of wireless telegraphy by
1
Do not mistake the term “code” used in communication literature for computer programming languages like Python or C++. In this context, code refers to a transmission scheme.
1.1 Overview of the Book
3
Italian engineer Guglielmo Marconi. Later this technology was further developed, which led to the invention of radio and TV. State of the affairs in early 20th century & Claude E. Shannon In the early 20th century, several communication systems were developed, including telegraph, telephone, wireless telegraphy, radio, and TV. During this time, Claude E. Shannon, widely recognized as the father of information theory, made an intriguing observation about these systems. Shannon noticed that engineering designs for communication systems were ad-hoc and tailored for specific applications, with design principles differing depending on the signals of interest. Shannon was dissatisfied with the ad-hoc approach of communication systems and sought to find a simple and universal framework to unify all of them. He posed the following two questions in his effort to achieve this goal. Shannon’s questions The first is the most fundamental question regarding the possibility of unification.
• ?
Question 1
Is there a general unified methodology for designing communication systems? Shannon’s second question was a logical continuation of the first, seeking to address the issue of unification. He posited that if communication systems could be unified, then there might exist a universal currency for information, analogous to the dollar in economics. Since communication systems utilize diverse information sources such as text, voice, video, and images, the second question centers on the plausibility of a universal currency for information.
• ?
Question 2
Is there a common currency of information that can represent different information sources? In a single stroke, Shannon was able to address both questions and develop a unified communication architecture (Shannon, 2001). A communication architecture Before delving into the details of the unified architecture, another terminology needs to be introduced in addition to the three mentioned earlier (transmitter, receiver, and channel). This new term is “information source,” which refers to the information one wishes to transmit, such as text, voice, and image pixels. Shannon believed that an information source needed to be processed before transmission and he abstracted this process with a black box called an “encoder.” At the receiving end, there must be something that attempts to recover the information source from the received signals. Shannon abstracted this process with another black box called a “decoder.” Figure 1.1 illustrates these processes and represents the very
4
1 Communication over the AWGN Channel receiver
transmitter
information source
encoder
channel
decoder
Fig. 1.1 A basic communication architecture
first basic block diagram that Shannon introduced for a communication architecture. From Shannon’s perspective, a communication system is simply a collection of encoder and decoder, so designing a communication system involves developing a pair of encoder and decoder. Representation of an information source With the basic architecture (Fig. 1.1) in his mind, Shannon wished to unify all the different communication systems. Most engineers at that time attempted to transmit information sources directly without any significant modification, although the sources are different depending on applications of interest. Shannon thought that this is the very reason as to why there were different systems. Shannon believed that in order to achieve unification, there must be a common currency that can represent different information sources, leading to a universal framework. His work on his master’s thesis at MIT (Shannon, 1938) played a significant role in finding a universal way of representing information sources. His thesis focused on Boolean algebra, which showed that any logical relationship in circuit systems can be represented using 0/1 logic, or a binary string. This inspired Shannon to imagine that the same concept could be applied to communication, meaning that any type of information source can be represented with a binary string. He demonstrated that this is indeed the case and that the binary string, which he called “bits,” can represent the meaning of information. For instance, he showed that English letters could be represented by 5 bits since there are only 26 letters in the English alphabet (log2 26 = 5). A two-stage architecture Shannon’s observation led him to propose bits as a common currency for information. He devised a two-stage architecture to convert an information source into bits. The first block, called “source encoder,” converts the information source into bits, and the second block, called “channel encoder,” converts the bits into a signal that can be transmitted over a channel. The source encoder is related to how the information source looks like, and the channel encoder concerns the channel. The upper part of Fig. 1.2 illustrates this two-stage architecture. Similarly, the receiver consists of two blocks. But the way the blocks are structured is opposite. We first convert the received signals into the bits that we sent at the transmitter (via channel decoder). Next, we reconstruct the information source from
1.1 Overview of the Book
5 transmitter
information source
bits
source encoder
channel encoder
digital interface
source decoder
bits
channel
channel decoder
receiver
Fig. 1.2 A two-stage communication architecture transmitter
bits
channel encoder
receiver voltage channel
(often called modulator)
channel decoder
bits
(demodulator)
Fig. 1.3 MODEM: MOdulator & DEModulator
the bits (via source decoder). The source decoder is nothing but an inverse function of the source encoder. Obviously source encoder should be one-to-one mapping; otherwise, there is no way to reconstruct the information source. See the lower part in Fig. 1.2. The term “digital interface” is used to describe the portion of the communication system from the channel encoder through the channel and to the channel decoder. This is universal because it is not tied to a specific type of information source. Regardless of the source, the input to the digital interface is always in the form of bits, which makes it a unified architecture. One primary objective of this book is to explore the process of designing this digital interface. Modem Another term that you may be familiar with is “modem”. Its purpose is to convert digital information into analog signals that can be transmitted through the physical world, which is analog. As illustrated in Fig. 1.3, the encoder must modulate the digital information (bits) into an analog signal, such as an electrical voltage signal, before it is transmitted through the channel. The decoder must also demodulate the received analog signal back into digital information (bits) to recover the transmitted information. The term “modem” is a combination of “modulator” and “demodulator”.
6
1 Communication over the AWGN Channel
bits
transmitter
receiver
channel encoder
channel decoder
bits
Fig. 1.4 A statistical model for an additive noise channel
Statistical models of the channel The contents regarding the modem that we will cover in this book are dependent on the characteristics of the channel. The strategies for transmission and reception are impacted by the channel, and the more complex the channel’s features, the more advanced the transmission-reception strategy required. Therefore, it is crucial to have a deep comprehension of the channel’s behavior during modem design. One crucial aspect of the channel is its uncertainty. The received voltage signal is not a deterministic function of the transmitted signal. Although the channel’s behavior in one experiment cannot be predicted, its average behavior, as seen over many experiments, is well-behaved in many practical situations. Additionally, the channel in most practical scenarios is characterized by addition: the received signal is the sum of the transmitted signal and additive noise, as shown in Fig. 1.4. The additive noise, represented by w, is typically described by a continuous random variable having a probability density function (pdf). If one is not familiar with the concept of random variables and their probability functions, they can refer to many exercise problems on probability basics provided in Problem Sets 1 and 2. The statistical behavior of w, i.e., its pdf, is essential in modem design. Therefore, to explore the modem design process, we first require reasonable statistical models for the channel. Book outline This book has a dual objective. Firstly, it aims to illustrate the process of designing a modem using two prevalent statistical channel models. Part I focuses on the Additive Gaussian channel, which is the simplest and practically-relevant. The noise added to the channel, denoted as w, follows a well-known Gaussian distribution with parameters μ and σ 2 representing the mean and variance, respectively. The calligraphy letter N is used to represent the Normal distribution, which is so named because it is commonly found in numerous contexts. This distribution was first described by the renowned German mathematician Carl Friedrich Gauss, and subsequently, it was named the Gaussian distribution as a tribute to him. The corresponding probability density function is as follows (Bertsekas & Tsitsiklis, 2008): f w (x) = √
1 2π σ 2
e−
(x−μ)2 2σ 2
.
Part II of this book is devoted to the wireline channel, which involves establishing a channel through copper lines, such as in the popular internet service, DSL. This
1.1 Overview of the Book
7
transmitter
bits
channel encoder
receiver
LTI
channel decoder
bits
Fig. 1.5 A statistical model for wireline channels
channel can be modeled as a concatenation of two blocks, as shown in Fig. 1.5. The first block is an LTI (linear time invariant) system, which is commonly encountered while studying Signals & Systems (Oppenheim et al., 1997). The second block is the same as before, capturing the additive Gaussian noise. Part II explores the rationale behind this modeling and explains how to design the modem accordingly. While the wireless channel, which underpins the prevalent 5G cellular system, is undoubtedly an interesting topic, Part III of the book will shift gears to explore something different. The book establishes fundamental principles that form the foundation for modem design in Parts I and II, and these principles are also essential in addressing many interesting problems in the rapidly growing field of data science. This constitutes the second goal of the book. The second goal is to demonstrate the powerful role of fundamental principles through four prominent data science applications. The first application, called community detection (Girvan & Newman, 2002; Fortunato, 2010; Abbe, 2017), is one of the most popular data science problems with significant implications in domains such as social networks (Facebook, LinkedIn, Twitter, etc.) and biological networks. The second application concerns DNA sequencing and focuses on a crucial problem named Haplotype phasing, which has a close connection with community detection (Browning & Browning, 2011; Das & Vikalo, 2015; Chen, Kamath, Suh, & Tse, 2016; Si, Vikalo, & Vishwanath, 2014). The third application is speech recognition, a widely popular killer application nowadays (Vaseghi, 2008). Finally, the fourth application is machine learning classification (LeCun, Bottou, Bengio, & Haffner, 1998), which has received significant attention in the past decade. While these applications may not initially appear to be related to communication, they are closely connected through the fundamental principles to be explained in Parts I and II. Readers will figure this out as they progress through the book.
8
1 Communication over the AWGN Channel
1.2 A Statistical Model for Additive Noise Channels Recap In the preceding section, we explored Shannon’s basic block diagram for a digital communication system, which is commonly referred to as the digital interface. This interface is used to send information in the form of digitized binary strings, or bits, as shown in Fig. 1.6. The process of designing a communication system involves creating a channel encoder and decoder, known as a modem. One primary objective of this book is to illustrate how to design a modem using statistical channel models that are relevant in practical scenarios. In Part I, we will concentrate on the simplest and most common practical channel, known as the additive Gaussian channel, where the additive noise follows a Gaussian distribution with mean μ and variance σ 2 . In the conclusion of the previous section, we stressed that comprehending channel properties is essential for modem design.
Outline In this section, we will explore how the characteristics of a channel can offer valuable insights into the design of transmission and reception strategies. This section consists of four parts. First, we will examine a simple channel example that will provide us with practical guidance on how to design the encoder and decoder. Next, we will develop a natural communication scheme that facilitates successful transmission. We will then calculate the number of bits that can be reliably transmitted using this scheme. Finally, we will identify a critical communication resource that affects the total number of transmittable bits and underscore the relationship between the communication resource and transmission capability.
A simple example The additive Gaussian channel that we will deal with mainly in Part I is of the following additive structure: y = x +w
transmitter bits
channel encoder
(1.1)
receiver channel
channel decoder
Fig. 1.6 The basic block diagram of a digital communication system
bits
1.2 A Statistical Model for Additive Noise Channels
9
where x indicates a transmitted signal, y denotes a received signal, and w is an additive noise. In order to get concrete feelings about the modem design, prior to the targeted Gaussian case, we will first consider a much simpler setting where w is assumed to be strictly within a certain range, say ±σ . A simple transmission scheme Suppose we want to send a single bit across the above channel. One natural communication scheme in this case is to transmit one of two possible candidates depending on the information content of the bit. For instance, we can send a voltage (say v0 ) to transmit an information content of the bit being “zero”, and another voltage (say v1 ) when transmitting an information content of the bit being “one”. Recall that w is assumed to be −σ ≤ w ≤ σ . Hence, if we set (v0 , v1 ) such that the distance between the two voltage levels exceeds the entire interval that the noise spans, i.e., |v0 − v1 | > 2σ,
(1.2)
we can be certain that our communication of the one bit of information is reliable over this channel. The total number of bits transmittable under the simple scheme Physically, the voltage transmitted corresponds to some energy being spent. For simplicity, assume that the energy spent in transmitting a voltage v Volts is v 2 Joules. In this context, one key question arises.
• ?
Question 1
How many bits can we reliably communicate with an energy constraint E? Suppose we intend to transmit the number k of bits. A natural extension of the previous scheme is to come up with 2k candidates for voltage levels, since we have now 2k possible patterns depending on the contents of k bits. For reliable transmission, we can also set the distance between any two consecutive voltages as the maximum interval of the noise. Hence, we choose to transmit one of the following discrete voltage levels: √ √ √ √ − E, − E + 2σ, . . . , − E + 2iσ, . . . , + E
(1.3)
where i denotes a certain integer. In order to ensure the successful transmission, we should have at least 2k candidates for voltage levels, leading to the total number 2k − 1 of intervals. Hence, we should satisfy: √ (2k − 1) · (2σ ) ≤ 2 E.
(1.4)
10
1 Communication over the AWGN Channel
Fig. 1.7 A mapping rule for sending two bits: Four candidates for voltage levels
√ For simplicity, let us assume that E is divisible by σ . This then quantifies the exact number of transmittable bits as below: √ E k = log2 1 + . (1.5) σ Figure 1.7 illustrates one possible mapping √ when we send two bits, so we have four discrete voltage levels. Here one can set E = 3σ due to (1.4), and the voltage levels read: √ vi = − E + 2iσ, i ∈ {0, 1, 2, 3}.
(1.6)
Relationship between energy and reliable information transmitted Eq. (1.5) suggests that for a given energy constraint E, the number of bits we can communicate reliably is: log2
√ E 1+ . σ
(1.7)
Another natural question arises concerning the relationship between the energy budget and its additional transmission capability.
• ?
Question 2
To transmit an extra bit reliably, how much additional energy must we expend? We can use the above expression (1.7) to answer this question. The new energy E˜ required to send an extra bit of information reliably has to satisfy: √ E E˜ 1+ = 1 + log2 1 + σ σ √ E E˜ =2 1+ 1+ σ σ √ E˜ = σ + 2 E.
log2
1.2 A Statistical Model for Additive Noise Channels
11
In other words, we need more than quadruple the energy constraint to send just one extra bit of information reliably. Another interesting thing to notice is that the amount of reliable communication transmitted depends on the ratio between the transmit energy budget E and σ 2 related to the energy of the noise. This ratio σE2 is called the signal to noise ratio (SNR for short) and will feature prominently in the other additive noise models we will see. Look ahead The simple channel model example provided us with insight into straightforward transmission and reception strategies. Additionally, it shed light on the relationship between a physical resource, such as energy, and the amount of information that we can communicate reliably. However, the deterministic channel model utilized in this example is rather simplistic. Specifically, the selection of σ may be overly cautious if we desire to guarantee that the additive noise falls within the range of ±σ with absolute certainty. This dilemma motivates us to understand the statistical behavior of w more precisely. We will delve into this topic in detail in the next section.
12
1 Communication over the AWGN Channel
1.3 Additive Gaussian Noise Channel and Python Exercise
Recap Before focusing on the additive Gaussian channel, the previous section examined a much simpler scenario in which the additive noise w was bounded by upper and lower limits: −σ ≤ w ≤ σ . In this channel, we investigated a straightforward transmission scheme called pulse amplitude modulation (PAM), where a transmitted signal’s voltage level is determined based on the pattern of multiple bits, say k bits (Proakis, Salehi, Zhou, & Li, 1994). To ensure the receiver’s flawless recovery of the bits, each voltage level is separated by at least 2σ . The reception strategy employed in this scheme is called the nearest neighbor (NN) rule, which declares the voltage level closest to the received signal as the transmitted voltage signal. We subsequently established the tradeoff relationship between the energy budget E and the maximum number of transmittable bits: k = log2
√ E 1+ . σ
Towards the end, we noted that the assumption about w made in the simple setting is not realistic. In practice, w adheres to a well-established distribution, namely the Gaussian (or Normal) distribution.
Outline This section aims to establish this fact, and we will do so in four parts. Firstly, we will employ the laws of physics to explore the origin of additive noise in practical communication systems. Based on this, we will formulate a set of mathematical assumptions that the additive noise satisfies. Subsequently, we will leverage these assumptions to demonstrate that the probability distribution of w conforms to the Gaussian distribution. Lastly, we will validate this analytical proof through an experimental demonstration in Python.
Physics To begin, let us examine the origin of additive noise in electronic communication systems. In such systems, signal levels are determined by voltage, which in turn is dictated by the movement of electrons. It was discovered by physicist John B. Johnson in the early 1900s that the behavior of electrons contributes to the induction of uncontrollable noise (Johnson, 1928). Johnson observed that the random agitation of electrons due to heat can cause voltage levels to fluctuate, making it difficult
1.3
Additive Gaussian Noise Channel and Python Exercise
13
to control the precise movement of electrons in reality. This random fluctuation in voltage levels was interpreted as a major source of noise. To understand the statistical behavior of this noise, Johnson collaborated with Harry Nyquist, a colleague at Bell Labs who was skilled in mathematics and known for his work on the Nyquist sampling theorem (Nyquist, 1928). Together, they developed a mathematical theory for the noise and established the Gaussian noise model that we will explore in this section. At the time, this noise was referred to as thermal noise, as it was dependent on temperature. Although other factors such as device imperfections and measurement inaccuracies can also contribute to noise, the primary source of randomness is the thermal noise, which is the focus of the mathematical theory developed by Johnson and Nyquist. Assumptions made on the thermal noise The mathematical theory starts with making concrete assumptions on physical properties that the thermal noise respects. These are four folded. 1. As mentioned earlier, the noise is due to the random movement of electrons. In practice, an electrical signal contains tons of electrons. So one natural assumption that one can make is: the additive noise is the overall consequence of many additive “sub-noises”. A natural follow-up assumption is that the number of sub-noises is infinity, which reflects the fact that there are tons of electrons in an electrical signal. 2. These electrons are known to be typically uncorrelated with each other. So the second assumption is: the sub-noises are statistically independent. 3. It is known that there is no particularly dominating electron that affects the additive noise most significantly. So the third assumption that one can make as a simplified version of this finding is: each sub-noise contributes the same amount of energy to the total energy in the additive noise. 4. Finally, the noise energy is not so big relative to the energy of an interested voltage signal. Obviously it does not blow up. So the last reasonable assumption is: the noise energy is finite. Mathematical expression of the four assumptions We translate the above four assumptions in a mathematical language. To this end, we first introduce some notations to write the total additive noise w as the sum of m sub-noises n 1 , . . . , n m : w = n1 + n2 + · · · + nm .
(1.8)
The mathematical expressions of the first and second assumptions are: m → ∞ and (n 1 , . . . , n m ) are statistically independent with each other. Here what it means by independence is that the joint pdf of (n 1 , . . . , n m ) can be expressed as the product of individual pdfs: f n 1 ,n 2 ,...,n m (a1 , a2 , . . . , am ) = f n 1 (a1 ) f n 2 (a2 ) · · · f n m (am )
(1.9)
14
1 Communication over the AWGN Channel
where ai ∈ R denotes a realization of the sub-noise n i . If you are not familiar with the definition of the statistical independence, please refer to some exercise problems in Problem Set 1, e.g., Problems 1.1 and 1.8 (b)(c). A mathematical assumption that we will take, which also concerns the third assumption is: (n 1 , . . . , n m ) are identically distributed. The property of being independent and identically distributed is simply called i.i.d. Without loss of generality2 , one can assume that w has zero-mean. Here is why. Suppose we have a bias on w: E[w] = μ = 0. We can then always subtract the bias from the received signal so that the mean of the effective noise is zero. More precisely, we subtract the bias μ from the received signal y = x + w to obtain: y − μ = x + (w − μ).
(1.10)
Then, w − μ can be interpreted as the effective noise, and the mean of this is indeed zero. Due to the i.i.d. assumption, each sub-noise has the same amount of energy: E[n i2 ] = E[n 2j ], ∀i, j ∈ {1, . . . , m}.
(1.11)
The energy in the total additive noise w is ⎡ 2 ⎤ m E[w 2 ] = E ⎣ ni ⎦ i=1 (a)
=
m
E[n i2 ] +
i=1 (b)
= mE[n 21 ] +
E[n j n i ]
j =i
(1.12)
E[n j ]E[n i ]
j =i (c)
= mE[n 21 ]
where (a) is due to the linearity of the expectation operator; (b) comes from the i.i.d. assumption; and (c) is because of the zero-mean assumption of E[w] = mE[n 1 ] = 0 (leading to E[n i ] = 0 as well). The i.i.d. assumption implies the (pairwise) independence between n i and n j (for j = i), thereby leading to E[n j n i ] = E[n j ]E[n i ]. If you are not familiar with this, see Problem 1.2. Lastly consider the finite energy assumption. Denoting this finite energy by σ 2 , we see from (1.12) that the energy of the sub-noises must shrink as more and more sub-noises are added: E[n i2 ] = 2
σ2 , i ∈ {1, . . . , m}. m
(1.13)
The meaning of without loss of generality is that the general case can readily be boiled down to a simple case with some proper modification.
1.3
Additive Gaussian Noise Channel and Python Exercise
15
The statistical behavior of the thermal noise in the limit of m We are interested in the statistical property of the total additive noise w. In the language of probability, we are asking for the probability distribution or pdf of the random variable w. It turns out that under the assumptions made on w as above, the probability distribution of w is Gaussian, formally stated below. Theorem 1.1 Let w = n 1 + n 2 + · · · + n m . Assume that (n 1 , . . . , n m ) are i.i.d. such 2 that E[n i ] = 0, E[n i2 ] = σm and σ 2 < ∞ (finite). Then, as m → ∞, w ∼ N (0, σ 2 ).
(1.14)
One can prove this theorem by relying on a very well-known theorem in probability and statistics, called the central limit theorem (Bertsekas & Tsitsiklis, 2008). Proof of Theorem 1.1 The pdf of w depends obviously on the pdfs of the sub-noises. We exploit one key statistical property that is very well-known when the sub-noises are independent.
•
> A key property on the sum of independent random variables
The pdf of the sum of independent random variables can be represented as the convolution of the pdfs of the random variables involved in the sum. For those who are not comfortable with this, we provide an exercise problem in Problem 1.2. Using the above property, we can then represent f w (a) as the convolution of all the pdfs of the sub-noises: f w (a) = f n 1 (a) ∗ f n 2 (a) ∗ · · · ∗ f n m (a)
(1.15)
where +∞ * indicates the convolution operator, defined as: f n 1 (a) ∗ f n 2 (a) := −∞ f n 1 (t) f n 2 (a − t)dt. But we face a challenge. The challenge is that the convolution formula is complicated, as the meaning of the word “convoluted” suggests. Even worse, we have an infinite number m of such convolutions (more precisely m − 1). Another property on the sum of independent random variables comes to the rescue. The convolutions can be greatly simplified in the Fourier or Laplace transform domain. Since the pdf is a continuous signal, the interested transform must be the Laplace transform. The simplification is based on the following key property:
•
> Another property on the sum of independent random variables
The Laplace transform of the convolution of multiple pdfs is the product of the Laplace transforms of all the pdfs involved in the convolution. To state what it means, recall the definition of the (double-sided) Laplace transform Fw (·) w.r.t. the pdf f w (·) of w:
16
1 Communication over the AWGN Channel
Fw (s) :=
+∞ −∞
f w (a)e−sa da.
(1.16)
Now using the above property that the convolution is translated to the multiplication in the Laplace transform domain, we can re-write (1.15) as: Fw (s) = [Fn (s)]m
(1.17)
where Fn (·) indicates the Laplace transform of the pdf f n (·) of the generic random variable n that represents a sub-noise n i . Next, how to proceed with (1.17)? More precisely, how to compute Fw (s) with 2 what we know: E[n] = 0 and E[n 2 ] = σm ? It turns out these moment information (0th and 1st moments) play a crucial role to compute. A key observation is that these moments appear as coefficients in the Taylor series expansion of Fn (s). And this expansion leads to an explicit expression for Fw (s), thereby hinting the pdf of w. To see this, let us try to obtain the Taylor series expansion of Fn (s) (Stewart, 2015). To this end, we first need to compute the kth derivative of Fn (s): Fn(k) (s) :=
d k Fn (s) k
ds ∞
=
−∞
(1.18)
(−1) a f n (a)e k k
−sa
da
where equality is due to the definition of the Laplace transform: Fn (s) := +∞ the second −sa f (a)e da. Applying the Taylor expansion at s = 0, we get: n −∞ ∞ Fn(k) (0) k s . Fn (s) = k! k=0
(1.19)
Plugging s = 0 into (1.18), we obtain: Fn(k) (0) =
∞
−∞
(−1)k a k f n (a)da = (−1)k E[n k ].
This together with (1.19) yields: Fn (s) =
∞ (−1)k E[n k ] k=0
Applying E[n] = 0 and E[n 2 ] =
σ2 m
k!
sk .
to the above, we get:
(1.20)
1.3
Additive Gaussian Noise Channel and Python Exercise
17
∞
Fn (s) = 1 +
σ 2 2 (−1)k E[n k ] k s + s . 2m k! k=3
(1.21)
Now what can we say about E[n k ] for k ≥ 3 in the above? Notice that E[n k ] = 2 k k E[(n 2 ) 2 ] and E[n 2 ] = σm decays like m1 . It turns out this leads to the fact that E[(n 2 ) 2 ] decays like m1k/2 , which exhibits a faster decaying rate for k ≥ 3, as compared to m1 . This scaling yields:
∞
σ 2 2 (−1)k E[n k ] k Fw (s) = lim 1 + s + s m→∞ 2m k! k=3 m σ2 2 s = lim 1 + m→∞ 2m =e
m
(1.22)
σ 2 s2 2
where the last equality is due to the fact that e x = lim
m→∞
1+
x m . m
(1.23)
From (1.22), what can we say about f w (a)? To figure this out, we need to do 1 lim T →∞ the Laplace inverse transform: f w (a) = InverseLaplace(Fw (s))(a) := 2πi Re(s)+i T sa Fw (s)e ds. You may feel headache because it looks very complicated. So Re(s)−i T instead we will do the reverse engineering (guess-&-check) to figure out f w (a). A good thing about the Laplace transform is that it is one-to-one mapping. Also one can easily derive: a2 1 σ 2 s2 e− 2σ 2 = e 2 . LaplaceTransform √ 2π σ 2 See Problem 1.3 for derivation. This together with (1.22) gives: a2 1 e− 2σ 2 , a ∈ R. f w (a) = √ 2π σ 2
(1.24)
This is indeed the Gaussian distribution with mean 0 and variance σ 2 . This completes the proof. Python exercise: Empirical verification of Theorem 1.1 Remember that Theorem 1.1 holds for an arbitrary distribution of sub-noise n i ’s. Here we verify this for a simple binary case where
18
1 Communication over the AWGN Channel
1 1 1 = P ni = − √ = . P ni = + √ 2 m m This way, we respect E[n i ] = 0 and E[n i2 ] = m1 for all i ∈ {1, . . . , m}. To create these random variables, we utilize a built-in function that is included in one of the helpful statistical packages, scipy.stats. from scipy.stats import bernoulli X = bernoulli(0.5) X_samples = X.rvs(100) print(X_samples) [0 1 1 0
0 1 0 1
1 1 1 0
0 1 0 0
1 1 1 0
0 1 0 1
1 1 1 0
1 0 1 0
0 0 0 1
0 1 1 1
0 0 0 1
0 0 0 1
0 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 0 1 1]
The term bernoulli refers to the distribution of a binary (or Bernoulli) random variable. This type of random variable is named after Jacob Bernoulli, a mathematician from Switzerland in the 1600s. Jacob Bernoulli utilized this binary random variable to discover the Law of Large Numbers (LLN), one of the fundamental laws in mathematics. As a result, the binary random variable is also known as the Bernoulli random variable. For further information on the LLN, please see Problem 8.3. To generate numerous i.i.d. samples (in this example, 100 samples), we use the function X.rvs(10). Since the bernoulli random variable takes 1 or 0, we manipulate it to take + √1m or − √1m instead. See below a code for the translation: import numpy as np m = 100 n_samples = (2*X_samples - 1)/np.sqrt(m) print(n_samples) print(n_samples.mean()) print(n_samples.std()) [-0.1 -0.1 0.1 -0.1 -0.1 0.1 0.1 -0.1 -0.1 0.1 0.1 0.1 -0.1 -0.1 0.1 -0.1 -0.1 0.1 -0.1 0.1 0.1 -0.1 0.1 -0.1 0.1 0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.1 0.1 0.1 0.1 0.1] 0.016000000000000004 0.09871170143402452
0.1 -0.1 0.1 0.1 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 -0.1 -0.1 0.1 -0.1 -0.1 -0.1 0.1 0.1 -0.1 0.1 -0.1 0.1 0.1 0.1 0.1 -0.1 0.1 0.1 -0.1 0.1 -0.1 -0.1 0.1 0.1 -0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 -0.1 -0.1 -0.1 0.1 -0.1 -0.1 0.1
Notice that the mean and standard derivation of the 100 i.i.d. samples are close to the ground-truth counterparts 0 and √1m = 0.1, respectively. Next we construct w = n 1 + n 2 + · · · + n m . Since we are interested in figuring out an empirical density of w, we generate multiple independent copies of w.
1.3
Additive Gaussian Noise Channel and Python Exercise
19
num_copies = 1000 n_samples = [] for i in range(m): n_samples.append( (2*X.rvs(num_copies) - 1)/np.sqrt(m) ) w_copies = sum(n_samples) print(w_copies.shape) print(w_copies.mean()) print(w_copies.std()) (1000,) 0.057000000000000016 0.9983942107203948
Note that the independent copies of w exhibit a mean close to zero and a standard deviation of 1. To verify if the statistics of the independent copies adhere to a Gaussian distribution, we plot the density histogram of the independent copies using a statistical visualization package called seaborn. import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import norm w = np.arange(-4., 4., 0.01) plt.figure(figsize=(4,4), dpi=150) plt.plot(w,norm.pdf(w),color=’red’,label=’Gaussian pdf’) sns.histplot(w_copies,stat=’density’,bins=10,label=’empirical’) plt.xlabel(’w’) plt.title(’m=100’) plt.legend() plt.show()
Here norm.pdf(w) implements the standard Gaussian distribution (with mean 0 and variance 1), and sns.histplot draws the empirical density of w_copies. Notice in Fig. 1.8 that the empirical density respects the ground-truth Gaussian distribution to some extent.
20
1 Communication over the AWGN Channel
Fig. 1.8 Empirical density of the sum of i.i.d. random variables: m = 100
As we increase the value of m for the summation, we observe that the empirical density becomes closer to the actual distribution, as shown in Fig. 1.9. As a summary, we provide the complete code for the case where m is set to 10,000. from scipy.stats import bernoulli from scipy.stats import norm import numpy as np import matplotlib.pyplot as plt import seaborn as sns m = 10000 num_copies = 1000 X = bernoulli(0.5) n_samples = [] for i in range(m): n_samples.append( (2*X.rvs(num_copies) - 1)/np.sqrt(m) ) w_copies = sum(n_samples) w = np.arange(-4., 4., 0.01) plt.figure(figsize=(5,5), dpi=200) plt.plot(w,norm.pdf(w),color=’red’,label=’Gaussian pdf’) sns.histplot(w_copies,stat="density",bins=20,label=’empirical’) plt.xlabel(’w’) plt.legend() plt.title(’m=10000’) plt.show()
1.3
Additive Gaussian Noise Channel and Python Exercise
21
Fig. 1.9 Empirical density of the sum of i.i.d. random variables: m = 10,000
Look ahead By utilizing certain assumptions regarding the additive thermal noise based on principles from physics, we established that the additive noise can be accurately represented by a Gaussian distribution. We also confirmed this through an empirical verification using Python. In the following section, we will shift our focus back to our primary topic, which is examining the design of the modem, but now under the additive Gaussian channel. We will explore the relationship between the number of transmission bits and the energy budget, as we did in the earlier stage with the simpler channel setting.
22
1 Communication over the AWGN Channel
Problem Set 1 Problem 1.1 (Rationale behind the definition of mutual independence) Consider a sample space of 8 equiprobable outcomes: Ω = {ω1 , ω2 , . . . , ω8 }; 1 ∀ωi ∈ Ω. P(ωi ) = 8 Come up with three events (A1 , A2 , A3 ) such that: P(A1 ) = P(A2 ) = P(A3 ) = 0.5; 1 1 P(A1 ∩ A2 ) = P(A1 ∩ A3 ) = , P(A2 ∩ A3 ) = ; 4 8 P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 ).
(1.25) (1.26) (1.27)
Note: The answer is not unique. Remark: Observe in your example that A2 and A3 are not independent, as P(A2 ∩ A3 ) = 18 = 41 = P(A2 )P(A3 ). The definition of independence would be very strange if it allowed (A1 , A2 , A3 ) to be independent while A2 and A3 are dependent. This illustrates why the definition of mutual independence requires the independence for every subset of (A1 , A2 , A3 ): P
Ai
=
i∈I
P(Ai ), ∀I ⊆ {1, 2, 3},
(1.28)
i∈I
rather than just the triplet independence: P
3 i=1
Ai
=
3
P(Ai ).
(1.29)
i=1
Problem 1.2 (Basics on the sum of independent random variables) Suppose that continuous random variables (r.v.’s) X and Y are independent. Let Z = X + Y . (a) Show that E[X Y ] = E[X ]E[Y ] and var[Z ] = var[X ] + var[Y ]. Here var[X ] denotes the variance of X : var[X ]
:= E (X − E[X ])2 .
(b) Show that var[a X ]
= a 2 var[X ], ∀a ∈ R.
1.3
Additive Gaussian Noise Channel and Python Exercise
23
(c) Show that the probability density function (pdf) f Z (·) is given by
f Z (a) = f X (a) ∗ f Y (a) :=
∞
−∞
f X (t) f Y (a − t)dt
for a ∈ R. (d) Consider the (double-sided) Laplace transform of f Z (·):
FZ (s) =
∞ −∞
e−sa f Z (a)da.
Show that FZ (s) = FX (s)FY (s) where FX (s) and FY (s) indicate the Laplace transforms of f X (·) and f Y (·) respectively. Problem 1.3 (Laplace transform of the Gaussian distribution) Let X be a Gaussian r.v. with mean 0 and variance σ 2 . Let f X (·) be the pdf of X . (a) Compute the Laplace transform of f X (·). (b) Using the fact that the Laplace transform is one-to-one mapping (that we menσ 2 s2 tioned in Sect. 1.3), argue that the inverse Laplace transform of e 2 is the Gaussian density with mean 0 and variance σ 2 . Problem 1.4 (Easy yet crucial bounds on probability) Consider two events A1 and A2 . Prove the following inequalities: (a) P(A1 ∩ A2 ) ≤ P(A1 ). (b) P(A1 ) ≤ P(A1 ∪ A2 ). (c) P(A1 ∪ A2 ) ≤ P(A1 ) + P(A2 ). Remark: The bound in part (c) is called the union bound. Problem 1.5 (Bits) In Sect. 1.1, we learned that bits is a common currency of information that can represent any type of information sources. Consider an image signal that consists of many pixels. Each pixel is represented by three real values which indicate the intensity of red, green and blue color components respectively. Here we assume that each value is quantized, taking one of 256 equal-spaced values in [0, 1), 0 1 2 , 256 , 256 , . . . , 254 , 255 . Explain how bits can represent such an image source. i.e., 256 256 256 Problem 1.6 (Shannon’s two-stage architecture) Consider Shannon’s unified communication architecture that we explored in Sect. 1.1. (a) Draw Shannon’s two-stage architecture and point out the digital interface.
24
1 Communication over the AWGN Channel
(b) In Sect. 1.1, we learned that the source decoder is an inverse function of the source encoder. A student claims that as in the source coder case, the channel decoder is an inverse function of the channel encoder. Prove or disprove the student’s claim. Problem 1.7 (The simple noise example) Consider the simple noise setting that we studied in Sect. 1.2, wherein the noise w is strictly within ±σ . Suppose we accept a 90% reliability rate (i.e., accept a 10% error) and know some statistics on the noise: P(− σ2 ≤ w ≤ σ2 ) = 0.9. Suppose we use only one time slot and have the same energy constraint as in the 100% reliability case. How many bits (at least) can we transmit more? For simplicity, we assume that the number of bits can be any real value, not limited to an integer. Problem 1.8 (True or False?) (a) There is a system which is linear but not time invariant. However, there exists no system such that it is time invariant while being nonlinear. (b) Suppose two events A and B are independent. Then, their complementary events Ac and B c are independent as well. (c) Let X, Y, Z be three r.v.’s defined on the same probability space. Suppose X and Y are pairwise independent, Y and Z are pairwise independent, and X and Z are pairwise independent. Then, X, Y, Z are mutually independent.
1.4
Optimal Receiver Principle
25
1.4 Optimal Receiver Principle
Recap In Sect. 1.2, we examined a simple channel setting where the additive noise falls within a range of ±σ voltages. We developed a basic transmission scheme, along with a corresponding reception strategy: PAM and the NN rule. After implementing these tx/rx schemes, we determined the relationship between the energy budget and the number of bits that can be transmitted. Since this channel is not representative of real-world scenarios, we delved into practical communication systems and revealed that the additive noise in such systems can be modeled as a random variable with a Gaussian distribution.
Outline In this section, we will delve deeper into the realistic additive channel, which is known as the additive Gaussian channel. The section consists of three parts. Firstly, we will examine a simple transmission scheme, using the same scheme as before, which is PAM. Secondly, we will explore a reception strategy. In the previous simple channel setting, the NN rule allowed for perfect transmission as long as the minimum distance between any two voltage levels was greater than 2σ . However, in the Gaussian channel, we must consider the possibility of transmission failures, as the Gaussian noise level can exceed +σ (or fall below −σ ), leading to decision errors under the NN rule. Given that transmission failures are unavoidable, our natural goal is to develop the best reception strategy possible, while acknowledging the potential for errors in decision making. This can be mathematically formulated as finding the decision rule that maximizes the success rate for transmission, or equivalently, minimizes the probability of error. In the second part, we will derive the optimal decision rule. Finally, we will demonstrate that under a reasonable assumption (to be explained later), the optimal decision rule is the same as the intuitive NN rule that we studied earlier. As the concept of reliability (or success rate) is introduced in the additive Gaussian channel, we are now interested in characterizing the tradeoff between three quantities: (i) the energy budget; (ii) the number of reliably transmitted bits; and (iii) reliability. This will be the focus of the next section. In this section, we will only consider two of these quantities: energy and reliability, specifically in the case of sending only one bit. A simple transmission scheme Consider the simple case of sending one bit: b = 1 or 0. Given the energy budget E, the transmission scheme that we will take is the
26
1 Communication over the AWGN Channel
√ same as before. That is, PAM: transmitting + E-voltage signal for b = 1 while √ sending − E-voltage signal for b = 0, as illustrated in Fig. 1.10. Factors that affect the decision rule Before we proceed with deriving the optimal decision rule, it is important to first identify and describe the two primary factors that impact the rule. These factors play a crucial role in determining the reliability of the transmission. • A priori knowledge: If we have some prior knowledge about the statistical behavior of the information bit b, it may impact the decision rule. This is because, in extreme cases where we have complete knowledge that the information bit is 1, we would not need to refer to the received voltage y, and can simply declare that the information bit is 1. However, in most cases, we do not have access to such prior information. Therefore, we assume that the information bit is equally likely to be 1 or 0 in such cases. • Noise statistics: Having knowledge of the noise’s statistical behavior would assist the receiver in making a decision. For instance, if the noise is likely to be centered around zero, an effective receiver would be one that selects the transmit voltage that is closer to the received voltage among the two possible options. This is precisely the Nearest Neighbor (NN) rule that was previously discussed. In the remainder of this section, we will demonstrate that under the assumption that the information bit is equally likely, the NN decision rule optimizes reliability. The optimal decision rule To begin with, let us list all the information that the receiver has. Also, see Fig. 1.11 that illustrates the action taken at the receiver. 1. The a priori probabilities of the two values the information bit can take. As mentioned earlier, we will assume that these are equal (0.5 each). 2. The received signal y. 3. The encoding rule: We assume to know how the information bit is mapped into a voltage at the transmitter. This means that the mapping rule illustrated in Fig. 1.10 should be known to the receiver. This is a reasonable assumption because in practice, we can fix the transmission rule as a communication protocol that both transmitter and receiver should agree upon. The reliability of communication conditioned on a specific received voltage level (say y = a) is simply the probability of a correct decision event, denoted by C (the ˆ information bit b being equal to the estimate b(a)): ˆ P(C|y = a) = P(b = b(a)|y = a).
Fig. 1.10 Pulse amplitude modulation (PAM) in the case of sending one bit
(1.30)
1.4
Optimal Receiver Principle
27
Fig. 1.11 The basic block diagram of a receiver
receiver channel
channel decoder
ˆ We want to maximize P(C|y = a), so we should choose b(a) as per: ˆ b(a) = arg max P(b = i|y = a) i∈{0,1}
(1.31)
where “arg max” means “argument of the maximum”. The quantity P(b = i|y = a) in the above is known as the a posteriori probability. Why “posteriori”? This is because the unconditional probability P(b = i) is called the a priori probability. So P(b = i|y = a) can be interpreted as the altered probability after we make an observation. The word “posteriori” is a Latin word that means “after”. Hence, this optimum rule is called the Maximum A Posteriori probability (MAP) rule. Derivation of the MAP rule Let us try to prove what we claimed earlier. The MAP rule is the same as the NN rule assuming that P(b = i) = 21 for all i ∈ {0, 1}. To prove this, we need to compute the interested quantity P(b = i|y = a), possibly using the three informations that the receiver has access to (enumerated at the beginning of the previous subsection). We face a challenge in the computation though. The challenge comes from the fact that the Gaussian statistics of our interest are described for analog voltage-level signals. This incurs the following technical problem in calculating the a posteriori probability: P(y = a|b = i) and P(y = a)
(1.32)
are both zero. The chance that an analog noise level is exactly a value we want is simply zero. So we cannot use the Bayes rule which is crucial in computing the a posteriori probability. Recall that the Bayes rule (Bertsekas & Tsitsiklis, 2008) is: P(A|B) =
P(A)P(B|A) P(B)
(1.33)
for any two events A and B where P(B) = 0. To resolve this issue, the a posteriori probability is defined as P(b = i|y = a) := lim P(b = i|y ∈ (a, a + )). →0
For > 0, we can now use the Bayes rule to obtain:
(1.34)
28
1 Communication over the AWGN Channel
P(b = i)P(y ∈ (a, a + )|b = i) →0 P(y ∈ (a, a + )) P(b = i) f y (a|b = i) (b) = lim →0 f y (a) P(b = i) f y (a|b = i) = f y (a) (a)
P(b = i|y = a) = lim
(1.35)
where (a) comes from the definition (1.34) and the Bayes rule; and (b) follows from the definition of the pdfs f y (·) and f y (·|b = i) w.r.t. the analog voltage signals. Here the conditional pdfs, called the likelihoods in the literature, f y (a|b = 1) and f y (a|b = 0)
(1.36)
are to be calculated based on the statistical knowledge of the channel noise. So, the MAP rule is as follows.
•
> The MAP rule
Given y = a, we decide bˆ = 1 if P(b = 1) f y (a|b = 1) ≥ P(b = 0) f y (a|b = 0)
(1.37)
and bˆ = 0 otherwise.
The ML decision rule Under the assumption that the information bit is equally likely, the MAP rule simplifies even further. It now suffices to just compare the two likelihoods (the two quantities in (1.36)). The decision rule is then to decide that bˆ is 1 if f y (a|b = 1) ≥ f y (a|b = 0),
(1.38)
and 0 otherwise. This rule is called the maximum likelihood (ML) rule. In the Gaussian channel, it is easy to calculate the conditional pdf: √ (a) f y (a|b = 1) = f y a|x = + E √ √ (b) = f w a − E|x = + E √ (c) = fw a − E
(1.39)
where √ (a) comes from the fact that the event b = 1 is equivalent to the event x = + E due to our encoding √ rule; (b) is due√to the fact that the event y = a is equivalent to the event w = a − E (under x = + E); and (c) is because of the independence
1.4
Optimal Receiver Principle
29
Fig. 1.12 The ML rule superposed on Fig. 1.10
(·)2
1 between w and x. Here f w (·) = √2πσ e− 2σ 2 . So, the ML rule for the Gaussian channel 2 is the following. We decide bˆ = 1 if
f w (a −
√
E) ≥ f w (a +
√
E)
(1.40)
and bˆ = 0 otherwise. Using the Gaussian pdf, we can further simplify the ML rule. We decide bˆ = 1 if √ √ f w (a − E) ≥ f w (a + E) √ √ (a − E)2 ≤ (a + E)2 √ 4 Ea ≥ 0 a ≥ 0.
(1.41) (1.42) (1.43) (1.44)
In other words, the ML decision rule takes the received signal y = a and do the following: a ≥ 0 =⇒ 1 was sent; Else, 0 was sent. Figure 1.12 illustrates the ML decision rule. The decision rule picks the transmitted voltage level that is closer to the received signal (closer in the usual sense of the Euclidean distance). Hence, the maximum likelihood (ML) rule is exactly the nearest neighbor (NN) rule. Look ahead The upcoming section will involve calculating the reception reliability using the ML decision rule. Our main objective will be to explore the relationship between the energy constraint of transmission, the variance of the noise, and a certain level of reliability.
30
1 Communication over the AWGN Channel
1.5 Error Probability Analysis and Python Simulation
Recap In the previous section, we delved into the transmission and reception schemes for the additive Gaussian channel. Unlike the simple channel setting studied in Sect. 1.2, the additive Gaussian channel is plagued by inevitable communication errors. The concept of errors led us to introduce the optimal decision rule, which is defined as the one that maximizes the success rate of communication, also known as reliability, or conversely minimizes the probability of error. Based on this definition, we demonstrated that the optimal decision rule is the maximum a posteriori probability (MAP) rule: bˆMAP (a) = arg max P(b = i|y = a). i∈{0,1}
(1.45)
Following that, we demonstrated that if we assume that the a priori probabilities are equal, then the MAP rule simplifies to the maximum likelihood (ML) rule: bˆML (a) = arg max f y (a|b = i). i∈{0,1}
(1.46)
We also discovered that the ML rule aligns with the intuitive NN rule: √ bˆNN (a) = arg min a − (2 × i − 1) E . i∈{0,1}
(1.47)
One of our main goals in Part I is to characterize the tradeoff relationship between the following three: (i) energy budget E; (ii) the number of transmission bits; and (iii) reliability of communication. Obviously this requires us to know how to compute reliability when using the NN rule.
Outline In this section, we aim to achieve the goal. It consists of four parts. Firstly, we will revisit the simple transmission and reception schemes when sending one bit. Then, we will calculate the reliability of the communication. After that, we will analyze the tradeoff between reliability and energy budget. Finally, we will validate the tradeoff by conducting a simulation using Python.
Recall the transmission/reception schemes Let us start by reviewing the transmission/reception schemes under the simple setting where we transmit only one bit. We
1.5
Error Probability Analysis and Python Simulation
31
employed a simple transmission scheme: PAM. We used the optimal decision rule, which was shown to be the NN rule. See Fig. 1.13. Reliability of communication Conditioned on y = a, the reliability of communication is the probability of the success event that the information bit b is equal to its ˆ estimate b(a): ˆ P(C|y = a) = P(b = b(a)|y = a).
(1.48)
Using this quantity poses a problem since it depends on a specific realization of the received signal y, resulting in variation. Consequently, it cannot be a representative quantity. To address this issue, we need to identify a representative quantity that is not dependent on a specific realization. One possible solution is to consider the expectation of this quantity, which is a weighted sum of reliabilities over all possible realizations of a. The weight assigned to each realization is determined by the probability density function f y (a).
∞ −∞
(a)
=
f y (a)P(C|y = a)da ∞ −∞
P(correct decision, y = a)da
(1.49)
(b)
= P(C)
where (a) is due to the definition of conditional probability; and (b) is because of the total probability law. If you are not familiar with the total probability law, refer to Problem 2.1. This is one of very important laws in probability. Probability of error Alternatively we may consider the average error probability: P(E) = 1 − P(C)
PAM
(1.50)
NN rule
Fig. 1.13 The transmission/reception schemes: the PAM and the NN rule
32
1 Communication over the AWGN Channel
where E denotes an error event. We will focus on evaluating P(E) instead. The receiver makes an error if it decides that 1 was sent when 0 was sent, or vice versa. The average error is a weighted sum of the probabilities of these two types of error events, with the weights being equal to the a priori probabilities of the information bit: P(E) = P(E, b = 0) + P(E, b = 1) = P(b = 0)P(E|b = 0) + P(b = 1)P(E|b = 1).
(1.51)
Why? Again this is due to the total probability law. More precisely, the law dictates the first equality in the above and the second equality comes from the definition of conditional probability. As mentioned earlier, we assume that the a priori probabilities are equal (to 0.5 each). Let us focus on one of the error events by assuming that the information bit was actually 0. Then with the NN rule, we get: ˆ P(E|b = 0) = P b(y) = 1|b = 0 (a)
= P (y > 0|b = 0) √ = P − E + w > 0|b = 0 √ w E (b) =P > σ σ
∞ 1 z2 (c) = √ √ e− 2 dz E 2π σ √ E (d) = Q σ
(1.52)
where (a) is due to the NN rule3 ; (b) is because of the independence between w and b; (c) follows from the fact that wσ ∼ N (0, 1) (why? please see Problem 1.2); ∞ z2 and (d) comes from the definition of Q(a) := a √12π e− 2 dz. The Gaussian random variable with mean 0 and variance 1 is called the standard Gaussian. It is an important and useful random variable and the Q-function is always tabulated in all probability textbooks and wikipedia. This integration can also be computed numerically by using a command “erfc” in Python (defined in the scipy.special module):
∞
erfc(x) := x
2 2 √ e−t dt. π
The relation between Q(a) and erfc(x) is: For simplicity of analysis, we assume that for the event y = 0, bˆ is decided to be 0 although we should flip a coin in the case as per the NN rule. Since the event has measure-zero (the probability of the event being occurred is 0), the error analysis remains the same.
3
1.5
Error Probability Analysis and Python Simulation
33
∞
1 z2 √ e− 2 dz 2π a
∞ 1 2 (a) = √ e−t dt π √a 2 1 a = · erfc √ 2 2
Q(a) :=
√ where (a) comes from the change of variable t := √z2 (dz = 2dt). Let us now consider P(E|b = 1). The error does not depend on which information bit is transmitted. The complete symmetry of the mapping from the bit values to the voltage levels and the NN decision rule (see Fig. 1.13) would suggest that the two error probabilities are identical. For completeness, we go through the calculation for P(E|b = 1) and verify that it is indeed the same: ˆ P(E|b = 1) = P b(y) = 0|b = 1 = P (y ≤ 0|b = 1) √ E + w ≤ 0|b = 1 =P √ w E =P ≤− σ σ √ E =Q . σ
(1.53)
This together with (1.52) gives: √ E P(E) = Q . σ
(1.54)
SNR and error probability The first observation that one can make from the expres-
sion (1.54) on error probability is that it depends only on the ratio of E and σ 2 : the error probability is P(E) = Q
√
SNR
.
(1.55)
We have already seen this phenomenon earlier. This ratio is called the signal to noise ratio, or simply SNR. Basically, the communication engineer can design for a certain reliability level by choosing an appropriate SNR setting. While the Q(·) function can be found in standard statistical tables, it is useful for the communication engineer to have a rule of thumb for how sensitive this SNR “knob” is in terms of the reliability each setting offers. For instance, it would be useful to know by how much
34
1 Communication over the AWGN Channel
the reliability increases if we double SNR. To do this, it helps to use the following approximation: 1 a2 Q(a)≈ e− 2 . 2
(1.56)
We use a rough approximation that we denote by “≈ . This approximation means a2 a2 that the exponent of Q(a) is very close to that of 21 e− 2 , i.e., log Q(a) ≈ log( 21 e− 2 ). The reason that we consider this rough approximation is that the exponent plays an enough role as a proper measure when we deal with the probability of error. For example, the following probabilities of error (10−7 and 2 · 10−7 ) are considered to provide almost the same performance although those differ by two times, since they a2 have roughly the same exponent. Actually 21 e− 2 is an upper bound of the Q-function. See Fig. 1.14. Also see a Python code for this plotting at the end of this subsection. You will have a chance to check this rigorously in Problem 2.11. In this book, we will frequently approximate Q(a) by the upper bound. This approximation implies that the unreliability level Q
√
SNR
1 SNR ≈ e− 2 . 2
(1.57)
Equation (1.57) says something interesting. It says that SNR has an exponential effect on the probability of error. What happens if we double the energy, i.e., E˜ = 2E? Q
Fig.√1.14 Comparison of Q( SNR) and its upper SNR bound 21 e− 2
√
1 2 √ 2SNR ≈ e−SNR ≈ Q SNR . 2
(1.58)
1.5
Error Probability Analysis and Python Simulation
35
Doubling the energy has thus squared the probability of error and squaring a number less than 1 means that we have significantly reduced the probability of error. Actually a typical range of SNR is 100 ∼ 10000, and for this range, the probability of error is quite a small value. Here is a Python code used for drawing the two curves in Fig. 1.14. import numpy as np from scipy.special import erfc import matplotlib.pyplot as plt SNRdB = np.arange(0,21,1) # interval of SNR in dB scale SNR = 10**(SNRdB/10) # SNRdB = 10*log10(SNR) # Q-function Qfunc = 1/2*erfc(np.sqrt(SNR/2)) # An upper bound of the Q-function Approx = 1/2*np.exp(-SNR/2) plt.figure(figsize=(5,5), dpi=150) plt.plot(SNRdB, Approx, ’r*:’, label=’0.5*exp(-SNR/2)’) plt.plot(SNRdB, Qfunc, label=’Q(sqrt(SNR))’) plt.yscale(’log’) plt.xlabel(’SNR (dB)’) plt.grid() plt.title(’Q function vs. its approximation’) plt.legend() plt.show()
Python simulation: Empirical verification of the tradeoff (1.55) In order to verify
our findings, we will conduct multiple independent trials, denoted by n, where we send one bit using the optimal MAP rule with PAM transmission. By counting the number of error events out of the n independent transmissions, we will calculate the empirical error rate for different SNR values. As n increases, the empirical error probability approaches the tradeoff characterized in (1.55), thus demonstrating convergence. This method of convergence verification is called Monte Carlo simulation (Kroese, Brereton, Taimre, & Botev, 2014). To begin, we will generate n independent bits using the built-in function bernoulli, as we previously did in Sect. 1.3. from scipy.stats import bernoulli import numpy as np n = 10 # number of independent bits X = bernoulli(0.5) # Generate n independent bits
36
1 Communication over the AWGN Channel
b_samples = X.rvs(n) print(b_samples) [0 1 1 0 1 0 0 0 1 0]
Without loss of generality, assume the unit noise variance (σ 2 = 1). Then, the energy budget E is the same as SNR. Using the energy level, we can generate n transmitted signals. These signals are then sent over the additive Gaussian channel with an additive noise ∼ N (0, 1). SNRdB = 0 SNR = 10**(SNRdB/10) # Use PAM to construct n transmitted signals x_samples = np.sqrt(SNR)*(2*b_samples - 1) # Passes through the Gaussian channel ˜ N(0,1) y_samples = x_samples + np.random.randn(n) print(x_samples) print(y_samples) [-1. 1. 1. -1. 1. -1. -1. -1. 1. -1.] [-0.85448476 2.24874648 -0.06785438 -1.40327708 -1.76768827 -0.20014252 -1.64453905 0.67536843
1.5978796 0.83061596]
ˆ Next we apply the MAP rule to estimate b(y). Since we consider the equal a priori probabilities and the considered Gaussian noise is symmetric around mean 0, the MAP rule reduces to the ML which coincides with the NN rule. # MAP rule (=ML=NN in this case) bhat = (y_samples > 0 )*1 print(bhat) print(b_samples) [1 0 0 0 0 0 0 0 1 0] [1 0 0 0 0 0 0 1 1 0]
ˆ By comparing b(y) to b, we finally compute an empirical error rate. # Compute an empirical error rate num_errors = sum(b_samples != bhat) error_rate = num_errors/n print(error_rate) 0.1
By repeating the aforementioned steps for varying SNR values, we can generate a plot of the empirical error rate, as shown in Fig. 1.15. The complete code for this process is provided below. To ensure accuracy, we set the number of independent trials to a sufficiently large number, such as n = 200,000.
1.5
Error Probability Analysis and Python Simulation
from scipy.stats import bernoulli import numpy as np n = 200000 X = bernoulli(0.5) # SNR range SNRdB = np.arange(0,13,1) SNR = 10**(SNRdB/10) error_rate = np.zeros(len(SNR)) for i, snr in enumerate(SNR): # Generate n independent bits b_samples = X.rvs(n) # Use the PAM to construct n transmitted signals x_samples = np.sqrt(snr)*(2*b_samples - 1) # Passes through the Gaussian channel ˜ N(0,1) y_samples = x_samples + np.random.randn(n) # MAP rule (=ML=NN in this case) bhat = (y_samples > 0 )*1 # Compute an empirical error rate num_errors = sum(b_samples != bhat) error_rate[i] = num_errors/n # Q-function Qfunc = 1/2*erfc(np.sqrt(SNR/2)) plt.figure(figsize=(5,5), dpi=150) plt.plot(SNRdB,error_rate,’r*:’,label=’empirical error rate’) plt.plot(SNRdB,Qfunc,label=’Q(sqrt(SNR))’)
Fig. 1.15 Empirical error probability when the number of independent trials n = 200000
37
38
1 Communication over the AWGN Channel
plt.yscale(’log’) plt.xlabel(’SNR (dB)’) plt.grid() plt.title(’empirical error rate’) plt.legend() plt.show()
Look ahead Up until now, we have focused on the scenario where we transmit only a single bit. Remember that our goal is to characterize the tradeoff relationship between three quantities: (i) the amount of energy allocated, (ii) the quantity of transmitted bits, and (iii) the probability of errors. In the following section, we will determine the extent to which transmitting multiple bits with a limited energy budget impacts reliability.
1.6
Multiple Bits Transmission using PAM
39
1.6 Multiple Bits Transmission using PAM
Recap In the preceding section, we computed the probability of error for the transmission of a single bit, utilizing the optimal MAP decision rule. Subsequently, we were able to describe the fundamental tradeoff between error probability and SNR. Our findings demonstrated that SNR has a considerable impact on the probability of error, whereby doubling SNR results in an exponential decrease in error probability. As previously mentioned, one of our objectives in Part I is to determine the relationship between the energy budget, number of transmitted bits, and error probability.
Outline In this section, we will investigate the relationship between the energy budget, number of transmitted bits, and error probability. This section consists of four parts. Initially, we will concentrate on the simplest scenario for transmitting multiple bits, which involves sending two bits utilizing the PAM transmission scheme. Subsequently, we will derive the optimal MAP rule, which simplifies to the NN decision rule similar to the single bit transmission case. We will then analyze the probability of error. Finally, we will expand our analysis to the general case of transmitting an arbitrary number k of bits to characterize the tradeoff relationship between the three factors.
A transmission scheme for sending two bits To begin, we will examine the simplest scenario for transmitting multiple bits, which involves sending two bits. We will assume that we have the same energy budget E as in the single bit transmission case. Similar to before, we will utilize the PAM transmission scheme, which involves mapping the bits to voltage levels such that the levels are maximally separated from each other: √ 00 −→ − E; √ E 01 −→ − ; 3 √ E ; 10 −→ + √3 11 −→ + E. Refer to Fig. 1.16 for a more explicit depiction. We wish to highlight a crucial factor, which is the minimum distance, denoted as d. This parameter is defined as the smallest distance between any two voltage levels. In the case of transmitting two bits
40
1 Communication over the AWGN Channel
Fig. 1.16 PAM for sending two bits over the additive Gaussian channel
with the PAM scheme and the same energy budget E as the single bit transmission, √ we have d = 2 3 E . It is noteworthy that this factor plays a pivotal role in computing the probability of error, and this will become more apparent soon. The optimal decision rule Recall that the optimal decision rule is the one that maximizes the reliability of communication. As we proved during the past few sections, it is the MAP rule: choosing a transmission candidate that enables the maximization of the a posteriori probability: bˆMAP = arg max P(b = i j|y = a). i, j∈{0,1}
(1.59)
Similar to the prior setting, consider a realistic scenario where we have no idea about the statistics of two information bits. So we assume that transmission bits are equally likely: P(b = i j) = 41 for all possible pairs of (i, j). Applying the above MAP rule under this assumption, we can show that the MAP is the same as the ML rule: bˆMAP = arg max P(b = i j|y = a) i, j∈{0,1}
f b,y (i j, a) f y (a) P(b = i j) f y (a|b = i j) (b) = arg max i, j∈{0,1} f y (a) (a)
= arg max
i, j∈{0,1}
(1.60)
(c)
= arg max P(b = i j) f y (a|b = i j) i, j∈{0,1}
(d)
= arg max f y (a|b = i j) =: bˆML i, j∈{0,1}
where (a) follows from the definition of conditional probability (here f b,y (i j, a) denotes the joint pdf of b and y); (b) is due to the definition of conditional probability (this together with (a) is essentially the Bayes rule); (c) is because f y (a) is irrelevant of information bits b; and (d) comes from the assumption that the bits are equally likely. Using the properties of the Gaussian likelihood function f y (a|b = i j) that it is symmetric around the mean and more likely near the mean, we can readily verify
1.6
Multiple Bits Transmission using PAM
41
Fig. 1.17 The ML rule is equivalent to the same as the NN rule in the additive Gaussian channel
that the ML rule is the same as the NN rule: picking a transmitted voltage level that is closest to the received signal. See Fig. 1.17 for pictorial justification. Error probability Let us analyze the probability of error when we use the optimal NN decision rule. Using the total probability law, we get: P(E) = P(b = 00)P(E|b = 00) + P(b = 11)P(E|b = 11) + P(b = 01)P(E|b = 01) + P(b = 10)P(E|b = 10).
(1.61)
From the encoding rule as illustrated in Fig.√1.16, we√ see that there are two types of voltage levels: (i) inner voltage levels (− 3E and 3E ); (ii) outer voltage levels √ √ (− E and E). By utilizing the symmetry argument presented in Sect. 1.5, we can conclude that the probability of error for the inner voltage levels is identical for all such levels. Similarly, the same is true for the outer voltage levels. Therefore, it is adequate to examine only one of the voltage levels within each category.√ Let us first calculate the error probability w.r.t. an outer level, say − E. Using similar procedures that we did in Sect. 1.5, we get: ˆ P (E|b = 00) = P b(y) = 00|b = 00 √ d (a) = P y > − E + |b = 00 2 d (b) =P w> 2 d (c) =Q 2σ √ E (d) = Q 3σ
(1.62)
42
1 Communication over the AWGN Channel
√ ˆ where (a) is because the event b(y) = 00 is equivalent to the event y > − E + d2 under the NN rule; √ (b) is due to the independence between w and b and the fact that w = y − x = y + E under b = 00; (c) comes from wσ ∼ N (0, 1); and (d) is due to d =
√ 2 E . 3
Next we derive the error probability w.r.t. an inner voltage level, say −
√
E : 3
√ E E d d P (E|b = 01) = P y>− + ∪ y 2 d (b) = Q 2σ √ E (c) =Q (2k − 1)σ
(1.67)
where (a) is because the only thing that matters in an error event is whether the noise w is beyond the minimum distance divided by 2; (b) is due to wσ ∼ N (0, 1); and (c) √
E . comes from the minimum distance formula d = 22k −1 Similar to the k = 2 scenario, the error probability concerning an inner voltage level is just twice that of an outer voltage level.
P(E|b = 00 · · · 1) = 2Q
√
E k (2 − 1)σ
.
(1.68)
This is because an error event w.r.t. an inner voltage level includes another tail probability of error. Applying (1.67) and (1.68) into (1.66), we obtain:
44
1 Communication over the AWGN Channel
√
E P(E) = 2 − k−1 Q k 2 (2 − 1)σ √ SNR 1 = 2 − k−1 Q . 2 2k − 1 1
(1.69)
An engineering conclusion At the end of Sect. 1.2, in the simple channel setting where −σ ≤ w ≤ σ , we noted the relationship between k and SNR: √ k = log2 1 + SNR .
(1.70)
√ Observe that k has a logarithmic relationship with SNR: k increases very slowly √ (logarithmically) with an increase in SNR. Interestingly even in the additive Gaussian channel where having errors is unavoidable and hence the concept of √ reliability comes naturally, k has exactly the same (logarithmic) relationship with SNR. To see this clearly, recall (1.69). Consider a certain reliability level. In order to guarantee the same reliability, the input √ argument in the Q-function, say t, should be similar among different pairs of (k, SNR): √ t≈
SNR
2k
−1
.
(1.71)
This implies that k ≈ log2 1 +
√
SNR
t
.
(1.72)
Hence, we can see that even in the additive Gaussian channel, the essential relationship between SNR and the number of information bits you can reliably transmit (at any reliability level) is unchanged. Look ahead A key observation that can be inferred from (1.69) is that the probability of error tends towards 1 as the number of information bits k increases. This result is discouraging, particularly if we consider modern communication systems that enable reliable communication even with a large number of bits. One might question the validity of the tradeoff relationship (1.69), or whether our mathematical derivation is incorrect. However, it is not due to any fault in our derivation, but rather it is a consequence of the specific communication scheme (PAM) that we have employed in our analysis. In reality, there are numerous methods to enhance reliability substantially, and we will investigate one such method in the following section.
1.6
Multiple Bits Transmission using PAM
45
Problem Set 2 Problem 2.1 (Total probability law) Let A1 , . . . , An be disjoint events and assume that P(Ai ) > 0 for all i ∈ {1, . . . , n}. A student claims that for any event B, P(B) = P(A1 )P(B|A1 ) + · · · + P(An )P(B|An ). Either prove or disprove the student’s claim. Problem 2.2 (Conditional probability: Exercise 1) There are two coins. The first coin is fair, i.e., P(H ) = P(T ) = 21 where H and T denote “Head” and “Tail” events respectively. The second coin is a bit biased: P(H ) = 0.6 = 1 − P(T ). You are given one of the two coins, with equal probabilities among the two coins. You flip the coin four times and three of the four outcomes are “Head”s. What is the probability that your coin is the fair one? Problem 2.3 (Conditional probability: Exercise 2) A drug discovery company is marketing a new test for COVID. According to clinical trials, the test demonstrates: (a) When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10%. Such negative events are called “false negative” or “misdetection” events. (b) When applied to a non-affected healthy person, the test comes up negative in 80% of cases, and positive in 20%. Such positive events are called “false positive” or “false alarm” events. Suppose that the incidence of COVID in the population is 1%. A person is chosen at random and tested, coming up positive. What is the probability that the person is actually affected? Problem 2.4 (Conditional probability: Exercise 3) A man has 3 coins in his pocket. One is double-headed, another is double-tailed, and the other is normal. The coins can be figured out only after he looks at them. (a) The man shuts his eye, chooses a coin at random, and tosses it. What is the probability that the lower face of the coin is “Head”? (b) After taking the action in part (a) (i.e., choosing a coin at random and then tossing it), he opens his eyes and sees that the upper face of the coin is “Head”. What is the probability that the lower face is “Head”? (c) Given the situation in part (b) (i.e., choosing a coin at random, tossing it and then seeing that the upper face of the coin is “Head”), he shuts his eyes again, picks up the same coin, and tosses it again. What is the probability that the lower face is “Head”? (d) Given the situation in part (c) (i.e., choosing a coin at random, tossing it, seeing that the upper face of the coin is “Head”, picking up the same coin, then tossing it again while shutting his eyes), he opens his eyes and sees that the upper face is “Head”. What is the probability that the lower face is “Head”?
46
1 Communication over the AWGN Channel
Problem 2.5 (Monty Hall Problem) In a game show, there are three doors, behind one of which is the prize of a car, while the other two conceal goats. However, you are unaware of which door contains the car, but the host (Monty Hall) is aware. Initially, you must select one door out of the three, but the door you choose is not immediately opened. Instead, the host opens one of the two remaining doors to reveal a goat. At this point, you are given a choice: to stick with your original selection or to switch to the other unopened door. What is the optimal choice to increase your chances of selecting the door containing the car? In other words, should you stay with your initial selection or switch to the unopened and unchosen door? Problem 2.6 (The MAP rule in PAM) Recall the simple encoding rule that we stud√ √ ied in Sect. 1.4. Information bit 0 is mapped to x = − E and 1 to x = + E. This transmission scheme is called pulse amplitude modulation (PAM). Suppose P(b = 0) = p and P(b = 1) = 1 − p. Consider a transmission of x over an additive Gaussian channel: y = x +w ˆ where w ∼ N (0, σ 2 ). Given y = a, find the MAP rule b(a). Problem 2.7 (The MAP rule in optical communication) Given b ∈ {0, 1}, y is distributed according to N (0.1 + b, 0.1 + b). This model is inspired by optical communication links. Assume that P(b = 0) = 0.4 and P(b = 1) = 0.6. Given y = a, ˆ find the MAP rule b(a). Problem 2.8 (The ML and MAP rules under multiple observations) Given X = x, the random variables {Yn , n ≥ 1} are i.i.d. according to an exponential distribution with mean x, i.e., f Yi (y) = xe−x y , ∀i. Assume that P(X = 1) = 1 − P(X = 2) = p ∈ (0, 1). (a) Find the ML rule for decoding X given Y n := {Y1 , . . . , Yn }. (b) Find the MAP rule for decoding X given Y n . Problem 2.9 (Application of the MAP rule) Let X and W be independent random variables where P(X = −1) = P(X = +1) = 21 (simply expressed as X ∼ 1 Bern( 2 )) and W is N (0, σ 2 ). Find the function g : R → {−1, 1} that minimizes P (X = g(X + W )). Hint: Think about the MAP rule. Problem 2.10 (Principle of the MAP rule) Consider a binary random variable X ∼ Bern( p) i.e., P(X = 1) = p and P(X = 0) = 1 − p. Suppose that another binary random variable Y satisfies P(Y = 0|X = 1) = P(Y = 1|X = 0) = . Compute P(X = 1|Y = 1). Given Y = 1, find the MAP rule for detecting X . Problem 2.11 (An upper bound of the Q-function) (a) Show that for z ≥ a and a ≥ 0,
1.6
Multiple Bits Transmission using PAM
z 2 ≥ (z − a)2 + a 2 .
47
(1.73)
(b) Using part (a), derive
∞
Q(a) := a
1 1 a2 z2 √ e− 2 dz ≤ e− 2 , a > 0. 2 2π
(1.74)
(c) Using a skeleton Python code provided in Sect. 1.5, plot the Q(a) and the upper a2 bound of 21 e− 2 . Use a built-in function “yscale” (placed in matplotlib.pyplot) to exhibit y-axis in a logarithm (base 10) scale. Problem 2.12 (The MAP vs ML rules) Consider an additive Gaussian noise channel with w ∼ N (0, σ 2 ). √ Consider a simple √ transmission scheme in which information bit 0 is mapped to − E and 1 to + E. Suppose that the a priori probabilities are P(b = 0) = 41 and P(b = 1) = 43 . (a) Given the received signal y = a, derive the MAP rule. (b) Evaluate the average error probability when using the MAP rule: P(bˆMAP = b). Use the Q-function notation. (c) Given the received signal y = a, derive the ML rule. (d) Evaluate the average error probability when using the ML rule: P(bˆML = b). Use the Q-function notation. (e) Compare the two average error probabilities derived in parts (b) and (d) respectively. Which is smaller? Explain why. Problem 2.13 (Binary minimum cost detection) Consider a binary hypothesis testing problem with the a priori probabilities p0 , p1 and likelihoods f Y |X (y|i), i ∈ {0, 1}. Let Ci j be the cost of deciding on hypothesis j when i is correct. (a) Given an observation Y = y, find the expected cost (over X ∈ {0, 1}) of making the decision Xˆ = j for j ∈ {0, 1}. Show that the decision for the minimum expected cost is given by Xˆ mincost = arg min C0 j P X |Y (0|y) + C1 j P X |Y (1|y) . j∈{0,1}
(b) Show that the min cost decision above can be expressed as the following threshold test: r (y) :=
f Y |X (y|0) Xˆ =0 p1 (C10 − C11 ) = η. f Y |X (y|1) Xˆ =1 p0 (C01 − C00 )
(c) Interpret the result above as saying that the only difference between the MAP rule and the minimum cost rule is an adjustment of the threshold to take account of the costs; i.e., a large cost of an error of one type is equivalent to having a large a priori probability for that hypothesis.
48
1 Communication over the AWGN Channel
Problem 2.14 (Colored Gaussian noise) Consider a binary communication system where we send b = 1 by transmitting x1 = [1, 1]T , and we send b = 0 by transmitting x0 = [−1, −1]T . Here (·)T denotes a transpose of a matrix. We observe y = xi + w where w is a colored Gaussian noise: 1ρ w ∼ N 0, , −1 < ρ < 1 ρ 1
1ρ indicates the covariance matrix of w, i.e., E[w[0]2 ] = E[w[1]2 ] = 1 ρ 1 and E[w[0]w[1]] = ρ. Given y = a, find the MAP rule for decoding b. where
Problem 2.15 (True or False?) (a) Suppose that E i ’s form a partition of a sample space. Then, P(A|B) =
P(A|E i )P(E i |B).
i
(b) Let X and Y be independent random variables where P(X = −1) = P(X = +1) = P(X = 0) = 13 and Y ∼ N (0, 0.1). Let Z = X + Y . Then the function g : R → {−1, 0, 1} that minimizes P(X = g(Z )) is the same as the function g : R → {−1, 0, 1} that maximizes f Z |X (z|g (z)). Here f Z |X (z|x) indicates the conditional pdf of Z given X = x. (c) Consider detecting whether voltages A1 or A2 was sent over a channel with an independent Gaussian noise. The Gaussian noise has mean A3 ( = 0) volts and variance σ 2 . The voltages A1 and A2 are equally likely (i.e., they occur with probability 0.5 each). Suppose you have a constraint that the average transmit energy cannot be more than E and you employ an optimal decision rule at the that meet the receiver. Then the voltage levels A1 and A2 √ √ average energy constraint and minimize the error probability are E and − E respectively or vice versa. (d) A transmitter wishes to send an information bit b ∈ {0, 1} to a receiver. The transmitted signal x is simply b. Given x, the channel output y follows a Gaussian distribution: N (b, 1 + b). The ML decision rule is the same as the Nearest Neighbor (NN) rule. √ √ (e) Suppose we send x = E or − E, depending on an information bit, over an additive Gaussian noise channel in which the noise has zero mean and is independent of x. In Sect. 1.4, we learned that the MAP decision rule is optimal in a sense of minimizing √the probability of√error. Suppose the a priori probabilities are equal, i.e., P(x = E) = P(x = − E) = 21 . Then the MAP rule reduces to the NN rule.
1.7
Multi-shot Communication: Sequential Coding
49
1.7 Multi-shot Communication: Sequential Coding
Recap We investigated the tradeoff relationship between three quantities: (i) SNR; (ii) reliability; and (iii) the number of transmitted bits (say B bits). This analysis was conducted with the use of a specific transmission scheme called PAM, which is illustrated in Fig. 1.19. Under the assumption that the bits are equally likely, we demonstrated that the optimal MAP decision rule is equivalent to the intuitive NN rule. By employing basic probability tools, we were able to derive the probability of error: 2 P(E) = B · Q 2
d 2σ
2B − 2 + · 2Q 2B
d 2σ
where the number 2 colored in blue indicates the number of outer voltage levels associated with the one-sided tail error event and the number 2 B − 2 colored in red denotes the number of inner voltage levels associated with the two-sided tail error √ E event. Plugging the minimum distance d = 22B −1 , we get: P(E) = 2 −
1
√
2 B−1
Q
SNR
2B − 1
1 − SNR ≈ 1 − B e 2(2 B −1)2 . 2
(1.75)
One crucial observation that we made was that the probability of error increases significantly with an increase in B. This is disappointing because in practical communication scenarios, a large number of bits need to be transmitted. For instance, typical file transmissions in our daily life amount to several mega or giga bytes, which consist of numerous bits, and according to the tradeoff relationship (1.75) under reasonable SNR levels (usually ranging from 100 to 10,000), such huge-sized file transmissions result in almost-certain errors. However, current communication systems can handle huge-sized file transmissions of several gigabytes and enable reliable communication. So, what is the issue with the above tradeoff relationship (1.75)?
Fig. 1.19 PAM for sending B bits
50
1 Communication over the AWGN Channel
Outline In this section, we aim to answer the question by taking three steps. Firstly, we will explain that the tradeoff relationship (1.75) is a result of choosing the PAM communication scheme. Secondly, we will explore an alternative that utilizes an additional communication resource. Lastly, we will derive the probability of error for this new scheme to see if it can achieve a reasonably low error probability even for a large B.
A natural alternative One significant observation about the previous transmission scheme depicted in Fig. 1.19 is that it only utilizes one time slot, despite the availability of numerous time slots for communication. Consequently, we failed to exploit the communication resource of time. Therefore, an apparent alternative is to utilize multiple time slots. As an illustration, we can consider the transmission scheme shown in Fig. 1.20, where we transmit k bits at a time. These k bits are treated as a single entity, referred to as a chunk, which is sent in one time slot. We can then send these chunks in a sequential manner, where in each time slot, a new chunk is transmitted until all the chunks are sent. This scheme is known as “sequential coding.” In the example in Fig. 1.20, the chunk size k is 4. If we need to send B bits, then the number of chunks (denoted by n) is B/k. We will investigate this scheme to see whether we can achieve a reasonably low error probability, even for a large value of B. The optimal decision rule The question now is whether sequential coding can significantly increase reliability with a moderate level of SNR. To answer this, we must evaluate the reliability, or probability of error, when using sequential coding. To do this, we need to determine the optimal decision rule. The MAP rule is always optimal by definition. Assuming a typical setting where the bits are equally likely:
P((b1 , . . . , b B ) = (i 1 , . . . , i B )) =
1 , 2B
the MAP rule is the same as the ML rule. If you still are not familiar with the equivalence, please review Sect. 1.4. In order to figure out what the ML rule is, we need to know about the pdf of the received signal:
Fig. 1.20 Illustration of sequencing coding
1.7
Multi-shot Communication: Sequential Coding
y[m] = x[m] + w[m],
51
m ∈ {1, . . . , n}
(1.76)
where (x[m], w[m]) indicate the transmitted signal and the additive noise at the mth chunk, respectively. Notice that the received signal contains a sequence of additive noises w[m]’s. So we need to figure out the statistics of the noise sequence. A statistical model for w[m]’s Based on our previous discussion in Sect. 1.3, we can argue that the statistics of the additive noise at any given time is Gaussian. In practice, these statistics do not change drastically over time, and it is reasonable to assume that the mean and variance remain constant over time. Additionally, there is little correlation between the noises across different time slots, so it is also reasonable to assume that the noise samples w[m]’s are statistically independent. This type of noise model is called white, and we will refer to it as additive white Gaussian noise (AWGN). You may be wondering why it’s called white. The reason is that AWGN contains frequency components that span the entire spectrum, much like white light contains every color component. The ML decision rule Under the AWGN channel, what is the ML decision rule? First consider the first chunk of k bits, say b[1]. The ML estimate b[1] is obviously a function of the sequence of the received signals. Let us denote it by b[1] = ML(b[1]|y[1], . . . , y[n]). Recall the typical assumption that we made earlier: the bits are equally likely. P((b1 , . . . , b B ) = (i 1 , . . . , i B )) =
1 . 2B
From this, one can also verify that the bits bi ’s are i.i.d. Check this in Problem 3.1. This then implies that the transmitted signals x[m]’s at different time slots are statistically independent with each other. Since w[m]’s are i.i.d. under the AWGN, the received signals y[m]’s are also statistically independent. This suggests that the first chuck of bits b[1] is independent of the received signals w.r.t. the remaining chunks (y[2], . . . , y[n]). So (y[2], . . . , y[n]) does not help decoding b[1]. This yields: b[1] = ML(b[1]|y[1], y[2], . . . , y[n]) = ML(b[1]|y[1]). Applying the same argument at different time slots, we obtain: = ML(b[m]|y[m]), b[m]
m ∈ {1, 2, . . . , n}.
In other words, sequential transmission over the AWGN channel naturally leads to sequential reception as well. At each time m, we make an ML decision on the local k
52
1 Communication over the AWGN Channel
bits transmitted over that time instant. As we verified in the past sections, this local ML rule is the same as the NN rule. Reliability of sequential coding In Sect. 1.6, we have derived the reliability of communication w.r.t. a particular time slot. Denoting by p the error probability for one time slot, we get: √ SNR 1 p = 2 − k−1 Q . 2 2k − 1
(1.77)
Due to the statistical independence of error events across distinct time slots, the reliability w.r.t. the entire time slots n would be: P (C1 ∩ C2 ∩ · · · ∩ Cn ) =
n
P(Ci ) = (1 − p)n
(1.78)
i=1
where Ci denotes the correct decision event at time i. Now we get a hint of what might go wrong with sequential coding. Even though we have ensured that the reliability is pretty high at any given time instant, the overall reliability of communicating correctly at each and every time instant is quite slim. To get a concrete feel, let us consider the following example: k = 4 and n = 20000. Suppose we had set the error probability p at any time instant to be 10−4 (based on (1.77), how much should SNR be?). Then the reliability of the entire sequential communication process is, from (1.78), (1 − 10−4 )20000 ≈ 2 × 10−9 .
(1.79)
The reliability level is extremely low, almost negligible. This implies that the only solution is to increase the SNR to a very high value to achieve an acceptable reliability level. Normally, the reliability level is in the order of 10−10 or so. However, even for small files, achieving this level of reliability requires extremely high SNR values. Further details will be discussed in Problem 3.6. Look ahead You may question whether the aforementioned phenomenon is inherent, i.e., an inevitable cost for ensuring reliable communication. However, the existence of successful communication technologies around us offers a strong indication of the answer. It appears that the issue lies with the sequential approach to transmission and reception. In the following section, we will investigate an alternative communication scheme that can considerably enhance reliability.
1.8
Multi-shot Communication: Repetition Coding
53
1.8 Multi-shot Communication: Repetition Coding
Recap In the previous section, we explored a chunk-based sequential coding scheme in an attempt to enhance reliability. The scheme involves reading information bits as a chunk unit consisting of k bits, applying PAM to each chunk, and transmitting the resulting PAM signal using a single time slot, as shown in Fig. 1.21. We demonstrated that, assuming a uniform distribution on the information bits, the optimal decision rule is the local nearest neighbor (NN) rule, whereby the NN rule is applied locally to each received signal y[m]. Our findings showed that although sequential coding does enhance reliability compared to the naive single-shot PAM, the improvement is not significant enough, as the error probability cannot be reduced to a small enough level. Recall that current communication systems enable reliable transmission even with a large number of bits. Hence, sequential coding is not the scheme employed in practice, indicating the existence of an alternative approach.
Outline In this section, we will explore a novel communication scheme that can significantly enhance reliability. This section consists of three parts. Firstly, we will introduce a new idea for transmission, which has not been discussed previously but is an intuitive approach. Secondly, we will examine a straightforward technique (inspired by this idea) in the simplest non-trivial setting, where a single bit is transmitted (i.e., B = 1) using two time slots (i.e., n = 2). Lastly, we will extend this method to the general case, considering any values of (B, n) and showing that the extended approach can reduce the probability of error to an arbitrarily small level.
Fig. 1.21 A chunk-based sequential coding scheme
54
1 Communication over the AWGN Channel
The idea of repetition Recall sequential coding illustrated in Fig. 1.21. Each chunk signal contains only the corresponding bit string, being irrelevant of those w.r.t. other chunks. For instance, x[2] is concerned only about b[2], having nothing to do with b[1]. In fact, the communication method we use in our daily lives is not the same as the one discussed earlier. When communication fails, we simply repeat the message until the listener understands. This approach, known as repetition, is a very natural idea that anyone can come up with, but it has not been implemented in sequential coding. Therefore, an alternative solution is to employ the repetition idea, which has been shown to greatly improve reliability. In the sequel, we will demonstrate that a simple repetition coding scheme can indeed significantly enhance reliability. Repetition coding for the simplest case (B, n) = (1, 2) Let us first consider the simplest yet non-trivial setting. That must be the case where we send only one bit (B = 1) but now through two time slots (n = 2). Assume that the energy budget per time slot√is E. A naive trial based on the repetition idea could be: sending a voltage level ± E at time 1 and sending the same voltage again at time 2. See Fig. 1.22. A fancy term for this kind of a scheme is repetition coding. For this coding scheme, the received signals read: √ y[1] = x[1] + w[1] = (2b − 1) E + w[1], √ y[2] = x[2] + w[2] = (2b − 1) E + w[2].
(1.80)
Guess for the optimal decision rule What is the optimal decision rule? In other words, how does the MAP decision rule look like? Remember the case of n = 1. Under the uniform distribution on b, the optimal decision rule (MAP) is the same as the ML rule, and it further reduces to the NN decision rule. Now any guess for the optimal decision rule when n = 2? The following observation gives an insight. Notice in (1.80) that x[1] and x[2] are identical. So one naive thinking is to add the received signals and then make a decision with the sum:
Fig. 1.22 Repetition coding for (B, n) = (1, 2)
1.8
Multi-shot Communication: Repetition Coding
55
y[1] + y[2] = x[1] + x[2] + w[1] + w[2] . =:x
=:y
(1.81)
=:w
This way, one can align x[1] and x[2] to boost up the power of an interested signal. On the other hand, w[1] and w[2] point to random directions, so such booting effect is not expected w.r.t. the summed noise w. As you may expect, the NN rule w.r.t. the summed received signal y is indeed the optimal decision rule. We will prove this in the sequel. The optimal decision rule Assuming the uniform distribution on b, the MAP rule is equivalent to the ML rule: bˆML = arg max f y (a|b = i) i∈{0,1}
(1.82)
where a = (a[1], a[2])T denotes a realization of the received signal vector y = (y[1], y[2])T . Here (·)T indicates a transpose. Consider the likelihood function of interest: f y (a|b = i) = f y (a[1], a[2]|b = i) = f w (a[1] − x[1], a[2] − x[2]|b = i) √ √ (a) = f w a[1] − (2i − 1) E, a[2] − (2i − 1) E|b = i √ √ (b) = f w a[1] − (2i − 1) E, a[2] − (2i − 1) E √ √ (c) = f w[1] a[1] − (2i − 1) E f w[2] a[2] − (2i − 1) E √ 2 √ 2 1 1 (d) a[1] − (2i − 1) E + a[2] − (2i − 1) E = exp − 2 2π σ 2 2σ 1 1 2 2 = exp − 2 (a[1] + E + a[2] + E) 2π σ 2 2σ √ 1 (a[1] + a[2])(2i − 1) E × exp σ2 √ where (a) comes from our encoding rule (x[m] = (2b − 1) E for m ∈ {1, 2}); (b) is due to the independence between w and b; (c) is due to the independence of the additive noises at different time slots; and (d) comes from the pdf of the Gaussian noise. Here one key observation is that the first part in the last equality (marked in red) is irrelevant of i. Hence, we get:
56
1 Communication over the AWGN Channel
bˆML = arg max f y (a|b = i) i∈{0,1} √ 1 = arg max exp (a[1] + a[2])(2i − 1) E i∈{0,1} σ2 = arg max (a[1] + a[2])(2i − 1).
(1.83)
i∈{0,1}
Notice that that the sum a[1] + a[2] plays a significant role in the decision: a[1] + a[2] ≥ 0 =⇒ bˆML = 1; a[1] + a[2] < 0 =⇒ bˆML = 0. What sense can we make of this rule? It seems that we are collecting the two received signals and taking some sort of a “joint opinion”. If the sum is positive, we decide that the bit must correspond to a positive signal (and vice-versa). This indeed coincides with our initial guess: the NN decision rule w.r.t. y := y[1] + y[2]. A sufficient statistic We observed that the sum of the received signals a[1] + a[2]
(1.84)
is sufficient to evaluate the ML rule for repetition coding. In other words, even though we received two different voltage levels, only their sum is relevant to making a decision. Such a quantity is called a sufficient statistic. We say that the sum of the received signals is a sufficient statistic to derive the ML rule in repetition coding (Fig. 1.23). Error probability Focusing on the sufficient statistic, the derivation of the error probability becomes straightforward. Consider the sum: y[1] + y[2] = x[1] + x[2] + w[1] + w[2] . =:y
=:x
=:w
Fig. 1.23 The sum is a sufficient statistic for deriving the ML rule
(1.85)
1.8
Multi-shot Communication: Repetition Coding
57
One key claim is that the summed noise w is also Gaussian. The mean is 0 and the variance is 2σ 2 : E[w 2 ] = E[(w[1] + w[2])2 ] = E[w[1]2 ] + E[w[2]2 ] + E[w[1]w[2]] (a)
= E[w[1]2 ] + E[w[2]2 ]
(b)
= 2σ 2
where (a) follows due to the independence and zero-mean properties of the AWGN; and (b) is because E[w[1]2 ] = E[w[2]2 ] = σ 2 . Yo may wonder why w is also Gaussian. You will have a chance to prove this in Problem 3.2. Now what is P(E)? If we can apply what we learned during the past sections, this calculation is trivial. In the case of sending one bit, P(E) takes the following formula: P(E) = Q
√
d
(1.86)
2 var[w]
where d denotes the minimum distance w.r.t. x := x[1] + x[2]: √ √ √ d = 2 E − (−2 E) = 4 E.
(1.87)
√ Plugging d = 4 E into (1.86), we get: P(E) = Q
√ √ 4 E 2 · SNR . =Q √ 2 2σ 2
(1.88)
Extension to (B, n) = (1, n) We transmit the same voltage level at each time instant, for n consecutive time slots. So the received signals read: √ y[m] = x[m] + w[m] = (2b − 1) E + w[m], m ∈ {1, . . . , n}.
(1.89)
Applying the same arguments that we made earlier, one can verify that the optimal decision rule is again the NN rule w.r.t. the sum: y[1] + · · · + y[n] = x[1] + · · · + x[n] + w[1] + · · · + w[n] . =:y
Also, the error probability is given by:
=:x
=:w
58
1 Communication over the AWGN Channel
Fig. 1.24 A 2 B -PAM signal w.r.t. x = x[1] + x[2] + · · · + x[n]
P(E) = Q
√
d
2 var[w] √ 2n E (a) = Q √ 2 nσ 2 √ =Q n · SNR
√ √ √ where (a) is because the minimum distance is d = n E − (−n E) = 2n E and var[w] = n var[w[1]] = nσ 2 . Extension to arbitrary (B, n) The idea is to apply PAM to the entire bit strong and then do repetition x[1] = x[2] = · · · = x[m]. Again one can prove that the sum y[1] + y[2] + · · · + y[n] is a sufficient statistic to derive the ML rule. y[1] + · · · + y[n] = x[1] + · · · + x[n] + w[1] + · · · + w[n] . =:x
=:y
=:w
Now x is a 2 B -PAM signal. See Fig. 1.24. So the minimum distance d reads: √ 2n E d= B . 2 −1
(1.90)
The summed noise w ∼ N (0, nσ 2 ). Again the optimal decision rule is the NN rule w.r.t. the sum, and the corresponding error probability is: P(E) = 2 −
1
d
√ 2 noise var √ 2n E 1 = 2 − B−1 Q √ 2 2(2 B − 1) nσ 2 √ 1 n SNR = 2 − B−1 Q . 2 2B − 1 2 B−1
Q
(1.91)
1.8
Multi-shot Communication: Repetition Coding
59
Can we make P(E) → 0? Let us check if we can make the error probability small enough with repetition coding. From (1.91), we see that it is possible. For example, let us set: n = B · 22B .
(1.92)
Plugging this into (1.91), we get: P(E) ≈ 2Q
√
B · SNR .
By increasing B together with n = B · 22B , we can make P(E) arbitrarily close to zero. Any other issue? But an issue arises in repetition coding. The issue is w.r.t. transmission efficiency, often quantified as the data rate: R :=
B total # of transmission bits = (data rate). total # of time slots n
In the above example n = B · 22B , the data rate tends to zero as B increases: R=
B 1 = 2B −→ 0 as B → ∞. n 2
Obviously this is not what we want. Look ahead A natural question arises. Is there any smarter communication scheme that yields R > 0 while ensuring P(E) → 0? In the next section, we will answer this question.
60
1 Communication over the AWGN Channel
1.9 Capacity of the AWGN Channel
Recap In the previous section, we explored the repetition coding scheme to see if it can provide reliable communication with a typical range of SNR. To do so, we calculated the probability of error: P(E) = 2 −
1 2 B−1
√
Q
n SNR B 2 −1
(1.93)
where B indicates the total number of transmission bits, and n denotes the total number of time slots used. As we have shown, repetition coding can indeed facilitate reliable communication. For instance, recall the case where we set the number of time slots to be n = B · 22B . In this case, √ as B → ∞. B · SNR −→ 0 P(E) ≈ 2Q While repetition coding yields a good performance in reliability, we found it comes at a cost in transmission efficiency, which can be quantified as the data rate R := Bn (bits/time). In the above example, the increase in B that yields an arbitrarily small P(E) makes R approach 0 as well. So one natural question is:
• ?
Question
Is there a coding scheme that yields both data efficiency (R > 0) and reliable communication (P(E) → 0)?
Outline In this section, we will provide detailed explanations to affirmatively answer the question. In fact, the question was raised by many communication engineers in the early 20th century. There was also a follow-up conjecture that whenever P(E) → 0, R → 0. First, we will explain why people believed in this conjecture. Then, we will tell the story of Claude Shannon who provided a positive answer to the question without developing explicit coding schemes. Interestingly, explicit coding schemes were later developed that yield both R > 0 and reliable communication. Finally, we will discuss two major efforts in this regard.
A belief in the early 20th century During the early 20th century, the prevailing belief was that the only method of achieving reliable communication over the noisy
1.9
Capacity of the AWGN Channel
61
AWGN channel was to add an infinite amount of redundancies, as seen in repetition coding. However, this approach would result in an extremely low data rate, and so it was commonly thought that it was impossible to attain a negligible error probability (P(E) → 0) while maintaining a positive data rate (R > 0). Claude Shannon’s finding Despite the prevailing belief, Shannon’s work demonstrated that this assumption was mistaken. He was able to prove this while developing his theory of communication, known as information theory. By applying this theory to the AWGN channel, Shannon showed that sophisticated coding schemes, beyond simple repetition coding, could achieve both a positive data rate and reliable communication: P(E) → 0 for R > 0.
(1.94)
Shannon’s efforts We will explain how Shannon arrived at the result expressed in (1.94). He started by creating a sophisticated transmission scheme that differed from repetition coding. We will not delve into the specifics of the scheme for the sake of simplicity. Then, he devised a receiver strategy that was not the optimal decision rule but was quite close to it. One may wonder why he opted for a suboptimal decision rule that was almost optimal. The reason was complex, and we will not go into further detail. After attempting to derive P(E) to determine if reliable communication is achievable even when R > 0, Shannon was unable to do so. However, he was still able to prove (1.94) without explicitly calculating P(E). This is a testament to Shannon’s brilliance. A smart upper-bound technique Instead Shannon was able to derive an upper bound on the error probability, say P¯e . This inspired him to come up with the following smart trick. If an upper bound tends to 0, then the exact error probability also goes to 0. Why? Because the probability cannot go below 0. The trick is based on the sandwich argument: 0 ≤ P(E) ≤ P¯e → 0 implies P(E) → 0. Shannon was able to derive a tight-enough upper bound so that he can apply the smart trick. Here is a rough upperbound formula that Shannon derived: P(E) ≤ exp (n(R − C)) =: P¯e
(1.95)
where n denotes the total number of time slots used and C indicates some strictly positive value that we will detail soon. This suggests that as long as R < C, we can make P¯e arbitrarily close to 0 as n → ∞, therefore making the exact error probability P(E) arbitrarily small as well. Figure 1.25 shows how the upper bound varies depending on a choice of n. Increasing n, the more sharply the upper bound drops to 0 around at R = C − . This is how he proved the main claim (1.94). The law governed by the nature Shannon’s pursuit of understanding the laws governing nature didn’t end there. He believed that there must be a fundamental limit
62
1 Communication over the AWGN Channel
Fig. 1.25 Upper bounds P¯e on error probability for various n
n
on the data rate beyond which reliable communication cannot be guaranteed. This means that no matter what coding scheme is used, the probability of error cannot be made arbitrarily close to zero. And he was right. Shannon proved that if R is greater than the channel capacity C, then it is fundamentally impossible to enable reliable communication as P(E) cannot be made arbitrarily close to zero. To prove this, Shannon used another smart trick. He developed a lower bound formula for the error probability and showed that if the lower bound does not go to zero, then the exact error probability cannot be zero either. This is because P(E) ≥ Pe 0 implies that P(E) > 0. Shannon was able to come up with a tight enough lower bound that allowed him to apply this smart trick: P(E) ≥ 1 −
1 C − R nR
∀n.
The bound holds for all n. So we must have: C 1 P(E) ≥ max 1 − − n∈N R nR C = 1 − =: Pe R
(1.96)
(1.97)
where the equality is achieved when n → ∞. This suggests that if R > C, Pe > 0 and therefore P(E) > 0, meaning it is impossible to enable reliable communication. Figure 1.26 illustrates such a lower bound: the lower bound is strictly above 0 for any R > C. Capacity of the AWGN channel The upper (1.95) and lower (1.97) bounds imply that the quantity C is a sort of fundamental quantity which delineates the sharp threshold on the data rate below which reliable communication is possible and above which reliable communication is impossible no matter what. See Fig. 1.27. In other words, in the limit of n, there is phase transition on the fundamental quantity. There is a terminology for the maximum data rate that enables reliable communication. It
1.9
Capacity of the AWGN Channel
63
Fig. 1.26 A lower bound Pe on error probability
data rate
Fig. 1.27 Phase transition: There exists a sharp threshold C on the data rate below which reliable communication is possible and above which reliable communication is impossible no matter what and whatsoever
is called the channel capacity or Shannon capacity. You may wonder where the letter C comes from. Here C stands for capacity. What is the explicit formula for the channel capacity C? Given a power constraint P, the capacity of the AWGN channel is shown to be P 1 C = log2 1 + 2 bits/time 2 σ
(1.98)
where σ 2 is the noise variance (Shannon, 2001). The rigorous proof is out of the scope of this book. Instead we will provide an intuition as to why this must be the capacity at the end of this section. The engineering conclusion Note that the capacity (1.98) depends only on σP2 , which is the average SNR. What does this remind you of? Yes, you saw such a similar relationship in Sect. 1.2 in the context of the simple channel setting. Hence, we can arrive at the conclusion: the SNR is the sole factor that determines the number of transmission bits. Moreover, we can see a fundamental relationship between the data rate R and SNR. Earlier we noted a relationship between the two: the required energy essen-
64
1 Communication over the AWGN Channel
tially quadrupled when we looked to transmit one extra bit. Here you can see that the essential relationship is unchanged. To see this clearly, let us do the following ˜ is a new SNR required to transmit an extra one bit on calculation. Suppose that SNR top of the C bits/time. We then have ! " 1 1 ˜ = 1 + log2 (1 + SNR) log2 1 + SNR 2 2 ˜ = 4(1 + SNR) 1 + SNR ˜ ≈ 4SNR. SNR ˜ is a roughly quadruple of SNR. Note that SNR Capacity-achieving codes? One might question how to attain the capacity since Shannon did not provide a specific coding scheme that achieves it. Instead, he could only prove the existence of such a scheme that achieves the capacity. To fully comprehend this, one must understand the proof of the upper bound (1.95), which we have omitted. This limitation of Shannon’s work is one of the reasons why it could not be directly applied to real-world communication systems at that time. You may wonder if this is still the case today. Surprisingly, capacity-achieving explicit codes for the AWGN channel still remain an open question. However, from an engineering perspective, it may be acceptable to achieve a performance close to the capacity. Thus, numerous studies have been conducted to develop capacity-approaching explicit coding schemes. Here, we will discuss two of these efforts. Two major efforts towards capacity-approaching codes Robert G. Gallager made an attempt to develop an explicit coding scheme called the “Low Density Parity Check code” (LDPC code) in 1960 (Gallager, 1962) to approach the capacity. See the left picture in Fig. 1.28. The LDPC code provides a detailed guideline on how
Robert G. Gallager
Erdal Arikan
Fig. 1.28 (Left): Robert G. Gallager, the inventor of LDPC code; (Right): Erdal Arikan, the inventor of Polar code
1.9
Capacity of the AWGN Channel
65
to design a code and has been shown to have remarkable performance. However, it does not exactly match the capacity, although it approaches it as the number n of time slots used tends to infinity. To demonstrate the degree of closeness, consider the following example. Suppose we want the data rate R = 0.5 bits/time. Then, according to Shannon’s capacity result (1.98), we must admit: 0.5 ≤
1 log2 (1 + SNR) . 2
(1.99)
This is then equivalent to saying that: SNR
≥ 1 (0 dB).
(1.100)
What it means by an LDPC code approaches the capacity is that: the LDPC code yields very small error probability P(E) ≈ 0 with SNR = 1.0233 (0.1 dB), which has only a 0.1 dB gap to the optimal SNR limit 0 dB. You may wonder how very small the error probability is. It is very small in the following sense. With n = 8000, P(E) ≈ 10−4 and we can achieve a smaller P(E) with a larger n. The LDPC code did not receive sufficient recognition at the time due to its high implementation complexity relative to the available digital signal processing (DSP) technology. Nevertheless, after 30 years, the code resurfaced and garnered significant attention due to its newfound efficiency, made possible by advancements in DSP technology. Later the code was more appreciated with a more advanced DSP technology—it is now widely being employed in a variety of systems such as LTE, WiFi, DMB4 and storage systems. Despite developing the LDPC code, Gallager was not completely satisfied with the outcome at the time because the code did not guarantee exact capacity achievement even as n approached infinity. This spurred one of his PhD students, Erdal Arikan (depicted in the right image in Fig. 1.28), to develop a capacity-achieving code called Polar code (Arikan, 2009). Notably, it took Arikan over 30 years from the time he was first inspired during his PhD days to create this code, which was accomplished in 2007. Due to its great performance and low-complexity nature, it is now being seriously considered for implementation in a variety of systems. Learning about LDPC and Polar codes is an arduous task that requires a solid foundation in random processes, which may be unfamiliar to many. Consequently, this book will not cover these codes. Further reading: Packing spheres From a geometric perspective, repetition coding confines all the constellation points to a single dimension, as depicted in Fig. 1.29. This approach places all the points on a single line, despite the signal space having numerous dimensions (n), rendering it highly inefficient in terms of packing 4
It is the name of the technology for a digital broadcast system, standing for Digital Multimedia Broadcasting.
66
1 Communication over the AWGN Channel
Fig. 1.29 Repetition coding packs constellation points inefficiently in the n-dimensional signal space
the constellation points. A more effective means of communication would involve distributing the points across all n dimensions. We can get an estimate on the maximum number of constellation points that can be packed in for the given power constraint P, by appealing to the classic spherepacking picture. See Fig. 1.30. By the law of large numbers5 , the n-dimensional received vector y= x + w will, with high probability, lie within a received-signalsphere of radius n(P + σ 2 ). So without loss of generality, we need to focus only on what happens inside this sphere. On the other hand, n 1 w[m]2 → σ 2 n m=1
(1.101)
as n → ∞, by the law of large numbers. So, for large n, the received √ vector y lies, with high probability, near the surface of a noise sphere of radius nσ 2 around the transmitted signal point. Reliable communication occurs as long as the noise spheres around the constellation points do not overlap. The maximum number of constellation points that can be packed with non-overlapping noise spheres is the ratio of the volume of the received-signal-sphere to the volume of a noise sphere:6 n n(P + σ 2 ) . √ n nσ 2
5
(1.102)
A rough definition of the law of large numbers is that the empirical mean of i.i.d. random variables approaches the true mean as the number of involved random variables increases. We will learn about the formal definition in Part III. If you want to know about it now, see Problems 8.1, 8.2 and 8.3. 6 The volume of an n-dimensional sphere of radius r is proportional to r n .
1.9
Capacity of the AWGN Channel
67
Fig. 1.30 The number of noise spheres that can be packed into the received-signal-sphere yields the maximum number of constellation points that can be reliably distinguished
This implies that the maximum number of bits per time slot that can be reliably communicated is ⎡ n ⎤ n(P + σ 2 ) P 1 ⎢ ⎥ 1 (1.103) log2 ⎣ √ n ⎦ = log2 1 + 2 . n 2 σ nσ 2 This is indeed the capacity of the AWGN channel. The argument might sound very heuristic. If you are interested in a rigorous proof, you may want to take a course on information theory, e.g., EE326 (or EE623) at KAIST, or EE290A at UC Berkeley. Summary of Part I Our focus has been on the AWGN channel, where we have delved into designing transmission and reception techniques, while also examining the interdependent relationship between data rate R, SNR, and P(E). Additionally, we have familiarized ourselves with the channel capacity and phase transition concepts and explored various endeavors geared towards attaining channel capacity. Outline of Part II While the AWGN channel is a fundamental component of various communication systems, it is not an accurate representation of practical channels. Consequently, Part II of this book will focus on a more practical channel, namely the wireline channel. Initially, we will demonstrate how to model the channel as a concatenation of an LTI system and the AWGN block before discussing the design of transmission and reception strategies for this channel.
68
1 Communication over the AWGN Channel
Problem Set 3 Problem 3.1 (Basics on probability) Suppose that the a priori probabilities of information bits are the same for all possible sequence patterns: P(b1 b2 · · · b B = i 1 i 2 · · · i B ) = 21B , ∀i j ∈ {0, 1}, j ∈ {1, 2, . . . , B}. Show that bi ’s are i.i.d. ∼ Bern( 21 ). Problem 3.2 (Sum of independent Gaussian random variables) Let w[1] and w[2] be independent Gaussian random variables with mean 0 and variance σ 2 . Show that w := w[1] + w[2] is also Gaussian. Also find the mean and the variance of w. Problem 3.3 (The MAP vs. ML vs. NN rules) A transmitter wishes to send an information bit b ∈ {0, 1} to a receiver. The encoding strategy is the following simple mapping: x = b. Given x, the channel output y respects a Gaussian distribution: N (0.1 + b, 0.1 + b). Assume that b ∼ Bern(0.6). (a) Given y = a, derive the MAP rule bˆMAP . You should describe how bˆMAP is decided as a function of a. (b) What is the error probability when using the MAP rule? Use the Q-function notation. (c) Given y = a, derive the ML rule bˆML . Again you should describe how bˆMAP is decided as a function of a. (d) What is the error probability when using the ML rule? (e) Given y = a, derive the NN rule bˆNN . Again you should describe how bˆMAP is decided as a function of a. ( f ) What is the error probability when using the NN rule? Problem 3.4 (Sufficient statistic) A statistic T (X ) is said to be sufficient for underlying parameter θ if the conditional distribution of the data X , given the statistic T (X ), does not depend on the parameter θ , i.e., P(X = x|T (X ) = t, θ ) = P(X = x|T (X ) = t). Consider the repetition coding example (B, n) = (1, 2). The two received signals are given by y[1] = x + w[1], y[2] = x + w[2] √ √ where the transmitted voltage x can take on two values (+ E and − E), and w[m]’s are i.i.d. according to N (0, σ 2 ). In Sect. 1.8, we found that only the sum y[1] + y[2] of the two received signals is relevant to making a decision in the ML decision rule. We then claimed that y[1] + y[2] is a sufficient statistic to estimate the transmitted signal x. Verify that it satisfies the general definition as above.
1.9
Capacity of the AWGN Channel
69
Problem 3.5 (Sequential coding) In this problem, we evaluate the error probability of sequential coding for different choices of the chunk size k. Suppose we want to communicate B = 4096 bits over the AWGN channel using n time slots. We consider only values of n that are powers of 2: n = 2, 4, 8, . . . , 4096. So we will consider k = 2048, 1024, 512, . . . , 1. Suppose we have an energy budget E per time slot and apply PAM for each chunk. Assume we choose one typical value of SNR: SNR = 40 dB. (a) Derive the error probability when using this sequential communication scheme. Use the Q-function notation. (b) Using Python, plot the error probability as a function of n. Note: Since we consider the values of n that are powers of 2, you may need to interpolate between these points to generate the graph. (c) Check if there is an appropriate choice of the chunk size that satisfies an error probability level (around 10−5 ). Problem 3.6 (Tradeoff relationship in sequential coding) In this problem, we wish to see the tradeoff between data rate R (see below for the definition) and SNR while communicating reliably at a certain desired level when using sequential coding that we learned in Sect. 1.7. Suppose we want to communicate B = 8096 bits (1 KB) over the AWGN channel using n time slots. The data rate is then defined as: R :=
8096 bits/time. n
Suppose n divides 8096. Then a sequential coding scheme transmits R bits every time R slot using the energy √ E. This is done by transmitting one of 2 equally-spaced voltage levels between ± E. Suppose we want to maintain an overall error probability level of 10−5 (this is the chance that at least one of the 8096 bits is wrongly communicated). We would like to understand how much SNR we need, as a function of n, to maintain the desired level. Using Python, plot the data rate R as a function of 10 log10 SNR. Here SNR is chosen such that the desired error probability level is met. You can plot only for values of n that are powers of 2 and interpolate between these points to generate the graph. Problem 3.7 (Tradeoff relationship in repetition coding) Repeat the exercise above with sequential coding replaced by repetition coding. Problem 3.8 (Sufficient statistics in repetition coding) Suppose we send a single bit (equi-probable) with 2-PAM and “repetition coding” over the AWGN using n time slots: y[m] = x + w[m] m ∈ {1, . . . , n}.
(1.104)
70
1 Communication over the AWGN Channel
There is an energy budget E per time slot. The Gaussian noise w[m] is i.i.d. ∼ N (0, σ 2 ). Show that the sum y[1] + · · · + y[n] is the sufficient statistics for the ML decision rule. Problem 3.9 (The ML principle) In the additive Gaussian channel, the input x and output y are continuous-valued. For some other channels, they are discrete-valued. For example, consider the “bit-flipping” channel where both input/output are binaryvalued (0 or 1). This is a common link-layer model for corruption in a network. The channel is given by the conditional probability: P y (1|x = 0) = α and P y (0|x = 1) = β. Suppose we want to transmit one bit using an all-0-vector (of length n) or an all-1-vector (of length n), i.e., we map 0 to (0, 0, . . . , 0), and 1 to (1, 1, . . . , 1) respectively. We assume that (y[1], . . . , y[n]) are conditionally independent given b = i: P y[1],...,y[n] (a[1], . . . , a[n]|b = i) =
n
P y[m] (a[m]|b = i).
m=1
(a) Given (y[1], . . . , y[n]) = (a[1], . . . , a[n]), derive the ML rule: bˆML := arg max P y[1],...,y[n] (a[1], . . . , a[n]|b = i). i∈{0,1}
(b) Evaluate the error probability when using the ML rule. Problem 3.10 (Repetition coding over a colored AGN channel) Consider sending a single bit with 2-PAM and “repetition coding” over the AGN channel using n = 2 time slots. There is an energy budget E per time slot. The Gaussian noise is independent from time to time but the variance is different: say σ12 at time 1 and σ22 at time 2. Assume that σ1 ≥ σ2 . (a) Given the two received signals (y[1], y[2]), derive the ML decision rule. (b) Draw a two-dimensional plot (with x[1]-axis and x[2]-axis) that exhibits two possible transmission signal points. The plot is called the “signal constellation diagram”. (c) Identify the sufficient statistic for the ML decision rule and draw the decision ˆ in the boundary (an explicit threshold curve which separates a decision on b) signal constellation diagram. (d) Consider a case where σ1 > σ2 = 0. Identify the decision boundary in this case. Does the decision boundary make an intuitive sense? Explain your answer. (e) Compute the error probability when using the ML rule. Consider arbitrary values of σ1 and σ2 , not the special case in part (d). Problem 3.11 (The optimal receiver principle) Consider the optimal decision rule for x given (y[1], y[2]): y[1] = x + w[1]; y[2] = w[1] + w[2].
1.9
Capacity of the AWGN Channel
Here x is equally likely to be being independent of x.
√
71
√ E or − E and (w[1], w[2]) are i.i.d. ∼ N (0, σ 2 ),
(a) Is y[1] a sufficient statistic for the optimal decision rule? Give an intuitive explanation to your answer. (b) Find the optimal detector for x given y[1] and y[2], and provide a rigorous explanation for your answer in part (a). (c) Compute the error probability when using the optimal detector derived in part (b). (d) Consider now the best detector for x using y[1] only. Find the additional energy, if any, needed to achieve the same error probability as that of the optimal detector in part (b). Problem 3.12 (A degraded channel) Consider the problem of detecting x given (y[1], y[2]): y[1] = x + w[1]; y[2] = x + w[1] + w[2]. √ √ Here x is equally likely to be + E or − E and (w[1], w[2]) are i.i.d. ∼ N (0, 1), being independent of x. (a) Is y[1] a sufficient statistic for the detection problem? Give an intuitive explanation to your answer. (b) Given (y[1], y[2]), derive the optimal detector for x. (c) Compute the error probability when using the optimal detector. Problem 3.13 (Sufficient statistics) Consider the problem of decoding x given (y[1], y[2], y[3]): y[1] = x + w[1]; y[2] = w[1] + w[2]; y[3] = w[2] + w[3]. √ √ Here x is equally likely to be E or − E, and (w[1], w[2], w[3]) are i.i.d. ∼ N (0, σ 2 ), being independent of x. A student claims that y[1] − 2y[2] is a sufficient statistic. Either prove or disprove the claim. Problem 3.14 (Sequential coding + repetition coding) Consider a communication scheme that combines sequential coding and repetition coding. We split information bits of length B (power of 2) bits into Bk number of chunks, each consisting of k (power of 2) bits. We apply PAM for each chunk and then transmit the PAM signal using r -times repetition coding over the AWGN channel. We perform this procedure separately over chunks.
72
1 Communication over the AWGN Channel
(a) What is the data rate R of this scheme? (b) Derive the ML decision rule. (c) Derive the error probability when using the ML rule. Use the Q-function notation. Problem 3.15 (Phase shift keying) Consider transmitting B = 3 bits over n = 2 time instants over an AWGN channel. The 8 messages (3 bits) correspond to points in a two-dimensional plane with x[1]-axis √ and x[2]-axis. These points are on a circle centered around the origin with radius E and angles θ1 , . . . , θ8 with θi =
(i − 1)π radians, i ∈ {1, . . . , 8}. 4
This modulation scheme is called phase shift keying. Sketch the constellation diagram and derive the ML decision rule. Problem 3.16 (Hard decision vs. soft decision) A transmitted symbol x is equally likely to be +1 or −1 and is received at n antennas: yi = x + wi , i ∈ {1, . . . , n} where the additive noise wi at antenna i is Gaussian with mean 0 and variance σ 2 , and is independent across all antennas and of the transmitted symbol x. (a) Focus first on one of the antennas (say i) in isolation. Find the optimal receiver that decides on the transmitted symbol based solely on the signal received at antenna i (namely yi ). Let us denote the output of this receiver as xˆi , an estimate of the transmitted signal x. This estimate takes values in {+1, −1} based on what the received signal yi is. Find the corresponding error probability. Notice that the error event occurs when xˆi = x. (b) Now we would like to combine the single-antenna-focused estimates xˆ1 , . . . , xˆn to come up with an overall estimate xˆ of the transmitted signal. Find the optimal way of combining xˆ1 , . . . , xˆn that minimizes the probability of detecting x wrongly. Derive the error probability when using the optimal way of combining. (c) Is the estimate xˆ you obtained in part (b) the same as the optimal receiver output that bases its decision on the received signals y1 , . . . , yn ?. If so, explain. If not, derive the optimal decision rule based on y1 , . . . , yn and the corresponding error probability. Problem 3.17 (Channel capacity) Consider the AWGN channel with noise variance σ 2 . Suppose that the average energy budget (power budget) is P (Joules/time). In Sect. 1.9, we learned that the fundamental limit on data rate for reliable communication (called the channel capacity) for the AWGN channel is: CAWGN
P 1 bits/time. = log2 1 + 2 2 σ
(1.105)
Suppose we wish to transmit B bits using n time slots. Define E b as the energy required per bit:
1.9
Capacity of the AWGN Channel
E b :=
73
nP total energy = . total # of bits B
(1.106)
Show that in order to ensure reliable communication (P(E) → 0), we must admit: Eb ≥ 2 ln 2. σ2
(1.107)
Note: The right-hand-side quantity in (1.107) is called the Shannon limit. Problem 3.18 (True or False?) (a) Suppose X is a discrete random variable which takes −1 or +1 with the equal probabilities. Let Y be a Gaussian random variable independent of X . Then Z = X + Y is Gaussian. (b) Suppose X are Y are independent and both have a uniform distribution with an interval [0, 1]. Then Z = X + Y is also Uniform. (c) Consider the sequential coding scheme that we learned in Sect. 1.7. Suppose that transmitted signals pass through an additive Gaussian channel. The ML decision rule reduces to the local ML rule. (d) Consider the problem of detecting x given y[1] and y[2]: y[1] = x + w[1]; y[2] = x + w[1] + w[2].
(1.108)
√ √ Here x is equally likely to be E or − E and w[m]’s are i.i.d. ∼ N (0, σ 2 ). Then the ML detector for x using y[1] only is the same as the ML detector based on (y[1], y[2]). (e) We wish to communicate one bit b of information using two time instants over an AWGN channel. There is a total energy of E allocated to send this bit. Recall the repetition coding scheme that we studied in Sect. 1.8. We use both time instants to send the bit and split out energy E into two parts: α E to be used at time 1 and (1 − α)E to be used at time 2, where α denotes a fraction between 0 and 1: y[1] =
√ α · x + w[1],
y[2] =
√ 1 − α · x + w[2].
√ √ We transmit x = E if b = 1, and x = − E otherwise. Suppose we use the ML decision rule. Then the optimal α ∗ that minimizes the probability of error is always either 0 or 1.
74
1 Communication over the AWGN Channel
( f ) In the 1960s, Robert G. Gallager developed a remarkable code, called LDPC code, that has great performance. But it does not achieve the AWGN channel capacity promised by Shannon though. More than 40 years later, Ardal Arikan finally developed an explicit code, called Polar code, that exactly achieves the capacity of the AWGN channel. (g) Consider an AWGN channel with noise w ∼ N (0, σ 2 ). Suppose that SNR = σP2 % dB and the transmitted signals are subject to n1 nm=1 |x[m]|2 ≤ P for any n. Then there exists a communication scheme that achieves the data rate of 1 bit per time instant while ensuring reliable communication.
References Abbe, E. (2017). Community detection and stochastic block models: Recent developments. The Journal of Machine Learning Research, 18(1), 6446–6531. Arikan, E. (2009). Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels. IEEE Transactions on Information Theory, 55(7), 3051–3073. Beauchamp, K. (2001). History of telegraphy (No. 26). IET. Bertsekas, D., & Tsitsiklis, J. N. (2008). Introduction to probability (Vol. 1). Athena Scientific. Bondyopadhyay, P. K. (1995). Guglielmo marconi-the father of long distance radio communicationan engineer’s tribute. In 1995 25th European Microwave Conference (Vol. 2, pp. 879–885). Browning, S. R., & Browning, B. L. (2011). Haplotype phasing: Existing methods and new developments. Nature Reviews Genetics, 12(10), 703–714. Chen, Y., Kamath, G., Suh, C., & Tse, D. (2016). Community recovery in graphs with locality. In International Conference on Machine Learning (pp. 689–698). Coe, L. (1995). The telephone and its several inventors: A history, McFarland. Das, S., & Vikalo, H. (2015). Sdhap: Haplotype assembly for diploids and polyploids via semidefinite programming. BMC Genomics, 16(1), 1–16. Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174. Gallager, R. (1962). Low-density parity-check codes Low-density parity-check codes. IRE Transactions on Information Theory, 8(1), 21–28. Girvan, M., & Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826. Johnson, J. B. (1928). Thermal agitation of electricity in conductors. Physical Review, 32(1), 97. Kroese, D. P., Brereton, T., Taimre, T., & Botev, Z. I. (2014). Why the monte carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics, 6(6), 386–392. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. Nyquist, H. (1928). Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers, 47(2), 617–644. Oppenheim, A. V., Willsky, A. S., Nawab, S. H., Hernández, G. M., et al. (1997). Signals & systems. Pearson Educación. Proakis, J. G., Salehi, M., Zhou, N., & Li, X. (1994). Communication systems engineering (Vol. 2). NJ: Prentice Hall. Shannon, C. E. (1938). A symbolic analysis of relay and switching circuits. Electrical Engineering, 57(12), 713–723.
References
75
Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55. Si, H., Vikalo, H., & Vishwanath, S. (2014). Haplotype assembly: An information theoretic view. In 2014 IEEE information theory workshop (ITW 2014) (pp. 182–186). Stewart, J. (2015). Calculus. Cengage Learning. Tse, D., & Viswanath, P. (2005). Fundamentals of wireless communication. Cambridge University Press. Vaseghi, S. V. (2008). Advanced digital signal processing and noise reduction. Wiley.
Chapter 2
Communication over ISI Channels
The Viterbi algorithm and OFDM come to the rescues of inter-symbol interference channels.
2.1 Waveform Shaping (1/2)
Recap Up to this point, we have focused on the AWGN channel and investigated various transmission/reception techniques, while also characterizing the associated tradeoffs between SNR, data rate, and error probability. However, as previously noted, the AWGN channel is not entirely practical. In fact, the impracticality is related to the utilization of time slots for discrete-time signals, as depicted by the lollipop-shaped signals in Fig. 2.1 (e.g., x[1], x[2], x[3], etc., in the bottom left corner). In reality, electromagnetic waveforms are what we transmit over practical channels to convey information about these discrete-time signals, as exemplified by the waveform x(t) in Fig. 2.1. Of course, such a waveform operates within the continuous-time domain, necessitating the conversion of discretetime signals into continuous-time waveforms.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 C. Suh, Communication Principles for Data Science, Signals and Communication Technology, https://doi.org/10.1007/978-981-19-8008-4_2
77
78
2 Communication over ISI Channels encoder
bits
bits-to-voltage
discrete
waveform shaper
(discrete continuous)
continuous practical channel
Fig. 2.1 The two-stage operation in the encoder: (i) bits-to-voltages mapping; (ii) voltages-towaveforms mapping
Outline This section will focus on the conversion of discrete-time signals to continuous-time waveforms. The discussion will be divided into three parts. Firstly, we will outline the design criteria that the converter must satisfy. One of these criteria will be derived from a fundamental characteristic of practical channels. In the second part, we will examine this characteristic and determine the corresponding design criterion. Finally, we will translate this criterion into mathematical terms. In the subsequent section, we will explore how to construct a converter that satisfies these design criteria.
Waveform shaping The encoder architecture shown in Fig. 2.1 is crucial for practical communication systems. It comprises of two main operations. The first operation involves mapping bits to voltages, which was discussed in Part I. The second operation, which is the main focus, involves mapping the voltages to waveforms that can be transmitted over a practical channel. This second mapping is commonly referred to as the waveform shaping, as it is responsible for shaping the output signal into a waveform format. Design criteria for waveform shaper The waveform shaper must adhere to two design criteria. The first criterion is straightforward - the input x[m] must be oneto-one relationship with the output x(t). In other words, x(t) should contain the same information as x[m]. As mentioned earlier, the second criterion is related to the behavior of practical channels. We will explore this property in detail. Properties of practical channels One of the practical channels that we will deal with in Part II is the wireline channel. It is also called the telephone channel, as it has been serving as a representative wireline channel in the communication history. It turns out the wireline channel induces restriction on the design of waveform x(t). The restriction comes even when there is no additive noise, say w(t), in the system. For simplicity, let us assume for the time being that there is no additive noise.
2.1 Waveform Shaping (1/2) Fig. 2.2 The wireline channel can be interpreted as a collection of three electrical circuits components: (i) resistance R; (ii) inductance L; and (iii) capacitance C
79
copper (an electrical circuit)
The wireline channel is constructed using copper wire as it is the most costeffective conductor that facilitates the flow of electrical signals. In terms of electrical circuit terminology, the copper wire can be viewed as consisting of three primary components: resistance R, inductance L, and capacitance C, as shown in Fig. 2.2. By understanding these components and studying Signals & Systems, we can identify a crucial property of the wireline channel - linearity. Remember one important property that you may learn from the course: any system that comprises solely of (R, L, C) elements is linear (Oppenheim et al. 1997). A system is deemed linear if a linear combination of two inputs results in the same linear combination of their corresponding outputs. Specifically, given input-output pairs (x1 (t), y1 (t)) and (x2 (t), y2 (t)), a linear system satisfies: c1 x1 (t) + c2 x2 (t) −→ c1 y1 (t) + c2 y2 (t), ∀c1 , c2 ∈ R.
(2.1)
There is another property that the wireline channel is subject to. The property is related to the fact that (R, L, C) parameters depend on environmental conditions, e.g., distance between the transmitter and the receiver, and temperature of the wire. The conditions can change over time, so the parameters may vary. But the time scale of such a change is pretty large relative to the communication time scale. Hence, it is fairly reasonable to assume that the system does not vary over time, in other words, time invariant. This forms the second property of the channel: time invariance. The formal definition of a time-invariant system is the following. A system is said to be time invariant if a time-shifted input x(t − t0 ) by t0 yields the output y(t − t0 ) with exactly the same shift (Oppenheim et al. 1997). The above two properties imply that the wireless channel is a linear time invariant (LTI) system (Fig. 2.3). Fig. 2.3 Key properties of an LTI system: (i) linearity; and (ii) time invariance
system (R, L, C)
80
2 Communication over ISI Channels
Impulse response h(t) of an LTI system An LTI system can be described by an entity that you may be familiar with. That is, impulse response. So the channel can be fully expressed by an impulse response, say h(t). The impulse response is defined as the output fed by the Dirac delta input δ(t), defined as a function satisfying the following two: δ(t) =
∞, t = 0; 0, t = 0,
+∞ −∞
f (t)δ(t)dt = f (0)
(2.2)
for a function f (t) (Oppenheim et al. 1997). The definition allows us to express the relationship between the input x(t) and the output y(t) via h(t). As you may know, the output is a convolution between the input and the impulse response: y(t) = x(t) ∗ h(t) :=
∞ −∞
x(τ )h(t − τ )dτ .
(2.3)
For those who are not comfortable with this, let us prove it. The proof is simple. It is based on the LTI property and the definition of an impulse response. First notice that an input δ(t − τ ) yields an output h(t − τ ) due to the time invariance property. By the multiplicative property of linearity, an input x(τ )δ(t − τ ) gives x(τ )h(t − τ ). Finally the superposition property of linearity yields:
∞
−∞
x(τ )δ(t − τ )dτ −→
∞ −∞
x(τ )h(t − τ )dτ .
(2.4)
Observe that
∞ −∞
x(τ )δ(t − τ )dτ = x(t)
(2.5)
due to the definition of the Dirac delta function; see (2.2). Hence, this proves the convolution relation (2.3) between x(t) and y(t). Also see Fig. 2.4.
–
–
LTI system
–
–
–
Fig. 2.4 Impulse response of an LTI system and the input-output relationship
2.1 Waveform Shaping (1/2)
81
Fig. 2.5 A typical shape of the impulse response of a wireline channel
A typical shape of h(t) and its spectrum You may wonder how h(t) looks like in practice. One typical example is shown in Fig. 2.5. We define a notion which indicates how much the impulse response is spread over time. We adopt a very rough definition. Let Td be the dispersion time of the channel which denotes the time duration during which the impulse response is not negligible. We may have a different value for Td depending on the definition of negligibility. For simplicity we will take as a given parameter of the channel without going into more discussion. A typical shape of the corresponding spectrum is shown in Fig. 2.6. One distinctive feature of the shape is that it is symmetric. This is because the magnitude of the spectrum of any real-valued signal is symmetric (Oppenheim et al. 1997). The second feature is that the more spread in the time domain, the sharper the spectrum looks like, and vice versa. Similar to the dispersion time Td , there is another notion in the frequency domain. That is, W H , which indicates a range of frequencies in which the magnitude |H ( f )| of the spectrum is not negligible. What the second feature means in terms of Td and W H is that the smaller Td , the larger W H and vice versa. Restriction in the design of x(t) The shape of the spectrum of the wireline channel as illustrated in Fig. 2.6 induces some restriction on x(t). To see this, consider:
= W
W
Fig. 2.6 A typical spectrum shape of the impulse response of the wireline channel. Here W H indicates a range of frequencies in which |H ( f )| is not negligible
82
2 Communication over ISI Channels
Fig. 2.7 The spectrum of a continuous-time signal x(t) with bandwidth W
Y ( f ) = H ( f )X ( f )
(2.6)
where X ( f ) and Y ( f ) denote the Fourier transforms of x(t) and y(t), respectively. Since |H ( f )| is negligible for high frequencies, Eq. (2.6) suggests that the high frequency components in x(t) are significantly attenuated, e.g., where | f | is beyond WH . To avoid this distortion, we need restriction on x(t). The restriction is that x(t) 2 should contain no component at high frequencies: |X ( f )| = 0
|f| ≥
WH . 2
(2.7)
We can rephrase this restriction with an important concept that often arises in the communication literature. That is, bandwidth. The bandwidth, denoted by W , is defined as a range of non-zero frequencies. A formal definition of the bandwidth hinges upon some mathematical notations. Let the upper frequency f U be the largest frequency such that |X ( f )| = 0 for f ≥ f U . Let the lower frequency f L be the smallest frequency such that |X ( f )| = 0 for f ≤ f L . Then, the bandwidth W is defined as the difference between the upper and lower frequencies: W := f U − f L . For example, in Fig. 2.7, f U = W2 , f L = − W2 and thus the bandwidth is W . In terms of W and W H , the restriction (2.7) can then be rephrased as: W ≤ WH .
(2.8)
Look ahead Two key design criteria have been established for a waveform shaper: (i) it must be a one-to-one mapping, and (ii) x(t) must be constrained by a limited bandwidth: W ≤ W H . In the upcoming section, we will design a waveform shaper that adheres to these criteria.
2.2 Waveform Shaping (2/2)
83
2.2 Waveform Shaping (2/2)
Recap In the preceding section, our objective was to design a block that could convert discrete signals x[m] (represented as lollypops) into waveforms x(t) that could be transmitted over a practical channel, as illustrated in Fig. 2.8. This block was named the “waveform shaper” in reference to the waveform output x(t). As an initial effort, we established two design criteria that the waveform shaper must meet. The first criterion is that it must be a one-to-one mapping function to ensure that no information is lost. Secondly, we considered the crucial property of practical wireline channels, whereby the spectrum |H ( f )| of channel impulse response h(t) is not negligible within a certain frequency range: |H ( f )| ≈ 0 (or considered negligible) for | f | ≥ WH . Therefore, high-frequency components in the signal x(t) can be significantly 2 distorted. To mitigate this issue, we imposed the second constraint that the bandwidth W of x(t) must be limited, i.e. W ≤ W H .
Outline This section will outline the design process of a waveform shaper that satisfies the two aforementioned criteria. The process can be divided into three stages. Firstly, we will introduce a suitable structure for the waveform shaper that simplifies the construction process. Secondly, we will incorporate the two design criteria into this structure and formulate them mathematically. Lastly, we will employ Fourier transform and series techniques from Signals & Systems to derive a precise mathematical relationship between x[m] and x(t), ultimately completing the design.
bits
bits-to-voltage
waveform shaper
practical channel
bandwidth: W
Fig. 2.8 The two-stage operations in the encoder: (i) bits-to-voltages mapping; and (ii) voltagesto-waveforms mapping
84
2 Communication over ISI Channels
sampling Fig. 2.9 The meaning of the one-to-one mapping criterion
One-to-one mapping Let us start by investigating the one-to-one mapping criterion. Here what the one-to-one mapping means is that we have to ensure reconstruction of x(t) from x[m] and vice versa. See Fig. 2.9. In fact, the other way around (obtaining x[m] from x(t)) is called “sampling”, and the sampling process is straightforward. As long as we specify the sampling period, say T (indicating the physical time duration between two consecutive samples), this process is obvious as the naming suggests. The one-to-one mapping criterion means that we have to ensure: x[m] = x(mT )
∀m ∈ {0, 1, . . . , n − 1}
(2.9)
where n denotes the total number of time slots used. We start from time 0. This is a convention that many people adopt in the field, although some people prefer to use time 1 as a starting point. As for the forward direction of our interest, there are many ways to connect discrete signals (lollypops), reflected in many green curves in Fig. 2.9. So you may wonder which way to take. It turns out the way to connect (often called the interpolation in the literature) is intimately related to a careful choice of the sampling period T and the bandwidth constraint W ≤ W H . Let us investigate the relationship from the next subsection onwards. A proper structure for a waveform shaper In an effort to ease construction, we introduce a proper structure for a waveform shaper. Notice that the waveform shaper is sort of an annoying system that we are not familiar with. Why annoying? This is because the signal types are distinct across input x[m] and output x(t): x[m] is discrete while x(t) being continuous. In order to close this gap, we introduce a two-stage architecture as illustrated in Fig. 2.10. First we have a so called time-domain converter from discrete to continuous. The output to the converter is defined in continuous time, yet still preserving discretevalued signals. What does it mean by having discrete-valued signals? It means that signals contain spiked values (reflected in the Dirac delta function, x[m]δ(t − mT ); also see the middle plot in Fig. 2.10) only at the sampled time instants, while having nothing at other continuous times. So it is an intermediate imaginary signal introduced only for conceptual simplicity.
2.2 Waveform Shaping (2/2)
85 waveform shaper interpolator
discrete-tocontinuous
discrete-valued continuous-time
continuous
Fig. 2.10 A two-stage architecture for a waveform shaper, consisting of (i) a time-domain converter (from discrete to continuous); and (ii) an interpolator g(t)
The role of the second block is to generate continuous-valued signals from such intermediate signals. See the right plot in Fig. 2.10. The second block is a standard homogeneous LTI system where the signal types are the same: input and output are all continuous-time signals. Hence, we can describe it simply with an impulse response, say g(t). The conventional name of the block is an interpolator, as its role suggests: interpolating discrete values to generate smoothly changing analog signals. How to design g(t)? As hinted earlier, the way to design g(t) is related to a choice of the sampling period T and the bandwidth constraint W ≤ W H . Let us now dive into details on this. Representation of a continuous-time yet discrete-valued signal xd (t) There is a useful and common way to represent the intermediate signal xd (t). The way is to use the Dirac delta function δ(t), as we briefly mentioned while referring to the second plot in Fig. 2.10. Remember that xd (t) contains discrete values only at the sampled time instants, say 0, T, 2T . The signal at time 0 corresponds to x[0]. But in the continuous-time domain, we represent the signal in a slightly different manner. We employ the Dirac delta function to indicate the signal as x[0]δ(t). For time 1, we express it as x[1]δ(t − T ). So the compact formula for xd (t) reads: xd (t) =
n−1
x[m]δ(t − mT ).
(2.10)
m=0
Using the fact that x[m] = 0 for m < 0 and m ≥ n, we can alternatively represent xd (t) as: xd (t) =
∞ m=−∞
x[m]δ(t − mT ).
(2.11)
86
2 Communication over ISI Channels
For convenience, we will use a simpler notation m to indicate ∞ m=−∞ . You may wonder why we multiply by the Dirac delta function. This will be clearer when explaining the design of g(t). Applying one of the criteria x[m] = x(mT ) into (2.11), xd (t) must read: xd (t) =
x(mT )δ(t − mT )
m
(a)
=
m
x(t)δ(t − mT )
= x(t)
(2.12)
δ(t − mT )
m
where (a) comes from δ(t − mT ) = 0 for t = mT . How to design the interpolator g(t)? How to design g(t) such that x(t) = xd (t) ∗ g(t)? The formula (2.12) exhibits the relationship between xd (t) and x(t). However, it is still difficult to figure it out from the formula (2.12). Taking Fourier transform, we can gain insights into identifying g(t). To see this, let us take Fourier transform on both sides in (2.12) to obtain: Xd ( f ) = X ( f ) ∗ F
δ(t − mT )
(2.13)
m
where we used the duality property of Fourier transform: x(t)h(t) ←→ X ( f ) ∗ H ( f ) (Oppenheim et al. 1997). Here F(·) indicates the Fourier transform of (·). One can make a key observation which allows us to represent (2.13) in a more insightful formula. The key observation is that the insider of F(·) in (2.13), m δ(t − mT ), is a periodic signal with a period T . So we can use Fourier series to represent the periodic signal as:
δ(t − mT ) =
m
ak e j
2πkt T
(2.14)
k
where ak indicates the kth Fourier coefficient, expressed as: 1 ak = T
T 2
− T2
δ(t − mT ) e− j
dt
m
T 2 1 2πkt δ(t)e− j T dt T − T2 1 = . T =
2πkt T
(2.15)
2.2 Waveform Shaping (2/2)
Fig. 2.11 The spectrum of xd (t) when
87
1 T
≥ W : Xd ( f ) =
1 T
k
X f −
k T
Here the second equality comes from m δ(t − mT ) = δ(t) for − T2 ≤ t ≤ T2 ; and the last equality is because of the definition of the Dirac delta function. Plugging (2.14) (together with (2.15)) into (2.13), we then get:
1 j 2πkt e T Xd ( f ) = X ( f ) ∗ F T k
1 k (a) = X( f ) ∗ δ f − T k T
k (b) 1 = X f − T k T
(2.16)
where (a) is due to F(e j2π f0 t ) = δ( f − f 0 ); and (b) is because X ( f ) ∗ δ( f − Tk ) = X ( f − Tk ). What is G( f ) that yields (2.16)? Notice in (2.16) that X d ( f ) is a periodic signal with the period T1 . Here the shape of the periodic signal X d ( f ) depends on the relationship between the bandwidth W of x(t) and the period T1 . Suppose we choose T such that the period T1 exceeds the bandwidth W . Then, we can get a spectrum which looks like the one in Fig. 2.11. Observe in the low frequency regime | f | ≤ W2 that we have T1 X ( f ). There is no overlap across many shifted copies because the period T1 is greater than the bandwidth W . Hence, the interested signal X ( f ) is sort of intact, although it is scaled by T1 . On the other hand, if T is chosen such that T1 < W , the interested signal X ( f ) is now compromised by the neighboring shifted copies. See Fig. 2.12. This advises us to choose T such that T1 ≥ W . With the choice, there is hope to extract the interested signal X ( f ) from X d ( f ). Indeed, by applying G( f ) with the shape as illustrated in Fig. 2.13, we can obtain the desired X ( f ). The impulse response g(t) that corresponds to G( f ) in Fig. 2.13 In order to figure out g(t) that yields G( f ) in Fig. 2.13, let us introduce some notation employed in the Signals-&-Systems literature:
88
2 Communication over ISI Channels
Fig. 2.12 The spectrum of xd (t) when
1 T
< W : Xd ( f ) =
1 T
k
X f −
k T
Fig. 2.13 An example of the spectrum G( f ) of an interpolator that satisfies the relationship (2.12) between xd (t) and x(t)
rect(x)
=
1, − 21 ≤ x ≤ 21 ; 0, other wise.
Using this rect function, we can represent G( f ) in Fig. 2.13 as: G( f ) = T rect( f T ).
(2.17)
It is well known that the rect function has the Fourier transform relationship with the function:
t (2.18) T rect( f T ) ↔ sinc T
sinc
where sinc(x) :=
sin(πx) . πx
Hence, we get:
t g(t) = sinc . T
(2.19)
See Problem 4.2 for derivation of (2.18). Relationship between x[m] and x(t) Recall in the introduced structure of a waveform shaper that: x(t) = xd (t) ∗ g(t).
(2.20)
2.2 Waveform Shaping (2/2)
89
Plugging (2.19) into the above with the choice of T such that
1 T
t x(t) = xd (t) ∗ sinc T
t (a) = x[m]δ(t − mT ) ∗ sinc T m
t − mT (b) = x[m]sinc T m
t −m = x[m]sinc T m
≥ W , we get:
(2.21)
where (a) is due to (2.10); and (b) comes from y(t) ∗ δ(t − t0 ) = y(t − t0 ) for any y(t) and t0 (see Problem 4.2). Finally, recall the bandwidth constraint: W ≤ W H . This constraint can be satisfied with the following choice of T : W ≤
1 ≤ WH . T
(2.22)
Look ahead Our attention has been directed towards the transmitter side thus far. As a result, one might naturally wonder about the receiver side. In the following section, we will construct a corresponding receiver architecture.
90
2 Communication over ISI Channels
2.3 Optimal Receiver Architecture
Recap In the previous sections, we focused on designing a waveform shaper, which converts a discrete-time signal (represented by a sequence of lollypops) into a continuoustime analog signal (represented by a waveform). This conversion is necessary because practical channels require analog signals for transmission. The design of the waveform shaper was based on two criteria. Firstly, it must be a one-to-one mapping to ensure that the output waveform x(t) contains the same information as the input signal x[m], without any loss of information. Secondly, x(t) must be bandwidthlimited to avoid distortion caused by the non-flat spectrum of the channel’s impulse response. We ensured that the bandwidth W of the transmitted signal does not exceed the range W H of significant frequencies in the channel’s spectrum. To meet these design criteria, we designed a waveform shaper that utilized a proper sampling period T to avoid signal overlap (called aliasing) in the frequency domain. We set T such that the sampling rate T1 is higher than the signal bandwidth W , and based on the frequency domain picture, we developed a way to interpolate the discrete-time signal to produce a continuous-time signal: x(t) =
m
x[m]sinc
t −m T
(2.23)
where sinc(x) := sinπxπx . This design is focused solely on the transmitter side. As a result, one may inquire about the corresponding receiver architecture.
Outline
bits
bits-to-voltage
waveform shaper
practical channel
bandwidth: W
Fig. 2.14 An encoder architecture consisting of: (i) bits-to-voltages mapper; and (ii) waveform shaper
2.3 Optimal Receiver Architecture
91
In this section, we will delve into designing a corresponding receiver architecture. This section consists of four parts. Firstly, we will propose a receiver architecture in a simple context where the channel is perfect (i.e., h(t) = δ(t)) and there is no additive noise (i.e., w(t) = 0). It turns out this receiver architecture is simply the reverse of the waveform shaper, involving a sampling operation that converts y(t) into y[m]. Secondly, we will justify the optimality of the sampler-based receiver architecture under the AWGN channel, as well as practical channels such as the wireline channel. Thirdly, we will utilize the receiver architecture to derive an equivalent discrete-time channel model that directly links x[m] to y[m]. Finally, we will discuss the design of the optimal receiver within the framework of the discrete-time channel model.
A corresponding receiver architecture in the perfect channel Recall that the wireline channel can be modeled as a concatenation of an LTI system and the AWGN channel. Hence, one can write down the channel output y(t) as: y(t) = h(t) ∗ x(t) + w(t)
(2.24)
where h(t) denotes an impulse response of the LTI system and w(t) indicates an AWGN. The received waveform y(t) is then fed into the optimal receiver which intends to decode information bits b. See Fig. 2.15. The question of interest is: What is the optimal receiver structure corresponding to the transmitter structure developed in the previous section? Observe two complications in the received waveform (2.24). One is that these are continuous-time signals. We never learned about the optimal decision rule concerning continuous-time signals. The second complication comes from the convolution operation. To gain insights, let us first consider a simple perfect channel setting in which h(t) = δ(t) and w(t) = 0: y(t) = x(t).
(2.25)
Remember in the transmitter side that we took the discrete-time signals x[m]’s and interpolated them to obtain a waveform x(t) that contains all the information in x[m]’s. The ultimate goal at the receiver is to decode the transmitted bits, so we
waveform shaper
bits-to-voltage
optimal receiver?
Fig. 2.15 A block diagram of a two-stage communication system
92
2 Communication over ISI Channels
bits-to-voltage
ML detector
waveform shaper
sampling @
Fig. 2.16 The sampler-based receiver architecture
channel
optimal receiver = waveform-level ML detector
Fig. 2.17 The waveform-level optimal ML receiver
can readily see that the receiver strategy would be to do exactly the reverse of what the transmitter is doing—sample the received waveform y(t) at t = mT , and obtain discrete-time signals y[m]’s. Under this perfect channel, we then get: y[m] = y(mT ) = x(mT ) = x[m],
∀m.
(2.26)
Optimality of the sampler-based receiver What about under the AWGN channel? Assuming to still employ the sampler, for the AWGN channel, y[m] would be the sum of x[m] and an additive noise: y[m] = x[m] + w[m]
(2.27)
where w[m] denotes a sample version of w(t). Since w(t) is an AWGN, w[m] := w(mT ) is also i.i.d. ∼ N (0, σ 2 ). In the presence of a noise, all we need to do for decoding the information bits from y[m] is to apply the ML decision rule. This is something that we have been doing in Part I. Given y[m]’s, make the best possible ˆ This leads to the initial guess of the optimal receiver architecture as estimate b. illustrated in Fig. 2.16. At this moment, this is nothing but a guess for the optimal receiver. Why is it still a guess? Remember in the perfect channel that we had only the sampler, and we saw that the sampler-based structure leads naturally to the optimal receiver. On the other hand, in the presence of a noise, it is not quite clear if we need the sampler in the first place. In fact, the optimal receiver should have the following structure in which we have the continuous-time waveform-level signal input. See Fig. 2.17.
2.3 Optimal Receiver Architecture
93
Here bˆML,wave denotes the output of the waveform-level ML detector: bˆML,wave := arg max f y(t) (a(t)|b = i) i
(2.28)
where f y(t) (a(t)|b = i) indicates the likelihood function w.r.t. the received waveform y(t) given b = i. Now a key question arises:
• ?
Question
Is the sampler-based ML solution, say bˆML , the same as the waveform-level (raw signal level) ML counterpart bˆML,wave ? It has been discovered that the maximum likelihood estimate of the transmitted bits bˆML , obtained using a sampler followed by the ML detector, is equivalent to the estimate obtained using the waveform shaper followed by the ML detector, bˆML,wave . This holds true for typical channels such as the AWGN channel and the wireline channel. This implies that no information is lost during the sampling process. This result is known as the irrelevance theorem (Gallager 2008), which is a profound result. However, providing a formal proof of this theorem is challenging and hence will not be discussed in this book. Instead, we will simply use the optimal receiver structure comprising the sampler followed by the ML detector. An equivalent discrete-time channel model Next we design the sampler-based ML detector in the context of the wireline channel. To this end, we first need to figure out the relationship between the input and the output in the discrete-time domain: x[m] and y[m]. Starting with the property of LTI systems, we get: y(t) = h(t) ∗ x(t) + w(t) ∞ h(τ )x(t − τ )dτ + w(t) = −∞ ∞ (a) = h(τ )x(t − τ )dτ + w(t) 0
∞ t −τ (b) −k = h(τ ) x[k]sinc dτ + w(t) T 0 k
∞ t −τ − k dτ + w(t) x[k] h(τ )sinc = T 0 k
(2.29)
where (a) is because h(τ ) = 0 when τ < 0 for a realistic wireline channel (see Fig. 2.18; you may wonder why h(τ ) = 0 for τ < 0; this is because otherwise the output is a function of future inputs, which never occurs in reality); and (b) follows from (2.23). By sampling y(t) at t = mT , we obtain:
94
2 Communication over ISI Channels
Fig. 2.18 A typical shape of the wireline channel’s impulse response
y[m] := y(mT ) ∞
τ = dτ +w[m] x[k] h(τ )sinc m − k − T k 0 =
(2.30)
=:h[m−k]
x[k]h[m − k] + w[m].
k
We recognize (2.30) as a discrete-time convolution and thus we have for the final received signal: y[m] = h[m] ∗ x[m] + w[m]
(2.31)
where h[m] is a discrete-time impulse response that represents a sampled version of h(t):
∞
h[m] = 0
τ dτ . h(τ )sinc m − T
(2.32)
This gives us an equivalent discrete-time channel model as summarized in Fig. 2.19. Remarks on h[m] How to design the optimal ML detector, given y[m]’s? It turns out the discrete-time convolution in (2.31) incurs a serious issue. Prior to discussing what that issue is, let us say a few words about h[m]. First of all, there is a terminology for this. To figure this out, notice that
bits-to-voltage
Fig. 2.19 The equivalent discrete-time model for the wireline channel
ML detector
2.3 Optimal Receiver Architecture
95
y[m] = x[m] ∗ h[m] + w[m] (a)
= h[m] ∗ x[m] + w[m] h[k]x[m − k] + w[m] =
(2.33)
k
where (a) is due to the commutative property of convolution. Here the kth channel impulse response h[k] is called a channel tap or simply a tap. Why do we call it a tap? This is because the channel impulse response looks like a sequence of faucets (taps). With the notation h[k], the tap index k may be confused with the time index m. Hence, in order to differentiate the tap index from the time index, we denote h[k] by h k . The second remark is about a range of k for non-zero h k ’s. As mentioned earlier, the output cannot depend on future inputs, which led to h(τ ) = 0 for τ < 0 (again see Fig. 2.18). In the discrete-time domain, this is translated to: h k = 0,
k < 0.
(2.34)
To see this clearly, observe: y[m] =
h k x[m − k] + w[m]
k
= h 0 x[m] + h 1 x[m − 1] + h 2 x[m − 2] + · · · + h −1 x[m + 1] + h −2 x[m + 2] + · · · + w[m].
(current) (past)
(2.35)
(future)
In light of the current symbol, say x[m], (h −1 , h −2 , . . . , ) are w.r.t. the future inputs (x[m + 1], x[m + 2], . . . ). Hence, (h −1 , h −2 , . . . , ) must be all zero. Inter-symbol interference (ISI) Now let us go back to the issue. To figure out the issue, let us again observe (2.35): y[m] = h 0 x[m] + h 1 x[m − 1] + h 2 x[m − 2] + · · · + w[m].
(current) (past)
(2.36)
Consider a sequential coding scheme in which the mth information bit, say bm , is mapped to x[m]. Here in light of the current symbol, say x[m], the past symbols (x[m − 1], x[m − 2], . . . ) act as interference. This phenomenon is called the intersymbol interference, ISI for short. This is the new phenomenon that we did not observe in Part I. It turns out the ISI incurs some complication in the ML decision rule.
96
2 Communication over ISI Channels
Look ahead In the next section, we will study how to deal with ISI and then to develop the optimal ML decision rule.
2.3 Optimal Receiver Architecture
97
Problem Set 4 Problem 4.1 (A choice of the sampling period T ) We desire to send 10M bits/sec over an AWGN channel under the condition that a transmitted waveform x(t) has a bandwidth constraint of W = 100 MHz. (a) Find the smallest sampling period T that respects the waveform shaper design criteria. Also explain why. You may want to rely on the two-stage architecture that we introduced in Sect. 2.2. (b) When using the sampling period derived in part (a), what is the minimum SNR required to achieve the target data rate 10M bits/sec. Hint: Use Shannon’s capacity formula that we learned in Sect. 1.9. Problem 4.2 (Basics on Signals & Systems) Let T be the sampling period. The sinc function is defined as sinc(x) := sinπxπx and rect(x)
=
1, − 21 ≤ x ≤ 21 ; 0, other wise.
(a) Show that ∞
δ(t − mT ) =
m=−∞
∞ 1 j 2πkt e T . T k=−∞
(2.37)
(b) Show that for any t0 , x(t) ∗ δ(t − t0 ) = x(t − t0 ).
(2.38)
t = T rect( f T ) F sinc T
(2.39)
(c) Show that
where F(·) indicates the Fourier transform of (·). Problem 4.3 (An equivalent discrete-time channel model) Let x(t) be the output of the waveform shaper fed by discrete-time signals x[m]’s where t ∈ R and m ∈ N. Assume that the waveform shaper is designed such that x(t) =
∞ m=−∞
x[m]sinc
t −m T
where T denotes the sampling period and sinc(x) := a wireline channel fed by x(t):
sin πx . πx
y(t) = h(t) ∗ x(t) + w(t)
(2.40)
Let y(t) be the output of
(2.41)
98
2 Communication over ISI Channels
where h(t) indicates the impulse response of the wireline channel (modeled as an LTI system) and w(t) is an AWGN with N (0, σ 2 ). Suppose we define: y[m] := y(mT ); w[m] := w(mT );
(2.42)
h[m] := h(mT ). A student claims that under the above definition, y[m] = h[m] ∗ x[m] + w[m].
(2.43)
Either prove or disprove the claim.
Problem 4.4 (True or False?) (a) Consider x(t) = sin(2π f t). Let x[m] = x(mT ) where T indicates the sampling period. The smallest possible sampling period for which x(t) can be perfectly reconstructed from its samples x[m]’s is 21f s. (b) Suppose that the spectrum of a continuous-time signal takes non-zero values only at f 1 ≤ f ≤ f 2 where f 2 > f 1 > 0. Then, the sampling period T = f2 −1 f1 is enough to recover the original signal. (c) Consider a communication problem in which we wish to send information bits b over a channel. Let y(t) be a continuous-time received signal of the channel. Let y[m] := y(mT ) where m ∈ {0, 1, . . . , n − 1} and T denotes the sampling period. Then, arg max f y[0],...,y[n−1] (a[0], . . . , a[n − 1]|b = i) = arg max f y(t) (a(t)|b = i) i∈I
i∈I
where f y[0],...,y[n−1] (·|b = i) indicates the conditional pdf of (y[0], . . . , y[n − 1]) given b = i; f y(t) (·|b = i) denotes the conditional pdf of y(t) given b = i; and I = {0, 1}k := {(i 1 , . . . , i k ) : i j ∈ {0, 1}, j = 1, . . . , k}. (d) In Sect. 2.3, we modeled the wireline channel as a concatenation of an LTI system and an AWGN channel. For this model, we considered a receiver architecture in which the received waveform y(t) is sampled at mT and the sampled voltages y[m]’s are put into the voltage-level ML detector as inputs. Here we set the sampling period T such that T ≤ W1 where W indicates the bandwidth of a transmitted waveform x(t). We may consider a different receiver architecture which consists of only the waveform-level ML detector that directly gets y(t) as inputs. It is then possible to design the second receiver architecture that outperforms the first one in minimizing the probability of error.
2.4 Optimal Receiver in ISI Channels
99
2.4 Optimal Receiver in ISI Channels
Recap In the preceding section, we derived an optimal receiver architecture that corresponds to the sampler-based transmitter which we developed in Sects. 2.1 and 2.2. We discovered that the optimal receiver architecture involves a sampler at the front end, which produces the sequence y[m] := y(mT ) from the received waveform y(t). With the proper and smart definition of h[m] as below:
∞
h[m] := 0
τ dτ , h(τ )sinc m − T
(2.44)
we could then establish an equivalent discrete-time channel model: y[m] = h[m] ∗ x[m] + w[m]
(2.45)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ) and h(t) indicates an impulse response of the wireline channel. We introduced a new term to refer to the impulse response of the discrete-time channel. That was, “channel tap” and we denoted it by h k to distinguish the tap index k from the time index m. Therefore, we can rewrite (2.45) as follows: y[m] =
∞
h k x[m − k] + w[m].
k=0
We set h k = 0 for k < 0, since the output cannot depend on future inputs. Next, we examined an issue that arises in the above channel. To see the issue clearly, we considered a sequential coding approach where each information bit bm is assigned to a transmitted signal x[m]. Specifically, we focused on an equation of the following form: y[m] = h 0 x[m] + h 1 x[m − 1] + h 2 x[m − 2] + · · ·
(2.46)
+ w[m]. The issue was that in view of the current symbol, say x[m], the past symbols cause interference. The phenomenon is called inter-symbol interference, ISI for short.
100
2 Communication over ISI Channels
Outline In this section, we will explore the design of the optimal receiver that considers the ISI issue. This section consists of three parts. First, we will derive the optimal ML receiver. Then, we will discuss the challenge in complexity that comes with implementing the optimal receiver in a naive way. Lastly, we will introduce a solution to the issue through an efficient algorithm, namely the Viterbi algorithm (Forney, 1973).
Signal representation For illustrative simplicity, let us consider a simple two-tap ISI channel: y[m] = h 0 x[m] + h 1 x[m − 1] + w[m].
(2.47)
Consider sequential coding in which we send one bit at a time: x[m] =
√ +√ E, bm = 1; − E, bm = 0
(2.48)
where m ∈ {0, . . . , n − 1} and n denotes the total number of time slots used. Suppose we want to decode x[m]. At time m, the interference h 1 x[m − 1] explicitly contains information about the bit that was sent in time m − 1. Since x[m − 1] is related to x[m − 2] due to the ISI and so on, it implicitly contains information about all the bits that were sent before time m. Hence we can easily guess that all of the received signals y[m]’s affect decoding x[m]. Similarly decoding the other transmitted signals is affected by all of the received signals. This motivates us to consider a block decoding scheme in which we decode all the bits looking at all the received signals. Specifically, given (y[0], . . . , y[n − 1]), we wish to decode (x[0], . . . , x[n − 1]) altogether. In light of this block decoding, the signals x[m] and x[m − 1] in (2.47) are all desired to be decoded. This motivates us to represent (2.47) as: y[m] = [h 1 h 0 ] =:hT
x[m − 1] +w[m]. x[m]
(2.49)
=:s[m]
This way, (2.49) looks like an AWGN channel where we are familiar with the ML derivation. Here we call s[m] a state. Later you will figure out the rationale behind the naming. Since decoding (x[0], x[1], . . . , x[n − 1]) is equivalent to decoding (s[0], s[1], . . . , s[n − 1]), it suffices to decode the states. We will focus on decoding the states instead.
2.4 Optimal Receiver in ISI Channels
101
The optimal ML decision rule The optimal ML decision rule is to find the sequence of states that maximizes the likelihood function w.r.t. the received signals y := (y[0], y[1], . . . , y[n − 1]). So let us massage the likelihood function: (a) f y (y[0], . . . , y[n − 1]|s) = f w y[0] − hT s[0], . . . , y[n − 1] − hT s[n − 1]|s (b)
=
n−1
f w[m] y[m] − hT s[m]
m=0
(c)
=
2 1 1 exp − 2 y[m] − hT s[m] √ 2σ 2πσ 2 m=0 n−1
where (a) is because w[m] = y[m] − hT s[m]; (b) is due to the independence of w[m]’s; and (c) comes from w[m] ∼ N (0, σ 2 ). This then yields:
2 1 exp − 2 y[m] − hT s[m] s 2σ 2πσ 2 m=0 n−1 2 1 T = arg max exp − 2 y[m] − h s[m] s 2σ m=0
s∗ = arg max
= arg min s
n−1
√
n−1
1
(2.50)
2 y[m] − hT s[m] .
m=0
A challenge in computation How to compute s∗ in (2.50)? One naive way is to do an exhaustive search. For all possible sequence patterns of states, we compute the following interested quantity: n−1
2 y[m] − hT s[m] .
(2.51)
m=0
We then find the sequence pattern that yields the minimum quantity. Under this naive way, however, we are faced with a challenge in complexity. Why? Think about the total number of possible sequence patterns. How many possible sequences? That is, 2n . This is because decoding (s[0], s[1], . . . , s[n − 1]) is equivalent to decoding (x[0], x[1], . . . , x[n − 1]) and the sequence of (x[0], x[1], . . . , x[n − 1]) takes one of the 2n possibilities. So the challenge is that the complexity grows exponentially with n, which is definitely prohibitive for a large value of n. To figure out exactly what the complexity is, let us compute the numbers of multiplication, addition and comparison required to figure out s∗ from (2.50).
102
2 Communication over ISI Channels
Fig. 2.20 Andrew Viterbi, a prominent figure in the fields of communication and information theory, is credited with developing a highly efficient algorithm that can realize the optimal ML solution in ISI channels
Andrew Viterbi ‘67
2 For each time slot, say m, the computation of y[m] − hT s[m] requires 3 multiplications and 2 additions. Since we have n of such computations, we get the following complexity: 3n · 2n ;
#multiplication:
(2n + (n − 1)) · 2n ; 2n − 1
#addition: #comparison:
where the number n − 1 marked in blue comes from the complexity w.r.t.
n−1
m=0 .
A low complexity solution Is there a more efficient way to implement the ML receiver with lower complexity? Andrew Viterbi, a prominent figure in communication and information theory, sought to answer this question by developing an algorithm that grows linearly in complexity with n. This algorithm, named after Viterbi, is called the Viterbi algorithm (Forney 1973), and we can see his portrait in Fig. 2.20. In the remainder of this section, we will examine some fundamental concepts that underlie the Viterbi algorithm. Cost & finite state machine The first concept is concerned about an interested quantity in the minimization problem: s∗ = arg min s
n−1
2 y[m] − hT s[m] m=0
(2.52)
=: cm (s[m])
2 The quantity y[m] − hT s[m] in the above can be viewed as something negative, because the smaller the quantity, the better the situation is. Hence, people often call it cost w.r.t. the state s[m].
2.4 Optimal Receiver in ISI Channels
103
Fig. 2.21 A finite state machine with four states. A transition occurs depending on the value of x[m]
s3
s0
+
-
+
-
s2
s1
+
The second is related to the state (or state vector) s[m] that can take one of the following six candidates:
√ √ √ √ 0 0 +√ E −√ E −√ E +√ E √ √ , , , , , . − E + E + E + E − E − E s0
s1
s2
s3
The first two occur only at the beginning (m = 0), assuming that x[−1] = 0. So the two states are negligible relative to the others for a large value of n. For simplification, √ let us not worry about the two states. To this end, we intentionally set x[−1] = + E. The value of x[m] determines the transition between states s[m]. This process can be visualized as a Finite State Machine (FSM) shown in Fig. 2.21. Here the transition arrows are labeled with + or − indicating √ the sign of x[m + 1] for the current state E when in state s0, we follow the red arrow s[m]. For example, if x[m√+ 1] = − √ to reach the state s3 of [+ E; − E]. While the state transition diagram effectively represents the movement between states, it does not account for time evolution. This is where the trellis diagram comes in as another concept related to the state machine. Trellis diagram The trellis diagram exhibits both state transitions and time evolution. To clearly understand how it works, let us walk you through details with Figs. 2.22 and 2.23.
104
2 Communication over ISI Channels
state
time
??
If + occurs, stay at s0.
s0 s1
s2 If – occurs, move to s3.
s3 Fig. 2.22 A trellis diagram at m = 0 state
time
s0 s1 s2
s3 Fig. 2.23 A trellis diagram at m = 1
Consider the state at time 0: ⎧ √ √ +√ E ⎪ √ ⎪ = s0, if x[0] = + E; ⎨ + E x[−1] + E = √ s[0] = = √ x[0] x[0] ⎪ + ⎪ √ E = s3, if x[0] = − E ⎩ − E
√ where the second equality is due to our assumption x[−1] = + E. Notice that s[0] takes one of the two possible states s0 and s3, reflected in the two dots at time 0 in Fig. 2.22. In time −1, we have two options for s[−1] depending on the value of x[−2]: s0√and s1. To remove the ambiguity in time −1, let us further assume x[−2] = + E to fix the initial state as s0. This way, √ we can ensure that we start always from s0. As mentioned earlier, if x[0] = + E, s[0] = s0; otherwise s[0] = s3. To be more familiar with how it works, let us consider √ one more time slot as illustrated in Fig. 2.23. Suppose s[0] = s0. If x[1] = + E, then it stays at the same state s0 (reflected in the blue transition arrow; otherwise, it moves to s3
2.4 Optimal Receiver in ISI Channels
state
105
time
s0 s1 s2
s3 √ √ √ Fig. √ 2.24 A trellis diagram for the sequence (x[0], x[1], x[2], x[3]) = (+ E, − E, + E, + E)
√ (reflected in the red arrow). On the other hand, given s[0] = s3, if x[1] = + E, then it moves to s1; otherwise, it goes to s2. Cost calculation How to calculate the cost in (2.52) from the trellis diagram? To see how it works, consider a concrete example for√n = 4,√as illustrated in Fig. 2.24. √ √ In this example, (x[0], x[1], x[2], x[3]) = (+ E, − E, + E, + E), so the trellis path takes the blue-red-blue-blue transition arrows, yielding the state change as: s0-s0-s3-s1-s0. To ease cost calculation, we leave associated cost at the cor√ an √ responding state node. For instance, we put c0 ([+ E; + E]) (marked √ in green √ in ([+ E; − E]), Fig. 2.24) nearby the black dot s0 in time 0. Similarly we leave c 1 √ √ √ √ c2 ([− E; + E]), c3 ([+ E; + E]) for the corresponding black dots. By aggregating all the costs associated with the black dots, we can compute the cost for one such sequence. Considering all possible sequence patterns, we compute the optimal path as: s∗ = arg min {c0 (s[0]) + c1 (s[1]) + c2 (s[2]) + c3 (s[3])} . s
As mentioned earlier, a naive exhaustive search requires 24 possibilities. So the complexity is very expensive especially for a large value of n. Look ahead In the next section, we will study the Viterbi algorithm that well exploits the structure of the trellis diagram to find s∗ efficiently.
106
2 Communication over ISI Channels
2.5 The Viterbi Algorithm
Recap In the previous section, we investigated the optimal receiver for a simple two-tap ISI channel: y[m] = h 0 x[m] + h 1 x[m − 1] + w[m]
(2.53)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ). As for a transmission scheme, we considered sequential coding in which an information bit bm is mapped to each transmitted signal x[m]. Since each symbol propagates upto the last received signal through the ISI term h 1 x[m − 1], we introduced block decoding in which we decode the entire symbols (x[0], x[1], . . . , x[n − 1]) based on all the received signals (y[0], y[1], . . . , y[n − 1]). Here n denotes the total number of time slots employed. Specifically we represented (2.53) as: y[m] = [h 1 h 0 ] =:hT
x[m − 1] +w[m]. x[m]
(2.54)
=:s[m]
We intended to encode the entire state vectors (s[0], s[1], . . . , s[n − 1]) which is equivalent to decoding (x[0], x[1], . . . , x[n − 1]). Manipulating the likelihood function w.r.t. the received signals, we obtained the ML solution: s∗ = arg min s
n−1
2 y[m] − hT s[m] .
(2.55)
m=0
As computing all 2n possible sequence patterns using a brute force approach is computationally intensive, we introduced a more efficient algorithm, the Viterbi algorithm, which significantly reduces complexity. We also examined some fundamental concepts that underpin this algorithm. The first is the cost, which represents the desired quantity in optimization (2.55). The second concept is the Finite State Machine (FSM), which pertains to the state vector s[m]. The third concept is the trellis diagram, which provides a graphical representation of how each state evolves over time. Figure 2.25 illustrates√an example of the √ trellis diagram for √ √ E, − E, + E, + E), assuming that the sequence (x[0], x[1], x[2], x[3]) = (+ √ x[−1] = x[−2] = + E.
2.5 The Viterbi Algorithm
state
107
time
s0 s1 s2
s3 Fig. 2.25 A trellis diagram how states are evolved for the sequence √ which √ exhibits √ √ (x[0], x[1], x[2], x[3]) = (+ E, − E, + E, + E)
Outline In this section, we will explore how the Viterbi algorithm works and understand why it offers a much lower complexity. The section comprises four parts. Firstly, we will highlight a crucial observation that provides valuable insight into the development of the Viterbi algorithm. We will then introduce the central idea of the algorithm, along with its detailed procedures. Subsequently, we will show that the algorithm’s complexity increases linearly with n, offering significant benefits over the exhaustive approach. Finally, we will discuss a challenge that arises with the Viterbi algorithm, particularly when dealing with a large number of channel taps.
Key observation in the trellis diagram The Viterbi algorithm is inspired by an interesting observation. To see this clearly, let us consider two possible sequence patterns presented in Fig. 2.26: √ √ √ √ (i) (x[0], x[1], x[2], x[3]) = (− E, + E, − E, + E) (marked in purple); √ √ √ √ (ii) (x[0], x[1], x[2], x[3]) = (− E, + E, − E, − E) (makred in blue). Here the key observation is that the two trellis paths are significantly overlapped; hence, the two corresponding costs are identical except the one w.r.t. the last state vector. This motivated Viterbi to come up with the following idea. The idea of the Viterbi algorithm The idea is to successively store only an aggregated cost up to time t and then use this to compute a follow-up aggregated cost w.r.t. the next time slot. In order to understand what this means in detail, let us consider a simple example in which (h 0 , h 1 ) = (1, 1), E = 1 and we receive: (y[0], y[1], y[2], y[3]) = (2.1, 1.8, 0.5, −2.2).
108
2 Communication over ISI Channels overlap
-+-+ -+-state
time
s0 s1 s2
s3 Fig. 2.26 A key observation in the trellis diagram: The two trellis paths (marked in purple and blue) share most of the costs, except the one w.r.t. the last state vector Fig. 2.27 Cost computation and store strategy: For each black dot, we store the aggregated cost summed up so far
state
time
s0 s1 s2 s3
See Figs. 2.27 and 2.28. First compute the cost at time 0 when x[0] = +1:
c0
+ +
=
y[0] − [1 1]
+1 +1
2 = 0.01.
(2.56)
The cost value of 0.01 is then placed along the blue transition arrow, as illustrated in Fig. 2.27. We then store the cost at the state s0 node at time 0 (a black dot). We indicate the store by marking a purple-colored number nearby the black dot. We do the same thing w.r.t. the cost yet now when x[0] = −1:
2.5 The Viterbi Algorithm
state
109
time
s0 s1 s2
s3 Fig. 2.28 The core idea of the Viterbi algorithm: In time 2, we take only one among two candidate cost values (2.3 and 9.9), coming from two possible prior states (s0 and s1) at time 1. This is because the larger cost 9.9 will generate loser paths that would be eliminated in the competition in the end
c0
+ −
=
+1 y[0] − [1 1] −1
2 = 4.41.
(2.57)
We place the value of 4.41 along the associated red transition arrow and then store 4.41 nearby the black dot w.r.t. the state s3. √ Next we compute the cost w.r.t. time 1. The cost for x[1] = + E is computed as 0.04 (check!). So the aggregated cost up to time 1 w.r.t. the state s0 would be 0.01+0.04, as illustrated in Fig. 2.27. Similarly the aggregated costs for the other states (s1, s2, s3) would be 4.41+3.24, 4.41+14.44, 0.01+3.24, respectively. Also check! Now one can see the core idea of the Viterbi algorithm from time 2. See Fig. 2.28. Consider the state s0 at time 2. This state occurs when x[2] = +1 and comes from two possible prior states: s0 and s1. The cost for x[2] = +1 is first computed as 2.25. So the aggregated cost assuming that it comes from the prior s0 state (storing 0.05 for the aggregated cost up to time 1) would be: 0.05 + 2.25 = 2.3. On the other hand, the aggregated cost w.r.t. the prior s1 state would be: 7.65 + 2.25 = 9.9. Now how to deal with the two cost values? Remember that we are interested in finding the path that yields the minimum aggregated cost in the end. So the path w.r.t. the larger cost 9.9 would be eliminated in the competition. So we don’t need to worry about any upcoming paths w.r.t. the larger cost. This naturally motivates us to store only the minimum between the two cost values at the state s0 in time 2, while ignoring the other loser path. So we store 2.3 at the node, as illustrated in Fig. 2.29. We do the same thing for the other states. For the state s1, the lower path turns out to be the winner, so we store the corresponding aggregated cost 3.5 at the state s1 node, while deleting the upper loser path. Similarly for the states s2 and s3. We repeat this procedure until the last time slot. See all the associated computations in Fig. 2.29.
110
2 Communication over ISI Channels
state
time
s0 s1 s2
s3 Fig. 2.29 Choose s∗ that minimizes the aggregated cost Fig. 2.30 Complexity per state in the Viterbi algorithm
state
time
s0 s1 # multiplication: # addition:
# comparison:
Now how to find the path that yields the minimum aggregated cost from the picture? It is very simple. Take a look at the four aggregated costs at the last time slot: 19.94, 5.14, 0.34, 7.14. We then pick up the minimum cost 0.34. How to find the corresponding sequence pattern? Since we leave only the survivor paths in the picture while deleting loser paths, we can readily find the path via backtracking. The survivor paths form a thick purple trajectory in√Fig. 2.29, √ which √ corresponds to the √ sequence (x[0], x[1], x[2], x[3]) = (+ E, + E, − E, − E). Complexity of the Viterbi algorithm We will now confirm our earlier claim that the complexity of the Viterbi algorithm increases linearly with n. Let us consider the basic operation that takes place at each node. Specifically, let us focus on the operation concerning the state s0 node at time 2, as depicted in Fig. 2.30. The current cost calculation involves three multiplications and two additions. Subsequently, we require one comparison to select the minimum and one addition to combine the costs. Therefore, the complexity per state comprises three multiplications, three additions, and one comparison.
2.5 The Viterbi Algorithm Fig. 2.31 Complexity comparison between the Viterbi algorithm and the exhaustive search
111 Exhaustive
Viterbi
# multiplication: # addition: # comparison:
This process is repeated for the other states across all time slots, as indicated by the black dots in the illustration. Given that there are 4n black dots in total, the complexity of the Viterbi algorithm increases linearly with n. You can find the precise numbers in Fig. 2.31. Look ahead In the upcoming section, we will develop a Python implementation of the Viterbi algorithm. We will begin by establishing basic coding components using the same example related to the two-tap ISI channel. Afterward, we will extend these components to construct a function that provides the optimal ML sequence based on the received signals and ISI channel configuration.
112
2 Communication over ISI Channels
2.6 The Viterbi Algorithm: Python Implementation
Recap Over the past two sections, we delved into the Viterbi algorithm that presents an efficient and optimal ML approach with a computational complexity that increases linearly with the total number of time slots, n. The algorithm is founded on a finite state machine and leverages the structure of the corresponding trellis diagram to facilitate a streamlined process of identifying the optimal sequence that minimizes the cost among all feasible sequence patterns: s∗ = arg min s
n−1
2 y[m] − hT s[m] .
(2.58)
m=0
Outline This section focuses on implementing the Viterbi algorithm using Python. The section comprises three parts. Initially, we will create basic coding components for the Viterbi algorithm using the same example as in the previous section. In the example, we examined a two-tap ISI channel with received signals of interest: y[m] = h 0 x[m] + h 1 x[m − 1] + w[m] = x[m] + x[m − 1] + w[m], m ∈ {0, 1, 2, 3}, (y[0], y[1], y[2], y[3]) = (2.1, 1.8, 0.5, −2.2).
(2.59)
After creating the coding components, we will utilize them to develop a function that produces the optimal sequence, given the received signals and the ISI channel taps (h 0 , h 1 ). Lastly, we will perform a simple experiment to show that the Viterbi algorithm effectively reconstructs the transmitted signals for a suitable SNR value.
State matrix, the received signal vector, and the channel vector We define three quantities fed into a to-be-developed function for the Viterbi algorithm. The first is the state matrix S that aggregates all of the available states. Remember that depending on the binary values of x[m − 1] and x[m], we have four states: s0 = [+1, +1], s1 = [−1, +1], s2 = [−1, −1], and s3 = [+1, −1]. The state matrix S is then defined as a row-wise stacking of the four states:
2.6 The Viterbi Algorithm: Python Implementation
⎡
+1 ⎢ −1 S=⎢ ⎣ −1 +1
⎤ +1 +1 ⎥ ⎥. −1 ⎦ −1
113
(2.60)
The second and third quantities are the received signal vector y = [y[0], . . . , y[3]]T , and the channel vector h = [h 0 , h 1 ]T . In the considered example, ⎡
⎤ 2.1 ⎢ 1.8 ⎥ ⎥, h = 1 . y=⎢ ⎣ 0.5 ⎦ 1 −2.2
(2.61)
See below a code for constructing the three quantities. import numpy as np S = np.array([[+1,+1],[-1,+1],[-1,-1],[+1,-1]]) y = np.array([2.1, 1.8, 0.5, -2.2]) h = np.array([1,1]) print(S.shape) print(y.shape) print(h.shape) (4, 2) (4,) (2,)
The state_to_idx function One of the basic coding components that we will often use is the state_to_idx function, which returns a row index in the state matrix S, corresponding to a given state vector (taking one among s0, s1, s2, and s3). For instance, given s0 = [+1, +1], it should output 0 in light of (2.60). Here is a code implementation for this function. def state_to_idx(s_current, S): # Check if each row in S (axis=1) matches s_current # For s_current=[+1,+1], it reads [True,False,False,False] idx_check = np.all(S == s_current, axis=1) # Return the index of the matched row # For s_current=[+1,+1], it returns 0 return np.where(idx_check)[0][0] s_current = np.array([[1,1]]) print(s_current) print(state_to_idx(s_current, S)) [[1 1]] 0
The prev_state function: Return the candidates of the previous state Another basic coding component is the prev_state function that returns all the
114
2 Communication over ISI Channels
candidates of the previous state w.r.t. a given current state. One key operation in the Viterbi algorithm is to store an aggregated cost at each node in the trellis diagram. To this end, we first need to retrieve the aggregated cost stored in the previous state. Hence, this requires identification of the previous state. To gain insights as to how to implement the prev_state function, consider an instance where the current state is [x[m − 1], x[m]] = [+1, +1] = s0. In this case, the previous state takes [x[m − 2], x[m − 1]] = [+1, +1] = s0 if x[m − 2] = +1, while taking [x[m − 2], x[m − 1]] = [−1, +1] = s1 when x[m − 2] = −1. See below its code implementation. def prev_state(s_current,S): # s_current = [x[m-1],x[m]] # Output: Possible candidates for the previous state # [+1,x[m-1]] and [-1,x[m-1]] # Find the row index matching s_current current_idx = state_to_idx(s_current,S) # Retrieve x[m-1] using the current_idx x_m1 = S[current_idx,0].squeeze() # Return the two condidates for [x[m-2],x[m-1]] # depending on x[m-2]=+1 or x[m-2]=-1 cand_plus = np.array([+1,x_m1]) cand_minus = np.array([-1,x_m1]) return [cand_plus, cand_minus] s_current = np.array([[+1,+1]]) print(prev_state(s_current, S)) [array([1, 1]), array([-1,
1])]
Compute the current cost at every node in the trellis Next, we compute the current cost (y[m] − hT s[m])2 for every m ∈ {0, 1, 2, 3} and for all possible patterns of s[m]. Specifically we construct a matrix that contains all of them as elements: ⎡
(y[0] − s0h)2 ⎢ (y[0] − s1h)2 trellis = ⎢ ⎣ (y[0] − s2h)2 (y[0] − s3h)2
(y[1] − s0h)2 (y[1] − s1h)2 (y[1] − s2h)2 (y[1] − s3h)2
(y[2] − s0h)2 (y[2] − s1h)2 (y[2] − s2h)2 (y[2] − s3h)2
One succinct way to construct this matrix is as below. y - S.dot(h).reshape((-1,1)) trellis = (y - S.dot(h).reshape((-1,1)))**2 print(y) print(S.dot(h).reshape((-1,1))) print(trellis.shape) [ 2.1 [[ 2] [ 0]
1.8
0.5 -2.2]
⎤ (y[3] − s0h)2 (y[3] − s1h)2 ⎥ ⎥ (y[3] − s2h)2 ⎦ (y[3] − s3h)2
2.6 The Viterbi Algorithm: Python Implementation
state
115
time
s0 s1 s2
s3 Fig. 2.32 Illustration of how the Viterbi algorithm works. The values marked in purple indicate the costs that are survived from competition among the previous state cost candidates [-2] [ 0]] (4, 4)
Initialize survived_costs at time 0 We are now ready to compute the costs that are survived from competition among the previous state cost candidates. These are the values that we specified in purple in Fig. 2.32. We first compute the costs at time 0. # Initialize "costs" to be stored survived_costs = np.zeros_like(trellis) # Initialize "paths" (of a dictionary type) that would store # the index of the previous winner state # key: (i,t) --> state i at time t # value: the index of the previous winner state paths = {} num_states = S.shape[0] num_times = y.shape[0] # Initialize costs at time 0 for i in range(num_states): survived_costs[i,0] = trellis[i,0] paths[(i,0)] = None # Specify the end of path as ‘‘None’’ print(survived_costs) [[1.000e-02 [4.410e+00 [1.681e+01 [4.410e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00] 0.000e+00] 0.000e+00] 0.000e+00]]
116
2 Communication over ISI Channels
Notice that the values at (1, 1) and (4, 1) entries in the survived_costs coincide with 0.01 and 4.41 that we already computed by hand in Fig. 2.32. The only distinction relative to Fig. 2.32 is that we do have costs at (2, 1) and (3, 1) entries. This is because we assume x[−1] = 0 here, while we set x[−1] = 1 and the state at t = −1 is s0 in Fig. 2.32. Compute survived_costs at other time slots costs at other time instants m ∈ {1, . . . , n − 1}.
Next, we compute the survived
# Iterate over other time slots m=1,...,n-1 for t in range(1, num_times): # Iterate over states: i=0,1,2,3 for i in range(num_states): # Find two candidates for the previous state prev_state_cand = prev_state(S[i,:],S) # Find the minimum cost among the two candidates min_cost, prev = min( (survived_costs[state_to_idx(s,S),t-1] + trellis[i,t], state_to_idx(s,S)) for s in prev_state_cand) # Store the minimum cost as a survival survived_costs[i,t] = min_cost # Store the index of the previous winner state paths[(i,t)] = prev print(survived_costs) print(paths) [[1.000e-02 5.000e-02 2.300e+00 1.994e+01] [4.410e+00 7.650e+00 3.500e+00 5.140e+00] [1.681e+01 1.885e+01 9.500e+00 3.400e-01] [4.410e+00 3.250e+00 3.000e-01 7.140e+00]] {(0, 0): None, (1, 0): None, (2, 0): None, (3, 0): None, (0, 1): 0, (1, 1): 3, (2, 1): 3, (3, 1): 0, (0, 2): 0, (1, 2): 3, (2, 2): 3, (3, 2): 0, (0, 3): 0, (1, 3): 3, (2, 3): 3, (3, 3): 0}
We see that the components in survived_costs are exactly the same as the purplecolored values in Fig. 2.32. Find the minimum cost and the corresponding sequence Once paths are obtained, we can find the optimal sequence s∗ by backtracking the winner states across all time instants. optimal_path = [] t = num_times -1 # The index of the winner state in the last time slot state_idx = np.argmin(survived_costs[:,t]) # Backtrack the winner states across all time instants
2.6 The Viterbi Algorithm: Python Implementation
117
while state_idx is not None: # Print the winner state at each time print(f"winner state: {state_idx} at time {t}") # Insert the winner state at the 0th position # This way, we enable ‘‘last input first ouput’’ optimal_path.insert(0,S[state_idx,:]) # Move to the previous winner state state_idx = paths[(state_idx,t)] t -= 1 print(optimal_path) winner state: 2 winner state: 3 winner state: 0 winner state: 0 [array([1, 1]),
at time 3 at time 2 at time 1 at time 0 array([1, 1]), array([ 1, -1]), array([-1, -1])]
Here the optimal_path indicates a list of (x[−1], x[0]), (x[0], x[1]), (x[1], x[2]), (x[2], x[3]), as per the last-input-first-output rule. Let us convert this into the sequence of transmitted signals: (x[0], x[1], x[2], x[3]). seq_estimate = np.array([pair[1] for pair in optimal_path]) print(seq_estimate) [ 1
1 -1 -1]
A final function for the Viterbi algorithm Employing all of the above coding components, we build up a function that runs the Viterbi algorithm for an arbitrary length of the received signals. def state_to_idx(s_current, S): # Check if each row in S (axis=1) matches with s_current # For s_current=[+1,+1], it reads [True,False,False,False] idx_check = np.all(S == s_current, axis=1) # Return the index of the matched row # For s_current=[+1,+1], it returns 0 return np.where(idx_check)[0][0] def prev_state(s_current,S): # s_current = [x[m-1],x[m]] # Output: Possible candidates for the previous state # [+1, x[m-1]] and [-1,x[m-1]] # Find the row index matching s_current current_idx = state_to_idx(s_current,S) # Retrieve x[m-1] using the current_idx x_m1 = S[current_idx,0].squeeze() # Return the two condidates for [x[m-2],x[m-1]] # depending on x[m-2]=+1 or x[m-2]=-1 cand_plus = np.array([+1,x_m1]) cand_minus = np.array([-1,x_m1]) return [cand_plus, cand_minus]
118
2 Communication over ISI Channels
def viterbi_algorithm(S,h,y): # Compute the current cost at every node y - S.dot(h).reshape((-1,1)) trellis = (y - S.dot(h).reshape((-1,1)))**2 # Initialize "costs" to be stored survived_costs = np.zeros_like(trellis) # Initialize "paths" (of a dictionary type) that would # store the index of the previous winner state # key: (i,t) --> state i at time t # value: the index of the previous winner state paths = {} num_states = S.shape[0] num_times = y.shape[0] # Initialize costs at time 0 for i in range(num_states): survived_costs[i,0] = trellis[i,0] paths[(i,0)] = None # Specify the end of path # Iterate over other time slots m=1,...,n-1 for t in range(1, num_times): # Iterate over states: i=0,1,2,3 for i in range(num_states): # Find two candidates for the previous state prev_state_cand = prev_state(S[i,:],S) # Find the minimum cost among the two candiates min_cost, prev = min( (survived_costs[state_to_idx(s,S),t-1]+trellis[i,t], state_to_idx(s,S)) for s in prev_state_cand) # Store the minimum cost as a survival survived_costs[i,t] = min_cost # Store the index of the winner state paths[(i,t)] = prev optimal_path = [] t = num_times -1 # The index of the winner state in the last time slot state_idx = np.argmin(survived_costs[:,t]) # Backtrack the winner states across all time instants while state_idx is not None: # Insert the winner state at the 0th position # This way, we enable ‘‘last input first ouput’’ optimal_path.insert(0,S[state_idx,:]) # Move to the previous winner state state_idx = paths[(state_idx,t)] t -= 1 # Convert ‘optimal_path’ into the sequence
2.6 The Viterbi Algorithm: Python Implementation
119
seq_estimate=np.array([pair[1] for pair in optimal_path]) return seq_estimate
A simple experiment for testing the Viterbi algorithm We conduct an experiment for checking whether the Viterbi algorithm works properly. Consider a sequential coding scheme where we send n = 10 bits via PAM with SNR = 10 dB. We consider the same two-tap ISI channel as above: h 0 = 1 and h 1 = 1. Here is a code for simulating this setup. from scipy.stats import bernoulli import numpy as np n = 10 # number of independent bits X = bernoulli(0.5) SNRdB = 10 SNR = 10**(SNRdB/10) # Generate transmitted PAM signals x_orgin = 2*X.rvs(n) - 1 x = np.sqrt(SNR)*x_orgin # Pass through a two-tap ISI channel with h0=1 and h1=1 y = np.zeros(n) y[0] = x[0] + np.random.randn(1) for t in range(1,n): y[t] = x[t] + x[t-1] + np.random.randn(1) print(x_orgin) print(y) [ 1 1 -1 -1 [ 2.71943023 6.65510309
1 1 1 1 -1 -1] 6.14157942 0.36701172 -7.78017509 0.64485587 8.69890061 6.41750414 0.90916301 -4.30718765]
We now plug these into the developed viterbi_algorithm function to estimate the transmitted signals. S = np.array([[+1,+1],[-1,+1],[-1,-1],[+1,-1]]) h = np.array([1,1]) print(x_orgin) print(viterbi_algorithm(S,h,y)) [ 1 [ 1
1 -1 -1 1 -1 -1
1 1
1 1
1 1
1 -1 -1] 1 -1 -1]
We see that the estimates are the same as the original signals.
120
2 Communication over ISI Channels
Look ahead Although the Viterbi algorithm offers significantly lower complexity compared to the exhaustive search approach, it still faces a challenge regarding the number of states. This number can become considerably large, leading to exponential growth in complexity, especially when dealing with an L-tap ISI channel where the number of states is 2 L . To address this issue, we will study a scheme in the upcoming sections that can further reduce the complexity, when L is large.
2.6 The Viterbi Algorithm: Python Implementation
121
Problem Set 5 Problem 5.1 (Successive cancelation) Consider a simple two-tap ISI channel: y[m] = x[m] + x[m − 1] + w[m],
m≥0
(2.62)
where w[m] is an AWGN with N (0, σ 2 ). Assume that x[−1] √ = 0. We consider a sequential coding scheme. At each time slot, we send one of ± E signals depending on an information bit which takes 1 or 0 equally likely. The receiver employs a naive successive cancelation method. At time m, we use the estimate x[m ˆ − 1] (derived in the previous time instant, i.e., at time m − 1) to cancel the ISI term, and then intend to decode x[m] based on: y˜ [m] = y[m] − x[m ˆ − 1] = x[m] + (x[m − 1] − x[m ˆ − 1]) +w[m].
(2.63)
=:I [m]
At time 0, we know that there is no ISI and hence we can estimate x[0] ˆ using the ML receiver from y[0]. (a) Derive pe := P(x[0] = x[0]) ˆ in terms of E and σ 2 . Show that the joint distribution of the pair of random variables (x[0], x[0]) ˆ respects: √ ⎧ √ (+ E, + ⎪ ⎪ √ E), w.p. ⎨ √ (−√ E, −√ E), w.p. ⎪ (+ E, −√ E), w.p. ⎪ ⎩ √ (− E, + E), w.p.
1− pe ; 2 1− pe ; 2 pe ; 2 pe . 2
(2.64)
(b) The successive cancelation method at time 1 yields: y˜ [1] := y[1] − x[0] ˆ = x[1] + I [1] + w[1].
(2.65)
Show that the interference I [1] is a discrete random variable taking on three possible values. Indentify the three values and the corresponding probabilities with which I [1] can take them. (c) Derive the NN decision rule for decoding x[1] based on y˜ [1]. Derive an expression for the probability of error P(E1 ) when using the NN rule. How is this related to pe ? Hint: Think about the total probability law: P(E1 ) = P(E0 )P(E1 |E0 ) + P(C0 )P(E1 |C0 ) where C0 indicates a correct decision event w.r.t. x[0].
(2.66)
122
2 Communication over ISI Channels
(d) Using a recursion argument, find an expression for the error probability P(Em ) of the successive cancelation receiver on the bit at time m. What happens to this expression as m becomes large? Hint: P(Em ) = P(Em−1 )P(Em |Em−1 ) + P(Cm−1 )P(Em |Cm−1 ).
Problem 5.2 (Zero forcing receiver) Consider a simple two-tap ISI channel, and focus on the first two received signals: y[0] = h 0 x[0] + w[0],
(2.67)
y[1] = h 1 x[0] + h 0 x[1] + w[1]
(2.68)
where the additive noise w[n] is white Gaussian with√mean 0 and variance σ 2 . The transmission is sequential coding. We send one of ± E voltages depending on the information bit which takes 1 or 0 equally likely. Given (y[0], y[1]), we intend to decode x[1]. (a) Argue that a vector (−h 1 , h 0 ) is perpendicular to the vector (h 0 , h 1 ) associated with x[0]. (b) Consider the following signal projected onto the perpendicular vector: yˆ := (−h 1 , h 0 )(y[0], y[1])T = h 20 x[1] − h 1 w[0] + h 0 w[1]. Find the distribution of the random variable −h 1 w[0] + h 0 w[1]. (c) Given yˆ , find the ML rule for detecting x[1]. (d) Compute the error probability w.r.t. x[1] when using the ML rule.
Problem 5.3 (Matched filter) Consider the same setting as in Problem 5.2. But we wish to decode x[0] instead from (y[0], y[1]). (a) Consider the following signal projected onto (h 0 , h 1 ): y := (h 0 , h 1 )(y[0], y[1])T = (h 20 + h 21 )x[0] + h 0 h 1 x[1] + h 0 w[0] + h 1 w[1]. Given y , find the Nearest Neighbor (NN) rule for detecting x[0]. (b) Compute the error probability w.r.t. x[0] when using the NN rule.
Problem 5.4 (The Viterbi algorithm: Exercise 1) Consider a two-tap ISI channel: y[m] = x[m] + x[m − 1] + w[m],
m ≥ 0.
(2.69)
2.6 The Viterbi Algorithm: Python Implementation
123
Transmission begins at time 0. We consider a sequential coding scheme. At each time slot, we send one of ±1 depending on an information bit. The noise w[m] is an AWGN, being independent of x[m]’s. Suppose that the total number of time slots n = 5 and the received signals are: (y[0], y[1], y[2], y[3], y[4]) = (1.2, 0.4, −1.9, −0.5, 0.3). Define the state s[m] as: s[m] :=
x[m − 1] . x[m]
(2.70)
Say that [+1; +1] is s0; [−1; +1] is s1; [−1; −1] is s2; and [+1; −1] is s3. Assume that x[−1] = x[−2] = 1, so the initial state s[−1] is s0. (a) Show that given (y[0], . . . , y[4]), the ML receiver is: s∗ = arg min s
4
(y[m] − [1 1]s[m])2 .
(2.71)
m=0
(b) Compute cm (s[m]) := (y[m] − [1 1]s[m])2 for s[m] ∈ {[+1; +1], [−1; +1]; [−1; −1]; [+1; −1]} and m ∈ {0, . . . , 4}. (c) Run the Viterbi algorithm by hand to draw the trellis diagram and the shortest paths at each stage. (d) Using part (c), derive the ML solution s∗ and the corresponding sequence of transmitted signals x[m]’s. (e) Using the Python skeleton code in Sect. 2.6, do sanity check for your answer in part (d). Problem 5.5 (The Viterbi algorithm: Exercise 2) Consider a two-tap ISI channel: y[m] = x[m] + x[m − 1] + w[m]
(2.72)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ), being independent of x[m]’s. Time slots begin at 0. We consider 4-PAM. At each time slot, we send one of four possible voltages (±1, ±3) depending on two bits that are needed to be communicated. Suppose that a block length n = 3 and the received signals are (y[0], y[1], y[2]) = (1.2, 0.4, −1.9).
124
2 Communication over ISI Channels
Define the state s[m] as: s[m] :=
x[m − 1] . x[m]
(2.73)
Assume that x[−1] = x[−2] = 3. (a) Compute cm (s[m]) := (y[m] − [1 1]s[m])2 for all possible s[m] and for m ∈ {0, 1, 2}. (b) Identify a finite state machine and draw the state transition diagram for the machine. (c) Run the Viterbi algorithm by hand, showing the trellis diagram and the shortest paths at each stage. (d) Using part (c), derive the ML solution s∗ and the corresponding sequence of transmitted signals x[m]’s. (e) Using the Python skeleton code in Sect. 2.6, construct a new function for the Viterbi algorithm, tailored for this 4-PAM setting. ( f ) Using the developed function in part (e), do sanity check for your answer in part (d).
Problem 5.6 (The Viterbi algorithm: Exercise 3) Consider a two-tap ISI channel: y[m] = x[m] + x[m − 1] + w[m], m ∈ {0, 1, . . . , n − 1}
(2.74)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ). We employ a concatenation of 2-PAM and 2x-repetition coding. The 2-PAM takes the following mapping rule: when a bit is 1, it yields +1; otherwise outputs −1. Define an appropriate state so that it is a finite state machine and it enables the use of the Viterbi algorithm. Also draw the state transition diagram. Further discuss the computational complexity of the Viterbi algorithm applied to this setting, indicating how it scales with n and other factors.
Problem 5.7 (True or False?) (a) Consider a simple two-tap ISI channel, and focus on the first two received signals: y[0] = h 0 x[0] + w[0], y[1] = h 1 x[0] + h 0 x[1] + w[1] 2 where the additive noise w[n] is white Gaussian with mean √ 0 and variance σ . The transmission is sequential coding. We send one of ± E voltages depending on the information bit which takes 1 or 0 equally likely. Consider
2.6 The Viterbi Algorithm: Python Implementation
125
y := (h 0 , h 1 )(y[0], y[1])T = (h 20 + h 21 )x[0] + h 0 h 1 x[1] + h 0 w[0] + h 1 w[1]. Given y , the ML rule for detecting [0] is the same as the Nearest Neighbor (NN) rule. (b) Consider a three-tap ISI channel: y[m] = x[m] + x[m − 1] + x[m − 2] + w[m], m ∈ {0, 1, . . . , n − 1} where x[m] ∈ {±1, ±3} and w[m]’s are i.i.d. ∼ N (0, σ 2 ). Let s[m] :=
x[m − 1] . x[m]
Then, s[m] is a finite state machine. (c) The trellis diagram of a finite state machine fully describes the machine as well as shows how states evolve over time.
126
2 Communication over ISI Channels
2.7 OFDM: Principle
Recap Over the past few sections, we investigated how to optimally decode transmitted symbols in the context of a two-tap ISI channel:
x[m − 1] y[m] = [h 1 h 0 ] +w[m] x[m] =:hT
(2.75)
=:s[m]
where w[m]’s are i.i.d. ∼ N (0, σ 2 ). Inspired by the fact that x[m]’s are closely coupled in the received signals via the ISI term, we introduced block decoding which intends to decode (x[0], x[1], . . . , x[n − 1]) altogether, based on the entire received signals (y[0], y[1], . . . , y[n − 1]). Here n indicates the total number of time slots used. We then derived the optimal ML receiver w.r.t. the sequence of state vectors s[m]’s (one-to-one mapping with (x[0], x[1], . . . , x[n − 1])): s∗ = arg min s
n−1
2 y[m] − hT s[m] .
(2.76)
m=0
We highlighted that the exhaustive search method for finding the ML solution is computationally prohibitive, with a complexity of 2n that grows exponentially with n. Subsequently, we introduced a more efficient method for implementing the ML solution, the Viterbi algorithm, whose complexity grows linearly with n. However, the complexity of the algorithm with respect to the number of states in its associated finite state machine (FSM) is still a concern. The number of states is 2 L , which results in an exponential complexity with L. Since the number of channel taps L can be large in practical scenarios, this could pose a challenge.
Outline In this section, we will examine an alternative approach that can significantly reduce complexity, especially when dealing with a large value of L. This approach is called Orthogonal Frequency Division Multiplexing (OFDM) (Chang 1966). The section is divided into three parts. Firstly, we will present a transmitter-side technique, known as cyclic signaling, which serves as the foundation of OFDM. Secondly, we will discuss how to implement cyclic signaling. Finally, we will demonstrate that this technique, along with a crucial property (which will be explained in detail), enables an ISI channel effectively ISI-free. We will demonstrate this concept in the context of a simple two-tap ISI channel and then extend it to an arbitrary L-tap ISI channel in the next section.
2.7 OFDM: Principle
127
A fixed transmission scheme Up until now, we have focused on a receiver-centered approach that assumes a fixed transmission scheme, such as sequential coding where each bit bm√ is transmitted independently over different time slots by transmitting x[m] = ± E. However, there is a transmitter-side technique that can be employed to effectively eliminate ISI from the channel, rendering the complexity irrelevant to L. Cyclic signaling Suppose we send N transmitted symbols: x := [x[0], x[1], . . . , x[N − 1]]T . A time slot begins at zero. We have not yet specified how to map information bits into x[m]’s. Later we will provide details on the encoding rule. Consider the corresponding received signals: y[0] = h 0 x[0] + w[0] y[1] = h 1 x[0] + h 0 x[1] + w[1] y[2] = h 1 x[1] + h 0 x[2] + w[2] .. .
(2.77)
y[N − 1] = h 1 x[N − 2] + h 0 x[N − 1] + w[N − 1]. Using a matrix notation, one can write the received signals as: y = Hx + w
(2.78)
where y := [y[0], . . . , y[N − 1]]T , w := [w[0], . . . , w[N − 1]]T , and ⎡
⎤
h0 ⎢ h1 h0 ⎢ ⎢ H = ⎢ h1 h0 ⎢ .. ⎣ .
⎥ ⎥ ⎥ ⎥ ∈ R N ×N . ⎥ ⎦
(2.79)
h1 h0 Notice that H imposes a systematic structure. Each row is a shifted version of (h 1 , h 0 ), except the first row. The reason that we have the exception in the first row is that there is no signal transmitted prior to t = 0, i.e., x[−1] = 0. This motivates us to consider a beautiful yet imaginary structure as the one illustrated in Fig. 2.33. Here we have h 1 in front of h 0 in the first row. But it is placed in an invalid position in the matrix. Another beautiful yet valid matrix might be the following:
128
2 Communication over ISI Channels
⎡
h0 ⎢ h1 h0 ⎢ ⎢ Hc = ⎢ h 1 h 0 ⎢ .. ⎣ .
h1
⎤ ⎥ ⎥ ⎥ ⎥ ∈ R N ×N ⎥ ⎦
(2.80)
h1 h0 where the position for h 1 in the first row (marked in red) is now valid while yielding a shifted copy of (h 1 , h 0 ). Here the shift means any type of shift that allows for a cyclic shift. Every row is rotated one element to the right relative to the preceding row. This is a well-known property in the linear algebra literature, called the circulant property. So Hc is a circulant matrix. The circulant matrix has a nice property (to be explored in depth later) that helps converting the original ISI channel into multiple ISI-free subchannels. A way to enable cyclic signaling We can implement this cyclic signaling. Prior to digging into details on the nice property, let us first investigate the way that enables the cyclic signaling. Notice in Fig. 2.33 that h 1 in the first row (marked in red) is placed in a position corresponding to time slot −1 in view of the real channel H. In light of Hc , however, the corresponding time slot w.r.t. h 1 is N − 1. Hence, the idea is to send a dummy symbol x[−1] at time −1 so that it acts as the symbol at time N − 1. In other words, we set x[−1] = x[N − 1] and then send x that augments x[−1] = x[N − 1] in front of x: x = [x[N − 1] | x[0], . . . , x[N − 1]]T .
(2.81)
See Fig. 2.34 for illustration of this implementation. This way yields the received signals as: ¯ + w y = Hx
Fig. 2.33 A nicely looking yet imaginary structure for the ISI-channel matrix H
(2.82)
2.7 OFDM: Principle
129
[N – 1]
[N – 1]
cyclic operation
decyclic operation
channel [N – 1]
[N – 1]
[N – 1]
[N – 1]
cyclic channel Fig. 2.34 How to generate cyclic signaling
where y := [y[−1] | y[0], . . . , y[N − 1]]T , w := [w[−1] | w[0], . . . , w[N − 1]]T , ¯ indicates an (N + 1)-by-(N + 1) real channel matrix having the same strucand H ture as H (2.79): ⎤
⎡
h0 ⎢ h1 h0 ⎢ ¯ =⎢ H ⎢ h1 h0 ⎢ .. ⎣ .
⎥ ⎥ ⎥ ⎥ ∈ R(N +1)×(N +1) . ⎥ ⎦
(2.83)
h1 h0 By removing the first component in y (that we call “decyclic operation” in Fig. 2.34), we obtain: y = Hc x + w.
(2.84)
A nice property of Hc As mentioned earlier, there is a nice property that enables us to remove ISI efficiently with the help of the cyclic structure reflected in (2.84). The nice property is that an eigenvalue decomposition of the circulant matrix Hc has the following structure (Gray et al. 2006; Strang 1993): Hc = F−1 F
(2.85)
where F denotes the Discrete Fourier Transform (DFT) matrix: ⎡ 1 F := √ N
⎢ ⎢ ⎢ ⎣
ω 0·0 ω 0·1 .. .
ω 1·0 ω 1·1 .. .
··· ··· .. .
ω (N −1)·0 ω (N −1)·1 .. .
ω 0·(N −1) ω 1·(N −1) · · · ω (N −1)·(N −1) Here ω := e− j N . 2π
⎤ ⎥ ⎥ ⎥ ∈ R N ×N . ⎦
(2.86)
130
2 Communication over ISI Channels
Frequency-domain signals How does the nice property (2.85) serve to remove ISI? To see this, let us apply (2.85) into (2.84) and then multiply the DFT matrix F to both sides. This gives: y˜ := Fy = (Fx) + (Fw) =:˜x
(2.87)
˜ =:w
˜ = ˜x + w. Here the key observation is that is a diagonal matrix: ⎡
⎤ 0 ··· 0 λ1 · · · 0 ⎥ ⎥ N ×N . .. . . . ⎥∈R . . .. ⎦ 0 0 · · · λ N −1
λ0 ⎢0 ⎢ =⎢ . ⎣ ..
(2.88)
This means that y˜ does not contain any ISI terms w.r.t. frequency-domain signal components in x˜ . So by mapping information bits into the frequency-domain signal vector x˜ , we obtain multiple ISI-free parallel subchannels. Denoting x˜ = [X [0], . . . , X [N − 1]]T and y˜ = [Y [0], . . . , Y [N − 1]]T , the parallel subchannels are described as: Y [0] = λ0 X [0] + W [0]; Y [1] = λ1 X [1] + W [1]; .. .
(2.89)
Y [N − 1] = λ N −1 X [N − 1] + W [N − 1]. Frequency-domain signals X [k]’s do not interfere with each other. Hence, the above transmitter-receiver technique is called Orthogonal Frequency Division Multiplexing (OFDM). See Fig. 2.35 for the overall OFDM architecture.
IDFT
cyclic operation
channel
decyclic operation
ISI-free channel Fig. 2.35 The entire transmitter-receiver architecture of OFDM
DFT
2.7 OFDM: Principle
encoding
131
ISI-free channel
decoding
Fig. 2.36 How to map information bits into frequency-domain signals in OFDM
At the transmitter, we first generate a frequency-domain signal vector x˜ with information bits. We will study how to map information bits into x˜ shortly. We then apply the IDFT matrix F−1 , in order to convert the information-bearing frequencydomain signal vector x˜ into the time-domain counterpart x. Finally we send over the channel x that augments x[N − 1] in front of x, as per (2.81). At the receiver, we perform the opposite operations. We remove the first component from the received signal vector y to obtain y. We then multiply the DFT matrix F to y, thus obtaining y˜ that will be used for decoding the information bits. How to map information bits The final topic to discuss is the mapping of information bits into X [k]’s. The process is straightforward and involves encoding the bits into x˜ using an encoding rule, as shown in Fig. 2.36. Various coding schemes from Part I, such as sequential coding, repetition coding, LDPC code, and Polar code, can be used for encoding. For decoding, the frequency-domain received signals in (2.87) resemble those in the AWGN channel. However, it is unclear whether the W [k]’s are i.i.d. ∼ N (0, σ 2 ). Fortunately, it turns out this is also the case. So any decoding schemes used in the AWGN channel can be applied. Look ahead We have introduced the OFDM technique, which utilizes a transmitter-receiver approach to achieve multiple ISI-free subchannels in the context of L = 2. We relied on property (2.85) and the assumption that W [k]’s are i.i.d. ∼ N (0, σ 2 ) without proof. In the following section, we will extend the technique to general L-tap ISI channels and provide proofs for the properties in the general context. Additionally, we will derive the explicit expression for in terms of the channel taps h k ’s. Finally, we will demonstrate that the complexity of the scheme remains reasonable even for a large value of L.
132
2 Communication over ISI Channels
2.8 OFDM: Extension to General L-tap ISI Channels
Recap In the previous section, we introduced a new transmitter-receiver technique, named OFDM, in the context of a two-tap ISI channel: y[m] = h 0 x[m] + h 1 x[m − 1] + w[m],
m ∈ {0, 1, . . . , N − 1}
(2.90)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ) and N denotes the number of transmitted symbols. Using matrix notations x := [x[0], x[1], . . . , x[N − 1]]T , y := [y[0], y[1], . . . , y[N − 1]]T and w := [w[0], w[1], . . . , w[N − 1]]T , we obtained its simple representation: y = Hx + w
(2.91)
where ⎤
⎡
h0 ⎢ h1 h0 ⎢ ⎢ H = ⎢ h1 h0 ⎢ .. ⎣ .
⎥ ⎥ ⎥ ⎥ ∈ R N ×N . ⎥ ⎦
(2.92)
h1 h0 Motivated by the systematic (yet not perfect) structure of H, we invoked the idea of sending an augmented signal vector x := [x[−1] | x[0], x[1], . . . , x[N − 1]]T while setting x[−1] = x[N − 1]. This then led to: y = Hc x + w
(2.93)
where Hc indicates a circulant matrix: ⎡
h0 ⎢ h1 h0 ⎢ ⎢ Hc = ⎢ h 1 h 0 ⎢ .. ⎣ .
h1
⎤ ⎥ ⎥ ⎥ ⎥ ∈ R N ×N . ⎥ ⎦
(2.94)
h1 h0 Next, relying on the nice property of the circulant matrix: Hc = F−1 F, we could make the ISI channel effectively ISI-free in the frequency domain:
(2.95)
2.8 OFDM: Extension to General L-tap ISI Channels
remove CP
add CP
IDFT
133
channel
DFT
ISI-free channel Fig. 2.37 The entire transmitter-receiver architecture of OFDM
˜ y˜ = ˜x + w.
(2.96)
˜ = Fw and F denotes the Discrete Fourier Transform (DFT) Here y˜ = Fy, x˜ = Fx, w matrix: ⎡ ⎤ ω 0·0 ω 1·0 · · · ω (N −1)·0 0·1 ω 1·1 · · · ω (N −1)·1 ⎥ 1 ⎢ ⎢ ω ⎥ F := √ ⎢ (2.97) ⎥ ∈ R N ×N .. .. .. .. ⎦ . N ⎣ . . . ω 0·(N −1) ω 1·(N −1) · · · ω (N −1)·(N −1)
where ω := e− j N . The overall procedure is illustrated in Fig. 2.37. 2π
Outline In this section, we will generalize the idea of using cyclic signaling to L-tap ISI channels. The section consists of three parts. First, we will explain how to implement the cyclic signaling for the general L-tap channel. Next, we will demonstrate that the computational complexity is reasonable even for a large value of L. Lastly, we will provide a proof for the nice property (2.95) that we previously deferred, and establish an explicit expression for the effective channel matrix shown in (2.96).
General L-tap ISI channels In an L-tap ISI channel, the received signals read: y[m] = h 0 x[m] + h 1 x[m − 1] + · · · + h L−1 x[m − L + 1] + w[m]
(2.98)
where m ∈ {0, 1, . . . , N − 1}. In light of the matrix notations (2.91), the channel matrix reads:
134
2 Communication over ISI Channels
⎤
⎡
h0 ⎢ h1 h0 ⎢ ⎢ H = ⎢ h2 h1 h0 ⎢ .. ⎣ .
⎥ ⎥ ⎥ ⎥. ⎥ ⎦
(2.99)
⎤ · · · h1 · · · h2 ⎥ ⎥ · · · h3 ⎥ ⎥. ⎥ ⎦ h1 h0
(2.100)
h L−1 · · · h 1 h 0 The corresponding circulant matrix is: ⎡
h0 h L−1 ⎢ h1 h0 h L−1 ⎢ ⎢ h2 h1 h0 h L−1 Hc = ⎢ ⎢ . .. ⎣ h L−1 · · ·
Now imagine the desired received signal w.r.t. the first row in (2.100): y[0] = h 0 x[0] + h 1 x[N − 1] + h 2 x[N − 2] + · · · + h L−1 x[N − L + 1] + w[0]. This motivates us to send the following dummy symbols in front of x: (x[−L + 1], x[−L + 2], . . . , x[−1]) =(x[N − L + 1], x[N − L + 2], . . . , x[N − 1]). In other words, we send an augmented signal vector x as follows: x = [x[N − L + 1], . . . , x[N − 1] | x[0], x[1], . . . , x[N − L], x[N − L + 1], . . . , x[N − 1]]T . Notice that the first (L − 1) elements are identical to the last (L − 1) elements. The augmented vector x ∈ R N +L−1 passes through the channel matrix H ∈ R(N +L−1)×(N +L−1) , yielding: y = Hx + w
(2.101)
where y := [y[−L + 1], . . . , y[−1] | y[0], . . . , y[N − 1]]T . By removing the first L − 1 components from y (that we call “remove CP” in Fig. 2.38), we obtain: y = Hc x + w. The whole procedure is illustrated in Fig. 2.38. Frequency-domain signals Recall the nice property of Hc :
(2.102)
2.8 OFDM: Extension to General L-tap ISI Channels
135
Cyclic Prefix (CP)
remove CP
add CP channel [N – 1]
[N – 1]
[N – 1]
[N – 1]
Fig. 2.38 How to generate cyclic signaling in a general L-tap ISI channel
Hc = F−1 F.
(2.103)
Applying this to (2.102), we obtain an effective channel: ˜ y˜ = ˜x + w
(2.104)
˜ = Fw are frequency-domain signals. Note that the where y˜ = Fy, x˜ = Fx and w effective channel is ISI-free, as is a diagonal matrix. Complexity We analyze the computational complexity of OFDM and show that it is not dependent on L. Although using the normal IDFT/DFT matrix operations that demand N 2 multiplications might raise concerns about computational complexity, there is a widely used transform that implements the DFT operation more efficiently - the Fast Fourier Transform (FFT). The FFT operation necessitates approximately N log2 N multiplications, which is significantly smaller than N 2 , particularly for a large N value. In comparison to the Viterbi algorithm’s complexity (about 2 L N multiplications), this represents a lower complexity, particularly when L is large. Penalty Regrettably, there is a drawback to this approach. It should be noted that when L is large, a considerable number of dummy signals are required. Furthermore, the cost of introducing cyclic signaling, which involves adding (L − 1) superfluous signals, increases as L grows. Proof of the circulant property (2.95) We prove the nice property (2.95) that we have relied on. By multiplying F−1 to the right in both sides in (2.95), we get: Hc F−1 = F−1 . The IDFT matrix F−1 reads:
(2.105)
136
2 Communication over ISI Channels
⎡
2π
⎢ ⎢ ⎢ ⎣
1 F−1 = √ N
2π
e j N 0·0 2π e j N 1·0 .. .
··· ··· .. .
e j N 0·1 2π e j N 1·1 .. .
e j N 0·(N −1) 2π e j N 1·(N −1) .. . 2π
⎤ ⎥ ⎥ ⎥. ⎦
e j N (N −1)·0 e j N (N −1)·1 · · · e j N (N −1)·(N −1) 2π
2π
2π
Here what we are going to prove is that the following vector vk (the kth column vector in F−1 ) is an eigenvector of Hc . ⎡
⎤
2π
e j N 0·k 2π e j N 1·k .. .
⎢ ⎢ vk := ⎢ ⎣
⎥ ⎥ ⎥. ⎦
e j N (N −1)·k 2π
To this end, we compute: ⎡
h0 h L−1 ⎢ h1 h0 h L−1 ⎢ ⎢ h2 h1 h0 h L−1 Hc vk = ⎢ ⎢ . .. ⎣ h L−1 · · ·
⎤ ⎤ 2π · · · h1 ⎡ e j N 0·k · · · h2 ⎥ 2π ⎥ ⎢ e j N 1·k ⎥ ⎢ ⎥ · · · h3 ⎥ ⎥⎢ ⎥. .. ⎥⎣ ⎦ . ⎦ j 2π (N −1)·k e N h1 h0
Consider the 0th component: h 0 e j N 0·k + h 1 e j N (N −1)·k + h 2 e j N (N −2)·k + · · · + h L−1 e j N (N −(L−1))·k 2π
2π
= h 0 e− j =
L−1
2π·0k N
h e− j
+ h 1 e− j 2πk N
2π
2π·k N
+ h 2 e− j
2π
2π·2k N
+ · · · + h L−1 e− j
2π·(L−1)k N
=: λk
=0
where the first equality is due to e j2π = 1. Similarly we can compute the 1st component as: h 0 e j N 1·k + h 1 e j N (1−1)·k + h 2 e j N (1+N −2)·k + · · · + h L−1 e j N (1+N −(L−1))·k ! 2π·(L−1)k 2π·1k 2π·0k 2π·k 2π·2k = e j N h 0 e− j N + h 1 e− j N + h 2 e− j N + · · · + h L−1 e− j N 2π
= ej
2π·1k N
2π
L−1 =0
=e
j 2π·1k N
λk .
h e− j
2π
2πk N
2π
2.8 OFDM: Extension to General L-tap ISI Channels
137
The key observation is that this is a simple multiplication version of the 0th component 2π·1k λk (by a factor of e j N ). The similar computation yields the 2nd component as: 2π·2k e j N λk . Hence, ⎡
⎤
2π
e j N 0·k 2π e j N 1·k .. .
⎢ ⎢ Hc vk = λk ⎢ ⎣
⎥ ⎥ ⎥ = λk vk . ⎦
e j N (N −1)·k 2π
This proves that vk is indeed an eigenvector of Hc with an eigenvalue λk . Frequency-domain channel response Let us figure out the explicit expression for λk that relates the kth frequency-domain transmitted signal X [k] to the corresponding received signal counterpart Y [k]: Y [k] = λk X [k] + W [k].
(2.106)
Actually we already figured this out while proving (2.95). To see this, remember that λk was the eigenvalue w.r.t. vk : λk =
L−1
h e− j N k . 2π
(2.107)
=0
Manipulating this in terms of the DFT matrix F, we get: λk = =
√ √
N·
N −1 1 2π h e− j N k √ N =0
" (2.108)
N · [Fh]k
where h := [h 0 , . . . , h L−1 , 0, . . . , 0]T . We can then also interpret this as a frequencydomain channel response. Look ahead Previously we claimed that mapping information bits into the frequency-domain transmitted signals x˜ is a simple process. However, this is not entirely straightforward. The reason is that the time-domain transmitted signal vector x = F−1 x˜ may contain complex signals that we may not know how to transmit over a practical channel. In the following section, we will delve into the specifics of this issue and examine how we can design an optimal receiver to address it.
138
2 Communication over ISI Channels
2.9 OFDM: Transmission and Python Implementation
Recap In the previous two sections, we explored the OFDM method that transforms an ISI channel into an ISI-free channel in the frequency domain. The main idea was to create “cyclic signaling” to establish the following relationship: y = Hc x + w
(2.109)
where w := [w[0], . . . , w[N − 1]]T and Hc indicates a circulant matrix: ⎡
h0 h L−1 ⎢ h1 h0 h L−1 ⎢ ⎢ h2 h1 h0 h L−1 Hc = ⎢ ⎢ . . ⎣ . h L−1 · · ·
⎤ · · · h1 · · · h2 ⎥ ⎥ · · · h3 ⎥ ⎥. ⎥ ⎦ h1 h0
(2.110)
Here L is the number of channel taps and N denotes the number of transmitted signals. The way to generate cyclic signaling was to send an augmented signal vector x ∈ R N +L−1 over a channel, as illustrated in Fig. 2.39. We then applied the nice property of the circulant matrix Hc = F−1 F into (2.109), thus obtaining the effective ISI-free channel that relates the frequency-domain transmitted signal vector x˜ = Fx to the received signal counterpart y˜ = Fy as follows: ˜ y˜ = ˜x + w
(2.111)
˜ = Fw and := diag(λ0 , λ1 , . . . , λ N −1 ). Here the frequency-domain chanwhere w nel response λk reads: λk =
L−1
h e− j N k . 2π
=0
We mentioned that an issue arises in mapping information bits into frequencydomain signal vector x˜ . The issue is that the time-domain signal x = F−1 x˜ may
Cyclic Prefix Fig. 2.39 An augmented signal vector x with cyclic prefix
last (L-1) symbols
2.9 OFDM: Transmission and Python Implementation
139
contain complex values which we have no idea about how to send over a physical channel.
Outline This section will tackle the issue of ensuring real values for all components in x in the context of OFDM signal transmission. It is divided into four parts. Firstly, we will examine a mapping rule that guarantees real values for all the components in x. Secondly, we will explore encoding and decoding schemes. Then, we will delve into the optimal receiver under a simplified scenario. Lastly, we will learn how to implement OFDM transmission using Python. How to map information bits With the notations of x˜ := [X [0], . . . , X [N − 1]]T , ˜ := [W [0], . . . , W [N − 1]]T , one can write y˜ := [Y [0], . . . , Y [N − 1]]T and W down all the components in (2.111) as: Y [0] = λ0 X [0] + W [0]; Y [1] = λ1 X [1] + W [1]; .. .
(2.112)
Y [N − 1] = λ N −1 X [N − 1] + W [N − 1]. Since Y [k]’s are complex values in general, one can separate Y R [k] and Y I [k] from Y [k] := Y R [k] + jY I [k] to view the above as 2N real-valued subchannels. Denoting W [k] := W R [k] + j W I [k], one can verify that W R [k] and W I [k] are Gaussian. So an individual subchannel is an additive Gaussian channel that we are familiar with. This naturally leads to the following encoding rule: Mapping information bits into frequency-domain transmitted signals X [k]’s. This way, we can apply every transmission/reception technique that we learned in Part I. See Fig. 2.40. However, an issue arises in mapping information bits into X [k]’s. The issue is that the actually-transmitted signals x[m]’s may contain complex-valued components. √ For example, in sequential coding where we send X [k] = ± E depending on the value of bk , x[m]’s can often be complex numbers. So we need to worry about the
encoding
ISI-free channel
decoding
Fig. 2.40 How to map information bits into frequency-domain signals in OFDM
140
2 Communication over ISI Channels
condition on X [k]’s that ensures real-valued signals for x[m]’s. To figure this out, let us consider a necessary and sufficient condition for x[m]’s being real-valued. A condition for ensuring real-valued x[m]’s The condition is: x[m] = x ∗ [m],
∀m.
(2.113)
where (·)∗ denotes a complex conjugate. Then, what is an equivalent condition expressed in terms of X [k]’s? To figure this out, one can apply the IDFT on both sides in the above to obtain: N −1 N −1 1 1 ∗ 2π j 2π mk N x[m] = √ X [k]e =√ X [k]e− j N mk = x ∗ [m]. N k=0 N k=0
(2.114)
Substituting k = N − k in the right-hand-side of the above, we get: N −1
2π
X [k]e j N mk =
N
X ∗ [N − k ]e j N mk . 2π
(2.115)
k =1
k=0
Here we used the fact that
e− j N m(N −k ) = e− j2πm · e j N mk = 1 · e j N mk , 2π
2π
2π
which then yields: x ∗ [m] =
N −1
X ∗ [k]e− j N mk 2π
k=0
=
N
X ∗ [N − k ]e j N mk . 2π
k =1
A condition w.r.t. X [k]’s Now we can obtain an equivalent condition by rewriting (2.115) as: X [0] +
N −1 k=1
X [k]e j N mk = X ∗ [0] + 2π
N −1
X ∗ [N − k ]e j N mk . 2π
(2.116)
k =1
Hence, one sufficient condition that guarantees the above is: X [0] = X ∗ [0]; X [k] = X ∗ [N − k], 1 ≤ k ≤ N − 1.
(2.117)
2.9 OFDM: Transmission and Python Implementation
141
Fig. 2.41 How to map information bits into frequency-domain signals in OFDM when employing a sequential coding scheme
Is this also necessary? Check in Problem 6.5. The condition (2.117) implies that there are actually N real symbols that can be freely chosen out of 2N real symbols. Why? To see this, consider a simple setting in which N is an even number. First express X [k] as: X [k] = X R [k] + j X I [k],
k ∈ {0, 1, . . . , N − 1}.
(2.118)
Then, as per the condition (2.117), we must respect: X [0] = X R [0] X [k] = X R [0] + j X I [k], N N = XR ; X 2 2
# N k ∈ 1, . . . , − 1 , 2
(2.119)
and the other signals should be fixed as: X [N − k] = X R [k] − j X I [k],
# N k ∈ 1, . . . , − 1 . 2
(2.120)
We see that there are only N real-valued symbols that can be freely chosen. For instance, given the sequential coding scheme, the mapping rule between information bits and X [k]’s can be the one illustrated in Fig. 2.41. Note that only N real symbols are employed in mapping N information bits. Of course, we can apply any other coding schemes that we learned in Part I (such as repetition code, LDPC code, and Polar code), as long as we satisfy the real-valued signal constraint, reflected in (2.117). The optimal decision rule Next we will develop the optimal receiver under a simple scenario in which N = 2 and L = 2 and we use sequential coding. In this case, the number of effective symbols (that can be freely chosen) is N = 2. So we map (b0 , b1 )
142
2 Communication over ISI Channels
to (X R [0], X R [1]), which in turn yields: X [0] = X R [0] and X [1] = X R [1]. Consider the received signals in the frequency domain: Y [0] = λ0 X R [0] + W [0]; Y [1] = λ1 X R [1] + W [1]
(2.121)
where λ0 =
1
h e− j
2π 2 ·0
= h0 + h1;
λ1 =
=0
1
h e− j
2π 2 ·1
= h0 − h1,
(2.122)
=0
and the frequency-domain noise signals are: 1 1 2π W [0] = √ w[m]e− j 2 m·0 = √ (w[0] + w[1]); 2 =0 2 1
(2.123)
1 1 2π w[m]e− j 2 m·1 = √ (w[0] − w[1]). W [1] = √ 2 =0 2 1
Here we have a good news about W [0] and W [1]: (W [0], W [1]) are statistically independent.
(2.124)
Why is good? The reason is that the independence yields the independence between Y [0] and Y [0], thus making the optimal decoder be simply the local ML rule: bˆk = ML(bk |Y [k]), k ∈ {0, 1}.
Proof of (2.124) Let us prove the independence relationship (2.124). We can easily prove this by using the definition of independence: the joint distribution f W [0],W [1] (W [0], W [1]) is the product of individual counterparts:
W [0] + W [1] W [0] − W [1] f W [0],W [1] (W [0], W [1]) = f w[0],w[1] , √ √ 2 2
2 2 (W [0] + W [1]) (W [0] − W [1]) 1 1 (b) + = exp − 2 2πσ 2 2σ 2 2
1 1 2 = exp − 2 W [0] + W 2 [1] 2πσ 2 2σ = f W [0] (W [0]) f W [1] (W [1]) (a)
(2.125)
2.9 OFDM: Transmission and Python Implementation
143
where (a) comes from (2.123); and (b) is due to the independence between w[0] and w[1]. Python implementation: OFDM signal transmission & verification We will use Python to implement OFDM signal transmission and verify two things: (i) the time-domain transmitted signals x[m]’s are real-valued under the conditions (2.119) and (2.120), and (ii) the frequency-domain received signals Y [k]’s satisfy the ISIfree channel condition (2.112). Our implementation will focus on the case of N = 8 and L = 3, although the provided codes can handle any values of N and L. We will simplify our implementation by assuming there is no additive noise, meaning w[m] = 0. Python: Construction of transmitted signals
We first generate N = 8 independent
bits. from scipy.stats import bernoulli import numpy as np N = 8 # number of independent bits Bern = bernoulli(0.5) # Generate N independent bits b_samples = Bern.rvs(N) print(b_samples) [0 0 0 1 0 0 0 1]
As a coding scheme, we consider sequential coding. So we take the mapping rule in Fig. 2.41: Mapping the first five bits (b0 , b1 , b2 , b3 , b4 ) into (X R [0], X R [1], X R [2], X R [3], X R [4]); and the last three bits (b5 , b6 , b7 ) into (X I [1], X I [2], X I [3]). We are then advised by the derived condition (2.119) and (2.120) to construct X [k]’s as follows: X [0] = X R [0] X [1] = X R [1] + j X I [1] X [2] = X R [2] + j X I [2] X [3] = X R [3] + j X I [3] X [4] = X R [4] X [5] = X R [3] − j X I [3] X [6] = X R [2] − j X I [2] X [7] = X R [1] − j X I [1]. # Mapping bits into frequency-domain signals X_R = np.zeros(N) X_I = np.zeros(N) X_R[0] = 2*b_samples[0] -1
(2.126)
144
2 Communication over ISI Channels
X_R[N//2] = 2*b_samples[N//2] -1 X_R[1:N//2] = 2*b_samples[1:N//2] -1 X_I[1:N//2] = 2*b_samples[N//2+1:N] -1 # Construct the complex-valued transmitted signals X[k]’s X = np.zeros(N,dtype=complex) X[0] = X_R[0] # X[k] = XR[k] + jXI[k], k=1,..., N/2-1 X[1:N//2] = X_R[1:N//2] + X_I[1:N//2]*1j X[N//2] = X_R[N//2] # X[N-k] = XR[k] - jXI[k], k=1,..., N/2-1 X[N-1:N//2:-1] = X_R[1:N//2] - X_I[1:N//2]*1j print(X_R) print(X_I) print(X) [-1. -1. -1. 1. -1. 0. 0. 0.] [ 0. -1. -1. 1. 0. 0. 0. 0.] [-1.+0.j -1.-1.j -1.-1.j 1.+1.j -1.+0.j 1.-1.j -1.+1.j -1.+1.j]
Notice that the above construction satisfies the condition (2.126) indeed. Next, we perform IDFT to generate time-domain transmitted signals. To this end, we employ a built-in function defined in the numpy.fft module: ifft. The ifft operation is√slightly different from the one that we defined. It takes division by N (instead of N ): ifft(X ) = So we should multiply it by
√
N −1 1 2π X [k]e j N mk . N k=0
N to implement the defined operation.
from numpy.fft import ifft x_time = np.sqrt(N)*ifft(X) print(x_time.real) print(x_time.imag) [-1.41421356 -0.29289322 1.41421356 1.70710678 -1.41421356 -1.70710678] [0. 0. 0. 0. 0. 0. 0. 0.]
0.29289322 -1.41421356
We verify that the time-domain transmitted signals are all real-valued. We then add the cyclic prefix of size L − 1 prior to x[m]’s. L = 3 # three-tap ISI channel cp = x_time[N-L+1:N] x_cp = np.concatenate([cp, x_time]) print(x_cp[0:L-1]) print(x_cp[N:N+L-1]) [-1.41421356+0.j -1.70710678+0.j] [-1.41421356+0.j -1.70710678+0.j]
(2.127)
2.9 OFDM: Transmission and Python Implementation
145
Python:
Construction of received signals The transmitted signals are passed through a three-tap ISI-channel: (h 0 , h 1 , h 2 ) = (1, 21 , 41 ). To implement this, we employ another useful built-in function convolve supported in the numpy package. from numpy import convolve # channel’s impulse response h = np.array([1,1/2,1/4]) # Compute y[m]= h[m]*x[m] y_cp = convolve(x_cp,h) # Remove the last L-1 signals y_cp = y_cp[0:N+L-1] print(y_cp.shape)
(10,)
Since the convolve function returns the output of size (N + L − 1) + L − 1, we removed the last L − 1 signals of no interest. We then remove the CP to construct y[m]. y_time = y_cp[L-1:N+L] print(y_time.shape) (8,)
Next, we perform DFT to generate frequency-domain signals Y [k]’s. To this end, . Since the fft operation we employ another built-in function fft placed in numpy.fft √ N −1 2π y[m]e− j N mk , we divide it by N to match the operation that is fft(y) = k=0 we defined earlier. from numpy.fft import fft Y = 1/np.sqrt(N)*fft(y_time)
We check if Y [k]’s as above coincide with the theoretical derivation: Y [0] = λ0 X [0]; Y [1] = λ1 X [1]; .. . Y [N − 1] = λ N −1 X [N − 1]. # Compute frequency-domain channel respose lambda_ = fft(h,N) Y_check = lambda_*X print(Y) print(Y_check) [-1.75+0.j 0.75+0.54289322j -1.25+0.25j [-1.75+0.j 0.75+0.54289322j -1.25+0.25j
-1.95710678-0.75j -0.75+0.j -1.95710678+0.75j] -1.95710678-0.75j -0.75+0.j -1.95710678+0.75j]
-1.25-0.25j 0.75-0.54289322j -1.25-0.25j 0.75-0.54289322j
(2.128)
146
2 Communication over ISI Channels
Observe that the frequency-domain received signals indeed satisfy the ISI-free channel condition. Look ahead Parts I and II have focused on channel modeling and transmission/reception strategies for two significant channels, namely the AWGN channel and the wireline channel. We have learned several concepts, but one key principle stands out and forms the basis of optimal receiver architecture - the Maximum A Posteriori probability (MAP) decision rule. In practical scenarios where the candidates to decode are equip-probable, the MAP rule reduces to the Maximum Likelihood (ML) rule. Additionally, we have studied the Viterbi algorithm, which is a powerful and popular algorithm that helps find the optimal MAP (or ML) solution efficiently, without requiring exhaustive search. These principles have immense value and are relevant in several fields, including data science, which is one of the fundamental and trending fields. Part III aims to demonstrate the power of these principles through various data science applications. In the upcoming section, we will focus on one such application that concerns the MAP/ML principles.
2.9 OFDM: Transmission and Python Implementation
147
Problem Set 6 Problem 6.1 (OFDM: Operations) Consider a three-tap ISI channel: 1 1 y[m] = x[m] + x[m − 1] + x[m − 2] + w[m] 2 3
(2.129)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ), being independent of x[m]’s. We would like to communicate over it using an OFDM system with N = 4. (a) What is the smallest length of the cyclic prefix needed to successfully implement the OFDM approach? (b) Write down the exact details of the transmitted signals when using the OFDM approach. In particular, highlight: (i) the total length of the transmitted signal vector; and (ii) the relationship between the transmitted signals. (c) Write down the exact details of the OFDM receiver operation. (d) Compute the frequency-domain channel response: λk =
L−1
h e− j N k . 2π
(2.130)
=0
Problem 6.2 (OFDM: An equivalent ISI-free channel) Consider an L-tap ISI channel: y[m] =
L−1
h x[m − ] + w[m]
(2.131)
=0
where w[m]’s are i.i.d. ∼ N (0, σ 2 ), being independent of x[m]’s. We would like to communicate over it using an OFDM system with N -point IDFT/DFT. (a) Describe a transmission-and-reception strategy that enables: y = Hc x + w
(2.132)
where y = [y[0], . . . , y[N − 1]]T , x = [x[0], . . . , x[N − 1]]T , and w = [w[0], . . . , w[N − 1]]T . Here Hc denotes a circulant matrix having the first row [h 0 , 0, . . . , 0, h L−1 , . . . , h 2 , h 1 ]. (b) In Sect. 2.7, we learned that the circulant matrix Hc can be decomposed into: Hc = F−1 F
(2.133)
where F denotes the DFT matrix. Find . Show all of your work in details. (c) Applying (2.133) into (2.132), we obtain:
148
2 Communication over ISI Channels
˜ y˜ = ˜x + w
(2.134)
˜ = Fw. Draw a block diagram of a system with the where y˜ = Fy, x˜ = Fx and w input x˜ and the output y˜ . Also explain why this system is called OFDM. Problem 6.3 (OFDM: Energy allocation) Consider a two-tap ISI channel: 1 y[m] = x[m] + x[m − 1] + w[m] 2
(2.135)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ), being independent of information bits. We wish to communicate 1 bit of information over the two-tap channel using the OFDM system with N = 2. There is a total energy of E allocated to send this bit (without worrying about the overhead in energy entailed by the cyclic prefix). So the total energy X 2 [0] + X 2 [1] in the data symbols X [0] and X [1] can be no more than E: X 2 [0] + X 2 [1] ≤ E. Consider the following communication schemes: (a) We use only subchannel 0 to communicate the 1 bit of information. That is, we √ set X [0] to be ± E depending on the information bit and just set X [1] to be zero. Derive the ML receiver for decoding the information. Also compute the associated ML error probability. (b) We use only subchannel 1 to communicate the 1 bit of information. That is, we √ set X [1] to be ± E depending on the information bit and just set X [0] to be zero. Derive the ML receiver. Also compute the associated ML error probability. (c) Let us try to be a bit cleverer and use both subchannels to send the bit. We have seen this kind of a strategy when we tried to send one bit over multiple time slots via repetition coding. We will try the same strategy, except we repeat the information over the two subchannels. We split out energy E into two parts: αE to be used for subchannel 0 and (1 − α)E to be used for subchannel 1, where α denotes √ a fraction between 0√and 1. The repetition coding scheme is to set X [0] = ± αE and X [1] = ± (1 − α)E, each of the same sign, based on the information bit. (i) Derive the ML receiver for decoding the information bit. (ii) Compute the associated ML error probability. (iii) What fraction of α minimizes the expression for the ML error probability from the previous part? How does your answer depend on the ratio σE2 ? Problem 6.4 (OFDM: System capacity) Consider a two-tap ISI channel with (h 0 , h 1 ) = (1, 21 ), and an OFDM system with N = L = 2: Y [0] = λ0 X [0] + W [0]; Y [1] = λ1 X [1] + W [1].
2.9 OFDM: Transmission and Python Implementation
149
There is a power constraint (an average energy constraint) of P. Don’t worry about the overhead in power entailed by the cyclic prefix. Suppose we split out the power P into two parts: P0 to be used for subchannel 0 and P1 to be used for subchannel 1: % $ E X 2 [0] ≤ P0 ; $ % E X 2 [1] ≤ P1 ; P0 + P1 = P. We would like to find P1 and P2 such that the system capacity is maximized. (a) Formulate an optimization problem that guides us to find the optimal (P1 , P2 ). Hint: Let Ck be the capacity of subchannel k given the power constraint. Using Shannon’s capacity formula, find Ck for k ∈ {0, 1}. Then, relate Ck ’s to the system capacity. (b) Derive an explicit solution to the optimization problem formulated in part (a).
Problem 6.5 (OFDM: Condition for ensuring real-valued time-domain signals) In Sect. 2.9, we proved that the following condition on the frequency-domain signals X [k]’s is sufficient for ensuring real-valued time-domain signals x[m]’s: X [0] = X ∗ [0] X [k] = X ∗ [N − k]
k ∈ {1, . . . , N − 1}
(2.136)
where N denotes the size of IDFT/DFT in an OFDM system. A student claims that the condition (2.136) is also necessary for real-valued time-domain signals. Prove or disprove the claim.
Problem 6.6 (OFDM: Statistics of frequency-domain noise signals) Suppose that w[m]’s are complex-valued i.i.d. Gaussian random processes: w[m] = w R [m] + jw I [m]
m ∈ {0, 1, . . . , N − 1}
(2.137)
where (w R [m], w I [m])’s are i.i.d. ∼ N (0, σ 2 ). Consider the frequency-domain counterparts: N −1 1 2π W [k] = √ w[m]e− j N mk N m=0
k ∈ {0, 1, . . . , N − 1}.
(2.138)
Represent W [k] as the sum of the real and imaginary components: W [k] := W R [k] + j W I [k]. Show that (W R [k], W I [k])’s are also i.i.d. ∼ N (0, σ 2 ).
150
2 Communication over ISI Channels
Problem 6.7 (True or False?) (a) Consider an L-tap ISI channel: y[m] =
L−1
h x[m − ] + w[m]
(2.139)
=0
where w[m]’s are i.i.d. ∼ N (0, σ 2 ). The computational complexity of an OFDM system with IFFT/FFT is lower than that of the Viterbi algorithm.
References Chang, R. W. (1966). Synthesis of band-limited orthogonal signals for multichannel data transmission. Bell System Technical Journal4510, 1775–1796. Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE,613, 268–278. Gallager, R. G. (2008). Principles of digital communication (Vol. 1). Cambridge, UK: Cambridge University Press. Gray, R. M. et al. (2006). Toeplitz and circulant matrices: A review. Foundations and Trends ®in Communications and Information Theory,23, 155–239. Oppenheim, A. V. , Willsky, A. S. , Nawab, S. H., Hernández, G. M. et al. (1997).Signals & systems. Pearson Educación. Strang, G. (1993). Introduction to linear algebra (Vol. 3).Wellesley, MA: Wellesley-Cambridge Press.
Chapter 3
Data Science Applications
Communication principles are prevalent in data science.
3.1 Community Detection as a Communication Problem
Principles covered in Parts I and II Parts I and II have focused on a communication problem that involves transmitting a sequence of information bits from a transmitter to a receiver over a channel. Specifically, we considered two prominent channels: (i) the additive white Gaussian channel (in Part I); and (ii) the wireline ISI channel (in Part II). Our approach throughout these parts was to start with an encoding strategy (e.g., sequential coding, repetition coding) and then: 1. Derive the optimal receiver for decoding the bit string; 2. Analyze the decoding error probability when using the optimal receiver (Fig. 3.1). Since there is a one-to-one mapping between the information bits and the transmitted signals x := [x[0], x[1], . . . , x[n − 1]]T , decoding the bit string is equivalent to decoding x. We can view this communication problem as an inference problem, as the goal is to infer the input x from the output y. Any problem that aims to infer an input from an output is categorized as an inference problem, as long as the input and output are related probabilistically. In the communication problem setting, the randomness from the channel determines the probabilistic relationship between the input and output.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 C. Suh, Communication Principles for Data Science, Signals and Communication Technology, https://doi.org/10.1007/978-981-19-8008-4_3
151
152
3 Data Science Applications
recovery
system fixed encoder
input
decoder
channel
output
Fig. 3.1 Viewing the communication problem as an inference problem
Therefore, an inference problem encompasses the communication problem as a specific case. In the context of inference problems, the following fundamental principles covered in Parts I and II play a crucial role: 1. Maximum A Posteriori probability (MAP) decision rule; 2. Maximum Likelihood (ML) decision rule; 3. The Viterbi algorithm. In addition to the MAP/ML principles, we also learned about another useful concept while studying Shannon’s capacity in Sect. 1.9 - the phase transition. This concept governs that there is a sharp threshold on the amount of information below which reliable communication is possible and above which reliable communication is impossible, regardless of any circumstances. It is worth noting that phase transition occurs in various other inference problems beyond communication.
Goal of Part III The goal of Part III is to demonstrate the role of the above fundamentals via other inference problems that arise in one recent field: Data science. We will do this in the context of three inference problems and prominent machine learning models: (1) Inference problem #1: Community detection; (2) Inference problem #2: Haplotype phasing; (3) Inference problem #3: Speech recognition; (4) Machine learning models: Logistic regression and deep learning. Our investigation will focus on the following three. Firstly, we will closely examine the role of the MAP/ML principles and the phase transition concept in the first two related problems. Secondly, we will explore the role of the Viterbi algorithm in speech recognition design. Finally, we will examine the role of the ML principle in the design of logistic regression and deep learning models.
3.1 Community Detection as a Communication Problem
“node” indicates “user”.
153
figure out which user belongs to which community.
Fig. 3.2 Community detection in picture
Outline In this section, we will address the first problem, which is community detection. The section consists of four parts. Firstly, we will explain what community detection is and why it is significant in data science. Secondly, we will formulate a corresponding mathematical problem and raise a fundamental question related to the concept of phase transition. Thirdly, we will establish a concrete connection with a communication problem. Finally, we will explore how phase transition occurs in the context of the problem by building on this connection.
Community detection Community detection refers to the process of identifying groups of individuals with similar interests or residing in the same locality. As shown in Fig. 3.2, given a graph where users are represented as nodes, the goal of community detection is to group the nodes into clusters, such as blue and red communities. This problem is also known as clustering in the literature (Bansal, Blum & Chawla, 2004; Jalali, Chen, Sanghavi, & Xu, 2011). The problem arises in various fields of data science, such as social networks (e.g., Facebook, LinkedIn, Twitter) and biological networks (Chen & Yuan, 2006). For instance, in social networks, identifying communities can help target specific groups for advertising and recommending products. In biological networks, community detection is crucial for DNA sequencing, which can aid in cancer detection and personalized medicine. The problem has many other applications in different areas. Problem formulation In order to tackle the problem of community detection, we first need to identify a type of information that we can easily access. In many cases, the most readily available information is relationship information, such as friend-
154
3 Data Science Applications
Fig. 3.3 An example of community detection
ship connections on Facebook, connection information on LinkedIn, and follower information on Twitter. On the contrary, information about a user’s community membership is typically not publicly available. For example, on Facebook, only friendship connections are visible, while the user’s community membership is unknown. Additionally, in some contexts, such community-type information is prohibited from being made public by law, even if the platform has access to it. To give an example of what relationship information might look like, let us consider a function that relates the values assigned to two nodes, such as xi and x j . For instance, we can define a function yi j as follows: yi j = 0 if xi = x j and yi j = 1 otherwise. Given a collection of such parities, the goal of the problem is to decode the community membership of each user. However, this is an impossible task since the parity information alone does not provide enough information to uniquely determine each user’s membership. For example, if we are given the parities (y12 , y13 , y14 ) = (0, 1, 1), then we obtain two possible solutions: x = [0, 1, 1, 0] (when x1 = 0) and its “flipped” counterpart x = [1, 0, 0, 1] (when x1 = 1). To resolve this ambiguity, we should relax the goal to decoding either x or x ⊕ 1 from the given parity information, where ⊕ denotes bit-wise modulo sum. Two challenges Regarding the example in Fig. 3.3, it may appear that solving the problem is not particularly challenging. However, in the era of big data, we face two significant obstacles. The first challenge is due to the fact that the number of nodes, such as social network users, can be enormous in many applications. For instance, Meta reported 2.936 billion Facebook users as of April 2022, which is roughly one-third of the global population. Therefore, we may only have access to a portion of the similarity parities since the number of possible pairs is immense, around n2 ≈ 4.309 × 1018 in the aforementioned example. In certain cases, where we can choose any two nodes at will, this may not be a major issue. In such instances, community detection is a straightforward task. For example, selecting a sequence of consecutive pairs such as (y12 , y23 , y34 , . . . , y(n−1)n ) allows us to decode x up to a global shift easily. Note that the number of pairwise measurements, n − 1, scales linearly with n, which is significantly less than n2 . However, this can still be problematic due to another challenge.
3.1 Community Detection as a Communication Problem
155
Fig. 3.4 The number of facebook users is around 2.936 billion as of April 2022
In numerous applications, such as Facebook, similarity parities are given passively according to the context. In this case, “passively given” implies that the information is not obtained by our own choice, but rather given by the context. For example, friendship information in Facebook is simply given by the context, and we cannot choose two arbitrary users and ask if they are friends. Consequently, similarity information may be provided randomly, with the information for any pair of two users given independently across other pairs with probability p. A big question Although community detection can be difficult due to the partial and random nature of the pairwise measurements, there is good news. The example in Fig. 3.3 suggests that decoding x does not necessarily require complete observation of every measurement pair. The pairs are interdependent, which suggests that partial pairs may be sufficient to decode x. This leads to a natural question:
• ?
Question
Is there a fundamental limit on the number of measurement pairs required to make community detection possible? It is noteworthy that similar to the channel capacity concept in communication, there is a phase transition in community detection. This implies that there is a precise threshold value, denoted as p ∗ , above which reliable community detection is feasible, and below which it is unfeasible, regardless of the approach. In the remainder of this section, we will delve into the value of p ∗ . Translation to a communication problem Under the partial and random observation setting, yi j is statistically related to xi and x j . So community detection is indeed an inference problem, suggesting an intimate connection with a communication problem. We find that translating community detection into a communication problem, we can come up with a precise mathematical statement on the limit. So we will first translate the problem as below. See Fig. 3.5. First of all, one can view x as an information source that we wish to send and xˆ as an estimate. What we are given are pairwise measurements. So one can think of a block diagram at the transmitter which converts x into pairwise measurements,
156
3 Data Science Applications (decoder)
(encoder) pairwise info
algorithm block
channel
Fig. 3.5 Translation to a communication problem
say xi j ’s. Here xi j := xi ⊕ x j . Actually one can view this as a kind of encoder. Since we assume that only part of the pairwise measurements are observed in a random manner, we have another block which implements the partial & random measurements to extract a subset of xi j ’s. One can view this processing as the one that behaves like a channel where the output yi j admits: yi j =
xi j ,
erasure,
w.p. p; w.p. 1 − p
where p denotes the observation probability and erasure indicates empty information. In the communication & information theory literature, this process can be modeled as an erasure channel with erasure probability 1 − p. Hence, we employ the terminology erasure. These yi j ’s are then fed into an algorithm block, thus yielding xˆ . The algorithm block can be interpreted as a decoder. Performance metrics & an optimization problem Similar to the communication setting, we can consider two performance metrics. The first metric we want to analyze is the limit on the number of observed pairwise measurements, which is also known as sample complexity. In this problem context, the sample complexity would be centered around: n sample complexity −→ p as n → ∞. 2 Why? Think about the Law of Large Numbers. The second performance metric is the one that is conventionally employed in the context of inference problems. That is, the probability of error defined as: / {x, x ⊕ 1} . Pe := P xˆ ∈ An error occurs when xˆ = x and xˆ = x ⊕ 1. There is a tradeoff relationship between the sample complexity and Pe , where the larger the sample complexity, the smaller Pe , and vice versa. To characterize this tradeoff relationship, we can formulate the following optimization problem. Given p and n,
3.1 Community Detection as a Communication Problem Fig. 3.6 Phase transition in community detection. If S is above the minimal sample n complexity S ∗ = n log n , then we can make Pe arbitrarily close to 0 as n tends to infinity. If S < S ∗ , then Pe cannot be made arbitrarily close to 0 no matter what and whatsoever
157
lower
upper
Pe∗ ( p, n) := min Pe . algorithm
(3.1)
Shannon formulated a similar optimization problem in order to characterize the channel capacity in the communication problem context. This problem has remained notoriously difficult and unsolved to this day. Similarly, the problem presented in (3.1) is also very challenging, and the exact error probability Pe∗ ( p, n) remains an open question. Phase transition We have some good news: similar to the communication setting, a phase transition occurs with respect to the sample complexity in the limit of n. If the sample complexity is above a certain threshold, then Pe can be made arbitrarily close to 0 as n approaches infinity. Otherwise, if it is below the threshold, Pe 0 no matter what we do. In other words, there exists a sharp boundary between possible and impossible detection determined by a threshold on the sample complexity, called the minimal sample complexity S ∗ . Please refer to Fig. 3.6. The minimal sample complexity can be characterized as follows: S∗ =
n log n 2
where log(·) indicates a natural logarithm. Notice that S ∗ is much smaller than the 2 total number of possible pairs: n2 ≈ n2 . This result implies that the limit on the observation probability is S∗ log n p ∗ = n = n 2 = 1. Notice that p ∗ = logn n where the second equality is because limn→∞ n(n−1) n2 vanishes as n → ∞, meaning that community detection requires only a negligible fraction of pairwise measurements for successful community recovery.
158
3 Data Science Applications
Look ahead The ML principle plays a significant role to prove the achievability of the limit p ∗ : p>
log n ⇐⇒ Pe → 0 as n → ∞. n
So in the next section, we will invoke the ML principle to prove the above.
(3.2)
3.2 Community Detection: The ML Principle
159
3.2 Community Detection: The ML Principle
Recap In the previous section, we embarked on Part III, which focuses on applying communication principles to data science. We discussed one such application: community detection. The objective of community detection is to identify community membership. Suppose there are two communities and xi ∈ {0, 1} denotes community membership of user i ∈ {1, 2, . . . , n}. Then, the problem’s goal is to decode x := [x1 , . . . , xn ] to figure out which user belongs to which group. We utilized similarity parities as measurement information, which indicates whether two users belong to the same community. Since there is no way to differentiate x from x ⊕ 1 only from the pairwise measurements yi j = xi ⊕ x j , we broadened the success event to include “its flipped version x ⊕ 1” as the correct one. With the increasing number of users in big data applications like Meta’s social networks, gathering all the relationship information becomes challenging. As a result, we examined a scenario where only some of the yi j ’s are available, and the selection of pairs is not our design. This led us to ask whether there is a fundamental limit on the number of randomly selected pairwise measurements required to enable community detection. We asserted that there is a limit, and we explained what this implies by translating it into a communication problem, as shown in Fig. 3.7. Specifically, we complexity (con introduced two performance metrics: (i) sample / {x, x ⊕ 1} . We centrated around n2 p); and (ii) the probability of error Pe := P xˆ ∈ then claimed the minimal sample complexity (above which one can make Pe → 0, under which one cannot make Pe → 0 no matter what and whatsoever) is: S∗ =
n log n log n , i.e., p ∗ = . 2 n
Outline In this section, we will prove that p ∗ is achievable:
(decoder)
(encoder) pairwise info
channel
algorithm block
Fig. 3.7 Translation of community detection into a communication problem
160
3 Data Science Applications
p>
log n =⇒ Pe → 0 as n → ∞. n
(3.3)
This section can be divided into three parts. Initially, we will utilize the ML principle to obtain the optimal decoder. Later, we will evaluate the decoding error probability based on the ML decoding rule. Rather than directly determining the exact probability, we will use a few bounding techniques to obtain an upper bound of the error probability. Finally, we will demonstrate that if the observation probability p is greater than log n , the upper bound becomes 0 as n approaches infinity, proving the achievability. n Encoder The encoder, shown in Fig. 3.7, converts the input x (which can be interpreted as an information source in the context of the communication problem) into xi j := xi ⊕ x j . Unlike in the communication setting, the encoder is not subject to our design choice. Let X be the n-by-n output matrix containing xi j ’s as its entries. The output of the encoder is commonly referred to as a codeword in the communication and information theory literature (Cover & Joy, 2006). A notable property of the codeword X is its symmetry, since xi j = x ji . For example, when x = (1000) and n = 4, the corresponding codeword is: ⎤ 111 ⎢ 0 0⎥ ⎥. X(x) = ⎢ ⎣ 0⎦ ⎡
(3.4)
We omit diagonal components and symmetric counterparts as they can be inferred trivially. Let us call, a collection of codewords X(x)’s, a codebook. The codebook is also a commonly used terminology in the information theory literature (Cover & Joy, 2006). The optimal decoder Let Y = [yi j ]. The decoder assumes that the codebook is already known, which is a reasonable assumption since the structure of pairwise measurements is disclosed. We use the MAP rule, which is the optimal decision rule for decoding x. Since we have no information about the statistics of x, we consider the worst-case scenario where x is uniformly distributed. In this scenario, there is no preference for any particular pattern of x; every candidate is equally likely. As we have seen in the communication context, under the equi-probable condition, the MAP decoder is equivalent to the ML decoder: xˆ ML = arg max P (Y|X(x)) . x
The calculation of the likelihood function P(Y|X(x)) is straightforward, being the same as the one in the communication setting. For instance, suppose n = 4, x = (0000), and yi j ’s are all zeros except that (y13 , y14 ) = (e, e):
3.2 Community Detection: The ML Principle
161
⎡
0e ⎢ 0 Y(x) = ⎢ ⎣
e
⎤
0⎥ ⎥ 0⎦
(3.5)
where e indicates the erasure (empty information). Then, P(Y|X(0000)) = (1 − p)2 p 4 . Here the number 2 marked in red denotes the number of erasures. The input pattern x = (0000) is a possible candidate for the ground truth, since the corresponding likelihood is not zero. We call such a pattern compatible. On the other hand, for x = (1000), P(Y|X(1000)) = 0. Note that x12 = 1 (underscored in (3.4)) is different from y12 = 0 (underscored in (3.5)), thus forcing the likelihood to be 0. This implies that the pattern x = (1000) can never be a solution, meaning that it is incompatible with Y. Then, what is the ML decoding rule? Here is how it works. 1. Eliminate all the input patterns which are incompatible with Y. 2. If there is only one pattern that survives, declare it as the correct pattern. However, this procedure is not sufficient to describe the ML decoding rule. The reason is that we may have a different erasure pattern that confuses the rule. To see this clearly, consider the following concrete example. Suppose that yi j ’s are all zeros except that (y12 , y13 , y14 ) = (e, e, e): ⎡ ⎢ Y(x) = ⎢ ⎣
e e e
⎤
0 0⎥ ⎥. 0⎦
(3.6)
Then, we see that P(Y|X(0000)) = (1 − p)3 p 4 ; P(Y|X(0111)) = (1 − p)3 p 4 . When considering the patterns 0000 and 0111 with equal likelihood functions, we are faced with an issue. In this scenario, the best course of action is to randomly select one of the two patterns by flipping a coin. This is the final step in the ML decoding process: 3. If there are multiple survivals, choose one randomly.
162
3 Data Science Applications
A setup for analysis of the error probability For the achievability proof (3.3), let us analyze the probability of error when using the ML decoder. Starting with the definition of Pe and using the total probability law, we get: / {x, x ⊕ 1} Pe := P xˆ ML ∈
P(x = a)P xˆ ML ∈ / {a, a ⊕ 1}|x = a . = a
For a fixed a, P xˆ ∈ / {a, a ⊕ 1}|x = a is a sole function of the likelihood function which depends only on erasure patterns for the n2 independent channels. Also, the erasure patterns are independent of the channel input a. Therefore, the probability is irrelevant of what the value of a is. Applying this to the above, we get: Pe = P xˆ ML ∈ / {0, 1}|x = 0 ⎞ ⎛ {ˆxML = a}|x = 0⎠ = P⎝
(3.7)
a∈{0,1} /
/ {0, 1} implies that xˆ ML = a for where the second equality is due to the fact that xˆ ML ∈ some a ∈ / {0, 1}. The union bound Calculating the probability of the union of multiple events can be a complex task, especially when dealing with a large number of events. Even for the case of three events (A, B, C), the probability formula can be quite intricate: P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C). Even worse, the number of associated multiple events in (3.7) is 2n − 2 and this will make the probability formula very complicated. To simplify the problem, we can take inspiration from Shannon’s approach in characterizing channel capacity. Instead of computing the error probability exactly, Shannon derived an upper bound for it, because ultimately what we want to prove is that the error probability approaches 0. We can adopt the same approach here and use a well-known upper bound for the union of multiple events, called the union bound. The union bound states that for two events A and B, P(A ∪ B) ≤ P(A) + P(B). The proof of the union bound is immediate because P(A ∩ B) ≥ 0. In fact, this already appeared in Problem 1.4. Applying the union bound to (3.7), we get:
3.2 Community Detection: The ML Principle
⎛
⎞
Pe = P ⎝
163
{ˆxML = a}|x = 0⎠
a∈{0,1} /
≤
(3.8)
P xˆ ML = a|x = 0 .
a∈{0,1} /
Further upper-bounding Let us focus on P(ˆxML = a|x = 0). To gain insights into this term, consider an example where n = 4 and a = (1000). In this case, the error event implies that X(1000) must be compatible. A necessary condition for X(1000) being compatible under x = (0000) is: (y12 , y13 , y14 ) = (e, e, e). Notice that for all (i, j) entries whose values are different between the two codewords X(0000) and X(1000) (that we call distinguishable positions), erasures must occur; otherwise, (1000) cannot be compatible as its corresponding likelihood function would be 0. Using the fact that for two events A and B such that A implies B, P(A) ≤ P(B) (check in Problem 1.4), we get: P xˆ ML = (1000)|x = 0 ≤ P (X(1000) compatible|x = 0) ≤ P ((y12 , y13 , y14 ) = (e, e, e)|x = 0)
(3.9)
= (1 − p)3 . Here the number 3 marked in red indicates the number of erasures that must occur in the distinguishable positions. A key to determine the above upper bound (3.9) is the number of distinguishable positions between X(a) and X(0). Actually one can easily verify that the number of distinguishable positions (w.r.t. X(0)) depends on the number of 1’s in a. To see this, · · · 0). In this case, consider an example of a = (11 · · · 1 00 k
n−k
⎤ 00011 ⎢ 0 0 1 1⎥ ⎥ ⎢ ⎢ 0 1 1⎥ ⎥. X(a) = ⎢ ⎢ 0 0⎥ ⎥ ⎢ ⎣ 0⎦ ⎡
Each of the first k rows contains (n − k) ones; hence, the total number of distinguishable positions w.r.t. X(0): # of distinguishable positions = k(n − k).
164
3 Data Science Applications
For a succinct description of the summation term in (3.8), let us classify the instance a depending on the number of 1’s in a. To this end, we introduce: Ak := {a|a1 = k}
(3.10)
where a1 := |a1 | + |a2 | + · · · + |an |. Using this notation, we can then express (3.8) as: Pe ≤
n−1
(1 − p)k(n−k)
k=1 a∈Ak
=
n−1
|Ak | (1 − p)k(n−k)
k=1
=
n−1
k=1
n (1 − p)k(n−k) k
2
n n
=2
k=1
k
(1 − p)k(n−k) −
(3.11) n (1 − p)n·0 n
n ≤2 (1 − p)k(n−k) k k=1 n 2
where the second last step follows from the fact that around k = n/2.
n (1 − p)k(n−k) is symmetric k
The final step of the achievability proof Since we intend to prove the achievability when p > logn n , let’s focus on the regime: λ > 1 where λ is defined such that p := λ logn n . Then, it suffices to show that in the regime of λ > 1, 2
n (1 − p)k(n−k) −→ 0 as n → ∞. k k=1 n
(3.12)
In this setting of p = λ logn n , p is arbitrarily close to 0 in the limit of n. This motivates us to employ the following upper bound on 1 − p (which is very tight in the regime): 1 − p ≤ e− p .
(3.13)
Check the proof in Problem 7.1. Also when n is large, the following bound is good enough to prove the achievability:
3.2 Community Detection: The ML Principle
165
n n(n − 1)(n − 2) · · · (n − k + 1) nk = ≤ ≤ nk k k! k!
(3.14)
Applying the bounds (3.13) and (3.14) into (3.11), we get: 2
n (1 − p)k(n−k) Pe ≤ 2 k k=1 n
n
≤2
2
n k e− pk(n−k)
k=1
(3.15)
n
(a)
=2
2
ek log n−λk (1− n ) log n k
k=1 n
=2
2
e−k(λ(1− n )−1) log n k
k=1
where (a) follows from n k = ek log n and p := λ logn n . Notice that for the range of 1 ≤ k ≤ n2 , 1 − nk (marked in blue in (3.15)) is minimized at 21 . Applying this to the above, we can then get: n
Pe ≤ 2
2
λ
e−k( 2 −1) log n
k=1
Case I: λ > 2: Here a key observation is that if λ > 2, then we can apply the wellknown summation formula w.r.t. the geometric series (where the common ratio is λ e−( 2 −1) log n ), thus obtaining: 2 λ k
e−( 2 −1) log n ≤ 2 · Pe ≤ 2 n
k=1
λ
e−( 2 −1) log n λ
1 − e−( 2 −1) log n
−→ 0 as n → ∞.
Hence, we can make Pe → 0 for the case of λ > 2. Case I: 1 < λ ≤ 2: Observe in the last step in (3.15) that k −1=0 λ 1− n
⇐⇒
1 k = 1− n, λ
and λ(1 − nk ) − 1 is a decreasing function in k. See Fig. 3.8. This together with a choice of α < 1 − λ1 gives:
166
3 Data Science Applications
Fig. 3.8 Behavior of λ 1 − nk − 1 in k
k − 1 > 0 when k ≤ αn. λ 1− n Applying this to (3.15), we get: 2
n (1 − p)k(n−k) Pe ≤ 2 k k=1 n
≤2
αn
e
−k(λ(1− nk )−1) log n
k=1
n 2
n (1 − p)k(n−k) . + k k=αn+1
Again applying the summation formula of the geometric series to the first term in the last step of the above, the first term vanishes as n → ∞. Hence, we obtain: n 2
n (1 − p)k(n−k) Pe k k=αn+1
(a)
≤ (1 − p)
(b)
α(1−α)n 2
n 2
n k k=αn+1
≤ (1 − p)α(1−α)n · 2n 2
(c)
≤ e−λα(1−α)n log n · en log 2 .
where (a) is due to the fact that k(n − k) ≥ α(1 − α)n 2 (where equality holds the when k = αn + 1); (b) comes from the binomial theorem ( nk=1 nk = 2n ); and (c) comes from the fact that 1 − p ≤ e− p and p := λ logn n . The last term in the above tends to 0 as n → ∞. This is because λα(1 − α)n log n grows much faster than n log 2. This implies Pe → 0, which completes the achievability proof (3.3).
3.2 Community Detection: The ML Principle
167
Look ahead We have established the achievability of the community detection limit (3.3) by utilizing the ML principle. Conversely, if p < logn n , it is impossible to achieve a probability of error Pe arbitrarily close to 0, regardless of any schemes implemented. In other words, the condition of p > logn n is a fundamental requirement for reliable detection. However, due to the focus of this book, we will not delve into this proof, which involves advanced techniques beyond the scope of this book. Instead, we will discuss a specific yet effective algorithm that guarantees satisfactory performance when p > logn n . Recall the optimal ML decoding rule we used as an achievable scheme: xˆ ML = arg max P (Y|X(x)) . x
Similar to the communication setting, the complexity of the ML rule is not feasible. It necessitates computing the likelihoods of 2n possible events, which exponentially increases with n. Since n is often very large in practice, the computational complexity becomes overwhelming. Nevertheless, an alternative algorithm that achieves almost the same performance as the ML rule exists and is more efficient. In the following section, we will explore this efficient algorithm.
168
3 Data Science Applications
3.3 An Efficient Algorithm and Python Implementation
Recap In the previous section, we invoked the ML principle (one of the communication principles) to prove the achievability of the limit p ∗ = logn n for community detection. Recall the optimal ML decoding rule: xˆ ML = arg max P (Y|X(x)) x
where x = [x1 , . . . , xn ]T indicates the community membership vector; X(x) denotes a codeword matrix taking xi j = xi ⊕ x j as the (i, j) entry; and Y is an observation matrix with yi j ’s (yi j = xi j w.p. p and erasure otherwise). One critical issue in the ML rule is that its complexity is significant. Since x takes one of the 2n possible patterns, the ML rule requires the number 2n of likelihood computations that grows exponentially with n. Hence, it is crucial to develop a computationally efficient algorithm that possibly yields the same performance as the ML rule. Indeed, efficient algorithms have been developed that achieve the optimal ML performance.
Outline In this section, we will examine one such efficient algorithm. The efficient algorithm with optimality is quite complex. For the purpose of explanation, we will focus on a simplified version that contains the primary components of the algorithm while achieving a sub-optimal performance. This section consists of four parts. Firstly, we will introduce the adjacency matrix, which is a fundamental component of the algorithm. Then, we will explain the spectral algorithm and its function of identifying the principal eigenvector of the adjacency matrix. We will then investigate the power method, which is an efficient technique for calculating the principal eigenvector. Finally, we will implement the spectral algorithm in Python.
Adjacency matrix Pairwise measurements constitute big data due to the typically large number n of users, although only a portion of these measurements are actually observed. In the field of data science, the adjacency matrix (Chartrand, 1977) provides a concise representation of such big data. The adjacency matrix, denoted as A, serves as an equivalent representation of the pairwise data, with each row and column representing a user. The entries in the matrix, denoted as ai j , indicate whether users i and j belong to the same community. Specifically, ai j = +1 signifies that the two users are part of the same community, ai j = −1 denotes the opposite, and ai j = 0 indicates a lack of measurement.
3.3 An Efficient Algorithm and Python Implementation
ai j =
169
1 − 2yi j , w.p. p; 0, otherwise (yi j = erasure).
(3.16)
For instance, when x = [1, 0, 0, 1]T , we might have: ⎡
+1 ⎢ −1 A=⎢ ⎣ 0 0
−1 +1 +1 −1
⎤ 0 −1 ⎥ ⎥ −1 ⎦ +1
0 +1 +1 −1
(3.17)
where we have erasures for (y13 , y14 ). In the full measurement setting ( p = 1), one can make a key observation that gives a significant insight into algorithms. When p = 1, we would obtain: ⎡
+1 ⎢ −1 ⎢ A=⎣ −1 +1
−1 +1 +1 −1
−1 +1 +1 −1
⎤ +1 −1 ⎥ ⎥. −1 ⎦ +1
(3.18)
The crucial insight is that A has a rank of 1, meaning that each row can be expressed as a linear combination of the other rows. Is it possible to derive the community pattern from this rank-1 matrix? The response is affirmative, and this serves as the foundation for the spectral algorithm that we will explore in the sequel. Spectral algorithm (Shen, Tang, & Wang, 2011) Using the fact that the rank of A (3.18) is 1, we can easily come up with its eigenvector. ⎡
⎤ +1 ⎢ −1 ⎥ ⎥ v=⎢ ⎣ −1 ⎦ . +1
(3.19)
Verify that Av = 4v. In this equation, the entries of v, which can be either +1 or −1, represent the community memberships of the users. Consequently, when every pair is sampled, the principal eigenvector can correctly recover the communities. This technique, which involves calculating the principal eigenvector of the adjacency matrix, is known as the spectral algorithm. However, what if the measurement is partial, with p < 1? In such a scenario, it is unclear whether the principal eigenvector can accurately identify the community memberships. Additionally, the entries in the eigenvector may not necessarily be +1 or −1. To address the latter issue, we can apply a threshold to the eigenvector, denoted as vth , where each entry takes on the sign of vi :
170
3 Data Science Applications
vth,i =
+1, vi > 0; −1, otherwise.
(3.20)
It has been discovered that vth can accurately recover the ground truth of communities for sufficiently large values of p, as n approaches infinity. However, we will not delve into analyzing the precise value of p that is necessary for the spectral algorithm to successfully recover the communities, as it is beyond the scope of this book. Instead, we will later utilize Python to conduct empirical simulations that demonstrate how vth approaches the ground truth as p increases, particularly for large n values. Power method (Golub & Van Loan, 2013) An additional technical issue arises when attempting to calculate the principal eigenvector of a large adjacency matrix. In Meta’s social networks, for instance, the order of n is approximately 109 . A straightforward approach to computing the eigenvector using eigenvalue decomposition would require a complexity of approximately n 3 , making it impractical. However, a highly efficient and beneficial technique for determining the principal eigenvector is the power method, which is a well-known and commonly used method in data science literature. Prior to describing how it works, let us make important observations that naturally lead to the method. Suppose that the adjacency matrix A ∈ Rn×n has m eigenvalues λi ’s and eigenvectors vi ’s: A := λ1 v1 v1T + λ2 v2 v2T + · · · + λm vm vmT where λ1 > λ2 ≥ λ3 ≥ · · · ≥ λm and vi ’s are orthonormal: viT v j = 1 only if i = j; 0 otherwise. By definition, (λ1 , v1 ) are the principal eigenvalue and eigenvector respectively. Let v ∈ Rn be an arbitrary non-zero vector such that v1T v = 0. Then, Av can be expressed as: Av = =
m
i=1 m
λi vi viT
v (3.21)
λi (viT v)vi
i=1
where the second equality is due to the fact that viT v is a scalar. One key observation is that the first term λ1 (v1T v)v1 in the above summation forms a major contribution, since λ1 is the largest. This major effect becomes more dominant when we multiply the adjacency matrix to the resulting vector Av. To see this clearly, consider:
3.3 An Efficient Algorithm and Python Implementation
171
A2 v = A(Av) ⎞ m ⎛ m
= λi vi viT ⎝ λ j (v Tj v)v j ⎠ i=1 (a)
=
m
j=1
λi2 (viT v)(viT vi )vi +
λi λ j (v Tj v)(viT v j )v j
(3.22)
i= j
i=1 m (b) 2 T = λi (vi v)vi i=1
where (a) comes from the fact that viT v j is a scalar; and (b) vi ’s are orthonormal vectors, i.e., viT vi = 1 and viT v j = 0 for i = j. The distinction in (3.22) relative to Av (3.21) is that we have λi2 instead of λi . The contribution from the principal component is more significant relative to the other components. If we iterate this process (multiply A to the resulting vector iteratively), then we get: Ak v =
m
λik (viT v)vi .
i=1
Observe that in the limit of k, m
λi k viT v Ak v vi −→ v1 as k → ∞. = λ1 v1T v λk1 (v1T v) i=1 This implies that as we iterate the following process (multiplying A and then normalizing the resulting vector), the normalized vector converges to the principal eigenvector. Precisely, as k → ∞,
Ak v Ak v2
−→ v1 as k → ∞.
This observation leads to the power method: 1. Choose a random vector v and set v(0) = v and t = 0. Av(t) 2. Compute v(t+1) = and increase t by 1. Av(t) 2 3. Iterate Step 2 until converged, e.g., v(t+1) − v(t) 2 < = 10−5 . The power method requires multiple (say k) matrix-vector multiplications, each having the complexity of n 2 multiplications. Hence, the complexity of the power method is still on the order of n 2 , as long as the number k of iterations is not so large relative to n (this is often the case in practice). This is much smaller than the
172
3 Data Science Applications
complexity n 3 of the eigenvalue decomposition, especially when n is very large. Due to this computational benefit, the power method is widely employed as an efficient algorithm for finding the principal eigenvector in many applications. implementation of the spectral algorithm We now implement the spectral algorithm via Python. We first generate the community memberships of n users.
Python
from scipy.stats import bernoulli import numpy as np n = 8 # number of users Bern = bernoulli(0.5) # Generate n community memberships x = Bern.rvs(n) print(x) [0 0 0 0 1 0 1 0]
We then construct random pairwise measurements. # Construct the codebook X = np.zeros((n,n)) for i in range(len(x)): for j in range(i,len(x)): # Compute xij = xi + xj (modulo 2) X[i,j] = (x[i]+x[j]) % 2 # Symmetric component X[j,i] = X[i,j] print(X) [[0. [0. [0. [0. [1. [0. [1. [0.
0. 0. 0. 0. 1. 0. 1. 0.
0. 0. 0. 0. 1. 0. 1. 0.
0. 0. 0. 0. 1. 0. 1. 0.
1. 1. 1. 1. 0. 1. 0. 1.
0. 0. 0. 0. 1. 0. 1. 0.
1. 1. 1. 1. 0. 1. 0. 1.
0.] 0.] 0.] 0.] 1.] 0.] 1.] 0.]]
Next we compute the adjacency matrix. # observation probability p = 0.8 obs_bern = bernoulli(p) # Construct an n-by-n mask matrix: # entry = 1 (observed); 0 (otherwise) mask_matrix = obs_bern.rvs((n,n)) # Construct the adjacency matrix A = (1-2*X)*mask_matrix print(1-2*X)
3.3 An Efficient Algorithm and Python Implementation print(mask_matrix) print(A) [[ 1. [ 1. [ 1. [ 1. [-1. [ 1. [-1. [ 1. [[1 1 [1 1 [1 1 [1 1 [1 1 [1 1 [1 1 [1 0 [[ 1. [ 1. [ 1. [ 1. [-1. [ 1. [-1. [ 1.
1. 1. 1. 1. -1. 1. -1. 1. 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1. 1. 1. 1. -1. 1. -1. 0.
1. 1. 1. 1. -1. 1. -1. 1. 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0. 1. 0. 1. -1. 1. -1. 1.
1. -1. 1. -1. 1. -1. 1. -1. -1. 1. 1. -1. -1. 1. 1. -1. 1 1] 1 1] 1 1] 1 1] 1 1] 1 1] 1 1] 1 1]] 0. -1. 0. -0. 1. -1. 1. -0. -1. 1. 1. -1. -1. 1. 1. -1.
1. 1. 1. 1. -1. 1. -1. 1.
-1. 1.] -1. 1.] -1. 1.] -1. 1.] 1. -1.] -1. 1.] 1. -1.] -1. 1.]]
0. 1. 1. 1. -1. 0. -1. 1.
-1. 1.] -1. 1.] -1. 1.] -1. 1.] 1. -1.] -1. 1.] 1. -1.] -1. 1.]]
We run the power method to compute the principal eigenvector of A. def power_method(A, eps=1e-5): # A computationally efficient algorithm # for finding the principal eigenvector # Choose a random vector v = np.random.randn(n) # normalization v = v/np.linalg.norm(v) prev_v = np.zeros(len(v)) t = 0 while np.linalg.norm(prev_v-v) > eps: prev_v = v v = np.array(np.dot(A,v)).reshape(-1) v = v/np.linalg.norm(v) t += 1 print("Power method is terminated after %s iterations"%t) return v v1 = power_method(A) print(v1) print(np.sign(v1)) print(1-2*x)
173
174
3 Data Science Applications
Power method is terminated after 8 iterations [ 0.25389707 0.29847768 0.35720613 0.34957476 -0.40941936 0.35720737 -0.40941936 0.36579104] [ 1. 1. 1. 1. -1. 1. -1. 1.] [ 1 1 1 1 -1 1 -1 1]
Notice in the above experiment that the thresholded principal eigenvector np.sign(v1) coincides with the ground truth community vector 1-2*x. Python: Performance
of the spectral algorithm As previously stated, we will conduct Python experiments to show that the principal eigenvector approaches the ground truth of community memberships as p increases. Specifically, we will examine a realistic scenario in which n is large, such as n = 4,000. In order to evaluate the degree of similarity between the principal eigenvector and the community vector, we will utilize a widely used correlation metric known as the Pearson correlation (Freedman, Pisani, & Purves, 2007): ρ X,Y :=
σ X,Y σ X σY
(3.23)
where σ = E[X Y ] − E[X ]E[Y ], σ = E[X 2 ] − (E[X ])2 , and σY = X,Y X 2 2 E[Y ] − (E[Y ]) . To compute the Person correlation, we use a built-in function pearsonr defined in the scipy.stats module. from scipy.stats import bernoulli from scipy.stats import pearsonr import numpy as np import matplotlib.pyplot as plt n = 4000 # number of users Bern = bernoulli(0.5) # Generate n community memberships x = Bern.rvs(n) # Construct the codebook X = np.zeros((n,n)) for i in range(len(x)): for j in range(i,len(x)): # Compute xij = xi + xj (modulo 2) X[i,j] = (x[i]+x[j]) % 2 # Symmetric component X[j,i] = X[i,j] p = np.linspace(0.0003,0.0025,30) limit = np.log(n)/n p_norm = p/limit def power_method(A, eps=1e-5): # A computionally efficient algorithm # for finding the principal eigenvector
3.3 An Efficient Algorithm and Python Implementation # v # v
175
Choose a random vector = np.random.randn(n) normalization = v/np.linalg.norm(v)
prev_v = np.zeros(len(v)) t = 0 while np.linalg.norm(prev_v-v) > eps: prev_v = v v = np.array(np.dot(A,v)).reshape(-1) v = v/np.linalg.norm(v) t += 1 print("Power method is terminated after %s iterations"%t) return v corr = np.zeros_like(p) for i,val in enumerate(p): obs_bern = bernoulli(val) # Construct an n-by-n mask matrix: # entry = 1 (observed); 0 (otherwise) mask_matrix = obs_bern.rvs((n,n)) # Construct the adjacency matrix A = (1-2*X)*mask_matrix # Power method v1 = power_method(A) # Threshold the principal eigenvector v1 = np.sign(v1) # Compute the ground truth ground_truth = 1-2*x # Compute Pearson correlation btw estimate & ground truth corr[i] = np.abs(pearsonr(ground_truth,v1)[0]) print(p_norm[i], corr[i]) plt.figure(figsize=(5,5), dpi=200) plt.plot(p_norm, corr) plt.title(’Pearson correlation btw estimate and ground truth’) plt.grid(linestyle=’:’, linewidth=0.5) plt.show()
Observe in Fig. 3.9 that the thresholded principal eigenvector becomes increasingly similar to the ground-truth community vector (or its inverse) as p increases, as indicated by the high Pearson correlation for large values of p. When p is close to the limiting value of p ∗ = logn n , the Pearson correlation approaches 1, suggesting that the spectral algorithm attains near-optimal performance, as promised by the ML decoding rule. However, this is a somewhat heuristic argument. To provide a more rigorous argument, we would need to rely on the empirical error rate, which can be calculated by computing the Pearson correlation over numerous random realizations of community vectors. For the sake of computational simplicity, we instead employ the Pearson correlation, which can be accurately computed with just one random trial per each p.
176
3 Data Science Applications
Fig. 3.9 Pearson correlation between the thresholded principal eigenvector and the ground-truth community vector as a function of λ := pp∗ = log pn/n
Look ahead Thus far, we have examined one application of communication principles in data science: community detection. We highlighted the importance of the maximum likelihood principle in demonstrating the achievability of the community detection limit, as well as the notion of phase transition, which was observed using Python simulations. In the following section, we will turn our attention to the second application, which is closely linked to community detection: a computational biology problem called Haplotype phasing. We will explore the nature of this problem and its connection to community detection.
3.3 An Efficient Algorithm and Python Implementation
177
Problem Set 7 Problem 7.1 (Basics on bounds and combinatorics) (a) Let p ≥ 0. Prove that 1 − p ≤ e− p . Also specify the condition under which the equality holds. (b) Show that for non-negative integers n and k (k ≤ n): n ≤ ek log n . k (c) Show that for integers n ≥ 0: n
n k=0
k
= 2n .
Problem 7.2 (The concept of reliable community detection) Suppose there are n users clustered into two communities. Let xi ∈ {0, 1} indicate a membership of community with regard to user i ∈ {1, 2, . . . , n}. Assume that we are given part of the pairwise measurements: yi j =
xi ⊕ x j , w.p. p; w.p.1 − p,
erasure,
for every pair (i, j) ∈ {(1, 2), (1, 3), . . . , (1, n), (2, 3), . . . , (n − 1, n)} and p ∈ [0, 1]. We also assume that yi j ’s are independent over (i, j). Given yi j ’s, one wishes to decode the community membership vector x := [x1 , x2 , . . . , xn ] or its flipped counterpart x ⊕ 1 := [x1 ⊕ 1, x2 ⊕ 1, . . . , xn ⊕ 1]. Let xˆ be an estimate. Define the probability of error as: / {x, x ⊕ 1} . Pe = P xˆ ∈ (a) Let sample complexity be the number of pairwise measurements which are not erased. Show that sample complexity n −→ p as n → ∞. 2
Hint: You may want to use the statement in Problem 8.3. (b) Consider the following optimization problem. Given p and n,
178
3 Data Science Applications
Fig. 3.10 Illustration of graph connectivity and disconnectivity
1
1 2
5
4
2
5
4
3
connected
3
disconnected
Pe∗ ( p, n) := min Pe . algorithm
State the definition of reliable detection. Also state the definition of minimal sample complexity using the concept of reliable detection. (c) Consider a slightly different optimization problem. Given p, Pe∗ ( p) :=
min
algorithm,n
Pe .
The distinction here is that n is a design parameter. For > 0, what are Pe∗ ( logn n + ) and Pe∗ ( logn n − )? Also explain why.
Problem 7.3 (An upper bound) Let p = λ logn n where λ > 1. Show that 2
n n
k=1
k
(1 − p)k(n−k) −→ 0
as n → ∞.
Hint: Use the bounds in Problem 7.1.
Problem 7.4 (Erd˝os-Rényi random graph) Consider a random graph G that two Hungarian mathematicians (named Paul Erd˝os and Alfréd Rényi) introduced in (Erd˝os, & Rényi, et al., 1960). Hence, the graph is called the Erd˝os-Rényi graph. The graph includes n nodes and assumes that an edge appears w.p. p ∈ [0, 1] for any pair of two nodes in an independent manner. We say that a graph is connected if for any two nodes, there exists a path (i.e., a sequence of edges) that connects one node to the other; otherwise, it is said to be disconnected. See Fig. 3.10 for examples.
3.3 An Efficient Algorithm and Python Implementation
179
(a) Show that P (G is disconnected) ≤
n−1
n (1 − p)k(n−k) . k k=1
Hint: Think about a necessary event for disconnectivity. (b) Show that if p > logn n , then P(G is disconnected) → 0 as n → ∞. Hint: Recall the achievability proof for community detection that we did in Sect. 3.2. (c) It has been shown that if p < logn n , one can verify that P(G is disconnected) → 1 as n → ∞. This together with the result in part (a) implies that the sharp threshold on p for graph connectivity is exactly the same as the one for community detection: p∗ =
log n . n
Relate this to the fundamental limit on the observation probability in the community detection problem, i.e., explain why the limits are same.
Problem 7.5 (The coupon collector problem) There are n different coupons. Suppose that whenever one buys a cracker, the probability that the cracker contains a particular coupon among the n coupons is n1 , i.e., the n kinds of couples are uniformly distributed over the entire crackers that are being sold. (a) Suppose that Alice has k(< n) distinct coupons. When Alice buys a new cracker, what is the probability that the new cracker contains a new coupon (i.e., being different from the k coupons that Alice possesses)? Let X k be the number of crackers that Alice needs to buy to acquire a new coupon. Show that m−1 n−k k , P(X k = m) = n n n . E[X k ] = n−k (b) Suppose that Bob has no coupon. Let K := n−1 k=0 X k indicate the number of crackers that Bob needs to buy to collect all the coupons. Show that 1 1 1 . E[K ] = n 1 + + + · · · + 2 3 n
(c) Using the fact that nk=1 k1 ≈ log n in the limit of n (check Euler-Maclaurin formula in wikepidia for details), argue that in the limit of n,
180
3 Data Science Applications
E [K ] ≈ n log n.
(3.24)
n Note: This is order-wise the same as the minimal sample complexity n log 2 required for reliable community detection. (d) Using Python, plot E[K ] and n log n in the same figure for a proper range of n, say 1 ≤ n ≤ 10, 000.
Problem 7.6 (True or False?) (a) Consider the two-community detection problem. Let x := [x1 , x2 , . . . , xn ] be the community membership vector in which xi ∈ {0, 1} and n denotes the total number of users. Suppose we are given part of the comparison pairs with observation probability p. In Sect. 3.1, we formulated an optimization problem which aims / {x, x ⊕ 1} . Given to minimize the probability of error defined as Pe := P xˆ ∈ p and n, denote by Pe∗ ( p, n) the minimum probability of error. In Sect. 3.2, we did not intend to derive the exact Pe∗ ( p, n); instead we developed a lower bound of Pe∗ ( p, n) to demonstrate that for any p > logn n , the probability of error can be made arbitrarily close to 0 as n tends to infinity. (b) Consider an inference problem in which we wish to decode X ∈ X from Y ∈ Y where X and Y indicate the ranges of X and Y , respectively. Given Y = y, the optimal decoder is: Xˆ = arg max P (Y = y|X = x) . x∈X
3.4 Haplotype Phasing as a Communication Problem
181
3.4 Haplotype Phasing as a Communication Problem
Recap In the preceding sections, we utilized the ML principle to establish the achievability of the fundamental limit on the observation probability p necessary for community detection: p>
log n ⇐⇒ Pe → 0 as n → ∞. n
We also observed the phase transition phenomenon at the limit of logn n through Python simulation, where the estimated vector obtained by the spectral algorithm converges to the ground-truth community vector when p is close to logn n . Now, we will delve into another data science application where the ML principle and the phase transition phenomenon play crucial roles. We will focus on a specific problem in computational biology known as Haplotype phasing (Browning & Browning, 2011; Das & Vikalo, 2015; Chen, Kamath, Suh, & Tse, 2016; Si, Vikalo, & Vishwanath, 2014), which is a significant DNA sequencing problem. This problem has a close connection with community detection and can be formulated as a communication problem.
Outline This section delves into the problem of Haplotype phasing and its connection to community detection. It consists of four parts. Firstly, we will explain two important keywords: DNA sequence and Haplotype. Secondly, we will explain what the Haplotype phasing problem is. Thirdly, the connection to community detection is explored using domain knowledge in computational biology. Finally, the fundamental limit that arises in the problem will be discussed.
Our 23 pairs of chromosome To begin with, we will define two important terms: (1) DNA sequence and (2) Haplotype. Our bodies are composed of numerous cells, each of which contains a vital component - the nucleus - that contains 23 pairs of chromosomes. One chromosome in each pair is inherited from the mother (known as the maternal chromosome), while the other is inherited from the father (known as the paternal chromosome). Please refer to Fig. 3.11 for a visual representation of these 23 pairs of chromosomes. Each of a pair comprises a sequence of elements, called “bases”, and each base takes one of the four letters {A, C, T, G}. The DNA sequence refers to such a
182
3 Data Science Applications
maternal
paternal
source: epigenetics tutorial of Umich BioSocial Collaborative
Fig. 3.11 23 pairs of chromosome
maternal sequence
paternal sequence
almost identical! ~ 99.9%
Fig. 3.12 The maternal sequence is almost (99.9%) identical to the paternal sequence
sequence. The typical length of the sequence is on the order of 3 billion (roughly one quarter of total population). When examining a pair of chromosomes, we observe an intriguing pattern of sequence. Specifically, the maternal and paternal sequences are almost identical, with a similarity of approximately 99.9%. Please refer to Fig. 3.12 for a visual representation. Any differences between the two sequences occur only at certain positions and are referred to as “Single Nucleotide Polymorphisms (SNPs)” - pronounced as “snips”. Research has shown that SNP patterns play a crucial role in personalized medicine. They can aid in predicting the likelihood of certain cancers occurring, identifying somatic mutations such as HIV, and understanding phylogenetic trees that illustrate relationships between different species. The term “Haplotype” refers to a pair of these two SNP sequences.
3.4 Haplotype Phasing as a Communication Problem
183
Haplotype phasing The term “Haplotype phasing” refers to the process of determining the Haplotype, which is the pair of two SNP sequences. This involves two distinct tasks: (1) identifying the positions of the SNPs, and (2) decoding the sequence pattern. The locations of the SNPs are typically well-documented using a technique called “SNP calling” (Nielsen, Paul, Albrechtsen, & Song, 2011). As a result, Haplotype phasing usually involves identifying the pattern of the SNPs. It is worth noting that each element of the SNP sequence takes a value in {A, C, T, G}. This may lead one to question the connection between Haplotype phasing and the community detection problem, where the components of the community membership vector take binary values. Binary representation based on major vs minor allele There is a significant characteristic of the SNP base values that enables a direct link to the community detection problem. The bases can be classified into two types: (1) major alleles, which are present in the majority of humans, and (2) minor alleles, which are present in a smaller proportion of humans. Please refer to Fig. 3.13 for an illustration. For instance, the first SNP in the sequence reads “A” for the majority of humans, so it represents the major allele and is denoted by 0. However, human 3’s SNP in that position reads “T ”, which is a rare occurrence and is classified as a minor allele, denoted by 1. Any letter other than the one associated with the major allele is considered a minor allele, so in this example, “T ”, “C”, and “G” are all considered minor alleles. The fact that each SNP is categorized into only two types - major or minor allele - allows us to represent it as a binary value. This, in turn, establishes a connection to the community detection problem. Two types of SNP positions Have we established the complete connection between the two problems? Not yet. There is still a difference between the two problems.
major Ch1 (M)
A
major C
minor C
minor T
1
0
Human 1
A
C
G
A
Human 2
A
C
G
A
Human 3
T
C
G
A
Human 4
A
C
G
A
A
G
G
A
: Human i :
Fig. 3.13 Major versus minor allele
184
3 Data Science Applications
Haplotype phasing involves determining two sequences (vectors), while community detection only requires decoding a single vector. Nevertheless, Haplotype phasing can be transformed into a problem of decoding only one vector (up to a global shift). To establish this connection, we must examine another property of SNPs. Specifically, each SNP position is either heterozygous or homozygous. Heterozygous positions indicate that the maternal and paternal bases are complementary, while homozygous positions indicate that they are identical. A standard method exists for determining the type of all SNP positions. Assuming that all SNP positions are heterozygous simplifies the problem. Under this assumption, we can observe that the two problems are identical. Why? Let x be the sequence of SNPs from the mother. Then, the father’s sequence would be its flipped version x ⊕ 1. The aim of Haplotype phasing is to decode x or x ⊕ 1. Mate-pair read (Browning & Browning, 2011; Das & Vikalo, 2015) The final aspect to consider in connecting the two problems is the type of information we are given. In the community detection problem, we are provided with pairwise comparison information in the form of xi ⊕ x j . Interestingly, the information we can obtain in the context of Haplotype phasing is similar in nature. This information is related to the current sequencing technology, which uses a widely used technique known as “shotgun sequencing” to generate short fragments, or “reads,” of the entire sequence. The current sequencing technology typically produces read lengths of 100 to 500 bases, while SNPs consist of around 3 million bases and the entire DNA sequence is about 3 billion bases long. As a result, the average inter-SNP distance is approximately 1000 bases, but the read length is much shorter than the inter-SNP distance. Each read (spanning 100 to 200 bases) usually contains only one SNP. In fact, this poses a challenge because we do not know which chromosome each read comes from between the mother and father. Mathematically, this can be described as follows: let yi denote the ith SNP contained in a read, and this is the information we obtain: yi =
w.p. 21 ; xi (mother’s), xi ⊕ 1 (father’s), w.p. 21 .
Since the probabilities of getting xi and xi ⊕ 1 are all equal to 21 , there is no way to figure out xi from yi . An advanced sequencing technique has been developed to address this challenge. The advanced method allows for two reads simultaneously, which are known as “mate-pair reads”. As shown in Fig. 3.14, the mate-pair reads come to the rescue. One benefit of this sequencing technology is that both reads come from the same individual, which means that we obtain: (yi , y j ) =
w.p. 21 ; (xi , x j ) (mother’s), (xi ⊕ 1, x j ⊕ 1) (father’s), w.p. 21 .
3.4 Haplotype Phasing as a Communication Problem
185
1000 bases (on average) Ch1 (M)
0
0
1
Ch1’ (F)
1
1
0
“mate-pair read” Fig. 3.14 Mate-pair read
Connection to community detection What we know for sure from (yi , y j ) is their parity: yi ⊕ y j = xi ⊕ x j . This is exactly the pairwise measurement given in the context of community detection, now making the perfect connection to community detection. In practice, however, what we get is a noisy version of (yi , y j ): (yi , y j ) =
w.p. 21 ; (xi ⊕ z i , x j ⊕ z j ) (mother’s), (xi ⊕ 1 ⊕ z i , x j ⊕ 1 ⊕ z j ) (father’s), w.p. 21
where z i indicates an additive noise w.r.t. the ith SNP. As per extensive experiments, it is found that z i ’s can be modeled as i.i.d., each being according to say Bern(q) (meaning that the noise statistics are identical and independent across all SNPs). Here the parity is yi j := yi ⊕ y j = xi ⊕ x j ⊕ z i ⊕ z j . Let z i j := z i ⊕ z j . Then, its statistics would be z i j ∼ Bern(2q(1 − q)). Why? Using the total probability law, we get: P(z i j = 1) = P(z i = 1)P(z j = 0|z i = 1) + P(z i = 0)P(z j = 1|z i = 0) = q(1 − q) + (1 − q)q = 2q(1 − q) where the second equality is due to the independence of z i and z j . Denoting θ = 2q(1 − q), we see that the measurement yi j is a noisy version of xi j . One may wonder if looking at the parity only (instead of individual measurements (yi , y j )) suffices to decode x. In other words, one may wonder if yi j is a sufficient statistic. It turns out it is indeed the case. Check in Problem 8.8(c). This suggests that we do not lose any information although we consider only parities. Translation into a communication problem The connection mentioned above was established in (Chen et al., 2016). The authors applied this connection to a partial and random measurement model, where the parity is observed independently from others with probability p:
186
3 Data Science Applications
noisy partial observation
pairwise info
decoder
Fig. 3.15 Translation of Haplotype phasing into a communication problem under a noisy channel with partial observations
yi j =
xi ⊕ x j ⊕ z i j , w.p. p; w.p. 1 − p
erasure,
where z i j ∼ Bern(θ ) and θ ∈ (0, 21 ). Without loss of generality, one can assume that 0 ≤ θ < 21 ; otherwise, one can flip all 0’s into 1’s and 1’s into 0’s. Since we wish to infer x from yi j ’s (an inference problem), this problem can be interpreted as a communication problem illustrated in Fig. 3.15. The fundamental limit (Chen et al., 2016) The above model includes the noiseless scenario θ = 0 as a special case. What one can easily expect in this model is that the larger θ , the larger the limit p ∗ . It turns out that the fundamental tradeoff behaves like: p∗ =
1 log n · −D(0.5θ) n 1−e
(3.25)
where D(··) denotes the Kullback-Leibler (KL) divergence defined w.r.t. a natural logarithm: 0.5 0.5 + 0.5 log θ 1−θ 1 . = 0.5 log 4θ (1 − θ )
D(0.5θ ) = 0.5 log
Plugging this into (3.25), we get: p∗ =
1 log n · . √ n 1 − 4θ (1 − θ )
(3.26)
We see that the limit indeed increases. More concretely, it grows exponentially with θ . See Fig. 3.16. The only distinction in the noisy setting relative to the noiseless community detec1 tion problem is that we have a factor 1−e−D(0.5θ ) in (3.25), reflecting the noise effect. In the noiseless setting θ = 0, D(0.5θ ) = ∞, so it reduces to the limit logn n .
3.4 Haplotype Phasing as a Communication Problem
187
80 70 60 50 40 30 20 10 0 0
0.1
0.2
0.3
0.4
0.5
Fig. 3.16 The effect of noise upon the limit
Look ahead In this section, we related community detection to one of applications in computational biology: Haplotype phasing. We demonstrated that Haplotype phasing is essentially a noisy version of the community detection problem and posited that phase transition occurs on the observation probability. This sharp threshold reads: 1 p ∗ = logn n · 1−e−D(0.5θ ) . In the next section, we will prove the achievability of the claimed limit, again exploiting the ML principle.
188
3 Data Science Applications
3.5 Haplotype Phasing: The ML Principle
Recap We established an intriguing relationship between Haplotype phasing and community detection in the preceding section. It was demonstrated that Haplotype phasing is a noisy version of community detection, with the aim of decoding x = [x1 , . . . , xn ] from yi j ’s. yi j =
xi ⊕ x j ⊕ z i j , w.p. p; w.p. 1 − p
erasure,
where z i j ∼ Bern(θ ) and θ ∈ (0, 21 ). We then claimed that as in community detection, phase transition occurs on observation probability: p∗ =
1 log n · n 1 − e−D(0.5θ)
where D(0.5θ ) denotes the KL divergence between Bern(0.5) and Bern(θ ) defined w.r.t. the natural logarithm: D(0.5θ ) := 0.5 log
0.5 1 0.5 + 0.5 log = 0.5 log . θ 1−θ 4θ (1 − θ )
Outline This section is dedicated to proving the achievability of the limit p ∗ = logn n · 1 . We will undertake three steps to accomplish this. Firstly, we will derive 1−e−D(0.5θ ) the optimal ML decoder. Next, we will set up a framework to analyze the error probability. While the overall proof procedure is similar to the noiseless case, there are several key differences that require the use of important bounding techniques. We will provide details on these techniques. Finally, by applying the bounding techniques to the error probability, we will prove the achievability.
The optimal ML decoder The proof follows a similar approach as the noiseless case, utilizing the same deterministic encoder that constructs the codeword matrix X(x) with xi j = xi ⊕ x j as the (i, j) entry. However, a difference arises in the decoder side. While the optimal decoder in the noiseless case is based solely on the concept of compatibility vs. incompatibility, the ML decoder is not. To illustrate this point, let’s consider an example where n = 4, the ground-truth vector x = 0, and the observation matrix Y is given as follows:
3.5 Haplotype Phasing: The ML Principle
189
⎡
1e ⎢ 0 Y=⎢ ⎣
e
⎤
0⎥ ⎥. 1⎦
(3.27)
An important observation is that not all observed components match their corresponding xi j ’s. For instance, in the example provided, noise is added to the (1, 2) and (3, 4) entries resulting in (y12 , y34 ) = (1, 1). This ambiguity complicates the calculation of P(Y|X(x)), relative to the noiseless case. The likelihood is not only a function of erasure patterns but also depends on flipping error patterns, which makes a distinction. In the noiseless case, the likelihood takes either 0 or one particular non-zero value. However, in the noisy case, we have multiple non-zero values that the likelihood can take on, depending on flipping error patterns. For example, given Y in (3.27), the likelihoods are: P (Y|X(0000)) = (1 − p)2 p 4 · θ 2 (1 − θ )2 ; P (Y|X(1000)) = (1 − p)2 p 4 · θ 1 (1 − θ )3 ; P (Y|X(0100)) = (1 − p)2 p 4 · θ 3 (1 − θ )1 where ⎡
11 ⎢ 0 X(1000) = ⎢ ⎣
⎡ ⎤ ⎤ 100 1 ⎢ ⎥ 0⎥ ⎥ , X(0100) = ⎢ 1 1 ⎥ . ⎣ ⎦ 0⎦ 0
We see in all likelihoods that the first product term (1 − p)2 p 4 is common. So the second term associated with a flipping error pattern will decide the ML solution. Here the numbers marked in red indicate the numbers of flips. Since the flipping probability θ is assumed to be less than 21 , the smallest number of flips would maximize the likehlihood, thus yielding: xˆ ML = arg min d(X(x), Y) x
(3.28)
where d(·, ·) denotes the number of distinct bits between the two arguments. In the communication and information theory literature, the number is called the Hamming distance (Hamming, 1950). A setup for the analysis of the error probability For the achievability proof, let us analyze the probability of error. Taking the same procedures as in the noiseless case, we get:
190
3 Data Science Applications
0
0
0
1
1
1
1
0
0
1
1
1
1
0
1
1
1
1
0
0
0
0
0
0
0
0
0 0
*
*
*
e
0
e
1
*
*
1
0
e
0
*
e
e
1
e
*
*
*
*
*
*
*
*
* *
Fig. 3.17 Illustration of the distinguishable positions colored in purple. Unlike the noiseless case, all that matters in a decision is the Hamming distance. So the error event implies that the number of 1’s is greater than or equal to the number of 0’s in the k(n − k) distinguishable positions
Pe := P xˆ ML ∈ / {x, x ⊕ 1} = P xˆ ML ∈ / {0, 1}|x = 0
P xˆ ML = a|x = 0 ≤ a∈{0,1} /
=
n−1
P xˆ ML = ak |x = 0
(3.29)
k=1 ak ∈Ak
=
n−1
k=1
n P xˆ ML = ak |x = 0 k
where the first equality is by symmetry (each error probability does not depend on the input vector pattern); and the inequality comes from the union bound. In the second last equality, we take an equivalent yet more insightful expression, by introducing a set Ak := {a : a1 = k}, as in the noiseless case. A bound on P(ˆxML = ak |x = 0) Let us focus on P(ˆxML = ak |x = 0). Unlike the noiseless case, deriving its upper bound is not that straightforward. To see this, consider a case in which ak = (1 · · · 10 · · · 0) ∈ Ak . Notice that X(ak ) takes 1’s only in the last n − k positions of the first k rows (that we called distinguishable positions relative to X(0)). See the middle in Fig. 3.17. In the noiseless case, the error event {ˆxML = ak |x = 0} must imply that such positions are all erased, since otherwise X(ak ) is incompatible. In the noisy case, on the other hand, the error event does not necessarily imply that the distinguishable positions are all erased, since X(ak ) could still be a candidate for the solution even if a few observations are made in such positions. As indicated in (3.28), what matters in a decision is the Hamming distance. As long as d(X(ak ), Y) ≤ d(X(0), Y), the
3.5 Haplotype Phasing: The ML Principle
191
codeword X(ak ) can be chosen as a solution, no matter what the number of erasures is in those positions. Hence what we can say is that the error event only suggests that the number of 1’s is greater than or equal to the number of 0’s in the k(n − k) distinguishable positions. Let M be the number of observations made in the distinguishable positions. Then, we get: for a ∈ Ak , P xˆ ML = ak |x = 0 ≤ P (d(X(ak ), Y) ≤ d(X(0), Y)|x = 0) = P (# 1’s ≥ # 0’s in the distinguishable positions|x = 0) (a)
=
k(n−k)
(3.30) P(M = |x = 0)
=0
× P (# 1’s ≥ # 0’s in distinguishable positions|x = 0, M = ) p (1 − where (a) is due to the total probability law. Here P(M = |x = 0) = k(n−k) p)k(n−k)− . Now consider P (# 1 ’s ≥ # 0’s in distinguishable positions |x = 0, M = ). Let Z i be the measured value at the ith observed entry in the k(n − k) distinguishable positions. Then, Z i ’s are i.i.d. ∼ Bern(θ ) where i ∈ {1, 2, . . . , }. Using this, we get: P # 1 s ≥ # 0 s in distinguishable positions|x = 0, M = Z1 + Z2 + · · · + Z ≥ 0.5 . =P
(3.31)
Since Z i ’s are i.i.d. ∼ Bern(θ ), one may expect that the empirical mean of Z i ’s would be concentrated around the true mean θ as increases. In fact, it is indeed the case and it can rigorously be proved via the Weak Law of Large Numbers (WLLN). Please check the formal statement of WLLN in Problem 8.3. Also by our assumption, θ < 0.5. Hence, the probability P( Z 1 +Z 2 +···+Z ≥ 0.5) would converge to zero, as tends to infinity. But what we are interested in here is how fast the probability converges to zero. There is a very well-known concentration bound which characterizes a precise convergence behavior of the probability. That is, the Chernoff bound (Bertsekas & Tsitsiklis, 2008; Gallager, 2013), formally stated below: P
Z1 + Z2 + · · · + Z ≥ 0.5 ≤ e−D(0.5θ) .
(3.32)
You may want to prove it in Problem 8.4, possibly consulting with Problem 8.1. Applying this into (3.30), we get:
192
3 Data Science Applications
P xˆ ML = ak |x = 0 ≤
k(n−k)
P(M = |x = 0)
=0
× P # 1 s ≥ # 0 s in distinguishable positions|x = 0, M = k(n−k) (a) k(n − k) p (1 − p)k(n−k)− e−D(0.5θ) ≤ =0 k(n−k)
k(n − k) pe−D(0.5θ) k(n−k) = (1 − p) 1− p =0 k(n−k) pe−D(0.5θ) (b) = (1 − p)k(n−k) 1 + 1− p −D(0.5θ) k(n−k) = 1 − p(1 − e ) p (1 − p)k(n−k)− and the Cherwhere (a) comes from P(M = |x = 0) = k(n−k) noff bound (3.32); and (b) is due to the binomial theorem: nk=0 nk x n−k y k = (x + y)n . The final step of the achievability proof Putting the above to (3.29), we get: n−1
n P xˆ ML = ak |x = 0 k k=1 n−1
k(n−k) n 1 − p(1 − e−D(0.5θ) ) . ≤ k k=1
Pe ≤
Remember what we proved in the noiseless case (check the precise statement in Problem 7.3): n−1
n k=1
k
(1 − q)k(n−k) −→ 0 if q >
log n . n
Hence, by replacing q with p(1 − e−D(0.5θ) ) in the above, one can make Pe arbitrarily close to 0, provided that p(1 − e−D(0.5θ) ) >
log n 1 log n ⇐⇒ p > · . −D(0.5θ) n n 1−e
This completes the achievability proof.
3.5 Haplotype Phasing: The ML Principle
193
Look ahead Using the ML principle, we proved the achievability of the limit in the noisy obser1 vation model p ∗ = logn n · 1−e−D(0.5θ ) . As in the noiseless case, we will omit the proof 1 of the other way around: if p < logn n · 1−e−D(0.5θ ) , then Pe cannot be made arbitrarily close to 0 no matter what and wahtsoever. The next section will delve into efficient algorithms that could potentially achieve the optimal performance promised by the ML rule.
194
3 Data Science Applications
3.6 An Efficient Algorithm and Python Implementation
Recap In the previous section, we invoked the ML principle to prove the achievability of the limit in the noisy observation model (inspired by Haplotype phasing): p∗ =
1 1 log n log n · · = √ n 1 − e−D(0.5θ) n 1 − 4θ (1 − θ )
where θ is the noise flipping error rate. Since the optimal ML decoding rule comes with a challenge in computational complexity (as in the noiseless case), it is important to develop computationally efficient algorithms that achieve the optimal ML performance yet with much lower complexities. Even in the noisy setting, such efficient algorithms are already developed.
Outline In this section, we will examine two efficient algorithms that potentially achieve the optimal ML performance. The first algorithm is the spectral algorithm, which we used in the noiseless scenario. The second algorithm is a slightly more complex version that starts with an initial estimate using the spectral algorithm and then improves the estimate for better performance (Chen et al., 2016). Although the second algorithm is optimal, we will not provide a proof of optimality in this book but instead focus on how the algorithms work and their implementation in Python. This section consists of four parts. Firstly, we will revisit the spectral algorithm. Secondly, we will apply the algorithm in the noisy observation setting and evaluate its performance using Python simulations. Thirdly, we will explain the second algorithm and its implementation using Python. Finally, we will demonstrate that the second algorithm with the additional operation outperforms the spectral algorithm.
Review of the spectral algorithm The spectral algorithm is based on the adjacency matrix A ∈ Rn×n where each entry ai j indicates whether users i and j are in the same community: ai j =
1 − 2yi j , w.p. p; 0, w.p. 1 − p (yi j = erasure)
(3.33)
where yi j = xi ⊕ x j ⊕ z i j and z i j ’s are i.i.d. ∼ Bern(θ ). Here ai j = 0 denotes no measurement. In the noiseless case, ai j = +1 means that the two users are in the same community; and ai j = −1 indicates the opposite.
3.6 An Efficient Algorithm and Python Implementation
195
The spectral algorithm is based on the observation that the rank of A is one in the noiseless full measurement setting, and that the principal eigenvector of A matches the ground-truth community vector up to a global shift. The algorithm uses the power method to compute the principal eigenvector efficiently. It then declares a thresholded version of this eigenvector as the estimate. Here is a summary of the steps involved in the spectral algorithm. 1. Construct the adjacency matrix A as per (3.33). 2. Choose a random vector v and set v(0) = v and t = 0. Av(t) 3. v(t+1) = and increase, t by 1. Av(t) 2 4. Iterate Step 3 until converged, e.g., v(t+1) − v(t) 2 < = 10−5 . Python implementation of the spectral algorithm
We implement the spectral algorithm via Python. The code below is almost the same as in the noiseless setting. The only distinction is that yi j is a noisy version of xi ⊕ x j . We consider a setting where the flipping error rate θ = 0.1 and n = 4000. from scipy.stats import bernoulli from scipy.stats import pearsonr import numpy as np import matplotlib.pyplot as plt n = 4000 # number of users Bern = bernoulli(0.5) # Generate n community memberships x = Bern.rvs(n) # Construct the codebook X = np.zeros((n,n)) for i in range(len(x)): for j in range(i,len(x)): # Compute xij = xi + xj (modulo 2) X[i,j] = (x[i]+x[j]) % 2 # Symmetric component X[j,i] = X[i,j] # Construct an observation matrix theta = 0.1 # noise flipping error rate noise_Bern = bernoulli(theta) noise_matrix = noise_Bern.rvs((n,n)) Y = (X + noise_matrix)%2 p = np.linspace(0.001,0.0065,30) limit = 1/(1-np.sqrt(4*theta*(1-theta)))*np.log(n)/n p_norm = p/limit
196
3 Data Science Applications
def power_method(A, eps=1e-5): # A computionally efficient algorithm # for finding the principal eigenvector # Choose a random vector v = np.random.randn(n) # normalization v = v/np.linalg.norm(v) prev_v = np.zeros(len(v)) t = 0 while np.linalg.norm(prev_v-v) > eps: prev_v = v v = np.array(np.dot(A,v)).reshape(-1) v = v/np.linalg.norm(v) t += 1 print("Power method is terminated after %s iterations"%t) return v corr = np.zeros_like(p) for i,val in enumerate(p): obs_bern = bernoulli(val) # Construct an n-by-n mask matrix: # entry = 1 (observed); 0 (otherwise) mask_matrix = obs_bern.rvs((n,n)) # Construct the adjacency matrix A = (1-2*Y)*mask_matrix # Power method v1 = power_method(A) # Threshold the principal eigenvector v1 = np.sign(v1) # Compute the ground truth ground_truth = 1-2*x # Compute Pearson correlation btw estimate & ground truth corr[i] = np.abs(pearsonr(ground_truth,v1)[0]) print(p_norm[i], corr[i]) plt.figure(figsize=(5,5), dpi=200) plt.plot(p_norm, corr) plt.title(’Pearson correlation btw estimate and ground truth’) plt.grid(linestyle=’:’, linewidth=0.5) plt.show()
As shown in Fig. 3.18 the Pearson correlation approaches 1 as p approaches the limit p ∗ . As previously mentioned, there is another algorithm that outperforms the spectral algorithm. We will now examine this algorithm. Additional step: Local refinement The other algorithm takes two steps: (i) running the spectral algorithm; and then (2) performing an additional step called local refinement. The role of the second step is to detect any errors assuming that a majority of the components in the estimate vector are correct. The idea for local refinement is
3.6 An Efficient Algorithm and Python Implementation
197
to use coordinate-wise maximum likelihood estimator (MLE). Here is how it works. Suppose we pick up an user, say user i. We then compute the coordinate-wise likelihood w.r.t. the membership value of user i. But a challenge arises in computing the coordinate-wise likelihood. The challenge is that it requires knowledge of the ground-truth memberships of other users, although they can never be revealed. As a surrogate, we employ an initial estimate, say x(0) = [x1(0) , . . . , xn(0) ], obtained in the earlier step. Specifically, we compute: xˆiMLE = arg
max
a∈{x (0) ,x (0) ⊕1}
(0) (0) P(Y|x1(0) , . . . , xi−1 , a, xi+1 , . . . , xn(0) ).
To find xˆiMLE , we only need to compare the two likelihood functions w.r.t. a = x (0) and a = x (0) ⊕ 1. So it boils down to comparing the following two:
j:(i, j)∈Ω
yi j ⊕ xi(0) ⊕ x (0) vs j
yi j ⊕ xi(0) ⊕ 1 ⊕ x (0) j
(3.34)
j:(i, j)∈Ω
where Ω indicates the set of (i, j) pairs such that yi j = erasure. In order to gain an insights into the above two terms, consider an idealistic setting where x(0) is perfect, matching the ground truth. In this case, the two terms become:
Fig. 3.18 Pearson correlation between the ground-truth community vector and the estimate obtained from the spectral algorithm: n = 4000 and θ = 0.1
198
3 Data Science Applications
z i j vs
j:(i, j)∈Ω
z i j ⊕ 1.
(3.35)
j:(i, j)∈Ω
Since the flipping rate error θ is assumed to be less than 0.5, the first term is likely to be smaller than the second. So it would be reasonable to take the candidate that yields a smaller value among the two. This is exactly what the coordinate-wise MLE does: (0) (0) (0) (0) xi(0) , MLE (i, j)∈Ω yi j ⊕ x i ⊕ x j < (i, j)∈Ω yi j ⊕ x i ⊕ 1 ⊕ x j ; xˆi = (0) xi ⊕ 1, otherwise. We repeat the same procedure for other users, one-by-one and step-by-step. This forms one iteration to yield x(1) = [xˆ1MLE , . . . , xˆnMLE ]. It has been shown in (Chen et al., 2016) that with multiple iterations (around the order of log n iterations), the coordinate-wise MLE converges to the ground-truth community vector. We will not prove this here. Instead we will provide simulation results to demonstrate the local refinement offers a better performance. See below a code for the local refinement step: # initial estimate obtained from the spectral algorithm x0 = (1- v1)//2 xt = x0 ITER = 3 for t in range(1,ITER): xt1 = xt for i in range(len(xt)): # Likelihood w.r.t. x_iˆ(t) L1 = (Y[i,:] + xt[i] + xt)%2 L1 = L1*mask_matrix[i,:] # Likelihood w.r.t. x_iˆ(t)+1 L2 = (Y[i,:] + xt[i] + 1 + xt)%2 L2 = L2*mask_matrix[i,:] xt1[i] = xt[i]*(sum(L1) sum(L2)) xt = xt1
Performances of the spectral algorithm vs local refinement We compare the Pearson correlations of the spectral algorithm and the two-step approach with local refinement, for a setting where n = 4000 and θ = 0.1. We set the number of iterations in the local refinement step as 9, since it is close to the suggested number (log 4000 ≈ 8.294). Here is a code for simulation. from scipy.stats import bernoulli from scipy.stats import pearsonr import numpy as np import matplotlib.pyplot as plt
3.6 An Efficient Algorithm and Python Implementation n = 4000 # number of users Bern = bernoulli(0.5) # Generate n community memberships x = Bern.rvs(n) # Construct the codebook X = np.zeros((n,n)) for i in range(len(x)): for j in range(i,len(x)): # Compute xij = xi + xj (modulo 2) X[i,j] = (x[i]+x[j]) % 2 # Symmetric component X[j,i] = X[i,j] # Construct an observation matrix theta = 0.1 # noise flipping error rate noise_Bern = bernoulli(theta) noise_matrix = noise_Bern.rvs((n,n)) Y = (X + noise_matrix)%2 p = np.linspace(0.001,0.0065,30) limit = 1/(1-np.sqrt(4*theta*(1-theta)))*np.log(n)/n p_norm = p/limit
corr1 = np.zeros_like(p) corr2 = np.zeros_like(p) for i, val in enumerate(p): obs_bern = bernoulli(val) # Construct an n-by-n mask matrix: # entry = 1 (observed); 0 (otherwise) mask_matrix = obs_bern.rvs((n,n)) ################################## ######## Spectral algorithm ###### ################################## # Construct the adjacency matrix A = (1-2*Y)*mask_matrix # Power method v1 = power_method(A) # Threshold the principal eigenvector v1 = np.sign(v1) ################################## ######## Local refinement ######## ################################## # initial estimate (from the spectral algorithm) x0 = (1- v1)//2 xt = x0 # number of iterations ITER = 9 for t in range(1,ITER):
199
200
3 Data Science Applications xt1 = xt # coordinate-wise MLE for k in range(len(xt)): # Likelihood w.r.t. x_kˆ{(t)} L1 = (Y[k,:] + xt[k] + xt)%2 L1 = L1*mask_matrix[k,:] # likelihood w.r.t. x_kˆ{(t)}+1 L2 = (Y[k,:] + xt[k] + 1 + xt)%2 L2 = L2*mask_matrix[k,:] xt1[k] = xt[k]*(sum(L1) sum(L2)) xt = xt1 # Compute the ground truth ground_truth = 1-2*x # Compute Pearson correlation btw estimate & ground truth corr1[i] = np.abs(pearsonr(ground_truth,v1)[0]) corr2[i] = np.abs(pearsonr(ground_truth,1-2*xt)[0]) print(p_norm[i], corr1[i], corr2[i])
plt.figure(figsize=(5,5), dpi=200) plt.plot(p_norm, corr1, label=’spectral algorithm’ ) plt.plot(p_norm, corr2, label=’local refinement’) plt.title(’Pearson correlation btw estimate and ground truth’) plt.legend() plt.grid(linestyle=’:’, linewidth=0.5) plt.show()
We see in Fig. 3.19 that the addition step of local refinement helps improve the performance.
3.6 An Efficient Algorithm and Python Implementation
201
Fig. 3.19 Pearson correlation performances of the spectral algorithm and local refinement: n = 4000 and θ = 0.1
Look ahead So far, we have investigated two applications in data science where the ML principle is vital, and phase transition arises, akin to the communication problem. As mentioned in the beginning of Part III, we emphasized the importance of the Viterbi algorithm, another communication principle, in the domain of data science. In the following section, we will examine a specific use case of this algorithm: Speech recognition.
202
3 Data Science Applications
Problem Set 8 Problem 8.1 (Markov’s inequality) Consider a non-negative random variable X and a positive value a > 0. (a) Show that a · 1(X ≥ a) ≤ X . Here 1(·) denotes the indicator function which returns 1 when the event (·) is true and 0 otherwise. (b) Using part (a), prove the Markov’s inequality: P (X ≥ a) ≤
E[X ] . a
Problem 8.2 (Chebychev inequality) Let X be a discrete random variable with mean μ and variance σ 2 < ∞ (finite). Show that for any t > 0, P (|X − μ| ≥ t) ≤
σ2 . t2
Problem 8.3 (Weak Law of Large Numbers) Let Y1 , Y2 , . . . be an i.i.d. discrete random process with mean μ and variance σ 2 < ∞. Let Sn :=
Y1 + Y2 + · · · + Yn . n
(3.36)
For any > 0, show that P (|Sn − E[Y ]| ≤ ) −→ 1
as n → ∞,
(3.37)
i.e., the sample mean Sn is within E[Y ] ± with high probability as n tends to infinity. Note 1: There is a terminology which indicates the convergence as above. It is called the convergence in probability. Hence, it is simply rephrased as “Sn convergences to E[Y ] = p in probability.” Note 2: The above convergence w.r.t. the empirical sample mean Sn is named the Weak Law of Large Numbers, WLLN for short.
Problem 8.4 (Chernoff bound) (a) Show that for a random variable X and some constant a, E eλX P(X > a) ≤ min . λ>0 eλa
3.6 An Efficient Algorithm and Python Implementation
203
(b) Suppose X 1 , . . . , X n are i.i.d., each being distributed according to Bern(m). Here Bern(m) indicates the probability distribution of a binary random variable, say X , with P(X = 1) = m. Fix δ > 0. Show that P(X 1 + · · · + X n ≥ n(m + δ)) ≤ e−n D(m+δm) where D(m + δm) denotes the Kullback-Leibler (KL) divergence between Bern(m + δ) and Bern(m) defined w.r.t. the natural logarithm (with base e): D(m + δm) := (m + δ) log
1−m−δ m+δ + (1 − m − δ) log . m 1−m
Problem 8.5 (Noisy measurements) Consider the same community detection problem setup as in Problem 7.2, except that the measurements are now noisy versions: yi j =
xi ⊕ x j ⊕ z i j , w.p. p; w.p. 1 − p
erasure,
where z i j are i.i.d. over (i, j), each being according to Bern(θ ), and θ is a constant 1 ∈ (0, 21 ). Show that if p > logn n · 1−e−D(0.5θ ) , then Pe can be made arbitrarily close to 0 as n tends to infinity. Here D(··) indicates the Kullback-Leibler divergence defined w.r.t. the natural logarithm: D(0.5θ ) := 0.5 log
0.5 0.5 + 0.5 log . θ 1−θ
Hint: Use the ML decoding rule and the Chernoff bound.
Problem 8.6 (Generalized Chernoff bound) Suppose we observe n i.i.d. discrete random variables Y n := (Y1 , . . . , Yn ). Consider two hypotheses H0 : Yi ∼ P0 (y); and H1 : Yi ∼ P1 (y) for y ∈ Y. Define the Chernoff information as: D ∗ := − min log 0≤λ≤1
⎧ ⎨ ⎩
P0 (y)λ P1 (y)1−λ
y∈Y
Let Lk be the likelihood function w.r.t. Hk : Lk = P(Y n |Hk ). (a) Show that for a, b > 0 and λ ∈ [0, 1], min{a, b} ≤ a λ b1−λ . (b) Let A := {y n : P(y n |H1 ) ≥ P(y n |H0 )}. Show that
⎫ ⎬ ⎭
.
204
3 Data Science Applications
P (L1 ≥ L0 |H0 ) =
n
&
P0 (yi ).
y n ∈A i=1
(c) Using parts (a) and (b), show that ∗
P (L1 ≥ L0 |H0 ) ≤ e−n D .
(3.38)
Problem 8.7 (A generalized model for community detection) Suppose there are n users clustered into two communities. Let xi ∈ {0, 1} indicate a community membership with regard to user i ∈ {1, 2, . . . , n}. Assume that we are given part of the comparison pairs: yi j =
∼ P(yi j |xi j ), w.p. p; w.p. 1 − p
erasure,
for every pair (i, j) ∈ {(1, 2), (1, 3), . . . , (n − 1, n)} and p ∈ [0, 1]. Whenever an observation is made, yi j is generated as per: P(yi j |xi j ) =
P0 (y), xi j = 0; P1 (y), xi j = 1.
We also assume that yi j ’s are independent over (i, j). Given yi j ’s, one wishes to decode the community membership vector x := [x1 , x2 , . . . , xn ] or x ⊕ 1 := [x1 ⊕ 1, x2 ⊕ 1, . . . , xn ⊕ 1].Let xˆ be a decoded vector. Define the probability of error as / {x, x ⊕ 1} . Suppose we employ the optimal ML decoding rule: Pe = P xˆ ∈ xˆ ML = arg max L(x) x
where L(x) := P(Y|X(x)) indicates the likelihood function w.r.t. x. (a) Show that n−1
n P xˆ ML = ak |x = 0 Pe ≤ k k=1
where ak = [1, 1, . . . , 1, 0, 0, . . . , 0]. k
n−k
(b) Let M be the number of observations made in the distinguishable positions between X(ak ) and X(0). Show that
k(n − k) k(n−k) p (1 − p)k(n−k)− P xˆ ML = ak |x = 0 ≤ =0 × P (L(ak ) ≥ L(0)|x = 0, M = ) .
3.6 An Efficient Algorithm and Python Implementation
205
(c) Using the above and part (c) in Problem 8.6, show that n−1
n ∗ k(n−k) 1 − p(1 − e−D ) Pe ≤ k k=1
where D ∗ denotes the Chernoff information: ⎧ ⎫ ⎨ ⎬ D ∗ := − min log P0 (y)λ P1 (y)1−λ . 0≤λ≤1 ⎩ ⎭ y∈Y
(d) Using the above and Problem 7.3, show that if p > be made arbitrarily close to 0 as n → ∞.
log n n
·
1 1−e−D∗
, then Pe can
Problem 8.8 (True or False?) (a) Suppose that X 1 ∼ Bern( p), X 2 ∼ Bern( 21 ) and S = X 1 ⊕ X 2 . Then, S follows 1 Bern( 2 ) no matter what p is. (b) Let X = [X 1 , X 2 , . . . , X n ]T be an i.i.d. random vector, each being according an n-by-n full-rank matrix with Ai j ∈ {0, 1} entries. Let to Bern( 21 ). Let A be Y = AX, i.e., Yi = nj=1 Ai j X j . Here the summation is modulo-2 addition. Then, Yi ’s are i.i.d. (c) Let X 1 and X 2 be independent random variables, each being according to Bern( 21 ). Suppose we observe (Y1 , Y2 ) =
w.p. 1 − α; (X 1 ⊕ Z 1 , X 2 ⊕ Z 2 ), (X 1 ⊕ 1 ⊕ Z 1 , X 2 ⊕ 1 ⊕ Z 2 ), w.p. α
where Z 1 and Z 2 are independent random variables ∼ Bern(q), being also independent of (X 1 , X 2 ). Then, Y1 ⊕ Y2 is a sufficient statistic w.r.t. (X 1 , X 2 ).
206
3 Data Science Applications
3.7 Speech Recognition as a Communication Problem
Recap In Part III, we have investigated two data science applications thus far: community detection and Haplotype phasing. We have placed significant emphasis on two communication principles: (i) the ML principle and (ii) the concept of phase transition. Additionally, we discussed the complexity issue of the ML decoding rule concerning the two problems. We then studied efficient algorithms that achieve nearly optimal ML performance while having much lower complexities. In Part II, we highlighted that the ML decoding rule in ISI channels also faces computational challenges. However, the Viterbi algorithm solved this issue at that time, implementing the ML solution accurately. For the third application, we will explore another data science problem that resembles the communication problem in ISI channels. Speech recognition.
Outline In the next three sections, we will first demonstrate that speech recognition is an inference problem, much like a communication problem. We will then derive the optimal solution based on this premise and show that the Viterbi algorithm is well applicable to implementing it. This section will concentrate on the first: demonstrating that speech recognition is indeed an inference problem. This section will comprise four parts. Initially, we will figure out what a speech recognition system is. Next, we will identify the system’s input and output. Then, we will explore the probabilistic relationship between the input and the output, establishing that speech recognition is indeed an inference problem. Finally, we will discuss the optimal solution and draw an analogy to a communication problem.
Speech recognition system Speech recognition is a popular application that we encounter in our daily lives. Siri in iphones is a prime example of this, as it aims to recognize spoken words. We speak into the iphone’s microphone, which samples the analog sound waveform. The iphone then attempts to comprehend what we said. The objective of speech recognition is to transform the analog waveform, containing spoken words, into a written command that can be represented as text without losing the words’ meaning. Therefore, the goal of speech recognition is to decode text from the output, which has a bearing on spoken words. In line with our earlier discussion, we will demonstrate that speech recognition can be viewed as an inference problem, where we decode the input x (text) from the output y. For a detailed illustration,
3.7 Speech Recognition as a Communication Problem
207
refer to Fig. 3.20. In this figure, the input x refers to the text we aim to recognize in the speech recognition block. The text’s components and appearance depend on the spoken language, and for our purposes, we will examine the English language. The structure of an English text A text is a collection of words that can be broken down into smaller units. However, there is an issue with using English alphabets as the basic unit, as the mapping between sound and alphabet is not one-to-one. For example, the phoneme /i/ can correspond to both “e” in “nike” and “i” in “bit.” Therefore, it is more appropriate to decompose words into phonemes, which are the smallest phonetic units in a language. English has 44 phonemes, consisting of 24 consonants and 20 vowels, and speech recognition involves determining the sequence of phonemes from the input sound waveform. Thus, the sequence of phonemes can be considered as uncertain quantities or random variables, denoted as x[m] ∈ X , where X is the set of phonemes with an alphabet size of |X | = 44. Output in speech recognition systems Let us determine what the output represents in the speech recognition system. The output y is the result that is fed into the speech recognition block, and it should reflect something related to actual sound. To understand this, we need to relate x to the actual sound signal that will be captured by the microphone in the system. A user who wants to say the text x speaks the corresponding words into the microphone, creating the analog waveform, denoted
Fig. 3.20 Viewing speech recognition as an inference problem
24 consonants
speech recognition
system
20 vowels
Fig. 3.21 44 phonemes in English: (i) 24 consonants; and (ii) 20 vowels
208
3 Data Science Applications system
y(t)
X
feature extraction
y
speech recognition
∧
X
10 ms
Fig. 3.22 The entire system diagram of speech recognition
as y(t). This analog waveform is then transformed into a sequence y of discrete-time signals using the following procedure. Each phoneme lasts approximately 10 ms. We segment the analog waveform into 10 ms intervals, and then extract some critical features from the signal in each interval. It is commonly known that the relevant information contained in speech is most evident in the frequency domain. Therefore, typically, a short window Fourier analysis is applied to the sampled signal from each 10 ms time interval and the corresponding Fourier coefficients are extracted. The coefficients obtained from the Fourier analysis provide spectral information that describes the phoneme in the 10 ms interval. Usually, there are multiple Fourier coefficients for the signal in each interval. However, for the sake of simplicity, let us assume that there is only one spectral component, which we refer to as a feature. The mth feature corresponds to the mth phoneme and is denoted by y[m]. For a complete illustration of this procedure, refer to Fig. 3.22. Relation between x and y In what way are x and y connected? Given the numerous sources of randomness in the system, two significant contributors of the randomness are: (1) voice characteristics (such as accent) and (2) noise (such as thermal noise caused by random movements of electrons in the electrical circuit). There is a significant degree of uncertainty associated with y, linking the input and output in a probabilistic manner. This uncertainty induces the notion that speech recognition can be considered an inference problem. The optimal decoder An important question that arises is what algorithm for speech recognition can decode x from y in the best way possible. One way to define the optimality of an algorithm is by measuring the probability of error Pe := P(ˆx = x). In that case, the optimal algorithm can be defined as the one that maximizes the a posteriori probability: xˆ MAP = arg max P(x|y) x
P(x, y) f (y) = arg max P(x) f (y|x) = arg max x x
3.7 Speech Recognition as a Communication Problem
209
when the second equality follows from the definition of conditional probability. One can easily see that to compute the MAP solution, we need to figure out two quantities: (1) A priori probability on the input P(x); and (2) conditional pdf (likelihood function) f (y|x), which captures the relation between the input and the output. The speech recognition system can be likened to a channel, drawing an analogy with a communication problem. Look ahead The upcoming section will cover obtaining the necessary information to compute the MAP solution. Interestingly, there exists a similarity between speech recognition and communication problems, enabling the use of the Viterbi algorithm for efficiently computing the MAP solution.
210
3 Data Science Applications
3.8 Speech Recognition: Statistical Modeling
Recap In the preceding section, it was demonstrated that speech recognition can be cast into an inference problem that involves decoding a sequence x of phonemes (to recognize what is being said) from a sequence y of features (related to actual spoken words). The system’s randomness due to speaker voice characteristics and system noise such as thermal noise introduces uncertainty in the input-output relationship, establishing the probabilistic association between them. This highlights the fact that speech recognition is an inference problem. The optimal MAP solution was then derived as: xˆ MAP = arg max P(x) f (y|x).
(3.39)
x
Here the optimality is w.r.t. maximizing reliability, quantified as P(ˆx = x).
Outline In this section, our focus is on obtaining two crucial quantities that are needed to compute the MAP solution: (1) a priori knowledge represented by P(x); and (2) the likelihood function f (y|x). This section consists of four parts. Firstly, we will examine a simple random process model that provides a concise representation of the a priori probability. Using this model, we will express P(x) in straightforward terms. Secondly, we will study another random process model that serves as the foundation for modeling the likelihood function. Finally, we will utilize the model to represent the likelihood function. With these two quantities, we will be able to compute the MAP solution in the next section.
system
y(t)
X
feature extraction
y
speech recognition
10 ms
Fig. 3.23 A block diagram of the speech recognition system and recovery block
∧
X
3.8 Speech Recognition: Statistical Modeling
211
A sequence of x[m]’s (random process) A naive approach to obtain P(x) is investigating all the possible patterns of x and then figuring out the statistical characteristics, e.g., histogram. However, this simple way comes with a challenge in computation. The reason is that the number of possible patterns of the sequence x grows exponentially with n: |X |n = 44n
(3.40)
where X denotes a set of all the phonemes that each x[m] can take on, usually called the alphabet. The equality is derived from the fact that English language has a total of 44 phonemes, including 24 consonants and 20 vowels. However, for a large value of n, obtaining the a priori knowledge requires a huge number of probability quantities, which makes it challenging. Fortunately, there is a statistical structure of the random process that governs the sequence of phonemes, allowing us to model it as a simple random process with a smaller number of parameters. A naive approach is to assume that the random variables are independent, as seen in the case of additive white Gaussian noise. However, this assumption is not appropriate for speech recognition since certain phonemes are more likely to follow others. For example, the phoneme /th/ is more likely to be followed by /e/ than by /s/ due to the frequency of occurrence of the word “the” in text. A simple dependency model: A Markov process To capture the dependency between phonemes, we can use a simple dependency model that accounts for the dependence of x[m] on its preceding random variable x[m − 1] This means that P(x[m]|x[m − 1], x[m − 2], . . . , x[0]) = P(x[m]|x[m − 1]).
(3.41)
The interpretation is that if we know the current state, say x[m − 1], then the future state x[m] and the past states (x[m − 2], . . . , x[0]) are conditionally independent. This type of random process was extensively studied by Andrey Markov, a renowned Russian mathematician. Hence, it is known as the Markov property, and a process that possesses this property is called a Markov process (Gallager, 2013). However, the dependency among phonemes in reality is more intricate than what is imposed by the simple Markov property. To properly capture the dependency of a realistic sequence of phonemes, we need to use a generalized Markov process that is characterized by P(x[m]|x[m − 1], . . . , x[m − ], x[m − − 1], . . . , x[0]) = P(x[m]|x[m − 1], . . . , x[m − ]). It can be observed that the dependence of x[m] on the past is now through a possibly larger number of past states, which are referred to as recent states, denoted as x[m − 1], x[m − 2]..., x[m − ]. This is a generalized Markov process and includes the
212
3 Data Science Applications
Fig. 3.24 Portrait of Andrey Markov. Andrey Markov is a Russian mathematician who made important achievements w.r.t. the random process (later named a Markov process) that satisfies a simple dependency property reflected in (3.41) Andrey Markov 1906
Markov process as a special case for = 1. To simplicity, we will assume that the sequence of phonemes is a single-memory Markov process with = 1. A graphical model (Koller & Friedman, 2009) The Markov property is often represented by a graphical model in statistics and machine learning, which provides a visual representation of the relationship between random variables. This graphical model is not specific to the Markov process, but can be applied to any random process (X 1 , X 2 , . . . , X n ). The graphical model includes nodes, representing random variables, and edges, representing the dependencies between pairs of random variables. The graph can be interpreted as follows: if a node X i is removed and the remaining nodes can be split into two subgraphs G1 and G2 , then the random variables in G1 and G2 are independent conditional on X i . For example, for the process (X 1 , X 2 , X 3 ) with joint probability P(X 1 , X 2 , X 3 ) = P(X 1 )P(X 2 |X 1 )P(X 3 |X 2 ), the graph shows that P(X 3 |X 2 , X 1 ) = P(X 3 |X 2 ), meaning that X 1 and X 3 are independent given X 2 . So the graph is illustrated as: X 1 − X 2 − X 3.
(3.42)
Note that the removal of X 2 disconnects X 1 and X 3 . A priori probability P(x) Going back to the sequence of phonemes, recall we assume that x[m] is independent of all the past given x[m − 1]. In this case, one can obtain the graphical model as: x[0] − x[1] − x[2] − · · · − x[n − 2] − x[n − 1].
(3.43)
This model is called a Markov model. Or it is called a Markov chain as it looks like a chain. Using this statistical structure, one can write down the joint distribution as:
3.8 Speech Recognition: Statistical Modeling
213
P(x) = P(x[0], x[1], x[2], . . . , x[n − 1]) (a)
= P(x[0])P(x[1]|x[0])P(x[2]|x[1], x[0]) · · · P(x[n − 1]|x[n − 2], . . . , x[0])
(b)
= P(x[0])P(x[1]|x[0])P(x[2]|x[1]) · · · P(x[n − 1]|x[n − 2])
= P(x[0])
n−1 &
P(x[m]|x[m − 1])
m=1
(3.44) where (a) is due to the definition of condition probability; and (b) comes from the Markov property. In order to compute P(x), it suffices to know only P(x[0]) and P(x[m]|x[m − 1]). So we need to specify only 44 values for P(x[0]) and 442 values for P(x[m]|x[m − 1]). The sum 44 + 442 is much smaller than the huge number 44n required to specify the joint distribution when the statistical structure is not exploited. In practice, one can estimate the individual pmf P(x[0]) and the transition probability P(x[m]|x[m − 1]) from any large text by computing the following averages: P(s) = P(x[0] = s) ≈
# of occurences of “s” ; # of phonemes in the interested text
P(t|s) = P(x[m] = t|x[m − 1] = s) ≈
# of “t” that follows “s” . # of occurrences of “s”
(3.45) (3.46)
By the law of large numbers (refer to Problem 8.3), the estimate for P(s) becomes concentrated around the ground-truth distribution as the number of phonemes in the text increases. It turns out that the law of large numbers concerning the i.i.d. random process can be extended to the Markov process as well. Hence, the estimate for P(t|s) also converges to the ground truth distribution in the limit. Likelihood function f (y|x) Let us figure out how to obtain the knowledge of the likelihood function. We find that a key observation on the system enables us to identify the statistical structure of the random process y, thus providing a concrete way of computing f (y|x). Here the key observation is that y[m] can be viewed as a noisy version of x[m] and the noise has nothing to do with any other random variables involved in the system. The mathematical representation of the observation is that given x[m], y[m] is statistically independent of all the other random variables: e.g., y[1]⊥(x[0], x[1], . . . , x[n − 1], y[0])|x[1] This property leads to the graphical model for y as follows. See Fig. 3.25. Notice that if we remove the node x[m], the node y[m] will be disconnected from the rest of the graph. This reflects the fact that y[m] depends on other random variables only through x[m]. One interesting question about the model: Is the obser-
214
3 Data Science Applications
Fig. 3.25 A Hidden Markov Model (HMM) for the output y of the speech recognition system
vation sequence y[0], . . . , y[n − 1] a Markov model? No. Why? But the underlying sequence that we want to figure out is a Markov model. That is the reason as to why this is called the Hidden Markov Model, HMM for short (Rabiner & Juang, 1986). Using this statistical property, one can write down the conditional distribution as: f (y|x) = f (y[0], y[1], . . . , y[n − 1]|x[0], x[1], . . . , x[n − 1]) = f (y[0]|x[0], x[1], . . . , x[n − 1]) × f (y[1], . . . , y[n − 1]|x[0], x[1], . . . , x[n − 1], y[0]) (a)
= f (y[0]|x[0]) f (y[1], . . . , y[n − 1]|x[0], x[1], . . . , x[n − 1], y[0]) .. .
(3.47)
(b)
= f (y[0]|x[0]) f (y[1]|x[1]) · · · f (y[n − 1]|x[n − 1])
=
n−1 &
f (y[m]|x[m])
m=0
where (a) and (b) are because given x[m], y[m] is independent of everything else for any m. Note that in order to compute f (y|x), it suffices to know only f (y[m]|x[m]). How can we obtain the knowledge of the individual likelihood function f (y[m]|x[m])? If the mth feature y[m] is a discrete value, then we need to specify only the number |X | × |Y| of possible values for f (y[m]|x[m]). But the value of the feature is in general continuous although it can be quantized so that it can be represented by a discrete random variable. So we need to figure out the functional relationship between y[m] and x[m] to specify the likelihood function. Look ahead There exists a method for estimating f (y[m]|x[m]). The following section will introduce us to this technique, which we will subsequently utilize to calculate the MAP solution.
3.9 Speech Recognition: The Viterbi Algorithm
215
3.9 Speech Recognition: The Viterbi Algorithm
Recap Over the previous two sections, we have modeled the speech recognition problem, as illustrated in Fig. 3.26. Let x[m] be the mth phoneme, and let y[m] represent the corresponding characteristic extracted from the recorded sound. In general, each y[m] is complicated and multi-dimensional. For instance, y[m] could be a signal’s Fourier coefficient vector. However, for the sake of simplicity, we assumed that y[m] is a single continuous random variable. Subsequently, we derived the optimal MAP solution associated with the inference problem: xˆ MAP = arg max P(x) f (y|x).
(3.48)
x
In the previous section, we have shown that the relation between the input and the output can be represented by the graphical model as illustrated in Fig. 3.27. Using this property, we could then identify the statistical structure of x and y: P(x) = P(x[0])
n−1 &
P(x[m]|x[m − 1]);
m=1
f (y|x) =
n−1 &
(3.49)
f (y[m]|x[m]).
m=0
We also learned how to estimate P(x[0]) and P(x[m]|x[m − 1]) from a sample text containing sufficiently many words. But for f (y[m]|x[m]), we just claimed that there is a way to estimate.
Fig. 3.26 A block diagram of the speech recognition system and recovery block
Fig. 3.27 A graphical model for the speech recognition system: a hidden Markov model (HMM)
phonemes X
system
features y speech recognition
∧
X
216
3 Data Science Applications
Outline This section will introduce us to the technique for estimating the likelihood f (y[m]|x[m]). Prior to that, we will finish the computation of the MAP solution (3.48), assuming knowledge of f (y[m]|x[m]) together with P(x[0]) and P(x[m]|x[m − 1]). For this purpose, we will first demonstrate the equivalence to the communication problem for the ISI channel. Subsequently, we will establish that the Viterbi algorithm can act as a proficient approach to determining the MAP solution. The Viterbi algorithm Using (3.49) and defining P(x[0]|x[−1] = ∅) := P(x[0]), we can write down the objective function in (3.48) as: P(x) f (y|x) = P(x[0])
n−1 & m=1
=
n−1 &
P(x[m]|x[m − 1])
m=0
=
n−1 &
n−1 &
P(x[m]|x[m − 1])
f (y[m]|x[m])
m=0 n−1 &
f (y[m]|x[m])
(3.50)
m=0
P(x[m]|x[m − 1]) f (y[m]|x[m]).
m=0
Taking a logarithm function at both sides, we obtain: log (P(x) f (y|x)) =
n−1
log {P(x[m]|x[m − 1]) f (y[m]|x[m])} .
(3.51)
m=0
Using the fact that log(·) is a non-decreasing function, we can obtain another optimization problem equivalent to (3.48): xˆ MAP
' 1 = arg min log . x P(x[m]|x[m − 1]) f (y[m]|x[m]) m=0 n−1
(3.52)
Letting s[m] = [x[m − 1], x[m]]T , one can view the term inside the log function in (3.52) as a function depending solely on m and s[m]. Note that the feature y[m] is given, and the transition probability P(·|·) and the likelihood function f (y[m]|x[m]) are assumed to be known. In fact, we already learned how to estimate the transition probability, and we are planning to study how to obtain the knowledge of the likelihood function later. This interpretation motivates us to define the cost function as: ' 1 . (3.53) cm (s[m]) := log P(x[m]|x[m − 1]) f (y[m]|x[m])
3.9 Speech Recognition: The Viterbi Algorithm
217
Noting the one-to-one mapping relation between x and s := (s[0], s[1], . . . , s[n − 1]), we can translate the problem (3.52) into the one that finds the sequence s of states: sˆMAP = arg min s
n−1
cm (s[m]).
(3.54)
m=0
The key observation is that this problem is exactly the same as that we encountered in the two-tap ISI communication problem. The only difference is that in the communication problem, the cost function is defined in a different manner: 2 cm (s[m]) := y[m] − hT s[m]
(the cost in the communication problem).
Recall that there exists a fascinating, low-complexity algorithm capable of efficiently resolving the optimization problem of the form (3.54). That is, the Viterbi algorithm. As such, we can simply apply the Viterbi algorithm to determine sˆMAP . It is worth noting that the complexity of the Viterbi algorithm is approximately proportional to the number of states multiplied by n. In the context of speech recognition, each x[m] can assume 44 possible phonemes, thus yielding 442 possible states for s[m]. Accordingly, the complexity of the Viterbi algorithm concerning the speech recognition problem is: ∼ (# of states) × n = 442 n.
Statistical modeling for y[m]’s Thus far, we have presumed that we possess knowledge regarding the likelihood function f (y[m]|x[m]). We will now illustrate how to estimate it. It’s crucial to bear in mind that the likelihood function relies on randomness that occurs in the system due to distinct voice characteristics and noise, as depicted in Fig. 3.28. If the system contains no noise, and a speaker is given, we can perceive y[m] as a deterministic function of x[m]. For instance, given x[m] = /a/, y[m] would be a deterministic function of /a/, say μa . Nonetheless, due to noise within the system, the feature y[m] would be distributed randomly around the authentic feature μa that
x[m]
Fig. 3.28 Inside of the speech recognition system
feature extraction
y[m]
218
3 Data Science Applications
corresponds to the phoneme /a/. It’s noteworthy that the primary source of this noise is the arbitrary movement of electrons caused by heat, known as thermal noise. In Part I, we learned that the thermal noise can be represented as an additive white Gaussian noise (AWGN). Hence, given x[m] = /a/, we can model y[m] as: y[m] = μa + w[m]
(3.55)
where w[m]’s are i.i.d. ∼ N (0, σ 2 ). Similarly given x[m] = /b/, y[m] can be modeled as the true feature concerning /b/, say μb , plus an additive Gaussian noise. For simplicity, we will assume that we know the variance of the additive noise σ 2 . Actually there are a variety of ways to estimate σ 2 . One simple way is to infer the variance from multiple measurements y[m]’s fed by null signals. More concretely, suppose x[m] = ∅ for m ∈ {0, 1, . . . , n − 1}. Then, under the statistical modeling on y[m] as above, we get: y[m] = w[m], m ∈ {0, 1, . . . , n − 1}.
(3.56)
The empirical mean of y 2 [m]’s can then be used as a good surrogate of σ 2 : y 2 [0] + y 2 [1] + · · · + y 2 [n − 1] ≈ σ 2. n
(3.57)
For a large value of n, the empirical mean is expected to concentrate around the variance, and especially as n tends to infinity, it does converge to the variance, by the law of large numbers (check the statement in Problem 8.3). Knowing the noise variance is insufficient to determine the likelihood function f (y[m]|x[m]). This is due to the fact that true features such as μa and μb are userspecific parameters, making them akin to random variables. As such, they must be estimated to fully specify the likelihood function. One prevalent technique facilitates this estimation, which we shall explore throughout the remainder of this section. Machine learning for estimating (μa , μb ) Suppose that we possess knowledge of the sequence of phonemes. Although this may seem ridiculous as inferring the phoneme sequence is our ultimate objective, some speech recognition software takes this approach by requesting the speaker to vocalize specific phonemes initially. We can then utilize this sequence to estimate μa and μb by referring to the known phonemes. Note that we acquire the parameters by eliciting input-output example pairs from the speaker initially. This method is, in fact, the basis of machine learning, which we will explore in greater detail in the next section. Here is an elaborate description of how machine learning works. Suppose we instruct the user to vocalize “aaaaaaaa” for eight time slots. This produces: y[m] = μa + w[m], m ∈ {0, 1, . . . , 7}.
(3.58)
3.9 Speech Recognition: The Viterbi Algorithm
219
Here μa is the mean of the samples y[m]’s since w[m] has mean zero. So one reasonable approach to estimate the parameter μa is taking the empirical mean: y[0] + y[1] + · · · + y[7] . 8
μˆ a =
(3.59)
You may be wondering if this estimate is the best possible. The answer is yes. To understand why, let’s consider what it means for an estimate to be optimal. As we learned in Parts I and II, an estimate is considered optimal if it maximizes reliability. However, the conventional definition of reliability P(μˆ a = μa ) is not appropriate in this case. The reason is that the definition always yields a reliability of zero, regardless of the estimate. This is because the value of μa that we want to estimate is continuous. To resolve this issue, we use what is called an -reliability: -reliability := P(μa − ≤ μˆ a ≤ μa + ).
(3.60)
Note that this quantity can be strictly positive for > 0. One can easily verify that the estimate that maximizes the -reliability in the limit of → 0 is an MAP estimate: μˆ aMAP = arg max f a (μa |y) μa
(3.61)
= arg max f a (μa ) f (y|μa ). μa
Check in Problem 9.6. Here we encounter a challenge. The challenge is that we know nothing about f a (μa ). One reasonable way to proceed in this case is assuming a uniformly distributed density for f a (μa ). Making this assumption, we can then simplify the MAP estimate as the ML estimate: μˆ aMAP = arg max f (y|μa ) =: μˆ aML .
(3.62)
μa
Now using the fact that y[m] ∼ N (μa , σ 2 ), we can compute the ML estimate as follows: μˆ aML = arg max f (y|μa ) μa
7 1 = arg max exp − 2 (y[m] − μa )2 μa 2σ m=0
= arg min μa
7
(3.63)
(y[m] − μa )2 .
m=0
Taking the derivative of the objective function w.r.t. μa and setting it to 0, we finally get:
220
3 Data Science Applications
μˆ aML =
y[0] + y[1] + · · · + y[7] 8
which coincides with the good estimate that we expected earlier. Look ahead So far, we have examined three inference problems in data science, namely community detection, Haplotype phasing, and speech recognition, which demonstrate the effectiveness of communication principles such as the MAP/ML principles, the concept of phase transition, and the Viterbi algorithm. There is another important area within the field of data science that highlights the significance of the MAP/ML principles. That is, machine learning and deep learning. In the upcoming section, we will delve into these fields and uncover how the MAP/ML principles are relevant.
3.9 Speech Recognition: The Viterbi Algorithm
221
Problem Set 9 Problem 9.1 (An inference problem) Consider a speech recognition system in which the input x = [x[0], x[1], . . . , x[n − 1]]T is a sequence of phonemes and the output y = [y[0], y[1], . . . , y[n − 1]]T is a sequence of features. (a) State the definition of an inference problem. (b) In Sect. 3.7, we learned that speech recognition is an inference problem. Explain why.
Problem 9.2 (A Markov process) (a) State the definition of a Markov process. (b) In Sect. 3.8, we learned that the sequence x := [x[0], x[1], . . . , x[n − 1]]T of phonemes in a text can be modeled by generalizing the concept of a Markov process. Explain why.
Problem 9.3 (Graphical models: Exercise 1) Describe the statistical structure of the joint distribution associated with a graphical model in Fig. 3.29.
Problem 9.4 (Graphical models: Exercise 2) Consider (X 1 , . . . , X 7 ) with the following joint distribution: P(X 1 , . . . , , X 7 ) = P(X 1 )P(X 2 |X 1 )P(X 3 |X 1 )P(X 4 |X 2 )P(X 5 |X 2 )P(X 6 |X 3 )P(X 7 |X 3 ). Draw the associated graphical model.
Problem 9.5 (A Hidden Markov Model) (a) State the definition of a Hidden Markov Model.
Fig. 3.29 A graphical model
X1
X2
X3
X4
X5
X6
222
3 Data Science Applications
(b) In Sect. 3.8, we learned that the sequence y := [y[0], y[1], . . . , y[n − 1]]T of features in the speech recognition system can be modeled as a Hidden Markov Model. Explain why.
Problem 9.6 (The MAP/ML principles) Consider Yi = X + Wi where X is a continuous random variable with the pdf f (x) and Wi ’s are i.i.d. ∼ N (0, σ 2 ), i ∈ {1, . . . , n}. Given (Y1 , Y2 , . . . , Yn ), we wish to estimate Xˆ . For > 0, define the -reliability as: P(X − ≤ Xˆ ≤ X + ).
(3.64)
(a) In the limit of → 0, derive an estimate Xˆ that maximizes the -reliability. (b) Suppose that X ∼ N (0, σx2 ). Derive the MAP estimate Xˆ MAP . (c) Derive the ML estimate Xˆ ML .
3.10 Machine Learning: Connection with Communication
223
3.10 Machine Learning: Connection with Communication
Recap So far, we have examined three inference problems: community detection, Haplotype phasing, and speech recognition. In doing so, we highlighted the importance of the MAP/ML principles, phase transition concept, and Viterbi algorithm. Now we will shift our focus to explore the broader context yet within the umbrella of data science. Over the past few decades, there has been a huge surge in research activities in the fields of machine learning and deep learning. Interestingly, it turns out that the MAP/ML principles are crucial in optimizing the design of machine learning models and architectures. The remaining sections of this book will delve deeper into the role of these principles in machine learning.1
Outline This section will focus on exploring how the MAP/ML principles are utilized in machine learning. The section consists of three parts. Firstly, we will define machine learning and its goal. Secondly, we will formulate an optimization problem that allows us to achieve the goal. Finally, we will show how the MAP/ML principles are critical in designing an optimization problem that ensures optimality in a certain sense.
Machine learning Let’s begin by understanding what machine learning is. It involves the use of algorithms, which can be defined as a set of instructions that a computer system can execute. Essentially, machine learning is the study of such algorithms, with the goal of training a computer system to perform a specific task. Figuratively, this can be visualized as shown in Fig. 3.30. In this scenario, the machine we are interested in building is a computer system that functions as a system, with an input (usually denoted as x) and an output (usually denoted as y). The input comprises information that is used to perform the task of interest, while the output reflects the result of the task. For example, if the task at hand is filtering legitimate emails from spam emails, x may include multi-dimensional quantities such as the frequency of specific keywords like dollar signs ($$$) and the frequency of other keywords like “winner”.2 And y could be an email entity, e.g., y = +1 indicates a legitimate email while y = −1 denotes a spam email. In machine learning, such y is called a label. Or if an interested task is cat-vs-dog classification, 1
The upcoming sections of this book are updated from the author’s previous publication (Suh, 2022) but have been customized to suit the specific focus and logical flow of this book. 2 In machine learning, such a quantity is called the feature. This refers to a key component that well describes characteristics of data.
224
3 Data Science Applications
x
computer system (machine)
y
training algorithm (together w/ data) Fig. 3.30 Machine learning is the study of algorithms which provide a set of explicit instructions to a computer system (machine) so that it can perform a specific task of interest. Let x be the input which indicates information employed to perform a task. Let y be the output that denotes a task result Fig. 3.31 Arthur Lee Samuel is an American pioneer in the field of artificial intelligence. One of his well-known works is to develop computer checkers which later formed the basis of AlphaGo Arthur Samuel ’59
checkers
then x could be image-pixel values and y is a binary value indicating whether the fed image is a cat (say y = 1) or a dog (y = 0). The purpose of machine learning is to develop algorithms that train a computer system to perform a task effectively. In this process, data plays a crucial role. Why do we call “Machine Learning”? The name “machine learning” makes sense when viewed from the machine’s perspective, as it suggests that the machine is learning a task from data. This term was introduced by Arthur Lee Samuel in 1959 (Samuel, 1967). See Fig. 3.31. Arthur Samuel was a pioneer in the field of Artificial Intelligence (AI), which encompasses machine learning as a sub-field. AI is concerned with creating intelligence in machines, distinct from the natural intelligence demonstrated by humans and animals. Machine learning is a sub-field of AI that aims to build machines that exhibit human-like behavior. One of Samuel’s early achievements was the development of a computer player for the board game of checkers, which demonstrated human-like playing strategies. During the development of this program, Samuel proposed various algorithms and concepts. It turns out those algorithms could form the basis of AlphaGo (Silver et al., 2016), a computer program for the board game Go
3.10 Machine Learning: Connection with Communication Fig. 3.32 Supervised Learning: Design a computer system (machine) f (·) with the help of a supervisor which offers input-output pair samples, called a training dataset m {(x (i) , y (i) )}i=1
225
machine
which defeated one of the 9-dan professional players, Lee Sedol with 4 wins out of 5 games in 2016 (News, 2016). Mission of machine learning Since machine learning is a sub-field of AI, its endmission is achieving artificial intelligence. So in view of the block diagram in Fig. 3.30, the goal of machine learning is to design an algorithm so that the trained machine behaves like intelligence beings. Supervised learning There exist various methodologies that assist in achieving the goal of machine learning, one of which is particularly prominent and widely used, known as: Super vised Lear ning. Supervised learning refers to the process of learning a function f (·) of a machine with the assistance of a supervisor, who provides input-output samples. The inputoutput samples constitute the data used to train the machine, typically represented by: m {(x (i) , y (i) )}i=1
(3.65)
where (x (i) , y (i) ) indicates the ith input-output sample (or called a training sample or an example) and m denotes the number of samples. Using this notation (3.65), one can say that supervised learning is to: m . Estimate f (·) using the training samples {(x (i) , y (i) )}i=1
(3.66)
Optimization Estimating f (·) is often achieved through optimization. Before we delve into how optimization is used, let us explain what optimization means. Optimization involves selecting a variable that minimizes or maximizes a specific quantity of interest, subject to certain constraints (Boyd & Vandenberghe, 2004; Calafiore & El Ghaoui, 2014). Two key components to this definition are the following. One is
226
3 Data Science Applications
the optimization variable which affects the interested quantity and is subject to our design. This is usually a multi-dimensional quantity. The second is the certain quantity of interest that we wish to minimize (or maximize). This is called the objective function in the field, and an one-dimensional scalar. Objective function To determine the objective function in the supervised learning framework, it is necessary to understand the desired outcome of supervised learning. Considering the goal (3.66), what is desired is: y (i) ≈ f (x (i) ), ∀i ∈ {1, . . . , m}. One question that naturally arises is how to measure the similarity (indicated by “ ≈ ) between y (i) and f (x (i) ). A widely used approach in the field is to use a function, known as a loss function, which is typically represented by: (y (i) , f (x (i) )).
(3.67)
One obvious property that the loss function (·, ·) should have is that it should be small when the two arguments are close, while being zero when the two are identical. Using such a loss function (3.67), one can then formulate an optimization problem as: min f (·)
m
(y (i) , f (x (i) )).
(3.68)
i=1
How to introduce optimization variable? In fact, there is no variable that can be optimized directly. Instead, a different quantity can be optimized: the function f (·) in (3.68). The question is then: how can we introduce an optimization variable? One common approach in the field is to represent the function f (·) with parameters, denoted by w, and then consider these weights as the optimization variable. By doing so, the problem (3.68) can be reformulated as follows: min w
m
(y (i) , f w (x (i) ))
(3.69)
i=1
where f w (x (i) ) denotes the function f (x (i) ) parameterized by w. The above optimization problem depends on how we define the two functions: (i) f w (x (i) ) w.r.t. w; and (ii) the loss function (·, ·). In machine learning, lots of works have been done for the choice of such functions. A choice for f w (·) One architecture was proposed around the same time as the founding of the machine learning field, specifically for the function f w (·) in the context
3.10 Machine Learning: Connection with Communication
227
Fig. 3.33 Portrait of Frank Rosenblatt
Frank Rosenblatt ‘57
of simple binary classifiers where y has only two options, such as y ∈ {−1, +1} or y ∈ {0, 1}. This architecture is called: Per ceptr on, and was invented in 1957 by one of the pioneers in the AI field, named Frank Rosenblatt (Rosenblatt, 1958). Frank Rosenblatt (see Fig. 3.33 for his portrait) was a psychologist who was fascinated by the workings of intelligent beings’ brains. It was his interest in the brain that led him to develop the Perceptron, an architecture that provided significant insights into neural networks, which many of you are familiar with. How brains work The Perceptron architecture was inspired by the structure of the brain. Neurons, which are electrically excitable cells, are a major component of the brain, as shown in Fig. 3.34. The architecture of the Perceptron is based on three key properties of neurons. Firstly, a neuron has an electric charge, or voltage. Secondly, neurons are connected to each other through synapses, which transmit electrical signals between them. The strength of the signal depends on the strength of the synapse. Finally, a neuron generates an all-or-nothing pulse signal called an activation. The magnitude of the signal depends on the voltage level of the neuron. If the voltage level exceeds a certain threshold, the neuron generates an impulse signal with a certain magnitude (say 1), otherwise it produces nothing. Perceptron The three properties mentioned above were the basis for Frank Rosenblatt’s proposal of the Perceptron architecture, as depicted in Fig. 3.35. Let x represent an n-dimensional real-valued signal: x := [x1 , x2 , . . . , xn ]T . Each component xi is distributed to a neuron, and xi represents the voltage level of the ith neuron. The voltage signal xi is transmitted through a synapse to another neuron (located on the right in the figure, denoted by a large circle). The voltage level can either increase or decrease depending on the strength of the synapse’s connectivity. To account for this, a weight wi is multiplied to xi , making wi xi the delivered voltage signal at
228
3 Data Science Applications
the terminal neuron. Based on an empirical observation that the voltage level at the terminal neuron increases with more connected neurons, Rosenblatt introduced an adder that simply combines all the voltage signals originating from many neurons. Consequently, he modeled the voltage signal at the terminal neuron as: w1 x1 + w2 x2 + · · · + wn xn = w T x.
(3.70)
Lastly in an effort to mimic the activation, he modeled the output signal as f w (x) =
1 if w T x > th, 0 otherwise
(3.71)
where “th” indicates a certain threshold level. It can also be simply denoted as f w (x) = 1{w T x > th}
(3.72)
neuron (voltage)
activation voltage
Fig. 3.34 Neurons are electrically excitable cells and are connected through synapses Fig. 3.35 The architecture of Perceptron
neuron
x1 x2
synapse w1
w2 x
wT x +
activation
fw (x)
=
wn xn
1 if wT x > th 0 o.w.
3.10 Machine Learning: Connection with Communication
229
Fig. 3.36 The shape of the logistic function
where 1{·} denotes an indicator function that returns 1 when the event (·) is true; 0 otherwise. Activation functions Taking the Perceptron architecture in Fig. 3.35, one can formulate the optimization problem (3.69) as: min w
m
(y (i) , 1{w T x (i) > th}).
(3.73)
i=1
This is an initial optimization problem proposed in the field, but it was soon discovered that there is a problem with solving it. The issue arises from the fact that the objective function includes an indicator function, making it non-differentiable. This non-differentiability makes the problem challenging to solve, as we will see in a later section. To overcome this problem, one common approach taken in the field is to approximate the activation function. There are various ways to do this, and we will explore one of them below. Approximate the activation One popular way is to use the following function that makes a smooth transition from 0 to 1: f w (x) =
1 . 1 + e−w T x
(3.74)
Notice that f w (x) ≈ 0 when w T x is very small; it then grows exponentially with an increase in w T x; later grows logarithmically; and finally saturates as 1 when w T x is very large. See Fig. 3.36 for its shape. Actually the function (3.74) is a very popular one prevalent in statistics, called the logistic3 function (Garnier & Quetelet, 1838). There is another name for the function, which is the sigmoid 4 function. The logistic function has two advantageous properties. Firstly, it is differentiable. Secondly, it can be used as the probability of the output in a binary classifier, for example, P(y = 1), where y represents the true label in the binary classifier. Thus, it is interpretable.
3 4
The word logistic comes from a Greek word which means a slow growth, like a logarithmic growth. Sigmoid means resembling the lower-case Greek letter sigma, S-shaped.
230
3 Data Science Applications
Look ahead If we choose the logistic activation, what is the suitable loss function? The design of an optimal loss function is influenced by the MAP/ML principles. In the subsequent section, we will examine the sense in which it is optimal and how these principles are employed in the design of the optimal loss function.
3.11 Logistic Regression and the ML Principle
231
3.11 Logistic Regression and the ML Principle
Recap In the previous section, we formulated an optimization problem for supervised learning based on the Perceptron architecture: min w
m
(y (i) , yˆ (i) )
(3.75)
i=1
m where {(x (i) , y (i) )}i=1 indicate data-label paired examples; (·, ·) denotes a loss func(i) tion; and yˆ := f w (x (i) ) is the prediction parameterized by the weights w. As an activation function, we considered the logistic function that is widely used in the field:
f w (x) =
1 . 1 + e−w T x
(3.76)
We claimed that the ML principle plays a crucial role in the design of the optimal loss function.
Outline This section will demonstrate the proof of the claim. The section consists of three parts. Firstly, we will examine the concept of the optimal loss function. Secondly, we will explore how the ML principle comes up in designing the optimal loss function. Lastly, we will discover how to solve the optimization problem formulated.
Optimality in a sense of maximizing likelihood A binary classifier with the logistic function (3.76) is called logistic regression. See Fig. 3.37 for illustration. Notice that the output yˆ lies in between 0 and 1: 0 ≤ yˆ ≤ 1. Hence, one can interpret the output as a probability quantity. The optimality of a classifier can be defined under the following assumption inspired by the interpretation: Assumption : yˆ = P(y = 1|x).
(3.77)
232
3 Data Science Applications
Fig. 3.37 Logistic regression
logistic regression
To understand what it means in detail, consider the likelihood of the ground-truth predictor: m m | {x (i) }i=1 P {y (i) }i=1 .
(3.78)
Notice that the classifier output yˆ is a function of weights w. Hence, assuming (3.77), the likelihood (3.78) is also a function of w. We are now ready to define the optimality of w. The optimal weight, say w∗ , is defined as the one that maximizes the likelihood (3.78): m m . |{x (i) }i=1 w ∗ := arg max P {y (i) }i=1 w
(3.79)
Similarly the optimal loss function, say ∗ (·, ·) is defined as the one that satisfies: arg min w
m
i=1
m m ∗ (y (i) , yˆ (i) ) = arg max P {y (i) }i=1 |{x (i) }i=1 . w
(3.80)
The optimal loss function can be obtained using condition (3.80). This leads to a widely used machine learning classifier called logistic regression, in which the loss function is cross entropy loss: ∗ (y, yˆ ) = CE (y, yˆ ) = −y log y − (1 − y) log(1 − y)
(3.81)
where “CE” stands for cross entropy. From below, we will prove this. Derivation of the optimal loss function ∗ (·, ·) It is reasonable to assume that samples are independent of each other because they are typically obtained from different data: m are independent over i. {(x (i) , y (i) )}i=1
Under this assumption, we can then rewrite the likelihood (3.78) as:
(3.82)
3.11 Logistic Regression and the ML Principle
P
m m {y (i) }i=1 |{x (i) }i=1
233
m (a) P {(x (i) , y (i) )}i=1 = m P {x (i) }i=1 (m P x (i) , y (i) (b) (m = i=1 (i) i=1 P(x ) m (c) & (i) (i) = P y |x
(3.83)
i=1
where (a) and (c) are due to the definition of conditional probability; and (b) comes from the independence assumption (3.82). Here P(x (i) , y (i) ) denotes the probability distribution of the input-output pair of the system: P(x (i) , y (i) ) := P(X = x (i) , Y = y (i) )
(3.84)
where X and Y indicate random variables of the input and the output, respectively. Recall the probability-interpretation-related assumption (3.77) made with regard to yˆ : yˆ = P(y = 1|x). This implies that: y = 1 : P(y|x) = yˆ ; y = 0 : P(y|x) = 1 − yˆ . Hence, one can represent P(y|x) as: P(y|x) = yˆ y (1 − yˆ )1−y . Now using the notations of (x (i) , y (i) ) and yˆ (i) , we then get: (i) (i) P y (i) |x (i) = ( yˆ (i) ) y (1 − yˆ (i) )1−y . Plugging this into (3.83), we get: m & (i) (i) m m = |{x (i) }i=1 ( yˆ (i) ) y (1 − yˆ (i) )1−y . P {y (i) }i=1 i=1
This together with (3.79) yields:
(3.85)
234
3 Data Science Applications m & (i) (i) ( yˆ (i) ) y (1 − yˆ (i) )1−y
w ∗ = arg max w
(a)
= arg max w
(b)
= arg min w
i=1 m
i=1 m
y (i) log yˆ (i) + (1 − y (i) ) log(1 − yˆ (i) )
(3.86)
−y (i) log yˆ (i) − (1 − y (i) ) log(1 − yˆ (i) )
i=1
(m where (a) comes from the fact that log(·) is a non-decreasing function and i=1 (i) (i) ( yˆ (i) ) y (1 − yˆ (i) )1−y is positive; and (b) is due to changing the sign of the objective function. In fact, the term inside the summation in the last equality in (3.86) respects the formula of an important notion that arises in the field of information theory: cross entropy. In particular, in the context of a loss function, it is named the cross entropy loss: CE (y, yˆ ) := −y log yˆ − (1 − y) log(1 − yˆ ).
(3.87)
Hence, the optimal loss function that yields the maximum likelihood is the cross entropy loss: ∗ (·, ·) = CE (·, ·).
Remarks on cross entropy loss (3.87) Before moving onto the next topic about how to solve the optimization problem, let us say a few words about why the loss function (3.87) is called the cross entropy loss. This naming comes from the definition of cross entropy, which is a measure used in information theory. The cross entropy is defined w.r.t. two random variables. For simplicity, let us consider two binary random variables, say X ∼ Bern( p) and Y ∼ Bern(q) where X ∼ Bern( p) indicates a binary random variable with p = P(X = 1). For such two random variables, the cross entropy is defined as (Cover & Joy, 2006): H ( p, q) := − p log q − (1 − p) log(1 − q).
(3.88)
Notice that the formula of (3.87) is exactly the same as the term inside summation in (3.86), except for having different notations. Hence, it is called the cross entropy loss. Here you may wonder why H ( p, q) in (3.88) is called cross entropy. The rationale comes from the following fact: H ( p, q) ≥ H ( p) := − p log p − (1 − p) log(1 − p)
(3.89)
3.11 Logistic Regression and the ML Principle
235
where H ( p) is a very-well known quantity in information theory, named entropy or Shannon entropy (Shannon, 2001). One can actually prove the inequality in (3.89) using one very famous inequality, Jensen’s inequality. Also one can verify that the equality holds when p = q. We will not prove this here. But you will have a chance to check this in Problems 10.4 and 10.5. So from this, one can interpret H ( p, q) as an entropic-measure of discrepancy across distributions. Hence, it is called cross entropy. How to solve logistic regression (3.86)? From (3.86), we can write the optimization problem as: min w
m
−y
i=1
(i)
1 e−w x (i) log − (1 − y ) log . T (i) 1 + e−w x 1 + e−w T x (i) T
(i)
(3.90)
Let J (w) be the normalized version of the objective function: J (w) :=
m 1 −y (i) log yˆ (i) − (1 − y (i) ) log(1 − yˆ (i) ). m i=1
(3.91)
The optimization described above is found to belong to the convex optimization class, which is a crucial category of optimization problems that can be efficiently solved using computers (Boyd & Vandenberghe, 2004). In simple terms, convex optimization refers to a class of problems where the objective function is convex, which means it has a bowl-like shape as depicted in Fig. 3.38. Please refer to the end of this section for the exact definition. Given the definition of convex optimization, J (w) is convex in the optimization variable w. In the remainder of this section, we will establish the convexity of the objective function and then discuss the methods for solving convex optimization. Proof of convexity One can readily show that convexity preserves under addition. Why? Think about the definition of convex functions. So it suffices to prove the following two: 1 is convex in w; 1 + e−w T x T e−w x (ii) − log is convex in w. 1 + e−w T x (i) − log
Since the second function in the above can be represented as the sum of a linear function and the first function: e−w x 1 T , T x = w x − log −w 1+e 1 + e−w T x T
− log
236
3 Data Science Applications
it suffices to prove the convexity of the first function. Notice that the first function can be rewritten as: − log
1 −w T x ). T x = log(1 + e −w 1+e
(3.92)
In fact, proving the convexity of (3.92) is a bit involved if one relies directly on the definition of convex functions. It turns out there is another way to prove. That is based on the computation of the second derivative of a function, called the Hessian (Marsden & Tromba, 2003). How to compute the Hessian? What is the dimension of the Hessian? For a function f : Rd → R, the gradient ∇ f (x) ∈ Rd and the Hessian ∇ 2 f (x) ∈ Rd×d . If you are not familiar, check from a vector calculus book (Marsden & Tromba, 2003). A well-known fact says that if the Hessian of a function is positive semi-definite (PSD),5 then the function is convex. We will not prove this here. Don’t worry about the proof, but do remember this fact. The statement itself is very instrumental. Here we will use this fact to prove the convexity of the function (3.92). Taking a derivative of the RHS formula in (3.92) w.r.t. w, we get: −xe−w x )= . 1 + e−w T x T
∇w log(1 + e
−w T x
d d z log z = 1z , dz e = e z and This is due to a chain rule of derivatives and the fact that dz d w T x = x. Taking another derivative of the above, we obtain a Hessian as follows: dw
∇w2
log(1 + e
−w T x
) = ∇w (a)
=
−xe−w x 1 + e−w T x T
x x T e−w x (1 + e−w x ) − x x T e−w x e−w (1 + e−w T x )2 T
T
T
T
x
(3.93)
x x T e−w x (1 + e−w T x )2 0 T
=
where (a) is due to the derivative rule of a quotient of two functions: f (z)g(z)− f (z)g (z) . g 2 (z) T T T
Here you may wonder why
T d (−xe−w x ) dw
T −w T x
= xx e
d f (z) dz g(z)
=
. Why not
−w T x
? One rule-of-thumb that we recommend is to x x, x x or x x in front of e simply try all the candidates and choose the one which does not have a syntax error (matrix dimension mismatch). For instance, x x (or x T x T ) is an invalid operation. x T x is not a right one because the Hessian must be an d-by-d matrix. The only candidate left without any syntax error is x x T . We see that x x T has the single eigenvalue of We say that a symmetric matrix, say Q = Q T ∈ Rd×d , is positive semi-definite if v T Qv ≥ 0, ∀v ∈ Rd , i.e., all the eigenvalues of Q are non-negative. It is simply denoted by Q 0.
5
3.11 Logistic Regression and the ML Principle
237
Fig. 3.38 How gradient descent works
slope:
t-th estimate
x2 (Why?). Since the eigenvalue x2 is non-negative, the Hessian is PSD, and therefore we prove the convexity. Gradient descent (Lemaréchal, 2012) How to solve the convex optimization problem (3.86)? Since there is no constraint in the optimization, w∗ must be the stationary point, i.e., the one such that ∇ J (w ∗ ) = 0.
(3.94)
However, there is a challenge in determining the optimal point w∗ from the above. The challenge arises because it is not feasible to analytically find the point due to the absence of a closed-form solution (please verify this). Nevertheless, we have some good news to share. The good news is that there exist several algorithms that allow us to efficiently find the point without requiring knowledge of the closed-form solution. One widely used algorithm in this field is known as gradient descent. Gradient descent is an iterative algorithm. It operates as follows. Suppose we have an estimate of w∗ at the t-th iteration, which we denote as w (t) . We then calculate the gradient of the function evaluated at this estimate, which is denoted as ∇ J (w(t) ). Next, we update the estimate by moving it in a direction opposite to the gradient direction: w (t+1) ←− w (t) − α (t) ∇ J (w (t) )
(3.95)
where α (t) > 0 indicates the learning rate (or called a step size) that usually decays like α (t) = 21t . If you think about it, this update rule makes sense. Suppose w(t) is placed right relative to the optimal point w∗ , as illustrated in Fig. 3.38. Then, we should move w (t) to the left so that it is closer to w∗ . The update rule actually does this, as we subtract by α (t) ∇ J (w (t) ). Notice that ∇ J (w (t) ) points to the right direction given that w (t) is placed right relative to w ∗ . We repeat this procedure until it converges. It turns out: as t → ∞, it converges:
238
3 Data Science Applications
Fig. 3.39 A geometric intuition behind a convex function
w (t) −→ w ∗ ,
(3.96)
as long as the learning rate is chosen properly, like the one decaying exponentially w.r.t. t. We will not touch upon the proof of this convergence. In fact, the proof is not that simple. Even there is a big field in statistics which intends to prove the convergence of a variety of algorithms (if the convergence holds). Definition of a convex function An informal yet intuitive definition of a convex function is the following. We say that a function is convex if it is bowl-shaped, as illustrated in Fig. 3.39. What is the formal definition? The following observation can help us to easily come up with the definition. Take two points, say x and y, as in Fig. 3.39. Consider a point that lies in between the two points, say λx + (1 − λ)y for λ ∈ [0, 1]. Then, the bowl-shaped function suggests that the function evaluated at an λ-weighted linear combination of x and y is less than or equal to the same λ-weighted linear combination of the two functions evaluated at x and y: f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y).
(3.97)
This motivates the following definition. We say that a function f is convex if (3.97) holds for all λ ∈ [0, 1]. Look ahead In summary, we have formulated an optimization problem for supervised learning and identified the ML principle as a fundamental element in designing the optimal loss function. Additionally, we have examined the widely used gradient descent algorithm for solving this problem. Moving ahead, we will explore the practical application of this algorithm through programming. Specifically, we will focus on a simple classifier implemented in TensorFlow in the upcoming section.
3.12 Machine Learning: TensorFlow Implementation
239
3.12 Machine Learning: TensorFlow Implementation
Recap In the previous sections, we have formulated an optimization problem for machine learning: min w
m
CE y (i) , yˆ (i)
(3.98)
i=1
1 where yˆ := indicates the prediction output with logistic activation and CE 1+e−w T x denotes cross entropy loss:
CE (y, yˆ ) = −y log yˆ − (1 − y) log(1 − yˆ ).
(3.99)
We proved that cross entropy loss CE (·, ·) is the optimal loss function in a sense of maximizing the likelihood. We also showed that the normalized version J (w) of the above objection function is convex in w, and therefore, it can be solved via one prominent algorithm that enables us to efficiently find the unique stationary point: gradient descent. J (w) :=
m 1 −y (i) log yˆ (i) − (1 − y (i) ) log(1 − yˆ (i) ). m i=1
(3.100)
Outline This section will cover the implementation of the algorithm using a software tool for a simple classifier. It consists of three parts. First, we will introduce the simple classifier setting. Second, we will examine four implementation details related to the classifier. The first detail is about the dataset used for training and testing, where testing refers to evaluating the performance of a trained model on an unseen dataset. The second detail is about building a deep neural network model with ReLU activation, which extends the Perceptron introduced in Sect. 3.10. The third detail is about using softmax activation for multiple classes at the output layer. The fourth detail is about using the Adam optimizer, an advanced version of gradient descent widely used in practice. In the last part, we will learn how to implement the classifier using TensorFlow, a prominent deep learning framework. Specifically, we will use Keras, a higher-level programming language fully integrated with TensorFlow.
240
3 Data Science Applications
Handwritten digit classification We will be focusing on a simple classifier that can recognize handwritten digits from images, as shown in Fig. 3.40, where an image of the digit 2 is correctly identified. To train our model, we will use the widely-used MNIST dataset (LeCun, Bottou, Bengio, & Haffner, 1998). It was created by remixing examples from the NIST’s original dataset and prepared by Yann LeCun, a deep learning pioneer. The MNIST dataset includes 60,000 training images (m) and 10,000 testing images (m test ), each of which contains 28 × 28 pixels representing gray-scale levels between 0 (white) and 1 (black). Additionally, each image is labeled with a corresponding class y (i) which can take one of the 10 values {0, 1, . . . , 9}. For reference, see Fig. 3.41. A deep neural network model We utilize an advanced version of logistic regression as our model. This is because logistic regression is a linear classifier, implying that its prediction function f w (·) is confined to a linear function category. As a result, its restricted architecture may not perform well in many applications. This led to the development of deep neural networks (DNNs), which are more advanced Perceptronlike neural networks consisting of multiple layers. Although DNNs were invented in the 1960s (Ivakhnenko, 1971), their performance benefits were only fully recognized in the past decade after the groundbreaking event in 2012 when Geoffrey Hinton, a renowned figure in the AI field, and his two PhD students won the ImageNet recognition competition using a DNN that achieved human-level recognition performance, which had never been seen before (Krizhevsky, Sutskever, & Hinton,
Fig. 3.40 Handwritten digit classification
digit classifier
(x (i) , y (i) ) Fig. 3.41 MNIST dataset: An input image is of 28-by-28 pixels, each indicating an intensity from 0 (white) to 1 (black); each label with size 1 takes one of the 10 classes from 0 to 9
white black
2
m i =1
3.12 Machine Learning: TensorFlow Implementation
input size = 784 500 hidden neurons
241
10 classes
softmax activation
ReLU activation Fig. 3.42 A two-layer fully-connected neural network where input size is 28 × 28 = 784, the number of hidden neurons is 500 and the number of classes is 10. We employ ReLU activation at the hidden layer, and softmax activation at the output layer; see Fig. 3.43 for details
2012). Therefore, this event marked the beginning of the deep learning revolution. In digit classification, a linear classifier is inadequate. So we will employ a simple version of a DNN with only two layers: a hidden layer and an output layer, as depicted in Fig. 3.42. Since the input layer is not considered a layer by convention, this is referred to as a two-layer neural network rather than a three-layer neural network. Each neuron in the hidden layer respects exactly the same procedure as that in the Perceptron: a linear operation followed by activation. For activation, the logistic function or its shifted version, called the tanh function (spanning −1 and +1), were frequently employed in early days. In recent times, numerous practitioners have discovered that the Rectified Linear Unit (ReLU) function is more effective in achieving faster training and producing better or equivalent performance. This was demonstrated in (Glorot et al., 2011). The ReLU function is defined as ReLU(x) = max(0, x). Consequently, it has become a widely accepted rule-of-thumb in the deep learning domain to apply ReLU activation in all hidden layers. Therefore, we have adopted this rule-of-thumb, as depicted in Fig. 3.42. Softmax activation at the output layer We have another reason for using an advanced
version of logistic regression, which relates to the number of classes in our classifier of interest. Logistic regression is designed for binary classification. However, in our digit classifier, there are a total of 10 classes, making the logistic function unsuitable. To address this, a natural extension of the logistic function, known as softmax, is used for general classifiers with more than two classes. The operation of softmax is shown in Fig. 3.43. Let z represent the output of the last layer in a neural network before activation:
242
3 Data Science Applications
Fig. 3.43 Softmax activation employed at the output layer. This is a natural extension of logistic activation intended for the two-class case
softmax
output layer
z := [z 1 , z 2 , . . . , z c ]T ∈ Rc
(3.101)
where c denotes the number of classes. The softmax function is then defined as: ez j yˆ j := [softmax(z)] j = c k=1
ezk
j ∈ {1, 2, . . . , c}.
(3.102)
Note that this is a natural extension of the logistic function: for c = 2, ez1 + ez2 1 = 1 + e−(z1 −z2 ) = σ (z 1 − z 2 )
yˆ1 := [softmax(z)]1 =
ez1
(3.103)
where σ (·) is the logistic function. Viewing z 1 − z 2 as the binary classifier output yˆ , this coincides with the logistic function. Here yˆi can be interpreted as the probability that the ith example belongs to class i. Hence, like the binary classifier, one may want to assume: 1 , . . . , 0]T |x), i ∈ {1, . . . , c}. yˆi = P(y = [0, . . . ,
(3.104)
ith position
As you may expect, under this assumption, one can verify that the optimal loss function (in a sense of maximizing likelihood) is again cross entropy loss: ∗ (y, yˆ ) = CE (y, yˆ ) =
c
j=1
−y j log yˆ j
3.12 Machine Learning: TensorFlow Implementation
243
where y indicates a label of one-hot encoding vector. For instance, in the case of label= 2 with c = 10, y takes: ⎡ ⎤ 0 ⎢0⎥ ⎢ ⎥ ⎢1⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎢0⎥ ⎥ y=⎢ ⎢0⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎣0⎦ 0
0 1 2 3 4 5 6 7 8 9
The proof of this is almost the same as that in the binary classifier. So we will omit the proof here. Instead you will have a chance to prove it in Problem 10.2. Due to the above rationales, the softmax activation has been widely used for many classifiers in the field. Hence, we will also use the conventional activation in our digit classifier. Adam optimizer (Kingma & Ba, 2014) Now let’s delve into a specific algorithm that
we will utilize in our setup. As previously stated, we will employ a more advanced version of gradient descent known as the Adam optimizer. Before we explain how this optimizer works, let’s briefly review the vanilla gradient descent algorithm: w (t+1) ← w (t) − α∇ J (w (t) ) where w (t) indicates the estimated weight in the t-th iteration, and α denotes the learning rate. It is worth noting that the weight update in the vanilla gradient descent algorithm depends solely on the current gradient, which is denoted as ∇ J (w(t) ). Consequently, if ∇ J (w(t) ) fluctuates excessively over iterations, the weight update oscillates significantly, leading to unstable training. To mitigate this issue, practitioners often turn to a variant algorithm that leverages past gradients for stabilization purposes. That is, the Adam optimizer. Here is how Adam works. The weight update takes the following formula instead: w (t+1) = w (t) + α √
m (t) s (t) +
(3.105)
where m (t) indicates a weighted average of the current and past gradients: m (t) =
1 (t−1) − (1 − β1 )∇ J (w (t) ) . t β1 m 1 − β1
(3.106)
244
3 Data Science Applications
Here β1 ∈ [0, 1] is a hyperparameter that captures the weight of past gradients, and hence it is called the momentum (Polyak, 1964). So the notation m stands for 1 momentum. The factor 1−β t is applied in front, in an effort to stabilize training in 1 initial iterations (small t). Check the detailed rationale behind this in Problem 10.7. s (t) is a normalization factor that makes the effect of ∇ J (w (t) ) almost constant over t. In case ∇ J (w(t) ) is too big or too small, we may have significantly different scalings in magnitude. Similar to m (t) , s (t) is defined as a weighted average of the current and past values (Hinton, Srivastava, & Swersky, 2012): s (t) =
1 (t−1) β2 s − (1 − β2 )(∇ J (w (t) ))2 1 − β2t
(3.107)
where β2 ∈ [0, 1] denotes another hyperparameter that captures the weight of past values, and s stands for square. Notice that the dimensions of w(t) , m (t) and s (t) are the same. So all the operations that appear in the above (including division in (3.105) and square in (3.107)) are component-wise. In (3.105), is a tiny value introduced to avoid division by 0 in practice (usually 10−8 ). TensorFlow: Loading MNIST data Now, let’s explore how to program in TensorFlow to implement a simple digit classifier. First, let’s focus on loading the MNIST
dataset. Since MNIST is a popular dataset, it is included in the following package: tensorflow.keras.datasets. Moreover, the package also contains the train and test
datasets, which are already properly split. Therefore, we do not need to be concerned about how to divide them. The only script we need to write to import MNIST is: from tensorflow.keras.datasets import mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() X_train = X_train/255. X_test = X_test/255.
We normalize the input (either X_train or X_test) by dividing it by its maximum value 255. This procedure is typically performed as part of data preprocessing. TensorFlow:
A two-layer DNN In order to implement the simple DNN, illustrated in Fig. 3.42, we rely upon two major packages: (i) tensorflow.keras.models; (ii) tensorflow.keras.layers.
The models package provides various functionalities related to neural networks themselves. One of the essential modules is Sequential, which is a neural network entity and can be described as a linear stack of layers. The layers package includes many elements that constitute a neural network, such as fully-connected dense layers and
3.12 Machine Learning: TensorFlow Implementation
245
activation functions. With these two components, we can easily construct a model shown in Fig. 3.42. from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Flatten model = Sequential() model.add(Flatten(input_shape=(28,28))) model.add(Dense(500, activation=’relu’)) model.add(Dense(10, activation=’softmax’))
The Flatten entity refers to a vector created from a higher-dimensional object, such as a 2D matrix. In this case, a 28-by-28 digit image is flattened into a 784dimensional vector (784 = 28 × 28). The add() method is used to add a layer to the end of the sequential model. The Dense layer represents a fully-connected layer, and the input size is automatically determined based on the previous layer in the sequential model. Only the number of output neurons needs to be specified, which in this case is 500 hidden neurons. An activation function can also be set with an additional argument, such as activation=’relu’. The output layer contains 10 neurons (matching the number of classes) and uses the softmax activation function. TensorFlow:
Training a model For training, we need to first set up an algorithm (optimizer) to be employed. Here we use the Adam optimizer. As mentioned earlier, Adam has three key hyperparameters: (i) the learning rate α; (ii) β1 (capturing the weight of past gradients); and (iii) β2 (indicating the weight of the square of past gradients). The default choice reads: (α, β1 , β2 ) = (0.001, 0.9, 0.999). So these values would be set if nothing is specified. We also need to specify a loss function. We employ the optimal loss function: cross entropy loss. A performance metric that we will look at during training and testing can also be specified. One metric frequently employed is accuracy. One can set all of these via another method compile. model.compile(optimizer=’adam’, loss=’sparse_categorical_crossentropy’, metrics=[’acc’])
Here the option optimizer=’adam’ sets the default choice of the learning rate and betas. For a manual choice, we first define: opt=tensorflow.keras.optimizers.Adam( learning_rate=0.01, beta_1 = 0.92, beta_2 = 0.992)
We then replace the above option with optimizer=opt. As for the loss option in which indicates cross entropy loss beyond the binary case.
compile, we employ ’sparse_categorical_crossentropy’,
246
3 Data Science Applications
Now we can bring this to train the model on MNIST data. During training, we often employ a part of the entire examples to compute a gradient of a loss function. The part is called a batch. Two more terminologies. One is the step which refers to a loss computation procedure spanning the examples only in a single batch. The other is the epoch which refers to the entire procedure associated with all the examples. In our experiment, we use the batch size of 64 and the number 20 of epochs. model.fit(X_train,y_train,batch_size=64,epochs=20) TensorFlow: Testing the trained model For testing, we first need to make a prediction
from the model ouput. To this end, we use the predict() function as follows: model.predict(X_test).argmax(1)
Here argmax(1) returns the class w.r.t. the highest softmax output among the 10 classes. In order to evaluate the test accuracy, we use the evaluate() function: model.evaluate(X_test, y_test)
Closing Finally we would like to offer some advice that may be helpful for students (perhaps the majority of readers) and their future careers, especially those interested in modern AI technologies. From Part III, we learned that the communication principles such as the MAP/ML principles, phase transition, and the Viterbi algorithm play important roles in this field. Therefore, we strongly recommend that you strive to strengthen your fundamentals. In particular, mathematics is a major background that we consider crucial. There are four fundamental branches of mathematics that we believe are especially important: optimization, linear algebra, probability, and information theory. Being proficient in these areas is essential for becoming an expert in the AI field. The first fundamental we recommend is optimization, as achieving the goal of machine learning often involves optimization. The second crucial branch is linear algebra, which provides essential tools to translate the objective function and constraints into simpler, more manageable formulas. Many complex mathematical formulas can be expressed in terms of matrix multiplications and additions, which makes linear algebra an important tool. The third essential branch is probability, which plays a role in dealing with uncertainty. Lastly, information theory provides optimal architectural insights into machine learning models. One example of this is the role of cross entropy, an information-theoretic notion, in the design of the optimal loss function.
3.12 Machine Learning: TensorFlow Implementation
247
We consider these fundamentals as essential for the progress of the 4th industrial revolution powered by AI technologies, and thus, we strongly advise you to develop a solid understanding of them. It is worth noting that mastering such fundamentals is more likely to happen during your time in school. While this may sound slightly exaggerated, it is the experience of many, including the author. After graduation, time constraints and waning energy levels may make it difficult to fully comprehend some concepts.
248
3 Data Science Applications
Problem Set 10 Problem 10.1 (Basic concepts on machine learning) (a) (b) (c) (d)
State the definition of an algorithm. State the definition of machine learning. State the definition of artificial intelligence. State the definition of examples (the terminology used in the machine learning field). (e) State the definition of supervised learning.
Problem 10.2 (Softmax activation for multi-class classifiers) This problem explores a general classifier setting in which the number of classes is arbitrary, say c ∈ N. Let z := [z 1 , z 2 , . . . , z c ]T ∈ Rc be the output of a so-called multi-perceptron prior to activation: z j = w Tj x
(3.108)
where x ∈ Rn indicates the input and w j := [w j1 , . . . , w jn ]T ∈ Rn denotes the weight vector associated with the jth neuron in the output. In an attempt to make those real values z j ’s being interpreted as probability quantities that lie in between 0 and 1, people usually employ the following activation function, called softmax: ez j yˆ j := [softmax(z)] j = c k=1
ezk
j ∈ {1, 2, . . . , c}.
(3.109)
Note that this is a natural extension of the logistic function: for c = 2, ez1 + ez2 1 = 1 + e−(z1 −z2 ) = σ (z 1 − z 2 )
yˆ1 := [softmax(z)]1 =
ez1
(3.110)
where σ (·) is the logistic function. Viewing z 1 − z 2 as the binary classifier output yˆ , this coincides exactly with logistic regression. Let y ∈ {[1, 0, . . . , 0]T , [0, 1, 0, . . . , 0]T , . . . , [0, . . . , 0, 1]T } be a label of onehot-encoded-vector type. Here yˆi can be interpreted as the probability that the ith example is categorized into class i. Hence, we assume that 1 , . . . , 0]T |x), i ∈ {1, . . . , c}. yˆi = P(y = [0, . . . , ith position
(3.111)
3.12 Machine Learning: TensorFlow Implementation
249
m We also assume that examples {(x (i) , y (i) )}i=1 are independent over i.
(a) Derive the likelihood of training examples: P y (1) , . . . , y (m) |x (1) , . . . , x (m) .
(3.112)
This should be expressed in terms of y (i) ’s and yˆ (i) ’s. (b) Derive the optimal loss function that maximizes the likelihood (3.112). (c) What is the name of the optimal loss function derived in part (b)?
Problem 10.3 (Gradient descent) Consider a function J (w) = w 2 + 2w where w ∈ R. Consider the gradient descent algorithm with the learning rate α (t) = 21t and w (0) = 2. (a) Describe how gradient descent works. (b) Using Python, run gradient descent to plot w (t) as a function of t.
Problem 10.4 (Jensen’s inequality) Suppose that a function f is convex and X is a discrete random variable. Show that E [ f (X )] ≥ f (E[X ]) . Also identify conditions under which the equality holds. Hint: First think about a simple binary random variable case and then generalize with the by-induction proof technique.
Problem 10.5 (Cross entropy) Let p and q be two distributions. In information theory, there is an important notion, called cross entropy (Cover & Joy, 2006): H ( p, q) := −
) p(x) log2 q(x) = E p log2
x∈X
1 q(X )
* (3.113)
where X ∈ X is a discrete random variable. Show that H ( p, q) ≥ H ( p)
(3.114)
where H ( p) denotes the Shannon entropy of a random variable with p: H ( p) := −
x∈X
) p(x) log2 p(x) = E p log2
* 1 . p(X )
(3.115)
250
3 Data Science Applications
Also identify conditions under which the equality in (3.114) holds. Hint: Think about Jensen’s inequality in Problem 10.4. Problem 10.6 (Kullback-Leibler divergence) In statistics, information theory and machine learning, there is a very well-known divergence measure, called the Kullback-Leibler divergence. It is defined for two distributions p and q: D( pq) :=
x∈X
) * p(X ) p(x) = E p log p(x) log q(x) q(X )
(3.116)
where X ∈ X is a discrete random variable (Cover & Joy, 2006). Show that D( pq) ≥ 0.
(3.117)
Also identify conditions under which the equality in (3.117) holds.
Problem 10.7 (Optimizers) Consider gradient descent: w (t+1) = w (t) − α∇ J (w (t) ) where w (t) indicates the weights of an interested model at the t-th iteration; J (w(t) ) denotes the cost function evaluated at w(t) ; and α is the learning rate. Note that only the current gradient, reflected in ∇ J (w(t) ), affects the weight update. (a) (Momentum optimizer (Polyak, 1964)) In the literature, there is a prominent variant of gradient descent that takes into account past gradients as well. Using such past information, one can damp an oscillating effect in the weight update that may incur instability in training. To capture past gradients and therefore address the oscillation problem, another quantity, often denoted by m (t) , is usually introduced: m (t) = βm (t−1) − (1 − β)∇ J (w (t) )
(3.118)
where β denotes another hyperparameter that captures the weight of the past gradients, simply called the momentum. Here m stands for the momentum vector. The variant of the algorithm (often called the momentum optimizer) takes the following update for w (t+1) : w (t+1) = w (t) + αm (t) . Show that w (t+1) = w (t) − α(1 − β)
t−1
k=0
β k ∇ J (w (t−k) ) + αβ t m (0) .
(3.119)
3.12 Machine Learning: TensorFlow Implementation
251
(b) (Bias correction) Assuming that ∇ J (w (t) ) is the same for all t and m (0) = 0, show that w (t+1) = w (t) − α(1 − β t )∇ J (w (t) ). Note: For a large value of t, 1 − β t ≈ 1, so it has almost the same scaling as that in the regular gradient descent. On the other hand, for a small value of t, 1 − β t can be small, being far from 1. For instance, when β = 0.9 and t = 2, 1 − β t = 0.19. This motivates people to rescale the moment m (t) in (3.118) through division by 1 − β t . So in practice, we use: mˆ (t) = w (t+1)
m (t) ; 1 − βt = w (t) + α mˆ (t) .
(3.120) (3.121)
This technique is so called the bias correction. (c) (Adam optimizer (Kingma & Ba, 2014)) Notice in (3.118) that a very large or very small value of ∇ J (w (t) ) affects the weight update in quite a different scaling. In an effort to avoid such a different scaling problem, people in practice often make normalization in the weight update (3.121) via a normalization factor, often denoted by sˆ (t) (Hinton et al., 2012): w (t+1) = w (t) + α √
mˆ (t) sˆ (t) +
(3.122)
where the division is component-wise, and mˆ (t) =
m (t) , 1 − β1t
m (t) = β1 m (t−1) − (1 − β1 )∇ J (w (t) ), sˆ (t) =
(3.123) (3.124)
(t)
s , 1 − β2t
s (t) = β2 s (t−1) + (1 − β2 )(∇ J (w (t) ))2 .
(3.125) (3.126)
Here (·)2 indicates a component-wise square; is a tiny value introduced to avoid division by 0 in practice (usually 10−8 ); and s stands for square. This optimizer (3.122) is called the Adam optimizer. Explain the rationale behind the division by 1 − β2t in (3.126).
Problem 10.8 (TensorFlow implementation of a digit classifier) Consider a handwritten digit classifier that we learned in Sect. 3.12. In this problem, you are asked
252
3 Data Science Applications
to build a classifier using a two-layer (one hidden layer) neural network with ReLU activation in the hidden layer and softmax activation in the output layer. (a) (MNIST dataset) Use the following script (or otherwise), load the MNIST dataset: from tensorflow.keras.datasets import mnist (X_train,y_train),(X_test,y_test)=mnist.load_data() X_train = X_train/255. X_test = X_test/255.
What are m (the number of training examples) and m test ? What are the shapes of X_train and y_train? (b) (Data visualization) Upon the code in part (a) being executed, report an output for the following: import matplotlib.pyplot as plt num_of_images = 60 for index in range(1,num_of_images+1): plt.subplot(6,10, index) plt.axis(’off’) plt.imshow(X_train[index], cmap = ’gray_r’)
(c) (Model) Using a skeleton code provided in Sect. 3.12, write a script for a twolayer neural network model with 500 hidden units fed by MNIST data. (d) (Training) Using a skeleton code in Sect. 3.12, write a script for training the model generated in part (c) with cross entropy loss. Use the Adam optimizer with: learning rate = 0.001;
(β1 , β2 ) = (0.9, 0.999)
and the number of epochs is 10. Also plot a training loss as a function of epochs. (e) (Testing) Using a skeleton code in Sect. 3.12, write a script for testing the model (trained in part (d)). What is the test accuracy?
Problem 10.9 (True or False?) (a) Consider an optimization problem for machine learning: min w
m
(y (i) , f w (x (i) ))
i=1
m where {(x (i) , y (i) )}i=1 indicate input-output example pairs, and
(3.127)
3.12 Machine Learning: TensorFlow Implementation
f w (x) =
253
1 . 1 + e−w T x
(3.128)
The optimal loss function (in a sense of maximizing the likelihood) is: ∗ (y, yˆ ) = −y log yˆ − (1 − y) log(1 − yˆ ).
(3.129)
(b) For two arbitrary distributions, say p and q, consider cross entropy H ( p, q). Then, H ( p, q) ≥ H (q)
(3.130)
where H (q) is the Shannon entropy w.r.t. q. (c) For two arbitrary distributions, say p and q, consider cross entropy: H ( p, q) := −
) p(x) log q(x) = E p log
x∈X
1 q(X )
*
where X ∈ X is a discrete random variable. Then,
H ( p, q) = H ( p) := − p(x) log p(x)
(3.131)
(3.132)
x∈X
only when q = p. (d) Consider a binary classifier in the machine learning setup where we are given m . Let 0 ≤ yˆ (i) ≤ 1 be the classifier input-output example pairs {(x (i) , y (i) )}i=1 output for the ith example. Let w be parameters of the classifier. Define: ∗ := arg min wCE
m 1 CE y (i) , yˆ (i) m i=1
∗ wKL := arg min
m 1 (i) (i) D y yˆ m i=1
w
w
where CE (·, ·) denotes cross entropy loss and D(y (i) yˆ (i) ) indicates the KL divergence between two binary random variables with parameters y (i) and yˆ (i) , respectively. Then, ∗ ∗ = wKL . wCE
(e) Suppose we execute the following code:
254
3 Data Science Applications import numpy as np a = np.random.randn(4,3,3) b = np.ones_like(a) print(b[0].shape) print(b.shape[0])
Then, the two prints yield the same results. ( f ) Suppose that image is an MNIST image of numpy array type. Then, one can use the following commands to plot the image: import matplotlib.pyplot as plt plt.imshow(image.squeeze(), cmap=’gray_r’)
References Abbe, E. (2017). Community detection and stochastic block models: Recent developments. The Journal of Machine Learning Research, 18(1), 6446–6531. Bansal, N., Blum, A., & Chawla, S. (2004). Correlation clustering. Machine Learning, 56(1), 89– 113. Bertsekas, D., & Tsitsiklis, J. N. (2008). Introduction to probability (Vol. 1). Athena Scientific. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Browning, S. R., & Browning, B. L. (2011). Haplotype phasing: Existing methods and new developments. Nature Reviews Genetics, 12(10), 703–714. Calafiore, G. C., & El Ghaoui, L. (2014). Optimization models. Cambridge: Cambridge University Press. Chartrand, G. (1977). Introductory graph theory. Courier Corporation. Chen, Y., Kamath, G., Suh, C., & Tse, D. (2016). Community recovery in graphs with locality. International Conference on Machine Learning (pp. 689–698). Chen, J., & Yuan, B. (2006). Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics, 22(18), 2283–2290. Cover, T., & Joy, A. T. (2006). Elements of information theory. Wiley-Interscience. Das, S., & Vikalo, H. (2015). Sdhap: Haplotype assembly for diploids and polyploids via semidefinite programming. BMC Genomics, 16(1), 1–16. Erd˝os, P., Rényi, A., et al. (1960). On the evolution of random graphs. Publication of the Mathematical Institute of the Hungarian Academy of Sciences, 5(1), 17–60. Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174. Freedman, D., Pisani, R., & Purves, R. (2007). Statistics. W.W. Norton & Co. Gallager, R. G. (2013). Stochastic processes: Theory for applications. Cambridge: Cambridge University Press. Garnier, J.- G., & Quetelet, A. (1838). Correspondance mathématique et physique (Vol. 10). Impr. d’H. Vandekerckhove. Girvan, M., & Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826. Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 315–323). Golub, G. H., & Van Loan, C. F. (2013). Matrix computations. JHU Press.
References
255
Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell System Technical Journal, 29(2), 147–160. Hinton, G., Srivastava, N., & Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8), 2. Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man, and Cybernetics(4), 364–378. Jalali, A., Chen, Y., Sanghavi, S., & Xu, H. (2011). Clustering partially observed graphs via convex optimization. In ICML. Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT Press. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. Lemaréchal, C. (2012). Cauchy and the gradient method. Documenta Mathematica Extra, 251(254), 10. Marsden, J. E., & Tromba, A. (2003). Vector calculus. Macmillan. Meta. (2022). Investor earnings report for 1q 2022. News, B. (2016). Artificial intelligence: Google’s alpha go beats go master lee se-dol. Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451. Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4(5), 1–17. Rabiner, L. & Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4–16. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386. Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. ii–recent progress. IBM Journal of Research and Development, 11(6), 601–617. Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55. Shen, J., Tang, T., & Wang, L.-L. (2011). Spectral methods: Algorithms, analysis and applications (Vol. 41). Springer Science & Business Media. Si, H., Vikalo, H., & Vishwanath, S. (2014). Haplotype assembly: An information theoretic view. 2014 ieee information theory workshop (itw 2014) (pp. 182–186). Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. Suh, C. (2022). Convex optimization for machine learning. Now Publishers.
Appendix A Python
Basics
Python is friendly and easy to learn.
A.1
Jupyter Notebook
Outline Python requires the use of another software platform to support its functionality, known as Jupyter notebook. In this section, we will provide an introduction to Jupyter notebook, covering several basic topics. Firstly, we will explore the role of Jupyter notebook in conjunction with Python. Next, we will provide guidance on how to install the software and launch a new file for scripting code. Subsequently, we will examine various useful interfaces which facilitate easy Python code scripting. Lastly, we will introduce several frequently used shortcuts for writing and executing code.1
What is Jupyter notebook? Jupyter notebook is a powerful tool that enables us to write and run Python code. One of its key advantages is that it allows us to execute each line of code separately, instead of running the entire code at once. This feature makes debugging much easier, especially when dealing with long code. a=1 b=2 a+b 3 1
This tutorial is adapted from Appendix A in Suh (2022).
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 C. Suh, Communication Principles for Data Science, Signals and Communication Technology, https://doi.org/10.1007/978-981-19-8008-4
257
258
Appendix A: Python Basics
Fig. A.1 Three versions of Anaconda installers
a=1
b=2
a+b 3
Two common ways to use Jupyter notebook include running the code on a server or cloud machine, and using a local machine. In this section, we will focus on the latter option. Installation & launch To use Jupyter notebook on a local machine, you need to install a widely-used software tool called Anaconda. The latest version can be downloaded and installed from the following website: https://www.anaconda.com/products/individual You may select one of the Anaconda installers depending on your machine’s operating system. The three available versions are displayed in Fig. A.1. While installing, you may encounter certain errors. One common error is related to “non-ascii characters.” To resolve this, ensure that the destination folder path for Anaconda does not contain any non-ascii characters such as Korean. Another error message you may encounter is related to access permissions for the designated path. To avoid this, run the Anaconda installer under “run as administrator” mode, which can be accessed by right-clicking on the downloaded executable file. In order to launch Jupyter notebook, you can use anaconda prompt (for Windows) or terminal (for mac and linux). The way it works is simple. You can just type Jupyter notebook in the prompt and then press Enter. Then, a Jupyter notebook window will pop up accordingly. If the window does not appear automatically, you can instead copy and paste the URL (indicated by the arrow in Fig. A.2) on your web browser,
Appendix A: Python Basics
259
Fig. A.2 How to launch Jupyter notebook in the Anaconda prompt
Fig. A.3 Web browser of a successfully launched Jupyter notebook
so as to manually open it. If it works properly, you should be able to see the window like the one in Fig. A.3. Generating a new notebook file is a straightforward process. Initially, you need to browse a folder where you want to save the notebook file. Then, click the New tab situated at the top right corner (highlighted in a blue box) and select the Python 3 tab (marked in a red box) from the options. You can refer to Fig. A.4 to locate the tabs. Interface Jupyter notebook comprises two essential components necessary for running a code: the first is a computational engine, known as the Kernel, which executes the code. The Kernel can be managed using various functions available in the Kernel tab. Refer to Fig. A.5 for further information. The second component is an entity, called cell, in which you can write a script. The cell operates in two modes. The first is so called the edit mode which allows you to type either: (i) a code script for running a program; or any text like a normal text
260
Appendix A: Python Basics
Fig. A.4 How to create a Jupyter notebook file on the web browser
Fig. A.5 Kernel is a computational engine which serves to run the code. There are several relevant functions under the Kernel tap
Fig. A.6 How to choose the Code or Markdown option in the edit mode
editor. The code script can be written under the Code tap (marked in a red box) as illustrated in Fig. A.6. Text-editing can be done under the Markdown tap, marked in a blue box. The other is the command mode. Under this mode, we can edit a notebook as a whole. For instance, we can copy or delete some cells, and move around cells under the command mode.
Appendix A: Python Basics
261
Shortcuts There are numerous shortcuts that can be useful for editing and navigating a notebook. Three types of frequently used shortcuts are particularly emphasized. The first set of shortcuts is used for changing between the edit and command modes. To change from the edit to command mode, press Esc. To change from command mode back to edit mode, press Enter. The second set of shortcuts is used for inserting or deleting a cell. Pressing the letter “a” inserts a new cell above the current cell, while pressing “b” inserts a new cell below the current cell. To delete a cell, press “d” twice. These commands should be typed while in command mode. The last set of shortcuts is used for executing a cell. To move between different cells, use the arrow keys. To run the current cell and move to the next one, press Shift + Enter. To run the current cell without moving to the next one, press Ctrl + Enter.
262
A.2
Appendix A: Python Basics
Basic Syntaxes of Python
Outline This section will cover the basic Python syntaxes necessary to script communication problems, inference problems, and machine learning optimization. In particular, we will focus on three key elements: (i) class; (ii) package; and (iii) function. Additionally, we will present a variety of Python packages that are relevant and useful to the content of this book.
A.2.1
Data Structure
There are two prominent data-structure components in Python: (i) list; and (ii) set. (i) List List is a data type provided by Python that enables us to store multiple elements in a single variable while maintaining their order. It also permits the duplication of elements. Some common applications of lists are demonstrated below. x = [1, 2, 3, 4] print(x)
# construct a simple list
[1, 2, 3, 4]
x.append(5) print(x)
# add an item at the end
[1, 2, 3, 4, 5]
x.pop() print(x)
# delete an item located in the last
[1, 2, 3, 4]
# checking if a particular element exists in the list if 3 in x: print(True) if 5 in x: print(True) else: print(False) True False
Appendix A: Python Basics
263
# A single-line construction of a list y = [x for x in range(1,10)] print(y) [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Retrieving all the elements through a "for" loop for i in x: print(i) 1 2 3 4
(ii) Set Set is a built-in data type that serves a similar purpose to List, with two main differences: (i) it is unordered, and (ii) it does not permit duplicates. Below are some examples of how to use it. x = set({1, 2, 3}) # construct a set print(f"x: {x}, type of x: {type(x)}") x: {1, 2, 3}, type of x:
Here the f in front of strings in the print command tells Python to look at the values inside {·}. x.add(1) print(x)
# add an existing item
{1, 2, 3}
x.add(4) print(x)
# add a new item
{1, 2, 3, 4}
# checking if a particular element exists in the list if 1 in x: print(True) if 5 in x: print(True) else: print(False) True False
264
Appendix A: Python Basics
# Retrieving all the elements through a "for" loop for i in x: print(i) 1 2 3 4
A.2.2
Package
We will explore five widely-used Python packages that are particularly useful in writing scripts for the problems covered in this book. These packages are: (i) math; (ii) random; (iii) itertools; (iv) numpy; and (v) scipy. (i) math The math module offers a range of valuable mathematical expressions, including exponentials, logarithms, square roots, and powers. Some examples of their usage are given below. import math
math.exp(1)
# exp(x)
2.718281828459045
print(math.log(1, 10)) # log(x, base) print(math.log(math.exp(20))) # natural logarithm print(math.log2(4)) # base-2 logarithm print(math.log10(1000)) # base-10 lograithm 0.0 20.0 2.0 3.0
print(math.sqrt(16)) # square root print(math.pow(2,4)) # x raised to y (same as x**y) print(2**4) 4.0 16.0 16
print(math.cos(math.pi)) # cosine of x radians print(math.dist([1,2],[3,4])) # Euclidean distance
Appendix A: Python Basics
265
-1.0 2.8284271247461903 # The erf() function can be used to compute traditional # statistical functions such as the CDF of # the standard Gaussian distribution def phi(x): # CDF of the standard Gaussian distribution return (1.0 + math.erf(x/math.sqrt(2.0)))/2.0 phi(1) 0.8413447460685428
(ii) random This module yields random number generation. See below for some examples. import random
random.randrange(start=1, stop=10, step=1) # a random number in range(start, stop, step) random.randrange(10) # integer from 0 to 9 inclusive 5 # returns random integer n such that a