Restless Multi-Armed Bandit in Opportunistic Scheduling 3030699587, 9783030699581

This book provides foundations for the understanding and design of computation-efficient algorithms and protocols for th

110 69 3MB

English Pages 163 [158] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acronyms
Contents
Chapter 1: RMAB in Opportunistic Scheduling
1.1 Introduction
1.1.1 Multiarmed Bandit
1.1.2 Restless Multiarmed Bandit
1.2 Technical Challenge
1.3 Book Organization
References
Chapter 2: Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels
2.1 Introduction
2.1.1 Background
2.1.2 Related Work
2.1.3 Main Results and Contributions
2.2 Problem Formulation
2.2.1 System Model
2.2.2 Restless Multiarmed Bandit Formulation
2.2.3 Myopic Sensing Policy
2.3 Axioms
2.4 Optimality of Myopic Sensing Policy under Imperfect Sensing
2.4.1 Definition and Properties of Auxiliary Value Function
2.4.2 Optimality of Myopic Sensing: Positively Correlated Channels
2.4.3 Discussion
2.5 Optimality Extension to Negatively Correlated Channels
2.6 Summary
Appendix
Proof of Lemma 2.4
Proof of Lemma 2.5
Proof of Lemmas 2.6 and 2.7
References
Chapter 3: Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State Channels
3.1 Introduction
3.1.1 Background
3.1.2 Main Results and Contributions
3.2 Related Work
3.3 System Model
3.4 Problem Formulation
3.5 Technical Preliminary: Indexability and Whittle Index
3.6 Whittle Index and Scheduling Rule
3.6.1 Whittle Index
3.6.2 Scheduling Rule
3.6.3 Technical Challenges
3.7 Fixed Point Analysis
3.8 Threshold Policy and Adjoint Dynamic System
3.8.1 Threshold Policy
3.8.2 Adjoint Dynamic System
3.9 Linearization of Value Function for Negatively Correlated Channels
3.9.1 Region 1-2
3.9.2 Region 3
3.9.3 Region 4
3.10 Linearization of Value Function for Positively Correlated Channels
3.10.1 Region n - 1
3.10.2 Region n - 2
3.10.3 Region n - 4
3.10.4 Region n-3
3.10.5 Region 5
3.11 Index Computation for Negatively Correlated Channels
3.11.1 Region 1
3.11.2 Region 2
3.11.3 Region 3
3.11.4 Region 4
3.12 Index Computation for Positively Correlated Channels
3.12.1 Region 1
3.12.2 Region 2
3.12.3 Region 3
3.12.4 Region 4
3.12.5 Region 5
3.12.6 Region 6
3.13 Numerical Study
3.13.1 Whittle Index versus Optimal Policy
3.13.2 Whittle Index verse Myopic Policy
3.14 Summary
References
Chapter 4: Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels
4.1 Introduction
4.1.1 Background
4.1.2 Existing Work
4.1.3 Main Results and Contributions
4.2 Problem Formulation
4.2.1 System Model
4.2.2 Information State
4.2.3 Myopic Policy
4.3 Optimality Analysis of Myopic Policy
4.3.1 Value Function and Its Properties
4.3.2 Assumptions
4.3.3 Properties
4.3.4 Analysis of Optimality
4.3.5 Discussion
4.3.5.1 Comparison
4.3.5.2 Bounds
4.3.5.3 Case Study
4.4 Optimality Extension
4.4.1 Assumptions
4.4.2 Optimality
4.5 Summary
Appendix
Proof of Lemma 4.1
Proof of Lemma 4.2
References
Chapter 5: Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Multistate Channels
5.1 Introduction
5.1.1 Background
5.1.2 State of The Art
5.1.3 Main Results and Contributions
5.2 System Model
5.2.1 Job, Channel, and User Models
5.2.2 Server Model
5.2.3 Opportunistic Scheduling Problem
5.3 Restless Bandit Formulation and Analysis
5.3.1 Job-Channel-User Bandit
5.3.2 Restless Bandit Formulation and Opportunistic Scheduling
5.4 Indexability Analysis and Index Computation
5.4.1 Transition Matrices and Threshold Structure
5.4.2 Indexability Analysis
5.4.3 Computing Index
5.5 Indexability Extension and Scheduling Policy
5.5.1 Indexability Extension
5.5.2 Transition Matrix Approximation
5.5.3 Scheduling Policy
5.6 Numerical Simulation
5.6.1 Scenario 1
5.6.2 Scenario 2
5.6.3 Scenario 3
5.7 Summary
Appendix
Proof of Lemma 5.1
Proof of Lemma 5.2
Proof of Theorem 5.2
References
Chapter 6: Conclusion and Perspective
6.1 Summary
6.2 Open Issues and Directions for Future Research
6.2.1 RMAB with Multiple Schedulers
6.2.2 RMAB with Switching Cost
6.2.3 RMAB with Correlated Arms
Index
Recommend Papers

Restless Multi-Armed Bandit in Opportunistic Scheduling
 3030699587, 9783030699581

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Kehao Wang Lin Chen

Restless MultiArmed Bandit in Opportunistic Scheduling

Restless Multi-Armed Bandit in Opportunistic Scheduling

Kehao Wang • Lin Chen

Restless Multi-Armed Bandit in Opportunistic Scheduling

Kehao Wang Wuhan University of Technology Wuhan, China

Lin Chen Sun Yat-sen University Guangzhou, Guangdong, China

ISBN 978-3-030-69958-1 ISBN 978-3-030-69959-8 https://doi.org/10.1007/978-3-030-69959-8

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

A colleague of high repute asked an equally well-known colleague: – What would you say if you were told that the multi-armed bandit problem had been solved? – Sir, the multi-armed bandit problem is not of such a nature that it can be solved.” Story told by Peter Whittle

Restless Multiarmed Bandit (RMAB) has been a classical problem in stochastic optimization and reinforcement learning with a wide range of engineering applications, including but not limited to, multi-agent systems, web search, Internet advertising, social networks, queueing systems, etc. In this book, we present a systematic research on a number of fundamental problems related to performance and computing complexity of both myopic policy and index policy under the context of imperfect sensing or observation appeared in practical scenarios. These problems are well characterized in mathematics and intuitively understandable, while of both fundamental and practical importance, and require nontrivial effort to solve. Especially, we address the following problems ranging from theoretical analysis to practical policy implementation and optimization: • Sufficient conditions under which the myopic policy is optimal for homogeneous two-state Markov channels • Feasibility and computation of the index policy for heterogeneous two-state Markov channels • Sufficient conditions under which the myopic policy is optimal for homogeneous multistate Markov channels • Feasibility and computation of the index policy for heterogeneous multistate Markov channels Actually, seeking the optimal policy of a generic RMAB involves not only the tradeoff between exploration and exploitation, but also the balance between aggression and conservation which are important concepts in machine learning. In this

v

vi

Preface

book, we adopt a search and exposition line from theoretical analysis to practical policy implementation and optimization. To lay the theoretical foundations for the design and optimization of the policies with linear complexity, we start by investigating the basic concepts and obstacles of RMAB. Technically, a unified framework is constructed under the myopic policy in terms of the regular reward function, characterized by three basic axioms—symmetry, monotonicity, and decomposability. For the homogeneous channels, we establish the optimality of the myopic policy when the reward function can be expressed as a regular function and when the discount factor is bounded by a closed-form threshold determined by the reward function. In order to further obtain the asymptotically optimal performance of the RMAB in a larger parameter space, we analyze the feasibility and computation scheme of the Whittle index for two-state Markov case by the fixed-point approach. We first derive a threshold structure of the single-arm policy. Based on this structure, the closed-form Whittle index is obtained for the case of negatively correlated channels, while the Whittle index for the positively correlated channel is hugely complicated for its uncertainty, particularly for certain regions below stationary distribution of Markov chain. Then, this region is divided into the deterministic regions and the indeterministic regions with interleaving structure. In the deterministic regions, the evolution of the dynamic system is periodic and then there exists an Eigen matrix to depict this kind of evolving structure through which the closed-form Whittle can be derived. In the indeterministic regions, there does not exist an Eigen matrix to depict its aperiodic structure. In the practical scenarios, given the computing precision, the Whittle index in those regions can be computed in the simply linear interpolation since the distribution of the deterministic and indeterministic regions appears in the interleaving form. We further consider an opportunistic scheduling system consisting of multiple homogeneous channels evolving as a multistate Markov process. We carry out a theoretic analysis on the performance of myopic policy with imperfect observation, introduce monotonic likelihood ratio in order to characterize the evolving structure of belief information, and establish a set of closed-form conditions to guarantee the optimality of the myopic scheduling policy in the opportunistic scheduling system. For the heterogeneous case, we cast the problem to a restless bandit. The pivot to solve restless bandit by index policy is to establish feasibility of the index policy or indexability. Despite the theoretical and practical importance of the index policy, the indexability is still open for the opportunistic scheduling in the heterogeneous multistate channel case. To fill this gap, we mathematically propose a set of sufficient conditions on channel state transition matrix under which the indexability is guaranteed, and consequently, the index policy is feasible. We further develop a simplified procedure to compute the index by reducing the complexity from more than quadratic to linear. Our work consists of a small step toward solving the opportunistic scheduling problem in its generic form involving multistate Markov channels. Wuhan, China Guangzhou, China January 9, 2021

Kehao Wang Lin Chen

Acronyms

eFOSD FOSD i.i.d. MAB MLR MPI PB POMDP RB RMAB RMABs SB TP2

Extended first-order stochastic dominance First-order stochastic dominance Independent and identically distributed Multiarmed bandit Monotone likelihood ratio Marginal productivity index Proportionally best Partially observable markov decision process Relatively best Restless multiarmed bandit Restless multiarmed bandits Scored based Totally positive of order 2

vii

Contents

1

RMAB in Opportunistic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Multiarmed Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Restless Multiarmed Bandit . . . . . . . . . . . . . . . . . . . . 1.2 Technical Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Main Results and Contributions . . . . . . . . . . . . . . . . . 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Restless Multiarmed Bandit Formulation . . . . . . . . . . . 2.2.3 Myopic Sensing Policy . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Optimality of Myopic Sensing Policy under Imperfect Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Definition and Properties of Auxiliary Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Optimality of Myopic Sensing: Positively Correlated Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Optimality Extension to Negatively Correlated Channels . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

1 1 1 2 4 5 6

. . . . . . . . . .

9 9 9 10 10 11 11 13 14 15

.

17

.

18

. . . . . .

20 21 23 25 25 25 ix

x

Contents

Proof of Lemma 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemmas 2.6 and 2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State Channels . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Main Results and Contributions . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Technical Preliminary: Indexability and Whittle Index . . . . . . . . 3.6 Whittle Index and Scheduling Rule . . . . . . . . . . . . . . . . . . . . . 3.6.1 Whittle Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Scheduling Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Fixed Point Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Threshold Policy and Adjoint Dynamic System . . . . . . . . . . . . 3.8.1 Threshold Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Adjoint Dynamic System . . . . . . . . . . . . . . . . . . . . . . 3.9 Linearization of Value Function for Negatively Correlated Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Region 1–2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 Region 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3 Region 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Linearization of Value Function for Positively Correlated Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Region n 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 Region n 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.3 Region n 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.4 Region n-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.5 Region 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Index Computation for Negatively Correlated Channels . . . . . . . 3.11.1 Region 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Region 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.3 Region 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Region 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Index Computation for Positively Correlated Channels . . . . . . . 3.12.1 Region 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.2 Region 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.3 Region 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.4 Region 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.5 Region 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.6 Region 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 30 35

. . . . . . . . . . . . . . . .

37 37 37 37 38 39 41 43 44 45 45 46 48 53 53 54

. . . .

55 56 57 58

. . . . . . . . . . . . . . . . . .

59 60 62 63 65 65 65 65 66 67 68 69 69 70 71 71 72 72

Contents

3.13

4

5

xi

Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13.1 Whittle Index versus Optimal Policy . . . . . . . . . . . . . . 3.13.2 Whittle Index verse Myopic Policy . . . . . . . . . . . . . . . 3.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Main Results and Contributions . . . . . . . . . . . . . . . . . 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Information State . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Myopic Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Optimality Analysis of Myopic Policy . . . . . . . . . . . . . . . . . . . 4.3.1 Value Function and Its Properties . . . . . . . . . . . . . . . . 4.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Analysis of Optimality . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Optimality Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 79 . 79 . 79 . 80 . 81 . 82 . 82 . 83 . 85 . 86 . 86 . 87 . 88 . 93 . 95 . 96 . 97 . 97 . 98 . 99 . 99 . 101 . 107

Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Multistate Channels . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 State of The Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Main Results and Contributions . . . . . . . . . . . . . . . . . 5.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Job, Channel, and User Models . . . . . . . . . . . . . . . . . . 5.2.2 Server Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Opportunistic Scheduling Problem . . . . . . . . . . . . . . . 5.3 Restless Bandit Formulation and Analysis . . . . . . . . . . . . . . . . 5.3.1 Job-Channel-User Bandit . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Restless Bandit Formulation and Opportunistic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

73 74 75 76 76

109 109 109 109 111 112 112 113 114 114 114

. 115

xii

6

Contents

5.4

Indexability Analysis and Index Computation . . . . . . . . . . . . . . 5.4.1 Transition Matrices and Threshold Structure . . . . . . . . 5.4.2 Indexability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Computing Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Indexability Extension and Scheduling Policy . . . . . . . . . . . . . . 5.5.1 Indexability Extension . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Transition Matrix Approximation . . . . . . . . . . . . . . . . 5.5.3 Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Lemma 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

117 117 118 122 124 124 125 125 126 129 131 131 131 132 132 136 139 140

Conclusion and Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Open Issues and Directions for Future Research . . . . . . . . . . . . 6.2.1 RMAB with Multiple Schedulers . . . . . . . . . . . . . . . . 6.2.2 RMAB with Switching Cost . . . . . . . . . . . . . . . . . . . . 6.2.3 RMAB with Correlated Arms . . . . . . . . . . . . . . . . . . .

. . . . . .

143 143 145 145 145 146

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Chapter 1

RMAB in Opportunistic Scheduling

1.1 1.1.1

Introduction Multiarmed Bandit

Multiarmed bandit, first posed in 1933 for clinical trial, has become a classical problem in stochastic optimization and reinforcement learning with a wide range of engineering applications, including but not limited to, multi-agent systems, web search, Internet advertising, social networks, and queueing systems. Consider a dynamic system consisting of a player and N independent arms. In each time slot t (t ¼ 1, 2, . . .), the state of arm k is denoted by sk(t) and completely observable to the player. At slot t, the player selects one arm, i.e., arm k, to activate based on the system state S ðt Þ ¼ ½s1 ðt Þ, s2 ðt Þ, ⋯, sN ðt Þ and accrues reward R(sk(t)) determined by the state sk(t) of arm k. Meanwhile, the state of arm k will transit to another state in the next slot according to certain transition probabilities, ðk Þ i:e:, pi,j ¼ Pðsk ðt þ 1Þ ¼ j j sk ðt Þ ¼ iÞ, i, j 2 Ωk , where Qk denotes the state space of arm k. The states of other arms which are not activated will remain frozen, i.e., sn(t + 1) ¼ sn(t) 8 n 6¼ k. The player’s selection policy π ¼ {π(1), π(2), ⋯} is a serials of mapping from the system state S ðt Þ to the action a(t) indicating which arm is activated, i.e., πðtÞ : SðtÞ ° aðtÞ: The objective is to obtain the optimal policy π  to maximize the expected total discounted reward in an infinite horizon:

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8_1

1

2

1 RMAB in Opportunistic Scheduling

  T X π  ¼ arg max  lim βt1 RðsaðtÞ ðtÞÞ π

T!1

ð1:1Þ

t¼1

where the discount factor 0β 1) arms, denoted as K ðtÞ, can be activated simultaneously and change their states in each slot and meanwhile the passive arms are also allowed to offer reward and change state, which makes it different from the classic MAB.

1.1 Introduction

3

If arm k is activated, then its state transits according to a transmitting rule Pk1 and yields the immediate reward gk1(sk(t)) while it transits by another rule Pk2 and yields the immediate reward gk2(sk(t)) when arm k is not activated. A policy π ¼ fπ ðt Þg1 t¼1 is a serial of mappings where π t maps the system state S(t) to the set of K arms K ðtÞ to be activated in slot t. In [3], P. Whittle considered the above problem to maximize the average reward over an infinite horizon,1 which can be formulated as follows: "  T 1X X gi1 ðsi ðtÞÞ þ π ¼ arg max  lim T!1 T π t¼1 

i2K ðtÞ

N X

# g j2 ðs j ðtÞÞ :

ð1:3Þ

j¼1, j= 2K ðtÞ

Let γ k denote the maximum expected average reward obtained by playing arm k without constraint: "

# I 1X gkak ðtÞ ðsk ðtÞÞ , where ak ðtÞ 2 f1, 2g γ k ¼ max  lim π T!1 T t¼1

ð1:4Þ

Let fk(sk(1)) denote the differential reward caused by the transient effect of starting from state sk(1) rather than from an equilibrium situation: " f k ðsk ð1ÞÞ ¼ lim π  T!1

T 1 X g ðs ðtÞÞ  γ k T t¼1 kak ðtÞ k

# ð1:5Þ

We have the following optimal equation for the maximum expected average reward γ k: γ k þ f k ðsk ðtÞÞ ¼ max ½gka ðsk ðtÞÞ þ ½ f k ðsk ðt þ 1ÞÞ j sk ðtÞ a¼f1, 2g

ð1:6Þ

We can rewrite the above formulation more compactly as γ k þ f k ðsk ðt ÞÞ ¼ max ½Lk1 f k , Lk2 f k 

ð1:7Þ

We consider the following relaxed condition: K out of N arms are activated on average rather than exactly in all time slots, i.e., ½jK ðtÞj ¼ K instead of j K ðtÞ j¼ K, 8t Then the objective under the relaxed condition is the following:

1

The discounted reward can be similarly discussed.

ð1:8Þ

4

1 RMAB in Opportunistic Scheduling

" max 

N X

#

"

r n , s:t:

n¼1

N X

# In ¼ K

ð1:9Þ

n¼1

where rn is the average reward obtained from arm n under the relaxed constraint, and In ¼ 1; 0 according to whether arm n is activated or not. We have the objective by the classic Lagrangian multiplier as follows: " max 

N X

rn þ v

n¼1

N X

#

"

I n ¼ max 

n¼1

N X

# ðr n þ vI n Þ

ð1:10Þ

n¼1

We thus have a v-subsidy problem γ k ðvÞ þ f k ¼ max ½Lk1 f k , v þ Lk2 f k 

ð1:11Þ

where v is referred as subsidy for passivity. We define the index Wk (i) of arm k in state i 2 Ωk as the value of v which makes the active and the passive phases equally attractive: Lk1 f k ¼ v þ Lk2 f k

ð1:12Þ

Let P k ðvÞ be the set of states for which arm k would be passive under a v-subsidy policy. Then the arm is indexable if P k ðvÞ increases monotonically from ∅ to Ωk as V increases from 1 to + 1 .. Thus, if all arms are indexable, arm k will be activated in slot t if Wk(sk(t)) > v. Therefore, we obtain the following Whittle index policy. Definition 1.2 (Whittle Index Policy) If all the bandits are indexable, activate the K arms of the greatest indices in each slot. Conjecture 1.1 (Whittle Conjecture) Suppose all arms are indexable, the index policy is optimal in terms of average yield per arm in the limit.

1.2

Technical Challenge

For MAB or RMAB, one research thrust of existing literatures focuses on seeking sufficient conditions under which myopic or greedy policy which only maximizes the current slot reward is optimal [4–7]. The second one is to study the asymptotically optimal Whittle index policy [8–18]. The third one is to derive some application-oriented approximate heuristic policies. However, these works do not consider the following three main challenges.

1.3 Book Organization

5

Fig. 1.1 Book organization

• Partial Information: In an opportunistic scheduling system, the decision-maker or scheduler has to consume a certain resource (i.e., time, energy, frequency) to observe (or sense, detect, sample) the system state. As a result, the decisionmaker would not or cannot observe the complete state of the system due to the practical cost constraint of consuming resources, and can only obtain partial information of the system state. Thus, based on the partial information, the scheduler has to make decision by learning decision history and observation history. • Imperfect Information: In practical environment, the decision-maker needs relying on a certain appliance to obtain the system state, and consequently, cannot observe the perfect information since any appliance would bring a certain mistake, i.e., false alarm and missing alarm. Hence, the imperfect observation is unavoidable during the process of obtaining information in a scheduling system. This kind of imperfect information leads to complicated nonlinear dynamics of the system, which requires special technique to conquer. • Multi-State Information: Actually, to make a better decision for an opportunistic scheduling system, the scheduler needs to know more precise system information. Thus, it is required to characterize the system state in a small grain level rather than a simply macro one, such as two-state (good vs. bad, or 1 vs. 0) using only one threshold. Multi-threshold or multi-state is adopted to describe the system state in a small grain level. However, multi-state quantity provably requires multi-variance technique.

1.3

Book Organization

In this book, we adopt a research and exposition line from theoretical modeling and analysis to practical algorithm design and optimization. Figure 1.1 illustrates the structure of the book. In the remainder of this section, we provide a high-level

6

1 RMAB in Opportunistic Scheduling

overview of the technical contributions of this book, which are presented sequentially in Chaps. 2–5. To facilitate readers, we adopt a modularized structure to present the results such that the chapters are arranged as independent modules, each devoted to a specific topic outlined above. In particular, each chapter has its own introduction and conclusion sections, elaborating the related work and the importance of the results with the specific context of that chapter. For this reason, we are not providing a detailed background, or a survey of prior work here.

References 1. W.R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933) 2. J.C. Gittins, D.M. Jones, A dynamic allocation index for the sequential design of experiments. Prog. Statist, 241–266 (1974) 3. P. Whittle, Restless bandits: Activity allocation in a changing world. J. Appl. Prob. 25A, 287298 (1988) 4. F.E. Lapiccirella, K.Q. Liu, Z. Ding, Multi-channel opportunistic access based on primary ARQ messages overhearing. Proc. IEEE ICC, 1–5 (2011) 5. Q. Zhao, B. Krishnamachari, K.Q. Liu, On myopic sensing for multi-channel opportunistic access: Structure, optimality, and performance. IEEE Trans. Wirel. Commun. 7(3), 5413–5440 (2008) 6. S. Ahmand, M.Y. Liu, T. Javidi, Q. Zhao, B. Krishnamachari, Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55(9), 4040–4050 (2009) 7. S. Murugesan, P. Schniter, N.B. Shroff, Multi-user scheduling in makov-modeled downlink using randomly delayed ARQ feedback. IEEE Trans. Inf. Theory 58(2), 1025–1042 (2012) 8. H. Ji, C.V. Leung, C.Q. Luo, F.R. Yu, Optimal channel access for tcp performance improvement in cognitive radio networks. Wirel. Netw 17, 479–492 (2010) 9. D. Chen, H. Ji, X. Li, Distributed best-relay node selection in underlay cognitive radio networks a restless bandits approach. Proc. IEEE WCNC, 1208–1212 (2011) 10. M.Y. Liu, N. Ehsan, On the optimality of an index policy for bandwidth allocation with delayed state observation and differentiated services. Proc. IEEE INFOCOM 3, 1974–1983 (2004) 11. P. Jacko, Value of information in optimal flow-level scheduling of users with markovian timevarying channels. Perform. Eval. 68(11), 1022–1036 (2011) 12. K.Q. Liu, Q. Zhao, Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 56(11), 5547–5567 (2000) 13. J.L. Ny, M. Dahleh, E. Feron, Multi-uav dynamic routing with partial observations using restless bandit allocation indices. Proc. ACC, 4220–4225 (2008) 14. O. Jonathan, A continuous-time markov decision process for infrastructure surveillance. Oper. Res. Proc., 327–332 (2010)

References

7

15. B. Sanso, P. Jacko, Optimal anticipative congestion control of flows with time-varying input stream. Perform. Eval. 69(2), 86–101 (2012) 16. V. Raghunathan, V. Borkar, M. Cao, P.R. Kumar, Index Policies for Real-Time Multicast Scheduling for Wireless Broadcast Systems. Proc. IEEE INFOCOM, 2243–2251 (2008) 17. T. He, A. Anandkumar, D. Agrawal, Index-based sampling policies for tracking dynamic networks under sampling constraints. Proc. IEEE INFOCOM, 1233–1241 (2011) 18. U. Ayesta, E. Martin, P. Jacko, A modeling framework for optimizing the flow-level scheduling with time-varying channels. Perform. Eval. 67(11), 1024–1029 (2010)

Chapter 2

Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

2.1 2.1.1

Introduction Background

We consider an opportunistic multichannel communication system where a user can access multiple Gilbert-Elliot channels, but it is limited to sense and transmit on a subset of channels each time. The fundamental problem that we are interested in is how the user can exploit past observation history, past decision history, and the knowledge of the stochastic properties of those channels to maximize his/her utility (e.g., expected throughput) by switching channels in each decision period opportunistically. Formally, there are N i.i.d. channels, each evolving as a two-state Markov process where the state of a channel indicates the desirability of accessing this channel. At each time slot, the user chooses k (1 < k < N ) of the N channels to sense, access, and obtain a certain amount of reward which depends on the states of the chosen channels. Given the initial state of the system, i.e., the initial states of N channels, the goal of the user is to find the optimal policy to schedule channels at each time slot so as to maximize the accumulated discounted reward. This channel access problem can be cast into the RMAB problem [1] or partially observable Markov decision process (POMDP) [2].

2.1.2

Related Work

Due to its application in numerous engineering problems, the RMAB problem is of fundamental importance in stochastic decision theory. However, finding the optimal policy for the generic RMAB problem is shown to be PSPACE-hard by Pa-padimitriou et al. [3]. P. Whittle in [1] proposed a heuristic index policy called © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8_2

9

10

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

Whittle index policy which is shown to be asymptotically optimal in certain limited regime under some specific constraints [4]. In this regard, Liu et al. studied in [5] the indexability of a class of RMAB problems relevant to dynamic multichannel access applications. However, the optimality of the index policy based on Whittle approach is not guaranteed in general cases, especially when the channels follow heterogeneous Markov chains. Moreover, not every RMAB problem has a well-defined Whittle index. A natural alternative, given the intractability of the RMAB problem, is to seek a simply myopic policy maximizing the short-term reward, i.e., slot reward. In this line of research, significant research efforts have been devoted to studying the performance of the myopic policy, especially in the context of opportunistic spectrum. Some key contributions from recent works on this subject can be summarized as follows. Zhao et al. [6] established the structure of the myopic sensing policy, analyzed the performance, and partly obtained the optimality for the case of homogeneous i.i.d. channels. Ahmad and Liu et al. [7] derived the optimality of the myopic policy for the positively correlated homogeneous i.i.d. channels when the user is limited to access one channel at each time slot. Ahmad and Liu [8] further extended the optimality result to the case of sensing multiple homogenous channels (k > 1) for a particular form of utility function which is used to model the fact that the user gets one unit of reward for each channel sensed to be good. Our works studied the case of non i.i.d. channels and provided generic conditions on reward function under which the myopic policy is optimal [9], and also illustrated that when these conditions are not satisfied, the myopic policy may not be optimal [10].

2.1.3

Main Results and Contributions

The vast majority of previous works (i.e., [2, 6–9]) in the area assume that the user can achieve perfect observation of channel state. However, sensing or observation errors are inevitable in practical scenario due to noise and system hardware limitation, especially in the dynamic environment of wireless communication. More specifically, a good (bad, respectively) channel may be sensed as bad (good, respectively). In such an imperfect context, it is crucial to study the structure and the optimality of the myopic sensing policy with imperfect observation. We would like to emphasize that the presence of sensing or observation error brings two obstacles when studying the myopic sensing policy in this new context. • The belief value of channel evolves as a nonlinear mapping in the imperfect case, instead of a linear one in the perfect case. • The update of belief value of channel depends not only on the channel Markov evolution rule but also on the observation outcome, impliciting that the transition is not deterministic. Due to the above particularities, the problem considered in this chapter requires an original study on the optimality of the myopic sensing policy, which cannot draw

2.2 Problem Formulation

11

on existing results in the perfect sensing case. We would like to report that despite its practical importance and particularities, very few works have been done on the impact of sensing error on the performance of the myopic sensing policy, or more generically, on the RMAB problem under imperfect observation. To the best of our knowledge, references [11, 12] are the only work in this area. Chen and Zhao et al. [11] decoupled the design of the sensing strategy from that of the spectrum sensor and the access strategy, and reduced the constrained POMDP to an unconstrained one. Liu and Zhao and Krishnamachari [12] established the optimality of the myopic policy for the case of two channels with a particular utility function under certain conditions and conjectured the optimality for arbitrary N under the same conditions. In this chapter, we derive closed-form conditions to guarantee the optimality of the myopic sensing policy for arbitrary N and for a class of utility functions. As shown in Sect. 2.4.3, the result obtained in the chapter can cover the result of [12]. Moreover, this chapter also significantly extends our previous work [9], focusing on perfect sensing scenario in which the analysis cannot be applied in the imperfect sensing scenario due to the non-trivial particularities introduced by sensing error as mentioned previously. In this regard, our work in this chapter contributes the existing literature by developing an adapted analysis on the RMAB problem under imperfect sensing.

2.2 2.2.1

Problem Formulation System Model

Table 2.1 summaries main notations used in this chapter. We consider a multichannel opportunistic communication system, in which a user is able to access a set N of N homogeneous channels, each characterized by a Markov chain of two states, good (1) and bad (0). The state transition matrix P of those channels is given as follows: " P¼

1  p01

p01

1  p11

p11

# ð2:1Þ

Assume that the system operates in a synchronous time slot fashion with the time slot indexed by t(t ¼ 0, 1, . . ., T), where T is the time horizon of interest. We assume that these channels go through state transition at the beginning of each slot t. Due to hardware constraints and energy cost, the user (precisely, the spectrum detector) is practically allowed to sense only k(1  k  N ) of the N channels at each time slot t. We assume that the user makes decision on channel selection at the beginning of each time slot after the state transition of these channels. Once a channel is chosen, the user detects the channel state Si(t), which can be considered as a binary hypothesis test as follows.

12

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

Table 2.1 Main notations Symbols N N (m) P T t ε ζ A ωi(t) Ω(t) Ω(0) Oi(t) πt β R(π t(Ω(t))) Vt(Ω(t))

Descriptions The set of N channels, i.e., 1, 2, . . ., N m channels in N, i.e., 1, 2, . . .,m Channel state transition matrix The total number of time slots Time slot index False alarm probability Miss detection probability The set of channel chosen in slot t The conditional probability of being “good” Channel state belief vector at slot t The initial channel state belief vector The observation state of channel i The mapping from Ω(t) to A ðt Þ Discount factor The reward collected in slot t Value function in slot t

ℋ0 : Si ðt Þ ¼ 1ð good Þ vs:ℋ1 : Si ðt Þ ¼ 0 ðbadÞ:

ð2:2Þ

We assume that the performance of channel state detection is characterized by the probability of false alarm ε and the probability of miss detection ζ: E≜Prf decide ℋ1 j ℋ0 is true g ζ≜Prf decide ℋ0 j ℋ1 is true g: We denote the set of channels chosen by the user at time slot t by A ðt Þ where A ðt Þ 2 N and j A ðt Þ j¼ k: Based on the imperfect sensing observations fOi ðtÞ 2 f0, 1g : i 2 AðtÞg in slot t, the user decides whether to access channel i for transmission.

2.2.2

Restless Multiarmed Bandit Formulation

Obviously, by imperfectly sensing only k out of N channels, the user cannot observe the state information of the whole system. Hence, the user has to infer the channel states from its past decision and observation history so as to make its future decision. To this end, we define the channel state belief vector (hereinafter referred to as belief vector for briefness) Ωðt Þ≜fωi ðt Þ, i 2 N g, where 0  ωi(t)  1 is the conditional probability that channel i is in good state (i.e., Si(t) ¼ 1). Given the sensing action AðtÞ and the observations fOi ðtÞ 2 f0, 1g : i 2 AðtÞg, the belief vector in t + 1 slot can be updated recursively using Bayes Rule as shown in (2.3).

2.2 Problem Formulation

13

ωi ðt þ 1Þ ¼

f

p11 , Γðφðωi ðtÞÞÞ,

if i 2 AðtÞ, Oi ðtÞ ¼ 1 if i 2 AðtÞ, Oi ðtÞ ¼ 0

Γðωi ðtÞÞ,

if i 2 = AðtÞ

ð2:3Þ

where, Γðωi ðt ÞÞ≜ωi ðt Þp11 þ ð1  ωi ðt ÞÞp01 φðωi ðt ÞÞ≜

ð2:4Þ

εωi ðt Þ 1  ð1  E Þωi ðt Þ

ð2:5Þ

Remark 2.1 We would like to emphasize that the sensing error introduces further complications in the system dynamics (i.e., φ(ω) is nonlinear in m) compared with the perfect sensing case. Therefore, those results [7, 9] obtained without sensing error cannot be trivially extended to the scenario with sensing error. A sensing policy specifies a sequence of functions π ≔ [π 0, π 1, . . ., π T] where π t(0 b t b T )r maps the belief vector Ω(t) to the action A(t) at each time slot t: π t : ΩðtÞ ° AðtÞ, j AðtÞ j¼ k

ð2:6Þ

Given the imperfect sensing context, we are interested in the user’s optimization problem to find the optimal sensing policy π  that maximizes the expected total discounted reward over a finite horizon: ( 

π ¼ arg max  π

T X

) β

t1

Rπ t ðΩðtÞÞ j Ωð0Þ

ð2:7Þ

t¼0

where Rπr ðΩðt ÞÞ is the reward collected in slot t under the sensing policy π t with the initial belief vector Ω(0)1,1 and 0  β  1 is the discount factor characterizing the feature that the future rewards are less valuable than the immediate reward. By treating the belief value of each channel as the state of each arm of a bandit, the user’s optimization problem can be cast into a restless multiarmed bandit problem.

2.2.3

Myopic Sensing Policy

In order to get more insight on the structure of the optimization problem formulated in (2.7) and the complexity to solve it, we derive the dynamic programming formulation of (2.7) as follows.

1

If no information on the initial system state is available, each entry of Q(0) can be set to the stationary distribution ω0 ¼ 1þpP01P11 . 01

14

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

f

h i V T ðΩðTÞÞ ¼ max  Rπ T ðΩðTÞÞ AðTÞ h i P V t ðΩðtÞÞ ¼ max  Rπ t ðΩðtÞÞ þ β PrðAðtÞ, EÞV tþ1 ðΩE ðt þ 1ÞÞ AðtÞ

ð2:8Þ

E ⊆ AðtÞ

where PrðAðtÞ, EÞ≜

Y Y ð1  EÞωi ðtÞ ½1  ð1  EÞω j ðtÞ: i2E

ð2:9Þ

j2AðtÞ∖E

In the above Bellman Eq. (2.8), Vt(Ω(t)) is the value function corresponding to the maximal expected reward from time slot t to T(0  t  T ) with the believe vector Ω(t + 1) following the evolution (2.3), given that the channels in the subset E are sensed to be good and the channels in A ðt Þ∖E are sensed to be bad. Solving (2.7) through the above recursive iteration is computationally heavy due to the fact that the belief vector {Ω(t), t ¼ 0, 1, ⋯, T} is a Markov chain with uncountable state space when T!1, resulting in the difficulty in tracing the optimal sensing policy π . Hence, a natural alternative is to seek simple myopic sensing policy which is easy to compute and implement that maximizes the immediate reward, formally defined as follows: Definition 2.1 (Myopic Policy) Let UðAðtÞ, ΩðtÞÞ :¼ fRπt ðΩðtÞÞg denote the expected immediate reward obtained in time slot t under the sensing policy π t : b ðtÞ, consists of sensing the k channels that ΩðtÞ ° AðtÞ. The myopic sensing policy A maximizes UðAðtÞ, ΩðtÞÞ, i.e., b ðtÞ ¼ max fRπ t ðΩðtÞÞg A π

ð2:10Þ

t

Despite the simple and robust structure of the myopic policy, the optimality of this kind of greedy policy is not guaranteed. More specifically, when the channels are homogeneous (i.e., all channels follow the same Markovian dynamics according to P) and positively correlated (i.e., p11 ⩾ p01), the myopic sensing policy is shown to be optimal when the user is limited to sensing one channel each slot (k ¼ 1) and obtains one unit of reward when the sensed channel is good [6]. The analysis [7] and our previous work [10] further extend the study on the generic case with k  1. However, the authors [7] showed that the myopic sensing policy is optimal if the user gets one unit of reward for each channel sensed to be good,2 while our work shows that the myopic sensing policy is not guaranteed to be optimal when the user’s objective is to find at least one good channel.3

2

In [7], the expected slot reward function is defined as UðAðtÞ, ΩðtÞÞ ¼

P

ωi ðtÞ.

i2AðtÞ 3

In [10], the expected slot reward function is defined as UðAðtÞ, ΩðtÞÞ ¼ 1 

Q

ð1  ωi ðtÞÞ.

i2AðtÞ

2.3 Axioms

15

Given that such nuance on the reward function leads to totally contrary results, a natural while fundamentally important question arises: How does the expected slot reward function UðAðtÞ, ΩðtÞÞ impact the optimality of the myopic sensing policy? Or more specifically, under what conditions on UðAðtÞ, ΩðtÞÞ is the myopic sensing policy guaranteed to be optimal?

2.3

Axioms

This section introduces a set of three axioms characterizing a family of generic and practically important functions, to which we refer as regular functions. The axioms developed in this section and the implied fundamental properties serve as a basis for the further analysis on the structure and the optimality of the myopic sensing policy in Sect. 2.4. First, we state some structural properties of Γ(ω) and φ(ω) that are useful in the subsequent proofs. Lemma 2.1 For positively correlated channel, i.e., p01 < p11, we have (i) Γ(ω) is monotonically increasing in ω (ii) p01  Γ(ω)  p11, 8 0  ω  1 Proof It follows from Γ(ω) ¼ ( p11  p01)ω + p01 straightforwardly.



11 Þp01 Lemma 2.2 If 0bE b ðp1p ð1p Þ and p01 < p11 , then 11

01

(i) φ(ω) increases monotonically in ω with φ(0) ¼ 0 and φ(1) ¼ 1; ðiiÞ φðωÞbp01 , 8p01 bωbp11 : εω Proof Noticing that φðωÞ ¼ εωþ1ω , the lemma follows straightforwardly. Throughout this section, for the convenience of presentation, we sort the elements of the believe vector Ω(t) ¼ j ω1(t), ⋯, ωN(t)j for each slot t such that A ¼ f1, ⋯, kg (i.e., the user senses channel 1 to channel k) and let ΩA ≜fωi :i 2 Ag ¼ fω1 , ⋯, ωk g4 .4 The three axioms derived in the following characterize a generic function f defined on ΩA. □

Axiom 1 (Symmetry) A function f(ΩA) : [0, 1]k!ℝ is symmetrical if 8i, j 2 A it holds that

4 For presentation simplicity, by slightly abusing the notations without introducing ambiguity, we drop the time slot index t.

16

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

    f ω1 , ⋯, ωi , ⋯, ω j , ⋯, ωk ¼ f ω1 , ⋯, ω j , ⋯, ωi , ⋯, ωk

ð2:11Þ

Axiom 2 (Monotonicity) A function f(ΩA) : [0, 1]k!ℝ is monotonically increasing if it is monotonically increasing in each variable ωi, i.e., 8i 2 A ω0i > ωi ) f ðω1 , ⋯, ω0i , ⋯, ωk Þ > f ðω1 , ⋯, ωi , ⋯, ωk Þ

ð2:12Þ

Axiom 3 (Decomposability) A function f(ΩA) : [0, 1]k!ℝ is decomposable if 8i 2 A it holds that f ðω1 , ⋯, ωi , ⋯, ωk Þ ¼ ωi f ðω1 , ⋯, 1, ⋯, ωk Þ þ ð1  ωi Þf ðω1 , ⋯, 0, ⋯, ωk Þ

ð2:13Þ

Axioms 1 and 2 are intuitive. Axiom 3 on the decomposability states that f(ΩA) can always be decomposed into two terms that replace ωi by 0 and 1, respectively. The three axioms introduced in this section are consistent and non-redundant. Moreover, they can be used to characterize a family of generic functions, referred to as regular functions, defined as follows: Definition 2.2 (Regular Function) A function is called regular if it satisfies all the three axioms. The following definition studies the structure of the myopic sensing policy if the expected reward function is regular. Proposition 2.1 (Structure of Myopic Sensing Policy) Sort the elements of the belief vector in descending order such that ω1  ⋯  ωN, if the expected slot reward function U(•) is regular, then the myopic sensing policy, where the user is allowed to sense k channels, consists of sensing the best k channels, i.e., channel 1, . . ., k. Remark 2.2 In case of tie, we sort the channels in tie in the descending order of ωi(t + 1) calculated in (2.3). The argument is that larger ωi(t + 1) leads to larger expected payoff in next slot t + 1. If the tie persists, the channels are sorted by indexes. The developed three axioms characterize a set of generic functions widely used in practical applications. To see this, we give two examples to get more insight: (i) The user gets one unit of reward for each channel that is sensed good and is indeed good. In this example, the expected reward function (for each slot) is k P U ðA ðt Þ, Ωðt ÞÞ ¼ ½ð1  E Þωi ðt Þ i¼1

(ii) The user gets one unit of reward if at least one channel is sensed good. In this k Q example, the expected reward function is UðAðtÞ, ΩðtÞÞ ¼ 1  ½1  ð1  EÞ ωi ðtÞ:

i¼1

2.4 Optimality of Myopic Sensing Policy under Imperfect Sensing

17

It is easy to verify that U(•) in both examples is regular since three axioms are satisfied.

2.4

Optimality of Myopic Sensing Policy under Imperfect Sensing

In this section we would establish closed-form conditions under which the myopic sensing policy, despite of its simple structure, achieves the system optimum under imperfect sensing. To this end, we set up by defining an auxiliary function and studying the structural properties of the auxiliary function, which serve as a basis in the study of the optimality of the myopic sensing policy. We then establish the main result on the optimality followed by the illustrating how the obtained result can be applied via two concrete application examples. For the convenience of discussion, we first state some notations before presenting the analysis: • The believe vector Ω(t) is sorted to [ω1(t), ⋯, ωN(t)] each slot t such that A ðt Þ ¼ f1, 2, ⋯, kg; • N ðmÞ≜f1, ⋯, mgðm  NÞ denotes the first m channels in N ; • Given E ⊆ M ⊆ N , PrðM , EÞ :¼

Y Y ð1  EÞωi ðtÞ ½1  ð1  EÞω j ðtÞ i2E

ð2:14Þ

j2M ∖E

where Pr(M, E) denotes the expected probability that the channels in E are sensed to be in good state, while the channels in M ∖ E are sensed in bad state, given that the channels in M are sensed; • PE11 denotes the vector of length jEj with each element being p11; • Φ(l, m) ≜ [Γ(ωi(t)), l  i  m] where the components are sorted by channel index. Φ(l, m) characterizes the updated belief values of the channels between l and m if they are not sensed; • Given E ⊆ M ⊆ N , QM ,E :¼ ½Γðφðωi ðtÞÞÞ : i 2 M ∖E

ð2:15Þ

where the components are sorted by channel index. QM, E characterizes the updated belief values of the channels in M ∖ E if they are sensed in the bad state;  M ,E,l :¼ ½Γðφðωi ðtÞÞÞ : i 2 M∖E and i < l Q

ð2:16Þ

18

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

characterizes the updated belief values of the channels in M\E if they are sensed in the bad state with the channel index smaller than l; QM ,E,l :¼ ½Γðφðωi ðtÞÞÞ : i 2 M∖E and i > l

ð2:17Þ

characterizes the updated belief values of the channels in M ∖ E if they are sensed in the bad statewith the channel  index larger than l; Let ωi ≔ ω j : j 2 A, j 6¼ i and 8 fU ð1, ωi Þ  U ð0, ωi Þg > < Δmax ≔i2N , ωmax 2½0, 1k1 i

> : Δmin ≔

min

i2N , ωi 2½0, 1k1

2.4.1

fU ð1, ωi Þ  U ð0, ωi Þg

ð2:18Þ

Definition and Properties of Auxiliary Value Function

In this subsection, inspired by the form of the value function Vt(Ω(t)) and the analysis in [8], we first define the auxiliary value function with imperfect sensing and then derive several fundamental properties of the auxiliary value function, which are crucial in the study on the optimality of the myopic sensing policy. Definition 2.3 (Auxiliary Value Function) The auxiliary value function, denoted as Wt(Ω(t))(t ¼ 0, 1, ⋯, T ), is recursively defined as follows: A b Wb T ðΩðTÞÞ ¼ UðA ðTÞ, ΩðTÞÞ X A A b b ðrÞ, EÞW b PrðA Wb r ðΩðrÞÞ ¼ UðA ðrÞ, ΩðrÞÞ þ β rþ1 ðΩE ðr þ 1ÞÞ E ⊆b A ðrÞ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Fðb A ðrÞ, ΩðrÞÞ

W At ðΩðtÞÞ ¼ UðAðtÞ, ΩðtÞÞ þ β

X

b PrðAðtÞ, EÞW Atþ1 ðΩE ðt þ 1ÞÞ

ð2:19Þ

E ⊆ AðtÞ

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} FðAðtÞ, ΩðtÞÞ

where t < r < T and ΩE ðt þ 1Þ :¼ ðPE11 , Φðk þ 1, NÞ, QAðtÞ,E Þ denotes the belief vector generated by Ω(t) based on (2.3). The above recursively defined auxiliary value function gives the expected cumulated reward of the following sensing policy: in slot t, sense the first k channels; if channel i is correctly sensed good, then put it on the top of the list to be sensed in next slot, otherwise drop it to the bottom of the list. Recall Lemmas 2.1 and 2.2, 11 Þp01 under the condition 0bE b ðp1p ð1p Þ, if the belief vector Ω(t) is ordered decreasingly 11

01

2.4 Optimality of Myopic Sensing Policy under Imperfect Sensing

19

in slot t, the above sensing policy is the myopic sensing policy with W At ðΩðt ÞÞ being the total reward from slot t to T. In the subsequent analysis, we prove some structural properties of the auxiliary value function. Lemma 2.3 (Symmetry) If the expected reward function U(•) is regular, the correspondent auxiliary value function W At ðΩðt ÞÞ is symmetrical in any two channel i, j 2 A ðt Þ for all t ¼ 0, 1, . . ., T, i. e. W At ðω1 , ⋯, ωi , ⋯, ω j , ⋯, ωN Þ ¼ W At ðω1 , ⋯, ω j , ⋯, ωi , ⋯, ωN Þ,

8i, j  k ð2:20Þ

Proof The lemma can be easily shown by backward induction noticing that at slot t, (⋯, ωi, ⋯, ωj, ⋯) and (⋯, ωj, ⋯, ωi, ⋯) generate the same belief vector ΩE (t + 1) for any E. □ Lemma 2.4 (Decomposability) If the expected reward function U(•) is regular, then the correspondent auxiliary value function W At ðΩðt ÞÞ is decomposable for all t ¼ 0, 1, ⋯, T, i. e. W At ðω1 , ⋯, ωi , ⋯, ωN Þ ¼ ωi W At ðω1 , ⋯, 1, ⋯, ωN Þ þð1  ωi ÞW At ðω1 , ⋯, 0, ⋯, ωN Þ,

8i 2 N

ð2:21Þ

Proof The proof is given in the Appendix A. Lemma 2.4 can be applied one step further to prove the following corollary.



Corollary 2.1 If the expected reward function U(•) is regular, then for any l; m 2 N holds that for t ¼ 0, 1, ⋯, T W At ðω1 , ⋯, ωl , ⋯, ωm , ⋯, ωN Þ  W At ðω1 , ⋯, ωm , ⋯, ωl , ⋯, ωN Þ ¼ ðωl  ωm Þ   W At ðω1 , ⋯, 1, ⋯, 0, ⋯, ωN Þ  W At ðω1 , ⋯, 0, ⋯, 1, ⋯, ωN Þ

ð2:22Þ

Lemma 2.5 (Monotonicity) If the expected reward function U is regular, the correspondent auxiliary value function W At ðΩÞ is monotonously non-decreasing in ωl, 8l 2 N , i:e:, ω0l  ωl ⟹W At ðω1 , ⋯, ω0l , ⋯, ωN Þ  W At ðω1 , ⋯, ωl , ⋯, ωN Þ

ð2:23Þ

20

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

Proof The proof is given in the Appendix A.

2.4.2



Optimality of Myopic Sensing: Positively Correlated Channels

In this section, we study the optimality of the myopic sensing policy under imperfect sensing. We start by showing the following important auxiliary lemmas (Lemmas 2.6 and 2.7) and then establish the sufficient condition under which the optimality of the myopic sensing policy is guaranteed. Lemma 2.6 Given (i) E < (ii) β 

p01 ð1p11 Þ P11 ð1p01 Þ0 Δmin =Δmax , E ðp11 p01 Þ ð1E Þð1p01 Þþ 1ð1E Þðp11 p01 Þ

(iii) U()is regular, if p11  ωl  ωm  p01 where l < m, then it holds that for t ¼ 0, 1, ⋯, T W At ðω1 , ⋯, ωl , ⋯, ωm , ⋯, ωN Þ  W At ðω1 , ⋯, ωm , ⋯, ωl , ⋯, ωN Þ

ð2:24Þ

Lemma 2.7 Given 11 Þ , (i) E < Pp01 ð1p ð1p Þ 11

(ii) β 

01

Δmin =Δmax , E ðp11 p01 Þ ð1E Þð1p01 Þþ 1ð1E Þðp11 p01 Þ

(iii) U()is regular, If p11  ω1  ⋯  ωN  p01,for any 0  t  T, it holds that W At ðω1 , ω2 , ⋯, ωN1 , ωN Þ  W At ðωN , ω1 , ⋯, ωN1 Þ  ð1  ωN ÞΔmax , W At ðω1 , ω2 , ⋯, ωN1 , ωN Þ  W At ðωN , ω2 , ⋯, ωN1 , ω1 Þ  ðp11  p01 ÞΔmax

1  ½βð1  EÞðp11  p01 ÞTtþ1 1  βð1  EÞðp11  p01 Þ

Lemma 2.6 states that by swapping two elements in Q with the former larger than the latter, it cannot increase the total expected reward. Lemma 2.7, on the other hand, gives the upper bound on the difference of the total reward of the two swapping operations, swapping ωN and ωk(k ¼ N  1, ⋯, 1) and swapping «1 and mN,

2.4 Optimality of Myopic Sensing Policy under Imperfect Sensing

21

respectively. For clarity of presentation, the detailed proofs of the two lemmas are deferred to the Appendix A. Theorem 2.1 If p01 b ωi(0) b p11 for any i(1 b i b N ), the myopic sensing policy is optimal if the following conditions hold 11 Þ (i) E < Pp0111ðð1p 1p Þ, 01

(ii) β 

Δmin =Δmax , E ðp11 p01 Þ 1ð1E Þðp11 p01 Þ

ð1E Þð1p01 Þþ

(iii) U()is regular. Proof It suffices to show that for t ¼ 0, 1, ⋯, T, by sorting Ω(t) in decreasing order such that ω1  ⋯  ωN, it holds that W At ðω1 , ⋯, ωN Þ  W At ðωi1 , ⋯, ωiN Þ, where ðωi1 , ⋯, ωiN Þ is any permutation of (1, ⋯, N ). We prove the above inequality by contradiction. Assume, by contradiction, the

maximum of Wt() is achieved at ωi1 , ⋯, ωiN

6¼ ðω1 , ⋯, ωN Þ, i:e: ,



W At ωi1 , ⋯, ωiN > W At ðω1 , ⋯, ωN Þ

ð2:25Þ



However, we run a bubble sort algorithm on ωi1 , ⋯, ωiN by repeatedly stepping through it, comparing each pair of adjacent element ωil and ωilþ1 and swapping them if ωil < ωil þ1 When the algorithm terminates, the elements of the channel belief vector are sorted decreasingly; that is to say, the belief vector becomes (ω1, ⋯, ωN) finally. By applying Lemma 2.6 at each swapping, we obtain W At ωi1 , ⋯, ωiN



W At ðω1 , ⋯, ωN Þ, which contradicts (2.25). Theorem 2.1 is thus proven. p01 as the popular case in As noted in [12], when the initial belief ωi is set to p þ1p 01 11 practical systems, it can be checked that p01 b ωi(0) b p11 holds. Moreover, even the initial belief does not fall in [p01, p11], all the belief values are bounded in the interval from the second slot following Lemma 2.1. Hence our results can be extended by treating the first slot separately from the future slots. □

2.4.3

Discussion

In this subsection, we illustrate the application of the obtained result in two concrete scenarios and compare our work with the existing results. Problem 2.1 Consider the channel access problem in which a user is limited to sensing k channels and gets one unit of reward if the sensed channel is in the P good state, i.e., the utility function can be formulated as UðAðtÞ, ΩA Þ ¼ ð1  EÞ ωi . i2AðtÞ

22

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

The optimality of the myopic sensing policy for this model was studied in [12] for a subset of scenarios where k ¼ 1 and N ¼ 2. We now study the generic case with k, N  2. To that end, we have Δmin ¼ Δmax ¼ 1  ε, and then can verify that when Δmin 11 Þ h i > 1:. Therefore, when the E < Pp0111ð1p ð1p Þ , it holds that E p p 10

Δmax ð1E Þð1p01 Þþ

ð 11 01 Þ 1ð1E Þðp11 p01 Þ

conditions (i) and (ii) hold, the myopic sensing policy is optimal for 0 b β b 1 by Theorem 2.1. This result in generic cases significantly extends the optimality [12] in which the optimality of the myopic policy is proved for the case of two channels and only conjectured for general cases. Problem 2.2 Consider the channel access scenario where a user can sense k channels but can only choose one of them to transmit its packets. Under this model, the user wants to maximize its expected throughput. More specifically, the slot utility functionU ðA ðt Þ, Ωðt ÞÞ ¼ 1  Πi2A ðtÞ ½1  ð1  E Þωi ðt Þ, which is regular. k1 k1 In this context, Δmax ¼ ð1  E Þk1 pk1 p01 . The third 11 and Δmin ¼ ð1  E Þ k1 k1 p01 h i Particularly, when ε ¼ 0, β  k1p01 condition β  pk1 ð1E Þð1p01 Þþ 11

p11 ð1p01 Þ

E ðp11 p01 Þ 1ð1E Þðp11 p01 Þ

. It can be noted that even though there is no sensing error, the myopic policy is not guaranteed to be optimal. Since the conditions in Theorem 2.1 is sufficient, it is insightful to see how tight the condition is, especially the third condition. To this end, we provide an example in which the third condition is only slightly violated and the myopic sensing policy is not optimal. Example 2.1 T ¼ 2, Ω(0) ¼ [0.5814, 0.41, 0.40, 0.33, 0.32, 0.31], p11 ¼ 0.5815 and 11 Þ ¼ 0:40 , U ðÞ ¼ 1  p01 ¼ 0.3, k ¼ 3, β ¼ 0.99992, E ¼ 0:25 < pp01 ð1p 11 ð1p01 Þ Q ð1  ωi Þ which is regular. i2A

In this example, it can be checked that sensing channels 1; 2; 4 yields more payoff than the myopic sensing policy since h

Δmin

E ðp11 p01 Þ Δmax ð1  E Þð1  p01 Þ þ 1ð1E Þðp p 11

i ¼ 0:9917 < 0:99992 ¼ β: 01 Þ

This example evidences that the condition in Theorem 2.1 is quite tight such that a slight violation can lead to the non-optimality of the myopic sensing policy.

2.5 Optimality Extension to Negatively Correlated Channels

2.5

23

Optimality Extension to Negatively Correlated Channels

In this section, we study the optimality of the proposed myopic policy for negatively correlated homogenous channels, i.e., p11 < p01. In [9], the authors showed by a counterexample that the myopic policy is not optimal for negatively correlated homogenous channels. However, we would prove that the myopic policy is optimal only by imposing a weak condition on the initial belief vector Ω(0), i. e. , 8 i, p11 b ωi(0) b p01. In fact, the weak condition will be automatically satisfied from the second slot since the belief value would enter into [p11, p01] by the operator Γ(). Lemma 2.8 For negatively correlated channel, i.e., p01 > p11, we have (i) Γ(ω) is monotonically decreasing in ω; (ii) p11  Γ(ω)  p01, 8 0  ω  1. 01 Þp11 Lemma 2.9 If 0bE b ð1p ð1p Þp and p11 < p01, then 11

01

(i) φ(ω) increases monotonically in ω with φ(0) ¼ 0 and φ(1) ¼ 1; (ii) φ(ω) b p11, 8 p11 b ω b p01 Through analyzing the queue structure of the myopic policy [5], we find out that the proof for positively correlated homogenous model can be slightly modified to fit into the negatively correlated homogenous model. Hence, we first give the following structure of belief vector, and then points out the nuance in the proof of optimality of myopic policy. 01 Þ Theorem 2.2 (Structure of Myopic Policy) If E b pp11 ðð1p , we have the following 01 1p11 Þ channel order rules at the end of each slot.

(i) The initial channel ordering Q(0) is determined by the initial belief vector: ωσ1 ð0Þ ⩾ ⋯ ⩾ ωσ N ð0Þ ) Qð0Þ ¼ ðσ 1 , σ 2 , ⋯, σ N Þ (ii) The channels sensed to be in bad state will move to the end of the queue while the channels sensed to be in good state will stay at the head of the queue, and the order of other channels will be reversed. Proof Assume Q(t) ¼ (σ 1, ⋯, σ N) at slot t, we thus have p01 ⩾ ωσ 1 ðt Þ ⩾ ⋯ ⩾ ωσN ðt Þ ⩾ p11 . If channel σ 1 is sensed to be in good state, then ωσ 1 ðt þ 1Þ ¼ Γð1ÞbΓðωσ2 ðtÞÞb⋯bΓðωσN ðtÞÞ by Lemma 2.8, and thus Q (t + 1) ¼ (σ N, ⋯, σ 1) according to the descending order of «.If channel σ 1 is sensed to be in bad state, then ωσ 1 ðt þ 1Þ ¼ Γðφðωσ1 ðtÞÞÞ ⩾ Γðp11 Þ ⩾ ΓðωσN ðtÞÞ⩾ ⋯ ⩾ Γðωσ2 ðtÞÞ, and further Q(t + 1) ¼ (σ 1, σ N,⋯, σ 2). □

24

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

Table 2.2 Structure of myopic policy with Q(t) ¼ (σ 1, ⋯, σ N) Sensed state of channel σ 1 Good Bad

Positively correlated Q(t + 1) ¼ (σ 1, σ 2, ⋯, σ N) Q(t + 1) ¼ (σ 2, ⋯, σ N, σ 1)

Negatively correlated Q(t + 1) ¼ (σ N, ⋯, σ 2, σ 1) Q(t + 1) ¼ (σ 1, σ N, ⋯, σ 2)

Remark 2.3 Assume Q(t) ¼ (σ 1, ⋯, σ N, ) at slot t where ωσ 1 ðt Þ ⩾ ⋯ ⩾ ωσ N ðt Þ . When channel σ 1 is sensed to in good state or bad state, respectively, the structure of Q(t + 1) is stated in the following table. Meanwhile, Q(t + 1) for positively correlated homogeneous channels is also listed for the purpose of comparison. As shown in Table 2.2, Q(t + 1)shows the reverse order in two cases. It is the reverse order which preserves two kinds of exchange operation in Lemmas 2.6 and 2.7. Thus, Lemmas 2.6 and 2.7 still hold by exchanging p11 and p01. Following the similar induction for positively correlated channels, we have the following lemmas and theorem without proof. Lemma 2.10 Given 01 Þ (i) E < pp11 ðð1p 1p Þ, 01

(ii) β 

11

Δmin =Δmax , E ðp01 p11 Þ ð1E Þð1p11 Þþ 1ð1E Þðp01 p11 Þ

(iii) U() is regular, if p01  ωl  ωm  p11 where l < m, then it holds that for t ¼ 0, 1, ⋯, T W At ðω1 , ⋯, ωl , ⋯, ωm , ⋯, ωN Þ  W At ðω1 , ⋯, ωm , ⋯, ωl , ⋯, ωN Þ Lemma 2.11 Given 01 Þ (i) E < pp11 ðð1p 1p Þ, 01

(ii) β 

11

Δmin =Δmax , E ðp01 p11 Þ 1ð1E Þðp01 p11 Þ

ð1E Þð1p11 Þþ

(iii) U() is regular. If p01  ω1  ⋯  ωN  p11, for any 0  t  T, it holds that W At ðω1 , ω2 , ⋯, ωN1 , ωN Þ  W At ðωN , ω1 , ⋯, ωN1 Þ  ð1  ωN ÞΔmax , W At ðω1 , ω2 , ⋯, ωN1 , ωN Þ  W At ðωN , ω2 , ⋯, ωN1 , ω1 Þ  ðp01  p11 ÞΔmax

1  ½βð1  E Þðp01  p11 ÞTtþ1 : 1  βð1  E Þðp01  p11 Þ

ð2:26Þ

Appendix

25

Theorem 2.3 If p11 b ωi(0) b p01 for 1 b i b N, the myopic policy is optimal if 01 Þ (i) E < pp11 ðð1p 1p Þ, 01

(ii) β 

11

Δmin =Δmax , E ðp01 p11 Þ 1ð1E Þðp01 p11 Þ

ð1E Þð1p11 Þþ

(iii) U() is regular. Remark 2.4 Theorem 2.3 gives the sufficient conditions to justify the optimality of the myopic policy, i.e., probing those best channels, for the negatively correlated homogeneous channels. More importantly, this theorem contradicts the intuition that the myopic policy is not optimal for negatively correlated case.

2.6

Summary

In this chapter, we have investigated the optimality of the myopic policy in the context of opportunistic access with imperfect channel sensing for two-state Markov channel, and derived closed-form conditions under which the myopic sensing policy is ensured to be optimal for homogeneous two-state Markov channels. Due to the generic RMAB formulation of the problem, the obtained results and the analysis methodology in this chapter are widely applicable in a wide range of engineering domains.

Appendix Proof of Lemma 2.4 We proceed the proof by backward induction. Firstly, it is easy to verify that the lemma holds for slot T. Assume that the lemma holds from slots t + 1, ⋯, T  1, we now prove that it holds for slot t by the following two different cases. Case 1: channel l is not sensed in slot t, i.e., l  k þ 1: Let AðtÞ ¼ M ≜N ðkÞ ¼ f1, ⋯, kg, ωl ¼ 0 and 1, respectively, we have W At ðω1 , ⋯, ωl , ⋯, ωn Þ ¼ Uðω1 , ⋯, ωk Þ þ β

X

b PrðM , EÞW Atþ1 ðΩEl ðt þ 1ÞÞ,

E⊆M

W At ðω1 , ⋯, 0, ⋯, ωn Þ ¼ Uðω1 , ⋯, ωk Þ þ β

X E⊆M

b PrðM , EÞW Atþ1 ðΩEl,0 ðt þ 1ÞÞ,

26

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

X

W At ðω1 , ⋯, 1, ⋯, ωn Þ ¼ Uðω1 , ⋯, ωk Þ þ β

b PrðM , EÞW Atþ1 ðΩEl,1 ðt þ 1ÞÞ,

E⊆M

where ΩEl ðt þ 1Þ ¼ ðPE11 , Φðk þ 1, l  1Þ, Γðωl Þ, Φðl þ 1, NÞ, QM ,E Þ, ΩEl,0 ðt þ 1Þ ¼ ðPE11 , Φðk þ 1, l  1Þ, p01 ,

Φðl þ 1, NÞ, QM ,E Þ,

ΩEl:1 ðt þ 1Þ ¼ ðPE11 , Φðk þ 1, l  1Þ, p11 ,

Φðl þ 1, NÞ, QM ,E Þ:

To prove the lemma in this case, it is sufficient to prove b W Atþ1 ðΩEl ðt þ 1ÞÞ b b ¼ ð1  ωl ÞW Atþ1 ðΩEl,0 ðt þ 1ÞÞ þ ωl W Atþ1 ðΩEl,1 ðt þ 1ÞÞ

ð2:27Þ

According to induction hypothesis, we have b W Atþ1 ðΩEl ðt þ 1ÞÞ b ¼ Γðωl Þ  W Atþ1 ðPE11 , Φðk þ 1, l  1Þ, 1, Φðl þ 1, NÞ, QM ,E Þ

ð2:28Þ

b þ ð1  Γðωl ÞÞ  W Atþ1 ðPE11 , Φðk þ 1, l  1Þ, 0, Φðl þ 1, NÞ, QM ,E Þ b W Atþ1 ðΩEl,0 ðt þ 1ÞÞ b ¼ p01  W Atþ1 ðPE11 , Φðk þ 1, l  1Þ, 1, Φðl þ 1, NÞ, QM ,E Þ

ð2:29Þ

b þ ð1  p01 Þ  W Atþ1 ðPE11 , Φðk þ 1, l  1Þ, 0, Φðl þ 1, NÞ, QM ,E Þ b b W Atþ1 ðΩEl,1 ðt þ 1ÞÞ ¼ p11  W Atþ1 ðPE11 , Φðk þ 1, l  1Þ, 1, Φðl þ 1, NÞ, QM ,E Þ b þð1  p11 Þ  W Atþ1 ðPE11 , Φðk þ 1, l  1Þ, 0, Φðl þ 1, NÞ, QM ,E Þ

ð2:30Þ

Combing (2.28), (2.29) with (2.30), we have (2.27). Case 2: channel l is sensed in slot t, i.e., l < k. Let M ≜N ðkÞ∖flg ¼ f1, ⋯, l  1, l þ 1, ⋯, kg, we have according to (2.19)

Appendix

27

W At ðΩðtÞÞ ¼ Uðω1 , ⋯, ωl , ⋯, ωk Þ X b  M ,E,l , QM ,E,l Þ PrðM , EÞW Atþ1 ðPE11 , p11 , Φðk þ 1, NÞ, Q þ βð1  EÞωl E⊆M

þ β½1  ð1  EÞωl 

X

b  M ,E,l , Γðφðωl ÞÞ, QM ,E,l Þ PrðM , EÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q

E⊆M

ð2:31Þ For (2.31), letting ωl ¼ 0, we have W At ðω1 , ⋯, 0, ⋯, ωN Þ ¼ Uðω1 , ⋯, 0, ⋯, ωk Þ X b  M ,E,l , p01 , QH ,E,l Þ PrðM , EÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q þβ

ð2:32Þ

E⊆M

For (2.31), letting ωl ¼ 1, we have W At ðω1 , ⋯, 1, ⋯, ωN Þ ¼ Uðω1 , ⋯, 1, ⋯, ωk Þ X b  M ,E,l , QM ,E,l Þ þ βð1  EÞ PrðM , EÞW Atþ1 ðPE11 , p11 , Φðk þ 1, NÞ, Q þ βE

X

E ⊆M

b  M ,E,l , p11 , QM ,E,l Þ PrðM , EÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q

ð2:33Þ

E⊆M

To prove the lemma for this case (channel l is not sensed in slot t), based on (2.31), (2.32), and (2.33), we only need to show b  M ,E,1 , Γðφðωl ÞÞ, QM ,E,l Þ ½1  ð1  EÞωl W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , p01 , QM ,E,l Þ ¼ ð1  ωl ÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , p11 , QM ,E,l Þ þ Eωl W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q According to induction hypothesis, we have b  M ,E,l , Γðφðωl ÞÞ, QM ,E,l Þ W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q

ð2:34Þ

28

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

b  M ,E,1 , 1, QM ,E,l Þ ¼ Γðφðωl ÞÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , 0, QM ,E,l Þ þ ð1  Γðφðωl ÞÞÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q

ð2:35Þ

b  M ,E,l , p01 , QM ,E,l Þ W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , 1, QM ,E,l Þ ¼ p01 W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , 0, QM ,E,l Þ þ ð1  p01 ÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q

ð2:36Þ

b  M ,E,l , p11 , QM ,E,l Þ W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , 1, QM ,E,l Þ ¼ p11 W Atþ1 ðPE11 , Φðk þ 1, NÞ, Q b  M ,E,l , 0, QM ,E,l Þ þ ð1  p11 ÞW Atþ1 ðPE11 , Φðk þ 1, NÞ, Q

ð2:37Þ

Combing (2.35), (2.36), and (2.37), we have (2.34). Combing the above analysis in two cases, we thus prove Lemma 2.4.

Proof of Lemma 2.5 We proceed the proof by backward induction. Firstly, we can easily show that the lemma holds for slot T. Assume that the lemma holds from slots t + 1, ⋯, T  1, we now prove that it also holds for slot t by distinguishing the following two cases. Case 1: channel l is not sensed in slot t, i.e., l  k + 1. In this case, the immediate reward is unrelated to ωl and ω0l . Moreover, let Ω(t + 1) and Ω0(t + 1) denote the generated by Ω(t) ¼ (ω1, ⋯, ωl, ⋯, ωN) and Ω0 ðt Þ ¼ belief vector  0 ω1 , ⋯, ωl , ⋯, ωN , respectively, it can be noticed that Ω(t + 1) and Ω0(t + 1) differ in only one element: ω0l ðt þ 1Þ  ωl ðt þ 1Þ: By induction, it holds that A A Wb ðΩ0 ðt þ 1ÞÞ  W b ðΩðt þ 1ÞÞ. Noticing (2.19), it follows that W A ðΩ0 ðt ÞÞ  tþ1

tþ1

t

W At ðΩðt ÞÞ. Case 2: channel l is sensed in slot t, i.e., l  k. Following Lemma 2.4 and after some straightforward algebraic operations, we have W At ðω1 , ⋯, ω0l , ⋯, ωN Þ  W At ðω1 , ⋯, ωl , ⋯, ωN Þ ¼ ðω0l  ωl Þ½W At ðω1 , ⋯, 1, ⋯, ωN Þ  W At ðω1 , ⋯, 0, ⋯, ωN Þ

Let M≜N ðkÞ∖flg ¼ f1, ⋯, l  1, l þ 1, ⋯, kg , by developing W At ðΩðt ÞÞ as a function of ωl, we have

Appendix

29

W At ðΩðtÞÞ ¼ Uðω1 ðtÞ, ⋯, ωk ðtÞÞ X b PrðM , EÞW Atþ1 ðΩE ðt þ 1ÞÞ þ βð1  EÞωl h

E⊆M

þ β 1  ð1  EÞωl

iX E⊆M

b PrðM , EÞW Atþ1 ðΩ0E ðt þ 1ÞÞ

Let ωl ¼ 0 and 1, respectively, we have W At ðω1 , ⋯, 0, ⋯, ωN Þ ¼ Uðω1 , ⋯, 0, ⋯, ωN Þ þ β

X

b PrðM , EÞW Atþ1 ðΩE0 ðt þ 1ÞÞ,

E⊆M

W At ðω1 , ⋯, 1, ⋯, ωN Þ

¼ Uðω1 , ⋯, 1, ⋯, ωN Þ X b þ βð1  EÞ PrðM , EÞW Atþ1 ðΩE1E ðt þ 1ÞÞ þ βE

X E⊆M

E⊆M

b PrðM , EÞW Atþ1 ðΩEE ðt þ 1ÞÞ

where  M ,E,l , p01 , QM ,E,l Þ ΩE0 ðt þ 1Þ ¼ ðPE11 , Φðk þ 1, NÞ, Q  M ,E,l , QM ,E,l Þ ΩE1E ðt þ 1Þ ¼ ðPE11 , p11 , Φðk þ 1, NÞ, Q

M,E,l , p11 , QM,E,l ΩEE ðt þ 1Þ ¼ PE11 , Φðk þ 1, N Þ, Q It can be checked that ΩE1E ðt þ 1Þ  ΩE0 ðt þ 1Þ and ΩEE ðt þ 1Þ  ΩE0 ðt þ 1Þ. It A A ðΩE ðt þ 1ÞÞ  W b ðΩE ðt þ 1ÞÞ then follows from induction that given E, W b tþ1

1E

tþ1

0

b E E A A and W b tþ1 ðΩE ðt þ 1ÞÞ  W tþ1 ðΩ0 ðt þ 1ÞÞ. Noticing that U (•) is increasing in each element, we then have W At ðω1 , ⋯, 1, ⋯, ωN Þ  W At ðω1 , ⋯, 0, ⋯, ωN Þ ¼ Uðω1 , ⋯, 1, ⋯, ωN Þ  Uðω1 , ⋯, 0, ⋯, ωN Þ h i X b b þ βð1  EÞ PrðM , EÞ W Atþ1 ðΩE1E ðt þ 1ÞÞ  W Atþ1 ðΩE0 ðt þ 1ÞÞ þ βE

X E⊆M

E⊆M

h i b b PrðM , EÞ W Atþ1 ðΩEE ðt þ 1ÞÞ  W Atþ1 ðΩE0 ðt þ 1ÞÞ

0 Combining the above analysis in two cases completes our proof.

30

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

Proof of Lemmas 2.6 and 2.7 Due to the dependency between the two lemmas, we prove them together by backward induction. We first show that Lemmas 2.6 and 2.7 hold for slot T. It is easy to verify that Lemma 2.6 holds. We then prove Lemma 2.7. Noticing that p01  ωN  ωk  p11  1, we have W AT ðω1 , ⋯, ωN Þ  W AT ðωN , ω1 , ⋯, ωN1 Þ ¼ Uðω1 , ⋯, ωk Þ  UðωN , ω1 , ⋯, ωk1 Þ ¼ ðωk  ωN Þ½Uðω1 , ⋯, ωk1 , 1Þ  Uðω1 , ⋯, ωk1 , 0Þ  ð1  ωN ÞΔmax

W AT ðω1 , ⋯, ωN Þ  W AT ðωN , ω2 , ⋯, ωN1 , ω1 Þ ¼ U ðω1 , ⋯, ωk Þ  U ðωN , ω2 , ⋯, ωk1 Þ ¼ ðω1  ωN Þ½U ð1, ω2 , ⋯, ωk Þ  U ð0, ω2 , ⋯, ωk Þ  ðp11  p01 ÞΔmax Lemma 2.7 thus holds for slot T. Assume that Lemmas 2.6 and 2.7 hold for slots T, ⋯, t + 1, we now prove that it holds for slot t. We first prove Lemma 2.6. Considering l < m, we distinguish the following three cases. Case 1: l  k þ 1, A ðt Þ ¼ f1, 2, ⋯, kg. In this case, we have W Al ðω1 , ⋯, ωl , ⋯, ωm , ⋯, ωN Þ  W Al ðω1 , ⋯, ωm , ⋯, ωl , ⋯, ωN Þ ¼ ðωl  ωm Þ½W At ðω1 , ⋯, 1, ⋯, 0, ⋯, ωN Þ  W At ðω1 , ⋯, 0, ⋯, 1, ⋯, ωN Þ X b b ¼ ðωl  ωm Þβ PrðAðtÞ, EÞ½W Atþ1 ðΩE ðt þ 1ÞÞ  W Atþ1 ðΩ0E ðt þ 1ÞÞ E ⊆ AðtÞ

where

ΩE ðt þ 1Þ ¼ PE11 , Γðωkþ1 Þ, ⋯, p11 , ⋯, p01 , ⋯, ΓðωN Þ, QAðtÞ,E

Ω0E ðt þ 1Þ ¼ PE11 , Γðωkþ1 Þ, ⋯, p01 , ⋯, p11 , ⋯, ΓðωN Þ, QAðtÞ,E  0  b A A It follows from the induction result that W b tþ1 ðΩE ðt þ 1ÞÞ  W tþ1 ΩE ðt þ 1Þ . Hence W At ðω1 , ⋯, ωl , ⋯, ωm , ⋯, ωN Þ  W At ðω1 , ⋯, ωm , ⋯, ωl , ⋯, ωN Þ Case 2: l  k and m  k + 1 In this case, denote M≜N ðkÞ∖flg, it can be noted  M ,E,l . In this case, we have that QM ,E ¼ QM ,E,l þ Q

Appendix

31

W At ðω1 , ⋯, ωl , ⋯, ωm , ⋯, ωN Þ  W At ðω1 , ⋯, ωm , ⋯, ωl , ⋯, ωN Þ   ¼ ðωl  ωm Þ W At ðω1 , ⋯, 1, ⋯, 0, ⋯, ωN Þ  W At ðω1 , ⋯, 0, ⋯, 1, ⋯, ωN Þ ¼ ðωl  ωm Þ½U ðω1 , ⋯, 1, ⋯, ωk Þ  U ðω1 , ⋯, 0, ⋯, ωk Þ þβ

X

b PrðM , EÞ½ð1  EÞW Atþ1 ðPE11 , p11 , Γðωkþ1 Þ, ⋯, p01 , ⋯, ΓðωN Þ, QM ,E Þ

E⊆M

b  M ,E,l , p11 , QM ,E,l Þ þEW Atþ1 ðPE11 , Γðωkþ1 Þ, ⋯, p01 , ⋯, ΓðωN Þ, Q b  M ,E,l , p01 , QM ,E,l Þ W Atþ1 ðPE11 , Γðωkþ1 Þ, ⋯, p11 , ⋯, ΓðωN Þ, Q X PrðM , EÞ  ðωl  ωm Þ½Δmin þ β E⊆M

b  ½ð1  EÞW Atþ1 ðp01 , PE11 , p11 , Γðωkþ1 Þ, ⋯, ΓðωN Þ, QM ,E Þ b þ EW Atþ1 ðp01 , PE11 , Γðωkþ1 Þ, ⋯, ΓðωN Þ, QM ,E , p11 Þ b W Atþ1 ðPE11 , p11 , Γðωkþ1 Þ, ⋯, ΓðωN Þ, QM ,E , p01 Þ " X  ðωl  ωm Þ Δmin  β PrðM, E Þ E⊆M

1  ½βð1  E Þðp11  p01 ÞTt   ð1  E Þð1  p01 ÞΔmax þ E ðp11  p01 ÞΔmax 1  βð1  E Þðp11  p01 Þ

 ðωl  ωm Þ

X

PrðM, E Þ

E⊆M

  Δmin  β ð1  E Þð1  p01 ÞΔmax þ E ðp11  p01 ÞΔmax

1 1  ð1  E Þðp11  p01 Þ

0 where the first inequality follows the induction result of Lemma 2.6, the second inequality follows the induction result of Lemma 2.7, the third inequality follows the condition in the lemma. Case 3: l, m  k. This case follows Lemma 2.3. Combing the above three cases, thus lemma 2.6 is proven for slot t. We then proceed to prove Lemma 2.7. We start with the first inequality. We develop Wt w.r.t. ωk and ωN according to Lemma 2.4 as follows:



32

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

W At ðω1 , ⋯, ωk1 , ωk , ⋯, ωn1 , ωN Þ  W At ðωN , ω1 , ⋯, ωk1 , ωk , ⋯, ωN1 Þ ¼ ωk ωN ½W At ðω1 ,⋯,ωk1 ,1, ωkþ1 ,⋯,ωN1 , 1Þ  W At ð1, ω1 ,⋯,ωk1 ,1, ωkþ1 ,⋯,ωN1 Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} termA

þωk ð1  ωN Þ½W At ðω1 , ⋯, ωk1 ,1, ωkþ1 ,⋯,ωN1 ,0Þ  W At ð0, ω1 , ⋯, ωk1 ,1, ωkþ1 ,⋯,ωN1 Þ

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} termB

þð1  ωk ÞωN ½W At ðω1 , ⋯, ωk1 ,0, ωkþ1 ,⋯,ωN1 ,1Þ  W At ð1, ω1 , ⋯, ωk1 ,0, ωkþ1 ,⋯,ωN1 Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} termC A þð1  ωk Þð1  ωN Þ½W A t ðω1 ,⋯,ωk1 ,0, ωkþ1 ,⋯,ωN1 , 0Þ  W t ð0, ω1 ,⋯,ωk1 ,0, ωkþ1 , ⋯, ωn1 Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} termD

We proceed the proof by upbounding the four terms in (2.39). For the first term A, we have W At ðω1 , ⋯, ωk1 , 1, ωkþ1 , ⋯, ωN1 , 1Þ  W At ð1, ω1 , ⋯, ωk1 , 1, ωkþ1 , ⋯, ωN1 Þ X ¼β PrðN ðk  1Þ, E Þ E ⊆ N ðk1Þ

h

b  ð1  EÞW Atþ1 PE11 , p11 , Φðk þ 1, N  1Þ, p11 , QN ðk1Þ,E

b þEW Atþ1 PE11 , Φðk þ 1, N  1Þ, p11 , QN ðk1Þ,E , p11

b ð1  EÞW Atþ1 p11 , PE11 , p11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E

i b EW Atþ1 PE11 , p11 , Φðk þ 1, N  1Þ, p11 , QN ðk1Þ,E  0 where, the inequality follows the induction of Lemma 2.6. For the second term B, we have W At ðω1 , ⋯, ωk1 , 1, ωkþ1 , ⋯, ωn1 , 0Þ  W At ð0, ω1 , ⋯, ωk1 , 1, ωkþ1 , ⋯, ωn1 Þ X ¼ U ðω1 , ⋯, ωk1 , 1Þ  U ð0, ω1 , ⋯, ωk1 Þ þ β PrðN ðk  1Þ, E Þ E ⊆ N ðk1Þ

h

b  ð1  E ÞW Atþ1 PE11 , p11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E

b þEW Atþ1 PE11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E , p11

i b W Atþ1 PE11 , p11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E

Appendix

33

¼ U ðω1 , ⋯, ωk1 , 1Þ  U ð0, ω1 , ⋯, ωk1 Þ þ β

X

PrðN ðk  1Þ, E Þ

E ⊆ N ðk1Þ



h b  EW Atþ1 PE11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E , p11

i b EW Atþ1 PE11 , p11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E  Δmax following the induction of Lemma 2.6. For the third term C, we have W At ðω1 , ⋯, ωk1 , 0, ωkþ1 , ⋯, ωN1 , 1Þ  W At ð1, ω1 , ⋯, ωk1 , 0, ωkþ1 , ⋯, ωN1 Þ X ¼ U ðω1 , ⋯, ωk1 , 0Þ  U ð1, ω1 , ⋯, ωk1 Þ þ β PrðN ðk  1Þ, E Þ E ⊆ N ðk1Þ

h

b  W Atþ1 PE11 , Φðk þ 1, N  1Þ, p11 , QN ðk1Þ,E , p01

b ð1  E ÞW Atþ1 p11 , PE11 , p01 , Φðk þ 1, N  1Þ, QN ðk1Þ,E

i b EW Atþ1 PE11 , p01 , Φðk þ 1, N  1Þ, p11 , QN ðk1Þ,E  Δmin þβ

h b PrðN ðk  1Þ, EÞ W Atþ1 ðPE11 , p11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E , p01 Þ

X E ⊆ N ðk1Þ

b ð1  EÞW Atþ1 ðp01 , p11 , PE11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E Þ b EW Atþ1 ðp01 , PE11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E , p11 Þ X PrðN ðk  1Þ, E Þ  Δmin þ β E ⊆ N ðk1Þ



1  ½βð1  E Þðp11  p01 ÞTt  ð1  E Þð1  p01 ÞΔmax þ E ðp11  p01 ÞΔmax 1  βð1  E Þðp11  p01 Þ X PrðN ðk  1Þ, E Þ  E ⊆ N ðk1Þ

   Δmin þ β ð1  E Þð1  p01 ÞΔmax þ 0

E ðp11  p01 ÞΔmax 1  ð1  E Þðp11  p01 Þ



34

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

where the first inequality follows the induction result of Lemma 2.6, the second equality follows the induction result of Lemma 2.7, the fourth inequality is due the condition in Lemma 2.7. For the fourth term D, we have W At ðω1 , ⋯, ωk1 , 0, ωkþ1 , ⋯, ωN1 , 0Þ  W At ð0, ω1 , ⋯, ωk1 , 0, ωkþ1 , ⋯, ωN1 Þ h

X b ¼β PrðN ðk  1Þ, EÞ W Atþ1 PE11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E , p01 E ⊆ N ðk1Þ



i b W Atþ1 PE11 , p01 , Φðk þ 1, N  1Þ, QN ðk1Þ,E , p01 ¼β

h

b PrðN ðk  1Þ, EÞ W Atþ1 PE11 , Φðk þ 1, N  1Þ, p01 , QN ðk1Þ,E , p01

X

E ⊆ N ðk1Þ



i b W Atþ1 p01 , PE11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E , p01



h

b PrðN ðk  1Þ, EÞ W Atþ1 PE11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E , p01 , p01

X

E ⊆ N ðk1Þ



i b W Atþ1 p01 , PE11 , Φðk þ 1, N  1Þ, QN ðk1Þ,E , p01  ð1  p01 ÞβΔmax where, the second equality follows Lemma 2.3, the first inequality follows the induction result of Lemma 2.6 and the second inequality follows the induction result of Lemma 2.7. Combing the above results of the four terms, we have W At ðω1 , ⋯, ωN Þ  W At ðωN , ω1 , ⋯, ωN1 Þ  ωk ð1  ωN Þ  Δmax þ ð1  ωk Þð1  ωN Þ  ð1  p01 ÞβΔmax  ωk ð1  ωN ÞΔmax þ ð1  ωk Þð1  ωN ÞΔmax  ð1  ωN ÞΔmax which completes the proof of the first part of Lemma 2.7. Finally, we prove the second part of Lemma 2.7. To this end, denote M ≜ {2, ⋯, k}, we have W t ðω1 , ω2 , ⋯, ωN1 , ωN Þ  W t ðωN , ω2 , ⋯, ωN1 , ω1 Þ ¼ ðω1  ωN Þ½W t ð1, ω2 , ⋯, ωN1 , 0Þ  W t ð0, ω2 , ⋯, ωN1 , 1Þ

þβ

X E⊆M

¼ ðω1  ωN Þ½F ð1, ω2 , ⋯, ωk Þ  F ð0, ω2 , ⋯, ωk Þ PrðM , EÞ½ð1  EÞW tþ1 ðPE11 , p11 , Φðk þ 1, N  1Þ, p01 , QM ,E Þ

References

35

þEW tþ1 ðPE11 , Φðk þ 1, N  1Þ, p01 , p11 , QM ,E Þ W tþ1 ðPE11 , Φðk þ 1, N  1Þ, p11 , p01 , QM ,E Þ  ðω1  ωN Þ½Δmax    PrðM, E Þ ð1  E ÞW tþ1 PE11 , p11 , Φðk þ 1, N  1Þ, p01 , QM,E

X

þβ

E⊆M

þEW tþ1 ðPE11 , Φðk þ 1, N  1Þ, p01 , p11 , QM ,E Þ   W tþ1 PE11 , Φðk þ 1, N  1Þ, p01 , p11 , QM,E

þβ

¼ ðω1  ωN Þ½Δmax

X

PrðM , EÞ½ð1  EÞW tþ1 ðPE11 , p11 , Φðk þ 1, N  1Þ, p01 , QM ,E Þ

E⊆M

ð1  EÞW tþ1 ðPE11 , Φðk þ 1, N  1Þ, p01 , p11 , QM ,E Þ þβ

 ðω1  ωN Þ½Δmax

X

PrðM , EÞ½ð1  EÞW tþ1 ðPE11 , p11 , Φðk þ 1, N  1Þ, QM ,E , p01 Þ

E⊆M

ð1  EÞW tþ1 ðp01 , PE11 , Φðk þ 1, N  1Þ, QM ,E , p11 Þ  ðp11  p01 ÞΔmax þβ

X

PrðM , EÞð1  EÞ

E⊆M

¼

1  ½βð1  EÞðp11  p01 ÞTt ðp11  p01 ÞΔmax  1  βð1  EÞðp11  p01 Þ

1  ½βð1  E Þðp11  p01 ÞTtþ1 ðp11  p01 ÞΔmax 1  βð1  E Þðp11  p01 Þ

where the first two inequalities follow the recursive application of the induction result of Lemma 2.6, the third inequality follows the induction result of Lemma 2.7. We thus complete the whole process of proving Lemmas 2.6 and 2.7.

References 1. P. Whittle, Restless bandits: Activity allocation in a changing world. J. Appl. Prob. 25A, 287–298 (1988) 2. Q. Zhao, L. Tong, A. Swami, Y. Chen, Decentralized cognitive MAC for opportunistic spectrum access in ad hoc networks: A POMDP framework. IEEE J. Sel. Areas Commun. 25 (3), 589–600 (2007) 3. C.H. Papadimitriou, J.N. Tsitsiklis, The complexity of optimal queueing network control. Math. Oper. Res. 24(2), 293–305 (1999)

36

2 Myopic Policy for Opportunistic Scheduling: Homogeneous Two-State Channels

4. R.R. Weber, G. Weiss, On an index policy for restless bandits. J. Appl. Prob. 27(1), 637–648 (1990) 5. K. Liu, Q. Zhao, Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 56(11), 5547–5567 (2010) 6. Q. Zhao, B. Krishnamachari, K. Liu, On myopic sensing for multi-channel opportunistic access: Structure, optimality, and performance. IEEE Trans. Wirel. Commun. 7(3), 54315440 (2008) 7. S. Ahmad, M. Liu, T. Javidi, Q. Zhao, B. Krishnamachari, Optimality of myopic sensing in multi-channel opportunistic access. IEEE Trans. Inf. Theory 55(9), 4040–4050 (2009) 8. S. Ahmad, M. Liu. Multi-Channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays. in Allerton Conference, Monticello, Il, Sep–Oct, 2009 9. K. Wang, L. Chen, On optimality of myopic policy for restless multi-armed bandit problem: An axiomatic approach. IEEE Trans. Signal Process. 60(1), 300–309 (2012) 10. K. Wang, L. Chen, On the optimality of myopic sensing in multi-channel opportunistic access: The case of sensing multiple channels. IEEE Wireless Commun. Lett. 1(5), 452–455 (2012) 11. Y. Chen, Q. Zhao, A. Swami, Joint design and separation principle for opportunistic spectrum access in the presence of sensing errors. IEEE Trans. Inf. Theory 54(5), 2053–2071 (2008) 12. K. Liu, Q. Zhao, B. Krishnamachari, Dynamic multichannel access with imperfect channel state detection. IEEE Trans. Signal Process. 58(5), 2795–2807 (2010)

Chapter 3

Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State Channels

3.1 3.1.1

Introduction Background

We consider an opportunistic communication system with heterogenous.1 GilbertElliot channels [1], in which a user is limited to sense and transmit only on one channel each time due to limitation on sensing capability2. Given that channel sensing in practice is not perfect, the fundamental optimization problem addressed in this chapter is how the user exploits the imperfect sensing results and the stochastic properties of channels to maximize its utility (e.g., expected throughput) by switching among channels opportunistically.

3.1.2

Main Results and Contributions

The central pivot in the Whittle index policy analysis is to establish the indexability of the problem and compute the corresponding index. In our problem, for a subset of specific scenarios characterized by the corresponding parameter spaces (e.g., [2, 3]), the Whittle index policy degenerates to the myopic policy. However, beyond those scenarios, the structure of the index-based policy is still open, which is the focus of this paper (cf. Table 3.1).

1

In Chap. 2, the homogeneous Gilbert-Elliot channels are studied. The technical analysis in this paper can be extended to address the case where a user is allowed to sense a fixed number of channels. 2

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8_3

37

38

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

Table 3.1 Summary of related work and this paper Parameter domain p11 ⩾ p01 , εb

p01 ð1p11 Þ p11 ð1p01 Þ

01 Þ p11 bp01 , εb pp11 ð1p ð1p Þ 01

εi b

Policy Myopic policy

Optimality Globally optimal [2]

Myopic policy

Globally optimal [3]

Index policy

Locally optimal [this chapter]

11

ð1 max fpð11iÞ , pð01iÞ gÞ min fpð11iÞ , pð01iÞ g ð1 min fpð11iÞ , pð01iÞ gÞ max fpð11iÞ , pð01iÞ g

The major technical challenge to establish the indexability in our problem comes from the imperfect sensing, where the false alarm rate is involved in the propagation of belief information and makes the value function no longer linear as in existing studies. As a result, the traditional approach of computing the Whittle index cannot be used in this context. To the best of our knowledge, there does not exist a closed form Whittle index for the nonlinear case; only numerical simulation is conducted under a strict assumption on the indexability [4]. To address the challenge caused by nonlinearity, we investigate the fixed points of belief evolution function (which is nonlinear), based on which we establish a set of periodic structures of the resulting dynamic system. We then use the derived properties to linearize the value function by a piecewise approach to prove the Whittle indexability and derive the closed-form Whittle index. Our results in this paper thus solves the multichannel opportunistic scheduling problem under imperfect channel sensing by establishing its indexability and constructing the corresponding index policy. Due to the generality of the problem, our results can be applied in a wide range of engineering applications where the underlying optimization problems can be cast into restless bandits with imperfect sensing of bandit states. Therefore, the terminology and analysis in this paper should be understood generically.

3.2

Related Work

The opportunistic channel access can be cast into a RMAB problem, which is proved to be PSPACE-hard [5]. To the best of our knowledge, very few results are reported on the structure of the optimal policy of a generic RMAB due to its high complexity. The myopic strategy, due to its simple and tractable structure, has recently attracted extensive research attention. It essentially consists of sensing the channels that maximize the expected immediate reward while ignoring the impact of the current decision on future reward. Along this research thrust, the optimality of the myopic policy is partially established for the homogeneous Gilbert-Elliot channel case under perfect sensing [6]. In [7], the authors studied the case of heterogeneous channels and derived a set of closed-form sufficient conditions to guarantee the optimality of the myopic policy. In [8], the authors proposed a sufficient condition framework for the optimality of myopic policy. In [9], the authors gave the sufficient

3.3 System Model

39

Table 3.2 Main notations Symbols N P(i) T t εi ζi Si(t) ωi(t) Ω(t) Ω(0) Oi(t) πt an(t) β

Descriptions The set of N channels, i.e., 1, 2, . . ., N State transition matrix of channel i The total number of time slots Time slot index False alarm probability of channel i Miss detection probability of channel i The state of channel i in slot t The conditional probability of being “good” Channel state belief vector at slot t The initial channel state belief vector The observation state of channel i The mapping from Ω(t) to A(t) The action of channel n in slot t Discount factor

conditions for multistate channels. For the imperfect sensing of Gilbert-Elliot channels, Liu et al. [10] proved the optimality of the myopic policy for the specifical case of two channels. In [2, 3, 11], the authors derived closed-form condition to guarantee the optimality of the myopic policy for arbitrary number of channels. Generally speaking, the structure of the optimum access policy is only characterized for a subset of parameter space under which the myopic policy is proved optimum. Beyond this parameter space, we need to turn to a more generic policy, Whittle index policy, introduced by P. Whittle in [27]. The Whittle index policy has been a very popular heuristic for restless bandit, which, while suboptimal in general, is provably optimal in asymptotic sense [12, 13] and has good empirical performance. The Whittle index policy and its variants have been studied extensively in engineering applications, e.g., sensor scheduling [4, 14], multi-UAV coordination [15], crawling web content [16], channel allocation in wireless networks [17, 18], and job scheduling [19–21]. More comprehensive treatments of indexable restless bandits can be found in [22–24].

3.3

System Model

Table 3.2 summaries main notations used in this chapter. We consider a time-slotted multichannel opportunistic communication system, in which a user is able to access a set N of N independent channels, each characterized by a Markov chain of two states, good (1) and bad (0). The channel state transition matrix P(i) for channel iði 2 N Þ is given as follows:

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

40

" P

ðiÞ

¼

ðiÞ

1  p01

ðiÞ

1  p11

ðiÞ

p01

#

ðiÞ

p11

We assume that channels go through state transition at the beginning of each slot t. The system operates in a synchronously time-slotted fashion with the time slot indexed by t(t ¼ 0, 1, ⋯). Due to hardware constraints and energy cost, the user is allowed to sense only one of the N channels at each slot t. We assume that the user makes the channel selection decision at the beginning of each slot after the channel state transition. Once a channel is selected, the user detects the channel state Si(t), which can be considered as a binary hypothesis test: ℋ0 : Si ðt Þ ¼ 1 ðgoodÞ vs:ℋ1 : Si ðt Þ ¼ 0 ðbadÞ: The performance of channel i state detection is characterized by the probability of false alarm εi and the probability of miss detection δi: εi ≔Prf decide ℋ1 j ℋ0 is true g δi ≔Prf decide ℋ0 j ℋ1 is true g Based on the imperfect detection outcome in slot t, the user determines whether t access channel i for transmission. We denote the action on channel n made by the user at slot t by an(t), i.e., ( an ðtÞ ¼

Thus,

N P

1, 0,

if channel n is chosen in slot t if channel n is not chosen in slot t

an ðt Þ ¼ 1 for all t, indicating that exactly one channel is chosen in

n¼1

each slot. Since failed transmissions may occur, acknowledgments (ACKs) are necessary to ensure guaranteed delivery. Specifically, when the receiver successfully receives a packet from a channel, it sends an acknowledgment to the transmitter over the same channel at the end of slot. Otherwise, the receiver does nothing, i.e., a NAK is defined as the absence of an ACK, which occurs when the transmitter did not transmit over this channel or transmitted, but the channel is busy in this slot. We assume that acknowledgments are received without error since acknowledgments are always transmitted over idle channels. Obviously, by imperfectly sensing only one of N channels, the user cannot observe the state information of the whole system. Hence, the user has to infer the channel states from its decision history and observation history so as to make its future decision. To this end, we define the channel state belief vector (hereinafter referred to as belief vector for briefness) Ωðt Þ≔fωi ðt Þ, i 2 N g where 0  ωi(t)  1 is

3.4 Problem Formulation

41

the conditional probability that channel i is in state good (i.e., Si(t) ¼ 1) conditioned on the decision history and observation history. To ensure that the user and its intended receiver tune to the same channels in each slot, channel selections should be based on common observation: K(t) 2 {0(NAK), 1 (ACK)} in each slot rather than the detection outcome at the transmitter. Given the sensing action fai ðt Þgi2N and the observation K(t), the belief vector in t + 1 slot can be updated recursively using Bayes Rule as shown in (3.1):

ωi ðt þ 1Þ ¼

f

ðiÞ

p11 ,

if ai ðtÞ ¼ 1, KðtÞ ¼ 1

Ψi ðωi ðtÞÞ, Γi ðωi ðtÞÞ,

if ai ðtÞ ¼ 1, KðtÞ ¼ 0 if ai ðtÞ ¼ 0

ð3:1Þ

where, ðiÞ

ðiÞ

Γi ðωi ðt ÞÞ≔ωi ðt Þp11 þ ð1  ωi ðt ÞÞp01 φi ðωi ðt ÞÞ≔

ε i ωi ð t Þ , 1  ð1  εi Þωi ðt Þ

Ψi ðωi ðt ÞÞ≔Γi ðφi ðωi ðt ÞÞÞ

ð3:2Þ ð3:3Þ ð3:4Þ

We would like to emphasize that the sensing error introduces technical complications in the system dynamics (i.e., φi(ωi(t))) due to its nonlinearity. Therefore, the analysis methods and results [7, 25, 26] in the perfect sensing case where the belief evolution is linear cannot be applied to the scenario with sensing error.

3.4

Problem Formulation

In this section, we formulate the optimization problem of opportunistic multichannel access faced by the user. Mathematically, let π ¼ {π(t)}t ⩾ 0 denote the sensing policy, with π(t) defined as a mapping from the belief vector Ω(t) to the action of sensing one channel in each slot t: π ðt Þ : Ωðt Þ ° f1, 2, ⋯, N g, t ¼ 0, 1, 2, ⋯

ð3:5Þ

Let ( aπn ðtÞ

¼

1, 0,

if channel n is chosen under πðtÞ if channel n is not chosen under πðtÞ:

ð3:6Þ

  Let Πn ≔ aπn ðt Þ : t ⩾ 0 be policy space on channel n under the sensing policy π, then Π ¼ [Nn¼1 Πn is the joint policy space.

42

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

We are interested in the user’s optimization problem to find the optimal sensing policy π  that maximizes the expected total discounted reward over an infinite horizon. The following gives the formal definition of the optimal sensing problem:

f

OrigP : max  π2Π

1 X

N X

t¼0

N X

s:t:

βt

n¼1

    aπn ðtÞ 1  εn ωn ðtÞ

g

ð3:7Þ

aπn ðt Þ ¼ 1, t ¼ 0, 1, ⋯, 1

ð3:8Þ

n¼1

where the constraint (3.8) shows that only one channel can be chosen each time. In the following, we decompose OrigP into N similar subproblems. By relaxing the constraint (3.8), we have 1 P



t¼0

βt

N P

n¼1 1 P

N P 1 P

aπn ðt Þ ¼

βt

n¼1 t¼0 1 P

t¼0

Divided by

1 P

βt aπn ðt Þ ¼ βt

N X n¼1

t¼0

1 P t¼0

βt aπn ðt Þ

1 P

ð3:9Þ βt

t¼0

βt , we transform OrigP to the following relaxed problem RelxP.

t¼0

RelxP : max  π2Π

N X n¼1

s:t:

f

1 P t¼0

N X n¼1

βt ðaπn ðtÞð1  εn Þωn ðtÞÞ 1 P

βt

t¼0 1 P t¼0

g

ð3:10Þ

βt aπn ðt Þ

1 P

¼1 β

ð3:11Þ

t

t¼0

By introducing Lagrange multiplier v, we can rewrite (3.10) as follows.

max  π2Π

N X n¼1

f

1 P t¼0

βt ðaπn ðtÞð1  εn Þωn ðtÞ þ νð1  aπn ðtÞÞÞ 1 P t¼0

βt

g

ð3:12Þ

3.5 Technical Preliminary: Indexability and Whittle Index

43

We further decompose (3.12) into N subproblems:

f

max 

π n 2Πn

1 P t¼0

βt ðaπn n ðtÞð1  εn Þωn ðtÞ þ νð1  aπnn ðtÞÞÞ 1 P

β

t

t¼0

g

ð3:13Þ

Considering the constant β(0 b β < 1), we have the following subproblem subP  n: 1 X      βt π n   an t 1  εn ωn t þ ν 1  aπnn t subP  n : max  t¼0 ð3:14Þ π n 2Πn

f

g

To solve the original optimization problem OrigP, we first seek the optimal policy π n for subproblem subP  nðn 2 N Þ, and then construct a feasibly approximation  policy π ¼ π 1 , π 2 , ⋯, π N for the original problem OrigP.

3.5

Technical Preliminary: Indexability and Whittle Index

Let Vβ, v(ω) be the value function corresponding to the subproblem (3.14), which denotes the maximum discounted reward accrued from a single-armed bandit process with subsidy v when the initial belief state is ω(0). Considering two possible actions (choose or not) in each slot, we have n      o V β,v ω ¼ max V β,v ω; a ¼ 0 V β,v ω; a ¼ 1

ð3:15Þ

where, V β,v ðω; a ¼ 0Þ ¼ v þ βV β,v ðΓðωÞÞ h i V β,v ðω; a ¼ 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 Þ þ ð1  ð1  εÞωÞV β,v ðΨðωÞÞ Vβ, v(ω; a ¼ 1) denotes the reward obtained by taking action a in the first slot following by the optimal policy in future slots, and Vβ, v(ω; a ¼ 0) denotes the sum of the subsidy v obtained in the current slot under the passive action (a ¼ 0) and the total discounted future reward βVβ, v(Γ(ω)). Remark 3.1 In an infinite time horizon, a decision should be made at each slot, and the different decision leads to different evolution of belief information ω. Thus, in the following, we call (3.15) a dynamic system without introducing ambiguity.

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

44

Remark 3.2 We would like to point out that Vβ, v(Ψ(ω)) (specifically φ(ω)) brings about the nonlinear belief update of the dynamic system (3.15), and leads to the complicated characteristics of the Whittle index. The optimal action a* for the belief state ω under subsidy v is given by ( 

a ¼

1,

if V β,v ðω; a ¼ 1Þ > V β,v ðω; a ¼ 0Þ

0,

otherwise:

ð3:16Þ

We define the passive set P ðvÞ under subsidy v as n  o P ðvÞ≔ ω : V β,v ðω; a ¼ 1ÞbV β,v ω; a ¼ 0

ð3:17Þ

We next introduce some definitions related to the indexability of our problem. Definition 3.1 (Indexability) Problem (3.14) is indexable if the passive set P ðvÞ of the corresponding single-armed bandit process with subsidy v monotonically increases from ϕ to the whole state space [0, 1] as v increases from  1 to + 1. Under the indexability condition, Whittle index is defined as follows: Definition 3.2 (Whittle index [27]) If Problem (3.14) is indexable, its Whittle index W(ω) of the state ω is the infimum subsidy v such that it is optimal to make the arm passive at ω. Equivalently, Whittle index is the infimum subsidy v that makes the passive and active actions equally rewarding n  o W ðωÞ ¼ inf v : V β,v ðω; a ¼ 1ÞbV β,v ω; a ¼ 0

ð3:18Þ

Definition 3.3 (Threshold Policy) Given a certain v, there exists ω(0 b ω b 1) such that Vβ, v(ω; 1) ¼ Vβ, v(ω; 0). The threshold policy is defined as follows:     a ¼ 1 for any ω ω < ω b 1 while a ¼ 0 for any ω 0 b ω < ω , or     a ¼ 0 for any ω ω < ω b 1 while a ¼ 1 for any ω 0 b ω < ω Definition 3.4 Problem (3.14) is CMI-indexable if the subsidy v computed by the threshold policy is a continuous and monotonically increasing (CMI) function of ω.

3.6

Whittle Index and Scheduling Rule

In this section, we summarize the main results of our paper. The detailed analysis and proofs of the results will be presented in later sections.

3.6 Whittle Index and Scheduling Rule

3.6.1

45

Whittle Index

Our central result is the establishment of the CMI indexability of the opportunistic multichannel access problem, as stated in the following theorem. Theorem 3.1 Given εi b

ðiÞ

ðiÞ

ðiÞ

ðiÞ

ðiÞ

ðiÞ

ðiÞ

ðiÞ

ð1 max fp11 , p01 gÞ min fp11 , p01 g ð1 min fp11 , p01 gÞ max fp11 , p01 g

ð8i 2 N Þ, Problem (3.14) is

CMI-indexable. To prove the indexability, we need to prove the continuity and increasing monotonicity of v in ω. Thus, we first derive the closed form v, and then prove that v is continuous and monotonically increasing in ω. Based on the definition of threshold policy, we obtain the Whittle index in the following theorem. Theorem 3.2 The Whittle index Wβ(ω) for channel i is given as follows. (i) The case of positively correlated channels, i.e., p11 bp01 : See (3.19). ðiÞ

(ii) The case of negatively correlated channels, i.e.,

ðiÞ p11

ðiÞ

ðiÞ

⩾ p01 : See (3.20).

For the case of optimizing average reward, i.e., β ¼ 1, we derive the Whittle index W ðωÞ ¼ lim W β ðωÞ as follows. β!1

Theorem 3.3 The Whittle index W(ω) for channel i is given as follows. (i) The case of positively correlated channels, i.e., p11 bp01 : See (3.21). ðiÞ

ðiÞ

ðiÞ

ðiÞ

(ii) The case of negatively correlated channels, i.e., p11 ⩾ p01 : See (3.22). The following corollary bridges our results with existing body of works on myopic policy by showing that in a particular case with stochastically identical channels, the Whittle index-based policy we derive degenerates to the myopic policy. Corollary 3.1 Wβ(ω) is a monotonically non-decreasing function of ω. As a consequence, the Whittle index policy is equivalent to the myopic (or greedy) policy for the considered RMAB with stochastically identical channels.

3.6.2

Scheduling Rule

Based on the Whittle index, we can construct the index-based access policy for OrigP. Each time to choose the channel i ¼ argmaxi2N W β ðωi Þ for the discounted case and i ¼ argmaxi2N W ðωi Þ for the case of optimizing the average reward.

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

46

3.6.3

Technical Challenges

The main challenges in obtaining the indexability result in our problem comes from the nonlinear operator Ψi(), summarized as follows: 1. The nonlinear operator Ψi() brings about nonlinear propagation of belief information in the evolution of the dynamic system. 2. The value function Vβ, v(ω) is also nonlinear and intractable to compute due to the nonlinearity of Ψi(). To address the above challenges, we analyze the fixed points of the operators Γi, Ψi, as well as their combinations, and divide the belief information space into a series of regions using the fixed points. We then establish a set of periodic structures of the underlying nonlinear dynamic evolving system, based on which we further devise the linearization scheme for each region.

W β ðωÞ ¼

f

f

ðiÞ

ð1  εi Þðβp01 þ ðω  βΓi ðωÞÞÞ

ðiÞ ðiÞ , if p11 ω < ω0 ðiÞ ðiÞ ðiÞ 2 1 þ βðp01  εi p11 Þ  β ð1  εi ÞΓi ðp11 Þ  βð1  εi Þðω  Γi ðωÞÞ ðiÞ ð1  εi Þðβp01 þ ð1  βÞωÞ ðiÞ ðiÞ , if ω0 < ω < Γi ðp11 Þ ðiÞ ðiÞ ðiÞ 2 1 þ βðp01  εi p11 Þ  β ð1  εi ÞΓi ðp11 Þ  βð1  βÞð1  εi Þω iÞ ð1  εi Þðβp01 þ ð1  βÞωÞ ðiÞ ðiÞ 0 , if Γi ðp11 Þ ω < ω ðiÞ ðiÞ 1 þ βðp01  εi p11 Þ  βð1  εi Þω

b

b

ðiÞ

0 ð1  εi Þω, if ω

bωbpðiÞ 01 ð3:19Þ

ð1  εi Þω, if p01 bωbω0 ðiÞ

1,ðiÞ

    ðiÞ    W β ωn0ðiÞ  W β ωn, 0 n,ðiÞ n,ðiÞ n,ðiÞ þ ω  ω0 , if ω0 < ω < ω0 and n ¼ 1,2,⋯ n,ðiÞ n,ðiÞ ω0  ω0      ð1  εi Þ 1  βnþ1 ðω  βΓi ðωÞÞ þ C 6 n,ðiÞ ðiÞ , if ω0 bω < Γni φ p11 ,n ¼ 1,2,⋯ W β ðωÞ ¼ C 0 ðω  βΓi ðωÞÞ þ C 7      ð1  εi Þ 1  βnþ1 ðω  βΓi ðωÞÞ þ C 6 ðiÞ nþ1,ðiÞ   , if Γni φ p11 bω < ω0 ,n ¼ 1,2,⋯ nþ1 ð1  ε i Þ β  β ðω  βΓi ðωÞÞ þ C 9 ð1  εi Þω   , if ωð0iÞ bωbpð11iÞ ðiÞ 1  βð1  εi Þ p11  ω 

n,ðiÞ W β ω0

ð3:20Þ

3.6 Whittle Index and Scheduling Rule

47

8   ðiÞ > > ð1  εi Þ p01 þ ω  Γi ðωÞ > ðiÞ ðiÞ > >   , if p11 bω < ω0 > > ðiÞ ðiÞ ðiÞ > > 1 þ p01  εi p11  ð1  εi ÞΓi p11  ð1  εi Þðω  Γi ðωÞÞ > > > > > ðiÞ   > > ð1  εi Þp01 <   , if ωð0iÞ bω < Γi pð11iÞ W ðωÞ ¼ 1 þ pðiÞ  ε pðiÞ  ð1  ε ÞΓ pðiÞ i 11 i i > 01 11 > > > > ðiÞ   > > ð1  εi Þp01 ðiÞ ðiÞ > > , if Γ p11 bω < ω0 > i > ðiÞ ðiÞ > 1 þ p  ε p  ð 1  ε Þω > i 11 i 01 > > > ðiÞ ðiÞ : ð1  εi Þω, if ω0 bωbp01 ð3:21Þ 8 ð iÞ 1,ðiÞ > > > ð1  εi Þω, if p01 bωbω0     > > ðiÞ n,ðiÞ > >     W ωn,  W ω0 > 0 > n, ð i Þ n, ð i Þ n,ðiÞ n,ðiÞ > > , if ω0 < ω < ω0 and n ¼ 1, 2,⋯ þ ω  ω0 > W ω0 n,ðiÞ n,ðiÞ > > ω  ω > 0 0 >   > > ð iÞ > >    ð1  εi Þðn þ 1ÞðΓi ðωÞ  ωÞ  ð1  εÞΓni p01 > > n,ðiÞ ð iÞ >   < , if ω0 bω < Γni φ p11 ,n ¼ 1,2,⋯ ð iÞ W ðωÞ ¼ ð1  εi Þ n þ 1  ð1  εÞp11 ðΓi ðωÞ  ωÞ þ C 07 > >   > > ð iÞ >    > ð1  εÞðn þ 1ÞðΓi ðωÞ  ωÞ  ð1  εÞΓni p01 > ðiÞ > > h i     , if Γni φ pð11iÞ bω < ωnþ1, , n ¼ 1, 2,⋯ > 0 > ð i Þ ð i Þ ð i Þ n n > > ð 1  ε Þ n Γ ð ð ω Þ  ω Þ þ p p p  1  Γ þ εΓ i > i i 11 01 11 > > > > > ð1  εi Þω > >   , if ωð0iÞ bωbpð11iÞ > > ð iÞ : 1  ð1  εi Þ p11  ω

ð3:22Þ where, (i)

(i)

C0 =(1 − εi )β [1 − β n p11 (1 − εi ) − β n+1 (1 − (1 − εi )p11 )],

  ðiÞ C 6 ¼ ð1  εi Þð1  βÞβnþ1 Γni p01     ðiÞ ðiÞ C 7 ¼ εi ð1  βÞβnþ1 1  βð1  εi Þp11 Γni p11 h  i   ðiÞ ðiÞ þ ð1  βÞβnþ1 1 þ βð1  εi Þ 1  p11 Γni p01       ðiÞ ðiÞ ðiÞ ðiÞ p11  ð1  εi Þð1  βÞβnþ1 1  p11 Γn1 p01 εi ð1  εi Þð1  βÞβnþ1 p11 Γn1 i i h i ðiÞ þð1  βÞ 1  βð1  εi Þp11

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

48

h i   h  i   ðiÞ ðiÞ ðiÞ ðiÞ C07 ¼ εi 1  ð1  εi Þp11 Γni p11  1 þ ð1  εi Þ 1  p11 Γni p01       ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ p11 þ ð1  εi Þ 1  p11 Γn1 p01 þ ð1  εi Þp11  1 þεi ð1  εi Þp11 Γn1 i i h i h    i ðiÞ ðiÞ ðiÞ C 9 ¼ ð1  βÞ 1  βð1  εi Þp11 þ ð1  βÞβnþ1 Γni p01  εi Γni p11

3.7

Fixed Point Analysis

In this section, we derive the fixed points of the mappings ri() and Wi() and their structural properties. To make our analysis concise, we omit the channel index i. Lemma 3.1 (Fixed point of Γ()) Consider the case p01 structural properties of Γ(ω(t)) hold (see Fig. 3.1):

b

p11, the following

(i) Γ(ω(t)) is monotonically increasing in ω(t); p01  Γðωðt ÞÞ  p11 , 80  ωðt Þ  1 (ii) Γk(ω(t)) ¼ Γ(Γk  1(ω(t))) monotonically converges to ω0 ≔ 1ðpp01p Þas k!1. 11

01

Proof Noticing that Γ(ω(t)) can be written as Γ(ω(t)) ¼ ( p11  p01)ω(t) + p01, Lemma 3.1 holds straightforwardly. □ Lemma 3.2 (Fixed point of Γ() : p01 > p11) Denote Γ0(ω) ¼ ω and Γk(ω) ¼ Γ(Γk  1(ω)), then Γ2k(ω) and Γ2k + 1(ω)(ω 2 [p11, p01]) converge, from opposite directions, to ω0 ≔ 1ðpp01p Þ as k ! 1 (see Fig. 3.2). In particular, we 11 01 have

Fig. 3.1 Γk(ω) revolution as k( p11 ⩾ p01). Red line: ω < ω0; Green line: ω > ω0

1 p 11

0

p 01 0

0

2

4

6

8

k

10

12

14

16

3.7 Fixed Point Analysis

49

Fig. 3.2 Γk(ω) evolution as k( p11 < p01). [Upper]: p11 b ω b ω0; [Down]: ω0 b ω b p01—Red line indicates the envelop and green line indicates the evolution as k

1 p 01

0

p 11 0

0

2

4

6

8

10

12

14

16

10

12

14

16

k 1 p 01

0

p 11 0

0

2

4

6

8

k

Γk ðωÞ > ω if p11 bω < ω0 ; Γ k ð ω0 Þ ¼ ω0 ; Γk ðωÞbω if ω0 bω < p01 : Proof It is easy to obtain the lemma, noticing Γ(ω) ¼ ( p11  p01)ω + p01 and 1 < p11  p01 < 0. Lemma 3.3 When εb ð1 min fp

ð1 max fp11 , p01 gÞ min fp, p01 g , 11 , p01 gÞ max fp11 , p01 g

then

(i) φ(ω(t)) monotonically increases with ω(t); φðωðt ÞÞb min fp11 , p01 g, min fp11 , p01 gbωðt Þb max fp11 , p01 g φð0Þ ¼ 0, φð1Þ ¼ 1 Proof According to (3.3) and (3.4), it is easy to obtain the results. ■



50

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

Fig. 3.3 Functions Ψ2(ω), Ψ(ω), ω with ω( p11 < p01)

1

0

0

1

Lemma 3.4 Given p01 > p11, there exists ω0 2 [Γ( p11), p01] (see Fig. 3.3) such that ΨðωÞ > ω, if Γðp11 Þbω < ω0 Ψ ð ω0 Þ ¼ ω0 ΨðωÞ < ω, if ω0 bω < p01 Proof Since φ(ω) monotonically increases with ω while Γ(ω) monotonically decreases with ω when p11 < p01, we obtain that Ψ(ω) ¼ Γ(φ(ω)) decreases monotonically with ω, and moreover, φ(ω) is concave in ω since the following condition:

f

∂½ΨðωÞ εðp01  p11 Þ ¼ Γðp11 Þ Ψðp01 Þ ¼ Γðφðp01 ÞÞ < Γðφð0ÞÞ ¼ Γð0Þ ¼ p01

we know that Ψ(ω) and ω must have a unique intersection at some point ω0 as shown in Fig. 3.6, taking into account the concavity of Ψ(ω) and the linearity of to, which indicates that Ψ(ω) > ω0 > ω when Γ( p11) b ω < ω0 while Ψ(ω) b ω0 b ω when ω0 bω < p01 □

3.7 Fixed Point Analysis

51

Fig. 3.4 Ψk(ω) evolution as k( p11 < p01). [U]: ω0 b ω b p01; [D]: Γ( p11) b ω b ω0 Red line indicates the envelop and greed line indicates the evolution as k

0

1

2

3

4

5

6

7

8

5

6

7

8

k

0

1

2

3

4

k

Lemma 3.5 (Fixed point of Ψ(ω): p01 > p11) Let Ψ0(ω) ¼ ω and Ψk(ω) ¼ Ψ(Ψk  1(ω)), Ψ2k(ω) and Ψ2k + 1(ω)(ω 2 [Γ( p11), p01]) converge, from opposite directions, to ω0 as k!1 (see Fig. 3.4). In particular, we have Ψk ðωÞbω if ω0 bω < p01 ; Ψ k ð ω0 Þ ¼ ω 0 ; Ψk ðωÞ > ω if Γðp11 Þbω < ω0 : Proof We prove the lemma in two different cases. Case 1. When ω0 b ω < p01, to show Ψi(ω) b ω, it is sufficient to prove (

ω ⩾ Ψ0 ðωÞ > ⋯ > Ψ2k ðωÞ > Ψ2kþ2 ðωÞ > ⋯ > ω0

Ψ1 ðωÞ < ⋯ < Ψ2kþ1 ðωÞ < Ψ2kþ3 ðωÞ < ⋯ < ω0 bω: To prove (3.23), it is sufficient to show for k ¼ 0, 1, 2, ⋯ (i) when Ψ2k ðωÞ > ω0 , Ψ2kþ2 ðωÞ > ω0 ; (ii) when Ψ2k + 1(ω) < ω0, Ψ2k + 3(ω) < ω0.

ð3:23Þ

52

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

(iii) when Ψ2k(ω) > ω0, Ψ2k(ω) > Ψ2k + 2(ω); (iv) when Ψ2kþ1 ðωÞ < ω0 , Ψ2kþ1 ðωÞ < Ψ2kþ3 ðωÞ First, we prove (i). By Lemma 3.4, Ψ0(ω) ⩾ ω0 ⩾ Ψ1(ω) for k ¼ 0. When Ψ (ω) > ω0, we have Ψ2k + 1(ω) ¼ Ψ(Ψ2k(ω)) < ω0 by Lemma 3.4, and furthermore,   Ψ2kþ2 ðωÞ ¼ Ψ Ψ2kþ1 ðωÞ > ω0 . Second, we prove (ii). By Lemma 3.4, Ψ0 ðωÞ ⩾ ω0 ⩾ Ψ1 ðωÞ for k ¼ 0. When 2k + 1 Ψ (ω) < ω0, we have Ψ2k + 2(ω) ¼ Ψ(Ψ2k + 1(ω)) > ω0 by Lemma 3.4, and   furthermore, Ψ2kþ3 ðωÞ ¼ Ψ Ψ2kþ2 ðωÞ < ω0 . Third, to prove (iii) and (vi), we only need to show that Ψ2(ω) < ω for any ω 2 ðω0 , p01  while Ψ2(ω) > ω for any ω 2 ½Γðp11 Þ, ω0 Þ. On one hand, from the following: 2k

∂½Ψ2 ðωÞ ∂½ΨðxÞ ¼ ∂½ω ∂½x ¼ ¼

2 ∂ Ψ 2 ð ωÞ 2

∂ ½ ω

j

 x¼ΨðωÞ

∂½ΨðωÞ ∂½ω

εðp01  p11 Þ εðp01  p11 Þ  ½1  ð1  εÞΨðωÞ2 ½1  ð1  εÞω2

ε2 ðp01  p11 Þ2 >0 ½1  ð1  εÞðð1  p01 þ εp11 Þω þ p01 Þ2 ¼

2ð1  εÞε2 ðp01  p11 Þ2 ð1  p01 þ εp11 Þ > 0, ½1  ð1  εÞðð1  p01 þ εp11 Þω þ p01 Þ3

ð3:24Þ ð3:25Þ

we have Ψ2(ω) is the convex increasing function of ®. On the other hand, we have the following inequalities regarding three end points: 8 2 > < Ψ ðΓðp11 ÞÞ ¼ ΨðΨðΓðp11 ÞÞÞ > ΨðΓðp11 ÞÞ > Γðp11 Þ Ψ 2 ð ω 0 Þ ¼ Ψ ð Ψ ð ω 0 Þ Þ ¼ Ψ ð ω0 Þ ¼ ω0 > : 2 Ψ ðp01 Þ ¼ ΨðΨðp01 ÞÞ < Ψð0Þ ¼ p01

ð3:26Þ

Therefore, combining (3.24), (3.25), and (3.26), we know that Ψ2(ω) and ω have a unique intersection at ω0 as shown in Fig. 3.3, and furthermore, Ψ2(ω) < ω for any ω 2 (ω0, p01] while Ψ2(ω) > ω for any ω 2 [Γ( p11), ω0). Case 2. When Γðp11 Þbω < ω0 , we need to prove (

ω ¼ Ψ0 ðωÞ < ⋯ < Ψ2k ðωÞ < Ψ2kþ2 ðωÞ < ⋯ < ω0 ΨðωÞ > ⋯ > Ψ2kþ1 ðωÞ > Ψ2kþ3 ðωÞ > ⋯ > ω0 > ω

which can be verified by the similar induction in the aforementioned case. Therefore, combining the above two cases, we conclude the lemma. □

3.8 Threshold Policy and Adjoint Dynamic System

3.8

53

Threshold Policy and Adjoint Dynamic System

In this section, we first express the value function by threshold policy, and then introduce an adjoint dynamic system to facilitate the analysis on nonlinear dynamics.

3.8.1

Threshold Policy

Let L(ω, ω0) be the minimum amount of time required for a passive arm to transit across ω0 starting from ω, i.e.,   Lðω, ω0 Þ≜ min k : Γk ðωÞ > ω0

ð3:27Þ

According to Lemma 3.1, we have for the case of p11 ⩾ p01

f

0

0

if ω > ω0

ω ω c þ 1 , if ωbω0 < ω0 blogp11 p01 0 ω0  ω Lðω, ω Þ ¼ 1 if ωbω0 , ω0 ⩾ ω0 0

ð3:28Þ

and, for the case of p11 < p01

Lðω, ω0 Þ ¼

f

0,

if ω > ω0

1, if ωbω0 and ΓðωÞ > ω0 1, if ωbω0 and ΓðωÞbω0

ð3:29Þ

Under the threshold policy, the arm will be activated if its belief state crosses a certain threshold ω0. In other words, starting from an arbitrary belief state «, the first active action on the arm is taken after L(ω, ω0) slots. Based  on the structure  of threshold policy, Vβ, v(ω) can be characterized in terms of V β,v Γt0 1 ðωÞ; a ¼ 1 for some t0 2 {1, 2, ⋯, 1} where t0 ¼ L(ω, ω) + 1 is the slot when the belief ω reaches the threshold ω for the first time. Specially, in the first L(ω, ω) slots, the subsidy v is obtained in each slot. In slot t0 ¼ L(ω, ω) + 1, the belief state reaches  the threshold or  and the arm is activated. The total reward thereafter is V β,v ΓLðω,ω Þ ðωÞ; a ¼ 1 . Taking into account β, we thus have V β,v ðωÞ ¼

     1  βLðω,ω Þ v þ βLðω,ω Þ V β,v ΓLðω,ω Þ ðωÞ; a ¼ 1 1β

ð3:30Þ

54

3.8.2

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

Adjoint Dynamic System

In the dynamic system (3.15), the belief information ω represents two kinds of information: • policy information, i.e., action a depends on ω; • value information, i.e., the reward value of the dynamic system (or value function) depends on ω. To better characterize the dynamic evolution of (3.15), we separate two roles of ω by mathematically letting ω only represent the value while introducing bωe to indicate the policy information used to make a decision (corresponding to the policy). Specifically, we introduce the following adjoint dynamic system:   V β,v ðω; ω0 Þ ¼ max V β,v ðω; ω0 , 0Þ, V β,v ðω; ω0 , 1Þ

ð3:31Þ

where, V β,v ðω; ω0 , 0Þ ¼ v þ βV β,v ðΓðωÞ; Γðω0 ÞÞ, V β,v ðω; ω0 , 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ þð1  ð1  εÞωÞV β,v ðΨðωÞ; Ψðω0 ÞÞ where ω0, a represents making action a(a ¼ 0,1) given the policy information ω0 . Proposition 3.1 Given v, Vβ,v(ω; a ¼ 1) and Vβ,v(ω; a ¼ 0) are piecewise linear and convex in ω. Proof We prove the proposition by induction. In slot T, we have V Tβ,v ðω; a ¼ 0Þ ¼ v and V Tβ,v ðω; a ¼ 1Þ ¼ ð1  εÞω, which follows V Tβ,v ðωÞ ¼ max fV Tβ,v ðω; a ¼0Þ, V Tβ,v ðω, a ¼ 1Þg is piecewise linear and convex in ω. tþ1 Assume V tþ1 β,v ðω; a ¼ 1Þ and V β,v ðω; a ¼ 0Þ are piecewise linear and convex in t ω, it is easy to show that both V β,v ðω; a ¼ 1Þ and V tβ,v ðω; a ¼ 0Þ are piecewise linear and convex in ω according to Eq. (3.15). Letting T % 1, we prove the proposition. □ Corollary 3.2 V tβ,v ðω; ω0 Þ is piecewise linear in ω. Proof We know that V tβ,v ðω; ωÞ is piece linear in ω by Proposition 3.1. In V tβ,v ðω; ω0 Þ, ω represents the value information while ω0 does the policy information. Thus, V tβ,v ðω; ω0 Þ is piece linear in the value information ω. □

3.9 Linearization of Value Function for Negatively Correlated Channels

55

Lemma 3.6 V β,v ðω; ω0 , 1Þ is decomposable in ω, i.e., V β,v ðω; ω0 , 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ

þεωV β,v ðp11 ; Ψðω0 ÞÞ þ ð1  ωÞV β,v ðp01 ; Ψðω0 ÞÞ : Proof

V β,v ðω; ω0 , 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ

þð1  ð1  εÞωÞV β,v ðΨðωÞ; Ψðω0 ÞÞ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ þð1  ωÞð1  ð1  εÞ0ÞV β,v ðΨð0Þ; Ψðω0 ÞÞ

þωð1  ð1  εÞ1ÞV β,v ðΨð1Þ; Ψðω0 ÞÞ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ ðaÞ

þð1  ωÞV β,v ðp01 ; Ψðω0 ÞÞ þ εωV β,v ðp11 ; Ψðω0 ÞÞ where (a) is due to Corollary 3.2.





Remark 3.3 In (3.32), for V β,v ðp11 ; p11 Þ and V β,v ðp11 ; Ψðω0 ÞÞ , we can see that though they have the same value information p11, they have different policy information, i.e., p11 and Ψðω0 Þ Hence, V β,v ðp11 ; p11 Þ 6¼ V β,v ðp11 ; Ψðω0 ÞÞ except that both p11 and Ψðω0 Þ can lead a same action policy for the dynamic system.

3.9

Linearization of Value Function for Negatively Correlated Channels

In this section, we focus on the linearization of V β,v ðω; ω, 1Þ for the case of ðiÞ ðiÞ negatively correlated channels, i.e., p11 < p01 , which serves as the basis to compute the Whittle index. Again, we consider one channel by dropping channel index i. In many practical systems, the initial belief ω0 is set to ω0[9]. It can then be checked that min{p01, p11} b ω0 b max {p01, p11}. Moreover, even the initial belief does not fall in [min{p01, p11}, max{p01, p11}], all the belief values are bounded in the interval from the second slot following Lemma 3.1. Hence the following results can be extended by treating the first slot separately from the future slots. Therefore, we assume min{p01, p11} b ω b max {p01, p11} in the first slot in our analysis. We divide the region [p11, p01] into four subregions by Γ( p11) and two fixed points ω0 , ω0 : ½p11 , p01  ¼ ½p11 , ω0 Þ [ ½ω0 , Γðp11 ÞÞ [ ½Γðp11 Þ, ω0 Þ [ ½ω0 , p01 

ð3:33Þ

56

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

In the following, we derive the linearized Vβ, v(ω; ω, 1) in these subregions, respectively, which will be used to compute the Whittle index in Sect. 3.12.

3.9.1

Region 1–2

Proposition 3.2 If p11 ω 2 [p11, p01]

b ω

< Γ( p11), it holds that L(Γ(φ(ω)), ω) ¼ 0 for any

Proof In the case of p11 < p01, φ(ω monotonically increases with ω while Γ(ω) monotonically decreases with ω. Thus, Γ(φ(ω)) ⩾ Γ( p11) > ω for ω 2 [p11, p01]  01 Þ when 0bεb pp11 ðð1p 1p Þ. Therefore, L(Γ(φ(ω)), ω ) ¼ 0. □ 01

11

Lemma 3.7 When p11 b ω < Γ( p11) for any ω 2 [p11, p01], the linearity of V β,v ðω; ω, 1Þ is as follows: V β,v ðω; ω, 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ

þεωV β,v ðp11 ; ΨðωÞÞ þ ð1  ωÞV β,v ðp01 ; ΨðωÞÞ

ð3:34Þ

where ðe1Þ

V β,v ðp11 ; p11 Þ ¼ V β,v ðp11 ; p11 , 0Þ ðe2Þ

¼ v þ βV β,v ðΓðp11 Þ; Γðp11 ÞÞ

¼ v þ βð1  εÞΓðp11 Þ þ β2 ð1  εÞΓðp11 ÞV β,v ðp11 ; p11 Þ

þð1  ð1  εÞΓðp11 ÞÞV β,v ðΨðΓðp11 ÞÞ; ΨðΓðp11 ÞÞÞ

ðe3Þ

¼ v þ βð1  εÞΓðp11 Þ þ β2 ð1  εÞΓðp11 ÞV β,v ðp11 ; p11 Þ

ðe4Þ

þð1  ð1  εÞΓðp11 ÞÞV B,v ðΨðΓðp11 ÞÞ; ΨðωÞÞ ðe5Þ

¼ v þ βð1  εÞΓðp11 Þ

þ β ½ð1  εÞΓðp11 ÞV β,v ðp11 ; p11 Þ 2

þ εΓðp11 ÞV β,v ðp11 ; ΨðωÞÞ þð1  Γðp11 ÞÞV β,v ðp01 ; ΨðωÞÞ

ð3:35Þ

V β,v ðp11 ; ΨðωÞÞ ¼ V β,v ðp11 ; ΨðωÞ, 1Þ ¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ

þεp11 V β,v ðp11 ; ΨðωÞÞ þ ð1  p11 ÞV β,v ðp01 ; ΨðωÞÞ

ðe6Þ

ð3:36Þ

3.9 Linearization of Value Function for Negatively Correlated Channels

57

V β,v ðp01 ; ΨðωÞÞ ¼ V β,v ðp01 ; ΨðωÞ, 1Þ ¼ ð1  εÞp01 þ β ð1  εÞp01 V β,v ðp11 ; p11 Þ

ðe7Þ

þεp01 V β,v ðp11 ; ΨðωÞÞ þ ð1  p01 ÞV β,v ðp01 ; ΨðωÞÞ :

ð3:37Þ

Proof (e1) is due to p11  ω ) a ¼ 0 (e2) is due to a ¼ 0. (e3) is due to Γ( p11) > ω ) a ¼ 1. (e4) is due to L(Ψ(Γ( p11)), ω) ¼ 0 and L(Ψ(ω), ω) ¼ 0 from Proposition 3.2. (e5), (e6), and (e7) follow Lemma 3.6. □ Remark 3.4 Based on (3.35), (3.36), and (3.37), we can compute V β,v ðp11 ; p11 Þ ,V β,v ðp11 ; ΨðωÞÞ , and V β,v ðp01 ; ΨðωÞÞ , and further V β,v ðω; ω, 1Þ is linearized by (3.34).

3.9.2

Region 3

Based on Lemma 3.5, we have the following important corollary. Corollary 3.3 When Γ( p11) b ω < p01, we have (i) When Γðp11 Þbω < ω0 the first crossing time of the nonlinear belief part Ψi(ω) (i ¼ 1, 2, ⋯) will be 0 in the evolving process; that is, L(Ψi(ω), ω) ¼ 0; (ii) When ω0 bω < p01 first crossing time of the nonlinear belief part Γi(Ψ(ω)) (i ¼ 0, 1, 2, ⋯) will be 1; that is, L(Γi(Ψ(ω)), ω) ¼ 1. Proof (1) By Lemma 3.5, we have that Ψi(ω) > ω when Γ( p11) b ω < ω0 , and furthermore, L(Ψi(ω), ω) ¼ 0 . (2) By Lemma 3.5, ω0 < Ψ(ω) b ω when ω0 bω < p01 . Furthermore, by Lemma 3.2, we have Γi(Ψ0(ω)) b ω, which means L(Γi(Ψ(ω)), ω) ¼ 1 □ Corollary 3.4 When Γðp11 Þbω < ω0 , we have that V β,v ðω ; ω , 1Þ can be linearized by the following: ðe1Þ V β,v ðω ; ω , 1Þ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 , p11 Þ

þð1  ð1  εÞω ÞV β,v ðΨðω Þ; Ψðω ÞÞ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ

þεω V β,v ðp11 ; Ψðω ÞÞ þ ð1  ω ÞV β,v ðp01 ; Ψðω ÞÞ

ðe2Þ

where

58

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . . ðe3Þ

V β,v ðp11 ; p11 Þ ¼

v , 1β

ðe4Þ

V β,v ðp11 ; Ψðω ÞÞ ¼ V β,v ðp11 ; Ψðω Þ, 1Þ ¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ  

þð1  ð1  εÞp11 ÞV β,v Ψðp11 Þ; Ψ2 ðω Þ

ðe5Þ

¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ

þð1  ð1  εÞp11 ÞV β,v ðΨðp11 Þ; Ψðω ÞÞ

ðe6Þ

¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ

ðe7Þ

þεp11 V β,v ðp11 ; Ψðω ÞÞ

þð1  p11 ÞV β,v ðp01 ; Ψðω ÞÞ ðe8Þ V β,v ðp01 ; Ψðω ÞÞ ¼ ð1  εÞp01 þ β ð1  εÞp01 V β,v ðp11 ; p11 Þ þεp01 V β,v ðp11 ; Ψðω ÞÞ

þð1  p01 ÞV β,v ðp01 ; Ψðω ÞÞ : Proof (e1) is due to a ¼ 1. (e2) is due to Lemma 3.6. (e3) is due to L ( p11, ω) ¼ 1 ) a ¼ 0 for Γðp11 Þbω < ω0 and V β,v ðp11 ; p11 Þ ¼ v þ βv þ v β2 v2 þ ⋯ ¼ 1B is due to L(Ψ(ω), ω) ¼ 0 ) a ¼ 1 from Corollary 3.3. (e6) is 2  due to L(Ψ (ω ), ω) ¼ L(Ψ0(ω), ω).|(e7) is due to Corollary 3.2.(e8) is due to a ¼ 1 ( L(Ψ(ω), ω) ¼ 0 by Corollary 3.3. □

3.9.3

Region 4

When ω0 b ω < p01, we have L(Γi(Ψ(ω)), ω) ¼ 1 by Corollary 3.3, and further v V β,v ðΨðω Þ; Ψðω ÞÞ ¼ 1β . Hence, V β,v ðω ; ω , 1Þ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ

þð1  ð1  εÞω ÞV β,v ðΨðω Þ; Ψðω ÞÞ v    ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ þ ð1  ð1  εÞω Þ 1β which clearly shows that V β,v ðω ; ω , 1Þ has been linearized.

3.10

3.10

Linearization of Value Function for Positively Correlated Channels

59

Linearization of Value Function for Positively Correlated Channels

In this section, we linearize V β,v ðω; ω, 1Þ for the case of p11 ⩾ p01. According to the fixed point ω0, we first divide [p01, p11] into [p01, ω0) and [ω0, p11]. According to Lemmas 3.1 and 3.3, [p01, ω0) can be further divided into [p01, Γ(φ( p01))) and [Γn(φ( p01)), Γn+1(φ( p01))), herein, n ¼ 1, 2, ⋯, 1. Thus, in the following, we take [Γn(φ( p01)), Γn+1(φ( p01))) as an example to analyze the evolution of belief value. Lemma 3.8 (Fixed points) Ifp11 ⩾ p01, for any n(n ¼ 1, 2, ⋯, 1)there exists ωn0 , and ωn0 satisfying Γn ðφðp01 ÞÞ < ωn0 < ωn0 < Γn ðφðp11 ÞÞ such that φðΓðωÞÞ > Γn ðωÞ for p01 bω < ωn0      φ Γ ωn0 ¼ Γn ωn0 φðΓðωÞÞ < Γn ðωÞ for ωn0 < ωbω0 and φðωÞ > Γn ðωÞ for p01 bω < ωn0     φ ωn0 ¼ Γn ωn0 φðωÞ < Γn ðωÞ for ωn0 < ωbω0 Proof Since ∂½φðΓðωÞÞ εðp11  p01 Þ ¼ >0 ∂½ω ½1  ð1  εÞω2 2

∂ ½φðΓðωÞÞ 2

∂ ½ ω

¼

2εð1  εÞðp11  p01 Þ > 0, ½1  ð1  εÞω3

we know that φ(Γ(ω)) is convex in ω. Moreover, by the increasing monotonicity of φ(ω), we have the following inequalities regarding end points Γn(φ( p01)) and Γn(φ( p11)) 

φðΓðΓn ðφðp01 ÞÞÞÞ > φðp01 Þ ¼ Γn ðΓn ðφðp01 ÞÞÞ φðΓðΓn ðφðp11 ÞÞÞÞ < φðp11 Þ ¼ Γn ðΓn ðφðp11 ÞÞÞ

p01 Þω0 and the conCombining the linearity of Γn ðωÞ ¼ ðp11  p01 Þω0 þ ωððpp11p n 11 01 Þ vexity of φ(Γ(ω)) in ω, we know that there must exist a unique point

60

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

 ωn0 ðΓn ðφðp01 ÞÞ Γn(ω) when p01 bω < ωn0 .   Likewise, there exists a unique point ωnl Γn ðφðp01 ÞÞ < ωn0 < Γn ðφðp11 ÞÞ such that φ(ω) b Γn(ω) when ωn0 bωbω0 while φ(ω) > Γn(ω) when p01 bω < ωn0 Next, we prove ωn0 < ωn0 by contradiction. Assume ωn0 ⩾ ωn0, we have, considering the increasing monotonicity of Γn(ω) in ω          φ Γ ωn0 ¼ Γn ωn0 bΓn ωn0 ¼ φ ωn0

ð3:38Þ

     Since Γ ωn0 > ΓðΓn ðφðp01 ÞÞÞ > Γn ðφðp11 ÞÞ > ωn0 , we have φ Γ ωn0 >  φ ωn0 according to the monotonicity of φ(ω), which contradicts (3.38). Hence, we have ωn0 < ωn0 .■ Based on the two fixed points ωn0 and ωn0 in Lemma 3.8, we further divide the region [Γn(φ( p01)), Γn + 1(φ( p01))) into the following four subregions: 



  Γn ðφðp01 ÞÞ, Γnþ1 ðφðp01 ÞÞ ¼ Γn ðφðp01 ÞÞ, ωn0  n0 Þ [½ωn0 , ω [ ½ ωn0 , Γn ðφðp11 ÞÞÞ : [ ½Γn ðφðp11 ÞÞ, Γnþ1 ðφðp01 ÞÞÞ

3.10.1 Region n 2 1 The following proposition quantifies how many time slots are required for a belief value to recover to the given threshold value ω*. Proposition 3.3 When Γn(φ( p11)) b ω < Γn + L(Γ(φ(ω)), ω) ¼ L(φ( p11), ω)  1 ¼ n for ω 2 [p01, p11]

1

(φ( p01)), we have

Proof Since Γn(φ( p11)) b ω, we have Lðφðp11 Þ, ω Þ ⩾ Lðφðp11 Þ, Γn ðφðp11 ÞÞÞ ¼ n þ 1

ð3:39Þ

On the other hand, considering ω < Γn + 1(φ( p11)), we have   Lðφðp11 Þ, ω Þ < L φðp11 Þ, Γnþ1 ðφðp11 ÞÞ ¼ n þ 2 Combining (3.39) and (3.40), we have L(φ( p11), ω) ¼ n + 1.

ð3:40Þ

3.10

Linearization of Value Function for Positively Correlated Channels

61

Since Γ(φ(ω)) ⩾ Γ(φ( p01)), then LðΓðφðωÞÞ, ω ÞbLðΓðφðp01 ÞÞ, ω Þ ¼ n Further, we have L(Γ(φ(ω)), ω) ¼ n. Hence, the lemma holds.



By Proposition 3.3, we have the following lemma to linearize V β,v ðω; ω, 1Þ. Lemma 3.9 If Γn(φ( p11)) b ω < Γn + 1(φ( p01)) , then we have for ω 2 [p01, p11] V β,v ðω; ω, 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ

þεωV β,v ðp11 ; ΨðωÞÞ þ ð1  ωÞV β,v ðp01 ; ΨðωÞÞ ,

ð3:41Þ

where L(Ψ(ω), ω) ¼ n ðe1Þ

V β,v ðp11 ; p11 Þ ¼ V β,v ðp11 ; ω, 1Þ

ð3:42Þ

ðe2Þ

1  βn v þ βn V β,v ðΓn ðp01 Þ; ω, 1Þ 1β

ð3:43Þ

ðe3Þ

1  βn v þ βn V β,v ðΓn ðp11 Þ; ω, 1Þ 1β

ð3:44Þ

V β,v ðp01 ; ΨðωÞÞ ¼ V β,v ðp11 ; ΨðωÞÞ ¼ V β,v ðp11 ; ω, 1Þ ðe4Þ

¼ ð1  εÞp11 þ βð1  εÞp11 V β,v ðp11 ; p11 Þ

ð3:45Þ

þβð1  p11 ÞV β,v ðp01 ; ΨðωÞÞ þ εβp11 V β,v ðp11 ; ΨðωÞÞ V β,v ðΓn ðp01 Þ; ω, 1Þ ðe5Þ

¼ ð1  εÞΓn ðp01 Þ þ βð1  εÞΓn ðp01 ÞV β,v ðp11 ; p11 Þ

þβð1  Γn ðp01 ÞÞV β,v ðp01 ; ΨðωÞÞ þ βεΓn ðp01 ÞV β,v ðp11 ; ΨðωÞÞ

ð3:46Þ

V β,v ðΓn ðp11 Þ; ω, 1Þ ðe6Þ

¼ ð1  εÞΓn ðp11 Þ þ βð1  εÞΓn ðp11 ÞV β,v ðp11 ; p11 Þ

þβð1  Γn ðp11 ÞÞV β,v ðp01 ; ΨðωÞÞ þ βεΓn ðp11 ÞV β,v ðp11 ; ΨðωÞÞ

ð3:47Þ

Proof (e1) is due to L(Ψ( p11), ω) ¼ L(Ψ(ω), ω) by Proposition 3.3. (e2) and (e3) are due to L(Ψ(Γn(Ψ(ω))), ω) ¼ L(Ψ(ω), ω)by Proposition 3.3. (e4)–(e6) are due to Lemma 3.6. Remark 3.5 By (3.42)–(3.47), we can obtain V β,v ðp11 ; p11 Þ, V β,v ðp01 ; ΨðωÞÞ, and Vβ, v( p11; Ψ(ω)) which are substituted in (3.41), leading to the linearization of V β,v ðω; ω, 1Þ. □

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

62

3.10.2 Region n 2 2 Proposition 3.4 If ωn0 bω < Γn ðφðp11 ÞÞ, then ( LðΓðφðωÞÞ, ω Þ ¼

Lðφðp11 Þ, ω Þ ¼ n,

if ω 2 ½p01 , Γðω Þ

Lðφðp11 Þ, ω Þ  1 ¼ n  1,

if ω ¼ p11

Proof According to Lemma 3.8, we have φ(Γ(ω)) b Γn(ω). Now we prove the lemma by two cases. (i) when ω 2 [p01, Γ(ω)], we have φðωÞbφðΓðω ÞÞbΓn ðω Þ and furthermore,   LðΓðφðωÞÞ, ω Þ ⩾ L Γnþ1 ðω Þ, ω ¼ n Considering L(φ( p11), ω) ¼ n, we have LðΓðφðωÞÞ, ω Þ ¼ Lðφðp11 Þ, ω Þ ¼ n (ii)

when ω ¼ p11, we have LðΓðφðp11 ÞÞ, ω Þ ¼ Lðφðp11 Þ, ω Þ  1 ¼ n  1

Combining two cases, we obtain the lemma.



Lemma 3.10 If ωn0 bω < Γn ðφðp11 ÞÞ, then we have for ω 2 [p01, Γ(ω)] V β,v ðp11 ; p11 Þ ¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ

þεp11 V β,v ðp11 ; Ψðp11 ÞÞ þ ð1  p11 ÞV β,v ðp01 ; Ψðp11 ÞÞ V β,v ðω; ω, 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ

þεωV β,v ðp11 ; ΨðωÞÞ þ ð1  ωÞV β,v ðp01 ; ΨðωÞÞ

ð3:48Þ ð3:49Þ

where L(φ( p11), ω) ¼ n ðe1Þ

1  βn v þ βn V β,v ðΓn ðp01 Þ; ω , 1Þ 1β

ð3:50Þ

ðe2Þ

1  βn v þ βn V β,v ðΓn ðp11 Þ; ω , 1Þ 1β

ð3:51Þ

V β,v ðp01 ; ΨðωÞÞ ¼ V β,v ðp11 ; ΨðωÞÞ ¼

3.10

Linearization of Value Function for Positively Correlated Channels ðe3Þ

V β,v ðp01 ; Ψðp11 ÞÞ ¼

ðe4Þ

V β,v ðp11 ; Ψðp11 ÞÞ ¼

63

  1  βn1 v þ βn1 V β,v Γn1 ðp01 Þ; ω , 1 1β

ð3:52Þ

  1  βn1 v þ βn1 V β,v Γn1 ðp11 Þ; ω , 1 1β

ð3:53Þ

V β,v ðΓn ðp01 Þ; ω , 1Þ ðe5Þ

¼ ð1  εÞΓn ðp01 Þ þ βð1  εÞΓn ðp01 ÞV β,v ðp11 ; p11 Þ þβð1  Γn ðp01 ÞÞV β,v ðp01 ; Ψðω ÞÞ þ βεΓn ðp01 ÞV β,v ðp11 ; Ψðω ÞÞ

ð3:54Þ

V β,v ðΓn ðp11 Þ; ω , 1Þ ðe6Þ

¼ ð1  εÞΓn ðp11 Þ þ βð1  εÞΓn ðp11 ÞV β,v ðp11 ; p11 Þ þβð1  Γn ðp11 ÞÞV β,v ðp01 ; Ψðω ÞÞ þ βεΓn ðp11 ÞV β,v ðp11 ; Ψðω ÞÞ   V β,v Γn1 ðp01 Þ; ω , 1

ð3:55Þ

ðe7Þ

¼ ð1  εÞΓn1 ðp01 Þ þ βð1  εÞΓn1 ðp01 ÞV β,v ðp11 ; p11 Þ   þβ 1  Γn1 ðp01 Þ V β,v ðp01 ; Ψðω ÞÞ þ βεΓn1 ðp01 ÞV β,v ðp11 ; Ψðω ÞÞ   V β,v Γn1 ðp11 Þ; ω , 1

ð3:56Þ

ðe8Þ

¼ ð1  εÞΓn1 ðp11 Þ þ βð1  εÞΓn1 ðp11 ÞV β,v ðp11 ; p11 Þ   þβ 1  Γn1 ðp11 Þ V β,v ðp01 ; Ψðω ÞÞ þ βεΓn1 ðp11 ÞV β,v ðp11 ; Ψðω ÞÞ

ð3:57Þ

Proof (e1) and (e2) are due to L(Ψ(ω), ω) ¼ L(Ψ0(ω), ω) for any ω 2 [p01, Γ (ω)] ¼ n by Proposition 3.4. (e3) and (e4) are due to L(Ψ( p11), ω) ¼ L (Ψ(ω), ω)  1 ¼ n  1 by Proposition 3.4. (e5)–(e8) are due to Lemma 3.6. □ Remark 3.6 By (3.48), (3.52), (3.53), (3.56), and (3.57), we can obtain V β,v ðp11 ; p11 Þ ,V β,v ðp01 ; Ψðω ÞÞ and V β,v ðp11 ; Ψðω ÞÞ . Then plugging them into (3.54), (3.55), (3.50), and (3.51), we have V B,v ðp01 ; ΨðωÞÞ and V β,v ðp11 ; ΨðωÞÞ . Further, the linearization version of V β,v ðω; ω, 1Þ is obtained.

3.10.3 Region n 2 4 Proposition 3.5 If Γn ðφðp01 ÞÞbω < ωn0 we have LðΓðφðωÞÞ, ω Þ ¼ Lðφðp11 Þ, ω Þ  1 ¼ n  1 for ω 2 [ω, p11]

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

64

Proof According to the monotonicity of φ(ω), we have φ(ω) ⩾ φ(ω) for ω 2 [ω, p11] and φ(ω) > Γn(ω) by Lemma 3.8. Thus, φ(ω) > Γn(ω) and L(Γ(φ(ω)), ω) < L(Γ1  n(ω), ω) ¼ n. Considering L(φ( p11), ω) ¼ n, we have L(Γ(φ(ω)), ω) ¼ n  1. Therefore, Proposition 3.5 holds. □ Lemma 3.11 If Γn ðφðp01 ÞÞbω < ωn0 we have for ω 2 [ω, p11] V β,v ðω; ω, 1Þ ¼ ð1  εÞω þ β ð1  εÞωV β,v ðp11 ; p11 Þ

þεωV β,v ðp11 ; ΨðωÞÞ þ ð1  ωÞV β,v ðp01 ; ΨðωÞÞ

ð3:58Þ

where ðe1Þ

V β,v ðp11 ; p11 Þ ¼ V β,v ðp11 ; ω, 1Þ ðe2Þ

V β,v ðp01 ; ΨðωÞÞ ¼

ðe3Þ

V β,v ðp11 ; ΨðωÞÞ ¼

ð3:59Þ

  1  βn1 v þ βn1 V β,v Γn1 ðp01 Þ; ω, 1 1β

ð3:60Þ

  1  βn1 v þ βn1 V β,v Γn1 ðp11 Þ; ω, 1 1β

ð3:61Þ

V β,v ðp11 ; ω, 1Þ ðe4Þ

¼ ð1  εÞp11 þ βð1  εÞp11 V β,v ðp11 ; p11 Þ þβð1  p11 ÞV β,v ðp01 ; ΨðωÞÞ þ εβp11 V β,v ðp11 ; ΨðωÞÞ   V β,v Γn1 ðp01 Þ; ω, 1

ð3:62Þ

ðe5Þ

¼ ð1  εÞΓn1 ðp01 Þ þ βð1  εÞΓn1 ðp01 ÞV β,v ðp11 ; p11 Þ   þβ 1  Γn1 ðp01 Þ V β,v ðp01 ; ΨðωÞÞ þ βεΓn1 ðp01 ÞV β,v ðp11 ; ΨðωÞÞ   V β,v Γn1 ðp11 Þ; ω, 1

ð3:63Þ

ðe6Þ

¼ ð1  εÞΓn1 ðp11 Þ þ βð1  εÞΓn1 ðp11 ÞV β,v ðp11 ; p11 Þ   þβ 1  Γn1 ðp11 Þ V β,v ðp01 ; ΨðωÞÞ þ βεΓn1 ðp11 ÞV β,v ðp11 ; ΨðωÞÞ

ð3:64Þ

Proof (e1)–(e3) are due to L(Ψ(ω), ω) ¼ n  1 for any ω 2 [ω, p11] by Proposition 3.5. (e4)–(e6) are due to Lemma 3.6. □ Remark 3.7 By (3.59)–(3.64), we can obtain V β,v ðp11 ; p11 Þ, V β,v ðp01 ; ΨðωÞÞ and V β,v ðp11 ; ΨðωÞÞ . Plugging them into (3.58), we get the linearized version V β,v ðω; ω, 1Þ.

3.11

Index Computation for Negatively Correlated Channels

65

3.10.4 Region n-3  For the region ωn0 , ωn0 , by theoretic analysis, we find out that there exist an infinite number of fixed points so that this region can be further divided into an infinite number of subregions. For each subregion, we can compute the subsidy v by the above similar way. Considering computation complexity, we simply use linear interpolation to compute the index in this region for two practical reasons: • the Whittle approach is an approximate one in essence, • the nonlinearity shows the tradeoff between precise computation and computation cost.

3.10.5 Region 5 For ω0 < ω < p11, it holds that L( p11, ω) ¼ 0 and L(Γ(φ(ω)), ω) ¼ 1 for ω 2 [p01, v p11]. Therefore, we have V β,v ðΓðφðωÞÞ; ΓðφðωÞÞÞ ¼ 1β .

3.11

Index Computation for Negatively Correlated Channels

Since the nonlinear part V β,v ðω ; ω , 1Þ for different regions has been linearized in previous sections, we now begin to compute the Whittle index based on the balance equation as follows: V β,v ðω ; ω , 0Þ ¼ V β,v ðω ; ω , 1Þ

ð3:65Þ

3.11.1 Region 1 When p11 b ω < ω0, we know that L(Γ(φ(ω)), ω) ¼ 0 for any ω 2 [p11, p01] according to Proposition 3.2 Thus, we have V β,v ðω ; ω , 0Þ ¼ v þ βV β,v ðΓðω Þ; Γðω ÞÞ

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

66

¼ v þ βð1  εÞΓðω Þ þ β2 ð1  εÞΓðω ÞV β,v ðp11 ; p11 Þ

þð1  ð1  εÞΓðω ÞÞV β,v ðΨðΓðω ÞÞ; ΨðΓðω ÞÞÞ ¼ v þ βð1  εÞΓðω Þ þ β2 ð1  εÞΓðω ÞV β,v ðp11 ; p11 Þ

þεΓðω ÞV β,v ðp11 ; ΨðΓðω ÞÞÞ þ ð1  Γðω ÞÞV β,v ðp01 ; ΨðΓðω ÞÞÞ ¼ v þ βð1  εÞΓðω Þ þ β2 ð1  εÞΓðω ÞV β,v ðp11 ; p11 Þ

ð3:66Þ þεΓðω ÞV β,v ðp11 ; Ψðω ÞÞ þ ð1  Γðω ÞÞV β,v ðp01 ; Ψðω ÞÞ According to Lemma 3.7, we have V β,v ðω ; ω , 1Þ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ

þεω V β,v ðp11 ; Ψðω ÞÞ þ ð1  ω ÞV β,v ðp01 ; Ψðω ÞÞ

ð3:67Þ

According to (3.65), combined with (3.35), (3.36), and (3.37) by letting ω ¼ ω, we have, in the matrix form, the following linear functions: 0

V β,v ðp11 ; p11 Þ

1

0

βðε  1ÞΓðp11 Þ

B V ðp ; Ψðω ÞÞ C B ðε  1Þp β,v 11 C B 11 e1 B M B C¼B @ V β,v ðp01 ; Ψðω ÞÞ A @ ðε  1Þp01

1 C C C A

ð3:68Þ

ðε  1Þðω  βΓðω ÞÞ

v e 1 is defined in (3.69). where M 0

β2 ð1  EÞΓðp11 Þ  1 B βð1  EÞp 11 e 1 ¼B M B @ βð1  EÞp01

β2 EΓðp11 Þ βEp11  1

β2 ð1  Γðp11 ÞÞ βð1  p11 Þ

1 1 0 C C C 0 A

βEp01 βð1  p01 Þ  1   βð1  EÞðω  βΓðω ÞÞ βEðω  βΓðω ÞÞ β½ð1  βÞ  ðω  βΓðω ÞÞ 1 



ð3:69Þ Thus, by some mathematical operations, we obtain the Whittle index v for the region p11 b ω < ω0 v¼

βð1  εÞp01 þ ð1  εÞðω  βΓðω ÞÞ 1 þ βðp01  εp11 Þ  βð1  εÞðβΓðp11 Þ þ ω  βΓðω ÞÞ

ð3:70Þ

3.11.2 Region 2 When ω0 < ω < Γ( p11), we know that L(Γ(φ(ω)), ω) ¼ 0 for any ω 2 [p11, p01]. Thus, we have

3.11

Index Computation for Negatively Correlated Channels

67

V β,v ðω ; ω , 0Þ ¼ v þ βV β,v ðΓðω Þ; Γðω ÞÞ ¼ v þ βv þ β2 V β,v ðΓ2 ðω Þ; Γ2 ðω ÞÞ v ¼ 1β Meanwhile, by Corollary 3.4, we have V β,v ðω ; ω , 1Þ. Further, combined with the balance equation at co', we have analogously the following matrix function: 0

1 0 1 βð1  εÞΓðp11 Þ V β,v ðp11 ; p11 Þ B V ðp ; Ψðω ÞÞ C B  ð1  εÞp C β,v 11 C B C 11 e2 B ¼ M B C B C @ V β,v ðp01 ; Ψðω ÞÞ A @  ð1  εÞp01 A

ð3:71Þ

 ð1  εÞω

v e 2 is defined in (3.72). where M 0

1

β2 ð1  EÞΓðp11 Þ  1 β2 EΓðp11 Þ β2 ð1  Γðp11 ÞÞ 1

B B βð1  EÞp11 e1 ¼ B M B βð1  EÞp01 B @ βð1  EÞω

βEp11  1

βð1  p11 Þ

0

βEp01

βð1  p01 Þ  1

0

βEω

βð1  ω Þ



1 1β

C C C C ð3:72Þ C A

Thus, by solving the matrix function, we obtain the following Whittle index for the region ω0 < ω < Γ( p11): v¼

βð1  εÞp01 þ ð1  εÞð1  βÞω 1 þ βðp01  εp11 Þ  βð1  εÞðβΓðp11 Þ þ ð1  βÞω Þ

ð3:73Þ

3.11.3 Region 3 For Γðp11 Þbω < ω0 , we compute v in the following. Proposition 3.6 When Γ( p11) b ω < p01, L(Γ(ω), ω) ¼ 1 for any ω 2 [p11, p01]. Proof Based on the decreasing monotonicity of Γ(ω), then Γ(ω) b Γ( p11) b ω for any ω 2 [p11, p01]. Considering Γ(ω) 2 [p11, p01], then Γ2(ω) b Γ( p11) b ω, and so on; that is, L(Γ(ω), ω) ¼ 1 When Γðp11 Þbω < ω0 , by some mathematical operations, we can obtain the following matrix function:

68

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

0

1 0 1 0 V β,v ðp11 ; p11 Þ B V ðp ; Ψðω ÞÞ C B ð1  εÞp C β,v 11 C B 11 C e3 B M B C¼B C @ V β,v ðp01 ; Ψðω ÞÞ A @ ð1  εÞp01 A

ð3:74Þ

ð1  εÞω

v e 3 is defined in (3.75). where M 0

1

B B B βð1  EÞp11 e1 ¼ B M B βð1  EÞp B 01 @ βð1  EÞω

0

0

βEp11  1 βEp01

βð1  p11 Þ βð1  p01 Þ  1

βEω

βð1  ω Þ

1 1 1β C C C 0 C C 0 C A 1  1β

ð3:75Þ

b

ω < ω0 as

Therefore, we have the Whittle index v for the region Γ( p11) follows: v¼

βð1  εÞp01 þ ð1  βÞð1  εÞω 1 þ βðp01  εp11 Þ  βð1  εÞω

ð3:76Þ

3.11.4 Region 4 When ω0 bω < p01 , we have L(Ψ(ω), ω) ¼ 1 and L( p11, ω) ¼ 1 by Corollary v v 3.3. Thus, V β,v ðp11 ; p11 Þ ¼ 1β and V β,v ðΨðω Þ; Ψðω ÞÞ ¼ 1β V β,v ðω ; ω , 0Þ ¼ v þ βV β,v ðΓðω Þ; Γðω ÞÞ ¼ v þ βm þ β2 V β,v ðΓ2 ðω Þ; Γ2 ðω ÞÞ v ¼ 1β V β,v ðω ; ω , 1Þ ¼ ð1  εÞω þ β½ð1  εÞω V β,v ðp11 ; p11 Þ þð1  ð1  εÞω ÞV β,v ðΨðω Þ; Ψðω ÞÞ v v    þ ð1  ð1  εÞω Þ ¼ ð1  εÞω þ β ð1  εÞω 1β 1β v ¼ ð1  εÞω þ β 1β

3.12

Index Computation for Positively Correlated Channels

69

Therefore, based on the balance equation V β,v ðω ; ω , 0Þ ¼ V β,v ðω ; ω , 1Þ, we have v ¼ ð1  εÞω

ð3:77Þ

Combing v in (3.70), (3.73), (3.76) with (3.77), we finally obtain the Whittle index shown in (3.19) for the negatively correlated channels.

3.12

Index Computation for Positively Correlated Channels

3.12.1 Region 1 When Γn ðφðp11 ÞÞbωβ ðmÞ < Γnþ1 ðφðp01 ÞÞ, according to Lemma 3.9, we let ω ¼ ω for (3.41)–(3.47), combine the balance function V β,v ðω ; ω , 0Þ ¼ V β,v ðω ; ω , 1Þ at ω and the following equation: V β,v ðω ; ω , 0Þ ¼ v þ βV β,v ðΓðω Þ; Γðω ÞÞ ¼ v þ βV β,v ðΓðω Þ; ω , 1Þ ¼ v þ βð1  εÞΓðω Þ þ β ð1  εÞΓðω ÞV β,v ðp11 ; p11 Þ þð1  Γðω ÞÞV β,v ðp01 ; Ψðω ÞÞ þ εΓðω ÞV β,v ðp11 ; Ψðω ÞÞ



ð3:78Þ

Finally, we can obtain the following linear functions: 0

V β,v ðp11 ; p11 Þ

1

0

1

ð1  εÞp11

B V ðp ; Ψðω ÞÞ C B βn ð1  εÞΓn ðp Þ B β,v 11 C B 11 M1  B C¼B @ V β,v ðp01 ; Ψðω ÞÞ A @ βn ð1  εÞΓn ðp01 Þ

C C C A

ð3:79Þ

ð1  εÞðω  βΓðω ÞÞ

v where M1 is defined in (3.80) 0

1

0

0

1 1β 0

1

C B C B C B βð1  EÞp11 βEp11  1 βð1  p11 Þ C B M1 ¼ B C βð1  p01 Þ  1 0 C B βð1  EÞp01 βEp01 A @ 1    βð1  EÞω βEω βð1  ω Þ  1β Thus, by simply solving (3.79), we have

ð3:80Þ

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

70

  ð1  εÞ 1  βnþ1 ðω  βΓðω ÞÞ þ C 6   v¼ ð1  εÞ β  βnþ1 ðω  βΓðω ÞÞ þ C 9

ð3:81Þ

3.12.2 Region 2 When ωn0 bωβ ðmÞ < Γn ðφðp11 ÞÞ, letting ω ¼ ω for (3.48)  (3.57), combined with (3.78) and the balance function at ω we have 0

1

V β,v ðp11 ; p11 Þ

0

1

ð1  εÞp11

B V ðp ; Ψðω ÞÞ C B βn1 ð1  εÞΓn1 ðp Þ C C B 11 C B β,v 11 C B C B B V β,v ðp01 ; Ψðω ÞÞ C B βn1 ð1  εÞΓn1 ðp01 Þ C C C B B M2  B C¼B n C n C B V β,v ðp11 ; Ψðp11 ÞÞ C B β ð1  εÞΓ ðp11 Þ C B n C B n A @ V β,v ðp01 ; Ψðp11 ÞÞ A @ β ð1  εÞΓ ðp01 Þ

ð3:82Þ

ð1  εÞðω  βΓðω ÞÞ

v where M2 is defined in (3.83). 0

βð1  εÞp11  1

B B B βn ð1  εÞΓn1 ðp Þ B 11 B B B B βn ð1  εÞΓn1 ðp Þ B 01 B M2 ¼ B B B nþ1 ð1  εÞΓn ðp11 Þ Bβ B B B B βnþ1 ð1  εÞΓn ðp Þ B 01 @  βð1  εÞðω  βΓðω ÞÞ

βεp11 βð1  p11 Þ 0

0

1

0

C 1  βn1 C C 1β C C C 1  βn1 C C 1β C C C 1  βh C C C 1β C C 1  βn C C C 1β A

1

0

βn εΓn1 ðp11 Þ

βn ð1  Γn1 ðp11 ÞÞ

0

1

βn εΓn1 ðp01 Þ

βn ð1  Γn1 ðp01 ÞÞ

0

0

βnþ1 εΓn ðp11 Þ  1 βnþ1 ð1  Γn ðp11 ÞÞ

0

0

βnþ1 εΓn ðp01 Þ

βnþ1 ð1  Γn ðp01 ÞÞ  1

0

0

βεðω  βΓðω ÞÞ

βð1  ω Þ  β2 ð1  Γðω Þ 1

ð3:83Þ

Thus   ð1  εÞ 1  βnþ1 ðω  βΓðω ÞÞ þ C 6 v¼ C 0 ðω  βΓðω ÞÞ þ C 7

3.12.3 Region 3 When Γnþ1 ðφðp01 ÞÞbωβ ðmÞ < ωnþ1 0 , we have

ð3:84Þ

3.12

Index Computation for Positively Correlated Channels

1 0 1 ð1  εÞp11 V β,v ðp11 ; p11 Þ B V ðp ; ω Þ C B βL1 ð1  εÞΓL1 ðp Þ C B β,v 11 C B 11 C M1  B C¼B C @ V β,v ðp01 ; ω Þ A @ βL1 ð1  εÞΓL1 ðp01 Þ A

71

0

ð3:85Þ

ð1  εÞðω  βΓðω ÞÞ

v where, M1 is defined in (3.80). Thus

  ð1  εÞ 1  βnþ1 ðω  βΓðω ÞÞ þ C 6   v¼ ð1  εÞ β  βnþ1 ðω  βΓðω ÞÞ þ C 9

ð3:86Þ

3.12.4 Region 4 When p01  ω < ω10 we have L(Ψ(ω), ω) ¼ 0 since Ψ(ω) > ω according to Lemma 3.8 and L( p11, ω) ¼ L(Γ(ω), ω) ¼ 0 since p11 > Γ(ω) > ω. Thus, V β,v ðω ; ω , 0Þ ¼ v þ βV β,v ðΓðω Þ; Γðω ÞÞ

¼ v þ β ω V β,v ðΓð1Þ; p11 Þ þ ð1  ω ÞV β,v ðΓð0Þ; p11 Þ

¼ v þ β ω V β,v ðp11 ; p11 Þ þ ð1  ω ÞV β,v ðp01 ; p11 Þ V β,v ðω ; ω , 1Þ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ þð1  ð1  εÞω ÞV β,v ðΨðω Þ; Ψðω ÞÞ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ

þεω V β,v ðp11 ; p11 Þ þ ð1  ω ÞV β,v ðp01 ; p11 Þ

¼ ð1  εÞω þ β ω V β,v ðp11 ; p11 Þ þ ð1  ω ÞV β,v ðp01 ; p11 Þ

Further, based on V β,v ðω ; ω , 0Þ ¼ V β,v ðω ; ω , 1Þ, we have v ¼ ð1  εÞω

ð3:87Þ

3.12.5 Region 5   When ωn0 ω < ωn0 , we obtain v by linear interpolation on two endpoints v ωn0 and v ωn0

72

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

v¼v



ωn0





þ ω

ωn0

 vωn   vωn  0

ωn0  ωn0

0

ð3:88Þ

3.12.6 Region 6 When ω0  ω  p11, we have L(Γi(ω), ω) ¼ 1 for i  1, L(Ψ( p11), ω) ¼ 1, and L(Ψ(ω), ω) ¼ 1. Thus, h V β,v ðp11 ; p11 Þ ¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ i þð1  ð1  εÞp11 ÞV β,v ðΨðp11 Þ; Ψðp11 ÞÞ h ¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ i þð1  ð1  εÞp11 ÞV β,v ðΨðp11 Þ; ω , 0Þ h ¼ ð1  εÞp11 þ β ð1  εÞp11 V β,v ðp11 ; p11 Þ i v þð1  ð1  εÞp11 Þ 1β

ð3:89Þ

V β,v ðω ; ω , 0Þ ¼ v þ βV β,v ðΓðω Þ; Γðω ÞÞ ¼ v þ βV B,v ðΓðω Þ; ω , 0Þ  

¼ v þ β v þ βV β,v Γ2 ðω Þ; ω , 0 ¼

v 1β

h V β,v ðω ; ω , 1Þ ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ i þð1  ð1  εÞω ÞV β,v ðΨðω Þ; Ψðω ÞÞ h ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ i þð1  ð1  εÞω ÞV β,v ðΨðω Þ; ω , 0Þ

ð3:90Þ

3.13

Numerical Study

73

h ¼ ð1  εÞω þ β ð1  εÞω V β,v ðp11 ; p11 Þ i v þð1  ð1  εÞω Þ 1β

ð3:91Þ

Finally, based on V β,v ðω ; ω , 0Þ ¼ V β,v ðω ; ω , 1Þ we obtain the following: v¼

ð1  εÞω 1  βð1  εÞðp11  ω Þ

ð3:92Þ

Combing v in (3.81), (3.84), (3.86), (3.87), (3.88) with (3.92), we finally obtain the Whittle index shown in (3.20) for the positively correlated channels.

3.13

Numerical Study

In this section, we evaluate the performance of the Whittle index policy by comparing with optimal policy and myopic policy.

3.13.1 Whittle Index versus Optimal Policy In the first scenario, the parameters are set as N ¼ 3, εi ¼ 0.01, and n o3 ðiÞ ðiÞ ¼ fð0:3, 0:7Þ, ð0:4, 0:8Þ, ð0:5, 0:7Þg: From Fig. 3.5, we β ¼ 1, p01 , p11 i¼1

observe that the Whittle index policy has almost the same performance with the optimal policy. In the second scenario, the parameters are set as N ¼ 3,

n

ðiÞ

ðiÞ

p01 , p11

o3 i¼1

¼

fð0:3, 0:7Þ, ð0:8, 0:4Þ, ð0:3, 0:6Þg, εi ¼ 0.01, and β ¼ 1. From Fig. 3.6, we observe that the Whittle index policy has about 1% performance loss compared with the optimal policy. Combining Figs. 3.5 and 3.6, we have the following intuition result: the Whittle index policy performs worse with the increasing heterogeneity among channels.

3.13.2 Whittle Index verse Myopic Policy In this scenario N ¼ 10,

n

ðiÞ

ðiÞ

p01 , p11

o10 i¼1

, ¼ fð0:3, 0:9Þ, ð0:8, 0:1Þ, ð0:30:8Þ,

(0.1, 0.9), (0.9, 0.1), (0.4, 0.8), (0.5, 0.3), (0.3, 0.3), (0.3, 0.6), (0.8, 0.1)}, and

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

74

0.72

Throughput (bits per slot)

0.71 0.7 0.69

Myopic Whittle

0.68 0.67 0.66 0.65

0

50

100

150

200

250

300

T

Fig. 3.5 N ¼ 3, β ¼ 1, εi ¼ 0:01,

n

ðiÞ

ðiÞ

p01 , p11

o3 i¼1

n o ¼ ð0:3, 0:7Þ, ð0:4, 0:8Þ, ð0:5, 0:7Þ

0.62

Throughput (bits per slot)

0.61 0.6 Optimal Whittle

0.59 0.58 0.57 0.56

0

50

100

150

200

250

T

Fig. 3.6 N ¼ 3, β ¼ 1, εi ¼ 0:01,

n

ðiÞ

ðiÞ

p01 , p11

o3 i¼1

n o ¼ ð0:3, 0:7Þ, ð0:8, 0:4Þ, ð0:3, 0:6Þ

300

3.14

Summary

75

0.84

Throughput (bits per slot)

0.82 0.8 Myopic Whittle

0.78 0.76 0.74 0.72

0

50

100

150

200

250

300

T

n o10 n ðiÞ ðiÞ Fig. 3.7 N ¼ 10, β ¼ 1, εi ¼ 0:01, ðp01 , p11 Þ ¼ ð0:3, 0:9Þ, ð0:8, 0:1Þ, ð0:3, 0:8Þ, ð0:1, 0:9Þ , i¼1 o (0.9, 0.1), (0.4, 0.8), (0.5, 0.3), (0.3, 0.3), (0.3, 0.6), (0.8, 0.1)

εi ¼ 0.01, and β ¼ 1. From Fig. 3.7, we can see that the Whittle index policy performs a little worse than the myopic policy when T b 18, while after that threshold time, performs better. This can be easily explained as follows: the myopic policy performs better in the initial period since it only exploits information to maximize utility but ignores exploring information for future decision. However, the Whittle index considers the balance between exploitation and exploration, so it performs better after the initial period.

3.14

Summary

In this chapter, we address the multichannel opportunistic access problem, in which a user decides, at each time slot, which channel to access among multiple GilbertElliot channels in order to maximize his aggregated utility, given that the observation of channel state is error-prone. The problem can be cast into a restless multiarmed bandit problem which is proved to be PSPACE-Hard. We thus study the alternative approach, Whittle index policy, a very popular heuristic for restless bandits, which is provably optimal asymptotically and has good empirical performance. However, in the case of imperfect observation, the traditional approach of computing the Whittle index policy cannot be applied because the channel state belief evolution is no more linear, thus rendering the indexability of our problem open. To bridge the gap, we mathematically establish the indexability and establish the closed-form Whittle index, based on which index policy can be constructed. The major technique is

76

3 Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Two-State. . .

our analysis is a fixed point-based approach which enable us to divide the belief information space into a series of regions and then establish a set of periodic structures of the underlying nonlinear dynamic evolving system, based on which we devise the linearization scheme for each region to establish indexability and compute the Whittle index for each region.

References 1. E.N. Gilbert, Capacity of a burst-noise channel. Bell Syst. Tech. J. 39(5), 1253–1265 (1960) 2. K. Wang, L. Chen, Q. Liu, K.A. Agha, On optimality of myopic sensing policy with imperfect sensing in multi-channel opportunistic access. IEEE Trans. Commun. 61(9), 3854–3862 (2013) 3. K. Wang, Q. Liu, F. Li, L. Chen, X. Ma, Myopic policy for opportunistic access in cognitive radio networks by exploiting primary user feedbacks. IET Commun. 9(7), 1017–1025 (2015) 4. J. Nino-Mora, S.S. Villar. Sensor scheduling for hunting elusive hiding targets via whittle’s restless bandit index policy. in Proc. of NetGCoop, Paris, France, Oct 2011, pp. 1–8. 5. C.H. Papadimitriou, J.N. Tsitsiklis, The complexity of optimal queueing network control. Math. Oper. Res. 24(2), 293–305 (1999) 6. S. Ahmad, M. Liu, T. Javidi, Q. Zhao, B. Krishnamachari, Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55(9), 4040–4050 (2009) 7. K. Wang, L. Chen, On optimality of myopic policy for restless multi-armed bandit problem: an axiomatic approach. IEEE Trans. Signal Process. 60(1), 300–309 (2012) 8. Y. Liu, M. Liu, S.H.A. Ahmad, Sufficient conditions on the optimality of myopic sensing in opportunistic channel access: a unifying framework. IEEE Trans. Inf. Theory 60(8), 4922–4940 (2014) 9. Y. Ouyang, D. Teneketzis, On the optimality of myopic sensing in multi-state resources. IEEE Trans. Inf. Theory 60(1), 681–696 (2014) 10. K. Liu, Q. Zhao, B. Krishnamachari, Dynamic multichannel access with imperfect channel state detection. IEEE Trans. Signal Process. 58(5), 2795–2807 (2010) 11. K. Wang, L. Chen, Q. Liu, Opportunistic spectrum access by exploiting primary user feedbacks in underlay cognitive radio systems: an optimality analysis. IEEE J. Sel. Topics Signal Process. 7(5), 869–882 (2013) 12. I.M. Verloop, Asymptotically optimal priority policies for indexable and non-indexable restless bandits. Ann. Appl. Probab. 26(4), 1947–1995 (2016) 13. R.R. Weber, G. Weiss, On an index policy for restless bandits. J. Appl. Probab. 27(3), 637–648 (1990) 14. P.R. Singh, X. Guo. Index policies for optimal mean-variance trade-off of inter-delivery times in real-time sensor networks. In Proc. of 2015 IEEE Conference on Computer Communications (INFOCOM), 2015. 15. J. L. Ny, M. Dahleh, E. Feron. Multi-UAV dynamic routing with partial observations using restless bandit allocation indices. in Proc. of American Control Conf., pages 42204225, Seattle, WA, June 2008. 16. V.S.B. Konstantin E. Avrachenkov. Whittle index policy for crawling ephemeral content. in Proc. of 2015 IEEE 54th Annual Conference on Decision and Control (CDC), 2015. 17. K. Liu and Q. Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory, 56(11):5547-5567, Nov. 2010. 18. W. Ouyang, S. Murugesan, A. Eryilmaz, N.B. Shroff, Exploiting channel memory for joint estimation and scheduling in downlink networks—a Whittle’s indexability analysis. IEEE Trans. Inf. Theory 4(61), 1702–1719 (2015) 19. F. Cecchi, P. Jacko. Scheduling of users with markovian time-variyng transmission rates. in Proc. ACM Sigmetrics, Pittsburgh, PA, June 2013.

References

77

20. S. Aalto, P. Lassila, P. Osti. Whittle index approach to size-aware scheduling with timevarying channels. in Proc. ACM Sigmetrics, Portland, OR, June 2015. 21. S. Aalto, P. Lassila, P. Osti, Whittle index approach to size-aware scheduling for timevarying channels with multiple states. Queueing Syst. 83, 195–225 (2016) 22. J. Gittins, K. Glazebrook, R. Weber, Multi-armed Bandit Allocation Indices, 2nd edn. (John Wiley & Sons Ltd., New York, NY, 2011) 23. D.R. Hernandez, Indexable Restless Bandits (VDM Verlag, 2008) 24. P. Jacko, Dynamic Priority Allocation in Restless Bandit Models (LAP Lambert Academic Publishing, 2010) 25. T.J.S.H. Ahmad, M. Liu, Q. Zhao, B. Krishnamachari, Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55(9), 4040–4050 (2009) 26. S. Ahmad and M. Liu. Multi-channel opportunistic access: a case of restless bandits with multiple players. in Proc. Allerton Conf. Commun. Control Comput, Oct. 2009, pp. 1361–1368.

Chapter 4

Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

4.1 4.1.1

Introduction Background

Consider a downlink scheduling wireless communication system composed of one user and N homogeneously independent channels, each of which is modeled as an Xstate Markov chain with known matrix of transition probabilities. At each time period one channel is scheduled to work and some reward depending on the state of the worked channel is obtained. The objective is to design a scheduling policy maximizing the expected accumulated discounted reward over an infinite time horizon. Mathematically, the scheduling problem can be formulated into a RMAB problem in decision theory [1]. Actually, RMAB has a wide range of applications, i.e., wired/wireless communication systems, manufacturing systems, transportation systems, economic systems, statistics, biomedical engineering, and information systems [1, 2]. However, the RMAB problem is proved to be PSPACE-Hard [3]. Specifically, the challenges of multistate RMAB are threefold: First, the probability vector is not completely ordered in probability space, making structural analysis substantially more difficult; Second, multistate RMAB tends to encounter the “curse of dimensionality,” which is further complicated by the uncountably infinite probability space; Third, the imperfect observation brings out nonlinear propagation of belief information about system state. Although the first two factors are usually taken into account, there are few literature in RMAB involving nonlinear belief information propagation. Hence, numerical approach is used to evaluate policy performance generally. In fact, the numerical approach has huge computational complexity and cannot provide any meaningful insight into optimal policy.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8_4

79

80

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

For the three reasons, in this paper, we study two instances of the generic RMAB in which the optimal policy is the so-called myopic policy or greedy policy with linear computation complexity O ðN Þ. Specially, we develop two sets of sufficient conditions to guarantee the optimality of the myopic policy for the two instances. In a nutsell, the optimal policy is to schedule the best channels each time in the sense of MLR order [4].

4.1.2

Existing Work

In the generic description of RMABs, there are N arms evolving as Markov chains. A player activates multiple arms out of N arms each time, and receives a reward depending on the states of the activated arms. The objective of the play is to maximize the expected reward sum over an infinite horizon by deciding which arms should be activated each time. As a simple instance, if the unactivated arms hold their states, then the RMAB problem is degenerated into the MAB one which was solved to reveal that the optimal policy has an index structure [5, 6]. As far as RMAB problem is concerned, there are two major research lines in this field. The first research thrust is to analyze performance gap between optimal policy and approximation policy [7–9]. Specifically, a simple myopic policy was shown to achieve a factor 2 approximation of optimal policy for a subclass of Monotonic MAB [8, 9]. The second direction is to seek sufficient condition to guarantee the optimality of the myopic policy for some instances of RMAB, particularly in the scheduling field of opportunistic communications [10–17]. To characterize the “myopic” (or “greedy”) feature, we need a certain kind of order concerning the information states of all channels, which makes the multistate channel model different from two-state one to a large extent. Specifically, the optimality of the myopic policy requires total order for two-state model while requires partial order for multistate one. For two-state Markov channel model, the structure and partial optimality of the myopic policy were obtained for homogeneous Markov channels in [17]. Then the optimality of the myopic sensing policy was obtained for the positively correlated homogeneous Markov channels for accessing one channel in [18], and further was extended to access multiple homogeneous channels in [10]. From the viewpoint of exploitation dominating exploration for myopic policy, in [15], we extended homogeneous channels to heterogeneous ones, and derived a set of closed-form sufficient conditions to guarantee the optimality of myopic sensing policy. The authors [12] studied the myopic channel probing policy for the scenario with imperfect observation, but only established its optimality in the particular case of probing one channel each time. However, we established the optimality of myopic policy for the case of probing N  1 of N channels each time, analyzed the performance of the myopic probing policy by domination theory in [16]. Further, we studied accessing arbitrary number of channels each time and derived more generic conditions on the optimality in [14].

4.1 Introduction

81

For multistate Markov channel model [19–21], the authors in [19] established sufficient conditions for optimality of myopic sensing policy (in sense of first-order stochastic dominance (FOSD)) in multistate homogeneous channels with perfect observation. For the same channel model, we study the optimality of myopic policy when transition matrix has a non-trivial eigenvalue with N -1 times [21]. Actually, due to noise, detecting methods, and etc., it is hard to implement the perfect state observation in practical applications.

4.1.3

Main Results and Contributions

Hence, in this paper, we consider the problem of imperfect (or indirect) observation of channel states, and analyze its impact on scheduling channels, which makes our scheduling problem different from [19, 21] to a large extent. Under indirect or imperfect observation, an observation matrix is introduced, from the viewpoint of mathematics, to replace the identity matrix for the direct observation case considered in [19, 21]. Hence, the FOSD order adopted in [19, 21] is not sufficient to characterize the order of information states for imperfect observation. Thus, the MLR order, a kind of stronger stochastic order than FOSD, is utilized to describe the order structure of information states. Moreover, the basic approach used in this paper is totally different from [19] in deriving the optimality of myopic policy. Specifically, our argument is more generic, and can be extended to the case of heterogeneous multistate model. As a result, the sufficient conditions obtained in this paper cannot degenerate into those of [19] for the perfect observation, which demonstrates the tradeoff between more generic method and stricter condition. In particular, the contributions of this paper include the following: • The structure of the myopic policy is shown to be a simple queue determined by the information states of channels provided that certain conditions are satisfied for the observation matrix and the probability transition matrix of homogeneous multistate channels. • We establish two sets of conditions to ensure the optimality of myopic policy for two different scenarios—one is the case of “positive” order on the row vectors of probability transition matrix, and another is the “inverse” order on the row vectors of that matrix. • Our derivation demonstrates the advantage of branch-and-bound and the directed comparison-based optimization approach. Notation: Boldface lower and upper case letters represent column vectors and matrices, respectively.

82

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

Table 4.1 Main notations Symbols N X A B ai, j bim ðnÞ

Descriptions The set of N channels, i.e., 1, 2, ⋯, N The set of X states, i.e., 1,2,. . .X Channel state transition matrix Observation probability matrix The transition probability of from state i to j The probability of being observed as state m when state i The state of channel n in slot t

x0

ðnÞ

Initial probability distribution of channel n

ðnÞ xt

The information state at time t

r ut T t ei E [A]k, [A]:, l [A]k, l () tr(), ()⊤ diag(a1, . . ., aK)

The X-dimensional reward column vector The allocation policy in time t The total number of time slots Time slot index An X-dimensional column vector with 1 in the i-th element and 0 in others The X  X identity matrix The kth row, the lth column, respectively The element in the kth row and the lth column of matrix A The complex conjugate The trace, transpose, of a matrix, respectively A diagonal and a block-diagonal matrix with a1 . . . aK

st

4.2 4.2.1

Problem Formulation System Model

Table 4.1 summaries main notations used in this chapter. Consider a wireless downlink slotted system consisting of one user and N independent channels n ¼ 1, ⋯, N. Assume each channel n has a finite number, ðnÞ X, of states, denoted as X ¼ f1, 2, ⋯, X g. Let st denote the state of channel n at ðnÞ time slot t ¼ 1, 2, ⋯. The state st evolves according  to an X-state homogeneous Markov chain with transition probability matrix A ¼ aij i,j2X , where,   ðnÞ ð nÞ aij ¼ ℙ stþ1 ¼ j j st ¼ i ðnÞ

ðnÞ

ð nÞ

All channels are initialized with s0  x0 , where x0 are specified initial distributions for n ¼ 1, ⋯, N. At each time instant t, only one of these channels can be allocated to the user. If  ðnÞ t channel n is allocated at time t, an instantaneous reward β r st is accrued, herein, 0 b β b 1 denotes the discount factor.

4.2 Problem Formulation

83

After transmitting information over the chosen channel n, the state of channel n is ðnÞ indirectly observed via noisy measurement (i.e., feedback information) to be ytþ1 of ðnÞ ðnÞ the channel state stþ1 . Assume that these observations ytþ1 belong to a finite set Y indexed by m ¼ 1, ⋯, Y. Let B ¼ ðbim Þi2X ,m2Y denote  the homogeneous observation  ðnÞ

ðnÞ

probability matrix, where each element is bim ≜ℙ ytþ1 ¼ m j st

¼ i, ut ¼ n . That

is, B is homogeneous. ðu Þ Let ut 2 {1, ⋯, N} denote which channel is allocated at time t. Consequently, stþ1t denotes the stateof the allocatedchannel at time t + 1. Denote the observation history ðu Þ ðu Þ at time t as Y t ≔ y1 0 , ⋯, yt t1 and the decision history as Ut ≔ (u0, ⋯, ut). Then the channel at time t + 1 is chosen according to ut + 1 ¼ μ(Yt + 1, Ut), where the policy denoted as μ belongs to the class of stationary policies U . The total expected discounted reward over an infinite time horizon is given by " Jμ ¼ 

1 X

# β

t

ðu Þ rðst t Þ

,

ut ¼ μðY t , U t1 Þ

ð4:1Þ

t¼0

where  denotes mathematical expectation. The aim is to determine the optimal stationary policy μ ¼ argmaxJ μ μ2U

ð4:2Þ

which yields the maximum rewards J  ¼ J μ in (4.1).

4.2.2

Information State

The above partially observed multiarmed bandit problem can be re-expressed as a fully observed multiarmed bandit in terms of the information state. For each channel ðnÞ n, denoted by xt the information state at time t (Bayesian posterior distribution of ðnÞ st ) as ðnÞ

xt

  ðnÞ ðnÞ ðnÞ ¼ xt ð1Þ, xt ð2Þ, ⋯, xt ðX Þ

  ðnÞ ðnÞ where xt ðiÞ≜ℙ st ¼ i j Y t , U t1 . The hidden markovian model (HMM) multiarmed bandit problem can be viewed as the following scheduling problem: Consider N parallel HMM state estimation ðnÞ filters, one for each channel. The channel n is allocated, an observation ytþ1 is ðnÞ obtained and the information state xtþ1 is computed recursively by the HMM state filter according to

84

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

  ð nÞ ðnÞ ðnÞ xtþ1 ¼ Γ xt , ytþ1 , if channel n is allocated at time t

ð4:3Þ

where   ByðnÞ A⊤ xðnÞ ðnÞ ðnÞ ≜ , Γ x ,y dðxðnÞ , yðnÞ Þ     B yðnÞ A⊤ xðnÞ : d xðnÞ , yðnÞ ≜1⊤ X

ð4:4Þ

In (4.4), if y(n) ¼ m, then B(m) ¼ diag [b1m, ⋯, bXm] is the diagonal matrix formed by the mth column of the observation matrix B, Ai is the ith row of the matrix A, 1X is an X-dimensional column vector of ones, and E is an identity matrix. The state estimation of the other N  1 channels is according to xtþ1 ¼ A⊤ xt ðlÞ

ðlÞ

ð4:5Þ

if channel l is not allocated at time t, l 2 {1, ⋯, N}, l 6¼ n. Let Π(X) denote the state space of information states x(n), n 2 {1, 2, ⋯, N}, which is a X  1-dimensional simplex: ΠðXÞ ¼ fx 2 X : 1⊤ X x ¼ 1, 0bxðiÞb1 for all i 2 X g

ð4:6Þ

ðnÞ

The process xt , n ¼ 1, ⋯, N qualifies as an information state   since choosing ut+1 ð1Þ ðN Þ ¼ μ(Yt+1, Ut) is equivalent to choosing utþ1 ¼ μ xtþ1 , ⋯, xtþ1 . Using the smoothing property of conditional expectations, the reward function (4.1) can be rewritten in terms of the information state as " Jμ ¼ 

1 X

# ðu Þ β t r ⊤ xt t

,

  ð1Þ ðNÞ ut ¼ μ xt , ⋯, xt

ð4:7Þ

t¼0

where r denotes the X-dimensional reward column vector with r(1) b r(2) b ⋯ b r (X). The aim is to compute the optimal policy argmaxJ μ . μ2U

For convenient analysis of the optimization problem in (4.8), we derive its dynamic programming formulation for a finite time T as follows:

4.2 Problem Formulation

85

h i 8  ð1:NÞ  ðu Þ > ¼ max  r⊤ xT T < V T xT uT h    i P  ðut Þ  ð1:NÞ ðu Þ ð1:u 1Þ ðut Þ ðu þ1:NÞ > d xt , m V tþ1 xtþ1 t , xtþ1,m , xtþ1t ¼ max  r⊤ xt t þ β : V t xt ut

m2Y

ð4:8Þ   ði:jÞ ðiÞ ðiþ1Þ ð jÞ where, xt ≜ xt , xt , ⋯, xt , and 8   < xðut Þ ¼ Γ xðt ut Þ , m tþ1,m :

xtþ1 ¼ A⊤ xt , n 6¼ ut ðnÞ

ðnÞ

ð4:9Þ

When T! 1 , J ¼ V0(X(1 : N )).

4.2.3

Myopic Policy

Theoretically, we can solve the above dynamic programming by backward deduction to obtain optimal policy. However, obtaining the optimal solution directly from the above recursive equations is computationally prohibitive. Hence, we turn to seek a simple myopic policy only maximizing the immediate reward, which is easy to compute and implement, formally defined as follows: b ut ¼ argmax r⊤ xt

ðut Þ

ut

ð4:10Þ

As we know, the myopic policy or greedy policy is not optimal generally, but its computation complexity is very low in O ðN Þ. Thus, our aim in the following parts is to seek sufficient conditions to guarantee that the myopic policy is optimal. Considering the multivariate information state, we introduce some partial orders to characterize the “order” of information states in the following sections. Definition 4.1 (MLR ordering [4]) Let x1, x2 2 Π(X) be any two belief vectors. Then X1 is greater than X2 with respect to the MLR ordering—denoted as x1⩾rx2, if x1 ðiÞx2 ð jÞbx2 ðiÞx1 ð jÞ, i < j, i, j 2 f1, 2, . . ., X g: Definition 4.2 (FOSD ordering [4]) Let x1, x2 2 Π(X) then x1 first order stochastically dominates x2—denoted as x1⩾sx2, if the following exists for j ¼ 1, 2, ⋯, X,

86

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels X X i¼j

x1 ð i Þ ⩾

X X

x2 ðiÞ

i¼j

Some useful results for MLR and FOSD order [4] are stated here: Proposition 4.1 ([4]) Let x1, x2 2 Π(X), the following holds (i) x1⩾rx2 implies x1⩾sx2. (ii) Let V denote the set of all X-dimensional vectors v with nondecreasing components, i.e., v1 b v2 b ⋯ b vX Then X1⩾sX2 i.f.f. for all v 2 V , v⊤ x1 ⩾ v⊤ x2 . Based on partial order structure, we now describe the structure of the myopic policy. ðσ Þ

ðσ Þ

ðσ Þ

Proposition 4.2 (Structure of Myopic Policy) If, xt 1 ⩾ r xt 2 ⩾ r ⋯ ⩾ r xt N , herein, {σ 1, σ 2, ⋯, σ N} is a permutation of {1, ⋯, N}, then the myopic policy at t is   ð1Þ ðN Þ b ut ¼ μt xt , ⋯, xt ¼ σ1

4.3

ð4:11Þ

Optimality Analysis of Myopic Policy

To analyze the performance of the myopic policy, we first introduce an auxiliary value function and then prove a critical feature of the auxiliary value function. Next, we give some assumptions about transition matrix, and show its special stochastic order. Finally, by deriving the bounds of different policies, we get some important bounds, which serve as the basis to prove the optimality of the myopic policy.

4.3.1

Value Function and Its Properties

First, we define the auxiliary value function (AVF) as follows:

4.3 Optimality Analysis of Myopic Policy

87

  8 ðb uT Þ ð1:NÞ ðb u Þ > x ¼ r ⊤ xT T , W > T T > > >      > X  > ðb uτþ1 Þ ð1:b uτ 1Þ ðb uτ Þ ðb uτ þ1:NÞ > ðb uτ Þ ⊤ ðb uτ Þ > W τðbuτ Þ xð1:NÞ x þ β d x , m W , x , X ¼ r x , > τ τ τþ1 τþ1 τþ1,m τþ1 τ > > > m2Y < |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ð1:NÞ FðxT ,b uτ Þ > > >       X > > ðut Þ ð1:NÞ ðut Þ ðb utþ1 Þ ð1:ut 1Þ ðut Þ ðut þ1:NÞ ⊤ ðut Þ > W x þ β d x , m W , x , x x ¼ r x > t t t t tþ1 tþ1 tþ1,m tþ1 > > > m2Y > > |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl {zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} > : ð1:NÞ Fðxt

, ut Þ

ð4:12Þ where, t + 1 b τ b T. Remark 4.1 AVF is the reward under the policy: at slot t, ut is adopted, while after t, myopic policy b uτ ðt þ 1bτbT Þ is adopted.   ðu Þ ð1:N Þ Lemma 4.1 W t t xt is decomposable for all t ¼ 0, 1, ⋯, T, i.e., ðut Þ



Wt

ð1:n1Þ

xt

ðnÞ

ðnþ1:N Þ

 ¼

, xt , xt

X X

  ðnÞ ð1:n1Þ ðnþ1:N Þ xt ðiÞW ut xt , e i , xt

i¼1

¼

X X

e⊤ i xt W t ðnÞ

ðut Þ



ð1:n1Þ

xt

ðnþ1:N Þ

, e i , xt

i¼1

Proof Please refer to Appendix 4.6.1.

4.3.2



Assumptions

We make the following assumptions/conditions. Assumption 1 Assume that (i) A1 r A2 r . . . r AX. (ii) B:,1 r B:,2 r . . . r B:,Y. (iii) There exists some K(2 b K b Y ) such that ΓðA⊤ e1 , K Þ ⩾ r ðA⊤ Þ eX 2

ΓðA⊤ eX , K  1Þbr ðA⊤ Þ e1 2

(iv) A1 br x0

ð1Þ

br xð02Þ br ⋯br xð0N Þ br AX



88

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

(v) r⊤(ei + 1  ei) ⩾ r⊤Q⊤(ei Q ¼ VΥV1 0

... ...

1 B 0 B Λ¼B @⋮

0 λ2 ⋮

0

0

+ 1

1

 ei)(1 0

b i b X  1), where 1

0 B B 0 B 0 C C C, Υ ¼ B B⋮ A ⋱ ⋮ B @ . . . λX 0

0 βλ2 1  βλ2 ⋮ 0

A ¼ VΛV1,

...

0

...

0



⋮ βλX ... 1  βλX

1 C C C C C C A

Remark 4.2 Assumption 1(i) ensures that the higher the quality of the channel’s current state, the higher is the likelihood that the next channel state will be of high quality. Assumption 1(ii) ensures that the higher the quality of the channel’s current state, the higher is the observation likelihood that the next channel state will be of high quality. Assumption 1(iii) along with 1.1–1.2 ensure that the information states of all channels can be ordered at all times in the sense of MLR order (see the proof of Proposition 4.6). Assumption 1(iv) states that initially the channels can be ordered in terms of their quality. Basically, Assumption 1(i)–(iv) ensure the order of information states, while Assumption 1(v) is required for reward comparison. In particular, Assumption 1(v) states that the instantaneous reward obtained at different states is sufficiently separated. Example 4.1 When X ¼ 2, Y ¼ 2, we have λ2 ¼ a22  a12. Assumption 1 degenerates into the following [22]: (i) a22 ⩾ a12, that is, λ2 ⩾ 0 (ii–iii) b22 ⩾ b12, ð1Þ ð2Þ ðN Þ (iv) a12 bx0 ð2Þbx0 ð2Þb⋯bx0 ð2Þba22 , βλ ðr ð2Þ  r ð1ÞÞ, that is, βλ2 b 12 (v) r ð2Þ  r ð1Þ ⩾ 1βλ

4.3.3

Properties

Under Assumption 1(i)–(v), we have some important propositions concerning the structure of information state in the following, which are proved in Appendix A. Proposition 4.3 Let x1, x2 2 Π(X) and x1bx2, then ðA1 ÞT  r AT x1  r AT x2  r ðAX ÞT : Proof Suppose i > j, we have

4.3 Optimality Analysis of Myopic Policy

89

⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ðe⊤ i A x2 Þ  ðe j A x1 Þ  ðe j A x2 Þ  ðei A x1 Þ

¼

X X

Ak,i x2 ðk Þ

k¼1

¼

Al,j x1 ðlÞ 

l¼1

X X X  X l¼1

X X

X X

Ak,j x2 ðk Þ

k¼1

X X

Al,i x1 ðlÞ

l¼1

  Ak,i Al,j  Al,i Ak,j x2 ðkÞx1 ðlÞ  x2 ðlÞx1 ðkÞ ⩾ 0

k¼l

where the last inequality is due to Ak⩾rAl(k ⩾ l) and x2⩾rx1. Then, considering e1brx1brx2breX, we have



ðA1 Þ⊤ ¼ A⊤ e1 br A⊤ x1 br A⊤ x2 br A⊤ eX ¼ ðAX Þ⊤ Proposition 4.3 states that if at any time t the information states of two channels are stochastically ordered and none of these channels is chosen at t, then the same stochastic order between the information states at time t + 1 is maintained. Proposition 4.4 Let x1, x2 2 Π(X) and (A1)⊤brX1brX2br(AX)⊤, then for 1bkbY Γ ð x 1 , k Þ br Γ ð x 2 , k Þ Proof According to Proposition 4.3, we have z1 ¼ A⊤x1brA⊤x2 ¼ z2. Suppose i > j, we have ðΓðx2 , k ÞÞi  ðΓðx1 , k ÞÞ j  ðΓðx2 , kÞÞ j  ðΓðx1 , kÞÞi ¼

bjk z1 ð jÞ bjk z2 ð jÞ bik z2 ðiÞ b z ði Þ  X  X  X ik 1 X P P P P bxk z2 ðxÞ bxk z1 ðxÞ bxk z2 ðxÞ bxk z1 ðxÞ x¼1

x¼1

¼

x¼1

x¼1

bik b jk ðz2 ðiÞz1 ðjÞ  z2 ðjÞz1 ðiÞÞ ⩾0 X X P P bxk z2 ðxÞ bxk z1 ðxÞ x¼1

x¼1

where, z2(i)z1( j)  z2( j)z1(i) ⩾ 0 is from z1brz2.



Proposition 4.4 states the increasing monotonicity of updating rule with information state for scheduled channel. Proposition 4.5 Let x 2 Π(X) and (A1)⊤brXbr(AX)⊤, then Γ(x, k)brΓ(x, m) for any 1 b k b m b Y. Proof Let z ¼ A⊤x. Suppose i > j, we have

90

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

ðΓðx, mÞÞi  ðΓðx, k ÞÞ j  ðΓðx, mÞÞ j  ðΓðx, k ÞÞi ¼

bjk zðjÞ bjm zðjÞ bim zðiÞ b zðiÞ  X  X  X ik X P P P P blm zðlÞ blk zðlÞ blm zðlÞ blk zðlÞ l¼1

 ¼

l¼1

l¼1

l¼1



bim bjk  bjm bik zðiÞzð jÞ ⩾0 X X P P blm zðlÞ blk zðlÞ l¼1

l¼1



where, bimbjk  bjmbik  0 is from B(m) r B(k).

Proposition 4.5 states the increasing monotonicity of updating rule with the increasing number of observation state for scheduled channel. ðlÞ

Proposition 4.6 Under Assumption 1, we have either xt l, n 2 {1, 2, ⋯, N} for all t.

br xðt nÞ or xðt nÞ br xðt lÞ for all

Proof Based on Proposition 4.3, A⊤x is monotonically increasing in x ((A1)⊤brxbr(AX)⊤). As a result, we obtain ðA⊤ Þ e1 ¼ A⊤ ðA1 Þ⊤ bA⊤ xbA⊤ ðAX Þ⊤ ¼ ðA⊤ Þ eX 2

2

ð4:13Þ

Based on Propositions 4.4 and 4.5, Γ(x, k) is monotonically increasing in both x ((A1)⊤brXbr(AX)⊤) and k(1 b k b Y ). Consequently, we have     Γ ðA1 Þ⊤ , 1 bΓðx, k ÞbΓ ðAX Þ⊤ , K  1 , for ðA1 Þ⊤ br xbr ðAX Þ⊤ , 1bk j+1 i1 i1 X X     r⊤ ei  e j  r⊤ Q ⊤ ei  e j ¼ r⊤ ðekþ1  ek Þ  r⊤ Q⊤ ðekþ1  ek Þ k¼j

¼

i1 X

k¼j

½r⊤ ðekþ1  ek Þ  r⊤ Q⊤ ðekþ1  ek Þ ⩾ 0:

k¼j

where, the last inequality is from Assumption 1(v).



4.3 Optimality Analysis of Myopic Policy

4.3.4

93

Analysis of Optimality

We first give some bounds of performance difference on reward pairs of policies, and then derive the main theorem on the optimality of myopic policy.     ðlÞ ðI Þ ðlÞ ðlÞ ðlÞ ðlÞ Lemma 4.2 Under Assumption 1, xlt ¼ xt , xt , ˘xlt ¼ xt , xt , xt br x ðlÞ t

we have for 1  t  T ðC1Þ if u0t ¼ ut ¼ l ðu0 Þ

r⊤ ðxt  xt Þ  W t t ðxlt Þ  W t t ðxlt Þ  ðlÞ

ðlÞ

ðu Þ

Tt X

βi r⊤ ðA⊤ Þ ðxt  xt Þ i

ðlÞ

ðlÞ

i¼0

(C2) if u0t 6¼ l, ut 6¼ l, and u0t ¼ ut ðu0 Þ

ðu Þ

0  W t t ðxlt Þ  W t t ðxlt Þ 

Tt X

βi r⊤ ðA⊤ Þ ðxt  xt Þ i

ðlÞ

ðlÞ

i¼1

(C3) if u0t ¼ l and ut 6¼ l, ðu0 Þ

ðu Þ

0  W t t ðxlt Þ  W t t ðxlt Þ 

Tt X

βi r⊤ ðA⊤ Þ ðxt  xt Þ: i

ðlÞ

ðlÞ

i¼0

Proof Please refer to Appendix A.



Remark 4.3 We would like to emphasize on what conditions the bounds of Lemma 4.2 are achieved. For (C1), the lower bound is achieved when channel l is scheduled at slot t but never scheduled after t; the upper bound is achieved when l is scheduled from t to T. For (C2), the lower bound is achieved when channel l is never scheduled from t; the upper bound is achieved when l is scheduled from t + 1 to T. For (C3), the lower bound is achieved when channel l is never scheduled from t; the upper bound is achieved when l is scheduled from t to T.   ðlÞ ðnÞ ðlÞ ð1:N Þ Lemma 4.3 Under Assumption 1, if xt >r xt , we have W t xt >   ðnÞ ð1:N Þ W t xt : Proof By Lemma 4.2, we have

94

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels ðlÞ

Wt ðlÞ

ðlÞ

ðlÞ

ðlÞ



ð1:N Þ



ðnÞ

ðlÞ



 Wt

xt

ðnÞ

ð1:N Þ



xt

ðlÞ

ðlÞ

ðnÞ

ðnÞ

ðnÞ

ðnÞ

¼ ½W t ðxt , xt Þ  W t ðxt , xt Þ  ½W t ðxt , xt Þ  W t ðxt , xt Þ h    i h    i ð aÞ ðlÞ ðlÞ ðlÞ ðlÞ ðlÞ ðnÞ ðnÞ ðlÞ ðnÞ ðnÞ ðnÞ ðnÞ ¼ W t xt , xt  W t xt , xt  W t xt , xt  W t xt , xt Tt   X   i ðlÞ ðnÞ ðlÞ ð nÞ β i r ⊤ ð A ⊤ Þ xt  xt  ⩾ r ⊤ x t  xt i¼1

¼r



E

Tt X

⊤ i

ðβA Þ

i¼1

⩾r



E

1 X

⊤ i

ðβA Þ

! 

ðlÞ

ðnÞ

ðlÞ

ðnÞ



xt  xt ! 



xt  xt

i¼1 ðbÞ

¼ r⊤ ðE  VΥV1 Þðxt  xt Þ ðlÞ

ðnÞ

" X X   X X  ðlÞ ðnÞ xt ð i Þ  xt ð i Þ r⊤ ð E  Q ⊤ Þ e j  e ¼

¼

j¼2

j1

i¼j

j¼2

" X X  X X

# 

ðlÞ xt ðiÞ



ðnÞ xt ð i Þ



r





ej  e

 j1

  r Q ej  e ⊤



#  j1

i¼j ðcÞ

⩾0 ðnÞ

where, (a) is from W t

    ðlÞ ðnÞ ðlÞ ðlÞ ðnÞ xt , xt ¼ W t xt , xt since the information states ðnÞ

of both channel n and channel l are the same value xt which implies choosing n or l leads to the same reward, (b) is from Proposition 4.8, and the inequality (c) is from  X  P ðlÞ ðnÞ ðlÞ ðnÞ Proposition 4.9, and xt ðiÞ  xt ðiÞ ⩾ 0 is due to xt ⩾ s xt by Proposition i¼j

4.1.



Remark 4.4 Lemma 4.3 states that scheduling the channel with better information state would bring more reward. Based on Lemma 4.3, we have the following theorem which states the optimal condition of the myopic policy. Theorem 4.1 Under Assumption 1, the myopic policy is optimal. Proof When T!1, we prove the theorem by backward induction. The theorem holds trivially for T. Assume that it holds for T  1, ⋯, t + 1, i.e., the optimal accessing policy is to access the best channels from time slot t + 1 to T. We now ð1Þ ðN Þ show that it holds for t. Suppose, by contradiction, that given xt >r ⋯>r xt , the

4.3 Optimality Analysis of Myopic Policy

95

optimal policy is to choose the best from slot t + 1 to T, and choose μt ¼ i1 6¼ 1 ¼ b μt at slot t where b μt is to choose the best in the sense of MLR at slot t according to ðin Þ ði1 Þ (4.11). Thus, there must  exist  in at slot t such that xt >r xt It then follows from ði Þ

ð1:N Þ

Lemma 4.3 that W t n xt

ði Þ

ð1:N Þ

> W t 1 xt

, which contradicts with the assump-

tion that the optimal policy is to choose i1 at slot t. This contradiction completes our proof for a finite T. Obviously, letting T!1, we finish the proof. ■

4.3.5

Discussion

4.3.5.1

Comparison

In [19], the authors considered the problem of scheduling channel with direct or perfect observation, and then their method is based on the information states of all channels in the sense of FOSD order, that is, the critical property is to keep the information states completely ordered or separated in the sense of FOSD order. However, in the case of indirect or imperfect observation, an observation matrix is introduced to replace the identity matrix E for the direct observation considered in [19]. Hence, the FOSD is not sufficient to characterize the order of information states, and then the MLR order, a kind of stronger stochastic order, is used to describe the order structure of information states. On the other hand, the approach adopted in this paper is totally different from [19] in deriving the optimality of the myopic policy, and consequently, the proposed sufficient conditions cannot degenerate into those of [19] since this approach is more generic and can be extended to the case of heterogeneous multistate channels.

4.3.5.2

Bounds

The bounds in (C1)–(C3) are not enough tight to drop the non-trivial Assumption 1 (v). Actually, we conjecture the optimality of myopic policy is kept even without the Assumption 1(v). However, due to the constraint of the method adopted in this paper, we cannot obtain better bounds to drop the non-trivial Assumption 1(v). Therefore, one of further directions is to obtain the optimality of myopic policy without Assumption 1(v) by some new methods.

4.3.5.3

Case Study

Consider a downlink scheduling system with N ¼ 4 channels, one user, and one base station. Each channel has X ¼ 3 states, evolving according A. The reward vector is r ¼ (0.050.200.70)⊤, e.g., 0.05 unit reward would be accrued if the chosen channel is

96

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

in state 1. The discount factor is ß ¼ 1, and the observation matrix B and the initial information states are set as follows: 0

0:400:200:40

1

0

0:980:010:01

1

B C B C A ¼ @ 0:200:240:56 A, B ¼ @ 0:100:400:50 A 0:150:250:60 0:010:400:59 1 0 1 0 1 0:40 0:20 0:15 B C ð2Þ B C ð4Þ B C ð3Þ ¼ @ 0:20 A, x0 ¼ x0 ¼ @ 0:24 A, x0 ¼ @ 0:25 A 0

ð1Þ

x0

0:40

0:56

0:60

In order to maximize throughput, the base station needs to decide which channel should be used to transmit information to the user each time. Under this setting, it is easy to check that Conditions .1.1–1.5 are satisfied by noticing the following: 1 1 0 0:2000 0:0087   B C C B ⊤ ⊤ ϕ A⊤ 1 , 2 ¼ @ 0:3054 A ⩾ r @ 0:2400 A ¼ A A3 0

0:5600 1 0:2600 0:8688 C B C B ⊤ ⊤ ϕðA⊤ 3 , 1Þ ¼ @ 0:1064 Abr @ 0:2280 A ¼ A A1 0:5120 0:0248 0

0:6859

1

0

r⊤ ðe2  e1 Þ  r⊤ Q⊤ ðe2  e1 Þ ¼ 0:0053 ⩾ 0 r⊤ ðe3  e2 Þ  r⊤ Q⊤ ðe3  e2 Þ ¼ 0:4638 ⩾ 0 Thus, Theorem 4.1 shows that the myopic policy is optimal. That is, in time slot 0, choosing channel 4 is optimal for the base station to communicate with the user ð1Þ ð2Þ ð3Þ ð4Þ since x0 br x0 br x0 br x0 For the other time slots, the optimal policy is to choose the channel with the largest information state, i.e., according to the order of ð1Þ ð2Þ ð3Þ ð4Þ xt , xt , x t , xt .

4.4

Optimality Extension

In this section, we extend the obtained optimality results to the case in which the transition matrix A is totally negative ordered in the sense of MLR, as a complementary to the totally positive order discussed in the previous section, which means that those relative propositions are stated here by replacing increasing monotonicity with deceasing monotonicity.

4.4 Optimality Extension

4.4.1

97

Assumptions

Some important assumptions are stated in the following. Assumption 2 Assume that (i) A1⩾rA2⩾r⋯⩾rAX (ii) B:, 1brB : , 2 b r⋯brB:, Y (iii) There exists some K(2 b K b Y ) such that ΓðA⊤ eX , K Þbr ðA⊤ Þ e1 2

ΓðA⊤ e1 , K  1Þ ⩾ r ðA⊤ Þ eX 2

ð1Þ

ð2Þ

ðNÞ

(iv) A1 ⩾ r x0 ⩾ r x0 ⩾ r ⋯ ⩾ r x0 ⩾ r AX (v) r⊤(ei + 1  ei) ⩾ r⊤Q⊤(ei + 1  ei)(1 Q ¼ VΥV1

b i b X  1), where

A ¼ VΛV1,

Remark 4.5 Assumption 2 differs from Assumption 1 in three aspects, i.e., 2.1, 2.3, 2.4, which reflects the inverse TP2 order [4] in matrix A.

4.4.2

Optimality

Under Assumption 2, we have the following propositions similar to Propositions 4.3–4.6. Proposition 4.10 Let x1, x2 2 Π(X) and X1brX2, then ðA1 Þ⊤ ⩾ r A⊤ x1 ⩾ r A⊤ x2 ⩾ r ðAX Þ⊤ Proposition 4.11 Let x1, x2 2 Π(X) and (A1)⊤⩾rx1⩾rx2⩾r(AX)⊤, then for Γ(x1, k)brΓ(x2, k) for 1 b k b Y . Proposition 4.12 Let X 2 Π(X) and (A1)⊤⩾rx⩾r(AX)⊤, then Γ(x, k)⩾rΓ(x, m) for any 1 b k b m b Y. Proposition 4.13 Under Assumption 2, we have either xt br xt or xt br xt for all l, n 2 {1, 2, ⋯, N}. Following the similar derivation of Lemma 4.2, we have the following important bounds. ðlÞ

ðlÞ

Lemma 4.4 Under Assumption 2, xlt ¼ ðxt have for 1  t  T (D1) if u0t ¼ ut ¼ l

ðlÞ

ðnÞ

ðlÞ

, xt Þ, xlt ¼ ðxt

ðlÞ

ðnÞ

ðlÞ

, xt Þ, xt

ðlÞ

br xðlÞ t

we

98

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

!

dTt 2 e

r

T

E

X

⊤ 2i1

ðβA Þ

ðlÞ

ðu0 Þ

ðlÞ

ðu Þ

ðxt  xt Þ  W t t ðxlt Þ  W t t ðxlt Þ

i¼1

r

!

bTt 2 c





X

⊤ 2i

ðβA Þ

ðlÞ

ðlÞ

ð xt  xt Þ

i¼1

(D2) if u0t 6¼ l, ut 6¼ l, and u0t ¼ ut r



Tt dX 2 e

ðβA⊤ Þ

2i1

ðlÞ

ðu0 Þ

ðlÞ

ðu Þ

ðxt  xt Þ  W t t ðxlt Þ  W t t ðxlt Þ

j¼1

r



X

bTt 2 c

ðβA⊤ Þ ðxt  xt Þ 2i

ðlÞ

ðlÞ

i¼1

(D3) if u0t ¼ l and ut 6¼ l r



dTt 2 e

X

ðβA⊤ Þ

2i1

ðlÞ

ðlÞ

ðu0 Þ

ðu Þ

ðxt  xt Þ  W t t ðxlt Þ  W t t ðxlt Þ

i¼1

 r⊤ E þ

Tt bX 2 c

! ⊤ 2i

ðβA Þ

ðlÞ

ðlÞ

ð xt  xt Þ

i¼1

Remark 4.6 (D1) achieves its lower bound when l is chosen at slot t, t + 1, t + 3, ⋯, and achieves the upper bound when l is chosen from t, t + 2, t + 4, ⋯. (D2) achieves its lower bound when l is chosen at slot t + 1, t + 3, ⋯, and upper bounds when l is chosen at t + 2, t + 4, ⋯. (D3) achieves its lower bound when l is chosen at slot t + 1, t + 3, ⋯, and upper bounds when l is chosen from t, t + 2, t + 4, ⋯. Based on Lemmas 4.3 and 4.4, we have the following theorem. Theorem 4.2 Under Assumption 2, the myopic policy is optimal.

4.5

Summary

In this chapter, we have investigated the problem of scheduling multistate channels under imperfect state observation. In general, the problem can be formulated as a partially observable Markov decision process or restless multiarmed bandit, which is proved to be PSPACE-hard. In this paper, we have derived a set of closed-form conditions to guarantee the optimality of the myopic policy (scheduling the best channel) in the sense of MLR order. Due to the generic RMAB formulation of the

Appendix

99

problem, the derived results and the analysis methodology proposed in this paper can be applicable in a wide range of domains.

Appendix Proof of Lemma 4.1 For slot T, it trivially holds. Suppose it holds for T  1, ⋯, t + 2, t + 1, we prove it holds for slot t. At slot t, we prove it by two cases in the following. Case 1: ut ¼ n, ðut Þ



Wt ¼ r ⊤ xt þ β ð nÞ

ð1:n1Þ

xt

ðnÞ

ðnþ1:NÞ



, xt , xt

 X  ðnÞ  ðbutþ1 Þ  ð1:n1Þ ðnÞ ðnþ1:N Þ d xt , m W tþ1 xtþ1 , xtþ1,m , xtþ1 m2Y

ðaÞ

¼ r ⊤ xt þ β ðnÞ

X

ðnÞ

dðxt , mÞ

m2Y

X X

  ðnÞ ðb u Þ ð1:n1Þ ðnþ1:NÞ e⊤j xtþ1,m W tþ1tþ1 xtþ1 , e j , xtþ1

ð4:19Þ

j¼1

where the equality (a) is due to the induction hypothesis. X X

¼

X X

ðnÞ r ⊤ xt

þβ



X

ð1:n1Þ

xt

ðnþ1:N Þ



, e i , xt

#   ðbutþ1 Þ ð1:n1Þ ðnþ1:N Þ d ðei , mÞW tþ1 xtþ1 , Γðei , mÞ, xtþ1

m2Y

i¼1 ðbÞ

ðut Þ

i¼1

" ðnÞ xt ðiÞ

ðnÞ

xi ðiÞW t

¼ r ⊤ xt þ β ðnÞ

X X

ðnÞ

xt ðiÞ

  ðb u Þ ð1:n1Þ ðnþ1:NÞ dðei , mÞW tþ1tþ1 xtþ1 , Γðei , mÞ, xtþ1

m2Y

i¼1 ðcÞ

X

¼ r ⊤ xt þ β ðnÞ

X X

ðnÞ

xt ð i Þ

X X

d ð ei , m Þ

m2Y

i¼1



X

  ðbutþ1 Þ ð1:n1Þ ðnþ1:N Þ e⊤j Γðei , mÞW tþ1 xtþ1 , e j , xtþ1

ð4:20Þ

j¼1

where, the equality (b) is from

X P

ðnÞ

xt ðiÞ ¼ 1, and equality (c) is due to induction

i¼1

hypothesis. To prove the lemma, it is sufficient to prove the following equation:

100

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels X X X X  ðnÞ  X X X X ðnÞ ðnÞ d xt , m e⊤j xtþ1,m ¼ xt ðiÞ dðei , mÞ e⊤j Γðei , mÞ m2Y

j¼1

i¼1

ð4:21Þ

j¼1

m2F

Now, we have RHS and LHS of (4.21) as follows: ðnÞ X X X  ðnÞ  X X  ðnÞ  X BðmÞA⊤ xt ðnÞ  d xt , m e⊤j xtþ1,m ¼ d xt , m e⊤j  ðnÞ d xt , m m2Y m2Y j¼1 j¼1

¼

X XX

e⊤j BðmÞA⊤ xt

ðnÞ

ð4:22Þ

m2Y j¼1

¼

X X i¼1

ðnÞ

xt ð i Þ

X

d ð ei , m Þ

m2Y

X X

e⊤j Γðei , mÞ ¼

j¼1

¼

X X

ðnÞ

xt ð i Þ

X XX

X

dðei , mÞ

X X

m2Y

j¼1

e⊤j

BðmÞA⊤ ei d ð ei , m Þ

e⊤j BðmÞA⊤ ei

m2Y j¼1

X XX

e⊤j BðmÞA⊤

m2Y j¼1

¼

ðnÞ

xt ð i Þ

i¼1

i¼1

¼

X X

X XX

X X

ðnÞ

xt ðiÞei

i¼1

e⊤j BðmÞA⊤ xt

ðnÞ

ð4:23Þ

m2Y j¼1

Combining (4.22) and (4.23), we have (4.21), and further, prove the lemma. Case 2: ut 6¼ n, without loss of generality, assuming ut ⩾ n + 1, ðut Þ

Wt ¼ r ⊤ xt

ðut Þ

þβ



ð1:n1Þ

xt

ðnÞ

ðnþ1:N Þ



, xt , xt

 X  ðu Þ  ðbu Þ  ð1:u 1Þ ðu Þ ðu þ1:NÞ t d xt t , m W tþ1tþ1 xtþ1 t , xtþ1,m , xtþ1t m2Y

ðaÞ

¼ r ⊤ xt

ðut Þ

þβ

X X  ðu Þ  X ðnÞ d xt t , m xtþ1 ðiÞ m2Y

i¼1



 ðb u Þ ð1:n1Þ ðnþ1:u 1Þ ðut Þ ðu þ1:NÞ W tþ1tþ1 xtþ1 , ei , xtþ1 t , xtþ1,m , xtþ1t , where, the equality (a) is due to the induction hypothesis.

ð4:24Þ

Appendix

101 X X

ðnÞ

ðut Þ

xt ðiÞW t



ð1:n1Þ

xt

ðnþ1:N Þ



, e i , xt

i¼1

¼

X X

 i h X  ðu Þ  ðnÞ ðu Þ ðb utþ1 Þ ð1:n1Þ ðnþ1:u 1Þ ðut Þ ðut þ1:NÞ xtþ1 ,ei ,xtþ1 t ,xtþ1,m xt ðiÞ r⊤ xt t þ β d xt t , m W tþ1 , xtþ1 m2Y

i¼1 ðbÞ

¼ r ⊤ xt

ðut Þ

þβ

X X



ðnÞ

xt ðiÞ

X  ðu Þ  d xt t , m m2Y

i¼1

ðb u Þ ð1:n1Þ ðnþ1:u 1Þ ðut Þ ðu þ1:NÞ  W tþ1tþ1 xtþ1 , ei , xtþ1 t , xtþ1,m , xtþ1t

where, the equality (b) is from

X P

 ð4:25Þ

ðnÞ

xt ðiÞ ¼ 1.

i¼1

Combining (4.24) and (4.25), we prove the lemma.

Proof of Lemma 4.2 We prove the lemma by backward induction. For slot T, we have ðu0 Þ

ðu Þ

ðlÞ

ðlÞ

1. For it holds that W T T ðxlT Þ  W T T ðxlT Þ ¼ r⊤ ðxT  xT Þ; ðu0T Þ

ðu Þ

2. For u0T 6¼ l, uT 6¼ l and u0T ¼ uT , it holds that W T ðxlT Þ  W T T ðxlT Þ ¼ 0; 3. For u0T ¼ l and uT 6¼ l it exists at least one channel n such that u0T ¼ n and xT ⩾ r xT ⩾ r xT It then holds that 0bW T T ðxlT Þ  W T T ðxlT Þbr⊤ ðxT xT Þ. ðlÞ

ðnÞ

ðu0 Þ

ðlÞ

ðu Þ

ðlÞ

ðnÞ

Therefore, Lemma 4.2 holds for slot T. Assume that Lemma 4.2 holds for T  1, ⋯, t + 1, then we prove the lemma for slot t. ðlÞ ðlÞ We first prove the first case: u0t ¼ l, ut ¼ l. By developing xt and xt according to Lemma 4.1, we have the following: Fðxlt , u0t Þ ¼

0     X  ðlÞ X ðb u Þ ðlÞ ðlÞ d xt , m e⊤j Γ xt , m W tþ1tþ1 xtþ1 , e j

m2Y

¼

j2X

0   XX u Þ ðlÞ ðb ðlÞ e⊤j BðmÞA⊤ xt W tþ1tþ1 xtþ1 , e j

m2Y j2X

Fðxlt , ut Þ ¼

    X  ðlÞ X ðlÞ ðb u Þ ðlÞ d xt , m e⊤j Γ xt , m W tþ1tþ1 xtþ1 , e j m2Y

j2X

ð4:26Þ

102

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

¼

XX

  utþ1 Þ ðlÞ ðb ðlÞ e⊤j BðmÞA⊤ xt W tþ1 xtþ1 , e j

ð4:27Þ

m2Y j2X

Furthermore, we have Fðxlt , u0t Þ  Fðxlt , ut Þ 0    i X Xh u Þ ðlÞ ðb ðlÞ ðlÞ ðb u Þ ðlÞ ¼ eTj BðmÞA⊤ xt W tþ1tþ1 xtþ1 , e j  e⊤j BðmÞA⊤ xt W tþ1tþ1 xtþ1 , e j m2Y j2X ðaÞ

¼

0      i X X h ðb u Þ ðlÞ ðlÞ ðlÞ ðb u Þ ðlÞ e⊤j BðmÞA⊤ xt  xt ðW tþ1tþ1 xtþ1 , e j  W tþ1tþ1 xtþ1 , e1

m2Y j2X f1g

ð4:28Þ ðlÞ

where, the equality (a) is due to xt ð1Þ ¼ 1 

P

ðlÞ

xt ð jÞ.

 0     bu ðbutþ1 Þ ðlÞ ðlÞ xtþ1 , e1 Next, we analyze the term in the bracket, W tþ1tþ1 xtþ1 , e j  W tþ1 j2X ðlÞ f1g

of RHS of (4.28) through three cases: Case 1: if b u0tþ1 ¼ l and b utþ1 ¼ l, according to the induction hypothesis, we have  0  butþ1

0bW tþ1

ðlÞ xtþ1 , e j



   Tt1 X butþ1 i ðlÞ r⊤ ðβA⊤ Þ ðe j  e1 Þ  W tþ1 xtþ1 , e1 b i¼0

utþ1 6¼ l, and b u0tþ1 ¼ b utþ1, according to the induction hypothCase 2: if b u0tþ1 6¼ l, b esis, we have  0  butþ1

0bW tþ1

ðlÞ xtþ1 , e j



   Tt1 X butþ1 i ðlÞ r⊤ ðβA⊤ Þ ðe j  e1 Þ  W tþ1 xtþ1 , e1 b i¼1

utþ1 6¼ l, according to the induction hypothesis, we have Case 3: if b u0tþ1 ¼ l and b  0  butþ1

0bW tþ1

T  X i ðlÞ ðb u Þ ðlÞ r⊤ ðβA⊤ Þ ðe j  e1 Þ xtþ1 , e j  W tþ1tþ1 ðxtþ1 , e1 Þb i¼0

 0   bu ðlÞ Combining Cases 1–3, we obtain the bounds of W tþ1tþ1 xtþ1 , e j     butþ1 ðlÞ W tþ1 xtþ1 , e1 as follows:  0  butþ1

0bW tþ1

   Tt1 X  ðbutþ1 Þ ðlÞ i ðlÞ xtþ1 , e j  W tþ1 xtþ1 , e1 b r⊤ ðβA⊤ Þ e j  e1 i¼0

Appendix

103

Therefore, we have  ðlÞ ðu Þ   xt  W t I xlt     ðlÞ ðlÞ ¼ r⊤ xt  xt þ β Fðxlt , u0t Þ  Fðxlt , ut Þ   X X ðlÞ ðlÞ ¼ r ⊤ xt  xt þ β ðut Þ



Wt

m2Y j2X f1g 0 h      i ðb u Þ ðlÞ ðlÞ ðlÞ ðb u Þ ðlÞ  e⊤j BðmÞA⊤ xt  xt W tþ1tþ1 xtþ1 , e j  W tþ1tþ1 xtþ1 , e1   X X ðlÞ br⊤ xðlÞ  x þβ t t

m2Y j2X f1g

h   Tt1  i X i ðlÞ ðlÞ  e⊤j BðmÞA⊤ xt  xt r⊤ ðβA⊤ Þ e j  e1 i¼0

h

  Tt1  i X i ðlÞ ðlÞ  e⊤j BðmÞA⊤ xt  xt r⊤ ðβA⊤ Þ e j  e1 i¼0

¼

Tt X



r⊤ ðβA⊤ Þ xt  xt i

ðlÞ

ðlÞ



i¼0

To the end, we complete the proof of the first part, u0t ¼ l and ut ¼ l, of Lemma 4.2. Secondly, we prove the second case u0t 6¼ l, ut 6¼ l, and u0t ¼ ut , which implies that in this case, u0t ¼ ut . Assuming u0t ¼ ut ¼ k, we have: Fðxlt , u0t Þ

¼

  X  ðkÞ X ðu Þ ðk,lÞ ðkÞ ðlÞ d xt , m e⊤j Γðxt , mÞW tþ1tþ1 xtþ1 , e j , A⊤ xt m2Y

j2X

0   XX u Þ ðkÞ ðb ðk,lÞ ðlÞ ¼ e⊤j BðmÞA⊤ xt W tþ1tþ1 xtþ1 , e j , A⊤ xt

ð4:29Þ

m2Y j2X

Fðxlt , ut Þ

¼

  X  ðkÞ X ðb u Þ ðkÞ ðk,lÞ ðlÞ d xt , m e⊤j Γðxt , mÞW tþ1tþ1 xtþ1 , e j , A⊤ xt m2Y

¼

j2X

  XX utþ1 Þ ðkÞ ðb ðk,lÞ ðlÞ e⊤j BðmÞA⊤ xt W tþ1 xtþ1 , e j , A⊤ xt m2Y j2X

ð4:30Þ

104

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

Thus, Fðxlt , u0t Þ  Fðxlt , ut Þ XX ðk Þ e⊤j BðmÞA⊤ xt ¼ m2Y j2X 0 h    i ðb u Þ ðk,lÞ ðlÞ ðb u Þ ðk,lÞ ðlÞ  W tþ1tþ1 xtþ1 , e j , A⊤ xt  W tþ1tþ1 xtþ1 , e j , A⊤ xt

ð4:31Þ

of RHS of (4.31), if l is never chosen for For  the term in the bracket 0    butþ1  ðk,lÞ ðbutþ1 Þ ðk,lÞ ð l Þ ðlÞ xtþ1 , e j , A⊤ ˘xt xtþ1 , e j , A⊤ xt and W tþ1 from the slot t + 1 W tþ1 uτ 6¼ l for to the end of time horizon of interest T. That is to say, b u0τ 6¼ l and b ðu0

Þ

t + 1  τ  T, and further, we have W tþ1tþ1 ðxtþ1 , e j , A⊤ xt Þ  ðb u Þ ðk,lÞ ðlÞ W tþ1tþ1 ðxtþ1 , e j , A⊤ xt Þ ¼ 0 ; otherwise, it exists to(t + 1 b to b T ) such that one of the following three cases holds. Case 1: u0τ 6¼ l and uτ 6¼ l for t  τ  t0  1 while u0t0 ¼ l and ut0 ¼ l; Case 2: u0τ 6¼ l and uτ 6¼ l for t  τ  t0  1 while u0t0 6¼ l and ut0 ¼ l (Note that t 0 t ðlÞ

ðk,lÞ

ðlÞ

t 0 t ðlÞ

this case does not exist since r⊤ ½A⊤ xt ⩾ r⊤ ½A⊤ xt according to the transition matrix A); Case 3: u0τ 6¼ l and uτ 6¼ l for t  τ  t0  1 while u0t0 ¼ l and  ut0 6¼ l. For Case 1, according to the hypothesis u0t0 ¼ l and ut0 ¼ l , we have   0 ðb u Þ ðb u Þ W t0 t0 ðxlt0 Þ  W t0 t0 ðxlt0 Þ   Tt Po 0 i ðlÞ ðlÞ  βt t ðβλÞ r⊤ xt  xt0

βt

0

t

i¼0

Tt P

o

¼ βt

0

t

i¼0 ðbÞ



Tt1 P

r⊤ ðβA⊤ Þ ½A⊤ i

t 0 t

  ðlÞ ðlÞ xt  X t

  i ðlÞ ðlÞ r⊤ ðβA⊤ Þ A⊤ xt  xt

i¼0

where, the inequality (b) is from t0 ⩾ t + 1. For Case 3, by the induction hypothesis, we have the similar results with Case 1. Combining the results of the three cases, we obtain the following: 0     ðb u Þ ðk,lÞ ðlÞ ðb u Þ ðk,lÞ ðlÞ W tþ1tþ1 xtþ1 , e j , A⊤ xt  W tþ1tþ1 xtþ1 , e j , A⊤ xt

b

Tt1 X i¼0

  i ðlÞ ðlÞ r⊤ ðβA⊤ Þ A⊤ xt  xt

ð4:32Þ

Appendix

105

Combining (4.32) and (4.31), we have ðu0 Þ

ðu Þ

W t t ðxlt Þ  W t t ðxlt Þ ¼ βðFðxlt , u0t Þ  Fðxlt , ut ÞÞ XX ðk Þ e⊤j BðmÞA⊤ xt ¼β m2Y j2X 0 h    i ðb u Þ ðk,lÞ ðlÞ ðb u Þ ðk,lÞ ðlÞ  W tþ1tþ1 xtþ1 , e j , A⊤ xt  W tþ1tþ1 xtþ1 , e j , A⊤ xt



XX

e⊤j BðmÞA⊤ xt

ðkÞ

Tt1 X

m2Y j2X

¼

X e⊤j j2X

¼

  i ðlÞ ðlÞ r⊤ ðβA⊤ Þ A⊤ xt  xt

i¼0

X

Tt1   X iþ1 ðkÞ ðlÞ ðlÞ BðmÞ A⊤ xt r⊤ ðβA⊤ Þ xt  xt

m2Y

i¼0

Tt1   X X iþ1 ðkÞ ðlÞ ðlÞ e⊤j EA⊤ xt r⊤ ðβA⊤ Þ xt  xt j2X

i¼0

⊤ ¼ 1⊤ X A xt

ðkÞ

Tt1 X

r⊤ ðβA⊤ Þ

iþ1

  ðlÞ ðlÞ xt  xt

i¼0

¼ 1⊤ X xt

ðkÞ

Tt1 X

r⊤ ðβA⊤ Þ

iþ1



ðlÞ

ðlÞ



xt  xt

i¼0

¼

Tt1 X

r⊤ ðβA⊤ Þ

iþ1

  ðlÞ ðlÞ xt  xt

i¼0

¼

Tt X

  i ðlÞ ðlÞ r⊤ ðβA⊤ Þ xt  xt

i¼1

which completes the proof of Lemma 4.2 when u0t 6¼ l and ut 6¼ l. Last, we prove the third case u0t ¼ l and ut 6¼ l, then it exists at least one process ðnÞ ðlÞ ðnÞ ðlÞ ut ¼ n, and its belief vector denoted as xt , such that xt ⩾ r xt ⩾ r xt we have ðu0 Þ

ðu Þ

W t t ðxlt Þ  W t t ðxlt Þ     ðlÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðNÞ ðnÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðNÞ ¼ W t xt ,⋯,xt , xt ,xt ,⋯,xt  W t xt ,⋯,xt ,xt ,xt ,⋯,xt h    i ðlÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðN Þ ðnÞ ð1Þ ðl1Þ ðnÞ ðlþ1Þ ðN Þ ¼ W t xt ,⋯,xt ,˘xt ,xt ,⋯,xt  W t xt ,⋯,xt ,xt ,xt ,⋯,xt

106

4 Myopic Policy for Opportunistic Scheduling: Homogeneous Multistate Channels

h    i ðlÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðNÞ ðlÞ ð1Þ ðl1Þ ðnÞ ðlþ1Þ ðNÞ ¼ W t xt ,⋯,xt , xt ,xt , ⋯,xt  W t xt , ⋯, xt , xt , xt ,⋯,xt h    i ðlÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðN Þ ðlÞ ð1Þ ðl1Þ ðnÞ ðlþ1Þ ðN Þ ¼ W t xt ,⋯,xt , ˘xt ,xt , ⋯,xt  W t xt , ⋯,xt , xt , xt ,⋯, xt h    i ðnÞ ð 1Þ ðl1Þ ðnÞ ðlþ1Þ ðN Þ ðnÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðN Þ þ W t xt ,⋯,xt , xt , xt , ⋯, xt  W t xt ,⋯, xt ,xt ,xt , ⋯,xt ð4:33Þ According to the induction hypothesis ðl 2 A 0 and l 2 A Þ, the first term of the RHS of (4.33) can be bounded as follows:   ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðNÞ xt , ⋯, xt , xt , xt , ⋯, xt   ðlÞ ð1Þ ðl1Þ ðnÞ ðlþ1Þ ðNÞ W t xt , ⋯, xt , xt , xt , ⋯, xt ðlÞ

Wt

b

Tt X

  i ðlÞ ðnÞ r⊤ ðβA⊤ Þ xt  xt

ð4:34Þ

i¼0

Meanwhile, the second term of the RHS of (4.33) is inducted by hypothesis ðl= 2A Þ: 2A 0 and l=  ð1Þ ðl1Þ ðnÞ ðlþ1Þ ðNÞ xt , ⋯, xt , xt , xt , ⋯, xt   ðnÞ ð1Þ ðl1Þ ðlÞ ðlþ1Þ ðNÞ  W t xt , ⋯, xt , xt , xt , ⋯, xt ðnÞ



Wt

b

Tt X

  i ðnÞ ðlÞ r⊤ ðβA⊤ Þ xt  xt

ð4:35Þ

i¼1

Therefore, we have, combining (4.33), (4.34), and (4.35), W t t ðxlt Þ  W t t ðxlt Þb ðu0 Þ

ðu Þ

Tt X

r⊤ ðβA⊤ Þ ðxt  xt Þ i

ðlÞ

ðlÞ

i¼0

Thus, we complete the proof of the third part, u0t ¼ l and ut 6¼ l, of Lemma 4.2. To the end, Lemma 4.2 is concluded.

References 1. P. Whittle, Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 24, 287–298 (1988)

References

107

2. J.C. Gittins, K. Glazebrook, R.R. Webber, Multi-Armed Bandit Allocation Indices (Blackwell, Oxford, U.K., 2011) 3. C.H. Papadimitriou, J.N. Tsitsiklis, The complexity of optimal queueing network control. Math. Oper. Res. 24(2), 293–305 (1999) 4. A. Muller, D. Stoyan, Comparison Methods for Stochastic Models and Risk (Wiley, New York, 2002) 5. J.C. Gittins, Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B 41(2), 148–177 (1979) 6. J.C. Gittins, D.M. Jones, A dynamic allocation index for the sequential design of experiments. Prog. Statist., 241–266 (1974) 7. D. Bertsimas, J. Nino-Mora, Restless bandits, linear programming relaxations, and a primaldual heuristic. Oper. Res. 48(1), 80–90 (2000) 8. S. Guha, K. Munagala. Approximation algorithms for partial-information based stochastic control with markovian rewards. in Proc. IEEE Symposium on Foundations of Computer Science (FOCS), Providence, RI, Oct 2007 9. S. Guha, K. Munagala. Approximation algorithms for restless bandit problems. in Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), New York, Jan 2009 10. S. Ahmad, M. Liu. Multi-Channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays. in Allerton Conference, Monticello, Il, Sep–Oct, 2009 11. K. Liu, Q. Zhao, B. Krishnamachari, Dynamic multichannel access with imperfect channel state detection. IEEE Trans. Signal Process. 58(5), 2795–2807 (2010) 12. F. E. Lapiccirella, K. Liu, Z. Ding: Multi-Channel Opportunistic Access Based on Primary ARQ Messages Overhearing. in Proc. of ICC, Kyoto, Jun. 2011 13. S. Ahmand, M. Liu, T. Javidi, Q. Zhao, B. Krishnamachari, Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55(9), 4040–4050 (2009) 14. K. Wang, L. Chen, Q. Liu, Opportunistic Spectrum access by exploiting primary user feedbacks in underlay cognitive radio systems: An optimality analysis. IEEE J. Sel. Topics Signal Process. 7(5), 869–882 (2013) 15. K. Wang, Q. Liu, L. Chen, On optimality of greedy policy for a class of standard reward function of restless multi-armed bandit problem. IET Signal Process. 6(6), 584–593 (2012) 16. K. Wang, Q. Liu, F.C. Lau, Multichannel opportunistic access by overhearing primary ARQ messages. IEEE Trans. Veh. Technol. 62(7), 3486–3492 (2013) 17. Q. Zhao, B. Krishnamachari, K. Liu, On myopic sensing for Multi-Channel opportunistic access: Structure, optimality, and performance. IEEE Trans. Wirel. Commun. 7(3), 54135440 (2008) 18. S. Ahmad, M. Liu, T. Javidi, Q. Zhao, B. Krishnamachari, Optimality of myopic sensing in multi-channel opportunistic access. IEEE Trans. Inf. Theory 55(9), 4040–4050 (2009) 19. Y. Ouyang, D. Teneketzis, On the optimality of myopic sensing in multi-state channels. IEEE Trans. Inf. Theory 60, 681–696 (2014) 20. K. Wang: Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation. in Proc. of GlobeCom, pages 1–6, Washington D. C., USA, Dec 2016 21. K. Wang, L. Chen, J. Yu: On optimality of myopic policy in multi-channel opportunistic access. in Proc. of ICC, pages 1–6, Kuala Lumpur, Malaysia, May 2016 22. K. Wang, Q. Liu, F. Li, L. Chen, X. Ma, Myopic policy for opportunistic access in cognitive radio networks by exploiting primary user feedbacks. IET Commun. 9(7), 1017–1025 (2015)

Chapter 5

Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Multistate Channels

5.1 5.1.1

Introduction Background

We revisit the following opportunistic scheduling system involving a base station, also referred to as a server, different classes of users with heterogeneous demands, and time-varying multistate Markovian channels. Each channel with different states and classes has a different transmission rate, i.e., the evolution of channels is Markovian and class-dependent. For those users connected to (or entering) the system but not served immediately, their waiting costs increase with time. In such an opportunistic scheduling scenario, a central problem is how to exploit the server’s capacity to serve the users. This problem can be formalized to the problem of designing an optimum opportunistic scheduling policy to minimize the average waiting cost. The above opportunistic scheduling problem is fundamental in many classical and emerging wireless communication systems such as mobile cellular systems including 4G LTE and the emerging 5G, heterogeneous networks (HetNet).

5.1.2

State of The Art

Due to its fundamental importance, the opportunistic scheduling problem has attracted a large body of research on channel-aware schedulers addressing one or more system performance metrics in terms of throughput, fairness, and stability [1–26].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8_5

109

110

5 Whittle Index Policy for Opportunistic Scheduling. . .

The seminal work in [2] showed that the system capacity can be improved by opportunistically serving users with maximal transmission rate. Such a scheduler is called cp-rule or MaxRate scheduler. In fact, the MaxRate scheduler was myopically throughput-optimal, i.e., maximizing the current slot transmission rate but ignoring the impact of the current scheduling on the future throughput, and consequently, was shown to perform badly in system stability from the long-term viewpoint. For instance, the number of waiting users in the system explodes with the increase of system load. Meanwhile, the MaxRate scheduler does not fairly schedule those users with lower transmission rates. To balance system throughput and fairness, the Proportionally Fair (PF) scheduler was proposed and implemented in CDMA 1xEV-DO system of 3G cellular networks [3]. Technically, the PF scheduler maximizes the logarithmic throughput of the system rather than traditional throughput, and as a result provides better fairness [4]. In [5], the authors approximated the PF by the relatively best (RB) scheduler, and analyzed the flow-level stability of the PF scheduler. Actually, the RB scheduler gives priority to users according to their ratio of the current transmission rate to the mean transmission rate. Accordingly, it is fair to users by taking future evolution into account. Consequently, it can provide a minimal throughput to the users with low accessible transmission rates, at the price of being not maximally stable at flow level [6]. Other schedulers, e.g., scored based (SB) [7], proportionally best (PB), and potential improvement (PI), belong to the family of the best-condition schedulers. These schedulers give priority to users according to their respective evaluated channel condition and, accordingly, do not have a direct association with transmission rate. They are not myopically throughput-optimal, but rather have a good performance in the long term. They are maximally stable [8, 9], but they do not consider fairness. The above papers all assume independent and identically distributed (i.i.d.) channels. For the more challenging scenarios, there exist some work on homogeneous channels [10, 11, 18], i.e., i.i.d. in slots, and heterogeneous channels [12–16], i.e., discrete-time Markov process in slots. Under the Markovian channel model, the opportunistic scheduling problem can be mathematically recast to a RMAB [17]. The RMAB is of fundamental importance in stochastic decision theory due to its generic nature, and has application in numerous engineering problems. The central problem in investigating and solving an instance of RMAB is to establish its indexability. Once the indexability is established, an index policy can be constructed by assigning an index for each state of each arm to measure the reward of activating the arm at that state. The policy thus consists of simply activating those arms with the largest indices. In the context of opportunistic scheduling, the authors [18] considered a flowlevel scheduling problem with time-homogeneous channel state transition where the probability of being in a state is fixed for any time slot, regardless of the evolution of the system. For the same channel model, the authors [10] considered the

5.1 Introduction

111

opportunistic scheduling problem under the assumption of traffic size following the Pascal distribution. In [10, 11, 18], the indexability was first proved and then the similar closed-form Whittle index was obtained [17]. For heterogeneous channels, the authors of [12–14] considered a generic flow-level scheduling problem with heterogeneous channel state transition, but carried out their work based on a conjecture that the problem is indexable. As a result, they can only verify the indexability of the proposed policy for some specific scenarios by numerical test before computing the policy index. The indexability of the opportunistic scheduling for the heterogeneous multistate Markovian channels, despite its theoretical and practical importance, is still open today.

5.1.3

Main Results and Contributions

To bridge the above theoretical gap, we first carry out a deep investigation into the indexability of the heterogeneous channel case formulated in [12–14] and mathematically, identify a set of sufficient conditions on the channel state transition matrix under which the indexability is guaranteed and consequently the Whittle index policy is feasible. Second, by exploiting the structural property of the channel state transition matrix, we obtain the closed-form Whittle index. Third, for a generic channel state transition matrix not satisfying the sufficient conditions, we propose an eigenvalue-arithmetic-mean scheme to approximate this matrix such that the approximate matrix satisfies the sufficient conditions and further the approximate Whittle index is easily obtained. Finally, we present a scheduling algorithm based on the Whittle index, and conduct extensive numerical experiments which demonstrate that the proposed scheduling algorithm can efficiently balance waiting cost and stability. Our work thus constitutes a small step toward solving the opportunistic scheduling problem in its generic form involving multistate Markovian channels. As a desirable feature, the indexability conditions established in this work only depend on channel state transition matrix without imposing constraints on those userdependent parameters such as service rate and waiting cost. Notation ei denotes an N-dimensional column vector with 1 in the i-th element and 0 in others. I denotes the N  N identity matrix. 1N denotes an N-dimensional column vector with 1 in all elements. 0N denotes an N-dimensional column vector with 0 in all elements. 1kN denotes the N-dimensional column vector with 1 in the first k elements and 0 in the remaining N  k elements. Diag (a1, . . ., aK) denotes a diagonal and a block-diagonal matrix with a1, . . ., aK. trace(•) denotes the sum of all elements in a diagonal of a matrix. ()⊤ represents the transpose of a matrix or a vector. ()1 represents the inverse of a matrix.

5 Whittle Index Policy for Opportunistic Scheduling. . .

112 Table 5.1 Main notations Symbols k K t T Nk N 0k Zk(t) qk,n,m sk,n μk,n ck Ak Nk ak,n ak,n

Descriptions User class User class set Slot index Slot set {0, 1, . . .} The number of channel states of class k user Set {1, 2, . . ., Nk} Channel state of a class-k user at t Probability from n to m of class-k user Transmission rate of class-k user in state n Departure probability of class-k user in state n Waiting cost of class-k user Decision set {0, 1} Set {0, 1, . . ., Nk} The expected one-slot capacity consumption for class k user in channel state n with action a The expected one-slot reward for class k user in channel state n with action a

δNn

Set {n + 1, . . ., N}

5.2

System Model

As mentioned in the introduction, we consider a wireless communication system where a server schedules jobs of heterogeneous users. The system operates in a timeslotted fashion where T denotes the slot duration and t 2 T ≔f0, 1, ⋯g denotes the slot index. Table 5.1 summaries main notations used in this chapter.

5.2.1

Job, Channel, and User Models

Suppose that there are K classes of users, k 2 K ≔f1, 2, ⋯, K g. Each user of class k is uniquely associated with a job of class k which is requested from the server and with a dedicated wireless channel of class k through which the job would be transmitted. Job sizes. The job (or flow) size bk of users of class k in bits is geometrically distributed with mean fbk g < 1 for class k 2 K . Channel condition. For each user, the channel condition varies from slot to slot, independently of all other users. For each class k user, the set of discretized channel conditions is denoted by the finite set N 0k ≔f1, 2, ⋯, N k g.

5.2 System Model

113

Channel condition evolution. We assume that at each slot, the channel condition of each user in the system evolves according to a class-dependent Markov chain. Thus, for each user of class k 2 K , we can define a Markov chain with state space N 0k. We further define qk, n, m ≔ ℙ(Zk(t + 1) ¼ m j Zk(t) ¼ n), where Zk(t), denotes the channel condition of a class k user at time t. The class k channel condition transition probability matrix is thus defined as follows: 2

qk,1,1

6q 6 k,2,1 QðkÞ ≔6 4 ⋮ qk,N k ,1 where

P m2N

0 i

qk,1,2



qk,1,N k

qk,2,2 ⋮

⋯ ⋱

qk,2,N k ⋮

qk,N k ,2

⋯ qk,N k ,N k

3 7 7 7, 5

qk,n,m ¼ 1 for every n 2 N 0k .

Transmission rates. When a user of class k is in channel condition n 2 N 0k, he can receive data at transmission rate sk,n, i.e., his job is served at rate sk,n. We assume that the higher the label of the channel condition, the higher the transmission rate, i.e., 0bsk,1 < sk,2 < ⋯ < sk,N k . Waiting costs. For every user of class k, the system operator accrues waiting cost ck(ck > 0) at the end of every slot if its job is uncompleted.

5.2.2

Server Model

The server is assumed to have full knowledge of the system parameters. We investigate the case where the server can serve one user each slot. However, our analysis can be straightforwardly generalized to the case where multiple users can be served each slot. At the beginning of every slot, the server observes the actual channel conditions of all users in the system and decides which user to serve during the slot. We assume that the server is preemptive, i.e., at any time it can interrupt the service of a user whose job is not yet completed. For those jobs not completed, they will be saved and served in the future. The server is also allowed to stay idle, and note that it is not work-conserving because of the time-varying transmission rate. We denote by μk,n : τsk,n =0 fbk g [12] the departure probability that the job is completed within the current time slot when the server serves a user of class k in channel condition n 2 N 0k . Note that the departure probabilities are increasing in the channel condition, i.e., 0bμk,1 < ⋯ < μk,N k b1, because the transmission rates satisfy 0bsk,1 < sk,2 < ⋯ < sk,N k .

5 Whittle Index Policy for Opportunistic Scheduling. . .

114

5.2.3

Opportunistic Scheduling Problem

In the above opportunistic scheduling model, a central problem is how to maximally exploit the server’s capacity to serve users. This problem can be formalized to the problem of designing an optimum opportunistic scheduling policy to minimize the average waiting cost.

5.3

Restless Bandit Formulation and Analysis

In this section we analyze the scheduling problem by the approach of RMAB. For the ease of analysis, we investigate the discounted waiting costs by introducing a discount factor 0 b β < 1. Basically, the time-average case is a special case where β!1.

5.3.1

Job-Channel-User Bandit

We denote by A k ≔f0, 1g the action space of user of class k where action 1 means serving the user and 0 not serving him. Every job-channel-user couple of class k is characterized by the tuple 

       N k , wak a2A k , rak a2A k , Pak a2A k

where 1. N k ≔f0g [ N 0k ‘is the user state space, where state 0 indicates that the job is completed, and state n 2 N 0k indicates that the current channel condition is n and the job  is uncompleted; 2. wak ≔ wak,n n2N , where wak,n is the expected one-slot capacity consumption, or k work required by a user at state n if action a is chosen. Specifically, for every state n 2 N k , w1k,n ¼ 1 and w0k,n ¼ 0;   3. rak ≔ r ak,n n2N , where r ak,n is the expected one-slot reward earned by a user at state k

n if action a is selected. Specifically, for every state n 2 N 0k , it is the negative of the expected waiting cost, r ak,0 ¼ 0, r 1k,n ¼ μk,n ck where μk,n ¼ 1  μk,n and r 0k,n ¼ ck .   4. Pak ≔ pak,n,m n,m2N where pak,n,m is the probability for a user evolving from state k n to state m if action a is selected. The one-slot transition probability matrices for action 0 and 1 are as follows:

5.3 Restless Bandit Formulation and Analysis

2

1 0 6 0 q 6 k,1,1 6 0 q P0k ¼ 6 k,2,1 6 6 ⋮ 4⋮ 0 qk,N k ,1 2 1 0 6μ μ k,1 qk,1,1 6 k,1 6 1 6 μ k,2 qk,2,1 Pk ¼ 6 μk,2 6 ⋮ ⋮ 4 μk,N k μ k,N k qk,N k ,1

115

⋯ ⋯ ⋯ ⋱ ⋯ ⋯ ⋯ ⋯ ⋱ ⋯

0

3

qk,1,N k 7 7 7 qk,2,N k 7 7 7 ⋮ 5 qk,N k ,N k

3 0 μ k,1 qk,1,N k 7 7 7 μ k,2 qk,2,N k 7 7 7 ⋮ 5 μ k,N k qk,N k ,N k

The dynamics of user j of class k is captured by the state process xk() and the action process aj(), which correspond to state x j ðt Þ 2 N k and action a j ðt Þ 2 A k at any slot t.

5.3.2

Restless Bandit Formulation and Opportunistic Scheduling

Let Πtx,a denote the set of all the policies composed of actions a(0), a(1)⋯, a(t), where a(t) is determined by the state history x(0), x(1), ⋯, x(t) and the action history a(0), a(1)⋯, a(t  1), i.e.,     Πtx,a ≔ aðiÞ j aðiÞ ¼ ϕ x0:i , a0:i1 , i ¼ 0, 1⋯, t ðeÞ

¼ faðiÞ j aðiÞ ¼ ϕðxðiÞÞ, i ¼ 0, 1⋯, t g

where ϕ is a mapping ϕ:(x0:i, a0:i1) ° a(i), x0:i ≔ (x(0), ⋯, x(i)) and a0:i1 ≔ (a(0), ⋯, a(i1)), and (e) is due to the Markovian feature. Let Πtx,a denote the space of randomized and non-anticipative policies depending on the joint state process x≔ðxk ðÞÞk2N and the joint action process Πtxk ,ak is the joint policy space. a≔ðak ðÞÞk2X , i:e:, Πtx,a ¼ k2X

Let πτ denote the expectation over the future states x(•) and the action process; a(•), conditioned on past states x(0), x(1), ⋯, x(τ) and the policy π 2 Πτx,a∘ . aðt Þ Consider any expected one-slot quantity GxðtÞ that depends on state x(t), an action a(t) at any time slot t. For any policy π 2 Π1 x,a and any discount factor 0 b β < 1, we define the infinite horizon β-average quantity as follows:

5 Whittle Index Policy for Opportunistic Scheduling. . .

116

T1 P aðÞ π0 fGxðÞ , β, 1g

t¼0

:¼ lim

βt πt fGxðtÞ g aðtÞ

T1 P

T!1

β

ð5:1Þ

t

t¼0

In the following we consider the discount factor β to be fixed and the horizon to aðÞ aðÞ be infinite; therefore, we omit them in π0 fGxðÞ , β, 1g and write briefly π0 fGxðÞ g. The reason for introducing π0 fg is that this form can smoothly transit to the average case β ¼ 1. Henceforth, we always suppose 0 b β < 1 except when explicitly emphasizing β ¼ 1. We are now ready to formulate the opportunistic scheduling problem faced by the server as follows. Problem 5.1 (Optimum Opportunistic Scheduling) For any discount factor β, the optimum opportunistic scheduling problem is to find a joint policy π ¼ ðπ 1 , ⋯, π K Þ 2 Π1 x,a a maximizing the total discounted reward (i.e., minimizing total discounted cost), mathematically defined as follows. ( ðPÞ : max π0 π2Πx,a

P k2X

) ak ðÞ r k,x k ðÞ

ð5:2Þ

X ak ðt Þ ¼ 1, t ¼ 0, 1, ⋯

s:t:

ð5:3Þ

k2X

The constraints (5.3) of problem (P) can be relaxed to the following: ( πt

) X ak ðtÞ ¼ 1, t ¼ 0, 1, ⋯ k2X T1 P

) lim

t¼0

( βt πt

k2X

T1 P

T!1

( ⟺π0

P

) ak ðtÞ wk,xðtÞ

¼1

βt

t¼0

X a ðÞ wk,xk k

) ¼1

ð5:4Þ

k2X

Using Lagrangian method, we obtain the following by combining (5.2) and (5.4),

5.4 Indexability Analysis and Index Computation

( max π0 π2Πx,a ¼

X k2X

X

117

) ak ðÞ r k,x k ðÞ

( 

νπ0

k2X

X a ðÞ k wk,x k ðÞ

)

k2X

n o a ðÞ a ðÞ max π0k r k,xk k ðÞ  νwk,xk k ðÞ

π k 2Πk , 2k

ð5:5Þ

Thus, we have the subproblem for class k 2 K :: n o a ðÞ a ðÞ ðSPÞ : max π0k r k,xk k ðÞ  νwk,xk k ðÞ : π k 2I xk ,ak

ð5:6Þ

Hence, our goal is to find an optimal policy π k for the subproblem kðk 2 K Þ and   then construct a feasible joint policy π ¼ π 1 , ⋯, π K for the problem (P). In the following, we focus on the subproblem (SP) and drop the subscript k.

5.4

Indexability Analysis and Index Computation

In this section, we first give a set of conditions on the channel state transition matrix, and, based on which, we obtain the threshold structure of the optimal scheduling strategy for the subproblem. We then establish the indexability under the proposed conditions.

5.4.1

Transition Matrices and Threshold Structure

Condition 1 Transition matrix Q can be written as Q ¼ O0 þ ε1 O1 þ ε2 O2 þ ⋯ þ ε2N2 O2N2 where h ≔ [h1, h2, ⋯, hN]⊤, Oj is defined in (5.8) and εj and λ are real numbers satisfying λ j ≔λ  εNj  εN1þj b0, 1bj < N 8 1N ðhÞ⊤ þ λIN > >

> Nj > > 0N , ⋯, 0N , 1Nj > N , 1N if j ¼ 0 > < |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Nj1 , if 1bjbN  1 O j≔ h > > > 0N , ⋯, 0N , 1N  1 jNþ1 , 1 jNþ1  1N if N bjb2N  2 > > N N > > : |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} jN

Remark 5.1 Basically, Condition 1 implies that

ð5:7Þ

ð5:8Þ

5 Whittle Index Policy for Opportunistic Scheduling. . .

118

(i) Any two adjacent rows (i.e., Qi; Qi + 1) of matrix Q differ in only two adjacent positions (i.e., i; i + 1). For example, if N ¼ 3, Q is written as 2

h1  ε 2 þ λ h2  ε 1 þ ε 2

6 Q ¼ 4 h1 þ ε 3

h 3 þ ε1

h2  ε 1  ε 3 þ λ h 3 þ ε 1

h1 þ ε 3

h2  ε 3 þ ε 4

3 7 5

h 3  ε4 þ λ

(ii) When λj ¼ 0 for all j(1 b j < N ) the Q degenerates into the case of [18]. Now, we give the following lemma on the threshold structure of the optimum scheduling policy for the subproblem. Lemma 5.1 (Threshold Structure) Under Condition 1, for every real-valued v, there exists n 2 N [ f1g such that the optimum scheduling policy only schedules transmission in channel states δNn ≔fm 2 N : m > ng. Proof Please see Appendix A.1.

5.4.2



Indexability Analysis

For π k 2 Πxk ,ak , , we introduce the concept of serving set, δðδ ⊆ N k Þ, such that the user is served if n 2 δ and not served if n 2 = δ. By slightly introducing ambiguity, δ can also be regarded as a policy of serving the set δ. Thus, the subproblem (5.6) can be transformed to n o a ðÞ a ðÞ max δ0 r k,xk k ðÞ  νwk,xk k ðÞ δ2N k

ð5:9Þ

For further analysis, we define

δn :¼ δn :¼

n o n,a ðÞ δ0 r k,xkkðÞ 1β n o n,a ðÞ δ0 wk,xkkðÞ 1β

ð5:10Þ

ð5:11Þ

where n refers to the initial state of user of class k. By Lemma 5.1, if there exists price vn for n 2 N 0 such that both transmitting and not transmitting are optimal for v ¼ vn, then there exists a set, δ, such that both including state n in δ and not including state n lead to the same reward, i.e.,

5.4 Indexability Analysis and Index Computation 



119 



δn [fng  νn δn [fng ¼ δn ∖fng  νn δn ∖fng

ð5:12Þ

A straightforward consequence is that changing the action only in the initial period must also lead to the same reward, i.e., 





i i nh0,δ i  νn h0,δ ¼ h1,δ  νn h1,δ n n n



i

ð5:13Þ

where ha, δi is the policy that employs action a in the initial period and then proceeds according to δ.   i 6¼ 0, we have Then, if nh1,δ i  h0,δ n 

νn ¼



i i h1,δ  h0,δ n n   i i h1,δ  h0,δ : n n

ð5:14Þ

h1,δi  h0,δi n n h1,δi n  nh0,δi

ð5:15Þ

We further define νδn :¼

To circumvent the long proof of Whittle indexability, we establish the indexability result by checking the LP-indexability condition [27]. If a problem is LP-indexable, then it is Whittle-indexable. In the following analysis, we show that our problem is LP-indexable; that is, the problem is Whittle-indexable. Definition 5.1 ([27]) Problem (5.6) is LP-indexable with price νn ¼ νδnNn ¼

Nn i Nn i h1,δ  h0,δ n n , Nn i Nn i h1,δ  h0,δ n n

ð5:16Þ

if the following conditions hold: i i (i) n 2 N , h1,0i  h0,0i > 0, h1,N  h0,N > 0; n n n n h1,δ

i

h0,δ

i

Nn i Nn i  h0,δ > 0 and nþ1Nn  nþ1Nn > 0; (ii) n 2 N ∖fNg, h1,δ n n (iii) For each real value v, there exists n 2 N [ f1g such that the serving set δN_n is optimal.

To check the LP-indexability, we first characterize the four critical quantities in (5.16) under δN  n for any n. Based on balance equations, when n is not chosen in the initial slot, we have (5.17) in the matrix language (see the top of the next page) and, further, the following simplified form:

5 Whittle Index Policy for Opportunistic Scheduling. . .

120

2

h0,δNn i

1

3

2

q1,1

7 6 6 7 6⋮ ⋮ 6 7 6 6 6 h0,δNn i 7 6 qn,1 7 6 n 6 h1,δ i 7 ¼ β6 6 Nn 7 6 nþ1 qnþ1,1 6μ 6 nþ1 7 6 7 6 4 ⋮ 5 4⋮ μ N qN,1

h1,δNn i

N



q1,n1



q1,N









⋯ ⋯

qn,n μ nþ1 qnþ1,n

⋯ ⋯

qn,N μ nþ1 qnþ1,N

⋱ ⋯

⋮ ⋱ μ N qN,n1 ⋯ 3 c ⋮ 7 7 7 c 7 7  c μnþ1 7 7 7 ⋮ 5

2

6 6 6 6 þ6 6 6 6 4 2

h0,δ i 1 Nn

3

h1,δNn i

q1,1



q1,n1

q1,n



q1,N



⋱ ⋯







qn1,n1

qn1,n

⋱ ⋯

qn1,N

μ n qn,1



μ n qn,n1

μ n qn,n



μ n qn,N

⋮ μ N qN,1

⋱ ⋯

⋮ μ N qN,n1

⋮ μ N qN,n

⋱ ⋯

μ N qN,N

qn1,1

2

c

h1,δNn i

N

ð5:17Þ

 c μN 2

7 6 6 7 6⋮ 6 7 6 6 6 h0,δNn i 7 6 6 n1 7 6 h1,δ i 7 ¼ 6 6 Nn 7 6 6 7 6 n 6 7 6 4 5 4⋮ N

⋮ μ N qN,N

32 h0,δNn i 3 1 7 76 7 6 76 ⋮ 76 h0,δNn i 7 7 76 n 7 76 76 h1,δNn i 7 76 nþ1 7 7 76 7 54 ⋮ 5

32 h0,δNn i 3 1 7 76 7 6 76 ⋮ 76 h0,δNn i 7 76 n1 7 7 76 76 h1,δNn i 7 7 76 n 7 76 7 54 ⋮ 5 h1,δNn i

N

3

6 ⋮ 7 7 6 7 6 6 c 7 7 þ6 6  c μn 7 7 6 7 6 4 ⋮ 5  c μN ðIN  βM 0 Þ  r0 ¼ c0 where,

⊤ ⊤ ⊤ M 0 ¼ Q⊤ Q⊤ 1 , ⋯, Qn , nþ1 μnþ1 , ⋯, QN μN

⊤ c0 ¼ c, ⋯, c, c, cμnþ1 , ⋯, cμN h i⊤ h0,δ i h1,δ i h1,δ i r0 ¼ ℝ1 Nn , ⋯, ℝhn0,δNn i , ℝnþ1Nn , ⋯, ℝN Nn :

ð5:18Þ

ð5:19Þ

5.4 Indexability Analysis and Index Computation

121

Similarly, when n is chosen in the initial slot, we have (5.18) and, further, the following: ðIN  βM 1 Þ  r1 ¼ c1

ð5:20Þ

where,

⊤ ⊤ ⊤ ⊤ M 1 ¼ Q⊤ 1 , ⋯, Qn1 , Qn μn , ⋯, QN μN

⊤ c1 ¼ c, ⋯, c, cμn , cμnþ1 , ⋯, cμN h i⊤ h0,δ i h0,δ i h1,δ i r1 ¼ ℝ1 Nn , ⋯, ℝn1Nn , ℝhn1,δNn i , ⋯, ℝN Nn Thus, from (5.19) and (5.20), we can obtain 1 ℝhn0,δNn i ¼ e⊤ n ðI N  βM 0 Þ c0

ð5:21Þ

1 ℝhn1,δNn i ¼ e⊤ n ðI N  βM 1 Þ c1

ð5:22Þ

Similarly, replacing c0 , c1 by 1N  1nN , 1N  1n1 N from (5.19) and (5.20), respectively, we have ðIN  βM 0 Þ  w0 ¼ 1N  1nN

ð5:23Þ

ðIN  βM 1 Þ  w1 ¼ 1N  1n1 N

ð5:24Þ

where, h i⊤ h0,δ i h1,δ i h1,δ i Nn i w0 ¼ 1 Nn , ⋯, h0,δ , nþ1Nn , ⋯, N Nn n h i⊤ h0,δ i h0,δ i h1,δ i Nn i w1 ¼ 1 Nn , ⋯, n1Nn , h1,δ , ⋯, N Nn n Further, 1 n Nn i ¼ e⊤ h0,δ n n ðIN  βM 0 Þ ð1N  1N Þ

ð5:25Þ

1 n1 Nn i h1,δ ¼ e⊤ n n ðIN  βM 1 Þ ð1N  1N Þ

ð5:26Þ

After obtaining the four critical quantities, we now check the LP-indexability condition. Lemma 5.2 Under Condition 1, for any n 2 N ∖fN g, we have Nn i Nn i (i) h1,δ > h0,δ n n

h1,δ

i

h0,δ

(ii) nþ1Nn > nþ1Nn

i

5 Whittle Index Policy for Opportunistic Scheduling. . .

122

Proof Please see Appendix A.2.



Lemma 5.3 Under Condition 1, Problem (5.6) is LP-indexable with price vn in (5.16). Proof According to Definition 5.1, we prove the indexability by checking three conditions. i 1 1 (i) Obviously, h0,∅i ¼ 0, h1,∅i ⩾ 1 , and h1,N ¼ 1β . For any, δ, δn b 1β n n n i 1 and further h0,N < 1β : n (ii) The second condition is proved in Lemma 5.2. (iii) The third condition is proved in Lemma 5.1.

Therefore, the LP-indexability is proved. Following Lemma 5.3, the following theorem states our main result on the indexability of Problem (5.6). □ Theorem 5.1 (Indexability) Under Condition 1, we have (i) if v b vn, it is optimal to serve the user in state n; (ii) if v > vn, it is optimal not to serve the user in state n.

5.4.3

Computing Index

In this part, we exploit the structural property of the transition matrix Q to simplify the index computation and further obtain the closed-form Whittle index. Proposition 5.1 Under Condition 1, we have h0,δ i Nn i Nn i (i) 1 Nn ¼ ⋯ ¼ h0,δ ¼ h0,δ . w n h0,δ

i

h0,δ

i

(ii) ℝ1 Nn ¼ ⋯ ¼ ℝn1Nn ¼ ℝhn0,δNn i (iii) The Whittle index is

νn ¼

Nn i μn h1,δ n Nn i 1  μn h1,δ n

Proof Following the proof of Lemma 5.2, we have h0,δNn i

1 and

h0,δ

i

Nn i ¼ ⋯ ¼ n1Nn ¼ h0,δ n

ð5:27Þ

5.4 Indexability Analysis and Index Computation

Nn i Nn i h1,δ  h0,δ ¼ n n

123

i 1  μn nh1,δNn i h ⊤ en ðIN  βM 0 Þ1 en 1  μn

ð5:28Þ

Similarly, h0,δNn i

ℝ1

h0,δ

i

¼ ⋯ ¼ ℝn1Nn ¼ ℝhn0,δNn i

and ℝhn1,δNn i  ℝhn0,δNn i ¼

i μn ℝhn1,δNn i h ⊤ en ðIN  βM 0 Þ1 en 1  μn

ð5:29Þ

Therefore, νn ¼

Nn i Nn i Nn i h1,δ  h0,δ μn h1,δ n n n ¼ Nn i h1,δ  nh0,δNn i 1  μn nh1,δNn i n

Nn i Based on (5.27), in order to obtain vn, we only need to compute h1,δ and n h1,δNn i . Further, by some complex operations, we can obtain the closed-form ℝn Whittle index as follows:

νn ¼

cμn , 1 bn b N 1  β þ f ðnÞ

ð5:30Þ

where f ðnÞ ¼

N X

βqn,i 1 

i¼nþ1

i Y

μ

dn1,i ¼ β qn,i þ

qn,k

k¼iþ1

Ki ¼ for i(n + 1  i  N ).



j1

1 μj

j¼nþ1 N X

1

1 1μi 1 1μi

 βλ

 βλ k Y j¼iþ1

 1μ1

i1

 βλi1

! j1

! þ μn dn1,i K i

j1 μ

1 j1

1 μj

 βλ

 βλ

! j1

j1

5 Whittle Index Policy for Opportunistic Scheduling. . .

124

5.5

Indexability Extension and Scheduling Policy

In this section, we first extend the proposed Condition 1 and obtain the indexability as well as the Whittle index. Next, we propose an eigenvalue-arithmetic-mean scheme to approximate any transition matrix, and further obtain the corresponding approximate Whittle index. Finally, based on the closed-form Whittle index, we construct an efficient scheduling policy.

5.5.1

Indexability Extension

In Sect. 5.4.3, the computing process of Vn shows that the Vn only depends on the structure of Q rather than the sign of λj, i.e., (5.7). Thus, we release Condition 1 based on the monotonicity of vn and obtain the following theorem on the indexability. Theorem 5.2 If Q can be written as Q ¼ O0 þ ε1 O1 þ ε2 O2 þ ⋯ þ ε2N2 O2N2 ,

ð5:31Þ

then Problem (5.6) is indexable and the Whittle index for state n(n ¼ 1, ⋯, N) is vn ¼

8
> < > > :1  β þ β

cμn

qn,i ðμi  μn Þ , i¼nþ1 1  βλð1  μi Þ N P

if β ¼ 1, n ¼ N, otherwise:

ð5:33Þ

5.5 Indexability Extension and Scheduling Policy

125

Remark 5.3 This corollary shows that the Whittle index degenerates into that of [18] if λ1 ¼ ⋯ ¼ λN  1 ¼ λ ¼ 0.

5.5.2

Transition Matrix Approximation

Given a generic Q, where Q 6¼ O0 þ ε1 O1 þ ε2 O2 þ ⋯ þ ε2N2 O2N2 , thus the result of Theorem 5.2 cannot be used. For this case, we approximate Q by the following eigenvalue-arithmetic-mean scheme, Q ¼ VΛV 1

ð5:34Þ

b 1 b ¼ V ΛV Q   b ¼ diag 1, b Λ λ, ⋯, bλ

ð5:35Þ

traceðQÞ  1 b λ¼ N1

ð5:36Þ ð5:37Þ

where b λ is the arithmetic mean of the N  1 eigenvalues of Q (excluding the trivial eigenvalue 1). Thus, the approximate matrix Q satisfies the condition of Corollary 5.1, and furthermore, the Whittle index can be approximated by

Vn ¼

5.5.3

8 1, > >
otherwise: > :1β þ β λ ð1  μ i Þ i¼nþ1 1  βb

ð5:38Þ

Scheduling Policy

In the previous sections, we have obtained the closed-form Whittle index for each subproblem. Now, we construct the joint scheduling policy for the original problem. In particular, the scheduling policy is to serve the user in k*(t) with the highest actual price, i.e.,

5 Whittle Index Policy for Opportunistic Scheduling. . .

126

k  ðtÞ ¼ argmaxk2K ½νk,xk ðtÞ , if νk,xk ðtÞ < 1:

ð5:39Þ

Actually, νk,xk ðtÞ < 1 always holds if 0 b β < 1. It happens νk,xk ðtÞ ! 1 only when β ¼ 1 and xk(t) ¼ Nk, corresponding to the average case. Therefore, the second item, ck μk,xk ðtÞ , of Laurent expansion of νk,xk ðtÞ would be taken as the secondary index in the case of β ¼ 1 and xk(t) ¼ Nk since lim ð1  βÞνk,N k ¼

β!1

ð1  βÞck μk,N k ¼ ck μk,N k : 1β

ð5:40Þ

Now, we give the marginal productivity index (MPI) scheduler in Algorithm 1. The MPI scheduler always serves the user currently with the best condition, i.e., v1 b ⋯ b vN, and is one of the best-condition schedulers, which has the stability property in a Markovian setting [9]. Theorem 5.3 ([9]) The MPI scheduler with one server is maximally stable under arbitrary arrivals. Algorithm 1 MPI scheduler (β ¼ 1) 1: for t 2 T 2: C number of system users in N k ðk 2 K Þ 3: if C 1 then   4: Serve one user in Nk with max ck μk,N k ðk 2 K Þ 5: (breaking ties randomly) 6: else 7: if condition (5.31) is satisfied 8: Serve the user k(t) with highest index value by (5.33) 9: else 10: Serve the user k(t) with highest index value by (5.38) 11: end if 12: (breaking ties randomly) 13: end if 14: end for

5.6

Numerical Simulation

In this section, we compare the proposed MPI scheduler with the following policies: (i) the cμ rule, νcμ k,n ¼ ck μk,n , ck μk,n (ii) the RB rule, νRB , Nk k,n ¼ P qSS μ k,m k,m

m¼1

5.6 Numerical Simulation

127

Table 5.2 Parameters adopted in simulation (c1, c2) (1, 1)

No. 1

Sk,n (Mb/s) 8:4 50:4 53:76 26:88 44:688 80:64

2

8:4 16:8 33:6 8:4 16:8 33:6

(10, 1)

3

8:4 16:8 50:4 67:2 26:88 33:6 44:688 80:64

(2,3)

(iii) the PB rule, νPB k,n ¼

ck μk,n μk,N k , n P

(iv) the SB rue, νSB k,n ¼ ck

m¼1

Channel Transition Matrices 3 2 0:00 0:80 0:20 7 6 4 0:30 0:50 0:20 5, 0:30 0:60 0:10 3 2 0:00 0:80 0:20 7 6 4 0:30 0:50 0:20 5 0:30 0:60 0:10 3 2 0:00 0:50 0:50 7 6 4 0:10 0:40 0:50 5, 0:10 0:70 0:20 3 2 0:25 0:60 0:15 7 6 4 0:35 0:50 0:15 5 0:35 0:55 0:10 3 2 0:50 0:10 0:20 0:20 6 0:15 0:45 0:20 0:20 7 7 6 7, 6 4 0:15 0:15 0:50 0:20 5 0:15 0:15 0:10 0:60 3 2 0:10 0:35 0:25 0:30 6 0:20 0:25 0:25 0:30 7 7 6 7 6 4 0:20 0:30 0:10 0:30 5 0:20 0:30 0:40 0:10

Job size (MB) 0.5, 0.5

5, 0.5

0.5, 0.5

qSS k,m ,

(v) the PISS rule [14] νSS k,n ¼ P

ck μk,n qSS ðμ μk,n Þ k,m k,m

,

m>n

where qSS k,m is the stationary probability of state m of a user of class k. Specifically, we only consider the case with at most one user served at each time slot. If there is more than one user having the highest index value, we uniformly choose one of them. In addition, we only consider two classes of users for a clearer performance comparison. Moreover, before evaluating the performance of different schedulers, we first test their similarity for a given scenario by computing the corresponding index and then choose one scheduler as a representative among multiple identical schedulers. In this way, we can decrease the time for numerical simulation and meanwhile obtain compact figures for performance comparison. Let T ¼ 1.67 ms for each slot for practical application [28]. The arrival probability for a new user of class k is characterized by ξk ¼ ρk μk,N k . For comparison, we adopt the transmission rate sk, n in [28], and job size 0 fbk g ¼ 0:5 Mb or HTML,

5 Whittle Index Policy for Opportunistic Scheduling. . .

128 6

pb/ss sb cu/rb/mp

Time Average Cost

5 4 3 2 1 0 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

300

Number of Users in System

sb

250

pb/ss cu/rb/mp

200 150 100 50 0

0

20

40

60

80

100

120

140

160

180

200

T ( = 0.98)

Fig. 5.1 Scenario 1 [Upper]: Time average waiting cost as a function of ρ; [Lower]: Number of users in the system as a function of time (ρ ¼ 0.98)

0 fbk g ¼ 5 Mb for PDF, and 0 fbk g ¼ 50 Mb for MP3. In this case, the departure probability is determined by μk,n ¼ τsk,n =0 fbk g. We assume that ρ1 ¼ ρ2 and the system load ρ ¼ ρ1 + ρ2 varies from 0.3 to 1 for a better presentation. The initial channel condition of a new user at the moment of entering the system is assumed to be determined by the stationary probability vector, i.e., with probability qSS k,m in state m for a new user of class k. The parameter setting for the following scenarios is stated in Table 5.2.

5.6 Numerical Simulation

5.6.1

129

Scenario 1

In this case, the setting is given in Table 5.2. In particular, the users are divided into two different classes. Each user requires a job of expected size of 0:5 Mb, and has the same waiting cost c1 ¼ c2 ¼ 1. The channel state transition matrix is identical. But the second class of users has a better transmission rate than the first class. Our goal is to minimize the number of users waiting for service in the system. Under this setting, three policies (cp, RB, and MPI) can be shown to bring about the same scheduling rule, i.e., the scheduling (class, state) order (2, 3) > (1, 3) > (1, 2) > (2, 2) > (2, 1) > (1, 1). Also, PB and PISS yield the same result, i.e., (1, 3) ¼ (2, 3) > (1, 2) > (2, 2) > (2, 1) > (1, 1). The SB policy will generate the order (2, 3) ¼ (1, 3) > (1, 2) ¼ (2, 2) > (2, 1) ¼ (1, 1). Thus, Fig. 5.1 140 cu rb/pb sb mp/ss

Time Average Cost

120 100 80 60 40 20 0 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Users in System

600 cu rb/pb sb mp/ss

500 400 300 200 100 0

0

20

40

60

80

100

120

140

160

180

200

T ( = 0.90)

Fig. 5.2 Scenario 2 [Upper]: Time-average waiting cost as a function of ρ; [Lower]: Number of users in the system as a function of time (ρ ¼ 0.90)

5 Whittle Index Policy for Opportunistic Scheduling. . .

130

shows that the time-average waiting cost varies with system load ρ for three policies, and the number of users in the system varies with time slots. Obviously, we observe that the behavior of all policies is quite similar. In addition, Fig. 5.1 clearly shows that cμ, RB, and MPI perform better than PB, SB, and PISS. This is because cp, RB, and MPI keep scheduling balance between class 1 and class 2. All those policies perform well with ρ < 0.9 but clearly have problems with stability since those policies become unstable close to 1 at which point the time-average waiting cost begins rising very steeply.

60 cu rb/pb sb ss mpi

Time Average Cost

50 40 30 20 10 0 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

900 cu rb/pb sb ss mpi

Number of Users in System

800 700 600 500 400 300 200 100 0

0

20

40

60

80

100

120

140

160

180

200

T ( = 0.98)

Fig. 5.3 Scenario 3 [Upper]: Time-average waiting cost as a function of ρ; [Lower]: Number of users in the system as a function of time (ρ ¼ 0.94)

5.7 Summary

5.6.2

131

Scenario 2

In this case, we consider two classes of users with different job sizes: the first one requires a job of expected size 5 Mb, while the second one requires a job of 0.5 Mb. The waiting costs for the two classes are c1 ¼ 10 and c2 ¼ 1, respectively. The two classes have the same transmission rate and different channel state transition matrices. Thus, we can easily check that PISS and MPI generate the same scheduling rule (1, 3) > (2, 3) > (1, 2) > (2, 2) > (1, 1) > (2, 1), and PB and RB have the same rule (1, 3) > (1, 2) > (1, 1) > (2, 3) > (2, 2) > (2, 1). Figure 5.2 shows that MPI has comparable performance with c/j and better performance than other policies in both time-average waiting cost and average number of users in system. From the scheduling order, we observe that PB (or RB) has the worst performance because of the extreme unbalance in user class, i.e., severing class 1 with complete priority than class 2. SB has the worst performance because of partial unbalance in user class from its scheduling order (1, 3) > (1, 2) > (2, 3) > (1, 1) > (2, 2) > (2, 1).

5.6.3

Scenario 3

In this case, we assume that every class of users has four states, different waiting costs, different transmission rates, and different channel transition matrices. In this case, we can check that PB and RB are same, i.e., (1, 4) > (1, 3) > (2, 4) > (2, 3) > (1, 2) > (2, 2) > (1, 1) > (2, 1). Figure 5.3 shows that MPI policy (2, 4) > (1, 4) > (2, 3) > (1, 3) > (1, 2) > (2, 2) > (1, 1) > (2, 1) has comparable performance with PISS policy, (2, 4) ¼ (1, 4) > (2, 3) > (1, 3) > (1, 2) > (2, 2) > (1, 1) > (2, 1), and better performance than others in both the average cost and the number of waiting users.

5.7

Summary

In this chapter, we have investigated the opportunistic scheduling problem involving multi-class multistate time-varying Markovian channels. Generally, the problem can be formulated as a restless multiarmed bandit problem. To the best of our knowledge, previous work only established an index policy for a two-state channel process and derived some limited results on multistate time-varying channels under an assumption of indexability as a prerequisite. To fill this gap, for the class of state transition matrices characterized by our proposed sufficient condition, we prove the indexability of the Whittle index policy and obtain the closed-form Whittle index. Simulation results show that the proposed index scheduler is effective in scheduling multi-class multistate channels. One future objective is to seek more generic conditions to guarantee the indexability.

5 Whittle Index Policy for Opportunistic Scheduling. . .

132

Appendix Proof of Lemma 5.1 Let vn denote the optimal value function, and van :¼ r an  νwan þ β

X

pan,m vm

m2N n1 n2 N 1 N 1   X X X X gn vn , vðnþ1Þ ≔ εN1þi vi  εN1þi viþ1 εNi viþ1  εNi vi i¼1

i¼1

8 > < εNn 0 αn ≔ εN2þn  εNn , > : ε2N2 8 > < εNn  εNn1 α1nþ1 ≔ εNn , > : 0

i¼nþ1

i¼nþ2

if n ¼ 1 if 2bnbN  1 if n ¼ N if 1bnbN  2, if n ¼ N  1 if n ¼ N

For state n 2 N , the Bellman equation is   vn ¼ max v0n ; v1n n   h i X ¼ max r 0n  νw0n þ β hm vm þ βgn vn , vðnþ1Þ þ β ðλ þ α0n Þvn þ α1nþ1 vnþ1 m2N 0

r 1n  νw1n þ β

X m2N 0

  ð1  μn Þhm vm þ βμn v0 þ βð1  μn Þgn vn , vðnþ1Þ

h io  þβð1  μn Þ λ þ α0n vn þ α1nþ1 vnþ1   h  i X hm vm þ βgn vn , vðnþ1Þ þ β λ þ α0n vn þ α1nþ1 vnþ1 ¼ c þ β m2N 

n    X þ max 0;  ν þ μn c þ βv0  β hm vm  βgn vn , vðnþ1Þ m2N 0

h io  β λ þ α0n vn þ α1nþ1 vnþ1 where the first term in the curly brackets correspond to action 0 and the second to action 1. Obviously, transmitting (i.e., action 1) is optimal in state n 2 N ∖f0g if the first term is less than the second one. For ease of presentation, let

Appendix

133 m6¼X n, nþ1   Z≔c þ βv0  βgn vn , vðnþ1Þ  β hm vm



m2N 0



  Z n ≔Z  β λ þ α0n þ hn vn  β α1nþ1 þ hnþ1 vnþ1 Now, we analyze the Bellman equation by two cases. Case 1. If v > 0, we have vn b0 for any n 2 N ∖f0g:I. If transmitting is optimal in state n 2 N ∖f0, N g , we obtain v + μnZn ⩾ 0 indicating Zn > 0, and further, v + μn + 1Zn >  v + μnZn since μn + 1 > μn. Thus vn ¼ c þ μn Z n  ν þ β < c þ μnþ1 Z n  ν þ β

X m2N 0

X m2N 0

hm vm þ β

hm vm þ β

h

h

 i λ þ α0n Þvn þ α1nþ1 vnþ1 þ gn vn , vðnþ1Þ

 i λ þ α0n Þvn þ α1nþ1 vnþ1 þ gn vn ,vðnþ1Þ ,

equivalently,     vn 1  β 1  μnþ1 ðλ  εNn  εNþn1 Þ < c þ μnþ1 Z  ν þ β

m6¼X n, nþ1 m2N



0

hm vm þ βgn ðvn , vðnþ1Þ Þ

     þβ 1  μnþ1 hn þ α0n þ εNn þ εNþn1 vn þ β 1  μnþ1 hnþ1 þ α1nþ1 vnþ1 ¼ c þ μnþ1 Z  ν þ β

m6¼X n, nþ1 m2N



0

hm vm þ βgn ðvn , vðnþ1Þ Þ

    þβ 1  μnþ1 ðhn þ γ n Þvn þ β 1  μnþ1 hnþ1 þ α1nþ1 vnþ1

ð5:41Þ

where, γ n ≔α0n þ εNþn1 þ εNn For state n + 1, if action “1” is adopted, then we have according to Bellman equation: v1nþ1 ¼ c  ν þ β þβ

h

X m2N 0

hm vm þ βgnþ1 ðvðnþ1Þ , vðnþ2Þ Þ

 i  X hm vm λ þ α0nþ1 vnþ1 þ α1nþ2 vnþ2 þ μnþ1 c þ βv0  β m2N 0

  h i  βgnþ1 vðnþ1Þ , vðnþ2Þ  β λ þ α0nþ1 vnþ1 þ α1nþ2 vnþ2

5 Whittle Index Policy for Opportunistic Scheduling. . .

134

ðaÞ

¼ c þ μnþ1 Z  ν þ β

m6¼X n, nþ1 m2N 0

hm vm þ βgn ðvn , vðnþ1Þ Þ

   þβ 1  μnþ1 hn þ α0n þ εNþn1 þ εNn vn    þβ 1  μnþ1 hnþ1 þ λ þ α1nþ1  εNn  εNþn1 vnþ1 ,

  v1nþ1  β 1  μnþ1 ðλ  εNn  εNþn1 Þvnþ1 ¼ c þ μnþ1 Z  ν þ β 

m6¼X η, nþ1 m2N



0

hm vm

  þβgn vn , vðnþ1Þ þ β 1  μnþ1 ðhn þ γ n Þvn    þβ 1  μnþ1 hnþ1 þ α1nþ1 vnþ1 where

(a)

is

due

to

ð5:42Þ

    gn vn , vðnþ1Þ ¼ gnþ1 vðnþ1Þ , vðnþ2Þ þ

ðεN2þn  εN1þn Þvn þ ðεNn1  εNn2 Þvnþ2 , α0nþ1 ¼ α1nþ1  εNn  εNþn1 and α1nþ2 ¼ εNn1  E N  n  2. Thus, combining (5.41) and (5.42), we have     vn 1  β 1  μnþ1 ðλ  εNn  εNþn1 Þ   < v1nþ1  β 1  μnþ1 ðλ  εNn  εNþn1 Þvnþ1   bvnþ1  β 1  μnþ1 ðλ  εNn  εNþn1 Þvnþ1     ¼ vnþ1 1  β 1  μnþ1 ðλ  εNn  εNþn1 Þ which indicates vn < vnþ1 . Meanwhile,

¼ c þ β

vn ⩾ v0n

X m2N 0

  h i  hm vm þ βgn vn , vðnþ1Þ þ β λ þ α0n vn þ α1nþ1 vnþ1

¼ c þ β

m6¼X n, nþ1 m2N 0

  hm vm þ βgn vn , vðnþ1Þ

h i    þ β hn þ λ þ α0n vn þ hnþ1 þ α1nþ1 vnþ1 , vn ð1  βðλ  εNn  εNþn1 ÞÞ

Appendix

135

⩾ cþβ

m6¼X n, nþ1 m2N 0

h

  hm vm þ βgn vn , vðnþ1Þ

i   þ β ðhn þ γ n Þvn þ hnþ1 þ α1nþ1 vnþ1

ð5:43Þ

On the other hand, we have according to Bellman equation v0nþ1 ¼ c þ β h

X m2N 0

  hm vm þ βgnþ1 vðnþ1Þ , vðnþ2Þ

i  þ β λ þ α0nþ1 vnþ1 þ α1nþ2 vnþ2 X

ðbÞ

¼ c þ β

m2N 0

  hm vm þ βgn vn , vðnþ1Þ

h i   þ β γ n vn þ λ þ α1nþ1  εNn  εNþn1 vnþ1 ¼ c þ β

m6¼X η, nþ1 m2N 0

h

  hm vm þ βgn vn , vðnþ1Þ

i   þβ ðhn þ γ n Þvn þ hnþ1 þ α1nþ1 vnþ1 þ βðλ  εNn  εNþn1 Þvnþ1 , v0nþ1 ð1

bcþβ

m6¼X n, nþ1

 βðλ  εNn  εNþn1 Þ

  hm vm þ βgn vn , vðnþ1Þ h i   þ β ðhn þ γ n Þvn þ hnþ1 þ α1nþ1 vnþ1

ð5:44Þ

0  where (a) is due to λbεNn þ εNþn1   and vnþ1 bvnþ1 , and (b) is due to gn vn , vðnþ1Þ ¼ gnþ1 vðnþ1Þ , vðnþ2Þ þ γ n vn þ ðεNn1  εNn2 Þvnþ2 .

Thus, combining (5.43) and (5.44), we have v0nþ1 bvn : Since vn <   we conclude vnþ1 gnþ1 vðnþ1Þ , vðnþ2Þ þ γ n vn þ ðεNn1  εNn2 Þvnþ2 ,

v0nþ1 bvn < vnþ1 ¼ v1nþ1 , that is, transmitting is optimal in state n + 1. Case 2. if v < 0, then we proceed as follows. First, using the Bellman equation it is ν because action 1 is optimal in state 0 and thus v is easy to obtain that v0 ¼  1β obtained in every period forever. Notice that the one-period net reward, r an  νwan, is for any state n 2 N and any action a G A upperbounded by i:e:, j r an  vr an j ν b  ν: Hence, vm b  1β ¼ v0 for any m G N0 , and therefore (using c > 0 and λ þ

5 Whittle Index Policy for Opportunistic Scheduling. . .

136

P m2N 0

hm ¼ 1) Zn > 0, and finally, for any state n 2 N \ {0}, v + /μnZn > 0. That is,

transmitting is optimal in any state n 2 N . Combining the two cases, we complete the proof.

Proof of Lemma 5.2 Nn i Nn i We first show h1,δ  h0,δ > 0 by the following four steps. n n Step 1. According to the definition of β-average work, we have Nn ⩾ 1,  h1,δNn i ⩾ 1, ⋯, h1,δNn i ⩾ 1 To show h1,δNn i ⩾ h1,δNn i ⩾ ⋯⩾ h1,δNn i , 1,δ n n N N nþ1

nþ1

h1,δ i Nn i ⩾ iþ1 Nn for any i(n b i b N  1). we only need to show h1,δ i For (5.24), we perform the following operations sequentially:

(i) Dividing the i-th equation by 1  μi, (ii) Dividing the (i + 1)-th equation by, (iii) Subtracting the i-th equation from the (i + 1)-th one, then we obtain the i-th equation h



h i i 1 1 1 1 h1,δ i h1,δ i þ βλi1 i Nn þ  βλi1 iþ1 Nn ¼  μ i μ iþ1 μ iþ1 μ i

equivalently, h

1  βλi1 μ i

i

h1,δ

i

h1,δNn i

iþ1 Nn  i



¼

h

1 μ iþ1



1 μ i

i

h1,δ

1  iþ1 Nn

i



b0

which implies iþ1 Nn b j Nn . h0,δ i h0,δ i h0,δ i Step 2 To show 1 Nn ¼ ⋯ ¼ n1Nn we only need to show i Nn ¼ h0,δ i iþ1 Nn 0,δNn for ið1bibn  2Þ. For (5.24), we subtract the i-th equation from the (i + 1)-th one, and come to h1,δ

i

h1,δ

i

h0,δNn i

½1  βλi i h0,δ

i

h0,δ

h0,δ

i

þ ½1  βλi iþ1 Nn ¼ 0

ð5:46Þ

i

which indicatesi Nn ¼ iþ1 Nn . h0,δ i h0,δ i Nn i To show 1 Nn ¼ ⋯ ¼ n1Nn bh1,δ , we have by the (n  1)-th n equation ½1  β

n1 X i¼1

equivalently,

h0,δ

i

qn1,i n1Nn  β

N X i¼n

h1,δNn i

qn1,i i

¼0

ð5:47Þ

Appendix

137

h0,δ i n1Nn

β

N P i¼n

¼

h1,δNn i

qn1,i i

1β

nP 1

qn1,i

i¼1

b

β

N P i¼n

qn1,i nh1,δNn i

1β

nP 1

qn1,i

i¼1 N P

b

i¼n

Nn i qn1,i h1,δ n

1

nP 1

qn1,i

i¼1 Nn i ¼ h1,δ n

h1,δ

ð5:48Þ

i

h1,δNn i

Nn i where (a) is due to h1,δ ⩾ nþ1Nn ⩾ ⋯ ⩾ N n N P

β

qn1,i

is increasing in β (0 < β < 1).

i¼n n1

1β

P

, and (b) is because

qn1,i

i¼1

Step 3. Considering the n-th equation of (5.24), we have βð1  μn Þ

n1 X

h0,δ

i

Nn i qn,i n1Nn þ ð1  βð1  μn Þqn,n Þh1,δ n

i¼1

βð1  μn Þ

N X

h1,δNn i

qn,i i

¼1

ð5:49Þ

i¼nþ1

equivalently, Nn i ð1  βð1  μn Þqn,n Þh1,δ n

¼ 1 þ βð1  μn Þ

n1 X

h0,δ

i

qn,i n1Nn þ βð1  μn Þ

i¼1 ðaÞ

b 1 þ βð1  μn Þ

n1 X i¼1

further,

Nn i qn,i h1,δ þ βð1  μn Þ n

N X

h1,δNn i

qn,i i

i¼nþ1 N X i¼nþ1

Nn i qn,i h1,δ n

ð5:50Þ

5 Whittle Index Policy for Opportunistic Scheduling. . .

138

Nn i h1,δ b n

1 1 < 1  βð1  μn Þ μn h0,δ

i

Nn i due to, and (a) is due to h1,δ ⩾ i Nn for any i(1 n h1,δ i Nn i h1,δ ⩾ i Nn for any i(n + 1 b i b N ). n Step 4. For the n-th equation of (5.24) is stated as follows:



ð5:51Þ

bib

n  1) and

 ⊤ e⊤ n  β ð1  μn Þen Q w1 ¼ 1

ð5:52Þ

Nn i we first subtract μn h1,δ from both LHS and RHS of (5.52), and then divide n (5.52) by 1  μn. As a consequence, (5.52) can be written as follows:

⊤ ðe⊤ n  βen QÞw1 ¼

1  μn nh1,δNn i 1  μn

ð5:53Þ

Combined with the other N - 1 equations of (5.24), then (5.24) can be transformed to the following: ðIN  βM 1 Þ  w1 ¼ 1N  1n1 N ⟺ðIN  βM 0 Þ  w1 ¼ 1N  1nN þ

Nn i 1  μn h1,δ n en 1  μn

ð5:54Þ

Nn i 1  μn h1,δ n en Þ 1  μn

ð5:55Þ

Thus, 1 n Nn i h1,δ ¼ e⊤ n n ðI N  βM 0 Þ ð1N  1N þ

combined with 1 n Nn i h0,δ ¼ e⊤ n n ðIN  βM 0 Þ ð1N  1N Þ

then we have Nn i Nn i  h0,δ ¼ h1,δ n n

Nn i ðaÞ 1  μn h1,δ 1 n ½e⊤ >0 n ðI N  βM 0 Þ en  1  μn

ð5:56Þ

1 Nn i

1 ⊤ 0: Note that en ðIN BM 0 Þ en > 0 because IN  βM0 is a diagonally dominant matrix, and every element in the diagonal line is larger than 0 when 0 < β < 1. Nn i Nn i  h0,δ > 0. Following the similar deduction, To the end, we prove h1,δ n n h1,δNn i h0,δNn i we can easily prove nþ1  nþ1 > 0. Therefore, we complete the proof.

Appendix

139

Proof of Theorem 5.2 According to the definition of the Whittle index, we prove the indexability by cμN checking v1 < v2 < ⋯ < vN. When β ¼ 1, we have νN ¼ 1β ! 1. First, we have f ðnÞ ¼ βqn,nþ1 1 

þβ

N X

1 μn  βλn 1 μnþ1  βλn

i Y

qn,i 1 

i¼nþ2

μ

þ μn d n1,nþ1 K nþ1 !

 βλ

1 j1

1 μj

j¼nþ1

!

j1

N X

þ μn

 βλ j1

dn1,i K i

i¼nþ2

And N X

f ð n þ 1Þ ¼ β

i Y

qnþ1,i 1 

i¼nþ2

¼β

N X

qn,i 1 

i¼nþ2

μ

1 j1

1 μj

j¼nþ2

j1

1 μj

j¼nþ2 i Y

1

μ

 βλ

 βλ

 βλ

 βλ

! j1

þ μnþ1

j1 N X

þ μnþ1

j1

dn1,i K i

i¼nþ2

Further, f ð nÞ  f ð n þ 1Þ N   X ¼ βqn,nþ1 K nþ1 þ μn  μnþ1 dn1,i K i i¼nþ2 N X

þμn d n1,nþ1 K nþ1 þ β

i Y

qn,i

i¼nþ2

1

μ

j1

1 μj

j¼nþ2

 βλ

 βλ

j1

K nþ1

j1

N   X ¼ βqn,nþ1 K nþ1 þ μn  μnþ1 dn1,i K i i¼nþ2

μn β qn1,nþ1 þ

N X

qn1,i

i¼nþ2

þβ

N X i¼nþ2

qn,i

i Y

i Y j¼nþ2

j¼nþ2 μ

1 j1

1 μj

1

μ

j1

1 μj

 βλ

 βλ

 βλ j1

 βλ

j1

j1

dn,i K i

i¼nþ2

! j1

N X

j1

K nþ1

! K nþ1

5 Whittle Index Policy for Opportunistic Scheduling. . .

140

N   X ¼ βqn,nþ1 K nþ1 þ μn  μnþ1 dn1,i K i i¼nþ2 N X

μn β qn,nþ1 þ

qn,i

i¼nþ2

bþβ

N P

qn,i

i¼nþ2

i Q j¼nþ2

1 μ j1 βλ j1 1 μ j βλ j1

i Y

μ

j¼nþ2

 βλ

1 j1

1 μj

 βλ

! j1

K nþ1

j1

K nþ1 .

N   X ¼ μn βqn,nþ1 K nþ1 þ μn  μnþ1 dn1,i K i i¼nþ2

þμn β

N X i¼nþ2

qn,i

i Y

μ

j¼nþ2

1 j1

1 μj

 βλ

 βλ

j1

K nþ1

j1

ðaÞ

⩾ 0, where (a) is due to dn  1, i b 0, Ki ⩾ 0, and μn < μn + 1. Hence, vn ¼

cμnþ1 cμn cμn b < ¼ vnþ1 : 1  β þ f ð nÞ 1  β þ f ð n þ 1Þ 1  β þ f ð n þ 1Þ

which complete the proof.

References 1. K. Wang, L.Chen, J. Yu, M.Z. Win. Opportunistic Scheduling Revisited Using Restless Bandits: Indexability and Index Policy. in Proc. of GlobeCom, Singapore, 1–6 Dec 2017 2. R. Knopp, P. Humblet, Information capacity and power control in single-cell multiuser communications. in Proc. IEEE Int. Conf. Commun., Seattle, WA, Mar 1995, pp. 331–33 3. P. Bender, P. Black, M. Grob, R. Padovani, N. Sindhushayana, A. Viterbi, CDMA/HD a bandwidth-efficient high-speed wireless data service for nomadic users. IEEE Comm Mag. 38 (7), 70–77 (2000) 4. H.J. Kushner, P.A. Whiting, Convergence of proportional fair sharing algorithms in general conditions. IEEE Trans. Wireless Commun. 3(4), 1250–1259 (2004) 5. S. Borst, User-level performance of channel-aware scheduling algorithms in wireless d networks. IEEE/ACM Trans. Netw. 13(3), 636–647 (2005) 6. S. Aalto, P. Lassila, Flow-level stability and performance of channel-aware priority-bas schedulers. in 6th EURO-NGI Conference on Next Generation Internet, Paris, France, 20, pp. 1–8 7. T. Bonald, A score-based opportunistic scheduler for fading radio channels. in Proc. of European Wireless, 2004, pp. 283–292 8. U. Ayesta, M. Erausquin, M. Jonckheere, I.M. Verloop, Scheduling in a random environment: Stability and asymptotic optimality. IEEE/ACM Trans. Netw. 21(1), 258–271 (2013)

References

141

9. J. Kim, B. Kim, J. Kim, Y.H. Bae, Stability of flow-level scheduling with Markovian timevarying channels. Perform. Eval. 70(2), 148–159 (2013) 10. S. Aalto, P. Lassila, P. Osti, Whittle index approach to size-aware scheduling with time varying channels. in Proc. ACM Sigmetrics, Portland, OR, June 2015 11. S. Aalto, P. Lassila, P. Osti, Whittle index approach to size-aware scheduling for time-varying channels with m tiple states. Queueing Syst. 83, 195–225 (2016) 12. F. Cecchi P. Jacko, Scheduling of Users with Markovian Time-Varying Transmission Rate, in Proc. ACM Sigmetrics, Pittsburgh, PA, 2013 13. P. Jacko, Value of information in optimal flow-level scheduling of users with Markovian time varying channels. Perform. Eval. 68(11), 1022–1036 (2011) 14. F. Cecchi, P. Jacko, Nearly-optimal scheduling of users with Markovian time-varying transmission rates. Perform. Eval. 99–100, 16–36 (2016) 15. K. Wang, Q. Liu, Q. Fan, Q. Ai, Optimally probing channel in opportunistic communication access. IEEE Commun. Lett. 22(7), 1426–1429 (2018) 16. K. Wang, L. Chen, J. Yu, Q. Fan, Y. Zhang, W. Chen, P. Zhou, Y. Zhong, On optimal of second-highest policy for opportunistic multichannel access. IEEE Trans. Veh. Techn 22(7), 1426–1429 (2018) 17. P. Whittle, Restless bandits: activity allocation in a changing world. J. Appl. Probab. 25A, 287–298 (1988) 18. U. Ayesta, M. Erausquin, P. Jacko, A modeling framework for optimizing the flow-le scheduling with time-varying channels. Perform. Eval. 67, 1014–1029 (2010) 19. P. Jacko, E. Morozov, L. Potakhina, I. M. Verloop, Maximal flow-level stability of be rate schedulers in heterogeneous wireless systems, Trans. Emerg. Telecommun. Technol. (2015);28:e2930 20. I. Taboada, F. Liberal, P. Jacko, An opportunistic and non-anticipating size-aware scheduling proposal for mean holding cost minimization in time-varying channels. Perform. Eval. 79, 90–103 (2014) 21. I. Taboada, P. Jacko, U. Ayesta, F. Liberal, Opportunistic scheduling of flows with general size distribution in wireless time-varying channels, in Proc. of ITC-26, 2014 22. S. Aalto, P. Lassila, P. Osti, On the optimal trade-off between SRPT and opportunistic scheduling, in Proc. ACM Sigmetrics, San Jose, CA, June 2011 23. T. Bonald, S. Borst, N. Hegde, M. Jonckheere, A. Proutiere, Flow-level performance and capacity of wireless networks with user mobility. Queueing Syst. 63, 131–164 (2009) 24. S. Borst, M. Jonckheere, Flow-level stability of channel-aware scheduling algorithms, in Proc. of WiOpt, 2006 25. P. van de Ven, S. Borst, S. Shneer, Instability of maxweight scheduling algorithms, in Proc. IEEE Conf. on Computer Commun., 2009, pp. 1701–1709 26. C. Buyukkoc, P. Varaiya, J. Walrand, The cβ rule revisited. Adv. Appl. Probab. 17(1), 237–238 (1985) 27. J. Nino-Mora, Characterization and computation of restless bandit marginal productivity indices, in Proc. of ValueTools, 2007 28. S. Sesia, I. Toufik, M. Baker, LTE-the UMTS Long Term Evolution: From Theory to Practice (John Wiley & Sons Ltd., New York, NY, 2011)

Chapter 6

Conclusion and Perspective

6.1

Summary

This book addresses a special kind of restless multiarmed bandit problem arising in opportunistic scheduling with imperfect sensing or observation conditions where each channel evolves as a discrete-time two-state Markovian chain in Chaps. 2 and 3 and multistate Markovian chain in Chaps. 4 and 5. Certain application examples demonstrate that the nuance of reward function leads to completely different optimality of the myopic policy. Therefore, some efforts were made to discover the relation between the form of reward function and the optimality of myopic policy. A unified framework was constructed under the myopic policy in terms of the regular reward function, characterized by three basic axioms—symmetry, monotonicity, and decomposability. A wide variety of RMAB problems can be treated according to this framework, including satellite networks with optical crosslinks and RF downlinks, wireless ad hoc networks, computer networks, and hybrid networks with both wireless and wireline components only if they confirm to the decomposable rule. For the homogeneous channels in Chap. 2, we established the optimality of the myopic policy when the reward function can be expressed as a regular function and when the discount factor is bounded by a closedform threshold determined by the reward function. Furthermore, the regular function was extended to a generic function such that it covers a much larger range of utility functions, particularly the logarithmic function and the power function widely used in engineering problems. By distinguishing arms as the sensed set and non-sensed set at each time slot, we quantified the trade-off of exploration vs. exploitation in the decision process, and then derived sufficient condition for the optimality of myopic policy.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8_6

143

144

6 Conclusion and Perspective

In order to further obtain the asymptotically optimal performance of the RMAB in all the parameter space, we analyzed its indexability and Whittle index for two-state Markovian case by the fixed-point approach in Chap. 3 and multistate Markovian case in Chap. 5. In Chap. 3, we first derived the threshold structure of the single-arm policy. Based on this structure, the closed-form Whittle index was obtained for the case of negatively correlated channels, while the Whittle index for the positively correlated channel was much complicated for its uncertainty, particularly for certain regions below stationary distribution of the Markovian chain. Then this region was divided into deterministic regions and indeterministic regions with interleaving structure. In the deterministic regions, the evolution of dynamic system is periodic and then there exists an eigen matrix to depict this kind of evolving structure through which the closed-form Whittle can be derived. In the indeterministic regions, there does not exist an eigen matrix to depict its aperiodic structure. In practical scenarios, given computing precision, the Whittle index in those regions can be computed in the simply linear interpolation since the distribution of the deterministic and indeterministic regions appears in the interleaving form. In Chap. 4, we consider the downlink scheduling problem of wireless communication system consisting of N homogenous channels and one user in which each channel is assumed to evolve as a multistate Markov process. At each time instant, one channel is allocated to the user according to imperfect state observation information, and some reward is accrued depending on the state of the chosen channel. The objective is to design a scheduling policy that maximizes the expected accumulated discounted reward over an infinite horizon. Mathematically, the proposed problem can be formulated into a RMAB problem that is PSPACE hard with exponential memory and computation complexity. One of feasible approaches is to consider the myopic policy (or greedy policy) that only focuses on maximizing immediate reward with linear complexity. Specifically, in this chapter, we carry out a theoretic analysis on the performance of myopic policy with imperfect observation, introduce MLR order to characterize the evolving structure of belief information, and establish a set of closed-form conditions to guarantee the optimality of the myopic scheduling policy in downlink channel allocation. In Chap. 5, we revisit the opportunistic scheduling problem where a server opportunistically serves multiple classes of users under time-varying multistate Markovian channels. The aim of the server is to find an optimal policy minimizing the average waiting cost of users. Mathematically, the problem can be cast to a restless bandit one, and a pivot to solve restless bandit by index policy is to establish indexability. Despite theoretical and practical importance of the index policy, the indexability is still open for the opportunistic scheduling in the heterogeneous multistate channel case. To fill this gap, we mathematically propose a set of sufficient conditions on channel state transition matrix under which the indexability is guaranteed, and consequently, the index policy is feasible. We further develop a simplified procedure to compute the index by reducing the complexity from more than quadratic to linear. Our work consists of a small step toward solving the opportunistic scheduling problem in its generic form involving multistate Markovian channels and multi-class users.

6.2 Open Issues and Directions for Future Research

145

From the system perspective, our analysis presented in this book provides insight on the following design trade-off in opportunistic scheduling, artificial intelligence, and optimization operation. • Exploitation Versus Exploration: Due to hardware limitations and cost constraint of consuming resources, a scheduler may not be able to observe all channels in system simultaneously. A good strategy is thus required for intelligent channel selection to track the rapidly varying system state. The purpose of a good strategy is twofold: to find good channels for immediate reward and to gain statistical information on the system state for better opportunity tracking in the future. Thus, the optimal strategy should strike a balance between these two often conflicting objectives. • Aggression Versus Conservation: Based on the imperfect observation outcomes, a scheduler needs to decide whether to act or not. An aggressive strategy may lead to excessive collisions while a conservative one may result in performance degradation due to overlooked opportunities. Therefore, the optimal strategy should achieve a trade-off between aggression and conservation.

6.2 6.2.1

Open Issues and Directions for Future Research RMAB with Multiple Schedulers

In this book, we mainly focus on the decision-making process and different trade-off with only one decision-maker or scheduler. A natural research direction is to take the results obtained in the book as a building block to further explore the opportunistic scheduling scenario with multiple schedulers. The key research challenge for multiple schedulers is how to coordinate them to utilize channels in a distributed fashion without or with little explicit network-level cooperation. A natural way to tackle this problem is to model the situation as a non-cooperative game among schedulers and to see whether the results obtained in this book can further be tailored in the new context.

6.2.2

RMAB with Switching Cost

Another aspect that may limit the performance of opportunistic scheduling mechanism is the channel switching cost. In current wireless devices or networks, the channel switching introduces a cost in terms of delay, packet loss, and protocol overhead. Hence, an efficient channel scheduling policy should avoid frequently channel switching, unless necessarily. In the context of RMAB, this problem can be mapped into the generic RMAB problem with switching cost between arms. More systematical works are called for to provide more in-depth insight on this problem.

146

6 Conclusion and Perspective

It is important to note that the generic MAB with switching cost is NP-hard, and there does not exist any optimal index policy. More specifically, the introduction of switching cost renders not only the Gittins index policy suboptimal but also makes the optimal policy computationally prohibitive. Given such difficulties, we envision to tackle the problem from the following aspects: • Looking for suboptimal policy with bounded efficiency loss compared to the optimal policy; • Developing heuristic policy achieving a trade-off between optimality and complexity; • Deriving optimal policy in a subset of scenarios or designing asymptotically optimal policy.

6.2.3

RMAB with Correlated Arms

Another practical extension is to consider the correlated channels, i.e., the Markov chains of different channels can be correlated. This problem can be cast into the RMAB problem with correlated arms. The introduction of the correlation among arms makes the trade-off between exploration and exploitation more sophisticated as sensing a channel can not only reveal the state of the sensed channel but also provide information on other channels as they are not entirely independent. How to characterize the trade-off in this new context and how to design efficient channel access policy are of course pressing research topics in this direction.

Index

A Acknowledgments (ACKs), 40 Adjoint dynamic system, 54, 55 Aggression vs. conservation, 145 Approximation policy, 80 Auxiliary value function (AVF), 86 decomposability, 19, 25–28 definition, 18 imperfect sensing, 18 monotonicity, 19, 28, 29 symmetry, 19

B Balance equations, 119 Bayesian posterior distribution, 83 Bayes Rule, 41 Belief vector, 28 β-average work, 136 Binary hypothesis test, 40 Biomedical engineering, 79

C Channel-aware schedulers, 109 Channel condition, 112 Channel condition evolution, 113 Channel state belief vector, 12, 40 Channel state transition matrix, 39, 111, 117, 129 Closed-form conditions, 11, 17, 25, 98, 111, 122–125, 131 CMI indexability, 45 Computing index, 122, 123

Continuous and monotonically increasing (CMI) function, 44 Correlated channels, 146 Cp-rule, 110, 129, 130 Crawling web content, 39 Curse of dimensionality, 79 D Decision-maker, 5 Decomposability, 16, 19 Discrete-time Markov process, 110 E Economic systems, 79 Eigenvalue-arithmetic-mean scheme, 125 Eigenvectors, 90 Exploitation vs. exploration, 145 Exploration vs. exploitation, 143 F Feedback information, 83 First-order stochastic dominance (FOSD), 81, 85, 95 5G, 109 Fixed points, 48–52 Flow-level scheduling, 110 4G LTE, 109

G Generic flow-level scheduling, 111 Generic function, 143

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Wang, L. Chen, Restless Multi-Armed Bandit in Opportunistic Scheduling, https://doi.org/10.1007/978-3-030-69959-8

147

148

Index

Gilbert-Elliot channel heterogeneous channels, 38 homogeneous, 38 imperfect sensing, 39 Gittins index, 2 Greedy policy, 80, 85, 144

Linearization scheme, 46 value function negatively correlated channels, 55–58 positively correlated channels, 59–65 LP-indexability condition, 119, 121

H Heterogeneous multistate model, 81 Heterogeneous networks (HetNet), 109 Hidden markovian model (HMM), 83

M Manufacturing systems, 79 Marginal productivity index (MPI) scheduler, 126 channel state transition matrix, 129 cu rule, 126 different channel transition matrices, 131 different transmission rates, 131 different waiting costs, 131 parameters, 127, 129 PB rule, 127 performance of different schedulers, 127 PISS rule, 127 policies, 126, 129 RB rule, 126 SB policy, 129 SB rue, 127 stationary probability vector, 128 time-average waiting cost, 128–131 Markov chains, 80 Markov channel model accessing one channel, 80 homogeneous, 80, 82 imperfect (or indirect) observation, 81 multistate, 81 two-state, 80 Markovian channels and class-dependent, 109 discrete-time Markov process, 110 scheduling problem, 111 time-varying multistate, 109 Matrix function, 67 Matrix language, 119 Maximal transmission rate, 110 MaxRate scheduler, 110 Mobile cellular systems, 109 Monotone likelihood ratio (MLR) order, 80, 81, 85, 86, 88, 95, 96, 98 Monotonicity, 60, 64, 89 Monotonic MAB, 80 Multiarmed bandit (MAB), 80, 83, 146 clinical trial, 1 engineering applications, 1 Gittins index, 2 player and N independent arms, 1, 2 technical challenge, 4, 5

I Imperfect observation, 76 Imperfect sensing, 37–40 Independent and identically distributed (i.i.d.) channels, 110 Indexability, 37, 38, 43, 44, 76 analysis, 118–122, 136–138 challenges, 46–48 channel state transition matrix, 111 closed-form, 111 extension, 124, 125, 139, 140 heterogeneous channel, 111 policy, 111 RMAB, 110 scheduling policy, 125, 126 threshold structure, 117–118, 132–136 transition matrices, 117–118, 132–136 transition matrix approximation, 125 user-dependent parameters, 111 Whittle index, 124 Whittle index policy, 131 Index computation negatively correlated channels, 65–69 positively correlated channels, 69–73 Index policy computing, 111 scheduling algorithm, 111 Whittle (see Whittle index policy) Infinite time horizon, 43 Information states, 83, 84, 90 Information systems, 79 Inverse order, 81 J Job-channel-user bandit, 114, 115 Job sizes, 112 L Lagrange multiplier, 42 Lagrangian method, 116

Index Multichannel opportunistic communication system, 11 Multiple Gilbert-Elliot channels, 9 Multiple schedulers, 145 Multistate bounds, 95 computing index, 122, 123 heterogeneous model, 81 homogeneous channels, 81 indirect/imperfect observation, 95 information state, 83, 84 Markov channel model, 81 Markovian channels, 109, 111 multi-class time-varying Markovian channels, 131 myopic policy (see Myopic policy) notations, 82, 83 RMAB, 79 scheduling channel, 95 server model, 113 Multi-UAV coordination, 39 Multivariate information state, 85 Myopic policy, 37, 96, 143, 144 approximation policy, 80 characterize, 80 compute and implement, 85 exploration, 80 FOSD, 81, 85 Markov chains, 80 MLR ordering, 85 multivariate information state, 85 optimality (see Optimality) optimal policy, 80 RMAB (see Restless multiarmed bandit (RMAB)) scheduling policy, 79 structure, 81, 86 vs. Whittle index policy, 75 Myopic sensing policy axioms, 15 belief value, 10 belief vector, 14 decomposability, 16 definition, 14 homogeneous channels, 10 imperfect observation, 10 monotonicity, 16 multiple homogenous channels, 10 negatively correlated homogenous channels belief vector, 23 structure, 23, 24 opportunistic spectrum, 10 optimality, 10, 14

149 optimization problem, 13 perfect sensing scenario, 11 POMDP, 11 regular functions, 16 reward function, 15 sensing strategy, 11 short-term reward, 10 structure, 16 symmetry, 15 wireless communication, 10 Myopic strategy, 38

N NAK, definition, 40 Negatively correlated channels index computation, 65–69 linearization of value function, 55–58 Nonlinear evolution, 38 Nonlinear operator, 46

O Open issue, 145–146 Opportunistic access optimization problem, 41–43 Opportunistic scheduling, 114 average waiting cost, 114 channel-aware schedulers, 109 flow-level scheduling, 110 indexability, 111 multi-class multistate time-varying Markovian channels, 131 multistate Markovian channels, 111 myopic policy (see Myopic policy) Pascal distribution, 111 and restless bandit formulation, 115–117 RMAB, 110 server, 109 wireless communication systems, 109 Opportunistic scheduling system imperfect information, 5 multi-state information, 5 partial information, 5 Optimality, 14, 143, 144, 146 analysis, 93–95 assumptions/conditions, 87–92 AVF, 86 downlink scheduling system, 95 extension, 96–98 myopic policy, 80 non-trivial eigenvalue, 81 properties, 88–92

150 Optimality (cont.) value function and properties, 86, 87, 99–106 Optimality of myopic sensing policy, imperfect sensing auxiliary function, 17 auxiliary value function, 18, 19, 25–29 belief values, 17, 18 belief vector, 17 channel access scenario, 21, 22 closed-form conditions, 17 positively correlated channels, 20, 21, 30–35 Optimal policy, 80 vs. Whittle index policy, 73 Optimum opportunistic scheduling, 116, 117

P Partially observable Markov decision process (POMDP), 9 Pascal distribution, 111 Policy information, 54 Positively correlated channels index computation, 69–73 linearization of value function, 59–65 Potential improvement (PI) scheduler, 110, 129–131 Proportionally best (PB) scheduler, 110, 129–131 Proportionally Fair (PF) scheduler, 110 PSPACE-hard, 79, 98, 144

R Randomized and non-anticipative policies, 115 Regular functions, 16 Relatively best (RB) scheduler, 110, 129–131 Research direction, 145–146 Restless bandit formulation and opportunistic scheduling, 115–117 Restless multiarmed bandit (RMAB), 38, 45, 79, 80, 98, 110, 114 arm k, 3 correlated channels, 146 Lagrangian multiplier, 4 multiple schedulers, 145 with switching cost, 145, 146 technical challenge, 4, 5 v-subsidy problem, 4 Whittle index, 4 RMAB problem imperfect observation, 11 multichannel access applications, 10 myopic policy, 10

Index optimal policy, 9 POMDP, 9 sensing error, 11 stochastic decision theory, 9 RMAB problem formulation, imperfect sensing belief vector, 13 channel state belief vector, 12 myopic sensing policy, 13–15 sensing error, 13 system model, 11, 12 user’s optimization problem, 13

S Satellite networks, 143 SB policy, 129 Scheduled channel, 90 Scheduling policy, 79, 125, 126 Scheduling rule, 45 Scored based (SB) scheduler, 110, 130, 131 Sensing error, 41 Sensor scheduling, 39 Server model, 113 Slot index, 112 Statistics, 79 Stochastic decision theory, 9 Stochastic order, 81, 86, 89, 95 Subproblem (SP), 117 Switching cost, 145, 146

T 3G cellular networks, 110 Threshold policy, 44, 45, 53 Threshold structure, 117–118, 132–136 Time-average case, 114 Transition matrices, 117–118, 132–136 Transition matrix, 86, 90, 96 Transition matrix approximation, 125 Transmission rates, 113 Transportation systems, 79 Trivial eigenvalue, 91 Trivial eigenvector, 91 Two-state Markov channel, 25 Two-state Markov process, 9

V Value information, 54

W Waiting costs, 113 Whittle index, 4, 10 definition, 139

Index Whittle index policy, 4, 9 ACKs, 40 adjoint dynamic system, 54, 55 Bayes Rule, 41 binary hypothesis test, 40 channel state belief vector, 40 channel state transition matrix, 39 CMI indexability, 45 definition, 44 in engineering applications, 39 fixed points, 48–52 Gilbert-Elliot channels (see Gilbert-Elliot channels) imperfect channel sensing, 38 and indexability, 37, 38, 43, 44 index computation (see Index computation) linearization (see Linearization) vs. myopic policy, 37, 75 myopic strategy, 38

151 NAK, 40 negatively correlated channels, 45 non-decreasing function, 45 nonlinear evolution, 38 notations, 39 vs. optimal policy, 73 positively correlated channels, 45 RMAB (see Restless multiarmed bandit (RMAB)) and scheduling rule, 45 sensing error, 41 stochastic properties, 37 structure, 39 threshold policy, 45, 53 work and paper, 37, 38 Wired/wireless communication systems, 79 Wireless channel, 112 Wireless communication systems, 79, 109, 112, 144