Quantum Computing: Circuits, Systems, Automation and Applications 3031379659, 9783031379659


128 41

English Pages 191 [183] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Lagrange Interpolation Approach for General Parameter-Shift Rule
1 Introduction
2 Preliminary
3 General Parameter-Shift Rule with the Lagrange Interpolation Approach
3.1 Lagrange Interpolation Approach
3.2 Two-Term Parameter-Shift Rule
3.3 Four-Term Parameter-Shift Rule
3.4 Multiple-Term Parameter-Shift Rule
4 Higher-Order Derivative
4.1 Second-Order Derivatives (Hessian)
4.2 Fubini-Study Metric Tensor
5 Numerical Simulations
5.1 Finite-Difference Approximation of Derivatives
5.2 Mean-Square Error
5.3 Numerical Simulation the MSE
5.4 Variational Quantum Eigensolver
6 Conclusion
Appendix
Generalized Parameter-Shift Rule for VQE
References
Multi-Programming Mechanism on Near-Term Quantum Computing
1 Introduction
2 Background
2.1 Multi-Programming Mechanism
2.2 State of the Art
3 Crosstalk Analysis on IBM Quantum Computer
3.1 Crosstalk Characterization Using SRB
3.2 Crosstalk Mitigation
3.3 Evaluation of Crosstalk Injection and Crosstalk Mitigation
4 Quantum Multi-Programming Compiler (QuMC)
4.1 Overview of QuMC Framework
4.2 Parallel Manager
4.3 Hardware-Aware Multi-Programming Compiler
4.3.1 Motivational Example
4.3.2 Greedy Sub-Graph Partition Algorithm
4.3.3 Qubit Fidelity Degree-Based Heuristic Sub-Graph Partition Algorithm
4.3.4 Runtime Analysis
4.3.5 Post Qubit Partition
4.4 Scheduler: Mapping Transition Algorithm
4.5 Evaluation
4.5.1 Metrics
4.5.2 Comparison
4.5.3 Benchmarks
4.5.4 Algorithm Configurations
4.5.5 Experimental Results
4.5.6 Result Analysis
5 QuCP
5.1 QuCP Compiler
5.2 Evaluation
5.2.1 Methodology
5.2.2 Experiments Results
6 Applications
6.1 Multi-Programming and VQE
6.2 Multi-Programming and ZNE
7 Discussion
8 Conclusion
References
Side-Channel Leakage in Suzuki Stack Circuits
1 Introduction
1.1 Superconducting Electronics and Quantum Computing
1.2 Interface Circuits
1.3 Hardware Security and Side-Channel Attacks
2 Suzuki Stack Circuits
3 Threat Model
4 Side-Channel Leakage Simulation
4.1 Modeling Coaxial Cable
4.2 Simulation Results
4.3 Effect of Inductive Noise Coupling
4.4 Margin Analysis
5 Exploiting Side-Channel Leakage
6 Conclusion
References
AQuCiDe: Architecture Aware Decomposition of Quantum Circuits
1 Introduction
2 Background
2.1 Quantum Circuits
2.2 Quantum Architectures
2.3 Decomposition of Reversible Gates
3 Proposed Decomposition-Mapping Approach
3.1 Qubit Interaction Graph of Toffoli Gate
3.2 Qubit Interaction Graph of MCT Gate
3.3 Architecture-Aware Decomposition of MCT Netlist
4 Experimental Results
4.1 Effectiveness of Architecture-Aware Decomposition
4.2 Improvements in Nearest Neighbor Mapping
5 Conclusion
References
Structure-Aware Minor-Embedding for Machine Learning in Quantum Annealing Processors
1 Introduction
2 Motivation
3 Related Work
4 Contributions
5 Methodology
5.1 Training Algorithm
5.2 Positive-Phase Temperature Scaling
6 Results
6.1 Heuristic vs Systematic
6.2 Adaption Methods
7 Discussion
References
Software for Massively Parallel Quantum Computing
1 Introduction
2 Background
2.1 Quantum Brilliance Hardware
2.2 Quantum Utility
2.3 Hybrid Quantum Computing
2.3.1 Variational Quantum Eigensolver
2.3.2 Quantum Approximate Optimization Algorithm
2.3.3 Quantum Machine Learning Algorithms
2.4 Parallelism in Quantum Computing
3 The Quantum Brilliance Software Development Kit (SDK)
4 Parallelism in the QB SDK
4.1 Asynchronous QPU Offloading
4.1.1 Example
4.2 Message Passing Interface
4.2.1 Example
5 Summary
References
Machine Learning Reliability Assessment from Application to Pulse Level
1 Introduction
2 Background
2.1 Quantum Algorithms
2.2 Gate-Level Quantum Circuits
2.3 Pulse-Level Quantum Circuits
2.4 Quantum Hardware Errors
2.5 Quantum Circuit Compilation
3 ML for Quantum Circuit Reliability Assessment
3.1 Logical Gate-Level Circuit (Application) Performance Assessment
3.2 Physical Gate-Level Circuit Performance Assessment
3.2.1 Traditional ML with Black Box Quantum Hardware 9251243
3.2.2 Traditional ML with White Box Quantum Hardware
3.2.3 Traditional ML with Runtime Features
3.3 Graph Neural Network
3.4 Pulse-Level Circuit Performance Assessment
4 Qualitative Comparison
5 Quantitative Comparison
5.1 Quantum Circuits for Training
5.2 Quantum Circuits for Testing
5.3 Experimental Setup
5.4 Results
6 Summary
References
Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis and Optimization
1 Introduction: Surface Code Assemblies
1.1 Footprint Analysis
1.2 Problem Statement: Quantum Circuit and Assembly Optimisation
2 Background: Queuing Systems, Networks and Models
2.1 Queuing System Models
2.1.1 Single Server with Finite and Infinite Capacity
2.1.2 Multi Server
2.2 Queuing Networks
2.2.1 Description and Model Parameters
2.2.2 Modelling Blocking Queues
3 Results: Optimisation of Topological Assemblies
3.1 Footprint Optimization of Addition Circuits
3.2 Circuit Depth Optimization of Multiplication Circuits
4 Conclusion
References
Quantum Annealing for Real-World Machine Learning Applications
1 Introduction
2 Background on Quantum Annealing
2.1 Quantum Annealing in D-Wave Systems
3 Quantum Processing Unit of D-Wave Systems
3.1 Classical Computation
3.1.1 Problem Instance
3.1.2 Programming Into QPU
3.1.3 Resample
3.2 Quantum Computation
4 Quantum Annealing for Machine Learning Classification
4.1 ML Applications Using D-Wave's Quantum Annealing
4.1.1 Image Recognition
4.1.2 Remote Sensing Imagery
4.1.3 Computational Biology/Biomedical Sciences
4.1.4 Physics
4.1.5 Security
5 Advantages of Using QA for Training ML Models
5.1 Limited Training Data
5.2 Dimension Reduction
5.3 Generalization to Test Data
5.4 Multiple Solutions
5.5 Reduced Training Time
5.6 Data Imbalance
6 Conclusions and Discussions
References
Index
Recommend Papers

Quantum Computing: Circuits, Systems, Automation and Applications
 3031379659, 9783031379659

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Himanshu Thapliyal Travis Humble   Editors

Quantum Computing Circuits, Systems, Automation and Applications

Quantum Computing

Himanshu Thapliyal • Travis Humble Editors

Quantum Computing Circuits, Systems, Automation and Applications

Editors Himanshu Thapliyal University of Tennessee Knoxville, TN, USA

Travis Humble Oak Ridge National Laboratory Oak Ridge, TN, USA

ISBN 978-3-031-37965-9 ISBN 978-3-031-37966-6 https://doi.org/10.1007/978-3-031-37966-6

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

To my wife, Apeksha, for her understanding and support. —Himanshu And to Holly, for her perpetual patience and love. —Travis

Preface

This book offers readers an overview of the latest research and technological advancements in the field of quantum computing. The field has grown tremendously since the initial ideas were formulated in the 1980s, and the increasing sophistication brought by years of research and development have propelled quantum computing into a new regime. We organized an international symposium/workshop on “Quantum Computing: Circuits Systems Automation and Applications (QCCSAA)” that focused on advancements in quantum computing hardware, software, and algorithms at a time when progress is rapidly increasing. The contributions aimed at facilitating discussion on practical and novel quantum computing operations and applications through transformative research. Researchers from the symposium/workshop were then invited to contribute their work to this book. This book delves into various design paradigms associated with quantum computing. The topics covered in this book encompass: • • • • • • • • •

The Lagrange interpolation approach applied to the general parameter-shift rule. Multi-programming mechanisms for near-term quantum computing. Architecture-aware decomposition techniques for quantum circuits. Software solutions designed for massively parallel quantum computing. The integration of machine learning into quantum annealing processors. Real-world applications of quantum annealing in machine learning. Queuing theory models tailored for (fault-tolerant) quantum circuits. The use of machine learning for assessing the reliability of quantum circuits. Examination of side-channel leakage in Suzuki stack circuits.

In summary, this book provides an in-depth exploration of a wide range of topics at the forefront of quantum computing research and technology. Knoxville, TN, USA Oak Ridge, TN, USA

Himanshu Thapliyal Travis Humble

vii

Contents

Lagrange Interpolation Approach for General Parameter-Shift Rule . . . . . Vu Tuan Hai and Le Bin Ho

1

Multi-Programming Mechanism on Near-Term Quantum Computing. . . . Siyuan Niu and Aida Todri-Sanial

19

Side-Channel Leakage in Suzuki Stack Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yerzhan Mustafa and Selçuk Köse

55

AQuCiDe: Architecture Aware Decomposition of Quantum Circuits . . . . . . Soumya Sengupta, Abhoy Kole, Kamalika Datta, Indranil Sengupta, and Rolf Drechsler

69

Structure-Aware Minor-Embedding for Machine Learning in Quantum Annealing Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose P. Pinilla and Steven J. E. Wilton

89

Software for Massively Parallel Quantum Computing . . . . . . . . . . . . . . . . . . . . . . 101 Thien Nguyen, Daanish Arya, Marcus Doherty, Nils Herrmann, Johannes Kuhlmann, Florian Preis, Pat Scott, and Simon Yin Machine Learning Reliability Assessment from Application to Pulse Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Vedika Saravanan and Samah Mohamed Saeed Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Robert Basmadjian and Alexandru Paler Quantum Annealing for Real-World Machine Learning Applications . . . . 157 Rajdeep Kumar Nath, Himanshu Thapliyal, and Travis S. Humble Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

ix

Lagrange Interpolation Approach for General Parameter-Shift Rule Vu Tuan Hai and Le Bin Ho

1 Introduction Quantum computing is an elegant fusion between computer science and principles of quantum physics to facilitate calculations [1]. It promises an excellent computational capacity that is intractable for classical computers to solve challenging problems, including materials science, information science, computer science, mathematical science, and others. However, it turns out that fault-tolerant quantum computers are challenging to realize due to: (i) the difficulty of accessing complete information from entangled systems because of the state collapse upon measurements, and (ii) the difficulty of building, controlling, and measuring quantum states with arbitrarily high accuracy [2]. In this regard, despite the tremendous rate of development, the current state-of-the-art quantum computers mainly contain a small number of quantum bits (qubits) with a noise level called the noisy intermediatescale quantum devices (NISQ) that prevents them from being practical [3]. Variational quantum algorithms (VQAs) are promising to speed up the computing capacity in the NISQ computers [4]. Massive applications of the VQAs were extensively reported, from dynamic simulation to condensed matter physics, machine learning, mathematical applications, and new frontiers quantum founda-

V. T. Hai University of Information Technology, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam e-mail: [email protected] L. B. Ho (O) Frontier Research Institute for Interdisciplinary Sciences, Tohoku University, Sendai, Japan Department of Applied Physics, Graduate School of Engineering, Tohoku University, Sendai, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_1

1

2

V. T. Hai and L. B. Ho

Fig. 1 A variational quantum algorithm consists of a quantum and a classical part. In the quantum part, let .|ψ> be the initial quantum state, it evolves under a parameterized ansatz .U (θ) before being subject to measurement. The cost function .C (θ) is defined and optimized in the classical part. The new parameters .θ are derived and updated for each iteration to the quantum part. The scheme is repeated until it converts

tions [4]. The main task of VQAs is to optimize a trainable parameterized circuit by using a hybrid quantum-classical scheme, as shown in Fig. 1. Here, one measures the cost function of interest .C(θ) in the quantum part and iteratively optimize it in the classical part until it converts. The optimization can use either gradient-free or gradient-based methods. A critical technique for the VQAs to compute analytic derivatives of the cost function is known as the “parameter-shift rule” [5, 6], which is often required for the gradient-based optimization. The method was first introduced by Mitarai et al. [5] and then extended to a so-call two-term parameter-shift rule [6]. This approach gives the exact gradient (first-order derivative) of the cost function by subtracting the two cost functions with different shifts. It is, however, only applicable for singlequbit quantum gates, whose generators have two distinguished eigenvalues, such as the rotation gates [7, 8]. So far, Anselmetti et al. introduced a four-term parametershift rule that applies to generators with three distinguished eigenvalues .{−1, 0, 1}, such as the controlled-rotation gates [9]. In this case, the derivative is given in the linear combination of four cost functions with different shifts. Recently, various attempts were devoted to generalizing the parameter-shift rule for any assigned quantum gates [10–12]. Remarkably, strategies for generalizing using polynomial expansion were proposed [11, 12]. Apart from that, Wierichs et al. introduced a general parameter-shift rule based on the finite Fourier series of the cost function [10]. However, this method heavily consumes cost of computation to evaluate all Fourier coefficients. So far, higher-order parameter-shift rule is also derived [13], such as the second-order derivative can reduce to the Hessian formula of the finite differential. In this work, we introduce the Lagrange interpolation approach [14, 15] to derive the general parameter-shift rule. We expand a quantum gate having a generator .G into a polynomial .P (G) of degree .n − 1, where n is the number of distinct eigenvalues of .G. Our approach is similar to the polynomial expansion Refs. [11, 12]. However, here, we provide a general procedure to derive the parametershift rule for arbitrary distinct eigenvalues automatically. It only requires fewer evaluations to compute the derivative. We illustrate the approach for the well-known two-term and four-term parameter-shift rules, and generalize to multiple-term

Lagrange Interpolation Approach

3

Table 1 Major symbols used in this chapter Symbol

Type Unitary operator Hermitian matrix Tuple Matrix Hermitian matrix Ket vector Real Integer Integer Real

Description A quantum gate or a set of quantum gates Hermitian generator of a quantum gate Trainable parameters Measured observable Super-operator .U [B] = U † BU Initial quantum state Cost function Number of qubits Number of distinguished eigenvalues of .G Eigenvalues of .G

.{αk }k=1

Complex Matrix Matrix Integer Real

Coefficients of the Lagrange polynomial (.n × n)-matrix (.n × n)-matrix Number of parameter-shift coefficients Arbitrary shifted values

.{dk }k=1

Complex

Coefficients

.D

Column vector Integer Column vector

A column vector contains all .{dk } The order of derivative Unit vector along the .θj axis

.U .G .θ .B .U .|ψ> .C

N n n−1 .{λk }k=0 .{Λk } .M .F

m m

m

.τ .ej

parameter-shift rule. Higher-order derivatives of the cost function are also discussed. We finally numerically evaluate the accuracy of the method via the mean-square error and demonstrate the variational quantum eigensolver for a many-body system. So far, our approach can be applied to trapped atomic ions quantum gates, such as collective quantum gates, whose generators have linear eigenvalues as number of particles [16, 17]. For convenience, in Table 1, we present the major symbols used in this work.

2 Preliminary x

Let us consider a parameterized quantum gate with a general form .U (x) = e−i 2 G where .G is the generator and assume the cost function of interest is the expectation value of a measured observable .B as C(x) = ,

.

(1)

where .|ψ> is the initial circuit’s state. The derivative w.r.t the parameter x yields .

[ ] ∂ i C(x) = − . ∂x 2

(2)

4

V. T. Hai and L. B. Ho

For generators that obey .G2 = I , such as the standard rotation gates based on Pauli matrices .G = {σx , σy , σz } or .G2 = G, such as projective operators, we have [ ] B, G =

.

] i [ † U (α)BU (α) − U † (−α)BU (−α) , sin(α)

(3)

where .α is an arbitrary shift. Substituting (3) to (2) results in a two-term parametershift rule as [6] .

] ∂ 1 [ C(x) = C(x + α) − C(x − α) , ∂x 2 sin(α)

(4)

where the gradient is proportional to the linear combination in the cost functions with different shifts .+α and .−α. It is summarized in a scheme below:

.

This is the well-known parameter-shift rule used in various VQA approaches [4]. In the following, we derive the parameter-shift rule for general quantum gates by using the Lagrange interpolation.

3 General Parameter-Shift Rule with the Lagrange Interpolation Approach The Lagrange interpolation is a calculus method that decomposes a quantum gate in terms of the polynomials generator. It thus supports deriving the parameter-shift rule with any generic generator. In this section, we derive the general parametershift rule using the Lagrange interpolation and apply it to the first-order derivative (gradient) for the cost function.

3.1 Lagrange Interpolation Approach x

We begin with a general quantum gate represented by .U (x) = e−i 2 G , where the Hermitian generator .G has n distinguished eigenvalues .{λk }, i.e., .G = E n−1 k=0 λk |φk > 3, we have x

e−i 2 G = Λ0 I + Λ1 G + · · · + Λn−1 Gn−1 ,

.

(15)

where all .Λ terms are solvable. Explicitly, the super-operator .U (α)[B] in Eq. (7) yields n−1 [ ] E [ ] Λ∗k (α)Λl (α)Gk BGl = Tr M(α) · FT , .U (α)[B] =

(16)

k,l=0

where the superscript T denotes the transpose, .M(α) and .F are .(n × n)-matrices ⎛

⎞ Λ∗0 (α)Λ1 (α) · · · Λ∗0 (α)Λn−1 (α) ⎜ Λ∗1 (α)Λ1 (α) · · · Λ∗1 (α)Λn−1 (α) ⎟ ⎜ ⎟ .M(α) = ⎜ ⎟, .. .. .. ⎝ ⎠ . . . ∗ ∗ ∗ Λn−1 (α)Λ0 (α) Λn−1 (α)Λ1 (α) · · · Λn−1 (α)Λn−1 (α) Λ∗0 (α)Λ0 (α) Λ∗1 (α)Λ0 (α) .. .



⎞ G0 BG1 · · · G0 BGn−1 ⎜ G1 BG1 · · · G1 BGn−1 ⎟ ⎜ ⎟ .F = ⎜ ⎟. .. .. .. ⎝ ⎠ . . . n−1 0 n−1 1 n−1 n−1 G BG G BG · · · G BG G0 BG0 G1 BG0 .. .

(17)

(18)

We next compute U (α)[B] − U (−α)[B] = Tr

.

] [( ) M(α) − M(−α) · FT

] [ = Tr ΔM(α) · FT .

(19)

8

V. T. Hai and L. B. Ho

Notable that .ΔM(α) is a complex symmetry matrix with .diag[ΔM(α)] = 0 and ( )∗ ( )T [ΔM(α)]ij = [ΔM(α)]j i . We set a column matrix .D = d1 , d2 , · · · , dm that obeys

.

m E .

( ) dk U (αk )[B] − U (−αk )[B] = ℘[B, G].

(20)

k=1

Here, m depends on the number of non-vanish elements .[ΔM(α)]ij , ∀j > i, and .℘ is a coefficient similar as in Eq. (8) ℘=

m E

.

[ ] dk ΔM(αk ) 01 .

(21)

k=1

We normalize .dk by .dk /℘. Finally, by substituting Eq. (20) into Eq. (2), we get the multiple-term parameter-shift rule as ] i E [ ∂ C(x) = − dk C(x + αk ) − C(x − αk ) . ∂x 2 m

.

(22)

k=1

Here, we need to compute m coefficients .d1 , · · · , dm regarding the shifts α1 , · · · , αm . See Algorithm 1 for the pseudo-code deriving all .{dk } terms. For the symmetry set of .{λk }, we obtain .m = L2n /4. − 1 for .n ≥ 4. We summarize all cases of m in Table 2.

.

4 Higher-Order Derivative In this section, we compute higher-order derivatives of the cost function using the parameter-shift rule. Let a variational circuit is governed by a set of quantum gates U (θ ) = V M U M (θM ) · · · V 1 U 1 (θ1 ),

.

(23)

) ( where .θ = θ1 , · · · , θM is an M-tuple of classical parameters, and .{V k } are arbitrary constant gates. The derivative of an arbitrary order .τ is defined by ∂θτ1 ,θ2 ,··· ,θτ C(θ) =

.

∂ τ C(θ) . ∂θ1 ∂θ2 · · · ∂θτ

(24)

The first-order derivative of the cost function is given in the terms of parametershift rule as in Eq. (22)

Lagrange Interpolation Approach

9

Algorithm 1 Lagrange interpolation for general parameter-shift rule Input: Generator G Output: {dk }k∈{1,··· ,m} 1: Calculate eigenvalues {λl }l∈{0,··· ,n−1} of the generator G 2: Apply the Lagrange interpolation to compute {Λl }l∈{0,n−1} 3: Compute M(α) and ΔM(α) 4: Count m non-zero elements in [ΔM(α)]ij , ∀j > i 5: Tk ← non-zero elements of ΔM(αk )k∈{1,··· ,m} ) ( 6: T ← T1 · · · Tm : (m × m)-matrix ⎛ ⎞ d1 ⎜ . ⎟ ⎜ 7: D ← ⎝ .. ⎟ ⎠ dm 8: Solve E equation[T · D = 0] → {dk } 9: ℘ ← m k=1 dk ΔM(αk ) 01 10: Return all {dk ← dk /℘} with given {αk } Table 2 Relationship between n and m for symmetry set of .{λk }

n 2 3 .≥ 4

m 1 2 n .L2 /4. − 1

Description Two-term parameter shift rule Four-term parameter shift rule 2m-term parameter shift rule

mj ] ∂C(θ) i E [ . =− dk C(θ + αk ej ) − C(θ − αk ej ) , ∂θj 2

(25)

k=1

where .ej is the unit vector along the .θj axis. We next derive for higher-order derivatives.

4.1 Second-Order Derivatives (Hessian) The second-order derivative gives

.

mj {E ml ] [ ∂ 2 C(θ) ( −i )2 E (α + ) = dk dr k C(αk+ + βr el ) − C(αk+ − βr el ) ∂θj ∂θl 2 k=1

r=1

(αk− )

− dr

[

]} C(αk− + βr el ) − C(αk− − βr el ) , (26)

where .αk± = θ ± αk ej .

10

V. T. Hai and L. B. Ho

For the two-term parameter shift rule, we have .mj = ml = 1. From Eq. (9), if (α1± )

λ = −2 (such as Pauli generators), then . d1 = i/ sin(α1 ), and .d1 The second-order derivative explicitly gives

.

.

= i/ sin(β1 ).

[ ( ) ( ) ∂ 2 C(θ) 1 = C θ + α(ej + el ) − C θ + α(ej − el ) 2 ∂θj ∂θl 4 sin (α) ( ) ( )] − C θ − α(ej − el ) + C θ − α(ej + el ) ,

(27)

where we used .α1 = β1 = α. For the four-term parameter shift rule, we √ have .mj = ml = 2. Choosing .α1 = π/2 and .α2 = π , then .d1 = i and .d2 = i( 2 − 1)/2. The second-order derivative explicitly gives .

{ [ ( ) ( )] ∂ 2 C(θ) ( −i )2 − 1 C θ + α1 ej + α1 el − C θ + α1 ej − α1 el = ∂θj ∂θl 2 √ ) ( )] 1 − 2[ ( C θ + α1 ej + α2 el − C θ + α1 ej − α2 el − 2 [ ( ) ( )] + C θ − α1 ej + α1 el − C θ − α1 ej − α1 el √ ) ( )] 1 − 2[ ( C θ − α1 ej + α2 el − C θ − α1 ej − α2 el + 2 √ ) ( )] (1 − 2) [ ( C θ + α2 ej + α1 el − C θ + α2 ej − α1 el − 2 √ ) ( )] (1 − 2)2 [ ( C θ + α2 ej + α2 el + C θ + α2 ej − α2 el − 4 √ ) ( )] (1 − 2) [ ( C θ − α2 ej + α1 el − C θ − α2 ej − α1 el + 2 √ } ) ( )] (1 − 2)2 [ ( C θ − α2 ej + α2 el + C θ − α2 ej − α2 el . + 4 (28)

Similarly, one can extend the second-order derivatives for the general multiple-term parameter-shift rule.

4.2 Fubini-Study Metric Tensor We apply the above results to compute the Fubini-Study metric tensor. It is a Riemannian metric that measures the “quantum distance” (in parameter spaces)

Lagrange Interpolation Approach

11

between the two quantum states. For pure quantum states, the metric tensor associates with the Fisher information matrix or the Bures metric tensor [18]. Mathematically, for a pure state .|ψ(θ )> = U (θ)|ψ>, the metric is defined in terms of the second-order derivative as gij (θ ) = −

.

|2 || 1 ∂ 2 |||| | , under the action of .U † (θ ' )U (θ ) on the initial state .|ψ>. In other words, .C(θ) is the probability of finding the outcome state .|ψ> when measuring the final state .|ψ(θ ' , θ )> = U † (θ ' )U (θ)|ψ>: | |2 C(θ) ≡ p = ||

.

(30)

where p is the probability of obtaining the outcome state .|ψ>. Notable that in Eq. (29), we set .θ ' = θ after the partial derivatives. However, in the quantum circuit using the parameter-shift rule, we first apply .U (θ ± shift) onto .|ψ> according with Eqs. (27) and (28), then apply .U † (θ ), and measure the final circuit’s state. Concretely, let us consider the example shown in Fig. 2. The circuit consists of two qubits, which is initially prepared in the state .|ψ> = |00>. The evaluation operator .U (θ) is parameterized by two single-qubit rotation gates .Rx (θx ) and .Rz (θz ), and a controlled-rotation gate .CRy (θy ). Here, the parameters are given in ) ( the order as .θ = θx , θz , θy . The state evolves to .|ψ(θ)> = U (θ)|00>. Conventionally, we can derive .U (θ ) into two layers: one with the two singlequbit rotation gates and one with the controlled-rotation gate. Then, the FubiniStudy is a tensor of .(2 × 2)-matrix and .(1 × 1)-matrix, i.e., becomes a .(3 × 3)-matrix [19]. Nevertheless, here we can directly apply the above method and get the same result

Fig. 2 An example quantum circuit for evaluating the Fubini-Study tensor metric. The initial state is .|ψ> ( = |00>. )It evolves to .|ψ(θ)> = U (θ)|00>, where .U (θ) = CRy (θy ) · Rx (θx ) ⊗ Rz (θz ), and .θ = θx , θz , θy

12

V. T. Hai and L. B. Ho



⎞ ⎛1 gxx gxz gxy 4 0 ⎝ ⎠ ⎝ .g = gzx gzz gzy = 0 0 gyx gyz gyy 00

⎞ 0 ⎠, 0 1 2 θx 4 sin ( 2 )

(31)

where .gj l , ∀j, l ∈ {x, z, y} are given in Eq. (29) and computed from the parametershift rule as described in Eq. (30).

5 Numerical Simulations In this section, we first revisit the finite-difference gradient and quantify the meansquare error (MSE) as the figure of merit for evaluating different estimators (finitedifference and parameter-shift rule estimators). We later demonstrate the numerical evaluation of the MSE. Afterward, we also examine a variational quantum circuit to find the ground state of a many-body system.

5.1 Finite-Difference Approximation of Derivatives For a given step size .h > 0, the finite-difference gradient of a function .f (θ ) gives .

f (θ + hej ) − f (θ − hej ) ∂f (θ) . = ∂θj 2h

(32)

The second-order derivatives is given via the Hessian formula [20] .

) ( ) ∂f 2 (θ ) 1 [ ( = 2 f θ + h(ej + el ) − f θ + h(ej − el ) ∂xj ∂xl 4h ( ) ( )] − f θ − h(ej − el ) + f θ − h(ej + el ) .

(33)

These are approximate methods that give high accuracy when .h → 0.

5.2 Mean-Square Error Following Ref. [13], we consider the mean-square error (MSE) as the figure of merit τ to evaluate the accuracy of different estimators. The MSE of an estimator < .∂ θ1 ,θ2 ,··· ,θτ is given as Δ(< ∂θτ1 ,θ2 ,··· ,θτ ) = E

.

[(

< ∂θτ1 ,θ2 ,··· ,θτ − ∂θτ1 ,θ2 ,··· ,θτ

)2 ] ,

(34)

Lagrange Interpolation Approach

13

τ where .∂θτ1 ,θ2 ,··· ,θτ is the analytical derivative equation (24), the estimator < .∂ θ1 ,θ2 ,··· ,θτ is either given by the parameter-shift rule or the finite-difference approximation.

5.3 Numerical Simulation the MSE We investigate the precision of the two estimators based on the parameter-shift rule and finite-difference approximation for the first-order derivatives. Let us consider the example circuit shown in Fig. 2 and measure the expectation value . as f (θ ) = ] ( ) 1[ = 1 + cos (θx ) + cos (θx ) − 1 cos (θy ) . 2

.

(35)

Analytically, we have ⎞ ⎞ ⎛ θ ∂θx f − cos2 ( 2y ) sin (θx ) ⎟ ⎜ θ .∇f (θ ) = ⎝∂θy f ⎠ = ⎝ sin2 ( x ) sin (θy ) ⎠ . 2 ∂θz f 0 ⎛

(36)

For the finite-difference estimator, we compute the partial derivatives .{< ∂θj f (θ)} by using Eq. (32) and investigate the MSE (34) as a function of the step size h. For the parameter-shift estimator, < .∂θx f (θ) and < .∂θz f (θ) are given by the two-term (9) < while .∂θy f (θ ) is given by the four-term (12). We choose the shift .α of the two-term the same as step size h and fix the four-term coefficients as above. The simulation runs in Qiskit’s Aer simulator, and the expectation values are given after .103 shots, and other .103 repetitions are used to determine the average of the MSE. The results are shown in the main Fig. 3. The MSE for the parameter-shift estimator gradually reduces and saturates at the optimal value when increasing the step size. Similarly, the finite-difference curve reduces, then matches with the parameter-shift curve, and finally deviates from the expected behavior since the Taylor expansion is no longer viable for large step sizes [13]. For small step sizes, .∂θy f (θ ) are different, i.e., one is the two estimators do not coincide because their < varied with h, and one is fixed. Details are shown in the inset Fig. 3.

5.4 Variational Quantum Eigensolver In this subsection, we demonstrate our approach in the variational quantum eigensolver (VQE) to find the ground state of a given system. We consider the Lipkin-Meshkov-Glick (LMG) model [21] consisting of N spin-1/2 particles with infinite-range interaction and exposing under a magnetic field along the z axis, as shown in Fig. 4a. The interaction Hamiltonian is given by

14

V. T. Hai and L. B. Ho

Fig. 3 Log-log plot of the mean-square error (MSE) versus the step size h for two estimators parameter-shift and finite-difference. The shaded areas show the standard deviation, and the solid lines represent the average MSE over .103 repetitions. For each MSE, we perform .103 shots to compute the expectation value. Insets: the plot of MSE for partial differences < .∂θx f (θ) and < .∂θy f (θ)

Fig. 4 (a) The LMG model with N spin-1/2 particles that obeys the infinite-range interaction and is placed under the magnetic field along the z axis. (b) The learning circuit used in VQE consists of .(RX − RZ − RX)×L gates. The expectation value . is measured and be the cost function. (c) The cost function versus iterations for .γ = 0, −0.05, −0.1 and the theoretical bounds are .{−0.10583, −0.269744, −0.509973}, respectively. (d) The minimum energy for .γ from .−0.1 to .0.1. The solid curve is the exact result from theoretical analysis by diagonalizing the Hamiltonian (37), and the dotted curve is obtained from the generalize parameter-shift rule

Lagrange Interpolation Approach

15

H = −2γ Jz −

.

) 2λ ( 2 Jx − Jy2 , N

(37)

where .γ is an effective magnetic field, .λ is the spin-spin exchange coupling, and Jk , (k = x, y, z) is a collective angular momentum

.

1E (l) I ⊗ · · · ⊗ σk ⊗ · · · ⊗ I , 2 N

Jk =

.

(38)

l=1

(l)

where .σk is a Pauli matrix at the site l. The purpose is to find the ground state of the Hamiltonian (37) by using the VQE and compare it with the theoretical result. The learning circuit is shown in Fig. 4b with the initial state is .ρ = q1 ⊗ q2 ⊗ · · · ⊗ qN . The training ansatz .U (θ ) reads ( )×L U (θ ) = RX(θ3 )RZ(θ2 )RX(θ1 ) ,

.

(39)

with several layers L. Here, RX and RZ are collective rotation gates [17] that any quantum state in the Bloch sphere, and .θ = ) (are sufficient to generate θ1 , θ2 , θ3 , · · · , θ3L are training parameters. The quantum state evolves to .ρ(θ ) = U (θ)ρU † (θ ) under the action of .U (θ). Finally, we measure the expectation value of the Hamiltonian and define the cost function as [ ] C(θ) = = Tr H · ρ(θ ) ,

.

(40)

from which its minimum value is the lowest energy, and the corresponding state ρ(θ ) becomes the ground state. The simulation is executed in the tqix code [17]. Here we fit .N = 5, .L = 5 layers, and .λ = 0.05. We also add random noises in every quantum gates. We train the model in 30 iterations with the standard gradient descent (SGD) optimizer

.

θ (t+1) = θ (t) − η∇θ C(θ),

.

(41)

where learning rate is .η = 0.4, and .∇θ C(θ) is calculated via the generalize parameter-shift rule (see detailed in the Appendix). Figure 4c displays the cost function versus the number of iterations, where it moves toward the theoretical bound after a certain number of iterations. The minimum energy given by the VQE and a comparison with the theoretical result are shown in Fig. 4d. The plots are given for different .γ from -0.1 to +0.1, offering an excellent match between these two approaches.

16

V. T. Hai and L. B. Ho

6 Conclusion We introduced the Lagrange interpolation approach for the general parameter-shift rule. This method is based on the interpolation of any given quantum gate into a polynomial form of its generator. We thus derived the general multiple-term parameter-shift rule and the higher-order derivative. We provided the numerical benchmarking via the mean-square error and further applied to the variational quantum eigensolver. Our approach can apply to various collective rotation gates in spin ensemble-based quantum computers. The code is available in https://github.com/vutuanhai237/LagrangeGPSR.

Appendix Generalized Parameter-Shift Rule for VQE We compute the generalized parameter-shift rule for two quantum gates RX and RZ in the circuit Fig. 4b. They read RX(x) = e−ixJx , and RZ(x) = e−ixJz .

.

(42)

Since both .Jx and .Jz have the same eigenvalues, hereafter, we derive for RZ while RX can be done the same. The eigenvalues of .Jz for .N = 5 include n .λ = {−5/2, −3/2, −1/2, 1/2, 3/2, 5/2}. Hence, .n = 6 and .m = L2 /4. − 1 = 15. 15 Apply Algorithms 1 with random .{αk }k=1 ∈ [0, 2π ], we find the corresponding .{dk }. The derivatives associated RZ is given from Eq. (22) as ] E [ ∂ RZ(x) = −i dk RZ(x + αk ) − RZ(x − αk ) . ∂x 15

.

(43)

k=1

Note that here we replace .−i/2 from Eq. (22) by .−i because RZ contains parameter x instead of .x/2. Doing similarly for RX, and finally, we obtain .∂θ .

References 1. J. D. Hidary, Quantum Computing: An Applied Approach, Springer Cham, 2019. 2. Y. Alexeev, D. Bacon, K. R. Brown, R. Calderbank, L. D. Carr, F. T. Chong, B. DeMarco, D. Englund, E. Farhi, B. Fefferman, A. V. Gorshkov, A. Houck, J. Kim, S. Kimmel, M. Lange, S. Lloyd, M. D. Lukin, D. Maslov, P. Maunz, C. Monroe, J. Preskill, M. Roetteler, M. J. Savage, J. Thompson, Quantum computer systems for scientific discovery, PRX Quantum 2 (2021) 017001. doi:10.1103/PRXQuantum.2.017001. URL https://link.aps.org/doi/10.1103/ PRXQuantum.2.017001

Lagrange Interpolation Approach

17

3. J. Preskill, Quantum Computing in the NISQ era and beyond, Quantum 2 (2018) 79. doi:10.22331/q-2018-08-06-79. URL https://doi.org/10.22331/q-2018-08-06-79 4. M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, P. J. Coles, Variational quantum algorithms, Nature Reviews Physics 3 (9) (2021) 625–644. doi:10.1038/s42254-021-00348-9. URL https://doi.org/10. 1038/s42254-021-00348-9 5. K. Mitarai, M. Negoro, M. Kitagawa, K. Fujii, Quantum circuit learning, Phys. Rev. A 98 (2018) 032309. doi:10.1103/PhysRevA.98.032309. URL https://link.aps.org/doi/10.1103/ PhysRevA.98.032309 6. M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, N. Killoran, Evaluating analytic gradients on quantum hardware, Phys. Rev. A 99 (2019) 032331. doi:10.1103/PhysRevA.99.032331. URL https://link.aps.org/doi/10.1103/PhysRevA.99.032331 7. C. P. Williams, Explorations in Quantum Computing, Springer London, 2011. 8. M. A. Nielsen, I. L. Chuang, Quantum Computation and Quantum Information, Cambridge, 2010. 9. G.-L. R. Anselmetti, D. Wierichs, C. Gogolin, R. M. Parrish, Local, expressive, quantum-number-preserving VQE ansätze for fermionic systems, New Journal of Physics 23 (11) (2021) 113010. doi:10.1088/1367-2630/ac2cb3. URL https://doi.org/10.1088/1367-2630/ac2cb3 10. D. Wierichs, J. Izaac, C. Wang, C. Y.-Y. Lin, General parameter-shift rules for quantum gradients, Quantum 6 (2022) 677. doi:10.22331/q-2022-03-30-677. URL https://doi.org/10. 22331/q-2022-03-30-677 11. O. Kyriienko, V. E. Elfving, Generalized quantum circuit differentiation rules, Phys. Rev. A 104 (2021) 052417. doi:10.1103/PhysRevA.104.052417. URL https://link.aps.org/doi/10. 1103/PhysRevA.104.052417 12. A. F. Izmaylov, R. A. Lang, T.-C. Yen, Analytic gradients in variational quantum algorithms: Algebraic extensions of the parameter-shift rule to general unitary transformations, Phys. Rev. A 104 (2021) 062443. doi:10.1103/PhysRevA.104.062443. URL https://link.aps.org/doi/10. 1103/PhysRevA.104.062443 13. A. Mari, T. R. Bromley, N. Killoran, Estimating the gradient and higher-order derivatives on quantum hardware, Phys. Rev. A 103 (2021) 012405. doi:10.1103/PhysRevA.103.012405. URL https://link.aps.org/doi/10.1103/PhysRevA.103.012405 14. C. Moler, C. Van Loan, Nineteen dubious ways to compute the exponential of a matrix, SIAM Review 20 (4) (1978) 801–836. arXiv:https://doi.org/10.1137/1020098, doi:10.1137/1020098. URL https://doi.org/10.1137/1020098 15. L. B. Ho, N. Imoto, Full characterization of modular values for finite-dimensional systems, Physics Letters A 380 (25) (2016) 2129–2135. doi:https://doi.org/10.1016/j.physleta.2016.05. 005. URL https://www.sciencedirect.com/science/article/pii/S0375960116301773 16. S. Debnath, N. M. Linke, C. Figgatt, K. A. Landsman, K. Wright, C. Monroe, Demonstration of a small programmable quantum computer with atomic qubits, Nature 536 (7614) (2016) 63–66. doi:10.1038/nature18648. URL https://doi.org/10.1038/nature18648 17. N. T. Viet, N. T. Chuong, V. T. N. Huyen, L. B. Ho, tqix.pis: A toolbox for quantum dynamics simulation of spin ensembles in dicke basis, Computer Physics Communications 286 (2023) 108686. doi:https://doi.org/10.1016/j.cpc.2023.108686. URL https://www.sciencedirect.com/ science/article/pii/S0010465523000310 18. J. Liu, H. Yuan, X.-M. Lu, X. Wang, Quantum fisher information matrix and multiparameter estimation, Journal of Physics A: Mathematical and Theoretical 53 (2) (2019) 023001. doi:10.1088/1751-8121/ab5d4d. URL https://doi.org/10.1088/1751-8121/ab5d4d 19. V. T. Hai, L. B. Ho, Universal compilation for quantum state tomography, Scientific Reports 13 (1) (2023) 3750. doi:10.1038/s41598-023-30983-4. URL https://doi.org/10.1038/s41598023-30983-4 20. I. A. S. Milton Abramowitz, Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables, Courier Corporation, 1965. 21. H. Lipkin, N. Meshkov, A. Glick, Validity of many-body approximation methods for a solvable model: (i). exact solutions and perturbation theory, Nuclear Physics 62 (2) (1965) 188– 198. doi:https://doi.org/10.1016/0029-5582(65)90862-X. URL https://www.sciencedirect. com/science/article/pii/002955826590862X

Multi-Programming Mechanism on Near-Term Quantum Computing Siyuan Niu and Aida Todri-Sanial

1 Introduction In recent years, quantum technologies are continuously improving, and IBM released the largest quantum chip with 433 qubits. But, current quantum devices are still qualified as Noisy Intermediate-Scale Quantum (NISQ) hardware [1], with several physical constraints. For example, for superconducting devices, which we target in this chapter, connections are only allowed between two neighbouring qubits. Besides, the gate operations of NISQ devices are noisy and have unavoidable error rates. As we do not have enough number of qubits to realize Quantum Error Correction [2], only small circuits with limited depths can obtain reliable results when executed on quantum hardware, which leads to a waste of hardware resources. With the growing demand to access quantum hardware, several companies such as IBM, Rigetti, and IonQ provide cloud quantum computing systems enabling users to execute their jobs on a quantum machine remotely. However, the cloud quantum computing systems have an limitation, where latency exists between the job submission and execution. Since there are a large number of jobs pending on the quantum device in general, users need to spend a long time waiting in the queue to execute the jobs. For example, it takes several days to get the result if we submit a circuit to IBM public quantum chips. The low hardware usage and long waiting time lead to a timely issue: how do we use quantum hardware more efficiently while maintaining the circuit output fidelity? As the growing of hardware qubit number and the improvement of qubit S. Niu (O) LIRMM, University of Montpellier, Montpellier, France e-mail: [email protected] A. Todri-Sanial Eindhoven University of Technology, Eindhoven, Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_2

19

20

S. Niu and A. Todri-Sanial

error rates, the multi-programming mechanism was introduced by Das et al. [3] and Liu and Dou [4] to address this issue, also is referred as parallel circuit execution. The utilization (usage/throughput) of NISQ hardware can be enhanced by executing several circuits at the same time. However, results show that the activity of one circuit can negatively impact the fidelity of others when executing several circuits simultaneously, due to the difficulty of allocating reliable regions to each circuit, higher chance of crosstalk error, etc. Previous works [3, 4] have left these issues largely unexplored and have not addressed the problem holistically such that the circuit fidelity reduction cannot be ignored for multi-programming mechanism. Crosstalk is a non-negligible error source in NISQ hardware. It can corrupt the qubit state when multiple quantum operations are executed in parallel. Detrimental crosstalk impact on parallel instructions has been reported in [5–7] by using Simultaneous Randomized Benchmarking (SRB) [8]. In the presence of crosstalk, gate error can be increased by an order of magnitude. Ash-Saki et al. [6] proposed a fault-attack model using crosstalk in a multi-programming environment. Moreover, the impact of crosstalk is considered in the multi-programming framework [9]. Multi-programming, if done in an ad-hoc way would be detrimental to fidelity, but if done carefully, it can be a very powerful technique to enable parallel execution for important quantum algorithms such as Variational Quantum Algorithms (VQAs) [10]. For example, the multi-programming mechanism can enable to execute several ansatz states in parallel in one quantum processor, such as in Variational Quantum Eigensolver (VQE) [11, 12], Variational Quantum Linear Solver (VQLS) [13], or Variational Quantum Classifier (VQC) [14] with reliability. It is also general enough to be applied to other quantum circuits regardless of applications or algorithms. More importantly, it can build the bridge between NISQ devices and large-scale fault-tolerant devices. In this work, we evaluate the influence of crosstalk on NISQ devices and then present two quantum multi-programming compilers, taking the impact of hardware topology, calibration data, and crosstalk into consideration. We also investigate how multi-programming mechanism can be useful for NISQ computing. Our major contributions can be listed as follows: • We perform crosstalk injection experiment on IBM quantum devices to analyze its impact on circuit fidelity and evaluate the crosstalk mitigation methods to demonstrate the fidelity improvement. • We propose a Quantum Multi-programming Compiler (QuMC), including two different qubit partition algorithms with consideration of crosstalk, and an improved mapping transition algorithm. • We improve our QuMC algorithm and introduce a Quantum Crosstalk-aware Parallel Circuit execution method (QuCP) without the overhead of crosstalk characterization. • We apply multi-programming mechanism to variational quantum eigensolver (VQE) and zero noise extrapolation (ZNE) to demonstrate its applications on NISQ algorithm and error mitigation technique.

Multi-Programming Mechanism on Near-Term Quantum Computing

21

The rest of the chapter is organized as follows. First, we introduce the background of multi-programming mechanism and its state of the art in Sect. 2. Second, we analyze the crosstalk error in NISQ computing, including characterization and error mitigation in Sect. 3. Third, we present our proposed QuMC algorithm in Sect. 4. Fourth, we explain our QuCP method in Sect. 5. Fifth, we demonstrate the applications of applying multi-programming to NISQ computing in Sect. 6. Finally, we discuss the future works and conclude the chapter in Sects. 7 and 8.

2 Background 2.1 Multi-Programming Mechanism The idea of the multi-programming mechanism is quite simple: executing several quantum circuits in parallel on the same quantum hardware. An example is shown in Fig. 1. Note that, the simultaneous circuits can always be scheduled using As Late As Possible (ALAP) method, allowing qubits to remain in the ground state as long as possible to avoid additional decoherence error caused by circuits with different depths. Since the waiting time is usually much longer than the circuit execution time, the difference between execution time for circuits with different depths can be ignored. By executing two circuits at the same time, the hardware throughput doubles and the total runtime (waiting time + execution time) is reduced twice. It is not trivial to design a multi-programming compiler. Even though it is possible to simply combine several programs to one large circuit and compile it directly, it has been shown in [4] that the circuit fidelity is decreased significantly due to the unfair allocation of partitions, unawareness of increased crosstalk, inflexibility of reverting back to independent executions for the case of serious fidelity drop, etc. The main concern is how to trade-off between the circuit output fidelity and the hardware throughput (also indicates the reduction of total runtime). A new compilation technique for the multi-programming mechanism is required. Several Fig. 1 An example of the multi-programming mechanism. (a) A four-qubit circuit is executed on a 10-qubit device. The hardware throughput is 40%. (b) Two four-qubit circuits are executed on the same device in parallel. The hardware throughput becomes 80%.

QC1

QC2

(a)

(b)

22

S. Niu and A. Todri-Sanial

problems need to be addressed to enable the multi-programming mechanism: (1) Find an appropriate number of circuits to be executed simultaneously such that the hardware throughput is improved without losing fidelity. (2) Allocate reliable partitions of the hardware to all the simultaneous circuits to make them execute with high fidelity. (3) Transform multiple circuits to make them executable on the hardware. (4) Reduce the interference between simultaneous circuit executions to lower the impact of crosstalk.

2.2 State of the Art The multi-programming mechanism was first proposed in [3] by developing a Fair and Reliable Partitioning (FRP) method. Liu et al. improved this mechanism and introduced QuCloud [4]. There are some limitations for the two works: (1) Hardware topology and calibration data are not fully analyzed, such that allocation is sometimes done on unreliable or sparse-connected partitions ignoring the robust qubits and links. (2) These works use only SWAP gate for the mapping transition process and the modified circuits always have a large number of additional gates. (3) Crosstalk is not considered when allocating partitions to the circuits. For example, the X-SWAP scheme [4] can only be performed when circuits are allocated to neighbouring partitions, which is the case of more crosstalk. Ohkura et al. designed palloq [9], a crosstalk detection protocol that reveals the crosstalk impact on multiprogramming. The concept of multi-programming was also explored in quantum annealers on DWAVE systems to solve several QUBO instances in parallel [15].

3 Crosstalk Analysis on IBM Quantum Computer Crosstalk is one of the major noise sources in near-term quantum devices. There are two types of crosstalk. The first one is quantum crosstalk, which is caused by the always-on-ZZ interaction [16, 17]. The second one is classical crosstalk caused by the incorrect control of the qubits. The calibration data provided by IBM do not include the crosstalk error. Different protocols were proposed in [8, 18, 19] to detect and characterize crosstalk in quantum devices. After assessing crosstalk, there are two types of methods to mitigating it. The first one is based on hardware strategies, such as tunable coupling [16], or frequency allocation [20]. The second method is from the software perspective, where crosstalk can be mitigated by making simultaneous CNOT operations with high crosstalk execute sequentially while trading off the increase of the decoherence error [5]. In this section, we first introduce how to characterize the crosstalk properties of a quantum device using protocol Simultaneous Randomized Benchmarking (SRB) [8]. Second, we present various state-of-the-art crosstalk mitigation methods. Third, we inject crosstalk on IBM quantum devices to show its impact on

Multi-Programming Mechanism on Near-Term Quantum Computing

23

output fidelity. Finally, we evaluate the crosstalk mitigation method using several benchmarks to demonstrate the fidelity improvement.

3.1 Crosstalk Characterization Using SRB To characterize the crosstalk effect of a quantum device, we choose the most commonly used protocol—Simultaneous Randomized Benchmarking (SRB) [8] based on its ability to quantify the impact of parallel instructions, i.e., the crosstalk effect of one quantum gate to another when they are executed at the same time. Suppose that we need to characterize the crosstalk effect between gate .gi and .gj , we first perform Randomized Benchmarking (RB) on both gates separately and obtain their independent error rate .E(gi ) and .E(gj ). Then, applying SRB on both gates yields the correlated error rate .E(gi |gj ) and .E(gj |gi ) for simultaneous executions. If there exists crosstalk between the two gates, the relation between independent and correlated errors should comply with .E(gi |gj ) > E(gi ) or .E(gj |gi ) > E(gj ). Previous work [5] has demonstrated that crosstalk is significant for simultaneous CNOT executions, hence, we only focus on characterizing the crosstalk effect between CNOT pairs. We use the ratio of correlated error to independent error .r(gi |gj ) = E(gi |gj )/E(gi ) as the indicator of the crosstalk effect on the CNOT pairs. We choose IBM Q 7 Casablanca as an example to show its crosstalk properties. As CNOT errors vary across each calibration, we characterize the crosstalk effect twice on each possible simultaneous CNOT pair of IBM Q Casablanca. The results of average crosstalk data are presented in Fig. 2. The crosstalk effect remains stable regardless of the variation of gate errors across days. For each experiment, there are 150 Cliffords in the sequence and each sequence is repeated five times due to the expensive runtime cost. The crosstalk effect is not as severe Fig. 2 The SRB result of IBM Q 7 Casablanca. The number represents the ratio of correlated error to independent error. Note that the SRB experiment is performed on CNOTs that can be executed in parallel, which means they do not share the same qubits. There are no SRB experiments on blank spots due to the sharing qubits

24

S. Niu and A. Todri-Sanial

in this device as in other devices such as IBM Q 20 Poughkeepsie, where the gate error grows up to 11 times caused by crosstalk [5]. However, the error rate is still amplified up to 3 times in IBM Q Casablanca. Sometimes the ratios are less than one, which is probably due to the limits of crosstalk metric in SRB protocol [21].

3.2 Crosstalk Mitigation After characterizing the crosstalk effect, the next challenge is how to mitigate it. The hardware-oriented crosstalk mitigation strategies have several physical constraints. For example, the tunable coupling [16] is only applicable for tunable coupling superconducting devices, whereas the frequency allocation [20] is mainly designed for fixed-frequency transmon qubit devices. Despite hardware-oriented approaches to reduce crosstalk, there are also software methods to address crosstalk. Here, we report on hardware-agnostic software crosstalk mitigation approaches. An intuitive approach for software crosstalk mitigation is to make simultaneous CNOTs execute serially to avoid crosstalk occurring on parallel operations. However, serial instructions can increase the circuit depth and introduce decoherence errors. Murali et al. [5] proposes a crosstalk-adaptive scheduler (labeled as XtalkSched) that inserts barriers between the simultaneous CNOTs with a significant crosstalk effect to avoid parallel instructions while considering the decoherence error.

3.3 Evaluation of Crosstalk Injection and Crosstalk Mitigation Benchmarks First, We use a 3-qubit CSWAP gate circuit and inject a different number of simultaneous CNOTs to show the impact of crosstalk on the output fidelity of a quantum circuit. An example of demonstrating crosstalk injection by enabling two simultaneous CNOTs is shown in Fig. 3. Then, we examine q0

X

T

q1

X

T†

T

T

T†

q2

H

T†

T H

q3 q4

Fig. 3 Crosstalk injection of CSWAP circuit. We introduce two other qubits (q3 , q4 ) and apply CNOT operations on them. Barriers are inserted to make sure the CNOTs are executed simultaneously. Note that, crosstalk injection are performed on CNOT pairs with strong crossalk error

Multi-Programming Mechanism on Near-Term Quantum Computing

25

Fig. 4 The output state probability distribution of CSWAP gate circuit. The right output state should be “101”. We inject a different number of simultaneous CNOTs to this circuit. The baseline circuit has zero simultaneous CNOT

benchmarks collected from the previous works [5, 22], including SWAP circuits, Bell state circuit, etc., to show the performance of XtalkSched. For each benchmark, we set its mapping carefully to ensure it includes at least one pair of CNOTs with a high crosstalk effect. Algorithm Configuration For crosstalk characterization of IBM Q Casablanca, we set the crosstalk threshold to 2, and w used in XtalkSched is set to 0.5 to tradeoff circuit depth and crosstalk mitigation. The Qiskit version is 0.23.6. Comparison We compare XtalkSched [5] with ParSched, which is the default scheduler used in Qiskit [23] to make instructions execute in parallel without considering crosstalk. The experimental results of crosstalk injection are shown in Fig. 4. The probability of obtaining the right output state is reduced with the increase of simultaneous CNOTs due to the injected crosstalk error. In the worst case, the fidelity is decreased by 33.3% compared to the non-crosstalk case. The comparison between XtalkSched with ParSched in terms of output fidelity and circuit depth is shown in Fig. 5. The fidelity is improved by 9.4%, whereas the circuit depth increased by 39.1%. Although XtalkSched enhances the circuit output fidelity, the circuit depth increase is not negligible. Moreover, since XtalkSched is based on SMT solver, it is not scalable for large quantum circuit. This raises the need for more effective and practical crosstalk mitigation methods.

26

S. Niu and A. Todri-Sanial

XtalkSched

XtalkSched

ParSched

ParSched

1 0.8 Fidelity

20

0.6 0.4 0.2

n4 hs

ll n 4 be

ad

der

n4

4)

n4 hs

ll n 4 be

n4

0

ad der

(1, 4)

0

(1,

Circuit Depth

40

Benchmarks

Benchmarks

(a)

(b)

Fig. 5 (a) Fidelity. (b) Circuit depth. Note that (1,4) represents the SWAP circuit which aims to realize a connection between q0 and q4 through SWAP operations

4 Quantum Multi-Programming Compiler (QuMC) In this section, we present our proposed QuMC to realize multi-programming mechanism that takes the hardware topology, calibration data, and crosstalk into account. Due to the limitations of gate-level software crosstalk mitigation method (XtalkSched) as explained in Sect. 3.2, we mitigate crosstalk at the partition level.

4.1 Overview of QuMC Framework Our proposed QuMC workflow is schematically shown in Fig. 6, which includes the following steps: • Input layer. It contains a list of small quantum circuits written in OpenQASM language [24], and the quantum hardware information, including the hardware topology, calibration data, and crosstalk effect. • Parallelism manager. It can determine whether executing circuits concurrently or separately. If the simultaneous execution is allowed, it can further decide the number of circuits to be executed on the hardware at the same time without losing fidelity based on the fidelity metric included in the hardware-aware multiprogramming compiler.

Multi-Programming Mechanism on Near-Term Quantum Computing

27

Selected shared workloads Hardware-aware Multi-programming Compiler

Quantum circuit workloads

Quantum hardware information

Parallelism Manager

Reduce the number of shared workloads

Quantum hardware

Scheduler

Output Circuits

Independent workloads

Fig. 6 Overview of our proposed QuMC framework. The input layer includes the quantum hardware information and multiple quantum circuit workloads. The parallelism manager decides whether to execute circuits simultaneously or independently. For simultaneous executions, it works with the hardware-aware multi-programming compiler to select an optimal number of shared workloads to be executed in parallel. These circuits are allocated to reliable partitions and then passed to the scheduler. It makes all the circuits executable on the quantum hardware and we can obtain the results of the output circuits

• Hardware-aware multi-programming compiler. Qubits are partitioned to several reliable regions and allocated to different quantum circuits using qubit partition algorithms. Then, the partition fidelity is evaluated by the post qubit partition process. We introduce a fidelity metric here, which helps to decide whether this number of circuits can be executed simultaneously or the number needs to be reduced. • Scheduler. The mapping transition algorithm is applied and circuits are transpiled to be executable on quantum hardware. • Output layer. Output circuits are executed on the quantum hardware simultaneously or independently according to the previous steps and the experimental results are obtained. In this chapter, we only focus on IBM quantum architecture. Our QuMC method can be generally adapted to quantum hardware with nearest-neighbor connectivity and also allows parallel operations if applied to different qubits. The source code of our QuMC method is publicly available on the Github repository https://github. com/peachnuts/Multiprogramming.

4.2 Parallel Manager In order to determine the optimal number of circuits that can be executed on the hardware in parallel without losing fidelity, here, we introduce the parallelism manager, shown in Fig. 7a. Suppose we have a list of n circuit workloads with .ni qubits for each of them, that are expected to be executed on N -qubit hardware. We define the circuit density metric as the number of CNOTs divided by the qubit number of the circuit, .#CNOT s/ni , and the circuit with higher density is considered to be more subject to

28

S. Niu and A. Todri-Sanial

Fig. 7 Process flow of each block that constitutes our QuMC approach. (a) The parallelism manager selects K circuits according to their densities and passes them to the hardware-aware multi-programming compiler. (b) The qubit partition algorithms allocate reliable regions to multiple circuits. .AS is the difference between partition scores when partitioning independently and simultaneously, which is the fidelity metric. .δ is the threshold set by the user. The fidelity metric helps to select the optimal number of simultaneous circuits to be executed. (c) The scheduler performs mapping transition algorithm and makes quantum circuits executable on real quantum hardware

errors. Firstly, the circuits are ordered by their “density” metric. Note that, the users can also customize the order of circuits if certain circuits are preferred to have higher fidelities. Then, we pick K circuits as the maximum number of circuits that can be E executed on the hardware at the same time, . K n n=1 i ≤ N. If K is equal to one, then all the circuits should be executed independently. Otherwise, these circuits are passed to the hardware-aware multi-programming compiler. It works together with the parallelism manager to decide an optimal number of simultaneous circuits to be executed.

4.3 Hardware-Aware Multi-Programming Compiler The hardware-aware multi-programming compiler contains two steps: (1) Perform qubit partitioning algorithm to allocate reliable partitions to multiple circuits. (2) Compute the fidelity metric during post qubit partition process and work with parallelism manager to determine the number of simultaneous circuits. We develop two qubit partition algorithms by accounting for the crosstalk, hardware topology, and calibration data. In this section, we first introduce a motivational example for qubit partition. Then, we present two qubit partition algorithms, one greedy and one heuristic. Finally, we explain the post qubit partition process.

4.3.1

Motivational Example

We consider two constraints when executing multiple circuits concurrently. First, each circuit should be allocated to a partition containing reliable physical qubits. Allocated physical qubits (qubits used in hardware) can not be shared among

Multi-Programming Mechanism on Near-Term Quantum Computing P1

P3 6

17

1.9

0

0.7

29

1

1.6

4

1.4

1.2

7

1.0

10

0.8

0.8

12

1.1

15

1.9

18

1.0

21

2.4

2

24

1.1 1.4

5

0.7

8

0.8

11

0.9

14

23

2.3

13

1.2

3

1.5

0.7 0.7

16

1.0

0.7

19

0.9

22

1.1

25

1.0

26

1.1

9

20

(a) P2

P3 6

17

1.9

0

0.7

1

1.6

4

1.4

1.2

7

1.0

10

0.8

0.8

12

1.1

15

1.9

18

1.0

21

2.4

2

24

1.1 1.4

5

0.7

8

0.8

11

0.9

14

1.0

23

2.3

13

1.2

3

1.5

0.7 0.7

16

0.7

19

0.9

22

1.1

25

1.0

26

1.1

9

20

(b) Fig. 8 A motivational example of qubit partition problem. (a) No crosstalk between partition P1 and partition P2. (b) Crosstalk exists between partition P2 and partition P3

quantum circuits. Second, qubits can be moved only inside of their circuit partition during the routing process, in other words, qubits can be swapped within the same partition only. Note that, in this work, we performed routing inside of the reliable partition but other approaches can be applied as well such as to route to other neighboring qubits that are outside of the reliable partition. Finding reliable partitions for multiple circuits is an important step in the multiprogramming problem. We choose IBM Q 27 Toronto to demonstrate an example. The CNOT error rate of each link is shown in Fig. 8 and the unreliable links with high CNOT error rates and qubits with high readout error rates are highlighted in red. In order to illustrate the impact of partitions with different operational errors (including CNOT error and readout error) on the output fidelity, first, we execute a small circuit alu-v0_27 (the information of this circuit can be found in Table 2) on three different partitions independently: (1) Partition P1 with reliable qubits and links. (2) Partition P2 with unreliable links. (3) Partition P3 with unreliable links and qubits with high readout error rate. Second, we execute two of the same circuits simultaneously to show the crosstalk effect: (1) P1 and P3 without crosstalk (Fig. 8a). (2) P2 and P3 with crosstalk (Fig. 8b). For the sake of fairness, each partition has the same topology. It is important to note that if we have different

30

S. Niu and A. Todri-Sanial

Fig. 9 Results of the motivational example. (a) No crosstalk corresponds to Fig. 8a where no crosstalk exists between P1 and P3. (b) Crosstalk corresponds to Fig. 8b where crosstalk exists between P2 and P3. Note that “only P1” means the fidelity of the circuit when it is executed independently on P1, whereas “P1|P3” means the fidelity of circuit on P1 when two circuits are executed on P1 and P3 simultaneously

topologies, the circuit output fidelity will also be different since the number of additional gates is strongly related to the hardware topology. The result of the motivational example is shown in Fig. 9. The fidelity is calculated using PST metric explained in Sect. 4.5.1 and higher is better. For independent execution, we have P1 > P2 > P3 in terms of fidelity, which shows the influence of operational error on output fidelity. For simultaneous execution, the circuit fidelities are approximately the same for the two partitions P1 and P3 compared with the independent execution in the case of no crosstalk. Whereas, the fidelities are decreased by 36.8 and 23.1% respectively for P2 and P3 when the two circuits are executed simultaneously due to crosstalk. We validate the crosstalk impact on the motivational example by performing SRB on the target quantum device. However, performing SRB requires a large overhead of circuit executions. Hence, we characterize crosstalk properties of IBM Q 27 Toronto followed by the optimization methods presented in [5]. On IBM quantum devices, the crosstalk effect is significant only at one hop distance between CNOT pairs [5], such as (.CX0,1 |CX2,3 ) shown in Fig. 10a, when the control pulse of one qubit propagates an unwanted drive to the nearby qubits that have similar resonate frequencies. Therefore, we perform SRB only on CNOT pairs that are separated by one-hop distance. For those pairs whose distance is greater than one hop, the crosstalk effects are very weak and we ignore them. It allows us to parallelize SRB experiments of multiple CNOT pairs when they are separated by two or more hops. For example, in IBM Q 27 Toronto, the pairs (.CX0,1 |CX4,7 ), (.CX12,15 |CX17,18 ), (.CX5,8 |CX11,14 ) can be characterized in parallel. As demonstrated in some previous works [5, 6, 25] that, although the absolute gate errors vary every day, the pairs that have strong crosstalk effect remain the same across days. We confirm that observation by performing the crosstalk characterization on IBM Q 27 Toronto

Multi-Programming Mechanism on Near-Term Quantum Computing

31

6 0

0

1

2

(a)

3

4

1

4

17

7

10

2 3

12

15

18

21

13 5

8

11

9

14

23 24

16

19

22

25

26

20

(b) Fig. 10 Characterization of crosstalk effect. (a) Crosstalk pairs separated by one-hop distance. The crosstalk pairs should be able to be executed at the same time. Therefore, they cannot share the same qubit. One-hop is the minimum distance between crosstalk pairs. (b) Crosstalk effect results of IBM Q 27 Toronto using SRB. The arrow of the red dash line points to the CNOT pair that is affected significantly by crosstalk effect, e.g., .CX7,10 and .CX12,15 affect each other when they are executed simultaneously. In our experiments, .E (CX10,12 |CX4,7 ) > 3 × E (CX10,12 ), whereas .E (CX4,7 |CX10,12 ) ≈ 1.5 × E (CX4,7 ). As we choose 3 as the factor to pick up pairs with strong crosstalk effect, there is no arrow at pair .CX4,7

twice and we get the similar behavior. The crosstalk effect characterization is expensive and time costly. Some of the pairs do not have crosstalk effect whereas others are strongly influenced by crosstalk. Therefore, we extract the pairs with significant crosstalk effect, i.e., .E(gi |gj ) > 3 × E(gi ) and only characterize these pairs when crosstalk properties are needed. We choose the same factor 3 to quantify the pairs with strong crosstalk error like [5]. The result of crosstalk effect characterization on IBM Q 27 Toronto is shown in Fig. 10b. Since pair .CX7,10 (included in P2) and .CX12,15 (included in P3) have significant crosstalk impact on each other, we confirm that the fidelity lose of the simultaneous execution in the motivational example is caused by additional crosstalk.

4.3.2

Greedy Sub-Graph Partition Algorithm

We develop a Greedy Sub-graph Partition algorithm (GSP) for qubit partition process which is able to provide the optimal partitions for different quantum circuits. The first step of the GSP algorithm is to traverse the overall hardware to find all the possible partitions for a given circuit. For example, suppose we have a five-qubit circuit, we find all the subgraphs of the hardware topology (also called coupling graph) containing five qubits as the partition candidates. Each candidate has a score to represent its fidelity depending on the topology and calibration data. The partition with the best fidelity is selected and all the qubits inside of the partition are marked as used qubits so they cannot be assigned to other circuits. For the next circuit, a subgraph with the required number of qubits is assigned and we check if there is an overlap on this partition to partitions of previous circuits. If not, the subgraph is a partition candidate for the given circuit and the same process is applied to each

32

S. Niu and A. Todri-Sanial

Algorithm 1: GSP algorithm input : Quantum circuit QC, Coupling graph G, Calibration data C, Crosstalk properties crosstalk_props, Used_qubits qused output: A list of candidate partitions sub_graph_list 1 begin 2 qubit_num ← QC.qubit_num; 3 Set sub_graph_list to empty list; 4 for sub_graph ∈ combinations (G, qubit_num) do 5 if sub_graph is connected then 6 if qused is empty then 7 sub_graph.Set_Partition_Score (G, C, QC); 8 sub_graph_list.append (sub_graph); 9 end 10 if no qubit in sub_graph is in qused then 11 crosstalk_pairs ← Find_Crosstalk_pairs (sub_graph, crosstalk_props, qused ); 12 sub_graph.Set_Partition_Score (G, C, QC, crosstalk_pairs); 13 sub_graph_list.append (sub_graph); 14 end 15 end 16 end 17 return sub_graph_list; 18 end

subsequent circuit. To account for crosstalk, we check if any pairs in a subgraph have strong crosstalk effect caused by the allocated partitions of other circuits. If so, the score of the subgraph is adjusted to take crosstalk error into account. In order to evaluate the reliability of a partition, three factors need to be considered: partition topology, error rates of two-qubit links, and readout error of each qubit. One-qubit gates are ignored for simplicity and because of their relatively low error rates compared to the other quantum operations. If there is a qubit pair in a partition that has strong crosstalk affected by other partitions, the CNOT error of this pair is replaced by the correlated CNOT error which takes crosstalk into account. Note that the most recent calibration data should be retrieved through the IBM Quantum Experience before each usage to ensure that the algorithm has access to the most accurate and up-to-date information. To evaluate the partition topology, we determine the longest shortest path (also called graph diameter) of the partition, denoted L. The smaller the longest shortest path is, the better the partition is connected. Eventually, fewer additional gates would be needed to connect two qubits in a well-connected partition. We devise a fidelity score metric for a partition that is the sum of the graph diameter L, average CNOT error rate of the links times the number of CNOTs of the circuit, and the sum of the readout error rate of each qubit in a partition (shown in (1)). Note that the CNOT error rate includes the crosstalk effect if it exists.

Multi-Programming Mechanism on Near-Term Quantum Computing

Scoreg = L + AvgCN OT × #CNOT s +

33

E

.

RQi

(1)

Qi ∈P

The graph diameter L is always prioritized in this equation, since it is more than one order of magnitude larger than the other two factors. The partition with the smallest fidelity score is selected. It is supposed to have the best connectivity and the lowest error rate. Moreover, the partition algorithm prioritizes the quantum circuit with a large density because the input circuits are ordered by their densities during the parallelism manager process. The partition algorithm is then called for each circuit in order. However, GSP algorithm is expensive and time costly. For small circuits, GSP algorithm gives the best choice of partition. It is also useful to use it as a baseline to compare with other partition algorithms. For beyond NISQ, a better approach should be explored to overcome the complexity overhead.

4.3.3

Qubit Fidelity Degree-Based Heuristic Sub-Graph Partition Algorithm

In order to reduce the overhead of GSP, we propose a Qubit fidelity degree-based Heuristic Sub-graph Partition algorithm (QHSP). It performs as well as GSP but without the large runtime overhead. In QHSP, when allocating partitions, we favor qubits with high fidelity. We define the fidelity degree of a qubit based on the CNOT and readout fidelities of this qubit as in (2). F _DegreeQi =

E

.

λ × (1 − E(Qi , Qj )) + (1 − RQi )

(2)

Qj ∈N (Qi )

Qj are the neighbour qubits connected to .Qi , E is the CNOT error matrix which is constructed by applying the Floyd-Warshall algorithm to the hardware coupling graph with CNOT error rate as edge weights, and R is the readout error rate. .λ is a user defined parameter to weight between the CNOT error rate and readout error rate. Such parameter is useful for two reasons: (1) Typically, in a quantum circuit, the number of CNOT operations is different from the number of measurement operations. Hence, the user can decide .λ based on the relative number of operations. (2) For some qubits, the readout error rate is one or more orders of magnitude larger than the CNOT error rate. Thus, it is reasonable to add a weight parameter. The fidelity degree metric reveals two aspects of a qubit. The first one is the connectivity of the qubit. The more neighbours a qubit has, the larger its fidelity degree is. The second one is the reliability of the qubit accounting CNOT and readout error rates. Thus, the metric allows us to select a reliable qubit with good connectivity. Instead of trying all the possible subgraph combinations (as in GSP algorithm), we propose a QHSP algorithm to build partitions that contain qubits with high fidelity degree while significantly reducing runtime.

.

34

S. Niu and A. Todri-Sanial

To further improve the algorithm, we construct a list of qubits with good connectivity as starting points. We sort all physical qubits by their physical node degree, which is defined as the number of links in a physical qubit. Note that, the physical node degree is different from the fidelity degree. Similarly, we also obtain the largest logical node degree of the logical qubit (qubits used in the quantum circuit) by checking the number of different qubits that are connected to a qubit through CNOT operations. Next, we compare these two metrics. Suppose the largest physical node degree is less than the largest logical node degree. In that case, it means that we cannot find a suitable physical qubit to map the logical qubit with the largest logical node degree that satisfies all the connections. In this case, we only collect the physical qubits with the largest physical node degree. Otherwise, the physical qubits whose physical node degree is greater than or equal to the largest logical node degree are collected as starting points. By limiting the starting points, this heuristic partition algorithm becomes even faster. For each qubit in the starting points list, the algorithm explores its neighbours and finds the neighbour qubit with the highest fidelity degree calculated in (2), and merges it into the sub-partition. Then, the qubit inside of the sub-partition with the highest fidelity degree explores its neighbour qubits and merges the best one. The process is repeated until the number of qubits inside of the sub-partition is equal to the number of qubits needed. This sub-partition is considered as a subgraph and is added to the partition candidates. After obtaining all the partition candidates, we compute the fidelity score for each of them. As we start from a qubit with a high physical node degree and merge to neighbour qubits with a high fidelity degree, the constructed partition is supposed to be well-connected, hence, we do not need to check the connectivity of the partition using the longest shortest path L as in (1), GSP algorithm. We can only compare the error rates. The fidelity score metric is simplified by only calculating the CNOT and readout error rates as in (3) (crosstalk is included if it exists). It is calculated for each partition candidate and the best one is selected. Scoreh = AvgCN OT × #CNOT s +

E

.

RQi

(3)

Qi ∈P

Figure 11 shows an example of applying QHSP on IBM Q 5 Valencia (ibmq_valencia) for a four-qubit circuit. The calibration data of IBM Q 5 Valencia, including readout error rate and CNOT error rate are shown in Fig. 11a. We set .λ to two and the physical node degree and the fidelity degree of qubit calculated by (2) are shown in Table 1. Suppose the largest logical node degree is three. Therefore, .Q1 is selected as the starting point since it is the only physical qubit that has the same physical node degree as the largest logical node degree. It has three neighbour qubits: .Q0 , .Q2 , and .Q3 . .Q3 is merged into the sub-partition because it has the highest fidelity degree among neighbour qubits. The sub-partition becomes .{Q1 , Q3 }. As the fidelity degree of .Q1 is larger than .Q3 , the algorithm will again select the left neighbour qubit with the largest fidelity degree of .Q1 , which is .Q0 . The sub-partition becomes .{Q1 , Q3 , Q0 }. .Q1 is still the qubit with the largest

Multi-Programming Mechanism on Near-Term Quantum Computing

35

Algorithm 2: QHSP algorithm input : Quantum circuit QC, Coupling graph G, Calibration data C, Crosstalk properties crosstalk_props, Used_qubits qused , Starting points starting_points output: A list of candidate partitions sub_graph_list 1 begin 2 circ_qubit_num ← QC.qubit_num; 3 Set sub_graph_list to empty list; 4 for i ∈ starting_points do 5 Set sub_graph to empty list; 6 qubit_num ← 0; 7 while qubit_num < circ_qubit_num do 8 if sub_graph is empty then 9 sub_graph.append (i); 10 qubit_num ← qubit_num + 1 ; 11 continue; 12 end 13 best_qubit ← find_best_qubit (sub_graph, G, C); 14 if best_qubit /= None then 15 sub_graph.append (best_qubit); 16 qubit_num ← qubit_num + 1 ; 17 continue; 18 end 19 end 20 if len (sub_graph) = circ_qubit_num then 21 if qused is empty then 22 sub_graph.Set_Partition_Score (G, C, QC,); 23 sub_graph_list.append (sub_graph); 24 end 25 if no qubit in sub_graph is in qused then 26 crosstalk_pairs ← Find_Crosstalk_pairs (sub_graph, crosstalk_props, qused ); 27 sub_graph.Set_Partition_Score (G, C, QC, crosstalk_pairs); 28 sub_graph_list.append (sub_graph); 29 end 30 end 31 end 32 return sub_graph_list; 33 end

fidelity degree in the current sub-partition, its neighbour qubit – .Q2 is merged. The final sub-partition is .{Q1 , Q3 , Q0 , Q2 } and it can be considered as a partition candidate. The merging process is shown in Fig. 11b.

4.3.4

Runtime Analysis

Let n be the number of hardware qubits (physical qubits) and k the number of circuit qubits (logical qubits) to be allocated a partition. GSP algorithm selects

36

S. Niu and A. Todri-Sanial

Q0 3.5

0.85

Q1 3.4

1.25

{Q1 }

Q2 3.3

{Q1 , Q3 }

1.59

Q3 3.3

{Q1 , Q3 , Q0 }

1.54

Q4 1.5

{Q1 , Q3 , Q0 , Q2 }

(a)

(b)

Fig. 11 Example of qubit partition on IBM Q 5 Valencia for a four-qubit circuit using QHSP. Suppose the largest logical node degree of the target circuit is three. (a) The topology and calibration data of IBM Q 5 Valencia. The value inside of the node represents the readout error rate (in%), and the value above the link represents the CNOT error rate (in%). (b) Process of constructing a partition candidate using QHSP Table 1 The physical node degree and the fidelity degree of each qubit on IBM Q 5 Valencia

Qubit Fidelity degree Physical node degree

.Q0

.Q1

.Q2

.Q3

.Q4

.1.96

.3.93

.1.95

.2.94

.1.97

1

3

1

2

1

all the combinations of k subgraphs from n-qubit hardware and takes .O(C(n, k)) time, which is .O(n choose k). For each subgraph, it computes its fidelity score including calculating the longest shortest path, which scales at .O(k 3 ). It ends up being equivalent to .O(k 3 min(nk , nn−k )). In most cases, the number of circuit qubits is less than the number of hardware qubits, thus the time complexity becomes 3 k .O(k n ). It increases exponentially as the number of circuit qubits augments. QHSP algorithm starts by collecting a list of m starting points where .m ≤ n. To get the starting points, we sort the n physical qubits by their physical node degree, which takes .O(nlog(n)). Then, we iterate over all the gates of the circuit (e.g., circuit has g gates) and sort the k logical qubits according to the logical node degree, which takes .O(g + klog(k)). Next, for each starting point, it iteratively merges the best neighbour qubit until each sub-partition contains k qubits. To find the best neighbour qubit, the algorithm finds the best qubit in a sub-partition and traverses all its neighbours to select the one with the highest fidelity degree. Finding the best qubit in the sub-partition is .O(p) where p is the number of qubits in a sub-partition. The average number of qubits p is .k/2, so this process takes .O(k) time on average. Finding the best neighbour qubit is .O(1) because of the nearest-neighbor connectivity of superconducting devices. Overall, the QHSP takes 2 2 .O(mk +nlog(n)+g+klog(k)) time, and it can be truncated to .O(mk +nlog(n)+ g), which is polynomial.

Multi-Programming Mechanism on Near-Term Quantum Computing

4.3.5

37

Post Qubit Partition

By default the multi-programming mechanism reduces circuit fidelity compared to standalone circuit execution mode. If the fidelity reduction is significant, circuits should be executed independently or the number of simultaneous circuits should be reduced even though the hardware throughput can be decreased as well. Therefore, we consistently check the circuit fidelity difference between independent versus concurrent execution. We start with qubit partition process for each circuit independently and obtain the fidelity score of the partition. Next, this qubit partition process is applied to these circuits to compute the fidelity score when executing them simultaneously. The difference between the fidelity scores is denoted .AS, which is the fidelity metric. If .AS is less than a specific threshold .δ, it means simultaneous circuit execution does not significantly detriment the fidelity score, thus circuits can be executed concurrently, otherwise, independently or reduce the number of simultaneous circuits. The fidelity metric and the parallelism manager help determine the optimal number of simultaneous circuits to be executed.

4.4 Scheduler: Mapping Transition Algorithm The scheduler includes the mapping algorithm to make circuits executable on real quantum hardware. Two steps are needed to make circuits hardware-compliant: initial mapping and mapping transition. The initial mapping of each circuit is created while taking into account swap error rate and swap distance, and the initial mapping of the simultaneous mapping transition process is obtained by merging the initial mapping of each circuit according to its partition. We improve the mapping transition algorithm proposed in [26] by modifying the heuristic cost function to better select the inserted gate. We also introduce the Bridge gate to the simultaneous mapping transition process for multi-programming. First, each quantum circuit is transformed into a more convenient format – Directed Acyclic Graph (DAG) circuit, which represents the operation dependencies of the circuit without considering the connectivity constraints. Then, the compiler traverses the DAG circuit and goes through each quantum gate sequentially. The gate that does not depend on other gates (i.e., all the gates before execution) is allocated to the first layer, denoted F . The compiler checks if the gates on the first layer are hardware-compliant. The hardware-compliant gates can be executed on the hardware directly without modification. They are added to the scheduler, removed from the first layer and marked as executed. If the first layer is not empty, which means some gates are non-executable on hardware, a SWAP or Bridge gate is needed. We collect all the possible SWAPs and Bridges, and use the cost function H (see (5)) to find the best candidate. The process is repeated until all the gates are marked as executed.

38

S. Niu and A. Todri-Sanial

A SWAP gate requires three CNOTs and inserting a SWAP gate can change the current mapping. Whereas a Bridge gate requires four CNOTs and inserting a Bridge gate does not change the current mapping. It can only be used to execute a CNOT when the distance between the control and the target qubits is exactly two. Both gates need three supplementary CNOTs. A SWAP gate is preferred when it has a positive impact on the following gates, allocated in the extended layer E, i.e., it makes these gates executable or reduces the distance between control and target qubits. Otherwise, a Bridge gate is preferred. A cost function H is introduced to evaluate the cost of inserting a SWAP or Bridge. We use the following distance matrix (see (4)) as in [26] to quantify the impact of the SWAP or Bridge gate, D = α1 × S + α2 × E

(4)

.

where S is the swap distance matrix and .E is the swap error matrix. We set .α1 and α2 to 0.5 to equally consider the swap distance and swap error rate. In [26], only the impact of a SWAP and Bridge on other gates (first and extended layer) was considered without considering their impact on the gate itself. As each of them is composed of either three or four CNOTs, their impact cannot be ignored. Hence, in our simultaneous mapping transition algorithm, we take self impact into account and create a list of both SWAP and Bridge candidates, labeled as “tentative gates”. The heuristic cost function is as:

.

H =

E E 1 ( D[π(g.q1 )][π(g.q2 )] + D[π(g.q1 )][π(g.q2 )]) |F + NT ent | g∈F

g∈T ent

.

+W ×

1 E D[π(g.q1 )][π(g.q2 )] |E| g∈E

(5) where W is the parameter that weights the impact of the extended layer, .NT ent is the number of gates of the tentative gate, T ent represents a SWAP or Bridge gate, and .π represents the mapping. SWAP gate has three CNOTs, thus .NT ent is three and we consider the impact of three CNOTs on the first layer. The mapping is the new mapping after inserting a SWAP. For Bridge gate, .NT ent is four and we consider four CNOTs on the first layer, and the mapping is the current mapping as Bridge gate does not change the current mapping. We weight the impact on the extended layer to prioritize the first layer. This cost function can help the compiler select the best gate to insert between a SWAP and Bridge gate. Our simultaneous mapping transition algorithm outperforms HA [26] thanks to the modifications of the cost function while not changing its asymptotic complexity. Let n be the number of hardware qubits, g the CNOT gates in the circuit.

Multi-Programming Mechanism on Near-Term Quantum Computing

39

Algorithm 3: Simultaneous mapping transition algorithm input : Circuits DAGs, Coupling graph G, Distance matrices Ds, Initial mapping πi , First layers F s output: Final schedule schedule 1 begin 2 πc ← πi ; 3 while not all gates are executed do 4 Set swap_bridge_lists to empty list; 5 for Fi in F s do 6 for gate in Fi do 7 if gate is hardware-compliant then 8 schedule.append (gate); 9 Remove gate from Fi ; 10 end 11 end 12 if Fi is not empty then 13 swap_bridge_candidate_list ← FindSwapBridgePairs (Fi , G); 14 swap_bridge_lists.append (swap_bridge_candidate_list); 15 end 16 end 17 for swap_bridge_candidate_list ∈ swap_bridge_lists do 18 for gtmp ∈ swap_bridge_candidate_list do 19 πtmp ← Map_Update (gtmp , πc ); 20 Hbasic ← 0; 21 for gate ∈ Fi do 22 Hbasic ← Hbasic + Di (gate, πtmp ) 23 end 24 Htentative ← gtmp .cost (G, Di , πtmp ); 25 Update the extended layer E; 26 Hextend ← 0; 27 for gate ∈ E do 28 Hextend ← Hextend + Di (gate, πtmp ); 29 end 1 W 30 H ← |F +N (Hbasic + Htentative ) + |E| Hextend tent | 31 end 32 Choose the best gate gn ; 33 πc ← Map_Update (gn , πc ); 34 end 35 Update the First layers; 36 end 37 return schedule 38 end

The simultaneous mapping transition algorithm takes .O(gn2.5 ) assuming nearestneighbor chip connectivity and an extended layer E with at most .O(n) CNOT gates. The detailed explanation about the complexity can be found in [26].

40

S. Niu and A. Todri-Sanial

4.5 Evaluation 4.5.1

Metrics

Here are the metrics that we use to evaluate the algorithms. 1. Probability of a Successful Trial (PST) [27]. This metric is used to represent the circuit output fidelity and is defined by the number of trials that give the expected result divided by the total number of trials. The expected result is obtained by executing the quantum circuit on the simulator. To precisely estimate the PST, we execute each quantum circuit on the quantum hardware for a large number of trials (8192). 2. Number of additional CNOT gates. This metric is related to the number of SWAP or Bridge gates inserted. This metric can show the ability of the algorithm to reduce the number of additional gates. 3. Trial Reduction Factor (TRF). This metric is introduced in [3] to evaluate the improvement of the throughput thanks to the multi-programming mechanism. It is defined as the ratio of trials needed when quantum circuits are executed independently to the trials needed when they are executed simultaneously.

4.5.2

Comparison

Several published qubit mapping algorithms [26, 28–32] and multi-programming mapping algorithms [3, 4] are available. We choose HA [26] as the baseline for independent execution, a qubit mapping algorithm taking hardware topology and calibration data into consideration to achieve high circuit fidelity with a reduced number of additional gates. Due to the different hardware access and code unavailability of the state-of-the-art multi-programming algorithms, we only compare our QuMC with independent executions to show the impact of the multiprogramming mechanism. Moreover, our qubit partition algorithms can also be applied to the qubit mapping algorithm for independent executions if running a program on a relatively large quantum device. To summarize, the following comparisons are performed: • For independent executions, we compare the partition + improved mapping transition algorithm based on HA (labeled as PHA) versus HA to show the impact of partition on large quantum hardware for a small circuit. • For simultaneous executions, we compare our QuMC framework, 1) GSP + improved mapping transition (labeled as GSP) and 2) QHSP + improved mapping transition (labeled as QHSP), with independent executions, HA and PHA, to report the fidelity loss due to simultaneous executions of multiple circuits. Note that, PHA allows each quantum circuit to be executed on the best partition selected according to the partition fidelity score metric.

Multi-Programming Mechanism on Near-Term Quantum Computing Table 2 Information of benchmarks

Type Small

Large

4.5.3

ID 1 2 3 4 5 6 7 8 9 10 11 12 13

Name 3_17_13 4mod5-v1_22 mod5mils_65 alu-v0_27 decod24-v2_43 qaoa_6 qaoa_8 qaoa_10 qft_6 qft_8 qft_10 ising_5 ising_10

41 Qubits 3 5 5 5 4 6 8 10 6 8 10 5 10

Gates 36 21 35 36 52 49 80 102 81 147 233 91 481

CNOTs 17 11 16 17 22 24 42 54 39 68 105 40 90

Benchmarks

We evaluate our QuMC framework by executing a list of different-size benchmarks at the same time on two quantum devices, IBM Q 27 Toronto (ibmq_toronto) and IBM Q 65 Manhattan (ibmq_manhattan). All the benchmarks are collected from the previous work [33], including several functions taken from RevLib [34] as well as some quantum algorithms written in Quipper [35] or Scaffold [36]. These benchmarks are widely used in the quantum community and their details are shown in Table 2. We execute small quantum circuits with shallow-depth on the selected two quantum devices since only they can obtain reliable results. For large quantum circuits, we compile them on the two chips without execution.

4.5.4

Algorithm Configurations

Here, we consider the algorithm configurations of different multi-programming and standalone mapping approaches. We select the best initial mapping out of ten attempts for HA, PHA, GSP, and QHSP. Weight parameter W in the cost function (see (5)) is set to 0.5 and the size of the extended layer is set to 20. Parameters .α1 and .α2 are set to 0.5 respectively to consider equally the swap distance and swap error rate. The weight parameter .λ of QHSP (see (2)) is set to 2 because of the relatively large number of CNOT gates in benchmarks, The threshold .δ for post qubit partition is set to 0.1 to ensure the multi-programming reliability. Due to the expensive cost of SRB, we perform SRB only on IBM Q 27 Toronto and collect the pairs with significant crosstalk effect. Only the collected pairs are characterized and their crosstalk properties are provided to the partition process. The experimental results on IBM Q 65 Manhattan do not consider the crosstalk effect. For each algorithm, we

42

S. Niu and A. Todri-Sanial HA

PHA

QHSP

HA

0.8

QHSP

1,5

2,2

Number of additional gate

35

0.7 Fidelity

PHA

0.6 0.5 0.4

30 25 20 15 10

1,1

1,2

1,3

1,4

1,5

2,2

2,3

Benchmarks

(a)

2,4

2,5

1,1

1,2

1,3

1,4

2,3

2,4

2,5

Benchmarks

(b)

Fig. 12 Comparison of fidelity and number of additional gates on IBM Q 27 Toronto when executing two circuits simultaneously. (a) Fidelity. (b) Number of additional gates

only evaluate the mapping transition process, which means no optimisation methods like gate commutation or cancellation are applied. The algorithm is implemented in Python and evaluated on a PC with 1 Intel i5-5300U CPU and 8 GB memory. Operating System is Ubuntu 18.04. All the experiments were performed using Qiskit and the version used is 0.21.0.

4.5.5

Experimental Results

We first run two quantum circuits on IBM Q 27 Toronto independently and simultaneously. Results on average output state fidelity and the total number of additional gates are shown in Fig. 12. Note that, all the circuit output fidelities are calculated by PST metric explained in Sect. 4.5.1. For independent executions, the fidelity is improved by 46.8% and the number of additional gates is reduced by 8.7% comparing PHA to HA. For simultaneous executions, QHSP and GSP allocate the same partitions except for the first experiment—(ID1, ID1). In this experiment, GSP improves the fidelity by 6% compared to QHSP. Note that partition results might be different due to the various calibration data and the choice of .λ, but the difference of the partition fidelity score between the two algorithms is small. The results show that QHSP is able to allocate nearly optimal partitions while reducing runtime significantly (from exponential to polynomial complexity). Therefore, for the rest experiments, we only evaluate QHSP algorithm. Comparing QHSP (simultaneous executions) versus HA (independent executions), the fidelity is even improved by 31.8% and the number of additional gates is reduced by 9.2%. Whereas comparing QHSP with PHA, the fidelity is decreased by 5.4% and the gate number is almost the same, with only 0.3% increase. During the post-partition process, .AS does not pass the threshold for all the combinations of benchmarks so that TRF is two. Next, we execute on IBM Q 65 Manhattan three and four simultaneous quantum circuits and compare the results with the independent executions. Figures 13 and 14 show the comparison of fidelity and the number of additional gates. PHA

Multi-Programming Mechanism on Near-Term Quantum Computing HA

PHA

HA

QHSP

0.62

PHA

QHSP

42 Number of additional gates

0.6 0.58 Fidelity

43

0.56 0.54 0.52 0.5

40 38 36 34 32 30

0.48 1,2,3

1,2,4

1,2,5

2,3,4

2,3,5

1,2,3

1,2,4

1,2,5

Benchmarks

Benchmarks

(a)

(b)

2,3,4

2,3,5

Fig. 13 Comparison of fidelity and number of additional gates on IBM Q 65 Manhattan when executing three circuits simultaneously. (a) Fidelity. (b) Number of additional gates

outperforms HA for independent executions in most of the cases. Comparing QHSP with HA, the fidelity is improved by 5.3 and 13.3% for three and four simultaneous executions, and the inserted gate number is always reduced. Whereas the fidelities decrease by 1.5 and 6.4% respectively for the two cases when comparing QHSP versus PHA, and the additional gate number is always almost the same. The threshold is still not passed for each experiment and TRF becomes three and four. Then, to evaluate the hardware limitations of executing multiple circuits in parallel, we set the threshold .δ to 0.2. All the five small benchmarks are able to be executed simultaneously on IBM Q 65 Manhattan. Partition fidelity difference is 0.18. Results show that fidelity of simultaneous executions (QHSP) is decreased by 9.5% compared to independent executions (PHA). Finally, to illustrate our QHSP algorithm’s performance on large benchmarks, we compile two and three simultaneous circuits with large size on IBM Q 27 Toronto and IBM Q 65 Manhattan, respectively, and compare the results with HA and PHA. Since the large benchmarks are not able to obtain meaningful results due to the noise, we do not execute them on the real hardware and only use the number of additional gates as the comparison metric. The results are shown in Fig. 15. The additional gate number is reduced by 23.2 and 15.6%, respectively comparing QHSP with HA. When compared with PHA, the additional gate number is increased by 0.9 and 6.4%.

4.5.6

Result Analysis

PHA is always better than HA for independent executions for two reasons: (1) The initial mapping of the two algorithms is based on a random process. During the experiment, we perform the initial mapping generation process ten times and select

44

S. Niu and A. Todri-Sanial HA

PHA

QHSP

HA

Number of additional gates

0.6 0.55 Fidelity

PHA

QHSP

65

0.5 0.45

60

55

50

0.4 1,2,3,4

1,2,3,5

1,3,4,5

2,3,4,5

1,2,3,4

1,2,3,5

1,3,4,5

Benchmarks

Benchmarks

(a)

(b)

2,3,4,5

Fig. 14 Comparison of fidelity and number of additional gates on IBM Q 65 Manhattan when executing four circuits simultaneously. (a) Fidelity. (b) Number of additional gates HA

PHA

HA

QHSP

QHSP

250 Number of additional gates

Number of additional gate

250

PHA

200 150 100

200

150

100

50 6,7

6,8

6,9

6,10

6,11

6,12

6,13

6,7,8

6,7,9 6,7,12 7,8,9 7,8,12 8,9,10 9,10,12

Benchmarks

Benchmarks

(a)

(b)

Fig. 15 Comparison of number of additional gates for large benchmarks when (a) compiling two benchmarks on IBM Q 27 Toronto, (b) compiling three benchmarks on IBM Q 65 Manhattan

the best one. However, for PHA, we first limit the random process into a reliable and well-connected small partition space rather than the overall hardware space used by HA. Therefore, with only ten trials, PHA finds a better initial mapping. (2) We improve the mapping transition process of PHA, which can make a better selection between SWAP and Bridge gate. HA is shown to be sufficient for hardware with a small number of qubits, for example a 5-qubit quantum chip. If we want to map a circuit on large hardware, it is better to first limit the search space into a reliable small partition and then find the initial mapping. This qubit partition approach can be applied to general qubit mapping problems for search space limitation when large hardware is selected to map. Comparing simultaneous process QHSP to independent process HA, QHSP is able to outperform HA with higher fidelity and a reduced number of additional gates. The improvement is also due to the partition allocation and the enhancement of the

Multi-Programming Mechanism on Near-Term Quantum Computing

45

mapping transition process as explained before. When comparing QHSP with PHA (where independent circuit is executed on the best partition), QHSP uses almost the same number of additional gates whereas fidelity is decreased less than 10% if the threshold is set to 0.1. However, the hardware throughput increases by two and four times respectively for the two devices. Note that, it also corresponds to a huge reduction of total runtime of these circuits (waiting time + circuit execution time).

5 QuCP Since multi-programming is more favorable for large-scale quantum computers, alternative approaches to address the drawbacks of QuMC for the large overhead when performing SRB for crosstalk characterization is needed. In this section, we present a Quantum Crosstalk-aware Parallel workload execution method (QuCP) which eliminates the crosstalk impact without the overhead of characterizing it.

5.1 QuCP Compiler Simultaneous Randomized Benchmarking is one of the most popular approaches to quantify the crosstalk properties of a quantum device. However, it introduces a significant overhead if applied to large devices. Crosstalk is shown to be significant between neighbor CNOT pairs and several optimization methods have been proposed in [5] to lower SRB overhead by grouping CNOT pairs separated by more than one-hop distance and performing SRB on them simultaneously. However, SRB is still expensive even with these optimization methods. The overhead of performing SRB on two quantum chips: IBM Q 27 Toronto and IBM Q 65 Manhattan, is shown in Table 3. The one-hop pairs (shown in Fig. 10a) are allocated to a minimum number of groups. We choose 5 seeds to ensure the precise result of SRB, and the number of jobs needed to perform SRB is 135 and 165, respectively, which takes a significant amount of time. The cost becomes even worse as the size of the quantum chip increases. Despite the expensive cost, SRB also requires users to master this technique to characterize crosstalk, which is not trivial. Inspired by QuMC, which mitigates crosstalk error at partition-level, we introduce a crosstalk parameter .σ to represent the crosstalk impact on CNOT pairs without the need of learning and performing SRB. Given a list of circuits to execute simultaneously, we first use the heuristic qubit partitioning method from QuMC to allocate the partition for the first circuit and add these qubits to a list of allocated qubits .qallocate . For the rest of the circuits, each time when we construct the possible partition candidates, we check if there are some pairs inside of the partition candidate are a one-hop distance from the pairs inside of .qallocate according to the hardware topology. If there exists some, we can collect a list of potential crosstalk pairs .qcrosstalk . To select the best partition, we re-use the partition score metric

46

S. Niu and A. Todri-Sanial

Table 3 Overhead of SRB on different IBM quantum chips

Chip Qubit 1-hop pairs Groups Seeds Jobs

IBM Q 27 Toronto 27 28 9 5 135

IBM Q 65 Manhattan 65 72 11 5 165

Scoreh introduced in (3). Note that if .qcrosstalk is not empty, we use the crosstalk parameter .σ times the CNOT errors of the pairs inside of .qcrosstalk to indicate the crosstalk effect, so that the crosstalk impact can be emulated and avoided at the partition-level without performing SRB to characterize the crosstalk properties of a quantum device.

.

5.2 Evaluation 5.2.1

Methodology

We compare our QuCP with two crosstalk-aware parallel workload execution approaches, QuMC and CNA [37]. The crosstalk impact is mitigated at the partitionlevel for QuMC and at the gate-level for CNA. Both of the two methods need to perform SRB for crosstalk characterization. Table 4 shows the benchmarks that we use to compare these algorithms. They are collected from [22, 34], including several functions about logical operations, error correction, and quantum simulation, etc. We calculate the output fidelity of the simultaneous circuits to evaluate the performance of these algorithms. Some of the benchmarks have one certain output, and we use the Probability of a Successful Trial (PST) metric introduced in Sect. 4.5.1. Whereas for other benchmarks, their results are supposed to be a distribution. We choose Jensen-Shanno Divergence (JSD) to compare the distance of two probability distributions, shown in (6), where P and Q are two distributions to compare and .M = 12 (P + Q). It is based on Kullback-Leibler divergence, shown in (7), with the benefit of always having a finite value and being symmetric.

J SD(P ||Q) =

.

1 1 D(P ||M) + D(Q||M) 2 2

DKL (P ||Q) =

E

.

x∈X

P (x) log(

P (x) ) Q(x)

(6)

(7)

Multi-Programming Mechanism on Near-Term Quantum Computing Table 4 Information of benchmarks

Benchmark Adder Linearsolver 4mod5-v1_22 Fredkin qec_en alu-v0_27 Bell Variation

Qubits 4 3 5 3 5 5 4 4

47 Gates 23 19 21 19 25 36 33 54

CNOTs 10 4 11 8 10 17 7 16

Result 1 Dist 1 1 Dist 1 Dist dist

We execute three benchmarks on IBM Q 27 Toronto in parallel. The optimization_level in Qiskit compiler is set to 3, which is the highest level for circuit optimizations.

5.2.2

Experiments Results

First, we tune the crosstalk parameter .σ used in QuCP to verify its ability for crosstalk-mitigation at partition-level without SRB by comparing its partitioning results with QuMC. When .σ ≥ 4, QuCP provides the same results as QuMC. This number is reasonable as we need to calculate the average CNOT error rate inside of the partition, which can decrease the impact of crosstalk on CNOT pair. Based on this experiment, we set .σ to 4 and compare QuCP with CNA to show the influence of crosstalk-mitigation at partition-level or gate-level for parallel circuit execution. The results in terms of JSD and PST are shown in Fig. 16. Note that a lower JSD or a higher PST is desirable. The benchmarks include unitary and various combinations. Comparing QuCP with CNA, the fidelity characterized by JSD and PST is improved by 10.5 and 89.9%, respectively. The fidelity improvement is realized by different partitioning and mapping methods, which are two other important factors to consider for parallel circuit execution. QuCP has better results and achieves crosstalk-mitigation with low overhead.

6 Applications 6.1 Multi-Programming and VQE In order to demonstrate the potential interest to apply the multi-programming mechanism to existing quantum algorithms, we investigate it on VQE algorithm. To do this, we perform the same experiment as [38, 39] on IBM Q 65 Manhattan, estimating the ground state energy of deuteron, which is the nucleus of a deuterium atom, an isotope of hydrogen.

48

S. Niu and A. Todri-Sanial

0.7 QuCP CNA

QuCP CNA

0.6

0.6

PST

JSD

0.5

0.4

0.4 0.3 0.2

0.2 0.1

Benchmarks

Benchmarks

(a)

(b)

od

edalu -fr

red -4m

r-f

4m od

ad de

lu d-a

od -al u 4m

ad de r-

alu ×3

fre

de rad

d× 3 fre

in r-l

in ll-l

-va qec

in

var -be

ell

ll-l

r-b

qec -b e

ll× 3 be

qec

-va

×3

var ×3

qec

lin ×3

ad de r× 3 4m od ×3

0 0

Fig. 16 The fidelity result of executing three benchmarks simultaneously on IBM Q 27 in terms of JSD and PST metrics. The experiments include combining three different benchmarks and repeating the same benchmark three times. (a) JSD result (lower is better). (b) PST result (higher is better)

Deuteron can be modeled using a 2-qubit Hamiltonian spanning four Pauli strings: .ZI, I Z, XX, and Y Y [38, 39]. If we use the naive measurement to calculate the state energy, one ansatz corresponds to four different measurements. Pauli operator grouping (labeled as PG) has been proposed to reduce this overhead by utilizing simultaneous measurement [12, 38, 40]. For example, the Pauli strings can be partitioned into two commuting families: {.ZI, I Z} and {.XX, Y Y } using the approach proposed in [38]. It allows one parameterized ansatz to be measured twice instead of four measurements in naive method. We use a simplified Unitary Coupled Cluster ansatz with a single parameter and three gates, as described in [38, 39]. We apply our QuMC method on the top of the Pauli operator grouping approach (labeled as QuMCPG) to estimate the ground state energy of deuteron and compare the results with PG. In our QuMC method, the parallelism manager works with the hardware-aware multi-programming compiler to determine the number of circuits for simultaneous execution. We set the threshold .σ to 0.1 to ensure the multi-programming reliability and the weight parameter .λ is set to 1 because of the small number of CNOTs of the parameterized circuit. Eight circuits are selected in order not to pass the fidelity threshold, which correspond to four parameterized circuits with four different parameters since one parameterized circuit requires two measurement circuits using PG. It is also equivalent to perform four times of optimizations. These circuits can be executed simultaneously using QuMCPG, which reduces the total circuit runtime by eight times compared with PG for independent execution. We perform this experiment five times across days with different calibration data. Note that,

Multi-Programming Mechanism on Near-Term Quantum Computing

49

Fig. 17 The estimation of the ground state energy of deuteron under PG and QuMCPG with four optimisations. (a) PG result (independent process) with eight measurements. (b) QuMCPG result (simultaneous process) with one measurement Table 5 The information of the five experiments

Experiments PG QuMCPG a

.nc

1 8

a

Error rate(%) 9 13.3

Hardware throughput 0.03 0.25

The number of simultaneous circuit number

if we use the naive measurement, the number of measurement circuits needed will be reduced by a factor of 16. The results of the five experiments using PG (independent process) and QuMCPG (simultaneous process) are shown in Fig. 17. We use simulator to perform the same experiment and set the result as baseline. Compared to the baseline, the average error rates are 9 and 13.3% for PG and QuMCPG respectively. The fidelity loss of simultaneous process is within the threshold compared to independent one and the hardware throughput is improved by eight times. More information about the experimental results can be found in Table 5.

6.2 Multi-Programming and ZNE As quantum error correction (QEC) requires a huge overhead of qubits to implement, an alternative scheme named quantum error mitigation (QEM) was proposed for error suppression on NISQ devices. The zero-noise extrapolation (ZNE) method, introduced in [41], is one of the simplest QEM technique that is based on error extrapolation. The basic idea is to first execute the circuit in different noise levels and then extrapolate an estimated error-free value. It can be implemented in two steps: (1) Noise-scaling. (2) Extrapolation. A digital ZNE approach was proposed in [42] to scale noises by increasing the number of gates or circuit depth. A list of folded circuits with different circuit depths

50

S. Niu and A. Todri-Sanial

is generated, and we calculate their expectation values. This method only requires the programmer’s gate-level access to the processor. There are several methods for error extrapolation, such as polynomial extrapolation, linear extrapolation, and Richardson extrapolation, etc. However, ZNE approach introduces an overhead of executing one circuit multiple times with various depths to extrapolate the noise-free expectation value. Here, we demonstrate how to reduce this overhead by applying parallel circuit execution to the digital ZNE approach. In our experiment, we first use fold_gates_at_random method from Mitiq package [43]. It selects gates randomly and folds them to modify the circuit depth that represents different noise levels. A list of folded circuits can be generated based on scale factors. Then, we execute these circuits simultaneously on IBM Q 65 Manhattan using the QuCP approach, and we can obtain the expectation values corresponding to different noise levels. Finally, we perform various extrapolation methods integrated in Mitiq, including LinearFactory, PolyFactory, and RichardsonFactory. These methods are used to calculate the estimated errorfree result. One of the limitations of ZNE is that the extrapolation methods are sensitive to noises, so that the extrapolated values are strongly dependent on the selected extrapolation method. Therefore, we only show the best estimated result among these methods, which is the result that is closest to the ideal result calculated by the simulator. We generate four folded circuits with scale factors from 1 to 2.5 with step 0.5. Three processes are included for comparison: (1) Execute the independent circuit on the best partition selected by the QuCP method without the ZNE method (labeled as Baseline). (2) Execute the folded circuits simultaneously using QuCP to perform the ZNE method (labeled as QuCP+ZNE). (3) Execute the folded circuits independently to perform the ZNE method (labeled as ZNE). The experimental results are shown in Fig. 18. The absolute error is represented by the difference between the ideal expectation value calculated by the simulator and the obtained expectation value. According to the results, baseline always has the largest error rate due to lack of mitigation technique. In most cases, ZNE gives the lowest error rate but requires multiple circuit executions. Whereas using parallel circuit execution technique (QuCP+ZNE), the error rate can be decreased significantly compared to the baseline with the same number of circuit executions. Also, the improvement of the hardware throughput and the reduction of overall runtime is three times. On average, the error rate is reduced by 2x, and in the best case (benchmark alu-v0_27), the error rate is reduced by 11x. Even though ZNE method was designed to scale the noise levels of the same occupied qubits, however, the errors can still be mitigated significantly by enlarging the circuit depth on different qubit partitions. It reveals some underlying similarities of the errors between different qubits which is interesting to explore in the future work.

Multi-Programming Mechanism on Near-Term Quantum Computing

0.6

Absolute error

0.5

51

Baseline QuCP+ZNE ZNE

0.4 0.3 0.2 0.1

ll be

var

qec

lin

alu

d fre

od 4m

ad

der

0

Benchmarks

Fig. 18 Comparison of error rate without mitigation and with mitigation by applying QuCP multiprogramming to ZNE method

7 Discussion Based on the experimental results, we found that the main concern with multiprogramming mechanism is a trade-off between output fidelity and the hardware throughput. For example, how much fidelity can we sacrifice to gain benefits from applying multi-programming mechanism to quantum applications. Here, we list several guidelines to help the user to be profited from the multi-programming mechanism. • Check the target hardware topology and calibration data. The multi-programming mechanism is more suitable for a relatively large quantum chip compared to the quantum circuit and with low error rate. • Choose appropriate fidelity threshold for post qubit partition process. A high threshold can improve the hardware throughput but lead to the reduction of output fidelity. It should be set carefully depending on the size of the benchmark. For benchmarks of small size that we used in experiments, it is reasonable to set the threshold to 0.1. • The number of circuits that can be executed simultaneously will mainly depend on the fidelity threshold and the calibration data of the hardware. • It is recommended to apply multi-programming to variational algorithms to speedup the optimization process and also to ZNE for error mitigation.

52

S. Niu and A. Todri-Sanial

8 Conclusion As the size of quantum chips grows and the demand for their accessibility increases, how to efficiently use the hardware resources is becoming a concern. In this chapter, we presented two multi-programming compilers that enable parallel circuit executions on NISQ hardware. We analyzed crosstalk impact on IBM quantum devices and mitigate it in our proposed multi-programming methods. We also investigate multi-programming on NISQ applications to demonstrate how it can be useful for near-term quantum computing. Acknowledgments This work is funded by the QuantUM Initiative of the Region Occitanie, University of Montpellier and IBM Montpellier.

References 1. John Preskill. Quantum Computing in the NISQ era and beyond. Quantum, 2:79, August 2018. 2. A Robert Calderbank and Peter W Shor. Good quantum error-correcting codes exist. Physical Review A, 54(2):1098, 1996. 3. Poulami Das, Swamit S Tannu, Prashant J Nair, and Moinuddin Qureshi. A case for multi-programming quantum computers. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 291–303, 2019. 4. Lei Liu and Xinglei Dou. Qucloud: A new qubit mapping mechanism for multi-programming quantum computing in cloud environment. In 2021 IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pages 167–178. IEEE, 2021. 5. Prakash Murali, David C McKay, Margaret Martonosi, and Ali Javadi-Abhari. Software mitigation of crosstalk on noisy intermediate-scale quantum computers. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1001–1016, 2020. 6. Abdullah Ash-Saki, Mahabubul Alam, and Swaroop Ghosh. Analysis of crosstalk in nisq devices and security implications in multi-programming regime. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pages 25–30, 2020. 7. Abdullah Ash-Saki, Mahabubul Alam, and Swaroop Ghosh. Experimental characterization, modeling, and analysis of crosstalk in a quantum computer. IEEE Transactions on Quantum Engineering, 2020. 8. Jay M Gambetta, AD Córcoles, Seth T Merkel, Blake R Johnson, John A Smolin, Jerry M Chow, Colm A Ryan, Chad Rigetti, S Poletto, Thomas A Ohki, et al. Characterization of addressability by simultaneous randomized benchmarking. Physical review letters, 109(24):240504, 2012. 9. Yasuhiro Ohkura, Takahiko Satoh, and Rodney Van Meter. Simultaneous quantum circuits execution on current and near-future nisq systems. arXiv preprint arXiv:2112.07091, 2021. 10. Marco Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, et al. Variational quantum algorithms. Nature Reviews Physics, 3(9):625–644, 2021. 11. Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J Love, Alán Aspuru-Guzik, and Jeremy L O’brien. A variational eigenvalue solver on a photonic quantum processor. Nature communications, 5:4213, 2014.

Multi-Programming Mechanism on Near-Term Quantum Computing

53

12. Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika Takita, Markus Brink, Jerry M Chow, and Jay M Gambetta. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature, 549(7671):242–246, 2017. 13. Carlos Bravo-Prieto, Ryan LaRose, Marco Cerezo, Yigit Subasi, Lukasz Cincio, and Patrick Coles. Variational quantum linear solver: A hybrid algorithm for linear systems. Bulletin of the American Physical Society, 65, 2020. 14. Vojtˇech Havlíˇcek, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala, Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature spaces. Nature, 567(7747):209–212, 2019. 15. Elijah Pelofske, Georg Hahn, and Hristo N Djidjev. Parallel quantum annealing. arXiv preprint arXiv:2111.05995, 2021. 16. Pranav Mundada, Gengyan Zhang, Thomas Hazard, and Andrew Houck. Suppression of qubit crosstalk in a tunable coupling superconducting circuit. Physical Review Applied, 12(5):054023, 2019. 17. Peng Zhao, Peng Xu, Dong Lan, Ji Chu, Xinsheng Tan, Haifeng Yu, and Yang Yu. Highcontrast z z interaction using superconducting qubits with opposite-sign anharmonicity. Physical Review Letters, 125(20):200503, 2020. 18. Alexander Erhard, Joel J Wallman, Lukas Postler, Michael Meth, Roman Stricker, Esteban A Martinez, Philipp Schindler, Thomas Monz, Joseph Emerson, and Rainer Blatt. Characterizing large-scale quantum computers via cycle benchmarking. Nature communications, 10(1):1–7, 2019. 19. Radoslaw C Bialczak, Markus Ansmann, Max Hofheinz, Erik Lucero, Matthew Neeley, AD O’Connell, Daniel Sank, Haohua Wang, James Wenner, Matthias Steffen, et al. Quantum process tomography of a universal entangling gate implemented with josephson phase qubits. Nature Physics, 6(6):409–413, 2010. 20. Gushu Li, Yufei Ding, and Yuan Xie. Towards efficient superconducting quantum processor architecture design. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1031–1045, 2020. 21. David C McKay, Andrew W Cross, Christopher J Wood, and Jay M Gambetta. Correlated randomized benchmarking. arXiv preprint arXiv:2003.02354, 2020. 22. Ang Li, Samuel Stein, Sriram Krishnamoorthy, and James Ang. Qasmbench: A low-level qasm benchmark suite for nisq evaluation and simulation. arXiv preprint arXiv:2005.13018, 2020. 23. Héctor Abraham et al. Qiskit: An open-source framework for quantum computing. https:// qiskit.org/, 2019. 24. Andrew W Cross, Lev S Bishop, John A Smolin, and Jay M Gambetta. Open quantum assembly language. arXiv preprint arXiv:1707.03429, 2017. 25. Siyuan Niu and Aida Todri-Sanial. Analyzing crosstalk error in the nisq era. In 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 428–430, 2021. 26. Siyuan Niu, Adrien Suau, Gabriel Staffelbach, and Aida Todri-Sanial. A hardware-aware heuristic for the qubit mapping problem in the nisq era. IEEE Transactions on Quantum Engineering, 1:1–14, 2020. 27. Swamit S Tannu and Moinuddin K Qureshi. Not all qubits are created equal: a case for variability-aware policies for nisq-era quantum computers. In Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 987–999, 2019. 28. Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem for nisq-era quantum devices. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1001–1014, 2019. 29. Robert Wille, Lukas Burgholzer, and Alwin Zulehner. Mapping quantum circuits to ibm qx architectures using the minimal number of swap and h operations. In 2019 56th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2019. 30. Prakash Murali, Jonathan M Baker, Ali Javadi-Abhari, Frederic T Chong, and Margaret Martonosi. Noise-adaptive compiler mappings for noisy intermediate-scale quantum computers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1015–1029, 2019.

54

S. Niu and A. Todri-Sanial

31. Gian Giacomo Guerreschi and Jongsoo Park. Two-step approach to scheduling quantum circuits. Quantum Science and Technology, 3(4):045003, 2018. 32. Toshinari Itoko, Rudy Raymond, Takashi Imamichi, and Atsushi Matsuo. Optimization of quantum circuit mapping using gate transformation and commutation. Integration, 70:43–50, 2020. 33. Alwin Zulehner, Alexandru Paler, and Robert Wille. An efficient methodology for mapping quantum circuits to the ibm qx architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(7):1226–1236, 2018. 34. R. Wille, D. Große, L. Teuber, G. W. Dueck, and R. Drechsler. RevLib: An online resource for reversible functions and reversible circuits. In Int’l Symp. on Multi-Valued Logic, pages 220–225, 2008. 35. Alexander S Green, Peter LeFanu Lumsdaine, Neil J Ross, Peter Selinger, and Benoît Valiron. Quipper: a scalable quantum programming language. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, pages 333–342, 2013. 36. Ali J Abhari, Arvin Faruque, Mohammad J Dousti, Lukas Svec, Oana Catu, Amlan Chakrabati, Chen-Fu Chiang, Seth Vanderwilt, John Black, and Fred Chong. Scaffold: Quantum programming language. Technical report, Princeton Univ NJ Dept of Computer Science, 2012. 37. Yasuhiro Ohkura. Crosstalk-aware nisq multi-programming. Faculty Policy Manage., Keio Univ., Tokyo, Japan, 2021. 38. Pranav Gokhale, Olivia Angiuli, Yongshan Ding, Kaiwen Gui, Teague Tomesh, Martin Suchara, Margaret Martonosi, and Frederic T Chong. Optimization of simultaneous measurement for variational quantum eigensolver applications. In 2020 IEEE International Conference on Quantum Computing and Engineering (QCE), pages 379–390. IEEE, 2020. 39. Eugene F Dumitrescu, Alex J McCaskey, Gaute Hagen, Gustav R Jansen, Titus D Morris, T Papenbrock, Raphael C Pooser, David Jarvis Dean, and Pavel Lougovski. Cloud quantum computing of an atomic nucleus. Physical review letters, 120(21):210501, 2018. 40. Ophelia Crawford, Barnaby van Straaten, Daochen Wang, Thomas Parks, Earl Campbell, and Stephen Brierley. Efficient quantum measurement of pauli operators in the presence of finite sampling error. Quantum, 5:385, 2021. 41. Ying Li and Simon C Benjamin. Efficient variational quantum simulator incorporating active error minimization. Physical Review X, 7(2):021050, 2017. 42. Tudor Giurgica-Tiron, Yousef Hindy, Ryan LaRose, Andrea Mari, and William J Zeng. Digital zero noise extrapolation for quantum error mitigation. In 2020 IEEE International Conference on Quantum Computing and Engineering (QCE), pages 306–316. IEEE, 2020. 43. Ryan LaRose, Andrea Mari, Sarah Kaiser, Peter J Karalekas, Andre A Alves, Piotr Czarnik, Mohamed El Mandouh, Max H Gordon, Yousef Hindy, Aaron Robertson, et al. Mitiq: A software package for error mitigation on noisy quantum computers. Quantum, 6:774, 2022.

Side-Channel Leakage in Suzuki Stack Circuits Yerzhan Mustafa and Selçuk Köse

1 Introduction 1.1 Superconducting Electronics and Quantum Computing A Josephson junction (JJ) is a two-terminal device that consists of two electrodes made from superconductor material, which are separated by a weak link. This link weakens the superconductivity between the two layers and could be made up of an insulating material, non-superconducting metal, or a certain physical constriction [1]. The most widely used type is a superconductor-insulator-superconductor type (e.g., Nb/AlO.x /Nb materials) [2]. The switching behavior of a JJ can be classified as underdamped (.βc  1) or overdamped (.βc  1), where .βc is the StewartMcCumber parameter [2]. This parameter is often controlled with an external resistor connected in parallel to the JJ. A single-flux quantum (SFQ) logic is the most common type of superconducting digital electronics [2–5]. An SFQ logic consists of overdamped JJs and inductances. When the current flowing through a JJ exceeds a certain critical value (.Ic ), a single magnetic flux quantum is produced in terms of a voltage pulse with the area of −15 Wb. In an SFQ logic, the presence of a pulse represents a .0 = 2.08 × 10 logical ‘1’, the absence of a pulse represents a logical ‘0’. Due to the short duration of this voltage pulse, which is in the order of a few ps, the SFQ logic can operate at frequencies of tens to hundreds of GHz. Furthermore, the switching of SFQ logic consumes low energy, in the order of zJ (zepto J).

Y. Mustafa () · S. Köse Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_3

55

56

Y. Mustafa and S. Köse

In order to achieve the superconductivity in Nb-based JJs, the circuit needs to be cooled to below 9.3 K. The common setup usually uses liquid helium that has a boiling temperature of 4.2 K. Although the refrigeration cost of SFQ circuits is about thousand times larger than the dissipated energy at the temperature of 4.2 K [6], the overall power consumption of superconducting electronics can still be significantly lower than the complementary metal–oxide–semiconductor (CMOS) technology when the computing is performed at a large scale. Therefore, the superconducting electronics is a mature and natural candidate for beyond-CMOS technology [7] that can have applications in data centers, cloud computing, and space electronics. SFQ circuits can also be used naturally as the in-fridge controllers for superconducting quantum computers. As compared to classical in-fridge controllers based on cryogenic CMOS (cryo-CMOS) technology [8], the SFQ-based controllers can significantly increase the scalability (i.e., qubit count) because of higher speed and lower power dissipation. For example, an SFQ-based digital controller for quantum computers called DigiQ has been proposed in [9]. The DigiQ can potentially support a quantum computer housing more than 42,000 qubits.

1.2 Interface Circuits As compared to the conventional CMOS technology, superconducting electronics currently possess less mature integrated circuit (IC) fabrication technology. For example, the integration level is reported to be 13 million JJs per cm.2 in 150 nm MIT Lincoln Lab process [10]. As a result, implementation of large-scale SFQ-based memory is difficult due to the low integration densities [11, 12]. Therefore, there is a need to interface the superconducting electronics with more mature semiconductor technology (e.g., CMOS). A superconductor-semiconductor interface circuit is used to connect SFQ logic with CMOS technology [13]. Such interface circuits allow to use greater integration densities of CMOS memory, which can be located at, e.g., 4.2 K or higher temperatures such as 77 K (liquid nitrogen) and room temperature. For instance, a Josephson-CMOS memory has been presented in [14]. There are various types of superconductor-semiconductor interface circuits available in the literature such as superconducting quantum interference device (SQUID) stack [15, 16], SFQ-to-DC converter [3, 17], nanocryotron (nTron) [18, 19], and Suzuki stack [13, 20–22]. These circuits have different output drive capability, switching speed, area, power consumption, and fabrication technology. This chapter specifically focuses on certain security vulnerabilities of Suzuki stack circuits that is discussed in detail in Sect. 2.

Side-Channel Leakage in Suzuki Stack Circuits

57

1.3 Hardware Security and Side-Channel Attacks The research field of hardware security aims to protect a physical device/circuit from unauthorized malicious activities including intellectual property (IP) piracy, overproduction, hardware Trojans, reverse engineering, covert communication, and side-channel attacks. A security vulnerability that stems from hardware is typically significantly more difficult to mitigate than software vulnerabilities especially when the vulnerability is uncovered when the device is in the market or operational in the field. This difference makes the hardware security an important design consideration during the IC design process along with power, performance, area, and cost constraints. A side-channel attack is a type of physical attack that exploits different types of hardware security vulnerabilities. The collection and processing of the physical leakage that is highly correlated to the sensitive information that is either stored or processed in a circuit-under-attack is key to the success of side-channel attacks. Depending on the type of leakage, these attacks can be classified into power, electromagnetic, optical, acoustic, and timing side-channel attacks (Fig. 1). Let us consider a circuit that runs certain cryptographic algorithm such as Advanced Encryption Standard (AES) and Rivest–Shamir–Adleman (RSA). From mathematical perspective, these cryptographic algorithms are secure within at least polynomial time in classical (non-quantum) computers. For example, to break RSA algorithm (i.e., reveal the secret key), an adversary should be able to factor the product of two large prime numbers, which requires a super-polynomial time using classical computers [23]. However, since a cryptographic algorithm is realized in a physical device as shown in Fig. 1, there could be a dependence of, e.g., power consumption on the secret key. By measuring the power traces, it is possible to reveal the secret key by either visual inspection (simple power analysis) or using some statistical tools (differential power analysis and correlation power analysis) [24]. Fig. 1 Types of side-channel attacks

58

Y. Mustafa and S. Köse

Various countermeasures against power and electromagnetic side-channel attacks for CMOS circuits have been proposed in the literature focusing on re-design of CMOS voltage regulators such as buck converter [25], low-dropout regulator [26, 27], and switched-capacitor converter [28–31]. Additionally, several works propose injecting noise to hide the side-channel leakage signals [32, 33]. Hardware security in superconducting digital electronics is an emerging research area. Recently, a small number of techniques such as camouflaging and logic locking have been proposed [34–37] that aim to protect SFQ circuits against supply chain vulnerabilities such as IP piracy, reverse engineering, and overproduction. In this chapter, a potential side-channel leakage related vulnerability in superconductorsemiconductor interface circuits is uncovered. Particularly, the feasibility and security implications of power side-channel leakage of Suzuki stack circuit are demonstrated both analytically and with extensive simulations in the following sections.

2 Suzuki Stack Circuits A Suzuki stack (a.k.a., Josephson latching driver) is a superconductorsemiconductor interface circuit, first proposed in 1988 by H. Suzuki et al. [20] and still widely used in Josephson-CMOS hybrid memory [14, 38, 39]. A Suzuki stack circuit can convert SFQ pulses to tens of mV DC signal that can then be fed into a CMOS amplifier. A 64-kb hybrid Josephson-CMOS random access memory (RAM) operating at 4 K has been demonstrated in [14]. In this system, a Suzuki stack circuit is designed to produce the output voltage of 60 mV with an average power dissipation of 165 . μW. The underdamped JJ is a basic building block of a Suzuki stack circuit. Underdamped JJs have been widely used in the latching logic, which was actively developed in 1980–1990s [40–45]. In underdamped JJs, once the current that is flowing through the device exceeds the critical current .Ic , the JJ switches from superconducting to resistive state. The voltage across the JJ in resistive state is referred to as the gap voltage .Vg . The I–V (current-voltage) characteristic of underdamped JJs is shown in Fig. 2. As shown in Fig. 2, underdamped JJs exhibit a hysteric behavior, i.e., once the JJ is in resistive state, it remains in this state even if the current becomes smaller than the critical current. The superconducting state can be achieved by resetting the current to near zero value (in Fig. 2, this value is around 10% of the critical current). The comparably higher output voltage is the primary advantage of a Suzuki stack circuit as compared to SQUID stack [15, 16] and SFQ-to-DC converter [3, 17]. Additionally, a Suzuki stack occupies lower area (few JJs) and operates faster than a SQUID stack. The major drawback of Suzuki stack circuit is an AC bias that is needed to switch the underdamped JJs back into superconducting state. The schematic of a Suzuki stack cell is depicted in Fig. 3 (within a dashed rectangle), which consists of two series arrays of underdamped JJs connected in

Side-Channel Leakage in Suzuki Stack Circuits

59

3.0 Vg = 2.8 mV

2.5

V (mV)

2.0 1.5 1.0 0.5 0.0 0

0.2

0.4

0.6

0.8

1

I/I c

Fig. 2 I–V characteristic of underdamped JJ with the critical current of 400 . μA. The JJ is simulated in MIT-LL SFQ5ee 10 kA/cm.2 process. Underdamped behavior is achieved by connecting in parallel (shunt) resistance [13]

Vs Rs

Vb

Z0

Is

Ib1 Rb

In1 Room temperature

4.2 K

Out1

IbN Rb

Rmatch OutN

InN

Suzuki stack cell

Fig. 3 Schematic of N -number of Suzuki stack cells. The parasitic inductances are not shown. Parameters: .Rs = 1 , Rb = 750 , 400 . μA critical current of JJs, 4 . branch resistance, 200 pH bias inductance. For .N = 3, .Rmatch = 62.5 . MIT-LL SFQ5ee 10 kA/cm.2 process. JJs are shunted with a 39 . resistor to achieve underdamped behavior [13]

Y. Mustafa and S. Köse 500 250 0 7 3.5 0

Ib ( A)

Vout (mV)

Vin (mV)

Vb (mV)

60

50 25 0

700 350 0

664.9 A

0

0.1

0.2

0.3

607.1 A

0.4

0.5

0.6

0.7

0.8

Time (ns) Fig. 4 Operation of Suzuki stack cell with .m = 16. The input signal is generated with .Vin applied to 50 . input resistance [13]. The output is modeled as a lumped RLC load (50 ., 100 pH, 180 fF) to represent next CMOS amplifier stage

parallel. The operation of a Suzuki stack cell is depicted in Fig. 4. Before the input signal arrives, all JJs are biased to around 80% of the critical current (i.e., superconducting state). When the input signal arrives, the current through the lowermost JJ in the left branch exceeds the critical current, forcing it to switch into the resistive state. Since the left and right branches are no longer symmetric after the lowermost JJ in the left branch becomes resistive, most of the bias current .Ib starts to flow through the right branch. Hence, the entire right branch is switched into the resistive state. After that, the portion of bias current flows back to the left branch, where the remaining JJs are finally switched into the resistive state. As a result, the output voltage of .Vout ≈ mVg is produced, where m is the number of JJs connected in series and .Vg is the gap voltage of a JJ. For .m = 16, the output voltage is approximately 44.8 mV. The output voltage remains high even after the input signal becomes low (Fig. 4). This behavior can be explained by carefully examining Fig. 2. When the input signal is low, all JJs are biased to approximately 80% of the critical current, which results in a minor voltage drop below the gap voltage. Finally, the output voltage resets back to low when the bias signal is low (i.e., the bias current is approximately zero and all JJs are reset to superconducting state). In a practical setup, multiple Suzuki stack cells are connected in parallel and biased with room temperature electronics via a coaxial cable [14], as shown in Fig. 3. The matching resistor .Rmatch is added to minimize reflections in the coaxial cable.

Side-Channel Leakage in Suzuki Stack Circuits

61

3 Threat Model By analyzing the bias current .Ib waveform of a single Suzuki stack cell in Fig. 4, one can notice that the bias current changes based on the applied input signal (i.e., .Ib = 607.1 and .664.9 μA, respectively). When Suzuki stack cell receives a logical ‘0’, the bias current can be approximately expressed as Ib1 =

.

Vb , Rb

(1)

where the branch resistances can be neglected due to the much smaller value than the bias resistance .Rb (see Fig. 3 for notation and parameter values). Similarly, when Suzuki stack cell receives a logical ‘1’, the bias current is given by Ib2 =

.

Vb − Vout . Rb

(2)

Accordingly, the variation in the bias current can be expressed as Ib = Ib1 − Ib2 =

.

Vout . Rb

(3)

The above analytical expressions are in good agreement with simulation results, as shown in Fig. 4. Therefore, it can be concluded that the variation in the bias current is caused by the high output voltage (.Vout ) of the Suzuki stack cell. By following similar analysis for N-number of Suzuki stack cells (Fig. 3), the variation in the total supply current can be written as Is = Is,0 − Is,k =

.

kRmatch Vout , Rb Rmatch + Rb Rs + N Rmatch Rs

(4)

where k is the number of switched Suzuki stack cells, .Rs is the output impedance of supply voltage, and .Is,i is the supply current when i-number of cells are switched. In our threat model, let us assume that an attacker has access to room temperature electronics. By using a high sensitivity current probe (e.g., Keysight N2820A with 50 . μA resolution), the variations in the supply current .Is can be measured. As can be observed from (4), .Is has a linear dependence on k. Therefore, the Hamming weight of input bits to Suzuki stack cells can be leaked (i.e., determined by the attacker). The Hamming weight of a string of bits corresponds to the total number of logical 1’s. For example, the Hamming weight of bit string ‘01100001’ is equal to 3. By obtaining the Hamming weight, an attacker can implement various malicious attacks that will be discussed in Sect. 5.

62

Y. Mustafa and S. Köse

Because of the usage of AC supply voltage (Fig. 4), the input bits from different clock cycles can be easily distinguished. Therefore, there is no need to align measured power traces, making the attack simpler to implement. Since Suzuki stack circuits operate at the frequencies in the order of a few GHz, the bandwidth of current probe could be limited. In this case, the attacker can potentially decrease the frequency by entering into low frequency test mode.

4 Side-Channel Leakage Simulation 4.1 Modeling Coaxial Cable To demonstrate the side-channel leakage in a practical setup, as shown in Fig. 3, a commercial coaxial cable RG58BU/CU is considered. This cable has the characteristic impedance .Z0 = 50  and capacitance per unit length .C = 100 pF/m. Since Suzuki stack cells operate at high frequencies and the cable length is in the order of few meters, the coaxial cable can be modeled as a lossless transmission line with the characteristic impedance  Z0 =

.

L . C

(5)

From that, the corresponding inductance per unit length can be calculated as L = Z02 C = 250 nH/m.

.

(6)

For the cable length .l = 3.4 m, which has been used in [46], the propagation delay can be determined as √ tp = l LC = 17 ns.

.

(7)

4.2 Simulation Results The operation of three Suzuki stack cells connected to the coaxial cable model is shown in Fig. 5. 17 ns is required for .Vs to propagate through the cable. Three consecutive inputs are applied to the corresponding Suzuki stack cell. When 1, 2, or 3 Suzuki stack cells are switched with the corresponding input signals, .Is , respectively, becomes 57, 113, and 170 . μA. From Table 1, it can be observed that the simulation results are in the good agreement with the analytical model (4), demonstrating a linear dependence on k (i.e., Hamming weight of input bits). The error is around 1–5 . μA. Detecting these current variations with a 50 . μA resolution current probe can therefore be assumed feasible.

Side-Channel Leakage in Suzuki Stack Circuits

63

Table 1 Analytical and simulated variation of the supply current .Is for different Hamming weight values Analytical .Is (.μA) 58 116 175

250 0 0.0

0.1

0.2

0.3

0.4

0.5

17.1

17.2

17.3

17.4

17.5

500 250 0 17.0 7 3.5 0

Vin1

Vout (mV)

17.0

Is (mA)

Simulated .Is (.μA) 57 113 170

500

Vin (mV)

Vb (mV)

Vs (mV)

Hamming weight of input bits, k 1 2 3

17.1

17.2

17.3

Vin2

17.4

Vin3

17.5

50 25

Vout1

Vout2

Vout3

0

17.0 10.2 10

17.1 0 cells

17.2 1 cell

17.3 2 cells

9.8 34.0

34.1

34.2

34.3

17.4

17.5

3 cells

34.4

34.5

Time (ns)

Fig. 5 Operation of three Suzuki stack cells connected to .Vs via a coaxial cable. The lowermost plot shows the total number of cells that are switched

4.3 Effect of Inductive Noise Coupling Suzuki stack circuits usually interface SFQ logic and CMOS memory [14]. While CMOS memory is fabricated on a separate die, Suzuki stack and SFQ circuits are typically integrated on the same die. As a result, the inductive noise coupling between these two circuits should be considered. Let us assume that the bias line of Suzuki stack cells is coupled with the SFQ clock signal with the mutual inductance of 3 pH [47]. Based on the simulation results, each SFQ pulse on the clock line causes the transient peak-to-peak variation in the supply current of around 19-20 .μA. However, by measuring the average supply current (i.e., .Is,0 and .Is,k , see (4)), the effect of noise on side-channel leakage becomes negligible. This is because the

64

Y. Mustafa and S. Köse

noise is added to both .Is,0 and .Is,k , and canceled out by taking their difference (i.e., Is ).

.

4.4 Margin Analysis In our setup, .Rs = 1 , which can be controlled by the attacker. From that, (4) can be simplified as Is =

.

kRmatch Vout kRmatch Vout kVout ≈ = . Rb Rmatch + Rb Rs + N Rmatch Rs Rb Rmatch Rb

(8)

Therefore, when .Rs is small, the variation in the supply current .Is has near linear dependence on .Vout and .Rb . Additionally, the effect of matching resistor .Rmatch on the side-channel leakage can be neglected. The simulated bias margins of a single Suzuki stack cell with parameters shown in Fig. 3 are found to be .(−21%, +20%) for .Vb = 500 mV. At these margins, both .Vout and .Is vary by less than .±3%. This observation makes the side-channel leakage more robust against process variations.

5 Exploiting Side-Channel Leakage Once an attacker knows the Hamming weight of input bits of Suzuki stack cells, various attacks are possible to implement. Let us consider an example, where the encryption key is inputted to Suzuki stack cells. In order to reveal an n-bit encryption key, the attacker can implement a brute-force search that would require to guess n .2 possible values of keys. As the number of bits (i.e., n) increases, this type of attack becomes infeasible since the number of key guesses grows exponentially. By knowing the Hamming weight of encryption key, the number of key guesses can be significantly reduced. For example, in Data Encryption Standard (DES) algorithm, which has 56-bit key size, the brute-force search can be reduced from .256 to approximately .245 (and to .238 if the parity bits are not used during the encryption) [48]. As was mentioned in Sect. 2, the Suzuki stack cells are usually used in JosephsonCMOS memory, which can serve as a random-access memory (RAM) for superconductive processor [39]. The Suzuki stack cells are switched when the processor writes certain data into memory and/or reads a specific address cell. The latter case can be used to detect the memory access when the data is being evicted/removed from the cache. The memory access detection has been used in cache side-channel attack that is a part of well-known attacks such as Meltdown [49] and Spectre [50]. Additionally, more sophisticated attacks could be developed that correlate the

Side-Channel Leakage in Suzuki Stack Circuits

65

relation between the Hamming weight of written data bits and/or address bits, which are supplied during the read operation.

6 Conclusion The superconductor-semiconductor interface circuits are used to connect the superconducting electronics (SFQ logic) with CMOS circuits in order to exploit the benefits of both technologies. The emerging applications of SFQ-based largescale quantum computer control, data centers, and cloud computing make the hardware security an important aspect along with power, performance, area, and cost constraints. This chapter uncovers a new side-channel leakage mechanism in Suzuki stack circuits. The linear dependence between the supply current variation and the Hamming weight of input bits of Suzuki stack circuits is verified with mathematical analysis and simulation results. Additionally, the effects of inductive coupling noise and bias margins are discussed. By exploiting the proposed side-channel leakage vulnerability, an attacker could implement various malicious activities that may compromise the security of encryption algorithms and read (write) operations from (to) the main memory. By using the proposed analytical model of the sidechannel leakage in Suzuki stack circuits, a circuit designer can select the appropriate parameters to minimize the variation in the supply current.

References 1. V. Lacquaniti, C. Cassiago, N. De Leo, M. Fretto, P. Durandetto, E. Zhitlukhina, M. Belogolovskii, IEEE Transactions on Applied Superconductivity 27(4) (2017). Art. no. 2400205 2. A.I. Braginski, Journal of Superconductivity and Novel Magnetism 32(1), 23 (2019) 3. K.K. Likharev, V.K. Semenov, IEEE Transactions on Applied Superconductivity 1(1), 3 (1991) 4. J.X. Przybysz, D.L. Miller, H. Toepfer, O. Mukhanov, J. Lisenfeld, M. Weides, H. Rotzinger, P. Febvre, Applied Superconductivity: Handbook on Devices and Applications pp. 1111–1206 (2015) 5. G. Krylov, E.G. Friedman, Single Flux Quantum Integrated Circuit Design (Springer, 2022) 6. D.S. Holmes, A.L. Ripple, M.A. Manheimer, IEEE Transactions on Applied Superconductivity 23(3) (2013). Art. no. 1701610 7. R.K. Cavin, P. Lugli, V.V. Zhirnov, Proceedings of the IEEE 100(Special Centennial Issue), 1720 (2012) 8. J.P.G. Van Dijk, B. Patra, S. Subramanian, X. Xue, N. Samkharadze, A. Corna, C. Jeon, F. Sheikh, E. Juarez-Hernandez, B.P. Esparza, et al., IEEE Journal of Solid-State Circuits 55(11), 2930 (2020) 9. M.R. Jokar, R. Rines, G. Pasandi, H. Cong, A. Holmes, Y. Shi, M. Pedram, F.T. Chong, in IEEE International Symposium on High-Performance Computer Architecture (HPCA) (IEEE, 2022), pp. 400–414 10. S.K. Tolpygo, V. Bolkhovsky, R. Rastogi, S. Zarr, E. Golden, T.J. Weir, L.M. Johnson, V.K. Semenov, M.A. Gouker, in Proceedings of Applied Superconductivity Conference (2020), pp. 1–29

66

Y. Mustafa and S. Köse

11. S. Nagasawa, Y. Hashimoto, H. Numata, S. Tahara, IEEE Transactions on Applied Superconductivity 5(2), 2447 (1995) 12. S. Nagasawa, T. Satoh, K. Hinode, Y. Kitagawa, M. Hidaka, IEEE Transactions on Applied Superconductivity 17(2), 177 (2007) 13. T. Ortlepp, L. Zheng, S. Whiteley, T. Van Duzer, Superconductor Science and Technology 26(3) (2013). Art. no. 035007 14. T. Van Duzer, L. Zheng, S. Whiteley, H. Kim, J. Kim, X. Meng, T. Ortlepp, IEEE Transactions on Applied Superconductivity 23(3) (2013). Art. no. 1700504 15. Y. Hashimoto, S. Yorozu, T. Miyazaki, Y. Kameda, H. Suzuki, N. Yoshikawa, IEEE Transactions on Applied Superconductivity 17(2), 546 (2007) 16. Q.P. Herr, D.L. Miller, A.A. Pesetski, J.X. Przybysz, IEEE Transactions on Applied Superconductivity 17(2), 565 (2007) 17. T. Ortlepp, S. Wuensch, M. Schubert, P. Febvre, B. Ebert, J. Kunert, E. Crocoll, H.G. Meyer, M. Siegel, F.H. Uhlmann, IEEE Transactions on Applied Superconductivity 19(1), 28 (2009) 18. A.N. McCaughan, K.K. Berggren, Nano Letters 14(10), 5748 (2014) 19. Q.Y. Zhao, A.N. McCaughan, A.E. Dane, K.K. Berggren, T. Ortlepp, Superconductor Science and Technology 30(4) (2017). Art. no. 044002 20. H. Suzuki, A. Inoue, T. Imamura, S. Hasuo, in IEEE International Electron Devices Meeting (1988), pp. 290–293 21. Y. Mustafa, S. Köse, IEEE Transactions on Applied Superconductivity 32(8) (2022). Art. no. 1301407 22. Y. Mustafa, S. Köse, IEEE Transactions on Applied Superconductivity 33(2) (2023). Art. no. 1300306 23. Y. Yacobi, Z. Shmuely, in Conference on the Theory and Application of Cryptology (Springer, 1989), pp. 344–355 24. S. Mangard, E. Oswald, T. Popp, Power analysis attacks: Revealing the secrets of smart cards (Springer, 2007) 25. M. Kar, A. Singh, S. Mathew, A. Rajan, V. De, S. Mukhopadhyay, in IEEE International SolidState Circuits Conference (ISSCC) (IEEE, 2017), pp. 142–143 26. A. Singh, M. Kar, S. Mathew, A. Rajan, V. De, S. Mukhopadhyay, in IEEE International SolidState Circuits Conference-(ISSCC) (IEEE, 2019), pp. 404–406 27. A. Singh, M. Kar, V.C.K. Chekuri, S.K. Mathew, A. Rajan, V. De, S. Mukhopadhyay, IEEE Journal of Solid-State Circuits 55(2), 478 (2020) 28. O.A. Uzun, S. Köse, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 4(2), 169 (2014) 29. W. Yu, O.A. Uzun, S. Köse, in Proceedings of the 52nd Annual Design Automation Conference (DAC) (2015), pp. 1–6 30. W. Yu, S. Köse, IEEE Transactions on Circuits and Systems I: Regular Papers 63(8), 1152 (2016) 31. W. Yu, S. Köse, IEEE Transactions on Emerging Topics in Computing 6(2), 244 (2018) 32. X. Wang, W. Yueh, D.B. Roy, S. Narasimhan, Y. Zheng, S. Mukhopadhyay, D. Mukhopadhyay, S. Bhunia, in Proceedings of the 50th Annual Design Automation Conference (2013), pp. 1–9 33. W. Yu, S. Köse, in IEEE International Symposium on Circuits and Systems (ISCAS) (IEEE, 2017), pp. 1–4 34. H. Kumar, T. Jabbari, G. Krylov, K. Basu, E.G. Friedman, R. Karri, IEEE Transactions on Applied Superconductivity 30(3) (2020). Art. no. 1700213 35. T. Jabbari, G. Krylov, E.G. Friedman, IEEE Transactions on Applied Superconductivity 31(5) (2021). Art. no. 1301605 36. Y. Mustafa, T. Jabbari, S. Köse, IEEE Transactions on Applied Superconductivity 32(3) (2022). Art. no. 1300708 37. T. Jabbari, Y. Mustafa, E.G. Friedman, S. Köse, in Design Automation of Quantum Computers (Springer, 2023), pp. 135–165 38. G. Konno, Y. Yamanashi, N. Yoshikawa, IEEE Transactions on Applied Superconductivity 27(4) (2017). Art. no. 1300607

Side-Channel Leakage in Suzuki Stack Circuits

67

39. Y. Hironaka, Y. Yamanashi, N. Yoshikawa, IEEE Transactions on Applied Superconductivity 30(7) (2020). Art. no. 1301206 40. W. Anacker, IBM Journal of Research and Development 24(2), 107 (1980) 41. Y. Tarutani, M. Hirano, U. Kawabe, Proceedings of the IEEE 77(8), 1164 (1989) 42. Y. Hatano, S. Yano, H. Mori, H. Yamada, M. Hirano, U. Kawabe, IEEE Journal of Solid-State Circuits 24(5), 1312 (1989) 43. S. Kotani, T. Imamura, S. Hasuo, IEEE Journal of Solid-State Circuits 25(1), 117 (1990) 44. S. Kotani, A. Inoue, T. Imamura, S. Hasuo, in 37th IEEE International Conference on SolidState Circuits (1990), pp. 148–149 45. S. Hasuo, T. Imamura, Proceedings of the IEEE 77(8), 1177 (1989) 46. J.X. Przybysz, D. Miller, S. Martinet, J. Kang, A.H. Worsham, M. Farich, IEEE Transactions on Applied Superconductivity 7(2), 2657 (1997) 47. C.J. Fourie, C. Shawawreh, I.V. Vernik, T.V. Filippov, IEEE Transactions on Applied Superconductivity 27(2) (2017). Art. no. 1300805 48. T.S. Messerges, E.A. Dabbish, R.H. Sloan, in Proceedings of the USENIX Workshop on Smartcard Technology (1999), pp. 1–11 49. M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, et al., in Proceedings of the 27th USENIX Security Symposium (2018), pp. 1–18 50. P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, et al., in IEEE Symposium on Security and Privacy (SP) (2019), pp. 1–19

AQuCiDe: Architecture Aware Decomposition of Quantum Circuits Soumya Sengupta, Abhoy Kole, Kamalika Datta, Indranil Sengupta, and Rolf Drechsler

1 Introduction We have been witnessing very fast developments in quantum computing over the last decade, with several demonstrable physical realizations of quantum computers being reported (viz. by IBM, Microsoft, Google, Intel, DWave and many others). Various implementation technologies have been used for realizing quantum computers, each having its specific quantum gate library and architecture [7, 9, 17]. Motivated by this, researchers have worked on the design, synthesis [6], and mapping of algorithms to quantum computers [3, 5, 10–13, 23]. The qubit coupling architecture specifies the way in which the physical qubits of a quantum computer can interact with respect to the quantum gate operations. If we consider the coupling architecture of the quantum computing platforms that are built using superconducting qubits, the 2-qubit gates are constrained to operate on

S. Sengupta Department of Computer Science and Engineering, JIS University, Kolkata, India A. Kole German Research Centre for Artificial Intelligence (DFKI), Bremen, Germany e-mail: [email protected] K. Datta () · R. Drechsler German Research Centre for Artificial Intelligence (DFKI), Bremen, Germany Institute of Computer Science, University of Bremen, Bremen, Germany e-mail: [email protected]; [email protected] I. Sengupta Department of Computer Science and Engineering, JIS University, Kolkata, India Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_4

69

70

S. Sengupta et al.

adjacent qubits only. This is known as Nearest-Neighbor (NN) Constraint. Broadly two approaches are used to satisfy the NN-constraints, viz. using Swap gates or remote-CNOT templates [16]. This results in an increase in the number of gates that further compromises on the computational reliability. Hence quantum algorithms that require Toffoli gate operations must need to be re-described using 1- and 2qubit elementary gates satisfying NN-constraints. To describe larger Toffoli gates, also referred to as Multiple-Control Toffoli (MCT) gates, using elementary quantum gates, two important approaches exist in the literature: (i) using dirty ancilla, (ii) using clean ancilla [2, 15]. By using dirty ancilla based method, the number of 2-qubit gates increases, but the requirement of number of additional qubits is less. This is due to the reuse of some of the circuit qubits available in the netlist as ancilla. However, if we use clean ancilla based decomposition, the number of 2-qubit gates are less at the expense of additional ancilla qubits. Modern-day quantum computers are characterized by varying number of qubits and architectures. As a result, architecture-aware decomposition of Toffoli gates also becomes important. None of the existing methods for MCT gate decomposition have considered architecture-specific constraints that arise out of the way the qubits are interconnected [1, 21]. In this work we propose an architecture-aware decomposition algorithm for MCT gates that takes the physical qubit architecture as input, and exploits a data structure referred to as Qubit Interaction Graph (QIG) of Toffoli gates. We have used Clifford+T library for the description of 1- and 2-qubit elementary gates. The experiments that are conducted on benchmark circuits demonstrate the benefits of the architecture-aware decomposition. It is further observed that while mapping the decomposed netlist into any physical architecture, the number of gates required is also reduced for many benchmarks as compared to clean ancilla based decomposition. Two physical architectures, viz. Hex20 and IBM27, are used for experimentation. However, the proposed method is general and can be used for any other architectures as well. The rest of the paper is organized as follows. Section 2 presents a brief background on quantum circuits and quantum architectures. The proposed architectureaware decomposition approach based on the QIG graph representation of the Toffoli gate netlist is discussed in Sect. 3. The experimental results are discussed in Sect. 4, followed by concluding remarks in Sect. 5.

2 Background In this section, we briefly review the fundamentals of quantum circuits and some of the qubit architectures.

Architecture Aware Decomposition of Quantum Circuits

71

q0 :

Fig. 1 An example quantum circuit

q1 : q2 :

H



T



T X

2.1 Quantum Circuits There have been a lot of developments in quantum computing over the last decade that have seen the emergence of demonstrable quantum computers. In classical computing, the fundamental unit of information is the bit, which can take on two values 0 and 1. Whereas in quantum computing, all processing are carried out on quantum bits or qubits, which can be in one of the basis states .|0 and .|1, or also in a state of superposition [18]: ψ = α|0 + β|1

.

where .|α|2 + |β|2 = 1. The states of the qubits can be manipulated by carrying out primitive quantum gate operations on them. In most of the practical quantum gate libraries, the gates operate on one or two qubits. One of the most popular quantum gate libraries is the Clifford+T library, which consists of the Hadamard (H ), NOT (X), controlled NOT (CNOT ), and phase-shift (T ) gates. The Clifford+T library is known to be universal and fault-tolerant [25]. A quantum circuit consists of a cascade of quantum gates that operate on a set of qubits in sequence. Figure 1 shows a quantum circuit with three qubits and gates from the Clifford+T library.

2.2 Quantum Architectures The way qubits can interact among themselves is determined by the architecture of the underlying quantum computer. A quantum computer typically consists of a set of qubits and some interconnection patterns among them. Two-qubit gate operations can be directly carried out on a pair of qubits that are connected as neighbors. The interconnection pattern among the qubits is referred to as the coupling constraint. Figure 2a shows the IBM Q27 27-qubit architecture, where each qubit is connected to a maximum of three neighboring qubits. Figure 2b shows a 20-qubit architecture where the qubits are interconnected in a hexagonal array, with each qubit connected to a maximum of 6 other qubits.

72

S. Sengupta et al. 0 6

0

1

4

7

1

10

2

12

15

18

21

13

23

5

6

5

8

3

4

11

14

9

7

8

9

24

10 3

2

17

16

19

20

22

25

11

12

13

14

26

15

16

17

(a)

18

19

(b)

Fig. 2 (a) IBM Q27 architecture, (b) 20-qubit hexagonal architecture

2.3 Decomposition of Reversible Gates In many quantum algorithms, some of the major steps may require a sequence of reversible operations to be executed. There exists many prior works [8] where a given function is synthesized using a cascade of reversible gates. Some of the commonly used reversible gates are NOT, controlled-NOT (CNOT), Toffoli, and multiple-control Toffoli (MCT) [22]. Many of the synthesis approaches generate a netlist comprising of a cascade of MCT gates. However, when executing on a quantum computer, these MCT gates must be first decomposed (transformed) into a netlist of native quantum gates. Some of the commonly used quantum gate libraries are NCV and Clifford+T. It may be noted that during the process of decomposition of a MCT gate, some additional qubits may be required. If these additional qubits are considered to be initialized to a basis state (say, .|0), they are referred to as clean ancilla qubits. However, if these qubits are already being used in some computation and may be in arbitrary states, they are referred to as dirty ancilla qubits. Many prior works for decomposition of MCT gates (typically, into smaller 3input Toffoli gates) exist [2, 10, 14, 15, 19]. However, none of these methods consider any information regarding the target quantum architecture to guide the process of decomposition. Since detailed information about the practical quantum computers are today available, it makes sense to utilize this information for better decomposition. In this context, the present work combines the clean and dirty ancilla based decomposition approaches by utilizing information available on the target architecture.

3 Proposed Decomposition-Mapping Approach In this paper we show how improved decomposition of MCT gates can be achieved using architecture-related information. A graph-based data structure, referred to as Qubit Interaction Graph (QIG), is used to capture the qubit interactions in a gate netlist. Then considering the coupling constraints and number of qubits in a

Architecture Aware Decomposition of Quantum Circuits

73

particular architecture, the decomposition of a MCT gate netlist is carried out in two steps: (i) map the clean or dirty ancilla based QIG of individual MCT gates, (ii) decompose each MCT gate based on the graph mapping information.

3.1 Qubit Interaction Graph of Toffoli Gate The decomposition of a Toffoli gate using 1- and 2-qubit elementary quantum gates does not require any ancilla qubit when decomposed using gates from Clifford+T library. However, depending on the order in which these 1- and 2-qubit gates appear in the netlist, the depth of the resulting netlist may vary. In this work we consider the Clifford+T realization of the Toffoli gate .C 2 X({c1 , c2 }; t) as shown in Fig. 3a. During the mapping of a Clifford+T gate netlist to a target architecture, the 2-qubit CNOT operations are of prime concern as they may violate some of the architecture-level constraints. We represent the qubit interactions in a gate netlist in the form of QIG, in which the vertices indicate qubits and edges represent 2-qubit operations, with edge weights indicating the number of such 2-qubit operations. Since the coupling constraints are imposed only on 2-qubit gates, all the 1-qubit operations are ignored in the QIG representation. Figure 3b shows the QIG of the quantum gate netlist shown in Fig. 3a. The weights on the edges indicate the number of 2-qubit operations between the corresponding qubits. For such construction of a Toffoli operation, there exists two possibilities: (i) using a 2-controlled relative inverse phase .−iZ (.C 2 -iZ) gate and a controlled phase S (CS) gate, or (ii) inverse of these gates (i.e., .C 2 iZ and .CS † ) [20]. Figure 3c shows one such realization using .C 2 -.iZ(c1 , c2 , t) and .CS(c1 , c2 ) gate pair. In a netlist of Toffoli gates, the description of alternate Toffoli gates operating on the same set of qubits, using inverse phase gate pair, may cancel out some gates depending on the qubits of intermediate gates. The CS and .CS † gates can be cancelled out if no intermediate gates use any of the control qubits of the Toffoli gate pair as illustrated in the following example. Example 1 Figure 4a shows a netlist of two identical Toffoli gates, .C 2 X({q1 , q2 }; q3 ), with an intermediate .CX(q3 ; q4 ) gate. Consider that the Toffoli gate pair is described as a sequence of .C 2 -.iZ(q1 , q2 , q3 ) and .CS(q1 , q2 ) phase gates and c1 :



c2 : t:



• H

• T†

T

(a)



T T†

T T

T



c1



H

2

2 t

2

(b)

c2

H







S

−iZ

H

(c)

Fig. 3 (a) Clifford+T realization of a Toffoli gate .C 2 X({c1 , c2 }; t), (b) The corresponding QIG, (c) Realization using a sequence of .C 2 -.iZ(c1 , c2 , t) and .CS(c1 , c2 ) gates

74

S. Sengupta et al. q1 :





q1 :









q2 :





q2 :



S

S†



H •

H

iZ

q3 :

q3 :



H

−iZ

q1

q4 :

q4 :

q4 4

H

4

q3

(a)

1

(b)

q2

(c)

Fig. 4 (a) A netlist of two Toffoli gates .C 2 X({q1 , q2 }; q3 ), with an intermediate .CX(q3 , q4 ) gate, (b) The intermediate representation of the netlist replacing Toffoli operations using .C 2 .iZ(q1 , q2 , q3 ) and .CS(q1 , q2 ) phase gates and their inverse, (c) The resulting QIG after removing the CS and .CS † gate pair Fig. 5 Description of CS and .CS † gates using phase † .T /T and CX gates. (a) † .CS(qi , qj ). (b) .CS (qi , qj ) qi : qj :



T

=

qi : qj :

T





T

T



qi :

T†

qj :



T

(a) qi : qj :

T





• T

(b) qi : qj :

• T

qi : = q : j

• T

Fig. 6 Commutative and non-commutative relationship between T and CX gate

their inverse gates in reverse sequence as shown in Fig. 4b. Since the intermediate CX gate is not acting on any of the two qubits used by the CS and .CS† gates, the .{CS, CS†} gate pair can be cancelled out resulting in elimination of the edge .(q1 , q2 ) in the QIG as shown in Fig. 4c. The .{CS, CS † } gate pair present in a decomposed netlist cannot be cancelled if there exists some intermediate gates operating on either of the control qubits of the Toffoli gate pair. This is due to the properties of individual Clifford+T gate operations. The CS and .CS † operations are realized using a network of phase .T /T † and CX gates as shown in Fig. 5. The CX and .T /T † gates operating on a common qubit commutes when .T /T † operates on the control qubit of CX gate; otherwise, the operation sequence shall not commute as shown in Fig. 6. This allows partial cancellation of one CX gate in realizing the pair of alternate Toffoli operations as described in the following example. Example 2 Consider the netlist shown in Fig. 7a, comprising of a pair of Toffoli gates .C 2 X({q2, q3}; q4 ) with an intermediate .CX(q1 ; q2 ) gate. The .CS(q1 ; q3 ) and † .CS (q1 ; q3 ) gate pair from the intermediate representation of the netlist in terms of 2 † .C ± iZ and .CS/CS phase gates (see Fig. 7b) cannot be removed completely due to the use of qubit .q3 in the intermediate CX gate operation. This leads to partial cancellation of one CX gate from each of the CS and .CS † gate descriptions, which in turn reduces the weight of the edge .(q2 , q3 ) to 2 in the QIG. Considering the QIGs of Toffoli netlists with complete or partial cancellation of CS/CS † as building blocks, the QIG formation for larger MCT gates is presented in the next subsection.

.

Architecture Aware Decomposition of Quantum Circuits q1 :

q1 :



75



q2 :





q2 :









q3 :





q3 :



S

S†



−iZ

H

H

iZ

q4 :

q4 :

H

q2

q1

H

1

4

q4

(b)

(a)

2

4

q3

(c)

Fig. 7 (a) A Toffoli gate netlist, (b) The intermediate representation of the netlist replacing Toffoli operations using .C 2 -.iZ(q2 , q3 , q4 ) and .CS(q1 , q3 ) phase gates and their inverse, (c) The QIG of the resulting netlist after removing a CX operation from each of the CS and .CS † gate descriptions c1 c2 c3 c4 a1 a2

: : : : : :

• • • •

t:

c1 c2 c3 c4 a1 a2

: : : : : :

• • •

• • •

• • •





• •







: : : : : :

• •

• • •

• •



• •

t:

t:

(a)

c1 c2 c3 c4 a1 a2

(b)

(c)

Fig. 8 (a) A 5-qubit MCT gate, (b) Realization using 2 dirty ancilla, (c) Realization using 2 clean ancilla

3.2 Qubit Interaction Graph of MCT Gate The multiple controlled Toffoli (MCT) gate operation can be realized with less number of Clifford+T gates when ancilla1 qubits are used [2, 15]. The number of additional gates required depends on the type of ancilla qubits (i.e., clean or dirty). In general, the decomposition of an n-qubit MCT gate, where .n > 3, requires .4(n − 3) Toffoli gates when .n − 3 dirty ancilla2 are used [2]. The number of Toffoli gates can be further reduced to .2n − 5 when .n − 3 clean ancilla3 are used [15]. The following example illustrates the realization of an MCT gate using clean and dirty ancilla qubits. Example 3 Figure 8 shows two realizations of a 5-qubit MCT gate considering the two qubits .(a1 , a2 ) as dirty and clean ancilla respectively. The number of Toffoli gates required in the two realizations are 8 and 5 respectively.

1 A qubit that is not acting as control or target of a MCT gate and is used to describe the MCT gate operation in terms of smaller gates, is called an ancilla qubit. 2 An ancilla qubit with unknown initial state is termed as dirty ancilla. 3 An ancilla qubit with .|0 initial state is termed as clean ancilla.

76

S. Sengupta et al. Dirty Ancilla

Clean Ancilla

c1

c1 Class I 4

4 4

a1

4

8 8

a2

Class II

a2

4

(a)

c2

4

c3

4

cn−2

4 an−3

cn−2 2

4 t

c3

8

4

4

4

8 an−3

a1

c2

cn−1

2

2

Class III t

2

cn−1

(b)

Fig. 9 QIG representations of an n-qubit MCT gate description based on (a) .n − 3 dirty, and (b) − 3 clean ancilla qubits

.n

The Toffoli gates used in dirty ancilla based description of a size-n MCT gate, C n−1 X({c1 , c2 , . . . , cn−1 }; t), in presence of the dirty ancille .{a1 , a2 , . . . , an−3 } can be classified as:

.

I. A pair of identical Toffoli gates of the form .C 2 X({c1 , c2 }; a1 ). II. .n − 4 set of 4 identical Toffoli gates where ith such set is of the form 2 .C X({ci , ai−2 }; ai−1 ) for .i = 3, 4, . . . , n − 2. III. A pair of identical Toffoli gates of the form .C 2 X({cn−1 , an−3 }; t). In a similar way, the Toffoli gates used in clean ancilla based description of a size-n MCT gate, .C n−1 X({c1 , c2 , . . . , cn−1 }; t), in presence of the clean ancille .{a1 , a2 , . . . , an−3 }, can be classified as: I. A pair of identical Toffoli gates of the form .C 2 X({c1 , c2 }; a1 ). II. .n − 4 set of identical Toffoli pairs where the ith identical Toffoli pair are of the form .C 2 X({ci , ai−2 }; ai−1 ) for .i = 3, 4, . . . , n − 2. III. A Toffoli gate of the form .C 2 X({cn−1 , an−3 }; t). Figure 9 shows the QIGs for dirty and clean ancilla based description respectively of a size-n MCT gate, .C n−1 X({c1 , c2 , . . . , cn−1 }; t), where the qubits .{a1 , a2 , . . . , an−3 } are used as ancilla. Since each Toffoli gate contributes an edge weight of 2 (see Fig. 3) in the QIG, the class I type of Toffoli pair (i.e. 2 .C X({c1 , c2 }; a1 )) results in an edge weight of 4 in respective QIG, with no edge between vertices .c1 and .c2 due to complete cancellation of .CS/CS † operation (see Fig. 4) for both dirty and clean ancilla based description. The class II type of Toffoli gates ( i.e. .C 2 X({ci , ai−2 }; ai−1 ) for .i = 3, 4, . . . , n − 2) results in

Architecture Aware Decomposition of Quantum Circuits Fig. 10 QIG of a 5-qubit MCT gate based on (a) dirty, and (b) clean ancilla based description

77

c1

c1 4

4 4

a1

8

c3

a2

4

4

c2

4

c3

4

2

4 t

a1

4

8 a2

c2

2

2 c4

(a)

t

2

c4

(b)

edge weights of 8 for dirty ancilla based description and 4 for clean ancilla based description. In dirty ancilla based description the intermediate gate configurations allow partial cancellation of .CS/CS † operation (see Fig. 7) and this reduces the weight of the edges .(ci , ai−2 ) for .i = 3, 4, . . . , n − 2 to 4 in the respective QIG. The clean ancilla based configuration does not require such an edge due to complete cancellation of .CS/CS † operation. The QIG edges for class III type Toffoli gate 2 .C X({cn−1 , an−3 }; t) for both the descriptions can be derived in a similar way. This is illustrated by the following example. Example 4 Consider again the dirty and clean ancilla based description of a MCT gate .C 4 X({c1 , c2 , c3 , c4 }; t) using {.a1 , .a2 } as ancilla qubits as shown in Fig. 8. The three classes of Toffoli gates in both the descriptions are .C 2 X({c1 , c2 }; a1 ), 2 2 .C X({c3 , a1 }; a2 ), and .C X({c4 , a2 }; t). This results in QIG representations with different edge weights of 8, 4 and 2 as shown in Fig. 10. The QIG of individual MCT gates based on dirty and clean ancilla qubits allow further to describe an MCT netlist appropriately using Clifford+T gates when architectural information is available as presented next.

3.3 Architecture-Aware Decomposition of MCT Netlist To describe an MCT netlist using gates from the Clifford+T library, additional ancilla qubits are required. In dirty ancilla based description, the number of qubits in the final decomposed netlist can be minimized by selecting some of the qubits from the netlist itself as dirty ancilla. The clean ancilla based description does not allow such reuse of qubits from the netlist. Given an MCT gate netlist comprising of n qubits .Q = {q1 , q2 , . . . , qn }, a qubit .qa can be used in dirty ancilla based description of a MCT gate .C m X(C; qt ) from the netlist if qa ∈ Q − C ∪ {qt }, where C ∪ {qt } ⊂ Q

.

(1)

78

S. Sengupta et al.

G1 G2 q1 : q2 :



q4 : q5 : q6 :

• •

G1

4 4

q2

4

q6

q6

q7

2

4 4

q7

q4

q3

q2

q4

q8

q5

q7

4

8

4 q5



q7 :

q1

G1

• •

G2

q1

q2



q3 :

G2

8

4

4

q4

q9

q6

q7

2

2

(b)

(a)

4

q3

4

q4

4

4

2

4

4 q8

2

2

2 2

q5

(c)

Fig. 11 (a) A reversible netlist with 2 MCT gates .G1 and .G2 , (b) QIGs of the MCT gates for dirty ancilla based description considering .q5 for .G1 and .{q2 , q6 } for .G2 as dirty ancilla, and (c) QIGs of the MCT gates for clean ancilla based description considering .q8 for .G1 and .{q8 , q9 } for .G2 as clean ancilla

Similarly, .qa can be used for clean ancilla based description of the MCT gate C m X(C; qt ) if

.

qa ∈ / Q where C ∪ {qt } ⊂ Q.

.

(2)

The selection of .qa as dirty or clean ancilla can only be made if the target architectural information is available during the elementary gate level description of the MCT netlist. The following example illustrates the description of an MCT netlist. Example 5 Consider the 7-qubit MCT gate netlist as shown in Fig. 11a, which consists of two MCT gates .G1 : C 3 X({q2 , q4 , q6 }; q7 ) and .G2 : C 4 X({q1 , q3 , q4 , q5 }; q7 ) of size 4- and 5-qubits respectively. Considering .q5 for .G1 and .{q2 , q6 } for .G2 as dirty ancilla, the QIGs of the corresponding Clifford+T gate description are shown in Fig. 11b. For a quantum device with 9 or more physical qubits the netlist can be described using clean ancilla. Figure 11c shows the QIGs of .G1 and .G2 considering .q8 and .{q8 , q9 } as clean ancilla respectively. The QIG of an individual MCT gate further can be considered for mapping to physical qubits, when the coupling-graph of the quantum device is available. An nqubit MCT gate .C n−1 X({q1 , q2 , . . . , qn−1 }; t; {a1 , q2 , . . . , an−3 }) with .n−3 ancille can be mapped using intermediate QIG representation prior to dirty or clean ancilla based description if the device coupling information is available. The mapping of a MCT netlist during decomposition is performed as follows: 1. The netlist is traversed from left to right and the QIGs of individual MCT gates are constructed, based on the availability of dirty or clean ancilla qubits. 2. Each QIG thus formed is then used for mapping in the following way:

Architecture Aware Decomposition of Quantum Circuits

79

q1

Fig. 12 A possible mapping of a 5-qubit MCT gate on a 20-qubit Hexagonal layout

q3 0

1

2

a1

6

10

15

7

q4 16

11

4

q5

a2 5

3

8

q7 17

12

9

13

18

14

19

(a) The target qubit is mapped to one of the physical qubits if all the qubits from the QIG are unmapped; otherwise, the physical qubit nearest to previously mapped qubits is used for the current mapping. (b) The vertex in QIG representing the control and ancilla qubits are then mapped to physical qubits by selecting the unmapped physical qubit nearest to the previously mapped physical qubits. 3. The process repeats until all the QIGs are mapped. The following example illustrates the mapping of an MCT gate based on the coupling-graph of a quantum device. Example 6 Consider the MCT gate netlist and the corresponding QIGs as shown in Fig. 11. The mapping of clean and dirty ancilla based decomposed netlist of MCT gate .G2 : C 4 X({q1 , q3 , q4 , q5 }; q7 ; {a1 , a2 }) is shown in Fig. 12. The mapping starts by placing target .q7 on physical qubit 12. Traversing the QIG of the dirty (or clean) ancilla based description considering ancille .{a1 , a2} as .{q2 , q6 } (or .{q8 , q9 }), the control and ancilla qubits .q5 and .q6 (or .q9 ) adjacent to target qubit in QIG are mapped to physical qubits 7 and 8 respectively that are nearest to physical qubit 12. Other control and ancilla qubits are mapped in similar way. With the mapping as obtained, the individual MCT gates from the mapped netlist are replaced by dirty or clean ancilla based description depending on the type of ancilla considered during the mapping. The proposed decomposition approach is suitable for any given physical architecture.

4 Experimental Results The MCT gate decomposition and architecture-aware mapping approach have been implemented in Python, and run on a system with AMD Ryzen7 PRO 5850U processor running at 1.90 GHz, 48 GB RAM, 1 TB SSD and Windows 10 Pro operating system. The benchmarks used for experimentation are taken from RevLib [24]. Two different architectures, the 20-qubit hexagonal (Hex20) and

80

S. Sengupta et al.

IBMQ 27-qubit Falcon (IBM27), are used for experimentation. It may be noted that the proposed method is general and can be used for other architectures as well.

4.1 Effectiveness of Architecture-Aware Decomposition We have carried out two sets of experiments. In the first one, the MCT gate netlists are described using Naive dirty and clean ancilla based methods assuming that no information about the target architecture is available. The results are compared with the netlist description obtained using the proposed architectureaware decomposition approach. The results are summarized in Table 1. The first two columns give the benchmark name and number of qubits (n). The next six columns show the number of CNOT gates (.#CX), number of qubits (n) required and time in seconds (t) for dirty and clean ancilla based Naive Decomposition respectively. The next six columns provides the number of CNOT gates (.#CX), number of qubits (n) required and time in seconds (t) for architecture-aware decomposition for Hex20 and IBM27 architectures respectively. The last three columns show the overall improvement in the number of CNOT gates for the clean ancilla based and the architecture-aware decomposition for Hex20 and IBM27, over the dirty ancilla based decomposition respectively. Some of the entries in the table are marked as (-), for those benchmarks where n is larger than the number of physical qubits available. Clean ancilla based decomposition requires least number of CNOT gates compared to the other methods. The proposed approach provides CNOT reductions over dirty ancilla based approach. Moreover, if the architecture has sufficient number of qubits then it gives the result close to clean ancilla based decomposition. For the benchmark 5xp1_194, we see that the value of .#CX is 1399 and 927 for dirty and clean ancila based decomposition respectively. Also 17 qubits are required for dirty ancilla, whereas 22 qubits are required for clean ancilla. In the presence of architectural information, the proposed approach will require 1223 and 927 .#CX gates for the Hex20 and IBM27 platforms respectively. The additional qubits required for clean ancilla based decomposition may not be available in the target architecture. This necessitates us to evaluate the number of qubits present in an architecture and then take the decision. Clearly in many cases we cannot use clean ancilla based decomposition due to the restriction in the number of qubits. We observe that if we have the architectural information then it is beneficial as we can choose either clean or dirty ancilla based decomposition for individual MCT gates from a netlist. In the next experiment we show how architecture-aware decomposition helps in mapping the circuits to specific architectures, which is even more beneficial than clean ancilla based decomposition.

Benchmark 4gt10-v1_81 4gt12-v0_86 4gt4-v0_73 5xp1_194 9symml_195 C17_204 C7552_205 add6_196 adr4_197 alu2_199 alu3_200 alu4_201 apla_203 clip_206 cm150a_210 cycle10_2_110 dc2_222 decod_217 dist_223 example2_231 f51m_233 ham15_107 hwb6_56

n 5 5 5 17 10 7 21 19 13 16 18 22 22 14 22 12 15 21 13 16 22 15 6

Naive decomposition Dirty #CX n t(s) 37 5 0.26 57 7 0.26 91 7 0.27 1399 17 0.48 4280 17 0.69 102 7 0.26 1753 21 0.55 6445 19 1.01 736 13 0.36 4722 19 0.77 2541 19 0.54 52928 23 5.58 3428 22 0.62 5500 17 0.87 1067 22 0.42 812 19 0.33 1879 15 0.52 1753 21 0.56 5992 15 0.88 4722 19 0.76 31110 27 3.34 1877 15 0.53 1465 9 0.50 Clean #CX 33 47 77 927 2460 80 1237 4045 540 2820 1575 29526 2034 3266 649 488 1201 1237 3594 2820 17466 1363 1183 6 7 7 22 17 9 24 24 16 24 26 32 29 21 26 20 20 24 19 24 34 20 9

n

t(s) 0.25 0.25 0.25 0.38 0.46 0.25 0.43 0.70 0.30 0.52 0.43 3.10 0.48 0.55 0.31 0.29 0.40 0.43 0.57 0.52 2.02 0.43 0.40

Architecture aware decomposition Hex20 IBM27 #CX n t(s) #CX 33 6 0.24 33 47 7 0.24 47 77 7 0.24 77 1223 20 0.44 927 2460 17 0.51 2460 80 11 0.24 80 1237 6413 20 1.04 4045 540 17 0.35 540 4184 20 0.70 2820 2443 20 0.52 1575 51118 2640 3570 20 0.65 3266 649 488 20 0.28 488 1201 20 0.40 1201 1237 3594 20 0.64 3594 4184 20 0.70 2820 29416 1363 20 0.47 1363 1183 14 0.44 1183

Table 1 Analysis of naive (dirty and clean ancilla based decomposition) and architecture aware decomposition

n 7 7 7 23 17 11 27 27 17 27 27 27 27 27 26 26 26 27 24 27 27 27 15

t(s) 0.21 0.23 0.31 0.43 0.54 0.24 0.42 0.76 0.37 0.57 0.45 5.82 0.51 0.67 0.37 0.28 0.42 0.43 0.72 0.57 3.33 0.48 0.47

#CX impr. over dirty (%) Clean Hex 10.81 10.81 17.54 17.54 15.38 15.38 33.74 12.58 42.52 42.52 21.57 21.57 29.44 37.24 0.50 26.63 26.63 40.28 11.39 38.02 3.86 44.21 40.67 40.62 35.09 39.18 39.90 39.90 36.08 36.08 29.44 40.02 40.02 40.28 11.39 43.86 27.38 27.38 19.25 19.25

(continued)

IBM 10.81 17.54 15.38 33.74 42.52 21.57 29.44 37.24 26.63 40.28 38.02 3.42 22.99 40.62 39.18 39.90 36.08 29.44 40.02 40.28 5.45 27.38 19.25

Architecture Aware Decomposition of Quantum Circuits 81

Benchmark hwb7_59 hwb8_113 hwb8_114 hwb9_119 hwb9_123 inc_237 life_238 max46_240 mlp4_245 rd73_252 root_255 sao2_257 sym9_148 sym9_193 urf2_161 urf3_157 urf3_279 urf5_159

n 7 8 8 9 9 16 10 10 16 10 13 14 10 10 8 10 10 9

Table 1 (continued)

Naive decomposition Dirty #CX n t(s) 4619 11 0.96 13380 13 2.31 11457 13 1.91 37605 15 6.03 16657 15 2.53 2126 16 0.53 3280 17 0.60 3286 15 0.57 3734 16 0.67 1036 11 0.46 2715 15 0.53 4447 19 0.66 4452 10 0.96 4280 17 0.67 22648 11 1.64 92318 17 12.84 37851 15 4.23 17156 15 2.48 Clean #CX 3477 9508 7993 25703 11375 1382 1950 1908 2310 740 1647 2497 3276 2460 21316 59370 33143 10992 n 11 13 13 15 15 21 17 16 22 14 19 22 12 17 12 17 16 15

t(s) 0.58 1.25 1.06 3.18 1.39 0.43 0.44 0.43 0.49 0.31 0.42 0.46 0.60 0.46 1.40 6.77 3.09 1.38

Architecture aware decomposition Hex20 IBM27 #CX n t(s) #CX 3477 15 0.75 3477 9508 15 1.68 9508 7993 15 1.41 7993 25703 20 3.60 25703 11375 17 3.06 11375 1408 20 0.43 1382 1950 17 0.47 1950 1908 16 0.46 1908 2866 20 0.64 2310 740 14 0.38 740 1647 19 0.45 1647 3831 20 0.58 2497 3276 12 0.68 3276 2460 17 0.51 2460 21316 15 4.63 21316 59370 18 7.10 59370 33143 17 16.38 33143 10992 18 1.44 10992 n 18 24 24 22 23 27 17 16 26 15 27 27 12 17 21 24 22 24

t(s) 0.89 1.79 1.60 4.20 3.89 0.48 0.50 0.51 0.55 0.41 0.46 0.50 0.73 0.54 5.87 7.99 20.97 1.68

#CX impr. over dirty (%) Clean Hex 24.72 24.72 28.94 28.94 30.23 30.23 31.65 31.65 31.71 31.71 35.00 33.77 40.55 40.55 41.94 41.94 38.14 23.25 28.57 28.57 39.34 39.34 43.85 13.85 26.42 26.42 42.52 42.52 5.88 5.88 35.69 35.69 12.44 12.44 35.93 35.93

IBM 24.72 28.94 30.23 31.65 31.71 35.00 40.55 41.94 38.14 28.57 39.34 43.85 26.42 42.52 5.88 35.69 12.44 35.93

82 S. Sengupta et al.

Architecture Aware Decomposition of Quantum Circuits

83

4.2 Improvements in Nearest Neighbor Mapping In the second experiment we consider the actual mapping of the decomposed circuits to Hex20 and IBM27 architectures. We have used a heuristic based nearest-neighbor (NN) method [4] for mapping the decomposed netlist. Table 2 shows the analysis of experimental results of NN-mapping of dirty, clean and architecture-aware decomposed netlists. The first column provides the benchmark name. The next six columns show the number of CNOT gates (.#CX) and time in seconds (t) for mapping the circuit to Hex20 architecture by using dirty, clean and architecture-aware decomposed netlists respectively. The next two columns show the .% improvement in .#CX gates for clean and architecture-aware mapping over dirty ancilla based method for the Hex20 architecture. The next eight columns show similar mapping results for the IBM27 architecture. Some of the entries in the table are marked as (-), for those benchmarks for which n is larger than the number of physical qubits available. From the results we can say that the mapping of architecture-aware decomposed netlist for Hex20 and IBM27 have less number of CNOT gates compared to the dirty ancilla based decomposed netlist. For most of the benchmarks, clean ancilla based decomposed netlist require less number of CNOT gates for mapping in both the architectures. This is due to the usage of less number of additional qubits as ancilla in clean ancilla based approach. The architecture-aware decomposition tries to exploit coupling information for searching nearest physical qubit as ancilla. This often results in more physical qubits in the decomposed netlist, e.g. the architectureaware description of benchmark C17_204 requires 11 qubits whereas the cleanancilla based description requires 9 qubits (see Table 1). The considered mapping approach [4] is unable to exploit the qubit-mapping information available from the proposed architecture-aware decomposition. The mapping result may vary even when the clean ancilla based and architectureaware decomposed netlists contain similar number of CNOT gates and qubits. For example, for benchmark like ham15_107 comprising of 1363 CNOT gates and 20qubits (see Table 1), the mapping approach [4] provides better CNOT reductions for architecture-aware decomposed netlist when mapped to Hex20 architecture. This is also the case for many benchmarks while mapping to IBM27 architecture. Hence architecture-aware decomposition is beneficial at least for cases when the clean ancilla based decomposed netlist cannot be considered for architectural mapping due to qubit restrictions. For example, the clean ancilla based 21-qubit description of the benchmark clip_206 (see Table 1) has no mapping result for Hex20 architecture. Finally, the architecture-aware decomposition does not provide any improvement over dirty ancilla based approach for architectural mapping, e.g. like mapping of the benchmark add6_196 on Hex20 architecture. The dirty and clean ancilla based decomposition of the benchmark add6_196 requires 19 and 24 qubits (see Table 1) respectively. Thus during architecture-aware decomposition for Hex20 architecture,

Benchmark 4gt10-v1_81 4gt12-v0_86 4gt4-v0_73 5xp1_194 9symml_195 C17_204 C7552_205 add6_196 adr4_197 alu2_199 alu3_200 alu4_201 apla_203 clip_206 cm150a_210 cycle10_2_110 dc2_222 decod_217 dist_223 example2_231 f51m_233 ham15_107 hwb6_56

Hex 20 Dirty #CX t(s) 52 0.30 87 0.48 139 0.55 2305 2.30 8201 3.08 129 0.40 11257 2.78 1249 0.96 8226 2.76 4266 2.22 9106 2.57 1430 2.15 2662 1.57 10396 2.39 8187 2.81 3368 1.60 2380 0.68

Clean #CX t(s) 42 0.33 65 0.45 128 0.48 3420 1.43 110 0.51 1107 1.34 770 1.83 1909 1.88 5916 2.38 2506 2.05 1984 0.73

Arch. #CX 42 59 119 2408 4398 113 11843 1059 7610 4495 7131 1037 2347 7173 7805 2446 2212 t(s) 0.21 0.29 0.30 1.72 1.13 0.43 3.04 1.14 2.19 2.13 2.08 1.41 1.49 2.11 2.38 1.81 1.03

Impr.(%) Clean Arch. 19.23 19.23 25.29 32.18 7.91 14.39 −4.47 58.3 46.37 14.73 12.40 −5.21 11.37 15.21 7.49 −5.37 21.69 46.15 27.48 28.29 11.83 43.09 31.00 4.67 25.59 27.38 16.64 7.06

IBM 27 Dirty #CX t(s) 64 0.37 126 0.53 196 0.63 4504 1.72 14363 3.12 237 0.43 5944 1.82 21256 3.36 2557 0.69 15033 2.57 7896 1.76 188006 34.02 12008 2.80 16594 2.60 3740 1.48 2660 1.18 5614 1.09 5077 1.60 18931 2.29 15927 2.03 115314 21.32 5999 0.98 3931 0.48 Clean #CX 60 128 194 2958 6966 212 3526 13249 1749 10848 5376 9944 2068 1559 3355 3757 10620 9795 4396 3151 t(s) 0.25 0.35 0.35 1.24 1.11 0.34 1.28 2.61 0.59 2.20 1.77 1.81 1.13 0.85 0.93 1.25 1.55 2.04 1.11 0.45

Arch. #CX 57 116 203 3579 6915 203 4702 15415 1875 10938 5994 197140 10890 13529 2281 2108 5263 4450 13266 11025 118174 4777 3715

Table 2 Analysis of nearest neighbor mapped naive (dirty and clean ancilla based) and architecture aware decomposed netlist Impr.(%) t(s) Clean Arch. 0.31 6.25 10.94 0.30 −1.59 7.94 0.36 1.02 −3.57 1.43 34.33 20.54 1.14 51.50 51.86 0.34 10.55 14.35 1.63 40.68 20.9 3.09 37.67 27.48 0.62 31.60 26.67 2.56 27.84 27.24 1.85 31.91 24.09 36.62 −4.86 2.69 9.31 3.27 40.07 18.47 1.14 44.71 39.01 1.00 41.39 20.75 1.73 40.24 6.25 1.61 26.00 12.35 2.34 43.90 29.92 2.54 38.50 30.78 22.26 −2.48 1.62 26.72 20.37 0.57 19.84 5.49

84 S. Sengupta et al.

hwb7_59 hwb8_113 hwb8_114 hwb9_119 hwb9_123 inc_237 life_238 max46_240 mlp4_245 rd73_252 root_255 sao2_257 sym9_148 sym9_193 urf2_161 urf3_157 urf3_279 urf5_159

7649 22758 19338 65832 30568 3176 6502 6484 5792 1795 4617 7603 6612 8195 38626 161021 72381 30650

1.20 2.31 2.13 5.71 3.74 1.48 2.16 1.76 2.00 0.98 1.65 3.02 1.08 2.48 2.59 14.36 6.63 4.14

6174 17722 14737 48290 23942 3288 3102 1355 2691 4902 3984 38122 112041 69896 20424

1.23 2.29 2.16 5.05 3.44 1.45 1.31 1.49 2.28 1.08 1.48 2.98 11.60 7.74 3.11

6747 18358 15517 53759 23327 2560 3498 2787 5707 1397 3054 7563 5040 3609 37978 114447 70187 22836

1.56 2.46 2.24 7.12 3.33 1.46 1.17 0.90 2.29 1.02 1.51 2.18 0.72 1.00 3.25 10.5 7.34 3.54

19.28 22.13 23.79 26.65 21.68 49.43 52.16 24.51 41.72 25.86 51.38 1.3 30.42 3.43 33.36

11.79 19.33 19.76 18.34 23.69 19.40 46.20 57.02 1.47 22.17 33.85 0.53 23.77 55.96 1.68 28.92 3.03 25.49

14360 43989 38664 123672 59101 6638 11398 11458 10865 3100 8514 15190 12024 14186 54595 310544 140790 58052

0.99 2.72 2.41 8.16 4.42 1.01 1.46 1.26 1.30 0.51 1.04 2.12 0.76 1.75 2.66 24.65 10.12 4.24

10224 34339 28648 91274 43754 4106 6096 5211 7482 2372 4959 7912 10419 6681 53020 208017 120587 38073

0.79 2.27 1.88 6.26 3.48 1.07 1.03 0.87 1.56 0.57 1.00 1.72 0.85 1.13 2.86 16.67 10.26 3.03

11970 33922 28999 95267 41324 5612 7077 5565 9564 2393 6216 10174 9666 7041 53803 216813 120125 39612

1.20 2.71 2.28 6.65 3.57 1.73 1.12 0.88 2.17 0.63 1.57 2.52 0.80 1.15 3.84 19.31 10.66 3.39

28.80 21.94 25.91 26.20 25.97 38.14 46.52 54.52 31.14 23.48 41.75 47.91 13.35 52.90 2.88 33.02 14.35 34.42

16.64 22.89 25.00 22.97 30.08 15.46 37.91 51.43 11.97 22.81 26.99 33.02 19.61 50.37 1.45 30.18 14.68 31.76

Architecture Aware Decomposition of Quantum Circuits 85

86

S. Sengupta et al.

most of the MCT gate operations from add6_196 are realized using dirty ancilla based descriptions.

5 Conclusion In this paper we show how architecture-aware decomposition plays a key role in reducing the number of CNOT gates in mapping quantum circuits to various physical architectures. In this work we particularly look into the Qubit-InteractionGraph (QIG) of dirty and clean ancilla based decomposition of n-qubit MCT gates. We then show that when we have the architectural information, we can efficiently map the decomposed netlists to these architectures. Experimental results reveal that clean ancilla based decomposition always generates less number of CNOT gates. But while mapping the decomposed netlist into any physical architecture, architecture-aware method provides better results for many circuits. As all prior works have considered the decomposed netlist either using dirty or clean ancilla qubits, it is evident that further improvement in terms of CNOT gates can be achieved if we also include information about physical architecture.

References 1. IBM Q. https://www.research.ibm.com/ibm-q. [Accessed: 2019-03-20]. 2. A. Barenco, C. H. Bennet, R. Cleve, D.P. DiVincenzo, N. Margolus, P. Shor, T. Sleator, J. Smolin, and H. Weinfurter. Elementary gates for quantum computation. Phys. Rev. A, 52(5):3457–3467, Nov 1995. 3. K. Y. Chang and C. Y. Lee. Mapping nearest neighbor compliant quantum circuits onto a 2-d hexagonal architecture. IEEE Trans. on CAD of Integrated Circuits and Systems, pages 1–14, 2021. 4. K. Datta, A. Kole, I. Sengupta, and R. Drechsler. Mapping quantum circuits to 2-dimensional quantum architectures. In GI-Jahrestagung Workshop 2022, pages 1109–1120, September 2022. 5. K. Datta, A. Kole, I. Sengupta, and R. Drechsler. Nearest neighbor mapping of quantum circuits to two-dimensional hexagonal qubit architecture. In Int’l Symp. on Multiple-Valued Logic, May 2022. 6. K. Datta, I. Sengupta, and H. Rahman. A post-synthesis optimization technique for reversible circuits exploiting negative control lines. IEEE Transactions on Computers, 64(4):1208–1214, Apr 2015. 7. R. D. Delaney, M. D. Urmey, S. Mittal, et al. Superconducting-qubit readout via lowbackaction electro-optic transduction. Nature, 606(7914):489–493, Jun 2022. 8. D. Große, R. Wille, G. W. Dueck, and R. Drechsler. Exact multiple-control Toffoli network synthesis with SAT techniques. IEEE Transactions on CAD, 28(5):703–715, May 2009. 9. J. Hilder, D. Pijn, O. Onishchenko, et al. Fault-tolerant parity readout on a shuttling-based trapped-ion quantum computer. Phys. Rev. X, 12:011032, Feb 2022. 10. A. Kole and K. Datta. Improved NCV gate realization of arbitrary size Toffoli gates. In Int’l Conf. on VLSI Design, pages 289–294, Jan 2017.

Architecture Aware Decomposition of Quantum Circuits

87

11. A. Kole, K. Datta, and I. Sengupta. A heuristic for linear nearest neighbor realization of quantum circuits by SWAP gate insertion using n-gate lookahead. IEEE J. Emerg. Sel. Topics Circuits Syst., 6(1):62–72, Feb 2016. 12. A. Kole, K. Datta, and I. Sengupta. A new heuristic for N -dimensional nearest neighbor realization of a quantum circuit. IEEE Trans. on CAD of Integrated Circuits and Systems, 37(1):182–192, Jan 2018. 13. A. Kole, S. Hillmich, K. Datta, R. Wille, and I. Sengupta. Improved mapping of quantum circuits to IBM QX architectures. IEEE Trans. on CAD of Integrated Circuits and Systems, 39(10):2375–2383, 2020. 14. D.M. Miller and Z. Sasanian. Improving the NCV realization of multiple-control toffoli gates. In Int’l Workshop on Boolean Problems, pages 37–44, 2010. 15. M. Nielsen and I. Chuang. Quantum Computation and Quantum Information. Cambridge Univ. Press, Oct 2000. 16. P. Niemann, C. Bandyopadhyay, and R. Drechsler. Combining SWAPs and remote Toffoli gates in the mapping to IBM QX architectures. In Design Automation and Test in Europe, pages 1–6, 2021. 17. Srikrishna Omkar, Seok-Hyung Lee, Yong Siah Teo, Seung-Woo Lee, and Hyunseok Jeong. All-photonic architecture for scalable quantum computing with greenberger-horne-zeilinger states. PRX Quantum, 3:030309, Jul 2022. ˇ 18. Martin J Renner and Caslav Brukner. Computational advantage from a quantum superposition of qubit gate orders. Physical Review Letters, 128(23):230503, 2022. 19. Z. Sasanian and D.M. Miller. NCV realization of MCT gates with mixed controls. In PacRim Conf. on Communications, Computers and Signal Processing, pages 567–571, Aug 2011. 20. Peter Selinger. Quantum circuits of t-depth one. Physical Review A, 87(4), apr 2013. 21. H. Tang et al. Experimental quantum fast hitting on hexagonal graphs. Nature Photonics, 12(12):754–758, 2018. 22. T. Toffoli. Reversible computing. In Int’l Colloquium on Automata, Languages, and Programming, pages 632–644. Springer, Jul 1980. 23. R. Wille, L. Burgholzer, and A. Zulehner. Mapping quantum circuits to IBM QX architectures using the minimal number of SWAP and H operations. In Design Automation Conf., page 142, 2019. 24. R. Wille, D. Große, L. Teuber, G.W. Dueck, and R. Drechsler. RevLib: An online resource for reversible functions and reversible circuits. In Int’l Symp. on Multi-Valued Logic, pages 220–225, 2008. RevLib is available at http://www.revlib.org. 25. Alwin Zulehner, Alexandru Paler, and Robert Wille. An efficient methodology for mapping quantum circuits to the ibm qx architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(7):1226–1236, 2018.

Structure-Aware Minor-Embedding for Machine Learning in Quantum Annealing Processors Jose P. Pinilla and Steven J. E. Wilton

1 Introduction Quantum-assisted training for probabilistic machine learning models is receiving increasing attention. One example is the focus of our work, wherein, it has been shown that QAPs can replace the otherwise intractable task of sampling from the joint probabilities of Boltzmann Machines (BMs), which is required during training; as shown by Benedetti et al. [3], Job and Adachi [7], and Mohan et al. [9], and others. The effectiveness of training Boltzman Machines in this way depends on the quality of samples obtained from the quantum annealer, i.e. their distance from the characteristic Boltzmann distribution of the model. Sample quality can be impacted by poor embeddings of the problem onto QAPs. Commercially available QAPs (i.e. D-Wave’s Chimera or Pegasus devices described in [4]) have low connectivity compared to the BM models; as well as qubits that are disabled due to fabrication yield. A straightforward embedding of a BM onto a QAP fabric can be done using (a) systematic methods, which either assume fully operational qubits or tolerate defective qubits in a structured way, or (b) path-search heuristics which can avoid faulty qubits but may result in highly imbalanced qubit chains leading to degraded sample quality. We show that it can be preferable to use systematic methods, but prune the BM itself to avoid faulty qubits in the embedding, rather than attempting to use a heuristic algorithm to avoid faulty qubits for a fixed BM. Pruning edges of the BM could result in a less capable BM and hence a loss of accuracy in the trained network. However, we show that in certain circumstances, this loss in accuracy can be outweighed by the benefits of increased sample quality due to a reduction in

J. P. Pinilla (O) · S. J. E. Wilton Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_5

89

90

J. P. Pinilla and S. J. E. Wilton

qubit chain length. This is enabled by the ability of the training algorithm to “train around” these changes to the BM. In this work we investigate several methods of pruning and quantify their impact on overall training accuracy.

2 Motivation Early work mapping RBMs to QAPs suggested systematically finding the minorembedding of fully connected bipartite graphs. For example, the result for D-Wave architectures is an assignment of one set of nodes in the graph to rows of qubits, and the other to columns. However, if a straightforward row and column minorembedding doesn’t exist due to qubit faults, then these methods fall back to pathsearch heuristics, e.g., [5]. As an example, Fig. 1 shows two cases of faults in the qubit fabric that would cause a mismatch of the systematic minor-embedding. The RBM bipartite graph is shown on the left side, and a toy version of the qubit fabric is on the right. This toy version can be characterized as a Chimera C(3,3,2) graph, whereas, D-Wave’s Chimera graph in their DW-2000Q devices are C(16,16,4). D-Wave’s Pegasus graph can be thought of as having an extra dimension on the Chimera graph; as seen in [4]. First, Fig. 1a, shows the case where edge qubits and single couplers are deffective. As seen on the RBM, this only affects the existence of some edges between the Visible and Hidden nodes of the graph. However, on Fig. 1b, the missing qubit creates a disjoint chain, represented by the split RBM node.

3 Related Work Two approaches have been considered in related work, to embed RBMs on fabrics with non-trivial pruning: Adaption with Disjoint Chains (Adachi) One method is to construct row/column minor-embedding as described above, and accomodate faulty qubits by allowing gaps in the chains that represent each BM node when required. Once samples are obtained, the broken chain values are resolved using majority-vote as if they were connected. These implementations have been shown to have small effects when only a small number of qubits are missing, suggesting that the training algorithm can train around these faults; as seen in [1, 6, 7]. Figure 2a is a representation of the adaption made to the RBM graph, when the same faults in Fig. 1b are in the qubit graph. Greedy Network Search (Koshka) In [8], the goal is to construct a maximum “embeddable” model on a fixed fabric. The row/column approach is modified such that a new RBM unit is created whenever a chain is found to be broken. This goal is different than ours; rather than finding an embedded model and then fitting a dataset

Structure-Aware Minor-Embedding for Machine Learning in Quantum. . .

91

(a)

(b) Fig. 1 Examples of systematic minor-embeddings of a 6x6 RBM onto a “Toy” C(3,3,2) Chimera graph with missing qubits and/or couplers. (a) Systematic minor-embedding with trivial prune. (b) Systematic minor-embedfing with disjoint chain

to it (which is useful for characterization problems), our goal is to create, embed, and train a model suited for a pre-determined data set, i.e., a real-world scenario.

4 Contributions In this work, we consider three new pruning methods: Adaption with Side Pruning (Adapt) In this method, we prune the model whenever we find a gap in the qubit chain, and only preserve the longest joint chain, and its connections. Once pruned, it is necessary to verify the validity of the resulting model. Disjoint graphs, or instances where hidden nodes are not connected to at least

92

J. P. Pinilla and S. J. E. Wilton

(a)

(c)

(b)

(d)

Fig. 2 Representation of the adaptions made to the RBM graph. (a) Method proposed by Adachi and Henderson [1]. (b) Adaptive method. (c) Priority method. (d) Repurpose method

one label node (for classification applications), would be considered unsuitable for training. Figure 2b represents the resulting RBM graph from the faults in Fig. 1b. Adaption with Label Priority (Prio) The previous method can be improved by rearranging node-to-qubit assignments. In a complete-bipartite RBM all visible nodes are indistinguishable, so any mapping from dataset to visible nodes is allowed. We use this additional structural information to swap higher priority visible nodes onto more connected qubit chains. This is particularly straightforward when an RBM is used in a classification application. These networks have visible nodes that represent a one-hot encoding of the labels in the dataset, which gives them the highest priority. Adaption with Node Repurposing (Rep) In order to recover resources lost due to these pruning methods, a further step is to repurpose the disjoint sections of qubit chains from hidden nodes. This is done by creating new hidden nodes and connections, as long as the same validity requirement is met. Because the fracture is in a visible chain, these method mirrors the embedding to make use of the disjoint chains in the hidden nodes. Figure 2d shows the RBM graph with visible and hidden nodes flipped, and one extra, partially-connected, hidden node.

Structure-Aware Minor-Embedding for Machine Learning in Quantum. . .

93

5 Methodology We evaluate the impact of each pruning method by creating an RBM with 64 visible nodes and 64 hidden nodes, which is embedded onto, and sampled from a publicly available D-Wave QAP. The size of the D-Wave machine also allowed us find a region of the device without qubit imperfections. We call that the Complete embedding and use it as a reference for performance in our experiments. We use the D-Wave “Advantage_system4.1”, which is a 5627-qubit quantum annealer that uses the Pegagus architecture. We train the network on a simplified version of the 8x8 OptDigits dataset; which is a similar setup to the one in [2]; i.e., binarized data with 8 classes. In order to perform a classification task, we include the 8 one-hot bits as part of the image.

5.1 Training Algorithm The training goal is to maximize the log-likelihood .ln L(O|v), given model parameters .O, and each dataset value fixed on the visible nodes .v. The reason for the use of logarithm is to be able to express the computation of this value as the sum of: the computation of the positive phase, i.e., the energy contribution from the configurations where the visible nodes are fixed, and the negative-phase, i.e., the energy contribution from all possible configurations of the network. Correspondingly, the first and second term on the rightmost expression below. .

ln L(O | v) = ln

E E 1 E −E(v,h) e = ln e−E(v,h) − ln e−E(v,h) Z h

h

(1)

v,h

Computation of the positive-phase is trivial in RBMs, because the hidden nodes are all independent from each other. The computation of the negative phase, on the other would require knowledge, or an estimate, of the partition function E hand, −E(v,h) . This also translates into a positive and negative phase of the .Z = e v,h gradient function. .

E ∂ ln L(O | v) ∂E(v, h) E ∂E(v, h) =− + p(h | v) p(v, h) ∂O ∂O ∂O h

(2)

v,h

Practical applications of BMs calculate the gradients in Equation (2) by approximating the expectation values from a collection of samples S, shown below for the individual cases of visible and hidden nodes, . model ≈

S 1 E (s) v ; S s i

model ≈

S 1 E (s) h ; S s i

(3)

94

J. P. Pinilla and S. J. E. Wilton

and for the pairwise interactions in the network,

.

S < > 1 E (s) (s) vi hj model ≈ v h . S s i j

(4)

Having estimated the gradients at each node, we perform Stochastic Gradient Descent (SGD); with the following: Awij (t + 1) =

.

( ηAwij (t) + e

) S ] 1 E [ (s) (s) (s) ˆ (s) vi hj − vˆi hj − λwij (t). S

(5)

s=1

Where, .e = 0.1 is the learning rate, and where we have also incorporated a momentum term .η = 0.5, and a weight decay term .λ = 10−4 .

5.2 Positive-Phase Temperature Scaling As part of the training algorithm, we use the method in [10]; called positivephase temperature scaling. This is a way of incorporating an estimate of the inverse-temperature that characterizes the distribution of samples collected from the D-Wave machine. Such that, we sample from: p(v, h)α =

.

e−αβDW E(v,h) ; Zα

Zα =

E

e−αβDW E(v,h) ,

(6)

v,h

on the negative phase. Where .α is the range scaling factor, i.e., to fit the D-Wave devive range.1 And .βDW is the real inverse-temperature of the D-Wave device. On the positive phase, we sample from: p(h'j = 1 | v) = sigm(αβeff cj + αβeff

m E

.

wij vi ),

(7)

i

with an estimated value of inverse temperature .βeff = 2.5. The assumption is that by scaling both versions of the model, the characteristic distributions match. As a result of using the positive-phase temperature scaling approach, we are able to train the RBM indefinitely; i.e. the weights and biases of the model don’t overflow.

1 For

the Advantage_system4.1 these are .hrange = [−4.0, 4.0] and .Jrange = [−1.0, 1.0].

Structure-Aware Minor-Embedding for Machine Learning in Quantum. . .

95

6 Results For each of our experiments we initialize three random models and train them for 75 epochs on the training set, and measure the testing accuracy on the remaining images. We use a batch size of 1024 data points, which we also use as the number of samples collected from the QAP. Throughout training, at every epoch, we measure the testing accuracy by clamping only the visible nodes with the image, and sampling from the label nodes to compare then against the true labels in the test dataset.

6.1 Heuristic vs Systematic First we compare the training results from the Complete embedding, against embeddings found using path heuristics. For each random model in the latter, we use the random seed to initialze the model parameters, as well as the minor-embedding algorithm. Figure 3 shows a histogram of the chain length metric for the three random embeddings. At this size of the D-Wave device chosen for experiments, this method finds low quality embeddings, i.e., with very long chains compared to the Complete embedding, seen in Table 1. This difference in the embedding quality reflects directly on the performance of the training algorithm, as seen in Fig. 4. We evaluate mean absolute error (MAE), and Kullback-Leibler Divergence .DKL , which uses the dataset, and the samples collected during the negative phase; as well as, testing accuracy, as mentioned above.

Fig. 3 Quality of heuristic embeddings

96 Table 1 Quality of systematic embeddings

J. P. Pinilla and S. J. E. Wilton

Method Complete Adachi Adaptive Prioriry Repurpose

Chain length 6 5 4 128 0 0 126 0 1 126 0 1 126 0 1 126 0 1

3 0 1 1 1 1

2 0 1 0 0 0

1 0 1 0 0 1

For each of these metrics, we see a slight advantage in the performance of the Complete embedding, although with more variation, as seen from the standard deviation between the 3 runs. One possibility for this behaviour is that the long chains in the heuristic-embedded models don’t contribute to the final parameters, effectively turning into a smaller model, and therefore fewer trainable solutions.

6.2 Adaption Methods Finally, we followed the same methodology to train RBMs, but this time putting them through our adaption methods. Figure 5 presents the testing accuracy for the 8-label OptDigits dataset. We highlight that although there exists a Complete embedding of the RBM network onto the device graph, better performance can be obtained in two instances; when some edges are pruned, and disjoint chains are preserved (i.e. Adachi), or with even more edges pruned, but with an additional hidden node (i.e. Rep). Those two adaption methods share the idea of repurposing, either by creating higher connectivity in RBM nodes through disjoint chains (i.e. Adachi), or by creating a wider hidden layer by assigning disjoint chains to multiple RBM nodes (i.e. Repurpose). Pruning additional nodes (i.e. Adapt and Prio) resulted in similar or lower performance than the Complete embedding.

7 Discussion In order to use quantum computing devices, there are algorithms necessary to map problems onto qubits. The optimality of these mapping algorithms have a measurable impact in the quality of the samples obtained. We have found an opportunity to improve the minor-embedding methods used in sampling-based training of RBMs assisted by quantum annealers. Inspired by the idea of pruning for classical ML models, we integrate pruning methods into the training algorithms for quantum-assisted RBMs (QARBMs) using

Structure-Aware Minor-Embedding for Machine Learning in Quantum. . .

97

Fig. 4 Comparison between systematic and heuristic. (a) .MAE(vdata , vsamples ). (b) .DKL (Pdata || Psamples ). (c) Testing accuracy

(a)

(b)

(c)

D-Wave QAPs. While pruning for classical ML is motivated by a reduction in overfitting, and lower use of resources; pruning for QARBMs through structure-aware embeddings can also be beneficial in the training process due to an improvement in sample quality.

98

J. P. Pinilla and S. J. E. Wilton

Fig. 5 Accuracy tests. Lines are labeled with the pruning method and number of edges pruned. Bold lines show the mean of the aggregated runs, with shaded regions as the confidence interval in terms of standard deviation

Experiments using different pruning methods show that there is a performance advantage in using fewer edges, when these translate into better embeddings (i.e. shorter qubit chains) of the model onto the qubit fabric, or a repurposing of broken chains. Further exploration into minor-embedding methods for QARBMs is in the scope of our future work. We look forward to expanding our approach to more general models, e.g. general Boltzmann machines. The higher connectivity in those topologies could benefit more from structure-awareness in the minor-embedding process.

References 1. Adachi SH, Henderson MP (2015) Application of Quantum Annealing to Training of Deep Neural Networks. arXiv preprint arXiv:151000635 p 18. https://doi.org/10.1038/nature10012, URL http://arxiv.org/abs/1510.06356 2. Benedetti M, Realpe-Gómez J, Biswas R, et al (2017) Quantum-assisted learning of hardwareembedded probabilistic graphical models. Physical Review X 7(4). https://doi.org/10.1103/ PhysRevX.7.041052, URL http://arxiv.org/abs/1609.02542 3. Benedetti M, Garcia-Pintos D, Perdomo O, et al (2019) A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Information 5(1). https:// doi.org/10.1038/s41534-019-0157-8

Structure-Aware Minor-Embedding for Machine Learning in Quantum. . .

99

4. Boothby K, Bunyk P, Raymond J, et al (2020) Next-Generation Topology of D-Wave Quantum Processors. URL http://arxiv.org/abs/2003.00133 5. Cai J, Macready WG, Roy A (2014) A practical heuristic for finding graph minors. URL http:// arxiv.org/abs/1406.2741 6. Dixit V, Selvarajan R, Aldwairi T, et al (2022) Training a Quantum Annealing Based Restricted Boltzmann Machine on Cybersecurity Data. IEEE Transactions on Emerging Topics in Computational Intelligence 6(3):417–428. https://doi.org/10.1109/TETCI.2021.3074916, URL http://arxiv.org/abs/2011.13996 http://dx.doi.org/10.1109/TETCI.2021.3074916 7. Job J, Adachi S (2020) Systematic comparison of deep belief network training using quantum annealing vs. classical techniques. URL http://arxiv.org/abs/2009.00134 8. Koshka Y, Novotny MA (2021) Comparison of Use of a 2000 Qubit D-Wave Quantum Annealer and MCMC for Sampling, Image Reconstruction, and Classification. IEEE Transactions on Emerging Topics in Computational Intelligence 5(1):119–129. https://doi.org/10. 1109/TETCI.2018.2871466, URL https://ieeexplore.ieee.org/document/8479371/ 9. Liu J, Mohan A, Kalia RK, et al (2020) Boltzmann machine modeling of layered MoS2 synthesis on a quantum annealer. Computational Materials Science 173:109,429. https://doi. org/10.1016/j.commatsci.2019.109429 10. Pinilla JP, Wilton SJE (2022) Positive-Phase Temperature Scaling for Quantum-Assisted Boltzmann Machine Training. International Conference for High Performance Computing, Networking, Storage and Analysis (SC’ 22). IEEE Computer Society. https://doi.org/10.1109/ SC41404.2022.00073

Software for Massively Parallel Quantum Computing Thien Nguyen, Daanish Arya, Marcus Doherty, Nils Herrmann, Johannes Kuhlmann, Florian Preis, Pat Scott, and Simon Yin

1 Introduction Quantum computing has the potential to solve previously intractable problems, including the factoring of large numbers [1], efficiently discovering optima in large search spaces [2], modelling and simulation of quantum mechanical systems [3, 4], and solving large systems of equations [5–8]. Current quantum computers are however unable to outperform their classical counterparts. This is for a number of reasons: • quantum algorithms are not yet as developed as classical ones; • quantum bits, or qubits, are subject to environmental disturbance (so-called ‘decoherence’), and thus exhibit high error rates; • the need to retain coherence throughout a calculation means that there is effectively no intermediate memory (such as processor cache or RAM) compatible with quantum algorithms at present. To harvest computational power from these imperfect qubits, a typical design pattern for near-term quantum programs is to combine quantum and classical components into a hybrid, heterogeneous workflow. In this hybrid execution model, a high degree of parallelism emerges, whereby the computational workload can be distributed across many quantum processing units (QPUs), with only classical data exchanged between them. We refer to this as applying ‘classical parallelism’ to the quantum workload.

T. Nguyen (O) · D. Arya · M. Doherty · N. Herrmann · J. Kuhlmann · F. Preis · P. Scott · S. Yin Quantum Brilliance Pty Ltd, The Australian National University, Canberra, ACT, Australia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_6

101

102

T. Nguyen et al.

This model of quantum computation requires a software framework designed from the ground up to take maximal advantage of existing parallelisation schemes in high-performance computing (HPC). These include threading, message passing and offload to classical accelerators such as GPUs. The QB Software Development Kit (SDK) is specifically designed to exploit these classical parallelisation schemes, not just in its simulation of quantum hardware, but also in its execution of hybrid classical-quantum and pure quantum algorithms on real quantum hardware. In Sect. 2, we provide more background on hybrid quantum computing and the possibilities it offers for parallelisation, particularly those arising from compact form-factor accelerators such as those developed by Quantum Brilliance (QB). We then describe the QB SDK (Sect. 3), illustrate its parallelisation capabilities (Sect. 4), and conclude (Sect. 5).

2 Background 2.1 Quantum Brilliance Hardware Quantum Brilliance builds quantum accelerators based on nitrogen-vacancy (NV) centers in diamond. The remarkable thermal and electrical properties of diamond [9, 10] allow for long coherence times of the qubits at room temperature and standard air pressure. Individual qubits are realized by two of the spin states of a single nitrogen nucleus. The electron contained in each NV center acts as bus for initialization and readout of the nitrogen spin qubit, as well as a mediator for interactions between multiple qubits. The nuclear spin is controlled by radio frequency (RF) fields, while the electron can be manipulated by microwave (MW) fields. The readout and initialization is achieved by pumping the electronic state with green laser light with a wavelength of 532 nm. Operation at room temperature requires only off-the-shelf control electronics and optics. Such QPUs do not require any of the usual bulky or expensive infrastructure typically seen in other quantum technologies, such as liquid Helium cooling systems, ultra-high vacuum or very precise lasers. Together these offer a significant opportunity to miniaturize the QPU and corresponding accelerator card to a form factor comparable to that of modern graphic accelerators. The current commercial realisation of this technology is a 2-qubit QPU, hosted in a 19" wide, rack-mountable 6 U factor chassis [11]. The emergence of compact form-factor quantum accelerators paves the way to many-QPU systems similar to conventional supercomputers, and to deployment in edge computing scenarios. As we explain in the following subsection, such accelerators may also deliver a decisive performance advantage over classical systems earlier than larger systems.

Software for Massively Parallel Quantum Computing

103

2.2 Quantum Utility Quantum devices can traverse a Hilbert space of dimension .2N with only N qubits. Therefore, they promise significant breakthroughs in the computational sciences [e.g. 10, 12, 13]. Future quantum computers may enable calculations deemed impossible on the best known classical hardware. Arguably, such a future may not be on the cards for the currently emerging devices of the NISQ (noisy intermediatescale quantum) era [14]. Fortunately, this does not imply that NISQ computing does not possess an industrial value. It merely requires a firmer definition of what constitutes a genuine quantum advantage. Any quantum device starts to become industrially useful once it outperforms a competitor classical device in terms of either: • computing time, • accuracy, or • power consumption. Such quantum devices are therefore said to possess a certain degree of quantum advantage. This advantage can be sub-categorised into the realms of quantum dominance and quantum utility. Quantum dominance is obtained once a device can perform calculations that would be otherwise impossible on any existing classical computer. Quantum utility, on the other hand, is obtained once a quantum device outperforms a classical device of comparable size, weight, and cost. That performance could however be matched or exceeded by more capable existing classical hardware, or greater size, weight or cost. Figure 1 provides a schematic visualization of the concept of quantum utility. Here, abstract computational power is plotted against an abstract ‘device specification’ metric accounting for device size, weight and cost. The red and blue curves depict the performance of the best available classical devices, i.e. the boundaries of classical computing, and the best quantum devices, respectively. The light red area of classical advantage is separated from the light blue area of quantum advantage. The blue region splits further into the realm of quantum utility and quantum dominance, based on the availability or absence of larger, more costly, and more powerful classical devices. While quantum dominance remains the ultimate goal of quantum computing, quantum utility may arguably be the more appropriate target for NISQ devices— especially in a hybrid classical-quantum environment. Quantum utility marks a significant milestone for suppliers and customers of quantum technology alike, as its achievement is highly likely to usher in a phase of truly ubiquitous quantum computing.

104

T. Nguyen et al.

Computational power (speed, accuracy

Quantum dominance

Quantum utility

Classical advantage Best quantum device

Best classical device

(size, weight, cost)

Fig. 1 Schematic illustration of classical (light red) and quantum (light blue) advantage in a comparison of an abstract computational power and device specification. Here, computational power is a proxy for speed, accuracy, or efficiency, and device specification is a proxy for size, weight, and cost. Note the splitting of the region of quantum advantage into quantum dominance (where no classical solution is possible) and quantum utility (where the same results can be obtained classically, but only by a higher-spec’d device than the device able to provide the quantum solution)

2.3 Hybrid Quantum Computing Several algorithms have been proposed to solve problems using pure quantum computing, but in the current NISQ era, quantum devices are still too small or too noisy to achieve non-trivial results with these techniques. This has been the motivation for the creation of hybrid algorithms, which leverage already-available classical resources to enhance the power of quantum computation, and vice versa. These hybrid algorithms primarily make use of variational quantum circuits with parameterised gates, and a classical optimizer that updates the parameters of the gates to achieve the desired result. Some examples of algorithms following this paradigm are given below.

2.3.1

Variational Quantum Eigensolver

Finding the ground state of a physical Hamiltonian is a problem of particular interest in the field of quantum chemistry. The Hamiltonian itself may have a structure that creates quantum correlations between an arbitrary number of its constituents.

Software for Massively Parallel Quantum Computing

105

To solve this problem in a hybrid manner, one employs the variational quantum eigensolver (VQE) [15]. VQE makes use of the natural quantum correlations within a quantum computer, through the creation of a circuit (or multiple circuits) that reflect the Pauli decomposition of the Hamiltonian under analysis. In addition to the variational ansatz, a state preparation circuit is applied (before) to rotate the system state into the desired basis, and a basis rotation circuit (after) is used to bring the state back into the measurement basis.

2.3.2

Quantum Approximate Optimization Algorithm

The quantum approximate optimization algorithm (QAOA) [2] is a widely-used technique to solve graph problems on a quantum computer, in particular those reducible to a MaxCut or QUBO (quadratic unconstrained binary optimization) formulation. It does so by invoking the basic principle of adiabatic quantum computation (AQC), wherein a system is allowed to evolve under the influence of a pre-defined Hamiltonian for a set amount of time in order to arrive at a desired state. The variational ansatz itself is split into two parts: the cost, and the mixer. The cost Hamiltonian evolves the initial state according to the graph weights of the optimization problem, and the mixer Hamiltonian allows for traversal, or ‘mixing’ between the allowable states of the optimizer. The classical optimizer is used to vary the rotation angles for the alternating variational ansatzes, which corresponds to varying Hamiltonian evolution time(s) within AQC.

2.3.3

Quantum Machine Learning Algorithms

Recent widespread adoption of machine learning has allowed scientists and engineers to gain intuition about using the same methods within a quantum context. Many quantum machine learning (QML) algorithms have been created, with several quantum versions inspired by their classical counterparts: GANs .→ QGANs [16], support vector machines .→ QSVMs [17], and convolutional neural networks .→ QCNNs [18]. They each follow the hybrid paradigm of classically optimizing a variational circuit that is evaluated quantumly. This is quite similar to the standard machine learning paradigm of optimizing linear transformations in the presence of non-linear activation functions. It has been suggested that in certain cases, QML algorithms can achieve similar accuracy to their classical counterparts, despite requiring less time [19], or fewer data points [20] for the training process.

2.4 Parallelism in Quantum Computing Many quantum-accelerated applications provide opportunities for classical parallelisation across multiple QPUs. The different problem scales at which this can be realised are illustrated schematically in Fig. 2.

106

T. Nguyen et al.

Fig. 2 Illustration of nested self-consistent loops in an optimization problem involving the execution of a variational quantum algorithm (VQA) at the lowest layer. The outer layer recursively divides the full problem into sub-problems that can be solved in parallel, for example a QM/MM followed by a DFT/HF embedding scheme in quantum chemistry. Running the VQA itself within each sub-problem can be further parallelised across multiple QPUs

At the highest level, applying a divide-and-conquer strategy to complex problems leads to an ensemble of quantum instructions [21] that can be trivially executed in parallel. A typical example of such a strategy is embedding methods in material sciences and chemistry. Consider a large molecule or a bath of many smaller molecules. In many cases the vast majority of the problem is sufficiently described by molecular mechanics (MM), i.e. the constituents are idealized as point charges and their interaction is governed by classical electrodynamics. In regions where quantum mechanical (QM) correlations cannot be neglected, e.g. where bonds may form or break, one can employ QM/MM schemes [22]. In the so-called additive scheme the total energy of the system consists of the energy calculated for the QM region employing an appropriate QM method, while the energy for the MM region is simply the classical energy of interacting point charges. The interaction of both subsystems is accounted for in a third contribution in which the charge distribution of both subsystems is used as an input. These steps are repeated until the residual force fields fall below a given threshold, i.e. the full system equilibrates. The QM regions can be tackled either directly on one or more QPUs, or by another nested approach whereby the computation on QPUs is embedded into an

Software for Massively Parallel Quantum Computing

107

approximate QM method such as density functional theory (DFT), Hartree-Fock (HF) [23], or the coupled cluster (CC) approach. For instance, in the simplest cases one would freeze the core electrons treated with HF methods, restrict the active space in the electronic structure calculation performed on QPUs, and neglect the backreaction of the active electrons on the core. Real world optimization problems in logistics, energy, financial services and manufacturing often require circuit widths that exceed the number of qubits likely to be available from quantum devices for the foreseeable future. Therefore, another promising approach to top-level parallelism is to divide the full problem into smaller sub-problems that can be solved on smaller quantum computers [e.g. 24, 25]. On the level of the QM computation, circuit knitting techniques [26] such as entanglement forging [27] make it possible to partition larger quantum circuits into smaller ones. The smaller circuits require less qubits and have smaller circuit depths than the original, but introduce significant overhead from the classical postprocessing required to reconstruct the result of the original circuit from the results of the smaller ones. At the lowest level, VQAs are particularly well suited for parallelization. For example, as discussed in [28], the parallelization of VQEs applied to electronic structure calculations is a necessity to render these applications viable. The evaluation of the objective function often requires evaluation of many different circuits. For example, the Hamiltonian operator in electronic structure calculations consists of non-commuting terms that are accounted for by corresponding basis rotations in the quantum circuits. Depending on the Pauli grouping strategy, the number of distinct circuits scales linearly with the number spin-orbitals, albeit with a large pre-factor. Furthermore, in QML algorithms each data point in the training data set leads to a distinct circuit that contributes to the overall cost function. The classical optimization of the variational parameters requires further evaluations of this ensemble of circuits at different points in the parameter space within a single iteration step. Finally, in order to achieve a desired precision .ε (typically .∼ 10−3 for chemistry), the execution of the same circuit must be repeated .1/ε2 times. These repetitions, often referred to as ‘shots’, can be distributed over many QPUs.

3 The Quantum Brilliance Software Development Kit (SDK) In order to realise the potential of hybrid algorithms and room-temperature quantum accelerators for quantum utility, one needs an appropriately parallelised quantum programming framework, specifically optimised for hybrid applications. The Quantum Brilliance Software Development Kit (SDK) is a full-stack software framework geared towards massive parallelism for accelerator-based hybrid quantum-classical workflows. The QB SDK is designed for ease of integration into a range of performancecritical environments: HPC, cloud, GPU workstations and even embedded/edge

108

T. Nguyen et al.

Fig. 3 The Quantum Brilliance Software Development Kit (SDK) provides a full-stack development environment for hybrid quantum-classical application development

systems. The SDK’s high-performance core and bundled simulators are implemented in C++. It provides hybrid and pure-quantum programming interfaces in both C++ and Python, with an extensive language binding module for Python included. This allows users to access most of the underlying C++ features from Python, whilst simultaneously having access to Python’s rich array of tools, such as NumPy for numerical programming and Matplotlib for plotting. Figure 3 depicts the high-level architecture of our SDK. This consists of (1) the frontend, which compiles various quantum programming inputs into an intermediate representation (IR); (2) the middleware layer, which is a collection of software modules that transform and optimise quantum programs in IR format; and (3) the backend, which converts IR into code directly executable on a specific simulator or actual quantum chips. To ensure high performance and HPC compatibility, the current version of the SDK leverages the quantum intermediate representation (IR) of XACC [29], a modular and extensible system-level software infrastructure for heterogeneous programming. Here, we detail the functionality of the different layers as shown in Fig. 3, focusing on the extensions and modifications that we have made to XACC targeting our NV quantum devices and hybrid quantum-classical workflows. At the top of the stack, the QB SDK provides an extensible collection of frontend modules capable of parsing commonly-used quantum assembly dialects, such as IBM’s OpenQASM [30] and Rigetti’s Quil [31], based on XACC’s Compiler interface for quantum compilation. We inject additional so-called intrinsics, i.e. definitions of native quantum gates, pertinent to NV-based quantum computation, into the source assembly code directly, using e.g. the include directive of OpenQASM.

Software for Massively Parallel Quantum Computing

109

(a)

(b) Fig. 4 An example using the CircuitBuilder utility of the QB SDK. (a) Python. (b) C++

One notable feature of the SDK is the ability it offers for the user to programmatically construct quantum circuits via a utility called the CircuitBuilder. This allows users to write a program in high-level classical languages, such as C++ or Python, that describes how to build the quantum program. From this high-level program, the SDK constructs the relevant quantum kernel(s). Examples of using the circuit builder utility to construct a simple two-qubit Bell-state circuit are shown in Fig. 4. As a convenience, we also provide a cloud-based user interface, whereby users have access to a Jupyter notebook instance with the latest SDK version installed and configured. Detailed information about this UI can be found in Ref. [32]. Integral to the core of our SDK is the intermediate representation (IR) data structure, which encapsulates the semantics of quantum circuits. At the time of writing, the QB SDK relies on the XACC IR infrastructure. This allows the SDK to make use of various composite IR modules in XACC that represent high-

110

T. Nguyen et al.

level quantum circuit templates such as for QAOA and QML, as well as XACC’s middleware processing modules, including IR transformation passes for quantum circuit optimization and simplification. We have also implemented a module for interoperability with the open-source C++ library TKET [33], providing access to additional IR transformations such as noise-aware circuit mapping. The QB SDK has a large variety of backends for both production and experimental use. These range from perfect (noise-free) quantum state simulation, to detailed emulation of quantum accelerators, to the ability to send compiled code to actual QB quantum chips. In particular, QB provides customers with an emulator module to enable accurate modeling of current and future hardware for algorithm and application development. With access to current QB hardware, users can also submit quantum circuits to the device via the SDK, with all steps from preparing the circuit for execution to sending it to the device handled by the SDK.

4 Parallelism in the QB SDK A key design focus of the QB SDK is to support the quantum-accelerated highperformance computing vision whereby many quantum computing devices (QPUs) are integrated into conventional HPC data centers. To this end, we envision two modes of integration: (1) loosely-coupled or distributed integration and (2) integrated quantum-accelerated compute nodes. Specifically, the first refers to the case where the communication between classical HPC nodes and QPUs is mediated by conventional networks, thus HPC-QPU communication latency is not a big issue. In the latter scenario, QPUs would need to be integrated into the HPC infrastructure, including the high-speed, low-latency communication network (e.g., Infiniband), programming environment (e.g., MPI) and resource scheduler. We also note that scenario (1) encompasses both cloud-hosted (remote access over the Internet) and co-located (on-premises installation but connected via a LAN network, not integrated into the HPC network) use cases. To cover both of these modes in a flexible, efficient and performant manner, we have designed and implemented (1) an event-driven, shared-memory circuit execution interface to orchestrate a pool of loosely-coupled QPUs (such as those hosted on the cloud) and (2) a distributed-memory QPU virtualization scheme based on the gold-standard Message Passing Interface (MPI). The first model can take advantage of the availability of cloud-hosted QPUs by pooling them together while the latter can take advantage of a cluster of QPUs backed by a high-bandwidth, low-latency network.

4.1 Asynchronous QPU Offloading Most currently-available quantum computers are networked devices over private (LAN/Intranet) or public (WAN/Internet) networks. This means that they are typically communicated with via HTTP/REST protocols, as depicted in Fig. 5.

Software for Massively Parallel Quantum Computing

111

Fig. 5 Asynchronous QPU offloading: a single QB SDK thread (host program) distributing quantum tasks to multiple remotely-hosted QPUs. Asynchronous methods are used to query, wait for, and retrieve the results from the quantum accelerators

Our SDK features a high-level set of utilities for submitting concurrent parallel jobs to a set of remotely-hosted QPUs, using a non-blocking asynchronous offloading scheme for quantum operations. In this model, the execution of the quantum operations proceeds independently from the classical host program. Thus, the calling classical program can proceed to its subsequent instruction without waiting for completion of the quantum operations that were just offloaded. Usage of the results of the quantum operations requires explicit detection and handling by the calling program. Figure 6 is a UML diagram describing the high-level constructs in the QB SDK that facilitate many-QPU parallelism via asynchronous offloading. The remote_accelerator represents a QPU target to which the SDK can offload tasks asynchronously via the async_execute API. The quantum task is described by XACC’s CompositeInstruction IR node, i.e. a quantum circuit. The offloading API (async_execute) is non-blocking. It returns a handle, async_job_handle, from which the calling (host) program can check the status of the job and retrieve the results from the QPU. The asynchronous model tolerates a high level of variation in execution times between different accelerators. Lastly, a group of quantum accelerators can be bundled into an Executor, which is essentially a pool of remote quantum workers accessible in a round-robin fashion via the get_next_available_qpu API. The initialize method provides the possiblity to configure an Executor’s properties, such as its number of QPUs and how to reach them.

4.1.1

Example

The snippet in Listing 1 demonstrates how to use asynchronous offloading to submit many quantum tasks to a remote backend, such as available from AWS Braket.

112

1 2 3

T. Nguyen et al.

import numpy as np import qb.core my_session = qb.core.session()

4 5 6 7

# Helper to set up an executor consisting of up to # 32 AWS Braket SV1 backends my_session.aws32sv1()

8 9 10 11 12 13 14 15 16 17 18 19 20

# Input parameterized quantum circuit in OpenQASM my_session.instring = ''' __qpu__ void QBCIRCUIT(qreg q) { OPENQASM 2.0; include "qelib1.inc"; creg c[2]; x q[1]; ry(QBTHETA_0) q[0]; cx q[1], q[0]; measure q[0] -> c[0]; measure q[1] -> c[1]; }'''

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

# Set up a parameter scan (100 data points) scan_params = qb.MapND() for theta in np.linspace(0, np.pi, 100): qb_theta = qb.ND() qb_theta[0] = theta scan_params.append(qb_theta) # Set up the scan in the SDK job table my_session.theta[0] = scan_params # List of all asynchronous job handles job_handles = [] for idx in range(len(scan_params)): # Post all the jobs asynchronously job_handle = my_session.run_async(0, idx) job_handles.append(job_handle)

36 37 38 39 40

# # # #

At this point, all the jobs have been offloaded to remote AWS Braket QPU(s) i.e., we can do other useful work in this thread ...

41 42 43 44

# The job handle can be used to check the status of a job # e.g. check all the jobs are completed all_done = all(handle.complete() for handle in job_handles)

Listing 1: Example of asynchronous offload to remotely-hosted QPUs (AWS Braket)

Software for Massively Parallel Quantum Computing

113

Executor List qpu_pools void initialize(config) remote_accelerator get_next_available_qpu() void release(remote_accelerator) 1 contains many

remote_accelerator async_job_handle async_execute(CompositeInstruction)

returns

async_job_handle void cancel() bool done() void wait_for_completion() void load_result(AcceleratorBuffer) void add_done_callback(Callable) Fig. 6 Asynchronous QPU offloading architecture

Here we set up the Executor via an SDK utility function (aws32sv1) in line 7, which configures the executor to have up to 32 Braket SV1 simulator instances. The input circuit is provided as an OpenQASM source string (lines 10–20). To demonstrate the utility of asynchronous parallelism, we set up a parameterized quantum circuit pertinent to variational algorithms such as VQE or QAOA. A large number of quantum tasks are generated by scanning the parameter (lines 23–29). A 2D array represents the table of quantum jobs to run. The first index (row) refers to different quantum circuits, whereas the second index (column) refers to different run configurations, such as shot counts or circuit parameters as in this particular example. With a large number of quantum jobs, we can use the Python run_async API to enqueue a job from the table, ready to be sent to the executor. Here run_async is just a high-level wrapper of the get_next_available_qpu and async_execute API functions shown in Fig. 6, designed to post a quantum job to the backend in a non-blocking manner. As a result, the main thread is free to do other useful processing tasks after run_async is invoked (as denoted in

114

T. Nguyen et al.

Fig. 7 MPI-based client-server architecture. The QB SDK UI (Jupyter notebook) is hosted in a conventional single-process context, connecting to a cluster of quantum-classical compute nodes. The cluster is backed by a high-speed network and MPI runtime

lines 37–40). The handle returned by each run_async function call, of type async_job_handle, can be used to check the status of the job or retrieve the result when the job completes.

4.2 Message Passing Interface MPI is the most common parallelisation protocol used in high-performance computing (HPC). It allows for high bandwidth, low latency communication between compute nodes in a cluster. The QB SDK offers a client-server MPI architecture, where the user interface runs as a standalone client process that sends circuits to an MPI-enabled backend server for compute. This design is depicted in Fig. 7. As was the case for its predecessor [32], the SDK offers a user-friendly Jupyter notebook interface. The client-server model allows the UI to be executed as a conventional single-process program, running on e.g. a cluster login node or the cloud. The server code running on the quantum cluster uses MPI to distribute quantum tasks across all available nodes and gather the results. The rank 0 process of the server code handles the communication with the SDK UI client. The server’s core functionality is currently provided by an MPI-parallel wrapper around standard XACC accelerator backends (HPCVirtDecorator, derived from AcceleratorDecorator). The wrapper introduces pre- and postprocessing steps. This provides simple parallelisation of serial and threaded accelerator backends. An example handling many small circuits utilizing the HPCVirtDecorator is given in Listing 2. The input parameter params

Software for Massively Parallel Quantum Computing

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

115

// Here VQE is called with a decorated accelerator. The decorator adds pre// and post-processing around the actual accelerator execution. This is used // to introduce MPI parallelism, i.e partitioning and distribution of the // vector of instructions (base curcuit + Pauli terms) and return of // the results. // Number of MPI processes and threads can be chosen as needed. void xaccOptimizeDecorated(vqe::Params& params, int nWorker, int nThreads) { // 1 of 4: ansatz from XACC qasm string std::shared_ptr ansatz; if (params.ansatz) { // ansatz provided ansatz = params.ansatz; } else { // circuit string provided xacc::qasm(params.circuitString); ansatz = xacc::getCompiled("ansatz"); }

22

// 2 of 4: observable from string std::shared_ptr observable = std::make_shared(); observable->fromString(params.pauliString);

23 24 25 26 27

// 3 of 4: accelerator - qpp: "vqe-mode"=true is non-stochastic xacc::HeterogeneousMap accParams{{"n-virtual-qpus", nWorker}, {"vqe-mode", params.isDeterministic}, {"shots", params.nShots}, {"threads", nThreads}};

28 29 30 31 32 33

// get and wrap accelerator with hpc-decorator to introduce MPI parallelism auto accelerator = xacc::getAccelerator("qpp", accParams); accelerator = xacc::getAcceleratorDecorator("hpc-virtualization", accelerator, accParams);

34 35 36 37 38 39

// 4 of 4: optimiser auto optimizer = xacc::getOptimizer("nlopt"); optimizer->setOptions({{"initial-parameters", {"nlopt-optimizer", {"nlopt-maxeval", {"nlopt-ftol",

40 41 42 43 44 45

params.theta}, "cobyla"}, params.maxIters}, params.tolerance}});

46

// instantiate XACC VQE auto vqe = xacc::getAlgorithm("vqe"); vqe->initialize({{"ansatz", ansatz}, {"accelerator", accelerator}, {"observable", observable}, {"optimizer", optimizer}});

47 48 49 50 51 52 53

// Allocate some qubits and execute auto buffer = xacc::qalloc(params.nQubits); vqe->execute(buffer);

54 55 56 57

// read out buffer params.energies = (*buffer)["params-energy"].as(); params.theta = (*buffer)["opt-params" ].as(); params.optimalValue = (*buffer)["opt-val" ].as();

58 59 60 61 62

}

Listing 2: Example of MPI-based VQE, parallelised using an HPC decorator

116

T. Nguyen et al.

contains descriptions for the ansatz and Pauli terms, and thus describes the quantum problem. nWorker and nThreads set the number of MPI processes and the number of threads per MPI process, respectively. Each MPI process handles exactly one accelerator backend, which may be further parallelized by threading. A simulator backend that makes native use of MPI can then also be orchestrated by the HPCVirtDecorator, using a multi-level MPI hierarchy. This makes it possible to spread multiple quantum circuits, each requiring more than one compute node, over a large compute cluster. The HPCVirtDecorator can also be used to orchestrate the execution of quantum problems on analog QPUs. Here a circuit must fit into a single QPU, but multiple QPUs can work on the large list of Pauli terms or gradient computation points in parallel.

4.2.1

Example

Here, demonstrate the utility and performance of the QB SDK using a simulated quantum HPC cluster, where QPUs are replaced by classical numerical simulators running on conventional compute nodes. As quantum hardware continues to advance in terms of its quantum utility, the same software can be used to orchestrate a comparable large-scale deployment of quantum accelerators. In this example, we look at the problem of computing the expectation value of a complex Hamiltonian operator for a quantum state. This procedure is pertinent in many near-term quantum applications, such as the VQE algorithm. Specifically, given a Hamiltonian expressed in terms of Pauli operators, H =



.

i

⎛ ⎝

nqubits



⎞ σj(i) ⎠ , σj(i)

∈ {I, X, Y, Z}

(1)

j

the expectation value (energy) of a quantum state (.) can be computed by evaluating each term in the sum independently. We look at the example of an .H8 molecule, consisting of four loosley-bonded .H2 molecules. The PySCF [34] second-quantized Hamiltonian, which contains 1particle and 2-particle integrals, can be converted into the above Pauli form using the Jordan-Wigner transformation. Figure 8 shows the time taken to evaluate the ground-state energy of the .H8 molecule on different numbers of QPUs (each simulated using a single node of a supercomputing cluster). The Hamiltonian consists of 3052 terms, requiring many quantum circuit evaluations, each of which includes evaluation of both the ansatz and the change of basis.1 It is worth noting that there are optimisation strategies that involve grouping of Pauli terms [e.g., 35, 36], which can help reduce the number of

1 Post-rotations

for each term in the Hamiltonian that is not a Pauli Z gate.

Software for Massively Parallel Quantum Computing

117

Fig. 8 Performance scaling with number of (virtual) QPUs for the computation of the groundstate energy of an .H8 chain. These calculations make use of a UCCSD ansatz at a fixed set of parameters, with a 16-qubit, 3052-term STO-3G Hamiltonian basis set obtained using the Jordan– Wigner transformation. The SDK reports both the quantum circuit execution time (blue bars) and the time taken for pre-and post-processing (orange bars). We also plot the number of circuits sent to each QPU

evaluations needed, but for the sake of simplicity we do not employ any grouping strategies for this demonstration. In this example, the quantum state is prepared by a UCCSD ansatz [37], i.e., |ψ> = U CCSD|0>.

.

(2)

at a fixed set of variational parameters (i.e. no outer classical optimization loop is included in this performance scaling evaluation). We evaluated each of the circuits resulting from the terms in the Hamiltonian using a state-vector-based simulator without statistical noise. Data for Fig. 8 result from runs on the Topaz cluster at the Pawsey supercomputing center, with each virtual QPU running on a single Topaz node. Nodes contain 256 GB memory and 2 .× Intel Xeon E5-2680 v4 2.4GHz CPUs, and communicate via a Mellanox InfiniBand interconnect running at 100 Gb/s. MPI is implemented with OpenMPI. Wall time results in Fig. 8 demonstrate close to ideal scaling when we increase the number of processing nodes (simulated virtual QPUs). More importantly, we observe minimal overhead for pre- and post-processing procedures, such as those associated with MPI scatter and gather steps in the MPI-based virtual QPU implementation of the SDK.

118

T. Nguyen et al.

5 Summary Thanks to advances in quantum computing hardware, we are fast approaching the regime of quantum utility, where hybrid quantum-classical computers can outperform conventional computers of comparable size, weight and power. A performant and capable software framework that allows for flexible and efficient multi-QPU parallelisation is critical to achieving the scalability and performance required for real-world workloads. Quantum Brilliance, a full-stack quantum computing company, has been developing an SDK incorporating parallelisation strategies targeting different modes of quantum-classical interactions. We have developed a thread-based asynchronous execution model for remotely-hosted QPUs, and a highperformance MPI-based QPU virtualisation system for quantum-accelerated data centers. The SDK enables users to explore hybrid quantum-classical workloads in order to test and benchmark computational utility in their own application domains. Acknowledgments This research used resources of the Pawsey Supercomputing Centre, which is supported by the Australian Government under the National Collaborative Research Infrastructure Strategy and related programs through the Department of Education.

References 1. P.W. Shor, in Proceedings 35th annual symposium on foundations of computer science (Ieee, 1994), pp. 124–134 2. E. Farhi, J. Goldstone, S. Gutmann, arXiv preprint arXiv:1411.4028 (2014) 3. A. Aspuru-Guzik, A.D. Dutoi, P.J. Love, M. Head-Gordon, Science 309(5741), 1704 (2005) 4. I.M. Georgescu, S. Ashhab, F. Nori, Reviews of Modern Physics 86(1), 153 (2014) 5. A.W. Harrow, A. Hassidim, S. Lloyd, Physical review letters 103(15), 150502 (2009) 6. N. Wiebe, D. Braun, S. Lloyd, Physical review letters 109(5), 050505 (2012) 7. D.W. Berry, Journal of Physics A: Mathematical and Theoretical 47(10), 105301 (2014) 8. S. Lloyd, M. Mohseni, P. Rebentrost, Nature Physics 10(9), 631 (2014) 9. Y. Chen, S. Stearn, S. Vella, A. Horsley, M.W. Doherty, New Journal of Physics 22(9), 093068 (2020) 10. M. Doherty, Digitale Welt 5(2), 74 (2021) 11. K. Nunez. Pawsey installs first room-temperature on-premises quantum computer in a supercomputing centre (2022). URL https://pawsey.org.au/pawsey-installs-first-room-temperatureon-premises-quantum-computer-in-a-supercomputing-centre/ 12. S.S. Gill, A. Kumar, H. Singh, M. Singh, K. Kaur, M. Usman, R. Buyya, Softw. Pract. Exp. 52(1), 66 (2022). DOI https://doi.org/10.1002/spe.3039. URL https://onlinelibrary.wiley.com/ doi/abs/10.1002/spe.3039 13. R. Rietsche, C. Dremel, S. Bosch, L. Steinacker, M. Meckel, J.M. Leimeister, Electron. Mark. (2022). DOI https://doi.org/10.1007/s12525-022-00570-y 14. J. Preskill, Quantum 2, 79 (2018) 15. J. Tilly, H. Chen, S. Cao, D. Picozzi, K. Setia, Y. Li, E. Grant, L. Wossnig, I. Rungger, G.H. Booth, et al., Physics Reports 986, 1 (2022) 16. C. Zoufal, A. Lucchi, S. Woerner, npj Quantum Information 5(1) (2019). https://doi.org/10. 1038/s41534-019-0223-2

Software for Massively Parallel Quantum Computing

119

17. P. Rebentrost, M. Mohseni, S. Lloyd, Phys. Rev. Lett. 113, 130503 (2014). https://doi.org/10. 1103/PhysRevLett.113.130503 18. I. Cong, S. Choi, M.D. Lukin, Nature Physics 15(12), 1273–1278 (2019). DOI https://doi.org/ 10.1038/s41567-019-0648-8 19. A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, S. Woerner, Nature Computational Science 1(6), 403–409 (2021). DOI https://doi.org/10.1038/s43588-021-00084-1 20. M.C. Caro, H.Y. Huang, M. Cerezo, K. Sharma, A. Sornborger, L. Cincio, P.J. Coles, Nature Communications 13(1) (2022). DOI https://doi.org/10.1038/s41467-022-32550-3 21. T.S. Humble, A. McCaskey, D.I. Lyakh, M. Gowrishankar, A. Frisch, T. Monz, IEEE Micro 41(5), 15 (2021). DOI https://doi.org/10.1109/MM.2021.3099140 22. A. Warshel, M. Levitt, Journal of Molecular Biology 103(2), 227 (1976). DOI https://doi. org/10.1016/0022-2836(76)90311-9. URL https://www.sciencedirect.com/science/article/pii/ 0022283676903119 23. M. Rossmannek, P.K. Barkoutsos, P.J. Ollitrault, I. Tavernelli, The Journal of Chemical Physics 154(11), 114105 (2021). DOI https://doi.org/10.1063/5.0029536. URL https://doi.org/10.1063 %2F5.0029536 24. G.G. Guerreschi, (2021). DOI https://doi.org/10.48550/ARXIV.2101.07813. URL https:// arxiv.org/abs/2101.07813 25. A.I. Pakhomchik, S. Yudin, M.R. Perelshtein, A. Alekseyenko, S. Yarkoni, (2022) 26. C. Piveteau, D. Sutter, (2022) 27. A. Eddins, M. Motta, T.P. Gujarati, S. Bravyi, A. Mezzacapo, C. Hadfield, S. Sheldon, PRX Quantum 3, 010309 (2022). DOI https://doi.org/10.1103/PRXQuantum.3.010309. URL https://link.aps.org/doi/10.1103/PRXQuantum.3.010309 28. J. Tilly, H. Chen, S. Cao, D. Picozzi, K. Setia, Y. Li, E. Grant, L. Wossnig, I. Rungger, G.H. Booth, et al., arXiv preprint arXiv:2111.05176 (2021) 29. A.J. McCaskey, D.I. Lyakh, E.F. Dumitrescu, S.S. Powers, T.S. Humble, Quantum Science and Technology 5(2), 024002 (2020). DOI https://doi.org/10.1088/2058-9565/ab6bf6 30. A.W. Cross, L.S. Bishop, J.A. Smolin, J.M. Gambetta, arXiv preprint arXiv:1707.03429 (2017) 31. R.S. Smith, M.J. Curtis, W.J. Zeng, arXiv preprint arXiv:1608.03355 (2016) 32. S.N. Saadatmand, S. Yin, M.L. Walker, M.W. Doherty, M. Cytowski, U. Varetto, First International Workshop on Integrating High-Performance and Quantum Computing (2021) 33. S. Sivarajah, S. Dilkes, A. Cowtan, W. Simmons, A. Edgington, R. Duncan, Quantum Science and Technology 6(1), 014003 (2020) 34. Q. Sun, T.C. Berkelbach, N.S. Blunt, G.H. Booth, S. Guo, Z. Li, J. Liu, J.D. McClain, E.R. Sayfutyarova, S. Sharma, et al., Wiley Interdisciplinary Reviews: Computational Molecular Science 8(1), e1340 (2018) 35. A.F. Izmaylov, T.C. Yen, R.A. Lang, V. Verteletskyi, Journal of chemical theory and computation 16(1), 190 (2019) 36. P. Gokhale, O. Angiuli, Y. Ding, K. Gui, T. Tomesh, M. Suchara, M. Martonosi, F.T. Chong, in 2020 IEEE International Conference on Quantum Computing and Engineering (QCE) (IEEE, 2020), pp. 379–390 37. H.R. Grimsley, D. Claudino, S.E. Economou, E. Barnes, N.J. Mayhall, Journal of chemical theory and computation 16(1), 1 (2019)

Machine Learning Reliability Assessment from Application to Pulse Level Vedika Saravanan

and Samah Mohamed Saeed

1 Introduction Many computational problems of great interest are difficult to solve using classical digital computers. Rapid technological breakthroughs since Feynman’s suggestion of quantum simulation have led to the development of different quantum computers based on different technologies as a step toward exploiting the quantum advantage [7, 57]. Current and near-term quantum computers are far more susceptible to noise than classical computers. While quantum error correction is the key to enabling these devices, it incurs significant overhead in terms of the number of qubits [19]. A fault-tolerant quantum computer capable of executing general quantum computations requires a very large number of physical qubits. Nevertheless, significant quantum hardware advances have been made in the past few years, resulting in different noisy intermediate-scale quantum (NISQ) computers that have tens to hundreds of qubits [17]. The pursuit of quantum advantage in the NISQ devices has continued along several paths, including quantum simulation [25, 42, 56], optimization problem [43, 53], and sampling [7, 57]. The presence of errors severely limits the size of the quantum circuits that can be reliably executed on these devices. Due to the different sources of noise in the quantum hardware, quantum circuits should be compiled very efficiently, considering the details of the specific quantum hardware platform, such as the available native gates, restrictions on qubit connectivity, and qubit error rates.

V. Saravanan · S. M. Saeed (O) City College of New York, City University of New York, New York, NY, USA e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_7

121

122

V. Saravanan and S. M. Saeed

To this end, there is a demand for predictive analysis at different levels of abstractions of quantum circuit design, which will help in making highly informed decisions about quantum circuit design and how to allocate hardware resources. These decisions directly impact the fidelity of the quantum circuits. In this work, we present a suite of Machine Learning (ML) predictive analysis techniques for projecting quantum circuit fidelity. While the existing estimated success probability metric can be utilized to estimate the quantum circuit fidelity, such a simple and coarse estimation represents several shortcomings. The ML predictive techniques address mainly three concerns, namely, (1) the necessity to estimate the quantum circuit fidelity at different abstraction levels, (2) the necessity to estimate the fidelity with minimum interaction with the quantum hardware, and (3) representativeness of information utilized by the predictive analysis. These shortcomings/needs can be elaborated on as follows: • Depending on the quantum circuit design/compilation stage when the predictive analysis is needed, detailed information about the physical implementation of the quantum circuit may be unavailable (e.g., the allocated physical qubits and the scheduled gates), rendering it difficult to compute the fidelity. A predictive analysis capable of delivering projections with reasonable accuracy, despite the lack of detailed physical implementation, is required. • Due to the complex noise model of the quantum hardware, existing methods [14, 40, 47] that reschedule quantum gates to minimize errors may require runtime data in the form of tuning circuit executions to tune the position of the quantum gate to minimize errors. A predictive analysis capable of projecting the impact of different optimization/error mitigation techniques is required. • Different pulse-based quantum circuit designs hold different features than the gate level circuit designs for example. A predictive analysis that reflects the characteristics of the given quantum circuit design level is necessary. The ML predictive techniques vary in accuracy, and amount of information required for the analysis, thus serving different levels of quantum circuit optimization and hardware resource allocation. Specifically, ML can be utilized to predict the quantum circuit fidelity at the logical gate level, physical gate level, and pulse level circuit implementations. While the ML models at the physical gate and pulse levels enable different optimization techniques to improve fidelity, applying ML models at the logical level can be exploited for quantum hardware benchmarking to select a quantum computer that performs the best for a given quantum application/algorithm. Our chapter is organized as follows: Sect. 2 provides the background on quantum circuit design at different levels of abstraction, errors in quantum hardware, and quantum circuit compilation. Sect. 3 provides a detailed description of the ML models for reliability assessment at different levels of the quantum circuit design. Sections 4 and 5 present qualitative and quantitative comparisons between the different ML models. Finally, we summarize the chapter in Sect. 6.

Lecture Notes in Computer Science: Authors’ Instructions

123

2 Background 2.1 Quantum Algorithms Many classes of quantum algorithms exist, for example, Shor’s algorithm for factorization [46] using quantum Fourier transform, amplitude amplification Grover’s algorithm for database search [20], quantum Hamiltonian simulation [3], linear system solvers [22] and hybrid quantum-classical algorithms such as quantum approximate optimization algorithm (QAOA) [16] and the variational quantum eigensolver (VQE) [38]. Hybrid quantum-classical algorithms are promising applications for NISQ devices. They aim at building a state .|ψ(θ )> that encapsulates a near-optimal solution to a given problem, where .|ψ(θ )> depends on certain parameters .θ = (θ1 , ..., θn ). QAOA is used for combinatorial optimization problems such as the Max-Cut, which separates a subset of vertices in a given graph using a cut that maximizes the total weight of the edges between the subset and its complement [16]. The Variational Quantum Eigensolver (VQE) can be used to find the eigenvalue of a Hamiltonian (H) ground state energy . [38]. This approach has significantly fueled recent advances in the application of chemistry with near-term quantum technology. The VQE algorithm takes as inputs a molecular Hamiltonian (H) and a parameterized circuit that prepares the molecule’s quantum state. The expectation value of the Hamiltonian determined in the trial state is specified as the cost function. The goal is to identify the Hamiltonian (H) ground state that minimizes the cost function. These variational quantum algorithms iteratively optimize .θ , with each iteration involving the application of a small .θ -dependent quantum circuit to a given initial state, followed by readout operations and the use of their results in a classical optimization strategy to update the value of .θ .

2.2 Gate-Level Quantum Circuits The main unit in quantum systems is the quantum bit, also known as a qubit, which can be in the state of .|0>, .|1>, or a superposition of both states [36]. The qubit state is expressed as .|ψ> = α0 |0> + α1 |1>, where .|0> and .|1> are base states. The coefficients .α0 and .α1 are complex numbers that are used to compute the probability of .|0> and 2 2 .|1> states and satisfy the condition .|α1 | + .|α0 | =1. Quantum algorithms are realized as quantum circuits that are composed of quantum gates, which transform one quantum state to another through unitary transformations denoted by .|ψ> → U |ψ>, where U is a unitary matrix. Gates that operate on a single and two qubits can be represented by a .2 × 2 and a .4 × 4 unitary matrix, respectively. For instance, the X gate flips the qubit’s state, the Z gate causes a .π rotation around the zaxis, the H gate causes a Hadamard transition, and the controlled not two-qubit gate (CN OT ) entangles two qubits. The resulting quantum circuits are referred

124

V. Saravanan and S. M. Saeed

Fig. 1 3-qubit QAOA circuit at different abstraction levels: (a) logical gate-level, (b) physical gate-level mapped to IBM Q Jakarta, and (c) a subset of the pulse-level circuit

to as logical quantum circuits. The native gate implementation differs from one architecture to another. To execute quantum circuits on the quantum computer, they are transformed into physical quantum circuits, which consist of physical qubits and gates that are supported by the underlying architecture. They are referred to as physical quantum circuits. Example 1 Figure 1a and b show an example of 3-qubit QAOA logical and physical quantum circuits, respectively, that solve the Max-cut problem for a fully connected graph.

2.3 Pulse-Level Quantum Circuits Recent breakthroughs in quantum computing are enabled through more control over quantum systems. The unitary gates that manipulate the qubit state and the measurement operations are applied to the quantum hardware using control pulses that fuel the quantum system. Quantum circuits are executed on superconducting quantum computers in the form of microwave pulses controlled by electronic devices such as the arbitrary waveform generator connected to the qubits via specialized channels [8, 31, 33]. Each qubit has multiple dedicated channels, each playing a specific role. The three main pulse channels are the DriveChannel, ControlChannel, and MeasureChannel. DriveChannel is used to drive the qubits of both single- and two-qubit gates with the input pulse. Similarly, ControlChannel transmits the pulse for the control qubit in two-qubit gate operations. MeasureChannel is used for readout operations. Each pulse of a quantum circuit has the following parameters: pulse duration, pulse amplitude, sigma, and beta. Pulse duration is the pulse length with respect to the sampling period. The pulse amplitude is a complex value that holds the information on the frequency and the phase of the signal based on the backend properties of the quantum computer. Sigma is the standard deviation of the pulse width. Beta is the

Lecture Notes in Computer Science: Authors’ Instructions

125

weighted factor for the Gaussian derivative of the pulse waveform. For example, the CNOT gate in IBM superconducting quantum computers is implemented at the pulse level using a combination of .RZX (π/2) gate and a few single-qubit gates [11]. The .RZX gate is achieved via the echoing cross-resonance gate [45], which has less overhead and hence increases the fidelity of the circuit. Example 2 Figure 1c is the pulse-level representation of a 3-qubit QAOA subcircuit highlighted in grey color in Fig. 1b. D0 and D1 are the drive channels of qubit .q0 and .q1 , respectively. U 0 is the control channel of qubit .q0 .

2.4 Quantum Hardware Errors Noise is any unwanted disturbance superimposed on a signal caused by operations applied to the qubits or the environment, resulting in errors. Noise is characterized during the calibration process using randomized benchmarking [28, 29, 39], which estimates the gate error rates using Clifford-only circuits or tomography [12, 32], which fully characterizes the qubit state preparation and its operations. Qubits may experience different types of errors during the execution of the quantum circuit defined as follow: • Gate errors occur when gates are applied to the qubits. They are classified into single- and two-qubit gate errors. As the depth of the circuit increases, the gate errors accumulate making the circuit unviable. • Measurement errors occur when the readout operations are applied to the qubits. • Idle errors occur due to decoherence and crosstalk errors. When a qubit interacts with the environment, a decoherence error changes the state of the qubit leading to information loss in the quantum systems. Idle qubits are also affected by crosstalk errors due to gates applied to neighboring qubits. Several metrics have been defined to quantify the performance of NISQ computers in terms of the output fidelity of the executed quantum circuits. Probability of Successful Trail (PST) and Hellinger Fidelity (HF) are two metrics that have been used to determine the output fidelity of single- and multi-output quantum circuits, respectively. They are defined as the following: P ST =

.

Number Of Successful Trials Total Number Of Trials ,

H F = (1 −

1 Em √ i=1 ( si 2



√ 2 2 ei ) ) ,

where m denotes the number of different measured output states of a quantum circuit, while .si and .ei are the probabilities of the output state i based on ideal simulation and actual execution, respectively.

126

V. Saravanan and S. M. Saeed

2.5 Quantum Circuit Compilation Quantum compilers play a critical role in bridging the gap between quantum algorithms and quantum computers. They are used to decompose, optimize, and map the quantum circuit into the native hardware operations. The input to the compiler is a high-level description of a quantum algorithm, and the output is the physical quantum circuit implemented using native gates. Since quantum computers support few native single- and two-qubit gates, the high-level description of gates should be decomposed into the native gates supported by quantum computers. A SW AP gate, for instance, which is used to apply a two-qubit gate to non-adjacent qubits in the quantum hardware, is decomposed into three CNOT gates. To minimize the number of SW AP gates, topology-aware complex gate decomposition approaches have been proposed, which leverage the permissible operations on the quantum architecture [15, 55]. To reduce the circuit cost, several optimization techniques can be applied such as template-based optimization which relies on predefined optimized sub-circuit implementation [30], or propagation-based optimization that reduces the gate count [23]. Next, each logical qubit is mapped to a physical qubit. Furthermore, the quantum gates are scheduled to satisfy the coupling constraint of the quantum architecture described using a coupling graph. This process is referred to as quantum circuit mapping. Many of these compilation phases have been developed to be noise-aware such as quantum circuit mapping, which typically depends on an Estimated Success Probability (ESP) to allocate qubits and schedule | | | |Q gates [9, 26, 34, 48]. The ESP metric is defined as . G i=0 (1 − egi ) × i=0 (1 − emi ), where G and Q are the number of gates and measured qubits, while .egi and .emi are gate and measurement error rates, respectively [37]. Other approaches rely on noise-aware quantum circuit synthesis [52]. As many quantum computing vendors recently have started to support pulse-level access to the quantum processor to enable low-level circuit optimization, several compilation approaches have been proposed to further mitigate quantum circuit errors at the pulse level. Some approaches take advantage of the detailed timing information of each gate to mitigate idling errors by inserting dynamical decoupling sequences adaptively into the physical quantum circuit [14, 44]. Other techniques optimize the pulse by manipulating the gate construction [24, 54] and the pulse parameters [18] to mitigate gate and crosstalk errors.

3 ML for Quantum Circuit Reliability Assessment In this section, we present various ML models that project the output fidelity of a given quantum algorithm at different design abstraction levels. Figure 2 illustrates the hierarchical representation of the quantum algorithm and the corresponding ML models to predict the quantum circuit output fidelity, and therefore, assess its reliability.

Lecture Notes in Computer Science: Authors’ Instructions

127

Fig. 2 Overview of our proposed ML-based framework

3.1 Logical Gate-Level Circuit (Application) Performance Assessment Standardized benchmark suites are typically used to evaluate the performance across a wide range of hardware systems and architectures. Quantum benchmark suites such as SupermarQ [49] can be used to evaluate system-level performance across different quantum hardware for different applications. However, unlike classical computing devices, quantum hardware exhibits frequent variations in the error rates, which requires very frequent executions of the quantum benchmark suites; each is preceded with several compilation phases rendering additional compilation time too. By modeling the variation in the physical properties of the quantum hardware across different calibration periods and their impact on different quantum applications we can illuminate the frequent executions of the quantum benchmark suites for performance assessment. We leveraged ML for performance assessment at the application level [51]. We generated an offline framework to select quantum hardware for different quantum applications that yield the highest output fidelity. The input to the ML model is the logical quantum circuits and the backend properties of the quantum hardware. We extract (1) basic logical quantum circuit features, (2) structural features that affect different physical properties of the hardware proposed by SupermarQ benchmarking suite for coverage assessment [49], and (3) backend features of the quantum hardware that are publicly available. While we utilize basic quantum circuit features such as the number of qubits and two-qubit gates and the circuit depth to estimate the circuit size, advanced structural features in [49] enable studying how sensitive the quantum circuit is to different physical properties of the hardware including the limited qubit connectivity (coupling constraint), coherence time, and crosstalk, idle, and measurement errors. We refer readers to the work in [49] for more details about how to compute these features. The hardware features include the number of qubits in the hardware, the average coherence time (.T1 , .T2 ), the average gate time, and the error rates of single- and two-qubit gates and measurement operations. By including the backend data of the quantum hardware, we correlate the hardware

128

V. Saravanan and S. M. Saeed

agnostic features (advanced structural features) of the quantum application with the physical properties of the quantum hardware. Thus, the model is trained once and used across different calibration periods. Traditional ML models can be utilized to construct the reliability model at this level such as Decision Tree (DT), Random Forest (RF), and Neural Network (NN).

3.2 Physical Gate-Level Circuit Performance Assessment The physical quantum circuit implementation impacts its output fidelity. Given the different noise sources of NISQ hardware, minimizing the gate count and the circuit depth does not guarantee an improvement in the output fidelity. The non-stable physical properties of NISQ architectures demonstrated by the frequent calibrations make the problem even harder. While the existing metric, i.e., ESP, takes into account the up-to-date physical properties of the hardware, it fails to provide a comprehensive reliability model for quantum hardware that considers all the potential errors to be encountered in the quantum circuit (e.g., crosstalk and decoherence errors). In this subsection, we discuss several ML reliability models, which all take as an input the physical quantum circuit but vary in the amount of knowledge provided by the quantum hardware backend.

3.2.1

Traditional ML with Black Box Quantum Hardware [27]

The first attempt at building an ML reliability model, which predicts the PST of the physical quantum circuits was proposed in [27]. The ML model was built as an alternative method to state-of-the-art noise characterization of quantum hardware such as randomized benchmarking [28, 29, 39] and tomography [12, 32] to assess the error rates in the quantum hardware. It relies only on the physical quantum circuit implementation and the coupling constraint of the quantum hardware. Thus, it omits the physical properties of the backend computed during the calibration process rendering the quantum hardware as a black box. The circuit’s structural features include the number of single- and two-qubit gates, measurement operations, connectivity map, number of qubits, depth, and single-qubit gate count for each qubit. The connectivity map describes the number of CNOT gates in the physical quantum circuit for each edge in the coupling graph of the quantum hardware. Similar to the randomized benchmarking protocols, the proposed ML model is trained using random circuits that consist of Clifford-based gates only and are equivalent to the identity gate. While the proposed ML model, which uses polynomial regression and neural network outperforms the quantum circuit noisy simulator and ESP-based reliability model, it still requires training per calibration process to adapt to the variations in the circuit error rates.

Lecture Notes in Computer Science: Authors’ Instructions

129

Table 1 The definition of backend and test features for traditional ML reliability models at the physical gate-level [50] (b) Test features /

(a) Backend features .V arCN

.WCC

3.2.2

gi ∈G

gi ∈G

max egi

= ncnot ξcnot

=

L E

N E

— (1)

gi ∈G

= nsx ξsx + nid ξid

.WSC

.TCir

=

max egi − min egi

max tgi

k=1 gi ∈Gk

— (2) — (3) — (4)

=

.SDT

.KL

=

N E

=

— (5)

N

ESP i log

i=1 l .Covl = L .Covg

(P ST i −P ST Avg )2

i=1

ESP i P ST i

g G

— (6) — (7) — (8)

Traditional ML with White Box Quantum Hardware

To eliminate the need for frequent training of the ML reliability model, we built an ML reliability model that considers the quantum hardware as a white box [50]. We leveraged not only the circuit structural features but also the backend features of the quantum hardware. The structural features include the number of qubits, the circuit depth, the number of parallel gates in the same layer of the circuit, and Weighted Graph (WG)-based features. WG is computed based on the coupling graph of the quantum hardware, where the qubits are the nodes, and the weighted edges are the number of two-qubit gates that are connected to the qubits in a circuit. The backend features include variation in the CNOT error rates of the circuit (Var.CN ), Weighted Single-qubit Count (WSC) and CNOT Count (WCC) and circuit time (T.Cir ) as shown in Table 1a.

3.2.3

Traditional ML with Runtime Features

The features of the white box ML reliability model have been extended to include runtime features for complex noise modeling [50]. The runtime features are collected by executing application-specific test circuits to capture the variations and the drift in the error rates of the quantum circuits [4, 5]. The test circuits are generated by partitioning the quantum circuit into sub-circuits. For each subcircuit, we uncompute the quantum gates by adding their inverse in a reverse order to obtain the initial state. In other words, for a sub-circuit with unitary matrix .U1 , its corresponding test circuit consists of .U1 · U1† unitary matrix, which is equivalent to the identity gate. Example 3 Figure 3b shows an example of a test circuit consisting of a sub-circuit highlighted in Fig. 3a followed by the uncompute operations. We extract test features from the output of the test circuits as shown in Table 1b. SDT measures the standard deviation of the PST of all the generated test circuits. KL measures the divergence between the ESP and PST of all the generated test circuits. .Covg and .Covl provide the ratio of the number of gates (g) and layers (l) that are covered by the test circuit for a given physical quantum circuit.

.

130

V. Saravanan and S. M. Saeed

Fig. 3 3-qubit QAOA circuit with (a) sub-circuit highlighted in red and (b) uncompute test circuit for the highlighted sub-circuit

3.3 Graph Neural Network Quantum circuits hold a topology-based structure. Therefore it is necessary to employ an ML algorithm that includes the circuit details at the basic gate level to enhance the prediction accuracy of the output fidelity. Since quantum circuits can be visualized using a gate dependency graph denoted using a Directed Acyclic Graph (DAG), Graph Neural Networks (GNN) can incorporate new and various circuit structural information into the learning process. Thus, GNN has also been exploited for quantum circuit reliability assessment [41]. In DAG, the gates are represented using nodes, and the connection/direct dependency between the gates are edges. For each node of DAG, several features based on the structure and the error rates of the quantum hardware can be added. The features include the gate layer number, qubit ID of the gate, gate type, number of parallel gates acting in the same layer, coherence time (.T1 , .T2 ) of the gates, gate error and gate time. Example 4 The 3-qubit QAOA physical quantum circuit in Fig. 4a is represented using DAG graph provided in Fig. 4b.

3.4 Pulse-Level Circuit Performance Assessment Quantum computing vendors (e.g., IBM and Rigetti) are moving toward enabling pulse-level-based quantum circuit design to optimize pulse control, reduce circuit latency, and therefore, improve the output fidelity of the quantum circuit. The lowlevel access to the quantum circuit can support several optimization opportunities that minimize errors. The impact of these techniques (e.g., Dynamical Decoupling

Lecture Notes in Computer Science: Authors’ Instructions

131

Fig. 4 3-qubit QAOA circuit (a) represented using basic gates of IBM Q Jakarta and (b) the corresponding DAG graph

sequences [14, 44]) on the circuit output fidelity is hardware-dependent. An ML model that predicts the output fidelity of quantum circuits at the pulse-level is vital to ensure the utilization of the best optimization and design practices at the pulse level [18, 24, 54]. We propose to couple the pulse-level circuit features with the backend features for better fidelity prediction accuracy. The features consist of pulse, channel, backend, and circuit features. The pulse features are the min, avg, and max of the amplitude, the beta, the sigma, and the duration of the pulse signal. The channel features are the number of drive channels, control channels, and measurement channels. The backend features are the qubit frequency and the coherence time (.T1 , .T2 ). The circuit features include the latency of the pulse-level circuit, the total number of pulses in the circuit and the number of qubits, single-qubit gate pulses, and cross-resonance (CR) or .RZX gate pulses. Since pulse level access provides accurate timing information for each gate in the quantum circuit, we can identify the number of simultaneous pulses at each time step, which is essential to determine the sensitivity of the circuit to crosstalk errors. Thus, we also add to the circuitrelated features the average number of parallel pulses in any time step.

4 Qualitative Comparison In this section, we compare the proposed ML reliability models for different quantum circuit design abstractions in terms of three critical metrics, the information required to build the models, the usability, and the prediction accuracy of the output fidelity. We consider the traditional ML model for a fair comparison. Table 2 summarizes our qualitative comparison.

132

V. Saravanan and S. M. Saeed

Table 2 Comparison of ML models at different quantum circuit abstractions Circuit abstraction Logical gate-level Physical gate-level Pulse-level

Info. required Low Medium High

Usability Application-level benchmarking Optimization and scheduling Optimization

Accuracy Medium High to very high Very high

The ML model at the logical quantum circuit level is based on the structural information of the quantum application and the average backend properties, making this model useful for benchmarking quantum hardware through rough estimation of the application fidelity. While it suffers in accuracy, it can be quite useful for eliminating unnecessary computation. The ML model at the physical quantum circuit level relies on the physical implementation of the quantum circuit, which satisfies the coupling constraint of the hardware. The knowledge of the assigned physical qubits enables extracting the backend features of the corresponding qubits and their gates, which boost the fidelity. This model is very useful to predict the impact of different gate-level optimization and scheduling on output fidelity, especially if access to the pulse-level implementation is not permitted. Finally, the ML model at the pulse level is provided with a rich set of information about the pulse level design, which is very important with the current advancements in pulselevel-based circuit optimizations. For a given quantum algorithm and a specific ML algorithm, the pulse-levelbased ML model is expected to provide the highest prediction accuracy, as it takes the lowest level representation of the quantum circuit, which not only includes the exact set of gates applied but also how these gates are implemented in the form of pulses and detailed timing information of when each pulse is applied. Since the ML model used at the application/logical level relies on the logical circuit structure and the average backend features, it provides the lowest fidelity prediction. The ML model at the physical quantum circuit level is expected to provide a high accuracy that is lower than the pulse level as it fails to capture the circuit optimization at the lower level. Finally, utilizing deep learning models (i.e., graph neural networks) at the physical gate level can provide very high accuracy. Similarly, applying deep learning at the pulse level is also expected to boost the accuracy.

5 Quantitative Comparison In this section, we experimentally compare the accuracy of the proposed ML reliability models based on PST and HF output fidelity metrics for single- and multioutput quantum circuits.

Lecture Notes in Computer Science: Authors’ Instructions

133

Table 3 Properties of physical quantum circuits for training and testing Training dataset Fidelity #CNOT Devices Min Max Min Max IBM Q Jakarta 0.10 0.96 13 67 Aspen-M-2 0.06 0.79 16 58

Depth Min Max 27 76 18 69

Testing dataset Fidelity #CNOT Min Max Min Max 0.11 0.99 6 89 0.09 0.76 6 30

Depth Min Max 16 178 12 45

5.1 Quantum Circuits for Training We train our ML models using random quantum circuits at different design abstractions. We employ several types of single-output state random circuits to train our PST-based ML reliability model as follows: • Random circuits consisting of universal gates followed by uncompute operations to return the qubits to their initial state. • Topology-aware random circuits consisting of universal gates that satisfy the coupling constraint of the quantum hardware followed by their uncompute operations. • Clifford-based random circuits that are equivalent to the identity gate using randomized benchmarking. For HF-based ML reliability models, we utilize 25% of the random circuits generated to train PST-based ML reliability models. The remaining 75% of the training dataset are random circuits consisting of Clifford gates only that can be simulated efficiently using classical computer [1, 6]. Table 3 provides the main properties of the physical implementation of the quantum circuits for training that satisfy the coupling constraint of IBM Q Jakarta and Rigetti Aspen-M-2 quantum computers including the minimum and maximum fidelity, gate count, and circuit depth. Table 3 shows that we train our models with a diverse dataset of different quantum circuit sizes and output fidelity.

5.2 Quantum Circuits for Testing We use several quantum benchmark circuits at different design abstractions to test our PST- and HF-based ML reliability models. We use the following singleoutput circuits for the PST-based models: Bernstein-Vazirani (BV) algorithm [10], Quantum Fourier Transform (QFT) [35], Hidden Shift (HS) algorithm [13], Grover Search algorithm [21], some reversible circuits (Adder, Toffoli, and Decoder), and other quantum benchmarks from SupermarQ including Bit-Code (BC) and PhaseCode (PC) for error correction [49]. For the HF-based ML reliability models, we test using multi-output quantum circuits from SupermarQ, which includes Greenberger– Horne– Zeilinger (GHZ), ZZ_SWAP_QAOA (FQAOA), Vanilla QAOA (VQAOA),

134

V. Saravanan and S. M. Saeed

VQE, and Hamiltonian Simulation (HSim) [49]. We report the minimum and maximum fidelity, gate count, and circuit depth of the physical implementation of the quantum benchmark circuits (test dataset) that satisfy the coupling constraint of IBM Q Jakarta and Rigetti Aspen-M-2 quantum computers in Table 3.

5.3 Experimental Setup We accumulated 3000 and 293 datasets for IBM Q Jakarta (7-qubit device), and Rigetti Aspen-M-2 (80-qubit device) quantum computers, respectively. We use NN, DT, and RF ML algorithms for building the logical gate-level ML model (LC), the physical gate-level ML model with (P CW T ) and without test features (P CW OT ) considering a white box quantum hardware, and the pulse-level ML model (P ulse). We use K-fold cross-validation, in which .k = 10, to split our datasets into 10-folds where 9 subsets are used for training the models, and the .10th subset is used for validation. We use the coefficient of determination (.R 2 ) and Root Mean Squared Error (RMSE) to evaluate the traditional ML models at different quantum circuit abstractions. Both .R 2 and RMSE measure the ability of the model to predict the fidelity from unseen data. They are computed as: R2 = 1 −

.

En (yi −xi )2 Ei=1 , n ¯ 2 i=1 (yi −y)

RMSE =

/

( n1 )

En

i=1 (yi

− xi )2 ,

where n represents the number of datasets, .y¯ denotes the mean value of the output fidelity, and .yi and .xi denote the quantum circuit’s original and predicted output fidelity, respectively. Given the limited pulse-level access to the quantum hardware, we leverage Qiskit noisy simulator [2] at the gate and the pulse levels for IBM Q Jakarta to generate a simulated dataset that ensure a fair comparison between the different ML models. For the Rigetti Aspen-M-2 quantum computer, we execute the quantum circuits on the quantum computer itself to collect our dataset. Due to the limited access to the Aspen-M-2 quantum computer, we focus on gate-level traditional ML models (LC, P CW OT ) and utilize Generative Adversarial Networks (GANs) to populate more datasets to train our traditional ML models. A GAN model has two main components: a generator (G) and a discriminator (D). The generator model will generate new data by adding noise to the original dataset. The discriminator model will decide whether the generated dataset is original or fake (generated by the generator model). The generator’s objective is to populate datasets that look real and as close to the original dataset as possible. At the same time, the discriminator model will discard the newly generated dataset if it is not similar to the original dataset. Once we populate more datasets using GAN, we train our ML models.

Lecture Notes in Computer Science: Authors’ Instructions

135

Fig. 5 Average (a) .R 2 and (b) RMSE of ML models for different quantum circuit design abstractions simulated for IBM Q Jakarta

5.4 Results The average of .R 2 and RMSE of different traditional ML models for various quantum circuit design abstractions that target IBM Q Jakarta are shown in Fig. 5a, and b, respectively. We infer that the ML models for physical quantum circuits with test features (P CW T ) increase the fidelity prediction accuracy compared to the corresponding models without test features (P CW OT ). However, the P CW OT ML models still outperform the corresponding models at the logical gate level (LC). Furthermore, the P ulse ML models outperform the P CW T models, although the P ulse ML models lack the runtime features. We also report the .R 2 and RMSE of the LC, P CW T , and P ulse ML models for different benchmark quantum circuits in Tables 4 and 5, respectively. The results show that despite the applied ML algorithm, the P ulse ML model provides the highest prediction accuracy and the LC ML model provides mostly the lowest prediction accuracy. Furthermore, NN-based ML reliability models for different quantum circuit design abstractions perform better on average than other proposed traditional ML reliability models. Figure 6 compares .R 2 and RMSE of the LC and P CW OT models for different benchmarks executed on Aspen-M-2 quantum computers. While both models do not achieve high accuracy due to the limited dataset size, the P CW OT model always provides higher fidelity prediction accuracy compared to LC. We anticipate as we increase the size of the dataset, the accuracy of the models will significantly increase as well. From our analysis, we conclude that as the knowledge of the quantum circuit deepens, the accuracy of the ML models also increases. While we compare the different ML models using traditional ML algorithms, we expect that as we apply advanced deep learning algorithms (e.g., Graph neural network) to achieve higher accuracy the same relationship between the ML models in different design abstractions should still hold.

136

V. Saravanan and S. M. Saeed

Table 4 Comparison of .R 2 of ML models for different quantum circuit design abstractions simulated for IBM Q Jakarta NN LC 7.90E-1 BV Adder 7.92E-1 Decoder 7.83E-1 Grover 7.90E-1 Toff 7.88E-1 QPEA 7.87E-1 QFT 7.78E-1 HS 7.88E-1 PC 7.84E-1 BC 7.87E-1 7.90E-1 GHZ FQAOA 7.85E-1 VQAOA 7.90E-1 VQE 7.83E-1 Hsim 7.86E-1 Circuits

PCWT 9.28E-1 9.33E-1 9.26E-1 9.30E-1 9.28E-1 9.27E-1 2.30E-1 9.22E-1 9.23E-1 9.26E-1 9.28E-1 9.29E-1 9.23E-1 9.30E-1 9.23E-1

Pulse 9.48E-1 9.52E-1 9.47E-1 9.46E-1 9.49E-1 9.50E-1 9.45E-1 9.46E-1 9.45E-1 9.47E-1 9.51E-1 9.52E-1 9.47E-1 9.48E-1 9.49E-1

DT LC 7.77E-1 7.80E-1 7.80E-1 7.80E-1 7.78E-1 7.73E-1 7.75E-1 7.79E-1 7.79E-1 7.72E-1 7.79E-1 7.72E-1 7.75E-1 7.80E-1 7.80E-1

PCWT 9.69E-1 9.73E-1 9.61E-1 9.65E-1 9.72E-1 9.66E-1 7.87E-1 9.67E-1 9.65E-1 9.64E-1 9.72E-1 9.72E-1 9.62E-1 9.61E-1 9.68E-1

Pulse 9.73E-1 9.73E-1 9.72E-1 9.79E-1 9.76E-1 9.76E-1 9.75E-1 9.75E-1 9.71E-1 9.76E-1 9.80E-1 9.82E-1 9.73E-1 9.77E-1 9.77E-1

RF LC 6.82E-1 6.81E-1 6.87E-1 6.93E-1 6.86E-1 6.82E-1 6.92E-1 6.84E-1 6.84E-1 6.89E-1 6.82E-1 6.81E-1 6.85E-1 6.83E-1 6.88E-1

PCWT 9.62E-1 9.73E-1 9.61E-1 9.64E-1 9.67E-1 9.64E-1 8.03E-1 9.62E-1 9.65E-1 9.62E-1 9.64E-1 9.61E-1 9.64E-1 9.68E-1 9.62E-1

Pulse 9.75E-1 9.80E-1 9.76E-1 9.75E-1 9.76E-1 9.73E-1 9.75E-1 9.73E-1 9.77E-1 9.75E-1 9.72E-1 9.76E-1 9.75E-1 9.71E-1 9.75E-1

Table 5 Comparison of RMSE of ML models for different quantum circuit design abstractions simulated for IBM Q Jakarta NN LC 2.10E-1 BV Adder 2.08E-1 Decoder 2.17E-1 Grover 2.10E-1 Toff 2.12E-1 2.13E-1 QPEA QFT 2.22E-1 2.12E-1 HS PC 2.16E-1 2.13E-1 BC GHZ 2.10E-1 FQAOA 2.15E-1 VQAOA 2.10E-1 VQE 2.17E-1 Hsim 2.14E-1 Circuits

PCWT 7.22E-2 6.74E-2 7.38E-2 6.98E-2 7.22E-2 7.32E-2 7.70E-1 7.81E-2 7.66E-2 7.37E-2 7.17E-2 7.13E-2 7.66E-2 6.99E-2 7.67E-2

Pulse 5.24E-2 4.84E-2 5.34E-2 5.37E-2 5.12E-2 4.98E-2 5.48E-2 5.37E-2 5.47E-2 5.28E-2 4.94E-2 4.83E-2 5.32E-2 5.17E-2 5.12E-2

DT LC 2.23E-1 2.20E-1 2.20E-1 2.20E-1 2.22E-1 2.27E-1 2.25E-1 2.21E-1 2.21E-1 2.28E-1 2.21E-1 2.28E-1 2.25E-1 2.20E-1 2.20E-1

PCWT 3.14E-2 2.74E-2 3.87E-2 3.47E-2 2.84E-2 3.43E-2 2.13E-1 3.34E-2 3.48E-2 3.57E-2 2.85E-2 2.77E-2 3.76E-2 3.88E-2 3.24E-2

Pulse 2.65E-2 2.67E-2 2.78E-2 2.12E-2 2.44E-2 2.43E-2 2.47E-2 2.48E-2 2.86E-2 2.44E-2 1.98E-2 1.84E-2 2.73E-2 2.34E-2 2.28E-2

RF LC 3.18E-1 3.19E-1 3.13E-1 3.07E-1 3.14E-1 3.18E-1 3.08E-1 3.16E-1 3.16E-1 3.11E-1 3.18E-1 3.19E-1 3.15E-1 3.17E-1 3.12E-1

PCWT 3.78E-2 2.74E-2 3.88E-2 3.56E-2 3.35E-2 3.64E-2 1.97E-1 3.85E-2 3.52E-2 3.82E-2 3.56E-2 3.88E-2 3.57E-2 3.18E-2 3.83E-2

Pulse 2.48E-2 1.99E-2 2.44E-2 2.48E-2 2.44E-2 2.67E-2 2.48E-2 2.74E-2 2.33E-2 2.48E-2 2.84E-2 2.41E-2 2.46E-2 2.87E-2 2.47E-2

Lecture Notes in Computer Science: Authors’ Instructions

137

Fig. 6 Average (a) .R 2 and (b) RMSE of ML models for different quantum circuit design abstractions executed on Rigetti Aspen-M-2

6 Summary In this chapter, we discuss and present several ML reliability assessments at different quantum circuit design abstractions. We provide qualitative and quantitative comparisons between the proposed ML models. We show the effectiveness of ML reliability models using IBM Q Jakarta and Rigetti Aspen-M-2 quantum computers.

References 1. Aaronson, S., Gottesman, D.: Improved simulation of stabilizer circuits. Phys. Rev. A 70, 052328 (Nov 2004). https://doi.org/10.1103/PhysRevA.70.052328 2. Abraham, H., et al.: Qiskit: An open-source framework for quantum computing (2019). https:// doi.org/10.5281/zenodo.2562110 3. Abrams, D.S., Lloyd, S.: Simulation of many-body fermi systems on a universal quantum computer. Phys. Rev. Lett. 79, 2586–2589 (Sep 1997). https://doi.org/10.1103/PhysRevLett. 79.2586 4. Acharya, N., Saeed, S.M.: A lightweight approach to detect malicious/unexpected changes in the error rates of NISQ computers. In: IEEE/ACM International Conference On Computer Aided Design, ICCAD. pp. 1–9 (2020) 5. Acharya, N., Urbánek, M., de Jong, W.A., Saeed, S.M.: Test points for online monitoring of quantum circuits. ACM J. Emerg. Technol. Comput. Syst. 18(1), 14:1–14:19 (2022). https:// doi.org/10.1145/3477928 6. Anders, S., Briegel, H.J.: Fast simulation of stabilizer circuits using a graph-state representation. Phys. Rev. A 73, 022334 (Feb 2006). https://doi.org/10.1103/PhysRevA.73.022334 7. Arute, F., et al.: Quantum supremacy using a programmable superconducting processor. Nature 574(7779), 505–510 (Oct 2019). https://doi.org/10.1038/s41586-019-1666-5 8. Arute, F., Arya, K., Babbush, R., Bacon, D., Bardin, J.C., Barends, R., Biswas, R., Boixo, S., Brandao, F.G., Buell, D.A., et al.: Quantum supremacy using a programmable superconducting processor. Nature 574(7779), 505–510 (2019) 9. Ash-Saki, A., Alam, M., Ghosh, S.: QURE: Qubit re-allocation in noisy intermediate-scale quantum computers. In: Proceedings of ACM/IEEE Design Automation Conference. pp. 1–6. ACM (2019) 10. Bernstein, E., Vazirani, U.: Quantum complexity theory. SIAM J. Comput. 26(5), 1411–1473 (Oct 1997). https://doi.org/10.1137/S0097539796300921

138

V. Saravanan and S. M. Saeed

11. Chow, J.M., Gambetta, J.M., Magesan, E., Abraham, D.W., Cross, A.W., Johnson, B.R., Masluk, N.A., Ryan, C.A., Smolin, J.A., Srinivasan, S.J., et al.: Implementing a strand of a scalable fault-tolerant quantum computing fabric. Nature communications 5(1), 1–9 (2014) 12. Chuang, I.L., Nielsen, M.A.: Prescription for experimental determination of the dynamics of a quantum black box. Journal of Modern Optics 44(11-12), 2455–2467 (1997). https://doi.org/ 10.1080/09500349708231894 13. van Dam, W., Hallgren, S., Ip, L.: Quantum algorithms for some hidden shift problems. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. p. 489–498. SODA ’03, Society for Industrial and Applied Mathematics, USA (2003) 14. Das, P., Tannu, S., Dangwal, S., Qureshi, M.: Adapt: Mitigating idling errors in qubits via adaptive dynamical decoupling. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. p. 950–962. MICRO ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3466752.3480059 15. Davis, M.G., Smith, E., Tudor, A., Sen, K., Siddiqi, I., Iancu, C.: Towards optimal topology aware quantum circuit synthesis. 2020 IEEE International Conference on Quantum Computing and Engineering (QCE) pp. 223–234 (2020) 16. Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm (2014). https://doi.org/10.48550/ARXIV.1411.4028, https://arxiv.org/abs/1411.4028 17. Feynman, R.P.: Simulating physics with computers. International Journal of Theoretical Physics 21(6), 467–488 (Jun 1982). https://doi.org/10.1007/BF02650179 18. Gokhale, P., Javadi-Abhari, A., Earnest, N., Shi, Y., Chong, F.T.: Optimized quantum compilation for near-term algorithms with openpulse. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 186–200 (2020). https://doi.org/10.1109/ MICRO50266.2020.00027 19. Gottesman, D.: An introduction to quantum error correction and fault-tolerant quantum computation (05 2009). https://doi.org/10.1090/psapm/068/2762145 20. Grover, L.K.: A fast quantum mechanical algorithm for database search (1996). https://doi.org/ 10.48550/ARXIV.QUANT-PH/9605043, https://arxiv.org/abs/quant-ph/9605043 21. Grover, L.K.: A fast quantum mechanical algorithm for database search. In: Theory of computing. pp. 212–219 (1996) 22. Harrow, A.W., Hassidim, A., Lloyd, S.: Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103, 150502 (Oct 2009). https://doi.org/10.1103/PhysRevLett.103.150502 23. Hietala, K., Rand, R., Hung, S.H., Wu, X., Hicks, M.: A verified optimizer for quantum circuits. Proc. ACM Program. Lang. 5(POPL) (Jan 2021). https://doi.org/10.1145/3434318 24. Ibrahim, M., Mohammadbagherpoor, H., Rios, C., Bronn, N.T., Byrd, G.T.: Pulse-level optimization of parameterized quantum circuits for variational quantum algorithms (2022). https://doi.org/10.48550/ARXIV.2211.00350, https://arxiv.org/abs/2211.00350 25. Kirmani, A., Bull, K., Hou, C.Y., Saravanan, V., Saeed, S.M., Papi´c, Z., Rahmani, A., Ghaemi, P.: Probing geometric excitations of fractional quantum hall states on quantum computers. Phys. Rev. Lett. 129, 056801 (Jul 2022). https://doi.org/10.1103/PhysRevLett.129.056801 26. Kusyk, J., Saeed, S.M., Uyar, M.U.: Survey on quantum circuit compilation for noisy intermediate-scale quantum computers: Artificial intelligence to heuristics. IEEE Transactions on Quantum Engineering 2, 1–16 (2021) 27. Liu, J., Zhou, H.: Reliability modeling of nisq- era quantum computers. In: IEEE International Symposium on Workload Characterization (IISWC). pp. 94–105 (2020). https://doi.org/10. 1109/IISWC50251.2020.00018 28. Magesan, E., Gambetta, J.M., Emerson, J.: Scalable and robust randomized benchmarking of quantum processes. Physical review letters 106(18), 180504 (2011) 29. Magesan, E., Gambetta, J.M., Emerson, J.: Characterizing quantum gates via randomized benchmarking. Phys. Rev. A 85, 042311 (Apr 2012). https://doi.org/10.1103/PhysRevA.85. 042311 30. Maslov, D., Dueck, G., Miller, D.: Toffoli network synthesis with templates. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24(6), 807–817 (2005). https:// doi.org/10.1109/TCAD.2005.847911

Lecture Notes in Computer Science: Authors’ Instructions

139

31. McKay, D.C., Wood, C.J., Sheldon, S., Chow, J.M., Gambetta, J.M.: Efficient z gates for quantum computing. Physical Review A 96(2), 022330 (2017) 32. Merkel, S.T., Gambetta, J.M., Smolin, J.A., Poletto, S., Córcoles, A.D., Johnson, B.R., Ryan, C.A., Steffen, M.: Self-consistent quantum process tomography. Physical Review A 87(6), 062119 (2013) 33. Motzoi, F., Gambetta, J.M., Rebentrost, P., Wilhelm, F.K.: Simple pulses for elimination of leakage in weakly nonlinear qubits. Physical review letters 103(11), 110501 (2009) 34. Murali, P., Baker, J.M., Javadi-Abhari, A., Chong, F.T., Martonosi, M.: Noise-adaptive compiler mappings for noisy intermediate-scale quantum computers. In: Proceedings of ASPLOS. pp. 1015–1029. ACM (2019) 35. Nielsen, M.A., Chuang, I.L.: Quantum computation and quantum information. Cambridge University Press (2019) 36. Nielsen, M., Chuang, I.: Quantum Computation and Quantum Information. Cambridge Univ. Press (2000) 37. Nishio, S., Pan, Y., Satoh, T., Amano, H., Meter, R.V.: Extracting success from ibm’s 20-qubit machines using error-aware compilation. J. Emerg. Technol. Comput. Syst. 16(3) (May 2020). https://doi.org/10.1145/3386162 38. Peruzzo, A., McClean, J., Shadbolt, P., Yung, M.H., Zhou, X.Q., Love, P.J., Aspuru-Guzik, A., O’Brien, J.L.: A variational eigenvalue solver on a photonic quantum processor. Nature Communications 5(1), 4213 (Jul 2014). https://doi.org/10.1038/ncomms5213 39. Proctor, T.J., Carignan-Dugas, A., Rudinger, K., Nielsen, E., Blume-Kohout, R., Young, K.: Direct randomized benchmarking for multiqubit devices. Phys. Rev. Lett. 123, 030503 (Jul 2019). https://doi.org/10.1103/PhysRevLett.123.030503 40. Ravi, G.S., Smith, K.N., Gokhale, P., Mari, A., Earnest, N., Javadi-Abhari, A., Chong, F.T.: Vaqem: A variational approach to quantum error mitigation. arXiv preprint arXiv:2112.05821 (2021) 41. Saravanan, V., Saeed, S.M.: Data-driven reliability models of quantum circuit: From traditional ml to graph neural network. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems pp. 1–1 (2022). https://doi.org/10.1109/TCAD.2022.3202430 42. Scholl, P., Schuler, M., Williams, H.J., Eberharter, A.A., Barredo, D., Schymik, K.N., Lienhard, V., Henry, L.P., Lang, T.C., Lahaye, T., et al.: Quantum simulation of 2d antiferromagnets with hundreds of rydberg atoms. Nature 595(7866), 233–238 (2021) 43. Self, C.N., Khosla, K.E., Smith, A.W., Sauvage, F., Haynes, P.D., Knolle, J., Mintert, F., Kim, M.: Variational quantum algorithm with information sharing. npj Quantum Information 7(1), 1–7 (2021) 44. Servanan, V., Saeed, S.M.: Graph neural networks for idling error mitigation. In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design. ICCAD ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/ 3508352.3549444 45. Sheldon, S., Magesan, E., Chow, J.M., Gambetta, J.M.: Procedure for systematically tuning up cross-talk in the cross-resonance gate. Physical Review A 93(6), 060302 (2016) 46. Shor, P.W.: Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Journal on Computing 26(5), 1484–1509 (1997). https://doi.org/10. 1137/S0097539795293172 47. Smith, K.N., Ravi, G.S., Murali, P., Baker, J.M., Earnest, N., Javadi-Abhari, A., Chong, F.T.: Error mitigation in quantum computers through instruction scheduling (2021) 48. Tannu, S.S., Qureshi, M.K.: Not all qubits are created equal: A case for variability-aware policies for nisq-era quantum computers. In: Proceedings of ASPLOS. pp. 987–999. ACM (2019) 49. Tomesh, T., Gokhale, P., Omole, V., Ravi, G.S., Smith, K.N., Viszlai, J., Wu, X.C., Hardavellas, N., Martonosi, M.R., Chong, F.T.: Supermarq: A scalable quantum benchmark suite. In: IEEE International Symposium on High-Performance Computer Architecture (HPCA). pp. 587–603 (2022)

140

V. Saravanan and S. M. Saeed

50. Saravanan, V., Saeed, S.M.: Test data-driven machine learning models for reliable quantum circuit output. In: IEEE European Test Symposium (ETS). pp. 1–6 (2021) 51. Saravanan, V., Saeed, S.M.: Machine learning for quantum hardware performance assessment. In: 2022 IEEE 40th International Conference on Computer Design (ICCD). pp. 1–7 (2022). https://doi.org/10.1109/ICCD56317.2022.00030 52. Wang, H., Ding, Y., Gu, J., Li, Z., Lin, Y., Pan, D.Z., Chong, F.T., Han, S.: Quantumnas: Noise-adaptive search for robust quantum circuits (2021). https://doi.org/10.48550/ARXIV. 2107.10845, https://arxiv.org/abs/2107.10845 53. Wang, S., Fontana, E., Cerezo, M., Sharma, K., Sone, A., Cincio, L., Coles, P.J.: Noise-induced barren plateaus in variational quantum algorithms. Nature communications 12(1), 1–11 (2021) 54. Xie, L., Zhai, J., Zhang, Z., Allcock, J., Zhang, S., Zheng, Y.C.: Suppressing zz crosstalk of quantum computers through pulse and scheduling co-optimization. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. p. 499–513. ASPLOS ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3503222.3507761 55. Younis, E., Sen, K., Yelick, K., Iancu, C.: Qfast: Conflating search and numerical optimization for scalable quantum circuit synthesis. In: 2021 IEEE International Conference on Quantum Computing and Engineering (QCE). pp. 232–243. IEEE (2021) 56. Zhang, J., Pagano, G., Hess, P.W., Kyprianidis, A., Becker, P., Kaplan, H., Gorshkov, A.V., Gong, Z.X., Monroe, C.: Observation of a many-body dynamical phase transition with a 53qubit quantum simulator. Nature 551(7682), 601–604 (2017) 57. Zhong, H.S., Wang, H., Deng, Y.H., Chen, M.C., Peng, L.C., Luo, Y.H., Qin, J., Wu, D., Ding, X., Hu, Y., et al.: Quantum computational advantage using photons. Science 370(6523), 1460– 1463 (2020)

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis and Optimization Robert Basmadjian and Alexandru Paler

1 Introduction: Surface Code Assemblies The efficient quantum compilation of surface code logical circuits is an open problem. Quantum compilers, such as [1, 2], are used to protect a quantum circuit by the surface code. For a complete introduction to surface codes, we recommend [3]. For the purpose of this chapter, we will consider that all circuits are error-corrected. The compilation workflow is a sequence of steps in which the original circuit is gradually transformed into a set of lattice surgery instructions, and finally, the instructions are compiled into a structure which is called topological assembly. There are multiple formalisms for compiling quantum circuits to lattice surgery, such as Clifford+T [4] and rotated Pauli strings [2], but all share the fact that they require distilled magic states to implement the non-Clifford part of the computation. A distillery is the region inside the assembly where distillation procedures are executed. Distillations are probabilistic and may need to be repeated to output a distilled state. We start by formulating the optimization problem statement (circuit footprint and depth optimizations). Afterward, the Background chapter introduces the necessary theory of queuing systems and networks necessary for tackling the optimization problem. The Results chapter discusses results after using the networks for optimizing quantum circuits. Parts of this chapter were presented in a different form in [5, 6].

R. Basmadjian (O) Clausthal University of Technology, Clausthal-Zellerfeld, Germany e-mail: [email protected] A. Paler Aalto University, Espoo, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_8

141

142

R. Basmadjian and A. Paler

Fig. 1 Lattice surgery: (a) Colored patches are logical qubits placed on a rectangular grid (e.g. × 6). Unused patches are white and can be used at later stages of the computation to implement gates or to store distilled states. The magenta patch marked by .|m> is, for example, a distilled state that is stored in a queue. (b) A gate is implemented by routing (green) along unused patches and consuming the magenta patch—this is, in this example, effectively reducing the occupation of the queue by one

.3

1.1 Footprint Analysis The assembly is a space-time volume representation of the original circuit. The space of the assembly is the footprint of the grid layout on which the error-corrected qubits are placed in the form of patches (see Fig. 1). The size of the footprint is the number of patches that are necessary to implement the computation. The time refers to the number of times the layout is updated according to the sequence of lattice surgery instructions. For example, in Fig. 1 there are two-time steps and 18 patches, and the spacetime volume of the corresponding assembly would be 36. Only at most nine patches from the footprint are used, and the six patches on the bottom row can be removed from the layout only if these are never used later. In Fig. 1 the magenta patch is storing a distilled state. The state was output by a distillation sub-circuit (distillery) that is occupying multiple patches of the layout. The distillery is not illustrated in Fig. 1. We call queue the region where distilled states are stored. For the example of Fig. 1, the queue is the entire region of white patches. A state is stored in a queue by moving it along the layout with an operation that looks similar to the green route in Fig. 1. Footprint analysis refers to the procedures necessary for optimizing the footprint of an assembly. The footprint is a direct measure of how many physical (hardware) qubits are necessary for implementing the assembly. For a single error-corrected qubit, the surface code is implemented by a two-dimensional lattice of physical qubits, and the code distance is reflected in the effective dimensions of the lattice. For a lattice surgery computation, each patch represents a logical qubit, such that the number of patches in the assembly’s layout informs about how many physical qubits are required for implementing the assembly’s execution. Consequently, the footprint dimensions of the space-time volume are important, because of this direct relation to the number of physical qubits—and physical qubits are a very restricted computational resource.

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

143

1.2 Problem Statement: Quantum Circuit and Assembly Optimisation The quantum circuit compilation problem, from the perspective of lattice surgery optimization, is to determine the minimum space-time volume necessary for implementing a given surface code computation. The volume of the assembly is important for computing the number of physical qubits (a function of the layout’s footprint) and the execution time (a function of depth) of the error-corrected computation. Efficient compilation is concerned with reducing the number of patches necessary and the number of gates/time steps. We consider two variants of the optimization problem: • depth: reduce the number of SWAP gates necessary for implementing a quantum circuit that is not compatible with the topology of the underlying hardware; • footprint: optimize the footprint of the distilleries used in the surface code computations (i.e., the hardware storing the distilled states necessary for T gates).

2 Background: Queuing Systems, Networks and Models Queuing theory is one of the mathematical foundations to analyze the performance of classical computing systems. The basic concepts go back to 1909 by Agner Erlang. The theory saw its advancements between the 1950s and late 1970s, such that its application to computer science problems was realized in the 1990s [7]. Such a theory was used in the last decade to study the performance and energy trade-off of the servers in a data center [8–14]. Recently, it has also been proposed within the context of quantum computing to analyze and study quantum circuit optimization [6, 15].

2.1 Queuing System Models In its basic form, a queuing system can be described by the following four parameters: A, S, m, and discipline. The first two parameters detail the arrival and servicing processes, such that .λ and .μ are used to denote the arrival and service rates (i.e., number of events—arrival or servicing—per unit time) respectively. The third parameter presents the number of servers (i.e., the processors) of a given queuing system. The queuing discipline is given by the last parameter such that the default is First-Come-First-Serve (i.e., FCFS). In most of the modeling cases, it is reasonable to consider that the arrival and servicing processes are Poisson and hence satisfy the Markovian property. Furthermore, one can specify the buffer size which can be either infinite or limited capacity. In case the queue is full, the new arriving job can

144

R. Basmadjian and A. Paler

be either lost or the sender will be blocked. In our work, we consider the latter case, and the details of the modeling are given in Sect. 2.2.

2.1.1

Single Server with Finite and Infinite Capacity

In Kendall’s notation, such a basic queuing system with one server and a finite capacity of 3 is presented as M/M/1/3-FCFS. Figure 2 gives a graphical representation of the corresponding queuing system. It is worthwhile to mention that whenever the server (i.e., the processor) is busy processing a job (e.g., in our case, this is the job annotated by 1), any arriving job can not be served by the server and consequently needs to be stored inside the queue (jobs annotated by 2 and 3). Once the current job is being processed by the server, it leaves the system, and the one that was waiting for the longest (i.e., first arrived—Job 2) will get into the server for servicing.

Fig. 2 Graphical presentation of an M/M/1/3-FCFS queuing system of three jobs, one being processed by the server and the other two waiting inside the queue of finite size

Fig. 3 Markov Chain (MC) of the M/M/1/3-FCFS queuing system with arrival and service rates of .λ and .μ respectively. The numbers inside the circles denote the number of jobs in the system. In general, systems start with 0 jobs and as more jobs arrive, it jumps from one state to another state with increments of one. For the case of the M/M/1/3-FCFS queuing system, the maximum number of jobs that it can have is 3, and thus the MC stops at the state 3

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

145

In Fig. 2 presented high-level description model is then transformed into a lowlevel computational model by considering Markov Chains. Figure 3 illustrates the continuous-time Markov Chain (i.e., CTMC) of the above-mentioned queuing system. After obtaining the underlying CTMC, the main objective is to derive the steady-state probability vector .π ={.π0 , π1 , π2 , π3 , . . . , πi } (i.e., in our case .i = 3). This is obtained by performing a Markovian analysis (i.e., closed-form solution) where we solve .π.Q = 0 such that Q is called the generator matrix obtained from the derived CTMC. As soon as the steady-state probabilities are computed, it is possible to derive all the relevant performance metrics from themEsuch as the utilization .ρ = 1 − π0 , the mean number of jobs in the system .K¯ = ki=0 πi ∗ i, where k denotes the total number of jobs in the queuing system (i.e., for the above case .k = 3). Furthermore, Little’s law can be used to derive the mean queue length ¯ mean waiting time .W¯ , etc. .Q, It was shown in the literature for the case of a finite capacity K single server queuing system of M/M/1/K (i.e., M/M/1/3-FCFS) that when .a = μλ /= 1, then: πi =

.

and when .a =

λ μ

(1 − a)a i for 0 ≤ i ≤ K, and πi = 0 1 − a K+1

otherwise

(1)

= 1, then: π0 =

.

1 = πi for i = 1, . . . , K. K +1

(2)

Furthermore, in the case of an infinite capacity single server queuing system of M/M/1-FCFS, it was shown in the literature that: πi = (1 − a)a i for 0 ≤ i ≤ ∞

.

(3)

where a has the same definition as in Eqs. (1) and (2).

2.1.2

Multi Server

The above single-server model can be extended by considering multiple servers instead of one both for the case of finite (M/M/m/K) and infinite (M/M/m) capacities. The servers can all have the same service rate .μ in which case, the system is called homogeneous, or each server of the system can have different service rates where such a queuing system is called heterogeneous. Figure 4 shows a graphical presentation of a homogeneous M/M/2/3-FCFS queuing system where its underlying low-level CTMC is given in Fig. 5. By carrying out a Markovian analysis on a homogeneous M/M/m/K-FCFS, it was shown in the λ literature that by considering .ρ = mμ then:

146

R. Basmadjian and A. Paler

Fig. 4 Graphical presentation of a homogeneous M/M/2/3-FCFS queuing system of three jobs, two being processed in parallel by the two servers and the third waiting inside the queue of finite size

Fig. 5 Markov Chain (MC) of the M/M/2/3-FCFS queuing system with arrival and service rates of .λ and .μ respectively. The numbers inside the circles denote the number of jobs in the system. In general, systems start with 0 jobs and as more jobs arrive, it jumps from one state to another state with increments of one. For the case of the M/M/2/3-FCFS queuing system, the maximum number of jobs that it can have is 3, and thus the MC stops at the state 3

πi = π0

.

i−1 | | λj μj +1

(4)

j =0

where .π0 =

1 ||i−1 E 1+ ∞ i=1 j =0

λj μj +1

. Then this leads to:

πi =

(mρ)i π0 for 0 < i < m i!

(5)

πi =

mm ρ i π0 for m ≤ i ≤ K. m!

(6)

.

and .

Note that the above-mentioned steady-state probabilities of Eqs. (5) and (6) also hold true for the case of multi-server queuing systems with infinite capacity.

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

147

However, the range of i in Eq. (6) does not stop at K but rather it goes to infinity (i.e., .i ≥ m).

2.2 Queuing Networks A queuing network model is a stochastic approach that analyzes the performance of a network consisting of more than one queuing system. In its basic form, in addition to the four model parameters (i.e., A, S, m, and the queuing discipline) of queuing systems that were given in Sect. 2.1, two further parameters are required to be specified, namely the number of nodes N (e.g., queuing systems) as well as the routing probabilities .pij . The latter gives the probability that a job after leaving a node i (i.e., queuing system), enters the next node(s) j such that the sum of the probabilities is equal to one. Due to this reason, the queuing network model can not be used to analyze systems where a job can be processed in parallel by two or more queuing systems.

2.2.1

Description and Model Parameters

There are three types of networks: open, closed, and hybrid. In the case of open networks, jobs arrive from outside the network at any node i with an arrival rate of .λ0,i and probability of .p0,i and leave the network at any node i with a probability E EN of .pi,0 = 1 − N j =1 pij (i.e., in Fig. 6 .p5,0 = 1 − j =1 p5,j = 1). In the case of closed networks, jobs neither arrive from outside nor leave the network but rather circulate inside the network. In closed networks, it is assumed that the total number of jobs K is constant. A hybrid type of network is a mixture of open and closed networks. Figure 6 gives an example of an open network composed of 5 nodes with finite capacity. The routing probabilities .pij are also given. For instance, a job after leaving Node 1, then goes to Node 2, Node 3, or Node 4 with probabilities of 65%, 5%, and 30% respectively. Note that the routing probabilities need to sum up to 1. In the beginning, only the arrival rate .λ0,i at which jobs enter the queuing network at the Node i is given. However, this information is enough to derive the arrival rates of all other nodes in the network using the following traffic equation: λi = p0,i +

N E

.

pj,i ∗ λj,i

(7)

j =1

where the above equation leads to an N set of linear equations with N unknowns. Thus the major challenge in PFQNs is to find fast and efficient algorithms for finding a solution for all nodes of the network. Analyzing the performance of such a queuing network can be realized by performing a Markovian analysis in the same way as explained above. To this

148

R. Basmadjian and A. Paler

Fig. 6 An example of an open queuing network consisting of 5 nodes where each presents a finite capacity queuing system. The routing probabilities .pij are given on the links between the nodes i and j such that they sum up to one. The jobs enter and leave the network at Node 1 and Node 5 respectively

end, the underlying Markov Chain needs to be derived, and due to the state-space explosion problem (i.e., the number of states increases exponentially with respect to the number of nodes and jobs), such an analysis is not possible. To remedy this issue, the product form queuing network (PFQN) approach can be used. Its main advantage is that closed form solution can be obtained without having the need to derive the corresponding Markov Chain of the whole network. However, to be able to use the PFQN approach, it is necessary that all the nodes of the network be one of the following types: M/M/m-FCFS, M/G/1-PS, M/G/1-IS, and M/G/1-LCFS PRS. In our case, all our nodes are of the type M/M/m-FCFS. Consequently, we used the PFQN approach to derive all the steady-state probabilities of the whole network. This approach states that such probabilities can be obtained by the marginal probabilities of each node as: π(K1 , K2 , . . . , KN ) = π1 (K1 ) ∗ π2 (K2 ) ∗ . . . ∗ πN (KN )

.

(8)

More precisely, the above equation states that the steady-state probability of the network, given that Node 1, 2, . . . N has .K1 , .K2 , . . . .KN jobs respectively, can be derived by multiplying the steady-state probability of Node 1 having .K1 jobs by the probability of Node 2 having .K2 jobs by . . . . by the probability of Node N having .KN jobs. Note that those marginal probabilities can be obtained by Eqs. (1)–(5), or (6) depending on the specifics of the corresponding queuing system.

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

2.2.2

149

Modelling Blocking Queues

For cases where the network consists of nodes with limited capacity such as the one given in Fig. 6, there can be two modeling choices: either the arriving job at the recipient node finds no empty place in the queue and gets lost or the sender of this job gets blocked until some space is created (i.e., some jobs are being processed) by the recipient node. In our work, we followed the second option since we can not afford any loss of jobs (i.e., qubits in our case) while analyzing the whole network. This fact gets even more important to be handled when dealing with nodes with a capacity of 1 such as the ones of M/M/1/1-FCFS. Such type of nodes existed while modeling and analyzing swap operations of multiplier circuits (see Sect. 3.2). The blocking aspect of the network needs to be modeled appropriately. To this end, for the case of the above-mentioned M/M/1/1-FCFS queuing system, we consider that it can be in one of the following three states: 1. State (0,0): the corresponding queuing system is empty, 2. State (1,0): the corresponding system is busy with one job being processed by the single server, 3. State (0,1): the corresponding system is busy with one job being processed by a single server. However, this job after completion can not arrive at the next node(s) in the network because its (their) capacity is full. Thus this node remains blocked. Figure 7 gives the continuous-time Markov Chain (CTMC) for modeling the blocking aspect of the nodes. Each M/M/1/1-FCFS node starts empty (i.e., presented by the state (0,0)). Then upon arrival of jobs with the rate of .λi , the queuing system transits from the state (0,0) to (1,0). This indicates that such a system has now 1 job to be processed. After completion, it can leave this node and get to the next one(s) without being blocked. Consequently, the system transits from the state (1,0) to (0,0) (i.e., back being empty) with the service rate of .μi (1 − Pb ). In case after completion, the job can not leave the system because it is being blocked by the next

Fig. 7 CTMC modeling the blocking aspect of each node i, having three different states (0,0), (1,0), and (0,1). It is also shown the arrival rate .λi , service rate under unblocking and blocking conditions .μi and .μib respectively. .Pb denotes the blocking probability

150

R. Basmadjian and A. Paler

node due to the fact that this latter is full, then the system transits from the state (1,0) to (0,1) with the service rate of .μi Pb , where .Pb gives the blocking probability and is given by: Pb =

di E

.

pij πj (1),

(9)

j =1

where .di is the degree of the node i (i.e., the number of neighbors), .pij is the routing probability, and .πj (1) is the steady-state probability that the node j is full, which is given by Eqs. (1) or (2). Once adequate space is freed up from one of the next nodes, the job can finally leave, and the system transits from the state (0,1) to (0,0) with the blocking service rate of .μib . Note that each node has two different types of service rates .μi and .μib . The former is an intrinsic characteristic of the node i’s hardware whereas the latter models the delay of servicing happened due to blocking.

3 Results: Optimisation of Topological Assemblies We present experimental results by applying queuing theory concepts for compiling and optimizing quantum circuits for addition and multiplication. We chose to model the quantum circuit compilation problem as a single-server queuing system. The parameters that govern this model influence the size of the layout (assembly footprint) as well as the scheduling of the gates. The latter influences the achievable parallelism and thus the depth of the assembly. In the following sections, we describe the model mapping and the model parameters. For example, the size of the queue influences the footprint (the queue occupies patches) as well as the depth (empty or full queues incur delays in the processing of gates). Optimum model parameter values are computed through numerical analysis. Figure 2 illustrates that when the single server is busy processing the different jobs, the further incoming jobs generated by the distilleries need to be stored in the queue. We consider that the single-server queuing system has a limited total capacity. In the previous example, this has a value of 3 jobs. We assume that when the system is full, any incoming jobs will be lost until some of the existing jobs are processed.

3.1 Footprint Optimization of Addition Circuits Figure 8 shows the modeling of distillation of the adder circuit in terms of finite capacity queuing systems. We assume that each job, processed by the considered

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

151

Fig. 8 Modelling the adder circuit as a queuing system of finite capacity. The upper figure is a 3D view of a topological assembly with two layers (compiled by Paler and Fowler [1]. The Magenta red and yellow regions present distillations, data patches (in columns), and ancilla (which are between the data patches) respectively; The lower figure is a queuing system presentation of finite capacity (i.e., 3 in this case)

queuing system, is a distilled T state. Also, the jobs are processed in a First-ComeFirst-Serve fashion. We consider that jobs (i.e., distilled T states) are produced and consumed in the form of a Poisson process (i.e., the inter-event times are exponentially distributed and they are i.i.d.). Since the considered queuing system is of finite capacity, whenever the buffer is full, we assume that the production of new jobs by the distillery is stopped until additional space in the buffer is made available. The topological assemblies are by design discrete with respect to time. Consequently, both time and state spaces are discrete which leads to the consideration of discrete-time Markov chains (i.e., DTMC) for closed-form analysis. It is worthwhile to mention that the resulting DTMC (see Fig. 9) is ergodic because (1) this has a finite number of states N, (2) its states are pairwise reachable, and (3) the chain is aperiodic (e.g. no periodicity of a certain state exist). Ergodic DTMCs ensure that there will exist always a unique closed-form solution for deriving the steady-state probabilities.

152

R. Basmadjian and A. Paler

Fig. 9 The derived DTMC for the analysis of the buffer capacity for the use case of distilleries

Fig. 10 Modelling the multiplier circuit as an open queuing network consisting of 15 nodes. The left figure is the 3D qubit layout of the quantum circuit, having four queues (two grays, one yellow, and a red one) connected to the cuboid-like 3D layout; The right figure is the queuing network presentation of 15 nodes, out of which 4 have a finite capacity, whereas the others (only circles with numbers 1 till 11) have a buffer size of 0 (i.e., the capacity of 1). For some of the transitions, their corresponding routing probabilities pij are given. Jobs from outside arrive only at Nodes 12 and 13, and depart from Nodes 14 and 15

3.2 Circuit Depth Optimization of Multiplication Circuits Figure 10 shows the mapping of a multiplication circuit into an open queuing network of 15 nodes. Four (i.e., Nodes 12–15) of those 15 nodes have a finite capacity with a fixed buffer size, whereas the other 11 have no buffer (i.e., the capacity of 1). The reason for this modeling decision is how the multiplier circuit operates especially when swapping is involved. To this end, since the physical wires can only process one qubit at a time, consequently the inner nodes (i.e., Nodes 1 till 11) should not be equipped with any buffer and have a service rate of .μi = 1. The outer nodes are the ones where either qubits arrive or depart and no swap operations take place. As a matter of fact, those nodes are equipped with buffers of finite capacity.

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

153

Table 1 The parameter values to compute the steady-state probabilities Node # 1 .0.94 .μi 1 .Pb 0.524 .μib 0.136 .λi

2

3

4

5

6

.0.94

.0.936

.0.88

.1.644

.1.596

1 1 0.408 0.535 0.144 0.17

1 0.576 0.142

1 1 0.527 0.434 0.136 0.13

7 1.02 1 0.567 0.124

8 1.6 1 0.515 0.173

9 1.18 1 0.60 0.175

10 1.42 1 0.538 0.195

11 0.86 1 0.60 0.143

10 0.16 0.22 0.62 0.84 0.84 0.59

11 0.18 0.16 0.66 0.82 0.82 0.95

Table 2 The steady-state probabilities and the calculated performance metrics Node # .πi (0, 0) .πi (1, 0) .πi (0, 1) .ρi

¯i .K .T¯i

1 0.18 0.17 0.65 0.82 0.82 0.87

2 0.18 0.17 0.65 0.82 0.82 0.87

3 0.2 0.18 0.62 0.8 0.8 0.85

4 0.23 0.2 0.57 0.72 0.77 0.88

5 0.13 0.21 0.66 0.87 0.87 0.53

6 0.11 0.18 0.71 0.89 0.89 0.56

7 0.15 0.15 0.70 0.85 0.85 0.83

8 0.14 0.21 0.65 0.86 0.86 0.54

9 0.16 0.19 0.65 0.84 0.84 0.71

The first value that needs to be computed before carrying out the closed-form analysis, is the arrival rate .λ of the open queuing network. This is derived by adding the arrival rate of Nodes 12 and 13. After systematic analysis of the multiplier circuit’s depth, we concluded that .λ12 = 0.15 and .λ13 = 0.1, which leads to an overall arrival rate of .λ = 0.25. Once this is determined, then using Eq. (7) we can derive the arrival rate of each node i which is given by Table 1. Note that we assume the routing probabilities of each node to be equally distributed among the number of neighbors (i.e., .di ) such that .pij = d1i . For instance, if we consider Node 1 which

has 3 neighbors, then we assume .p12 = p14 = p15 = 13 . Afterward, it is possible to compute using Eq. (1) the probability that a given queuing system of finite capacity is full. Building on those probabilities (i.e. a queuing system being full), it is then possible to compute the blocking probability at each node i using Eq. (9) (i.e., last row in the Table 1). The product form queuing network (PFQN) approach of Sect. 2.2 is used to calculate the steady-state probability of the intermediate network using the marginal probabilities of each node in that network (i.e., see Eq. (8)). We used the SHARPE modeling tool [16] to calculate the steady-state probabilities of each node i, using the parameters configured in Table 1. Table 2 summarizes the obtained closed-form performance metrics of each node i of the intermediate network (i.e., Nodes 1 till 11). The utilization .ρi = 1−πi (0, 0), and the mean number of jobs .K¯ i is the sum of the steady-state probabilities of (1,0) and (0,1). It can be noticed that almost all the nodes are busy with an average utilization of 80%, whereas the jobs were blocked with a probability between 57% and 71% of the time. Little’s law [17] is used to derive the mean response time .T¯i , this being the ratio between the mean number of jobs .K¯ i to the arrival rate. The mean E number of jobs in the network is .K =

11 ¯ i=1 Ki

11

= 0.83. Thus, the mean response

154

R. Basmadjian and A. Paler

time of the whole network is computed again using Little’s law such .T¯ = Kλ = 3.33. Consequently, this theoretical lower bound of 3 confirms the practically obtained depth of 5 SWAP circuits for the case of multipliers.

4 Conclusion The execution of large-scale quantum computations requires error-correcting codes, and the compilation of such codes has to be as efficient as possible. This is because quantum hardware is a scarce computational resource. In this chapter, we discussed how queuing systems and networks can be applied to the optimization of quantum circuits protected by the surface code. We focused on the lattice surgery implementation of this code and optimized the footprint and depth for arithmetic circuits. Our results are promising with respect to the achieved optimization goals and future work will focus on increasing the scale at which these methods are applied.

References 1. A. Paler, A.G. Fowler, in 2020 IEEE Globecom Workshops (GC Wkshps (IEEE, 2020), pp. 1–4 2. G. Watkins, H.M. Nguyen, V. Seshadri, K. Watkins, S. Pearce, H.K. Lau, A. Paler, arXiv preprint arXiv:2302.02459 (2023) 3. A.G. Fowler, M. Mariantoni, J.M. Martinis, A.N. Cleland, Physical Review A 86(3), 032324 (2012) 4. A. Paler, I. Polian, K. Nemoto, S.J. Devitt, Quantum Science and Technology 2(2), 025003 (2017) 5. A. Paler, R. Basmadjian, in 2019 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH) (IEEE, 2019), pp. 1–5 6. E.E. Dobbs, R. Basmadjian, A. Paler, J.S. Friedman, in Reversible Computation, ed. by S. Yamashita, T. Yokoyama (Springer International Publishing, Cham, 2021), pp. 256–265 7. K.S. Trivedi, Probability & statistics with reliability, queuing and computer science applications (John Wiley & Sons, 2008) 8. A. Rumyantsev, R. Basmadjian, S. Astafiev, A. Golovin, in Proceedings of the Thirteenth ACM International Conference on Future Energy Systems (Association for Computing Machinery, New York, NY, USA, 2022), e-Energy ’22, p. 581–586. DOI https://doi.org/10.1145/3538637. 3539655 9. A. Rumyantsev, R. Basmadjian, S. Astafiev, A. Golovin, Annals of Operations Research (2022). DOI https://doi.org/10.1007/s10479-022-04830-0 10. R. Basmadjian, H. de Meer, in Proceedings of the Ninth International Conference on Future Energy Systems (Association for Computing Machinery, New York, NY, USA, 2018), e-Energy ’18, p. 519–525. DOI https://doi.org/10.1145/3208903.3213778 11. R. Basmadjian, F. Niedermeier, H. de Meer, (Association for Computing Machinery, New York, NY, USA, 2016), e-Energy ’16. DOI https://doi.org/10.1145/2934328.2934342 12. P.J. Kuehn, M. Mashaly, in 2019 15th Annual Conference on Wireless On-demand Network Systems and Services (WONS) (2019), pp. 91–98. DOI https://doi.org/10.23919/WONS.2019. 8795470

Queuing Theory Models for (Fault-Tolerant) Quantum Circuits: Analysis

155

13. F. Mantovani, M. Garcia-Gasulla, J. Gracia, E. Stafford, F. Banchelli, M. Josep-Fabrego, J. Criado-Ledesma, M. Nachtmann, Future Generation Computer Systems 112, 800 (2020). DOI https://doi.org/10.1016/j.future.2020.06.033. URL https://www.sciencedirect. com/science/article/pii/S0167739X19309781 14. J. Criado, M. Garcia-Gasulla, P. Kumbhar, O. Awile, I. Magkanaris, F. Mantovani, in 2020 IEEE International Conference on Cluster Computing (CLUSTER) (2020), pp. 540–548. DOI https://doi.org/10.1109/CLUSTER49012.2020.00077 15. A. Paler, R. Basmadjian, in 2019 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH) (2019), pp. 1–5. DOI https://doi.org/10.1109/NANOARCH47378.2019. 181305 16. K.S. Trivedi, R. Sahner, SIGMETRICS Perform. Eval. Rev. 36(4), 52–57 (2009). DOI https:// doi.org/10.1145/1530873.1530884. URL https://doi.org/10.1145/1530873.1530884 17. R. Basmadjian, F. Niedermeier, H. de Meer, in Proceedings of the Seventh International Conference on Future Energy Systems (Association for Computing Machinery, New York, NY, USA, 2016), e-Energy ’16. DOI https://doi.org/10.1145/2934328.2934342. URL https://doi. org/10.1145/2934328.2934342

Quantum Annealing for Real-World Machine Learning Applications Rajdeep Kumar Nath, Himanshu Thapliyal, and Travis S. Humble

1 Introduction Machine learning techniques to explore and harness the power of data have found application in health care, finance, autonomous driving, security, etc., and are continuing to make their impact deeper every single day [9, 19, 20]. However, this technological boom is accompanied by unprecedented challenges. These challenges arise from several factors such as the scale of data being generated, hardware limitations, computational complexity, and cost [45]. Although recent advances in hardware technology have increased computational capability significantly, this is no match for the amount of data generated globally, which is rapidly increasing, as well as the amount of stored data, which is increasing at the rate of about 20% per year [24]. The current trend of technological advancement in terms of computational power will become saturated in dealing with the massive scale of data which will result in increased cost, and more importantly, the maximum utilization of data won’t be possible in the future [45]. To deal with this challenge, quantum computing is seen as a promising alternative that can boost the computational prowess in utilizing the power of data to the full extent [3]. Most of the real-world problems that machine learning seeks to solve are nonconvex in nature. These types of problems are hard to converge to an optimal solution using classical algorithms because of the possibility of the presence of several local minima and saddle points [28]. This computational challenge limits

R. K. Nath () VTT Technical Research Centre of Finland, Kuopio, Finland H. Thapliyal University of Tennessee, Knoxville, TN, USA T. S. Humble Oak Ridge National Laboratory, Oak Ridge, TN, USA © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Thapliyal, T. Humble (eds.), Quantum Computing, https://doi.org/10.1007/978-3-031-37966-6_9

157

158

R. K. Nath et al.

the ability of machine learning models to train efficiently especially in the presence of a large number of variables. To deal with this problem, researchers have been exploring alternative computing techniques such as quantum computing to augment the power of machine learning algorithms [1, 6, 34]. Quantum computing is inherently suited to carry out complex computation which is not possible using classical computation [25]. Quantum computing processes information in the form of qubits (quantum bits). This is different from a classical bit in the sense that a classical bit can take the value of either 0 or 1. However, a qubit can be represented as a combination of two values at the same time. This principle is called superposition which allows quantum computation to achieve exponential speedup over its classical counterpart [21]. Another property of quantum computation that establishes superiority over classical computation is the phenomenon of quantum tunneling. Because of quantum tunneling, a quantum system can navigate through solution space more efficiently by overcoming long and thin energy barriers [46]. These characteristics of quantum computing have led to the hypothesis that QPU (Quantum Processing Unit) can be used to discover more interesting and counterintuitive patterns for machine learning classification task than that obtained using classical processing units such as CPU (Central Processing Unit) or GPU (General Processing Unit) [4] (Fig. 1). Hence, integrating quantum processing units with classical computation techniques can achieve an increase in performance and computational speed for complex problems. The overview of the concept of a hybrid classical-quantum system is visualized in Fig. 2. For example, data from real-world sensor data collected at the edge can be stored in cloud for data analytics and building machine learning based prediction model. Similarly big data such as image, security data, and finance are also processed and stored at the cloud. To support quantum integration, the cloud infrastructure acts as an interface between classical computing systems and a physical quantum computer. The idea is to outsource the solution of complex problems to a quantum processing unit for faster and accurate computation. Optimization is an important part in the context of machine learning application on real-world data. Optimizing the number of variables, and parameters of a learning model is challenging especially under data uncertainty conditions. Some of these conditions are limited availability of training data, modeling under high dimensional feature variables, better generalization to unknown data, and reduced training time and cost. The objective of this chapter is to identify the effectiveness and usefulness of D-wave’s physical quantum annealer as an optimization subroutine in machine learning pipelines for classification tasks. Through this work, we will present a comprehensive survey on the application of quantum annealing for real-world machine learning classification tasks based on recent research works. This chapter is an extension of our previously published work [41]. The main contribution of this chapter are as follows: – The integration of D-wave’s quantum annealer as an optimization subroutine has been analyzed and discussed in the context of machine learning classification tasks.

Quantum Annealing

1

159

Bit

0

Classical Processing (CPU/GPU)

Patterns

ML Algorithms Time

1

Quantum Processing (QPU)

Patterns

Qubit

0

ML Algorithms Time

Fig. 1 Processing in classical (top) and quantum systems (bottom). QPU can generate manysolutions and counter-intuitive solutions faster than classical processing units such as CPU/GPU

Sensor Data

Big Data

Edge Computing ƒ Processing ƒ Light Computation ƒ Light storage

Cloud Resource ƒ Big data Processing ƒ Storage ƒ Computation ƒ Quantum Translation

Quantum Processing Unit ƒ General purpose ƒ Special purpose: ƒ Optimization

ƒ ƒ ƒ ƒ

Dimension reduction Model expectations Quantum speed up Multiple solutions

Fig. 2 Overview of the integration of quantum processing units with classical computing systems on real-world data

– Application domain where quantum annealing is applied for real-world classification tasks has been identified and analyzed from existing literature. – Finally, the possible advantages of using quantum annealer as an optimization subroutine for machine learning classification has been analyzed based on the findings from recent research.

160

R. K. Nath et al.

In this chapter, we will discusses the theoretical background of quantum annealing and also the realization of the quantum annealing in D-Wave’s quantum computer (Sect. 2) and the hybrid classical-quantum computing system using DWave’s quantum computer (Sect. 3). Subsequently, we will explore the existing literature in the area that has used quantum annealing for machine learning classification tasks (Sect. 4) and discusses the possible advantages of using quantum annealing over classical techniques (Sect. 5)

2 Background on Quantum Annealing Quantum annealing is a heuristic search algorithm for finding the lowest energy state by traversing over the solution landscape [23]. The lowest energy state is achieved through the evolution of a time-dependent Hamiltonian. Quantum annealing uses quantum tunneling to traverse through the solution space more efficiently than classical annealing to reach the optimal state [37, 40]. As visualized in Fig. 3, the time-dependent Hamiltonian can be represented as a landscape of the cost of the solution. The two components of the time-dependent Hamiltonian is the final Hamiltonian (.HF ), that is the ground state or the optimal solution to the problem and the second component is a transverse field Hamiltonian (.HD ) which is scaled by a time-dependent coefficient also called transverse field which is initialized at a high value and then decreased to zero. Large values of the transverse field allow the algorithm to avoid local minima and tunnel through large energy barriers to move towards the ground state. This same phenomenon is achieved in classical annealing by thermal jumps. It has been theoretically proven that quantum annealing

Thermal Jump

Cost

Tunneling

H(t)=HF+I*(t)HD

( I*) Local Minima

Cost Function

Global Optimum

HF ( I*)

H(t)

Fig. 3 Solution landscape of a cost function represented by a time-dependent Hamiltonian. High values of the transverse field coefficient drives tunneling phenomenon in order to escape local minima and move towards the optimal solution

Quantum Annealing

161

is guaranteed to converge to the optimal solution and also the convergence rate of quantum annealing was faster than that of classical annealing [38]. Even though quantum annealing optimization algorithm has proven to be theoretically superior to classical annealing, a classical implementation of such an algorithm is both costly and inefficient [37]. Whereas some operations are more efficient and faster using quantum computing, there are also several operations for which classical systems are inherently better than quantum systems. A hybrid system that uses both classical and quantum computing is gaining immense popularity in recent years. In the context of machine learning, training becomes extremely expensive and inefficient as the dimension of the feature vector increases which in turn makes it difficult to find meaningful patterns. These types of computationally challenging tasks could be offloaded to quantum processors for optimization [2]. Recent advances made in the design of physical quantum annealer have made it possible to explore the practical application of quantum algorithms. D-Wave system is currently leading the market of commercial quantum computers which uses quantum annealing for computation. We will now present a brief overview of the implementation of quantum annealing on D-Wave systems.

2.1 Quantum Annealing in D-Wave Systems In D-Wave’s implementation of quantum annealing, a qubit initially remains in a superposition state. At the end of annealing, the qubit goes from the superposition state to either 0 state or 1 state. Figure 4 shows the energy diagram depicting the physics associated with the process of quantum annealing and how the qubits attain the lowest energy state. Figure 4 shows three configurations of the energy diagram. The initial configuration (a) consists of only one valley, with the qubit in the superpositioned state.

High Energy

QA

Bias

0

(a)

0/1

(b)

0

1

(c)

Low Energy

Fig. 4 Energy diagram of the quantum annealing process of a single qubit

1

162

R. K. Nath et al.

When quantum annealing is run, a barrier is raised which gives rise to the formation of a double-well potential (configuration (b)). In this configuration, the qubits have an equal probability of ending up either in the 0 state (low point of the left valley) or in 1 state (low point of the right valley). An interesting feature of the quantum annealing processor is that the probability with which a particular qubit will fall either in 0 state or 1 state can be controlled by applying an external magnetic field to the qubits also called the bias. In other words, the qubits can minimize their energy under the influence of the external magnetic field or bias. Coupling is another important feature of the quantum annealing processor. Coupling is the method through which two coupled qubits can be made to be either in the same state or in a different state, that is both 0 or 1 or one 0 and the other 1 or vice versa. This phenomenon of coupling is known as entanglement in quantum computing. For example, when two qubits are entangled, they are considered as one object but with four possible states or combinations. Hence, a two-qubit system will have potential with four states where each state represents a different combination. This defines the energy landscape of the two qubits governed by the relative energy between the qubits. The relative energy between the qubits depends on the biases of each qubit and the coupling between them. The programmer chooses the bias and coupling to encode a specific problem instance.

3 Quantum Processing Unit of D-Wave Systems In the previous section, we discussed the basics of computation using quantum annealing and discussed some key concepts such as qubits, bias, and coupler. These concepts are necessary for understanding the architecture of the D-Wave’s Quantum Processing Unit (QPU) and the application of D-Wave’s QPU in solving real-world problem instances. Figure 5 shows an overview of the steps involved in performing computation using D-Wave’s QPU. The overall process can be viewed as a hybrid system consisting of both classical computation and quantum computation.

3.1 Classical Computation The classical computation consists of three main parts, (1) Problem initialization, (2) Programming the problem instance into the quantum annealing hardware through a software interface, and (3) Resampling. The first two steps take place before annealing and the last step takes place post-annealing.

Quantum Annealing

163

Classical Computation

Problem Instance Ising Objective: min ( Σ



,