Segment Routing Part II: Traffic Engineering [2] 1095963139, 9781095963135

142 50 42MB

English Pages 549 [759] Year 2019

Table of contents :
Copyright
Preface
1 Introduction
1.1 Subjective Introduction of Part I
1.2 A Few Terms
1.3 Design Objectives
1.3.1 An IP-Optimized Solution
1.3.2 A Simple Solution
1.3.3 A Scalable Solution
1.3.4 A Modular Solution
1.3.5 An Innovative Solution
1.4 Service-Level Assurance (SLA)
1.5 Traffic Matrix
1.6 Capacity Planning
1.7 Centralized Dependency
1.7.1 Disjoint Paths
1.7.2 Inter-Domain
1.7.3 Bandwidth Broker
1.7.4 Multi-Layer Optimization
1.8 A TE Intent as a SID or a SID list
1.9 SR Policy
1.10 Binding SID
1.11 How Many SIDs and Will It Work?
1.12 Automation Based on Colored Service Routes
1.13 The SR-TE Process
1.13.1 One Process, Multiple Roles
1.13.2 Components
1.13.3 SR-TE Database
1.13.4 SR Native Algorithms
1.13.5 Interaction With Other Processes and External APIs
1.13.6 New Command Line Interface (CLI)
1.14 Service Programming
1.15 Lead Operator Team
1.16 SR-TE Cisco Team
1.17 Standardization
1.18 Flow of This Book
1.19 References
Section I – Foundation
2 SR Policy
2.1 Introduction
2.1.1 An Explicit Candidate Path of an SR Policy
2.1.2 Path Validation and Selection
2.1.3 A Low-Delay Dynamic Candidate Path
2.1.4 A Dynamic Candidate Path Avoiding Specific Links
2.1.5 Encoding a Path in a Segment List
2.2 SR Policy Model
2.2.1 Segment List
2.2.2 Candidate Paths
2.3 Binding Segment
2.4 SR Policy Configuration
2.5 Summary
2.6 References
3 Explicit Candidate Path
3.1 Introduction
3.2 SR-MPLS Labels
3.3 Segment Descriptors
3.4 Path Validation
3.5 Practical Considerations
3.6 Controller-Initiated Candidate Path
3.7 TDM Migration
3.8 Dual-Plane Disjoint Paths Using Anycast-SID
3.9 Summary
3.10 References
4 Dynamic Candidate Path
4.1 Introduction
4.1.1 Expressing Dynamic Path Objective and Constraints
4.1.2 Compute Path = Solve Optimization Problem
4.1.3 SR-Native Versus Circuit-Based Algorithms
4.2 Distributed Computation
4.2.1 Headend Computes Low-Delay Path
4.2.2 Headend Computes Constrained Paths
4.2.2.1 Affinity Link Colors
4.2.2.2 Affinity Constraint
4.2.3 Other Use-Cases and Limitations
4.2.3.1 Disjoint Paths Limited to Single Head-End
4.2.3.2 Inter-Domain Path Requires Multi-Domain Information
4.3 Centralized Computation
4.3.1 SR PCE
4.3.1.1 SR PCE Redundancy
4.3.2 SR PCE Computes Disjoint Paths
4.3.2.1 Disjoint Group
4.3.2.2 Path Request, Reply, and Report
4.3.2.3 Path Delegation
4.3.3 SR PCE Computes End-To-End Inter-Domain Paths
4.3.3.1 SR PCE’s Multi-Domain Capability
4.3.3.2 SR PCE Computes Inter-Domain Path
4.3.3.3 SR PCE Updates Inter-Domain Path
4.4 Summary
4.5 References
5 Automated Steering
5.1 Introduction
5.2 Coloring a BGP Route
5.2.1 BGP Color Extended Community
5.2.2 Coloring BGP Routes at the Egress PE
5.2.3 Conflict With Other Color Usage
5.3 Automated Steering of a VPN Prefix
5.4 Steering Multiple Prefixes With Different SLAs
5.5 Automated Steering for EVPN
5.6 Other Service Routes
5.7 Disabling AS
5.8 Applicability
5.9 Summary
5.10 References
6 On-demand Next-hop
6.1 Coloring
6.2 On-Demand Candidate Path Instantiation
6.3 Seamless Integration in SR-TE Solution
6.4 Tearing Down an ODN Candidate Path
6.5 Illustration: Intra-Area ODN
6.6 Illustration: Inter-domain ODN
6.7 ODN Only for Authorized Colors
6.8 Summary
6.9 References
7 Flexible Algorithm
7.1 Prefix-SID Algorithms
7.2 Algorithm Definition
7.2.1 Consistency
7.2.2 Definition Advertisement
7.3 Path Computation
7.4 TI-LFA Backup Path
7.5 Integration With SR-TE
7.5.1 ODN/AS
7.5.2 Inter-Domain Paths
7.6 Dual-Plane Disjoint Paths Use-Case
7.7 Flex-Algo Anycast-SID Use-Case
7.8 Summary
7.9 References
8 Network Resiliency
8.1 Local Failure Detection
8.2 Intra-Domain IGP Flooding
8.3 Inter-Domain BGP-LS Flooding
8.4 Validation of an Explicit Path
8.4.1 Segments Expressed as Segment Descriptors
8.4.2 Segments Expressed as SID Values
8.5 Recomputation of a Dynamic Path by a Headend
8.6 Recomputation of a Dynamic Path by an SR PCE
8.7 IGP Convergence Along a Constituent Prefix-SID
8.7.1 IGP Reminder
8.7.2 Explicit Candidate Path
8.7.3 Dynamic Candidate Paths
8.8 Anycast-SIDs
8.9 TI-LFA protection
8.9.1 Constituent Prefix-SID
8.9.2 Constituent Adj-SID
8.9.3 TI-LFA Applied to Flex-Algo SID
8.10 Unprotected SR Policy
8.11 Other Mechanisms
8.11.1 SR IGP Microloop Avoidance
8.11.2 SR Policy Liveness Detection
8.11.3 TI-LFA Protection for an Intermediate SID of an SR Policy
8.12 Concurrency
8.13 Summary
8.14 References
Section II – Further details
9 Binding-SID and SRLB
9.1 Definition
9.2 Explicit Allocation
9.3 Simplification and Scaling
9.4 Network Opacity and Service Independence
9.5 Steering Into a Remote RSVP-TE Tunnel
9.6 Summary
9.7 References
10 Further Details on Automated Steering
10.1 Service Routes With Multiple Colors
10.2 Coloring Service Routes on Ingress PE
10.3 Automated Steering and BGP Multi-Path
10.4 Color-Only Steering
10.5 Summary
10.6 References
11 Autoroute and Policy-Based Steering
11.1 Autoroute
11.2 Pseudowire Preferred Path
11.3 Static Route
11.4 Summary
11.5 References
12 SR-TE Database
12.1 Overview
12.2 Headend
12.3 SR PCE
12.3.1 BGP-LS
12.3.2 PCEP
12.4 Consolidating a Multi-Domain Topology
12.4.1 Domain Boundary on a Node
12.4.2 Domain Boundary on a Link
12.5 Summary
12.6 References
13 SR PCE
13.1 SR-TE Process
13.2 Deployment
13.2.1 SR PCE Configuration
13.2.2 Headend Configuration
13.2.3 Recommendations
13.3 Centralized Path Computation
13.3.1 Headend-Initiated Path
13.3.2 PCE-Initiated Path
13.4 Application-Driven Path
13.5 High-Availability
13.5.1 Headend Reports to All PCEs
13.5.2 Failure Detection
13.5.3 Headend Re-Delegates Paths to Alternate PCE Upon Failure
13.5.3.1 Headend-Initiated Paths
13.5.3.2 Application-Driven Paths
13.5.4 Inter-PCE State-Sync PCEP Session
13.5.4.1 State-Sync Illustration
13.5.4.2 Split-Brain
13.6 BGP SR-TE
13.7 Summary
13.8 References
14 SR BGP Egress Peer Engineering
14.1 Introduction
14.2 SR BGP Egress Peer Engineering (EPE)
14.2.1 SR EPE Properties
14.3 Segment Types
14.4 Configuration
14.5 Distribution of EPE Information in BGP-LS
14.5.1 BGP Peering SID TLV
14.5.2 Single-hop BGP Session
14.5.3 Multi-hop BGP Session
14.6 Use-Cases
14.6.1 SR Policy Using Peering-SID
14.6.2 SR EPE for Inter-Domain SR Policy Paths
14.7 Summary
14.8 References
15 Performance Monitoring – Link Delay
15.1 Performance Measurement Framework
15.2 The Components of Link Delay
15.3 Measuring Link Delay
15.3.1 Probe Format
15.3.2 Methodology
15.3.3 Configuration
15.3.4 Verification
15.4 Delay Advertisement
15.4.1 Delay Metric in IGP and BGP-LS
15.4.2 Configuration
15.4.3 Detailed Delay Reports in Telemetry
15.5 Usage of Link Delay in SR-TE
15.6 Summary
15.7 References
16 SR-TE Operations
16.1 Weighted Load-Sharing Within SR Policy Path
16.2 Drop on Invalid SR Policy
16.3 SR-MPLS Operations
16.3.1 First Segment
16.3.2 PHP and Explicit-Null
16.3.3 MPLS TTL and Traffic-Class
16.4 Non-Homogenous SRGB
16.5 Candidate-Paths With Same Preference
16.6 Summary
16.7 References
Section III – Tutorials
17 BGP-LS
17.1 BGP-LS Deployment Scenario
17.2 BGP-LS Topology Model
17.3 BGP-LS Advertisement
17.3.1 BGP-LS NLRI
17.3.1.1 Protocol-ID Field
17.3.1.2 Identifier Field
17.3.2 Node NLRI
17.3.3 Link NLRI
17.3.4 Prefix NLRI
17.3.5 TE Policy NLRI
17.3.6 Link-State Attribute
17.4 SR BGP Egress Peer Engineering
17.4.1 PeerNode SID BGP-LS Advertisement
17.4.2 PeerAdj SID BGP-LS Advertisement
17.4.3 PeerSet SID BGP-LS Advertisement
17.5 Configuration
17.6 ISIS Topology
17.6.1 Node NLRI
17.6.2 Link NLRI
17.6.3 Prefix NLRI
17.7 OSPF Topology
17.7.1 Node NLRI
17.7.2 Link NLRI
17.7.3 Prefix NLRI
17.8 References
18 PCEP
18.1 Introduction
18.1.1 Short PCEP History
18.2 PCEP Session Setup and Maintenance
18.2.1 SR Policy State Synchronization
18.3 SR Policy Path Setup and Maintenance
18.3.1 PCC-Initiated SR Policy Path
18.3.2 PCE-Initiated SR Policy Path
18.3.3 PCE Updates SR Policy Path
18.4 PCEP Messages
18.4.1 PCEP Open Message
18.4.2 PCEP Close Message
18.4.3 PCEP Keepalive Message
18.4.4 PCEP Request message
18.4.5 PCEP Reply Message
18.4.6 PCEP Report Message
18.4.7 PCEP Update Message
18.4.8 PCEP Initiate Message
18.4.9 Disjointness Association Object
18.5 References
19 BGP SR-TE
19.1 SR Policy Address-Family
19.1.1 SR Policy NLRI
19.1.2 Tunnel Encapsulation Attribute
19.2 SR Policy BGP Operations
19.2.1 BGP Best-Path Selection
19.2.2 Use of Distinguisher NLRI Field
19.2.3 Target Headend Node
19.3 Illustrations
19.3.1 Illustration NLRI Distinguisher
19.3.1.1 Same Distinguisher, Same NLRI
19.3.1.2 Different Distinguishers, Different NLRIs
19.4 References
20 Telemetry
20.1 Telemetry Configuration
20.1.1 What Data to Stream
20.1.2 Where to Send It and How
20.1.3 When to Send It
20.2 Collectors and Analytics
20.3 References
Section IV – Appendices
A. Introduction of SR Book Part I
A.1 Objectives of the Book
A.2 Why Did We Start SR?
A.3 The SDN and OpenFlow Influences
A.4 100% Coverage for IPFRR and Optimum Repair Path
A.5 Other Benefits
A.6 Team
A.7 Keeping Things Simple
A.8 Standardization and Multi-Vendor Consensus
A.9 Global Label
A.10 SR MPLS
A.11 SRv6
A.12 Industry Benefits
A.13 References
B. Confirming the Intuition of SR Book Part I
B.1 Raincoat and Boots on a Sunny Day
B.2 ECMP-Awareness and Diversity

Recommend Papers

Segment Routing, Part I [1] 1542369126, 9781542369121

101 14 14MB Read more

Segment Routing for Service Provider and Enterprise Networks 9780138230937, 0138230935

Unlock the Future of Networking with Segment Routing: Your Comprehensive Guide to SR-MPLS and SRv6 for Service Provider

116 103 60MB Read more

Traffic Engineering 013191877X, 9780131918771

452 109 3MB Read more

Routing TCP/IP, Volume II [Volume 2] 1578700892, 9781578700899

The complexities of exterior gateway protocols, including TCP connections, message states, path attributes, interior rou

504 8 7MB Read more

Traffic Engineering: A Practical Approach 303109588X, 9783031095887

This textbook discusses the principles of queuing theory and teletraffic engineering in telecommunication networks. The

250 77 4MB Read more

Routing Tcp/Ip, Volume II 1578700892, 9781578700899

A detailed examination of exterior routing protocols and advanced IP routing issuesRouting TCP/IP, Volume II, enables yo

736 103 6MB Read more

Data-Driven Traffic Engineering: Understanding of Traffic and Applications Based on Three-Phase Traffic Theory 9780128191385, 0128191384

Data-Driven Traffic Engineering: Understanding of Traffic and Applications Based on Three-Phase Traffic Theory shifts th

405 60 26MB Read more

Abydos. Part II

112 78 10MB Read more

Engineering Design Handbook - Ammunition Series, Fuzes, Proximity, Electrical, Part 2

524 27 53MB Read more

CCIE Professional Development - Routing TCP-IP [Volume II]

480 80 3KB Read more

Segment Routing Part II: Traffic Engineering [2]
1095963139, 9781095963135

Author / Uploaded
Clarence Filsfils
Kris Michielsen
Francois Clad
Daniel Voyer

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Copyright Segment Routing Part II – Traffic Engineering by Clarence Filsfils, Kris Michielsen, François Clad, Daniel Voyer Copyright © 2019 Cisco Systems, Inc. and/or its affiliates. All rights reserved. All rights reserved. No part of this book may be reproduced, distributed, or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the publisher or author, except for the inclusion of brief quotations in a review. Kindle Edition v1.1, May 2019

Warning and Disclaimer Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. THE INFORMATION HEREIN IS PROVIDED ON AN “AS IS” BASIS, WITHOUT ANY WARRANTIES OR REPRESENTATIONS, EXPRESS, IMPLIED OR STATUTORY, INCLUDING WITHOUT LIMITATION, WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. The publisher and authors shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book. The views and opinions expressed in this book belong to the authors or to the person who is quoted.

Trademark Acknowledgments Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R) All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. The authors cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.

Preface This book is the second part of the series about Segment Routing (SR). As more and more Service Providers and Enterprises operate a single network infrastructure to support an ever-increasing number of services, the ability to custom fit transport to application needs is of paramount importance. In that respect, network operators have been exploring Traffic Engineering techniques for some years now but have obviously run into many scaling issues preventing them from having an end-to-end, finegrained control over the myriad services they offer. Segment Routing Traffic Engineering (SR-TE) has changed that game and has become the undisputed solution to deliver Traffic Engineering capabilities at scale.

Audience We have tried to make this book accessible to a wide audience and to address beginner, intermediate and advanced topics. We hope it will be of value to anyone trying to design, support or just understand SR-TE from a practical perspective. This includes network designers, engineers, administrators, and operators, both in service provider and enterprise environments, as well as other professionals or students looking to gain an understanding of SR-TE. We have assumed that readers are familiar with data networking in general, and with the concepts of IP, IP routing, and MPLS, in particular. There are many good books available on these subjects. We have also assumed that readers know the basics of Segment Routing. Part I of this SR book series (available on amazon.com) is a great resource to learn the SR fundamentals.

Disclaimer This book only reflects the opinion of the authors and not the company they work for. Every statement made in this book is a conclusion drawn from personal research by the authors and lab tests. Also consider the following: Some examples have been built with prototype images. It is possible that at the time of publication, some functions and commands are not yet generally available. Cisco Systems does not commit to release any of the features described in this book. For some functionalities this is indicated in the text, but not for all. On the bright side: this book provides a sneak preview to the state-of-the-art and you have the opportunity to get a taste of the things that may be coming. It is possible that some of the commands used in this book will be changed or become obsolete in the future. Syntax accuracy is not guaranteed. For illustrative purposes, the authors took the liberty to edit some of the command output examples, e.g., by removing parts of the text. For this reason, the examples in this book have no guaranteed accuracy.

Reviewers Many people have contributed to this book and they are acknowledged in the Introduction chapter. To deliver a book that is accurate, clear, and enjoyable to read, it needs many eyes, other that the eyes of the authors. Here we would like to specifically thank the people that have reviewed this book in its various stages towards completion. Without them this book would not have been at the level it is now. A sincere “Thank you!” to all. In alphabetical order: Mike DiVincenzo, Alberto Donzelli, Darren Dukes, Muhammad Durrani, Rakesh Gandhi, Arkadiy Gulko, Al Kiramoto, Przemyslaw Krol, Sakthi Malli, Paul Mattes, Hardik Patel, Rob Piasecki, Carlos Pignataro, Robert Raszuk, Joel E. Roberts, JC Rode, Aisha Sanes, Grant Socal, Simon Spraggs, YuanChao Su, Ketan Talaulikar, Mike Valentine, Bhupendra Yadav, Frank Zhao, and Qing Zhong.

Textual Conventions This book has multiple flows: General flow: This is the regular flow of content that a reader wanting to learn SR would follow. It contains facts, no opinions and is objective in nature. Highlights: Highlight boxes emphasize important elements and topics for the reader. These are presented in a “highlight” box.

HIGHLIGHT This is an example of a highlight.

Opinions: This content expresses opinions, choices and tradeoffs. This content is not necessary to understand SR, but gives some more background to the interested reader and is presented as quotes. We have also invited colleagues in the industry who have been very active on the SR project to share their opinions on SR in general or some specific aspects. The name of the person providing that opinion is indicated in each quote box.

“This is an example opinion. ” — John Doe

Reminders: The reminders briefly explain technological aspects (mostly outside of SR) that may help understanding the general flow. They are presented in a “reminder” box. REMINDER This is an example reminder

Illustrations and Examples Conventions The illustrations and examples in this book follow the following conventions: Router-id of NodeX is 1.1.1.X. Other loopbacks have address 1.1.n.X, with n an index. Interface IPv4 address of an interface on NodeX connected to NodeY is 99.X.Y.X/24, with X. The use of a BSID (and the transit SR Policy) decreases the number of segments imposed by the source. A BSID acts as a stable anchor point which isolates one domain from the churn of another domain. Upon topology changes within the core of the network, the low-delay path from DCI1 to DCI3 may change. While the path of an intermediate policy changes, its BSID does not change. Hence the policy used by the source does not change and the source is shielded from the churn in another domain.

A BSID provides opacity and independence between domains. The administrative authority of the core domain may want to exercise a total control over the paths through this domain so that it can perform capacity planning and introduce TE for the SLAs it provides to the leaf domains. The use of a BSID allows keeping the service opaque. S is not aware of the details of how the low-delay service is provided by the core domain. S is not aware of the need of the core authority to temporarily change the intermediate path.

1.11 How Many SIDs and Will It Work? Let’s assume that a router S can only push 5 labels and a TE intent requires a SID list of 7 labels where these labels are IGP Prefix SIDs to respective nodes 1, 2, 3, 4, 5, 6 and 7. Are we stuck? No, for two reasons: Binding SID and Flex-Algo SID. The first solution is clear as it leverages the Binding SID properties (Figure 1‑6 (a)).

Figure 1-6: Solutions to handle label imposition limit

The SID list at the headend node becomes where B is the Binding SID of a policy at node 4. The SID list at the headend node meets the 5-label constraint of that node. While a binding SID and policy at node 4 does add state to the core for a policy on the edge, the state is not per edge policy. Therefore, a single binding SID B for policy may be reused by many edge node policies.

The second solution is equally straightforward: instantiate the intent on the nodes in the network as an extra Flex-Algo IGP algorithm (say AlgoK) and allocate a second SID S7’ to node 7 where S7’ is associated with AlgoK (Figure 1‑6 (b)). Clearly the SID list now becomes and only requires one label to push ☺. These two concepts guarantee that an intent can be expressed as an SR Policy that meets the capabilities of the headend (e.g., max push ≤ 5 labels). As we detail the SR-TE solution, we explain how the characteristics of the nodes are discovered (how many labels can they push) and how the optimization algorithms take this constraint into consideration. Last but not least, it is important to remember that most modern forwarding ASICs can push at least 5 labels and that most use-cases require less than 5 labels (based on extensive analysis of various TErelated use-cases on ISP topologies). Hence, this problem is rarely a real constraint in practice.

1.12 Automation Based on Colored Service Routes A key intuition at the base of the simplicity and automation of our SR-TE solution has been to place the BGP routes at the center of our solution. That idea came during a taxi ride in Rome with Alex Preusche and Alberto Donzelli. BGP routes provide reachability to services: Internet, L3VPN, PW, L2VPN. We allow the operator to mark the BGP routes with colors. Hence, any BGP route has a next-hop and a color. The next-hop indicates where we need to go. The color is an extended color attribute expressed as 32-bit values. The color indicates how we need to go the next-hop. It defines a TE SLA intent. An operator allocates TE SLA intent to each color as he wishes. A likely example is: No color: “best-effort” Red: “low-delay” Let us assume the operator marks a BGP route 9/8 via 1.1.1.5 with color red while BGP route 8/8 via 1.1.1.5 is left uncolored. Upon receiving the route 8/8, Node 1 installs 8/8 via 16005 (prefix-SID of 1.1.1.5) in FIB. This is the classic behavior. The traffic to 8/8 takes the IGP path to 1.1.1.5 which is the best-effort (lowest cost) path. Upon receiving the route 9/8, Node 1 detects that the route has a color red that matches a local TE SLA template red. As such, the BGP process asks the TE process for the local SR Policy with (color = red; endpoint = 1.1.1.5). If this SR Policy does not yet exist, then the TE process instantiates it ondemand. Whether pre-existing or on-demand instantiated, the TE process eventually returns an SR Policy to the BGP process. The BGP process then installs 9/8 on the returned SR policy. The traffic to 9/9 takes the red SR Policy to 1.1.1.5 which provides the low-delay path to node 5.

The installation of a BGP route onto an SR Policy is called Automated Steering (AS). This is fundamental simplification as one no longer needs to resort to complex policy-based routing constructions. The dynamic instantiation of an SR Policy based on a color template and an endpoint is called OnDemand Next-hop (ODN). This is a fundamental simplification as one no longer needs to preconfigure any SR Policy. Instead, all the edge nodes are configured with the same few templates.

1.13 The SR-TE Process The SR-TE solution is implemented in the Cisco router operating systems (Cisco IOS XR, Cisco IOS XE, Cisco NX-OS) as a completely new process (i.e., different and independent from MPLS RSVPTE). In this book, we explain the IOS XR SR-TE implementation in detail. We ensured consistency between the three OS implementations and hence the concepts are easily leveraged for IOS XE and NX-OS platforms.

1.13.1 One Process, Multiple Roles The SR-TE process supports multiple roles: Within the headend: SR-TE brain of the router (as a local decision maker) Topology database limited to its local domain Possibly contains multi-domain topology database, but not yet seen in practice Outside the headend: SR PCE (helping routers for path disjointness or inter-domain policies) This multi-role ability is very similar to the BGP process in the router OS: BGP brain of the router (receives path and install the best paths in RIB) BGP Route Reflector (helping routers to collect all the paths) In the BGP case, a border router peers to a Route Reflector (RR) to get all its BGP paths. The border router and the route reflector run the same OS and use the same BGP process. The border router uses the BGP process as a local headend, for BGP path selection and forwarding entry installation. The route reflector uses the BGP process only to aggregate paths and reflects the best ones. This is very similar for SR-TE. In the SR-TE case, a border router (headend in SR-TE language) runs the SR-TE process to manage its policies, computes the SID list when it can, requests help from an SR PCE when it cannot, and eventually installs the forwarding entries for the active paths of its policies.

The SR PCE runs the same SR-TE process but in PCE mode. In this case, the SR-TE process typically collects more topological information from multiple domains (e.g., for inter-domain path calculation) and provides centralized path computation services for multiple headends. The ability to use a single consistent architecture and implementation for multiple roles or use-cases is a great benefit. However, it may sound confusing initially. For example, let us ask the following question: does the router have the multi-domain topology? Three answers are possible: No. The router only has local domain information and asks the SR PCE for help for inter-domain policies. Yes. The router is an SR PCE and to fulfill this role it aggregates the information from multiple domains. Yes. The router is a headend, but the operator has decided to provide multi-domain information directly to this headend such that it computes inter-domain policies by itself (without SR PCE help). The SR Process architecture and implementation allows these three use-cases. In this book, we walk you through the various use-cases and we provide subjective hints and experience to identify which use-cases are most common. For example, while the last answer is architecturally possible and it will likely eventually be deployed, for now such design has not come up in deployment discussions. Another likely question may be: “Is the SR PCE a virtualized instance on a server?” Two answers are possible: Yes. An SR PCE does not need to install any policy in its dataplane. It has a control-plane only role. Hence, it may be more scalable/cost-effective to run it as a virtual instance No. An SR PCE role may be enabled on a physical router in the network. The router is already present, it already has the SR-TE process, the operator may as well leverage it as an additional SR PCE role.

Again, this is very similar to the BGP RR role: it can be virtualized on a server or enabled on a physical router.

1.13.2 Components At high-level, the SR-TE process comprises the following components: SR-TE Database Local Policy Database: Multiple candidate paths per policy received via multiple channels (BGP, PCEP, NETCONF, CLI) Validation process Selection process Binding SID (BSID) association Dynamic Path Computation – the SR-native algorithms On-Demand Policy instantiation (ODN) Policy installation in FIB Automated Steering (AS) Policy Reporting BGP-LS Telemetry NETCONF/YANG

1.13.3 SR-TE Database A key component of the SR-TE process is the SR-TE database (SR-TE DB).

The SR-TE process collects topology information from ISIS/OSPF and BGP-LS and stores this information in the SR-TE database (SR-TE DB). We designed the SR-TE DB to be multi-domain aware. This means that the SR-TE process must be able to consolidate in the SR-TE DB the topology information of multiple domains, while discarding the redundant information. The SR-TE DB is also designed to hold more than the topological information, it includes the following: Segments (Prefix-SIDs, Adj-SIDs, Peering SIDs) TE Link Attributes (such as TE metric, delay metric, SRLG, affinity link color) Remote Policies (in order to leverage a policy in a central domain as a transit policy an end-to-end inter-domain policy between two different access domains, an SR PCE MUST know about all the policies installed in the network that are eligible for use as transit policy)

1.13.4 SR Native Algorithms This is a very important topic. The algorithms to translate an intent (e.g., low-delay) into a SID list have nothing to do with the circuit algorithms used by ATM/FR PNNI or MPLS RSVP-TE. The SR native algorithms take into consideration the properties of each available segment in the network, such as the segment algorithm or ECMP diversity, and compute paths as sequences of segments rather than forwarding links The SR native algorithms are implemented in the SR-TE process, and hence are leveraged either in headend or SR PCE mode.

1.13.5 Interaction With Other Processes and External APIs The SR-TE process has numerous interactions with other router OS processes and with external entities via APIs.

Those interactions are classified as a router headend role (“headend”) or an SR PCE role (“SR PCE”). These hints are indicative/subjective as many other use-cases are possible.

Figure 1-7: SR-TE interaction with other processes and external APIs

ISIS/OSPF headend discovers the local-domain link-state topology BGP-LS SR PCE discovers local/remote-domain link-state topology information SR PCE discovers SR Policies in local/remote domains Headend reports its locally programmed SR Policies and their status Headend discovers the local-domain link-state topology (e.g., SR BGP-only DC) PCEP Headend requests a path computation from the SR PCE

Conversely, SR PCE receives a path computation request Headend learns a path from the SR PCE Conversely, SR PCE signals a path to the headend Headend reports its local SR Policies to the SR PCE BGP-TE Headend learns a path from the SR PCE Conversely, SR PCE signals a path to the headend NETCONF Headend learns a path from the SR PCE Conversely, SR PCE signals a path to the headend SR PCE discovers SR Policies of the headend Conversely, headend signals SR Policies to the SR PCE FIB Headend installs a policy in the dataplane BGP Service Route and equivalent (Layer-2 Pseudowire) for automated steering purpose Headend’s BGP process asks the SR-TE process for an SR Policy (Endpoint, color) Headend’s SR-TE process communicates the local valid SR-TE policies to the BGP process Northbound API for high-level orchestration and visibility (SR PCE) Cisco WAE, Packet Design LLC, etc. applications

We will come back to these different interactions and APIs throughout this book and illustrate them with examples.

1.13.6 New Command Line Interface (CLI) The SR-TE CLI in IOS XR 6.2.1 and above has been designed to be intuitive and minimalistic. It fundamentally differs from the classic MPLS RSVP-TE CLI in IOS XR.

1.14 Service Programming While we will detail the notion of “network programming” and hence “service programming” in Part III of this series of SR books, it is important to understand that the Service Function Chaining (SFC) solution is entirely integrated within the SR-TE solution. The solution to enforce a stateless topological traffic-engineering policy through a network is the same as the solution to enforce a stateless service program. We call a “Network Function (NF)” an application/function provided by a hardware appliance or a Virtual Machine (VM) or Container. An NF can reside anywhere within the SR domain. It can be close to the access or centralized within a DC. We use the term “service program” instead of the classical term “service chain” because the integrated SR-TE solution for TE and SFC delivers more than a simple sequential chain. The SR solution allows for the same expression as a modern programming language: it supports flexible branching based on rich conditions. An example of application that leverages the unique capabilities of SR is the Linux iptables-based SERA firewall [SERA]. This open-source firewall can filter packets based on their attached SR information and perform SR-specific actions, such as skipping one or several of the remaining segments. This has also been demonstrated with the opensource SNORT application. Aside from the richness of the SR-TE “service program” solution, we also have the scale and simplicity benefit of SR: the state is only at the edge of the network and no other protocol is needed to support Network Functions Virtualization (NFV) based solutions (the SR-TE solution is simply reused). This is very different from other solutions like Network Service Header (NSH) that installs state all over the network (less scalable) and creates a new encapsulation, resulting in more protocols, more overhead, and more complexity.

1.15 Lead Operator Team The SR-TE solution has been largely influenced by lead operators around the globe. Martin Horneffer from Deutsche Telecom is an industry TE veteran. In 2005, he highlighted the scaling and complexity drawback of MPLS RSVP-TE and proposed an alternative solution based on capacity planning and IGP-metric tuning [IGP Tuning]. He likes to remind the community that the absolute pre-requisite for a TE process is the reliable and scalable collection of the input information: the optical SRLGs, the per-link performance metrics (delay and loss) and the demand matrix. Martin has significantly influenced the SR-TE design by keeping the focus on the automation and simplification of the input collections and the relationship between TE and Capacity Planning. Paul Mattes from Microsoft is an IP and SDN industry veteran. His long-time experience with developing BGP code gives him a clear insight into network complexity and scale. His early participation in one of the major SDN use-cases [SWAN] gave him an insight in how SR could preserve the network programming benefit introduced by open-flow while significantly scaling its operation by leveraging the distributed routing protocols and their related prefix segments. Paul has significantly influenced the SR-TE design, notably any use-case involving a centralized SR-TE controller with higher-layer applications. Alex Bogdanov, Steven Lin, Rob Shakir and later Przemyslaw Krol from Google, have joined the work initiated with Paul, have validated the BGP SR-TE benefits and have helped to refine many details as we were working on their use-cases. They significantly influenced the RSVP-TE/SR-TE bandwidth Broker interworking solution. Niels Hanke from Vodafone Germany has been a key lead operator with one of the first ever SRMPLS deployments. Together with Anton Karneliuk from Vodafone Germany, we then engineered the first worldwide delivery of a three-tier latency service thanks to SR-TE, ODN/AS and IGP SR FlexAlgo. Anton joined me at the Cisco Live!™ 2019 event to share some information on this new deployment. The video is available on our site segment-routing.net. Mike Valentine from a leading financial service customer was the first to realize the applicability of SR in his sector both to drastically simplify the operation and to provide automated end-to-end SLA

SR Policies. His blunt feedback helped me a lot to convince Siva to “bite the bullet”, forget the original prototype code and re-engineer the SR-TE process (and its CLI) from scratch. Stéphane Litkowski from Orange Group was our first SR PCE user. His passion and the codevelopment efforts with Siva helped to refine our design and speed-up our production plans. Dan Voyer of Bell Canada is an IP/MPLS and SDN industry veteran, and his team has deployed an impressive SR network across several domains. His use-case, combining service chaining and SRTE and applying it to external services such as SDWAN, was a key influence in the SR design. Gaurav Dawra first at Cisco and then at LinkedIn, extended the BGP control plane and BGP-LS. Gaurav also provided key insight in the DC and inter-DC use-case for SR-TE. Dennis Cai from Alibaba helped the community to understand SR-TE deployment and SR Controller architecture within a WEB context. His presentation is available on our site segment-routing.net. Arkadiy Gulko of Thomson Reuters has worked closely with us to refine the IGP SR Flex-Algo solution and show its great applicability for low-latency or diversity requirements. As in any public presentation on SR, we thank the lead operator team for all their help in defining and deploying this novel SR solution.

1.16 SR-TE Cisco Team Siva Sivabalan, Tarek Saad and Joseph Chin worked closely to design and implement the SR-TE solution on IOS XR from scratch. Siva Sivabalan continues to lead all the SR-TE development and his energy has been key to deliver the SR-TE solution to our lead operators. As the project and the deployment expanded, the larger SR-TE team was formed with Mike Koldychev, Johnson Thomas, Arash Khabbazibasmenj, Abdul Rehman, Alex Tokar, Guennoun Mouhcine, David Toscano, Bhupendra Yadav, Bo Wu, Peter Pieda, Prajeet G.C. and Jeff Williams. The SR-TE solution has been tested by Zhihao Hong, Vibov Bhan, Vijay Iyengar, Braven Hong, Wanmathy Dolaasthan, Manan Patel, Kalai Sankaralingam, Suguna Ganti, Paul Yu, Sudheer Kalyanashetty, Murthy Haresamudra, Matthew Starky, Avinash Tadimalla, Yatin Gandhi and Yong Wang. François Clad played a key role defining the SR native algorithms. Junaid Israr, Apoorva Karan, Bertrand Duvivier and Ianik Semco have been the project and product managers for SR-TE and have been instrumental in making things happen from a development process viewpoint. They orchestrated the work across all the components and API’s such as to have a single modular and easy to operate solution. Jose Liste, Kris Michielsen, and Alberto Donzelli supported the first major SR-TE designs and deployments. They are excellent sources of reality-check to keep the team focused on deployment requirements. Their professional handling of demos and proof-of-concepts has been key to our communication of these novel ideas. Frederic Trate has been leading our external communication and has been a great source of brainstorming for our long-term strategy. Tim LaBerge influenced the SR-TE design during his tenure at Microsoft (partnering with Paul Mattes) and as part of the Cisco WAE development team. Peter Psenak led the SR IGP Flex-Algo implementation.

Ketan Talaulikar led the BGP-LS architecture and implementation and the related BGP-only design together with Krishna Swamy. Zafar Ali led the OAM for SR-TE and our overall SR activity at the IETF. Rakesh Gandhi led the Performance Monitoring for SR-TE. Sagar Soni and Patrick Khordoc provided key help on Performance Monitoring implementation. Dhanendra Jain and Krishna Swamy led the BGP work for SR-TE and were key players for the ODN/AS implementation. David Ward, SVP and Chief Architect for Cisco Engineering, was essential to realize the opportunity with SR and fund the project in September 2012. Ravi Chandra, SVP Core Software Group, proved essential to execute our project beyond its first phase. Ravi has been leading the IOS XR, IOS XE and NX-OS software at Cisco. He very quickly understood the SR opportunity and funded it as a portfolio-wide program. We could then really tackle all the markets interested in SR (hyper-scale WEB operators, SP and Enterprise) and all the network segments (DC, metro/aggregation, edge, backbone). Sumeet Arora, SVP SP Routing, provided strong support to execute our project across the SP routing portfolio: access, metro, core, merchant and Cisco silicon. Venu Venugopal, VP, acted as our executive sponsor during most of the SR-TE engineering phase. He played a key role orchestrating the commitment and effective delivery of all the components of the solution. We had a great trip together to visit Paul and Tim at Microsoft a few years ago. This is where the BGP-TE component got started.

1.17 Standardization As we explained in Part I of this SR books series, we are committed to standardization and have published at IETF all the details that are required to ensure SR-TE inter-operability (e.g., protocol extensions). Furthermore, we have documented a fairly detailed description of our local-node behavior to allow operators to request similar behaviors from other vendors. [draft-ietf-spring-segment-routing-policy] is the main document to consider. It describes the SR-TE architecture and its key concepts. It introduces the various protocol extensions: draft-filsfils-spring-sr-traffic-counters draft-filsfils-spring-sr-policy-considerations draft-ietf-idr-bgp-ls-segment-routing-ext draft-ietf-idr-te-lsp-distribution draft-ietf-idr-bgpls-segment-routing-epe draft-ietf-lsr-flex-algo draft-ietf-pce-segment-routing draft-sivabalan-pce-binding-label-sid draft-ietf-pce-association-diversity draft-ietf-idr-segment-routing-te-policy RFC8491 RFC8476 draft-ietf-idr-bgp-ls-segment-routing-msd See the complete list on www.segment-routing.net/ietf.

1.18 Flow of This Book This book focuses on the SLA intents that can be handled as a routing problem: low-delay, disjoint planes, resource inclusion/exclusion, intra and inter-domain. We relate these SLA intents to use-cases and show how they can be met by various SR-TE solutions: SR-TE policy with explicit path, SR-TE Policy with dynamic path, SR IGP Flex-Algo. For example, the dual-plane disjointness service can be supported by an explicit path leveraging perplane anycast SID, or a dynamic path excluding an affinity or by an IGP Flex-Algo enabled on the chosen plane. The first chapters introduce the notions of SR Policy and its candidate paths (static and dynamic). We then delve into the heart of the SR-TE solution: On-Demand Policy (ODN) and Automated Steering (AS). We then cover SR IGP Flexible Algorithm, Network Resiliency and Binding SID. At that point, the key concepts will have been covered. The remaining of the book is then a series of chapters that details topics that were introduced earlier. For example, the notion of SR-TE Database is introduced in the first chapters as it is key to understand how native SR algorithms compute a solution SID list and how an explicit candidate path is validated. Chapter 12, "SR-TE Database" revisits that concept and covers it in depth. We chose to have this two-step approach to ease the learning curve. We believe that the most important is to understand what the different components of the SR-TE solution are and how they interact between each other. Once this is well-understood, the reader can then zoom on a specific component and study it in more depth.

1.19 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018 [draft-ietf-pce-segment-routing] "PCEP Extensions for Segment Routing", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Wim Henderickx, Jonathan Hardwick, draft-ietf-pce-segmentrouting-16 (Work in Progress), March 2019 [draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idrsegment-routing-te-policy-05 (Work in Progress), November 2018 [draft-ietf-idr-bgp-ls-segment-routing-ext] "BGP Link-State extensions for Segment Routing", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Hannes Gredler, Mach Chen, draft-ietf-idrbgp-ls-segment-routing-ext-12 (Work in Progress), March 2019 [draft-sivabalan-pce-binding-label-sid] "Carrying Binding Label/Segment-ID in PCE-based Networks.", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Jonathan Hardwick, Stefano Previdi, Cheng Li, draft-sivabalan-pce-binding-label-sid-06 (Work in Progress), February 2019 [draft-ietf-lsr-flex-algo] "IGP Flexible Algorithm", Peter Psenak, Shraddha Hegde, Clarence Filsfils, Ketan Talaulikar, Arkadiy Gulko, draft-ietf-lsr-flex-algo-01 (Work in Progress), November 2018 [RFC8491] "Signaling Maximum SID Depth (MSD) Using IS-IS", Jeff Tantsura, Uma Chunduri, Sam Aldrin, Les Ginsberg, RFC8491, November 2018 [RFC8476] "Signaling Maximum SID Depth (MSD) Using OSPF", Jeff Tantsura, Uma Chunduri, Sam Aldrin, Peter Psenak, RFC8476, December 2018

[draft-ietf-idr-bgp-ls-segment-routing-msd] "Signaling MSD (Maximum SID Depth) using Border Gateway Protocol Link-State", Jeff Tantsura, Uma Chunduri, Gregory Mirsky, Siva Sivabalan, Nikos Triantafillis, draft-ietf-idr-bgp-ls-segment-routing-msd-04 (Work in Progress), February 2019 [draft-ietf-idr-te-lsp-distribution] "Distribution of Traffic Engineering (TE) Policies and State using BGP-LS", Stefano Previdi, Ketan Talaulikar, Jie Dong, Mach Chen, Hannes Gredler, Jeff Tantsura, draft-ietf-idr-te-lsp-distribution-10 (Work in Progress), February 2019 [draft-ietf-pce-association-diversity] "Path Computation Element communication Protocol (PCEP) extension for signaling LSP diversity constraint", Stephane Litkowski, Siva Sivabalan, Colby Barth, Mahendra Singh Negi, draft-ietf-pce-association-diversity-06 (Work in Progress), February 2019 [draft-ietf-idr-tunnel-encaps] "The BGP Tunnel Encapsulation Attribute", Eric C. Rosen, Keyur Patel, Gunter Van de Velde, draft-ietf-idr-tunnel-encaps-11 (Work in Progress), February 2019 [SR.net] [GMPLS-UNI] [Cisco WAE] [Google Espresso] “Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering.”, Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett, Matthew Holliman, Gary Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain, Victor Lin, Colin Rice, Brian Rogan, Arjun Singh, Bert Tanaka, Manish Verma, Puneet Sood, Mukarram Tariq, Matt Tierney, Dzevad Trumic, Vytautas Valancius, Calvin Ying, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat, Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17), 2017.

[Facebook EdgeFabric] “Engineering Egress with Edge Fabric: Steering Oceans of Content to the World.”, Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V.

Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng, Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17), . 2017. ,

[Alibaba NetO] “NetO: Alibaba’s WAN Orchestrator”, Xin Wu, Chao Huang, Ming Tang, Yihong Sang, Wei Zhou ,Tao Wang, Yuan He, Dennis Cai, Haiyong Wang, and Ming Zhang, SIGCOMM 2017 Industrial Demos, 2017. [SERA] “SERA: SEgment Routing Aware Firewall for Service Function Chaining scenarios”, Ahmed Abdelsalam, Stefano Salsano, Francois Clad, Pablo Camarillo and Clarence Filsfils, IFIP Networking, Zurich, Switzerland, May 2018.

[IGP tuning] “IGP Tuning in an MPLS Network”, Martin Horneffer, NANOG 33, February 2005, Las Vegas. [SWAN] “Achieving high utilization with software-driven WAN.”, Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer, Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM (SIGCOMM '13), 2013. ,

1. While a typical deployment uses RRs to scale BGP, any BGP distribution model can be used: full-mesh, RRs, Confederations.↩ 2. Historically, serialization delay (packet size divided by the Gbps rate of the link) was also a component. It has now become negligible.↩ 3. Not to be confused with link colors or affinity colors, which are used to mark links in order to include or exclude them from a TE path.↩

Section I – Foundation This section describes the foundational elements of SR Traffic Engineering.

2 SR Policy What we will learn in this chapter: An SR Policy at a headend is identified by the endpoint and a color An SR Policy has one or more candidate paths, which are often simply called “paths” A candidate path is in essence a SID list (a list of segments), or a set of SID lists A candidate path can be dynamic or explicit A candidate path can be instantiated via CLI/NETCONF or signaled via PCEP or BGP-TE The valid path with the highest preference is selected as active path The active path’s SID lists are programmed in the forwarding table The SID lists of an SR Policy are the SID lists of its active path An SR Policy is bound to a Binding-SID A valid SR Policy has its Binding-SID installed in the MPLS forwarding table as a local incoming label with action “Pop and push the SID list of the SR Policy” We first introduce the concept of an SR Policy through example, then we provide a formal definition and we remind the key concept of a Binding Segment. The last section introduces the configuration model.

2.1 Introduction Segment Routing allows a headend node to steer a packet flow along any path in the network by imposing an ordered list of segments on the packets of the flow. No per-flow state is created on the intermediate nodes, the per-flow state is carried within the packet’s segment list. Each segment in the segment list identifies an instruction that the node must execute on the packet, the most common being a forwarding instruction. Se gme nts and Se gme nt Ide ntifie rs (SIDs) A Segment is an instruction that a node executes on the incoming packet that carries the instruction in its header. Examples of instructions are forward packet along the shortest path to its destination, forward packet through a specific interface, deliver packet to a given application/service instance, etc. A Segment Identifier (SID) identifies a Segment. The format of a SID depends on the implementation. The SR MPLS implementation uses MPLS labels as SIDs. SRv6 uses SIDs in the format of IPv6 addresses but they are not actually IPv6 addresses as their semantic is different. While there is a semantic difference between segments and SIDs, the latter being the identifier of the former, both terms are often used interchangeably and the meaning can be derived from its context. This also applies to their derivatives such as “segment list” and “SID list”.

In the MPLS instantiation of Segment Routing, a SID is an MPLS label and a SID list is an MPLS stack. In the context of topological traffic-engineering, the SID list (the MPLS label stack) is composed of prefix-SIDs (shortest path to the related node) and adjacency SIDs (specific use of a link). A key benefit of the Segment Routing solution is that it integrates topological traffic-engineering and service chaining in the same solution. The SID list can be a combination of Prefix-SIDs, AdjacencySIDs and service SIDs. The first two help to guide the packets topologically through the network while the latter represents services distributed through the network (hardware appliance, VM or container). In this book (Part II), we will focus on the topological traffic-engineering (more briefly, traffic engineering or SR-TE). Part III will explain how the same solution is leveraged for SR-based service chaining.

The central concept in our SR-TE solution is “SR Policy”. SR Policy governs the two fundamental actions of a traffic engineering solution: Expressing a specific path through the network Typically different from the shortest-path computed by the IGP Steering traffic onto the specific path In the remaining of this section, we use four illustrations to introduce some key characteristics of an SR Policy. The first example illustrates the concept of explicit path. The second example introduces the concept of validation and selection of an explicit path. The third example introduces the concept of a dynamic path with the minimization of a cumulative metric. The fourth example extends this latter concept by adding a constraint to the dynamic path. The ability to engineer a path through a network cannot be dissociated from the steering of traffic onto that path. Automated Steering (AS) is the SR-TE functionality that automatically steers colored service routes on the appropriate SR Policy. We briefly present the AS functionality in the first example, while chapter 5, "Automated Steering" and chapter 10, "Further Details on Automated Steering" cover the steering concept in detail.

SR Policy is not a tunne l! “The term “tunnel” historically involved several issues that we absolutely did not want Segment Routing to inherit: 1/ preconfiguration at a headend towards a specific endpoint with a specific SLA or path; 2/ performance degradation upon steering; 3/ scale limitation due to the handling of a tunnel as an interface; 4/ autoroute steering as sole steering mechanism The think-out-of-the-box and simplification/automation mindset pushed us to reconsider the key concepts at the base of an SR solution. This allowed us to identify the notions of SR Policy, color of an SR Policy, On-Demand Policy (ODN) and Automated Steering (AS). These concepts were not present in the RSVP-TE tunnel model and allowed us to simplify the TE operation. ” — Clarence Filsfils

The illustrations assume a single IGP area network (e.g., Figure 2‑1).

Figure 2-1: Network Topology with IGP Shortest path and SR Policy with explicit path

The default IGP metric for the links in this network is 10. The link between Node3 and Node4 is an expensive, low capacity link and therefore the operator has assigned it a higher IGP link metric of 100. Also, the link between Node8 and Node5 has a higher IGP link metric of 100.

2.1.1 An Explicit Candidate Path of an SR Policy We illustrate the concept of an explicit candidate path of an SR Policy. We briefly introduce the Automated Steering (AS) of a flow onto an SR Policy. Note that, while the exact term is “candidate path”, we may use the shorter term “path” instead. We assume that the operator wants some traffic from Node1 to Node4 to go via the path 1→7→6→8→5→4. The operator could express this path as : shortest-path to Node8 (Prefix-SID of Node8), adjacency from Node8 to Node5 (Adjacency-SID), shortest-path to Node4 (Prefix-SID of Node4). Per the illustration conventions used in this book, 24085 is the Adj-SID label of Node8 for its link to Node5.

Such a SID list expressed by the operator is called “explicit”. The term “explicit” means that the operator (potentially via an external controller) computes the source-routed path and programs it on the headend router. The headend router is explicitly told what source-routed path to use. The headend router simply receives a SID list and uses it as such. No intent is associated with the SID list: that is, the headend does not know why the operator selected that SID list. The headend just instantiates it as per the explicit configuration. This explicit SID list (the sequence of instructions to go from Node1 to Node4) is configured in an SR Policy at headend Node1. SR Policy (headend Node1, color blue, endpoint Node4) candidate-paths: 1. explicit: SID list We see that the SR Policy is identified by a tuple of three entries: the headend at which the SR Policy is instantiated, the endpoint and a color. As explained in the introduction, the color is a way to express an intent. At this point in the illustration, think of the color as a way to distinguish multiple SR Policies, each with its own intent, from the same headend Node1 to the same endpoint Node4. For example, Node1 could be configured with the following two SR Policies: SR Policy (headend Node1, color blue, endpoint Node4) candidate-paths: 1. explicit: SID list SR Policy (headend Node1, color orange, endpoint Node4) candidate-paths: 1. explicit: SID list

These two SR Policies need to be distinguished because Node1 may want to concurrently steer a flow into the blue SR Policy while another flow is steered into the orange SR Policy. In fact, you can already intuit that the color is not only for SR Policy distinction. It plays a key role in steering traffic, and more specifically in automatically steering traffic in the right SR Policy by similarly coloring the service route (i.e., intuitively, the orange-colored traffic destined to Node4 will go via the orange SR Policy to Node4 and the blue-colored traffic destined to Node4 will go via the blue SR Policy to Node4). A fundamental component of the SR-TE solution is “Automated Steering” (AS). While this is explained in detail later in the book, we provide a brief introduction here.

Figure 2-2: Two SR Policies from Node1 to Node4

Automated Steering allows the headend Node1 to automate the steering of its traffic into the appropriate SR Policy. In this example, we see in Figure 2‑2 that Node4 advertises two BGP routes to Node1: 1.1.1.0/24 with color orange and 2.2.2.0/24 with color blue. Based on these colors,

headend Node1 automatically steers any traffic to orange prefix 1.1.1.0/24 into the SR Policy (orange, Node4) and any traffic to blue prefix 2.2.2.0/24 into the SR Policy (blue, Node4). Briefly, applying Automated Steering at a headend allows the headend to automatically steer a service route (e.g., BGP, PW, etc.) into the SR Policy that matches the service route color and the service route endpoint (i.e., BGP nexthop). Remember that while in this text we refer to colors as names for ease of understanding, a color is actually a number. For example, the color blue could be the number 10. To advertise a color with a prefix, BGP adds a color extended community attribute to the advertisement (more details in chapter 5, "Automated Steering"). Let us also reiterate a key benefit of Segment Routing. Once the traffic is steered into the SR Policy configured at Node1, the traffic follows the traffic-engineered path without any further state in the network. For example, if the blue traffic destined for Node4 is steered in the SR Policy (Node1, blue, Node4), then the blue traffic will go via 1→7→6→8→5→4. No state is present at 7, 6, 8, 5 or 4 for this flow. The only state is present at the SR Policy headend. This is a key advantage compared to RSVP-TE which creates per-flow state throughout the network. In these first examples, SR Policies had one single explicit candidate path per policy, with an explicit candidate path that is specified as a single explicit SID list. In the next illustrations, we expand the SR Policy concept by introducing multiple candidate paths per policy and then the notion of dynamic candidate paths.

2.1.2 Path Validation and Selection In the second example, we assume that the operator wants some traffic from Node1 to Node4 to go via one of two possible candidate paths, 1→7→6→8→5→4 and 1→2→3→4. These paths can be expressed with the SID lists and respectively. The operator prefers using the first candidate path and only use the second candidate path if the first one is unusable. Therefore, he assigns a preference value to each candidate path, with a higher preference

value indicating a more preferred path. The first path has a preference 100 while the second path has preference 50. Both paths are shown in Figure 2‑3.

Figure 2-3: SR Policy (blue, Node4) with two candidate paths

A candidate path is usable when it valid. A common path validity criterion is the reachability of its constituent SIDs. The validation rules are specified in chapter 3, "Explicit Candidate Path". When both candidate paths are valid (i.e., both paths are usable), headend Node1 selects the highest preference path and installs the SID list of this path () in its forwarding table. At any point in time, the blue traffic that is steered into this SR Policy is only sent on the selected path, any other candidate paths are inactive. A candidate path is selected when it has the highest preference value among all the valid candidate paths of the SR Policy. The selected path is also referred to as the “active path” of the SR Policy. In case multiple valid candidate paths have the same preference, the tie-breaking rules described in chapter 16, "SR-TE Operations" are evaluated to select a path. A headend re-executes the active path selection procedure whenever it learns about a new candidate path of an SR Policy, the active path is deleted, an existing candidate path is modified or its validity changes.

At some point, a failure occurs in the network. The link between Node8 and Node5 fails, as shown in Figure 2‑4. At first, Topology Independent Loop-Free Alternate (TI-LFA) protections ensures that the traffic flows that were traversing this link are quickly (in less than 50 milliseconds) restored. Chapter 8, "Network Resiliency" describes TI-LFA protection of SR Policies. Refer to SR book Part I for more details of TI-LFA. Eventually, headend Node1 learns via IGP flooding that the Adj-SID 24085 of the failed link has become invalid. Node1 evaluates the validity of the path’s SID list1 and invalidates it due to the presence of the invalid Adj-SID. Node1 invalidates the SID list and the candidate path and reexecutes the path selection process. Node1 selects the next highest preference valid candidate path, the path with preference 50. Node1 installs the SID list of this path – – in the forwarding table. From then, the blue traffic that is steered into this SR Policy is sent on the new selected path.

Figure 2-4: SR Policy (blue, Node4) with two candidate paths – highest preference path is invalid

After restoring the failed link, the candidate path with preference 100 becomes valid again. Headend Node1 will perform the SR Policy path selection procedure again, select the valid candidate path with the highest preference and update its forwarding table with this path’s SID list . The blue traffic that is steered into this SR Policy is sent on the path 1→7→6→8→5→4, as shown in Figure 2‑3.

2.1.3 A Low-Delay Dynamic Candidate Path Often, the operator prefers to simply express an intent (e.g., low-delay2 to a specific endpoint or avoid links with TE affinity color purple) and let the headend translate this intent into a SID list. This is called a “dynamic” candidate path. The headend dynamically translates the intent into a SID list, and most importantly, continuously responds to any network change by updating the SID list as required to meet the intent. The intent is formally defined as a minimization of an additive metric (e.g., IGP metric, TE metric or Link-Delay) and a set of constraints (e.g., avoid/include IP address, SRLG, TE Affinity). Leveraging the link-state routing protocol (ISIS, OSPF or BGP-LS), each node distributes its own local information, its own piece of the network jigsaw puzzle. Besides the well-known topology information (nodes, links, prefixes, and their attributes), the link-state advertisement may include SR elements (SRGB, SIDs, …) and other link attributes (delay, loss, SRLGs, affinity, …). The headend node receives all this information and stores it in its local SR-TE database (SR-TE DB). This SR-TE DB contains a complete topological view of the local IGP area, including SR-TE information. The SR-TE DB contains everything that the headend node needs to compute the paths through the network that meet the intent. The headend uses “native SR” algorithms to translate the “intent” of a dynamic path into a SID list. The term “native SR” highlights that the algorithm has been optimized for SR. It maximizes ECMP and minimizes the SID list length. Let us now consider an SR Policy from Node1 to Node4 that expresses an intent to provide a lowdelay path. To compute the low-delay path, the headend node needs to know the delay of the links in the network. Figure 2‑5 shows the measured delay values for each link. Headend Node1 receives these link-delay metrics via the IGP and adds them to its SR-TE DB. Node1 can now compute the low-delay path to Node4, which is simply the shortest path computation using the link-delay as metric. The resulting path is 1→2→3→4, with a cumulative delay 12+11+7 = 30.

Figure 2-5: Network topology with measured link-delay values

Node1 encodes the path in the SID list ; 16003 is the shortest-path to Node3 and hence correctly follows 1→2→3 and then 3→4 is enforced with the Adj SID from Node3 to Node4 (24034). To recap, an SR Policy may be instantiated with a “dynamic” candidate path. A “dynamic” candidate path expresses an intent. The headend uses its SR-TE Database and its SR native algorithms (explained later in the book) to translate the intent into a SID list. Any time the network changes, the headend updates the SID list accordingly. Assuming that the operator uses the “green” color for “low-delay”, the SR Policy that we just analyzed can be summarized as follows: SR Policy (Node1, green, Node4) candidate-paths:

1. dynamic: delay optimized → SID list

2.1.4 A Dynamic Candidate Path Avoiding Specific Links Let us now assume that the operator needs an SR Policy “purple” from Node1 to Node4 which minimizes the accumulated IGP metric from Node1 to Node4 while avoiding the links with affinity color red. Link affinity colors are a set of properties that can be assigned to a link. Each link in the network can have zero, one or more affinity colors. These colors are advertised in the IGP (and BGP-LS) with the link (IGP adjacency, really). While we refer here to affinity colors as color names, each affinity color is actually a bit in a bitmap. If a link has a given color, then the bit in the bitmap that corresponds to that color is set. Colors The term “color” may refer to an attribute of two distinct elements of a network that must not be confused. Link affinity color is a commonly used name to indicate an IETF Administrative Group (RFC 3630 and RFC 5305) or Resource Class (RFC 2702). Link affinity colors are link attributes that are used to express some notion of "class". The link affinity colors can then be used in path computation to include or exclude links with some combination of colors. SR Policy colors on the other hand, are used to identify an intent or SLA. Such color is used to match a service route that requires a given SLA to an SR Policy that provides the path that satisfies this SLA. The SLA color is attached to a service route advertisement as a community.

As shown in Figure 2‑6, only the link between Node7 and Node6 has affinity color red. Node6 and Node7 distribute this affinity color information within the network in the IGP. Node1 receives this information and inserts it into its SR-TE DB.

Figure 2-6: Network topology with link affinity

To compute the path, Node1 prunes the links that do not meet the constraint “avoiding red links” from the topology model in its SR-TE DB and computes the IGP metric shortest path to Node4 on the pruned topology. The resulting path is 1→2→3→6→5→4. Node1 encodes this path in the optimal SID list . 16003 is the Prefix-SID of Node3, transporting the packet via the IGP shortest path to Node3. 16004 is the Prefix-SID of Node4, transporting the packet via the IGP shortest path to Node4. This SR Policy is summarized as follows: SR Policy (Node1, purple, Node4) candidate-paths: 1. dynamic: minimize IGP-metric AND exclude red links → SID list

In section 2.2 we formalize the SR Policy concepts illustrated in this section.

2.1.5 Encoding a Path in a Segment List A segment list expresses a path in a sequence of segments. Each segment specifies a part of the path and, when taken in isolation, is independent from the intent of the end-to-end path. SR-TE derives an optimal segment list to encode the end-to-end path by using the segments at its disposal. By default, the available segments are the regular IGP prefix segments (following the IGP shortest path) and IGP adjacency segments.3 In the low-delay example in section 2.1.3, SR-TE encoded the low-delay path in a sequence of a Prefix-SID and an Adj-SID. The illustration of that example is repeated here in Figure 2‑7 for ease of reference.

Figure 2-7: Encoding the low-delay path in a segment list

The low-delay path from Node1 to Node4 is 1→2→3→4. Each node advertises a Prefix-SID and an Adj-SID for each of its adjacencies. These are the segments that SR-TE has available in its toolbox to encode the path.

Since this path has no ECMP, SR-TE could encode the path using the sequence of Adj-SIDs of all links that the path traverses: . This is not optimal as it requires a SID for each hop. SR-TE sees that the portion of the path 1→2→3 can be expressed by the Prefix-SID of Node3. Indeed, the IGP shortest path from Node1 to Node3 is 1→2→3. The IGP shortest path from Node3 to Node4 is 3→6→5→4, which does not match the desired path 3→4. The path 3→4 can be expressed by the Adj-SID 24034 of Node3 for the link to Node4.

2.2 SR Policy Model An SR Policy is uniquely identified by a tuple consisting of the following three elements: Headend Endpoint Color The headend is where the policy is instantiated/implemented. The endpoint is typically the destination of the SR Policy, specified as an IPv4 or IPv6 address. The color is an arbitrary 32-bit numerical value used to differentiate multiple SR Policies with the same headend and endpoint. The color is key for the “Automated Steering” functionality. The color typically represents an intent, a specific way to reach the endpoint (e.g., low-delay, low-cost with SRLG exclusion, etc.). At a given headend node, an SR Policy is fully identified by the (color, endpoint) tuple. In this book, when we assume that the headend is well-known, we will often refer to an SR Policy as (color, endpoint). Only one SR Policy with a given color C can exist between a given pair (headend node, endpoint). In other words: each SR Policy tuple (headend node, color, endpoint) is unique. As illustrated in Figure 2‑8, an SR Policy has at least one candidate path and a single active candidate path. The active candidate path is the valid path of best preference. The SID list of a policy is the SID list of its active path.

Figure 2-8: SR Policy model

For simplicity we start with the assumption that each candidate path has only one SID list. In reality, each candidate path can have multiple SID lists, each with its an associated load-balancing weight. The traffic on that candidate path is then load-shared over all valid SID lists of that path, in accordance with their weight ratio. This is explained in more detail in chapter 16, "SR-TE Operations".

2.2.1 Segment List A Segment List (SID list) is a sequence of SIDs that encodes the path of a packet through the network. In the SR MPLS implementation, a SID is a label value and a SID list is a stack of labels. For a packet steered into the SR Policy, this stack of labels (SID list) is imposed on the packet’s header. A SID list is represented as an ordered list , where S1 is the first SID, which is the top label of the SR MPLS label stack, and Sn is the last SID, the bottom label for SR MPLS. In the SR-MPLS implementation, there are two ways to configure a SID as part of a SID list: by directly specifying its MPLS label value (e.g., 16004) or by providing a segment descriptor (e.g., IP address 1.1.1.4) that the headend node translates into the corresponding label value. There is an important difference between the two options.

A SID expressed as an MPLS label value is checked only if it is in the first position of the SID list. It is checked (validated) to find the outgoing interface and next-hop. A SID expressed as a segment descriptor is always checked for validity. The headend must indeed translate that segment descriptor into an MPLS label (i.e., what is imposed on the packets are label stacks); this act of translating the segment descriptor to the MPLS label is the validity check. If it works, then the SID is valid. If the translation fails (the segment descriptor IP address is not seen in the SR-TE DB or the IP address is seen but without a SID) then the SID is invalid. A SID bound to a prefix of a failed node or a failed adjacency is invalid. That SID is not in the SR-TE DB and the headend cannot translate the segment descriptor into an MPLS label value. Most often an explicit path will be configured with all SIDs expressed as MPLS label values. This would be the case when an external controller has done all the computations and is actively (stateful) monitoring the policy. In such case, the controller is in charge and it does not want the headend to second guess its operation. There is a second reason to express an explicit SID as an MPLS label value: when the SID belongs to a remote domain. In that case, the headend has no way to validate the SID (it does not have the linkstate topology of the remote domain), hence an MPLS label value is used to avoid the validity check.

“I can't emphasize enough how much having complete SR-based MPLS connectivity set up "for free" by the IGP allows you to concentrate on the real work of traffic engineering. For example, you can craft an explicit path to navigate traffic through a critically-congested region, then use a node SID at the bottom of the stack to get the rest of the way to egress via IGP shortest path. And you never need to worry about LSP signaling. ” — Paul Mattes

2.2.2 Candidate Paths A candidate path of an SR Policy represents a specific way to transport traffic from the headend to the endpoint of the corresponding SR Policy. A dynamic candidate path is expressed as an optimization objective and a set of constraints. Using the native SR algorithms and information in its local SR-TE DB, the headend computes a solution to the optimization problem4, and provides the solution (the optimal path) as a SID list. If the headend

does not have the required information in its SR-TE DB to compute the path, the headend may delegate the computation to a controller or Path Computation Element (PCE). Whenever the network state changes, the path is automatically recomputed. Read chapter 4, "Dynamic Candidate Path" for more information about dynamic paths. An explicit candidate path is expressed as a SID list. The SID list can be provided to the headend in various ways, most commonly by configuration or signaled by a controller. Despite its name, an explicit path is likely the result of a dynamic computation. The difference with the dynamic path is that the headend is not involved in the computation of the explicit path’s SID list; the headend only receives the result as an explicit SID list. Read chapter 3, "Explicit Candidate Path" for more information about explicit paths.

Figure 2-9: Dynamic and explicit candidate paths

Instead of specifying a single candidate path for an SR Policy, an operator may want to specify multiple possible candidate paths with an order of preference. Refer to Figure 2‑3 of the example in section 2.1.2, where the operator specified two candidate paths for the SR Policy on Node1 to Node4. At any point in time, one candidate path is selected and installed in the forwarding table. This is the valid candidate path with highest preference. Upon invalidation of the selected path, the next highest preference valid path is selected as new selected path. Each candidate path has a preference. The default value is 100. The higher the preference, the more preferred the path.

The active candidate path of an SR Policy is the valid path with highest preference. A candidate path is valid if usable. The validation process of explicit paths is detailed in chapter 3, "Explicit Candidate Path". Briefly, the first SID must be resolved on a valid outgoing interface and next-hop and the SR-TE DB must be able to resolve all the other SIDs that are expressed as segment descriptors into MPLS labels. To validate a dynamic path, the path is recomputed. A headend re-executes the active path selection procedure whenever it learns about a new candidate path of an SR Policy, the active path is deleted, an existing candidate path is modified or its validity changes. In other words, the active path of an SR Policy, at any time, is the valid candidate path with the highest preference value. A headend may be informed about candidate paths for a given SR Policy by various means, including local configuration, NETCONF, PCEP or BGP (we will discuss these different mechanisms throughout the book). The source of the candidate path does not influence the selection of the SR Policy’s active path; the SR Policy’s active path is selected based on its validity and its preference. A later section will explain how a single active path is selected when an SR Policy has multiple valid candidate paths with the same preference.

Each candidate path can be le arne d via a diffe re nt way “A headend can learn different candidate paths of an SR Policy via different means: some via local configuration, others via PCEP or BGP SR-TE. The SR-TE solution is designed to be modular and hence the SR Policy model allows for mixing candidate paths from various sources. In practice, the reader can assume that one single source is used in a given deployment model. In a distributed control-plane model, the candidate path (or more likely the ODN dynamic template it derives from) is likely learned by the headend via local configuration (itself automated by a solution such as Cisco NSO). In a centralized control-plane model, the (explicit) candidate path is likely learned by the headend from the controller via BGP SR-TE or PCEP. In some specific and less frequent deployments, the operator may mix different sources: some base candidate path is learned from the local configuration while some specific candidate paths are learned via BGP SR-TE or PCEP. Likely these later are more preferred when present. As such we defined the SR Policy concept such that each of its candidate paths can be learned in a different way: configuration, NETCONF, PCEP and BGP SR-TE. In practice, we advise you to focus on the most likely use-case: all of the candidate paths of a policy are learned via the same mechanism (e.g., configuration). ” — Clarence Filsfils

2.3 Binding Segment The Binding Segment (BSID) is fundamental to SR-TE. A Binding-SID is bound to an SR Policy. It provides the key into the bound SR Policy. On a given headend, a BSID is bound to a single SR Policy at any point in time. The function of a BSID (i.e., the instruction it represents) is to steer labeled packets into its associated SR Policy. In the SR MPLS implementation, a BSID B bound to an SR Policy P at headend H is a local label of H. Only the headend H has a state for this SR Policy and hence has a forwarding entry for B. When a remote node R wants to steer packets into the SR Policy P of headend node H, the remote node R pushes a label stack with the prefix SID of H followed by the BSID of P. The prefix SID of H may be replaced by any SID list that eventually reaches H. If R is attached to H, then R can simply push the BSID label to steer the packets into P. The BSID bound to an SR Policy P can be either explicitly provided by the operator or controller, or dynamically allocated by the headend. We will detail the allocation process in a later chapter. The BSID is an attribute of a candidate path and the BSID of an SR Policy is the BSID of the active candidate path.

The BSID of an SR Policy could change upon change of active path “This is a consequence of the ability to learn any candidate path independently. As a result, each candidate path may be learned with a BSID. In practice, we advise you to focus on the most likely use-case: all of the candidate paths of a policy have the same BSID. Hence, upon active path change, the BSID of the policy does not change. ” — Clarence Filsfils

Any packet that arrives at the headend with a BSID on top of its label stack is steered into the SR Policy associated with the BSID. The headend pops the BSID label, pushes the label stack (SID list) associated with the SR Policy on the packet’s header, and forwards the packet according to the first label of the SID list.

For an SR Policy that has a valid path, the headend node installs the following entry in its forwarding table for an SR Policy with SID list and BSID B1: Incoming label: B1 Label operation: Pop, Push Egress: egress information of S1

Is the labe l value of a BSID at time T a re liable ide ntifie r of an SR Policy at any time ? “Theoretically, no. As we highlighted previously, in a corner case, each different candidate path may have a different BSID. Hence, upon valid path change, the BSID of an SR Policy may change and hence its label value. Theoretically, the unique time-independent identification of an SR Policy is the tuple (headend, color, endpoint). However, in a normal deployment, all the candidate paths of the same policy are defined with the same BSID and hence there is a single and stable BSID per policy (whatever the selected candidate path) and hence the BSID is in practice a good identifier for a policy. ” — Clarence Filsfils

In the topology of Figure 2‑10, Node1 is the headend of an SR Policy with SID list . Node1 allocated a BSID 40104 for this SR Policy.

Figure 2-10: Binding-SID

When Node1 instantiated the SR Policy, it installed the forwarding entry: Incoming label (Binding-SID): 40104 Outgoing label operation: pop, push Egress interface: egress interface of label 16003: to Node2 A remote node, such as Node10 in Figure 2‑10, can steer a packet into Node1’s SR Policy by including the BSID of the SR Policy in the label stack of the packet. Node1’s SR Policy (green, 1.1.1.4) is then used as a Transit Policy in the end-to-end path from Node10 to Node4. The label stack of a packet going from Node10 to Node4 is shown in Figure 2‑10. In the example, Node10 imposes the label stack on the packet and sends it towards Node9. Label 16001, the Prefix-SID of Node1, brings the packet to Node1. BSID 40104 steers the traffic into the SR Policy. Label 40104 is popped and labels are pushed. The headend of a Transit Policy steers the packet into this Transit Policy without re-classifying the packet. The packet classification was done by a node located remotely from the headend node,

Node10 in this example. Node10 decided to steer a packet into this specific end-to-end SR Policy to Node4. Node10 can classify this packet based on its destination address only, or it could also take other elements into account (source address, DSCP, …). Node10 encoded the result of the classification as a BSID in the segment list imposed on the packet. Node1 simply forwards the packet according to this packet’s label stack. The BSID is further discussed in chapter 9, "Binding-SID and SRLB", and the role of the BSID in traffic steering is also covered in chapter 10, "Further Details on Automated Steering".

2.4 SR Policy Configuration This section shows the configurations of the SR Policies that were presented in the examples of the introduction section of this chapter. An Explicit Candidate Path of an SR Policy

For ease of reference the network topology is repeated here in Figure 2‑11. This illustration shows two SR Policies on headend Node1. Both SR Policies have the same endpoint Node4 and each has an explicit candidate path.

Figure 2-11: Two SR Policies from Node1 to Node4

The configuration of headend Node1 in Example 2‑1, specifies two SR Policies with the names POLICY1 and POLICY2. SR Policies are configured under the segment-routing traffic-eng configuration section. The name of an SR Policy is user-defined and is unique on the headend node.

Example 2-1: SR Policy configuration Node1 segment-routing traffic-eng policy POLICY1 !! (blue, Node4) color 20 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1 ! policy POLICY2 !! (orange, Node4) color 40 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST2 ! segment-list name SIDLIST1 index 10 mpls label 16008 !! Prefix-SID Node8 index 20 mpls label 24085 !1 Adj-SID link 8->5 index 30 mpls label 16004 !! Prefix-SID Node4 ! segment-list name SIDLIST2 index 10 mpls label 16003 !! Prefix-SID Node3 index 20 mpls label 24034 !1 Adj-SID link 3->4

SR Policy POLICY1 is configured with a color value 20 and endpoint address 1.1.1.4. The color value 20 is chosen by the operator and represents a specific service level or specific intent for the policy. The endpoint 1.1.1.4 is the loopback prefix advertised by Node4, as shown in Figure 2‑11. One candidate path is specified with preference 100: an explicit path with a segment list named SIDLIST1. SIDLIST1 is a locally significant user-defined name that identifies the segment list. The segment list SIDLIST1 specifies the SIDs in in increasing index order, which maps into top to bottom order for the SR MPLS stack. The first SID (index 10) is the Prefix-SID 16008 of Node8’s loopback prefix 1.1.1.8/32. The second SID (index 20) is the Adjacency-SID 24085 of Node8 for the adjacency to Node5. The third and last SID (index 30) is the Prefix-SID 16004 of Node4’s loopback prefix 1.1.1.4/32. The resulting candidate path (1→7→6→8→5→4) is shown in Figure 2‑11. A second SR Policy POLICY2 is configured with a color 40 and endpoint address 1.1.1.4. One candidate path is specified with preference 100: an explicit path with a segment list named SIDLIST2.

The first SID (index 10) is the Prefix-SID 16003 of Node3’s loopback prefix 1.1.1.3/32. The second and last SID (index 20) is the Adjacency-SID 24034 of Node3 for the adjacency to Node4. The resulting candidate path (1→2→3→4) is shown in Figure 2‑11. Path Validation and Selection

In an SR Policy with multiple candidate paths, each one is configured with a different preference value. The higher preference value, the more preferred the candidate path. SR Policy (blue, Node4) of Node1 in Figure 2‑12 has two candidate paths, one with preference 50 and one with preference 100. The preference 100 candidate path is the current active path since it has the highest preference.

Figure 2-12: SR Policy (blue, Node4) with two candidate paths

Example 2‑2 shows the SR Policy configuration on headend Node1. This SR Policy has two candidate paths. The preference 100 path is an explicit path with segment list SIDLIST1. The second candidate path has preference 50 and is an explicit path with segment list SIDLIST2. Both candidate paths are illustrated in Figure 2‑12.

Example 2-2: SR Policy configuration Node1 – explicit path segment-routing traffic-eng policy POLICY1 !! (blue, Node4) color 20 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1 ! preference 50 explicit segment-list SIDLIST2 ! segment-list name SIDLIST1 index 10 mpls label 16008 !! Prefix-SID Node8 index 20 mpls label 24085 !1 Adj-SID link 8->5 index 30 mpls label 16004 !! Prefix-SID Node4 ! segment-list name SIDLIST2 index 10 mpls label 16003 !! Prefix-SID Node3 index 20 mpls label 24034 !1 Adj-SID link 3->4

A Low-Delay Dynamic Candidate Path

Another SR Policy, named POLICY3, is now configured on Node1 in Figure 2‑13. This SR Policy has a dynamic path, where Node1 dynamically computes the low-delay path to Node4. In the example, all nodes in the topology measure the delay of their links and distribute these delay metrics using the IGP. These measured link-delay metrics are displayed next to the links in the illustration. How these delay metrics are measured and distributed is discussed in chapter 15, "Performance Monitoring – Link Delay". The IGP on Node1 receives all these link-delay metrics and stores them in its database. From these delay metrics, you can deduce that the low-delay path from Node1 to Node4 is 1→2→3→4, as shown in Figure 2‑13. The cumulative delay of this path is 12+11+7 = 30.

Figure 2-13: Network topology with measured link-delay values

The configuration of this SR Policy on Node1 is shown in Example 2‑3. A color value 30 is assigned to this SR Policy with IPv4 endpoint address 1.1.1.4. This color value is chosen by the operator to indicate the “low-delay” SLA. Example 2-3: SR Policy configuration Node1 – dynamic path segment-routing traffic-eng policy POLICY3 !! (green, Node4) color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay

One candidate path is specified for this SR Policy: a dynamic path minimizing the accumulated linkdelay to the endpoint. The link-delay metric is the optimization objective of this dynamic path. The

preference of the candidate path is 100. A Dynamic Candidate Path Avoiding Specific Links

The next SR Policy on headend Node1 also has a dynamic candidate path. This SR Policy path provides the IGP shortest path avoiding the red links and is illustrated in Figure 2‑14. To compute the path, Node1 prunes the links that do not meet the constraint “avoiding red links” from the network graph in its SR-TE DB and computes the IGP metric shortest path to Node4 on the pruned topology. The resulting path is 1→2→3→6→5→4. Node1 encodes this path in the optimal SID list . The first entry 16003 is the Prefix-SID of Node3, transporting the packet via the IGP shortest path to Node3; and the second entry 16004 is the Prefix-SID of Node4, transporting the packet via the IGP shortest path to Node4.

Figure 2-14: Network topology with link affinity

The configuration of this SR Policy on Node1 is shown in Example 2‑4.

A color value 50 is assigned to this SR Policy with IPv4 endpoint address 1.1.1.4. This color value is chosen by the operator to indicate the SLA of this path. Example 2-4: SR Policy configuration Node1 – dynamic path segment-routing traffic-eng policy POLICY4 !! (purple, Node4) color 50 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type igp ! constraints affinity exclude-any name red ! affinity-map name red bit-position 3

One candidate path is specified for this SR Policy. The preference of the candidate path is 100. The optimization objective of this path is to minimize the accumulated IGP to the endpoint. The path must not traverse any “red” links, as specified by the constraints of this dynamic path. exclude-any name red

means to avoid links that have affinity link color “red”.

2.5 Summary An SR Policy essentially consists of an ordered list of SIDs. In the SR-MPLS instantiation, this SID list is represented as a stack of MPLS labels that is imposed on packets that are steered into the SR Policy. An SR Policy is uniquely identified by the tuple (headend node, color, endpoint). When the headend is known, an SR Policy is identified by (color, endpoint). The color is used to distinguish multiple SR Policies between the same headend node and endpoint. It is typically an identifier of the SLA that the SR Policy provides, e.g., color “green” = low-delay. The color is key to automate the traffic steering (Automated Steering (AS) and the on-demand instantiation of SR Policy (ODN) An SR Policy has one or multiple candidate paths. A candidate path can be dynamically computed (dynamic path) or explicitly specified (explicit path). For a dynamic path, the SID list is computed based on an optimization objective and a set of constraints provided by the headend. For an explicit path, the SID list is explicitly specified by the operator or by an application. The headend is not involved in the specification nor the computation of the explicit path’s SID list. The headend only needs to validate the explicit SID list. A candidate path can be instantiated via different sources CLI/NETCONF or signaled via PCEP or BGP SR-TE. An SR Policy can contain candidate paths from different sources. The active candidate path of an SR Policy is the valid path with highest preference. Packets steered into an SR Policy follow the SID list associated with the active path of the SR Policy. An SR Policy is bound to a Binding-SID. The BSID is a key into the SR Policy. In the SR-MPLS instantiation, a BSID is a local label at the headend of the policy. Traffic received by the headend with the BSID as top label is steered into the policy. Specifically, the BSID is popped and the active SID list is pushed on the packet.

2.6 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018 [RFC7471] "OSPF Traffic Engineering (TE) Metric Extensions", Spencer Giacalone, David Ward, John Drake, Alia Atlas, Stefano Previdi, RFC7471, March 2015 [RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016

1. By default, Node1 does not invalidate the segment list if it is expressed using label values. More information about controlling validation of a segment list can be found in chapter 3, "Explicit Candidate Path".↩ 2. In this book, we use the term “delay” even in cases where the term “latency” is commonly used. The reason is that the metric that expresses the link propagation delay is named “link delay” (RFC 7810 and RFC 7471). A low-delay path is then a path with minimal cumulative link-delay metric.↩ 3. Chapter 7, "Flexible Algorithm" describes how operators can extend their segment “toolbox” by defining their own Prefix-SIDs.↩ 4. Computing a path is really solving an optimization problem. Given the information in the SR-TE DB, find the optimal path (minimizing the specified metric) while adhering to the specified constraints.↩

3 Explicit Candidate Path What we will learn in this chapter: An explicit candidate path is formally defined as a weighted set of SID lists An explicit candidate path is typically a single SID list An explicit candidate path can be locally configured on the headend node An explicit candidate path can be instantiated on the headend node by a controller through a signaling protocol A SID can be expressed as a segment descriptor or an MPLS label value Validation of an explicit SID list and hence validation of an explicit candidate path Pros and cons of using a segment descriptor or an MPLS label value Two use-cases: dual-plane disjoint paths and TDM migrations An explicit candidate path of an SR Policy is directly associated with a list of SIDs. These SIDs can be expressed either with their MPLS label values or using abstract segment descriptors that are deterministically resolved into label values by the headend. The latter provides a level of resiliency against changes in the MPLS label allocations and allows the headend to validate the SID list before configuring its forwarding table. On the other hand, the resolution procedure requires that the headend knows the label value for each segment descriptor in the SID list. Depending on the reach of SR Policy and the knowledge of the entity initiating it, one type of expression would thus be preferred over the other. The instantiation of explicit candidate path using both types of SID expression is described in this chapter, with examples of scenarios in which each one is used. Two practical use-cases are also detailed to further illustrate the role of explicit candidate paths in real-world deployments.

3.1 Introduction Formally, an explicit candidate path is defined as a weighted set of SID lists. For example, an explicit candidate path could be defined as SID list with weight W1 and SID list with weight W2. If that candidate path is selected as the active path of the policy, then the two SID lists are installed in the dataplane. The traffic flows steered into the SR Policy are load-balanced between the two SID lists with a ratio of W1/(W1+W2) on the first list and W2/(W1+W2) on the second. In practice, most of the use-cases define an explicit candidate path as a single SID list. Therefore, this chapter focuses on the single-list case, while more details on the multiple SID list generalization are provided in chapter 16, "SR-TE Operations". The key property of an explicit candidate path is that the SID list is provided to the headend. The headend does not need to compute it and is oblivious of the operator’s intent for this candidate path. A segment in an explicit SID list can be expressed either as an SR-MPLS label or as a segment descriptor. Intuitively, the segment descriptor of a Prefix SID is the IP address of the related prefix and the descriptor of an Adj-SID is the IP address of the related adjacency. An explicit candidate path can be initiated on a headend node by configuration, using the Command Line Interface (CLI), the classic XML API, or the NETCONF protocol; or by a signaling protocol, such as PCEP or BGP. Although it is possible for an operator to manually configure this type of path on the headend node, explicit paths are typically initiated by a controller that then also takes the responsibility to monitor and maintain them. These are referred to as controller-initiated paths. The responsibility of the headend for an explicit candidate path is usually limited to: 1. Translating any SID expressed as a segment descriptor into an SR-MPLS label 2. Validating the outgoing interface and next-hop

3.2 SR-MPLS Labels The SIDs in the SID list can be specified as MPLS labels, with any value in the MPLS label range. Each SID in an explicit path configuration is associated with an index and the SID list is ordered by increasing SID indexes. In this book the indexes are numbered in increments of 10, but this is not a requirement. Example 3‑1 shows the required configuration to specify a SID list, named SIDLIST1, with MPLS label values. The first entry (index 10) is the label 16008 that is the Prefix-SID associated with prefix 1.1.1.8/32 on Node8; the second entry (index 20) is the label 15085 that is the Adj-SID of Node8 for its adjacency to Node5; and the third entry (index 30) is the label 16004 that is the Prefix-SID associated with Node4. Example 3-1: SR Policy with explicit path on Node1 – SID list using MPLS labels segment-routing traffic-eng policy POLICY1 color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1 ! segment-list name SIDLIST1 index 10 mpls label 16008 !! Prefix-SID Node8 index 20 mpls label 15085 !! Adj-SID link 8->5 index 30 mpls label 16004 !! Prefix-SID Node4

The Adj-SID 15085 is a manual Adj-SID configured on Node8 for its adjacency to Node5. Label 15085 is allocated from Node8’s SRLB. Since this Adj-SID is configured, it is persistent across reloads. The SID list SIDLIST1 is used as part of an explicit candidate path for the SR Policy POLICY1, illustrated on Figure 3‑1. The headend node Node1 can directly use the MPLS labels in SIDLIST1 to program the forwarding table entry for POLICY1. Node1 only needs to resolve the first SID 16008 to obtain the outgoing interface to be associated with the SR Policy forwarding entry.

Figure 3-1: Explicit path example

The output in Example 3‑2 shows the SR Policy POLICY1 instantiated on Node1 with the configuration in Example 3‑1. Node1 mapped the first SID label value 16008 to the Prefix-SID (line 16). Example 3-2: SR Policy using explicit segment list with label values 1 RP/0/0/CPU0:xrvr-1#show

segment-routing traffic-eng policy

2 3 SR-TE

policy database

4 --------------------5 6 Color: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:15 (since Jan 24 10:51:03.477) Candidate-paths: Preference: 100 (configuration) (active) Name: POLICY1 Requested BSID: dynamic Explicit: segment-list SIDLIST1 (valid) Weight: 1 16008 [Prefix-SID, 1.1.1.8] 15085 16004 Attributes: Binding SID: 40014 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

As you can see in Example 3‑2, IOS XR uses the auto-generated name srte_c_10_ep_1.1.1.4 to identify the SR Policy. This name is composed of the SR Policy’s color 10 and endpoint 1.1.1.4. The configured name POLICY1 is used as the name of the candidate path. The headend Node1 dynamically allocates the BSID 40014 for this SR Policy, as shown in Example 3‑2 (line 20) and installs the BSID forwarding entry for this SR Policy, as shown in Example 3‑3. The BSID forwarding entry instructs Node1, for any incoming packet that has the BSID 40014 as top label, to pop the BSID label and steer the packet into the SR Policy POLICY1, which causes the SR Policy’s SID list to be imposed on the packet. Example 3-3: SR Policy BSID forwarding entry on headend Node1 RP/0/0/CPU0:xrvr-1#show mpls Local Outgoing Prefix Label Label or ID ------ ----------- --------40014 Pop No ID

forwarding labels 40014 Outgoing Next Hop Interface -------------------- --------------srte_c_10_ep_1.1.1.4 point2point

Bytes Switched -------0

The forwarding entry of the SR Policy is shown in Example 3‑4. The imposed label stack is (top→bottom): (16008, 15085, 16004). The outgoing interface (Gi0/0/0/1) and next hop (99.1.7.7) are derived from the first SID in the list (16008) and point towards Node7. Example 3-4: SR Policy imposed segment list on headend Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail Color Endpoint Segment Outgoing Outgoing Next Hop Bytes List Label Interface Switched ----- ---------- ------------ -------- ------------ ------------ --------10 1.1.1.4 SIDLIST1 16008 Gi0/0/0/1 99.1.7.7 0 Label Stack (Top -> Bottom): { 16008, 15085, 16004 } Path-id: 1, Weight: 0 Packets Switched: 0 Packets/Bytes Switched: 0/0 (!): FRR pure backup

3.3 Segment Descriptors A segment descriptor identifies a SID in such a way that the headend node can retrieve the corresponding MPLS label value through a deterministic SID resolution procedure. While multiple types of descriptors exist (see draft-ietf-spring-segment-routing-policy), this chapter focuses on the two descriptor types most commonly used: IPv4 address identifying its corresponding Prefix-SID IPv4 address identifying a numbered point-to-point link1 and its corresponding Adj-SID The segment descriptor of a Prefix-SID is its associated prefix. Since a Prefix-SID is always associated with a prefix, the resolution of a descriptor prefix to the corresponding SID label value is straightforward. The prefix-to-SID-label mapping is advertised via IGP or BGP-LS and stored in each node’s SR-TE DB. For example, line 11 of the SR-TE DB snapshot in Example 3‑5 indicates that the prefix 1.1.1.8(/32) is associated with the label 16008. When several Prefix-SID algorithms are available in a domain, the same prefix can be associated with multiple Prefix-SIDs, each bound to a different algorithm. In that case, the prefix alone is not sufficient to uniquely identify a Prefix-SID and should be completed by an algorithm identifier. If the algorithm is not specified, then the headend selects by default the strict-SPF (algorithm 1) Prefix-SID if available, or the regular SPF (algorithm 0) Prefix-SID otherwise. See chapter 7, "Flexible Algorithm" for details of the different algorithms. Similarly, an IP address configured on the interface to a point-to-point adjacency can serve as the segment descriptor of an Adj-SID. Such an IP address identifies a specific L3 link in the network. For example, the address 99.5.8.8 is configured on Node8 for its link with Node5 and thus identifies the link between Node5 and Node8. The headend can then derive the direction in which that link should be traversed from the preceding SID in the SID list. In this example, if the preceding SID in the list ends at Node5, then the link should be traversed from Node5 to Node8. Conversely, if the preceding SID were to end at Node8, then that same link should have been traversed from Node8 to Node5. This headend intelligence allows the segment descriptor to be the IP address of either end of the link; it does not specifically need to be a local, or remote, interface address. The link and direction

together identify a specific adjacency that the headend can look up in its SR-TE DB to determine the Adj-SID label value. A node may advertise several Adj-SIDs for a given adjacency. Typically, one protected2 Adj-SID and one unprotected Adj-SIDs, as illustrated on lines 23 and 35 of Example 3‑5. In this situation, the headend selects a protected Adj-SID by default. Example 3-5: Translation of segment descriptor into MPLS label 1 RP/0/0/CPU0:xrvr-1#show

segment-routing traffic-eng ipv4 topology 1.1.1.8

2 3 SR-TE

topology database

4 ----------------------5 6 Node 7 8 9 10 11

8 TE router ID: 1.1.1.8 Host name: xrvr-8 ISIS system ID: 0000.0000.0008 level-2 Prefix SID: Prefix 1.1.1.8, label 16008 (regular)

12 13 14 15 16 17 18 19 20 21 22

Link[0]: local address 99.5.8.8, remote address 99.5.8.5 Local node: ISIS system ID: 0000.0000.0008 level-2 Remote node: TE router ID: 1.1.1.5 Host name: xrvr-5 ISIS system ID: 0000.0000.0005 level-2 Metric: IGP 10, TE 10, Delay 6 Admin-groups: 0x00000000 Adj SID: 24085 (protected) 24185 (unprotected)

23 24 25 26 27 28 29 30 31 32 33

Link[1]: local address 99.6.8.8, remote address 99.6.8.6 Local node: ISIS system ID: 0000.0000.0008 level-2 Remote node: TE router ID: 1.1.1.6 Host name: xrvr-6 ISIS system ID: 0000.0000.0006 level-2 Metric: IGP 10, TE 10, Delay 10 Admin-groups: 0x00000000 Adj SID: 24086 (protected) 24186 (unprotected)

Example 3‑6 and Figure 3‑2 show how these segment descriptors are used as part an explicit SID list configuration. The first entry (index 10) uses the IPv4 address 1.1.1.3 as the segment descriptor for the Prefix-SID associated with prefix 1.1.1.3/32 of Node3. The second entry (index 20) uses address 99.3.4.3 as the segment descriptor for the Adj-SID of Node3 for its adjacency to Node4. The address 99.3.4.3 is configured on Node3’s interface for the point-to-point link between Node3 and Node4, and the preceding segment ending at Node3 indicates that the link should be traversed from Node3 to Node4.

Example 3-6: SIDs specified as segment descriptors segment-routing traffic-eng policy POLICY2 color 20 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST2 ! segment-list name SIDLIST2 index 10 address ipv4 1.1.1.3 !! Prefix-SID Node3 index 20 address ipv4 99.3.4.3 !! Adj-SID link 3->4

Figure 3-2: SIDs specified as segment descriptors example

The SID list SIDLIST2 is configured as an explicit candidate path of SR Policy POLICY2. The output in Example 3‑7 shows that the headend Node1 resolved the segment descriptors in SIDLIST2 into the corresponding label values and used them to program the forwarding table entry for POLICY2. The address 1.1.1.3 is resolved to the Prefix-SID label 16003 and 99.3.4.3 to the protected Adj-SID label 24034.

Example 3-7: SR Policy using explicit segment list with IP addresses RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 20, End-point: 1.1.1.4 Name: srte_c_20_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:20 (since May Candidate-paths: Preference: 100 (configuration) (active) Name: POLICY2 Requested BSID: dynamic Explicit: segment-list SIDLIST2 (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 40114 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

4 09:25:54.847)

If the Adj-SID label on Node3 for the adjacency to Node4 changes value, for example following a reload of Node3, headend Node1 automatically updates the SID list with the new value of the AdjSID label. No configuration change is required on Node1. The BSID and SR Policy forwarding entries are equivalent with the previous example. In the two examples above, the explicit SID list was either configured with only MPLS label values or with only segment descriptors. Segment descriptors and MPLS label values can also be combined in an explicit SID list, but with one limitation: once an entry in the SID list is specified as an MPLS label value, all subsequent entries must also be specified as MPLS label values. In other words, a SID specified as segment descriptor cannot follow a SID specified as MPLS label value.

3.4 Path Validation An explicit candidate path is valid if its SID list is valid. More generally, when an explicit candidate path is expressed as a weighted set of SID lists, the explicit candidate path is valid if it has at least one valid SID list. A SID list is valid if all the following conditions are satisfied: The SID list contains at least one SID The SID list has a weight larger than 0 The headend can resolve all SIDs expressed as segment descriptors into MPLS labels The headend can resolve the first SID into one or more outgoing interfaces and next-hops The first condition is obvious: an empty SID list is invalid. As mentioned in chapter 2, "SR Policy", each SID list has an associated weight value that controls the relative amount of traffic steered over this particular SID list. The default weight is 1 for SID lists defined as part of an explicit candidate path. If the weight is zero, then the SID list is considered invalid. The headend node uses the information in its local SR-TE DB to resolve each SID specified as segment descriptor to its MPLS label value. If a segment descriptor is not present in the headend node’s SR-TE DB, then the headend node cannot retrieve the corresponding label value. Therefore, a SID list containing a segment descriptor that cannot be resolved is considered invalid. Finally, the headend node must be able to find out where to send the packet after imposing the SID list. The headend node uses its local information to determine a set of outgoing interfaces and nexthops from the first SID in the list. If it is unable to resolve the first SID of a SID list into at least one outgoing interface and next hop, then that SID list is invalid. To illustrate the validation procedure for explicit paths, we compare two configuration variants for the SR policy illustrated on Figure 3‑3: the SIDs specified as label values and the SIDs specified as segment descriptors.

A failure occurred in the network: the link between Node8 and Node5 went down. As a result, IGP withdraws the Adj-SID of this link. The headend removes the Adj-SID from its SR-TE DB. If this Adj-SID is protected by TI-LFA, then traffic will temporarily (~ 15 minutes) be forwarded on the TILFA backup path.

Figure 3-3: Explicit path example

First, consider the case where the SIDs in the SID list are specified as label values. This is the configuration in Example 3‑8. The SID list is not empty and has the default weight value of 1, so the first two validation conditions are satisfied. The next condition does not apply since the SIDs are specified as label values. The first SID in the list is 16008, which is the Prefix-SID of Node8. Node8 is still reachable from Node1, via Node7, and hence Node1 can resolve the outgoing interface and nexthop for 16008 accordingly. This satisfies the last validation condition. Consequently, Node1 still considers SIDLIST1 as valid and keeps steering traffic into POLICY1. If the Adj-SID 15085 is protected by TI-LFA, then the traffic is forwarded along the backup path and continues to reach Node4 for a time, but starts being dropped by Node8 as soon as the IGP removes the Adj-SID backup path. If the Adj-SID is not protected, then the traffic is immediately dropped by Node8 when the link goes down.

Example 3-8: SR Policy with explicit path on Node1 – SID list using MPLS labels segment-routing traffic-eng policy POLICY1 color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1 ! segment-list name SIDLIST1 index 10 mpls label 16008 !! Prefix-SID Node8 index 20 mpls label 15085 !! Adj-SID link 8->5 index 30 mpls label 16004 !! Prefix-SID Node4

Now, consider the case where the SIDs in the SID list are specified with segment descriptors. This is the configuration in Example 3‑9. The first two validation conditions, SID list not empty and weight not 0, are satisfied. The fourth condition is satisfied as well: the prefix-SID 16008 of Node8’s prefix 1.1.1.8/32 is still reachable from the headend Node1 via Node7. However, when Node1 attempts to resolve the second segment descriptor 99.5.8.8 into an MPLS label value, it is unable to find the IP address in its SR-TE DB. This entry was removed from the SR-TE DB following the withdrawal of the link between Node8 and Node5 in the IGP. The failure to resolve this second segment descriptor violates the third validation condition and causes Node1 to invalidate SIDLIST2. The candidate path and SR Policy are also invalidated since no other option is available. Example 3-9: SIDs specified as segment descriptors segment-routing traffic-eng policy POLICY2 color 20 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST2 ! segment-list name SIDLIST2 index 10 address ipv4 1.1.1.8 !! Prefix-SID Node8 index 20 address ipv4 99.5.8.8 !!Adj-SID link 8->5 index 30 address ipv4 1.1.1.4 !! Prefix-SID Node4

Example 3‑10 shows the status of the SR Policy with the configuration in Example 3‑9 when the link Node8-Node5 is down. Notice that this SR Policy is Operational down (line 9) since its only Candidate path is down.

Example 3-10: SR Policy using explicit segment list with IP addresses – status down 1 RP/0/0/CPU0:xrvr-1#show

segment-routing traffic-eng policy

2 3 SR-TE

policy database

4 --------------------5 6 Color: 7 8 9 10 11 12 13 14 15 16 17 18 19 20

20, End-point: 1.1.1.4 Name: srte_c_20_ep_1.1.1.4 Status: Admin: up Operational: down for 00:00:06 (since May 4 10:47:10.263) Candidate-paths: Preference: 100 (configuration) Name: POLICY2 Requested BSID: dynamic Explicit: segment-list SIDLIST2 (inactive) Inactive Reason: Address 99.5.8.5 can not be resolved to a SID Weight: 1 Attributes: Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

Constraints

Besides the four validation conditions explained previously, the operator may express further constraints to an explicit candidate path. For example, the operator may request the headend to ensure that the path followed by the explicit SID list never uses a link or node with a given SRLG or TE-affinity, or may require that the accumulated link delay be smaller than a bound. Chapter 4, "Dynamic Candidate Path" contains more information about constraints. Resiliency

While explicit paths are static by nature, they benefit from different resiliency mechanisms. Let us consider the Adj-SID S1 associated with link L. Let us assume that S1 is the first SID of the list or is expressed as a segment descriptor. Should L fail, the IGP convergence ensures that the headend SR-TE DB is updated within a few 100s of msec. The SR-TE DB update triggers the invalidation of S1, the invalidation of its associated SID list and hence the invalidation of its associated explicit candidate path (assuming that it is the only SID list for this candidate path). This invalidation may allow for another candidate path of the SR policy to become active or it may cause the SR policy to become invalid and hence to be removed from the forwarding table. In such case, the traffic would then follow the default IGP shortest path (or be dropped as described in chapter 16, "SR-TE Operations").

3.5 Practical Considerations In practice, the choice of expressing a SID list with label values or segment descriptors depends on various factors, such as how reliably the entity initiating the path can maintain it, or whether the headend has enough visibility to resolve all the segment descriptors that would be used in the SID list. Some hints are provided in this section on the most appropriate mode of expression through several practical examples. A Controller May Not Want the Headend to Second-Guess

An explicit candidate path is typically computed by an external controller. The intent for that path is known by the controller. This controller typically monitors the network in real time and updates any explicit SID list when needed. This can happen upon the loss of a link or node (an alternate path is then selected) or upon the addition of a link or node in the topology (a better path is now possible). As the controller knows the intent and it monitors the network in real time, most likely the controller does not want the headend to second-guess its decisions. In such a case, the controller will express the SIDs as MPLS labels. The headend does not check the validity of such SIDs and the controller stays in full control. An Operator May Want Headend Monitoring

The operator may instantiate an explicit SR Policy as two (or more) explicit candidate paths (CPs): CP1 with preference 200 and CP2 with preference 100. The operator does not want to continuously check the state of the network and promptly react to a change. Instead, the operator wants the headend to automatically switch from CP1 to CP2 when CP1 becomes unavailable and from CP2 to CP1 when CP1 becomes available. In such a case, the operator will express the SIDs as segment descriptors. This will force the headend to validate each SID individually and hence the SID list in its entirety. When a SID of CP1’s SID list becomes invalid, the headend will invalidate the entire SID list and CP1, and activate CP2. When the SID becomes valid again, the headend will re-enable the SID list and activate CP1.

SID Translation Is Limited to the Headend Domain

When it is known that a segment descriptor is not present in the headend’s SR-TE DB, the SID must be expressed as a label value. The translation from segment descriptor to SID label value would not work otherwise and hence there would be no point in expressing the related SID using a segment descriptor. This is typically the case when a SID is outside the local domain of the headend. Dynamically allocated labels are hard to predict

One must know the label value of a SID to be able to use that mode of expression. This is not a problem for Prefix-SIDs, as they have fixed, network-wide label values3. It may be a problem for Adj-SIDs or Peering-SIDs which often use dynamically allocated label values that are hard – even impossible – to predict. These dynamic labels are subject to change; for example, a router could allocate different label values for its Adj-SIDs after it has been reloaded. If the Adj-SID label value changes, the configured SID list with the old Adj-SID label value does no longer provide the correct path and must be reconfigured. Assume that in the configuration of Example 3‑1 the operator had configured the dynamic Adj-SID label 24085 instead of the manual Adj-SID. Then assume that Node8 was reloaded and has acquired a new label value 24000 for the Adj-SID of the link Node8→Node5. Since the old label value 24085 for this Adj-SID is no longer correct, the Adj-SID entry (index 20) in the SR Policy’s SID list on Node1 must then be reconfigured, using the new label value, as index 20 mpls label 24000. A solution to this problem is to configure explicit Adj-SIDs and Peering SIDs. Since they are configured, they are persistent, even across reloads.

3.6 Controller-Initiated Candidate Path A controller, or an application above a controller, can translate an SLA from a headend H to an endpoint E into an Explicit Candidate Path. Once computed, the controller signals the path to H via PCEP or BGP SR-TE. We leave the protocol details to the chapters that are dedicated to these protocols. For now, it is enough to know that these protocols provide the means to convey the explicit candidate path from the controller to the headend. After the protocol exchange, the headend node instantiates the path or updates the path if it already existed. South-bound and north-bound inte rface s The south-bound and north-bound indications refer to the typical architectural drawing, where the north-bound interface is drawn on top of the illustrated component and the south-bound interface is drawn below it. For a controller, examples of the north-bound interface are: REST (Representational State Transfer) and NETCONF (Network Configuration Protocol). Controller south-bound interface examples are PCEP, BGP, classic XML, and NETCONF.

In the topology of Figure 3‑4, a controller is programmed to deliver a low-delay path from Node1 to 1.1.1.4 (Node4). The controller is monitoring the topology and learns that the link between Node7 and Node6 experiences a higher than normal delay, for example due to an optical circuit change. Therefore, the controller decides to avoid this link by initiating on headend Node1 a new explicit path with the SID list , which encodes the path 1→2→3→6→5→4. The controller signals this path for SR Policy (blue, 1.1.1.4) to headend Node1 via its south-bound interface. This path is a candidate path for the SR Policy (blue, 1.1.1.4). If the SR Policy (blue, 1.1.1.4) already exists on Node1 (initiated by any protocol: CLI, PCEP, BGP, …), this new candidate path is added to the list of candidate paths of this SR Policy. The path selection procedure then decides which candidate path becomes the active path. In this case, we assume that SR Policy (blue, 1.1.1.4) does not yet exist on Node1. Therefore, Node1 instantiates the SR Policy with a single

candidate path and programs the forwarding entry for this SR Policy. The path is then ready to be used, traffic can be steered onto it.

Figure 3-4: Controller-initiated path

Using this mechanism, a controller can steer any traffic flow on any desired path through the network by simply programming the SR Policy path on the headend node. The controller is responsible for maintaining the path. For example, following a change in the network, it must re-compute the path and update it on headend Node1 if required.

3.7 TDM Migration In this use-case, we explain how an operator can pin a pseudowire (PW) onto a specific SR Policy and ensure that the PW traffic will be dropped as soon as the SR Policy’s selected path becomes invalid. The steering of the PW in the SR policy is not the focus of this section. Briefly, we use the L2VPN preferred path mechanism which allows us to steer a PW into an SR policy. We will detail this in chapter 10, "Further Details on Automated Steering". The focus of this section is on two specific mechanisms: The ability to avoid using a TI-LFA FRR backup path The ability to force the traffic to be dropped upon SR policy invalidation (instead of letting the PW follow the IGP shortest-path) Explicit paths allow pinning down services on a pre-defined path through the network. This may be desired when, for example, migrating a Time Division Multiplexing (TDM) service to an IP/MPLS infrastructure. In the example topology of Figure 3‑5, two disjoint PWs are configured between Node1 and Node4. These PWs carry high traffic volumes and the operator does not want both PWs to traverse the same core link (2-3 or 6-5) since their combined load exceeds the capacity of these links. The operator configures two SR Policies on Node1 and steers one PW in each of them using the L2VPN preferred-path functionality.

Figure 3-5: TDM migration

Each of these two SR Policies has an explicit SID list, where the SIDs are the unprotected Adjacency-SIDs of the links on the path. This ensures that the SR Policies are pinned to their path. Unprote cte d Adj-SID Unprotected Adj-SIDs are not protected by a local protection mechanism such as TI-LFA. By using this type of Adj-SIDs for this path, local protection can be enabled on all links in the network for other traffic flows to benefit from it, while traffic flows using the unprotected Adj-SIDs do not use the local protection functionality.

The relevant configuration of Node1 is shown in Example 3‑11. The Adj-SIDs in the SID lists are specified as interface IP addresses. For example, 99.1.2.2 is the IP address of the interface on Node2 for its link to Node1. The headend Node1 maps these interface IP addresses to the Adj-SID of the link. However, as mentioned in section 3.3, IP addresses are mapped to protected Adj-SID labels by default. Hence, the configuration line constraints segments unprotected (see lines 10-12 of Example 3‑11) is added to instruct Node1 to map the IP addresses to the unprotected Adj-SID labels instead.

By default, an SR Policy with no valid candidate path is invalidated and the traffic that was steered into it falls back to its default forwarding path, which is usually the IGP shortest path to its destination. However, in this use-case the operator specifically wants to force the traffic to be dropped upon SR Policy invalidation. This behavior is achieved by configuring steering invalidation drop

under the SR Policy (see lines 5-6 of Example 3‑11).

“The invalidation-drop behavior is a great example of lead operator partnership. The lead operator brought up their requirement very early in the SR-TE design process (~2014) and this helped us define and implement the behavior. This has been now shipping since 2015 and is used in deployment when the operator prefers to drop some traffic rather than letting it flow over paths that do not meet is requirements (e.g., from a BW/capacity viewpoint).. ” — Clarence Filsfils

Example 3-11: TDM migration – configuration of Node1 1 segment-routing

traffic-eng policy POLICY1 4 color 10 end-point ipv4 1.1.1.4 5 steering 6 invalidation drop 7 candidate-paths 8 preference 100 9 explicit segment-list SIDLIST1 10 constraints 11 segments 12 unprotected 13 ! 14 policy POLICY2 15 color 20 end-point ipv4 1.1.1.4 16 steering 17 invalidation drop 18 candidate-paths 19 preference 100 20 explicit segment-list SIDLIST2 21 constraints 22 segments 23 unprotected 24 ! 25 segment-list name SIDLIST1 26 index 10 address ipv4 99.1.2.2 !! link 27 index 20 address ipv4 99.2.3.3 !! link 28 index 30 address ipv4 99.3.4.4 !! link 29 ! 30 segment-list name SIDLIST2 31 index 10 address ipv4 99.1.6.6 !! link 32 index 20 address ipv4 99.5.6.5 !! link 33 index 30 address ipv4 99.4.5.4 !! link 34 ! 35 l2vpn 36 pw-class PREF-PATH1 37 encapsulation mpls 38 preferred-path sr-te policy POLICY1 39 ! 40 pw-class PREF-PATH2 41 encapsulation mpls 42 preferred-path sr-te policy POLICY2 43 ! 44 xconnect group XG1 45 p2p PW1 46 interface GigabitEthernet0/0/0/0 47 neighbor ipv4 1.1.1.3 pw-id 1 48 pw-class PREF-PATH1 49 ! 50 p2p PW2 51 interface GigabitEthernet0/0/0/1 52 neighbor ipv4 1.1.1.3 pw-id 2 53 pw-class PREF-PATH2 2 3

1->2 2->3 3->4

1->6 6->5 5->4

At the time of writing, steering invalidation drop and constraints segments unprotected were only available in the initial (and now deprecated) SR-TE CLI.

3.8 Dual-Plane Disjoint Paths Using Anycast-SID “There are at least three SR solutions to the disjoint paths use-case: SR Policy with a dynamic path, SR IGP Flex-Algo and Explicit candidate path. In this section, we describe the last option. We will describe the other options later in this book. The key point I would like to highlight is that this explicit candidate path option has been selected for deployment. In theory, this explicit path solution does not work when an access node loses all its links to the chosen blue plane (2-11 and 2-13 both fail in the following illustration) or the blue plane partitions (11-12 and 13-14 both fail). However, in practice, some operators estimate that these events are highly unlikely and hence they select this simple dualplane solution for its pragmatic simplicity. Other operators prefer a solution that dynamically ensures that the disjoint objective is always met whatever the state of the network, even if unlikely. The two other design options meet those requirements. We will cover them later in the book. ” — Clarence Filsfils

The topology in Figure 3‑6 shows a dual-plane network topology, a design that is used in many networks. The blue plane consists of nodes 11 to 14. The green plane consists of nodes 21 to 24. The common practice consists in configuring the inter-plane connections, also known as shunt links, (e.g., link between Node11 and Node21) with high (bad) IGP metric. These links are represented with thinner lines to illustrate this. When the shunt links have such a high metric, the traffic that enters one plane remains on the same plane until its destination. The only scenario that would make the traffic cross to the other plane is a partitioning of its initial plane; i.e., a failure that causes one part of the plane to become isolated from the rest. In this case, reachability between the partitions may only be possible via the other plane. This is very rare in practice as this would require at least two independent failures. Edge nodes connect redundantly to each plane, via direct links, or indirectly via another edge node.

Figure 3-6: Dual-plane disjoint paths using anycast-SID

Let us assume that all the blue nodes are configured with Anycast-SID 16111 and all the green nodes are configured with Anycast-SID 16222.

Anycast-SID As we explained in Part1 of this book, Anycast-SIDs not only provide more load-balancing and node resiliency, they are also useful to express macro-engineering policies that steer traffic via groups of nodes (“anycast sets”) instead of via individual nodes. Each member of an anycast set advertises the same Anycast-SID. All blue plane nodes advertise the Anycast-SID 16111. For this, the configuration of Example 3‑12 is applied on all the blue plane nodes. In essence, an Anycast-SID is a Prefix-SID that is advertised by multiple nodes. Therefore, the regular Prefix-SID configuration is used to configure it. All blue plane nodes advertise the same Loopback1 prefix 1.1.1.111/32 with Prefix-SID 16111. Example 3-12: Dual-plane disjoint paths – anycast-SID configuration interface Loopback1 description blue plane anycast address ipv4 address 1.1.1.111/32 ! router isis 1 interface Loopback1 address-family ipv4 unicast prefix-sid absolute 16111 n-flag-clear

By default, when configuring a Prefix-SID, its N-flag is set, indicating that the Prefix-SID identifies a single node. However, an Anycast-SID does not identify a single node, but it identifies a group of nodes: an anycast set. Therefore, an Anycast-SID must be advertised with the Prefix-SID N-flag unset, requiring the n-flag-clear keyword in the Prefix-SID configuration. Note that for ISIS, this configuration also clears the N-flag in the prefix-attribute.

Anycast se gme nts: ve rsatile and powe rful “Anycast segments are, in my eyes, a very versatile and powerful tool for any network designer who does not want to or cannot completely rely on a central controller to initiate all needed paths. Whenever TE policies are designed and maybe even configured by a human, anycast segments can solve several problems: They provide resiliency. More than one node can carry the same SID and possible several nodes can serve the same purpose and step in for each other. See chapter 8, "Network Resiliency". Using the same SID or SID descriptor on all routers significantly simplifies router configuration. IT effort will often be reduced as less parameters are needed to generate the configuration for each router. Last but not least, anycast segments can be seen as a method of abstraction: An Anycast-SID does no longer just stand for a particular node, or group of nodes, but rather for a function that needs to be applied to a packet, or for a certain property of forwarding a packet through a network. Thus, for my specific needs, anycast segments are essential element of how Segment Routing brings traffic engineering to a whole new level. ” — Martin Horneffer

In such a dual-plane topology, a very simple solution to steer traffic via the blue plane to Node3 consists in imposing the SID list , where 16003 is the Prefix-SID of Node3. Indeed, 16111 steers the traffic into the blue plane and then the dual-plane design ensures that it remains in the same plane until its destination. The traffic would not use the other plane as the shunt links have a very bad metric and multiple independent failures would be required to partition the blue plane. Similarly, traffic can be steered in the green plane with the SID list . To steer traffic via the blue plane, the operator configures an SR Policy “BLUE” with color blue (value 10), endpoint 1.1.1.3 (Node3) and the explicit segment list SIDLIST1, as shown in Example 3‑13. This SID list is expressed with two segment descriptors: 1.1.1.111 is resolved into the blue plane Anycast-SID 16111 and 1.1.1.3 maps to the Prefix-SID of Node3. A second SR Policy, named “GREEN”, with color green (value 20), endpoint 1.1.1.3 (Node3) and the explicit SID list SIDLIST2 is used to steer traffic via the green plane. SIDLIST2 also contains

two entries: 1.1.1.222 maps to the green plane Anycast-SID 16222 and 1.1.1.3 maps to the Prefix-SID of Node3. Example 3-13: Dual-plane disjoint paths using anycast-SID – configuration segment-routing traffic-eng policy BLUE color 10 end-point ipv4 1.1.1.3 candidate-paths preference 100 explicit segment-list SIDLIST1 ! policy GREEN color 20 end-point ipv4 1.1.1.3 candidate-paths preference 100 explicit segment-list SIDLIST2 ! segment-list name SIDLIST1 index 10 address ipv4 1.1.1.111 !! blue plane anycast index 20 address ipv4 1.1.1.3 ! segment-list name SIDLIST2 index 10 address ipv4 1.1.1.222 !! green plane anycast index 20 address ipv4 1.1.1.3

Now, assume that an L3VPN service is required between Node1 and Node3, with two VRFs: blue and green. These two VRFs should be carried over disjoint paths wherever possible. This can be achieved by steering the VRF blue packets into SR Policy BLUE, and the VRF green packets into SR Policy GREEN. Traffic steering details are covered in chapter 5, "Automated Steering". For now, it is enough to know that blue colored prefixes are steered into the SR Policy with color blue and green colored prefixes are steered into the SR Policy with color green. The traceroute output in Example 3‑14 shows that the traffic of the VRF blue traverses the blue plane (nodes 11 to 14) while the traffic of VRF green traverses the green plane (nodes 21 to 24). Both VRF prefixes 10.10.10.10 and 20.20.20.20 have Node3 as BGP next hop. The MPLS labels on the packet received by Node2 are the Anycast-SID of the plane, 16111 or 16222, the Prefix-SID of the BGP next hop, 16003, and the VPN label 9000x for the prefix.

Example 3-14: Dual-plane disjoint paths using anycast-SID – traceroute RP/0/0/CPU0:xrvr-1#traceroute vrf blue 10.10.10.10 Type escape sequence to abort. Tracing the route to 10.10.10.10 1 2 3 4

99.1.2.2 [MPLS: Labels 16111/16003/90009 Exp 0/0/0] 19 msec 19 msec 19 msec 99.2.11.11 [MPLS: Labels 16003/90009 Exp 0/0] 29 msec 19 msec 19 msec 99.11.12.12 [MPLS: Labels 16003/90009 Exp 0/0] 19 msec 19 msec 19 msec 99.3.12.3 19 msec 19 msec 19 msec

RP/0/0/CPU0:xrvr-1#traceroute vrf green 20.20.20.20 Type escape sequence to abort. Tracing the route to 20.20.20.20 1 2 3 4

99.1.2.2 [MPLS: Labels 16222/16003/90007 Exp 0/0/0] 19 msec 19 msec 19 msec 99.2.23.23 [MPLS: Labels 16003/90007 Exp 0/0] 29 msec 19 msec 19 msec 99.23.24.24 [MPLS: Labels 16003/90007 Exp 0/0] 19 msec 19 msec 19 msec 99.3.24.3 19 msec 19 msec 19 msec

3.9 Summary An explicit path can be provided to the headend by configuration or signaled by a controller (NETCONF, PCEP, BGP). An explicit candidate path is formally defined as a weighted set of SID lists. An explicit candidate path is typically a single SID list. A headend node is not involved in the path computation or the SID list encoding of an explicit path; it instantiates an explicit path verbatim, as is provided. A SID can be expressed as an MPLS label or a segment descriptor. A segment descriptor is preferred when one wants the headend to validate the SID. A segment descriptor cannot be used for a SID that is unknown to the headend (e.g., inter-domain). An explicit candidate path is valid if at least one of its SID lists is valid. An explicit SID list is valid if all the following conditions are met: The SID list contains at least one SID The SID list has a weight value larger than 0 The headend can resolve all SIDs expressed as segment descriptors The headend can resolve the first SID into one or more outgoing interfaces and next-hops. An explicit candidate path has many applications. We illustrated two of them: TDM migration to an SR-based infrastructure Disjoint paths in a dual-plane network

3.10 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg, Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018 [draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018

1. Restricting this segment descriptor to point-to-point links allows the SID to be determined with a single interface address, as opposed to a pair of (local, remote) interface addresses in the general case.↩ 2. As described in SR book Part I and in chapter 8, "Network Resiliency", a protected Adj-SID is FRR-enabled while its unprotected counterpart is not.↩ 3. Unless you do not use the same SRGB on all nodes, which is strongly discouraged.↩

4 Dynamic Candidate Path What we will learn in this chapter: Dynamically computing an SR-TE path is solving an optimization problem with an optimization objective and constraints. The information in the SR-TE DB is used to compute a path. SR-optimized algorithms have been developed to compute paths and encode these paths in SID lists to make optimal use of the benefits of SR, leveraging the available ECMP in the network. The headend node or a Path Computation Element (PCE) can compute SR-TE paths. The base IOS XR software offers SR PCE server functionality. In many cases, the headend node itself can compute SLA paths in a single IGP area, for example compute delay optimized paths, or paths that avoid specific resources. For specific path computations, where the headend does not have the necessary information in its SR-TE DB, the headend node uses an SR PCE to compute paths. For example, computing disjoint paths from different headend nodes or computing end-to-end inter-domain paths. A headend not only requests an SR PCE to compute paths but can also delegate control of the paths to the SR PCE, which then autonomously maintains these paths. The SR-TE DB is natively multi-domain capable. The SR PCE learns the topologies of all domains and the Peering-SIDs of the BGP peering links via BGP-LS. The SR PCE functionality can be distributed among several SR PCE servers throughout the network. If needed, these servers may synchronize between each other.

4.1 Introduction A dynamic candidate path is a candidate path of an SR Policy that is automatically computed by a headend (router) or by a Path Computation Element (PCE) upon request of the headend. Such a path is automatically re-computed when required to adapt to a changing network.

4.1.1 Expressing Dynamic Path Objective and Constraints While an explicit path is expressed as a list of segments, a dynamic path is expressed as an optimization objective and a set of constraints. These two elements specify the SR-TE intent. The optimization objective is the characteristic of the path that must be optimized. Typically, this is the minimization of a specific metric such as the IGP link metric or the minimum link-delay metric. Constraints are limitations that the resulting path must honor. For example, one may want the path to avoid specific links or groups of links or have a cumulative path metric that is bound by a maximum value.

4.1.2 Compute Path = Solve Optimization Problem Computing a path through a network that satisfies a set of given requirements is solving an optimization problem, basically a constrained shortest path problem. “Find the shortest path to each node in the network” is an example of an optimization problem that is solved by using Dijkstra’s well-known Shortest Path First (SPF) algorithm. To solve this problem, Dijkstra’s algorithm uses the network graph (consisting of nodes and links, also known as vertices and edges in graph theory) and a cost associated with each edge (link). In networking, the link-state IGPs use Dijkstra’s algorithm to compute the Shortest Path Tree (SPT); the edge cost is then the IGP link metric. Prefixes are leaves hanging from the nodes (vertices). The information that is needed to run Dijkstra’s SPF and compute prefix reachability – nodes, links, and prefixes – is distributed by the link-state IGPs. Each node keeps this information in its Link-state Database (LS-DB). This introduces two necessary elements that are needed to compute paths in a network: a database that contains all necessary information about the network and a computation engine that applies

algorithms to the information to solve the optimization problem, i.e., compute dynamic paths. Database

The database used for the computation is the SR-TE DB. Besides the network graph (nodes and links), the SR-TE DB contains various other information elements that the computation engine uses to solve different optimization problems. A few examples illustrate this. To compute delay-optimized paths, link delay information is required. To provide disjoint paths, information about the programmed paths is required. Chapter 12, "SR-TE Database" goes into much more detail on the SR-TE DB. Computation Engine

The computation engine translates an SR-TE intent into a SID list. The path computation algorithms are at the core of the computation engine. Efficient SR-native optimization algorithms have been developed based on extensive scientific research; see the SIGCOMM 2015 paper [SIGCOMM2015]. This chapter focuses on four use-cases: delay-optimized paths, resource-avoidance paths, disjoint paths, and inter-domain paths.

4.1.3 SR-Native Versus Circuit-Based Algorithms Classic MPLS-TE (RSVP-TE) was designed 20 years ago as an ATM/FR sibling. Its fundamental building block is a point-to-point non-ECMP circuit with per-circuit state at each hop. Even though ECMP is omnipresent in IP networks, classic RSVP-TE circuit-based paths do not leverage this ECMP by definition. As a result, the RSVP-TE solution needs many tunnels between the same set of headends and endpoints in order to use multiple paths through the network (one tunnel over each possible path). This drastically increases operational complexity and decreases scalability due to the number of tunnels required for a proper traffic load-balancing. As indicated before, considerable research went into the development of efficient SR-optimized path computation algorithms for SR-TE. These algorithms natively use the characteristics of the IP network and inherently leverage all available ECMP paths. The outcome of a native, SR-optimized

computation is a reduced SID list that leverages the SR capabilities, such as multi-hop segments and ECMP-awareness.

Fe w SIDs, ECMP “As explained in the introduction of Part1, the intuition for Segment Routing came up while driving to Rome and realizing that the path to Rome avoiding the Gottardo pass (the shortest path to Rome from Brussels is via the Gottardo pass) could be expressed as simply “from Brussels, go to Chamonix and then from Chamonix go to Rome”. Only two segments are required! All the simulations we did later confirmed this basic but key intuition: few SIDs were required. My experience designing and deploying network had told me that ECMP is a basic property of modern IP network. Hence the intuition while driving to Rome was also that segment-routed paths would naturally favor ECMP. This was clear because each prefix segment expresses a shortest-path and network topologies are designed such that shortest-paths involve as much ECMP as possible (to load-share and better use resources and for robustness). This ECMP property was proved later by all the simulations we did. ” — Clarence Filsfils

To illustrate the benefits of the SR-native algorithms over the circuit-based classic RSVP-TE solution, Figure 4‑1 compares RSVP-TE circuit-based optimization with SR-TE optimization. In the topology a path is computed using both methods from Node1 to Node3, avoiding the link between Node2 and Node3.

Figure 4-1: Circuit Optimization versus SR Optimization

The RSVP-TE solution first prunes the red link. Second, it computes the shortest path from 1 to 3 in the pruned graph. Third, it selects one single non-ECMP path out of the (potentially ECMP) shortest path. In this example, there are three possible shortest paths and let us assume RSVP-TE picks the path 1→4→5→7→3. Applying this old circuit-based solution to SR would specify each link on the path, leading to a SID list , where 240XY represents the Adj-SID of NodeX to NodeY. This SID list can be shortened to , where 1600X is the Prefix-SID of NodeX, using a trivial path-to-SID list algorithm. However, this is still not as good as the SRnative solution described in this chapter. It still does not leverage ECMP and this classic path computation cannot adapt the path to SR-specific requirements such as segment list size or more ECMP paths. The SR-TE solution uses a completely different algorithm that seeks to leverage as much ECMP as possible while using as few SIDs as possible. For this reason, we call it the “SR-native” algorithm. In this example, the SR native algorithm finds the SID list , where 1600X is the Prefix-SID of NodeX. This SID list only uses two SIDs. This SID list load-balances traffic over 3 paths.

Clearly the SR-native algorithm is preferred for SR applications.

4.2 Distributed Computation The SR-TE computation engine can run on a headend (router) or on an SR Path Computation Element (SR PCE). The former involves a distributed solution while the latter involves a centralized solution. Both headend and SR PCE leverage the SR-native algorithms. The SR-TE implementation in IOS XR can be used both as an SR-TE headend and an SR PCE. When possible, one should leverage the SR-TE path computation of the headend (distributed design). This provides a very scalable solution. When needed, path computation is delegated to an SR PCE (centralized). Router and SR PCE use the same path computation algorithms. The difference of their functionality is not the computation engine but the content of the SR-TE DB. A headend’s SR-TE DB is most often limited to the local domain and the local SR policies. An SR PCE’s SR-TE DB may contain (much) more information such as the state of other SR Policies and additional performance information. Knowing other SR Policies allows disjoint path computation, and knowing other domains allows inter-domain path computation. Computing an SR-TE path from a headend to an endpoint within the same IGP area requires only information of the local IGP area. The headend node learns the information of its local IGP area and therefore it can compute such paths itself. Sections 4.2.1 and 4.2.2 detail two such examples: lowdelay and resource exclusion. In a distributed model, the headend itself computes paths locally. When possible, this model should be used as it provides a very scalable solution.

4.2.1 Headend Computes Low-Delay Path The network operator wants to provide a low-delay path for a delay-sensitive application. The source and destination of the path are in the same IGP area. For SR-TE to compute such a path, the link-delay information must be available in the SR-TE DB. Each node in the area measures the delays of its links and floods the measured link delay metrics in the IGP.

Link de lay me tric The operator can use the measured link delay metric which is dynamically measured per-link by the router and distributed in the IGP. This methodology has the benefit of always ensuring a correct link delay, even if the optical topology changes due to optical circuit restoration. In addition, using the measured link delay removes the operational complexity of manually configuring link delays. See chapter 15, "Performance Monitoring – Link Delay" for more details of the measured delay metric functionality. In case the direct measurement and distribution of the link delay metric is not available, the operator can use the TE link metric to represent link delay. The TE metric is an additional link metric, distributed in the IGP, that the operator can leverage to represent the needs of a particular application. If the delays of the links in the network are constant and known (e.g., based on information coming from the optical network and fiber lengths), the operator can configure the TE link metrics to represent the (static) link delay.

Given that each node distributes the link delay metrics in the IGP, each headend node in the area receives this information and stores it in its SR-TE DB. This way, a headend node has the necessary information to compute delay-optimized paths in the network. For a headend to feed the IGP-learned information into the SR-TE DB, the distribute link-state command must be configured under the IGP, as illustrated in Example 4‑1 for both ISIS and OSPF. This command has an optional parameter instance-id , which is only relevant in multi-domain topologies. See chapter 17, "BGP-LS" for details. Example 4-1: Feed SR-TE DB with IGP information router isis distribute ! router ospf distribute

SR link-state SR link-state

As an example, the operator has enabled the nodes in the network of Figure 4‑2 to measure the delay of their links. The current link delays are displayed in the illustration as the unidirectional link-delay in milliseconds. The IGP link metrics are 10.

Figure 4-2: Headend computes low-delay path

The operator needs a delay-optimized path from Node1 to Node5 and configures an SR Policy on Node1 with a dynamic path, optimizing the delay metric. Example 4‑2 shows the SR Policy configuration on Node1. The SR Policy “LOW-DELAY” has a single candidate path that is dynamically computed by optimizing the delay metric. No path constraints are applied. Example 4-2: SR Policy configuration – headend computed dynamic low-delay path segment-routing traffic-eng policy LOW-DELAY color 20 end-point ipv4 1.1.1.5 candidate-paths preference 100 dynamic metric type delay

The resulting path, 1→2→3→4→5 with a cumulative delay of 10 + 10 + 10 + 9 = 39 ms, is illustrated in Figure 4‑2. This path can be expressed with a SID list , where 1600X is the Prefix-SID of NodeX. Each of these Prefix-SIDs expresses the IGP shortest path to their target node. This SID list is shown in the output of Example 4‑3.

Example 4-3: Headend computed low-delay SR Policy path RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 20, End-point: 1.1.1.5 Name: srte_c_20_ep_1.1.1.5 Status: Admin: up Operational: up for 06:57:40(since Sep 14 07:55:11.176) Candidate-paths: Preference: 100 (configuration) (active) Name: LOW-DELAY Requested BSID: dynamic Dynamic (valid) Metric Type: delay, Path Accumulated Metric: 39 16003 [Prefix-SID, 1.1.1.3] 16004 [Prefix-SID, 1.1.1.4] 16005 [Prefix-SID, 1.1.1.5] Attributes: Binding SID: 40026 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

The operator notices that this delay-optimized path only uses the path via Node4, while two equal cost (IGP metric) paths are available between Node3 and Node5. This is due to a small difference in link delay between these two paths from Node3 to Node5: the path via Node4 has a delay of 10 + 9 = 19 ms, while the path via Node6 has a delay of 10 + 10 = 20 ms, 1 ms more. The operator considers the difference in delay between these two paths insignificant and would prefer to leverage the available ECMP between Node3 and Node5. The operator can achieve this by specifying a margin to tolerate a solution that is not optimal but within the specified margin of the optimal solution. The margin can be specified as an absolute value or as a relative value (percentage), both in comparison to the optimal solution. In Example 4‑4 an absolute margin of 2 ms (2000 µs) is specified for the dynamic path of the SR Policy to Node5 on Node1. The cumulative delay of the solution path can be up to 2 ms larger than the minimum-delay path. With this configuration the solution SID list is , which leverages the ECMP between Node3 and Node5. Use the keyword relative to specify a relative margin.

Example 4-4: Dynamic path with delay margin segment-routing traffic-eng policy LOW-DELAY color 20 end-point ipv4 1.1.1.5 candidate-paths preference 100 dynamic metric type delay margin absolute 2000

De lay margin “Back in 2014 when designing the SR-TE native algorithms, we realized that the intuition to use as much ECMP as possible would not work for delay optimization without the notion of margin. First, let us explain the reason. This is again very intuitive. Human beings optimize network topologies and assign IGP metrics to enhance the ECMP nature of shortest paths. For example, two fibers between Brussels and Paris are assigned the same IGP cost of 10 while one fiber takes 300 km and the other one takes 500 km. From a capacity viewpoint, this distance difference does not matter. From a delay viewpoint it does matter. The dynamic router-based performance monitoring of the two fibers will detect a difference of 200 km / c ≈ 1 msec, where c is the speed of light in the fiber. The IGP will flood one link with 1.5 msec of delay and the other link with 2.5 msec of delay. Remote routers computing a low-delay path (e.g., from Stockholm to Madrid) would have to insert a prefix-SID in Brussels followed by the adjacency SID of the first fiber to make sure that the second fiber is avoided. Two more SIDs and no ECMP just for gaining 1 msec between Stockholm and Madrid. We could easily guess that some operator would prefer fewer SIDs and more ECMP in exchange for some margin around the low-delay path. Hence, we introduced the notion of margin for low-delay SR-native optimization. We then designed the related algorithm which finds the segment-routed path with the least amount of SIDs within the margin above the lowest-delay path. ” — Clarence Filsfils

Whenever the topology changes, or the link delay measurements change significantly (see chapter 15, "Performance Monitoring – Link Delay", discussing the link delay measurement), headend Node1 recomputes the paths in the new topology and updates the SID list of the SR Policy accordingly.

4.2.2 Headend Computes Constrained Paths The operator needs to provide a path that avoids certain resources. The example network is a dual-plane network. A characteristic of such network design is that, when a packet lands on a plane, it stays on that plane until its destination, provided the plane is not partitioned. By default, traffic flows are load-balanced over both planes. The operator can steer traffic flows onto one of the planes. This can be achieved in multiple ways. One way is by using an Anycast-SID assigned to all devices in a plane as described in the previous chapter (chapter 3, "Explicit Candidate Path"). Another possibility is to compute dynamic paths, restricting the path to a given plane. This method is described in this section. Yet another way is to use the Flex-Algo functionality described in chapter 7, "Flexible Algorithm".

Re source e xclusion “The richness of the SR-TE solution allows to solve a given problem in different ways, each with different trade-offs. Let us illustrate this with the dual-plane disjoint paths use-case (e.g., enforcing a flow through the blue plane). In the explicit path chapter 3, "Explicit Candidate Path", we showed a first solution using the anycast SID of the blue plane. In this dynamic path chapter, we show a second solution using a dynamic path with an “exclude affinity green”. Later on, in the SR IGP Flex-Algo chapter 7, "Flexible Algorithm", we will describe a third solution. The first solution does not require any intelligence on the headend but may fail to adapt to rare topology issues. The second solution requires SR-TE headend intelligence (SR-TE DB and the SR-native algorithm) without network-wide IGP change or dependency. The third solution leverages the IGP itself but requires a network-wide feature capability. In my opinion, each is useful, and the selection is done case by case based on the specific operator situation. This is confirmed by my experience as I have been involved with the deployment of the first two and at the time of writing I am involved in the deployment of the third solution type. Each operator had its specific reasons to prefer one solution over another. The SR solution is built as modules. Each operator can use the module he wants based on his specific analysis and preference. ” — Clarence Filsfils

4.2.2.1 Affinity Link Colors To steer traffic flows on a single plane the operator uses an SR Policy with a dynamic path that is constrained to the desired plane. One way to apply this restriction is by using the link affinity functionality. The links in each plane are colored; the links in one plane are colored blue and the links in the other plane green. The path via the green plane can then be computed as “optimize IGP metric, avoiding blue links”, and the path via the blue plane as “optimize IGP metric, avoiding green links”.

Link colors and affinity An operator can mark links with so-called affinity link colors, known as administrative groups in IETF terminology. Historically, there were 32 different colors (groups) available, later extended to allow more (256 colors in Cisco IOS XR). These colors allow links to be mapped into groups or classes. For example, links in a given region are colored red and links in another region are colored green. The colors of a link are advertised in IGP as a bitmap, where each color represents a bit in this bitmap. The IGP advertises the affinity bitmaps with the adjacencies. An operator can freely choose which bit represents which color. For example, color red is represented by the first bit in the bitmap and color green by the third bit. If a link has a given color, the affinity bitmap of that link is then advertised with the bit of the corresponding color set. Each link can have multiple colors, in which case multiple bits in the bitmap will be set. These link colors are inserted in the SR-TE DB. A headend node can then use these link colors in its path computation to include or exclude links with a given color or set of colors in the path. For example, the operator can specify to compute a dynamic path, optimizing IGP metric and avoiding links with color red. Refer to RFC 3630, RFC 5305, and Section 6.2 of RFC 2702 for IETF information about affinity and link colors.

The topology of Figure 4‑3 is a dual-plane network. One plane, named “Green”, consists of the nodes 11 to 14, while the other plane, named “Blue”, consists of the nodes 21 to 24. Node2 and Node3 are connected to both planes. All the links in plane Green are colored with the same affinity color (green), the links in plane Blue with another affinity color (blue).

Figure 4-3: dual-plane affinity resource avoidance example

The affinity color configuration is illustrated with Node2’s configuration in Example 4‑5. This node has a link in each plane, the link to Node11 in the Green plane and the link to Node21 in the Blue plane. The configuration starts by defining human-friendly affinity color names. These names can be any user-defined strings, not just names of colors as usually shown in examples. Each name identifies a specific bit in the affinity bitmap. The position of the bit in the bitmap is zero-based, the first bit has position 0. Name GREEN in the example corresponds to the bit at position 0 in the bitmap (GREEN position 0),

name BLUE corresponds to the bit at position 2.

The naming scheme for the affinity-maps is locally significant; the names and their mapping to the bit positions are not distributed. Consistency is key; it is strongly recommended to have a consistent name-to-bit-position mapping configuration across all nodes, e.g., by using an orchestration system. After defining the names, they can be assigned to links. Interface Gi0/0/0/0 on Node2 is in the plane Green and is marked with affinity GREEN (affinity name GREEN), while interface Gi0/0/0/1, which is in the plane Blue, is marked with the name BLUE. Each link can be marked with multiple names by configuring multiple affinity names under the interface. The node then advertises for each interface the affinity bitmap with the bits that correspond to the configured names set to 1. Example 4-5: Assigning Affinity link colors – Node2 segment-routing traffic-eng affinity-map name GREEN bit-position 0 name BLUE bit-position 2 ! interface Gi0/0/0/0 !! link to Node11, in plane Green affinity name GREEN ! interface Gi0/0/0/1 !! link to Node21, in plane Blue affinity name BLUE

Since each node advertises the affinity-map of its links in the IGP, all nodes in the IGP area receive that information and insert it in their SR-TE DB. A headend node can then use this information in its path computations.

4.2.2.2 Affinity Constraint

The operator configures two SR Policies to Node3 on Node1. Each SR Policy uses one of the two planes. The operator configures one SR Policy to compute a dynamic path avoiding BLUE links, hence restricted to the Green plane, and another SR Policy with a path that avoids GREEN links, hence restricted to the Blue plane. The SR Policy configuration of Node1 is shown in Example 4‑6. Example 4-6: Affinity link color resource avoidance example – Node1 segment-routing traffic-eng policy VIA-PLANE-GREEN color 20 end-point ipv4 1.1.1.3 candidate-paths preference 100 dynamic metric type igp ! constraints affinity exclude-any name BLUE ! policy VIA-PLANE-BLUE color 30 end-point ipv4 1.1.1.3 candidate-paths preference 100 dynamic metric type igp ! constraints affinity exclude-any name GREEN

The first SR Policy, named VIA-PLANE-GREEN, has color 20 and endpoint 1.1.1.3 (Node3). A single candidate path is configured, dynamically computing an IGP metric optimized path, excluding links with color BLUE. Headend Node1 locally computes the path. The resulting path is shown in Figure 4‑3. Notice that the path follows plane Green, leveraging the available ECMP within this plane. Example 4‑7 shows the status of this SR Policy on Node1.

Example 4-7: SR Policy status on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 20, End-point: 1.1.1.3 Name: srte_c_20_ep_1.1.1.3 Status: Admin: up Operational: up for 00:10:54 (since Mar 2 17:27:13.549) Candidate-paths: Preference: 100 (configuration) (active) Name: VIA-PLANE-GREEN Requested BSID: dynamic Constraints: Affinity: exclude-any: BLUE Dynamic (valid) Metric Type: IGP, Path Accumulated Metric: 50 16014 [Prefix-SID, 1.1.1.14] 16003 [Prefix-SID, 1.1.1.3] Attributes: Binding SID: 40048 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

The second SR Policy in Node1’s configuration, named VIA-PLANE-BLUE, steers the traffic along plane Blue. Node1 computes this path by optimizing the IGP metric and avoiding links with color GREEN. Whenever the topology changes, headend Node1 re-computes the SR Policy paths in the new topology and updates the SID lists of the SR Policies accordingly. With the configuration in Example 4‑6, three paths are available to Node3, two via SR policies and one via the IGP shortest path. The SR Policies restrict the traffic to one of the planes, while the IGP shortest path uses both planes. By default, service traffic is steered via the IGP shortest path to its nexthop. If the nexthop has an associated Prefix-SID then that will be imposed. This is the default Prefix-SID forwarding behavior. For example, service traffic with nexthop Node3 is steered via Node3’s Prefix-SID 16003. The IGP Prefix-SID 16003 follows the unconstrained IGP shortest path, leveraging all available ECMP paths. Hence, by default, the traffic flows from Node1 to Node3 are not limited to a single plane, but they are distributed over all available ECMP.

Steering service traffic into the SR Policies towards its nexthop can be done using Automated Steering (AS) by attaching the color of the required SR Policy to the destination prefix. Attach color 20 to steer service traffic into SR Policy VIA-PLANE-GREEN and color 30 to steer it into SR Policy VIA- PLANE-BLUE. Chapter 5, "Automated Steering" describes AS in more details. This gives the operator three steering possibilities: via both planes for uncolored destinations, via plane Green for destinations with color 20, and via plane Blue for destinations with color 30.

4.2.3 Other Use-Cases and Limitations 4.2.3.1 Disjoint Paths Limited to Single Head-End Disjoint (diverse) paths are paths that do not traverse common network elements, such as links, nodes, or SRLGs. To compute disjoint SR-TE paths, information about all paths is required. If the disjoint paths have a common headend and the endpoints(s) of these paths are within the same IGP area, the headend can compute these paths itself. In that case the headend has knowledge of the local IGP area and it has knowledge of both paths since it is the headend of both. If the disjoint paths have distinct headends, then these headends cannot compute the paths since they only know about their own SR Policy paths and are unaware of the other headend’s SR Policy paths. If one does not know the other path, then computing a disjoint path is not possible. This use-case requires a centralized solution where a centralized computation entity is aware of both paths in order to provide disjoint apths. Section 4.3 of this chapter details the disjoint paths use-case between two sets of headends and endpoints.

4.2.3.2 Inter-Domain Path Requires Multi-Domain Information Computing paths that traverse different IGP areas or domains requires knowledge about all these IGP areas and domains. Typically, a headend node has only knowledge about its local IGP area. Possibly the headend’s SRTE DB contains multi-domain topology database, but that is not yet seen in practice, for scalability reasons.

If the headend node does not have multi-domain information, it cannot compute inter-area and interdomain paths. This requires using a centralized solution. Section 4.3.3 of this chapter details the inter-domain path computation use-case.

4.3 Centralized Computation When possible, one should leverage the SR-TE path computation of the headend (distributed design). When needed, path computation is delegated to an SR PCE (centralized). Router and SR PCE use the same path computation algorithms. The difference of their functionality is not the computation engine but the content of the SR-TE DB. A headend’s SR-TE DB is most often limited to its local domain and its own SR policies, as was pointed out before. An SR PCE’s SR-TE DB may contain more information, such as the topology of other domains and state of other SR Policies. Knowing other domains allows inter-domain path computation. Knowing other SR Policies allows disjoint path computation.

4.3.1 SR PCE A Path Computation Element (PCE) is an element in the network that provides a path computation service to Path Computation Clients (PCCs). Typically, a PCC communicates with a PCE using the PCE communication Protocol (PCEP). Each headend node can act as a PCC and request the PCE to compute a path using a client/server request/reply model. The PCE computes the path and replies to the PCC, providing the path details. To compute paths, an SR PCE has an SR-TE DB and a computation engine, like an SR-TE headend node. For its local IGP area, the SR PCE learns the topology information from IGP and stores that information in its SR-TE DB. The PCE uses this information to compute paths, hereby using the same path computation algorithms as used by a headend node. A headend node acts as a PCC and requests an SR PCE to compute a path to an endpoint. In its request the headend provides the path’s optimization objective and constraints. The SR PCE then computes the path and returns the resulting path to the PCC (headend) as a SID list. The headend then instantiates this path. But an SR PCE can do more than compute paths; as a stateful PCE it can control paths as well. A PCC hands over control of a path to a PCE by delegating this path to the PCE.

“It's a very tedious task to organize the efficiency of a network in regards of RSVP-TE. We have to manually tweak the timers, the cost and add lots of route-policies to make sure that the traffic will use the pattern that we want. SR-TE dynamic path drastically reduces this complexity and eliminates the need for manual intervention while maintaining a highly optimized and robust network infrastructure. One easy use-case is the disjoint paths for video streaming; when sending duplicate streams, SR-TE dynamic path allows us to make sure the two copies will never use the same links all along the path from source to destination. This is done by automatically gathering information from the LS-DB to the SR PCE. The PCE can then compute SR-policy paths that match the desired objective, such as delay, and constraints. ” — Daniel Voyer

IOS XR SR PCE Server

SR PCE server functionality is available in the base IOS XR software image and it can be used on all physical and virtual IOS XR platforms. SR PCE functionality can be enabled on any IOS XR node that is already in the network. However, for scalability and to prevent a mix of functionalities on a given node, it may be desirable to deploy separate nodes for the SR PCE server functionality, using either a hardware or a virtual platform. The SR PCE server functionality is enabled in IOS XR using a configuration as in Example 4‑8. The configured IP address is used for the PCEP session between PCC and PCE. It must be a globally reachable (not in a VPN) local IP address. SR PCE receives its topology information from the different protocols (ISIS, OSPF, and BGP) via an internal API. To enable the IGP (ISIS or OSPF) of the PCE node to feed its IGP link-state database (LS-DB) to the SR-TE process, configure distribute link-state under the IGP. This way, the SR PCE server learns the local IGP area’s topology information and inserts it in its SR-TE DB. BGP automatically feeds its BGP-LS information to SR PCE, without additional configuration, as described in section 4.3.3. The instance-id 101 in the distribute link-state command is the domain identifier that is used to distinguish between domains in the SR-TE DB. This is explained further in chapter 12, "SR-TE Database" and chapter 13, "SR PCE".

Example 4-8: SR PCE server configuration pce address ipv4 1.1.1.10 ! router isis SR !! or "router ospf SR" distribute link-state instance-id 101

IOS XR PCE Client (PCC)

To enable a headend as PCE Client (PCC), configure the address of the SR PCE, as shown in Example 4‑9 for a PCE with address 1.1.1.10. With this configuration, the PCC establishes a PCEP session to the SR PCE. Multiple PCEs can be configured with an order of preference, as explained in chapter 13, "SR PCE". Example 4-9: PCE address on PCC segment-routing traffic-eng pcc pce address ipv4 1.1.1.10

4.3.1.1 SR PCE Redundancy SR PCE uses the existing PCEP high availability functionality to provide redundancy. The three main components of this functionality are topology learning, SR Policy reporting, and re-delegation behavior. A headend has a PCEP connection to a pair (or even a larger group) of PCEs. For redundancy, it is important that these PCEs have a common knowledge of the topology and the SR Policies in the network. That way, the headend can use any of these PCEs in case its primary PCE fails. All connected PCEs have the same knowledge of the topology since they all receive the same topology feed from the IGP and/or BGP-LS. When an SR Policy is instantiated, updated, or deleted, the headend sends an SR Policy state report to all its connected PCEs. This keeps the SR Policy database of all these PCEs in sync, and thus one PCE can act as a standby of another PCE. A headend delegates control of an SR Policy path to a single SR PCE, the primary SR PCE that also computed the path.

Failure of this primary PCE, does not impact the SR Policies that are delegated to it nor the traffic that is steered into it. Upon failure of this PCE the headend maintains the SR Policies and redelegates them to another connected PCE. This new delegate PCE takes over control of the path, verifies it and updates it if required. Given that the information available to all PCEs in this set is the same and the PCEs use the same path computation algorithms, the path will not be updated if the topology did not change in the meantime. More details are available in chapter 13, "SR PCE".

4.3.2 SR PCE Computes Disjoint Paths In this first centralized computation use-case, a centralized SR PCE computes disjoint paths from two different headends. The SR PCE not only computes the paths but controls them as well. These are key requirements to provide optimal disjoint paths from different headends. The network in Figure 4‑4 is a single-area network, interconnecting two customer sites, Site 1 and Site 2. The operator of this network wants to provide disjoint paths between the customer sites. The disjoint paths start from two different headend nodes: one path from Node1 to Node4 and another path from Node5 to Node8.

Figure 4-4: Network topology for disjoint paths

An SR PCE is added to the network. For illustration purposes the SR PCE is drawn as a single separate entity. Chapter 13, "SR PCE" discusses the different SR PCE connectivity options. Headends Node1 and Node5 cannot simply compute the usual IGP shortest paths to Node4 and Node8 respectively, since these paths are not disjoint, as illustrated in Figure 4‑5. The two IGP shortest paths share Node6, Node7, and the link between Node6 and Node7.

Figure 4-5: IGP shortest paths are not disjoint

The headends cannot independently compute disjoint paths since neither of the headends knows about the path computed by the other headend. Providing disjoint paths in an IP/MPLS network has always been cumbersome, especially when the disjoint paths must be achieved between distinct pairs of nodes. The operator could use path constraints to accomplish disjoint paths. The operator could, for example, use affinity link colors to mark the links traversed by the path from Node5 to Node8 and then exclude these links from the path between Node1 and Node4. However, this solution provides no guarantee to find disjoint paths. Moreover, it would be an operational burden, as the link colors would have to be updated whenever the topology changes. A dynamic solution to provide a disjoint path service is highly desired. The SR PCE provides this solution.

4.3.2.1 Disjoint Group Disjoint paths are expressed by assigning them to an entity called a disjoint association group, or disjoint group for short. A group-id identifies a disjoint group. Paths with the same disjoint group-id (i.e., member of the same disjoint group) are disjoint from each other. Disjoint groups can be applied to paths originating at the

same headend or different headends. The operator indicates which paths must be disjoint from each other by assigning both paths the same disjoint group-id. The PCE understands and enforces this constraint. A disjoint group also specifies the diversity parameters, such as the desired type of disjoint paths: link, node, SRLG, or node+SRLG. The type of disjoint paths indicates which resources are not shared between the two disjoint paths, e.g., link-disjoint paths do not share any link, node-disjoint paths do not share any node, etc.

4.3.2.2 Path Request, Reply, and Report The operator first configures Node5 as a PCC by specifying the SR PCE’s address 1.1.1.10. Then the operator configures an SR Policy to Node8 on Node5, as shown in Example 4‑10. The SR Policy, named POLICY1, has a single candidate dynamic path with preference 100. The keyword pcep under dynamic indicates that Node5 uses an SR PCE to compute the path. The optimization objective is to minimize the IGP metric. As a constraint, the path is a member of the disjoint group with identifier 1 and the paths in this group must be node disjoint (type disjoint node). Example 4-10: SR Policy disjoint path configuration on Node5 segment-routing traffic-eng policy POLICY1 color 20 end-point 1.1.1.8 candidate-paths preference 100 dynamic pcep metric type igp constraints association-group type disjoint node identifier 1 ! pcc pce address ipv4 1.1.1.10

Figure 4-6: First path of disjoint group

After configuring the SR Policy on Node5, Node5 sends a Path Computation Request (PCReq) to the SR PCE 1.1.1.10 (see ➊ in Figure 4‑6). In that PCReq, Node5 provides all the necessary information to enable the SR PCE to compute the path: the headend and endpoint, the optimization objective and the constraints. In this example, Node5 requests the SR PCE to compute a path from Node5 to Node8, optimizing the IGP metric, with the constraint that is must be node-disjoint from other paths with a disjoint group-id 1. Since the SR PCE does not know about any other existing paths with disjoint group-id 1, it computes the unconstrained IGP metric optimized path to Node8. This is indicated as ➋ in Figure 4‑6. The SR PCE then sends the computation result to Node5 in a Path Computation Reply (PCRep) message

(marked ➌). If the path computation was successful, then the PCRep contains the solution SID list. In the example, the resulting SID list is , only containing the Prefix-SID of Node8. Node5 instantiates the path for the SR Policy to Node8 (see ➍ in Figure 4‑6). Node5 reports the information of this path to PCE, using a Path Computation Report (PCRpt) message (marked ➎). In this PCRpt, Node5 provides all details about the state of the path to the SR PCE.

Figure 4-7: Second path of disjoint group

Sometime later, the operator configures the SR Policy to Node4 on headend Node1, as shown in Example 4‑11. This configuration is almost an identical copy of the configuration on Node5 in Example 4‑10, but we chose to use a different name although this is not required.

Example 4-11: SR Policy disjoint path configuration on Node1 segment-routing traffic-eng policy POLICY2 color 20 end-point 1.1.1.4 candidate-paths preference 100 dynamic pcep metric type igp constraints association-group type disjoint node identifier 1 ! pcc pce address ipv4 1.1.1.10

After configuring the SR Policy on Node1, Node1 sends a PCReq to the SR PCE (see ➊ in Figure 4‑7). Node1 indicates in this PCReq that it needs a path to Node4, with optimized IGP metric and node-disjoint from other paths with a disjoint group-id 1. The SR PCE finds another path with disjoint group-id 1 in its SR-TE DB: the path from Node5 to Node8 that was instantiated before. The SR PCE computes the path1 (marked ➋ in Figure 4‑7) , disjoint from the existing one, and finds that the path from Node1 to Node4 is 1→2→3→4, with SID list . It sends a PCRep to Node1 (➌), which instantiates the path (➍) and sends a PCRpt to the SR PCE with the path’s state (➎).

4.3.2.3 Path Delegation The sequence described in the previous section works well when first configuring the SR Policy on Node5 and then the SR Policy on Node1. To describe another important SR PCE mechanism, path delegation, assume that the SR Policy on Node1 is configured first. Figure 4‑8 illustrates this. After configuring the SR Policy on Node1, Node1 sends a Path Computation Request (PCReq) to the SR PCE, see ➊ in Figure 4‑8.

Figure 4-8: SR PCE computes disjoint paths – step 1

Since the SR PCE does not know about any other existing paths with disjoint group-id 1, it computes the unconstrained IGP metric optimized path to Node4. This is indicated with ➋ in Figure 4‑8. The resulting SID list is , only containing the Prefix-SID of Node4. The SR PCE sends the result to Node1 in a Path Computation Reply (PCRep) message, marked ➌. Node1 instantiates the path for the SR Policy to Node4, see ➍ in Figure 4‑8. Node1 reports the information of this path to PCE, using a Path Computation Report (PCRpt) message, marked ➎. If now Node5 requests a path to Node8 that is node-disjoint from the path between Node1 and Node4, no such path exists since all paths would have to traverse Node6 and Node7, violating the diversity requirement. The only solution is to change the path between Node1 and Node4 to 1→2→3→4.

To make this possible, Node1 has delegated control of the SR PCE computed path to the SR PCE. When a PCC delegates path control to the SR PCE, the SR PCE can autonomously indicate to the PCC that it must update the path. This way the SR PCE can maintain the intent of the path, for example following a topology change or after adding or changing a path of a disjoint path pair. To delegate a path to the SR PCE, the PCC sets the delegate flag in the PCRpt message for that path. An IOS XR PCC always automatically delegates control to the SR PCE when it has computed the path. The next step in the example is the configuration of the SR Policy to Node8 on headend Node5. After configuring this SR Policy on Node5, Node5 sends a PCReq to the SR PCE (see ➊ in Figure 4‑9). Node5 indicates in this PCReq that it needs a path to Node8, with optimized IGP metric and nodedisjoint from other paths with a disjoint group-id 1. SR PCE finds another path with disjoint group-id 1 in its SR-TE DB: the path from Node1 to Node4 that was instantiated before. Instead of simply computing the path, disjoint from the existing path, the SR PCE concurrently computes both disjoint paths, as this yields the optimal disjoint paths. This is marked ➋ in Figure 4‑9. As a result of this computation, SR PCE finds that the path from Node1 to Node4 must be updated to a new path 1→2→3→4, otherwise no disjoint path is possible from Node5 to Node8. Since Node1 has delegated control of the path to the SR PCE, SR PCE can autonomously update this path when required. The SR PCE updates the path by sending a Path Computation Update (PCUpd) to Node1 (marked ➌ in Figure 4‑9) with the new SID list . 16002 is the PrefixSID of Node2, 24023 is the Adj-SID of the link from Node2 to Node3, and 16004 is the Prefix-SID of Node4.

Figure 4-9: SR PCE computes disjoint paths – step 2

Node1 updates the path (indicated by ➍ in Figure 4‑9) and Node1 reports the new path to SR PCE using a PCRpt (marked ➎). The SR PCE had also computed the path from Node5 to Node8 (5→6→7→8), with the solution SID list , where 16008 is the Prefix-SID of Node8. SR PCE replies to Node5 with this solution SID list in the PCRep message. This is ➏ in Figure 4‑10. Node5 instantiates the path (indicated with ➐) and sends a PCRpt with the path information to SR PCE (marked ➑). With this

PCRpt, Node5 also delegates control of this path to the SR PCE such that the SR PCE can autonomously instruct Node5 to update the path if required.

Figure 4-10: SR PCE computes disjoint paths – step 3

Whenever the topology changes, SR PCE autonomously re-computes the paths and updates them if required to maintain disjoint paths. The two headends of the disjoint paths need to use the same SR PCE to compute the paths to ensure mutual diversity. However, any such pair of headends can use any SR PCE for the computation. To expand the example that we described above, another SR PCE, or pair of SR PCEs, could be introduced in the network. This SR PCE can for example compute disjoint paths from Node4 to Node1 and from Node8 to Node5. The SR PCEs that compute paths for different disjoint groups do not need to synchronize between each other. These computations are completely independent.

PCEs are typically deployed in pairs for redundancy reasons and intra-pair synchronization (statesync PCEP sessions) may be leveraged for high-availability. See chapter 13, "SR PCE" for more details.

4.3.3 SR PCE Computes End-To-End Inter-Domain Paths In the previous sections of this chapter, the dynamic path computations, either performed by the headend node or by a PCE, were restricted to a single IGP area. In this section, we go one step further by adding multiple domains to the mix. Figure 4‑11 shows a topology consisting of three domains. All three domains are SR-enabled and, per the SR design recommendation, all nodes use the same SRGB. Domain1 and Domain2 are different Autonomous Systems (ASs), interconnected by two eBGP peering links, between Node14 and Node24, and between Node15 and Node25. Domain2 and Domain3 are interconnected by two border nodes: Node22 and Node23. If Domain2 and Domain3 are different IGP domains, these border nodes run two IGP instances, one for each connected domain. If Domain2 and Domain3 are two areas in a single IGP domain then Node22 and Node23 are Area Border Routers (ABRs).

Significant Inte r-Domain Improve me nts “In the ~2005/2009 timeframe, I worked with Martin Horneffer and Jim Uttaro to define the “Seamless MPLS design”. In essence, we used RFC3107 to provide a best-effort MPLS LSP across multiple domains. SR-TE drastically improves this solution as it allows to remove RFC3107 (less protocol) and it supports SLA-enabled interdomain policies (which the Seamless MPLS design could not support). SR Policies can indeed be used to instantiate a best-effort path across domains. An operator could leverage this to remove RFC3107 and hence simplify its operation. Furthermore, SR policies natively provide SLA policies (low-delay, resource avoidance, disjoint paths) over the inter-domain path while Seamless MPLS based on RFC3107 only delivers best-effort. ” — Clarence Filsfils

Figure 4-11: Multi-domain network topology

An SR Policy can provide a seamless end-to-end inter-domain path. To illustrate this, we start by looking at an SR Policy with an explicit inter-domain path, instead of directly jumping into dynamic path computation for multi-domain networks. To provide an end-to-end inter-domain path from Node11 to Node31 in the network of Figure 4‑11, the operator configures an SR Policy with an explicit SID list on Node11. The SID list is: , where 160XX is the Prefix-SID of NodeXX. 51424 is the SID that steers the traffic over the BGP peering link between Node14 and Node24: the Peering-SID. This Peering-SID is described in chapter 14, "SR BGP Egress Peer Engineering". For now, you can view the Peering-SID as the BGP equivalent of the IGP Adj-SID: packets received with a Peering-SID as top label are steered towards the associated BGP neighbor. Figure 4‑12 shows the path of the SR Policy. This illustration also shows a packet with its label stack as it travels from Node11 to Node31 using the above SID list. Prefix-SID 16014 brings the packet from Node11 to Node14 via the ECMP-aware IGP shortest path. The Peering-SID 51424 brings the packet from Node14 to Node24, traversing the peering link. Prefix-SID 16022 then brings the packet from Node24 to Node22 via the IGP shortest path, and finally, Prefix-SID 16031 brings the packet to Node31 via the ECMP-aware IGP shortest path.

Figure 4-12: End-to-end inter-domain explicit path

Thus, an SR Policy can provide a seamless end-to-end inter-domain path. But the operator wants a dynamic solution that adjusts the end-to-end SLA paths to the changing network. Instantiating an SR Policy on Node11 with a dynamic path that is locally computed by the headend node, does not work. In a multi-domain network, a headend node can only see the topology of its local area; it cannot see the network topology beyond the border nodes. The IGP floods the topology information in the local area only. Therefore, the headend node does not have enough information in its SR-TE DB to compute inter-domain paths. To solve this problem, the headend node can use an SR PCE to compute the path. This SR PCE will have to know the topologies of all domains.

4.3.3.1 SR PCE’s Multi-Domain Capability Information about inter-AS BGP peering links is provided via the SR BGP EPE functionality. Knowledge about all domain topologies and about the BGP inter-AS peering links is distributed by BGP-LS. BGP-LS

In previous sections of this chapter, we have seen that an SR PCE can get the topology information of its connected IGP area. To get the topology information of remote domains, the SR PCE server can use BGP and the BGP link-state address-family, commonly known as “BGP-LS”. BGP-LS transports

an IGP LS-DB using BGP signaling. Basically, BGP-LS can pack the content of the LS-DB and transport it in BGP to a remote location, typically a PCE. BGP-LS benefits from all BGP route propagation functionality, such as use of Route-reflectors, as we will see further. Since its introduction, BGP-LS has outgrown this initial functionality and thanks to various extensions it has become the mechanism of choice to feed any information to a controller. BGP-LS is described in more detail in chapter 17, "BGP-LS". The SR PCE server learns the topology information of the different domains in the network via BGPLS. In the example topology in Figure 4‑13, one node in each domain feeds its local LS-DB via a BGP-LS session to a single local Route Reflector (RR). This minimal setup is for illustration purposes only; in practice, redundancy would be provided. Example 4‑12 shows the BGP-LS and IGP configuration of Node21. The BGP session to neighbor 1.1.1.29, which is the Domain2 RR, has the Link-state address-family (BGP-LS) enabled. Note that both AFI and SAFI are named “link-state”, therefore the double link-state in the address-family configuration. With the configuration distribute link-state under router isis, Node21 distributes its ISIS LS-DB to BGP. The instance-id 102 in this command is a “routing universe” identifier that makes it possible to differentiate multiple, possibly overlapping, topologies. This identifier is carried in all BGP-LS advertisements and for each advertisement it identifies the IGP instance that fed the information to BGP-LS. In this example, the topology information of the ISIS instance “SR” on Node21 is identified by instance-id 102. All information that ISIS instance “SR” feeds to BGP-LS carries the identifier 102. More information is available in chapter 17, "BGP-LS". Remember from sections 4.2.1 and 4.3.1 in this chapter that this same command also enables distribution of the IGP LS-DB to the SR-TE process on headend node and SR PCE.

Example 4-12: BGP-LS configuration on Node21 router isis SR !! or "router ospf SR" distribute link-state instance-id 102 ! router bgp 2 bgp router-id 1.1.1.21 address-family link-state link-state ! neighbor 1.1.1.29 !! Domain2 RR remote-as 2 update-source Loopback0 address-family link-state link-state

In practice, multiple nodes will have a BGP-LS session to multiple local RRs for redundancy. The link-state address-family (BGP-LS) can be carried in Internal BGP (iBGP) and External BGP (eBGP) sessions, and it is subject to the standard BGP propagation rules. In the example, a full-mesh of BGP-LS sessions interconnects the domain RRs, but other design options are possible, such as hierarchical RRs. Eventually, each RR in the network has the LS-DB information of all IGP domains in its BGP-LS database.

Figure 4-13: Multi-domain distribution of LS-DB information in BGP-LS

BGP Peering-SIDs

At this point, we have not yet discussed how BGP-LS gets the information about the BGP peering links between Domain1 and Domain2. These links are not in the IGP LS-DB since there is no IGP adjacency formed across them. Earlier in this section, when discussing the inter-domain explicit path, we have introduced the BGP Peering-SID. This Peering-SID fulfills a function for BGP peers that is similar to the function an Adj-SID has for IGP adjacencies. When applied to the inter-domain usecase, BGP Peering-SIDs provide the functionality to cross the border between two ASs and encode a path that spans different domains. BGP Peering-SIDs are SIDs that are allocated for a BGP peer or BGP peering links when enabling the Egress Peer Engineering (EPE) functionality. Chapter 14, "SR BGP Egress Peer Engineering" is dedicated to EPE. The EPE Peering SIDs are advertised in BGP-LS by each EPE-enabled peering node. For this purpose, the EPE-enabled border nodes Node14, Node15, Node24, and Node25 have a BGP-LS session to their local RR, as shown in Figure 4‑14. The BGP-LS sessions between the RRs also distribute the EPE information.

Figure 4-14: Multi-domain distribution of EPE information in BGP-LS

Note that the information distributed by BGP-LS is not a periodic snapshot of the topology; it is a real-time reactive feed. It is updated as new information becomes available. SR PCE Learns Multi-Domain Information

Now that we have learned how all information that is required to compute inter-domain paths is available in BGP-LS, we can feed this information into the SR PCE’s SR-TE DB. Therefore, the SR PCE server connects via BGP-LS to its local RR to receive consolidated information about the whole network. The SR PCE server deployment model is similar to the BGP RR deployment model. Multiple SR PCE servers in the network can perform the inter-domain path computation functionality, provided they tap into the BGP-LS information feed. As multiple SR PCE servers can be deployed in the network, the operator can specify which headend uses which SR PCE server, based on geographical proximity or service type, for example. This provides for horizontal scalability, adding more SR PCEs when scale requires it and dividing the work between them.

SR PCE scale s as BGP RR “A key scaling benefit of the SR solution is that its SR PCEs scale like BGP Route Reflectors. A pair of SR PCEs can be deployed in each PoP or region and can be dedicated to serve the headends of that PoP/region. As their load increases, further pairs can be added. There is no need to synch all the SR PCEs of an SR domain. There is only a need to synch within a pair. This can be done either in an indirect manner (via BGP-LS/PCEP communications with the headends which reflect the same information to the two SR PCEs of their pair) or in a direct manner (PCEP synch between the two SR PCEs) ” — Clarence Filsfils

In the example in Figure 4‑15, two SR PCE servers have been added to the network, one in Domain1 and another in Domain3. Both connect to their local RR to tap into the real-time reactive BGP-LS feed to get up-to-date topology information. Headend Node11 uses the SR PCE in Domain1 to compute inter-domain paths, while Node31 can use the SR PCE in Domain3 to compute these paths.

Note that the headend nodes only need to use an SR PCE to compute paths if they do not have the necessary information available in their own SR-TE DB. For example, Node11 can to compute by itself a delay-optimized path to Node14, which is located in its own domain. Node11 does not need to involve the SR PCE, as all required information is in its own SR-TE DB.

Figure 4-15: PCE servers using BGP-LS to receive multi-domain topology information

The SR PCE servers in the example are IOS XR routers with the SR PCE server functionality enabled. The SR PCE and BGP-LS configuration in Domain1’s SR PCE node in this example is displayed in Example 4‑13. The SR PCE service functionality is enabled by configuring the local IP address used for the PCEP sessions, 1.1.1.10 in this example. The SR PCE gets its SR-TE DB information from BGP-LS, therefore it has a BGP-LS session to the Domain1 RR 1.1.1.19. The SR PCE can combine the information received via BGP-LS with information of its local IGP instance.

Example 4-13: PCE server and BGP-LS configuration on Domain1’s SR PCE node pce address ipv4 1.1.1.10 ! router bgp 1 bgp router-id 1.1.1.10 address-family link-state link-state ! neighbor 1.1.1.19 !! Domain1 RR remote-as 1 update-source Loopback0 address-family link-state link-state

4.3.3.2 SR PCE Computes Inter-Domain Path To provide an end-to-end delay-optimized path from Node11 to Node31, the operator configures an SR Policy on Node11. The operator specifies that the SR PCE in Domain1 computes this SR Policy dynamic path. Example 4‑14 shows the SR Policy configuration on Node11. The SR Policy to endpoint 1.1.1.31 – Node31’s loopback address – has a dynamic path computed by the SR PCE server. The SR PCE must optimize the delay of the path (metric type delay). Node11 connects to the SR PCE server with address 1.1.1.10 to compute the path, as configured under pcc section on Node11. Example 4-14: SR Policy with delay-optimized inter-domain path on Node11 segment-routing traffic-eng policy POLICY1 color 20 end-point 1.1.1.31 candidate-paths preference 100 dynamic pcep metric type delay ! pcc pce address ipv4 1.1.1.10

All nodes in the network have enabled link delay measurement and distribute the link delay metrics in IGP. How this works and how to use it is explained in chapter 15, "Performance Monitoring – Link Delay". Together with the other topology information, the link delay metrics are distributed in BGPLS. For illustration simplicity, the default link-delay metric in the illustration is 10. The PCEP protocol exchange that occurs after configuring the SR Policy, as illustrated in Figure 4‑16, is equivalent to the sequence that we have seen before in the single-domain disjoint paths example

(section 4.3.2).

Figure 4-16: Inter-domain path PCEP sequence

After configuring the SR Policy on Node11, Node11 sends a PCEP Request to its SR PCE, requesting a delay-optimized path to endpoint Node31 (marked ➊ in Figure 4‑16). The SR PCE uses the multidomain information in its SR-TE DB to compute the end-to-end delay-optimized path (indicated with ➋). The SR PCE returns the solution SID list to Node11 (➌ in the illustration). Node11 instantiates this path (marked ➍). This path follows the IGP shortest path to Node15 using Prefix-SID 16015 of Node15. It then traverses the BGP peering link using the Peering-SID 51525. To traverse the low-delay, high IGP metric link between Node25 and Node23, Node25’s Adj-SID for this link is used: label 24523. Finally, the path follows the IGP shortest path to Node31, using Node31’s Prefix-SID 16031. Node11 reports the status of the path to the SR PCE in a PCEP Report message (marked ➎). In this Report, Node11 also delegates the control of this path to the SR PCE. The SR PCE can autonomously request Node11 to update the path if required.

4.3.3.3 SR PCE Updates Inter-Domain Path To illustrate the reactivity of the SR PCE computed path, assume a failure occurred in the network. The link between Node25 and Node23 went down (indicated with ➊ in Figure 4‑17). As a first reaction to the failure, Node25 activates the TI-LFA backup path for the Adj-SID 24523 of the failed link (➋). This quickly restores connectivity of the traffic carried in the SR Policy.

Figure 4-17: Stateful PCE updates inter-domain path after failure

The IGP floods the topology change and Node21 advertises it in BGP-LS (marked ➌). BGP propagates the topology update to SR PCE (➍ in the illustration). The SR PCE re-computes the path (➎), and sends a PCEP Update message to headend Node11 (marked ➏). Node11 updates the path (➐) and sends a PCEP Report message to SR PCE with the status of the path (indicated with ➑). The sequence of events in this example is event-triggered, each event triggers the next event. However, it is not instantaneous and subject to delays, such as delays in BGP to propagate the topology change to the SR PCE.

4.4 Summary This chapter explains how traffic engineering paths are dynamically computed. Computing a TE path is solving an optimization problem that has an optimization objective and constraints. The required information to compute SR-TE paths is stored in the SR-TE DB. SR-optimized algorithms have been developed to compute paths and encode these paths in SID lists while minimizing the number of SID’s and maximizing ECMP. In many cases, the headend node itself can compute SLA paths in a single IGP area. Low-delay and resource exclusions are typical examples. All information required for such computations is available in the headend’s SR-TE DB. When the headend does not have the necessary information in its SR-TE DB, the headend node uses an SR PCE to compute paths. Computing disjoint paths from different headend nodes is a frequent example. The two headends delegate the computation of their SR Policy to the same SR PCE to ensure mutual path diversity. The SR PCE is responsible for keeping a delegated SR path optimal. It recomputes delegated paths in case of topology changes and instructs the headends to update the paths if required. Another case requiring an SR PCE to compute paths is the multi-domain case. A headend node cannot compute optimal end-to-end paths in a multi-domain network since the IGP only provides topology information about the local area, not about the remote areas and domains. An SR PCE learns the topology information about all domains in a network using BGP-LS and stores this information in its SR-TE DB. The SR-TE DB is natively multi-domain capable. BGP-LS not only carries IGP topology information in BGP, but also other information, such as EPE information. BGP-LS provides a realtime reactive feed for this information that the PCE can tap into. The SR PCE uses the multi-domain SR-TE DB to compute optimal end-to-end inter-domain paths for the headend nodes. The SR PCE functionality is not concentrated in a single centralized server. SR PCE servers can be distributed throughout the network (a typical analogy is BGP route reflector distribution).

The SR PCE servers do not need to synchronize with each other. They only need to get the necessary information (e.g., topology) via their BGP-LS feed, and from PCEP reports of their connected PCCs.

4.5 References [SIGCOMM2015] “A Declarative and Expressive Approach to Control Forwarding Paths in Carrier-Grade Networks.”, Renaud Hartert, Stefano Vissicchio, Pierre Schaus, Olivier Bonaventure, Clarence Filsfils, Thomas Telkamp, Pierre François, SIGCOMM 2015, October 2015, [RFC2702] "Requirements for Traffic Engineering Over MPLS", Michael D. O'Dell, Joseph Malcolm, Jim McManus, Daniel O. Awduche, Johnson Agogbua, RFC2702, September 1999 [RFC3107] "Carrying Label Information in BGP-4", Eric C. Rosen, Yakov Rekhter, RFC3107, May 2001 [RFC3630] "Traffic Engineering (TE) Extensions to OSPF Version 2", Derek M. Yeung, Dave Katz, Kireeti Kompella, RFC3630, October 2003 [RFC5305] "IS-IS Extensions for Traffic Engineering", Tony Li, Henk Smit, RFC5305, October 2008 [RFC5440] "Path Computation Element (PCE) Communication Protocol (PCEP)", JP Vasseur, Jean-Louis Le Roux, RFC5440, March 2009 [RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752, March 2016 [RFC8231] "Path Computation Element Communication Protocol (PCEP) Extensions for Stateful PCE", Edward Crabbe, Ina Minei, Jan Medved, Robert Varga, RFC8231, September 2017 [draft-ietf-idr-bgp-ls-segment-routing-ext] "BGP Link-State extensions for Segment Routing", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Hannes Gredler, Mach Chen, draft-ietf-idrbgp-ls-segment-routing-ext-12 (Work in Progress), March 2019

1. Instead of simply computing the path, disjoint from the existing one, the SR PCE concurrently computes both disjoint paths, as this has the highest probability to find a solution and it yields the

optimal solution. See the next section for details.↩

5 Automated Steering What we will learn in this chapter: A Provider Edge (PE) advertising a service route (BGP, Pseudowire (PW), or Locator/ID Separation Protocol (LISP)) may tag it with a color, which indicates a specific intent (e.g., color 30 = “low-delay”, color 20 = “only via network plane blue”) In BGP, the color is supported by the well-known color extended community attribute. The same extension has been defined for LISP. Per-Destination Automated Steering (AS) automatically steers a service route onto a valid SR Policy based on its next-hop and color. If no such valid SR Policy exists, the service route is instead installed “classically” in the forwarding plane with a recursion on the route to the next-hop, i.e., the Prefix-SID to the next-hop. AS is enabled by default but can be disabled on a per-protocol (e.g., BGP) and per AFI/SAFI basis. A per-flow variant of AS will be detailed in a future revision of this book. Per-flow steering enables to steer traffic-flows by considering various packet header fields such as source address or DSCP. In the previous chapters, we have explored the SR Policy, its paths, and SID lists. We have seen how such paths can be computed and instantiated on a headend node. But, how can we make use of these SR Policies? How can we steer traffic into them? In this chapter we first describe how an operator can indicate the service requirements of a route by tagging it with a color, a BGP color community for BGP routes. This color is an identifier of an intent. The ingress node can then use the route’s color and nexthop to automatically steer it in the matching SR Policy. This is called Automated Steering (AS). Next, we explain how the granular perdestination AS applies to various types of BGP service routes. Finally, we show how AS can be disabled if necessary.

5.1 Introduction The Automated Steering (AS) functionality automatically steers service routes into the SR Policy that provides the desired intent or SLA1. AS is a key component of the SR-TE solution as it automates the steering of the service traffic on the SR Policies that deliver the required SLA. AS can operate on a per-flow basis2. In this case, multiple flows to the same service destination (e.g., BGP route 1.0.0.0/8) may be automatically steered onto different SR Policies. While any per-flow classification technique could be used, a commonly used one is DSCP/EXP classification. For example, traffic to 1.0.0.0/8 with DSCP 1 would go in SR Policy 1 while traffic to 1.0.0.0/8 with DSCP 2 would go in SR Policy 2. In this revision of the book, we will focus on the per-destination AS solution. For example, two BGP destinations 1.0.0.0/8 and 2.0.0.0/8 with the same next-hop 9.9.9.9 can be automatically steered on two different SR Policies to 9.9.9.9. While the concept applies to any type of service route, we will use BGP as an illustration.

Omne s viae Romam ducunt (all the ways le ad to Rome ) “The ODN/AS design came up in a taxi in Rome ☺. We were on our way to an excellent trattoria in Trastevere. We were caught in the traffic on the right bank of the river Tevere somewhere between Castel San Angelo and piazza Trilussa. My dear friends Alexander Preusche and Alberto Donzelli were giving me a hard time on the need for an “SDN” solution that would be “easy to operate”. I knew that SR was perfectly adapted to a centralized solution (e.g., see Paul Mattes’ talk [SR-for-DCI]). But I resisted the push towards this sole solution as I thought that it required a level of operational investment that many SPs would want to avoid. I wanted a solution that would: be much more automated. not rely on a single mighty server. be integrated with the BGP/VPN route distribution as this is the heart of the SP/enterprise services. The traffic jam was a blessing as was the friendly pressure from Alex and Alberto. Finally, the idea came and looked very simple to me: when it receives the VPN routes from an attached CPE, the egress PE marks the received routes with a color that encodes the SLA required by the related customer; the route-reflector reflects this color attribute transparently; the ingress PE automatically requests the local SR-TE process to instantiate an SR Policy using a template associated with the color (this would become ODN); when instantiated, the SR-TE process calls back BGP and provides the BSID of the instantiated policy; BGP automatically installs the colored service route on the BSID (this would become AS). The idea naturally applied to multi-domain as the ODN template could request SR PCE delegation for SLA objectives that exceed the information available at the ingress PE. After a very good meal with Alberto and Alex, I called Siva and explained the idea. As usual, with his boundless energy, he said, “let’s do it” and a few weeks later we had the first proof of concept. Bertrand immediately understood the huge business appeal for the solution and was a key contributor to get this committed in the product roadmap. Thanks to Siva, Bertrand, Alberto and Alex … and Rome ☺ ”. — Clarence Filsfils

Throughout this chapter we will use the network in Figure 5‑1 to illustrate the different concepts. This network consists of a single IGP area. The default IGP link metric is 10, the links between Node3 and Node4 and between Node5 and Node8 have a higher IGP link metric 100, as indicated in

the drawing. An affinity color red is assigned to the link between Node6 and Node7.

Figure 5-1: Automated Steering reference topology

The operator of the network in Figure 5‑1 has enabled delay measurement on all links in the network (see chapter 15, "Performance Monitoring – Link Delay"). For simplicity, we assume that the measured link-delays of all links are the same: 10 milliseconds. The operator of the network provides various services that will be described in the remainder of this chapter. Figure 5‑1 illustrates a VPN service for a customer NewCo, using CEs Node13 and Node43, attached to PEs Node1 and Node4 respectively. The illustration also shows an internet service represented with a cloud attached to PE Node4. CE Node43, which is configured in VRF NewCo on Node4, advertises prefixes 3.3.3.0/24 and 6.6.6.0/24 via BGP to Node4. PE Node4 also learns of a global prefix 8.8.8.0/24 and propagates these prefixes via BGP to Node1 and sets itself (router-id 1.1.1.4) as the BGP nexthop. PE Node1 receives these routes from Node4 and propagates the routes in VRF NewCo to CE Node13. Node1 switches traffic to these routes (VPN and global) via the IGP shortest path to Node4 (1→7→6→5→4).

We will build further on this topology to explain how to provide the appropriate SLA for the different service destinations.

5.2 Coloring a BGP Route At the heart of the AS solution is the tagging concept: when the egress Provider Edge (PE) advertises a BGP (service) route, the egress PE colors the route. The color is chosen to indicate the SLA required by the BGP route. There is no fixed mapping of color to SLA semantics. Any operator is free to design the allocation. For example, in this chapter, we will use color 30 to represent a low-delay SLA.

5.2.1 BGP Color Extended Community The SR-TE solution leverages the BGP Color Extended Community. BGP communitie s BGP communities (RFC 1997) are additional tags that can be added to routes to provide additional information for processing that route in a specific way. The BGP communities of a route are bundled in a BGP communities attribute attached to that route. BGP communities are typically used to influence routing decisions. BGP communities can be selectively added, modified, or removed as the route travels from router to router. A BGP community is a 32-bit number, typically consisting of the 16-bit AS number of the originator followed by a 16-bit number with an operator-defined meaning. RFC 4360 introduced a larger format (48-bit) of community, the extended community. It has a type field to indicate the type and dictate the structure of the remaining bytes in the community. Examples of extended community types are the Route-Target (RT) community for VPN routes and the Route Origin community, both specified in RFC 4360. The Opaque Extended Community is another type of extended community defined in RFC 4360. It is really a class of extended communities. The Color extended community, specified in RFC 5512, is a sub-type of this class. The BGP extended communities of a route are bundled in a BGP extended communities attribute attached to that route.

RFC 5512 specifies the format of the Color Extended Community shown in Figure 5‑2. The first two octets indicate the type of the extended community. The first octet indicates it is a transitive opaque extended community. “Transitive” means that a BGP node should pass it to its neighbors, even if it does not recognize the attribute. The second octet indicates the type of Opaque Extended Community, the type 11 (or 0x0b) is Color.

Figure 5-2: Color Extended Community format

The Color Value is a flat, 32-bit number. The value is defined by the user and is opaque to BGP. To be complete, we note that draft-ietf-idr-segment-routing-te-policy specifies that when the Color Extended Community is used to steer the traffic into an SR policy, two bits of the Reserved field are used to carry the Color-Only bits (CO bits), as shown in Figure 5‑3.

Figure 5-3: Color Extended Community CO-flags

The use of these CO bits for other settings than the default value “00” is not very common. The usecases are explained in chapter 10, "Further Details on Automated Steering".

5.2.2 Coloring BGP Routes at the Egress PE Automated Steering uses the color extended community to tag service routes with a color. This color extended community is then matched to the color of an SR Policy. The functionality and the configuration that is used to color BGP routes – i.e., attach a color extended community to a route – is not specific to SR-TE, but we cover this aspect for completeness. There are multiple ways to attach (extended) communities to BGP routes, all involving a route-policy, but applied at different so-called “attach points”. A route-policy is a construct that implements a

routing policy. This routing policy instructs the router to inspect routes, filter them, and potentially modify their attributes. A route-policy is written in Routing Policy Language (RPL), which resembles a simplified programming language. In this chapter we illustrate two different route-policy attach points, an ingress BGP route-policy and an egress BGP route-policy. Other attach points are possible but not covered in this book. In Figure 5‑1, Node4 receives the service routes from CE Node43. By applying an ingress routepolicy on Node4 for the BGP session to Node43, the received routes and their attributes can be manipulated, such as adding a color extended community. Node4 advertises the global prefix 8.8.8.0/24 to Node1. By applying an egress route-policy on Node4 for the BGP session to Node1, the transmitted routes and their attributes can be manipulated, such as adding a color extended community. Before looking at the route-policies on Node4, let’s take a look at the color extended communities. The color extended communities that Node4 uses in its route-policies, are defined in the extcommunity-set

sets Blue, Green, and Purple, as shown in Example 5‑1.

Example 5-1: color extended community sets on Node4 extcommunity-set opaque Blue 20 end-set ! extcommunity-set opaque Green 30 end-set ! extcommunity-set opaque Purple 40 end-set

Using the configuration in Example 5‑2, Node4 applies route-policy VRF-COLOR as an ingress route-policy under the address-family ipv4 unicast for VRF NewCo neighbor 99.4.43.43 (Node43) (see line 39). Since it is applied under the VRF NewCo BGP session, it is only applied to routes of that VRF.

Route-policy VRF-COLOR is defined in lines 1 to 9. This route-policy attaches color extended community Blue to 3.3.3.0/24 and Green to 6.6.6.0/24. Route-policy GLOBAL-COLOR, defined in lines 11 to 16, attaches color extended community Purple to 8.8.8.0/24. This route-policy is applied as an egress route-policy under the address-family ipv4 unicast for global neighbor 1.1.1.1 (Node1) (see line 27). Example 5-2: Applying route-policies on Node4 to color routes 1 route-policy

VRF-COLOR if destination in (3.3.3.0/24) then 3 set extcommunity color Blue 4 endif 5 if destination in (6.6.6.0/24) then 6 set extcommunity color Green 7 endif 8 pass 9 end-policy 10 ! 11 route-policy GLOBAL-COLOR 12 if destination in (8.8.8.0/24) then 13 set extcommunity color Purple 14 endif 15 pass 16 end-policy 17 ! 18 router bgp 1 19 bgp router-id 1.1.1.4 20 address-family ipv4 unicast 21 address-family vpnv4 unicast 22 ! 23 neighbor 1.1.1.1 24 remote-as 1 25 update-source Loopback0 26 address-family ipv4 unicast 27 route-policy GLOBAL-COLOR out 28 ! 29 address-family vpnv4 unicast 30 ! 31 vrf NewCo 32 rd auto 33 address-family ipv4 unicast 34 ! 35 neighbor 99.4.43.43 36 remote-as 2 37 description to CE43 38 address-family ipv4 unicast 39 route-policy VRF-COLOR in 2

5.2.3 Conflict With Other Color Usage The color extended community has been defined many years ago and may safely be concurrently used for many distinct purposes.

The design rule for the operator is to allocate color ranges on a per application basis. For example, an operator may have allocated 10-99 to indicate SLA requirements while he decided to use 1000-1999 to track the PoP-of-origin of Internet routes. As soon as a specific range is allocated for SLA/SR-TE purposes, then the operator will only use these colors for the SR Policy configurations (color, endpoint). Conversely, in this example, an SR Policy will never be configured with a color 1000 and hence an Internet route coming from PoP 1000 will never risk being automatically steered on an SR Policy of color 1000. Multiple extended and regular communities can be concurrently attached to a given route, e.g., one to indicate SLA requirements and another to track the PoP-of-origin.

5.3 Automated Steering of a VPN Prefix Another customer Acme uses the L3VPN service of the network shown in Figure 5‑4. This customer’s CE Node12 connects to PE Node1 in VRF Acme and similarly CE Node42 connects to PE Node4. Customer Acme requires a low-delay path from CE12 to CE42. To satisfy this requirement, the operator configures on Node1 an SR Policy with color “green” (value 30, associated with low-delay SLA) and endpoint Node4’s loopback prefix 1.1.1.4/32. The name of an SR Policy is user-defined. Here we have chosen the name “GREEN”, matching the name we used for the color of this SR Policy, but we could have named it “delay_to_node4” or any other meaningful name. The SR Policy GREEN has a single candidate path with preference 100. This candidate path is a dynamic path with an optimized delay metric and no constraints. The configuration of this SR Policy is shown in Example 5‑3. Example 5-3: SR Policy GREEN configuration on Node1 segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay

The path of SR Policy GREEN is shown in Figure 5‑4.

Figure 5-4: Automated Steering of a green BGP route

CE42 advertises prefix 2.2.2.0/24 via BGP to PE Node4. This is indicated by ➊ in Figure 5‑4. Node4 allocates a VPN label 92220 for prefix 2.2.2.0/24 in VRF Acme and advertises this prefix via BGP to Node1 with the BGP nexthop set to its loopback address 1.1.1.4. To indicate the low-delay requirement for traffic destined for this prefix, Node4 adds a color extended community with color “green” to the BGP update for VRF Acme prefix 2.2.2.0/24 (➋ in Figure 5‑4). As described earlier, color “green” refers to color value 30 which was chosen to identify the low-delay SLA. Node1 receives the BGP route and tries to match the route’s attached color (30) and its nexthop (1.1.1.4) to a valid SR Policy’s color and endpoint. BGP finds the matching SR Policy GREEN, with color 30 and endpoint 1.1.1.4, and automatically installs the prefix 2.2.2.0/24 in the forwarding table pointing to the SR Policy GREEN (➌), instead of pointing to the BGP nexthop. All traffic destined for VRF Acme route 2.2.2.0/24 is then forwarded via SR Policy GREEN. By directly mapping the color of the BGP prefix to the color of the SR Policy, no complex and operationally intensive traffic steering configurations are required. By installing the BGP prefix in the forwarding table, pointing to the SR Policy, the traffic steering has no impact on the forwarding performance.

Simplicity of Automate d Ste e ring “In our first SR customer deployment in 2017, we had to use the tunnel interface mode SR-TE since SR Policy mode was not available at that time (see the “new CLI” section in chapter 1, "Introduction"). Consequently, legacy techniques like SPP (Service Path Preference) were used to steer traffic into corresponding tunnels. This customer has been struggling with the granularity and complexity of traffic steering. Therefore, when we introduced auto-steering capability of SR Policy mode, this customer accepted it at once: per-flow granularity, only need to care about “colors”. Now the customer is planning to migrate the existing tunnel interface mode SR-TE to SR Policy mode to enjoy the benefits of auto-steering. ” — YuanChao Su

More Details

We will now explore the Automated Steering functionality in more detail by examining the BGP routes and forwarding entries. Example 5‑4 shows the BGP configuration of Node4. It is a regular base VPNv4 configuration. Node4 has two BGP sessions, one VPNv4 iBGP session to Node1 (1.1.1.1) and an eBGP session in VRF Acme to CE42 (99.4.42.42). An extended-community-set Green is defined for color value 30 (lines 14-17). Note that the name Green is only a locally significant user-defined identifier of this set. To attach color 30 to the prefixes received from CE Node42, the operator has chosen to use an ingress route-policy COLOR_GREEN for address-family ipv4 unicast of the BGP session to Node42 in VRF Acme. To keep the example simple, route-policy COLOR_GREEN unconditionally attaches the extended-community-set Green to all incoming routes on that BGP session.

Example 5-4: BGP configuration on Node4 1 vrf

Acme address-family ipv4 unicast 3 import route-target 4 1:1 5 ! 6 export route-target 7 1:1 8! 9 interface GigabitEthernet0/0/0/1.100 10 vrf Acme 11 ipv4 address 99.4.41.4 255.255.255.0 12 encapsulation dot1q 100 13 ! 14 extcommunity-set opaque Green 15 # low-delay SLA 16 30 17 end-set 18 ! 19 route-policy COLOR_GREEN 20 set extcommunity color Green 21 end-policy 22 ! 23 route-policy PASS 24 pass 25 end-policy 26 ! 27 router bgp 1 28 bgp router-id 1.1.1.4 29 address-family vpnv4 unicast 30 ! 31 neighbor 1.1.1.1 32 remote-as 1 33 update-source Loopback0 34 address-family vpnv4 unicast 35 ! 36 vrf Acme 37 rd auto 38 address-family ipv4 unicast 39 ! 40 neighbor 99.4.42.42 41 remote-as 2 42 description to CE42 43 address-family ipv4 unicast 44 route-policy COLOR_GREEN in 2

BGP on Node4 receives the route 2.2.2.0/24 from CE Node42. This BGP route is displayed in Example 5‑5. Notice that the route has an attached color extended community Color:30, as a result of the ingress BGP route-policy COLOR_GREEN, as described above. The other Extended community of that route is a Route-Target (RT) community with value 1:1 which is used for L3VPN purposes.

Example 5-5: Node4 received BGP route 2.2.2.0/24 from CE42 RP/0/0/CPU0:xrvr-4#show bgp vrf Acme 2.2.2.0/24 BGP routing table entry for 2.2.2.0/24, Route Distinguisher: 1.1.1.4:0 Versions: Process bRIB/RIB SendTblVer Speaker 10 10 Local Label: 92220 Last Modified: Jun 5 09:30:24.894 for 01:58:01 Paths: (1 available, best #1) Advertised to PE peers (in unique update groups): 1.1.1.1 Path #1: Received by speaker 0 Advertised to PE peers (in unique update groups): 1.1.1.1 2 99.4.42.42 from 99.4.42.42 (1.1.1.42) Origin IGP, metric 0, localpref 100, valid, external, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 10 Extended community: Color:30 RT:1:1 Origin-AS validity: (disabled)

According to the regular L3VPN functionality, Node4 dynamically allocated a label 92220 for this VPN prefix and advertised it to Node1, with the color extended community attached. Node4 has set its router-id 1.1.1.4 as BGP nexthop for VPN prefixes. The BGP route as received by Node1 is shown in Example 5‑6. The output shows that VRF Acme route 2.2.2.0/24 has BGP nexthop 1.1.1.4 and color 30. Example 5-6: Node1 received BGP VPN route 2.2.2.0/24 from PE Node4 RP/0/0/CPU0:xrvr-1#show bgp vrf Acme 2.2.2.0/24 BGP routing table entry for 2.2.2.0/24, Route Distinguisher: 1.1.1.1:0 Versions: Process bRIB/RIB SendTblVer Speaker 7 7 Last Modified: Jun 5 09:31:12.880 for 02:04:00 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 2 1.1.1.4 C:30 (bsid:40001) (metric 40) from 1.1.1.4 (1.1.1.4) Received Label 92220 Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 7 Extended community: Color:30 RT:1:1 SR policy color 30, up, registered, bsid 40001, if-handle 0x00000410 Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0

SR Policy GREEN on Node1, with color 30 and endpoint 1.1.1.4, matches the color 30 and BGP nexthop 1.1.1.4 of VRF Acme route 2.2.2.0/24. Using the Automated Steering functionality, BGP

installs the route in the forwarding table, pointing to SR Policy GREEN. BGP uses the BSID of the SR Policy as a key. Since no explicit BSID was provided by the operator as part of the policy configuration (Example 5‑3), the headend node dynamically allocated one; this is the default behavior. To find out the actual BSID of an SR Policy, we can look at its status. Example 5‑7 shows the status of SR Policy GREEN. The output shows the dynamically allocated Binding SID for this SR Policy: label 40001. The SID list of this SR Policy is . Example 5-7: Status of SR Policy GREEN on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.4 Name: srte_c_30_ep_1.1.1.4 Status: Admin: up Operational: up for 07:45:16 (since May 28 09:48:34.563) Candidate-paths: Preference: 100 (configuration) (active) Name: GREEN Requested BSID: dynamic Dynamic (valid) Metric Type: delay, Path Accumulated Metric: 30 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 40001 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

Without Automated Steering, BGP would install the VRF Acme route 2.2.2.0/24 recursing on its BGP nexthop 1.1.1.4. Recursion basically means that the route refers to another route for its forwarding information. When recursing on the BGP nexthop, all traffic destined for 2.2.2.0/24 would follow the Prefix-SID of 1.1.1.4 to Node4. With automated Steering, BGP does not recurse the route on its BGP nexthop, but on the BSID of the matching SR Policy. In this case, BGP installs the route 2.2.2.0/24 recursing on the BSID 40001 of SR Policy GREEN, as shown in Example 5‑8. The RIB entry (the first output in Example 5‑8) shows the BSID (Binding Label: 0x9c41 (40001)). The CEF entry (the second output) shows that the path goes via local-label 40001, which resolves to SR Policy GREEN ((with color 30 and

endpoint 1.1.1.4: next hop srte_c_30_ep_1.1.1.4). Hence, all traffic destined for 2.2.2.0/24 will be steered into SR Policy GREEN. As we had indicated above, BGP advertises the VPN route 2.2.2.0/24 with a VPN service label, 92220. Since the service label must be at the bottom of the label stack, Node1 imposes the VPN label 92220 on the packets before steering them into the SR Policy GREEN where the SR Policy’s SID list is imposed. This is shown in the last line of the output, labels imposed {ImplNull 92220}. This label stack is ordered top→bottom. The bottom label on the right is the VPN label. The top label is the label needed to reach the BGP nexthop. In this case no such label is required since the SR Policy transports the packets to the nexthop. Therefore, the top label is indicated as ImplNull, which is the implicit-null label. Normally the implicit-null label signals the MPLS pop operation, in this case it represents a no-operation. Example 5-8: RIB and CEF forwarding entries for VRF Acme prefix 2.2.2.0/24 on Node1 RP/0/0/CPU0:xrvr-1#show route vrf Acme 2.2.2.0/24 detail Routing entry for 2.2.2.0/24 Known via "bgp 1", distance 200, metric 0 Tag 2, type internal Installed Jun 5 09:31:13.214 for 02:16:45 Routing Descriptor Blocks 1.1.1.4, from 1.1.1.4 Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id: Route metric is 0 Label: 0x1683c (92220) Tunnel ID: None Binding Label: 0x9c41 (40001) Extended communities count: 0 Source RD attributes: 0x0001:257:17039360 NHID:0x0(Ref:0) Route version is 0x1 (1) No local label IP Precedence: Not Set QoS Group ID: Not Set Flow-tag: Not Set Fwd-class: Not Set Route Priority: RIB_PRIORITY_RECURSIVE (12) SVD Type RIB_SVD_TYPE_REMOTE Download Priority 3, Download Version 1 No advertising protos.

0xe0000000

RP/0/0/CPU0:xrvr-1#show cef vrf Acme 2.2.2.0/24 2.2.2.0/24, version 1, internal 0x5000001 0x0 (ptr 0xa14f440c) [1], 0x0 (0x0), 0x208 (0xa16ac190) Updated Jun 5 09:31:13.233 Prefix Len 24, traffic index 0, precedence n/a, priority 3 via local-label 40001, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa171651c 0x0] recursion-via-label next hop VRF - 'default', table - 0xe0000000 next hop via 40001/0/21 next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 92220}

To validate let’s use traceroute on Node1 to verify the path of the packets sent to 2.2.20/24. A traceroute from Node1 to the VRF Acme prefix 2.2.2.0/24 reveals the actual label stack that is imposed on the packets. The output of this traceroute is shown in Example 5‑9. By using the interface addressing convention as specified in the front matter of this book, the last digit in each address of the traceroute output indicates the number of the responding node. The traceroute probes follow the path 1→2→3→4→41. The label stack shown for the first hop, 16003/24034/92220, consists of the SR Policy’s SID list as transport labels and the VPN label 92220 as service label. Example 5-9: Traceroute on Node1 to VRF Acme prefix 2.2.2.0/24 RP/0/0/CPU0:xrvr-1#traceroute vrf Acme 2.2.2.2 Type escape sequence to abort. Tracing the route to 2.2.2.2 1 2 3 4

99.1.2.2 [MPLS: Labels 16003/24034/92220 Exp 0] 19 msec 9 msec 9 msec 99.2.3.3 [MPLS: Labels 24034/92220 Exp 0] 9 msec 9 msec 9 msec 99.3.4.4 [MPLS: Label 92220 Exp 0] 9 msec 9 msec 9 msec 99.4.41.41 9 msec 9 msec 9 msec

Note that traffic to Node4’s loopback prefix 1.1.1.4/32 itself is not steered into an SR Policy but follows the usual IGP shortest path 1→7→6→5→4, as illustrated in the output of the traceroute on Node1 to 1.1.1.4 displayed in Example 5‑10. Example 5-10: Traceroute on Node1 to Node4’s loopback address 1.1.1.4 RP/0/0/CPU0:xrvr-1#traceroute 1.1.1.4 Type escape sequence to abort. Tracing the route to 1.1.1.4 1 2 3 4

99.1.7.7 99.6.7.6 99.5.6.5 99.4.5.4

[MPLS: Label 16004 Exp 0] 9 msec [MPLS: Label 16004 Exp 0] 0 msec [MPLS: Label 16004 Exp 0] 0 msec 0 msec 0 msec 0 msec

0 msec 0 msec 0 msec

0 msec 0 msec 0 msec

“More than 15 years of software development experience helped me to identify solutions that are less complex to implement yet powerful enough to simplify network operations. When Clarence introduced the concept of ODN/AS, I immediately realized that the proposed mechanism falls into such category. After developing a proof-of-concept ODN/AS model for SR-TE disjoint path use-case, I generalized the model for other SR-TE use-cases. Currently, ODN/AS has become a salient feature of SR-TE portfolio. The operational simplicity of the ODN/AS solution did not come for free as it required a rather sophisticated implementation. One of the key challenges of this functionality was the internal communication between the different software processes (e.g., BGP and SR-TE processes became new collaborators). Since these software components are developed by different teams, the realization of this functionality was an example of true teamwork. Thanks to the excellent collaboration with the other component leads (particularly BGP, RIB, and FIB), the final implementation in the shipping product has become robust, scalable, performance optimized while at the same time simple to operate. Even though I have been part of SR-TE development from day one, because of its great customer appeal, realizing the concept of ODN/AS in Cisco products has provided me with a great sense of satisfaction. ” — Siva Sivabalan

5.4 Steering Multiple Prefixes With Different SLAs Let us now assume that the operator has configured Node1 with two additional SR Policies to Node4. We have conveniently named each SR Policy as the color of its SLA. The following SR Policies exist on Node1: SR Policy BLUE, color 20 (blue), endpoint 1.1.1.4 SR Policy GREEN, color 30 (green), endpoint 1.1.1.4 SR Policy PURPLE, color 40 (purple), endpoint 1.1.1.4 The paths of these three SR Policies are shown in Figure 5‑5.

Figure 5-5: Multiple SR Policies with endpoint Node4 on Node1

The SR Policy configuration of Node1 is shown in Example 5‑11.

SR Policy BLUE has an explicit candidate-path using segment list BLUE_PATH. This segment list is defined at the top of the configuration. SR Policy GREEN has a dynamic delay-optimized path without constraints. SR Policy PURPLE has a dynamic path that optimizes the IGP metric and avoids links with Affinity red. There is only one link with Affinity red in the network: the link between Node6 and Node7. Example 5-11: SR Policy configuration on Node1 segment-routing traffic-eng affinity-map name red bit-position 0 ! segment-list name BLUE_PATH index 10 mpls label 16008 index 20 mpls label 24085 index 30 mpls label 16004 ! policy BLUE color 20 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list BLUE_PATH ! policy GREEN color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay ! policy PURPLE color 40 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type igp constraints affinity exclude-any name red

Different CEs of different customers are connected to PE Node4: Node42 of VRF Acme, Node43 of VRF NewCo, and Node44 of VRF Widget. Each of these customers has different SLA needs. The required SLA for each prefix is indicated by attaching the appropriate SLA-identifying color to the prefix advertisement. PE Node4 colors the prefixes as follows (also shown in Figure 5‑6), when advertising them to Node1:

VRF Acme prefix 2.2.2.0/24: color 30 (green) VRF Acme prefix 5.5.5.0/24: color 30 (green) VRF NewCo prefix 3.3.3.0/24: color 20 (blue) VRF NewCo prefix 6.6.6.0/24: color 30 (green) VRF Widget prefix 4.4.4.0/24: color 40 (purple) VRF Widget prefix 7.7.7.0/24: no color

Figure 5-6: Advertising prefixes with their appropriate SLA color

All these prefixes have a BGP nexthop 1.1.1.4.

In the previous section we have seen that Node1 steers VRF Acme prefix 2.2.2.0/24 with color green attached, into SR Policy GREEN since this SR Policy’s color and endpoint match the color and BGP nexthop of the prefix. Node4 advertises the other prefix in VRF Acme, 5.5.5.0/24, with the same color green (30), therefore Node1 also steers this prefix into SR Policy GREEN. Multiple service routes that have the same color and same BGP nexthop share the same SR Policy. Node1 receives VRF NewCo prefix 3.3.3.0/24 with nexthop 1.1.1.4 and color blue (20). BGP installs this route recursing on the BSID of SR Policy BLUE, as this SR Policy matches the color and nexthop of this route. The other prefix in VRF NewCo, 6.6.6.0/24, has color green (30). This prefix is steered into SR Policy GREEN, since this SR Policy matches the color and nexthop of this prefix. BGP installs VRF Widget prefix 4.4.4.0/24 with nexthop 1.1.1.4 and color purple (40) recursing on the BSID of SR Policy PURPLE, the SR Policy that matches the color and nexthop of this route. Node4 advertises the other prefix of VRF Widget, 7.7.7.0/24, without color, indicating that this prefix requires no specific SLA. This BGP route as received by Node1 is presented in Example 5‑12. Since it has no attached color, BGP on Node1 installs this route recursing on its BGP nexthop 1.1.1.4. Consequently, the traffic destined for 7.7.7.0/24 will follow the IGP shortest path to Node4. Example 5-12: BGP route without color on Node1 RP/0/0/CPU0:xrvr-1#show bgp vrf Widget 7.7.7.0/24 BGP routing table entry for 7.7.7.0/24, Route Distinguisher: 1.1.1.1:0 Versions: Process bRIB/RIB SendTblVer Speaker 10 10 Last Modified: Jun 5 14:02:44.880 for 00:03:39 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 2 1.1.1.4 (metric 40) from 1.1.1.4 (1.1.1.4) Received Label 97770 Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 10 Extended community: RT:1:1 Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0

The RIB and CEF forwarding entries for this colorless prefix are shown in Example 5‑13. The first output in Example 5‑13 shows that BGP installed the route in RIB without reference to a BSID (Binding Label: None).

The second output shows that the CEF entry for VRF Widget prefix 7.7.7.0/24 recurses on its BGP nexthop 1.1.1.4 (via 1.1.1.4/32). The last line in the output shows the resolved route, pointing to the outgoing interface and nexthop (Gi0/0/0/1, 99.1.7.7/32). The imposed labels are (labels imposed {16004 97770}), where 16004 (Node4’s Prefix-SID) is the label to reach the BGP nexthop and 97770 is the VPN label that Node4 advertised for prefix 7.7.7.0/24 in VRF Widget. Example 5-13: RIB and CEF forwarding entries for BGP route without color on Node1 RP/0/0/CPU0:xrvr-1#show route vrf Widget 7.7.7.0/24 detail Routing entry for 7.7.7.0/24 Known via "bgp 1", distance 200, metric 0 Tag 2, type internal Installed Jun 5 14:02:44.898 for 00:00:13 Routing Descriptor Blocks 1.1.1.4, from 1.1.1.4 Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id: Route metric is 0 Label: 0x17dea (97770) Tunnel ID: None Binding Label: None Extended communities count: 0 Source RD attributes: 0x0001:257:17039360 NHID:0x0(Ref:0) Route version is 0x3 (3) No local label IP Precedence: Not Set QoS Group ID: Not Set Flow-tag: Not Set Fwd-class: Not Set Route Priority: RIB_PRIORITY_RECURSIVE (12) SVD Type RIB_SVD_TYPE_REMOTE Download Priority 3, Download Version 5 No advertising protos.

0xe0000000

RP/0/0/CPU0:xrvr-1#show cef vrf Widget 7.7.7.0/24 7.7.7.0/24, version 5, internal 0x5000001 0x0 (ptr 0xa14f440c) [1], 0x0 (0x0), 0x208 (0xa16ac5c8) Updated Jun 5 14:02:44.917 Prefix Len 24, traffic index 0, precedence n/a, priority 3 via 1.1.1.4/32, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa1715c14 0x0] recursion-via-/32 next hop VRF - 'default', table - 0xe0000000 next hop 1.1.1.4/32 via 16004/0/21 next hop 99.1.7.7/32 Gi0/0/0/1 labels imposed {16004 97770}

The last example showed that a BGP route without a color recurses on its nexthop. This can be generalized. If BGP on an ingress node receives a service route with color C and BGP nexthop E, and no valid SR Policy (C, E) with color C and endpoint E exists on the ingress node, then BGP installs the service route as usual, recursing on its BGP nexthop. In this case, any packet destined for this service route will follow the default path to E, typically the IGP shortest path. The same behavior is applied as for service routes without a color.

5.5 Automated Steering for EVPN While the previous section showed an L3VPN example, this section illustrates the Automated Steering functionality for a single-homed EVPN Virtual Private Wire Service (VPWS) service. EVPN is the next generation solution for Ethernet services. It relies on BGP for auto-discovery and signaling, using same principles as L3VPNs. EVPN VPWS is specified in RFC 8214. Figure 5‑7 shows the topology with Node1 and Node4 the PEs providing the EVPN VPWS service. Node12 and Node42 are the CEs of a customer, connected with their Access Circuit (AC) to their respective PEs. A BGP session is established between Node1 and Node4. For simplicity, this illustration shows a direct BGP session, but in general a BGP RR would be used. A VPWS service should be established between CE12 and CE42. This VPWS service requires lowdelay transport, therefore this service route BGP advertisement is colored with color 30, identifying the low-delay requirement.

Figure 5-7: EVPN VPWS Automated Steering reference topology

The VPWS requires the configuration of the following elements on the PEs: EVPN Instance (EVI) that represents a VPN on a PE router. It serves the same role as an L3VPN VRF. Local AC identifier to identify the local end of the point-to-point VPWS Remote AC identifier to identify the remote end of the VPWS The L2VPN configuration of the VPWS service on Node4 is shown in Example 5‑14. The circuit with name “EVI9” on interface Gi0/0/0/0.100, has EVI 9 and remote and local AC identifiers are 1 and 4 respectively.

Example 5-14: EVPN VPWS configuration on PE Node4 interface GigabitEthernet0/0/0/0.100 l2transport encapsulation dot1q 100 ! l2vpn xconnect group evpn-vpws p2p EVI9 interface GigabitEthernet0/0/0/0.100 neighbor evpn evi 9 target 1 source 4

BGP signaling for EVPN uses the address-family l2vpn evpn. The BGP configuration of Node4 is displayed in Example 5‑15. To signal the low-delay requirement of the circuit of EVI 9 to Node1, the color extended community 30 is attached to the service route. For this purpose, an outgoing routepolicy evpn_vpws_policy is applied under the EVPN address-family of neighbor 1.1.1.1 (Node1). The route-policy evpn_vpws_policy matches on the route-distinguisher (RD) 1.1.1.4:9 of the service route, which is automatically allocated and consists of the router-id 1.1.1.4 and the EVI (9) of the service. The route-policy sets the color extended community Green for matching routes. “Green” refers to an extcommunity-set that specifies the color numerical value 30. To transport the EVPN VPWS service between CE12 and CE42 on a low-delay path in both directions, the equivalent BGP configuration is used on Node1. Node1 then also colors the service route with color 30 (low-delay). Example 5-15: BGP configuration on PE Node4 extcommunity-set opaque Green # color green identifies low-delay 30 end-set ! route-policy evpn_vpws_policy if rd in (1.1.1.4:9) then set extcommunity color Green endif end-policy ! router bgp 1 bgp router-id 1.1.1.4 address-family l2vpn evpn ! neighbor 1.1.1.1 remote-as 1 update-source Loopback0 address-family l2vpn evpn route-policy evpn_vpws_policy out

BGP on Node1 receives the service routes as shown in Example 5‑16. The EVI 9 routes are highlighted. The first entry shows the local route, the second shows the imported remote route and the third shows the received route. Example 5-16: BGP EVPN routes on Node1 RP/0/0/CPU0:xrvr-1#show bgp l2vpn evpn BGP router identifier 1.1.1.1, local AS number 1 BGP generic scan interval 60 secs Non-stop routing is enabled BGP table state: Active Table ID: 0x0 RD version: 0 BGP main routing table version 63 BGP NSR Initial initsync version 6 (Reached) BGP NSR/ISSU Sync-Group versions 0/0 BGP scan interval 60 secs Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 1.1.1.1:9 (default for vrf VPWS:9) *> [1][0000.0000.0000.0000.0000][1]/120 0.0.0.0 0 i *>i[1][0000.0000.0000.0000.0000][4]/120 1.1.1.4 C:30 100 0 i Route Distinguisher: 1.1.1.4:9 *>i[1][0000.0000.0000.0000.0000][4]/120 1.1.1.4 C:30 100 0 i Processed 3 prefixes, 3 paths

To display the details of the route, add the NLRI to the show command as in Example 5‑17. The NLRI in this case is [1][0000.0000.0000.0000.0000][4]/120. The next hop of this route is 1.1.1.4 (Node4) and a color extended community 30 is attached to this route (Color:30 in the output). Node4 has allocated service label 90010 for this service route (Received Label 90010 in the output). The output 1.1.1.4 C:30 (bsid:40001) and the last line in the output are a result of the automated steering that we will explain next.

Example 5-17: EVI 9 BGP EVPN routes on Node1 RP/0/0/CPU0:xrvr-1#show bgp l2vpn evpn rd 1.1.1.4:9 [1][0000.0000.0000.0000.0000][4]/120 BGP routing table entry for [1][0000.0000.0000.0000.0000][4]/120, Route Distinguisher: 1.1.1.4:9 Versions: Process bRIB/RIB SendTblVer Speaker 53 53 Last Modified: Jul 4 08:59:57.989 for 00:11:18 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 1.1.1.4 C:30 (bsid:40001) (metric 50) from 1.1.1.4 (1.1.1.4) Received Label 90010 Origin IGP, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 48 Extended community: Color:30 RT:1:9 SR policy color 30, up, registered, bsid 40001, if-handle 0x00000470

Node1 has an SR Policy configured with color 30 that provides a low-delay path to Node4. The configuration of this SR Policy is displayed in Example 5‑18. Example 5-18: SR Policy GREEN configuration on Node1 segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic mpls metric type delay

When receiving and importing the route, BGP finds that the route with nexthop 1.1.1.4 is tagged with color extended community 30. BGP has learned from SR-TE that a valid SR Policy GREEN exists that matches the nexthop and color of the service route. Therefore, BGP installs the service route recursing on the Binding-SID 40001 of that matching SR Policy. This is indicated in the BGP route output in Example 5‑17 with 1.1.1.4 C:30 (bsid:40001) and also in the last line of the same output. As a result, the VPWS traffic for EVI 9 from Node1 to Node4 follows the low-delay path instead of the default IGP shortest path. The equivalent mechanism can be used to steer the reverse VPWS traffic via the low-delay path.

5.6 Other Service Routes While the examples in the previous sections were restricted to VPNv4 L3VPN routes and EVPN VPWS routes, Automated Steering functionality equally applies to other types of service routes, such as global labeled and unlabeled IPv4 and IPv6 unicast routes, 6PE routes, and 6vPE routes. Also, different types of EVPN routes (type 1, type 2, and type 5) will benefit from AS. At the time of writing, AS support for EVPN routes was limited. Automated Steering is a generic steering architecture that can be equally applied to other signaling protocols such as LISP (Locator/ID Separation Protocol).

5.7 Disabling AS Automated Steering is enabled by default. If a headend H with a valid SR Policy P (C, E) receives a BGP route B/b3 with color C and next-hop E, then H automatically installs B/b via P in the forwarding plane. In some deployments, the operator may decide to only use AS for VPN services. In this case, he may want to add a layer of configuration robustness by disabling AS for the Internet service. This can be done as illustrated in Example 5‑19. Example 5-19: Disable Automated Steering for IPv4 Unicast BGP routes segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.4 steering bgp disable ipv4 unicast candidate-paths preference 100 dynamic mpls metric type delay

More generically, Example 5‑20 shows the options to disable Automated Steering. When not specifying AFI/SAFI then Automated Steering is disabled for all BGP address-families. The perAFI/SAFI functionality is not available at time of writing this book. Example 5-20: Options to disable Automated Steering segment-routing traffic-eng policy color end-point (ipv4|ipv6) steering bgp disable [ ] [ ...]

5.8 Applicability Automated Steering functionality applies to any SR Policy, regardless of how it is instantiated. In this chapter, for illustration simplicity, we used pre-configured SR Policies. AS would apply in the same exact way to any other type of SR Policy: instantiated via PCEP, BGP SR-TE and ODN as we will see in the next chapter.

ODN and AS are inde pe nde nt “In this chapter, we explained the AS solution. In the next chapter, we will explain its close companion “ODN”. AS automates the steering of traffic into an SR Policy while ODN automates the instantiation of an SR Policy. Historically, I came up with the two solutions together. Later on, I insisted to break the original idea into two independent modules: ODN for the on-demand SR Policy instantiation and AS for the Automated Steering of service routes in related SR policies. It was clear that some operators would ensure that the required SR Policies would be present at the ingress PE (e.g., via a centralized controller) and these operators would need automated steering on existing SR policies (instead of on-demand policies). This is consistent with the overall modular architecture of the SR solution. Each independent behavior is defined as an independent module. Different operators combine different modules based on their specific use-cases. ” — Clarence Filsfils

Automated Steering is not restricted to the single IGP domain networks. It equally applies to multidomain networks. Note that Automated Steering relies on the SR Policy color and endpoint matching the color and BGP nexthop of the service route, also for end-to-end inter-domain SR Policies. Consequently, service routes that must be steered into an end-to-end inter-domain SR Policy must have the egress PE (the SR Policy’s endpoint in the remote domain) as BGP nexthop. Therefore, the service routes must be propagated to the ingress PE keeping the BGP nexthop intact. If there are eBGP sessions involved, as typically is the case, they must be configured with next-hopunchanged

under the BGP address-family to ensure the BGP nexthop of the service route is not

updated when propagating it.

5.9 Summary A service route is a BGP route (VPN or not), an EVPN route, a PW or a LISP route. A Provider Edge (PE) advertises a service route and tags it with a color. In BGP, the color is supported by the well-known color extended community attribute. The same extension has been defined for LISP. The operator allocates a range of colors to indicate the SLA requirement. For example, color = 30 means “low-delay” while color 50 means “only via network plane blue”. For convenience, the operator allocates a name to the color. For example, color = 30 could be named “green” or “low-delay”. AS is enabled by default. Per-Destination Automated Steering (also called “AS”) automatically steers a service route with color C and next-hop N onto the SR Policy (N, C), if valid. If this SR Policy does not exist or is invalid, the service route is installed “classically” in the forwarding plane: i.e., with a recursion on the route to the next-hop N (typically the IGP path to N). AS also supports per-flow steering. Per-Flow steering allows to steer different flows matching the same destination service route onto different SR policies based on per-flow classification (e.g., DSCP value). This book revision only details the per-destination AS. The operator may concurrently use the BGP color extended-community for different purposes. AS can be disabled on a per-protocol basis (BGP) and per AFI/SAFI basis. AS is one of the key components of the SR-TE solution. It drastically simplifies the traffic-engineering operation. The operator only needs to color the service routes and AS automatically steers them in the correct SR Policies. Furthermore, by directly installing the service route recursing on the SR Policy, the forwarding performance degradation of mechanisms such as policy-based routing is avoided.

5.10 References [SR-for-DCI] “Segment Routing for Data Center Interconnect at Scale”, Paul Mattes (Microsoft), Mohan Nanduri (Microsoft), MPLS + SDN WC2017 Upperside Conferences, , March 2017 [RFC4364] "BGP/MPLS IP Virtual Private Networks (VPNs)", Yakov Rekhter, Eric C. Rosen, RFC4364, February 2006 [RFC4659] "BGP-MPLS IP Virtual Private Network (VPN) Extension for IPv6 VPN", Francois Le Faucheur, Jeremy De Clercq, Dirk Ooms, Marco Carugi, RFC4659, September 2006 [RFC4798] "Connecting IPv6 Islands over IPv4 MPLS Using IPv6 Provider Edge Routers (6PE)", Jeremy De Clercq, Dirk Ooms, Francois Le Faucheur, Stuart Prevost, RFC4798, February 2007 [RFC5512] "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP Tunnel Encapsulation Attribute", Pradosh Mohapatra, Eric C. Rosen, RFC5512, April 2009 [RFC6830] "The Locator/ID Separation Protocol (LISP)", Dino Farinacci, Vince Fuller, David Meyer, Darrel Lewis, RFC6830, January 2013 [RFC7432] "BGP MPLS-Based Ethernet VPN", John Drake, Wim Henderickx, Ali Sajassi, Rahul Aggarwal, Dr. Nabil N. Bitar, Aldrin Isaac, Jim Uttaro, RFC7432, February 2015 [RFC8214] "Virtual Private Wire Service Support in Ethernet VPN", Sami Boutros, Ali Sajassi, Samer Salam, John Drake, Jorge Rabadan, RFC8214, August 2017 [RFC4360] "BGP Extended Communities Attribute", Dan Tappan, Srihari S. Ramachandra, Yakov Rekhter, RFC4360, February 2006 [draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018

[draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idrsegment-routing-te-policy-05 (Work in Progress), November 2018 [draft-dukes-lisp-colored-engineered-underlays] "LISP Colored Engineered Underlays", Darren Dukes, Jesus Arango, draft-dukes-lisp-colored-engineered-underlays-01 (Work in Progress), December 2017

1. Service-Level Agreement – this may not be the most correct term to use in this context as it really indicates a written agreement documenting the required levels of service. When using the term SLA in this book, we mean the requirements of a given service.↩ 2. Will be described in a later revision of this book.↩ 3. Prefix B, prefix length b.↩

6 On-demand Next-hop What we will learn in this chapter: A candidate path of an SR Policy is either instantiated explicitly (configuration, PCEP, BGP-TE) or automatically by the “On-Demand Next-Hop” (ODN) solution. If a head-end H has at least one service route with color C and nexthop E and has an SR Policy path template for color C (“ODN template”), then H automatically instantiates a candidate path for the SR Policy (C, E) based on this template. The candidate path instantiation occurs even if an SR Policy (C, E) already exists on H with a nonODN-instantiated candidate path. ODN applies to single and multi-domain networks We first explain how a colored service route triggers the dynamic on-demand instantiation of an SR Policy candidate path, based on an ODN template. This path is deleted again if it is no longer used. We then explain the integration of ODN in the SR-TE solution and illustrate with examples. Finally, we explain how to apply ODN restrictions. In the previous chapters, we have explored the SR Policy, its paths, and SID lists. We have seen how such paths can be computed and instantiated on a headend node. We have seen how to automatically steer traffic on the SR Policy. Up to now, the SR Policies were initiated by configuration or by a central controller. The configuration option can be cumbersome, especially if many SR Policies are to be configured. The controller option may not fit those operators who want to maximize distributed intelligence. The On-demand Next-hop (ODN) functionality automates the instantiation of SR Policy paths by the headend node without any need for an SDN controller. The SR Policy paths are based on templates, each template specifying the requirements of the path. The Automated Steering functionality (AS) applies to any type of SR candidate path: ODN or not.

SR De sign Principle – Simplicity “The ODN solution is a key achievement of the SR-TE innovation. The heart of the business of a network operator is to support services and hence we needed to focus the solution on the service and not on the transport. The RSVP-TE solution built complex meshes of pre-configured tunnels and then had lots of difficulty steering traffic into the RSVP tunnels. With ODN (and AS), the solution is natural: the service routes are tagged and the related SR Policies are automatically instantiated and used for steering. The operation is focused on the service, not on the underlying transport. ” — Clarence Filsfils

6.1 Coloring As for the AS solution, the coloring of service routes is at the heart of the ODN solution. The ingress PE is pre-configured with a set of path templates, one per expected color indicating an SLA/SR-TE requirement. The egress PE colors the service routes it advertises with the color that indicates the desired SLA for each service route.

6.2 On-Demand Candidate Path Instantiation As soon as an ingress PE that has an ODN template for color C, receives at least one service route with color C and nexthop E, this PE automatically instantiates an ODN candidate path for SR Policy (C, E) according to the ODN template of color C. This candidate path is called ODN-instantiated (as opposed to locally configured or signaled from a controller via PCEP or BGP-TE). If SR Policy (C, E) already exists, this ODN-instantiated candidate path is added to the list of candidate paths of the SR Policy. Else, the SR Policy is dynamically instantiated. The ODN instantiation occurs even if non-ODN candidate paths already exist for the related SR Policy (C, E). The ODN template specifies the characteristics of the instantiated candidate path, such as its preference, whether it is dynamic, if so which metric to minimize, which resources to exclude.

ODCPI is more accurate , ODN is nice r “When the idea came in the taxi, Alberto and Alex’s problem was the instantiation of best-effort SR connectivity across domains without RFC3107. Without RFC3107, PE1 in the left-most domain does not know about the loopback of PE2 in the right-most domain and hence PE1 cannot install a VPN route learned with next-hop = PE2. Alex and Alberto needed an automated scheme to give PE1 a path to PE2’s loopback. They needed an on-demand SR Policy to the next-hop (PE2). Hence, in the taxi, I called that idea: “ODN” for “On-Demand Next-hop”: an automated SR path to the next-hop. Quickly the idea became well known within Cisco and the lead operators group and we kept using the “ODN” name. Technically, a better name could have been “ODCPI” for “On-Demand Candidate Path Instantiation”. This is much longer and cumbersome and hence we kept using the ODN name. ” — Clarence Filsfils

6.3 Seamless Integration in SR-TE Solution Once the ODN candidate path is instantiated for the SR Policy (C, E), all the behaviors for this SR Policy happen as usual: The best candidate path is selected The related SID list and Binding-SID (BSID) are inserted in the forwarding plane If AS is applicable, service routes of color C and next-hop E are steered on the active candidate path of the SR Policy (C, E) All these points are applied regardless of the candidate path instantiation mechanism (configuration, signalization or ODN).

SR De sign Principle - Modularity “Each solution is designed as an individual component that fits within the overall SR architecture. The ODN solution is solely related to the dynamic instantiation of a candidate path. All the rest of the SR-TE solution is unchanged. Any other behavior of the solution applies to a candidate path whether it is explicitly configured, explicitly signaled by an SDN controller or automatically instantiated by ODN ” — Clarence Filsfils

6.4 Tearing Down an ODN Candidate Path Assuming an ODN template for color C at headend H, when H receives the last BGP withdraw for a service route with color C and endpoint E, then H tears down the ODN-instantiated candidate path for the SR Policy (C, E). At that time, as for the previous section, the SR Policy behavior is as usual: The best candidate path is selected The related SID list and BSID are inserted in the forwarding plane If AS is applicable, service routes of color C and next-hop E are steered on the selected candidate path of the SR Policy (C, E) If the ODN candidate path was the last candidate path of the SR Policy, then the headend also tears down the SR Policy.

6.5 Illustration: Intra-Area ODN Node1 in Figure 6‑1 has two on-demand templates configured. One for color green (color value 30), and another for color purple (color value 40). The configuration of these templates is presented in Example 6‑1. Example 6-1: On-demand color templates on Node1 segment-routing traffic-eng affinity-map name RED bit-position 1 ! !! green ODN template on-demand color 30 dynamic metric type delay ! !! purple ODN template on-demand color 40 dynamic metric type igp affinity exclude-any !! RED is defined in affinity-map above name RED

Figure 6-1: Intra-area On-Demand Next-hop (ODN)

The green template (on-demand color 30) specifies to compute a dynamic candidate path, optimizing the delay metric of the path. The purple template (on-demand color 40) specifies to compute a dynamic candidate path, optimizing the IGP metric and avoiding links with affinity color RED. Colors Note that we are using the color terminology for different purposes, in accordance with the common terminology used in the networking industry. The meaning or purpose of a color in this book can be derived from its context. Often it is indicated in the text, such as “affinity color” that refers to the link affinity functionality. The link affinity color RED has nothing to do with the SLA color green (30).

Node4 advertises service route 2.2.2.0/24 in BGP with color green and BGP nexthop 1.1.1.4 (Node4) to Node1, as illustrated in Figure 6‑1. How to attach a color to a prefix is explained in chapter 5, "Automated Steering". When this route arrives on Node1, since an on-demand template exists for color green on Node1, BGP on Node1 requests the local SR-TE process to instantiate a candidate path for the SR Policy with endpoint 1.1.1.4 and color green, based on the green on-demand template. This is the ODN functionality. As pointed out earlier in this chapter, the ODN candidate path is instantiated wether the corresponding SR Policy already exists or not, and even if other non-ODN candidate paths already exist for the related SR Policy. If the related SR Policy does not yet exist, then it is created when the candidate path is instantiated. SR-TE computes a dynamic path to endpoint 1.1.1.4 with optimized delay metric, as the green ondemand template specifies. Since no SR Policy (green, 1.1.1.4) existed yet on Node1, Node1 creates it with the dynamic path as candidate path. At this point, an SR Policy (green, 1.1.1.4) exists on Node1 with the ODN instantiated candidate path as selected path. The status of this SR Policy is shown in Example 6‑2.

Note that with the configuration in Example 6‑1 two candidate paths are automatically instantiated ondemand: one with preference 100 and another with preference 200. Both paths are marked (BGP ODN)

in the output.

The candidate path with preference 200 is computed by the headend node. The candidate path with preference 100 is computed by a PCE. Since no PCE is configured in this example, this path is invalid. The PCE computed path will be used if the headend cannot compute the path, e.g., because the endpoint is located in a remote domain. In this case, the PCE address must be configured as illustrated in the next section. Note that it is only required to specify pcep under dynamic if only the PCE must compute the path. This is explained further in this chapter. Note the BSID label 40001 that was allocated for this SR Policy, Binding SID: 40001. Example 6-2: On-demand instantiated candidate path on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.4 Name: srte_c_30_ep_1.1.1.4 Status: Admin: up Operational: up for 00:15:42 (since Jul 5 07:57:23.382) Candidate-paths: Preference: 200 (BGP ODN) (active) Requested BSID: dynamic Dynamic (active) Metric Type: delay, Path Accumulated Metric: 30 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Preference: 100 (BGP ODN) Requested BSID: dynamic PCC info: Symbolic name: bgp_c_30_ep_1.1.1.4_discr_100 PLSP-ID: 30 Dynamic (pce) (invalid) Metric Type: NONE, Path Accumulated Metric: 0 Attributes: Binding SID: 40001 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

BGP uses Automated Steering as usual (see chapter 5, "Automated Steering") to steer the service route 2.2.2.0/24 into this SR Policy (green, 1.1.1.4), because this service route’s color green and

nexthop 1.1.1.4 match with the SR Policy’s color and endpoint. The Automated Steering can be verified in the BGP table and in the forwarding table. The BGP table entry for VRF ACME prefix 2.2.2.0/24 is displayed in Example 6‑3. The output shows that this prefix was received with a nexthop 1.1.1.4 (Node4) and a color extended community 30 (that we named “green” in this text), 1.1.1.4 C:30. It also shows the BSID of the SR Policy, (bsid:40001). Example 6-3: BGP table entry on Node1 for VRF ACME prefix 2.2.2.0/24 RP/0/0/CPU0:xrvr-1#show bgp vrf ACME 2.2.2.0/24 BGP routing table entry for 2.2.2.0/24, Route Distinguisher: 1.1.1.1:0 Versions: Process bRIB/RIB SendTblVer Speaker 572 572 Last Modified: Jul 5 07:59:18.989 for 00:29:11 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 2 1.1.1.4 C:30 (bsid:40001) (metric 50) from 1.1.1.4 (1.1.1.4) Received Label 90000 Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 556 Extended community: Color:30 RT:1:1 SR policy color 30, up, registered, bsid 40001, if-handle 0x00000490 Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0

The forwarding entry of VRF ACME prefix 2.2.2.0/24 in Example 6‑4 shows that this prefix recurses on the BSID 40001 of the SR Policy (green (30), 1.1.1.4). BSID 40001 is associated with SR Policy srte_c_30_ep_1.1.1.4, as indicated in the last line of the output. Note that Node1 imposes the VPN service label 90000 for this prefix, as advertised by Node4, before steering it into the SR Policy. Example 6-4: CEF table entry on Node1 for VRF ACME prefix 2.2.2.0/24 RP/0/0/CPU0:xrvr-1#show cef vrf ACME 2.2.2.0/24 2.2.2.0/24, version 218, internal 0x5000001 0x0 (ptr 0xa13a0d78) [1], 0x0 (0x0), 0x208 (0xa175f44c) Updated Jul 5 07:59:18.595 Prefix Len 24, traffic index 0, precedence n/a, priority 3 via local-label 40001, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa17d4288 0x0] recursion-via-label next hop VRF - 'default', table - 0xe0000000 next hop via 40001/0/21 next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 90000}

Sometime later, Node4 advertises service route 5.5.5.0/24 in BGP with color green and BGP nexthop 1.1.1.4 (Node4). When Node1 receives this route, it finds that the ODN candidate path for SR Policy (green, 1.1.1.4) already exists. BGP uses Automated Steering to steer the service route 5.5.5.0/24 into this SR Policy. The output in Example 6‑5 shows that 5.5.5.0/24 indeed recurses on BSID label 40001, the BSID of SR Policy srte_c_30_ep_1.1.1.4. The VPN service label for this prefix is 90003. Example 6-5: CEF table entry on Node1 for VRF ACME prefix 5.5.5.0/24 RP/0/0/CPU0:xrvr-1#show cef vrf ACME 5.5.5.0/24 5.5.5.0/24, version 220, internal 0x5000001 0x0 (ptr 0xa13a1868) [1], 0x0 (0x0), 0x208 (0xa175f1e4) Updated Jul 5 08:47:16.587 Prefix Len 24, traffic index 0, precedence n/a, priority 3 via local-label 40001, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa17d4288 0x0] recursion-via-label next hop VRF - 'default', table - 0xe0000000 next hop via 40001/0/21 next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 90003}

The service route 5.5.5.0/24 is of the same VRF ACME as the previous one, but that is not a requirement. The above behavior also applies if the service routes are of different VRFs or even different address-families. The two decisive elements are the color and nexthop of the service route. Node4 also advertises a service route 4.4.4.0/24 of VRF Widget with nexthop 1.1.1.4 (Node4) and color purple (color value 40). Since Node1 has an on-demand template for color 40, the same procedure is used as before to instantiate an on-demand candidate path based on the template.

Figure 6-2: Intra-area On-Demand Next-hop (ODN) – using multiple ODN templates

BGP uses Automated Steering to steer the traffic into the matching SR Policy (purple, 1.1.1.4).

6.6 Illustration: Inter-domain ODN The ODN functionality is not limited to single area networks but it is equally applicable to multidomain networks. Consider the network diagram in Figure 6‑3. This network consists of three independent domains. With “independent”, we mean that no reachability information is exchanged between the domains. We assume that, by default, Node1 for example, does not have reachability to Node10. At the time of writing, BGP requires an IP route to the nexthop (it could be a less specific prefix) to consider the service route. This can be achieved by propagating an aggregate route to the nexthop in BGP or by configuring a less specific static route to the nexthop (it can point to Null0) on the BGP speakers1.

Figure 6-3: ODN Multi-domain topology

“ODN has the best properties to make a multi-domain network infrastructure to scale and very flexible, while keeping everything simple at the same time. Still from an end-to-end perspective, it allows us to implement solutions for cases such as a PE that belongs in a network domain that needs to establish an L3VPN with another PE that belongs to another domain, while going through a middle "opaque" domain. An easy use case could be a virtual PE (vPE) hosted in a datacenter attached to a core network connecting to a brownfield PE that belongs to a different area from the Core and the DC. The implementation is done with a simple configuration enhanced with an SR PCE, enabling automated service instantiation with specific criteria such as latency, optimized delay-metric, bandwidth, disjointness, etc. ” — Daniel Voyer

In the single-area network example that we discussed in the previous section, the headend node itself can dynamically compute the on-demand SR Policy path based on the optimization objective and constraints specified in the ODN template for the SLA color. However, this node cannot locally compute end-to-end inter-domain paths for the multi-domain case since it has no visibility into the network topology beyond its local area. Instead, the headend node should request a PCE with network-wide visibility to compute the inter-domain paths. Similar to the single-area ODN case, the egress service node Node10 advertises its service routes to the ingress node Node1, tagged with the colors that identify the required SLAs. This is marked with ➊ in Figure 6‑3. Typically, the service routes are advertised via a Route-Reflector infrastructure, but BGP must ensure that the service routes are propagated keeping the initial BGP nexthop intact, i.e., Node1 must receive the service route with BGP nexthop Node10 (➋). Also, the color attached to the service route should be propagated all the way to Node1. Different from the single-area case, the ondemand template on Node1 specifies that a PCE must be used to compute the inter-domain path of the on-demand SR Policy (➌ in Figure 6‑3). The ODN template of Node1 is shown in Example 6‑6. The SR PCE has IP address 1.1.1.99. Example 6-6: On-demand color templates on Node1 – PCE computes path segment-routing traffic-eng !! green ODN template on-demand color 30 dynamic pcep metric type delay ! pcc pce address ipv4 1.1.1.99

By default, the headend first tries to compute the path locally. If that fails it requests the SR PCE to compute the path. The keyword pcep in the color 30 ODN template specifies to only use the SR PCE to compute the path. Example 6‑7 shows the status of the ODN-instantiated SR Policy path. Since the configuration specifies to use the PCE to compute the path, the locally computed candidate path (preference 200) is shutdown.

Example 6-7: On-demand instantiated candidate path for inter-domain SR Policy on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.10 Name: srte_c_30_ep_1.1.1.10 Status: Admin: up Operational: up for 00:15:42 (since Jul 5 07:57:23.382) Candidate-paths: Preference: 200 (BGP ODN) (shutdown) Requested BSID: dynamic Dynamic (invalid) Last error: No path found Preference: 100 (BGP ODN) (active) Requested BSID: dynamic PCC info: Symbolic name: bgp_c_30_ep_1.1.1.10_discr_100 PLSP-ID: 32 Dynamic (pce 1.1.1.99) (valid) Metric Type: delay, Path Accumulated Metric: 600 16121 [Prefix-SID, 1.1.1.121] 16101 [Prefix-SID, 1.1.1.101] 16231 [Prefix-SID, 1.1.1.231] 16010 [Prefix-SID, 1.1.1.10] Attributes: Binding SID: 40060 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

ODN is a scalable mechanism to automatically provide end-to-end inter-domain reachability, with SLA if required. It offers a pull model, where the ingress node installs forwarding entries on-demand, only for service routes that are actually used. This is opposed to the traditional push model, where the network pushes routes to all endpoints which may or may not steer traffic onto them.

“Most of the customers I work with as a solution architect, are struggling in designing and operate network architectures that are either easy to scale and operate but flexible enough to implement different transport SLAs. They are always very impressed by the simplicity of ODN solution and the benefits are immediately evident: no manual RSVP-TE tunnel setup, no complicated Policy-Based Routing (PBR) configuration for selectively steering traffic, very granular traffic steering and transparent applicability to inter area and inter domain scenarios that are in many cases fundamental for end-to-end access-core-dc architecture scalability. ” — Alberto Donzelli

The ODN model can also be used together with a default reachability model. For example, when routes are redistributed between domains or in combination with unified MPLS, also known as

seamless MPLS2. ODN can provide SLA paths for a subset of routes, keeping the other routes on their default forwarding path. When transitioning from the traditional route push model to the ODN route pull model, ODN can be introduced next to the existing default forwarding. Reachability to service routes can then be gradually moved from the classic model to ODN.

6.7 ODN Only for Authorized Colors The ODN functionality is only triggered when receiving a service route with an authorized color. A color is authorized when an on-demand template is configured for that color. Appropriate ingress filtering in BGP can also be used to restrict the received extended communities. If required, the ODN functionality can be further restricted to a sub-set of BGP nexthops by configurating a prefix-list ACL on the on-demand template. Only the nexthops passing the filter will trigger on-demand instantiation for this template. At the time of writing, this functionality is not available. Example 6‑8 illustrates a possible configuration to restrict ODN functionality for this template to the BGP nexthops matching the prefix 1.1.1.0/24. Example 6-8: Restrict ODN to a subset of BGP nexthops ipv4 prefix-list ODN_PL 10 permit 1.1.1.0/24 ! segment-routing traffic-eng !! green ODN template on-demand color 30 restrict ODN_PL dynamic metric type delay

SR De sign Principle – Le ss protocols and Se amle ss De ployme nt “With SR we want to eliminate any unnecessary protocol and maximize the use of the base IP protocols: IGP and BGP. Hence, ODN/AS has been designed to re-use the already existing BGP coloring mechanism. Simple color range allocation rules ensure seamless ODN/AS deployment into networks that were already using the BGP coloring mechanism for other purposes. For example, the operator dedicates range 1000-1999 to mark the PoP-of-origin of a BGP path while range 100-199 is used for ODN/AS. ” — Clarence Filsfils

6.8 Summary An SR candidate path is either instantiated explicitly (configuration, PCEP, BGP SR-TE) or automatically by the “On-Demand Next-Hop” (ODN) solution. If a head-end H has at least one service route with color C and has an ODN template for color C, then H automatically instantiates an ODN candidate path for the SR Policy (C, E) based on the ODN template of color C. The ODN instantiation occurs even if there is a pre-existing non-ODN candidate path for the SR Policy (C, E). The ODN solution is a key benefit of the SR-TE architecture. It drastically simplifies the operation of the network by not requiring maintaining a complex configuration. It leverages the distributed intelligence and robustness of the network. It does not require an SDN controller. The ODN solution applies to intra and inter-domain use-cases. Most likely, the ODN solution is combined with the AS solution. However, in theory, these two components are independent and could be used independently in some use-cases.

6.9 References [RFC5512] "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP Tunnel Encapsulation Attribute", Pradosh Mohapatra, Eric C. Rosen, RFC5512, April 2009 [draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018 [draft-ietf-mpls-seamless-mpls] "Seamless MPLS Architecture", Nicolai Leymann, Bruno Decraene, Clarence Filsfils, Maciek Konstantynowicz, Dirk Steinberg, draft-ietf-mpls-seamlessmpls-07 (Work in Progress), June 2014

1. Note that the configuration may allow reachability to the nexthop via the default route (nexthop resolution allow-default) or may require nexthop reachability via a host route (nexthop resolution prefix-length minimum 32).↩ 2. A scalable architecture that integrates core, aggregation, and access domains into a single (“seamless”) MPLS domain to provide an end-to-end service delivery [draft-ietf-mpls-seamlessmpls-07].↩

7 Flexible Algorithm What we will learn in this chapter: An IGP Prefix-SID is associated to a prefix and an algorithm. The default algorithm (0) is the IGP shortest-path according to the IGP metric. Algorithms 0 to 127 are standardized by the IETF. Algorithms 128 to 255 are customized by the operator. They are called SR IGP Flexible Algorithms (Flex-Algo for short). A Flex-Algo is defined with an optimization objective and a set of constraints. Any node participating in a Flex-Algo advertises its support for this Flex-Algo. Only the nodes participating in a Flex-Algo compute the paths to the Prefix-SIDs of that Flex-Algo. This is an important benefit for scale. Multiple prefix-SIDs of different algorithms can share the same loopback address. This is an important benefit for operational simplicity and scale. Flex-Algo is an intrinsic component of the SR architecture: the algorithm has been included in the SR proposal since day one. Flex-Algo is an intrinsic component of the SR-TE architecture: it leverages the Automated Steering component to automatically steer service traffic onto Flex-Algo Prefix-SIDs. Upon ODN automated instantiation of an inter-domain SR-TE policy, the SR PCE leverages any per-IGP-domain FlexAlgo Prefix-SID that provides the required path. Frequent use-cases involve dual-plane disjoint paths and low-delay routing. IGP Prefix-SIDs steer the traffic along the shortest path to the corresponding prefix, as computed by the IGP. Although this shortest path is usually assumed to be the shortest IGP path, minimizing the accumulated IGP metric along the traversed links, there is in reality more than one definition of

shortest path. In order to ensure that all nodes in the IGP follow the same definition for a given PrefixSID, each Prefix-SID is associated with an algorithm. The Prefix-SID algorithm defines how the shortest paths should be computed for that particular Prefix-SID. Flexible Algorithms are a type of Prefix-SID algorithms defined within the scope of an IGP domain and fully customizable by the operator. An operator can define one or several Flexible Algorithms with optimization objectives and constraints tailored to its specific needs. In this chapter, we start by detailing the concept of Prefix-SID algorithm. We then explain how Flexible Algorithms are defined in a consistent manner within an IGP domain and how this capability is integrated within the SR architecture. We conclude by illustrating the use and the benefits of Flexible Algorithms within the context of real operator use-cases.

7.1 Prefix-SID Algorithms Before delving into the Flexible Algorithm, we first go back to the basics of the Prefix-SID. The IGP Prefix-SID is associated with an IGP prefix. It is a global SID that steer packets along the ECMPaware IGP shortest path to its associated prefix.

Figure 7-1: Example Prefix-SID of default algorithm (SPF)

On the example topology in Figure 7‑1, Node3 advertises a Prefix-SID 16003 associated with its loopback prefix 1.1.1.3/32. All nodes in the network program a forwarding entry for this SID with the instruction “go via the IGP shortest path to 1.1.1.3/32”. For example, the shortest path from Node6 to 1.1.1.3/32 is towards Node5, so Node6 installs the forwarding entry: incoming label 16003, outgoing label 16003, via Node5. Eventually, a packet with top label 16003, injected anywhere in the network, is forwarded via the IGP shortest path to 1.1.1.3/32. This Prefix-SID description is correct, but incomplete. It only covers the default Prefix-SID behavior, which is associated with algorithm 0.

The SR Algorithm information is part of the SR architecture since the beginning. An IGP Prefix-SID is associated with an algorithm K where K ranges from 0 to 255. The IGP advertises this Algorithm information in its Prefix-SID and Router Capability advertisement structure. Each Prefix-SID is advertised in ISIS with an associated Algorithm, as shown in the Prefix-SID TLV format in Figure 7‑2. OSPF includes an Algorithm field in the Prefix-SID sub-TLV of the OSPF Extended Prefix TLV (RFC7684), as shown in Figure 7‑3. An IGP prefix can be associated with different Prefix-SIDs, each of a different algorithm.

Figure 7-2: Algorithm field in ISIS Prefix-SID TLV format

Figure 7-3: Algorithm field in OSPF Prefix-SID sub-TLV format

Algorithm identifiers between 0 and 127 are reserved for standardized (IETF/IANA) algorithms. At the time of writing, the following standardized Algorithm identifiers are defined in RFC 8402 (Segment Routing Architecture). 0: Shortest Path First (SPF) algorithm based on IGP link metric. This is the well-known shortest path algorithm based on IGP link metric, as computed by ISIS and OSPF. The traffic steered over an SPF Prefix-SID follows the shortest IGP path to the associated prefix, unless the Prefix-SID’s forwarding entry is overridden by a local policy on an intermediate node. In this case, the traffic is

processed as per the local policy and may deviate from the path expressed in the SID list. A common example of such local policy is the autoroute announce mechanism: the traffic steered over an SPF Prefix-SIDs can be autorouted into an SR Policy or RSVP-TE tunnel. 1: Strict Shortest Path First (Strict-SPF) algorithm based on IGP link metric. This algorithm is identical to algorithm 0, except that it has an additional semantic instructing the intermediate nodes to ignore any local path deviation policy. The traffic steered over a Strict-SPF Prefix-SID strictly follows the unaltered shortest IGP path to the associated prefix1. In particular, this traffic cannot be autorouted. User-defined algorithms are identified by a number between 128 and 255. These can be customized by each operator to their liking and are called the SR IGP Flexible Algorithms, or Flex-Algos for short.

Compone nt by compone nt we build a powe rful inte grate d solution for 5G “In my first public talk on SR (October 2012), I described the use of Prefix Segments associated with different algorithms. We kept that component for later execution because we first needed to develop the more general SR-TE solution: ODN, AS and the inter-domain solution thanks to SR PCE. Once these components are in place, the full power of Flex-Algo Prefix-SIDs can be leveraged. For example, Automated Steering is key to automatically steer BGP/Service destinations on the Flex-Algo Prefix-SID that delivers the required SLA to the BGP next-hop. Most deployments are inter-domain and hence ODN and SR PCE are key to deliver automated inter-domain SR Policies that leverage per-IGP-domain Flex-Algo Prefix-SID. Flex-Algo and the related per-link delay measurement is shipping since December 2018. As I write these lines, Anton Karneliuk and I gave the first public presentation of a flag-ship deployment at Vodafone (Cisco Live!™ January 2019, video available on segment-routing.net). The details of this low-delay slice design so important for 5G are described in this chapter, both as a use-case and with a viewpoint offered by Anton. ” — Clarence Filsfils

In this book, we use the notation Flex-Algo(K) to indicate the Flex-Algo with identifier K, and PrefixSID(K) to indicate the Prefix-SID of algorithm K. The notation Algo(0) is used to indicate algorithm 0.

A typical use-case is for an operator to define a Flex-Algo, for example 128, to minimize the delay metric. This is illustrated on Figure 7‑4. First, the operator enables the dynamic link-delay measurement, such that the per-link delay metric is advertised by the IGP. This allows the paths to adapt to any link-delay changes, for example caused by rerouting of the underlying optical network (see chapter 15, "Performance Monitoring – Link Delay"). Alternatively, if link-delays are assumed to be constant, the operator could use the TE metric to reflect the link-delay by statically configuring the TE metric on each link with the link’s delay value. Then, every IGP node is configured with one additional prefix-SID, associated with the FlexAlgo(128). As a result, the IGP continuously delivers two different ways to reach every node: a shortest-path according to the IGP metric (Prefix-SID with Algo(0) of the destination node) and a shortest-path according to the delay metric (Prefix-SID with Flex-Algo(128) of the destination node).

Figure 7-4: Example Prefix-SID of low-delay algorithm

To ease the illustrations, we use the following numbering scheme: Prefix-SID(0) of Algo(0) of Node J is 16000 + J Prefix-SID(K) of Flex-Algo(K) of Node J is 16000 + (K −120) × 100 + J Said otherwise, the 3rd digit of the label indicates the algorithm: 0 means Algo(0) and 8 means FlexAlgo(128). Obviously, this is just for our illustration and this should not be used as a design rule. In this illustration, Node3 advertises the prefix 1.1.1.3/32 with two Prefix-SIDs: 16003: algorithm 0: shortest-path according to IGP metric 16803: algorithm 128: shortest-path according to delay metric As a result, when Node1 sends the green packet with top-label 16803, the packet follows the delaymetric shortest path 1→2→3 with an accumulated delay metric of 10. Similarly, when Node1 sends the blue packet with top-label 16003, it follows the IGP-metric shortest path 1→6→5→3 with an accumulated IGP metric of 30. The low-delay path from Node1 to Node3 (1→2→3) could also be encoded in a SID list , where 16002 is the Algo(0) Prefix-SID of Node2, providing the IGP shortest path to Node2 (1→2), and 24023 the Adj-SID of Node2 to Node3, steering the traffic via the direct link to Node3 (2→3). However, the Flex-Algo functionality reduces the required SID list to the single segment 16803, as indicated above.

Low-De lay “Working closely with the operators, I knew that huge business requirements were left unsupported in terms of concurrent support of low-cost and delay optimization. A key component of this solution is the real-time per-link delay measurement and advertisement in the IGP. As soon as this metric is available, then it is clear that the operator can use it to differentiate services on its infrastructure. Either via the SR Policy solution (dynamic path based on delay-metric) or via the Flex-Algo solution. ” — Clarence Filsfils

Another very important benefit of the Flex-Algo solution is that no additional addresses need to be configured. The same prefix can be associated with multiple Prefix-SIDs of multiple algorithms. This is key for operational simplicity and scale. The related configuration of Node3 is shown in Example 7‑1. Example 7-1: ISIS Prefix-SID configuration on Node3 interface Loopback0 ipv4 address 1.1.1.3/32 ! router isis 1 flex-algo 128 ! interface Loopback0 address-family ipv4 unicast prefix-sid absolute 16003 prefix-sid algorithm 128 absolute 16803

Aside from the algorithm value between 128 and 255, the configuration of Flex-Algo Prefix-SIDs is similar to the well-known IGP Prefix-SID. This implies that Flex-Algo Prefix-SID labels are selected from the SR Global Block (SRGB) label range, which is shared by all Prefix-SIDs, and that PrefixSID properties (e.g., Node-SID flag, PHP and explicit-null behaviors) are also configurable for FlexAlgo Prefix-SIDs. All the IGP nodes in the illustration support algorithms 0 and 128 and hence all the nodes install the related Prefix-SIDs. The Prefix-SIDs of Algo(0) are installed along the IGP-metric shortest path while the Prefix-SIDs of Flex-Algo(128) are installed along the delay-metric shortest path.

The ISIS advertisement of Node3’s prefix 1.1.1.3/32 with the above configuration is shown in Example 7‑2. The example illustrates the ISIS advertisement of prefix 1.1.1.3/32 with Prefix-SID index 3 for Algo(0) and index 803 for Flex-Algo(128). With the default SRGB [16000-23999], these respectively map to the Prefix-SIDs label values 16003 (Algo(0)) and 16803 (Flex-Algo(128)). The equivalent Prefix-SID algorithm capability is available for OSPF. Similarly, the SR extensions of BGP-LS in ietf-idr-bgp-ls-segment-routing-ext define the Algorithm Identifier advertisement in BGP-LS. Example 7-2: ISIS Prefix-SID advertisement example RP/0/0/CPU0:xrvr-3#show isis database verbose xrvr-3 IS-IS 1 (Level-2) Link State Database LSPID LSP Seq Num LSP Checksum LSP Holdtime ATT/P/OL xrvr-3.00-00 * 0x0000013e 0x20ee 1198 0/0/0

Metric: 0 IP-Extended 1.1.1.3/32 Prefix-SID Index: 3, Algorithm:0, R:0 N:1 P:0 E:0 V:0 L:0 Prefix-SID Index: 803, Algorithm:128, R:0 N:1 P:0 E:0 V:0 L:0 Prefix Attribute Flags: X:0 R:0 N:1 Source Router ID: 1.1.1.3

Each ISIS node advertises its support for a given algorithm in the Algorithm sub-TLV of the RouterCapability TLV, as shown in Figure 7‑5 and Example 7‑3. This sub-TLV lists all identifiers of the algorithms that the node supports. Again, the equivalent functionality is available for OSPF, where OSPF advertises the list of supported algorithms in an SR-Algorithm TLV in the Router Information LSA, as shown in Figure 7‑6.

Figure 7-5: Algorithms in ISIS Algorithm sub-TLV of Router Capability TLV

Figure 7-6: Algorithms in OSPF Algorithm TLV of Router Information LSA

Example 7‑3 shows the ISIS Router Capability advertisement of a node that supports Algorithms 0 (SPF), 1 (strict-SPF) and 128 (Flex-Algo(128)). An SR-enabled IOS XR node always participates in Algo(0) and Algo(1). Example 7-3: ISIS Router Capability advertisement example RP/0/0/CPU0:xrvr-3#show isis database verbose xrvr-3 IS-IS 1 (Level-2) Link State Database LSPID LSP Seq Num LSP Checksum LSP Holdtime xrvr-3.00-00 * 0x0000013e 0x20ee 1198

Hostname: xrvr-3 Router Cap: 1.1.1.3, D:0, S:0 Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000 SR Algorithm: Algorithm: 0 Algorithm: 1 Algorithm: 128 Node Maximum SID Depth: Label Imposition: 10 Router ID: 1.1.1.3

ATT/P/OL 0/0/0

The Router Capability TLV can also contain the Flex-Algo Definition sub-TLVs, which will be covered in the next section.

7.2 Algorithm Definition An operator can define its own custom Flexible Algorithms and assign an algorithm identifier to it. The user-defined algorithms have an identifier value between 128 and 255. The Flex-Algo definition consists of three elements: A calculation type An optimization objective Optional constraints For example: use SPF to minimize IGP metric and avoid red colored links, or use SPF to minimize delay metric. The calculation type indicates the method to compute the paths. At the time of writing, two calculation types have been defined: SPF (0) and strict-SPF (1), same as the IETF-defined SR algorithms. Both types use the same algorithm (Dijkstra’s SPF), but the strict-SPF type does not allow a local policy to override the SPF-computed path with a different path. IOS XR uses only type 0 (SPF) for Flex-Algo computations. The optimization objective is defined as the minimization of a specific metric type. On IOS XR, the metric type can be IGP, TE or link-delay. The path to each Prefix-SID of the Flex-Algo is computed such that the accumulated metric of the specified type is minimal, considering the Flex-Algo constraints. These constraints consist in zero, one or several sets of resources, typically identified by their affinity color, to be included in or excluded from the path to each Prefix-SID of the Flex-Algo. A node that has a local definition of a particular Flex-Algo can advertise the Flex-Algo definition in its Router Capability TLV (ISIS) or Router Information LSA (OSPF). These definition advertisements will be used to select a consistent Flex-Algo definition on all participating nodes. The consistency insurance is discussed further in the next section. By default, a local Flex-Algo definition is not advertised.

Figure 7-7: All three Prefix-SID paths from Node1 to Node3

Example 7‑4 illustrates the Flex-Algo definition configuration of Node3 in Figure 7‑7. Two FlexAlgo definitions are configured. Flex-Algo(128) provides the low-delay path without constraints, as explained in section 7.1, while Flex-Algo(129) provides a constrained IGP shortest path that avoids links with affinity color RED. The priority of both Flex-Algo definitions is 100. The default priority is 128. This priority value is used in the Flex-Algo Definition consistency mechanism as described in the next section. Node3 advertises both Flex-Algo definitions.

Example 7-4: Flex-Algo definition and affinity configuration on Node3 1 router 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

isis 1 affinity-map RED bit-position 2 ! flex-algo 128 priority 100 metric-type delay advertise-definition ! flex-algo 129 priority 100 !! default: metric-type igp advertise-definition affinity exclude-any RED ! interface GigabitEthernet0/0/0/3 !! link to Node5 affinity flex-algo RED

For this illustration, we have colored the link between Node3 and Node5 RED by adding the affinity RED to interface Gi0/0/0/3 (lines 15-17). RED is a locally significant user-defined name that indicates the bit in position 2 of the affinity bitmap (line 2 in the example). Therefore, FlexAlgo(129) paths will not traverse this link. Notice that the link affinity is configured under the IGP. Flex-Algo link affinities are advertised as application specific TE link attributes, as further described in section 7.2.2. Node3 advertises a third Prefix-SID for prefix 1.1.1.3/32: 16903 with Flex-Algo(129). Since all nodes participate in Flex-Algo(129), each node computes the path to 1.1.1.3/32, according to the algorithm minimize IGP, avoid red links, and installs the forwarding entry for Prefix-SID 16903 accordingly, as illustrated in Figure 7‑8.

Figure 7-8: Example Prefix-SID of algorithm with constraints

As an example, the Flex-Algo(129) path from Node1 to Node3 is 1→6→5→4→3. Without using the Flex-Algo SIDs, this path can be encoded in the SID list , where 16004 is the Algo(0) Prefix-SID of Node4 and 16003 the Algo(0) Prefix-SID of Node3. The other Prefix-SIDs of Node3 still provide the path according to their associated algorithm: PrefixSID 16003 provides the IGP shortest path to Node3, and Prefix-SID 16803 provides the low-delay path to Node3. All three paths from Node1 to Node3 are shown in Figure 7‑7.

7.2.1 Consistency While IETF-defined SR Algorithms are standardized and hence globally significant, a Flex-Algo is defined within the scope of an IGP domain. Different definitions for a given Flex-Algo identifier may thus co-exist at the same time over the global Internet.

The paths towards Flex-Algo Prefix-SIDs are computed on each participating node as per the definition of this Flex-Algo. To avoid routing loops, it is crucial that all nodes participating in a FlexAlgo have a consistent definition of this algorithm. Therefore, the nodes must agree on a unique FlexAlgo definition within the advertisement scope of the Flex-Algo identifier. A node can participate in a Flex-Algo without a local definition for this Flex-Algo. At least one node in the flooding domain must advertise the Flex-Algo definition. Multiple advertisers are recommended for redundancy. In order to ensure Flex-Algo definition consistency, every node that participates in a particular FlexAlgo, selects the Flex-Algo definition based on the following rules: 1. From the Flex-Algo definition advertisements in the area (including both locally generated and received advertisements), select the one(s) with the highest priority. 2. If there are multiple Flex-Algo definition advertisements with the same highest priority, select the one that is originated from the node with the highest System-ID in case of ISIS or highest RouterID in case of OSPF. The locally configured definition of the Flex-Algo is only considered if it is advertised. It is then treated equally in the selection process as the received definitions advertised by remote nodes. FlexAlgo definition advertisement is disabled by default. If a node does not send nor receive any definition advertisement for a given Flex-Algo, or if it does not support the selected Flex-Algo definition (e.g., an unsupported metric-type), the node locally disables this Flex-Algo. It stops advertising support for the Flex-Algo and removes the forwarding entries for all Prefix-SIDs of this Flex-Algo. Thus, if the definition of a Flex-Algo is not advertised by any node, this Flex-Algo is non-functional. Lacking a definition, all participating nodes disable this Flex-Algo. The format of the Flex-Algo definition advertisement is described in the next section.

Consiste nt Fle x-Algo de finitions “Common understanding of the Flex-Algo definition between all participating routers is absolutely needed for correct operation of the Flex-Algo technology. Very often, when consistency between multiple devices is requested, people think of it as a potential danger. What happens if there is a conflict? How do I make sure the consistency is achieved? What if things are misconfigured? When we designed Flex-Algo technology we tried to avoid these problems. The selection of the Flex-Algo definition is done based on strict rules and with deterministic outcome. There can never be an inconsistency or a conflict. Every participating node will use the same selection algorithm that is guaranteed to produce a single definition for the particular Flex-Algo. ” — Peter Psenak

7.2.2 Definition Advertisement The Flex-Algo definition is advertised in ISIS using the Flex-Algo Definition sub-TLV shown in Figure 7‑9. This sub-TLV is advertised as a sub-TLV of the Router Capability TLV.

Figure 7-9: ISIS Flex-Algo Definition sub-TLV format

The Flex-Algo definition is advertised in OSPF using the Flex-Algo Definition TLV shown in Figure 7‑10. This TLV is advertised as a top-level TLV of the Router Information LSA.

Figure 7-10: OSPF Flex-Algo Definition TLV format

The fields in the Flex-Algo Definition sub-TLV (ISIS) and TLV (OSPFv2) (Figure 7‑9 and Figure 7‑10) are: Flex-Algorithm: Flex-Algo identifier. Value between 128 and 255. Metric-Type: type of metric to be used during the calculation. 0. IGP metric – default in IOS XR 1. Minimum Unidirectional Link Delay (RFC7810) 2. TE metric (RFC5305) Calc-Type: the calculation type used to compute paths for the Flex-Algorithm2. 0: SPF algorithm, 1: strict-SPF algorithm – in IOS XR: SPF (0) Priority: the priority of the advertisement; higher value is more preferred – default in IOS XR: 128 Sub-TLVs: optional sub-TLVs. At the time of writing, the available optional sub-TLVs specify the inclusion and exclusion of link affinity colors (also known as Administrative Groups) in the path computation as follows. Exclude-any (type 1): exclude links that have any of the specified link colors Include-any (type 2): only include links that have any of the specified link colors

Include-all (type 3): only include links that have all of the specified link colors The format of the ISIS Admin Group Sub-TLVs is common for all types. The format is shown in Figure 7‑11.

Figure 7-11: ISIS Flex-Algo include/exclude admin-group sub-TLV

The format of the OSPF Admin Group Sub-TLVs is shown in Figure 7‑12.

Figure 7-12: OSPF Flex-Algo include/exclude admin-group sub-TLV

These sub-TLVs specify which extended administrative groups (link affinity colors) must be excluded/included from the path, according to the type of the sub-TLV. The format of the Extended Administrative Group field is defined in RFC7308. IETF draft-ketant-idr-bgp-ls-flex-algo adds support for Flex-Algo definitions to BGP-LS. When a Flex-Algo is enabled on a node, ISIS advertises support for it by adding the algorithm(s) in the Router Capability TLV. If advertisement of the Flex-Algo definition(s) is enabled, ISIS includes

these in the Router Capability TLV as well. Example 7‑5 shows the ISIS Router Capability TLV as advertised by Node3 with the configuration in Example 7‑4. Example 7-5: Algorithm support in ISIS Router Capability TLV of Node3 RP/0/0/CPU0:xrvr-3#show isis database verbose xrvr-3

Hostname: xrvr-3 Router Cap: 1.1.1.3, D:0, S:0 Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000 SR Local Block: Base: 15000 Range: 1000 SR Algorithm: Algorithm: 0 Algorithm: 1 Algorithm: 128 Algorithm: 129 Node Maximum SID Depth: Label Imposition: 10 Flex-Algo Definition: Algorith: 128 Metric-Type: 1 Alg-type: 0 Priority: 100 Flex-Algo Definition: Algorith: 129 Metric-Type: 1 Alg-type: 0 Priority: 100 Flex-Algo Exclude Ext Admin Group: 0x00000004

Note that the router isis configuration in Example 7‑4 also contains the affinity-map definition that defines the Flex-Algo application specific affinity mappings. The advertisement of application specific TE link attributes is specified for ISIS in draft-ietf-isis-te-app. In Example 7‑4, color RED is defined as the bit in position 2 of the affinity bitmap. The color is attached to interface Gi0/0/0/3, which is the link to Node5. Node3 now advertises its adjacency to Node5 with an affinity bitmap where bit 2 is set (binary 100 = hexadecimal 0x4). Example 7‑6 shows the ISIS advertisement of Node3 for its adjacency to Node5. Example 7-6: Flex-algo affinity bitmap advertisement of Node3 in ISIS

Metric: 10 IS-Extended xrvr-5.00 Interface IP Address: 99.3.5.3 Neighbor IP Address: 99.3.5.5 Link Average Delay: 12 us Link Min/Max Delay: 12/12 us Link Delay Variation: 0 us Application Specific Link Attributes: L flag: 0, SA-Length 1, UDA-Length 0 Standard Applications: FLEX-ALGO Ext Admin Group: 0x00000004 Link Maximum SID Depth: Label Imposition: 10 ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24035

A node that participates in a given algorithm does not necessarily need to advertise a Prefix-SID for that algorithm. However, in practice a Prefix-SID is likely not only intended to steer the traffic on the desired path to that node, but also to use that node as part of a TI-LFA backup path. For that reason, we will assume in this chapter that when a node participates in Flex-Algo(K), it also advertises a Prefix-SID for Flex-Algo(K).

7.3 Path Computation A node performs a computation per algorithm that it participates in. When computing the path for a given Flex-Algo(K), the computing node first removes all the nodes that do not advertise their participation for this Flex-Algo(K) as well as the resources that must be avoided according to the constraints of Flex-Algo(K). Any link that does not have a metric as used in Flex-Algo(K) is also pruned from the topology. The resulting topology is called Topo(K). In Figure 7‑13, Flex-Algo(130) is defined as optimize IGP metric, exclude RED links and enabled on all nodes except Node6. The Flex-Algo(130) topology, Topo(130), is thus derived from the original network topology in Figure 7‑13 (a) by pruning Node6, which does not participate in FlexAlgo(130), and the RED link between Node3 and Node5, which must be avoided as per the definition of Flex-Algo(130). The topology Topo(130), represented in Figure 7‑13 (b), is used for the FlexAlgo(130) path computations on all nodes participating in this algorithm.

Figure 7-13: Physical network topology and Flex-Algo(130) topology Topo(130)

The computing nodes leverage Dijkstra’s algorithm to compute the shortest path graph on Topo(K), optimizing the type of metric as defined for Flex-Algo(K). ECMP is obviously supported in each Flex-Algo, i.e., traffic-flows are load-balanced over multiple paths that have an equal cost for that Flex-Algo.

Finally, the node installs the MPLS-to-MPLS forwarding entries for the Prefix-SIDs of Flex-Algo(K). The node does not install any IP-to-MPLS or IP-to-IP forwarding entries for these Flex-Algo PrefixSIDs. This means that, by default, unlabeled IP packets are not steered onto a Flex-Algo Prefix-SID. Instead, the operator can rely on SR-TE steering mechanisms to steer traffic flows on Flex-Algo Prefix-SIDs. For example, the Flex-Algo(130) path from Node1 to Node5 is 1→2→3→4→5. Without using the Flex-Algo SIDs, this path can be encoded in the SID list , where 1600X is the Algo(0) Prefix-SID of NodeX and 24023 is an Adj-SID of the link from Node2 to Node3. Also note that, although the network topology and the Flex-Algo definition is the same as FlexAlgo(129) in Example 7‑4, the fact that Node6 does not participate in Flex-Algo(130) drastically changes the resulting path compared to Figure 7‑8 (1→6→5→4→3). This property of the Flex-Algo solution is particularly useful for use-cases like dual-plane disjoint paths, as described in section 7.6. The path computation is performed by any node that participates in Flex-Algo(K). If a node participates in more than one Flex-Algo, then it performs the described computation independently for each of these Flex-Algos. In this example, all nodes support Algo(0) and thus perform an independent path computation for this algorithm, based on the regular IGP topology. ECMP Load Balancing

ECMP-awareness is an intrinsic property of Prefix-SIDs: traffic flows steered over a Prefix-SID are load-balanced over all the equal-cost shortest paths towards the associated prefix. This concept is fairly simple to understand with the regular IGP Prefix-SIDs (algorithms 0 and 1), since the traffic flows merely follow the IGP shortest paths. With Flex-Algo, it is important to understand that the term “shortest paths” is meant in a generic way, considering the algorithm definition associated Prefix-SID algorithm optimization metric and constraints. If the Prefix-SID algorithm is a Flex-Algo, then the shortest paths are computed on the Flex-Algo topology, as defined in the previous section, and with the optimization metric specified in the Flex-Algo definition as link metric. These are the only shortest paths being considered for ECMP load-balancing. In particular, the IGP metric has no impact at all on the ECMP decisions for a Flex-Algo defined as optimizing the TE or delay metric.

This particular property allows SR paths that would have required multiple SID lists with Algo(0) or 1 Prefix-SIDs to be expressed with a single Flex-Algo SID, or SID list. This is an important benefit for operational efficiency and robustness. In the network of Figure 7‑14, Flex-Algo(128) is enabled on all nodes alongside Algo(0). The definition of Flex-Algo(128) is minimize delay metric. All links in the network have an IGP metric 10, except for the links between Node2 and Node3 and between Node4 and Node5 that have IGP metric 100. All links in the network have a link-delay metric 10, except for the link between Node6 and Node7 that has link-delay metric 92. Node8 advertises a Prefix-SID 16008 for Algo(0) and a Prefix-SID 16808 for Flex-Algo(128). All nodes compute the IGP shortest path to Node8 and install the forwarding entry for Node8’s Prefix-SID 16008 along that path. This Prefix-SID’s path from Node1 to Node8 (1→6→7→8) is shown in Figure 7‑14. All nodes compute the low-delay path to Node8 and install the forwarding entry for Node8’s FlexAlgo(128)’s Prefix-SID 16808 following that path. Node1 has two equal delay paths to Node8, one via Node2 and another via Node4. Therefore, Node1 installs these two paths for Node8’s FlexAlgo(128)’s Prefix-SID 16808. This Prefix-SID’s equal cost paths (1→2→3→8 and 1→4→5→8) are shown in Figure 7‑14. Node1 load-balances traffic-flows to this Prefix-SID over these two paths.

Figure 7-14: Example Flex-Algo ECMP

Without using the Flex-Algo SIDs, this low-delay ECMP path to Node8 requires two SID lists and , where 16002, 16004, and 16008 are the Algo(0) Prefix-SIDs of Node2, Node4, and Node8 respectively, and 24023 and 24045 the Adj-SIDs of the links Node2-Node3 and Node4-Node5 respectively. The Adj-SIDs are required to cross the high IGP metric links.

7.4 TI-LFA Backup Path If TI-LFA is enabled on a node, then this node provisions the TI-LFA backups paths for the PrefixSIDs of each Flex-Algo(K) that it supports. The backup path for a Prefix-SID(K) is computed and optimized as per the definition of Flex-Algo(K). Only Flex-Algo(K) Prefix-SIDs and unprotected Adj-SIDs are used in the SID list that encodes the Flex-Algo(K) backup path. This ensures that TILFA backup paths meet the same constraints and minimize the same link metric as their primary paths. The Flex-Algo TI-LFA functionality is illustrated with the topology in Figure 7‑15. It is similar to those used earlier in the chapter, except that all links have the same IGP metric (10).

Figure 7-15: Flex-Algo TI-LFA topology

All nodes in the network participate in Algo(0) and in Flex-Algo(129). Flex-Algo(129) is defined to optimize the IGP metric while avoiding RED links, as shown in the configuration in Example 7‑7. Example 7-7: Flex-Algo definition configuration on all nodes router isis 1 flex-algo 129 !! default: metric-type igp advertise-definition affinity exclude-any RED

Node1 advertises a Prefix-SID 16001 for Algo(0) and a Flex-Algo(129) Prefix-SID 16901. Node5 advertises Algo(0) Prefix-SID 16005 and Flex-Algo(129) Prefix-SID 16905. TI-LFA protection is enabled on Node2, as shown in Example 7‑8. After enabling TI-LFA, Node2 computes the backup paths for all algorithms it participates in. We focus on the TI-LFA backup paths for the Prefix-SIDs advertised by Node1. Example 7-8: TI-LFA enabled on Node2 router isis 1 interface Gi0/0/0/0 !! link to Node1 point-to-point address-family ipv4 unicast fast-reroute per-prefix fast-reroute per-prefix ti-lfa

The primary path on Node2 for both Node1’s Prefix-SIDs (16001 and 16901) are via the direct link to Node1, as shown in Figure 7‑15. TI-LFA independently computes backup paths for the Prefix-SIDs of each algorithm, using the topology as seen by the algorithm. Algo(0) Backup Path

Let us start with the TI-LFA backup paths for the IGP SPF algorithm (Algo(0)) Prefix-SIDs. TI-LFA uses the Algo(0) topology, Topo(0), to compute the backup paths for the Algo(0) Prefix-SIDs. Topo(0) contains all nodes and all links in the network, since all nodes participate in Algo(0) and Algo(0) has no constraints. The Algo(0) TI-LFA link-protecting backup path on Node2 for prefix 1.1.1.1/32 is the postconvergence path3 2→3→5→6→1, as illustrated in Figure 7‑16. This backup path can be encoded by the SID list , only using Algo(0) Prefix-SIDs. This SID list first brings the packet via the Algo(0) path (this is the IGP shortest path) to Node5 (2→3→5) using the Prefix-SID(0) 16005 of Node5, and then via the Algo(0) path to Node1 (5→6→1) using the Prefix-SID(0) 16001 of Node1.

Figure 7-16: TI-LFA backup paths for each algorithm

Algo(129) Backup Path

To compute the Flex-Algo(129) TI-LFA backup path for the Flex-Algo(129) Prefix-SIDs, IGP on Node2 first derives the topology Topo(129) to be used for Flex-Algo(129) computations. For this, IGP removes all nodes that do not participate in Flex-Algo(129) and all links that are excluded by the Flex-Algo(129) constraints. In this example, all nodes participate in Flex-Algo(129), therefore all nodes stay in Topo(129). FlexAlgo(129)’s definition specifies to avoid RED colored links, therefore the RED colored link between Node3 and Node5 is removed from the topology. The Flex-Algo(129) TI-LFA link-protecting backup path on Node2 for Prefix-SID 16901 is the postconvergence path 2→3→4→5→6→1, as illustrated in Figure 7‑16. This backup path can be encoded by the SID list , using Flex-Algo(129) Prefix-SIDs. This SID list first brings the packet via the Flex-Algo(129) shortest path to Node5 (2→3→4→5) using the Prefix-SID(129) 16905 of Node5 and then via the Flex-Algo(129) shortest path to Node1 (5→6→1) using the PrefixSID(129) 16901 of Node1. Notice that the Flex-Algo(129) TI-LFA backup path avoids the RED link, as specified in the FlexAlgo(129) definition.

TI-LFA/Fle x-Algo gain “One of the major benefits of Flex-Algo is its ability to provision dynamic constrained paths based on a single SR label with local repair (TI-LFA) respecting the same constraints as the primary path. We looked into different options such as Multi-Topology (MT)-SIDs, various SR-TE policy-based routing solutions, however none of the currently available techniques except Flex-Algo SR-TE approach could efficiently support ~50ms recovery from a network node or link failure and at the same time to prevent the traffic traversal via an undesirable path. Flex-Algo TI-LFA optimum fast-reroute capabilities could be well suited for unicast as well as multicast flows to significantly reduce operational overhead and greatly simplify our overall end-to-end service model. ” — Arkadiy Gulko

The properties of TI-LFA still hold with Flex-Algo: the backup path is tailored along the postconvergence path, at most two SIDs are required to express the backup path in symmetric metric networks, etc.

7.5 Integration With SR-TE Flex-Algo is inherently part of SR. It leverages the algorithm functionality that is part of the PrefixSID definition since day 1. Flex-Algo is also part of the SR-TE Architecture. It enriches the set of SIDs that are available for SR-TE to encode SLA paths and is fully integrated with the other SR-TE mechanisms such as ODN and AS. Any path in the network can be encoded as a list of Adj-SIDs. Prefix-SIDs of any algorithm, allow SR-TE paths to be expressed in a better way by leveraging the IGP distributed path calculation and robustness. Not only do they drastically reduce the number of SIDs required in the SID list, but they also bring ECMP capabilities and all the IGP resiliency mechanisms to SR-TE. Flex-Algo generalizes the concept of Prefix-SID, that was previously limited to unconstraint shortest IGP path, to any operator-defined intent. The Flex-Algo Prefix-SIDs thus allow SR-TE to meet more accurately the operator’s intent for an SR path, with even fewer SIDs in the SID list, more ECMP load-balancing, IGP-based re-optimization after a topology change and properly tailored TI-LFA backup paths. Assume Flex-Algo(128) is enabled on all nodes in the network of Figure 7‑17 with algorithm definition minimize delay metric without constraints. By default, Algo(0) is also enabled on all nodes, providing the unconstrained IGP shortest path.

Figure 7-17: Flex-Algo integration with SR-TE

When configuring on Node1 an SR Policy to Node3 with a dynamic path optimizing the IGP metric, the computed path is 1→6→5→3. SR-TE would encode this path as SID list , only containing the Algo(0) Prefix-SID 16003 of Node3. SR-TE could have encoded the computed path using various other SID lists, such as , with 1600X the Algo(0) Prefix-SID of NodeX, or , with 240XY the Adj-SID of NodeX to NodeY. In this scenario, SR-TE decides to encode the path with the Prefix-SID 16003 because this SID is the most appropriate one considering the dynamic path definition. It not only provides the shortest SID list, but it also provides the best resiliency and scaling. If SR-TE would encode the path with SID list , SR-TE would have to update the SID list when e.g., the link between Node1 and Node6 fails, since the existing SID list no longer encodes the new IGP shortest path 1→2→3. When using SID list , SR-TE does not have to update the SID list after this failure since the IGP updates the path of Prefix-SID 16003 to the new IGP shortest path.

Similarly, when an SR Policy to Node3 is configured on Node1 with a dynamic path optimizing delay, SR-TE computes the low-delay path as 1→2→3. SR-TE can encode this path with the SID list , with 16002 the Prefix-SID of Node2 and 24023 the Adj-SID of Node2 to Node3. However, SR-TE has another more appropriate SID available, which is the Flex-Algo(128) PrefixSID 16803 of Node3. This Flex-Algo SID allows to express the path as a single SID , but it also provides better resiliency and scaling with the same reasons as the previous example. Upon failure of the link between Node2 and Node3, SR-TE does not have to update the SID list since the IGP will update the path of Prefix-SID 16803. In addition, the TI-LFA backup of 16803 is tailored along the low-delay post-convergence path since IGP uses the Flex-Algo definition when computing the TI-LFA backup path. SR-TE can also combine SIDs of different types in a SID list, as illustrated in the next example. The network in Figure 7‑18 illustrates the combination of Algo(0) Prefix-SIDs with a Flex-Algo Prefix-SID. The network consists of three domains, two edge domains (Edge1 and Edge2) and a core domain (Core).

Figure 7-18: Combining different types of SIDs in an SR Policy SID list

Since there is no significant delay difference in the edge domains, only Algo(0) is enabled in these domains.

To provide low-delay paths in the core domain, a Flex-Algo(128) is defined as minimize delay metric and all core nodes participate in this Flex-Algo(128). The low-delay path from Node11 to Node31 combines Algo(0) SIDs in the edge domains with a Flex-Algo(128) SID in the core domain. Node11 imposes a SID list on the low-delay traffic. This SID list first steers the packet on the IGP shortest path to Node1 (Algo(0) Prefix-SID 16001), then on the low-delay path to Node3, using the Flex-Algo(128) Prefix-SID 16803 and finally on the IGP shortest path to Node31 using Algo(0) Prefix-SID 16031.

7.5.1 ODN/AS As we have seen in chapter 5, "Automated Steering" and chapter 6, "On-Demand Nexthop", the Automated Steering functionality steers a service route (e.g., BGP route) into the SR Policy that is identified by the nexthop and color of the service route. The On-demand Nexthop functionality enables automatic instantiation of an SR Policy path when receiving a service route. There is no change in the ODN and AS functionalities when using Flex-Algo SIDs to encode an SR Policy path. ODN and AS work irrespective of the composition of the SID list. One way to specifically enforce the use of Flex-Algo SIDs is to specify the SID algorithm identifier as a constraint of the dynamic path. This way the Flex-Algo identifier is mapped to an SLA color. SRTE will express the path of that SLA color using SIDs of the mapped Flex-Algo identifier. For example, assume that the operator identifies low-delay by SLA color green (value 30). Flex-Algo(128) is defined to provide the low-delay path. Therefore, SLA color green (value 30) can be mapped to Flex-Algo(128). For this purpose, an on-demand color template is configured on the headend node. Example 7‑9 shows the on-demand color template for color 30, indicating Flex-Algo(128) SIDs must be used to encode the path.

Example 7-9: ODN configuration segment-routing traffic-eng on-demand color 30 dynamic sid-algorithm 128

Assume that the on-demand template in Example 7‑9 is applied on Node1 in Figure 7‑17. When a service route arrives on headend Node1, with BGP nexthop 1.1.1.3 and color green (30), the ODN functionality instantiates an SR Policy to 1.1.1.3 (Node3) based on the ODN template for color green. The template restricts the SIDs of the SR Policy’s SID list to Flex-Algo(128) SIDs. The solution SID list consists of a single SID: the Flex-Algo(128) Prefix-SID 16803 of Node3. Example 7‑10 shows the status of this ODN SR Policy. Example 7-10: ODN SR Policy using Flex-Algo SID RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.3 Name: srte_c_30_ep_1.1.1.3 Status: Admin: up Operational: up for 00:02:03 (since Apr 9 17:11:52.137) Candidate-paths: Preference: 200 (BGP ODN) (active) Requested BSID: dynamic Constraints: Prefix-SID Algorithm: 128 Dynamic (valid) 16803 [Prefix-SID: 1.1.1.3, Algorithm: 128] Preference: 100 (BGP ODN) Requested BSID: dynamic Dynamic (pce) (invalid) Metric Type: NONE, Path Accumulated Metric: 0 Attributes: Binding SID: 40001 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

BGP on Node1 then uses Automated Steering to steer the service route into this ODN SR Policy.

7.5.2 Inter-Domain Paths Consider the multi-domain network as shown in Figure 7‑19. All nodes in both domains participate in Flex-Algo(128), which is defined as optimize delay metric without constraints. Dynamic link delay

measurement is enabled on all links in both domains. The measured link-delays are displayed next to the links.

Figure 7-19: Flex-Algo ODN and AS Inter-Domain Delay

The operator wants to use ODN to automatically instantiate low-delay paths for service routes, leveraging the low-delay Flex-Algo SIDs in both domains. Since these are inter-domain paths, an SR PCE is required to compute these paths. The operator uses SLA color 30 to identify low-delay service paths. Node9 advertises a service route 128.9.9.0/24 that requires a low-delay path. Therefore, Node9 attaches color extended community 30 to this route. Upon receiving this route, BGP on Node0 requests SR-TE to provide a path to Node9, with the characteristics that are specified in the on-demand template for color 30. The operator wants to leverage the Flex-Algo SIDs, as indicated before. In this example there are two options to achieve this, using two different ODN templates. In the first option, the ODN template on headend Node0 only specifies the low-delay optimization objective and lets the PCE select the most appropriate SIDs to meet this objective. In a second option, the ODN template on headend Node0 explicitly specifies to use SIDs of Flex-Algo(128).

Both options lead to the same result in this example. The SR PCE computes the path and replies to Node0 with the SID list . Both SIDs are Flex-Algo(128) Prefix-SIDs. The first Prefix-SID in the SID list, 16805, brings the packet to Node5 via the low-delay path. The second Prefix-SID, 16809, brings the packet from Node5 to Node9 via the low-delay path. SR-TE on Node0 instantiates an SR Policy to Node9 with the received SID list and steers the service route 128.9.9.0/24 into this SR Policy. However, the second option requires that the operator associates the Flex-Algo definition optimize delay metric without constraints with the same algorithm identifier, in this case 128, in both domains. If this definition is associated with Flex-Algo(128) in Domain1 and Flex-Algo(129) in Domain2, then the SID algorithm constraint would prevent SR-TE from returning the appropriate path. It is thus recommended that dynamic paths and ODN templates are configured with the actual intent for the SR path and that the SID algorithm constraint is only used when the intent cannot be expressed otherwise. An example of intent that would require the SID algorithm constraint is described in the next section. The second option also requires support in PCEP to signal the required SID algorithm to the SR PCE. At the time of writing, this PCEP extension is not available.

7.6 Dual-Plane Disjoint Paths Use-Case Figure 7‑20 shows a dual-plane network topology. Plane1 consists of nodes 1 to 4 while Plane2 consists of nodes 5 to 8. Node0 and Node9 are part of both planes.

Figure 7-20: Flex-Algo dual-plane network design

In this use-case, the operator needs to provide strict disjoint paths for certain traffic streams on the two planes, even during failure conditions. For other traffic, the operator wants to benefit from all available ECMP and resiliency provided by the dual-plane design model. This other traffic should therefore not be restricted to a single plane. The operator achieves these requirements by using the Flex-Algo functionality. This functionality can indeed provide strict disjoint paths, even during failure conditions when the traffic is directed on the TI-LFA backup path. Furthermore, the Flex-Algo solution does not required any use of affinity link colors nor additional loopback prefixes on the network nodes.

By default, all SR nodes automatically participate in the Algo(0) (IGP SPF) topology. The operator defines two Flex-Algos and identifies them with numbers 128 and 129. Both algorithms have the same definition: optimize IGP metric, no constraints. The operator enables Flex-Algo(128) on all the Plane1 nodes (0, 1, 2, 3, 4, 9) and Flex-Algo(129) on all the Plane2 nodes (0, 5, 6, 7, 8, 9). If a node advertises its participation in a Flex-Algo, it typically also advertises a Prefix-SID for that Flex-Algo. As an example, Figure 7‑21 shows the different Prefix-SIDs that Node2, Node7, and Node9 advertise for their loopback prefix. The Prefix-SIDs advertised by the other nodes are not shown. Node9, for example, advertises the following Prefix-SIDs for its loopback prefix 1.1.1.9/32: Algo(0) (SPF): 16009 Flex-Algo(128): 16809 Flex-Algo(129): 16909

Figure 7-21: Flex-Algo Prefix-SID assignment in dual-plane network

Example 7‑11 shows the relevant ISIS configuration of Node9. The flex-algo definitions are empty, since IGP metric is optimized by default and no constraint is specified. The Prefix-SID configurations

for the different algorithms are shown under interface Loopback0. Example 7-11: Flex-Algo use-case – Dual-plane, ISIS configuration of Node9 interface Loopback0 ipv4 address 1.1.1.9/32 ! router isis 1 flex-algo 128 !! default: metric-type igp advertise-definition ! flex-algo 129 !! default: metric-type igp advertise-definition ! address-family ipv4 unicast segment-routing mpls ! interface Loopback0 address-family ipv4 unicast prefix-sid absolute 16009 prefix-sid algorithm 128 absolute 16809 prefix-sid algorithm 129 absolute 16909

On each node, ISIS computes the Shortest Path Trees (SPTs) for the different algorithms that the node participates in. ISIS starts by deriving the topology for each algorithm by pruning the nonparticipating nodes and excluded links from the topology graph. This results in the topologies shown in Figure 7‑22: Topo(0), Topo(128), and Topo(129).

Figure 7-22: Flex-Algo Topologies – Topo(0), Topo(128), and Topo(129)

On each node, after computing the SPT for a given algorithm topology, ISIS installs the forwarding entries for the Prefix-SIDs of the related algorithm. As an example, Figure 7‑23 shows the paths from Node0 to Node9 for the three Prefix-SIDs that are advertised by Node9. Notice that the Algo(0) Prefix-SID 16009 leverages all available ECMP, of both planes. Flex-Algo(128) and Flex-Algo(129) Prefix-SIDs leverage the available ECMP, constrained to their respective plane.

Figure 7-23: Flex-Algo Prefix-SID path examples

With the exception of the Algo(0) Prefix-SID, only MPLS-to-MPLS (label swap) and MPLS-to-IP (label pop) entries are installed for the Prefix-SIDs. IP-to-MPLS (label push) forwarding entries are typically installed for Algo(0) Prefix-SIDs4. As an example, the following MPLS forwarding entries are installed on Node0: Algo(0): In 16009, out 16009 via Node1 or Node5 In 16002, out 16002 via Node1 In 16007, out 16007 via Node5 Flex-Algo(128): In 16809, out 16809 via Node1 In 16802, out 16802 via Node1 Flex-Algo(129): In 16909, out 16909 via Node5 In 16907, out 16907 via Node5 Node1 installs the following forwarding entries: Algo(0): In 16009, out 16009 via Node2 or Node4 In 16002, out 16002 via Node2 In 16007, out 16007 via Node2, Node4, or Node5 Flex-Algo(128):

In 16809, out 16809 via Node2 or Node4 In 16802, out 16802 via Node2 Flex-Algo(129): None, Node1 does not participate in Flex-Algo(129) When enabling TI-LFA, the traffic on a given Flex-Algo Prefix-SID will be protected by a backup path that is constrained to the topology of the Flex-Algo. This implies that even in failure cases the protected traffic will stay in the Flex-Algo’s topology plane. Since traffic carried on Algo(0) PrefixSIDs is not constrained to a single plane, this does not apply to these Prefix-SIDs. Traffic to Algo(0) Prefix-SIDs can be deviated to another plane in failure cases. The operator can steer service traffic on any of the planes by attaching a specific SLA color to this service route. If a service route has no color, then traffic flows are steered on the Algo(0) Prefix-SID and it will be shared over both planes. The operator chooses the SLA color value 1128 to indicate Flex-Algo(128), and SLA color value 1129 to indicate Flex-Algo(129). Assume that Node9 advertises three service routes, for an L3VPN service in this example, to Node0. L3VPN prefix 128.9.9.0/24 must be steered on a Flex-Algo(128) path, L3VPN prefix 129.9.9.0/24 must be steered on a Flex-Algo(129) path, and L3VPN prefix 9.9.9.0/24 must follow the regular IGP shortest path to its nexthop. Therefore, Node9 advertises 128.9.9.0/24 with color extended community 1128 and 129.9.9.0/24 with extended community color 1129. 9.9.9.0/24 is advertised without color. Node0 is configured as shown in Example 7‑12, to enable the ODN functionality for both Flex-Algos (lines 3 to 9). Color 1128 is mapped to Flex-Algo(128), and color 1129 is mapped to FlexAlgo(129). Part of the BGP configuration on this node is included to illustrate that no BGP routepolicy is required for the ODN/AS functionality.

Example 7-12: Flex-Algo use-case – Dual-plane, Automated Steering on Node0 1 segment-routing

traffic-eng on-demand color 1128 4 dynamic 5 sid-algorithm 128 6 ! 7 on-demand color 1129 8 dynamic 9 sid-algorithm 129 10 ! 11 router bgp 1 12 neighbor 1.1.1.9 13 remote-as 1 14 update-source Loopback0 15 address-family vpnv4 unicast 16 ! 17 vrf Acme 18 rd auto 19 address-family ipv4 unicast 2 3

Figure 7‑24 illustrates the steering of the uncolored L3VPN prefix 9.9.9.0/24. The BGP advertisement for this prefix arrives on Node0 without any color extended community, therefore, BGP installs the prefix as usual, recursing on its BGP next-hop 1.1.1.9. By default, Node0 imposes the (Algo(0)) Prefix-SID 16009 for traffic destined for 1.1.1.9/32, and for traffic to prefixes that recurse on 1.1.1.9/32, such as 9.9.9.0/24. Since traffic flows to Prefix-SID 16009 are load-balanced over all available ECMP of both planes, this is also the case for traffic flows to 9.9.9.0/24.

Figure 7-24: Algo(0) Prefix-SID uses both planes

The BGP advertisement for L3VPN prefix 128.9.9.0/24 arrives on Node0 with a color extended community 1128, as shown in Figure 7‑25. Since Node0 has an SR-TE on-demand template for color

1128 configured, mapping color 1128 to Flex-Algo(128), BGP installs L3VPN prefix 128.9.9.0/24, recursing on the Flex-Algo(128) Prefix-SID 16809 of the BGP next-hop 1.1.1.9/32. Traffic-flows towards 128.9.9.0/24 are distributed over the available ECMP of the top plane (Flex-Algo(128)) only.

Figure 7-25: Flex-Algo(128) Prefix-SID uses one planes

Similarly, traffic flows to L3VPN prefix 129.9.9.0/24 will be distributed over the available ECMP of the bottom plane (Flex-Algo(129)) only.

Fle x-Algo Use Case s - Disjointne ss and many othe rs “It is very important to categorize operators use cases for unconventional flows that require traffic to be steered over specific path/topology to meet explicit SLA requirements. The use cases for these flows could be categorized as follow: 1. Flex-Algo Only – Use Cases that could be framed into predefined path/topology options based on commonly used constrains such as: To contain the traffic within specific scoped boundaries (local/metro; sub- or multi-regional; tiered (parent traffic should never traverse child); Dual plane disjoint path based on strict/loose routing within a plane to support two unicast/multicast flows that MUST traverse completely diverse logically and physically end to end path under any conditions; Path meeting specific SLA (e.g., delay, loss) 2. Controller – Use Cases that do not fit into predefined constraint categories such as: Dual plane Disjoint path where any path at a time can fallback to common plane based on SLA and still preserve the disjointness;

Guaranteed BW path; TE Multi-Layer path/visibility; Flex-Algo could be applicable to both; in the first set of use cases Flex-Algo could provide a full solution, in the second set of use cases Flex-Algo could effectively complement controller approach to enhance computation outcome. Flex-Algo allows us to design any topology/path for unicast and multicast as per our intent to achieve specific business objectives. For instance, we can define constraints, computation algorithm and apply it in a such a way to support any diverse requirements for dual plane model per regional or global scope based on best SLA performance in a seamless fashion with automated provisioning, minimum operational overhead and no additional protocol support. In another example where business drives cost reduction, by reducing number of redundant circuits to 3 per location (by decreasing availability/reliability of dual plane model) but still mandating required diversity based on best performant paths for two streams, we could use Flex-Algo in combination with Controller to provide disjointness and best paths using a third circuit as a common path that either plane might use at any given time. In addition to the business gain above, the combination of Flex-Algo with Controller produces set of other benefits such as fast-recovery and use of minimum label stack, which converts in business terms to risk mitigation and cost avoidance. As you noted, multicast has been mentioned along the side with unicast. And this is a major advantage of Flex-Algo since unicast and multicast traffic steering leverage common Flex-Algo infrastructure. Despite that the result of Flex-Algo computation is programmed in the MPLS FIB, Flex-Algo could be decoupled from MPLS FIB and be used by other applications such as mLDP and BIER. Many other use cases could be demonstrated to reveal how impactful Flex-Algo technology might be in resolving various business requirements. Operator creativity to define its own custom algorithm spreads far beyond of presented use cases. ” — Arkadiy Gulko

7.7 Flex-Algo Anycast-SID Use-Case In Figure 7‑26, the network operator wants to enforce for some Internet prefixes a low-delay path from the Internet border router Node1 to a specific Broadband Network Gateway (BNG) Node8. This could be easily achieved with SR IGP Flex-Algo if Flex-Algo were available from ingress to egress, but unfortunately that is not the case. Part of the network consists of older LDP-only nodes (Node4, Node5, and Node8). On those LDP-only nodes the delay metrics are not available and FlexAlgo cannot be used. All nodes are in the same single-level ISIS area. Node3 and Node6 are on the border between the SR domain and the LDP domain, these are the SR domain border nodes. Both SR and LDP are enabled on these nodes. Link-delay measurement is enabled on all links in the SR domain.

Figure 7-26: Low-delay path in SR/LDP network

In order to provide a low-delay path where possible and an IGP shortest path where no delay information is available, the operator combines a Flex-Algo SID (providing the low-delay path in the

SR domain) with a regular Prefix-SID (providing the IGP shortest path in the LDP domain). The operator enables Flex-Algo(128) on Node1, Node2, Node3, Node6, and Node7 with definition optimize delay metric. Given that the link-delays in the LDP portion of the network are unknown, the low-delay path from Node1 to the (delay-wise) closest SR domain border node is combined with the IGP shortest path from the border node to BNG Node8. This selection of the optimal SR domain border node is automatic by leveraging Anycast-SID functionality. This is simply done by letting these border nodes advertise the same (anycast) prefix with the same Flex-Algo Anycast-SID. These SR domain border nodes are then grouped in an anycast set, with the Anycast-SID leading to the closest node in this anycast set. The Flex-Algo(128) configurations on the SR domain border nodes Node3 and Node6 are presented in Example 7‑13 and Example 7‑14. Both nodes advertise their own Flex-Algo(128) Prefix-SID with their Loopback0 address, and they both advertise Flex-Algo(128) Anycast-SID 16836 with their loopback136 prefix 136.1.1.136/32. An Anycast-SID is a sub-type of Prefix-SID that is advertised by several nodes, as opposed to the Node-SID, with is another sub-type of Prefix-SID advertised by a single node. Node-SIDs are advertised with the Node flag and a configured Prefix-SID is by default a Node-SID. Anycast-SIDs are advertised without the Node flag (N-flag), as specified with n-flag-clear configuration.

Example 7-13: Flex-Algo configuration on Node3 interface Loopback0 ipv4 address 1.1.1.3 255.255.255.255 ! interface Loopback136 description Anycast of Node3 and Node6 ipv4 address 136.1.1.136 255.255.255.255 ! router isis 1 flex-algo 128 metric-type delay advertise-definition ! interface Loopback0 address-family ipv4 unicast prefix-sid absolute 16003 prefix-sid algorithm 128 absolute 16803 ! interface Loopback136 address-family ipv4 unicast prefix-sid algorithm 128 absolute 16836 n-flag-clear

Example 7-14: Flex-Algo configuration on Node6 interface Loopback0 ipv4 address 1.1.1.6 255.255.255.255 ! interface Loopback136 description Anycast of Node3 and Node6 ipv4 address 136.1.1.136 255.255.255.255 ! router isis 1 flex-algo 128 metric-type delay advertise-definition ! interface Loopback0 address-family ipv4 unicast prefix-sid absolute 16006 prefix-sid algorithm 128 absolute 16806 ! interface Loopback136 address-family ipv4 unicast prefix-sid algorithm 128 absolute 16836 n-flag-clear

This Flex-Algo(128) Anycast-SID 16836 transports packets to the closest (delay-wise) SR domain border node. If two or more SR domain border nodes are equally close in terms of delay, then the traffic is per-flow load-balanced among them. Given the delay metrics in Figure 7‑26, Node6 is the SR domain border node that is closest to Node1. Therefore, the path of Flex-Algo(128) Anycast-SID 16836 from Node1 is via Node7 (1→7→6).

To traverse the LDP portion of the path, the IGP shortest path is used. Node8 is not SR enabled hence it does not advertise a Prefix-SID. Therefore, one or more SR Mapping-Servers (SRMSs) advertise a (regular Algo(0)) Prefix-SID 16008 for the BNG address 1.1.1.8/32. Refer to Part I of the Segment Routing book series for more details of the SRMS. Node2 is used as SRMS and its configuration is shown in Example 7‑15. Example 7-15: SRMS configuration on Node2 segment-routing mapping-server prefix-sid-map address-family ipv4 1.1.1.8/32 8 range 1 !! Prefix-SID 16008 (index 8) ! router isis 1 address-family ipv4 unicast segment-routing prefix-sid-map advertise-local

This Prefix-SID 16008 carries the traffic from the SR border node to the BNG via the IGP shortest path. The SR domain border nodes automatically (no configuration required) takes care of the interworking with LDP by stitching the BNG Prefix-SID to the LDP LSP. As shown on Example 7‑16, an SR Policy POLICY2 to the BNG Node8 is configured on the Internet border router Node1. This SR Policy has an explicit path PATH2 consisting of two SIDs: the FlexAlgo Anycast-SID 16836 of the SR domain border nodes and the Prefix-SID 16008 of BNG Node8. Example 7-16: SR Policy configuration on Node1 segment-routing traffic-eng segment-list PATH2 index 20 mpls label 16836 !! Flex-Algo Anycast-SID of Node(3,6) index 40 mpls label 16008 !! Prefix-SID of Node8 ! policy POLICY2 color 30 end-point ipv4 1.1.1.8 candidate-paths preference 100 explicit segment-list PATH2

The status of the SR Policy is shown in Example 7‑17.

Example 7-17: SR Policy status on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.8 Name: srte_c_30_ep_1.1.1.8 Status: Admin: up Operational: up for 00:13:05 (since Mar 30 14:57:53.285) Candidate-paths: Preference: 100 (configuration) (active) Name: POLICY2 Requested BSID: dynamic Explicit: segment-list PATH2 (valid) Weight: 1 16836 [Prefix-SID, 136.1.1.136] 16008 Attributes: Binding SID: 40001 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

The path of the SR Policy is illustrated in Figure 7‑27. The Flex-Algo(128) Anycast-SID 16836 provides the low-delay path to the closest of (Node3, Node6) and the packets arrive on Node3 or Node6 with Prefix-SID label 16008 as top label. Given the link delays in this example, the node closest to Node1 is Node6.

Figure 7-27: SR Policy path combining Flex-Algo and SRMS Prefix-SID

Node3 and Node6 automatically stitch Node8’s Prefix-SID 16008 to the LDP LSPs towards Node8. Node6 has two equal cost paths to BNG Node8, as illustrated in Figure 7‑27. This is the case for Node3 as well. The MPLS forwarding table on Node6, displayed in Example 7‑18, shows that Node6 installs PrefixSID 16008 with outgoing LDP labels 90048 and 90058. These are the LDP labels allocated for 1.1.1.8/32 by Node4 and Node5 respectively (see output in Example 7‑18). Node6 load-balances traffic-flows over these two paths.

Example 7-18: Node8’s Prefix-SID in MPLS forwarding table on Node6 RP/0/0/CPU0:xrvr-6#show mpls ldp bindings 1.1.1.8/32 1.1.1.8/32, rev 39 Local binding: label: 90068 Remote bindings: (3 peers) Peer Label ------------------------1.1.1.3:0 90038 1.1.1.4:0 90048 1.1.1.5:0 90058 RP/0/0/CPU0:xrvr-6#show mpls forwarding labels 16008 Local Outgoing Prefix Outgoing Next Hop Label Label or ID Interface ------ ----------- ------------------ ------------ ------------16008 90058 SR Pfx (idx 8) Gi0/0/0/0 99.5.6.5 90048 SR Pfx (idx 8) Gi0/0/0/4 99.4.6.4

Bytes Switched --------4880 1600

This solution is fully automated and dynamic once configured. The IGP dynamically adjusts the FlexAlgo(128) Prefix-SID to go via the closest (lowest delay) SR domain border node. The Prefix-SID then steers the traffic via the IGP shortest path to BNG Node8. Anycast-SIDs also provide node resiliency. In case one of the SR domain border nodes fails, the other seamlessly takes over. The operator uses Automated Steering to steer the destination prefixes that require a low-delay path into the SR Policy. In this use-case the operator uses different SR building blocks (Flex-Algo, Anycast-SID, SRMS, SR/LDP interworking) and combines them with the SR-TE infrastructure to provide a solution for the problem. It is an example of the versatility of the solution.

7.8 Summary Since day 1 of Segment Routing, a Prefix-SID is associated to a prefix and an algorithm. Most often, the Prefix-SID algorithm is the default one (zero). With algorithm zero, the IGP Prefix-SID is steered via the shortest IGP path to its associated prefix (i.e., the shortest-path as computed by ISIS and OSPF). Algorithms 0 to 127 are standardized by the IETF. Algorithms 128 to 255 are customized by the operator. They are called SR IGP Flexible Algorithms (Flex-Algo for short). For example, one operator may define Flex-Algo(128) to minimize the delay while another operator defines Flex-Algo(128) to minimize the IGP metric and avoid the TE-affinity RED. Yet another operator could define Flex-Algo(128) as a minimization of the TE metric and avoid the SRLG 20. Any node participating in a Flex-Algo advertises its support for this Flex-Algo. Typically, a node that participates in a Flex-Algo has a prefix-SID for that Flex-Algo. This is not an absolute requirement. Adding a Prefix-SID for a Flex-Algo to a node does not require to configure an additional loopback address. Multiple prefix-SIDs of different Algorithms can share the same address. This is an important benefit for operational simplicity and scale. Any node participating in a Flex-Algo computes the paths to the Prefix-SIDs of that Flex-Algo. Any node not participating in a Flex-Algo does not compute the paths to the Prefix-SIDs of that FlexAlgo. This is an important benefit for scale. The definition of a Flex-Algo must be consistent across the IGP domain. The Flex-Algo solution includes a mechanism to enforce that consistency. Flex-Algo is an intrinsic component of the SR-TE architecture; Flex-Algo SIDs with their specificities are automatically considered for SR-TE path calculation and can be included in any SR

Policy. As such, the ODN and AS components of SR-TE natively leverage Flex-Algo. Typical use-cases involve plane disjoint paths and low-latency routing.

7.9 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[RFC5305] "IS-IS Extensions for Traffic Engineering", Tony Li, Henk Smit, RFC5305, October 2008 [RFC7308] "Extended Administrative Groups in MPLS Traffic Engineering (MPLS-TE)", Eric Osborne, RFC7308, July 2014 [RFC7684] "OSPFv2 Prefix/Link Attribute Advertisement", Peter Psenak, Hannes Gredler, Rob Shakir, Wim Henderickx, Jeff Tantsura, Acee Lindem, RFC7684, November 2015 [RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016 [RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg, Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018 [draft-ietf-lsr-flex-algo] "IGP Flexible Algorithm", Peter Psenak, Shraddha Hegde, Clarence Filsfils, Ketan Talaulikar, Arkadiy Gulko, draft-ietf-lsr-flex-algo-01 (Work in Progress), November 2018 [draft-ietf-isis-te-app] "IS-IS TE Attributes per application", Les Ginsberg, Peter Psenak, Stefano Previdi, Wim Henderickx, John Drake, draft-ietf-isis-te-app-06 (Work in Progress), April 2019 [draft-ietf-ospf-segment-routing-extensions] "OSPF Extensions for Segment Routing", Peter Psenak, Stefano Previdi, Clarence Filsfils, Hannes Gredler, Rob Shakir, Wim Henderickx, Jeff Tantsura, draft-ietf-ospf-segment-routing-extensions-27 (Work in Progress), December 2018 [draft-ietf-isis-segment-routing-extensions] "IS-IS Extensions for Segment Routing", Stefano Previdi, Les Ginsberg, Clarence Filsfils, Ahmed Bashandy, Hannes Gredler, Bruno Decraene, draft-ietf-isis-segment-routing-extensions-23 (Work in Progress), March 2019

[draft-ketant-idr-bgp-ls-flex-algo] "Flexible Algorithm Definition Advertisement with BGP LinkState", Ketan Talaulikar, Peter Psenak, Shawn Zandi, Gaurav Dawra, draft-ketant-idr-bgp-ls-flexalgo-01 (Work in Progress), February 2019

1. “Strict” must not be confused with “unprotected”. FRR mechanisms such as TI-LFA normally apply to Strit-SPF Prefix-SIDs.↩ 2. IGP Algorithm Types in IANA registry: https://www.iana.org/assignments/igp-parameters/igpparameters.xhtml#igp-algorithm-types↩ 3. The TI-LFA backup path is tailored along the post-convergence path; this is the path that the traffic will follow after IGP converges following a failure. Read Part I of the SR book series for more details.↩ 4. Unless LDP is enabled and the default label imposition preference is applied. See Part I of the SR book series.↩

8 Network Resiliency The resiliency of an SR policy may benefit from several detection mechanisms and several convergence or protection solutions: local trigger from a local link failure remote intra-domain trigger through IGP flooding remote inter-domain trigger through BGP-LS flooding validation of explicit candidate paths re-computation of dynamic paths selection of the next best candidate path IGP convergence of constituent IGP Prefix SIDs Anycast Prefix SID leverage TI-LFA Local Protection of constituent SIDs, including Flex-Algo SIDs TI-LFA Node Protection for an intermediate segment part of an SR Policy Invalidation of a candidate path based on end-to-end liveness detection We will describe each of these mechanisms in isolation and will highlight how they can co-operate to the benefit of the resiliency of an SR Policy. We will explain how an SR Policy can be designed to avoid TI-LFA protection when it is not desirable.

The Fast Conve rge nce (FC) Proje ct “Around 2001, while working on the design and deployment of the first worldwide DiffServ network, I realized that the loss of connectivity (packet drops) following a link/node down/up transition was much more important that the impact of congestion (QoS/DiffServ). I then started the Fast Convergence project at Cisco Systems where we step-by-step optimized all the components of the routing resiliency solution. We first focused on the IGP convergence as this is the base and most general solution. We drastically improved the IGP convergence on large networks from 10s of seconds down to 200msec. In parallel, we pioneered the IP Fast Reroute technology and invented Loop Free Alternate (LFA), Remote LFA, TI-LFA for link/node and SRLG protection, SR IGP microloop avoidance solution and, last but not least, the BGP Prefix-Independent (BGP PIC) fast-reroute solution. A key aspect of this research, engineering and deployment is that this is automated by IGP and BGP (simple), optimum (the backup paths are computed on a per-destination basis) and complete (lots of focused has been placed on the hardware data structures to provide sub-10msec failure detection and enable sub-25msec backup path in a prefix independent manner). All of these efforts led to over 30 patents and in many aspects are the source of the Segment Routing solution. Indeed, when I studied the first SDN experiments with OpenFlow, it was very clear to me that it would not scale. The unit of state (a per-flow entry at every hop through an inter-domain fabric) was way too low and would exhaust the controller. I had spent years optimizing the installations of these states in the dataplane. Instead, I thought that the SDN controller should work with IGP constructs (IGP prefix and IGP adjacency segments). The SDN controller should focus on combining these segments and delegate their state maintenance to the distributed intelligence of the network. We should have written a book on this FC journey. Maybe one day... ” — Clarence Filsfils

8.1 Local Failure Detection The sooner a local link failure is detected, the sooner a solution can be found. Some transmission media provide fast hardware-based notifications of connectivity loss. By default in IOS XR, link down events are immediately notified to the control plane (i.e., default carrier-delay for link down events is zero). For transmission media that do not provide such fast fault notifications, or if the link is not a direct connection but there is an underlying infrastructure providing the connectivity, a higher layer mechanism must perform the connectivity verification. Link liveness is traditionally assessed by IGPs using periodic hello messages. However, the complexity of handling IGP hellos in the Route Processor imposes a rather large lower bound on the hello interval, thus making this mechanism unsuitable for fast failure detection. Bidirectional Failure Detection (BFD) is a generic light-weight hello mechanism that can be used to detect connectivity failures quickly and efficiently. Processing the fixed format BFD packets is a simple operation that can be offloaded to the linecards. BFD sends a continuous fast-paced stream of BFD Control packets from both ends of the link. A connectivity failure is reported when a series of consecutive packets is not received on one end. A detected connectivity failure is indicated in the transmitted BFD packets. This way uni-directional failures can be reported on both sides of the link. In addition to the BFD Control packets, BFD can transmit Echo packets. This is the so-called Echo mode. The Echo packets are transmitted on the link with the local IP address as destination IP address. The remote end of the link forwards these Echo packets as regular traffic back to the local node. Since BFD on the remote node is not involved in processing Echo packets and less delay variation is expected, BFD Echo packets can be sent at a faster pace, resulting in faster failure detection. The transmission interval of the BFD Control packets can be slowed down.

In Example 8‑1, the IGP (ISIS or OSPF) bootstraps the BFD session to its neighbor on the interface. Echo mode is enabled by default and the echo transmission interval is set to 50 milliseconds (bfd minimum-interval 50).

BFD Control packets (named Async in the output) are sent at a default two-

second interval. The default BFD multiplier is 3. The status of the BFD session is verified in Example 8‑2. With this configuration the connectivity detection time is < 150 ms (= 50 ms × 3), which is the hello interval times the multiplier. Example 8-1: ISIS and OSPF BFD configurations router isis 1 interface TenGigE0/1/0/0 bfd minimum-interval 50 bfd fast-detect ipv4 ! router ospf 1 area 0 interface TenGigE0/1/0/1 bfd minimum-interval 50 bfd fast-detect

Example 8-2: BFD session status RP/0/RSP0/CPU0:R2#show bfd session Interface Dest Addr Local det time(int*mult) State Echo Async H/W NPU ----------------- --------------- ---------------- ---------------- -----Te0/1/0/0 99.2.3.3 150ms(50ms*3) 6s(2s*3) UP No n/a

“It should be pointed out that in number of production networks the term “convergence” is often used to describe the duration of the outage when upon node or link failure packets are dropped until the routing protocols (IGP and even BGP) synchronize their databases or tables network wide. The duration of the outage usually takes seconds or even minutes for BGP. That is also one of the reasons many people consider BGP as slow protocol. Network convergence as described above is only one aspect of a sound network design. In practice network designs should introduce forms of connectivity restoration hierarchy into any network. Below are the summarized best practices to reduce the duration of packet drops upon any network failure: BGP at Ingress and Egress of the network should have both internal and external path redundancy ahead of failure. Possible options: “Distribution of Diverse BGP Paths” [RFC6774] or “Advertisement of Multiple Paths in BGP” [RFC7911]. Routing in the core should be based on BGP next hop and in the cases where distribution of redundant paths to all network elements would be problematic some choice of encapsulation should be applied. Protection should be computed and inserted into FIB ahead of failure such that connectivity restoration is immediate (closely related to failure detection time) while protocols take their time to synchronize routing layer. That is why applying techniques of fast reroute (TI-LFA or PIC) are extremely important elements every operator should highly consider to be used in his network regardless if the network is pure IP, MPLS or SR enabled. The fundamental disruptive change which needs to be clearly articulated here is that until Segment Routing the only practical fine grained traffic control and therefore its protection was possible to be accomplished with RSVP-TE. The problem and in my opinion the main significant difference between those two technologies is the fact that soft-state RSVPTE mechanism required end-to-end path signaling with RSVP PATH MSG followed by RSVP RESV MSG. SR requires no event-based signaling as network topology with SIDs (MPLS or IPv6) is known ahead of any network event. ” — Robert Raszuk

A local interface failure or BFD session failure triggers two concurrent reactions: the IGP convergence and local Fast Reroute. The local Fast Reroute mechanism (TI-LFA) ensures a sub-50msec restoration of the traffic flows. TI-LFA is detailed in section 8.9. In parallel, the detection of the local link failure triggers IGP convergence. The failure of a local link brings down the IGP adjacency over that link, which then triggers the linkstate IGP to generate a new Link-State PDU (LSP) that reflects the new situation (after the failure).We

use the ISIS term LSP in this section, but it equally applies to OSPF Link-State Advertisements (LSAs). A delay is introduced before actually generating the new LSP. This is not static delay, but a delay that dynamically varies with the stability of the network. The delay is minimal in a stable network. In periods of network instability, the delay increases exponentially to throttle the triggers to the IGP. This adaptive delay is handled by an exponential backoff timer, the LSP generation (LSP-gen) timer in this case. When the network settles, the delay steadily decreases to its minimum again. By default IOS XR sets the minimum (initial) LSP generation timer to 50 ms. 50 ms is a good tradeoff between quick reaction to failure and some delay to possibly consolidate multiple changes in a single LSP. After the IGP has generated the new LSP, a number of events take place. First, the IGP floods this new LSP to its IGP neighbors. Second, it triggers SPF computation that is scheduled to start after the SPF backoff timer expires. Third, it feeds this new LSP to SR-TE. SR-TE updates its SR-TE DB and acts upon this topology change notification as described further in this chapter. Similar to the LSP generation mechanism, a delay is introduced between receiving the SPF computation trigger and starting the SPF computation. This delay varies dynamically with the stability of the network. It exponentially increases in periods of network instability to throttle the number of SPF computations and it decreases in stable periods. This adaptive delay is handled by the SPF backoff timer. As the IGP feeds the new topology information to SR-TE, this information is also received by BGP if BGP-LS is enabled on this node. BGP then sends the updated topology information in BGP-LS to its BGP neighbors.

8.2 Intra-Domain IGP Flooding When a link in a network fails, the nodes adjacent to that link quickly detect the failure, which triggers the generation of a new LSP as described in the previous section. Nodes that are remote to the failure learn about it through the IGP flooding of this new LSP. It takes some time to flood the LSP from the node connected to the failure to the SR-TE headend node. This time is the sum of the bufferisation, serialization, propagation, and IGP processing time at each hop flooding the LSP (see [subsec]). The paper [subsec] indicates that bufferisation and serialization delays are negligible components of the flooding delay. The propagation delay is a function of the cumulative fiber distance between the originating node and the headend node and can be roughly assessed as 5 ms per 1000 km of fiber. For a US backbone the rule of thumb indicates that the worst case distance is ~6000 km (30 ms) and that a very conservative average is 3000 km (15 ms). The IGP processing delay is minimized by a fast-flooding mechanism that ensures that a number of incoming LSPs is quickly flooded on the other interfaces, without introducing delay or pacing. A lab characterization described in [subsec] showed that in 90% of 1000 single-hop flooding delay measurements, the delay was smaller than 2 ms. The 95% of the measurements was smaller than 28 ms. Note that multiple flooding paths exist in a network, hence it is unlikely that the worst case perhop delay will be seen on all flooding paths. Now that we have a per-hop flooding delay, we need to know the number of flooding hops. The LSP is flooded in the local IGP area. Rule of thumb indicates that the number of flooding hops should be small to very small (< 5). In some topologies, inter-site connections or large rings might lead to 7 flooding hops. If we add up the flooding delay contributors, we come to a delay of 50 ms (LSP-gen initial-wait) + 5 (number of hops) × 2 ms (IGP processing delay) + 15 ms (propagation delay) = 75 ms between failure detection and the headend receiving the LSP. After receiving the new LSP of a remote node, the IGP handles it like a locally generated LSP. First, the IGP floods this LSP to its IGP neighbors. Second, it triggers an SPF computation that is scheduled

to start after the SPF backoff timer expires. Third, it feeds this new LSP to SR-TE. SR-TE updates its SR-TE DB and acts upon this topology change notification as described further in this chapter. As the IGP feeds the new topology information to SR-TE, this information is also received by BGP if BGP-LS is enabled on this node. BGP then sends the updated topology information in BGP-LS to its BGP neighbors. Figure 8‑1 shows a single-area IGP topology. An SR Policy is configured on Node1 with endpoint Node4 and a single dynamic candidate path with optimized delay, as shown in Example 8‑3. This low-delay path is illustrated in Figure 8‑1. Example 8-3: Minimum delay SR Policy on Node1 segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay

Figure 8-1: Remote trigger through IGP flooding – before failure

The link between Node3 and Node4 fails. Both nodes detect the failure, bring down the adjacency and flood the new LSP. When the LSP reaches Node1, the IGP notifies SR-TE of the change. SR-TE on Node1 re-computes the path and installs the new SID list for the SR Policy, as illustrated in Figure 8‑2.

Figure 8-2: Remote trigger through IGP flooding – after failure

8.3 Inter-Domain BGP-LS Flooding Topology change notifications are also propagated by BGP-LS. Figure 8‑3 shows a multi-domain network with two independent domains, Domain1 and Domain2, interconnected by eBGP peering links. An SR PCE is available in Domain1 to compute inter-domain paths. This SR PCE receives the topology information of Domain2 via BGP-LS. While it can receive the information of its local domain, Domain1, via IGP or BGP-LS, in this section we assume it gets it vis BGP-LS. Each domain has a BGP Route-reflector (RR) and these RRs are inter-connected.

Figure 8-3: Remote trigger through BGP-LS update – before failure

Node5 in Domain2 feeds any changes of its IGP LS-DB into BGP-LS and sends this information to its local RR. This RR propagates the BGP-LS information to the RR in Domain1. SR PCE taps into this Domain1 RR to receive the topology information of both domains.

An SR Policy is configured on Node1 with endpoint Node10 and a single dynamic candidate path with optimized delay metric that is computed by an SR PCE. Node1 has a PCEP session with the SR PCE in Domain1 with address 1.1.1.100. The configuration of this SR Policy is displayed in Example 8‑4. Node1 automatically delegates control of the path to the SR PCE. This SR PCE is then responsible of updating the path when required. Example 8-4: SR Policy with PCE computed dynamic path segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.10 candidate-paths preference 100 dynamic pcep ! metric type delay ! pcc pce address ipv4 1.1.1.100

The link between Node6 and Node7 fails, as illustrated in Figure 8‑4. Both nodes detect the failure, bring down the IGP adjacency and flood the new LSPs in the Domain2 area. When the LSPs reach Node5, the IGP of this node feeds the information to BGP-LS. BGP on Node5 sends the BGP-LS update to its RR which propagates the update to the RR in Domain1. This RR then propagates the update to the SR PCE. This topology update triggers the recomputation of the delegated path on the SR PCE. After recomputing the path, the SR PCE instructs Node1 to update the path with the new SID list. Node1 then installs the new SID list for the SR Policy, as illustrated in Figure 8‑4. Note that within 50 ms after detecting the failure, TI-LFA protection on Node7 restores connectivity via the TI-LFA backup path. This allows more time for the sequence of events as described above to execute.

Figure 8-4: Remote trigger through BGP-LS update – after failure

8.4 Validation of an Explicit Path When a headend receives a notification that the topology has changed, it re-validates the explicit paths of its local SR Policies. If the topology change causes the active candidate path of an SR Policy to become invalid, then the next-preferred, valid candidate path of the SR Policy is selected as active path. An explicit path is invalidated when the headend determines that at least one of the SIDs in its SID list is invalid. The validation procedure is described in detail in chapter 3, "Explicit Candidate Path". Briefly, an operator can specify a SID in an explicit path as a label value or as a segment descriptor. The headend validates all the SIDs expressed with segment descriptors by attempting to resolve them into the corresponding label values, as well as the first SID in the list, regardless of how it is expressed, by resolving it into an outgoing interface and next-hop. Non-first SIDs expressed as MPLS labels cannot be validated by the headend and hence are considered valid. An explicit path also becomes invalid if it violates one of its associated constraints in the new topology. The next sections illustrate how an active candidate path expressed with segment descriptors or MPLS label values can be invalidated following a topology change and how SR-TE falls back to the next valid candidate path.

8.4.1 Segments Expressed as Segment Descriptors Headend Node1 in Figure 8‑5 has an SR Policy to endpoint Node4. The SR Policy has two explicit candidate paths, both are illustrated in Figure 8‑5. The configuration of the SR Policy is shown in Example 8‑5.

Figure 8-5: Validation of explicit path

The segment lists are defined in the beginning of the configuration. The segments in the segment lists are expressed as segment descriptors. address ipv4 1.1.1.3

identifies the Prefix-SID of Node3 (1.1.1.3); address ipv4 99.3.4.3

identifies the Adj-SID of the point-to-point link of Node3 to Node4; address ipv4 1.1.1.4 identifies the Prefix-SID of Node4 (1.1.1.4). The SR Policy’s candidate path with preference 200 refers to the segment list SIDLIST2; this is the path 1→2→3→4 in Figure 8‑5. The candidate path with preference 100 refers to the segment list SIDLIST1; this is the path 1→2→3→6→5→4 in Figure 8‑5.

Example 8-5: SR Policy with explicit paths – segments expressed as segment descriptors segment-routing traffic-eng segment-list SIDLIST2 index 10 address ipv4 1.1.1.3 !! index 20 address ipv4 99.3.4.3 !! ! segment-list SIDLIST1 index 10 address ipv4 1.1.1.3 !! index 20 address ipv4 1.1.1.4 !! ! policy EXP color 10 end-point ipv4 1.1.1.4 candidate-paths preference 200 explicit segment-list SIDLIST2 ! preference 100 explicit segment-list SIDLIST1

Prefix-SID Node3 Adj-SID Node3->Node4

Prefix-SID Node3 Prefix-SID Node4

Both candidate paths are valid. The candidate path with the highest preference value (200) is selected as active path, as shown in the SR Policy status in Example 8‑6. Example 8-6: SR Policy with explicit paths RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:18 (since Jul 12 11:46:01.673) Candidate-paths: Preference: 200 (configuration) (active) Name: EXP Requested BSID: dynamic Explicit: segment-list SIDLIST2 (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Preference: 100 (configuration) Name: EXP Requested BSID: dynamic Explicit: segment-list SIDLIST1 (invalid) Weight: 1 Attributes: Binding SID: 40013 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

As displayed in the output, the active path (with the highest preference 200) has a SID list , with 16003 the Prefix-SID of Node3 and 24034 the Adj-SID of Node3’s link to Node4.

The less preferred candidate path has a SID list , with 16003 and 16004 the PrefixSIDs of Node3 and Node4 respectively. Since this path is not active, its SID list is not shown in the output of Example 8‑6. The link between Node3 and Node4 fails, as illustrated in Figure 8‑6. Node1 gets the topology change notification via IGP flooding. In the new topology the Adj-SID 24034 of the failed link is no longer valid. Because of this, Node1 invalidates the current active path since it contains the, now invalid, Adj-SID 24034. Node1 selects the next preferred candidate path and installs its SID list in the forwarding table. The new status of the SR Policy is displayed in Example 8‑7. The preference 100 candidate path is now the active path. Example 8-7: SR Policy with explicit paths – highest preference path invalidated after failure RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:05:47 (since Apr 9 18:03:56.673) Candidate-paths: Preference: 200 (configuration) Name: EXP Requested BSID: dynamic Explicit: segment-list SIDLIST2 (invalid) Weight: 1 Last error: Address 99.3.4.3 can not be resolved to a SID Preference: 100 (configuration) (active) Name: EXP Requested BSID: dynamic Explicit: segment-list SIDLIST1 (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 16004 [Prefix-SID, 1.1.1.4] Attributes: Binding SID: 40013 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

All traffic steered into the SR Policy now follows the new explicit path.

Figure 8-6: Validation of explicit path – after failure

“Creation of explicit paths often raises concerns that static paths make the network more fragile and less robust to dynamic repairs. While the text explicitly explains the dynamic selection between number of possible candidate SR paths based on live link state updates it should also be pointed out that those are only optimizations. When no explicit paths are available fallback to native IGP path will be the default behavior. Moreover some implementations may also allow to configure load balancing ahead of failure across N parallel explicit paths. ” — Robert Raszuk

8.4.2 Segments Expressed as SID Values The configuration of Node1 is now modified by expressing the segments in the segment lists using SID label values instead of segment descriptors. The resulting SR Policy configuration of Node1 is shown in Example 8‑8 and the candidate paths are illustrated in Figure 8‑5. Remember that a SID expressed as a label value is validated by the headend only if it is in first position in the SID list. The headend must be able to resolve this first entry to find outgoing interface(s) and nexthop(s) of the SR Policy.

Example 8-8: SR Policy with explicit paths – SIDs expressed as label values segment-routing traffic-eng segment-list SIDLIST12 index 10 mpls label 16003 !! Prefix-SID Node3 index 20 mpls label 24034 !! Adj-SID Node3-Node4 ! segment-list SIDLIST11 index 10 mpls label 16003 !! Prefix-SID Node3 index 20 mpls label 16004 !! Prefix-SID Node4 ! policy EXP color 10 end-point ipv4 1.1.1.4 candidate-paths preference 200 explicit segment-list SIDLIST12 ! preference 100 explicit segment-list SIDLIST11

After the failure of the link between Node3 and Node4, as illustrated in Figure 8‑6, the selected path (the candidate path with preference 200) is not invalidated. The first SID in the SID list (16003) is still valid since the link failure does not impact this SID. The failure does impact the second SID in the list. However, this second SID (24034) is specified as a SID label value and therefore its validity cannot be assessed by the headend, as discussed above. The status of the SR Policy after the failure is displayed in Example 8‑9. The preference 200 candidate path is still the selected path. Traffic steered into this SR Policy keeps following the path 1→2→3→4. Since the link between Node3 and Node4 is down, traffic is now dropped at Node3.

Example 8-9: SR Policy with explicit paths – highest preference path is not invalidated after failure RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:59:29 (since Jul 12 11:46:01.081) Candidate-paths: Preference: 200 (configuration) (active) Name: EXP Requested BSID: dynamic Explicit: segment-list SIDLIST2 (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 24034 Preference: 100 (configuration) Name: EXP Requested BSID: dynamic Explicit: segment-list SIDLIST1 (invalid) Weight: 1 Attributes: Binding SID: 40013 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

It is important to note that an explicit SID list is normally controlled by an SR PCE. This SR PCE would be made aware of the topology change and would dynamically update the SID list as described in section 8.6 below.

8.5 Recomputation of a Dynamic Path by a Headend When a headend receives a notification that the topology has changed, it re-computes the nondelegated dynamic paths of its SR Policies. Headend Node1 in Figure 8‑7 has an SR Policy to endpoint Node4 with a delay-optimized dynamic path. The link-delay metrics are indicated next to the links, the default IGP metric is 10. The configuration of the SR Policy is shown in Example 8‑10. The computed low-delay path is illustrated in the drawing.

Figure 8-7: Recomputation of a dynamic path by headend – before failure Example 8-10: SR Policy with dynamic low-delay path segment-routing traffic-eng policy DYN color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay

The status of this SR Policy is presented in Example 8‑11. The computed SID list is , with 16003 the Prefix-SID of Node3 and 24034 the Adj-SID of Node3 for the link to Node4. The

accumulated delay metric of this path is 30 (= 12 + 11 + 7). Example 8-11: SR Policy with dynamic low-delay path RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:02:46 (since Jul 12 13:28:18.800) Candidate-paths: Preference: 100 (configuration) (active) Name: DYN Requested BSID: dynamic Dynamic (valid) Metric Type: delay, Path Accumulated Metric: 30 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 40033 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

The link between Node3 and Node4 fails. SR-TE on Node1 is notified of the topology change and SR-TE re-computes the dynamic path of SR Policy DYN. The new low-delay path from Node1 to Node4 is illustrated in Figure 8‑8.

Figure 8-8: Recomputation of a dynamic path by headend – after failure

Example 8‑12 shows the new status of the SR Policy. The cumulative delay of this path is 64 (= 12 + 11 + 10 + 15 + 16). The path is encoded in the SID list with the Prefix-SIDs of Node3 and Node4 respectively. Example 8-12: SR Policy with dynamic low-delay path RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:05:46 (since Jul 12 13:28:18.800) Candidate-paths: Preference: 100 (configuration) (active) Name: DYN Requested BSID: dynamic Dynamic (valid) Metric Type: delay, Path Accumulated Metric: 64 16003 [Prefix-SID, 1.1.1.3] 16004 [Prefix-SID, 1.1.1.4] Attributes: Binding SID: 40033 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

All traffic steered into the SR Policy now follows the new low-delay path.

8.6 Recomputation of a Dynamic Path by an SR PCE The headend node delegates control of all SR Policy paths computed by an SR PCE to that same SR PCE. This occurs when the headend requested the path computation services of an SR PCE or when the SR PCE itself initiated the path on the headend node. In either case, the SR PCE is responsible for maintaining its delegated paths. Following a topology change, the SR PCE thus re-computes these paths and updates if necessary the SID lists provided to their respective headend nodes. The sequence of events following a topology change is equivalent with the previous section, but with the potential additional delay of the failure notification to the SR PCE and the signaling of the updated path from the SR PCE to the headend. If the SR PCE were connected via BGP-LS it would receive the topology notification via BGP-LS (as in section 8.3 above), possibly incurring additional delay. To illustrate, we add an SR PCE in the network of the previous section, see Figure 8‑9, and let the headend Node1 delegate the low-delay path to the SR PCE. When the link between Node3 and Node4 fails, the IGP floods the change throughout the network where it also reaches the SR PCE. The SR PCE re-computes the path and signals the updated path to the headend Node1 that updates the path.

Figure 8-9: Recomputation of a dynamic path by SR PCE – after failure

I Furthermore, PCE-initiated paths are signaled to the headend node as explicit candidate paths, which implies that the explicit path validation procedure described in section 8.4 is also applied by the headend. The resiliency of PCE-initiated paths is thus really the combination of PCE-based recomputation and headend-based validation, although the latter is fairly limited since these SID lists are often expressed with MPLS label values.

8.7 IGP Convergence Along a Constituent Prefix-SID A prefix-SID may be a constituent of a candidate path of a Policy (explicit or dynamic). A failure may impact the shortest-path of that prefix-SID. For example, in this illustration, we see that Prefix-SID 16004 of Node4 is the second SID of the SID list of the SR Policy in Figure 8‑10. If the link between Node3 and Node6 fails, this SID is impacted and hence the SR Policy is impacted. In this section, we describe how the IGP convergence of a constituent Prefix-SID helps the resiliency of the related SR Policy. First we provide a brief reminder on the IGP convergence process and then we show how the IGP convergence is beneficial both for an explicit and a dynamic candidate path.

8.7.1 IGP Reminder A Prefix-SID follows the shortest path to its associated prefix. When the network topology changes, the path of the Prefix-SID may change if the topology change affects the shortest path to its associated prefix. The IGP takes care of maintaining the Prefix-SID paths. If the Prefix-SID is used in an SR Policy’s SID list and IGP updates the path of the Prefix-SID, then the path of the SR Policy follows this change. Let us illustrate this in the next sections.

8.7.2 Explicit Candidate Path Node1 in Figure 8‑10 has an SR Policy to Node4 with an explicit SID list. The configuration of this SR Policy is displayed in Example 8‑13. The SIDs in the SID list are specified as segment descriptors. The first entry in the SID list is the Prefix-SID of Node3, the second entry the Prefix-SID of Node4. The path of this SR Policy is illustrated in the drawing.

Example 8-13: SR Policy with an explicit path segment-routing traffic-eng segment-list SIDLIST1 index 10 address ipv4 1.1.1.3 !! Prefix-SID Node3 index 20 address ipv4 1.1.1.4 !! Prefix-SID Node4 ! policy EXP color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1

Figure 8-10: IGP convergence along a constituent Prefix SID – before failure

The link between Node3 and Node6 fails, as illustrated in Figure 8‑11. This link is along the shortest path of the Prefix-SID of Node4 from Node3 in the topology before the failure. IGP converges and updates the path of this Prefix-SID along the new shortest path from Node3 to Node4: the direct high-cost link from Node3 to Node4.

Figure 8-11: IGP convergence along a constituent Prefix SID – after failure

In parallel to the IGP convergence, the SR-TE process of the headend Node1 receives a remote intradomain topology change (section 8.2) and triggers the re-validation of the explicit path (section 8.4). The headend finds that the SIDs in the SID list are still valid (thanks to the IGP convergence) and leaves the SR Policy as is. Even though the topology has changed by a failure that impacted the SR Policy path, the explicit SID list of the SR Policy stayed valid while the path encoded in this SID list has changed thanks to IGP convergence. This is a straightforward illustration of the scaling benefits of SR over OpenFlow. In an OpenFlow deployment, the controller cannot rely on the network to maintain IGP Prefix-SIDs. The OpenFlow controller must itself reprogram all the per-flow states at every hop. This does not scale.

8.7.3 Dynamic Candidate Paths Node1 in Figure 8‑12 has an SR Policy with a delay-optimized dynamic path. The configuration of this SR Policy is displayed in Example 8‑14. The low-delay path as computed by Node1 is illustrated in the drawing. The SID list that encodes this path is , as shown in the SR Policy status output in Example 8‑15.

Figure 8-12: IGP convergence along a constituent Prefix SID – dynamic path before failure Example 8-14: SR Policy configuration with dynamic path segment-routing traffic-eng policy DYN color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay

Example 8-15: SR Policy status RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 18:16:20 (since Jul 12 13:28:18.800) Candidate-paths: Preference: 100 (configuration) (active) Name: DYN Requested BSID: dynamic Dynamic (valid) Metric Type: delay, Path Accumulated Metric: 30 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 40033 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

The link between Node2 and Node3 fails, as illustrated in Figure 8‑13. Triggered by the topology change, Node1 re-computes the new low-delay path. This new low-delay path is illustrated in Figure 8‑13. It turns out that the SID list that encodes this new path is the same SID list as before the failure. The output in Example 8‑16 confirms this. While the SID list is unchanged, the path of the first SID in the list has been updated by IGP. Notice in the output that the cumulative delay metric of the path has incremented to 58.

Example 8-16: SR Policy status after failure – SID list unchanged RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 18:23:47 (since Jul 12 13:28:18.800) Candidate-paths: Preference: 100 (configuration) (active) Name: DYN Requested BSID: dynamic Dynamic (valid) Metric Type: delay, Path Accumulated Metric: 58 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 40033 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

Figure 8-13: IGP convergence along a constituent Prefix SID – dynamic path after failure

This is another example where the SR Policy headend benefits from the concurrent IGP convergence. The SID list before and after the network failure is the same (although the actual network path is different). The SR Policy headend does not need to change anything in its dataplane. It simply leverages the network wide IGP convergence process.

8.8 Anycast-SIDs An Anycast-SID is a Prefix-SID that is advertised by more than one node. The set of nodes advertising the same anycast prefix with Anycast-SID form a group, known by the name anycast set. Using an Anycast-SID in the SID list of an SR Policy path provides improved traffic load-sharing and node resiliency. An Anycast-SID provides the IGP shortest path to the nearest node of the anycast set. If two or more nodes of the anycast set are located at the same distance (metric), then traffic flows are load-balanced among them. If a node in an anycast set fails, then the other nodes in the set seamlessly take over. After IGP has converged, the Anycast-SID follows the shortest path to the node of the anycast set that is now the nearest in the new topology. An Anycast-SID can be used as a SID in the SR Policy’s SID list, hereby leveraging the loadbalancing capabilities and node resiliency properties of the Anycast-SID. Note that all this applies to Flex-Algo Prefix-SIDs as well. For example, if two nodes with a FlexAlgo(K) Anycast-SID are at the same distance with respect to the Flex-Algo(K) metric, then traffic flows are load-balanced among them. In a typical use-case of the Anycast-SID, an Anycast-SID is assigned to a pair (or more) of border nodes between two domains. To traverse the border between the domains, the Anycast-SID of the border nodes can be included in the SR Policy SID list. Traffic is then load-balanced between the border nodes if possible, and following a failure of a border node, the other border node(s) seamlessly take(s) over. Finer grained traffic steering over a particular border node is still possible, using this border node’s own Prefix-SID instead of the Anycast-SID. Figure 8‑14 shows a network topology with three domains, connected by pairs of border nodes. An Anycast-SID is assigned to each pair of border nodes: Anycast-SID 20012 for the border nodes between Domain1 and Domain2, Node1 and Node2; Anycast-SID 20034 for the border nodes between Domain2 and Domain3, Node3 and Node4. The domains are independent, no redistribution is done between the domains.

Figure 8-14: Anycast-SID provides load-balancing

An SR Policy with endpoint Node31 is instantiated on headend Node11, using an explicit path. The SID list is ; Anycast-SID 20012 load-balances the traffic from Node11 to Node1 and Node2; Anycast-SID 20034 load-balances the traffic to Node3 and Node4; Prefix-SID 16031 of Node31 then steers the traffic to Node31. By using the Anycast-SIDs of the border nodes in the SID list instead of their individual Prefix-SIDs, the available ECMP in the network is better used. See, for example, how Node11 load-balances the traffic flows between Node1 and Node2, and Node1 load-balances the flows between Node3 and Node4. The use of Anycast-SIDs also provides node resiliency. Upon failure of Node3, traffic to AnycastSID 20034 is then carried by the remaining Node4, as illustrated in Figure 8‑15. The SID list of the SR Policy on Node11 stays unchanged.

Figure 8-15: Anycast-SID provides node resiliency

8.9 TI-LFA protection Local protection techniques ensure minimal packet loss (< 50msec) following a failure in the network. Protection is typically provided by pre-installing a backup path before a failure occurs. Upon a failure of the locally connected resource, traffic flows can then quickly be restored by sending them on the pre-installed backup path. While protection ensures the initial restoration of the traffic flows, the control plane (IGP, SR-TE, …) computes the paths in the new topology and installs the new paths in the forwarding tables. The traffic then follows the newly installed paths and new backup paths are derived and installed according to the new topology. Part I of the SR book explains Topology Independent LFA (TI-LFA) that uses SR functionality to provide optimal and automatic sub-50msec Fast Reroute protection for any link, node, or local SRLG failure in the network. TI-LFA is an IGP functionality. IGP computes and installs TI-LFA backup paths for each IGP destination. The TI-LFA configuration specifies which resource TI-LFA should protect: the link, the neighbor node, or the local SRLG. Remote SRLGs are protected using Weighted SRLG TI-LFA Protection. IGP then computes a per-prefix optimal TI-LFA backup path that steers the traffic around the protected resource (link, node, SRLG). The TI-LFA backup path is tailored over the post-convergence path, that the traffic will follow after IGP converges following the failure of the protected resource. IGP then installs the computed TI-LFA backup paths in the forwarding table. When a failure is detected on the local interface, the backup paths are activated in the data plane, deviating the traffic away from the failure. The activation of the TI-LFA backup paths happens for all affected destinations at the same time, in a prefix-independent manner. The backup paths stay active until IGP converges and updates the forwarding entries on their new paths. TI-LFA provides protection for Prefix-SIDs, Adj-SIDs, LDP labeled traffic and IP unlabeled traffic. Since TI-LFA is a local protection mechanism, it should be enabled on all interfaces of all nodes to provide automatic local protection against any failure.

“I have been working on network resiliency for 15 years, contributing to IGP fast convergence, LFA, LFA applicability, LFA manageability and RLFA. TI-LFA and IGP microloop avoidance are key improvements in terms of network resiliency, both powered by Segment Routing. They are easy to use as they have general applicability: any topology and any traffic, both shortest path and TE. They also follow the usual IGP shortest path which has many good properties: the best from a routing standpoint, wellknown to people (e.g., compared to RLFA paths), natively compliant with service providers' existing routing policies (as those are already encoded as part of the IGP metrics), and typically provisioned with enough capacity (as the fast reroute path is typically shared with IGP post convergence). ” — Bruno Decraene

Example 8‑17 shows the configuration that is required to enable TI-LFA for ISIS and OSPF. Example 8-17: TI-LFA configuration router isis 1 interface Gi0/0/0/0 address-family ipv4 unicast fast-reroute per-prefix fast-reroute per-prefix ti-lfa ! router ospf 1 area 0 interface GigabitEthernet0/0/0/0 fast-reroute per-prefix fast-reroute per-prefix ti-lfa enable

Refer to Part I of the SR book for more details of the TI-LFA functionality. TI-LFA provides local protection for Prefix-SIDs and Adj-SIDs. This protection also applies to SIDs used in an SR Policy’s SID list, the constituent SIDs of the SID list. The following sections illustrate the TI-LFA protection of such constituent Prefix-SIDs and Adj-SIDs.

8.9.1 Constituent Prefix-SID Node1 in Figure 8‑16 has an SR Policy to endpoint Node4 with a SID list . The SIDs in the list are the Prefix-SIDs of Node6 and Node4 respectively. This SID list could have been explicitly specified or dynamically derived. The path of this SR Policy is illustrated in the drawing.

Figure 8-16: TI-LFA of a constituent Prefix-SID

In this example we focus on the protection by Node6. Packets that Node1 steers into the SR Policy of Figure 8‑16 arrive on Node6 with the Prefix-SID label 16004 as top label. In steady-state Node6 forwards these packets towards Node5, via the IGP shortest path to Node4. The operator wants to protect the traffic on Node6 against link failures and enables TI-LFA on Node6. The IGP on Node6 computes the TI-LFA backup paths for all destinations, including Node4. Figure 8‑16 illustrates the TI-LFA backup path for Prefix-SID 16004. Upon failure of the link to Node5, Node6 promptly steers the traffic destined for 16004 on this PrefixSID’s TI-LFA backup path (via Node3) to quickly restore connectivity. While this local protection is active, the mechanisms described earlier in this chapter are triggered in order to update the SR Policy path if required. Also IGP convergence happens concurrently. The backup path is 6→3→4. Let us verify the TI-LFA backup path for 16004 on Node6 using the output of Example 8‑18, as collected on Node6.

Node6 imposes Adj-SID 24134 on the protected packets and sends them to Node3. The link from Node3 to Node4 has a high metric, therefore the Adj-SID 24134 of Node3 for the link to Node4 is imposed on the protected packets to steer them over that link. Example 8-18: Prefix-SID 16004 TI-LFA backup path on Node6 RP/0/0/CPU0:xrvr-6#show isis fast-reroute 1.1.1.4/32 L2 1.1.1.4/32 [30/115] via 99.5.6.5, GigabitEthernet0/0/0/0, xrvr-5, SRGB Base: 16000, Weight: 0 Backup path: TI-LFA (link), via 99.3.6.3, GigabitEthernet0/0/0/1 xrvr-3, SRGB Base: 16000, Weight: 0 P node: xrvr-3.00 [1.1.1.3], Label: ImpNull Q node: xrvr-4.00 [1.1.1.4], Label: 24134 Prefix label: ImpNull Backup-src: xrvr-4.00 RP/0/0/CPU0:xrvr-6#show mpls forwarding labels 16004 detail Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ----------------- ----------- ------------- ----------16004 16004 SR Pfx (idx 4) Gi0/0/0/0 99.5.6.5 59117 Updated: Jul 13 10:38:49.812 Path Flags: 0x400 [ BKUP-IDX:1 (0xa1a6e3c0) ] Version: 245, Priority: 1 Label Stack (Top -> Bottom): { 16004 } NHID: 0x0, Encap-ID: N/A, Path idx: 0, Backup path idx: 1, Weight: 0 MAC/Encaps: 14/18, MTU: 1500 Outgoing Interface: GigabitEthernet0/0/0/0 (ifhandle 0x00000020) Packets Switched: 1126 Pop SR Pfx (idx 4) Gi0/0/0/1 99.3.6.3 0 (!) Updated: Jul 13 10:38:49.812 Path Flags: 0xb00 [ IDX:1 BKUP, NoFwd ] Version: 245, Priority: 1 Label Stack (Top -> Bottom): { Imp-Null 24134 } NHID: 0x0, Encap-ID: N/A, Path idx: 1, Backup path idx: 0, Weight: 0 MAC/Encaps: 14/18, MTU: 1500 Outgoing Interface: GigabitEthernet0/0/0/1 (ifhandle 0x00000040) Packets Switched: 0 (!): FRR pure backup Traffic-Matrix Packets/Bytes Switched: 0/0

If TI-LFA is not enabled, then there is no protection and no backup path is computed and installed. Following a failure, traffic loss occurs until the IGP and SR-TE computed and installed the paths in the new topology.

8.9.2 Constituent Adj-SID As described in SR book Part I, two types of Adj-SIDs exist, those that are protected by TI-LFA and those that are not protected by TI-LFA. These are respectively called “protected” and “unprotected” Adj-SIDs.

By default in IOS XR, the IGP allocates an Adj-SID of each type for each of its adjacencies. Additional protected and unprotected Adj-SIDs can be configured if needed. When TI-LFA is enabled on a node, IGP computes and installs a TI-LFA backup path for each of the protected Adj-SIDs. The unprotected Adj-SIDs are left without TI-LFA backup path. Example 8‑19 shows the two Adj-SIDs that Node3 in the following example has allocated for its link to Node4. 24034 is the protected one (Adjacency SID: 24034 (protected)) and 24134 the unprotected one (Non-FRR Adjacency SID: 24134). Example 8-19: Adj-SIDs on Node3 for the adjacency to Node4 RP/0/0/CPU0:xrvr-3#show isis adjacency systemid xrvr-4 detail IS-IS 1 Level-2 adjacencies: System Id Interface SNPA xrvr-4 Gi0/0/0/0 Area Address: Neighbor IPv4 Address: Adjacency SID: Backup label stack: Backup stack size: Backup interface: Backup nexthop: Backup node address: Non-FRR Adjacency SID: Topology:

State Hold Changed

*PtoP* Up 27 49.0001 99.3.4.4* 24034 (protected) [16004] 1 Gi0/0/0/1 99.3.6.6 1.1.1.4 24134 IPv4 Unicast

1w2d

NSF IPv4 IPv6 BFD BFD Yes None None

Total adjacency count: 1

Node1 in Figure 8‑17 has an SR Policy to endpoint Node4 with a SID list . 16003 is the Prefix-SID of Node3, 24034 is the protected Adj-SID of Node3 for the link to Node4. It is unimportant to the discussion here if this SID list is explicitly specified or dynamically computed.

Figure 8-17: TI-LFA of a constituent Adj-SID

Since TI-LFA is enabled on Node3, IGP on Node3 derives and installs a TI-LFA backup path for the protected Adj-SID 24034. This TI-LFA backup path that steers the packets to the remote end of the link (Node4) while avoiding the link, is illustrated in the drawing. The output in Example 8‑20, as collected on Node3, shows the TI-LFA backup path for the Adj-SID 24034. The backup path is marked in the output with (!) on the far right. It imposes Node4’s PrefixSID 16004 on the protected packets and steers the traffic via Node6 in the case the link to Node4 fails.

Example 8-20: Adj-SID TI-LFA backup on Node3 RP/0/0/CPU0:xrvr-3#show mpls forwarding labels 24034 detail Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ----------------- ------------ ------------- ---------24034 Pop SR Adj (idx 0) Gi0/0/0/0 99.3.4.4 0 Updated: Jul 13 07:44:17.297 Path Flags: 0x400 [ BKUP-IDX:1 (0xa19c0210) ] Version: 121, Priority: 1 Label Stack (Top -> Bottom): { Imp-Null } NHID: 0x0, Encap-ID: N/A, Path idx: 0, Backup path idx: 1, Weight: 100 MAC/Encaps: 14/14, MTU: 1500 Outgoing Interface: GigabitEthernet0/0/0/0 (ifhandle 0x00000040) Packets Switched: 0 16004 SR Adj (idx 0) Gi0/0/0/1 99.3.6.6 0 (!) Updated: Jul 13 07:44:17.297 Path Flags: 0x100 [ BKUP, NoFwd ] Version: 121, Priority: 1 Label Stack (Top -> Bottom): { 16004 } NHID: 0x0, Encap-ID: N/A, Path idx: 1, Backup path idx: 0, Weight: 40 MAC/Encaps: 14/18, MTU: 1500 Outgoing Interface: GigabitEthernet0/0/0/1 (ifhandle 0x00000060) Packets Switched: 0 (!): FRR pure backup

After a link goes down, the IGP preserves the forwarding entries of its protected Adj-SIDs for some time, so that the traffic that still arrives with this Adj-SID is not dropped but forwarded via the backup path. This gives time for a headend or SR PCE to update any SR Policy path that uses this Adj-SID to a new path avoiding the failed link. Unprotected Adj-SIDs are not protected by TI-LFA, even if TI-LFA is enabled on the node. The IGP immediately removes the forwarding entry of an unprotected Adj-SID from the forwarding table when the associated link goes down. If an unprotected Adj-SID is included in the SID list of an SR Policy and the link of that Adj-SID fails, then packets are dropped until the SID list is updated. Node1 in Figure 8‑18 has an SR Policy to endpoint Node4 with a SID list . 16003 is the Prefix-SID of Node3, 24134 is the unprotected Adj-SID of Node3 for the link to Node4.

Figure 8-18: Including an unprotected Adj-SID in the SID list

Even though TI-LFA is enabled on Node3, IGP on Node3 does not install a TI-LFA backup path for the unprotected Adj-SID 24134, as shown in Example 8‑19. As a consequence, traffic to this Adj-SID is dropped when the link fails.

8.9.3 TI-LFA Applied to Flex-Algo SID As we have seen in chapter 7, "Flexible Algorithm", TI-LFA also applies to Flex-Algo Prefix-SIDs and the post-convergence backup path is computed with the same optimization objective and constraints as the primary path. An SR Policy SID list built with Flex-Algo Prefix-SIDs benefits from the TI-LFA protection of its constituent SIDs.

8.10 Unprotected SR Policy Some use-cases demand that no local protection is applied for specific traffic streams. This can for example be the case for two live-live streams (1+1 redundancy) carried in two disjoint SR Policies. Rather than locally restoring the SR Policy traffic after a failure and possibly creating congestion, the operator may want that traffic to be dropped. Yet this decision must not prevent other traffic traversing the same nodes and links from being protected by TI-LFA. Figure 8‑19 illustrates such a use-case. A Market Data Feed source on the left sends two Market Data Feed multicast traffic streams, carried in Pseudowires, over two disjoint paths in the WAN (live-live or 1+1 redundancy). The resiliency for this Market Data Feed is provided by concurrently streaming the same data over both disjoint paths. The operator requires that if a failure occurs in the network the Market Data Feed traffic is not locally protected. Given the capacity of the links and the bandwidth utilization of the streams, this would lead to congestion and hence packet loss.

Figure 8-19: Unprotected Adj-SIDs use-case

The use-case in Figure 8‑19 is solved by configuring SR Policies from Node1 to Node2 and from Node3 to Node4. These two SR Policies have explicit paths providing disjointness. The explicit SID lists of these SR Policies consist of unprotected Adj-SIDs only. The Pseudowire traffic is steered into these SR Policies. The SR Policy and L2VPN configuration of Node1 is shown in Example 8‑21. The SR Policy to Node2 is POLICY1, using explicit segment list SIDLIST1. This segment list contains one entry: the unprotected Adj-SID 24012 of the adjacency to Node2. The L2VPN pseudowire uses statically configured PW labels, although LDP signaling can be used as well. The pseudowire traffic is steered into POLICY1 (10, 1.1.1.2) by specifying this SR Policy as preferred-path.

Example 8-21: Configuration SR Policy using unprotected Adj-SIDs and invalidation drop segment-routing traffic-eng segment-list name SIDLIST1 index 10 mpls label 24012 !! Adj-SID Node1->Node2 ! policy POLICY1 color 10 end-point ipv4 1.1.1.2 candidate-paths preference 100 explicit segment-list SIDLIST1 ! l2vpn pw-class PWCLASS1 encapsulation mpls preferred-path sr-te policy srte_c_10_ep_1.1.1.2 fallback disable ! xconnect group XG1 p2p PW1 interface Gi0/0/0/0 neighbor ipv4 1.1.1.2 pw-id 1234 mpls static label local 3333 remote 2222 pw-class PWCLASS1

Upon a link failure, traffic is not protected by TI-LFA since only unprotected Adj-SIDs are used in the SID list. Also IGP convergence does not deviate the paths of the SR Policies since they are pinned to the links using Adj-SIDs. A failure of a link will make the SR Policy invalid. Since fallback is disabled in the preferred-path configuration, the PW will go down as well. Another mechanism to prevent fallback to the default forwarding path that applies to all service traffic, not only L2VPN using preferred path, is the “drop upon invalidation” behavior. This mechanism is described in chapter 16, "SR-TE Operations".

8.11 Other Mechanisms The resiliency of an SR Policy may also benefit from the following mechanisms: SR IGP microloop avoidance SR Policy liveness detection TI-LFA Protection for an Intermediate SID of an SR Policy

8.11.1 SR IGP Microloop Avoidance Due to the distributed nature of routing protocols, the nodes in a network converge asynchronously following a topology change, i.e., the nodes do not start and end their convergence at the same times. During this period of convergence, transient forwarding inconsistencies may occur, leading to transient routing loops, the so-called microloops. Microloops cause delay, out-of-order delivery and packet loss. Depending on topology, hardware, and coincidence, the occurrence and duration of microloops may vary. They are less likely to occur and shorter in networks with devices that offer fast and deterministic convergence characteristics. The SR IGP microloop avoidance solution eliminates transient loops during IGP convergence. This is a local functionality that temporarily steers traffic on a loop-free path using a list of segments until the network has settled and microloops may no longer occur. This mechanism is beneficial for the resiliency of SR Policies as it eliminates delays and losses caused by those transient loops along its constituent IGP SIDs. An SR Policy is configured on Node1 in Figure 8‑20. The SR Policy’s SID list is with 16003 and 16004 the Prefix-SIDs of Node3 and Node4. The link between Node3 and Node4 has a higher IGP metric 50, but since it provides the only connection to Node4, it is the shortest IGP path from Node3 to Node4.

Figure 8-20: IGP microloop avoidance

The operator brings the link between Node4 and Node5 up, as in Figure 8‑21. In this new topology the shortest IGP path from Node3 to Node4 is no longer via the direct link, but via the path 3→6→5→4. Assume in Figure 8‑21 that Node3 is the first to update the forwarding entry for Node4 for the new topology. Node3 starts forwarding traffic destined for Node4 towards Node6, but the forwarding entry for Node4 on Node6 still directs the traffic towards Node3: a microloop between Node3 and Node6 exists. If now Node6 updates its forwarding entry before Node5 then a microloop between Node6 and Node5 occurs until Node5 updates its forwarding entry.

Figure 8-21: IGP microloop avoidance – after bringing up link

In order to prevent these microloops from happening, Node3 uses SR IGP microloop avoidance and temporarily steers traffic for 16004 on the loop-free path by imposing the SID list on the incoming packets with top label 16004. 16005 is the Prefix-SID of Node5, 24054 is the AdjSID of Node5 for the link to Node4. After the network has settled, Node3 installs the regular forwarding entry for 16004. By eliminating the possible impact of microloops, SR IGP microloop avoidance provides a resiliency gain that benefits SR Policies. Part I of the SR book provides more information about IGP microloop avoidance.

8.11.2 SR Policy Liveness Detection The IOS XR Performance Measurement (PM)1 infrastructure is leveraged to provide SR Policy liveness detection, independently or in combination with end-to-end delay measurement. By combining liveness detection with end-to-end delay measurement, the amount of probes is reduced. PM not only verifies liveness of the SR Policy endpoint, it has the capability to verify liveness of all ECMP paths of an SR Policy by exercising each individual atomic path.

For liveness detection, the headend imposes the forward SID list and optionally return SID list on the PM probes. The remote endpoint simply switches the probes as regular data packets. This makes it a scalable solution. By default the PM probes are encoded using the Two-Way Active Measurement Protocol (TWAMP) encoding [RFC5357] which is a well-known standard deployed by many software and hardware vendors. See draft-gandhi-spring-twamp-srpm. Other probe formats may be available – RFC6374 using MPLS GAL/G-Ach or IP/UDP (draft-gandhi-spring-rfc6374-srpm-udp). When PM detects a failure, it notifies SR-TE such that SR-TE can invalidate the SID list or the active path or trigger path protection switchover to a standby path. Leveraging PM for liveness verification removes the need to run (S)BFD over the SR Policy.

8.11.3 TI-LFA Protection for an Intermediate SID of an SR Policy An SR Policy with SID list is installed on headend Node1 in Figure 8‑22. Consider the failure of Node3, the target node of Prefix-SID 16003. TI-LFA Node Protection does not apply for the failure of the destination node, or in this case the target of a Prefix-SID, since no alternate path would allow the traffic to reach a failed node. Although there is indeed no way to “go around” a failure of the final destination of a traffic flow, the problem is slightly different when the failure affects an intermediate SID in an SR Policy. Intuitively one can see that the solution is to skip the SID that failed and continue with the remainder of the SID list. In the example, Node2 would pop 16003 and forward to the next SID 16004.

Figure 8-22: TI-LFA Protection for an intermediate SID of an SR Policy

draft-ietf-rtgwg-segment-routing-ti-lfa specifies TI-LFA Extended Node Protection. At the time of writing it is not implemented in IOS XR.

8.12 Concurrency The network resiliency technology is often hard to grasp because it involves many interactions. First, many nodes may be involved, each providing a piece of the overall solution: one node may detect a problem, another may participate in the transmission of that information, another node may compute the solution and yet another may update the actual forwarding entry. Second, several mechanisms may operate concurrently. The fastest is TI-LFA, which, within 50msec, applies a pre-computed backup path for a constituent SID of the SR Policy. Then it is not easy to know whether the IGP convergence or the SR Policy convergence will happen first. We advise to consider them happening concurrently within an order of a second following the failure. The IGP convergence involves the distributed system of the IGP domain. The nodes re-compute the shortest path to impacted IGP destinations and the related IGP Prefix SIDs are updated. If an SR Policy is using one of these IGP Prefix SIDs, the IGP convergence benefits the resiliency of the SR Policy. The SR Policy convergence involves the headend of the policy (and potentially its SR PCE). The SR Policy convergence involves the re-validation of explicit candidate path and the re-computation of dynamic candidate path. The result of the SR Policy convergence process may be the selection of a new candidate path or the computation of a different SID list for the current dynamic path.

Worth studying “The SR-TE resiliency solution may first seem a difficult concept to apprehend. I would encourage you to study it incrementally, a few concepts at a time and more as experience is collected. It is foundational to Segment Routing because, at the heart of Segment Routing, we find the concept of a Prefix SID which is a shortest path to a node maintained by a large set of nodes acting as a distributed system. Hence at the heart of Segment Routing, we have IP routing and hence IP routing resiliency. Obviously at times, a few may regret the easier to comprehend ATM/FR/RSVP-TE circuits without any ECMP and without any distributed behavior. But quickly, one will remember that this apparent simpler comprehension is only an illusion hiding much more serious problems of scale and complexity. ” — Clarence Filsfils

8.13 Summary The resiliency of an SR policy may benefit from several detection mechanisms and several convergence or protection solutions: local trigger from a local link failure remote intra-domain trigger through IGP flooding remote inter-domain trigger through BGP-LS flooding validation of explicit candidate paths re-computation of dynamic paths selection of the next best candidate path IGP convergence of constituent IGP Prefix SIDs Anycast Prefix SID leverage TI-LFA Local Protection of constituent SIDs, including Flex-Algo SIDs TI-LFA Node Protection for an intermediate segment part of an SR Policy Invalidation of a candidate path based on end-to-end liveness detection We described each of these mechanisms in isolation and highlighted how they can co-operate to the benefit of the resiliency of an SR Policy. We have explained how an SR Policy can be designed to avoid TI-LFA protection when it is not desirable. A local link failure and a remote IGP flooded failure trigger IGP convergence and provide the intradomain trigger to the headend SR-TE process. IGP convergence maintains the constituent SIDs of an SR Policy.

A remote failure flooded by IGP or BGP-LS triggers the headend and SR PCE SR-TE process to validate the explicit paths, recompute the dynamic paths and reselects the active path. End-to-end liveness detection triggers the invalidation of a candidate path. A local link failure triggers TI-LFA that provides sub-50ms restoration for the constituent SIDs of an SR Policy.

8.14 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[subsec] “Achieving sub-second IGP convergence in large IP networks”, Francois P, Filsfils C, Evans J, Bonaventure O, ACM SIGCOMM Computer Communication Review, 35(3):33-44, , July 2005. [RFC6774] "Distribution of Diverse BGP Paths", Robert Raszuk, Rex Fernando, Keyur Patel, Danny R. McPherson, Kenji Kumaki, RFC6774, November 2012 [RFC7880] "Seamless Bidirectional Forwarding Detection (S-BFD)", Carlos Pignataro, David Ward, Nobo Akiya, Manav Bhatia, Santosh Pallagatti, RFC7880, July 2016 [RFC7911] "Advertisement of Multiple Paths in BGP", Daniel Walton, Alvaro Retana, Enke Chen, John Scudder, RFC7911, July 2016 [RFC5357] "A Two-Way Active Measurement Protocol (TWAMP)", Jozef Babiarz, Roman M. Krzanowski, Kaynam Hedayat, Kiho Yum, Al Morton, RFC5357, October 2008 [RFC6374] "Packet Loss and Delay Measurement for MPLS Networks", Dan Frost, Stewart Bryant, RFC6374, September 2011 [draft-ietf-rtgwg-segment-routing-ti-lfa] "Topology Independent Fast Reroute using Segment Routing", Stephane Litkowski, Ahmed Bashandy, Clarence Filsfils, Bruno Decraene, Pierre Francois, Daniel Voyer, Francois Clad, Pablo Camarillo Garvia, draft-ietf-rtgwg-segment-routingti-lfa-01 (Work in Progress), March 2019 [draft-gandhi-spring-rfc6374-srpm-udp] "In-band Performance Measurement Using UDP Path for Segment Routing Networks", Rakesh Gandhi, Clarence Filsfils, Daniel Voyer, Stefano Salsano, Pier Luigi Ventre, Mach Chen, draft-gandhi-spring-rfc6374-srpm-udp-00 (Work in Progress), February 2019

[draft-gandhi-spring-twamp-srpm] "In-band Performance Measurement Using TWAMP for Segment Routing Networks", Rakesh Gandhi, Clarence Filsfils, Daniel Voyer, draft-gandhi-spring-twampsrpm-00 (Work in Progress), February 2019

1. Also see chapter 15, "Performance Monitoring – Link Delay".↩

Section II – Further details This section deepens the concepts of the Foundation section and extends it with more details about SR-TE topics.

9 Binding-SID and SRLB What we will learn in this chapter: In SR-MPLS, a BSID is a local label bound to an SR Policy. A BSID is associated with a single SR Policy. The purpose of a Binding-SID (BSID) is to steer packets into its associated SR Policy The steering can be local or remote. Any packet received with the BSID as top label is steered into the SR Policy associated with this BSID: i.e., the BSID label is popped and the label stack of the associated SR Policy is pushed. Automated Steering (AS) locally steers a BGP/service route onto the BSID of the SR Policy delivering the required SLA/color to the BGP next-hop. The BSID of an SR Policy is the BSID of the active candidate path. We recommend that all the candidate paths of an SR Policy use the same BSID. By default, a BSID is dynamically allocated and is inherited by all candidate paths without BSID A BSID can be explicitly specified, in which case we recommend doing so within the SRLB. BSIDs provide the benefits of simplification and scaling, network opacity, and service independence. Equivalent SR Policies can be configured with the same explicit BSID on all the nodes of an anycast set, such that traffic can be remotely steered into the best suited SR Policy using the Anycast-SID followed by the explicit BSID. A BSID can be associated with any interface or tunnel. E.g., a BSID assigned to an RSVP-TE tunnel enables SR-TE to steer traffic into that tunnel.

We start by defining the Binding-SID as the key to an SR Policy. The BSID is used to steer local and remote traffic into the SR Policy. The BSID is dynamically allocated by default. Explicitly specified BSIDs should be allocated from the SRLB to simplify explicit BSID allocation by a controller or application. We recommend allocating the same BSID to all candidate paths to provide a stable BSID. This simplifies operations. After this, the benefits of using BSIDs are highlighted: simplicity, scale, network opacity, and service independence. Finally, we show how a BSID allocated to an RSVP-TE tunnel can be used to traverse an RSVP-TE domain.

9.1 Definition As introduced in chapter 2, "SR Policy", a Binding Segment is a local1 segment used to steer packets into an SR Policy. For SR-MPLS, the Binding Segment identifier, called Binding-SID or BSID, is an MPLS label. The BSID is an attribute of the candidate path and each SR Policy inherits the BSID of its active candidate path. If the SR Policy is reported to a controller, then its associated BSID is included in the SR Policy’s status report. In practice, when an SR Policy is initiated, the headend dynamically allocates a BSID label at the SR Policy level, that is used for all candidate paths of the SR Policy. This ensures that each SR Policy is associated with a unique BSID and, on a given headend node, a particular BSID is bound to a single SR Policy at a time. The BSID label is allocated from the dynamic label range, by default [240001048575] in IOS XR, which is the pool of labels available for automatic allocation. This default behavior can be overridden by specifying an explicit BSID for the SR Policy or its constituent candidate paths, but it is recommended that all candidate paths of an SR Policy always have the same BSID. Examples of scenarios where explicit BSIDs are required and recommendations on their usage are provided in the next section. In the topology of Figure 9‑1, Node1 is the headend of an SR Policy “POLICY1” with color 30 and endpoint Node4. Its SID list is . SR-TE has dynamically allocated BSID label 40104 for this SR Policy. When this SR Policy becomes active, Node1 installs the BSID MPLS forwarding entry as indicated in the illustration: incoming label: 40104 outgoing label operation: POP, PUSH (16003, 24034) outgoing interface and nexthop: via Node2 (along Prefix-SID 16003)

Figure 9-1: Binding-SID

Example 9‑1 shows Node1’s SR Policy configuration.

Example 9-1: SR Policy on Node1 with dynamically allocated BSID segment-routing traffic-eng segment-list name SIDLIST1 index 10 mpls label 16003 index 20 mpls label 24034 ! policy POLICY1 !! binding-sid is dynamically allocated (default) color 30 end-point ipv4 1.1.1.4 candidate-paths preference 200 explicit segment-list SIDLIST1 ! preference 100 dynamic metric type te

Example 9-2: Status of SR Policy with dynamically allocated BSID on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.4 Name: srte_c_30_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:17 (since Sep 28 11:17:04.764) Candidate-paths: Preference: 200 (configuration) (active) Name: POLICY1 Requested BSID: dynamic Explicit: segment-list SIDLIST1 (valid) Weight: 1, Metric Type: TE 16003 [Prefix-SID, 1.1.1.3] 24034 Preference: 100 (configuration) Name: POLICY1 Requested BSID: dynamic Dynamic (invalid) Attributes: Binding SID: 40104 Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

Remote Steering

The MPLS forwarding entry of the BSID label 40104 on Node1 is shown in Example 9‑3. Traffic with top label 40104 is forwarded into the SR Policy (30, 1.1.1.4), named srte_c_30_ep_1.1.1.4 after pop’ing the top label. The forwarding entry of SR Policy (30, 1.1.1.4) itself is displayed in Example 9‑4. It shows that labels (16003, 24034) are imposed on packets steered into this SR Policy.

Example 9-3: BSID forwarding entry RP/0/0/CPU0:xrvr-1#show mpls forwarding labels 40104 Local Outgoing Prefix Outgoing Next Hop Label Label or ID Interface ------ ----------- -------- --------------------- --------------40104 Pop No ID srte_c_30_ep_1.1.1.4 point2point

Bytes Switched -------0

Example 9-4: SR Policy forwarding entry RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail Color Endpoint Segment Outgoing Outgoing Next Hop Bytes List Label Interface Switched ----- ------------ ----------- -------- ------------ ------------ -------30 1.1.1.4 SIDLIST1 16003 Gi0/0/0/0 99.1.2.2 0 Label Stack (Top -> Bottom): { 16003, 24034 } Path-id: 1, Weight: 64 Packets Switched: 0

Local Steering

The BSID not only steers BSID-labeled traffic into the SR Policy, it also serves as an internal key to its bound SR Policy. To locally steer service traffic into an SR Policy on a service headend, the service signaling protocol (e.g., BGP) installs the service route recursing on this SR Policy’s BSID, instead of classically installing a service route recursing on its nexthop prefix. This makes the BSID a fundamental element of the SR architecture. Node4 in Figure 9‑1 advertises BGP route 2.2.2.0/24 to Node1 with BGP nexthop Node4 and color 30. Example 9‑5 shows this BGP route received by Node1. Example 9-5: BGP IPv4 unicast route 2.2.2.0/24 received by Node1 RP/0/0/CPU0:xrvr-1#show bgp ipv4 unicast 2.2.2.0/24 BGP routing table entry for 2.2.2.0/24 Versions: Process bRIB/RIB SendTblVer Speaker 6 6 Last Modified: Sep 20 08:21:32.506 for 00:24:00 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 1.1.1.4 C:30 (bsid:40104) (metric 40) from 1.1.1.4 (1.1.1.4) Origin IGP, metric 0, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 6 Extended community: Color:30 SR policy color 30, up, registered, bsid 40104, if-handle 0x000000d0

Node1 uses Automated Steering to steer this route into SR Policy (30, Node4) since this SR Policy matches the nexthop and color of the route. To forward the traffic destined for 2.2.2.0/24 into the SR

Policy, BGP installs the forwarding entry of this route recursing on BSID label 40104. Example 9‑6 shows the forwarding entry of service route 2.2.2.0/24. The output shows that Node1 forwards packet to 2.2.2.0/24 via local-label 40104 (recursion-via-label). This BSID 40104 is bound to the SR Policy (30, 1.1.1.4). Example 9-6: Service route recurses on BSID RP/0/0/CPU0:xrvr-1#show cef 2.2.2.0/24 2.2.2.0/24, version 45, internal 0x5000001 0x0 (ptr 0xa140fe6c) [1], 0x0 (0x0), 0x0 (0x0) Updated Sep 20 08:21:32.506 Prefix Len 24, traffic index 0, precedence n/a, priority 4 via local-label 40104, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa17cbd10 0x0] recursion-via-label next hop via 40104/1/21 next hop srte_c_30_ep_1.1.1.4

9.2 Explicit Allocation By default, the headend dynamically allocates the BSID for an SR Policy. This is appropriate in most cases as the operator, or the controller initiating the SR Policy, does not care which particular label is used as BSID. However, some particular use-cases require that a specific label value is used a BSID, either for stability reasons or to ensure that similar SR Policies on different headend nodes have the same BSID value. In such cases, the BSID of the SR Policy should be explicitly allocated. In a simple example illustrated in Figure 9‑2, the operator wants to statically steer a traffic flow from a host H into an SR Policy POLICY1 on Node1. Therefore, the host imposes this SR Policy’s BSID on the packets in the flow. Since the configuration on the host is static, this BSID label value must be persistent, even across reloads. If the BSID would not be persistent and the value would change after a reload, then the host configuration would have to be updated to impose the new BSID value. This is not desirable, therefore the operator specifies an explicit BSID 15000 for the SR Policy.

Figure 9-2: Simple explicit BSID use-case

Example 9‑7 shows the explicit BSID label 15000 specified for SR Policy POLICY1 on Node1.

Example 9-7: An explicit BSID for a configured SR Policy segment-routing traffic-eng segment-list name SIDLIST1 index 10 mpls label 16003 index 20 mpls label 24034 ! policy POLICY1 binding-sid mpls 15000 color 100 end-point ipv4 1.1.1.4 candidate-paths preference 200 explicit segment-list SIDLIST1 ! preference 100 dynamic metric type te

For a configured SR Policy, the explicit BSID is defined under the SR Policy and inherited by all configured candidate paths and all signaled candidate paths that do not have an explicit BSID. Although controller-initiated candidate paths can be signaled with their own explicit BSIDs, we recommend that all candidate paths of a given SR Policy have the same BSID. This greatly simplifies network operations by ensuring a stable SR Policy BSID, that is independent of the selected candidate path. All SR-TE use-cases and deployments known to the authors follow this recommendation. When not following this recommendation, the BSID of the SR Policy may change over time as the active path of the SR Policy changes. If a newly selected candidate path has a different BSID than the former active path, the SR policy BSID is modified to the new active path’s BSID. As a consequence, all remote nodes leveraging the BSID of this SR Policy must be updated to the new value. The SR Policy status in Example 9‑8 shows that the BSID 15000 is explicitly specified and that its forwarding entry has been programmed.

Example 9-8: Status of SR Policy with explicit BSID RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 100, End-point: 1.1.1.4 Name: srte_c_100_ep_1.1.1.4 Status: Admin: up Operational: up for 1d15h (since Feb Candidate-paths: Preference: 100 (configuration) (active) Name: POLICY1 Requested BSID: 15000 Explicit: segment-list SIDLIST1 (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 24034 Attributes: Binding SID: 15000 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

5 18:12:38.507)

The headend programmed the BSID label 15000 in the MPLS forwarding table, as shown in Example 9‑9. The packets with top label 15000 are steered into SR Policy (100, 1.1.1.4). Example 9-9: Verify MPLS forwarding entry for Binding-SID of SR Policy RP/0/0/CPU0:xrvr-1#show mpls forwarding labels 15000 Local Outgoing Prefix Outgoing Next Hop Label Label or ID Interface ------ --------- ------------- ---------------------- ----------15000 Pop SRLB (idx 0) srte_c_100_ep_1.1.1.4 point2point

Bytes Switched -------0

Label 15000 is the first (index 0) label in the SR Local Block (SRLB) label range, as indicated in the output (SRLB (idx 0)). The SRLB is a label range reserved for allocating local SIDs for SR applications and it is recommended that all explicit BSIDs are allocated from this range, as further explained below. The operator can enforce using explicit BSIDs for all SR Policies on a headend by disabling dynamic BSID label allocation as illustrated in Example 9‑10. With this configuration, a candidate path without a specified BSID is considered invalid. Example 9-10: Disable dynamic label allocation for BSID segment-routing traffic-eng binding-sid dynamic disable

If the requested explicit BSID label for a candidate path is not available on the headend, then no BSID is allocated by default to that candidate path. Without a BSID, the candidate path is considered invalid and cannot be selected as active path for the SR Policy. This behavior can be changed by instructing the headend to fall back to a dynamically allocated label if the explicit BSID label is not available, as illustrated in Example 9‑11. Example 9-11: Enable fallback to dynamic label allocation for BSID segment-routing traffic-eng binding-sid explicit fallback-dynamic

This fallback behavior applies to any SR Policy candidate path with an explicit BSID, regardless of how it is communicated to the headend, via CLI, PCEP, BGP SR-TE, …. Note that an explicit BSID cannot be configured under an ODN color template. Since multiple SR Policies with different endpoints can be instantiated from such template, they would all be bound to the same explicit BSID. This is not allowed as it would violate the BSID uniqueness requirement. SR Local Block (SRLB)

In the simplistic scenario illustrated in Figure 9‑2, the operator can easily determine that the label 15000 is available and configure it as an explicit BSID for the SR Policy. It is much more complex for a controller managing thousands of SR Policy paths to find an available label to be used as explicit BSID of an SR Policy. To simplify the discovery of available labels on a headend node, explicit BSIDs should be allocated from the headend’s SRLB. The SR Local Block (SRLB) is a range of local labels that is reserved by the node for explicitly specified local segments. The default SRLB in IOS XR is [15000-15999]. Same as for the SRGB, the SRLB is advertised in the IGP (ISIS and OSPF) and BGP-LS such that other devices in the network, controllers in particular, can learn the SRLB of each node in the network. Example 9‑12 and Example 9‑13 show the ISIS Router Capability TLV and OSPF Router Information LSA specifying the default SRLB.

Example 9-12: ISIS Router Capability TLV with SRLB advertised by Node1 RP/0/0/CPU0:xrvr-1#show isis database verbose xrvr-1 IS-IS 1 (Level-2) Link State Database LSPID LSP Seq Num LSP Checksum LSP Holdtime/Rcvd xrvr-1.00-00 * 0x00000178 0x5d23 621 /*

Router Cap: 1.1.1.1 D:0 S:0 Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000 SR Local Block: Base: 15000 Range: 1000 SR Algorithm: Algorithm: 0 Algorithm: 1 Node Maximum SID Depth: Label Imposition: 10

ATT/P/OL 0/0/0

Example 9-13: OSPF Router Information LSA with SRLB RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 4.0.0.0 self-originate OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 47 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 4.0.0.0 Opaque Type: 4 Opaque ID: 0 Advertising Router: 1.1.1.1 LS Seq Number: 80000002 Checksum: 0x47ce Length: 88 Router Information TLV: Length: 4 Capabilities: Graceful Restart Helper Capable Stub Router Capable All capability bits: 0x60000000 Segment Routing Algorithm TLV: Length: 2 Algorithm: 0 Algorithm: 1 Segment Routing Range TLV: Length: 12 Range Size: 8000 SID sub-TLV: Length 3 Label: 16000 Node MSD TLV: Length: 2 Type: 1, Value 10 Segment Routing Local Block TLV: Length: 12 Range Size: 1000 SID sub-TLV: Length 3 Label: 15000 Dynamic Hostname TLV: Length: 6 Hostname: xrvr-1

A non-default SRLB can be configured as illustrated in Example 9‑14 where label range [100000100999] is allocated as SRLB. Example 9-14: Non-default SRLB configuration segment-routing !! local-block local-block 100000 100999

For operational simplicity, it is recommended that the operator configures the same SRLB position (between 15000 and ~1M) and size (≤ 256k) on all nodes. A controller can learn via IGP, BGP-LS and PCEP the SRLB of each node in the network, as well as the list of local SIDs explicitly allocated from the node’s SRLB (BSIDs, Adj-SIDs, EPE SIDs). The controller can mark the allocated local SIDs in the SRLB as unavailable in order to deduce the set of available labels in the SRLB of any node. As an example, the application learns that Node1’s SRLB is [15000-15999], it has allocated labels 15000 and 15001 for Adj-SIDs, 15002 for EPE and 15003 for BSID. Therefore, the controller can pick any label from the SRLB sub-range [15004-15999]. Randomly picking one of the available SRLB labels makes the likelihood of a collision with another controller or application allocating the same label at the same time unlikely. To further reduce the chance of allocation collisions, the operator could allocate sub-ranges of the SRLB to different applications, e.g., [15000-15499] to application A and [15500-15999] to application B. Each application independently manages its SRLB sub-range. If despite these techniques a BSID collision would still take place, e.g., due to a race condition, the application would get notified via system alerts or learn about it via BGP-LS or PCEP. By default, IOS XR accepts allocating any free label (except the label values 4 ! segment-list name SIDLIST2 index 10 mpls label 16008 !! Prefix-SID Node8 index 20 mpls label 24085 !1 Adj-SID link 8->5 index 30 mpls label 16004 !! Prefix-SID Node4

Attaching multiple colors to a BGP route can be used as a mechanism to let the egress PE signal primary and secondary SR Policy steering selection using BGP. If SR Policy GREEN becomes invalid, due to a failure as illustrated in Figure 10‑2, BGP reresolves1 route 4.4.4.0/24 on the valid SR Policy with the next lower numerical color value matching a color advertised with the prefix, SR Policy BLUE in this example. Since this functionality relies on BGP to re-resolve the route, it is not a protection mechanism.

Figure 10-2: service route with multiple colors for primary/backup

When SR Policy GREEN becomes valid again sometime later then BGP re-resolves the prefix 4.4.4.0/24 onto SR Policy GREEN since then it is again the valid SR Policy with the highest color that matches a color of the route.

10.2 Coloring Service Routes on Ingress PE Chapter 5, "Automated Steering" describes the typical situation where the egress PE specifies the service level agreement (SLA; the “intent”) of a service route by coloring the route with the color that identifies the desired service level. The egress PE advertises the service route with this information attached as color extended community to the ingress PE, possibly via one or more Route-Reflectors (RR). The ingress PE can then directly use these attached colors to steer the routes into the desired SR Policy without invoking a BGP route-policy to classify the routes. However, the ingress PE can override the SLA of any received service route by manipulating the color attached to this route. Such manipulation is realized through an ingress route-policy that sets, adds, modifies, or deletes any color extended community of a given service route. In the topology of Figure 10‑3, Node1 has two SR Policies with endpoint Node4: GREEN with color 30 (green) with an explicit path via Node3 BLUE with color 20 (blue) with an explicit path via Node8 and Node5 The configuration of these SR Policies on Node1 is the same as Example 10‑1. In this example, egress Node4 advertises two prefixes with next-hop 1.1.1.4 in BGP: 4.4.4.0/24 with color 40 (purple) 5.5.5.0/24 without color The operator wants to override the service level selection that was done by the egress node and configures ingress Node1 to: set color of 4.4.4.0/24 to 20 (blue) set color of 5.5.5.0/24 to 30 (green)

Figure 10-3: Setting color of BGP routes on ingress PE

The BGP route-policy configuration of ingress PE Node1 is shown in Example 10‑2. Example 10-2: Ingress PE Node1 BGP configuration extcommunity-set opaque COLOR-BLUE 20 end-set ! extcommunity-set opaque COLOR-GREEN 30 end-set ! route-policy SET-COLOR if destination in (4.4.4.0/24) then set extcommunity color COLOR-BLUE endif if destination in (5.5.5.0/24) then set extcommunity color COLOR-GREEN endif end-policy ! router bgp 1 neighbor 1.1.1.4 address-family ipv4 unicast route-policy SET-COLOR in

The configuration starts with the two extended community sets COLOR-BLUE and COLOR-GREEN. Color is a type of opaque extended community (see RFC 4360), therefore the keyword opaque. Each set contains one color value, set COLOR-BLUE contains value 20 and set COLOR-GREEN value 30.

Route-policy SET-COLOR replaces any currently attached color extended community of prefix 4.4.4.0/24 with the value 20 in the COLOR-BLUE extended community set. The color of prefix 5.5.5.0/24 is set to value 30 of the COLOR-GREEN set. Node1 applies an ingress route-policy SET-COLOR on its BGP session to PE Node4 (1.1.1.4). The ingress route-policy is exercised when the route is received and before Automated Steering is done. Hence, any color changes done in the route-policy are taken into account for AS. As a result, prefix 4.4.4.0/24 is steered into SR Policy BLUE, since its next-hop (1.1.1.4) and the updated color (20) match the endpoint and color of this SR Policy. Prefix 5.5.5.0/24 is steered into SR Policy GREEN, based on the updated color (30) of this prefix. Similarly, an ingress route-policy modifying the BGP next-hop will cause the matching traffic to be steered into an SR policy whose endpoint is the new next-hop.

10.3 Automated Steering and BGP Multi-Path Using Automated Steering, BGP matches each of the paths of a route with an SR Policy based on the path’s nexthop and color. So, if BGP installs multiple paths for a route, AS is applied for each individual path. In the network of Figure 10‑4, CE Node45 in vrf Acme is multi-homed to two egress PEs: Node4 and Node5. CE Node45 advertises the route 45.1.1.0/24 and both egress PEs color this route in green (30) when advertising it to the ingress PE Node1.

Figure 10-4: BGP multi-path with Automated Steering

Ingress PE Node1 receives both paths for vrf Acme route 45.1.1.0/24. When using BGP RRs in the network instead of a full mesh of BGP sessions between the PEs, the RRs need to propagate both routes. This is a generic BGP multi-path requirement. Here this is achieved by using a different Route-Distinguisher for the route on both egress PEs. Ingress PE Node1 is configured for iBGP multi-path, as shown in Example 10‑3. To enable iBGP multi-path, use the command maximum-paths ibgp under the address-family.

specifies the maximum number of paths that will be installed for any given

route. Since the IGP costs to the egress PEs are not equal in this example, the parameter unequal-

cost

is added to the command to relax the default multi-path requirement that the IGP metrics to the

BGP nexthops must be equal. Example 10-3: Configuration of ingress PE Node1 router bgp 1 bgp router-id 1.1.1.1 address-family vpnv4 unicast ! neighbor 1.1.1.4 remote-as 1 update-source Loopback0 address-family vpnv4 unicast ! neighbor 1.1.1.5 remote-as 1 update-source Loopback0 address-family vpnv4 unicast ! vrf Acme rd auto address-family ipv4 unicast maximum-paths ibgp 4 unequal-cost

As a result of this multi-path configuration, BGP on Node1 installs two paths to 45.1.1.0/24, one via Node4 and another via Node5, since these paths both satisfy the multi-path requirements. The SR Policy configuration of Node1 is shown in Example 10‑4. Node1’s SR Policies to Node4 and Node5 with color green (30) are named GREEN_TO4 and GREEN_TO5 respectively.

Example 10-4: SR Policy configuration Node1 – BGP multi-path segment-routing traffic-eng policy GREEN_TO4 !! (green, Node4) color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1 ! policy GREEN_TO5 !! (green, Node5) color 30 end-point ipv4 1.1.1.5 candidate-paths preference 100 explicit segment-list SIDLIST2 ! segment-list name SIDLIST1 index 10 mpls label 16003 !! Prefix-SID Node3 index 20 mpls label 24034 !1 Adj-SID link 3->4 ! segment-list name SIDLIST2 index 10 mpls label 16008 !! Prefix-SID Node8 index 20 mpls label 24085 !1 Adj-SID link 8->5 index 30 mpls label 16004 !! Prefix-SID Node4

The egress PEs both advertise route 45.1.1.0/24 with a color green (30). Using Automated Steering functionality, Node1 matches the path via Node4 to SR Policy (green, Node4) and the path via Node5 to SR Policy (green, Node5). Applying multi-path and AS, BGP installs both paths to 45.1.1.0/24 recursing on the Binding-SIDs (BSIDs) of the respective SR Policies matching the color and nexthop of each path. Example 10‑5 shows the BGP route 45.1.1.0/24 with its two paths. The first path, which is the bestpath, goes via Node4 using the SR Policy with color 30 to 1.1.1.4 (Node4). The second path goes via Node5 using the SR Policy with color 30 to 1.1.1.5 (Node5).

Example 10-5: BGP multi-path route RP/0/0/CPU0:xrvr-1#show bgp vrf Acme 45.1.1.0/24 BGP routing table entry for 45.1.1.0/24, Route Distinguisher: 1.1.1.1:2 Versions: Process bRIB/RIB SendTblVer Speaker 82 82 Last Modified: Jul 19 18:50:03.624 for 00:19:55 Paths: (2 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 1.1.1.4 C:30 (bsid:40001) (metric 120) from 1.1.1.4 (1.1.1.4) Received Label 90000 Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, multipath, import-candidate, imported Received Path ID 0, Local Path ID 0, version 81 Extended community: Color:30 RT:1:1 SR policy color 30, up, registered, bsid 40001 Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.4:0 Path #2: Received by speaker 0 Not advertised to any peer Local 1.1.1.5 C:30 (bsid:40002) (metric 130) from 1.1.1.5 (1.1.1.5) Received Label 90002 Origin IGP, metric 0, localpref 100, valid, internal, multipath, import-candidate, imported Received Path ID 0, Local Path ID 0, version 0 Extended community: Color:30 RT:1:1 SR policy color 30, up, registered, bsid 40002 Source AFI: VPNv4 Unicast, Source VRF: default, Source Route Distinguisher: 1.1.1.5:0

Example 10‑6 shows the CEF forwarding entry of vrf Acme route 45.1.1.0/24. The first path of this route is steered into SR Policy (30, 1.1.1.4) GREEN_TO4 via BSID 40001 (via local-label 40001).

The second path is steered into SR Policy (30, 1.1.1.5) GREEN_TO5 via its BSID 40002

(via local-label 40002). The last label (9000X) in the labels imposed stack of the output is the VPN label of the prefix, as advertised by the egress PE. The first label in the imposed label stack is the label to reach the BGP nexthop, but since the nexthop is reached via the SR Policy this label is implicit-null (ImplNull) which represents a no-operation in this case.

Example 10-6: CEF entry of BGP multi-path route RP/0/0/CPU0:xrvr-1#show cef vrf Acme 45.1.1.0/24 45.1.1.0/24, version 13, internal 0x5000001 0x0 (ptr 0xa13e8b04) [1], 0x0 (0x0), 0x208 (0xa18f907c) Updated Jul 19 18:50:03.624 Prefix Len 24, traffic index 0, precedence n/a, priority 3 via local-label 40001, 3 dependencies, recursive, bgp-multipath [flags 0x6080] path-idx 0 NHID 0x0 [0xa163e174 0x0] recursion-via-label next hop VRF - 'default', table - 0xe0000000 next hop via 40001/0/21 next hop srte_c_30_ep_1.1.1.4 labels imposed {ImplNull 90000} via local-label 40002, 3 dependencies, recursive, bgp-multipath [flags 0x6080] path-idx 1 NHID 0x0 [0xa163e74c 0x0] recursion-via-label next hop VRF - 'default', table - 0xe0000000 next hop via 24028/0/21 next hop srte_c_30_ep_1.1.1.5 labels imposed {ImplNull 90002}

10.4 Color-Only Steering By default, AS requires an exact match of the service route’s nexthop with the SR Policy’s endpoint. This implies that the BGP next-hop of the service route must be of the same address-family (IPv4 or IPv6) as the SR Policy endpoint. In rare cases an operator may want to steer traffic into an SR Policy based on color only or an IPv6 service route into an IPv4 SR Policy (address-family agnostic). This would allow the operator to reduce the number of SR Policies maintained on a headend, by leveraging a single SR Policy to forward traffic regardless of its nexthop and address-family. Automated Steering based on color only is possible by using the Color-Only bits (CO-bits) of the Color Extended Community attribute. The format of the Color Extended Community is specified in RFC 5512 (to be replaced by draft-ietfidr-tunnel-encaps) and is augmented in draft-ietf-idr-segment-routing-te-policy for the cases where it is used to steer traffic into an SR Policy. The format, as specified in draft-ietf-idr-segment-routing-tepolicy, is shown in Figure 10‑5.

Figure 10-5: Color extended community attribute format

With: Color-Only bits (CO-bits): influence the Automated Steering selection preference, as described in this section. Color Value: a flat 32-bit number The CO-bits of a Color Extended Community can be specified with the color value in the extcommunity-set

that configures the color value, as illustrated in Example 10‑7. The example

configures the extended community set named “INCL-NULL” with color value “100” and sets the CO-bits of this color to “01”. This setting of the CO-bits indicates to the headend to also consider SR Policies with null endpoint, as discussed further. This extcommunity-set can then be used to attach the color extended community to a prefix as usual, using a route-policy. Example 10-7: set CO-bits for Color Extended Community extcommunity-set opaque INCL-NULL 100 co-flag 01 end-set

Table 10‑1 shows the behavior associated to each setting of the CO-bits in the Color Extended Community. It shows the selection criteria and preference order (most to least preferred) that BGP uses to select an SR Policy to be associated with a received route R/r. The table assumes that the received BGP route R/r with next-hop N has a single color C. Table 10-1: CO-bits – traffic steering preference order CO=00

CO=01

CO=10

1. SR Policy (C, N)

1. SR Policy (C, N)

1. SR Policy (C, N)

2. IGP to N

2. SR Policy (C, null(AFN))

2. SR Policy (C, null(AFN))

3. SR Policy (C, null(any))

3. SR Policy (C, null(any))

4. IGP to N

4. SR Policy (C, any(AFN)) 5. SR Policy (C, any(any)) 6. IGP to N

Terminology: “IGP to N” is the IGP shortest path to N null(AFN) is the null endpoint for the address-family of N (AFN) null(any) is the null endpoint for any address-family any(AFN) is any endpoint of the address-family of N (AFN) any(any) is any endpoint of any address-family

By default, the CO-bits of a color are 00 which results in the default Automated Steering functionality described in chapter 5, "Automated Steering": steer a BGP route R/r with next-hop N and color C into the valid and authorized-to-steer SR Policy (C, N). If no such SR Policy exists, then steer the route

R/r on the IGP shortest path to N. This default behavior (CO-bits=00) is listed in the first column of the table. While discussing the other settings of the CO-bits, we will introduce some new terms that are also used in Table 10‑1. Null endpoint is the first term. Null endpoint An operator can specify a null endpoint for an SR Policy if only the color of a route is of importance for the steering or if a “wildcard” SR Policy for a given color is required. Depending on the address-family, the null-endpoint will be 0.0.0.0 (IPv4) or ::0 (IPv6). Remember that the endpoint attribute of an SR Policy does not directly determine the location where the packets will exit the SR Policy. This exit point derives from the SID list associated with the SR Policy, in particular the last SID. However, dynamic candidate paths are computed based on the location of the SR Policy endpoint. Since the null endpoint does not have a specific location, it is therefore not compatible with dynamic candidate paths. All the candidate paths of an SR Policy with a null endpoint must be explicit. Table 10‑1 uses the term “null(AFN)” to indicate the null endpoint for the Address-Family of nexthop N (indicated by AFN). For example, if N is identified by an IPv6 address then null(AFN) = null(IPv6) = ::0. Note that only one SR Policy (C, null(AF)) for each AF can exist on a given head-end node since the tuple (color, endpoint) uniquely identifies an SR Policy. For a BGP route R/r that has a single color C with the CO-bits set to 01, BGP steers the route R/r according to the preference order in the second column of Table 10‑1 (CO=01). BGP preferably steers the route R/r into a valid SR Policy (C, N). If no such SR Policy is available, then BGP steers the route R/r into a valid SR Policy (C, null(AFN)), which matches the route’s color C and has a null endpoint of the same address-family as the nexthop N.

If such SR Policy is not available, then BGP steers the route R/r into a valid SR Policy (C, null(any)) with a matching color and a null endpoint of any family. Finally, if none of the above SR Policies are available, then BGP steers the route R/r on the IGP shortest path to its next-hop N. The second new term is any endpoint. Any endpoint Table 10‑1 uses the term any(AFN) to indicate any endpoint that matches the address-family of nexthop N (which is indicated by AFN). This allows any SR Policy with a matching color and address family to be selected. If, for example, route R/r has BGP nexthop 1.1.1.3 (i.e., N=1.1.1.3), then the address-family of N (AFN) is IPv4. any(AFN) is then any(IPv4) and will match any IPv4 endpoint. any(any) goes one step further by allowing an SR Policy of any endpoint and any address-family to be selected. For a BGP route R/r that has a single color C with the CO-bits set to 10, BGP steers the route R/r according to the preference order in the third column of Table 10‑1. The preference order is identical to the order in the second column (CO-bits 01), up to preference 3. The next (4th) preference is a valid SR Policy (C, any(AFN)), which matches the route’s color C and has an endpoint of the same address-family as the nexthop N. If such SR Policy is not available, then BGP steers the route into a valid SR Policy (C, any(any)). If none of the above SR Policies are available, then BGP steers the route R/r on the IGP shortest path to its next-hop N. At the time of writing, handling of CO-bits=10 is not implemented in IOS XR. Address-Family Agnostic Steering

As a consequence of the Color-Only steering mechanism, traffic of one address-family can be steered into an SR Policy of the other address-family.

For example, headend H receives BGP IPv6 route 4::4:4:0/112 with nexthop B:4:: and color 20 with the CO-bits set to 01. The non-default CO-bits setting 01 indicates to also consider SR Policies with color 20 and a null endpoint as a fallback. In this example, H has no SR Policy with nexthop B:4:: and color 20, nor an SR Policy with IPv6 null endpoint ::0 and color 20, but H has an SR Policy with IPv4 null endpoint 0.0.0.0 and color 20. Therefore, BGP steers 4::4:4:0/112 on this IPv4 SR Policy. In this situation, a particular transport problem may arise if the last label of the label stack is popped before the traffic reaches the endpoint node, due to the Penultimate Hop Popping (PHP) functionality. MPLS (Multiprotocol Label Switching) transport can carry multiple types of traffic, IPv4, IPv6, L2, etc… As long as the packet contains an MPLS header, it will be transported seamlessly, even by nodes that do not support the transported type of traffic. However, when a node pops the last label and removes the MPLS header, the node needs to know and support the type of packet that is behind the now removed MPLS header in order to correctly forward this packet. “Correctly forward” implies elements such as using correct Layer 2 encapsulation and updating TTL field in the header. An MPLS header does not contain a protocol field nor any other indication of the type of the underlying packet. The node that popped the last label would need to examine the newly exposed packet header to find out if it is IPv4 or IPv6 or something else. If the SR Policy uses IPv4-based SIDs, then the nodes are able to transport MPLS and unlabeled IPv4 but maybe not unlabeled IPv6. And the other way around, if the SR Policy uses IPv6-based SIDs, then the nodes are able to transport MPLS and unlabeled IPv6 but maybe not unlabeled IPv4. To avoid this penultimate node problem, the packets of the address-family that differs from the SR Policy’s address-family must be label-switched all the way to the endpoint of the SR Policy. The exposed IP header problem on the penultimate node does not arise in such case. When steering an IPv6 destination into an IPv4 SR Policy, BGP programs the route to impose an IPv6 explicit-null (label value 2) on unlabeled IPv6 packets steered into this SR Policy. With this IPv6 explicit-null label at the bottom of the label stack, the packet stays labeled all the way to the ultimate hop node. This node then pops the IPv6 explicit-null label and does a lookup for the IPv6 destination address in the IPv6 forwarding table to forward the packet.

For labeled IPv6 packets, it is the source node’s responsibility to impose a bottom label that is not popped by the penultimate node, such as an IPv6 explicit-null label or an appropriate service label. Note that for destinations carried in SR Policies of their own address-family (e.g., IPv4 traffic carried by IPv4 SR Policy) nothing changes. Figure 10‑6 illustrates an SR Policy with IPv4 null endpoint transporting the following types of traffic. This SR Policy’s SID list is , only containing the Prefix-SID of Node4. Prefixes 4.4.4.0/24 and 4::4:4:0/112 have color 10 with CO-bits set to “01”.

Figure 10-6: Address-Family agnostic steering into IPv4 SR Policy

Unlabeled IPv4 Automated Steering steers unlabeled IPv4 packets into the matching (nexthop and color) IPv4 SR Policy.

Labeled IPv4 Labeled IPv4 packets with the top label matching a BSID are steered into the IPv4 SR Policy associated with the BSID. Unlabeled IPv6 BGP imposes the IPv6-explicit-null label when steering unlabeled IPv6 packets into the matching IPv4 SR Policy. Labeled IPv6 The source node is expected to add IPv6-explicit-null (or another appropriate label) underneath the BSID of the IPv4 SR Policy. At the time of writing, steering IPv4 destinations into IPv6 SR Policies is not possible. Steering IPv6 destinations into IPv4 SR Policies is possible by default. It can be disabled per SR Policy, as shown in Example 10‑8. Example 10-8: disable steering IPv6 destinations into IPv4 SR Policy with null endpoint segment-routing traffic-eng policy POLICY1 ipv6 disable color 20 end-point ipv4 0.0.0.0 candidate-paths preference 100 dynamic

The color only steering for BGP-LU, 6PE, 6vPE, VPNv4 and EVPN cases will follow the same semantic as global IPv4/IPv6 traffic.

10.5 Summary This chapter describes some specific elements of Automated Steering that are not covered in chapter 5, "Automated Steering". A BGP route can have multiple SLA colors. Automated Steering installs such route on the valid SR Policy that matches the color with the highest numerical value. Use an ingress BGP route-policy on an ingress PE to add, update, or delete the SLA color(s) as advertised by the egress PE. With BGP multi-path, Automated Steering is applied to each installed multi-path entry. In very specific use-cases, an operator may desire to use color-only and address-family agnostic steering. This allows steering traffic only based on color and enables steering traffic of one addressfamily in an SR Policy of the other address-family.

10.6 References [RFC5512] "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP Tunnel Encapsulation Attibute", Pradosh Mohapatra, Eric C. Rosen, RFC5512, April 2009 [RFC7911] "Advertisement of Multiple Paths in BGP", Daniel Walton, Alvaro Retana, Enke Chen, John Scudder, RFC7911, July 2016 [draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idrsegment-routing-te-policy-05 (Work in Progress), November 2018 [draft-ietf-idr-tunnel-encaps] "The BGP Tunnel Encapsulation Attribute", Eric C. Rosen, Keyur Patel, Gunter Van de Velde, draft-ietf-idr-tunnel-encaps-11 (Work in Progress), February 2019 [draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018

1. BGP re-executes nexthop resolution to find the new route to reach the nexthop for this route.↩

11 Autoroute and Policy-Based Steering What we will learn in this chapter: Policy-based steering methods allow an operator to configure a local routing policy on a headend that overrides any BGP/IGP path and steers specified traffic flows on an SR Policy. Autoroute is an IGP functionality where the IGP automatically installs forwarding entries for destinations beyond the endpoint of the SR Policy into this SR Policy Pseudowire traffic can be steered into an SR Policy by pinning it to this SR Policy Traffic can be statically steered into an SR Policy by using a static route Automated Steering (AS) is the recommended steering functionality for SR-TE. It allows an operator to automatically steer traffic with a fine granularity. However, AS is not the only steering mechanism. Other steering mechanisms are available to the operator to solve particular use-cases. These steering methods allow the operator to configure a local routing policy on a headend that overrides any BGP or IGP shortest path and steers a specified traffic flow into an SR Policy. In this chapter we first describe how the IGP can steer prefixes into an SR Policy by using autoroute. Autoroute has some limitations that will be highlighted. Next we explain steering pseudowire traffic into an SR Policy by pinning the pseudowire to the SR Policy. Finally, we show static route pointing to an SR Policy.

11.1 Autoroute Autoroute [RFC 3906], also known as IGP shortcut, is a mechanism to let the IGP steer destination prefixes into SR Policies. When enabling autoroute on an SR Policy, it essentially instructs the IGP to steer destination prefixes that are advertised by the SR Policy’s endpoint node or its downstream nodes (downstream on the IGP shortest path graph) into this SR Policy. The IGP installs these destination prefixes in the forwarding table, pointing to the SR Policy. Autoroute is a local behavior on the headend. No IGP adjacency is formed over the SR Policy and the IGP does not advertise any autoroute information to the other nodes in the network. If the IGP installs the forwarding entry for the SR Policy’s endpoint N into the SR Policy, all BGP routes with nexthop N that are not steered by Automated Steering are steered into this SR Policy. BGP installs these routes in the forwarding table, recursing on their nexthop N, and IGP steers this nexthop into the SR Policy. Automated Steering overrides the steering done by autoroute. BGP installs colored BGP routes that match an SR Policy, recursing on the matching SR Policy’s BSID instead of recursing on their nexthop. Figure 11‑1 shows a two-area IGP network. It can be OSPF were Node5 and Node6 are Area Border Routers, or ISIS were Node5 and Node6 are Level-1-2 routers, Area 1 is level-2 and Area 2 is level1.

Figure 11-1: Autoroute

An SR Policy GREEN (green, Node6) is programmed on Node1 with a SID list that enforces the path 1→2→3→6 (the IGP shortest path is 1→7→6) and configured with autoroute announce. As a result, the IGP installs the forwarding entries to Node4, Node5 and Node6, which are located downstream of the SR Policy’s endpoint Node6, via SR Policy GREEN. These destinations are said to be autorouted in this SR Policy. Since Node1 sees the loopback address of Node8 as advertised by the area border nodes Node5 and Node6, the forwarding entry to Node8 is also installed via the SR Policy. The forwarding table of Node1 towards the other nodes in the network is shown on Figure 11‑1. The IGP steers unlabeled and label imposition forwarding entries for autorouted destinations into the SR Policy. The IGP installs the label swap forwarding entries for Prefix-SIDs of autorouted destinations onto the SID’s algorithm shortest path, with one exception: algorithm 0 Prefix-SIDs. Read chapter 7, "Flexible Algorithm" for more information about Prefix-SID algorithms. For algorithm 0 (default SPF) PrefixSIDs the IGP installs the label swap forwarding entry into the SR Policy if this SR Policy’s SID list only contains non-algorithm-0 Prefix-SIDs.

Otherwise, i.e., if the SR Policy contains at least one algorithm 0 Prefix-SID, the IGP installs the label swap entry for algorithm 0 (default SPF) Prefix-SIDs onto the IGP shortest path. In order to steer SR labeled traffic into an SR Policy using autoroute, e.g., on an aggregation node, SR Policy must use only strict Prefix-SIDs and Adj-SIDs. Note that autorouting SR labeled traffic into an SR Policy is only possible for algorithm 0 Prefix-SIDs. To steer service traffic into an SR Policy using autoroute, e.g., on a PE node, the SR Policy’s SID list can contain any type of Prefix-SIDs since the label imposition forwarding entry will forward the service traffic into the SR Policy. LDP forwarding entries are installed as well for autorouted prefixes. The label swap forwarding entry for an autorouted destination with an LDP local label is always installed, regardless of the composition of the SR Policy’s SID list. If a (targeted) LDP session with the SR Policy’s endpoint node exists, then the installed outgoing label for the LDP forwarding entry is the learned LDP label. Otherwise the Prefix-SID of the prefix, possibly advertised by an SR Mapping Server (SRMS), is used as outgoing label, hereby leveraging LDP to SR interworking functionality. More information of SR and LDP interworking is provided in Part I of the SR book series. Limitations of Autoroute

Autoroute is widely used in classic MPLS TE deployments, but it has a number of disadvantages and limitations that make its use less desirable as compared to the recommended SR-TE Automated Steering functionality. Limited to local IGP area SR Policies Since autoroute is an IGP functionality and the IGP has only visibility in the topology of its local area, autoroute applicability is limited to SR Policies with an endpoint located in the local IGP area. Limited to per-BGP nexthop steering Autoroute steers the specific IGP prefixes into an SR Policy, as well as BGP routes recursing on these IGP routes. Therefore, for BGP prefixes the steering is per-nexthop, not per-destination. If

a nexthop address N is autorouted into an SR Policy, then all BGP routes with N as nexthop are steered into the SR Policy as well. Note that Automated Steering overrides autoroute steering. Automated Steering steers a BGP route into the SR Policy that matches the color and nexthop of that BGP route, even if this nexthop prefix is autorouted into another SR Policy. Blanket steering method When enabling autoroute on an SR Policy, all IGP prefixes of the endpoint node and of the nodes on the SPT behind the endpoint node are steered into the SR Policy, as well as all BGP routes recursing on these prefixes. Steering SR labeled traffic As explained in the previous section, incoming labeled traffic with the Prefix-SID of one of these autorouted prefixes as top label is steered into the SR Policy only if this SR Policy’s SID list contains strict-SPF Prefix-SIDs and Adj-SIDs. If the SID list contains one or more nonstrict-SPF Prefix-SIDs, SR labeled traffic is steered on the IGP shortest path instead.

11.2 Pseudowire Preferred Path The operator who wants to pin the path for an L2VPN Pseudowire (PW) to an SR Policy can use the configuration shown in Example 11‑1. In this example configuration, the L2VPN PW is pinned to SR Policy GREEN using the preferred-path sr-te policy srte_c_30_ep_1.1.1.4 configuration in the applied pw-class. With this configuration, if the SR Policy goes down then the pseudowire traffic will follow the default (labeled) forwarding path to the L2VPN PW neighbor. This is typically the IGP shortest path. This default fallback can be disabled by adding fallback disable to the preferred-path configuration. When this option is specified and the SR Policy goes down, the pseudowire goes down as well. Example 11‑1 shows the configuration for a pseudowire XCON-P2P with statically specified pseudowire labels (mpls static label local 2222 remote 3333). Another option is to use LDP to exchange pseudowire labels with the neighbor. In that case no labels must be configured, but mpls ldp

must be enabled globally on the node. Enabling LDP does not imply that LDP transport labels

will be used. If LDP is only used for PW label negotiation and not for transport, then no interfaces must be configured under mpls ldp. The other cross-connect in Example 11‑1 is XCON-EVPN-P2P, an EVPN VPWS signaled pseudowire. In the pw-class EoMPLS-PWCLASS the SR Policy GREEN is configured as preferred path. This pw-class is applied to both cross-connects to make SR Policy GREEN their preferred path. Optionally fallback to the IGP shortest path can be disabled in the preferred-path configuration.

Example 11-1: L2VPN PW preferred-path segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay ! l2vpn pw-class EoMPLS-PWCLASS encapsulation mpls preferred-path sr-te policy srte_c_30_ep_1.1.1.4 !! or: preferred-path sr-te policy srte_c_30_ep_1.1.1.4 fallback disable ! xconnect group XCONGRP p2p XCON-P2P interface TenGigE0/1/0/3 neighbor ipv4 1.1.1.4 pw-id 1234 !! below line only if not using LDP for PW signaling mpls static label local 2222 remote 3333 pw-class EoMPLS-PWCLASS ! p2p XCON-EVPN-P2P interface TenGigE0/1/0/4 neighbor evpn evi 105 target 40101 source 10101 pw-class EoMPLS-PWCLASS

11.3 Static Route The network operator can use static routing to steer specific destination prefixes into an SR Policy. The configuration is illustrated in Example 11‑2. Example 11-2: Static route into SR Policy – configuration segment-routing traffic-eng policy GREEN color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay ! router static address-family ipv4 unicast 99.20.20.0/24 sr-policy srte_c_30_ep_1.1.1.4

The route points directly to the SR Policy, as shown in Example 11‑3. Note that a static route has a lower administrative distance (1) and overrides any existing routing entry provided by other protocols such as the IGP or BGP. Example 11-3: Static route into SR Policy RP/0/0/CPU0:xrvr-1#show route 99.20.20.0/24 Routing entry for 99.20.20.0/24 Known via "static", distance 1, metric 0 (connected) Installed Jun 19 11:54:54.290 for 00:08:07 Routing Descriptor Blocks directly connected, via srte_c_30_ep_1.1.1.4 Route metric is 0, Wt is 1 No advertising protos.

11.4 Summary A local routing policy can be used to steer traffic into an SR Policy, such as autoroute, pinning a pseudowire to a specific SR Policy, or static route. Autoroute is an IGP functionality where the IGP installs the forwarding entries for prefixes located on or beyond the endpoint of an SR Policy into that SR Policy. Automated Steering overrides autoroute steering. A pseudowire can be pinned to an SR Policy to steer all traffic of that pseudowire into this SR Policy.

11.5 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[RFC3906] "Calculating Interior Gateway Protocol (IGP) Routes Over Traffic Engineering Tunnels", Henk Smit, Naiming Shen, RFC3906, October 2004

12 SR-TE Database What you will learn in this chapter: SR-TE computes and validates SR Policy paths based on the information contained in the SR-TE DB The SR-TE DB is populated with local domain topology (nodes, links, metrics) and SR (SRGB, SIDs) information through the IGP Topology and SR information from remote domains and BGP-only domains is obtained through BGP-LS The SR-TE DB is intrinsically multi-domain capable: it can contain and consolidate information of an entire multi-domain network Multi-domain topology consolidation allows SR-TE to compute globally optimal inter-domain SR Policy paths Detailed information about all active SR Policies in the network can be obtained through PCEP or BGP-LS The SR-TE database represents the view that the SR-TE process of a node has of its network. It is the one stop source of information for SR-TE to compute and validate SR Policy paths. On a headend node, it contains local area information fed by the IGP process. It is used to compute intra-area dynamic paths and translate explicit paths expressed with Segment Descriptors into a list of MPLS labels. On an SR PCE, it contains the entire multi-domain network topology, acquired via BGP-LS, that is used to compute inter-domain SR Policy paths as well as disjoint SR Policy paths originating from separate headend nodes. The topology of BGP-only networks can also be collected via BGPLS. In this chapter, we explain how the SR-TE DB is populated with IGP information on a headend, with BGP-LS and PCEP on an SR PCE and how network topology information from different domains is consolidated into a single multi-domain topology.

12.1 Overview The SR-TE Database (SR-TE DB) contains the information for SR-TE to compute and validate SR Policy paths. The information in the SR-TE DB includes: Base IGP topology information (nodes, links, IGP metric, …) Egress Peer Engineering (EPE) information Segment Routing information (SRGB, Prefix-SID, Adj-SID, …) TE Link Attributes (TE metric, link delay metric, SRLG, affinity colors, …) SR Policy information (headend, endpoint, color, segment list, BSID, …) The information in the SR-TE DB is protocol independent and it may be learnt via the link-state IGPs, via BGP-LS, or via PCEP. The SR-TE DB is intrinsically multi-domain capable. In some use-cases, the SR-TE DB may only contain the topology of the local attached domain while in other use-cases the SR-TE DB contains the topology of multiple domains. Instance Identifier

Each routing domain, or routing protocol instance, in the network is identified by a network-wide unique Instance Identifier (Instance-ID), which is a 64-bit value assigned by the operator. A given routing protocol instance must be associated with the same Instance-ID on all nodes that participate in this instance. Conversely, other routing protocol instances on the same or different nodes in the network must be provided a different Instance-ID. Furthermore, a single Instance-ID is assigned to an entire multi-level or multi-area IGP instance. This means that the same Instance-ID must be associated to the IGP instance on all nodes in the domain, regardless of their level or area. The Instance-ID is configured on a node with distribute link-state instance-id

under the IGP instance. The configuration in Example 12‑1 shows a node running two IGP

instances. Instance-ID 100 is configured for the ISIS instance named “SR-ISIS” and Instance-ID 101 for the OSPF instance “SR-OSPF”. The same Instance-ID 100 is associated to this ISIS instance on

the other nodes in the “SR-ISIS” domain, and similarly for the Instance-ID 101 in the “SR-OSPF” domain. If not specifying an instance-id in the configuration, it uses the default Instance-ID 0. This default Instance-ID is reserved for networks running a single routing protocol instance. Example 12-1: Specify the domain Instance-ID under the IGP router isis distribute ! router ospf distribute

SR-ISIS link-state instance-id 100 SR-OSPF link-state instance-id 101

The distribute link-state configuration enables the feeding of information from the IGP to the SR-TE DB and to BGP-LS. It is thus required on SR-TE headend nodes, SR PCEs participating in an IGP and BGP-LS Producers1. Each domain-specific object (node, link, prefix) in the SR-TE DB is associated with the Instance-ID of its domain identifying the domain where the object belongs. When distributing the object in BGPLS, the Instance-ID is carried within the Identifier field of the object’s BGP-LS NLRI. See chapter 17, "BGP-LS" for more details. When not following the above Instance-ID assignment recommendations, that is using non unique Instance-IDs across multiple domains, duplicate entries for the same node, link, or prefix objects may be present in the SR-TE DB. This may also result in an inaccurate view of the network-wide topology. Information Feeds

A headend node typically has an SR-TE DB containing only information of the locally attached IGP area. While the SR-TE DB of a Path Computation Element (PCE) usually includes more information, such as information about remote domains and BGP peering links, and remote SR Policies. As illustrated in Figure 12‑1, different mechanisms exist to feed information into the SR-TE DB: IGP, BGP, and Path Computation Element Protocol (PCEP). The BGP-LS feed configuration does not specify an Instance-ID since BGP-LS NLRIs carry the Instance-ID that is provided by the BGP-LS Producer.

Figure 12-1: Populating the SR-TE DB

Different combinations of information feeding mechanisms are used, depending on the role and position of the node hosting the SR-TE DB. These mechanisms are detailed in the following sections.

12.2 Headend In order for the SR-TE process of a node to learn the topology of a connected network area, it is sufficient that the node participates in the IGP of that area and that the information feed from the IGP to the SR-TE process is enabled. Since a headend node typically participates in the IGP, this is a common method to populate the SR-TE DB of such a node. This local area information allows the headend node to compute and maintain intra-area SR Policy paths, as described in chapter 4, "Dynamic Candidate Path". On a node in a single-IGP-domain network, the information feed of the IGP to the SR-TE process is enabled by configuring distribute link-state under the IGP instance, as illustrated in Example 12‑2. Since no Instance-ID is specified in this command, it uses the default Instance-ID 0 for the only domain in the network. Example 12-2: Single IGP domain – distribute link-state configuration router isis SR !! or router ospf SR distribute link-state

In a network with multiple IGP domains, it is required to specify the instance-id in the distribute link-state

configuration command, as was illustrated before in Figure 12‑1.

The distribute link-state configuration command has an optional throttle parameter that specifies the time to wait before distributing the link-state update. The default throttle interval is 50 ms for ISIS and 5 ms for OSPF. The two-node topology in Figure 12‑2 is used to illustrate the information present in the SR-TE DB. This is an ISIS level-2 network, but the equivalent information is available for OSPF networks.

Figure 12-2: Two-node ISIS topology

Example 12‑3 shows the configuration of Node1 in this topology. First a brief description of the configuration elements, from top to bottom. The interface Loopback0 prefix is 1.1.1.1/32 (line 2). ISIS distributes the LS-DB to SR-TE with Instance-ID 100 (line 7). The Instance-ID is specified for illustration purposes. Since this is a single-domain network, the default Instance-ID 0 could have been used. The ISIS router-id is configured as 1.1.1.1, interface Loopback0’s address (line 10), this is the TE router-id. For OSPF this TE router-id is configured as mpls traffic-eng router-id Loopback0. Segment-routing is enabled for ISIS and a Prefix-SID 16001 is associated with prefix 1.1.1.1/32 (line 16). The link to Node2 is a point-to-point link and TI-LFA is enabled on this interface (lines 18-22). The TE metric for interface Gi0/0/0/0 is configured as 15 under SR-TE (line 27). Bit 0 is set in the affinity bitmap of this interface, using the user-defined name COLOR0 (lines 28-32). With this

configuration, Node1 advertises the affinity bit-map 0x00000001 for this link. Performance-measurement is enabled to dynamically measure and advertise the link-delay metrics. In this example, the link-delay for interface Gi0/0/0/0 is configured as a static value of 12 s (line 38). See chapter 15, "Performance Monitoring – Link Delay" for more details. In the SRLG section an SRLG 1111 is assigned to interface Gi0/0/0/0 using the user-defined name SRLG1 (lines 40-44). The following elements are not shown in the configuration since they are the default: SR Global Block (SRGB) is [16000-23999] and the SR Local Block (SRLB) is [15000-15999].

Example 12-3: Node1’s configuration – ISIS example 1 interface

Loopback0 ipv4 address 1.1.1.1 255.255.255.255 3! 4 router isis SR 5 is-type level-2-only 6 net 49.0001.0000.0000.0001.00 7 distribute link-state instance-id 100 8 address-family ipv4 unicast 9 metric-style wide 10 router-id Loopback0 11 segment-routing mpls 12 ! 13 interface Loopback0 14 passive 15 address-family ipv4 unicast 16 prefix-sid absolute 16001 17 ! 18 interface GigabitEthernet0/0/0/0 19 point-to-point 20 address-family ipv4 unicast 21 fast-reroute per-prefix 22 fast-reroute per-prefix ti-lfa 23 ! 24 segment-routing 25 traffic-eng 26 interface GigabitEthernet0/0/0/0 27 metric 15 28 affinity 29 name COLOR0 30 ! 31 affinity-map 32 name COLOR0 bit-position 0 33 ! 34 performance-measurement 35 interface GigabitEthernet0/0/0/0 36 delay-measurement 37 !! statically specified delay value 38 advertise-delay 12 39 ! 40 srlg 41 interface GigabitEthernet0/0/0/0 42 name SRLG1 43 ! 44 name SRLG1 value 1111 2

This example is a single-domain topology. All the information that is inserted in the SR-TE DB is learned from the IGP. To allow comparing the SR-TE DB entry for Node1 to the ISIS link-state advertisement of Node1, Node1’s LS-DB entry is shown in Example 12‑4.

Example 12-4: Node1’s ISIS LS-DB entry RP/0/0/CPU0:xrvr-1#show isis database verbose xrvr-1 IS-IS SR (Level-2) Link State Database LSPID LSP Seq Num LSP Checksum LSP Holdtime/Rcvd xrvr-1.00-00 * 0x00000067 0xd912 1077 /* Area Address: 49.0001 NLPID: 0xcc IP Address: 1.1.1.1 Router ID: 1.1.1.1 Hostname: xrvr-1 Router Cap: 1.1.1.1, D:0, S:0 Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000 SR Local Block: Base: 15000 Range: 1000 SR Algorithm: Algorithm: 0 Algorithm: 1 Node Maximum SID Depth: Label Imposition: 10 Metric: 0 IP-Extended 1.1.1.1/32 Prefix-SID Index: 1, Algorithm:0, R:0 N:1 P:0 E:0 V:0 L:0 Prefix Attribute Flags: X:0 R:0 N:1 Source Router ID: 1.1.1.1 Metric: 10 IS-Extended xrvr-2.00 Affinity: 0x00000001 Interface IP Address: 99.1.2.1 Neighbor IP Address: 99.1.2.2 Admin. Weight: 15 Link Average Delay: 12 us Link Min/Max Delay: 12/12 us Link Delay Variation: 0 us Link Maximum SID Depth: Label Imposition: 10 ADJ-SID: F:0 B:1 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24012 ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24112 Metric: 10 IP-Extended 99.1.2.0/24 Prefix Attribute Flags: X:0 R:0 N:0 MPLS SRLG: xrvr-2.00 Interface IP Address: 99.1.2.1 Neighbor IP Address: 99.1.2.2 Flags: 0x1 SRLGs: [0]: 1111 Total Level-2 LSP count: 1

ATT/P/OL 0/0/0

Local Level-2 LSP count: 1

The SR-TE process on Node1 receives the topology information from the IGP (distribute linkstate

is configured under router isis SR).

Example 12‑5 shows the entry for Node1 in the SR PCE’s SR-TE DB and can be compared with the information in the ISIS LS-DB entry shown in Example 12‑4. The output starts with the node information, followed by the link information.

Node1’s hostname is xrvr-1 (line 7) and its TE router-id is 1.1.1.1 (line 6). Node1 is a level-2 ISIS node with system-id 0000.0000.0001 (line 8). The Autonomous System number (ASN) is 0 since this information is not received via BGP. Node1 advertises a single Prefix-SID 16001, which is an algorithm 0 (regular) Prefix-SID associated with prefix 1.1.1.1/32 (line 11). It is a Node-SID since it has the N-flag set (flags: N). The prefix 1.1.1.1/32 with its Prefix-SID is advertised in the domain with ID 100 (domain ID: 100) (line 10). This domain ID is the instance-id as is configured in the distribute link-state instance-id 100

command to distribute the IGP LS-DB to SR-TE. This information is not in the

ISIS LS-DB entry of Example 12‑4. Node1 has an SRGB [16000-23999] (line 14) and an SRLB [15000-15999] (line 17). There is one link to Node2. Node2 has a TE router-id 1.1.1.2 (line 23). The link is identified by its local and remote IP addresses, 99.1.2.1 and 99.1.2.2 respectively (line 19). The link metrics are IGP metric 10, TE metric 15, and link-delay metric 12 (line 26). Note that the link-delay metric is the minimum-delay metric, as discussed in chapter 15, "Performance Monitoring – Link Delay". Two Adjacency-SIDs are advertised for this link, a protected Adj-SID 24012 and an unprotected Adj-SID 24112 (line 27). The purpose of these different Adj-SIDs is explained in Part I of the SR book series. The other link attributes are the affinity bitmap (Admin-groups) with bit 0 set (0x00000001 on line 28)) and the SRLG 1111 (line 29).

Example 12-5: Entry for Node1 in SR-TE DB – ISIS example 1 RP/0/0/CPU0:xrvr-1#show

segment-routing traffic-eng ipv4 topology traffic-eng 1.1.1.1

2 3 SR-TE

topology database

4 --------------------------------5 Node 6 7 8 9 10 11 12 13 14 15 16 17

1 TE router ID: 1.1.1.1 Host name: xrvr-1 ISIS system ID: 0000.0000.0001 level-2 ASN: 0 Prefix SID: ISIS system ID: 0000.0000.0001 level-2 ASN: 0 domain ID: 100 Prefix 1.1.1.1, label 16001 (regular), flags: N SRGB INFO: ISIS system ID: 0000.0000.0001 level-2 ASN: 0 SRGB Start: 16000 Size: 8000 SRLB INFO: ISIS system ID: 0000.0000.0001 level-2 ASN: 0 SRLB Start: 15000 Size: 1000

18 19 20 21 22 23 24 25 26 27 28 29

Link[0]: local address 99.1.2.1, remote address 99.1.2.2 Local node: ISIS system ID: 0000.0000.0001 level-2 ASN: 0 Remote node: TE router ID: 1.1.1.2 Host name: xrvr-2 ISIS system ID: 0000.0000.0002 level-2 ASN: 0 Metric: IGP 10, TE 15, Delay 12 Adj SID: 24012 (protected) 24112 (unprotected) Admin-groups: 0x00000001 SRLG Values: 1111

12.3 SR PCE Please reference chapter 13, "SR PCE" for further details of the SR PCE. The SR PCE can participate in the IGP to learn the topology of the connected IGP area, equivalent to the headend as described in the previous section. However, using IGP information feeds to populate the SR-TE DB with a complete view of a multidomain network would require establishing an IGP adjacency with each IGP area of the network. This is neither practical nor scalable.

12.3.1 BGP-LS Instead, BGP link-state (BGP-LS) is the mechanism of choice to learn the topology of multiple IGP areas and domains. BGP-LS uses BGP to distribute the network information in a scalable manner. On top of that, BGP-LS also carries other information that is not distributed by the IGP (e.g., BGP peering links). Each entry in the SR-TE DB has an associated Instance-ID identifying the domain where this entry belongs. This Instance-ID is provided by the BGP-LS Producer of that entry and is carried in the object’s BGP-LS NLRI. This allows the BGP-LS Consumer to learn the domain where the received object belongs. The name BGP-LS refers to the BGP link-state address-family, specified in “Link-State Info Distribution Using BGP” [RFC7752]. As you can tell from its name, BGP-LS was initially introduced to distribute the IGP LS-DB and TE-DB information in BGP. It has been extended and became the preferred mechanism to carry other types of information, such as Egress Peering Engineering (EPE) and TE Policy information. Chapter 17, "BGP-LS" discusses the BGP-LS protocol aspects in more detail. Since BGP-LS is just another BGP address-family, all existing BGP protocol mechanisms can be used to transport and distribute the BGP-LS information in a scalable fashion. A typical BGP-LS deployment uses BGP Route Reflectors to scale the distribution, but this is not a requirement. The SR-TE DB of a node can be populated by combining the IGP and BGP-LS information feeds. The node learns the topology of its local IGP area via the IGP, while the remote area and domain

topologies are obtained via BGP-LS. Both of these topology learning mechanisms provide real-time topology feeds. Whenever the topology changes, the change is propagated immediately via the different feeding mechanisms. While this is obvious for the IGP mechanism, it is worth noting that is also the case for BGP-LS. However, a direct IGP feed is likely quicker to deliver the information than a BGP-LS update that may be propagated across one or more BGP RRs. The configuration on a BGP-LS Producer – a node that feeds its IGP’s LS-DB into BGP-LS – consists of a BGP session enabled for address-family link-state link-state and the distribute link-state

command under the IGP specifying an Instance-ID. The configuration is illustrated in

Example 12‑6. The BGP address-family is identified by an Address Family Identifier (AFI) and a Subsequent Address Family Identifier (SAFI). For the BGP-LS address family, both AFI and SAFI are named link-state,

therefore the double keyword in the configuration.

Example 12-6: Configuration on BGP-LS Producer node router isis SR distribute link-state instance-id 100 ! router ospf SR distribute link-state instance-id 101 ! router bgp 1 bgp router-id 1.1.1.1 address-family link-state link-state ! neighbor 1.1.1.10 remote-as 1 address-family link-state link-state

The configuration shows one BGP neighbor 1.1.1.10 for BGP-LS. Distribute link-state is configured under both IGP instances of this node to distribute their LS-DB information to the local SR-TE process (if present) and to the BGP process. The respective Instance-IDs that the operator has assigned to these IGP domains are specified in the distribute commands. The SR PCE has a similar configuration as in Example 12‑6. The IGP configuration is needed when acquiring the local IGP area’s topology directly from the IGP. SR PCE inserts the information received via IGP and BGP-LS in its SR-TE DB.

In addition to the network topology, SR Policy information can also be exchanged via BGP-LS, as specified in draft-ietf-idr-te-lsp-distribution. Each SR Policy headend may advertise its local SR Policies in BGP-LS, thus allowing a PCE, controller, or any other node to obtain this information via its BGP-LS feed. At the time of writing, only the PCEP functionality to collect SR Policy information was available. BGP-Only Fabric

Some networks use BGP as the only routing protocol, like Massive-Scale Data Centers (MSDCs) described in “BGP Routing in Data Centers” [RFC7938]. The implementation of SR in such networks is described in ietf-spring-segment-routing-msdc. To populate the SR-TE DB on an SR PCE or headend for such networks, each node advertises its local information in BGP-LS as specified in draft-ketant-idr-bgp-ls-bgp-only-fabric. This is needed since BGP does not provide a detailed consolidated topology view of network similar to the one provided by the link-state IGPs. The information is inserted in the SR-TE DB, same as the other BGP-LS topology information, and is used by the SR-TE process. The BGP nodes in a BGP-only SR network advertise the following information in BGP-LS. The attributes of the BGP node and the BGP Prefix-SIDs2 to reach that node The attributes of all the links between the BGP nodes and their associated BGP PeerAdj-SIDs (and other Peering-SIDs, see chapter 14, "SR BGP Egress Peer Engineering") The SR Policies instantiated on each of the BGP nodes with their properties and attributes The functionality in the above bullets is not available in IOS XR at the time of writing.

12.3.2 PCEP The SR-TE DB also contains information on the existing SR Policies in the network. In particular, the local SR Policies on a headend and the delegated ones on a PCE are maintained in the SR-TE DB. The delegated SR Policies are under control of the SR PCE as described in chapter 13, "SR PCE". Besides, the PCE can learn about the other SR Policies instantiated in the network via BGP-LS, as mentioned in the previous section, or via PCEP.

The latter is achieved by having each Path Computation Client (PCC) send to all its connected (stateful) PCEs a PCEP state report for an SR Policy whenever it is added, updated, or deleted. A PCE receiving such a state report updates its SR-TE DB with the new SR Policy information. The PCE can receive this information directly from the PCC or via a state-sync connection with another PCE (see draft-litkowski-pce-state-sync). More PCEP information is provided in chapter 13, "SR PCE" and chapter 18, "PCEP". Figure 12‑3 illustrates a network with an SR PCE Node10. For illustration purposes the headend Node1 in this network uses a PCE to compute a dynamic path, even though it could compute the path itself in this particular example.

Figure 12-3: ISIS network with SR PCE

An SR Policy GREEN to Node4 with a delay-optimized path is configured on Node1, as shown in Example 12‑7. PCE 1.1.1.10 (Node10) is specified.

Example 12-7: Node1’s SR-TE configuration segment-routing traffic-eng policy GREEN binding-sid mpls 15000 color 30 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic pcep ! metric type delay ! pcc pce address ipv4 1.1.1.10

The configuration to enable the SR PCE functionality on Node10 is shown in Example 12‑8. Example 12-8: SR PCE Node10’s PCE configuration pce address ipv4 1.1.1.10

SR PCE Node10 computes the path of Node1’s SR Policy GREEN and Node1 reports the state of the SR Policy path to this SR PCE. SR PCE Node10’s SR Policy database contains one entry, Node1’s SR Policy (PCC 1.1.1.1), shown in Example 12‑9. Node1’s SR Policy has a reported name cfg_GREEN_discr_100, where the prefix “cfg_” is added to the configured name GREEN to indicate that it is a configured SR Policy and avoid name collisions with controller-initiated SR Policies. The suffix “_discr_100” uses the preference of the candidate path (100 in this example) to differentiate between multiple configured candidate paths. The endpoint of this SR Policy is Node4 (destination 1.1.1.4) and its BSID is 15000 (Binding SID: 15000).

The flags in the PCEP information indicate the path is delegated to this SR PCE (D:1), the path is Administratively active (A:1) and operationally active (O:2). The meanings of these flags are specified in “PCEP Extensions for Stateful PCE” [RFC8231].

A reported and a computed path are shown for this SR Policy since Node10 itself computed this path (Computed path: (Local PCE)) and Node1 (PCC: 1.1.1.1) reported this path (Reported path). Example 12-9: SR PCE Node10’s SR Policy database – Node1’s SR Policy RP/0/0/CPU0:xrvr-10#show pce lsp pcc ipv4 1.1.1.1 detail PCE's tunnel database: ---------------------PCC 1.1.1.1: Tunnel Name: cfg_GREEN_discr_100 LSPs: LSP[0]: source 1.1.1.1, destination 1.1.1.4, tunnel ID 22, LSP ID 2 State: Admin up, Operation up Setup type: Segment Routing Binding SID: 15000 Maximum SID Depth: 10 Bandwidth: signaled 0 kbps, applied 0 kbps PCEP information: PLSP-ID 0x80016, flags: D:1 S:0 R:0 A:1 O:2 C:0 LSP Role: Single LSP State-sync PCE: None PCC: 1.1.1.1 LSP is subdelegated to: None Reported path: Metric type: delay, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24004, Address: local 99.3.4.3 remote 99.3.4.4 Computed path: (Local PCE) Computed Time: Thu Oct 18 11:34:26 UTC 2018 (00:08:57 ago) Metric type: delay, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24004, Address: local 99.3.4.3 remote 99.3.4.4 Recorded path: None Disjoint Group Information: None

12.4 Consolidating a Multi-Domain Topology The SR-TE DB is intrinsically multi-domain capable; it can contain and consolidate the topology of an entire multi-domain network. The multi-domain information in the SR-TE DB makes it possible to compute optimal end-to-end inter-domain paths. A multi-domain network is a network topology that consists of multiple sub-topologies. An IGP area is an example of a sub-topology. A multi-area IGP network is therefore considered as a multi-domain network. A network consisting of multiple interconnected Autonomous Systems is another example of a multi-domain network. The domains in the network can be isolated from each other or prefixes can be redistributed or propagated between the domains. The Segment Routing architecture can provide seamless unified end-to-end forwarding paths in multidomain networks. SR turns the network into a unified stateless fabric. In order to compute globally optimal inter-domain paths, knowledge of the end-to-end topology is needed. The term “globally optimal” indicates that the computed path is the best possible over the whole multi-domain network, as opposed to per-domain locally optimal paths that may not produce a globally optimal path when concatenated. This can be achieved by consolidating the different domain topologies in the multi-domain SR-TE DB, and computing paths on this consolidated topology. Techniques that use per-domain path computations (e.g., “Path Comp. for Inter-Domain TE LSPs” [RFC5152]) do not provide this optimality; these techniques may lead to sub-optimal paths, making diverse or back-up path computation hard, or may simply fail to find a path when one really does exist.

The importance of simple multi-domain TE “From an end-to-end path point of view, the ability to unify multiple ASs into one logical network is critical, however, this ability was not there in the IP world before SR was introduced. With BGP EPE & BGP-LS, a controller can learn and stitch multiple AS networks together into one automatically, which lays out the foundation of massive SDN deployment. This ability is vital for 5G transport, which may consist of hundreds of ASs and thousands of devices in case of Tier-1 operators. In this case flexible end-to-end SR-TE (e.g., from cell site to packet core to content) is a must for 5G slicing. It is also vital for network and cloud synergy, especially where Cloud Native is the new norm and the requirement to use TE from network devices to containers becomes common, hereby most likely traverse multiple ASs. ” — YuanChao Su

Assumptions and Requirements

Since the topology information of all domains is collected and inserted in the same SR-TE DB, a trust relationship between the domains is assumed (e.g., all domains belong to the same administrative entity). This is a common assumption for many multi-domain functionalities. While the meaning of physical quantities, such as delay or bandwidth, could be assumed to be the same across different domains, this should be verified. For example, link-delay measurements should provide compatible and comparable metrics, not only across domains but also across devices within a domain, like between devices of different vendors. Other link attributes (such as IGP link metric, TE link metric, affinity color, SRLG, etc.) may have different meanings in different domains, possibly because they have been implemented by different organizations or authorities. When computing inter-domain paths, the PCE treats all the metrics and other link attributes as “global” (i.e., comparable between domains). This may not always be correct. For example, if the domains where originally from different administrators then affinity link colors, metrics, etc. can have been assigned differently. For domains that use different routing protocols, different metrics are used by default, notably ISIS versus OSPF. ISIS uses default link metric 10, while OSPF uses default link metric 1 for interfaces with 100 Mbps speed or higher.

12.4.1 Domain Boundary on a Node

The topology information (e.g., nodes, links) that a PCE receives from a domain enables it to construct the network graph of this domain. For a multi-domain network, the PCE receives multiple of such individual domains and inserts them in the multi-domain SR-TE DB. Figure 12‑4 illustrates a three-domain network.

Figure 12-4: Example multi-domain topology

In order to compute end-to-end inter-domain paths, the different domains must be consolidated into a single network topology graph, by interconnecting the topology graphs of the individual domains. Two domain topologies can be interconnected on the nodes that are common to these two domains, the border nodes. A node that participates in multiple domains is present in the topologies of all these domains. In order to identify this node as a single entity across the domains, and therefore this node consolidates the domain topologies, a common network-wide node identifier is required. In a multi-area IGP instance, the ISIS system-id or OSPF router-id are used as common identifier. In the case of multiple IGP instances, the TE router-id is used instead. The OSPF router-id and ISIS system-id are the natural node identifiers in a single multi-area network. Each area is inserted as a separate topology in the SR-TE DB. The Area Border Router (“L1L2 node” for ISIS) advertises its unique router-id (“system-id” for ISIS) in each of its areas. SR-TE

finds multiple nodes with the same identifier in the SR-TE DB and identifies these node instances as the same node, hereby interconnecting the areas’ topologies at this node. In multi-domain networks running several IGP instances, the TE router-id is used as a network-wide unique identifier of a node. If two nodes are present in the SR-TE DB with the same TE router-id, they are really the same node. Figure 12‑4 illustrates how the border nodes between the domains interconnect the domain topologies. Node1 in Figure 12‑4 runs two IGP instances: an IGP instance for Domain1 and another IGP instance for Domain2. A PCE learns all topologies of this network. The topologies of Domain1 and Domain2 in this PCE’s SR-TE DB both contain Node1 since it participates in both domains. SR-TE can identify Node1 in both domains since Node1 advertises the same TE router-id (1.1.1.1) in both domains. While the OSPF router-id and ISIS system-id are the natural node identifiers in a single multi-area network, using the same IGP identifier (OSPF router-id or ISIS system-id) on multiple IGP instances is discouraged. It could inadvertently lead to problems due to duplicate protocol identifiers in the network. Pre ve nt duplicate OSPF route r-ids or ISIS syste m-ids It is recommended to not use identical ISIS system-ids on multiple ISIS instances of a node and to not use identical OSPF router-ids on multiple OSPF instances of a node. Using the same protocol identifier for two different instances of the same IGP, ISIS or OSPF, could be quite dangerous. The node would advertise the same identifier in two separate domains. If on a given moment, at any place in the network some “wires are crossed” by accident, hereby directly connecting the domains to each other, then duplicate router-ids/system-ids will appear in the network with all its consequences. It will cause misrouting and flooding storms as the LSPs/LSAs are re-originated by the two nodes that have the same identifier. The risk of this happening, however small, should be avoided.

12.4.2 Domain Boundary on a Link A multi-AS network is a multi-domain network that consists of multiple Autonomous Systems (ASs) interconnected via BGP peering links. Assuming that Egress Peer Engineering (EPE) is enabled on the BGP peering sessions, BGP advertises these BGP peering sessions in BGP-LS as links, anchored to a local and a remote node.

These anchor nodes are identified by their BGP router-id. More information about EPE is available in chapter 14, "SR BGP Egress Peer Engineering". For the SR PCE to identify the EPE anchor node and the IGP ASBR to be the same node, the BGP router-id of the EPE link anchor node and the TE router-id of the IGP ASBR must be the equal. Figure 12‑5 illustrates a multi-AS network consisting of two ASs, AS1 and AS2, interconnected by two BGP peering links. EPE has been enabled on the BGP peering sessions and these are inserted in the SR-TE DB as “links” between the ASBRs. The IGP TE router-id on the ASBRs have been configured to be the same as their BGP router-id. For example, the TE router-id and the BGP routerid on Node1 are configured as 1.1.1.1.

Figure 12-5: Inter-AS BGP peering links

With this information in the SR-TE DB, SR-TE can interconnect the two ASs’ topologies via the EPE peering sessions. The anchor nodes of the bottom EPE “link”, Node1 and Node3, are identified by their BGP router-ids, 1.1.1.1 and 1.1.1.3 respectively. TE router-id 1.1.1.1 identifies Node1 in the AS1 topology. Because of the common identifier, SR-TE knows this is the same node. The same occurs for Node3. As a result, the two domains are consolidated in a single network topology. End-to-end inter-AS paths can be computed on this topology and the Peering-SID label is used to traverse the peering link.

Another possibility to provide inter-domain connectivity is configuring an SR Policy that steers traffic over the inter-domain link. This SR Policy can be reported in PCEP and advertised in BGPLS. SR-TE can then provide inter-domain connectivity by including the BSID of the SR Policy in its solution SID list. At the time of writing, this solution is not available in IOS XR. Figure 12‑6 illustrates a two-node topology with a BGP peering link between the two nodes. ISIS is enabled on each node but since ISIS is not enabled on the link they do not form an ISIS adjacency. Enabling BGP and ISIS on the nodes illustrates how SR-TE consolidates the node entries of the ISIS topology and BGP EPE topology. Node1 is in Autonomous System (AS) number 1, Node2 is in AS2. A single-hop external BGP (eBGP) session is established between the nodes and Egress Peer Engineering (EPE) is enabled on this eBGP session.

Figure 12-6: Two-node BGP topology

The configuration of Node1 is shown in Example 12‑10. The ISIS configuration is the same as in the previous section, except that the interface to Node2 is not configured under ISIS. A single-hop eBGP session is configured to Node2’s interface address 99.1.2.2 (line 26). EPE (egress-engineering) is enabled on this session (line 28). Address-family IPv4 unicast is enabled on this session, but other address-families can be configured with an equivalent configuration. Since the session is an external BGP session, ingress and egress route-policies must be configured. Here we have applied the route-policy PASS which allows all routes (lines 1-3 and 31-32). Since we want to use this link to carry labeled traffic between the ASs, MPLS must be enabled on the interface. BGP automatically enables MPLS forwarding on the interface when enabling a labeled address-family (such as ipv4 labeled-unicast) under BGP. In this example, MPLS is explicitly enabled on Gi0/0/0/0 using the mpls static interface configuration (lines 34-35).

Example 12-10: Node1’s configuration – EPE example 1 route-policy

PASS pass 3 end-policy 4! 5 interface Loopback0 6 ipv4 address 1.1.1.1 255.255.255.255 7! 8 router isis SR 9 is-type level-2-only 10 net 49.0001.0000.0000.0001.00 11 distribute link-state instance-id 100 12 address-family ipv4 unicast 13 metric-style wide 14 router-id Loopback0 15 segment-routing mpls 16 ! 17 interface Loopback0 18 passive 19 address-family ipv4 unicast 20 prefix-sid absolute 16001 21 ! 22 router bgp 1 23 bgp router-id 1.1.1.1 24 address-family ipv4 unicast 25 ! 26 neighbor 99.1.2.2 27 remote-as 2 28 egress-engineering 29 description # to Node2 # 30 address-family ipv4 unicast 31 route-policy PASS in 32 route-policy PASS out 33 ! 34 mpls static 35 interface GigabitEthernet0/0/0/0 2

Example 12‑11 shows the entry for Node1 in the SR-TE DB of an SR PCE that received the BGP-LS information from Node1 and Node2. Notice that SR-TE has consolidated Node1 of the ISIS topology with Node1 of the BGP topology since the ISIS TE router-id and the BGP router-id are the same 1.1.1.1, as described in section 12.4.2. The entry for Node1 in the SR-TE DB shows both ISIS and BGP properties. The ISIS node elements in the output are the same as in the ISIS example above. The BGP ASN 1 and router-id 1.1.1.1 are shown with the node properties (line 8). The link is connected to Node2 that has BGP router-id 1.1.1.2 in AS 2 (line 24). The link is identified by its local and remote ip addresses 99.1.2.1 and 99.1.2.2 respectively (line 20).

The link metrics (IGP, TE, and link-delay) are all 0 (line 25) and the affinity bit map (Admin-groups) is also 0 (line 26). At the time of writing, the link metrics and attributes for an EPE link were not yet advertised in BGP-LS and get the value 0. The peerNode-SID label 50012 (represented as Adj-SID of the EPE link) is shown, marked as (epe) in the output (line 27). Example 12-11: Node1 as PCE – Entry for Node1 in SR-TE DB – EPE example 1 RP/0/0/CPU0:sr-pce#show

pce ipv4 topology bgp 1.1.1.1

2 3 PCE's

topology database - detail:

4 --------------------------------5 Node 6 7 8 9 10 11 12 13 14 15 16 17 18

1 TE router ID: 1.1.1.1 Host name: xrvr-1 BGP router ID: 1.1.1.1 ASN: 1 ISIS system ID: 0000.0000.0001 level-2 ASN: 1 Prefix SID: ISIS system ID: 0000.0000.0001 level-2 ASN: 1 domain ID: 100 Prefix 1.1.1.1, label 16001 (regular), flags: N SRGB INFO: ISIS system ID: 0000.0000.0001 level-2 ASN: 1 SRGB Start: 16000 Size: 8000 SRLB INFO: ISIS system ID: 0000.0000.0001 level-2 ASN: 1 SRLB Start: 15000 Size: 1000

19 20 21 22 23 24 25 26 27

Link[0]: local address 99.1.2.1, remote address 99.1.2.2 Local node: BGP router ID: 1.1.1.1 ASN: 1 Remote node: BGP router ID: 1.1.1.2 ASN: 2 Metric: IGP 0, TE 0, Delay 0 Admin-groups: 0x00000000 Adj SID: 50012 (epe)

12.5 Summary The SR-TE DB is an essential component of the SR-TE functionality that contains all the network information (nodes, links, prefixes, SR Policies) required to compute and validate paths. The information in the SR-TE DB is protocol-independent. It combines information retrieved from different sources (IGP, BGP-LS, PCEP). The SR-TE DB on a headend node contains the local area information acquired from the IGP process. It uses this information to compute intra-area paths and to translate explicit paths expressed with Segment Descriptors into a list of MPLS labels. The SR-TE DB on an SR PCE contains the entire multi-domain network topology, acquired via BGPLS, possibly in combination with IGP. Information of BGP-only networks is collected via BGP-LS. The SR PCE consolidates the network topology information from different domains into a single network graph to compute optimal inter-domain paths. The PCE’s SR-TE DB also contains the active SR Policies in the network, acquired via PCEP or BGP-LS. This information allows the SR PCE to compute disjoint paths.

12.6 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[draft-ietf-idr-bgpls-segment-routing-epe] "BGP-LS extensions for Segment Routing BGP Egress Peer Engineering", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Keyur Patel, Saikat Ray, Jie Dong, draft-ietf-idr-bgpls-segment-routing-epe-18 (Work in Progress), March 2019 [RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752, March 2016 [draft-ietf-idr-te-lsp-distribution] "Distribution of Traffic Engineering (TE) Policies and State using BGP-LS", Stefano Previdi, Ketan Talaulikar, Jie Dong, Mach Chen, Hannes Gredler, Jeff Tantsura, draft-ietf-idr-te-lsp-distribution-10 (Work in Progress), February 2019 [draft-litkowski-pce-state-sync] "Inter Stateful Path Computation Element (PCE) Communication Procedures.", Stephane Litkowski, Siva Sivabalan, Cheng Li, Haomian Zheng, draft-litkowski-pcestate-sync-05 (Work in Progress), March 2019 [draft-ketant-idr-bgp-ls-bgp-only-fabric] "BGP Link-State Extensions for BGP-only Fabric", Ketan Talaulikar, Clarence Filsfils, krishnaswamy ananthamurthy, Shawn Zandi, Gaurav Dawra, Muhammad Durrani, draft-ketant-idr-bgp-ls-bgp-only-fabric-02 (Work in Progress), March 2019

1. BGP-LS Producers are nodes that feed information, such as their IGP LS-DB, into BGP-LS. See chapter 17, "BGP-LS" for the different BGP-LS roles.↩ 2. BGP Prefix-SIDs are described in Part I of the SR book series.↩

13 SR PCE What we will learn in this chapter: The SR-TE process can realize different roles, as the brain of a headend and as part of an SR Path Computation Element (SR PCE) server. SR PCE is a network function component integrated in any IOS XR base software image1. This functionality can be enabled on any IOS XR node, physical or virtual. SR PCE server is stateful, multi-domain capable (it computes and maintains inter-domain paths), and SR-optimized (using SR-native algorithms). SR PCE is a network entity that provides computation services, it can compute and maintain paths on behalf of a headend node. As such it extends the SR-TE capability of a headend by computing paths for cases where the headend node cannot compute. Computing inter-domain paths or disjoint paths are such cases. SR PCE provides a north-bound interface to the network for external applications. PCEP high-availability mechanisms provide resiliency against SR PCE failures without impacting the SR Policies and traffic forwarding. Inter-PCE PCEP sessions can improve this resiliency. In this chapter, we assume that the SR PCE communicates with its clients (SR-TE headends) via PCEP. BGP SR-TE is an alternate possibility introduced in the last section of this chapter and described in detail in chapter 19, "BGP SR-TE". First, we explain that the SR-TE process is the brain of an SR PCE. Then we see that the SR PCE computes and statefully maintains paths on behalf of headend nodes, particularly for the use-cases that the headend cannot handle itself. We describe the PCEP protocol exchange between the SR PCE and its headend client. Next, we discuss the role of SR PCE as an interface to the network. We conclude by describing the SR PCE high availability mechanisms and showing a brief example of signaling an SR Policy path via BGP.

13.1 SR-TE Process The SR-TE process is at the core of the SR-TE implementation. As indicated in chapter 1, "Introduction", it is a building block that can fulfill different roles: Embedded in a headend node as the brain of the router, it provides SR-TE services to the local node. In an SR PCE server, it provides SR-TE services to other nodes in the network. While this chapter focuses on the latter, this SR PCE always has to interact with the SR-TE process running on the headend nodes, where the SR Policies are instantiated and the traffic steering takes place. The capabilities of the SR-TE process in the SR PCE and in the headend are similar, but due to their different position and role in the network they use differently the SR-TE process components. These components are illustrated in Figure 13‑1.

Figure 13-1: SR-TE Process

SR-TE DB: holds (multi-domain) topology information, SR information, SR Policies, and more. See chapter 12, "SR-TE Database" for more details.

Compute engine: dynamic path computations using SR-native algorithms. See chapter 4, "Dynamic Candidate Path" for more details. Local SR Policy database: used on a headend to maintain, validate and select candidate paths from different sources for SR Policies. Also see chapter 2, "SR Policy". On-Demand Nexthop (ODN): used on a headend to instantiate SR Policies on demand. See chapter 6, "On-Demand Nexthop" for more details. Automated Steering (AS): used on a headend to automatically steer traffic into SR Policies. See chapter 5, "Automated Steering" for more details. The SR-TE process interacts with various internal and external entities using different protocols and interfaces, such as the ones listed below. These interfaces are illustrated in Figure 13‑1. The role of the protocol/interface depends on the role of the SR-TE process. IGP: receive network information distributed by the IGP, distribute TE attributes via the IGP. See chapter 12, "SR-TE Database" for more details. BGP-LS: receive topology and other network information and report SR Policy information. See chapter 17, "BGP-LS" and chapter 12, "SR-TE Database" for more details. PCE communication Protocol (PCEP): communication between SR PCE and SR PCC. See further in this chapter, chapter 12, "SR-TE Database", and chapter 18, "PCEP" for more details. BGP SR-TE: BGP address-family for communication between SR PCE and SR PCC. See chapter 19, "BGP SR-TE" for more details. NETCONF: data-model based communication between SR PCE and SR PCC, and between application and SR PCE. REST: communication between application and SR PCE. See further in this chapter for more details. SR-TE Process on the Headend

The SR-TE process on a headend node is in charge of managing the local policies and the associated traffic steering. Its interface with the IGP provides the SR-TE process the local domain topology that it needs to perform its most common tasks, such as dynamically computing intra-domain SR Policy paths or resolving segment descriptors in explicit paths. This is illustrated in Figure 13‑2, only showing the IGP as interface.

Figure 13-2: SR-TE Process on headend

SR-TE Process on the SR PCE

While many SR-TE use-cases can be solved by only using the SR-TE process that is embedded on the headend node, others require the delegation of SR-TE tasks to an external SR PCE. The headend then becomes a client of the SR PCE server. Figure 13‑3 illustrates the headend (PCC) and SR PCE server and their communication via PCEP.

Figure 13-3: SR-TE Processes on headend and on SR PCE

Here are some of the use-cases that require usage of SR PCE. Inter-domain paths: For end-to-end inter-domain path computations, the computing node needs to have knowledge about the different domains in the network. Architecturally, it is possible to feed the topology of the whole network to the headend node such that it can locally compute inter-domain paths. However, operators prefer to deploy dedicated SR PCEs for this functionality, as this provides a more scalable solution. The SR PCEs get the network-wide information via their BGP-LS feed. The headend nodes then use the services of these SR PCEs to compute and maintain these inter-domain paths. Disjoint paths from distinct headends: Disjoint paths in a network should preferably be computed and maintained by a single entity. This method provides the highest probability that disjoint paths are found and that they are optimal. Disjoints paths from a single headend can be computed and maintained by the headend itself. For disjoint paths from distinct headends, an SR PCE is used as the single central entity that computes and maintains both disjoint paths on behalf of their respective headends. North-bound Interface to the network: A third-party application requiring access to the network to extract information, program the network devices or steer traffic flows may leverage the SR PCE north-bound interfaces. Instead of directly accessing the individual devices in the network via various protocols and interfaces, that application can access them through the SR PCE, which provides a unified structured interface to the network. The SR PCE provides real-time topology and SR Policy

status information via its REST API and it can instantiate, update, and delete SR Policy candidate paths via the same API.

13.2 Deployment Figure 13‑4 shows the network used for the illustrations in this chapter. Node1 is the headend node. It has a PCEP session with the SR PCE.

Figure 13-4: Network for PCEP protocol sequences illustrations

13.2.1 SR PCE Configuration SR PCE is a built-in functionality in the IOS XR base software image. This functionality is available on any physical or virtual IOS XR node and it can be activated with a single configuration command.

In practice, the IOS XR based SR PCE can be deployed on physical hardware router, but also on a virtual router such as the Cisco IOS XRv9000 router, as it involves control plane operations only. Enabling SR PCE Mode

Example 13‑1 shows the configuration to enable SR PCE functionality on an IOS XR node. The configured IP address, 1.1.1.10, indicates the local address that SR PCE uses for its PCEP sessions. The PCEP session can be authenticated by specifying a password. Example 13-1: PCE PCEP configuration pce address ipv4 1.1.1.10 !! password encrypted 00071A150754

(optional)

Populating the SR-TE DB

The SR PCE needs to populate its SR-TE DB with the network topology. As described in chapter 12, "SR-TE Database", it can obtain and combine information from the IGP (limited to the local IGP area(s)) and from BGP-LS. In the example single-area network the SR PCE learns the topology via the IGP. The configuration on the SR PCE is shown for both IGPs (ISIS and OSPF) in Example 13‑2. The IGP information feed to the SR-TE process is enabled with the distribute link-state configuration under the appropriate IGP. It is recommended to specify the instance-id with the distribute link-state

configuration. In single-area networks, the instance-id can be left at its

default value 0 (identifying the “Default Layer 3 Routing topology”). Example 13-2: Feed IGP information to PCE SR-TE router isis distribute ! router ospf distribute

SR link-state instance-id 101 SR link-state instance-id 102

To learn the topologies of remote areas and domains, BGP-LS is required. Example 13‑3 shows a basic configuration to feed BGP-LS information from BGP neighbor 1.1.1.11 into the SR-TE DB. It can be combined with the configuration in Example 13‑2. Further details are provided in chapter 12, "SR-TE Database" and chapter 17, "BGP-LS".

Example 13-3: Feed BGP-LS information to PCE SR-TE router bgp 1 address-family link-state link-state ! neighbor 1.1.1.11 remote-as 1 update-source Loopback0 address-family link-state link-state

13.2.2 Headend Configuration Example 13‑4 shows the configuration on an IOS XR head-end to enable PCC PCEP functionality. In the example, the headend connects to an SR PCE at address 1.1.1.10. By default, the local PCEP session address is the address of the lowest numbered loopback interface. The source address of the PCEP session can be configured and the PCEP sessions can be authenticated by specifying the password of the session under the PCE configuration. Example 13-4: Headend PCEP configuration segment-routing traffic-eng pcc !! source-address ipv4 1.1.1.1 pce address ipv4 1.1.1.10 !! password encrypted 13061E010803

(optional) (optional)

By default, a headend only sends PCEP Report messages for SR Policies that it delegates to the SR PCE. SR Policy delegation is discussed further in this chapter. Sometimes it may be required for the SR PCEs to also learn about the paths that are not delegated, such as headend-computed paths or explicit paths. In that case, the configuration in Example 13‑5 lets the headend report all its local SR Policies. Example 13-5: Headend reports all local SR Policies segment-routing traffic-eng pcc report-all

Verify PCEP Session

The command show pce ipv4 peer is used to verify the PCEP sessions on the SR PCE, as illustrated in Example 13‑6. In this example, one PCC with address 1.1.1.1 is connected (Peer address: 1.1.1.1)

and its PCEP capabilities, as reported in its PCEP Open message, are shown in

the output. This PCC supports the Stateful PCEP extensions (Update and Instantiation) and the Segment-Routing

PCEP extensions. These extensions are specified in “PCEP Extensions for Stateful

PCE” [RFC8231], “PCE-Initiated LSPs in Stateful PCE” [RFC8281], and draft-ietf-pce-segmentrouting respectively. By adding the detail keyword to the show command (not illustrated here), additional information is provided, such as PCEP protocol statistics. Example 13-6: Verify PCEP session on SR PCE RP/0/0/CPU0:SR-PCE1#show pce ipv4 peer PCE's peer database: -------------------Peer address: 1.1.1.1 State: Up Capabilities: Stateful, Segment-Routing, Update, Instantiation

To verify the PCEP sessions on the headend, use the command show segment-routing trafficeng pcc ipv4 peer,

as illustrated in Example 13‑7. The PCEP capabilities, as reported in the

PCEP Open message are shown in the output. This PCE has a precedence 255 on this headend, which is the default precedence value. The precedence indicates a relative preference between multiple configured PCEs as will be discussed further in this chapter. Example 13-7: Verify PCEP session on headend RP/0/0/CPU0:iosxrv-1#show segment-routing traffic-eng pcc ipv4 peer PCC's peer database: -------------------Peer address: 1.1.1.10, Precedence: 255, (best PCE) State up Capabilities: Stateful, Update, Segment-Routing, Instantiation

By adding the detail keyword to the show command (not illustrated here), additional information is provided, such as PCEP protocol statistics.

13.2.3 Recommendations The SR PCE deployment model has similarities with the BGP RR deployment model. A PCE is a control plane functionality and, as such, does not need to be in the forwarding path. It can even be located remotely from the network it serves, for example as a virtual instance in a Data Center. Since SR PCE is an IOS XR functionality, it can be deployed on routers in the network where it could be managed as a regular IOS XR network node, or it can be enabled on virtual IOS XR instances on a

server. Although the SR PCE functionality provides centralized path computation services, it is not meant to be concentrated on a single box (a so-called “god box” or “all-seeing oracle in the sky”). Instead, this functionality should be distributed among multiple instances, each offering path computation services to a subset of headend nodes in the network while acting as a backup for another. We recommend that, for redundancy, every headend should have a PCEP session to two SR PCE severs, one primary and one backup. Example 13‑8 shows a configuration of a headend with two SR PCE servers, 1.1.1.10 and 1.1.1.11, configured with a different precedence. PCE 1.1.1.10 is the preferred PCE since it has the lowest precedence value. The headend uses this preferred PCE for its path computations. PCE 1.1.1.11 is the secondary PCE, used upon failure of the preferred SR PCE. More than two PCEs can be specified if desired. Example 13-8: Headend configuration – multiple SR PCEs segment-routing traffic-eng pcc pce address ipv4 1.1.1.10 precedence 100 ! pce address ipv4 1.1.1.11 precedence 200

As the network grows bigger, additional SR PCEs can be introduced to deal with the increasing load. Each new SR PCE is configured as the primary PCE on a subset of the headend nodes, and as secondary on another subset. Because every SR PCE has a complete view of the topology, it is able to serve any request from its connected headends. For example, all the PEs within a region may share the same pair of SR PCEs. Half of these PEs use one SR PCE as primary (lowest precedence) and the other SR PCE as secondary, while the other half is configured the other way around. This will distribute the load over both SR PCEs. Once the PCE scale limit is reached, we introduce a second pair of PCEs and half of the PEs use the first pair, half use the second. And we can keep adding pairs of PCEs as required. The SR PCE solution is horizontally scalable.

13.3 Centralized Path Computation When enabling the SR PCE functionality on a node, this node acts as a server providing path computation and path maintenance services to other nodes in the network. Multi-Domain and Optimized for SR

An SR PCE is natively multi-domain capable. It maintains an SR-TE DB that can hold multiple domains and it can compute optimal end-to-end inter-domain paths. The SR PCE populates its SR-TE DB with multi-domain topology information obtained via BGP-LS, possibly combined with an IGP information feed (ISIS or OSPF) for its local IGP area. The SR-TE DB is described in more detail in chapter 12, "SR-TE Database". Using the information in its SR-TE DB, SR PCE uses its SR optimized path computation algorithms to solve path optimization problems. The algorithms compute single-area and inter-domain paths and encode the computed path in an optimized segment list. Stateful

SR PCE is an Active Stateful type of PCE. It does not only compute paths on request of a client (headend), but it also maintains them. It takes control of the SR Policy paths and updates them when required. This requirement for the SR PCE to be able to control SR Policy paths is evident. If a headend node requests an SR PCE to compute a path, it is very likely that this headend cannot compute the path itself. Therefore, it is likely not able to validate the path and request a new path if the current one is invalid. An Active Stateful PCE uses the PCEP path delegation mechanism specified in “PCEP Extensions for Stateful PCE” [RFC8231] to update the parameters of a path that a client has delegated to it. This PCEP delegation mechanism is described further in this chapter.

13.3.1 Headend-Initiated Path IETF draft-ietf-pce-segment-routing adds SR support to the PCEP protocol and “PCE Path Setup Type” [RFC8408] specifies how to signal the type of path, SR-TE or RSVP-TE. With these extensions, the existing PCEP procedures can be used to signal SR Policy paths.

PCEP has been extended further with additional SR-TE functionalities such as draft-sivabalan-pcebinding-label-sid that specifies how to signal the BSID. More detailed PCEP protocol information is available in chapter 18, "PCEP". Different PCEP packet sequences are used in the interaction between a headend (PCC) and an SR PCE to initiate an SR Policy path and to maintain this path during its lifetime. The SR Policy candidate path initiation can be done by configuration (CLI or NETCONF) or automatically through the ODN functionality (see chapter 6, "On-Demand Nexthop"). The candidate path is of dynamic type with the pcep keyword instructing the headend to request the computation service of an SR PCE. This is the most likely use-case for an SP or Enterprise. Two different protocol sequences are possible for these headend-initiated, PCE-computed paths. In the first variant, the headend (PCC) starts by requesting the SR PCE to compute a path using the stateless Request/Reply protocol exchange (as specified in “Path Computation Element (PCE) Communication Protocol (PCEP)” [RFC5440]). The headend sends a path Request message to the SR PCE with an optimization objective and a set of constraints. The SR PCE computes the path and returns the solution to the headend in a Reply message. The headend installs that path and then switches to stateful mode by sending a Report message for the path to the SR PCE with the Delegate flag (D-flag) set. This action effectively delegates control of the path to the SR PCE. In the second variant, the PCC starts in stateful mode by immediately delegating control of the path to the SR PCE. Therefore, the headend sends a Report message for the (empty) path to the SR PCE with the Delegate flag set. This variant is used by IOS XR headends and is the one illustrated in Figure 13‑5, with an SR Policy dynamic candidate path initiated from headend Node1. Node1’s configuration is shown in Example 13‑9. The configuration specifies to request the SR PCE to compute the path (using the keyword pcep). The dynamic path must optimize the TE metric (metric type te) and avoid links with affinity color name RED (affinity exclude-any name RED).

Example 13-9: Configuration on headend Node1 segment-routing traffic-eng affinity-map name RED bit-position 0 ! policy BLUE binding-sid mpls 15000 color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic pcep metric type te ! constraints affinity exclude-any name RED

After configuring the SR Policy, the headend Node1 sends the following PCEP Report (PCRpt) message (marked ➊ in Figure 13‑5) to the SR PCE: PCEP Report: SR Policy status: Administratively Up, Operationally Down, Delegate flag set Path setup type: SR Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4 Symbolic name: cfg_BLUE_discr_100 BSID: 15000 Segment list: empty Optimization objective: metric type TE Constraints: exclude-any RED links

Figure 13-5: PCEP protocol sequence – Report/Update/Report

With this message Node1 delegates this path to the SR PCE. After receiving this message, SR PCE finds the empty segment list and computes the path according to the optimization objective and constraints that are specified in this Report message. If, for any reason, the SR PCE needs to reject the delegation, it sends an empty Update message with the Delegate flag set to 0.

After the computation (➋) SR PCE requests Node1 to update the path by sending the following PCEP Update (PCUpd) message to Node1 (➌): PCEP Update: SR Policy status: Desired Admin Status Up, Delegate flag Path setup type: SR Segment list: Headend Node1 installs the SR Policy path in the forwarding table (➍), and sends a status report to SR PCE in the following PCEP Report (PCRpt) message (➎): PCEP Report: SR Policy status: Administratively Up, Operationally Up, Delegate flag set Path setup type: SR Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4 Symbolic name: cfg_BLUE_discr_100 BSID: 15000 Segment list: Optimization objective: metric type TE Constraints: exclude-any RED links The status of the SR Policy is shown in the output of Example 13‑10.

Example 13-10: Status SR Policy on headend Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:51:55 (since Aug 9 13:30:20.164) Candidate-paths: Preference: 100 (configuration) (active) Name: BLUE Requested BSID: 15000 PCC info: Symbolic name: cfg_BLUE_discr_100 PLSP-ID: 30 Dynamic (pce 1.1.1.10) (valid) Metric Type: TE, Path Accumulated Metric: 30 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 15000 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

After completing this protocol exchange, the headend has installed the path and has delegated control of the path to the SR PCE. The SR PCE is now responsible for maintaining this path and may autonomously update it using the PCEP sequence described below. Following a topology change the SR PCE re-computes the delegated paths and updates them if necessary. Any time that the SR PCE finds that the current path differs from the desired path, SR PCE requests the headend to update the path. The illustration in Figure 13‑6 continues the example. Node1 has delegated the SR Policy path with endpoint Node4 to the SR PCE. The link between Node3 and Node4 fails and SR PCE is notified of this topology change via the IGP (marked ➊ in Figure 13‑6). The topology change triggers SR PCE to recompute this path (➋). The new path is encoded in SID list .

Figure 13-6: PCEP protocol sequence – Update/Report

After computing the new path, SR PCE requests Node1 to update the path by sending the following PCEP Update (PCUpd) message to Node1 (➌): PCEP Update: SR Policy status: Desired Admin Status Up, Delegate flag

Path setup type: SR Segment list: Headend Node1 installs the SR Policy path in the forwarding table (➍), and sends a status report to SR PCE in the following PCEP Report (PCRpt) message (➎): PCEP Report: SR Policy status: Administratively Up, Operationally Up, Delegate flag set Path setup type: SR Endpoints: headend and endpoint identifiers 1.1.1.1 and 1.1.1.4 Symbolic name: cfg_BLUE_discr_100 BSID: 15000 Segment list: Optimization objective: metric type TE Constraints: exclude-any RED links

13.3.2 PCE-Initiated Path An operator can configure an SR Policy definition on an SR PCE. The SR PCE then initiates this SR Policy on the specified headend and maintains it. Example 13‑11 illustrates the SR PCE configuration of an SR Policy named BROWN to be initiated on headend Node1 with address 1.1.1.1 (peer ipv4 1.1.1.1). This SR Policy has color 50, endpoint 1.1.1.4 and BSID 15111. The dynamic candidate path has preference 100 and optimizes the TE metric. While this example SR Policy has a single dynamic candidate path, SR Policies with multiple candidate paths and explicit paths can be configured on the PCE.

Example 13-11: SR Policy configuration on SR PCE pce segment-routing traffic-eng peer ipv4 1.1.1.1 policy BROWN binding-sid mpls 15111 color 50 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type te

The SR PCE in Figure 13‑7 computes the dynamic path (➊) and sends the following PCEP Initiate (PCInit) message with the solution SID list to headend Node1 (➋): PCEP Initiate: SR Policy status: Desired Admin Status Up, Delegate flag, Create flag set Path setup type: SR Endpoints: headend and endpoint identifiers 1.1.1.1 and 1.1.1.4 Color: 50 Symbolic name: BROWN BSID: 15111 Preference: 100 Segment list:

Figure 13-7: PCE-initiated path – Initiate/Report PCEP protocol sequence

Headend Node1 installs the SR Policy path in the forwarding table (➌), and sends a status report to SR PCE in the following PCEP Report (PCRpt) message (➍): PCEP Report: SR Policy status: Administratively Up, Operationally Up, Delegate flag set Path setup type: SR

Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4 Symbolic name: BROWN BSID: 15111 Segment list: With this message the headend confirms that the path has been installed as instructed and delegates control to the SR PCE. The resulting SR Policy on Node1 is shown in Example 13‑12. Example 13-12: SR Policy candidate path on the headend, as initiated by the SR PCE configuration RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 50, End-point: 1.1.1.4 Name: srte_c_50_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:15 (since Jul 30 38:37:12.967) Candidate-paths: Preference: 100 (PCEP) (active) Name: BROWN Requested BSID: 15111 PCC info: Symbolic name: BROWN PLSP-ID: 3 Dynamic (pce 1.1.1.10) (valid) 16003 [Prefix-SID, 1.1.1.3] 24004 [Adjacency-SID, 99.3.6.3 - 99.3.6.6] Attributes: Binding SID: 15111 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

The SR PCE maintains this path in the same manner as a headend-initiated path. When the SR Policy configuration on SR PCE is updated or removed, the SR Policy candidate path on the headend is updated or deleted accordingly using PCEP.

13.4 Application-Driven Path A likely use-case in the WEB/OTT market sees a central application programming the network with SR Policies. In that case, the central application likely collects the view of the network using the SR PCE’s north-bound interface. This interface to the network is available to any application. At the time of writing, the available interfaces are REST and NETCONF. North-bound and south-bound inte rface s A north-bound interface allows a network component to communicate with a higher-level component while a south-bound interface allows a network component to communicate with a lower-level component. The north-bound and south-bound terminology refers to the typical architectural drawing, as in Figure 13‑8, where the north-bound interface is drawn on top (“on the North side”) of the applicable component, SR PCE in this case, and the south-bound interface is drawn below it (“on the South side”).

Figure 13-8: North- and south-bound interfaces

For a controller, examples of the north-bound interface are: REST (Representational State Transfer) and NETCONF (Network Configuration Protocol). Controller south-bound interface examples are PCEP, BGP, classic XML, NETCONF.

Using the REST API, an application can request SR PCE to provide topology and SR Policies information. An application can do a GET of the following REST URLs to retrieve the information: Topology URL: http://:8080/topo/subscribe/json SR Policy URL: http://:8080/lsp/subscribe/json The returned topology information is the equivalent of the output of the command show pce ipv4 topology.

When specifying “json” in the URL, the information is returned in telemetry format which

has base64 encoded Google Protocol Buffers (GPB) data wrapped into json object(s). The data is encoded using the encoding path Cisco-IOS-XR-infra-xtc-oper:pce/topology-nodes/topology-node as can be found in the YANG module Cisco-IOS-XR-infra-xtc-oper.yang. Other encoding formats (txt, xml) are possible. The returned SR Policy information is the equivalent of the output of the command show pce lsp. When specifying json format, the information is returned in telemetry format using the encoding path Cisco-IOS-XR-infra-xtc-oper:pce/tunnel-detail-infos/tunnel-detail-info as can be found in the YANG module Cisco-IOS-XR-infra-xtc-oper.yang. Other encoding formats (txt, xml) are possible. An application can request a snapshot of the current information or subscribe to a continuous realtime feed of topology and SR Policy updates. If an application subscribes to the continuous topology feed, it will automatically receive (push model) the new topology information of each changed node. Equivalently, if subscribing to the SR Policy feed, the application automatically receives the new information of each changed SR Policy.

Incre asing fle xibility and re ducing comple xity “With centralized TE controller and SR policy, now we can dramatically reduce the complexity for our traffic engineering deployment. We can push the traditional edge router function to be inside of the data center and control the path from our server gateway. This not only increases the flexibility of path selection, but also reduces the complexity of the WAN routers. ” — Dennis Cai

After the central application collects the network information, it computes the required SR Policy paths and deploys these paths using the SR PCE’s north-bound interface. The SR PCE then initiates these paths via its south-bound PCEP interface. If the SR Policy does not yet exist on the headend then the headend instantiates it with the specified path as candidate path. If the SR Policy already exists on the headend then the PCEP initiated SR Policy candidate path is added to the other candidate paths of the SR Policy. The selection of the active candidate path for an SR Policy is based on preference of these path, as described in chapter 2, "SR Policy". If the PCEP initiated candidate path has the highest preference, it will be selected as active path and override any lower preference candidate paths of the SR Policy. PCEP initiated candidate paths are ephemeral. They are not stored in the running configuration of the headend node. This does not mean that they are deleted when the connection to the SR PCE is lost, as described in the SR PCE high availability section 13.5. In Figure 13‑9 an application instructs the SR PCE to initiate an explicit SR Policy path to Node4 on headend Node1 using the SR PCE’s North-bound API2 (marked ➊ in Figure 13‑9). Following this request, the SR PCE sends the following PCEP Initiate (PCInit) message to headend Node1 (➋): PCEP Initiate: SR Policy status: Desired Admin Status Up, Delegate flag, Create flag set Path setup type: SR Endpoints: headend and endpoint identifiers 1.1.1.1 and 1.1.1.4

Color: 30 Symbolic name: GREEN BSID: 15001 Preference: 100 Segment list:

Figure 13-9: PCEP protocol sequence – Initiate/Report

Headend Node1 installs the SR Policy path in the forwarding table (➌), and sends a status report to SR PCE in the following PCEP Report (PCRpt) message (➍):

PCEP Report: SR Policy status: Administratively Up, Operationally Up, Delegate flag set Path setup type: SR Endpoints: Headend and endpoint identifiers 1.1.1.1 and 1.1.1.4 Symbolic name: GREEN BSID: 15001 Segment list: With this message the headend confirms that the path has been installed as instructed and delegates control to the SR PCE. SR PCE stores the status information in its database and feeds the path information to the application via its North-bound interface (➎). The status of the SR Policy on headend Node1 is shown in Example 13‑13. Example 13-13: Status PCE-initiated SR Policy on headend Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.4 Name: srte_c_30_ep_1.1.1.4 Status: Admin: up Operational: up for 00:51:55 (since Aug Candidate-paths: Preference: 100 (PCEP) (active) Name: GREEN Requested BSID: 15001 PCC info: Symbolic name: GREEN PLSP-ID: 3 Dynamic (pce 1.1.1.10) (valid) 16003 [Prefix-SID, 1.1.1.3] 24034 [Adjacency-SID, 99.3.4.3 - 99.3.4.4] Attributes: Binding SID: 15001 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

9 14:29:38.540)

The central application is now responsible to maintain the path. It receives a real-time feed of topology and SR Policy information via the SR PCE north-bound interface and it can update the path as required via this north-bound interface. The application can also delete the path via the SR PCE’s North-bound API. After receiving the path delete request from the application, the SR PCE sends a PCEP Initiate message with the Remove flag (R-flag) set and the headend acknowledges with a PCEP Report message that the SR Policy path has been removed. If the deleted path was the last candidate path of the SR Policy, then the headend will also delete the SR Policy. Otherwise, the headend follows the selection procedure to select a new active path.

13.5 High-Availability An SR PCE populates its SR-TE DB via BGP-LS, or via IGP for its local IGP area. Both BGP-LS and IGP are distributed mechanisms that have their own well-seasoned high availability mechanisms. Since the SR PCEs individually tap into these information feeds, typically via redundant connections, the SR-TE DBs of the SR PCEs are independently and reliably synchronized to a common source of information. SR PCE uses the native PCEP high-availability capabilities to recover from PCEP failure situations. These are described in the next sections.

13.5.1 Headend Reports to All PCEs As stated in the previous section, we recommend that a headend node connects to a pair of SR PCEs for redundancy. The headend specifies one SR PCE as primary and the other as secondary. The headend establishes a PCEP session to both SR PCEs, but it only uses the most preferred one for path computations if both are available. Whenever the headend sends out a PCEP Report message reporting the state of an SR Policy path, it sends this it to all connected SR PCEs, but only sets the Delegate flag (D-flag) in the Report message for the PCE that computed or initiated the SR Policy path. The other SR PCEs receive the Report message with the Delegate flag unset. This way, the headend keeps the SR Policy databases of all connected SR PCEs synchronized, while delegating the path to a single SR PCE. This mechanism allows the less preferred SR PCEs to operate in hot-standby mode; no state synchronization is needed at the time the preferred SR PCE fails. The PCEP state synchronization mechanism is illustrated in Figure 13‑10. Two SR PCEs are configured on Node1, SR PCE1 (1.1.1.10) and SR PCE2 (1.1.1.11), with SR PCE1 being the most preferred (lowest precedence 100). The corresponding configuration of Node1 is displayed in Example 13‑14.

Example 13-14: Configuration of Node1 – two SR PCEs segment-routing traffic-eng policy POL1 color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic pcep ! metric type te ! pcc pce address ipv4 1.1.1.10 precedence 100 ! pce address ipv4 1.1.1.11 precedence 200

Example 13‑15 shows the status of the PCEP sessions to both SR PCEs on headend Node1. Both sessions are up and SR PCE1 (1.1.1.10) is selected as the primary PCE (best PCE). Example 13-15: PCEP peers’ status on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng pcc ipv4 peer PCC's peer database: -------------------Peer address: 1.1.1.10, Precedence: 100, (best PCE) State up Capabilities: Stateful, Update, Segment-Routing, Instantiation Peer address: 1.1.1.11, Precedence: 200 State up Capabilities: Stateful, Update, Segment-Routing, Instantiation

The operator configures SR Policy POL1 on Node1 with a single dynamic candidate path, as shown in Example 13‑14. The operator indicates to use PCEP to compute this dynamic path, optimizing the TE-metric. Node1 uses its primary SR PCE1 for path computations. After configuring this SR Policy on Node1, Node1 sends a PCEP Report message with an empty segment list to both SR PCE1 and SR PCE2, as indicated by ➊ and ➋ in Figure 13‑10. These Report messages are identical, except for the Delegate flag. Node1 delegates the path to its primary SR PCE1 by setting the Delegate flag (D-flag = 1) in this Report message (➊), while the Delegate flag in the Report message sent to SR PCE2 ➋ is unset (Dflag = 0).

Since the path is delegated to SR PCE1, it computes the path (➌) and replies to Node1 with the solution in a PCEP Update message (➍). Node1 installs the path (➎) and sends a Report message to both SR PCE1 (➏) and SR PCE2 (➐). Again, only setting the Delegate flag in the Report to SR PCE1. Both SR PCE1 and SR PCE2 are now aware of the status of the SR Policy path on head-end Node1, but only SR PCE1 has control over it.

Figure 13-10: Headend Node1 sends PCEP Report to all its connected PCEs

Example 13‑16 shows the SR Policy status output of SR PCE1 and Example 13‑17 shows the status output of SR PCE2. Both SR PCEs show the Reported path, which is the SID list ,

as shown in lines 24 to 27 in both outputs. The two outputs are almost identical, except on two aspects. First, SR PCE1 also shows the Computed path

(lines 28 to 32 in Example 13‑16), while this information is not displayed on SR

PCE2 (Example 13‑17) since SR PCE2 did not compute this path. Second, the Delegate flag is set on SR PCE1 (flags: D:1 on line 19) but not on SR PCE2 (flags: D:0 on line 19), since the path is only delegated to SR PCE1. Example 13-16: SR Policy status on primary SR PCE1 1 RP/0/0/CPU0:xrvr-10#show

pce lsp detail

2 3 PCE's

tunnel database:

4 ---------------------5 PCC

1.1.1.1:

6 7 Tunnel 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Name: cfg_POL1_discr_100 LSPs: LSP[0]: source 1.1.1.1, destination 1.1.1.4, tunnel ID 10, LSP ID 1 State: Admin up, Operation up Setup type: Segment Routing Binding SID: 40006 Maximum SID Depth: 10 Absolute Metric Margin: 0 Relative Metric Margin: 0% Bandwidth: signaled 0 kbps, applied 0 kbps PCEP information: PLSP-ID 0x80001, flags: D:1 S:0 R:0 A:1 O:1 C:0 LSP Role: Single LSP State-sync PCE: None PCC: 1.1.1.1 LSP is subdelegated to: None Reported path: Metric type: TE, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24034, Address: local 99.3.4.3 remote 99.3.4.4 Computed path: (Local PCE) Computed Time: Mon Aug 13 19:21:31 UTC 2018 (00:13:06 ago) Metric type: TE, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24034, Address: local 99.3.4.3 remote 99.3.4.4 Recorded path: None Disjoint Group Information: None

Example 13-17: SR Policy status on secondary SR PCE2 1 RP/0/0/CPU0:xrvr-11#show

pce lsp detail

2 3 PCE's

tunnel database:

4 ---------------------5 PCC

1.1.1.1:

6 7 Tunnel 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Name: cfg_POL1_discr_100 LSPs: LSP[0]: source 1.1.1.1, destination 1.1.1.4, tunnel ID 10, LSP ID 1 State: Admin up, Operation up Setup type: Segment Routing Binding SID: 40006 Maximum SID Depth: 10 Absolute Metric Margin: 0 Relative Metric Margin: 0% Bandwidth: signaled 0 kbps, applied 0 kbps PCEP information: PLSP-ID 0x5, flags: D:0 S:0 R:0 A:1 O:1 C:0 LSP Role: Single LSP State-sync PCE: None PCC: 1.1.1.1 LSP is subdelegated to: None Reported path: Metric type: TE, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24034, Address: local 99.3.4.3 remote 99.3.4.4 Computed path: (Local PCE) None Computed Time: Not computed yet Recorded path: None Disjoint Group Information: None

13.5.2 Failure Detection The liveness of the SR PCE having delegation over SR Policy paths is important. This SR PCE is responsible to maintain the delegated paths and update them if required, such as following a topology change. Failure to do so can result in sub-optimal routing and packet loss. A headend verifies liveliness of a PCEP session using a PCEP keepalive timer and dead timer. By default, the headend and SR PCE send a PCEP message (Keepalive or other) at least every 30 seconds. This time period is the keepalive timer interval. The keepalive timer is restarted every time a PCEP message (of any type) is sent. If no PCEP message has been sent for the keepalive time period, the keepalive timer expires and a PCEP Keepalive message is sent.

A node uses the dead timer to detect if the PCEP peer is still alive. A node restarts the session’s dead timer when receiving a PCEP message (of any type). The default dead timer interval is 120 seconds. When not receiving any PCEP message for the dead-timer time period, the dead timer expires and the PCEP session is declared down. Keepalive/dead timers are exchanged in the PCEP Open message. The recommended (and as mentioned IOS XR default) values are 30 seconds for keepalive, and four times the keepalive value (120 seconds), for the dead timer. For further details please see chapter 18, "PCEP" and RFC5440. The keepalive and dead timers are configurable on PCC and PCE, as shown in Example 13‑18 and Example 13‑19. On the PCC the keepalive timer interval is configured as 60 seconds, the dead timer interval 180 seconds. On the PCE the keepalive interval is configured as 60 seconds and the dead timer is left to its default value. Example 13-18: PCC timer configuration segment-routing traffic-eng pcc pce address ipv4 1.1.1.10 ! timers keepalive 60 timers deadtimer 180

Example 13-19: PCE timer configuration pce address ipv4 1.1.1.10 ! timers keepalive 60

The PCC also tracks reachability of the SR PCE’s address in the forwarding table. When the SR PCE’s address becomes unreachable, then the PCEP session to that SR PCE is brought down immediately without waiting for the expiration of the dead-timer.

13.5.3 Headend Re-Delegates Paths to Alternate PCE Upon Failure It is important to note that the failure of the PCEP session to an SR PCE has no immediate impact on the SR Policies statuses and traffic forwarding into these SR Policies, even for a failure of the primary SR PCE to which the headend has delegated the paths.

After the headend detects the failure of an SR PCE using the mechanisms of the previous section, it attempts to re-delegate the paths maintained by this failed SR PCE to an alternate SR PCE. Two cases are distinguished: headend-initiated paths and application-initiated paths.

13.5.3.1 Headend-Initiated Paths The redelegation mechanism described in this section is specified in ”PCEP Extensions for Stateful PCE” [RFC8231]. When a headend detects that the PCEP session to an SR PCE has failed, it starts a Redelegation timer. This is the time to wait before redelegating paths to another SR PCE. At the same time the headend starts a State-Timeout timer, which is the maximum time to keep SR Policy paths states after detecting the PCEP session failure. The Redelegation timer allows some time for the primary PCEP session to recover after the failure. If it recovers before the Redelegation timer has expired, both the path state and the delegation state remain unmodified, as if the failure had never occurred. For headend initiated paths, the Redelegation timer in IOS XR is fixed to 0. After detecting a PCEP session failure the headend immediately redelegates the headend initiated paths that were delegated to the failed SR PCE to the next preferred SR PCE. When the delegation succeeds, the new delegate SR PCE verifies the path and updates it if needed. Assuming that the SR PCEs have a synchronized SR-TE DB and use similar path computation algorithms, no path changes are expected after a re-delegation. Paths that cannot be redelegated to an alternate SR PCE are called orphaned paths. When the StateTimeout timer expires, the headend invalidates the orphaned paths. The usual SR Policy path selection procedure is then used to select the highest preference valid candidate path of the SR Policy as active path. If no other valid candidate path is available, then the SR Policy is torn down. The default state-timeout timer in IOS XR is 120 seconds and is configurable on the headend as delegation-timeout.

If the PCEP session to a more preferred SR PCE comes up, either when a failed SR PCE is restored or when a new one with a lower precedence is configured, then the headend immediately redelegates its SR Policy paths to this more preferred SR PCE. Primary PCE Failure Illustration

The redelegation mechanism for headend-initiated paths is illustrated in Figure 13‑11. The initial situation of this illustration is the final situation of Figure 13‑10 where Node1 has delegated its SR Policy to its primary SR PCE1. At a given time, SR PCE1 becomes unreachable and Node1 brings down the PCEP session to SR PCE1 (➊). This can be due to expiration of the dead-timer or triggered by the removal of SR PCE1’s address from the routing table. Upon detecting the PCEP session failure, Node1 starts its State-Timeout timer and immediately attempts to redelegate its SR Policy path to its other connected PCE, SR PCE2. To redelegate the path, Node1 sends a PCEP Report message for the SR Policy path to SR PCE2 with the Delegate flag set (➋). In the meantime, Node1 preserves the SR Policy path state and traffic steering as they were before the PCEP failure. SR PCE2 accepts the delegation (by not rejecting it) and computes the SR Policy path to validate its correctness (➌). If the computed path differs from the reported path, then SR PCE2 instructs Node1 to update the SR Policy with the new path. In the illustration we assume that the failure does not impact the SR Policy path. Therefore, SR PCE2 does not need to update the path.

Figure 13-11: Workflow after failure of the preferred SR PCE

When SR PCE1 recovers, then Node1 will redelegate the SR Policy path to SR PCE1 since it is the more preferred SR PCE. The message exchange is equivalent with Figure 13‑11, but to SR PCE1 instead of to SR PCE2.

13.5.3.2 Application-Driven Paths The redelegation mechanism described in this section is specified in RFC 8281 (PCEP extensions for PCE-initiated paths). While it is the application that initiates paths and maintains them, we will describe the behaviors from a PCEP point of view where the SR PCE initiates the paths on the headend. The SR PCE that has initiated a given SR Policy path is called the “owner SR PCE”. The headend delegates the SR PCE initiated path to the owner SR PCE and keeps track of this owner.

Upon failure of the owner SR PCE, the paths should be redelegated to another SR PCE. The redelegation behavior is PCE centric. The headend does not simply redelegate an SR PCEinitiated path to the next preferred SR PCE after the owner SR PCE fails. Indeed, in contrast with the headend-initiated path, the headend cannot know which PCE is capable of maintaining the PCEinitiated path. Instead, another SR PCE can take initiative to adopt the orphan path. This SR PCE requests the headend to delegate an orphaned SR PCE-initiated path to it by sending a PCEP Initiate message identifying the orphan path it wants to adopt to the headend. Same as for the headend-initiated paths, the procedure is governed by two timers on the headend, the redelegation timer and the state-timeout timer. In IOS XR these timers differ from the timers with the same name used for the headend-initiated paths. They are configured separately. The redelegation timer interval is the time the headend waits for the owner PCE to recover. The statetimeout timer interval is the time the headend waits before cleaning up the orphaned SR Policy path. These two timers are started when the PCEP session for the owner PCE goes down. If the owner SR PCE disconnects and reconnects before the redelegation timer expires, then the headend automatically redelegates the path to the owner SR PCE. Until this timer expires, only the owner SR PCE can re-gain ownership of the path. If the redelegation timer expires before the owner SR PCE re-gains ownership, the path becomes an orphan. At that point in time any SR PCE, not only the owner SR PCE, can request ownership of the orphan path (“adopt” it) by sending a proper PCEP Initiate message with the path identifier to the headend. The path is identified by the PCEP-specific LSP identifier (PLSP-ID), as described in chapter 18, "PCEP". The redelegation timer for SR PCE initiated paths is named initiated orphan timer in IOS XR. The redelegation timer is configurable, and its default value is 3 minutes. Before state-timeout timer expires, the SR Policy path is kept intact, and traffic is forwarded into it. If this timer expires before any SR PCE claims ownership of the path, the headend brings down and removes the path. The usual SR Policy path selection procedure is then used to select the highest preference valid candidate path of the SR Policy as active path. If no other valid candidate path is available, then the SR Policy is torn down.

The state-timeout timer for SR PCE initiated paths is named initiated state timer in IOS XR. The state-timeout timer is configurable, and its default value is 10 minutes.

13.5.4 Inter-PCE State-Sync PCEP Session An SR PCE can maintain PCEP sessions to other SR PCEs. These are the state-sync sessions as specified in IETF draft-litkowski-pce-state-sync. These state-sync sessions have a dual purpose: synchronize SR Policy databases of SR PCEs independently from the headends and prevent split-brain situations for disjoint path computations. The standard PCEP procedures maintain SR Policy database synchronization between SR PCEs by letting the headends send their PCEP Report messages to all their connected SR PCEs. This way the headends ensure that the SR Policy databases on the different connected SR PCEs stay synchronized. If additional redundancy is desired, PCEP state-sync sessions can be established between the SR PCEs. These state-sync sessions will ensure that the SR Policy databases of the SR PCEs stay synchronized, regardless of their connectivity to the headends. If a given headend loses its PCEP session to one of its SR PCEs, the state-sync connections ensure that this SR PCE still (indirectly) receives the Reports emitted by that headend.

13.5.4.1 State-Sync Illustration In Figure 13‑12, headend Node1 has a PCEP session only to SR PCE1, e.g., as a result of a failing PCEP session between Node1 and SR PCE2. SR PCE1 has a state-sync PCEP session to SR PCE2.

Figure 13-12: State-sync session between SR PCE1 and SR PCE2

The state-sync PCEP session is configured on SR PCE1 as shown in Example 13‑20. SR PCE1 has IP address 1.1.1.10, SR PCE2 has IP address 1.1.1.11. Example 13-20: State-sync session configuration on SR PCE1 pce address ipv4 1.1.1.10 state-sync ipv4 1.1.1.11

An SR Policy GREEN is configured on headend Node1, with a dynamic path computed by SR PCE1. After SR PCE1 has computed the path and has sent an Update message to Node1, Node1 installs the path and sends a Report message to its only connected SR PCE1 (marked ➊ in Figure 13‑12). The Delegate flag is set since Node1 delegates this path to SR PCE1.

Upon receiving this Report message, SR PCE1 clears the Delegate flag, adds a TLV specifying the identity of the headend (PCC) (➋) and then forwards the message to SR PCE2 via the PCEP statesync session. When an SR PCE receives a Report message via a state-sync session, it does not forward this Report message again via its own state-sync sessions. This prevents message loops between PCEs but, as a consequence, a full mesh of state-sync sessions between PCEs is required for the complete information to be available on all of them. Example 13‑21 and Example 13‑22 show the SR Policy databases on SR PCE1 and SR PCE2 respectively. Example 13‑21 shows that SR Policy cfg_GREEN_discr_100 has been reported by headend Node1 (PCC: 1.1.1.1) and is delegated to this SR PCE1 (D:1). Both a Computed path and Reported path are shown since SR PCE has computed this path and has received the Report message.

Example 13-21: SR Policy database on SR PCE1 RP/0/0/CPU0:SR-PCE1#show pce lsp detail PCE's tunnel database: ---------------------PCC 1.1.1.1: Tunnel Name: cfg_GREEN_discr_100 LSPs: LSP[0]: source 1.1.1.1, destination 1.1.1.4, tunnel ID 29, LSP ID 8 State: Admin up, Operation active Setup type: Segment Routing Binding SID: 40006 Maximum SID Depth: 10 Absolute Metric Margin: 0 Relative Metric Margin: 0% Bandwidth: signaled 0 kbps, applied 0 kbps PCEP information: PLSP-ID 0x8001d, flags: D:1 S:0 R:0 A:1 O:1 C:0 LSP Role: Single LSP State-sync PCE: None PCC: 1.1.1.1 LSP is subdelegated to: None Reported path: Metric type: TE, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24023, Address: local 99.3.4.3 remote 99.3.4.4 Computed path: (Local PCE) Computed Time: Wed Aug 01 12:21:21 UTC 2018 (00:02:25 ago) Metric type: TE, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24023, Address: local 99.3.4.3 remote 99.3.4.4 Recorded path: None Disjoint Group Information: None

Example 13‑22 shows that SR Policy cfg_GREEN_discr_100 is not delegated to this SR PCE2 (D:0). This SR Policy has been reported by SR PCE1 via the state-sync session (State-sync PCE: 1.1.1.10).

Only a reported path is shown since this SR PCE2 did not compute this path, it only

received the Report message.

Example 13-22: SR Policy database on SR PCE2 RP/0/0/CPU0:SR-PCE2#show pce lsp detail PCE's tunnel database: ---------------------PCC 1.1.1.1: Tunnel Name: cfg_GREEN_discr_100 LSPs: LSP[0]: source 1.1.1.1, destination 1.1.1.4, tunnel ID 29, LSP ID 8 State: Admin up, Operation active Setup type: Segment Routing Binding SID: 40006 Maximum SID Depth: 10 Absolute Metric Margin: 0 Relative Metric Margin: 0% Bandwidth: signaled 0 kbps, applied 0 kbps PCEP information: PLSP-ID 0x8001d, flags: D:0 S:1 R:0 A:1 O:1 C:0 LSP Role: Single LSP State-sync PCE: 1.1.1.10 PCC: None LSP is subdelegated to: None Reported path: Metric type: TE, Accumulated Metric 30 SID[0]: Node, Label 16003, Address 1.1.1.3 SID[1]: Adj, Label 24023, Address: local 99.3.4.3 remote 99.3.4.4 Computed path: (Local PCE) None Computed Time: Not computed yet Recorded path: None Disjoint Group Information: None

13.5.4.2 Split-Brain The inter-PCE state-sync PCEP connection is also used to solve possible split-brain situations. In a split-brain situation, multiple brains (in this case, SR PCEs) independently compute parts of the solution to a problem and fail to find a complete or optimal solution because none of the brains has the authority to make the others reconsider their decisions. In the context of SR-TE, this type of situation may occur for disjoint path computation, as illustrated in Figure 13‑13. Headends Node1 and Node5 have PCEP sessions to both SR PCE1 and SR PCE2, with SR PCE1 configured as the primary SR PCE. However, the PCEP session from Node5 to SR PCE1 has failed. Therefore, SR PCE2 is now the primary (and only) SR PCE for Node5. There is a state-sync PCEP session between SR PCE1 and SR PCE2.

Strict node-disjoint paths (see chapter 4, "Dynamic Candidate Path" for more details) are required from Node1 to Node4 and from Node5 to Node8. At first the SR Policy to Node4 is configured on Node1 and SR PCE1 computes the path. The result is path 1→2→6→7→3→4, which is the IGP shortest path from Node1 to Node4. This path is encoded as segment list , as indicated in the drawing. Node1 delegates the path to SR PCE1. SR PCE2 also learns about this path from Node1 since Node1 sends Report messages to both connected SR PCEs. Then the SR Policy to Node8 is configured on Node5. Node5 requests a path that is strictly nodedisjoint from the existing path of Node1 to Node4. SR PCE2 is unable to find such a path, all possible paths traverse nodes used by the existing path from Node1 to Node4. The problem is that SR PCE2 learned about the path from Node1 to Node4, but it cannot change this path in order to solve the disjoint paths problem. SR PCE2 has no control over that path since it is delegated to SR PCE1. If as single SR PCE would be computing and maintaining both paths, it could update the path from Node1 to Node4 to 1→2→3→4 using segment list , and disjoint paths would be possible. But due the split-brain situation (SR PCE1 controls one path, SR PCE2 controls the other path), no solution is found.

Figure 13-13: Split-brain situation: no disjoint paths found

To solve the split-brain problem, a master/slave relationship is established between SR PCEs when an inter-PCE state-sync session is configured. At the time of writing, the master PCE is the PCE with the lowest PCEP session IP address. The SR PCE master/slave relationship solves the split-brain situation by letting the master SR PCE compute and maintain both disjoint paths. The slave SR PCE sub-delegates the disjoint path computation and maintenance to this master SR PCE. Figure 13‑14 illustrates this by showing how it solves the problem of the example above, requiring strict node-disjoint paths from Node1 to Node4 and from Node5 to Node8

Figure 13-14: Sub-delegating SR Policy path to master SR PCE (1)

Assuming that SR PCE1 in Figure 13‑14 is the master SR PCE, it is responsible for the computation and maintenance of disjoint paths. Assuming the SR Policy from Node1 to Node4 is configured first, SR PCE1 computes the path as shown in the drawing as the lighter colored path 1→2→6→7→3→4. Node1 installs the path and reports it to both SR PCEs. Next, the SR Policy from Node5 to Node8 is configured on Node5. Node5 sends a PCEP Report with empty segment list to SR PCE2 (➊). SR PCE2 does not compute the path but forwards the Report message to SR PCE1 via the state-sync session (➋) and hereby sub-delegates this SR Policy path to SR PCE1. SR PCE1 then becomes responsible for computing and updating the path. SR PCE2 adds a TLV to this Report message identifying the owner PCC of the SR Policy path, Node5. SR PCE1 computes both disjoint paths (➌) and sends an Update message to Node1 (➍) to update the path of Node1’s SR Policy as 1→2→3→4. Node1 updates the path (➎).

Figure 13-15: Sub-delegating SR Policy path to master SR PCE (2)

In Figure 13‑15 SR PCE1 then requests Node5 to update the path of the SR Policy to Node8 to 5→6→7→8 by sending an Update message to SR PCE2 via the state-sync session (➏). SR PCE2 forwards the Update message to Node5 (➐), which updates the path (➑). After the headend nodes have updated their paths, they both send a PCEP Report reporting the new state of the SR Policy paths (these messages are not shown in the illustration). Node1 sends the Report directly to both SR PCE1 and SR PCE2. Node5 then sends a Report of the path to SR PCE2, which forwards it to the master SR PCE1.

13.6 BGP SR-TE This section provides a brief introduction to BGP SR-TE, the BGP way of signaling SR Policy candidate paths from a controller to a headend. It is also known as “BGP SR Policy” and the BGP extensions are specified in draft-ietf-idr-segment-routing-te-policy. A new Subsequent AddressFamily Indicator (SAFI) “SR Policy” is defined to convey the SR Policy information. At the time of writing, SR PCE does not support BGP SR-TE as south-bound interface, but it is supported on the headend.

Why would anyone ne e d that? “Most network operators don’t have the luxury of doing a greenfield deployment of new technology, nor do they have 100% confidence from management that their new solution won't cause some unforeseen disaster. BGP SR-TE provides an elegant, safe and flexible way to introduce global traffic engineering into an existing BGP-based network, and to transition an existing RSVP-TE-based network to use SR-based traffic engineering. It leverages your existing BGP expertise, such as understanding the semantics of peering sessions, updates, path attributes and Graceful Restart. As it operates at the level of a BGP nexthop (an endpoint), it is straightforward for a team with BGP experience to comprehend. Safety comes from having BGP and your IGP as fallbacks. BGP SR-TE may come across at first as arbitrarily complex, but understanding each feature often leads to the same pattern: “Why would anyone need that?”, followed by “Oh, I need that.” ” — Paul Mattes

BGP SR-TE is covered in detail in chapter 19, "BGP SR-TE". Here we present a simple example illustrated in Figure 13‑16.

Figure 13-16: BGP SR Policy illustration

The controller (ip address 1.1.1.10) signals the SR Policy candidate path using the BGP IPv4 SR Policy address-family. The SR Policy is identified in the NLRI with color 30 and endpoint 1.1.1.4. The distinguisher allows to differentiate multiple candidate paths for the same SR Policy. It is important to note that this BGP Update only carries an SR Policy path, it does not carry any service route reachability information. The attributes of the candidate path are carried in the Tunnel-encaps attribute of the BGP Update message. Multiple attributes are possible, here we only show the Binding-SID 15001 and the SID list .

The IPv4 SR Policy BGP address-family is configured on headend Node1 for its session to the controller, as shown in Example 13‑23. SR-TE is enabled as well. Example 13-23: BGP SR-TE configuration router bgp 1 bgp router-id 1.1.1.1 address-family ipv4 sr-policy ! neighbor 1.1.1.10 remote-as 1 update-source Loopback0 address-family ipv4 sr-policy ! segment-routing traffic-eng

Le ve raging the BGP protocol “BGP SR Policy leverages the scalable, extensible and persuasive BGP protocol, this differentiates it from other controller south-bound interface protocols like CLI/NETCONF, PCEP, etc. An operator/vendor can easily develop a BGP controller that could signal this new NLRI. I’m seeing more and more adoption of BGP SR Policy because of its simplicity and scalability; actually, a couple of BGP SR Policy controllers have been deployed in production networks in the past year. Moving forward, with the ability of BGP-LS to report SR Policies, BGP has a big potential to become the unified SR Policy communication protocol between controller and device. ” — YuanChao Su

After receiving the BGP Update, Node1 instantiates the SR Policy’s candidate path with the attributes signaled in BGP. Same as for PCEP initiated paths, if the SR Policy does not yet exist, it is created with the candidate path. The candidate path is added to the list of all candidate paths of the SR Policy and the selection process decides which candidate path becomes the active path. Example 13‑24 shows the status of the SR Policy on Node1.

Example 13-24: Status of BGP SR-TE signaled SR Policy on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 30, End-point: 1.1.1.4 Name: srte_c_30_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:16 (since Feb 26 13:14:45.488) Candidate-paths: Preference: 100 (BGP, RD: 12345) (active) Requested BSID: 15001 Explicit: segment-list (valid) Weight: 1, Metric Type: TE 16003 [Prefix-SID, 1.1.1.3] 24034 Attributes: Binding SID: 15001 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

When BGP withdraws the entry, the candidate path is removed and a new candidate path is selected. If no other candidate path exists, the SR Policy is removed as well.

13.7 Summary The SR-TE process can fulfil different roles, as the brain of a headend and as part of an SR Path Computation Element (SR PCE) server. SR PCE is a network function component integrated in any IOS XR base image. This functionality can be enabled on any IOS XR node, physical or virtual. SR PCE server is stateful (it maintains paths on behalf of other nodes), multi-domain capable (it computes and maintains inter-domain paths), and SR-optimized (using SR-native algorithms). SR PCE is a network entity that provides path computation and maintenance services to headend nodes. It extends the SR-TE capability of a headend by computing paths that the headend node cannot compute, such as inter-domain or disjoint paths. SR PCE provides an interface to the network via its north-bound interfaces. An application uses this interface to collect a real-time view of the network and to add/update/delete SR Policies. PCEP high-availability mechanisms provide resiliency against SR PCE failures without impacting the SR Policies and traffic forwarding. Inter-PCE state-sync PCEP sessions can further improve this resiliency.

13.8 References [YANGModels] “YANG models”, https://github.com/YangModels/yang [RFC4655] "A Path Computation Element (PCE)-Based Architecture", JP Vasseur, Adrian Farrel, Gerald Ash, RFC4655, August 2006 [RFC5440] "Path Computation Element (PCE) Communication Protocol (PCEP)", JP Vasseur, Jean-Louis Le Roux, RFC5440, March 2009 [RFC8231] "Path Computation Element Communication Protocol (PCEP) Extensions for Stateful PCE", Edward Crabbe, Ina Minei, Jan Medved, Robert Varga, RFC8231, September 2017 [RFC8281] "Path Computation Element Communication Protocol (PCEP) Extensions for PCEInitiated LSP Setup in a Stateful PCE Model", Edward Crabbe, Ina Minei, Siva Sivabalan, Robert Varga, RFC8281, December 2017 [RFC8408] "Conveying Path Setup Type in PCE Communication Protocol (PCEP) Messages", Siva Sivabalan, Jeff Tantsura, Ina Minei, Robert Varga, Jonathan Hardwick, RFC8408, July 2018 [draft-ietf-pce-segment-routing] "PCEP Extensions for Segment Routing", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Wim Henderickx, Jonathan Hardwick, draft-ietf-pce-segmentrouting-15 (Work in Progress), February 2019 [draft-sivabalan-pce-binding-label-sid] "Carrying Binding Label/Segment-ID in PCE-based Networks.", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Jonathan Hardwick, Stefano Previdi, Cheng Li, draft-sivabalan-pce-binding-label-sid-06 (Work in Progress), February 2019 [draft-litkowski-pce-state-sync] "Inter Stateful Path Computation Element (PCE) Communication Procedures.", Stephane Litkowski, Siva Sivabalan, Cheng Li, Haomian Zheng, draft-litkowski-pcestate-sync-05 (Work in Progress), March 2019 [draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idrsegment-routing-te-policy-05 (Work in Progress), November 2018

1. In some cases, an additional software license may be required. Check with your sales representative.↩ 2. At the time of writing, the REST API to manage SR Policy candidate paths was in the process of being updated and the exact details were not yet available for publication.↩

14 SR BGP Egress Peer Engineering What we will learn in this chapter: SR BGP Egress Peer Engineering (EPE) enables a centralized controller to instruct an ingress node to steer traffic via a specific egress node to a specific BGP peer or peering link. SR BGP EPE uses Peering-SIDs to steer to a specific BGP peer or peering link, regardless of the BGP best-path. Peering-SIDs can be seen as the BGP variant of IGP Adj-SIDs. SR BGP EPE does not change the existing BGP distribution mechanism in place nor does it make any assumptions on the existing iBGP design. The different types of peering-SIDs are PeerNode-SIDs, Peer-Adj-SIDs, and PeerSet-SIDs. PeerNode-SIDs steer to the associated peer, PeerAdj-SIDs steer over a specific peering link of the associated peer, and PeerSet-SIDs steer on a set of PeerNode- and/or PeerAdj-SIDs. The Peering-SID information is advertised in BGP-LS, such that the controller can insert it in its SR-TE DB and use it for path computations. The controller can instantiate an SR Policy on an ingress node to steer traffic flows via a specific egress node to a specific BGP peer, whereby also the path to the egress node can be specified, integrating intra-domain and inter-domain TE. The SR PCE includes SR EPE peering-SIDs in the SR Policy’s segment list to provide end-to-end unified inter-domain SR-TE paths. We start by explaining how SR BGP EPE solves the BGP egress engineering problem. Then the different Peering SID types are introduced and how the peering information with the SIDs are distributed in BGP-LS. We then illustrate how the peering information is inserted in the controller’s database. Two EPE use-cases are presented as conclusion.

14.1 Introduction The network in Figure 14‑1 consists of four Autonomous Systems (ASs), where each AS is identified by its AS Number (ASN): AS1, AS2, AS3, and AS4. AS1 has BGP peerings with AS2 (via egress Node3 and Node4) and with AS3 (via egress Node4). Node1 and Node2 in AS1 are shown as ingress nodes, but they can be any router, server or Top-ofRack (ToR) switch in AS1. A given AS not only advertises its own prefixes to its peer ASs, but may also propagate the prefixes it has received from its own peer ASs. Therefore, a given destination prefix can be reached through multiple peering ASs and over multiple peering links. Destination prefixes 11.1.1.0/24 and 22.2.2.0/24 of Figure 14‑1 are located in AS4. AS4 advertises these prefixes to its peer ASs AS2 and AS3, which propagate these prefixes further to AS1. AS1 not only receives these prefixes from peers in different ASs, such as Node5 in AS2 and Node6 in AS3, but also from different eBGP peers of the same AS, such as Node6 and Node7 in AS3.

Figure 14-1: Multi-AS network

Given that multiple paths to reach a particular prefix are available, it is the task of the routers and the network operators to select the best path for each prefix. Proper engineering of the ways that traffic exits the AS is crucial for cost efficiency and better end-user experience. For the AS4 prefixes in the example, there are four ways to exit AS1: Node3→Node5, Node4→Node5, Node4→Node6, and Node4→Node7. BGP uses path selection rules to select one path, the so-called “BGP best-path”, to the destination prefix at each BGP speaker. This is the path that BGP installs in the forwarding table. BGP’s bestpath selection typically involves routing policies and rule-sets that are specified by the network operator to influence the best-path selection. Using such routing policies and rule-sets on the different BGP speakers provides some level of flexibility and control on how traffic leaves the AS. However, this technique is limited by the BGP best-path selection mechanism and a per-destination prefix granularity.

Assume in Figure 14‑1 that AS1’s egress ASBR Node4 sets a higher local preference on the path of prefix 11.1.1.0/24 received from AS3’s peer Node7. As a result, Node4 selects this path via Node7 as best-path and sends out all packets destined for 11.1.1.0/24 towards Node7, regardless of where the packet comes from. Besides, the ingress ASBRs Node1 and Node2 may be configured with a route-policy to select a BGP best-path via egress ASBR Node3 or Node4 for a particular prefix, but they have no control over the egress peering link that these egress ASBRs will use for that prefix. The egress peering link selection is made solely by the egress ASBR based on its own configuration. Assume that the routing policies and rule-sets are such that the BGP best paths to prefixes 11.1.1.0/24 and 22.2.2.0/24 are as indicated in Figure 14‑1. The BGP best-path from Node1 and Node2 to these prefixes is via Node3 and its peer Node5 in AS2, as indicated in Figure 14‑1.

14.2 SR BGP Egress Peer Engineering (EPE) Due to the limitations of the classic BGP egress engineering based on routing policies and rule-sets, a finer grained and more consistent control on this egress path selection is desired. The SR BGP Egress Peer Engineering (EPE) solution enables a centralized controller to instruct an ingress Provider Edge (PE) router or a content source within the domain to use a specific egress ASBR and a specific external interface or neighbor to reach a particular destination prefix. See RFC 8402 and draft-ietf-spring-segment-routing-central-epe. To provide this functionality, an SR EPE capable node allocates one or more segments for each of its connected BGP neighbors. These segments are called BGP Peering Segments or BGP Peering SIDs. These Peering-SIDs provide to BGP a similar functionality to the IGP Adj-SID, by steering traffic to a specific BGP peer or over a specific peering interface. They can thus be included in an SR Policy’s segment list to express a source-routed inter-AS path. Different types of Peering-SIDs exist and are discussed in section 14.3 of this chapter. SR BGP EPE is used to engineer egress traffic to external peers but, by providing the means for SRTE paths to cross AS boundaries, it also enables unified end-to-end SR-TE paths in multi-AS networks. SR BGP EPE is a main constituent of the integrated intra-domain and inter-domain SR-TE solution In the remainder of this chapter we will abbreviate “SR BGP EPE” to “SR EPE” or even “EPE”. Segment Routing is enabled in AS1 of Figure 14‑2. AS1’s ASBRs Node3 and Node4 respectively advertise the IGP Prefix-SID 16003 and 16004. AS2, AS3, and AS4 are external to AS1 and no SR support is assumed for these ASs.

Figure 14-2: SR BGP EPE illustration

Single -hop and multi-hop e BGP se ssions The standard assumption for an eBGP session is that the peers are directly connected such that they can establish a single-hop session. Therefore, by default eBGP sessions are only allowed between directly connected neighbors. Single-hop BGP sessions are established between the interfaces addresses of the peering nodes. By default, IOS XR BGP checks if the BGP peering addresses are directly connected (the “connected-check”) and it sends all BGP packets with a TTL=1 to ensure that only single-hop sessions can be established. Note that “single-hop” is somewhat of a misnomer since it excludes BGP sessions between the loopback addresses of two directly connected nodes. But one could argue that the loopback addresses are not connected in this case. Sometimes eBGP sessions between non-directly-connected neighbors (“multi-hop”) are desired, e.g., to establish eBGP sessions between route reflectors in different ASs (e.g., inter-AS option C in RFC 4364) or to enable load-balancing between multiple peering links. Multi-hop BGP sessions are typically established between loopback addresses of the peering nodes. There are two options in IOS XR to enable a multi-hop eBGP session for a peer: ebgp-multihop and ignore-connectedcheck. ebgp-multihop

disables the connected-check to allow non-connected neighbors and sets the TTL of the transmitted BGP

packets to the configured value. This is option must be used if the BGP neighbor nodes are not directly connected. If the BGP neighbor nodes are directly connected, but the BGP session is established between their loopbacks, then it is preferred to configure ignore-connected-check to disable the connected-check, possibly in combination with ttl-security to only accept BGP packets of directly connected nodes (see RFC 5082).

A controller is present in AS1. The BGP best-path for AS4’s prefixes 11.1.1.0/24 and 22.2.2.0/24 is via Node3 and its peer Node5 in AS2, as indicated in Figure 14‑2. The operator of AS1 wants to have full control of the egress traffic flows, not only on how these traffic flows exit AS1, but also on how they travel within AS1 itself. The operator wants to have very granular and programmable control of these flows, without dependencies on the other ASs or on BGP routing policies. This steering control is provided by SR-TE and the SR EPE functionality. SR EPE is enabled on all BGP peering sessions of Node3 and Node4. Node4 allocated a BGP Peering SID label 50407 for its peering session to Node7 and label 50405 for its peering session to Node5. Packets that arrive on Node4 with a top label 50407 or 50405 are steered towards Node7 or Node5 respectively.

A traffic flow to destination 11.1.1.0/24, marked “F1” in Figure 14‑2, enters AS1 via Node1. The operator chooses to steer this flow F1 via egress Node4 towards peer Node5. The operator instantiates an SR Policy on Node1 for this purpose, with segment list where 16004 is the Prefix-SID of Node4 and 50405 the BGP peering SID for Node4’s peering to Node5. This SR Policy steers traffic destined for 11.1.1.0/24 on the IGP shortest path to egress Node4 and then on the BGP peering link to Node5. The path of flow F1 is indicated in the drawing. At the same time another flow “F2” that enters AS1 in Node1 with destination 22.2.2.0/24, should be steered on an intra-AS path via Node8 towards egress Node4 and then towards AS3’s peer Node7. The operator initiates a second SR Policy on Node1 that imposes a segment list on packets destined for 22.2.2.0/24. This segment list brings the packets via the IGP shortest path to Node8 (using Node8’s Prefix-SID 16008), then via the IGP shortest path to Node4 (Node4’s Prefix-SID 16004), and finally towards peer Node7 (using BGP Peering SID 50407). The path of flow F2 is indicated in the drawing. Any other traffic flows still follow their default forwarding paths, IGP shortest-path and BGP bestpath.

14.2.1 SR EPE Properties The SR EPE solution does not require changing the existing BGP distribution mechanism in place nor does it make any assumptions on the existing iBGP design (route reflectors (RRs), confederations, or iBGP full meshes). The SR EPE functionality is only required at the EPE egress border router and the EPE controller. The solution does not require any changes on the remote peering node in the external AS. This remote peering node is not involved in the SR EPE solution. Furthermore, the solution allows egress PEs to set next-hop-self on eBGP-learned routes that they announce to their iBGP peers. SR EPE provides an integrated intra-domain and inter-domain traffic engineering solution. The solution accommodates using an SR Policy at an ingress PE or directly at a source host within the domain to steer traffic within the local domain all the way to the external peer. SR EPE enables using BGP peering links on an end-to-end SR-TE path in a multi-AS network, without requiring to distribute the peering links in IGP, nor having to declare the peering links as

“passive TE interfaces”. Although egress engineering usually refers to steering traffic on eBGP peering sessions, SR EPE could be equally applied to iBGP sessions. However, contrary to eBGP, iBGP sessions are typically not along the data forwarding path between directly connected peers. Therefore, the applicability of SR EPE for iBGP peering sessions may be limited to only those deployments where the iBGP peer is along the data forwarding path. At the time of writing, IOS XR only supports SR EPE for eBGP neighbors.

14.3 Segment Types Earlier in this chapter, we have explained that an SR EPE capable node allocates one or more segments for each of its connected BGP neighbors. These segments are called BGP Peering Segments or BGP Peering SIDs and they provide a functionality to BGP that is similar to the Adj-SID for IGP: Peering-SIDs steer traffic to a specific peer or over a specific peering interface. A peering interface is an interface on the shortest path to the peer. In case of a single-hop BGP session (between neighboring interface addresses), there is a single peering interface. In case of a multi-hop BGP session, there can be multiple peering interfaces: the interfaces along the set of shortest paths to the BGP neighbor address. Several types of Peering-SIDs enable the egress peering selection with various level of granularity. The Segment Routing Architecture specification (RFC8402) defines three types of BGP Peering SIDs: PeerNode-SID: steer traffic to the BGP peer, load-balancing traffic flows over all available peering interfaces SR header operation: POP Action: Forward on any peering interface PeerAdjacency-SID: steer traffic to the BGP peer via the specified peering interface SR header operation: POP Action: Forward on the specific peering interface PeerSet-SID: steer traffic via an arbitrary set of BGP peers or peering interfaces, load-balancing traffic flows over all members of the set SR header operation: POP Action: Forward on any interface to the set of peers or peering interfaces These three segment types can be local or global segments. Remember that a local segment only has local significance on the node originating it, while a global segment has significance in the whole SR

domain. In the SR MPLS implementation in IOS XR, BGP Peering Segments are local segments. An SR EPE node allocates a PeerNode-SID for each EPE-enabled BGP session. For multi-hop BGP sessions, the SR EPE node also allocates a PeerAdj-SID for each peering interface of that BGP session, in addition to the PeerNode-SID. Even if there is only a single peering interface to the BGP multi-hop neighbor, a PeerAdj-SID is allocated for it. For single-hop BGP sessions, no PeerAdj-SID is allocated since for this type of BGP sessions the PeerNode-SID also provides the PeerAdj-SID functionality. In addition, each peer and peering interface may be part of a set identified by a PeerSet-SID. For example, when assigning the same PeerSet-SID to two peers and a peering interface to a third peer, traffic to the PeerSet-SID is load-balanced over both peers and the peering interface. At the time of writing, IOS XR does not support PeerSet-SIDs. The SR EPE node can dynamically allocate local labels for its BGP Peering SIDs, or the operator can explicitly assign local labels for this purpose. At the time of writing, IOS XR only supports dynamically allocated BGP Peering SID labels. Figure 14‑3 illustrates the various types of BGP Peering SIDs on a border node. Node4 has three eBGP sessions: a single-hop session to Node5 in AS2, another single-hop session to Node6 in AS3 and a multi-hop session to Node7 in AS3. The multi-hop BGP session is established between the loopback addresses of Node4 and Node7; the session is transported over two peering interfaces, represented by the two links between these nodes.

Figure 14-3: SR EPE Peering SID types

Assume that Node4 allocated the following BGP Peering SIDs:

BGP PeerNode-SID 50405 for the session to Node5 in AS2 BGP PeerNode-SID 50406 for the session to Node6 in AS3 BGP PeerNode-SID 50407 for the session to Node7 in AS3 BGP PeerAdj-SID 50417 for the top peering link between Node4 and Node7 BGP PeerAdj-SID 50427 for the bottom peering link between Node4 and Node7

14.4 Configuration The BGP configuration on Node4 for the topology of Figure 14‑3, is shown in Example 14‑1. Node4 is in AS1 and has three BGP sessions, two single-hop sessions to 99.4.5.5 (Node5) and 99.4.6.6 (Node6), and one multi-hop session (ignore-connected-check) to 1.1.1.7 (Node7). The ignore-connected-check

command disables the verification that the nexthop is on a connected

subnet. Node4 sends the BGP packets to Node7 with TTL=1 which is fine since the nodes are directly connected. In this example, address-family ipv4 unicast is enabled for all neighbors, but other addressfamilies are also supported. Since the neighbors are in a different AS, an ingress and egress route-policy must be applied.

Example 14-1: Node4 SR EPE configuration route-policy bgp_in pass end-policy ! route-policy bgp_out pass end-policy ! router bgp 1 bgp router-id 1.1.1.4 address-family ipv4 unicast ! neighbor 99.4.5.5 remote-as 2 egress-engineering description # single-hop eBGP peer Node5 address-family ipv4 unicast route-policy bgp_in in route-policy bgp_out out ! ! neighbor 99.4.6.6 remote-as 3 egress-engineering description # single-hop eBGP peer Node6 address-family ipv4 unicast route-policy bgp_in in route-policy bgp_out out ! ! neighbor 1.1.1.7 remote-as 3 egress-engineering description # multi-hop eBGP peer Node7 update-source Loopback0 ignore-connected-check address-family ipv4 unicast route-policy bgp_in in route-policy bgp_out out

SR EPE is enabled by configuring egress-engineering under the BGP neighbor. With this configuration, the node automatically allocates, programs and advertises the Peering-SIDs to the controller in BGP-LS, as we will see in the next section. For the example configuration of Node4 in Example 14‑1, the PeerNode-SID for the session to Node5 is shown in Example 14‑2. Node4 has allocated label 50405 as PeerNode-SID for this BGP session. First Hop

address 99.4.5.5 is the interface address on Node5 for the link to Node4.

Example 14-2: Node4 PeerNode-SID for peer Node5 RP/0/0/CPU0:xrvr-4#show bgp egress-engineering

Egress Engineering Peer Set: 99.4.5.5/32 (10b291a4) Nexthop: 99.4.5.5 Version: 5, rn_version: 5 Flags: 0x00000006 Local ASN: 1 Remote ASN: 2 Local RID: 1.1.1.4 Remote RID: 1.1.1.5 First Hop: 99.4.5.5 NHID: 1 Label: 50405, Refcount: 3 rpc_set: 105cfd34

The PeerNode-SID for the multi-hop eBGP session to Node7 is shown in Example 14‑3. Node4 has allocated label 50407 as PeerNode-SID for this BGP session. Node4 uses two next-hops (First Hop in the output) for this BGP PeerNode-SID: via the top link to Node7 (99.4.7.7) and via the bottom link to Node7 (77.4.7.7). Traffic is load-balanced over these two next-hops. Addresses 99.4.7.7 and 77.4.7.7 are Node7’s interface addresses for its links to Node4. Example 14-3: Node4 PeerNode-SID for peer Node7

Egress Engineering Peer Set: 1.1.1.7/32 (10b48fec) Nexthop: 1.1.1.7 Version: 2, rn_version: 2 Flags: 0x00000006 Local ASN: 1 Remote ASN: 3 Local RID: 1.1.1.4 Remote RID: 1.1.1.7 First Hop: 99.4.7.7, 77.4.7.7 NHID: 0, 0 Label: 50407, Refcount: 3 rpc_set: 10c34c24

One or more PeerAdj-SIDs are allocated for a multi-hop BGP session. The PeerAdj-SIDs for Node4’s multi-hop EBGP session to Node7 are shown in Example 14‑4. Node4 has two equal cost paths to the peer Node7: via the two links to Node7. Node4 has allocated labels 50417 and 50427 as PeerAdj-SIDs for these two peering links. Hence, traffic that arrives on Node4 with top label 50417 will be forwarded on the top peering link, using next-hop (First Hop) 99.4.7.7.

Example 14-4: Node4 PeerAdj-SIDs for peer Node7

Egress Engineering Peer Set: 99.4.7.7/32 (10d92234) Nexthop: 99.4.7.7 Version: 3, rn_version: 5 Flags: 0x0000000a Local ASN: 1 Remote ASN: 3 Local RID: 1.1.1.4 Remote RID: 1.1.1.7 First Hop: 99.4.7.7 NHID: 2 Label: 50417, Refcount: 3 rpc_set: 10e37684 Egress Engineering Peer Set: 77.4.7.7/32 (10c931f0) Nexthop: 77.4.7.7 Version: 4, rn_version: 5 Flags: 0x0000000a Local ASN: 1 Remote ASN: 3 Local RID: 1.1.1.4 Remote RID: 1.1.1.7 First Hop: 77.4.7.7 NHID: 4 Label: 50427, Refcount: 3 rpc_set: 10e58fa4

BGP on Node4 allocates the Peering-SIDs and installs them in the forwarding table, as shown in the MPLS forwarding table displayed in Example 14‑5. All the Peering SID forwarding entries pop the label and forward on the appropriate interface. The traffic to the PeerNode-SID 50407 of EBGP multi-hop session to Node7 is load-balanced over the two links to Node7. Example 14-5: SR EPE Peering-SID MPLS forwarding entries on Node4 RP/0/0/CPU0:xrvr-4#show mpls forwarding Local Outgoing Prefix Outgoing Label Label or ID Interface ------ ----------- ------------------ -----------50405 Pop No ID Gi0/0/0/0 50406 Pop No ID Gi0/0/0/1 50407 Pop No ID Gi0/0/0/2 Pop No ID Gi0/0/0/3 50417 Pop No ID Gi0/0/0/2 50427 Pop No ID Gi0/0/0/3

Next Hop

Bytes Switched ------------- -------99.4.5.5 0 99.4.6.6 0 99.4.7.7 0 77.4.7.7 0 99.4.7.7 0 77.4.7.7 0

14.5 Distribution of SR EPE Information in BGP-LS An SR EPE enabled border node allocates BGP Peering-SIDs and programs them in its forwarding table. It also advertises these BGP Peering-SIDs in BGP-LS such that a controller can learn and use this EPE information. As explained in the previous chapters (chapter 12, "SR-TE Database" and chapter 13, "SR PCE"), BGP-LS is the method of choice to feed network information to a controller. In the context of SR EPE, the use of BGP-LS makes the advertisement of the BGP Peering SIDs completely independent from the advertisement of any forwarding and reachability information in BGP. Indeed, BGP does not directly use the BGP Peering SIDs, but simply conveys this information to the controller (or the operator) that can leverage it in SR Policies. In case RRs are used to scale BGP-LS distribution, the controller can then tap into one of them to learn the BGP Peering SIDs from all SR EPE enabled border nodes, in addition to the other BGP-LS information. The SR EPE-enabled BGP peerings are represented as Link objects in the BGP-LS database and advertised using Link-type BGP-LS NLRIs. The controller inserts the SR EPE-enabled BGP peerings as links in the SR-TE DB. More details of the SR-TE DB and the SR EPE peering entries in SR-TE DB are available in chapter 12, "SR-TE Database". Figure 14‑4 shows the format of a BGP-LS Link-type NLRI, as specified in BGP-LS RFC 7752.

Figure 14-4: BGP-LS Link-type NLRI format

The fields in the BGP-LS Link-type NLRI are: Protocol-ID: Identifies the source protocol of this NLRI Identifier: identifier of the “routing universe” that this link belongs to, commonly associated with an instance-id in the IGP world Local Node Descriptors: set of properties that uniquely identify the local node of the link Remote Node Descriptors: set of properties that uniquely identify the remote node of the link Link Descriptors: set of properties that uniquely identifies a link between the two anchor nodes BGP-LS is extended by IETF draft-ietf-idr-bgpls-segment-routing-epe to support advertisement of BGP Peering SIDs. The Protocol-ID field in the NLRI for the SR EPE Peerings has value 7 “BGP” to indicate that BGP is the source protocol of this NLRI. The Identifier field has value 0 in IOS XR, identifying the Default Layer 3 Routing topology. The possible values of the other fields in the Link-type NLRI (Local Node, Remote Node, and Link Descriptors) as used for SR EPE peering advertisements are described in chapter 17, "BGP-LS".

The BGP Peering SID is included in the BGP-LS Attribute that is advertised with the EPE NLRI. It is specified in a BGP Peering SID TLV, as described in the next section.

14.5.1 BGP Peering SID TLV The format of the BGP Peering SID TLV is displayed in Figure 14‑5. This TLV contains the PeeringSID and is advertised with the associated EPE NLRI in the BGP-LS Attribute.

Figure 14-5: Peering SID TLV format

The fields in the BGP Peering SID TLV are: Type: PeerNode SID (1101); PeerAdj SID (1102); PeerSet SID (1103) Length: length of TLV Flags: V-Flag: Value flag – If set, then the SID carries a value; set in current IOS XR releases. L-Flag: Local flag – If set, then the value/index carried by the SID has local significance; set in current IOS XR releases.

B-Flag: Backup flag – If set, then the SID refers to a path that is eligible for protection; unset in current IOS XR releases. P-Flag: Persistent flag – If set, then the SID is persistently allocated, i.e., the SID value remains consistent across router restart and session/interface flap; unset in current IOS XR releases. Weight: sets the distribution ratio for Weighted ECMP load-balancing; always 0 in current IOS XR releases. SID/Label/Index: According to the TLV length and to the Value (V) and Local (L) flags settings, it contains either: 3-octet local label where the 20 rightmost bits are used for encoding the label value. The V and L flags are set. 4-octet index defining the offset in the SRGB (Segment Routing Global Block) advertised by this node. In this case, the SRGB is advertised using the extensions defined in ietf-idr-bgp-lssegment-routing-ext.

14.5.2 Single-hop BGP Session Figure 14‑6 illustrates a single-hop eBGP session between Node4 in AS1 and Node5 in AS2.

Figure 14-6: SR EPE for single-hop BGP session

SR EPE is enabled on Node4 for this BGP session, using the configuration of Node4 as shown in Example 14‑6. By default, eBGP sessions in IOS XR require specifying ingress and egress routepolicies that specify which prefixes to accept and advertise. These are the bgp_in and bgp_out route-policies in the example. Example 14-6: BGP SR EPE on Single-hop BGP session – Node4’s configuration route-policy bgp_in pass end-policy ! route-policy bgp_out pass end-policy ! router bgp 1 bgp router-id 1.1.1.4 address-family ipv4 unicast ! neighbor 99.4.5.5 remote-as 2 egress-engineering description # single-hop eBGP peer Node5 # address-family ipv4 unicast route-policy bgp_in in route-policy bgp_out out

Node4 advertises the BGP Peering SIDs to the controller (1.1.1.10) via a BGP-LS BGP session. The configuration of this BGP-LS session to 1.1.1.10 on Node4 is shown in Example 14‑7. Example 14-7: Node4’s BGP-LS session to Controller router bgp 1 address-family link-state link-state ! neighbor 1.1.1.10 remote-as 1 update-source Loopback0 description iBGP to Controller address-family link-state link-state

Node4 has allocated PeerNode-SID 50405 for the session to Node5 (also see the output in Example 14‑2). Since it is a single-hop BGP session, no PeerAdj-SIDs are allocated for this session. The controller receives the BGP-LS advertisement for this PeerNode-SID, as shown in Example 14‑8. The long and cryptic command used to display the advertisement consists mainly of the BGP-LS NLRI of the BGP PeerNode-SID in string format that is specified as an argument of the show command.

The output of show bgp link-state link-state (without specifying an NLRI) shows a legend to interpret the NLRI fields, as is explained in chapter 17, "BGP-LS". Refer to the format of a BGP-LS link-type NLRI is shown in Figure 14‑4. In the NLRI, we recognize the Local and Remote Node Descriptors, which identify the local and remote anchor nodes Node4 and Node5. These anchor nodes are identified by their ASN and BGP router-id. The Link Descriptor contains the local and remote interface addresses of this single-hop BGP session. The only entry in the Link-state Attribute, shown at the bottom of the output, is the BGP PeerNode-SID 50405. Example 14-8: BGP PeerNode-SID BGP table entry RP/0/0/CPU0:xrvr-10#show bgp link-state link-state [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c2][b0.0.0.0] [q1.1.1.5]][L[i99.4.5.4][n99.4.5.5]]/664 detail BGP routing table entry for [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c2][b0.0.0.0][q1.1.1.5]][L[i99.4.5.4] [n99.4.5.5]]/664 NLRI Type: Link Protocol: BGP Identifier: 0x0 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 BGP Router Identifier: 1.1.1.4 Remote Node Descriptor: AS Number: 2 BGP Identifier: 0.0.0.0 BGP Router Identifier: 1.1.1.5 Link Descriptor: Local Interface Address IPv4: 99.4.5.4 Neighbor Interface Address IPv4: 99.4.5.5 Versions: Process bRIB/RIB SendTblVer Speaker 3 3 Flags: 0x00000001+0x00000000; Last Modified: Feb 7 16:39:56.876 for 00:19:02 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Flags: 0x4000000001060005, import: 0x20 Not advertised to any peer Local 1.1.1.4 (metric 10) from 1.1.1.4 (1.1.1.4) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 3 Link-state: Peer-SID: 50405

The SR PCE receives the BGP-LS information and inserts the BGP Peering-SID information in its SR-TE DB. This is shown in the output of Example 14‑9. Node4 is presented as Node 1 in the SR-TE DB.

The BGP session on Node4 to Node5 is represented as a (uni-directional) link Link[0], anchored to local Node4 and remote Node5. The controller created entries in the SR-TE DB for these two anchor nodes, identified by their ASN and BGP router-id: Node4 (AS1, 1.1.1.4) and Node5 (AS2, 1.1.1.5). The PeerNode-SID is shown as Adj SID (epe) in the SR-TE DB. At the time of writing, the other link attributes (metrics, affinity, etc.) are not advertised for EPE links in IOS XR, therefore the link metrics are set to zero. The remote Node5 is also inserted in the SR-TE DB as Node 2. Since Node5 itself does not advertise any BGP Peering-SIDs in this example, no other information is available for this node besides its ASN and BGP router-id (AS2, 1.1.1.5). Example 14-9: BGP PeerNode-SID SR-TE DB entry RP/0/0/CPU0:xrvr-10#show pce ipv4 topology PCE's topology database - detail: --------------------------------Node 1 BGP router ID: 1.1.1.4 ASN: 1 Link[0]: local address 99.4.5.4, remote address 99.4.5.5 Local node: BGP router ID: 1.1.1.4 ASN: 1 Remote node: BGP router ID: 1.1.1.5 ASN: 2 Metric: IGP 0, TE 0, delay 0 Adj SID: 50405 (epe) Node 2 BGP router ID: 1.1.1.5 ASN: 2

14.5.3 Multi-hop BGP Session Figure 14‑7 illustrates a multi-hop eBGP session between Node4 in AS1 and Node7 in AS3.

Figure 14-7: SR EPE for multi-hop BGP session

BGP EPE is enabled on the BGP session of Node4 to Node7, as shown in the configuration of Example 14‑10. Node4 and Node7 are adjacent. To allow the multi-hop BGP session between the loopback interfaces of these nodes, it is sufficient to disable the connected check (ignoreconnected-check)

in BGP.

Example 14-10: SR BGP EPE for multi-hop eBGP session – Node4’s configuration route-policy bgp_in pass end-policy ! route-policy bgp_out pass end-policy ! router bgp 1 bgp router-id 1.1.1.4 address-family ipv4 unicast neighbor 1.1.1.7 remote-as 3 ignore-connected-check egress-engineering description # multi-hop eBGP peer Node7 # update-source Loopback0 address-family ipv4 unicast route-policy bgp_in in route-policy bgp_out out

The PeerNode-SID for this BGP session is 50407 and the PeerAdj-SIDs are 50417 and 50417 (also see the output in Example 14‑3 and Example 14‑4). These SIDs are also shown in Figure 14‑7.

The BGP PeerNode-SID BGP-LS advertisement as received by the SR PCE is shown in Example 14‑11. The PeerNode entry is advertised as a link using a BGP-LS Link-type NLRI. This “link” is anchored to its local node (AS1, 1.1.1.4) and remote node (AS3, 1.1.1.7), as shown in the Local and Remote Node Descriptors in Example 14‑11. The Link Descriptor contains the local and remote BGP session addresses, Node4’s BGP router-id 1.1.1.4 and Node7’s BGP router-id 1.1.1.7 in this example. The only entry in the Link-state Attribute attached to this NLRI is the BGP PeerNode-SID 50407, as shown at the bottom of the output. Example 14-11: BGP PeerNode-SID BGP table entry for multi-hop session RP/0/0/CPU0:xrvr-10#show bgp link-state link-state [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0] [q1.1.1.7]][L[i1.1.1.4][n1.1.1.7]]/664 detail BGP routing table entry for [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0][q1.1.1.7]][L[i1.1.1.4] [n1.1.1.7]]/664 NLRI Type: Link Protocol: BGP Identifier: 0x0 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 BGP Router Identifier: 1.1.1.4 Remote Node Descriptor: AS Number: 3 BGP Identifier: 0.0.0.0 BGP Router Identifier: 1.1.1.7 Link Descriptor: Local Interface Address IPv4: 1.1.1.4 Neighbor Interface Address IPv4: 1.1.1.7 Versions: Process bRIB/RIB SendTblVer Speaker 16 16 Flags: 0x00000001+0x00000200; Last Modified: Feb 7 17:24:22.876 for 00:01:08 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Flags: 0x4000000001060005, import: 0x20 Not advertised to any peer Local 1.1.1.4 (metric 10) from 1.1.1.4 (1.1.1.4) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 16 Link-state: Peer-SID: 50407

For a multi-hop BGP session, PeerAdj-SIDs are also allocated for each link used in the BGP session. In this example there are two peering links, so two PeerAdj-SIDs are allocated. Each PeerAdj-SID is advertised in a BGP-LS Link-type NLRI.

One of the two PeerAdj-SID BGP-LS advertisements as received by the controller is shown in Example 14‑12. The two anchor nodes Node4 and Node7 are specified in the Local and Remote Node Descriptors. To distinguish this link from the other parallel link, the local (99.4.7.4) and remote (99.4.7.7) interface addresses are included in the Link Descriptor. The only entry in the Link-state Attribute attached to this NLRI is the BGP PeerAdj-SID 50427, shown at the bottom of the output. Example 14-12: BGP PeerAdj-SID BGP table entry for multi-hop session RP/0/0/CPU0:xrvr-10#show bgp link-state link-state [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0] [q1.1.1.7]][L[i99.4.7.4][n99.4.7.7]]/664 detail BGP routing table entry for [E][B][I0x0][N[c1][b0.0.0.0][q1.1.1.4]][R[c3][b0.0.0.0][q1.1.1.7]][L[i99.4.7.4] [n99.4.7.7]]/664 NLRI Type: Link Protocol: BGP Identifier: 0x0 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 BGP Router Identifier: 1.1.1.4 Remote Node Descriptor: AS Number: 3 BGP Identifier: 0.0.0.0 BGP Router Identifier: 1.1.1.7 Link Descriptor: Local Interface Address IPv4: 99.4.7.4 Neighbor Interface Address IPv4: 99.4.7.7 Versions: Process bRIB/RIB SendTblVer Speaker 18 18 Flags: 0x00000001+0x00000200; Last Modified: Feb 7 17:24:22.876 for 00:02:41 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Flags: 0x4000000001060005, import: 0x20 Not advertised to any peer Local 1.1.1.4 (metric 10) from 1.1.1.4 (1.1.1.4) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 18 Link-state: Peer-Adj-SID: 50427

The SR PCE inserts the BGP Peering-SID information in its SR-TE DB. The entry in the SR PCE’s database is shown in Example 14‑13. Three Peering link entries are present in the SR-TE DB entry of Node4: Link[0] is the PeerNode entry, Link[1] is the PeerAdj entry for the top link to Node7, and Link[2] is the PeerAdj entry for

the bottom link. The two PeerAdj entries can be distinguished from each other by their local and remote addresses. The controller created a node entry for Node7. Since Node7 does not advertise any BGP PeeringSIDs in this example, the controller does not have any other information about this node besides its ASN and BGP router-id (AS3, 1.1.1.7). Example 14-13: BGP PeerNode-SID SR-TE DB entry RP/0/0/CPU0:xrvr-10#show pce ipv4 topology PCE's topology database - detail: --------------------------------Node 1 BGP router ID: 1.1.1.4 ASN: 1 Link[0]: local address 1.1.1.4, remote address 1.1.1.7 Local node: BGP router ID: 1.1.1.4 ASN: 1 Remote node: BGP router ID: 1.1.1.7 ASN: 3 Metric: IGP 0, TE 0 Bandwidth: Total 0 Bps, Reservable 0 Bps Adj SID: 50407 (epe) Link[1]: local address 99.4.7.4, remote address 99.4.7.7 Local node: BGP router ID: 1.1.1.4 ASN: 1 Remote node: BGP router ID: 1.1.1.7 ASN: 3 Metric: IGP 0, TE 0 Bandwidth: Total 0 Bps, Reservable 0 Bps Adj SID: 50417 (epe) Link[2]: local address 77.4.7.4, remote address 77.4.7.7 Local node: BGP router ID: 1.1.1.4 ASN: 1 Remote node: BGP router ID: 1.1.1.7 ASN: 3 Metric: IGP 0, TE 0 Bandwidth: Total 0 Bps, Reservable 0 Bps Adj SID: 50427 (epe) Node 2 BGP router ID: 1.1.1.7 ASN: 3

14.6 Use-Cases 14.6.1 SR Policy Using Peering-SID Assume in Figure 14‑8 that egress nodes Node3 and Node4 have next-hop-self configured. This means that they advertise the BGP prefixes setting themselves as BGP nexthop. Both egress nodes receive the AS4 BGP routes 11.1.1.0/24 and 22.2.2.0/24 from their neighbor AS. Node1 receives the routes as advertised by Node3 with nexthop 1.1.1.3, and as advertised by Node4 with nexthop 1.1.1.4. Ingress Node1 selects the routes via Node3 as BGP best-path. The BGP bestpath on Node1 for both routes goes via egress Node3 and to its peer Node 5, as indicated in the illustration.

Figure 14-8: SR Policies using Peering-SIDs

Instead of using the BGP best-path, the operator wants to steer 11.1.1.0/24 via the IGP shortest path to egress Node4 and then to its peer Node5. The prefix 22.2.2.0/24 must be steered via Node8 towards egress Node4 and then to its peer Node7.

For this purpose, the operator instantiates two SR Policies on Node1, both with endpoint Node4 (1.1.1.4), one with color 10 and the other with color 20. The segment list of the SR Policy with color 10 is , where 16004 is the Prefix-SID of Node4 and 50405 is the PeerNode-SID of Node4 to peer Node5. The segment list of the SR Policy with color 20 is , with 16008 and 16004 the Prefix-SIDs of Node8 and Node4 respectively, and 50407 the PeerNode-SID of Node4 to peer Node7. To steer the two service routes on their desired SR Policies, Node1 changes their nexthops to 1.1.1.4 (Node4) and tags each with the color of the appropriate SR Policy. 11.1.1.0/24 gets color 10, 22.2.2.0/24 gets color 20. For this, Node1 applies an ingress route-policy on its BGP session as explained in chapter 10, "Further Details on Automated Steering". BGP on Node1 uses Automated Steering functionality to steer each of these two service routes on the SR Policy matching its color and nexthop. As a result, Node1 imposes segment list on packets destined for prefix 11.1.1.0/24, and segment list on packets destined for prefix 22.2.2.0/24.

14.6.2 SR EPE for Inter-Domain SR Policy Paths In a multi-domain network, with multiple ASs and BGP peering links between the ASs, SR BGP EPE can be used to transport the SR traffic across the BGP peering links. When SR EPE is enabled on the BGP peering sessions, the ASBRs advertise these BGP sessions and their BGP Peering SIDs in BGP-LS. The SR PCE receives this information, together with all the other topology information of the network. The SR PCE consolidates all the topology information to form a single network graph. The SR EPE enabled peering links are included in this network graph as links. The controller can then compute the required end-to-end SR Policy paths using that consolidated network graph. When an SR Policy path traverses a BGP peering link, this peering link’s Peering SID is included in the SR Policy’s segment list. More details are provided in chapter 12, "SR-TE Database". MPLS Forwarding on Peering Link

To provide SR-MPLS inter-domain connectivity, labeled packets must be able to traverse the peering link. This means that MPLS forwarding must be enabled on this link. Otherwise, labeled packets that are switched on this interface will be dropped. By default, MPLS forwarding is not enabled on a peering link, even if egress-engineering is enabled on the BGP session. Use the command illustrated in Example 14‑14 to list the interfaces with MPLS forwarding enabled. The two interfaces in the output have MPLS forwarding enabled, as indicated in the last column. Any unlisted interface does not have MPLS forwarding enabled. Example 14-14: Show MPLS-enabled interfaces RP/0/0/CPU0:xrvr-4#show mpls interfaces Interface LDP Tunnel -------------------------- -------- -------GigabitEthernet0/0/0/0 No No GigabitEthernet0/0/0/1 No No

Static -------No No

Enabled -------Yes Yes

MPLS forwarding is automatically enabled on the peering link when enabling a labeled addressfamily (such as address-family ipv4 labeled-unicast) on the BGP session. If no labeled address-family is required, enable MPLS forwarding on the interface by configuring it under mpls static, as illustrated in Example 14‑15. This command enables MPLS forwarding on interface Gi0/0/0/2. Example 14-15: Enable MPLS forwarding using mpls static configuration mpls static interface GigabitEthernet0/0/0/2

14.7 Summary Egress Peer Engineering enables a centralized controller to instruct an ingress node to steer traffic via a specific egress node to a specific BGP peer or peering link. SR EPE peering SIDs steer traffic to a specific BGP peer or peering link. The different types of peering-SIDs are PeerNode-SIDs, Peer-Adj-SIDs and PeerSet-SIDs. The Peering-SID information is advertised in BGP-LS. This way, a controller can insert the information in its SR-TE DB and use it for path computations. SR EPE peering-SIDs can be used in an SR Policy’s segment list to steer traffic via a specific egress node and a specific peering link to an external domain. SR EPE Peering-SIDs can also be used in the segment list of an end-to-end inter-domain SR-TE path to cross the inter-domain BGP peerings.

14.8 References [RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752, March 2016 [RFC8402] "Segment Routing Architecture", Clarence Filsfils, Stefano Previdi, Les Ginsberg, Bruno Decraene, Stephane Litkowski, Rob Shakir, RFC8402, July 2018 [RFC5082] "The Generalized TTL Security Mechanism (GTSM)", Carlos Pignataro, Pekka Savola, David Meyer, Vijay Gill, John Heasley, RFC5082, October 2007 [draft-ietf-idr-bgpls-segment-routing-epe] "BGP-LS extensions for Segment Routing BGP Egress Peer Engineering", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Keyur Patel, Saikat Ray, Jie Dong, draft-ietf-idr-bgpls-segment-routing-epe-18 (Work in Progress), March 2019 [draft-ketant-idr-bgp-ls-bgp-only-fabric] "BGP Link-State Extensions for BGP-only Fabric", Ketan Talaulikar, Clarence Filsfils, krishnaswamy ananthamurthy, Shawn Zandi, Gaurav Dawra, Muhammad Durrani, draft-ketant-idr-bgp-ls-bgp-only-fabric-02 (Work in Progress), March 2019 [draft-ietf-spring-segment-routing-central-epe] "Segment Routing Centralized BGP Egress Peer Engineering", Clarence Filsfils, Stefano Previdi, Gaurav Dawra, Ebben Aries, Dmitry Afanasiev, draft-ietf-spring-segment-routing-central-epe-10 (Work in Progress), December 2017

15 Performance Monitoring – Link Delay What you will learn in this chapter: The Performance Measurement (PM) framework enables measurement of various characteristics (delay, loss, and consistency) for different network elements (link, SR Policy, node) The PM functionality enables the dynamic measurement of link delays. Link delay is measured using standard query-response messaging, in either one-way or two-way mode. Measured link delay information is advertised in the IGP and in BGP-LS. The minimum delay measured over a time period represents the propagation delay of the link. This is stable metric used by SR-TE and Flex-Algo to steer traffic along a low-delay path across the network. To reduce the flooding churn on the network, link delay metrics are only flooded when the minimum delay changes significantly. Detailed delay measurement reports are streamed via telemetry, for history and analysis. After introducing the generic Performance Measurement framework, we provide a brief overview of the components of link delay, highlighting why the minimum value is so important for SR-TE. We explain in detail how the link delays are measured in the network, how to enable these measurements in IOS XR and how to configure the functionality to meet the operator’s need. We then explain how the link delay information is advertised by the IGP, BGP-LS and telemetry, showing in particular how variations in the link delay are reflected in the routing protocol advertisements. Finally, we remind how the delay information is leveraged by SR-TE.

15.1 Performance Measurement Framework In certain networks, network performance data such as packet loss, delay and delay variation (jitter) as well as bandwidth utilization is a critical measure for Traffic Engineering (TE). Such performance data provides operators the characteristics of their networks for performance evaluation, that is required to ensure the Service Level Agreements (SLAs). The Performance Measurement (PM) functionality provides a generic framework to enable dynamically measuring various characteristics (delay, loss, and consistency) for different network elements (link, SR Policy, node). The focus of this chapter is the link delay measurement. The description of other PM functionalities such as measuring delay and loss of SR Policies and Data Plane Monitoring1 (DPM) are deferred to a later revision of this book. The PM functionalities can be employed to measure actual performance metrics and react accordingly. For example, SR-TE leverages link delay measurements to maintain dynamic low-delay paths, updating them as needed to always provide the lowest possible delay. Before link delay measurement was available, the TE metric was often used to express a static link delay value, and SR-TE computed delay optimized paths by minimizing the accumulated TE metric along the path. However, these low-delay paths could not be adapted to varying link delays, such as following a reroute in the underlay optical network, since TE metric values were statically configured.

15.2 The Components of Link Delay The Extended TE Link Delay Metrics are flooded in ISIS (RFC 7810), OSPF (RFC 7471), as well as BGP-LS (draft-ietf-idr-te-pm-bgp). The following Extended TE Link Delay Metrics are flooded: Unidirectional Link Delay Unidirectional Minimum/Maximum Link Delay Unidirectional Delay Variation Section 15.4 in this chapter covers the details of the link delay metrics advertisement in the IGP and BGP-LS. An Insight Into the Physical Topology

Of all these link delay statistics, only the minimum values are leveraged by SR-TE for dynamic path computation. The minimum link delay measured over a period of time is the most representative of the transmission or propagation delay of a link. This is a stable value that only depends on the underlying optical circuit and irrespective of the traffic traversing the link. If the physical topology does not change, a constant fiber length over a constant speed of light provide a constant propagation delay2. A significative modification of the minimum link delay thus indicates a change in the optical circuit, with an impact on the traversed fiber length. For example, a fiber cut followed by a reroute of the optical circuit is reflected in the measured minimum link delay. In practice, such optical network topology changes often occur without the IP team being aware, although they may have a significant impact of the service quality perceived by the user. The link delay measurement functionality provides the means to automatically respond to these changes by moving sensitive traffic to alternative, lower delay paths. QoS-Controlled Variations

The other link delay statistics — average, maximum and variance — are affected by the traffic flowing through the link and can thus be highly variable at any time scale. These values do not reflect

the forwarding path of the traffic but the QoS policy (e.g., queueing, tail-drop) configured by the operator for that class of traffic. For that reason, they are not suitable as a routing metric. Furthermore, rerouting traffic as a reaction to a change in the packet scheduling delays would cause network instability and considerably deteriorate the perceived service quality. An operator willing to reduce the maximum delay or jitter (delay variation) along an SR Policy should consider modifying the QoS policy applied to the traffic steered into that SR Policy.

15.3 Measuring Link Delay The measurement method uses (hardware) timestamping in Query and Response packets, as defined in RFC 6374 “Packet Loss and Delay Measurement for MPLS Networks”, RFC 4656 “One-way Active Measurement Protocol (OWAMP)”, and RFC 5357 “Two-Way Active Measurement Protocol (TWAMP)”. To measure the link delay at local Node1 over a link to remote Node2, the following steps are followed, as illustrated in Figure 15‑1:

Figure 15-1: Delay Measurement method

1. Local Node1 (querier) sends a Delay Measurement (DM) Query packet to remote Node2 (responder). The egress Line Card (LC) on Node1 timestamps the packet just before sending it on the wire (T1). 2. Remote Node2’s ingress LC timestamps the packets as soon as it receives it from the wire (T2). 3. Remote Node2 reflects the DM packet to the querier with the timestamps in the DM Response packet. The remote Node2 timestamps (optional for one-way measurement) the packet just before sending it over the wire (T3).

4. Local Node1 timestamps (optional for one-way measurement) the packet as soon as it receives the packet from the wire (T4). Each node measures the link delay independently. Therefore, Node1 and Node2 both send Query messages. A querier sends the Query messages at its own initiative; a responder replies to the received Query messages with Response messages. The link delay is computed using the timestamps in the DM Response packet. For one-way measurements, the delay is computed as T2 – T1. For two-way measurements, the one-way delay is computed as the half of the two-way delay: ((T2 – T1) + (T4 – T3))/2 = ((T4 – T1) – (T3 – T2))/2. The performance measurement functionality is disabled by default in IOS XR and is only enabled when it is configured.

15.3.1 Probe Format The Delay Measurement (DM) probe packets can be sent in various formats and various encapsulations. Well-known formats are RFC 6374 “Packet Loss and Delay Measurement for MPLS Networks” and RFC 4656/RFC 5357 “OWAMP”/”TWAMP”. RFC 6374 probes are sent as native MPLS packets or IP/UDP encapsulated. This format is usable for measurements in SR-MPLS networks. draft-gandhi-spring-twamp-srpm describes a more generic mechanism for Performance Measurement in MPLS and non-MPLS networks. While this draft uses TWAMP IP/UDP encapsulated probes, the measurement mechanism is the same as described in this chapter. At the time of writing, the above IETF draft was just released. Therefore, this section focuses on the RFC 6374 encapsulation and the TWAMP encapsulation is deferred to a later revision of this book. The RFC 6374 Delay Measurement (DM) probe packets are MPLS packets that are sent over the Generic Associated Channel (G-ACh). Packets that are sent on the G-Ach are MPLS packets with the G-ACh Label (GAL) – label value 13 – at the bottom of the MPLS label stack.

G-ACh, ACH, and GAL Generic Associated Channel (G-ACh) The Generic Associated Channel (G-ACh) (RFC5586) is a control channel that is associated with an MPLS Label Switched Path (LSP), a pseudowire, or a section (link). It provides a supplementary logical data channel that can be used by various protocols, mostly for maintenance functions. The G-ACh provides a link-layer-agnostic channel that can be used in MPLS networks to communicate between two adjacent devices, similar to Cisco Discovery Protocol (CDP) or Link Layer Discovery Protocol (LLDP) used on Ethernet links. Associated Channel Header (ACH) A packet that is sent on the G-ACh has an Associated Channel Header (ACH) that identifies the Channel Type. There is for example a channel type for Delay Measurement (DM) packets. G-ACh Label (GAL) To enable devices to identify packets that contain an ACH (i.e., is sent on an associated control channel), a label-based exception mechanism is used. A label from the MPLS reserved label range is used for this purpose: the G-ACh Label (GAL) with label value 13. When a packet is received with the GAL label, the receiving device knows the packet is received on the G-ACh and that an ACH follows the bottomof-stack label.

The GAL is followed by an Associated Channel Header (ACH) that identifies the message type, and the message body that contains the actual DM probe packet follows this ACH. Because the DM probe packets are carried in the G-ACh, these probe packets are MPLS packets. This means that the LC at the remote end of the link must be MPLS-capable in order to process these packets. The DM Query probe packet format is shown in Figure 15‑2. The packet consists of a Layer 2 header, an MPLS shim header, an Associated Channel Header (ACH), and the DM probe packet.

Figure 15-2: DM Query packet using RFC 6374 Packet Format

Layer 2 header The Destination MAC address of the DM query packets is the MPLS Multicast MAC address 01-00-5e-80-00-0d (RFC 7213). This way the user does not need to configure next-hop addresses for the links (which would be needed to resolve a unicast destination MAC address). The remote side’s LC needs to support this MPLS Multicast MAC address. Since the packet is an MPLS packet, the L2 ethertype field value is 0x8847. MPLS shim header

The G-ACh Label (GAL) label (label value 13) follows the MAC header. This special label indicates to the receiving node that this packet is a G-ACh packet. Associated Channel Header (ACH) The Associated Channel Header (ACH) that follows the GAL specifies the Channel Type. This indicates the type of message carried in the associated control channel (G-ACh). For DM query probe messages, the Channel Type is called “MPLS Delay Measurement (DM)”. DM packet The actual DM query probe message follows the ACH. The fields in the DM packet are specified as follows: Version: Currently set to 0. Flags: Query/Response indicator (R-flag): Set to 0 for a Query and 1 for a Response. Traffic-class-specific measurement indicator (T-flag): Set to 1 when the measurement is done for a particular traffic class (DSCP value) specified in DS field – 0 for IOS XR. Control Code: For query message: 0x0: In-band Response Requested – response is expected over GACh – IOS XR uses this Control Code for two-way delay measurements 0x1: Out-of-band Response Requested – response is expected over an out-of-band channel – IOS XR uses this Control Code for one-way delay measurements 0x2: No Response Requested – no response is expected. For response message:

0x0: Success other values: Notifications and Errors Message Length: length of this message Querier Timestamp Format (QTF): The format of the timestamp values written by the querier – IOS XR: IEEE 1588-2008 (1588v2) format; 32-bit seconds field + 32-bit nanoseconds field Responder Timestamp Format (RTF): The format of the timestamp values written by the responder – IOS XR: IEEE 1588-2008 (1588v2) format; 32-bit seconds field + 32-bit nanoseconds field Responder's Preferred Timestamp Format (RPTF): The timestamp format preferred by the responder – IOS XR: not specified Session Identifier: An arbitrary numerical value that uniquely identifies a measurement operation (also called a session) that consists of a sequence of messages. All messages in the sequence (session) have the same Session Identifier value. DS: DSCP value of measured traffic-class – not used in IOS XR (T-flag = 0). Timestamp 1-4: The timestamps as collected by the local-end and remote-end nodes. TLV Block: zero or more TLVs – IOS XR includes UDP Return Object (URO) TLV (RFC 7876) for one-way delay measurement. The delay measurement method allows for one-way and two-way delay measurements. By default, two-way delay measurement is enabled in IOS XR when enabling link delay measurement. Two-Way Measurements

For two-way measurements, all four timestamps (T1 to T4) defined in the DM packet are used. Since the delay in both directions must be measured, both Query and Response messages are sent as MPLS GAL packets, as shown in Figure 15‑2. Therefore, the querier requests to receive the DM Response in-band and the UDP Return Object (URO) TLV is not included in the TLV block.

To compute the two-way link delay, the timestamps filled in by the same node are subtracted; therefore, hardware clock synchronization is not required between querier and responder nodes. On platforms that do not support Precision Time Protocol (PTP) for accurate time synchronization, twoway delay measurement is the only option to support delay measurement. From a two-way measurement, the one-way link delay is computed as the two-way delay divided by 2. Example 15‑1 shows a packet capture of a DM Query message for a two-way delay measurement. The querier node has indicated it desires an in-band response (line 16). The session identifier is 33554433 (line 21). Timestamp T1 has been filled in when the query was transmitted (line 22). Example 15-1: Two-way delay measurement Query message example 1 Ethernet

II, Src: fa:16:3e:59:50:91, Dst: IPv4mcast_01:81 (01:00:5e:00:01:81) Destination: IPv4mcast_01:81 (01:00:5e:00:01:81) 3 Source: fa:16:3e:59:50:91 (fa:16:3e:59:50:91) 4 Type: MPLS label switched packet (0x8847) 5 MultiProtocol Label Switching Header, Label: 13 (Generic Associated Channel Label (GAL)), Exp: 0, S: 1, TTL: 1 6 Generic Associated Channel Header 7 .... 0000 = Channel Version: 0 8 Reserved: 0x00 9 Channel Type: MPLS Delay Measurement (DM) (0x000c) 10 MPLS Delay Measurement (DM) 11 0000 .... = Version: 0 12 .... 0000 = Flags: 0x0 13 0... = Response indicator (R): Not set 14 .0.. = Traffic-class-specific measurement indicator (T): Not set 15 ..00 = Reserved: Not set 16 Control Code: In-band Response Requested (0x00) 17 Message Length: 44 18 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3) 19 .... 0000 = Responder timestamp format (RTF): Null Timestamp (0) 20 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0) 21 Session Identifier: 33554433 22 Timestamp 1 (T1): 1519157877.787756030 seconds 23 Timestamp 2 (T2): 0.000000000 seconds 24 Timestamp 3: 0 25 Timestamp 4: 0 2

The responder fills in timestamp T2 (line 23 of Example 15‑1) when receiving the Query message. The responder then sends an in-band (G-Ach) Response message, presented in Example 15‑2. It copies the Session Identifier and Querier Timestamp Format (QTF) fields from the Query message and it also copies the T1 and T2 fields of the Query message to the T3 and T4 fields of the Response

message. When transmitting the Response message, it fills in the timestamp T3 field (line 22). Timestamp T4 is filled in by the querier when it receives the message. Example 15-2: Two-way delay measurement Response message example 1 Ethernet

II, Src: fa:16:3e:72:db:5e, Dst: IPv4mcast_01:81 (01:00:5e:00:01:81) Destination: IPv4mcast_01:81 (01:00:5e:00:01:81) 3 Source: fa:16:3e:72:db:5e (fa:16:3e:72:db:5e) 4 Type: MPLS label switched packet (0x8847) 5 MultiProtocol Label Switching Header, Label: 13 (Generic Associated Channel Label (GAL)), Exp: 0, S: 1, TTL: 1 6 Generic Associated Channel Header 7 .... 0000 = Channel Version: 0 8 Reserved: 0x00 9 Channel Type: MPLS Delay Measurement (DM) (0x000c) 10 MPLS Delay Measurement (DM) 11 0000 .... = Version: 0 12 .... 1000 = Flags: 0x8 13 1... = Response indicator (R): Set 14 .0.. = Traffic-class-specific measurement indicator (T): Not set 15 ..00 = Reserved: Not set 16 Control Code: Success (0x01) 17 Message Length: 44 18 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3) 19 .... 0011 = Responder timestamp format (RTF): Truncated IEEE 1588v2 PTP Timestamp (3) 20 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0) 21 Session Identifier: 33554433 22 Timestamp 1 (T3): 1519157877.798967010 seconds 23 Timestamp 2 (T4): 0.000000000 seconds 24 Timestamp 3 (T1): 1519157877.787756030 seconds 25 Timestamp 4 (T2): 1519157877.798967010 seconds 2

One-Way Measurements

When one-way delay is enabled, the querier requests to receive the DM Response out-of-band and it adds a UDP Return Object (URO) TLV (defined in RFC 7876) in the DM Query packet, to receive the DM Response message via an out-of-band UDP channel. The URO TLV contains the IP address (IPv4 or IPv6) and the destination UDP port to be used for this response packet. Only two timestamps (T1 and T2) defined in the RFC 6374 DM packets are used for one-way measurements since the Response packet is not sent in-band and may not traverse the link. Timestamps filled in by different nodes are subtracted; therefore, the hardware clocks must be accurately synchronized between querier and responder nodes. Clock synchronization can be achieved by using e.g., Precision Time Protocol (PTP), which provides clock accuracy in the submicrosecond range. Figure 15‑3 shows the format of the out-of-band DM Response probe packet used for one-way delay measurements.

Figure 15-3: out-of-band DM Response message for one-way delay measurement

The DM Response probe packet is carried in an IP/UDP packet, with the IP destination address and UDP Destination port number as specified in the DM Query probe message. The source IP address and source UDP port number are chosen by the responder. The RFC 6374 DM packet immediately follows the UDP header; no ACH is included. The correlation between Query and Response messages can be achieved by the UDP port number and the Session Identifier field in the DM message. Example 15‑3 shows a packet capture of a Query message for a one-way measurement. For a oneway measurement, the querier requests an out-of-band response (line 16) and adds an UDP Return Object (URO) TLV (lines 26 to 30), requesting the DM response to return in an UDP packet.

Example 15-3: One-way delay measurement Query message example 1 Ethernet

II, Src: fa:16:3e:59:50:91, Dst: IPv4mcast_01:81 (01:00:5e:00:01:81) Destination: IPv4mcast_01:81 (01:00:5e:00:01:81) 3 Source: fa:16:3e:59:50:91 (fa:16:3e:59:50:91) 4 Type: MPLS label switched packet (0x8847) 5 MultiProtocol Label Switching Header, Label: 13 (Generic Associated Channel Label (GAL)), Exp: 0, S: 1, TTL: 1 6 Generic Associated Channel Header 7 .... 0000 = Channel Version: 0 8 Reserved: 0x00 9 Channel Type: MPLS Delay Measurement (DM) (0x000c) 10 MPLS Delay Measurement (DM) 11 0000 .... = Version: 0 12 .... 0000 = Flags: 0x0 13 0... = Response indicator (R): Not set 14 .0.. = Traffic-class-specific measurement indicator (T): Not set 15 ..00 = Reserved: Not set 16 Control Code: Out-of-band Response Requested (0x01) 17 Message Length: 52 18 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3) 19 .... 0000 = Responder timestamp format (RTF): Null Timestamp (0) 20 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0) 21 Session Identifier: 33554434 22 Timestamp 1 (T1): 1519157877.787756030 seconds 23 Timestamp 2 (T2): 0.000000000 seconds 24 Timestamp 3: 0 25 Timestamp 4: 0 26 UDP Return Object (URO) 27 URO type: 131 28 Length: 6 29 UDP-Destination-Port: 6634 30 Address: 10.0.0.5 2

The responder sends the DM Response in an UDP packet, presented in Example 15‑4. Even though the responder also filled in timestamp T3 when transmitting the Response, this timestamp is of little use since the DM Response is not transmitted in-band and may not traverse the measured link.

Example 15-4: One-way delay measurement Response message example Ethernet II, Src: fa:16:3e:72:db:5e, Dst: fa:16:3e:59:50:91 Destination: fa:16:3e:59:50:91 (fa:16:3e:59:50:91) Source: fa:16:3e:72:db:5e (fa:16:3e:72:db:5e) Type: IPv4 (0x0800) Internet Protocol Version 4, Src: 10.0.0.6 (10.0.0.6), Dst: 10.0.0.5 (10.0.0.5) Protocol: UDP (17) User Datagram Protocol, Src Port: 6634, Dst Port: 6634 Source Port: 6634 Destination Port: 6634 MPLS Delay Measurement (DM) 0000 .... = Version: 0 .... 1000 = Flags: 0x8 1... = Response indicator (R): Set .0.. = Traffic-class-specific measurement indicator (T): Not set ..00 = Reserved: Not set Control Code: Success (0x01) Message Length: 44 0011 .... = Querier timestamp format (QTF): Truncated IEEE 1588v2 PTP Timestamp (3) .... 0011 = Responder timestamp format (RTF): Truncated IEEE 1588v2 PTP Timestamp (3) 0000 .... = Responder's preferred timestamp format (RPTF): Null Timestamp (0) Session Identifier: 33554434 Timestamp 1 (T1): 1519157877.787756030 seconds Timestamp 2 (T2): 1519157877.798967010 seconds Timestamp 3 (T3): 1519157877.798967010 seconds Timestamp 4 (T4): 0.000000000 seconds

15.3.2 Methodology To measure the delay, the operator sets up a delay measurement session by enabling delay measurement on an interface. Such measurement session consists of a sequence of DM Query messages sent by the querier node. The responder node replies to each of the Query messages with DM Response messages. All DM Query and Response messages in a given session use the same Session Identifier field value in the DM packet. The characteristics of the measurement session (interval between queries, number of queries, …) are chosen by the querier node. The delay measurement operation follows the below sequence. Every probe interval the querier node starts a probe. Each probe consists of one or more queries, where each query – DM Query/Response exchange – provides a delay measurement. The number of queries per probe is called burst count and the time interval between two queries in a burst is called burst interval.

These terms are illustrated in Figure 15‑4. The illustration shows the sequence of two probes, with a burst count of five.

Figure 15-4: Delay measurement – query, probe, and burst terminology

The different delay measurement parameters are configurable, as described in the next section. The default probe interval is 30 seconds, with a default burst count of 10 and a default burst interval of 3 seconds. This makes that, by default, the queries are evenly spread over the probe interval. After receiving all responses to the queries in a probe (or after a timeout, in case one or more messages have been lost), the various link delay metrics are computed over all the (successful) queries in the probe: minimum, average, moving average, maximum, and variation. These statistical metrics are provided to the operator via Telemetry (periodic and event-driven) and CLI/NETCONF. At the end of a probe, the node also checks if the current link delay metrics must be advertised in IGP using the accelerated advertisement. The IGP advertisement functionality is further discussed in section 15.4 below. In Figure 15‑4, let us assume that d1n is the measured delay of the nth query in this probe and that Probe0 is the probe before Probe1 (not illustrated). After receiving the responses to all queries of Probe1, the querier node computes the probe metrics as follows. Average delay: Probe1.Average = AVG(d11, d12, d13, d14, d15)

Moving average3: Probe1.MovingAvg = Probe0.MovingAvg × (1 – ⍺) + Probe1.Average × ⍺ (⍺ is a weight factor, set to 0.5 at the time of writing) Minimum delay: Probe1.Minimum = MIN(d11, d12, d13, d14, d15) Maximum delay: Probe1.Maximum = MAX(d11, d12, d13, d14, d15) Delay variation4: Probe1.Variation = Probe1.Average – Probe1.Minimum Similarly, after receiving all responses to the queries of Probe2, the querier node computes: Probe2.Average = AVG(d21, d22, d23, d24, d25) Probe2.MovingAvg = Probe1.MovingAvg × (1 – ⍺) + Probe2.Average × ⍺ Probe2.Minimum = MIN(d21, d22, d23, d24, d25) Probe2.Maximum = MAX(d21, d22, d23, d24, d25) Probe2.Variation = Probe2.Average – Probe2.Minimum The querier node aggregates the delay measurements of multiple probes every advertisement interval. The advertisement interval is a multiple of the probe interval, hence it consists of a whole number of probes. The advertisement interval is configurable, with a default interval of 120 seconds. The advertisement interval is also called aggregation interval. After computing the delay metrics of the last probe in the advertisement interval, the querier node computes the aggregated delay metrics over all probes in the advertisement interval: average, minimum, moving average, maximum, and variation. The querier node then provides these statistical delay metrics via Telemetry (periodic and event-driven) and CLI/NETCONF. At this point in time, the node also checks if the current link delay metrics must be advertised in IGP as a periodic advertisement. Figure 15‑5 shows three probes, with a burst count of five and the probe interval as indicated. The aggregation interval is two times the probe interval in this illustration.

Assuming that the node has aggregated the metrics at the end of Probe1, the next aggregation interval consists of Probe2 and Probe3. At the end of Probe3, the node computes the delay metrics of Probe3 itself and it also computes the aggregate metrics, since it is the last probe in the aggregation interval.

Figure 15-5: Delay measurement – probe metrics and aggregated metrics

The aggregate metrics computed after completing Probe3 are as follows: Average delay: Agg2.Average = AVG(Probe2.Average, Probe3.Average) Moving average: Agg2.MovingAvg = Probe3.MovingAvg Minimum delay: Agg2.Minimum = MIN(Probe2.Minimum, Probe3.Minimum) Maximum delay: Agg2.Maximum = MAX(Probe2.Maximum, Probe3.Maximum) Delay variation: Agg2.Variation = AVG(Probe2.Variation, Probe3.Variation)

15.3.3 Configuration

Delay measurements can be enabled per interface by configuring delay-measurement on that interface under the performance-measurement configuration section, as shown in Example 15‑5. Example 15-5: Enable delay measurement on an interface performance-measurement interface GigabitEthernet0/0/0/0 delay-measurement ! interface GigabitEthernet0/0/0/1 delay-measurement

The remote side of the link must be similarly configured to activate the responder. The performance measurement parameters are configured under a profile. The global profile for link delay measurements (delay-profile interfaces) is shown in Example 15‑6. The configuration value ranges and defaults are indicated. The parameters in this delay-profile are applied for all interface link delay measurements. Example 15-6: Global link delay measurement profile performance-measurement !! Global default profile for link delay measurement delay-profile interfaces probe interval < 30-3600 SEC > (default: 30 sec) burst count < 1-30 COUNT > (default: 10 count) interval < 30-15000 msec > (default: 3000 msec) one-way (default: two-way) advertisement periodic interval < 30-3600 sec > (default: 120 sec)

By default, the delay measurements are performed in two-way mode. For one-way measurements, the Precision Time Protocol (PTP) must be enabled for accurate time synchronization between the two nodes and the delay profile must be configured with probe one-way. A configured advertisement interval (advertisement periodic interval) is internally rounded up to the next multiple of the probe interval, so that it equals to a whole number of probe intervals. For example, a configured advertisement interval of 45 seconds with a probe interval of 30 seconds is rounded up to 60 seconds, or two probe intervals: 2×30 seconds. The configuration does not reflect this internal round-up.

The other advertisement configuration options are discussed in the Delay Advertisement section below. Static Delay

If desired, a static link delay can be configured for an interface, using the advertise-delay command illustrated in Example 15‑7. Example 15-7: Static delay configuration performance-measurement interface GigabitEthernet0/0/0/0 delay-measurement advertise-delay

When a static advertise-delay value is configured on an interface, the minimum, maximum and average delay advertised by the node for this interface are all equal to the configured value, while the advertised variance is 0. Even if the link delay is statically configured, probes are continued to be scheduled and delay metrics are aggregated, stored in the history buffers and streamed in telemetry. However, advertisement threshold checks are suppressed and therefore, the actual measured link delay values are not advertised. Platforms that do not support dynamic link delay measurements may still allow configuration of static link delay values. This allows integration of such platforms for delay-optimized path computations without using the TE metric as fallback link delay metric.

15.3.4 Verification Various show commands are available to verify delay measurements on a router. These show commands are not covered in detail this book, please refer to the available product documentation for this. We do want to mention that a node keeps a history of delay metric data. The following historical data is available: 1. Delay metrics computed from the probe

Delay metrics computed at the end of the probe-interval (threshold crossed or not) 2. Delay metrics computed from the aggregation Delay metrics computed at the end of the advertisement interval 3. Delay metrics computed from the advertisement Delay metrics from the accelerated and periodic advertisement (i.e., threshold crossed) Example 15‑8, Example 15‑9, and Example 15‑10 are a few examples. Example 15‑8 shows the current information of a delay measurement session. Example 15‑9 and Example 15‑10 respectively show historical information of the probe metrics and aggregated metrics. Example 15-8: Delay measurement information RP/0/0/CPU0:iosxrv-1#show performance-measurement interfaces -------------------------------------------------------------------------0/0/CPU0 -------------------------------------------------------------------------Interface Name: GigabitEthernet0/0/0/0 (ifh: 0x20) Delay-Measurement : Enabled Local IPV4 Address : 99.1.2.1 Local IPV6 Address : :: Local MAC Address : fa16.3e59.5091 Primary VLAN Tag : None Secondary VLAN Tag : None State : Up Delay Measurement session: Session ID : 33554434 Last advertisement: Advertised at: 10:41:58 Thu 22 Feb 2018 (161 seconds ago) Advertised reason: periodic timer, min delay threshold crossed Advertised delays (uSec): avg: 17711, min: 14998, max: 24998, variance: 3124 Current advertisement: Scheduled in 2 more probes (roughly every 120 seconds) Current delays (uSec): avg: 18567, min: 14998, max: 19998, variance: 3000 Number of probes started: 2

Example 15-9: History of probe delay metric RP/0/0/CPU0:iosxrv-1#show performance-measurement history probe interfaces -------------------------------------------------------------------------0/0/CPU0 -------------------------------------------------------------------------Interface Name: GigabitEthernet0/0/0/0 (ifh: 0x20) Delay-Measurement history (uSec): Probe Start Timestamp Pkt(TX/RX) Average 10:46:01 Thu 22 Feb 2018 10/10 17998 10:45:31 Thu 22 Feb 2018 10/10 18498 10:45:01 Thu 22 Feb 2018 10/10 18998 10:44:31 Thu 22 Feb 2018 10/10 17998 10:44:01 Thu 22 Feb 2018 10/10 17998 10:43:31 Thu 22 Feb 2018 10/10 19498 10:43:01 Thu 22 Feb 2018 10/10 18998 10:42:31 Thu 22 Feb 2018 10/10 18998 10:42:01 Thu 22 Feb 2018 10/10 18498 10:41:31 Thu 22 Feb 2018 10/10 17498 10:41:01 Thu 22 Feb 2018 10/10 16998 10:40:31 Thu 22 Feb 2018 10/10 19498 10:40:01 Thu 22 Feb 2018 10/10 18498 10:39:31 Thu 22 Feb 2018 10/10 17998 10:39:01 Thu 22 Feb 2018 10/10 16998 --More--

Min 14998 14998 14998 9999 14998 14998 14998 14998 14998 14998 14998 14998 14998 14998 9999

Max 19998 19998 19998 19998 19998 24998 19998 19998 19998 19998 24998 24998 19998 19998 19998

Example 15-10: History of aggregated delay metric RP/0/0/CPU0:iosxrv-1#show performance-measurement history aggregated interfaces PER-NA PER-AVG PER-MIN PER-MAX ACCEL-MIN

: : : : :

Periodic timer, periodic timer, periodic timer, periodic timer, accel threshold

no advertisements have occured avg delay threshold crossed min delay threshold crossed max delay threshold crossed crossed, min delay threshold crossed

-------------------------------------------------------------------------0/0/CPU0 -------------------------------------------------------------------------Interface Name: GigabitEthernet0/0/0/0 (ifh: 0x20) Delay-Measurement history (uSec): Aggregation Timestamp Average Min 10:45:59 Thu 22 Feb 2018 18372 9999 10:43:58 Thu 22 Feb 2018 18997 14998 10:41:58 Thu 22 Feb 2018 18122 14998 10:39:58 Thu 22 Feb 2018 18247 9999 10:37:58 Thu 22 Feb 2018 18123 14998 10:35:58 Thu 22 Feb 2018 17748 14998 10:33:58 Thu 22 Feb 2018 18997 14998 10:31:58 Thu 22 Feb 2018 17747 9999 10:29:58 Thu 22 Feb 2018 18122 14998 --More--

Max 19998 24998 24998 24998 19998 19998 24998 19998 24998

Action PER-MIN PER-NA PER-MIN PER-MIN PER-NA PER-NA PER-MIN PER-MIN PER-NA

15.4 Delay Advertisement As we have seen in the previous sections, the node periodically measures the link delay and computes the various delay metrics. In order for other devices to use this information, the node must then share these link delay metrics with the other devices in the network and/or with a controller. To prevent churn in the network, flooding of link delay metrics in IGP should be reduced to the minimum. The node must not flood the delay metrics whenever it has a new measurement. On the other hand, a very detailed view on the evolution of the delay metrics may be desirable. Therefore, delay measurement functionality provides reduced IGP flooding – only flooding the delay metrics when exceeding certain thresholds – while using Event Driven Telemetry (EDT) to push the delay metrics at a finer time scale.

15.4.1 Delay Metric in IGP and BGP-LS At the beginning of this chapter, we explained that only the minimum delay metric should be considered for delay-optimized routing, since it expresses the propagation delay of the link. Consequently, IGP only floods the link delay metrics if the minimum delay changes significantly. If the maximum, average and/or variance change but the minimum delay remains stable, then no IGP delay metric advertisement is triggered. Whenever delay metric flooding is triggered due to minimum delay change, all delay metrics for the link are advertised in the network. The link delay measurements are flooded as Extended TE Link Delay Metrics in ISIS (RFC 7810) and OSPF (RFC 7471). The measurements are added as sub-TLVs to the advertisement of the link. BGP-LS supports advertisement of the Extended TE Link Delay Metrics, as specified in draft-ietfidr-te-pm-bgp. The following Extended TE Link Delay Metrics will be flooded (in separate sub-TLVs): Unidirectional Link Delay Unidirectional Min/Max Link Delay Unidirectional Delay Variation

These metrics are the delay metrics as computed in the previous section; the (moving) average delay is advertised as the Unidirectional Link Delay metric. The format of the delay metric TLVs are the same for ISIS, OSPF, and BGP-LS. The format of the Unidirectional Link Delay (Sub-)TLV is shown in Figure 15‑6, the Min/Max Unidirectional Link Delay (Sub-)TLV format is shown in Figure 15‑7, and the format of the Unidirectional Delay Variation (Sub-)TLV is shown in Figure 15‑8.

Figure 15-6: Unidirectional Link Delay Sub-TLV format

Figure 15-7: Min/Max Unidirectional Link Delay Sub-TLV format

Figure 15-8: Unidirectional Delay Variation Sub-TLV format

The format of the actual delay measurement fields (Delay, Min Delay, Max Delay, and Delay Variation) in these TLVs, is the delay measurement in microseconds, encoded as a 24-bit integer value.

The Anomalous flag (A-flag) in the Unidirectional and Min/Max Unidirectional Link Delay (Sub‑)TLV formats, is set when a measured value exceeds a configured maximum threshold. The A‑flag is cleared again when the measured value falls below the configured reuse threshold. At the time of writing, the A-flag is always unset in IOS XR. An example of delay metrics advertised in ISIS is shown in Example 15‑11 and the matching advertisement of the link in BGP-LS is shown in Example 15‑13. Example 15-11: ISIS delay metric advertisement RP/0/0/CPU0:iosxrv-1#show isis database verbose iosxrv-1 | begin IS-Extended iosxrv-2.00 Metric: 10 IS-Extended iosxrv-2.00 Interface IP Address: 99.1.2.1 Neighbor IP Address: 99.1.2.2 Link Average Delay: 8127 us Link Min/Max Delay: 4999/14998 us Link Delay Variation: 3374 us Link Maximum SID Depth: Subtype: 1, Value: 10 ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24004

OSPF advertises the link delay metrics in the TE (type 1) Opaque LSAs, as illustrated in Example 15‑12.

Example 15-12: OSPF delay metric advertisement RP/0/0/CPU0:iosxrv-1#show ospf database opaque-area 1.0.0.3 self-originate

OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 98 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.3 Opaque Type: 1 Opaque ID: 3 Advertising Router: 1.1.1.1 LS Seq Number: 80000006 Checksum: 0xe45e Length: 108 Link connected to Point-to-Point network Link ID : 1.1.1.2 (all bandwidths in bytes/sec) Interface Address : 99.1.2.1 Neighbor Address : 99.1.2.2 Admin Metric : 1 Maximum bandwidth : 125000000 IGP Metric : 1 Unidir Link Delay : 8127 micro sec, Anomalous: no Unidir Link Min Delay : 4999 micro sec, Anomalous: no Unidir Link Max Delay : 14998 micro sec Unidir Link Delay Variance : 3374 micro sec

BGP-LS advertises these Extended TE Link Delay Metrics as Link Attribute TLVs, which are TLVs that may be encoded in the BGP-LS attribute with a Link NLRI. The format is the same as for ISIS and OSPF advertisements. An example is shown in Example 15‑13, showing a Link NLRI originated by ISIS.

Example 15-13: BGP-LS delay metric advertisement RP/0/0/CPU0:iosxrv-1#sh bgp link-state link-state [E][L2][I0x0][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1] [b0.0.0.0][s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696 detail BGP routing table entry for [E][L2][I0x0][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1][b0.0.0.0] [s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696 NLRI Type: Link Protocol: ISIS L2 Identifier: 0x0 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 ISO Node ID: 0000.0000.0001.00 Remote Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 ISO Node ID: 0000.0000.0002.00 Link Descriptor: Local Interface Address IPv4: 99.1.2.1 Neighbor Interface Address IPv4: 99.1.2.2 Versions: Process bRIB/RIB SendTblVer Speaker 29 29 Flags: 0x00000001+0x00000200; Last Modified: Feb 20 21:20:11.687 for 00:03:21 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Flags: 0x400000000104000b, import: 0x20 Not advertised to any peer Local 0.0.0.0 from 0.0.0.0 (1.1.1.1) Origin IGP, localpref 100, valid, redistributed, best, group-best Received Path ID 0, Local Path ID 0, version 29 Link-state: metric: 10, ADJ-SID: 24004(30) , MSD: Type 1 Value 10 Link Delay: 8127 us Flags: 0x00, Min Delay: 4999 us Max Delay: 14998 us Flags: 0x00 Delay Variation: 3374 us

IGP delay metric flooding can be triggered periodically or accelerated, as described in the next sections. Periodic Advertisements

When enabling link delay measurements, periodic IGP advertisements are enabled by default. A node periodically checks if it needs to flood the current delay metrics in IGP. The periodicity is 120 seconds by default and is configurable as the advertisement interval. IGP floods the delay metrics of a given link if both conditions in Equation 15‑1 are satisfied for that link at the end of the periodic advertisement interval. Flooded.Minimum is the minimum delay metric that was last flooded and Agg.Minimum is the aggregated minimum delay metric computed at the end

of the current aggregation/advertisement interval. The default periodic threshold (%) and minimum (value) are configurable values, with the default values: threshold 10%, minimum 500 μs.

Equation 15-1: Periodic delay metric flooding conditions

|Agg.Minimum – Flooded.Minimum| / Flooded.Minimum ≥ threshold (default 10%) (a) and |Agg.Minimum – Flooded.Minimum| ≥ minimum (default 500 μs) (b)

In this default configuration, the delay metrics of a link are flooded at the end of the periodic advertisement interval if the minimum delay of that link over the advertisement interval differs ( + or – ) from the last flooded minimum delay for that link by at least 10% and 500 μs. The default minimum value of 500 μs is roughly equivalent to 100 km of fiber5. When these conditions are met, IGP floods all current aggregate delay metrics for that link: Agg.Minimum, Agg.Maximum, Agg.MovingAvg (as average delay), and Agg.Variance. The node keeps these flooded metrics as Flooded.Minimum, Flooded.Maximum, Flooded.MovingAvg, and Flooded.Variance, and it also pushes/streams these values with Event Driven Telemetry to the collector. Figure 15‑9 illustrates the delay metric advertisement operation. The illustration shows on top the minimum delay of a link (y-axis) in function of time (x-axis). Both the measured minimum delay and the delay as advertised in the IGP are shown. Below the minimum delay graph, the DM Queries that the node sends on the link are represented as a series of arrows, each of them representing a query. The Probes are shown below the Queries. In this example the burst-count is set to 4, i.e., every probe-interval the node sends a burst of 4 queries. After receiving the last response of all queries in

the probe, the node computes the probe metrics. In the illustration this occurs at times p1, p2, …, p7 as indicated with arrows. The advertisements are displayed below the Probes. The advertisement interval is two times the probe-interval. At the bottom of the illustration the advertised metrics are indicated. These are described in this section.

Figure 15-9: Minimum-delay metric increase with periodic advertisements

Assume that at time p1, the end of the first periodic advertisement interval in the illustration, the aggregate minimum delay (min1) is equal to the last advertised minimum delay (min1). The node suppresses the periodic advertisement. Sometime during Probe2 the minimum delay of the link increases from min1 to min2. This can occur due to a failure in the optical network triggering an optical restoration that switches traffic to go via the other side of the optical ring. In this example this new optical path is significantly longer, causing an large abrupt increase of propagation delay.

The advertisement interval expires at time p3. The aggregate minimum delay over the interval is min1, since the first query of Probe2 measured a delay min1. Even though all the other measurements in Probe2 and Probe3 measured a minimum delay min2, the minimum delay of the whole advertisement interval is still min1. Min1 is also the last advertised minimum delay, so the periodic advertisement is suppressed. The advertisement interval expires again at time p5. This time, the aggregate minimum delay of the interval is min2. Min2 differs enough from the previously advertised min1 to exceed the periodic threshold. The node advertises the aggregate delay metrics (min/max/avg/var of (Probe4-metrics, Probe5-metrics)). The advertised minimum delay is min2. Relying only on periodic advertisements, it takes between one and two times the advertisement interval for a significant increase in the minimum link delay to be advertised in the IGP. Accelerated Advertisements

As shown on the previous illustration, it can take up to two times the periodic advertisement interval before a worse (increased) minimum delay metric is advertised in the IGP through the periodic advertisements. If a faster reaction is important, the accelerated advertisement functionality, disabled by default, can be enabled. When enabled, an accelerated advertisement can be sent at the end of each probe interval, in-between periodic advertisements, if the minimum delay metric has changed significantly compared to the previous advertised minimum delay. When a probe finishes (the last response message has been received) and the probe’s minimum delay metric crosses the accelerated threshold (see further), then the IGP floods all link delay metrics (average, min, max, and variation) for that link. The advertised delay metrics are the delay metrics of the last probe. When the accelerated advertisement is triggered, the periodic advertisement interval is reset. The IGP floods the delay metrics of a given link if both conditions in Equation 15‑2 are satisfied for that link at the end of a probe interval. In fact, these conditions are the same as in Equation 15‑1, but the threshold and minimum for accelerated advertisements are independent from those of the periodic

advertisements. The default accelerated threshold (%) and minimum (value) are configurable values, with the default values: threshold 20%, minimum 500 μs.

Equation 15-2: Accelerated delay metric flooding conditions

|Agg.Minimum – Flooded.Minimum| / Flooded.Minimum ≥ threshold (default 20%) (a) and |Agg.Minimum – Flooded.Minimum| ≥ minimum (default 500 μs) (b)

If accelerated advertisement is enabled, then the delay metrics of a given link are flooded at the end of the probe interval if the minimum delay of that link over the probe interval differs ( + or – ) at least 20% (by default) with the last flooded minimum delay of that link and the difference is at least 500 μs (by default). When these conditions are met, the IGP floods all current probe delay metrics for that link: Probe.Minimum, Probe.Maximum, Probe.MovingAvg (as average delay), and Probe.Variance. The node keeps these flooded metrics as Flooded.Minimum, Flooded.Maximum, Flooded.MovingAvg, and Flooded.Variance, and it also pushes these values with Event Driven Telemetry. Figure 15‑10 illustrates the effect of accelerated advertisements on the scenario of Figure 15‑9. The delay measurement parameters are the same as for the previous example, only accelerated advertisements are enabled now.

Figure 15-10: Minimum-delay metric increase with accelerated advertisements

Assume that at time p1, the end of the first periodic advertisement interval in the illustration, the aggregate minimum delay (min1) is equal to the last advertised minimum delay (min1). The node suppresses the periodic advertisement. Sometime during Probe2 the minimum delay of the link increases from min1 to min2. At time p2, the node verifies if the Probe2-metrics exceed the accelerated thresholds. The minimum delay measured over Probe2 is min1 since the first query of Probe2 measured a delay min1. Since min1 is also the last advertised minimum delay, the threshold is not exceeded, the accelerated advertisement is suppressed. The periodic advertisement interval expires at time p3. The aggregate minimum delay over the interval is min1, since the first query of Probe2 measured a delay min1. Min1 is also the last advertised minimum delay, so the periodic advertisement is suppressed. However, the minimum delay of the Probe3 interval is min2. The difference between min2 and the previously advertised min1 exceeds the accelerated thresholds. Therefore, the node advertises the Probe3-metrics as accelerated advertisement.

Since the minimum delay stays constant, all further advertisements in the example are suppressed. With accelerated advertisements enabled, the worst-case delay between the minimum delay increase and the advertisement of this change is reduced to two times the probe-interval. The best-case delay for that case is one probe-interval. Another example combining periodic and accelerated advertisements is shown in Figure 15‑11. The delay measurement parameters are the same as in the previous examples.

Figure 15-11: Delay metric advertisements – illustration

Before the illustrated time interval, the node advertises the minimum delay value min3 in the IGP. At time p1, the end of the first probe-interval in the diagram, the node computes the delay metrics of this probe: Probe1-metrics. The node then checks if it needs to accelerated flood the delay metrics, but since the measured minimum delay (min3) is the same as the last advertised minimum delay (min3), the accelerated thresholds are not exceeded, therefore the accelerated advertisement is suppressed. Time p1 is also the end of the periodic advertisement interval, and the node computes the aggregate delay metrics of this advertisement interval, based on the probes in the advertisement interval. The

minimum delay is the same as the last advertised minimum delay and the periodic advertisement is suppressed. Shortly after time p1 and before the first query of Probe2, the minimum delay of the link decreases from min3 to min1. At time p2, the node computes the Probe2-metrics and it then checks if it needs to accelerated flood the delay metrics. The difference between the minimum delay metric in this probe-interval, min1, and the previous advertised minimum delay metric, min3, exceeds the accelerated thresholds. Therefore, the node floods the Probe2-metrics. The advertised minimum delay metric is now min1. The accelerated advertisement also resets the periodic advertisement interval. Right after time p2 and before the first query of Probe3, the minimum delay of the link increases from min1 to min2. At time p3, the node computes the Probe3-metrics. The minimum delay metric in this probe-interval is min2. The difference between min2 and the last advertised minimum delay, min1, does not exceed the accelerated thresholds, therefore the accelerated advertisement is suppressed. At time p4, the end of the periodic advertisement interval, the node computes the aggregated delay metrics over the advertisement interval. The aggregated minimum delay metric is min2. The difference between min2 and min1 exceeds the periodic thresholds, although it did not exceed the accelerated thresholds. Therefore, the node advertises the aggregated delay metrics (min/max/avg/var of (Probe3-metrics, Probe4-metrics)). The advertised minimum delay metric is min2. At time p5, the accelerated thresholds are not exceeded. Sometime during Probe5, the minimum delay of the link increases from min2 to min4. At time p6, the end of the advertisement interval, the aggregated minimum delay over the advertisement interval is min2. This is the lowest measured delay in Probe5 and Probe6. Since min2 is equal to the last advertised minimum delay metric, no periodic advertisement is triggered. However, the difference between min2 and the minimum delay metric of the Probe6 interval, min4, exceeds the accelerated threshold. Therefore, the node advertises the Probe6-metrics as accelerated advertisement and resets the periodic advertisement interval.

Finally, at time p7, Probe7-metrics do not exceed the accelerated threshold, suppressing the accelerated advertisement.

15.4.2 Configuration The IGP advertisement parameters can be customized, as shown in Example 15‑14. Periodic advertisements are enabled by default. Accelerated advertisements are disabled by default. The IGP automatically floods the delay metrics when requested by the performance measurement functionality. No additional IGP configuration is required. The relative thresholds and absolute minimum values of both advertisement modes can be configured. Example 15-14: Global link delay measurement profile – advertisement configuration performance-measurement !! Global default profile for link delay measurement delay-profile interfaces advertisement periodic disabled (default: enabled) interval < 30-3600 sec > (default: 120 sec) threshold < 0-100 % > (default: 10%) minimum < 0-100000 usec > (default: 500 usec) accelerated (default: disabled) threshold < 0-100 % > (default: 20%) minimum < 1-100000 usec > (default: 500 usec)

Periodic IGP delay metric advertisements can be disabled altogether if desired. But even with periodic advertisements disabled, the link delay metrics are still pushed via telemetry (EDT). This way a controller can use the link delay metrics, as received via telemetry, while eliminating the flooding of these metrics in the IGP.

15.4.3 Detailed Delay Reports in Telemetry We have seen that the node not only floods the measured link delay metrics in IGP, but it also streams the measurements via telemetry, periodic and event-driven. Chapter 20, "Telemetry" provides more information about telemetry. This section only describes the applicability of telemetry for performance measurement data. Telemetry uses a push model as opposed to the traditional poll model to get data from the network. Telemetry streams the data, skipping the overhead of polling.

Model Driven Telemetry (MDT) structures the streamed data based on a specified model. Native, OpenConfig or IETF YANG models are available through model-driven telemetry. MDT can periodically stream (push) telemetry data or only when the data has changed. The latter is called Event Driven Telemetry (EDT). By only streaming data when it has changed, EDT reduces the amount of unnecessary data that is streamed. For the delay measurements, MDT is supported for the following data: Summary, interface, session and counter show command content Histogram data In addition to the periodic streaming, EDT is supported for the following delay measurement data: Delay metrics computed in the last probe-interval (event: probe-completed) Delay metrics computed in the last aggregation-interval i.e., end of the periodic advertisement interval (event: advertisement interval expired) Delay metrics last flooded in the network (event: flooding-triggered) The performance-measurement telemetry data entries are in the Cisco-IOS-XR-perf-meas-oper.yang YANG module.

15.5 Usage of Link Delay in SR-TE SR Policy dynamic path computation uses the minimum link delay metric for delay optimized paths. As previously mentioned, this minimum link delay metric reflects the underlying optical circuit makes it possible to react to changes in the optical topology. Example 15‑15 shows the configuration of an SR Policy ORANGE and an on-demand color template for color 10 using the minimum link delay metric as the optimization objective. Example 15-15: SR Policy with delay optimized dynamic path segment-routing traffic-eng on-demand color 10 dynamic metric type delay ! policy ORANGE color 100 end-point ipv4 1.1.1.4 candidate-paths preference 100 dynamic metric type delay

If the link delay metric of a given link is not advertised by the node, then SR-TE path computation falls back to the TE metric advertised for that link. The TE metric is used as if it expresses the link delay in sec. For example, if a node does not advertise a link delay metric for a link but it advertises a TE metric 100 for this link, then SR-TE considers for its delay-optimized path computations the TE metric 100 as a minimum link delay of 100 sec. This fallback to TE metric helps incremental deployment of dynamic link delay metrics in networks. The Flexible Algorithm (Flex-Algo) functionality (see chapter 7, "Flexible Algorithm") can use the minimum link delay metric as optimization objective as well. However, Flex-Algo’s IGP path computation does not fall back to using TE metric for links that do not advertise the minimum link delay metric. Instead, any link that does not have a link delay metric is excluded from the Flex-Algo topology. Example 15‑16 shows a configuration of the Flex-Algo functionality using the measured minimum link delay metric. Traffic is automatically steered on the low-delay path using the SR-TE Automated

Steering functionality (see chapter 5, "Automated Steering"). Example 15-16: Flexible Algorithm configuration using delay metric for path computation router isis 1 flex-algo 128 metric-type delay ! segment-routing traffic-eng on-demand color 10 dynamic sid-algorithm 128

15.6 Summary The Performance Monitoring functionality enables dynamic measurement of link delay. SR-TE can use the measured link delay metrics to compute delay-optimized paths, also for the FlexAlgo functionality of SR-TE. The minimum delay measured over a time period is important for SRTE, as it represents the (quasi) static propagation delay of the link. Link delay metrics are dynamically measured using a Query/Response packet exchange. This mechanism is standardized in RFC 6374 for link delay measurements in an MPLS network. draftgandhi-spring-twamp-srpm provides the equivalent mechanism using more generic TWAMP encoding. The measured delay metrics are flooded in IGP and BGP-LS and streamed via telemetry. To reduce the flooding churn on the network, only flooding new link delay metrics when the minimum delay metric has significantly changed. Link delays can be measured using a one-way or a two-way measurement. One-way measurement requires accurate time synchronization between the local and remote node of the link.

15.7 References [RFC6374] "Packet Loss and Delay Measurement for MPLS Networks", Dan Frost, Stewart Bryant, RFC6374, September 2011 [RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016 [RFC7471] "OSPF Traffic Engineering (TE) Metric Extensions", Spencer Giacalone, David Ward, John Drake, Alia Atlas, Stefano Previdi, RFC7471, March 2015 [RFC8571] "BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering Performance Metric Extensions", Les Ginsberg, Stefano Previdi, Qin Wu, Jeff Tantsura, Clarence Filsfils, RFC8571, March 2019 [RFC5586] "MPLS Generic Associated Channel", Martin Vigoureux, Stewart Bryant, Matthew Bocci, RFC5586, June 2009 [RFC7213] "MPLS Transport Profile (MPLS-TP) Next-Hop Ethernet Addressing", Dan Frost, Stewart Bryant, Matthew Bocci, RFC7213, June 2014 [RFC7876] "UDP Return Path for Packet Loss and Delay Measurement for MPLS Networks", Stewart Bryant, Siva Sivabalan, Sagar Soni, RFC7876, July 2016 [RFC5481] "Packet Delay Variation Applicability Statement", Benoit Claise, Al Morton, RFC5481, March 2009 [draft-gandhi-spring-twamp-srpm] "In-band Performance Measurement Using TWAMP for Segment Routing Networks", Rakesh Gandhi, Clarence Filsfils, Daniel Voyer, draft-gandhi-spring-twampsrpm-00 (Work in Progress), February 2019 [RFC4656] "A One-way Active Measurement Protocol (OWAMP)", Matthew J. Zekauskas, Anatoly Karp, Stanislav Shalunov, Jeff W. Boote, Benjamin R. Teitelbaum, RFC4656, September 2006

[RFC5357] "A Two-Way Active Measurement Protocol (TWAMP)", Jozef Babiarz, Roman M. Krzanowski, Kaynam Hedayat, Kiho Yum, Al Morton, RFC5357, October 2008

1. DPM is a data-plane monitoring solution that provides highly scalable blackhole detection capability. DPM validates control plane and data plane consistency, leveraging SR to steer probes validating the local node’s own data plane.↩ 2. Fiber propagation delay can vary over time, due to external influences. For example, as the length of the fiber and its index of refraction are slightly temperature dependent, propagation delay can change due to a changing temperature.↩ 3. This is the exponential moving average, as described in Wikipedia: https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average↩ 4. The delay variance metric is computed as the Packet Delay Variation (PDV) specified in RFC5481, section 4.2, based on the average and minimum delay: delay variance = average delay – minimum delay.↩ 5. The propagation delay of a fiber connection can be roughly assessed as 5 ms per 1000 km of fiber.↩

16 SR-TE Operations What we will learn in this chapter: An SR Policy candidate path can consist of multiple segment lists, each associated with a weight. Traffic-flows that are steered into the SR Policy are load-balanced over the segment lists in proportion to their relative weights. This is called “Weighted ECMP”. The path-invalidation drop functionality keeps an invalid SR Policy in the forwarding table as a drop entry. All service traffic that is steered into the SR Policy is dropped instead of falling back to the IGP shortest path to the nexthop of the service. The MPLS PHP and explicit-null behaviors apply to segments in the SR Policies segment list. In particular, the first entry in the segment list is not always applied on the packet steered into the SR Policy since Penultimate Hop Popping (PHP) may apply to that segment. TTL and TC/DSCP field values of incoming packets are propagated by default. It is recommended to use the same SRGB on all nodes in an SR domain. Using heterogeneous SRGBs significantly complicates network operations. This chapter provides various practical indications for deploying and operating SR-TE. We start by explaining how multiple SID lists can be associated with the same candidate path and how the traffic is load-balanced among those. We describe a method that prevents traffic steered into an invalid SR policy from being automatically redirected to the IGP shortest path. We then explain how generic MPLS mechanisms, such as PHP, explicit null and TTL propagation, apply to SR-TE. Finally, we provide some details on what to expect if some of the recommendations stated over the course of this book are not followed.

16.1 Weighted Load-Sharing Within SR Policy Path An SR Policy candidate path consists of one or more segment lists, as described in chapter 2, "SR Policy". If the active candidate path consists of multiple segment lists, traffic flows that are steered into the SR Policy are load-balanced over these segment lists. The regular per-flow load-balancing mechanism are applied, using a hash function over the packet header fields. Each segment list has an associated weight. The default weight is 1. Each segment list carries a number of flows in proportion to its relative weight. The fraction of flows carried by a given segment list with weight w is w/∑wi where w is the weight of the segment list and ∑wi the sum of the weights of all segment lists of the candidate path. This is called “Weighted ECMP” as opposed to “ECMP” where traffic-flows are uniformly distributed over all paths. The accuracy of the weighted load-balancing depends on the platform implementation. The weighted ECMP between segment lists can be used to distribute traffic over several paths in proportion to the capacity available on each path. Figure 16‑1 illustrates a network where the interconnections between the nodes have different capacities. The connection between Node2 and Node4 consists of three GigabitEthernet links and the connection between Node3 and Node4 consists of five GigabitEthernet links. The link between Node2 and Node3 is a Ten GigabitEthernet link, as well as the other links in the drawing. The operator wants to distribute the load of an SR Policy from headend Node1 to endpoint Node5 over the available capacity. Therefore, the operator configures two segment lists for the SR Policy, SL1 via Node3 with weight 5 and SL2 directly to Node5 with weight 3. With this configuration 5/8 (62.5%) of the traffic-flows follow SL1 and 3/8 (37.5%) of the flows follow SL2.

Figure 16-1: Weighted ECMP between segment lists of an SR Policy

Example 16‑1 shows the SR Policy configuration on Node1. Two segment lists are configured: SL1 and SL2 . These segments lists are used in the candidate path with preference 100 of the SR Policy named “WECMP”. SL1 is configured with weight 5 and SL2 has weight 3.

Example 16-1: Weighted ECMP between segment lists of an SR Policy segment-routing traffic-eng segment-list SL1 index 10 mpls label 16003 index 20 mpls label 16005 ! segment-list SL2 index 10 mpls label 16005 ! policy WECMP color 20 end-point ipv4 1.1.1.5 candidate-paths preference 100 explicit segment-list SL1 weight 5 ! explicit segment-list SL2 weight 3

The SR Policy forwarding entry, as displayed in Example 16‑2, shows the two segment lists’ forwarding entries with their weight values to distribute the traffic flows over the two paths. Example 16-2: Weighted ECMP – SR Policy forwarding RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail Color Endpoint Segment Outgoing Outgoing Next Hop Bytes List Label Interface Switched ----- ----------- ------------ -------- ------------ ------------ -------20 1.1.1.5 SL1 16003 Gi0/0/0/0 99.1.2.2 0 Label Stack (Top -> Bottom): { 16003, 16005 } Path-id: 1, Weight: 320 Packets Switched: 0 SL2 16005 Gi0/0/0/1 99.1.7.7 0 Label Stack (Top -> Bottom): { 16005 } Path-id: 2, Weight: 192 Packets Switched: 0

SR-TE has computed the weights in the output according to the configured weights. The computed weight ratios in the output are the same as for the configured weights: 320/(320+192) = 5/8 and 192/(320+192) = 3/8.

16.2 Drop on Invalid SR Policy If an SR Policy becomes invalid, i.e., all its candidate-paths are invalid, then the default behavior is to bring down the SR Policy and let all traffic that was steered into the SR Policy fall back on its default forwarding entry, usually the IGP shortest path. For example, prefix 2.2.2.0/24 with nexthop 1.1.1.4 and color 30 is steered into SR Policy GREEN. When all candidate paths of SR Policy GREEN become invalid, then this SR Policy is invalidated and its forwarding entry is removed. The prefix 2.2.2.0/24 is then steered onto the IGP shortest path to its nexthop 1.1.1.4. For some use-cases it is desirable for the traffic to be dropped when the SR Policy becomes invalid. To prevent the traffic that is steered into an SR Policy to fall back on the IGP shortest path upon invalidation of this SR Policy, the path-invalidation-drop functionality can be enabled. Path-invalidation-drop keeps the invalid SR Policy up, keeping its forwarding entry active but dropping all traffic that is steered into it. Let us consider an example use-case that requires strict separation between two disjoint paths. If one of the paths becomes invalid, then the traffic that was steered into that SR Policy must not be steered on the IGP shortest path since that is not guaranteed to be disjoint from the other path. The pathinvalidation-drop functionality drops the traffic of the invalid SR Policy such that the strict disjointness requirement is not violated. In another example, a traffic stream is replicated into two streams on two disjoint paths (live-live redundancy). The rate of the streams is high compared to the capacity of the links carrying the traffic. Some of the links cannot carry the combination of both streams. Steering both streams on such link would result in congestion and packet loss affecting both streams. This is highly undesirable since it defeats the purpose of live-live redundancy. Path-invalidation-drop will drop the stream’s packets at the headend upon SR Policy invalidation. This prevents congestion and keeps the other stream unaffected. At the time of writing, this functionality was not available in IOS XR.

16.3 SR-MPLS Operations The SR implementation for MPLS (SR-MPLS) leverages the existing MPLS architecture. This means that the existing MPLS operations are also applicable to SR-MPLS and consequently also to SR-TE for SR-MPLS.

16.3.1 First Segment The first segment in an SR Policy’s segment list is sometimes not imposed on the packets steered into this SR Policy as a consequence of applying the usual MPLS operations. For simplicity, we assume that the active candidate path of a given SR Policy consists of a single segment list . At first sight, one would expect that these three segments are imposed as labels on the packets that are steered into the SR Policy. However, this is not always the case. Sometimes the label of the first segment in the list is not pushed on these packets as a consequence of applying the usual MPLS operations on this first segment. Four cases can be distinguished with respect to the first segment in the list: 1. S1 is a Prefix-SID 1. S1 is a Prefix-SID of a downstream neighbor node (i.e., the shortest path to this connected node is via the direct link) and PHP is enabled for this Prefix-SID 2. S1 is a Prefix-SID of a downstream neighbor node and PHP is disabled for this Prefix-SID 3. S1 is a Prefix-SID of a node that is not a downstream neighbor node (i.e., the node is not a connected node or the shortest path to this connected node is not via the direct link) 2. S1 is an Adjacency-SID of a local adjacency or an EPE peering-SID of a local EPE peering session In cases 1.a. and 2., the headend uses the first segment in the SID list to determine the outgoing interface(s) and next-hop(s) for the SR Policy, but does not impose this first segment on the packet. In cases 1.b. and 1.c., the head-end node uses the first segment to determine the outgoing interface(s) and next-hop(s), and also imposes this first segment on the packet steered into the SR Policy.

16.3.2 PHP and Explicit-Null The well-known MPLS operations on the penultimate hop node also apply to SR-MPLS packets. These operations are: Penultimate Hop Popping (PHP), where the penultimate hop node pops the label before forwarding, and Explicit-null, where the penultimate hop node swaps the top label with the explicit-null label before forwarding. PHP behavior is enabled by default for each Prefix-SID in IOS XR. This means that a node advertises its Prefix-SIDs in the IGP with the PHP-off flag unset. An operator may require QoS treatment based on the MPLS TC (formerly EXP) field value. In that case, the packet must arrive at the final node with a label carrying the TC value. If the bottom label is a service label, then the packet always arrives at the final node with a label. However, if the bottom label is a transport label, then the penultimate hop must not pop the last label, i.e., PHP must not be enabled for the bottom label. In IOS XR, the default PHP behavior can be replaced by the explicit-null behavior. When enabling explicit-null behavior, the penultimate hop no longer pops the bottom label but swaps it with the explicit-null label. In a first method to apply explicit-null behavior for an SR Policy, the bottom label is a Prefix-SID and explicit-null is enabled for this Prefix-SID. This can be done by adding explicit-null to the Prefix-SID configuration on the endpoint node, as shown in Example 16‑3. Example 16-3: Explicit-null behavior for Prefix-SID router isis 1 interface Loopback0 address-family ipv4 unicast prefix-sid absolute 16001 explicit-null ! router ospf 1 area 0 interface Loopback0 prefix-sid absolute 16001 explicit-null

The penultimate hop node of the endpoint node will receive the packets with this Prefix-SID as top label and due to the requested explicit-null behavior, this penultimate hop node will swap the PrefixSID label with the explicit-null label.

Another method to convey the TC value up to the endpoint of an SR Policy consists in adding the explicit-null label as the last segment (bottom label) in an explicit SR Policy segment list. This will make the packets arrive at the SR Policy’s endpoint node with an explicit-null label as only label.

16.3.3 MPLS TTL and Traffic-Class SR-TE for SR-MPLS can push one or more MPLS labels on packets (unlabeled and labeled) that are steered into the SR Policy. The MPLS label imposition follows the generic procedures that are described in detail in Segment Routing Part I. In summary, for an incoming unlabeled packet, the default behavior is to copy the TTL of the IP header to the MPLS TTL field of the MPLS label after decrementing it by one. In case multiple labels are imposed, all imposed labels get the same MPLS TTL value. This behavior can be disabled by configuring mpls ip-ttl-propagation disable. With this configuration the MPLS TTL fields of all imposed labels are set to 255 regardless of the IP TTL value in the received IP packet. For an incoming labeled packet, the TTL of the incoming top label is decremented and the swapped label as well as all imposed labels get this decremented TTL value in their MPLS TTL field. For an incoming unlabeled packet, the first 3 bits of the DSCP field in the IP header are copied to the MPLS TC field of all imposed MPLS labels. For an incoming labeled packet, the MPLS TC field of the top label is copied to the TC field of the swapped label and all imposed MPLS labels.

16.4 Non-Homogenous SRGB It is strongly recommended to use the same SRGB on all nodes in the SR domain, although the SR architecture allows to use different SRGBs on different nodes. Refer to Part I of the SR book series for more information about the SRGB. The use of heterogeneous SRGBs has implications on SR-TE, particularly when manually specifying prefix-SID label values in an explicit path’s segment list of an SR Policy.

“Using the same SRGB on all nodes within the SR Domain has undoubtedly significant advantages in simplification of administration, operation and troubleshooting. Also programming the network is simplified by using this model, and Anycast Segments can be used without added complexity. Using the same SRGB network-wide is expected to be a deployment guideline. As mentioned in the expired draft-filsfils-spring-segment-routing-use-cases “Several operators have indicated that they would deploy the SR technology in this way with a single consistent SRGB across all the nodes. They motivated their choice based on operational simplicity ...”. ” — Kris Michielsen

The Prefix-SID label value specified in the segment list is the label value for that Prefix-SID as known by the node that needs to interpret it. This is most easily explained by an example. See Figure 16‑2 (same as Figure 4.3. in Segment Routing Part I). The SRGB of the four nodes in the topology is indicated above each node. Each node advertises a Prefix-SID index equal to its node number.

Figure 16-2: Multi-label stacks and different SRGBs

Node1 applies a segment list to steer traffic towards Node4 via Node3. Node1 needs to consider the SRGB of the correct nodes to determine which label values for the Prefix-SIDs to push on the packets. To compute the required label value for the first SID in the SID list – Prefix-SID(3) – Node1 uses the SRGB of its nexthop towards Node3, which is Node2. Node2 would receive the packet with that label value as top label and must be able to identify it as Prefix-SID(3). Hence, the label value must be associated with Prefix-SID(3) in Node2’s SRGB. Node1 computes the label value for PrefixSID(3) by adding the SID index 3 to the SRGB base of Node2: 21000 + 3 = 21003. Equivalently, Node1 uses the SRGB of Node3 to compute the label value for the second SID in the SID list, which is Prefix-SID(4). Node3 receives the packets with this second SID’s label as top label. Therefore, Node1 needs to compute the label value of the second SID in the label context of Node3. Node1 adds the SID index 4 to the SRGB base of Node3: 22000 + 4 = 22004. When imposing the SID list as label values, Node1 imposes the label stack on the packets. The SR Policy configuration on Node1 is shown in Example 16‑4. However, the first label value in the configuration is 16003, which is different from 21003 that was derived above.

Example 16-4: Non-homogenous SRGB – SR Policy configuration on Node1 segment-routing traffic-eng segment-list name SIDLIST1 index 10 mpls label 16003 !! Prefix-SID(3) index 20 mpls label 22004 !! Prefix-SID(4) ! policy ORANGE color 10 end-point ipv4 1.1.1.4 candidate-paths preference 100 explicit segment-list SIDLIST1

The label value for the first segment in the segment list (16003 in the example) is the Prefix-SID(3) label value in the context of the headend node, thus using the local SRGB of the head-end node ([16000-23999] in the example). The SID index of Prefix-SID(3) is 3, therefore the required label value is 16000 + 3 = 16003. Configuring the local label value for the first segment makes this configuration independent of the downstream neighbor. Internally, the correct label value is computed matching the downstream neighbor’s SRGB. In the example, Node1 imposes the outgoing label value 21003 as top label, as shown in Example 16‑5. Example 16-5: Non-homogenous SRGB – SR Policy forwarding entry on Node1 RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng forwarding policy detail Color Endpoint Segment Outgoing Outgoing Next Hop Bytes List Label Interface Switched ----- ----------- ------------ -------- ------------ ------------ -------10 1.1.1.4 SIDLIST1 21003 Gi0/0/0/0 99.1.2.2 0 Label Stack (Top -> Bottom): { 21003, 16004 } Path-id: 1, Weight: 64 Packets Switched: 0

Using a homogenous SRGB on all nodes in the SR domain avoids all the complexity of deriving SID label values.

16.5 Candidate-Paths With Same Preference The recommended design rule is to use a different preference for all candidate paths of a given SR Policy. This makes it straightforward to select the active path of the SR Policy: the valid candidate path with the highest preference value. However, draft-ietf-spring-segment-routing-policy does not require uniqueness of preference, which would be difficult to impose given that the candidate paths can come from multiple, possibly independent sources. Instead, a set of tie-breaking rules are used to select the active path for an SR Policy among a set of candidate paths with the same preference value. A candidate path of an SR Policy is uniquely identified by the tuple in the context of a single SR Policy that is identified by the tuple . Protocol-Origin: Numerical value that identifies the component or protocol that originates or signals the candidate path. Recommended values are 10: PCEP; 20: BGP-SR-TE; 30: configuration. Originator: Tuple with ASN represented as a 4-byte number and node-address being an IPv4 or IPv6 address. Discriminator: Numerical value to distinguish between multiple candidate paths with common Protocol-Origin and Originator. This identifier tuple is used in the tie-breaking rules to select a single valid candidate path for an SR Policy out of a set of candidate paths with the same preference value. The tie-breaking rules are evaluated in the following order:

1. Prefer higher Protocol-Origin value 2. Prefer existing installed path (optional, only if this rule is activated by configuration) 3. Prefer lower Originator value 4. Prefer higher Discriminator value

16.6 Summary Traffic steered into an SR Policy can be load-balanced over different paths by specifying multiple segment lists under the candidate path. Each segment list has a weight and traffic-flows are loadbalanced over the segment lists proportional to their relative weights. We have seen how a service can be strictly steered onto a policy which turns as a bit bucket when a selected candidate path is invalid. Specifically, when the SR Policy becomes invalid, instead of reusing the IGP shortest path to the next-hop of the service, the service route is kept on the invalid policy which remains in the forwarding table as a drop entry. SR-MPLS and thus also SR-TE for SR-MPLS leverage the existing MPLS architecture. This implies that the PHP and explicit-null behaviors also apply to segments in the SR Policies segment list. We have seen that the first entry in the segment list is not always applied on the packet steered into the SR Policy since Penultimate Hop Popping (PHP) may apply to that segment. When imposing labels on a packet, the TTL field of the imposed labels can be copied from the IP header TTL field for unlabeled incoming traffic or it is copied from the top label for incoming labeled traffic. The MPLS TC field is copied from the DSCP field in the IP header for incoming unlabeled traffic or from the top label of an incoming labeled packet. All imposed labels get the same TTL and TC field values. Using different SRGBs on different nodes in the network complicates network operations. In particular, defining an explicit segment list using label values becomes more difficult.

16.7 References [SR-book-Part-I] "Segment Routing Part I", Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, October 2016, ASIN: B01I58LSUO (Kindle), ISBN-10: 1542369126, ISBN-13: 978-1542369121, ,

[draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018 [draft-filsfils-spring-segment-routing-use-cases] "Segment Routing Use Cases", Clarence Filsfils, Pierre Francois, Stefano Preidi, Bruno Decraene, Stephane Litkowski, Martin Horneffer, Igor Milojevic, Rob Shakir, Saku Ytti, Wim Henderickx, Jeff Tantsura, Sriganesh Kini, Edward Crabbe, draft-filsfils-spring-segment-routing-use-cases-01 (Work in Progress), October 2014 [draft-filsfils-spring-sr-policy-considerations] "SR Policy Implementation and Deployment Considerations", Clarence Filsfils, Ketan Talaulikar, Przemyslaw Krol, Martin Horneffer, Paul Mattes, draft-filsfils-spring-sr-policy-considerations-02 (Work in Progress), October 2018 [SR-Part-I] “Segment Routing Part I”, Clarence Filsfils, Kris Michielsen, Ketan Talaulikar, https://www.amazon.com/dp/1542369126 and https://www.amazon.com/dp/B01I58LSUO, October 2016

Section III – Tutorials This section contains tutorials of the protocols used for SR-TE.

17 BGP-LS BGP Link-State (BGP-LS) provides the mechanism to advertise network topology and other network information via BGP. The initial specification of BGP-LS, as described in RFC 7752, defines how to use BGP to convey the content of the link-state protocol database (LS-DB) and the TrafficEngineering Database (TED) to external components such as a PCE. This explains the name “Linkstate” of the BGP-LS address-family. BGP transport makes it possible to convey the topology information of an IGP area to remote locations, even across domain and AS boundaries, in a scalable manner using a robust and proven protocol. External devices, that are not participating in a link-state protocol (ISIS or OSPF), can collect network-wide topology information using BGP-LS. Visibility into the entire network allows applications, such as the Path Computation Element (PCE), to extend the use of TE techniques to the whole network in an optimal way. BGP-LS can also be used as a common interface to retrieve topology information from an entire network for applications that require it. It uses its own abstract topology model that hides many of the differences between the advertisements of the different protocols that are injected into BGP-LS. BGP-LS can be used as an Application Programming Interface (API) to collect network information. Since its initial specification in RFC 7752, BGP-LS has been extended in order to carry other types of information, such as SR and performance data: Segment Routing – draft-ietf-idr-bgp-ls-segment-routing-ext IGP TE Performance Metric Extensions – draft-ietf-idr-te-pm-bgp: TE IGP metric extensions (from RFC7471 and RFC7810) Egress Peer Engineering – draft-ietf-idr-bgpls-segment-routing-epe TE Policies – draft-ietf-idr-te-lsp-distribution It is important to note that the BGP-LS address-family is not limited to carrying routing information. The address-family can be extended to carry any information, for example:

Information taken from other sources than link-state databases Information related to topology Information related to a node’s configuration and state Information related to services Examples of these BGP-LS extensions are draft-ketant-idr-bgp-ls-bgp-only-fabric that specifies how BGP-LS carried topology information of a BGP-only network and draft-dawra-idr-bgp-ls-sr-servicesegments that specifies how BGP-LS is used to advertise service segments.

17.1 BGP-LS Deployment Scenario Figure 17‑1 illustrates a typical deployment scenario for BGP-LS. Multiple BGP speakers in the network are enabled for BGP-LS and form BGP-LS sessions with one or more centralized BGP speakers, such as RRs, over which they convey their local topology information. This local information can come from IGP, BGP, or another source.

Figure 17-1: Typical BGP-LS deployment model

Using the regular BGP propagation mechanisms, any BGP speaker may obtain the consolidated BGPLS information of the entire network as provided by all other BGP speakers. An external component such as an SR PCE, can obtain this consolidated information by tapping into a centralized BGP speaker or any other BGP speaker that has the aggregated BGP-LS information. An internal component of a BGP-enabled node, such as the SR-TE process on a headend, can obtain the aggregated BGP-LS information from its local BGP process. The entities and nodes in Figure 17‑1 are assuming different roles in the dissemination of BGP-LS information. BGP-LS Producer:

The BGP speaker that advertises local information (e.g., IGP, SR, PM) into BGP-LS. The BGP speakers Node3, Node7, Node9, and Node12 originate link-state information from their IGP into BGP-LS. Node7 and Node9 are in the same IGP area, they are originating the same link-state information into BGP-LS. A node may also originate non-IGP information into BGP-LS, e.g., its local node information. BGP-LS Propagator: The BGP speaker that propagates BGP-LS information from producers to other BGP-LS speakers, and eventually consumers. The BGP speaker Node1 propagates the BGP-LS information between the BGP speakers Node3, Node7, Node9, and Node12. Node1 performs BGP best-path selection and propagates BGP-LS updates. BGP-LS Consumer: The application or process that leverages the BGP-LS information to compute paths or perform network analysis. The BGP-LS consumer is not a BGP speaker. The SR PCEs are BGP speakers that provide the BGP-LS information that they have collected to a consumer application. The BGP protocol implementation and the consumer application may be on the same or different nodes (e.g., local SR-TE process or remote application retrieving BGPLS information via a North-bound interface). These roles are not mutually exclusive. The same node may be Producer for some link-state information and Propagator for some other link-state information while also providing this information to a Consumer application.

17.2 BGP-LS Topology Model A node running a link-state IGP in essence distributes its local connectivity (neighbors and prefixes) in a link state advertisement (LSP/LSA1) to all other nodes in the IGP area. Each node combines these LSPs/LSAs as pieces of a jigsaw puzzle to form a complete map or graph of the topology. The IGP then uses this topology map to compute the shortest path tree (SPT) and derive prefix shortest path reachability. Figure 17‑2 illustrates this for a three-node topology. Each node announces its own adjacencies to the other nodes. Note that this topology graph represents a logical topology, not a physical topology. For example, nodes or links that are not enabled for this IGP, are not present in the graph.

Figure 17-2: Link-state IGP LSPs/LSAs and topology model

BGP-LS does not simply encapsulate these IGP LSPs/LSAs in BGP, instead it transcodes the information they contain into a more abstract topology model based on three classes of objects: nodes, links, and prefixes.

The IGP topology is transcoded into the BGP-LS model in order to overcome the differences between OSPF and ISIS, but also to include information from other sources, such as BGP. The use of Type/Length/Value (TLV) structures to encode the data makes BGP-LS easily extensible without requiring any change to the underlying transport protocol. For example, new classes of objects, such as TE Policies, have been added to the model (draft-ietf-idr-te-lsp-distribution).

Figure 17-3: BGP-LS topology model

The three base BGP-LS classes of objects are illustrated in Figure 17‑3. A node object represents a node, which is typically a router or a routing protocol instance of a router. A link object represents a directed link. A link object is anchored to two anchor nodes: a local node and a remote node. One half-link can have different characteristics than the corresponding half-link in the other direction. A prefix object is anchored to the node that originates the prefix. Multiple nodes can originate the same prefix, in which case multiple prefix objects exist, one for each node that originates the prefix. The network in Figure 17‑3 contains three nodes and six (half-) links. Only one prefix object is shown in the illustration, although, in reality, a loopback prefix is defined for each node and an interface

prefix for each (half-)link.

17.3 BGP-LS Advertisement BGP-LS uses the Multi-Protocol extensions of BGP (MP-BGP), as specified in RFC4760, to carry the information. The base BGP-LS specification (RFC7752) defines a new address-family named “Link-state”. The Address Family Indicator (AFI) for BGP-LS is 16388, the Subsequent Address Family Indicator (SAFI) is 71 for a non-VPN topology and 72 for a VPN topology. When defining a new address-family, the Network Layer Reachability Information (NLRI) format for that address-family must also be specified. Three BGP-LS NLRI types are defined in RFC7752: Node, Link, and Prefix. IETF draft-ietf-idr-telsp-distribution specifies a fourth type of BGP-LS NLRI: TE Policy. These different types of BGP-LS NLRIs are described in the next sections. RFC7752 also defines a Link-state Attribute that is used to carry additional parameters and characteristics for a BGP-LS NLRI. This Link-state Attribute is advertised together with the BGP-LS NLRI that it applies to. Figure 17‑4 is a high-level illustration of a BGP-LS Update message, showing the different attributes that are present in such a BGP-LS Update message. A BGP-LS Update message contains the mandatory attributes ORIGIN, AS_PATH, and, for iBGP advertisements, LOCAL_PREF. The BGP-LS NLRI is included in the MP_REACH_NLRI attribute (RFC4760). This attribute also contains a Next-hop, which is the BGP nexthop as we know it. It is the IPv4 or IPv6 BGP session address if the advertising node applies next-hop-self. The Link-state Attribute, as mentioned above, contains the properties of the NLRI included in the Update message.

Figure 17-4: BGP-LS Update message showing the different attributes

The usual BGP protocol procedures, specifically best-path selection, also applies to BGP-LS paths. Among all the paths received for a given BGP-LS NLRI, only the best-path is selected and propagated to the other BGP neighbors. Furthermore, only the best-path is provided to the external consumers such as SR-TE. BGP applies the regular selection rules to select the best-path for each BGP-LS NLRI. One of the conditions for BGP to consider a path is the reachability of the BGP nexthop. If the BGP nexthop is unreachable then the path is not considered. The best-path selection can be influenced by modifying the attributes of the paths using BGP routepolicies. In the next section, we will look at the two main parts of the BGP-LS advertisement: BGP-LS NLRI and Link-state Attribute.

17.3.1 BGP-LS NLRI Earlier in this chapter we have introduced the BGP-LS topology abstraction that models a network using three classes of objects: nodes, links, prefixes. Each BGP-LS object is identified by its NLRI. The NLRI is the key to the corresponding object entry in the BGP-LS database, while the Link-state Attribute that is advertised with the NLRI contains the properties and characteristics of this object. Since the BGP-LS NLRI is the key to an object, it must contain sufficient data to uniquely identify an object. The remaining data – that is not required to identify the object – is carried in the associated Link-state Attribute. The general format of a BGP-LS NLRI is shown in Figure 17‑5. The NLRI-type defines the object class (node, link, prefix, …). The data of the NLRI is a set of TLVs.

Figure 17-5: BGP-LS NLRI format

RFC 7752 defines four types of BGP-LS NLRIs: Node NLRI-Type (type 1): describes a node Link NLRI-Type (type 2): describes a (directed) link IPv4 Prefix NLRI-Type (type 3): describes an IPv4 prefix IPv6 Prefix NLRI-Type (type 4): describes an IPv6 prefix IETF draft-ietf-idr-te-lsp-distribution defines a fifth type: TE Policy NLRI-Type (type 5): Describe a TE Policy (SR TE Policy, MPLS Tunnel, IP Tunnel, MPLS state) The general format of these BGP-LS NLRI-types is shown in Figure 17‑6. The first two fields in the BGP-LS NLRI are the same for all NLRI types: Protocol-ID and Identifier. Depending on the NLRI type, it contains various other Descriptor fields as described further.

Ide ntifie r and Instance -ID The Identifier field in the BGP-LS NLRI is the 64-bit number that identifies a “routing universe”. RFC7752 also uses the name “Instance-ID” for the Identifier field. The NLRI “Identifier” field must not be confused with the “BGP-LS Identifier” that is a 32-bit Node Descriptor Sub-TLV. The use of the BGP-LS Identifier is discouraged in IOS XR. The NLRI Identifier field containing the Instance-ID is the recommended way to distinguish between different domains. The BGP-LS Identifier field has value 0 in IOS XR.

Figure 17-6: General format of the BGP-LS NLRI

Remember that a BGP-LS NLRI forms a unique key for an object in the BGP-LS database. The BGPLS database can consist of multiple (logical) topologies that can partially or completely overlap. Even for overlapping topologies, each of the objects must be uniquely identifiable. For example, when migrating a network from OSPF to ISIS, both IGPs can be enabled at the same time on all nodes and links of the physical topology. In that case, the OSPF and ISIS topologies completely overlap. BGP-LS must be able to distinguish the two independent topologies. For example, a given node must have a key (NLRI) for its node object in the OSPF topology and a different key (NLRI) for its node object in the ISIS topology. The two fields that are present in all BGP-LS NLRIs, Protocol-ID and Identifier, allow to maintain separation between multiple overlapping (partially or completely) topologies. We will start by looking at these two fields.

17.3.1.1 Protocol-ID Field The Protocol-ID field of a BGP-LS NLRI identifies the protocol that is the source of the topology information. It identifies the protocol that led to the creation of this BGP-LS NLRI. Table 17‑1 lists the possible values of the Protocol-ID field. For example, a BGP-LS NLRI that is generated based on the information in an OSPFv2 LS-DB, will have a Protocol-ID value 3 (OSPFv2). A BGP-LS NLRI that is generated based on Level-2 ISIS LS-DB entry will have a Protocol-ID value 2 (ISIS L2). Table 17-1: BGP-LS Protocol-IDs Protocol-ID NLRI information source protocol Specification 1

IS-IS Level 1

RFC7752

2

IS-IS Level 2

RFC7752

3

OSPFv2

RFC7752

4

Direct

RFC7752

5

Static configuration

RFC7752

6

OSPFv3

RFC7752

7

BGP

draft-ietf-idr-bgpls-segment-routing-epe

8

RSVP-TE

draft-ietf-idr-te-lsp-distribution

9

Segment Routing

draft-ietf-idr-te-lsp-distribution

17.3.1.2 Identifier Field Multiple instances of a routing protocol may run over the same set of nodes and links, for example using RFC6549 for multi-instance OSPFv2 or using RFC8202 for multi-instance ISIS. Each routing protocol instance defines an independent topology, an independent context, also known as a “routing universe”. The 64-bit Identifier field in each BGP-LS NLRI allows to discriminate which NLRI belongs to which routing protocol instance or routing universe. The Identifier field is used to “stamp” each NLRI with a value that identifies the routing universe the NLRI belongs to. All NLRIs that identify objects (nodes, links, prefixes) from a given routing

universe have the same Identifier value. NLRIs with different Identifier values are part of different routing universes. The Identifier is defined as a flat 64-bit value. RFC 7752 reserves Identifier values 0-31, with 0 indicating the “Default Layer 3 Routing topology”. Values in the range 32 to 264-1 are for "Private Use” and can be freely used. Figure 17‑7 illustrates the use of the Identifier field to discriminate between IGP topologies advertised through the same protocol2. Both nodes in the topology run two identically configured ISIS instances, resulting in two logical topologies that entirely overlap. To enable BGP-LS to distinguish these topologies, a different Identifier field value is used in the BGP-LS NLRIs for each ISIS instance. This makes the NLRIs of each logical topology unique, despite that all other NLRI fields are equal for both topologies.

Figure 17-7: Example of using Identifier field as discriminator between IGP topologies

In IOS XR, the Identifier value can be configured using the instance-id keyword in the distribute link-state

command, as illustrated in Example 17‑1. The default Identifier value is 0, identifying

the “Default Layer 3 Routing topology”. ISIS instance SR-ISIS-1 in Example 17‑1 has Identifier 1000, ISIS instance SR-ISIS-2 has Identifier 2000, and OSPF instance SR-OSPF has Identifier 32.

Example 17-1: Configuration of instance-id router isis distribute ! router isis distribute ! router ospf distribute

SR-ISIS-1 link-state instance-id 1000 SR-ISIS-2 link-state instance-id 2000 SR-OSPF link-state instance-id 32

Two instances of the same protocol on a node cannot have the same instance-id. This is enforced during configuration. The IGP instance of all nodes in the same IGP domain must have the same instance-id since they belong to the same routing universe.

17.3.2 Node NLRI The Node NLRI is the key to identify a node object in the BGP-LS database. This key must be globally unique for each node within the entire network (BGP-LS database). A physical node that is part of multiple routing universes is represented by multiple Node NLRIs, one for each routing universe the physical node participates in. The format of an BGP-LS Node NLRI is shown in Figure 17‑8.

Figure 17-8: Node NLRI format

Besides the Protocol-ID and the Identifier fields that were discussed in the previous sections, the Node NLRI also contains a Local Node Descriptors field. This field consists of a set of one or more TLVs that uniquely identifies the node. The possible Node Descriptor TLVs are:

Autonomous System: Opaque 32-bit number, by default the AS number of the BGP-LS originator3 BGP-LS Identifier (BGP-LS ID): Opaque 32-bit number, by default 045 OSPF Area-ID: 32-bit number identifying the OSPF area of the NLRI IGP Router-ID: Opaque value of variable size, depending on the type of node: ISIS non-pseudonode: 6-octet ISO system-id of the node ISIS pseudonode: 6-octet ISO system-id of the Designated Intermediate System (DIS) followed by the 1-octet, nonzero Pseudonode identifier (PSN ID) OSPFv2 and OSPFv3 non-pseudonode: 4-octet Router-ID OSPFv2 pseudonode: 4-octet Router-ID of the Designated Router (DR) followed by the 4-octet IPv4 address of the DR's interface to the LAN OSPFv3 pseudonode: 4-octet Router-ID of the DR followed by the 4-octet interface identifier of the DR's interface to the LAN BGP Router Identifier (BGP Router-ID): 4-octet BGP Router-ID, as defined in RFC4271 and RFC6286. Confederation Member ASN (Member-ASN): 4-octet number representing the member ASN inside the Confederation. See the sections 17.6 and 17.7 below for illustrations of Node NLRIs.

17.3.3 Link NLRI A Link NLRI uniquely identifies a link object in the BGP-LS database. The format of a Link NLRI is shown in Figure 17‑9.

Figure 17-9: Link NLRI format

A link object, as identified by a BGP-LS Link-type NLRI, represents a unidirectional link (“halflink”). A link object is identified by the two end nodes of the link, the local node and the remote node, also known as the “anchor nodes”, and further information to distinguish different links between the same pair of nodes. The BGP-LS Link-type NLRI contains a Local Node Descriptors field and a Remote Node Descriptors field that define the anchor nodes. These two Node Descriptors fields can contain the same Node Descriptor TLVs as discussed in the previous section about the Node NLRI. The Link NLRI also contains a Link Descriptors field to further distinguish different links between the same pair of nodes. These are the possible TLVs that can be added to the Link Descriptors field: Link Local/Remote Identifier: local and remote interface identifiers, used when the link has no IP address configured (unnumbered). In IOS XR, the SNMP ifIndex of the interface is used as interface identifier. IPv4 interface address: IPv4 address of the local node’s interface IPv4 neighbor address: IPv4 address of the remote node’s interface IPv6 interface address: IPv6 address of the local node’s interface IPv6 neighbor address: IPv6 address of the remote node’s interface

Multi-Topology Identifier: MT-ID TLV containing the (single) MT-ID of the topology where the link is reachable See the sections 17.6 and 17.7 below for illustrations of Link NLRIs.

17.3.4 Prefix NLRI A Prefix NLRI uniquely identifies a prefix object in the BGP-LS database. The Prefix NLRI format, as displayed in Figure 17‑10, has a Local Node Descriptors field and a Prefix Descriptors field.

Figure 17-10: Prefix NLRI format

The Local Node Descriptors field uniquely identifies the node that originates the prefix, which is the prefix’s anchor node. The same TLVs can be used for this Node Descriptors field as described in the above section for the Node NLRI. The Prefix Descriptors field is then used to uniquely identify the prefix advertised by the node. The Possible TLVs that can be included in the Prefix Descriptors fields are: Multi-topology Identifier: a TLV containing the (single) MT-ID of the topology where the prefix is reachable OSPF Route Type: The type of OSPF route as defined in the OSPF protocol. – 1: Intra-area; 2: inter-area; 3: External type 1; 4: External type 2; 5: NSSA type 1; 6: NSSA type 2 IP Reachability Information: a TLV containing a single IP address prefix (IPv4 or IPv6) and its prefix length; this is the prefix itself

See the sections 17.6 and 17.7 below for illustrations of Prefix NLRIs.

17.3.5 TE Policy NLRI A TE Policy NLRI uniquely identifies a TE Policy object in the BGP-LS database. The format of a BGP-LS TE Policy NLRI is shown in Figure 17‑11. This NLRI is specified in draft-ietf-idr-te-lspdistribution. A TE Policy object represents a unidirectional TE Policy. A TE Policy object is anchored to the TE Policy’s Headend node. The Headend Node Descriptors field in the NLRI identifies this anchor node. This field can use the same Node Descriptors TLVs as described in the above section for the Node NLRI. The TE Policy Descriptors field in the NLRI is used to uniquely identify the TE Policy on the headend anchor node.

Figure 17-11: TE Policy NLRI format

The Possible TLVs that can be included in the TE Policy Descriptors fields are: Tunnel ID: 16-bits Tunnel Identifier, as specified in RFC3209 (RSVP-TE: Extensions to RSVP for LSP Tunnels) LSP ID: 16-bits LSP Identifier, as specified in RFC3209 IPv4/6 Tunnel Head-end address: IPv4/IPv6 Tunnel Head-End Address, as specified in RFC3209

IPv4/6 Tunnel Tail-end address: IPv4/IPv6 Tunnel Tail-End Address, as specified in RFC3209 SR Policy Candidate Path: the tuple (Endpoint, Policy Color, Protocol-Origin, Originator ASN, Originator Address, Distinguisher) as defined in ietf-spring-segment-routing-policy Local MPLS Cross Connect: a local MPLS state in the form of an incoming label and an interface followed by an outgoing label and an interface At time of writing, IOS XR does not advertise TE Policies in BGP-LS.

17.3.6 Link-State Attribute While an BGP-LS NLRI is the unique identifier or key of a BGP-LS object, the Link-state Attribute is a container for additional characteristics of the object. Examples of object characteristics are the node’s name, the link bandwidth, the prefix route-tag, etc. Refer to the different IETF documents for the possible information that can be included in the Linkstate Attribute: BGP-LS – RFC7752 Segment Routing – draft-ietf-idr-bgp-ls-segment-routing-ext IGP TE Performance Metric Extensions – draft-ietf-idr-te-pm-bgp Egress Peer Engineering – draft-ietf-idr-bgpls-segment-routing-epe TE Policies – draft-ietf-idr-te-lsp-distribution We will look at some examples in the illustrations sections 17.6 and 17.7 below.

17.4 SR BGP Egress Peer Engineering Chapter 14, "SR BGP Egress Peer Engineering" describes the EPE functionality in detail. SR BGP EPE allocates a Peering SID for a BGP session and advertises the BGP peering session information in BGP-LS for a controller to use this information in its path computation. EPE is enabled on a BGP session as shown in Example 17‑2. Example 17-2: EPE configuration router bgp 1 neighbor 99.4.5.5 egress-engineering

Since the EPE-enabled BGP Peerings are advertised in BGP-LS as links (using link-type NLRI), they are identified by their local and remote anchor nodes plus some additional information to distinguish multiple parallel links between the two nodes. The advertisements of the different Peering Segment types are described in the following sections.

17.4.1 PeerNode SID BGP-LS Advertisement The PeerNode SID is advertised with a BGP-LS Link-type NLRI, See Figure 17‑9. The following information is provided in the BGP-LS NLRI and BGP-LS Attribute: BGP-LS Link NLRI: Protocol-ID: BGP Identifier: 0 in IOS XR Local Node Descriptors contains: Local BGP Router-ID Local ASN Remote Node Descriptors contains: Peer BGP Router-ID

Peer ASN Link Descriptors contains (depending on the address-family used for the BGP session): IPv4 Interface Address TLV contains the BGP session IPv4 local address IPv4 Neighbor Address TLV contains the BGP session IPv4 peer address IPv6 Interface Address TLV contains the BGP session IPv6 local address IPv6 Neighbor Address TLV contains the BGP session IPv6 peer address BGP-LS Attribute contains: PeerNode SID TLV Other TLV entries may be present for each of these fields, such as additional node descriptor fields as specified in BGP-LS RFC 7752. In addition, BGP-LS Node and Link Attributes, as defined in RFC 7752, may be added in order to advertise the characteristics of the link. Examples of a PeerNode-SID advertisements are included in chapter 14, "SR BGP Egress Peer Engineering", section 14.5.

17.4.2 PeerAdj SID BGP-LS Advertisement The PeerAdj SID is advertised with a BGP-LS Link-type NLRI, See Figure 17‑9. The following information is provided in the BGP-LS NLRI and BGP-LS Attribute: BGP-LS Link NLRI: Protocol-ID: BGP Identifier: 0 in IOS XR Local Node Descriptors contains TLVs: Local BGP Router-ID Local ASN

Remote Node Descriptors contains TLVs: Peer BGP Router-ID Peer ASN Link Descriptors contains TLV: Link Local/Remote Identifiers contains the 4-octet Link Local Identifier followed by the 4-octet value 0 indicating that the Link Remote Identifier is unknown Link Descriptors may contain in addition TLVs: IPv4 Interface Address TLV contains the IPv4 address of the local interface through which the BGP session is established IPv4 Neighbor Address TLV contains the BGP session IPv4 peer address IPv6 Interface Address TLV contains the IPv6 address of the local interface through which the BGP session is established IPv6 Neighbor Address TLV contains the BGP session IPv6 peer address BGP-LS Attribute: PeerAdj SID TLV Other TLV entries may be present for each of these fields, such as additional node descriptor fields as described in BGP-LS RFC 7752. In addition, BGP-LS Nodes and Link Attributes, as defined in RFC 7752, may be added in order to advertise the characteristics of the link. Examples of a PeerAdj-SID advertisements are included in chapter 14, "SR BGP Egress Peer Engineering", section 14.5.

17.4.3 PeerSet SID BGP-LS Advertisement The PeerSet SID TLV is added to the BGP-LS Attribute of a PeerNode SID or PeerAdj SID BGP-LS advertisement, in addition to the PeerNode SID or PeerAdj SID TLV.

At the time of writing, PeerSet SIDs are not supported in IOS XR.

17.5 Configuration To configure a BGP-LS session, use address-family link-state link-state. BGP-LS can be enabled for an IPv4 or IPv6 Internal BGP (iBGP) or External BGP (eBGP) sessions. BGP-LS updates follow the regular BGP propagation functionality, for example, they are subject to best-path selection and are reflected by Route Reflectors. Also the BGP nexthop attribute is updated according to the regular procedures. The node in Example 17‑3 has an iBGP BGP-LS session to neighbor 1.1.1.2 and an eBGP BGP-LS session to 2001::1:1:1:2. The eBGP session in this example is multi-hop. Example 17-3: BGP-LS session configuration route-policy bgp_in pass end-policy ! route-policy bgp_out pass end-policy ! router bgp 1 address-family link-state link-state ! neighbor 1.1.1.2 remote-as 1 update-source Loopback0 address-family link-state link-state ! neighbor 2001::1:1:1:2 remote-as 2 ebgp-multihop 2 update-source Loopback0 address-family link-state link-state route-policy bgp_in in route-policy bgp_out out

17.6 ISIS Topology

Figure 17-12: BGP-LS advertisements – network topology

The network topology in Figure 17‑12 consists of two ISIS Level-2 nodes interconnected by a link. Both nodes have SR enabled for IPv4 and advertise a Prefix-SID for their loopback prefix. Both nodes have TI-LFA enabled on their interface such that they advertise a protected and unprotected Adj-SID for the adjacency. The nodes also have a TE metric (also known as Administrative Weight) and an affinity link color (also known as affinity bits or Administrative Group) configured for the link. Link-delay measurement is enabled on the link to measure the link-delay and advertise the link-delay metrics as described in chapter 15, "Performance Monitoring – Link Delay". The relevant parts of Node1’s configuration are shown in Example 17‑4.

Example 17-4: ISIS topology – Node1 configuration hostname xrvr-1 ! interface Loopback0 ipv4 address 1.1.1.1 255.255.255.255 ! interface GigabitEthernet0/0/0/0 description to xrvr-2 ipv4 address 99.1.2.1 255.255.255.0 ! router isis SR is-type level-2-only net 49.0001.0000.0000.0001.00 distribute link-state instance-id 101 address-family ipv4 unicast metric-style wide router-id Loopback0 segment-routing mpls ! interface Loopback0 passive address-family ipv4 unicast prefix-sid absolute 16001 ! ! interface GigabitEthernet0/0/0/0 point-to-point address-family ipv4 unicast fast-reroute per-prefix fast-reroute per-prefix ti-lfa ! ! segment-routing traffic-eng interface GigabitEthernet0/0/0/0 affinity name BLUE ! metric 20 ! affinity-map name BLUE bit-position 0 ! performance-measurement interface GigabitEthernet0/0/0/0 delay-measurement

The ISIS Link-state database entry for Node1 is shown in Example 17‑5. Node1 originates a Link-State PDU stating what area it is in (49.0001) and its hostname (xrvr-1). The Network Layer Protocol IDentifier (NLPID) value indicates that the node supports IPv4 (0xcc). Node1 advertises its loopback0 IPv4 address 1.1.1.1 as a router-id in the TLVs IP address, Router ID, and Router Capabilities.

The Router Capability TLV contains more attributes of this node: SR is enabled for IPv4 (I:1), the SRGB is [16000-23999] (first label 16000, size 8000), the SRLB is [15000-15999]. Node1 supports SR algorithm 0 (SPF) and 1 (strict-SPF), and its type-1 Node Maximum SID Depth (MSD) is 10. Node1 advertises its loopback0 prefix 1.1.1.1/32 with a Prefix-SID 16001 (index 1). The adjacency to Node2 has an IGP metric 10, a TE metric (Admin. Weight) 20, and a minimumdelay link metric 5047. The adjacency carries the affinity color identified by bit 0 in the affinity bitmap. Two Adj-SIDs are advertised for this adjacency: protected (B:1) Adj-SID 24012 and unprotected (B:0) Adj-SID 24112. Example 17-5: ISIS database entry for Node1 RP/0/0/CPU0:xrvr-1#show isis database verbose xrvr-1 IS-IS SR (Level-2) Link State Database LSPID LSP Seq Num LSP Checksum LSP Holdtime/Rcvd xrvr-1.00-00 * 0x0000022a 0x2880 1180 /* Area Address: 49.0001 NLPID: 0xcc Hostname: xrvr-1 IP Address: 1.1.1.1 Router ID: 1.1.1.1 Router Cap: 1.1.1.1, D:0, S:0 Segment Routing: I:1 V:0, SRGB Base: 16000 Range: 8000 SR Local Block: Base: 15000 Range: 1000 SR Algorithm: Algorithm: 0 Algorithm: 1 Node Maximum SID Depth: Label Imposition: 10 Metric: 0 IP-Extended 1.1.1.1/32 Prefix-SID Index: 1, Algorithm:0, R:0 N:1 P:0 E:0 V:0 L:0 Prefix Attribute Flags: X:0 R:0 N:1 Source Router ID: 1.1.1.1 Metric: 10 IP-Extended 99.1.2.0/24 Prefix Attribute Flags: X:0 R:0 N:0 Metric: 10 IS-Extended xrvr-2.00 Affinity: 0x00000001 Interface IP Address: 99.1.2.1 Neighbor IP Address: 99.1.2.2 Admin. Weight: 20 Ext Admin Group: Length: 4 0x00000001 Link Average Delay: 5831 us Link Min/Max Delay: 5047/7047 us Link Delay Variation: 499 us Link Maximum SID Depth: Label Imposition: 10 ADJ-SID: F:0 B:1 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24012 ADJ-SID: F:0 B:0 V:1 L:1 S:0 P:0 weight:0 Adjacency-sid:24112 Total Level-2 LSP count: 1

Local Level-2 LSP count: 1

ATT/P/OL 0/0/0

ISIS feeds the LS-DB information to BGP-LS and SR-TE. The content of the BGP-LS database is shown in Example 17‑6. The output shows the BGP-LS NLRIs in string format. The legend of the NLRI fields in these strings is indicated on top of the output. The first field ([V], [E], or [T]) indicates the BGP-LS NLRI type: [V] Node, [E] Link, or [T] Prefix. To verify the Link-state Attribute that is associated with the NLRI, the command show bgp linkstate link-state [detail]

must be used. In the next sections we will

show examples of each NLRI. Example 17-6: BGP-LS database content RP/0/0/CPU0:xrvr-1#show bgp link-state link-state BGP router identifier 1.1.1.1, local AS number 1

Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Prefix codes: E link, V node, T IP reacheable route, u/U unknown I Identifier, N local node, R remote node, L link, P prefix L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static/peer-node a area-ID, l link-ID, t topology-ID, s ISO-ID, c confed-ID/ASN, b bgp-identifier, r router-ID, i if-address, n nbr-address, o OSPF Route-type, p IP-prefix d designated router address Network Next Hop Metric LocPrf Weight Path *> [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]/328 0.0.0.0 0 i *> [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]]/328 0.0.0.0 0 i *> [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1][b0.0.0.0][s0000.0000.0002.00]][L[i99.1.2.1] [n99.1.2.2]]/696 0.0.0.0 0 i *> [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]][R[c1][b0.0.0.0][s0000.0000.0001.00]][L[i99.1.2.2] [n99.1.2.1]]/696 0.0.0.0 0 i *> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][P[p99.1.2.0/24]]/392 0.0.0.0 0 i *> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][P[p1.1.1.1/32]]/400 0.0.0.0 0 i *> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]][P[p99.1.2.0/24]]/392 0.0.0.0 0 i *> [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0002.00]][P[p1.1.1.2/32]]/400 0.0.0.0 0 i Processed 8 prefixes, 8 paths

17.6.1 Node NLRI The Node NLRI for Node1 is shown in Example 17‑7. By using the detail keyword in the show command, you get a breakdown of the NLRI fields at the top of the output.

These are the fields in the Node NLRI string: [V]

NLRI Type: Node

[L2]

Protocol: ISIS L2

[I0x65]

Identifier: 0x65 = 101 (this is the instance-id)

[N ...]

Local Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[s0000.0000.0001.00]

ISO Node ID: 0000.0000.0001.00

The prefix-length at the end of the NLRI string (/328) is the length of the NLRI in bits. The Protocol and ISO Node ID can be derived from the ISIS IS-type and LSP-ID Example 17-7: BGP-LS Node NLRI of Node1 RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]/328 detail BGP routing table entry for [V][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]]/328 NLRI Type: Node Protocol: ISIS L2 Identifier: 0x65 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 ISO Node ID: 0000.0000.0001.00

Link-state: Node-name: xrvr-1, ISIS area: 49.00.01, Local TE Router-ID: 1.1.1.1, SRGB: 16000:8000, SR-ALG: 0 SR-ALG: 1, SRLB: 15000:1000, MSD: Type 1 Value 10

The Link-state attribute is shown at the bottom of the output, preceded by Link-state:. These are the elements in the Link-state attribute for the Node NLRI example: Node-name: xrvr-1

ISIS dynamic hostname xrvr-1

ISIS area: 49.00.01

ISIS area address 49.0001

Local TE Router-ID: 1.1.1.1

Local node (Node1) IPv4 TE router-id 1.1.1.1

SRGB: 16000:8000

SRGB: [16000-23999]

SR-ALG: 0 SR-ALG: 1

SR Algorithms: SPF (0), strict-SPF (1)

SRLB: 15000:1000

SRLB: [15000-15999]

MSD: Type 1 Value 10

Node Maximum SID Depth (MSD): 10

These attributes correspond to the TLVs in the ISIS LSP as displayed in Example 17‑5.

17.6.2 Link NLRI The Link NLRI for the (half-)link Node1→Node2 in the IPv4 topology is shown in Example 17‑8. By using the detail keyword in the show command, you get a breakdown of the NLRI fields. These are the fields in the Link NLRI string: [E]

NLRI Type: Link

[L2]

Protocol: ISIS L2

[I0x65]

Identifier: 0x65 = 101

[N ...]

Local Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[s0000.0000.0001.00]

ISO Node ID: 0000.0000.0001.00

[R ...]

Remote Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[s0000.0000.0002.00]

ISO Node ID: 0000.0000.0002.00

[L ...]

Link Descriptor:

[i99.1.2.1]

Local Interface Address IPv4: 99.1.2.1

[n99.1.2.2]

Neighbor Interface Address IPv4: 99.1.2.2

The prefix-length at the end (/696) is the length of the NLRI in bits. Example 17-8: BGP-LS Link NLRI of link Node1→Node2 in IPv4 topology RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1] [b0.0.0.0][s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696 detail BGP routing table entry for [E][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][R[c1][b0.0.0.0] [s0000.0000.0002.00]][L[i99.1.2.1][n99.1.2.2]]/696 NLRI Type: Link Protocol: ISIS L2 Identifier: 0x65 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 ISO Node ID: 0000.0000.0001.00 Remote Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 ISO Node ID: 0000.0000.0002.00 Link Descriptor: Local Interface Address IPv4: 99.1.2.1 Neighbor Interface Address IPv4: 99.1.2.2

Link-state: Local TE Router-ID: 1.1.1.1, Remote TE Router-ID: 1.1.1.2, admin-group: 0x00000001, TE-default-metric: 20, metric: 10, ADJ-SID: 24012(70), ADJ-SID: 24112(30), MSD: Type 1 Value 10, Link Delay: 5831 us Flags: 0x00, Min Delay: 5047 us Max Delay: 7047 us Flags: 0x00, Delay Variation: 499 us

The Link-state attribute is shown in the output, preceded by Link-state. These are the elements in the Link-state attribute for the Link NLRI example: Local TE RouterID: 1.1.1.1

Local node (Node1) IPv4 TE router-id: 1.1.1.1

Remote TE Router-ID: 1.1.1.2

Remote node (Node2) IPv4 TE router-id: 1.1.1.2

admin-group: 0x00000001

Affinity (color) bitmap: 0x00000001

TE-defaultmetric: 20

TE metric: 20

metric: 10

IGP metric: 10

ADJ-SID: 24012(70), ADJSID: 24112(30)

Adj-SIDs: protected (flags 0x70 → B-flag=1) 24012; unprotected (flags 0x30 → B-flag=0) 24112 – refer to the ISIS LS-DB output in Example 17‑5 for the flags

MSD: Type 1 Value 10

Link Maximum SID Depth: 10

Link Delay: 5831 us Flags: 0x00

Average Link Delay: 5.831 ms

Min Delay: 5047 us

Minimum Link Delay: 5.047 ms

Max Delay: 7047 us Flags: 0x00

Maximum Link Delay: 7.047 ms

Delay Variation: 499 us

Link Delay Variation: 0.499 ms

These attributes correspond to the TLVs advertised with the adjacency in the ISIS LSP of Example 17‑5.

17.6.3 Prefix NLRI The Prefix NLRI for Node1’s loopback IPv4 prefix 1.1.1.1/32 is shown in Example 17‑9. By using the detail keyword in the command, you get a breakdown of the NLRI fields. These are the fields in the Prefix NLRI string: [T]

NLRI Type: Prefix

[L2]

Protocol: ISIS L2

[I0x65]

Identifier: 0x65 = 101

[N ...]

Local Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[s0000.0000.0001.00]

ISO Node ID: 0000.0000.0001.00

[P ...] [p1.1.1.1/32]

Prefix Descriptor: Prefix: 1.1.1.1/32

The prefix-length at the end (/696) is the length of the NLRI in bits.

Example 17-9: BGP-LS Prefix NLRI of Node1’s loopback IPv4 prefix RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]] [P[p1.1.1.1/32]]/400 detail BGP routing table entry for [T][L2][I0x65][N[c1][b0.0.0.0][s0000.0000.0001.00]][P[p1.1.1.1/32]]/400 NLRI Type: Prefix Protocol: ISIS L2 Identifier: 0x65 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 ISO Node ID: 0000.0000.0001.00 Prefix Descriptor: Prefix: 1.1.1.1/32

Link-state: Metric: 0, PFX-SID: 1(40/0), Extended IGP flags: 0x20, Source Router ID: 1.1.1.1

The Link-state attribute is shown in the output, preceded by Link-state:. These are the elements in the Link-state attribute for the Prefix NLRI example: Metric: 0

IGP metric: 0

PFX-SID: 1(40/0)

Prefix-SID index: 1; SPF (algo 0); flags 0x40 → N-flag=1

Extended IGP flags: 0x20

Extended prefix attribute flags: 0x20 → N-flag=1

Source Router ID: 1.1.1.1

Originating node router-id: 1.1.1.1

These attributes correspond to the TLVs advertised with prefix 1.1.1.1/32 in the ISIS LSP of Example 17‑5.

17.7 OSPF Topology

Figure 17-13: Two-node OSPF topology

The network topology in Figure 17‑13 consists of two OSPF area 0 nodes interconnected by a link. Both nodes have SR enabled and advertise a Prefix-SID for their loopback prefix. Both nodes have TI-LFA enabled on their interface such that they advertise a protected and unprotected Adj-SID for the adjacency. The nodes also have a TE metric (also known as Administrative Weight) and an affinity link color (also known as affinity bits or Administrative Group) configured for the link. Delay measurement is enabled on the link to measure the link delay and advertise the link-delay metrics as described in chapter 15, "Performance Monitoring – Link Delay". The relevant parts of Node1’s configuration are shown in Example 17‑10.

Example 17-10: OSPF topology – Node1 configuration interface Loopback0 ipv4 address 1.1.1.1 255.255.255.255 ! interface GigabitEthernet0/0/0/0 description to xrvr-2 ipv4 address 99.1.2.1 255.255.255.0 ! router ospf SR distribute link-state instance-id 102 log adjacency changes router-id 1.1.1.1 segment-routing mpls fast-reroute per-prefix fast-reroute per-prefix ti-lfa enable area 0 interface Loopback0 passive enable prefix-sid absolute 16001 ! interface GigabitEthernet0/0/0/0 network point-to-point ! ! mpls traffic-eng router-id 1.1.1.1 ! segment-routing traffic-eng interface GigabitEthernet0/0/0/0 affinity name BLUE ! metric 20 ! affinity-map name BLUE bit-position 0 ! performance-measurement interface GigabitEthernet0/0/0/0 delay-measurement

Example 17‑11 shows the OSPF LSAs as advertised by Node1. The main one is the Router LSA (type 1), which advertises intra-area adjacencies and connected prefixes. The other LSAs are Opaque LSAs of various types6, as indicated by the first digit in the Link ID: TE (type 1), Router Information (type 4), Extended Prefix LSA (type 7), and Extended Link LSA (type 8). BGP-LS consolidates information of the various LSAs in its Node, Link, and Prefix NLRIs. In the next sections we will discuss the different BGP-LS NLRIs and how their information maps to the OSPF LSAs.

Example 17-11: OSPF LSAs advertised by xrvr-1 RP/0/0/CPU0:xrvr-1#show ospf database adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Router Link States (Area 0) Link ID 1.1.1.1

ADV Router 1.1.1.1

Age 1252

Seq# Checksum Link count 0x800000d9 0x005aff 3

Type-10 Opaque Link Area Link States (Area 0) Link ID 1.0.0.0 1.0.0.3 4.0.0.0 7.0.0.1 8.0.0.3

ADV Router 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1

Age 1252 743 1252 1252 113

Seq# 0x800000d8 0x800000da 0x800000d9 0x800000d8 0x800000d9

Checksum Opaque ID 0x00a8a9 0 0x003601 3 0x004135 0 0x003688 1 0x006d08 3

OSPF feeds its LS-DB content to BGP-LS and SR-TE. The content of the BGP-LS database is shown in Example 17‑12. The output shows the BGP-LS NLRIs in string format. The legend of the NLRI fields in these strings is indicated on top of the output. The first field ([V], [E], or [T]) indicates the BGP-LS NLRI type: [V] Node, [E] Link, or [T] Prefix. A detailed view of a given NLRI, including the Link-state Attribute that is associated with the NLRI, can be displayed by using the command show bgp link-state link-state [detail].

In the next sections we will show examples of each NLRI.

Example 17-12: BGP-LS database content – OSPF RP/0/0/CPU0:xrvr-1#show bgp link-state link-state BGP router identifier 1.1.1.1, local AS number 1

Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Prefix codes: E link, V node, T IP reacheable route, u/U unknown I Identifier, N local node, R remote node, L link, P prefix L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static/peer-node a area-ID, l link-ID, t topology-ID, s ISO-ID, c confed-ID/ASN, b bgp-identifier, r router-ID, i if-address, n nbr-address, o OSPF Route-type, p IP-prefix d designated router address Network Next Hop Metric LocPrf Weight Path *> [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]]/376 0.0.0.0 0 i *> [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]]/376 0.0.0.0 0 i *> [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][R[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][L[i99.1.2.1] [n99.1.2.2]]/792 0.0.0.0 0 i *> [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][R[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][L[i99.1.2.2] [n99.1.2.1]]/792 0.0.0.0 0 i *> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01][p99.1.2.0/24]]/480 0.0.0.0 0 i *> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01][p1.1.1.1/32]]/488 0.0.0.0 0 i *> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][P[o0x01][p99.1.2.0/24]]/480 0.0.0.0 0 i *> [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.2]][P[o0x01][p1.1.1.2/32]]/488 0.0.0.0 0 i Processed 8 prefixes, 8 paths

17.7.1 Node NLRI The Node NLRI for Node1 is shown in Example 17‑13. By using the detail keyword in the show command, you get a breakdown of the NLRI fields at the top of the output. These are the fields in the Node NLRI string: [V]

NLRI Type: Node

[O]

Protocol: OSPF

[I0x66]

Identifier: 0x66 = 102 (this is the instance-id)

[N ...]

Local Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[a0.0.0.0]

Area ID: 0.0.0.0

[r1.1.1.1]

Router ID IPv4: 1.1.1.1

The prefix-length at the end of the NLRI string (/376) is the length of the NLRI in bits. The area ID and OSPF router ID can be derived from the OSPF packet header. Example 17-13: BGP-LS Node NLRI of xrvr-1 RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]]/376 detail BGP routing table entry for [V][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]]/376 NLRI Type: Node Protocol: OSPF Identifier: 0x66 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 Area ID: 0.0.0.0 Router ID IPv4: 1.1.1.1

Link-state: Local TE Router-ID: 1.1.1.1, SRGB: 16000:8000 SR-ALG: 0 SR-ALG: 1, MSD: Type 1 Value 10

The Link-state attribute is shown at the bottom of the output, preceded by Link-state:. These are the elements in the Link-state attribute for the Node NLRI example: Local TE Router-ID: 1.1.1.1

Local node (Node1) IPv4 TE router-id 1.1.1.1

SRGB: 16000:8000

SRGB: [16000-23999]

SR-ALG: 0 SR-ALG: 1

SR Algorithms: SPF (0), strict-SPF (1)

MSD: Type 1 Value 10

Node Maximum SID Depth (MSD): 10

The TE router-id can be retrieved from one of the TE Opaque LSAs, as shown in Example 17‑14. The SR information (SRGB, Algorithms, MSD) can be retrieved from the Router Information Opaque LSA. See Example 17‑15.

Example 17-14: OSPF TE router-ID RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 1.0.0.0 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 410 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.0 Opaque Type: 1 Opaque ID: 0 Advertising Router: 1.1.1.1 LS Seq Number: 800000d9 Checksum: 0xa6aa Length: 28 MPLS TE router ID : 1.1.1.1 Number of Links : 0

Example 17-15: OSPF Router Information LSA RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 4.0.0.0 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 521 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 4.0.0.0 Opaque Type: 4 Opaque ID: 0 Advertising Router: 1.1.1.1 LS Seq Number: 800000da Checksum: 0x3f36 Length: 60 Router Information TLV: Length: 4 Capabilities: Graceful Restart Helper Capable Stub Router Capable All capability bits: 0x60000000 Segment Routing Algorithm TLV: Length: 2 Algorithm: 0 Algorithm: 1 Segment Routing Range TLV: Length: 12 Range Size: 8000 SID sub-TLV: Length 3 Label: 16000 Node MSD TLV: Length: 2 Type: 1, Value 10

17.7.2 Link NLRI The Link NLRI for the (half-)link Node1→Node2 in the OSPF topology is shown in Example 17‑16. By using the detail keyword in the show command, you get a breakdown of the NLRI fields. These are the fields in the Link NLRI string: [E]

NLRI Type: Link

[O]

Protocol: OSPF

[I0x66]

Identifier: 0x66 = 102

[N ...]

Local Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[a0.0.0.0]

Area ID: 0.0.0.0

[r1.1.1.1]

Router ID IPv4: 1.1.1.1

[R ...]

Remote Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[a0.0.0.0]

Area ID: 0.0.0.0

[r1.1.1.2]

Router ID IPv4: 1.1.1.2

[L ...]

Link Descriptor:

[i99.1.2.1]

Local Interface Address IPv4: 99.1.2.1

[n99.1.2.2]

Neighbor Interface Address IPv4: 99.1.2.2

The prefix-length at the end of the NLRI string (/792) is the length of the NLRI in bits.

Example 17-16: BGP-LS Link NLRI of link Node1→Node2 in IPv4 topology RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][R[c1] [b0.0.0.0][a0.0.0.0][r1.1.1.2]][L[i99.1.2.1][n99.1.2.2]]/792 detail BGP routing table entry for [E][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][R[c1][b0.0.0.0][a0.0.0.0] [r1.1.1.2]][L[i99.1.2.1][n99.1.2.2]]/792 NLRI Type: Link Protocol: OSPF Identifier: 0x66 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 Area ID: 0.0.0.0 Router ID IPv4: 1.1.1.1 Remote Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 Area ID: 0.0.0.0 Router ID IPv4: 1.1.1.2 Link Descriptor: Local Interface Address IPv4: 99.1.2.1 Neighbor Interface Address IPv4: 99.1.2.2

Link-state: Local TE Router-ID: 1.1.1.1, Remote TE Router-ID: 1.1.1.2 admin-group: 0x00000001, max-link-bw (kbits/sec): 1000000 TE-default-metric: 20, metric: 1, ADJ-SID: 24012(e0) ADJ-SID: 24112(60), MSD: Type 1 Value 10, Link Delay: 5831 us Flags: 0x00 Min Delay: 5047 us Max Delay: 7047 us Flags: 0x00, Delay Variation: 499 us Link ID: Local:3 Remote:4

The Link-state attribute is shown in the output, preceded by Link-state. These are the elements in the Link-state attribute for the Link NLRI example: Local TE Router-ID: 1.1.1.1

Local node (Node1) IPv4 TE router-id: 1.1.1.1

Remote TE Router-ID: 1.1.1.2

Remote node (Node2) IPv4 TE router-id: 1.1.1.2

admin-group: 0x00000001

Affinity (color) bitmap: 0x00000001

max-link-bw (kbits/sec): 1000000

Bandwidth of the link: 1Gbps

TE-default-metric: 20

TE metric: 20

metric: 1

IGP metric: 1

ADJ-SID: 24012(e0), ADJSID: 24112(60)

Adj-SIDs: protected (flags 0xe0 → B-flag=1) 24012; unprotected (flags 0x60 → B-flag=0) 24112

MSD: Type 1 Value 10

Link Maximum SID Depth: 10

Link Delay: 5831 us Flags: 0x00

Average Link Delay: 5.831 ms

Min Delay: 5047 us

Minimum Link Delay: 5.047 ms

Max Delay: 7047 us Flags: 0x00

Maximum Link Delay: 7.047 ms

Delay Variation: 499 us

Link Delay Variation: 0.499 ms

Link ID: Local:3 Remote:4

Local and remote interface identifiers: 3 and 4

The Link-state attributes contains information collected from Router LSA for adjacency, from TE Opaque LSA for TE attributes, and from Extended Link LSA for Adj-SIDs, MSD, and Link IDs. These LSAs are shown in Example 17‑17, Example 17‑18, and Example 17‑19.

Example 17-17: OSPF Router LSA RP/0/0/CPU0:xrvr-1#show ospf database router adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Router Link States (Area 0) LS age: 1786 Options: (No TOS-capability, DC) LS Type: Router Links Link State ID: 1.1.1.1 Advertising Router: 1.1.1.1 LS Seq Number: 800000da Checksum: 0x5801 Length: 60 Number of Links: 3 Link connected to: a Stub Network (Link ID) Network/subnet number: 1.1.1.1 (Link Data) Network Mask: 255.255.255.255 Number of TOS metrics: 0 TOS 0 Metrics: 1 Link connected to: another Router (point-to-point) (Link ID) Neighboring Router ID: 1.1.1.2 (Link Data) Router Interface address: 99.1.2.1 Number of TOS metrics: 0 TOS 0 Metrics: 1 Link connected to: a Stub Network (Link ID) Network/subnet number: 99.1.2.0 (Link Data) Network Mask: 255.255.255.0 Number of TOS metrics: 0 TOS 0 Metrics: 1

Example 17-18: OSPF TE LSA – link RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 1.0.0.3 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 1410 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.3 Opaque Type: 1 Opaque ID: 3 Advertising Router: 1.1.1.1 LS Seq Number: 800000db Checksum: 0x3402 Length: 132 Link connected to Point-to-Point network Link ID : 1.1.1.2 (all bandwidths in bytes/sec) Interface Address : 99.1.2.1 Neighbor Address : 99.1.2.2 Admin Metric : 20 Maximum bandwidth : 125000000 Affinity Bit : 0x1 IGP Metric : 1 Extended Administrative Group : Length: 1 EAG[0]: 0x1 Unidir Link Delay : 5831 micro sec, Anomalous: no Unidir Link Min Delay : 5047 micro sec, Anomalous: no Unidir Link Max Delay : 7047 micro sec Unidir Link Delay Variance : 499 micro sec Number of Links : 1

Example 17-19: OSPF Extended Link LSA RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 8.0.0.3 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 906 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 8.0.0.3 Opaque Type: 8 Opaque ID: 3 Advertising Router: 1.1.1.1 LS Seq Number: 800000da Checksum: 0x6b09 Length: 112 Extended Link Link-type : Link ID : Link Data :

TLV: Length: 88 1 1.1.1.2 99.1.2.1

Adj sub-TLV: Flags MTID Weight Label

Length: 7 : 0xe0 : 0 : 0 : 24012

Adj sub-TLV: Flags MTID Weight Label

Length: 7 : 0x60 : 0 : 0 : 24112

Local-ID Remote-ID sub-TLV: Length: 8 Local Interface ID: 3 Remote Interface ID: 4 Remote If Address sub-TLV: Length: 4 Neighbor Address: 99.1.2.2 Link MSD sub-TLV: Length: 2 Type: 1, Value 10

17.7.3 Prefix NLRI The Prefix NLRI for Node1’s loopback IPv4 prefix 1.1.1.1/32 is shown in Example 17‑20. By using the detail keyword in the command, you get a breakdown of the NLRI fields. These are the fields in the Prefix NLRI string: [T]

NLRI Type: Prefix

[O]

Protocol: OSPF

[I0x66]

Identifier: 0x66 = 102

[N ...]

Local Node Descriptor:

[c1]

AS Number: 1

[b0.0.0.0]

BGP Identifier: 0.0.0.0

[a0.0.0.0]

Area ID: 0.0.0.0

[r1.1.1.1]

Router ID IPv4: 1.1.1.1

[P ...]

Prefix Descriptor:

[o0x01]

OSPF Route Type: 0x01

[p1.1.1.1/32]

Prefix: 1.1.1.1/32

The prefix-length at the end of the NLRI string (/488) is the length of the NLRI in bits. Example 17-20: BGP-LS Prefix NLRI of Node1’s loopback IPv4 prefix RP/0/0/CPU0:xrvr-1#show bgp link-state link-state [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01] [p1.1.1.1/32]]/488 detail BGP routing table entry for [T][O][I0x66][N[c1][b0.0.0.0][a0.0.0.0][r1.1.1.1]][P[o0x01][p1.1.1.1/32]]/488 NLRI Type: Prefix Protocol: OSPF Identifier: 0x66 Local Node Descriptor: AS Number: 1 BGP Identifier: 0.0.0.0 Area ID: 0.0.0.0 Router ID IPv4: 1.1.1.1 Prefix Descriptor: OSPF Route Type: 0x01 Prefix: 1.1.1.1/32

Link-state: Metric: 1, PFX-SID: 1(0/0), Extended IGP flags: 0x40

The Link-state attribute is shown in the output, preceded by Link-state:. These are the elements in the Link-state attribute for the Prefix NLRI example: Metric: 1

IGP metric: 1

PFX-SID: 1(0/0)

Prefix-SID index: 1; SPF (algo 0); flags 0x0 (NP:0, M:0, E:0, V:0, L:0)

Extended IGP flags: 0x40

Extended prefix attribute flags: 0x40 (A:0, N:1)

The Link-state attributes contains information collected from Router LSA for prefix and from Extended Prefix LSA for Prefix-SID. These LSAs are shown in Example 17‑17 and Example 17‑21. Example 17-21: OSPF Extended Prefix LSA RP/0/0/CPU0:xrvr-1#show ospf database opaque-area 7.0.0.1 adv-router 1.1.1.1

OSPF Router with ID (1.1.1.1) (Process ID 1) Type-10 Opaque Link Area Link States (Area 0) LS age: 128 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 7.0.0.1 Opaque Type: 7 Opaque ID: 1 Advertising Router: 1.1.1.1 LS Seq Number: 800000da Checksum: 0x328a Length: 44 Extended Prefix TLV: Length: 20 Route-type: 1 AF : 0 Flags : 0x40 Prefix : 1.1.1.1/32 SID sub-TLV: Length: 8 Flags : 0x0 MTID : 0 Algo : 0 SID Index : 1

17.8 References [RFC3209] "RSVP-TE: Extensions to RSVP for LSP Tunnels", Daniel O. Awduche, Lou Berger, Der-Hwa Gan, Tony Li, Dr. Vijay Srinivasan, George Swallow, RFC3209, December 2001 [RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752, March 2016 [RFC7471] "OSPF Traffic Engineering (TE) Metric Extensions", Spencer Giacalone, David Ward, John Drake, Alia Atlas, Stefano Previdi, RFC7471, March 2015 [RFC7810] "IS-IS Traffic Engineering (TE) Metric Extensions", Stefano Previdi, Spencer Giacalone, David Ward, John Drake, Qin Wu, RFC7810, May 2016 [RFC8571] "BGP - Link State (BGP-LS) Advertisement of IGP Traffic Engineering Performance Metric Extensions", Les Ginsberg, Stefano Previdi, Qin Wu, Jeff Tantsura, Clarence Filsfils, RFC8571, March 2019 [draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idrsegment-routing-te-policy-05 (Work in Progress), November 2018 [draft-ietf-idr-bgp-ls-segment-routing-ext] "BGP Link-State extensions for Segment Routing", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Hannes Gredler, Mach Chen, draft-ietf-idrbgp-ls-segment-routing-ext-12 (Work in Progress), March 2019 [draft-ietf-idr-bgpls-segment-routing-epe] "BGP-LS extensions for Segment Routing BGP Egress Peer Engineering", Stefano Previdi, Ketan Talaulikar, Clarence Filsfils, Keyur Patel, Saikat Ray, Jie Dong, draft-ietf-idr-bgpls-segment-routing-epe-18 (Work in Progress), March 2019 [draft-ietf-idr-te-lsp-distribution] "Distribution of Traffic Engineering (TE) Policies and State using BGP-LS", Stefano Previdi, Ketan Talaulikar, Jie Dong, Mach Chen, Hannes Gredler, Jeff Tantsura, draft-ietf-idr-te-lsp-distribution-10 (Work in Progress), February 2019

[draft-ketant-idr-bgp-ls-bgp-only-fabric] "BGP Link-State Extensions for BGP-only Fabric", Ketan Talaulikar, Clarence Filsfils, krishnaswamy ananthamurthy, Shawn Zandi, Gaurav Dawra, Muhammad Durrani, draft-ketant-idr-bgp-ls-bgp-only-fabric-02 (Work in Progress), March 2019 [draft-dawra-idr-bgp-ls-sr-service-segments] "BGP-LS Advertisement of Segment Routing Service Segments", Gaurav Dawra, Clarence Filsfils, Daniel Bernier, Jim Uttaro, Bruno Decraene, Hani Elmalky, Xiaohu Xu, Francois Clad, Ketan Talaulikar, draft-dawra-idr-bgp-ls-sr-servicesegments-01 (Work in Progress), January 2019

1. ISIS uses the term Link-State PDU (LSP), OSPF uses the term Link-State Advertisement (LSA). Do not confuse the ISIS LSP with the MPLS term Label Switched Path (LSP).↩ 2. Note that this example (enabling multiple ISIS instances on an interface) cannot be configured in IOS XR, since a given non-loopback interface can only belong to one ISIS instance.↩ 3. Although the AS number is configurable on IOS XR for backward compatibility reasons, we recommend to leave it to the default value.↩ 4. Although the BGP-LS identifier is configurable on IOS XR for backward compatibility reasons, we recommend to leave it to the default value (0).↩ 5. The 32-bit BGP-LS Identifier field is not related to the 64-bit BGP-LS NLRI Identifier field, the latter is also known as “instance-id”. It is just an unfortunate clash of terminology.↩ 6. https://www.iana.org/assignments/ospf-opaque-types/ospf-opaque-types.xhtml↩

18 PCEP 18.1 Introduction A Path Computation Element (PCE) is “… an entity that is capable of computing a network path or route based on a network graph, and of applying computational constraints during the computation.” (RFC 4655). A Path Computation Client (PCC) is the entity using the services of a PCE. For a PCC and PCE or two PCEs to communicate the Path Computation Element (Communication) Protocol (PCEP) has been introduced (RFC5440). At the time PCEP was specified, the only existing TE solution was the classic RSVP-TE protocol. Therefore, many aspects of PCEP are geared towards its application for RSVP-TE tunnels, such as re-use of existing RSVP-TE objects. Since its introduction PCEP has evolved and extended. For example, support for SR-TE has been added (draft-ietf-pce-segment-routing). Since this book is about SR-TE, the focus of this PCEP chapter is on the PCEP functionality that is applicable to SR-TE. This implies that various PCEP packet types, packet fields, flags, etc. are not described. Please refer to the relevant IETF specification of a given element that is not covered in this chapter. This book only describes PCEP message exchanges with an Active Stateful PCE that is SR-capable.

18.1.1 Short PCEP History The initial PCEP specification in RFC 5440 only provided stateless PCE/PCC support using a PCEP Request/Reply protocol exchange. In a client/server relation, the PCC requests the PCE to compute a path, the PCE computes and returns the result. The PCE computes the path using an up-to-date topology database. “Stateless” means that the PCE does not keep track of the computed path or the other paths in the network. RFC 8231 adds stateful PCE/PCC capabilities to PCEP. This RFC specifies the protocol mechanism for a (stateful) PCE to learn the SR Policy paths from a PCC. This enables the PCE to keep track of

the SR Policy paths in the network and take them into account during the computation. This RFC also specifies the PCEP mechanism for a PCC to delegate control of an SR Policy path to a PCE, an Active Stateful PCE in this case. The Active Stateful PCE not only computes and learns SR Policy paths, it can also take control of SR Policy paths and update these paths. RFC 8281 further extends the stateful PCEP capabilities to support PCE initiation of SR Policies on a PCC. IETF draft-ietf-pce-segment-routing extends PCEP to support SR Policies. It mainly defines how an SR Policy path is specified in PCEP. PCEP has been further extended with SR-TE support. Some examples: RFC 8408 specifies how to indicate the type of path since PCEP can be used for SR-TE and RSVP-TE, draft-sivabalan-pcebinding-label-sid specifies how PCEP conveys the Binding-SID of an SR Policy path.

18.2 PCEP Session Setup and Maintenance a PCEP session can be setup between a PCC and a PCE or between two PCEs. A PCC may have PCEP sessions with multiple PCEs, and a PCE may have PCEP sessions with multiple PCCs. PCEP is a TCP-based protocol. A PCC establishes a TCP session to a PCE on TCP port 4189. The PCC always initiates the PCEP connection. After the TCP session is established, PCC and PCE initialize the PCEP session by exchanging the session parameters using Open messages as illustrated in Figure 18‑1. The Open message specifies the Keepalive interval that the node will use and a recommended Dead timer interval for the peer to use. It also contains the capabilities of the node. The Open message is acknowledged by a Keepalive message.

Figure 18-1: PCEP session initialization

Two timers are used to maintain the PCEP session, Keepalive timer and Dead timer. These two timers and Keepalive messages are used to verify liveness of the PCEP session. The default Keepalive timer is 30 seconds, the default Dead timer is 120 secons. Every time a node sends a PCEP message it restarts the Keepalive timer of the session. When the Keepalive timer expires then the node sends a Keepalive message. This mechanism ensures that at least every Keepalive interval a PCEP message is sent without sending unnecessary Keepalive messages. Every time a node receives a PCEP message it restarts the Dead timer. When the Dead timer expires the node tears down the PCEP session. PCC and PCE independently send their Keepalive messages; they are not responded to. The Keepalive timer can be different on both peers.

18.2.1 SR Policy State Synchronization After the session is initialized and both PCE and PCC have stateful PCEP capability, then the PCC synchronizes its local SR Policies’ states to the PCE using PCEP State Report (PCRpt) messages. This exchange is illustrated in Figure 18‑2. The format of the PCRpt message will be described further in section 18.4.7 of this chapter. During the state synchronization, the PCC reports its local SR Policies using PCRpt messages with the Sync-flag set. The end of the synchronization is indicated by an empty Report message with the Sync-flag unset.

Figure 18-2: SR Policy initial state synchronization

After the initial SR Policy state synchronization, the PCC sends a PCRpt message whenever the state of any of its local SR Policies changes. As illustrated in Figure 18‑3, following a state change of a local SR Policy path, the PCC sends a PCRpt message to all its connected stateful PCEs. This way the SR Policy databases on all these connected PCEs stay synchronized with the PCC’s SR Policy database.

Figure 18-3: SR Policy state synchronization

18.3 SR Policy Path Setup and Maintenance In this book we focus on the use of an Active Stateful PCE. An Active Stateful PCE not only computes an SR Policy path and keeps track of the path, it also takes the responsibility to maintain the path and update it when required. The IOS XR SR PCE is an example of an Active Stateful PCE. The PCC delegates control of a path to an Active Stateful PCE by sending a state Report message for that path with the Delegate flag set. An IOS XR PCC automatically delegates control of a PCE-computed path to the PCE that computed this path. This makes sense since if the PCC relies on a PCE to compute the path, it will also rely on the PCE to maintain this path. This includes configured (CLI, NETCONF) SR Policy paths and ODN initiated SR Policy paths. An IOS XR PCC also delegates control of an SR Policy path that a PCE initiates on the PCC. In that case the PCC delegates control to the PCE that initiated the path. An IOS XR headend does not delegate control of a locally computed path to a PCE, nor the control of an explicit path. The following sections describe the different PCEP protocol exchanges and message formats to initiate an SR Policy path that is delegated to a PCE and to maintain this path using PCEP. Two cases can be distinguished, PCC-initiated paths and PCE-initiated paths.

18.3.1 PCC-Initiated SR Policy Path The PCC initiates the SR Policy path by configuration (CLI), NETCONF, or automatically instantiates the path on-demand by a service protocol trigger (ODN). The PCEP protocol exchange between PCC and PCE is the same in all these cases. Two protocol exchange variants exist for the initial setup of the path. In the first variant (Request/Reply/Report), the protocol exchange starts as a stateless Request/Reply message exchange between PCC and PCE. The PCC installs the path and sends a Report message to the PCE where the PCC delegates control of the path to the PCE.

In the second variant (Report/Update/Report), the PCC immediately delegates the (empty) path to the PCE. The PCE accepts control of the path (by not rejecting it), computes the path and sends the computed path in an Update message to the PCC. The PCC installs the path and sends a Report message to the PCE. While SR PCE supports both variants, an IOS XR PCC uses the second variant. At the end of both variants of the initiation sequence, the PCC has installed the path and has delegated control of the path to the SR PCE. Request/Reply/Report

The Request/Reply/Report variant is illustrated in Figure 18‑4.

Figure 18-4: PCC-initiated SR Policy initial message exchange variant 1

The message sequence is as follows: 1. The operator configures an SR Policy on headend (PCC) with an SR PCE-computed dynamic path. Or the headend automatically instantiates an SR Policy path using ODN. The headend assigns a unique PLSP-ID (PCEP-specific LSP ID) to the SR Policy path. The PLSP-ID uniquely identifies the SR Policy path on a PCEP session and remains constant during its lifetime. 2. The headend sends a PCReq message to the PCE and includes the path computation parameters in an ENDPOINT object, a METRIC object and an LSPA (LSP Attributes) object. The headend also includes an LSP object with this path’s PLSP-ID.

3. The PCE computes the path. 4. The PCE returns the path in a PCRep message with the computed path in an ERO (Explicit Route Object). It also includes an LSP object with the unique PLSP-ID and a METRIC object with the computed metric value. 5. The headend installs the path. 6. The headend sends a PCRpt message with the state of the path. It includes an ERO, a METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE. Report/Update/Report

The Report/Update/Report variant is illustrated in Figure 18‑5.

Figure 18-5: PCC-initiated SR Policy initial message exchange variant 2

This variant’s message sequence is as follows: 1. The operator configures an SR Policy on headend (PCC) with an SR PCE-computed dynamic path. Or the headend automatically instantiates an SR Policy path using ODN. The headend assigns a unique PLSP-ID (PCEP-specific LSP ID) to the SR Policy. The PLSP-ID uniquely identifies the SR Policy on a PCEP session and remains constant during its lifetime.

2. The headend sends a PCRpt message to the PCE with an empty ERO. It includes a METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE. 3. The SR PCE accepts the delegation (by not rejecting it), finds that the path must be updated and computes a new path. 4. The PCE returns the new path in a PCUpd message to the headend. It includes an ERO with the computed path, an LSP object with the unique PLSP-ID assigned by the headend, and an SRP (Stateful Request Parameters) object with a unique SRP-ID number to track error and state report messages received as a response to this Update message. 5. The headend installs the path. 6. The headend sends a PCRpt message with the state of the new path. It includes an ERO, a METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE.

18.3.2 PCE-Initiated SR Policy Path We call this case the PCE-initiated path since from a PCEP perspective the PCE takes the initiative to initiate this path. In reality it will be an application (or orchestrator, or administrator) that requests a PCE to initiate an SR Policy path on a given headend. This application uses one of the PCE’s northbound interfaces, such as REST, NETCONF, or even CLI, for this purpose.

Figure 18-6: PCE-initiated SR Policy path message exchange

The message sequence as illustrated in Figure 18‑6 is as follows: 1. An application requests SR PCE to initiate an SR Policy candidate path on a headend. 2. The application can provide the path it computed itself, or it can request the SR PCE to compute the path. SR PCE sends a PCEP Initiate (PCInit) message to the headend with an ERO encoding the path. It includes an LSP object with the PLSP-ID set to 0 as well as an SRP object with a unique SRP-ID number to track error and state report messages received as a response to this initiate message. 3. The headend initiates the path and allocates a PLSP-ID for it. 4. The headend sends a PCRpt message with the state of the path. It includes an ERO, a METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE. It also sets the Create flag in the LSP object to indicate it is a PCE-initiated path.

18.3.3 PCE Updates SR Policy Path When the headend has delegated the control of a path to the PCE, the PCE is in full control and it maintains the path and updates it when required. For example, following a topology change or upon

request of an application. The same message exchange as illustrated in Figure 18‑7 is used to update the path, regardless how the delegated path was initially set up.

Figure 18-7: SR Policy path update message exchange

The message sequence is as follows: 1. Following a topology change event, the SR PCE re-computes the delegated path. Or an application requests to update the path of a given SR Policy. 2. The PCE sends the updated path in a PCUpd message to the headend. It includes an ERO encoding the path, an LSP object with the unique PLSP-ID assigned by the headend as well as an SRP object with a unique SRP-ID number to track error and state report messages received as a response to this update message. The Delegate flag is set in the LSP object to indicate that the path is delegated. 3. The headend updates the path. 4. The headend sends a PCRpt message with the state of the new path. It includes an ERO, a METRIC object and an LSPA object. It also includes an LSP object with the unique PLSP-ID. The headend sets the Delegate flag in the LSP object to delegate control of this path to the SR PCE.

18.4 PCEP Messages Different types of PCEP messages exist. The main ones have been introduced in the previous sections of this chapter: Open – start a PCEP session, including capability negotiation Close – end a PCEP session Keepalive – maintain PCEP session active Error (PCErr) – error message sent when a protocol error occurs or when a received message is not compliant with the specification Request (PCReq) – sent by PCC to PCE to request path computation Reply (PCRep) – sent by PCE to PCC in response to Request Report (PCRpt) – sent by PCC for various purposes: Initial state synchronization – after initializing the PCEP session, the PCC uses Report messages to synchronize its local SR Policy path statuses to the PCE Path status report – the PCC sends Report messages to all its connected PCEs whenever the state of a local SR Policy path changes Delegation control – the PCC delegates control of a local SR Policy path to the PCE by setting the Delegate flag in the path’s Report message Update (PCUpd) – sent by PCE to request PCC to update an SR Policy path Initiate (PCInit) – sent by PCE to request PCC to initiate an SR Policy path PCEP Header

Each PCEP message carries information organized as “Objects”, depending on the type of the message. All PCEP messages have a common header followed by the “Objects”.

The format of the common PCEP header is shown in Figure 18‑8.

Figure 18-8: Common PCEP Header format

Where: Version: 1 Flags: No flags are currently defined Message-Type: 1. Open 2. Keepalive 3. Path Computation Request (PCReq) 4. Path Computation Reply (PCRep) 5. Notification (PCNtf) 6. Error (PCErr) 7. Close 8. Path Computation Monitoring Request (PCMonReq) 9. Path Computation Monitoring Reply (PCMonRep) 10. Path Computation State Report (PCRpt) 11. Path Computation Update Request (PCUpd) 12. Path Computation LSP Initiate Message (PCInit)

Object Header

The Objects in a PCEP message have various formats and can contain sub-Objects and TLVs, depending on the type of Object. The header of an Object has a fixed format that is shown in Figure 18‑9.

Figure 18-9: Common Object Header format

Where: Object-Class: identifies the PCEP object class Object-Type (OT): identifies the PCEP object type – The pair (Object-Class, Object-Type) uniquely identifies each PCEP object Flags: Res (Reserved) Processing-Rule flag (P-flag): used in Request (PCReq) message. If set, the object must be taken into account during path computation. If unset, the object is optional. Ignore flag (I flag): used in Reply (PCRep) message. Set if the optional object was ignored. Unset if the optional object was processed. Object body: contains the different elements of the object

18.4.1 PCEP Open Message A PCEP Open Message contains an Open Object with the format as shown in Figure 18‑10. The Object is preceded by the common PCEP header and the common Object header (not shown).

Figure 18-10: PCEP Open Object format

The PCEP message is of type “Open”. The Open Object is of Object-Class 1, Object-Type 1. The fields in the Open Object body: Version: 1 Flags: none defined Keepalive: maximum period of time (in seconds) between two consecutive PCEP messages sent by the sender of this Open message Deadtimer: the Deadtimer that the sender of this Open message recommends its peer to use for this session. Session ID (SID): identifies current PCEP session for logging and troubleshooting purposes. Can be different in each direction The Open Object can contain TLVs: Stateful Capability TLV Segment Routing Capability TLV The format of the Stateful Capability TLV is shown in Figure 18‑11. If this TLV is present, then the device (PCE or PCC) is Stateful PCEP capable.

Figure 18-11: Stateful Capability TLV format

The fields in this TLV are: Flags (only showing the supported ones): Update (U): PCE can update TE Policies, PCC accepts updates from PCE Synch (S): PCC includes LSP-DB version number when the paths are reported to the PCE Initiate (I): PCE can initiate LSPs, PCC accepts PCE initiated TE Policies Other flags are being defined The format of the Segment Routing Capability TLV is shown in Figure 18‑12. If this TLV is present, then the device (PCE or PCC) is Segment Routing Capable.

Figure 18-12: Segment Routing Capability TLV format

The fields in this TLV are: Flags: L-flag (no MSD Limit): set by PCC that does not impose any limit on MSD

N-flag (NAI resolution capable): set by a PCC that can resolve a Node or Adjacency Identifier (NAI) to a SID Maximum SID Depth (MSD): Specifies the maximum size of the segment list (label stack) that this node can impose. Example

Example 18‑1 shows a packet capture of a PCEP Open message. The PCC sending this message advertises a keepalive interval of 30 seconds and a dead-timer of 120 seconds. The PCC is stateful and supports the update and initiation functionality. It also supports the SR PCEP extensions and the MSD is 10 labels. Example 18-1: Packet capture of a PCEP Open message Path Computation Element communication Protocol OPEN MESSAGE (1), length: 28 OPEN OBJECT Object Class: OPEN OBJECT (1) Object Flags: 0x0 Object Value: PCEP version: 1 Open Object Flags: 0x0 Keepalive: 30 Deadtime: 120 Session ID (SID): 4 TLVs: STATEFUL-PCE-CAPABILITY TLV (16), Type: 16, Length: 4 Stateful Capabilities Flags: 0x00000005 ..0. .... = Triggered Initial Sync (F): Not Set ...0 .... = Delta LSP Sync Capability (D): Not Set .... 0... = Triggered Resync (T): Not Set .... .1.. = Instantiation (I): Set .... ..0. = Include Db Version (S): Not Set .... ...1 = Update (U): Set SR-PCE-CAPABILITY TLV (26), Type: 26, Length: 4 Reserved: 0x0000 SR Capability Flags: 0x00 Maximum SID Depth (MSD): 10

18.4.2 PCEP Close Message A PCEP Close Message contains a Close Object with the format as shown in Figure 18‑13.

Figure 18-13: PCEP Close Object format

A Close Object has the following fields: Flags: no flags defined yet Reason: specifies the reason for closing the PCEP session; setting of this field is optional

18.4.3 PCEP Keepalive Message A PCEP Keepalive message consists of only the common PCEP header, with message type “Keepalive”.

18.4.4 PCEP Request message A PCEP Request message must contain at least the following objects: Request Parameters (RP) Object: administrative info about the Request End-points Object: source and destination of the SR Policy A PCEP Request message may also contain (non-exhaustive list): LSP Attributes (LSPA) Object: various LSP attributes (such as affinity, priority, protection desired) Metric Object: Metric type to optimize, and set max-metric bound RP Object

The Request Parameters Object specifies various characteristics of the path computation request.

Figure 18-14: PCEP RP Object format

Flags: various flags have been defined – Priority, Reoptimization, Bi-directional, strict/loose – please refer to the different IETF PCEP specifications Request-ID-number: ID number to identify Request, to match with Reply Optional TLVs: Path setup type TLV (draft-ietf-pce-lsp-setup-type): specifies which type of path to setup – 0 or TLV not present: RSVP-TE; 1: Segment Routing Endpoints Object

Source and destination IP addresses of the path for which a path computation is requested. Two types exist, one for IPv4 and one for IPv6. Both have an equivalent format.

Figure 18-15: PCEP Endpoints Object format

LSPA Object

The LSP Attributes (LSPA) Object contains various path attributes to be taken into account during path computation.

Figure 18-16: PCEP LSPA Object format

The fields in the LSPA Object are: Affinity bitmaps: Exclude-any: exclude links that have any of the indicated “colors”. Set 0 to exclude no links Include-any: only use links that have any of the indicated “colors”. Set 0 to accept all links Include-all: only use links that have all of the indicated “colors”. Set 0 to accept all Note: only “standard” 32-bit color bitmaps are supported in this object Priorities (Setup and Holding), used in tunnel preemption. Flags: L-flag (Local Protection Desired): when set, the computed path must include protected segments Optional TLVs Metric Object

The Metric object specifies which metric must be optimized in the path computation and which metric bounds must be applied to the path computation.

Figure 18-17: PCEP Metric Object format

The fields in the Metric Object are: Flags: B-flag (Bound): if unset, the Metric Type is the optimization objective; if set, the Metric Value is a bound of the Metric Type, i.e., the resulting path must have a cumulative metric ≤ Metric Value C-flag (Computed Metric): if set, then PCE must return the computed path metric value Metric Type: IGP metric, TE metric, Hop Count, path delay (RFC 8233), … Metric Value: path metric value Example

Example 18‑2 shows a packet capture of a PCEP Path Computation Request message. The PCC requests the computation of an SR Policy path from 1.1.1.1 (headend) to 1.1.1.4 (endpoint), optimizing TE metric without constraints. The path must be encoded using protected SIDs. This message’s Request-ID-Number is 2. See Example 18‑3 in the next section for the matching Reply message.

Example 18-2: Packet capture of a PCEP Path Computation Request message 14:54:41.769244000: 1.1.1.1:46605 --> 1.1.1.10:4189 Path Computation Element communication Protocol PATH COMPUTATION REQUEST MESSAGE (3), length: 92 Request Parameters (RP) OBJECT Object Class: Request Parameters (RP) OBJECT (2) Object Type: 1, Length: 20 Object Flags: 0x2 00.. = Reserved Flags: Not Set ..1. = Processing-Rule (P): Set ...0 = Ignore (I): Not Set Object Value: RP Object Flags: 0x80 .... .... .... .... .... .... ..0. .... = Strict/Loose (O): Not Set .... .... .... .... .... .... ...0 .... = Bi-directional (B): Not Set .... .... .... .... .... .... .... 0... = Reoptimization (R): Not Set .... .... .... .... .... .... .... .000 = Priority (Pri): 0 Request-ID-Number: 0x00000002 (2) TLVs: PATH-SETUP-TYPE TLV (28), Type: 28, Length: 4 Reserved: 0x000000 Path Setup Type (PST): 1 (Segment Routing) END-POINT OBJECT Object Class: END-POINT OBJECT (4) Object Type: 1, Length: 12 Object Flags: 0x0 Object Value: Source IPv4 address: 1.1.1.1 Destination IPv4 address: 1.1.1.4 LSP Attributes (LSPA) OBJECT Object Class: LSP Attributes (LSPA) OBJECT (9) Object Type: 1, Length: 20 Object Flags: 0x0 Object Value: Exclude-Any: 0x00000000 Include-Any: 0x00000000 Include-All: 0x00000000 Setup Priority: 7 Holding Priority: 7 LSPA Object Flags: 0x01 .... ...1 = Local Protection Desired (L): Set Reserved: 0x00 METRIC OBJECT Object Class: METRIC OBJECT (6) Object Type: 1, Length: 12 Object Flags: 0x0 Object Value: Reserved: 0x0000 Metric Object Flags: 0x00 .... ..0. = Computed Metric (C): Not Set .... ..0. = Bound (B): Not Set Metric Type: 2 (TE metric) Metric Value: 0

18.4.5 PCEP Reply Message A PCEP Reply message must contain at least the following objects: Request Parameters (RP) Object: administrative info about the Request

If computation successful: Explicit Route Object (ERO): path of the LSP If no path found: NO-PATH object A PCEP Reply message may also contain (non-exhaustive list): Metric Object: Metric of path RP Object

See section 18.4.4. ERO Object

An Explicit Route Object (ERO) describes the sequence of elements (segments, links, nodes, ASs, …) that the path traverses. The explicit route is encoded as a series of sub-objects contained in the ERO object. For an SR Policy path, the Sub-Objects are of type “Segment Routing” (SR-ERO). The format of the ERO and ERO Sub-Object are shown in Figure 18‑18 and Figure 18‑19 respectively.

Figure 18-18: PCEP ERO format

Figure 18-19: PCEP ERO sub-object format

The format of the SR-ERO is shown in Figure 18‑20. Each SR-ERO Sub-Object represents a segment in the segment list. Each segment is specified as a SID and/or a segment-descriptor, called “NAI”

(Node or Adjacency Identifier) in the SR PCEP specification.

Figure 18-20: PCEP SR-ERO Sub-Object format

The fields in the SR-ERO Sub-Object are: L-Flag: Loose, if set, this is a loose-hop in the LSP, PCC may expand NT: NAI (Node or Adjacency Identifier) Type: 0 – NAI is absent 1 – IPv4 node ID 2 – IPv6 node ID 3 – IPv4 adjacency 4 – IPv6 adjacency 5 – unnumbered adjacency with IPv4 node IDs Flags: F-flag: If set, NAI == Null S-flag: if set, SID == Null, PCC chooses SID C-flag: if set (with M-Flag set), the SID field is a full MPLS label (incl. TC, S, and TTL); if unset (with M-Flag set), the PCC must ignore the TC, S, and TTL fields of the MPLS label in the

SID field M-flag: if set, the SID field is an MPLS label value; if unset, the SID field is an index into a label range SID: Segment Identifier NAI: Node or Adjacency Identifier (also known as SID descriptor), format depends on the NAI type (NT) field value: If NAI type = 1 (IPv4 Node ID): IPv4 address of Node ID If NAI type = 2 (IPv6 Node ID): IPv6 address of Node ID If NAI type = 3 (IPv4 Adjacency): Pair of IPv4 addresses: Local IPv4 address and Remote IPv4 address If NAI type = 4 (IPv6 Adjacency): Pair of IPv6 addresses: Local IPv6 address and Remote IPv6 address If NAI type = 5 (unnumbered adjacency with IPv4 node IDs): Pair of (Node ID, Interface ID) tuples NO-PATH Object

The NO-PATH object is included in the PCEP Reply message if the path computation request cannot be satisfied. It may contain a reason of computation failure. Metric Object

See section 18.4.4. Example

Example 18‑3 shows a packet capture of a Path Computation Reply message. This Reply message is a response to the Request message in Example Example 18‑2 of the previous section. The PCE computed an SR Policy path encoded in a SID list with two SIDs: . 16003 is the Prefix-SID of Node3 (1.1.1.3), 24034 is an Adj-SID of the link between Node3 and Node4. The cumulative TE metric of this path is 30.

Example 18-3: Packet capture of a PCEP Path Computation Reply message 14:53:20.691651000: 1.1.1.10:4189 --> 1.1.1.1:55512 Path Computation Element communication Protocol PATH COMPUTATION REPLY MESSAGE (4), length: 68 Request Parameters (RP) OBJECT Object Class: Request Parameters (RP) OBJECT (2) Object Type: 1, Length: 20 Object Flags: 0x2 00.. = Reserved Flags: Not Set ..1. = Processing-Rule (P): Set ...0 = Ignore (I): Not Set Object Value: RP Object Flags: 0x80 .... .... .... .... .... .... ..0. .... = Strict/Loose (O): Not Set .... .... .... .... .... .... ...0 .... = Bi-directional (B): Not Set .... .... .... .... .... .... .... 0... = Reoptimization (R): Not Set .... .... .... .... .... .... .... .000 = Priority (Pri): 0 Request-ID-Number: 0x00000002 (2) TLVs: PATH-SETUP-TYPE TLV (28), Type: 28, Length: 4 Reserved: 0x000000 Path Setup Type (PST): 1 (Segment Routing) ERO OBJECT Object Class: ERO OBJECT (7) Object Type: 1, Length: 32 Object Flags: 0x0 Object Value: SR-ERO SUBOBJECT Loose Flag = 0x0: Not Set Sub Object Type: 36, Length: 12 SID Type: 1 (IPv4 Node ID) SR-ERO Sub-Object Flags: 0x001 .... .... 0... = NAI==Null (F): Not Set .... .... .0.. = SID==Null, PCC chooses SID (S): Not Set .... .... ..0. = Complete MPLS label (C): Not Set .... .... ...1 = MPLS label (M): Set SID: 0x03e83000 SID MPLS Label: 16003 IPv4 Node ID: 1.1.1.3 SR-ERO SUBOBJECT Loose Flag = 0x0: Not Set Sub Object Type: 36, Length: 16 SID Type: 3 (IPv4 Adjacency) SR-ERO Sub-Object Flags: 0x001 .... .... 0... = NAI==Null (F): Not Set .... .... .0.. = SID==Null, PCC chooses SID (S): Not Set .... .... ..0. = Complete MPLS label (C): Not Set .... .... ...1 = MPLS label (M): Set SID: 0x05de2000 SID MPLS Label: 24034 Local IPv4 Address: 99.3.4.3 Remote IPv4 Address: 99.3.4.4 METRIC OBJECT Object Class: METRIC OBJECT (6) Object Type: 1, Length: 12 Object Flags: 0x0 Object Value: Reserved: 0x0000 Metric Object Flags: 0x00 .... ..0. = Computed Metric (C): Not Set .... ..0. = Bound (B): Not Set Metric Type: 2 (TE metric) Metric Value: 30

18.4.6 PCEP Report Message A PCEP Report message contains at least the following objects: LSP Object: to identify the LSP Explicit Route Object (ERO): path of the LSP A PCEP Report message may also contain (non-exhaustive list): Stateful PCE Request Parameters (SRP) Object: to correlate messages (e.g., PCUpd and PCRept), mandatory if PCReport is result of PCUpdate LSP Attributes (LSPA) Object: various LSP attributes (such as affinity, priority) Metric Object: Metric of calculated path LSP Object

The LSP object identifies the path and indicates the state of the path, its delegation status, and the state synchronisation flag. The format of the LSP Object is displayed in Figure 18‑21.

Figure 18-21: PCEP LSP Object format

The fields in the LSP Object are: PLSP-ID: A PCEP-specific Identifier for the LSP

A PCC creates a unique PLSP-ID for each path (LSP). The PLSP-ID remains constant for the lifetime of a PCEP session. The tuple (PCC, PLSP-ID) is a globally unique PCEP identifier of a path. Flags: O (Operational State): 0 – Down; 1 – Up; 2 – Active; 3 – Going-Down; 4 – Going-Up A (Administrative State): If set, the path is Administratively enabled R (Remove): If set, PCE should remove all path state S (Sync): Set during State Synchronization D (Delegate): If set in Report, PCC delegates path If set in Update, PCE accepts delegation TLVs: LSP Identifier TLV, Symbolic Path Name TLV, Binding-SID TLV LSP Identifier TLV The LSP Identifier TLV identifies the path using fields inherited from RSVP-TE. Its format is shown in Figure 18‑22. The fields in this TLV refer to various RSVP-TE tunnel and LSP identifiers. Some of these fields are re-used for the SR Policy paths.

Figure 18-22: LSP Identifier TLV format

The fields in the LSP Identifier TLV are: IPv4 Tunnel Sender Address – headend IP address LSP ID – used to differentiate LSPs with the same Tunnel ID Tunnel ID – path identifier that remains constant over the life time of a path Extended Tunnel ID – generally the headend IP address IPv4 Tunnel Endpoint Address – endpoint IP address Symbolic Path Name TLV A symbolic name for a path, unique per PCC. The symbolic name of a path stays constant during the path’s lifetime. The format of this TLV is shown in Figure 18‑23.

Figure 18-23: Symbolic Path Name TLV format

Binding-SID TLV The SR Policy candidate path’s Binding-SID. The format of this TLV is presented in Figure 18‑24.

Figure 18-24: Binding-SID TLV format

The Binding-SID TLV has the following fields: Binding Type (BT): BT = 0: The Binding Value field contains an MPLS label where only the label value is valid, other fields of the MPLS label (TC, S, and TTL) are ignored BT = 1: The Binding Value field contains an MPLS label, where all the fields of the MPLS label (TC, S, and TTL) are significant and filled in on transmission. Binding Value: as specified by Binding Type value ERO Object

See section 18.4.6. SRP Object

The SRP (Stateful PCE Request Parameters) Object is used to correlate between Update requests sent by the PCE and the error reports and state reports sent by the PCC. The format of the SRP Object is shown in Figure 18‑25.

Figure 18-25: SRP Object format

The SRP Object contains the following fields: Flags: R-flag (Remove) – if set, it indicates a request to remove a path; if unset, indicates a request to create a path – used in Initiate message SRP-ID-number: uniquely identifies (in the current PCEP session) the operation that the PCE has requested the PCC to perform. SRP-ID-number is incremented each time a new request is sent to the PCC. LSPA Object

See section 18.4.4. Metric Object

See section 18.4.4. Example

Example 18‑4 shows a packet capture of a Report message. This Report message follows the Reply message in Example 18‑3 of the previous section, after the PCC instantiated the SR Policy path. The SR Policy path with SID list is Active and Delegated to the PCE. The SR Policy has the name “BROWN” and has a Binding-SID label 15111. Example 18-4: Packet capture of a PCEP Path Computation Report message 14:54:42.769244000: 1.1.1.1:46605 --> 1.1.1.10:4189 Path Computation Element communication Protocol Path Computation LSP State Report (PCRpt) Header ..1. .... = PCEP Version: 0x1

...0 0000 = Flags: 0x00 ...0 0000 = Reserved Flags: Not set Message Type: Path Computation LSP State Report (PCRpt) (10) Message length: 156 SRP object Object Class: SRP OBJECT (33) 0001 .... = SRP Object-Type: SRP (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 20 Flags: 0x00000000 .... .... .... .... .... .... .... ...0 = Remove (R): Not set SRP-ID-number: 64 PATH-SETUP-TYPE Type: PATH-SETUP-TYPE (28) Length: 4 Reserved: 0x000000 Path Setup Type: Path is setup using Segment Routing (1) LSP object Object Class: LSP OBJECT (32) 0001 .... = LSP Object-Type: LSP (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 52 .... .... 1000 0000 0000 0010 1110 .... = PLSP-ID: 524334 Flags: 0x000029 .... .... .... ...1 = Delegate (D): Set .... .... .... ..0. = SYNC (S): Not set .... .... .... .0.. = Remove (R): Not set .... .... .... 1... = Administrative (A): Set .... .... .010 .... = Operational (O): ACTIVE (2) .... .... 0... .... = Create (C): Not Set .... 0000 .... .... = Reserved: Not set IPV4-LSP-IDENTIFIERS Type: IPV4-LSP-IDENTIFIERS (18) Length: 16 IPv4 Tunnel Sender Address: 1.1.1.1 LSP ID: 2 Tunnel ID: 46 Extended Tunnel ID: 1.1.1.1 (16843009) IPv4 Tunnel Endpoint Address: 1.1.1.4 SYMBOLIC-PATH-NAME Type: SYMBOLIC-PATH-NAME (17) Length: 5 SYMBOLIC-PATH-NAME: BROWN Padding: 000000 TE-PATH-BINDING TLV Type: TE-PATH-BINDING TLV (65505) Length: 6 Binding Type: 0 (MPLS) Binding Value: 0x03b07000 MPLS label: 15111 EXPLICIT ROUTE object (ERO) Object Class: EXPLICIT ROUTE OBJECT (ERO) (7) 0001 .... = ERO Object-Type: Explicit Route (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 32 SR

0... .... = L: Strict Hop (0) .010 0100 = Type: SUBOBJECT SR (36) Length: 12 0001 .... = SID Type: IPv4 Node ID (1) .... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M) SID: 65548288 (Label: 16003, TC: 0, S: 0, TTL: 0) NAI (IPv4 Node ID): 1.1.1.3 SR 0... .... = L: Strict Hop (0) .010 0100 = Type: SUBOBJECT SR (36) Length: 16 0011 .... = SID Type: IPv4 Adjacency (3) .... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M) SID: 98443264 (Label: 24034, TC: 0, S: 0, TTL: 0) Local IPv4 address: 99.3.4.3 Remote IPv4 address: 99.3.4.4 LSPA object Object Class: LSPA OBJECT (9) 0001 .... = LSPA Object-Type: LSP Attributes (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 20 Exclude-Any: 0x00000000 Include-Any: 0x00000000 Include-All: 0x00000000 Setup Priority: 7 Holding Priority: 7 Flags: 0x01 .... ...1 = Local Protection Desired (L): Set Reserved: 0x00 METRIC object Object Class: METRIC OBJECT (6) 0001 .... = METRIC Object-Type: Metric (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 12 Reserved: 0 Flags: 0x00 .... ..0. = (C) Cost: Not set .... ...0 = (B) Bound: Not set Type: TE Metric (2) Metric Value: 30

18.4.7 PCEP Update Message A PCEP Update message contains at least the following objects: Stateful PCE Request Parameters (SRP) Object: to correlate messages (e.g., PCUpd and PCRept) LSP Object: to identify the LSP Explicit Route Object (ERO): path of the LSP

A PCEP Update message may also contain (non-exhaustive list): Metric Object: Metric of calculated path SRP Object

See section 18.4.7. LSP Object

See section 18.4.7. ERO Object

See section 18.4.6. Metric Object

See section 18.4.4. Example

Example 18‑5 shows the packet capture of a Path Computation Update message. The PCE requests the PCC to update the SR Policy path’s SID list to , both SIDs are Prefix-SIDs. Example 18-5: Packet capture of a PCEP Path Computation Update message 14:54:47.769244000: 1.1.1.10:4189 --> 1.1.1.1:46605 Path Computation Element communication Protocol Path Computation LSP Update Request (PCUpd) Header ..1. .... = PCEP Version: 0x1 ...0 0000 = Flags: 0x00 ...0 0000 = Reserved Flags: Not set Message Type: Path Computation LSP Update Request (PCUpd) (11) Message length: 88 SRP object Object Class: SRP OBJECT (33) 0001 .... = SRP Object-Type: SRP (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 20 Flags: 0x00000000 .... .... .... .... .... .... .... ...0 = Remove (R): Not set SRP-ID-number: 47 PATH-SETUP-TYPE Type: PATH-SETUP-TYPE (28) Length: 4 Reserved: 0x000000 Path Setup Type: Path is setup using Segment Routing (1) LSP object Object Class: LSP OBJECT (32) 0001 .... = LSP Object-Type: LSP (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set

..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 24 .... .... 1000 0000 0000 0001 1101 .... = PLSP-ID: 524334 Flags: 0x000009 .... .... .... ...1 = Delegate (D): Set .... .... .... ..0. = SYNC (S): Not set .... .... .... .0.. = Remove (R): Not set .... .... .... 1... = Administrative (A): Set .... .... .000 .... = Operational (O): DOWN (0) .... .... 0... .... = Create (C): Not Set .... 0000 .... .... = Reserved: Not set VENDOR-INFORMATION-TLV Type: VENDOR-INFORMATION-TLV (7) Length: 12 Enterprise Number: ciscoSystems (9) Enterprise-Specific Information: 0003000400000001 EXPLICIT ROUTE object (ERO) Object Class: EXPLICIT ROUTE OBJECT (ERO) (7) 0001 .... = ERO Object-Type: Explicit Route (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 28 SR 0... .... = L: Strict Hop (0) .010 0100 = Type: SUBOBJECT SR (36) Length: 12 0001 .... = SID Type: IPv4 Node ID (1) .... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M) SID: 65548288 (Label: 16003, TC: 0, S: 0, TTL: 0) NAI (IPv4 Node ID): 1.1.1.3 SR 0... .... = L: Strict Hop (0) .010 0100 = Type: SUBOBJECT SR (36) Length: 12 0011 .... = SID Type: IPv4 Node ID (1) .... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M) SID: 65552384 (Label: 16004, TC: 0, S: 0, TTL: 0) NAI (IPv4 Node ID): 1.1.1.4 METRIC object Object Class: METRIC OBJECT (6) 0001 .... = METRIC Object-Type: Metric (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 12 Reserved: 0 Flags: 0x00 .... ..0. = (C) Cost: Not set .... ...0 = (B) Bound: Not set Type: TE Metric (2) Metric Value: 50

18.4.8 PCEP Initiate Message A PCEP Initiate message contains at least the following objects: Stateful PCE Request Parameters (SRP) Object: to correlate messages (e.g., PCUpd and PCRept)

LSP Object: to identify the LSP Explicit Route Object (ERO): path of the LSP A PCEP Update message may also contain (non-exhaustive list): End-points Object: source and destination of the SR Policy Metric Object: Metric of calculated path SRP Object

See section 18.4.7. LSP Object

See section 18.4.7. ERO Object

See section 18.4.6. Endpoints Object

See section 18.4.4. Metric Object

See section 18.4.4. Example

Example 18‑6 shows the packet capture of an Initiate message. The PCE initiates an SR Policy path on the PCC 1.1.1.1 to endpoint 1.1.1.4. The path is named “GREEN” and has a BSID 15222. The SID list is , with 16003 the Prefix-SID of Node3 and 24034 the Adj-SID of the link from Node3 to Node4. The SR Policy has a color 888 and the path’s preference is 200. At the time of writing, no PCEP objects were specified for these elements. Example 18-6: Packet capture of a PCEP Path Computation Initiate message 14:59:36.738294000: 1.1.1.10:4189 --> 1.1.1.1:46605 Path Computation Element communication Protocol Path Computation LSP Initiate (PCInitiate) Header ..1. .... = PCEP Version: 0x1 ...0 0000 = Flags: 0x00

...0 0000 = Reserved Flags: Not set Message Type: Path Computation LSP Initiate (PCInitiate) (12) Message length: 144 SRP object Object Class: SRP OBJECT (33) 0001 .... = SRP Object-Type: SRP (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 20 Flags: 0x00000000 .... .... .... .... .... .... .... ...0 = Remove (R): Not set SRP-ID-number: 63 PATH-SETUP-TYPE Type: PATH-SETUP-TYPE (28) Length: 4 Reserved: 0x000000 Path Setup Type: Path is setup using Segment Routing (1) LSP object Object Class: LSP OBJECT (32) 0001 .... = LSP Object-Type: LSP (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 48 .... .... 0000 0000 0000 0000 0000 .... = PLSP-ID: 0 Flags: 0x000089 .... .... .... ...1 = Delegate (D): Set .... .... .... ..0. = SYNC (S): Not set .... .... .... .0.. = Remove (R): Not set .... .... .... 1... = Administrative (A): Set .... .... .000 .... = Operational (O): DOWN (0) .... .... 1... .... = Create (C): Set .... 0000 .... .... = Reserved: Not set SYMBOLIC-PATH-NAME Type: SYMBOLIC-PATH-NAME (17) Length: 5 SYMBOLIC-PATH-NAME: GREEN Padding: 000000 TE-PATH-BINDING TLV Type: TE-PATH-BINDING TLV (65505) Length: 6 Binding Type: 0 (MPLS) Binding Value: 0x03b76000 MPLS label: 15222 VENDOR-INFORMATION-TLV Type: VENDOR-INFORMATION-TLV (7) Length: 12 Enterprise Number: ciscoSystems (9) Enterprise-Specific Information: 0003000400000001 END-POINT object Object Class: END-POINT OBJECT (4) 0001 .... = END-POINT Object-Type: IPv4 addresses (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 12 Source IPv4 Address: 1.1.1.1 Destination IPv4 Address: 1.1.1.4 COLOR object Object Class: COLOR OBJECT (36) 0001 .... = COLOR Object Type: color

.... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 8 Color: 888 PREFERENCE object Object Class: PREFERENCE OBJECT (37) 0001 .... = PREFERENCE Object Type: preference .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 8 Preference: 200 EXPLICIT ROUTE object (ERO) Object Class: EXPLICIT ROUTE OBJECT (ERO) (7) 0001 .... = ERO Object-Type: Explicit Route (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 32 SR 0... .... = L: Strict Hop (0) .010 0100 = Type: SUBOBJECT SR (36) Length: 12 0001 .... = SID Type: IPv4 Node ID (1) .... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M) SID: 65548288 (Label: 16003, TC: 0, S: 0, TTL: 0) NAI (IPv4 Node ID): 1.1.1.3 SR 0... .... = L: Strict Hop (0) .010 0100 = Type: SUBOBJECT SR (36) Length: 16 0011 .... = SID Type: IPv4 Adjacency (3) .... 0000 0000 0001 = Flags: 0x001, SID value represents an MPLS label w/o TC, S, and TTL (M) SID: 98443264 (Label: 24034, TC: 0, S: 0, TTL: 0) Local IPv4 address: 99.3.4.3 Remote IPv4 address: 99.3.4.4 METRIC object Object Class: METRIC OBJECT (6) 0001 .... = METRIC Object-Type: Metric (1) .... 0000 = Object Header Flags: 0x0 ...0 = Ignore (I): Not set ..0. = Processing-Rule (P): Not set 00.. = Reserved Flags: Not set Object Length: 12 Reserved: 0 Flags: 0x00 .... ..0. = (C) Cost: Not set .... ...0 = (B) Bound: Not set Type: TE Metric (2) Metric Value: 30

18.4.9 Disjointness Association Object Identifying paths that must be disjoint from each other is done by grouping those paths in an Association group. To indicate the membership of a path in an association group, an Association

Object is added to the PCEP messages. This Association Object is specified in IETF draft-ietf-pceassociation-group. The format of the Association Object is displayed in Figure 18‑26. Two types of Association Objects exist, one with IPv4 Association Source and one with IPv6 Association Source.

Figure 18-26: Association Object format

The different fields in the Association Object are: Flags: R (Removal): if set, the path is removed from the association group; if unset, the path is added or kept as part of the association group – only considered in Report and Update messages Association type: identifies the type of association, such as “disjointness” Association ID: this identifier in combination with Association type and Association Source uniquely identifies an association group Association Source: this IPv4 or IPv6 address in combination with Association type and Association ID uniquely identifies an association group Optional TLVs To associate paths that must be disjoint from each other, an Association Object with Disjointness Association Type is used. This Disjointness Association Type is defined in draft-ietf-pce-association-

diversity. The disjointness Association Object contains a Disjointness Configuration TLV that specifies the disjointness configuration parameters. The format of the Disjointness Configuration TLV is shown in Figure 18‑27.

Figure 18-27: Disjointness Configuration TLV format

Where the fields are: Flags: L (Link diverse) – if set, the paths within the disjoint group must be link-diverse N (Node diverse) – if set, the paths within the disjoint group must be node-diverse S (SRLG diverse) – if set, the paths within the disjoint group must be SRLG-diverse P (Shortest path) – if set, the path should be computed without considering the disjointness constraint T (strict disjointness) – if set, the path must not fall back to a lower disjoint type if no disjoint path can be found; if unset, the path can fall back to a lower disjoint path or to a non-disjoint path if no disjoint path can be found.

18.5 References [RFC4655] "A Path Computation Element (PCE)-Based Architecture", JP Vasseur, Adrian Farrel, Gerald Ash, RFC4655, August 2006 [RFC5440] "Path Computation Element (PCE) Communication Protocol (PCEP)", JP Vasseur, Jean-Louis Le Roux, RFC5440, March 2009 [RFC8231] "Path Computation Element Communication Protocol (PCEP) Extensions for Stateful PCE", Edward Crabbe, Ina Minei, Jan Medved, Robert Varga, RFC8231, September 2017 [RFC8281] "Path Computation Element Communication Protocol (PCEP) Extensions for PCEInitiated LSP Setup in a Stateful PCE Model", Edward Crabbe, Ina Minei, Siva Sivabalan, Robert Varga, RFC8281, December 2017 [RFC8408] "Conveying Path Setup Type in PCE Communication Protocol (PCEP) Messages", Siva Sivabalan, Jeff Tantsura, Ina Minei, Robert Varga, Jonathan Hardwick, RFC8408, July 2018 [draft-ietf-pce-segment-routing] "PCEP Extensions for Segment Routing", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Wim Henderickx, Jonathan Hardwick, draft-ietf-pce-segmentrouting-16 (Work in Progress), March 2019 [draft-sivabalan-pce-binding-label-sid] "Carrying Binding Label/Segment-ID in PCE-based Networks.", Siva Sivabalan, Clarence Filsfils, Jeff Tantsura, Jonathan Hardwick, Stefano Previdi, Cheng Li, draft-sivabalan-pce-binding-label-sid-06 (Work in Progress), February 2019 [draft-ietf-pce-association-group] "PCEP Extensions for Establishing Relationships Between Sets of LSPs", Ina Minei, Edward Crabbe, Siva Sivabalan, Hariharan Ananthakrishnan, Dhruv Dhody, Yosuke Tanaka, draft-ietf-pce-association-group-09 (Work in Progress), April 2019 [draft-ietf-pce-association-diversity] "Path Computation Element communication Protocol (PCEP) extension for signaling LSP diversity constraint", Stephane Litkowski, Siva Sivabalan, Colby Barth, Mahendra Singh Negi, draft-ietf-pce-association-diversity-06 (Work in Progress), February 2019

[draft-litkowski-pce-state-sync] "Inter Stateful Path Computation Element (PCE) Communication Procedures.", Stephane Litkowski, Siva Sivabalan, Cheng Li, Haomian Zheng, draft-litkowski-pcestate-sync-05 (Work in Progress), March 2019

19 BGP SR-TE Different mechanisms are available to instantiate SR Policy candidate paths on a head-end node, such as CLI, NETCONF, PCEP, and BGP. This chapter explains the BGP mechanism to advertise a candidate path of an SR Policy, also known as “BGP SR-TE” or “BGP SR Policy”. IETF draft-ietf-idr-segment-routing-te-policy specifies the BGP extensions that are needed for this. This draft specifies a new Multi-Protocol BGP (MP-BGP) address-family to convey the SR Policy candidate paths using a new NLRI format. It is important to note that BGP advertises the SR Policy candidate path on itself; it is a self-contained BGP advertisement. The SR Policy candidate path advertisement is not a prefix advertisement and it is not related to any prefix nor does it express any prefix reachability. It is not a tunnel advertisement and it is not related to any tunnel. It is also not an attribute of a prefix. If a given SR Policy candidate path must be updated, only that SR Policy candidate path needs to be re-advertised. If a new SR Policy candidate path is defined, only that new SR Policy candidate path needs to be advertised. In summary, the BGP protocol is used as a conveyor of SR Policy candidate paths, taking up the same task as e.g., the PCEP SR Policy initiation mechanism. No prefix reachability information is sent with the SR Policy. The relation between SR Policy and a prefix is established by matching the SR Policy’s end-point and color to the prefix’s BGP next-hop and color community; this is Automated Steering (AS) described in chapter 5, "Automated Steering". AS is not specific to BGP-initiated SR Policies, but it applies to all SR Policies, regardless of their initiation mechanism.

19.1 SR Policy Address-Family The Multi-protocol extensions of BGP (MP-BGP RFC4760) provides the ability to pass arbitrary data in BGP protocol messages. The MP_REACH_NLRI and MP_UNREACH_NLRI attributes introduced with MP-BGP are BGP's containers for carrying opaque information. This capability has been leveraged by for example BGP-LS (RFC7752) and BGP flow-spec (RFC5575). MP-BGP can also be used to carry SR Policy information in BGP. Therefore, IETF draft-ietf-idrsegment-routing-te-policy introduces new BGP address-families with new AFI/SAFI combinations, where the AFI is IPv4 or IPv6, combined with a new SR Policy SAFI. Figure 19‑1 is a high-level illustration of a BGP SR Policy Update message showing the different attributes that are present in such a BGP Update message.

Figure 19-1: BGP SR Policy Update message showing the different attributes

A BGP SR Policy Update message contains the mandatory attributes ORIGIN, AS_PATH, and for iBGP exchanges also LOCAL_PREF. The BGP SR Policy NLRIs are included in the MP_REACH_NLRI attribute, as specified in RFC4760. This attribute also contains a Next-hop, which is the BGP nexthop as we know it. It is the IPv4 or IPv6 BGP session address if the advertising node applies next-hop-self. Note that, although a BGP next-hop is advertised with the SR Policy NLRI, the SR Policy is not bound to this BGP next-hop. The BGP next-hop attribute is a mandatory BGP attribute, mandated by the BGP protocol specification, so it must be included. But it is not related to any aspect of the advertised SR Policy path. However, as for any BGP advertisement, BGP requires the nexthop of a route to be reachable (read “present in RIB”) before it considers installing the route (RFC4271, Section 9.1.2). The Tunnel-encapsulation Attribute contains the properties of the candidate path associated with the SR Policy that is identified in the NLRI included in the Update message. One or more (extended) community attributes is present. These community attributes are Route-Target (RT) extended communities and/or the NO_ADVERTISE community. The purpose of which is described in section 19.2.3 of this chapter.

19.1.1 SR Policy NLRI The new NLRI for the SR Policy AFI/SAFI identifies the SR Policy of the advertised candidate path. This NLRI has the format as shown in Figure 19‑2.

Figure 19-2: SR Policy SAFI NLRI format

The fields in this NLRI are: Distinguisher: a 32-bit numerical value used to distinguish multiple advertisements of the same SR Policy. The distinguisher has no semantic and it is only used to make multiple occurrences of the same SR Policy unique (from a NLRI perspective). It is comparable to the L3VPN Route Distinguisher (RD). Policy Color: a 32-bit color value that to distinguish multiple SR Policies to the same end-point. The tuple (color, end-point) identifies an SR Policy on a given head-end. The color typically identifies the policy or SLA or intent of the SR Policy, such as “low-delay”. Endpoint: an IPv4 (32-bit) or IPv6 (128-bit) address, according to the AFI of the NLRI, that identifies the endpoint of the SR Policy. The endpoint can represent a single node or a set of nodes (e.g., using an anycast address).

19.1.2 Tunnel Encapsulation Attribute While the NLRI identifies the SR Policy of the candidate path, the properties of the candidate path itself are provided in a Tunnel Encapsulation Attribute that is advertised with the NLRI. The Tunnel Encapsulation Attribute is defined in IETF draft-ietf-idr-tunnel-encaps and consists of a set of TLVs, where each TLV contains information corresponding to a particular Tunnel-type. A new Tunnel-type (15) is defined in draft-ietf-idr-segment-routing-te-policy to encode an SR Policy candidate path in the attribute. The SR Policy TLV in the attribute consists of a set of sub-TLVs and

sub-sub-TLVs. The encoding of an SR Policy candidate path in the Tunnel Encapsulation Attribute is as illustrated in Figure 19‑3. The structure of the TLVs in the attribute reflects the structure of the SR Policy, as discussed in chapter 2, "SR Policy".

Figure 19-3: Tunnel Encapsulation Attribute structure for SR Policy candidate path

It is important to highlight that this Tunnel Encapsulation Attribute is opaque to BGP; it has no influence on BGP best-path selection or the path propagation procedure. A Tunnel Encapsulation Attribute contains a single SR Policy TLV, hence it describes a single candidate path. The format of the SR Policy TLV is shown in Figure 19‑4.

Figure 19-4: SR Policy TLV format

The SR Policy TLV may contain multiple Segment-List sub-TLVs, and each Segment-List sub-TLV may contain multiple Segment sub-sub-TLVs, as illustrated in Figure 19‑3. The different sub- and subsub-TLVs are discussed in the following sections. Preference Sub-TLV

The Preference sub-TLV is an optional sub-TLV that encodes the preference of this candidate path. The preference value of a candidate path is used by SR-TE to select the preferred path among multiple (valid) candidates, see chapter 2, "SR Policy". Note that the preferred path selection is done by SR-TE, not by BGP. The format of this sub-TLV is shown in Figure 19‑5. No Flags have been defined yet.

Figure 19-5: Preference sub-TLV format

Binding-SID Sub-TLV

A controller can specify an explicit Binding-SID value for the SR Policy candidate path, see chapter 9, "Binding-SID and SRLB". In that case, the controller includes this optional Binding-SID sub-TLV in the SR Policy TLV. The format of this sub-TLV is shown in Figure 19‑6.

Figure 19-6: Binding-SID sub-TLV format

The Binding-SID can be empty, a 4-octet MPLS label (SR MPLS), or a 16-octet IPv6 SID (SRv6). The flags are: S-flag: if set then the “Specified-BSID-only” behavior is enabled; the candidate path is then only considered as active path if a BSID is specified and available. I-flag: if set then the “Drop Upon Invalid” behavior is enabled; the invalid SR Policy and its BSID is kept in the forwarding table with an action to drop packets steered into it. Priority Sub-TLV

The priority sub-TLV is an optional sub-TLV that indicates the order in which SR Policies are recomputed following a topology change. The Priority sub-TLV has the format illustrated in Figure 19‑7. No flags have been defined.

Figure 19-7: Priority sub-TLV format

Policy Name Sub-TLV

The Policy Name sub-TLV is an optional sub-TLV that specifies a symbolic name of the SR Policy candidate path. The format of this sub-TLV is illustrated in Figure 19‑8.

Figure 19-8: Policy Name sub-TLV format

Explicit-Null Label Policy (ENLP) Sub-TLV

The Explicit-null Label Policy (ENLP) sub-TLV is an optional sub-TLV that indicates whether an explicit-null label must be imposed prior to any other labels on an unlabeled packet that is steered into the SR Policy. Imposing an explicit-null label at the bottom of the label stack on an unlabeled packet enables carrying packets of one address-family in an SR Policy of another address-family, such as carrying an IPv6 packet in an IPv4 SR Policy. First imposing the explicit-null label ensures that a packet is carried with MPLS label all the way to the endpoint of the SR Policy. No PHP will be done on this packet. The intermediate nodes on the path can then label-switch the packet and do not have to support forwarding of the type of packet carried underneath the MPLS label. Only the endpoint must be able to process that packet that arrives with only the explicit-null label. The format of this sub-TLV is illustrated in Figure 19‑9.

Figure 19-9: Explicit-null Label Policy (ENLP) Sub-TLV format

No flags have been defined. The ENLP field has one of the following values: 1. Impose an IPv4 explicit-null label (0) on an unlabeled IPv4 packet and do not impose an IPv6 explicit-null label (2) on an unlabeled IPv6 packet 2. Do not impose an IPv4 explicit-null label (0) on an unlabeled IPv4 packet and impose an IPv6 explicit-null label (2) on an unlabeled IPv6 packet 3. Impose an IPv4 explicit-null label (0) on an unlabeled IPv4 packet and impose an IPv6 explicitnull label (2) on an unlabeled IPv6 packet 4. Do not impose an explicit-null label Segment-List Sub-TLV

The Segment List, or SID list, of an SR Policy path encodes a single explicit path to the end-point, as a list of segments. The format of the Segment-List sub-TLV is shown in Figure 19‑10.

Figure 19-10: Segment-List sub-TLV format

This Segment-List sub-TLV is optional. As discussed in chapter 2, "SR Policy", an SR Policy candidate path can contain multiple segment lists, and each segment list in the set can have a weight value for Weighted ECMP load-balancing. Therefore, the SR Policy TLV can contain multiple Segment-List sub-TLVs. Each Segment-List sub-TLV can contain an optional Weight sub-sub-TLV and zero, one, or multiple Segment sub-sub-TLVs. Weight Sub-Sub-TLV

The Weight sub-sub-TLV is an optional TLV in the Segment-List sub-TLV. The format of this sub-subTLV is shown in Figure 19‑11. No Flags have been defined yet.

Figure 19-11: Weight sub-TLV format

Segment Sub-Sub-TLV

A Segment-List sub-TLV contains zero, one, or multiple Segment sub-sub-TLVs. As specified in IETF draft-ietf-spring-segment-routing-policy, segments in the segment list can be specified using different segment-descriptor types. These types are also repeated in IETF draft-ietf-idr-segment-routing-tepolicy. As an example, Figure 19‑12 shows the format of a Type 1 Segment TLV. This type specifies the SID in the form of an MPLS label value.

Figure 19-12: Segment sub-sub-TLV Type 1: SID only, in the form of MPLS Label

Where the flags are defined as: V-flag: if set then the Segment Verification behavior is enabled, where SR-TE verifies validity of this segment. A-flag: if set then the SR Algorithm id is present in the SR Algorithm field; this flag does not apply to all segment-descriptor types, e.g., it does not apply to type-1. Please refer to IETF draft-ietf-idr-segment-routing-te-policy for the formats of the other Segmentdescriptor types.

19.2 SR Policy BGP Operations Since the SR Policy candidate path is conveyed by BGP, the existing BGP operations are applicable. This section discusses these operational aspects. As was indicated before, BGP is merely used as the transport mechanism to convey the SR Policy candidate path information from the controller to the head-end. The SR Policy path information is opaque to BGP. BGP is not involved in the processing of the information, nor in the installation of the forwarding instructions. These aspects are handled by the SR-TE functionality.

19.2.1 BGP Best-Path Selection When BGP receives multiple paths of a given NLRI, it uses the best-path selection rules to select the single best path for that NLRI. This procedure equally applies for BGP SR Policy advertisements. BGP selects the best-path among all received BGP SR Policy advertisements with the same NLRI, using the regular selection rules. BGP sends only the best-path to SR-TE. While the existing BGP procedures apply to the BGP SR Policy advertisements, the network operator must ensure that the role of BGP is limited to be the conveyer of the information, not interfering with SR-TE operations. Specifically, BGP must not be the one that makes a selection between different candidate paths of the same SR Policy. Therefore, the operator must use the Distinguisher field in the NLRI, which is described in the next section.

19.2.2 Use of Distinguisher NLRI Field The Distinguisher field in the SR Policy NLRI is used to make multiple candidate path advertisements of the same SR Policy (i.e., having the same color and endpoint) unique from a NLRI perspective.

Ne twork Laye r Re achability Information (NLRI) When BGP receives a MP-BGP update message, it receives the Network Layer Reachability Information (NLRI) and the path attributes that are the properties of the NLRI. BGP treats the NLRI as an opaque key to an entry in its database and the path attributes are associated to this BGP database entry. By default, BGP only installs a single path for a given NLRI, the best-path, and when BGP propagates an NLRI to its neighbors (e.g., a RR reflecting routes to its RR clients), BGP only propagates this best-path. When BGP receives multiple paths with the same NLRI (the key of the database entry), it selects the best path for that NLRI using the BGP best-path selection procedure and only that best path is propagated. The best-path selection is influenced by the path attributes and other elements. When BGP receives multiple paths, each with a different NLRI, then BGP propagates all of these paths.

For those familiar with L3VPN, the SR Policy Distinguisher has a similar function as the L3VPN Route Distinguisher (RD) value that is added to an L3VPN prefix. One or more controllers can advertise multiple candidate paths for a given SR Policy to a headend node. Or these controllers can advertise candidate paths for SR Policies with the same color and endpoint to multiple headend nodes. In all these cases, the controller needs to ensure that an SR Policy path advertisement is not inadvertently suppressed due to best-path selection. Therefore, each SR Policy path should be advertised with a unique NLRI. This explains the purpose of the Distinguisher field in the SR Policy NLRI: make the NLRI unique to ensure that: RRs propagate each of the SR Policy candidate paths BGP at the headend hands over each SR Policy candidate path to SR-TE It is important that BGP does not perform any selection among the different candidate paths. The candidate path selection must only be done by SR-TE, based on path validity and preference, as described in chapter 2, "SR Policy". Let us illustrate this with Figure 19‑13. In this network a BGP Route-Reflector (RR) is used to scale BGP route distribution. Two controllers advertise a candidate path for SR Policy (4.4.4.4, green) intended for headend Node1. Both controllers send the advertisement to the RR. Each controller sends its SR Policy path

with a different distinguisher field, such that the RR treats them as separate routes (NLRIs). Therefore, the RR propagates both advertisements to headend Node1. Also BGP on Node1 sees both paths as independent paths, thanks to their unique NLRI, and does not apply any BGP path selection. Hence BGP hands over both paths to the SR-TE process. SR-TE receives both paths and performs the path selection mechanism based on path validity and preference, to select the active path among all candidate paths of the SR Policy (4.4.4.4, green).

Figure 19-13: Use of the Distinguisher NLRI field

If the controllers would have used the same RD for their advertisements, then the BGP RR would have applied best-path selection and would have propagated only the best-path to Node1. SR-TE on Node1 would have only received one candidate path for the SR Policy to Node4 instead of the expected two paths.

19.2.3 Target Headend Node A controller must specify the intended target headend, this is the headend node that should instantiate the SR Policy candidate path specified in the BGP SR Policy advertisement. There are two possible mechanisms to do this, as illustrated in Figure 19‑14. Which mechanism can be used depends on whether the controller has a direct BGP session to the target headend node or not.

Figure 19-14: Direct session or via RR

For the first mechanism as shown in Figure 19‑14 (a), the controller may or may not have a direct BGP session to the target headend. To identify the target headend, the controller attaches a RouteTarget (RT) Extended Community attribute to the SR Policy advertisement. This is the well-known RT, as defined in RFC4360 and also used for BGP L3VPN. The RT is in IPv4-address format, where the IPv4 address matches the BGP router-id of the intended headend node.

If the SR Policy path must be instantiated on multiple headend nodes, multiple RTs can be attached, one for each targeted headend node. Only the nodes identified by one of the RTs will hand the SR Policy candidate path information to the SR-TE process. In the case that the controller has a direct BGP session to the intended headend node, as shown in Figure 19‑14 (b), two methods are possible: attach an RT matching the headend node to the BGP SR Policy advertisement as described before or attach a NO_ADVERTISE community. The presence of the well-known NO_ADVERTISE community (value 0xFFFFFF02, RFC1997) prevents the receiving node to propagate this advertisement to any of its BGP neighbors. If the headend node receives the SR Policy advertisement with a NO_ADVERTISE community attached, then it knows it is the intended target node. If a node receives a BGP SR Policy advertisement that has no RT nor a NO_ADVERTISE community attached, the advertisement is considered invalid. As a summary, a controller can attach an RT that identifies the intended headend node to an SR Policy advertisement. If the controller has a direct BGP session to the intended headend node, then it can also attach the NO_ADVERTISE community to the SR Policy advertisement.

19.3 Illustrations The BGP configuration of headend Node1 in Figure 19‑15 is shown in Example 19‑1. Node1 has a BGP session to the controller (1.1.1.10) with address-family IPv4 SR Policy enabled. Node1 also has SR-TE enabled.

Figure 19-15: BGP SR Policy illustration Example 19-1: BGP SR-TE configuration router bgp 1 bgp router-id 1.1.1.1 address-family ipv4 sr-policy ! neighbor 1.1.1.10 remote-as 1 update-source Loopback0 address-family ipv4 sr-policy ! segment-routing traffic-eng

The controller sends a BGP update of an SR Policy to headend Node1. The SR Policy has a color 10 and an IPv4 endpoint 1.1.1.4. The advertised SR Policy NLRI consists of: Distinguisher: 12345 Color: 10 Endpoint: 1.1.1.4 The preference of the advertised candidate path is 100 and its Binding-SID is 15001. The candidate path consists of two SID lists: with weight 1 and with weight 2. Example 19‑2 shows the BGP Update packet capture. The update message contains the MP_REACH_NLRI attribute; the mandatory attributes ORIGIN, AS_PATH, and LOCAL_PREF; EXTENDED_COMMUNITIES attribute; and TUNNEL_ENCAPSULATION attribute. The MP_REACH_NLRI attribute contains the nexthop address 1.1.1.10, which is the BGP router-id of the controller. The nexthop is not used but BGP requires that it is a reachable address. The NLRI consists of the three fields described above (Distinguisher, Color, Endpoint). The controller added the RT extended community 1.1.1.1:0 to indicate that Node1 (with router-id 1.1.1.1) is the intended headend node. The TUNNEL_ENCAPSULATION attribute contains all the SR Policy’s candidate path information, structured with TLVs and sub-TLVs.

Example 19-2: BGP SR-TE Update message packet capture Border Gateway Protocol - UPDATE Message Marker: ffffffffffffffffffffffffffffffff Length: 153 Type: UPDATE Message (2) Withdrawn Routes Length: 0 Total Path Attribute Length: 130 Path attributes Path Attribute - MP_REACH_NLRI Address family identifier (AFI): IPv4 (1) Subsequent address family identifier (SAFI): SR TE Policy (73) Next hop network address (4 bytes) IPv4=1.1.1.10 Number of Subnetwork points of attachment (SNPA): 0 Network layer reachability information (13 bytes) SR TE Policy Distinguisher: 12345 Color: 10 Endpoint: IPv4=1.1.1.4 Path Attribute - ORIGIN: IGP Path Attribute - AS_PATH: empty Path Attribute - LOCAL_PREF: 100 Path Attribute - EXTENDED_COMMUNITIES Carried extended communities: (1 community) Community Transitive IPv4-Address Route Target: 1.1.1.1:0 Path Attribute - TUNNEL_ENCAPSULATION TLV Encodings SR TE Policy Sub-TLV Encodings Preference: 100 SR TE Binding SID: 15001 Segment List Sub-sub-TLV Encodings weight: 1 Segment: (Type 1) label: 16003, TC: 0, S: 0, TTL: 0 Segment: (Type 1) label: 24034, TC: 0, S: 0, TTL: 0 Segment List Sub-sub-TLV Encodings weight: 2 Segment: (Type 1) label: 16003, TC: 0, S: 0, TTL: 0 Segment: (Type 1) label: 16004, TC: 0, S: 0, TTL: 0

Example 19‑3 shows the summary of the BGP SR Policy paths on the headend Node1. In this case, Node1 received a single SR Policy path with NLRI [12345][10][1.1.1.4]/96, as shown in the output. The format of this string is indicated in the legend of the output: [distinguisher][color] [endpoint]/mask.

Example 19-3: BGP IPv4 SR Policy update on Node1 RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy BGP router identifier 1.1.1.1, local AS number 1 BGP generic scan interval 60 secs Non-stop routing is enabled BGP table state: Active Table ID: 0x0 RD version: 16 BGP main routing table version 16 BGP NSR Initial initsync version 2 (Reached) BGP NSR/ISSU Sync-Group versions 0/0 BGP scan interval 60 secs Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network codes: [distinguisher][color][endpoint]/mask Network Next Hop Metric LocPrf Weight Path *>i[12345][10][1.1.1.4]/96 1.1.1.10 100 0 i Processed 1 prefixes, 1 paths

To show more details, add the NLRI string to the show command as well as the detail keyword, as shown in Example 19‑4. Example 19-4: BGP IPv4 SR Policy update on Node1 – detail RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [12345][10][1.1.1.4]/96 detail BGP routing table entry for [12345][10][1.1.1.7]/96 Versions: Process bRIB/RIB SendTblVer Speaker 16 16 Flags: 0x00001001+0x00000200; Last Modified: Mar 27 10:06:04.986 for 00:05:50 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Flags: 0x4000000001060005, import: 0x20 Not advertised to any peer Local 1.1.1.10 (metric 50) from 1.1.1.10 (1.1.1.10) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 16 Extended community: RT:1.1.1.1:0 Tunnel encap attribute type: 15 (SR policy) bsid 15001, preference 100, num of segment-lists 2 segment-list 1, weight 1 segments: {16003} {24034} segment-list 2, weight 2 segments: {16003} {16004} SR policy state is UP, Allocated bsid 15001

Headend Node1 instantiates the SR Policy, as shown in Example 19‑5. The SR Policy has the automatically assigned name srte_c_10_ep_1.1.1.4, where “c_10” indicates color 10 and “ep_1.1.1.4” indicates endpoint 1.1.1.4. The candidate path has the automatically assigned name

bgp_c_10_ep_1.1.1.4_discr_12345, where “bgp” indicates it is a BGP-signaled path and “discr_12345” indicates Discriminator 12345. Example 19-5: BGP SR-TE configuration

RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:00:07 (since Mar 27 10:06:05.106) Candidate-paths: Preference: 100 (BGP, RD: 12345) (active) Requested BSID: 15001 PCC info: Symbolic name: bgp_c_10_ep_1.1.1.4_discr_12345 PLSP-ID: 18 Explicit: segment-list (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 24034 Explicit: segment-list (valid) Weight: 2 16003 [Prefix-SID, 1.1.1.3] 16004 Attributes: Binding SID: 15001 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

19.3.1 Illustration NLRI Distinguisher 19.3.1.1 Same Distinguisher, Same NLRI Consider headend Node1 where two candidate paths of the same SR Policy (10, 1.1.1.4) are signaled via BGP and whose respective NLRIs have the same route distinguishers: NLRI 1 with distinguisher = 12345, color = 10, endpoint = 1.1.1.4 preference 100, segment list NLRI 2 with distinguisher = 12345, color = 10, endpoint = 1.1.1.4 preference 200, segment list

Because the NLRIs are identical (same endpoint, same color, same distinguisher) BGP performs bestpath selection as usual and passes only the best-path to the SR-TE process. Note that there are no changes to BGP best path selection algorithm; the candidate paths’ preference values do not have any effect on the BGP best-path selection process. The two advertisements as received by Node1 are shown in Example 19‑6. Both advertisements, one from 1.1.1.10 and another from 1.1.1.9, are shown as paths of the same NLRI. The detailed BGP output, shown in Example 19‑7, indicates that the advertisement from 1.1.1.10 is the best and is selected as best-path. As SR-TE receives only one path, shown in Example 19‑8, it does not perform any path selection. While the controllers intended to instantiate two candidate paths for SR Policy (1.1.1.4, 10), a preferred path with preference 200 and a less preferred path with preference 100, BGP only passed one path to SR-TE. Since both paths where advertised with the same NLRI, the intended outcome was not achieved. Example 19-6: SR Policy BGP advertisements with the same NLRI RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy BGP router identifier 1.1.1.1, local AS number 1 BGP generic scan interval 60 secs Non-stop routing is enabled BGP table state: Active Table ID: 0x0 RD version: 24 BGP main routing table version 24 BGP NSR Initial initsync version 2 (Reached) BGP NSR/ISSU Sync-Group versions 0/0 BGP scan interval 60 secs Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network codes: [distinguisher][color][endpoint]/mask Network Next Hop Metric LocPrf Weight Path *>i[12345][10][1.1.1.4]/96 1.1.1.10 100 0 i * i 1.1.1.9 100 0 i Processed 1 prefixes, 2 paths

Example 19-7: SR Policy BGP advertisements with the same NLRI – detail RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [12345][10][1.1.1.4]/96 BGP routing table entry for [12345][10][1.1.1.4]/96 Versions: Process bRIB/RIB SendTblVer Speaker 24 24 Last Modified: Mar 27 10:52:10.986 for 02:24:22 Paths: (2 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 1.1.1.10 (metric 50) from 1.1.1.10 (1.1.1.10) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 24 Extended community: RT:1.1.1.1:0 Tunnel encap attribute type: 15 (SR policy) bsid 15001, preference 100, num of segment-lists 1 segment-list 1, weight 1 segments: {16003} {24034} SR policy state is UP, Allocated bsid 15001 Path #2: Received by speaker 0 Not advertised to any peer Local 1.1.1.9 (metric 50) from 1.1.1.9 (1.1.1.9) Origin IGP, localpref 100, valid, internal Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:1.1.1.1:0 Tunnel encap attribute type: 15 (SR policy) bsid 15001, preference 200, num of segment-lists 1 segment-list 1, weight 1 segments: {16003} {16004} SR policy state is Down (Request pending), Allocated bsid none

Example 19-8: SR-TE only receives one candidate path RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 02:25:03 (since Mar 27 10:52:11.187) Candidate-paths: Preference 100 (BGP, RD: 12345) (active) Requested BSID: 15000 PCC info: Symbolic name: bgp_c_10_ep_1.1.1.4_discr_12345 PLSP-ID: 16 Explicit: segment-list (valid) Weight: 1 16003 24034 Attributes: Binding SID: 15001 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

19.3.1.2 Different Distinguishers, Different NLRIs Consider headend Node1 where two candidate paths of the same SR Policy (1.1.1.4, 10) are signaled via BGP and whose respective NLRIs have different route distinguishers: NLRI 1 with distinguisher = 12345, color = 10, endpoint = 1.1.1.4 preference 100, segment list NLRI 2 with distinguisher = 54321, color = 10, endpoint = 1.1.1.4 preference 200, segment list The two advertisements as received by Node1 are shown in Example 19‑9. Both advertisements, one from 1.1.1.10 and another from 1.1.1.9, have a different NLRI. The detailed BGP output of these paths are shown in Example 19‑10 and Example 19‑11. Because the NLRIs are different (different distinguisher), they each have a BGP best-path. Therefore, BGP passes both paths to the SR-TE process. From these two candidate paths, SR-TE selects the path with the highest preference as active path, since both paths are valid. Thus, the path advertised by 1.1.1.9 becomes the active path for SR Policy (1.1.1.4, 10). This is presented in Example 19‑12. Since each path was advertised with a different NLRI, the resulting active path is the intended active path for the SR Policy. The recommended approach is to use NLRIs with different distinguishers when several candidate paths for the same SR Policy (color, endpoint) are signaled via BGP to a headend.

Example 19-9: SR Policy BGP advertisements with different NLRI RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy BGP router identifier 1.1.1.1, local AS number 1 BGP generic scan interval 60 secs Non-stop routing is enabled BGP table state: Active Table ID: 0x0 RD version: 25 BGP main routing table version 25 BGP NSR Initial initsync version 2 (Reached) BGP NSR/ISSU Sync-Group versions 0/0 BGP scan interval 60 secs Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network codes: [distinguisher][color][endpoint]/mask Network Next Hop Metric LocPrf Weight Path *>i[12345][10][1.1.1.4]/96 1.1.1.10 100 0 i *>i[54321][10][1.1.1.4]/96 1.1.1.9 100 0 i Processed 2 prefixes, 2 paths

Example 19-10: SR Policy BGP advertisements with different NLRI – path 1 RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [12345][10][1.1.1.4]/96 BGP routing table entry for [12345][10][1.1.1.4]/96 Versions: Process bRIB/RIB SendTblVer Speaker 24 24 Last Modified: Mar 27 10:52:10.986 for 02:26:39 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 1.1.1.10 (metric 50) from 1.1.1.10 (1.1.1.10) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 24 Extended community: RT:1.1.1.1:0 Tunnel encap attribute type: 15 (SR policy) bsid 15001, preference 100, num of segment-lists 1 segment-list 1, weight 1 segments: {16003} {24034} SR policy state is UP, Allocated bsid 15001

Example 19-11: SR Policy BGP advertisements with different NLRI – path 2 RP/0/0/CPU0:xrvr-1#show bgp ipv4 sr-policy [54321][10][1.1.1.4]/96 BGP routing table entry for [54321][10][1.1.1.4]/96 Versions: Process bRIB/RIB SendTblVer Speaker 25 25 Last Modified: Mar 27 13:18:27.986 for 00:00:31 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 1.1.1.9 (metric 50) from 1.1.1.9 (1.1.1.9) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 0, version 25 Extended community: RT:1.1.1.1:0 Tunnel encap attribute type: 15 (SR policy) bsid 15001, preference 200, num of segment-lists 1 segment-list 1, weight 1 segments: {16003} {16004} SR policy state is UP, Allocated bsid 15001

Example 19-12: SR-TE receives two candidate path and selects one as active path RP/0/0/CPU0:xrvr-1#show segment-routing traffic-eng policy SR-TE policy database --------------------Color: 10, End-point: 1.1.1.4 Name: srte_c_10_ep_1.1.1.4 Status: Admin: up Operational: up for 00:28:01 (since Aug Candidate-paths: Preference: 200 (BGP, RD: 54321) (active) Requested BSID: 15001 PCC info: Symbolic name: bgp_c_10_ep_1.1.1.4_discr_54321 PLSP-ID: 17 Explicit: segment-list (valid) Weight: 1 16003 [Prefix-SID, 1.1.1.3] 16004 Preference: 100 (BGP, RD: 12345) Requested BSID: 15001 PCC info: Symbolic name: bgp_c_10_ep_1.1.1.4_discr_12345 PLSP-ID: 16 Explicit: segment-list (invalid) Weight: 1 Attributes: Binding SID: 15001 (SRLB) Forward Class: 0 Steering BGP disabled: no IPv6 caps enable: yes

8 20:23:00.736)

19.4 References [RFC1997] "BGP Communities Attribute", Tony Li, Ravi Chandra, Paul S. Traina, RFC1997, August 1996 [RFC4760] "Multiprotocol Extensions for BGP-4", Ravi Chandra, Yakov Rekhter, Tony J. Bates, Dave Katz, RFC4760, January 2007 [RFC7752] "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", Hannes Gredler, Jan Medved, Stefano Previdi, Adrian Farrel, Saikat Ray, RFC7752, March 2016 [RFC5575] "Dissemination of Flow Specification Rules", Pedro R. Marques, Jared Mauch, Nischal Sheth, Barry Greene, Robert Raszuk, Danny R. McPherson, RFC5575, August 2009 [draft-ietf-idr-tunnel-encaps] "The BGP Tunnel Encapsulation Attribute", Eric C. Rosen, Keyur Patel, Gunter Van de Velde, draft-ietf-idr-tunnel-encaps-11 (Work in Progress), February 2019 [draft-ietf-idr-segment-routing-te-policy] "Advertising Segment Routing Policies in BGP", Stefano Previdi, Clarence Filsfils, Dhanendra Jain, Paul Mattes, Eric C. Rosen, Steven Lin, draft-ietf-idrsegment-routing-te-policy-05 (Work in Progress), November 2018 [draft-ietf-spring-segment-routing-policy] "Segment Routing Policy Architecture", Clarence Filsfils, Siva Sivabalan, Daniel Voyer, Alex Bogdanov, Paul Mattes, draft-ietf-spring-segmentrouting-policy-02 (Work in Progress), October 2018

20 Telemetry Before the introduction of telemetry, a network operator used various methods to collect data from networking devices, SNMP, syslog, CLI scripts, etc. These methods had a number of problems: incomplete, unscalable, unstructured, etc. With telemetry, the network data harvesting model changes from a pull model to a push model, from polling to streaming. Therefore, telemetry is often referred to as “streaming telemetry”. Polling – when the network management system (NMS) wants to get some network data from a networking device, such as interface counter values, it sends a request to this device, specifying the data it wants to receive. The device collects the data and returns it to the NMS in a reply. This sequence is repeated whenever the NMS needs new data. Streaming – data is continuously sent (“streamed”) from the networking devices to one or more collectors, either at fixed intervals or when the data changes. A collector can subscribe to the specific data streams it is interested in. Streaming data is more efficient that polling. The collector can simply consume the received data without periodic polling and the networking device can collect the data more efficiently as it knows beforehand what to collect, which allows the collection to be organized efficiently. It also avoids the Request/Response Hits to CPU that is associated with polling and more importantly, if there are many receivers it can just copy the packet instead of processing multiple requests for the same data. The telemetry data is model-driven, which means telemetry uses data that is structured in a consistent format, a data-model. The various types of data are structured in different models, with each data model specified in a YANG module, using the YANG data modeling language (RFC 6020). A YANG module is a file that describes a data-model, how the data is structured in a hierarchical (tree) structure, using the YANG language. There are multiple ways to structure data and therefore multiple types of YANG models exist: “Cisco IOS XR native model”, Openconfig, IETF, … A number of modules (IETF, vendor-specific, …) are available on the Gitub YangModels repository [Github-YANG]. For example, the Cisco IOS XR

native YANG models can be found in [YANG-Cisco]. Openconfig YANG modules can be found in the Openconfig Github repository [Openconfig-YANG].

20.1 Telemetry Configuration The model-driven telemetry configuration in Cisco IOS XR consists of three parts: What data to stream Where to send it and how When to send it

20.1.1 What Data to Stream We will use the example of a popular YANG module that describes the data-model of the Statistics Infrastructure of a router, containing the interface statistics: Cisco-IOS-XR-infra-statsd-oper.yang. Example 20‑1 shows the YANG module as presented using the pyang tool to dump the YANG module in a tree structure. Large chunks of the output have been removed for brevity, but it gives an idea of the structure of this data-model. Example 20-1: Cisco-IOS-XR-infra-statsd-oper.yang tree structure $ pyang -f tree Cisco-IOS-XR-infra-statsd-oper.yang --tree-depth 4 module: Cisco-IOS-XR-infra-statsd-oper +--ro infra-statistics +--ro interfaces +--ro interface* [interface-name] +--ro cache | ... +--ro latest | ... +--ro total | ... +--ro interface-name xr:Interface-name +--ro protocols | ... +--ro interfaces-mib-counters | ... +--ro data-rate | ... +--ro generic-counters ...

To find the data that you want to stream using telemetry, you first need to identify the data-model that contains the required data and secondly the leafs within that data-model that contain the required data.

In this example, we want to stream the generic interface statistics (packets/bytes received, packet/bytes sent, etc.). The Cisco-IOS-XR-infra-statsd-oper YANG model that we saw earlier may contain this data. Example 20‑2 shows the content of the “latest/generic-counters” container of the Cisco-IOS-XR-infra-statsd-oper YANG model. It contains the statistics that we are interested in. Example 20-2: Cisco-IOS-XR-infra-statsd-oper.yang $ pyang -f tree Cisco-IOS-XR-infra-statsd-oper.yang --tree-path infrastatistics/interfaces/interface/latest/generic-counters module: Cisco-IOS-XR-infra-statsd-oper +--ro infra-statistics +--ro interfaces +--ro interface* [interface-name] +--ro latest +--ro generic-counters +--ro packets-received? +--ro bytes-received? +--ro packets-sent? +--ro bytes-sent? +--ro multicast-packets-received? +--ro broadcast-packets-received? ...

uint64 uint64 uint64 uint64 uint64 uint64

Now that we have found the YANG model and the subtree within that model that contains the desired data, we can configure the networking device to stream this data. A sensor-path is configured in the sensor-group

section of the telemetry model-driven configuration section, as shown in

Example 20‑3. This sensor-path consists of two elements separated by a “:”. The first element is the name of the YANG model. This is the name of the YANG module file without the “.yang” extension, “Cisco-IOSXR-infra-statsd-oper” in the example. The second element of the sensor-path is the subtree path, “infra-statistics/interfaces/interface/latest/generic-counters” in the example. Multiple sensor-paths can be configured under the sensor-group if desired. Example 20-3: MDT sensor-path configuration telemetry model-driven sensor-group SGROUP1 sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters

Example 20‑4 shows a sample of data that is streamed when specifying the example sensor-path. The data of all interfaces is streamed, the output only shows the data of a single interface.

Example 20-4: Sample of streamed telemetry data ... { "Timestamp": 1530541658167, "Keys": { "interface-name": "TenGigE0/1/0/0" }, "Content": { "applique": 0, "availability-flag": 0, "broadcast-packets-received": 2, "broadcast-packets-sent": 1, "bytes-received": 772890213, "bytes-sent": 1245490036, "carrier-transitions": 1, "crc-errors": 0, "framing-errors-received": 0, "giant-packets-received": 0, "input-aborts": 0, "input-drops": 11, "input-errors": 0, "input-ignored-packets": 0, "input-overruns": 0, "input-queue-drops": 0, "last-data-time": 1530541658, "last-discontinuity-time": 1528316495, "multicast-packets-received": 1768685, "multicast-packets-sent": 1026962, "output-buffer-failures": 0, "output-buffers-swapped-out": 0, "output-drops": 0, "output-errors": 0, "output-queue-drops": 0, "output-underruns": 0, "packets-received": 4671580, "packets-sent": 9672832, "parity-packets-received": 0, "resets": 0, "runt-packets-received": 0, "seconds-since-last-clear-counters": 0, "seconds-since-packet-received": 0, "seconds-since-packet-sent": 0, "throttled-packets-received": 0, "unknown-protocol-packets-received": 0 } }, ...

Table 20‑1 shows some commonly used YANG models. Table 20-1: Examples of commonly used YANG models Feature

YANG Model

Interfaces Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters QoS

Cisco-IOS-XR-qos-ma-oper:qos/interface-table/interface

Memory

Cisco-IOS-XR-nto-misc-shmem-oper:memory-summary/nodes/node/summary

Cisco-IOS-XR-nto-misc-shprocmem-oper:processes-memory/nodes/node CPU

Cisco-IOS-XR-wdsysmon-fd-oper:system-monitoring/cpu-utilization

BGP

Cisco-IOS-XR-ipv4-bgp-oper:bgp/instances/instance/instance-active/default-vrf/neighbors/neighbor

IP

Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/protocols/protocol

The YANG models can also be found on the router itself. The modules are located in the directory /pkg/yang that can be accessed after activating the shell (run) as shown in Example 20‑5. The example lists the performance measurement YANG modules. Use more or less to peek inside a module. Example 20-5: Accessing YANG modules on the router RP/0/RSP0/CPU0:ASR9904-B# run # cd /pkg/yang # ls Cisco-IOS-XR-perf-meas* Cisco-IOS-XR-perf-meas-cfg.yang Cisco-IOS-XR-perf-meas-oper-sub1.yang Cisco-IOS-XR-perf-meas-oper-sub2.yang Cisco-IOS-XR-perf-meas-oper.yang # more Cisco-IOS-XR-perf-meas-oper.yang module Cisco-IOS-XR-perf-meas-oper { /*** NAMESPACE / PREFIX DEFINITION ***/ namespace "http://cisco.com/ns/yang/Cisco-IOS-XR-perf-meas-oper";

prefix "perf-meas-oper"; /*** LINKAGE (IMPORTS / INCLUDES) ***/ import Cisco-IOS-XR-types { prefix "xr"; } include Cisco-IOS-XR-perf-meas-oper-sub2 { revision-date 2017-10-17; } include Cisco-IOS-XR-perf-meas-oper-sub1 { revision-date 2017-10-17; } /*** META INFORMATION ***/ --More-# exit

20.1.2 Where to Send It and How In the previous section we have identified which data needs to be streamed. In this section,we specify where the data will be sent, using which protocol, and in which format.

The transport and encoding is configured under a destination-group section of the telemetry model-driven

configuration. In Example 20‑6, the collector’s address is 172.21.174.74 and it listens

at TCP port 5432. Example 20-6: MDT destination-group configuration telemetry model-driven destination-group DGROUP1 address family ipv4 172.21.174.74 port 5432 encoding self-describing-gpb protocol tcp

The other protocol options are UDP and gRPC [gRPC]. Note that the available configuration options depend on platform and software release. Using gRPC there is an option to let the collector dial-in into the router, in that case the collector initiates a gRPC session to the router and specifies a subscription. The router then streams the data that is specified by the sensor-group(s) in the subscription that the collector specified. To transport the data from the device to the collector, it must be encoded (or “serialized”) in a format that can be transmitted across the network. The collector decodes (or “de-serializes”) the received data into a semantically identical copy of the original data. Common text-based encodings are JSON and XML. Another popular encoding format is Google Protocol Buffers (GPB), “protobufs” for short. From the GPB website [GPB]: “Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.” The GPB encoding serializes data into a binary format that is compact but not self-describing (i.e., you need an external specification to decode the data). Specifically, the field names are not included in the transmitted data since these are very verbose, especially when compared to mostly numerical data, and do not change between sample intervals. There is a self-describing variant of GPB that is less efficient but simpler to use. The encoding is configured in the destination-group of the telemetry configuration. In Example 20‑6 the encoding is specified as self-describing-gpb.

20.1.3 When to Send It

A subscription configuration section under telemetry model-driven specifies which sensor-group(s) must be streamed to which destination-id(s) and at what interval. Streaming telemetry can send the data periodically at a fixed interval, or only when it changes. The latter is achieved by specifying a sample-interval 0. The configuration in Example 20‑7 specifies that the subscription SUB1 streams the sensor-path(s) specified in the sensor-group SGROUP1 every 30 seconds (30000 ms) to the collector(s) specified in destination-id DGROUP1. Example 20-7: MDT subscription configuration telemetry model-driven subscription SUB1 sensor-group-id SGROUP1 sample-interval 30000 destination-id DGROUP1

Example 20‑8 shows the complete telemetry model-driven configuration. Example 20-8: MDT complete configuration telemetry model-driven destination-group DGROUP1 address family ipv4 172.21.174.74 port 5432 encoding self-describing-gpb protocol tcp ! sensor-group SGROUP1 sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/generic-counters ! subscription SUB1 sensor-group-id SGROUP1 sample-interval 30000 destination-id DGROUP1

20.2 Collectors and Analytics The collector and analytics platform are not in scope of this book. Worth mentioning is the opensource pipeline [pipeline], a flexible, multi-function collection service. It can collect telemetry data from the network devices, write the telemetry data to a text file, push the data to a Kafka bus and/or format it for consumption by various analytics stacks.

20.3 References [RFC6020] "YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF)", Martin Björklund, RFC6020, October 2010 [Github-YANG] https://github.com/YangModels [YANG-Cisco] https://github.com/YangModels/yang/tree/master/vendor/cisco [Openconfig-YANG] https://github.com/openconfig/public/tree/master/release/models [gRPC] https://grpc.io/ [GPB] https://developers.google.com/protocol-buffers/ [pipeline] https://github.com/cisco/bigmuddy-network-telemetry-pipeline [Telemetry] https://xrdocs.io/telemetry/

Section IV – Appendices

A. Introduction of SR Book Part I This appendix provides an unmodified copy of the subjective introduction of Part I of this SR book series as written in 2016. It describes the intuitions and design objectives of the overall Segment Routing solution.

A.1 Objectives of the Book “Segment Routing – Part I” has several objectives: To teach the basic elements of Segment Routing (SR) with a focus on the IGP extensions, the IGP/LDP interaction, Topology-Independent Fast Reroute (TI-LFA FRR) and the MPLS data plane. “Segment Routing – Part II” will focus on the Traffic Engineering applications, both distributed and centralized. Part I describes the SR use-cases that have been first considered for deployment. Part I lays the foundation for Part II. To provide design guidelines and illustrations based on real use-cases we have deployed To invite operators who have participated in this project to provide their view and offer tips based on their experience defining and deploying SR To explain why we have defined SR the way it is, what were our primary objectives, what were our intuitions, what happened during the first 3 years of this technology The first objective is natural and we will dedicate lots of content to teach SR, let’s say in an “objective” manner. We believe that it is important to complement the “objective” understanding of a technology with a more “subjective” viewpoint. This helps to understand what part of the technology is really important in real life, what to optimize for, what trade-offs to make or not make. The other three objectives are related to getting this subjective experience. By reviewing deployed use-cases, by incorporating the highlights of the operators, by explaining how we came up with SR and what were our initial intuitions and goals; we hope that the readers will get a much more practical understanding of the SR architecture. The main part of the text is dedicated to the first goal, the objective explanation of the technology and the description of use-cases. Important concepts are reminded in textboxes titled “Highlight”. To clearly distinguish the subjective content from the main flow of the book (objective), the subjective viewpoints will be inserted in text boxes attributed to the person providing the subjective opinion.

This entire first chapter should be considered as a subjective entry. It is written by Clarence to describe how he got started with SR, then the influence of SDN and OpenFlow, how he managed the project and got the close involvement of operators to define the technology and drive its implementation for specific use-cases. We stress the subjective nature of this chapter and the opinion text boxes along this book. They express personal viewpoints to help the reader forge his own viewpoint. They are not meant to be “it must be this way” or “this is the only right way” type of guidelines. They are meant to clearly describe how some people think of the technology so that the readers can leverage their viewpoints to build their own. This is very different from the normal flow of the book where we try to be objective and describe the technology as it is (“then it is really like this”).

A.2 Why Did We Start SR? In 1998, I was hired in the European consulting team at Cisco to deploy a new technology called tagswitching (later renamed MPLS). This was a fantastic experience: witnessing the entire technology definition process, working closely with the MPLS architecture team, having first-hand experience designing, deploying the first and biggest MPLS networks and collecting feedback and requirements from operators. Over these years, while the elegance of the MPLS data plane has rarely been challenged, it became obvious that the MPLS “classic” (LDP and RSVP-TE) control-plane was too complex and lacked scalability. In 2016, when we write this text, it should be safe to write that LDP is redundant to the IGP and that it is better to distribute labels bound to IGP signaled prefixes in the IGP itself rather than using an independent protocol (LDP) to do it. LDP adds complexity: it requires one more process to configure and manage and it creates complicated interaction problems with the IGP (LDP-IGP synchronization issue, RFC 5443, RFC 6138). Clearly, LDP was invented 20 years ago for various reasons that were good at that time. We are not saying that mistakes were made in the 1990’s. We are saying that, in our opinion1, if an operator were to deploy a greenfield MPLS network in 2016, considering the issues described above and the experience learned with RLFA (RFC 7490), they would not think of using LDP and would prefer to distribute the labels directly in the IGP. This requires straightforward IGP extensions as we will see in chapter 5, “Segment Routing IGP Control Plane” in Part I of the SR book series. On the RSVP-TE side, from a bandwidth admission control viewpoint, it seems safe to write that there has been weak deployment and that the few who deployed have reported complex operation models and scaling issues. In fact, most of the RSVP-TE deployments have been limited to fast reroute (FRR) use-case. Overall, we would estimate that 10% of the SP market and likely 0% of the Enterprise market have used RSVP-TE and that among these deployments, the vast majority did it for FRR reasons.

Our point is not to criticize the RSVP-TE protocol definition or minimize its merits. Like for LDP, there were good reasons 20 years ago for RSVP-TE and MPLS TE to have been defined the way they are. It is also clear that 20 years ago, RSVP-TE and MPLS-TE provided a major innovation to IP networks. At that time, there was no other bandwidth optimization solution. At that time, there was no other FRR solution. RSVP-TE and MPLS-TE introduced great benefits 20 years ago. Our point is to look at its applicability in IP networks in 2016. Does it fit the needs of modern IP networks? In our opinion, RSVP-TE and the classic MPLS TE solution have been defined to replicate FR/ATM in IP. The objective was to create circuits whose state would be signaled hop-by-hop along the circuit path. Bandwidth would be booked hop-by-hop. Each hop’s state would be updated. The available bandwidth of each link would be flooded throughout the domain using IGP to enable distributed TE computation. We believe that these design goals are no longer consistent with the needs of modern IP networks. First, RSVP-TE is not ECMP-friendly. This is a fundamental issue as the basic property of modern IP networks is to offer multiple paths from a source to a destination. This ECMP-nature is fundamental to spread traffic along multiple paths to add capacity as required and for redundancy reasons. Second, to accurately book the used bandwidth, RSVP-TE requires all the IP traffic to run within socalled “RSVP-TE tunnels”. This leads to much complexity and lack of scale in practice. Let’s illustrate this by analyzing the most frequent status of a network: i.e., a correctly capacityplanned network. Such a network has enough capacity to accommodate without congestion a likely traffic volume under a set of likely independent failures. The traffic is routed according to the IGP shortest-path and enough capacity is present along these shortest-paths. This is the norm for the vast majority of SP and Enterprise networks either all of the times or at least for the vast majority of the times (this is controlled by the “likeliness” of the traffic volume and the failure scenarios). Tools such as Cisco WAN Automation Engine (WAE) Planning2 are essential to correctly capacity plan a network.

HIGHLIGHT: Capacity Planning If one wants to succeed in traffic engineering, one should first learn capacity planning. An analogy could be that if one wants to be healthy (SLA), one should focus on finding a life style (capacity planning process) which keeps the body and mind balanced such as to minimize when medicine (tactical traffic engineering) needs to be taken. We would advise to study WAE Planning.2

Clearly, in these conditions, traffic engineering to avoid congestion is not needed. It seems obvious to write it but as we will see further, this is not the case for an RSVP-TE network. In the rare cases where the traffic is larger than expected or a non-expected failure occurs, congestion occurs and a traffic engineering solution may be needed. We write “may” because once again it depends on the capacity planning process. Some operators might capacity plan the network via modeling such that these occurrences are so unlikely that the resulting congestion might be tolerated. This is a very frequent approach. Some other operators may not tolerate even these rare congestions and then require a tactical trafficengineering process. A tactical traffic-engineering solution is a solution that is used only when needed. To the contrary, the classic RSVP-TE solution is an “always-on” solution. At any time (even when no congestion is occurring), all the traffic must be steered along circuits (RSVP-TE tunnels). This is required to correctly account the used bandwidth at any hop. This is the reason for the infamous full-mesh of RSVP-TE tunnels. Full-mesh implies that there must be a tunnel from anywhere to anywhere on the network edge and that all traffic must ride on RSVP-TE tunnels. IP forwarding spoils accurate traffic statistics. Hence, traffic can never utilize IGP derived ECMP paths and to hide the lack of ECMP in RSVP-TE, several tunnels have to be created between each source and destination (at least one per ECMP path). Hence, while no traffic engineering is required in the most likely situation of an IP network, the RSVP-TE solution always requires N2×K tunnels where N scales with the number of nodes in the

network and K with the number of ECMP paths. While no traffic engineering is required in the most likely situation of an IP network, the classical MPLS TE solution always requires all the IP traffic to not be switched as IP, but as MPLS TE circuits. The consequence of this “full-mesh” is lots of operational complexity and limited scale, most of the time, without any gain. Indeed, most of the times, all these tunnels follow the IGP shortest-path as the network is correctly capacity planned and no traffic engineering is required. This is largely suboptimal. An analogy would be that one needs to wear his raincoat and boots every day while it rains only a few days a year.

RSVP ope rational comple xity “More than 15 years ago, when DT’s IP/MPLS multi-service network design was implemented, everyone assumed that RSVP should be used as a traffic engineering technology in order to optimize the overall network efficiency. While the routers eventually proved that they can do RSVP, the operational experience was devastating: the effect of ECMP routing – even at that time – was completely underestimated, and mitigating the effects of parallel links with more and more TE tunnels made the overlay topology of explicit paths even more complicated. Eventually we found that IGP metric tuning, which cannot optimize network efficiency as perfectly as explicit paths, still does a fair job in terms of network efficiency but at a much lower cost of operational complexity. We continued to use RSVP for tactical cases of traffic-engineering for quite a while. We merge with a network that used RSVP for the sake of Fast-Reroute. But finally we managed to fulfill all requirements of efficiency, Fast-Reroute and disjoint paths just with proper IGP metric tuning, LFA based IP-FRR, and maintaining a suitable topology at the transport layer. Removing RSVP from the network design – although technically working – was considered a great advantage from all sides.” — Martin Horneffer

Let’s remember the two origins (in our opinion) of the classical RSVP-TE complexity and scale problem: the modeling as a reference to ATM/FR circuit; the decision to optimize bandwidth in a distributed manner instead of a centralized one. In the early 2000, Thomas Telkamp was managing the worldwide Global Crossing backbone from Amsterdam, the Netherlands. This was one of the first RSVP-TE deployment and likely the biggest at that time. I had the chance to work directly with him and learned the three following concepts through the experience.

1. the always-on RSVP-TE full-mesh model is way too complex because it creates continuous pain for no gain as, in most of the cases, the network is just fine routing along the IGP shortest-path. A tactical TE approach is more appealing. Remember the analogy of the raincoat. One should only need to wear the raincoat when it actually rains. 2. ECMP is key to IP. A traffic engineering approach for IP should natively support ECMP 3. for real networks, routing convergence has more impact on SLA than bandwidth optimization

Let’s illustrate the third learning. Back in the early 2000, Global Crossing was operating one of the first VoIP service. During any network failure, the connectivity was lost for several 10’s of seconds waiting for the network (IGP plus full-mesh of RSVP-TE tunnels) to converge. Considering that the network was correctly capacity planned for most expected failures, and that independent failures occur very often on a large network, it is easy to understand that the SLA impact of slow routing convergence is much more serious than the impact due to congestion. While everyone was focusing at that time on QoS and TE (the rare congestion problem), a very important problem was left without attention: routing convergence. Thanks to Thomas and the Global Crossing experience, I then started the “Fast Convergence” project. In 6 years, we improved IS-IS/OSPF convergence in a reference worldwide network from 9.5 second to under 200msec. This involved considerable optimization in all parts of the routing system from the IS-IS/OSPF process to the router’s linecard HW FIB organization including the RIB, LDP, LSD (the MPLS RIB), BCDL (the bus from the route processor to the linecard) and the FIB process. This involved considerable lab characterization either to monitor our progress towards our “sub200msec” target or to spot the next bottleneck. In parallel to this fast IS-IS/OSPF convergence, we also researched on an IP-based automated FRR for sub-50msec protection. As we noted earlier, we knew that RSVP-TE deployment was rare (10%) and that most of these deployments were not motivated by BW optimization but rather by FRR. So, if we found an IP-

optimized FRR that was simpler to operate, we knew that this would attract a lot of operator interest. We started that research in 2001 timeframe at the cafeteria of the Brussels Cisco office. This was the “water flow” discussion. If the course of a river through a valley gets blocked, it suffices to explicitly steer the water to the top of the correct hill and, from there, let the water flow naturally to its destination. The intuition was fundamental to our “IPFRR” research: the shorter the explicit path, the better. Contrary to RSVP-TE FRR, we never wanted to steer the packet around the failure and back at the other end of the failure. This model is ATM/FR “circuit” centric. We wanted a model that would be IP centric. We wanted to reroute around the failure as fast as possible such as to release the packet as soon as possible to a natural IP path Releasing the packet as soon as possible to a natural IP path was the target of our IPFRR project. Very quickly, we found Loop-Free Alternate (LFA, RFC 6571). LFA allowed the IGP to pre-compute FRR backup paths for 75 to 90% of the IGP destinations (please refer to RFC 6571 for LFA applicability analysis and reports of solution coverage across real data sets). This solution received a lot of interest as it offered a much simpler FRR alternative than RSVPTE. Later on, we extended the IPFRR coverage to 95/99% with the introduction of Remote LFA (RLFA, RFC 7490). This was sufficient for most operators and the deployment of RSVP-TE for sole FRR reasons stopped in favor of the much simpler LFA/RLFA alternative. Still, two problems were remaining: the theoretical lack of a 100% guarantee of coverage and the possibility that the LFA backup/repair path can be non-optimum (not along the post-convergence path). We had done ample research on the matter and we knew that we could not deliver these properties without explicit routing.

“In early 2000, RSVP-TE was the only solution available to provide fast-reroute. Implementing RSVP-TE for FRR in a network already running LDP for primary path brings additional complexity in the network design and operations as three protocols (IGP, LDP, RSVP-TE) will interact between each other: think about the sequence of protocols events when a single link fails in the network, when something goes wrong, troubleshooting is not so easy. The availability of IPFRR solutions was a great opportunity for network simplification by leveraging only one primary protocol (IGP) to provide FRR. Even if LFA/RLFA did not provide 100% guarantee of protection, the gain in simplicity is a good reason to use them: a simple network is usually more robust.” — Stéphane Litkowski

In parallel to the IGP fast convergence and the IPFRR research, we had a third related research: the search for a microloop avoidance solution3: i.e., a solution that would prevent packet drop due to inconsistent transient state between converging routers. We found several theoretical solutions along that research but never one with enough coverage or with enough robustness for real-life deployment. Let’s stop for a moment here and summarize the key points we learned through these years of design, deployment and research: LDP was redundant and was creating unneeded complexity For the bandwidth optimization problem, the RSVP-TE full-mesh was modelled on ATM/FR “circuits” with a distributed optimization and as a result was too complex and not scalable. A tactical ECMP-aware IP optimized solution would be better. For the FRR problem, the RSVP-TE solution was modelled on SONET/SDH “circuit” and as a result was too suboptimal from a bandwidth and latency viewpoint and was too complex to operate. An ECMP-aware IP-optimized solution which would release the packet as soon as possible to its shortest-path was much better. Our IPFRR research led to LFA and RLFA. Most operators were satisfied with the 90-99% coverage. Still, the 100% coverage guarantee was missing. We knew that a form of explicit routing would be required to get that property. MPLS was perceived to be too complex to deploy and manage and most Enterprise network and some SP network operators stayed away from it

microloop avoidance was an unsolved problem Around spring 2012, one of the research projects I was managing was related to OAM in ECMP networks. It was indeed difficult for operators to spot forwarding issues involving a subset of the ECMP paths to a destination. As part of that work, we came up with a proposal to allocate MPLS labels for each outgoing interface of each router and then steer OAM probes over a deterministic path by pushing the appropriate label stack on the probe. At each hop, the top label would be popped and would indicate the outgoing interface to forward the packet. The proposal was not very interesting for several reasons. First, BFD was already largely deployed on a per link basis and hence issues impacting any traffic on a link were already detected. What we had to detect is the failure related to a specific FIB destination and this idea was missing the target. Second, this idea would require too many labels as one label per link hop was required and the network diameter could be very large. However, at that time, I was planning a trip to Rome and the following intuition came to me: when I go to Rome, from Brussels, I listen to the radio for traffic events. If I hear that the Gottardo tunnel is blocked then I mentally switch on the path to Geneva, and then from there to the path to Rome. This intuition initiated the Segment Routing project: in our daily life, we do not plan our journeys turn by turn; instead we plan them as a very small number of shortest-path hops. Applying the intuition to real network, real traffic engineering problems in real networks would be solved by one or two intermediate shortest-paths, one or two segments in SR terminology. The years of designing and deploying real technology in real networks had taught me that the simplest ideas prevail and that unneeded sophistication leads to a lot of cost and complexity. Obviously, I knew that I would want to prove this in a scientific manner by analyzing real-life networks and applying traffic engineering problems to them. I knew we needed to confirm it

scientifically… but I had seen enough networks that I felt the intuition was correct.

HIGHLIGHT: Fe w se gme nts would be re quire d to e xpre ss an e xplicit path When we explain a path to a friend, we do not describe it turn by turn but instead as a small number of shortest-path hops through intermediate cities. Applying the intuition to real network: real traffic engineering problems would be solved by one or two intermediate shortest-paths, one or two segments in SR terminology. While the theoretical bound scales with the network diameter, few segments would be required in practice.

This was the start of the Segment Routing project. We would distribute labels in the routing protocol (i.e., the prefix segments) and then we would build an ECMP-aware IP-optimized explicit path solution based on stacking few of these labels (i.e., the segment list). Doing so, we would drastically simplify the MPLS control-plane by removing LDP and RSVP-TE while we would improve its scalability and functionality (tactical bandwidth engineering, IPFRR with 100% coverage, microloop avoidance, OAM). We will explain these concepts in detail in this book. In fact, the intuition of the “path to Rome” also gave the first name for “SR”. “SR” originally meant “Strade Romane” which means in Italian the network of roads that were built by the Roman Empire. By combining any of these roads, the Romans could go from anywhere to anywhere within the Empire.

Figure A-1: Clarence at the start of the “Appia” Roman road in Rome. The network of Roman roads

Later on, we turned SR for Segment Routing ☺.

A.3 The SDN and OpenFlow Influences In 2013, two fundamental papers were published at Sigcomm: SWAN (Software-driven WAN) from Microsoft4 and B4 from Google5. While reading these excellent papers, it occurred to me that while I was agreeing on the ideas, I would have not implemented them with an OpenFlow concept as, in my opinion, it would not scale. Indeed, I had spent several years improving the routing convergence speed and hence I knew that the problem was not so much in the control plane and the algorithmic computation of a path but much more in the transferring of the FIB updates from the routing processor down to the linecards and then the writing of the updates from the linecard CPU to the hardware forwarding logic. To give an order of magnitude, while the routing processor would compute and update an entry in µsec, the rest of the process (distribute to Linecards, update hardware forwarding) was in msec when we started the fast convergence project. It took many years of work to get that component down to 10’s of µsec. Hence, by intuition I would have bet that the OpenFlow-driven approach would run in severe convergence problem: it would take way too much time for the centralized control-plane to send updates to the switches and have them install these updates in HW. Hint to the reader: do some reverse engineering of the numbers published in the OpenFlow papers and realize how slow these systems were. Nevertheless, reading these papers was essential to the SR project. Using my “path to Rome” intuition, I thought that a much better solution would be to combine centralized optimization with a traffic-engineering solution based on a list of prefix segments. The need for centralized optimization is clearly described in a paper of Urs Holzle, also at Google.6 It details the drawbacks due to the distributed signaling mechanism at the heart of RSVP-TE: lack of optimality, lack of predictability and slow convergence. In a distributed signaling solution, each router behaves like a kid sitting at a table where bandwidth is represented as a pile of candies at the center of the table. Each router vies for its bandwidth as a kid

for its candy. This uncoordinated fight leads to lack of optimality (each kid is not ensured to get its preferred taste or what is required for his health condition), lack of predictability (no way to guess which kid will get what candy) and slow convergence (kids fighting for the same candy, retrying for others etc.).

HIGHLIGHT: Ce ntraliz e d Optimiz ation Read the paper of Urs Holzle6 for an analysis of the centralized optimization benefits over RSVP-TE distributed signaling: optimality, predictability and convergence.

While the need of centralized optimization was clear and I agreed with these papers, the use of OpenFlow-like technique was, in my opinion, the issue. It would lead to far too many interactions between the centralized controller and the routers in the network. The first reason is that the OpenFlow granularity is far too small: a per-flow entry on a per switch basis. To build a single flow through the network, all the routers along the path need to be programmed. Multiply this by tens of thousands or millions and this is clearly too much state. The second reason is that every time the network changes, the controller needs to re-compute the flow placement and potentially update a large part of the flow entries throughout the network. Instead, the intuition was to let the centralized controller use segments as bricks of the LEGO® construction toy. The network would offer basic bricks as ECMP-aware shortest path to destinations (Prefix Segments) and the controller would combine these LEGO® bricks to build any explicit path it would want (but with as few per-flow state as possible, in fact only one, at the source). To route a demand from San Francisco (SFO) to New York City (NYC) via Denver (DEN), rather than programming all the routers along the path with the appropriate per-flow entry, the controller would only need to program one single state at the source of the demand (SFO). The state would consist of a source-routed path expressed as an ordered list of segments {DEN, NYC}. This is what we refer to as SR Traffic Engineering (SR-TE).

“Segment Routing represents a major, evolutionary step forward for the design, control and operation of modern, largescale, data-center or WAN networks. It provides for an unprecedented level of control over traffic without the concomitant state required by existing MPLS control plane technologies. That it can accomplish this by leveraging MPLS data plane technologies means that it can be introduced into existing networks without major disruption. This accords it a significant ease-of-deployability advantage over other SDN technologies. It is an exciting time to be a network provider because the future of Segment Routing holds so much potential for transforming networks and the services provided by them.” — Steven Lin

SR-TE is hybrid centralized/distributed cooperation between the controller and the network. The network maintains the multi-hop ECMP-aware segments and the centralized controller combines them to form a source-routed path through the network. State is removed from the network. State is only present at the ingress to the network and then in the packet header itself. {DEN, NYC} is called a segment list. DEN is the first segment. NYC is the last segment. Steering a demand through {DEN, NYC} means steering the demand along the ECMP-aware shortest-path to Denver and then along the ECMP-aware shortest-path to NYC. Note the similarity with the initial intuition for SR: if the shortest-path to Rome is jammed, then the alternate path is {Geneva, Rome}. The human mind does not express a path as a turn-by-turn journey; instead it expresses it as a minimum number of subsequent ECMP-aware shortest-paths. In the early days of SR, we used a baggage tag analogy to explain SR to non-technical audience, see Figure A-2. Imagine one needs to send a Baggage to Berlin (TXL) from Seattle with transit in Mexico City (MEX) and Madrid (MAD). Clearly, the transportation system does not associate a single ID to the baggage (flow ID) and then create circuit state in Mexico and Madrid to recognize the baggage ID and route accordingly. Instead, the transportation system scales better by appending a tag “Mexico then Madrid then Berlin” to the baggage at the source of the journey. This way, the transportation system is unaware of each individual baggage specifics along the journey. It only knows about a few thousands of airport codes (Prefix Segments) and routes the baggage from airport to airport according to the baggage tag. Segment Routing does exactly the same: a Prefix Segment acts as an airport code; the Segment List in the packet header acts as a tag on the baggage; the source knows the specifics of the packet and encodes it as a Segment List in the packet header; the rest of the network is unaware of that specific flow and only knows about a few hundreds or thousands of Prefix Segments.

Figure A-2: Baggage tag analogy to explain SR to non-technical audience

HIGHLIGHT: Combining se gme nts to form an e xplicit path. Hybrid coupling of ce ntraliz e d optimiz ation with distribute d inte llige nce The network would offer basic LEGO® bricks as ECMP-aware shortest path to destinations (Prefix Segments). The distributed intelligence would ensure these segments are always available (IGP convergence, IP FRR). The controller would acquire the entire topology state (with the segments supported by the network) and would use centralized optimization to express traffic-engineering policies as a combination of segments. It would consider segments as a LEGO® bricks. It would combine them in the way it wants, to express the policy it wants. The controller would scale better thanks to the source-routing concept: it would only need to maintain state at the source, not throughout the infrastructure.

The network would keep the right level of distributed intelligence (IS-IS and OSPF distributing the so-called Prefix and Node Segments) and the controller would express any path it desires as a list of segments. The controller’s programming job would be drastically reduced because state would only need to be programmed at the source and not all along the flow path. The controller scaling would also be much better thanks to the reliance on the IGP intelligence. Indeed, in real life, the underlying distributed intelligence can be trusted to adapt correctly to the topology changes reducing the need for

the controller to react, re-compute and update the segment lists. For example, the inherent IGP support for an FRR solution ensures that the connectivity to prefix segments is preserved and hence the connectivity along a list of segments is preserved. Upon failure, the controller can leverage this IGP FRR help to dampen its reaction and hence scale better. This is why we say that Segment Routing is hybrid centralized/distributed optimization architecture. We marry distributed intelligence (shortest-path, FRR, microloop avoidance) with centralized optimization for a policy objective like latency, disjoint-ness, avoidance and bandwidth. In SR, we would tackle the bandwidth engineering problem with a centralized controller in a tactical manner: the controller would monitor the network and the requirements of applications and would push SR Policies only when required. This would give us better optimality and predictability. We would scale the controller by expressing traffic engineering policies with list of segments and hence would only need per-demand state at the source. Also our intuition was that this solution would also converge faster. Why? Because the comparison has to be made with an N2×K full-mesh of RSVP-TE tunnels that vie for bandwidth every time the topology changes. It was well-known that operators had seen convergence times as slow as 15minutes. In comparison, our approach would only have to compute SR Policies in the rare cases when congestion occurs. Few states would need to be programmed thanks to the source-routed nature. By combining sufficient centralized compute performance with our scaled source-routing paradigm, likely we would reduce the convergence time.

Figure A-3: John Leddy highlighting the importance to keep the network stateless and hence the role of SR for SDN done right

On SDN and the role of Se gme nt Routing “SR is SDN done right!” — John Leddy

The provisioning of such tactical SR Policies can also be done directly on the edge router in the network as a first transition, without the need for a controller. However, the benefits of bringing in the centralized intelligence are obvious as described above, and enable the transition to an SDN architecture. We will see in this book how basic Segment Routing addresses the requirements for which operators used to turn to RSVP-TE and in the next book we will introduce real traffic engineering use-cases at a scale that was challenging if not impossible until now. Aside this influence, SDN proved fundamental to SR, it simply allowed it to exist: the applicability of SR to SDN provided us with the necessary tailwind to push through our proposal against the dominance of the classic MPLS control-plane.

Over the last three years, seeing the numerous design and deployments, we can say that SR has become the de-facto network architecture for SDN. Watch for example the SR analysis for SWAN.7

A.4 100% Coverage for IPFRR and Optimum Repair Path Now that we had a source-routed explicit solution, it was straightforward to close our IPFRR research. LFA and RLFA had many advantages but only gave us 99% coverage and were not always selecting the optimum path. Thanks to SR, we could easily solve both problems. The point of local repair (PLR) would pre-compute the post-convergence path to a destination, assuming its primary link, node or SRLG fails. The PLR would express this explicit path as sourcerouted list of segments. The PLR would install this backup/repair path on a per destination basis in the data plane. The PLR would detect the failure in sub-10msec and would trigger the backup/repair paths in a prefix-independent manner. This solution would give us 100% coverage in any topology. We would call it “TopologyIndependent LFA” (TI-LFA). It would also ensure optimality of the backup path (post-convergence path on a per-destination basis). We will explain all the details of TI-LFA in this book.

HIGHLIGHT: TI-LFA be ne fits Sub-50msec link, node and SRLG protection 100% coverage. Simple to operate and understand Automatically computed by the IGP, no other protocol required No state created outside the protecting state at the PLR Optimum: the backup path follows the post-convergence path Incremental deployment Also applies to IP and LDP traffic

A.5 Other Benefits The following problems in IP/MPLS networks did also motivate our SR research: IP had no native explicit routing capability. It was clear that with explicit routing capability, IPv6 would have a central role to play in future infrastructure and service network programming. SRv6 is playing a key role here. We will detail this in Part III of the book. RSVP-TE did not scale inter-domain for basic services such as latency and disjoint-ness. In a previous section, we analyzed the drawbacks of the RSVP-TE solution for the bandwidth optimization problem. RSVP-TE has another issue: for latency or disjoint-ness services, it did not scale across domains. The very nature of modern IP networks is to be multi-domain (e.g., data-center (DC), metro, core). SR is now recognized as a scalable end-to-end policy-aware IP architecture that spans DC, metro and backbone. This is an important concept which we will detail in this book (the scalability and base design) but as well in Part II (the TE specific aspects). MPLS did not make any inroad in the Data Center (DC) We felt that while MPLS data plane had a role to play in the DC, this could not be realized due to inherent complexities of the classic MPLS control plane. SR MPLS is now a reality in the DC. We will document this in the BGP Prefix-SID section and the DC use-case. OAM was fairly limited Operators were reporting real difficulty to detect forwarding issues in IP network, especially when the issue is associated to a specific FIB entry (destination based) potentially for a specific combination of ingress port at the faulty node and ECMP hash. SR has delivered solutions here which will be detailed in the OAM section of this book. Traffic Matrices were important but were actually unavailable to most operators. Traffic matrices are the key input to the capacity planning analysis that is one of the most important processes within operator. Most operators did not have any traffic matrices. They were basically too complex to be built with classic IP/MPLS architecture.

SR has delivered an automated traffic matrix collection solution.

A.6 Team David Ward, SVP and Chief Architect for Cisco Engineering, was essential to realize the opportunity with SR and approved the project in September 2012. We have created the momentum by keeping focus on the execution and delivering an impressive list of functionality and use-cases: IS-IS and OSPF support for SR, BGP support for SR, the SR/LDP seamless interworking, the TI-LFA 100% FRR solution, microloop avoidance, distributed SR-TE, centralized SR-TE, the egress peer traffic engineering use-case etc. Ahmed Bashandy was the first person I recruited for the SR project. I had worked with him during the Fast Convergence project and I knew him well. He had deep knowledge across the entire IOS XR system (he had coded in IS-IS, BGP and FIB). This was essential for bringing up our first SR implementation as we had to touch all the parts of the system. Ahmed is fun to work with. This is essential. Finally, Ahmed is not slowed down by doubts. Many engineers would have spent hours asking why we would do it, what was the difference against MPLS classic, whether we had to first have IETF consensus, whether we had any chance to get something done in the end… all sorts of existential questions that are the best way to never do anything. This was certainly not going to be an issue with Ahmed and this was essential. We had no time to lose. Ahmed joined in December 2012 and we had a first public demo by March 2013. Bertrand Duvivier was the second person to join the team. I needed someone to manage the relationship with marketing and engineering once we became public. I knew Bertrand since 1998. He is excellent at understanding a technology, the business benefits and managing the engineering and marketing relationships. Bertrand joined the team in February 2013. We come from the same region (Namur, Belgium), we speak the same language (French, with same accent and dialect), share the same culture hence it is very easy to understand each other. Kris Michielsen was the third person to join. Again, I knew him from the Fast Convergence project. Kris had done all the Fast Convergence characterization. He is excellent at managing lab characterization, thoroughly analyzing the origins of bottlenecks, creating training and transferring information to the field and operators. We knew each other since 1996.

Stefano Previdi was the fourth person to join. I knew him since I started at Cisco in June 1996. We had done all the MPLS deployments together and then IPFRR. Stefano would focus on the IETF and we will talk more about this later. Aside the technical expertise, we were all in the same time-zone, we knew each other for a long time, we could understand each other well and we had fun working together. These are essential ingredients. Later on, Siva Sivabalan and Ketan Talaulikar would join us. Siva would lead the SR-TE project and Ketan would contribute to the IGP deployment. Operators have played a fundamental role in the SR project. In October 2012, at the annual Cisco Network Architecture Group (NAG) conference (we invite key operator architects for 2 days and have open discussion on any relevant subject of networking, without any marketing compromise), I presented the first session on SR, describing the improved simplicity, the improved functionality (FRR, SR-TE), the improved scale (flow state only at the source) and the opportunity for SDN (hybrid centralized and distributed optimization). This session generated a lot of interest. We immediately created a lead operator group and started defining SR. John Leddy from Comcast, Stéphane Litkowski and Bruno Decraene from Orange, Daniel Voyer from Bell Canada, Martin Horneffer from DT, Rob Shakir then at BT, Josh George then at Google, Ebben Aries then at Facebook, Dmitry Afanasiev and Daniel Ginsburg at Yandex, Tim Laberge and Steven Lin then at Microsoft, Mohan Nanduri and Paul Mattes from Microsoft were among the most active in the group. Later on, as the project evolved, many other engineers and operators joined the effort and we will have the opportunity to introduce their viewpoints throughout this book. Again, most of us were in the same time-zone; we knew each other from previous projects. It was easy to understand each other and we had the same focus. This generated a lot of excellent discussions that shaped most of the technology and use-cases illustrated in this book. Within a few months of defining this technology, we received a formal letter of intent to deploy from

an operator. This was a fantastic proof for the business interest and this helped us fund our first phase of the project. Ravi Chandra proved essential to execute our project beyond its first phase. Ravi was leading the IOS XR, IOS XE and NX-OS software at Cisco. He very quickly understood the SR opportunity and funded it as a portfolio-wide program. We could then really tackle all the markets interested in SR (hyper-scale WEB operators, SP and Enterprise) and all the network segments (DC, metro/aggregation, edge, backbone).

“Segment Routing is a core innovation that I believe will change how we do networking like some of the earlier core technologies like MPLS. Having the ability to Source Routing at scale will be an invaluable tool in many different use cases in the future.” — Ravi Chandra SVP Cisco Systems Core Software Group

A.7 Keeping Things Simple More than anything, we believe in keeping things simple. This means that we do not like technology for the sake of technology. If we can avoid technology, we avoid technology. If we can state a reasonable design guideline (that is correct and fair for real networks) such that the problem becomes much simpler and we need less technology to solve it, then we are not shy of making the assumption that such guideline will be met. With SR, we try to package complex technology and algorithms in a manner that is simple to use and operate. We favor automated behaviors. We try to avoid parameters, options and tuning. TI-LFA, microloop avoidance, centralized SR-TE optimization (Sigcomm 2015 paper8), and distributed SRTE optimization are such examples.

Simplicity “The old rule to make things "as simple as possible, but not simpler" is always good advice for network design. In other words it says that the parameter complexity should be minimized. However, whenever you look closer at a specific problem, you will usually find that complexity is a rather multi-dimensional parameter. Whenever you minimize one dimension, you will increase another. For example, in order to provide sets of disjoint paths in an IP backbone, you can either ask operations to use complicated technology and operational procedures to satisfy this requirement. Or you can ask planning to provide a suitable topology with strict requirements on the transport layer. The art of good network design is, in my eyes, based on a broad overview of possible dimensions of complexity, good knowledge of the cost related to each dimension, and wise choice of the trade-offs that come with specific design options. I am convinced that segment routing offers a lot of very interesting network design options that help to reduce overall network cost and increase its robustness.” — Martin Horneffer

The point of keeping things simple is to choose well where to put intelligence to solve a real problem, how to package the intelligence to simplify its operation and where to simplify the problem by an appropriate design approach.

A.8 Standardization and Multi-Vendor Consensus When we created the lead-operator group, we pledged three characteristics of our joint work: (1) transparency, (2) committed to standards, (3) committed to multi-vendor consensus. The transparency meant that we would define the technology together; we would update the lead operator group with progress, issues and challenges. Our commitment to standardization meant that we would release all the necessary information to the IETF and would ensure that our implementation will be entirely in-line with the released information. We clearly understood that this was essential to our operators and hence it was essential to us. Clearly, getting our ideas through the IETF was a fantastic challenge. We were challenging 20-years of classic MPLS control-plane and were promoting concepts (global label) that were seen as a nonstarter by some prominent part of the IETF community. Our strategy was the following: 1. be positive. We knew that we would be attacked in cynical ways. The team was instructed to never reply and never state anything negative or emotional 2. lead by the implementation. We had to take the risk of implementing what we were proposing. We had to demonstrate the benefits of the use-cases. 3. get the operators to voice their requirements. The operators who wanted to have SR knew that they would need to state their demands very clearly. The implementation and the demonstration of the use-cases would strengthen their argumentation 4. get other vendors to contribute. Alcatel-Lucent (Wim Henderickx) and Ericsson (Jeff Tantsura) quickly understood the SR benefits and joined the project. More or less a year after our initial implementation, we (i.e., Cisco) could demonstrate interoperability with Alcatel and Ericsson and it was then clear that the multi-vendor consensus would be obtained. Shortly after, Huawei also joined the effort. 5. shield the team from this standardization activity while handling the process with the highest quality

Stefano joined the team to handle this last point. He would lead and coordinate all our IETF SR involvement on behalf of the team. The team would contribute by writing text but the team would rarely enter in list discussion and argumentation. Stefano would handle it. This was essential to get the highest quality of the argumentation, the consistency but as well to shield the team from the emotional implication of the negotiation. While we were handling the most difficult steps of the IETF process and negotiation, the rest of the engineers were focused on coding and releasing functionality to our lead operators. This was key to preserve our implementation velocity, which was a key pillar of the strategy.

Standardiz ation and Multi-Ve ndor Focus for SR te am “Operators manage multi-vendors networks. Standardization and interoperability is thus essential, especially for routing related subjects. It would have been easier and faster for the SR team to work alone and create a proprietary technology as big vendors sometimes do. Yet, Clarence understood the point, committed to work with all vendors and did make it through the whole industry.” — Bruno Decraene

A.9 Global Label Clearly, SR was conceived with global labels in mind. Since global labels have been discussed in the IETF in the early days of MPLS, we knew that some members of the community would be strongly against this concept. Very early, we came up with the indexing solution in the SRGB9 range. We thought it was a great idea because on one hand, operators could still use the technology with global labels (simply by ensuring that all the routers use the same SRGB range), while on the other hand, the technology was theoretically not defining global labels (but rather global indexes) and hence would more easily pass the IETF process. We presented this idea to the lead operator group in February 2013 and they rejected it. Main concern has been that they wanted global labels, they wanted the ease of operation of global labels and hence they did not want this indexing operation. We noted their preference and when we released the proposal publicly in March 2013, our first IETF draft and our first implementation were using global labels without index. As expected, global labels created lots of emotions… To get our multi-vendor consensus, we invited several vendors to Rome for a two-day discussion (see Figure A-4). We always meet in Rome when an important discussion must occur. This is a habit that we took during the IPFRR project.

Figure A-4: Note the bar “Dolce Roma” (“Sweet Rome”) behind the team ☺ . From left to right, Stefano Previdi, Peter Psenak, Wim Henderickx, Clarence Filsfils and Hannes Gredler

During the first day of the meeting, we could resolve all issues but one: Juniper and Hannes were demanding that we re-introduce the index in the proposal. Hannes’ main concern was that in some contexts with legacy hardware-constrained systems, it may not be possible to come up with a single contiguous SRGB block across vendors. At the end of the first day, we then organized a call with all the lead operators, explained the status and advised to re-introduce the index. Multi-vendor consensus and standardization was essential and operators could still get the global label behavior by ensuring that the SRGB be the same on all nodes.

Operators were not happy as they really wanted the simplicity of the global labels but in the end agreed. On the second day, we updated the architecture, IS-IS, OSPF and use-case drafts to reflect the changes and from then on, we had our multi-vendor consensus. This story is extremely important as it shows that SR (for MPLS) was really designed and desired with global labels and that operators should always design and deploy SR with the same SRGB across all their routers. This is the simplest way to use SR and the best way to collect all its benefits. Sure, diverse SRGB is supported by the implementation and the technology… but really understand that it is supported mostly for the rare interoperability reasons10 in case no common SRGB range could be found in deployments involving legacy hardware constrained systems. People really would like to use as per the original idea of global labels. During the call with the operators, as they were very reluctant to accept the index, I committed that the Cisco implementation would allow for global and index configuration both in the configuration and in the show command. Indeed, by preserving the look and feel of the global label in the configuration and in the show command, by designing with the same SRGB across all the routers, an operator would have what they want: global label. They would never have to deal with index. This section is a good example for the use of “opinion” textboxes. Without this historical explanation, one could wonder why we have two ways to configure prefix segments in MPLS. This is the reason. The use of the SRGB, the index, global and local labels are described in detail in this book.

A.10 SR MPLS It is straightforward to apply the SR architecture to the MPLS data plane. A segment is a label. A list of segments is a stack of labels. The active segment is the top of the stack. When a segment is completed, the related label is popped. When an SR policy is applied on a packet, the related label stack is pushed on the packet. The SR architecture reuses the MPLS data plane without any change. Existing infrastructure only requires a software upgrade to enable the SR control-plane. Operators were complaining about the lack of scale, functionality and the inherent complexity of the classic MPLS control plane. The MPLS data plane was mature and very well deployed. For these reasons, our first priority has been to apply SR to the MPLS data plane. This was the main focus of our first three years of the SR project and this is also the focus of this book. It is important to remind that SR MPLS applies to IPv4 and IPv6. As part of the initial focus on SR MPLS, it was essential to devise a solution to seamlessly deploy SR in existing MPLS network. The SR team and the lead operator group dedicated much of the initial effort to this problem. A great solution was found to allow SR and LDP brownfield networks to seamlessly co-exist or, better, inter-work. This SR/LDP seamless solution is described in detail in this book.

A.11 SRv6 The SR architecture has been thought since day one for its application to the IPv6 data plane. This is referred to as “SRv6”. All the research we did on automated TI-LFA FRR, microloop avoidance, distributed traffic engineering, centralized traffic engineering etc. is directly applicable to SRv6. We believe that SRv6 plays a fundamental role to the value of IPv6 and will significantly influence all future IP infrastructure deployments either in the DC, the large-scale aggregation, in the backbone. SRv6’s influence will expand beyond the infrastructure layer. An IPv6 address can identify any object, any content or any function applied to an object or a piece of content. SRv6 could offer formidable opportunities for chaining micro-services in a distributed architecture or for content networking. We would recommend reading the paper of John Schanz from Comcast11, watching the session of John Leddy on Comcast’s SRv6-based Smarter Network concept12 and watching the demonstration of the “Spray” end-to-end SRv6 use-case13 to understand the potential of SRv6.

Figure A-5: John Leddy presenting on the “Smarter Network” concept and highlighting the SRv6 role

John Leddy played a key role in highlighting the SRv6 potential for drastically enhancing the application interaction with the network. We will focus on the SRv6 technology and use-cases in the third book. This first book introduces the Segment Routing architecture by first addressing the simpler use-cases related to the MPLS data plane and the MPLS infrastructure from DC to aggregation to backbone. In this part of the book, we will only provide a small introduction to SRv6. We will dedicate much more content in a following book.

A.12 Industry Benefits We have seen SR being applied to the hyper-scale WEB, SP and Enterprise markets. We have seen use-cases in the DC, in the metro/aggregation and in the WAN. We have seen (many) use-cases for end-to-end policy-aware architecture from the server in the DC through the metro and the backbone. We believe that the SR benefits are the following: simplification of the control-plane (LDP and RSVP-TE removed, LDP/IGP synchronization removed) topology-independent IP-optimal 50msec FRR (TI-LFA) microloop avoidance support for tactical traffic engineering (explicit path encoded as a segment list) centralized optimization benefits (optimality, predictability, convergence) scalability (per-flow state is only at the source, not throughout the infrastructure) seamless deployment in existing networks (applies equally for SR-MPLS and SRv6) de-facto architecture for SDN standardized multi-vendor consensus strong requirement from operators defined closely with operators for real use-cases solving unsolved problems (TI-LFA, microloop, inter-domain disjointness/latency policies…) cost optimization (through improved capacity planning with tactical traffic engineering) an IPv6 segment can identify any object, any content, or any function applied to an object. This will likely extend SR’s impact well beyond the infrastructure use-cases. Most of these benefits will be described in this book. The TE and SRv6 benefits will be detailed in the next book.

A.13 References [draft-francois-rtgwg-segment-routing-uloop] Francois, P., Filsfils, C., Bashandy, A., Litkowski, S. , “Loop avoidance using Segment Routing”, draft-francois-rtgwg-segment-routing-uloop (work in progress), June 2016, https://datatracker.ietf.org/doc/draft-francois-rtgwg-segment-routinguloop [RFC5443] Jork, M., Atlas, A., and L. Fang, “LDP IGP Synchronization”, RFC 5443, DOI 10.17487/RFC5443, March 2009, https://datatracker.ietf.org/doc/rfc5443. [RFC6138] Kini, S., Ed., and W. Lu, Ed., “LDP IGP Synchronization for Broadcast Networks”, RFC 6138, DOI 10.17487/RFC6138, February 2011, https://datatracker.ietf.org/doc/rfc6138. [RFC6571] Filsfils, C., Ed., Francois, P., Ed., Shand, M., Decraene, B., Uttaro, J., Leymann, N., and M. Horneffer, “Loop-Free Alternate (LFA) Applicability in Service Provider (SP) Networks”, RFC 6571, DOI 10.17487/RFC6571, June 2012, https://datatracker.ietf.org/doc/rfc6571. [RFC7490] Bryant, S., Filsfils, C., Previdi, S., Shand, M., and N. So, “Remote Loop-Free Alternate (LFA) Fast Reroute (FRR)”, RFC 7490, DOI 10.17487/RFC7490, April 2015, https://datatracker.ietf.org/doc/rfc7490. [RFC7490] Bryant, S., Filsfils, C., Previdi, S., Shand, M., and N. So, “Remote Loop-Free Alternate (LFA) Fast Reroute (FRR)”, RFC 7490, DOI 10.17487/RFC7490, April 2015, https://datatracker.ietf.org/doc/rfc7490.

1. Note that this entire chapter should be considered subjective.↩ 2. Cisco WAN Automation Engine (WAE), http://www.cisco.com/c/en/us/products/routers/waeplanning/index.html and http://www.cisco.com/c/en/us/support/routers/waeplanning/model.html.↩ 3. draft-francois-rtgwg-segment-routing-uloop, https://datatracker.ietf.org/doc/draft-francois-rtgwgsegment-routing-uloop.↩ 4. http://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p15.pdf.↩

5. http://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p3.pdf.↩ 6. http://www.opennetsummit.org/archives/apr12/hoelzle-tue-openflow.pdf.↩ 7. Paul Mattes, “Traffic Engineering in a Large Network with Segment Routing“, Tech Field Day, https://www.youtube.com/watch?v=CDtoPGCZu3Y.↩ 8. http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p15.pdf.↩ 9. Segment Routing Global Block, see chapter 4, “Management of the SRGB” in Part I of the SR book series.↩ 10. In extreme/rare cases, this could help to migrate some hardware constrained legacy systems, but one should be very careful to limit this usage to a transient migration. To the best of our knowledge, all Cisco platforms support the homogenous SRGB design guideline at the base of SR MPLS.↩ 11. John D. Schanz, “How IPv6 lays the foundation for a smarter network”, http://www.networkworld.com/article/3088322/internet/how-ipv6-lays-the-foundation-for-asmarter-network.html.↩ 12. John Leddy, “Comcast and The Smarter Network with John Leddy”, Tech Field Day, https://www.youtube.com/watch?v=GQkVpfgjiJ0.↩ 13. Jose Liste, “Cisco Segment Routing Multicast Use Case Demo with Jose Liste”, Tech Field Day, https://www.youtube.com/watch?v=W-q4T-vN0Q4.↩

B. Confirming the Intuition of SR Book Part I This short chapter leverages an excellent public presentation from Alex Bogdanov at the MPLS WC 2019 in Paris, describing his traffic engineering experience operating one of the largest networks in the world: B2, the Google backbone that carries all user-facing traffic. The video of the presentation is available on our segment-routing.net website.

B.1 Raincoat and Boots on a Sunny Day In chapter 1 of Part I we wrote: “The consequence of this “full-mesh” is lots of operational complexity and limited scale, most of the time, without any gain. Indeed, most of the times, all these tunnels follow the IGP shortest-path as the network is correctly capacity planned and no traffic engineering is required. This is largely suboptimal. An analogy would be that one needs to wear his raincoat and boots every day while it rains only a few days a year.” This is confirmed by the data analysis presented in Figure B‑1, reporting that more than 98% of RSVP-TE LSPs follow the shortest-path.

Figure B-1: >98% of HIPRI LSPs remain on the shortest path 1

This is highly complex and inefficient. Just using a segment would be much simpler (automated from the IGP), scalable (stateless), and efficient (a prefix segment is ECMP-aware, an RSVP-TE LSP is not).

B.2 ECMP-Awareness and Diversity In chapter 1 of Part I we wrote: “First, RSVP-TE is not ECMP-friendly. This is a fundamental issue as the basic property of modern IP networks is to offer multiple paths from a source to a destination. This ECMPnature is fundamental to spread traffic along multiple paths to add capacity as required and for redundancy reasons.” And we also wrote: “Hence, traffic can never utilize IGP derived ECMP paths and to hide the lack of ECMP in RSVP-TE, several tunnels have to be created between each source and destination (at least one per ECMP path). Hence, while no traffic engineering is required in the most likely situation of an IP network, the RSVP-TE solution always requires N2×K tunnels where N scales with the number of nodes in the network and K with the number of ECMP paths.” This is confirmed by the data analysis presented in Figure B‑2 that reports the weak path diversity of the RSVP-TE LSP solution.

Figure B-2: LSP pathing does not offer enough diversity 1

The slide in Figure B‑2 reminds us that this could be theoretically improved (as we noted in 2016) by increasing K (the number of LSPs between a headend and a tailend) but this is obviously difficult in practice as this amplifies the RSVP-TE-LSP scaling issue (K×N2).

1. ©2019 Google LLC. Presented at MPLS WC Paris in April 2019. Google and the Google logo are registered trademarks of Google LLC.↩