CCIE Service Provider Version 4 Written and Lab Exam Comprehensive Guide 0692747370, 9780692747377


623 132 16MB

English Pages 2930 Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
1. SP architecture concepts
1.1 IPv6
1.1.1 Definitions
1.1.2 Neighbor Discovery details
1.2 Broadband Aggregation (BBA)
1.2.1 PPP over Ethernet (PPPoE) technology
1.2.2 Multi-service PPPoE and LAC/LNS architecture
1.3 MEF Ethernet Services Definitions (MEF 6.2)
1.4 Platform Architecture
1.4.1 Route-Switch Processor (RSP) and Route Processor (RP)
1.4.2 Line cards (LC)
1.4.3 Switching fabric / backplane and forwarding model
1.4.4 Multicast forwarding and hierarchical replication
1.4.5 Satellite operations (remote linecards)
3.1 WAN technologies
3.1.1 Packet over SONET/SDH
3.1.2 T1/E1 and T3/E3
3.1.3 Dense Wavelength Division Multiplexing (DWDM)
3.2 IP connectivity to the customer
3.2.1 Digital Subscriber Line (DSL)
3.2.2 Cable Internet
3.2.3 Wireline
4. Virtualization concepts
4.1 SVR vs. HVR
4.2 Network Functions Virtualization (NFV)
4.3 Software Defined Networking (SDN)
5. Mobility concepts
5.1 LTE
5.2 Backhaul
6. Describe BGP path attributes
7. Describe MPLS forwarding and control plane mechanisms
7.1 Label Distribution Protocol (LDP)
7.2 Static label bindings
7.3 MPLS IP and MTU minor options
8. Describe MPLS advanced features
8.1 Segment Routing
8.2 Generalized MPLS (GMPLS)
8.3 MPLS Transport Profile (MPLS-TP)
8.4 Inter-AS MPLS
8.4.1 Option A (Back to back VRF exchange)
8.4.1.1 L3VPN
8.4.1.2 L2VPN
8.4.1.3 MVPN – GRE (Profile 0) and mLDP (Profile 1)
8.4.1.4 MPLS TE
8.4.1.5 Confederation variation
8.4.1.6 Carrier Supporting Carrier (CSC) variation
8.4.2 Option B (ASBR VPNv4/v6 eBGP)
8.4.2.1 L3VPN
8.4.2.2 L2VPN
8.4.2.3 mVPN – GRE (Profile 0)
8.4.2.4 MVPN – mLDP (Profile 17)
8.4.2.5 MPLS TE
8.4.2.6 Confederation variation
8.4.3 Option C (ASBR eBGP + Label, RR VPNv4 eBGP)
8.4.3.1 L3VPN
8.4.3.2 L2VPN
8.4.3.3 MVPN – GRE (Profile 0)
8.4.3.4 MVPN – mLDP (Profile 17)
8.4.3.5 MPLS TE
8.4.3.6 Confederation variation
8.4.4 Option AB Inter-AS hybrid (AKA Option D)
8.4.4.1 L3VPN
8.4.4.2 L2VPN
8.4.4.3 MVPN – GRE (Profile 0) and mLDP (Profile 1)
8.4.4.4 MPLS TE
8.4.5 Confederation variation
9. Describe multicast P2MP TE
10. Describe EVPN (EVPN and PBB-EVPN)
10.1 EVPN
10.2 PBB-EVPN
11. Describe IEEE 802.1ad (QinQ), IEEE 802.1ah (Mac-in-Mac), and ITU G.8032 (REP)
11.1 802.1ad QinQ
11.2 802.1ah MAC in MAC (Provider Backbone Bridges)
11.3 Ethernet Ring loop-prevention
11.3.1 Cisco Resilient Ethernet Protocol (REP)
11.3.2 ITU G.8032
12. Describe broadband forum TR-101 VLAN paradigms (N:1 and 1:1)
13. Describe QoS link fragmentation (LFI), cRTP, and RTP
14. Describe Multichassis/Clustering High Availability (HA)
14.1 High Availability (HA) Demonstration (NSF/NSR/GR)
14.1.1 IS-IS NSF and NSR
14.1.2 OSPFv2 NSF and NSR
14.1.3 OSPFv3 GR and NSR
14.1.4 BGP GR and NSR
14.1.5 LDP GR and NSR
14.1.6 RSVP-TE GR
14.1.7 EIGRP NSF
15. Describe Layer 1 failure detection
16. Describe BGPsec
17. Describe backscatter traceback
18. Describe lawful-intercept
19. Describe BGP Flowspec
20. Describe DDoS mitigation techniques
21. Describe network event and fault management
22. Describe performance management and capacity procedures
23. Describe maintenance and operational procedures
24. Describe the network inventory management process
25. Describe network change, implementation, and rollback
25.1 Processes and best practices
25.2 NETCONF and YANG
26. Describe the incident management process based on the ITILv3 framework
27. Describe, implement, and troubleshoot advanced BGP features
27.1 Additional Paths (add-path) and Prefix Independent Convergence (PIC)
27.2 BGP RT-filter unicast / IPv4 RT-filter feature
27.3 BGP RR-group and Selective RT Retention
27.4 Accumulated IGP attribute
27.4.1 Basic AIGP
27.4.2 AIGP with cost-communities and BGP confederations
27.5 Cost-Community / Point Of Insertion (POI)
27.6 DMZ Link Bandwidth
27.7 BGP Multicast VPN (MVPN) Theory
27.8 BGP Link State AF and Path Computation Element (PCE)
28. Describe, implement, and troubleshoot MVPN
28.1 Profile 0: Default MDT − GRE − PIM C−mcast Signaling (Traditional Draft-Rosen)
28.1.1 PIM-ASM in the core
28.1.2 PIM-SSM in the core
28.1.3 PIM-Bidir in the core
28.2 Profile 1: Default MDT − MLDP MP2MP − PIM C−mcast Signaling (Basic mLDP)
28.3 Profile 3: Default MDT − GRE − BGP−AD − PIM C−mcast Signaling
28.4 Profile 6: VRF MLDP − In−band Signaling
28.5 Profile 7: Global MLDP In−band Signaling
28.6 Profile 8: Global Static − P2MP−TE
28.7 Profile 9: Default MDT − MLDP − MP2MP − BGP−AD − PIM C−mcast Signaling
28.8 Profile 10: VRF Static – P2MP TE - BGP−AD
28.9 Profile 11: Default MDT − GRE − BGP−AD − BGP C−mcast Signaling
28.10 Profile 12: Default MDT − MLDP − P2MP − BGP−AD − BGP C−mcast Signaling
28.11 Profile 13: Default MDT − MLDP − MP2MP − BGP−AD − BGP C−mcast Signaling
28.12 Profile 14: Partitioned MDT – MLDP P2MP – BGP-AD – BGP C-mcast signaling
28.13 Profile 17: Default MDT – MLDP P2MP – BGP-AD – PIM C-mcast signaling
29. Describe and optimize multicast scale and performance
29.1 Inter-AS Multicast and Multicast Source Discovery Protocol (MSDP)
29.2 Multicast Only Fast Re-Reroute (MoFRR)
29.3 Protecting mLDP LSPs with Fast Re-Reoute (FRR)
29.4 MVPN Extranet
29.4.1 PIM/GRE
29.4.2 mLDP
30. Describe, implement, and troubleshoot MPLS QoS models and related features
30.1 Uniform
30.2 Short pipe
30.3 Pipe (AKA long pipe)
30.4 QoS Policy Propagation through BGP (QPPB)
30.5 QoS specifics on IOS XRv
30.6 Network Based Application Recognition (NBAR) summary and configurations
30.6.1 NBAR Custom Protocols
30.6.2 NBAR Attributes
30.6.3 NBAR Attributes with HTTP
30.6.4 NBAR Protocol-ID
30.6.5 NBAR Protocol Discovery
31. Describe, implement, and troubleshoot MPLS TE / QoS mechanisms
31.1 MPLS RSVP-TE (General)
31.1.1 TE Topology (TED) construction and RSVP-TE signaling
31.1.2 TE attributes
31.1.3 Directing traffic into TE tunnels and tunnel stitching
31.2 TE Fast-ReRoute (FRR) and rapid provisioning
31.2.1 Link (NHOP), Node (NNHOP), and Path protection – Manual
31.2.2 Automatic tunnels (with OSPF)
31.3 CBTS (IOS) and PBTS (XR)
31.4 DiffServ-aware Traffic Engineering (DS-TE)
31.4.1 Pre-standard Model
31.4.2 IETF Russian Dolls Model (RDM)
31.4.3 IETF Maximum Allocation Model (MAM)
31.4.4 Per-VRF TE techniques
32. Describe, implement, and troubleshoot E-LAN and E-TREE (extended to general L2VPN)
32.1 MPLS encapsulated L2VPN
32.1.1 Static configuration
32.1.1.1 E-LINE (VPWS)
32.1.1.2 Advanced PW features (CW, Status, etc)
32.1.1.3 E-LAN and E-TREE (VPLS)
32.1.1.4 Multisegment PW (MS-PW) switching
32.1.1.5 EVC rewrite operations
32.1.2 BGP auto-discovery for VPWS/VPLS
32.1.2.1 LDP signaling
32.1.2.2 BGP signaling
32.1.3 Hierarchical VPLS (H-VPLS)
32.1.3.1 MPLS in the Access Network
32.1.3.2 QinQ in the Access Network
32.2 IP encapsulated L2VPN
32.2.1 E-LINE with L2TP
32.2.2 E-LAN and E-TREE using OTV
33. Describe, implement, and troubleshoot Unified MPLS and CSC
33.1 Carrier Supporting Carrier (CSC)
33.1.1 L3VPN
33.1.2 L2VPN
33.1.3 MVPN (Profile 0 with SSM)
33.1.4 TE and TE-FRR
33.2 Unified (seamless) MPLS
33.2.1 IS-IS
33.2.1.1 L3VPN
33.2.1.2 L2VPN
33.2.1.3 MVPN (mLDP profiles 1 and 17)
33.2.1.4 Inter-area TE and TE-FRR
33.2.2 OSPF (summarized)
33.2.2.1 L3VPN
33.2.2.2 L2VPN
33.2.2.3 MVPN (mLDP profiles 1 and 17)
33.2.2.4 MPLS TE and TE-FRR
34. Describe, implement, and troubleshoot LISP
35. Describe, implement, and troubleshoot GRE and mGRE-based VPN
35.1 P2P GRE tunneling and GRE features
35.2 Dynamic Multipoint VPN (DMVPN) basics
35.2.1 Phase 1
35.2.2 Phase 2
35.2.3 Phase 3
35.3 mGRE-based L3VPN
36. Describe, implement, and troubleshoot IPv6 transition mechanisms
36.1 NAT44 and NAT444
36.2 NAT64 and NAT464
36.3 Dual stack lite (DS-lite)
36.4 IPv6 tunneling over IPv4 networks
36.4.1 GRE / Manual IPv6 tunnels
36.4.2 6to4 automatic tunnels
36.4.3 6 Rapid Deployment (6RD)
36.4.4 Intra-Site Automatic tunnel Addressing Protocol (ISATAP)
36.5 IPv4/IPv6 Internet Access over MPLS using NAT44
37. Describe, implement, and troubleshoot end-to-end fast convergence
37.1 Loop Free Alternate (LFA) for IPv4
37.1.1 OSPFv2
37.1.1.1 Direct LFA
37.1.1.2 Remote LFA
37.1.2 IS-IS
37.1.2.1 Direct LFA
37.1.2.2 Remote LFA
37.1.3 EIGRP
37.2 Loop Free Alternate (LFA) for IPv6 (XR Only)
37.2.1 OSPFv3
37.2.1.1 Direct LFA
37.2.1.2 Remote LFA
37.2.2 IS-IS
37.2.2.1 Direct LFA
37.2.2.2 Remote LFA
37.3 Convergence optimizations for BGP
37.4 Convergence optimizations for IGPs
37.4.1 IS-IS
37.4.2 OSPFv2 and OSPFv3
38. Describe, implement, and troubleshoot multi-VRF CE and advanced VRF techniques
38.1 Multi-VRF CE (VRF-Lite)
38.1.1 Basic VRF-Lite
38.1.2 OSPF and sham-links
38.1.3 EIGRP and Site-of-Origin (SoO)
38.1.4 IS-IS
38.1.5 BGP and Site-of-Origin (SoO)
38.1.6 Static routing
38.1.7 RIP
38.2 VRF label modes
38.3 VRF selection for traffic leaking
38.4 VRF route leaking
38.5 L3VPN import/export maps
38.6 Half-Duplex VRF (HDVRF)
38.7 BGP Local Convergence (VRF Local Protection)
39. Describe, implement, and troubleshoot Layer 2 failure detection
39.1 Link Aggregation Control Protocol (LACP)
39.2 Uni-Directional Link Detection (UDLD)
40. Describe, implement, and troubleshoot Layer 3 failure detection
40.1 Individual Protocol Hello packets
40.2 Bidirectional Forwarding Detection (BFD)
41. Describe, implement, and troubleshoot control plane protection techniques
41.1 Control Plane Policing (CPP) in XE and Local Packet Transport Services (LPTS) in XR
42. Describe, implement, and troubleshoot logging and SNMP security
42.1 Logging
42.2 SNMP security
43. Describe, implement, and troubleshoot timing
43.1 Network Time Protocol (NTP)
43.2 1588v2 (Precision Time Protocol(PTP))
43.3 Synchronous Ethernet (SyncE)
44. Describe, implement, and troubleshoot SNMP traps, RMON, EEM, and EPC
44.1 SNMP traps
44.2 Remote Monitor (RMON) in XE and logging correlation in XR
44.3 Embedded Event Manager (EEM)
44.4 Embedded Packet Capture (EPC)
45. Describe, implement, and troubleshoot port mirroring protocols
45.1 Switch port analyzer (SPAN)
45.2 Remote SPAN (RSPAN)
45.3 Encapsulated RSPAN (ERSPAN)
46. Describe, implement, and troubleshoot Netflow and IPFIX
46.1 Flexible Netflow (FNF)
46.2 IPFIX
47. Describe, implement, and troubleshoot IP SLA
47.1 Basic IP SLA probes, responders, features, and configurations
47.2 UDP-jitter and VOIP codec probes
47.3 Advanced ICMP probes
47.4 MPLS probes
47.5 Ethernet probes including ITU-T Y.1731 Basics and Performing Monitoring (PM)
47.6 Miscellaneous probes
47.7 Aggregated statistics, history, group scheduling, and miscellaneous features
47.8 Enhanced Object Tracking (EOT)
47.9 IPv6 SLA
47.10 IOS-XR IP SLA and EOT
48. Describe, implement, and troubleshoot MPLS OAM and Ethernet OAM
48.1 MPLS ping, MPLS traceroute, and VCCV
48.2 MPLS LSP Monitor (MPLSLM) / LSP Health Monitor
48.3 Ethernet Management Tools (CFM, OAM, and E-LMI)
48.3.1 Connectivity Fault Management (CFM) (802.3ag)
48.3.2 Ethernet OAM (IEEE 802.3ah)
48.3.3 Ethernet Local Management Interface (E-LMI) (MEF.16)
48.3.4 Ethernet CFM, OAM, E-LMI, and Y.1731 on CSR1000v (Comprehensive)
49. Service Provider security best practices (Comprehensive)
49.1 Control plane security best practices
49.2 Management plane security best practices
49.3 Data plane security best practices
49.4 Advanced security techniques and features
Recommend Papers

CCIE Service Provider Version 4 Written and Lab Exam Comprehensive Guide
 0692747370, 9780692747377

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

CCIE™ Service Provider Version 4 Written and Lab Exam Comprehensive Guide

By: Nicholas J. Russo CCIE™ #42518 (RS/SP)

About the Author Nicholas (Nick) Russo, CCIE™ #42518, holds active CCIE certifications in both Routing and Switching and Service Provider. Nick was among the first individuals to pass the CCIE Service Provider version 4 lab examination and this book represents his personal journey towards that end. Nick also holds a Bachelor’s of Science in Computer Science, and a minor in International Relations, from the Rochester Institute of Technology (RIT). Nick lives in Maryland, USA with his wife, Carla. They are currently expecting their first child.

Dedications This book is dedicated to my wife Carla, for without her support, I would have not even started this endeavor. Although I have spent years studying for multiple certifications, she continues to support me in every way. This is the mark of a true companion and I love her dearly for it.

Copyright 2016 Nicholas J. Russo ISBN-10: 0-692-74737-0 ISBN-13: 978-0-692-74737-7 This material is not sponsored or endorsed by Cisco Systems, Inc. Cisco, Cisco Systems, CCIE and the CCIE Logo are trademarks of Cisco Systems, Inc. and its affiliates. The symbol ™ is included in the Logo artwork provided to you and should never be deleted from this artwork. All Cisco products, features, or technologies mentioned in this document are trademarks of Cisco. This includes, but is not limited to, Cisco IOS®, Cisco IOS-XE®, and Cisco IOS-XR®. Within the body of this document, not every instance of the aforementioned trademarks are prepended with the symbols ® or ™ as they are demonstrated above. The opinions expressed in this book belong to the author and are not necessarily those of Cisco. THE INFORMATION HEREIN IS PROVIDED ON AN “AS IS” BASIS, WITHOUT ANY WARRANTIES OR REPRESENTATIONS, EXPRESS, IMPLIED OR STATUTORY, INCLUDING WITHOUT LIMITATION, WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2 © 2016 Nicholas J. Russo

Purpose: This book attempts to cover every topic in the CCIE Service Provider version 4 (SPv4) blueprint. The vast majority of technical topics, even topics only present on the written examination, have corresponding practical labs. In this way, the book is an educational resource focused more on developing true technical experts rather than training individuals to pass a test. By testing many advanced technologies in detail, such as Ethernet VPN and Segment Routing, the reader gains valuable insight as to the future of SP technologies. Target audience: Individuals using this book should already have a strong understanding of core routing, switching, and SP technologies. The book does not detail the basics of routing, MPLS forwarding, or other topics considered “beneath” the scope of a CCIE certification. Very few of the labs in this book are single-technology focused. This is done intentionally to constantly exercise features working in concert (or disharmony) with one another. Readers should understand this and be knowledgeable on prerequisite topics as discussed in each chapter’s introduction. Scope: This primarily focuses on CCIE SP version 4 topics as specified in the official blueprint published by Cisco. Other topics that do not appear on the blueprint, such as BGP customer multicast signaling and PPP over Ethernet (PPPoE), are documented briefly in this document as they are relevant for SP networking in general. Nonetheless, this focus remains on the core SP topics in the blueprint since this book is designed for CCIE SPv4 candidates and other SP networking professionals. The length and breadth of a section is often a good measure of how “important” it is. This is helpful for prioritizing one’s study time. Note that some blueprint topics may not be covered in an appropriate level of detail in this book; always consult the official blueprint to determine if a specific technology is testable or not. How to use this book: The table of contents is hyper-linked to each chapter, so an ordinary “point-andclick” operation is an effective navigation tool. The table of contents is arrayed in a way that makes sense to the author, but this is not necessarily the best sequence to review the labs. Topologies are seldom recycled across major domains unless the topology is well-suited to a number of particular labs. Basic IP addressing and routing configurations will be briefly validated before each lab; this is done to conserve time. The core study topic at hand remains the focus of a particular section. Reference material: An “Additional Reading” comment is included in every major technology area which identifies the suffix of the supporting document relating to a lab. This document will contain the original diagram embedded in the book, as well as all configuration files. These are included separately so that they may be viewed, printed, or modified. Below is a mapping of topic weights by Cisco for reference. Topic Service Provider Architecture and Evolution Core Routing Service Provider Based Services Access and Aggregation High Availability and Fast Convergence SP Security, SP Operation and Management

Written Weight (%) 10 23 23 17 10 17

Lab Weight (%) N/A 27 26 17 13 17 3

© 2016 Nicholas J. Russo

Contents 1.

SP architecture concepts

1.1

IPv6

13 13

1.1.1

Definitions

13

1.1.2

Neighbor Discovery details

16

Broadband Aggregation (BBA)

41

1.2 1.2.1

PPP over Ethernet (PPPoE) technology

42

1.2.2

Multi-service PPPoE and LAC/LNS architecture

70

1.3

MEF Ethernet Services Definitions (MEF 6.2)

93

1.4

Platform Architecture

94

1.4.1

Route-Switch Processor (RSP) and Route Processor (RP)

94

1.4.2

Line cards (LC)

95

1.4.3

Switching fabric / backplane and forwarding model

95

1.4.4

Multicast forwarding and hierarchical replication

96

1.4.5

Satellite operations (remote linecards)

96

3.1

WAN technologies

96

3.1.1

Packet over SONET/SDH

96

3.1.2

T1/E1 and T3/E3

97

3.1.3

Dense Wavelength Division Multiplexing (DWDM)

98

3.2

IP connectivity to the customer

99

3.2.1

Digital Subscriber Line (DSL)

99

3.2.2

Cable Internet

99

3.2.3

Wireline

99

4.

Virtualization concepts

100

4.1

SVR vs. HVR

100

4.2

Network Functions Virtualization (NFV)

101

4.3

Software Defined Networking (SDN)

101

5.

Mobility concepts

102

5.1

LTE

102

5.2

Backhaul

104

6.

Describe BGP path attributes

105

7.

Describe MPLS forwarding and control plane mechanisms

107 4

© 2016 Nicholas J. Russo

7.1

Label Distribution Protocol (LDP)

107

7.2

Static label bindings

166

7.3

MPLS IP and MTU minor options

170

8.

Describe MPLS advanced features

200

8.1

Segment Routing

200

8.2

Generalized MPLS (GMPLS)

212

8.3

MPLS Transport Profile (MPLS-TP)

213

8.4

Inter-AS MPLS

235

8.4.1

Option A (Back to back VRF exchange)

258

8.4.1.1

L3VPN

258

8.4.1.2

L2VPN

286

8.4.1.3

MVPN – GRE (Profile 0) and mLDP (Profile 1)

292

8.4.1.4

MPLS TE

310

8.4.1.5

Confederation variation

314

8.4.1.6

Carrier Supporting Carrier (CSC) variation

325

8.4.2

Option B (ASBR VPNv4/v6 eBGP)

331

8.4.2.1

L3VPN

333

8.4.2.2

L2VPN

368

8.4.2.3

mVPN – GRE (Profile 0)

379

8.4.2.4

MVPN – mLDP (Profile 17)

404

8.4.2.5

MPLS TE

413

8.4.2.6

Confederation variation

427

8.4.3

Option C (ASBR eBGP + Label, RR VPNv4 eBGP)

452

8.4.3.1

L3VPN

453

8.4.3.2

L2VPN

501

8.4.3.3

MVPN – GRE (Profile 0)

512

8.4.3.4

MVPN – mLDP (Profile 17)

519

8.4.3.5

MPLS TE

536

8.4.3.6

Confederation variation

563

8.4.4

Option AB Inter-AS hybrid (AKA Option D)

581

8.4.4.1

L3VPN

581

8.4.4.2

L2VPN

613 5

© 2016 Nicholas J. Russo

8.4.4.3

MVPN – GRE (Profile 0) and mLDP (Profile 1)

615

8.4.4.4

MPLS TE

623

8.4.5

Confederation variation

627

9.

Describe multicast P2MP TE

627

10.

Describe EVPN (EVPN and PBB-EVPN)

627

10.1

EVPN

630

10.2

PBB-EVPN

630

11.

Describe IEEE 802.1ad (QinQ), IEEE 802.1ah (Mac-in-Mac), and ITU G.8032 (REP)

646

11.1

802.1ad QinQ

646

11.2

802.1ah MAC in MAC (Provider Backbone Bridges)

648

11.3

Ethernet Ring loop-prevention

648

11.3.1

Cisco Resilient Ethernet Protocol (REP)

648

11.3.2

ITU G.8032

675

12.

Describe broadband forum TR-101 VLAN paradigms (N:1 and 1:1)

675

13.

Describe QoS link fragmentation (LFI), cRTP, and RTP

685

14.

Describe Multichassis/Clustering High Availability (HA)

694

14.1

High Availability (HA) Demonstration (NSF/NSR/GR)

696

14.1.1

IS-IS NSF and NSR

702

14.1.2

OSPFv2 NSF and NSR

707

14.1.3

OSPFv3 GR and NSR

710

14.1.4

BGP GR and NSR

712

14.1.5

LDP GR and NSR

720

14.1.6

RSVP-TE GR

726

14.1.7

EIGRP NSF

734

15.

Describe Layer 1 failure detection

737

16.

Describe BGPsec

740

17.

Describe backscatter traceback

740

18.

Describe lawful-intercept

740

19.

Describe BGP Flowspec

740

20.

Describe DDoS mitigation techniques

740

21.

Describe network event and fault management

741

22.

Describe performance management and capacity procedures

741 6

© 2016 Nicholas J. Russo

23.

Describe maintenance and operational procedures

744

24.

Describe the network inventory management process

745

25.

Describe network change, implementation, and rollback

745

25.1

Processes and best practices

745

25.2

NETCONF and YANG

747

26.

Describe the incident management process based on the ITILv3 framework

750

27.

Describe, implement, and troubleshoot advanced BGP features

751

27.1

Additional Paths (add-path) and Prefix Independent Convergence (PIC)

751

27.2

BGP RT-filter unicast / IPv4 RT-filter feature

818

27.3

BGP RR-group and Selective RT Retention

823

27.4

Accumulated IGP attribute

841

27.4.1

Basic AIGP

841

27.4.2

AIGP with cost-communities and BGP confederations

847

27.5

Cost-Community / Point Of Insertion (POI)

850

27.6

DMZ Link Bandwidth

865

27.7

BGP Multicast VPN (MVPN) Theory

881

27.8

BGP Link State AF and Path Computation Element (PCE)

884

28.

Describe, implement, and troubleshoot MVPN

890

28.1

Profile 0: Default MDT − GRE − PIM C−mcast Signaling (Traditional Draft-Rosen)

891

28.1.1

PIM-ASM in the core

893

28.1.2

PIM-SSM in the core

905

28.1.3

PIM-Bidir in the core

915

28.2

Profile 1: Default MDT − MLDP MP2MP − PIM C−mcast Signaling (Basic mLDP)

924

28.3

Profile 3: Default MDT − GRE − BGP−AD − PIM C−mcast Signaling

951

28.4

Profile 6: VRF MLDP − In−band Signaling

960

28.5

Profile 7: Global MLDP In−band Signaling

969

28.6

Profile 8: Global Static − P2MP−TE

980

28.7

Profile 9: Default MDT − MLDP − MP2MP − BGP−AD − PIM C−mcast Signaling

987

28.8

Profile 10: VRF Static – P2MP TE - BGP−AD

993

28.9

Profile 11: Default MDT − GRE − BGP−AD − BGP C−mcast Signaling

1000

28.10

Profile 12: Default MDT − MLDP − P2MP − BGP−AD − BGP C−mcast Signaling

1011

28.11

Profile 13: Default MDT − MLDP − MP2MP − BGP−AD − BGP C−mcast Signaling

1030 7

© 2016 Nicholas J. Russo

28.12

Profile 14: Partitioned MDT – MLDP P2MP – BGP-AD – BGP C-mcast signaling

1061

28.13

Profile 17: Default MDT – MLDP P2MP – BGP-AD – PIM C-mcast signaling

1080

29.

Describe and optimize multicast scale and performance

1094

29.1

Inter-AS Multicast and Multicast Source Discovery Protocol (MSDP)

1094

29.2

Multicast Only Fast Re-Reroute (MoFRR)

1158

29.3

Protecting mLDP LSPs with Fast Re-Reoute (FRR)

1173

29.4

MVPN Extranet

1178

29.4.1

PIM/GRE

1179

29.4.2

mLDP

1205

30.

Describe, implement, and troubleshoot MPLS QoS models and related features

1233

30.1

Uniform

1234

30.2

Short pipe

1237

30.3

Pipe (AKA long pipe)

1238

30.4

QoS Policy Propagation through BGP (QPPB)

1240

30.5

QoS specifics on IOS XRv

1246

30.6

Network Based Application Recognition (NBAR) summary and configurations

1251

30.6.1

NBAR Custom Protocols

1253

30.6.2

NBAR Attributes

1258

30.6.3

NBAR Attributes with HTTP

1262

30.6.4

NBAR Protocol-ID

1267

30.6.5

NBAR Protocol Discovery

1268

31.

Describe, implement, and troubleshoot MPLS TE / QoS mechanisms

1270

31.1

MPLS RSVP-TE (General)

1270

31.1.1

TE Topology (TED) construction and RSVP-TE signaling

1270

31.1.2

TE attributes

1297

31.1.3

Directing traffic into TE tunnels and tunnel stitching

1338

31.2

TE Fast-ReRoute (FRR) and rapid provisioning

1363

31.2.1

Link (NHOP), Node (NNHOP), and Path protection – Manual

1363

31.2.2

Automatic tunnels (with OSPF)

1401

31.3

CBTS (IOS) and PBTS (XR)

1451

31.4

DiffServ-aware Traffic Engineering (DS-TE)

1469

31.4.1

Pre-standard Model

1470 8

© 2016 Nicholas J. Russo

31.4.2

IETF Russian Dolls Model (RDM)

1490

31.4.3

IETF Maximum Allocation Model (MAM)

1500

31.4.4

Per-VRF TE techniques

1507

32.

Describe, implement, and troubleshoot E-LAN and E-TREE (extended to general L2VPN) 1540

32.1

MPLS encapsulated L2VPN

32.1.1

1540

Static configuration

1540

32.1.1.1

E-LINE (VPWS)

1540

32.1.1.2

Advanced PW features (CW, Status, etc)

1562

32.1.1.3

E-LAN and E-TREE (VPLS)

1574

32.1.1.4

Multisegment PW (MS-PW) switching

1598

32.1.1.5

EVC rewrite operations

1622

32.1.2

BGP auto-discovery for VPWS/VPLS

1632

32.1.2.1

LDP signaling

1633

32.1.2.2

BGP signaling

1648

32.1.3

Hierarchical VPLS (H-VPLS)

1664

32.1.3.1

MPLS in the Access Network

1664

32.1.3.2

QinQ in the Access Network

1681

32.2

IP encapsulated L2VPN

1688

32.2.1

E-LINE with L2TP

1688

32.2.2

E-LAN and E-TREE using OTV

1714

33.

Describe, implement, and troubleshoot Unified MPLS and CSC

1731

33.1

Carrier Supporting Carrier (CSC)

1731

33.1.1

L3VPN

1739

33.1.2

L2VPN

1750

33.1.3

MVPN (Profile 0 with SSM)

1759

33.1.4

TE and TE-FRR

1768

33.2

Unified (seamless) MPLS

33.2.1

IS-IS

1780 1787

33.2.1.1

L3VPN

1797

33.2.1.2

L2VPN

1812

33.2.1.3

MVPN (mLDP profiles 1 and 17)

1816

33.2.1.4

Inter-area TE and TE-FRR

1824 9

© 2016 Nicholas J. Russo

33.2.2

OSPF (summarized)

1840

33.2.2.1

L3VPN

1843

33.2.2.2

L2VPN

1850

33.2.2.3

MVPN (mLDP profiles 1 and 17)

1856

33.2.2.4

MPLS TE and TE-FRR

1859

34.

Describe, implement, and troubleshoot LISP

1870

35.

Describe, implement, and troubleshoot GRE and mGRE-based VPN

1902

35.1

P2P GRE tunneling and GRE features

1902

35.2

Dynamic Multipoint VPN (DMVPN) basics

1916

35.2.1

Phase 1

1918

35.2.2

Phase 2

1938

35.2.3

Phase 3

1948

35.3

mGRE-based L3VPN

1964

36.

Describe, implement, and troubleshoot IPv6 transition mechanisms

1976

36.1

NAT44 and NAT444

1976

36.2

NAT64 and NAT464

1995

36.3

Dual stack lite (DS-lite)

2035

36.4

IPv6 tunneling over IPv4 networks

2037

36.4.1

GRE / Manual IPv6 tunnels

2038

36.4.2

6to4 automatic tunnels

2041

36.4.3

6 Rapid Deployment (6RD)

2045

36.4.4

Intra-Site Automatic tunnel Addressing Protocol (ISATAP)

2052

36.5

IPv4/IPv6 Internet Access over MPLS using NAT44

2055

37.

Describe, implement, and troubleshoot end-to-end fast convergence

2092

37.1

Loop Free Alternate (LFA) for IPv4

2092

37.1.1

OSPFv2

2092

37.1.1.1

Direct LFA

2092

37.1.1.2

Remote LFA

2106

37.1.2

IS-IS

2121

37.1.2.1

Direct LFA

2121

37.1.2.2

Remote LFA

2127

37.1.3

EIGRP

2131 10

© 2016 Nicholas J. Russo

37.2

Loop Free Alternate (LFA) for IPv6 (XR Only)

37.2.1

OSPFv3

2136 2136

37.2.1.1

Direct LFA

2136

37.2.1.2

Remote LFA

2140

37.2.2

IS-IS

2140

37.2.2.1

Direct LFA

2140

37.2.2.2

Remote LFA

2144

37.3

Convergence optimizations for BGP

2148

37.4

Convergence optimizations for IGPs

2174

37.4.1

IS-IS

2175

37.4.2

OSPFv2 and OSPFv3

2181

38.

Describe, implement, and troubleshoot multi-VRF CE and advanced VRF techniques

2194

38.1

Multi-VRF CE (VRF-Lite)

2195

38.1.1

Basic VRF-Lite

2195

38.1.2

OSPF and sham-links

2198

38.1.3

EIGRP and Site-of-Origin (SoO)

2233

38.1.4

IS-IS

2262

38.1.5

BGP and Site-of-Origin (SoO)

2266

38.1.6

Static routing

2289

38.1.7

RIP

2293

38.2

VRF label modes

2300

38.3

VRF selection for traffic leaking

2314

38.4

VRF route leaking

2318

38.5

L3VPN import/export maps

2338

38.6

Half-Duplex VRF (HDVRF)

2350

38.7

BGP Local Convergence (VRF Local Protection)

2363

39.

Describe, implement, and troubleshoot Layer 2 failure detection

2377

39.1

Link Aggregation Control Protocol (LACP)

2377

39.2

Uni-Directional Link Detection (UDLD)

2388

40.

Describe, implement, and troubleshoot Layer 3 failure detection

2396

40.1

Individual Protocol Hello packets

2396

40.2

Bidirectional Forwarding Detection (BFD)

2415 11

© 2016 Nicholas J. Russo

41.

Describe, implement, and troubleshoot control plane protection techniques

2444

41.1

Control Plane Policing (CPP) in XE and Local Packet Transport Services (LPTS) in XR

2444

42.

Describe, implement, and troubleshoot logging and SNMP security

2461

42.1

Logging

2461

42.2

SNMP security

2461

43.

Describe, implement, and troubleshoot timing

2461

43.1

Network Time Protocol (NTP)

2462

43.2

1588v2 (Precision Time Protocol(PTP))

2480

43.3

Synchronous Ethernet (SyncE)

2482

44.

Describe, implement, and troubleshoot SNMP traps, RMON, EEM, and EPC

2483

44.1

SNMP traps

2484

44.2

Remote Monitor (RMON) in XE and logging correlation in XR

2490

44.3

Embedded Event Manager (EEM)

2503

44.4

Embedded Packet Capture (EPC)

2512

45.

Describe, implement, and troubleshoot port mirroring protocols

2522

45.1

Switch port analyzer (SPAN)

2522

45.2

Remote SPAN (RSPAN)

2527

45.3

Encapsulated RSPAN (ERSPAN)

2530

46.

Describe, implement, and troubleshoot Netflow and IPFIX

2534

46.1

Flexible Netflow (FNF)

2536

46.2

IPFIX

2547

47.

Describe, implement, and troubleshoot IP SLA

2549

47.1

Basic IP SLA probes, responders, features, and configurations

2549

47.2

UDP-jitter and VOIP codec probes

2560

47.3

Advanced ICMP probes

2566

47.4

MPLS probes

2573

47.5

Ethernet probes including ITU-T Y.1731 Basics and Performing Monitoring (PM)

2577

47.6

Miscellaneous probes

2603

47.7

Aggregated statistics, history, group scheduling, and miscellaneous features

2610

47.8

Enhanced Object Tracking (EOT)

2622

47.9

IPv6 SLA

2637

47.10

IOS-XR IP SLA and EOT

2643 12

© 2016 Nicholas J. Russo

48.

Describe, implement, and troubleshoot MPLS OAM and Ethernet OAM

2667

48.1

MPLS ping, MPLS traceroute, and VCCV

2667

48.2

MPLS LSP Monitor (MPLSLM) / LSP Health Monitor

2690

48.3

Ethernet Management Tools (CFM, OAM, and E-LMI)

2703

48.3.1

Connectivity Fault Management (CFM) (802.3ag)

2703

48.3.2

Ethernet OAM (IEEE 802.3ah)

2733

48.3.3

Ethernet Local Management Interface (E-LMI) (MEF.16)

2748

48.3.4

Ethernet CFM, OAM, E-LMI, and Y.1731 on CSR1000v (Comprehensive)

2766

49.

Service Provider security best practices (Comprehensive)

2794

49.1

Control plane security best practices

2795

49.2

Management plane security best practices

2831

49.3

Data plane security best practices

2862

49.4

Advanced security techniques and features

2889

1. SP architecture concepts 1.1 IPv6 1.1.1 Definitions Link-local address: Addressing within FE80::/10 (FE80:: through FEBF:FFFF…) to be used for communication on a link. The addressing in not routable and all routers must have LL addresses on all interfaces. Site-local address: Addressing within FEC0::/10 (FEC0:: through FEFF:FFFF…) to be used within an organization. This is similar to RFC 1918 private addressing and is routable., but is discouraged. The unique-local addressing addressing was meant to replace it. Unique-local address (ULA): Addressing within FC00::/7 (FC00:: through FDFF:FFFF…) to be used within an organization. This replaced site-local addressing and serves the same function. Multicast addresses: Addressing within FF00::/8 (anything starting with FF) to be used for multicast transport. Within the second byte, the first hex digit represents special flags while the second represents the scope. The flags, in binary, are “0RPT”. The most significant bit is always 0 and means nothing. 1. ’R’ indicates whether the IPv6 carries a PIM RP address. This is used for embedded RP and the RP address is signaled inside of the IPv6 multicast group. 2. ’P’ indicates whether a multicast address is assigned based on the network prefix. This is used for embedded RP and the network prefix is embedded inside of the IPv6 address. If R is 1, P must also be 1, since the embedded RP construct implies that the network prefix is also carried in the IPv6 address. The 13 © 2016 Nicholas J. Russo

opposite is not true as ‘P’ could be 1 while ‘R’ is 0; a case may exist where network prefix information is carried in the multicast address but the function is unrelated to embedded RP. 3. ’T’ indicates whether a multicast group is transient (dynamically/non-permanently assigned) or not. When T is 0, it assumes a well-known multicast address is used according to IANA. If P is 1, T must also be 1. The opposite is not true as ‘T’ could be 1 while ‘P’ is 0. This would represent a normal transient multicast group that does not carry any network prefix information. The scopes are self-explanatory and are used to contain multicast into administrative regions. 1 - Interface local: Only useful for loopback transmission of multicast 2 - Link-local: Communication on a segment, typically used for IGP, PIM, neighbor discovery (ND), etc 4 - Admin-local: Smallest scope that can be administratively configured; that is, unlike node and linklocal, this traffic is routable and the administrator decides what constitutes an “admin-local” boundary. This would be useful for limiting traffic to a set of devices within a site, such as access/distribution/core layers of a LAN-side routing architecture. 5 - Site-local: For use within a site. This would be useful for confining multicast traffic within local branch office. Although PIM dense-mode is not supported in IPv6 on Cisco platforms, a site-local sparse-mode domain may be a good alternative for local multicast confinement. 8 - Organization-local: Spans multiple sites within an organization, such as between branch offices. The information would typically not be allowed to be exchanged over the Internet. E - Global scope: Sometimes called “VPN scope” by Cisco, this has no scoping limit. Anycast addresses: Though the concept exists in IPv4, it does not exist on a LAN segment, and IPv6 enables this capability. Configuring an anycast address is essentially the same as a unicast address with duplicate address detection disabled (DAD is discussed later). When a host tries to resolve layer 2 addresses, any node may respond, hence the name anycast. Solicited-node address: A link-local scope multicast address computed as a function of a node’s unicast and anycast addresses. These addresses are formed by taking the low-order 24 bits of an IPv6 address and appending those bits to the prefix FF02::1:FF00::/104 (FF02::1:FF00:: to FF02::1:FFFF:FFFF). The network prefix length of 104 plus the low-order 24 bits of the unicast/anycast address on the interface creates the full 128-bit IPv6 solicited-node address. A node that has multiple prefixes but similar host addresses can therefore join less (hopefully only one) solicited-node multicast address. Every node must join a solicited-node multicast address for every unicast and any cast address on all interfaces, regardless of how they were configured (manual, DHCPv6, SLAAC, etc). This also reduces interrupts on nodes other than the target because the destination is not like an IPv4 ARP broadcast, or even an IPv6 all-nodes multicast. When a node sends traffic to a solicited-node address, it is like a semi-directed broadcast message that targets a very small set of nodes (again, hopefully only one). Neighbor Solicitation (NS): ICMP type 135. The destination is the solicited-node multicast address of a specific host on the LAN, while the source is the link-local IPv6 address of the source interface. This is 14 © 2016 Nicholas J. Russo

used for LAN discovery and is directly comparable in function to an ARP request. The NS can also have a unicast destination when not being used for discovery. This is used to verify the reachability of a neighbor once discovered as a reachability probe and is known as Neighbor Unreachability Detection (NUD). NUD guarantees two-way communication in this way as well. Neighbor Advertisement (NA): ICMP type 136. The destination is the link-local IPv6 address of the node that sent the NS (regular unicast packet) and the source is the LL address of the node sending the NA. The layer 2 address is contained within the packet’s payload, and on Ethernet media this is the MAC address of the node sending the NA. If a node’s layer 2 address changes, an unsolicited NA is sent to the all-nodes multicast address (FF02::1) to update their IPv6 neighbor tables. There is a solicit-flag that is 1 (true) only when the NA is sent in response to an NS, whereas the flag is 0 otherwise. Router Solicitation (RS): ICMP type 133. These are sent by hosts to discover available routers on the segment. The source is the IPv6 link-local address of the sending interface (or :: if no address has been assigned yet) with a destination of the all-routers (FF02::2) multicast address. In this way, other IPv6 hosts will discard RS packets they receive since they are destined only for IPv6 routers, and because the source address can be unspecified (::), this facilitates SLAAC operation. Router Advertisement (RA): ICMP type 134. These are periodically sent by routers with a source address of the interface LL address and destination of FF02::1. If sent in reply to an RS, it can also have a destination of the router’s LL address that sent the RS. RA messages typically include: one or more prefixes for SLAAC (prefix-length must be 64 bits), prefix lifetime (validity), hop limit (TTL), MTU, and auto-configuration details. RA generation is enabled on Ethernet and FDDI interfaces by default and can be manually suppressed. On all other interfaces, it is disabled by default and can be manually enabled; one such use case of enabling it on a non-LAN interface would be to support ISATAP tunneling towards clients (discussed later). Two flags of are particular interest. The ‘M’ flag is the managed address configuration flag, which indicates that prefixes are available via DHCPv6. The ‘O’ flag indicates that other information, such as DNS, is available via DHCPv6 but addresses are not. If the ‘M’ flag is set, the ‘O’ flag is redundant/ignored, since all information is returned from DHCPv6 in that case. With both flags clear, this indicates that no information is available via DHCPv6. Regarding the router lifetime, a value of 0 indicates the router should not appear as a candidate default gateway; the lifetime only applies to the router’s usefulness as a default gateway and no other RA components (prefixes have their own lifetimes). Neighbor Redirect (NR): ICMP type 137. Used to notify a host of a better path to reach the destination. Same purpose as an IPv4 ICMP redirect, however the IPv6 NR must know the link-local address of the redirect target (i.e., the other router on the segment that is the better exit point). This LL address is contained in the payload of the NR message. An optional field that should be included, if known, is the target’s layer 2 address as well. This saves the host receiving the redirect from having to use an NS to determine the next-hop, if it doesn’t already have the information. There are several validations that occur on these ND packets as well. For example, IPv6 nodes will 15 © 2016 Nicholas J. Russo

discard RA or RS messages that don’t have a hop limit (TTL) of 255, which implies their origination was off-link and therefore is probably invalid. Duplicate Address Detection (DAD): When a new address is configured on a link, DAD is typically run first before assigning the address to the link. The NS message is used with an unspecified source address (::) and the all-nodes multicast address (FF02::1). The tentative LL address that a node is checking for uniqueness is contained within the body of this NS. Two conditions will render the address “duplicate” and therefore unusable: reception of an NA from another node saying the address is already in use on the segment, or reception of an NS from another node that is concurrently trying to determine uniqueness. All IPv6 addresses (global and link-local) are subject to DAD, however DAD for LL addressing must happen first before progressing to additional IPv6 addresses. Cisco does not perform DAD on global or any cast addresses generated from 64-bit interface identifiers, such as EUI-64. It is assumed that these are unique and bypassing DAD is a minor optimization. Default Router Preference (DRP): Signaled in unused bits within the RA message to provide low, medium, and high preference options for selecting a default gateway when multicast RAs offer it. Failure to evaluate/understand these bits results in a value of “medium”. The IPv6 header differs from the IPv4 header in several ways. As expected, it is much larger at 40 bytes versus 20 bytes; each IPv6 address is 16 bytes by itself. The IPv4 TTL and IP protocol have been renamed to “hop limit” and “next header”, but they are still both 1 byte fields with the same function. In IPv4, the TTL comes before the IP protocol, but in IPv6, the fields are reversed with “next header” coming before “hop limit”. IPv6 also adds the concept of a flow label which assigns packets to a particular flow. It is 20 bits long, like an MPLS label, but has nothing to do with MPLS. The idea is that routers can do per-flow load sharing based on this information without having to look at higher layer protocols like TCP or UDP ports. Many protocols may have multiple flows but lack the concept of “ports” that TCP and UDP have. There is no way to verify the authenticity of the flow-label and it could be changed in transit, but since it is generally used for load-sharing, this may not be significant. A value of 0 indicates that the packet has not been assigned to a particular flow. Layer 2 (LAG) or layer 3 (CEF) mechanisms may use this for loadsharing. The “next header” name is more appropriate for IPv6 since it can refer to one of two things. In a normal IPv6 packet, it would refer to the upper-layer protocol, such as 6 for TCP or 17 for UDP. It can also refer to IPv6 extension headers (EH) which immediately follow the normal IPv6 header. These are like IPv4 options that allow IPv6 to carry extra information. Some of these headers include the routing header (43), mobility header (135), fragment header (44), and destination options (60). IPv6 doesn’t support fragmentation on the routers, but end hosts do, assuming they support these IPv6 EH options. 1.1.2 Neighbor Discovery details This lab uses CSRs only since XRv does not appear to support sending RAs under any circumstance. Because XRv is modeled after an RSP, not a line-card, it cannot issue RA messages. I have included configurations for XRv1 and XRv2 that can be hot-swapped with CSR1 and CSR2 should the code be fixed 16 © 2016 Nicholas J. Russo

later. Basic IS-IS single-topology IPv6 routing is used for reachability across the network. CSR4, CSR5, and CSR7 represent end hosts with very little configuration.

First, we will examine the ND process using CSR1 and CSR2. This is simple because there are only two routers on the segment, making the RA/RS process is unnecessary. In cases like this, disabling RAs makes sense to conserve resources and increase security. Although not necessary on transit links, I configured a global unicast address range as well. The relevant configuration from CSR1 is shown below; CSR2 has an identical configuration with different host addresses. ! CSR1 interface GigabitEthernet2.512 ipv6 address FE80::11 link-local ipv6 address 2020:0:11:12::11/64 ipv6 nd ra suppress all

Debugging ICMPv6 and ND on CSR1 allows us to see many details about what happens during the process described earlier. We bounce CSR1’s link to CSR2 to see the full procedure. For clarity, the debug is broken into chunks and explained in line. First, IPv6 ND is notified that the layer 2 components of the link came up, which starts the ND process at layer 3. Before anything else, DAD must be run on the LL address of the link after a short delay. The DAD message is just an NS to the solicited-node 17 © 2016 Nicholas J. Russo

address of CSR1; this is a way to see if anyone else has the same low-order 24 bits of the host address as CSR1. DAD sees no response after 1 second (globally adjustable, as we see later) and declares the address unique. CSR1 then issues an unsolicited NA to the all-nodes multicast group to notify them of its MAC address binding to this IPv6 LL address. R1#debug ipv6 icmp R1#debug ipv6 nd 22:28:32.249: ICMPv6-ND: (GigabitEthernet2.512) L2 came up 22:28:32.249: IPv6-Addrmgr-ND: DAD request for FE80::11 on GigabitEthernet2.512 22:28:32.249: ICMPv6-ND: Delay DAD for FE80::11 on GigabitEthernet2.512 by 200 msec 22:28:32.449: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Sending DAD NS [6F530] 22:28:32.450: ICMPv6: Sent N-Solicit, Src=::, Dst=FF02::1:FF00:11 22:28:33.449: IPv6-Addrmgr-ND: DAD: FE80::11 is unique. 22:28:33.449: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Sending NA to FF02::1 22:28:33.449: ICMPv6-ND: (GigabitEthernet2.512) L3 came up 22:28:33.449: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Linklocal Up 22:28:33.450: ICMPv6: Sent N-Advert, Src=FE80::11, Dst=FF02::1

Next, DAD iterates through the rest of the unicast and anycast addresses on the link. The DAD process for subsequent addresses need not be delayed since there was not another link-up event. The solicitednode address happens to be the same in this case because the host addresses for the LL and global address are the same, but could be different. As expected, there are no duplicate addresses on the LAN between CSR1 and CSR2. CSR1 issues another unsolicited NA, this time sourced from the global address, to notify other nodes on the segment about its global address. ! CSR1 22:28:33.449: IPv6-Addrmgr-ND: DAD request for 2020:0:11:12::11 on GigabitEthernet2.512 22:28:33.449: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::11) Sending DAD NS [6F530] 22:28:33.451: ICMPv6: Sent N-Solicit, Src=::, Dst=FF02::1:FF00:11 22:28:34.449: IPv6-Addrmgr-ND: DAD: 2020:0:11:12::11 is unique. 22:28:34.449: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::11) Sending NA to FF02::1 22:28:34.451: ICMPv6: Sent N-Advert, Src=2020:0:11:12::11, Dst=FF02::1

A few seconds later, IS-IS converges. Because IS-IS does not rely on IP, CSR1 has no idea about CSR2’s existence and has no reason to resolve its layer 2 address. After convergence, IS-IS routes are learned via CSR2 and installed in the routing table, which prompts CSR1 to resolve the IPv6 next-hops, which are LL addresses. The ND state machine for FE80::12 (CSR2 address) transitions from deleted (nonexistent) to incomplete (INCMP). At this time, CSR1 send another NS to CSR2’s solicited-node address; it gleans the solicited-node address from the low-order 24 bits of the address it is trying to resolve, which is 18 © 2016 Nicholas J. Russo

FE80::12. About 200 ms later, CSR2 responds with an NA which carries its MAC address. The NA packet is validated for security reasons, and the IPv6 neighbor entry transitions from incomplete to reachable. ! CSR1 22:28:40.936: 22:28:40.936: 22:28:40.936: 22:28:40.936: 22:28:40.937: 22:28:41.147: 22:28:41.147: FE80::12 22:28:41.147: 22:28:41.147: 22:28:41.147:

ICMPv6-ND: (GigabitEthernet2.512,FE80::12) ULP neighbour ICMPv6-ND: (GigabitEthernet2.512,FE80::12) DELETE -> INCMP ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Sending NS ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Set ULP NUD ICMPv6: Sent N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:12 ICMPv6: Received N-Advert, Src=FE80::12, Dst=FE80::11 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Received NA from ICMPv6-ND: Validating ND packet options: valid ICMPv6-ND: (GigabitEthernet2.512,FE80::12) LLA 0012.1212.1212 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) INCMP -> REACH

We can verify this entry by checking the IPv6 neighbor table, which is like the IPv4 ARP table. We see the IPv6 LL address and MAC address of CSR2 as reachable. Notice there is no entry for CSR2’s global unicast address. The only reason CSR1 knew to look for CSR2 was because IS-IS next-hops necessitated it. CSR1 remains ignorant about anyone else on the LAN. R1#show ipv6 neighbors gig 2.512 IPv6 Address FE80::12

Age Link-layer Addr State Interface 0 0012.1212.1212 REACH Gi2.512

R1#show ipv6 route isis | begin ^I2 I2 2020:0:3:6::/64 [115/30] via FE80::12, GigabitEthernet2.512 I2 2020:3:4:12::/64 [115/20] via FE80::12, GigabitEthernet2.512 I2 2020:5:6:11::/64 [115/40] via FE80::12, GigabitEthernet2.512 I2 FD00:3:4:12::/64 [115/20] via FE80::12, GigabitEthernet2.512

We can trick the router into trying to discover a new node in a few ways. The most obvious is to ping a new LL address out of that interface, which will trigger ND. A more subtle way is to configure a static route to a bogus next-hop, which like the IS-IS routes, will trigger ND. Below we configure a bogus default route; IPv6 ND makes three attempts to resolve the layer 2 address (one second apart), then gives up and delete the ND cache entry. ! CSR1 ipv6 route ::/0 GigabitEthernet2.512 FE80::BEEF R1#debug ipv6 icmp R1#debug ipv6 nd 22:52:28.302: ICMPv6-ND: (GigabitEthernet2.512,FE80::BEEF) DELETE -> INCMP

19 © 2016 Nicholas J. Russo

22:52:28.302: 22:52:28.303: 22:52:29.393: 22:52:29.393: 22:52:30.483: 22:52:30.483: 22:52:31.573: 22:52:31.573:

ICMPv6-ND: (GigabitEthernet2.512,FE80::BEEF) Sending NS ICMPv6: Sent N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:BEEF ICMPv6-ND: (GigabitEthernet2.512,FE80::BEEF) Sending NS ICMPv6: Sent N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:BEEF ICMPv6-ND: (GigabitEthernet2.512,FE80::BEEF) Sending NS ICMPv6: Sent N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:BEEF ICMPv6-ND: (GigabitEthernet2.512,FE80::BEEF) INCMP -> DELETE ICMPv6-ND: Remove ND cache entry

We can force ND for CSR2’s global unicast address by pinging it. The debug also shows the ICMPv6 echo and echo-reply packets to confirm that everything worked. The first echo request is shown first, with the first echo reply shown last. Notice that CSR1 also receives an NS for its solicited-node address; this is because CSR2 also has to run ND to reach CSR1’s global unicast address to send the echo-reply. CSR1 replies with a unicast NA to CSR2 (solicited), and after that, the ICMP flow succeeds. ! CSR1 22:56:36.499: ICMPv6: Sent echo request, Src=2020:0:11:12::11, Dst=2020:0:11:12::12 22:56:36.501: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::12) DELETE -> INCMP 22:56:36.503: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::12) Sending NS 22:56:36.503: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::12) Queued data for resolution 22:56:36.504: ICMPv6: Sent N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:12 22:56:36.507: ICMPv6: Received N-Advert, Src=2020:0:11:12::12, Dst=FE80::11 22:56:36.507: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::12) Received NA from 2020:0:11:12::12 22:56:36.507: ICMPv6-ND: Validating ND packet options: valid 22:56:36.507: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::12) LLA 0012.1212.1212 22:56:36.507: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::12) INCMP -> REACH 22:56:36.514: ICMPv6: Received N-Solicit, Src=2020:0:11:12::12, Dst=FF02::1:FF00:11 22:56:36.514: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::11) Received NS from 2020:0:11:12::12 22:56:36.514: ICMPv6-ND: Validating ND packet options: valid 22:56:36.514: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12::11) Sending NA to 2020:0:11:12::12 22:56:36.515: ICMPv6: Sent N-Advert, Src=2020:0:11:12::11, Dst=2020:0:11:12::12 22:56:36.517: ICMPv6: Received echo reply, Src=2020:0:11:12::12, Dst=2020:0:11:12::11

We can see the obvious differences in solicited-node addresses between CSR1 and CSR2 on this segment since they have different low-order 24 bit host addresses. We can try to trick DAD by configuring different host addresses on CSR1 and CSR2 but with the low-order 24 bits being equal. We will add these 20 © 2016 Nicholas J. Russo

as additional IPv6 address rather than replace the existing one. We quickly verify the solicited-node addresses on both routers to ensure they are the same for this new IPv6 address. ! CSR1 interface GigabitEthernet2.512 ipv6 address 2020:0:11:12:0:11:0:1212/64 ! CSR2 ipv6 address 2020:0:11:12:0:12:0:1212/64 R1#show ipv6 interface gig2.512 | section group_add Joined group address(es): FF02::1 FF02::1:FF00:11 FF02::1:FF00:1212 R2#show ipv6 interface gig2.512 | section group_add Joined group address(es): FF02::1 FF02::2 FF02::1:FF00:12 FF02::1:FF00:1212

The debug below shows that DAD is smart enough to determine of the address is unique or not. Even if CSR1 and CSR2 have the same solicited-node address, the actual IPv6 address in question is contained within the NS payload. CSR2 is joined to FF02::1:FF00:1212, as is CSR1, so CSR2 actually has to open the packet and process it. Had the host addresses been different, CSR2 would have discarded the packet at layer 3, saving it a little bit of CPU time. The solicited-node is not what DAD uses for its final decision, as it is really used as a CPU interrupt reduction technique. The timestamps are not perfectly synchronized between CSR1 and CSR2, but it is clear that CSR2 receives the DAD NS, does nothing, then receives the authoritative NA from CSR1 declaring the address unique. This is the correct behavior. ! CSR1 13:31:30.714: IPv6-Addrmgr-ND: DAD request for 2020:0:11:12:0:11:0:1212 on GigabitEthernet2.512 13:31:30.715: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12:0:11:0:1212) Sending DAD NS [C0BB4] 13:31:30.716: ICMPv6: Sent N-Solicit, Src=::, Dst=FF02::1:FF00:1212 13:31:31.715: IPv6-Addrmgr-ND: DAD: 2020:0:11:12:0:11:0:1212 is unique. 13:31:31.715: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12:0:11:0:1212) Sending NA to FF02::1 13:31:31.717: ICMPv6: Sent N-Advert, Src=2020:0:11:12:0:11:0:1212, Dst=FF02::1 ! CSR2 13:31:31.303: ICMPv6: Received N-Solicit, Src=::, Dst=FF02::1:FF00:1212 13:31:31.303: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12:0:11:0:1212)

21 © 2016 Nicholas J. Russo

Received NS from :: 13:31:32.303: ICMPv6: Received N-Advert, Src=2020:0:11:12:0:11:0:1212, Dst=FF02::1 13:31:32.303: ICMPv6-ND: (GigabitEthernet2.512,2020:0:11:12:0:11:0:1212) Received NA from 2020:0:11:12:0:11:0:1212 13:31:32.303: ICMPv6-ND: Validating ND packet options: valid

In the event there really is a duplicate address, DAD will detect this. Obviously, the solicited-node multicast addresses will be the same, and opening the packet for processing will alert DAD to the duplicated address on the segment. To save a little bit of memory, we can configure a duplicate address (2020::1212) that uses the same solicited-node address of something each router has already joined, which can reduce the number of solicited-node addresses a router must join. The duplicate address below also joins FF02::1:FF00:1212 since the low-order 24 bits of the host address are 0x001212 in hex. Assuming CSR2 has the address configured and we add it to CSR1 later, the debugs are shown below. CSR1 sends the DAD NS and immediately receives an NA back from CSR2; when the address is unique, the DAD NS should not receive an NA in response. Both routers also display a syslog message in case debugging is not enabled. CSR1 calls this a syslog warning (level 4) since it tried to use an address already in use. CSR2 calls this a syslog informational message (level 6) since someone else is attempting to use an address already valid on CSR2. ! CSR1 13:42:37.470: IPv6-Addrmgr-ND: DAD request for 2020::1212 on GigabitEthernet2.512 13:42:37.470: ICMPv6-ND: (GigabitEthernet2.512,2020::1212) Sending DAD NS [598ED] 13:42:37.471: ICMPv6: Sent N-Solicit, Src=::, Dst=FF02::1:FF00:1212 13:42:37.475: ICMPv6: Received N-Advert, Src=2020::1212, Dst=FF02::1 13:42:37.475: ICMPv6-ND: (GigabitEthernet2.512,2020::1212) Received NA from 2020::1212 13:42:37.475: ICMPv6-ND: Validating ND packet options: valid 13:42:37.475: %IPV6_ND-4-DUPLICATE: Duplicate address 2020::1212 on GigabitEthernet2.512 ! CSR2 13:42:38.057: ICMPv6: Received N-Solicit, Src=::, Dst=FF02::1:FF00:1212 13:42:38.057: ICMPv6-ND: (GigabitEthernet2.512,2020::1212) Received NS from :: 13:42:38.057: ICMPv6-ND: Packet contains no options 13:42:38.057: ICMPv6-ND: Validating ND packet options: valid 13:42:38.057: ICMPv6-ND: Packet contains no options 13:42:38.057: ICMPv6-ND: (GigabitEthernet2.512,2020::1212) Sending NA to FF02::1 13:42:38.057: %IPV6_ND-6-DUPLICATE_INFO: DAD attempt detected for 2020::1212 on GigabitEthernet2.512 13:42:38.058: ICMPv6: Sent N-Advert, Src=2020::1212, Dst=FF02::1

22 © 2016 Nicholas J. Russo

We can verify this by checking the interface details on each router. The IPv6 address 2020::1212 is marked as [DUP] to indicate it is a duplicate address on CSR1. CSR2 does not show this because it had the address first; DAD honors “first come, first served” in terms of address claims. The syslog message priorities above seem to support this conclusion as well. We will see how to work around DAD issues later. R1#show ipv6 interface gig2.512 | section Global_uni Global unicast address(es): 2020::1212, subnet is 2020::/64 [DUP] 2020:0:11:12::11, subnet is 2020:0:11:12::/64 2020:0:11:12:0:11:0:1212, subnet is 2020:0:11:12::/64 R2#show ipv6 interface gig2.512 | section Global_uni Global unicast address(es): 2020::1212, subnet is 2020::/64 2020:0:11:12::12, subnet is 2020:0:11:12::/64 2020:0:11:12:0:12:0:1212, subnet is 2020:0:11:12::/64

The ND state machine is quick to transition ND entries out of the REACH state. By default, this is 30,000 ms (30 seconds) on all interfaces. It can be tuned globally or at the interface level. Below, CSR1 runs ND to CSR2’s LL address. The entry was in the STALE state, and after the ping, ND transitions the entry to the REACH state. The ND process doesn’t need to occur again since we still had a STALE entry, and the successful ping shows us that the entry is still valid. ND now marks it as REACH, but after 30 seconds, the entry transitions back to the STALE state since no traffic is flowing through this address presently. This STALE state helps the administrator determine how long it has been since traffic was sent to an IPv6 peer. ! CSR1 13:52:45.031: 13:52:45.034: 13:52:45.034: 13:52:45.034: 13:53:15.085:

ICMPv6: Sent echo request, Src=FE80::11, Dst=FE80::12 ICMPv6: Received echo reply, Src=FE80::12, Dst=FE80::11 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) ULP indication ICMPv6-ND: (GigabitEthernet2.512,FE80::12) STALE -> REACH ICMPv6-ND: (GigabitEthernet2.512,FE80::12) REACH -> STALE

We will adjust this timer on CSR1 facing CSR2 to have a 10 second transition so that ND cache entries are moved to the STALE state more quickly when not used. Running the same ND test again, we can see the transition happens in 10 seconds, as expected. ! CSR1 interface GigabitEthernet2.512 ipv6 nd reachable-time 10000 R1#show ipv6 interface gig2.512 | include ND_reachable ND reachable time is 10000 milliseconds (using 10000)

23 © 2016 Nicholas J. Russo

14:02:25.842: 14:02:25.846: 14:02:25.846: 14:02:25.846: 14:02:35.934:

ICMPv6: Sent echo request, Src=FE80::11, Dst=FE80::12 ICMPv6: Received echo reply, Src=FE80::12, Dst=FE80::11 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) ULP indication ICMPv6-ND: (GigabitEthernet2.512,FE80::12) STALE -> REACH ICMPv6-ND: (GigabitEthernet2.512,FE80::12) REACH -> STALE

Entries are removed from the IPv6 ND cache after having been stale for 4 hours by default. This can also be adjusted globally or at the interface level. CSR1 will adjust this globally to delete entries that are stale for 50 seconds. Thus, on the interface to CSR2, an entry moves from REACH to STALE after 10 seconds, then STALE to DELETE after 50 seconds, meaning that there is one minute between traffic sent to a nexthop and the cache entry being totally removed. ! CSR1 ipv6 nd cache expire 50

There does not appear to be a show command to verify this, but we can check the IPv6 cache statistics to see the number of cache entries in each state. This command was run after the entries aged out (given the 50 second DELETE timer), so there are zero entries in the cache currently. R1#show ipv6 neighbors statistics IPv6 ND Statistics Entries 0, High-water 4, Gleaned 1, Scavenged 3, Static 0 Entry States INCMP 0 REACH 0 STALE 0 GLEAN 0 DELAY 0 PROBE 0 Resolutions Requested 11, timeouts 10, resolved 7, failed 3 In-progress 0, High-water 2, Throttled 0, Data discards 0 NUD Requested 1, timeouts 0, resolved 1, failed 0 in-progress 0, high-water 1, throttled 0, current queue 0, queue highwater 0 Delayed Queue 0, Delayed Queue High-water 4

Repeating the same test again, we confirm this behavior on CSR1 by verifying the timestamps. ! CSR1 14:04:53.505: 14:04:53.508: 14:04:53.508: 14:04:53.508: 14:05:03.548: 14:05:53.600: 14:05:53.600: 14:05:53.601:

ICMPv6: Sent echo request, Src=FE80::11, Dst=FE80::12 ICMPv6: Received echo reply, Src=FE80::12, Dst=FE80::11 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) ULP indication ICMPv6-ND: (GigabitEthernet2.512,FE80::12) STALE -> REACH ICMPv6-ND: (GigabitEthernet2.512,FE80::12) REACH -> STALE ICMPv6-ND: STALE deleted: FE80::12 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) STALE -> DELETE ICMPv6-ND: Remove ND cache entry

From the statistics show command, we see there are other states such as GLEAN, DELAY, and PROBE. 24 © 2016 Nicholas J. Russo

The GLEAN state doesn’t actually show up in the ND neighbor cache, but it is a valid state. When an unsolicited NA is received on the segment, routers will ignore those entries (like ignoring a gratuitous ARP) to save memory. For example, if we bounce CSR2’s interface, it will perform DAD on all of its addresses beginning with its LL address. CSR1 doesn’t see the DAD NS because it isn’t joined to the same solicited-node address, but it does see the NA that CSR2 sends once DAD declares the address unique. CSR1 does nothing with it; no further processing is done on its IPv6 cache. CSR1 will need it later for IS-IS routing, but that is beyond the scope of this test. Notice that CSR1 has no entry for FE80::12 in its cache. ! CSR1 14:13:09.122: ICMPv6: Received N-Advert, Src=FE80::12, Dst=FF02::1 14:13:09.122: ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Received NA from FE80::12 14:13:09.122: ICMPv6-ND: Validating ND packet options: valid R1#show ipv6 neighbors gig2.512 [no output]

We can configure CSR1 to record these unsolicited NA mappings on a per-interface basis. CSR1 can “glean” that information by snooping the LAN, which may speed convergence and reduce independent ND conversations later. The cost is larger ND caches (more memory) for address that may not be relevant for the traffic patterns on a given LAN. Once configured, we can verify it by checking the IPv6 interface details. ! CSR1 interface GigabitEthernet2.512 ipv6 nd na glean R1#show ipv6 interface gig2.512 | include glean ND gleaning on unsolicited neighbor advertisements

This time, when CSR2 sends the NA onto the LAN, CSR1 is directed to glean the layer 2 address for this unsolicited NA. The entry is recorded as STALE in the ND cache, which makes sense since CSR1 has no idea if the address is actually reachable as it did not initiate an ND conversation with it, nor direct traffic to/through it. This entry is still subject to the ND expiration timer configured earlier. ! CSR1 14:16:45.688: 14:16:45.688: FE80::12 14:16:45.688: 14:16:45.688: 14:16:45.688: 14:16:45.688: 14:16:45.688:

ICMPv6: Received N-Advert, Src=FE80::12, Dst=FF02::1 ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Received NA from ICMPv6-ND: ICMPv6-ND: ICMPv6-ND: ICMPv6-ND: ICMPv6-ND:

Validating ND packet options: valid Glean unsolicited NA (GigabitEthernet2.512,FE80::12) Glean (GigabitEthernet2.512,FE80::12) LLA 0012.1212.1212 (GigabitEthernet2.512,FE80::12) INCMP -> STALE

25 © 2016 Nicholas J. Russo

This process happens for all of CSR2’s addresses, and CSR1 gleans them all and records them as stale. The IPv6 cache statistics counts them as GLEAN entries, despite their operational capacity being “stale” in a sense. R1#show ipv6 neighbors gig2.512 IPv6 Address 2020::1212 2020:0:11:12::12 2020:0:11:12:0:12:0:1212 FE80::12

Age 0 0 0 0

Link-layer Addr 0012.1212.1212 0012.1212.1212 0012.1212.1212 0012.1212.1212

State STALE STALE STALE STALE

Interface Gi2.512 Gi2.512 Gi2.512 Gi2.512

R1#show ipv6 neighbors statistics IPv6 ND Statistics Entries 4, High-water 4, Gleaned 9, Scavenged 8, Static 0 Entry States INCMP 0 REACH 0 STALE 0 GLEAN 4 DELAY 0 PROBE 0 Resolutions Requested 12, timeouts 10, resolved 7, failed 3 In-progress 0, High-water 2, Throttled 0, Data discards 0 NUD Requested 1, timeouts 0, resolved 1, failed 0 in-progress 0, high-water 1, throttled 0, current queue 0, queue highwater 0 Delayed Queue 0, Delayed Queue High-water 4

IPv6 NUD also accounts the presence of IGP to reduce ND traffic. When routes are learned from an IGP, the next-hop will be a LL address. As seen earlier, installation of those routes in the RIB triggers ND for the next-hops whether traffic is flowing to those destinations or not. The router needing to resolve the remote next-hop sends an NS to the target’s solicited-node address. If there is an IGP neighbor with that node, NUD assumes there is reachability to it, and does not wait for the NA to return before identifying the cache entry as REACH. This is enabled by default on all interfaces; the debug below on CSR2 shows that the cache entry was moved to the REACH state before the NA was received from CSR1. CSR2 knows the MAC address for CSR1 only because it received an NS from CSR1 who was performing ND for CSR2 at the same time. ! CSR2 15:07:59.194: %CLNS-5-ADJCHANGE: ISIS: Adjacency to R1 (GigabitEthernet2.512) Up, new adjacency 15:07:59.194: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) ULP neighbour 15:07:59.194: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) DELETE -> INCMP 15:07:59.194: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Sending NS 15:07:59.194: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Set ULP NUD 15:07:59.195: ICMPv6: Sent N-Solicit, Src=FE80::12, Dst=FF02::1:FF00:11 15:07:59.272: ICMPv6: Received N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:12 15:07:59.272: ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Received NS from FE80::11 15:07:59.272: ICMPv6-ND: Validating ND packet options: valid

26 © 2016 Nicholas J. Russo

15:07:59.272: 15:07:59.272: 15:07:59.273: 15:07:59.279: 15:07:59.279: FE80::11 15:07:59.279:

ICMPv6-ND: (GigabitEthernet2.512,FE80::11) LLA 0011.1111.1111 ICMPv6-ND: (GigabitEthernet2.512,FE80::11) INCMP -> STALE ICMPv6-ND: (GigabitEthernet2.512,FE80::11) STALE -> REACH ICMPv6: Received N-Advert, Src=FE80::11, Dst=FE80::12 ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Received NA from ICMPv6-ND: Validating ND packet options: valid

We can disable this behavior on CSR2, which ignores that symmetric NS coming from CSR1 in terms of honoring the source MAC address. Notice that CSR2 still stores the MAC address from CSR1, carried in the NS message. However, this transitions the entry to the DELAY state once CSR2 responds to CSR1’s NS with an NA, not the REACH state. While in the DELAY state, the assumption is that we have told our neighbor about our MAC address using a solicited NA, and we are simply waiting for the neighbor to do the same. Until then, the entry is not marked as REACH. NUD is waiting for the solicited NA to come back from CSR1 which authoritatively identifies CSR1’s MAC address (and implies reachability without relying on IGP). ! CSR2 interface GigabitEthernet2.512 no ipv6 nd nud igp ! CSR2 15:06:02.357: %CLNS-5-ADJCHANGE: ISIS: Adjacency to R1 (GigabitEthernet2.512) Up, new adjacency 15:06:02.357: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) ULP neighbour 15:06:02.357: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) DELETE -> INCMP 15:06:02.358: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Sending NS 15:06:02.358: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Set ULP NUD 15:06:02.358: ICMPv6: Sent N-Solicit, Src=FE80::12, Dst=FF02::1:FF00:11 15:06:02.434: ICMPv6: Received N-Solicit, Src=FE80::11, Dst=FF02::1:FF00:12 15:06:02.434: ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Received NS from FE80::11 15:06:02.434: ICMPv6-ND: Validating ND packet options: valid 15:06:02.434: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) LLA 0011.1111.1111 15:06:02.434: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) INCMP -> STALE 15:06:02.434: ICMPv6-ND: (GigabitEthernet2.512,FE80::12) Sending NA to FE80::11 15:06:02.435: ICMPv6: Sent N-Advert, Src=FE80::12, Dst=FE80::11 15:06:02.436: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) STALE -> DELAY 15:06:02.443: ICMPv6: Received N-Advert, Src=FE80::11, Dst=FE80::12 15:06:02.443: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) Received NA from FE80::11 15:06:02.443: ICMPv6-ND: Validating ND packet options: valid 15:06:02.443: ICMPv6-ND: (GigabitEthernet2.512,FE80::11) DELAY -> REACH

Next, we will examine the RS and RA messages. Cisco routers will always send RA messages out of their IPv6 LAN interfaces unless suppressed. Suppressing them makes sense on transit links, such as CSR127 © 2016 Nicholas J. Russo

CSR2 and CSR3-CSR6, where there are no hosts. Leaving the “all” keyword off the command only suppresses unsolicited, periodic RA messages. The “all” keyword ensures that router does not respond to RS messages with a solicited RA, either. Only CSR2 is shown, but this is configured on all transit links. ! CSR2 interface GigabitEthernet2.512 ipv6 nd ra suppress all R2#show ipv6 interface gig2.512 | include ND_RA ND RAs are suppressed (all)

RAs are allowed on the LAN segments upon which CSR4 and CSR5 are hosted. This allows to hosts to discover the routers and automatically obtain IPv6 addresses from the on-link prefix(es). ! CSR5 interface GigabitEthernet2.556 ipv6 address autoconfig default

We will examine the basic ND process between a host requiring autoconfiguration and the routers on the segment by debugging on CSR5. As expected, the very first thing all IPv6 nodes do is run DAD for their LL address. Since CSR5 has no explicit LL address, the EUI-64 process is used. This takes the 48-bit MAC address, inserts the hex string 0xFFFE into the middle of it, and sets the U/L bit in the MAC address to 1. With a MAC address of 0055.5555.5555, the EUI-64 address becomes 0255:55FF:FE55:5555. The prefix is FE80::/10 as always; CSR5 ensures its EUI-64 address is unique before doing anything else. After 1 second of not seeing an NA in response, it assumes the address is unique, and sends an unsolicited NA onto the segment to announce it. ! CSR5 15:33:46.200: ICMPv6-ND: (GigabitEthernet2.556) L2 came up 15:33:46.200: IPv6-Addrmgr-ND: DAD request for FE80::255:55FF:FE55:5555 on GigabitEthernet2.556 15:33:46.201: ICMPv6-ND: Delay DAD for FE80::255:55FF:FE55:5555 on GigabitEthernet2.556 by 200 msec 15:33:46.401: ICMPv6-ND: (GigabitEthernet2.556,FE80::255:55FF:FE55:5555) Sending DAD NS [A23BB] 15:33:46.402: ICMPv6: Sent N-Solicit, Src=::, Dst=FF02::1:FF55:5555 15:33:47.401: IPv6-Addrmgr-ND: DAD: FE80::255:55FF:FE55:5555 is unique. 15:33:47.401: ICMPv6-ND: (GigabitEthernet2.556,FE80::255:55FF:FE55:5555) Sending NA to FF02::1 15:33:47.401: ICMPv6-ND: (GigabitEthernet2.556) L3 came up 15:33:47.402: ICMPv6-ND: (GigabitEthernet2.556,FE80::255:55FF:FE55:5555) Linklocal Up 15:33:47.403: ICMPv6: Sent N-Advert, Src=FE80::255:55FF:FE55:5555, Dst=FF02::1

CSR5 also needs a globally routable address, but it has no idea what the on-link prefixes are. It needs to 28 © 2016 Nicholas J. Russo

check for routers on the segment by issuing an RS message to the all-routers multicast group sourced from its LL address. ! CSR5 15:33:47.756: ICMPv6-ND: (GigabitEthernet2.556) Sending RS 15:33:47.763: ICMPv6: Sent R-Solicit, Src=FE80::255:55FF:FE55:5555, Dst=FF02::2

CSR5 receives a solicited RA from CSR1 and CSR6 at the same time; we will examine CSR1’s RA first. Upon receipt, the RA is validated (hop limit = 255, no bogus flags, etc). Because this was a solicited RA, the host gleans the MAC address based on the source MAC of the Ethernet frame. The entry is marked as STALE, not REACH, since traffic is not yet flowing through these routers. Next, there is a chatty process that is used for default router selection. Since CSR1 is the only router known, it is currently the best, and a default route is installed on CSR5. The RA also carries the on-link prefix 2020:5:6::11/64 which will be used for autoconfiguration soon. ! CSR5 15:33:47.769: ICMPv6: Received R-Advert, Src=FE80::11, Dst=FF02::1 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) Received RA 15:33:47.769: ICMPv6-ND: Validating ND packet options: valid 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) Glean 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) LLA 0011.1111.1111 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) INCMP -> STALE 15:33:47.769: ICMPv6-ND: [default] New router interface context created/GigabitEthernet2.556 15:33:47.769: ICMPv6-ND: [default] New router interface context created/7F2323D66078 15:33:47.769: ICMPv6-ND: [default] inserted router FE80::11/GigabitEthernet2.556 15:33:47.769: ICMPv6-ND: [default] Select default router 15:33:47.769: ICMPv6-ND: [default] best rank is C11 15:33:47.769: ICMPv6-ND: [default] router FE80::11/GigabitEthernet2.556 is new best 15:33:47.769: ICMPv6-ND: [default] Selected new default router 15:33:47.769: ICMPv6-ND: [default] Install default to FE80::11/GigabitEthernet2.556 15:33:47.769: ICMPv6-ND: Prefix : 2020:5:6:11::, Length: 64, Vld Lifetime: 2592000, Prf Lifetime: 604800, PI Flags: C0 15:33:47.769: ICMPv6-ND: Created OL-prefix root for 0 15:33:47.769: ICMPv6-ND: New on-link prefix 2020:5:6:11::/64 on GigabitEthernet2.556/FE80::11, lifetime 2592000

CSR5 also receives an RA from CSR6. CAR6 is advertising the same on-link prefix, so CSR5 annotates that the prefix is supported by CSR6 as well since it already tracked this existing prefix. CSR6’s MAC address is gleaned just like CSR1’s. CSR5 continues to use CSR1 as its default-router since there are no preferences configured and CSR1 is the older entry. 29 © 2016 Nicholas J. Russo

! CSR5 15:33:47.769: ICMPv6: Received R-Advert, Src=FE80::6, Dst=FF02::1 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::6) Received RA 15:33:47.769: ICMPv6-ND: Validating ND packet options: valid 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::6) Glean 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::6) LLA 0066.6666.6666 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,FE80::6) INCMP -> STALE 15:33:47.771: ICMPv6-ND: [default] New router interface context created/7F2323D66078 15:33:47.771: ICMPv6-ND: [default] inserted router FE80::6/GigabitEthernet2.556 15:33:47.771: ICMPv6-ND: [default] Select default router 15:33:47.771: ICMPv6-ND: [default] best rank is C11 15:33:47.771: ICMPv6-ND: Prefix : 2020:5:6:11::, Length: 64, Vld Lifetime: 2592000, Prf Lifetime: 604800, PI Flags: C0 15:33:47.771: ICMPv6-ND: Update on-link prefix 2020:5:6:11::/64 on GigabitEthernet2.556/FE80::6, lifetime 2592000

As a quick aside, we can verify the default route installed by CSR5 points to CSR1 as an ND route, and that CSR5 sees both routers. All of the detailed RA information is contained there as well. R5#show ipv6 route ::/0 Routing entry for ::/0 Known via "ND", distance 2, metric 0 Route count is 1/1, share count 0 Routing paths: FE80::11, GigabitEthernet2.556 Last updated 00:13:41 ago R5#show ipv6 routers detail IPV6 ND Routers (table: default) Router FE80::11 on GigabitEthernet2.556, last update 2 min Rank 0xC11 (elegible), Default Router Hops 64, Lifetime 1800 sec, AddrFlag=0, OtherFlag=0, MTU=1500 HomeAgentFlag=0, Preference=Medium, trustlevel = 0 Reachable time 0 (unspecified), Retransmit time 0 (unspecified) Prefix 2020:5:6:11::/64 onlink autoconfig Valid lifetime 2592000, preferred lifetime 604800 Router FE80::6 on GigabitEthernet2.556, last update 1 min Rank 0xC11 (elegible) Hops 64, Lifetime 1800 sec, AddrFlag=0, OtherFlag=0, MTU=1500 HomeAgentFlag=0, Preference=Medium, trustlevel = 0 Reachable time 0 (unspecified), Retransmit time 0 (unspecified) Prefix 2020:5:6:11::/64 onlink autoconfig Valid lifetime 2592000, preferred lifetime 604800

Mixed in with the debugs above is the DAD process for the global address derived from 30 © 2016 Nicholas J. Russo

autoconfiguration. For clarity, I grouped those debug messages below. The computed autoconfiguration address uses EUI-64 as well, which means both the LL and global addresses have the same host address. Thus, only a single solicited-node multicast group must be joined, shown below. After sending the NS, DAD waits for 1 second, as usual, then declares this global unicast address unique. The ND process finishes with an unsolicited NA for other hosts on the segment; recall that routers will ignore this by default and not use them for gleaned adjacencies unless configured. ! CSR5 15:33:47.769: IPv6-Addrmgr-ND: DAD request for 2020:5:6:11:255:55FF:FE55:5555 on GigabitEthernet2.556 15:33:47.769: ICMPv6-ND: (GigabitEthernet2.556,2020:5:6:11:255:55FF:FE55:5555) Sending DAD NS [A23BB] 15:33:47.769: ICMPv6-ND: Autoconfiguring 2020:5:6:11:255:55FF:FE55:5555 on GigabitEthernet2.556 15:33:47.771: ICMPv6-ND: %GigabitEthernet2.556: OK: IPv6 Address Autoconfig 2020:5:6:11::/64 eui-64, 2020:5:6:11:255:55FF:FE55:5555 2020:5:6:11:255:55FF:FE55:5555/64 is existing 15:33:47.773: ICMPv6: Sent N-Solicit, Src=::, Dst=FF02::1:FF55:5555 15:33:48.769: IPv6-Addrmgr-ND: DAD: 2020:5:6:11:255:55FF:FE55:5555 is unique. 15:33:48.769: ICMPv6-ND: (GigabitEthernet2.556,2020:5:6:11:255:55FF:FE55:5555) Sending NA to FF02::1 15:33:48.770: ICMPv6: Sent N-Advert, Src=2020:5:6:11:255:55FF:FE55:5555, Dst=FF02::1 R5#show ipv6 interface gig2.556 | section (group|unicast)_add Global unicast address(es): 2020:5:6:11:255:55FF:FE55:5555, subnet is 2020:5:6:11::/64 [EUI/CAL/PRE] valid lifetime 2591947 preferred lifetime 604747 Joined group address(es): FF02::1 FF02::2 FF02::1:FF55:5555

Continuing with our verification, CSR5 maintains these two routers as STALE entries until it actually sends traffic through them. Since CSR1 is the default gateway, sending traffic off-link will move CSR1 from STALE to REACH via the NUD process. R5#show ipv6 neighbors gig2.556 IPv6 Address FE80::6 FE80::11

Age Link-layer Addr State Interface 24 0066.6666.6666 STALE Gi2.556 24 0011.1111.1111 STALE Gi2.556

Since CSR1 is the default gateway, sending traffic off-link will move CSR1 from STALE to REACH via the ND process. The age also resets to 0, since the “age” column represents the last time an ND conversation occurred with the given cache entry. Because CSR1 does not have the MAC address of CSR5 mapped to CSR5’s global address, it issues an NS message for it, to which CSR5 responds. 31 © 2016 Nicholas J. Russo

R5#ping 2020:0:11:12::11 repeat 1 Type escape sequence to abort. Sending 1, 100-byte ICMP Echos to 2020:0:11:12::11, timeout is 2 seconds: ! Success rate is 100 percent (1/1), round-trip min/avg/max = 10/10/10 ms ! CSR5 16:00:05.181: ICMPv6-ND: (GigabitEthernet2.556,2020:5:6:11:255:55FF:FE55:5555) Received NS from FE80::11 16:00:05.181: ICMPv6-ND: Validating ND packet options: valid 16:00:05.181: ICMPv6-ND: (GigabitEthernet2.556,2020:5:6:11:255:55FF:FE55:5555) Sending NA to FE80::11 16:00:05.182: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) STALE -> DELAY 16:00:05.185: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) ULP indication 16:00:05.185: ICMPv6-ND: (GigabitEthernet2.556,FE80::11) DELAY -> REACH R5#show ipv6 neighbors IPv6 Address FE80::6 FE80::11

Age Link-layer Addr State Interface 27 0066.6666.6666 STALE Gi2.556 0 0011.1111.1111 REACH Gi2.556

If CSR5 wants to send traffic to the segment between CSR6 and CSR3, it still sends traffic to CSR1 initially. This is suboptimal and is handled with redirect messages. First, we verify that CSR1 is actually routing to CSR6 via the host LAN to which CSR5 is joined. For clarity, many of the basic NA/NS messages are stripped from the debugs since that process has been examined thoroughly. R1#show ipv6 route 2020:0:3:6::/64 Routing entry for 2020:0:3:6::/64 Known via "isis 2020", distance 115, metric 20, type level-2 Route count is 1/1, share count 0 Routing paths: FE80::6, GigabitEthernet2.556 Last updated 00:32:25 ago

When CSR5 sends packets to this destination, they first go to CSR1. CSR1 issues a redirect message to CSR5 inform it of the better gateway. The target address is carried in the payload and identifies CSR6’s LL address as the next-hop. CSR5 does not appear to honor the redirect, but I wanted to show the mechanism. ! CSR1 ICMPv6-ND: (GigabitEthernet2.556,2020:0:3:6::6)Sending REDIRECT, target FE80::6 ICMPv6: Sent Redirect, Src=FE80::11, Dst=2020:5:6:11:255:55FF:FE55:5555 ! CSR5

32 © 2016 Nicholas J. Russo

ICMPv6: Received Redirect, Src=FE80::11, Dst=2020:5:6:11:255:55FF:FE55:5555

Another interesting characteristic of the ND cache is the PROBE state. This is NUD in action, sending targeted (unicast) NS messages to verify reachability. The reason CSR5 performs this towards CSR6 is because the initial packet is asymmetrically routed. CSR5 sent it to CSR1, but the reply came from CSR6. CSR5 cannot guarantee that two-way reachability exists with CSR6 despite knowing it’s MAC address from the RA. The cache entry transitions from the PROBE state once the solicited NA is received from the neighbor. ! CSR5 ICMPv6: Received echo reply, Src=2020:0:3:6::6, Dst=2020:5:6:11:255:55FF:FE55:5555 ICMPv6-ND: (GigabitEthernet2.556,FE80::6) DELAY -> PROBE ICMPv6-ND: (GigabitEthernet2.556,FE80::6) Sending NS ICMPv6: Sent N-Solicit, Src=FE80::255:55FF:FE55:5555, Dst=FE80::6 ICMPv6: Received N-Advert, Src=FE80::6, Dst=FE80::255:55FF:FE55:5555 ICMPv6-ND: (GigabitEthernet2.556,FE80::6) Received NA from FE80::6 ICMPv6-ND: Validating ND packet options: valid ICMPv6-ND: Packet contains no options ICMPv6-ND: (GigabitEthernet2.556,FE80::6) PROBE -> REACH

We will briefly examine anycast addressing on the LAN as well. Earlier, we saw how DAD can determine if there are duplicate addresses on a LAN; clearly this does not make sense for anycast gateways where the same IPv6 address may exist on the LAN. In XE, we can append the “anycast” keyword to an IPv6 address that essentially disables DAD for the address. In XE and XR, we can disable DAD for the entire interface, which affects all prefixes. We will use both methods on CSR1 and CSR6 while adding a new anycast IPv6 address to the subnet. We will ensure this new address is within the same on-link prefix so the composition of the RA message need not change. The method used on CSR1 is the only way to configure anycast addresses on XR. ! CSR1 interface GigabitEthernet2.556 ipv6 address 2020:5:6:11::611/64 ipv6 nd dad attempts 0 ! CSR6 interface GigabitEthernet2.556 ipv6 address 2020:5:6:11::611/64 anycast R1#show ipv6 interface gig2.556 | section Global_uni|DAD Global unicast address(es): 2020:5:6:11::11, subnet is 2020:5:6:11::/64 2020:5:6:11::611, subnet is 2020:5:6:11::/64 ND DAD is disabled R6#show ipv6 interface gig2.556 | section Global_uni|DAD

33 © 2016 Nicholas J. Russo

Global unicast address(es): 2020:5:6:11::6, subnet is 2020:5:6:11::/64 2020:5:6:11::611, subnet is 2020:5:6:11::/64 [ANY] ND DAD is enabled, number of DAD attempts: 1

Debugging ND on CSR1 shows that the DAD software process is invoked for all three addresses, but immediately returns that the addresses are unique without actually doing anything. The timestamps prove this. ! CSR1 16:40:22.787: IPv6-Addrmgr-ND: GigabitEthernet2.556 16:40:22.787: IPv6-Addrmgr-ND: 16:40:22.788: IPv6-Addrmgr-ND: GigabitEthernet2.556 16:40:22.788: IPv6-Addrmgr-ND: 16:40:22.788: IPv6-Addrmgr-ND: GigabitEthernet2.556 16:40:22.788: IPv6-Addrmgr-ND:

DAD request for FE80::11 on DAD: FE80::11 is unique. DAD request for 2020:5:6:11::11 on DAD: 2020:5:6:11::11 is unique. DAD request for 2020:5:6:11::611 on DAD: 2020:5:6:11::611 is unique.

The output is similar on CSR6, but only for the anycast address. The other addresses undergo the normal DAD process. ! CSR6 16:42:55.742: IPv6-Addrmgr-ND: DAD request for FE80::6 on GigabitEthernet2.556 16:42:55.742: ICMPv6-ND: Delay DAD for FE80::6 on GigabitEthernet2.556 by 200 msec 16:42:55.941: ICMPv6-ND: (GigabitEthernet2.556,FE80::6) Sending DAD NS [D9C50] 16:42:56.942: IPv6-Addrmgr-ND: DAD: FE80::6 is unique. 16:42:56.942: ICMPv6-ND: (GigabitEthernet2.556,FE80::6) Sending NA to FF02::1 16:42:56.942: ICMPv6-ND: (GigabitEthernet2.556) L3 came up 16:42:56.942: IPv6-Addrmgr-ND: DAD request for 2020:5:6:11::6 on GigabitEthernet2.556 16:42:56.942: ICMPv6-ND: (GigabitEthernet2.556,2020:5:6:11::6) Sending DAD NS [D9C50] 16:42:56.942: IPv6-Addrmgr-ND: DAD request for 2020:5:6:11::611 on GigabitEthernet2.556 16:42:56.942: IPv6-Addrmgr-ND: DAD: 2020:5:6:11::611 is unique. 16:42:57.942: IPv6-Addrmgr-ND: DAD: 2020:5:6:11::6 is unique.

There are several other important RA options as well. We can have multiple on-link prefixes but only offer a subset of them to clients for autoconfiguration. For example, CSR2 and CSR3 are both routers serving network access to CSR4. They have a global unicast prefix as well as a unique-local prefix for intra-site routing. The client should typically not use ULA for autoconfiguration if its wants Internet reachability. Both CSR2 and CSR3 can suppress this prefix from their RA messages so that CSR4 is not 34 © 2016 Nicholas J. Russo

aware of its existence. CSR2 and CSR3 have nearly identical configurations, not counting the host addresses, so only CSR2 is shown. We can verify the prefixes advertised by an IPv6-enabled router interface as well; the ULA prefix has the ‘N’ flag to indicate it is not advertised, while the global unicast address is. CSR4 only sees the global prefix as a result. ! CSR2 interface GigabitEthernet2.542 ipv6 address FE80::12 link-local ipv6 address 2020:3:4:12::12/64 ipv6 address FD00:3:4:12::12/64 ipv6 nd prefix FD00:3:4:12::/64 no-advertise R2#show ipv6 interface gig2.542 prefix IPv6 Prefix Advertisements GigabitEthernet2.542 Codes for 1st column: A - Address, P - Prefix-Advertisement, O - Pool U - Per-user prefix Codes for 2nd column and above: D - Default N - Not advertised, C - Calendar PD default [LA] Valid lifetime 2592000, preferred lifetime 604800 AD 2020:3:4:12::/64 [LA] Valid lifetime 2592000, preferred lifetime 604800 PAN FD00:3:4:12::/64 [LA] Valid lifetime 2592000, preferred lifetime 604800

CSR4 only sees the global prefix as a result and computes its EUI-64 address accordingly. R4#show ipv6 routers | include ^Router|Prefix Router FE80::12 on GigabitEthernet2.542, last update 1 min Prefix 2020:3:4:12::/64 onlink autoconfig Router FE80::3 on GigabitEthernet2.542, last update 1 min Prefix 2020:3:4:12::/64 onlink autoconfig R4#show ipv6 interface gig2.542 | section Global_uni Global unicast address(es): 2020:3:4:12:244:44FF:FE44:4444, subnet is 2020:3:4:12::/64 [EUI/CAL/PRE] valid lifetime 2591882 preferred lifetime 604682

We can adjust the unsolicited RA interval and the corresponding lifetime as well. The lifetime is relevant for how long the default routing can be considered valid, not the RA itself. Debugging on CSR4, we can see that every 20 seconds, an RA from CSR2 is received (green). Every 30 seconds, an RA from CSR3 is received (yellow). For debugging brevity, we look at the ICMPv6 packet exchange without examining the ND process. To prevent RA synchronization, the timers are randomized within a range; the values configured above are maximum values. The minimum is 75% of the maximum and the actual timer used is a random number in that range. The minimum can adjusted as well, but 75% is a good value. Thus, CSR2’s RAs are sent at a rate of 15 – 20 seconds while CSR3’s RAs are sent at a rate of 22.5 – 30 seconds. 35 © 2016 Nicholas J. Russo

! CSR2 interface GigabitEthernet2.542 ipv6 nd ra lifetime 200 ipv6 nd ra interval 20 ! CSR3 interface GigabitEthernet2.542 ipv6 nd ra lifetime 300 ipv6 nd ra interval 30 ! CSR4 16:57:59.614: 16:58:09.105: 16:58:25.823: 16:58:27.625: 16:58:44.403: 16:58:55.329: 16:59:04.014:

ICMPv6: ICMPv6: ICMPv6: ICMPv6: ICMPv6: ICMPv6: ICMPv6:

Received Received Received Received Received Received Received

R-Advert, R-Advert, R-Advert, R-Advert, R-Advert, R-Advert, R-Advert,

Src=FE80::3, Dst=FF02::1 Src=FE80::12, Dst=FF02::1 Src=FE80::3, Dst=FF02::1 Src=FE80::12, Dst=FF02::1 Src=FE80::12, Dst=FF02::1 Src=FE80::3, Dst=FF02::1 Src=FE80::12, Dst=FF02::1

The Default Router Preference (DRP) feature provides basic “low, medium, high” priorities as a tiebreaker for selecting a default router. On the LAN with CSR4, both CSR2 and CSR3 are originating RAs. CSR2 is configured with a priority of “high” while CSR3 uses the default priority of “medium”. We can confirm this on both routers by checking the IPv6 interface details. ! CSR2 interface GigabitEthernet2.542 ipv6 nd router-preference High R2#show ipv6 interface gig2.542 | include preference ND advertised default router preference is High R3#show ipv6 interface gig2.542 | include preference ND advertised default router preference is Medium

Debugging IPv6 ND on CSR4, it receives unsolicited RAs from both CSR2 and CSR3 periodically. CSR4 can see this DRP value and always select CSR2 when it is available. Notice that the RA lifetimes are shown in this output as well, which are different than the prefix lifetimes. The RA lifetime measures how long this router is useful as a default router; prefix lifetimes are examined next. R4#show ipv6 routers detail IPV6 ND Routers (table: default) Router FE80::12 on GigabitEthernet2.542, last update 0 min Rank 0xC19 (elegible), Default Router Hops 64, Lifetime 200 sec, AddrFlag=0, OtherFlag=0, MTU=1500 HomeAgentFlag=0, Preference=High, trustlevel = 0 Reachable time 0 (unspecified), Retransmit time 0 (unspecified)

36 © 2016 Nicholas J. Russo

Prefix 2020:3:4:12::/64 onlink autoconfig Valid lifetime 2592000, preferred lifetime 604800 Router FE80::3 on GigabitEthernet2.542, last update 0 min Rank 0xC11 (elegible) Hops 64, Lifetime 300 sec, AddrFlag=0, OtherFlag=1, MTU=1500 HomeAgentFlag=0, Preference=Medium, trustlevel = 0 Reachable time 0 (unspecified), Retransmit time 0 (unspecified) Prefix 2020:3:4:12::/64 onlink autoconfig Valid lifetime 2592000, preferred lifetime 604800

The valid and preferred lifetimes are used to denote how long a prefix can be used or preferred. The preferred lifetime cannot exceed the valid lifetime, and these values can be tuned per-prefix. However, all routers on the segment should agree on the values or else the router will display an error message showing the differences. ! CSR2 interface GigabitEthernet2.542 ipv6 nd prefix 2020:3:4:12::/64 200 180 ! CSR2 %IPV6_ND-3-CONFLICT: Router FE80::3 on GigabitEthernet2.542 conflicting ND setting prefix 2020:3:4:12::/64 valid lifetime, difference 2591800 seconds ! CSR3 %IPV6_ND-3-CONFLICT: Router FE80::12 on GigabitEthernet2.542 conflicting ND setting prefix 2020:3:4:12::/64 valid lifetime, difference 2591800 seconds

For consistency, we configure these settings on CSR3 as well (not shown), then verify it on both routers and the client (CSR4). R2#show ipv6 interface gig2.542 prefix | include 2020 PA 2020:3:4:12::/64 [LA] Valid lifetime 200, preferred lifetime 180 R3#show ipv6 interface gig2.542 prefix | include 2020 PA 2020:3:4:12::/64 [LA] Valid lifetime 200, preferred lifetime 180 R4#sh ipv6 router | include ^Router|Prefix|Valid Router FE80::12 on GigabitEthernet2.542, last update 0 min Prefix 2020:3:4:12::/64 onlink autoconfig Valid lifetime 200, preferred lifetime 180 Router FE80::3 on GigabitEthernet2.542, last update 0 min Prefix 2020:3:4:12::/64 onlink autoconfig Valid lifetime 200, preferred lifetime 180

There are several other options we can enable per-prefix as well, such as whether the prefix can be used for autoconfiguration, whether it is on-link, etc. We will configure a bogus prefix on CSR2 only which is not on-link and cannot be used for autoconfiguration. Notice that we do not need to configure an IPv6 address on CSR2 for this prefix. 37 © 2016 Nicholas J. Russo

! CSR2 interface GigabitEthernet2.542 ipv6 nd prefix 2020:FFFF:FFFF:FFFF::/64 infinite infinite no-autoconfig noonlink R2#show ipv6 interface gig2.542 prefix | include FFFF P 2020:FFFF:FFFF:FFFF::/64 [] Valid lifetime infinite, preferred lifetime infinite

CSR4 learns the prefix but it cannot use it for much, at present. There is no ND route for it (no connected, on-link route) and it cannot be used for auto-configuration. R4#show ipv6 routers default Router FE80::12 on GigabitEthernet2.542, last update 0 min Hops 64, Lifetime 200 sec, AddrFlag=0, OtherFlag=0, MTU=1500 HomeAgentFlag=0, Preference=High, trustlevel = 0 Reachable time 0 (unspecified), Retransmit time 0 (unspecified) Prefix 2020:3:4:12::/64 onlink autoconfig Valid lifetime 200, preferred lifetime 180 Prefix 2020:FFFF:FFFF:FFFF::/64 Valid lifetime infinite, preferred lifetime infinite R4#show ipv6 route nd | begin ^ND ND ::/0 [2/0] via FE80::12, GigabitEthernet2.542 NDp 2020:3:4:12::/64 [2/0] via GigabitEthernet2.542, directly connected

Next, we will examine DHCPv6 for stateless autoconfiguration. DHCPv6’s role in this design is to issue non-address related configuration, such as DNS servers, domain names, SNTP servers, etc. This information doesn’t need to be bound to a host and can be handed out freely to SLAAC clients upon request. A router signals that it is capable of providing DHCPv6 “other” configurations by setting the ‘O’ flag in the RA, described earlier. CSR3 will be the DHCPv6 server and will notify CSR4 about some nonaddress configurations. CSR4 doesn’t have to use CSR3 as a default gateway to use this service, either. We verify that CSR4 can see this ‘O’ flag set in the RA from CSR3 but not CSR2. ! CSR3 ipv6 dhcp pool DHCPV6_POOL dns-server 2020::BEEF domain-name lab.local sntp address 2001:0:3:7::3 interface GigabitEthernet2.542 ipv6 nd other-config-flag ipv6 dhcp server DHCPV6_POOL

38 © 2016 Nicholas J. Russo

R4#show ipv6 routers | include Router|Other Router FE80::12 on GigabitEthernet2.542, last update 0 min Hops 64, Lifetime 200 sec, AddrFlag=0, OtherFlag=0, MTU=1500 Router FE80::3 on GigabitEthernet2.542, last update 0 min Hops 64, Lifetime 300 sec, AddrFlag=0, OtherFlag=1, MTU=1500

When CSR4 receives a solicited RA from CSR3 after having sent an RS (assume a link-up event on CSR4), it will notice the ‘O’ flag and invoke the DHCPv6 process to send traffic to the DHCPv6 servers and relay agents multicast group (FF02::1:2). Configuring the DHCPv6 pool on CSR3 causes it to listen to this group as a DHCPv6 server. The other group, FF05::1:3 is for DHCPv6 servers only, not relay agents, but serves the same function except is routable based on the scope bits. R3#show ipv6 interface gig2.542 | section group_add Joined group address(es): FF02::1 FF02::2 FF02::1:2 FF02::1:FF00:3 FF05::1:3 R4#debug ipv6 nd R4#debug ipv6 dhcp detail 17:28:48.238: ICMPv6-ND: (GigabitEthernet2.542) Sending RS 17:28:48.241: ICMPv6-ND: (GigabitEthernet2.542,FE80::3) Received RA 17:28:48.241: ICMPv6-ND: Validating ND packet options: valid [snip, normal RA processing] 17:28:48.242: ICMPv6-ND: O-bit set; checking DHCP 17:28:48.242: IPv6 DHCP: detailed packet contents 17:28:48.242: src FE80::244:44FF:FE44:4444 17:28:48.242: dst FF02::1:2 (GigabitEthernet2.542) 17:28:48.242: type INFORMATION-REQUEST(11), xid 13468421 17:28:48.242: option ELAPSED-TIME(8), len 2 17:28:48.242: elapsed-time 0 17:28:48.242: option CLIENTID(1), len 10 17:28:48.242: 00030001001E4980B400 17:28:48.242: option ORO(6), len 4 17:28:48.242: DNS-SERVERS,DOMAIN-LIST 17:28:48.242: IPv6 DHCP: Sending INFORMATION-REQUEST to FF02::1:2 on GigabitEthernet2.542 17:28:48.243: IPv6 DHCP: DHCPv6 changes state from IDLE to INFORMATIONREQUEST (STATELESS) on GigabitEthernet2.542

CSR3 replies to the DHCP information request with the DNS servers and domain-list. CSR4 did not ask for the SNTP servers, as seen above, so the DHCPv6 server did not respond with it. The response is a unicast reply to the clients LL address; CSR4 then saves the new DNS and domain information. ! CSR4 17:28:48.245: IPv6 DHCP: Received REPLY message

39 © 2016 Nicholas J. Russo

17:28:48.245: IPv6 DHCP: Received REPLY from FE80::3 on GigabitEthernet2.542 17:28:48.245: IPv6 DHCP: detailed packet contents 17:28:48.245: src FE80::3 (GigabitEthernet2.542) 17:28:48.245: dst FE80::244:44FF:FE44:4444 (GigabitEthernet2.542) 17:28:48.245: type REPLY(7), xid 13468421 17:28:48.245: option SERVERID(2), len 10 17:28:48.245: 00030001001EE5A8FF00 17:28:48.245: option CLIENTID(1), len 10 17:28:48.245: 00030001001E4980B400 17:28:48.245: option DNS-SERVERS(23), len 16 17:28:48.245: 2020::BEEF 17:28:48.245: option DOMAIN-LIST(24), len 11 17:28:48.245: lab.local 17:28:48.245: IPv6 DHCP: Adding server FE80::3 17:28:48.245: IPv6 DHCP: Processing options 17:28:48.245: IPv6 DHCP: Configuring DNS server 2020::BEEF 17:28:48.245: IPv6 DHCP: Configuring domain name lab.local 17:28:48.245: IPv6 DHCP: DHCPv6 changes state from INFORMATION-REQUEST to IDLE (REPLY_RECEIVED) on GigabitEthernet2.542 R4#show hosts Name lookup view: Global Default domain is not set Domain list: lab.local [snip]

We can also test the stateful DHCPv6 behavior. This allows a DHCPv6 server to hand out IPv6 addresses from a specific prefix (pool) as in DHCPv4. We can extend our DHCPv6 pool to add a prefix, then offer it to CSR7 on a new interface. The ‘M’ and ‘O’ flags are set on this interface which allows CSR7 to get all of its information, both addressing and “other”, from the DHCPv6 server. We confirm that CSR7 sees both of these flags in the RA from CSR3. ! CSR3 ipv6 dhcp pool DHCPV6_POOL address prefix 2020:0:3:7::/64 interface GigabitEthernet2.537 ipv6 dhcp server DHCPV6_POOL ipv6 nd managed-config-flag ipv6 nd other-config-flag R7#show ipv6 routers detail IPV6 ND Routers (table: default) Router FE80::3 on GigabitEthernet2.537, last update 0 min Rank 0xA11 (elegible), Default Router Hops 64, Lifetime 1800 sec, AddrFlag=1, OtherFlag=1, MTU=1500 HomeAgentFlag=0, Preference=Medium, trustlevel = 0 Reachable time 0 (unspecified), Retransmit time 0 (unspecified)

40 © 2016 Nicholas J. Russo

Prefix 2020:0:3:7::/64 onlink autoconfig Valid lifetime 2592000, preferred lifetime 604800

Unfortunately, XE does not appear to support stateful DHCPv6 client at this time. “ipv6 address dhcp” is not a supported option, but we will show the rest of CSR7’s configuration for completeness. With IPv6 enabled, a LL address can be obtained automatically. We also tell ND to automatically configure the prefix and default-route based on the address received, which would normally come from DHCPv6 in a functional design. R7(config-subif)#ipv6 WORD X:X:X:X::X X:X:X:X::X/ autoconfig

address ? General prefix name IPv6 link-local address IPv6 prefix Obtain address using autoconfiguration

! CSR7 interface GigabitEthernet2.537 ipv6 enable ipv6 nd autoconfig prefix ipv6 nd autoconfig default-route

Additional Reading – Reference configurations “ipv6-nd” 1.2 Broadband Aggregation (BBA) BBA is a sizable topics and only the basic concepts are covered here. Below are some example BBA architectures and definitions: 1. Direct connections from DSLAM/ANs to BNGs. 2. DSLAM/AN to an aggregate Ethernet switch, then to BNG, in a hub-spoke criss-form; classic design. 3. DSLAM/AN to an aggregate Ethernet switch, all of which are tied into a ring where the BNGs also reside. BNG - Broadband Network Gateway. Sits between the DSLAM, or aggregator of DSL connections, and the IP network of the network service provider (NSP). It may encompass the BRAS, but the two are not the same. Some architectures may introduce dual BNGs, for example dedicating one to video services and another to all other. In dual-BNG scenarios, BOTH BNGs do not have to meet all requirements, as long as the union of the BNG capabilities does. BRAS - Broadband remote access server. This is the aggregation point between the NSP and the access network, typically using IP. It is also an injection point for policies, such as IP QoS. BBA - Broadband aggregation. This commonly relies on L2TP. The main component of L2TP is a reliable control channel that is responsible for session setup, negotiation, and teardown, and a forwarding plane that adds negotiated session IDs and forwards traffic. Layer 2 circuits terminate in a device called an 41 © 2016 Nicholas J. Russo

L2TP access concentrator (LAC), and the PPP sessions terminate in an L2TP network server (LNS). The LNS authenticates the user and is the endpoint for PPP negotiation. The LAC is closer to the customer than the LNS, and is the “downstream tail end” of the L2TP tunnel, whereas the LNS is the “upstream head end”. Thus, the PPPoE frames are tunneled inside L2TP to the LNS. The LAC connects to the LNS using a LAN or a WAN connection, and L2TP rides over the top of this. The LAC directs the subscriber session into L2TP tunnels based on the domain of each session. 1.2.1 PPP over Ethernet (PPPoE) technology PPPoE is commonly used for BBA because it offers all of the benefits of PPP (authentication, directional call control, compression, encryption, etc) but can use Ethernet at layer 2 as transport. Many callers can “dial in” to the BNG on a shared segment and gain network connectivity in this way. Unlike Ethernet, clients cannot talk to one another, and this method works well with the N:1 VLAN paradigm discussed later, which further restricts peer-to-peer connectivity at layer 2. XR supports PPPoE server only, but XRv does not appear to support PPPoE at all. The CSR1000v supports both roles, but some features are unsupported. Currently, I have discovered that Microsoft Point to Point Encryption (MPPE) and compression (stac, predictor) are not supported. The CSR generates a log message when you try to configure these features to indicate that it’s virtual-access interface is incapable of supporting them. %FMANRP_ESS-4-FULLVAI: Session creation failed due to Full Virtual-Access Interfaces not being supported. Check that all applied Virtual-Template and RADIUS features support Virtual-Access sub-interfaces. swidb= 0x7F1E9054C508, ifnum= 19

To represent a PPPoE-based BBA architecture that is somewhat realistic, we will use a hierarchical access/aggregation network (similar network used for NAT444, NAT464, etc). CSR8, CSR9, and CSR10 are the PPPoE servers while CSR2 through CSR7 are the PPPoE clients. The clients are like CPE routers in residential areas while the PPPoE servers are the BNGs. XRv1 and XRv2 are Internet gateways, and XRv3 is the Internet. Because XR does not support IPv6 SLAAC, CSR1 has several VRFs to similar a client behind each CPE router. Basic NAT44 is used to translate private CPE addressing to global addressing at the CPE; hierarchical NAT is discussed in a dedicated chapter. NAT is not the focus of this lab so very basic NAT techniques are used, otherwise the PPPoE design would be very unrealistic with IGP running everywhere.

42 © 2016 Nicholas J. Russo

First, we will configure PPPoE between CSR8 and CSR2; as the access concentrator (AC), CSR8 only has one client. The AC uses a virtual-template interface and the client uses a dialer interface. We will negotiate the client IP address using IP control protocol (IPCP) which is a function of PPP. The addresses issued to clients will be handed out from a local pool (not DHCP). We can also apply limits to the number of client sessions the server will accept; in this case, we say there can be only one per-MAC and perVLAN. This prevents CSR2 from dialing into CSR8 multiple times. We must adjust the MTU on the PPPoE virtual interfaces to be 8 bytes less than the supported layer 3 MTU since PPPoE adds 8 bytes of encapsulation. To support IPv6, we create two pools. The first is meant to service the transit links (the configuration below has a tricky error; will be fixed later) and the second is a way to “delegate” a downstream IPv6 prefix for the client to use. In this way, DHCPv6 can offer CSR2 a LAN-side public prefix so that CSR2 doesn’t have to manually configure it. This prefix is exchanged using IPv6 ND, so we must unsuppress the RA advertisements on the BNG. ! CSR8 bba-group pppoe PPPOE_28 virtual-template 28 sessions per-mac limit 1 sessions per-vlan limit 1 ipv6 address FE80::8 link-local ip local pool PPPOE_POOL_V4 209.2.8.100 209.2.8.149

43 © 2016 Nicholas J. Russo

ipv6 local pool PPPOE_POOL_V6 2001:10:2:80::/60 64 ipv6 local pool PD_POOL_V6 2001:192:168:80::/60 64 interface Virtual-Template28 mtu 1492 ip unnumbered Loopback28 peer default ip address pool PPPOE_POOL_V4 peer default ipv6 pool PPPOE_POOL_V6 ipv6 enable no ipv6 nd ra suppress ipv6 nd ra lifetime 60 ipv6 nd ra interval 10 5 ipv6 dhcp server DHCP_POOL_V6

The client configuration is similar to the BNG. Dialers are not PPP-encapsulated by default, so we must specify this as well. NAT44 is enabled but it does not affect IPv6 traffic at all. We assign a dial-pool number which is applied at the interface level from which the session initiation occurs. We instruct the client to install a default route to the IPCP negotiated address, and for IPv6 we likewise install a default route to the BNG router discovered through IPv6 ND. The IPv6 prefix-delegation allows CSR2 to learn a prefix from the IPv6 local pool defined on CSR8 to use for its LAN segment. ! CSR2 interface Dialer28 mtu 1492 ip address negotiated ip nat outside encapsulation ppp dialer pool 28 dialer idle-timeout 0 dialer persistent ipv6 address autoconfig default ipv6 dhcp client pd PPPOE_ISP_PREFIX ppp ipcp route default interface GigabitEthernet2.528 pppoe-client dial-pool-number 28

First, we will enable PPP and PPPoE debugging on the client and server to watch the sequence of events. The PPP debugging is not specific to PPPoE at all and shows many low-level details. The PPPoE discovery packets seen below are described now. For clarity, the packets from the debug are shown in-line with the descriptions below. ! CSR2 and CSR8 debug pppoe events debug pppoe packets debug ppp negotiation

44 © 2016 Nicholas J. Russo

1. PPPoE Active Discovery Initiation (PADI): Sent to the Ethernet broadcast address (ffff.ffff.ffff) with a source MAC of the client. This is used to discover all ACs on the segment. Later, we will examine service-names, and only ACs with a matching service name should respond. This is similar to a DHCPDISCOVER and the PADI is sent from CSR2 to CSR8 as shown below. The destination MACs are shown in pink with source MACs in green. Notice that the PPPoE discovery ethertype is 0x8863, which is non-IP traffic (cyan, only shown once). Upon receipt, CSR8’s debug shows a nice summary of the PADI header information to include remote (R) and local (L) MAC addresses, VLAN ID, and interface. The “I” before the word PADI means incoming. The server also annotates that the client’s service tag is null, so no special treatment is being requested, and any server may respond. The code for PADI is 0x09 (codes are discussed later). ! CSR2 Sending PADI: Interface pppoe_send_padi: contiguous pak, size 64 FF FF FF FF FF FF 88 63 11 09 00 00 D2 00 00 06 00 00 00 00 00 00 00 00

= GigabitEthernet2.528

00 00 0F 00

50 10 20 00

56 01 00 00

A9 01 00 00

BE 00 00 00

8A 00 00 00

81 01 00 00

00 03 00 00

0D 00 00 00

C8 08 00 00

! CSR8 PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3528 Gi2.528 contiguous pak, size 40 FF FF FF FF FF FF 00 50 56 A9 BE 8A 81 00 0D C8 88 63 11 09 00 00 00 10 01 01 00 00 01 03 00 08 D2 00 00 06 00 00 0F 20 Service tag: NULL Tag

2. PAD Offer (PADO): The response from the AC is a unicast Ethernet frame back to the source and is sent from ACs that are capable of servicing the client. This like similar to a DHCPOFFER. CSR8 originates this message and sends it to CSR2 as a unicast frame, again with a null service tag. The PADO is outbound as denoted by the “O”. CSR2 receives this PADO; we can tell because the “I” before PADO means incoming and the local MAC address is CSR2’s Ethernet interface (destination of the frame). The PADO code is 0x07. ! CSR8

PPPoE 0: O PADO, R:0050.56a9.fb1c L:0050.56a9.be8a 3528 Gi2.528 Service tag: NULL Tag contiguous pak, size 00 50 56 A9 BE 88 63 11 07 00 D2 00 00 06 00 00 10 97 58 88

66 8A 00 00 C2

00 00 0F 01

50 2A 20 8F

56 01 01 8A

A9 01 02 96

FB 00 00 23

1C 00 02 D2

81 01 52 F7

00 03 38 0E

0D 00 01 E3

C8 08 04 54

45 © 2016 Nicholas J. Russo

F4 D5 ! CSR2 PPPoE 0: I PADO R:0050.56a9.fb1c L:0050.56a9.be8a contiguous pak, size 66 00 50 56 A9 BE 8A 00 50 56 A9 FB 1C 81 00 0D 88 63 11 07 00 00 00 2A 01 01 00 00 01 03 00 D2 00 00 06 00 00 0F 20 01 02 00 02 52 38 01 00 10 97 58 88 C2 01 8F 8A 96 23 D2 F7 0E E3 F4 D5

3528 Gi2.528 C8 08 04 54

3. PAD Request (PADR): The client selects one of the ACs and sends a unicast Ethernet frame to it requesting to connect. This is like a DHCPREQUEST. For some reason, the output on CSR2 is inconsistent with the format seen with the PADI and PADO thus far. It simply says the PADR was sent but doesn’t parse any details for us. We can easily pick out the MAC addresses and see that this is destined as a unicast frame to CSR8. When CSR8 receives it, it prepares an encapsulation string for the session. This includes the full layer 2 encapsulation of the PPPoE data frames; notice the ethertype is 0x8864 now, also non-IP, and is used for PPPoE bearer traffic. first two bytes of the PPPoE header represent this new ethertype (green). The first 4 bits in the next byte represents version and the second 4 bits represents type (cyan, both must be 1). This is the third byte of the PPPoE header. The next byte represents a code used for discovery and session stages, and is zero here (pink). The next 2 bytes (0x0013) represents the session ID, which is decimal 19 in this case (grey). The last 2 bytes represent the length of the packet at layer 3, which varies per packet and is shown as zero in the debug (red). The PADR code is 0x19. ! CSR2 OUT PADR from PPPoE Session contiguous pak, size 66 00 50 56 A9 FB 1C 00 50 88 63 11 19 00 00 00 2A 00 00 0F 20 01 02 00 02 88 C2 01 8F 8A 96 23 D2 00 00

56 01 52 F7

A9 03 38 0E

BE 00 01 E3

8A 08 04 54

81 D2 00 F4

00 00 10 D5

0D 00 97 01

! CSR8 PPPoE 0: I PADR R:0050.56a9.be8a L:0050.56a9.fb1c contiguous pak, size 66 00 50 56 A9 FB 1C 00 50 56 A9 BE 8A 81 00 0D 88 63 11 19 00 00 00 2A 01 03 00 08 D2 00 00 00 00 0F 20 01 02 00 02 52 38 01 04 00 10 97 88 C2 01 8F 8A 96 23 D2 F7 0E E3 54 F4 D5 01 00 00 Service tag: NULL Tag PPPoE : encap string prepared contiguous pak, size 24 00 50 56 A9 BE 8A 00 50 56 A9 FB 1C 81 00 0D 88 64 11 00 00 13 00 00

C8 06 58 01

3528 Gi2.528 C8 06 58 01

C8

46 © 2016 Nicholas J. Russo

4. PAD Session (PADS): The server acknowledges and accepts the offer which completes the PPPoE session. This packet also contains the session ID. This is like a DHCPACK. Now that the session ID has been established, the debug messages pertinent to this session will include the number (in decimal) within the debug logs. CSR8 sends the PADS back to CSR2 with this number embedded in the PPPoE header; previously it was zero for the initial PAD exchanges. The PPPoE header is shown in yellow; we can see the layer 3 packet length is 0x2A (42 in decimal). CSR2 receives this PADS packet and decodes the encapsulation string, which is identical to what CSR8 generated upon receipt of the PADR. The PADS code is 0x65. ! CSR8 [19]PPPoE 19: O PADS contiguous pak, size 00 50 56 A9 BE 88 63 11 65 00 00 00 0F 20 01 88 C2 01 8F 8A 00 00

R:0050.56a9.be8a 66 8A 00 50 56 A9 FB 13 00 2A 01 03 00 02 00 02 52 38 01 96 23 D2 F7 0E E3

L:0050.56a9.fb1c Gi2.528 1C 08 04 54

81 D2 00 F4

00 00 10 D5

0D 00 97 01

C8 06 58 01

! CSR2 PPPoE 19: I PADS R:0050.56a9.fb1c L:0050.56a9.be8a 3528 Gi2.528 contiguous pak, size 66 00 50 56 A9 BE 8A 00 50 56 A9 FB 1C 81 00 0D C8 88 63 11 65 00 13 00 2A 01 03 00 08 D2 00 00 06 00 00 0F 20 01 02 00 02 52 38 01 04 00 10 97 58 88 C2 01 8F 8A 96 23 D2 F7 0E E3 54 F4 D5 01 01 00 00 IN PADS from PPPoE Session PPPoE: Virtual Access interface obtained. PPPoE : encap string prepared contiguous pak, size 24 00 50 56 A9 FB 1C 00 50 56 A9 BE 8A 81 00 0D C8 88 64 11 00 00 13 00 00

Although not part of a successful PPPoE discovery process, a PAD Termination (PADT) is sent when the session should be torn down. From CSR2’s perspective, we can see the PADT is exchanged mutually between client and server depending on who terminates the session. Session ID is 31 for this session only because I added this paragraph at the end of the testing. Most of the packet is padding since the termination message is carried as a message code (0xA7) inside of the PPPoE header, which of course includes the session ID (0x1F = 31). ! CSR2 PPPoE 31: I PADT R:0050.56a9.fb1c L:0050.56a9.be8a 3528 Gi2.528 contiguous pak, size 64 00 50 56 A9 BE 8A 00 50 56 A9 FB 1C 81 00 0D C8 88 63 11 A7 00 1F 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

47 © 2016 Nicholas J. Russo

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 PPPoE : Shutting down client session [0]PPPoE 31: O PADT R:0050.56a9.fb1c L:0050.56a9.be8a Gi2.528 contiguous pak, size 64 00 50 56 A9 FB 1C 00 50 56 A9 BE 8A 81 00 0D C8 88 63 11 A7 00 1F 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Once the PPPoE discovery process is complete, the traditional PPP negotiation must occur. In our case, three protocols must be negotiated. As done earlier, the output is shown in sections. 1. Link Control Protocol (LCP): LCP negotiates basic PPP parameters such as packet size, method of transmission, authentication, etc. The detailed codes will not be examined here, but we can watch the LCP process. One of the first things the PPP/LCP process figures out is that, because it is within a PPPoE session, it establishes that the AC is being called (call-in) and the client is calling (call-out). This is shown in the debug logs during the magic number negotiation. This negotiation is just an agreement between both routers that the number selected can be used; MRU is also negotiated as part of determining the packet sizes. The “I” and “O”, as with PPPoE discovery, represent inbound and outbound packets. Each router sends both inbound and outbound configuration requests (CONFREQ) and acknowledgements (CONFACK). Rejected configurations generate configuration reject (CONFREJ) messages, which is often due to authentication failures. Four colors are used to show the same messages displayed on CSR2 and CSR8 for mapping purposes. Once LCP is “open”, other higher-layer protocols can begin negotiating over PPP. ! CSR2 Vi1 PPP: Vi1 PPP: Vi1 PPP: Vi1 LCP: Vi1 PPP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP: Vi1 LCP:

Using dialer call direction Treating connection as a callout Session handle[CC00000D] Session id[13] Event[OPEN] State[Initial to Starting] No remote authentication for call-out O CONFREQ [Starting] id 1 len 14 MRU 1492 (0x010405D4) MagicNumber 0x230EBFC3 (0x0506230EBFC3) Event[UP] State[Starting to REQsent] I CONFREQ [REQsent] id 1 len 14 MRU 1492 (0x010405D4) MagicNumber 0x28BA1EA3 (0x050628BA1EA3) O CONFACK [REQsent] id 1 len 14 MRU 1492 (0x010405D4) MagicNumber 0x28BA1EA3 (0x050628BA1EA3) Event[Receive ConfReq+] State[REQsent to ACKsent] I CONFACK [ACKsent] id 1 len 14 MRU 1492 (0x010405D4) MagicNumber 0x230EBFC3 (0x0506230EBFC3) Event[Receive ConfAck] State[ACKsent to Open]

48 © 2016 Nicholas J. Russo

Vi1 PPP: Phase is FORWARDING, Attempting Forward Vi1 LCP: State is Open ! CSR8 ppp19 PPP: Using vpn set call direction ppp19 PPP: Treating connection as a callin ppp19 PPP: Session handle[EC000013] Session id[19] ppp19 LCP: Event[OPEN] State[Initial to Starting] ppp19 PPP: No remote authentication for call-in ppp19 PPP LCP: Enter passive mode, state[Stopped] ppp19 LCP: I CONFREQ [Stopped] id 1 len 14 ppp19 LCP: MRU 1492 (0x010405D4) ppp19 LCP: MagicNumber 0x230EBFC3 (0x0506230EBFC3) ppp19 LCP: O CONFREQ [Stopped] id 1 len 14 ppp19 LCP: MRU 1492 (0x010405D4) ppp19 LCP: MagicNumber 0x28BA1EA3 (0x050628BA1EA3) ppp19 LCP: O CONFACK [Stopped] id 1 len 14 ppp19 LCP: MRU 1492 (0x010405D4) ppp19 LCP: MagicNumber 0x230EBFC3 (0x0506230EBFC3) ppp19 LCP: Event[Receive ConfReq+] State[Stopped to ACKsent] ppp19 LCP: I CONFACK [ACKsent] id 1 len 14 ppp19 LCP: MRU 1492 (0x010405D4) ppp19 LCP: MagicNumber 0x28BA1EA3 (0x050628BA1EA3) ppp19 LCP: Event[Receive ConfAck] State[ACKsent to Open] ppp19 PPP: Queue IPCP code[1] id[1] ppp19 PPP: Queue IPV6CP code[1] id[1] ppp19 PPP: Phase is FORWARDING, Attempting Forward ppp19 LCP: State is Open

2. IP control protocol (IPCP): Since both routers want to use IPv4 on the link, those parameters must be negotiated as well. In this case, CSR2 has no IP address, and indicates this in its initial outbound CONFREQ. Interestingly, CSR8 offers the address of 209.2.8.107 using a CONFNAK (negative ACK) message, which triggers a CONFREQ from CSR2 to request that same address. CSR8 confirms it with a CONFACK. This process is shown and yellow, and the simpler exchange of CSR2 learning CSR8’s static address is shown in green. After IPCP is open, each one installs a connected host route to the remote peer via the PPP interface. This allows hosts in different subnets to communicate over PPP. ! CSR2 Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP: Vi1 IPCP:

Protocol configured, start CP. state[Initial] Event[OPEN] State[Initial to Starting] O CONFREQ [Starting] id 1 len 10 Address 0.0.0.0 (0x030600000000) Event[UP] State[Starting to REQsent] I CONFREQ [REQsent] id 1 len 10 Address 209.2.8.8 (0x0306D1020808) O CONFACK [REQsent] id 1 len 10 Address 209.2.8.8 (0x0306D1020808) Event[Receive ConfReq+] State[REQsent to ACKsent]

49 © 2016 Nicholas J. Russo

Vi1 IPCP: I CONFNAK [ACKsent] id 1 len 10 Vi1 IPCP: Address 209.2.8.107 (0x0306D102086B) Vi1 IPCP: O CONFREQ [ACKsent] id 2 len 10 Vi1 IPCP: Address 209.2.8.107 (0x0306D102086B) Vi1 IPCP: Event[Receive ConfNak/Rej] State[ACKsent to ACKsent] Vi1 IPCP: I CONFACK [ACKsent] id 2 len 10 Vi1 IPCP: Address 209.2.8.107 (0x0306D102086B) Vi1 IPCP: Event[Receive ConfAck] State[ACKsent to Open] Vi1 IPCP: State is Open Di28 IPCP: Install default route thru 209.2.8.8 Di28 Added to neighbor route AVL tree: topoid 0, address 209.2.8.8 Di28 IPCP: Install route to 209.2.8.8 ! CSR8 Vi2.1 IPCP: Protocol configured, start CP. state[Initial] Vi2.1 IPCP: Event[OPEN] State[Initial to Starting] Vi2.1 IPCP: O CONFREQ [Starting] id 1 len 10 Vi2.1 IPCP: Address 209.2.8.8 (0x0306D1020808) Vi2.1 IPCP: Event[UP] State[Starting to REQsent] Vi2.1 PPP: Process pending ncp packets Vi2.1 IPCP: Redirect packet to Vi2.1 Vi2.1 IPCP: I CONFREQ [REQsent] id 1 len 10 Vi2.1 IPCP: Address 0.0.0.0 (0x030600000000) Vi2.1 IPCP AUTHOR: Done. Her address 0.0.0.0, we want 0.0.0.0 Vi2.1 IPCP: Pool returned 209.2.8.107 Vi2.1 IPCP: O CONFNAK [REQsent] id 1 len 10 Vi2.1 IPCP: Address 209.2.8.107 (0x0306D102086B) Vi2.1 IPCP: Event[Receive ConfReq-] State[REQsent to REQsent] Vi2.1 IPCP: I CONFACK [REQsent] id 1 len 10 Vi2.1 IPCP: Address 209.2.8.8 (0x0306D1020808) Vi2.1 IPCP: Event[Receive ConfAck] State[REQsent to ACKrcvd] Vi2.1 IPCP: I CONFREQ [ACKrcvd] id 2 len 10 Vi2.1 IPCP: Address 209.2.8.107 (0x0306D102086B) Vi2.1 IPCP: O CONFACK [ACKrcvd] id 2 len 10 Vi2.1 IPCP: Address 209.2.8.107 (0x0306D102086B) Vi2.1 IPCP: Event[Receive ConfReq+] State[ACKrcvd to Open] Vi2.1 IPCP: State is Open Vi2.1 Added to neighbor route AVL tree: topoid 0, address 209.2.8.107 Vi2.1 IPCP: Install route to 209.2.8.107

3. IPV6CP: Like IPCP, IPv6 information is negotiated over the link also. In this case, 64-bit interfaceIDs are exchanged over the link which make up the host portion of an IPv6 LL address. The address don’t make it into the IPv6 RIB but are tracked internally by PPP for forwarding. CSR2 informs CSR8 that it wants to use the address ending in DB00 (Green) and CSR8 informs CSR2 that it wants to use the address ending in 4D00 (yellow). ! CSR2 Vi1 IPV6CP: Protocol configured, start CP. state[Initial]

50 © 2016 Nicholas J. Russo

Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1 Vi1

IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP: IPV6CP:

Event[OPEN] State[Initial to Starting] O CONFREQ [Starting] id 1 len 14 Interface-Id 021E:14FF:FE15:DB00 (0x010A021E14FFFE15DB00) Event[UP] State[Starting to REQsent] I CONFREQ [REQsent] id 1 len 14 Interface-Id 021E:E6FF:FE4D:4D00 (0x010A021EE6FFFE4D4D00) O CONFACK [REQsent] id 1 len 14 Interface-Id 021E:E6FF:FE4D:4D00 (0x010A021EE6FFFE4D4D00) Event[Receive ConfReq+] State[REQsent to ACKsent] I CONFACK [ACKsent] id 1 len 14 Interface-Id 021E:14FF:FE15:DB00 (0x010A021E14FFFE15DB00) Event[Receive ConfAck] State[ACKsent to Open] State is Open

! CSR8 Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP: Vi2.1 IPV6CP:

Protocol configured, start CP. state[Initial] Event[OPEN] State[Initial to Starting] O CONFREQ [Starting] id 1 len 14 Interface-Id 021E:E6FF:FE4D:4D00 (0x010A021EE6FFFE4D4D00) Event[UP] State[Starting to REQsent] Redirect packet to Vi2.1 I CONFREQ [REQsent] id 1 len 14 Interface-Id 021E:14FF:FE15:DB00 (0x010A021E14FFFE15DB00) O CONFACK [REQsent] id 1 len 14 Interface-Id 021E:14FF:FE15:DB00 (0x010A021E14FFFE15DB00) Event[Receive ConfReq+] State[REQsent to ACKsent] I CONFACK [ACKsent] id 1 len 14 Interface-Id 021E:E6FF:FE4D:4D00 (0x010A021EE6FFFE4D4D00) Event[Receive ConfAck] State[ACKsent to Open] State is Open

At this point, we will verify everything with show commands. The client and server both show the summary information for each PPPoE session in similar formats. Note: The session bounced once during the course of documenting the feature so the session ID incremented from 19 to 20. This output shows us the remote and local MAC addresses, port, VLAN, virtual interface, session ID, and state. It is the most valuable PPPoE show command. The string “PTA” means locally terminated and is present only the AC; it stands for PPP Termination and Aggregation. R2#show pppoe session 1 client session Uniq ID N/A

PPPoE SID 20

RemMAC LocMAC 0050.56a9.fb1c 0050.56a9.be8a

Port Gi2.528

VT

VA VA-st Di28 Vi1 UP

State Type UP

R8#show pppoe session 1 session in LOCALLY_TERMINATED (PTA) State

51 © 2016 Nicholas J. Russo

1 session Uniq ID 20

PPPoE SID 20

total RemMAC LocMAC 0050.56a9.be8a 0050.56a9.fb1c

Port

VT

Gi2.528 VLAN:3528

28

VA VA-st Vi2.1 UP

State Type PTA

Some outputs/commands reference the session ID, so it is important to understand that concept. It may be useful to look at packet counters as well, shown below. ACs also maintain a summary view of all PPPoE sessions, included those forwarded past the AC or in a transient state. R8#show pppoe session packets Total PPPoE sessions 1 SID Pkts-In Pkts-Out Bytes-In Bytes-Out 20 882 1448 13651 50867 R8#show pppoe summary PTA : Locally terminated sessions FWDED: Forwarded sessions TRANS: All other sessions (in transient state)

TOTAL GigabitEthernet2

TOTAL 1 1

PTA 1 1

FWDED 0 0

TRANS 0 0

The PPP show commands also give additional information which is specific to PPP. This includes PPP subprotocol negotiation details. We can see a summary of all PPP sessions on both routers and their negotiated protocols. Notice the peer name is blank since there is no authentication happening presently. Both CSR2 and CSR8 show that LCP, IPCP, and IPV6CP were successfully negotiated. R2#show ppp all Interface/ID OPEN+ Nego* FailStage Peer Address Peer Name ------------ --------------------- -------- --------------- ----------------Vi1 LCP+ IPCP+ IPV6CP+ LocalT 209.2.8.8 R8#show ppp all Interface/ID OPEN+ Nego* FailStage Peer Address Peer Name ------------ --------------------- -------- --------------- ----------------Vi2.1 LCP+ IPCP+ IPV6CP+ LocalT 209.2.8.108

Looking at the details on CSR8, we can see there is a ton of PPP information for each sub-protocol. The items of greatest significance are highlighted. Note that the IPv6 address exchanges are not visible with any other show command to my knowledge. CSR2’s output is very similar and is omitted for brevity. R8#show ppp interface virtual-access2.1 Vi2.1 No PPP serial context PPP Session Info

52 © 2016 Nicholas J. Russo

---------------Interface : PPP ID : Phase : Stage : Peer Name : Peer Address : Control Protocols: Session ID : AAA Unique ID : SSS Manager ID : SIP ID : PPP_IN_USE :

Vi2.1 0x4C000014 UP Local Termination 209.2.8.108 LCP[Open] IPCP[Open] IPV6CP[Open] 20 31 0x7C000029 0x61000028 0x11

Vi2.1 LCP: [Open] Our Negotiated Options Vi2.1 LCP: MRU 1492 (0x010405D4) Vi2.1 LCP: MagicNumber 0x28BACAD4 (0x050628BACAD4) Peer's Negotiated Options Vi2.1 LCP: MRU 1492 (0x010405D4) Vi2.1 LCP: MagicNumber 0x230F6BF6 (0x0506230F6BF6) Vi2.1 IPCP: [Open] Our Negotiated Options Vi2.1 IPCP: Address 209.2.8.8 (0x0306D1020808) Peer's Negotiated Options Vi2.1 IPCP: Address 209.2.8.108 (0x0306D102086C) Vi2.1 IPV6CP: [Open] Our Negotiated Options Vi2.1 IPV6CP: Interface-Id 021E:E6FF:FE4D:4D00 (0x010A021EE6FFFE4D4D00) Peer's Negotiated Options Vi2.1 IPV6CP: Interface-Id 021E:14FF:FE15:DB00 (0x010A021E14FFFE15DB00)

One particularly important piece of the IPV6CP debugging not shown above indicates an issue with assigning a prefix to the PPPoE client. This is poorly documented and not well known, so I highlight it. CSR8 says that it cannot allocate a prefix from the local pool since CSR2 has no remote name. The PPP show commands earlier prove this. We can rectify this by configuring a PAP username on CSR2 and tell CSR8 to use PAP authentication. PAP details are examined later. ! CSR8 Vi2.1 IPV6CP: Cannot use a pool without remote name ! CSR2 interface Dialer28 ppp pap sent-username R2 password 0 PAP ! CSR8

53 © 2016 Nicholas J. Russo

username R2 password 0 PAP interface Virtual-Template28 ppp authentication pap callin

Although the IPV6CP debugs don’t show the local IPv6 prefix being allocated to CSR2, it did actually work. CSR8 shows that one of its local prefixes was allocated for this purpose and CSR2 shows it as a global unicast address on its dialer interface. R8#show ipv6 local pool Pool Prefix PPPOE_POOL_V6 2001:10:2:80::/60 PD_POOL_V6 2001:192:168:80::/60

Free In use 15 1 15 1

R2#show ipv6 interface dialer 28 | section Global Global unicast address(es): 2001:10:2:80:21E:14FF:FE15:DB00, subnet is 2001:10:2:80::/64 [EUI/CAL/PRE] valid lifetime 2591997 preferred lifetime 604797

Interestingly, IPV6CP relies on ordinary IPv6 ND to issue this prefix. CSR8 includes this prefix in an RA on the PPPoE virtual interface, which was pulled from the IPV6CP local pool. Upon receipt of the RA from CSR8, CSR2 uses this as the “on-link” prefix for the dialer interface. CSR8#debug ipv6 nd ICMPv6-ND: (Virtual-Access2.1,FE80::21E:E6FF:FE4D:4D00) Sending RA (60) to FF02::1 ICMPv6-ND: MTU = 1492 ICMPv6-ND: prefix 2001:10:2:80::/64 [LA] 2592000/604800 CSR2#debug ipv6 nd ICMPv6-ND: (Dialer28,FE80::21E:E6FF:FE4D:4D00) Received RA ICMPv6-ND: Validating ND packet options: valid ICMPv6-ND: Prefix : 2001:10:2:80::, Length: 64, Vld Lifetime: 2592000, Prf Lifetime: 604800, PI Flags: C0 ICMPv6-ND: Update on-link prefix 2001:10:2:80::/64 on Dialer28/FE80::21E:E6FF:FE4D:4D00, lifetime 2592000

This is not the same as prefix delegation, which is shown next. This process relies on DHCPv6 and not IPv6 ND to distribute those delegated prefixes. ! CSR2 and CSR8 debug ipv6 dhcp detailed ! CSR8 IPv6 DHCP: Received REBIND from FE80::21E:14FF:FE15:DB00 on Virtual-Access2.1 IPv6 DHCP: detailed packet contents src FE80::21E:14FF:FE15:DB00 (Virtual-Access2.1) dst FF02::1:2

54 © 2016 Nicholas J. Russo

type REBIND(6), xid 7320700 option ELAPSED-TIME(8), len 2 elapsed-time 0 option CLIENTID(1), len 10 00030001001E1415DB00 option ORO(6), len 6 IA-PD,DNS-SERVERS,DOMAIN-LIST option IA-PD(25), len 41 IAID 0x000C0001, T1 0, T2 0 option IAPREFIX(26), len 25 preferred 0, valid 0, prefix 2001:192:168:80::/64 IPv6 DHCP: Using interface pool DHCP_POOL_V6 IPv6 DHCP: REBIND: Client has moved from unassigned to Virtual-Access2.1 IPv6 DHCP: Route added: 2001:192:168:80::/64 via FE80::21E:14FF:FE15:DB00 dist 1 iaid 000C0001 vrf default

When CSR8 selects a prefix from its local pool, it also installs a static route on the AC to reach that prefix. This is very useful because it can be redistributed into IGP as needed to provide Internet connectivity. This is redistributed into IS-IS (configuration not shown), as verified below. Short of running IGP, this is an excellent, dynamic approach to issuing IPv6 prefixes to PPPoE clients. R8#show ipv6 route 2001:192:168:80::/64 Routing entry for 2001:192:168:80::/64 Known via "static", distance 1, metric 0 Redistributing via isis 1112 Route count is 1/1, share count 0 Routing paths: FE80::21E:14FF:FE15:DB00, Virtual-Access2.1 Last updated 00:09:40 ago R8#show isis database l2 R8.00-00 detail | begin IPv6_Add IPv6 Address: 2001:10:8:12::8 Metric: 0 IPv6 (MT-IPv6) 2001:192:168:80::/64

Below, we see that CSR2 receives this PD prefix from CSR8 and binds it to the string “PPPOE_ISP_PREFIX” which can be used elsewhere. ! CSR2 IPv6 DHCP: Received REPLY from FE80::21E:E6FF:FE4D:4D00 on Dialer28 IPv6 DHCP: detailed packet contents src FE80::21E:E6FF:FE4D:4D00 (Dialer28) dst FE80::21E:14FF:FE15:DB00 (Dialer28) type REPLY(7), xid 7320700 option SERVERID(2), len 10 00030001001EE64D4D00 option CLIENTID(1), len 10 00030001001E1415DB00 option IA-PD(25), len 41

55 © 2016 Nicholas J. Russo

IAID 0x000C0001, T1 302400, T2 483840 option IAPREFIX(26), len 25 preferred INFINITY, valid INFINITY, prefix 2001:192:168:80::/64 IPv6 DHCP: Processing options IPv6 DHCP: Adding prefix 2001:192:168:80::/64 to PPPOE_ISP_PREFIX

CSR2 can use this as another IPv6 prefix on its LAN interface. The configuration is similar to the IPV6 general prefix construct and this prefix will be included in the RA messages by default. I also add a static IPv6 prefix to this link also to support non-SLAAC capable clients, such as XRv4, but remove it from the RA so that SLAAC-capable clients don’t select addresses from that prefix. The “N” flag in the show command indicates that it is not included in the RA. Also note the NAT44 inside interface, which is unrelated to IPv6 but important for IPv4 connectivity to the Internet. ! CSR2 interface GigabitEthernet2.524 ip nat inside ipv6 address FE80::2 link-local ipv6 address 2001:192:168:2::2/64 ipv6 address PPPOE_ISP_PREFIX ::2/64 ipv6 nd prefix 2001:192:168:2::/64 no-advertise ipv6 nd ra lifetime 30 ipv6 nd ra interval 10 5 R2#show ipv6 interface gigabitEthernet 2.524 prefix IPv6 Prefix Advertisements GigabitEthernet2.524 Codes for 1st column: A - Address, P - Prefix-Advertisement, O - Pool U - Per-user prefix Codes for 2nd column and above: D - Default N - Not advertised, C - Calendar PD default [LA] Valid lifetime 2592000, preferred lifetime 604800 PAN 2001:192:168:2::/64 [LA] Valid lifetime 2592000, preferred lifetime 604800 AD 2001:192:168:80::/64 [LA] Valid lifetime 2592000, preferred lifetime 604800

CSR2 has a classic CPE configuration with NAT44 (seen already) is configured with a DHCP pool to service its hosts. Since NAT44 occurs, CSR8 only needs to reach the post-NAT public address on CSR2’s dialer interface. ! CSR2 ip dhcp excluded-address 192.168.2.0 192.168.2.20 ip dhcp pool DHCP_POOL_V4 network 192.168.2.0 255.255.255.0 default-router 192.168.2.2

56 © 2016 Nicholas J. Russo

CSR1 is configured as a DHCPv4 client and an IPv6 SLAAC client. AS such, it receives an address/prefix for both protocols along with a default route. CSR1 now has full Internet connectivity for IPv4 and IPv6, and this represents a typical DSL deployment. R1#show ip interface brief gigabitEthernet 2.524 Interface IP-Address OK? Method Status GigabitEthernet2.524 192.168.2.21 YES DHCP up

Protocol up

R1#show ipv6 interface brief gigabitEthernet 2.524 GigabitEthernet2.524 [up/up] FE80::250:56FF:FEA9:1AAA 2001:192:168:80:250:56FF:FEA9:1AAA R1#show ip route vrf 2 0.0.0.0 Routing Table: 2 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 254, metric 0, candidate default path Routing Descriptor Blocks: * 192.168.2.2 Route metric is 0, traffic share count is 1 R1#show ipv6 route vrf 2 ::/0 Routing entry for ::/0 Known via "ND", distance 2, metric 0 Route count is 1/1, share count 0 Routing paths: FE80::2, GigabitEthernet2.524 Last updated 1d00h ago

We quickly confirm connectivity to the Internet from CSR1. R1#ping vrf 2 13.144.2.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/15 ms R1#ping vrf 2 2bad:beef:13:aaaa::a Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 2BAD:BEEF:13:AAAA::A, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 5/7/13 ms

We will briefly examine service tags. We know this information is carried back and forth in the PPPoE discovery packets when negotiating a session between client and AC. If a client specifies a service tag that no ACs can service, they can respond with a PADO for that client since a “Null” service at the AC 57 © 2016 Nicholas J. Russo

essentially means “any service”. The client requests “BLUE” and the server responds with “BLUE”, despite “BLUE” not being configured anywhere on CSR8. ! CSR2 interface GigabitEthernet2.528 pppoe-client dial-pool-number 28 service-name "BLUE" ! CSR8 PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff contiguous pak, size 44 FF FF FF FF FF FF 00 50 56 A9 BE 8A 81 00 0D 88 63 11 09 00 00 00 14 01 01 00 04 42 4C 55 01 03 00 08 87 00 00 0B 00 00 23 1D Service tag: BLUE PPPoE 0: O PADO, R:0050.56a9.fb1c L:0050.56a9.be8a Service tag: BLUE contiguous pak, size 70 00 50 56 A9 BE 8A 00 50 56 A9 FB 1C 81 00 0D 88 63 11 07 00 00 00 2E 01 01 00 04 42 4C 55 01 03 00 08 87 00 00 0B 00 00 23 1D 01 02 00 52 38 01 04 00 10 97 58 88 C2 01 8F 8A 96 23 F7 0E E3 54 F4 D5

3528 Gi2.528 C8 45

3528 Gi2.528

C8 45 02 D2

Configuring CSR8 with service “RED” under the BBA means that it will only service clients with a service containing the string “RED”. The “contains” keyword indicates we can use a partial match. CSR2 will keep sending PADI messages to CSR8, which never responds with a PADO. ! CSR8 bba-group pppoe PPPOE_28 virtual-template 28 service name contains RED PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3528 Gi2.528 contiguous pak, size 44 FF FF FF FF FF FF 00 50 56 A9 BE 8A 81 00 0D C8 88 63 11 09 00 00 00 14 01 01 00 04 42 4C 55 45 01 03 00 08 87 00 00 0B 00 00 23 1D PPPoE 0: Requested service-name BLUE has no partial match with RED, discarding PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3528 Gi2.528

Updating CSR8 to include the string “BLU” means that it can service CSR2, since “BLU” is contained within “BLUE”. CSR8 shows the match and sends a PADO back to CSR2. ! CSR8 bba-group pppoe PPPOE_28 virtual-template 28 service name contains BLU PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3528 Gi2.528

58 © 2016 Nicholas J. Russo

contiguous pak, size 44 FF FF FF FF FF FF 00 50 56 A9 BE 8A 81 00 0D 88 63 11 09 00 00 00 14 01 01 00 04 42 4C 55 01 03 00 08 87 00 00 0B 00 00 23 1D PPPoE 0: Requested service-name BLUE partial match Service tag: BLUE PPPoE 0: O PADO, R:0050.56a9.fb1c L:0050.56a9.be8a Service tag: BLUE

C8 45 with BLU 3528 Gi2.528

However, if the client has a null service but the BBA point specifies a string, the session cannot form. CSR8 is still expecting a partial match with the “BLU” string. The BBA has restrictive logic whereby only clients requesting that specific service can be serviced by the BBA in question. This can be used for simple load-sharing, where different strings can be used by different ACs on a LAN segment. This can be overridden with the “accept-null-service” option under the BBA configuration if needed. ! CSR2 interface GigabitEthernet2.528 no pppoe-client dial-pool-number 28 service-name "BLUE" pppoe-client dial-pool-number 28 ! CSR8 PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3528 Gi2.528 contiguous pak, size 40 FF FF FF FF FF FF 00 50 56 A9 BE 8A 81 00 0D C8 88 63 11 09 00 00 00 10 01 01 00 00 01 03 00 08 32 00 00 0C 00 00 26 58 PPPoE 0: Discarding PADI with empty service-name R:0050.56a9.be8a L:ffff.ffff.ffff 3528 Gi2.528

Before continuing, we fix the client to be back in “BLUE” service again. There are other ways to setup PPP networks, although CSR2/CSR8 is the most common. CSR9 is an AC that has clients CSR3, CSR4, and CSR5. CSR9 uses DHCP to hand out addresses to PPP, which is similar to the local-pool but uses a centralized DHCP process. The pool is still local to CSR9, but other hosts can also use addresses from this pool, not just PPP. IPCP is still used to issue IP addresses to clients, but DHCP or static addressing could technically be used. Rather than use IPv6 DHCP prefix delegation (PD), we can run IGP over the link to exchange IPV6 prefixes. This is not common but certainly works. We also enable PAP and CHAP authentication with a custom AAA method list. If RADIUS/TACACS were in play, the PPP sessions could be authenticated against a remote AAA server. In this case, the method-list just uses the local database. CSR9 prefers to use CHAP but can fallback to PAP. Notice that CSR9 enables ISIS on this link for IPv4; this is only to advertise the /24 connected prefix into ISIS for routing reachability. Passive-interface cannot be used on the virtual-template since it is always down. Each client has a separate VLAN for connectivity to the BNG, which conforms to the TR-101 1:1 VLAN paradigm. ! CSR9 bba-group pppoe PPPOE_P2MP virtual-template 345

59 © 2016 Nicholas J. Russo

sessions per-mac limit 1 sessions per-vlan limit 1 aaa new-model aaa authentication login default none aaa authentication ppp PPPOE local-case ip dhcp excluded-address 209.34.59.0 209.34.59.20 ip dhcp pool DHCP_POOL_PPPOE_NETWORK network 209.34.59.0 255.255.255.0 default-router 209.34.59.9 interface Virtual-Template345 mtu 1492 ip address 209.34.59.9 255.255.255.0 ip router isis 1112 peer default ip address dhcp-pool DHCP_POOL_PPPOE_NETWORK ipv6 enable ospfv3 9 ipv6 area 0 ospfv3 9 ipv6 network point-to-point ppp authentication chap pap callin PPPOE interface GigabitEthernet2.539 pppoe enable group PPPOE_P2MP interface GigabitEthernet2.549 pppoe enable group PPPOE_P2MP interface GigabitEthernet2.559 pppoe enable group PPPOE_P2MP

Aside from interface enumerations, CSR3 and CSR4 have identical configurations. They both use the same CHAP hostname as well, and refuse to use the insecure PAP method. They install default routes for IPv4 and IPv6 negotiated addresses. ! CSR3 and CSR4 interface Dialer3 description CPE OUTSIDE mtu 1492 ip address negotiated ip nat outside encapsulation ppp dialer pool 3 dialer idle-timeout 0 dialer persistent ipv6 address autoconfig default ipv6 enable ospfv3 9 ipv6 area 0

60 © 2016 Nicholas J. Russo

ospfv3 9 ipv6 network point-to-point ppp chap hostname CHAP ppp chap password 0 CHAP ppp pap refuse ppp ipcp route default interface GigabitEthernet2.539 pppoe-client dial-pool-number 3

CSR5 is very similar except it refuses CHAP and uses PAP. CHAP refusal is necessary since it is the preferred authentication method on the AC; failing to explicitly refuse CHAP means that authentication will fail and PAP will not be used for fallback. ! CSR5 interface Dialer5 mtu 1492 ip address negotiated ip nat outside encapsulation ppp dialer pool 5 ipv6 address autoconfig default ospfv3 9 ipv6 area 0 ospfv3 9 ipv6 network point-to-point ppp chap refuse ppp pap sent-username PAP_R5 password 0 PAP_R5 ppp ipcp route default interface GigabitEthernet2.559 pppoe-client dial-pool-number 5

The only new thing to research with this design is the authentication, since IPv6 prefix-delegation is not in play, and OSPFv3 over a P2P link is not new. CSR3 and CSR9 negotiate CHAP authentication and it is successful. CSR3 receives the inbound challenge from R9, sends a response using username CHAP (which has a valid local-database entry on CSR9. CSR9 then responds that authentication was successful. ! CSR3 Vi2 CHAP: Redirect packet to Vi2 Vi2 CHAP: I CHALLENGE id 1 len 23 from "R9" Vi2 LCP: State is Open Vi2 CHAP: Using hostname from interface CHAP Vi2 CHAP: Using password from interface CHAP Vi2 CHAP: O RESPONSE id 1 len 25 from "CHAP" Vi2 CHAP: I SUCCESS id 1 len 4 Vi2 PPP: Phase is FORWARDING, Attempting Forward Vi2 PPP: Phase is ESTABLISHING, Finish LCP ! CSR9 ppp45 PPP: Phase is AUTHENTICATING, by this end

61 © 2016 Nicholas J. Russo

ppp45 ppp45 ppp45 ppp45 ppp45 ppp45 Vi2.4 Vi2.4

CHAP: O CHALLENGE id 1 len 23 from "R9" LCP: State is Open CHAP: I RESPONSE id 1 len 25 from "CHAP" PPP: Phase is FORWARDING, Attempting Forward PPP: Phase is AUTHENTICATING, Unauthenticated User PPP: Phase is FORWARDING, Attempting Forward PPP: Phase is AUTHENTICATING, Authenticated User CHAP: O SUCCESS id 1 len 4

CSR5 and CSR9 fail to negotiate CHAP since CSR5 refuses it, and instead authenticate with PAP. The CHAP failure is not shown in the debug since the messages sent by CSR5 (low level PPP information) carried it. PAP has less chatter than CHAP but still has an explicit authentication request from the client and response to the server. PPP authenticate in general can be done in either direction or bidirectionally, but normally the AC will authenticate the client only. For additional security, the client can authenticate the server since PPP is a peer-to-peer protocol, generally speaking. PPPoE is not used in this fashion but authentication is transport-independent. ! CSR5 Vi3 PPP: Vi3 PAP: Vi3 PAP: Vi3 PAP: Vi3 LCP: Vi3 PAP:

Phase is AUTHENTICATING, by the peer Using hostname from interface PAP Using password from interface PAP O AUTH-REQ id 1 len 18 from "PAP_R5" State is Open I AUTH-ACK id 1 len 5

! CSR9 ppp46 PPP: ppp46 PPP: ppp46 PAP: ppp46 PAP: ppp46 PAP: ppp46 PPP: ppp46 LCP: Vi2.1 PAP:

Queue PAP code[1] id[1] Phase is AUTHENTICATING, by this end Redirect packet to ppp46 I AUTH-REQ id 1 len 18 from "PAP_R5" Authenticating peer PAP_R5 Phase is FORWARDING, Attempting Forward State is Open O AUTH-ACK id 1 len 5

Verifying the PPPoE sessions on CSR9 shows 3 subscribers, each on a different VLAN, but all in the PTA state. This shows that PPPoE is working properly. Since each PPPoE client has a statically-configured IPv6 LAN prefix, we use OSPFv3 to learn them at the BNG. CSR9 has OSPFv3 neighbors with all subscribers through the PPPoE session as well. ! CSR9 R9#show pppoe session 3 sessions in LOCALLY_TERMINATED (PTA) State 3 sessions total Uniq ID

PPPoE SID

RemMAC LocMAC

Port

VT

VA VA-st

State Type

62 © 2016 Nicholas J. Russo

45

45

40

40

46

46

0050.56a9.8ccf 0050.56a9.d672 0050.56a9.2c57 0050.56a9.d672 0050.56a9.dc63 0050.56a9.d672

Gi2.539 VLAN:3539 Gi2.549 VLAN:3549 Gi2.559 VLAN:3559

345 345 345

Vi2.4 UP Vi2.2 UP Vi2.1 UP

PTA PTA PTA

R9#show ospfv3 ipv6 neighbor OSPFv3 9 address-family ipv6 (router-id 209.19.85.11) Neighbor ID 192.168.5.5 192.168.34.3 192.168.34.4

Pri 0 0 0

State FULL/ FULL/ FULL/

-

Dead Time 00:00:37 00:00:38 00:00:34

Interface ID 18 12 13

Interface Virtual-Access2.1 Virtual-Access2.4 Virtual-Access2.2

We can verify the DHCP-issued addresses on CSR9 as well. The PPP information on CSR10 shows which address maps to which client. Notice that PAP and CHAP are also considered PPP sub-protocols and are shown in the PPP summary. The client-ID is actually the hostname in ASCII using hex values: 4348.4150 spells CH.AP and 5041.505f.5235 spells PA.P_.R5. R9#show ip dhcp binding Bindings from all pools not associated with VRF: IP address Client-ID/ Lease expiration Type Hardware address/ User name 209.34.59.35 4348.4150 Infinite 209.34.59.37 4348.4150 Infinite 209.34.59.38 5041.505f.5235 Infinite

R9#show ppp all Interface/ID OPEN+ Nego* Fail------------ --------------------Vi2.1 LCP+ PAP+ IPCP+ IPV6> Vi2.4 LCP+ CHAP+ IPCP+ IPV> Vi2.2 LCP+ CHAP+ IPCP+ IPV>

Stage -------LocalT LocalT LocalT

State

On-demand On-demand On-demand

Peer Address --------------209.34.59.38 209.34.59.37 209.34.59.35

Interface

Selecting Selecting Selecting

Vi2.2 Vi2.4 Vi2.1

Peer Name ----------------PAP_R5 CHAP CHAP

CSR1 in the client VRF behind CSR3 and CSR4 (tied together with HSRP, which is examined in greater detail in the NAT44/NAT444 section), we use all static addressing and static routing for IPv4 and IPv6. CSR1 has reachability within the client VRF as desired. ! CSR1 interface GigabitEthernet2.534 vrf forwarding 34 ip address 192.168.34.1 255.255.255.0 ipv6 address 2001:192:168:34::1/64 ip route vrf 34 0.0.0.0 0.0.0.0 192.168.34.254 ipv6 route vrf 34 ::/0 GigabitEthernet2.534 FE80::254

63 © 2016 Nicholas J. Russo

R1#ping vrf 34 13.144.2.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/15 ms R1#ping vrf 34 2bad:beef:13:aaaa::a Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 2BAD:BEEF:13:AAAA::A, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 5/6/10 ms

Last, we examine the DHCPv4 proxy service in conjunction with PPP. The equivalent for IPv6 would be using some AAA server to issue IPv6 prefixes for prefix-delegation, but that is not tested here. In this example, PPPoE clients CSR6 and CSR7 dial-in to CSR10 and request addresses via IPCP. Because CSR10 has neither local pools nor DHCP pools configured, it sends the requests to a remote DHCP server (CSR9) much like a DHCP relay. That DHCP server will issue addresses on a per hostname basis, which implies that one of two things must happen: authentication must be used so the server sees hostnames, or the server must be configured to assign different hostnames behind the scenes to each client. Failure to do this will result in the same client-ID presented to the DHCP server, which results in the same IPCP address allocated to multiple clients. This ultimately breaks routing. CSR10 uses local DHCPv6 for PD but rather than call a local pool, it assigns specific prefixes to CSR6 and CSR7. The hexadecimal values are the DHCPv6 unique ID (DUID) of each client, which we verify later. I also add several other minor PPP IPCP and IPV6CP options to enforce uniqueness among addressing for clients, but the “username” uniqueness is what allows this design to actually work. The BBA configuration includes 2 sessions per VLAN since all PPPoE speakers are on the same segment in this broadband design, but each client can only have one session. The throttling ensures that a PPPoE client cannot try to initiate more than 5 sessions in 5 minutes, and if it does, it is blocked for 0 minutes (not blocked at all). We can set the 802.1p bits in the 802.1q VLAN tag to be CoS 7 so that the PPPoE control traffic is less likely to be dropped during times of congestion. The reason two DHCP servers are defined is because the addressing is somewhat asymmetric; CSR10 can reach CSR9’s loopback, but when CSR9 responds, it does so from a suppressed transit link as the source. CSR10 needs to account for this source or else the DHCPOFFER is automatically rejected, which explains the two DHCP server commands. ! CSR10 ipv6 dhcp pool DHCP_PD_R6_R7_SPECIFIC prefix-delegation 2001:192:168:7::/64 00030001001E49CAA400 lifetime infinite infinite prefix-delegation 2001:192:168:6::/64 00030001001EBD696200 lifetime infinite infinite ip dhcp-server 10.9.9.9 ip dhcp-server 10.9.11.9 ip address-pool dhcp-proxy-client

64 © 2016 Nicholas J. Russo

bba-group pppoe PPPOE_VLAN virtual-template 67 sessions per-mac limit 1 sessions per-vlan limit 2 sessions per-vlan throttle 5 5 0 control-packets vlan cos 7 interface Virtual-Template67 mtu 1492 ip unnumbered Loopback67 peer ip address forced peer default ip address dhcp ipv6 enable no ipv6 nd ra suppress ipv6 nd ra lifetime 60 ipv6 nd ra interval 10 5 ipv6 dhcp server DHCP_PD_R6_R7_SPECIFIC ppp ipcp mask reject ppp ipcp username unique ppp ipcp address required ppp ipcp address unique ppp ipv6cp address unique interface GigabitEthernet2.556 pppoe enable group PPPOE_VLAN

The client configuration is nothing special, and is nearly identical on CSR6 and CSR7; only CSR6 is shown. Both clients receive IPCP addresses and perform NAT44 to give access for their LAN hosts. They also learn IPv6 prefixes via DHCPv6 PD for their LANs to access the IPv6 Internet. ! CSR6 interface Dialer6 description CPE OUTSIDE mtu 1492 ip address negotiated ip nat outside encapsulation ppp dialer pool 6 dialer idle-timeout 0 dialer persistent ipv6 address autoconfig default ipv6 dhcp client pd PREFIX_FROM_ISP ppp ipcp route default interface GigabitEthernet2.556 pppoe-client dial-pool-number 6

65 © 2016 Nicholas J. Russo

To support this design, we also need to add a new DHCP pool to service CSR10’s PPPoE clients. The pool is configured on CSR9. ! CSR9 ip dhcp excluded-address 209.56.70.0 209.56.70.20 ip dhcp pool DHCP_PROXY_V4 network 209.56.70.0 255.255.255.0 default-router 209.56.70.10

As a general comment, the debug below shows what happens if the DHCP server from which the DHCPOFFER is received is not explicitly configured on the proxy router. The DHCPOFFER arriving on CSR10 is automatically rejected if 10.9.11.9 is not configured as an explicit DHCP server. CSR10#debug dhcp DHCP: offer received from 10.9.11.9 DHCP: offer: server 10.9.11.9 not in approved list

On CSR10, we examine the debugs to see how a client dials in. The initial PPP LCP process is unchanged as the DHCPv4 process is invoked by IPCP, and upper layer PPP sub-protocol. IPCP is “stalled” waiting for an address. ! CSR10 Vi2.1 IPCP: Stalled on pool request Vi2.1 IPCP: CP stalled on event[IPCP Allocate Address] Vi2.1 IPCP: Stalled on option [Address]

At this point, the DHCP process sends a DHCPDISCOVER to the DHCP server. The discover is sent twice, once to each server, but we know 10.9.11.9 is unroutable. The reply comes from 10.9.11.9 (CSR9 transit link) and contains the address 209.56.70.22. ! CSR10 DHCP: proxy allocate request DHCP: new entry. add to queue DHCP: SDiscover attempt # 1 for entry: DHCP: SDiscover: sending 276 byte length DHCP packet DHCP: SDiscover 276 bytes DHCP: SDiscover 276 bytes DHCP: DHCP: DHCP: DHCP: DHCP: DHCP: DHCP: DHCP:

Received a BOOTREP pkt offer received from 10.9.11.9 SRequest attempt # 1 for entry: SRequest- Server ID option: 10.9.11.9 SRequest- Requested IP addr option: 209.56.70.22 SRequest placed lease len option: 75144 SRequest: 294 bytes SRequest: 294 bytes

66 © 2016 Nicholas J. Russo

DHCP: SRequest: 294 bytes DHCP: XID MATCH in dhcpc_for_us() DHCP: Received a BOOTREP pkt DHCP Proxy Client Pooling: ***Allocated IP address: 209.56.70.22

This address is returned to the IPCP process for allocation to the client. IPCP is now “unstalled”. An inbound CONFREQ arrives with all zeroes, essentially requesting an address. The AC uses the CONFNAK message, sent outbound to the client, as a method of offering an address. The client then formally request the address and the AC confirms it. This is the same mechanism seen earlier for local-pool IPCP address allocation. ! CSR10 Vi2.1 IPCP: CP unstall Vi2.1 IPCP: Continue processing stalled packet: Vi2.1 IPCP: I CONFREQ [ACKrcvd] id 1 len 10 Vi2.1 IPCP: Address 0.0.0.0 (0x030600000000) Vi2.1 PPP/IPAM: ipcp_req_addr: s_data=C000056 r=0 a=0 ans=0 Vi2.1 IPCP AUTHOR: Done. Her address 0.0.0.0, we want 0.0.0.0 Vi2.1 IPCP: Pool returned 209.56.70.22 Vi2.1 IPCP: O CONFNAK [ACKrcvd] id 1 len 10 Vi2.1 IPCP: Address 209.56.70.22 (0x0306D1384616) Vi2.1 IPCP: Event[Receive ConfReq-] State[ACKrcvd to ACKrcvd] Vi2.1 IPCP: I CONFREQ [ACKrcvd] id 2 len 10 Vi2.1 IPCP: Address 209.56.70.22 (0x0306D1384616) Vi2.1 PPP/IPAM: ipcp_req_addr: s_data=0 r=0 a=0 ans=0 Vi2.1 IPCP: O CONFACK [ACKrcvd] id 2 len 10 Vi2.1 IPCP: Address 209.56.70.22 (0x0306D1384616)

The DHCP server now shows two addresses allocated to clients CSR6 and CSR7. The PPP details on CSR10 can show which address went to which host. Notice the single digit difference in the client-ID, which was done by CSR10 by making the usernames unique on call-in. Without this, the DHCP server thinks that the same client keeps asking for an address, so it responds with the same address over and over, which is not valid as it breaks routing on CSR10. R9#show ip dhcp binding 209.56.70.22 IP address Client-ID/ Lease expiration Type Hardware address/ User name 209.56.70.22 003d.3230.392e.3536. MON 09 2015 08:48 PM 2e37.302e.3130.3d56. 6932.2e31 R9#show ip dhcp binding 209.56.70.23 IP address Client-ID/ Lease expiration Type Hardware address/ User name 209.56.70.23 003d.3230.392e.3536. MON 09 2015 08:54 PM 2e37.302e.3130.3d56. 6932.2e32

State

Automatic

Interface

Active

State

Interface

Automatic

Active

Gig2.591

Gig2.591

67 © 2016 Nicholas J. Russo

R10#show ppp Interface/ID -----------Vi2.2 Vi2.1

all OPEN+ Nego* Fail--------------------LCP+ IPCP+ IPV6CP+ LCP+ IPCP+ IPV6CP+

Stage -------LocalT LocalT

Peer Address Peer Name --------------- ----------------209.56.70.23 209.56.70.22

We can also verify the DUIDs on the clients, which do not appear configurable. To issue specific IPv6 prefixes to clients, we can map these DUIDs to manual prefixes inside the DHCPv6 pool on CSR10. This is less dynamic but more granular that using a local pool. AAA attributes allow for this functionality as well, but that is not tested here. One would have to do this first on a router if trying to assign specific PD prefixes to a CPE device via DHCPv6. R6#show ipv6 dhcp This device's DHCPv6 unique identifier(DUID): 00030001001EBD696200 R7#show ipv6 dhcp This device's DHCPv6 unique identifier(DUID): 00030001001E49CAA400

To verify the CoS markings applied to CSR10, we can enable PPPoE packet debugging on CSR10 and CSR9 to compare the differences. Within CSR9’s dot1q PADO header, the 4 bits preceding the VLAN ID are 0000; the first 3 bits represent the CoS which is 000, or 0 in decimal. The packet PADO from CSR10, however, has bits 1110 (0xE) which is 111, or 7. This CoS marking applies to PADO and PADS packets for PPPoE discovery, as well as PPP’s LCP, NCP sub-protocols (IPCP, IPV6CP, etc), keepalives, and authentication. CSR10’s PADO, PADS and PADT are shown to prove this; notice that the PADT does not have this marking set. ! CSR9 PPPoE 0: O PADO, R:0050.56a9.d672 L:0050.56a9.8ccf Service tag: NULL Tag contiguous pak, size 66 00 50 56 A9 8C CF 00 50 56 A9 D6 72 81 00 0D 88 63 11 07 00 00 00 2A 01 01 00 00 01 03 00 C3 00 00 02 00 00 26 8C 01 02 00 02 52 39 01 00 10 6A EC 54 7B 52 F4 6B 8F 8D AA 32 83 7D B1 0D ! CSR10 PPPoE 0: O PADO, R:0050.56a9.f961 L:0050.56a9.ea77 Service tag: NULL Tag contiguous pak, size 67 00 50 56 A9 EA 77 00 50 56 A9 F9 61 81 00 ED 88 63 11 07 00 00 00 2B 01 01 00 00 01 03 00 A9 00 00 05 00 00 1C CB 01 02 00 03 52 31 30 04 00 10 D3 77 CD ED 4B 6B AF E9 12 94 4A 4D 0C 92 34 [52]PPPoE 52: O PADS

3539 Gi2.539

D3 08 04 6E

3556 Gi2.556

E4 08 01 F4

R:0050.56a9.ea77 L:0050.56a9.f961 Gi2.556

68 © 2016 Nicholas J. Russo

contiguous pak, size 00 50 56 A9 EA 88 63 11 65 00 00 00 1C CB 01 77 CD ED 4B 6B 01 00 00

67 77 34 02 AF

[50]PPPoE 50: O PADT contiguous pak, size 00 50 56 A9 EA 88 63 11 A7 00 00 00 00 00 00 00 00 00 00 00

R:0050.56a9.ea77 64 77 00 50 56 A9 F9 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 E9

50 2B 03 12

56 01 52 94

A9 03 31 4A

F9 00 30 4D

61 08 01 F4

81 A9 04 0C

00 00 00 92

ED 00 10 34

E4 05 D3 01

L:0050.56a9.f961 Gi2.556 61 00 00 00

81 00 00 00

00 00 00 00

0D 00 00 00

E4 00 00 00

For client connectivity, we use the same method on CSR6 and CSR7 as we did on CSR2. This involves a local DHCPv4 pool and NAT44 for IPv4 connectivity, with IPv6 PD for the IPv6 hosts. The configuration on all devices, include the CSR1 client VRFs, is not shown. We can verify connectivity for IPv4 and IPv6 using both CSR6 and CSR7 below. Unfortunately, despite being in different VRFs, XE does not let us configure multiple IPv6 ND defaults on multiple interfaces. We manually configure static routes for VRF 6 and 7 as a result (not shown). R1#ping vrf 6 13.144.2.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/13 ms R1#ping vrf 6 2bad:beef:13:aaaa::a Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 2BAD:BEEF:13:AAAA::A, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 6/8/16 ms R1#ping vrf 7 13.144.2.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/15 ms R1#ping vrf 7 2bad:beef:13:aaaa::a Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 2BAD:BEEF:13:AAAA::A, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 6/7/12 ms

Additional Reading – Reference configurations “pppoe-tech”

69 © 2016 Nicholas J. Russo

1.2.2 Multi-service PPPoE and LAC/LNS architecture This section tests using “smart” PPPoE server selection as well as a basic LAC/LNS architecture using PPPoE as the access technology. The technical PPPoE and L2TP details are summarized since the architecture is the focus in this section; those technologies have their own sections. The network is shown below which is very similar to the PPPoE technology architecture. This time, CSR2 and CSR3 are PPPoE clients with CSR8 and CSR9 as ACs on the same LAN. CSR10 is an LNS with CSR4 and CSR5 as LACs. CSR6 and CSR7 are PPPoE clients like CSR2 and CSR3. This allows us to test PPPoE in conjunction with L2TP VPDN technologies. The upper-level architecture is still BGP-oriented since XRv does not support PPPoE or any L2VPN features, and CSR1 is generally used for most tests because it supports IPv6 SLAAC (XR in general does not).

CSR2 wants to join the "RED" service and states this in its PADI as seen in the PPPoE technology section. Both CSR8 and CSR9 offer "RED" service, but CSR9's PADO is delayed by about 1 second. Cisco will round these backoff timers to the closest multiple of 256 ms, which is why I chose 1024 ms. The PADI is a layer 2 broadcast, and both ACs will respond, but CSR2 will use CSR8's PADO since it was received first. The PADO delay timer, combined with service-name selection, is how one can achieve granularity/loadsharing with BNG nodes. Basic features like DHCPv6 PD and local pools are not shown again on the ACs since they are the same as the earlier examples. No new complexity is being introduced with those 70 © 2016 Nicholas J. Russo

technologies. The interface to which the VTs are unnumbered is advertised into IS-IS passively, and the netmask is big enough to encompass the entire local pool. In this way, routing to the PPPoE clients is cleanly achieved without manual configuration. IPv6 static routes generated by DHCPv6 for PD are also redistributed into IS-IS. ! CSR8 bba-group pppoe PPPOE_RED virtual-template 89 service name contains RED sessions per-vlan limit 5 pado delay 0 control-packets vlan cos 7 interface GigabitEthernet2.589 pppoe enable group PPPOE_RED interface Virtual-Template89 mtu 1492 ip unnumbered Loopback208 peer default ip address pool PPPOE_RED_IPV4 ipv6 enable no ipv6 nd ra suppress ipv6 nd ra interval 30 ipv6 dhcp server DHCPV6_PD ! CSR9 bba-group pppoe PPPOE_RED virtual-template 89 service name contains RED sessions per-vlan limit 5 pado delay 1024 control-packets vlan cos 7 bba-group pppoe PPPOE_BLUE virtual-template 99 service name contains BLUE accept-null-service sessions per-vlan limit 5 pado delay 0 control-packets vlan cos 6 interface Virtual-Template89 mtu 1492 ip unnumbered Loopback209 peer default ip address pool PPPOE_RED_IPV4 ipv6 enable no ipv6 nd ra suppress ipv6 nd ra interval 30 ipv6 dhcp server DHCPV6_PD

71 © 2016 Nicholas J. Russo

interface Virtual-Template99 mtu 1492 ip unnumbered Loopback209 peer default ip address pool PPPOE_BLUE_IPV4 ipv6 enable no ipv6 nd ra suppress ipv6 nd ra interval 30 ipv6 dhcp server DHCPV6_PD interface GigabitEthernet2.589 pppoe enable group PPPOE_RED interface GigabitEthernet3.589 pppoe enable group PPPOE_BLUE

The client configurations are shown below. Basic features like NAT44 and DHCPv4 are not shown again since they are the same as the earlier examples. No new complexity is being introduced with those technologies. CSR3’s dialer is identical to CSR2’s with the exception of numbering, so it is not shown again. The only difference is the service-name. ! CSR2 interface Dialer2 mtu 1492 ip address negotiated ip nat outside encapsulation ppp dialer pool 2 dialer idle-timeout 0 dialer persistent ipv6 address autoconfig default ipv6 dhcp client pd PPPOE_ISP_PREFIX ppp ipcp route default interface GigabitEthernet2.589 pppoe-client dial-pool-number 2 service-name "RED" ! CSR3 interface GigabitEthernet2.589 pppoe-client dial-pool-number 3 service-name "BIG_BLUE_HOUSE"

We can watch the session establish when CSR2 initiates the discovery process. The debug on CSR2 (with timestamps) clearly shows the outgoing PADI, followed by two incoming PADOs. R2#debug pppoe event R2#debug pppoe packet 00:04:51.747: pppoe_send_padi 00:04:51.750: PPPoE 0: I PADO

R:0050.56a9.fb1c L:0050.56a9.be8a 3589 Gi2.589

72 © 2016 Nicholas J. Russo

00:04:53.011: 00:04:53.796: 00:04:53.796: 00:04:53.802: 00:04:53.802:

PPPoE 0: I PADO R:0050.56a9.d672 L:0050.56a9.be8a 3589 Gi2.589 PPPOE: we've got our pado and the pado timer went off OUT PADR from PPPoE Session PPPoE 1: I PADS R:0050.56a9.fb1c L:0050.56a9.be8a 3589 Gi2.589 IN PADS from PPPoE Session

CSR8 receives the PADI and matches it to the RED service. The PADO is immediately sent in reply, to which CSR2 then issues a PADR. The PPPoE discovery process continues normally and the session is formed between CSR2 and CSR8 (debug is trimmed). ! CSR8 00:04:51.155: 00:04:51.155: 00:04:51.155: 00:04:51.155: 00:04:51.155: 00:04:53.204: 00:04:53.204: 00:04:53.206:

PPPoE 0: PPPoE 0: Service PPPoE 0: Service PPPoE 0: Service [5]PPPoE

I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3589 Gi2.589 Requested service-name RED partial match with RED tag: RED O PADO, R:0050.56a9.fb1c L:0050.56a9.be8a 3589 Gi2.589 tag: RED I PADR R:0050.56a9.be8a L:0050.56a9.fb1c 3589 Gi2.589 tag: RED 1: O PADS R:0050.56a9.be8a L:0050.56a9.fb1c Gi2.589

CSR9 also receives the PADI and matches it to the RED service. It sees the PADI twice, once for RED and one for BLUE matching, and fails the BLUE match as expected. The PADO is sent about 1 second later, and CSR2 never replies with a PADR back to CSR9. CSR9 creates no PPPOE state and acts as if nothing happened. ! CSR9 00:04:51.212: PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3589 Gi2.589 00:04:51.212: PPPoE 0: Requested service-name RED partial match with RED 00:04:51.212: Service tag: RED 00:04:51.212: PPPoE: PADO id 0: Starting timer for 1024 msec 00:04:51.212: PPPoE 0: I PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3589 Gi3.589 00:04:51.212: PPPoE 0: Requested service-name RED has no partial match with BLUE, discarding PADI R:0050.56a9.be8a L:ffff.ffff.ffff 3589 Gi3.589 00:04:52.474: PPPoE: Sending PADO for pado id 0 00:04:52.474: PPPoE 0: O PADO, R:0050.56a9.d672 L:0050.56a9.be8a 3589 Gi2.589 00:04:52.474: Service tag: RED

We can also verify this with show commands. CSR2 is connected to the MAC address of CSR8; we can see the IP address, assuming IPCP is negotiated, by checking the PPP details. R2#show pppoe session 1 client session Uniq ID N/A

PPPoE SID 1

RemMAC LocMAC 0050.56a9.fb1c 0050.56a9.be8a

Port Gi2.589

VT

VA VA-st Di2 Vi2 UP

State Type UP

73 © 2016 Nicholas J. Russo

R2#show ppp all Interface/ID OPEN+ Nego* FailStage Peer Address Peer Name ------------ --------------------- -------- --------------- ----------------Vi2 LCP+ IPCP+ IPV6CP+ LocalT 209.8.8.8

CSR2 successfully received an IP address via IPCP (CSR8 local pool) and an IPv6 address via autoconfiguration, which was exchanged with IPV6CP. We also validate that IPv6 prefix delegation worked using DHCPv6; CSR2 receives a prefix from the pool of prefixes and CSR8 creates a static route to be redistributed into IGP. R2#show ppp interface vi2 | begin IPCP Vi2 IPCP: [Open] Our Negotiated Options Vi2 IPCP: Address 209.8.9.50 (0x0306D1080932) Peer's Negotiated Options Vi2 IPCP: Address 209.8.8.8 (0x0306D1080808) Vi2 IPV6CP: [Open] Our Negotiated Options Vi2 IPV6CP: Interface-Id 021E:14FF:FE15:DB00 (0x010A021E14FFFE15DB00) Peer's Negotiated Options Vi2 IPV6CP: Interface-Id 021E:E6FF:FE4D:4D00 (0x010A021EE6FFFE4D4D00) R2#show ipv6 general-prefix IPv6 Prefix PPPOE_ISP_PREFIX, acquired via DHCP PD 2001:192:168:80::/64 Valid lifetime 2591986, preferred lifetime 604786 GigabitEthernet2.524 (Address command) R8#show ipv6 route static | begin ^S S 2001:192:168:80::/64 [1/0] via FE80::21E:14FF:FE15:DB00, Virtual-Access2.1

Looking at CSR3 as a client, CSR8 does not offer the BLUE service but CSR9 does. Since CSR3 wants to join the BLUE service, only one PADO will be received from CSR9 since CSR8 cannot support it. CSR9 also supports clients with the "null" service to catch clients by default. For brevity we will debug only PPPoE events, not packets. R3#debug pppoe packet 00:27:23.143: padi timer expired 00:27:23.143: Sending PADI: Interface = GigabitEthernet2.589 00:27:23.147: PPPoE 0: I PADO R:0050.56a9.4c24 L:0050.56a9.8ccf 3589 Gi2.589 00:27:25.192: PPPOE: we've got our pado and the pado timer went off 00:27:25.192: OUT PADR from PPPoE Session 00:27:25.197: PPPoE 36: I PADS R:0050.56a9.4c24 L:0050.56a9.8ccf 3589 Gi2.589 00:27:25.197: IN PADS from PPPoE Session

74 © 2016 Nicholas J. Russo

! CSR8 00:27:24.115: PPPoE 0: I PADI R:0050.56a9.8ccf L:ffff.ffff.ffff 3589 Gi2.589 00:27:24.115: PPPoE 0: Requested service-name BIG_BLUE_HOUSE has no partial match with RED, discarding PADI R:0050.56a9.8ccf L:ffff.ffff.ffff 3589 Gi2.589 ! CSR9 00:27:24.172: PPPoE 0: I PADI R:0050.56a9.8ccf L:ffff.ffff.ffff 3589 Gi2.589 00:27:24.172: PPPoE 0: Requested service-name BIG_BLUE_HOUSE has no partial match with RED, discarding PADI R:0050.56a9.8ccf L:ffff.ffff.ffff 3589 Gi2.589 00:27:24.172: PPPoE 0: I PADI R:0050.56a9.8ccf L:ffff.ffff.ffff 3589 Gi3.589 00:27:24.172: PPPoE 0: Requested service-name BIG_BLUE_HOUSE partial match with BLUE 00:27:24.172: Service tag: BIG_BLUE_HOUSE 00:27:24.172: PPPoE 0: O PADO, R:0050.56a9.4c24 L:0050.56a9.8ccf 3589 Gi3.589 00:27:24.172: Service tag: BIG_BLUE_HOUSE 00:27:26.221: PPPoE 0: I PADR R:0050.56a9.8ccf L:0050.56a9.4c24 3589 Gi3.589 00:27:26.222: Service tag: BIG_BLUE_HOUSE 00:27:26.222: PPPoE : encap string prepared 00:27:26.222: [327]PPPoE 36: O PADS R:0050.56a9.8ccf L:0050.56a9.4c24 Gi3.589

A quick PPPoE/PPP verification shows that CSR3 is working properly. R3#show pppoe session 1 client session Uniq ID N/A

PPPoE SID 38

RemMAC LocMAC 0050.56a9.4c24 0050.56a9.8ccf

Port Gi2.589

VT

VA VA-st Di3 Vi3 UP

State Type UP

R3#show ppp all Interface/ID OPEN+ Nego* FailStage Peer Address Peer Name ------------ --------------------- -------- --------------- ----------------Vi3 LCP+ IPCP+ IPV6CP+ LocalT 209.9.9.9

As a final verification to ensure NAT44 is working on CSR2 and CSR3, as well as IPv6 global unicast routing, we will send traffic from CSR1 to the Internet via both clients for both protocols. This feature is sometimes called the “smart” PPPoE server selection mechanism. R1#ping vrf 2 13.144.2.1 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 3/6/16 ms R1#ping vrf 3 13.144.2.1 [snip]

75 © 2016 Nicholas J. Russo

Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/16 ms R1#ping vrf 2 2bad:beef:13:dddd::d [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 5/6/9 ms R1#ping vrf 3 2bad:beef:13:dddd::d [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 7/10/25 ms

Next, we progress to the LAC/LNS configuration. CSR4 and CSR5 are LACs that create L2TP tunnels to the LNS. The PPPoE sessions terminate on the LACs, but all of the intelligent PPP negotiation happens with the LNS. The LAC simply forward the PPP connections onto the LNS inside of an L2TP tunnel. This allows the CPE routers to appear directly connected to the LNS. For brevity, only CSR4 and CSR6 will be analyzed in terms of LAC and CPE; CSR5 and CSR7 are configured almost identically. IPCP and IPV6CP work exactly as one would expect; the only caveat is that all of this logic is centralized on the LNS, not the LACs. First, we will examine the CPE configurations, which are literally identical to the configurations on CSR2 and CSR3. The differences are shown below; the PAP hostname includes the domain-name so that the Virtual Private Dial-up Network (VPDN) process can match this user to a vpdn-group. All other CPE settings, such as NAT, IPCP, DHCPv6 PD, MTU, etc are the same as CSR2 and CSR3. ! CSR6 interface Dialer6 ppp pap sent-username [email protected] password 0 R6 interface GigabitEthernet2.546 pppoe-client dial-pool-number 6 ! CSR7 interface Dialer6 ppp pap sent-username [email protected] password 0 R7 interface GigabitEthernet2.557 pppoe-client dial-pool-number 7

The LAC configuration is a little more involved. This is where the PPPoE logic terminates as there is a BBA configured on the interfaces towards the CPEs. The LAC will initiate L2TP tunnels to CSR10’s LAN interface IP address (any reachable address is fine) provided the dialing user is within the lab.local domain. This is why the PAP username is valuable (CHAP can be used also). Before configuring any VPDN features, we must enable the process globally, and in this case, we want to perform domain-based matching. Notice that the LAC must be configured to authenticate CPE’s via PAP despite not actually doing it. When creating L2TP tunnels, both CSR4 and CSR5 will use the name “LAC45” to identify themselves. ! CSR4

76 © 2016 Nicholas J. Russo

vpdn enable vpdn search-order domain vpdn-group LAC request-dialin protocol l2tp domain lab.local initiate-to ip 10.45.10.10 local name LAC45 l2tp tunnel password 0 L2TP_AUTH bba-group pppoe LAC virtual-template 10 sessions per-mac limit 2 interface Virtual-Template10 mtu 1492 no ip address ppp authentication pap callin interface GigabitEthernet2.546 pppoe enable group LAC

Last, we configure the LNS. AAA must be enabled or else authentication will fail, but we can simply use the local database. Usernames are manually configured for R6 and R7, which must include the domainname as well since that is part of the PAP hostname string. The LACs requested dial-in, and the LNS accepts dial-in. It will terminate any L2TP tunnel from devices with hostname “LAC45”, which is both CSR4 and CSR5. Like the BBA object, the VPDN-group on the LNS will reference a virtual-template. This is configured just like CSR8 and CSR9 in terms of PPP options and protocols. Here, we can configure IPCP and IPV6CP to enable IPv4/v6 reachability to the CPEs. PAP authentication is also enabled on this interface. The local pools and other unrelated objects are not shown. ! CSR10 aaa new-model aaa authentication login default none aaa authentication ppp default local username [email protected] password 0 R6 username [email protected] password 0 R7 vpdn enable vpdn-group LNS accept-dialin protocol l2tp virtual-template 10 terminate-from hostname LAC45 l2tp tunnel password 0 L2TP_AUTH

77 © 2016 Nicholas J. Russo

interface Virtual-Template10 mtu 1492 ip unnumbered Loopback209 peer default ip address pool LNS_POOL ipv6 enable no ipv6 nd ra suppress ipv6 nd ra interval 30 ipv6 dhcp server DHCPV6_PD ppp authentication pap callin

With CSR7 disabled for now, we will debug the PPPoE, PPP, L2TP, and VPDN activities as necessary on CSR6, CSR4, and CSR10. First, the PPPoE exchange happens between the CPE and the LAC, and is limited to only those nodes. The LNS is unaware of what flavor of PPP is used on the access link. The PPPoE basic exchange is shown below. R6#debug pppoe event R6#debug ppp negotiation 02:22:07.093: Sending PADI: Interface = GigabitEthernet2.546 02:22:07.096: PPPoE 0: I PADO R:0050.56a9.2c57 L:0050.56a9.de0d 3546 Gi2.546 02:22:09.141: PPPOE: we've got our pado and the pado timer went off 02:22:09.141: OUT PADR from PPPoE Session 02:22:09.144: PPPoE 24: I PADS R:0050.56a9.2c57 L:0050.56a9.de0d 3546 Gi2.546 02:22:09.144: IN PADS from PPPoE Session 02:22:09.149: PPPoE: Virtual Access interface obtained. 02:22:09.149: PPPoE : encap string prepared R4#debug pppoe event R4#debug ppp negotiation R4#debug l2tp brief R4#debug vpdn event 02:22:06.423: PPPoE 0: I PADI R:0050.56a9.de0d L:ffff.ffff.ffff 3546 02:22:06.424: Service tag: NULL Tag 02:22:06.424: PPPoE 0: O PADO, R:0050.56a9.2c57 L:0050.56a9.de0d 3546 02:22:06.424: Service tag: NULL Tag 02:22:08.471: PPPoE 0: I PADR R:0050.56a9.de0d L:0050.56a9.2c57 3546 02:22:08.471: Service tag: NULL Tag [snip] 02:22:08.471: [30]PPPoE 24: O PADS R:0050.56a9.de0d L:0050.56a9.2c57

Gi2.546 Gi2.546 Gi2.546

Gi2.546

Next, the CPE and LAC begin the LCP negotiation within PPP. This exchange is also limited to the CPE and LAC as shown in the debugs. The messages within the two concurrent CONFREQ/CONFACK conversations are highlighted in yellow and green for clarity. So far, this is nothing new. ! CSR6 02:22:09.150: Vi2 PPP: Using dialer call direction 02:22:09.150: Vi2 PPP: Treating connection as a callout

78 © 2016 Nicholas J. Russo

02:22:09.150: 02:22:09.150: 02:22:09.150: 02:22:09.150: 02:22:09.150: 02:22:09.150: 02:22:09.150: 02:22:09.152: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153: 02:22:09.153:

Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2 Vi2

PPP: LCP: PPP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP: LCP:

! CSR4 02:22:08.471: 02:22:08.471: 02:22:08.471: 02:22:08.471: 02:22:08.471: 02:22:08.480: 02:22:08.480: 02:22:08.480: 02:22:08.480: 02:22:08.480: 02:22:08.480: 02:22:08.480: 02:22:08.481: 02:22:08.481: 02:22:08.481: 02:22:08.481: 02:22:08.482: 02:22:08.482: 02:22:08.482: 02:22:08.482:

ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 ppp30

Session handle[1000001D] Session id[29] Event[OPEN] State[Initial to Starting] No remote authentication for call-out O CONFREQ [Starting] id 1 len 14 MRU 1492 (0x010405D4) MagicNumber 0x2F35DBB0 (0x05062F35DBB0) Event[UP] State[Starting to REQsent] I CONFREQ [REQsent] id 1 len 18 MRU 1492 (0x010405D4) AuthProto PAP (0x0304C023) MagicNumber 0x2AB61AF0 (0x05062AB61AF0) O CONFACK [REQsent] id 1 len 18 MRU 1492 (0x010405D4) AuthProto PAP (0x0304C023) MagicNumber 0x2AB61AF0 (0x05062AB61AF0) Event[Receive ConfReq+] State[REQsent to ACKsent] I CONFACK [ACKsent] id 1 len 14 MRU 1492 (0x010405D4) MagicNumber 0x2F35DBB0 (0x05062F35DBB0)

PPP: Using vpn set call direction PPP: Treating connection as a callin PPP: Session handle[8000001E] Session id[30] LCP: Event[OPEN] State[Initial to Starting] PPP LCP: Enter passive mode, state[Stopped] LCP: I CONFREQ [Stopped] id 1 len 14 LCP: MRU 1492 (0x010405D4) LCP: MagicNumber 0x2F35DBB0 (0x05062F35DBB0) LCP: O CONFREQ [Stopped] id 1 len 18 LCP: MRU 1492 (0x010405D4) LCP: AuthProto PAP (0x0304C023) LCP: MagicNumber 0x2AB61AF0 (0x05062AB61AF0) LCP: O CONFACK [Stopped] id 1 len 14 LCP: MRU 1492 (0x010405D4) LCP: MagicNumber 0x2F35DBB0 (0x05062F35DBB0) LCP: Event[Receive ConfReq+] State[Stopped to ACKsent] LCP: I CONFACK [ACKsent] id 1 len 18 LCP: MRU 1492 (0x010405D4) LCP: AuthProto PAP (0x0304C023) LCP: MagicNumber 0x2AB61AF0 (0x05062AB61AF0)

Next, CSR6 tries to authenticate via PAP. At this point, LCP is open and CSR6 is waiting for a response back from the LNS, so the LAC cannot respond to this. CSR4 claims there is no method list to authenticate this user after the AUTH-REQ is received from CSR6. ! CSR6 02:22:09.153: Vi2 LCP: Event[Receive ConfAck] State[ACKsent to Open]

79 © 2016 Nicholas J. Russo

02:22:09.169: 02:22:09.169: 02:22:09.169: 02:22:09.169: 02:22:09.169:

Vi2 Vi2 Vi2 Vi2 Vi2

PPP: PAP: PAP: PAP: LCP:

! CSR4 02:22:08.482: 02:22:08.498: 02:22:08.498: 02:22:08.504: 02:22:08.504: 02:22:08.504: 02:22:08.504:

ppp30 ppp30 ppp30 ppp30 ppp30 ppp30 PPPoE

Phase is AUTHENTICATING, by the peer Using hostname from interface PAP Using password from interface PAP O AUTH-REQ id 1 len 20 from "[email protected]" State is Open

LCP: Event[Receive ConfAck] State[ACKsent to Open] PPP: Phase is AUTHENTICATING, by this end LCP: State is Open PAP: I AUTH-REQ id 1 len 20 from "[email protected]" PAP: Authenticating peer [email protected] PPP: Phase is FORWARDING, Attempting Forward : Method list does not exists

At this point, CSR4 knows that LCP was successful and that it must generate an L2TP tunnel to the LNS to complete the authentication process. The VPDN process matches the LAC group and initiates the tunnel to 10.45.10.10 as user LAC45. The L2TP tunnels comes up shortly thereafter. Most of the L2TP and VPDN debug isn’t very helpful, but the key parts are shown below. CSR4 “forwards” the PPP session into the tunnel after negotiating LCP (to some extent) and forwarding the other PPP protocols (PAP, etc) onto the LNS. ! CSR4 02:22:08.505: VPDN L2X: ADD class VPDN group LAC ip addr 10.45.10.10 client LAC45 (group LAC) [snip] 02:22:08.505: [30]PPPoE 24: State LCP_NEGOTIATION Event PPP FORWARDING 02:22:08.505: [30]PPPoE 24: Segment (SSS class): UPDATED 02:22:08.505: [30]PPPoE 24: SSS switch updated 02:22:08.511: L2TP 0001E:080E6:0000F3DE: APP 10.1.1.1/32 10.5.6.6 0 24 ? *> 10.1.2.0/24 10.5.6.6 0 24 ? *> 10.1.13.0/24 10.5.6.6 0 24 ? *> 10.13.13.13/32 10.5.6.6 0 24 ? *> 10.13.14.0/24 10.5.6.6 0 24 ?

587 © 2016 Nicholas J. Russo

We repeat the process for the OSPF VPN except I do not specify a next-hop. This technique works because CSR6 has the same IP address for all VPNs and the global table. Since all the VPNs are separate, this is a simple way to provision transit links for option A. It also means a simpler option AB VRF configuration as the next-hop need not be manually specified. R5#show bgp vpnv4 unicast all neighbors 13.0.0.12 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 13:2 (default for vrf OSPF) *> 10.2.9.0/24 10.5.6.6 0 24 ? *> 10.9.9.9/32 10.5.6.6 0 24 ? Route Distinguisher: 13:3 (default for vrf EIGRP) *> 10.1.1.1/32 10.5.6.6 0 24 ? *> 10.1.2.0/24 10.5.6.6 0 24 ? *> 10.1.13.0/24 10.5.6.6 0 24 ? *> 10.13.13.13/32 10.5.6.6 0 24 ? *> 10.13.14.0/24 10.5.6.6 0 24 ?

I quickly check XRv2 to ensure the routes are learned. One would think that a next-hop-self issue would have occurred, but option AB handles that automatically. Because it knows the data-plane is based on option A, the next-hop is automatically set to the ASBR without us having to manually configure it. This is a nice feature that simplifies configuration. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast rd 13:2 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 13:2 *>i10.2.9.0/24 13.0.0.5 0 100 0 24 ? * i 13.0.0.11 100 0 24 ? *>i10.4.4.4/32 13.0.0.8 1 100 0 ? *>i10.4.8.0/24 13.0.0.8 0 100 0 ? *>i10.9.9.9/32 13.0.0.5 0 100 0 24 ? * i 13.0.0.11 100 0 24 ?

After we configured the VRF for inter-as-hybrid support, the ASBRs were able to advertise locally imported routes to iBGP peers. These routes are technically exports from the local VRF rather than passthrough advertisements from the eBGP peer. To prove it, I show the eBGP route to 10.9.9.9/32 on CSR5 and the iBGP route on CSR8. The RT advertised by CSR6/CSR7 was 24:2. CSR5 did not perform any kind of explicit rewrite but was configured to import 24:2. We did not change CSR8’s RT import/export policies at all and it is still able to import this route into VRF OSPF. This is because CSR5 re-exported the route with RT:13:2 per the local RT export policy. This allows the ASBRs to import remote RTs but the remote PEs do not. This also increases option AB scalability over option B in some cases since the PE/RR is unaware of an RT change. R5#show bgp vpnv4 unicast vrf OSPF 10.9.9.9/32 BGP routing table entry for 13:2:10.9.9.9/32, version 109

588 © 2016 Nicholas J. Russo

Paths: (1 available, best #1, table OSPF) Advertised to update-groups: 3 5 Refresh Epoch 7 24, imported path from 24:2:10.9.9.9/32 (global) 10.5.6.6 (via default) from 10.5.6.6 (24.0.0.6) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:24:2 OSPF ROUTER ID:10.2.9.2:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 5052/6002 rx pathid: 0, tx pathid: 0x0

R8#show bgp vpnv4 unicast vrf OSPF 10.9.9.9/32 BGP routing table entry for 13:2:10.9.9.9/32, version 872 Paths: (1 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 1 24 13.0.0.5 (metric 2) (via default) from 13.0.0.12 (13.0.0.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:13:2 OSPF ROUTER ID:10.2.9.2:0 OSPF RT:0.0.0.0:2:0 Originator: 13.0.0.5, Cluster list: 13.0.0.12 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.5 mpls labels in/out nolabel/5052 rx pathid: 0, tx pathid: 0x0

I quickly configure the VRFs to support option AB on CSR6 and CSR7 as well without specifying the nexthop. I temporarily shut down the BGP peer to XRv1 so that we can focus on option AB testing for now (not shown). ! CSR6 and CSR7 vrf definition EIGRP address-family ipv4 inter-as-hybrid vrf definition OSPF address-family ipv4 inter-as-hybrid

CSR2 learns two routes for 10.4.4.4/32 inside the OSPF VRF: one from CSR6 and one from CSR7. These are the eBGP VPNv4 routes learned from the option B-style BGP connection to CSR5. Notice the RT:24:2 on the routes; this is a result of routes with RT:13:2 being imported on CSR6/CSR6 and being reexported with RT:24:2 by those same ASBRs (CSR6 and CSR7). R2#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32

589 © 2016 Nicholas J. Russo

BGP routing table entry for 24:2:10.4.4.4/32, version 1025 Paths: (2 available, best #2, table OSPF) Advertised to update-groups: 13 Refresh Epoch 1 13, (Received from a RR-client) 24.0.0.7 (metric 20) (via default) from 24.0.0.7 (24.0.0.7) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out nolabel/7023 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 13, (Received from a RR-client) 24.0.0.6 (metric 20) (via defau lt) from 24.0.0.6 (24.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out nolabel/6031 rx pathid: 0, tx pathid: 0x0

To simplify the initial test, the OSPF backdoor and sham-links are disabled. Both CEs have a route to one another via the MPLS network. This shows that the control-plane is correct, at least within the OSPF VPN. R9#show ip cef 10.4.4.4 10.4.4.4/32 nexthop 10.4.9.4 GigabitEthernet2.549 R4#show ip cef 10.9.9.9 10.9.9.9/32 nexthop 10.4.8.8 GigabitEthernet2.548

We will manually trace the LSP from CSR9 to CSR4. On CSR2, there is a VPNv4 route from CSR6 using label 6031. The route to the VPN next-hop is learned from ISIS via XRv4, so XRv4’s local LDP label is used. The label stack becomes {94008 6031} and we can quickly confirm this by checking the FIB. R2#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 bestpath BGP routing table entry for 24:2:10.4.4.4/32, version 1032 Paths: (2 available, best #1, table OSPF) Advertised to update-groups: 13 Refresh Epoch 2

590 © 2016 Nicholas J. Russo

13, (Received from a RR-client) 24.0.0.6 (metric 20) (via default) from 24.0.0.6 (24.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out nolabel/6031 rx pathid: 0, tx pathid: 0x0 R2#show ip route 24.0.0.6 Routing entry for 24.0.0.6/32 Known via "isis", distance 115, metric 20, type level-2 Redistributing via isis 24 Last update from 24.2.14.14 on GigabitEthernet2.524, 2d08h ago Routing Descriptor Blocks: * 24.2.14.14, from 24.0.0.6, 2d08h ago, via GigabitEthernet2.524 Route metric is 20, traffic share count is 1 R2#show mpls ldp bindings 24.0.0.6 32 neighbor 24.0.0.14 lib entry: 24.0.0.6/32, rev 12 remote binding: lsr: 24.0.0.14:0, label: 94008 R2#show ip cef vrf OSPF 10.4.4.4/32 10.4.4.4/32 nexthop 24.2.14.14 GigabitEthernet2.524 label 94008 6031

XRv4 is the penultimate hop so the topmost label is removed, exposing label 6031 to CSR6. RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94008 Pop 24.0.0.6/32

labels 94008 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.564 24.6.14.6 248618

Like an ordinary option A ASBR, CSR6 removes all labels and sends the packet inside of the OSPF VPN towards CSR5. There aren’t many fancy show commands to see option AB in action as most of the inner workings are done behind the scenes. It is similar to global/VRF route and traffic leaking discussed in another chapter. R6#show mpls forwarding-table labels 6031 detail Local Outgoing Prefix Bytes Label Outgoing Label Label or Tunnel Id Switched interface 6031 No Label 10.4.4.4/32[V] 0 Gi2.5562 MAC/Encaps=22/22, MRU=1504, Label Stack{} 005056A9DC63005056A9DE0D81000DE4810000020800 VPN route: OSPF No output feature configured

Next Hop 10.5.6.5

591 © 2016 Nicholas J. Russo

CSR5 imposes a single VPN label for prefix 10.4.4.4/32 inside VRF OSPF since it is both the ingress LSR and penultimate hop. The label value is 8020; when CSR98 receives packets with this label, it removes all MPLS encapsulation and sends the IP packet to CSR4, the CE. R5#show ip cef vrf OSPF 10.4.4.4 10.4.4.4/32 nexthop 13.5.8.8 GigabitEthernet2.558 label 8020 R8#show mpls forwarding-table labels 8020 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 8020 No Label 10.4.4.4/32[V] 10308 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A92C57005056A9FB1C81000DDC0800 VPN route: OSPF No output feature configured

Outgoing interface Gi2.548

Next Hop 10.4.8.4

We use traceroute to confirm this path. Just like option A, the traffic over the transit links is raw IP. Option AB is transparent to customers in this way as there is no way to tell whether option A or AB is in use based on the output below. The IPv6 traceroute, for example, is using option A and exhibit identical data-plane behavior. We did not explicitly trace the VPNv6 LSPs but this demonstrates that VPNv6 option A still functions independently from VPNv4 option AB. R9#traceroute 10.4.4.4 so 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 5 msec 4 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94008/6031 Exp 0] 6 msec 5 msec 7 msec 3 10.5.6.6 [MPLS: Label 6031 Exp 0] 16 msec 16 msec 15 msec 4 10.5.6.5 19 msec 11 msec 10 msec 5 10.4.8.8 [MPLS: Label 8020 Exp 0] 11 msec 20 msec 19 msec 6 10.4.8.4 20 msec 10 msec 11 msec R9#traceroute ipv6 Target IPv6 address: ::10:4:4:4 Source address: ::10:9:9:9 [snip] 1 FD00:10:2:9::2 6 msec 3 msec 5 msec 2 2024:24:2:14::14 [MPLS: Labels 94008/6015 Exp 0] 7 msec 7 msec 12 msec 3 FD00:10:5:6::6 [MPLS: Label 6015 Exp 0] 17 msec 17 msec 20 msec 4 FD00:10:5:6::5 22 msec 15 msec 41 msec 5 FD00:10:4:8::8 [MPLS: Label 8027 Exp 0] 9 msec 17 msec 16 msec 6 FD00:10:4:8::4 20 msec 10 msec 15 msec

592 © 2016 Nicholas J. Russo

Although there is nothing special required for option AB in particular, we bring up the backdoor link between CSR4 and CSR9, as well as the pair of sham-links from CSR2 to CSR8. Since the sham-link endpoints are ordinary VPNv6 prefixes from the perspective of BGP, they are carried via the traditional option A method between the ASBRs. I quickly verify that the sham-links and OSPFv3 backdoor neighbors are up. R8#show ospfv3 vrf OSPF sham-links | include ^Sham Sham Link OSPFv3_SL0 to address FD00::2 is up Sham Link OSPFv3_SL1 to address FD00::2 is up R4#show ospfv3 neighbor gig2.549 OSPFv3 2 address-family ipv4 (router-id 10.4.8.4) Neighbor ID 10.4.9.9

Pri 0

State FULL/

-

Dead Time 00:00:35

Interface ID 26

Interface Gig2.549

OSPFv3 2 address-family ipv6 (router-id 10.4.8.4) Neighbor ID 10.4.9.9

Pri 0

State FULL/

-

Dead Time 00:00:37

Interface ID 26

Interface Gig2.549

The sham-link signaling and construction was a result of the option A VPNv6 design, but IPv4 forwarding over the sham-link still follows the option AB mechanisms. The ASBRs still have eBGP VPNv4 routes in the global table for inter-as IPv4 prefixes, which are advertised to the RRs within each AS. We can check this quickly by comparing the VPNv4 and VPNv6 routes to CSR9. The VPNv4 route has a next-hop in the default table (which is overridden by the inter-as-hybrid functionality) while the VPNv6 route has a nexthop in VRF OSPF. R5#show bgp vpnv4 unicast vrf OSPF 10.9.9.9 BGP routing table entry for 13:2:10.9.9.9/32, version 43 Paths: (1 available, best #1, table OSPF) Advertised to update-groups: 1 2 Refresh Epoch 1 24, imported path from 24:2:10.9.9.9/32 (global) 10.5.7.7 (via default) from 10.5.7.7 (24.0.0.7) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:24:2 OSPF ROUTER ID:10.2.9.2:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 5006/7005 rx pathid: 0, tx pathid: 0x0 R5#show bgp vpnv6 unicast vrf OSPF ::10:9:9:9/128 BGP routing table entry for [13:2]::10:9:9:9/128, version 341 Paths: (2 available, best #2, table OSPF) Advertised to update-groups:

593 © 2016 Nicholas J. Russo

1 3 Refresh Epoch 1 24 FD00:10:5:7::7 (FE80::7) (via vrf OSPF) from FD00:10:5:7::7 (24.0.0.7) Origin incomplete, localpref 100, valid, external Extended Community: RT:13:2 OSPF ROUTER ID:10.2.9.2:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 5012/nolabel rx pathid: 0, tx pathid: 0 Refresh Epoch 1 24 FD00:10:5:6::6 (FE80::6) (via vrf OSPF) from FD00:10:5:6::6 (24.0.0.6) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:13:2 OSPF ROUTER ID:10.2.9.2:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 5012/nolabel rx pathid: 0, tx pathid: 0x0

Using traceroute, we can confirm the sham-links are operational. Both IPv4 and IPv6 have identical dataplane characteristics which resemble an option A architecture. R4#traceroute 10.9.9.9 source 10.4.4.4 Type escape sequence to abort. Tracing the route to 10.9.9.9 VRF info: (vrf in name/id, vrf out name/id) 1 10.4.8.8 5 msec 3 msec 4 msec 2 10.5.7.5 [MPLS: Label 5006 Exp 0] 4 msec 4 msec 4 msec 3 10.5.7.7 6 msec 8 msec 10 msec 4 24.7.14.14 [MPLS: Labels 94009/2017 Exp 0] 12 msec 19 msec 18 msec 5 10.2.9.2 [MPLS: Label 2017 Exp 0] 19 msec 72 msec 14 msec 6 10.2.9.9 16 msec 9 msec 8 msec R4#traceroute ipv6 Target IPv6 address: ::10:9:9:9 Source address: ::10:4:4:4 [snip] Tracing the route to ::10:9:9:9 1 FD00:10:4:8::8 5 msec 3 msec 3 msec 2 FD00:10:5:6::5 [MPLS: Label 5012 Exp 0] 4 msec 4 msec 4 msec 3 FD00:10:5:6::6 23 msec 14 msec 15 msec 4 ::FFFF:24.6.7.7 [MPLS: Labels 7004/2024 Exp 0] 17 msec 23 msec 21 msec 5 FD00:10:2:9::2 [MPLS: Label 2024 Exp 0] 28 msec 22 msec 21 msec 6 FD00:10:2:9::9 23 msec 17 msec 13 msec

To show the seamless interworking with existing option A architectures, I bring up the VRF-aware IPv4/v6 connections between CSR6 and XRv1; these were previously shutdown on CSR6 (removal of “shutdown” not shown). Using outbound RPL on XRv1, I increase the MED for routes advertised from AS

594 © 2016 Nicholas J. Russo

13 to AS 24. This will “hint” to AS 24 that XRv1 is a suboptimal point of ingress, which will prefer CSR5. This is so that the option AB links are preferred over the option A links. ! XRv1 route-policy RPL_SET_MED($MED) set med $MED end-policy router bgp 13 vrf OSPF neighbor 10.6.11.6 remote-as 24 address-family ipv4 unicast route-policy RPL_SET_MED(11) out neighbor fd00:10:6:11::6 address-family ipv6 unicast route-policy RPL_SET_MED(11) out

Checking CSR6 for the VPNv4 route to 10.4.4.4/32, we can see the MED value carried from XRv1. The eBGP route from CSR5 is preferred as a result of having a lower MED. So far, this has no impact on the traffic pattern since this was CSR6’s only option before. Also, we can see the next-hops are in different tables according to BGP; this is an indication that a combination of options A and AB are in use on this particular ASBR. R6#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 BGP routing table entry for 24:2:10.4.4.4/32, version 28 Paths: (2 available, best #2, table OSPF) Advertised to update-groups: 3 1 Refresh Epoch 1 13 10.6.11.11 (via vrf OSPF) from 10.6.11.11 (13.0.0.11) Origin incomplete, metric 11, localpref 100, valid, external Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 6027/nolabel rx pathid: 0, tx pathid: 0 Refresh Epoch 1 13, imported path from 13:2:10.4.4.4/32 (global) 10.5.6.5 (via default) from 10.5.6.5 (13.0.0.5) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:13:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out 6027/5019 rx pathid: 0, tx pathid: 0x0

595 © 2016 Nicholas J. Russo

To test XRv1 as a backup, I shut down the VPNv4 eBGP session from CSR6 to CSR5. Now, CSR7 is preferred since the MED of the iBGP learned route is still lower than the eBGP route. CSR9 will be routing to CSR5 via CSR7 now. XRv1 is still not used, which is the correct behavior. The current configuration identifies it as a backup for CSR5 entirely. R6#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 BGP routing table entry for 24:2:10.4.4.4/32, version 87 Paths: (2 available, best #1, table OSPF) Advertised to update-groups: 3 Refresh Epoch 2 13 24.0.0.7 (metric 10) (via default) from 24.0.0.2 (24.0.0.2) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Originator: 24.0.0.7, Cluster list: 24.0.0.2 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out 6027/7018 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 13 10.6.11.11 (via vrf OSPF) from 10.6.11.11 (13.0.0.11) Origin incomplete, metric 11, localpref 100, valid, external Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 6027/nolabel rx pathid: 0, tx pathid: 0

Using traceroute, we can confirm the new traffic pattern. R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 6 msec 4 msec 5 msec 2 24.2.14.14 [MPLS: Labels 94010/7018 Exp 0] 6 msec 7 msec 5 msec 3 10.5.7.7 [MPLS: Label 7018 Exp 0] 16 msec 16 msec 15 msec 4 10.5.7.5 19 msec 11 msec 10 msec 5 10.4.8.8 [MPLS: Label 8020 Exp 0] 11 msec 20 msec 19 msec 6 10.4.8.4 21 msec 11 msec 10 msec

I also shut down CSR7’s VPNv4 eBGP session to CSR5 which leaves XRv1 as the only remaining option. CSR6 reflects this accurately, and traceroute proves that the data plane flow is correction.

596 © 2016 Nicholas J. Russo

R6#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 BGP routing table entry for 24:2:10.4.4.4/32, version 95 Paths: (1 available, best #1, table OSPF) Advertised to update-groups: 1 Refresh Epoch 1 13 10.6.11.11 (via vrf OSPF) from 10.6.11.11 (13.0.0.11) Origin incomplete, metric 11, localpref 100, valid, external, best Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 6027/nolabel rx pathid: 0, tx pathid: 0x0 R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 6 msec 4 msec 3 msec 2 24.2.14.14 [MPLS: Labels 94008/6027 Exp 0] 5 msec 5 msec 8 msec 3 10.6.11.6 [MPLS: Label 6027 Exp 0] 16 msec 16 msec 14 msec 4 10.6.11.11 21 msec 13 msec 73 msec 5 10.4.8.8 [MPLS: Label 8020 Exp 0] 11 msec 15 msec 15 msec 6 10.4.8.4 15 msec 12 msec 11 msec

Before continuing, I restore all BGP sessions. Now that we have tested the OSPF VPN connectivity, I quickly check the EIGRP VPN. Both XRv3 and CSR1 are able to reach CSR3 without issue. For brevity, I do not trace the control plane signaling as it is identical in concept to the OSPF VPN. R1#traceroute 10.3.3.3 source 10.1.1.1 Type escape sequence to abort. Tracing the route to 10.3.3.3 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.2.2 5 msec 4 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94008/6026 Exp 0] 4 msec 5 msec 5 msec 3 10.5.6.6 5 msec 8 msec 11 msec 4 10.5.6.5 11 msec 10 msec 10 msec 5 13.5.8.8 [MPLS: Labels 8010/92006 Exp 0] 16 msec 30 msec 19 msec 6 13.8.12.12 [MPLS: Label 92006 Exp 0] 20 msec 19 msec 19 msec 7 10.3.12.3 21 msec 11 msec 15 msec RP/0/0/CPU0:XRv3#traceroute ::10:3:3:3 source ::10:13:13:13 Type escape sequence to abort. Tracing the route to ::10:3:3:3 1 fd00:10:13:14::14 0 msec 0 msec 0 msec 2 fd00:10:5:6::6 [MPLS: Label 6019 Exp 0] 69 msec 0 msec 0 msec 3 fd00:10:5:6::5 29 msec 19 msec 9 msec 4 ::ffff:13.5.8.8 [MPLS: Labels 8010/92008 Exp 0] 9 msec 9 msec 0 msec

597 © 2016 Nicholas J. Russo

5 6

2013:13:8:12::12 [MPLS: Label 92008 Exp 0] 9 msec 0 msec 0 msec fd00:10:3:12::3 0 msec 0 msec 0 msec

We also test central services to ensure it works over option AB. Since the VPNv4 routes advertised by the ASBRs (CSR6 and CSR7) have the local export RTs, not the eBGP-learned ones, we expect these to be accepted within an AS. CSR2 learns the VPN route from both CSR6 and CSR7. CSR6 is the best path but both VPN routes carry RT:24:3, which was exported from the ASBRs, not rewritten by BGP route-maps. R2#show bgp vpnv4 unicast vrf EIGRP 110.0.0.3/32 BGP routing table entry for 24:3:110.0.0.3/32, version 1216 Paths: (2 available, best #1, table EIGRP) Advertised to update-groups: 13 Refresh Epoch 1 13 100, (Received from a RR-client) 24.0.0.6 (metric 20) (via default) from 24.0.0.6 (24.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:3 Connector Attribute: count=1 type 1 len 12 value 13:3:13.0.0.5 mpls labels in/out nolabel/6043 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 13 100, (Received from a RR-client) 24.0.0.7 (metric 20) (via default) from 24.0.0.7 (24.0.0.7) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:24:3 Connector Attribute: count=1 type 1 len 12 value 13:3:13.0.0.5 mpls labels in/out nolabel/7038 rx pathid: 0, tx pathid: 0

Option B central services is fully operational as expected. R1#traceroute 110.0.0.3 source 10.1.1.1 Type escape sequence to abort. Tracing the route to 110.0.0.3 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.2.2 2 msec 4 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94008/6043 Exp 0] 5 msec 6 msec 7 msec 3 10.5.6.6 [MPLS: Label 6043 Exp 0] 16 msec 15 msec 16 msec 4 10.5.6.5 20 msec 10 msec 11 msec 5 10.8.10.8 [MPLS: Label 8019 Exp 0] 10 msec 20 msec 19 msec 6 10.8.10.10 20 msec 10 msec 11 msec RP/0/0/CPU0:XRv3#traceroute ::110:0:0:3 source ::10:13:13:13 Type escape sequence to abort. Tracing the route to ::110:0:0:3

598 © 2016 Nicholas J. Russo

1 2 3 4 5

fd00:10:13:14::14 0 msec 0 msec 0 msec fd00:10:5:6::6 [MPLS: Label 6041 Exp 0] 0 msec 0 msec 0 msec fd00:10:5:6::5 0 msec 19 msec 119 msec fd00:10:8:10::8 [MPLS: Label 8026 Exp 0] 29 msec 9 msec 0 msec fd00:10:8:10::10 9 msec 0 msec 0 msec

Next, we will look at some minor variations of option AB. The design tested above is an “easy” one because the VPN next-hops and global next-hops are the same at the ASBRs. For example, CSR5 has a global eBGP VPNv4 session with 10.5.6.6, and this is also the VPN next-hop for all VRFs. This simplifies the configuration as manual option AB next-hops need not be specified. To better see the consequence of having different next-hops, I change the IPv4 address of CSR5’s transit link towards CSR6 inside both VRFs. ! CSR5 interface GigabitEthernet2.5562 ip address 10.5.6.52 255.255.255.0 interface GigabitEthernet2.5563 ip address 10.5.6.53 255.255.255.0

The VPN next-hop is still 10.5.6.5, and since the VRF did not specify a different next-hop, CSR6 still thinks 10.5.6.5 is accessible within the VPN. CSR6 continuously tries to ARP for this IPv4 next-hop as one would expect, but it does not exist. R6#show ip route vrf OSPF 10.4.4.4 Routing Table: OSPF Routing entry for 10.4.4.4/32 Known via "bgp 24", distance 20, metric 0 Tag 13, type external Last update from 10.5.6.5 00:16:03 ago Routing Descriptor Blocks: * 10.5.6.5, from 10.5.6.5, 00:16:03 ago Route metric is 0, traffic share count is 1 AS Hops 1 Route tag 13 MPLS label: none R6#show ip arp vrf OSPF 10.5.6.5 Protocol Address Age (min) Internet 10.5.6.5 0

Hardware Addr Incomplete

Type ARPA

Interface

CSR2, the ingress LER, still prefers CSR6 as the VPN next-hop. This will blackhole all traffic in the OSPF VPN. R2#show ip cef vrf OSPF 10.4.4.4 detail 10.4.4.4/32, epoch 0, flags [rib defined all labels]

599 © 2016 Nicholas J. Russo

recursive via 24.0.0.6 label 6046 nexthop 24.2.14.14 GigabitEthernet2.524 label 94008 R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 4 msec 3 msec 4 msec 2 * * *

The same is true for EIGRP as we changed that link address as well. CSR1 no longer has access to central service resources. R2#show ip cef vrf EIGRP 110.0.0.3 detail 110.0.0.3/32, epoch 0, flags [rib defined all labels] recursive via 24.0.0.6 label 6057 nexthop 24.2.14.14 GigabitEthernet2.524 label 94008 R1#traceroute 110.0.0.3 source 10.1.1.1 Type escape sequence to abort. Tracing the route to 110.0.0.3 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.2.2 5 msec 3 msec 1 msec 2 *

To fix it, we can use the same solution we configured initially on CSR5 as a demonstration. Under the VRFs on CSR6, I manually specify the VRF next-hops. ! CSR6 vrf definition EIGRP address-family ipv4 inter-as-hybrid next-hop 10.5.6.53 vrf definition OSPF address-family ipv4 inter-as-hybrid next-hop 10.5.6.52

Checking VPNv4, it is totally unaware of this change, at least from a VPN next-hop perspective. It still shows 10.5.6.5 as the VPN next-hop in the global table. Option AB overrides this and adjusts the nexthop to the value specified above. R6#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 bestpath BGP routing table entry for 24:2:10.4.4.4/32, version 33 Paths: (2 available, best #1, table OSPF) Advertised to update-groups: 7 6 Refresh Epoch 2

600 © 2016 Nicholas J. Russo

13, imported path from 13:2:10.4.4.4/32 (global) 10.5.6.5 (via default) from 10.5.6.5 (13.0.0.5) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:13:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out 6046/5019 rx pathid: 0, tx pathid: 0x0 R6#show ip route vrf OSPF 10.4.4.4 Routing Table: OSPF Routing entry for 10.4.4.4/32 Known via "bgp 24", distance 20, metric 0 Tag 13, type external Last update from 10.5.6.52 00:01:41 ago Routing Descriptor Blocks: * 10.5.6.52, from 10.5.6.5, 00:01:41 ago Route metric is 0, traffic share count is 1 AS Hops 1 Route tag 13 MPLS label: none

Now, CSR6 can ARP for this IPv4 next-hop to bind a MAC address for encapsulation. CSR9’s VPN connectivity to CSR4 and central services has been restored. The output is identical to earlier examples except the untagged inter-AS traffic transits through 10.5.6.52 rather than 10.5.6.5. R6#show ip arp vrf OSPF 10.5.6.52 Protocol Address Age (min) Internet 10.5.6.52 9

Hardware Addr 0050.56a9.dc63

Type ARPA

Interface Gig2.5562

R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 6 msec 3 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94008/6046 Exp 0] 7 msec 7 msec 5 msec 3 10.5.6.6 [MPLS: Label 6046 Exp 0] 5 msec 13 msec 20 msec 4 10.5.6.52 20 msec 10 msec 10 msec 5 10.4.8.8 [MPLS: Label 8020 Exp 0] 11 msec 19 msec 20 msec 6 10.4.8.4 19 msec 11 msec 9 msec R9#traceroute 110.0.0.0 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 110.0.0.0 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 4 msec 5 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94008/6048 Exp 0] 8 msec 7 msec 7 msec 3 10.5.6.6 [MPLS: Label 6048 Exp 0] 6 msec 8 msec 20 msec

601 © 2016 Nicholas J. Russo

4 10.5.6.52 20 msec 11 msec 10 msec 5 10.8.10.8 [MPLS: Label 8016 Exp 0] 11 msec 19 msec 19 msec 6 10.8.10.10 18 msec 11 msec 11 msec

The same is true for EIGRP. On CSR6, the VPN routes to CSR3 has a global next-hop of 10.5.6.5 which was the eBGP VPNv4 peer address. The option AB hybrid configuration overrides this to correct the VRFaware RIB, and ultimately the VRF-aware FIB. CSR6 is able to ARP for this MAC address once the VRFaware route is updated with the manually-configured next-hop. R6#show bgp vpnv4 unicast vrf EIGRP 10.3.3.3/32 bestpath BGP routing table entry for 24:3:10.3.3.3/32, version 39 Paths: (1 available, best #1, table EIGRP) Advertised to update-groups: 6 Refresh Epoch 2 13, imported path from 13:3:10.3.3.3/32 (global) 10.5.6.5 (via default) from 10.5.6.5 (13.0.0.5) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:13:3 0x8800:32768:0 0x8801:3:288 0x8802:65281:2560 0x8803:1:1500 0x8806:0:167971843 Connector Attribute: count=1 type 1 len 12 value 13:3:13.0.0.12 mpls labels in/out 6052/5020 rx pathid: 0, tx pathid: 0x0 R6#show ip route vrf EIGRP 10.3.3.3 Routing Table: EIGRP Routing entry for 10.3.3.3/32 Known via "bgp 24", distance 20, metric 0 Tag 13, type external Last update from 10.5.6.53 00:09:08 ago Routing Descriptor Blocks: * 10.5.6.53, from 10.5.6.5, 00:09:08 ago Route metric is 0, traffic share count is 1 AS Hops 1 Route tag 13 MPLS label: none R6#show ip arp vrf EIGRP 10.5.6.53 Protocol Address Age (min) Internet 10.5.6.53 15

Hardware Addr 0050.56a9.dc63

Type ARPA

Interface Gig2.5563

A set of traceroutes confirms that EIGRP customers have VPN and central services reachability. RP/0/0/CPU0:XRv3#traceroute 10.3.3.3 source 10.13.13.13 Type escape sequence to abort. Tracing the route to 10.3.3.3 1 10.13.14.14 0 msec 0 msec 0 msec

602 © 2016 Nicholas J. Russo

2 3 4 5 6

10.5.6.6 [MPLS: Label 6052 Exp 0] 0 msec 0 msec 0 msec 10.5.6.53 0 msec 0 msec 0 msec 13.5.8.8 [MPLS: Labels 8010/92006 Exp 0] 0 msec 0 msec 0 msec 13.8.12.12 [MPLS: Label 92006 Exp 0] 0 msec 0 msec 0 msec 10.3.12.3 9 msec 0 msec 0 msec

RP/0/0/CPU0:XRv3#traceroute 110.0.0.2 source 10.13.13.13 Type escape sequence to abort. Tracing the route to 110.0.0.2 1 10.13.14.14 9 msec 0 msec 0 msec 2 10.5.6.6 [MPLS: Label 6056 Exp 0] 0 msec 0 msec 0 msec 3 10.5.6.53 0 msec 0 msec 0 msec 4 10.8.10.8 [MPLS: Label 8018 Exp 0] 0 msec 0 msec 0 msec 5 10.8.10.10 0 msec 0 msec 0 msec

Last, we can test the CSC extension to option AB. This is generally useful for when the CSC core carrier is actually two carriers. For example, if AS 13 and AS 24 were the conglomerate “core carrier” while CSR4 and CSR9 represented customer carriers, the traffic would need to remain MPLS encapsulated end-toend. The CSC feature with option AB supports this and is very easy to configure; it is supported in conjunction with an explicit or implicit next-hop as shown below. I enable CSC for VRF OSPF only, so that VRF EIGRP will remain only IP-encapsulated over the transit link. ! CSR6 vrf definition OSPF address-family ipv4 inter-as-hybrid csc next-hop 10.5.6.52 ! CSR5 and CSR7 vrf definition OSPF address-family ipv4 inter-as-hybrid csc

The only difference in behavior is that the MPLS labels exchanged with the eBGP VPN routes are swapped at the ASBRs, much like option B. On CSR6, we can see the next-hop in the BGP route is still incorrect as the manual configuration overrides it. However, VPN label swapping occurs like option B. R6#show bgp vpnv4 unicast vrf OSPF 10.4.4.4 bestpath BGP routing table entry for 24:2:10.4.4.4/32, version 33 Paths: (2 available, best #1, table OSPF) Advertised to update-groups: 7 6 Refresh Epoch 2 13, imported path from 13:2:10.4.4.4/32 (global) 10.5.6.5 (via default) from 10.5.6.5 (13.0.0.5) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:13:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0

603 © 2016 Nicholas J. Russo

Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out 6046/5019 rx pathid: 0, tx pathid: 0x0

The RIB indicates an outgoing MPLS label of 5019, which is the VPN label carried in the eBGP route seen above. Despite this label always being exchanged, it was not delivered to the RIB/FIB for encapsulation when the CSC option was disabled. R6#show ip route vrf OSPF 10.4.4.4 Routing Table: OSPF Routing entry for 10.4.4.4/32 Known via "bgp 24", distance 20, metric 0 Tag 13, type external Last update from 10.5.6.52 00:01:43 ago Routing Descriptor Blocks: * 10.5.6.52, from 10.5.6.5, 00:01:43 ago Route metric is 0, traffic share count is 1 AS Hops 1 Route tag 13 MPLS label: 5019 R6#show ip cef vrf OSPF 10.4.4.4 10.4.4.4/32 nexthop 10.5.6.52 GigabitEthernet2.5562 label 5019

If we examine the VPNv4 debugging on CSR6, we can see that all of the prefixes learned from CSR5 have the same label value within each VPN. Even though we didn’t even enable CSC support on VRF EIGRP, the label aggregation occurs for that VPN also. We can clearly see that CSC is not enabled for VRF EIGRP. ! CSR5 R5#show vrf detail | include Hybrid|^VRF VRF EIGRP (VRF Id = 1); default RD 13:3; default VPNID Inter AS Hybrid mode configured, next-hop 10.5.6.6 VRF OSPF (VRF Id = 2); default RD 13:2; default VPNID Inter AS Hybrid mode configured, with CSC

This is a per-VRF label that is advertised inside of the VPNv4 update. Label aggregation is discussed in the multi-VRF CE chapter, but in summary, this is used in contract to the per-prefix label. A single label represents all VPN routes inside of a VRF, which conserves labels. The downside of this approach is that the payload of the MPLS packet must be consulted for a routing decision. CSR5’s LFIB proves this; when packets arrive with label 5019, the label is removed and the OSPF RIB is consulted. R6#debug bgp vpnv4 unicast updates 10.5.6.5 in BGP updates debugging is on for neighbor 10.5.6.5 (inbound) for address family: VPNv4 Unicast

604 © 2016 Nicholas J. Russo

BGP(4): 10.5.6.5 rcvd 13:2:110.0.0.1/32, label 5019 BGP(4): 10.5.6.5 rcvd UPDATE w/ attr: nexthop 10.5.6.5, origin ?, merged path 13 100, AS_PATH , extended community RT:13:2 BGP(4): 10.5.6.5 rcvd UPDATE w/ attr: nexthop 10.5.6.5, origin ?, merged path 13, AS_PATH , extended community RT:13:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 BGP(4): 10.5.6.5 rcvd 13:2:10.4.4.4/32, label 5019 BGP(4): 10.5.6.5 rcvd UPDATE w/ attr: nexthop 10.5.6.5, origin ?, merged path 13, AS_PATH , extended community RT:13:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 BGP(4): 10.5.6.5 rcvd 13:2:10.4.8.0/24, label 5019 R5#show mpls forwarding-table labels 5019 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 5019 Pop Label IPv4 VRF[V] 6874

Outgoing Next Hop interface aggregate/OSPF

CSR5 allocates label 5019 for VRF OSPF and 2050 for VRF EIGRP. We quickly check the VPNv4 label bindings on CSR5 to verify this. CSR6 also allocates aggregated labels per VPN for the return traffic. R5#show bgp vpnv4 unicast vrf OSPF labels Network Next Hop In label/Out label Route Distinguisher: 13:2 (OSPF) 10.2.9.0/24 10.5.7.7 IPv4 VRF Aggr:5019/7005 10.4.4.4/32 13.0.0.8 IPv4 VRF Aggr:5019/8020 10.4.8.0/24 13.0.0.8 IPv4 VRF Aggr:5019/8021 10.9.9.9/32 10.5.7.7 IPv4 VRF Aggr:5019/7005 110.0.0.0/32 13.0.0.8 IPv4 VRF Aggr:5019/8016 110.0.0.1/32 13.0.0.8 IPv4 VRF Aggr:5019/8017 110.0.0.2/32 13.0.0.8 IPv4 VRF Aggr:5019/8018 110.0.0.3/32 13.0.0.8 IPv4 VRF Aggr:5019/8019 R5#show bgp vpnv4 unicast vrf EIGRP labels Network Next Hop In label/Out label Route Distinguisher: 13:3 (EIGRP) 10.1.1.1/32 10.5.7.7 5016/7006 10.1.2.0/24 10.5.7.7 5026/7006 10.1.13.0/24 10.5.7.7 5027/7006 10.3.3.3/32 13.0.0.12 IPv4 VRF Aggr:5020/92006 10.3.12.0/24 13.0.0.12 IPv4 VRF Aggr:5020/92007 10.13.13.13/32 10.5.7.7 5028/7006 10.13.14.0/24 10.5.7.7 5029/7006 110.0.0.0/32 13.0.0.8 IPv4 VRF Aggr:5020/8016 110.0.0.1/32 13.0.0.8 IPv4 VRF Aggr:5020/8017 110.0.0.2/32 13.0.0.8 IPv4 VRF Aggr:5020/8018 110.0.0.3/32 13.0.0.8 IPv4 VRF Aggr:5020/8019

605 © 2016 Nicholas J. Russo

The EIGRP VPN has connectivity still, as expected, since nothing really changed. We use EPC outbound on CSR6 to confirm that MPLS encapsulation is not observed. Since QinQ is used, the Ethernet payload is another dot1q frame. We do not see ethertype 0x8847 in the packet and the length of 522 suggests that the packet is not MPLS encapsulated (14 Ethernet + 4 dot1q + 4 second dot1q). The IP addresses in the hex dump are shown in green. R1#ping 10.3.3.3 source 10.1.1.1 size 500 Type escape sequence to abort. Sending 5, 500-byte ICMP Echos to 10.3.3.3, timeout is 2 seconds: Packet sent with a source address of 10.1.1.1 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 14/61/96 ms R6#show monitor capture CAP buffer detailed 0 522 0.000000 00:50:56:A9:DE:0D -> 00:50:56:A9:DC:63 VLAN-tagged frame 0000: 005056A9 DC630050 56A9DE0D 81000DE4 .PV..c.PV....... 0010: 81000003 08004500 01F405D0 0000FC01 ......E......... 0020: 9F310A01 01010A03 03030800 CA300016 .1...........0.. 0030: 00000000 00003F9A 3BBBABCD ABCDABCD ......?.;.......

We can prove that the traffic is untagged another way. CSR2 uses VPN label 6026, which is allocated by CSR6. The LSRs along the path inside AS 24 will forward the packet to the penultimate hop, which performs PHP to reveal label 6026 to CSR6. CSR6’s LFIB performs a pop, not a swap, and then uses the MPLS payload for a routing decision. This is the expected behavior for VRF-wide aggregate labels since there is no per-prefix or per-CE (per-next-hop) granularity. Forwarding traffic based on the incoming label alone is ambiguous to a second lookup is necessary. The traffic is routed without addition labels directly to CSR5 across the VRF-aware transit link. R2#show ip cef vrf EIGRP 10.3.3.3 10.3.3.3/32 nexthop 24.2.14.14 GigabitEthernet2.524 label 94008 6026 R6#show mpls forwarding-table labels 6026 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6026 Pop Label IPv4 VRF[V] 8296

Outgoing Next Hop interface aggregate/EIGRP

R6#show ip cef vrf EIGRP 10.3.3.3/32 10.3.3.3/32 nexthop 10.5.6.53 GigabitEthernet2.5563

Pinging in VRF OSPF does not succeed and EPC reveals nothing. Since CSR6 would essentially be swapping the VPN label 6064 with the VPN aggregate label 5019, we expect MPLS to be functional. However, packets are being dropped per the LFIB. The in/out labels are correct but the LFIB does not specify an outgoing interface. 606 © 2016 Nicholas J. Russo

R9#ping 10.4.4.4 source 10.9.9.9 size 500 Type escape sequence to abort. Sending 5, 500-byte ICMP Echos to 10.4.4.4, timeout is 2 seconds: Packet sent with a source address of 10.9.9.9 ..... Success rate is 0 percent (0/5) R2#show ip cef vrf OSPF 10.4.4.4/32 10.4.4.4/32 nexthop 24.2.14.14 GigabitEthernet2.524 label 94008 6046 R6#show mpls forwarding-table labels 6046 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6046 5019 10.4.4.4/32[V] 0

Outgoing interface drop

Next Hop

The solution is to enable MPLS on the VRF-aware transit links for BGP forwarding. Normally this is configured automatically with option B, but since there are no VRF-aware VPNv4 sessions, we must configure it manually on all transit links expected to support CSC. This is only needed on the VRF OSPF interfaces as VRF EIGRP is not configured for CSC support. ! CSR6 interface GigabitEthernet2.5562 mpls bgp forwarding ! CSR7 interface GigabitEthernet2.5572 mpls bgp forwarding ! CSR5 interface GigabitEthernet2.5562 mpls bgp forwarding interface GigabitEthernet2.5572 mpls bgp forwarding

I quickly verify that BGP-based MPLS is enabled on all VRF-aware transit links. Without this command, all of the control plane functions appear functional (BGP, routing, etc) but forwarding does not work. R5#show mpls interfaces vrf OSPF Interface IP VRF OSPF: GigabitEthernet2.5562 No GigabitEthernet2.5572 No

Tunnel

BGP Static Operational

No No

Yes No Yes No

Yes Yes

R6#show mpls interfaces vrf OSPF

607 © 2016 Nicholas J. Russo

Interface VRF OSPF: GigabitEthernet2.5562 Lspvif1

IP

Tunnel

BGP Static Operational

No No

No No

Yes No No No

Tunnel

BGP Static Operational

No No

Yes No No No

R7#show mpls interfaces vrf OSPF Interface IP VRF OSPF: GigabitEthernet2.5572 No Lspvif1 No

Yes Yes

Yes Yes

Now, when we verify the LFIB, we can see the label swap occurring properly. The LFIB is now aware that BGP is allowed to perform MPLS forwarding operations on this interface, allowing it to be added to the LFIB properly. R6#show mpls forwarding-table labels 6046 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6046 5019 10.4.4.4/32[V] 0

Outgoing interface Gi2.5562

Next Hop 10.5.6.52

Using ping and EPC, we confirm that the traffic is MPLS-encapsulated with a single label value of 5019 as it is sent towards CSR5. The packet size is now 526, which is 4 bytes larger than the untagged packet in the EIGRP VPN. Label 5019 is in the stack (0x139B), which makes this a valuable tool for CSC operations. The VPN IP addresses is highlighted in green again. R9#ping 10.4.4.4 source 10.9.9.9 size 500 Type escape sequence to abort. Sending 5, 500-byte ICMP Echos to 10.4.4.4, timeout is 2 seconds: Packet sent with a source address of 10.9.9.9 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 13/59/89 ms R6#show monitor capture CAP buffer detailed 1 526 0.851998 00:50:56:A9:DE:0D -> 00:50:56:A9:DC:63 VLAN-tagged frame 0000: 005056A9 DC630050 56A9DE0D 81000DE4 .PV..c.PV....... 0010: 81000002 88470139 B1FC4500 01F4004D .....G.9..E....M 0020: 0000FE01 99A20A09 09090A04 04040800 ................ 0030: 50360016 00000000 0000A525 502AABCD P6.........%P*..

We confirm that the LFIB counters increment when sending traffic. This is expected regardless of whether the outgoing label is an aggregate label or not. The reason it is important to verify this feature with EPC and packet counters is because VPN traceroute does not reveal this interim LSP, as shown below. I suspect the reason is because CSR5 perform its final routing lookup based on the MPLS payload (IP packet), not the MPLS aggregate label, so the reply does not include that label.

608 © 2016 Nicholas J. Russo

R6#show mpls forwarding-table labels 6046 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6046 5019 10.4.4.4/32[V] 7060

Outgoing interface Gi2.5562

Next Hop 10.5.6.52

R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 6 msec 4 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94008/6046 Exp 0] 7 msec 6 msec 6 msec 3 10.5.6.6 [MPLS: Label 6046 Exp 0] 7 msec 12 msec 19 msec 4 10.5.6.52 19 msec 11 msec 11 msec 5 10.4.8.8 [MPLS: Label 8020 Exp 0] 10 msec 19 msec 20 msec 6 10.4.8.4 20 msec 10 msec 9 msec

To perform a quick verification on CSR7, the backup CSC path, I shut down CSR6’s BGP session to CSR5. CSR6’s new best-path becomes a non-CSC enabled link to XRv1, which should never be used as it would blackhole traffic. The aggregate label was never generated on XRv1 and not even exchanged with CSR6. I leave this link operational for demonstration, but BGP MED ensures that CSR7 will be preferred as an alternate. We can also see label 7005, which we assume to be CSR7’s local aggregate label for VRF OSPF. R6#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 BGP routing table entry for 24:2:10.4.4.4/32, version 162 Paths: (2 available, best #1, table OSPF) Advertised to update-groups: 7 Refresh Epoch 2 13 24.0.0.7 (metric 10) (via default) from 24.0.0.2 (24.0.0.2) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Originator: 24.0.0.7, Cluster list: 24.0.0.2 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out 6046/7005 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 13 10.6.11.11 (via vrf OSPF) from 10.6.11.11 (13.0.0.11) Origin incomplete, metric 11, localpref 100, valid, external Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out 6046/nolabel rx pathid: 0, tx pathid: 0

609 © 2016 Nicholas J. Russo

We verify that label 7005 is, in fact, the local aggregate label for VRF OSPF. This should be swapped to 5019, which was CSR5’s aggregate label for VRF OSPF. This hasn’t changed as a result of shutting CSR6 down as CSR5 advertises it to all BGP peers for all prefixes. Since CSR2 will prefer the VPN route from CSR7 as the best path, we expect 7005 to be the VPN label for 10.4.4.4/32, as well as other VPN prefixes. R7#show bgp vpnv4 unicast vrf OSPF labels Network Next Hop In label/Out label Route Distinguisher: 24:2 (OSPF) 10.2.9.0/24 24.0.0.2 IPv4 VRF Aggr:7005/2015 10.4.4.4/32 10.5.7.5 IPv4 VRF Aggr:7005/5019 10.4.8.0/24 10.5.7.5 IPv4 VRF Aggr:7005/5019 10.9.9.9/32 24.0.0.2 IPv4 VRF Aggr:7005/2017 110.0.0.0/32 10.5.7.5 IPv4 VRF Aggr:7005/5019 110.0.0.1/32 10.5.7.5 IPv4 VRF Aggr:7005/5019 110.0.0.2/32 10.5.7.5 IPv4 VRF Aggr:7005/5019 110.0.0.3/32 10.5.7.5 IPv4 VRF Aggr:7005/5019 R2#show ip cef vrf OSPF 10.4.4.4/32 10.4.4.4/32 nexthop 24.2.14.14 GigabitEthernet2.524 label 94010 7005

We notice a slight change in CSR7’s local label allocation for 10.4.4.4/32 compared to CSR6. CSR6 allocated per-prefix local labels for inter-AS VPN routes, such as 6046, which could be swapped to the peer’s aggregate label without invoking an IPv4 FIB lookup inside the VPN. CSR7 does not explicitly specify a VPN next-hop under the VRF configuration. Without CSC, this had no impact on performance, but in this case, CSR7 has to remove the local aggregate label, and then perform an IPv4 FIB lookup in the VPN while pushing the peer’s aggregate label. The VPN-aware FIB lookup still relies on the existing 10.5.7.5 next-hop, which was unchanged, so the design is valid. This two-step operation (pop/push rather than swap) is less efficient, which is why specifying the next-hop is a good design choice when using the CSC option with option AB. R7#show mpls forwarding-table labels 7005 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7005 Pop Label IPv4 VRF[V] 13670

Outgoing Next Hop interface aggregate/OSPF

R7#show ip cef vrf OSPF 10.4.4.4/32 10.4.4.4/32 nexthop 10.5.7.5 GigabitEthernet2.5572 label 5019

This makes the traceroute even more awkward looking inside the VPN. Since CSR7 and CSR5 are both performing FIB (not LFIB) lookups to route the VPN traffic, neither of them include the MPLS labels associated with each hop. R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort.

610 © 2016 Nicholas J. Russo

Tracing the route to 10.4.4.4 VRF info: (vrf in name/id, vrf out name/id) 1 10.2.9.2 6 msec 4 msec 4 msec 2 24.2.14.14 [MPLS: Labels 94010/7005 Exp 0] 6 msec 5 msec 6 msec 3 10.5.7.7 5 msec 7 msec 9 msec 4 10.5.7.5 10 msec 10 msec 9 msec 5 10.4.8.8 [MPLS: Label 8020 Exp 0] 13 msec 22 msec 18 msec 6 10.4.8.4 21 msec 10 msec 16 msec

A quick EPC check on CSR7 outbound towards CSR5 shows label 5019 in the stack. This MPLS encapsulation is required for CSC service to be operational. Before continuing, CSR6’s BGP session with CSR5 is restored. R7#show monitor capture CAP buffer detailed 0 526 0.000000 00:50:56:A9:EA:77 -> 00:50:56:A9:DC:63 VLAN-tagged frame 0000: 005056A9 DC630050 56A9EA77 81000DE5 .PV..c.PV..w.... 0010: 81000002 88470139 B1FC4500 01F4005C .....G.9..E....\ 0020: 0000FC01 9B930A09 09090A04 04040800 ................ 0030: 18FF0019 00000000 0000A53D 8746ABCD ...........=.F..

Some of these inconsistencies appeared strange, and furthermore, the documentation clearly states that “per-prefix” labels should be allocated when CSC is enabled. After reloading the ASBRs, the behavior changes significantly. I assume this is the result of a configuration order-of-operations issue despite the configuration above being functional. Below, I show CSR5 and CSR6 label allocations for VPNv4 inside VRF OSPF; there are no aggregate labels. On CSR6, notice that label swapping only occurs when CSR5 is the next-hop, which is accurate since XRv1 only supports option A in this architecture. R6#show bgp vpnv4 unicast vrf OSPF labels Network Next Hop In label/Out label Route Distinguisher: 24:2 (OSPF) 10.2.9.0/24 24.0.0.2 6048/2013 10.4.4.4/32 10.5.6.5 6036/5021 10.6.11.11 6036/nolabel 10.4.8.0/24 10.5.6.5 6037/5022 10.6.11.11 6037/nolabel 10.9.9.9/32 24.0.0.2 6049/2015 110.0.0.0/32 10.5.6.5 6038/5023 10.6.11.11 6038/nolabel 110.0.0.1/32 10.5.6.5 6039/5024 10.6.11.11 6039/nolabel 110.0.0.2/32 10.5.6.5 6040/5025 10.6.11.11 6040/nolabel 110.0.0.3/32 10.5.6.5 6041/5026 10.6.11.11 6041/nolabel R5#show bgp vpnv4 unicast vrf OSPF labels

611 © 2016 Nicholas J. Russo

Network Next Hop Route Distinguisher: 13:2 (OSPF) 10.2.9.0/24 10.5.6.6 10.4.4.4/32 13.0.0.8 10.4.8.0/24 13.0.0.8 10.9.9.9/32 10.5.6.6 110.0.0.0/32 13.0.0.8 110.0.0.1/32 13.0.0.8 110.0.0.2/32 13.0.0.8 110.0.0.3/32 13.0.0.8

In label/Out label 5009/6048 5021/8007 5022/8008 5010/6049 5023/8003 5024/8004 5025/8005 5026/8006

The implication of this is that the LSPs will operate more efficiently now. For example, if traffic arrives at CSR5 with label 5010 and destined for 10.9.9.9/32 in the remote AS, it will be swapped directly to 6049 in a single LFIB lookup. Label 6036 entering CSR6 along the reverse LSP should behave similarly by swapping directly to label 5021 and sending to CSR5. With a next-hop interface in the LFIB below, there is no need for another FIB lookup. The comments earlier about next-hop specification with CSC are no longer true since CSR6 and CSR5 both behave identically now. Again, this is probably the result of a configuration sequence error, but I leave the existing documentation in place as a troubleshooting aid. R5#show mpls forwarding-table labels 5010 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 5010 6049 10.9.9.9/32[V] 3402

Outgoing interface Gi2.5562

Next Hop

R6#show mpls forwarding-table labels 6036 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6036 5021 10.4.4.4/32[V] 7512

Outgoing interface Gi2.5562

Next Hop

10.5.6.6

10.5.6.52

Traceroute proves this as the path looks more like “option B” from a traceroute perspective. I recommend ensuring that, when inter-AS option AB with CSC is enabled, labels are allocated per-prefix. The only method that worked in this setup was an ASBR reload. This is true in both directions. R4#traceroute 10.9.9.9 source 10.4.4.4 Type escape sequence to abort. Tracing the route to 10.9.9.9 VRF info: (vrf in name/id, vrf out name/id) 1 10.4.8.8 4 msec 4 msec 3 msec 2 10.5.6.52 [MPLS: Label 5010 Exp 0] 7 msec 9 msec 7 msec 3 10.5.6.6 [MPLS: Label 6049 Exp 0] 27 msec 29 msec 32 msec 4 24.6.14.14 [MPLS: Labels 94009/2015 Exp 0] 29 msec 31 msec 32 msec 5 10.2.9.2 [MPLS: Label 2015 Exp 0] 15 msec 16 msec 14 msec 6 10.2.9.9 20 msec 10 msec 16 msec R9#traceroute 10.4.4.4 source 10.9.9.9 Type escape sequence to abort. Tracing the route to 10.4.4.4

612 © 2016 Nicholas J. Russo

VRF 1 2 3 4 5 6

info: (vrf in name/id, vrf out name/id) 10.2.9.2 4 msec 5 msec 5 msec 24.2.14.14 [MPLS: Labels 94008/6036 Exp 0] 9 msec 10 msec 10 msec 10.5.6.6 [MPLS: Label 6036 Exp 0] 23 msec 31 msec 31 msec 10.5.6.52 [MPLS: Label 5021 Exp 0] 30 msec 31 msec 31 msec 10.4.8.8 [MPLS: Label 8007 Exp 0] 19 msec 29 msec 20 msec 10.4.8.4 20 msec 12 msec 11 msec

8.4.4.2 L2VPN MPLS L2VPN using option AB works identically as it does for option A. Since only VPNv4 is supported, and the whole idea of option AB is to isolate customer traffic into separate VPN transit links, an option B-like forwarding mechanism would violate the general design. This section is largely similar to the option A L2VPN section but is verified briefly for completeness. We will limit our verifications to CSR5 and CSR6; no configuration changes were made between option A and option AB (this section). First, I verify that the ASBRs have a single “core PW” over MPLS to their respectively PEs, as well as the VFI interface that links the bridge-domain. All connections are reported as being up. R5#show l2vpn service vfi name VPLS | begin Interface Interface Group Encapsulation ------------------------VPLS name: VPLS, State: UP pw100001 VPLS(VFI) pw100002 core_pw 13.0.0.8:3(MPLS) R6#show l2vpn service vfi name VPLS | begin Interface Interface Group Encapsulation ------------------------VPLS name: VPLS, State: UP pw100001 VPLS(VFI) pw100002 core_pw 2:3(MPLS)

Prio ----

St --

XC St -----

0 0

UP UP

UP UP

Prio ----

St --

XC St -----

0 0

UP UP

UP UP

Next, I check for the data-plane learning on the ASBRs. I first record the MAC address of CSR1 (yellow) and CSR3 (green) for easy identification. R1#show interfaces gig2 | include bia Hardware is CSR vNIC, address is 0050.56a9.1aaa (bia 0050.56a9.1aaa) R3#show interfaces gig2 | include bia Hardware is CSR vNIC, address is 0050.56a9.8ccf (bia 0050.56a9.8ccf)

Looking at the ASBR bridge-domains, we can see both MAC addresses are learned. On CSR5, CSR1’s MAC is learned via the VPLS PW to CSR2 while CSR3’s MAC is learned via the inter-AS EFP. The opposite is true on CSR6. R5#show bridge-domain 3 Bridge-domain 3 (2 ports in all)

613 © 2016 Nicholas J. Russo

State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 30 vfi VPLS neighbor 13.0.0.8 3 AED MAC address Policy Tag Age Pseudoport 0 0050.56A9.1AAA forward dynamic 221 GigabitEthernet2.EFP30 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe80d0400 0 0050.56A9.8CCF forward dynamic 236 VPLS.1004011 R6#show bridge-domain 3 Bridge-domain 3 (2 ports in all) State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 30 vfi VPLS neighbor 24.0.0.2 3 AED MAC address Policy Tag Age Pseudoport 0 0050.56A9.1AAA forward dynamic 213 VPLS.1004011 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe8125400 0 0050.56A9.8CCF forward dynamic 228 GigabitEthernet2.EFP30

A similar bridge-domain MAC address check on CSR2 and CSR8 confirms that the MAC addresses are being learned properly. The PE learns its local CE via the PE-CE EFP and the remote CE via the VPLS pseudoport towards the neighboring ASBR. R2#show bridge-domain 3 Bridge-domain 3 (2 ports in all) State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 3 vfi VPLS neighbor 24.0.0.6 3 AED MAC address Policy Tag Age Pseudoport 0 0050.56A9.1AAA forward dynamic 216 GigabitEthernet2.EFP3 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe80e9400 0 0050.56A9.8CCF forward dynamic 246 VPLS.1004011 R8#show bridge-domain 3 Bridge-domain 3 (2 ports in all) State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 3 vfi VPLS neighbor 13.0.0.5 3 AED MAC address Policy Tag Age Pseudoport 0 0050.56A9.1AAA forward dynamic 211 VPLS.1004011 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe80a9800 0 0050.56A9.8CCF forward dynamic 242 GigabitEthernet2.EFP3

Ping and traceroute indicate the VPLS has formed and is functioning properly.

614 © 2016 Nicholas J. Russo

R3#ping vrf VPLS 10.0.0.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 9/10/14 ms R3#traceroute vrf VPLS 10.0.0.1 Type escape sequence to abort. Tracing the route to 10.0.0.1 VRF info: (vrf in name/id, vrf out name/id) 1 10.0.0.1 12 msec 10 msec 9 msec

8.4.4.3 MVPN – GRE (Profile 0) and mLDP (Profile 1) MVPN over option AB is similar to option A but is worth investigating because the control-plane changed for VPNv4. AS 13 uses GRE encapsulation (Profile 0) and AS 24 uses MPLS encapsulation via mLDP (Profile 1). First, I verify that the default MDT forms correctly inside AS 13. Although not recommended, I used ASM groups to demonstrate that IPv4 MDT/MVPN is not a requirement for GRE-based MVPN. For brevity, I limit the verifications to the OSPF VPN. Since the MVPN features used are deployed per-AS, not per-VPN, it is acceptable to only verify one VRF per AS for illustration. CSR8, CSR5, and XRv1 are all inside of the same default MDT. So far, this is identical to MVPN for option A. I highlighted the inter-AS PIM neighbors on CSR5 in green as well; this verification gives us the bonus information of ensuring those PIM neighbors are functional. R8#show ip pim vrf OSPF neighbor | begin ^Neigh Neighbor Interface Uptime/Expires Address 10.4.8.4 GigabitEthernet2.548 00:31:13/00:01:31 13.0.0.5 Tunnel3 00:29:28/00:01:17 13.0.0.11 Tunnel3 00:29:57/00:01:15 R5#show ip pim vrf OSPF neighbor | begin Neighbor Interface Address 10.5.6.6 GigabitEthernet2.5562 10.5.7.7 GigabitEthernet2.5572 13.0.0.11 Tunnel4 13.0.0.8 Tunnel4

Ver

DR Prio/Mode 1 / S P G 1 / S P G 1 / DR P G

v2 v2 v2

^Neigh Uptime/Expires

Ver

00:31:43/00:01:33 00:31:52/00:01:22 00:30:32/00:01:41 00:30:58/00:01:16

v2 v2 v2 v2

DR Prio/Mode 1 / S P G 1 / DR S P G 1 / DR P G 1 / S P G

RP/0/0/CPU0:XRv1#show pim vrf OSPF neighbor | begin ^Neigh Neighbor Address Interface Uptime Expires DR pri 10.6.11.6 GigabitEthernet0/0/0/0.5612 00:31:29 00:01:16 1 10.6.11.11* GigabitEthernet0/0/0/0.5612 3d00h 00:01:25 1 13.0.0.5 mdtOSPF 00:29:48 00:01:26 1 13.0.0.8 mdtOSPF 00:30:43 00:01:31 1 13.0.0.11* mdtOSPF 3d00h 00:01:25 1 (DR)

Flags P (DR) B P E P P P

615 © 2016 Nicholas J. Russo

Inside of AS 24, we quickly trace the MP2MP tree from root to leaves. Although mLDP trees are built upstream towards the root, I verify it in the reverse direction for brevity. Without checking the MDT database in brief format, we know the opaque is MDT and the VPN ID is 24:2 as this was identified in the VRF configuration earlier. The root is XRv4 and it has no upstream neighbors as expected. All upstream labels are local labels since XRv4 allocated these to inform downstream nodes how to forward traffic up the tree (towards the root). The remote labels are downstream labels, learned by XRV4 via each peer’s label-mapping messages, which describe how to forward traffic down the tree (away from the root). The tree has 3 downstream peers: CSR6 and CSR7 are ASBRs and CSR2 is the PE. All of them appear the same to XRv4 as it is a P-router from the perspective of the data-plane. RP/0/0/CPU0:XRv4#show mpls mldp database opaquetype mdt 24:2 mLDP database LSM-ID: 0x00009 Type: MP2MP Uptime: 2d18h FEC Root : 24.0.0.14 (we are the root) Opaque decoded : [mdt 24:2 0] Upstream neighbor(s) : None Downstream client(s): LDP 24.0.0.2:0 Uptime: 00:37:16 Next Hop : 24.2.14.2 Interface : GigabitEthernet0/0/0/0.524 Remote label (D) : 2023 Local label (U) : LDP 24.0.0.6:0 Uptime: 00:17:14 Next Hop : 24.6.14.6 Interface : GigabitEthernet0/0/0/0.564 Remote label (D) : 6052 Local label (U) : LDP 24.0.0.7:0 Uptime: 00:17:14 Next Hop : 24.7.14.7 Interface : GigabitEthernet0/0/0/0.574 Remote label (D) : 7031 Local label (U) :

0

94023

94028

94031

A quick OAM check from CSR2 indicates that the tree has formed and is functioning properly. R2#ping mpls mldp mp2mp 24.0.0.14 mdt 24:2 0 mp2mp Root node addr 24.0.0.14 Opaque type MDT, oui:index 0x24:02, mdtnum 0 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor, timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request #1 ! reply addr 24.6.14.6 ! reply addr 24.7.14.7

616 © 2016 Nicholas J. Russo

Last, I ensure all ASBR/PE routers have VRF-aware PIM neighbors with all others (yellow). This ensures that the MP2MP tree is functioning correctly. On CSR6 and CSR7, we can also verify the inter-AS PIM neighbors (green). R2#show ip pim vrf OSPF neighbor | begin ^Neigh Neighbor Interface Uptime/Expires Address 10.2.9.9 GigabitEthernet2.529 00:46:21/00:01:40 24.0.0.6 Lspvif1 00:44:18/00:01:43 24.0.0.7 Lspvif1 00:45:17/00:01:41 R6#show ip pim vrf OSPF neighbor | begin ^Neigh Neighbor Interface Uptime/Expires Address 10.5.6.52 GigabitEthernet2.5562 00:46:47/00:01:44 10.6.11.11 GigabitEthernet2.5612 00:46:46/00:01:40 24.0.0.2 Lspvif1 00:44:52/00:01:43 24.0.0.7 Lspvif1 00:44:52/00:01:38

Ver v2 v2 v2

Ver v2 v2 v2 v2

R7#show ip pim vrf OSPF neighbor | begin ^Neigh Neighbor Interface Uptime/Expires Address 10.5.7.5 GigabitEthernet2.5572 00:46:59/00:01:34 24.0.0.6 Lspvif1 00:44:55/00:01:36 24.0.0.2 Lspvif1 00:45:54/00:01:39

DR Prio/Mode 1 / DR S P G 1 / S P G 1 / DR S P G

DR Prio/Mode 1 / DR S P G 1 / DR P G 1 / S P G 1 / DR S P G

Ver v2 v2 v2

DR Prio/Mode 1 / S P G 1 / S P G 1 / S P G

Though we did not verify EIGRP, we expect that all of the PIM neighbors are in place and are functional. A quick check would be to ensure the RP information (10.3.3.3) is present on all PEs and ASBRs. We can see that all routers learn this information except CSR2. RP/0/0/CPU0:XRv2#show pim vrf EIGRP rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.3.3.3 (?), v2 Info source: 10.3.12.3 (?), elected via bsr, priority 0, holdtime 150 Uptime: 23:21:09, expires: 00:01:40 RP/0/0/CPU0:XRv1#show pim vrf EIGRP rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.3.3.3 (?), v2 Info source: 13.0.0.12 (?), elected via bsr, priority 0, holdtime 150 Uptime: 23:16:21, expires: 00:02:24 R5#show ip pim vrf EIGRP rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.3.3.3 (?), v2

617 © 2016 Nicholas J. Russo

Info source: 10.3.3.3 (?), via bootstrap, priority 0, holdtime 150 Uptime: 00:58:02, expires: 00:02:10 R6#show ip pim vrf EIGRP rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.3.3.3 (?), v2 Info source: 10.3.3.3 (?), via bootstrap, priority 0, holdtime 150 Uptime: 00:57:12, expires: 00:01:59 R7#show ip pim vrf EIGRP rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.3.3.3 (?), v2 Info source: 10.3.3.3 (?), via bootstrap, priority 0, holdtime 150 Uptime: 00:57:27, expires: 00:01:43 R2#show ip pim vrf EIGRP rp mapping PIM Group-to-RP Mappings [no output]

Enabling VRF-aware PIM BSR debugging on CSR2, the issue is revealed. CSR2 receives BSR messages from both CSR6 and CSR7 over the PMSI as expected, but RPF fails. CSR2 expects to receive them from 13.0.0.12, which is in another AS. The first statement of the error message can be ignored since the BSR message did arrive on the correct interface, but from the incorrect neighbor address. R2#debug ip pim vrf EIGRP bsr PIM-BSR debugging is on PIM(1): Received v2 Bootstrap on Lspvif0 from 24.0.0.6 PIM-BSR(1): bootstrap (10.3.3.3) on non-RPF path Lspvif0 (expected Lspvif0) or from non-RPF neighbor 24.0.0.6 (expected 13.0.0.12) discarded PIM(1): Received v2 Bootstrap on Lspvif0 from 24.0.0.7 PIM-BSR(1): bootstrap (10.3.3.3) on non-RPF path Lspvif0 (expected Lspvif0) or from non-RPF neighbor 24.0.0.7 (expected 13.0.0.12) discarded

At a glance, this seems odd because, in option A, this inter-AS information should not be exchanged. Option AB bends those rules since the control-plane information is shared, to some degree, between ASes. This initially included VPNv4 routes, their RDs, and their extended communities (RT, etc). However, we can confirm that 13.0.0.12 is, in fact, the RPF neighbor for 10.3.3.3/32 inside the EIGRP VPN. The key to this output is the “BGP originator”. This attribute was discussed in the option B section and is used for RPF checks against multicast packets coming from the remote AS. In this case, BSR messages are rejected since multicast traffic is arriving from the ASBRs as opposed to the remote PE. The unicast FIB clearly shows 24.0.0.6 as the preferred unicast path, which is the VPN next-hop for 10.3.3.3/32.

618 © 2016 Nicholas J. Russo

R2#show ip rpf vrf EIGRP 10.3.3.3 RPF information for ? (10.3.3.3) RPF interface: Lspvif0 RPF neighbor: ? (13.0.0.12) RPF route/mask: 10.3.3.3/32 RPF type: unicast (bgp 24) Doing distance-preferred lookups across tables BGP originator: 13.0.0.12 RPF topology: ipv4 multicast base, originated from ipv4 unicast base R2#show ip cef vrf EIGRP 10.3.3.3/32 detail 10.3.3.3/32, epoch 0, flags [rib defined all labels] recursive via 24.0.0.6 label 6046 nexthop 24.2.14.14 GigabitEthernet2.524 label 94008

This attribute cannot be overridden or modified in any obvious way. The designers of this feature likely did not consider option AB and MVPN being used together as the RPF lookup should be making decisions similar to option A. Even the multicast route, replicated from the unicast iBGP route, correctly says 24.0.0.6 is the next-hop. Only the BGP connector attribute, when present, is processed. R2#sh ip route multicast vrf EIGRP 10.3.3.3 Routing Table: EIGRP:multicast Routing entry for 10.3.3.3/32 Known via "bgp 24", distance 200, metric 0 Tag 13, type internal, replicated from topology(EIGRP) Last update from 24.0.0.6 01:23:23 ago Routing Descriptor Blocks: * 24.0.0.6 (default), from 24.0.0.6, 01:23:23 ago Route metric is 0, traffic share count is 1 AS Hops 1 Route tag 13 MPLS label: 6046 MPLS Flags: MPLS Required R2#show bgp vpnv4 unicast vrf EIGRP 10.3.3.3/32 bestpath BGP routing table entry for 24:3:10.3.3.3/32, version 51 Paths: (2 available, best #2, table EIGRP) Advertised to update-groups: 1 Refresh Epoch 1 13, (Received from a RR-client) 24.0.0.6 (metric 20) (via default) from 24.0.0.6 (24.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:3 0x8800:32768:0 0x8801:3:288 0x8802:65281:2560 0x8803:1:1500 0x8806:0:167971843 Connector Attribute: count=1 type 1 len 12 value 13:3:13.0.0.12

619 © 2016 Nicholas J. Russo

mpls labels in/out nolabel/6046 rx pathid: 0, tx pathid: 0x0

An odd behavior is observed when we test this on the opposite side of the network. CSR2 saw a BGP connector attribute from AS 13 claiming 13.0.0.12 was the originator of 10.3.3.3/32, which is true. CSR8’s view of 10.1.1.1/32, which originates at 24.0.0.2, shows the local ASBR CSR5 as the originator. For this reason, MVPN should work fine from the perspective of AS 13. R8#show bgp vpnv4 unicast vrf OSPF 10.9.9.9 BGP routing table entry for 13:2:10.9.9.9/32, version 76 Paths: (1 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 1 24 13.0.0.5 (metric 2) (via default) from 13.0.0.12 (13.0.0.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:13:2 OSPF ROUTER ID:10.2.9.2:0 OSPF RT:0.0.0.0:2:0 Originator: 13.0.0.5, Cluster list: 13.0.0.12 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.5 mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0

On XRv2, we can see that both XRv1 and CSR5 adjusted the connector attributes to be their own addresses. This is the result of the VPN routes being imported locally into VRFs, then re-advertised towards the RRs. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast rd 13:2 10.9.9.9 | begin 24, 24, (Received from a RR-client) 13.0.0.5 (metric 3) from 13.0.0.5 (13.0.0.5) Received Label 5010 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf, import suspect Received Path ID 0, Local Path ID 1, version 113 Extended community: OSPF router-id:10.2.9.2 OSPF route-type:0:2:0x0 RT:13:2 Connector: type: 1, Value:13:2:13.0.0.5 Path #2: Received by speaker 0 Not advertised to any peer 24, (Received from a RR-client) 13.0.0.11 (metric 3) from 13.0.0.11 (13.0.0.11) Received Label 91008 Origin incomplete, localpref 100, valid, internal, import-candidate, not-in-vrf, import suspect Received Path ID 0, Local Path ID 0, version 0

620 © 2016 Nicholas J. Russo

Extended community: OSPF router-id:10.2.9.2 OSPF route-type:0:2:0x0 RT:13:2 Connector: type: 1, Value:13:2:13.0.0.11

There appears to be some inconsistency between when a routers applies the connector attribute or not. For example, CSR7 receives an EIGRP VPN route from XRv4 and CSR2. The one from XRv4 has a connector attribute but the one from CSR2 does not. The PE configurations are nearly identical yet the behavior appears to differ between XE and XR. R7#show bgp vpnv4 unicast vrf EIGRP 10.1.1.1 BGP routing table entry for 24:3:10.1.1.1/32, version 352 Paths: (1 available, best #1, table EIGRP) Advertised to update-groups: 6 Refresh Epoch 1 Local 24.0.0.2 (metric 20) (via default) from 24.0.0.2 (24.0.0.2) Origin incomplete, metric 10880, localpref 100, valid, internal, best Extended Community: RT:24:3 Cost:pre-bestpath:128:10880 0x8800:32768:0 0x8801:3:288 0x8802:65281:2560 0x8803:65281:1500 0x8806:0:167837953 mpls labels in/out IPv4 VRF Aggr:7006/2016 rx pathid: 0, tx pathid: 0x0 R7#show bgp vpnv4 unicast vrf EIGRP 10.13.13.13 BGP routing table entry for 24:3:10.13.13.13/32, version 416 Paths: (1 available, best #1, table EIGRP) Advertised to update-groups: 6 Refresh Epoch 1 Local 24.0.0.14 (metric 10) (via default) from 24.0.0.2 (24.0.0.2) Origin incomplete, metric 10752, localpref 100, valid, internal, best Extended Community: RT:24:3 Cost:pre-bestpath:128:10752 0x8800:32768:0 0x8801:3:282 0x8802:65281:2560 0x8803:1:1500 0x8806:0:168627469 Originator: 24.0.0.14, Cluster list: 24.0.0.2 Connector Attribute: count=1 type 1 len 12 value 24:3:24.0.0.14 mpls labels in/out IPv4 VRF Aggr:7006/94006 rx pathid: 0, tx pathid: 0x0

Checking CSR5, the VPN routes from CSR8 have the connector attribute set as well, which disproves the claim above. R5#show bgp vpnv4 unicast vrf OSPF 10.4.4.4 BGP routing table entry for 13:2:10.4.4.4/32, version 41 Paths: (1 available, best #1, table OSPF) Advertised to update-groups: 1

621 © 2016 Nicholas J. Russo

Refresh Epoch 1 Local 13.0.0.8 (metric 2) (via default) from 13.0.0.12 (13.0.0.12) Origin incomplete, metric 1, localpref 100, valid, internal, best Extended Community: RT:13:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Originator: 13.0.0.8, Cluster list: 13.0.0.12 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out 5021/8007 rx pathid: 0, tx pathid: 0x0

I suspect these inconsistencies are related to the MVPN methods used. AS 13 uses the GRE method which appears to always allocated the BGP connector attribute, whereas the mLDP methods appear less consistent. Since MVPN ASM clearly cannot work as the RP information cannot transit the VPN, we will quickly test SSM. We expect to have a similar problem as the VPN route 10.4.4.4/32 carries a connector attribute specifying CSR8 as the originator. The RPF check expects traffic to come from CSR8, not the local ingress ASBR CSR6. R2#show bgp vpnv4 unicast vrf OSPF 10.4.4.4/32 bestpath BGP routing table entry for 24:2:10.4.4.4/32, version 35 Paths: (2 available, best #2, table OSPF) Advertised to update-groups: 1 Refresh Epoch 1 13, (Received from a RR-client) 24.0.0.6 (metric 20) (via default) from 24.0.0.6 (24.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:24:2 OSPF ROUTER ID:10.4.8.8:0 OSPF RT:0.0.0.0:2:0 Connector Attribute: count=1 type 1 len 12 value 13:2:13.0.0.8 mpls labels in/out nolabel/6036 rx pathid: 0, tx pathid: 0x0 R2#show ip rpf vrf OSPF 10.4.4.4 RPF information for ? (10.4.4.4) RPF interface: Lspvif1 RPF neighbor: ? (13.0.0.8) RPF route/mask: 10.4.4.4/32 RPF type: unicast (bgp 24) Doing distance-preferred lookups across tables BGP originator: 13.0.0.8 RPF topology: ipv4 multicast base, originated from ipv4 unicast base

We can demonstrate the problem by configuring an IGMP SSM join on CSR9 to receive traffic from CSR4. CSR2 will receive the C(S,G) join from CSR9 for (232.9.9.9, 10.4.4.4). CSR2’s RPF neighbor for this C622 © 2016 Nicholas J. Russo

source is 13.0.0.8 which is accessible via the PMSI. CSR2 attempts to send the join, but cannot. The C(S,G) state looks correct on CSR2, but no other OSPF VPN routers will have this state entry. So long as the BGP connector attributes are used in an option AB network (they are supposed to be optional attributes, but modification appears impossible), MVPN will be broken at points. The attribute origination inconsistencies make it increasingly challenging to test this feature. ! CSR9 interface Loopback0 ip pim sparse-mode ip igmp join-group 232.9.9.9 source 10.4.4.4 R2#debug ip pim vrf OSPF PIM debugging is on PIM(2): Received v2 Join/Prune on GigabitEthernet2.529 from 10.2.9.9, to us PIM(2): Join-list: (10.4.4.4/32, 232.9.9.9), S-bit set PIM(2): Add GigabitEthernet2.529/10.2.9.9 to (10.4.4.4, 232.9.9.9), Forward state, by PIM SG Join R2#show ip mroute vrf OSPF 232.9.9.9 | begin \( (10.4.4.4, 232.9.9.9), 00:03:34/00:02:50, flags: sT Incoming interface: Lspvif1, RPF nbr 13.0.0.8 Outgoing interface list: GigabitEthernet2.529, Forward/Sparse, 00:03:34/00:02:50 R5#show ip mroute vrf OSPF 232.9.9.9 Group 232.9.9.9 not found R6#show ip mroute vrf OSPF 232.9.9.9 Group 232.9.9.9 not found R8#show ip mroute vrf OSPF 232.9.9.9 Group 232.9.9.9 not found

8.4.4.4 MPLS TE MPLS TE works identically for option AB and it does for option A. The tunnels configuration in option A are quickly examined again for completeness. Re-using tunnel100 on XRv2 and CSR2, defined in the option A section, I quickly trace the LSPs to prove the correct operation. From XRv2, we can see the TE tunnel to CSR5 is operational. Tracing the LSP for the EIGRP VPN route 10.1.1.1/32, we see that CSR5 is the best-path due to having the lower BGP RID. The VPN label bound is 5016. RP/0/0/CPU0:XRv2#show mpls traffic-eng tunnels 100 brief TUNNEL NAME DESTINATION STATUS tunnel-te100 13.0.0.5 up Displayed 1 (of 1) heads, 0 (of 0) midpoints, 0 (of 0) tails Displayed 1 up, 0 down, 0 recovering, 0 recovered heads

STATE up

623 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf EIGRP 10.1.1.1 | begin 24, 24, (Received from a RR-client) 13.0.0.5 (metric 3) from 13.0.0.5 (13.0.0.5) Received Label 5016 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported, import suspect Received Path ID 0, Local Path ID 1, version 119 Extended community: EIGRP route-info:0x8000:0 EIGRP AD:3:288 EIGRP RHB:255:1:2560 EIGRP LM:0xff:1:1500 EIGRP VRR:0x0:1.1.1.10 RT:13:3 Connector: type: 1, Value:13:3:13.0.0.5 Source VRF: EIGRP, Source Route Distinguisher: 13:3 Path #2: Received by speaker 0 Not advertised to any peer 24, (Received from a RR-client) 13.0.0.11 (metric 3) from 13.0.0.11 (13.0.0.11) Received Label 91003 Origin incomplete, localpref 100, valid, internal, import-candidate, imported, import suspect Received Path ID 0, Local Path ID 0, version 0 Extended community: EIGRP route-info:0x8000:0 EIGRP AD:3:288 EIGRP RHB:255:1:2560 EIGRP LM:0xff:1:1500 EIGRP VRR:0x0:1.1.1.10 RT:13:3 Connector: type: 1, Value:13:3:13.0.0.11 Source VRF: EIGRP, Source Route Distinguisher: 13:3

The VPN next-hop is reachable via IGP through a TE tunnel, which implies the RSVP-TE label is used. RP/0/0/CPU0:XRv2#show route 13.0.0.5 Routing entry for 13.0.0.5/32 Known via "ospf 13", distance 110, metric 3, type intra area Routing Descriptor Blocks 13.0.0.5, from 13.0.0.5, via tunnel-te100 Route metric is 3 No advertising protos.

The TE tunnel routes via XRv1, has a bandwidth allocation of 5 Mbps, and uses label 91010. The label stack becomes {91010 5016}. RP/0/0/CPU0:XRv2#show mpls traffic-eng tunnels 100 detail | begin Path Info Path Info: Outgoing: Explicit Route: Strict, 13.11.12.11 Strict, 13.5.11.5 Strict, 13.0.0.5 Record Route: Disabled Tspec: avg rate=5000 kbits, burst=1000 bytes, peak rate=5000 kbits [snip]

624 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show mpls traffic-eng tunnels 100 detail | include Label Outgoing Interface: GigabitEthernet0/0/0/0.521, Outgoing Label: 91010

XRv1 is the penultimate hop, so its pops the TE label and exposes the VPN label to CSR5. Since the EIGRP VPN is not enabled for option AB CSC, the inter-AS traffic is raw IP. The entire label stack is removed on CSR5 with the IP packets sent to 10.5.6.6 inside VRF EIGRP. RP/0/0/CPU0:XRv1#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------91010 Pop 100

labels 91010 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.551 13.5.11.5 12395

R5#show mpls forwarding-table labels 5016 detail Local Outgoing Prefix Bytes Label Outgoing Label Label or Tunnel Id Switched interface 5016 No Label 10.1.1.1/32[V] 0 Gi2.5563 MAC/Encaps=22/22, MRU=1504, Label Stack{} 005056A9DE0D005056A9DC6381000DE4810000030800 VPN route: EIGRP No output feature configured

Next Hop 10.5.6.6

The TE tunnel in AS 24 is also operational. CSR6’s best path to 10.1.1.1/32 in the EIGRP VPN is via CSR2. The outgoing label is 2016 and the next-hop is 24.0.0.2. R6#show mpls traffic-eng tunnels tunnel 100 brief | begin TUNNEL TUNNEL NAME DESTINATION UP IF DOWN IF STATE/PROT R6_t100 24.0.0.2 Gi2.567 up/up R6#show bgp vpnv4 unicast vrf EIGRP 10.1.1.1/32 bestpath BGP routing table entry for 24:3:10.1.1.1/32, version 18 Paths: (1 available, best #1, table EIGRP) Advertised to update-groups: 4 1 Refresh Epoch 1 Local 24.0.0.2 (metric 20) (via default) from 24.0.0.2 (24.0.0.2) Origin incomplete, metric 10880, localpref 100, valid, internal, best Extended Community: RT:24:3 Cost:pre-bestpath:128:10880 0x8800:32768:0 0x8801:3:288 0x8802:65281:2560 0x8803:65281:1500 0x8806:0:167837953 mpls labels in/out IPv4 VRF Aggr:6050/2016 rx pathid: 0, tx pathid: 0x0

The VPN next-hop is reachable via IGP, again through a TE tunnel. The RSVP-TE label is 7012 with CSR7 as the next-hop. The label stack becomes {7012 2016} at imposition since CSR6 is effectively an ingress LER. 625 © 2016 Nicholas J. Russo

R6#show ip route 24.0.0.2 Routing entry for 24.0.0.2/32 Known via "isis", distance 115, metric 20, type level-2 Redistributing via isis 24 Last update from 24.0.0.2 on Tunnel100, 00:09:31 ago Routing Descriptor Blocks: * 24.0.0.2, from 24.0.0.2, 00:09:31 ago, via Tunnel100 Route metric is 20, traffic share count is 1 R6#show ip rsvp reservation detail filter session-type 7 destination 24.0.0.2 Reservation: Tun Dest: 24.0.0.2 Tun ID: 100 Ext Tun ID: 24.0.0.6 Tun Sender: 24.0.0.6 LSP ID: 1 Next Hop: 24.6.7.7 on GigabitEthernet2.567 Label: 7012 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 01000409. Average Bitrate is 0 bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes Status: Policy: Accepted. Policy source(s): MPLS/TE

CSR7 performs a label pop to expose the VPN label to CSR2. CSR2 removes all labels and forwards the IP packets to CSR1 inside of the EIGRP VPN. R7#show mpls forwarding-table labels 7012 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7012 Pop Label 24.0.0.6 100 [1] 18222 R2#show mpls forwarding-table labels 2016 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 2016 No Label 10.1.1.1/32[V] 0 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A91AAA005056A9BE8A81000DB80800 VPN route: EIGRP No output feature configured

Outgoing interface Gi2.527

Outgoing interface Gi2.512

Next Hop 24.2.7.2

Next Hop 10.1.2.1

We quickly verify the LSP using traceroute inside of the VPN. Everything works identically as it did for option A, since the intra-AS TE tunnels just change the routing to the egress ASBRs (also fulfilling the role of LERs when CSC is not in use). The transit link remains raw IP. R3#traceroute 10.1.1.1 source 10.3.3.3 Type escape sequence to abort. Tracing the route to 10.1.1.1 VRF info: (vrf in name/id, vrf out name/id)

626 © 2016 Nicholas J. Russo

1 2 3 4 5 6 7

10.3.12.12 8 msec 2 msec 2 msec 13.11.12.11 [MPLS: Labels 91010/5016 Exp 0] 7 msec 7 msec 6 msec 10.5.6.53 [MPLS: Label 5016 Exp 0] 16 msec 16 msec 15 msec 10.5.6.6 20 msec 9 msec 12 msec 24.6.7.7 [MPLS: Labels 7012/2016 Exp 0] 12 msec 17 msec 21 msec 10.1.2.2 [MPLS: Label 2016 Exp 0] 20 msec 18 msec 20 msec 10.1.2.1 21 msec 12 msec 12 msec

8.4.5 Confederation variation Inter-AS option AB confederation is not currently supported in XE. Once the confederation configurations are applied, the “inter-as-hybrid” command is automatically removed. Adding it back generates the error message shown below. This chapter is a placeholder in the event this feature is supported in the future. R6(config-router-af)#neighbor 10.5.6.5 inter-as-hybrid %BGP: this command is only valid for eBGP peers

9. Describe multicast P2MP TE Because this fits within the scope of MVPN profiles, it is documented there in the appropriate chapters. Specifically, it is documented under profiles 8 and 10, and both tests include configurations files. 10. Describe EVPN (EVPN and PBB-EVPN) Ethernet VPN was developed as a next-generation L2VPN technology to overcome existing VPLS limitations. Since VPLS acts like an Ethernet switch by learning MAC addresses (data-plane learning), its scalability can be limited by the number of MAC addresses in the customers network. For this reason, many VPLS deployments have a small number of MAC addresses behind each site, which is not appropriate for data-center interconnect (DCI) service. EVPN also introduces per-flow load balancing, which is somewhat available in VPLS given the flow-aware transport PW (FAT-PW) label, but EVPN improves upon it. EVPN also supports multi-homing at layer 2 to prevent loops and duplicate packets; its lack of fully/partially meshed PWs means that it has the same scalability as MPLS L3VPN. It uses a new BGP AFI to carry new EVPN routes that essentially enables “MAC routing”. Multicast inclusion can also be signaled to indicate interest in multicast reception. Both EVPN and PBB-EVPN are discussed in this section with a limited demonstration. XRv does not support the L2VPN data-plane, and XE platforms (ASR1000, CSR1000v, etc) only support EVPN BGP features and none of the data-plane features at all. In this way, both XRv and CSR1000v make good EVPN route-reflectors. For data-plane support, ASR9000 series are the best choice. First, we must understand several key definitions. BUM: Broadcast, multicast, or unknown unicast. Though not specific to EVPN, this is used extensively in the descriptions and is defined here for clarity. BUM traffic generally requires special handling by EVPN and is often treated like a multicast Ethernet frame. EVI: Ethernet VPN Identifier. This is analogous to the virtual circuit ID (VCID) seen in VPLS which identifies an EVPN to BGP. An EVI may support multiple bridge-domains. Classic design options, such as 627 © 2016 Nicholas J. Russo

E-LINE/E-LAN/E-TREE (port-based transparent service) and EV-LINE/EV-LAN/EV-TREE (VLAN-based service) are still available, but we can also map specific C-VLANs to EVIs flexibly. ES: Ethernet segment. This is analogous to an access circuit (AC) in VPLS which defines a link to the customer. An ES differs from an AC in that an ES actually defines a “site” which could be connected to multiple PEs. As such, multiple PEs could share the same ES if the customer is multi-homed at layer 2. Thus, the ES could be a single CE device or an entire network if the customer has multiple CE devices. A customer can follow one of these four models: 1. Single-homed device (SHD). This is a single device connected to a single PE which terminates the broadcast domain (like a router). 2. Multi-homed device (MHD) using LACP. This is a single device connected to multiple PEs. The CE devices terminate the broadcast domain. 3. Single-homed network (SHN). Like SHD, this is a single device connected to a single PE, but with layer 2 network behind it. 4. Multi-homed network (MHN) using LACP. This consists of multiple CE devices connected to multiple PE devices with a layer 2 network behind the CEs. ESI: Ethernet Segment Identifier. Every ES is identified by a unique ESI. It is a 10 byte field that can be derived one of three ways. Of note, on singly-attached networks, the ESI is always 0; there is no compelling reason to track the ESI in these cases. 1. LACP: Based on the CEs information, the first 2 bytes will be the CE’s system priority, the next 6 bytes will be the CE’s LACP system-ID (like a MAC address), and the last 2 bytes will be a port key, if defined. This information is carried in the LACPDUs received from the CE. 2. MST: If LACP is not in use, MST can be used to create the ESI for an ES. This is based on the CIST’s root parameters: first 2 bytes is the bridge priority, next 6 bytes is the bridge MAC address, and the last 2 bytes are always 0. Essentially, the 8 byte bridge ID is used with trailing zeroes. 3. Manual configuration: If LACP and MST are not in use, it can be manually configured. The manual configuration only allows main interfaces to be used, not subinterfaces, and requires input that looks like LACP. One will specify the system priority, system ID, and port key for a given Ethernet interface, which represents the ES from the PE’s perspective. This would need to be the same on all PEs service the same ES. I-SID: Instance Service identifier. An identifier telling the router where to map a customer frame. It represents a service instance like an EFP. Different I-SIDs can be mapped to the same EVI if their topologies are similar, or to different EVIs if their topologies are different. The I-SID is a 24-bit value that is used by the PBB edge domain to determine how frames should be forwarded. There are two different redundancy models when multi-homing: 1. Single-active: One PE is active and all others are in standby, which means only this active PE is allowed to forward traffic towards the customer. VPLS also supports this option by way of active/backup PWs. 628 © 2016 Nicholas J. Russo

2. All-active: All PEs can forward traffic down to the customer concurrently. This does not apply to BUM traffic as it would could cause loops, but is valid for unicast frames. VPLS cannot support this functionality. There are 4 new types of BGP routes used for the BGP EVPN AFI. Each one is referenced by its “type” number. Both EVPN and PBB-EVPN use these routes, except that type 1 routes only apply to EVPN. 1. Ethernet active discovery (AD) route: This route is used to signal MAC withdrawals, aliasing, and split-horizon labels. This applies to EVPN only and is not used for PBB-EVPN. These details are described in the EVPN section. 2. MAC advertisement route: This carries MAC addresses along with EVPN ESIs and MPLS labels. 3. Inclusive multicast route: This route uses provider multicast service interface (PMSI) attributes to represent ingress replication (IR); MVPN technologies are discussed at length in a dedicated section. Other multicast tunneling options can be configured as well, but IR is the default since it does not assume an LSM architecture is available. This also carries the root of the multicast deliver tree (a subcomponent of PMSI) and an MPLS label. 4. Ethernet Segment (ES) route: This route carries information relevant for multi-homed layer 2 connections (redundancy via ICCP) and the designated forwarder (DF) election. In networks without multi-homing, these routes are not shown. Both EVPN and PBB-EVPN must also prevent loops via split-horizon mechanisms. Since these mechanisms differ between EVPN and PBB-EVPN, split-horizon is discussed in the individual sections. In multi-homed deployments, the SP network must identifier designated forwarders (DF) who are allowed to send traffic towards the customer. If a BUM frame is flooded across the network between ESes, multiple remote PEs on the other end of the network may both flood the same BUM frame to the customer, which is not ideal. Instead, only one of the PEs is allowed to do this. The DF election is the same for EVPN and PBB-EVPN and operates as follows: 1. Identify each PE by its lowest IP address. The PEs connected to an ES are then assigned an index starting at 0. For example, if PE1 has address 1.11.11.11 and PE2 has address 1.12.12.12, then PE1 is assigned index 0 and PE2 is assigned index 1. This is the PE “ordered list” and has length ‘n’. 2. Evaluate [EVI modulo ‘n’] for each EVI. If EVIs 1000 and 1001 are used, the following occurs: a. EVI 1000 modulo 2 = 0, therefore PE1 is the DF and PE2 is the backup DF (BDF) b. EVI 1001 modulo 2 = 1, therefore PE2 is the DF and PE1 is the BDF Aliasing is a technique used to ensure that, when operating in all-active mode, that all PEs are aware of all MACs addresses in an ES, even if they have not received traffic from that MAC address. For example, if a CE is multi-homed to PE1 and PE2 using LACP, and a unicast frame is forwarded to PE1, PE2 will not know about it. If PE2 receives a frame from the core destined to this MAC address, it won’t know how to

629 © 2016 Nicholas J. Russo

forward it unless PE1 tells it how. Aliasing is handled differently in EVPN and PBB-EVPN and is discussed there. When using the active-standby (single-active) model, there needs to be a mechanism to notify the backup PE that it should become the active PE. Using BGP route withdrawal, both EVPN and PBB-EVPN signal this differently. 10.1 EVPN After searching for some time, I was not able to find a configuration example that shows an EVPN implementation on IOS XR. It is discussed in detail on many presentations but the configuration examples all refer to PBB, probably because PBB-EVPN scales much better and there isn’t much benefit to EVPN over PBB-EVPN. This section will briefly describe how EVPN works. In VPLS, the split-horizon rule states that traffic arriving on a PW cannot be sent out of any other PW. One of the few reasons to relax this rule is to support H-VPLS, but generally it makes sense. It is also why VPLS requires a full mesh of PWs; without it, some kind of spanning-tree protocol would be needed. EVPN may suffer from similar problems in multi-homed layer 2 environments. If two PEs (PE1 and PE2) are connected to the same CE which is running multi-chassis LACP (one link to each PE), then the possibility of a loop exists when the CE sends BUM traffic. If the CE selects PE1 to send a multicast frame, the frame will be replicated to all other remote PEs, including PE2, despite PE2 serving the same ES. The solution is to use a split-horizon label; the label is included with the BUM frame and identifies the source ES of the frame. When PE2 receives it, it drops the traffic as it is already connected to the source ES via LACP. Aliasing in EVPN, like split-horizon, is achieved using the type 1 route. When a PE learns a C-MAC from the CE, it advertises this C-MAC into BGP along with the ESI from which the MAC was learned. The PE also attached to this ES will evaluate the ESI, see a match, then know that this MAC address is reachable via the connected ES shared with the PE that originated the BGP route. Backup path signaling is done by withdrawing the type 1 routes associated with a MAC address. Since there is a type 1 BGP route for each C-MAC, each one is withdrawn for a given ESI that signals the backup PE to become active. 10.2 PBB-EVPN PBB-EVPN extends the scalability of EVPN by combining the 802.1ah MAC-in-MAC (or PBB) logic into EVPN. Each PE has a backbone MAC (B-MAC) which is used to identify that PE. In this way, only one type 2 (MAC advertisement) route is created per PE in each EVI, versus one per C-MAC in regular EVPN. This is why the type 1 route is not needed in PBB since MAC withdrawals and aliasing are not relevant. Split-horizon is also possible in PBB-EVPN as in EVPN. However, a split-horizon label is not necessary to prevent loops with PBB-EVPN. PEs connected to the same ES will use the same B-MAC for that ES: that

630 © 2016 Nicholas J. Russo

is, there is a 1:1 mapping between B-MAC and ESI. The receiving PEs simply check the source B-MAC on incoming BUM frames and compare it against their own. If the two match, the packet is dropped. Aliasing in PBB-EVPN is simpler than in EVPN, much like split-horizon. Since two PEs connected to the same ES will have the same B-MAC, if a PE receives a frame destined for its own B-MAC but doesn’t have the explicit C-MAC installed, it can safely assume this C-MAC is reachable via the connected ES. No explicit routing is needed since the traffic would not have been sent to this PE if the remote PE (across the core) wasn’t already aware that it was mapped to the proper B-MAC shared by other PEs on the ES. Since type 1 BGP routes are not used for PBB-EVPN, a router simply withdraws its type 2 router for that ESI. Because the B-MAC is bound to that ESI with PBB, other routers will know that when a PE loses connectivity to an ES, that B-MAC cannot be used for forwarding anymore. The other PEs on the ES will have the same B-MAC so there is no need for per-MAC withdrawals as with EVPN. PBB operations are similar to EVPN with some exceptions. First, the concept of a B-MAC is mapped to an ESI on a 1:1 basis. The B-MAC can be automatically derived from the LACP system ID of the CE by flipping the U/L bit, sort of like EUI-64 for IPv6 SLAAC. In this way, all PEs connected to an ES will have the same B-MAC. The route distinguishers (RDs) are still unique per PE as they are auto-generated based on the BGP router-IDs. An EVPN type 2 route will include several key components, include the RD, ESI, B-MAC, MPLS label, and route-target (RT). The label is a downstream label advertised to other PEs to use when sending traffic towards this B-MAC within a given EVI. When multiple PEs exist on a segment, their BMACs will be the same, but their labels will not be. Like the RD, the RT is automatically configured based on the BGP ASN and the EVI. This is like VPLS auto-route-target and can be overridden if necessary. This ensures that all endpoints within an EVI are connected in a LAN topology by default (can adjust RTs to build a tree, etc). Since PBB control-plane is supported on XRv, we will look at some of the PBB details. The network diagram is kept simple with 3 PBB edge routers and some CE routers as well. Both XRv4 and CSR6 are EVPN route-reflectors; this is the extent of IOS XE’s support for EVPN at present. The remaining CSRs are CE routers with OSPFv3 enabled for IPv4 and IPv6. If PBB-EVPN ever suddenly starts working on XRv, then they should form neighbors according to the EVPN configuration. CSR5, CSR7, and CSR8 share a LAN, while CSR9 and CSR10 do the same in a different VPN.

631 © 2016 Nicholas J. Russo

The basic configurations are shown below using XRv3 as the demonstration, since XRv1 and XRv2 basic PBB configurations are identical. As discussed earlier, we identify PBB “edge” and “core” bridgedomains. PBB edge domains are assigned I-SIDs which identify ESes/VLANs and map to interfaces. PBB core domains specify the EVI used which is the basis for RT generation. Every PBB edge I-SID must map to a core bridge-domain so the router can build the internal plumbing from the EFP to the MPLS core. Using terminology from the 802.1ah (MAC in MAC) section, the PBB edge domain handles the service component (I-SID) and the PBB core domain handles the backbone component (B-MAC and B-tag). This multi-stage encapsulation allows for maximal flexibility when building EVPNs. XRv3 runs BGP to XRv4 and CSR6 which are EVPN route-reflectors. Of note, XRv3 has a second CE and thus defines a second PBB edge group. It shares the same bridge group as the other PBB edges but uses a separate bridge-domain. This will become relevant when looking at BGP routes. ! XRv3 (similar for XRv1 and XRv2) interface GigabitEthernet0/0/0/0.503 l2transport encapsulation dot1q 3503 exact rewrite ingress tag pop 1 symmetric interface GigabitEthernet0/0/0/0.573 l2transport encapsulation dot1q 3573 exact rewrite ingress tag pop 1 symmetric l2vpn pbb backbone-source-mac 0000.0013.0013 bridge group PBB_CORE bridge-domain PBB_CORE

632 © 2016 Nicholas J. Russo

pbb core evpn evi 1000 bridge group PBB_EDGES bridge-domain PBB_EDGE_1 pbb edge i-sid 1001 core-bridge PBB_CORE interface GigabitEthernet0/0/0/0.573 bridge-domain PBB_EDGE_2 pbb edge i-sid 1002 core-bridge PBB_CORE interface GigabitEthernet0/0/0/0.503 router bgp 1 bgp router-id 1.13.13.13 address-family l2vpn evpn neighbor-group IBGP remote-as 1 timers 10 40 update-source Loopback0 address-family l2vpn evpn neighbor 1.6.6.6 use neighbor-group IBGP neighbor 1.14.14.14 use neighbor-group IBGP

There is nothing exciting about the BGP configurations on XRv4 and CSR6. They are just RRs for the L2VPN EVPN SAFI and don’t do anything fancy. This is just demonstrating the RR capability on XE and XR. ! XRv4 router bgp 1 bgp router-id 1.14.14.14 bgp cluster-id 1.14.14.14 address-family l2vpn evpn neighbor-group IBGP remote-as 1 timers 10 40 update-source Loopback0 address-family l2vpn evpn route-reflector-client neighbor 1.11.11.11 use neighbor-group IBGP neighbor 1.12.12.12 use neighbor-group IBGP

633 © 2016 Nicholas J. Russo

neighbor 1.13.13.13 use neighbor-group IBGP ! CSR6 router bgp 1 no bgp default ipv4-unicast neighbor IBGP peer-group neighbor IBGP remote-as 1 neighbor IBGP update-source Loopback0 neighbor IBGP timers 10 40 neighbor 1.11.11.11 peer-group IBGP neighbor 1.12.12.12 peer-group IBGP neighbor 1.13.13.13 peer-group IBGP address-family l2vpn evpn neighbor IBGP route-reflector-client neighbor 1.11.11.11 activate neighbor 1.12.12.12 activate neighbor 1.13.13.13 activate

First and most simply, we can verify the B-MAC configuration worked on one of the routers. In verifying this, we can also see the “chassis” MAC which would have been used had we not configured one. RP/0/0/CPU0:XRv3#show l2vpn pbb backbone-source-mac Backbone Source MAC: 0000.0013.0013 Chassis MAC : 0b16.212c.3742

We can look at XRv3 to see all routes in the topology. So far, we only see type 2 and type 3 routes. This makes sense because type 1 only applies to EVPN (which does not appear configurable on XR) and type 4 only applies to multi-homed environments (multi-chassis LACP not supported on virtual platforms). Giving this output a closer look, we notice the word “vrf” used to identify PBB_CORE. This is odd because PBB_CORE is a bridge-domain that maps to an EVI. However, the cosmetics of the output treat this like a VRF since it maps to an RD. It would make sense, then, for the administrator to name the PBB core bridge-domains based on the EVIs they support. Next, we can see that each PE generates exactly one type-2 MAC advertisement route (yellow). They carry a 48-bit MAC address which represents the manually-configured B-MAC on each router; I manually configure it for ease of understanding, but doing so it not necessary. We then see several type 3 Inclusive Multicast routes (green). We see 5 of these, so clearly PEs can advertise more than one, unlike type 2 routes. In this case, all PEs are part of I-SID 1001, which is a PBB edge domain, and thus each one advertises a route (cyan). I-SID 1002 is only present on XRv2 and XRv3, so only those two PEs generate a route for this I-SID (pink). Because these are multicast inclusion routes, this ensures that multicast traffic doesn’t leak across PBB edge domains even if they share the same PBB core domain. RP/0/0/CPU0:XRv3#show bgp l2vpn evpn rd 1.13.13.13:1000 | begin Network

634 © 2016 Nicholas J. Russo

Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 1.13.13.13:1000 (default for vrf PBB_CORE) *>i[2][0][48][0000.0011.0011][0]/104 1.11.11.11 100 0 i *>i[2][0][48][0000.0012.0012][0]/104 1.12.12.12 100 0 i *> [2][0][48][0000.0013.0013][0]/104 0.0.0.0 0 i *>i[3][1001][32][1.11.11.11]/80 1.11.11.11 100 0 i *>i[3][1001][32][1.12.12.12]/80 1.12.12.12 100 0 i *> [3][1001][32][1.13.13.13]/80 0.0.0.0 0 i *>i[3][1002][32][1.12.12.12]/80 1.12.12.12 100 0 i *> [3][1002][32][1.13.13.13]/80 0.0.0.0 0 i

We can also view information by specifying the PBB core domain by name (versus using the RD), much like using a VRF name. In this way, we will examine the type 2 route from XRv2. Like MVPN, EVPN BGP show commands require you to enter the entire prefix string on XR. Below, we can see many of the key fields carried inside this route. The MAC address is shown in the prefix, but the MPLS label, RT, and ESI are carried inside of the route itself. Much like VPNv4/v6, the B-MAC address 0000.0012.0012 is bound to local label 92004 on XRv2. The ESI is always 0 unless multi-homing is configured, so we ignore that for now. RP/0/0/CPU0:XRv3#show bgp l2vpn evpn bdomain PBB_CORE [2][0][48][0000.0012.0012][0]/104 BGP routing table entry for [2][0][48][0000.0012.0012][0]/104, Route Distinguisher: 1.13.13.13:1000 [snip] Local 1.12.12.12 (metric 20) from 1.6.6.6 (1.12.12.12) Received Label 92004 Origin IGP, localpref 100, valid, internal, best, group-best, importcandidate, imported, rib-install Received Path ID 0, Local Path ID 1, version 10 Extended community: RT:1:1000 Originator: 1.12.12.12, Cluster list: 1.6.6.6 EVPN ESI: 0000.0000.0000.0000.0000 Source VRF: default, Source Route Distinguisher: 1.12.12.12:1000

Now that we know the local label on XRv2 for receiving MPLS packets, we can check the LFIB to see what happens when that label is received. The bridge-domain ID is identified as 0 (more on this later) and the router is identified as a PE. The detailed output shows us that PW flow label is automatically

635 © 2016 Nicholas J. Russo

enabled which means load-sharing is inherent with EVPN. This label is popped before sending traffic to the customer’s bridge-domain. RP/0/0/CPU0:XRv2#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------92004 Pop No ID

labels 92004 Outgoing Interface -----------BD=0 PE

detail Next Hop

Bytes Switched --------------- ---------point2point 0

PW Flow Label: Enabled MAC/Encaps: 0/0, MTU: 0 Label Stack (Top -> Bottom): { } Packets Switched: 0

The “BD=0” is referring to the PBB_CORE bridge-domain on XRv2. This delivers the packet from the MPLS process to the core domain. RP/0/0/CPU0:XRv2#show l2vpn bridge-domain bd-name PBB_CORE brief Legend: pp = Partially Programmed. Bridge Group:Bridge-Domain Name ID State Num ACs/up Num PWs/up -------------------------------- ----- -------------- ------------ ---------PBB_CORE:PBB_CORE 0 up 0/0 0/0

To see the auto-generated RD/RTs without looking at BGP, we can query the EVPN EVI details for EVI 1000. This output also shows us some interesting label information; we see a unicast and multicast label for the EVI. The unicast label is 92004 which is what we just saw. We have not seen label 92005 yet. RP/0/0/CPU0:XRv2#show evpn evi vpn-id 1000 detail EVI Bridge Domain Type ---------- ---------------------------- ------1000 PBB_CORE PBB Unicast Label : 92004 Multicast Label: 92005 Flow Label: N RD Config: none RD Auto : (auto) 1.12.12.12:1000 RT Auto : 1:1000 Route Targets in Use Type ------------------------------ ------1:1000 Import 1:1000 Export

Checking the LFIB for label 92005, we can see the output is very similar to the unicast label except that it identifies an “inclusive multicast” (IM) entry. The output is otherwise identical, but having a separate label is important to differentiate unicast from BUM traffic.

636 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------92005 Pop No ID PW Flow Label: Enabled MAC/Encaps: 0/0, MTU: 0 Label Stack (Top -> Bottom): { } Packets Switched: 0

labels 92005 Outgoing Interface -----------BD=0 PEIM

detail Next Hop

Bytes Switched --------------- ---------point2point 0

When looking at the BGP type 2 route, we did not see this label being signaled. The label is carried in the type 3 routes and is the same for all I-SIDs. This is the first time we have ever seen a non-zero MPLS label being carried within the PMSI attributes; the MVPN section did not operate this way. Looking at the type-3 routes for I-SIDs 1001 and 1002, we see the same label for both of XRv2’s locally-originated BGP routes. Both of these I-SIDs (PBB edge domain) map to the same EVI (PBB core domain) so the RD/RT information is identical as well. Although not terribly significant for EVPN, the tunnel type of 6 represents ingress replication (IR) and the ID is the IP address of the LSM root, which is 1.12.12.12 (XRv2 loopback). I show this in green because it is of less importance right now. RP/0/0/CPU0:XRv3#show bgp l2vpn evpn bdomain PBB_CORE [3][1001][32][1.12.12.12$ BGP routing table entry for [3][1001][32][1.12.12.12]/80, Route Distinguisher: 1.13.13.13:1000 [snip] Local 1.12.12.12 (metric 20) from 1.6.6.6 (1.12.12.12) Origin IGP, localpref 100, valid, internal, best, group-best, importcandidate, imported Received Path ID 0, Local Path ID 1, version 11 Extended community: RT:1:1000 Originator: 1.12.12.12, Cluster list: 1.6.6.6 PMSI: flags 0x00, type 6, label 92005, ID 0x010c0c0c Source VRF: default, Source Route Distinguisher: 1.12.12.12:1000 RP/0/0/CPU0:XRv3#show bgp l2vpn evpn bdomain PBB_CORE [3][1002][32][1.12.12.12$ BGP routing table entry for [3][1002][32][1.12.12.12]/80, Route Distinguisher: 1.13.13.13:1000 [snip] Local 1.12.12.12 (metric 20) from 1.6.6.6 (1.12.12.12) Origin IGP, localpref 100, valid, internal, best, group-best, importcandidate, imported Received Path ID 0, Local Path ID 1, version 12 Extended community: RT:1:1000 Originator: 1.12.12.12, Cluster list: 1.6.6.6 PMSI: flags 0x00, type 6, label 92005, ID 0x010c0c0c Source VRF: default, Source Route Distinguisher: 1.12.12.12:1000

637 © 2016 Nicholas J. Russo

We can see a summary of the type 2 (unicast) labels for EVPN using basic BGP show commands. Since the type 3 multicast labels are carried inside of PMSI attributes, they are not shown in summary form using BGP. Again, we can see label 92004 being used for unicast traffic towards XRv2’s B-MAC of 0000.0012.0012. RP/0/0/CPU0:XRv3#show bgp l2vpn evpn bdomain PBB_CORE Network Next Hop Rcvd Label Route Distinguisher: 1.13.13.13:1000 (default for vrf *>i[2][0][48][0000.0011.0011][0]/104 1.11.11.11 91004 *>i[2][0][48][0000.0012.0012][0]/104 1.12.12.12 92004 *> [2][0][48][0000.0013.0013][0]/104 0.0.0.0 nolabel

labels | begin Network Local Label PBB_CORE) nolabel nolabel 93004

For completeness, we will examine some of the BGP show commands on XE as well. Looking at XRV2’s routes, we can see the output is significantly different from XR when looking at the summary information. The type 2 route is expanded more literally on XE; the 20 zeroes in a row is the ESI, where in XR it is displayed as a single zero. The summary output also contains the MPLS label for the type 2 route. XE is clearly struggling to display information in the type 3 routes as it specifies some fields are UNKNOWN. R6#show bgp l2vpn evpn rd 1.12.12.12:1000 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 1.12.12.12:1000 *>i [2][1.12.12.12:1000][00000000000000000000][0][48][000000120012][0][*][92004]/ 33 1.12.12.12 100 0 i *>i [3][1.12.12.12:1000][1001][32][UNKNOWN]/17 1.12.12.12 100 0 i *>i [3][1.12.12.12:1000][1002][32][UNKNOWN]/17 1.12.12.12 100 0 i

Examining XRv2’s type 2 in detail, we only see the RT. Nothing else is revealed since XR crams it into the summary information without saying much about it. Considering the feature isn’t supported in XE in the data-plane at all, this is understandable, and XE is just transporting BGP information (no local processing or path-related evaluations on the payload data). R6#show bgp l2vpn evpn rd 1.12.12.12:1000 route-type 2 BGP routing table entry for [2][1.12.12.12:1000][00000000000000000000][0][48][000000120012][0][*][92004]/ 33, version 8 Paths: (1 available, best #1, table EVPN-BGP-Table) Advertised to update-groups:

638 © 2016 Nicholas J. Russo

2 Refresh Epoch 1 Local, (Received from a RR-client) 1.12.12.12 (metric 20) from 1.12.12.12 (1.12.12.12) Origin IGP, localpref 100, valid, internal, best Extended Community: RT:1:1000 rx pathid: 0, tx pathid: 0x0

Looking at XRv2’s type 3 routes provides some extra PMSI information. Since MVPN is fully supported on XE, we can see all the PMSI attributes, to include an option that identifies this PMSI as being “for EVPN”. We also see the inclusive multicast MPLS labels, tunnel type (ingress replication), and LSM root (1.12.12.12). There are two routes as expected, one per supported I-SID on XRv2. R6#show bgp l2vpn evpn rd 1.12.12.12:1000 route-type 3 BGP routing table entry for [3][1.12.12.12:1000][1001][32][UNKNOWN]/17, version 21 Paths: (1 available, best #1, table EVPN-BGP-Table) Advertised to update-groups: 2 Refresh Epoch 1 Local, (Received from a RR-client) 1.12.12.12 (metric 20) from 1.12.12.12 (1.12.12.12) Origin IGP, localpref 100, valid, internal, best Extended Community: RT:1:1000 PMSI Attribute: for EVPN, Flags: 0x0, Tunnel type: 6, length 4, label: 92005, tunnel parameters: 010C 0C0C rx pathid: 0, tx pathid: 0x0 BGP routing table entry for [3][1.12.12.12:1000][1002][32][UNKNOWN]/17, version 25 Paths: (1 available, best #1, table EVPN-BGP-Table) Advertised to update-groups: 2 Refresh Epoch 1 Local, (Received from a RR-client) 1.12.12.12 (metric 20) from 1.12.12.12 (1.12.12.12) Origin IGP, localpref 100, valid, internal, best Extended Community: RT:1:1000 PMSI Attribute: for EVPN, Flags: 0x0, Tunnel type: 6, length 4, label: 92005, tunnel parameters: 010C 0C0C rx pathid: 0, tx pathid: 0x0

We will detail inclusive multicast operations a bit more as well. We can clearly see a mapping between ISIDs, IP addresses, and EVIs using a basic show command on a per EVI basis. In this example, we can see both I-SIDs 1001 and 1002, their participating routers, and the EVI common for all I-SIDs. RP/0/0/CPU0:XRv3#show evpn evi vpn-id 1000 inclusive-multicast ISID Originating IP vpn-id

639 © 2016 Nicholas J. Russo

-------------1001 1001 1001 1002 1002

---------------------------------------1.11.11.11 1.12.12.12 1.13.13.13 1.12.12.12 1.13.13.13

-------1000 1000 1000 1000 1000

The detailed output shows the BGP next-hops, along with the MPLS labels and whether the entry was locally generated or remotely learned. Again, this shows the same inclusive multicast MPLS label (as signaled as a subfield in the PMSI attribute) within each I-SID. RP/0/0/CPU0:XRv3#show evpn evi vpn-id 1000 inclusive-multicast detail ISID: 1001, Originating IP: 1.11.11.11 1000 Nexthop: 1.11.11.11 Label : 91005 Source : Remote ISID: 1001, Originating IP: 1.12.12.12 1000 Nexthop: 1.12.12.12 Label : 92005 Source : Remote ISID: 1001, Originating IP: 1.13.13.13 1000 Nexthop: :: Label : 93005 Source : Local ISID: 1002, Originating IP: 1.12.12.12 1000 Nexthop: 1.12.12.12 Label : 92005 Source : Remote ISID: 1002, Originating IP: 1.13.13.13 1000 Nexthop: :: Label : 93005 Source : Local

Ingress replication, as a method to transport LSM, is somewhat self-explanatory. Packets arriving from the CE are replicated to each remote PE that has expressed interest. That is to say, if the remote RT was imported for a route, it is assumed that PE is interested in receiving multicast. IR is easy to deploy because core routers treat them just like ordinary MPLS frames and don’t need to run anything special, unlike RSVP P2MP-TE and mLDP, which may not be supported by all routers. Core routers won’t even see the inclusive multicast labels since ordinary transport labels, such as LDP, RSVP-TE, or SR will be used to transport traffic across the network. It is important to note that because the MPLS label is the same for multicast I-SID traffic within an EVI, the PBB encapsulation (802.1ah described later) is responsible for providing the demultiplexing capability to achieve per I-SID granularity, not the MPLS label. We can prove this by creating a second EVC between XRv1 and XRv3 using bogus ESes. For additional variety, we will configure custom RD’s and RT’s on XRv1 and XRv3 for this EVI. The ES doesn’t actually connect a host device but for verifying the 640 © 2016 Nicholas J. Russo

control plane, it doesn’t matter. Only XRv3 is shown, but XRv1 has a near-identical configuration. The RTs are reversed so that XRv1 and XRv3 import one another’s RTs, and the RDs are different as well. ! XRv3 interface GigabitEthernet0/0/0/0.1113 l2transport encapsulation dot1q 1113 exact rewrite ingress tag pop 1 symmetric evpn evi 1313 bgp rd 0.0.11.13:13 route-target import 1113:11 route-target export 1113:13 l2vpn bridge group PBB_CORE bridge-domain PBB_CORE_3 pbb core evpn evi 1313 bridge group PBB_EDGES bridge-domain PBB_EDGE_3 pbb edge i-sid 3111 core-bridge PBB_CORE_3 interface GigabitEthernet0/0/0/0.1113

Checking BGP o nXRv3, we can see XRv1’s type 2 and type 3 routes within this new EVI. Since the PBB core domain is different, the EVI is different, which means the RDs, RTs, and MPLS labels will be different. We can see XRv3’s custom RD below. Also note that there are only two type 3 IM routes because there is only one I-SID within PBB_CORE_3 (only one PBB edge domain). The I-SID value is 3111 as expected; there is no knowledge of I-SIDs 1001 or 1002 within this PBB core domain, which is a separate EVI than the one previously configured. RP/0/0/CPU0:XRv3#show bgp l2vpn evpn bdomain PBB_CORE_3 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 0.0.11.13:13 (default for vrf PBB_CORE_3) *>i[2][0][48][0000.0011.0011][0]/104 1.11.11.11 100 0 i *> [2][0][48][0000.0013.0013][0]/104 0.0.0.0 0 i *>i[3][3111][32][1.11.11.11]/80 1.11.11.11 100 0 i *> [3][3111][32][1.13.13.13]/80 0.0.0.0 0 i

Looking at XRv1’s type 2 route, we can see the original RD was XRv1’s customer RD, and the route also contains the custom RT. A new MPLS label 91006 was allocated for this new PBB core domain. 641 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show bgp l2vpn evpn bdomain PBB_CORE_3 [2][0][48][0000.0011. 0011][0]/104 BGP routing table entry for [2][0][48][0000.0011.0011][0]/104, Route Distinguisher: 0.0.11.13:13 [snip] Local 1.11.11.11 (metric 20) from 1.6.6.6 (1.11.11.11) Received Label 91006 Origin IGP, localpref 100, valid, internal, best, group-best, importcandidate, imported, rib-install Received Path ID 0, Local Path ID 1, version 60 Extended community: RT:1113:11 Originator: 1.11.11.11, Cluster list: 1.6.6.6 EVPN ESI: 0000.0000.0000.0000.0000 Source VRF: default, Source Route Distinguisher: 0.0.11.13:11

Before checking the LFIB on XRv1, we can verify the PBB_CORE_3 bridge-domain ID to correlate it with the LFIB next. This PBB core domain has an ID of 3. For sanity, we verify the local label by checking the EVI details as well, expecting to see value 91006 as the unicast label carried in the BGP type 2 message. The LFIB shows verifies all of this; label 91006 is popped upon reception and the packet is delivered to PBB_CORE_3, which is bridge-domain 3. RP/0/0/CPU0:XRv1#show l2vpn bridge-domain bd-name PBB_CORE_3 brief Legend: pp = Partially Programmed. Bridge Group:Bridge-Domain Name ID State Num ACs/up Num PWs/up -------------------------------- ----- -------------- ------------ ---------PBB_CORE:PBB_CORE_3 3 up 0/0 0/0 RP/0/0/CPU0:XRv1#show evpn evi vpn-id 1313 detail | include Unicast Unicast Label : 91006 RP/0/0/CPU0:XRv1#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------91006 Pop No ID

labels 91006 Outgoing Next Hop Interface ------------ --------------BD=3 PE point2point

Bytes Switched ---------0

The same is true for multicast labels. Since XRv3 and XRv1 are connected with two separate EVIs, we expect the multicast labels assigned for their I-SIDs to vary. In this case, we see label 91008 used for the new I-SID of 3111 (EVI 1313) and 91005 still in use for the old I-SID of 1001 (EVI 1000). Note that this verification was performed remotely on XRv3 based on received BGP type 3 routes. RP/0/0/CPU0:XRv3#show evpn evi inclusive-multicast service-id 3111 orig-ip 1.11.11.11 detail ISID: 3111, Originating IP: 1.11.11.11 1313 Nexthop: 1.11.11.11

642 © 2016 Nicholas J. Russo

Label : 91007 Source : Remote RP/0/0/CPU0:XRv3#show evpn evi inclusive-multicast service-id 1001 orig-ip 1.11.11.11 detail ISID: 1001, Originating IP: 1.11.11.11 1000 Nexthop: 1.11.11.11 Label : 91005 Source : Remote

We can confirm both entries in XRv1’s LFIB. We can see that label 91005 is an include multicast label for BD 0, which is PBB_CORE seen earlier. Label 9107 is an IM label for BD 3, which is PBB_CORE_3, verified above. In this way, the MPLS label is mapped to an EVI which allows MPLS to deliver untagged packets to the proper PBB core domain. The PBB encapsulation provides the specific I-SID, not the MPLS label. RP/0/0/CPU0:XRv1#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------91005 Pop No ID 91006 Pop No ID 91007 Pop No ID

labels 91005 Outgoing Interface -----------BD=0 PEIM BD=3 PE BD=3 PEIM

91007 Next Hop

Bytes Switched --------------- ---------point2point 0 point2point 0 point2point 0

We will finish with some additional show commands. Although I have not found a command to show this in BGP, the remote PEs signal whether they are running in single-active or all-active mode. I assume that this command confirms that each remote PE is single-homed as a result of this signaling. Notice that this output does not show any local labels. RP/0/0/CPU0:XRv3#show evpn internal-label detail EVI Nexthop ---------- ---------------------------------------1000 1.11.11.11 ESI: 0000.0000.0000.0000.0000 Tag: 0 Single-homed 1000 1.12.12.12 ESI: 0000.0000.0000.0000.0000 Tag: 0 Single-homed 1313 1.11.11.11 ESI: 0000.0000.0000.0000.0000 Tag: 0 Single-homed

Output Label --------------91004

92004

91006

643 © 2016 Nicholas J. Russo

EVPN information can be summarized as shown below on XRv1. This can be useful for trying to determine what the resulting auto-generate RDs and RTs will be. It also shows the total number of routes seen across all EVIs (PBB core domain) and supported I-SIDs (PBB edge domain). RP/0/0/CPU0:XRv1#show evpn summary ----------------------------Global Information ----------------------------Number of EVIs : 3 Number of Local MAC Routes : 4 Number of Remote MAC Routes : 3 Number of Local IMCAST Routes : 2 Number of Remote IMCAST Routes: 5 Number of Internal Labels : 3 Number of ES Entries : 1 Number of Neighbor Entries : 3 BGP Router ID : 1.11.11.11 BGP ASN : 1 PBB BSA MAC address : 0000.0011.0011 Global peering timer : 45 seconds Global recovery timer : 20 seconds Global programming timer : 1500 microseconds Global flushagain timer : 60 seconds Standard Version : draft version 06 Version Transition Timer : 0(min) 0(sec) left ----------------------------High Availability Information ----------------------------BGP EOD : Y Number of Marked MAC Routes : 0 Number of Swept MAC Routes : 0 Number of Marked IMCAST Routes : 0 Number of Swept IMCAST Routes : 0

Although we will not see any MAC addresses, we can confirm some basic data-plane configurations are correct. First, we look at the bridge-domain summary. We see 3 PBB edge domains and 2 PBB core domains, which is consistent with the BGP verifications we performed earlier. The BGP verifications showed two separate EVIs (1000 and 1313) with three I-SIDs (1001 and 1002 mapped to EVI 1000, and 3111 mapped to EVI 1313). RP/0/0/CPU0:XRv3#show l2vpn bridge-domain summary Number of groups: 2, bridge-domains: 5, Up: 5, Shutdown: 0, Partiallyprogrammed: 0 Default: 0, pbb-edge: 3, pbb-core: 2 Number of ACs: 3 Up: 3, Down: 0, Partially-programmed: 0 Number of PWs: 0 Up: 0, Down: 0, Standby: 0, Partially-programmed: 0 Number of P2MP PWs: 0, Up: 0, Down: 0, other-state: 0 Number of VNIs: 0, Up: 0, Down: 0, Unresolved: 0

644 © 2016 Nicholas J. Russo

Looking at PBB_EDGE_1 as a bridge-domain, we confirm its configured as PBB-edge and has a valid I-SID. We can see it is mapped to a PBB and has one AC that is online. RP/0/0/CPU0:XRv1#show l2vpn bridge-domain bd-name PBB_EDGE_1 Legend: pp = Partially Programmed. Bridge group: PBB_EDGES, bridge-domain: PBB_EDGE_1, id: 1, state: up, ShgId: 0, MSTi: 0 Type: pbb-edge, I-SID: 1001 Aging: 300 s, MAC limit: 4000, Action: none, Notification: syslog Filter MAC addresses: 0 ACs: 1 (1 up), VFIs: 0, PWs: 0 (0 up), PBBs: 1 (1 up) List of PBBs: PBB Edge, state: up, Static MAC addresses: 0 List of ACs: Gi0/0/0/0.551, state: up, Static MAC addresses: 0 List of Access PWs: List of VFIs:

The PBB_CORE bridge-domain is also mapped to a PBB (core type) and has one PBB-edge bridge-domain associated with it. RP/0/0/CPU0:XRv1#show l2vpn bridge-domain bd-name PBB_CORE Legend: pp = Partially Programmed. Bridge group: PBB_CORE, bridge-domain: PBB_CORE, id: 0, state: up, ShgId: 0, MSTi: 0 Type: pbb-core Number of associated pbb-edge BDs: 1 Aging: 300 s, MAC limit: 4000, Action: none, Notification: syslog Filter MAC addresses: 0 ACs: 0 (0 up), VFIs: 0, PWs: 0 (0 up), PBBs: 1 (1 up) List of PBBs: PBB Core, state: up List of ACs: List of Access PWs: List of VFIs:

To see additional details about how the PBB edge domains are mapped to PBB core domains, we use the “detail” keyword. This reveals much more information about the behavior of the bridge-domain, to including flooding and MAC operations. RP/0/0/CPU0:XRv1#show l2vpn bridge-domain bd-name PBB_EDGE_1 detail Legend: pp = Partially Programmed. Bridge group: PBB_EDGES, bridge-domain: PBB_EDGE_1, id: 1, state: up, ShgId: 0, MSTi: 0 Coupled state: disabled Type: pbb-edge, I-SID: 1001 Core-bridge: PBB_CORE (State: Bridge Up)

645 © 2016 Nicholas J. Russo

MIRP-lite: not supported Format: none MAC learning: enabled MAC withdraw: enabled MAC withdraw for Access PW: enabled MAC withdraw sent on: bridge port up MAC withdraw relaying (access to access): disabled Flooding: Broadcast & Multicast: enabled Unknown unicast: enabled [snip]

Using “detail” on the PBB core domain shows additional EVI information. Obviously the packet/byte counts are incorrect since no traffic is flowing, but basic BGP information (ASN, RD, etc) is shown as well. MAC and flooding operations are at the bottom and most of it is omitted for brevity. RP/0/0/CPU0:XRv1#show l2vpn bridge-domain bd-name PBB_CORE detail Legend: pp = Partially Programmed. Bridge group: PBB_CORE, bridge-domain: PBB_CORE, id: 0, state: up, ShgId: 0, MSTi: 0 Coupled state: disabled Type: pbb-core Number of associated pbb-edge BDs: 1 EVPN: EVI: 1000 Route Distinguisher: (auto) 1.11.11.11:1000 Imposition Statistics: Packet Count: 293221265588617217 Byte Count : 293183607383632832 Disposition Statistics: Packet Count: 311937118322620165 Byte Count : 1176759140103596992 AS Number: 1 MAC learning: enabled MAC withdraw: enabled MAC withdraw for Access PW: enabled MAC withdraw sent on: bridge port up MAC withdraw relaying (access to access): disabled Flooding: Broadcast & Multicast: enabled Unknown unicast: enabled [snip]

Additional Reading – Reference configurations “pbb-evpn" 11. Describe IEEE 802.1ad (QinQ), IEEE 802.1ah (Mac-in-Mac), and ITU G.8032 (REP) 11.1 802.1ad QinQ

646 © 2016 Nicholas J. Russo

QinQ (dot1q tunneling or provider bridging) is a method of adding additional 802.1q headers to an Ethernet frame to tunnel it across different networks. 802.1ad is an amendment to 802.1q which doesn’t change any aspects of dot1q headers per se, but simply allows them to be stacked in an Ethernet frame’s header. This is demonstrated many times throughout this document so a dedicated lab is not built for this section. There are many benefits to tag stacking: 1. The original specification can create 2^12 (4096) theoretical VLANs since the VLAN-ID is a 12 bit number. VLANs 0 and 4096 cannot be used, though, so the math is imprecise so far. Adding an additional tag squares this (2^24 = 16777216) which allows for greater network growth. I personally use this extensively in many labs where I don’t want to provision public VLANs (I am only allowed to use VLANs 3501 – 3600 for my labs), so I tunnel my private VLANs inside of these publicly switched VLANs. 2. Tags can be added (pushed), removed (popped), or changed (translated) by network devices as necessary. 3. Provides the basis for provider bridging and provider backbone bridging (PBB) which is described later. When used in conjunction with PBB, hierarchical bridging networks can be designed to achieve very high scalability. The standard denotes Ethertype 0x88A8 for the outer tag (service tag, carrier tag, metro tag, tag 1, or Stag) and 0x8100 for the inner tag (customer tag, tag 2, or C-tag). This new Ethertype isn’t terribly significant other than identifying the Ethernet frame as tunneled or not. Below is a copy of a packet dump from another section later in the lab that shows the QinQ stack. In this case, the S-tag in yellow has value 3544 (0xDD8) and the C-tag in green has value 460. As mentioned earlier, this can be used to tunnel VLANs not allowed by my cloud provider inside VLANs that are. Immediately following the QinQ encapsulation is the ethertype of 0x0800 (cyan), which identifies an IP packet inside of the following Ethernet frame. R6#show monitor capture CAP buffer detail 6 152 1.376039 00:50:56:A9:DE:0D -> 00:50:56:A9:EA:77 MPLS unicast 0000: 005056A9 EA770050 56A9DE0D 81000DEF .PV..w.PV....... 0010: 884701B5 D0FF00FA A1FF0000 00000050 .G.............P 0020: 56A9862A 005056A9 EA5488A8 0DD88100 V..*.PV..T...... 0030: 01CC0800 45000064 00010000 FF01A77D ....E..d.......}

Although 802.1ad does not specify more than 2 tags pushed, there is no limitation in many devices to this upper bound. In that case, the inner-most tag is the C-tag where all other tags are considered S-tags. The outer-most tag is “tag 1” and the next tag is “tag 2”, which would be another S-tag if more than 2 tags exist. Another interesting change in 802.1ad is that the canonical format ID (CFI) field, formerly used when bridging token ring networks, has been recycled for drop probability use. This is similar to the frame-relay discard eligibility (DE) bit and is called the drop eligibility indicator (DEI). This can be used in conjunction with or in lieu of the 3 priority bits in the dot1q header which are also examined in later sections. Those priority bits are referenced in 802.1p and are called the priority code points (PCP). 647 © 2016 Nicholas J. Russo

Q-tunneling has some downsides as well. The C-MAC addresses are always exposed, even when being bridged over a service provider’s network. The S-tag helps keep the C-tag intact and separates some data-plane functionality, but the provider network still must program all C-MACs within the S-VLAN CAM cables on all switches in the transit path. This leads to poor scalability for data center interconnect (DCI) applications where many customers will have many C-MACs. QinQ also makes the assumption that no two networks will ever have the same MAC address, which may not be true if custom MAC addressing is configured in software for aggregation/filtering reasons. 11.2 802.1ah MAC in MAC (Provider Backbone Bridges) This feature was written to address some of the concerns previously identified with QinQ tunneling. It is also known as provider backbone bridging (PBB) since it introduces the concept of backbone MACs (BMACs). Like S-tags shown in QinQ tunnel, B-MACs are service provider switch MACs that server as the outer encapsulation for customer traffic. In this way, provider switches don’t need to learn all C-MACs, only B-MACs, of which there is one per PE. This greatly enhances scalability as the B-MAC is like a tunnel or aggregate MAC. PBB was defined to offer total separation between customer and provider MAC address spaces, which is somewhat similar to MPLS L3VPN (or IP-in-IP/GRE) does for IP VPNs. PBB is compromised of three main encapsulation components: 1. Backbone component: Contains a destination address (B-DA), source address (B-SA), the ethertype 0x88A8 as defined in 802.1ad for QinQ tunneling, and a B-tag to represents the backbone VLAN. The B-tag is similar to the S-tag but is renamed for clarity between the specifications, since PBB technically can carry QinQ frames inside of it. 2. Service component: Contains ethertype 0x88E7 (defined for PBB) and an I-SID (24-bit service ID). The I-SID is used as a method is defining to which customer the frames belongs, since the Btag isn’t customer specific. 3. Original frame: This could be an IP packet inside of an Ethernet frame (0x0800), an Ethernet frame inside of dot1q with a single C-tag (0x8100), or part of a hierarchical bridging architecture with two tags already on it (0x88A8). This is why the B-tag isn’t called the S-tag, as the S-tag may actually exist within the original frame that arrived at the PBB device. This component is highly variable in length and composition, but PBB doesn’t care. This PBB encapsulation can be used in an ordinary Ethernet switched network or as a pre-encapsulation before MPLS. In PBB-EVPN, this PBB encapsulation happens first, then the entire backbone frame is encapsulated inside MPLS labels to represent the remote PE within a VPN (BGP) and the transport labels to get there (LDP, RSVP-TE, SR, etc). I am unable to demonstrate this encapsulation due to lack of hardware devices that support it. 11.3

Ethernet Ring loop-prevention 11.3.1 Cisco Resilient Ethernet Protocol (REP) REP is a Cisco-proprietary protocol that is used to rapidly converge a layer 2 Ethernet network arrayed in a ring topology. This is an alternative to STP which works in a very specific set of network designs, but offers very fast (50 – 300 ms) convergence time. The entire “circumference” of the ring, including the endpoints, are part of the REP domain. This domain is called a “segment” in REP terms, but I personally 648 © 2016 Nicholas J. Russo

find this term confusing. The segment is not like a LAN segment, nor is it a single link between two switches. Rather, the segment is the end-to-end circumference of the REP ring, which is terminated by exactly two edge ports. Each switch in the REP segment can have up to 2 ports in a segment (for intermediate nodes), and each link must have exactly one REP neighbor. To illustrate this, we will test two different REP topologies shown below. The left diagram is a “ring segment” in which one switch terminates the segment on both ends, since both edge ports on configured on this one switch. This is a simpler design but it highly reliant on this one switch. The right diagram is an “open segment” in which the edge ports exist on disparate switches. The network in between can use STP (Cisco Catalyst 3560, for example), and since there are multiple switches with REP edge ports, this more complex design offers higher availability (more ingress/egress points).

First, we will configure the ring topology. REP is very easy to configure as there are not many commands. Before configuring any interfaces, we will discuss the REP admin VLAN. This VLAN is used for the flooding of REP messages to an ordinary MAC multicast address, which is the IEEE STP MAC 0180.c200.0000 also used for STP on VLAN 1 in PVST, for example. The reason this ordinary address is used is to improve REP performance; by using the hardware flood layer (HFL), REP bypasses the delay with software forwarding. Certain REP messages are therefore flooded throughout the entire REP admin VLAN, even beyond the REP segment. For this reason, the admin VLAN can be pruned to only exist in the REP segment, giving the operator additional control. We will configure VLAN 29 for this. It is important to note that the REP link-status layer (LSL) PDUs, which are like STP BPDUs, are always sent untagged in the native VLAN. ! All ME switches vlan 29 name REP_ADMIN rep admin vlan 29

We can verify this configuration by checking the REP interface details, which is not intuitive. There are many other important fields in this output we look at later, but for now, we just verify the admin VLAN. ME3#show interfaces port-channel 13 rep detail | include Admin-vlan

649 © 2016 Nicholas J. Russo

Admin-vlan: 29

There are two key requirements for enabling REP on an interface: it must be an NNI and it must be a trunk port. Any other combination of port-types and switch port modes is not supported. REP is also supported on port-channel interfaces as we shown between ME1 and ME3. Of the two edge ports, one is the “primary” edge port and one is the secondary. This is relevant later, but for now, we configure the LACP bond as the primary towards ME1 and the other port towards ME2 as the secondary. Ideally, you would work clockwise or counter-clockwise around the ring and not configure both edge ports at the same time. Moving in sequence is the best way to migrate to REP from STP, but since this is a “greenfield” deployment, I will configure all REP ports per device as I have console access and don’t care about temporary traffic blackholes. ! ME3 interface Port-channel13 description LACP TO ME1 port-type nni switchport mode trunk rep segment 1 edge primary interface FastEthernet0/13 description TO ME2 port-type nni switchport mode trunk rep segment 1 edge

Notice that configuring REP generates a message warning you about the effects on STP. STP is totally disabled on any REP port, which means BPDUs are not sent nor received. We can prove this with a show command as well. ME3(config-if)#rep segment 1 edge Warning: Enabling REP automatically disables STP on this port. It is recommended to shutdown all interfaces which are not currently in use to prevent potential bridging loops. ME3#show spanning-tree interface fastEthernet 0/13 no spanning tree info available for FastEthernet0/13

Configuring the intermediate nodes is simple. All we do is specify the REP segment as there is no need to specify edge port status. I do not show the descriptions, NNI, or trunk configurations for brevity. ! ME2 interface FastEthernet0/13 rep segment 1 interface FastEthernet0/15 rep segment 1

650 © 2016 Nicholas J. Russo

! ME3 interface FastEthernet0/15 rep segment 1 interface Port-channel13 rep segment 1

At this point, we can use some simple show commands to verify the REP topology. Assuming the ring is complete, all nodes will have the same view of the topology, so we can issue this command anywhere. The command starts with the primary edge port and moves around the ring until it reaches the secondary edge port. Each bridge and port along the way are identified, along with their port roles. Notice that ME2 port 15 is blocking; this is similar to STP where one port in the ring must be blocking to prevent a loop. This means that traffic from ME2 to ME1 will transit ME3, just like in STP. These ports are called “alternate” ports because a break in the ring anywhere automatically means this blocked port will be opened. If the edge ports lose connectivity, there is a never a case where the ring should contain a blocked port. ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15 ME2 Fa0/15 ME2 Fa0/13 ME3 Fa0/13 Sec

Role ---Open Open Open Alt Open Open

Before we dig into detailed show commands, we will verify REP operation. Using a local SPAN session on ME3, I look at traffic on ME3’s primary edge port. At present, we can see several LSL PDUs captured by Wireshark. ! ME3 monitor session 1 source interface Po13 monitor session 1 destination interface Fa0/12 encapsulation replicate

651 © 2016 Nicholas J. Russo

It is interesting to see that Wireshark is calling this an STP frame, primarily because it doesn’t understand REP, and really only understands the STP destination MAC address. That MAC address is highlighted below in the REP packet debug (green). I also verify the MAC address of ME1’s LACP bond as this is the source MAC of the frame (yellow). This debug matches the Wireshark capture, and because the Ethertype isn’t specified (it’s a length of 576, which is 590 – 14 bytes of Ethernet encapsulation), the DSAP and SSAP are. Both of them are 0xAA (pink), so I am confident that this is the same type of packet seen above. ME1#show interfaces port-channel 13 | include bia Hardware is EtherChannel, address is 001d.4692.f713 (bia 001d.4692.f713) ME3#debug rep packet REP debug packets debugging is on ! ME2 00:43:21: 00:43:21: 00:43:21: 00:43:21: 00:43:21: 00:43:21:

REP store seq#1310 for re_tx on Fa0/13 (pak ref:2) fifo size REP Rx packet on Port-channel13 0x00 0x80 0x00 0xCB 0x00 0x00 0x00 0x1A 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x80 0xC2 0x00 0x00 0x00 0x00 0x1D 0x46 0x92 0xF7 0x13 0x02 0x40 0xAA 0xAA

For fastest convergence, we can reduce the LSL age timer on each interface. We can only set the “age timer”, which is a dead timer, and the hello interval is one third of this value. The downside to using port-channels with REP is that the fastest age timer is 1 seconds. ME3(config-if)#rep lsl-age-timer 120 Error: LSL Age Out timer value less than 1000 not supported on etherchannel

That being said, we will configure an LSL age-timer of 1 second on the port-channel interface and the minimum of 120 ms on all other interfaces. Notice that if you have mismatched timers, the REP ring 652 © 2016 Nicholas J. Russo

becomes unstable. By configuring only ME3 with this new interval, I’ve created a mismatch with ME1. ME3 is flapping up and down since ME1 is sending hellos too slowly. ! ME3 interface Port-channel13 rep lsl-age-timer 1000 ! ME1 %REP-4-LINKSTATUS: Port-channel13 (segment 1) is non-operational due to neighbor mismatch ! ME3 %REP-4-LINKSTATUS: Port-channel13 %REP-4-LINKSTATUS: Port-channel13 neighbor not responding %REP-4-LINKSTATUS: Port-channel13 %REP-4-LINKSTATUS: Port-channel13 neighbor not responding

(segment 1) is operational (segment 1) is non-operational due to (segment 1) is operational (segment 1) is non-operational due to

We can confirm the issue by checking the REP interface details on both ME1 and ME3 to see the different timers. Filtering on “LSL” also shows us the PDU counters, similar to how STP interface-level show commands would reveal BPDU TX/RX counters. ME3#show interfaces port-channel 13 rep detail | include LSL LSL Ageout Timer: 1000 ms LSL PDU rx: 4964, tx: 3907 BPA (STCN, LSL) TLV rx: 0, tx: 0 ME1#show interfaces port-channel 13 rep detail | include LSL LSL Ageout Timer: 5000 ms LSL PDU rx: 3921, tx: 5104 BPA (STCN, LSL) TLV rx: 0, tx: 0

To fix it, we simply match the timers on ME1 using the same configuration (not shown). Then, I adjust the timer to 120 ms on all ordinary ports (also not shown). The blinking lights on the switches are going crazy now, as expected, on the inter-switch links not including the LACP members. Below is a quick verification of all REP interfaces to show their LSL ageout timers. ME1#show interfaces rep detail | include ^(Fast|Port-)|Ageout FastEthernet0/15 REP enabled LSL Ageout Timer: 120 ms Port-channel13 REP enabled LSL Ageout Timer: 1000 ms ME2#show interfaces rep detail | include ^(Fast|Port-)|Ageout FastEthernet0/13 REP enabled LSL Ageout Timer: 120 ms

653 © 2016 Nicholas J. Russo

FastEthernet0/15 REP enabled LSL Ageout Timer: 120 ms ME3#show interfaces rep detail | include ^(Fast|Port-)|Ageout FastEthernet0/13 REP enabled LSL Ageout Timer: 120 ms Port-channel13 REP enabled LSL Ageout Timer: 1000 ms

Even with this fast timer, only about 12-16 kbps of bandwidth is used. It only consumes a small amount of CPU cycles as well, but more REP interfaces implies more CPU consumed for hello generation/processing. These are important things to check when using any kind of fast failover active detection mechanism. ME1#show interfaces fastEthernet 0/15 | include put_rate 5 minute input rate 16000 bits/sec, 23 packets/sec 5 minute output rate 12000 bits/sec, 23 packets/sec ME1#show processes cpu sorted CPU utilization for five seconds: 5%/0%; PID Runtime(ms) Invoked uSecs 198 10674 2175 4907 126 2642 125673 21

one minute: 6%; five minutes: 6% 5Sec 1Min 5Min TTY Process 0.31% 0.08% 0.04% 0 Exec 0.31% 0.35% 0.30% 0 REP LSL Hello PP

To avoid flooding these LSLs to our Wireshark capture, from now on we will only look at messages in the admin VLAN 29. We apply a quick SPAN filter to achieve this, which allows us to test REP failover. I verify the SPAN configuration because failing to do so may degrade my sniffer PC. ! ME3 monitor session 1 filter vlan 29 ME3#show monitor session Session 1 --------Type : Source Ports : Both : Destination Ports : Encapsulation : Ingress : Filter VLANs :

1

Local Session Po13 Fa0/12 Replicate Disabled 29

Before we simulate a failure, we can trace the basic traffic flow for a number of VLANs. Test VLANs 100103 have been defined on every switch with corresponding SVIs. We quickly check the REP topology again for reference, which indicates port 15 on ME2 is blocking. Traffic between ME1 and ME2 should therefore transit ME3.

654 © 2016 Nicholas J. Russo

ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15 ME2 Fa0/15 ME2 Fa0/13 ME3 Fa0/13 Sec

Role ---Open Open Open Alt Open Open

To prove this, I ping the broadcast address from ME2. This will source broadcast pings from all connected interfaces, which should quickly discover all of the other SVIs on the network. Specifically, we want to ensure we see responses from ME1’s test VLANs. ME2#ping 255.255.255.255 repeat 1 Type escape sequence to abort. Sending 1, 100-byte ICMP Echos to 255.255.255.255, timeout is 2 seconds: Reply Reply Reply Reply Reply Reply Reply Reply Reply Reply

to to to to to to to to to to

request request request request request request request request request request

0 0 0 0 0 0 0 0 0 0

from from from from from from from from from from

10.0.10.91, 8 ms 103.0.0.3, 16 ms 102.0.0.3, 16 ms 103.0.0.1, 16 ms 101.0.0.3, 16 ms 102.0.0.1, 8 ms 100.0.0.3, 8 ms 101.0.0.1, 8 ms 10.0.10.93, 8 ms 100.0.0.1, 8 ms

The MAC addresses for ME1 are shown below so that we can correlate them with ME2’s CAM table for those 3 test VLANs. Notice how ME3 is the outgoing port for all of them (along with the 4 SVIs on ME3), and nothing is forwarded out of port 15. This shows that the REP blocked port is working as expected. ME1#show interfaces vlan 100 | include bia Hardware is EtherSVI, address is 001d.4692.f741 (bia 001d.4692.f741) ME1#show interfaces vlan 101 | include bia Hardware is EtherSVI, address is 001d.4692.f742 (bia 001d.4692.f742) ME1#show interfaces vlan 102 | include bia Hardware is EtherSVI, address is 001d.4692.f743 (bia 001d.4692.f743) ME1#show interfaces vlan 103 | include bia Hardware is EtherSVI, address is 001d.4692.f744 (bia 001d.4692.f744) ME2#show mac address-table dynamic | exclude _1_

655 © 2016 Nicholas J. Russo

Mac Address Table ------------------------------------------Vlan Mac Address Type Ports ------------------------100 001d.4692.f741 DYNAMIC Fa0/13 100 001f.9d0b.16c1 DYNAMIC Fa0/13 101 001d.4692.f742 DYNAMIC Fa0/13 101 001f.9d0b.16c2 DYNAMIC Fa0/13 102 001d.4692.f743 DYNAMIC Fa0/13 102 001f.9d0b.16c3 DYNAMIC Fa0/13 103 001d.4692.f744 DYNAMIC Fa0/13 103 001f.9d0b.16c4 DYNAMIC Fa0/13

Now, I will simulate a failure by unplugging the cable between ME2 and ME3. First, I enable pertinent REP debugs on all switches to observe the event. ! All switches debug rep failure-recovery debug rep prsm

Because this breaks the ring, traffic should immediately be allowed to forward over ME2’s port 15. First, ME2 and ME3 receive the link-down notification from the platform after the carrier-delay expires. Both of them show the exact same debug as they both witness the layer 1 failure. The port transitions from the OPEN to FAILED state. Both switches also flush MAC addresses learned on this port from the CAM table, which will trigger temporary unknown unicast flooding until the new port is learned. Each of them notify the other switches in the segment as well. ! ME2 and ME3 [Fa0/13]: Link down rep_pr Fa0/13 @@@ rep_pr Fa0/13 PRSM->failed state,

notification received - pr: during state OPEN_PORT, got event 1(link_not_op) pr: OPEN_PORT -> FAILED_PORT[Fa0/13]rep_pr_act_fp@101: 0x80

HREP: Sent hfl pak on Fa0/13, vlan_id 29, di 379 REP Flush from Fa0/13 to REP, sending msg[Fa0/13]: Local failure handling [Fa0/13]rep_pr_act_send_hw_ind@251: PRSM->sending hw failure

Even with debug off, both ME2 and M3 generate syslog messages to show the REP link failure. ! ME2 and ME3 %REP-4-LINKSTATUS: FastEthernet0/13 (segment 1) is non-operational due to port down

ME1 is connected to both switches and receives this HW failure indication from both ME2 and ME3 on its LACP bond and port 15.

656 © 2016 Nicholas J. Russo

! ME1 REP LSL-OP Rx EXT Local (Fa0/15 seg:1, tc:1, frs:0) prio: 0x80 0x00 0x0F 0x04 0xC5 0xA4 0x53 0xD8 0x00 REP Flush from Fa0/15 to REP, sending msg REP LSL-OP Rx INT Local (Po13 seg:1, tc:1, frs:0) prio: 0x80 0x00 0x0F 0x04 0xC5 0xA4 0x53 0xD8 0x00 REP Flush from Po13 to REP, sending msg

From a topology perspective, we can see that ME3 sees a failure in the topology. Failed ports terminate segments just as edge ports do, however since the segment is incomplete, all ports are open. We now see that ME2 port 15 is available for forwarding. We also check it from ME2’s perspective which lists the nodes in the reverse order, but with the same port roles. The presence of a failed port on any segment causes REP to warn the user that there is a segment failure. ME3#show rep topology segment 1 REP Segment 1 Warning: REP detects a segment failure, topology may be incomplete BridgeName ---------------ME3 ME1 ME1 ME2 ME2

PortName Edge Role ---------- ---- ---Po13 Sec Open Po13 Open Fa0/15 Open Fa0/15 Open Fa0/13 Fail

ME2#show rep topology segment 1 REP Segment 1 Warning: REP detects a segment failure, topology may be incomplete BridgeName ---------------ME2 ME2 ME1 ME1 ME3

PortName Edge Role ---------- ---- ---Fa0/13 Fail Fa0/15 Open Fa0/15 Open Po13 Open Po13 Sec Open

We ping the broadcast address from ME2 again (not shown) so that all MAC addresses in all VLANs are resolved. Check the CAM for all non-VLAN 1 entries, we can see that port 15 is used to reach ME1’s test SVIs as expected. ME2#show mac address-table dynamic | exclude _1_ Mac Address Table -------------------------------------------

657 © 2016 Nicholas J. Russo

Vlan ---100 100 101 101 102 102 103 103

Mac Address ----------001d.4692.f741 001f.9d0b.16c1 001d.4692.f742 001f.9d0b.16c2 001d.4692.f743 001f.9d0b.16c3 001d.4692.f744 001f.9d0b.16c4

Type -------DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC

Ports ----Fa0/15 Fa0/15 Fa0/15 Fa0/15 Fa0/15 Fa0/15 Fa0/15 Fa0/15

With debugging enabled, we will plug the cable back in. Both ME2 and ME3 change their port state from “FAILED” to “FAILED with no external neighbor”. This is an indication that the electrical carrier is back up but REP is not operational yet. ! ME2 and ME3 @@@ rep_pr Fa0/13 - pr: FAILED_PORT -> FAILED_PORT_NO_EXT_NEIGHBOR[Fa0/13]rep_pr_act_no_ext_neighbor@415: PRSM>fp_no_ext_neighbor state [Fa0/13]rep_pr_lsl_event_handler@645: REP_MSG_EXT_PEER_GONE rcvd

After exchanging LSL PDUs, the two neighbors discover one another again, and move their ports to the alternate state. There is still no forwarder on the segment, which is incomplete. ! ME2 and ME3 rep_pr Fa0/13 - pr: during state FAILED_PORT_NO_EXT_NEIGHBOR, got event 0(link_op) @@@ rep_pr Fa0/13 - pr: FAILED_PORT_NO_EXT_NEIGHBOR -> ALTERNATE_PORT[Fa0/13]rep_pr_act_ap@216: PRSM->alternate state [Fa0/13]rep_pr_lsl_event_handler@625: REP_MSG_LINKOP_TRUE rcvd

Since one of the ports needs to be forwarding, an election process takes place during the formation of a REP ring which was not discussed until now as it is very detailed. Like STP, each port has a 64-bit priority associated with it, and in REP it is the concatenation of several fields. Towards the beginning the first 4 bits of the election include a failed bit (the most significant bit, only set on failed ports) followed by user configurable fields. The next 12 bits represent a port identifier, and the remaining 48 represent the bridge MAC address. The alternate port election only happens on these blocked ports when an “open” port must be chosen, so the switches exchange blocked port advertisements (BPAs) containing their priorities. If a port receives a BPA with a higher priority, it transitions to the open state as shown below. At the end of the link-up condition, REP displays a syslog message. ! ME3 rep_pr Fa0/13 - pr: during state ALTERNATE_PORT, got event 2(pre_empt_ind) @@@ rep_pr Fa0/13 - pr: ALTERNATE_PORT -> UNBLOCK_VLANS_ACT

658 © 2016 Nicholas J. Russo

rep_pr Fa0/13 - pr: during state UNBLOCK_VLANS_ACT, got event 3(no_local_block_vlan) @@@ rep_pr Fa0/13 - pr: UNBLOCK_VLANS_ACT -> OPEN_PORT[Fa0/13]rep_pr_act_op@391: PRSM->active state ! ME2 and ME3 %REP-4-LINKSTATUS: FastEthernet0/13 (segment 1) is operational

To see the priorities, we can show the details of the REP topology. Since the link is not failed and the port priorities are both zero, the port numbers are compared. These also match, so then the MAC addresses are compared. Since ME3 port 13 receives a BPA with a better priority, it opens, and since ME2 received a BPA with a worse priorities, it remains an alternate port. BPAs are only sent by alternate ports during this election process as this information contains the VLAN information for blocked VLANs; VLAN load sharing is discussed later. ME3#show rep topology segment 1 detail | begin ME2, Fa0/13 ME2, Fa0/13 (Intermediate) Alternate Port, some vlans blocked Bridge MAC: 04c5.a453.d800 Port Number: 00F Port Priority: 000 Neighbor Number: 5 / [-2] ME3, Fa0/13 (Secondary Edge) Open Port, all vlans forwarding Bridge MAC: 001f.9d0b.1680 Port Number: 00F Port Priority: 000 Neighbor Number: 6 / [-1]

While we don’t have much control over the exact port priority values, we can administratively configure a preferred alternate port. This is somewhat unintuitive since we normally adjust priorities to prefer something, but in this case, the opposite is true. Since the reception of a superior BPA causes a port to open, we can configure a higher priority on ME3 port 13 so that ME2 port 13 becomes open for this link. We now identify this port as the secondary edge port with the “preferred” flag. We can verify the configuration by checking the REP interface details. ! ME3 interface FastEthernet0/13 rep segment 1 edge preferred ME3#show interfaces fastEthernet 0/13 rep detail | include Pref Segment-id: 1 (Preferred Edge) Preferred flag: Yes

However, when we check the REP topology, nothing has changed. Despite having a higher priority, ME3 port 13 is still forwarding. 659 © 2016 Nicholas J. Russo

ME3#show rep topology segment 1 detail | begin ME2, Fa0/13 ME2, Fa0/13 (Intermediate) Alternate Port, some vlans blocked Bridge MAC: 04c5.a453.d800 Port Number: 00F Port Priority: 000 Neighbor Number: 5 / [-2] ME3, Fa0/13 (Secondary Edge) Open Port, all vlans forwarding Bridge MAC: 001f.9d0b.1680 Port Number: 00F Port Priority: 010 Neighbor Number: 6 / [-1]

The alternate port elections only happen once during the initial negotiation when a segment is complete, and they are not preemptive. As such, I quickly disconnect and reconnect the cable which triggers the election again. Now, ME2 port 13 is forwarding and the secondary edge port is blocked. In this way, we can somewhat influence the REP convergence around the ring. ME3#show rep topology segment 1 detail | begin ME2, Fa0/13 ME2, Fa0/13 (Intermediate) Open Port, all vlans forwarding Bridge MAC: 04c5.a453.d800 Port Number: 00F Port Priority: 000 Neighbor Number: 5 / [-2] ME3, Fa0/13 (Secondary Edge) Alternate Port, some vlans blocked Bridge MAC: 001f.9d0b.1680 Port Number: 00F Port Priority: 010 Neighbor Number: 6 / [-1]

Unplugging the cable again, we can see the “failed” bit in action. Notice that the priority is a very high number, and nothing can ever be higher than it. As such, assuming the link was up but there was a REP communication issue, this port would always be the alternate port. ME3#show rep topology segment 1 detail | begin ME2, Fa0/13 ME2, Fa0/13 (Intermediate) Failed Port, Reason: Physical link down Bridge MAC: 04c5.a453.d800 Port Number: 00F Port Priority: 800 Neighbor Number: Not available

660 © 2016 Nicholas J. Russo

Wireshark doesn’t help us decode the HFL packets at all, but we note something interesting. First, despite the dot1q tag not showing up, we know that this was in VLAN 29 since the local SPAN filter is only capturing that VLAN. The destination MAC address looks similar to the one typically used by Cisco for VTP, CDP, and others, except the last character is an ‘e’. Wireshark identifies this as ISL encapsulation. Either way, we know that since it isn’t the IEEE STP MAC address, even STP-aware nonCisco switches would forward this traffic throughout the network. That is why the administrative VLAN can scope the REP signaling traffic.

Edge ports also sent end point advertisements (EPAs) occasionally via LSL in the native VLAN (not HFL in the REP admin VLAN). We can enable debugs on ME3 to see the EPAs being sent, which contain much more information than normal LSLs. In fact, the large LSL packet we saw earlier in the packet dumps was probably an EPA. Both the primary and secondary edge ports originate these. The network is currently stable so the states aren’t changing, but we can see the topology information summarized in the debug. The primary edge port lists itself first and then the hops to the secondary edge port. The secondary edge port does the opposite in the reverse direction; this is how the edge ports know the segment is complete. ME3#debug rep epasm REP debug epa sm debugging is on ! ME3 rep_epa_edge Po13 - epa-edge: during state EDGE_PRIMARY, got event 0(epa_hello_tmo) @@@ rep_epa_edge Po13 - epa-edge: EDGE_PRIMARY -> EDGE_PRIMARY EPA Append BridgeInfo TLV LEN = 13 ME3 EPA Append PortInfo TLV LEN = 18, SubType 2 Po13 EPA Create REP_TLV_EPA_INFO TLV LEN = 35 , SegNum 0 Msg Length Rx 147, Datagram Rx 175 Bridge:ME3 Port1:Po13 Bridge:ME1 Port1:Po13 Port2:Fa0/15 Bridge:ME2 Port1:Fa0/15 Port2:Fa0/13 rep_epa_edge Fa0/13 - epa-edge: during state EDGE_SECONDARY, got event 0(epa_hello_tmo) @@@ rep_epa_edge Fa0/13 - epa-edge: EDGE_SECONDARY -> EDGE_SECONDARY EPA Append BridgeInfo TLV LEN = 13 ME3 EPA Append PortInfo TLV LEN = 20, SubType 2 Fa0/13 EPA Create REP_TLV_EPA_INFO TLV LEN = 37 , SegNum 0 Msg Length Rx 149, Datagram Rx 177

661 © 2016 Nicholas J. Russo

Bridge:ME3 Bridge:ME2 Bridge:ME1

Port1:Fa0/13 Port1:Fa0/13 Port1:Fa0/15

Port2:Fa0/15 Port2:Po13

If we unplug the cable to cause a segment failure, the EPA state machine transitions the primary edge port to the secondary edge port. The reason for this is because the primary edge port is also responsible for VLAN load balancing; when there is a break in the segment, VLAN load balancing is impossible. As such, the edge port becomes a secondary port so it can effectively ignore that configuration. The debug also shows us that the triggering state change was a “segment break”. ! ME3 03:15:32: rep_epa_edge Po13 - epa-edge: during state EDGE_PRIMARY, got event 4(seg_break) 03:15:32: @@@ rep_epa_edge Po13 - epa-edge: EDGE_PRIMARY -> EDGE_SECONDARY[Po13]rep_epa_edge_act_edge_sec@113: EPA: edge sec [Po13]rep_epa_egde_act_update_key@208: new key=61757

Each port in a REP segment is numbered with offsets as well. This isn’t the same as the hexadecimal port numbers seen earlier, but rather is a way of measuring distance away from edge ports. The primary edge port is number 1 and the secondary edge port is -1. Ports increment by 1 as they move away from the primary edge port and decrement by 1 as they move away from the secondary port. From any switch, we look at the topology details and filter based on the more important information. We can see that each port is assigned both a positive and negative number. The positive number measures the distance from the primary edge port and the negative number measures the distances from the secondary edge port. ! ME1 ME1#show rep topology segment 1 detail | include ^ME|Port,|Neighbor ME3, Po13 (Primary Edge) Open Port, all vlans forwarding Neighbor Number: 1 / [-6] ME1, Po13 (Intermediate) Open Port, all vlans forwarding Neighbor Number: 2 / [-5] ME1, Fa0/15 (Intermediate) Open Port, all vlans forwarding Neighbor Number: 3 / [-4] ME2, Fa0/15 (Intermediate) Open Port, all vlans forwarding Neighbor Number: 4 / [-3] ME2, Fa0/13 (Intermediate) Open Port, all vlans forwarding Neighbor Number: 5 / [-2] ME3, Fa0/13 (Secondary Edge) Alternate Port, some vlans blocked Neighbor Number: 6 / [-1]

662 © 2016 Nicholas J. Russo

Ports also have unique IDs. We can reveal these using the interface-level REP show command. Notice that this is the 64-bit field that is used in BPA elections to determine alternate ports. The port number is the 12-bit number following the first nibble. Below, the port number for ME1 port 15 is 0x011, or 17 in decimal. This makes sense since port 15 is the 17th port on the switch as the two SFPs count as the first two. Virtual interfaces like LACP bonds use system-generated values, like 160 (0x0A0) shown below. ME1#show interfaces rep detail | include REP_enabled|^PortID FastEthernet0/15 REP enabled PortID: 0011001D4692F700 Port-channel13 REP enabled PortID: 00A0001D4692F700

Using these numbers/port IDs, we can achieve VLAN load balancing. This is configured only on the primary edge port and performs two key tasks. First, the administrator specifies a port in the segment where certain VLANs should be blocked. Any VLANs not specified by this list are blocked at the primary edge port. An interface can be specified in one of three ways: 1. Using the port ID specified manually. 2. Using the neighbor offset number, positive or negative, specified manually. 3. Using the “preferred” keyword, which will reference the preferred port previously configured in the ring. Normally, enforcing these kind of changes on a REP ring is disruptive. Cisco decided not to break the REP ring just to enable VLAN load balancing, so after configuring any of these three methods, we must enable the feature one of two ways, but in both cases, it is a disruptive event. 1. Manually pre-empt the ring from the switch with the primary edge port. This will likely result in some traffic disruption. 2. Configure a delay on the interface which begins counting down after a link failure and recovery. We will quickly test all methods. Our first test will be using the port ID method. We will choose ME1 port 15 as the alternate port for VLANs 100 and 101, while the primary edge port (ME3 LACP bond) will be the alternate port for all other VLANs. A quick look at the topology shows no change, as expected. ! ME3 interface Port-channel13 rep block port id 0011001D4692F700 vlan 100-101 ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15

Role ---Open Open Open

663 © 2016 Nicholas J. Russo

ME2 ME2 ME3

Fa0/15 Fa0/13 Fa0/13

Sec

Open Open Alt

I will manually preempt this REP ring to enable VLAN load balancing. The parser warns us about traffic interruptions as well. ME3#rep preempt segment 1 The command will cause a momentary traffic disruption. Do you still want to continue? [confirm]y Proceeding with Manual Preemption

Now, we look at the topology in detail. We see two alternate ports in the topology, both of which specify that “some” VLANs are blocked. The primary edge port has an increased port-priority, which means it will become the alternate port for this link. The same is true for the port we specified above, which is also an alternate port. ME3#show rep topology segment 1 detail REP Segment 1 ME3, Po13 (Primary Edge) Alternate Port, some vlans blocked Bridge MAC: 001f.9d0b.1680 Port Number: 0A0 Port Priority: 080 Neighbor Number: 1 / [-6] ME1, Po13 (Intermediate) Open Port, all vlans forwarding Bridge MAC: 001d.4692.f700 Port Number: 0A0 Port Priority: 000 Neighbor Number: 2 / [-5] ME1, Fa0/15 (Intermediate) Alternate Port, some vlans blocked Bridge MAC: 001d.4692.f700 Port Number: 011 Port Priority: 040 Neighbor Number: 3 / [-4] ME2, Fa0/15 (Intermediate) Open Port, all vlans forwarding Bridge MAC: 04c5.a453.d800 Port Number: 011 Port Priority: 000 Neighbor Number: 4 / [-3] ME2, Fa0/13 (Intermediate) Open Port, all vlans forwarding Bridge MAC: 04c5.a453.d800 Port Number: 00F

664 © 2016 Nicholas J. Russo

Port Priority: 000 Neighbor Number: 5 / [-2] ME3, Fa0/13 (Secondary Edge) Open Port, all vlans forwarding Bridge MAC: 001f.9d0b.1680 Port Number: 00F Port Priority: 010 Neighbor Number: 6 / [-1]

We can check the individual port-level details to see which VLANs are blocked per-port. Notice that the “block VLAN” output shows the VLANs actually blocked, while the “configured” outputs show only what is configured. The configuration only exists on the primary edge port, which is why ME1 doesn’t show it. ME3#show interfaces port-channel 13 rep detail | include Block Blocked VLAN: 1-99,102-4094 Configured Load-balancing Block Port: 0011001D4692F700 Configured Load-balancing Block VLAN: 100-101 ME1#show interfaces fast 0/15 rep detail | include Block Blocked VLAN: 100-101 Configured Load-balancing Block Port: none Configured Load-balancing Block VLAN: none

Based on this output, we quickly verify the CAM tables. Since VLAN 101 is blocked on ME1 port 15, ME1 should reach ME2’s VLAN 101 SVI via its LACP bond to ME3. That is to say, it should take the “long way” around the switched network. However, VLAN 102 traffic should go direct over port 15 to ME2. We can check the ARP cache and CAM cable on ME1 to confirm this. VLAN 101 entries are in green and VLAN 102 entries are in yellow. ME1#show ip arp | include 0\.0\.2 Internet 101.0.0.2 0 Internet 102.0.0.2 0

04c5.a453.d842 04c5.a453.d843

ARPA ARPA

Vlan101 Vlan102

ME1#show mac address-table dynamic | include 04c5.a453.d84[23] 101 04c5.a453.d842 DYNAMIC Po13 102 04c5.a453.d843 DYNAMIC Fa0/15

Next, we will use the numbering method. We will block ME2 port 13 for VLANs 100 and 101. We can use the number 5 (distance from primary edge) or -2 (distance from secondary edge) for the same effect. The topology will not change unless we manually preempt it again, or configure a preempt delay. To trigger this, I unplug ME1 port 15, which isn’t part of this VLAN load balancing scheme. It should not matter since the BPA elections will ultimately block the port we specify below, along with the primary edge port. ! ME3 interface Port-channel13

665 © 2016 Nicholas J. Russo

rep preempt delay 15 rep block port -2 vlan 100-101

After plugging the cable back in, we can see that it takes 15 seconds for the VLAN load balancing to start. 15 seconds is the minimum timer, so the classic REP ring behavior will converge at first, then be changed. Ultimately, we see the primary edge port blocking, along with the port we specified. The offset of -2 specifies a port that is adjacent to the secondary edge port; we verify that by checking the topology as well. ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15 ME2 Fa0/15 ME2 Fa0/13 ME3 Fa0/13 Sec

Role ---Open Open Alt Open Open Open

ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15 ME2 Fa0/15 ME2 Fa0/13 ME3 Fa0/13 Sec

Role ---Alt Open Open Open Alt Open

The primary edge port looks similar as it did before, except now it shows the configured offset of -2 instead of an explicit port ID. We see that only VLANs 100 and 101 are allowed through the primary edge port, while all other VLANs are allowed through ME2 port 13. ME3#show interfaces port-channel 13 rep detail | include Block Blocked VLAN: 1-99,102-4094 Configured Load-balancing Block Port: -2 Configured Load-balancing Block VLAN: 100-101 ME2#show interfaces fast 0/13 rep detail | include Block Blocked VLAN: 100-101 Configured Load-balancing Block Port: none Configured Load-balancing Block VLAN: none

666 © 2016 Nicholas J. Russo

We quickly confirm this on ME2 by looking at the MAC addresses in the CAM table for ME3 in VLAN 100 and VLAN 103. Traffic in VLAN 103 (and most other VLANs) goes directly to ME3 via port 13, but traffic for VLANs 100 and 101 goes the long way through ME1 via port 15. ME2#show ip arp | include 10[03]\.0\.0\.3 Internet 100.0.0.3 6 001f.9d0b.16c1 Internet 103.0.0.3 0 001f.9d0b.16c4

ARPA ARPA

Vlan100 Vlan103

ME2#show mac address-table dynamic | include 001f.9d0b.16c[14] 100 001f.9d0b.16c1 DYNAMIC Fa0/15 103 001f.9d0b.16c4 DYNAMIC Fa0/13

The last VLAN load balancing configuration option is to rely on the previously-configured “preferred” port. We configured this on the secondary edge port earlier. We quickly verify this again. ME3#show interfaces fast 0/13 rep detail | include Prefer Segment-id: 1 (Preferred Edge) Preferred flag: Yes

On the primary edge port, we remove the offset option and use the preferred option instead. This is more dynamic since now the user can just move the “preferred” option around the ring to make adjustments, rather than use port IDs or neighbor offsets. ! ME3 interface Port-channel13 rep block port preferred vlan 100-101

The result of this configuration is that both edge ports are alternate ports for complementary VLAN sets. I quickly unplug a cable and plug it back in on ME2; either link is fine. Looking at the REP topology summary after waiting 15 seconds for VLAN load balancing to start, this looks incorrect since both edges are blocking. One may think that the network is dysfunctional since no traffic can ever leave the REP ring, but this is normal for VLAN load balancing. ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15 ME2 Fa0/15 ME2 Fa0/13 ME3 Fa0/13 Sec

Role ---Alt Open Open Open Open Alt

667 © 2016 Nicholas J. Russo

The interface details are predictable as the configuration is based on the “prefer” flag, and the VLANs allowed are complementary. ME3#show interfaces port-channel 13 rep detail | include Block Blocked VLAN: 1-99,102-4094 Configured Load-balancing Block Port: prefer Configured Load-balancing Block VLAN: 100-101 ME3#show interfaces fast 0/13 rep detail | include Block Blocked VLAN: 100-101 Configured Load-balancing Block Port: none Configured Load-balancing Block VLAN: none

We quickly verify that forwarding occurs according to this REP topology. ME1 in VLAN 100 is reachable via the LACP bond while ME1 in VLAN 102 is reachable via the indirect path through ME2. Personally, this kind of load balancing scheme is easiest to understand. If you make the secondary edge port the preferred port, then use the “prefer” option on the primary edge port, the forwarding paths are simple. Traffic will either flow through one edge port of the other and it’s very easy to determine which way the traffic is going. This is especially useful in “open segments” which are described next. The configurations for the REP ring segment are presented below separately. ME3#show ip arp | include 10[02]\.0\.0\.1 Internet 100.0.0.1 0 001d.4692.f741 Internet 102.0.0.1 0 001d.4692.f743

ARPA ARPA

Vlan100 Vlan102

ME3#show mac address-table dynamic | include 001d.4692.f74[13] 100 001d.4692.f741 DYNAMIC Po13 102 001d.4692.f743 DYNAMIC Fa0/13

Next, we modify the architecture slightly to introduce the REP open segment. A fourth switch (Cisco Catalyst 3560G) has been introduced to the ring, with produces a square. This switch is not REP-capable but will run RPVST+ and serve as the gateway into the rest of the notional layer 2 domain. Although REP is not compatible with STP, it can interwork seamlessly with it at the boundaries. ME2 and ME3 are connected to this C3560. The diagram is shown below again for reference.

668 © 2016 Nicholas J. Russo

We will make an edge port adjustment on ME2 by introducing a new option. We can specify the “uplink” port towards the C3560 as an edge port, but clarify that there will be no REP neighbor on it. This is loosely analogous to an IGP passive-interface, but in this case, the port is part of the REP ring. The benefit of this approach is that ME2 does not need to run STP for any VLANs now. On ME3, we will leave the primary edge port on the LACP bond to ME1, which means its uplink port will run STP. These uplink ports only carry VLANs 1 and 100-103, not the REP admin VLAN 29. This confines the BPAs to the REP ring. C3560 is the root for all VLANs and the basic configurations are shown below. Not shown are 4 test SVIs on C3560 for VLANs 100-103 which are used to test connectivity to the ME SVIs. ! ME2 interface FastEthernet0/15 rep segment 1 interface FastEthernet0/23 description TO C3560 port-type nni switchport trunk allowed vlan 1,100-103 switchport mode trunk rep segment 1 edge no-neighbor preferred rep stcn stp ! ME3 interface FastEthernet0/21 description TO C3560 port-type nni switchport trunk allowed vlan 1,100-103 switchport mode trunk spanning-tree guard loop ! C3560 spanning-tree vlan 1-4094 priority 0 interface GigabitEthernet0/21 description TO ME3 switchport trunk encapsulation dot1q switchport trunk allowed vlan 1,100-103 switchport mode trunk switchport nonegotiate spanning-tree guard root interface GigabitEthernet0/23 description TO ME2 switchport trunk encapsulation dot1q switchport trunk allowed vlan 1,100-103 switchport mode trunk switchport nonegotiate spanning-tree guard root

669 © 2016 Nicholas J. Russo

We first begin by verifying the REP topology. VLAN load balancing should still be in effect based on the “prefer” flag, which is still set on the secondary edge port. However, ME3 displays a syslog message that says the configuration is now invalid. In order for VLAN load balancing the work, the port specified must have a REP neighbor (I assume). The topology shows only one alternate port, which implies VLAN load balancing is disabled. The asterisk next to the edge port role means that there is no neighbor. ! ME3 %REP-5-PREEMPTIONFAIL: can not perform preemption on segment 1 due to invalid Load-balancing block port ME3#show rep topology segment 1 REP Segment 1 BridgeName PortName Edge ---------------- ---------- ---ME3 Po13 Pri ME1 Po13 ME1 Fa0/15 ME2 Fa0/15 ME2 Fa0/23 Sec*

Role ---Open Open Open Alt Open

To fix this, we could move the preferred port to somewhere else, or use the port ID / neighbor offset method. I will use the neighbor offset of 4, which selects ME2 port 15. A value of -2 would have also worked. ME3#show rep topology segment 1 detail | section ME2, Fa0/15 ME2, Fa0/15 (Intermediate) Alternate Port, some vlans blocked Bridge MAC: 04c5.a453.d800 Port Number: 011 Port Priority: 000 Neighbor Number: 4 / [-2] ! ME3 interface Port-channel13 rep block port 4 vlan 100-101

Next, we preempt the topology to enable VLAN load balancing. ME2 port 15 is now blocking VLANs 100 and 101 while the primary edge port blocks all other VLANs. ME3#rep preempt segment 1 The command will cause a momentary traffic disruption. Do you still want to continue? [confirm]y Proceeding with Manual Preemption ME3#show rep topology segment 1

670 © 2016 Nicholas J. Russo

REP Segment 1 BridgeName ---------------ME3 ME1 ME1 ME2 ME2

PortName ---------Po13 Po13 Fa0/15 Fa0/15 Fa0/23

Edge Role ---- ---Pri Alt Open Open Alt Sec* Open

Another important aspect of REP/STP interworking is topology change notification (TCN). REP has its own TCN mechanism for convergence, but STP has a TCN BPDU as well. Upon receipt of the TCN, a switch will reduce its MAC address aging timers from the default of 300 seconds to 35 seconds (max age plus forward delay time using default timers). This doesn’t clear the current CAM table in regular STP, but it does in RSTP. It would be ideal if REP could inform the STP network about changes in the REP segment so that STP switches can age-out stale MAC addresses after a topology change. Because the segment has two ends, traffic that may have been flowing through one edge port was quickly transition to the other edge port, which is why the STP TCN propagation is useful. We will configure the REP segment TCN (STCN) to be reflected as an STP TCN on ME2 and ME3; notice that ME2 configures this on the port facing C3560 while ME3 does not. The command must go on the REP edge port regardless of whether the port is REP-facing (ME3) or STP-facing with no neighbor (ME2). ! ME2 interface FastEthernet0/23 rep stcn stp ! ME3 interface Port-channel13 rep stcn stp

We can see the details of the TCN propagation within the REP segment using the interface show commands. At present, the network is stable, so no TCNs have been sent. We can see that the STCN to STP TCN propagation has been enabled on this interfaces. ME3#show interfaces port-channel 13 rep detail | include TCN STCN Propagate to: STP BPA (STCN, LSL) TLV rx: 0, tx: 0 BPA (STCN, HFL) TLV rx: 0, tx: 0 ME2#show interfaces fast 0/23 rep detail | include STCN STCN Propagate to: STP BPA (STCN, LSL) TLV rx: 0, tx: 0 BPA (STCN, HFL) TLV rx: 0, tx: 0

671 © 2016 Nicholas J. Russo

Based on the VLAN load-balancing scheme, traffic from C3560 to ME1 on VLAN 100 should flow through ME3 (primary edge port) and traffic to ME1 on VLAN 102 should flow through ME2 (secondary edge port). VLAN 100 is shown in yellow and VLAN 102 is shown in green. C3560#show ip arp | include 10[02]\.0\.0\.1_ Internet 100.0.0.1 0 001d.4692.f741 Internet 102.0.0.1 0 001d.4692.f743

ARPA ARPA

Vlan100 Vlan102

C3560#show mac address-table dynamic | include 001d.4692.f74[13] 100 001d.4692.f741 DYNAMIC Gi0/21 102 001d.4692.f743 DYNAMIC Gi0/23

On ME1, I shutdown port 15 which triggers a TCN. We can see the counter increment on both ME2 and M3 to represent this BPA activity. ME3#show interfaces port-channel 13 rep detail | include TCN STCN Propagate to: STP BPA (STCN, LSL) TLV rx: 0, tx: 0 BPA (STCN, HFL) TLV rx: 0, tx: 1 ME2#show interfaces fast 0/23 rep detail | include STCN STCN Propagate to: STP BPA (STCN, LSL) TLV rx: 0, tx: 0 BPA (STCN, HFL) TLV rx: 0, tx: 1

With debugging enabled on C3560, we can see the TCNs being received on every VLAN allowed on the trunk. Because the network is so small, the TC event (with respect to signaling) ends immediately. C3560#debug spanning-tree switch general Spanning Tree Switch Shim general debugging is on STP STP STP STP STP

SW: SW: SW: SW: SW:

VLAN1: topology change over VLAN100: topology change over VLAN101: topology change over VLAN102: topology change over VLAN103: topology change over

this bridge is root - this bridge is root - this bridge is root - this bridge is root - this bridge is root

Checking the STP details for VLAN 100 as an example, we can see the TC flag is still set. This flag remains set for 35 seconds (max age plus forward delay) and effectively flushes the CAM table for this VLAN when the TCN is received. If we check the command after 35 seconds, we see the flag is cleared. C3560#show spanning-tree vlan 100 detail VLAN0100 is executing the rstp compatible Spanning Tree protocol Bridge Identifier has priority 0, sysid 100, address 0023.5dc9.5a80 Configured hello time 2, max age 20, forward delay 15, transmit holdcount 6 We are the root of the spanning tree

672 © 2016 Nicholas J. Russo

Topology change flag set, detected flag not set Number of topology changes 12 last change occurred 00:00:26 ago from GigabitEthernet0/21 Times: hold 1, topology change 35, notification 2 hello 2, max age 20, forward delay 15 Timers: hello 0, topology change 8, notification 0, aging 300 C3560#show spanning-tree vlan 100 detail VLAN0100 is executing the rstp compatible Spanning Tree protocol Bridge Identifier has priority 0, sysid 100, address 0023.5dc9.5a80 Configured hello time 2, max age 20, forward delay 15, transmit holdcount 6 We are the root of the spanning tree Topology change flag not set, detected flag not set Number of topology changes 12 last change occurred 00:00:37 ago from GigabitEthernet0/21 Times: hold 1, topology change 35, notification 2 hello 2, max age 20, forward delay 15 Timers: hello 0, topology change 0, notification 0, aging 300

I’ve also configured a basic SPAN session on C3560 to capture these TCNs are they are flooded from the REP segment onto the C3560 designated ports. I don’t need to apply a VLAN filter since I know the REP LSLs aren’t being sent to the STP switches on VLAN 1. ! C3560 monitor session 1 source interface Gi0/21 , Gi0/23 rx monitor session 1 destination interface Gi0/22 encapsulation replicate

Notice that a single TCN BPDU is sent from the REP primary and secondary edge ports (MAC addresses shown below) towards the STP switch. Even though the TCN propagation was configured on ME3 LACP bond to ME1, the actual TCN BPDU is sourced from the proper uplink interface which is STP-enabled. The exception is VLAN 1; both a PVST and IEEE TCN is sent for this VLAN, which is where the “plus” comes from in PVST+. This means that the STP-aware switch in the network could be running Cisco PVST variations (Cisco multicast MAC destination) or any other vendor that supports the IEEE STP variations without PVST. Even though the REP counters only showed 1 TCN being flooded, that really indicates one REP STCN, which may translate into several STP TCNs if running PVST variants beyond the REP ring. I use a red circle as a marker in the capture below to show the column where the VLAN IDs are shown inside of the TCN BPDUs. In total, we see 12 TCNs, 6 from each REP edge-port. ME3#show interfaces fast 0/21 | include bia Hardware is Fast Ethernet, address is 001f.9d0b.1697 (bia 001f.9d0b.1697) ME2#show interfaces fast 0/23 | include bia Hardware is Fast Ethernet, address is 04c5.a453.d819 (bia 04c5.a453.d819)

673 © 2016 Nicholas J. Russo

With the link still broken, C3560 now prefers ME3 for all traffic to ME1 as it is the only available path. Since there is a failure in the REP segment, all ports are opened, including the primary edge port which was recently blocking all VLANs except 100 and 101. C3560#show mac address-table dynamic | include 001d.4692.f74[13] 100 001d.4692.f741 DYNAMIC Gi0/21 102 001d.4692.f743 DYNAMIC Gi0/21 ME3#show rep topology segment 1 REP Segment 1 Warning: REP detects a segment failure, topology may be incomplete BridgeName ---------------ME3 ME1 ME1

PortName Edge Role ---------- ---- ---Po13 Sec Open Po13 Open Fa0/15 Fail

Before continuing, we fix the REP segment by bringing ME1 port 15 back up. There are two other kinds of TCN propagation options to discuss, but not test. We looked at the STP option in detail and saw how it worked. The remaining options are shown below. 1. Segment: When there are different REP segments on a single switch, this is used to propagate REP STCNs from one segment to a list of other segments. This would support their CAM table flushes for faster layer 2 data-plane convergence. The command example is below, and assuming this was applied to an interface in REP segment 1, would propagate segment 1 STCNs into the segments specified by this command. rep stcn segment 2-5,6,7

2. Interface: When REP segments are not terminate on the same switch and might be only “one link” away, this command is useful. Rather than use full-blown STP on a P2P link between two switches where that link is not part of any REP segment, this interface can specify the outgoing interface for the REP STCNs to send. Assuming port 7 was a back to back link between two REPaware switches, this command could be used to exchanged REP STCNs. rep stcn interface fastEthernet 0/7

674 © 2016 Nicholas J. Russo

Additional Reading – Reference configurations "rep-ring" and "rep-open" 11.3.2 ITU G.8032 Also known as Ethernet Ring Protection Switching (ERPS), this feature works similar to Cisco's REP but is a standards-based variant. The terminology changes from REP but the general concepts remain the same. A segment is called a "ring" and an independent link between two ring nodes is called a "ring link". The ports that run this protocol on ring links are called "ring ports". The alternate (blocked) port is known as the ring protection link, or RPL. The node that is actually blocking a port is known as the “RPL owner node”. Just like REP, a single port in the ring will be blocked (the RPL) to prevent loops in the ring topology. In REP, the port adjacent to the alternate port just looks like an ordinary open port, but in G.8032/ERPS, this port is called the RPL neighbor node. The loose equivalent of REP BPA/EPA messages would be the ring automatic protection switching (R-APS) series of messages. These are passed along the ring to communicate RPL information and other significant events. A failure along the ring triggers an R-APG signal failure (R-APS SF) failure, and when the RPL owner receives this message, the port is immediately unblocked. Just like when a failed port exists in a REP topology, the existing alternate port is immediately opened. Unlike REP, G.8032 relies on existing Ethernet OAM technologies like connectivity fault management (CFM). Switches participating in ERPS must be part of the same CFM domain as a prerequisite to ERPS working; REP does not have this requirement. It is also recommended to configure CFM MEPs; although this is not required, it allows for additional CFM monitoring. ERPS registers to CFM as a client much like protocols register to BFD. In this way, the CFM process ERPS of failures. ERPS uses CFM continuity check messages (CCM) at the fast rate of one packet of 3.3 ms. This helps ERPS achieve SONET-level performance in terms of HA. Because G.8032 is not supported on the ME3400 series, it is not tested in detail here. 12. Describe broadband forum TR-101 VLAN paradigms (N:1 and 1:1) The entire purpose of the TR-101 document is to detail how to aggregate DSL connections from the access layer into Ethernet versus ATM. The document provides several options, both for general architecture and encapsulations. This section only shows the VLAN paradigms but other notes from the document are rolled into the SP Architecture section. TR-101 states that the access nodes (AN) must support both of the following VLAN allocation paradigms: N:1 - Many-to-one mapping between ports and VLAN. This would mean several downstream (customer) facing ports in the same VLAN, likely because they have subscribed to the same service. For example, they all have the same destination service provider gateway device. The VLAN still has its own independent MAC address-table so that its traffic does not mix with other VLANs. The AN also must have a mechanism to prevent communication between user ports (user isolation). In Cisco ME switches, this is automatically enforced between UNI and ENI ports. In Catalyst switches, the Private VLAN (PVLAN) feature can be used to map secondary VLANs to a primary VLAN; in this case, a collection of 675 © 2016 Nicholas J. Russo

secondary community VLANS or a single secondary isolated VLAN would face towards the customers, while the primary VLAN promiscuous port (or the ME switch’s NNI port) faces towards the provider. Additional restrictions are as follows: * Traffic may be received 802.1Q tagged on the user port * Traffic must be S-Tagged for C-Tag transparency within the aggregation network (QinQ) * The 802.1Q tag value must be preserved as the C-Tag (QinQ) * Access Node must apply an S-Tag. Must have a way to multiplex multiple services whether the received traffic was tagged or not. * The S-Tag will be common to a TLS instance if switching is performed within the aggregation network. 1:1 - One-to-one binding between a user port and a VLAN. This basically means one VLAN per customer, where a single access port facing a customer is the only UNI in that VLAN. This paradigm is required to support both single and dual 802.1Q tagged frames and also must have the capability to disable MAC address learning. The document references the Independent VLAN Learning (IVL) model described in 802.1Q. Additional restrictions are as follows: * Access Node must apply at least an S-Tag (outer tag) * Access Node may apply a C-Tag for scalability reasons (QinQ) * S-Tag must not be shared across Access Nodes * S-Tag/C-Tag pair must be unique within Access Node The demonstration uses three hardware Cisco ME-3400-24TS-A switches. There are 3 customers which are simulated by ME1. Each of them has a dot1q trunk to the AN, which is ME2, and because I am using a single customer switch to represent 3 customers, each one uses different C-VLANs. I can still demonstrate QinQ tunneling with this approach, but I cannot demonstrate using the same C-VLANs on a physical switch. Customers 1, 2, and 3 use VLANS 10, 20, and 30 respectively. C1 and C2 are both connected to UNI ports on ME2 and use the same S-TAG of 101 for transport across the aggregation switched network. ME3 serves as a backbone switch which also simulates the customer endpoints inside VRFs using jumper cables; the setup is very sloppy at layer 1 but demonstrates the technology. Below is the logical flow, since the physical cabling is less relevant.

676 © 2016 Nicholas J. Russo

C3 connects to an ENI port so that LACP can run, and this uses S-VLAN 102. Since all the ports are UNI ports of some kind, they cannot talk to one another. The sharing of S-tag 101 by multiple customers C1 and C2 represents the N:1 paradigm while the dedicated S-tag of 102 for C3 represents the 1:1 paradigm. This might make sense if C1 and C2 are subscribing to a similar service and have similar SLA requirements, etc, while C3 may later have a requirement to disable MAC learning (RSPAN VLANs perhaps). The relevant configuration for ME1 is shown below. The configuration is monotonous as we must define VRFs, VLANs, and ports to the AN. Notice that C1 and C2 are regular trunk ports with a single C-VLAN allowed, while C3 uses an LACP bond to accomplish the same thing. ! ME1 ip routing vrf definition C1 address-family ipv4 vrf definition C2 address-family ipv4 vrf definition C3 address-family ipv4 vlan 10 name C1

677 © 2016 Nicholas J. Russo

vlan 20 name C2 vlan 30 name C3 interface FastEthernet0/13 description C1 port-type nni switchport trunk allowed vlan 10 switchport mode trunk spanning-tree portfast trunk spanning-tree bpdufilter enable interface FastEthernet0/14 description C2 port-type nni switchport trunk allowed vlan 20 switchport mode trunk spanning-tree portfast trunk spanning-tree bpdufilter enable interface FastEthernet0/15 description C3 port-type nni switchport trunk allowed vlan 30 switchport mode trunk channel-protocol lacp channel-group 2 mode active interface FastEthernet0/16 description C3 port-type nni switchport trunk allowed vlan 30 switchport mode trunk channel-protocol lacp channel-group 2 mode active interface Port-channel2 description LACP BOND TO ACCESS NODE, C3 port-type nni switchport trunk allowed vlan 30 switchport mode trunk spanning-tree portfast trunk spanning-tree bpdufilter enable interface Vlan10 vrf forwarding C1 ip address 10.255.10.1 255.255.255.0

678 © 2016 Nicholas J. Russo

interface Vlan20 vrf forwarding C2 ip address 10.255.20.1 255.255.255.0 interface Vlan30 vrf forwarding C3 ip address 10.255.30.1 255.255.255.0

ME2 is really the focus of this test. It defines a set of S-VLANs for QinQ tunneling, but I don’t name them for brevity. The port-types are actually realistic and there aren’t any crazy workarounds. Facing the customer, the ports are UNI for C1 and C2 and ENI for C3. These are all user-to-network interfaces of sorts, so they cannot communicate with one another, which makes sense for customer-facing links. C1 and C2 are tunneled inside S-VLAN 101 while C3 is tunneled inside S-VLAN 102. ! ME2 vlan 101-109 interface FastEthernet0/13 description C1 switchport access vlan 101 switchport mode dot1q-tunnel interface FastEthernet0/14 description C2 switchport access vlan 101 switchport mode dot1q-tunnel interface FastEthernet0/15 description C3 port-type eni switchport access vlan 102 switchport mode dot1q-tunnel channel-protocol lacp channel-group 2 mode passive interface FastEthernet0/16 description C3 port-type eni switchport access vlan 102 switchport mode dot1q-tunnel channel-protocol lacp channel-group 2 mode passive interface Port-channel2 description LACP BOND TO CUSTOMER C3 port-type eni switchport access vlan 102

679 © 2016 Nicholas J. Russo

switchport mode dot1q-tunnel

For upstream connectivity to ME3, the backbone switch, we can bond the high-speed 1 Gbps interfaces together. These should be NNI ports which participate in STP, and we will enable loop-guard on the LACP bond to prevent against sudden loss of BPDUs from the upstream switch towards the root. Only SVLANs are allowed over this link and ME2 does not need to account for any of the C-VLANs as they are tunneled inside dot1q. ! ME2 interface GigabitEthernet0/1 description LACP TO GATEWAY port-type nni switchport trunk allowed vlan 101-109 switchport mode trunk channel-protocol lacp channel-group 1 mode active interface GigabitEthernet0/2 description LACP TO GATEWAY port-type nni switchport trunk allowed vlan 101-109 switchport mode trunk channel-protocol lacp channel-group 1 mode active interface Port-channel1 description LACP TO GATEWAY port-type nni switchport trunk allowed vlan 101-109 switchport mode trunk spanning-tree guard loop

ME3 is the other end of the LACP gateway bond and allows the S-VLANS in. It is the root for all S-VLANs, uses root-guard on the port-channel to ensure the downstream designated port never becomes an upstream root port. ! ME3 spanning-tree vlan 1-4094 priority 8192 interface GigabitEthernet0/1 description LACP TO ACCESS NODE port-type nni switchport trunk allowed vlan 101-109 switchport mode trunk channel-protocol lacp channel-group 1 mode passive

680 © 2016 Nicholas J. Russo

interface GigabitEthernet0/2 description LACP TO ACCESS NODE port-type nni switchport trunk allowed vlan 101-109 switchport mode trunk channel-protocol lacp channel-group 1 mode passive interface Port-channel1 description LACP TO ACCESS NODE port-type nni switchport trunk allowed vlan 101-109 switchport mode trunk spanning-tree guard root

In real life, this traffic would be MPLS-encapsulated and transported to the other end of the network once it hits the network core. In this case, I configure another dot1q-tunnel port to remove the outermost encapsulation from the frames. This is because ME3 is going to respond to the traffic locally for testing purposes. ME3 defines the same VRFs and C-VLANs as ME1 so that the two can communicate. Ports 21/22 and 23/24 are jumped together; ports 22 and 24 are the correctly-configured dot1q-tunnel ports at the remote end of the customer circuit, which removes the S-tag. Ports 21 and 23 simulate the customer switch by processing the customer VLANs. I used two ports because I had two separate S-tags in this test, so S-tag 101 carries VLANs 10 and 20 while S-tag 102 only carried VLAN 30. This is for testing purposes only given a lack of hardware. ! ME3 ip routing vrf definition C1 address-family ipv4 vrf definition C2 address-family ipv4 vrf definition C3 address-family ipv4 vlan 10 name C1 vlan 20 name C2 vlan 30 name C3 interface FastEthernet0/21 description LINK TO ACCESS NODE (JUMPER 102)

681 © 2016 Nicholas J. Russo

port-type nni switchport trunk allowed vlan 30 switchport mode trunk spanning-tree portfast trunk interface FastEthernet0/22 description LINK TO CUSTOMER (JUMPER 102) switchport access vlan 102 switchport mode dot1q-tunnel interface FastEthernet0/23 description LINK TO ACCESS NODE (JUMPER 101) port-type nni switchport trunk allowed vlan 10,20 switchport mode trunk spanning-tree portfast trunk interface FastEthernet0/24 description LINK TO CUSTOMER (JUMPER 101) switchport access vlan 101 switchport mode dot1q-tunnel

Before we start testing, we will capture the MAC addresses of the SVIs on ME1 and ME3. These are not configurable, or else I would have configured easily recognizable values. I will always highlight ME1’s MACs in yellow with ME3’s MACs in green for clarity. ME1#show interfaces | include ^Vlan[123]0|EtherSVI Hardware is EtherSVI, address is 001d.4692.f740 (bia Vlan10 is up, line protocol is up Hardware is EtherSVI, address is 001d.4692.f741 (bia Vlan20 is up, line protocol is up Hardware is EtherSVI, address is 001d.4692.f742 (bia Vlan30 is up, line protocol is up Hardware is EtherSVI, address is 001d.4692.f743 (bia ME3#show interfaces | include ^Vlan.0|EtherSVI Hardware is EtherSVI, address is 001f.9d0b.16c0 Vlan10 is up, line protocol is up Hardware is EtherSVI, address is 001f.9d0b.16c1 Vlan20 is up, line protocol is up Hardware is EtherSVI, address is 001f.9d0b.16c2 Vlan30 is up, line protocol is up Hardware is EtherSVI, address is 001f.9d0b.16c3

001d.4692.f740) 001d.4692.f741) 001d.4692.f742) 001d.4692.f743)

(bia 001f.9d0b.16c0) (bia 001f.9d0b.16c1) (bia 001f.9d0b.16c2) (bia 001f.9d0b.16c3)

First, we ping from ME1 to ME3 within VRF C1. This represents end-to-end customer connectivity inside C-VLAN 10, which was encapsulated inside S-VLAN 101 for transport between ME2 and ME3. Since the ping succeeds, we assume everything is working, and begin verification. 682 © 2016 Nicholas J. Russo

ME1#ping vrf C1 10.255.10.3 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.255.10.3, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/5/9 ms

When ME1 sends traffic to ME2, it does so on port 13 which is forwarding per STP. There is no BPDU exchange between customer and AN as I’ve filtered BPDUs on the customer side, and the UNI ports never run STP. ENI ports support STP but it is disabled by default; I do not enable it anywhere in this test. ME1 has a CAM entry for ME3’s MAC address inside the C-VLAN out of this port. ME1#show spanning-tree vlan 10 interface fa0/13 Vlan Role Sts Cost Prio.Nbr Type ------------------- ---- --- --------- -------- ---------------------------VLAN0010 Desg FWD 19 128.15 P2p Edge ME1#show mac address-table dynamic vlan 10 Mac Address Table ------------------------------------------Vlan Mac Address Type Ports ------------------------10 001f.9d0b.16c1 DYNAMIC Fa0/13

Looking at ME2’s CAM for the S-VLAN, we can see this MAC address being identified as learned in VLAN 101. Even though port 13 is not really in VLAN 101, that is the S-tag for that port, so the CAM entry counts for that VLAN. This makes sense since when frames are received for this MAC, ME2 should forward them out port 13, but also remove the topmost dot1q tag. Traffic towards ME3 is forwarding out of the LACP bond towards the gateway switch, which is correct. ME2#show mac address-table dynamic vlan 101 Mac Address Table ------------------------------------------Vlan Mac Address Type Ports ------------------------101 001d.4692.f70f DYNAMIC Fa0/13 101 001d.4692.f710 DYNAMIC Fa0/14 101 001d.4692.f741 DYNAMIC Fa0/13 101 001f.9d0b.1681 DYNAMIC Po1 101 001f.9d0b.16c1 DYNAMIC Po1

When ME3 receives these frames, they arrive on the high-speed Gigabit Etherchannel (GEC) from ME2. The VLAN is 101 since the frames are still double-tagged. The destination MAC address of the frames from ME1 matches the second CAM entry, which is forward out port 24. ME3#show mac address-table dynamic vlan 101

683 © 2016 Nicholas J. Russo

Mac Address Table -----------------------------------------Vlan Mac Address Type Ports ------------------------101 001d.4692.f741 DYNAMIC Po1 101 001f.9d0b.16c1 DYNAMIC Fa0/24

Since ME3 is also the test endpoint, it will see the C-VLAN on its jumped interface. Since ports 23 and 24 are connected, it sees the C-VLAN tag of 10 on port 23. ME3#show mac address-table dynamic vlan 10 Mac Address Table ------------------------------------------Vlan Mac Address Type Ports ------------------------10 001d.4692.f741 DYNAMIC Fa0/23 Total Mac Addresses for this criterion: 1

We won’t trace the process in the reverse direction or for the other C-VLANs since it is exactly the same. We quickly test that ME3 has reachability to ME1 for the other C-VLANs to ensure they work. ME3#ping vrf C2 10.255.20.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.255.20.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms ME3#ping vrf C3 10.255.30.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.255.30.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/5/9 ms

The downside to QinQ in this way is that the switches unaware of the C-VLAN tags, such as ME2, will continue to increase the size of their CAM tables since the C-MACs are still exposed. VLAN 101 shows 2 C-MACs per site and VLAN 102 shows 1 C-MAC per site. This can be resolved with provider backbone bridges (PBB), which are discussed in another chapter. ME2#show mac address-table dynamic vlan 101 Mac Address Table ------------------------------------------Vlan Mac Address Type Ports ------------------------101 001d.4692.f70f DYNAMIC Fa0/13 101 001d.4692.f710 DYNAMIC Fa0/14 101 001d.4692.f741 DYNAMIC Fa0/13 101 001d.4692.f742 DYNAMIC Fa0/14

684 © 2016 Nicholas J. Russo

101 001f.9d0b.1681 DYNAMIC Po1 101 001f.9d0b.16c1 DYNAMIC Po1 101 001f.9d0b.16c2 DYNAMIC Po1 Total Mac Addresses for this criterion: 7 ME2#show mac address-table dynamic vlan 102 Mac Address Table ------------------------------------------Vlan Mac Address Type Ports ------------------------102 001d.4692.f711 DYNAMIC Po2 102 001d.4692.f712 DYNAMIC Po2 102 001d.4692.f743 DYNAMIC Po2 102 001f.9d0b.1681 DYNAMIC Po1 102 001f.9d0b.16c3 DYNAMIC Po1 Total Mac Addresses for this criterion: 5

Additional Reading – Reference configurations “tr101" 13. Describe QoS link fragmentation (LFI), cRTP, and RTP QoS link fragmentation and interleaving (LFI) is a common technique for slow WAN links to reduce latency for priority traffic. Large data packets, such as seen in an FTP download, take longer to serialize over slow WAN links than small voice packets do. The small voice packets are latency-sensitive while the FTP packets are not. PPP specifically has a mechanism to deal with this that is built into multi-link PPP (MLP or MLPPP). Because multi-link bundles aren’t supported with PPPoE, we will test this using a small GNS3 topology with Cisco 3725 images running 12.4(15)T. R1 is the customer site router with the 512 kbps WAN connection, while R2 is the “Internet” simulator and R3 is a LAN client.

The basic MLP configurations are shown below. R3 is also an NTP server and IP SLA responder which will help when approximating VOIP performance. R1 and R2 share many common configuration parameters mostly related to basic PPP/MLP configuration. The differences are shown below as well, since R2 issues 685 © 2016 Nicholas J. Russo

IP addresses to R1 via IPCP from a local pool. R1 installs a default route back to those addresses, and CSR2 uses a simple static route for customer reachability. R3 has an IP address and default route, which is so basic that it is not shown. ! R1 and R2 (common) interface Serial0/0 encapsulation ppp ppp multilink ppp multilink group 2 interface Serial0/1 encapsulation ppp ppp multilink ppp multilink group 2 interface Multilink2 bandwidth 512 fair-queue 64 256 0 ppp multilink ppp multilink group 2 max-reserved-bandwidth 100 ! R1 interface Multilink2 ip address negotiated no peer default ip address ppp ipcp route default ! R2 ip local pool PPPOE_IPV4 10.1.2.100 10.1.2.200 interface Loopback0 ip address 10.1.2.2 255.255.255.0 interface Multilink2 ip unnumbered Loopback0 peer default ip address pool PPPOE_IPV4 ip route 10.0.0.0 255.0.0.0 Multilink2

We will conduct basic MLP checks and ensure we have reachability from R2 to R3. R1 shows the multilink bundle as operational with Serial0/0 and Serial0/1 as member interfaces. We can clearly see the local/remote endpoints (hostnames for clarity). We also confirm reachability between the Internet and the LAN client. R1#show ppp multilink active Multilink2 Bundle name: R2

686 © 2016 Nicholas J. Russo

Remote Endpoint Discriminator: [1] R2 Local Endpoint Discriminator: [1] R1 Bundle up for 00:26:14, total bandwidth 3088, load 1/255 Receive buffer limit 24000 bytes, frag timeout 1000 ms 0/0 fragments/bytes in reassembly list 0 lost fragments, 1 reordered 0/0 discarded fragments/bytes, 0 lost received 0x5D received sequence, 0x5A sent sequence Member links: 2 active, 0 inactive (max not set, min not set) Se0/0, since 00:26:14 Se0/1, since 00:26:14 R2#ping 10.1.3.3 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.1.3.3, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 28/45/104 ms

Since this is a simulation, it will be hard to measure real latency on this slow link, but we will work out all the math anyway. First, we must define what a “slow” WAN link is. The formula for determining serialization delay is: Delay (sec) = size (bits) / rate (bits/sec) As an example, a 1500 byte packet (12000 bits) going over a T1 link (1544000 bps) will take about ~8ms (0.008 seconds) to serialize. In our case, since we are using a 512 kbps link, that same sized packet will take about 3 times longer to serialize since the link is about one third as fast. Specifically, 12000 / 512000 =~ 23.4ms. While this might not seem like a long time, imagine a tiny VOIP packet stuck behind this giant 1500 packet in the TX ring. It has to wait a relatively long time before it can be serialized. An effective VOIP SLA will meet the following requirements as recommended by Cisco (and others): 1. Packet loss should be no more than 1 percent. 2. One-way latency (mouth to ear) should be no more than 150 ms. 3. Average one-way jitter should be less than 30 ms. With stringent requirements like this, VOIP latency must be minimized. A general recommendation for per-link VOIP latency is 10 ms of serialization delay. This is forgivable up to 20 ms per link but is not desirable, especially for long-distance calls where there are many transit links in the path. Incurring more than that will make the cumulative 150 ms target difficult to achieve. That begin said, we can use the combination of the 10 ms per-link target and the serialization delay formula to determine what a “slow” link is. In essence, we want to answer the question “What is the slowest speed that can serialize a 1500 byte packet in 10ms or less”? If a link is faster than this number, many of the techniques discussed in this section are moot, since that “huge” 1500 byte packet serializes very quickly on fast links. To evaluate it, we use a variant of the formula above: rate = speed * delay = 1.2 Mbps. Despite 687 © 2016 Nicholas J. Russo

this value, it is commonly understood that 768 kbps (half T1) is the cutoff for a slow link. Mathematically, a 768 kbps link would take ~15.6 ms to serialize a 1500 packet, this is considered acceptable when combined with LLQ and header compression. Since our link is 512 kbps, we assume this is a slow link. The first tool to deal with this is an intelligent QoS policy. QoS is covered in detail in another section, but the basic idea is to treat VOIP traffic as low latency by guaranteeing bandwidth for it. In addition, low latency queuing (LLQ) also ensures that this traffic is serviced before any other queue, regardless of their class weights. This makes it a fitting solution for VOIP and other latency-sensitive flows. A sample policy is configured on R1 and attached to the MLPPP interface. For simplicity, we match the first 2 bits of the TOS byte to look for a binary value of 101; this will match IPP5, DSCP CS5, DSCP EF, and any value beginning with 101, which is generally low-latency traffic. Next, a shaper wraps this queue to limit it to 512 kbps as an average rate. The mathematics behind the priority value and shaper are described next. ! R1 class-map match-all CMAP_VOICE match precedence 5 policy-map PMAP_QUEUE class CMAP_VOICE priority 45 class class-default fair-queue random-detect dscp-based policy-map PMAP_SHAPE class class-default shape average 512000 5120 2560 service-policy PMAP_QUEUE interface Multilink2 service-policy output PMAP_SHAPE

First, we analyze the value of 45 kbps for the voice reservation. This policy was built on the assumption that exactly two concurrent G.729 VOIP phone calls needed to be supported on this link. We will assume (and test via IP SLA) that the G.729 calls will be 20 byte payloads at a rate of 50 packets/second. The full packet size is much larger when we add all of the overhead: 20 payload + 12 RTP + 8 UDP + 20 IP + 8 PPP = 68 bytes total. 50 packets/second is the rate, and multiplying this by 68 yields 3400 Bytes/second. Converting to bits, this is about 27.2 kbps per phone call. 27.2 kbps multiplied by 2 is 44.4 kbps, so reserving 45 kbps for the LLQ will cover both calls. Recall that Cisco queuing strategies always account for layer 2 overhead, as well as all overhead from upper layer protocols, which is why we must include the 8 bytes of PPP encapsulation. Next, we examine the shaper configuration. We will assume that the SP will let us burst up to 768 kbps within a single time interval if we have accumulated credit (achieved by not sending at CIR rate in 688 © 2016 Nicholas J. Russo

previous intervals). Since we want to achieve a 10 ms time interval to ensure the shaper “runs” and sends data every 10 ms, we must select a burst committed (Bc) value that works. The relationships between Tc, Bc, Be, CIR, and PIR are shown below. Tc = Bc / CIR Tc = (Bc + Be) / PIR First, we solve for Bc using the first formula. Tc is set to 0.01 seconds (10 ms) and CIR is set to 512000 bits/second. Evaluation: 0.01 = Bc / 768000  0.01 * 512000 = Bc = 5120 bits. Next, we solve for Be using the second formula. This will be used to determine how many bits can we accumulate and send during periods of burst; this involves the PIR which is 768 kbps. Evaluation: 0.01 = (5120 + Be) / 768000  0.01 * 768000 = 5120 + Be  7860 = 5120 + Be = 2560 bits. We can show the result of our math using the show command below. We can see Tc was evaluated to be 10 ms based on the Bc we computed. The burst excess (Be) is 2560 bits but the PIR of 768 kbps is not explicitly shown. R1#show policy-map interface multilink 2 output Multilink2 Service-policy output: PMAP_SHAPE Class-map: class-default (match-any) 139 packets, 25548 bytes 5 minute offered rate 0 bps, drop rate Match: any Traffic Shaping Target/Average Byte Sustain Rate Limit bits/int 512000/512000 960 5120 Adapt Queue Active Depth 0

Packets

Bytes

139

25548

0 bps

Excess bits/int 2560

Interval (ms) 10

Increment (bytes) 640

Packets Delayed 0

Bytes Delayed 0

Shaping Active no

Although testing the shaper will be difficult, we can ensure G.729 VOIP traffic is matching the voice queue we configured. We can achieve this using IP SLA on R3. We specify a 20 byte packet payload using G.729A (the ‘A’ doesn’t affect the bandwidth and relates only to payload processing) with a TOS of 160, which is IPP5. The codec-interval is 20 ms (1000ms / 50 pps = 20 ms) but that is the default and not shown in the configuration, but I show it below for completeness. Note that the codec-size of 32 is also the default, but since this isn’t a real RTP stream, we must manually account for the 12 RTP bytes. The payload of a real call would be 20 bytes with 12 RTP bytes, but in this case, the payload is 32 bytes with only an additional 8-byte UDP header. ! R3 ip sla 3

689 © 2016 Nicholas J. Russo

udp-jitter 10.1.2.2 16384 source-port 16386 codec g729a codec-size 32 codecinterval 20 tos 160 timeout 10000 ip sla schedule 3 life forever start-time now

Checking R3, we can see the IP SLA operation is successful and we resolve both latency and jitter measurements. As seen later in the IP SLA section, this means that NTP is operational, as is the dynamic IP SLA responder on R2. R3#show ip sla statistics 3 Round Trip Time (RTT) for Index 3 Latest RTT: 57 milliseconds Latest operation start time: 01:21:57.612 UTC Fri Mar 1 2002 Latest operation return code: OK RTT Values: Number Of RTT: 79 RTT Min/Avg/Max: 14/57/101 milliseconds Latency one-way time: Number of Latency one-way Samples: 34 Source to Destination Latency one way Min/Avg/Max: 3/15/42 milliseconds Destination to Source Latency one way Min/Avg/Max: 11/47/82 milliseconds Jitter Time: Number of SD Jitter Samples: 77 Number of DS Jitter Samples: 77 Source to Destination Jitter Min/Avg/Max: 0/20/84 milliseconds Destination to Source Jitter Min/Avg/Max: 0/18/87 milliseconds Packet Loss Values: Loss Source to Destination: 0 Loss Destination to Source: 0 Out Of Sequence: 0 Tail Drop: 918 Packet Late Arrival: 3 Packet Skipped: 0 Voice Score Values: Calculated Planning Impairment Factor (ICPIF): 11 MOS score: 4.06 Number of successes: 3 Number of failures: 0 Operation time to live: Forever

We also check the policy-map on R2 to ensure the VOIP is being classified properly. The reason it only shows as a 2 kbps flow is that the probe is not continuous and runs once every 60 seconds so my computer doesn’t crash. R1#show policy-map interface multilink 2 output | section VOICE Class-map: CMAP_VOICE (match-all) 4368 packets, 218560 bytes 5 minute offered rate 2000 bps, drop rate 0 bps

690 © 2016 Nicholas J. Russo

Match: precedence 5 Queueing Strict Priority Output Queue: Conversation 40 Bandwidth 45 (kbps) Burst 1125 (Bytes) (pkts matched/bytes matched) 0/0 (total drops/bytes drops) 0/0

To further optimize the WAN for our slow 512 kbps link, we can enable LFI. Because large 1500 byte packets will take about 24 ms to serialize, we won’t want our VOIP traffic to get stuck behind it. LFI has two main components: fragmentation and interleaving. Fragmentation is often times viewed negatively because it is usually discussed in the context of layer 3 protocols, like IP. IP fragmentation occurs end to end (unless virtual reassembly occurs in the network, which is CPU intensive) but layer 2 fragmentation is local to a link or potentially an entire layer-2 domain. MLPPP can use this to break up large packets into smaller fragments. For example, a 1500 byte packet broken into three 500-byte packets still requires 24 ms to serialize everything, but individually, they take about 8 ms. That is well within the latency range for VOIP traffic on a link, and assuming we can break up large packets, we can squeeze VOIP packets in between them. This is interleaving and is often enabled in conjunction with fragmentation with MLP. In summary, we can fragment large data packets and interleave small VOIP packets in between them to achieve VOIP SLAs. The downside of this approach is additional CPU resources to perform the fragmentation/reassembly and a small amount of additional overhead added to each fragment. Before enabling fragmentation, we need to determine the fragment size. This goes back to the shaper math and the serialization delay, and should be coordinated with other QoS features appropriately. We want to maintain a 10 ms serialization delay on this link for VOIP, so we can extend that logic to ALL packets on the link. Thus, we must find out what is the largest packet we can serialize on a 512 kbps link in 10ms. Evaluation: 0.01 = size / 512000  0x01 * 512000 = size = 5120 bits. Converting this to bytes, we get 640 bytes. This sounds about right, since this is a little over one third the size of a 1500 byte packet which took ~24 ms to serialize. PPP can break packets up into fragments of 640 regardless of how large they are, which guarantees that every single packet will take no more than 10 ms to serialize. VOIP packets are only 68 bytes, so those will serialize about 10 times faster at 1 ms. The goal was to ensure that a VOIP packet never gets stuck waiting for more than 10 seconds, so we have to fragment the big packets to meet this goal. Combined with the fragmentation, the shaper running every 10 ms ensures that VOIP packets do not wait long before being serialized. Configuring this on R1 and R2 is simple. ! R1 and R2 interface Multilink2 ppp multilink fragment size 640

We can verify this took affect for all member interfaces by checking the MLP details. R1#show ppp multilink active

691 © 2016 Nicholas J. Russo

Multilink2 Bundle name: R2 Remote Endpoint Discriminator: [1] R2 Local Endpoint Discriminator: [1] R1 Bundle up for 01:12:44, total bandwidth 3088, load 1/255 Receive buffer limit 24000 bytes, frag timeout 1000 ms 0/0 fragments/bytes in reassembly list 0 lost fragments, 227 reordered 0/0 discarded fragments/bytes, 0 lost received 0x4739 received sequence, 0x476B sent sequence Member links: 2 active, 0 inactive (max not set, min not set) Se0/0, since 01:12:44, 5790 weight, 640 frag size Se0/1, since 01:12:44, 5790 weight, 640 frag size

MLP is very smart in that it can use multiple physical links in the bundle to serialize the fragments, further reducing latency. We can see this in the debug on R1. R3 sends a single 1000-byte packet, and R1 breaks it up. The fragment of size 648 is the largest packet that can be sent (640 + 8 PPP overhead) which takes a tiny bit longer than 10 ms to serialize. We could account for these 8 bytes by reducing the fragment size to 632, but it’s not significant. We can see the fragment sequence numbers increase as well to show fragments as part of the same packet on a per-interface basis. R3#ping 10.1.2.2 size 1000 repeat 1 Type escape sequence to abort. Sending 1, 1000-byte ICMP Echos to 10.1.2.2, timeout is 2 seconds: ! Success rate is 100 percent (1/1), round-trip min/avg/max = 64/64/64 ms R1#debug ppp multilink fragments Multilink fragments debugging is on Se0/0 MLP: O frag 80005212 size 648 Se0/1 MLP: O frag 40005213 size 370 Se0/1 MLP: I frag 400051D7 size 370 Se0/0 MLP: I frag 800051D6 size 648

encsize encsize encsize encsize

4 4 4 4

We have not yet enabled interleaving. This is a simple command that enables VOIP packets to be inserted in between other packets. The way PPP knows about packet priorities is using fair-queuing, which MUST be enabled on the multilink bundle for interleaving to work. We configured it earlier since it is off by default on multilink interfaces. Despite having a custom queuing policy on the multilink, we still need the “fair-queue” command, at least in older versions. ! R1 and R2 interface Multilink2 ppp multilink interleave

692 © 2016 Nicholas J. Russo

We can verify the configuration was successful by checking the MLP details again. We can see that interleaving is now enabled, which allows the VOIP packets to be woven in between the other packets, assuming they have TOS set appropriately. R1#show ppp multilink active Multilink2 Bundle name: R2 Remote Endpoint Discriminator: [1] R2 Local Endpoint Discriminator: [1] R1 Bundle up for 01:24:18, total bandwidth 3088, load 1/255 Receive buffer limit 24000 bytes, frag timeout 1000 ms Interleaving enabled 0/0 fragments/bytes in reassembly list 0 lost fragments, 285 reordered 0/0 discarded fragments/bytes, 0 lost received 0x562D received sequence, 0x566D sent sequence Member links: 2 active, 0 inactive (max not set, min not set) Se0/0, since 01:24:18, 5790 weight, 640 frag size Se0/1, since 01:24:18, 5790 weight, 640 frag size

The final technique is using header compression for RTP. Although this technique throws off some of the math we did earlier, specifically with respect to the LLQ, it isn’t a big deal in most cases. First, we will discuss the basics of the RTP header. The three main components of the header, ignoring minor fields like version/flags, are the timestamp, sequence number, and synchronization source (SSRC). The SSRC is a way to identify a source, like an IP address would, but is seldom used in any meaningful way. It could be used to find a looped audio source, potentially. Of greater importance are the timestamps and sequence numbers. This is a way to determine if packets are dropped and to compute delay/latency so that software can attempt to make corrections as necessary. Before configuring this feature, we will capture the original size of the packets on the MLPPP link. If we debug PPP fragments before header compression, we can see a ton of output for each of the VOIP packets generated by the IP SLA probe on CSR3. I recommend sending this to the buffer. Each fragment is shown as 58 bytes, but this is after the packet is split in half and sent down each MLPPP link. Don’t fret on the number 58, just remember it for now. ! R1 Se0/0 Se0/0 Se0/1 Se0/0

MLP: MLP: MLP: MLP:

I I O O

frag frag frag frag

C0005018 C000501A C0005053 C0005054

size size size size

58 58 58 58

encsize encsize encsize encsize

4 4 4 4

There are two formats available: IETF and IHPC, the latter of which is the default. Since we only expect two phone calls, we can also limit the header compression to conserve resources. ! R1 and R2

693 © 2016 Nicholas J. Russo

interface Multilink2 ip rtp header-compression iphc-format ip rtp compression-connections 2

We can confirm the feature is enabled by checking the compression details. We can see most of the packets being compressed with a high hit ratio. Since the packets are small to begin with, they don’t compress terribly well, but it does save several overhead bytes. Real G.729 calls would compress very poorly as the codec has already heavily compressed the voice to minimize the bandwidth requirements. It also includes the IP and UDP headers as well, which is shown in the header of the output for clarity. Generally the 40 bytes of overhead (20 IP + 8 UDP + 12 RTP) is compressed to approximately 5 bytes. R1#show ip rtp header-compression multilink 2 RTP/UDP/IP header compression statistics: Interface Multilink2 (compression on, IPHC, RTP) Rcvd: 2009 total, 1977 compressed, 0 errors, 0 status msgs 0 dropped, 0 buffer copies, 0 buffer failures Sent: 2009 total, 1979 compressed, 0 status msgs, 1979 not predicted 41559 bytes saved, 79157 bytes sent 1.52 efficiency improvement factor Connect: 2 rx slots, 2 tx slots, 9 misses, 0 collisions, 0 negative cache hits, 2 free contexts 99% hit ratio, five minute miss rate 0 misses/sec, 0 max

If we debug PPP fragments again, we can see the fragment size is no 49, which is 9 bytes smaller than before. This is a good indication that RTP header compression did a little bit for us. ! R2 Se0/1 Se0/1 Se0/0 Se0/1

MLP: MLP: MLP: MLP:

I O O I

frag frag frag frag

C0002B80 C0002B84 C0002B85 C0002B82

size size size size

49 49 49 49

encsize encsize encsize encsize

4 4 4 4

Additional Reading - Reference configurations “lfi-crtp" 14. Describe Multichassis/Clustering High Availability (HA) The ASR9000 supports many HA mechanisms. Many of these mechanisms work together during RSP/RP failure events, control-plane failure events, and other faults that could cause network outages. SSO: Stateful switchover. The RSP/RP cards are deployed in an active/standby fashion and SSO allows the state from the active RSP/RP to be written to the backup one so that it can immediately take over if the primary one fails. SSO is complementary (and a prerequisite) for NSF/GR and NSR technologies discussed next. It synchronizes the device configuration and other critical non-routing, non-forwarding related information.

694 © 2016 Nicholas J. Russo

NSF: Non-stop forwarding. During an SSO event, NSF ensures that despite a brief loss of the control plane due to an RSP/RP failure, traffic can still be forwarded. The line-cards still have FIBs and can forward traffic; functions like ARP, BFD, ACL, and QoS can be handled by line-cards as well, so the impact on forwarding transit (non-exception) traffic is minimal. NSF also takes affect when soft reboots of certain software modules occurs, provided the line-cards are not affected. NSF relies on active communication with peers who “help” a failed router recover using a set of special messages that vary per protocol. NSF is also called “Graceful Restart” (GR) for some protocols, and the terms NSF and GR are interchangeable. NSF is available with almost all routing or routing-related technologies, except for RIP. NSR: Non-stop routing. Like NSF, this maintains the routing protocol information during an SSO event. BGP sessions can remain intact to continue supporting MPLS L3VPN services on a PE router, for example. Unlike NSF/GR, NSR occurs locally on a router using a checkpoint mechanism to synchronize information between RSPs/RPs. It can be used when NSF is not supported on all devices in the network. The peer devices won’t know that anything happened on the device sustaining a failure at all. Neighbor sessions and all pertinent information are migrated between RSP/RP cards when an SSO event occurs without notifying the peer, which is different than NSF. NSR is available with OSPFv2/v3, IS-IS, BGP, LDP routing technologies currently. RVSP-TE, EIGRP, and RIP are not included. “Process restartability” isn’t so much of a feature as it is a capability. If a process has failed, it will not affect any other process, and can freely be stopped, started, or restarted at any time within IOS XR. Each process runs in a protected address space, and often times a failed process will restart itself (can be controlled manually as well). The ASR9000 can be clustered together to create a single logical router. This negates the need for complex protocol-based HA schemes since there is only a single router with increased capabilities. The RSPs can be connected directly together using Ethernet Out-of-Band Channel (EOBC) interfaces on the RSP; each RSP has two of these. The idea is that between two routers with two RSPs each, each port will connect to a remote RSP on the other router, creating a full mesh of links between RSPs across the two platforms. This uses the same nV technology used for satellite operations on the ASR9000 and is known as “nV Edge”. These links are used for control information and not for data traffic, which is why high speed interfaces on the line cards must also be identified and connected. These are known as Inter Rack Links (IRLs) which must be at least 10 GbE and be direct layer 1 connections to the remote routers line cards. A minimum of 2 IRLs is required and preferably are connected between different pairs of line cards on each router for extra resiliency. Below are some common failover scenarios: 1. Active RSP failures on the primary node. The backup RSP on the same node detects the failure through the backplane, and becomes the active RSP for the system as a whole. The primary node then informs the backup node of the change via inter-chassis control link messaging. 2. Both the active and backup RSPs on the primary node fail (entire router lost power, for example). The failed router is detected by the UDLD running on the IRLs from the backup node. The backup node then becomes the primary node. 695 © 2016 Nicholas J. Russo

14.1 High Availability (HA) Demonstration (NSF/NSR/GR) Although many of these features don’t make much sense on virtual platforms, the configurations are supported in the parser and some of the feature behaviors are supported. These labs quickly examine the HA configurations for all supported protocols. Recall that GR and NSF are interchangeable terms, and the chapter headers below will use whichever feature name is supported by the parser. For example, in OSPFv2, the command “nsf” is used, while OSPFv3 uses the command “graceful-restart”. All of these features share a large unified MPLS topology so that they can be tested concurrently. The network diagram is shown below.

First, I quickly skim the ISIS LSPDB and OSPF LSDB within the IGPs. Not all configurations are examined since there are dedicated sections for these features. The goal is to ensure the topology shows all of the proper routers and links at a quick glance. Looking at the ABRs, we can see the entire topology quickly. CSR7 shows the output for the aggregation and core IGPs, while XRv2 shows just its aggregation IGP. R7#show isis database level-2 detail | include ^[RX]|Extended R1.00-00 0x0000000D 0xAE88 762 Metric: 10 IS-Extended R9.00 R7.00-00 * 0x00000013 0x04B2 477 Metric: 10 IS-Extended XRv4.00 R8.00-00 0x0000000F 0x9427 1031 Metric: 10 IS-Extended R9.00 Metric: 10 IS-Extended XRv4.00

0/0/0 0/0/0 0/0/0

696 © 2016 Nicholas J. Russo

Metric: 10 R9.00-00 Metric: 10 Metric: 10 Metric: 10 Metric: 10 XRv2.00-00 Metric: 10 XRv4.00-00 Metric: 10 Metric: 10 Metric: 10 Metric: 10

IS-Extended 0x00000011 IS-Extended IS-Extended IS-Extended IS-Extended 0x00000008 IS-Extended 0x0000000B IS-Extended IS-Extended IS-Extended IS-Extended

XRv2.00 0x83DF R8.00 R1.00 XRv4.00 XRv4.00 0x90DE R8.00 0x171F R8.00 R7.00 R9.00 R9.00

530

0/0/0

502

0/0/0

870

0/0/0

R7#show isis database level-1 detail | include ^[RX]|Extended R6.00-00 0x0000000C 0x2DE2 1061 Metric: 10 IS-Extended XRv1.00 Metric: 10 IS-Extended R7.00 R7.00-00 * 0x00000016 0x7843 1053 Metric: 10 IS-Extended R6.00 Metric: 10 IS-Extended XRv1.00 XRv1.00-00 0x0000000C 0xFA12 975 Metric: 10 IS-Extended R6.00 Metric: 10 IS-Extended R7.00

0/0/0

0/0/0

0/0/0

RP/0/0/CPU0:XRv2#show ospf database router | utility egrep 'Advert|Neighboring' Advertising Router: 222.0.0.1 (Link ID) Neighboring Router ID: 222.0.0.13 (Link ID) Neighboring Router ID: 222.0.0.12 (Link ID) Neighboring Router ID: 222.0.0.10 Advertising Router: 222.0.0.10 (Link ID) Neighboring Router ID: 222.0.0.1 (Link ID) Neighboring Router ID: 222.0.0.13 (Link ID) Neighboring Router ID: 222.0.0.12 Advertising Router: 222.0.0.12 (Link ID) Neighboring Router ID: 222.0.0.10 (Link ID) Neighboring Router ID: 222.0.0.1 Advertising Router: 222.0.0.13 (Link ID) Neighboring Router ID: 222.0.0.10 (Link ID) Neighboring Router ID: 222.0.0.1

I repeat the same test as above using the MPLS TED show commands. This shows that all routers are TEenabled with the proper links in the topology. Assuming all IGP links are TE-capable, the TE topology should mirror the IGP topology. R7#show mpls traffic-eng topology level-2 brief | include IGP_Id IGP Id: 0000.0000.0001.00, MPLS TE Id:222.0.0.1 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, nbr_node_id:20, gen:32

697 © 2016 Nicholas J. Russo

IGP Id: 0000.0000.0007.00, MPLS TE link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0008.00, MPLS TE link[0]: Point-to-Point, Nbr link[1]: Point-to-Point, Nbr link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, MPLS TE link[0]: Point-to-Point, Nbr link[1]: Point-to-Point, Nbr link[2]: Point-to-Point, Nbr link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0012.00, MPLS TE link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0014.00, MPLS TE link[0]: Point-to-Point, Nbr link[1]: Point-to-Point, Nbr link[2]: Point-to-Point, Nbr link[3]: Point-to-Point, Nbr

Id:222.0.0.7 Router Node (isis level-2) IGP Id: 0000.0000.0014.00, nbr_node_id:22, Id:222.0.0.8 Router Node (isis level-2) IGP Id: 0000.0000.0009.00, nbr_node_id:20, IGP Id: 0000.0000.0014.00, nbr_node_id:22, IGP Id: 0000.0000.0012.00, nbr_node_id:21, Id:222.0.0.9 Router Node (isis level-2) IGP Id: 0000.0000.0008.00, nbr_node_id:19, IGP Id: 0000.0000.0001.00, nbr_node_id:18, IGP Id: 0000.0000.0014.00, nbr_node_id:22, IGP Id: 0000.0000.0014.00, nbr_node_id:22, Id:222.0.0.12 Router Node (isis level-2) IGP Id: 0000.0000.0008.00, nbr_node_id:19, Id:222.0.0.14 Router Node (isis level-2) IGP Id: 0000.0000.0008.00, nbr_node_id:19, IGP Id: 0000.0000.0007.00, nbr_node_id:17, IGP Id: 0000.0000.0009.00, nbr_node_id:20, IGP Id: 0000.0000.0009.00, nbr_node_id:20,

R7#show mpls traffic-eng topology level-1 brief | include IGP_Id IGP Id: 0000.0000.0006.00, MPLS TE Id:222.0.0.6 Router Node (isis level-1) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0011.00, nbr_node_id:23, link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0007.00, nbr_node_id:16, IGP Id: 0000.0000.0007.00, MPLS TE Id:222.0.0.7 Router Node (isis level-1) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:15, link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0011.00, nbr_node_id:23, IGP Id: 0000.0000.0011.00, MPLS TE Id:222.0.0.11 Router Node (isis level-1) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:15, link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0007.00, nbr_node_id:16,

gen:31 gen:33 gen:33 gen:33 gen:34 gen:34 gen:34 gen:34 gen:35 gen:36 gen:36 gen:36 gen:36

gen:38 gen:38 gen:39 gen:39 gen:37 gen:37

RP/0/0/CPU0:XRv2#show mpls traffic-eng topology ospf brief | include IGP Id IGP Id: 222.0.0.1, MPLS TE Id: 222.0.0.1 Router Node (OSPF 222 area 0) Link[0]:Point-to-Point, Nbr IGP Id:222.0.0.13, Nbr Node Id:4, gen:1870 Link[1]:Point-to-Point, Nbr IGP Id:222.0.0.12, Nbr Node Id:1, gen:1871 Link[2]:Point-to-Point, Nbr IGP Id:222.0.0.10, Nbr Node Id:3, gen:1872 IGP Id: 222.0.0.10, MPLS TE Id: 222.0.0.10 Router Node (OSPF 222 area 0) Link[0]:Point-to-Point, Nbr IGP Id:222.0.0.12, Nbr Node Id:1, gen:1873 Link[1]:Point-to-Point, Nbr IGP Id:222.0.0.13, Nbr Node Id:4, gen:1874 Link[2]:Point-to-Point, Nbr IGP Id:222.0.0.1, Nbr Node Id:2, gen:1875 IGP Id: 222.0.0.12, MPLS TE Id: 222.0.0.12 Router Node (OSPF 222 area 0) Link[0]:Point-to-Point, Nbr IGP Id:222.0.0.1, Nbr Node Id:2, gen:1868 Link[1]:Point-to-Point, Nbr IGP Id:222.0.0.10, Nbr Node Id:3, gen:1869 IGP Id: 222.0.0.13, MPLS TE Id: 222.0.0.13 Router Node (OSPF 222 area 0) Link[0]:Point-to-Point, Nbr IGP Id:222.0.0.1, Nbr Node Id:2, gen:1876 Link[1]:Point-to-Point, Nbr IGP Id:222.0.0.10, Nbr Node Id:3, gen:1877

To ensure MPLS transport will function when TE is not configured, I also spot-check some LDP sessions. Ideally, one would verify all sessions, but for brevity I demonstrate that most of them are functional. Selecting routers with many links is a good way to verify this. Out of 15 links in the network, we can verify 10 quickly using 3 show commands. R9#show mpls ldp neighbor | include Peer

698 © 2016 Nicholas J. Russo

Peer LDP Ident: 222.0.0.8:0; Local LDP Ident 222.0.0.9:0 Peer LDP Ident: 222.0.0.1:0; Local LDP Ident 222.0.0.9:0 Peer LDP Ident: 222.0.0.14:0; Local LDP Ident 222.0.0.9:0 R1#show mpls Peer LDP Peer LDP Peer LDP Peer LDP

ldp neighbor | include Peer Ident: 222.0.0.9:0; Local LDP Ident 222.0.0.1:0 Ident: 222.0.0.13:0; Local LDP Ident 222.0.0.1:0 Ident: 222.0.0.12:0; Local LDP Ident 222.0.0.1:0 Ident: 222.0.0.10:0; Local LDP Ident 222.0.0.1:0

R7#show mpls Peer LDP Peer LDP Peer LDP

ldp neighbor | include Peer Ident: 222.0.0.6:0; Local LDP Ident 222.0.0.7:0 Ident: 222.0.0.11:0; Local LDP Ident 222.0.0.7:0 Ident: 222.0.0.14:0; Local LDP Ident 222.0.0.7:0

Next, we will validate the interconnected UMPLS topology. The core IGP is IS-IS level-2 with aggregation IGPs of OSPFv2/v3 and IS-IS level-1. To override the default IS-IS behavior, I apply a pair of filters on CSR7. The attached-bit is never allowed to be set into L1 (which would create a default routing entry), while L1 routes are never allowed to be distributed into L2. Though this would normally break IPv4/v6 reachability, it is suitable for UMPLS since BGP carries those prefixes. MPLS is used to tunnel traffic through routers that don’t have the IGP routes, but do have reachability to the ABRs. ! CSR7 clns filter-set CLNS_FILTER_CLEAR_ATT permit 49.beef route-map RM_ATT_BIT permit 10 match clns address CLNS_FILTER_CLEAR_ATT route-map RM_DENY_ALL deny 10 router isis 222 set-attached-bit route-map RM_ATT_BIT redistribute isis ip level-1 into level-2 route-map RM_DENY_ALL

CSR8 is a traditional RR that transports prefixes between aggregation islands. CSR1, CSR7, and XRv2 are modified RRs that adjust BGP next-hops between AS boundaries. This forces BGP to swap labels (labeled-unicast AFI) between aggregation islands, allowing UMPLS to scale as IGP routes are never redistributed. The only BGI AFI that the ABRs need to support is IPv4 labeled-unicast. We quickly verify that this capability is negotiation between CSR8 and the ABRs. R8#show bgp ipv4 unicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 222.0.0.1 4 222 520 520 19 222.0.0.7 4 222 376 374 19 222.0.0.12 4 222 291 308 19

InQ OutQ Up/Down State/PfxRcd 0 0 01:20:43 2 0 0 00:56:37 2 0 0 00:46:44 1

R8#show bgp ipv4 unicast neighbors 222.0.0.1 | include ^BGP|MPLS_Label

699 © 2016 Nicholas J. Russo

BGP neighbor is 222.0.0.1, remote AS 222, internal link ipv4 MPLS Label capability: advertised and received R8#show bgp ipv4 unicast neighbors 222.0.0.7 | include ^BGP|MPLS_Label BGP neighbor is 222.0.0.7, remote AS 222, internal link ipv4 MPLS Label capability: advertised and received R8#show bgp ipv4 unicast neighbors 222.0.0.12 | include ^BGP|MPLS_Label BGP neighbor is 222.0.0.12, remote AS 222, internal link ipv4 MPLS Label capability: advertised and received

For L3VPN service, XRv1 and CSR10 are RRs within their aggregation IGPs. Other PEs peer with those RRs, and the RRs peer with one another. The cluster-IDs must be different since these aggregation RRs do not service the same set of clients. These RRs will only have reachability to one another once the aggregation/core infrastructures are functional. This is somewhat similar to CSC where the “customer carriers”, or aggregation IGPs in this example, must use MPLS encapsulation for their control plane traffic to communicate over the “core carrier”, or core IGP. Both XRv1 and CSR10 have all of their VPNv4/v6 sessions online. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 222.0.0.6 0 222 474 463 64 0 0 01:03:51 222.0.0.10 0 222 361 351 64 0 0 00:04:19

St/PfxRcd 5 9

RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 222.0.0.6 0 222 475 464 35 0 0 01:03:58 222.0.0.10 0 222 362 352 35 0 0 00:04:26

St/PfxRcd 4 5

R10#show bgp vpnv4 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 222.0.0.11 4 222 44 44 23 0 0 00:05:00 9 222.0.0.13 4 222 39 48 23 0 0 00:05:23 5 R10#show bgp vpnv6 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 222.0.0.11 4 222 44 44 22 0 0 00:05:00 8 222.0.0.13 4 222 39 48 22 0 0 00:05:23 5

Next, I quickly verify that CSR6 and XRv3, which are PEs, have BGP routes towards one another. The next-hops have been adjusted to the local ABRs so that label swapping occurs there. CSR6 only has one ABR servicing its aggregation IGP while XRv3 has two ABRs. Traceroute indicates that the proper number of labels are being imposed: BGP labels on the bottom with LDP labels on top, as necessary. Seeing only a single label during these tests may indicate that the BGP route recursion is not occurring, and remote loopbacks are IGP learned instead. This negatively affects the scalability of BGP. R6#show bgp ipv4 unicast | begin Network Network Next Hop

Metric LocPrf Weight Path

700 © 2016 Nicholas J. Russo

*> *>i r>i *>i

222.0.0.6/32 222.0.0.10/32 222.0.0.11/32 222.0.0.13/32

RP/0/0/CPU0:XRv3#show Network *>i222.0.0.6/32 * i *>i222.0.0.10/32 * i *>i222.0.0.11/32 * i *> 222.0.0.13/32

0.0.0.0 222.0.0.7 222.0.0.7 222.0.0.7

0 0 0 0

100 100 100

32768 0 0 0

bgp ipv4 labeled-unicast | begin Network Next Hop Metric LocPrf Weight 222.0.0.1 0 100 0 222.0.0.12 0 100 0 222.0.0.1 0 100 0 222.0.0.12 0 100 0 222.0.0.1 0 100 0 222.0.0.12 0 100 0 0.0.0.0 0 32768

i i i i

Path i i i i i i i

R6#traceroute 222.0.0.13 source 222.0.0.6 Type escape sequence to abort. Tracing the route to 222.0.0.13 VRF info: (vrf in name/id, vrf out name/id) 1 222.6.7.7 [MPLS: Label 7002 Exp 0] 10 msec 8 msec 7 msec 2 222.7.14.14 [MPLS: Labels 94005/1003 Exp 0] 31 msec 31 msec 32 msec 3 222.9.14.9 [MPLS: Labels 9001/1003 Exp 0] 43 msec 31 msec 31 msec 4 222.1.9.1 [MPLS: Label 1003 Exp 0] 16 msec 21 msec 21 msec 5 222.1.13.13 20 msec 15 msec 15 msec

Several different PE-CE routing protocols are demonstrate in this lab. CSR5 (BGP) and CSR4 (IS-IS) share a VPN, and CSR3 (OSPFv3) and CSR2 (EIGRP) share a VPN. A quick traceroute for IPv4 and IPv6 within each VPN confirms that the general design is functional. Note that IS-IS is not supported for IPv6 inside of a VRF, so only IPv4 is verified there. Since this lab is focused on GR/NSR, we will not trace all LSPs. R5#traceroute 4.4.0.1 source 5.5.0.1 Type escape sequence to abort. Tracing the route to 4.4.0.1 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.11.11 3 msec 2 msec 2 msec 2 222.7.11.7 [MPLS: Labels 7005/10003 Exp 0] 11 msec 10 msec 11 msec 3 222.7.14.14 [MPLS: Labels 94006/92001/10003 Exp 0] 23 msec 30 msec 33 msec 4 222.8.14.8 [MPLS: Labels 8006/92001/10003 Exp 0] 37 msec 31 msec 31 msec 5 222.8.12.12 [MPLS: Labels 92001/10003 Exp 0] 38 msec 31 msec 31 msec 6 10.4.10.10 [MPLS: Label 10003 Exp 0] 16 msec 15 msec 16 msec 7 10.4.10.4 19 msec 10 msec 134 msec R3#traceroute 2.2.0.1 source 3.3.0.1 Type escape sequence to abort. Tracing the route to 2.2.0.1 VRF info: (vrf in name/id, vrf out name/id) 1 10.3.6.6 5 msec 4 msec 4 msec

701 © 2016 Nicholas J. Russo

2 3 4 5 6 7

222.6.7.7 [MPLS: Labels 7002/93013 Exp 0] 11 msec 10 msec 10 msec 222.7.14.14 [MPLS: Labels 94005/1003/93013 Exp 0] 29 msec 33 msec 32 msec 222.14.9.9 [MPLS: Labels 9001/1003/93013 Exp 0] 33 msec 33 msec 37 msec 222.1.9.1 [MPLS: Labels 1003/93013 Exp 0] 32 msec 31 msec 33 msec 222.1.13.13 [MPLS: Label 93013 Exp 0] 21 msec 19 msec 22 msec 10.2.13.2 20 msec 76 msec 13 msec

R3#traceroute ipv6 Target IPv6 address: 2002:2:2:a::1 Source address: 3003:3:3:a::1 [snip] 1 FD00:10:3:6::6 5 msec 3 msec 4 msec 2 ::FFFF:222.6.7.7 [MPLS: Labels 7002/93009 Exp 0] 12 msec 11 msec 12 msec 3 2222:222:7:14::14 [MPLS: Labels 94005/1003/93009 Exp 0] 29 msec 42 msec 36 msec 4 ::FFFF:222.9.14.9 [MPLS: Labels 9001/1003/93009 Exp 0] 35 msec 43 msec 35 msec 5 ::FFFF:222.1.9.1 [MPLS: Labels 1003/93009 Exp 0] 35 msec 46 msec 35 msec 6 2222:222:1:13::13 [MPLS: Label 93009 Exp 0] 23 msec 91 msec 9 msec 7 FD00:10:2:13::2 11 msec 11 msec 10 msec

Now that the basic network has been validated, we can progress to the GR/NSR tests. Note that the CSR1000v does not support SSO, which is a requirement for NSF/GR. Some applications of NSF/GR have value outside of the traditional use-case of having redundancy route processors (RP), but generally, these labs cannot be tested in detail. Additional Reading – Reference configurations “ha" 14.1.1 IS-IS NSF and NSR IS-IS supports both NSF and NSR. The NSF capability of IS-IS is signaled using the Restart TLV, which is carried in the hello PDUs used by IS-IS. When an IS-IS router supports NSF, this TLV is included for hellos out of all interfaces to signal support for the feature. Both XE and XR support it by default and it is true regardless of the adjacency’s IS-IS level. We confirm this on CSR7 which participates in multiple IS-IS levels and peers with XE and XR routers. R7#show clns is-neighbors detail Tag 222: System Id Interface State Type Priority R6 Gi2.567 Up L1 0 Area Address(es): 49.6711 IP Address(es): 222.6.7.6* IPv6 Address(es): FE80::6 Uptime: 01:35:41 NSF capable Topology: IPv4, IPv6 Interface name: GigabitEthernet2.567

Circuit Id 00

Format Phase V

702 © 2016 Nicholas J. Russo

XRv1 Gi2.571 Up L1 0 Area Address(es): 49.6711 IP Address(es): 222.7.11.11* IPv6 Address(es): FE80::11 Uptime: 01:35:35 NSF capable Topology: IPv4, IPv6 Interface name: GigabitEthernet2.571 XRv4 Gi2.574 Up L2 0 Area Address(es): 49.0000 IP Address(es): 222.7.14.14* IPv6 Address(es): FE80::14 Uptime: 01:35:37 NSF capable Topology: IPv4, IPv6 Interface name: GigabitEthernet2.574

00

Phase V

00

Phase V

Enabling NSF is very simple and the command is the same for XE and XR. We enable this on all IS-IS level2 and level-1-2 routers. The “ietf” option must be used with the command to enable NSF; this is discussed more later. ! CSR1, CSR7, CSR8, CSR9, XRv2, and XRv4 router isis 222 nsf ietf

While not very exciting, NSF is not entirely enabled on these routers. Without SSO to synchronize all of the non-routing information between RPs, such as device configurations, NSF cannot realistically work. The routers will already operate in “helper” mode, which can acknowledge that a peer is performing a switchover and continue forwarding traffic (or any control-plane information required) towards it. Using basic show commands, we can verify it is enabled. XR’s only show command appears to be a simple yes/no field inside the general IS-IS protocol details. R7#show isis nsf Tag 222: NSF is ENABLED, mode 'ietf' NSF pdb state: NSF L1 active interfaces: 0 NSF L1 active LSPs: 0 NSF interfaces awaiting L1 CSNP: 0 Awaiting L1 LSPs: NSF L2 active interfaces: 0 NSF L2 active LSPs: 0 NSF interfaces awaiting L2 CSNP: 0 Awaiting L2 LSPs: NSF T3 remaining: 0 seconds Interface: GigabitEthernet2.567 NSF L1 Restart state: Running

703 © 2016 Nicholas J. Russo

NSF p2p Restart retransmissions: 0 Maximum L1 NSF Restart retransmissions: 3 Interface: GigabitEthernet2.571 NSF L1 Restart state: Running NSF p2p Restart retransmissions: 0 Maximum L1 NSF Restart retransmissions: 3 Interface: GigabitEthernet2.574 NSF L2 Restart state: Running NSF p2p Restart retransmissions: 0 Maximum L2 NSF Restart retransmissions: 3 RP/0/0/CPU0:XRv4#show isis protocol | include Non-stop Non-stop forwarding: IETF NSF Restart enabled

Since the feature is not enabled on CSR6 and XRv1, I use the same show commands to demonstrate the output when NSF is disabled. On XE, this also shows us that SSO is not configured or operational. R6#show isis nsf Tag 222: NSF is DISABLED RP is ACTIVE, standby not ready, RTR chkpt peer not ready, UPD chkpt peer not ready, bulk sync pending NSF interval timer expired (NSF OK, checkpoint pending) Checkpointing disabled, no errors Local state: ACTIVE, Peer state: DISABLED, Config Mode: non-SSO, Operating Mode: non-SSO RP/0/0/CPU0:XRv1#show isis protocol | include Non-stop Non-stop forwarding: Disabled

Debugging NSF, in a stable network, shows the adjacency timer refreshes occurring every 10 seconds, which is the P2P hello timer for Is-IS by default. I color-code the timers to show the difference of 10 seconds across 3 different interfaces. R8#debug isis nsf IS-IS NSF events debugging is on for router process 222 17:13:35.392: 17:13:36.781: 17:13:39.840: 17:13:44.841: 17:13:44.973: 17:13:49.049:

ISIS-NSF ISIS-NSF ISIS-NSF ISIS-NSF ISIS-NSF ISIS-NSF

(222): (222): (222): (222): (222): (222):

Adjacency Adjacency Adjacency Adjacency Adjacency Adjacency

timer timer timer timer timer timer

refreshed: refreshed: refreshed: refreshed: refreshed: refreshed:

GigabitEthernet2.584 GigabitEthernet2.589 GigabitEthernet2.582 GigabitEthernet2.584 GigabitEthernet2.589 GigabitEthernet2.582

I enable debugging on CSR7 and also clear its ISIS process. As an L1/L2 router, we can see CSR7 trying to establish both types of adjacencies. It iterates through them, bringing up the L1 adjacencies first. The true/false markers indicate what kinds of adjacencies IS-IS needs to establish with each peer. L1 704 © 2016 Nicholas J. Russo

adjacencies are shown in yellow while L2 adjacencies are shown in green. XR does not appear to have an equivalent debug. R7#debug isis nsf IS-IS NSF events debugging is on for router process 222 R7#clear isis * %CLNS-5-ADJCLEAR: ISIS (222): All adjacencies cleared ISIS-NSF (222): Pick ADJ - XRv1 (Gi2.571): ISIS-NSF (222): L1 BEST ADJ - XRv1 (Gi2.571) = not found => TRUE ISIS-NSF (222): L2 BEST ADJ - XRv1 (Gi2.571) = not found => FALSE ISIS-NSF (222): Pick ADJ - XRv1 (Gi2.571): ISIS-NSF (222): L1 BEST ADJ - XRv1 (Gi2.571) = not found => TRUE ISIS-NSF (222): L2 BEST ADJ - XRv1 (Gi2.571) = not found => FALSE %CLNS-5-ADJCHANGE: ISIS (222): Adjacency to XRv1 (GigabitEthernet2.571) Up, new adjacency ISIS-NSF (222): Pick ADJ - R6 (Gi2.567): ISIS-NSF (222): L1 BEST ADJ - R6 (Gi2.567) = not found => TRUE ISIS-NSF (222): L2 BEST ADJ - R6 (Gi2.567) = not found => FALSE ISIS-NSF (222): Pick ADJ - R6 (Gi2.567): ISIS-NSF (222): L1 BEST ADJ - R6 (Gi2.567) = not found => TRUE ISIS-NSF (222): L2 BEST ADJ - R6 (Gi2.567) = not found => FALSE %CLNS-5-ADJCHANGE: ISIS (222): Adjacency to R6 (GigabitEthernet2.567) Up, new adjacency ISIS-NSF (222): Pick ADJ - XRv4 (Gi2.574): ISIS-NSF (222): L1 BEST ADJ - XRv4 (Gi2.574) = not found => FALSE ISIS-NSF (222): L2 BEST ADJ - XRv4 (Gi2.574) = not found => TRUE ISIS-NSF (222): Pick ADJ - XRv4 (Gi2.574): ISIS-NSF (222): L1 BEST ADJ - XRv4 (Gi2.574) = not found => FALSE ISIS-NSF (222): L2 BEST ADJ - XRv4 (Gi2.574) = not found => TRUE %CLNS-5-ADJCHANGE: ISIS (222): Adjacency to XRv4 (GigabitEthernet2.574) Up, new adjacency

Next, we will configure several minor options. There are several timers associated with NSF. The “T3” timer is used to determine how long to wait before setting the overload bit (similar to OSPF max-metric) when a switchover occurs. “T3” is equivalent to the “restart-time” in other protocols or the “lifetime” in XR. This would force traffic to avoid a particular node by causing IGP to converge around it. The “interface” timer determines how long to wait for interfaces to come up before completing the restart/switchover. The “interval” timer is a throttle to ensure NSF restarts to do not happen to quickly, and serves as a hold-down time between them. XR allows you to define the interface timer (T1) which defines how long to wait before resending an unacknowledged restart attempt. This should be a few seconds, and I adjust it to 5. The expiration counter tells IS-IS to stop trying after 3 attempts. These timers do not appear in any obvious show commands. They are demonstrated here to show NSF’s granularity.

705 © 2016 Nicholas J. Russo

! CSR7 router isis 222 nsf interval 7 nsf interface wait 15 nsf t3 manual 60 router isis 222 nsf lifetime 60 nsf interface-timer 5 nsf interface-expires 3

Enabling NSR is less straightforward than NSF. The primary difference between IS-IS NSR and NSF is that NSF is an IETF standard that involves interactions between routers. The capability is signaled via the Restart TLV and routers are expected to know when peers switchover. NSR is local-only and does not signal anything to remote peers. The configuration for XE and XR is below. This uses the “cisco” keyword as it was developed before IETF NSF. ! CSR6 and XRv1 router isis 222 nsf cisco

When this is configured on the CSR1000v, several errors are printed as shown below. This is followed by syslog messages every 60 seconds as well. I assume this is because the virtual platform does not support it, or possibly because SSO is not configured (also unsupported). Despite these errors, the configuration command can remain in the configuration for reference. R6(config-router)#nsf cisco Adjacency and LSP information cannot be checkpointed due to Adjacency and LSP information cannot be checkpointed due to

learned from an interface learned from an interface

interface GigabitEthernet2.561 encoding error. interface GigabitEthernet2.567 encoding error.

Please verify that these interface types are supported for NSF in this release. If so, report the problem to the TAC. %CLNS-3-NSF_CP_IDB_ENCODE_FAIL: ISIS (222): Interface Gi2.561 cannot be encoded for nsf cisco %CLNS-3-NSF_CP_IDB_ENCODE_FAIL: ISIS (222): Interface Gi2.567 cannot be encoded for nsf cisco

CSR6 still reports NSF as being enabled, but this is effectively NSR when the mode is “cisco”. R6#show isis nsf Tag 222: NSF is ENABLED, mode 'cisco'

706 © 2016 Nicholas J. Russo

RP is ACTIVE, standby not ready, RTR chkpt peer not ready, UPD chkpt peer not ready, bulk sync pending NSF interval timer expired (NSF restart enabled) Checkpointing disabled, no errors Local state: ACTIVE, Peer state: DISABLED, Config Mode: non-SSO, Operating Mode: non-SSO

XRv1 appears to accept the command without complaint, but the feature is still inoperable. RP/0/0/CPU0:XRv1#show isis protocol | include Non-stop Non-stop forwarding: Cisco Proprietary NSF Restart enabled

14.1.2 OSPFv2 NSF and NSR OSPFv2 NSF and NSR are similar to IS-IS in concept. NSF requires some extensions to OSPF (IS-IS used a Restart TLV, which isn’t really an extension) so that the peer routers remain aware of the restarting router. When the router performs a switchover, the neighbors will continue to sustain the existing LSA1 information that describes the network topology. This allows traffic to continue forwarding through the router until the grace period expires. OSPF uses new “grace LSAs”, which are link-local opaque LSAs containing the grace period time in their payloads. This is how NSF information is exchanged between routers, and being link-local in scope, they are not flooded throughout the area/domain. The configuration is identical to IS-IS, and is also the same on XE and XR. ! CSR1 and XRv2 router ospf 222 nsf ietf

XE has a number of verification methods while XR does not. Of note, NSF “helper” support is enabled by default in both platforms, which is valuable on virtual routers. If they were peered with hardware routers that actually supported SSO/NSF, they could honor the received grace LSAs and react accordingly. XE clearly tells us that NSF is not operating due to SSO not being configured. R1#show ip ospf | include NSF|IETF IETF Non-Stop Forwarding enabled IETF NSF helper support enabled Cisco NSF helper support enabled R1#show ip ospf nsf Routing Process "ospf 222" IETF Non-Stop Forwarding enabled restart-interval limit: 120 sec Router is not operating in SSO mode Global RIB has not converged yet IETF NSF helper support enabled Cisco NSF helper support enabled OSPF restart state is NO_RESTART Handle 140512595585760, Router ID 222.0.0.1, checkpoint Router ID 0.0.0.0

707 © 2016 Nicholas J. Russo

Config wait timer interval 10, timer not running Dbase wait timer interval 120, timer not running RP/0/0/CPU0:XRv2#show ospf | include Non-Stop Non-Stop Forwarding enabled RP/0/0/CPU0:XRv2#show protocols ospf | include Non-Stop Non-Stop Forwarding: Enabled

We also see that the default restart interval is 2 minutes. If the neighbor does not complete switchover during that time, the helper routers will update the LSA1s to cause a reconvergence event. Setting this timer too short may not give the switching router enough time to complete the SSO process. Setting it too long may create a longer black hole if the router legitimately fails and cannot fail over, yet somehow fails to signal it. For demonstration, I increase the timer to 10 minutes on both CSR1 and XRv2. Note that this “restart-interval” is equivalent in purpose to the “T3” timer in IS-IS. ! CSR1 router ospf 222 nsf ietf restart-interval 600 ! XRv2 router ospf 222 nsf lifetime 600 R1#show ip ospf nsf Routing Process "ospf 222" IETF Non-Stop Forwarding enabled restart-interval limit: 600 sec Router is not operating in SSO mode Global RIB has not converged yet [snip]

We can also adjust a router’s willingness to be a helper for NSF. The pre-standard NSF version described by Cisco is not discussed in detail here, but unlike IS-IS, “nsf cisco” does not enable NSR. It was a proprietary Cisco implementation that extended OSPF; it was not internal-only as NSR is. XRv2 disables the helper capability for IETF NSF while CSR1 disables it for the Cisco version. You cannot disable helper support for Cisco NSF on XR. ! XRv2 router ospf 222 nsf ietf helper disable ! CSR1 router ospf 222 nsf cisco helper disable R1#show ip ospf nsf | include helper

708 © 2016 Nicholas J. Russo

IETF NSF helper support enabled Cisco NSF helper support disabled

Next, we configure OSPF NSR on XRv3 and CSR10. This actually uses the “nsr” command which simplifies the logic as compared to IS-IS NSR (using the command “nsf cisco”). This is good for routers in a heterogeneous environment where it is not guaranteed that the peers are NSF-capable. For example, XRv2 is not willing to be an IETF NSF helper, so CSR10 should consider using NSR versus NSF. The configuration is a single command with no arguments, and there are no other options to tune. ! CSR10 and XRv3 router ospf 222 nsr

XE has a rich set of show commands to verify NSR. I’ve highlighted several fields that suggest NSR is enabled but inoperable. NSR works by checkpointing the routing state and synchronizing that to the standby RP internally. Such checkpoints cannot be synchronized when there is only one RP. R10#show ip ospf nsr Active RP Operating in simplex mode Redundancy state: ACTIVE Peer redundancy state: DISABLED Checkpoint peer not ready Checkpoint messages enabled ISSU negotiation not complete ISSU versions not compatible Routing Process "ospf 222" with ID 222.0.0.10 NSR configured Checkpoint message sequence number: 0 Standby synchronization state: unsynchronized Bulk sync operations: 0 Next sync check time: LSA Count: 18, Checksum Sum 0x000A1D2B

Assuming checkpoints were being generated, the statistics below would have non-zero values. This can be used to verify the synchronization within a router is functional. R10#show ip ospf nsr statistics Pending checkpoint requests (current/max): 0/0 Pending checkpoint messages (current/max): 0/0 Routing Process "ospf 222" with ID 222.0.0.10 Pending checkpoint requests (current/max): 0/0 Pending checkpoint messages (current/max): 0/0 Time spent scheduling bulk syncs (max): 0 ms

709 © 2016 Nicholas J. Russo

Time spent in checkpoint loop (average/max): 0/0 ms Checkpoint loop interruptions: 0

XR validation is limited to a single line of output, but this confirms the feature is enabled. RP/0/0/CPU0:XRv3#show ospf | include NSR NSR (Non-stop routing) is Enabled

14.1.3 OSPFv3 GR and NSR OSPFv3 is very similar to OSPFv2 in terms of HA. One notable difference is that OSPFv3 calls NSF “graceful-restart”, but the terms NSF and GR are interchangeable. To configure NSF/GR, we generally substitute the word “nsf” for “graceful-restart” and the command syntax is similar to OSPFv2. There is no Cisco proprietary feature in OSPFv3, so differentiating Cisco from IETF is not necessary. I enable GR on all routers in the right-most aggregation IGP. I also adjust the restart-interval to 5 minutes, allowing any switching routers some extra time to complete the switchover. ! CSR1 and CSR10 router ospfv3 222 graceful-restart restart-interval 300 ! XRv2 and XRv3 router ospfv3 222 graceful-restart graceful-restart lifetime 300

We can verify OSPFv3 GR using some simple show commands on XE and XR. Just like OSPFv2, XE provides more detail about the feature than XR. Both platforms also support GR helper mode by default, which could be valuable on virtual routers that peer with physical routers. There aren’t many timers to adjust with this feature, either. R10#show ospfv3 graceful-restart OSPFv3 222 address-family ipv6 (router-id 222.0.0.10) Graceful Restart enabled restart-interval limit: 300 sec Router is NOT running in SSO mode Graceful Restart helper support enabled Number of neighbors performing Graceful Restart is 0 RP/0/0/CPU0:XRv2#show ospfv3 | include Grace Graceful Restart enabled

710 © 2016 Nicholas J. Russo

To test OSPFv3 NSR, I will use the PE-CE connection between CSR3 and CSR6. CSR6 disables GR helper mode, which means CSR3 should support NSR (heterogeneous environment). CSR6 could technically run GR since CSR3 did not disable helper mode, but for consistency, both PE and CE use NSR over GR. ! CSR6 router ospfv3 3 nsr graceful-restart helper disable ! CSR3 router ospfv3 3 nsr

Before verifying NSR, we verify that GR helper mode is enabled on CSR3 (default) and disabled on CSR6. This is the main motivation for wanting to use NSR on CSR3 specifically. Since NSR can be VRF-aware, we must issue the proper show command to verify this on CSR6. R3#show ospfv3 graceful-restart OSPFv3 3 address-family ipv4 (router-id 3.3.0.1) Graceful Restart helper support enabled Number of neighbors performing Graceful Restart is 0 OSPFv3 3 address-family ipv6 (router-id 3.3.0.1) Graceful Restart helper support enabled Number of neighbors performing Graceful Restart is 0

R6#show ospfv3 vrf OSPF graceful-restart OSPFv3 3 address-family ipv4 vrf OSPF (router-id 10.3.6.6) Graceful Restart helper support disabled OSPFv3 3 address-family ipv6 vrf OSPF (router-id 10.3.6.6) Graceful Restart helper support disabled

Like OSPFv2, XE provides great detail regarding NSR state and behavior. I highlight the important lines, just as in OSPFv2, which indicate that NSR is configured but not operational. NSR is configured for both IPv4 and IPv6 AFIs automatically. R6#show ospfv3 vrf OSPF nsr Active RP Operating in simplex mode Redundancy state: ACTIVE Peer redundancy state: DISABLED Checkpoint peer not ready Checkpoint messages enabled ISSU negotiation not complete ISSU versions not compatible

711 © 2016 Nicholas J. Russo

OSPFv3 3 address-family ipv4 vrf OSPF (router-id 10.3.6.6) NSR configured Checkpoint message sequence number: 0 Standby synchronization state: unsynchronized Bulk sync operations: 0 Next sync check time: LSA Count: 10, Checksum Sum 0x00078C5F OSPFv3 3 address-family ipv6 vrf OSPF (router-id 10.3.6.6) NSR configured Checkpoint message sequence number: 0 Standby synchronization state: unsynchronized Bulk sync operations: 0 Next sync check time: LSA Count: 10, Checksum Sum 0x00055A60

NSR statistics regarding the checkpoints are available, but unpopulated as the feature is not properly enabled. R6#show ospfv3 vrf OSPF nsr statistics Pending checkpoint requests (current/max): 0/0 Pending checkpoint messages (current/max): 0/0 OSPFv3 3 address-family ipv4 vrf OSPF (router-id 10.3.6.6) Pending checkpoint requests (current/max): 0/0 Pending checkpoint messages (current/max): 0/0 Time spent scheduling bulk syncs (max): 0 ms Time spent in checkpoint loop (average/max): 0/0 ms Checkpoint loop interruptions: 0 OSPFv3 3 address-family ipv6 vrf OSPF (router-id 10.3.6.6) Pending checkpoint requests (current/max): 0/0 Pending checkpoint messages (current/max): 0/0 Time spent scheduling bulk syncs (max): 0 ms Time spent in checkpoint loop (average/max): 0/0 ms Checkpoint loop interruptions: 0

14.1.4 BGP GR and NSR Unlike many other forms of GR/NSF/NSR on virtual platforms, BGP GR actually has value. BGP GR’s primary purpose is the same as the other protocols: don’t introduce churn into the protocol while a router switches over, specify a grace period, and continue forwarding as normal. BGP accomplishes this by introducing the End-of-RIB (EoR) marker, as well as a new “Graceful Restart” capability. RFC 4724 makes specific mention that this EoR marker can be valuable to speed up convergence as well. Because it marks the end of RIB, it is effectively an explicit signal to the peer that the BGP UPDATE messages have ceased. Rather than wait for more updates, a BGP speaker is able to immediately begin 712 © 2016 Nicholas J. Russo

running best-path selections once all peers indicate EoR. Normally this period is controlled by the “update-delay” timer which specifies how long BGP can stay in read-only mode. The EoR marker is a more event-driven mechanism for starting the best-path selection process. Specifically, the EoR marker for IPv4 is an UPDATE message with no prefixes. For all other AFIs, it uses the MP_UNREACH_NLRI, a multi-protocol extension used for identifying unreachable prefixes (withdrawals). No prefixes are contained inside this message, either. XR advertises the BGP GR capability to all peers by default. We can check CSR8 to prove this. CSR8 has GR disabled for all peers by default, yet the capability is advertised from XRv2. This is not an AFI-specific feature and applies to the BGP session in general. Checking XRv1 for VPNv4, we can see it advertises the GR capability to CSR10 and CSR6 by default even though only VPNv4/v6 is negotiated, not IPv4. R8#show bgp ipv4 unicast neighbors | include ^BGP|race BGP neighbor is 222.0.0.1, remote AS 222, internal link Graceful-Restart is disabled BGP neighbor is 222.0.0.7, remote AS 222, internal link Graceful-Restart is disabled BGP neighbor is 222.0.0.12, remote AS 222, internal link Graceful Restart Capability: received Graceful-Restart is disabled RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast neighbors | utility egrep '^BGP|race' BGP neighbor is 222.0.0.6 Graceful Restart (GR Awareness): advertised BGP neighbor is 222.0.0.10 Graceful Restart (GR Awareness): advertised

To be clear, XR advertises that it is “GR-aware” but the actual feature is not enabled by default in XR. Below, I show the output on XRv2 within the IPv4 labeled-unicast AFI. GR awareness is advertised to the XE peers, and bidirectionally exchanged with the XR peers. RP/0/0/CPU0:XRv2#show bgp ipv4 labeled-unicast neighbors | utility egrep '^BGP|race’ BGP neighbor is 222.0.0.8 Graceful Restart (GR Awareness): advertised BGP neighbor is 222.0.0.10 Graceful Restart (GR Awareness): advertised BGP neighbor is 222.0.0.13 Graceful Restart (GR Awareness): advertised and received

Next, I enable the feature on XRv2 and re-issue the show command. GR is now “enabled” for each neighbor by default. ! XRv2 router bgp 222

713 © 2016 Nicholas J. Russo

bgp graceful-restart RP/0/0/CPU0:XRv2#show bgp ipv4 labeled-unicast neighbors | utility egrep '^BGP|race’ BGP neighbor is 222.0.0.8 Graceful restart is enabled Graceful Restart (GR Awareness): advertised Graceful Restart capability advertised BGP neighbor is 222.0.0.10 Graceful restart is enabled Graceful Restart (GR Awareness): advertised Graceful Restart capability advertised BGP neighbor is 222.0.0.13 Graceful restart is enabled Graceful Restart (GR Awareness): advertised and received Graceful Restart capability advertised

XRv2 will assume that CSR10 is incapable of GR. Whether this is true or not does not matter; below I illustrate explicitly disabling BGP GR for a specific peer. XRv2 continues to notify CSR10 that it is GRaware, but is not running the feature with CSR10 anymore. GR is still enabled for the other peers as shown below. ! XRv2 router bgp 222 neighbor 222.0.0.10 graceful-restart disable RP/0/0/CPU0:XRv2#show bgp ipv4 labeled-unicast neighbors | utility egrep '^BGP|race’ BGP neighbor is 222.0.0.8 Graceful restart is enabled Graceful Restart (GR Awareness): advertised Graceful Restart capability advertised BGP neighbor is 222.0.0.10 Graceful Restart (GR Awareness): advertised BGP neighbor is 222.0.0.13 Graceful restart is enabled Graceful Restart (GR Awareness): advertised and received Graceful Restart capability advertised

Enabling BGP GR on XE is identical, except that BGP sessions must be hard-reset before the changes take effect. I begin with CSR8. ! CSR8 router bgp 222 bgp graceful-restart R8(config-router)#bgp graceful-restart

714 © 2016 Nicholas J. Russo

All BGP sessions must be reset to take the new GR config R8#clear bgp ipv4 unicast *

Now, CSR8 advertises this capability to all routers. CSR1 and CSR7 are not advertising it yet, but XRv2 always was; CSR8 and XRv2 are now capable of using BGP GR fully with one another. Again, the NSF capabilities of GR still require SSO and redundant RPs, so this would be specific to the routing optimization aspect (via the EoR marker). CSR8 claims GR is enabled for all peers, but the capability should be exchanged bidirectionally first. R8#show bgp ipv4 unicast neighbors | include ^BGP|race BGP neighbor is 222.0.0.1, remote AS 222, internal link Graceful Restart Capability: advertised Graceful-Restart is enabled, restart-time 120 seconds, stalepath-time 360 seconds BGP neighbor is 222.0.0.7, remote AS 222, internal link Graceful Restart Capability: advertised Graceful-Restart is enabled, restart-time 120 seconds, stalepath-time 360 seconds BGP neighbor is 222.0.0.12, remote AS 222, internal link Graceful Restart Capability: advertised and received Graceful-Restart is enabled, restart-time 120 seconds, stalepath-time 360 seconds

I quickly enable GR on all other BGP speakers except CSR10 and XRv3. To avoid any inconsistent behavior, I hard-clear all BGP sessions on all BGP routers. CSR8 now shows that the capability is exchanged with all peers. XRv1 exchanges it correct with CSR6, but not CSR10, which is correct. We are assuming CSR10 and XRv13 are not GR capable. ! CSR1, CSR6, CSR7, and XRv1 router bgp 222 bgp graceful-restart R8#show bgp ipv4 unicast neighbors | include ^BGP|tart_Cap BGP neighbor is 222.0.0.1, remote AS 222, internal link Graceful Restart Capability: advertised and received BGP neighbor is 222.0.0.7, remote AS 222, internal link Graceful Restart Capability: advertised and received BGP neighbor is 222.0.0.12, remote AS 222, internal link Graceful Restart Capability: advertised and received RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast neighbors | utility egrep '^BGP|race' BGP neighbor is 222.0.0.6 Graceful restart is enabled Graceful Restart (GR Awareness): advertised and received Graceful Restart capability advertised Graceful Restart capability received

715 © 2016 Nicholas J. Russo

Graceful Restart capability advertised Graceful Restart capability received BGP neighbor is 222.0.0.10 Graceful restart is enabled Graceful Restart (GR Awareness): advertised Graceful Restart capability advertised Graceful Restart capability advertised

For cleanup, any routers that peer with CSR10 or XRv3 will disable GR towards that peer. This was demonstrated once in XRv2, but I apply it everywhere necessary. It is hard to see the value of BGP GR on Cisco routers with respect to using EoR for routing optimization. This is because they send BGP keepalives as soon as the updates are finished sending, which allows BGP best-path to start sooner. Routers that do not behave this way (such as other vendor products) may benefit. This keepalive technique is not a BGP requirement and is specific to Cisco platforms. When used only for routing convergence optimizations, BGP GR makes more sense in multi-vendor environments. The remainder of this test will focus on the traditional use-case of GR/NSF and not the convergence optimizations. ! XRv1 router bgp 222 neighbor 222.0.0.10 graceful-restart disable ! XRv2 router bgp 222 neighbor 222.0.0.13 graceful-restart disable ! CSR1 router bgp 222 neighbor 222.0.0.10 ha-mode graceful-restart disable neighbor 222.0.0.13 ha-mode graceful-restart disable

BGP GR specifies two significant timers: restart-time and stalepath-time. The restart-time begins counting down once a TCP session fails between a pair of peers during an RP switchover. The peer router (not the restarting router) must provide routing information to the new standby RP, and this timer controls how long to wait for a BGP OPEN message from the router that is switching over. It is equivalent to the IGP NSF restart-timers that limit the switchover time. When a router begins an RP switchover, all routes learned from that peer are marked as stale, yet traffic is still allowed to forward through that peer. The stalepath-time controls how long these routes can be used before they are flushed. Normally, it is longer than the restart-time because receiving an OPEN message will happen before total BGP convergence, especially in large networks. When BGP GR is enabled on XR, these timers are explicitly added to the configuration using their default values as shown below. The restart and stale timers are 2 and 6 minutes respectively. ! XE default values for BGP GR

716 © 2016 Nicholas J. Russo

router bgp 222 bgp graceful-restart restart-time 120 bgp graceful-restart stalepath-time 360 bgp graceful-restart

This does not occur in XR, so for consistency, I configure these timers on the BGP XR routers manually. XRv3 is excluded since it notionally does not support GR. ! XRv1 and XRv2 router bgp 222 bgp graceful-restart restart-time 120 bgp graceful-restart stalepath-time 360

NSR is less straightforward. On XE, we adjust the HA-mode to SSO in order to enable NSR. On XR, the command is simply “nsr”. Although the goal of NSR is identical to the IGP NSR features, the BGP usecase is focused heavily on PE-CE routing. A common case would be that the CE router does not support BGP GR, which means it cannot “help” the PE. Rather than just configure NSR on the PE, which is supported, it makes sense to leave GR configured and add NSR as well. This will allow the router to be “helped” by iBGP peers within the provider’s network, while also allowing it to not rely on GR if it is unavailable. XRv1 is configured with BGP NSR below, and it’s supported CE, CSR5, was never configured with BGP GR. ! XRv1 router bgp 222 nsr

The “show bgp nsr” command prints about 5 pages of output. For brevity, I show the first few stanzas. Most of this output isn’t NSR specific, but lists the critical BGP behaviors that are expected to be consistent when the RP changes. This includes basic fields such as the BGP ASN, BGP router ID, GR capability, and various timers. In this configuration, XRv1 can be protected by iBGP peers via GR and can protect itself using NSR. RP/0/0/CPU0:XRv1#show bgp nsr BGP Process Information: BGP is operating in STANDALONE mode Autonomous System number format: ASPLAIN Autonomous System: 222 Router ID: 222.0.0.11 Default Cluster ID: 222.0.0.11 Active Cluster IDs: 222.0.0.11 Fast external fallover enabled Neighbor logging is enabled Enforce first AS enabled Default local preference: 100 Default keepalive: 60

717 © 2016 Nicholas J. Russo

Graceful restart enabled Restart time: 120 Stale path timeout time: 360 RIB purge timeout time: 600 Non-stop routing is enabled Update delay: 120 Generic scan interval: 60 Address family: IPv4 Unicast Dampening is not enabled Client reflection is enabled in global config Dynamic MED is Disabled Dynamic MED interval : 10 minutes Dynamic MED Timer : Not Running Dynamic MED Periodic Timer : Not Running Scan interval: 60 Total prefixes scanned: 4 Prefixes scanned per segment: 100000 Number of scan segments: 1 Nexthop resolution minimum prefix-length: 0 (not configured) Main Table Version: 61 Table version synced to RIB: 61 Table version acked by RIB: 61 IGP notification: IGPs notified RIB has converged: version 13 RIB table prefix-limit reached ? [No], version 0 Permanent Network Unconfigured [snip]

I enable NSR on all provider BGP speakers running XE, as well as XRv2 for completeness. Just like BGP GR, this requires a hard reset on all peers for which NSR is configured on XE routers. ! CSR1, CSR6, CSR7, and CSR8 router bgp 222 bgp ha-mode sso ! XRv2 router bgp 222 nsr

Just like before, we should not enable this feature towards CSR4 or XRv3. There is no way to control this on XR, but XE allows it to be disabled on a per-neighbor basis, just like GR. ! CSR1 router bgp 222 neighbor 222.0.0.10 ha-mode sso disable neighbor 222.0.0.13 ha-mode sso disable

718 © 2016 Nicholas J. Russo

To verify the feature, I begin with CSR8. NSR is being used interchangeably with SSO here, and it is enabled for all peers. The XE peers show a communication failure while the XR peer shows that the feature is not enabled. This is probably a result of different implementations, and since the feature doesn’t work on virtual machines anyway, I do not examine the output further. R8#show bgp ipv4 unicast neighbors | include ^BGP|SSO BGP neighbor is 222.0.0.1, remote AS 222, internal link SSO is enabled SSO last disable reason: Application disable (Active) BGP neighbor is 222.0.0.7, remote AS 222, internal link SSO is enabled SSO last disable reason: Communication failure (Active) BGP neighbor is 222.0.0.12, remote AS 222, internal link SSO is enabled SSO last disable reason: Application disable (Active)

Like XR, there are custom commands for NSR as well, except they are executed per AFI. The XR commands can also be issued per-AFI, but the global command shown above returns output for all AFIs. The output is very verbose, and most of it is invalid as a result of the feature not being fully supported on the CSR1000v. There are 3 sessions established with NSR enabled, but none of them support SSO. R8#show bgp ipv4 unicast sso summary Total sessions with stateful switchover support enabled: 0 Total sessions configured with NSR mode and in established state: 3 Total number of times throttle ON by TCP: 0 Total number of times throttle OFF by TCP: 0 Total number of times CF Message Transmission Failure: 0 Total number of times CF Flow OFF received: 0 Total number of times CF Flow ON received: 0 Total number of times CF Flow Warning message received: 0 Total number of times CF peer-not-ready to sync: 0 RF sent peer-comm-down notification: NO Total number of sessions on the Enable SyncQ: 0 Total number of sessions on the Update SyncQ: 0 Total number of sessions removed from Tcp SyncQ during Bulk Sync: 0 Total number of sessions removed from Enable SyncQ during Bulk Sync: 0 Total number of sessions removed from Update SyncQ during Bulk Sync: 0 Total number of times TCP sent Sync Enable Notification: 0 Total number of times TCP sent Sync Disable Notification: 0 Total number of times TCP sent Retry SSO Notification: 0 Total APDb walk prefixes sent: 0 Total APDb walk prefix send failures: 0 Total APDb topology walk prefixes sent: 0 Total APDb topology walk prefix send failures: 0 Total APDb prefixes sent: 0 Total WPdb prefixes sent: 0 bgp_tcp_sso_enable_pending is 0

719 © 2016 Nicholas J. Russo

tcp_syncQ: 0, enable_syncQ: 0, update_syncQ: 0 Bulk Sync Done: NO

For completeness, we can see CSR6 trying to use NSR towards XRv1 inside the VPNv4/v6 AFI. Although it still fails, this shows that NSR is supported in many AFIs. R6#show bgp vpnv4 unicast all neighbors | include ^BGP|SSO BGP neighbor is 222.0.0.11, remote AS 222, internal link SSO is enabled SSO last disable reason: Application disable (Active) R6#show bgp vpnv6 unicast all neighbors | include ^BGP|SSO BGP neighbor is 222.0.0.11, remote AS 222, internal link SSO is enabled SSO last disable reason: Application disable (Active)

14.1.5 LDP GR and NSR Although not a routing protocol, LDP is critical for most MPLS applications. It also has the concept of GR/NSF and NSR for this reason. Because LDP is an overlay protocol of sorts, it is recommended that the underlying protocols in the network that interact with LDP are also GR/NSF-enabled. This includes OSPF, IS-IS, and/or BGP. Similar to the IGPs, when LDP GR is enabled without SSO being supported, the router can only operate in “helper” mode to synchronize state with a device that is actively switching over. Enabling SSO is a requirement to achieve literal “nonstop forwarding” with LDP, which we cannot test in a virtual environment. I begin by enabling the feature on all P and PE routers. ! CSR1, CSR6, CSR7, CSR8, CSR9, CSR10 mpls ldp graceful-restart ! XRv1, XRv2, and XRv3 mpls ldp graceful-restart

When applying these changes, XE warns us that GR will not apply for existing LDP sessions, but only new ones. I manually clear all LDP peers on all LDP GR routers to avoid any confusion. XR automatically flaps all LDP peers when the feature is configured, so a manual clearing is not required. R8(config)#mpls ldp graceful-restart % Previously established LDP sessions may not have graceful restart protection.

When LDP GR is negotiated, we can see the details on both platforms. This reveals several new timers that will be discussed later. For now, we can verify that all peers on CSR9 and XRv2 are GR-enabled, as expected. There are no obvious incompatibilities between XE and XR with this feature. The “connect count” on XRv2 simply counts the number of times the session forms; if a session is cleared, or a peer fails, this number will increment. 720 © 2016 Nicholas J. Russo

R9#show mpls ldp graceful-restart LDP Graceful Restart is enabled Neighbor Liveness Timer: 120 seconds Max Recovery Time: 120 seconds Forwarding State Holding Time: 600 seconds Down Neighbor Database (0 records): Graceful Restart-enabled Sessions: VRF default: Peer LDP Ident: 222.0.0.1:0, State: estab Peer LDP Ident: 222.0.0.8:0, State: estab Peer LDP Ident: 222.0.0.14:0, State: estab RP/0/0/CPU0:XRv2#show mpls ldp graceful-restart Forwarding State Hold timer : Not Running GR Neighbors : 3 Neighbor ID Up Connect Count Liveness Timer --------------- -- ------------- -----------------222.0.0.1 Y 1 222.0.0.8 Y 2 222.0.0.10 Y 2 -

Recovery Timer ------------------

Alternatively, XE has a command to see some extra detail per-neighbor regarding GR. This lists all GRenabled peers along with their reconnection timers of 2 minutes. R8#show mpls ldp neighbor graceful-restart Peer LDP Ident: 222.0.0.9:0; Local LDP Ident 222.0.0.8:0 TCP connection: 222.0.0.9.12152 - 222.0.0.8.646 State: Oper; Msgs sent/rcvd: 12/12; Downstream Up time: 00:03:10 Graceful Restart enabled; Peer reconnect time (msecs): 120000 Peer LDP Ident: 222.0.0.12:0; Local LDP Ident 222.0.0.8:0 TCP connection: 222.0.0.12.14292 - 222.0.0.8.646 State: Oper; Msgs sent/rcvd: 12/19; Downstream Up time: 00:03:10 Graceful Restart enabled; Peer reconnect time (msecs): 120000 Peer LDP Ident: 222.0.0.14:0; Local LDP Ident 222.0.0.8:0 TCP connection: 222.0.0.14.43579 - 222.0.0.8.646 State: Oper; Msgs sent/rcvd: 12/16; Downstream Up time: 00:03:06 Graceful Restart enabled; Peer reconnect time (msecs): 120000

To test what this reconnection timer does, we can cause an LDP failure in the network. Clearing all peers on CSR10 triggers a GR event (not shown). If we quickly check XRv3 and CSR1, we can see that their recovery timers start counting down from 2 minutes. Although the peer session forms before 2 minutes, this timer effectively represents the grace period seen in other GR/NSF implementations. If the active RP failed, the session must be restored within those 2 minutes. 721 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show mpls ldp graceful-restart Forwarding State Hold timer : Not Running GR Neighbors : 2 Neighbor ID Up Connect Count Liveness Timer --------------- -- ------------- -----------------222.0.0.1 Y 1 222.0.0.10 Y 2 Not running

Recovery Timer -----------------106 sec remaining

R1#show mpls ldp graceful-restart LDP Graceful Restart is enabled Neighbor Liveness Timer: 120 seconds Max Recovery Time: 120 seconds Forwarding State Holding Time: 600 seconds Down Neighbor Database (1 records): VRF default: Peer LDP Ident: 222.0.0.10:0 [inst 2], Local LDP Ident: 222.0.0.1:0 Status: recovering (25 seconds left) Address list contains 0 addresses: Graceful Restart-enabled Sessions: VRF default: Peer LDP Ident: 222.0.0.13:0, State: estab Peer LDP Ident: 222.0.0.9:0, State: estab Peer LDP Ident: 222.0.0.12:0, State: estab Peer LDP Ident: 222.0.0.10:0, State: estab

XE also generates log messages to show that the session failed and that GR begins its 2 minute countdown. After 2 minutes, the process completes. ! CSR1 21:07:49.534: %LDP-5-GR: GR session 222.0.0.10:0 (inst 2): interrupted-recovery pending 21:07:52.995: %LDP-5-GR: GR session 222.0.0.10:0 (inst 5): starting graceful recovery 21:09:52.995: %LDP-5-GR: GR session 222.0.0.10:0 (inst 5): completed graceful recovery

The LDP GR timers are not intuitively named. The “neighbor-liveness” timer is the equivalent of the grace period, while the “max-recovery” timer determines how long to continue using stale label bindings after an LDP session is established. A third timer, the “forwarding-holding” timer, determines how long label bindings should be stored after a control plane resets. This sounds very similar to the “maxrecovery” timer but begins counting down at the time of failure, not restoration. The defaults for the first two timers is 2 minutes, and the default for “forwarding-holding” is 10 minutes. To test these timers, I adjust them on CSR7 first. ! CSR7 mpls ldp graceful-restart timers neighbor-liveness 30

722 © 2016 Nicholas J. Russo

mpls ldp graceful-restart timers max-recovery 60 mpls ldp graceful-restart timers forwarding-holding 120

Before testing anything, we verify that the timers were accepted by the router. CSR7 still has 3 LDP neighbors and adjusting the timers did not cause them to flap. R7#show mpls ldp graceful-restart LDP Graceful Restart is enabled Neighbor Liveness Timer: 30 seconds Max Recovery Time: 60 seconds Forwarding State Holding Time: 120 seconds Down Neighbor Database (0 records): Graceful Restart-enabled Sessions: VRF default: Peer LDP Ident: 222.0.0.6:0, State: estab Peer LDP Ident: 222.0.0.14:0, State: estab Peer LDP Ident: 222.0.0.11:0, State: estab

Next, I clear all of CSR7’s neighbors, then check their status a few seconds later. We cannot adjust the reconnection timer of 2 minutes, but we see that the max-recovery timer begins counting down as soon as the session is established. I color-code the peers for consistency with other tests. R7#show mpls ldp neighbor graceful-restart Peer LDP Ident: 222.0.0.6:0; Local LDP Ident 222.0.0.7:0 TCP connection: 222.0.0.6.646 - 222.0.0.7.21879 State: Oper; Msgs sent/rcvd: 11/6; Downstream Up time: 00:00:17 Graceful Restart enabled; Peer reconnect time (msecs): 120000 Down Neighbor Information: Status: recovering (42 seconds left) Address list contains 0 addresses Peer LDP Ident: 222.0.0.14:0; Local LDP Ident 222.0.0.7:0 TCP connection: 222.0.0.14.49261 - 222.0.0.7.646 State: Oper; Msgs sent/rcvd: 11/13; Downstream Up time: 00:00:14 Graceful Restart enabled; Peer reconnect time (msecs): 120000 Down Neighbor Information: Status: recovering (45 seconds left) Address list contains 0 addresses Peer LDP Ident: 222.0.0.11:0; Local LDP Ident 222.0.0.7:0 TCP connection: 222.0.0.11.20174 - 222.0.0.7.646 State: Oper; Msgs sent/rcvd: 11/8; Downstream Up time: 00:00:14 Graceful Restart enabled; Peer reconnect time (msecs): 120000 Down Neighbor Information: Status: recovering (45 seconds left) Address list contains 0 addresses

723 © 2016 Nicholas J. Russo

Looking at the specific timestamps, we can see exactly 60 seconds pass between the time a peer is established to the time the graceful recovery completes. This removes the stale label bindings from the peer since it is expected that new bindings can be exchanged. ! CSR7 21:27:48.559: %LDP-5-GR: GR session 222.0.0.6:0 (inst 4): starting graceful recovery 21:27:48.559: %LDP-5-NBRCHG: LDP Neighbor 222.0.0.6:0 (4) is UP 21:27:50.958: %LDP-5-GR: GR session 222.0.0.14:0 (inst 5): starting graceful recovery 21:27:50.958: %LDP-5-NBRCHG: LDP Neighbor 222.0.0.14:0 (5) is UP 21:27:51.699: %LDP-5-GR: GR session 222.0.0.11:0 (inst 6): starting graceful recovery 21:27:51.699: %LDP-5-NBRCHG: LDP Neighbor 222.0.0.11:0 (6) is UP 21:28:48.559: %LDP-5-GR: GR session 222.0.0.6:0 (inst 4): completed graceful recovery 21:28:50.959: %LDP-5-GR: GR session 222.0.0.14:0 (inst 5): completed graceful recovery 21:28:51.698: %LDP-5-GR: GR session 222.0.0.11:0 (inst 6): completed graceful recovery

To see the “neighbor-liveness” timer in action, I clear all LDP peers on CSR7 again. I immediately (within 5 seconds) check the GR status to see that 2 of the neighbors are still down, and one has already recovered. The recovering router has already started its “max-recovery” time, which beings after the session establishes. The liveness timer begins counting down when the peer fails, which is still counting down for XRv1 and XRv4. When the peer is still waiting for a connection, all addresses bound to that LDP are shown in this list until the session reforms. R7#show mpls ldp graceful-restart LDP Graceful Restart is enabled Neighbor Liveness Timer: 30 seconds Max Recovery Time: 60 seconds Forwarding State Holding Time: 120 seconds Down Neighbor Database (3 records): VRF default: Peer LDP Ident: 222.0.0.6:0 [inst 4], Local LDP Ident: 222.0.0.7:0 Status: recovering (57 seconds left) Address list contains 0 addresses: Peer LDP Ident: 222.0.0.11:0 [inst 6], Local LDP Ident: 222.0.0.7:0 Status: waiting for reconnection (25 seconds left) Address list contains 3 addresses: 222.6.11.11 222.7.11.11 222.0.0.11 Peer LDP Ident: 222.0.0.14:0 [inst 5], Local LDP Ident: 222.0.0.7:0 Status: waiting for reconnection (25 seconds left)

724 © 2016 Nicholas J. Russo

Address list contains 5 addresses: 222.14.9.14 222.7.14.14 222.8.14.14 222.0.0.14 Graceful Restart-enabled Sessions: VRF default: Peer LDP Ident: 222.0.0.6:0, State: estab

222.9.14.14

Seconds later, the remaining LDP sessions to XRv1 and XRv4 come up. The output is not shown again as we know the max-recovery timers begin counting down from 60 seconds after the session forms. Next, we will examine the available timers on XR. XR allows us to configure the reconnection-timeout, which is equivalent to the XE neighbor-liveness timer. This specifies how long to wait for the peer to return after a failure. The recovery timer is fixed at 2 minutes where as XE allowed us to adjust this (max-recovery). The forwarding-state-holdtime behaves identically to the forwarding-holding timer. ! XRv2 mpls ldp graceful-restart reconnect-timeout 60 graceful-restart forwarding-state-holdtime 150

Clearing the LDP peers on XRv2 and immediately looking at the LDP GR status, we can see a similar output as we did on XE. CSR8 comes up quickly and begins counting down from 60, which mimics the max-recovery timer, except it is fixed at 2 minutes. The other peers that are still down, CSR1 and CSR10, count down from the reconnection-timeout of 60 seconds. RP/0/0/CPU0:XRv2#show mpls ldp graceful-restart Forwarding State Hold timer : Not Running GR Neighbors : 3 Neighbor ID Up Connect Count Liveness Timer --------------- -- ------------- -----------------222.0.0.1 N 1 52 sec remaining 222.0.0.8 Y 3 Not running 222.0.0.10 N 2 52 sec remaining

Recovery Timer -----------------Not running 118 sec remaining Not running

Last, we examine LDP NSR. Like the other NSR variants, this is used to mask RP failures on a router by handling all of the switchover locally. This can be used in environments where LDP GR is not supported by all platforms. The LDP sessions will not flap during switchover and the entire operation will appear transparent to all LDP peers. ! CSR10 mpls ldp nsr ! XRv3 mpls ldp nsr

725 © 2016 Nicholas J. Russo

We can verify the NSR configuration by checking the summary information on both routers. Interestingly, CSR10 shows XRv3 as not ready, but XRv3 shows CSR10 as ready. I assume this is an artificiality of the virtual environment. R10#show mpls ldp nsr LDP Non-Stop Routing is enabled LDP Non-Stop Routing Sessions: VRF default: Peer LDP Ident: 222.0.0.1:0 Peer LDP Ident: 222.0.0.12:0 Peer LDP Ident: 222.0.0.13:0

State: Not Ready State: Not Ready State: Not Ready

RP/0/0/CPU0:XRv3#show mpls ldp nsr summary Sessions: Total: 2, NSR-eligible: 2, Sync-ed: 0 (2 Ready)

Like the other NSR implementations, we can query the routers for their NSR exchanges with their peers. Although NSR is not a negotiated capability like the GR/NSF family of features, statistics are still tracked per peer. R10#show mpls ldp nsr statistics neighbor 222.0.0.13 Peer: 222.0.0.13:0 In label Request Records created: 0, freed: 0 In label Withdraw Records created: 0, freed: 0 Local Address Withdraw Set: 0, Cleared: 0 Transmit contexts enqueued: 0, dequeued: 0 RP/0/0/CPU0:XRv3#show mpls ldp nsr statistics neighbor 222.0.0.10 Peer: 222.0.0.10:0 State: Ready for TCP sync Messages: 0 Init Sync: Start: MON 6 03:56:14 (18:01:31 ago) End: MON 6 03:56:14 (18:01:31 ago) Protocol: 0/0 sent/rcvd Capabilities, 0 peer addr, 0 peer label 0 rx-buf bytes, 0 app-data bytes Steady State Sync: Capabilities: 0 sent, 0 rcvd Send-Ack:0 In Lbl-WD, 0 In Lbl-Req Stdby Adj: 0 join, 0 leave

14.1.6 RSVP-TE GR RSVP-TE supports GR but not NSR. RSVP GR applies only to TE applications, not ordinary IP, and is used to maintain state information when a node in the LSP performs an RP switchover. Below, I configure RSVP GR on XRv3 and CSR10. I also enable it on CSR1 and XRv2 so that all interfaces within the OSPFv2 726 © 2016 Nicholas J. Russo

aggregation IGP are using this feature, but those configurations are not shown for brevity. I make minor adjustments to the hello and “miss” interval; this specifies how frequently RSVP hello messages are sent. Additionally, I enable RSVP GR under all interfaces also enabled for OSPF2. ! XRv3 rsvp interface GigabitEthernet0/0/0/0.503 signalling hello graceful-restart interface-based interface GigabitEthernet0/0/0/0.513 signalling hello graceful-restart interface-based signalling graceful-restart signalling hello graceful-restart refresh misses 5 signalling hello graceful-restart refresh interval 6000 ! CSR10 interface GigabitEthernet2.502 ip rsvp signalling hello graceful-restart interface GigabitEthernet2.503 ip rsvp signalling hello graceful-restart interface GigabitEthernet2.510 ip rsvp signalling hello graceful-restart ip rsvp signalling hello graceful-restart mode full ip rsvp signalling hello graceful-restart refresh interval 6000 ip rsvp signalling hello graceful-restart refresh misses 5

Just like TE-FRR, nothing happens until a TE tunnel is configured. RSVP is like BFD in a way where it does not have a discovery mechanism; RSVP does have hellos, but they are used for specific applications. RP/0/0/CPU0:XRv2#show rsvp graceful-restart neighbors [no output]

At a minimum, we can at least ensure the feature was configured properly before configuring any TE tunnels. XRv2 and CSR1 both indicate that RSVP GR is enabled with the correct timers. RP/0/0/CPU0:XRv2#show rsvp graceful-restart Graceful restart: enabled Number of global neighbors: 0 Restart time: 120 seconds Recovery time: 120 seconds Recovery timer: Not running Hello interval: 6000 milliseconds Maximum Hello miss-count: 5 Pending states: 0 R1#show ip rsvp hello

727 © 2016 Nicholas J. Russo

Hello: RSVP Hello for Fast-Reroute/Reroute: Disabled Statistics: Disabled BFD for Fast-Reroute/Reroute: Disabled RSVP Hello for Graceful Restart: Enabled (full mode) R1#show ip rsvp hello graceful-restart Graceful Restart: Enabled (full mode) Refresh interval: 6000 msecs Refresh misses: 5 DSCP: 0x30 Advertised restart time: 5 msecs Advertised recovery time: 0 msecs Maximum wait for recovery: 3600000 msecs

Next, I configure a TE tunnel on XRv3 to route traffic to CSR1 over any link except the one directly connected to CSR1. This means that CSR10 will always be in the transit path. After configuring the tunnel, we verify that it forms correctly. The tunnel comes up and routes via CSR10 to CSR1 as expected. There is no need to route any traffic into the tunnel for this test. ! XRv3 explicit-path name EP_AVOID_DIRECT index 10 exclude-address ipv4 unicast 222.1.13.1 index 20 exclude-address ipv4 unicast 222.1.13.13 interface tunnel-te100 ipv4 unnumbered Loopback0 destination 222.0.0.1 path-option 10 explicit name EP_AVOID_DIRECT RP/0/0/CPU0:XRv3#show mpls traffic-eng tunnels 100 brief TUNNEL NAME DESTINATION STATUS tunnel-te100 222.0.0.1 up Displayed 1 (of 1) heads, 0 (of 0) midpoints, 0 (of 0) tails Displayed 1 up, 0 down, 0 recovering, 0 recovered heads

STATE up

RP/0/0/CPU0:XRv3#show mpls traffic-eng tunnels 100 detail | begin Path Info Path Info: Outgoing: Explicit Route: Strict, 222.10.13.10 Strict, 222.1.10.1 Strict, 222.0.0.1 [snip]

We know that the RSVP RESV message travels from tail to head which binds RSVP labels at each hop for the TE tunnel being signaled. RSVP GR hello requests are originated at each upstream router to establish GR connectivity and a hello acknowledgement is expected in return from the downstream router. That is 728 © 2016 Nicholas J. Russo

to say, the hello initiation and acknowledge travels in the same direction as the RSVP PATH message. Below, we see that CSR1 received two sets of hello request messages. These are traveling down the LSP (head to tail) and originate from CSR10. One set of hellos uses TTL=255 so that the failure of an intermediate node will not prevent the RSVP hello from arriving. This session is between node IDs, much like an LDP session would be. The normal, link-based RSVP session still has TTL=1 as expected. These hellos are different than normal RSVP hellos as they contain the RESTART capability, which is supported in XR (unlike regular RSVP hellos). We will examine these two different “neighbors” later. Also, note that the recovery time is 0 (green) and restart time is 5 msec (cyan). R1#debug ip rsvp dump-messages RSVP message dump debugging is on Incoming Hello: version:1 flags:0000 cksum:C054 ttl:255 reserved:0 length:32 HELLO type HELLO REQUEST length 12: Src_Instance: 0x658183D5, Dst_Instance: 0x909D1D62 RESTART_CAP type 1 length 12: Restart_Time: 0x00000005, Recovery_Time: 0x00000000 Outgoing Hello: version:1 flags:0000 cksum:C053 ttl:255 reserved:0 length:32 HELLO type HELLO ACK length 12: Src_Instance: 0x909D1D62, Dst_Instance: 0x658183D5 RESTART_CAP type 1 length 12: Restart_Time: 0x00000005, Recovery_Time: 0x00000000 Incoming Hello: version:1 flags:0000 cksum:BE55 ttl:1 reserved:0 length:32 HELLO type HELLO REQUEST length 12: Src_Instance: 0x658183D5, Dst_Instance: 0x909D1D62 RESTART_CAP type 1 length 12: Restart_Time: 0x00000005, Recovery_Time: 0x00000000 Outgoing Hello: version:1 flags:0000 cksum:BE54 ttl:1 reserved:0 length:32 HELLO type HELLO ACK length 12: Src_Instance: 0x909D1D62, Dst_Instance: 0x658183D5 RESTART_CAP type 1 length 12: Restart_Time: 0x00000005, Recovery_Time: 0x00000000

XRv3 generates a syslog message when the RSVP neighbor session is formed to CSR10. Notice that CSR10’s note ID is specified, not the link ID. This is because RSVP GR only supports node failures, so it tracks the node address. ! XRv3 rsvp[1049]: %ROUTING-RSVP-6-RSVP_HELLO_STATE_UP : A graceful restart hello session to 222.0.0.10 has been established

729 © 2016 Nicholas J. Russo

Starting from the headend, we will verify the RSVP GR neighbors. I show a summary and detailed view at each router along the way. Of note, XRv3 always identifies CSR10 by its node address. Hellos are successfully being sent and received, which indicates a functional configuration. RP/0/0/CPU0:XRv3#show rsvp graceful-restart neighbors Neighbor App State Recovery Reason Since LostCnt --------------- ----- ------ -------- ------------ -------------------- -------222.0.0.10 MPLS UP DONE N/A 06/12/2045 22:39:35 0

RP/0/0/CPU0:XRv3#show rsvp hello instance detail Neighbor: 222.0.0.10 Source: 222.0.0.13 (MPLS) State: UP (for 00:05:48) Type: ACTIVE (sending requests) Last Hello received on GigabitEthernet0/0/0/0.503 Hello interval (msec) (used when ACTIVE) Configured: 6000 Src_instance 0x883a58, Dst_instance 0x658183d5 Counters: Communication with neighbor lost: Num of times: 0 Reasons: Missed acks: 0 New Src_Inst received: 0 New Dst_Inst received: 0 I/f went down: 0 Neighbor disabled Hello: 0 Msgs Received: 60 Sent: 60 Suppressed: 0

CSR10 is a midpoint along this LSP. It shows a single LSP passing through it from 222.0.0.13 to 222.0.0.1. The details of this LSP specify the upstream and downstream nodes, which are highlighted. R10#show ip rsvp hello client lsp summary Local Remote tun_id lsp_id subgrp_orig 222.0.0.13 222.0.0.1 100 2 0.0.0.0

subgrp_id FLAGS 0 0xC038

R10#show ip rsvp hello client lsp detail Hello Client LSPs (all lsp tree) Tun Dest: 222.0.0.1 Tun ID: 100 Ext Tun ID: 222.0.0.13 Tun Sender: 222.0.0.13 LSP ID: 2 Lsp flags: 0xC038 Lsp GR UP Node nbr: 222.0.0.13 Intf nbr: 222.10.13.13 Lsp GR DN Node nbr: 222.0.0.1 Intf nbr: 222.1.10.1

CSR1 is the tail end of this TE tunnel. Notice that it has two neighbors. The first is the TTL=225 “multihop” session between the routers while the other uses the directly connected interfaces at TTL=1. This is 730 © 2016 Nicholas J. Russo

consistent with the debug evaluated earlier. Although GR cannot protect against link failures, the multihop adjacency allows the router to continue communicating, which is the reason why a second session is made. The details reveal that the single-hop session is bound to a specific interface, while the multi-hop session is not. R1#show ip rsvp hello client nbr summary Local Remote Type NBR_STATE 222.0.0.1 222.0.0.10 GR Normal 222.1.10.1 222.1.10.10 GR Normal

HI_STATE Up Up

LSPs 1 1

R1#show ip rsvp hello client nbr detail Hello Client Neighbors Remote addr 222.0.0.10, Local addr 222.0.0.1 Nbr State: Normal Type: Graceful Restart Nbr Hello State: Up LSPs protecting: 1 I/F: Any Remote addr 222.1.10.10, Local addr Nbr State: Normal Type: Graceful Restart Nbr Hello State: Up LSPs protecting: 1 I/F: Gi2.510

222.1.10.1

To simulate a failure of CSR10, I shut down its physical interface, effectively removing it from the network entirely. Though this is not an appropriate way to simulate an RP switchover, it allows us to see how XRv3 will react. After 5 missed hellos, the neighbor is torn down and a syslog is generated. XRv3 will track the number of times the neighbor fails, along with other timestamp-related statistics. ! XRv3 rsvp[1049]: %ROUTING-RSVP-6-RSVP_HELLO_STATE_DOWN : Graceful restart hello session to 222.0.0.10 has changed to the down state RP/0/0/CPU0:XRv3#show rsvp hello instance detail Neighbor: 222.0.0.10 Source: 222.0.0.13 (MPLS) State: DOWN (for 00:00:13) Type: ACTIVE (sending requests) Last Hello received on GigabitEthernet0/0/0/0.503 Hello interval (msec) (used when ACTIVE) Configured: 6000 Src_instance 0x883a58, Dst_instance 0x0 Counters: Communication with neighbor lost: Num of times: 1 (last time at 00:00:13 ago for reason HELLO MISSED) Reasons: Missed acks: 1 New Src_Inst received: 0

731 © 2016 Nicholas J. Russo

New Dst_Inst received: 0 I/f went down: 0 Neighbor disabled Hello: 0 Msgs Received: 360 Sent: 368 Suppressed: 0

Before continuing, I restore CSR10 and ensure that it forms its RSVP neighbors again. A total of 4 neighbors are shown: 2 single-hop sessions (yellow) and 2 multi-hop sessions (green), one of each for CSR1 and XRv3. Despite GR being configured, all routers immediately purged all TE LSP information when CSR10 failed. R10#show ip rsvp hello client nbr Local Remote Type 222.0.0.10 222.0.0.1 GR 222.0.0.10 222.0.0.13 GR 222.10.13.10 222.10.13.13 GR 222.1.10.10 222.1.10.1 GR

NBR_STATE Normal Normal Normal Normal

HI_STATE Up Up Up Up

LSPs 1 1 1 1

The restart time and recovery time observed earlier should change this behavior. A recovery time of 0 indicates that a router has no ability to preserve its forwarding information. This causes RSVP peers adjacent to a failed node to immediately delete all LSP information through that peer. A number greater than zero causes routers on either side of the failure to continue sending RSVP PATH messages (if downstream from the failed node) or RSVP RESV messages (if upstream from the failed node) so that the failure is masked. These soft-state refreshes are imperative to normal RSVP operation. The restart timer specifies how long a peer should wait for the session to establish, and the recovery time specifies how long it should take to synchronize the label forwarding constructs. XE only lets you adjust the recovery time, but XR allows you to adjust both. I use 90 seconds for both timers. XE has a hidden command for the restart time, but it doesn’t even stick in the configuration. It is included below for reference. ! CSR1 and CSR10 ip rsvp signalling hello graceful-restart send recovery-time 90000 ip rsvp signalling hello graceful-restart send restart-time 90000 ! HIDDEN ! XRv2 and XRv3 rsvp signalling graceful-restart restart-time 90 signalling graceful-restart recovery-time 90

Using debug on CSR1, we can verify that the timers have changed. 0x15F90 in decimal is 90,000 which represents the time in msec. R1#debug ip rsvp dump-messages hello RSVP message dump debugging is on

732 © 2016 Nicholas J. Russo

Incoming Hello: version:1 flags:0000 cksum:0137 ttl:255 reserved:0 length:32 HELLO type HELLO REQUEST length 12: Src_Instance: 0x658183D5, Dst_Instance: 0x909D1D62 RESTART_CAP type 1 length 12: Restart_Time: 0x00015F90, Recovery_Time: 0x00015F90 Outgoing Hello: version:1 flags:0000 cksum:0136 ttl:255 reserved:0 length:32 HELLO type HELLO ACK length 12: Src_Instance: 0x909D1D62, Dst_Instance: 0x658183D5 RESTART_CAP type 1 length 12: Restart_Time: 0x00015F90, Recovery_Time: 0x00015F90 Incoming Hello: version:1 flags:0000 cksum:FF37 ttl:1 reserved:0 length:32 HELLO type HELLO REQUEST length 12: Src_Instance: 0x658183D5, Dst_Instance: 0x909D1D62 RESTART_CAP type 1 length 12: Restart_Time: 0x00015F90, Recovery_Time: 0x00015F90 Outgoing Hello: version:1 flags:0000 cksum:FF36 ttl:1 reserved:0 length:32 HELLO type HELLO ACK length 12: Src_Instance: 0x909D1D62, Dst_Instance: 0x658183D5 RESTART_CAP type 1 length 12: Restart_Time: 0x00015F90, Recovery_Time: 0x00015F90

We can also use show commands. CSR10 and XRv3 both indicate the proper values. Ideally, this should give CSR10 plenty of time to recover from another failure. R10#show ip rsvp hello graceful-restart Graceful Restart: Enabled (full mode) Refresh interval: 6000 msecs Refresh misses: 5 DSCP: 0x30 Advertised restart time: 90000 msecs Advertised recovery time: 90000 msecs Maximum wait for recovery: 3600000 msecs RP/0/0/CPU0:XRv3#show rsvp graceful-restart Graceful restart: enabled Number of global neighbors: 1 Local MPLS router id: 222.0.0.13 Restart time: 90 seconds Recovery time: 90 seconds Recovery timer: Not running Hello interval: 6000 milliseconds Maximum Hello miss-count: 5 Pending states: 0

733 © 2016 Nicholas J. Russo

Since shutting down interfaces is not a valid test of an RP switchover, this lab cannot be tested further. The test results on CSR10 appear to be the same even with these new timer configurations. I was not able to see any noticeably different, however I was able to capture some new output on CSR1. We can see that when CSR10 fails, CSR1 does change the state to “Restarting” which implies, for a short time, CSR1 believed CSR10 was performing SSO. This did not last the entire 90 seconds because when IGP reachability was lost, the peer failed entirely (even though IGP converged faster than 90 seconds). I theorize that a real RP failure would have caused this behavior to reflect the timer adjustments configured earlier. R1#show ip rsvp hello client nbr Local Remote Type 222.0.0.1 222.0.0.10 GR 222.1.10.1 222.1.10.10 GR

NBR_STATE Restarting Restarting

HI_STATE Lost Lost

LSPs 1 1

14.1.7 EIGRP NSF Like RSVP, EIGRP supports GR/NSF but not NSR. The concept is identical to all other GR/NSF features discussed so far. A router undergoing a switchover requires peers to “help” it synchronize its standby RP during an SSO event. In the case of EIGRP, it also assists with stuck-in-active (SIA) events where an “active” route has unanswered queries from at least one peer regarding its reachability status. When a restart operation begins on an EIGRP router, the restart bit (RS-bit) is set in the hello packets towards all peers. Those who receive and understand this message immediately send their EIGRP topology tables to the restarting router. When complete, an end-of-table (EoT) message is sent to signal the end of the topology transfer. Once the switching router is finished converging, it uses the EoT message towards the “helpers” to signify that convergence is complete. Interestingly, NSF is enabled by default on XR. No other NSF features are enabled by default for any other protocol on XR or XE. We can confirm this with some show commands. RP/0/0/CPU0:XRv3#show protocols ipv4 eigrp vrf EIGRP | include NSF EIGRP NSF: enabled NSF-aware route hold timer is 480s NSF signal timer is 20s NSF converge timer is 300s RP/0/0/CPU0:XRv3#show protocols ipv6 eigrp vrf EIGRP | include NSF EIGRP NSF: enabled NSF-aware route hold timer is 480s NSF signal timer is 20s NSF converge timer is 300s

Because we cannot “enable” NSF on XRv3, I will disable it from the IPv6 AFI. The word “disable” is the only option in the parser. After issuing this command, XRv3 issues a log message about degraded ISSU capabilities. We also verify the NSF has been fully disabled for IPv6 as a demonstration.

734 © 2016 Nicholas J. Russo

! XRv3 router eigrp VPN vrf EIGRP address-family ipv6 nsf disable ! XRv3 eigrp[1003]: %ROUTING-EIGRP-4-ISSU_SUPPORT : EIGRP-VPN: EIGRP-v6 213: NSF disabled. ISSU requires NSF to be enabled. RP/0/0/CPU0:XRv3#show protocols ipv6 eigrp vrf EIGRP | include NSF EIGRP NSF: disabled NSF-aware route hold timer is 480s NSF signal timer is 20s NSF converge timer is 300s

To mirror XRv3’s capabilities, CSR2 enables NSF for IPv4 only. We can verify this by checking the IPv4/v6 protocol details. ! CSR2 router eigrp VPN address-family ipv4 unicast autonomous-system 213 nsf R2#show ip protocols | include eigrp|NSF *** IP Routing is NSF aware *** Routing Protocol is "eigrp 213" NSF-aware route hold timer is 240 EIGRP NSF enabled NSF signal timer is 20s NSF converge timer is 120s R2#show ipv6 protocols | include eigrp|NSF IPv6 Routing Protocol is "eigrp 213" NSF-aware route hold timer is 240 EIGRP NSF disabled NSF signal timer is 20s NSF converge timer is 120s

As a quick test, I enabled NSF debugging on CSR2 and clear the EIGRP neighbor table on XRv3 (not shown). CSR2 cannot execute NSF because the neighbor was “hard cleared”, which means that administrator was intentionally trying to reset the control-plane. At a minimum, this shows that EIGRP NSF is operational as the feature “tried to” work. We can see that EoT was received from XRv3 which indicates convergence is complete; this is carried as a special flag inside of EIGRP packets. R2#debug eigrp nsf EIGRP NSF debugging is on

735 © 2016 Nicholas J. Russo

EIGRP: NSF: 10.2.13.13, EIGRP: NSF: EIGRP: NSF: EIGRP: NSF:

AS213: Checking if Graceful Restart is possible with neighbor peer_down reason 'Interface PEER-TERMINATION received' Not possible: 'peer_down was called with a HARD resync flag' Enqueuing NULL update to 10.2.13.13, flags 0x1:(INIT) AS213, Receive EOT from 10.2.13.13, Flags 0x8:(EOT)

Performing the same test in the opposite direction, XRv3 indicates similar results. NSF cannot be used, but the software tries to apply the feature. This is likely the result of having a single RP with SSO unsupported. The last line of this debug probably means that an EoT was received from CSR2 as this indicates convergence from an EIGRP neighbor. RP/0/0/CPU0:XRv3#debug eigrp nsf eigrp[1003]: IP-EIGRP(VPN-213): NSF: RIB Converged notification received for VRF EIGRP-v4 eigrp[1003]: IPv4-EIGRP(VPN-213): NSF: Self NOT in NSF restart mode. Received a non-RS INIT Update from peer 10.2.13.2 eigrp[1003]: IPv4-EIGRP(VPN-213): NSF: Received initial updates for the peer. Will use NSF startup mode for thethe peer now on eigrp[1003]: IPv4-EIGRP(VPN-213): NSF: Self NOT in NSF restart mode. Received a non-RS non-INIT Update from peer 10.2.13.2 eigrp[1003]: IPv4-EIGRP(VPN-213): NSF: Self NOT in NSF restart mode. Received a non-RS non-INIT Update from peer 10.2.13.2 eigrp[1003]: IP-EIGRP(VPN-213): NSF: RIB Converged notification received for VRF EIGRP-v4

Last, we will examine the EIGRP NSF timers briefly. On XE, you actually use both GR and NSF syntax, which is confusing. They are effectively describing the same feature. In the outputs above, we saw the route hold timer, signal timer, and converge timer. The route hold-timer is configured as the GR purgetime and determines how long EIGRP will keep routes for an inactive peer if EoT is not received. The signal timer indicates how long to wait after sending packets with the RS-bit set, which was described earlier. The converge timer indicates how long to wait before receiving the EoT message from all neighbors, which is similar to the BGP EoR marker. These minor details should not be adjusted unless there is a compelling reason. I adjust these to new values for variety, then confirm them with show commands on both CSR2 and XRv3. ! CSR2 router eigrp VPN address-family ipv4 unicast autonomous-system 213 timers nsf signal 25 timers nsf converge 100 timers graceful-restart purge-time 150 ! XRv3 router eigrp VPN vrf EIGRP address-family ipv4 timers nsf route-hold 150

736 © 2016 Nicholas J. Russo

timers nsf signal 25 timers nsf converge 100 R2#show ip protocols | include eigrp|NSF *** IP Routing is NSF aware *** Routing Protocol is "eigrp 213" NSF-aware route hold timer is 150 EIGRP NSF enabled NSF signal timer is 25s NSF converge timer is 100s RP/0/0/CPU0:XRv3#show protocols eigrp vrf EIGRP | include NSF EIGRP NSF: enabled NSF-aware route hold timer is 150s NSF signal timer is 25s NSF converge timer is 100s

15. Describe Layer 1 failure detection This section will be brief and covers only two features: carrier delay and link debounce. The network diagram is shown below, which uses two Cisco ME-3400-24TS-A switches with 3 links between them. Two are 100 Mbps copper NNIs and one is a 100BaseFX 100 Mbps fiber NNI. There is no link bundling.

Carrier delay is a timer that runs in software to identify when a link goes down at layer 1. This is normally set to 2 seconds to prevent against very short flaps, which may introduce churn. This slows convergence significantly, so we can adjust it to a smaller value. We can specify a value in seconds or ms, and we reduce this to 0 on ME1. The interface will go down as soon as the cable is unplugged on ME1, but on ME2, this will occur 2 seconds later. On most IOS platforms, 2 seconds is the default carrier delay (not revealed in the show command). ! ME1 interface FastEthernet0/15 carrier-delay msec 0 ME1#show interfaces fastEthernet 0/15 | include Carrier Carrier delay is 0 msec

Before continuing, we configure basic NTP so we can use the logging timestamps to approximate the carrier delay changes. We verify the clocks are synchronized; see the NTP section for more details. 737 © 2016 Nicholas J. Russo

! ME1 ntp master ! ME2 ntp server 10.0.10.91 ME1#show ntp status | include sync Clock is synchronized, stratum 8, reference is 127.127.1.1 ME2#show ntp status | include sync Clock is synchronized, stratum 9, reference is 10.0.10.91

When the cable is unplugged, the link-light immediately turns off on both switches. However, there is a ~2 second delay between the log messages being issued. Since this is a software timer, the LEDs may turn off, but IOS doesn’t notify any other process of the outage until the carrier delay expires. The timing of the syslog generation is imperfect which is why we only see a ~1.3 second delay in this case. ! ME1 03:45:32.510: %LINK-3-UPDOWN: Interface FastEthernet0/15, changed state to down 03:45:33.500: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/15, changed state to down ! ME2 03:45:33.894: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/15, changed state to down 03:45:34.892: %LINK-3-UPDOWN: Interface FastEthernet0/15, changed state to down

The carrier delay is synchronous by default and also controls how long to wait before declaring a link up event. Plugging the cable back in proves this, and this time is much closer to 2 seconds than the linkdown time, which is somewhat coincidental. ! ME1 03:47:37.542: %LINK-3-UPDOWN: Interface FastEthernet0/15, changed state to up 03:47:38.549: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/15, changed state to up ! ME2 03:47:39.524: %LINK-3-UPDOWN: Interface FastEthernet0/15, changed state to up 03:47:40.531: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/15, changed state to up

We quickly compare this to port 16, which has the default carrier-delay on both switches. The down event happens at the same time. 738 © 2016 Nicholas J. Russo

! ME1 04:25:28.414: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/16, changed state to down 04:25:29.420: %LINK-3-UPDOWN: Interface FastEthernet0/16, changed state to down ! ME2 04:25:28.389: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/16, changed state to down 04:25:29.395: %LINK-3-UPDOWN: Interface FastEthernet0/16, changed state to down

Plugging port 16 back in, the up event happens at the same time also. This proves that the carrier delay is actually evaluated on the ME3400 platform, and since its running in software, should be platformagnostic. ! ME1 04:25:50.929: %LINK-3-UPDOWN: Interface FastEthernet0/16, changed state to up 04:25:51.935: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/16, changed state to up ! ME2 04:25:50.946: %LINK-3-UPDOWN: Interface FastEthernet0/16, changed state to up 04:25:51.952: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/16, changed state to up

Rushing into a link-up event can be dangerous, so on some platforms we can also configure asynchronous carrier delays. The ME3400 is not one of them, which means adjusting carrier delay applies in both up and down directions at all times. For that reason, I would recommend a non-zero value, even for faster convergence, since it may lead to short-term blackholes on link-up events. An example of an “up” delay is shown below. ! Example (not configured on ME3400) interface FastEthernet0/15 carrier-delay up msec 500

Link debounce is not supported in the ME3400, but is used for link-down events only. It runs in firmware, not software, so it is a function of the interface driver. The default is 300 ms on copper and 10 ms on fiber interfaces, and Cisco recommends keeping it as the default. The timer is measured in ms. ! Example (not configured on ME3400) interface FastEthernet0/15 link debounce time 500

Additional Reading – Reference configurations “layer1-fail" 739 © 2016 Nicholas J. Russo

16. Describe BGPsec This feature is discussed/tested in detail in the security section. In summary, the feature encompasses AS origin validation by way of Resource Public Key Infrastructure (RPKI). The word “resource” relates to IP resources, like prefixes, where PKI relates to existing X.509 PKI certificate infrastructure. AS origin validation seeks to ensure that only prefixes registered in a certain AS can be originated from that AS, and is most useful to prevent accidental advertisements. Attackers can easily craft BGP prefixes with ASpaths to overcome this, and BGPsec aims to stop it by having a router in each AS (only makes sense with eBGP) sign the prefix to increase its authenticity. A Resource Origination Authorization (ROA) is what binds an address space with an AS number, granting permission to that ASN to advertise prefixes in accordance with the ROA. Multiple ROAs can exist, and they may overlap; routers will consider the union of all valid ROAs when determining prefix validity. Both certificates and ROAs are published in publicly accessible repositories for general use. The time-range validity of these constructs is always relevant; if the current time does not fit within the lifetime of an ROA’s or certificate, that object cannot be used, much like a time-based key chain. 17. Describe backscatter traceback This feature is discussed/tested in detail in the security section. In summary, the feature combines ACLs, remotely triggered blackholes (sometimes), sinkholes, and ICMP unreachables to determine the ASBR ingress routers from which an attack has entered the network. The backscatter is the forwarding of ICMP unreachables back into the core network and ultimately into a sinkhole for analysis. This is primarily employed to detect/defeat IP spoofing attacks against the SP network. 18. Describe lawful-intercept This feature is discussed/tested in detail in the security section. In summary, this allows a mediation device to communicate to a router over SNMPv3 to setup/teardown wiretap sessions on a network device. This would be done in accordance with Government regulations and the mediation device would forward copies of “interesting” traffic back to law enforcement agencies (LEAs) that require it. Mediation devices are provided by third-party vendors and do most of the processing, but routers need to be enabled for SNMPv3 and must have the LI feature supported as required by law. 19. Describe BGP Flowspec This feature is discussed/tested in detail in the security section. In summary, the feature is meant to add additional granularity to the remotely trigged black-hole (RTBH) feature. RBTH is a “big hammer” where it completely blackholes traffic to/from certain prefixes during a DDoS attack. BGP flowspec is like a combination of RTBH and QPPB; it can signal specific QoS actions, redirect traffic into a VRF for analysis, or drop the traffic. XE only supports it as a route-reflector (full support is introduced in version 3.15S), and XRv can only create flowspec rules; it cannot be tested in the data plane due to lacking hardware linecards. 20. Describe DDoS mitigation techniques This feature is discussed/tested in detail in the security section. In summary, common tools include ACLs, remotely triggered black holes, unicast RPF, backscatter traceback, and BGP Flowspec. 740 © 2016 Nicholas J. Russo

21. Describe network event and fault management This chapter encompasses a set of functions that detect, isolate, and correct malfunctions in the network. This includes accounting for environmental changes, examining logs and notifications, tracing/correcting faults, and running diagnostic tests. When network devices fail or suffer errors, an event-driven mechanism should be configured. For example, the device could send a syslog message to a remote syslog server or issue SNMP traps/informs to an SNMP server. This notifies the network staff of the fault and generally provides details regarding the fault. Once the condition that triggers the fault is resolved, the alarm state is cleared. Fault management systems may have internal mechanisms to map in-band severities (such as syslog priority 0, or emergencies) to another prioritization scheme based on network administrator weights. With respect to ITIL Incident Management, a fault would qualify as an incident, and is subject to the ITIL processes described in that section.

22. Describe performance management and capacity procedures Capacity management: process of determining the network resources required to prevent a performance or availability impact on business-critical applications. Performance management: practice of managing network service response time, consistency, and quality for individual and overall services. Note: Performance problems are usually related to capacity; applications are slower due to congestion management and traffic conditioning (queuing and shaping). Trending is the process of reviewing performance/capacity issues and reviewing the baselines for trends. This helps to understand future upgrade requirements. Common areas for concern: a. CPU. Used by both control and data planes on any network device. For performance and capacity management, all devices must have sufficient CPU to service these needs. Insufficient CPU on one device can impact the entire network (BGP slow-peer, slow SPF computation blackholes, etc). b. Backplane of I/O. Refers to the total amount of traffic a device can handle. Insufficient backplane bandwidth results in dropped packets, leading to retransmissions and even more traffic on the network. c. Memory and buffers. Used by both control and data planes on any network device. Memory can often be low during periods of routing convergence. d. Interface and pipe sizes. Refers to the amount of data than be sent on any one connection. Often incorrectly referred to as speed. Congestion on these links usually results in queuing of sorts, which can increase latency.

741 © 2016 Nicholas J. Russo

e. Queuing, latency, and jitter. The larger the queue, the longer data has to wait. Tail drop occurs when the queue is full and admittance is prohibited. Acceptable for TCP, but not much else. When input queues fill, this can have a negative effect on CPU and memory as the device cannot process the incoming traffic fast enough. f. Speed and distance. Data can only be forwarded up to the speed of light, which is approximately 100 miles per ms. Even with high bandwidth, low latency international links, the distance for energy to travel can be significant. Can be a large factor for certain applications if they are not tuned for specific network characteristics. g. Application characteristics. Small window sizes, application keepalives, and amount of data sent over the network versus what is required is often the reason for poor application performance. Best practices a. Service level management. A deliverable that seeks to identify what service can be delivered within budgetary and staffing constraints. It can be done via an SLA between users and network organizations for a service. The service would include reports and recommendations to maintain acceptable service quality. Users must be prepared to fund the upgrades. Alternatively, the network organization can define their capacity/performance management upgrades and attempt to fund upgrades on a case-bycase basis. b. Network and application what-if analysis. Used to determine the outcome of a planned change. Significantly reduces risk and helps increase network availability. Identify risk levels for individual changes and perform more detailed analyses on those that are highest risk. Cisco recommends 5 risk levels: 1. High potential impact to many users and/or business critical applications with expected downtime. Mitigation: Validation (lab) of new solution. Extensive testing and what-if analysis showing impact. New solutions require completion of an operations support document. Perform design review, create backout plan/change process. 2. High potential impact to many users and/or business critical applications with possible downtime. Mitigation: Perform what-if analysis and review functionality. Perform design review, create back-out plan/change process. 3. Medium potential impact to less users and/or business service with possible downtime. Mitigation: Perform engineering analysis, and create implementation plan/change process. 4. Lower potential service impact, such as adding new template network modules, new WAN sites, etc. Mitigation: Create implementation plan/change process.

742 © 2016 Nicholas J. Russo

5. No user/service impact, such as adding individual users, changing passwords/banners/SNMP configurations, etc. Mitigation: None, all processes are optional. General comments: -Modeling of network topologies are recommend for all significant changes, such as routing, etc. -Perform an application what-if analysis before deploying any business application. If you don't the application group will blame the network group for poor performance. c. Baselining and trending. Allows network administrators to plan/complete network upgrades before a capacity problem arises. Can be resolved in several ways: build enough capacity so there is never an issue, divide the trend information into groups and focus on the critical network areas, or rely on reporting mechanisms to detect issues and respond to them first. All methods still require periodic review to ensure capacity demands are met over time. Graphical/chart tools are valuable for visualizing this kind of information. d. Exception management. This is a reactive methodology; the network team receives notification of capacity/performance threshold violations and responds immediately. Tools like RMON and SNMP traps/informs (described in detail later) can be used for the initial reporting. Reporting must be near real-time so that the problem does not vanish before it is seen. Netflow can also be used to measure traffic, though it is passive and would require a tool to interpret the data to raise alarms. e. QoS management. After baselining the network, traffic can be classified for preferential treatment. A basic design could include LLQ for voice, along with bandwidth guarantees for business critical applications. Large file transfers that are not time sensitive can be given lower priority. Collecting and reporting capacity information: This is linked to three areas of capacity management. Each of these should have an information collection plan. a. What-if analyses. Need tools to mimic the network environment and understand the effects of changes. b. Baselining and trending. Need "snapshots" of current devices to shown current resource utilizations. c. Exception management. Need this to alert network administrators about problems so resolution can occur. 5 steps for this process: a. Determine your needs. Discover resources available and what is needed. Consider the people/expertise needed to use tools rather than just the tool/technology itself. b. Define a process. Describe what happens when violations occur or what the baselining process looks like. Can be outsourced this functionality to a network services organization.

743 © 2016 Nicholas J. Russo

c. Define capacity areas. Certain areas of the network share common capacity planning strategies: LAN, WAN, field offices, critical sites, etc. LAN bandwidth is always cheaper than WAN bandwidth, so utilization should be lower. Different sites may require different SNMP MIB monitoring. Upgrades may be more difficult or time consuming in some places rather than others within the network. d. Define the capacity variables. Depends on the devices/media in the network. General parameters such as CPU, memory, and link utilization are generally measured. Other things like queue depths and media-specific congestion can be valuable. e. Interpret the data. Difference between peak and average is important, as peaks may be invisible based on the polling interval. Select a value that is a compromise; aggressive values cause excessive overhead while lethargic ones return less accurate information. 23. Describe maintenance and operational procedures Maintenance procedures should be conducted in accordance with the hardware platform manual; there isn’t much more to say other than that. Sustaining operations for a service provider is critical, which is why many platforms have an in-service software upgrade (ISSU) feature. This allows the administrator to upgrade/downgrade software versions of network devices while packet forwarding continues. This relies on Cisco non-stop forwarding (NSF) and stateful switchover (SSO) which implies the platform has redundant RSPs/RPs. The standby RSP/RP can be loaded with the next version, which can be manually started by the administrator while the old one is halted. Once the new version is explicitly accepted, it is written to both RSP/RPs going forward. If the new version introduces a new command or new syntax for something not recognized in the previous version, “Config Sync” is a process designed to recognize these new commands and maintain compatibility. ISSU has some limitations regarding the software versions it can support. ISSU may not be supported across major version releases if the underlying architecture changes sufficiently such that the underlying line-card FIBs must be changed, even if both versions supported ISSU. It is definitely not supported across code trains (T to S to M, for example) either. Cisco identifies code with one of three letters to denote its compatibility: a. C: Compatible. All base-level system infrastructure is supported as well as optional HW-aware subsystem components. There will be minimal service impact when performing ISSU to a version labeled compatible. b. B: Base-level compatible. ISSU will succeed but some sub-systems will not be able to maintain state during the transition. There may be some service interruption. c. I: Incompatible. The core architecture is different to the point that ISSU is not supported since SSO cannot work. Incompatible updates are still desirable, so rather than use ISSU, one can use Fast software upgrade (FSU) which is a suboption of ISSU. By using the keyword “forced” in some platforms, you can force ISSU to perform the upgrade but it will be service impacting, so plan for that.

744 © 2016 Nicholas J. Russo

From a purely operational standpoint, service providers often use “runbooks” to describe their operations. This details procedures to begin, stop, supervise, and debug the system. Effective runbooks allow operations to effectively manage and troubleshoot a system, which may include automation. They are often exhaustive and include procedures for common scenarios with decision trees/flow charts to aid operators. It begins with an index of processes that are often broken down in outline form to associate processes to the operations/technologies they support. Runbook automation (RBA) is the ability to define, build, orchestrate, manage, and report on workflows that support processes that support the network. 24. Describe the network inventory management process Network inventory management is the process of keeping records of all network assets owned by an organization, to include assets not currently in service (on the shelf, in maintenance, etc). It enables network administrators to have a record of all network equipment within the organization. Network inventory management is generally performed to through asset tracking software that scans, compiles, and records data about each device/node over a network. This obviously cannot touch network devices not currently on line, but the software can be supplemented by manual inventorying done by humans. Network inventory management would include basic information about each platform. This would include their make/model, location, serial number, network addressing (for management), software version, and license keys (along with expiration dates). Capturing this data is important for businesses. The most obvious benefit is to maintain accountability of all equipment and prevent theft. It also helps with estimating the size of a network and supports planning for growth. Knowing the network diameter also helps realize costs to maintain and grow the network. 25. Describe network change, implementation, and rollback 25.1 Processes and best practices Configuration management best practices: 1. Create standards. Helps reduce network complexity, unplanned downtime, and exposure to network impacting events. a. Software version. Deploy consistent SW versions on similar network devices. Increase interoperability, reduce SW defects and likelihood of complex issues. b. IP addressing. Subnet allocation scheme ensures address blocks do not overlap and promotes contiguous address summarization. Identify certain hosts in the subnet for certain applications (i.e., always make the routers the first addresses, followed by switches, followed by hosts). c. Naming conventions, DNS, and DHCP assignments. Simplifies the identification for devices and reduces the possibility of duplicate IP addresses. Adding DHCP ranges to DNS helps with identifying where hosts are also.

745 © 2016 Nicholas J. Russo

d. Baseline configurations and descriptors. Common configuration parameters, such as VTY settings, banners, NTP, SNMP, and AAA can be uniform on most network devices. Descriptors are used on interfaces to show the purpose and location of an interface. e. Configuration upgrade procedures. Documenting these procedures ensures HW/SW upgrades go smoothly. The document should list all steps for the upgrade, reference vendor documentation, and provide post-upgrade validation steps for testing. f. Solution templates. Define common solutions to common problems in modular form. Each template must be defined, tested, and documented to ensure uniformity when the templates are deployed. Templates may include HW modules, port/VLAN assignments, ACL/QoS configurations, out of band management (OOBM), and cable installations. These templates should not contain specifics like IP addresses or DNS/DHCP assignments. 2. Implement standards 3. Maintain documentation. Should be updated in near-real time. Cisco defines the following "critical success" factors: a. Current device, link, and end-user inventory. Track devices, problem impact, and network change impacts. Should contain tables for similar devices, links, and user data. b. Configuration version control system. Maintain set of current configurations and some number of previous versions for rollback, troubleshooting, or change audits. Maintain 3-5 previous versions of configuration per device. c. TACACS configuration log. Any AAA server will do, but Cisco specifies TACACS. When used with NTP, it is easy to see who logged into a device, when it happened, and what was changed. d. Network topology documentation. Should include both logical and physical topologies at different OSI layers for the entire network. Detail will range from routing diagrams to cabling/linecard/electrical level of detail. These are typically maintained in graphical form. 4. Validate and audit standards. Form a cross-functional team to measure configuration management quality. First objective is to implement change management (CM) to identify the issues. a. Conduct integrity checks. Evaluate overall network configuration to include potential issues. This should catch inconsistent naming conventions, duplicate IP addresses, protocol mismatches, and other errors. b. Device, protocol, and media audits. This would identify inconsistent HW/SW versions, incompatible modules, etc. 746 © 2016 Nicholas J. Russo

c. Standards and documentation reviews. Executed on a quarterly basis, all components of the standards (naming convention, IP subnetting plan, solution templates, etc) should be reviewed for inconsistencies and errors. New templates should be developed in accordance with new HW/SW upgrades as needed. 5. Review standards. The process restarts at "Create Standards". 25.2 NETCONF and YANG NETCONF is an IETF standard which is short for Network Configuration Protocol (not to be confused with PPP’s Network Control Protocol (NCP) mechanism). It is a standard for configuring, deleting, and modifying configurations on network devices. YANG is used to model the network configurations and state into tree structures; it is the data model for NETCONF. Put another way, YANG is to NETCONF what MIBs are to SNMP, as YANG core models contain similar definitions for how data is structured. Like SNMP, NETCONF supports similar basic operations such as “GET”, “GET-NEXT”, “SET”, etc, which makes it appear very similar to SNMP at a glance. The NETCONF / YANG model was intended to be vendor neutral to avoid dealing with specific CLI scripts or existing tools that were inflexible. However, NETCONF has extra features, such as being able to test configurations before committing them, being able to configure multiple network devices concurrently, and bulk-get operations that are much faster than SNMP. Performing network-wide transactions (GET, SET, etc) is made easier since NETCONF handles the error recovery and sequencing. That is to say, the operator doesn’t need to worry about the order of executing configuration commands within a given transaction. NETCONF uses SSH, which is secure, and also means that the operator doesn’t need to resend requests that fail as is necessary using SNMP over UDP. NETCONF messages are encoded in XML, and individual YANG data models are included as “capabilities” in NETCONF hello packets. The device notifies the manager which capabilities it supports so the system knows the appropriate mechanisms to configure the device. YANG data structures loosely resemble “classes” in object oriented programming or “structs” in nonOOP languages. The structure contains several “types”, which include integers, Booleans, enumerations, pointers, and other information types to be processed. There are some commonly used YANG types defined in RFC 6021 which are specific to networking (or at least, well suited for it). Some of these include pre-made objects for IPv4 and IPv6 addresses and prefixes, flow-labels, port-number, MACaddress, URI, domain-name, and others. This is very convenient for managing/storing configuration data within YANG. NETCONF simply manipulates this data accordingly; provided the data from the network devices fits the YANG standard, NETCONF can modify it. If a vendor simply took their CLI configuration and wrapped it in a YANG structure somehow, this would not be effective. NETCONF / YANG is supported in both XE and XR. The diagram for this test is shown below. CSR2 will be the NETCONF server for these examples, and it shares a LAN with CSR1 and XRv4 (NETCONF clients).

747 © 2016 Nicholas J. Russo

Below is an XE configuration example with basic verification. Since NETCONF requires SSH, and specifically SSHv2, we must configure a domain name as well as generate crypto keys (not shown). We can optionally add a standard ACL to identify the NETCONF managers for extra security. The NETCONF maximum sessions can be between 4 and 16 and there must be one VTY line per session, so configuring a number greater than the number of available VTY sessions does not make sense (default is 4). The lock-time identifies the maximum amount of time the NETCONF configuration is locked (cannot be changed while locked); the default is 10 seconds. The max-message option limits the size of the NETCONF messages, where by default there is no limit. ! CSR1 ip ssh logging events ip ssh version 2 access-list 10 permit 10.1.14.2 netconf max-sessions 5 netconf lock-time 30 netconf max-message 65535 netconf ssh acl 10

Although the documentation suggests that entering the exec command “netconf” enters the NETCONF conversation, it doesn’t appear to work on XE version 3.13S with the CSR1000v. R2#ssh -l test 10.1.14.1 netconf Password: test Line has invalid autocommand "netconf" [Connection to 10.1.14.1 closed by foreign host]

Even if we test it on CSR1 natively, the command doesn’t exist. However, if it did work, we can check the sessions using the command below. R1#netconf % Bad IP address or host name address R1#net? % Unrecognized command R1#show netconf session

748 © 2016 Nicholas J. Russo

Netconf Sessions: 0 open, maximum is 5

The XR configuration is similar except we specify the SSH port 830 (the IANA recommended port) and enable NETCONF for SSH. I add several other minor options for demonstration only to adjust the NETCONF session limits and data transfer policies. XR actually does work with NETCONF in this setup. Again, the assumption is that basic SSHv2 (domain name, keys, username, etc) has been configured. ! XRv4 ssh server netconf port 830 netconf agent tty session timeout 15 throttle memory 400 throttle process-rate 1500 netconf-yang agent ssh session limit 5 session idle-timeout 10 session absolute-timeout 15 rate-limit 65535

From our NETCONF server, we will log into XRv4 using SSHv2 over TCP port 830. The “netconf” option is an auto-command that gets run immediately upon log in, which drops us into the NETCONF conversation. The NETCONF client (XRv4) sends a “hello” message to CSR2 which contains its capabilities. We could respond with a hand-crafted “hello” message, but that is way beyond the scope of this test. We can see that if CSR2 were really a NETCONF manager, this would have worked. R2#ssh -l cisco -p 830 10.1.14.14 netconf IMPORTANT: READ CAREFULLY [snip] Password: cisco

urn :ietf:params:netconf:base:1.0urn:ietf:params:netconf :capability:candidate:1.0urn:ietf:params:netconf:cap ability:notification:1.0285212672 ]]>]]>

Before terminating the NETCONF session, we can perform some basic verifications on XRv4. We can see the session from CSR2 in summary and detailed form. The session is idle since CSR2 didn’t send any NETCONF messages to XRv4, but we see the session is up. 749 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show netconf sessions Session Client Agent User 11000000 10.1.14.2 tty cisco

Date State Sun MON DAY 15:10:11 2015 Idle

RP/0/0/CPU0:XRv4#show netconf sessions detail 1) Session : 11000000 Client : 10.1.14.2:0 (VRF -) Agent Type : tty User : cisco State : Idle Config Session : Admin Config Session : Alarm Notification : Not Registered Start Date : Sun MON DAY 15:10:11 2015 Elapsed Time : 00:00:13 Last State Changed : 00:00:13

The command below shows detailed NETCONF statistics based on configuration changes. The table is enormous so I have snipped rows and columns from it. RP/0/0/CPU0:XRv4#show netconf-yang statistics Summary statistics # requests| | other 0| 0h | close-session 0| 0h | kill-session 0| 0h | get-schema 0| 0h [snip]

total time|

min time per request|

0m

0s

0ms|

0h

0m

0s

0ms|

0m

0s

0ms|

0h

0m

0s

0ms|

0m

0s

0ms|

0h

0m

0s

0ms|

0m

0s

0ms|

0h

0m

0s

0ms|

Additional Reading – Reference configurations “netconf-yang" 26. Describe the incident management process based on the ITILv3 framework ITIL defines an incident as any event that is not part of the standard operation of the service. The event causes, or may cause, an interruption of the service or a reduction in its quality. Failures that affect an item, even if they don’t affect a service, also qualify as an incident. This could be an LACP member port failing with no significant network impacts. An incident is different than a problem in that the latter consists of recurring, related incidents. Problem Management is a different process than Incident Management. The objective of Incident Management (IM) is to restore a service to normal operations with minimal impact on the business or user, and at minimal cost.

750 © 2016 Nicholas J. Russo

IM inputs mostly come from users, but machines (management or detection systems) can provide information as well. The output of IM are requests for changes (RFCs) which can mitigate future incidents. Incidents can be prioritized generally by considering (summing) the impact and the urgency. Impact refers to the business criticality of the incident and the extent that it degrades the quality of the SLA (large number of users out of service, etc). Urgency reflects the required speed of resolving an incident. The chart below summarizes the prioritization scheme. Urgency is on the y-axis with impact on the xaxis. Priority 2 Limited damage Recover within 2 hours Priority 4 Limited damage Recover within 8 hours

Priority 1 Significant damage Recover within 1 hour Priority 3 Significant damage Recover within 4 hours

Incidents can be escalated functionally or hierarchically. Functional escalation moves incidents from level 1 (front line service disk support) up through levels 2 and 3. Hierarchical escalation advances the incidents upward to the Incident Manager and ultimately to the Service Manager. IM consists of several sub processes:        

Incident detection and recording Classification and initial support Investigation and analysis Resolution and recovery Incident closure Incident ownership, monitoring, tracking and communication Establish incident framework management Evaluation of incident framework management

27. Describe, implement, and troubleshoot advanced BGP features 27.1 Additional Paths (add-path) and Prefix Independent Convergence (PIC) It is a well-known behavior that using BGP route-reflection within an iBGP domain can hide topological redundancies between clients in many cases. BGP only advertises it's best path, which is true for RRs also. Often times this is desirable since a route-reflector’s best path might be the best path for everyone else, also. Sometimes this is a limitation since clients with multiple paths could potentially load-share, or use additional baths for fast-reroute. BGP additional-paths, or add-path, is meant to overcome the limitations with hiding the BGP topology when using RRs. The topology below shows a square-shaped provider network which uses IS-IS Level 2 for IP routing and LDP for MPLS transport label binding. RSVP-TE is also enabled for ad-hoc testing as needed. Initially, all 751 © 2016 Nicholas J. Russo

interfaces use the default IS-IS metric of 10, but we will make modifications to this as we test different features. The SP network is BGP AS 100. The Internet injects several routes into the global BGP table of the SP (AS 6). The SP transports these to another SP (AS 12) and MPLS transport isn't really the goal because we are testing BGP add-path. The SP has one L3VPN customer (AS 410) so we can test BGP addpath with VPNv4 in addition to IPv4.

Let's examine the state of the network before enabling any advanced features. CSR8, CSR9, and XRv11 are route-reflectors for both VPNv4 and IPv4 (the middle column). The routers in the left and right columns are PEs which peer with all three of the RRs; some PEs only run IPv4 while some only run VPNv4 depending on who their customers are. CSR5, XRv12, and XRv13 all receive several Internet routes from CSR6 and advertise them to the RRs. All of the RR's will have the exact same routes, but may have different best-paths due to the IGP metric. For example, looking at an Internet route, CSR9 prefers the path via CSR5 due to IGP metric and lowest RID while CSR8 prefers the path through XRv12. CSR9#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 50 Paths: (3 available, best #1, table default) Advertised to update-groups: 1 Refresh Epoch 1 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 30) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

752 © 2016 Nicholas J. Russo

Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 CSR8#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 80 Paths: (3 available, best #2, table default) Advertised to update-groups: 1 Refresh Epoch 1 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 30) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

CSR3, for example, is going to learn the best-path from each of the three RR's. XRv11 has selected CSR5 as the best path (lowest RID), which means CSR3 has no idea there is a third path through XRv13. CSR3 now has two copies of one path and one of another, but is missing the third. CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 97 Paths: (3 available, best #3, table default) Advertised to update-groups: 4 Refresh Epoch 1 6 78 454 5.5.5.5 (metric 20) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 5.5.5.5, Cluster list: 11.11.11.11 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 6 78 454 12.12.12.12 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 12.12.12.12, Cluster list: 8.8.8.8 rx pathid: 0, tx pathid: 0

753 © 2016 Nicholas J. Russo

Refresh Epoch 2 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 9.9.9.9 rx pathid: 0, tx pathid: 0x0

BGP add-path has many solutions for problems like this (and others). One quick solution is to direct CSR9 to only advertise its second best path, which would be via XRv13. This way, each of the three RR's coincidentally advertise a diverse path to CSR3, XRv14, and CSR7. This is sometimes called the "shadow RR" method. Of note, the Shadow RR itself has a few steps of configuration. Specifically, it has to select the backup path as the repair path, install it in the FIB, and then advertise it to its neighbor. ! CSR9 router bgp 100 template peer-policy IPV4 advertise diverse-path backup address-family ipv4 bgp additional-paths select backup bgp additional-paths install

Verifications are shown below. We check the BGP RIB, FIB, and advertised routes. The "bia" codes on the BGP advertised route indicate a backup-path, iBGP learned, and additional-path respectively. Notice that the “>” symbol is not shown on this route since it is a backup path, not a best path, which seems to violate the fundamental rule of BGP route advertisement at first glance. CSR9#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 122 Paths: (3 available, best #1, table default) Additional-path-install Advertised to update-groups: 2 Refresh Epoch 4 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 30) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair

754 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0 CSR9#show ip cef 28.10.45.0 detail 28.10.45.0/25, epoch 2, flags [rib only nolabel, rib defined all labels] recursive via 5.5.5.5 nexthop 153.5.9.5 GigabitEthernet2.559 recursive via 13.13.13.13, repair nexthop 153.9.13.13 GigabitEthernet2.593 CSR9#show bgp ipv4 unicast neighbors 3.3.3.3 advertised-routes | include 28.10.45.0/25 *bia28.10.45.0/25 13.13.13.13 0 100 0 6 78 454 ?

When we check CSR3, we can see it has three different next-hops for this prefix, which is ideal. Notice that the route from CSR9 isn't really CSR9's best path, since we specifically directed it to advertise its second best path. This still hasn’t helped us too much, since CSR3 still only picked one best path. CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 112 Paths: (3 available, best #1, table default) Advertised to update-groups: 4 Refresh Epoch 1 6 78 454 5.5.5.5 (metric 20) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 11.11.11.11 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 6 78 454 12.12.12.12 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 12.12.12.12, Cluster list: 8.8.8.8 rx pathid: 0, tx pathid: 0 Refresh Epoch 5 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0, tx pathid: 0

Of greater benefit to CSR3 would be the ability to use these paths for ECMP, since they all have equal IGP metrics. Let's enable iBGP multipath on CSR3 for the IPv4 AF. CSR3 still identifies a best-path because BGP always selects a single best-path, but marks the ECMP paths as "multipath". The reason for always needing exactly one best-path is because this what BGP would normally advertise on (with addpath we can change it, but this describes classic BGP behavior). The actual load sharing is interesting because the RIB installs three entries as expected, but based on the IGP routes to the BGP next-hops, we 755 © 2016 Nicholas J. Russo

see the FIB uses paths through XRv11 and CSR9 to service the three different routes. There is more diversity following these paths in the “core” of the network. ! CSR3 router bgp 100 address-family ipv4 maximum-paths ibgp 3 CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 127 Paths: (3 available, best #1, table default) Multipath: iBGP Advertised to update-groups: 4 Refresh Epoch 1 6 78 454 5.5.5.5 (metric 20) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best Originator: 5.5.5.5, Cluster list: 11.11.11.11 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 6 78 454 12.12.12.12 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, multipath Originator: 12.12.12.12, Cluster list: 8.8.8.8 rx pathid: 0, tx pathid: 0 Refresh Epoch 5 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, multipath(oldest) Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0, tx pathid: 0 CSR3#show ip route 28.10.45.0 Routing entry for 28.10.45.0/25 Known via "bgp 100", distance 200, metric 0 Tag 6, type internal Last update from 13.13.13.13 00:03:34 ago Routing Descriptor Blocks: 13.13.13.13, from 9.9.9.9, 00:03:34 ago Route metric is 0, traffic share count is 1 AS Hops 3 Route tag 6 MPLS label: none 12.12.12.12, from 8.8.8.8, 00:03:34 ago Route metric is 0, traffic share count is 1 AS Hops 3

756 © 2016 Nicholas J. Russo

Route tag 6 MPLS label: none * 5.5.5.5, from 11.11.11.11, 00:03:34 ago Route metric is 0, traffic share count is 1 AS Hops 3 Route tag 6 MPLS label: none CSR3#show ip cef 28.10.45.0 28.10.45.0/25 nexthop 153.3.9.9 GigabitEthernet2.539 label 9003 nexthop 153.3.11.11 GigabitEthernet2.531 label 91013 nexthop 153.3.11.11 GigabitEthernet2.531 label 91010 nexthop 153.3.9.9 GigabitEthernet2.539 label 9000

Looking at the labels, we come up with the following conclusions. This is significant because the FIB ultimately determines the level of HA you achieve in a network. 1. Traffic to CSR5 can be load-shared between CSR9 and XRv11. 2. Traffic to XRv12 always flows through XRv11. 3. Traffic to XRv13 always flows through CSR9. CSR3#show mpls forwarding-table 5.5.5.5 32 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3010 9003 5.5.5.5/32 0 91013 5.5.5.5/32 0

Outgoing interface Gi2.539 Gi2.531

CSR3#show mpls forwarding-table 12.12.12.12 32 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3000 91010 12.12.12.12/32 0

Outgoing interface Gi2.531

Next Hop

CSR3#show mpls forwarding-table 13.13.13.13 32 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3005 9000 13.13.13.13/32 1400

Outgoing interface Gi2.539

Next Hop

Next Hop 153.3.9.9 153.3.11.11

153.3.11.11

153.3.9.9

Next, we will remove the shadow RR configuration from CSR9 and multipath configuration from CSR3 to return the network to its baseline status. A common feature to use on RR's that are spread throughput a network is disabling the IGP-metric evaluation in the BGP bestpath process. For example, CSR9 never considered XRv12 as a viable candidate because the IGP metric was 30, where the IGP metric to CSR5 and XRv13 was 10. If we disable IGP metric consideration on all RRs, then the next tie-breaker is lowest RID, and every RR will select the exact same bestpath. If we configure this feature by itself only CSR8 and CSR9 we actually cause more damage because now CSR3 only knows about the path via CSR5. We have to disable XRv11 BGP for this test since it does not appear to support ignoring the IGP metric. The 757 © 2016 Nicholas J. Russo

easiest way to do this is shutting down the IBGP session-group. At this point, CSR3 prefers the path via CSR5 (through CSR9) because the neighbor ID is lower. Regardless of which path CSR3 selects, it can only use CSR5 for forwarding, ignoring the other two candidate PEs. ! CSR8 and CSR9 router bgp 100 address-family ipv4 bgp bestpath igp-metric ignore CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 172 Paths: (2 available, best #1, table default) Advertised to update-groups: 4 Refresh Epoch 2 6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 8.8.8.8 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 6 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 5.5.5.5, Cluster list: 9.9.9.9 rx pathid: 0, tx pathid: 0

Earlier we used the “Shadow RR” in a highly coincidental and cherry-picked situation to solve our problem. The current network is more representative of a real setup where all of the RRs pick the exact same path. In this case, we can re-enable the Shadow RR feature on CSR9 so that it picks the next best path, ignoring the IGP metric, which is XRv12. Now, CSR3 has the backup paths. This design is more resilient because changes in IGP will not affect the BGP topology; if the RR's IGP metrics to a PE change, it doesn't matter. Since the two RRs always pick the same best paths, the shadow (CSR9) knows that it's safe to advertise the second best. Without ignoring IGP metric, RRs might pick opposing paths, and without shadow RR enabled, this is fine (we just tested that). But with shadow enabled, this might result in the remote PEs receiving the same route twice due to lack of synchronization between the RRs. Although I don’t show it here, iBGP multipath would be a good option on the remote PE (CSR3) or using backup paths (discussed later) to achieve fast convergence. CSR9#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 190 BGP Bestpath: igpmetric-ignore Paths: (3 available, best #1, table default) Additional-path-install Advertised to update-groups: 4

758 © 2016 Nicholas J. Russo

Refresh Epoch 4 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 30) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair rx pathid: 0, tx pathid: 0 Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 172 Paths: (2 available, best #1, table default) Advertised to update-groups: 4 Refresh Epoch 2 6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 8.8.8.8 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 7 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 12.12.12.12, Cluster list: 9.9.9.9 rx pathid: 0, tx pathid: 0

We will remove the shadow RR configuration from CSR9 again to test other features. We will leave the IGP-metric ignore in place and leave XRv11 BGP shutdown for now. The RR's are configured to send additional paths while the PEs are configured to receive them for IPv4. The general snippet is below for reference. Before checking the BGP tables, we can verify this negotiated capability was successful by checking the BGP neighbor details. Beware the logic on the PEs; both CSR3 and XRv14 say the "send" capability was received and the "receive" capability was sent. This is actually correct. The RR's logic is easier to digest but it's effectively the opposite of the clients. ! XE and XR PE router bgp 100 address-family ipv4 bgp additional-paths receive

759 © 2016 Nicholas J. Russo

! XE and XR RR router bgp 100 address-family ipv4 bgp additional-paths send CSR3#show bgp ipv4 unicast neighbors 9.9.9.9 | include Additional Additional Paths send capability: received Additional Paths receive capability: advertised RP/0/0/CPU0:XRv14#show bgp ipv4 unicast neighbors 9.9.9.9 | include Additional Additional-paths Send: received Additional-paths Receive: advertised CSR9#show bgp ipv4 unicast neighbors 3.3.3.3 | include Additional Additional Paths send capability: advertised Additional Paths receive capability: received

First we will test selecting the best 2 paths from the BGP RIB and advertising them to all iBGP peers on CSR9. CSR9 shows best2 path as XRv12 due to lower RID than XRv13 (remember, we ignore the IGP metric). We clearly see them being advertised to CSR3 and XRv14 as examples. Notice that with this more powerful feature, the routers are actually advertising multiple paths. The shadow RR advertised a single back up, not multiple paths. ! CSR9 router bgp 100 template peer-policy IPV4 advertise additional-paths best 2 address-family ipv4 bgp additional-paths select best 2 CSR9#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 329 BGP Bestpath: igpmetric-ignore Paths: (3 available, best #1, table default) Additional-path-install Path advertised to update-groups: 6 7 Refresh Epoch 1 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Path advertised to update-groups: 7 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 30) from 12.12.12.12 (12.12.12.12)

760 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair, best2 rx pathid: 0, tx pathid: 0x1 Path not advertised to any peer Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 CSR9#show bgp ipv4 unicast neighbors 3.3.3.3 advertised-routes | include 28.10.45.0/25 *>i 28.10.45.0/25 5.5.5.5 0 100 0 6 78 454 ? *bia28.10.45.0/25 12.12.12.12 0 100 0 6 78 454 ? CSR9#show bgp ipv4 unicast neighbors 14.14.14.14 advertised-routes | include 28.10.45.0/25 *>i 28.10.45.0/25 5.5.5.5 0 100 0 6 78 454 ? *bia28.10.45.0/25 12.12.12.12 0 100 0 6 78 454 ?

Now, CSR3 and XRv14 will have three paths to choose from (CSR8 bestpath, CSR9 bestpath, and CSR9 best2 path). In order to make this effective, we should either use iBGP multipath or backup paths on the PEs. We will configure backup paths on CSR3 and XRv14 for variety. Notice that CSR3 is smart enough to pick a backup path that has a different next-hop than the best-path, although technically the "bestpath" from CSR9 (nexthop 5.5.5.5) is better than the backup due to lower originator-ID. We confirm this in the FIB. The bestpath using 5.5.5.5 as a next-hop has ECMP paths through CSR9 and XRv11 according to IGP, while the backup path has a single path via XRv11. CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 324 Paths: (3 available, best #2, table default) Additional-path-install Advertised to update-groups: 4 Refresh Epoch 2 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Originator: 12.12.12.12, Cluster list: 9.9.9.9 rx pathid: 0x1, tx pathid: 0 Refresh Epoch 2 6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 8.8.8.8 rx pathid: 0x0, tx pathid: 0x0 Refresh Epoch 2

761 © 2016 Nicholas J. Russo

6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 5.5.5.5, Cluster list: 9.9.9.9 rx pathid: 0x0, tx pathid: 0 CSR3#show ip cef 28.10.45.0 det 28.10.45.0/25, epoch 2, flags [rib only nolabel, rib defined all labels] recursive via 5.5.5.5 nexthop 153.3.9.9 GigabitEthernet2.539 label 9003 nexthop 153.3.11.11 GigabitEthernet2.531 label 91013 recursive via 12.12.12.12, repair nexthop 153.3.11.11 GigabitEthernet2.531 label 91010

XRv14 has the exact same choices as CSR3 did. XR requires extra work, because we need to apply an RPL as an additional-path selection filter to determine not only what we want to backup, but how. Below is a sample RPL that only backs up the route in question. The configuration is shown below. I didn't have to match the specific prefix but did so for illustrative purposes. XR is also smart enough to pick a diverse BGP next-hop for its backup path. We perform a verification of both the BGP RIB and FIB. ! XRv14 route-policy RPL_BACKUP_45_0 if destination in (28.10.45.0/25) then set path-selection backup 1 install endif end-policy router bgp 100 address-family ipv4 unicast additional-paths selection route-policy RPL_BACKUP_45_0 RP/0/0/CPU0:XRv14#show bgp ipv4 unicast 28.10.45.0/25 | begin Paths Paths: (3 available, best #1) Advertised to update-groups (with more than one peer): 0.3 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.3 6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 255 Originator: 5.5.5.5, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (5.5.5.5)

762 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 5.5.5.5, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, backup, add-path Received Path ID 1, Local Path ID 2, version 266 Originator: 12.12.12.12, Cluster list: 9.9.9.9 RP/0/0/CPU0:XRv14#show cef ipv4 28.10.45.0/25 28.10.45.0/25, version 2900, internal 0x5000001 0x0 (ptr 0xa13d2b74) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 153.9.14.9 Prefix Len 25, traffic index 0, precedence n/a, priority 4 via 5.5.5.5, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa13d2574 0x0] next hop 5.5.5.5 via 5.5.5.5/32 via 12.12.12.12, 2 dependencies, recursive, backup [flags 0x6100] path-idx 1 NHID 0x0 [0xa13d2a74 0x0] next hop 12.12.12.12 via 12.12.12.12/32

We can also select the best 3 paths, which would effectively be the same as selecting all paths (both are options) in this case since we only have 3 paths. On CSR9, we will globally direct BGP to select the best 3 paths, then advertise the best 2 to all peers (default) but send the best 3 to XRv14. This is basic inheritance using the templates and overriding the settings by making changes directly on a BGP peer. Since best2 is already in the policy template for IPv4, we will override it specifically for XRv14. XRv14 now has 4 paths; one from CSR8 (it's best path) and three from CSR9. It still picks CSR8 as the best path due to lowest neighbor ID, then picks XRv12 as the backup given the lower originator ID. I highlighted the best two paths along with all three paths from CSR9. ! CSR9 router bgp 100 template peer-policy IPV4 advertise additional-paths best 2 address-family ipv4 bgp additional-paths select best 3 neighbor 14.14.14.14 advertise additional-paths best 3 RP/0/0/CPU0:XRv14#show bgp ipv4 unicast 28.10.45.0/25 | begin Paths Paths: (4 available, best #1) Advertised to update-groups (with more than one peer): 0.3 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.3

763 © 2016 Nicholas J. Russo

6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, group-best Received Path ID 0, Local Path ID 1, version 255 Originator: 5.5.5.5, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 0, Local Path ID 0, version 0 Originator: 5.5.5.5, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, add-path Received Path ID 1, Local Path ID 2, version 266 Originator: 12.12.12.12, Cluster list: 9.9.9.9 Path #4: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 2, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9

internal, best,

internal

internal, backup,

internal

This is a good opportunity to enable multipath on XRv14 to take advantage of all paths. Notice that even when enabling iBGP multipath for 4 paths, only 3 are used, since the BGP next-hop for two of them is the same. As always, we verify the BGP table, RIB, and FIB. Because I still have add-path configured, one of the paths is still considered as a backup also. Multipath and add-path are not mutually exclusive in many cases. However, notice that the FIB does not flag anything as “backup” or “repair”, because all of the paths are already in the FIB as primary, ECMP options. ! XRv14 router bgp 100 address-family ipv4 unicast maximum-paths ibgp 4 RP/0/0/CPU0:XRv14#show bgp ipv4 unicast 28.10.45.0/25 | begin Paths Paths: (4 available, best #1) Advertised to update-groups (with more than one peer): 0.3 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.3

764 © 2016 Nicholas J. Russo

6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, group-best, multipath Received Path ID 0, Local Path ID 1, version 272 Originator: 5.5.5.5, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 0, Local Path ID 0, version 0 Originator: 5.5.5.5, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, backup, add-path Received Path ID 1, Local Path ID 3, version 268 Originator: 12.12.12.12, Cluster list: 9.9.9.9 Path #4: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 2, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9

internal, best,

internal

internal, multipath,

internal, multipath

RP/0/0/CPU0:XRv14#show route ipv4 unicast 28.10.45.0/25 Routing entry for 28.10.45.0/25 Known via "bgp 100", distance 200, metric 0 Tag 6, type internal Routing Descriptor Blocks 5.5.5.5, from 8.8.8.8, BGP multi path Route metric is 0 12.12.12.12, from 9.9.9.9, BGP multi path Route metric is 0 13.13.13.13, from 9.9.9.9, BGP multi path Route metric is 0 No advertising protos. RP/0/0/CPU0:XRv14#show cef ipv4 28.10.45.0/25 28.10.45.0/25, version 2922, internal 0x5000001 0x0 (ptr 0xa13d2b74) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 153.9.14.9 Prefix Len 25, traffic index 0, precedence n/a, priority 4 via 5.5.5.5, 2 dependencies, recursive, bgp-multipath [flags 0x6080] path-idx 0 NHID 0x0 [0xa13d2574 0x0]

765 © 2016 Nicholas J. Russo

next hop 5.5.5.5 via 5.5.5.5/32 via 12.12.12.12, 2 dependencies, recursive, bgp-multipath [flags 0x6080] path-idx 1 NHID 0x0 [0xa13d2a74 0x0] next hop 12.12.12.12 via 12.12.12.12/32 via 13.13.13.13, 2 dependencies, recursive, bgp-multipath [flags 0x6080] path-idx 2 NHID 0x0 [0xa13d3174 0x0] next hop 13.13.13.13 via 13.13.13.13/32

Now we will shift focus to CSR5 and XRv13. CSR5 learns two paths to 28.10.45.0/25; one via eBGP and one via CSR9. Interestingly, the only reason CSR5 learns the CSR9 route is due to CSR9 sending its best 2 paths, since the path via CSR5 is the best path for both RRs (coincidental). A quick check on CSR9 confirms this. CSR9#show bgp ipv4 unicast neighbors 5.5.5.5 advertised-routes | include 28.10.45.0/25 *>i 28.10.45.0/25 5.5.5.5 0 100 0 6 78 454 ? *bia28.10.45.0/25 12.12.12.12 0 100 0 6 78 454 ? CSR5#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 10 Paths: (2 available, best #2, table default) Advertised to update-groups: 8 Refresh Epoch 4 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 12.12.12.12, Cluster list: 9.9.9.9 rx pathid: 0x1, tx pathid: 0 Refresh Epoch 2 6 78 454 10.5.6.6 from 10.5.6.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

XRv13, on the other hand, will show 4 routes. First comes the expected eBGP route, its best path. Both RR's select the path via CSR5 and reflect that to XRv13. Additionally, CSR9 sends its second best path to XRv13 also. Thus, two routes have a next hop of CSR5, one has a next hop of XRv12, and one is the eBGP route. The eBGP route is preferred over the iBGP ones since no BGP attributes have been modified. CSR9#show bgp ipv4 unicast neighbors 13.13.13.13 advertised-routes | include 28.10.45.0/25 *>i 28.10.45.0/25 5.5.5.5 0 100 0 6 78 454 ? *bia28.10.45.0/25 12.12.12.12 0 100 0 6 78 454 ? RP/0/0/CPU0:XRv13#show bgp ipv4 un 28.10.45.0/25 | begin Paths Paths: (4 available, best #4)

766 © 2016 Nicholas J. Russo

Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 8.8.8.8 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 0, Local Path ID 0, version 0 Originator: 5.5.5.5, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 0, Local Path ID 0, version 0 Originator: 5.5.5.5, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, Received Path ID 1, Local Path ID 0, version 0 Originator: 12.12.12.12, Cluster list: 9.9.9.9 Path #4: Received by speaker 0 Not advertised to any peer 6 78 454 10.6.13.6 from 10.6.13.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, group-best Received Path ID 0, Local Path ID 1, version 103 Origin-AS validity: not-found

internal

internal

internal

external, best,

Using local-preference on XRv13, will make it the desired exit point for this specific prefix. We will craft a parameterized RPL for extra practice to accomplish this. Notice that XRv13 only has one route now, its best path. ! XRv13 prefix-set PS_COOL_ROUTE 28.10.45.0/25 end-set route-policy RPL_SET_LPREF_TO_X_FOR_Y($X, $Y) if destination in $Y then set local-preference $X else pass endif end-policy router bgp 100

767 © 2016 Nicholas J. Russo

neighbor 10.6.13.6 remote-as 6 address-family ipv4 unicast route-policy RPL_SET_LPREF_TO_X_FOR_Y(200, PS_COOL_ROUTE) in RP/0/0/CPU0:XRv13#show bgp ipv4 unicast 28.10.45.0/25 | begin Paths Paths: (1 available, best #1) Advertised to update-groups (with more than one peer): 0.1 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.1 6 78 454 10.6.13.6 from 10.6.13.6 (6.6.6.6) Origin incomplete, metric 0, localpref 200, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 110 Origin-AS validity: not-found

This implies that CSR9 must not have any secondary paths or else we would have seen them reflected back to XRv13. We confirm this below. The reason is because now all of the other PEs to AS 6, such as CSR5 and XRv2, all prefer this iBGP path and do not advertise their eBGP path. Even though they have multiple iBGP paths, they are just reflections of the same local-preference 200 path. The RR's never learn anything unique, and by extension the remote PEs never learn anything unique, making the network somewhat stovepiped. CSR9#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 373 BGP Bestpath: igpmetric-ignore Paths: (1 available, best #1, table default) Additional-path-install Path advertised to update-groups: 6 7 8 Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best rx pathid: 0, tx pathid: 0x0 CSR5#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 54 Paths: (3 available, best #1, table default) Advertised to update-groups: 7 Refresh Epoch 2 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best

768 © 2016 Nicholas J. Russo

Originator: 13.13.13.13, Cluster list: 8.8.8.8 rx pathid: 0x0, tx pathid: 0x0 Refresh Epoch 4 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0x0, tx pathid: 0 Refresh Epoch 2 6 78 454 10.5.6.6 from 10.5.6.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external rx pathid: 0, tx pathid: 0 RP/0/0/CPU0:XRv12#show bgp ipv4 unicast 28.10.45.0/25 | begin Paths Paths: (3 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 73 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 10.6.12.6 from 10.6.12.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external Received Path ID 0, Local Path ID 0, version 0 Origin-AS validity: not-found

We can do some local damage-repair by configuring the PE's to install the eBGP route as a backup path. It doesn't really help the remote PEs on the other side of the network, like CSR3 and XRv14, but it's a small improvement. The XE command is simple for CSR5 but again, XRv12 requires a little more work. We will use a simple RPL for this one. As always, verify the FIB too (RIB verification doesn't matter because we are not load-sharing). ! XRv12 route-policy RPL_BACKUP_ALL

769 © 2016 Nicholas J. Russo

set path-selection backup 1 install end-policy router bgp 100 address-family ipv4 unicast additional-paths selection route-policy RPL_BACKUP_ALL RP/0/0/CPU0:XRv12#show bgp ipv4 unicast 28.10.45.0/25 | begin Paths Paths: (3 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 73 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 10.6.12.6 from 10.6.12.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, backup, add-path Received Path ID 0, Local Path ID 2, version 96 Origin-AS validity: not-found CSR5#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 60 Paths: (3 available, best #1, table default) Additional-path-install Advertised to update-groups: 7 Refresh Epoch 2 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 rx pathid: 0x0, tx pathid: 0x0 Refresh Epoch 4 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9)

770 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0x0, tx pathid: 0 Refresh Epoch 2 6 78 454 10.5.6.6 from 10.5.6.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, backup/repair , recursive-via-connected rx pathid: 0, tx pathid: 0 CSR5#show ip cef 28.10.45.0 detail 28.10.45.0/25, epoch 2, flags [rib only nolabel, rib defined all labels] recursive via 13.13.13.13 nexthop 153.5.9.9 GigabitEthernet2.559 label 9000 recursive via 10.5.6.6, repair attached to GigabitEthernet2.556 RP/0/0/CPU0:XRv12#show cef ipv4 28.10.45.0/25 28.10.45.0/25, version 1286, internal 0x5000001 0x0 (ptr 0xa14803f4) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 153.8.12.8 Prefix Len 25, traffic index 0, precedence n/a, priority 4 via 10.6.12.6, 3 dependencies, recursive, bgp-ext, backup [flags 0x6120] path-idx 0 NHID 0x0 [0xa147f5f4 0x0] next hop 10.6.12.6 via 10.6.12.6/32 via 13.13.13.13, 2 dependencies, recursive [flags 0x6000] path-idx 1 NHID 0x0 [0xa147ef74 0x0] next hop 13.13.13.13 via 13.13.13.13/32

A good option for introducing more path diversity into the iBGP domain is allowing the "loser" PE's to advertise their best external route. You can enable it on all the PE's in case path selections change, but in this case, I just enable it on CSR5 and XRv12. Beware that in order for best-external to work, you must remove the negotiated add-path capability, since features like this and the shadow RR are independent (and sometimes mutually exclusive) from full-blown add-path. Now CSR5 marks this as its best eBGP path and still treats it as a fast-reroute backup. CSR5(config-router-ptmp)#advertise best-external % BGP: Addpath capability present. When negotiated, it will cause advertise best-external to be silently disabled. Inheritance accepted anyway for 8.8.8.8 ! CSR5 router bgp 100 template peer-policy IPV4 advertise best-external address-family ipv4 bgp additional-paths select best-external no bgp additional-paths receive

771 © 2016 Nicholas J. Russo

CSR5#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 136 Paths: (3 available, best #1, table default) Additional-path-install Advertised to update-groups: 11 Refresh Epoch 1 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 6 78 454 10.5.6.6 from 10.5.6.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, backup/repair, advertise-best-external , recursive-via-connected rx pathid: 0, tx pathid: 0

Similarly, we configure the feature on XRv12 so the RRs can learn its eBGP route also. XR seems content to advertise the best-external even when add-path capability is negotiated, unlike XE. It marks its best (and only) eBGP path as best external, and it's still a valid backup path. ! XRv12 router bgp 100 address-family ipv4 unicast advertise best-external RP/0/0/CPU0:XRv12#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25 Versions: Process bRIB/RIB SendTblVer Speaker 164 164 Paths: (4 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 73

772 © 2016 Nicholas J. Russo

Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal Received Path ID 1, Local Path ID 0, version 0 Originator: 5.5.5.5, Cluster list: 9.9.9.9 Path #4: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.1 6 78 454 10.6.12.6 from 10.6.12.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, bestexternal, backup, add-path Received Path ID 0, Local Path ID 3, version 164 Origin-AS validity: not-found

A quick check of CSR8 and CSR9 show the paths. We did not configure any add-path options on CSR8, other than the basic negotiation, so it is still a classic RR in that it only sends its best path. CSR9, however, has now selected its three best paths. Without advertising the best-external paths from the “loser” PEs in the first place, the RRs would have never seen these additional paths in the first place. We can tell these are the “loser” paths because of the relatively low local-preference value. CSR8#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 185 BGP Bestpath: igpmetric-ignore Paths: (3 available, best #3, table default) Advertised to update-groups: 3 4 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 2 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 30) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 1

773 © 2016 Nicholas J. Russo

6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best rx pathid: 0, tx pathid: 0x0 CSR9#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 433 BGP Bestpath: igpmetric-ignore Paths: (3 available, best #3, table default) Additional-path-install Path advertised to update-groups: 8 Refresh Epoch 1 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 30) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best3 rx pathid: 0, tx pathid: 0x2 Path advertised to update-groups: 7 8 Refresh Epoch 1 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair, best2 rx pathid: 0, tx pathid: 0x1 Path advertised to update-groups: 7 8 9 Refresh Epoch 1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best rx pathid: 0, tx pathid: 0x0

Now, CSR3 and XRv14 can see the additional paths in the network from the RRs. They still select the proper "best" path because the whole reason one uses local-preference in the first place is for traffic engineering, and add-path honors that. The backup paths they select are the next-best paths with varying next-hops as we expected. However, even with this feature enabled, you should also have either add-path or shadow RR configured, otherwise all the RRs will simply pick the higher local-preference and ignore all the other best-externals. If add-path is not supported, you can use the shadow method by itself, which we examine in VPNv4 tests. Here, we can clearly see XRv13 as the primary path with CSR5 as the backup. CSR3#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 467 Paths: (3 available, best #2, table default) Additional-path-install Advertised to update-groups: 4

774 © 2016 Nicholas J. Russo

Refresh Epoch 4 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Originator: 5.5.5.5, Cluster list: 9.9.9.9 rx pathid: 0x1, tx pathid: 0 Refresh Epoch 2 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 rx pathid: 0x0, tx pathid: 0x0 Refresh Epoch 4 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0x0, tx pathid: 0 RP/0/0/CPU0:XRv14#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25 Versions: Process bRIB/RIB SendTblVer Speaker 394 394 Paths: (4 available, best #1) Advertised to update-groups (with more than one peer): 0.3 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.3 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 314 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 78 454 5.5.5.5 (metric 20) from 9.9.9.9 (5.5.5.5)

775 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal, backup, add-path Received Path ID 1, Local Path ID 4, version 394 Originator: 5.5.5.5, Cluster list: 9.9.9.9 Path #4: Received by speaker 0 Not advertised to any peer 6 78 454 12.12.12.12 (metric 20) from 9.9.9.9 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Received Path ID 2, Local Path ID 0, version 0 Originator: 12.12.12.12, Cluster list: 9.9.9.9

Another useful feature is group-best. This feature groups incoming routes together by AS, selects a best path, then picks the next best path from the group NOT from the same AS. CSR1, CSR2, and CSR4 all advertise 124.0.0.1/32 to CSR3. I've created three basic route-maps that set the local-preference to 101, 102, and 104 on this route inbound from CSR1, CSR2, and CSR4 respectively. CSR4 is the best path, with CSR2 as the "group-best" for AS 12. Another key point with group-best is that it is technically an additional-path and requires the negotiated capability. On CSR3 only, we enabled send/receive, along with CSR9. The other PE's are receive only and the other RRs are send only. CSR3’s changes are below. !CSR3 router bgp 100 template peer-policy IPV4 advertise additional-paths group-best address-family ipv4 bgp additional-paths select group-best bgp additional-paths send receive neighbor 10.1.3.1 route-map RM_LPREF_GROUP_TEST_CSR1 in neighbor 10.2.3.2 route-map RM_LPREF_GROUP_TEST_CSR2 in neighbor 10.3.4.4 route-map RM_LPREF_GROUP_TEST_CSR4 in CSR3#show bgp ipv4 unicast 124.0.0.1/32 BGP routing table entry for 124.0.0.1/32, version 787 Paths: (3 available, best #3, table default) Additional-path-install Flag: 0x820 Path not advertised to any peer Refresh Epoch 10 12 10.1.3.1 from 10.1.3.1 (1.1.1.1) Origin incomplete, metric 0, localpref 101, valid, external , recursive-via-connected rx pathid: 0, tx pathid: 0 Path advertised to update-groups: (Pending Update Generation) 11 Refresh Epoch 10 12 10.2.3.2 from 10.2.3.2 (2.2.2.2)

776 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 102, valid, external, backup/repair, group-best , recursive-via-connected rx pathid: 0, tx pathid: 0x2 Path advertised to update-groups: (Pending Update Generation) 4 10 11 Refresh Epoch 8 410 10.3.4.4 from 10.3.4.4 (4.4.4.4) Origin incomplete, metric 0, localpref 104, valid, external, best , recursive-via-connected rx pathid: 0, tx pathid: 0x0

As expected, CSR3 will advertise this group-best path to CSR9 as a backup, additional path (b and a flags). Notice that CSR3 does not advertise it to CSR8 because CSR8 is not configured to receive additional paths. CSR9 advertises the ability to send/receive, while CSR8 only advertises the ability to send. Since each capability is distinctly different, I recommend enabling both everywhere to avoid the confusion if possible. CSR3#show bgp ipv4 unicast neighbors 9.9.9.9 | include Addition Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR3#show bgp ipv4 unicast neighbors 9.9.9.9 advertised-routes | include 124.0.0.1/32 *> 124.0.0.1/32 10.3.4.4 0 104 0 410 ? *b a124.0.0.1/32 10.2.3.2 0 102 0 12 ? CSR3#show bgp ipv4 unicast neighbors 8.8.8.8 | include Addition Additional Paths send capability: advertised and received Additional Paths receive capability: advertised CSR3#show bgp ipv4 unicast neighbors 8.8.8.8 advertised-routes | include 124.0.0.1/32 *> 124.0.0.1/32 10.3.4.4 0 104 0 410 ?

To demonstrate the behavior of group-best specifically, we will reduce the local-preference of CSR4 to 90. The best path becomes CSR2, but BGP cannot consider CSR1 as the next best path despite its localpreference being higher than CSR4. This is because CSR1 and CSR2 share an AS, so the group-best should have a different next-hop AS; this is a fault-tolerance mechanism. What is also interesting is that CSR3 uses the path through CSR1 as its local backup path because it's the second best, but advertises the group-best, which is CSR4. We can prove both points by checking the BGP RIB, FIB (for backup path), and neighbor advertised routes (for group-best). The advertised group-best route is no longer flagged with 'b' for backup, because it isn't one. I also include the "brief" view of the BGP table to demonstrate the differences between the three routes in summary form. All three routes are different: one best, one backup, and one additional group-best.

777 © 2016 Nicholas J. Russo

CSR3#show bgp ipv4 unicast 124.0.0.1/32 BGP routing table entry for 124.0.0.1/32, version 788 Paths: (3 available, best #2, table default) Additional-path-install Path not advertised to any peer Refresh Epoch 11 12 10.1.3.1 from 10.1.3.1 (1.1.1.1) Origin incomplete, metric 0, localpref 101, valid, external, backup/repair , recursive-via-connected rx pathid: 0, tx pathid: 0 Path advertised to update-groups: 4 10 11 Refresh Epoch 11 12 10.2.3.2 from 10.2.3.2 (2.2.2.2) Origin incomplete, metric 0, localpref 102, valid, external, best , recursive-via-connected rx pathid: 0, tx pathid: 0x0 Path advertised to update-groups: 11 Refresh Epoch 9 410 10.3.4.4 from 10.3.4.4 (4.4.4.4) Origin incomplete, metric 0, localpref 90, valid, external, group-best , recursive-via-connected rx pathid: 0, tx pathid: 0x2 CSR3#show ip cef 124.0.0.1 detail 124.0.0.1/32, epoch 2, flags [rib only nolabel, rib defined all labels] recursive via 10.2.3.2 attached to GigabitEthernet2.523 recursive via 10.1.3.1, repair attached to GigabitEthernet2.513 CSR3#show bgp ipv4 unicast neighbors 9.9.9.9 advertised-routes | include 124.0.0.1/32 *> 124.0.0.1/32 10.2.3.2 0 102 0 12 ? * a124.0.0.1/32 10.3.4.4 0 90 0 410 ? CSR3#show bgp ipv4 unicast | begin 124.0.0.1 *b 124.0.0.1/32 10.1.3.1 *> 10.2.3.2 * a 10.3.4.4

0 0 0

101 102 90

0 12 ? 0 12 ? 0 410 ?

So far XRv11 has not been participating much in this test, mostly because of its inability to disable the IGP metric from its best path selection. This has a more significant effect on non-add-path features like shadow RR and best-external. To test XR as a RR with add-path, we will allow CSR8 and CSR9 to consider 778 © 2016 Nicholas J. Russo

IGP metric in their calculations again just so all RRs behave the same. We also disable '"best-external" from CSR5 so we are allowed to enable add-path negotiations. We then disable the add-path advertisements on CSR9. Essentially, we are doing some basic clean-up so that we can focus on XR. ! CSR5 router bgp 100 template peer-policy IPV4 no advertise best-external address-family ipv4 no bgp additional-paths select best-external bgp additional-paths receive ! CSR8 router bgp 100 address-family ipv4 no bgp bestpath igp-metric ignore ! CSR9 router bgp 100 template peer-policy IPV4 advertise additional-paths best 2 address-family ipv4 no bgp bestpath igp-metric ignore no bgp additional-paths select best 3 no neighbor 14.14.14.14 advertise additional-paths best 3 ! XRv11 router bgp 100 session-group IBGP no shutdown ! XRv12 router bgp 100 address-family ipv4 unicast no advertise best-external

Let's quickly verify that XRv11 has properly negotiated the add-path capability, only because we haven't used it in awhile. Remember that we added the "send" capability on CSR3, the only PE with it configured. The remaining PEs are receive-capable while XRv11 advertises send capability. RP/0/0/CPU0:XRv11#show bgp ipv4 unicast neighbors | utility egrep '^BGP neigh|Additional' BGP neighbor is 3.3.3.3 Additional-paths Send: advertised and received Additional-paths Receive: received Additional-paths operation: Send BGP neighbor is 5.5.5.5 Additional-paths Send: advertised

779 © 2016 Nicholas J. Russo

Additional-paths Receive: received Additional-paths operation: Send Additional-paths operation: None BGP neighbor is 12.12.12.12 Additional-paths Send: advertised Additional-paths Receive: received Additional-paths operation: Send Additional-paths operation: None BGP neighbor is 13.13.13.13 Additional-paths Send: advertised Additional-paths Receive: received Additional-paths operation: Send Additional-paths operation: None BGP neighbor is 14.14.14.14 Additional-paths Send: advertised Additional-paths Receive: received Additional-paths operation: Send Additional-paths operation: None

We solved the local-preference 200 problem earlier using advertise-best-external, which allowed CSR5 and XRv12 to send their eBGP routes to the RRs. This time, we will enable BGP add-path receive capability on XRv11 and add-path send capability on CSR5 / XRv12. Then, we verify the negotiations were successful. We see that with CSR5 and XRv12, both send and receive capabilities are negotiated bidirectionally. The configuration is not shown as we have seen it many times so far. RP/0/0/CPU0:XRv11#show bgp ipv4 unicast neighbors | utility egrep '^BGP neigh|Additional' BGP neighbor is 3.3.3.3 Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive BGP neighbor is 5.5.5.5 Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive Additional-paths operation: None BGP neighbor is 12.12.12.12 Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive Additional-paths operation: None BGP neighbor is 13.13.13.13 Additional-paths Send: advertised Additional-paths Receive: advertised and received Additional-paths operation: Send Additional-paths operation: None BGP neighbor is 14.14.14.14 Additional-paths Send: advertised

780 © 2016 Nicholas J. Russo

Additional-paths Receive: advertised and received Additional-paths operation: Send Additional-paths operation: None

Recall that XRv13 is setting local-preference 200 inbound for the prefix 28.10.45.0/25, making it the preferred exit point from AS 100. All of the PE's on the other side of the SP network are going to have three copies of that route, one from each RR, since each RR only learns this one path, it is the only thing it can advertise. The PE’s also have their less-preferred eBGP routes which can we considered additional paths. We check the detailed output on CSR3 and the brief output on XRv14 to prove it. CSR5#show bgp ipv4 un 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25, version 444 Paths: (4 available, best #2, table default) Additional-path-install Advertised to update-groups: 13 Refresh Epoch 1 6 78 454 13.13.13.13 (metric 20) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 11.11.11.11 rx pathid: 0x1, tx pathid: 0 Refresh Epoch 2 6 78 454 13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 rx pathid: 0x0, tx pathid: 0x0 Refresh Epoch 2 6 78 454 13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 rx pathid: 0x0, tx pathid: 0 Refresh Epoch 2 6 78 454 10.5.6.6 from 10.5.6.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, backup/repair , recursive-via-connected rx pathid: 0, tx pathid: 0 RP/0/0/CPU0:XRv12#show bgp ipv4 unicast 28.10.45.0/25 brief | begin Network Next Hop Metric LocPrf Weight Path *>i28.10.45.0/25 13.13.13.13 0 200 0 6 78 * i 13.13.13.13 0 200 0 6 78 * 10.6.12.6 0 0 6 78 * i 13.13.13.13 0 200 0 6 78

Network 454 454 454 454

? ? ? ?

781 © 2016 Nicholas J. Russo

Next, we configure both CSR5 and XRv12 to select all paths as additional paths. XRv12 automatically advertises them to neighbors without a neighbor-level command, while CSR5 requires it. BGP is smart enough to know that the three copies of the exact same route on CSR5 and XRv12 do not qualify as additional paths, and additionally, the PE's are not RRs (so basic iBGP says we cannot advertise them further). We have effectively allowed the PE's to advertise their best external routes using add-path capabilities as an alternative method to achieve the same result. A quick look at XRv11 confirms this; notice the variance in next-hops and local-preference values. BGP still selects the proper best-path, though. There is still a problem; XRv11 is not advertising any of its additional-paths to clients on the other side of the network, such as CSR3 and XRv14. BGP add-path is less effective if only the RR's have visibility of the additional paths. RP/0/0/CPU0:XRv11#show bgp ipv4 unicast 28.10.45.0/25 brief | begin Network Next Hop Metric LocPrf Weight Path * i28.10.45.0/25 5.5.5.5 0 100 0 6 78 * i 12.12.12.12 0 100 0 6 78 *>i 13.13.13.13 0 200 0 6 78

Network 454 ? 454 ? 454 ?

RP/0/0/CPU0:XRv11#show bgp ipv4 unicast neighbors 3.3.3.3 advertised-routes [snip] 28.10.44.128/25 5.5.5.5 5.5.5.5 6 78 454? 28.10.45.0/25 13.13.13.13 13.13.13.13 6 78 454? 28.10.45.128/25 5.5.5.5 5.5.5.5 6 78 454? [snip]

We can correct this by specifically telling XRv11 to advertise backup paths. For some reason, XRv only lets me use the number 1, even with parameterization, but the documentation claims it can be between 0 and 7 (3 bits). After applying this configuration snippet, XRv11 selects a single back-up path via CSR5 (lower RID than XRv12) and advertises this to its neighbors. We verify the advertisement and also that the route is marked as "backup" and "add-path" in the BGP table. This is equivalent to “best2” in XE. ! XRv11 route-policy RPL_ADV_ADD_PATHS if destination in (28.10.45.0/25) then set path-selection backup 1 advertise endif end-policy router bgp 100 address-family ipv4 unicast no additional-paths selection route-policy RPL additional-paths selection route-policy RPL_ADV_ADD_PATHS RP/0/0/CPU0:XRv11#show bgp ipv4 unicast neighbors 3.3.3.3 advertised-routes | begin 28.10.45.0/25 28.10.45.0/25 13.13.13.13 13.13.13.13 6 78 454? 5.5.5.5 5.5.5.5 6 78 454?

782 © 2016 Nicholas J. Russo

[snip] RP/0/0/CPU0:XRv11#show bgp ipv4 unicast 28.10.45.0/25 BGP routing table entry for 28.10.45.0/25 Versions: Process bRIB/RIB SendTblVer Speaker 290 290 Paths: (3 available, best #3) Advertised to update-groups (with more than one peer): 0.1 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.1 6 78 454, (Received from a RR-client) 5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, backup, add-path Received Path ID 1, Local Path ID 2, version 290 Path #2: Received by speaker 0 Not advertised to any peer 6 78 454, (Received from a RR-client) 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Received Path ID 4, Local Path ID 0, version 0 Path #3: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.1 6 78 454, (Received from a RR-client) 13.13.13.13 (metric 30) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 247

A summary view of CSR3's BGP table shows this additional route with the 'b' flag for backup. CSR3 was already configured to install these backup paths and was just waiting for XRv11 to advertise it. CSR3#show bgp ipv4 unicast | begin 28.10.45.0/25 *bi 28.10.45.0/25 5.5.5.5 0 * i 13.13.13.13 0 * i 13.13.13.13 0 *>i 13.13.13.13 0 [snip]

100 200 200 200

0 0 0 0

6 6 6 6

78 78 78 78

454 454 454 454

? ? ? ?

CSR3#show bgp ipv4 unicast neighbors 11.11.11.11 routes | begin 28.10.45.0/25 *bi 28.10.45.0/25 5.5.5.5 0 100 0 6 78 454 ? * i 13.13.13.13 0 200 0 6 78 454 ?

783 © 2016 Nicholas J. Russo

It would be even better if we could get the path from XRv12 advertised to our left-side PEs also via XRv11. A simple RPL modification to select all available paths (there doesn't appear to be a best3 equivalent … its 0, 1, or all) solves this. CSR3 still has 3 copies of the best-path, but has two additional paths received from XRv11. I show the RPL configuration, XRv11 advertisement to CSR3, CSR3 BGP table displaying all 5 paths, and the route reception on CSR3 from XRv11. The path via XRv12 may not be a hot-standby but having it locally can still be valuable for future design changes, such as ECMP, without needing to reconverge BGP. ! XRv11 route-policy RPL_ADV_ADD_PATHS if destination in (28.10.45.0/25) then set path-selection all advertise endif end-policy RP/0/0/CPU0:XRv11#show bgp ipv4 unicast neighbors 3.3.3.3 advertised-routes | begin 28.10.45.0/25 28.10.45.0/25 13.13.13.13 13.13.13.13 6 78 454? 5.5.5.5 5.5.5.5 6 78 454? 12.12.12.12 12.12.12.12 6 78 454? [snip] CSR3#show bgp ipv4 unicast | begin 28.10.45.0/25 * i 28.10.45.0/25 12.12.12.12 0 *bi 5.5.5.5 0 * i 13.13.13.13 0 * i 13.13.13.13 0 *>i 13.13.13.13 0

100 100 200 200 200

0 0 0 0 0

6 6 6 6 6

78 78 78 78 78

454 454 454 454 454

? ? ? ? ?

CSR3#show bgp ipv4 unicast neighbors 11.11.11.11 routes | begin 28.10.45.0/25 * i 28.10.45.0/25 12.12.12.12 0 100 0 6 78 454 ? *bi 5.5.5.5 0 100 0 6 78 454 ? * i 13.13.13.13 0 200 0 6 78 454 ?

So far, we have focused on IPv4 unicast. IPv6 unicast works identically, and one would expect all AFs to work this way. Unfortunately, current XE code only supports the send/receive negotiated capability for IPV4/IPv6 unicast/multicast. All other AFs, such as VPNv4, VPNv6, L2VPN variations, RT-filter, NSAP, etc do not currently support this capability. The CSR does support backup path selection and best-externaladvertisement for VPNv4 and VPNv6 though, but supports no BGP add-path features for the remaining AFs. XR appears to fully support the feature for all AFs, even new AFIs such as BGP-LS and BGP flowspec. That being said, we will quickly test IPv6 unicast in our setup using 6PE as the provider core is not IPv6 enabled. XRv11 BGP process is shutdown for this test as it does not support ignoring the IGP metric in the bestpath calculation. ! CSR8 and CSR9 router bgp 100

784 © 2016 Nicholas J. Russo

address-family ipv6 unicast bgp bestpath igp-metric ignore

At this point, both CSR8 and CSR9 select CSR5 as their best path due to having the lowest BGP RID. The result on CSR3 and other remote PEs is that they receive multiple copies of the same path with the same BGP next-hop. A failure of CSR5 would break 6PE despite the existence of alternative paths. CSR3#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 333 Paths: (2 available, best #2, table default) Advertised to update-groups: 5 6 7 Refresh Epoch 1 6 ::FFFF:5.5.5.5 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 5.5.5.5, Cluster list: 9.9.9.9 mpls labels in/out nolabel/5044 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 6 ::FFFF:5.5.5.5 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 8.8.8.8 mpls labels in/out nolabel/5044 rx pathid: 0, tx pathid: 0x0

We will configure CSR9 to select its backup path for advertisement using the minimum number of commands. CSR9 is instructed to select its backup path as the additional path, then advertise it to all IPv6 labeled-unicast neighbors. CSR3 still picks CSR5 as the exit point due to lower originator ID, but the backup path is XRv12. Ideally, we should configure CSR9 to install this is a backup path also, but that technically isn't a function of the shadow RR method, just an additional optimization. We verify that this backup path exists in the BGP table and the FIB. CSR3 will ECMP to the primary destination of CSR5 via CSR9 and XRv11 while the backup path routes via XRv11 exclusively. ! CSR9 router bgp 100 template peer-policy IPV6_LUCAST advertise diverse-path backup address-family ipv6 unicast bgp additional-paths select backup ! CSR3 router bgp 100 address-family ipv6 bgp additional-paths install

785 © 2016 Nicholas J. Russo

CSR3#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 347 Paths: (2 available, best #2, table default) Additional-path-install Advertised to update-groups: 5 6 7 Refresh Epoch 4 6 ::FFFF:12.12.12.12 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Originator: 12.12.12.12, Cluster list: 9.9.9.9 mpls labels in/out nolabel/92026 rx pathid: 0, tx pathid: 0 Refresh Epoch 3 6 ::FFFF:5.5.5.5 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 8.8.8.8 mpls labels in/out nolabel/5044 rx pathid: 0, tx pathid: 0x0 CSR3#show ipv6 cef 2001:28:14:170::/64 detail 2001:28:14:170::/64, epoch 0, flags [rib defined all labels] recursive via 5.5.5.5 label 5044 nexthop 153.3.9.9 GigabitEthernet2.539 label 9037 nexthop 153.3.11.11 GigabitEthernet2.531 label 91018 recursive via 12.12.12.12 label 92026, repair nexthop 153.3.11.11 GigabitEthernet2.531 label 91010

Next, we will demonstrate the PE's each advertising their best-external routes as this does not require add-path negotiation. XRv13 will set the local preference on the prefix 2001:28:14:170::/64 to 200 so that it is the preferred exit point from AS 100. We demonstrate the power of RPL again by using our highly generic existing configuration. Both CSR5 and XRv12 consider this their best path, and because it is iBGP learned (and these PE's are not RRs), it cannot be advertised to the RRs. ! XRv13 prefix-set PS_COOL_ROUTE_IPV6 2001:28:14:170::/64 end-set router bgp 100 neighbor 2001:10:6:13::6 remote-as 6 address-family ipv6 unicast route-policy RPL_SET_LPREF_TO_X_FOR_Y(200, PS_COOL_ROUTE_IPV6) in RP/0/0/CPU0:XRv12#show bgp ipv6 unicast 2001:28:14:170::/64 | begin Paths

786 © 2016 Nicholas J. Russo

Paths: (3 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 6 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Received Label 93009 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 22 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Received Label 93009 Origin incomplete, metric 0, localpref 200, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 6 2001:10:6:12::6 from 2001:10:6:12::6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external Received Path ID 0, Local Path ID 0, version 0 Origin-AS validity: not-found CSR5#show bgp ipv6 unicast 2001:28:14:170::/64 | begin Paths Paths: (3 available, best #2, table default) Advertised to update-groups: 6 Refresh Epoch 2 6 ::FFFF:13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 mpls labels in/out 5044/93009 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 6 ::FFFF:13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 mpls labels in/out 5044/93009 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 6 2001:10:5:6::6 (FE80::6) from 10.5.6.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external

787 © 2016 Nicholas J. Russo

mpls labels in/out 5044/nolabel rx pathid: 0, tx pathid: 0

The result is that the RR's only have a single route to exit the AS, and this is all they distribute to their peers. The remote PEs, such as CSR3, are going to see the same copy of that single route. Our existing solution of using shadow RR in isolation won't solve the problem because the RR's only see one path in the first place. CSR9#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 31 BGP Bestpath: igpmetric-ignore Paths: (1 available, best #1, table default) Advertised to update-groups: 2 Refresh Epoch 1 6, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best mpls labels in/out nolabel/93009 rx pathid: 0, tx pathid: 0x0 CSR3#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 355 Paths: (2 available, best #2, table default) Additional-path-install Advertised to update-groups: 5 6 7 Refresh Epoch 5 6 ::FFFF:13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 mpls labels in/out nolabel/93009 rx pathid: 0, tx pathid: 0 Refresh Epoch 4 6 ::FFFF:13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 mpls labels in/out nolabel/93009 rx pathid: 0, tx pathid: 0x0

As seen earlier, we can use best-external to solve this. CSR5, XRv12, and XRv13 are configured to advertise their best external routes to the RRs, giving them a better view of the network. Then, our existing shadow RR configuration ensures that a diverse path is advertised to the remote PEs. Note that these PE's must also install this path as a backup before they are able to advertise it to the RR. Unfortunately, this "selection" keyword on XR seems to only apply to the negotiated add-path 788 © 2016 Nicholas J. Russo

capability. XRv12 claims the path was selected as a backup but was not advertised to the RRs, probably because it expects the full negotiated capability to be exchanged. ! CSR5 router bgp 100 template peer-policy IPV6_INTERNAL advertise best-external address-family ipv6 unicast bgp additional-paths select best-external bgp additional-paths install ! XRv12 (should be on XRv13 also to account for failovers also) route-policy RPL_BACKUP_1 set path-selection backup 1 advertise install end-policy router bgp 100 address-family ipv6 unicast additional-paths selection route-policy RPL_BACKUP_1 RP/0/0/CPU0:XRv12#show bgp ipv6 labeled-unicast neighbors 9.9.9.9 advertisedroutes Network Next Hop From AS Path ::6:6:6:6/128 12.12.12.12 2001:10:6:12::6 6? 2001:28:10:44::/64 12.12.12.12 2001:10:6:12::6 6? 2001:28:10:45::/64 12.12.12.12 2001:10:6:12::6 6? 2001:28:10:46::/64 12.12.12.12 2001:10:6:12::6 6? 2001:28:10:47::/64 12.12.12.12 2001:10:6:12::6 6? RP/0/0/CPU0:XRv12#show bgp ipv6 unicast 2001:28:14:170::/64 | begin Paths Paths: (3 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 6 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Received Label 93009 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 22 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 6 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Received Label 93009 Origin incomplete, metric 0, localpref 200, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 13.13.13.13, Cluster list: 9.9.9.9

789 © 2016 Nicholas J. Russo

Path #3: Received by speaker 0 Not advertised to any peer 6 2001:10:6:12::6 from 2001:10:6:12::6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, external, backup, add-path Received Path ID 0, Local Path ID 2, version 30 Origin-AS validity: not-found

At a minimum, the RR's receive the best-external from CSR5, which can be advertised to the remote PEs using the diverse-path (shadow RR) method. Another very odd issue can be observed on CSR9. Notice that the IPv6 next-hop is reachable, but is inaccessible. This is because there is no MPLS label associated with the 6PE prefix. I don't have a great reason for this, but the debug on CSR9 reveals that the MPLS label for this backup path was not advertised with the prefix from CSR5. Notice that all the other routes have labels. CSR9#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 70 BGP Bestpath: igpmetric-ignore Paths: (2 available, best #2, table default) Additional-path-install Advertised to update-groups: 2 Refresh Epoch 2 6, (Received from a RR-client) ::FFFF:5.5.5.5 (inaccessible) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 1 6, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best mpls labels in/out nolabel/93009 rx pathid: 0, tx pathid: 0x0 !CSR9 BGP(1): BGP(1): BGP(1): BGP(1): BGP(1): BGP(1):

5.5.5.5 5.5.5.5 5.5.5.5 5.5.5.5 5.5.5.5 5.5.5.5

rcvd rcvd rcvd rcvd rcvd rcvd

::6:6:6:6/128, label 5046 2001:28:14:170::/64 2001:28:10:44::/64, label 2001:28:10:45::/64, label 2001:28:10:46::/64, label 2001:28:10:47::/64, label

5047 5048 5049 5050

On CSR5, when debugging outbound updates, we see that the label was never sent. Digging deeper, we see that local label was not even allocated. This is true on both CSR5 and XRv12, and I am concluding that the older, non-negotiated add-path features were not designed to work with labeled-unicast. MPLS wasn't supposed to reinvent the wheel, but rely on existing IGP and BGP routing, and since the shadow 790 © 2016 Nicholas J. Russo

RR method is breaking rules without any active communication, MPLS may not understand. BGP may see no reason to allocate labels (the local labels are highlighted) for paths that are not best. So, using additional-path best-external with labeled-unicast AFI does not appear to be a valid design. ! CSR5 BGP(1): (base) 8.8.8.8 send UPDATE (format) ::6:6:6:6/128, next ::FFFF:5.5.5.5, label 5046, metric 0, path 6, label 5046 BGP(1): (base) 8.8.8.8 send UPDATE (format) 2001:28:14:170::/64, next ::FFFF:5.5.5.5, metric 0, path 6 CSR5#show bgp ipv6 unicast labels | begin 2001:28:14:170::/64 2001:28:14:170::/64 ::FFFF:13.13.13.13 nolabel/93009 ::FFFF:13.13.13.13 nolabel/93009 2001:10:5:6::6 nolabel/nolabel RP/0/0/CPU0:XRv12#show bgp ipv6 labeled-unicast labels | begin 2001:28:14:170::/64 *>i2001:28:14:170::/64 13.13.13.13 93009 nolabel * i 13.13.13.13 93009 nolabel

Let's try to salvage this design by removing the best-external configuration from CSR5 and enabling the BGP add-path negotiation. We will enable send/receive capabilities on the PEs and RRs so we can do all our testing bidirectionally as needed. For completeness we will apply these changes on all relevant devices. This also means we can bring XRv11 back into the network. The snippets are shown below, along with the basic neighbor negotiation verifications. I am only showing outputs from the RR's to save time; if they see the proper negotiations sent/received them it's a fair assumption to assume the PE's are configured properly for add-path. It can be time consuming to do these basic verifications but trying to troubleshoot why additional-paths aren't being advertised is much more difficult. ! XRv11, XRv12, XRv13, XRv14 router bgp 100 address-family ipv6 unicast additional-paths send additional-paths receive ! CSR3, CSR5, CSR8, CSR9 router bgp 100 address-family ipv6 unicast bgp additional-paths send receive CSR8#show bgp ipv6 unicast neighbors 3.3.3.3 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received

791 © 2016 Nicholas J. Russo

Additional Paths receive capability: advertised and received CSR8#show bgp ipv6 unicast neighbors 5.5.5.5 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR8#show bgp ipv6 unicast neighbors 14.14.14.14 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR8#show bgp ipv6 unicast neighbors 13.13.13.13 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR8#show bgp ipv6 unicast neighbors 12.12.12.12 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR9#show bgp ipv6 unicast neighbors 3.3.3.3 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR9#show bgp ipv6 unicast neighbors 5.5.5.5 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR9#show bgp ipv6 unicast neighbors 14.14.14.14 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR9#show bgp ipv6 unicast neighbors 13.13.13.13 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received CSR9#show bgp ipv6 unicast neighbors 12.12.12.12 | include IPv6_U|Additional For address family: IPv6 Unicast Additional Paths send capability: advertised and received Additional Paths receive capability: advertised and received RP/0/0/CPU0:XRv11#show bgp ipv6 labeled-unicast neighbors 3.3.3.3 | utility egrep 'IPv6 Lab|Additional'

792 © 2016 Nicholas J. Russo

For Address Family: IPv6 Labeled-unicast Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive RP/0/0/CPU0:XRv11#show bgp ipv6 labeled-unicast neighbors 5.5.5.5 | utility egrep 'IPv6 Lab|Additional' For Address Family: IPv6 Labeled-unicast Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive Additional-paths operation: None RP/0/0/CPU0:XRv11#show bgp ipv6 labeled-unicast neighbors 14.14.14.14 | utility egrep 'IPv6 Lab|Additional' For Address Family: IPv6 Labeled-unicast Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive Additional-paths operation: None RP/0/0/CPU0:XRv11#show bgp ipv6 labeled-unicast neighbors 13.13.13.13 | utility egrep 'IPv6 Lab|Additional' For Address Family: IPv6 Labeled-unicast Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive Additional-paths operation: None RP/0/0/CPU0:XRv11#show bgp ipv6 labeled-unicast neighbors 12.12.12.12 | utility egrep 'IPv6 Lab|Additional' For Address Family: IPv6 Labeled-unicast Additional-paths Send: advertised and received Additional-paths Receive: advertised and received Additional-paths operation: Send and Receive Additional-paths operation: None

Before checking any specific BGP routing information, let's see if CSR5 allocates a label for this backup path now. It does, using label value 5007, which was BGP 6PE in action. ! CSR5 router bgp 100 template peer-policy IPV6_INTERNAL advertise additional-paths best 2 address-family ipv6 unicast bgp additional-paths select best 2 CSR5#show bgp ipv6 unicast labels | begin 2001:28:14:170::/64 2001:28:14:170::/64

793 © 2016 Nicholas J. Russo

::FFFF:13.13.13.13 5007/93009 ::FFFF:13.13.13.13 5007/93009 ::FFFF:13.13.13.13 5007/93009 2001:10:5:6::6 5007/nolabel ! XRv12 router bgp 100 address-family ipv6 unicast additional-paths selection route-policy RPL_BACKUP_ALL RP/0/0/CPU0:XRv12#show bgp ipv6 labeled-unicast labels | begin 2001:28:14:170::/64 *>i2001:28:14:170::/64 13.13.13.13 93009 92002 * i 13.13.13.13 93009 92002 * i 5.5.5.5 5007 92002 * i 13.13.13.13 93009 92002 * 2001:10:6:12::6 nolabel 92002

CSR9 has the backup routes from CSR5 and XRv12 now. Technically, the existing selection and advertisement of the diverse-path could work, but with add-path negotiation enabled, it makes more sense to use that. The parser also tells you that the non-negotiated add-paths features are silently ignored when negotiated add-path is configured. I will configure CSR9 to select its best 3 paths for backup, and advertise 2 towards remote PEs by default. XRv14 will be an exception and receive all 3 paths. We can see the "best2" path from CSR5 and XRv12 on the RR (CSR9). We see that CSR3 has 3 copies of the best-path, and selects the CSR5 path as the backup. XRv14 also has 3 copies of the best path plus backups via CSR5 and XRv12 (on XRv14 sees the path via XRv12, though). ! CSR9 router bgp 100 template peer-policy IPV6_LUCAST no advertise diverse-path backup advertise additional-paths best 2 address-family ipv6 unicast no bgp additional-paths select backup bgp additional-paths select best 3 neighbor 14.14.14.14 advertise additional-paths best 3 CSR9#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 32 Paths: (3 available, best #2, table default) Additional-path-install Path advertised to update-groups: 4 Refresh Epoch 1

794 © 2016 Nicholas J. Russo

6, (Received from a RR-client) ::FFFF:12.12.12.12 (metric 30) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best3 mpls labels in/out nolabel/92002 rx pathid: 0x2, tx pathid: 0x2 Path advertised to update-groups: 3 4 Refresh Epoch 1 6, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 200, valid, internal, best mpls labels in/out nolabel/93009 rx pathid: 0x1, tx pathid: 0x0 Path advertised to update-groups: 3 4 Refresh Epoch 2 6, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 10) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair, best2 mpls labels in/out nolabel/5007 rx pathid: 0x1, tx pathid: 0x1 CSR3#show bgp ipv6 unicast 2001:28:14:170::/64 BGP routing table entry for 2001:28:14:170::/64, version 694 Paths: (4 available, best #4, table default) Additional-path-install Advertised to update-groups: 5 6 7 Refresh Epoch 1 6 ::FFFF:5.5.5.5 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Originator: 5.5.5.5, Cluster list: 9.9.9.9 mpls labels in/out nolabel/5007 rx pathid: 0x1, tx pathid: 0 Refresh Epoch 1 6 ::FFFF:13.13.13.13 (metric 20) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 9.9.9.9 mpls labels in/out nolabel/93009 rx pathid: 0x0, tx pathid: 0 Refresh Epoch 1 6 ::FFFF:13.13.13.13 (metric 20) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 200, valid, internal Originator: 13.13.13.13, Cluster list: 11.11.11.11

795 © 2016 Nicholas J. Russo

mpls labels in/out nolabel/93009 rx pathid: 0x1, tx pathid: 0 Refresh Epoch 2 6 ::FFFF:13.13.13.13 (metric 20) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 200, valid, internal, best Originator: 13.13.13.13, Cluster list: 8.8.8.8 mpls labels in/out nolabel/93009 rx pathid: 0x0, tx pathid: 0x0

RP/0/0/CPU0:XRv14#show bgp ipv6 unicast 2001:28:14:170::/64 brief Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path *>i2001:28:14:170::/64 13.13.13.13 0 200 0 6 ? * i 13.13.13.13 0 200 0 6 ? * i 5.5.5.5 0 100 0 6 ? * i 12.12.12.12 0 100 0 6 ? * i 13.13.13.13 0 200 0 6 ?

Next, we will examine VPNv4 and VPNv6 tests briefly. This will be short because the CSR does not support the negotiated capabilities, but just the diverse-path, best-external, and backup install mechanisms. CSR4 and CSR10 are the VPNv4/v6 customers. An interesting design note with VPN service is that we can achieve additional-paths simply by having unique RD's. For example, CSR5 uses RD 100:5, while XRv13 and XRv12 use RD 100:13. CSR9 doesn't have any VRFs locally configured, but by breaking the routes apart by RD, we see one route in table 100:5 and two in table 100:13. The RR runs best-path on these routes within an RD, never across, so there are two best-paths advertised to remote PEs. One is 100:13:10.10.10.10/32 via XRv13 and one is 100:5:10.10.10.10/32 via CSR5. We could have had three different paths if XRv12 had a unique RD, such as 100:12, but I wanted to combine it with XRv13 to demonstrate the point. CSR9#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 BGP routing table entry for 100:13:10.10.10.10/32, version 39 Paths: (2 available, best #2, no table) Advertised to update-groups: 1 Refresh Epoch 1 410, (Received from a RR-client) 12.12.12.12 (metric 30) (via default) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/92008 rx pathid: 0, tx pathid: 0

796 © 2016 Nicholas J. Russo

Refresh Epoch 1 410, (Received from a RR-client) 13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/93011 rx pathid: 0, tx pathid: 0x0 CSR9#show bgp vpnv4 unicast rd 100:5 10.10.10.10/32 BGP routing table entry for 100:5:10.10.10.10/32, version 38 Paths: (1 available, best #1, no table) Advertised to update-groups: 1 Refresh Epoch 2 410, (Received from a RR-client) 5.5.5.5 (metric 10) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5045 rx pathid: 0, tx pathid: 0x0

We can verify that both of these routes were advertised to CSR7. CSR9#show bgp vpnv4 unicast rd 100:13 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:13 *>i 10.10.10.10/32 13.13.13.13 0 100 0 410 ? CSR9#show bgp vpnv4 unicast rd 100:5 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:5 *>i 10.10.10.10/32 5.5.5.5 0 100 0 410 ?

If we look at a different RR in the network, such as CSR8, we see that it select a different best-path in RD 100:13. Specifically, it picks XRv12 because its IGP metric is lower. CSR8#show bgp vpnv4 unicast rd 100:5 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:5 *>i 10.10.10.10/32 5.5.5.5 0 100 0 410 ? Total number of prefixes 1 CSR8#show bgp vpnv4 unicast rd 100:13 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path

797 © 2016 Nicholas J. Russo

Route Distinguisher: 100:13 *>i 10.10.10.10/32 12.12.12.12

0

100

0 410 ?

CSR7 has received both RD 100:13 routes with next-hops of XRv12 and XRv13. Best-path runs on these routes first, before import into the VRF, so that only one VPN route is imported per RD. Because both XRv11 and CSR8 selected XRv12 as the best path, CSR3 has a copy of each route. Ultimately, the route through CSR8 wins due to lower BGP peer ID (the last tie breaker) and progresses to the "next round". CSR7#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 BGP routing table entry for 100:13:10.10.10.10/32, version 151 Paths: (3 available, best #3, no table) Not advertised to any peer Refresh Epoch 3 410 13.13.13.13 (metric 20) (via default) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 Originator: 13.13.13.13, Cluster list: 9.9.9.9 mpls labels in/out nolabel/93011 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410 12.12.12.12 (metric 20) (via default) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 Originator: 12.12.12.12, Cluster list: 11.11.11.11 mpls labels in/out nolabel/92008 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410 12.12.12.12 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 12.12.12.12, Cluster list: 8.8.8.8 mpls labels in/out nolabel/92008 rx pathid: 0, tx pathid: 0x0

Best-path within RD 100:5 is much less interesting because all three RR's selected the same best path (the only path) which is via CSR5. The path selected was based on the lowest neighbor ID (the last tie breaker), indicating the routes are equivalent in every way. This best path advances to the "next round". CSR7#show bgp vpnv4 unicast rd 100:5 10.10.10.10/32 BGP routing table entry for 100:5:10.10.10.10/32, version 128 Paths: (3 available, best #3, no table) Not advertised to any peer Refresh Epoch 3 410

798 © 2016 Nicholas J. Russo

5.5.5.5 (metric 20) (via default) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 9.9.9.9 mpls labels in/out nolabel/5045 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410 5.5.5.5 (metric 20) (via default) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 11.11.11.11 mpls labels in/out nolabel/5045 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410 5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 mpls labels in/out nolabel/5045 rx pathid: 0, tx pathid: 0x0

CSR7 imports these "best within the RD" VPN routes into the VRF for further processing, which allows it to directly compare all three routes since they are competing for a spot in the VRF RIB; they already exist in the VRF BGP table though. CSR5 wins due to lower originator ID. Enabling additional-path install and backup selection gives us fast-reroute without having to do anything fancy with the route-reflectors. This is a good design case for using different RD's at each PE, even for a single customer VRF. It is simple and does not require much additional complexity in the SP network. CSR7#show bgp vpnv4 unicast vrf B 10.10.10.10/32 BGP routing table entry for 100:7:10.10.10.10/32, version 154 Paths: (2 available, best #2, table B) Additional-path-install Advertised to update-groups: 8 Refresh Epoch 1 410, imported path from 100:13:10.10.10.10/32 (global) 12.12.12.12 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Extended Community: RT:100:410 Originator: 12.12.12.12, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/92008 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410, imported path from 100:5:10.10.10.10/32 (global) 5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8)

799 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/5045 rx pathid: 0, tx pathid: 0x0

We will quickly demonstrate the shadow RR feature. This involves using the same RD on CSR5, XRv12, and XRv13, which forces the RR to only advertise one path. We change CSR5's RD to 100:13 to accomplish this. Beware that deleting the RD from the VRF automatically removes VRF-related BGP configuration and RT import/export policies, so backup the configuration before attempting the change. We already know that the route-reflectors are going to get three copies of the same route, pick a best path, then advertise that single path to CSR7. This creates the inherent problem add-path was meant to solve. CSR9#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 BGP routing table entry for 100:13:10.10.10.10/32, version 43 Paths: (3 available, best #1, no table) Advertised to update-groups: 1 Refresh Epoch 2 410, (Received from a RR-client) 5.5.5.5 (metric 10) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5003 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 410, (Received from a RR-client) 12.12.12.12 (metric 30) (via default) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/92008 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410, (Received from a RR-client) 13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/93011 rx pathid: 0, tx pathid: 0

As usual, we ignore the IGP metric in the RR for VPNv4/v6 so that RR's have the exact same topology regardless of IGP changes. Then, we configure CSR to be the shadow RR so it can advertise a diverse, second best path. CSR8, as usual, advertises its single best path, while CSR9 advertises a backup, diverse path (which is not marked as best). ! CSR8

800 © 2016 Nicholas J. Russo

router bgp 100 address-family vpnv4 bgp bestpath igp-metric ignore ! CSR9 router bgp 100 template peer-policy VPNV4 advertise diverse-path backup address-family vpnv4 bgp bestpath igp-metric ignore bgp additional-paths select backup CSR8#show bgp vpnv4 unicast rd 100:13 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:13 *>i 10.10.10.10/32 5.5.5.5 0 100 0 410 ? CSR9#show bgp vpnv4 unicast rd 100:13 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:13 *bia10.10.10.10/32 12.12.12.12 0 100 0 410 ?

As expected, CSR7 runs best-path within the RD and identifies the best and backup paths. The issue is that only one of these paths is actually imported into the VRF table, which means FRR is not achieved. Even if I configured add-path install under the VRF, it did not work. I have seen online blogs and my colleagues configured it successfully, but I have personally never seen it work, and so I do not document it here. CSR7#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 BGP routing table entry for 100:13:10.10.10.10/32, version 170 Paths: (3 available, best #3, no table) Additional-path-install Not advertised to any peer Refresh Epoch 4 410 12.12.12.12 (metric 20) (via default) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Extended Community: RT:100:410 Originator: 12.12.12.12, Cluster list: 9.9.9.9 , recursive-via-host mpls labels in/out nolabel/92008 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410 5.5.5.5 (metric 20) (via default) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal

801 © 2016 Nicholas J. Russo

Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 11.11.11.11 , recursive-via-host mpls labels in/out nolabel/5003 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410 5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/5003 rx pathid: 0, tx pathid: 0x0

Only one route is imported into the VRF table, which means the FIB will only have one entry. While the VRF import process is probably very fast given the backup path, the FIB not being pre-programmed is a fundamental principle of FRR. The multiple RD method is simpler, more effective, and seems to be recommended by Cisco. CSR7#show bgp vpnv4 unicast vrf B 10.10.10.10/32 BGP routing table entry for 100:7:10.10.10.10/32, version 171 Paths: (1 available, best #1, table B) Additional-path-install Advertised to update-groups: 8 Refresh Epoch 1 410, imported path from 100:13:10.10.10.10/32 (global) 5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/5003 rx pathid: 0, tx pathid: 0x0 CSR7#show ip cef vrf B 10.10.10.10 detail 10.10.10.10/32, epoch 0, flags [rib defined all labels] recursive via 5.5.5.5 label 5003 nexthop 153.7.11.11 GigabitEthernet2.571 label 91018

Given the density of XRv, I will attempt to use the negotiated capability between XRv11, XRV12, and XRv14 to demonstrate the equivalent of "best-external". XRv3 sets the local-preference of its exported VPN routes to 200 and advertises this to its iBGP peers. It is now the preferred exit point from AS 100 for VPN customer traffic. XRv12 selects this as its best path, giving the RR's no indicating of an alternative. The eBGP path is considered an "import suspect" as it was not a candidate for import into BGP (export from VRF). ! XRv13 route-policy RPL_EXPORT_VPNV4($LPREF)

802 © 2016 Nicholas J. Russo

set extcommunity rt (100:410) set local-preference $LPREF end-policy vrf B address-family ipv4 unicast export route-policy RPL_EXPORT_VPNV4(200) no export route-target RP/0/0/CPU0:XRv12#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 BGP routing table entry for 10.10.10.10/32, Route Distinguisher: 100:13 Versions: Process bRIB/RIB SendTblVer Speaker 250 250 Local Label: 92008 Paths: (3 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best, import-candidate, imported, import suspect Received Path ID 0, Local Path ID 1, version 250 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Source VRF: B, Source Route Distinguisher: 100:13 Path #2: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, importcandidate, imported, import suspect Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Source VRF: B, Source Route Distinguisher: 100:13 Path #3: Received by speaker 0 Not advertised to any peer 410 10.10.12.10 from 10.10.12.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, external, backup, add-path, import suspect Received Path ID 0, Local Path ID 2, version 233 Extended community: RT:100:410

803 © 2016 Nicholas J. Russo

To fix it, we need quite a bit of configuration. We need to enable the dynamic add-path capability on XRv11, XRv12, and XRv14. Each router needs to select its backup paths using RPL, and they need to be advertised to the RRs (XRv12 to XRv11), then advertised and installed on the remote PE (XRv14). ! XRv11 route-policy RPL_VPN_ADD_PATH set path-selection all advertise end-policy router bgp 100 address-family vpnv4 unicast additional-paths send additional-paths receive additional-paths selection route-policy RPL_VPN_ADD_PATH ! XRv12 router bgp 100 address-family vpnv4 unicast additional-paths receive additional-paths send additional-paths selection route-policy RPL_BACKUP_ALL ! XRv14 route-policy RPL_ADD_BACKUP set path-selection backup 1 install end-policy router bgp 100 address-family vpnv4 unicast additional-paths receive additional-paths send additional-paths selection route-policy RPL_ADD_BACKUP

XRv12 now considers this a valid add-path and advertises it to XRv11. The RR sees the additional path from XRv12 and can advertise it to XRv14. RP/0/0/CPU0:XRv12#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 | begin Paths Paths: (4 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 291

804 © 2016 Nicholas J. Russo

Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Source VRF: B, Source Route Distinguisher: 100:13 Path #2: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, importcandidate, imported Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Source VRF: B, Source Route Distinguisher: 100:13 Path #3: Received by speaker 0 Advertised to peers (in unique update groups): 11.11.11.11 410 10.10.12.10 from 10.10.12.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, external, backup, add-path Received Path ID 0, Local Path ID 2, version 280 Extended community: RT:100:410 Path #4: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 11.11.11.11 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, importcandidate, imported Received Path ID 1, Local Path ID 0, version 0 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 11.11.11.11 Source VRF: B, Source Route Distinguisher: 100:13 RP/0/0/CPU0:XRv11#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 | begin Paths Paths: (2 available, best #2) Advertised to update-groups (with more than one peer): 0.2 Advertised to peers (in unique update groups): 12.12.12.12 Path #1: Received by speaker 0 Not advertised to any peer 410, (Received from a RR-client) 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Received Label 92008 Origin incomplete, metric 0, localpref 100, valid, internal, not-in-vrf Received Path ID 2, Local Path ID 0, version 0

805 © 2016 Nicholas J. Russo

Extended community: RT:100:410 Path #2: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.2 Advertised to peers (in unique update groups): 12.12.12.12 410, (Received from a RR-client) 13.13.13.13 (metric 30) from 13.13.13.13 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 45 Extended community: RT:100:410 RP/0/0/CPU0:XRv11#show bgp vpnv4 unicast neighbors routes Network Next Hop From [snip] Route Distinguisher: 100:13 10.10.10.10/32 13.13.13.13 13.13.13.13 12.12.12.12 12.12.12.12 [snip]

14.14.14.14 advertisedAS Path

410? 410?

Finally, the remote PE gets both routes and marks the less preferred route (lower local-preference) as a backup path. We still have the same issue as earlier, which is that only a single route gets installed in the VRF RIB and thus in the FIB. This is why unique RD is still the best solution in my eyes. The reason the FIB shows two paths is due to IGP ECMP towards 13.13.13.13; don't be fooled. RP/0/0/CPU0:XRv14#show bgp vpnv4 unicast rd 100:13 10.10.10.10/32 | begin Paths Paths: (4 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 169 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 9.9.9.9 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, not-in-vrf

806 © 2016 Nicholas J. Russo

Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 9.9.9.9 Path #3: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 11.11.11.11 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, not-in-vrf Received Path ID 1, Local Path ID 0, version 0 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 11.11.11.11 Path #4: Received by speaker 0 Not advertised to any peer 410 12.12.12.12 (metric 20) from 11.11.11.11 (12.12.12.12) Received Label 92008 Origin incomplete, metric 0, localpref 100, valid, internal, backup, add-path, not-in-vrf Received Path ID 2, Local Path ID 2, version 177 Extended community: RT:100:410 Originator: 12.12.12.12, Cluster list: 11.11.11.11 RP/0/0/CPU0:XRv14#show bgp vpnv4 unicast vrf B 10.10.10.10/32 | beign Paths Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 410 13.13.13.13 (metric 20) from 8.8.8.8 (13.13.13.13) Received Label 93011 Origin incomplete, metric 0, localpref 200, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 174 Extended community: RT:100:410 Originator: 13.13.13.13, Cluster list: 8.8.8.8 Source VRF: default, Source Route Distinguisher: 100:13 RP/0/0/CPU0:XRv14#show cef vrf B ipv4 10.10.10.10/32 10.10.10.10/32, version 162, internal 0x5000001 0x0 (ptr 0xa13d4774) [1], 0x0 (0x0), 0x208 (0xa152c3c0) Prefix Len 32, traffic index 0, precedence n/a, priority 3 via 13.13.13.13, 5 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa1594b74 0x0] recursion-via-/32 next hop VRF - 'default', table - 0xe0000000 next hop 13.13.13.13 via 94000/0/21 next hop 153.8.14.8/32 Gi0/0/0/0.584 labels imposed {8009 93011} next hop 153.9.14.9/32 Gi0/0/0/0.594 labels imposed {9006 93011}

807 © 2016 Nicholas J. Russo

Last, we will examine VPNv6 with add-path. CSR5, XRv12, and XRv13 all still use the same RD, so we will test add-path first before testing diverse RDs. Looking quickly at CSR9, we see that the bestpath to ::10:10:10:10/128 is via CSR5 due to lowest metric, and then due to neighbor RID. CSR8 picks XRv12 for the same reason. Fortunately, we can achieve a little bit of redundancy based on NOT ignoring the IGP metric in this specific situation. The problem is that changes in the IGP topology may negatively influence our BGP topology with designs like this. Without even looking at XRv11, we know that it will select CSR5 over XRv12 due to lower RID, and rejects XRv13 due to higher IGP cost. CSR9#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 33 Paths: (3 available, best #2, no table) Advertised to update-groups: 1 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:12.12.12.12 (metric 30) (via default) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 410, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 10) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0 CSR8#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 46 Paths: (3 available, best #1, no table) Advertised to update-groups: 1 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:12.12.12.12 (metric 10) (via default) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best

808 © 2016 Nicholas J. Russo

Extended Community: RT:100:410 mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 30) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0

On CSR7, we will receive two copies of CSR5's route (via CSR9 and XRv11) and one copy of XRv12's route (via CSR8). Activating add-path install allows the route to XRv12 to be considered a backup, but not installed in the VRF FIB, because RD-level bestpath happens before VRF import from BGP. This is the same issue we saw before. ! CSR7 router bgp 100 address-family vpnv6 bgp additional-paths install CSR7#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 209 Paths: (3 available, best #2, no table) Additional-path-install Not advertised to any peer Refresh Epoch 1 410 ::FFFF:5.5.5.5 (metric 20) (via default) from 11.11.11.11 (11.11.11.11) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 11.11.11.11 , recursive-via-host mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0 Refresh Epoch 3 410 ::FFFF:5.5.5.5 (metric 20) (via default) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 9.9.9.9 , recursive-via-host

809 © 2016 Nicholas J. Russo

mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 410 ::FFFF:12.12.12.12 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Extended Community: RT:100:410 Originator: 12.12.12.12, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0 CSR7#show bgp vpnv6 unicast vrf B ::10:10:10:10/128 BGP routing table entry for [100:7]::10:10:10:10/128, version 208 Paths: (1 available, best #1, table B) Additional-path-install Advertised to update-groups: 6 Refresh Epoch 3 410, imported path from [100:13]::10:10:10:10/128 (global) ::FFFF:5.5.5.5 (metric 20) (via default) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 9.9.9.9 , recursive-via-host mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0 CSR7#show ipv6 cef vrf B ::10:10:10:10 detail ::10:10:10:10/128, epoch 0, flags [rib defined all labels] recursive via 5.5.5.5 label 5010 nexthop 153.7.11.11 GigabitEthernet2.571 label 91018

We can achieve a similar, and somewhat better result, by ignoring the IGP metric on CSR8 and CSR9 (we will disable XRv11 for this since it doesn't support that command), and directing CSR9 to advertise diverse paths. This is the shadow RR feature we have tested many times. CSR8 and CSR9 both select CSR5 as the bestpath given the lowest RID. In terms of advertisements, CSR8 advertises the bestpath (classic RR behavior), while CSR9 advertises its backup path (shadow RR behavior). Notice that the route advertised by CSR9 is marked as an additional and backup path, but is not the bestpath. ! CSR8 router bgp 100 address-family vpnv6 unicast bgp bestpath igp-metric ignore ! CSR9 router bgp 100 template peer-policy VPNV6 advertise diverse-path backup

810 © 2016 Nicholas J. Russo

address-family vpnv6 unicast bgp bestpath igp-metric ignore CSR8#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 48 BGP Bestpath: igpmetric-ignore Paths: (3 available, best #2, no table) Advertised to update-groups: 1 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:12.12.12.12 (metric 10) (via default) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 30) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0

CSR8#show bgp vpnv6 unicast rd 100:13 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:13 *>i ::10:10:10:10/128 ::FFFF:5.5.5.5 0 100 0 410 ? CSR9#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 39 BGP Bestpath: igpmetric-ignore Paths: (3 available, best #2, no table) Advertised to update-groups: 2 Refresh Epoch 1 410, (Received from a RR-client)

811 © 2016 Nicholas J. Russo

::FFFF:12.12.12.12 (metric 30) (via default) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Extended Community: RT:100:410 mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 410, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 10) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:100:410 mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0 CSR9#show bgp vpnv6 unicast rd 100:13 neighbors 7.7.7.7 advertised-routes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:13 *bia::10:10:10:10/128 ::FFFF:12.12.12.12 0 100 0 410 ?

We still have the same ineffective result on CSR7, but at least now its IGP-independent using the shadow RR method. Neither XR v3.15 nor XR v5.3.0 supports multipath for VPNv6. CSR7#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 213 Paths: (2 available, best #2, no table) Additional-path-install Not advertised to any peer Refresh Epoch 4 410 ::FFFF:12.12.12.12 (metric 20) (via default) from 9.9.9.9 (9.9.9.9) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Extended Community: RT:100:410 Originator: 12.12.12.12, Cluster list: 9.9.9.9 , recursive-via-host mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0 Refresh Epoch 1

812 © 2016 Nicholas J. Russo

410 ::FFFF:5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0 CSR7#show bgp vpnv6 unicast vrf B ::10:10:10:10/128 BGP routing table entry for [100:7]::10:10:10:10/128, version 212 Paths: (1 available, best #1, table B) Additional-path-install Advertised to update-groups: 6 Refresh Epoch 1 410, imported path from [100:13]::10:10:10:10/128 (global) ::FFFF:5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/5010 rx pathid: 0, tx pathid: 0x0

Last, we will change CSR5's RD back to 100:5 to achieve "optimal" path diversity without using BGP addpath on the RRs. Remember that it will wipe out all of the RT import/export policies and related BGP configurations. We will combine it with changing the local-preference on XRv13 to 200 for ::10:10:10:10/128 so that the remote PE's have a common exit point from the AS, but the VRF table should have the backup path installed in the FIB also. ! XRv13 vrf B address-family ipv6 unicast export route-policy RPL_EXPORT_VPNV4(166) no export route-target ! CSR5 vrf definition B no rd 100:13 rd 100:5 router bgp 100 address-family vpnv6 unicast bgp additional-paths install bgp additional-paths select best-external address-family vpnv6 neighbor 8.8.8.8 advertise best-external neighbor 9.9.9.9 advertise best-external neighbor 11.11.11.11 advertise best-external

813 © 2016 Nicholas J. Russo

CSR5#show bgp vpnv6 unicast rd 100:5 ::10:10:10:10/128 BGP routing table entry for [100:5]::10:10:10:10/128, version 297 Paths: (2 available, best #2, table B) Advertised to update-groups: 10 Refresh Epoch 1 410 2001:10:5:10::10 (FE80::10) (via vrf B) from 10.5.10.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, external, advertisebest-external Extended Community: RT:100:410 , recursive-via-connected rx pathid: 0, tx pathid: 0 Refresh Epoch 7 410, imported path from [100:13]::10:10:10:10/128 (global) ::FFFF:13.13.13.13 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 166, valid, internal, best Extended Community: RT:100:410 Originator: 13.13.13.13, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0x0

Now, each RR will have two paths, one via CSR5 and one via XRv13. On CSR9, we look at the RDs individually, and on CSR8 we look at them with a single command. XRv11's verification is shown in brief form for variety. CSR9#show bgp vpnv6 unicast rd 100:5 ::10:10:10:10/128 BGP routing table entry for [100:5]::10:10:10:10/128, version 47 BGP Bestpath: igpmetric-ignore Paths: (1 available, best #1, no table) Advertised to update-groups: 2 Refresh Epoch 4 410, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 10) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5044 rx pathid: 0, tx pathid: 0x0 CSR9#show bgp vpnv6 unicast rd 100:13 ::10:10:10:10/128 BGP routing table entry for [100:13]::10:10:10:10/128, version 46 BGP Bestpath: igpmetric-ignore Paths: (1 available, best #1, no table) Advertised to update-groups: 2 Refresh Epoch 1 410, (Received from a RR-client)

814 © 2016 Nicholas J. Russo

::FFFF:13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 166, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0x0 CSR8#show bgp vpnv6 unicast all ::10:10:10:10/128 BGP routing table entry for [100:5]::10:10:10:10/128, version 54 BGP Bestpath: igpmetric-ignore Paths: (1 available, best #1, no table) Advertised to update-groups: 1 Refresh Epoch 2 410, (Received from a RR-client) ::FFFF:5.5.5.5 (metric 30) (via default) from 5.5.5.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/5044 rx pathid: 0, tx pathid: 0x0 BGP routing table entry for [100:13]::10:10:10:10/128, version 52 BGP Bestpath: igpmetric-ignore Paths: (1 available, best #1, no table) Advertised to update-groups: 1 Refresh Epoch 1 410, (Received from a RR-client) ::FFFF:13.13.13.13 (metric 10) (via default) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 166, valid, internal, best Extended Community: RT:100:410 mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv11#show bgp vpnv6 unicast rd all ::10:10:10:10/128 Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:5 *>i::10:10:10:10/128 5.5.5.5 0 100 0 410 ? Route Distinguisher: 100:13 *>i::10:10:10:10/128 13.13.13.13 0 166 0 410 ?

Now, within CSR7 and XRv14 VRF table, we have multiple BGP routes and pre-installed backup paths in the FIB. Be sure to add the proper configurations to CSR7 and XRv14 so the backup paths can be installed. 815 © 2016 Nicholas J. Russo

! CSR7 router bgp 100 address-family vpnv6 bgp additional-paths install CSR7#show bgp vpnv6 unicast vrf B ::10:10:10:10/128 BGP routing table entry for [100:7]::10:10:10:10/128, version 239 Paths: (2 available, best #2, table B) Additional-path-install Advertised to update-groups: 6 Refresh Epoch 1 410, imported path from [100:5]::10:10:10:10/128 (global) ::FFFF:5.5.5.5 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, backup/repair Extended Community: RT:100:410 Originator: 5.5.5.5, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/5044 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 410, imported path from [100:13]::10:10:10:10/128 (global) ::FFFF:13.13.13.13 (metric 20) (via default) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 166, valid, internal, best Extended Community: RT:100:410 Originator: 13.13.13.13, Cluster list: 8.8.8.8 , recursive-via-host mpls labels in/out nolabel/93013 rx pathid: 0, tx pathid: 0x0 CSR7#show ipv6 cef vrf B ::10:10:10:10/128 detail ::10:10:10:10/128, epoch 0, flags [rib defined all labels] recursive via 13.13.13.13 label 93013 nexthop 153.7.8.8 GigabitEthernet2.578 label 8009 recursive via 5.5.5.5 label 5044, repair nexthop 153.7.11.11 GigabitEthernet2.571 label 91018 ! XRv14 router bgp 100 address-family vpnv6 unicast additional-paths selection route-policy RPL_ADD_BACKUP RP/0/0/CPU0:XRv14#show bgp vpnv6 unicast vrf B ::10:10:10:10/128 brief | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 100:14 (default for vrf B) * i::10:10:10:10/128 5.5.5.5 0 100 0 410 ? *>i 13.13.13.13 0 166 0 410 ?

816 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv14#show cef vrf B ipv6 ::10:10:10:10/128 ::10:10:10:10/128, version 155, internal 0x5000001 0x0 (ptr 0xa13ba2f4) [1], 0x0 (0x0), 0x208 (0xa17fa12c) Prefix Len 128, traffic index 0, precedence n/a, priority 3 via ::ffff:5.5.5.5, 4 dependencies, recursive, backup [flags 0x6100] path-idx 0 NHID 0x0 [0xa18ef0bc 0x0] recursion-via-/128 next hop VRF - 'default', table - 0xe0000000 next hop ::ffff:5.5.5.5 via ::ffff:5.5.5.5:0 next hop 153.9.14.9/32 Gi0/0/0/0.594 labels imposed {9001 5044} via ::ffff:13.13.13.13, 5 dependencies, recursive [flags 0x6000] path-idx 1 NHID 0x0 [0xa18ef200 0x0] recursion-via-/128 next hop VRF - 'default', table - 0xe0000000 next hop ::ffff:13.13.13.13 via ::ffff:13.13.13.13:0 next hop 153.8.14.8/32 Gi0/0/0/0.584 labels imposed {8009 93013} next hop 153.9.14.9/32 Gi0/0/0/0.594 labels imposed {9006 93013}

BGP add-path is a very involved feature and there are many minor options worth testing. PIC, on the other hand, is a simple concept with almost no configuration involved. The idea of PIC is to create a hierarchical FIB to speed up convergence. Legacy FIBs were two-tiered, in which a prefix was mapped to an outgoing interface and next-hop value. Because the number of routes in a large router was orders of magnitude greater than the number of connected routes (possible next-hops), a reconvergence event would require a tremendous amount of updates to the FIB which took minutes to accomplish. For example, an Internet router with 400,000 routes may have 10 connected interfaces or so. This means an average of 40,000 routes all have the same next-hop, yet we still have to make 400,000 updates to each individual FIB entry. It would be much more efficient to update the next-hop references (using a pointer of sorts) while the FIB just references the pointer itself. This is called a three-tier or hierarchical FIB which greatly speeds convergence. A change in next-hop status due to IGP failure would result in the FIB updating a next-hop pointer, which is referenced by 40,000 routes. This is theoretically 40,000 times faster to converge since there is one change for 40,000 prefixes, rather than 40,000 next-hop updates for 40,000 prefixes. Enabling PIC is often done simply by designing a BGP add-path architecture. To change the way the FIB is built, you can adjust the CEF table construction for fast convergence or efficient memory use. The two are mutually exclusive, but on modern routers the general preference (and the default) is to favor speed. We can adjust this in global configuration, and also verify the behavior as well. The default is shown below (convergence speed) while memory utilization is another option. CSR1#show cef table | section favor Output chain build favors: platform: not configured CLI: not configured operational: convergence-speed

817 © 2016 Nicholas J. Russo

CSR1(config)#cef table output-chain build favor ? convergence-speed Favor faster convergence memory-utilization Favor smaller memory utilization

Additional Reading – R ference configurations “bgp-add-path" 27.2 BGP RT-filter unicast / IPv4 RT-filter feature Most designs have a BGP hierarchy using VPN RRs. If a PE is only imported a subset of RTs, the RR will reflect all VPN routes to a PE where the majority of them are rejected due to the RT-import not being locally configured for any VRF. This is generally desirable, but would be better if it could have been filtered outbound on the sender versus inbound on the receiver. The feature is also called "RT constrained route distribution". Features like aggregation (notwithstanding the default route), redistribution, and iBGP synchronization are not supported for this feature. They don't really make sense anyway since this is meant to reduce routing advertisement based on RTs. When the feature is not negotiated, the default filter states that all RTs should be sent to a peer, regardless of RT value. The local router will just reject the ones it doesn't actually need. Ensure your routers have enough memory to store all these RT-filter "routes" (also called RTC prefixes) as it could create a sizable footprint in the BGP LocRIB on the RRs. This feature is loosely analogous to Outbound Route Filtering (ORF).

Without enabling the feature yet, we configure a basic VPNv4 route exchange with proper RT import/export policies. Both routers show the prefix attributes and the VPNv4 allocated label. This was successfully imported into at least one VRF, or at a minimum, not filtered from the VPNv4 LocRIB. CSR1#debug bgp vpnv4 unicast updates in BGP(4): 11.0.0.11 rcvd UPDATE w/ attr: nexthop 11.0.0.11, origin ?, localpref 100, metric 0, extended community RT:1:111 BGP(4): 11.0.0.11 rcvd 1:1:100.11.11.11/32, label 91000 BGP(4): Revise route installing 1 of 1 routes for 100.11.11.11/32 -> 11.0.0.11(A) to A IP table

We see similar output in XR, except that it is a little harder to read. RP/0/0/CPU0:XRv11#debug bgp update vpnv4 unicast bgp[1046]: [default-rtr] (vpn4u): nexthop 11.0.0.1/32, origin ?, localpref 100, metric 0, extended community RT:1:101 bgp[1046]: [default-rtr] (vpn4u): Received prefix 2ASN:1:1:100.1.1.1/32 (path ID: none) with MPLS label 1002 from neighbor 11.0.0.1

818 © 2016 Nicholas J. Russo

bgp[1046]: [default-upd] VRF A: table-attr walk for table TBL:A (1/1), resume version 0, subgrp version 8, target version 10

On CSR1, we remove the RT import for 1:111, so CSR1 should reject it. The debug doesn't explicitly say the RT was rejected, but the generic "extended community not supported" normally indicates an RT import issue. ! CSR1 BGP(4): 11.0.0.11 rcvd UPDATE w/ attr: nexthop 11.0.0.11, origin ?, localpref 100, metric 0, extended community RT:1:111 BGP(4): 11.0.0.11 rcvd 1:1:100.11.11.11/32, label 91000 -- DENIED due to: extended community not supported; BGP(4): no valid path for 1:1:100.11.11.11/32 BGP: topo A:VPNv4 Unicast:base Remove_fwdroute for 1:1:100.11.11.11/32

Likewise on XRv11, we remove the RT import for 1:101. The XR debug is more explicit and tells you there is an issue with the RT import policy. ! XRv11 bgp[1046]: [default-rtr] (vpn4u): nexthop 11.0.0.1/32, origin ?, localpref 100, metric 0, extended community RT:1:101 bgp[1046]: [default-rtr] (vpn4u): Received prefix 2ASN:1:1:100.1.1.1/32 (path ID: none) with MPLS label 1002 from neighbor 11.0.0.1 [default-rtr] (vpn4u): Prefix 2ASN:1:1:100.1.1.1/32 (path ID: none) received from 11.0.0.1 DENIED RT extended community is not imported locally bgp[1046]: [default-upd] (vpn4u): No unreachable (no path is available) sent to sub-group 0.1 (Regular) with 2ASN:1:1:100.1.1.1/32 - already withdrawn

Assuming that these routers don't want one another's routes, not importing the RTs ensures the VPN topology is functional but still requires the receiving PE to process each route individually. This can consume router resources and waste network bandwidth. We will enable the rt-filter feature on both routers. It is called different things between XE and XR, though. Specifically, the XE AF is "rt-filter unicast" and the XR AF is "ipv4 rt-filter". Assuming a semi-worthless configuration where neither router imports any RTs, the LocRIB for the RT-filter AF is empty, which implicitly says "my neighbor requested no RTs, so I send him nothing". The debug on CSR1 indicates that the prefix was denied, but doesn't give much detail. We know it's due to the dynamic RT filter. ! CSR1 BGP(4): (base) 11.0.0.11 send UPDATE (format) 1:1:100.1.1.1/32, next 11.0.0.1, label 1002, metric 0, path Local, extended community RT:1:101 BGP(4): (base) 11.0.0.11 Peer based policy member(format) 1:1:100.1.1.1/32, next 11.0.0.1 result(denied)

The debug on XRv11 indicates that the update was not sent to CSR1 and is more detailed than the CSR debug. 819 © 2016 Nicholas J. Russo

!XRv11 bgp[1046]: [default-upd] (vpn4u): Deny UPDATE to filter-group 0.2 (Regular, pelem Regular) for 2ASN:1:1:100.11.11.11/32 (changedfl=0x1000000/0x0), path reason Dropped by RT Filter bgp[1046]: [default-upd] (vpn4u): No unreachable (Dropped by RT Filter) sent to sub-group 0.1 (Regular) with 2ASN:1:1:100.11.11.11/32 - already withdrawn

Next, we will add the baseline RT import policy to CSR1 but not to XRv11 yet. The dynamic RT filter should be updated so that the RTs can be advertised between the neighbors. We will activate debug for both VPNv4 and rt-filter AFIs. BGP(11) in this case is rt-filter and BGP(4) is VPNv4. CSR1#debug bgp vpnv4 unicast updates in CSR1#debug bgp rt-filter unicast updates in BGP: 111:2:1:111 bestpath, is sourced, is not multipath BGP(11): 11.0.0.11 NEXT_HOP self is set for sourced RT Filter for net 111:2:1:111 BGP(11): (base) 11.0.0.11 send UPDATE (format) 111:2:1:111, next 11.0.0.1, metric 0, path Local BGP(4): 11.0.0.11 rcvd UPDATE w/ attr: nexthop 11.0.0.11, origin ?, localpref 100, metric 0, extended community RT:1:111 BGP(4): 11.0.0.11 rcvd 1:1:100.11.11.11/32, label 91000 BGP(4): Revise route installing 1 of 1 routes for 100.11.11.11/32 -> 11.0.0.11(A) to A IP table

First, CSR1 creates an rt-filter RTC prefix that contains A:B:C:D ... A = BGP AS number B = Usually 2 for the extended-community type code C:D = The route target we are importing CSR1#show bgp rtfilter unicast all detail BGP routing table entry for 111:2:1:111, version 2 Paths: (1 available, best #1) Advertised to update-groups: 3 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (1.1.1.1) Origin IGP, localpref 100, weight 32768, valid, sourced, local, best RT generation: import rx pathid: 0, tx pathid: 0x0

We can also see this on XRv12 as being received from CSR1, both in debug and show commands. XR will consult this table before advertising VPN routes to a peer to ensure those RT's are requested. This is a similar operation to the outbound route filtering (ORF) capability used to communicate an inbound prefix filter to the remote peer. 820 © 2016 Nicholas J. Russo

! XRv11 bgp[1046]: [default-upd] (vpn4u): Permit UPDATE to filter-group 0.2 (Regular, pelem Regular) for 2ASN:1:1:100.11.11.11/32 (changedfl=0x1000000/0x0), path RP/0/0/CPU0:XRv11#show bgp ipv4 rt-filter 111:2:1:111/96 BGP routing table entry for 111:2:1:111/96 Versions: Process bRIB/RIB SendTblVer Speaker 2 2 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 11.0.0.1 from 11.0.0.1 (1.1.1.1) Origin IGP, metric 0, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 2

Next, let's import some more bogus RTs on CSR1. The router creates three new RTC prefixes in the rtfilter unicast AF and advertises them to XRv11. CSR1(config-vrf)#route-target import 1:901 CSR1(config-vrf)#route-target import 1:902 CSR1(config-vrf)#route-target import 1:903 ! CSR1 BGP: 111:2:1:901 bestpath, is sourced, is not multipath BGP(11): 11.0.0.11 NEXT_HOP self is set for sourced RT Filter for net 111:2:1:901 (base) 11.0.0.11 send UPDATE (format) 111:2:1:901, next 11.0.0.1, metric 0, path Local BGP(11): 111:2:1:902 bestpath, is sourced, is not multipath BGP(11): 11.0.0.11 NEXT_HOP self is set for sourced RT Filter for net 111:2:1:902 (base) 11.0.0.11 send UPDATE (format) 111:2:1:902, next 11.0.0.1, metric 0, path Local BGP(11): 111:2:1:903 bestpath, is sourced, is not multipath BGP(11): 11.0.0.11 NEXT_HOP self is set for sourced RT Filter for net 111:2:1:903 BGP(11): (base) 11.0.0.11 send UPDATE (format) 111:2:1:903, next 11.0.0.1, metric 0, path Local

XRv11 isn't going to do anything with those as no one is exporting VPN routes with those RTs, but we can see a 1:1 correlation between imported RT's and what is advertised between the neighbors.

821 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv11#show bgp ipv4 rt-filter [snip] Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path *>i111:2:1:111/96 11.0.0.1 0 100 0 i *>i111:2:1:901/96 11.0.0.1 0 100 0 i *>i111:2:1:902/96 11.0.0.1 0 100 0 i *>i111:2:1:903/96 11.0.0.1 0 100 0 i

On CSR1, we can look at the details of one of these RTs to see an extended community "RT generation" set to value "import", which commands the VRF to import routes with the matching RT. CSR1#show bgp rtfilter unicast rt 1:901 BGP routing table entry for 111:2:1:901, version 3 Paths: (1 available, best #1) Advertised to update-groups: 4 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (1.1.1.1) Origin IGP, localpref 100, weight 32768, valid, sourced, local, best RT generation: import rx pathid: 0, tx pathid: 0x0

For good measure, we will create bogus RT's on XRv11 to observe any relevant debug messages. The XR debugs are very detailed so irrelevant parts have been omitted. RP/0/0/CPU0:XRv11(config-vrf-import-rt)#1:801 RP/0/0/CPU0:XRv11(config-vrf-import-rt)#1:802 RP/0/0/CPU0:XRv11(config-vrf-import-rt)#1:803 ! XRv11 bgp[1046]: [default-upd] (ipv4rtf): ===UPDATE===: tbl=TBL:default (1/132), afi=11: [snip]: net=111:2:1:803/96 [snip] bgp[1046]: [default-upd] (ipv4rtf): Permit UPDATE to filter-group 0.2 (Regular, pelem Regular) for 111:2:1:803/96 (changedfl=0x0/0x1000), path bgp[1046]: [default-upd] (ipv4rtf): ===UPDATE===: tbl=TBL:default (1/132), afi=11: [snip]: net=111:2:1:802/96 [snip] bgp[1046]: [default-upd] (ipv4rtf): Permit UPDATE to filter-group 0.2 (Regular, pelem Regular) for 111:2:1:802/96 (changedfl=0x0/0x1000), path bgp[1046]: [default-upd] (ipv4rtf): ===UPDATE===: tbl=TBL:default (1/132), afi=11: [snip]: net=111:2:1:801/96 [snip] bgp[1046]: [default-upd] (ipv4rtf): Permit UPDATE to filter-group 0.2 (Regular, pelem Regular) for 111:2:1:801/96 (changedfl=0x0/0x1000), path

822 © 2016 Nicholas J. Russo

Verify that CSR1 has received them. We will look at one of them as an example. No surprises. This effectively says "11.0.0.11 wants RT:1:801". CSR1#show bgp rtfilter unicast rt 1:801 BGP routing table entry for 111:2:1:801, version 8 Paths: (1 available, best #1) Not advertised to any peer Refresh Epoch 1 Local 11.0.0.11 from 11.0.0.11 (11.11.11.11) Origin IGP, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0

As a final test, let's pretend XRv11 is a route-reflector for VPNv4 and the RT-filter AFIs. Ideally, CSR1 should be sending all RTs it exports to XRv11 without XRv11 having to explicitly request it via individual RTC prefixes. The "default-originate" command violates many of the formatting rules defined earlier, such as encoding the BGP AS number, community code, and RT. The zeroes in the RTC instruct the RR client to send all VPN routes with any RTs to the BGP peer, which is a RR. Notwithstanding the default route, aggregation is not supported with the RT filter feature. ! XRv11 router bgp 111 neighbor 11.0.0.1 address-family vpnv4 unicast route-reflector-client address-family ipv4 rt-filter route-reflector-client default-originate

CSR1 shows this "default" RTC and uses this to send all RT's to the peer 11.0.0.11, which is the RR. CSR1#show bgp rtfilter unicast all [snip] Network Next Hop *>i 0:0:0:0 11.0.0.11 *> 111:2:1:111 0.0.0.0 *> 111:2:1:901 0.0.0.0 *> 111:2:1:902 0.0.0.0 *> 111:2:1:903 0.0.0.0

Metric LocPrf Weight Path 100 0 i 32768 i 32768 i 32768 i 32768 i

Additional Reading – Reference configurations “rtfilter" and "rtfilter-rr" 27.3 BGP RR-group and Selective RT Retention This lab investigates how to enable efficient route retention on BGP RRs when using VPNv4 and/or VPNv6. Traditionally, BGP RRs for VPN routes must always maintain all routes from RR-clients, regardless 823 © 2016 Nicholas J. Russo

of their RT values. This is because the RR expects the PE to filter RTs that are unwanted. The feature in the previous chapter is one solution to the problem which actively enables a communication method between devices. This lab uses a passive feature that is local only, requires no additional AFI negotiation, and can save memory on RRs/PEs in a highly distributed RR architecture. The diagram is shown below. Several unrelated advanced technologies have been enabled for further testing in other sections, such as TE auto-primary tunnels and MVPN, but those are not the focus of this lab. CSR1, CSR2, XRv1, and XRv2 are BGP RRs for VPNv4 and VPNv6. CSRs 5 through 10 are LSRs within BGP AS 56. The IS-IS areas are built in such a way to shield the RRs from the “large” IS-IS level-2 domain. This design attempts to move BGP RRs entirely out of the control plane, both from a forwarding and feature perspective (not running MPLS, PIM, etc). All PEs peer with all RRs for VPNv4 and VPNv6. There are 2 VPNs, RED and BLUE. Each CE runs VRF-lite to support both VPNs and each PE provides service for both VPNs. VPN RED uses BGP for IPv4/v6 routing while VPN BLUE uses RIPv2 and OSPFv3 for IPv4 and IPv6 routing, respectively.

The basic configurations, such as interfaces, IS-IS routing, and BGP peers are not discussed in detail. Each router runs IS-IS on all interfaces as shown in the diagram with the loopback interfaces advertised. Instead, we quickly verify the network using cursory show commands. On CSR6, we verify the IS-IS level2 database. In a single show command, I reveal all of the IS-IS level-2 routers and their links while hiding the IP prefixes. Quickly skimming this output, we confirm that it corresponds with the diagram above. R6#show isis database level-2 detail | include ^[RX]|Extended

824 © 2016 Nicholas J. Russo

R5.00-00 Metric: Metric: Metric: R6.00-00 Metric: Metric: R7.00-00 Metric: Metric: Metric: Metric: R8.00-00 Metric: Metric: Metric: Metric: R9.00-00 Metric: Metric: R10.00-00 Metric: Metric: Metric:

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

0x0000001A IS-Extended IS-Extended IS-Extended * 0x00000018 IS-Extended IS-Extended 0x0000001A IS-Extended IS-Extended IS-Extended IS-Extended 0x0000001B IS-Extended IS-Extended IS-Extended IS-Extended 0x00000019 IS-Extended IS-Extended 0x0000001C IS-Extended IS-Extended IS-Extended

0xFCCA R7.00 R6.00 R8.00 0xFDEF R8.00 R5.00 0x2682 R10.00 R9.00 R5.00 R8.00 0x0B9D R7.00 R6.00 R10.00 R5.00 0x7AE0 R10.00 R7.00 0x8E89 R7.00 R8.00 R9.00

653

0/0/0

416

0/0/0

1122

0/0/0

975

0/0/0

562

0/0/0

996

0/0/0

Since CSR6 is an L1L2 router, it also has a level-1 LSPDB. This is much smaller, but we expect to see CSR1 along with a transit link. CSR6 sets the attached-bit so that CSR1 installs a default route via CSR6. The other three L1L2 routers do this as well (not shown below). R6#show isis database level-1 detail | include ^[RX]|Extended R1.00-00 0x0000000B 0xCDE7 603 Metric: 10 IS-Extended R6.00 R6.00-00 * 0x0000000B 0x8844 1161 Metric: 10 IS-Extended R1.00

0/0/0 1/0/0

MPLS TE is enabled for IS-IS L2 only. This output should mirror the IS-IS level-2 LSPDB exactly as TE is enabled on all nodes and all links. This is a critical part of the MPLS transport infrastructure as LDP is not enabled on IS-IS interfaces. R6#show mpls traffic-eng topology brief | include IGP_Id IGP Id: 0000.0000.0005.00, MPLS TE Id:56.0.0.5 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0007.00, nbr_node_id:14, link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:11, link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0008.00, nbr_node_id:15, IGP Id: 0000.0000.0006.00, MPLS TE Id:56.0.0.6 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0008.00, nbr_node_id:15, link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0005.00, nbr_node_id:10, IGP Id: 0000.0000.0007.00, MPLS TE Id:56.0.0.7 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0010.00, nbr_node_id:16,

gen:126 gen:126 gen:126 gen:123 gen:123 gen:125

825 © 2016 Nicholas J. Russo

link[1]: Point-to-Point, Nbr link[2]: Point-to-Point, Nbr link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0008.00, MPLS TE link[0]: Point-to-Point, Nbr link[1]: Point-to-Point, Nbr link[2]: Point-to-Point, Nbr link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, MPLS TE link[0]: Point-to-Point, Nbr link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0010.00, MPLS TE link[0]: Point-to-Point, Nbr link[1]: Point-to-Point, Nbr link[2]: Point-to-Point, Nbr

IGP Id: 0000.0000.0009.00, nbr_node_id:12, IGP Id: 0000.0000.0005.00, nbr_node_id:10, IGP Id: 0000.0000.0008.00, nbr_node_id:15, Id:56.0.0.8 Router Node (isis level-2) IGP Id: 0000.0000.0007.00, nbr_node_id:14, IGP Id: 0000.0000.0006.00, nbr_node_id:11, IGP Id: 0000.0000.0010.00, nbr_node_id:16, IGP Id: 0000.0000.0005.00, nbr_node_id:10, Id:56.0.0.9 Router Node (isis level-2) IGP Id: 0000.0000.0010.00, nbr_node_id:16, IGP Id: 0000.0000.0007.00, nbr_node_id:14, Id:56.0.0.10 Router Node (isis level-2) IGP Id: 0000.0000.0007.00, nbr_node_id:14, IGP Id: 0000.0000.0008.00, nbr_node_id:15, IGP Id: 0000.0000.0009.00, nbr_node_id:12,

gen:125 gen:125 gen:125 gen:127 gen:127 gen:127 gen:127 gen:116 gen:116 gen:117 gen:117 gen:117

MPLS LDP is not enabled directly on the transit links. Instead, TE auto-primary tunnels are configured. These tunnels are automatically created to every one-hop peer as determined by the TED. The tunnels have autoroute announce and FRR automatically configured as well. I enable tLDP on them so that routers more than one hop away (CSR6 to CSR9, for example) can still communicate. This feature is discussed in detail in the TE section, but in summary, the idea is to “protect every link” in the topology. This offers protection to LDP and mLDP LSPs, not just TE LSPs, and it scales quite well. CSR6, for example, is the headend for tunnels to CSR5 and CSR8. It is also the tail end for the complementary tunnels from those routers along the reverse path. I do not use this technique often which is why I configure it for variety here. Normally this feature is paired with auto-backup tunnels for TE-FRR, but I omit that configuration for clarity. ! All LSRs mpls traffic-eng mpls traffic-eng mpls traffic-eng mpls traffic-eng

auto-tunnel auto-tunnel auto-tunnel auto-tunnel

primary primary primary primary

onehop config unnumbered-interface Loopback0 config mpls ip tunnel-num min 500 max 599

R6#show mpls traffic-eng auto-tunnel primary State: Enabled Auto primary tunnels: 2 (up: 2, down: 0) Tunnel ID Range: 500 - 599 Check for deletion of FRR Active onehop tunnels every:0 Sec Config: unnumbered-interface: Loopback0 mpls ip: TRUE R6#show mpls traffic-eng tunnels brief | begin TUNNEL P2P TUNNELS/LSPs: TUNNEL NAME DESTINATION UP IF STATE/PROT R6_t500 56.0.0.5 R6_t501 56.0.0.8 R5_t500 56.0.0.6 Gi2.556

DOWN IF Gi2.556 Gi2.568 -

up/up up/up up/up

826 © 2016 Nicholas J. Russo

R8_t500 56.0.0.6 Gi2.568 Displayed 2 (of 2) heads, 0 (of 0) midpoints, 2 (of 2) tails

up/up

The one-hop tunnels always bind implicit-null as the headend is always the penultimate hop. For this reason, many tunnels are PE-P or P-P, requiring an additional label in the stack. Thanks to tLDP, CSR6 is able to send MPLS-encapsulated traffic to CSR10, which is multiple hops away. This is not specific to the features addressed in this lab, but is the basis for all MPLS forwarding and must be verified. R6#show ip route 56.0.0.10 Routing entry for 56.0.0.10/32 Known via "isis", distance 115, metric 20 Tag 56, type level-2 Redistributing via isis 56 Last update from 56.0.0.8 on Tunnel501, 01:12:19 ago Routing Descriptor Blocks: * 56.0.0.8, from 56.0.0.10, 01:12:19 ago, via Tunnel501 Route metric is 20, traffic share count is 1 Route tag 56 R6#show mpls ldp bindings 56.0.0.10 32 neighbor 56.0.0.8 lib entry: 56.0.0.10/32, rev 8 remote binding: lsr: 56.0.0.8:0, label: 8010 R6#show mpls traffic-eng tunnels tunnel 501 | include Label InLabel : OutLabel : GigabitEthernet2.568, implicit-null R6#traceroute 56.0.0.10 source Type escape sequence to abort. Tracing the route to 56.0.0.10 VRF info: (vrf in name/id, vrf 1 56.6.8.8 [MPLS: Label 8010 2 56.8.10.10 3 msec 4 msec 3

56.0.0.6

out name/id) Exp 0] 5 msec 3 msec 3 msec msec

Next, we configure BGP towards the RRs and the configuration on all four PEs is identical. The configuration is simple as all PEs peer to all four RRs using all supported AFIs. The VRFs will be configured after the core infrastructure is built. IPv4 MDT is also negotiated to support PIM/GRE MVPN, which is beyond the scope of this test as well. Since that AFI has no concept of RT, there is no ability to “filter” prefixes from IPv4 MDT based on RT extended communities. ! CSR5, CSR6, CSR9, CSR10 router bgp 56 template peer-session IBGP remote-as 56 update-source Loopback0 timers 10 40

827 © 2016 Nicholas J. Russo

no bgp default ipv4-unicast neighbor 56.0.0.1 inherit peer-session IBGP neighbor 56.0.0.2 inherit peer-session IBGP neighbor 56.0.0.11 inherit peer-session IBGP neighbor 56.0.0.12 inherit peer-session IBGP address-family vpnv4 neighbor 56.0.0.1 activate neighbor 56.0.0.2 activate neighbor 56.0.0.11 activate neighbor 56.0.0.12 activate address-family ipv4 mdt neighbor 56.0.0.1 activate neighbor 56.0.0.2 activate neighbor 56.0.0.11 activate neighbor 56.0.0.12 activate address-family vpnv6 neighbor 56.0.0.1 activate neighbor 56.0.0.2 activate neighbor 56.0.0.11 activate neighbor 56.0.0.12 activate

The configuration on CSR1 and CSR2 is identical as well. The RRs peer with all PEs and negotiate VPNv4, VPNv6, and IPv4 MDT. This is almost identical to the PE configuration except all peers are RR clients. ! CSR1 and CSR2 router bgp 56 template peer-policy RR_CLIENT route-reflector-client send-community both template peer-session IBGP remote-as 56 update-source Loopback0 timers 10 40 no bgp default ipv4-unicast neighbor 56.0.0.5 inherit peer-session IBGP neighbor 56.0.0.6 inherit peer-session IBGP neighbor 56.0.0.9 inherit peer-session IBGP neighbor 56.0.0.10 inherit peer-session IBGP address-family vpnv4 neighbor 56.0.0.5 activate neighbor 56.0.0.5 inherit peer-policy RR_CLIENT neighbor 56.0.0.6 activate neighbor 56.0.0.6 inherit peer-policy RR_CLIENT

828 © 2016 Nicholas J. Russo

neighbor neighbor neighbor neighbor

56.0.0.9 activate 56.0.0.9 inherit peer-policy RR_CLIENT 56.0.0.10 activate 56.0.0.10 inherit peer-policy RR_CLIENT

address-family ipv4 mdt neighbor 56.0.0.5 activate neighbor 56.0.0.5 inherit peer-policy RR_CLIENT neighbor 56.0.0.6 activate neighbor 56.0.0.6 inherit peer-policy RR_CLIENT neighbor 56.0.0.9 activate neighbor 56.0.0.9 inherit peer-policy RR_CLIENT neighbor 56.0.0.10 activate neighbor 56.0.0.10 inherit peer-policy RR_CLIENT address-family vpnv6 bgp rr-group EXTCOML_RR_RED neighbor 56.0.0.5 activate neighbor 56.0.0.5 inherit peer-policy RR_CLIENT neighbor 56.0.0.6 activate neighbor 56.0.0.6 inherit peer-policy RR_CLIENT neighbor 56.0.0.9 activate neighbor 56.0.0.9 inherit peer-policy RR_CLIENT neighbor 56.0.0.10 activate neighbor 56.0.0.10 inherit peer-policy RR_CLIENT

From a logical standpoint, XRv1 and XRv2 perform the same function. Their configuration is identical as well and effectively mirrors the XE configuration above. The configuration is long only due to the number of clients and AFIs, not due to complexity. ! XRv1 and XRv2 router bgp 56 bgp cluster-id address-family address-family address-family

56.0.0.11 vpnv4 unicast vpnv6 unicast ipv4 mdt

af-group AF_VPNV4 address-family vpnv4 unicast route-reflector-client af-group AF_VPNV6 address-family vpnv6 unicast route-reflector-client af-group AF_IPV4_MDT address-family ipv4 mdt route-reflector-client session-group IBGP remote-as 56 timers 10 40 update-source Loopback0

829 © 2016 Nicholas J. Russo

neighbor 56.0.0.5 use session-group IBGP address-family vpnv4 unicast use af-group AF_VPNV4 address-family vpnv6 unicast use af-group AF_VPNV6 address-family ipv4 mdt use af-group AF_IPV4_MDT neighbor 56.0.0.6 use session-group IBGP address-family vpnv4 unicast use af-group AF_VPNV4 address-family vpnv6 unicast use af-group AF_VPNV6 address-family ipv4 mdt use af-group AF_IPV4_MDT neighbor 56.0.0.9 use session-group IBGP address-family vpnv4 unicast use af-group AF_VPNV4 address-family vpnv6 unicast use af-group AF_VPNV6 address-family ipv4 mdt use af-group AF_IPV4_MDT neighbor 56.0.0.10 use session-group IBGP address-family vpnv4 unicast use af-group AF_VPNV4 address-family vpnv6 unicast use af-group AF_VPNV6 address-family ipv4 mdt use af-group AF_IPV4_MDT

At this point, all BGP sessions should form. Note that BGP (XE and XR) will never initiate a session when the longest-match route to a peer is the default route. This means all of the RRs, at this point, will be passively listening for BGP OPEN messages from their peers. They can reply along the default route, but not initiate. Once the sessions form, we quickly verify them. Rather than document all sessions on all RRs for all AFIs, I show CSR1 and XRv1 only. All sessions are verified as up but are not documented. No routes are currently received by the RRs as the VRFs have not yet been configured. R1#show bgp vpnv4 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 56.0.0.5 4 56 825 895 77 56.0.0.6 4 56 816 834 77

InQ OutQ Up/Down State/PfxRcd 0 0 02:02:09 0 0 0 02:02:24 0

830 © 2016 Nicholas J. Russo

56.0.0.9 56.0.0.10

4 4

56 56

834 821

836 832

77 77

R1#show bgp vpnv6 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 56.0.0.5 4 56 826 895 101 56.0.0.6 4 56 816 834 101 56.0.0.9 4 56 834 836 101 56.0.0.10 4 56 822 832 101 R1#show bgp ipv4 mdt all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 56.0.0.5 4 56 826 896 5 56.0.0.6 4 56 817 835 5 56.0.0.9 4 56 835 837 5 56.0.0.10 4 56 822 833 5

0 0

0 02:02:16 0 02:02:13

0 0

InQ OutQ Up/Down State/PfxRcd 0 0 02:02:12 0 0 0 02:02:27 0 0 0 02:02:19 0 0 0 02:02:16 0

InQ OutQ Up/Down State/PfxRcd 0 0 02:02:18 0 0 0 02:02:33 0 0 0 02:02:26 0 0 0 02:02:22 0

RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 56.0.0.5 0 56 730 724 59 0 0 01:34:55 56.0.0.6 0 56 712 690 59 0 0 01:34:42 56.0.0.9 0 56 736 690 59 0 0 01:34:46 56.0.0.10 0 56 728 690 59 0 0 01:34:45

St/PfxRcd 0 0 0 0

RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 56.0.0.5 0 56 731 725 49 0 0 01:34:58 56.0.0.6 0 56 713 691 49 0 0 01:34:44 56.0.0.9 0 56 737 691 49 0 0 01:34:49 56.0.0.10 0 56 729 691 49 0 0 01:34:48

St/PfxRcd 0 0 0 0

RP/0/0/CPU0:XRv1#show bgp ipv4 mdt summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ 56.0.0.5 0 56 731 725 7 0 0 56.0.0.6 0 56 713 691 7 0 0 56.0.0.9 0 56 737 691 7 0 0 56.0.0.10 0 56 729 691 7 0 0

St/PfxRcd 0 0 0 0

Up/Down 01:35:02 01:34:49 01:34:53 01:34:52

To prove that BGP cannot initiate sessions via a default route, we can check the TCP brief on CSR1 and XRv1. Note that each RR will have one /32 route to the connected L1L2 router, so it is possible that one session may be initiated by the RR. For example, CSR1 has a route to 56.0.0.5/32 and XRv1 has a route to 56.0.0.6/32 as a result of being connected to CSR5 and CSR6, respectively. XRv1 initiated the connection to CSR6 coincidentally, while CSR1 did not. The other 3 sessions that are reachable via the default route are highlighted; the RRs are servers (not initiators) for all of them. R1#show tcp brief TCB Local Address 7FCBA18381D0 56.0.0.1.179 7FCBAC1EB498 56.0.0.1.179 7FCBA1837850 56.0.0.1.179 7FCBA1889000 56.0.0.1.179

Foreign Address 56.0.0.10.40173 56.0.0.6.49282 56.0.0.9.20797 56.0.0.5.44310

(state) ESTAB ESTAB ESTAB ESTAB

831 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show tcp brief | include ESTAB 0x10180278 0x60000000 0 0 56.0.0.11:179 0x1017c368 0x60000000 0 0 56.0.0.11:30859 0x10145f9c 0x60000000 0 0 56.0.0.11:179 0x1016b444 0x60000000 0 0 56.0.0.11:179

56.0.0.9:42848 56.0.0.5:179 56.0.0.10:25813 56.0.0.6:32154

ESTAB ESTAB ESTAB ESTAB

The PE-CE setup is configuration intensive but very simple. As described earlier, each CE (CSR3, CSR4, XRv3, and XRv4) runs VRF-lite so it can represent multiple CE routers. VPN RED uses BGP while VPN BLUE uses RIPv2/OSPFv3. The CE configurations are not shown here, and only snippets from the PEs are shown. There are 2 RDs and 8 RTs in use. VPN RED uses RD 56:100 and VPN BLUE uses RD 56:200 on all PEs. Individual PEs use different import/export RTs shown below in the configuration. CSR5, for example, exports RT:56:213 and imports 3 others. The other 3 represent the remote PEs. The same is true inside both VPNs and both AFIs, totaling 8 RTs. Having RT diversity is important to test the RR-group and RT retention features. The MVPN configuration is shown for reference only. It is important to note that RED RTs contain the BGP ASN and a number within the 100-199 range. VPN BLUE is similar except the number is in the 200-299 range, which is important for matching. ! CSR5 vrf definition BLUE rd 56:200 vpn id 56:200 route-target export route-target import route-target import route-target import

56:213 56:214 56:203 56:204

address-family ipv4 mdt preference mldp mdt default mpls mldp 56.0.0.7 address-family ipv6 mdt preference mldp mdt default mpls mldp 56.0.0.7 vrf definition RED rd 56:100 route-target export route-target import route-target import route-target import

56:113 56:114 56:103 56:104

address-family ipv4 mdt default 232.0.0.100 address-family ipv6 mdt default 232.0.0.100

All of the PEs use a very similar configuration except the RT import/export polices are adjusted slightly. The full-mesh L3VPN is the end-state. The basic PE-CE routing is shown below, along with the VRF-aware 832 © 2016 Nicholas J. Russo

RIPv2 and OSPFv3 process inside VPN BLUE. This is basic L3VPN and is not discussed in detail. The CE configurations are not shown, but they are configured similarly to exchange routes with the PE. ! CSR5 route-map RM_CONN_BGP deny 10 router bgp 56 address-family ipv4 vrf BLUE redistribute connected route-map RM_CONN_BGP redistribute rip address-family ipv6 vrf BLUE redistribute ospf 56 address-family ipv4 vrf RED neighbor 10.5.13.13 remote-as 100 neighbor 10.5.13.13 activate neighbor 10.5.13.13 as-override address-family ipv6 vrf RED neighbor FD00:10:5:13::13 remote-as 100 neighbor FD00:10:5:13::13 activate neighbor FD00:10:5:13::13 as-override router ospfv3 56 address-family ipv6 unicast vrf BLUE redistribute bgp 56 router-id 56.0.0.5 prefix-suppression router rip address-family ipv4 vrf BLUE redistribute bgp 56 metric transparent network 10.0.0.0 no auto-summary version 2

Assuming the CEs are configured, the PE should be learning IPv4 and IPv6 routes inside of both RED and BLUE VRFs. All of these routes would be advertised to all four RRs. Checking CSR1 and XRv1, we can see all of the routes are present from all PEs (each PE injects 4 routes per VPN, per AFI). For completeness, I show the entire VPNv4 routing table on CSR1, where it clearly depicts 4 routes from each PE within each RD. These are the routes within the two VPNs (recall that RED uses RD 56:100 and BLUE uses RD 56:200). This makes for 8 total VPN routes from each PE. Using the VPNV6 summary command, we can see this as well, which serves as a cursory check. R1#show bgp vpnv4 unicast all | begin Network Network Next Hop Metric LocPrf Weight Path

833 © 2016 Nicholas J. Russo

Route Distinguisher: 56:100 *>i 100.3.0.0/18 56.0.0.9 *>i 100.3.64.0/18 56.0.0.9 *>i 100.3.128.0/18 56.0.0.9 *>i 100.3.192.0/18 56.0.0.9 *>i 100.4.0.0/18 56.0.0.10 *>i 100.4.64.0/18 56.0.0.10 *>i 100.4.128.0/18 56.0.0.10 *>i 100.4.192.0/18 56.0.0.10 *>i 100.13.0.0/18 56.0.0.5 *>i 100.13.64.0/18 56.0.0.5 *>i 100.13.128.0/18 56.0.0.5 *>i 100.13.192.0/18 56.0.0.5 *>i 100.14.0.0/18 56.0.0.6 *>i 100.14.64.0/18 56.0.0.6 *>i 100.14.128.0/18 56.0.0.6 *>i 100.14.192.0/18 56.0.0.6 Route Distinguisher: 56:200 *>i 200.3.0.0/18 56.0.0.9 *>i 200.3.64.0/18 56.0.0.9 *>i 200.3.128.0/18 56.0.0.9 *>i 200.3.192.0/18 56.0.0.9 *>i 200.4.0.0/18 56.0.0.10 *>i 200.4.64.0/18 56.0.0.10 *>i 200.4.128.0/18 56.0.0.10 *>i 200.4.192.0/18 56.0.0.10 *>i 200.13.0.0/18 56.0.0.5 *>i 200.13.64.0/18 56.0.0.5 *>i 200.13.128.0/18 56.0.0.5 *>i 200.13.192.0/18 56.0.0.5 *>i 200.14.0.0/18 56.0.0.6 *>i 200.14.64.0/18 56.0.0.6 *>i 200.14.128.0/18 56.0.0.6 *>i 200.14.192.0/18 56.0.0.6 R1#show bgp vpnv6 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 56.0.0.5 4 56 1038 1128 117 56.0.0.6 4 56 1025 1057 117 56.0.0.9 4 56 1042 1057 117 56.0.0.10 4 56 1030 1052 117

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

InQ OutQ Up/Down State/PfxRcd 0 0 02:30:49 8 0 0 02:31:05 8 0 0 02:30:57 8 0 0 02:30:54 8

Checking XRv1, we see an interesting problem. None of the VPN routes are marked as best path. Upon closer inspection, XRv1 claims the VPN next-hop is unreachable. This is because XR expects a nondefault route to the BGP next-hop, in addition to the peering restriction discussed earlier. XE did not seem to care (CSR1 and CSR2) that the only route to the VPN next-hop was a default route. XRv1 and XRv2 are ineffective as VPN RRs in the design at present. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast | begin Network

834 © 2016 Nicholas J. Russo

Network Next Hop Route Distinguisher: 56:100 * i100.3.0.0/18 56.0.0.9 * i100.3.64.0/18 56.0.0.9 * i100.3.128.0/18 56.0.0.9 * i100.3.192.0/18 56.0.0.9 [snip]

Metric LocPrf Weight Path 0 0 0 0

100 100 100 100

0 0 0 0

100 100 100 100

? ? ? ?

RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast rd 56:100 100.3.0.0/18 | begin 100, 100, (Received from a RR-client) 56.0.0.9 (inaccessible) from 56.0.0.9 (56.0.0.9) Received Label 9026 Origin incomplete, metric 0, localpref 100, valid, internal, not-in-vrf Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:56:103 Connector: type: 1, Value:56:100:56.0.0.9

Since this issue is not the focus of this lab, we can quickly resolve it by leaking PE loopbacks (not core loopbacks) into IS-IS level-1 areas on CSR5 and CSR9. To do this semi-dynamically, I apply a route tag of 56 to the PE loopbacks. The route-map on CSR5 and CSR9 matches this tag and allows those loopbacks to leak into level-1. This has the added benefit of allowing XRv1 and XRv2 initiate BGP sessions to the PEs as there is a non-default route to reach them. ! All PEs interface Loopback0 isis tag 56 ! CSR5 and CSR9 route-map RM_LEAK_LOOP_L1 permit 10 match tag 56 router isis 56 redistribute isis ip level-2 into level-1 route-map RM_LEAK_LOOP_L1

This immediately fixes the problem on XRv1 as shown below. We also check to ensure that XRv1 is receiving VPNv6 routes. Like CSR1, it receives exactly 8, as expected. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 56:100 *>i100.3.0.0/18 56.0.0.9 0 100 0 100 ? *>i100.3.64.0/18 56.0.0.9 0 100 0 100 ? *>i100.3.128.0/18 56.0.0.9 0 100 0 100 ? *>i100.3.192.0/18 56.0.0.9 0 100 0 100 ? RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 56.0.0.5 0 56 991 1006 113 0 0 02:15:01

St/PfxRcd 8

835 © 2016 Nicholas J. Russo

56.0.0.6 56.0.0.9 56.0.0.10

0 0 0

56 56 56

971 996 988

962 962 962

113 113 113

0 0 0

0 02:14:47 0 02:14:51 0 02:14:50

8 8 8

With the BGP configuration complete, I quickly test VPN connectivity on a few routers. Although MPLS forwarding is not the focus of this lab, this will ensure BGP was configured properly. We will perform more detailed BGP verification later. Using nested TCL, I can execute 16 pings between each pair of addresses within a given VPN. One example is shown below, and it succeeds. ! CSR3 tclsh foreach x { 100.13.0.1 100.13.64.1 100.13.128.1 100.13.224.1 } { foreach y { 100.3.0.1 100.3.64.1 100.3.128.1 100.3.224.1 } { ping vrf RED $x source $y repeat 3 timeout 1 } }

The issue with this design is that the PEs have many copies of the exact same route. CSR10, for example, receives a copy from all four RRs. Upon deeper inspection, we see that the route from CSR1 is always preferred as it has the lowest BGP peer ID (the final tie-breaker). This is irrelevant since all routes are identical, but it is a waste of memory on both the RRs and PEs. R10#show bgp vpnv4 unicast vrf RED | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 56:100 (default for vrf RED) * i 100.3.0.0/18 56.0.0.9 0 100 0 100 ? * i 56.0.0.9 0 100 0 100 ? * i 56.0.0.9 0 100 0 100 ? *>i 56.0.0.9 0 100 0 100 ? * i 100.3.64.0/18 56.0.0.9 0 100 0 100 ? * i 56.0.0.9 0 100 0 100 ? * i 56.0.0.9 0 100 0 100 ? *>i 56.0.0.9 0 100 0 100 ? [snip] R10#show bgp 56.0.0.9 Origin 56.0.0.9 Origin 56.0.0.9

vpnv4 unicast vrf RED 100.3.0.0/18 | include from|valid (metric 10) (via default) from 56.0.0.11 (56.0.0.11) incomplete, metric 0, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.12 (56.0.0.12) incomplete, metric 0, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.2 (56.0.0.2)

836 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal 56.0.0.9 (metric 10) (via default) from 56.0.0.1 (56.0.0.1) Origin incomplete, metric 0, localpref 100, valid, internal, best

With such a distributed RR infrastructure, it would make sense if different RRs could service different VPNs. This is accomplished by selecting which RTs an RR can retain as discussed earlier. On XE, the feature is known as BGP “RR-groups” which are not to be confused with RR clusters. This is a simple configuration which uses an extended-community list of RTs as a pass/fail filter. A certain RR can be configured to retain only specific RTs that match the filter and reject all others. RRs assume that, by default, all RTs must be retained since any PE may need them. Using this method when there are multiple/redundant RRs in the network is a good way to increase scale and conserve memory on RRs and PEs. On CSR1 and CSR2, I configure opposing policies. CSR1 is allowed to retain only RED RTs while CSR2 is allowed to retain anything except RED RTs. In this design, there is only one non-RED VPN, but if there were more, CSR2 would retain those other RTs as well. This illustrates that both positive and negative logic are supported for the RR-group filter. On XE, the feature is supported for VPNv4, VPNv6, MVPNV4, and MVPNv6. It is not supported for L2VPN VPLS. CSR1 is examined first. ! CSR1 ip extcommunity-list expanded EXTCOML_RR_RED permit RT:56:1..$ router bgp 56 address-family vpnv4 bgp rr-group EXTCOML_RR_RED address-family vpnv6 bgp rr-group EXTCOML_RR_RED

Applying this configuration immediately purges any VPN routes with rejected RTs. CSR1 now has no routes for RD 56:200. R1#show bgp vpnv4 unicast rd 56:200 [no output]

If more routes are advertised to the RRs later containing rejected RTs, they are never installed. The debug output shown below looks familiar as the rejection looks identical to any RT rejection. The same output would be present if, for example, a PE failed to explicitly import a RT. I highly the RD 56:200 as well as this identifies the routes as BLUE. Make no mistake; the filter is not matching the RD, but the RTs carried by the routes. These are highlighted in green. R1#debug bgp vpnv4 unicast updates in BGP updates debugging is on (inbound) for address family: VPNv4 Unicast BGP(4): 56.0.0.9 rcvd UPDATE w/ attr: nexthop 56.0.0.9, origin ?, localpref 100, metric 1, extended community RT:56:203 BGP(4): 56.0.0.9 rcvd 56:200:200.3.0.0/18, label 9034 -- DENIED due to: extended community not supported;

837 © 2016 Nicholas J. Russo

BGP(4): 56.0.0.9 rcvd 56:200:200.3.64.0/18, label 9035 -- DENIED due to: extended community not supported; BGP(4): 56.0.0.9 rcvd 56:200:200.3.128.0/18, label 9036 -- DENIED due to: extended community not supported; BGP(4): 56.0.0.9 rcvd 56:200:200.3.192.0/18, label 9037 -- DENIED due to: extended community not supported;

Examining VPN BLUE on CSR10, there are only 3 copies learned. CSR1 no longer advertises one as it cannot even retain the routes in the first place. This indicates that the feature is working correctly. CSR2 is the new best-path given the next lower BGP peer ID. R10#show bgp 56.0.0.9 Origin 56.0.0.9 Origin 56.0.0.9 Origin

vpnv4 unicast vrf BLUE 200.3.0.0/18 | include from|valid (metric 10) (via default) from 56.0.0.11 (56.0.0.11) incomplete, metric 1, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.12 (56.0.0.12) incomplete, metric 1, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.2 (56.0.0.2) incomplete, metric 1, localpref 100, valid, internal, best

Next, we will configure CSR2 to reject RED RTs while permitting anything else. The IOS-regex logic is straightforward and applying the filter is identical as it was on CSR1. After applying the configuration, we quickly verify that CSR2 is not retaining any RED routes. ! CSR2 ip extcommunity-list expanded EXTCOML_RR_NOT_RED deny RT:56:1..$ ip extcommunity-list expanded EXTCOML_RR_NOT_RED permit RT:[0-9]+:[0-9]+ router bgp 56 address-family vpnv4 bgp rr-group EXTCOML_RR_NOT_RED address-family vpnv6 bgp rr-group EXTCOML_RR_NOT_RED R2#show bgp vpnv4 unicast rd 56:100 [no output]

On CSR10 within the RED VPN, CSR2 is no longer advertising reachability for RED prefixes. CSR1 remains the best path, but of greater significance is that CSR2 is not advertising any routes. R10#show bgp 56.0.0.9 Origin 56.0.0.9 Origin 56.0.0.9 Origin

vpnv4 unicast vrf RED 100.3.0.0/18 | include from|valid (metric 10) (via default) from 56.0.0.11 (56.0.0.11) incomplete, metric 0, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.12 (56.0.0.12) incomplete, metric 0, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.1 (56.0.0.1) incomplete, metric 0, localpref 100, valid, internal, best

838 © 2016 Nicholas J. Russo

To further clean up the network, we will configure similar policies on XRv1 and XRv2. XRv1 will team up with CSR1 to service the RED VPN, while XRv2 will team up with CSR2 to service anything not in the RED VPN. Rather than create a custom command for this, XR uses the “retain route-target” syntax. This is often used for inter-AS MPLS option B ASBRs so that they can retain all RTs without having to configure the VRF locally or identify iBGP peers as RR-clients. When a peer is identified as an RR-client, all RTs are retained from those peers by default. XR allows the user to restrict this using RPL, so the feature has value on RRs as well as option B ASBRs. Looking at XRv1 first, we create an RT-set using an IOS-regex to match RED RTs. This is passed into a parameterized RPL for flexibility. Last, the RPL is invoked by VPNv4 and VPNv6 AFIs under BGP. ! XRv1 extcommunity-set rt RT_RED ios-regex '56:1[01][34]$' end-set route-policy RPL_RETAIN_IF_MATCH($RT) if extcommunity rt matches-any $RT then pass endif end-policy router bgp 56 address-family vpnv4 retain route-target address-family vpnv6 retain route-target

unicast route-policy RPL_RETAIN_IF_MATCH(RT_RED) unicast route-policy RPL_RETAIN_IF_MATCH(RT_RED)

Like XE, any existing routes with unmatched RTs are immediately purged. XRv1 has no need to maintain a separate table for RD 56:200, which saves memory. Checking the BLUE VRF on CSR10 reveals only 2 routes as both CSR1 and XRv1 have stopped servicing the BLUE VPN. CSR10 still learns RED routes from XRv1, also shown below. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast rd 56:200 [no output] R10#show bgp 56.0.0.9 Origin 56.0.0.9 Origin

vpnv4 unicast vrf BLUE 200.3.0.0/18 | include from|valid (metric 10) (via default) from 56.0.0.12 (56.0.0.12) incomplete, metric 1, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.2 (56.0.0.2) incomplete, metric 1, localpref 100, valid, internal, best

R10#show bgp 56.0.0.9 Origin 56.0.0.9 Origin

vpnv4 unicast vrf RED 100.3.0.0/18 | include from|valid (metric 10) (via default) from 56.0.0.11 (56.0.0.11) incomplete, metric 0, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.12 (56.0.0.12) incomplete, metric 0, localpref 100, valid, internal

839 © 2016 Nicholas J. Russo

56.0.0.9 (metric 10) (via default) from 56.0.0.1 (56.0.0.1) Origin incomplete, metric 0, localpref 100, valid, internal, best

Last, we configure XRv2 to reject RED routes. RPL helps simplify positive/negative logic by embedding the logic in the conditional construct rather than the list construct. In XE, the permit/deny action was defined in the extended-community list. XRv2 uses the same RT-set as XRv1, but inverts the RPL logic. The RPL rejects the route if it contains any RT matched within the RED list. ! XRv2 extcommunity-set rt RT_RED ios-regex '56:1[01][34]$' end-set route-policy RPL_REJECT_IF_MATCH($RT) if extcommunity rt matches-any $RT then drop else pass end end-policy router bgp 56 address-family vpnv4 retain route-target address-family vpnv6 retain route-target

unicast route-policy RPL_REJECT_IF_MATCH(RT_RED) unicast route-policy RPL_REJECT_IF_MATCH(RT_RED)

XRv2 no longer retains any RED routes and has no need to create a table for RD 56:100. CSR10 now only sees 2 RED routes (CSR1 and XRv1) which means CSR2 and XRv2 are no longer servicing RED VPN clients. In this design, we still have HA since there are 2 RRs servicing each VPN. If the PEs were multi-homed BGP PIC could be used for additional fault tolerance. This design reduced the memory requirements on all 4 RRs and all 4 PEs by 50% by load-splitting the VPN routes between available RRs. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast rd 56:100 [no output] R10#show bgp 56.0.0.9 Origin 56.0.0.9 Origin

vpnv4 unicast vrf RED 100.3.0.0/18 | include from|valid (metric 10) (via default) from 56.0.0.11 (56.0.0.11) incomplete, metric 0, localpref 100, valid, internal (metric 10) (via default) from 56.0.0.1 (56.0.0.1) incomplete, metric 0, localpref 100, valid, internal, best

To confirm the claim above with respect to RR memory requirements, we check CSR1 and XRv1 for their VPN route counts per client. These numbers were 8 earlier, but now are 4, since each RR is only accepting one VPN route “color”.

840 © 2016 Nicholas J. Russo

R1#show bgp vpnv4 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 56.0.0.5 4 56 1384 1476 117 56.0.0.6 4 56 1372 1403 117 56.0.0.9 4 56 1388 1404 117 56.0.0.10 4 56 1379 1398 117

InQ OutQ Up/Down State/PfxRcd 0 0 03:24:26 4 0 0 03:24:42 4 0 0 03:24:34 4 0 0 03:24:30 4

R1#show bgp vpnv6 unicast all summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 56.0.0.5 4 56 1385 1477 133 56.0.0.6 4 56 1372 1403 133 56.0.0.9 4 56 1389 1404 133 56.0.0.10 4 56 1379 1398 133

InQ OutQ Up/Down State/PfxRcd 0 0 03:24:31 4 0 0 03:24:46 4 0 0 03:24:38 4 0 0 03:24:35 4

RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 56.0.0.5 0 56 1261 1263 147 0 0 02:57:39 56.0.0.6 0 56 1240 1219 147 0 0 02:57:25 56.0.0.9 0 56 1265 1219 147 0 0 02:57:29 56.0.0.10 0 56 1257 1219 147 0 0 02:57:28

St/PfxRcd 4 4 4 4

RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 56.0.0.5 0 56 1262 1263 129 0 0 02:57:41 56.0.0.6 0 56 1241 1219 129 0 0 02:57:28 56.0.0.9 0 56 1265 1219 129 0 0 02:57:32 56.0.0.10 0 56 1258 1219 129 0 0 02:57:31

St/PfxRcd 4 4 4 4

Additional Reading – Reference configurations “bgp-rr-group" 27.4 Accumulated IGP attribute If you have multiple sub-ASes within a confederation (or you control the IGP metrics for both ASes), you can essentially compare the end to end cost using accumulated IGP (AIGP). The remote AS will write the AIGP value, normally by copying the IGP metric (similar to how MED is populated), then advertise it to the BGP neighbor as a new attribute. This new attribute can be added to the IGP metric to the BGP nexthop to determine to total path cost; this cannot be done with MED or cost-communities, although AIGP can be transported via those methods for legacy devices. AIGP is evaluated immediately following localpreference in the best path selection process (so, it's evaluated very early). Beware that grossly incompatible IGP metrics between the two ASes, confed-external or true eBGP, will break the usefulness of this feature. EIGRP in one AS versus RIP in another will always be bound by the EIGRP metric, since the RIP metrics will contribute comparatively little to the AIGP.As an odd workaround, you could encode AIGP inside cost-community, use the pre-bestpath POI, and apply the transitive option. This would allow one AS to override the policies of another, even for true eBGP. 27.4.1 Basic AIGP This lab explores the basics of AIGP and how it is exchanged/processed in a BGP network. The network diagram is shown below. There are two ASes pictured with iBGP internally and eBGP between one another. We are only evaluating prefixes from the perspective of CSR1 and CSR4 in these labs. Most of 841 © 2016 Nicholas J. Russo

the community manipulation happens on the ASBRs. The numbers associated with the links are the IGP costs, which are symmetric.

Below I show the configurations for CSR2 and XRv3, but the equivalent AIGP configuration exists on all 4 ASBRs. We need to activate AIGP on all neighbors to pass the attribute. We also have to set the AIGP metric manually, which can be done dynamically based on the IGP cost. This can be applied to prefixes being redistributed into BGP or as an attribute modification within the “network” statement. The RPL construct in XR is very similar. Of note, the IGP costs are constantly adjusted in this lab as well. ! CSR2 route-map AIGP_METRIC permit 10 set aigp-metric igp-metric router bgp 10 no bgp default ipv4-unicast address-family ipv4 network 100.1.1.1 mask 255.255.255.255 route-map AIGP_METRIC neighbor 1.1.1.1 aigp neighbor 22.0.0.12 aigp ! XRv12 route-policy SET_AIGP set aigp-metric igp-cost router bgp 20 address-family ipv4 unicast network 100.4.4.4/32 route-policy SET_AIGP neighbor 33.0.0.3 address-family ipv4 unicast aigp neighbor 4.4.4.4 address-family ipv4 unicast aigp

842 © 2016 Nicholas J. Russo

Before looking at any routing information, we will verify the configuration between XRv13 and CSR4 as a demonstration. AIGP is a neighbor-specific feature but is not a negotiated capability nor a community value. Rather, it is like an additional BGP path-attribute, much like local-preference or MED. These commands confirm that AIGP was configured correctly. CSR4#show bgp ipv4 unicast neighbors 12.12.12.12 | begin For address family For address family: IPv4 Unicast Session: 12.12.12.12 BGP table version 1, neighbor version 1/0 Output queue size : 0 Index 1, Advertise bit 0 AIGP is enabled 1 update-group member RP/0/0/CPU0:XRv2#show bgp ipv4 unicast neighbors 4.4.4.4 | begin For Address Family For Address Family: IPv4 Unicast BGP neighbor version 2 Update group: 0.2 Filter-group: 0.1 No Refresh request being processed Route refresh request: received 0, sent 0 0 accepted prefixes, 0 are bestpaths Cumulative no. of prefixes denied: 0. Prefix advertised 0, suppressed 0, withdrawn 0 Maximum prefixes allowed 1048576 Threshold for warning message 75%, restart interval 0 min AIGP is enabled An EoR was received during read-only mode [snip]

A quick debug on CSR2 reveals that CSR4’s loopback received from XRv12 now carries a new AIGP-metric field, which is currently set to XRv12 IGP metric to 100.4.4.4/32. CSR2#debug bgp ipv4 unicast update BGP updates debugging is on for address family: IPv4 Unicast BGP(0): 22.0.0.12 rcvd UPDATE w/ attr: nexthop 22.0.0.12, origin i, metric 10, aigp-metric 10, merged path 20, AS_PATH BGP(0): 22.0.0.12 rcvd 100.4.4.4/32

We will look at CSR4’s route to CSR1’s loopback to start. The AIGP metric is better via XRv13 (11), but AIGP + local IGP metric (42) is worse than the sum of the same values carried by the XRv12 route (41). This is what makes AIGP special; unlike MED, it is combined with the IGP cost to form a pseudo end-toend metric across AS boundaries. CSR4#show bgp ipv4 unicast 100.1.1.1/32 BGP routing table entry for 100.1.1.1/32, version 12 Paths: (2 available, best #2, table default) Not advertised to any peer

843 © 2016 Nicholas J. Russo

Refresh Epoch 1 10 13.13.13.13 (metric 31) from 13.13.13.13 Origin IGP, aigp-metric 11, metric 11, rx pathid: 0, tx pathid: 0 Refresh Epoch 1 10 12.12.12.12 (metric 10) from 12.12.12.12 Origin IGP, aigp-metric 31, metric 31, best rx pathid: 0, tx pathid: 0x0

(13.13.13.13) localpref 100, valid, internal

(12.12.12.12) localpref 100, valid, internal,

Next, we reduce CSR4’s cost to XRv13 to 28. The AIGP metric is better via XRv13 (11), even though the IGP metric (29) is worse. The sum is still less for the best route (40 vs. 41). Notice that even though AIGP is compared early in the path selection process, this comparison is the sum of the AIGP and IGP costs, not just comparing the AIGP cost, as proven below. CSR4#show bgp ipv4 unicast 100.1.1.1/32 BGP routing table entry for 100.1.1.1/32, version 11 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 10 13.13.13.13 (metric 29) from 13.13.13.13 (13.13.13.13) Origin IGP, aigp-metric 11, metric 11, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 10 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Origin IGP, aigp-metric 31, metric 31, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

Below is an example where AIGP + local IGP is the same in both cases (41). Best-path declares a tie and continues with the selection process. In most cases, the AIGP metric was copied from the IGP metric, just like MED. Therefore, MED determines that XRv13 is the better route, and generally speaking, the lower remote IGP cost will break the tie. CSR4#show bgp ipv4 unicast 100.1.1.1/32 BGP routing table entry for 100.1.1.1/32, version 13 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 10 13.13.13.13 (metric 30) from 13.13.13.13 (13.13.13.13) Origin IGP, aigp-metric 11, metric 11, localpref 100, valid, internal, best

844 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 10 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Origin IGP, aigp-metric 31, metric 31, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

If AIGP is not present, AIGP automatically wins (assuming weight and local preference are equal), even at maximum value. This is a little odd as I personally like the behavior of the cost-community where the lack of a value assumes the midpoint between minimum and maximum. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 31 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 6 20 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, aigp-metric 4294967295, metric 10, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 6 20 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

Just to prove where AIGP sits in the best path, local preference has been increased for the route via CSR3, and now it is preferred since AIGP is not evaluated. By extension, modifying weight would also work (no example shown). CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 10 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 2 20 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 200, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 20 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, aigp-metric 4294967295, metric 10, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

845 © 2016 Nicholas J. Russo

Also, let's increase the AS path length of the CSR2 route. The CSR2 route stills wins because AIGP is processed first. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 12 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 3 20 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 1 20 20 20 20 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, aigp-metric 4294967295, metric 10, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0

Below is another example where a lower AIGP does not necessary imply best-path. AIGP is just a component of the metric as the local IGP metric still counts, assuming AIGP attribute is present on all contending routes. The best path has a total metric of 31 while the path with lower AIGP has a total metric of 41. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 29 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 6 20 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, aigp-metric 10, metric 10, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 6 20 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, aigp-metric 20, metric 20, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0

Last, I show another example of an AIGP + local IGP metric tie (31). The route through CSR2 is selected due to lower MED. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 33 Paths: (2 available, best #1, table default) Not advertised to any peer

846 © 2016 Nicholas J. Russo

Refresh Epoch 7 20 2.2.2.2 (metric 21) from 2.2.2.2 (2.2.2.2) Origin IGP, aigp-metric 10, metric 10, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 7 20 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, aigp-metric 20, metric 20, localpref 100, valid, internal rx pathid: 0, tx pathid: 0

Additional Reading – Reference configurations "aigp-basic" 27.4.2 AIGP with cost-communities and BGP confederations AIGP can also be transported inside the cost-community. Note: As seen in the cost-community section, this does not work for true eBGP peers, only confederations and iBGP. The network has been modified to place AS 10 and AS 20 in a confederation together. The exception is using the "transitive" option which is discussed next, but oddly, there is no way to pass classic cost-communities across eBGP boundaries. It appears to be an AIGP-only extension. The network diagram is below; it is the same design with minor BGP modifications to construct a confederation between AS 10 and AS 20.

Although much less effective than real AIGP, you can encode the AIGP metric in a cost-community. If used with the pre-bestpath POI, it effectively trumps all local policies set by the receiving router and is even more powerful than ordinary AIGP. Below are the outputs from CSR1 and CSR4, respectively. Using it with igp-cost POI is probably most realistic as it comes after the IGP metric to BGP next-hop, but the two are not added together. In XR, a special command is needed to exchange standard/extended communities with eBGP peers; within iBGP, communities are automatically exchanged. In IOS, community exchange must be specified per neighbor, per type, regardless of the remote AS number. 847 © 2016 Nicholas J. Russo

The configuration snippet from XRv13 (the other three ASBRs have similar configurations to correspond with the show commands following this) is below. ! XRv13 router bgp 20 af-group AF_IPV4 address-family ipv4 unicast aigp send cost-community 44 poi pre-bestpath next-hop-self neighbor 33.0.0.3 address-family ipv4 unicast use af-group AF_IPV4 send-extended-community-ebgp CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 33 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 6 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:44:10 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:44:20 rx pathid: 0, tx pathid: 0 CSR4#show bgp ipv4 unicast 100.1.1.1/32 BGP routing table entry for 100.1.1.1/32, version 63 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (10) 13.13.13.13 (metric 20) from 13.13.13.13 (13.13.13.13) Origin IGP, metric 11, localpref 100, valid, internal, best Extended Community: Cost:pre-bestpath:99:11 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 (10) 12.12.12.12 (metric 10) from 12.12.12.12 (12.12.12.12) Origin IGP, metric 31, localpref 100, valid, internal Extended Community: Cost:pre-bestpath:99:31 rx pathid: 0, tx pathid: 0

848 © 2016 Nicholas J. Russo

If you tell CSR2 and CSR3 to send AIGP inside cost-community towards CSR1, it will add a second costcommunity with metric 0. CSR1 sees both communities. The communities are compared in sequence which is discussed in detail in the cost-community section. Essentially, the lowest community IDs are compared together first. Since there is no tie, the higher community IDs are never evaluated. CSR1#show bgp ipv4 unicast 100.4.4.4/32 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 6 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:44:10 Cost:pre-bestpath:99:0 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:44:20 Cost:pre-bestpath:99:0 rx pathid: 0, tx pathid: 0

I am assuming this is because the first non-BGP route that the recursive lookup finds is a connected route. The IGP cost to the BGP next-hop, for eBGP peers in this network, is zero. CSR2’s perspective is shown below. CSR2#show ip route 100.4.4.4 Routing entry for 100.4.4.4/32 Known via "bgp 10", distance 200, metric 10 Tag 20, type internal Last update from 22.0.0.12 00:09:43 ago Routing Descriptor Blocks: * 22.0.0.12, from 22.0.0.12, 00:09:43 ago Route metric is 10, traffic share count is 1 AS Hops 0 Route tag 20 MPLS label: none CSR2#show ip route 22.0.0.12 Routing entry for 22.0.0.0/8 Known via "connected", distance 0, metric 0 (connected, via interface) Routing Descriptor Blocks: * directly connected, via GigabitEthernet2.522 Route metric is 0, traffic share count is 1

AIGP is also available inside the cost-community with the transitive option. Using the transitive keyword, we can transport the cost-community across true eBGP boundaries (configuration not shown). The

849 © 2016 Nicholas J. Russo

confederation has been removed and AS 10 and AS 20 are separate ASes now. All the same costcommunity rules apply. I have not found a way to do this with regular cost-communities. ! XRv12 router bgp 20 af-group AF_IPV4 address-family ipv4 unicast aigp send cost-community 44 poi pre-bestpath transitive next-hop-self CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 6 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 1 20 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, internal Extended Community: Cost(transitive):pre-bestpath:44:20 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 20 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, internal, best Extended Community: Cost(transitive):pre-bestpath:44:10 rx pathid: 0, tx pathid: 0x0

Additional Reading – Reference configurations "aigp-costcom" 27.5 Cost-Community / Point Of Insertion (POI) Somewhat similar to AIGP, the cost-community was developed as a mechanism to carry cost information around BGP networks. It has many different purposes which are described in this chapter, along with different points of insertion (POI) that determine when this community is evaluated. This feature is quite old, so newer features like accumulated IGP (AIGP) can use it to transport AIGP metrics for older routers. AIGP is more advanced and requires newer code versions (15.4(2)S for IOS, 3.12 for XE, 4.0.0 for XR). Pre-bestpath POI was designed for EIGRP MPLS L3VPN architectures where backdoor links exist. Normally BGP would prefer the locally sourced routes over the BGP learned ones due to the automatic 32,768 weight value applied. This would lead to a looping condition or suboptimal routing. Using the cost-community fixes it because if a prefix loops in the control-plane back to a PE, the cost-community would have increased across the customer network and will be less than whatever the original VPNv4 prefix cost-community carried from the source PE. This is why cost-community is valuable for overriding the weight attribute.

850 © 2016 Nicholas J. Russo

Another use for pre-bestpath is for extreme "hot potato" routing, where getting the traffic out of the AS as quickly as possible is desired. Because it pre-empts even the weight attribute, it can be used to change path selection within an AS or between confederations (not possible with true eBGP peers). There is one exception where the cost-community is fully transitive, and that is when the AIGP metric is transported in the cost-community with the "transitive" keyword. I have not found a way to establish transitivity without the AIGP application. The ordinary IOS route-map and XR RPL does not have an option and the attribute is simply not exchanged. IGP-cost is the default POI for cost-community with the exception of EIGRP MPLS L3VPN, which uses pre-bestpath (and is fully automatic; requires no configuration). The rules for the cost-community processing are the same (many are illustrated below) but the IGP-cost POI is evaluated after the IGP metric to the BGP next-hop criteria. It is effectively a new tie-breaker to occur before the router-ID comparison which applies to iBGP, confed-internal, and confed-external. Hierarchical cost-community selection is achievable by affixing multiple cost-communities to a specific BGP prefix. It should be supported for all AFI/SAFI on all platforms that understand the cost-community in general. To protect your network from an accidental hijack, you can use "bgp bestpath costcommunity ignore". BGP simply ignores the values and proceeds as if they never existed. This feature is tested using a similar topology as AIGP with two BGP sub-ASes wrapped into a confederation. Because the cost-community is not transitive, we cannot test it with true eBGP.

These examples will not show the configuration changes each time. For all of them, the cost-community application on XRv12 and/or XRv13 was changed as the text describes. Here are the snippets from XRv12, but the actual community values vary during the experiments. Similar configurations are applied to XRv13. Additional communities and component routes are added to the configurations over time, but the general construct is shown below. The "COST_COMM_SET" is changed constantly in these

851 © 2016 Nicholas J. Russo

experiments, but the other constructs generally remain the same. The basic network configurations are also not shown. ! XRv12 extcommunity-set cost COST_COMM_SET IGP:21:100 end-set route-policy COST_COMM set extcommunity cost COST_COMM_SET end-policy router bgp 20 address-family ipv4 unicast network 100.4.4.4/32 route-policy COST_COMM

Before completing the confederation configuration, I quickly prove that the cost-community is not transitive. Below I attempt true eBGP between AS 10 and AS 20. Notice how CSR2 sets the costcommunity locally and "debug bgp updates" verifies it was sent to XRv12. CSR2#show bgp ipv4 unicast 100.1.1.1/32 BGP routing table entry for 100.1.1.1/32, version 2 Paths: (1 available, best #1, table default) Advertised to update-groups: 10 11 Refresh Epoch 1 Local 12.0.0.1 from 0.0.0.0 (2.2.2.2) Origin IGP, metric 31, localpref 100, weight 32768, valid, sourced, local, best Extended Community: Cost:pre-bestpath:22:30 rx pathid: 0, tx pathid: 0x0 CSR2#debug bgp ipv4 unicast update BGP(0): (base) 22.0.0.12 send UPDATE (format) 100.1.1.1/32, next 22.0.0.2, metric 31, path Local, extended community Cost:pre-bestpath:22:30

Despite receiving the update successfully, XRv12 never honors that value it. The same was true in the other direction, despite extended communities being enabled between the neighbors. XRv12#show bgp ipv4 unicast 100.1.1.1/32 | begin Paths Paths: (1 available, best #1) Advertised to peers (in unique update groups): 4.4.4.4 Path #1: Received by speaker 0 Advertised to peers (in unique update groups): 4.4.4.4

852 © 2016 Nicholas J. Russo

10 22.0.0.2 from 22.0.0.2 (2.2.2.2) Origin IGP, metric 31, localpref 100, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 23 Origin-AS validity: not-found

When AS 10 and AS 20 are placed in a confederation together (with appropriate next-hop changes), the cost-community immediately appears. Cisco clearly states eBGP is not supported as the cost-community is not transitive. Notice that XR makes no attempt to decode the “pre-bestpath” code of 128. This number is specific to the pre-bestpath POI. XRv12#show bgp ipv4 unicast 100.1.1.1/32 | begin Paths Paths: (1 available, best #1) Advertised to peers (in unique update groups): 4.4.4.4 Path #1: Received by speaker 0 Advertised to peers (in unique update groups): 4.4.4.4 (10) 22.0.0.2 from 22.0.0.2 (2.2.2.2) Origin IGP, metric 31, localpref 100, valid, confed-external, best, group-best Received Path ID 0, Local Path ID 1, version 27 Extended community: COST:128:22:30

With respect to the community ID, the lower ID wins, even if cost is worse. This assumes that each prefix only has one cost-community with the same POI. Below, I demonstrate that 22:10 is better than 33:5 due to having a lower community ID. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 46 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:33:5 rx pathid: 0, tx pathid: 0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:10 rx pathid: 0, tx pathid: 0x0

853 © 2016 Nicholas J. Russo

When the IDs match, a tie is declared, and the cost value is compared. The lower cost always wins in this case, as expected. In short, it is an ordered comparison of ID first, cost value second. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 47 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:5 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:22:10 rx pathid: 0, tx pathid: 0

When there are multiple cost communities, the lowest IDs are compared first, then incrementally so. In this example, the loser route has an additional cost-community with very low cost, but it was never considered since the ID 22 cost-community was already defeated. So, even though the winner route doesn't have a second cost-community for comparison, it doesn't matter. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 47 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:5 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:22:10 Cost:pre-bestpath:33:1 rx pathid: 0, tx pathid: 0

Just to prove it, here is an example where both paths have two cost communities but only the first was compared. The first route still wins as the second cost-community was never compared. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 48

854 © 2016 Nicholas J. Russo

Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:5 Cost:pre-bestpath:33:33 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:22:10 Cost:pre-bestpath:33:1 rx pathid: 0, tx pathid: 0

When the first cost communities are tied, only then do the following cost communities act as tiebreakers. Now the route via CSR3 is the best based on the next-lowest cost-community ID / cost pair. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 49 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:22:10 Cost:pre-bestpath:33:33 rx pathid: 0, tx pathid: 0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:10 Cost:pre-bestpath:33:1 rx pathid: 0, tx pathid: 0x0

Paths without the cost-community have value set to 2^32 / 2 or 21 ^ 31 (2147483647). This is considered the "default" value. The feature makes the assumption that if you don’t specify anything, the middle value is used for comparison. This is in contrast to AIGP which automatically wins if it is present, even at its maximum value. Path through CSR2 wins with the value set to default – 1, but just barely so. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 52 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal

855 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:2147483646 (default-1) rx pathid: 0, tx pathid: 0x0

If we increase the cost value to default + 1, CSR3 becomes the best path despite not carrying the community at all. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 53 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:22:2147483648 (default+1) rx pathid: 0, tx pathid: 0

When one prefix has a default cost-community (explicit default) and the other has no community at all (implicit default0, the BGP best-path selection process simply continues as this evaluation criterion ties. Below is an example; CSR3 wins in a tie due to lower IGP metric to BGP next hop. You may have expected CSR2 to win based on lower MED, but due to the BGP confederation, the MED is ignored in this case. You can use "bgp bestpath med confed" to select CSR2 over CSR3 in this case, but this is not the default behavior. We can tell the peers are in a confederation based on the parenthetical AS path. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 53 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 3 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal

856 © 2016 Nicholas J. Russo

Extended Community: Cost:pre-bestpath:22:2147483647 (default) rx pathid: 0, tx pathid: 0

So far we have only looked at pre-bestpath, which is a very extreme way to change the path selection before any other attributes are considered, even weight. Below shows an attempt to override prebestpath behavior using weight; it fails. CSR2 is still the bestpath. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 6 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 2 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:214 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 (20) 3.3.3.3 (metric 11) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, weight 99, valid, confed-internal rx pathid: 0, tx pathid: 0

When we change the cost-community back to the default value to create a tie, weight takes effect and wins. I also changed the IGP metric to BGP next-hop for the CSR3 route to 1000 in order to demonstrate that the CSR3 route did not win due to that value, as it did earlier. When the cost-community is compared against a route with no cost-community at the same POI, the community ID means nothing. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 7 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 2 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:pre-bestpath:22:2147483647 (default) rx pathid: 0, tx pathid: 0 Refresh Epoch 2 (20) 3.3.3.3 (metric 1000) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, weight 99, valid, confedinternal, best rx pathid: 0, tx pathid: 0x0

Next, we evaluate the IGP POI. This considers the IGP cost-community just after the IGP metric to BGP next-hop comparison. It's essentially a tie-breaker for routes before comparing the oldest route 857 © 2016 Nicholas J. Russo

(external) or router ID (confed-internal, confed-external, or internal). The only way eBGP routes would carry this attribute is by way of the AIGP inside cost-community with transitive option, so realistically, it is preempting the lowest RID (or originator ID with route-reflector) comparisons. As a quick comparison, when the type of cost-community varies, they are not compared directly. CSR3 wins despite having a larger "cost value" but it was considered pre-bestpath against CSR2's pre-bestpath implicit-default value of 2,147,483,647. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 11 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:igp:22:12 rx pathid: 0, tx pathid: 0 Refresh Epoch 5 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:13 rx pathid: 0, tx pathid: 0x0

When the two are the same, they are considered only when the IGP metric to BGP next-hop ties. CSR2 wins in a tie due to lower RID (again, because MED is not compared within the confederation). CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 12 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:12 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:igp:22:12 rx pathid: 0, tx pathid: 0

If we adjust the cost to a higher number for CSR2’s prefix, CSR3 wins due to a lower cost-community value. The community ID remains 22 so that the costs can be compared directly.

858 © 2016 Nicholas J. Russo

CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 13 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal Extended Community: Cost:igp:22:999 rx pathid: 0, tx pathid: 0 Refresh Epoch 5 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:12 rx pathid: 0, tx pathid: 0x0

Just like pre-bestpath, the lack of the IGP-cost cost-community compares against the default value. CSR3 wins because CSR2's value is 2,147,483,647 (default). CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 13 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal rx pathid: 0, tx pathid: 0 Refresh Epoch 5 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:12 rx pathid: 0, tx pathid: 0x0

As expected, CSR2 wins in a tie due to lower RID when CSR3 has the default cost-community metric. This is slightly different than the earlier example to prove the behavior of the default metric when using IGP cost POI. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 14 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 5 (20) 2.2.2.2 (metric 31) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best

859 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:igp:22:2147483647 (default) rx pathid: 0, tx pathid: 0

The next example proves that the IGP cost comparison happens after the ordinary IGP metric to BGP next-hop. The cost to CSR2 has been reduced but CSR3 has a better cost-community. CSR2 is the best path, though; the IGP cost cost-community is only compared when the IGP metric to BGP next-hop is a tie. It can be viewed as the “first tie breaker” ahead of the BGP RID. This application of this POI was designed to make a less arbitrary best-path decision by assigning administrator-defined values that take effect when the tie-breakers begin. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 16 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 5 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal Extended Community: Cost:igp:22:21 rx pathid: 0, tx pathid: 0

The cost-community also interacts with BGP route aggregation. The aggregate route selects the highest cost-community value, but only if the IDs match AND the "as-set" flag is used on the aggregate. If "asset" is not applied, no communities are carried over. The first two outputs show two component /32 routes with varying cost-communities. The aggregate /24 route uses the larger one. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 18 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 6 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0

860 © 2016 Nicholas J. Russo

CSR1#show bgp ipv4 unicast 100.4.4.44/32 BGP routing table entry for 100.4.4.44/32, version 19 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 6 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:100 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.0/24 BGP routing table entry for 100.4.4.0/24, version 21 Paths: (1 available, best #1, table default) Advertised to update-groups: 5 Refresh Epoch 1 (20), (aggregated by 1020 100.1.1.1) 0.0.0.0 from 0.0.0.0 (100.1.1.1) Origin IGP, localpref 100, weight 32768, valid, aggregated, local, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0

If one or more component routes don't have a cost-community, the largest explicit value is applied. The documentation says the default value is applied, but I did not observe this. The route via CSR2 does not have any cost-community and the aggregate selected the value from CSR3's route. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 3 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.44/32 BGP routing table entry for 100.4.4.44/32, version 4 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best rx pathid: 0, tx pathid: 0x0

861 © 2016 Nicholas J. Russo

CSR1#show bgp ipv4 unicast 100.4.4.0/24 BGP routing table entry for 100.4.4.0/24, version 5 Paths: (1 available, best #1, table default) Advertised to update-groups: 6 Refresh Epoch 1 (20), (aggregated by 1020 100.1.1.1) 0.0.0.0 from 0.0.0.0 (100.1.1.1) Origin IGP, localpref 100, weight 32768, valid, aggregated, local, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0

If one or more component routes have mismatched cost-community IDs, individual cost-communities are added to the aggregate for each unique ID with a default cost value. CSR2 and CSR3 have mismatched IDs and the aggregate has two IGP cost-communities, both with default cost values, per unique ID. This isn’t very useful for the aggregate, since the presence of the default cost-community effectively does nothing for BGP best-path comparisons. At this point, the cost-community values just become “tags” that would be used to identify the community IDs carried by the component routes. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 3 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.44/32 BGP routing table entry for 100.4.4.44/32, version 6 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:21:100 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.0/24 BGP routing table entry for 100.4.4.0/24, version 7 Paths: (1 available, best #1, table default) Advertised to update-groups: 6 Refresh Epoch 1 (20), (aggregated by 1020 100.1.1.1)

862 © 2016 Nicholas J. Russo

0.0.0.0 from 0.0.0.0 (100.1.1.1) Origin IGP, localpref 100, weight 32768, valid, aggregated, local, best Extended Community: Cost:igp:21:2147483647 (default) Cost:igp:22:2147483647 (default) rx pathid: 0, tx pathid: 0x0

If one or more component routes have varying POI types (pre-bestpath versus IGP cost), the same rules apply as demonstrated above. If the IDs do not match, the metric is set to the default, and each custom ID is added to the aggregate. The only difference is that the POI is maintained. The above example has been extended with a third component route to illustrate this. The aggregate has two IGP and one prebestpath communities, each with varying IDs, and all with default metric. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 3 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.44/32 BGP routing table entry for 100.4.4.44/32, version 6 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:21:100 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.144/32 BGP routing table entry for 100.4.4.144/32, version 8 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:7:7 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.0/24 BGP routing table entry for 100.4.4.0/24, version 9 Paths: (1 available, best #1, table default)

863 © 2016 Nicholas J. Russo

Advertised to update-groups: 6 Refresh Epoch 1 (20), (aggregated by 1020 100.1.1.1) 0.0.0.0 from 0.0.0.0 (100.1.1.1) Origin IGP, localpref 100, weight 32768, valid, aggregated, local, best Extended Community: Cost:pre-bestpath:7:2147483647 (default) Cost:igp:21:2147483647 (default) Cost:igp:22:2147483647 (default) rx pathid: 0, tx pathid: 0x0

Even if the ID matches between a pre-bestpath and IGP cost POI, the two are incompatible and incomparable, and both are added to the aggregate. The outputs are almost identical to the above example, except the pre-bestpath POI component route IDs matches one of the IGP cost-community IDs. CSR1#show bgp ipv4 unicast 100.4.4.4/32 BGP routing table entry for 100.4.4.4/32, version 3 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 3.3.3.3 (metric 31) from 3.3.3.3 (3.3.3.3) Origin IGP, metric 20, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:22:300 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.44/32 BGP routing table entry for 100.4.4.44/32, version 6 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:igp:21:100 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.144/32 BGP routing table entry for 100.4.4.144/32, version 10 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 (20) 2.2.2.2 (metric 30) from 2.2.2.2 (2.2.2.2) Origin IGP, metric 10, localpref 100, valid, confed-internal, best Extended Community: Cost:pre-bestpath:22:7 rx pathid: 0, tx pathid: 0x0 CSR1#show bgp ipv4 unicast 100.4.4.0/24

864 © 2016 Nicholas J. Russo

BGP routing table entry for 100.4.4.0/24, version 11 Paths: (1 available, best #1, table default) Advertised to update-groups: 6 Refresh Epoch 1 (20), (aggregated by 1020 100.1.1.1) 0.0.0.0 from 0.0.0.0 (100.1.1.1) Origin IGP, localpref 100, weight 32768, valid, aggregated, local, best Extended Community: Cost:pre-bestpath:22:2147483647 (default) Cost:igp:21:2147483647 (default) Cost:igp:22:2147483647 (default) rx pathid: 0, tx pathid: 0x0

Additional Reading – Reference configurations “costcom" 27.6 DMZ Link Bandwidth The DMZ link bandwidth feature is used to enable unequal cost load sharing in BGP. DMZ links are the eBGP links that interconnect BGP ASes, which this document also identifies as “transit links”. This feature encodes the bandwidth of those links into an extended community (also carries the BGP ASN for administrative purposes) which is non-transitive. Routers that learn the remote BGP routes via iBGP can perform load sharing when sending traffic to those remote destinations based on the bandwidth of the DMZ links. On XE, the feature is fully supported for IPv4 and IPv6, but only supported at the edge routers (eBGP) for VPNv4. The XE routers on the AS boundary can encode the link bandwidth (LB) community but cannot use it for load sharing for VPN-iBGP paths at this time. There is currently no VPNv6 support. XR appears to fully support all four of these AFIs. This feature is implemented very differently between XE and XR which makes interworking very difficult. Between the two, the units of measurement are vastly different and are not often converted when crossing between the two. Coupled with a limited set of show/debug commands, this makes troubleshooting very challenging. The network below has two ASes using Inter AS MPLS option B for VPNv4/v6, along with ordinary IPv4/v6. CSR9 and XRv3 are VPN customers while CSR10 and CSR2 are remote Internet ASes. CSR1 and XRv4 have iBGP neighbors to each of their ASBRs, and each ASBR has a direct eBGP peering to the other ASBR at the end of the link. AS 137 uses IS-IS and MPLS-TE for transport (no LDP) while AS 173 uses OSPFv2 and LDP. This is unrelated to DMZ link bandwidth and is introduced for variety.

865 © 2016 Nicholas J. Russo

First, we will quickly check the BGP peers on CSR1 and XRv4 to verify the peers are up and the proper capabilities were exchanged. IPv6 uses labeled unicast (6PE) to provide IPv6 transit across an IPv4 core. Normally, we would verify each neighbor individually as we do between CSR1 and CSR6, but in this case, I will just look at the summary for each AFI for brevity. The asterisk indicates that these are dynamic neighbors using the BGP listen command. R1#show bgp ipv4 unicast neighbors 137.0.0.6 | section capabilities: Neighbor capabilities: Route refresh: advertised and received(new) Four-octets ASN Capability: advertised and received Address family IPv4 Unicast: advertised and received Address family IPv6 Unicast: advertised and received ipv6 MPLS Label capability: advertised and received Address family VPNv4 Unicast: advertised and received Address family VPNv6 Unicast: advertised and received Enhanced Refresh Capability: advertised and received Multisession Capability: Stateful switchover support enabled: NO for session 1 R1#show bgp ipv4 unicast summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer 10.1.10.10 4 100 3098 3036 353 *137.0.0.6 4 137 3120 3164 353

InQ OutQ Up/Down 0 0 1d21h 0 0 1d21h

State/PfxRcd 6 7

866 © 2016 Nicholas J. Russo

*137.0.0.7 4 137 3148 3163 353 *137.0.0.12 4 137 30 29 353 * Dynamically created based on a listen range command Dynamically created neighbors: 3, Subnet ranges: 1

0 0

0 1d21h 0 00:15:44

7 7

BGP peergroup IBGP listen range group members: 137.0.0.0/28 Total dynamically created neighbors: 3/(3 max), Subnet ranges: 1 R1#show bgp ipv6 unicast summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.1.10.10 4 100 3098 3036 499 0 0 1d21h 8 *137.0.0.6 4 137 3120 3165 499 0 0 1d21h 9 *137.0.0.7 4 137 3148 3164 499 0 0 1d21h 9 *137.0.0.12 4 137 30 30 499 0 0 00:15:52 9 [snip] R1#show bgp vpnv4 unicast all summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd *137.0.0.6 4 137 3120 3165 131 0 0 1d21h 1 *137.0.0.7 4 137 3148 3164 131 0 0 1d21h 1 *137.0.0.12 4 137 30 30 131 0 0 00:15:59 1 [snip] R1#show bgp vpnv6 unicast all summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd *137.0.0.6 4 137 3120 3165 128 0 0 1d21h 1 *137.0.0.7 4 137 3148 3164 128 0 0 1d21h 1 *137.0.0.12 4 137 30 30 128 0 0 00:16:03 1 [snip]

We quickly do the same on XRv4 for the routers in AS 173. Notice that the IPv4 unicast AFI only includes the eBGP peer to CSR2; the iBGP peers are technically running IPv6 labeled-unicast which XR treats as a different AFI. This is technically more correct than the XE implementation. RP/0/0/CPU0:XRv4#show bgp ipv4 unicast summary | begin ^Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 10.2.14.2 0 2 1358 1226 22 0 0 20:18:16 173.0.0.4 0 173 1365 1228 22 0 0 20:18:08 173.0.0.11 0 173 1238 1228 22 0 0 20:18:12

St/PfxRcd 7 6 6

RP/0/0/CPU0:XRv4#show bgp ipv6 unicast summary | begin ^Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down fd00:10:2:14::2 0 2 1351 1228 32 0 0 20:18:24

St/PfxRcd 9

RP/0/0/CPU0:XRv4#show bgp ipv6 labeled-unicast summary | begin ^Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 173.0.0.4 0 173 1365 1228 32 0 0 20:18:17 8 173.0.0.11 0 173 1239 1228 32 0 0 20:18:22 8 RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast summary | begin ^Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 173.0.0.4 0 173 1365 1228 9 0 0 20:18:24

St/PfxRcd 1

867 © 2016 Nicholas J. Russo

173.0.0.11

0

173

1239

1228

9

0

0 20:18:28

1

RP/0/0/CPU0:XRv4#show bgp vpnv6 unicast summary | begin ^Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 173.0.0.4 0 173 1365 1228 9 0 0 20:18:27 173.0.0.11 0 173 1239 1228 9 0 0 20:18:31

St/PfxRcd 1 1

We would continue this process for all of the eBGP peers as well, but skip it for the sake of brevity. The number of prefixes exchanged in CSR1 and XRv4 indicates that there is at least some connectivity. We will begin in AS 137 first. The goal is to load-share traffic as follows: traffic to 2.128.0.0/16 prefixes should be shared in ratio 5:3:8 using CSR6, CSR7, and XRv12 respectively. Traffic to 2.129.0.0/16 prefixes should be shared in ratio 5:3:16 using CSR6, CSR7, and XRv12 respectively. We will assign some bandwidth values to the XE routers to represent 5 Mbps and 3 Mbps on CSR6 and CSR7, respectively. There is nothing fancy about this configuration. We also need to instruct these routers to “look” at the transit links and encode this value into an extended community. Last, we need to ensure extended communities can be exchanged with CSR1, which is automatic for VPNv4/6 but not IPv4/6. I also use BGP RID comparison on CSR7 so that CSR4 always wins the tie-breaker, ignoring the oldest eBGP route. ! CSR6 interface GigabitEthernet2.546 bandwidth 5000 router bgp 137 address-family ipv4 neighbor 10.4.6.4 dmzlink-bw neighbor 137.0.0.1 send-community both address-family vpnv4 neighbor 10.4.6.4 dmzlink-bw address-family ipv6 neighbor 137.0.0.1 send-community both neighbor FD00:10:4:6::4 dmzlink-bw address-family vpnv6 neighbor 10.4.6.4 dmzlink-bw ! CSR7 interface GigabitEthernet2.547 bandwidth 3000 interface GigabitEthernet2.571 bandwidth 16000 router bgp 137 bgp bestpath compare-routerid address-family ipv4 neighbor 10.4.7.4 dmzlink-bw

868 © 2016 Nicholas J. Russo

neighbor 10.7.11.11 dmzlink-bw neighbor 137.0.0.1 send-community both address-family vpnv4 neighbor 10.4.7.4 dmzlink-bw neighbor 10.7.11.11 dmzlink-bw address-family ipv6 neighbor 137.0.0.1 send-community both neighbor FD00:10:4:7::4 dmzlink-bw neighbor FD00:10:7:11::11 dmzlink-bw address-family vpnv6 neighbor 10.4.7.4 dmzlink-bw neighbor 10.7.11.11 dmzlink-bw

We can locally check CSR6 and CSR7 to ensure this community was encoded using any test prefix from AS 2. As we know, the bandwidth applied to the link-level was in kbps. In XE, this is converted to kBps (bytes) when it is received from an XE router. The CSR7-XRv1 link has bandwidth 16 Mbps but this is not a best path, and is not advertised to CSR1. R6#show bgp ipv4 unicast 2.128.192.0/20 BGP routing table entry for 2.128.192.0/20, version 86 Paths: (1 available, best #1, table default) Advertised to update-groups: 9 Refresh Epoch 8 173 2 10.4.6.4 from 10.4.6.4 (173.0.0.4) Origin incomplete, localpref 100, valid, external, best DMZ-Link Bw 625 kbytes rx pathid: 0, tx pathid: 0x0 R7#show bgp ipv4 unicast 2.128.192.0/20 BGP routing table entry for 2.128.192.0/20, version 105 BGP Bestpath: compare-routerid Paths: (2 available, best #2, table default) Advertised to update-groups: 3 8 Refresh Epoch 1 173 2 10.7.11.11 from 10.7.11.11 (173.0.0.11) Origin incomplete, localpref 100, valid, external DMZ-Link Bw 2000 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 6 173 2 10.4.7.4 from 10.4.7.4 (173.0.0.4)

869 © 2016 Nicholas J. Russo

Origin incomplete, localpref 100, valid, external, best DMZ-Link Bw 375 kbytes rx pathid: 0, tx pathid: 0x0

We verify that CSR1 receives this information. If CSR1 did not receive it, that would imply that the ASBRs were not proper sending extended communities to CSR1. CSR1 is not yet configured for load-sharing; we will come back to this after configuring XRv2. I want to point out that the bandwidth values are the same as they were on the XE ASBRs, as expected. R1#show bgp ipv4 unicast 2.129.0.0 BGP routing table entry for 2.129.0.0/17, version 358 Paths: (3 available, best #3, table default) Advertised to update-groups: 6 Refresh Epoch 1 173 2 137.0.0.12 from *137.0.0.12 (137.0.0.12) Origin incomplete, localpref 100, valid, internal rx pathid: 0, tx pathid: 0 Refresh Epoch 5 173 2 137.0.0.7 from *137.0.0.7 (137.0.0.7) Origin incomplete, metric 0, localpref 100, valid, internal DMZ-Link Bw 375 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 7 173 2 137.0.0.6 from *137.0.0.6 (137.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, best DMZ-Link Bw 625 kbytes rx pathid: 0, tx pathid: 0x0

XR allows the LB community to be tuned based on an RPL, providing additional flexibility not present in XE. To meet the requirement stated earlier, we configure a simple RPL that changes the bandwidth based on the prefix on XRv2. Notice that the values being set are in kBps; when XE receives the LB community from XR, it does not do any conversation, so these are raw kBps values. ! XRv2 prefix-set PL_2_128 2.128.0.0/16 le 24 end-set prefix-set PL_2_129 2.129.0.0/16 le 24 end-set route-policy RPL_DMZ

870 © 2016 Nicholas J. Russo

if destination in PL_2_128 then set extcommunity bandwidth (137:1000) elseif destination in PL_2_129 then set extcommunity bandwidth (137:2000) else pass endif end-policy router bgp 137 neighbor 10.11.12.11 address-family ipv4 unicast route-policy RPL_DMZ in

We verify that this configuration was successful on XRv2. XR shows the extended community in the form of LB:ASN:value. The 137 was automatically added, where XE leaves this as zero. The value 8 is confusing; this actually represents Mbps. The link was configured at 1000 kBps = 8000 kbps = 8 Mbps. The same is true for the second prefix, where 2000 kBps was configured, yet we see 16 Mbps. This at least proves that the RPL is working properly. RP/0/0/CPU0:XRv2#show bgp ipv4 unicast 2.128.192.0/20 [snip] 173 2 10.11.12.11 from 10.11.12.11 (173.0.0.11) Origin incomplete, localpref 100, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 3 Extended community: LB:137:8 Origin-AS validity: not-found RP/0/0/CPU0:XRv2#show bgp ipv4 unicast 2.129.0.0/17 [snip] 173 2 10.11.12.11 from 10.11.12.11 (173.0.0.11) Origin incomplete, localpref 100, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 7 Extended community: LB:137:16 Origin-AS validity: not-found

Checking CSR1 again and only focusing on the prefix from XRv2, we see the update now carries the LB community. XE leaves this value as configured, which leads me to believe XR wants the value to be configured in kBps. R1#show bgp ipv4 unicast 2.129.0.0 BGP routing table entry for 2.129.0.0/17, version 358 Paths: (3 available, best #3, table default) Advertised to update-groups: 6

871 © 2016 Nicholas J. Russo

Refresh Epoch 1 173 2 137.0.0.12 from *137.0.0.12 (137.0.0.12) Origin incomplete, localpref 100, valid, internal DMZ-Link Bw 2000 kbytes rx pathid: 0, tx pathid: 0

Next, we will enable CSR1 to use these three paths proportionally. On a per-AF basis, we can enable this feature. We instruct BGP to account for the LB community (default is to ignore it) and use up to 4 iBGP paths for load sharing. Now, we can see the paths are all marked as multipath. BGP still needs to select a best-path for advertisement onward. ! CSR1 router bgp 137 address-family ipv4 bgp dmzlink-bw maximum-paths ibgp 4 R1#show bgp ipv4 unicast 2.129.0.0 BGP routing table entry for 2.129.0.0/17, version 365 Paths: (3 available, best #3, table default) Multipath: iBGP Advertised to update-groups: 6 Refresh Epoch 1 173 2 137.0.0.12 from *137.0.0.12 (137.0.0.12) Origin incomplete, localpref 100, valid, internal, multipath DMZ-Link Bw 2000 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 5 173 2 137.0.0.7 from *137.0.0.7 (137.0.0.7) Origin incomplete, metric 0, localpref 100, valid, internal, multipath(oldest) DMZ-Link Bw 375 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 7 173 2 137.0.0.6 from *137.0.0.6 (137.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best DMZ-Link Bw 625 kbytes rx pathid: 0, tx pathid: 0x0

We can verify proper operation by checking the RIB and FIB for share counts. Considering the bandwidth values were in the radio of 5:3:16, the traffic share counts are not terribly important. CEF can split traffic 16 different ways, so there isn’t tremendous granularity. The 16 buckets show an approximate 872 © 2016 Nicholas J. Russo

proportion of 5:3:16. We can see most of the traffic going towards XRv2 via its TE tunnel (11/16), significantly less going to CSR6 (3/16), and the least going to CSR7 (2/16). Given the configured bandwidth values, this appears correct. R1#show ip route 2.129.0.0 Routing entry for 2.129.0.0/17 Known via "bgp 137", distance 200, metric 0 Tag 173, type internal Last update from 137.0.0.6 00:02:15 ago Routing Descriptor Blocks: * 137.0.0.12, from 137.0.0.12, 00:02:15 ago Route metric is 0, traffic share count is 240 AS Hops 2 Route tag 173 MPLS label: none 137.0.0.7, from 137.0.0.7, 00:02:15 ago Route metric is 0, traffic share count is 44 AS Hops 2 Route tag 173 MPLS label: none 137.0.0.6, from 137.0.0.6, 00:02:15 ago Route metric is 0, traffic share count is 75 AS Hops 2 Route tag 173 MPLS label: none R1#show ip cef 2.129.0.0 internal | include midchain attached to Tunnel17, IP midchain out of Tunnel17 7F7C756E5F58 attached to Tunnel16, IP midchain out of Tunnel16 7F7C756E7038 attached to Tunnel112, IP midchain out of Tunnel112 7F7C756E6318 < 0 > IP midchain out of Tunnel17 7F7C756E5F58 < 1 > IP midchain out of Tunnel16 7F7C756E7038 < 2 > IP midchain out of Tunnel112 7F7C756E6318 < 3 > IP midchain out of Tunnel17 7F7C756E5F58 < 4 > IP midchain out of Tunnel16 7F7C756E7038 < 5 > IP midchain out of Tunnel112 7F7C756E6318 < 6 > IP midchain out of Tunnel16 7F7C756E7038 < 7 > IP midchain out of Tunnel112 7F7C756E6318 < 8 > IP midchain out of Tunnel112 7F7C756E6318 < 9 > IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318

873 © 2016 Nicholas J. Russo

We quickly check a 2.128.0.0/16 prefix to verify a 5:3:8 ratio. Essentially, we should see less buckets assigned to XRv2 and more assigned to CSR6 and CSR7. Notice the traffic share counts are twice as high for CSR6 and CSR7, mostly because the shares for XRv2 were halved as the bandwidth is halved. R1#show bgp ipv4 unicast 2.128.192.0/20 BGP routing table entry for 2.128.192.0/20, version 361 Paths: (3 available, best #3, table default) Multipath: iBGP Advertised to update-groups: 6 Refresh Epoch 1 173 2 137.0.0.12 from *137.0.0.12 (137.0.0.12) Origin incomplete, localpref 100, valid, internal, multipath DMZ-Link Bw 1000 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 5 173 2 137.0.0.7 from *137.0.0.7 (137.0.0.7) Origin incomplete, metric 0, localpref 100, valid, internal, multipath(oldest) DMZ-Link Bw 375 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 7 173 2 137.0.0.6 from *137.0.0.6 (137.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best DMZ-Link Bw 625 kbytes rx pathid: 0, tx pathid: 0x0 R1#show ip route 2.128.192.0 Routing entry for 2.128.192.0/20 Known via "bgp 137", distance 200, metric 0 Tag 173, type internal Last update from 137.0.0.6 00:15:25 ago Routing Descriptor Blocks: * 137.0.0.12, from 137.0.0.12, 00:15:25 ago Route metric is 0, traffic share count is 240 AS Hops 2 Route tag 173 MPLS label: none 137.0.0.7, from 137.0.0.7, 00:15:25 ago Route metric is 0, traffic share count is 89 AS Hops 2 Route tag 173 MPLS label: none 137.0.0.6, from 137.0.0.6, 00:15:25 ago Route metric is 0, traffic share count is 150

874 © 2016 Nicholas J. Russo

AS Hops 2 Route tag 173 MPLS label: none

This time, only 8/16 buckets are allocated for XRv2, while there are 5/16 for CSR6 and 3/16 for CSR7. When the ratio numbers add up to 16 as they do here, you end up with a very clean distribution scheme. This CEF load sharing scheme matches the target ratio perfectly. R1#show ip cef 2.128.192.0 internal | include midchain attached to Tunnel17, IP midchain out of Tunnel17 7F7C756E5F58 attached to Tunnel16, IP midchain out of Tunnel16 7F7C756E7038 attached to Tunnel112, IP midchain out of Tunnel112 7F7C756E6318 < 0 > IP midchain out of Tunnel17 7F7C756E5F58 < 1 > IP midchain out of Tunnel16 7F7C756E7038 < 2 > IP midchain out of Tunnel112 7F7C756E6318 < 3 > IP midchain out of Tunnel17 7F7C756E5F58 < 4 > IP midchain out of Tunnel16 7F7C756E7038 < 5 > IP midchain out of Tunnel112 7F7C756E6318 < 6 > IP midchain out of Tunnel17 7F7C756E5F58 < 7 > IP midchain out of Tunnel16 7F7C756E7038 < 8 > IP midchain out of Tunnel112 7F7C756E6318 < 9 > IP midchain out of Tunnel16 7F7C756E7038 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel16 7F7C756E7038 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318 IP midchain out of Tunnel112 7F7C756E6318

The process is identical for IPv6. We will use the legacy method on XRv2, which is a neighbor-level command to enable DMZ LB. This applies to all AFs for which there is not a more specific RPL attachment. Unfortunately, no matter what you set the actual bandwidth on the link, XR always uses value 500,000. This means that CSR1 will interpret this as 500,000 kBps, which is worthless in the context of the current low-speed bandwidths. Since XE does not have the per-AF granularity to make adjustments, I use this as a demonstration of a bad LB design. You also cannot verify it locally on XR due to what I consider an order-of-operations issue. CSR1 clearly shows this bogus value. ! XRv2 interface GigabitEthernet0/0/0/0.512 bandwidth 4000 router bgp 137 neighbor fd00:10:11:12::11 dmz-link-bandwidth R1#show bgp ipv6 unicast 2001:2:128:190::/62 BGP routing table entry for 2001:2:128:190::/62, version 499

875 © 2016 Nicholas J. Russo

Paths: (3 available, best #3, table default) Multipath: iBGP Advertised to update-groups: 6 Refresh Epoch 1 173 2 ::FFFF:137.0.0.12 from *137.0.0.12 (137.0.0.12) Origin incomplete, localpref 100, valid, internal, multipath DMZ-Link Bw 500000 kbytes mpls labels in/out nolabel/92006 rx pathid: 0, tx pathid: 0 Refresh Epoch 13 173 2 ::FFFF:137.0.0.7 from *137.0.0.7 (137.0.0.7) Origin incomplete, metric 0, localpref 100, valid, internal, multipath(oldest) DMZ-Link Bw 375 kbytes mpls labels in/out nolabel/7002 rx pathid: 0, tx pathid: 0 Refresh Epoch 13 173 2 ::FFFF:137.0.0.6 from *137.0.0.6 (137.0.0.6) Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best DMZ-Link Bw 625 kbytes mpls labels in/out nolabel/6017 rx pathid: 0, tx pathid: 0x0

As expected, the load balancing weights are heavily tilted and all traffic should be forwarded out of XRv2. I am not sure if this is an XE issue or a function of 6PE, but the CEF table shows equal distribution between all of the buckets. The IPv6 CEF has a nice distribution output that IPv4 CEF used to have. Even without knowing what the paths are, we can see they are processed in round-robin fashion (each path gets a number). This is obviously incorrect despite having the correct configuration. R1#show ipv6 cef 2001:2:128:190::/62 internal | include Path index Path index [ 0 1 2 0 1 2 0 1 2 0 1 2 0 1

2 ]

VPNv4 link bandwidth load-sharing is limited to eBGP in current XE code. On CSR7, we can demonstrate this balance between the 3 Mbps link to CSR4 and 16 Mbps link to XRv1. Take note of the inbound label. ! CSR7 address-family vpnv4 bgp dmzlink-bw maximum-paths 2 R7#show bgp vpnv4 unicast rd 137:173 10.13.14.0/24 BGP routing table entry for 137:173:10.13.14.0/24, version 38

876 © 2016 Nicholas J. Russo

BGP Bestpath: compare-routerid Paths: (2 available, best #2, no table) Advertised to update-groups: 3 7 Refresh Epoch 1 173 10.7.11.11 (via default) from 10.7.11.11 (173.0.0.11) Origin incomplete, localpref 100, valid, external, multipath(oldest) Extended Community: RT:137:173 DMZ-Link Bw 2000 kbytes mpls labels in/out 7003/91005 rx pathid: 0, tx pathid: 0 Refresh Epoch 3 173 10.4.7.4 (via default) from 10.4.7.4 (173.0.0.4) Origin incomplete, localpref 100, valid, external, multipath, best Extended Community: RT:137:173 DMZ-Link Bw 375 kbytes mpls labels in/out 7003/4015 rx pathid: 0, tx pathid: 0x0

Since there is no VRF locally configured on CSR7 (Option B), there won’t be a RIB or FIB entry. Traffic will be label-switched, with BGP performing a VPN label swap, at the ASBRs. Like IPv6, the feature does not appear to work very well, and could be MPLS related. The distribution appears 1:1 despite enabling DMZ LB feature for this AFI. R7#show mpls forwarding-table labels 7003 detail Local Outgoing Prefix Bytes Label Outgoing Label Label or Tunnel Id Switched interface 7003 4015 137:173:10.13.14.0/24 \ 0 Gi2.547 MAC/Encaps=18/22, MRU=1500, Label Stack{4015} 005056A92C57005056A9EA7781000DDB8847 00FAF000 No output feature configured Per-destination load-sharing, slots: 0 2 4 6 8 10 12 14 91005 137:173:10.13.14.0/24 \ 0 Gi2.571 MAC/Encaps=18/22, MRU=1500, Label Stack{91005} 005056A99C60005056A9EA7781000DF38847 1637D000 No output feature configured Per-destination load-sharing, slots: 1 3 5 7 9 11 13 15

Next Hop

10.4.7.4

10.7.11.11

Next, we will configure AS 173. The goal is to balance traffic in a 2:1 radio preferring CSR4 twice as often as XRv1. When XRv4 receives the LB community from an XE ASBR, it divides it by 1000. Since the bandwidth was specified in kbps on CSR4, XR wants to view it in Mbps, so dividing by 1000 is logical. XR also cannot process LB values less than 1000, and since XR interprets the value as Mbps, this means that only 1 Gbps links (1000 Mbps) or faster will even register as having an LB value. As such, I set the 877 © 2016 Nicholas J. Russo

bandwidth to 8 Gbps on CSR4 to achieve a value of 4000 on XRv4. Notice how the value is wildly different on CSR4 from XRv4. In AS 137, values set by XR were untouched by XE, but values set by XE are divided by 1000 when received by XR. XE reads the value as kbps and displays it in kBps. XR reads the value as kbps and displays it in Mbps with a lower bound of 1000 Mbps. There is also an unidentifiable extended community sent by CSR4 that XR cannot read. A quick search did not reveal obvious documentation for extended community 0x0006. Also notice that CSR4 did not set the ASN inside the LB community and it remains 0, whereas XR as an ASBR set the value to the local ASN. ! CSR4 interface GigabitEthernet2.546 bandwidth 8000000 R4#show bgp ipv4 unicast 100.64.32.0/20 BGP routing table entry for 100.64.32.0/20, version 91 Paths: (2 available, best #2, table default) Advertised to update-groups: 3 17 Refresh Epoch 6 137 100 10.4.7.7 from 10.4.7.7 (137.0.0.7) Origin incomplete, localpref 100, valid, external DMZ-Link Bw 375 kbytes rx pathid: 0, tx pathid: 0 Refresh Epoch 8 137 100 10.4.6.6 from 10.4.6.6 (137.0.0.6) Origin incomplete, localpref 100, valid, external, best DMZ-Link Bw 1000000 kbytes rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv4#show bgp ipv4 unicast 100.64.32.0/20 [snip] 137 100 173.0.0.4 (metric 3) from 173.0.0.4 (173.0.0.4) Origin incomplete, metric 0, localpref 100, valid, internal, multipath Received Path ID 0, Local Path ID 0, version 0 Extended community: 0x0006:0:1232348160 LB:0:8000 [snip]

Just like XE-to-XE, XR-to-XR is a little easier. While it still doesn’t read the interface bandwidth properly, the value on XRv4 corresponds to what we saw earlier. CSR1 showed an LB value of 500,000 kBps when it received LB from XRv2 using the legacy configuration (no RPL). XRv4 shows a value of 4,000 Mbps; these two values are equivalent to 4 Gbps. I am not sure where XR is getting this number, but at least it is consistent. Notice that there is no unknown extended community and the ASN is encoded, unlike XE. RP/0/0/CPU0:XRv4#show bgp ipv4 unicast 100.64.32.0/20

878 © 2016 Nicholas J. Russo

[snip] Path #1: Received by speaker 0 Not advertised to any peer 137 100 173.0.0.4 (metric 3) from 173.0.0.4 (173.0.0.4) Origin incomplete, metric 0, localpref 100, valid, internal, multipath Received Path ID 0, Local Path ID 0, version 0 Extended community: 0x0006:0:1232348160 LB:0:8000 Path #2: Received by speaker 0 Advertised to peers (in unique update groups): 10.2.14.2 137 100 173.0.0.11 (metric 2) from 173.0.0.11 (173.0.0.11) Origin incomplete, localpref 100, valid, internal, best, group-best, multipath Received Path ID 0, Local Path ID 1, version 15 Extended community: LB:173:4000

To enable XRv4 to use this information, we need to enable maximum-paths. I use unequal-cost since, unlike AS 137, there are different IGP costs to the BGP next-hops, and I need to multipath across the two. AS 137 used static routes with RSVP-TE so this was a non-issue. We can clearly see the weights of 8000 and 4000 being applied to the candidate routes. The XR show commands for CEF distribution are very clean. XRv4 builds 3 buckets for more accurate load sharing, rather than forcing a 2:1 ratio to be split evenly over 16 buckets, which is impossible. The load distribution shows path 0 (CSR4) being used twice as often as path 1 (XRv1). ! XRv4 router bgp 173 address-family ipv4 unicast maximum-paths ibgp 2 unequal-cost RP/0/0/CPU0:XRv4#show route ipv4 unicast 100.64.32.0/20 Routing entry for 100.64.32.0/20 Known via "bgp 173", distance 200, metric 0 Tag 137, type internal Routing Descriptor Blocks 173.0.0.4, from 173.0.0.4, BGP multi path Route metric is 0, Wt is 8000 173.0.0.11, from 173.0.0.11, BGP multi path Route metric is 0, Wt is 4000 No advertising protos. RP/0/0/CPU0:XRv4#show cef ipv4 100.64.32.0/20 detail | begin Weight Weight distribution: slot 0, weight 8000, normalized_weight 2, class 0 slot 1, weight 4000, normalized_weight 1, class 0 Load distribution: 0 0 1 (refcount 1)

879 © 2016 Nicholas J. Russo

Hash 0 1 2

OK Y Y Y

Interface Address GigabitEthernet0/0/0/0.554 173.5.14.5 GigabitEthernet0/0/0/0.554 173.5.14.5 GigabitEthernet0/0/0/0.514 173.11.14.11

We will turn the tables and use XRv1 twice as often when CSR4 picks CSR7 as the best path. By setting the link to 2 Gbps, we will achieve a value of 2000 on XR, which is half of XRv1’s value of 4000. XR needs to specify the maximum-paths under the VRF IPv4/v6 AFI, not the VPNv4 AFI. ! CSR4 interface GigabitEthernet2.547 bandwidth 2000000 router bgp 173 vrf DMZ address-family ipv4 unicast maximum-paths ibgp 2 unequal-cost

We verify the communities below. We see that both were imported, are multipath, and have the proper 2:1 ratio of XRv1:CSR4 for load sharing. RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast vrf DMZ 10.1.9.0/24 [snip] Path #1: Received by speaker 0 Not advertised to any peer 137 173.0.0.4 (metric 3) from 173.0.0.4 (173.0.0.4) Received Label 4000 Origin incomplete, metric 0, localpref 100, valid, internal, multipath, import-candidate, imported Received Path ID 0, Local Path ID 0, version 0 Extended community: 0x0006:0:1215570944 LB:0:2000 RT:137:173 Source VRF: DMZ, Source Route Distinguisher: 137:173 Path #2: Received by speaker 0 Not advertised to any peer 137 173.0.0.11 (metric 2) from 173.0.0.11 (173.0.0.11) Received Label 91008 Origin incomplete, localpref 100, valid, internal, best, group-best, multipath, import-candidate, imported Received Path ID 0, Local Path ID 1, version 11 Extended community: LB:173:4000 RT:137:173 Source VRF: DMZ, Source Route Distinguisher: 137:173

We check the routing information within the VPN to verify it is correct. Since multipath was configured under the VRF, both routes can be imported into the VPN for multipath. 880 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show route vrf DMZ ipv4 unicast 10.1.9.0/24 Routing entry for 10.1.9.0/24 Known via "bgp 173", distance 200, metric 0 Tag 137, type internal Routing Descriptor Blocks 173.0.0.4, from 173.0.0.4, BGP multi path Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id: 0xe0000000 Route metric is 0, Wt is 2000 173.0.0.11, from 173.0.0.11, BGP multi path Nexthop in Vrf: "default", Table: "default", IPv4 Unicast, Table Id: 0xe0000000 Route metric is 0, Wt is 4000 No advertising protos.

Since these are VPN routes, the label imposition is shown, not the outgoing interface. We can connect the two easily by checking the LFIB. Label 94000 corresponds to CSR4 and 92002 corresponds to XRv1, so the load distribution appears correct. RP/0/0/CPU0:XRv4#show cef vrf DMZ ipv4 10.1.9.0/24 detail | begin Weight Weight distribution: slot 0, weight 2000, normalized_weight 1, class 0 slot 1, weight 4000, normalized_weight 2, class 0 Load distribution: 0 1 1 (refcount 1) Hash 0 1 2

OK Y Y Y

Interface Unknown Unknown Unknown

RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94000 5007 173.0.0.4/32 94001 Pop 173.0.0.5/32 94002 Pop 173.0.0.11/32

Address 94000/0 94002/0 94002/0 labels 94000 94002 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.554 173.5.14.5 199672 Gi0/0/0/0.554 173.5.14.5 144626 Gi0/0/0/0.514 173.11.14.11 327767

Additional Reading – Reference configurations “bgp-dmz-lb" 27.7 BGP Multicast VPN (MVPN) Theory MVPN is a generalized name meant to encompass all of the solutions meant to transport multicast traffic from place to place in a VPN environment. In most cases, the “VPN environment” specifically means MPLS 3VPN. The main business driver for this is the explosion of Internet video over the past

881 © 2016 Nicholas J. Russo

years, and these video services are expected to increase further still. A very long set of solutions has been developed and adopted as industry standards to transport multicast across MPLS L3VPN. The original solution, known commonly as Draft-Rosen, used GRE tunnels to connect all of the PE devices in a given MVPN instance (generally within a given L3VPN, though extranets are supported). The destination of these tunnels was multicast also, so the SP core needed to run PIM to transport it. It has worked well for many years and one of its greatest benefits is that it doesn’t introduce any particularly new or difficult concepts. Both the provider and customer use PIM, except the provider tunnels (Ptunnels) just carry customer multicast inside. It can be protected with multicast-only FRR (MoFRR) and also works with PIM-ASM, PIM-SSM, or PIM-Bidir in the core. It can also transport any of these three traffic types used by the customer, along with customer BSR messaging for RP discovery. It carries IPv6 multicast inside these IPv4 P-tunnels as well. Overall, it is a good solution, and despite it being “legacy” it is still considered an MVPN solution (profile 0). The main issue with this solution is that it does not take advantage of MPLS transport. It would be ideal if both unicast and multicast traffic used a common data plane in the SP core, ultimately removing the need to run PIM there. Label switched multicast (LSM) can be candidate for FRR using the methods already used for unicast LSPs. Two main technologies were extended to support this: RSVP-TE and LDP. RSVP was already extended to support TE (RSVP-TE), so extending it again to transport multicast was logical. Using P2MP RSVP-TE, a multipoint tunnel can be built from headend to multiple tail ends, which is perfect for transporting one-way multicast. Like a normal TE tunnel, each sub-tunnel (also called subLSP) can request FRR treatment, make bandwidth reservations, and do any other kind of TE CSPF parameter tuning. This is Juniper’s preferred model and it is the only model that supports bandwidth reservations and simplified FRR. XE never supported this feature (the IOS S-train did), but XR does. LDP was extended to support this also; it is called multicast LDP (mLDP). This works like LDP where labels are exchanged on a hop-by-hop basis. mLDP can support P2MP trees, which are unidirectional kind of like RSVP-TE P2MP trees. It can also support MP2MP trees, which are similar to PIM-Bidir where traffic can flow in either direction towards the root, which is the main junction point. In P2MP trees, only downstream labels are allocated, whereas in MP2MP trees, upstream and downstream labels are exchanged. P2MP trees are more efficient (like PIM-SSM) but MP2MP trees use less state (like PIMBidir). Unlike RSVP LSPs, mLDP LSPs are built from tail to head as label-mapping messages are sent towards the root of the delivery tree. This is Cisco’s preferred model as it has less “core state”, even for P2MP trees, and generally scales better. It supports FRR but it’s less straightforward than RSVP-TE P2MP, and there is no support for making bandwidth reservations. Some solutions permit dynamic discovery of PE endpoints. Rather than statically configure RSVP-TE P2MP endpoints or mLDP MP2MP roots, BGP can automatically discover endpoints. This is often called BGP auto-discovery (AD). BGP introduces 5 new route-types under the IPv4/v6 MVPN SAFI to accomplish this. The concept of Provider Multicast Service Interface (PMSI) is an abstraction for the router. As long as the C-MRIB is smart enough to know “I need to deliver this packet to the PMSI”, the MVPN process 882 © 2016 Nicholas J. Russo

knows what to do next. The PMSI actions will typically include some kind of encapsulation, be it IP/GRE or MPLS, for transport across the core. Traffic exiting the core traverses the PMSI as well where the encapsulation is removed. Different types of BGP routes mean different things as described below. Type 1: Intra-AS Inclusive PMSI (I-PMSI) AD: Originated by each PE inside an AS; used to learn MVPN membership. It is called “inclusive” because it includes everyone in the MVPN, whether they want traffic or not. I-PMSI can be thought of as similar to a default MDT. Type 2: Inter-AS I-PMSI AD: Originated by each ASBR; used to learn MVPN membership. This is examined in detail when testing inter-AS MVPN. Type 3: Selective (S-PMSI) AD: Originated by the source (ingress) PE and used for signaling a specific P2MP tree or data MDT. This is meant to selectively target receivers that want multicast for a given C(S,G). Type 4: Leaf AD: If the Type-3 requests it, the receiver (Egress) PEs will explicitly show interest in a specific C(S,G). Normally this is not needed for the vast majority of profiles, but has relevance for RSVP P2MP-TE auto-discovered tunnels. Type 5: Source Active AD: Originated by the source (ingress) PE when it learns about an active source and assist with SPT switchover. This is only relevant when the customer is using PIM-ASM. The method by which customers exchange multicast routing has always been PIM. In Draft-Rosen, the PEs would form PIM neighbors inside the VRF and the tunnel-mesh would look like an emulated LAN. The same is true for basic mLDP MP2MP trees. This works well and most people are familiar with PIM operations on a LAN. To achieve better scalability, BGP was extended to do customer multicast (c-mcast) signaling. BGP introduces 2 new route-types under the IPv4/v6 MVPN SAFI to accomplish this; basically the PM equivalent of a (*,G) join and an (S,G) join. Receiving a PIM (*,G) or (S,G) join from the CE will trigger the creation of a BGP Type-6 or Type-7 route, respectively. Receiving a prune simply triggers the withdrawal of that route. BGP allows for massive scalability in MVPNs as the concept of a PIM “emulated LAN” for overlay signaling is removed. Type 6: Shared Tree Join: Specifies the RP and group, and is meant to signal C-PIM-ASM interest in a particular group. This replaces the PIM (*,G) signaling. Type 7: Source Tree Join: Looks identical to a Type-6 except the RP’s IP address is substituted for the source’s address. It can also be sent upon receipt of a Type-5 Source Active AD message. This replaces the PIM (S,G) signaling. mLDP is also capable of carrying the c-mcast signaling in-band. That is, PIM/BGP is not needed to exchange customer multicast information since the actually (S,G) information is encoded in the mLDP label mapping messages as an opaque value. This only works with PIM-SSM, since there isn’t a 883 © 2016 Nicholas J. Russo

mechanism for (*,G) signaling nor carrying BSR messages across the VPN. It also scales poorly, but is a very simple solution that eliminates the need for an overlay protocol entirely. For customers using SSM that only have a few, high-bandwidth multicast flows (which is common), and this is a good solution. A more rudimentary mechanism for transporting multicast includes performing the multicast replication at the head-end and reusing the unicast LSPs. This automatically gives all the flows FRR (provided the unicast LSPs were already protected), doesn’t require new signaling, and also overcomes any vendor incompatibility issues. Ingress Replication (IR) is not commonly used (and is supported on XR only), but can be used in specific corner-cases. PBB-EVPN uses this by default for its inclusive multicast exchanges as it assumes that the core is not MVPN-capable by default, which is a good assumption. In order for the router to know which kind of P-tunnel to use for a particular multicast flow (assuming BGP AD is used), it needs to look at the I-PMSI or S-PMSI to determine the tunnel type. Each transport method has a different tunnel type. This information is carried inside the PMSI attributes within BGP AD routes; once a BGP AD route is bound to a particular C(S,G), the tunnel type specified in the PMSI attribute is used for transport. A mismatch between endpoints of tunnel-types is not supported. This is how remote endpoints agree on how to encapsulate/decapsulate traffic traversing their local PMSIs. Type 0: No tunnel specified Type 1: RSVP-TE P2MP Type 2: mLDP P2MP Type 3: PIM-SSM Type 4: PIM-ASM Type 5: PIM-Bidir Type 6: Ingress Replication (IR) Type 7: mLDP MP2MP As you can see, there are many options; many of the combinations of transport and overlay signaling are valid. Note: This chapter lays the foundation for the MVPN tests. All of the detailed verification happens there. 27.8 BGP Link State AF and Path Computation Element (PCE) BGP-LS is designed to capture link state IGP and TE information and transport it across the network. Specifically, passing it to an SDN/PCE device allows the network to be controlled remotely, enabling network-wide TE and other features. BGP cannot actually use the IGP link-state information for any kind of path selection, but it is easily readable from the CLI. BGP is truly a transport mechanism as it scales much better than IGP and can peer directly with SDN controllers or PCE devices. The feature is simple to configure and only supported in IOS XR at present. The documentation says the peer address must be private but that isn't enforced in XR version 5.3.0. After identifying a domain-distinguisher, just enable the AF per neighbor as usual. The network diagram is shown below.

884 © 2016 Nicholas J. Russo

Below is the configuration of our simulated PCE server. Essentially, it is going to accept BGP-LS information from XRv12 and do nothing with it. ! XRv14 router bgp 10 address-family link-state link-state domain-distinguisher 10:14.14.14.14 neighbor 12.12.12.12 address-family link-state link-state

The configuration is similar on XRv12, with the addition of the IS-IS LSP distribution into BGP. OSPFv2 is also supported, but OSPFv3 does not appear supported yet. ! XRv12 router isis 1 distribute bgp-ls level 2 router bgp 10 address-family link-state link-state domain-distinguisher 10:12.12.12.12 neighbor 12.12.12.12 address-family link-state link-state

A quick look at the BGP table for this AFI shows all of the IS-IS topology information from level 2 that was injected by XRv12. We will select a few different route types to examine. The BGP codes in the output are excellent and help to decipher the output. The [V] route represents the “vertex" for XRv12 itself. The [L2] is the IS-IS level, and since this isn't a link in the graph, there is a local node [N] but no remote node [R]. The NET is 0000.0000.0012.00 and the BGP RID that originated the prefix was 12.12.12.12 (XRv12). The [E] route, or "edge", is similar except it has a remote node, which is

885 © 2016 Nicholas J. Russo

0000.0000.0012.03. This is an IS-IS DIS for one of XRv12's links. The [T] route contains the prefix for XRv12's loopback0 which is 12.12.12.12/32. The prefix-codes are very useful; don’t ignore them. RP/0/0/CPU0:XRv14#show bgp link-state link-state [snip] Status codes: s suppressed, d damped, h history, * valid, > best i - internal, r RIB-failure, S stale, N Nexthop-discard Origin codes: i - IGP, e - EGP, ? - incomplete Prefix codes: E link, V node, T IP reacheable route, u/U unknown I Identifier, N local node, R remote node, L link, P prefix L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static a area-ID, l link-ID, t topology-ID, s ISO-ID, c confed-ID/ASN, b bgp-identifier, r router-ID, i if-address, n nbr-address, o OSPF Route-type, p IP-prefix d designated router address Network Next Hop Metric LocPrf Weight Path [snip] *>i[V][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]]/328 12.12.12.12 100 0 i [snip] *>i[E][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]][R[c10][b12.12.12.12][s0000. 0000.0012.03]]/576 12.12.12.12 100 0 i [snip] *>i[T][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]][P[p12.12.12.12/32]]/400 12.12.12.12 100 0 i [snip]

The details of the [V] route help to summarize our findings, also. The IS-IS area is not encoded anywhere in the summary information, so one would have to use the detailed show command for it. It is also nice to see the hostname (the LSP name for IS-IS at least) when trying to identify the prefixes. The TE-ID is also important since one of the main uses of this feature is MPLS TE. RP/0/0/CPU0:XRv14#show bgp link-state link-state [V][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]]/328 BGP routing table entry for [V][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]]/328 Versions: Process bRIB/RIB SendTblVer Speaker 22 22 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 12.12.12.12 from 12.12.12.12 (12.12.12.12) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 22 Link-state: Node-name: XRv12, ISIS area: 00, Local TE Router-ID:

886 © 2016 Nicholas J. Russo

12.12.12.12

The [E] route details reveal some extra information, such as TE metric, IGP metric and TE RID. The show command is very long so I snip the entire prefix, but BGP repeats it (highlighted) when the output is displayed. RP/0/0/CPU0:XRv14#show bgp link-state link-state [E][L2][I0x0][N[c10][b12.12. [snip] BGP routing table entry for [E][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]][R[c10][b12.12.12.12][ s0000.0000.0012.03]]/576 Versions: Process bRIB/RIB SendTblVer Speaker 7 7 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 12.12.12.12 from 12.12.12.12 (12.12.12.12) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 7 Link-state: Local TE Router-ID: 12.12.12.12, TE-default-metric: 100 metric: 100

The [T] route details don't really reveal anything interesting other than the IGP metric. This makes sense because this is just an IP route and doesn’t carry any interesting TE information like the “vertex” and “edge” routes do, which are critical components in the link-state graph. RP/0/0/CPU0:XRv14#show bgp link-state link-state [T][L2][I0x0][N[c10][b12.12. [snip] BGP routing table entry for [T][L2][I0x0][N[c10][b12.12.12.12][s0000.0000.0012.00]][P[p12.12.12.12/32]]/4 00 Versions: Process bRIB/RIB SendTblVer Speaker 33 33 Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 12.12.12.12 from 12.12.12.12 (12.12.12.12) Origin IGP, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 33 Link-state: Metric: 0

887 © 2016 Nicholas J. Russo

PCE Vocabulary and Definitions PCE: Path Computation Element is a network entity that can compute paths based on the topological graph and a set of constraints. It's basically an enhanced version of the "TE PCALC" and is often externally hosted (not composite). Paths that span multiple domains could be controlled by a single PCE (centralized) or multiple PCE (distributed). PCC: The Path Computation Client generally represents an LSR who can request a path computation. PCEP: PCE protocol is used between PCE and PCC and is defined in RFC 5440. It is just a TCP session whereby path requests, replies, commands, and error messages are exchanged. TED: Traffic Engineering Database is what the PCE uses to determine paths through the network. The TED is not a new construct as RSVP-TE used the same information as input to the "TE PCALC" algorithm, ultimately generating an ERO to feed RSVP-TE PATH messages. PCE systems can operate in one of two modes Stateful PCE: The PCE has a strict synchronization with its PCC devices to include topology and resource information. This allows for optimal path computation network-wide. The downside is the significant control-plane overhead between PCE/PCC, and potentially multiple PCEs (more SDN-like). With the delegation option enabled, you also give PCE the authority to change your statically configured tunnels also, not just the automatically created PCE ones. The automatically created ones, like other autotunnels for TE, will be spawned and re-used based on the range of numbers you specify. Stateless PCE: Each request for a path is computed independent of one another and the PCE does not remember any path information. Less control-plane overhead but much less capability. This is less SDNlike and might be better for PCE newcomers initially. IGP extensions to support PCE discovery: IS-IS introduces "PCED sub-TLV type 5" while OSPF includes sub-TLVs inside the Router Information LSA (Opaque Type 4, Opaque ID 0) with TLV PCEC type 6. There are more sub-TLVs included to define specific PCE parameters covered in RFCs 5088 and 5089 for OSPF and IS-IS, respectively. Three main PCE deployment models Composite PCE: The PCE is hosted locally on an LSR. The PCE reads the TED and controls the LSP signaling within one box. I don't believe this is supported on XR as it would effectively turn the router into a TE controller.

888 © 2016 Nicholas J. Russo

External PCE: The PCE is hosted on an external platform like a server or out of band LSR. The TED is learned via BGP-LS and the PCE controls the LSP signaling using PCE protocol (PCEP) to control the PCCs. This seems like an appropriate model for a small SP or a test setup. Multiple PCE: Typically uses external PCEs interconnected to share information, either for availability or scalability. This would be appropriate for a large, distributed SP. The lab for this feature is very basic. In this case, the PCE source address for the TCP session is XRv12 loopback, and two peers are defined. The precedence value is self explanatory and lower numbers indicate higher priority. 14.14.14.14/32 is XRV14's loopback (remember, we have a BGP-LS peering with it, since it's our imitation PCE) and 15.15.15.15/32 is a non-existent address just for demonstration. PCE is operating in stateful mode with delegation of TE tunnels. The instantiation keyword allows the PCE to create new dynamic tunnels. The last stanza is outside of the PCE hierarchy but allows the PCC (this router) to generate auto-tunnels in a specified index range. Auto-tunnels come in many varieties and are discussed in detail in the MPLS TE section of this book. ! XRv12 mpls traffic-eng auto-tunnel pcc tunnel-id min 100 max 199 pce peer source ipv4 12.12.12.12 peer ipv4 14.14.14.14 precedence 1 peer ipv4 15.15.15.15 precedence 2 logging events peer-status stateful-client instantiation delegation

When we verify our PCE sessions, we see the session is pending, since we have no PCE. The remaining show commands for PCE are related to specific LSPs and tunnels, of which we have zero. I didn’t actually set up the entire PCE environment as I believe it is beyond the scope of the CCIE SP exam. RP/0/0/CPU0:XRv12#show mpls traffic-eng pce peer Address Precedence State Learned From --------------- ------------ ------------ -------------------14.14.14.14 1 TCP Pending Static config 15.15.15.15 2 TCP Pending Static config

Additional Reading – Reference configurations "bgp-ls"

889 © 2016 Nicholas J. Russo

28. Describe, implement, and troubleshoot MVPN This section walks through many advanced MVPN configurations. The topology is shown below; XRv3 and XRv4 are core routers, with the other six as VPNv4/v6 PEs. XRv2 is a BGP RR for VPNv4/v6 and MVPNv4/v6 (or IPv4/v6 MVPN, same thing). I tried to keep XRv away from the edge since its MVPN capabilities are limited, so most of the testing will occur on the five CSR PEs. CSR10 is the RP for all ASM groups, while the other nodes are senders and/or receivers. The SSM range of 232.0.0.0/8 is also used by the MVPN clients. All 6 sites have full IPv4 unicast reachability, and IPv6 reachability is limited to the top three CEs and bottom three CEs in a segmented fashion. The core runs IS-IS level 2 with MPLS TE enabled on all links. For all profiles, the following (*,G) and (S,G) streams are joined by the specified CE devices: XRv1: (*, 225.0.0.1) CSR10: (2001:10:4:6::4, FF33::1), CSR9: (*, 225.0.0.1), (2001:10:4:6::4, FF33::1) CSR4: (*, 225.0.0.1), (10.2.7.2, 232.0.0.5) CSR3: (*, 225.0.0.1), (10.2.7.2, 232.0.0.5), (2001:10:4:6::4, FF33::1), (*,FF7E:240:2001:10:2:7:0:1) CSR2: (2001:10:4:6::4, FF33::1) Anywhere MDT default groups are used, the format is as follows: First octet: 239 for ASM, 238 for Bidir, and 232 for SSM. Remaining octets: X.255.0.1 Anywhere MDT data groups are used, the format is as follows: First octet: 239 for ASM, 238 for Bidir, and 232 for SSM Second octet: 4 for IPv4, 6 for IPv6 Third octet: Router number (1, 5, 6, 7, 8, 12) for each PE Fourth octet: Variable, 0 - 255 The network diagram used in all profiles is shown below. The routers within the box make up the SP core. XRv2 is the BGP RR for all AFIs that are in use. The 6 routers not inside the box are CE devices. Each of them has an IPv4/IPv6 static default route pointing towards the PEs, while the PEs redistribute connected PE-CE links into BGP for VPNv4/v6 reachability. VPNv4 and VPNv6 are always enabled. MVPNv4, MVPNv6, and IPv4 MDT will be enabled based on the profile being tested. The SP core runs PIM only for non-LSM profiles as a best practice.

890 © 2016 Nicholas J. Russo

28.1 Profile 0: Default MDT − GRE − PIM C−mcast Signaling (Traditional Draft-Rosen) Because the type of PIM used in the core can vary and Profile 0 generally recommends SSM, we will look at all of the options. The baseline configuration has XRv4 as the RP for 239.0.0.0/8 groups using ASM and 238.0.0.0/8 groups using PIM-Bidir. BGP IPv4 MDT and IPv4/v6 MVPN families are disabled everywhere. PIM-SM is enabled on all transit links and MDT endpoints on PEs. The basic multicast configuration is not shown in this section. Rather than show the same basic verification over and over for the different flavors of profile 0, the baseline verifications are done here. First, we will check PIM neighbor counts on XRv3 and XRv4; because there are no neighbor filters, I make the assumption that the neighbors are bidirectional with all the PEs for brevity. There are 2 neighbors per link because XR considers the its own interface in the link (also why the loopback0 shows 1 neighbor as well). RP/0/0/CPU0:XRv3#show pim neighbor count PIM neighbors in VRF default

891 © 2016 Nicholas J. Russo

Interface GigabitEthernet0_0_0_0.523 GigabitEthernet0_0_0_0.534 GigabitEthernet0_0_0_0.553 GigabitEthernet0_0_0_0.563 GigabitEthernet0_0_0_0.573 AMT Loopback0

Nbr count 2 2 2 2 2 1 1

RP/0/0/CPU0:XRv4#show pim neighbor count PIM neighbors in VRF default Interface Nbr count GigabitEthernet0_0_0_0.514 2 GigabitEthernet0_0_0_0.534 2 GigabitEthernet0_0_0_0.554 2 GigabitEthernet0_0_0_0.574 2 GigabitEthernet0_0_0_0.584 2 AMT 1 Loopback0 1

Next, I check the status of the RP on a few select nodes. 239.0.0.0/8 is the ASM range, 238.0.0.0/8 is the Bidir range, and 232.0.0.0/8 is the default SSM range (no RP required). XRv4, the BSR/RP, shows the ranges. I spot-check XRv3 and CSR8 to ensure they learned the mappings properly, and are using the right type of PIM for each RP (ASM or Bidir). This completes the basic multicast transport verification for the core network. RP/0/0/CPU0:XRv4#show pim bsr rp-cache PIM BSR Candidate RP Cache Group(s) 238.0.0.0/8, RP count 1 RP-addr Proto Priority Holdtime(s) Uptime Expires 213.14.14.14 BD 192 150 00:06:27 00:02:02 Group(s) 239.0.0.0/8, RP count 1 RP-addr Proto Priority Holdtime(s) Uptime Expires 213.14.14.14 SM 192 150 00:06:27 00:02:02 RP/0/0/CPU0:XRv3#show pim rp mapping PIM Group-to-RP Mappings Group(s) 238.0.0.0/8 RP 213.14.14.14 (?), v2, bidir Info source: 213.13.14.14 (?), elected via bsr, priority 192, holdtime 150 Uptime: 00:07:36, expires: 00:02:08 Group(s) 239.0.0.0/8 RP 213.14.14.14 (?), v2 Info source: 213.13.14.14 (?), elected via bsr, priority 192, holdtime 150 Uptime: 00:07:36, expires: 00:02:08

892 © 2016 Nicholas J. Russo

R8#show ip pim rp mapping PIM Group-to-RP Mappings Group(s) 238.0.0.0/8 RP 213.14.14.14 (?), v2, bidir Info source: 213.14.14.14 (?), via bootstrap, priority 192, holdtime 150 Uptime: 00:07:57, expires: 00:01:44 Group(s) 239.0.0.0/8 RP 213.14.14.14 (?), v2 Info source: 213.14.14.14 (?), via bootstrap, priority 192, holdtime 150 Uptime: 00:07:57, expires: 00:01:44

28.1.1 PIM-ASM in the core This variation starts with the baseline configuration and does not require any modification. The default MDT for VRF MC uses group 239.255.0.1 which, at present, is an ASM group. This default MDT is used for signaling between the nodes so C-PIM information, such as joins, prunes, hellos, and BSR messages can transit the SP core. Each PE loopback participating in this MDT is a source for group 239.255.0.1. We can check XRv4, the RP, and we should see a P(S,G) state entry for each PE using the default MDT group. Because this yields a lot of output, I will show a summary view. The P(*,G) entry is always in the table provided these is at least one receiver (we will discuss receivers next). Each P(S,G) is duplicated for a shortest path tree (SPT) and an RP tree (RPT, or root path tree). Note: Cisco documentation clearly states the need to run BGP IPv4 MDT between sites for this profile to work. I found out that is not true as this entire section was done without this feature being configured. This may be because these ASM/Bidir variations are not technically compliant with Cisco’s definition of the “profile”, but they are functionally valid. RP/0/0/CPU0:XRv4#show pim topology 239.255.0.1 | include 239.255.0.1 (*,239.255.0.1) SM Up: 00:33:44 RP: 213.14.14.14* (213.1.1.1,239.255.0.1)RPT SM Up: 00:33:44 RP: 213.14.14.14* (213.1.1.1,239.255.0.1)SPT SM Up: 00:33:31 (213.5.5.5,239.255.0.1)RPT SM Up: 00:33:38 RP: 213.14.14.14* (213.5.5.5,239.255.0.1)SPT SM Up: 00:33:38 (213.6.6.6,239.255.0.1)RPT SM Up: 00:33:36 RP: 213.14.14.14* (213.6.6.6,239.255.0.1)SPT SM Up: 00:33:36 (213.7.7.7,239.255.0.1)RPT SM Up: 00:33:26 RP: 213.14.14.14* (213.7.7.7,239.255.0.1)SPT SM Up: 00:33:26 (213.8.8.8,239.255.0.1)RPT SM Up: 00:33:32 RP: 213.14.14.14* (213.8.8.8,239.255.0.1)SPT SM Up: 00:33:32 (213.12.12.12,239.255.0.1)RPT SM Up: 00:01:23 RP: 213.14.14.14* (213.12.12.12,239.255.0.1)SPT SM Up: 00:01:27

We will zoom in on CSR8’s related state information on the P-RP. The RPT entries represent the state created when a source registered with the RP; this is why the RPF interface is a decapsulation tunnel since that is the mechanism by which “native” multicast is viewed by the router. The SPT is the P(S,G) state we normally see in XE that shows how native multicast is forwarded (not encapsulated register 893 © 2016 Nicholas J. Russo

messages). This traffic is forwarded out all of XRv4’s other interfaces towards the remote PE’s in the same MDT. RP/0/0/CPU0:XRv4#show pim topology 239.255.0.1 213.8.8.8 | begin RPT SM (213.8.8.8,239.255.0.1)RPT SM Up: 00:17:22 RP: 213.14.14.14* JP: Prune(never) RPF: Decapstunnel0,213.14.14.14 Flags: KAT(00:02:42) RA RR (00:03:45) GigabitEthernet0/0/0/0.584 00:17:22 off Prune(00:02:35) (213.8.8.8,239.255.0.1)SPT SM Up: 00:17:22 JP: Join(00:00:25) RPF: GigabitEthernet0/0/0/0.584,213.8.14.8 Flags: KAT(00:02:42) RA RR (00:03:45) GigabitEthernet0/0/0/0.514 00:16:18 fwd Join(00:02:55) GigabitEthernet0/0/0/0.534 00:17:17 fwd Join(00:03:12) GigabitEthernet0/0/0/0.554 00:16:18 fwd Join(00:02:57) GigabitEthernet0/0/0/0.574 00:16:18 fwd Join(00:02:53)

The reason XRv4 knows the forward the traffic to all receivers is because each PE loopback is also a receiver for the same group. This is part of the reason why PIM must be enabled on the MDT source; an IGMP join is automatically applied to that interface, which triggers a PIM P(*,G) join towards the SPcore’s RP for that group. For completeness, I show some debug from CSR8 which shows the periodic join. CSR8 also tells the RP that it doesn’t want traffic from the source 213.8.8.8, which is itself. All of the other PEs perform similar operations. R8#show ip igmp groups 239.255.0.1 IGMP Connected Group Membership Group Address Interface Group Accounted 239.255.0.1 Loopback0 ! CSR8 PIM(0): Building Periodic (*,G) Join / 239.255.0.1 PIM(0): Insert (*,239.255.0.1) join in PIM(0): Insert (213.8.8.8,239.255.0.1) PIM(0): Building Join/Prune packet for

Uptime

Expires

Last Reporter

00:40:49

stopped

213.8.8.8

(S,G,RP-bit) Prune message for nbr 213.8.14.14's queue sgr prune in nbr 213.8.14.14's queue nbr 213.8.14.14

Focusing on CSR8, we can check its global MRIB to see how it treats this default MDT group. The summary view gives us most of the information, and I remove some of the expanded output because many entries are nearly identical. We can see the big ‘Z’ flag set on these groups which represents a multicast tunnel. The last group is the locally originated traffic which was locally registered (big ‘F’) after arriving on the loopback0 (MDT source). The OIL for these tunnels includes the MVRF MC, which is the VRF to which this MDT is bound. R8#show ip mroute 239.255.0.1 summary | begin \(

894 © 2016 Nicholas J. Russo

(*, 239.255.0.1), 00:52:33/stopped, RP 213.14.14.14, OIF count: 1, flags: SJCFZ (213.12.12.12, 239.255.0.1), 00:00:09/00:02:50, OIF count: 1, flags: JTZ (213.5.5.5, 239.255.0.1), 00:31:41/00:02:36, OIF count: 1, flags: JTZ (213.1.1.1, 239.255.0.1), 00:31:42/00:01:47, OIF count: 1, flags: JTZ (213.7.7.7, 239.255.0.1), 00:31:54/00:02:43, OIF count: 1, flags: JTZ (213.6.6.6, 239.255.0.1), 00:32:12/00:01:44, OIF count: 1, flags: JTZ (213.8.8.8, 239.255.0.1), 00:32:17/00:02:03, OIF count: 1, flags: FT R8#show ip mroute 239.255.0.1 | begin \( (*, 239.255.0.1), 00:53:05/stopped, RP 213.14.14.14, flags: SJCFZ Incoming interface: GigabitEthernet2.584, RPF nbr 213.8.14.14 Outgoing interface list: MVRF MC, Forward/Sparse, 00:53:03/stopped (213.12.12.12, 239.255.0.1), 00:00:40/00:02:19, flags: JTZ Incoming interface: GigabitEthernet2.584, RPF nbr 213.8.14.14 Outgoing interface list: MVRF MC, Forward/Sparse, 00:00:40/00:02:19 [snip] (213.8.8.8, 239.255.0.1), 00:27:24/00:02:56, flags: FT Incoming interface: Loopback0, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.584, Forward/Sparse, 00:27:24/00:03:04

With the multicast tunnels operational, we can check the C-PIM information to verify we have PIM neighbors across the tunnel (in the overlay) with the other five PEs. This PIM signaling is always carried by the default MDT and cannot be optimized into a data MDT, because it needs to go to all nodes. The default MDT is often called an “emulated LAN” since all of the multi-access PIM mechanisms, such as Assert, will occur over this tunnel mesh. R8#show ip pim vrf MC neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.12.12.12 Tunnel2 00:04:22/00:01:21 213.7.7.7 Tunnel2 00:35:59/00:01:17 213.5.5.5 Tunnel2 00:36:29/00:01:17 213.1.1.1 Tunnel2 00:36:29/00:01:18 213.6.6.6 Tunnel2 00:36:29/00:01:19

Ver v2 v2 v2 v2 v2

1 1 1 1 1

DR Prio/Mode / DR B P G / B S P G / B S P G / B S P G / B S P G

The C-MRIB on CSR8 shows two interesting entries currently. The first is the IGMPv3 join for the SSM pair. The big ‘Y’ is set, meaning this group should be receiving traffic on a data MDT. We can specifically see a new data MDT was spawned using group 239.4.7.0 specifically to carry this traffic. The third octet represents the router that is sending the traffic, so we can see the CSR7 is the root (source) of this flow. This occurs because this C(S,G) was matched by the ACL defining what traffic was candidate for data 895 © 2016 Nicholas J. Russo

MDT transport. The second entry is the C(*,G) entry for customer ASM. CSR10 is the C-RP, and this information was learned via BSR across the default MDT. There hasn’t been any traffic to this group yet. R8#show ip mroute vrf MC | begin \( (10.2.7.2, 232.0.0.5), 00:00:43/00:02:27, flags: sTIY Incoming interface: Tunnel2, RPF nbr 213.7.7.7, MDT:239.4.7.0/00:02:26 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:00:33/00:02:27 (*, 225.0.0.1), 00:08:05/00:02:16, RP 10.5.10.10, flags: SJC Incoming interface: Tunnel2, RPF nbr 213.5.5.5 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:08:05/00:02:16

For verification, we will confirm our understanding of the data MDT by checking CSR6. It should have joined the same data MDT group. R6#show ip mroute vrf MC 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 00:00:38/00:02:43, flags: sTIY Incoming interface: Tunnel0, RPF nbr 213.7.7.7, MDT:239.4.7.0/00:02:43 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 00:00:38/00:02:43

It is not a coincidence that both routers knew to pick the same group. This was carried inside the MDT signaling in the overlay. Although the debugs don’t reveal the contents of the overlay, we can see that it happened. We can also look at the send/receive MDT information in a separate table. CSR7 was the originator, and CSR6 /CSR8 were the receivers. ! CSR6 PIM(1): Receive MDT Packet (13613) from 213.7.7.7 (Tunnel0), length (ip: 44, udp: 24), ttl: 1 PIM(1): TLV type: 1 length: 16 MDT Packet length: 16 R6#show ip pim mdt receive Joined MDT-data [group/mdt number : source] [239.4.7.0 : 213.7.7.7] 00:14:54/00:02:05

uptime/expires for VRF: MC

R8#show ip pim mdt receive Joined MDT-data [group/mdt number : source] [239.4.7.0 : 213.7.7.7] 00:15:06/00:02:53

uptime/expires for VRF: MC

R7#show ip pim mdt send MDT-data send list for VRF: MC (source, group) (10.2.7.2, 232.0.0.5)

MDT-data group/num 239.4.7.0

ref_count 1

896 © 2016 Nicholas J. Russo

An unfortunate drawback of the ASM model is that our data MDTs are also ASM groups. That means even though the overlay signaling was SSM, the underlying transport is not, so hopefully the SP is allowing SPT construction in the core for those flows. R6#show ip mroute 239.4.7.0 | begin \( (*, 239.4.7.0), 00:13:51/stopped, RP 213.14.14.14, flags: SJCZ Incoming interface: GigabitEthernet2.563, RPF nbr 213.6.13.13 Outgoing interface list: MVRF MC, Forward/Sparse, 00:08:40/stopped

Now, we will sent traffic from CSR2 to group 232.0.0.5. We don’t expect any additional PIM signaling to occur within the overlay, since the deliver trees have been built. However, in the underlay, CSR7 will have to register this new (S,G) with XRv4, and the SPT construction happens per normal PIM-SM rules. The S remains the MDT source (loopback0), but the G will be the new data MDT group assigned for this flow. XRv4 receives the register and sends a (S,G) join towards the source, CSR7’s loopback. XR PIM debugs are extensive so I’ve only cherry-picked the valuable parts. ! CSR7 PIM(0): Adding register encap tunnel (Tunnel3) as forwarding interface of (213.7.7.7, 239.4.7.0). PIM(0): Received v2 Join/Prune on GigabitEthernet2.574 from 213.7.14.14, to us PIM(0): Join-list: (213.7.7.7/32, 239.4.7.0), S-bit set PIM(0): Add GigabitEthernet2.574/213.7.14.14 to (213.7.7.7, 239.4.7.0), Forward state, by PIM SG Join ! XRv4 pim[1152]: [13] VRF : default (213.7.7.7,239.4.7.0) Received Register from 213.7.14.7 Register count 6, Register limit 20000 pim[1152]: [13] Route create: VRF default (213.7.7.7,239.4.7.0) pim[1152]: [13] VRF : default (213.7.7.7,239.4.7.0) SEND J/P adding Join on Gi0/0/0/0.574, target 213.7.14.7 pim[1152]: [13] Sending J/P message for neighbor 213.7.14.7 on Gi0/0/0/0.574 for 1 groups, size 34 (MTU=1480)

XRv4 decapsulates the register message and sends it down the shared tree. CSR6 and CSR8 receive it and issue (S,G) joins towards CSR7 to leave the shared tree. ! CSR6 PIM(0): Insert (213.7.7.7,239.4.7.0) join in nbr 213.6.13.13's queue PIM(0): Building Join/Prune packet for nbr 213.6.13.13 PIM(0): Adding v2 (213.7.7.7/32, 239.4.7.0), S-bit Join PIM(0): Send v2 join/prune to 213.6.13.13 (GigabitEthernet2.563) ! CSR8 PIM(0): Insert (213.7.7.7,239.4.7.0) join in nbr 213.8.14.14's queue

897 © 2016 Nicholas J. Russo

PIM(0): Building Join/Prune packet for nbr 213.8.14.14 PIM(0): Adding v2 (213.7.7.7/32, 239.4.7.0), S-bit Join PIM(0): Send v2 join/prune to 213.8.14.14 (GigabitEthernet2.584)

CSR7 receives the SPT join from XRv3 (originated by CSR6) and adds it to its delivery tree. ! CSR7 PIM(0): us PIM(0): PIM(0): Forward

Received v2 Join/Prune on GigabitEthernet2.573 from 213.7.13.13, to Join-list: (213.7.7.7/32, 239.4.7.0), S-bit set Add GigabitEthernet2.573/213.7.13.13 to (213.7.7.7, 239.4.7.0), state, by PIM SG Join

CSR7 has not been told to stop registering, so it sends another one to the RP, along with native multicast. When XRv4 sees native multicast on the SPT, it issues the register-stop. ! CSR7 PIM(0): Received v2 Register-Stop on GigabitEthernet2.574 from 213.14.14.14 PIM(0): for source 213.7.7.7, group 239.4.7.0 PIM(0): Removing register encap tunnel (Tunnel3) as forwarding interface of (213.7.7.7, 239.4.7.0). PIM(0): Clear Registering flag to 213.14.14.14 for (213.7.7.7/32, 239.4.7.0) PIM(0): Building Periodic (*,G) Join / (S,G,RP-bit) Prune message for 239.4.7.0 ! XRv4 pim[1152]: [13] VRF : default (213.7.7.7,239.4.7.0) Received Register from 213.7.14.7 Register count 7, Register limit 20000 pim[1152]: [13] VRF : default (213.7.7.7,239.4.7.0) Send Register-Stop to 213.7.14.7

At this point, CSR7 should be delivering (213.7.7.7, 239.4.7.0) out of both interfaces towards CSR6 and CSR8. The little ‘z’ means this is the outer encapsulation for a data MDT group. R7#show ip mroute 239.4.7.0 213.7.7.7 | begin \( (213.7.7.7, 239.4.7.0), 00:12:07/00:01:48, flags: FTz Incoming interface: Loopback0, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.573, Forward/Sparse, 00:12:05/00:03:19 GigabitEthernet2.574, Forward/Sparse, 00:12:07/00:03:22

We already know there is no new signaling in the overlay. Now, we can check packet counters on all three PEs; CSR7 sends packets to the MDT while CSR6 and CSR8 receive it from the MDT. R6#show ip mroute 239.4.7.0 213.7.7.7 count | begin ^Group Group: 239.4.7.0, Source count: 1, Packets forwarded: 412, Packets received: 412

898 © 2016 Nicholas J. Russo

Source: 213.7.7.7/32, Forwarding: 411/0/142/0, Other: 411/0/0 R7#show ip mroute 239.4.7.0 213.7.7.7 count | begin ^Group Group: 239.4.7.0, Source count: 1, Packets forwarded: 409, Packets received: 410 Source: 213.7.7.7/32, Forwarding: 409/0/124/0, Other: 410/1/0 R8#show ip mroute 239.4.7.0 213.7.7.7 count | begin ^Group Group: 239.4.7.0, Source count: 1, Packets forwarded: 406, Packets received: 406 Source: 213.7.7.7/32, Forwarding: 405/0/142/0, Other: 405/0/0

Fortunately, native multicast counters work in XRv. We can check the packet counters on XRv3 and XRv4 for extra verification. The “A” flag means accept (inbound), the “EG” flag means egress (outbound), and the “NS” flag means negate signal (not related to this). RP/0/0/CPU0:XRv3#show mfib route 239.4.7.0 sources-only | begin 213 (213.7.7.7,239.4.7.0), Flags: Up: 00:15:41 Last Used: 00:00:00 SW Forwarding Counts: 472/470/58280 SW Replication Counts: 472/470/58280 SW Failure Counts: 1/0/0/0/0 GigabitEthernet0/0/0/0.563 Flags: NS EG, Up:00:15:41 GigabitEthernet0/0/0/0.573 Flags: A, Up:00:15:41 RP/0/0/CPU0:XRv4#show mfib route 239.4.7.0 sources-only | begin 213 (213.7.7.7,239.4.7.0), Flags: Up: 00:16:11 Last Used: 00:00:00 SW Forwarding Counts: 486/485/60140 SW Replication Counts: 486/487/60388 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.574 Flags: A, Up:00:16:11 GigabitEthernet0/0/0/0.584 Flags: NS EG, Up:00:16:11

CSR6 will perform a final verification using EPC. It will capture traffic entering from the SP core (multicast ICMP inside GRE) and also exiting towards the customer (multicast ICMP only). Since the capture will look at both interfaces concurrently, the timestamps for the two packets should be very close together. It looks like the router was extremely fast in switching this packet (no time difference). In both packets, I highlighted the ICMP packet source/destination addresses. We can see the difference in length is exactly 24 (142-118=24), the length of a basic GRE header. This further strengthens our confidence that it is working. 3 142 2.000992 213.7.7.7 -> 239.4.7.0 GRE 0000: 01005E04 07000050 56A9DB37 8100CDEB ..^....PV..7.... 0010: 08004500 007C024B 0000FE2F E7F4D507 ..E..|.K.../....

899 © 2016 Nicholas J. Russo

0020: 0030:

0707EF04 07000000 08004500 00641CB3 0000FE01 A6DC0A02 0702E800 00050800

4 118 2.000992 10.2.7.2 0000: 01005E00 00050050 56A9DE0D 0010: 08004500 00641CB3 0000FD01 0020: 0702E800 00050800 3A150007 0030: 0000142C 2DC0ABCD ABCDABCD

..........E..d.. ................

-> 232.0.0.5 ICMP 81000DDA ..^....PV....... A7DC0A02 ..E..d.......... 02420000 ........:....B.. ABCDABCD ...,-...........

Next we will examine IPv6 ASM. This is significantly more complex because we are doing ASM in both the customer and SP networks. XE routers only allowing debugging in one VRF at a time, so because we already examined what happens in the SP network, we will look at the C-PIM signaling for customer ASM traffic with IPv6. CSR2 is the embedded RP and only CSR3 is a receiver for the group specified below. R8#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 01:00:00/never, RP 2001:10:2:7::2, flags: SCJ Incoming interface: Tunnel2 RPF nbr: ::FFFF:213.7.7.7 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 01:00:00/never

Quickly checking for IPv6 PIM neighbors, we see that no PE has any neighbors over the MDT. Below is a spot check. RP/0/0/CPU0:XRv2#show pim vrf MC ipv6 neighbor | begin mdt mdtMC Neighbor Address Uptime Expires DR pri DR Flags ::ffff:213.12.12.12* 01:21:53 00:01:44 1 (DR)

P

R6#show ipv6 pim vrf MC neighbor No neighbors found. R1#show ipv6 pim vrf MC neighbor No neighbors found. R8#show ipv6 pim vrf MC neighbor No neighbors found.

Cisco requires IPv6 multicast enabled in the global process also; when this is enabled globally, it is enabled for all interfaces by default. Be sure to manually remove it (no ipv6 pim) from core-facing links so it only remains enabled on the loopback0 (MDT source). You can quickly check to see which interfaces have PIMv6 enabled. Below is a spot check of some routers, along with their newly-built PIM adjacencies.

900 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show pim vrf MC ipv6 neighbor | begin mdt mdtMC Neighbor Address Uptime Expires Flags ::ffff:213.1.1.1 00:02:39 00:01:16 ::ffff:213.5.5.5 00:01:57 00:01:19 ::ffff:213.6.6.6 00:12:15 00:01:17 ::ffff:213.7.7.7 00:02:09 00:01:17 ::ffff:213.8.8.8 00:02:54 00:01:16 ::ffff:213.12.12.12* 01:52:26 00:01:34 RP/0/0/CPU0:XRv2#show pim interface state-on PIM interfaces in VRF default Address Interface 213.12.12.12 this system 213.12.13.12 213.12.13.13

PIM

DR pri

DR

1 1 1 1 1 1

(DR)

Loopback0

on

Nbr Hello Count Intvl 1 30

GigabitEthernet0/0/0/0.523

on

2

30

P

DR DR Prior 1 1

R8#show ipv6 pim vrf MC neighbor PIM Neighbor Table Mode: B - Bidir Capable, G - GenID Capable Neighbor Address Interface

Uptime

Expires

Mode DR pri

::FFFF:213.1.1.1 ::FFFF:213.5.5.5 ::FFFF:213.6.6.6 ::FFFF:213.7.7.7 ::FFFF:213.12.12.12

00:03:26 00:02:44 00:03:40 00:02:57 00:03:37

00:01:29 00:01:31 00:01:28 00:01:29 00:01:16

B B B B B

Tunnel2 Tunnel2 Tunnel2 Tunnel2 Tunnel2

R8#show ipv6 pim int R8#show ipv6 pim interface state-on Interface PIM Nbr Hello Count Intvl

G G G G G

1 1 1 1 DR 1

DR Prior

Loopback0 on 0 30 1 Address: FE80::21E:E6FF:FE4D:4D00 DR : this system

An interesting behavior of the CSR is that if the PIM DR is within a VRF and the RP is embedded, the register message never gets sent. Since CSR4 will be the source, we will tell CSR6 to use a static RP. We know embedded RP works from the receiver side since CSR2 has the C(*,G) join from CSR8. IPv6 (*,G) messages have an incoming interface of the PIM decapsulation tunnel for register messages as opposed to “null” for IPv4. The logic is the same, though. ! CSR6 ipv6 pim vrf MC rp-address 2001:10:2:7::2

901 © 2016 Nicholas J. Russo

ipv6 pim vrf MC register-source GigabitEthernet2.546 no ipv6 pim vrf MC rp embedded R2#show ipv6 mroute FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:04:30/00:02:58, RP 2001:10:2:7::2, flags: S Incoming interface: Tunnel2 RPF nbr: 2001:10:2:7::2 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 00:04:30/00:02:58 R2#show ipv6 pim tunnel tunnel2 Tunnel2* Type : PIM Decap RP : 2001:10:2:7::2* Source: -

CSR4 will being sending traffic to this group. We expect CSR6 to register the source with CSR2, who will decapsulate the register message and send it down the shared tree, then immediately join the SPT. Given the location of CSR2 in the network, it sends a (S,G) prune to CSR7 instead, which is sufficient to build the (S,G) state there. ! CSR6 IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Start registering to 2001:10:2:7::2 ! CSR2 IPv6 PIM: J/P entry: Prune root: 2001:10:4:6::4 group: FF7E:240:2001:10:2:7:0:1 Sender: FE80::7 flags: RPT S IPv6 PIM: (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1)RPT Create entry (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) J/P adding Prune on GigabitEthernet2.527

Upon receiving the traffic, CSR8 will join the SPT by sending a PIMv6 join directed to CSR6 over the default MDT. ! CSR8 IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) J/P adding Join on Tunnel2

Whether triggered by CSR7’s SPT join from CSR2 or CSR8’s SPT join, CSR6 will switch over to the data MDT since this group is candidate for optimization. CSR6 will eventually receive joins from both. At this point, CSR6 sends multicast both as register messages to the C-RP and as native multicast. ! CSR6 IPv6 PIM: [MC] J/P entry: Join root: 2001:10:4:6::4 group: FF7E:240:2001:10:2:7:0:1 Sender: ::FFFF:213.7.7.7 flags:

902 © 2016 Nicholas J. Russo

IPv6 PIM: [MC] J/P entry: Join root: 2001:10:2:7::2 group: FF7E:240:2001:10:2:7:0:1 Sender: ::FFFF:213.8.8.8 flags:

CSR2 receives another register then tells CSR6 to stop registering. CSR6 stops sending traffic down its PIM encapsulation tunnel. ! CSR2 IPv6 PIM: (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Received Register from 2001:10:4:6::6 IPv6 PIM: (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Send Register-Stop to 2001:10:4:6::6 ! CSR6 IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Stop IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) changed from Join to Null IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) change from Forward to Prune

Received RegisterStop registering Tunnel5 J/P state Tunnel5 FWD state

R6#show ipv6 pim vrf MC tunnel Tunnel5* Type : PIM Encap RP : 2001:10:2:7::2 Source: 2001:10:4:6::6

At this point, CSR6 and CSR8 should have valid C(S,G) information. CSR6 claims to be forwarding traffic to multiple tunnels, but Tunnel2 doesn’t exist. I am assuming this feature is not supported fully in a virtual environment. The router says it is forwarding to an MDT data group, which is correct (little ‘y’). CSR8 specifically lists the MDT to which this C(S,G) is bound, which uses the SP-core group 239.6.6.0, rooted at CSR6. R6#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:31:35/00:00:07, flags: SFTy Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 Immediate Outgoing interface list: Tunnel2, Forward, 00:31:35/never Tunnel0, Forward, 00:20:38/00:03:08 R6#show ip interface brief | include Tunnel Tunnel0 213.6.6.6 YES unset Tunnel1 unassigned YES unset Tunnel3 213.6.13.6 YES unset Tunnel4 10.4.6.6 YES unset Tunnel5 unassigned YES unset

up up up up up

up up up up up

903 © 2016 Nicholas J. Russo

R8#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:17:47/now, flags: SJY Incoming interface: Tunnel2 MDT: 239.6.6.0/00:02:23 RPF nbr: ::FFFF:213.6.6.6 Inherited Outgoing interface list: GigabitEthernet2.538, Forward, 02:30:53/never

A quick check of the packet counters indicates traffic is flowing from end to end. The reason for the 280 failures was a failure on my part; the “issue” I described regarding anycast RP in a VRF slowed me down, and I did not clear counters once I got things working. You can see the difference in packet size is exactly 24 (142 – 118) bytes also. R6#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 count | include HW HW Forwarding: 903/0/118/0, Other: 280/0/280 R8#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 count | include HW HW Forwarding: 895/0/142/0, Other: 0/0/0

In the underlay, CSR6 has spawned a data MDT for this flow, which was carried over the default MDT in the PIM signaling information. CSR6 sends the MDT information and CSR8 receives it. R6#show ipv6 pim mdt send MDT-data send list for VRF: MC (source, group) (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1)

MDT-data group/num

ref_count

239.6.6.0

1

R8#show ipv6 pim mdt receive Joined MDT-data [group/mdt number : source] [239.6.6.0 : 213.6.6.6] 00:25:50/00:02:14

uptime/expires for VRF: MC

Quickly checking XRv3 and XRv4, we can see this new MDT group exists in the core and appears to be functional. RP/0/0/CPU0:XRv3#show mfib route 239.6.6.0 sources-only | begin 213 (213.6.6.6,239.6.6.0), Flags: Up: 00:31:48 Last Used: 00:00:00 SW Forwarding Counts: 954/954/118296 SW Replication Counts: 954/1907/236468 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.534 Flags: NS EG, Up:00:31:48 GigabitEthernet0/0/0/0.563 Flags: A, Up:00:31:48 GigabitEthernet0/0/0/0.573 Flags: NS EG, Up:00:31:46

904 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show mfib route 239.6.6.0 sources-only | begin 213 (213.6.6.6,239.6.6.0), Flags: Up: 00:32:09 Last Used: 00:00:00 SW Forwarding Counts: 965/964/119536 SW Replication Counts: 965/966/119784 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.534 Flags: A, Up:00:32:09 GigabitEthernet0/0/0/0.584 Flags: NS EG, Up:00:32:09

We will configure EPC on CSR8 to look at GRE multicast incoming from the core and ICMP multicast going out towards the customer. Like with IPv4 SSM, the capture should show both packets almost simultaneously. The first packet is the GRE packet (0x2F is protocol 47) and the P(S,G) is (213.6.6.6, 239.6.6.0). This is how the service provider transports the packet across the core. The GRE payload is 0x86DD which is IPv6 that has C(S,G) equal to (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1). The second packet is identical to the first except without the outer IPv4/GRE headers. If you count the difference in length, you will get 24 (0x8E - 0x76 = 0x18 = 24). R8#show monitor capture CAP buffer dump 0 0000: 01005E06 06000050 56A9DE77 8100CE00 0010: 08004500 007C0C58 0000FD2F E0E7D506 0020: 0606EF06 06000000 86DD6000 0000003C 0030: 3A3F2001 00100004 00060000 00000000 0040: 0004FF7E 02402001 00100002 00070000 0050: 00018000 AE3A0B3D 047F7F80 81828384 0060: 85868788 898A8B8C 8D8E8F90 91929394 0070: 95969798 999A9B9C 9D9E9FA0 A1A2A3A4 0080: A5A6A7A8 A9AAABAC ADAEAFB0 B1B2

..^....PV..w.... ..E..|.X.../.... ..........`....< :? ............. ...~.@ ......... .....:.=........ ................ ................ ..............

1 0000: 0010: 0020: 0030: 0040: 0050: 0060: 0070:

33330000 86DD6000 00060000 00100002 047F7F80 8D8E8F90 9D9E9FA0 ADAEAFB0

00010050 0000003C 00000000 00070000 81828384 91929394 A1A2A3A4 B1B2

56A9FB1C 3A3F2001 0004FF7E 00018000 85868788 95969798 A5A6A7A8

81000DD2 00100004 02402001 AE3A0B3D 898A8B8C 999A9B9C A9AAABAC

33.....PV....... ..`.... 232.4.7.0 GRE 0000: 01005E04 07000050 56A9DE77 8100CE00 ..^....PV..w.... 0010: 08004500 007C0597 0000FE2F EBA8D507 ..E..|...../.... 0020: 0707E804 07000000 08004500 00641FFC ..........E..d.. 0030: 0000FE01 A3930A02 0702E800 00050800 ................ 1 118 0.000000 10.2.7.2 0000: 01005E00 00050050 56A9FB1C 0010: 08004500 00641FFC 0000FD01 0020: 0702E800 00050800 FA97000B 0030: 000017FD 6B3DABCD ABCDABCD

-> 232.0.0.5 ICMP 81000DD2 ..^....PV....... A4930A02 ..E..d.......... 006D0000 .............m.. ABCDABCD ....k=..........

For good measure, here is a PIM signaling packet transiting the default MDT. We know it is on the default MDT based on the destination group address 238.255.0.1. The TTL is 1 (0x01) and the protocol is 103/PIM (0x67). This packet in particular came from CSR5 destined to 224.0.0.13 (highlighted). It is probably a PIM Hello, BSR message, or periodic join/prune. We know that this packet transited the Bidir shared tree to reach CSR8.

920 © 2016 Nicholas J. Russo

R8#show monitor capture CAP buffer detailed 14 98 67.960009 213.5.5.5 -> 238.255.0.1 GRE 0000: 01005E7F 00010050 56A9DE77 8100CE00 ..^....PV..w.... 0010: 080045C0 00500AAA 0000FE2F E809D505 ..E..P...../.... 0020: 0505EEFF 00010000 080045C0 003884DA ..........E..8.. 0030: 00000167 79ADD505 0505E000 000D2400 ...gy.........$.

Next we will test IPv6 SSM as we did earlier. The only difference with this delivery mechanism is that the IPv6 SSM pair (2001:10:4:6::4, FF33::1) is not candidate for MDT optimization as it does not match the ACL we applied. This means the traffic will always transit the default MDT. First, we verify that CSR7 and CSR8 receive the MLDv2 joins from their customers and that the C(S,G) tree was properly built back to the root, CSR6. The only significant difference between this set of outputs and the IPv4 SSM set is the lack of any ‘y’ or ‘Y’ flags. This is because the data MDT operation is not involved with this C(S,G) at all. R6#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF33::1), 10:58:30/00:03:05, flags: sT Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 Immediate Outgoing interface list: Tunnel0, Forward, 10:58:30/00:03:05 R7#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF33::1), 18:44:37/never, flags: sTI Incoming interface: Tunnel2 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 18:44:37/never R8#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF33::1), 18:44:15/never, flags: sTI Incoming interface: Tunnel2 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 18:44:15/never

Since there is no data MDT, and the default MDT has been verified, we can safely assume this traffic will transport inside the default MDT. This means that XRv4 will always receive traffic for this flow. When CSR4 starts sending traffic, we check the packet counters on the three PEs to ensure it is working. The packet counters are all increasing as traffic flows. R6#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW HW Forwarding: 790/1/118/0, Other: 0/0/0 R7#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW HW Forwarding: 797/1/142/1, Other: 0/0/0 R8#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW

921 © 2016 Nicholas J. Russo

HW Forwarding:

792/1/142/1, Other: 0/0/0

To show the drawbacks of using PIM Bidir with default MDTs, I will instruct CSR3 to stop requesting this flow. CSR8 will send a C(S,G) prune towards CSR6 but it has no effect since CSR7 is still on the “emulated LAN” requesting traffic. When CSR7 sees the C(S,G) prune from CSR8, it immediately issues a C(S,G) join so that CSR6 will cancel its removal of the MDT from its OIL. Since there is no P(S,G) prune mechanism for the default MDT, traffic will still be sent to XRv4 and dropped. This is a prune override operation in action. ! CSR8 IPv6 PIM: [MC] (2001:10:4:6::4,FF33::1) GigabitEthernet2.538 Local state changed from Join to Null IPv6 PIM: [MC] (2001:10:4:6::4,FF33::1) GigabitEthernet2.538 FWD state change from Forward to Prune IPv6 PIM: [MC] FF33::1 Delete Group ! CSR7 IPv6 PIM: [MC] J/P entry: Prune root: 2001:10:4:6::4 group: FF33::1 Sender: ::FFFF:213.8.8.8 flags: S IPv6 PIM: [MC] (2001:10:4:6::4,FF33::1) J/P adding Join on Tunnel2 ! CSR6 IPv6 PIM: [MC] J/P entry: Prune root: 2001:10:4:6::4 group: FF33::1 Sender: ::FFFF:213.8.8.8 flags: S IPv6 PIM: [MC] (2001:10:4:6::4,FF33::1) Tunnel0 Prune scheduled IPv6 PIM: [MC] J/P entry: Join root: 2001:10:4:6::4 group: FF33::1 Sender: ::FFFF:213.7.7.7 flags: S IPv6 PIM: [MC] (2001:10:4:6::4,FF33::1) Tunnel0 Canceling scheduled prune

Let’s ensure that CSR7 is still receiving the traffic while CSR8 is not. CSR8 doesn’t even have the C-MRIB entry anymore let along statistical information for the data plane. CSR7 is still receiving traffic, though. R7#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW HW Forwarding: 1237/1/142/1, Other: 0/0/0 R8#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW [no output]

To prove packets are entering CSR6, we will do two things. We will check the counters on the P(S,G) for the default MDT on CSR8 to show they are being received. This method isn’t fool-proof since you are basically trying to guess how many packets are along the default MDT that aren’t signaling, but the two outputs below were sampled 10 seconds apart. The source is sending 1 pps, so seeing ~12 packets in 10 seconds is probably an indication that the traffic is being received. R8#show ip mroute 238.255.0.1 count | begin ^Group

922 © 2016 Nicholas J. Russo

Group: 238.255.0.1, Source count: 0, Packets forwarded: 20742, Packets received: 20745 RP-tree: Forwarding: 20742/1/107/1, Other: 20745/3/0 R8#show ip mroute 238.255.0.1 count | begin ^Group Group: 238.255.0.1, Source count: 0, Packets forwarded: 20754, Packets received: 20757 RP-tree: Forwarding: 20754/1/107/1, Other: 20757/3/0

We can also use EPC to show the packets entering and not being sent down to the customer. Below we see two packets spaced exactly one second apart, but there is no raw ICMPv6 being sent to the customer. We can see the beginning of the IPv6 source address as 2001:10:4:6, which we know is the customer behind CSR6. R8#show monitor capture CAP buffer detailed 1 142 0.413980 213.6.6.6 -> 238.255.0.1 GRE 0000: 01005E7F 00010050 56A9DE77 8100CE00 ..^....PV..w.... 0010: 08004500 007C1E5C 0000FD2F D4E9D506 ..E..|.\.../.... 0020: 0606EEFF 00010000 86DD6000 0000003C ..........`....< 0030: 3A3F2001 00100004 00060000 00000000 :? ............. 2 142 1.413980 213.6.6.6 0000: 01005E7F 00010050 56A9DE77 0010: 08004500 007C1E5D 0000FD2F 0020: 0606EEFF 00010000 86DD6000 0030: 3A3F2001 00100004 00060000

-> 238.255.0.1 GRE 8100CE00 ..^....PV..w.... D4E8D506 ..E..|.].../.... 0000003C ..........`....< 00000000 :? .............

To confirm that the flow can be repaired, we re-add the MLDv2 join to CSR3 and we now see the ICMPv6 packets flowing to the customer. R8#show monitor capture CAP buffer detailed 0 142 0.000000 213.6.6.6 -> 238.255.0.1 GRE 0000: 01005E7F 00010050 56A9DE77 8100CE00 ..^....PV..w.... 0010: 08004500 007C1F51 0000FD2F D3F4D506 ..E..|.Q.../.... 0020: 0606EEFF 00010000 86DD6000 0000003C ..........`....< 0030: 3A3F2001 00100004 00060000 00000000 :? ............. 1 118 0.000000 2001:*:0004 -> 0000: 33330000 00010050 56A9FB1C 81000DD2 0010: 86DD6000 0000003C 3A3F2001 00100004 0020: 00060000 00000000 0004FF33 00000000 0030: 00000000 00000000 00018000 96921B72

FF33:*:0001 IPv6-ICMP 33.....PV....... ..`.... 238.255.0.1 GRE 1 118 0.000000 2001:*:0004 -> FF33:*:0001 IPv6-ICMP

Additional Reading – Reference configurations “mvpn-0-bidir" 28.2 Profile 1: Default MDT − MLDP MP2MP − PIM C−mcast Signaling (Basic mLDP) This profile is the one most people think of when mLDP is discussed. It is a very simple configuration and is basically a one-for-one replacement with the GRE-Rosen method in terms of MDT methodology. The profile specifically calls out MP2MP trees, but we examine P2MP MDT trees also. The basic mLDP configuration is very straightforward, and I will configure support for the MP2MP tree and P2MP trees together. The MP2MP tree is a bidirectional tree that represents and emulated LAN. When used with CPIM signaling, it forms a full-mesh of all VPN participants, which is determined by the VRF’s VPN-ID. This is not to be confused with the RD, which can (and does) vary between the PEs. An mLDP MP2MP tree is logically comparable to a PIM Bidir RP. Before doing anything, we should verify the routers are mLDPcapable. XRv lies and always says it is mLDP capable even when the process is disabled. Maybe this “capability” is loosely interpreted as the feature being supported, but this is advertised to LDP peers even when mLDP is off. RP/0/0/CPU0:XRv4#show mpls ldp capabilities Type Description --------------------------------------------0x50b Typed Wildcard FEC 0x3eff Cisco IOS-XR 0x508 MP: Point-to-Multipoint (P2MP) 0x509 MP: Multipoint-to-Multipoint (MP2MP) 0x703 P2MP PW

Owner -----------LDP LDP mLDP mLDP L2VPN-AToM

RP/0/0/CPU0:XRv4#show mpls mldp database mLDP is not running (1082230272)

The configuration below enables mLDP in XR. I enable logging as well, but that is not required for mLDP to work. ! XRv4 mpls ldp mldp logging notifications

I have selected XRv4 to be the “root” of the mLDP “shared tree” and instructed the VRF to prefer mLDP over PIM. I also show a snippet for how to enable basic mLDP in XR because it is not on by default, as it is with XE. XR is much more complicated by introducing “core-trees”, which are essentially RPF lookup mechanisms. We specifically need to tell PIM to use the mLDP-based tree for RPF.

924 © 2016 Nicholas J. Russo

! XE basic PE configuration for all PEs vrf definition MC vpn id 213:213 address-family ipv4 mdt preference mldp mdt default mpls mldp 213.14.14.14 ! XRv2 basic PE configuration vrf MC vpn id 213:213 multicast-routing address-family ipv4 interface Loopback0 enable mdt source Loopback0 vrf MC address-family ipv4 mdt default mldp ipv4 213.14.14.14 route-policy RPL_MLDP set core-tree mldp-default end-policy router pim vrf MC address-family ipv4 rpf topology route-policy RPL_MLDP

As soon as a node is told about a root, the mLDP process adds this to its database. Behind the scenes, it creates an LSPvif for forwarding LSM. It automatically inherits the Loopback0 (LDP RID) configuration as the MDT source (XR needs to be explicitly told, XE does not) and enables PIM on it. A MP2MP default MDT is always identified by the number 0 contained within the opaque ID. This is the “MDT number” and as P2MP trees are spawned, the number increments. R6#debug mpls mldp all MLDP-MFI: Enabled MLDP MFI client on Lspvif0; status = ok MLDP: Reclaimed success lsp vif: Lspvif0 address: 0.0.0.0 application: MDT vrf_id: 1 MLDP: Enabled IPv4 on Lspvif0 unnumbered with Loopback0 MLDP: Enable pim on lsp vif: Lspvif0 MLDP: Add success lsp vif: Lspvif0 address: 0.0.0.0 application: MDT vrf_id: 1 MLDP-MDT: [mdt 213:213 0] wavl insert success MLDP: LDP root 213.14.14.14 added

925 © 2016 Nicholas J. Russo

Next, the router needs to signal to the root that it is a leaf in the tree. It does this by sending a label mapping message towards the root based on the shortest IGP path. This is not based on the multicast RPF lookups, but uses the actual unicast routing table. We see that shortest path to XRv4 is via XRv3. We also notice that mLDP allocated a local label of 6005. This is the downstream label that CSR6 advertises to XRv3 to say “when you want to send me traffic inside this specific VPN along the MP2MP tree, use this label”. ID:5 at the end is the LSM path ID, which is another way to reference the mLDP database entry. ! CSR6 MLDP: nhop 213.6.13.13 added MLDP-NBR: 213.13.13.13:0 mapped to next_hop: 213.6.13.13 MLDP: Root 213.14.14.14 old paths: 0 new paths: 1 MLDP-DB: [mdt 213:213 0] Changing peer from none to 213.13.13.13:0 MLDP-DB: [mdt 213:213 0] Add accepting element nbr: 213.13.13.13:0 MLDP-MFI: For accepting element nbr: 213.13.13.13:0, [mdt 213:213 0] MFI allocated label 6005 MLDP: [mdt 213:213 0] label mappping msg sent to 213.13.13.13:0 success MLDP-DB: [mdt 213:213 0] path to peer: 213.13.13.13:0 changed None:0.0.0.0 to GigabitEthernet2.563:213.6.13.13 MLDP-MFI: After the MFI call For accepting element nbr: 213.13.13.13:0, Bind local label 6005 to PSM ID: 5, returned: 6005

If we have advertised our downstream local label to the upstream router, surely the upstream router will give us an upstream label so we can reach the root. This only occurs with MP2MP trees. This upstream label is a local label on the upstream router which says “when you want to send traffic through me to reach the root, use this label”. ! CSR6 MLDP-LDP: [mdt 213:213 0] label mapping from: 213.13.13.13:0 label: 93004 root: 213.14.14.14 Opaque_len: 14 sess_hndl: 0xF MLDP-DB: [mdt 213:213 0] Set upstream label from no_label to 93004

We can look at the mLDP database for the MP2MP tree (ID=0) for the VPN in question to verify these labels mapping. We also see the LSM ID:5 which corresponds to the debug messages. Notice that the local label is downstream (XRv3 uses this to send traffic away from the root) and the outbound label is upstream (CSR6 uses this label to send traffic towards the root). The output also shows “replication clients” which are essentially downstream clients. In this case, it’s the customer VRF. R6#show mpls mldp database opaque_type mdt 213:213 0 LSM ID : 5 (RNR LSM ID: 6) Type: MP2MP Uptime : 00:08:58 FEC Root : 213.14.14.14 Opaque decoded : [mdt 213:213 0] Opaque length : 11 bytes Opaque value : 02 000B 0002130000021300000000 RNR active LSP : (this entry)

926 © 2016 Nicholas J. Russo

Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Out Label (U) : 93004 Local Label (D): 6005 Replication client(s): MDT (VRF MC) Uptime : 00:08:58 Interface : Lspvif0

Path Set ID Interface Next Hop

: 5 : GigabitEthernet2.563* : 213.6.13.13

Path Set ID

: 6

XRv3 also generates a log message which shows similar information. The log is incomplete in that it says no local label was assigned, but if we check the database, we clearly see it. The log also shows us the LSM-ID, a long hexadecimal number, which I use the reference the mLDP database entry. XRv3 also has a second downstream client, XRv2. The same process occurs for all clients, and the label mapping process happens all the way until the root is reached. ! XRv3 mpls_ldp[1042]: %ROUTING-MLDP-5-BRANCH_ADD : 0x00001 [mdt 213:213 0] MP2MP 213.14.14.14, Add LDP 213.6.6.6:0 branch remote label 6005, local label no_label RP/0/0/CPU0:XRv3#show mpls mldp database 0x00001 mLDP database LSM-ID: 0x00001 Type: MP2MP Uptime: 01:21:25 FEC Root : 213.14.14.14 Opaque decoded : [mdt 213:213 0] Upstream neighbor(s) : 213.14.14.14:0 [Active] Uptime: 00:22:10 Next Hop : 213.13.14.14 Interface : GigabitEthernet0/0/0/0.534 Local Label (D) : 93005 Remote Label (U): 94007 Downstream client(s): LDP 213.6.6.6:0 Uptime: 00:17:55 Next Hop : 213.6.13.6 Interface : GigabitEthernet0/0/0/0.563 Remote label (D) : 6005 Local label (U) : 93004 LDP 213.12.12.12:0 Uptime: 00:22:31 Next Hop : 213.12.13.12 Interface : GigabitEthernet0/0/0/0.523 Remote label (D) : 92003 Local label (U) : 93012

A quick way to see if the root is reachable with a properly built tree is to look at the root information directly. The output is similar in XE and XR. Recall that we never configured the root on XRv3, it learned it from the label mapping messages from the downstream clients (PEs). R6#show mpls mldp root 213.14.14.14 Root node : 213.14.14.14 Metric : 20

927 © 2016 Nicholas J. Russo

Distance : 115 Interface : GigabitEthernet2.563 (via unicast RT) FEC count : 1 Path count : 1 Path(s) : 213.6.13.13 LDP nbr: 213.13.13.13:0 GigabitEthernet2.563 RP/0/0/CPU0:XRv3#show mpls mldp root 213.14.14.14 Root node : 213.14.14.14 Metric : 10 Distance : 115 FEC count : 1 Path count : 1 Path(s) : 213.13.14.14 LDP nbr: 213.14.14.14:0

The root also knows that is it the root using the same method as XRv3. When the metric/distance are 0, there are no LDP neighbors towards the root, and the router knows it is the root. At no time did we explicitly configure XRv4 to be root on XRv4 itself. RP/0/0/CPU0:XRv4#show mpls mldp root 213.14.14.14 Root node : 213.14.14.14 (We are the root) Metric : 0 Distance : 0 FEC count : 1 Path count : 1 Path(s) : 213.14.14.14 LDP nbr: none

The PEs are able to communicate over this MP2MP tree which allows them to form PIM adjacencies across their virtual interfaces within the VPN. On XRv3, when label 93005 comes in, the traffic is replicated towards both CSR6 and XRv2 (downstream clients). Label 93005 is the downstream label that XRv3 allocated and advertised towards XRv4. RP/0/0/CPU0:XRv3#show mpls mldp database opaquetype mdt 213:213 0 | begin Upst Upstream neighbor(s) : 213.14.14.14:0 [Active] Uptime: 00:33:08 Next Hop : 213.13.14.14 Interface : GigabitEthernet0/0/0/0.534 Local Label (D) : 93005 Remote Label (U): 94007 RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93005 6005 MLDP: 0x00001 92003 MLDP: 0x00001

labels 93005 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.563 213.6.13.6 0 Gi0/0/0/0.523 213.12.13.12 0

928 © 2016 Nicholas J. Russo

CSR6 and XRv2, upon receiving this, associate it with the proper VPN. That is why PHP cannot be used for LSM, since the label determines in which VPN the traffic belongs. XR takes a few show commands to verify that the traffic is part of a specific MDT. R6#show mpls forwarding-table labels 6005 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6005 [T] No Label [mdt 213:213 0][V] \ 0 aggregate/MC RP/0/0/CPU0:XRv2#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------92003 Unlabelled MLDP: 0x00003

Outgoing interface

Next Hop

labels 92003 Outgoing Next Hop Bytes Interface Switched ------------ ------------- ------------

RP/0/0/CPU0:XRv2#show mpls mldp database 0x00003 brief LSM ID Type Root Up Down Decoded Opaque Value 0x00003 MP2MP 213.14.14.14 1 1 [mdt 213:213 0]

We can verify the PIM adjacencies over the tunnel also. This happens within the VRF since the global process is PIM-free. RP/0/0/CPU0:XRv2#show pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime Expires DR pri 10.11.12.11 Gig0/0/0/0.512 02:00:40 00:01:33 1 B P 10.11.12.12* Gig0/0/0/0.512 02:27:50 00:01:33 1 (DR) B P E 213.1.1.1 LmdtMC 00:36:37 00:01:40 1 P 213.5.5.5 LmdtMC 00:36:37 00:01:44 1 P 213.6.6.6 LmdtMC 01:35:35 00:01:43 1 P 213.7.7.7 LmdtMC 00:36:37 00:01:41 1 P 213.8.8.8 LmdtMC 00:36:37 00:01:15 1 P 213.12.12.12* LmdtMC 01:35:39 00:01:34 1 (DR) P R6#show ip pim vrf MC neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.12.12.12 Lspvif0 00:34:34/00:01:27 213.5.5.5 Lspvif0 00:34:35/00:01:36 213.7.7.7 Lspvif0 00:34:35/00:01:33 213.1.1.1 Lspvif0 00:34:35/00:01:34 213.8.8.8 Lspvif0 00:34:35/00:01:37

Ver v2 v2 v2 v2 v2

Flags

DR Prio/Mode 1 / DR P G 1 / S P G 1 / S P G 1 / S P G 1 / S P G

With the PIM neighbors up, the PEs should also learn that CSR10 is an RP. If you forget to change the core-tree on XR PEs, they will drop the BSR messages due to failed RPF. The “info source” is different but this is just cosmetic between XE and XR. RP/0/0/CPU0:XRv2#show pim vrf MC rp mapping

929 © 2016 Nicholas J. Russo

PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.5.10.10 (?), v2 Info source: 213.5.5.5 (?), elected via bsr, priority 0, holdtime 150 Uptime: 00:02:35, expires: 00:01:56 R6#show ip pim vrf MC rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.5.10.10 (?), v2 Info source: 10.5.10.10 (?), via bootstrap, priority 0, holdtime 150 Uptime: 00:40:26, expires: 00:01:31

At this point, I configure CSR4, CSR3, CSR9, and XRv1 to be receivers for 225.0.0.1. This is an ASM group so each PE should issues a C(*,G) join towards the C-RP, CSR10. I do not show the IGMP configuration because it is simple. Below is a quick check of the PEs to ensure they see the C-RP via the LSPvif0 interface with proper RPF information. R1#show ip mroute vrf MC sparse (*, 225.0.0.1), 01:37:11/00:02:48, RP 10.5.10.10, flags: SJC Incoming interface: Lspvif0, RPF nbr 213.5.5.5 Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 01:21:09/00:02:48 R6#show ip mroute vrf MC sparse (*, 225.0.0.1), 01:37:29/00:02:46, RP 10.5.10.10, flags: SJC Incoming interface: Lspvif0, RPF nbr 213.5.5.5 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 01:21:13/00:02:46 R8#show ip mroute vrf MC sparse (*, 225.0.0.1), 01:37:23/00:02:17, RP 10.5.10.10, flags: SJC Incoming interface: Lspvif0, RPF nbr 213.5.5.5 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 01:37:23/00:02:17 RP/0/0/CPU0:XRv2#show pim vrf MC topology 225.0.0.1 (*,225.0.0.1) SM Up: 00:00:35 RP: 10.5.10.10 JP: Join(00:00:14) RPF: LmdtMC,213.5.5.5 Flags: LH GigabitEthernet0/0/0/0.512 00:00:35 fwd LI LH

Next, we check the ingress PE and C-RP, CSR5 and CSR10, to ensure they see the C(*,G) join. CSR5 is the only router that will have the LSPvif0 in the OIL since the RP is behind it. CSR10 has no incoming interface since it is the root of the shared tree, and is running PIM in the global table. R5#show ip mroute vrf MC sparse (*, 225.0.0.1), 01:45:03/00:03:11, RP 10.5.10.10, flags: S

930 © 2016 Nicholas J. Russo

Incoming interface: GigabitEthernet2.550, RPF nbr 10.5.10.10 Outgoing interface list: Lspvif0, Forward/Sparse, 01:45:03/00:03:11 R10#show ip mroute sparse (*, 225.0.0.1), 01:45:53/00:02:54, RP 10.5.10.10, flags: S Incoming interface: Null, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.550, Forward/Sparse, 01:45:53/00:02:54

Assuming CSR2 is the source and is sending traffic, we expect CSR7 to send a register message to the RP. The RP will forward the traffic down the shared tree until it can build the C(S,G) tree back to CSR7. Once it does that, it will prune itself from the C(S,G) tree since it is not in the transit path for any of those flows. All of the PEs will have C(S,G) state for the source. CSR10#debug ip pim PIM(0): Received v2 Register on GigabitEthernet2.550 from 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Insert (10.2.7.2,225.0.0.1) join in nbr 10.5.10.5's queue PIM(0): Send v2 join/prune to 10.5.10.5 (GigabitEthernet2.550) PIM(0): Prune-list: (10.2.7.2/32, 225.0.0.1) RPT-bit set PIM(0): Received v2 Register on GigabitEthernet2.550 from 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Send v2 Register-Stop to 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Insert (10.2.7.2,225.0.0.1) prune in nbr 10.5.10.5's queue PIM(0): Send v2 join/prune to 10.5.10.5 (GigabitEthernet2.550)

We will verify that the PEs are actually forwarding traffic towards the customers. XRv does not appear to work in the data plane, though. R1#show ip mroute vrf MC 225.0.0.1 count | begin Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 26, Packets received: 26 RP-tree: Forwarding: 3/0/122/0, Other: 3/0/0 Source: 10.2.7.2/32, Forwarding: 23/0/122/0, Other: 23/0/0 R6#show ip mroute vrf MC 225.0.0.1 count | begin Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 101, Packets received: 101 RP-tree: Forwarding: 2/0/122/0, Other: 2/0/0 Source: 10.2.7.2/32, Forwarding: 99/0/122/0, Other: 99/0/0 R8#show ip mroute vrf MC 225.0.0.1 count | begin Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 36, Packets received: 36 RP-tree: Forwarding: 2/0/122/0, Other: 2/0/0 Source: 10.2.7.2/32, Forwarding: 34/1/122/0, Other: 34/0/0

931 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show mfib vrf MC interface GigabitEthernet0/0/0/0.512 Interface : GigabitEthernet0/0/0/0.512 (Enabled) SW Mcast pkts in : 0, SW Mcast pkts out : 0 TTL Threshold : 0 Ref Count : 4

XRv LSM LFIB counters don’t appear to work, but the debugs do. On XRv3, we can see traffic arriving from CSR7 and being replicated towards XRv2 and CSR6. XRv3 spits out a nasty message that the P2MP traffic is being dropped, but it actually isn’t. We saw the counters increment on CSR6 as it forward traffic towards the customer. The LFIB has this properly programmed but the counters don’t increase. Notice that this MPLS packet came in from XRv4, not direct from CSR7. This is because we are using the MP2MP tree, so transit traffic goes through the root. There are ways to optimize this, which are examined later. RP/0/0/CPU0:XRv3#debug mpls packet netio[309]: mpls_switch: GigabitEthernet0_0_0_0.534, mpls eos 1, ttl 253, len 122, inlabel 93005, tbl_id=0x0, vrf_id=0x0 in=0xe00 netio[309]: mpls_p2mp_switch: dir: ingress netio[309]: mpls_rewrite netio[309]: mpls_rewrite: tos 0 eos 1 ttl 252, out_label 92003, #labels 1, RA - 0, out_intf GigabitEthernet0_0_0_0.523 netio[309]: mpls_rewrite netio[309]: mpls_rewrite: tos 0 eos 1 ttl 252, out_label 6005, #labels 1, RA - 0, out_intf GigabitEthernet0_0_0_0.563 netio[309]: [mpls_p2mp_switch_egress:819] Pkt Drop: P2MP MPLS2IP disposition: egress - LKUP flag not set, input ifh:0x300 RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93005 6005 MLDP: 0x00001 92003 MLDP: 0x00001

labels 93005 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.563 213.6.13.6 0 Gi0/0/0/0.523 213.12.13.12 0

Additionally, I confirm it with EPC on CSR6. Immediately following the 0x8847 ethertype, label 0x1775 (6005) is seen entering the router, which is the correct local downstream label allocated by CSR6 towards XRv3. The packet has EXP 0 and is the bottom-most label. The IP packet has source 10.2.7.2 (green) and destination 225.0.0.1 (pink). R6#show monitor capture CAP buffer detailed 4 122 2.286987 00:50:56:A9:DB:37 -> 00:50:56:A9:DE:0D MPLS unicast 0000: 005056A9 DE0D0050 56A9DB37 81000DEB .PV....PV..7.... 0010: 88470177 51FC4500 0064012B 0000FE01 .G.wQ.E..d.+.... 0020: C9680A02 0702E100 00010800 1E7A0018 .h...........z.. 0030: 00020000 00004958 165EABCD ABCDABCD ......IX.^......

932 © 2016 Nicholas J. Russo

Although not strictly part of profile 1, data MDTs are optional. These are used to optimize bandwidth by only sending traffic to PEs that want it. The MP2MP tree is bidirectional and when traffic is sent onto that tree, the core routers consult their LFIB and deliver traffic to each node. PEs without C-multicast clients interested in that C-group will simply drop the traffic. The benefit of only using MP2MP trees is that it minimizes state since there is one delivery tree. It is best used for low-volume/inclusive traffic, whereas data MDTs are more appropriate for high-volume/selective traffic. In this example, only CSR4 and CSR3 request a high-volume SSM flow using C-group 232.0.0.5. This is the only high-volume flow in the network; all other SSM and ASM flows should use the MP2MP tree. We will instruct all PEs routers to allow a maximum of 10 data MDTs using only this C-group. I use SSM for variation, not out of necessity. ! XE PE configuration ip access-list extended ACL_DATA_MDT permit ip any host 232.0.0.5 vrf definition MC address-family ipv4 mdt data mpls mldp 10 list ACL_DATA_MDT immediate-switch ! XRv2 PE configuration ipv4 access-list ACL_DATA_MDT 10 permit ipv4 any host 232.0.0.5 multicast-routing vrf MC address-family ipv4 mdt data mldp 10 ACL_DATA_MDT immediate-switch

This will spawn a new MDT in the core of the network, with a non-zero identifier. Since PIM is still used for C-multicast signaling, the PIM TLV process reviewed in profile 0 is used again. Looking at CSR6’s MRIB topology, we can quickly figure out the new MDT number and look into the mLDP database. Notice that the RPF interface for the C(S,G) join is the same, but the RPF neighbor is CSR7 since this is an SSM join. The MDT number is 1. The Y/y flags mean joined/sending to MDT data groups, respectively. R6#show ip mroute vrf MC ssm (10.2.7.2, 232.0.0.5), 02:10:05/00:02:37, flags: sTIY Incoming interface: Lspvif0, RPF nbr 213.7.7.7, MDT: [1, 213.7.7.7]/ 00:02:54 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 02:10:05/00:02:37

CSR6 indicates that traffic sent downstream from the root (CSR7) towards the customer should use label 6000. This was advertised to XRv3. Notice there is no outgoing/upstream label. Data MDTs are not only P2MP, they are unidirectional. A downstream LSR can never send data towards the root of the tree.

933 © 2016 Nicholas J. Russo

R6#show mpls mldp database opaque_type mdt 213:213 1 LSM ID : 7 Type: P2MP Uptime : 01:33:18 FEC Root : 213.7.7.7 Opaque decoded : [mdt 213:213 1] Opaque length : 11 bytes Opaque value : 02 000B 0002130000021300000001 Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 7 Out Label (U) : None Interface : GigabitEthernet2.563* Local Label (D): 6000 Next Hop : 213.6.13.13 Replication client(s): MDT (VRF MC) Uptime : 01:33:18 Path Set ID : None Interface : Lspvif0

On XRv3, we will look at its P2MP trees, since we know there is only one. The output doesn’t even show an upstream label field, let alone a value. The remote label from CSR6 is 6000, which is what XRv3 will use to deliver LSM towards CSR6 (downstream away from the root). The local label 93002 is what XRv3 advertised to CSR7 to receive this downstream traffic. XRv3 doesn’t need to replicate anything because there is only one client, CSR6. IF there were more clients, their label mapping messages would intersect the existing P2MP tree somewhere and be grafted (like tree branches). The LFIB does not list XRv2 for this MDT as expected. RP/0/0/CPU0:XRv3#show mpls mldp database p2mp mLDP database LSM-ID: 0x00002 Type: P2MP Uptime: 01:35:51 FEC Root : 213.7.7.7 Opaque decoded : [mdt 213:213 1] Upstream neighbor(s) : 213.7.7.7:0 [Active] Uptime: 01:35:51 Local Label (D) : 93002 Downstream client(s): LDP 213.6.6.6:0 Uptime: 01:35:51 Next Hop : 213.6.13.6 Interface : GigabitEthernet0/0/0/0.563 Remote label (D) : 6000 RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93002 6000 MLDP: 0x00002

labels 93002 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.563 213.6.13.6 0

We will use a new command on CSR7 as a quick verification. Checking the mLDP bindings shows the downstream labels used to deliver traffic. CSR7 has to replicate it towards XRv3 and XRv4. Notice label 93002 in use for traffic sent towards XRv3 in this VPN as it matches the verification performed above. 934 © 2016 Nicholas J. Russo

We also check the mLDP database summary which shows us all the roots and their VPNs, along with client counts. There can be multiple roots in the network, even for the same VPN. R7#show mpls mldp bindings opaque_type mdt 213:213 1 System ID: 3 Type: P2MP, Root Node: 213.7.7.7, Opaque Len: 14 Opaque value: [mdt 213:213 1] lsr: 213.14.14.14:0, remote binding[D]: 94018 lsr: 213.13.13.13:0, remote binding[D]: 93002 R7#show mpls mldp database summary LSM ID Type Root Cnt. 3 P2MP 213.7.7.7 1 MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:213 1] [mdt 213:213 0]

3 1

We will send traffic from CSR2 again, this time to the SSM group. We expect CSR6 and CSR8 to see packets being forwarded to their customers. R6#show ip mroute vrf MC 232.0.0.5 count | begin Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 48, Packets received: 48 Source: 10.2.7.2/32, Forwarding: 48/1/122/0, Other: 48/0/0 R8#show ip mroute vrf MC 232.0.0.5 count | begin Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 60, Packets received: 60 Source: 10.2.7.2/32, Forwarding: 60/1/122/0, Other: 60/0/0

Earlier, I mentioned there can be multiple roots. Specifically, we can manually configure multiple roots for the MP2MP tree also. Both of them use the MDT ID of 0. On CSR6 and CSR7, I temporarily configure CSR5 as a MP2MP root in addition to XRv4. Examining CSR7, we can prove this. ! CSR6 and CSR7 vrf definition MC address-family ipv4 mdt default mpls mldp 213.5.5.5 address-family ipv6 mdt default mpls mldp 213.5.5.5 R7#show mpls mldp database summary LSM ID Type Root Cnt. 3 P2MP 213.7.7.7 4 MP2MP 213.5.5.5 1 MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:213 1] [mdt 213:213 0] [mdt 213:213 0]

3 1 1

935 © 2016 Nicholas J. Russo

CSR7 selects XRv4 as the root, not wanting to use CSR5. This is because the lowest cost to the root, when multiple roots are configured, is the default behavior. CSR5 is a hot-standby, since the label mapping exchange already occurred. This secondary root is listed as a “candidate”. R7#show mpls mldp database opaque_type mdt 213:213 0 | include RNR|FEC_Root LSM ID : 4 (RNR LSM ID: 2) Type: MP2MP Uptime : 00:01:51 FEC Root : 213.5.5.5 RNR active LSP : 1 (root: 213.14.14.14) LSM ID : 1 (RNR LSM ID: 2) Type: MP2MP Uptime : 03:11:48 FEC Root : 213.14.14.14 RNR active LSP : (this entry) Candidate RNR ID(s): 4 R7#show mpls mldp root 213.14.14.14 | include Metric Metric : 10 R7#show mpls mldp root 213.5.5.5 | include Metric Metric : 20

In the case of a tie, the highest IP address wins. In this case, XRv4 remains the root. The LSM-ID 8 output shows that “A” is the active LSP. This hexadecimal number 0xA represents the second database entry that maps to 213.14.14.14 as the root. R6#show mpls mldp database opaque_type mdt 213:213 0 | include RNR|FEC_Root LSM ID : 8 (RNR LSM ID: 6) Type: MP2MP Uptime : 00:15:40 FEC Root : 213.5.5.5 RNR active LSP : A (root: 213.14.14.14) LSM ID : A (RNR LSM ID: 6) Type: MP2MP Uptime : 00:01:45 FEC Root : 213.14.14.14 RNR active LSP : (this entry) Candidate RNR ID(s): 8 R6#show mpls mldp root 213.14.14.14 | include Metric Metric : 20 R6#show mpls mldp root 213.5.5.5 | include Metric Metric : 20

To make things complicated, I increase the IS-IS cost on XRv3’s link to XRv4 from 10 to 15. This makes CSR6 choose CSR5 as its root, while CSR7 still selects XRv4. In fact, every other router selects XRv4, since no one else is configured to use CSR5. At a glance, it appears like CSR6 will not have reachability to any other nodes since they cannot agree on the MP2MP root. R6#show mpls mldp root 213.14.14.14 | include Metric Metric : 25 R6# show mpls mldp root 213.5.5.5 | include Metric Metric : 20 R6#

show mpls mldp database opaque_type mdt 213:213 0 | include RNR|FEC_Root

936 © 2016 Nicholas J. Russo

LSM ID : 8 (RNR LSM ID: 6) Type: MP2MP Uptime : 00:17:30 FEC Root : 213.5.5.5 RNR active LSP : (this entry) Candidate RNR ID(s): A LSM ID : A (RNR LSM ID: 6) Type: MP2MP Uptime : 00:03:36 FEC Root : 213.14.14.14 RNR active LSP : 8 (root: 213.5.5.5)

From CSR6’s perspective, nothing looks wrong as it sees all its PIM neighbors. Recall that PIM neighbors are unidirectional, and the reception of a PIM hello is sufficient to form a neighborship. This implies that every other router is able to reach CSR6. R6#show ip pim vrf MC neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.12.12.12 Lspvif0 04:50:19/00:01:32 213.5.5.5 Lspvif0 04:50:20/00:01:42 213.7.7.7 Lspvif0 04:50:20/00:01:43 213.1.1.1 Lspvif0 04:50:20/00:01:19 213.8.8.8 Lspvif0 04:50:20/00:01:18

Ver v2 v2 v2 v2 v2

DR Prio/Mode 1 / DR P G 1 / S P G 1 / S P G 1 / S P G 1 / S P G

The issue is that CSR6 is not able to reach all the other PEs. Specifically, it cannot reach XRv2, CSR1, or CSR8. Tracing the MP2MP tree from CSR6, CSR6 is using label 93007 to send traffic upstream towards CSR5 via XRv3. I use the ID number of 8, seen earlier, to reference this MP2MP tree. R6#show mpls mldp database id 8 | section Upstream Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 8 Out Label (U) : 93007 Interface : GigabitEthernet2.563* Local Label (D): 6002 Next Hop : 213.6.13.13

XRv3 forwards this towards the root, which is CSR5, and also down to CSR7, which is part of the same MP2MP tree. XRv3 does not forward it to XRv2 since that PE never send a label mapping for the MP2MP tree using CSR5 as the root. Thus, XRv2 cannot see PIM hellos from CSR6. RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93007 5001 MLDP: 0x00003 7006 MLDP: 0x00003

labels 93007 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.553 213.5.13.5 0 Gi0/0/0/0.573 213.7.13.7 0

Knowing that CSR7 is a leaf in the tree, we normally expect CSR5 (the root) to forward the packet onward towards the other side of the network. However, CSR1 and CSR8 did not issue label mappings for this MP2MP tree, and thus no “branches” exist from CSR5. The only downstream client is XRv3 with 937 © 2016 Nicholas J. Russo

inbound traffic using label 5001 as expected. This means CSR1 and CSR8 will never see traffic from CSR6, but CSR7 will. CSR5 also will not see traffic from CSR6; even though it’s the root, it did not configure itself as the root. In performing PE duties, it is not aware of that traffic coming from CSR6 within the MVPN as a result. R5#show mpls mldp database summary LSM ID Type Root Cnt. 3 MP2MP 213.5.5.5 1 MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:213 0] [mdt 213:213 0]

1 1

R5#show mpls mldp database id 3 | section Replication Replication client(s): 213.13.13.13:0 Uptime : 03:00:24 Path Set ID : 4 Out label (D) : 93010 Interface : GigabitEthernet2.553* Local label (U): 5001 Next Hop : 213.5.13.13

We verify our findings by checking PIM neighbors on all the PEs. As expected, only CSR7 have a PIM neighborship with CSR6. More correctly, the opposite is true regarding reachability, since CSR6 can only reach CSR7. Everyone can reach CSR6. A node only sends traffic upstream to one root, but is capable of receiving traffic from all configured roots. That is why using multiple roots works, so long as everyone has the same list of roots. The LSPs are signaled to each root at the time of configuration to enable this failover behavior. RP/0/0/CPU0:XRv2#show pim vrf MC neighbor | include 213.6.6.6 [no output] R5#show ip pim vrf MC neighbor | include 213.6.6.6 [no output] R1#show ip pim vrf MC neighbor | include 213.6.6.6 [no output] R8#show ip pim vrf MC neighbor | include 213.6.6.6 [no output] R7#show ip pim vrf MC neighbor | include 213.6.6.6 213.6.6.6 Lspvif0 05:05:01/00:01:24 v2

1 / S P G

To fix it, we will configure every node to agree on the list of roots. For simplicity, I will remove the CSR5 root from CSR6 and CSR7, as well as revert the IS-IS cost on XRv3. Multiple mLDP root testing is not the focus of this lab. Next, we will retest this MVPN profile with IPv6. We will use XRv4 as root on all devices and use Embedded RP for variety. CSR2 will be the Embedded RP for all IPv6 ASM traffic; for this test, we will use the ASM group FF7E:240:2001:10:2:7::1 and SSM group FF33::1. The SSM group will use the MP2MP default tree and the ASM group will use the P2MP data trees (opposite of what we tested 938 © 2016 Nicholas J. Russo

earlier). The VPN ID does not change but it shown again because without it, mLDP does not work at all. It is critical to identify a VPN-ID under the VRF. Because the MP2MP tree carries the FEC information of the VPN ID and root address, there is no differentiation between IPv4 and IPv6, so the roots must match between the two. If we wanted to use multiple roots, we would have to do so consistently between both AFs. ! XE PE configuration on all PEs ipv6 access-list ACL_DATA_MDT_IPV6 permit ipv6 any FF7E:240:2001:10:2:7::/96 vrf definition MC vpn id 213:213 address-family ipv6 mdt preference mldp mdt default mpls mldp 213.14.14.14 mdt data mpls mldp 10 list ACL_DATA_MDT_IPV6 immediate-switch ! XRv2 PE configuration ipv6 access-list ACL_DATA_MDT_IPV6 10 permit ipv6 any ff7e:240:2001:10:2:7::/96 multicast-routing vrf MC address-family ipv6 mdt default mldp ipv4 213.14.14.14 mdt data mldp 10 ACL_DATA_MDT_IPV6 immediate-switch

We should verify that our IPv6 PIM neighbors have formed. We see the “emulated LAN” behavior in play; even though the unicast RTs did not create this full-mesh of L3VPN reachability, the VPN ID is the same on all of the nodes. RP/0/0/CPU0:XRv2#show pim vrf MC ipv6 neighbor | begin Lmdt LmdtMC Neighbor Address Uptime Expires Flags ::ffff:213.1.1.1 00:03:58 00:01:43 ::ffff:213.5.5.5 00:03:58 00:01:43 ::ffff:213.6.6.6 00:03:57 00:01:44 ::ffff:213.7.7.7 00:03:58 00:01:43 ::ffff:213.8.8.8 00:03:58 00:01:42 ::ffff:213.12.12.12* 00:04:01 00:01:32 R5#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.1.1.1 Lspvif0 00:06:13 ::FFFF:213.6.6.6 Lspvif0 00:06:17 ::FFFF:213.7.7.7 Lspvif0 00:06:17

DR pri 1 1 1 1 1 1

Expires 00:01:41 00:01:42 00:01:41

DR

(DR)

P

Mode DR pri B G 1 B G 1 B G 1

939 © 2016 Nicholas J. Russo

::FFFF:213.8.8.8 ::FFFF:213.12.12.12

Lspvif0 Lspvif0

00:06:18 00:04:31

00:01:39 B G 00:01:30 B G

1 DR 1

R1#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.5.5.5 Lspvif0 00:06:47 ::FFFF:213.6.6.6 Lspvif0 00:06:47 ::FFFF:213.7.7.7 Lspvif0 00:06:47 ::FFFF:213.8.8.8 Lspvif0 00:06:46 ::FFFF:213.12.12.12 Lspvif0 00:05:05

Expires 00:01:36 00:01:37 00:01:37 00:01:35 00:01:26

Mode DR pri B G 1 B G 1 B G 1 B G 1 B G DR 1

R6#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.1.1.1 Lspvif0 00:06:33 ::FFFF:213.5.5.5 Lspvif0 00:06:38 ::FFFF:213.7.7.7 Lspvif0 00:07:25 ::FFFF:213.8.8.8 Lspvif0 00:07:20 ::FFFF:213.12.12.12 Lspvif0 00:04:51

Expires 00:01:21 00:01:21 00:01:21 00:01:20 00:01:41

Mode DR pri B G 1 B G 1 B G 1 B G 1 B G DR 1

R7#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime FE80::2 Gi2.527 00:19:29 ::FFFF:213.1.1.1 Lspvif0 00:06:45 ::FFFF:213.5.5.5 Lspvif0 00:06:50 ::FFFF:213.6.6.6 Lspvif0 00:07:36 ::FFFF:213.8.8.8 Lspvif0 00:07:31 ::FFFF:213.12.12.12 Lspvif0 00:05:02

Expires 00:01:30 00:01:39 00:01:38 00:01:40 00:01:37 00:01:29

Mode DR pri B G 1 B G 1 B G 1 B G 1 B G 1 B G DR 1

R8#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.1.1.1 Lspvif0 00:06:42 ::FFFF:213.5.5.5 Lspvif0 00:06:47 ::FFFF:213.6.6.6 Lspvif0 00:07:28 ::FFFF:213.7.7.7 Lspvif0 00:07:28 ::FFFF:213.12.12.12 Lspvif0 00:04:59

Expires 00:01:42 00:01:42 00:01:43 00:01:42 00:01:32

Mode DR pri B G 1 B G 1 B G 1 B G 1 B G DR 1

We will configure MLD joins on all of the PEs to request traffic for FF33:1 from CSR4 and this will use the MP2MP default tree. CSR2 and CSR3 will issue MLD joins for FF7E:240:2001:10:2:7::1. The MLD configuration is not shown, but we verify the PIMv6 signaling between the PEs. First, we will verify the top three PEs (XRv2, CSR5, and CSR1). All of them receive the MLDv2 join from their CEs, but their RPF interface is null. The reason for this is that none of them have a unicast IPv6 route to 2001:10:4:6::4, because the unicast topology does not allow it. Their OILs are properly built, but ultimately traffic cannot flow towards the customers, even if it is flooded to the PEs via the MP2MP tree. There is nothing magical about MVPN as it still relies heavily on converged unicast VPN routing. R1#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:04:57/never, flags: sTI Incoming interface: Null

940 © 2016 Nicholas J. Russo

RPF nbr: :: Immediate Outgoing interface list: GigabitEthernet2.519, Forward, 00:04:57/never R5#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:06:25/never, flags: sTI Incoming interface: Null RPF nbr: :: Immediate Outgoing interface list: GigabitEthernet2.550, Forward, 00:06:25/never RP/0/0/CPU0:XRv2#show pim vrf MC ipv6 topology | begin SPT$ (2001:10:4:6::4,ff33::1) SPT SSM Up: 00:09:05 JP: Join(00:00:59) Flags: RPF: Null,:: GigabitEthernet0/0/0/0.512 00:09:05 fwd LI LH

Next, we verify CSR7 and CSR8, which have unicast reachability back to CSR4. Their C(S,G) trees are properly constructed. A quick verification on CSR6 shows the LSPvif as the “LAN” interface in the OIL, as expected. This virtual interface represents the PMSI. R7#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:00:31/never, flags: sTI Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 00:00:31/never R8#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:00:43/never, flags: sTI Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 00:00:43/never R6#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:01:32/00:03:10, flags: sT Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 Immediate Outgoing interface list: Lspvif0, Forward, 00:01:32/00:03:10

Traffic sent to LSPvif0 will be send down the MP2MP tree, ultimately reaching all destinations. We acknowledge this is wasteful in this scenario, because there is no connectivity between some receivers and the source. We can prove that mLDP is not smart enough to understand this by tracing the LSP. CSR6’s mLDP database uses upstream label 93004 to send traffic towards the root. XRv3 is the next-hop. R6#show mpls mldp database opaque_type mdt 213:213 0 | section Upstream

941 © 2016 Nicholas J. Russo

Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Out Label (U) : 93004 Local Label (D): 6004

Path Set ID Interface Next Hop

: B : GigabitEthernet2.563* : 213.6.13.13

On XRv3, the mLDP database shows XRv2 as another downstream client of the MP2MP tree, and replicates traffic towards it using label 92003. We verify this by checking the LFIB also. At the same time, XRv3 sends traffic upstream to the root using XRv4’s local label of 94007. RP/0/0/CPU0:XRv3#show mpls mldp database opaquetype mdt 213:213 0 mLDP database LSM-ID: 0x00001 Type: MP2MP Uptime: 18:45:34 FEC Root : 213.14.14.14 Opaque decoded : [mdt 213:213 0] Upstream neighbor(s) : 213.14.14.14:0 [Active] Uptime: 17:46:19 Next Hop : 213.13.14.14 Interface : GigabitEthernet0/0/0/0.534 Local Label (D) : 93005 Remote Label (U): 94007 Downstream client(s): LDP 213.6.6.6:0 Uptime: 15:29:10 Next Hop : 213.6.13.6 Interface : GigabitEthernet0/0/0/0.563 Remote label (D) : 6004 Local label (U) : 93004 LDP 213.12.12.12:0 Uptime: 17:46:40 Next Hop : 213.12.13.12 Interface : GigabitEthernet0/0/0/0.523 Remote label (D) : 92003 Local label (U) : 93012 RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93004 92003 MLDP: 0x00001 94007 MLDP: 0x00001

labels 93004 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.523 213.12.13.12 0 Gi0/0/0/0.534 213.13.14.14 0

Traffic reaching XRv4 from XRv3 (using label 94007) is replicated to many downstream peers. Specifically, it includes CSR1, CSR5, CSR7, and CSR8. Only the last two will be able to forward the traffic further. These four downstream peers are all listed in the LFIB also. RP/0/0/CPU0:XRv4#show mpls mldp database opaquetype mdt 213:213 0 | begin Down Downstream client(s): LDP 213.1.1.1:0 Uptime: 17:46:49 Next Hop : 213.1.14.1 Interface : GigabitEthernet0/0/0/0.514 Remote label (D) : 1009 Local label (U) : 94016

942 © 2016 Nicholas J. Russo

LDP 213.5.5.5:0 Next Hop Interface Remote label (D) LDP 213.7.7.7:0 Next Hop Interface Remote label (D) LDP 213.8.8.8:0 Next Hop Interface Remote label (D) LDP 213.13.13.13:0 Next Hop Interface Remote label (D)

Uptime: 17:46:49 : 213.5.14.5 : GigabitEthernet0/0/0/0.554 : 5015 Local label Uptime: 17:46:49 : 213.7.14.7 : GigabitEthernet0/0/0/0.574 : 7004 Local label Uptime: 17:46:49 : 213.8.14.8 : GigabitEthernet0/0/0/0.584 : 8035 Local label Uptime: 17:46:49 : 213.13.14.13 : GigabitEthernet0/0/0/0.534 : 93005 Local label

RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94007 1009 MLDP: 0x00003 5015 MLDP: 0x00003 7004 MLDP: 0x00003 8035 MLDP: 0x00003

(U) : 94015

(U) : 94017

(U) : 94011

(U) : 94007

labels 94007 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.514 213.1.14.1 0 Gi0/0/0/0.554 213.5.14.5 0 Gi0/0/0/0.574 213.7.14.7 0 Gi0/0/0/0.584 213.8.14.8 0

CSR4 will begin sending traffic now. Examining XRv2, CSR5, and CSR1, they show RPF failures when traffic arrives, since there is no RPF interface but the C(S,G) matches the C-MRIB. Because MVPN LSM isn’t supported on egress XRv PEs, all the counters are zero, but normally it would show here. The RPF counters are proof that mLDP is flooding the traffic throughout the entire MP2MP tree. R4#ping FF33::1 repeat 10 Output Interface: GigabitEthernet2.546 R5#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 53 routes, 6 (*,G)s, 46 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 0/0/0/0, Other: 10/10/0 Totals - Source count: 1, Packet count: 0 R1#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc)

943 © 2016 Nicholas J. Russo

VRF MC 53 routes, 6 (*,G)s, 46 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 0/0/0/0, Other: 10/10/0 Totals - Source count: 1, Packet count: 0 RP/0/0/CPU0:XRv2#show mfib vrf MC ipv6 route FF33::1/128 | begin Failure Failure Counts: RPF / TTL / Empty Olist / Encap RL / Other (2001:10:4:6::4,ff33::1) Flags: Up: 00:39:35 Last Used: never SW Forwarding Counts: 0/0/0 SW Replication Counts: 0/0/0 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.512 Flags: NS EG, Up:00:39:35

The operation was successful on CSR7 and CSR8, as expected. R8#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 53 routes, 6 (*,G)s, 46 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 10/0/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 10 R7#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 53 routes, 6 (*,G)s, 46 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 10/0/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 10

Next, we test IPv6 ASM using the P2MP data MDT. Knowing that the top half of the network does not have a route to the Embedded RP, I won’t configure the MLD join on them; we already tested the failures there. Instead, CSR3 joins the group FF7E:240:2001:10:2:7::1 and CSR8 sends a C(*,G) join towards the C-RP, which is CSR2.

944 © 2016 Nicholas J. Russo

R8#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:00:26/never, RP 2001:10:2:7::2, flags: SCJ Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.7.7.7 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 00:00:26/never R7#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:01:57/00:02:36, RP 2001:10:2:7::2, flags: S Incoming interface: GigabitEthernet2.527 RPF nbr: FE80::2 Immediate Outgoing interface list: Lspvif0, Forward, 00:01:57/00:02:36

CSR2 is the C-RP, and is the root of the C(*,G) tree. Its incoming interface is the PIM decapsulation tunnel, which was automatically created. R2#show ipv6 mroute FF7E:240:2001:10:2:7::1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:02:31/00:03:00, RP 2001:10:2:7::2, flags: S Incoming interface: Tunnel2 RPF nbr: 2001:10:2:7::2 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 00:02:31/00:03:00 R2#show ipv6 pim tunnel tunnel2 Tunnel2* Type : PIM Decap RP : 2001:10:2:7::2* Source: R2#show derived-config interface tunnel2 interface Tunnel2 description Pim Register Tunnel (Decap) for RP 2001:10:2:7::2 no ip address ipv6 unnumbered GigabitEthernet2.527 ipv6 enable tunnel source GigabitEthernet2.527 tunnel destination 2001:10:2:7::2

At this point, no new mLDP P2MP trees have been built, despite us specifying so in the ACL. This is because a C(S,G) join is generally required for the P2MP tree to be built, since the intent is to built it directly to the source. This is why P2MP mLDP trees are better suited for SSM than ASM. As observed earlier, Embedded RP does not appear to work in a VRF when deployed with MVPN on XE. The PE never sends the register message. The configuration looks right, it just never works. There is no debug output on CSR2 (C-RP), nor any MPLS unicast packets seen in the core during this process. CSR6#debug ipv6 pim

945 © 2016 Nicholas J. Russo

IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1/128) GigabitEthernet2.546 MRIB update (f=20,c=20) IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Signal present on GigabitEthernet2.546 IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Create entry IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) RPF changed from ::/- to 2001:10:4:6::4/GigabitEthernet2.546 IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Connected status changed from off to on IPv6 PIM: [MC] (2001:10:4:6::4,FF7E:240:2001:10:2:7:0:1) Start registering to 2001:10:2:7::2 R6#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:03:43/00:00:51, flags: SFT Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4, Registering Immediate Outgoing interface list: Tunnel1, Forward, 00:03:43/never

For the sake of actually seeing this work, I configure the RP statically on CSR6 only, disable embedded RP support, and set a valid register source just to be certain. This was the same workaround conducted for profile 0. ! CSR6 ipv6 pim vrf MC rp-address 2001:10:2:7::2 ipv6 pim vrf MC register-source GigabitEthernet2.546 no ipv6 pim vrf MC rp embedded

Now, the first-hop router can successfully register traffic with the C-RP. The little ‘y’ flag means sending traffic to an MDT data group. This implies the P2MP MDT was built, but we will verify that later. We also know the MP2MP is not in use because the MDT ID is non-zero and a new LSM-ID was assigned. The counters show several packets forwarded also. R6#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:19:10/00:00:33, flags: SFTy Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 MDT TX nr: 2147483649 (0x80000001) LSM-ID: 0xD Immediate Outgoing interface list: Tunnel1, Forward, 00:19:10/never Lspvif0, Forward, 00:14:47/00:03:05 R6#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 54 routes, 6 (*,G)s, 46 (*,G/m)s Group: FF7E:240:2001:10:2:7:0:1

946 © 2016 Nicholas J. Russo

Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 786/1/118/0, Other: 10/0/10 Totals - Source count: 1, Packet count: 786

Checking the C-RP quickly, we see the C(S,G) entry so we know registration was successful. Because CSR2 is not in the transit path between CSR4 and CSR3, it is pruned from the C(S,G) tree. R2#show ipv6 mroute FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:45:09/00:03:22, RP 2001:10:2:7::2, flags: S Incoming interface: Tunnel2 RPF nbr: 2001:10:2:7::2 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 00:45:09/00:03:22 (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:02:10/00:03:10, RP 2001:10:2:7::2, flags: SPR Incoming interface: Tunnel2 RPF nbr: 2001:10:2:7::2 Immediate Outgoing interface list: GigabitEthernet2.527, Null, 00:02:10/00:03:22

The egress PE shows traffic being received from the LSPvif and forwarded to the customer correctly. The same can be observed on CSR7. It is also considered an egress PE (tail-end of the P2MP tree), probably because the RP is behind it issuing C(S,G) joins, although the traffic is dropped due to a null OIL. It is awkward that the packet counters increase on CSR7 with a null OIL. The big ‘Y’ means traffic is being received from an MDT data group. R8#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:46:06/never, RP 2001:10:2:7::2, flags: SCJ Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.7.7.7 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 00:46:06/never (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:03:07/00:00:22, flags: SJY Incoming interface: Lspvif0, MDT: [2147483649 (0x80000001), 213.6.6.6]/00:02:02 RPF nbr: ::FFFF:213.6.6.6 Inherited Outgoing interface list: GigabitEthernet2.538, Forward, 00:46:06/never R8#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 56 routes, 7 (*,G)s, 47 (*,G/m)s

947 © 2016 Nicholas J. Russo

Group: FF7E:240:2001:10:2:7:0:1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 167/1/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 167 R7#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 | begin \( (*, FF7E:240:2001:10:2:7:0:1), 00:50:12/00:03:20, RP 2001:10:2:7::2, flags: S Incoming interface: GigabitEthernet2.527 RPF nbr: FE80::2 Immediate Outgoing interface list: Lspvif0, Forward, 00:50:12/00:03:20 (2001:10:4:6::4, FF7E:240:2001:10:2:7:0:1), 00:08:31/00:03:16, flags: STY Incoming interface: Lspvif0, MDT: [2147483649 (0x80000001), 213.6.6.6]/00:02:56 RPF nbr: ::FFFF:213.6.6.6 Outgoing interface list: Null R7#show ipv6 mroute vrf MC FF7E:240:2001:10:2:7:0:1 2001:10:4:6::4 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 56 routes, 7 (*,G)s, 47 (*,G/m)s Group: FF7E:240:2001:10:2:7:0:1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 404/1/126/0, Other: 1/1/0 Totals - Source count: 1, Packet count: 404

Before all of this occurred, the new P2MP tree was signaled. We will examine the debugs from CSR8. One significant difference between these debugs and the MP2MP debugs is that there is no upstream label mapping message from the core, as expected. ! CSR8 MLDP-MDT: [mdt 213:213 2147483649] wavl insert success MLDP: LDP root 213.6.6.6 added MLDP-DB: [mdt 213:213 2147483649] Added P2MP branch for MDT label MLDP: Root 213.6.6.6 old paths: 0 new paths: 1 MLDP-DB: [mdt 213:213 2147483649] Changing peer from none to 213.14.14.14:0 MLDP-DB: [mdt 213:213 2147483649] Add accepting element nbr: 213.14.14.14:0 MLDP-MFI: For accepting element nbr: 213.14.14.14:0, [mdt 213:213 2147483649] MFI allocated label 8038 MLDP: [mdt 213:213 2147483649] label mappping msg sent to 213.14.14.14:0 success MLDP-DB: [mdt 213:213 2147483649] path to peer: 213.14.14.14:0 changed None:0.0.0.0 to GigabitEthernet2.584:213.8.14

948 © 2016 Nicholas J. Russo

MLDP-MFI: After the MFI call For accepting element nbr: 213.14.14.14:0, Bind local label 8038 to PSM ID: 6, returned: 8038

Next, we will trace the LSP. Checking CSR8, we confirm a new P2MP tree was built. R8#show mpls mldp database summary LSM ID Type Root Cnt. 4 P2MP 213.7.7.7 5 P2MP 213.6.6.6 1 MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:213 1] [mdt 213:213 2147483649] [mdt 213:213 0]

1 1 1

Looking at the mLDP database, CSR8 advertises label 8038 so XRv4 can send traffic downstream. There is no upstream label since mLDP P2MP trees are downstream only. R8#show mpls mldp database opaque_type mdt 213:213 2147483649 LSM ID : 5 Type: P2MP Uptime : 00:03:40 FEC Root : 213.6.6.6 Opaque decoded : [mdt 213:213 2147483649] Opaque length : 11 bytes Opaque value : 02 000B 0002130000021380000001 Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : 5 Out Label (U) : None Interface : GigabitEthernet2.584* Local Label (D): 8038 Next Hop : 213.8.14.14 Replication client(s): MDT (VRF MC) Uptime : 00:03:40 Path Set ID : None Interface : Lspvif0

On the other end, CSR6 is the root of the new P2MP MDT. R6#show mpls mldp database summary LSM ID Type Root Cnt. B P2MP 213.7.7.7 D P2MP 213.6.6.6 A MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:213 1] [mdt 213:213 2147483649] [mdt 213:213 0]

1 2 1

There are no upstream clients from the root, and CSR6 replicates the traffic downstream to XRv3 using label 93007. R6#show mpls mldp database opaque_type mdt 213:213 2147483649 LSM ID : D Type: P2MP Uptime : 00:03:52 FEC Root : 213.6.6.6 (we are the root) Opaque decoded : [mdt 213:213 2147483649]

949 © 2016 Nicholas J. Russo

Opaque length : 11 bytes Opaque value : 02 000B 0002130000021380000001 Upstream client(s) : None Expires : N/A Path Set ID : 10 Replication client(s): MDT (VRF MC) Uptime : 00:03:52 Path Set ID : None Interface : Lspvif0 213.13.13.13:0 Uptime : 00:03:52 Path Set ID : None Out label (D) : 93007 Interface : GigabitEthernet2.563* Local label (U): None Next Hop : 213.6.13.13

Upon receipt of label 93007, XRv3 replicates the traffic to both CSR7 (C-RP is behind it) and XRv4 (downstream towards CSR8). RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93007 7008 MLDP: 0x00005 94019 MLDP: 0x00005

labels 93007 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.573 213.7.13.7 0 Gi0/0/0/0.534 213.13.14.14 0

Upon receipt of label 94019, XRv4 replicates the traffic to CSR8 (in front of the C-receiver). RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94019 8038 MLDP: 0x00005

labels 94019 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.584 213.8.14.8 0

Last, we will examine OAM for MP2MP and P2MP trees in this profile. Unlike Profile 12 where we were limited only to unidirectional P2MP trees (the root specified had to be local), with MP2MP pings we can test from any endpoint; the tree is bidirectional. We specify the FEC and opaque type and can quickly discover all the endpoints in the MVPN on this MDT. R6# show mpls mldp LSM ID Type Cnt. 16 P2MP 13 MP2MP

database summary Root

Decoded Opaque Value

Client

213.7.7.7 213.14.14.14

[mdt 213:213 1] [mdt 213:213 0]

1 1

R6#ping mpls mldp mp2mp 213.14.14.14 mdt 213:213 0 mp2mp Root node addr 213.14.14.14 Opaque type MDT, oui:index 0x213:0213, mdtnum 0 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor,

950 © 2016 Nicholas J. Russo

timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request ! reply ! reply ! reply ! reply ! reply

#1 addr addr addr addr addr

213.12.13.12 213.7.13.7 213.1.14.1 213.5.14.5 213.8.14.8

Round-trip min/avg/max = 16/68/127 ms

R7#show mpls mldp database summary LSM ID Cnt. 16 13

Type

Root

Decoded Opaque Value

Client

P2MP MP2MP

213.7.7.7 213.14.14.14

[mdt 213:213 1] [mdt 213:213 0]

3 2

R7#ping mpls mldp p2mp 213.7.7.7 mdt 213:213 1 p2mp Root node addr 213.7.7.7 Opaque type MDT, oui:index 0x213:0213, mdtnum 1 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor, timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Type escape sequence to abort. Request #1 ! reply addr 213.8.14.8 ! reply addr 213.6.13.6 Round-trip min/avg/max = 126/136/146 ms

Additional Reading – Reference configurations "mvpn-1" 28.3 Profile 3: Default MDT − GRE − BGP−AD − PIM C−mcast Signaling For this test, begin with the MVPN Profile 0 configuration using the Bidir model where PIM is set up already for both the customer and provider networks. The default MDT uses PIM-Bidir while the data MDTs use PIM-SSM. This will allow us to see more varieties in the P-tunnel types. The only significant difference here is enabling BGP auto-discovery (AD). This modification similar to jumping from profile 1 to profile 9 (using mLDP, not GRE). I do not anticipate seeing much benefit to this since the default MDT is still specified statically and signaling is done via PIM. IPv4 MDT SAFI is no longer needed; instead we configure MVPN. 951 © 2016 Nicholas J. Russo

! Basic XE configuration to enable BGP AD vrf definition MC address-family ipv4 mdt auto-discovery pim address-family ipv4 mdt auto-discovery pim router bgp 213 no address-family ipv4 mdt address-family ipv4 mvpn neighbor 213.12.12.12 activate neighbor 213.12.12.12 send-community extended address-family ipv6 mvpn neighbor 213.12.12.12 activate neighbor 213.12.12.12 send-community extended ! Basic XR configuration to enable BGP AD (specific to XRv2 for BGP config) route-policy RPL_PIM set core-tree pim-default end-policy multicast-routing vrf MC address-family ipv4 bgp auto-discovery pim address-family ipv6 bgp auto-discovery pim router pim address-family ipv4 rpf topology route-policy RPL_PIM address-family ipv6 rpf topology route-policy RPL_PIM router bgp 213 no address-family ipv4 mdt address-family ipv4 mvpn address-family ipv6 mvpn no address-family ipv4 mdt neighbor-group IBGP address-family ipv4 mvpn route-reflector-client address-family ipv6 mvpn route-reflector-client vrf MC address-family ipv4 mvpn

952 © 2016 Nicholas J. Russo

address-family ipv6 mvpn

After configuring BGP, we look at CSR1 to see if it learned any BGP AD routes. It sees a Type-1 I-PMSI route from every PE in the MVPN (IPv4) and the same type of route for the “south-side” routers on IPv6. This is because the VPNv6 L3VPN is split horizontally. We also see a Type-3 S-PMSI route from CSR7 which describes a data MDT. R1#show bgp ipv4 mvpn vrf MC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:1 (default for vrf MC) *> [1][213:1][213.1.1.1]/12 0.0.0.0 32768 ? *>i [1][213:1][213.5.5.5]/12 213.5.5.5 0 100 0 ? *>i [1][213:1][213.6.6.6]/12 213.6.6.6 0 100 0 ? *>i [1][213:1][213.7.7.7]/12 213.7.7.7 0 100 0 ? *>i [1][213:1][213.8.8.8]/12 213.8.8.8 0 100 0 ? *>i [1][213:1][213.12.12.12]/12 213.12.12.12 100 0 i *>i [3][213:1][10.2.7.2][232.0.0.5][213.7.7.7]/22 213.7.7.7 0 100 0 ? R1#show bgp ipv6 mvpn vrf MC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:1 (default for vrf MC) *> [1][213:1][213.1.1.1]/12 :: 32768 ? *>i [1][213:1][213.5.5.5]/12 213.5.5.5 0 100 0 ? *>i [1][213:1][213.12.12.12]/12 213.12.12.12 100 0 i

Looking at the I-PMSI and S-PMSI for CSR7 side by side, we see one uses type 5 and one uses type 3 provider tunnels (green). Type-5 is PIM-Bidir and Type-3 is PIM-SSM, which is correct given our multicast group allocations defined for MDT default/data within this MVPN. If we were using ASM, we would have seen Type-4 also. In this case, both trees are rooted at CSR7 (yellow). The only difference is the destination group (cyan) encoded in the tunnel parameters. This is essentially the same information that IPv4 MDT carries, except it is more verbose and capable. R1#show bgp ipv4 mvpn vrf MC route-type 1 213.7.7.7 BGP routing table entry for [1][213:1][213.7.7.7]/12, version 376 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer

953 © 2016 Nicholas J. Russo

Refresh Epoch 1 Local, imported path from [1][213:7][213.7.7.7]/12 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:7 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 5, length 8, label: exp-null, tunnel parameters: D507 0707 EEFF 0001 rx pathid: 0, tx pathid: 0x0 R1#show bgp ipv4 mvpn vrf MC route-type 3 10.2.7.2 232.0.0.5 213.7.7.7 BGP routing table entry for [3][213:1][10.2.7.2][232.0.0.5][213.7.7.7]/22, version 378 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [3][213:7][10.2.7.2][232.0.0.5][213.7.7.7]/22 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:7 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E804 0700 rx pathid: 0, tx pathid: 0x0

Because the Type-3 S-PMSI was created once CSR7 received a C(S,G) join for a flow candidate for MDT optimization, this implies the entire signaling path is already set up. Assuming CSR2 is the source with CSR3 and CSR4 as receivers, we can verify the C-MRIB entries on the three relevant PEs. The “verbose” option on the egress PE shows us the GRE MDT to which this is bound; the S-PMSI route told us the Ptunnel was PIM (GRE) and the group was 232.4.7.0. We also see a little ‘y’ which means this is sending to a data MDT group. The ingress PEs, CSR6 and CSR7, also show the MDT to which they are bound (using the Type-3 S-PMSI from CSR7). The big ‘Y’ means they are joined to (receiving) an MDT data group. R7#show ip mroute vrf MC 232.0.0.5 10.2.7.2 verbose | begin \( (10.2.7.2, 232.0.0.5), 00:05:18/00:03:06, flags: sTyp Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Tunnel2, GRE MDT: 232.4.7.0 (data), Forward/Sparse, 00:05:19/00:03:06, p R6#show ip mroute vrf MC 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 00:00:26/00:02:33, flags: sTIY Incoming interface: Tunnel3, RPF nbr 213.7.7.7, MDT:[213.7.7.7,232.4.7.0]/never Outgoing interface list:

954 © 2016 Nicholas J. Russo

GigabitEthernet2.546, Forward/Sparse, 00:00:26/00:02:33 R8#show ip mroute vrf MC 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 00:36:19/00:02:48, flags: sTIY Incoming interface: Tunnel3, RPF nbr 213.7.7.7, MDT:[213.7.7.7,232.4.7.0]/never Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:36:18/00:02:48

We can verify the data MDT is in play by checking the routers MDT caches. This information is derived from BGP and helps device the binding between BGP AD routes (PMSI) and C-MRIB entries. All of them know about the data MDT. CSR7 originated the message so the next-hop (last column) is blank. R7#show ip pim mdt bgp all | section 232 MDT group 232.4.7.0 213:7:213.7.7.7

213.7.7.7

0.0.0.0

R6#show ip pim mdt bgp all | section 232 MDT group 232.4.7.0 213:6:213.7.7.7

213.12.12.12

213.7.7.7

R8#show ip pim mdt bgp all | section 232 MDT group 232.4.7.0 213:8:213.7.7.7

213.12.12.12

213.7.7.7

We can also quickly verify the P(S,G) creation of the MDT data group. We also must ensure XRv3 and XRv4, core routers, have this P(S,G) state. CSR7 is performing replication to both XRv3 and XRv4 and the little ‘z’ means this is a multicast tunnel bound to an MDT data group in the MVPN. The core routers show that they are sending the traffic only two their downstream clients and not all PEs in the MVPN, which is the whole point of data MDTs. The core routers have no special MVPN-style flags for the P(S,G) state because they are not aware of anything special. Last, the egress PEs accept this traffic from the core and forward it to their MVPN customers. R7#show ip mroute 232.4.7.0 213.7.7.7 | begin \( (213.7.7.7, 232.4.7.0), 00:12:19/00:03:14, flags: sTz Incoming interface: Loopback0, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.573, Forward/Sparse, 00:03:18/00:03:13 GigabitEthernet2.574, Forward/Sparse, 00:12:19/00:03:14 RP/0/0/CPU0:XRv3#show pim topology 232.4.7.0 213.7.7.7 | begin 213 (213.7.7.7,232.4.7.0)SPT SSM Up: 00:04:47 JP: Join(00:00:04) RPF: GigabitEthernet0/0/0/0.573,213.7.13.7 Flags: GigabitEthernet0/0/0/0.563 00:04:47 fwd Join(00:02:40) RP/0/0/CPU0:XRv4#show pim topology 232.4.7.0 213.7.7.7 | begin 213 (213.7.7.7,232.4.7.0)SPT SSM Up: 00:14:02 JP: Join(now) RPF: GigabitEthernet0/0/0/0.574,213.7.14.7 Flags:

955 © 2016 Nicholas J. Russo

GigabitEthernet0/0/0/0.584

00:14:02

fwd Join(00:03:13)

R6#show ip mroute 232.4.7.0 213.7.7.7 | begin \( (213.7.7.7, 232.4.7.0), 00:05:13/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.563, RPF nbr 213.6.13.13 Outgoing interface list: MVRF MC, Forward/Sparse, 00:05:13/00:00:46 R8#show ip mroute 232.4.7.0 213.7.7.7 | begin \( (213.7.7.7, 232.4.7.0), 00:14:16/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.584, RPF nbr 213.8.14.14 Outgoing interface list: MVRF MC, Forward/Sparse, 00:14:16/00:00:43

We being sending traffic from CSR2 and check the packet counters on all routers. CSR7 is sending them to the PMSI while CSR6 and CSR8 are receiving them from the PMSI. This occurs within the MVPN. R7#show ip mroute vrf MC 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 318, Packets received: 318 Source: 10.2.7.2/32, Forwarding: 318/1/118/0, Other: 318/0/0 R6#show ip mroute vrf MC 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 326, Packets received: 326 Source: 10.2.7.2/32, Forwarding: 326/1/142/1, Other: 326/0/0 R8#show ip mroute vrf MC 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 324, Packets received: 324 Source: 10.2.7.2/32, Forwarding: 324/1/142/1, Other: 324/0/0

Next, we check counters on the ingress and egress PEs to ensure they are seeing the MDT-encapsulated packets along the PMSI. R7#show ip mroute 232.4.7.0 213.7.7.7 count | begin ^Group Group: 232.4.7.0, Source count: 1, Packets forwarded: 179, Packets received: 179 Source: 213.7.7.7/32, Forwarding: 179/1/124/0, Other: 179/0/0 R6#show ip mroute 232.4.7.0 213.7.7.7 count | begin ^Group Group: 232.4.7.0, Source count: 1, Packets forwarded: 186, Packets received: 186 Source: 213.7.7.7/32, Forwarding: 186/1/142/1, Other: 186/0/0 R8#show ip mroute 232.4.7.0 213.7.7.7 count | begin ^Group Group: 232.4.7.0, Source count: 1, Packets forwarded: 184, Packets received: 184

956 © 2016 Nicholas J. Russo

Source: 213.7.7.7/32, Forwarding: 184/1/142/1, Other: 184/0/0

Although not necessary, we will check the core routers for good measure. Both XRv3 and XRv4 show transit traffic within the data MDT, although they are totally unaware that MVPN is in play. RP/0/0/CPU0:XRv3#show mfib route 232.4.7.0 213.7.7.7 | begin 213 (213.7.7.7,232.4.7.0), Flags: Up: 00:21:32 Last Used: 00:00:00 SW Forwarding Counts: 407/407/50468 SW Replication Counts: 407/407/50468 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.563 Flags: NS EG, Up:00:21:32 GigabitEthernet0/0/0/0.573 Flags: A, Up:00:21:32 RP/0/0/CPU0:XRv4#show mfib route 232.4.7.0 213.7.7.7 | begin 213 (213.7.7.7,232.4.7.0), Flags: Up: 00:30:52 Last Used: 00:00:00 SW Forwarding Counts: 426/426/52824 SW Replication Counts: 426/426/52824 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.574 Flags: A, Up:00:30:52 GigabitEthernet0/0/0/0.584 Flags: NS EG, Up:00:30:52

As a final check, we can use EPC to measure packets coming in from the core and going out towards the customer. The two packets are shown below in that sequence. The difference in length is exactly 24 bytes (142-118) and the IP addresses are shown plainly in the output. The outer IP addressing is highlighted in yellow and the inner IP addressing is green. R8#show monitor capture CAP buffer detailed 0 142 0.000000 213.7.7.7 -> 232.4.7.0 GRE 0000: 01005E04 07000050 56A9DE77 8100CE00 ..^....PV..w.... 0010: 08004500 007C05E4 0000FE2F EB5BD507 ..E..|...../.[.. 0020: 0707E804 07000000 08004500 00642049 ..........E..d I 0030: 0000FE01 A3460A02 0702E800 00050800 .....F.......... 1 118 0.000000 10.2.7.2 0000: 01005E00 00050050 56A9FB1C 0010: 08004500 00642049 0000FD01 0020: 0702E800 00050800 06DC000D 0030: 00002431 5328ABCD ABCDABCD

-> 232.0.0.5 ICMP 81000DD2 ..^....PV....... A4460A02 ..E..d I.....F.. 00080000 ................ ABCDABCD ..$1S(..........

Next, we will test IPv6 SSM. ASM introduces much more complex signaling in the MVPN customer network and is tested many times in many profiles. IPv6 SSM is interesting because the C(S,G) we use, which is (2001:10:4:6::4, FF33::1), is not candidate for MDT optimization per our IPv6 ACL. This means that it will transit the default MDT, which uses PIM-Bidir in the underlay. All PEs will receive the traffic in 957 © 2016 Nicholas J. Russo

their underlay (core) networks and many will drop it, which can be wasteful both on bandwidth and router resources. First, we verify the C-PIM signaling is working by checking the three relevant PEs. CSR6 should have received the C(S,G) join from CSR7 and CSR8, then added its MDT interface to the OIL. R6#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF33::1), 00:37:09/00:03:19, flags: sT Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 Immediate Outgoing interface list: Tunnel3, Forward, 00:37:09/00:03:19 R7#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF33::1), 01:06:42/never, flags: sTI Incoming interface: Tunnel2 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 01:06:42/never R8#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 | begin \( (2001:10:4:6::4, FF33::1), 01:06:41/never, flags: sTI Incoming interface: Tunnel3 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 01:06:41/never

We can verify that the default MDT is in use a few ways. First, none of the routers claim to be doing any kind of data MDT. CSR6 does not send and CSR7/CSR8 do not receive and data MDT information. No one has entry idea about any 232 groups; in our case, we can correlate this with no data MDTs. R6#show ipv6 pim mdt bgp all | section 232 [no output] R7#show ipv6 pim mdt bgp all | section 232 [no output] R8#show ipv6 pim mdt bgp all | section 232 [no output]

We check BGP IPv6 MVPN for a matching S-PMSI route for this specific C(S,G) on the router who would have originated it, which is CSR6. There is no such network. However, we do have an I-PMSI match using CSR6 as the root (green) using a type-5 Bidir tunnel (yellow) to group 238.255.0.1 (cyan). R6#show bgp ipv6 mvpn vrf MC route-type 3 2001:10:4:6::4 FF33::1 213.6.6.6 % Network not in table R6#show bgp ipv6 mvpn vrf MC route-type 1 213.6.6.6 BGP routing table entry for [1][213:6][213.6.6.6]/12, version 32

958 © 2016 Nicholas J. Russo

Paths: (1 available, best #1, table MVPNV6-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 2 Refresh Epoch 1 Local :: from 0.0.0.0 (213.6.6.6) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:6 PMSI Attribute: Flags: 0x0, Tunnel type: 5, length 8, label: exp-null, tunnel parameters: D506 0606 EEFF 0001 rx pathid: 0, tx pathid: 0x0

Because the default MDT is working (if it were not, none of the C-MRIB state would be exchanged), we need not validate it in the core again. It is a basic PIM-Bidir configuration with XRv4 as the root and the core has no knowledge about the MDT traffic at all. At this point, we begin sending traffic from CSR4. We check packet counters on the ingress/egress PEs to ensure it is working. R6#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW HW Forwarding: 13/0/118/0, Other: 0/0/0 R7#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW HW Forwarding: 25/1/142/1, Other: 0/0/0 R8#show ipv6 mroute vrf MC FF33::1 2001:10:4:6::4 count | include HW HW Forwarding: 31/1/142/1, Other: 0/0/0

A quick EPC effort on CSR8 from the core and to the customer shows the encapsulations plainly. In summary, this profile is almost identical to profile 0 except replaces IPv4 MDT with IPv4/v6 MVPN using BGP AD. This is the “new way” of configuring PIM/GRE-based MVPN. R8#show monitor capture CAP buffer detailed 11 142 1.998993 213.6.6.6 -> 238.255.0.1 GRE 0000: 01005E7F 00010050 56A9DE77 8100CE00 ..^....PV..w.... 0010: 08004500 007C2143 0000FD2F D202D506 ..E..|!C.../.... 0020: 0606EEFF 00010000 86DD6000 0000003C ..........`....< 0030: 3A3F2001 00100004 00060000 00000000 :? ............. 12 118 1.998993 2001:*:0004 -> 0000: 33330000 00010050 56A9FB1C 81000DD2 0010: 86DD6000 0000003C 3A3F2001 00100004 0020: 00060000 00000000 0004FF33 00000000 0030: 00000000 00000000 00018000 9A970661

FF33:*:0001 IPv6-ICMP 33.....PV....... ..`....i 213.5.5.5 0 100 0 * i 213.6.6.6 0 100 0 * i 213.8.8.8 0 100 0

Network Path

? ? ? ?

I’ve also highlighted the dynamic RT of 213.7.7.7:1. This was derived from the extended community carried within CSR7’s VPNv4 route to the C-source. We can verify that below, and all routers use this in their Type-7 joins to target CSR7. 1018 © 2016 Nicholas J. Russo

R5#show bgp vpnv4 unicast vrf MC 10.2.7.0/24 BGP routing table entry for 213:5:10.2.7.0/24, version 69 Paths: (1 available, best #1, table MC) Not advertised to any peer Refresh Epoch 1 Local, imported path from 213:7:10.2.7.0/24 (global) 213.7.7.7 (metric 20) (via default) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213:7 MVPN AS:213:0.0.0.0 MVPN VRF:213.7.7.7:1 Originator: 213.7.7.7, Cluster list: 213.12.12.12 mpls labels in/out nolabel/7000 rx pathid: 0, tx pathid: 0x0

CSR7 also originates a Type-5 Source Active AD message to notify others of an active multicast sender. Notice that the RT is a normal RT since this is an AD route, not a BGP c-mcast signaling route. ! CSR7 PIM(1): Add Lspvif0/0.0.0.0 to (10.2.7.2, 225.0.0.1), Forward state, by BGP SG Join BGP(15): nettable_walker [5][213:7][10.2.7.2][225.0.0.1]/18 route sourced locally BGP(15): delete RIB route [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): 213.12.12.12 NEXT_HOP self is set for sourced RT Filter for net [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): (base) 213.12.12.12 send UPDATE (format) [5][213:7][10.2.7.2][225.0.0.1]/18, next 213.7.7.7, metric 0, path Local, extended community RT:213:7

CSR7 also receives the Type-7 SPT join, which is technically the best-path from XRv2. They all contain the same information so it doesn’t matter which one the RR picks as best. In this case, CSR5’s route was best. Upon receipt, once the BGP route is processed and installed, PIM takes over and enters the C(S,G) state into the PIM topology (MRIB). This means the C-SPT has been built back to the source so multicast can now travel our mLDP P2MP tree. Technically this is not S-PMSI since no Type-3 S-PMSI AD routes were generated, and we did not configure “Data MDT” support for this group. The I-PMSI is still used, but the C-SPT is still invoked on the ingress PE. ! CSR7 BGP(15): 213.12.12.12 rcvd UPDATE w/ attr: nexthop 213.5.5.5, origin ?, localpref 100, metric 0, originator 213.5.5.5, clusterlist 213.12.12.12, extended community RT:213.7.7.7:1 BGP(15): 213.12.12.12 rcvd [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): skip vrf default table RIB route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): add RIB route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 PIM(1): Join-list: (10.2.7.2/32, 225.0.0.1), S-bit set, BGP C-Route PIM(1): MDT next_hop change from: 0 to 7 for (10.2.7.2, 225.0.0.1) Lspvif0

1019 © 2016 Nicholas J. Russo

PIM(1): Add Lspvif0/0.0.0.0 to (10.2.7.2, 225.0.0.1), Forward state, by BGP SG Join

CSR10 sends a register-stop and prunes itself from the C-SPT since it is not in the transit path. CSR7 receives the register stop and tears down the encapsulation tunnel used for sending the register messages. ! CSR10 PIM(0): Received v2 Register on GigabitEthernet2.550 from 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Send v2 Register-Stop to 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Insert (10.2.7.2,225.0.0.1) prune in nbr 10.5.10.5's queue PIM(0): Building Join/Prune packet for nbr 10.5.10.5 PIM(0): Adding v2 (10.2.7.2/32, 225.0.0.1), S-bit Prune PIM(0): Send v2 join/prune to 10.5.10.5 (GigabitEthernet2.550) ! CSR7 PIM(1): Received v2 Register-Stop on GigabitEthernet2.574 from 10.5.10.10 PIM(1): for source 10.2.7.2, group 225.0.0.1 PIM(1): Removing register encap tunnel (Tunnel2) as forwarding interface of (10.2.7.2, 225.0.0.1). PIM(1): Clear Registering flag to 10.5.10.10 for (10.2.7.2/32, 225.0.0.1)

Before verifying mLDP operation, we check the MRIB state and counters on the ingress PE. CSR7 shows traffic entering from the customer LAN and being forwarded out the PMSI. The big ‘G’ means the entry was installed due to receiving a BGP Type-7 SPT join, and the little ‘q’ means that router originated a Type-5 SA message for this C(S,G). R7#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:28:50/00:02:32, flags: FTGq Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Lspvif0, Forward/Sparse, 00:28:49/stopped R7#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | section ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 861, Packets received: 862 Source: 10.2.7.2/32, Forwarding: 861/0/117/0, Other: 862/1/0

The egress PEs should show incoming traffic on the PMSI and outgoing on their customer LANs. Their flags show a little ‘g’ indicating they sourced a BGP Type-7 SPT join and a big ‘Q’ indicating they received a Type-5 Source Active. R8#show ip (10.2.7.2, Incoming Outgoing

mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( 225.0.0.1), 00:31:04/00:01:43, flags: JTgQ interface: Lspvif0, RPF nbr 213.7.7.7 interface list:

1020 © 2016 Nicholas J. Russo

GigabitEthernet2.538, Forward/Sparse, 00:31:04/00:02:52 R8#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | section ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 928, Packets received: 928 Source: 10.2.7.2/32, Forwarding: 927/0/122/0, Other: 927/0/0 R1#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:30:59/00:02:42, flags: JTgQ Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 00:30:59/00:02:44 R1#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | section ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 929, Packets received: 929 Source: 10.2.7.2/32, Forwarding: 928/0/122/0, Other: 928/0/0 R6#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:30:48/00:01:56, flags: JTgQ Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 00:30:48/00:02:23 R6#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | section ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 921, Packets received: 921 Source: 10.2.7.2/32, Forwarding: 920/0/122/0, Other: 920/0/0

The PE facing the C-RP has interesting output. Since only two packets were sent in native multicast to the RP, the counter stops incrementing as the C(S,G) is pruned. This is the correct operation. The route was still installed from having sent a BGP Type-7 route (little ‘g’), but the Type-5 SA is not bound to this C(S,G) since CSR5 did not join the C-SPT. The “other drops” column is incrementing as the OIL is null. R5#show ip (10.2.7.2, Incoming Outgoing

mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( 225.0.0.1), 00:30:55/00:02:08, flags: PTXg interface: Lspvif0, RPF nbr 213.7.7.7 interface list: Null

R5#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | section ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 3, Packets received: 928 Source: 10.2.7.2/32, Forwarding: 2/0/122/0, Other: 927/1/924

Since the source is behind CSR7, we know that CSR7 is the root of the mLDP P2MP tree. I perform a quick EPC on CSR7 outbound towards XRv3 to ensure the proper label bindings are used. CSR7 should

1021 © 2016 Nicholas J. Russo

use label 93011 (0x16B53) to send MVPN traffic for VRF MC downstream towards XRv3. The packet capture shows this, along with the source/destination IP addresses highlighted. R7#show mpls mldp database summary | include 213.7.7.7 7 P2MP 213.7.7.7 [gid 65536 (0x00010000)] 9 P2MP 213.7.7.7 [gid 131072 (0x00020000)] R7#show mpls mldp database id 7 | section Replic Replication client(s): MDT (VRF MC) Uptime : 01:42:45 Path Set ID Interface : Lspvif0 213.13.13.13:0 Uptime : 01:42:45 Path Set ID Out label (D) : 93011 Interface Local label (U): None Next Hop 213.14.14.14:0 Uptime : 01:42:40 Path Set ID Out label (D) : 94014 Interface Local label (U): None Next Hop

3 3

: None

: None : GigabitEthernet2.573* : 213.7.13.13 : None : GigabitEthernet2.574* : 213.7.14.14

R7#show monitor capture CAP buffer detailed 5 122 83460.123986 00:50:56:A9:EA:77 -> 00:50:56:A9:DB:37 MPLS unicast 0000: 005056A9 DB370050 56A9EA77 81000DF5 .PV..7.PV..w.... 0010: 884716B5 31FE4500 006404EE 0000FE01 .G..1.E..d...... 0020: C5A50A02 0702E100 00010800 080A0000 ................ 0030: 04EE0000 0000057D 6BD5ABCD ABCDABCD .......}k.......

Since testing ASM is much more involved and the concepts are identical between IPv4 and IPv6, we will quickly test IPv6 SSM only. Our baseline configuration does not signal S-PMSI for our IPv6 SSM group FF33::1, but we will modify the ACLs on our PEs to account for it. This deviation allows us to add some variety to the testing. ! Modification to XE/XR PE routers (same syntax) ipv6 access-list ACL_DATA_MDT_IPV6 permit ipv6 any host FF33::1

Adding this to ACL entry CSR6 (ingress PE), with BGP IPv6 MVPN debugging enabled, shows the immediate creation of the Type-3 S-PMSI route. Technically, the data MDT configurations only need to be added to the ingress PE connected to sources, but I add it everywhere for completeness. ! CSR6 BGP(16): nettable_walker [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 route sourced locally BGP(16): (base) 213.12.12.12 send UPDATE (format) [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46, next 213.6.6.6, metric 0, path Local, extended community RT:213:6

1022 © 2016 Nicholas J. Russo

As an interesting test, if you delete the ACL entry, the route is immediately withdraw and downstream PEs will switch over to using the default MDT (I-PMSI) again. The lack of S-PMSI advertisement by the “root” of the P2MP, in the case of mLDP, means that there is no selective behavior for this C(S,G). The ACL is corrected and the entry is re-added before continuing. R6(config-ipv6-acl)#no permit ipv6 any host FF33::1 BGP: MVPN(16) deleting the local route [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 BGP(16): no valid path for [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 BGP(16): nettable_walker [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 no best path BGP(16): delete RIB route [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 BGP(16): (base) 213.12.12.12 send unreachable (format) [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 BGP(16): 213.12.12.12 rcv UPDATE about [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 -- withdrawn

Looking at the route from XRv2’s perspective, it was received from CSR6 and advertised to the other PEs in the update-group. With an RT of 213:6, we know that CSR1 and CSR5 will not import this Type-3 route, but CSR7 and CSR8 should. There are two other key components of this Type-3 route worth discussing. The first is the tunnel-type, which is 2, just like the P2MP default MDTs. It’s the same delivery mechanism in the data plane, but a more “selective” tunnel as not all nodes in the MVPN instance are required to join it. The 0x20001 is the global-id (GID) which is 131073 in decimal. RP/0/0/CPU0:XRv2#show bgp ipv6 mvpn rd 213:6 [3][128][2001:10:4:6::4][128][ff33::1][213.6.6.6]/312 | begin Paths Paths: (1 available, best #1, not advertised to EBGP peer) Advertised to update-groups (with more than one peer): 0.2 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.2 Local, (Received from a RR-client) 213.6.6.6 (metric 20) from 213.6.6.6 (213.6.6.6) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 75 Community: no-export Extended community: RT:213:6 PMSI: flags 0x00, type 2, label 0, ID 0x06000104d5060606000701000400020001 RP/0/0/CPU0:XRv2#show bgp ipv6 mvpn update-group 0.2 summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 213.1.1.1 0 213 1894 1788 75 0 0 1d03h 1

1023 © 2016 Nicholas J. Russo

213.5.5.5 213.6.6.6 213.7.7.7 213.8.8.8

0 0 0 0

213 213 213 213

1900 1909 1933 1970

1797 1797 1788 1788

75 75 75 75

0 0 0 0

0 0 0 0

1d03h 1d03h 1d03h 1d03h

1 2 2 3

A quick check on CSR1, CSR5, CSR7, and CSR8 confirms the assertion that only the last two PEs listed import and utilize this Type-3 S-PMSI route. R1#show bgp ipv6 mvpn vrf MC | include \[3\] [no output] R5#show bgp ipv6 mvpn vrf MC | include \[3\] [no output] R7#show bgp ipv6 mvpn vrf MC | include \[3\] *>i [3][213:7][2001:10:4:6::4][FF33::1][213.6.6.6]/46 R7#show bgp ipv6 mvpn vrf MC | include \[3\] *>i [3][213:7][2001:10:4:6::4][FF33::1][213.6.6.6]/46

Even without the Type-3 route, CSR7 and CSR8 originated Type-7 SPT joins to CSR6 to request traffic for this C(S,G). The Type-3 route just allows the core to transport it more efficiently. The way to tell if the Ctraffic is actually utilizing the data MDT is to check the C-MRIB entry. On CSR7, we can see the MDT with GID 131073 which we saw in the Type-3 route. We also see the big ‘Y’ flag which means this entry receives traffic from a data MDT (learned via Type-3 S-PMSI route). The little ‘g’ means it was installed by having sent a Type-7 SPT join. Because the GID is carried in the Type-3 route, it will be common for all subscribers to the S-PMSI. A quick check of the incoming interface and OIL shows us that the PMSI is upstream and the customer LAN is downstream, which is correct. R7#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 1d04h/never, flags: sTIYg Incoming interface: Lspvif0, MDT: [131073 (0x00020001), 213.6.6.6]/never RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 1d04h/never R8#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 1d05h/never, flags: sTIYg Incoming interface: Lspvif0, MDT: [131073 (0x00020001), 213.6.6.6]/never RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 1d05h/never

CSR6’s output is very similar but reversed in many ways. The flag cases have been inverted; little ‘y’ means the router originated a Type-3 S-PMSI route for this C(S,G) and is sending to an MDT data group. The big ‘G’ means it received a Type-7 SPT join for this C(S,G). Traffic enters from the customer LAN and

1024 © 2016 Nicholas J. Russo

is sent to the PMSI using MDT 131073 (global-id). The MDT is “TX” only, where the others are receiveonly; mLDP P2MP trees are always unidirectional. R6#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 04:36:09/never, flags: sTyG Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 MDT TX nr: 131073 (0x00020001) LSM-ID: 0x12 Immediate Outgoing interface list: Lspvif0, Forward, 04:36:09/never

Unlike ASM, there is no more BGP/C-PIM signaling that has to occur for traffic to flow. Once CSR4 starts sending traffic, CSR6 pushes the proper labels and sends it down the delivery tree. We will quickly trace the delivery tree to see if there is any obvious benefit. Remember that we already have several mLDP P2MP trees in the network, and we have just added another for S-PMSI supporting a C(S,G). Since P2MP trees are receiver-driven, we start with CSR7 and CSR8. Since we know the GID, we will use that as a filter to find the P2MP tree we want to trace. We find downstream label mappings of 7016 and 8016, one of which goes to XRv3 and one of which goes to XRv4. R7#show mpls mldp database summary | include 131073 12 P2MP 213.6.6.6 [gid 131073 (0x00020001)]

1

R7#show mpls mldp database id 12 | section Upstream Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 12 Out Label (U) : None Interface : GigabitEthernet2.573* Local Label (D): 7016 Next Hop : 213.7.13.13 R8#show mpls mldp database summary | include 131073 12 P2MP 213.6.6.6 [gid 131073 (0x00020001)]

1

R8#show mpls mldp database id 12 | section Upstream Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : 12 Out Label (U) : None Interface : GigabitEthernet2.584* Local Label (D): 8016 Next Hop : 213.8.14.14

Checking those two routers, we see that XRv4 uses label 8016 to send traffic to CSR8 and tells XRv3 to use label 94000 to send traffic towards itself. XRv3 acknowledges that label from XRv4 as well as the one from CSR7, and tells CSR6 to use label 93001 to send traffic towards it. Notice that XRv4 has one upstream and one downstream client, while XRv3 has one upstream and two downstream clients. RP/0/0/CPU0:XRv4#show mpls mldp database brief | include 131073 0x00010 P2MP 213.6.6.6 1 1 [global-id 131073]

1025 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show mpls mldp database 0x00010 mLDP database LSM-ID: 0x00010 Type: P2MP Uptime: 00:34:09 FEC Root : 213.6.6.6 Opaque decoded : [global-id 131073] Upstream neighbor(s) : 213.13.13.13:0 [Active] Uptime: 00:34:09 Local Label (D) : 94000 Downstream client(s): LDP 213.8.8.8:0 Uptime: 00:34:09 Next Hop : 213.8.14.8 Interface : GigabitEthernet0/0/0/0.584 Remote label (D) : 8016 RP/0/0/CPU0:XRv3#show mpls mldp database brief | include 131073 0x00010 P2MP 213.6.6.6 1 2 [global-id 131073] RP/0/0/CPU0:XRv3#show mpls mldp database 0x00010 mLDP database LSM-ID: 0x00010 Type: P2MP Uptime: 00:35:26 FEC Root : 213.6.6.6 Opaque decoded : [global-id 131073] Upstream neighbor(s) : 213.6.6.6:0 [Active] Uptime: 00:35:26 Local Label (D) : 93001 Downstream client(s): LDP 213.7.7.7:0 Uptime: 00:35:26 Next Hop : 213.7.13.7 Interface : GigabitEthernet0/0/0/0.573 Remote label (D) : 7016 LDP 213.14.14.14:0 Uptime: 00:35:26 Next Hop : 213.13.14.14 Interface : GigabitEthernet0/0/0/0.534 Remote label (D) : 94000

CSR6 is the root, using label 93001 to send traffic towards XRv3 for further distribution. Because all of the clients in the south-side of the network are requesting this data, the default MDT is identical to the data MDT. This is not an effective use of the data MDT since all PE’s are requesting traffic. From a design perspective, it would not make sense to signal S-PMSI for this flow. We will test it quickly to verify it works. We just recently checked the C-MRIB, and knowing that the C(S,G) state won’t change, we will just check the forwarding counters. R6#show ipv6 mroute vrf MC FF33::1 count | include HW HW Forwarding: 37/0/118/0, Other: 0/0/0 R7#show ipv6 mroute vrf MC FF33::1 count | include HW HW Forwarding: 38/0/126/0, Other: 0/0/0

1026 © 2016 Nicholas J. Russo

R8#show ipv6 mroute vrf MC FF33::1 count | include HW HW Forwarding: 60/0/126/0, Other: 0/0/0

To show a more efficient design, we temporarily remove CSR3 as a client of this C-multicast feed. Upon doing this, CSR8 immediately withdraws the Type-7 SPT join, which makes sense since it no longer wants the traffic. There is no significant change on CSR6 since CSR7’s route was the best-path and there is still at least one client. ! CSR8 BGP[16] MVPN: withdraw c-route, type 7, bs len 0 asn=213, remote-rd=213:6, source=2001:10:4:6::4/16, group=FF33::1/16, nexthop=::FFFF:213.6.6.6, len left = 0 BGP: MVPN(16) deleting the local route [7][213:6][213][2001:10:4:6::4][FF33::1]/46 BGP(16): no valid path for [7][213:6][213][2001:10:4:6::4][FF33::1]/46 BGP(16): nettable_walker [7][213:6][213][2001:10:4:6::4][FF33::1]/46 no best path BGP(16): delete RIB route [7][213:6][213][2001:10:4:6::4][FF33::1]/46 BGP(16): (base) 213.12.12.12 send unreachable (format) [7][213:6][213][2001:10:4:6::4][FF33::1]/46

XRv4 also generates syslog saying that it has deleted a branch to CSR8. This is because CSR8 is no longer part of this P2MP S-PMSI tree. Because XRv4 as no clients, XRv3 prunes XRv4 from the tree also. RP/0/0/CPU0:XRv4#mpls_ldp[1042]: %ROUTING-MLDP-5-BRANCH_DELETE : 0x00010 [global-id 131073] P2MP 213.6.6.6, Delete LDP 213.8.8.8:0 branch remote label 8016 RP/0/0/CPU0:XRv3#mpls_ldp[1042]: %ROUTING-MLDP-5-BRANCH_DELETE : 0x00010 [global-id 131073] P2MP 213.6.6.6, Delete LDP 213.14.14.14:0 branch remote label 94000

A quick check of XRv3’s LFIB shows that the traffic is only going to CSR7, as expected. Now the S-PMSI has some benefit since using the I-PMSI from CSR6 would send traffic to CSR8 also, only to be dropped. Assuming there was full IPv6 connectivity in the VPN, like IPv4, the I-PMSI would span all PE’s. We also check the mLDP database to verify the LFIB is correct. By removing this one client from the S-PMSI, both XRv4 and CSR8 were removed from the S-PMSI delivery tree. RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93001 7016 MLDP: 0x00010

labels 93001 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.573 213.7.13.7 0

1027 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show mpls mldp database 0x00010 mLDP database LSM-ID: 0x00010 Type: P2MP Uptime: 00:47:29 FEC Root : 213.6.6.6 Opaque decoded : [global-id 131073] Upstream neighbor(s) : 213.6.6.6:0 [Active] Uptime: 00:47:29 Local Label (D) : 93001 Downstream client(s): LDP 213.7.7.7:0 Uptime: 00:47:29 Next Hop : 213.7.13.7 Interface : GigabitEthernet0/0/0/0.573 Remote label (D) : 7016

As always, we finish with looking at OAM. Because mLDP P2MP trees are unidirectional, we must test them from the root. This is awkward because it looks like we are pinging the root from the root, but in reality we are just identifying the mLDP FEC and opaque values. XE does not have a keyword for “globalid”. Referencing “draft-bishnoi-mpls-mldp-opaque-types-01.txt” indicates that this opaque type is 0x01, so we can use that. From CSR6, we can test one-way connectivity from CSR6 to all of its leaves on the IPMSI. We see three trees rooted at CSR6. The first one is the I-PMSI for the IPv4 MVPN. We verify that the remaining 5 PEs are connected on this I-PMSI. The “ping” command requires the exact hex data of the GID since there isn’t a more graceful way to test this in XE currently. R6#show mpls mldp database summary | include 213.6.6.6 8 P2MP 213.6.6.6 [gid 65536 (0x00010000)] 7 P2MP 213.6.6.6 [gid 131072 (0x00020000)] 12 P2MP 213.6.6.6 [gid 131073 (0x00020001)] R6#ping mpls mldp p2mp 213.6.6.6 p2mp Root node addr 213.6.6.6 Opaque type hex value (0x1), num Sending 1, 72-byte MPLS Echos to timeout is 2.2 seconds, send [snip] Request ! reply ! reply ! reply ! reply ! reply

#1 addr addr addr addr addr

2 2 2

hex 0x01 00010000 hex digits 4 Target FEC Stack TLV descriptor, interval is 0 msec, jitter value is 200

213.8.14.8 213.5.13.5 213.1.14.1 213.7.13.7 213.12.13.12

Round-trip min/avg/max = 22/105/158 ms Received 5 replies

1028 © 2016 Nicholas J. Russo

We can also quickly verify the IPv6 I-PMSI and IPv6 S-PMSI trees. We expect the former to have two clients (CSR7 and CSR8) while the latter only has one client since we removed the MLD join from CSR3 (only CSR7 is a client). R6#ping mpls mldp p2mp 213.6.6.6 p2mp Root node addr 213.6.6.6 Opaque type hex value (0x1), num Sending 1, 72-byte MPLS Echos to timeout is 2.2 seconds, send [snip]

hex 0x01 00020000 hex digits 4 Target FEC Stack TLV descriptor, interval is 0 msec, jitter value is 200

Request #1 ! reply addr 213.7.13.7 ! reply addr 213.8.14.8 Round-trip min/avg/max = 58/96/134 ms R6#ping mpls mldp p2mp 213.6.6.6 p2mp Root node addr 213.6.6.6 Opaque type hex value (0x1), num Sending 1, 72-byte MPLS Echos to timeout is 2.2 seconds, send [snip]

hex 0x01 00020001 hex digits 4 Target FEC Stack TLV descriptor, interval is 0 msec, jitter value is 200

Request #1 ! reply addr 213.7.13.7 Round-trip min/avg/max = 103/103/103 ms

Quickly testing XRv2, we look at the IPv4 I-PMSI and IPv6 I-PMSI. XR has a nice keyword for “global-id”, but you can also do it the hard way using the hex data. I demonstrate both below. It doesn’t seem to work, though, possibly due to a GID mismatch or XRv LSM limitations. I do not troubleshoot this further. RP/0/0/CPU0:XRv2#show mpls mldp database brief | include 213.12.12.12 0x0000B P2MP 213.12.12.12 0 2 [global-id 1] 0x00005 P2MP 213.12.12.12 0 2 [global-id 262145] RP/0/0/CPU0:XRv2#ping mpls mldp p2mp 213.12.12.12 hex 0x1 00000001 Sending 1, 100-byte MPLS Echos to mldp p2mp 213.12.12.12 hex fec (0x1, 00000001), timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request #1 . RP/0/0/CPU0:XRv2#ping mpls mldp p2mp 213.12.12.12 global-id 262145

1029 © 2016 Nicholas J. Russo

Sending 1, 100-byte MPLS Echos to mldp p2mp 213.12.12.12 global-id 262145, timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request #1 .

Additional Reading – Reference configurations “mvpn-12" 28.11 Profile 13: Default MDT − MLDP − MP2MP − BGP−AD − BGP C−mcast Signaling For this test, begin with the MVPN Profile 9 configuration where all of the MVPN BGP AD is set up already, plus the MP2MP/P2MP trees are built. XRv4 is the MP2MP root and there are some other P2MP roots related to SSM joins. Enabling BGP C-multicast signaling is very easy, and the snippets are below. ! Enable BGP c-mcast signaling in XE PEs vrf definition MC address-family ipv4 mdt overlay use-bgp address-family ipv6 mdt overlay use-bgp ! Enable BGP c-mcast signaling in XRv2 router pim vrf MC address-family ipv4 mdt c-multicast-routing bgp address-family ipv6 mdt c-multicast-routing bgp

Pitfall: There does not appear to be a way to stop XRv from sending PIM hellos over the emulated LAN. Every router will still see XRv2 as a PIM neighbor. The real power of using BGP AD and C-mcast signaling together is that now (with the exception of the XRv issue) there is no PIM anywhere in the carrier network, other than PE-CE links. First, let’s ensure some of the remote PEs still know who the RP is. After all, there is no PIM adjacency on the LSPvif0 anymore, so there must be a mechanism for the BSR messages to flow. CSR6, for examine, still learns the RP. The debug confirms it came from the LSPvif0 interface. R6#show ip pim vrf MC rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 10.5.10.10 (?), v2 Info source: 10.5.10.10 (?), via bootstrap, priority 0, holdtime 150 Uptime: 00:53:56, expires: 00:02:14

1030 © 2016 Nicholas J. Russo

R6#debug ip pim vrf MC PIM(1): Received v2 Bootstrap on Lspvif0 from 213.5.5.5 (1): pim_add_prm:: 224.0.0.0/240.0.0.0, rp=10.5.10.10, repl = 0, ver =2, is_neg =0, bidir = 0, crp = 0 PIM(1): Update prm_rp->bidir_mode = 0 vs bidir = 0 (224.0.0.0/4, RP:10.5.10.10), PIMv2

Double checking RPF and PIM neighbors, we clearly see the correct RPF to CSR5, but no PIM neighbor with CSR5. This seems to violate some fundamental PIM rules. R6#show ip rpf vrf MC 10.5.10.10 RPF information for ? (10.5.10.10) RPF interface: Lspvif0 RPF neighbor: ? (213.5.5.5) RPF route/mask: 10.5.10.0/24 RPF type: unicast (bgp 213) Doing distance-preferred lookups across tables RPF topology: ipv4 multicast base, originated from ipv4 unicast base R6#show ip pim vrf MC neighbor lspvif 0 | begin ^Neighbor Neighbor Interface Uptime/Expires Ver Address 213.12.12.12 Lspvif0 01:43:02/00:01:34 v2

DR Prio/Mode 1 / DR P G

This is where the Type 1 AD route is key. Remember that because all the PEs have discovered one another via the I-PSMI BGP AD routes, they know to use the bidirectional MP2MP tree. The Type-1 route signaled this with the tunnel-type 7 option and including the root address in the tunnel parameters, as discussed in Profile 9. Thus, ASM can still work with this design even without PIM in the overlay. Using MP2MP root 213.14.14.14 (0xD50E0E0E), we can reach CSR5. R5#show bgp ipv4 mvpn vrf MC route-type 1 213.5.5.5 BGP routing table entry for [1][213:5][213.5.5.5]/12, version 54 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.5.5.5) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:5 PMSI Attribute: Flags: 0x0, Tunnel type: 7, length 24, label: exp-null, tunnel parameters: 0800 0104 D50E 0E0E 000E 0200 0B00 0213 0000 0213 0000 0000 rx pathid: 0, tx pathid: 0x0

1031 © 2016 Nicholas J. Russo

Also recall that CSR4 is issuing IGMP joins for an ASM group 225.0.0.1 and an SSM group 232.0.0.5. In order to send the C(*,G) join to the C-RP, CSR6 must know the C-RP address first, which was just discussed. Once it knows that, it generates a Type 6 Shared-Tree Join route which is directly comparable to a PIM (*,G) join. This route information contains many pieces of information, and the RP address is included as the “source”. In this way, it looks just like the Type 7 Source-Tree Join route which is examined later. R6#show bgp ipv4 mvpn vrf MC route-type 6 213:5 213 10.5.10.10 225.0.0.1 BGP routing table entry for [6][213:5][213][10.5.10.10/32][225.0.0.1/32]/22, version 56 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.6.6.6) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:213.5.5.5:2 rx pathid: 0, tx pathid: 0x0

Looking at this BGP Type-6 route on CSR5, we can see it was iBGP learned from CSR6 through the routereflector, XRv2. The RT is the MDT source address of the PE who is closest to the RP, in this case, and the “2” at the end is a random number used to differentiate the RTs for different C(*,G) joins. This allows CSR6 to “target” CSR5; since XRv2 reflects this route to many peers, those other peers should not import it. R5#show bgp ipv4 mvpn vrf MC route-type 6 213:5 213 10.5.10.10 225.0.0.1 BGP routing table entry for [6][213:5][213][10.5.10.10/32][225.0.0.1/32]/22, version 66 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Not advertised to any peer Refresh Epoch 1 Local 213.6.6.6 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213.5.5.5:2 Originator: 213.6.6.6, Cluster list: 213.12.12.12 rx pathid: 0, tx pathid: 0x0

With this route in the BGP RIB, the MRIB process for this VPN also needs to update itself so multicast traffic can flow. The C(*,G) entry looks normal, with the exception of the big “G” flag, which means the entry was installed by a received BGP c-mcast route. Normally, MRIB entries are the results of PIM signaling (joins, prunes, etc). In this case, the MRIB entry was BGP-signaled.

1032 © 2016 Nicholas J. Russo

R5#show ip mroute vrf MC sparse (*, 225.0.0.1), 00:00:37/00:02:22, RP 10.5.10.10, flags: SG Incoming interface: GigabitEthernet2.550, RPF nbr 10.5.10.10 Outgoing interface list: Lspvif0, Forward/Sparse, 00:00:37/00:02:22

This raises the question of why CSR5 only has one Type 6 route when CSR1 and CSR8 should have also generated these routes. The answer is because XRv2 is a RR, it only advertises its best route. CSR6 is best because the IGP metric to the BGP next-hop is lowest. I’ve highlighted the show command because it is very long. This “loss” of information does not negatively affect forwarding; the LSPvif is in the OIL of the C(*,G) if at least one BGP Type 6 route is received. It does not matter which one, and BGP can be used to help scale MVPNs in this way. XRv2 should have generated one too, but I don’t think it is fully supported. RP/0/0/CPU0:XRv2#show bgp ipv4 mvpn rd 213:5 [6][213:5][213][32][10.5.10.10][32][225.0.0.1]/184 bestpath-compare BGP routing table entry for [6][213:5][213][32][10.5.10.10][32][225.0.0.1]/184, Route Distinguisher: 213:5 Versions: Process bRIB/RIB SendTblVer Speaker 52 52 Paths: (3 available, best #2) Advertised to update-groups (with more than one peer): 0.2 Path #1: Received by speaker 0 Not advertised to any peer Local, (Received from a RR-client) 213.1.1.1 (metric 40) from 213.1.1.1 (213.1.1.1) Origin incomplete, metric 0, localpref 100, valid, internal, importcandidate, not-in-vrf, import suspect Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:213.5.5.5:2 Higher IGP metric than best path (path #2) Path #2: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.2 Local, (Received from a RR-client) 213.6.6.6 (metric 20) from 213.6.6.6 (213.6.6.6) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf, import suspect Received Path ID 0, Local Path ID 1, version 52 Extended community: RT:213.5.5.5:2 best of local AS, Overall best Path #3: Received by speaker 0 Not advertised to any peer Local, (Received from a RR-client) 213.8.8.8 (metric 40) from 213.8.8.8 (213.8.8.8)

1033 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal, importcandidate, not-in-vrf, import suspect Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:213.5.5.5:2 Higher IGP metric than best path (path #2)

There is a lot of upcoming activity once CSR2 starts sending traffic. I have enabled IPv4 MVPN debugging on CSR7 (first hop router), CSR6 (last hop router), and CSR5 (PE towards RP). CSR7 sends a PIM register message, which is unicast, towards the RP. I enabled this debugging after the fact, so registration is complete, but we can see the registration process continuing so the RP is aware of the active source. ! CSR7 PIM(2): Send v2 Data-header Register to 10.5.10.10 for 10.2.7.2, group 225.0.0.1 PIM(2): Received v2 Register-Stop on GigabitEthernet2.574 from 10.5.10.10 PIM(2): for source 10.2.7.2, group 225.0.0.1 PIM(2): Clear Registering flag to 10.5.10.10 for (10.2.7.2/32, 225.0.0.1)

The RP tries to join to SPT and sends a C(S,G) join to CSR5. CSR5 then creates a BGP Type-7 route and sends it to the RR. RR rules still apply, and when XRv2 reflects it back, it gets rejected on ingress. ! CSR5 BGP[15] MVPN: add c-route, type 7, bs len 0 asn=0, rd=213:5, source=10.2.7.2/4, group=225.0.0.1/4, nexthop=213.7.7.7, len left = 0 BGP: MVPN(15) create local route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): nettable_walker [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 route sourced locally BGP(15): 213.12.12.12 NEXT_HOP self is set for sourced RT Filter for net [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): (base) 213.12.12.12 send UPDATE (format) [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, next 213.5.5.5, metric 0, path Local, extended community RT:213.7.14.7:2 BGP(15): 213.12.12.12 rcv UPDATE w/ attr: nexthop 213.5.5.5, origin ?, localpref 100, metric 0, originator 213.5.5.5, clusterlist 213.12.12.12, merged path , AS_PATH , community , extended community RT:213.7.14.7:2, SSA attribute BGP(15): 213.12.12.12 rcv UPDATE about [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 -- DENIED due to: ORIGINATOR is us; MP_REACH NEXTHOP is our own address;

Upon receiving this Type-7 Source-Tree Join, CSR7 issues a Type-5 Source Active AD route. This announces that there is a source actively sending traffic behind the originating PE. CSR7 adds the Type-7 to its MVPN RIB, then generates the Type 5 route and sends it to the RR. 1034 © 2016 Nicholas J. Russo

! CSR7 BGP(15): 213.12.12.12 rcvd UPDATE w/ attr: nexthop 213.5.5.5, origin ?, localpref 100, metric 0, originator 213.5.5.5, clusterlist 213.12.12.12, extended community RT:213.7.14.7:2 BGP(15): 213.12.12.12 rcvd [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): skip vrf default table RIB route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): add RIB route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): nettable_walker [5][213:7][10.2.7.2][225.0.0.1]/18 route sourced locally BGP(15): delete RIB route [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): 213.12.12.12 NEXT_HOP self is set for sourced RT Filter for net [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): (base) 213.12.12.12 send UPDATE (format) [5][213:7][10.2.7.2][225.0.0.1]/18, next 213.7.7.7, metric 0, path Local, extended community RT:213:7

When CSR5 receives this Source Active message, it installs it in the RIB. ! CSR5 BGP(15): 213.12.12.12 rcvd UPDATE w/ attr: nexthop 213.7.7.7, origin ?, localpref 100, metric 0, originator 213.7.14.7, clusterlist 213.12.12.12, community no-export, extended community RT:213:7 BGP(15): 213.12.12.12 rcvd [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): skip vrf default table RIB route [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): add RIB route [5][213:5][10.2.7.2][225.0.0.1]/18

Like CSR10, when CSR4 saw multicast traffic, it immediately tries to switch to the SPT and issues a PIM (S,G) join to CSR6. CSR6 then generates a Type 7 BGP route to indicate interest in this C(S,G). This goes to the RR, along with the Type 7 routes from CSR5, CSR1, and CSR8. All of their CEs do the same thing. ! CSR6 BGP[15] MVPN: add c-route, type 7, bs len 0 asn=0, rd=213:6, source=10.2.7.2/4, group=225.0.0.1/4, nexthop=213.7.7.7, len left = 0 BGP: MVPN(15) create local route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): nettable_walker [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 route sourced locally BGP(15): 213.12.12.12 NEXT_HOP self is set for sourced RT Filter for net [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): (base) 213.12.12.12 send UPDATE (format) [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, next 213.6.6.6, metric 0, path Local, extended community RT:213.7.14.7:2

1035 © 2016 Nicholas J. Russo

CSR6 receives a Type 7 route from CSR5 for (10.2.7.2, 225.0.0.1). Clearly this source is NOT behind CSR6, and CSR6 should not accept this route. This is the purpose of the dynamic RTs; CSR6 is not importing RT:213.7.14.7:2, only CSR7 is, because the source is behind CSR7 only. The route is rejected. However, the Source Active AD route (Type-5) is accepted from the RR and installed. The RT on the Type 5 route, since this is an AD route (not C-mcast signaling, technically) is the standard unicast RT. ! CSR6 BGP(15): 213.12.12.12 rcvd UPDATE w/ attr: nexthop 213.5.5.5, origin ?, localpref 100, metric 0, originator 213.5.5.5, clusterlist 213.12.12.12, extended community RT:213.7.14.7:2 BGP(15): 213.12.12.12 rcvd [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 -DENIED due to: extended community not supported; BGP(15): 213.12.12.12 rcvd UPDATE w/ attr: nexthop 213.7.7.7, origin ?, localpref 100, metric 0, originator 213.7.14.7, clusterlist 213.12.12.12, community no-export, extended community RT:213:7 BGP(15): 213.12.12.12 rcvd [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): skip vrf default table RIB route [5][213:7][10.2.7.2][225.0.0.1]/18 BGP(15): add RIB route [5][213:6][10.2.7.2][225.0.0.1]/18

On CSR7, the Type-7 route is received and accepted; the dynamic RT is imported because the C-source 10.2.7.2 is behind CSR7. CSR7 originated the Type-5 Source Active message which indicated so. This is all logical. ! CSR7 BGP(15): 213.12.12.12 rcvd UPDATE w/ attr: nexthop 213.5.5.5, origin ?, localpref 100, metric 0, originator 213.5.5.5, clusterlist 213.12.12.12, extended community RT:213.7.14.7:2 BGP(15): 213.12.12.12 rcvd [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22 BGP(15): add RIB route [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22

As a quick check, let’s identify why CSR7 only saw, and subsequently installed, the Type-7 route from CSR5. This is because XRv2 only advertised the best path, which won against CSR6 due to lowest RID. As long as the first-hop router receives one Type 7, that is all that matters. ! CSR7 RP/0/0/CPU0:XRv2#show bgp ipv4 mvpn rd 213:7 [7][213:7][213][32][10.2.7.2][32][225.0.0.1]/184 bestpath-compare BGP routing table entry for [7][213:7][213][32][10.2.7.2][32][225.0.0.1]/184, Route Distinguisher: 213:7 Versions: Process bRIB/RIB SendTblVer Speaker 57 57 Paths: (4 available, best #2) Advertised to update-groups (with more than one peer): 0.2 Path #1: Received by speaker 0 Not advertised to any peer

1036 © 2016 Nicholas J. Russo

Local, (Received from a RR-client) 213.1.1.1 (metric 40) from 213.1.1.1 (213.1.1.1) Origin incomplete, metric 0, localpref 100, valid, candidate, not-in-vrf Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:213.7.14.7:2 Higher IGP metric than best path (path #2) Path #2: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.2 Local, (Received from a RR-client) 213.5.5.5 (metric 20) from 213.5.5.5 (213.5.5.5) Origin incomplete, metric 0, localpref 100, valid, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 57 Extended community: RT:213.7.14.7:2 best of local AS, Overall best Path #3: Received by speaker 0 Not advertised to any peer Local, (Received from a RR-client) 213.6.6.6 (metric 20) from 213.6.6.6 (213.6.6.6) Origin incomplete, metric 0, localpref 100, valid, candidate, not-in-vrf Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:213.7.14.7:2 Higher router ID than best path (path #2) Path #4: Received by speaker 0 Not advertised to any peer Local, (Received from a RR-client) 213.8.8.8 (metric 40) from 213.8.8.8 (213.8.8.8) Origin incomplete, metric 0, localpref 100, valid, candidate, not-in-vrf Received Path ID 0, Local Path ID 0, version 0 Extended community: RT:213.7.14.7:2 Higher IGP metric than best path (path #2)

internal, import-

internal, best,

internal, import-

internal, import-

Let’s verify the Type-7 installation on CSR7. These parameters should match up with the debugs. The route is from CSR5 (doesn’t really matter) and describes C(10.2.7.2, 225.0.0.1). CSR7 also originates the Type-5 route. Again, notice the difference between the dynamic RT on the Type-7 and the standard RT on the Type-5. R7#show bgp ipv4 mvpn vrf MC route-type 7 213:7 213 10.2.7.2 225.0.0.1 BGP routing table entry for [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, version 59 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Not advertised to any peer Refresh Epoch 1 Local 213.5.5.5 (metric 20) from 213.12.12.12 (213.12.12.12)

1037 © 2016 Nicholas J. Russo

Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213.7.14.7:2 Originator: 213.5.5.5, Cluster list: 213.12.12.12 rx pathid: 0, tx pathid: 0x0 R7#show bgp ipv4 mvpn vrf MC route-type 5 10.2.7.2 225.0.0.1 BGP routing table entry for [5][213:7][10.2.7.2][225.0.0.1]/18, version 58 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.7.14.7) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:7 rx pathid: 0, tx pathid: 0x0

Checking the same information on CSR5, we see the Type-5 was learned from CSR7 and installed, while the Type-7 was locally originated. R5#show bgp ipv4 mvpn vrf MC route-type 5 10.2.7.2 225.0.0.1 BGP routing table entry for [5][213:5][10.2.7.2][225.0.0.1]/18, version 69 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [5][213:7][10.2.7.2][225.0.0.1]/18 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:7 Originator: 213.7.14.7, Cluster list: 213.12.12.12 rx pathid: 0, tx pathid: 0x0 R5#show bgp ipv4 mvpn vrf MC route-type 7 213:7 213 10.2.7.2 225.0.0.1 BGP routing table entry for [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, version 67 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.5.5.5) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:213.7.14.7:2

1038 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0x0

CSR6, as the last hop router, acknowledges the Type-5 and originates a Type-7. CSR6’s Type-7 loses bestpath on XRv2 and is not advertised further, but it doesn’t matter since an identical C(S,G) route from CSR5 was advertised on. R6#show bgp ipv4 mvpn vrf MC route-type 5 10.2.7.2 225.0.0.1 BGP routing table entry for [5][213:6][10.2.7.2][225.0.0.1]/18, version 59 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [5][213:7][10.2.7.2][225.0.0.1]/18 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:7 Originator: 213.7.14.7, Cluster list: 213.12.12.12 rx pathid: 0, tx pathid: 0x0 R6#show bgp ipv4 mvpn vrf MC route-type 7 213:7 213 10.2.7.2 225.0.0.1 BGP routing table entry for [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, version 57 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.6.6.6) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:213.7.14.7:2 rx pathid: 0, tx pathid: 0x0

For completeness, we take a quick look at CSR1 and CSR8 also. Their process was identical to CSR6. The Type-7 routes (the C(S,G) joins) were sourced locally and the Type-5 route (Source Active) was learned via the RR with next-hop CSR7, the first-hop router. R1#show bgp ipv4 mvpn vrf MC route-type 5 10.2.7.2 225.0.0.1 | include ^BGP|from BGP routing table entry for [5][213:1][10.2.7.2][225.0.0.1]/18, version 25 Local, imported path from [5][213:7][10.2.7.2][225.0.0.1]/18 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) R1#show bgp ipv4 mvpn vrf MC route-type 7 213:7 213 10.2.7.2 225.0.0.1 | include ^BGP|from BGP routing table entry for [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, version 23

1039 © 2016 Nicholas J. Russo

0.0.0.0 from 0.0.0.0 (213.1.1.1) R8#show bgp ipv4 mvpn vrf MC route-type 5 10.2.7.2 225.0.0.1 | include ^BGP|from BGP routing table entry for [5][213:8][10.2.7.2][225.0.0.1]/18, version 61 Local, imported path from [5][213:7][10.2.7.2][225.0.0.1]/18 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) R8#show bgp ipv4 mvpn vrf MC route-type 7 213:7 213 10.2.7.2 225.0.0.1 | include ^BGP|from BGP routing table entry for [7][213:7][213][10.2.7.2/32][225.0.0.1/32]/22, version 59 0.0.0.0 from 0.0.0.0 (213.8.8.8)

Now that all the signaling was verified, we can check the multicast routing. First, we check CSR7 to ensure traffic is arriving from the VPN customer and going towards the core via LSPvif0. We see some new flags here. The big “G” means the entry was installed from a received BGP c-mcast route, in this case, a Type-7 Source-Tree Join from CSR5. The little ‘q’ means that this router issues a Type-5 Source Active route for this C(S,G). The packet counters for CSR7 continue to increase, indicating a functional data-plane. R7#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:01:25/00:01:37, flags: FTGq Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Lspvif0, Forward/Sparse, 00:01:25/00:01:34 R7#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 87, Packets received: 88 Source: 10.2.7.2/32, Forwarding: 87/1/117/0, Other: 88/1/0

CSR6, CSR8, and CSR1 should have identical outputs with the exception of minor differences like packet counters, since I checked their outputs at different times. The flags are “inverted” on these entries: the little ‘g’ signifies that a BGP c-mcast route was sent representing this C(S,G), which is true because CSR6 did originate a Type-7 route for it. The big ‘Q’ means that a Type-5 Source Active route was received for this C(S,G). This is also true because CSR7 originated that message. Notice the counters are increasing. R6#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:01:02/00:01:57, flags: JTgQ Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 00:01:02/00:02:06 R6#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 58, Packets received: 58

1040 © 2016 Nicholas J. Russo

Source: 10.2.7.2/32, Forwarding: 57/1/122/0, Other: 57/0/0 R8#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:02:14/00:00:45, flags: JTgQ Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:02:14/00:02:54 R8#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 131, Packets received: 131 Source: 10.2.7.2/32, Forwarding: 130/1/122/0, Other: 130/0/0 R1#show ip mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:02:04/00:00:54, flags: JTgQ Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 00:02:04/00:02:32 R1#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 123, Packets received: 123 Source: 10.2.7.2/32, Forwarding: 122/1/122/0, Other: 122/0/0

Last, let’s check CSR5. Very few multicast packets should be shown here, since the C-RP will prune itself from the SPT when it realizes it is not in the transit path anymore. The big ‘P’ indicates a prune and the little ‘g’ indicates the entry is the result of a sent BGP c-mcast message, specifically a Type-7 SPT join. The LSPvif is still a “LAN” in terms of forwarding, because this traffic is using the MP2MP mLDP tree. R5#show ip (10.2.7.2, Incoming Outgoing

mroute vrf MC 225.0.0.1 10.2.7.2 | begin \( 225.0.0.1), 00:02:42/00:00:17, flags: PTXg interface: Lspvif0, RPF nbr 213.7.7.7 interface list: Null

R5#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 5, Packets received: 154 Source: 10.2.7.2/32, Forwarding: 4/0/122/0, Other: 153/1/148

Earlier examples showed us the MDT information inside the MRIB entries, but no longer. One way we know this traffic is using the MP2MP mLDP tree is the lack of an S-PMSI AD route (Type-3). When this route is present, as it is currently for SSM traffic, the P2MP root is encoded in the same manner as the MP2MP tunnel, except the tunnel type is changed to 2, which is mLDP P2MP. We explore this next. First, we will do an inbound capture on CSR6 to see if the downstream label 6001 is used. We select CSR6 because it participates in both a P2MP tree (SSM) and the MP2MP tree. Assuming we were confused as to which one was in use, we check the incoming label. If the label is 6013, that means the P2MP tree is used. 1041 © 2016 Nicholas J. Russo

R6#show mpls mldp database summary LSM ID Type Root Cnt. 1C P2MP 213.7.7.7 19 MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:213 1] [mdt 213:213 0]

1 1

R6#show mpls mldp database opaque_type mdt 213:213 0 | section Upstream Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 1F Out Label (U) : 93004 Interface : GigabitEthernet2.563* Local Label (D): 6001 Next Hop : 213.6.13.13 R6#show mpls mldp database opaque_type mdt 213:213 1 | section Upstream Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 23 Out Label (U) : None Interface : GigabitEthernet2.563* Local Label (D): 6013 Next Hop : 213.6.13.13

We use EPC to verify the packet, and also check the LFIB for sanity. Label 0x1771 is 6001. The counters for this LSP constantly increase. R6#show monitor capture CAP buffer detailed 0 122 0.000000 00:50:56:A9:DB:37 -> 00:50:56:A9:DE:0D MPLS unicast 0000: 005056A9 DE0D0050 56A9DB37 81000DEB .PV....PV..7.... 0010: 88470177 11FC4500 00641576 0000FE01 .G.w..E..d.v.... 0020: B51D0A02 0702E100 00010800 906C001B .............l.. 0030: 120A0000 00004E95 8D23ABCD ABCDABCD ......N..#...... R6#show mpls forwarding-table labels 6001 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6001 [T] No Label [mdt 213:213 0][V] \ 570724

Outgoing interface

Next Hop

aggregate/MC

The process for SSM is a little more straightforward since there is no RP, and therefore no registration. Looking deeper at the Type-7 BGP SPT join, we see that for SSM, it is originated by both CSR6 and CSR8. Both of them have SSM clients (CSR3 and CSR4) that are requesting traffic for 232.0.0.5. On XRv2, we see both routes have an RD of 213:7. Note that CSR6 is the best path. RP/0/0/CPU0:XRv2#show bgp ipv4 mvpn rd 213:7 route-type 7 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:7 *>i[7][213:7][213][32][10.2.7.2][32][232.0.0.5]/184 213.6.6.6 0 100 0 ?

1042 © 2016 Nicholas J. Russo

* i

213.8.8.8

0

100

0 ?

Normally, the RD of a VPN route is equal to that of the originator. In this case, it is set to the RD of the destination. The only way CSR6 and CSR8 can determine this is by using the BGP AD messaging. This is where the Type 1 I-PMSI message is used. CSR8 reports learning this Type 1 from CSR7 which simply carries the RD, VPN next-hop for BGP, and tunnel information (seen earlier). The BGP next-hop determines what the dynamic RT will be (shown shortly). R8#show bgp ipv4 mvpn rd 213:7 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:7 *>i [1][213:7][213.7.7.7]/12 213.7.7.7 0 100 0 ? *>i [3][213:7][10.2.7.2][232.0.0.5][213.7.7.7]/22 213.7.7.7 0 100 0 ? *> [7][213:7][213][10.2.7.2/32][232.0.0.5/32]/22 0.0.0.0 32768 ?

Upon receiving it, CSR7 creates a Type 3 S-PMSI route that sates it has a C(S,G) of (10.2.7.2, 232.0.0.5) reachable via PE 213.7.7.7. The tunnel type 2 indicates mLDP P2MP tree and the root is 213.7.7.7, which is not the same as the MP2MP root of 213.14.14.14 seen in the Type-1 I-PMSI messages. The only reason CSR7 (ingress PE) generated this was because the ACL specified that this group was candidate for data MDT construction. The Type 3 just informs egress PEs know about the existing P2MP tree. R7#show bgp ipv4 mvpn vrf MC route-type 3 10.2.7.2 232.0.0.5 213.7.7.7 BGP routing table entry for [3][213:7][10.2.7.2][232.0.0.5][213.7.7.7]/22, version 54 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.7.14.7) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:7 PMSI Attribute: Flags: 0x0, Tunnel type: 2, length 24, label: exp-null, tunnel parameters: 0600 0104 D507 0707 000E 0200 0B00 0213 0000 0213 0000 0001 rx pathid: 0, tx pathid: 0x0

CSR8 then assigns the RD of CSR7 along with a dynamic RT so that only CSR7 imports it. Notice the RD on the Type 7 route below, along with the RT. This RT was learned via the regular VPNv4 unicast route as an

1043 © 2016 Nicholas J. Russo

extended community. This is also true for ASM but is detailed here instead. No extra configuration is needed to VPNv4/v6 to enable this, as long as extended communities are supported (they must be). R8#show bgp ipv4 mvpn rd 213:7 route-type 7 213:7 213 10.2.7.2 232.0.0.5 BGP routing table entry for [7][213:7][213][10.2.7.2/32][232.0.0.5/32]/22, version 56 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Advertised to update-groups: 1 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.8.8.8) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:213.7.14.7:2 rx pathid: 0, tx pathid: 0x0 R8#show bgp vpnv4 unicast rd 213:7 10.2.7.2 BGP routing table entry for 213:7:10.2.7.0/24, version 27 Paths: (1 available, best #1, no table) Not advertised to any peer Refresh Epoch 1 Local 213.7.7.7 (metric 20) (via default) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213:7 MVPN AS:213:0.0.0.0 MVPN VRF:213.7.14.7:2 Originator: 213.7.14.7, Cluster list: 213.12.12.12 mpls labels in/out nolabel/7019 rx pathid: 0, tx pathid: 0x0

When the join reaches CSR7, it is imported into the MVPN because it contained the proper RT. Only CSR6’s route makes it to CSR7 because XRv2 selected it as the best-path (seen earlier). R7#show bgp ipv4 mvpn vrf MC route-type 7 213:7 213 10.2.7.2 232.0.0.5 BGP routing table entry for [7][213:7][213][10.2.7.2/32][232.0.0.5/32]/22, version 55 Paths: (1 available, best #1, table MVPNv4-BGP-Table) Not advertised to any peer Refresh Epoch 1 Local 213.6.6.6 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213.7.14.7:2 Originator: 213.6.6.6, Cluster list: 213.12.12.12 rx pathid: 0, tx pathid: 0x0

A quick check of the VPN MRIBs on CSR8 and CSR7 confirm our understanding. CSR6 claims the C(S,G) entry was installed by having sent a BGP route (Type 7), and CSR7 indicates the entry was installed by 1044 © 2016 Nicholas J. Russo

having received it from an interested receiver. This is indicated by the little ‘g’ and big ‘G’, respectively. CSR6 has joined a “data MDT”, meaning a specific tree built for this flow, and CSR7 is sending to this MDT. The data MDT directions are indicated by the big ‘Y’ and little ‘y’, respectively. CSR6’s RPF faces the PMSI interface while CSR7’s RPF faces the source. We know this is definitely not using the default MDT since the command shows an MDT ID of 1. Everything appears correct. R6#show ip mroute vrf MC ssm (10.2.7.2, 232.0.0.5), 1d01h/00:02:49, flags: sTIYg Incoming interface: Lspvif0, RPF nbr 213.7.7.7, MDT: [1, 213.7.7.7]/never Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 1d01h/00:02:49 R7#show ip mroute vrf MC ssm (10.2.7.2, 232.0.0.5), 1d01h/stopped, flags: sTyG Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Lspvif0, Forward/Sparse, 1d01h/stopped

When CSR2 starts sending, there is no more BGP AD or c-mcast signaling that must occur, since the trees are already fully built. CSR7 reports sending traffic successfully out to the LSPvif. R7#show ip mroute vrf MC 232.0.0.5 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 96, Packets received: 96 Source: 10.2.7.2/32, Forwarding: 96/1/118/0, Other: 96/0/0

CSR6 and CSR8 also show it being sent towards their respective customers. Just for extra verification, I used EPC on egress on CSR6 to show the IP data being sent. Because it’s not encapsulated, we can see the nicely formatted IP addresses in the header without digging through the hex. I highlighted the IP protocol, source, and destination IP addresses as a sanity check. R6#show ip mroute vrf MC 232.0.0.5 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 76, Packets received: 76 Source: 10.2.7.2/32, Forwarding: 76/1/122/0, Other: 76/0/0 R8#show ip mroute vrf MC 232.0.0.5 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 69, Packets received: 69 Source: 10.2.7.2/32, Forwarding: 69/1/122/0, Other: 69/0/0 R6#show monitor capture CAP buffer detailed ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------0 118 0.000000 10.2.7.2 -> 232.0.0.5 ICMP

1045 © 2016 Nicholas J. Russo

0000: 0010: 0020: 0030:

01005E00 08004500 0702E800 000053B3

00050050 006416FA 00050800 2F87ABCD

56A9DE0D 0000FC01 F9F4001C ABCDABCD

81000DDA AE950A02 00FF0000 ABCDABCD

..^....PV....... ..E..d.......... ................ ..S./...........

Next, we will test IPv6. The concepts are identical to IPv4 but for completeness, we will test it. First, I recommend checking to ensure there are no PIM neighbors over the PMSI. XRv appears stubborn here, so we will ignore that. R1#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.12.12.12 Lspvif0 1d03h

Expires Mode DR pri 00:01:36 B G DR 1

R5#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime :FFFF:213.12.12.12 Lspvif0 1d03h

Expires Mode DR pri 00:01:18 B G DR 1

R6#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.12.12.12 Lspvif0 1d03h

Expires Mode DR pri 00:01:17 B G DR 1

R7#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime FE80::2 Gi2.527 1d09h ::FFFF:213.12.12.12 Lspvif0 1d08h

Expires Mode DR pri 00:01:37 B G 1 00:01:15 B G DR 1

R8#show ipv6 pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime ::FFFF:213.12.12.12 Lspvif0 1d08h

Expires Mode DR pri 00:01:41 B G DR 1

Checking XRv2 is a faster way to ensure none of the CSRs are sending PIM hellos onto the PMSI. RP/0/0/CPU0:XRv2#show pim vrf MC ipv6 neighbor | begin Lmdt LmdtMC Neighbor Address Uptime Expires DR pri DR Flags ::ffff:213.12.12.12* 1d08h 00:01:17 1 (DR)

P

A quick snapshot of the BGP IPv6 MVPN table is warranted. We check from CSR6; immediately we see far less information than we did for IPv4. The answer is twofold. First, the network is segmented with the top three PEs not having reachability to the bottom three PEs. Recall that these Type-1 I-PMSI routes have a standard unicast RT in accordance with the RT-export policies of the VRF. Normal RT rules dictate a router cannot accept VPN routes that are not imported locally. The debugs confirm this basic fact. Other dynamic RT’s not destined to CSR6 are also rejected, but that is the normal operation for BGP cmcast signaling.

1046 © 2016 Nicholas J. Russo

R6#show bgp ipv6 mvpn vrf MC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:6 (default for vrf MC) *> [1][213:6][213.6.6.6]/12 :: 32768 ? *>i [1][213:6][213.7.7.7]/12 213.7.7.7 0 100 0 ? *>i [1][213:6][213.8.8.8]/12 213.8.8.8 0 100 0 ? *>i [7][213:6][213][2001:10:4:6::4][FF33::1]/46 213.7.7.7 0 100 0 ? R6#debug bgp ipv6 mvpn updates in R6#clear bgp ipv6 mvpn * soft BGP(16): extended BGP(16): extended BGP(16): extended

213.12.12.12 rcvd [1][213:1][213.1.1.1]/12 -- DENIED due to: community not supported; 213.12.12.12 rcvd [1][213:5][213.5.5.5]/12 -- DENIED due to: community not supported; 213.12.12.12 rcvd [1][213:12][213.12.12.12]/12 -- DENIED due to: community not supported;

The table is also missing a Type-3 S-PMSI AD route for the SSM group FF33::1. This is potentially confusing, because for IPv4, once CSR7 received a Type-7 SPT join for the SSM group we tested, it immediately issued a Type-3 S-PMSI announcement. In this case, CSR6 receives a Type-7 indicating interest in C(S,G) pair (2001:10:4:6::4, FF33::1) but does not originate a Type-3 S-PMSI AD route. Again, this is because this particular IPv6 C(S,G) is not candidate for data MDT optimization per the ACL rules in the VRF. As a quick test, if we add the C-G to the ACL, even without clearing anything, CSR6 immediately creates a Type-3 S-PMSI route to announce the presence of an S-PMSI (specifically, a tunnel-type 2 mLDP P2MP transport) for this group. R6(config-ipv6-acl)#permit ipv6 any ff33::1/128 BGP(16): nettable_walker [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46 route sourced locally BGP(16): (base) 213.12.12.12 send UPDATE (format) [3][213:6][2001:10:4:6::4][FF33::1][213.6.6.6]/46, next 213.6.6.6, metric 0, path Local, extended community RT:213:6 BGP: 213.12.12.12 Next hop is our own address 213.6.6.6

Before continuing, the ACL entry is removed (not shown). A quick look on CSR1 confirms that PEs CSR6, CSR7, and CSR8 are outside of the MVPN, but PEs XRv2 and CSR5 are within it. R1#show bgp ipv6 mvpn vrf MC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:1 (default for vrf MC) *> [1][213:1][213.1.1.1]/12

1047 © 2016 Nicholas J. Russo

:: *>i [1][213:1][213.5.5.5]/12 213.5.5.5 *>i [1][213:1][213.12.12.12]/12 213.12.12.12

32768 ? 0

100

0 ?

100

0 i

Looking specifically at IPv6 SSM, both CSR2 and CSR3 originated C(S,G) joins towards their PEs requesting traffic for (2001:10:4:6::4, FF33::1). CSR7 and CSR8 both originated Type-7 SPT joins for this C(S,G) and sent it to XRv2, the RR. The RR picks CSR7 because it has a lower IGP cost to the BGP nexthop. RP/0/0/CPU0:XRv2#show bgp ipv6 mvpn rd 213:6 route-type 7 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:6 *>i[7][213:6][213][128][2001:10:4:6::4][128][ff33::1]/376 213.7.7.7 0 100 0 ? * i 213.8.8.8 0 100 0 ?

Looking specifically at this best-path on CSR6, we can see why CSR6 imported it. The dynamic RT assigned to the C-S is what CSR7 used to set the dynamic RT on the Type-7 route it advertised towards CSR6. This way, only CSR6 imports it. This behavior is protocol agnostic and is true for IPv4 and IPv6. R6#show bgp ipv6 mvpn vrf MC route-type 7 213:6 213 2001:10:4:6::4 FF33::1 BGP routing table entry for [7][213:6][213][2001:10:4:6::4][FF33::1]/46, version 34 Paths: (1 available, best #1, table MVPNV6-BGP-Table) Not advertised to any peer Refresh Epoch 1 Local 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213.6.6.6:1 Originator: 213.7.7.7, Cluster list: 213.12.12.12 rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv6 unicast vrf MC 2001:10:4:6::/64 BGP routing table entry for [213:6]2001:10:4:6::/64, version 25 Paths: (1 available, best #1, table MC) Advertised to update-groups: 2 Refresh Epoch 1 Local :: (via vrf MC) from 0.0.0.0 (213.6.6.6) Origin incomplete, metric 0, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:213:6 MVPN AS:213:0.0.0.0 MVPN VRF:213.6.6.6:1 mpls labels in/out 6001/nolabel(MC)

1048 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0x0

Before sending traffic, we can verify that the C-PIM control-plane is correct. The group is mapped to the I-PMSI, or the mLDP MP2MP tree, which explains the absence of any ‘Y’ or ‘y’ flags on these groups. The “data MDT” is not in use here. Again, the big ‘G’ means the entry was installed from having received a BGP Type 6 or 7 route. The little ‘g’ means the entry was installed from having sent one. Packets exit to the PMSI on CSR6 and enter from it on CSR7 and CSR8. R6#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:14:12/never, flags: sTG Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 Immediate Outgoing interface list: Lspvif0, Forward, 00:14:12/never R7#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:47:32/never, flags: sTIg Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 00:47:32/never R8#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 01:55:23/never, flags: sTIg Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 01:55:23/never

When CSR4 starts sending traffic, there is no Type-5 Source Active AD message. This is because the group is SSM and CSR6 has already received a Type-7 join for it. We can quickly verify if the traffic is working by checking the counters on CSR6, CSR7, and CSR8. R4#ping FF33::1 rep 10000 Output Interface: GigabitEthernet2.546 R6#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 53 routes, 6 (*,G)s, 46 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 87/0/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 87

1049 © 2016 Nicholas J. Russo

R7#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 55 routes, 7 (*,G)s, 47 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 94/0/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 94 R8#show ipv6 mroute vrf MC FF33::1 count Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) VRF MC 55 routes, 7 (*,G)s, 47 (*,G/m)s Group: FF33::1 Source: 2001:10:4:6::4, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 95/0/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 95

For variety, we will capture inbound on CSR7 facing the mLDP MP2MP root of XRv4. Even though this isn’t the shortest path back to CSR6, we know that we are not using S-PMSI since no Type 3 routes exist within this MVPN instance anywhere currently. The incoming label is 0x1B5E (7006) which is the correct MDT 0 local downstream label allocated by CSR7 for the I-PMSI. I also highlighted the IPv6 source/destination addresses as proof. R7#show monitor capture CAP buffer dump 2 0000: 005056A9 EA770050 56A9DE77 81000DF6 0010: 884701B5 E03D0000 213F6000 0000003C 0020: 3A3F2001 00100004 00060000 00000000 0030: 0004FF33 00000000 00000000 00000000 0040: 00018000 9DF5263C 014F4F50 51525354 0050: 55565758 595A5B5C 5D5E5F60 61626364 0060: 65666768 696A6B6C 6D6E6F70 71727374 0070: 75767778 797A7B7C 7D7E7F80 8182

.PV..w.PV..w.... .G...=..!?`....< :? ............. ...3............ ......&..!?`....< :? ............. ...~.@ .........

1057 © 2016 Nicholas J. Russo

0040: 0050: 0060: 0070:

00018000 D2D3D4D5 E2E3E4E5 F2F3F4F5

C7101640 D6D7D8D9 E6E7E8E9 F6F7F8F9

06CCCCCD DADBDCDD EAEBECED FAFBFCFD

CECFD0D1 DEDFE0E1 EEEFF0F1 FEFF

.......@........ ................ ................ ..............

R7#show mpls mldp database opaque_type mdt 213:213 2147483649 | section Upstream Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 5 Out Label (U) : None Interface : GigabitEthernet2.573* Local Label (D): 7010 Next Hop : 213.7.13.13

We also expect CSR8 to be receiving this flow. We will capture outbound on CSR8 within the VRF so we can see the raw IPv6 packets being forwarded to the customer without MPLS encapsulation. We clearly see the 0x86DD ethertype to represent IPv6. The IP protocol number is 0x3A which is 58 for IPv6 ICMP. The parser gives you some summarized information also. R8#show monitor capture CAP buffer detailed 1 118 1.352979 2001:*:0004 -> 0000: 33330000 00010050 56A9FB1C 81000DD2 0010: 86DD6000 0000003C 3A3C2001 00100004 0020: 00060000 00000000 0004FF7E 02402001 0030: 00100002 00070000 00018000 36FB1640

FF7E:*:0001 IPv6-ICMP 33.....PV....... ..`.... [1][213:1][213.1.1.1]/12 0.0.0.0 32768 ? *>i [1][213:1][213.5.5.5]/12 213.5.5.5 0 100 0 ? *>i [1][213:1][213.6.6.6]/12 213.6.6.6 0 100 0 ? *>i [1][213:1][213.7.7.7]/12 213.7.7.7 0 100 0 ? *>i [1][213:1][213.8.8.8]/12 213.8.8.8 0 100 0 ? *>i [1][213:1][213.12.12.12]/12 213.12.12.12 100 0 i *>i [3][213:1][10.2.7.2][232.0.0.5][213.7.7.7]/22 213.7.7.7 0 100 0 ? R1#show bgp ipv6 mvpn vrf MC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:1 (default for vrf MC) *> [1][213:1][213.1.1.1]/12 :: 32768 ? *>i [1][213:1][213.5.5.5]/12 213.5.5.5 0 100 0 ? *>i [1][213:1][213.12.12.12]/12 213.12.12.12 100 0 i

We also expect to see a “full-mesh” of C-PIM neighbors over the I-PMSI. In the interest of brevity, we will check CSR6 and XRv2 for extra validation, then continue with the test. R1#show ip pim vrf MC neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address

Ver

DR Prio/Mode

1081 © 2016 Nicholas J. Russo

213.8.8.8 213.5.5.5 213.7.7.7 213.6.6.6 213.12.12.12

Lspvif0 Lspvif0 Lspvif0 Lspvif0 Lspvif0

00:03:49/00:01:22 00:03:49/00:01:22 00:03:49/00:01:22 00:03:49/00:01:21 00:15:08/00:01:21

R6#show ip pim vrf MC neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.8.8.8 Lspvif0 00:03:06/00:01:35 213.5.5.5 Lspvif0 00:03:06/00:01:35 213.7.7.7 Lspvif0 00:03:06/00:01:35 213.1.1.1 Lspvif0 00:04:04/00:01:35 213.12.12.12 Lspvif0 00:14:25/00:01:34 RP/0/0/CPU0:XRv2#show pim vrf MC neighbor | begin ^Neighbor Neighbor Address Interface Uptime 10.11.12.11 GigabitEthernet0/0/0/0.512 1d07h 10.11.12.12* GigabitEthernet0/0/0/0.512 2d00h 213.1.1.1 LmdtMC 00:05:09 213.5.5.5 LmdtMC 00:04:11 213.6.6.6 LmdtMC 00:04:11 213.7.7.7 LmdtMC 00:04:11 213.8.8.8 LmdtMC 00:04:11 213.12.12.12* LmdtMC 00:15:37

v2 v2 v2 v2 v2

1 1 1 1 1

Ver

DR Prio/Mode 1 / S P G 1 / S P G 1 / S P G 1 / S P G 1 / DR P G

v2 v2 v2 v2 v2

/ / / / /

S P G S P G S P G S P G DR P G

Expires DR pri Flags 00:01:21 1 B P 00:01:35 1 (DR) B PE 00:01:29 1 P 00:01:29 1 P 00:01:28 1 P 00:01:30 1 P 00:01:29 1 P 00:01:29 1 (DR) P

Because C-PIM signaling is used, we expect to see C(*,G) joins making their way through the PMSI towards the C-RP, with C(S,G) SSM joins making their way towards their respective C-S. In this case, CSR10 is the R-CP which should have the (*, 225.0.0.1) join. CSR2 is an SSM source, which means CSR7 should have the (10.2.7.2, 232.0.0.5) join. R10#show ip mroute 225.0.0.1 | begin \( (*, 225.0.0.1), 22:23:06/00:03:16, RP 10.5.10.10, flags: S Incoming interface: Null, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.550, Forward/Sparse, 22:23:06/00:03:16 R7#show ip mroute vrf MC 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 22:22:54/00:02:56, flags: sTy Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Lspvif0, Forward/Sparse, 22:22:54/00:02:56

The output on CSR7 indicates that this group is “sending” to an MDT data group (little ‘y’ flag). We can find out the S-PMSI tunnel binding by checking the BGP-AD messages for a matching Type-3 associated with this C(S,G). CSR7 originated this Type-3 S-PMSI route to tell other PEs about this C(S,G). The tunneltype 2 field means this is an mLDP P2MP tunnel and the root is 213.7.7.7 (0xD5070707). The MDT number is 65537 (0x10001). 1082 © 2016 Nicholas J. Russo

R7#show bgp ipv4 mvpn vrf MC route-type 3 10.2.7.2 232.0.0.5 213.7.7.7 BGP routing table entry for [3][213:7][10.2.7.2][232.0.0.5][213.7.7.7]/22, version 200 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 3 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.7.7.7) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:7 PMSI Attribute: Flags: 0x0, Tunnel type: 2, length 17, label: exp-null, tunnel parameters: 0600 0104 D507 0707 0007 0100 0400 0100 01 rx pathid: 0, tx pathid: 0x0

Using the “verbose” modifier, we can see the binding between the S-PMSI and the C(S,G) MRIB entry. The little ‘p’ means that it was signaled from a C-PIM join, which is expected. R7#show ip mroute vrf MC 232.0.0.5 10.2.7.2 verbose | begin \( (10.2.7.2, 232.0.0.5), 22:37:25/00:03:08, flags: sTyp Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 MDT TX nr: 65537 LSM-ID: 0x23 Outgoing interface list: Lspvif0, LSM MDT: 23 (data), Forward/Sparse, 22:37:25/00:03:08, p

When CSR2 starts sending traffic, it will be delivered to all endpoints of the P2MP tree. We could trace the tree to verify, but since that has been demonstrated many times, I will use show commands on CSR6 and CSR8. If the GID is 65537 is it highly likely this is part of the same mLDP P2MP tree. R6#show mpls mldp database summary | include 65537 23 P2MP 213.7.7.7 [gid 65537 (0x00010001)]

1

R8#show mpls mldp database summary | include 65537 22 P2MP 213.7.7.7 [gid 65537 (0x00010001)]

1

We quickly verify the C-MRIB entries on CSR6 and CSR8. The RPF interface is the PMSI (MDT is shown without the verbose keyword here) and OIL includes the customer LAN. The big ‘Y’ indicates that these routers have “joined’ a Data MDT. R6#show ip mroute vrf MC 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 22:42:35/00:02:26, flags: sTIY Incoming interface: Lspvif0, RPF nbr 213.7.7.7, MDT: [65537, 213.7.7.7]/never

1083 © 2016 Nicholas J. Russo

Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 22:42:35/00:02:26 R8#show ip mroute vrf MC 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 22:42:57/00:02:05, flags: sTIY Incoming interface: Lspvif0, RPF nbr 213.7.7.7, MDT: [65537, 213.7.7.7]/never Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 22:42:57/00:02:05

CSR2 begins sending traffic. Because the S-PMSI is signaled and the group is SSM, no more signaling will take place. We can check the packet counters on all relevant PEs to ensure packets are being forwarded. R6#show ip mroute vrf MC 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 68, Packets received: 68 Source: 10.2.7.2/32, Forwarding: 68/0/122/0, Other: 68/0/0 R7#show ip mroute vrf MC 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 67, Packets received: 67 Source: 10.2.7.2/32, Forwarding: 67/0/118/0, Other: 67/0/0 R8#show ip mroute vrf MC 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 69, Packets received: 69 Source: 10.2.7.2/32, Forwarding: 69/0/122/0, Other: 69/0/0

We will quickly trace the LSPs that make up the P2MP tree used for this flow. We will use a different method this time; we know that P2MP trees are unidirectional (downstream only) and are receiverdriven, but we can still verify it from the root downwards. On CSR7, we will check the downstream labels learned from XRv3 and XRv4. CSR7 should use labels 93015 and 94020 for XRv3 and XRv4 respectively when sending traffic down this tree. R7#show mpls mldp database opaque_type gid 65537 | section Replic Replication client(s): MDT (VRF MC) Uptime : 22:48:55 Path Set ID : None Interface : Lspvif0 213.14.14.14:0 Uptime : 22:48:55 Path Set ID : None Out label (D) : 94020 Interface : GigabitEthernet2.574* Local label (U): None Next Hop : 213.7.14.14 213.13.13.13:0 Uptime : 22:48:42 Path Set ID : None Out label (D) : 93015 Interface : GigabitEthernet2.573* Local label (U): None Next Hop : 213.7.13.13

1084 © 2016 Nicholas J. Russo

Using EPC on CSR7 outbound concurrently on both core links, we can verify the packets are actually using the labels we just traced. Assuming the root node pushes the right MVPN label onto the packet, the core routers will perform swap operations until it gets to the leaves. Notice the identical timestamps, source MAC address, and IP header information. This is the router performing replication as expected. The labels are highlighted (93015 and 94020) which are the downstream labels for XRv3 and XRv4. R7#show monitor capture CAP buffer detailed 12 122 251158.141991 00:50:56:A9:EA:77 -> 00:50:56:A9:DB:37 MPLS unicast 0000: 005056A9 DB370050 56A9EA77 81000DF5 .PV..7.PV..w.... 0010: 884716B5 71FE4500 0064067E 0000FE01 .G..q.E..d.~.... 0020: BD110A02 0702E800 00050800 28CC0001 ............(... 0030: 01050000 00000F7C 44FCABCD ABCDABCD .......|D....... 13 122 251160.141991 00:50:56:A9:EA:77 -> 00:50:56:A9:DE:77 MPLS unicast 0000: 005056A9 DE770050 56A9EA77 81000DF6 .PV..w.PV..w.... 0010: 884716F4 41FE4500 0064067F 0000FE01 .G..A.E..d...... 0020: BD100A02 0702E800 00050800 20FB0001 ............ ... 0030: 01060000 00000F7C 4CCCABCD ABCDABCD .......|L.......

XRv3 and XRv4 perform label swap operations and forward down their trees to CSR6 and CSR8 respectively. RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93015 6013 MLDP: 0x0001c

labels 93015 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.563 213.6.13.6 0

RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94020 8002 MLDP: 0x0001f

labels 94020 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.584 213.8.14.8 0

Checking CSR6 and CSR8, we see the packets arrive with the proper labels. The LFIB also shows the GID for verification. R6#show mpls forwarding-table labels 6013 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6013 [T] No Label [gid 65537 (0x00010001)][V] 51484 R8#show mpls forwarding-table labels 8002 Local Outgoing Prefix Bytes Label

Outgoing Next Hop interface \ aggregate/MC

Outgoing

Next Hop

1085 © 2016 Nicholas J. Russo

Label 8002 [T]

Label No Label

or Tunnel Id Switched [gid 65537 (0x00010001)][V] 53558

interface \ aggregate/MC

Next, we will examine the more complex ASM setup. CSR10 is the R-CP, so when CSR2 starts sending traffic, CSR7 will register this C(S,G) with the C-RP. CSR10 decapsulates the payload and sends this traffic down the C(*,G) tree towards CSR6 and CSR8. ! CSR7 PIM(1): Check RP 10.5.10.10 into the (*, 225.0.0.1) entry PIM(1): Building Triggered (*,G) Join / (S,G,RP-bit) Prune message for 225.0.0.1 PIM(1): Adding register encap tunnel (Tunnel2) as forwarding interface of (10.2.7.2, 225.0.0.1). ! CSR10 PIM(0): Received v2 Register on GigabitEthernet2.550 from 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Adding register decap tunnel (Tunnel1) as accepting interface of (10.2.7.2, 225.0.0.1). PIM(0): Insert (10.2.7.2,225.0.0.1) join in nbr 10.5.10.5's queue PIM(0): Building Join/Prune packet for nbr 10.5.10.5 PIM(0): Adding v2 (10.2.7.2/32, 225.0.0.1), S-bit Join PIM(0): Send v2 join/prune to 10.5.10.5 (GigabitEthernet2.550)

CSR5 receives that join from CSR10 and sends it towards the C(S,G), CSR7. CSR7 then updates the MDT in use from 0 (null in this case, since there is no MP2MP default MDT) to LSM-ID 0x17. We can see this is the I-PMSI P2MP tunnel rooted at CSR7, which is correct. ! CSR5 PIM(1): Received v2 Join/Prune on GigabitEthernet2.550 from 10.5.10.10, to us PIM(1): Join-list: (10.2.7.2/32, 225.0.0.1), S-bit set PIM(1): Add GigabitEthernet2.550/10.5.10.10 to (10.2.7.2, 225.0.0.1), Forward state, by PIM SG Join PIM(1): Insert (10.2.7.2,225.0.0.1) join in nbr 213.7.7.7's queue PIM(1): Building Join/Prune packet for nbr 213.7.7.7 PIM(1): Adding v2 (10.2.7.2/32, 225.0.0.1), S-bit Join PIM(1): Send v2 join/prune to 213.7.7.7 (Lspvif0) ! CSR7 Received v2 Join/Prune on Lspvif0 from 213.5.5.5, to us Join-list: (10.2.7.2/32, 225.0.0.1), S-bit set MDT next_hop change from: 0 to 17 for (10.2.7.2, 225.0.0.1) Lspvif0 Add Lspvif0/213.5.5.5 to (10.2.7.2, 225.0.0.1), Forward state, by PIM SG Join R7#show mpls mldp database summary | include ^17 17 P2MP 213.7.7.7 [gid 65536 (0x00010000)]

3

1086 © 2016 Nicholas J. Russo

Now that the debug revealed the P2MP tree in use, we can verify the GID by checking the BGP-AS Type1 I-PMSI message to confirm our findings. The tunnel-type 2 indicates an mLDP P2MP tree and the GID is 0x10000, which is 65536. R7#show bgp ipv4 mvpn vrf MC route-type 1 213.7.7.7 BGP routing table entry for [1][213:7][213.7.7.7]/12, version 185 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 3 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.7.7.7) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:7 PMSI Attribute: Flags: 0x0, Tunnel type: 2, length 17, label: exp-null, tunnel parameters: 0600 0104 D507 0707 0007 0100 0400 0100 00 rx pathid: 0, tx pathid: 0x0

At this point CSR7 can start sending native multicast traffic down the C(S,G) tree. The C-RP will want to see traffic on the SPT before telling the DR to stop registering. Once it does, we see the big ’T’ flag to indicate having seeing traffic along the SPT. The C-RP then issues a prune for this C(S,G) towards CSR5 as it no longer wants the traffic. It also sends a register-stop to the DR (CSR7) so that router stops sending both unicast-encapsulated traffic inside register messages and native multicast concurrently. Notice that CSR7 also tells us that this group will continue to use the default MDT since there was no match in the data MDT ACL. R10#show ip mroute 225.0.0.1 10.2.7.2 | begin \( (10.2.7.2, 225.0.0.1), 00:13:13/00:02:20, flags: PT Incoming interface: GigabitEthernet2.550, RPF nbr 10.5.10.5 Outgoing interface list: Null ! CSR10 PIM(0): Received v2 Register on GigabitEthernet2.550 from 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Send v2 Register-Stop to 10.2.7.7 for 10.2.7.2, group 225.0.0.1 PIM(0): Insert (10.2.7.2,225.0.0.1) prune in nbr 10.5.10.5's queue PIM(0): Building Join/Prune packet for nbr 10.5.10.5 PIM(0): Adding v2 (10.2.7.2/32, 225.0.0.1), S-bit Prune ! CSR7 PIM(1): Received v2 Register-Stop on GigabitEthernet2.574 from 10.5.10.10 PIM(1): for source 10.2.7.2, group 225.0.0.1

1087 © 2016 Nicholas J. Russo

PIM(1): Removing register encap tunnel (Tunnel2) as forwarding interface of (10.2.7.2, 225.0.0.1). PIM(1): Clear Registering flag to 10.5.10.10 for (10.2.7.2/32, 225.0.0.1) PIM(1): MDT ACL not matched for (10.2.7.2,225.0.0.1)

All of the interested PEs also send C(S,G) joins towards CSR7. This allows CSR7 to leave the interface in a forwarding state. ! CSR7 PIM(1): Received v2 Join/Prune on Lspvif0 from 213.8.8.8, to us PIM(1): Join-list: (10.2.7.2/32, 225.0.0.1), S-bit set PIM(1): Update Lspvif0/213.8.8.8 to (10.2.7.2, 225.0.0.1), Forward state, by PIM SG Join PIM(1): Received v2 Join/Prune on Lspvif0 from 213.6.6.6, to us PIM(1): Join-list: (10.2.7.2/32, 225.0.0.1), S-bit set PIM(1): Update Lspvif0/213.6.6.6 to (10.2.7.2, 225.0.0.1), Forward state, by PIM SG Join PIM(1): Received v2 Join/Prune on Lspvif0 from 213.1.1.1, to us PIM(1): Join-list: (10.2.7.2/32, 225.0.0.1), S-bit set PIM(1): Update Lspvif0/213.1.1.1 to (10.2.7.2, 225.0.0.1), Forward state, by PIM SG Join

Unlike some other examples, no Type-5 Source Active route is created. Although technically an AD route, the Type 5 is mostly to inform other PEs of an active source when BGP c-mcast signaling is in use. There has been no change in BGP AD signaling since CSR2 started sending traffic. No Type-3 was generated by CSR7 since this group is not candidate for S-PMSI treatment. We can verify the C-MRIB entries for this C(S,G) on all receiver PEs. Because these all use the I-PMSI, there is no special MDT for them, and thus they do not reveal any MDT details, even with the “verbose” command. Only CSR7, the ingress PE, reveals some MDT information. It’s also clear that the ingress PE sends traffic out the PMSI while the egress PEs receive it from the PMSI. R1#show ip mroute vrf MC 225.0.0.1 10.2.7.2 verbose | begin \( (10.2.7.2, 225.0.0.1), 00:25:03/00:02:47, flags: JT Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 00:25:03/00:02:04 R6#show ip mroute vrf MC 225.0.0.1 10.2.7.2 verbose | begin \( (10.2.7.2, 225.0.0.1), 00:24:15/00:02:38, flags: JT Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 00:24:15/00:02:19 R8#show ip mroute vrf MC 225.0.0.1 10.2.7.2 verbose | begin \(

1088 © 2016 Nicholas J. Russo

(10.2.7.2, 225.0.0.1), 00:24:59/00:01:53, flags: JT Incoming interface: Lspvif0, RPF nbr 213.7.7.7 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:24:59/00:02:19 R7#show ip mroute vrf MC 225.0.0.1 10.2.7.2 verbose | begin \( (10.2.7.2, 225.0.0.1), 00:25:43/00:02:59, flags: FTp Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Lspvif0, LSM MDT: 17 (default), Forward/Sparse, 00:25:43/00:03:05, p

Now that the control-plane is verified, we will verify the data-plane. First we will check basic multicast packet counts on each device. R1#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 818, Packets received: 818 Source: 10.2.7.2/32, Forwarding: 817/0/122/0, Other: 817/0/0 R6#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 865, Packets received: 865 Source: 10.2.7.2/32, Forwarding: 864/0/122/0, Other: 864/0/0 R7#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 865, Packets received: 866 Source: 10.2.7.2/32, Forwarding: 865/0/117/0, Other: 866/1/0 R8#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 867, Packets received: 867 Source: 10.2.7.2/32, Forwarding: 866/0/122/0, Other: 866/0/0

Note that the PE facing the R-CP also forwarded a few multicast packets before it was pruned. This is why the ‘T’ flag is set on the C(S,G). Another 818 packets were dropped due to having a null OIL; since IPMSI is used, all the PEs will receive this traffic, including CSR5 and XRv2. R5#show ip (10.2.7.2, Incoming Outgoing

mroute vrf MC 225.0.0.1 10.2.7.2 verbose | begin \( 225.0.0.1), 00:32:29/00:02:54, flags: PTX interface: Lspvif0, RPF nbr 213.7.7.7 interface list: Null

R5#show ip mroute vrf MC 225.0.0.1 10.2.7.2 count | begin ^Group Group: 225.0.0.1, Source count: 1, Packets forwarded: 3, Packets received: 822 Source: 10.2.7.2/32, Forwarding: 2/0/122/0, Other: 821/1/818

1089 © 2016 Nicholas J. Russo

We already determined that the LSM-ID of the I-PMSI for this MVPN is 17 (on CSR7, at least). We will use that to look at the mLDP bindings to see the downstream labels. Traffic towards XRv3 uses label 93011 while traffic to XRv4 uses label 94015. R7#show mpls mldp database id 17 | section Replic Replication client(s): MDT (VRF MC) Uptime : 23:58:18 Path Set ID Interface : Lspvif0 213.13.13.13:0 Uptime : 23:58:18 Path Set ID Out label (D) : 93011 Interface Local label (U): None Next Hop 213.14.14.14:0 Uptime : 23:58:18 Path Set ID Out label (D) : 94015 Interface Local label (U): None Next Hop

: None

: None : GigabitEthernet2.573* : 213.7.13.13 : None : GigabitEthernet2.574* : 213.7.14.14

Using the same EPC configuration on CSR7, we verify the LSM leaving CSR7’s upstream interfaces. We can see the proper labels encoded in hex below. Again, the timestamps, source MAC address, and IP payload are identical as CSR7 is replicating this traffic. These labels are different than the ones used for the IPv4 SSM flow (10.2.7.2, 232.0.0.5) so we know we are using a different delivery tree. R7#show monitor capture CAP buffer detailed 2 122 0.123986 00:50:56:A9:EA:77 -> 00:50:56:A9:DB:37 MPLS unicast 0000: 005056A9 DB370050 56A9EA77 81000DF5 .PV..7.PV..w.... 0010: 884716B5 31FE4500 00640D15 0000FE01 .G..1.E..d...... 0020: BD7E0A02 0702E100 00010800 97620002 .~...........b.. 0030: 04130000 00000FAF D323ABCD ABCDABCD .........#...... 3 122 0.123986 00:50:56:A9:EA:77 -> 00:50:56:A9:DE:77 MPLS unicast 0000: 005056A9 DE770050 56A9EA77 81000DF6 .PV..w.PV..w.... 0010: 884716F3 F1FE4500 00640D15 0000FE01 .G....E..d...... 0020: BD7E0A02 0702E100 00010800 97620002 .~...........b.. 0030: 04130000 00000FAF D323ABCD ABCDABCD .........#......

For brevity, we will check the LFIBs of XRv3 and XRv4 and move on. Because this is the I-PMSI tree, the traffic is delivered to all PEs. We can also see that CSR5 selected XRv3 as the preferred path to CSR7 (ECMP). This is where it sent its label mapping message. RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93011 5011 MLDP: 0x00018 6003 MLDP: 0x00018 92018 MLDP: 0x00018

labels 93011 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.553 213.5.13.5 0 Gi0/0/0/0.563 213.6.13.6 0 Gi0/0/0/0.523 213.12.13.12 0

1090 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94015 1011 MLDP: 0x0001a 8017 MLDP: 0x0001a

labels 94015 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.514 213.1.14.1 0 Gi0/0/0/0.584 213.8.14.8 0

Last, we will test IPv6 SSM. This does not have an S-PMSI associated with it, so we expect it to use the IPMSI tunnel rooted at CSR6. We can verify this by checking the BGP AD information. There are no Type3 routes anywhere in the “south-side” of the IPv6 MVPN network. R6#show bgp ipv6 mvpn vrf MC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:6 (default for vrf MC) *> [1][213:6][213.6.6.6]/12 :: 32768 ? *>i [1][213:6][213.7.7.7]/12 213.7.7.7 0 100 0 ? *>i [1][213:6][213.8.8.8]/12 213.8.8.8 0 100 0 ?

Verifying the current C-MRIB information, we can see the C(S,G) tree has already been built within an RP for (2001:10:4:6::4, FF33::1). Traffic enters CSR6 from the customer (CSR4) and goes to the PMSI, and the receivers accept traffic from the PMSI and deliver to remote customers. R6#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 23:59:45/00:02:57, flags: sT Incoming interface: GigabitEthernet2.546 RPF nbr: 2001:10:4:6::4 Immediate Outgoing interface list: Lspvif0, Forward, 23:59:45/00:02:57 R7#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 3d01h/never, flags: sTI Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.527, Forward, 3d01h/never R8#show ipv6 mroute vrf MC FF33::1 | begin \( (2001:10:4:6::4, FF33::1), 00:00:04/never, flags: sTI Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.6.6.6 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 00:00:04/never

1091 © 2016 Nicholas J. Russo

CSR4 begins sending traffic and we check the packet counters. R4#ping FF33::1 repeat 10000 Output Interface: GigabitEthernet2.546 R6#show ipv6 mroute vrf MC FF33::1 count | include HW HW Forwarding: 22/0/118/0, Other: 0/0/0 R7#show ipv6 mroute vrf MC FF33::1 count | include HW HW Forwarding: 30/0/126/0, Other: 0/0/0 R8#show ipv6 mroute vrf MC FF33::1 count | include HW HW Forwarding: 24/0/126/0, Other: 0/0/0

To determine the outgoing label, we can look at the tunnels rooted at CSR6 with a GID of 65536. We know that is the I-PMSI from this MVPN as it was shown many times earlier as the opaque information inside the BGP Type-1 I-PMSI route. We will check the database to extract the LSM-ID. CSR6 sends LSM to XRv3 using label 93002. R6#show mpls mldp database summary | include 213.6.6.6.*65536 17 P2MP 213.6.6.6 [gid 65536 (0x00010000)] R6#show mpls mldp database id 17 | section Replic Replication client(s): MDT (VRF MC) Uptime : 1d00h Path Set ID Interface : Lspvif0 213.13.13.13:0 Uptime : 1d00h Path Set ID Out label (D) : 93002 Interface Local label (U): None Next Hop

2

: None

: None : GigabitEthernet2.563* : 213.6.13.13

We can see that XRv3 replicates this along the I-PMSI towards XRv2, CSR5, XRv4, and CSR7. CSR5 and CSR7 are receivers and do not forward it on, while XRv4 replicates it towards CSR1 and CSR8. Only CSR7 and CSR8 really needed the traffic, so S-PMSI could have optimized the bandwidth usage here (at the expense of state). Because there are few receivers but many PEs, using the I-PMSI for this flow is highly inefficient. RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93002 5010 MLDP: 0x00016 7016 MLDP: 0x00016 92017 MLDP: 0x00016 94014 MLDP: 0x00016

labels 93002 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.553 213.5.13.5 0 Gi0/0/0/0.573 213.7.13.7 0 Gi0/0/0/0.523 213.12.13.12 0 Gi0/0/0/0.534 213.13.14.14 0

1092 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94014 1010 MLDP: 0x00019 8016 MLDP: 0x00019

labels 94014 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.514 213.1.14.1 0 Gi0/0/0/0.584 213.8.14.8 0

Last, we test OAM. I described earlier how there is no easy keyword for GID in XE, so we have to manually specify the opaque type of 0x01 in hexadecimal. By specifying the root of the LSP, and also testing this from the root (tree is unidirectional and downstream), we can quickly see all of the endpoints. This tests the LSPs for any oddities that may occur in the path. From CSR1, we will test the IPv4 and IPv6 I-PMSI tunnels. Notice that the IPv4 one has 5 peers while the IPv6 only has 2, due to VPNv4/v6 topology differences. I also quickly test the IPv4 S-PMSI for the SSM group from the root, CSR7. Notice that the opaque value changes to reflect the different GID. R1#ping mpls mldp p2mp 213.1.1.1 p2mp Root node addr 213.1.1.1 Opaque type hex value (0x1), num Sending 1, 72-byte MPLS Echos to timeout is 2.2 seconds, send msec: [snip] Request ! reply ! reply ! reply ! reply ! reply

#1 addr addr addr addr addr

hex 0x01 00010000 hex digits 4 Target FEC Stack TLV descriptor, interval is 0 msec, jitter value is 200

213.6.13.6 213.12.13.12 213.7.14.7 213.8.14.8 213.5.14.5

Round-trip min/avg/max = 9/74/177 ms R1#ping mpls mldp p2mp 213.1.1.1 p2mp Root node addr 213.1.1.1 Opaque type hex value (0x1), num Sending 1, 72-byte MPLS Echos to timeout is 2.2 seconds, send msec: [snip]

hex 0x01 00020000 hex digits 4 Target FEC Stack TLV descriptor, interval is 0 msec, jitter value is 200

Request #1 ! reply addr 213.12.13.12 ! reply addr 213.5.14.5 Round-trip min/avg/max = 63/127/191 ms R7#ping mpls mldp p2mp 213.7.7.7 hex 0x01 00010001 p2mp Root node addr 213.7.7.7

1093 © 2016 Nicholas J. Russo

Opaque type hex value (0x1), num hex digits 4 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor, timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 [snip] Type escape sequence to abort. Request #1 ! reply addr 213.6.13.6 ! reply addr 213.8.14.8 Round-trip min/avg/max = 10/95/181 ms

Additional Reading – Reference configurations “mvpn-17" 29. Describe and optimize multicast scale and performance 29.1 Inter-AS Multicast and Multicast Source Discovery Protocol (MSDP) This section uses a large BGP topology with multiple ASes to demonstrate how multicast can flow across Internet boundaries. The basic topology is shown below. AS 7 and AS 9 use route-reflection to reduce the number of iBGP peerings, while AS 8 and AS 11 do not because they contain only 2 routers each. Every BGP peering negotiates four AFIs: IPv4 unicast, IPv4 multicast, IPv6 unicast, and IPV6 multicast. This allows us to quickly adjust the path attribute, next-hops, and other features without having to change the BGP topology mid-stride. Each AS also uses its own IGP or set of IGPs to achieve reachability for IPv4 and IPv6 as shown in the diagram.

1094 © 2016 Nicholas J. Russo

Before doing any advanced MSDP or multicast configuration, we will verify the configurations. The basic configuration is very long only because there are many protocols running, but there is nothing complex yet. As an example, I show the BGP configurations on CSR10 and XRv1 only. I create a pair of peer groups (XR neighbor-groups) for IPv4 and IPv6 peers; an alternative approach would be peer-session and peer-policy templates. IPv4 unicast and IPv4 multicast are enabled on the “IBGP” group with IPv6 unicast and IPv6 multicast enabled on the “IBGPV6” group. I chose to show CSR10 and XRv1 because they are the route-reflectors within their respective AS boundaries and would have the most interesting configuration. Notice there are no advanced features configured, and I do not show the IGP configuration because it is very simple. ! CSR10 route-map RM_CONN_TO_BGP permit 10 match interface Loopback0 router bgp 9 bgp log-neighbor-changes no bgp default ipv4-unicast neighbor IBGP peer-group neighbor IBGP remote-as 9

1095 © 2016 Nicholas J. Russo

neighbor neighbor neighbor neighbor neighbor neighbor neighbor neighbor neighbor neighbor neighbor neighbor

IBGP update-source Loopback0 IBGP timers 10 40 IBGPV6 peer-group IBGPV6 remote-as 9 IBGPV6 update-source Loopback0 IBGPV6 timers 10 40 9.0.0.5 peer-group IBGP 9.0.0.6 peer-group IBGP 10.10.11.11 remote-as 7 2009::5 peer-group IBGPV6 2009::6 peer-group IBGPV6 FD00:10:10:11::11 remote-as 7

address-family ipv4 redistribute connected route-map RM_CONN_TO_BGP neighbor IBGP send-community neighbor IBGP route-reflector-client neighbor IBGP next-hop-self neighbor 9.0.0.5 activate neighbor 9.0.0.6 activate neighbor 10.10.11.11 activate address-family ipv4 multicast neighbor IBGP send-community neighbor IBGP route-reflector-client neighbor IBGP next-hop-self neighbor 9.0.0.5 activate neighbor 9.0.0.6 activate neighbor 10.10.11.11 activate address-family ipv6 redistribute connected route-map RM_CONN_TO_BGP neighbor IBGPV6 send-community neighbor IBGPV6 route-reflector-client neighbor IBGPV6 next-hop-self neighbor 2009::5 activate neighbor 2009::6 activate neighbor FD00:10:10:11::11 activate address-family ipv6 multicast neighbor IBGPV6 send-community neighbor IBGPV6 route-reflector-client neighbor IBGPV6 next-hop-self neighbor 2009::5 activate neighbor 2009::6 activate neighbor FD00:10:10:11::11 activate ! XRv1 route-policy PASS

1096 © 2016 Nicholas J. Russo

pass end-policy router bgp 7 bgp cluster-id 7.0.0.11 address-family ipv4 unicast network 7.0.0.11/32 address-family ipv4 multicast address-family ipv6 unicast network 2007::b/128 address-family ipv6 multicast neighbor-group IBGP remote-as 7 timers 10 40 update-source Loopback0 address-family ipv4 unicast route-reflector-client next-hop-self address-family ipv4 multicast route-reflector-client next-hop-self neighbor-group IBGPV6 remote-as 7 timers 10 40 update-source Loopback0 address-family ipv6 unicast route-reflector-client next-hop-self address-family ipv6 multicast route-reflector-client next-hop-self neighbor 2007::7 use neighbor-group IBGPV6 neighbor 7.0.0.7 use neighbor-group IBGP neighbor 2007::e use neighbor-group IBGPV6 neighbor 7.0.0.14

1097 © 2016 Nicholas J. Russo

use neighbor-group IBGP neighbor 10.10.11.10 remote-as 9 address-family ipv4 unicast route-policy PASS in route-policy PASS out address-family ipv4 multicast route-policy PASS in route-policy PASS out neighbor fd00:10:10:11::10 remote-as 9 address-family ipv6 unicast route-policy PASS in route-policy PASS out address-family ipv6 multicast route-policy PASS in route-policy PASS out

Because all of the IPv4 BGP sessions use IPv4 peer loopbacks, and all of the IPv6 BGP sessions use IPv6 peer loopbacks, we can simply verifying BGP sessions being up rather than verify IGP reachability. The BGP sessions wouldn’t be up if IGP was dysfunctional within any AS for IPv4 or IPv6. For brevity, I copy/paste the commands below on all BGP speakers. I only show some of the output (XRv1 and CSR10) for brevity. The benefit of the speeding verification approach is that the commands are valid on XE and XR platforms. ! All BGP speakers show bgp ipv4 unicast summary | show bgp ipv4 multicast summary show bgp ipv6 unicast summary | show bgp ipv6 multicast summary

begin ^Neigh | begin ^Neigh begin ^Neigh | begin ^Neigh

R10#show bgp ipv4 unicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 9.0.0.5 4 9 5686 5660 6 9.0.0.6 4 9 5654 5666 6 10.10.11.11 4 7 880 966 6

InQ OutQ Up/Down State/PfxRcd 0 0 14:57:00 1 0 0 14:56:52 2 0 0 14:31:06 3

R10#show bgp ipv4 multicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 9.0.0.5 4 9 5686 5660 6 9.0.0.6 4 9 5654 5666 6 10.10.11.11 4 7 880 966 6

InQ OutQ Up/Down State/PfxRcd 0 0 14:57:00 1 0 0 14:56:52 2 0 0 14:31:06 3

R10#show bgp ipv6 unicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer

InQ OutQ Up/Down

State/PfxRcd

1098 © 2016 Nicholas J. Russo

2009::5 4 9 5680 5766 61 2009::6 4 9 5731 5770 61 FD00:10:10:11::11 4 7 880 966 61 R10#show bgp ipv6 multicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 2009::5 4 9 5680 5766 61 2009::6 4 9 5731 5770 61 FD00:10:10:11::11 4 7 880 966 61

0 0

0 14:57:05 0 14:56:50

1 2

0

0 14:30:59

3

InQ OutQ Up/Down State/PfxRcd 0 0 14:57:05 1 0 0 14:56:50 2 0

0 14:30:59

3

RP/0/0/CPU0:XRv1#show bgp ipv4 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 7.0.0.7 0 7 5571 5267 6 0 0 14:37:05 7.0.0.14 0 7 5255 5266 6 0 0 14:35:14 10.10.11.10 0 9 966 880 6 0 0 14:31:00

St/PfxRcd 1 2 1

RP/0/0/CPU0:XRv1#show bgp ipv4 multicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 7.0.0.7 0 7 5571 5267 7 0 0 14:37:05 7.0.0.14 0 7 5255 5266 7 0 0 14:35:14 10.10.11.10 0 9 966 880 7 0 0 14:31:00

St/PfxRcd 1 2 1

RP/0/0/CPU0:XRv1#show bgp ipv6 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 2007::7 0 7 5560 5265 6 0 0 14:37:01 2007::14 0 7 5253 5265 6 0 0 14:35:14 fd00:10:10:11::10 0 9 966 880 6 0 0 14:30:53 RP/0/0/CPU0:XRv1#show bgp ipv6 multicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 2007::7 0 7 5560 5265 6 0 0 14:37:01 2007::14 0 7 5253 5265 6 0 0 14:35:14 fd00:10:10:11::10 0 9 966 880 6 0 0 14:30:53

St/PfxRcd 1 2 1

St/PfxRcd 1 2 1

At this point, all test host routers (CSR1 through CSR4) should be able to ping one another. Using a TCL script, I test this quickly on these four routers (ping results not shown for brevity). ! CSR1 through CSR4 tclsh foreach i { 9.2.5.2 7.1.7.1 8.3.8.3 11.4.13.4 } { ping $i timeout 1 repeat 3 }

Additionally, all BGP loopbacks should have reachability between one another. This will be important for the MSDP testing later. We will initiate this from the XE BGP speakers only, but that will give us a good picture as to which loopbacks are reachable since we can probe the XR loopbacks as well. 1099 © 2016 Nicholas J. Russo

! All XE BGP speakers tclsh foreach i { 7.0.0.7 7.0.0.11 7.0.0.14 8.0.0.12 8.0.0.8 11.0.0.9 11.0.0.13 9.0.0.5 9.0.0.10 9.0.0.6 } { ping $i timeout 1 repeat 3 source loopback 0 }

PIMv2 is enabled on all interfaces in the topology except for on the four test hosts. We can quickly test this by looking at the IPv4 and IPv6 PIM neighbors by using copy/paste again. This is critical for multicast transport, and PIM must also be enabled on all inter-AS transit links for multicast to flow between AS boundaries. ! All PIM XE speakers show ip pim neighbor | begin ^Neigh show ipv6 pim neighbor | begin ^Neigh ! All PIM XR speakers show pim neighbor show pim ipv6 neighbor R10#show ip pim neighbor | begin ^Neigh Neighbor Interface Address 9.6.10.6 GigabitEthernet2.560 9.5.10.5 GigabitEthernet2.550 10.10.11.11 GigabitEthernet2.501 R10#show ipv6 pim neighbor | begin ^Neigh Neighbor Address Interface FE80::5 Gi2.550 FE80::11 Gi2.501 FE80::6 Gi2.560 RP/0/0/CPU0:XRv1#show pim neighbor [snip] Neighbor Address Interface 10.10.11.10 10.10.11.11*

Uptime/Expires

Ver

16:28:04/00:01:21 v2 16:25:53/00:01:33 v2 16:19:57/00:01:24 v2

Uptime 16:54:42 16:08:59 16:52:11

Uptime

GigabitEth0/0/0/0.501 16:20:07 GigabitEth0/0/0/0.501 16:20:14

Expires 00:01:23 00:01:32 00:01:25

Expires

DR Prio/Mode 1 / S P G 1 / S P G 1 / DR P G

Mode DR pri B G 1 B G DR 1 B G 1

DR pri

Flags

00:01:30 1 P 00:01:15 1 (DR) B P E

1100 © 2016 Nicholas J. Russo

7.11.14.11* 7.11.14.14 7.7.11.7 7.7.11.11* 7.0.0.11*

GigabitEth0/0/0/0.514 GigabitEth0/0/0/0.514 GigabitEth0/0/0/0.571 GigabitEth0/0/0/0.571 Loopback0

16:20:14 16:17:02 16:20:09 16:20:14 16:20:14

00:01:19 00:01:28 00:01:41 00:01:35 00:01:22

1 B P E 1 (DR) B P 1 P 1 (DR) B P E 1 (DR) B P E

RP/0/0/CPU0:XRv1#show pim ipv6 neighbor [snip] GigabitEthernet0/0/0/0.501 Neighbor Address Uptime fe80::10 16:09:09 fe80::11* 16:09:15

Expires DR pri DR Flags 00:01:24 1 B 00:01:22 1 (DR) B P

GigabitEthernet0/0/0/0.514 Neighbor Address fe80::11* fe80::14

Uptime 16:09:15 16:08:56

Expires DR pri DR Flags 00:01:35 1 B P 00:01:39 1 (DR) B P

GigabitEthernet0/0/0/0.571 Neighbor Address fe80::7 fe80::11*

Uptime 16:09:12 16:09:15

Expires DR pri DR Flags 00:01:23 1 B 00:01:33 1 (DR) B P

Next, we will configure the rendezvous points (RP). Each AS has at least one RP to support PIM sparse mode (PIM-SM). Routers not contained with the AS boundaries are used to simulate hosts that can send and receive multicast traffic. Rather than simulate this using router loopbacks, this lab will try to make the testing more realistic by having PIM designed routers (DR) perform their natural roles on the LAN segments they serve. AS 7 uses PIM bootstrap router (BSR) which is built into PIMv2; XRv1 is the RP for that AS. AS 9 uses Cisco’s AutoRP which was used in PIMv1 for RP discovery and dissemination; CSR10 is the RP for that AS. AS 8 and AS 11 use the static RP method given their small size. CSR9 is the RP for AS 11 and XRv2 is the RP for AS 8. Because the focus of this lab is inter-AS multicast and MSDP, we will not be examining all of the advanced PIM features here. I configure the RPs for IPv4 only; IPv6 inter-AS multicast is a very different topic and is examined later. ! XRv1 router pim address-family ipv4 bsr candidate-bsr 7.0.0.11 hash-mask-len 30 priority 1 bsr candidate-rp 7.0.0.11 priority 192 interval 60 ! CSR10 ip pim autorp listener ip pim send-rp-announce Loopback0 scope 1 ip pim send-rp-discovery Loopback0 scope 2 ! CSR6 and CSR5 ip pim autorp listener

1101 © 2016 Nicholas J. Russo

! CSR9 ip pim rp-address 11.0.0.9 ! XRv3 router pim address-family ipv4 rp-address 11.0.0.9 ! XRv2 router pim address-family ipv4 rp-address 8.0.0.12 ! CSR8 ip pim rp-address 8.0.0.12

First, we will verify the routers in AS 7 have all learned the RP through BSR. In this case, the RP and BSR address is 7.0.0.11; I highlight the RP address in yellow and the BSR address in green. This is straightforward with no surprises. R7#show ip pim rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 7.0.0.11 (?), v2 Info source: 7.0.0.11 (?), via bootstrap, priority 192, holdtime 150 Uptime: 00:03:00, expires: 00:01:40 RP/0/0/CPU0:XRv1#show pim rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 7.0.0.11 (?), v2 Info source: 7.0.0.11 (?), elected via bsr, priority 192, holdtime 150 Uptime: 00:03:48, expires: 00:01:56 RP/0/0/CPU0:XRv4#show pim rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 7.0.0.11 (?), v2 Info source: 7.11.14.11 (?), elected via bsr, priority 192, holdtime 150 Uptime: 00:04:16, expires: 00:02:28

Next, we verify the RP mappings in AS 9. The immediate problem we find with PIM and AutoRP is that this dynamic RP signaling leaks beyond AS boundaries. This is generally undesirable unless the ASes share an RP, which is highly unlikely in real life and certainly not the case in this lab. The most obvious solution would be to ensure ASes did not have reachability between RPs at all, but that would break inter-AS multicast entirely. Since the loopbacks are advertised into BGP, reachability exists and RPF 1102 © 2016 Nicholas J. Russo

passes, which is why RP information is currently allowed to leak. On CSR10, we can see two RP mappings, both from BSR and AutoRP. R10#show ip pim rp mapping PIM Group-to-RP Mappings This system is an RP (Auto-RP) This system is an RP-mapping agent (Loopback0) Group(s) 224.0.0.0/4 RP 9.0.0.10 (?), v2v1 Info source: 9.0.0.10 (?), elected via Auto-RP Uptime: 00:06:30, expires: 00:02:28 RP 7.0.0.11 (?), v2 Info source: 7.0.0.11 (?), via bootstrap, priority 192, holdtime 150 Uptime: 00:05:23, expires: 00:02:18

To see which one is actually preferred, we can test it with the RP-hash command. We can see that the AutoRP is preferred, which would technically work for intra-AS multicast in AS 9, but it is very sloppy. The reason AS 7 did not learn the AutoRP mappings is because they were not configured to listen for it. Additionally, those mappings are flooded in dense-mode (which is what the listen command facilitates) which isn’t operational on the XR boxes presently. R10#show ip pim rp-hash 224.1.1.1 RP 9.0.0.10 (?), v2v1 Info source: 9.0.0.10 (?), elected via Auto-RP Uptime: 00:09:43, expires: 00:02:14 PIMv2 Hash Value (mask 255.255.255.252) RP 9.0.0.10, via Auto-RP RP 7.0.0.11, via bootstrap, priority 192, hash value 930306691

We can look at CSR9 to see that it learns XRv1 as the RP, which overrides the static configuration. This is a bigger deal since their intra-AS multicast would now be broken as the local RP would not be used. We could use the “override” keyword on CSR9 so that the static mapping (of equal prefix length) overrides any dynamic mappings. This is also a sloppy solution and is not preferable in this situation. R9#show ip pim rp mapping PIM Group-to-RP Mappings Group(s) 224.0.0.0/4 RP 7.0.0.11 (?), v2 Info source: 7.0.0.11 (?), via bootstrap, priority 192, holdtime 150 Uptime: 00:11:59, expires: 00:01:44 Group(s): 224.0.0.0/4, Static RP: 11.0.0.9 (?)

The better solution is to filter the BSR messages bi-directionally on all inter-AS links where BSR is enabled. We only need to do this on XRv1 and XRv4, but for extra protection, it makes sense to 1103 © 2016 Nicholas J. Russo

configure it on all transit interfaces. This way, it protects the non-BSR AS from accidentally learning BSR information. We do not have the granularity to selectively filter group/RP mappings from the BSR message; we can only turn the filter off or on. I show the configuration on XRv4 and CSR6 for brevity, but I apply the “bsr-border” command on all transit links. ! XRv1 router pim address-family ipv4 interface GigabitEthernet0/0/0/0.524 bsr-border interface GigabitEthernet0/0/0/0.594 bsr-border ! CSR6 interface GigabitEthernet2.562 ip pim bsr-border interface GigabitEthernet2.569 ip pim bsr-border

On XE, we can verify it by checking the PIM interface summary to find an asterisk after the PIM operating mode on each interface. Alternatively, we can look at the interface details to see it confirmed more explicitly. There does not appear to be an XR show command to verify this. R6#show ip pim interface gig2.569 Address Interface 10.6.9.6

GigabitEthernet2.569

Ver/ Nbr Mode Count v2/S * 1

Query Intvl 30

DR Prior 1

DR 10.6.9.9

R6#show ip pim interface gig2.569 detail GigabitEthernet2.569 is up, line protocol is up Internet address is 10.6.9.6/24 Multicast switching: fast Multicast packets in/out: 0/26 Multicast TTL threshold: 0 PIM: enabled PIM version: 2, mode: sparse PIM DR: 10.6.9.9 PIM neighbor count: 1 PIM Hello/Query interval: 30 seconds PIM Hello packets in/out: 2140/2152 PIM J/P interval: 60 seconds PIM State-Refresh processing: enabled PIM State-Refresh origination: disabled PIM NBMA mode: disabled PIM ATM multipoint signalling: disabled PIM domain border: enabled

1104 © 2016 Nicholas J. Russo

PIM neighbors rpf proxy capable: TRUE PIM BFD: disabled PIM Non-DR-Join: FALSE Multicast Tagswitching: disabled

Quickly checking CSR10 and CSR9, we can see their RP mapping tables look correct now. The RP addresses are shown in yellow and the information sources are shown in green. They no longer learn the BSR information from AS 7. R10#show ip pim rp mapping PIM Group-to-RP Mappings This system is an RP (Auto-RP) This system is an RP-mapping agent (Loopback0) Group(s) 224.0.0.0/4 RP 9.0.0.10 (?), v2v1 Info source: 9.0.0.10 (?), elected via Auto-RP Uptime: 00:29:07, expires: 00:02:53 R9#show ip pim rp mapping PIM Group-to-RP Mappings Group(s): 224.0.0.0/4, Static RP: 11.0.0.9 (?)

Although less of a problem, the AutoRP announcements are still being sent out of AS 9. Looking at CSR6, we can confirm this, as these groups are secretly operating in dense mode. Group 224.0.1.39 is used by candidate-RPs to notify the mapping agent (MA) about their candidacy, and group 224.0.1.40 is used to disseminate the RP information to clients. CSR6 sees traffic from 9.0.0.10 towards 224.0.1.40 which are the mapping advertisements and floods these out of the AS. R6#show ip mroute 224.0.1.40 9.0.0.10 | begin \( (9.0.0.10, 224.0.1.40), 00:30:09/00:02:31, flags: LT Incoming interface: GigabitEthernet2.560, RPF nbr 9.6.10.10 Outgoing interface list: Loopback0, Forward/Sparse, 00:30:09/stopped GigabitEthernet2.562, Forward/Sparse, 00:30:09/stopped GigabitEthernet2.569, Forward/Sparse, 00:30:09/stopped

These isn’t an easy button to solve this, so we can manually remove groups 224.0.1.39 and 224.0.1.40 from the mroute OIL using a multicast boundary. We apply this to all transit links on all routers, but like BSR, it is most important to do it on the AS 9 ASBRs. I show the configuration on CSR6 and XRv1. ! CSR6 ip access-list deny ip any deny ip any permit ip any

extended ACL_DENY_AUTORP host 224.0.1.39 host 224.0.1.40 any

1105 © 2016 Nicholas J. Russo

interface GigabitEthernet2.562 ip multicast boundary ACL_DENY_AUTORP interface GigabitEthernet2.569 ip multicast boundary ACL_DENY_AUTORP ! XRv1 ipv4 access-list ACL_DENY_AUTORP 10 deny ipv4 any host 224.0.1.39 20 deny ipv4 any host 224.0.1.40 30 permit ipv4 any any multicast-routing address-family ipv4 interface GigabitEthernet0/0/0/0.501 boundary ACL_DENY_AUTORP

First, we check the configuration to make sure the list was applied properly. We can see the filter as a multicast boundary inbound and outbound, and we see hits against the ACL entries matching those groups. This is a good indication that the filter is working. R6#show ip multicast interface gig2.569 GigabitEthernet2.569 is up, line protocol is up Internet address is 10.6.9.6/24 Multicast routing: enabled Multicast switching: fast Multicast packets in/out: 0/36 Multicast boundary: ACL_DENY_AUTORP (in/out) Multicast TTL threshold: 0 Multicast Tagswitching: disabled R6#show access-l ACL_DENY_AUTORP Extended IP access list ACL_DENY_AUTORP 10 deny ip any host 224.0.1.39 (2 matches) 20 deny ip any host 224.0.1.40 (4 matches) 30 permit ip any any

This is also true for XRv1, though the commands are slightly different. Looking at the MRIB before the AutoRP group times out, we can see that XRv1 did see traffic from 9.0.0.10 for 224.0.1.40. This times out soon after the filters are applied and the entry is removed. RP/0/0/CPU0:XRv1#show mrib route 224.0.1.40 9.0.0.10 [snip] (9.0.0.10,224.0.1.40) RPF nbr: 0.0.0.0 Flags: RPF Up: 00:04:18 Outgoing Interface List Loopback0 Flags: F IC, Up: 00:04:18

1106 © 2016 Nicholas J. Russo

GigabitEthernet0/0/0/0.514 Flags: F, Up: 00:04:18 GigabitEthernet0/0/0/0.571 Flags: F, Up: 00:04:18 RP/0/0/CPU0:XRv1#show mrib route 224.0.1.40 9.0.0.10 No matching route in MRIB route-DB

We can further confirm it worked by checking the (S,G) state for the AutoRP mapping group to ensure it is only being locally consumed (forwarded to loopback0 only) or forwarded within the AS. This ensures that AS 9, the AS using AutoRP, doesn’t leak information out of its AS in the first place. R6#show ip mroute 224.0.1.40 9.0.0.10 | begin \( (9.0.0.10, 224.0.1.40), 00:35:15/00:02:22, flags: LT Incoming interface: GigabitEthernet2.560, RPF nbr 9.6.10.10 Outgoing interface list: Loopback0, Forward/Sparse, 00:35:15/stopped

With multiple disparate RPs servicing overlapping ASM groups, we must use MSDP to build a topology in which they can be tied together. MSDP is very similar to BGP as it is TCP based (port 639) and is designed to notify neighboring ASes about active sources within the local AS. These source active (SA) messages are like “long-range” PIM register messages. The general value of this is that any RPs with existing (*,G) trees in a remote AS will be notified of active sources in the local AS. These remote RPs can issue (S,G) joins back to the source in another AS, which may not even transit the MSDP peers (doesn’t matter; MSDP is used for signaling only). The original intent of MSDP is that it would mirror the BGP topology exactly, at least with respect to RPs. Though MSDP can be configured on non-RPs, this only makes sense in hierarchical architectures where some aggregation nodes might reduce the overall amount of MSDP peers required. For now, we will configure MSDP peers between the four ASes in a near-full mesh except AS 8 and AS 11 will not have a direct peering. The basic MSDP peer configuration is simple, and the remote-AS field is optional. If not configured, it can be derived from BGP or set to zero, depending on software versions. I personally prefer to set the value for completeness. ! CSR10 ip msdp peer 7.0.0.11 connect-source Loopback0 remote-as 7 ip msdp peer 8.0.0.12 connect-source Loopback0 remote-as 8 ip msdp peer 11.0.0.9 connect-source Loopback0 remote-as 11 ! CSR9 ip msdp peer 7.0.0.11 connect-source Loopback0 remote-as 7 ip msdp peer 9.0.0.10 connect-source Loopback0 remote-as 9 ! XRv1 router msdp peer 8.0.0.12 connect-source Loopback0 remote-as 8 peer 9.0.0.10

1107 © 2016 Nicholas J. Russo

connect-source Loopback0 remote-as 9 peer 11.0.0.9 connect-source Loopback0 remote-as 11 ! XRv2 router msdp peer 7.0.0.11 connect-source Loopback0 remote-as 7 peer 9.0.0.10 connect-source Loopback0 remote-as 9

When the peers start to come up, MSDP prints syslog messages to inform us. Below are the log messages on CSR10 and XRv1 from their mutual peering being established. ! CSR10 %MSDP-5-PEER_UPDOWN: Session to peer 7.0.0.11 going up ! XRv1 msdp[1045]: %ROUTING-MSDP-5-INIT_PEER_UP_DOWN : MSDP peer up:

9.0.0.10

To see the status of all peers, the MSDP summary table displays it clearly on both XE and XR. Since AS 8 and AS 11 don’t peer directly, we can conclude that all peer are up by checking CSR10 and XRv1. The “peer name” is the DNS-resolved name for the peer, and the question mark means that resolution was not performed. This is not an error condition and has no effect on MSDP operations. R10#show ip msdp summary MSDP Peer Status Summary Peer Address AS State 7.0.0.11 8.0.0.12 11.0.0.9

7 8 11

Up Up Up

Uptime/ Downtime 00:01:43 00:01:46 00:03:51

RP/0/0/CPU0:XRv1#show msdp summary Out of Resource Handling Enabled Maximum External SA's Global : 20000 Current External Active SAs : 0 MSDP Peer Status Summary Peer Address AS State Uptime/ Downtime 8.0.0.12 8 Up 00:02:46 9.0.0.10 9 Up 00:02:46

Reset Count 0 0 0

Reset Count 0 0

SA Count 0 0 0

Peer Name ? ?

Peer Name ? ? ?

Active Cfg.Max TLV SA Cnt Ext.SAs recv/sent 0 0 6/6 0 0 3/6

1108 © 2016 Nicholas J. Russo

11.0.0.9

11

Up

00:00:15

0

?

0

0

1/1

We can look at the details on each peer, much like we would with the “show bgp neighbor” command. There is not much interesting traffic at present, but we learn some new things about MSDP. We can see that the SA filtering options are very flexible as we can apply route-maps in/out to filter RP information and (S,G) information. The outputs on both routers also show us detailed statistics for messages exchanged. One of the most important message counters is the MSDP RPF failure, which is discussed in detail later. Last, MSDP is also capable of MD5 authentication, much like BGP. Notice that XR shows the keep-alive timer at 30 seconds with a dead timer of 2.5 times that (75 seconds). R10#show ip msdp peer 7.0.0.11 MSDP Peer 7.0.0.11 (?), AS 7 (configured AS) Connection status: State: Up, Resets: 0, Connection source: Loopback0 (9.0.0.10) Uptime(Downtime): 00:05:31, Messages sent/received: 6/12 Output messages discarded: 0 Connection and counters cleared 00:09:14 ago SA Filtering: Input (S,G) filter: none, route-map: none Input RP filter: none, route-map: none Output (S,G) filter: none, route-map: none Output RP filter: none, route-map: none SA-Requests: Input filter: none Peer ttl threshold: 0 SAs learned from this peer: 0 Number of connection transitions to Established state: 1 Input queue size: 0, Output queue size: 0 MD5 signature protection on MSDP TCP connection: not enabled Message counters: RPF Failure count: 0 SA Messages in/out: 0/0 SA Requests in: 0 SA Responses out: 0 Data Packets in/out: 0/0 RP/0/0/CPU0:XRv1#show msdp peer 9.0.0.10 MSDP Peer 9.0.0.10 (?), AS 9 Description: Connection status: State: Up, Resets: 0, Connection Source: 7.0.0.11 Uptime(Downtime): 00:06:04, SA messages received: 0 TLV messages sent/received: 13/7 Output messages discarded: 0 Connection and counters cleared 00:06:04 ago SA Filtering: Input (S,G) filter: none Input RP filter: none

1109 © 2016 Nicholas J. Russo

Output (S,G) filter: none Output RP filter: none SA-Requests: Input filter: none Sending SA-Requests to peer: disabled Password: None Peer ttl threshold: 2 Input queue size: 0, Output queue size: 0 KeepAlive timer period: 30 Peer Timeout timer period: 75 NSR: State: Unknown, Oper-Downs: 0 NSR-Uptime(NSR-Downtime): 2w0d

To confirm that MSDP uses TCP port 639, we can check the TCP details to see the sessions established. Both CSR10 and XRv1 show 3 MSDP sessions, each to the proper peer. We also note that the lower IP address is responsible for opening the MSDP session (that is, the lower peer IP is the TCP client). R10#show tcp brief | include 639 7FBC42BF0C40 9.0.0.10.639 7FBC42BF7E20 9.0.0.10.30326 7FBC42BF15C0 9.0.0.10.639

8.0.0.12.65296 11.0.0.9.639 7.0.0.11.30561

RP/0/0/CPU0:XRv1#show tcp brief | include :639 0x10151770 0x60000000 0 0 7.0.0.11:24200 0x101b1638 0x60000000 0 0 7.0.0.11:30561 0x101c71fc 0x60000000 0 0 7.0.0.11:23252 0x101b32a8 0x60000000 0 0 0.0.0.0:639 0x101b1a84 0x00000000 0 0 0.0.0.0:639

8.0.0.12:639 9.0.0.10:639 11.0.0.9:639 0.0.0.0:0 0.0.0.0:0

ESTAB ESTAB ESTAB

ESTAB ESTAB ESTAB LISTEN LISTEN

We will configure MD5 authentication between CSR10 and XRv1. Normally, one would configure this everywhere, but for brevity I limit it to one connection. Key-chains are not supported. We can verify this by checking the peer detailed and filtering for the important output. ! CSR10 ip msdp password peer 7.0.0.11 AS7AS9_MSDP ! XRv1 router msdp peer 9.0.0.10 password clear AS7AS9_MSDP R10#show ip msdp peer 7.0.0.11 | include MD5 MD5 signature protection on MSDP TCP connection: enabled RP/0/0/CPU0:XRv1#show msdp peer 9.0.0.10 | include Password Password: Configured, set on active socket

1110 © 2016 Nicholas J. Russo

We can also verify this with low-level TCP commands. Recovering the TCB from the TCP brief for a particular TCP session, we look at the option detailed to find MD5 negotiated. This does not appear visible on XR. R10#show tcp brief | include 7.0.0.11 7FBC42BF6068 9.0.0.10.639

7.0.0.11.45268

ESTAB

R10#show tcp tcb 7FBC42BF6068 | begin ^Option Option Flags: md5, Retrans timeout [snip]

Next, we will adjust the session timers between CSR10 and XRv1. For faster dead-peer detection, I will reduce the keep-alive timer to 15 seconds and the maximum time to wait for a message (dead-timer, basically) to 35 seconds. ! CSR10 ip msdp keepalive 7.0.0.11 15 35 ! XRv1 router msdp peer 9.0.0.10 keepalive 15 35

This information does not appear visible on XE, so we will verify it on XR only. RP/0/0/CPU0:XRv1#show msdp peer 9.0.0.10 | include timer period KeepAlive timer period: 15 Peer Timeout timer period: 35

Next, we will actually send some multicast data between sites. CSR3 will request traffic for group 225.0.1.3, which will trigger a (*,G) join from the PIM-DR (CSR8) to the local RP (XRv2). XRv2 is the root of the shared tree and is waiting to hear about active sources via PIM register messages (intra-AS) or MSDP SA messages (inter-AS). ! CSR3 interface GigabitEthernet2.538 ip igmp join-group 225.0.1.3 R8#show ip igmp groups 225.0.1.3 IGMP Connected Group Membership Group Address Interface Group Accounted 225.0.1.3 GigabitEthernet2.538

Uptime

Expires

Last Reporter

00:00:19

00:02:40

8.3.8.3

R8#show ip mroute 225.0.1.3 | begin \(

1111 © 2016 Nicholas J. Russo

(*, 225.0.1.3), 00:00:31/00:02:28, RP 8.0.0.12, flags: SJC Incoming interface: GigabitEthernet2.582, RPF nbr 8.8.12.12 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:00:31/00:02:28 RP/0/0/CPU0:XRv2#show pim topology 225.0.1.3 | begin 225 (*,225.0.1.3) SM Up: 00:00:46 RP: 8.0.0.12* JP: Join(never) RPF: Decapstunnel0,8.0.0.12 Flags: GigabitEthernet0/0/0/0.582 00:00:46 fwd Join(00:02:43)

CSR2 will be the first sender for this group. Before starting, we enable some PIM and MSDP debugging on various to understand the process. ! CSR5 debug ip pim 225.0.1.3 ! CSR10 debug ip pim 225.0.1.3 debug ip msdp 8.0.0.12 peer debug ip msdp 8.0.0.12 detail ! XRv2 debug msdp peer 9.0.0.10 debug mrib route ! CSR2 R2#ping 225.0.1.3 repeat 10000

As the PIM DR, CSR5 senses traffic on the LAN and originates a PIM register message towards the local RP, which is CSR10. CSR10 is not aware of any receivers for this group and would normally reply immediately with a PIM register-stop message. This time, CSR10 first originates an MSDP source-active (SA) message to notify XRv2 that there is an active source. After that, it continues with its normal PIM operation of sending a register-stop. We can see that, generally speaking, the MSDP actions happen before the PIM ones as indicated by the debug. ! CSR5 PIM(0): Check RP 9.0.0.10 into the (*, 225.0.1.3) entry PIM(0): Building Triggered (*,G) Join / (S,G,RP-bit) Prune message for 225.0.1.3 PIM(0): Adding register encap tunnel (Tunnel1) as forwarding interface of (9.2.5.2, 225.0.1.3). ! CSR10 MSDP(0): 8.0.0.12: Building SA message from SA cache MSDP(0): start_index = 0, sa_cache_index = 0, Qlen = 0 MSDP(0): Sent entire sa-cache, sa_cache_index = 0, Qlen = 0 PIM(0): Received v2 Register on GigabitEthernet2.550 from 9.5.10.5

1112 © 2016 Nicholas J. Russo

for 9.2.5.2, group 225.0.1.3 PIM(0): Check RP 9.0.0.10 into the (*, 225.0.1.3) entry PIM(0): Adding register decap tunnel (Tunnel2) as accepting interface of (*, 225.0.1.3). PIM(0): Adding register decap tunnel (Tunnel2) as accepting interface of (9.2.5.2, 225.0.1.3). MSDP(0): 8.0.0.12: Send 100-byte SA encapsulated data for (9.2.5.2, 225.0.1.3), RP 9.0.0.10

Multiple things are happening concurrently now, so we will look at CSR10 first. After sending the MSDP SA message, it doesn’t wait to receive an (S,G) join from the remote AS and immediately sends a PIM register-stop to CSR5. CSR5 receives this and, as far as it knows, there are no receivers for this group, so the DR stops sending register messages. As seen above, the SA message carries that 100-byte ICMP ping as a payload, much like a register message does. ! CSR10 PIM(0): Send v2 Register-Stop to 9.5.10.5 for 9.2.5.2, group 225.0.1.3 ! CSR5 PIM(0): Received v2 Register-Stop on GigabitEthernet2.550 from 9.0.0.10 PIM(0): for source 9.2.5.2, group 225.0.1.3 PIM(0): Removing register encap tunnel (Tunnel1) as forwarding interface of (9.2.5.2, 225.0.1.3). PIM(0): Clear Registering flag to 9.0.0.10 for (9.2.5.2/32, 225.0.1.3)

On the other side of the network, XRv2 receives the MSDP SA message from CSR10. The MSDP debugging doesn’t reveal much human-readable output, but we can see that a message was received from MSDP and an MRIB (S,G) entry of (9.2.5.2, 255.0.1.3) was created. This is a good indication that it successfully processed the SA message from CSR10. ! XRv2 msdp[1045]: [1] default 9.0.0.10 peer: peer 9.0.0.10 timeout timer started. Expiry in 75 secs rib[1146]: [ 27] TID: 0xe0000000 (9.2.5.2,225.0.1.3) Added E*, #A=0, #F=0, #MDT_A=0, Route Ver = 0x2c30 rib[1146]: [ 12] TID: 0xe0000000 (9.2.5.2,225.0.1.3) Updated E C* RPF*, #A=1, #F=1, #MDT_A=0, RPF=8.0.0.12 [Gi0/0/0/0.582 F* NS*] [De0 A*], Route Ver = 0x2c33 msdp[1045]: [1] default 7.0.0.11 peer: peer 7.0.0.11 timeout timer started. Expiry in 75 secs mrib[1146]: [ 27] TID: 0xe0000000 (9.2.5.2,225.0.1.3) Unchanged mrib[1146]: [ 12] TID: 0xe0000000 (9.2.5.2,225.0.1.3) Updated E C RPF*, #A=1, #F=1, #MDT_A=0, RPF=10.6.12.6 [Gi0/0/0/0.562 NS*], Route Ver = 0x2c34

In lieu of cryptic debug, we can verify the SA-cache on XRv2 to ensure it sees the entry. We can see the (S,G), followed the RPF source (more on this later) and the remote AS. The encapsulated data received is 1113 © 2016 Nicholas J. Russo

a byte counter for the amount of traffic carried inside of the SA message; like PIM register messages, the payload is transferred to MSDP SA messages when they are originated. The 100-byte ICMP echo was tunneled across while the inter-AS SPT was first built. RP/0/0/CPU0:XRv2#show msdp sa-cache 225.0.1.3 MSDP Flags: E - set MRIB E flag , L - domain local source is active, EA - externally active source, PI - PIM is interested in the group, DE - SAs have been denied. Timers age/expiration, Cache Entry: (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 00:08:55/00:01:50 Learned from peer 9.0.0.10, RPF peer 9.0.0.10 SAs recvd 10, Encapsulated data received: 100 grp flags: PI, src flags: E, EA, PI

It is important to note that the SA cache shows received SA messages. As such, CSR10 has no entries, despite originating one. Like BGP, we can see the advertised SAs by invoking the peer command with a special option. This shows the SAs advertised to a certain peer. SA messages are categorized by whether they were advertised by their presence in the MRIB (local origination) or forwarded on from the SA cache (like a string of MSDP peers). R10#show ip msdp sa-cache 225.0.1.3 MSDP Source-Active Cache Group not found R10#show ip msdp peer 8.0.0.12 advertised-SAs MSDP SA advertised to peer 8.0.0.12 (?) from mroute table 225.0.1.3 9.2.5.2 (?) MSDP SA advertised to peer 8.0.0.12 (?) from SA cache

We can further verify the MSDP statistics on XRv2. We can see that 30 SAs have been received (periodic updates) while none of have been sent. The output is similar on CSR10 which shows 30 outbound SAs and none received. CSR10 also shows a single “data” packet out, which is the 100-byte ICMP echo that was MSDP-encapsulated as a carry-over from the original PIM register message from CSR5. RP/0/0/CPU0:XRv2#show msdp statistics peer 9.0.0.10 MSDP Peer Statistics :- default Peer 9.0.0.10 : AS is 9, State is Up, 1 active SAs TLV Rcvd : 33 total 3 keepalives, 0 notifications 30 SAs, 0 SA Requests 0 SA responses, 0 unknowns TLV Sent : 59 total 59 keepalives, 0 notifications 0 SAs, 0 SA Requests

1114 © 2016 Nicholas J. Russo

SA msgs

0 SA responses : 30 received, 0 sent

R10#show ip msdp peer 8.0.0.12 | begin Message_count Message counters: RPF Failure count: 0 SA Messages in/out: 0/30 SA Requests in: 0 SA Responses out: 0 Data Packets in/out: 0/1

The more significant effect of having received this SA message is that now XRv2 can originate an (S,G) join back to 9.2.5.2 since it is aware of a new source. We will casually glaze over the MSDP RPF rules for now, but we see that the RPF peer for the originating RP is CSR6, and MSDP instructs the router to consult the RIB for this information. P/0/0/CPU0:XRv2#show msdp rpf 9.0.0.10 RPF peer for 9.0.0.10 is 9.0.0.10 AS 9, rule: 1 bgp/rib lookup: nexthop: 9.0.0.10, asnum: 9 RP/0/0/CPU0:XRv2#show route 9.0.0.10 Routing entry for 9.0.0.10/32 Known via "bgp 8", distance 20, metric 0 Tag 9, type external Routing Descriptor Blocks 10.6.12.6, from 10.6.12.6, BGP external Route metric is 0 No advertising protos.

With this information, XRv2 sends the (S,G) join to CSR6. The MRIB entry also has the ‘E’ flag set which means MSDP external, equivalent to the ‘M’ flag in XE. RP/0/0/CPU0:XRv2#show pim topology 225.0.1.3 9.2.5.2 | begin 9.2 (9.2.5.2,225.0.1.3)SPT SM Up: 00:19:10 JP: Join(00:00:39) RPF: GigabitEthernet0/0/0/0.562,10.6.12.6 Flags: KAT(00:02:04) E RA GigabitEthernet0/0/0/0.582 00:19:09 fwd Join(00:03:02)

CSR6 and CSR10 continue building the SPT by sending the (S,G) join towards CSR5. Notice that CSR10 has the “A” flag set to indicate that this was a candidate for MSDP advertisement, implying that an SA may have been advertised for it. CSR6 and CSR5 have no idea MSDP is even in use, and they do not need to know. R6#show ip mroute 225.0.1.3 9.2.5.2 | begin \( (9.2.5.2, 225.0.1.3), 00:19:38/00:01:35, flags: T Incoming interface: GigabitEthernet2.560, RPF nbr 9.6.10.10

1115 © 2016 Nicholas J. Russo

Outgoing interface list: GigabitEthernet2.562, Forward/Sparse, 00:19:38/00:02:51 R10#show ip mroute 225.0.1.3 9.2.5.2 | begin \( (9.2.5.2, 225.0.1.3), 00:19:47/00:02:00, flags: TA Incoming interface: GigabitEthernet2.550, RPF nbr 9.5.10.5 Outgoing interface list: GigabitEthernet2.560, Forward/Sparse, 00:19:47/00:03:24 R5#show ip mroute 225.0.1.3 9.2.5.2 | begin \( (9.2.5.2, 225.0.1.3), 00:22:16/00:03:20, flags: FT Incoming interface: GigabitEthernet2.525, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.550, Forward/Sparse, 00:22:16/00:02:48

As a final check, we look at CSR8. I do not trace the SPT switchover process, which also occurs, but suffice it to say that CSR8 has the (S,G) entry as well due to SPT switchover. Counting packets forwarding to CSR3, we can see the feature is working as expected. I issue the command a few times to ensure the packet counters continue to increment. R8#show ip mroute 225.0.1.3 9.2.5.2 count | begin ^Group Group: 225.0.1.3, Source count: 1, Packets forwarded: 1372, Packets received: 1372 Source: 9.2.5.2/32, Forwarding: 1371/1/118/0, Other: 1371/0/0 R8#show ip mroute 225.0.1.3 9.2.5.2 count | begin ^Group Group: 225.0.1.3, Source count: 1, Packets forwarded: 1422, Packets received: 1422 Source: 9.2.5.2/32, Forwarding: 1421/1/118/0, Other: 1421/0/0

To test a more complex example, we will originate traffic from CSR4 for group 225.0.1.3 at the same time. Since AS 11 and AS 8 do not have a direct MSDP connection, AS 7 and AS 9 will need to forward the SA generated by CSR9, the local RP. This behavior is the MSDP default; this is different than iBGP which doesn’t reflect iBGP-learned routes by default. MSDP is very liberally with SA advertisements and relies on RPF for loop prevention. To test, we can simply start the ping and after enabling some debug. I use different MSDP debugs on XRv2 to try and get more valuable information. I also disable the PIM debugging since that process is the same, essentially, and we are trying to learn MSDP. ! CSR9, CSR10 debug ip msdp detail debug ip msdp peer ! XRv2, XRv1 debug msdp cache debug msdp events ! CSR4

1116 © 2016 Nicholas J. Russo

ping 225.0.1.3 repeat 10000 timeout 1

Upon received the PIM register message from XRv3, CSR9 originates the MSDP SA for (11.4.13.4, 225.0.1.3). CSR9 sends it to both XRv1 and CSR10 as we can see from peer advertised-SA output. ! CSR9 MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0):

9.0.0.10: Originating SA message start_index = 0, mroute_cache_index = 0, Qlen = 0 Sent entire mroute table, mroute_cache_index = 0, Qlen = 0 9.0.0.10: No entries found 9.0.0.10: Building SA message from SA cache start_index = 0, sa_cache_index = 0, Qlen = 0 Sent entire sa-cache, sa_cache_index = 0, Qlen = 0

R9#show ip msdp peer 7.0.0.11 advertised-SAs MSDP SA advertised to peer 7.0.0.11 (?) from mroute table 225.0.1.3 11.4.13.4 (?) MSDP SA advertised to peer 7.0.0.11 (?) from SA cache 225.0.1.3

9.2.5.2 (?) RP: 9.0.0.10

R9#show ip msdp peer 9.0.0.10 advertised-SAs MSDP SA advertised to peer 9.0.0.10 (?) from mroute table 225.0.1.3 11.4.13.4 (?) MSDP SA advertised to peer 9.0.0.10 (?) from SA cache

First, we will look at CSR10. It receives the message from CSR9 with encapsulation data inside. This passes the MSDP RPF check (discussed in detail later), which means it is added to the SA cache and forwarded to other MSDP peers, except the one from which the SA was received. Because XRv1 also received this SA from CSR9 directly, it performs the same action; install the SA into the cache and advertise it onward. ! CSR10 MSDP(0): 11.0.0.9: Received 120-byte msg 506 from peer MSDP(0): 11.0.0.9: SA TLV, len: 120, ec: 1, RP: 11.0.0.9, with data MSDP(0): 11.0.0.9: Peer RPF check passed for 11.0.0.9, peer is RP MSDP(0): WAVL Insert SA Source 11.4.13.4 Group 225.0.1.3 RP 11.0.0.9 Successful MSDP(0): 7.0.0.11: Forward 120-byte SA (11.4.13.4, 225.0.1.3) from 11.0.0.9 to 7.0.0.11 MSDP(0): 8.0.0.12: Forward 120-byte SA (11.4.13.4, 225.0.1.3) from 11.0.0.9 to 8.0.0.12 ! XRv1 msdp[1045]: [1] default cache: adding remote RP 11.0.0.9

1117 © 2016 Nicholas J. Russo

msdp[1045]: [1] default (11.4.13.4, 225.0.1.3) cache: new entry - rpf: 11.0.0.9

This means that CSR10 and XRv1 are both going to advertise the SA for (11.4.13.4, 225.0.1.3) to one another as well. When CSR10 receives it, this packet is rejected due to failing the MSDP RPF check. The same is true on XRv1 for the SA received from CSR10. This is perfectly logical and is loosely analogous to the BGP AS-path loop-prevention mechanism. Even without understanding the details of MSDP RPF yet, we can look at the diagram and generally understand its purpose. CSR10 and XRv1 both received the SA directly from CSR9, so accepting the same SA from a second-hand peer is less desirable. ! CSR10 MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): 7/0

Received 120-byte TCP segment from 7.0.0.11 Append 120 bytes to 0-byte msg 507 from 7.0.0.11, qs 1 7.0.0.11: Received 120-byte msg 507 from peer 7.0.0.11: SA TLV, len: 120, ec: 1, RP: 11.0.0.9, with data 7.0.0.11: Peer RPF check failed for 11.0.0.9, EBGP route/peer in AS

! XRv1 msdp[1045]: [1] default cache: rejecting SA from peer 9.0.0.10 for RP 11.0.0.9

To prove that each router installed the proper SA entry, we can check the caches. Both of them show 11.0.0.9 as the RPF peer (as opposed to one another) which is correct given the BGP topology. RP/0/0/CPU0:XRv1#show msdp sa-cache 225.0.1.3 MSDP Flags: E - set MRIB E flag , L - domain local source is active, EA - externally active source, PI - PIM is interested in the group, DE - SAs have been denied. Timers age/expiration, Cache Entry: (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 00:48:42/00:02:07 Learned from peer 9.0.0.10, RPF peer 9.0.0.10 SAs recvd 53, Encapsulated data received: 100 grp flags: none, src flags: EA (11.4.13.4, 225.0.1.3), RP 11.0.0.9, MBGP/AS 11, 00:09:50/00:02:17 Learned from peer 11.0.0.9, RPF peer 11.0.0.9 SAs recvd 12, Encapsulated data received: 100 grp flags: none, src flags: EA R10#show ip msdp sa-cache 225.0.1.3 MSDP Source-Active Cache - 1 entries for 225.0.1.3 (11.4.13.4, 225.0.1.3), RP 11.0.0.9, BGP/AS 11, 00:10:02/00:05:49, Peer 11.0.0.9

There is a problem on XRv2, however. Despite both XRv1 and CSR10 advertising their SA messages to XRv2, XRv2 rejects them both. The debug reveals this and the SA cache has no entries originating in AS 1118 © 2016 Nicholas J. Russo

11. The only entry is the first one we tested earlier from AS 9. XRv2 is claiming that both of these SAs are failing RPF. ! XRv2 msdp[1045]: [1] default cache: rejecting SA from peer 7.0.0.11 for RP 11.0.0.9 msdp[1045]: [1] default cache: rejecting SA from peer 9.0.0.10 for RP 11.0.0.9 RP/0/0/CPU0:XRv2#show msdp sa-cache 225.0.1.3 MSDP Flags: E - set MRIB E flag , L - domain local source is active, EA - externally active source, PI - PIM is interested in the group, DE - SAs have been denied. Timers age/expiration, Cache Entry: (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 00:51:05/00:01:57 Learned from peer 9.0.0.10, RPF peer 9.0.0.10 SAs recvd 55, Encapsulated data received: 100 grp flags: PI, src flags: E, EA, PI

XR has a special debug command for troubleshooting MSDP RPF issues. I enable it on XRv2 to get more details, along with MSDP TLV debugging which will reveal the contents of the SA TLVs. The router is indicating that there is an RPF route towards 11.0.0.9 (via BGP) but there isn’t an MSDP peer. This is by design as we never configured a peer between those ASes. RP/0/0/CPU0:XRv2#debug msdp rpf RP/0/0/CPU0:XRv2#debug msdp tlv msdp[1045]: [1] default 7.0.0.11 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 7.0.0.11 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 7.0.0.11 tlv: 7.0.0.11: recvd msdp[1045]: [1] default 7.0.0.11 tlv: RP 11.0.0.9 has 1 entries msdp[1045]: [1] default 7.0.0.11 tlv: -- (11.4.13.4/32 225.0.1.3) msdp[1045]: [1] default 7.0.0.11 tlv: msdp[1045]: [1] 11.0.0.9 RPF: Found RPF but no rpf peer for 11.0.0.9 - Failed msdp[1045]: [1] 11.0.0.9 RPF: RPF lookup for 11.0.0.9 failed for peer 7.0.0.11: No Peer prefered msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 9.0.0.10 tlv: 9.0.0.10: recvd msdp[1045]: [1] default 9.0.0.10 tlv: RP 11.0.0.9 has 1 entries msdp[1045]: [1] default 9.0.0.10 tlv: -- (11.4.13.4/32 225.0.1.3) msdp[1045]: [1] default 9.0.0.10 tlv: msdp[1045]: [1] 11.0.0.9 RPF: Found RPF but no rpf peer for 11.0.0.9 - Failed msdp[1045]: [1] 11.0.0.9 RPF: RPF lookup for 11.0.0.9 failed for peer 9.0.0.10: No Peer prefered

1119 © 2016 Nicholas J. Russo

The debug also reveals a semi-frightening error message that isn’t exactly intuitive. Clearly there are BGP routes for this remote RP, but the debug seems to indicate there is an error. ! XRv2 msdp[1045]: [2] 11.0.0.9 RPF: BGP route not found for 11.0.0.9: 'sysdb' detected the 'warning' condition 'A SysDB client tried to access a nonexistent item or list an empty directory'

To better understand MSDP RPF, we will discuss its operation and purpose. There are three rules to MSDP and they are applied depending on the BGP topology. First, the MSDP process tries to find a BGP neighbor that has the same IP address as the MSDP peer. 1. If the MSDP peer is also an iBGP peer, apply Rule 1. 2. If the MSDP peer is also an eBGP peer, apply Rule 2. 3. If the MSDP peer is not a BGP peer at all, apply Rule 3. The rules are all very similar in nature. I describe them in detail below, and then summarize them afterwards. To identify some terms initially, the “originating RP” is the RP that originated the SA message. In our case, this would be 11.0.0.9. The “advertising peer” is the MSDP peer that advertised the SA message in question, which may or not be the same as the originating RP. In this case, 9.0.0.10 and 7.0.0.11 are the advertising peers. Rule 1 (iBGP peer): Check the BGP MRIB for the best path to the originating RP. If no match exists, check the BGP URIB instead. If no match exists, RPF fails. If a path is found, the receiving RP compares the BGP neighbor address (not the next-hop) to the MSDP peer address; they must match. In summary, this rule means that the BGP and MSDP topologies should match within an AS. Rule 2 (eBGP peer): Check the BGP MRIB for the best path to the originating RP. If no match exists, check the BGP URIB instead. If no match exists, RPF fails. If a path is found, the first AS in the path towards the originating RP must match the peer AS of the eBGP peer. If it doesn’t, RPF fails. In summary, the BGP and MSDP topologies should match between ASes. Compared to rule 1, the MSDP peer address and BGP peer address do not have to match, but the general AS advertisement path must match. Rule 3 (no BGP peer): Check the BGP MRIB for the best path to the originating RP. If no match exists, check the BGP URIB instead. If no match exists, RPF fails. If a path is found, the BGP MRIB is searched for the best path to the advertising peer, followed by the BGP URIB if a path isn’t found. If a path is found, the first AS in the path towards the originating RP must match the AS of the MSDP peer. In summary, this rule is similar to rule 2 but performs two different lookups; one for the originating RP and one for the advertising peer.

1120 © 2016 Nicholas J. Russo

Summary of rules: All of them check the BGP MRIB for a path to the originating RP, followed by the BGP URIB if a path is not found. For all lookups, this is the order of operations. The difference is in the followon activities which may introduce additions checks depending on the peer type. Troubleshooting XRv2, we check the BGP neighbor tables for IPv4 unicast and multicast. 11.0.0.9 is not a BGP peer of any kind, so we assume MSDP rule 3 is invoked. RP/0/0/CPU0:XRv2#show bgp ipv4 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 8.0.0.8 0 8 170 160 20 0 0 00:24:11 10.6.12.6 0 9 46 43 20 0 0 00:24:49 10.12.14.14 0 7 45 43 20 0 0 00:24:49

St/PfxRcd 2 11 11

RP/0/0/CPU0:XRv2#show bgp ipv4 multicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 8.0.0.8 0 8 170 161 13 0 0 00:24:20 10.6.12.6 0 9 46 43 13 0 0 00:24:58 10.12.14.14 0 7 45 43 13 0 0 00:24:58

St/PfxRcd 1 3 3

First, we check the BGP MRIB (IPv4 multicast) for a path to the originating RP (11.0.0.9). We find no match there, but we do in the BGP URIB (IPv4 unicast). The best path is via XRv4 due to being the oldest eBGP route. The first AS in the best path is AS 7. RP/0/0/CPU0:XRv2#show bgp ipv4 multicast 11.0.0.9 % Network not in table RP/0/0/CPU0:XRv2#show bgp ipv4 unicast 11.0.0.9 [snip] Path #1: Received by speaker 0 Not advertised to any peer 9 11 10.6.12.6 from 10.6.12.6 (9.0.0.6) Origin incomplete, localpref 100, valid, external, group-best Received Path ID 0, Local Path ID 0, version 0 Origin-AS validity: not-found Path #2: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.3 Advertised to peers (in unique update groups): 8.0.0.8 7 11 10.12.14.14 from 10.12.14.14 (7.0.0.14) Origin incomplete, localpref 100, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 14 Origin-AS validity: not-found

Since we found a path, XRv2 progresses to run the same checks against the advertising peer (7.0.0.11). XRv4 is the best path due to the shortest AS-path length. 1121 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp ipv4 multicast 7.0.0.11 % Network not in table RP/0/0/CPU0:XRv2#show bgp ipv4 unicast 7.0.0.11 [snip] Path #1: Received by speaker 0 Not advertised to any peer 9 7 10.6.12.6 from 10.6.12.6 (9.0.0.6) Origin IGP, localpref 100, valid, external, group-best Received Path ID 0, Local Path ID 0, version 0 Origin-AS validity: not-found Path #2: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.3 Advertised to peers (in unique update groups): 8.0.0.8 7 10.12.14.14 from 10.12.14.14 (7.0.0.14) Origin IGP, localpref 100, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 6 Origin-AS validity: not-found

Last, the router verifies that the first AS in the best-path to the originating RP (11.0.0.9) is the same as the advertising MSDP peer (7.0.0.11). Both of them are 7, so RPF should pass according to the rules above. We saw earlier that it did not, which leads us to question the aforementioned RPF rule logic. In an attempt to repair this, I will make the MSDP RPF check “more correct” by advertising CSR9’s loopback (the originating RP address) into the BGP MRIB. ! CSR9 router bgp 11 address-family ipv4 multicast redistribute connected route-map RM_CONN_TO_BGP

This means that XRv2 will learn BGP MRIB routes for the originating RP now instead of relying on the BGP URIB. This shouldn’t be required per the rules described earlier, but perhaps XR assumes that because IPv4 multicast is running, that is the only AFI it checks. Now, it prefers CSR6 as the best-path due to being the oldest eBGP route. RP/0/0/CPU0:XRv2#show bgp ipv4 multicast 11.0.0.9/32 [snip] 9 11 10.6.12.6 from 10.6.12.6 (9.0.0.6) Origin incomplete, localpref 100, valid, external, best, group-best Received Path ID 0, Local Path ID 1, version 14 Path #2: Received by speaker 0

1122 © 2016 Nicholas J. Russo

Not advertised to any peer 7 11 10.12.14.14 from 10.12.14.14 (7.0.0.14) Origin incomplete, localpref 100, valid, external, group-best Received Path ID 0, Local Path ID 0, version 0

Suddenly, RPF passes on XRv2. Everything else we verified above is still correct, but the route to the originating RP changed from being unicast to multicast. Examining the debug on XRv2, we no longer see the “scary” BGP messages about missing routes. Based on this, I assert that the XR debug message was referring to the lack of an IPv4 multicast BGP route specifically, but didn’t make that entirely clear. We will later prove that when XR BGP is running multicast and unicast AFIs, the RPF process expects to always have a routing entry in the multicast BGP RIB for RPF. Without it, RPF will fail, as BGP unicast routes are not considered in this case. ! XRv2 msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 9.0.0.10 tlv: 9.0.0.10: recvd msdp[1045]: [1] default 9.0.0.10 tlv: RP 11.0.0.9 has 1 entries msdp[1045]: [1] default 9.0.0.10 tlv: -- (11.4.13.4/32 225.0.1.3) msdp[1045]: [1] default 9.0.0.10 tlv: msdp[1045]: [1] 11.0.0.9 RPF: Found RPF peer 9.0.0.10 for RP 11.0.0.9 Matched Highest in AS (iv) msdp[1045]: [1] 11.0.0.9 RPF: Accepting SA from peer 9.0.0.10 for RP 11.0.0.9 due to Matched Highest in AS (iv) msdp[1045]: [1] default 7.0.0.11 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 7.0.0.11 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 7.0.0.11 tlv: 7.0.0.11: recvd msdp[1045]: [1] default 7.0.0.11 tlv: RP 11.0.0.9 has 1 entries msdp[1045]: [1] default 7.0.0.11 tlv: -- (11.4.13.4/32 225.0.1.3) msdp[1045]: [1] default 7.0.0.11 tlv: msdp[1045]: [1] 11.0.0.9 RPF: Found RPF peer 9.0.0.10 for RP 11.0.0.9 Matched Highest in AS (iv) msdp[1045]: [1] 11.0.0.9 RPF: RPF lookup for 11.0.0.9 failed for peer 7.0.0.11: 9.0.0.10 prefered

When we check the RPF details on XRv2, we see that MSDP rule 4 is used. This appears very confusing and the word “rule” in XR not does correlate with the MSDP “rules” we just discussed. We can at least see that 9.0.0.10 is the next-hop which has a valid BGP URIB route (the BGP peer and MSDP peer), which allows RPF to work. RP/0/0/CPU0:XRv2#show msdp rpf 11.0.0.9 RPF peer for 11.0.0.9 is 9.0.0.10 AS 9, rule: 4 bgp/rib lookup: nexthop: 9.0.0.10, asnum: 11

1123 © 2016 Nicholas J. Russo

The XR “rules” are specific to RFC3618 which describe conditions for RPF checks but this is not made clear in the documentation. This is enabled on XR and cannot be disabled, yet the default on IOS and XE platforms is to use the Cisco “rules” discussed earlier. Unfortunately, both RFC3618 and Cisco call them “MSDP RPF rules” so there isn’t an easy way to distinguish them in the show commands or debug outputs. One would have to simply know the difference, the rule descriptions, and the defaults per platform. XE platforms can optionally use RFC3618 rules which we will investigate later. The RFC3618 rules are also processed in top-down fashion with lower number rules being the more preferred RPF mechanisms. RFC3618, and the debug output on XR, list the rules in Roman numeral format. i. The advertising MSDP peer is the same address as the originating RP. ii. The advertising MSDP peer is the eBGP next-hop for the RPF route. iii. The advertising MSDP peer is the BGP peer that advertised the route towards the originating RP, but it isn’t the same BGP peer address as the next-hop. iv. The advertising MSDP peer resides in the closest AS towards the originating RP. v. The advertising MSDP peer RPF was statically configured. This raises the question of how RPF passed on the original SA when the originating RP (9.0.0.10) wasn’t in the MRIB either. This was from our first test where CSR2 was the sender and CSR10 was the originating RP. We can examine XRv2 MSDP RPF debugs to see which RFC3618 rule was used to accept this SA from CSR10. This debug can be confusing because it suggests “rule 1” was used, when clearly the routes to CSR10 are eBGP, not iBGP as the “Cisco MSDP RPF rules” specify. Again, this is because XR is following “RFC3618 RPF rule 1”; when the originating RP and advertising peer are the same address, RPF passes immediately. Under Cisco’s rules, RPF isn’t even checked in this case, as this is a special case clearly documented. The debug clearly shows that the advertising peer and originating RP are the same. ! XRv2 msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: 40 bytes in buffer msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 9.0.0.10 tlv: 9.0.0.10: recvd msdp[1045]: [1] default 9.0.0.10 tlv: RP 9.0.0.10 has 1 entries msdp[1045]: [1] default 9.0.0.10 tlv: -- (9.2.5.2/32 225.0.1.3) msdp[1045]: [1] default 9.0.0.10 tlv: msdp[1045]: [1] 9.0.0.10 RPF: Found RPF peer 9.0.0.10 for RP 9.0.0.10 Matched RP Address (i) msdp[1045]: [1] 9.0.0.10 RPF: Accepting SA from peer 9.0.0.10 for RP 9.0.0.10 due to Matched RP Address (i)

Going back to the second test with CSR4 as a sender, we can verify the SA cache on XRv2 to see the new entry. The AS is 11 with the remote RP being 11.0.0.9. Although there is no MSDP or BGP peering with this address, RPF passes due to the IPv4 multicast route for the originating RP advertised by CSR6 and XRv4 (best path dependent). RP/0/0/CPU0:XRv2#show msdp sa-cache MSDP Flags:

1124 © 2016 Nicholas J. Russo

E - set MRIB E flag , L - domain local source is active, EA - externally active source, PI - PIM is interested in the group, DE - SAs have been denied. Timers age/expiration, Cache Entry: (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 00:48:35/00:01:54 Learned from peer 9.0.0.10, RPF peer 9.0.0.10 SAs recvd 52, Encapsulated data received: 0 grp flags: PI, src flags: E, EA, PI (11.4.13.4, 225.0.1.3), RP 11.0.0.9, MBGP/AS 11, 00:17:24/00:01:54 Learned from peer 9.0.0.10, RPF peer 9.0.0.10 SAs recvd 19, Encapsulated data received: 0 grp flags: PI, src flags: E, EA, PI

A quick check of multicast counters on CSR8 shows this working. CSR2 has many more packets than CSR4 because of the RPF problems we had to resolve. Both CSR2 and CSR4 have been sending the whole time which is why the packet counts are high. R8#show ip mroute 225.0.1.3 count | begin ^Group Group: 225.0.1.3, Source count: 2, Packets forwarded: 3945, Packets received: 3945 RP-tree: Forwarding: 3/0/118/0, Other: 3/0/0 Source: 11.4.13.4/32, Forwarding: 1095/1/118/0, Other: 1095/0/0 Source: 9.2.5.2/32, Forwarding: 2847/1/118/0, Other: 2847/0/0 R8#show ip mroute 225.0.1.3 count | begin ^Group Group: 225.0.1.3, Source count: 2, Packets forwarded: 3965, Packets received: 3965 RP-tree: Forwarding: 3/0/118/0, Other: 3/0/0 Source: 11.4.13.4/32, Forwarding: 1105/1/118/0, Other: 1105/0/0 Source: 9.2.5.2/32, Forwarding: 2857/1/118/0, Other: 2857/0/0

Because the eBGP peers have MSDP connections between their loopbacks (which are also RPs), we bypass many of the complex RPF issues and rules. On XE, the special case of not checking RPF applies, and on XR, RFC3618 rule 1 is always applied. We can introduce some of this complexity for additional practice. On XRv1 and CSR10, I reconfigure the MSDP session slightly. XRv1 will peer with CSR10’s direct connected transit link while CSR10 continues to peer with XRv1’s loopback. This is analogous with modifying the BGP update-source, except we modify the MSDP connect-source. From XRv1’s perspective, this means that SA messages from CSR10 will be sourced from 10.10.11.10 (transit link) with an RP of 9.0.0.10 (loopback). Since the MSDP peer address (10.10.11.10) and RP (9.0.0.10) differ, this does not qualify for RFC3618 rule 1. The topology doesn’t change much, so the general design is the same, but XR will have to select another RFC3618 RPF rule to accept SA messages from 9.0.0.10. ! CSR10 ip msdp peer 7.0.0.11 connect-source gig2.501 remote-as 7 ! XRv1

1125 © 2016 Nicholas J. Russo

router msdp no peer 9.0.0.10 peer 10.10.11.10 password clear AS7AS9_MSDP connect-source Loopback0 remote-as 9 keepalive 15 35

Since CSR2 is still sending traffic, CSR10 is still originating periodic SA messages. XRv1 will receive this whether there are clients in AS 7 or not. On XRv1, the MSDP RPF rule used in this case is RFC3618 rule 3. Despite the route to 9.0.0.10/32 being eBGP-learned, rule 3 is used. At a glance, the logic for RFC3618 rule 2 holds true since the eBGP next-hop is 10.10.11.10 which is the advertising peer as well. The debug shows RFC3618 rule 3 being matched for that particular SA received from 9.0.0.10. The reason for failing to match RFC3618 rule 2 is similar to the error on XRv2. The RFC3618 definitions, and XR in general, assume that the routes in question are BGP IPv4 multicast. Since XRv1 does not have an IPv4 multicast route to 9.0.0.10, it “falls back” to the unicast route which qualifies it for RFC3618 rule 3 RPF. We confirm this with the show command also. ! XRv1 msdp[1045]: [1] default 10.10.11.10 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 10.10.11.10 tlv: 10.10.11.10: recvd msdp[1045]: [1] default 10.10.11.10 tlv: RP 9.0.0.10 has 1 entries msdp[1045]: [1] default 10.10.11.10 tlv: -- (9.2.5.2/32 225.0.1.3) msdp[1045]: [1] default 10.10.11.10 tlv: msdp[1045]: [1] 9.0.0.10 RPF: Found RPF peer 10.10.11.10 for RP 9.0.0.10 Matched Next Hop (iii) msdp[1045]: [1] 9.0.0.10 RPF: Accepting SA from peer 10.10.11.10 for RP 9.0.0.10 due to Matched Next Hop (iii) RP/0/0/CPU0:XRv1#show msdp rpf 9.0.0.10 RPF peer for 9.0.0.10 is 10.10.11.10 AS 9, rule: 3 bgp/rib lookup: nexthop: 10.10.11.10, asnum: 0

If we want XRv1 to see this as candidate for matching RFC3618 rule 2, we would need to advertise 9.0.0.10 inside BGP IPv4 multicast. We can re-use the route-map from IPv4 unicast for this. ! CSR10 router bgp 9 address-family ipv4 multicast redistribute connected route-map RM_CONN_TO_BGP

With debugging still enabled on XRv1, we can see the new match. This debug clearly shows that the next-hop was the eBGP peer which is expected. The SA is exactly the same, but since XRv1 now has an IPv4 multicast route towards the originating RP, rule 2 can be used. I also verify with the show command. 1126 © 2016 Nicholas J. Russo

! XRv1 msdp[1045]: [1] default 10.10.11.10 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 10.10.11.10 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 10.10.11.10 tlv: 10.10.11.10: recvd msdp[1045]: [1] default 10.10.11.10 tlv: RP 9.0.0.10 has 1 entries msdp[1045]: [1] default 10.10.11.10 tlv: -- (9.2.5.2/32 225.0.1.3) msdp[1045]: [1] default 10.10.11.10 tlv: msdp[1045]: [1] 9.0.0.10 RPF: Found RPF peer 10.10.11.10 for RP 9.0.0.10 Matched External BGP Peer (ii) msdp[1045]: [1] 9.0.0.10 RPF: Accepting SA from peer 10.10.11.10 for RP 9.0.0.10 due to Matched External BGP Peer (ii) RP/0/0/CPU0:XRv1#show msdp rpf 9.0.0.10 RPF peer for 9.0.0.10 is 10.10.11.10 AS 9, rule: 2 bgp/rib lookup: nexthop: 10.10.11.10, asnum: 9

By now, it should be clear that the MSDP RPF rules can be confusing and often inconsistent between platforms. Fortunately, the debug messaging is helpful, especially on XR where the matched rules are shown. To demonstrate more differences between Cisco’s version and RFC3618, we can use CSR9. It is currently using Cisco’s method, which means our verifications are limited to debugs. The equivalent show command only works when RFC3618 is enabled. R9#show ip msdp rpf-peer 8.0.0.12 This command requires "ip msdp rpf rfc3618" to be configured.

To test it, I will originate traffic for a bogus group on CSR3. Even if AS 11 doesn’t have receivers, CSR9 should accept the SA message from XRv2. We saw that, in the same topology but in the reverse order, XRv2 rejected these messages under the RRC3618 construct. CSR9, however, accepts them, based on Cisco’s MSDP RPF rule 3. R9#debug ip msdp peer MSDP Peer debugging is on MSDP(0): MSDP(0): MSDP(0): 7/0 MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): 7.0.0.11

7.0.0.11: Received 120-byte msg 1017 from peer 7.0.0.11: SA TLV, len: 120, ec: 1, RP: 8.0.0.12, with data 7.0.0.11: Peer RPF check failed for 8.0.0.12, EBGP route/peer in AS 9.0.0.10: Received 120-byte msg 1018 from peer 9.0.0.10: SA TLV, len: 120, ec: 1, RP: 8.0.0.12, with data 9.0.0.10: Peer RPF check passed for 8.0.0.12, used EBGP peer WAVL Insert SA Source 8.3.8.3 Group 225.9.9.9 RP 8.0.0.12 Successful 7.0.0.11: Forward 120-byte SA (8.3.8.3, 225.9.9.9) from 9.0.0.10 to

R9#show ip msdp sa-cache

1127 © 2016 Nicholas J. Russo

MSDP Source-Active Cache - 2 entries (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 04:43:52/00:05:35, Peer 9.0.0.10 Learned from peer 9.0.0.10, RPF peer 9.0.0.10, SAs received: 304, Encapsulated data received: 2 (8.3.8.3, 225.9.9.9), RP 8.0.0.12, BGP/AS 8, 00:00:06/00:05:53, Peer 9.0.0.10 Learned from peer 9.0.0.10, RPF peer 9.0.0.10, SAs received: 1, Encapsulated data received: 1

The path via CSR6 was chosen as the BGP best-path due to being the oldest route. Notice that CSR9 doesn’t care about the lack of an IPv4 multicast route for the originating RP, as XRv2 did. R9#show bgp ipv4 multicast 8.0.0.12 % Network not in table R9#show bgp ipv4 unicast 8.0.0.12 BGP routing table entry for 8.0.0.12/32, version 47 Paths: (2 available, best #2, table default) Advertised to update-groups: 3 4 Refresh Epoch 1 7 8 10.9.14.14 from 10.9.14.14 (7.0.0.14) Origin IGP, localpref 100, valid, external rx pathid: 0, tx pathid: 0 Refresh Epoch 1 9 8 10.6.9.6 from 10.6.9.6 (9.0.0.6) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

Per Cisco’s rule 3, the router then checks the path to the advertising peer to ensure the AS of that peer matches the first AS of the best path, which is 9. They match, so the SA is installed as we saw above. CSR9 actually has an IPv4 multicast route for this since we advertised it on CSR10 earlier, but this is irrelevant. R9#show bgp ipv4 multicast 9.0.0.10 BGP routing table entry for 9.0.0.10/32, version 28 Paths: (2 available, best #2, table 8000) Advertised to update-groups: 3 4 Refresh Epoch 1 7 9 10.9.14.14 from 10.9.14.14 (7.0.0.14) Origin incomplete, localpref 100, valid, external rx pathid: 0, tx pathid: 0 Refresh Epoch 1 9

1128 © 2016 Nicholas J. Russo

10.6.9.6 from 10.6.9.6 (9.0.0.6) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

Personally, I would prefer to have all MSDP speakers playing by the same “rules” in a real network design. We enable RFC3618 for RPF rules on CSR9 and observe the differences. We now expect more XR-like behavior with respect to the debug outputs and show commands. The XE platforms don’t tell you rules 1 through 5, but instead give you a brief description of why it was selected. For example, if we look at the MSDP RPF towards 9.0.0.10, we can see that the advertising peer equals the originating RP (“Peer is RP”), which is RFC3618 rule 1. Under Cisco’s rules, this would also “pass” by not even triggering the RPF check. If we repeat the show command for XRv2’s loopback on CSR9, the RFC3618 rule is 4 (“Peer is best in the closest AS”), which is consistent with how XRv2 views CSR9 (reciprocity, in a way). There is no BGP or MSDP peer with this address, and the path to 8.0.0.12 is consistent with the SA received from the MSDP peer in that AS. ! CSR9 ip msdp rpf rfc3618 R9#show ip msdp rpf-peer 9.0.0.10 RPF peer information for ? (9.0.0.10) RPF peer: ? (9.0.0.10) RPF route/mask: 9.0.0.10/32 RPF rule: Peer is RP RPF type: mbgp R9#show ip msdp rpf-peer 8.0.0.12 RPF peer information for ? (8.0.0.12) RPF peer: ? (9.0.0.10) RPF route/mask: 9.0.0.10/32 RPF rule: Peer is best in the closest AS RPF type: mbgp

Examining the debugs, we can confirm these findings. From CSR10, SA messages originating from XRv2 are accepted since this is the best-path back to the originating RP. When CSR9 receives SAs from XRv1 from CSR10, these are rejected, since CSR9 is expecting to learn these directly from CSR10. This makes sense because CSR10 and CSR9 have a direct MSDP peering, so RFC3618 rule 4 is used as shown above. In order for the XE debugging to be particularly useful, one would have to memorize the five RFC3618 rules since it doesn’t give you the Roman numeral as XR does. Then again, if RPF passes and everything is functional, an administrator may have less of a reason to reveal these details. ! CSR9 MSDP(0): 9.0.0.10: Received 20-byte msg 1047 from peer MSDP(0): 9.0.0.10: SA TLV, len: 20, ec: 1, RP: 8.0.0.12 MSDP(0): 9.0.0.10: RPF check passed for 8.0.0.12, Peer is best in the closest AS

1129 © 2016 Nicholas J. Russo

MSDP(0): 7.0.0.11: Received 20-byte msg 1044 from peer MSDP(0): 7.0.0.11: SA TLV, len: 20, ec: 1, RP: 9.0.0.10 MSDP(0): 7.0.0.11: Peer RPF check failed for 9.0.0.10, EMBGP route/peer in AS 7/0

I view MSDP as being a combination of PIM and BGP in some ways. It has RPF rules for control-plane SAs just like PIM has RPF rules for register messages. It is like BGP based on its ability to relay information across ASes, along with its TCP transport and rich filtering options. We notice that MSDP peers, upon receipt of a valid SA (one that passes RPF), will automatically re-send that SA to all other MSDP peers except the peer from which it was received. While this doesn’t cause any loops due to RPF rules, it can be very inefficient, especially in a meshed network. For example, when CSR9 receives SAs from CSR10, there is no reason for it to re-send them to XRv1, since XRv1 and CSR10 have a direct MSDP session. The same is true on XRv2’s reception of SAs from CSR10. In BGP, the iBGP “split-horizon” rule effectively prevents this re-advertisement of routes within an AS; we can relax this rule with route-reflectors. Note: The MSDP AS information is not relevant for MSDP advertisement in terms of “split-horizon” like BGP. In MSDP, the opposite logic is used; SAs are flooded liberally by default and we can restrict their advertisement using mesh-groups. Mesh-groups are generally used for intra-AS flood reduction but we can use it for inter-AS as well. The concept is simple; SAs received from a peer in a given mesh-group can never be advertised to a peer in the same mesh-group. The assumption is that the sending MSDP peer always has a direct connection to all other peers in the mesh-group. Enabling this feature can reduce SA flooding, saving routing resources (less RPF lookups and guaranteed failures) and bandwidth (less flooding). We will configure this on all MSDP peers in accordance with the example I gave above. ! CSR9 ip msdp mesh-group MG9 7.0.0.11 ip msdp mesh-group MG9 9.0.0.10 ! XRv2 router msdp peer 7.0.0.11 mesh-group MG12 peer 9.0.0.10 mesh-group MG12

Before observing the operational effect of this feature, we will verify the configuration by checking the peer details on each router. Mesh groups are only locally significant and are not signaled in MSDP. They can be any text string and every peer can belong to a maximum of one mesh group. The way I’ve configured it, it is similar to a BGP route-reflector design where XRv2 and CSR9 should be clients only; that is, not route-reflectors. They should not advertise anything other than logically originated SA messages. R9#show ip msdp peer | include ^MSDP_Peer|mesh

1130 © 2016 Nicholas J. Russo

MSDP Peer 7.0.0.11 Peer is member MSDP Peer 9.0.0.10 Peer is member

(?), AS 7 (configured AS) of mesh-group MG9 (?), AS 9 (configured AS) of mesh-group MG9

RP/0/0/CPU0:XRv2#show msdp MSDP Peer 7.0.0.11 (?), AS Peer is a member of mesh MSDP Peer 9.0.0.10 (?), AS Peer is a member of mesh

peer | utility egrep '^MSDP Peer|mesh' 7 group MG12 9 group MG12

First, we will verify the behavior on CSR9. With CSR2 sending traffic, CSR10 (9.0.0.10 as the RP address and MSDP peer) will advertise SAs to all MSDP peers since there are no mesh groups configured there. When CSR9 receives it, it can only advertise it to peers in a different mesh group than CSR10. Since all peers are in the same mesh group as CSR10, CSR9 does nothing beyond process it locally. It is interesting to note that enabling mesh groups for a peer totally bypasses the MSDP RPF checks as the debug reveals. This could be a workaround for RPF issues as well. Since both messages bypass RPF, CSR9 installs the one it receives first, which came directly from CSR10. It doesn’t really matter which one gets installed since the (S,G) join will traverse the shortest path to the originating RP anyway. ! CSR9 debug ip msdp peer debug ip msdp detail MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0): MSDP(0):

Received 120-byte TCP segment from 9.0.0.10 Append 120 bytes to 0-byte msg 3916 from 9.0.0.10, qs 1 9.0.0.10: Received 120-byte msg 3916 from peer 9.0.0.10: SA TLV, len: 120, ec: 1, RP: 9.0.0.10, with data 9.0.0.10: Peer RPF check bypassed, peer 9.0.0.10 in mesh-group MG9 WAVL Insert SA Source 9.2.5.2 Group 225.0.1.3 RP 9.0.0.10 Successful Received 120-byte TCP segment from 7.0.0.11 Append 120 bytes to 0-byte msg 3917 from 7.0.0.11, qs 1 7.0.0.11: Received 120-byte msg 3917 from peer 7.0.0.11: SA TLV, len: 120, ec: 1, RP: 9.0.0.10, with data 7.0.0.11: Peer RPF check bypassed, peer 7.0.0.11 in mesh-group MG9

R9#show ip msdp sa-cache 225.0.1.3 MSDP Source-Active Cache - 1 entries for 225.0.1.3 (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 00:03:14/00:05:35, Peer 9.0.0.10

We will perform the same verification on XRv2, which should behave identically. It receives and accepts for SAs, but installs the first one received into the SA cache. This came from CSR10 which his indicated by the advertising peer address in the SA cache. ! XRv2

1131 © 2016 Nicholas J. Russo

debug msdp tlv debug msdp rpf msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 9.0.0.10 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 9.0.0.10 tlv: 9.0.0.10: recvd msdp[1045]: [1] default 9.0.0.10 tlv: RP 9.0.0.10 has 1 entries msdp[1045]: [1] default 9.0.0.10 tlv: -- (9.2.5.2/32 225.0.1.3) msdp[1045]: [1] default 9.0.0.10 tlv: msdp[1045]: [1] 9.0.0.10 RPF: Peer 9.0.0.10 is in mesh group. Bypassing RPF checks msdp[1045]: [1] 9.0.0.10 RPF: Accepting SA from peer 9.0.0.10 for RP 9.0.0.10 due to Matched a mesh group member msdp[1045]: [1] default 7.0.0.11 tlv: ReadQ: 20 bytes in buffer msdp[1045]: [1] default 7.0.0.11 tlv: ReadQ: TLV parsed length = 20, type = 1 msdp[1045]: [1] default 7.0.0.11 tlv: 7.0.0.11: recvd msdp[1045]: [1] default 7.0.0.11 tlv: RP 9.0.0.10 has 1 entries msdp[1045]: [1] default 7.0.0.11 tlv: -- (9.2.5.2/32 225.0.1.3) msdp[1045]: [1] default 7.0.0.11 tlv: msdp[1045]: [1] 9.0.0.10 RPF: Peer 7.0.0.11 is in mesh group. Bypassing RPF checks msdp[1045]: [1] 9.0.0.10 RPF: Accepting SA from peer 7.0.0.11 for RP 9.0.0.10 due to Matched a mesh group member RP/0/0/CPU0:XRv2#show msdp sa-cache 225.0.1.3 MSDP Flags: E - set MRIB E flag , L - domain local source is active, EA - externally active source, PI - PIM is interested in the group, DE - SAs have been denied. Timers age/expiration, Cache Entry: (9.2.5.2, 225.0.1.3), RP 9.0.0.10, MBGP/AS 9, 00:06:50/00:02:18 Learned from peer 9.0.0.10, RPF peer 9.0.0.10 SAs recvd 16, Encapsulated data received: 200 grp flags: PI, src flags: E, EA, PI

As further proof, we can check CSR9 to ensure it is not advertising the SA or (9.2.5.2, 225.0.1.3) towards XRv1 (7.0.0.11). Being in the same mesh group, this should never happen. It is, however, advertising its locally originated SA which is a result of CSR4 sending traffic into the network. This is the desired result, which effectively allows CSR9 to be a “non-transit” MSDP node. R9#show ip msdp peer 7.0.0.11 advertised-SAs MSDP SA advertised to peer 7.0.0.11 (?) from mroute table 225.0.1.3 11.4.13.4 (?) MSDP SA advertised to peer 7.0.0.11 (?) from SA cache [no output]

1132 © 2016 Nicholas J. Russo

There is a third way to bypass RPF in MSDP as well; this is the default-peer feature. This is valuable for environments where SAs should be accepted from a peer no matter what. You can specify multiple default-peers in a router, but they are processed top-down assuming the peers are up and a prefix-list is not used. We can optionally apply a prefix list to each default peer to specify which sources apply to which peer. For example, on CSR9 we could configure CSR10 and XRv1 as default peers given the sources in each of their ASes. Although this will remain in the configuration for demonstration purposes, this feature only makes sense when MSDP and BGP are not deployed together. This can be used for stub ASes that may not be running BGP, or perhaps a multi-homed customer to two ISPs with two static default routes. The current network does not fit this design without significant rework, so we don’t test this further. ! CSR9 ip msdp default-peer 9.0.0.10 prefix-list PL_AS9 ip msdp default-peer 7.0.0.11 prefix-list PL_AS7 ip prefix-list PL_AS7 seq 5 permit 7.0.0.0/8 le 32 ip prefix-list PL_AS9 seq 5 permit 9.0.0.0/8 le 32

XR is less flexible than XE with respect to this feature, so I show a snippet below but do not commit it to the configuration. XR only allows for one default peer (good in a single-homed PE-CE environment without BGP) and thus no prefix-list support is needed. The snippet shows XRv1 as an example. ! XRv2 (example only) router msdp default-peer 7.0.0.11

There are other minor options used for fine-tuning SA advertising. One of them is setting a basic SAlimit, which is loosely analogous to OSPF’s “max-lsa” command. Only a certain amount of SAs can be received from a peer, regardless of what (S,G) pairs they represent, and this is a security mechanism to prevent DoS attacks against the MSDP control-plane. The syntax differs slightly between XE and XR, and we configure this between CSR10 and XRv1 for demonstration with a limit of 500 SAs. XE clearly shows this new limit in the peer output but XR does not. ! XRv1 router msdp peer 10.10.11.10 maximum external-sa 500 ! CSR10 ip msdp sa-limit 7.0.0.11 500 R10#show ip msdp peer 7.0.0.11 | include limit SAs learned from this peer: 0, SAs limit: 500

We can also filter SAs based on sources, groups, or both together. On CSR9 and XRv1, we configure them to reject sending SAs sourced from their own major networks to any administratively-scoped 1133 © 2016 Nicholas J. Russo

addresses towards CSR9 and XRv2. In a real deployment, a feature like this would make sense on all external peers, but I demonstrate it towards CSR9 and XRv2. The filters on CSR9 use a combination of route-maps and ACLs while XRv1 only uses ACLs. XR RPLs are not supported for SA-filtering in MSDP. ! CSR10 ip access-list extended ACL_SA_FILTER deny ip 9.0.0.0 0.255.255.255 239.0.0.0 0.255.255.255 permit ip any any route-map RM_FILTER_SA permit 10 match ip address ACL_SA_FILTER ip ip ip ip

msdp msdp msdp msdp

sa-filter sa-filter sa-filter sa-filter

in 8.0.0.12 list ACL_SA_FILTER out 8.0.0.12 list ACL_SA_FILTER in 11.0.0.9 route-map RM_FILTER_SA out 11.0.0.9 route-map RM_FILTER_SA

! XRv1 ipv4 access-list ACL_SA_FILTER 10 deny ipv4 9.0.0.0 0.255.255.255 239.0.0.0 0.255.255.255 20 permit ipv4 any any router msdp peer 11.0.0.9 sa-filter in list ACL_SA_FILTER sa-filter out list ACL_SA_FILTER peer 8.0.0.12 sa-filter in list ACL_SA_FILTER sa-filter out list ACL_SA_FILTER

We quickly verify that the configurations were applied successfully on both routers for both peers. On CSR10, the output changes slightly based on whether a route-map or ACL is used, since XE gives us the option. R10#show ip msdp peer 8.0.0.12 | section Filtering SA Filtering: Input (S,G) filter: ACL_SA_FILTER, route-map: none Input RP filter: none, route-map: none Output (S,G) filter: ACL_SA_FILTER, route-map: none Output RP filter: none, route-map: none R10#show ip msdp peer 11.0.0.9 | section Filtering SA Filtering: Input (S,G) filter: none, route-map: RM_FILTER_SA Input RP filter: none, route-map: none Output (S,G) filter: none, route-map: RM_FILTER_SA Output RP filter: none, route-map: none

1134 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show msdp peer 11.0.0.9 | begin Filtering SA Filtering: Input (S,G) filter: ACL_SA_FILTER Input RP filter: none Output (S,G) filter: ACL_SA_FILTER Output RP filter: none RP/0/0/CPU0:XRv1#show msdp peer 8.0.0.12 | begin Filtering SA Filtering: Input (S,G) filter: ACL_SA_FILTER Input RP filter: none Output (S,G) filter: ACL_SA_FILTER Output RP filter: none

To demonstrate this in action, I will source traffic from CSR2 to group 239.1.1.1, which will be (9.2.5.2, 239.1.1.1). This will match the outgoing filter on CSR10 and the SA will not be advertised to XRv2 or CSR9 at all, but XRv1 will receive it. XRv1 is likewise not allowed to send it to XRv2 or CSR9, so they shouldn’t have the SA via any means. We enable debugging on CSR10 and XRv1 to confirm this. Of note, it seems like the XE route-map doesn’t work since the message was filtered to XRv2 (ACL) but not to CSR9 (route-map). R2#ping 239.1.1.1 repeat 1000000 timeout 1 ! CSR10 MSDP(0): 7.0.0.11: Building SA message from SA cache MSDP(0): start_index = 0, sa_cache_index = 0, Qlen = 0 MSDP(0): Sent entire sa-cache, sa_cache_index = 0, Qlen = MSDP(0): Received 3-byte TCP segment from 7.0.0.11 MSDP(0): Append 3 bytes to 0-byte msg 9190 from 7.0.0.11, MSDP(0): 7.0.0.11: Received 3-byte msg 9190 from peer MSDP(0): 7.0.0.11: Keepalive TLV MSDP(0): 7.0.0.11: Sending Keepalive message to peer MSDP(0): 8.0.0.12: Filtered encapsulated message to peer MSDP(0): 11.0.0.9: Send 100-byte SA encapsulated data for 239.1.1.1), RP 9.0.0.10 MSDP(0): 7.0.0.11: Send 100-byte SA encapsulated data for 239.1.1.1), RP 9.0.0.10

0 qs 1

(9.2.5.2, (9.2.5.2,

The XR debug isn’t nearly as valuable since it doesn’t show the filtering in real-time, but we do see the SA arriving from 9.0.0.10. ! XRv1 msdp[1045]: [1] default 10.10.11.10 tlv: ReadQ: 120 bytes in buffer msdp[1045]: [1] default 10.10.11.10 tlv: ReadQ: TLV parsed length = 120, type = 1

1135 © 2016 Nicholas J. Russo

msdp[1045]: [1] default

msdp[1045]: [1] default msdp[1045]: [1] default msdp[1045]: [1] default remain

10.10.11.10 tlv: 10.10.11.10: recvd 10.10.11.10 tlv: RP 9.0.0.10 has 1 entries 10.10.11.10 tlv: -- (9.2.5.2/32 239.1.1.1) 10.10.11.10 tlv: -- 100 bytes of encap'd data

We confirm the error by checking CSR10 to see if it actually advertised SAs. We also check XRv2 and CSR9 for further confirmation. We can conclude the route-map mechanism isn’t working in XE in version 3.13.2S. I would recommend to use the ACL only, primarily because it always works and because it is consistent with XR. R10#show ip msdp peer 8.0.0.12 advertised-SAs | include 239 [no output] R10#show ip msdp peer 11.0.0.9 advertised-SAs | include 239 239.1.1.1 9.2.5.2 (?) RP/0/0/CPU0:XRv2#show msdp sa-cache 239.1.1.1 [no output] R9#show ip msdp sa-cache 239.1.1.1 MSDP Source-Active Cache - 1 entries for 239.1.1.1 (9.2.5.2, 239.1.1.1), RP 9.0.0.10, MBGP/AS 9, 00:03:49/00:05:14, Peer 9.0.0.10

Continuing on, XRv1 correctly processes this SA from CSR10 and adds it to the SA cache. XRv1 is allowed to learn this SA as it was not filtered outbound on CSR10 or inbound on XRv1. RP/0/0/CPU0:XRv1#show msdp sa-cache 239.1.1.1 MSDP Flags: E - set MRIB E flag , L - domain local source is active, EA - externally active source, PI - PIM is interested in the group, DE - SAs have been denied. Timers age/expiration, Cache Entry: (9.2.5.2, 239.1.1.1), RP 9.0.0.10, MBGP/AS 9, 00:01:16/00:01:54 Learned from peer 10.10.11.10, RPF peer 10.10.11.10 SAs recvd 2, Encapsulated data received: 100 grp flags: none, src flags: EA

Both XE and XR show ACL hits which are revealed when the SAs are matched against a filter. Of course, on CSR10, these hits only increment when advertising SAs toXRv2 since the route-map applied to CSR9 is dysfunctional and never matches. RP/0/0/CPU0:XRv1#show access-lists ACL_SA_FILTER ipv4 access-list ACL_SA_FILTER 10 deny ipv4 9.0.0.0 0.255.255.255 239.0.0.0 0.255.255.255 (28 matches)

1136 © 2016 Nicholas J. Russo

20 permit ipv4 any any (24 matches) R10#show access-list ACL_SA_FILTER Extended IP access list ACL_SA_FILTER 10 deny ip 9.0.0.0 0.255.255.255 239.0.0.0 0.255.255.255 (4 matches) 20 permit ip any any (43 matches)

TTL scoping is a simple mechanism to help confine multicast traffic within a certain AS. This becomes very significant in IPv6 since it is one of the main mechanisms for performing inter-AS ASM; there is no MSDP for IPv6. MSDP can also identify TTL-thresholds per peer which will cause it not to encapsulate data traffic into SA messages to peers if the TTL of the multicast data packet is less than or equal to the configured value. For example, XE has no default here, but XR has a default threshold value of 2. This makes sense because in terms of inter-AS multicast, it’s highly likely that the source and destination and less than 2 hops away. Building an inter-AS multicast tree would be useless since the data-plane traffic would expire before reaching the receivers. We quickly verify the defaults before adjusting the values. R10#show ip msdp peer 7.0.0.11 | include ttl Peer ttl threshold: 0 RP/0/0/CPU0:XRv1#show msdp peer 10.10.11.10 | include ttl Peer ttl threshold: 2

On XRv1 and CSR10, on the peering they share, I adjust the TTL-threshold value to 8. This means that multicast packets with TTL XRv2 > CSR6 > CSR10 > CSR5. CSR5 is the root of the SPT as the RPF neighbor is connected.

1140 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show pim topology 238.3.3.3 | begin 238 (9.2.5.2,238.3.3.3)SPT SSM Up: 00:04:54 JP: Join(now) RPF: GigabitEthernet0/0/0/0.562,10.6.12.6 Flags: GigabitEthernet0/0/0/0.582 00:04:54 fwd Join(00:03:29) R6#show ip mroute ssm (9.2.5.2, 238.3.3.3), 00:05:33/00:02:53, flags: sT Incoming interface: GigabitEthernet2.560, RPF nbr 9.6.10.10 Outgoing interface list: GigabitEthernet2.562, Forward/Sparse, 00:05:33/00:02:53 R10#show ip mroute ssm (9.2.5.2, 238.3.3.3), 00:08:20/00:03:02, flags: sT Incoming interface: GigabitEthernet2.550, RPF nbr 9.5.10.5 Outgoing interface list: GigabitEthernet2.560, Forward/Sparse, 00:08:20/00:03:02 R5#show ip mroute ssm (9.2.5.2, 238.3.3.3), 00:09:02/00:03:20, flags: sT Incoming interface: GigabitEthernet2.525, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.550, Forward/Sparse, 00:09:02/00:03:20

Checking the packet counters on CSR8 for brevity, we can see traffic is successfully being delivered to CSR3. There was no additional complexity with PIM RPs or MSDP in this design, and once the SPT is built, there is no additional signaling required once the source starts sending traffic. R2#ping 238.3.3.3 repeat 10000 timeout 1 R8#show ip mroute 238.3.3.3 9.2.5.2 count | begin ^Group Group: 238.3.3.3, Source count: 1, Packets forwarded: 18, Packets received: 18 Source: 9.2.5.2/32, Forwarding: 18/1/118/0, Other: 18/0/0 R8#show ip mroute 238.3.3.3 9.2.5.2 count | begin ^Group Group: 238.3.3.3, Source count: 1, Packets forwarded: 28, Packets received: 28 Source: 9.2.5.2/32, Forwarding: 28/1/118/0, Other: 28/0/0

Next, we will examine IPv6 inter-AS multicast routing. Because there is no concept of MSDP, we can achieve intelligent scoping in one of three ways. I list them in order of my personal preference, and probably from most straightforward to most difficult. 1. Use SSM as we did for IPv4. The signaling logic is identical for IPv4 SSM and IPv6 SSM. 2. Use embedded RP for ASM, which does not apply in IPv4. The IPv6 data packet embeds the IPv6 RP information and provided there is a common RP for this group, the proper trees can be built.

1141 © 2016 Nicholas J. Russo

3. Use the scope bits inside the IPv6 group address for ASM. These were discussed in the IPv6 general architecture section and allow the administrator to define arbitrary boundaries. We could also use these to scope SSM traffic as they are independent from the multicast transport type. First, we will use IPv6 SSM similar to how we did for IPv4. We will use the default IPv6 SSM range of FF33::/32 since XE does not appear capable of specifying a custom range. Technically, FF3x::/32 where 3 XRv4 > CSR7. Just like with IPv4, the little ‘s’ flag on XE indicates this is an SSM group, where XR actually uses the string “SSM”. RP/0/0/CPU0:XRv3#show pim ipv6 topology ff33::4 | begin ff33 (2007:7:1:7::1,ff33::4) SPT SSM Up: 00:04:27 JP: Join(00:00:21) Flags: RPF: GigabitEthernet0/0/0/0.593,fe80::9 GigabitEthernet0/0/0/0.543 00:04:27 fwd LI LH R9#show ipv6 mroute ff33::4 | begin \( (2007:7:1:7::1, FF33::4), 00:06:10/00:02:10, flags: sT Incoming interface: GigabitEthernet2.594 RPF nbr: FE80::14 Immediate Outgoing interface list: GigabitEthernet2.593, Forward, 00:06:10/00:02:10 RP/0/0/CPU0:XRv4#show pim ipv6 topology ff33::4 | begin ff33

1142 © 2016 Nicholas J. Russo

(2007:7:1:7::1,ff33::4) SPT SSM Up: 00:06:14 JP: Join(00:00:16) Flags: RPF: GigabitEthernet0/0/0/0.574,fe80::7 GigabitEthernet0/0/0/0.594 00:06:14 fwd Join(00:03:21) R7#show ipv6 mroute ff33::4 | begin \( (2007:7:1:7::1, FF33::4), 00:07:05/00:02:57, flags: sT Incoming interface: GigabitEthernet2.517 RPF nbr: 2007:7:1:7::1 Immediate Outgoing interface list: GigabitEthernet2.574, Forward, 00:07:05/00:02:57

We begin pinging on CSR1 to test reachability. We can check the counters on XRv3 to ensure traffic is being deliver to CSR4. We can see packets in/packets out on XRv3, and the numbers are equal which means there is only one OIL interface (no replication). I issue the command a few seconds apart so we can see the counters increment. The ‘A’ flag on Gig0/0/0/0.593 shows that packets were “accepted” from this interface (valid RPF) and the ‘EG” flag on Gig0/0/0/0.543 shows that packets were “egressing” towards CSR4. R1#ping ff33::4 repeat 10000 timeout 1 Output Interface: GigabitEthernet2.517 RP/0/0/CPU0:XRv3#show mfib ipv6 route ff33::4 | begin ff33 (2007:7:1:7::1,ff33::4) Flags: Up: 00:09:31 Last Used: 00:00:00 SW Forwarding Counts: 11/11/1100 SW Replication Counts: 11/11/1100 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.543 Flags: NS EG, Up:00:09:31 GigabitEthernet0/0/0/0.593 Flags: A, Up:00:09:31 RP/0/0/CPU0:XRv3#show mfib ipv6 route ff33::4 | begin ff33 (2007:7:1:7::1,ff33::4) Flags: Up: 00:09:33 Last Used: 00:00:00 SW Forwarding Counts: 13/13/1300 SW Replication Counts: 13/13/1300 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.543 Flags: NS EG, Up:00:09:33 GigabitEthernet0/0/0/0.593 Flags: A, Up:00:09:33

Next, we will test embedded RP in an inter-AS environment. We first need to identify the RPs and statically configure them (and only them) as RPs. Because this information is contained within the IPv6 data traffic, we don’t necessarily need to be fancy with per-group RP assignments. The RP will look like a 1143 © 2016 Nicholas J. Russo

static RP on these nodes (because it is). We begin with CSR10 and CSR9, then verify the RP configuration afterwards. ! CSR10 ipv6 pim rp-address 2009::a ! CSR9 ipv6 pim rp-address 200b::9 R10#show ipv6 pim range-list 2009::a Static SM RP: 2009::A Exp: never Learnt from : :: FF00::/8 Up: 00:00:26 R9#show ipv6 pim range-list 200b::9 Static SM RP: 200B::9 Exp: never Learnt from : :: FF00::/8 Up: 00:00:12

We configure the same feature on XRv1 and XRv2. The configuration is similar and we verify the RP mappings before continuing. The difference is that in XR, you cannot simply specify a static RP; you must use the specified embedded-RP command. Additionally, you must specify a group-list, but I permit all globally-scoped groups for each RP. The ACL is commonly used for scoping since we can permit/deny certain ranges within a given scope. The breakdown of the group addresses is discussed next. At this point, every AS has an RP. ! XRv1 ipv6 access-list ACL_EMBEDDED_RP 10 permit ipv6 any ff7e:b40:2007::/96 router pim address-family ipv6 embedded-rp 2007::b ACL_EMBEDDED_RP ! XRv2 ipv6 access-list ACL_EMBEDDED_RP 10 permit ipv6 any ff7e:c40:2008::/96 router pim address-family ipv6 embedded-rp 2008::c ACL_EMBEDDED_RP RP/0/0/CPU0:XRv1#show pim ipv6 group-map ff7e:b40:2007:: IP PIM Group Mapping Table (* indicates group mappings being used) (+ indicates BSR group mappings active in MRIB) Group Range ff7e:b40:2007::/96*

Proto Client Groups SM embd-cfg 0

1144 © 2016 Nicholas J. Russo

RP: 2007::b RPF: De6tunnel2,2007::b (us) RP/0/0/CPU0:XRv2#show pim ipv6 group-map ff7e:c40:2008:: IP PIM Group Mapping Table (* indicates group mappings being used) (+ indicates BSR group mappings active in MRIB) Group Range ff7e:c40:2008::/96* RP: 2008::c RPF: De6tunnel2,2008::c (us)

Proto Client Groups SM embd-cfg 0

Before continuing, we will specify the appropriate embedded RP group-ranges supported by each embedded RP. For simplicity I will use global scope (0xE) for now. The RP host address is limited to 4 bits, which is why I used hexadecimal characters for the router loopback low-order bits rather than decimal numbers. This is relevant particularly on CSR10 and the XR routers. In the examples below, yellow highlights the scope, green highlights the RP host ID, pink highlights the prefix length (0x40 = 64), grey highlights the prefix, and cyan highlights the available groups. CSR9: CSR10: XRv1: XRv2:

FF7E:0940:200B:0:0:0:0:0/96 FF7E:0A40:2009:0:0:0:0:0/96 FF7E:0B40:2007:0:0:0:0:0/96 FF7E:0C40:2008:0:0:0:0:0/96

A very quick way to verify this is to ping an IPv6 multicast group in each of these ranges from any of our test clients provided embedded RP is supported (it is enabled by default on XE and XR). The PIM DR that registers these sources will be smart enough to know the RP is embedded based on the flags in the group address header (these flags were discussed in the IPv6 architecture section). I will conduct this test on CSR2 using CSR5 as the PIM DR. We enable debugging on CSR5 to watch it create range-list (RP mapping) entries for the proper RPs, then confirm it creates (S,G) state for each entry. ! CSR9 is RP R2#ping FF7E:0940:200B:0:0:0:0:0 Output Interface: GigabitEthernet2.525 ! CSR10 is RP R2#ping FF7E:0A40:2009:0:0:0:0:0 Output Interface: GigabitEthernet2.525 ! XRv1 is RP R2#ping FF7E:0B40:2007:0:0:0:0:0 Output Interface: GigabitEthernet2.525 ! XRv2 is RP R2#ping FF7E:0C40:2008:0:0:0:0:0

1145 © 2016 Nicholas J. Russo

Output Interface: GigabitEthernet2.525 ! CSR5 IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: the rib

(2009:9:2:5::2,FF7E:940:200B::/128) MRIB update (t=1) Create range list 200B::9 Sparse Create range FF7E:940:200B::/96 Adding monitor for 200B::9 RPF change for root 200B::9: nbr FE80::10, GigabitEthernet2.550 via

IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: the rib

(2009:9:2:5::2,FF7E:A40:2009::/128) MRIB update (t=1) Create range list 2009::A Sparse Create range FF7E:A40:2009::/96 Adding monitor for 2009::A RPF change for root 2009::A: nbr FE80::10, GigabitEthernet2.550 via

IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: the rib

(2009:9:2:5::2,FF7E:B40:2007::/128) MRIB update (t=1) Create range list 2007::B Sparse Create range FF7E:B40:2007::/96 Adding monitor for 2007::B RPF change for root 2007::B: nbr FE80::10, GigabitEthernet2.550 via

IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: IPv6 PIM: the rib

(2009:9:2:5::2,FF7E:C40:2008::/128) MRIB update (t=1) Create range list 2008::C Sparse Create range FF7E:C40:2008::/96 Adding monitor for 2008::C RPF change for root 2008::C: nbr FE80::10, GigabitEthernet2.550 via

We can confirm the MRIB to ensure the (S,G) state exists for each one. Since CSR5 should be registering each (S,G) to each remote RP, the entries should exist for each (S,G) that CSR2 used. R5#show ipv6 mroute | begin FF7 (2009:9:2:5::2, FF7E:940:200B::), 00:01:40/00:01:49, flags: SFT Incoming interface: GigabitEthernet2.525 RPF nbr: 2009:9:2:5::2, Registering Immediate Outgoing interface list: Tunnel0, Forward, 00:01:40/never (2009:9:2:5::2, FF7E:A40:2009::), 00:01:11/00:02:18, flags: SFT Incoming interface: GigabitEthernet2.525 RPF nbr: 2009:9:2:5::2, Registering Immediate Outgoing interface list: Tunnel0, Forward, 00:01:11/never (2009:9:2:5::2, FF7E:B40:2007::), 00:00:55/00:02:34, flags: SFT Incoming interface: GigabitEthernet2.525

1146 © 2016 Nicholas J. Russo

RPF nbr: 2009:9:2:5::2, Registering Immediate Outgoing interface list: Tunnel0, Forward, 00:00:55/never (2009:9:2:5::2, FF7E:C40:2008::), 00:00:40/00:02:49, flags: SFT Incoming interface: GigabitEthernet2.525 RPF nbr: 2009:9:2:5::2, Registering Immediate Outgoing interface list: Tunnel0, Forward, 00:00:40/never

Last, we can check the range-list mappings beginning with the first embedded RP. We see all four entries in the table, which proves our embedded RP group construction from earlier was correct. R5#show ipv6 pim range-list | begin ^Embed Embedded SM RP: 2007::B Exp: never Learnt from FF7E:B40:2007::/96 Up: 00:01:38 Embedded SM RP: 2008::C Exp: never Learnt from FF7E:C40:2008::/96 Up: 00:01:23 Embedded SM RP: 2009::A Exp: never Learnt from FF7E:A40:2009::/96 Up: 00:01:54 Embedded SM RP: 200B::9 Exp: never Learnt from FF7E:940:200B::/96 Up: 00:02:23

: :: : :: : :: : ::

It appears that register messages are never being sent. None of the remote RPs show any indication of receiving a register message. To prove it, below I show two PIM tunnels. One is the register tunnel for the IPv4 RP within AS 7 and the other is for the embedded RP in AS 8. This is the same behavior seen in the MVPN section with embedded RP as well. R5#show ip pim Tunnel1 Type : RP : Source : State : Last event :

tunnel PIM Encap 9.0.0.10 9.5.10.5 UP Created (1d02h)

R5#show ipv6 pim tunnel Tunnel0* Type : PIM Encap RP : Embedded RP Tunnel Source: 2009::5

Despite the embedded RP destination being known, the tunnel destination is unspecified. One may think this is just a cosmetic issue because the RP isn’t learned ahead of time and the tunnel is built dynamically. Having testing embedded RP before, this is actually a normal case. R5#show derived-config interface tunnel1

1147 © 2016 Nicholas J. Russo

interface Tunnel1 description Pim Register Tunnel (Encap) for RP 9.0.0.10 ip unnumbered GigabitEthernet2.550 tunnel source GigabitEthernet2.550 tunnel destination 9.0.0.10 tunnel tos 192 R5#show derived-config interface tunnel0 interface Tunnel0 description Pim Register Tunnel (Encap) for Embedded RP no ip address ipv6 unnumbered Loopback0 ipv6 enable tunnel source Loopback0 tunnel destination :: tunnel tos 224 tunnel ttl 65

Furthermore, tunnel0 shows no outbound packets while tunnel1 does. Tunnel0 doesn’t even show output drops or any sign of activity. Debugging CEF drops and IPv6 packets also reveals no indication of PIMv6 register messages being sent. Unfortunately, this is also the case in a working embedded RP design, making the problem difficult to troubleshoot. R5#show interfaces tunnel1 | include output Last input never, output never, output hang never Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0 5 minute output rate 0 bits/sec, 0 packets/sec 10 packets output, 1280 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 output buffer failures, 0 output buffers swapped out R5#show interfaces tunnel0 | include output Last input never, output never, output hang never Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0 5 minute output rate 0 bits/sec, 0 packets/sec 0 packets output, 0 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 output buffer failures, 0 output buffers swapped out

The feature actually works if traffic is sourced from CSR5’s loopback and sent out of the interface as well. I assume that because this local traffic is processed-switched, the registration process happens successfully. I quickly test a group for each of the 4 embedded RPs to make sure the registers are working now. All of the OILs are null since there are no receivers in the network for any of these groups. R5#ping FF7E:A40:2009:: source loopback0 Output Interface: loopback0

1148 © 2016 Nicholas J. Russo

R10#show ipv6 mroute FF7E:A40:2009:: | begin \( (2009::5, FF7E:A40:2009::), 00:00:25/00:03:04, flags: SP Incoming interface: GigabitEthernet2.550 RPF nbr: FE80::5 Outgoing interface list: Null R5#ping FF7E:940:200B:: source loopback0 Output Interface: loopback0 R9#show ipv6 mroute FF7E:940:200B:: | begin \( (2009::5, FF7E:940:200B::), 00:00:12/00:03:16, flags: SP Incoming interface: GigabitEthernet2.569 RPF nbr: FE80::6 Outgoing interface list: Null

For additional verification, IPv6 PIM debugs on CSR9 (one of the RPs) shows that a register was received. We never saw this before when trying to make CSR5 register sources on its LAN interface to CSR2. ! CSR9 IPv6 PIM: (2009::5,FF7E:940:200B::) Received Register from 2009::5 IPv6 PIM: (2009::5,FF7E:940:200B::) Create entry IPv6 PIM: (2009::5,FF7E:940:200B::) RPF changed from ::/- to FE80::6/GigabitEthernet2.569

When testing this with XRv1 and XRv2, they receive the registers, but RPF appears to be failing despite both routers have valid unicast BGP routes. R5#ping FF7E:B40:2007:: source loopback0 Output Interface: loopback0 RP/0/0/CPU0:XRv1#show pim ipv6 topology ff7e:b40:2007:: | begin 2009 (2009::5,ff7e:b40:2007::) SM Up: 00:00:14 JP: Null(never) Flags: KAT(00:03:15) RA RR (00:04:20) RPF: Null,:: No interfaces in immediate olist R5#ping FF7E:C40:2008:: source loopback0 Output Interface: loopback0 RP/0/0/CPU0:XRv2#show pim ipv6 topology ff7e:c40:2008:: | begin 2009 (2009::5,ff7e:c40:2008::) SM Up: 00:00:16 JP: Null(never) Flags: KAT(00:03:15) RA RR (00:04:18) RPF: Null,:: No interfaces in immediate olist RP/0/0/CPU0:XRv1#show route ipv6 2009::5 Routing entry for 2009::5/128 Known via "bgp 7", distance 20, metric 0

1149 © 2016 Nicholas J. Russo

Tag 9, type external Routing Descriptor Blocks fe80::10, from fd00:10:10:11::10, via GigabitEthernet0/0/0/0.501, BGP external Route metric is 0 No advertising protos. RP/0/0/CPU0:XRv2#show route ipv6 2009::5 Routing entry for 2009::5/128 Known via "bgp 8", distance 20, metric 0 Tag 9, type external Routing Descriptor Blocks fe80::6, from fd00:10:6:12::6, via GigabitEthernet0/0/0/0.562, BGP external Route metric is 0 No advertising protos.

This is because whenever XR is running IPv4/v6 multicast in BGP, the IPv4/v6 unicast routes are not allowed to satisfy RPF. This is stricter than XE where any route can satisfy RPF if the multicast BGP entries do not exist. This is part of the reason why the MSDP RPF rules were so difficult earlier; XR assumes that if BGP multicast is available it must be used; unicast BGP routes are ignored. We will quickly advertise CSR5’s IPv6 loopback into IPv6 BGP multicast, and then start the pings again. ! CSR5 router bgp 9 address-family ipv6 multicast network 2009::5/128 R5#ping FF7E:B40:2007:: source loopback0 Output Interface: loopback0 RP/0/0/CPU0:XRv1#show pim ipv6 topology ff7e:b40:2007:: | begin 2009 (2009::5,ff7e:b40:2007::) SM Up: 00:00:22 JP: Null(never) Flags: KAT(00:03:09) RA RR (00:04:12) RPF: GigabitEthernet0/0/0/0.501,fe80::10 No interfaces in immediate olist R5#ping FF7E:C40:2008:: source loopback0 Output Interface: loopback0 RP/0/0/CPU0:XRv2#show pim ipv6 topology FF7E:C40:2008:: | begin 2009 (2009::5,ff7e:c40:2008::) SM Up: 00:00:05 JP: Null(never) Flags: KAT(00:03:27) RA RR (00:04:29) RPF: GigabitEthernet0/0/0/0.562,fe80::6 No interfaces in immediate olist

Despite the RPF next-hops being the same as the unicast routes, XR simply didn’t accept the unicast BGP routes for RPF when multicast AFIs were available. A quick look at the PIM RPF information confirms the 1150 © 2016 Nicholas J. Russo

valid entries. CSR9 and CSR10, also running BGP IPv6 multicast, accepted the RPF derived from the BGP IPv6 unicast route to 2009::5. RP/0/0/CPU0:XRv1#show pim ipv6 rpf 2009::5 Table: IPv6-Multicast-default 2009::5/128 [20/0] via fe80::10, GigabitEthernet0/0/0/0.501 RP/0/0/CPU0:XRv2#show pim ipv6 rpf 2009::5 Table: IPv6-Multicast-default 2009::5/128 [20/0] via fe80::6, GigabitEthernet0/0/0/0.562

To actually test a real flow, I will join a group in AS8 from CSR3. This is an ASM join so no sources are included, as CSR8 shows. ! CSR3 interface GigabitEthernet2.538 ipv6 mld join-group FF7E:C40:2008::3 R8#show ipv6 mld groups FF7E:C40:2008::3 detail Interface: GigabitEthernet2.538 Group: FF7E:C40:2008::3 Uptime: 03:51:31 Router mode: EXCLUDE (Expires: 00:03:23) Host mode: INCLUDE Last reporter: FE80::3 Source list is empty

CSR8 originates the (*,G) join towards the RP, which is 2008::C (XRv2). It gleans this from the group itself just like the first-hop routers did during the source registration process. XRv2 receives the join, sees itself as the RP, and becomes the root of the (*,G) tree. It is standing by to receive register messages on its decapsulation PIM tunnel. R8#show ipv6 mroute ff7e:c40:2008::3 | begin \( (*, FF7E:C40:2008::3), 03:52:45/never, RP 2008::C, flags: SCJ Incoming interface: GigabitEthernet2.582 RPF nbr: FE80::12 Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 03:52:45/never RP/0/0/CPU0:XRv2#show pim ipv6 topology ff7e:c40:2008::3 | begin ::3 (*,ff7e:c40:2008::3) SM Up: 03:50:33 JP: Join(never) Flags: RP: 2008::c* RPF: Decaps6tunnel2,2008::c GigabitEthernet0/0/0/0.582 03:50:33 fwd Join(00:02:58)

1151 © 2016 Nicholas J. Russo

We will send traffic from CSR5 (sourcing from the loopback so it works reliably) towards this new group. When XRv2 receives the register message, it does nothing special, and the signaling follows the same PIM ASM process we have reviewed many times. R5#ping FF7E:C40:2008::3 source loopback0 repeat 100000 timeout 1 Output Interface: loopback0

XRv2 sends the (S,G) back towards 2009::5, which goes through CSR6 and CSR10 on the way. This is based on the BGP IPv6 multicast route which describes the RPF back to the source in question. RP/0/0/CPU0:XRv2#show pim ipv6 topology | begin 2009 (2009::5,ff7e:c40:2008::3) SPT SM Up: 00:09:56 JP: Join(00:00:10) Flags: KAT(00:02:59) RA RR (00:04:01) RPF: GigabitEthernet0/0/0/0.562,fe80::6 GigabitEthernet0/0/0/0.582 00:00:34 fwd Join(00:02:55) R6#show ipv6 mroute FF7E:C40:2008::3 | begin \( (2009::5, FF7E:C40:2008::3), 00:04:16/00:03:02, flags: ST Incoming interface: GigabitEthernet2.560 RPF nbr: FE80::10 Immediate Outgoing interface list: GigabitEthernet2.562, Forward, 00:04:16/00:03:02 R10#show ipv6 mroute FF7E:C40:2008::3 | begin \( (2009::5, FF7E:C40:2008::3), 00:04:26/00:03:03, flags: ST Incoming interface: GigabitEthernet2.550 RPF nbr: FE80::5 Immediate Outgoing interface list: GigabitEthernet2.560, Forward, 00:04:26/00:03:03

CSR8 switches over to the SPT (not terribly relevant in this design or for embedded RP) and is actively receiving packets along this tree. I issue the command a few seconds apart to ensure the counters are incrementing, which shows packets being forwarded towards the client, CSR3. R8#show ipv6 mroute FF7E:C40:2008::3 2009::5 count | begin ^Group Group: FF7E:C40:2008::3 Source: 2009::5, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 107/1/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 107 R8#show ipv6 mroute FF7E:C40:2008::3 2009::5 count | begin ^Group Group: FF7E:C40:2008::3 Source: 2009::5, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 137/1/118/0, Other: 0/0/0

1152 © 2016 Nicholas J. Russo

Totals - Source count: 1, Packet count: 137

The last technique for inter-AS multicast is scoping. XE makes it easy with a custom command that automatically blocks all traffic at the configuration scope or less. XR requires manual multicast boundaries, which is similar to the IPv4 multicast boundary method. Let’s assume that AS7 and AS9 are part of the same organization but are different geographical sites. They should be able to share organization-local traffic, but not site-local traffic. Their links to ASes 8 and 11 should block organizational traffic since those ASes are not within the organization. First, we will configure the XE routers in AS 9 since they are simpler. CSR10 limits scope 5 (site-local) and below within the AS, while CSR6 limits scope 8 (organization-local) and below within the AS. This means that global traffic can use any of these three links and organization-local traffic can use Gig2.501 on CSR10. I also show the context-sensitive help; we can use numbers or textual macros (which get converted into numbers). R10(config-subif)#ipv6 multicast boundary scope ?

Scope identifier for this zone admin-local Admin-local(4) organization-local Organization-local(8) site-local Site-local(5) subnet-local Subnet-local(3) vpn Virtual Routing/Forwarding(14) ! CSR10 interface GigabitEthernet2.501 ipv6 multicast boundary scope 5 ! CSR6 interface GigabitEthernet2.562 ipv6 multicast boundary scope 8 interface GigabitEthernet2.569 ipv6 multicast boundary scope 8

On XR, we have to create exhaustive ACLs. Since wildcards are not supported for IPv6 ACLs on either XE or XR, we enumerate all of the possible group ranges. Since the ‘R’, ‘P’, and ‘T’ bits have some dependencies upon one another, we don’t need to enumerate all 16 combinations for the third hex digit, but we do need to account for the combinations shown below (there are four). We also should account for lesser, routable scopes, such as scopes 3 (subnet-local) and 4 (admin-local). The ACL on XRv4 is even longer since it has to account for all routable scopes at value 8 or less. ! XRv1 ipv6 access-list ACL_SITE_BOUNDARY 10 deny ipv6 any ff05::/32 20 deny ipv6 any ff15::/32 30 deny ipv6 any ff35::/32 40 deny ipv6 any ff75::/32 50 deny ipv6 any ff04::/32

1153 © 2016 Nicholas J. Russo

60 deny ipv6 any ff14::/32 70 deny ipv6 any ff34::/32 80 deny ipv6 any ff54::/32 90 deny ipv6 any ff03::/32 100 deny ipv6 any ff13::/32 110 deny ipv6 any ff33::/32 120 deny ipv6 any ff73::/32 130 permit ipv6 any any multicast-routing address-family ipv6 interface GigabitEthernet0/0/0/0.501 boundary ACL_SITE_BOUNDARY ! XRv4 ipv6 access-list ACL_ORG_BOUNDARY 10 deny ipv6 any ff08::/32 20 deny ipv6 any ff18::/32 30 deny ipv6 any ff38::/32 40 deny ipv6 any ff78::/32 50 deny ipv6 any ff07::/32 60 deny ipv6 any ff17::/32 70 deny ipv6 any ff37::/32 80 deny ipv6 any ff57::/32 90 deny ipv6 any ff06::/32 100 deny ipv6 any ff16::/32 110 deny ipv6 any ff36::/32 120 deny ipv6 any ff76::/32 130 deny ipv6 any ff05::/32 140 deny ipv6 any ff15::/32 150 deny ipv6 any ff35::/32 160 deny ipv6 any ff75::/32 170 deny ipv6 any ff04::/32 180 deny ipv6 any ff14::/32 190 deny ipv6 any ff34::/32 200 deny ipv6 any ff54::/32 210 deny ipv6 any ff03::/32 220 deny ipv6 any ff13::/32 230 deny ipv6 any ff33::/32 240 deny ipv6 any ff73::/32 250 permit ipv6 any any multicast-routing address-family ipv6 interface GigabitEthernet0/0/0/0.524 boundary ACL_ORG_BOUNDARY interface GigabitEthernet0/0/0/0.594 boundary ACL_ORG_BOUNDARY

1154 © 2016 Nicholas J. Russo

We can see the effect of this within our existing topology. Earlier, when we tested IPv6 SSM, we used group FF33::4. This group is subnet-local with a scope value of 3, which is relatively low. XRv3 and CSR9 both have the (S,G) join for this as they are trying to build the tree back towards CSR1. RP/0/0/CPU0:XRv3#show pim ipv6 topology | begin ff33 (2007:7:1:7::1,ff33::4) SPT SSM Up: 08:18:21 JP: Join(00:00:15) Flags: RPF: GigabitEthernet0/0/0/0.593,fe80::9 GigabitEthernet0/0/0/0.543 08:18:21 fwd LI LH R9#show ipv6 mroute ff33::4 | begin \( (2007:7:1:7::1, FF33::4), 07:47:26/00:03:17, flags: sT Incoming interface: GigabitEthernet2.594 RPF nbr: FE80::14 Immediate Outgoing interface list: GigabitEthernet2.593, Forward, 07:47:26/00:03:17

However, XRv4 should not have this (S,G) due to the multicast-boundary applied. Only traffic scoped 9 or higher can be exchanged with AS 11. We see no multicast state for this group, and furthermore the boundary shows hits against the FF33::/32 entry. This proves that the boundary is functioning as expected, and this breaks connectivity from CSR1 to CSR4 using this group. RP/0/0/CPU0:XRv4#show pim ipv6 topology ff33::4 No PIM topology table entries found. RP/0/0/CPU0:XRv4#show access-list ipv6 ipv6 access-list ACL_ORG_BOUNDARY 10 deny ipv6 any ff08::/32 [snip, other entries are not interesting] 210 deny ipv6 any ff03::/32 220 deny ipv6 any ff13::/32 230 deny ipv6 any ff33::/32 (8 matches) 240 deny ipv6 any ff73::/32 250 permit ipv6 any any

To test the XE scope-based boundaries, we will try to join a new group on CSR4 via CSR2 as the source. The group is FF38::4 which is exactly equal to the scope configured on the AS 9 border routers; we expect the (S,G) join from CSR9 to be rejected by CSR6. ! CSR4 interface GigabitEthernet2.543 ipv6 mld join-group FF38::4 2009:9:2:5::2 RP/0/0/CPU0:XRv3#show pim ipv6 topology | begin ff38 (2009:9:2:5::2,ff38::4) SPT SSM Up: 00:00:18 JP: Join(00:00:57) Flags:

1155 © 2016 Nicholas J. Russo

RPF: GigabitEthernet0/0/0/0.593,fe80::9 GigabitEthernet0/0/0/0.543 00:00:18 fwd LI LH R9#show ipv6 mroute ff38::4 | begin \( (2009:9:2:5::2, FF38::4), 00:00:30/00:02:59, flags: sT Incoming interface: GigabitEthernet2.569 RPF nbr: FE80::6 Immediate Outgoing interface list: GigabitEthernet2.593, Forward, 00:00:30/00:02:59

Looking at CSR6, it has no entry for this group. There aren’t many positive-logic show commands for this feature. The best way to verify is by confirming the absence of the groups in question on a given router. IPv6 PIM debugging on XE does reveal that the (S,G) join from CSR9 is rejected due to multicast scope, but there does not appear to be a comparable show command. R6#show ipv6 mroute ff38::4 No mroute entries found. ! CSR6 IPv6 PIM: Received J/P on GigabitEthernet2.569 from FE80::9 target: FE80::6 (to us) IPv6 PIM: J/P Group FF38::4 is a scoped boundary or denied, skipping

Configuring another SSM join on CSR1 with a group of similar scope towards CSR2, we can confirm that this works. CSR7 receives the INCLUDE-mode join which contains the source of CSR2. This is because AS 7 and AS 9 are allowed to share organization-scope IPv6 multicast. ! CSR1 interface GigabitEthernet2.517 ipv6 mld join-group FF38::1 2009:9:2:5::2 R7#show ipv6 mld groups FF38::1 detail Interface: GigabitEthernet2.517 Group: FF38::1 Uptime: 00:00:23 Router mode: INCLUDE Host mode: INCLUDE Last reporter: FE80::1 Group source list: Source Address 2009:9:2:5::2

Uptime 00:00:23

Expires 00:03:56

Fwd Yes

Flags Remote 4

CSR7 originates an (S,G) join towards CSR2, which is via XRv1. XRv1 then sends the (S,G) join towards CSR10 which is the RPF neighbor for the source. It passes the filter outbound on XRv1 since the scope is organization-local. If the multicast boundary denied this group, the link to CSR10 would not be in the OIL.

1156 © 2016 Nicholas J. Russo

R7#show ipv6 mroute FF38::1 | begin \( (2009:9:2:5::2, FF38::1), 00:02:02/never, flags: sTI Incoming interface: GigabitEthernet2.571 RPF nbr: FE80::11 Immediate Outgoing interface list: GigabitEthernet2.517, Forward, 00:02:02/never RP/0/0/CPU0:XRv1#show pim ipv6 topology | begin ff38 (2009:9:2:5::2,ff38::1) SPT SSM Up: 00:02:20 JP: Join(00:00:14) Flags: RPF: GigabitEthernet0/0/0/0.501,fe80::10 GigabitEthernet0/0/0/0.571 00:02:20 fwd Join(00:03:16)

The received (S,G) join likewise passes the filter inbound on CSR10 since the groups scope of 8 is greater than the boundary scope of 5. CSR10 passes the join to CSR5 who is the root of the SPT. R10#show ipv6 mroute FF38::1 | begin \( (2009:9:2:5::2, FF38::1), 00:03:27/00:02:38, flags: sT Incoming interface: GigabitEthernet2.550 RPF nbr: FE80::5 Immediate Outgoing interface list: GigabitEthernet2.501, Forward, 00:03:27/00:02:38 R5#show ipv6 mroute FF38::1 | begin \( (2009:9:2:5::2, FF38::1), 00:03:42/00:02:50, flags: sT Incoming interface: GigabitEthernet2.525 RPF nbr: 2009:9:2:5::2 Immediate Outgoing interface list: GigabitEthernet2.550, Forward, 00:03:42/00:02:50

To confirm that organization-local multicast works, we ping from CSR2 and check the counters on CSR7 to ensure they increment. R7#show ipv6 mroute FF38::1 count | begin ^Group Group: FF38::1 Source: 2009:9:2:5::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 12/1/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 12 R7#show ipv6 mroute FF38::1 count | begin ^Group Group: FF38::1 Source: 2009:9:2:5::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 22/1/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 22

In summary, there are several key lessons from this chapter: 1157 © 2016 Nicholas J. Russo







 



It is bad practice to let local RP information (BSR, AutoRP) leak in/out of a multicast domain/AS 1. Use BSR border at the boundaries 2. Filter AutoRP groups manually at the boundaries 3. Last ditch effort: Use static RP “overrides” Cisco and RFC3618 MSDP RPF rules are different. XE uses Cisco’s rules by default while XR uses RFC3618. I recommend enabling RFC3618 RPF rules on XE routers because they are simpler to understand and consistent with XR platforms. This also enables the “show ip msdp rpf-peer” command within XE which returns the RFC3618 RPF rule used to match the RP input. When XR has BGP multicast AFIs configured, the RPF lookup always uses this table over the corresponding BGP unicast AFI. Existing BGP unicast routes, however valid they may be, cannot be used for RPF unless the corresponding multicast AFI is totally disabled on the router. For IPv4, inter-AS ASM almost always requires MSDP. SSM is much simpler as it removes the need for RPs and MSDP entirely. By default, MSDP is very liberal with its SA advertisement and relies heavily on RPF to prevent loops. A well-tuned MSDP network makes use of mesh-groups, TTL-thresholds, redistribution filters, and SA filters to minimize SA flooding in the first place. For IPv6, there are three main design options for enabling inter-AS multicast service: 1. SSM: Simplest and most reliable 2. Embedded RP ASM: Works well when used in conjunction with RP ACLs and scopes 3. Scoping with globally-reachable RPs: Difficult to manage but possible

Additional Reading – Reference configurations “inter-as-mcast" 29.2 Multicast Only Fast Re-Reroute (MoFRR) MoFRR is a feature used to ensure high availability for multicast traffic. It does not introduce any new protocols but adjusts how RPF works with traditional PIM. It is also not specific to MVPN or MPLS in general. When the PIM last hop router (LHR) receives an IGMP join, whether ASM or SSM, it will originate some kind of PIM join. ASM joins will be (*,G) PIM joins sent to the RP while SSM joins will be (S,G) PIM joins sent to the first hop router (FHR) connected to the source. In either case, the LHR and all intermediate routers towards the root of the given tree select exactly one RPF interface. This will normally follow the shortest IGP path to the root, and in the case of ECMP, select the PIM neighbor with the highest IP address. MoFRR allows a router to send multiple PIM joins towards the root of the tree for diversity; this technique works best when the two ECMP paths are separate networks or “planes” that are totally independent, but this is not a requirement. We can test it using an existing topology. Specifically, this section uses the MVPN profile 0 with PIM-SSM basic configuration as a starting point. The diagram is shown again for reference.

1158 © 2016 Nicholas J. Russo

The only initial modification is that the direct link between XRv3 and XRv4 has been adjusted to an IS-IS metric of 20. In this way, there are 3 ECMPs between XRv3 and XRv4; MoFRR requires ECMP to operate. We can check the FIB on each core router towards one another to verify this configuration change. ! XRv3 and XRv4 router isis 1 interface GigabitEthernet0/0/0/0.534 point-to-point address-family ipv4 unicast metric 20 address-family ipv6 unicast metric 20

1159 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show cef ipv4 213.14.14.14 | include via via 213.13.14.14, GigabitEthernet0/0/0/0.534, 5 dependencies, weight 0, class 0 [flags 0x0] via 213.5.13.5, GigabitEthernet0/0/0/0.553, 8 dependencies, weight 0, class 0 [flags 0x0] via 213.7.13.7, GigabitEthernet0/0/0/0.573, 8 dependencies, weight 0, class 0 [flags 0x0] RP/0/0/CPU0:XRv4#show cef ipv4 213.13.13.13 | include via via 213.13.14.13, GigabitEthernet0/0/0/0.534, 5 dependencies, weight 0, class 0 [flags 0x0] via 213.5.14.5, GigabitEthernet0/0/0/0.554, 8 dependencies, weight 0, class 0 [flags 0x0] via 213.7.14.7, GigabitEthernet0/0/0/0.574, 8 dependencies, weight 0, class 0 [flags 0x0]

Next, we will configure some new SSM groups within the customer VPN. Since the core of the network is purely using SSM, that is what will be tested. Using SSM in the customer VPN will simplify the design and allow us to focus on MoFRR, not basic PIM-ASM processes. To support data MDTs for sources behind it, CSR1’s data MDT ACL will be extended to the entire SSM range, since CSR9 will be an IPv4 multicast source for this test. We will join a new SSM group on CSR4 as this will be the receiver. ! CSR1 ip access-list extended ACL_DATA_MDT no 10 10 permit ip any 232.0.0.0 0.255.255.255 ! CSR4 interface GigabitEthernet2.546 ip igmp join-group 232.42.51.8 source 10.1.9.9

Before continuing, we will verify that the proper MDT has been built. CSR6 receives the IGMPv2 join operating in INCLUDE mode; we verify the proper source is carried in the membership report. A corresponding C(S,G) entry is created for this source/group pair. R6#show ip igmp vrf MC groups 232.42.51.8 detail [snip] Interface: GigabitEthernet2.546 Group: 232.42.51.8 Flags: SSM Uptime: 00:19:39 Group mode: INCLUDE Last reporter: 10.4.6.4 [snip] Source Address Uptime v3 Exp CSR Exp Fwd Flags 10.1.9.9 00:19:39 00:02:30 stopped Yes R

1160 © 2016 Nicholas J. Russo

CSR6 translates this into a C-PIM join and sends it to CSR1. The join is sent within the default MDT which is used for C-PIM signaling. R6#debug ip pim vrf MC PIM(1): Insert (10.1.9.9,232.42.51.8) join in nbr 213.1.1.1's queue PIM(1): Building Join/Prune packet for nbr 213.1.1.1 PIM(1): Adding v2 (10.1.9.9/32, 232.42.51.8), S-bit Join PIM(1): Send v2 join/prune to 213.1.1.1 (Tunnel3)

Upon receipt, CSR1 adds the C(S,G) state to its C-MRIB and notifies CSR6, via PIM MDT TLV, that a data MDT is available for this C(S,G). R1#debug ip pim vrf MC PIM(17): Received v2 Join/Prune on Tunnel3 from 213.6.6.6, to us PIM(17): Join-list: (10.1.9.9/32, 232.42.51.8), S-bit set PIM(17): Add Tunnel3/213.6.6.6 to (10.1.9.9, 232.42.51.8), Forward state, by PIM SG Join PIM(17): MDT next_hop change from: 232.255.0.1 to 232.4.1.0 for (10.1.9.9, 232.42.51.8) Tunnel3 R1#show ip pim mdt send MDT-data send list for VRF: MC (source, group) (10.1.9.9, 232.42.51.8)

MDT-data group/num 232.4.1.0

ref_count 1

CSR6 receives this update, updates its C-MRIB entry with the MDT data to bind it to the proper P(S,G) entry. Without this data MDT, the traffic would be forwarded within the default MDT. The big ‘Y’ implies this entry is joined to a data MDT, and the information is embedded within the C-RIB entry. So far, this is all review, and nothing specific to MoFRR has occurred. R6#debug ip pim vrf MC PIM(1): Receive MDT Packet (11649) from 213.1.1.1 (Tunnel3), length (ip: 44, udp: 24), ttl: 1 PIM(1): TLV type: 1 length: 16 MDT Packet length: 16 R6#show ip pim mdt receive Joined MDT-data [group/mdt number : source] [232.4.1.0 : 213.1.1.1] 00:24:13/00:02:45 [232.4.7.0 : 213.7.7.7] 03:55:34/00:02:25

uptime/expires for VRF: MC

R6#show ip mroute vrf MC 232.42.51.8 10.1.9.9 | begin \( (10.1.9.9, 232.42.51.8), 00:26:17/00:02:50, flags: sTIY Incoming interface: Tunnel3, RPF nbr 213.1.1.1, MDT:[213.1.1.1,232.4.1.0]/00:02:11 Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 00:07:08/00:02:50

1161 © 2016 Nicholas J. Russo

As a final verification, we will check the P(S,G) entry on CSR6 to ensure it has been created. The RPF interface points towards XRv3 as expected. The big ‘Z’ identifies this as a multicast tunnel, and the OIL includes the multicast VRF being serviced. R6#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 00:28:08/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.563, RPF nbr 213.6.13.13 Outgoing interface list: MVRF MC, Forward/Sparse, 00:10:28/00:01:31

We know that this join must transit XRv3 since that is the only path CSR6 has to reach CSR1. XRv3 has three options to reach 213.1.1.1, which is the result of our IS-IS metric adjustment earlier. Based on this, XRv3 will select the PIM neighbor with the highest IP address, which is XRv4. RP/0/0/CPU0:XRv3#show route ipv4 213.1.1.1 | include via Known via "isis 1", distance 115, metric 30, type level-2 213.13.14.14, from 213.1.1.1, via GigabitEthernet0/0/0/0.534 213.5.13.5, from 213.1.1.1, via GigabitEthernet0/0/0/0.553 213.7.13.7, from 213.1.1.1, via GigabitEthernet0/0/0/0.573 RP/0/0/CPU0:XRv3#show pim ipv4 rpf 213.1.1.1 Table: IPv4-Unicast-default * 213.1.1.1/32 [115/30] via GigabitEthernet0/0/0/0.534 with rpf neighbor 213.13.14.14

Although both CSR7 and CSR5 are viable alternative paths, they do not receive the P(S,G) join from XRv3 since they are not in the reverse path towards the root of the tree. We can confirm this issue by checking the P-MRIB entry on CSR5 and CSR7; this group is not found. XRv3 also makes no mention of any attempt to involve those routers in the multicast delivery path, either. So far, this is all expected PIM behavior. R5#show ip mroute 232.4.1.0 213.1.1.1 Group 232.4.1.0 not found R7#show ip mroute 232.4.1.0 213.1.1.1 Group 232.4.1.0 not found RP/0/0/CPU0:XRv3#show pim ipv4 topology 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0)SPT SSM Up: 00:15:09 JP: Join(00:00:35) RPF: GigabitEthernet0/0/0/0.534,213.13.14.14 Flags: GigabitEthernet0/0/0/0.563 00:15:09 fwd Join(00:03:08)

To enable MoFRR, we must have ECMP on the router on which MoFRR will be configured. In this case, XRv3 has ECMP to 213.1.1.1, and we will configure MoFRR there. The configuration is very straightforward, and we can use an ACL to select which groups to enable for MoFRR. Since MoFRR

1162 © 2016 Nicholas J. Russo

essentially duplicates the multicast flow along alternative paths, it can be wasteful of bandwidth, and should be used sparingly for critical flows. ! XRv3 ipv4 access-list ACL_MOFRR 10 permit ipv4 any 232.0.0.0 0.255.255.255 router pim address-family ipv4 mofrr flow ACL_MOFRR

To verify this, we can look at the detailed P-MRIB output on XRv3. While the summary P-MRIB information does identify this as MoFRR capable, it doesn’t tell us anything about the alternate path. Instead, towards the bottom of this command we can see a secondary RPF neighbor identified as CSR7. This router was selected over CSR5 since it has the next highest IP address. This means that XRv3 issues a P(S,G) join to CSR7 and XRv4 at the same time. RP/0/0/CPU0:XRv3#show pim ipv4 topology 232.4.1.0 213.1.1.1 detail | begin 232 (213.1.1.1,232.4.1.0)SPT SSM Up: 00:19:06 JP: Join(00:00:08) RPF: GigabitEthernet0/0/0/0.534,213.13.14.14 MoFRR, Flags: Up: MT clr (00:00:00) MDT: JoinSend N, Cache N/N, Misc (0x0,0/0) Cache: Add 00:00:00, Rem 00:00:00. MT Cnt: Set 0, Unset 0. Joins sent 0 MDT-ifh 0x0/0x0, MT Slot none/ none RPF-redirect BW usage: 0, Flags: 0x0, ObjID: 0x0 c-multicast-routing: PIM* BGPJP: 1d10h RPF Table: IPv4-Unicast-default RPF Secondary: GigabitEthernet0/0/0/0.573,213.7.13.7 GigabitEthernet0/0/0/0.563 00:19:06 fwd Join(00:03:07)

We check the P-MRIB tables on CSR5 and CSR7 again. CSR5 still doesn’t have the entry, but CSR7 does, and nothing looks fancy about it. CSR7 is totally unaware that MoFRR is in effect. CSR7 identifies XRv4 as its RPF neighbor to CSR1, which is correct. R5#show ip mroute 232.4.1.0 213.1.1.1 Group 232.4.1.0 not found R7#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 00:36:58/00:03:28, flags: sTZ Incoming interface: GigabitEthernet2.574, RPF nbr 213.7.14.14 Outgoing interface list: GigabitEthernet2.573, Forward/Sparse, 00:03:00/00:03:28

Because both XRv3 and CSR7 identified XRv4 as their RPF neighbor towards 213.1.1.1, that means XRv4 is a “merge point”, and a single point of failure, for this flow. More importantly, it means that XRv4 should have both links to CSR7 and XRv3 in its OIL for this P(S,G). Like CSR7, and every other router 1163 © 2016 Nicholas J. Russo

besides XRv3, XRv4 is totally unaware that MoFRR is enabled in the network. XRv4 will duplicate multicast traffic towards XRv3 and CSR7 concurrently. When XRv3 receives multiple copies, it will discard the ones from CSR7 (the alternate path) until the main path fails. In XR, failure is determined by a loss of traffic for 30 ms; after this time, switchover occurs, and the entire process takes about 50 ms on real hardware routers. RP/0/0/CPU0:XRv4#show pim ipv4 topology 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0)SPT SSM Up: 00:23:10 JP: Join(00:00:43) RPF: GigabitEthernet0/0/0/0.514,213.1.14.1 Flags: GigabitEthernet0/0/0/0.534 00:23:10 fwd Join(00:02:45) GigabitEthernet0/0/0/0.574 00:04:44 fwd Join(00:03:06)

We begin sending traffic on CSR9 to test this. Since all PIM signaling is complete, CSR4 should be receiving this traffic. For brevity, we will verify that traffic is arriving on CSR6 within the P(S,G) entry, and being transferred to the C(S,G) group for forwarding. We can see the packet counts are identical between the two, as expected. R9#ping 232.42.51.8 repeat 10000 time 1 R6#show ip mroute 232.4.1.0 213.1.1.1 count | begin Group Group: 232.4.1.0, Source count: 1, Packets forwarded: 85, Packets received: 85 Source: 213.1.1.1/32, Forwarding: 85/1/142/1, Other: 85/0/0 R6#show ip mroute vrf MC 232.42.51.8 10.1.9.9 count | begin Group Group: 232.42.51.8, Source count: 1, Packets forwarded: 85, Packets received: 85 Source: 10.1.9.9/32, Forwarding: 85/1/142/1, Other: 85/0/0

XRv3 has very interesting output. I include the forwarding legend for clarity since the counters are different than a normal setup. We can see that 446 packets have arrived, but exactly half of them have been forwarded. The other half are identified as RPF failures. RP/0/0/CPU0:XRv3#show mfib ipv4 route 232.4.1.0 213.1.1.1 | begin ^Forwarding Forwarding/Replication Counts: Packets in/Packets out/Bytes out Failure Counts: RPF / TTL / Empty Olist / Encap RL / Other (213.1.1.1,232.4.1.0), Flags: MoFE MoFS Up: 00:29:12 Last Used: 00:00:00 SW Forwarding Counts: 446/223/27652 SW Replication Counts: 446/223/27652 SW Failure Counts: 223/0/0/0/0 GigabitEthernet0/0/0/0.534 Flags: A, Up:00:29:12 GigabitEthernet0/0/0/0.563 Flags: NS EG, Up:00:29:12 GigabitEthernet0/0/0/0.573 Flags: A2, Up:00:10:46

1164 © 2016 Nicholas J. Russo

Looking at the P-MRIB, which shows slightly different information than the P-PIM topology, we see the interface towards XRv4 is identified as ‘A’ for accept. This is the RPF interface. The interface towards CSR7 is identified as ‘A2’ because this is the secondary RPF, and this is where the packets are being dropped. We can see the MoFRR is reported as “inactive”, which is expected in a stable network. RP/0/0/CPU0:XRv3#show mrib ipv4 route 232.4.1.0 | begin 232 (213.1.1.1,232.4.1.0) RPF nbr: 213.13.14.14 Flags: RPF MoFE MoFS Up: 00:31:08 MOFRR State: Inactive Sequence No 1 Incoming Interface List GigabitEthernet0/0/0/0.534 Flags: A, Up: 00:31:08 GigabitEthernet0/0/0/0.573 Flags: A2, Up: 00:12:42 Outgoing Interface List GigabitEthernet0/0/0/0.563 Flags: F NS, Up: 00:31:08

We can further confirm that both XRv4 and CSR7 are forwarding equal quantities of packets to XRv3, which would explain the exact 50% drop rate. The CSR only updates its counters every 10 seconds which explains the tiny discrepancy in the packet counts. Notice that on XRv4, the software replication process has twice as many packets going out as coming in; this proves that XRv4 is replicating packets properly. R7#show ip mroute 232.4.1.0 213.1.1.1 count | begin Group Group: 232.4.1.0, Source count: 1, Packets forwarded: 489, Packets received: 489 Source: 213.1.1.1/32, Forwarding: 489/1/142/1, Other: 489/0/0 RP/0/0/CPU0:XRv4#show mfib ipv4 route 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0), Flags: Up: 00:33:41 Last Used: 00:00:00 SW Forwarding Counts: 492/492/61008 SW Replication Counts: 492/984/122016 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.514 Flags: A, Up:00:33:41 GigabitEthernet0/0/0/0.534 Flags: NS EG, Up:00:33:41 GigabitEthernet0/0/0/0.574 Flags: NS EG, Up:00:15:15

To test MoFRR, we will shutdown XRv4’s interface to XRv3 (not shown). MoFRR determines that the link has failed when the PIM neighbor fails (in the graceful shutdown I did, it happens immediately). MoFRR is still receiving packets from CSR7 and makes this the primary RPF interface, marked with an ‘A’ below. Because a third path exists, XRv3 signals a P(S,G) join to CSR5 to use it for a backup path now, marked ‘A2’. Also note that while MoFRR is still technically inactive, the sequence number is 2, which counts the number of MoFRR triggers per (S,G). MoFRR is only active for the time it takes PIM to reconverge its forwarding state in the network. This is similar to TE-FRR except RSVP is used, not PIM. RP/0/0/CPU0:XRv3#show mrib ipv4 route 232.4.1.0 | begin 232 (213.1.1.1,232.4.1.0) RPF nbr: 213.7.13.7 Flags: RPF MoFE MoFS

1165 © 2016 Nicholas J. Russo

Up: 00:37:42 MOFRR State: Inactive Sequence No 2 Incoming Interface List GigabitEthernet0/0/0/0.553 Flags: A2, Up: 00:00:46 GigabitEthernet0/0/0/0.573 Flags: A, Up: 00:19:16 Outgoing Interface List GigabitEthernet0/0/0/0.563 Flags: F NS, Up: 00:37:42

We verify that CSR5 is now part of the P-PIM control plane for this multicast tunnel, and that XRv4 has added CSR5 to its OIL. The OIL now contains CSR5 and CSR7 only. R5#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 00:01:11/00:03:17, flags: sTZ Incoming interface: GigabitEthernet2.554, RPF nbr 213.5.14.14 Outgoing interface list: GigabitEthernet2.553, Forward/Sparse, 00:01:11/00:03:17 RP/0/0/CPU0:XRv4#show pim ipv4 topology 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0)SPT SSM Up: 00:41:07 JP: Join(00:00:46) RPF: GigabitEthernet0/0/0/0.514,213.1.14.1 Flags: GigabitEthernet0/0/0/0.554 00:04:11 fwd Join(00:03:14) GigabitEthernet0/0/0/0.574 00:22:41 fwd Join(00:02:50)

Finally, we ensure CSR6 is still receiving packets. We check the counters 10 seconds apart to verify this. R6#show ip mroute vrf MC 232.42.51.8 10.1.9.9 count | begin Group Group: 232.42.51.8, Source count: 1, Packets forwarded: 705, Packets received: 705 Source: 10.1.9.9/32, Forwarding: 705/1/142/1, Other: 705/0/0 R6#show ip mroute vrf MC 232.42.51.8 10.1.9.9 count | begin Group Group: 232.42.51.8, Source count: 1, Packets forwarded: 715, Packets received: 715 Source: 10.1.9.9/32, Forwarding: 715/1/142/1, Other: 715/0/0

Before continuing, we will bring the link between XRv3 and XRv4 back. We will also stop the pings from CSR9 to reset the network to an idle state. We quickly verify that XRv3 re-establishes its proper RPF interface via XRv4, which triggers a prune to CSR5. Notice that MoFRR sequence number is 3, which indicates another change occurred when the link came back up. RP/0/0/CPU0:XRv3#show mrib ipv4 route 232.4.1.0 | begin 232 (213.1.1.1,232.4.1.0) RPF nbr: 213.13.14.14 Flags: RPF MoFE MoFS Up: 00:43:40 MOFRR State: Inactive Sequence No 3 Incoming Interface List GigabitEthernet0/0/0/0.534 Flags: A, Up: 00:00:31 GigabitEthernet0/0/0/0.573 Flags: A2, Up: 00:25:15 Outgoing Interface List

1166 © 2016 Nicholas J. Russo

GigabitEthernet0/0/0/0.563 Flags: F NS, Up: 00:43:40 R5#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 00:08:30/00:02:59, flags: sPTZ Incoming interface: GigabitEthernet2.554, RPF nbr 213.5.14.14 Outgoing interface list: Null

To test MoFRR on XE, we will add a new link between CSR6 and CSR7, while concurrently shutting down the link between XRv3 and CSR7. This gives CSR6 two options for its original P(S,G), and still leaves XRv3 with two options as well. In order for ECMP to work from CSR6 to CSR1, the IS-IS cost of this new link must be 20. Because CSR7 and XRv3 no longer have direct reachability, we have additional path diversity. The path from CSR6 to XRv4 is totally independent of links and nodes via CSR7 and XRv3, which is ideal for HA design. The updated diagram is shown below.

1167 © 2016 Nicholas J. Russo

The interface configurations are basic and are not shown, but we will verify the ECMP behavior as I just described. CSR6 has two paths via XRv3 and CSR7, and XRv3 has two paths via XRv4 and CSR5. R6#show ip cef 213.1.1.1 213.1.1.1/32 nexthop 213.6.7.7 GigabitEthernet2.567 label 7005 nexthop 213.6.13.13 GigabitEthernet2.563 label 93005 RP/0/0/CPU0:XRv3#show cef ipv4 213.1.1.1 | include via via 213.13.14.14, GigabitEthernet0/0/0/0.534, 7 dependencies, weight 0, class 0 [flags 0x0] via 213.5.13.5, GigabitEthernet0/0/0/0.553, 10 dependencies, weight 0, class 0 [flags 0x0]

In this way, we can have MoFRR enabled both at the edge and core routers for additional redundancy. In a purely planar network, this wouldn’t be necessary, since the two planes would be “enough” redundancy for the vast majority of cases. I demonstrate it here because it is possible; there is nothing that says MoFRR is limited to a certain type or quantity of nodes. The configuration on XE is very similar to XR, except we will enable the “sticky” option. This ensures that the primary RPF doesn’t change even if a better one shows up later. This leads to better stability in the network, and unlike with XRv3, we would not prune stable links to choose a link that just came back up. This is also supported in XR and the option is “non-revertive”, which is a little more descriptive. XE MoFRR can only accept an extended ACL, so we will identify the specific tunnel endpoint of CSR1, along with all SSM groups for variety. ! CSR1 ip multicast rpf mofrr ACL_MOFRR sticky ip access-list extended ACL_MOFRR permit ip host 213.1.1.1 232.0.0.0 0.255.255.255

To verify it, we can look at the P(S,G) entry for the multicast tunnel. The output is comparable to XR. We can see RIB-based MoFRR (different than flow-based MoFRR described earlier) is enabled and the secondary RPF information is made very clear. CSR6 selects XRv3 as the primary RPF due to having a low IP address. Otherwise, nothing is different. R6#show ip mroute 232.4.1.0 | begin \( (213.1.1.1, 232.4.1.0), 01:19:06/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.563, RPF nbr 213.6.13.13, RIB based MoFRR Secondary RPF interface: GigabitEthernet2.567, Secondary RPF nbr 213.6.7.7 Outgoing interface list: MVRF MC, Forward/Sparse, 01:01:26/00:01:33

1168 © 2016 Nicholas J. Russo

We verify that CSR7 received the P(S,G) join for this group and adds CSR6 to its OIL. XRv3 is still running MoFRR and has added both XRv4 and CSR5 as RPF interfaces, which means both of them should have the entry, too. R7#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 01:18:54/00:03:10, flags: sTZ Incoming interface: GigabitEthernet2.574, RPF nbr 213.7.14.14 Outgoing interface list: GigabitEthernet2.567, Forward/Sparse, 00:02:58/00:03:10 RP/0/0/CPU0:XRv3#show mrib ipv4 route 232.4.1.0 | begin 232 (213.1.1.1,232.4.1.0) RPF nbr: 213.13.14.14 Flags: RPF MoFE MoFS Up: 00:11:08 MOFRR State: Inactive Sequence No 1 Incoming Interface List GigabitEthernet0/0/0/0.534 Flags: A, Up: 00:11:08 GigabitEthernet0/0/0/0.553 Flags: A2, Up: 00:11:08 Outgoing Interface List GigabitEthernet0/0/0/0.563 Flags: F NS, Up: 00:11:08

We quickly verify CSR5 and XRv4 as well. CSR5’s output is very basic and not exciting, but now XRv4 will be very busy. It now has three interfaces in its OIL for this group having received P(S,G) joins from CSR5, CSR7, and XRv3. There is only one receiver yet the traffic is being replicated three times, which is highly inefficient. R5#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 00:27:36/00:02:42, flags: sTZ Incoming interface: GigabitEthernet2.554, RPF nbr 213.5.14.14 Outgoing interface list: GigabitEthernet2.553, Forward/Sparse, 00:11:52/00:02:42 RP/0/0/CPU0:XRv4#show pim ipv4 topology 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0)SPT SSM Up: 00:12:57 JP: Join(now) RPF: GigabitEthernet0/0/0/0.514,213.1.14.1 Flags: GigabitEthernet0/0/0/0.534 00:12:57 fwd Join(00:02:37) GigabitEthernet0/0/0/0.554 00:12:57 fwd Join(00:03:23) GigabitEthernet0/0/0/0.574 00:05:14 fwd Join(00:02:54)

To test this, we will send pings from CSR9 again. This time we will move from XRv4 down to CSR6. As expected, XRv4 replicates the traffic three times, as we can see by the 30 packets received and 90 transmitted. Packets are accepted from CSR1 (“A” flag) and egress out of three other interface (“EG” flag). R9#ping 232.42.51.8 repeat 10000 time 1 RP/0/0/CPU0:XRv4#show mfib ipv4 route 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0), Flags:

1169 © 2016 Nicholas J. Russo

Up: 00:16:59 Last Used: 00:00:00 SW Forwarding Counts: 30/30/3720 SW Replication Counts: 30/90/11160 SW Failure Counts: 0/0/0/0/0 GigabitEthernet0/0/0/0.514 Flags: GigabitEthernet0/0/0/0.534 Flags: GigabitEthernet0/0/0/0.554 Flags: GigabitEthernet0/0/0/0.574 Flags:

A, NS NS NS

Up:00:16:59 EG, Up:00:16:59 EG, Up:00:16:59 EG, Up:00:09:15

There is nothing new on XRv3, since the duplicated packets arriving from CSR5 are dropped due to RPF failure. 180 packets have arrived, but only 90 are transmitted. XRv3 still forwards traffic to CSR6, which is imported because CSR6 has selected XRv3 as the primary RPF neighbor. RP/0/0/CPU0:XRv3#show mfib ipv4 route 232.4.1.0 213.1.1.1 | begin 232 (213.1.1.1,232.4.1.0), Flags: MoFE MoFS Up: 00:17:59 Last Used: 00:00:00 SW Forwarding Counts: 180/90/11160 SW Replication Counts: 180/90/11160 SW Failure Counts: 90/0/0/0/0 GigabitEthernet0/0/0/0.534 Flags: A, Up:00:17:59 GigabitEthernet0/0/0/0.553 Flags: A2, Up:00:17:59 GigabitEthernet0/0/0/0.563 Flags: NS EG, Up:00:17:59

CSR5 and CSR7 are very straightforward and their outputs are not shown, since they receive a single flow of traffic and replicate it once downstream. CSR6 has interesting output as we expected to see several RPF failures for this P(S,G). The reason the forwarding packets are higher is because I did not clear the counters before this test, but it isn’t necessary to see the difference. R6#show ip mroute 232.4.1.0 213.1.1.1 count | begin Forwarding Forwarding Counts: Pkt Count/Pkts per second/Avg Pkt Size/Kilobits per second Other counts: Total/RPF failed/Other drops(OIF-null, rate-limit etc) Group: 232.4.1.0, Source count: 1, Packets forwarded: 1289, Packets received: 1553 Source: 213.1.1.1/32, Forwarding: 1289/1/142/1, Other: 1553/264/0

Using EPC, we will conduct a quick verification to see two packets arriving at approximately the same time from XRv3 and CSR7, but only one being forwarded to the customer. In this case, the first packet that arrived was from XRv3 and was immediately forwarded to the customer. The backup flow from CSR7 came a short time later and was discarded. I show the MAC addresses for CSR7 and XRv3 for confirmation. R6#show ip arp 213.6.7.7 Protocol Address

Age (min)

Hardware Addr

Type

Interface

1170 © 2016 Nicholas J. Russo

Internet

213.6.7.7

25

R6#show ip arp 213.6.13.13 Protocol Address Age (min) Internet 213.6.13.13 91

0050.56a9.ea77

ARPA

Gig2.567

Hardware Addr 0050.56a9.ea54

Type ARPA

Interface Gig2.563

R6#show monitor capture CAP buffer detailed 0 142 0.000000 213.1.1.1 -> 232.4.1.0 GRE 0000: 01005E04 01000050 56A9EA54 8100CDEB ..^....PV..T.... 0010: 08004500 007C05DE 0000FD2F F86DD501 ..E..|...../.m.. 0020: 0101E804 01000000 08004500 006405EB ..........E..d.. 0030: 0000FE01 88710A01 0909E82A 33080800 .....q.....*3... 1 118 0.000000 10.1.9.9 0000: 01005E2A 33080050 56A9DE0D 0010: 08004500 006405EB 0000FD01 0020: 0909E82A 33080800 A2360006 0030: 0000307A A9B6ABCD ABCDABCD

-> 232.42.51.8 ICMP 81000DDA ..^*3..PV....... 89710A01 ..E..d.......q.. 01DD0000 ...*3....6...... ABCDABCD ..0z............

2 142 0.000992 213.1.1.1 0000: 01005E04 01000050 56A9EA77 0010: 08004500 007C05DE 0000FD2F 0020: 0101E804 01000000 08004500 0030: 0000FE01 88710A01 0909E82A

-> 232.4.1.0 GRE 81000DEF ..^....PV..w.... F86DD501 ..E..|...../.m.. 006405EB ..........E..d.. 33080800 .....q.....*3...

To quickly test the sticky feature, we will shutdown the link to XRv3 on CSR6. CSR6 immediately switches to CSR7 as the primary RPF interface, and since there are no alternate paths, the secondary RPF does not exist. R6#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 01:39:16/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.567, RPF nbr 213.6.7.7, RIB based MoFRR Secondary RPF interface: Null, Secondary RPF nbr 0.0.0.0 Outgoing interface list: MVRF MC, Forward/Sparse, 01:21:36/00:02:47

Of paramount importance is CSR6’s ability to continue receiving multicast traffic. Checking the packet counters, we can see the forwarded packets continue to increase with no new RPF drops. Because there isn’t a second path, there isn’t a backup flow to discard. R6#show ip mroute 232.4.1.0 213.1.1.1 count | begin Group Group: 232.4.1.0, Source count: 1, Packets forwarded: 1808, Packets received: 2541 Source: 213.1.1.1/32, Forwarding: 1808/1/142/1, Other: 2541/733/0 R6#show ip mroute 232.4.1.0 213.1.1.1 count | begin Group

1171 © 2016 Nicholas J. Russo

Group: 232.4.1.0, Source count: 1, Packets forwarded: 1818, Packets received: 2551 Source: 213.1.1.1/32, Forwarding: 1818/1/142/1, Other: 2551/733/0

We will bring the link to XRv3 back up and see how CSR6 reacts. Ideally, CSR6 should continue using CSR7 as the primary RPF which will reduce P-PIM control-plane churn in the network core. When ECMP is in play, it may not matter which path is chosen, so constantly changing may be undesirable. As expected, XRv3 is now the secondary RPF path. When performing an unrelated RPF lookup for 213.1.1.1, we see that XRv3 is still shown, but for this particular MoFRR P(S,G) this default RPF mechanism does not take effect, thanks to the “sticky” option. R6#show ip mroute 232.4.1.0 213.1.1.1 | begin \( (213.1.1.1, 232.4.1.0), 01:42:25/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.567, RPF nbr 213.6.7.7, RIB based MoFRR Secondary RPF interface: GigabitEthernet2.563, Secondary RPF nbr 213.6.13.13 Outgoing interface list: MVRF MC, Forward/Sparse, 01:24:45/00:02:37 R6#show ip rpf 213.1.1.1 RPF information for ? (213.1.1.1) RPF interface: GigabitEthernet2.563 RPF neighbor: ? (213.6.13.13) RPF route/mask: 213.1.1.1/32 RPF type: unicast (isis 213) Doing distance-preferred lookups across tables RPF topology: ipv4 multicast base, originated from ipv4 unicast base

Checking the packet counters, CSR6 is still receiving traffic, but is now incurring RPF failures as well, this time discarding traffic from XRv3. R6#show ip mroute 232.4.1.0 213.1.1.1 count | begin Group Group: 232.4.1.0, Source count: 1, Packets forwarded: 2058, Packets received: 2937 Source: 213.1.1.1/32, Forwarding: 2058/1/142/1, Other: 2937/879/0 R6#show ip mroute 232.4.1.0 213.1.1.1 count | begin Group Group: 232.4.1.0, Source count: 1, Packets forwarded: 2068, Packets received: 2957 Source: 213.1.1.1/32, Forwarding: 2068/1/142/1, Other: 2957/889/0

We can prove this further with EPC again. This time, the packets from CSR7 and XRv3 arrived at exactly the same time, but the one from CSR7 was forwarded to the customer. R6#show monitor capture CAP buffer detailed 18 142 607.249987 213.1.1.1 -> 232.4.1.0 GRE 0000: 01005E04 01000050 56A9EA77 81000DEF ..^....PV..w....

1172 © 2016 Nicholas J. Russo

0010: 0020: 0030:

08004500 007C083D 0000FD2F F60ED501 0101E804 01000000 08004500 0064084A 0000FE01 86120A01 0909E82A 33080800

..E..|.=.../.... ..........E..d.J ...........*3...

19 118 607.249987 10.1.9.9 0000: 01005E2A 33080050 56A9DE0D 0010: 08004500 0064084A 0000FD01 0020: 0909E82A 33080800 5BBC0006 0030: 00003083 EDC8ABCD ABCDABCD

-> 232.42.51.8 ICMP 81000DDA ..^*3..PV....... 87120A01 ..E..d.J........ 043C0000 ...*3...[.... 232.4.1.0 GRE 8100CDEB ..^....PV..T.... F60ED501 ..E..|.=.../.... 0064084A ..........E..d.J 33080800 ...........*3...

Additional Reading – Reference configurations “mvpn-mofrr" 29.3 Protecting mLDP LSPs with Fast Re-Reoute (FRR) This section details how to provide FRR for mLDP traffic. For testing, we will use the same alternative MVPN network used in the MoFRR section which adds an additional link between CSR6 and CSR7 for redundancy. The base configuration is MVPN profile 1, which uses a MP2MP default tree, statically configured, with PIM C-multicast signaling.

1173 © 2016 Nicholas J. Russo

Since mLDP does not have any inherent ability to be protected by TE (unlike P2MP RSVP-TE), it relies on existing TE-FRR mechanisms. Specifically, we can provide link protection only for mLDP LSM traffic in the network core using a combination of existing technologies. As examined in the TE section, we can use NHOP tunnels for link protection coupled with primary one-hop tunnels to reach each next-hop. The primary tunnels use auto-route announce and FRR, and the NHOP backup tunnels can protect these tunnels. mLDP can be configured to use these primary one-hop TE tunnels for recursive forwarding which are inherently protected by NHOP backups. This protects every link in the topology if configured everywhere. This can be done manually or automatically, but the combination of auto-primary and autobackup tunnels suits this purpose perfectly. Those topics are covered in detail in the TE section so we configure them quickly on all LSRs in the topology. Note: The feature does not appear to work well on XR. There is no obvious way to have auto-backups protect auto-primaries that request FRR, nor is there a way to force mLDP traffic into a TE tunnel. We will limit the test to CSR6 and CSR7 instead. 1174 © 2016 Nicholas J. Russo

! All XE LSRs mpls mldp path traffic-eng mpls ldp discovery targeted-hello accept mpls traffic-eng auto-tunnel backup nhop-only mpls traffic-eng auto-tunnel backup tunnel-num min 5000 max 5999 mpls traffic-eng auto-tunnel primary onehop mpls traffic-eng auto-tunnel primary config mpls ip mpls traffic-eng auto-tunnel primary tunnel-num min 6000 max 6999 ! All XR LSRs mpls ldp address-family ipv4 discovery targeted-hello accept

The configuration isn’t exactly straightforward. Enabling auto-backup and auto-primary tunnels is basic, with the exception that because these primary tunnels are one-hop, not PE-PE, we must enable LDP on them as well. This stitches the LSP together end to end. All routers are able to accept targeted sessions to support this, even the XR core routers. I explicitly used different number ranges for the backup and primary tunnels to make verification easier. The “magic” command is “mpls mldp path traffic-eng” on XE which allows the mLDP LSPs to use TE tunnels. A quick check on CSR6 shows us that the tunnels are up; there will be 2 primary tunnels to XRv3 and CSR7, and the backup tunnels routing around those protected links. This magic command also assumes the tunnels are LDP-enabled with autoroute announce configured on them, which is fitting for auto-primary tunnels. R6#show mpls traffic-eng tunnels summary | begin auto auto-tunnel: backup Enabled (2 ), id-range:5000-5999 onehop Enabled (2 ), id-range:6000-6999 mesh Disabled (0 ), id-range:64336-65335

To verify this further, we can look at the RSVP RESV messages to see the tunnels originating on CSR6. We can clearly see the direction each tunnel is going. The backup tunnels are highlighted in yellow with the primary tunnels in green. R6#show ip rsvp Destination 213.7.7.7 213.7.7.7 213.13.13.13 213.13.13.13

reservation filter session-type 7 sender Tun Sender TunID LSPID Next Hop 213.6.6.6 5000 1 213.6.13.13 213.6.6.6 6001 8053 213.6.7.7 213.6.6.6 5001 1 213.6.7.7 213.6.6.6 6000 928 213.6.13.13

213.6.6.6 I/F Fi Gi2.563 SE Gi2.567 SE Gi2.567 SE Gi2.563 SE

Serv LOAD LOAD LOAD LOAD

BPS 0 0 0 0

We can determine which primary tunnel is protected by which backup tunnel by checking the FRR database. Unlike previous examples, the FRR database on the PLR has no in-label since the PLR is the tunnel head. Tunnel6000 goes to XRv3 and is backed up by Tunnel5001 which also terminates there. We 1175 © 2016 Nicholas J. Russo

can tell this is an NHOP tunnel as a result; we confirm this by looking at the RSVP RESV messages above to see the same tunnel tail-end. The same is true about the relationship between Tunnel6001 and Tunnel5000. R6#show mpls traffic-eng fast-reroute database P2P Headend FRR information: Protected tunnel In-label Out intf/label ------------------------------- -------------Tunnel6000 Tun hd Gi2.563:implicit Tunnel6001 Tun hd Gi2.567:implicit

FRR intf/label -------------Tu5001:implicitTu5000:implicit-

Status -----ready ready

Now that we have verified FRR was enabled on the primary tunnels, we can verify that autoroute announce was as well. We can see each tunnel with autoroute announce enabled with its specified tailend. R6#show mpls traffic-eng autoroute MPLS TE autorouting enabled destination 0000.0000.0007.00, area isis level-2, has 1 tunnels Tunnel6001 (load balancing metric 0, nexthop 213.7.7.7) (flags: Announce) destination 0000.0000.0013.00, area isis level-2, has 1 tunnels Tunnel6000 (load balancing metric 0, nexthop 213.13.13.13) (flags: Announce)

Next, we will verify LDP is enabled on the tunnel and has established targeted sessions. This will allow the LDP label exchange to occur between the head and tail ends, which is important for tunnel stitching and described in the TE section. The reason CSR7’s entry says active/passive is because CSR7 has also established a session to CSR6, but XRv3 has not established one to CSR6. The “xmit/recv” is what really matters for label exchange. R6#show mpls ldp discovery | begin Targeted Hello Targeted Hellos: 213.6.6.6 -> 213.13.13.13 (ldp): active, xmit/recv LDP Id: 213.13.13.13:0 213.6.6.6 -> 213.7.7.7 (ldp): active/passive, xmit/recv LDP Id: 213.7.7.7:0

This implies CSR6 can learn the labels towards any node beyond these primary tunnels, such as XRv14. We confirm this by checking the LDP bindings and ultimately, the FIB. These labels are stacked before the TE label (which is implicit-null on the primary tunnels) when sending traffic outbound. This is just a basic unicast check and isn’t terribly relevant to mLDP. R6#show mpls ldp bindings 213.14.14.14 32 lib entry: 213.14.14.14/32, rev 15 local binding: label: 6002

1176 © 2016 Nicholas J. Russo

remote binding: lsr: 213.13.13.13:0, label: 93000 remote binding: lsr: 213.7.7.7:0, label: 7000 R6#show ip cef 213.14.14.14 213.14.14.14/32 nexthop 213.7.7.7 Tunnel6001 label 7000 nexthop 213.13.13.13 Tunnel6000 label 93000

Of greater importance is mLDP. We verify that CSR6 has joined the MP2MP default MDT, which is the result of having sent a label mapping message upstream. CSR6 informs XRv3 that it should receive traffic using label 6012, and XRv3 responds with an upstream mapping of 93009 which is used to send traffic towards the root. The next-hop interface is a TE tunnel which is not possible unless the router is explicitly told to consider TE tunnels as paths for mLDP (configured earlier). R6#show mpls mldp database opaque_type mdt 213:213 0 LSM ID : 1 (RNR LSM ID: 2) Type: MP2MP Uptime : 01:20:04 FEC Root : 213.14.14.14 Opaque decoded : [mdt 213:213 0] Opaque length : 11 bytes Opaque value : 02 000B 0002130000021300000000 RNR active LSP : (this entry) Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 1 Out Label (U) : 93009 Interface : Tunnel6000* Local Label (D): 6012 Next Hop : 213.13.13.13 Replication client(s): MDT (VRF MC) Uptime : 01:20:04 Path Set ID : 2 Interface : Lspvif0

A quick OAM check verifies that all of the other XE PE routers have configured this properly as well. XRv2 is not running TE with mLDP but can still communicate; we also check the PIM neighbors inside the VPN to prove that the C-PIM signaling works. R6#ping mpls mldp mp2mp 213.14.14.14 mdt 213:213 0 mp2mp Root node addr 213.14.14.14 Opaque type MDT, oui:index 0x213:0213, mdtnum 0 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor, timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request ! reply ! reply ! reply

#1 addr 213.8.14.8 addr 213.5.14.5 addr 213.12.13.12

1177 © 2016 Nicholas J. Russo

! reply addr 213.1.14.1 ! reply addr 213.7.14.7 Round-trip min/avg/max = 48/411/1676 ms Received 5 replies R6#show ip pim vrf MC neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.7.7.7 Lspvif0 00:21:09/00:01:18 213.5.5.5 Lspvif0 00:21:09/00:01:44 213.1.1.1 Lspvif0 00:21:09/00:01:15 213.8.8.8 Lspvif0 00:21:09/00:01:43 213.12.12.12 Lspvif0 01:14:37/00:01:38

Ver v2 v2 v2 v2 v2

DR Prio/Mode 1 / S P G 1 / S P G 1 / S P G 1 / S P G 1 / DR P G

At this point, all links are protected by NHOP tunnels using TE-FRR (assuming we configure auto-primary and auto-backup everywhere, which we didn’t). Actually verifying this feature in the data-plane is difficult in this compressed topology, so it is skipped. Additional Reading – Reference configurations “mvpn-mldp-frr" 29.4 MVPN Extranet Extranets are defined as networks that have been extended outside of an intranet to provide reachability to external organizations. This could be a business partnership of sorts. According to senior engineers in Cisco, extranets are much more common in enterprise MPLS networks than in SP ones, but the concepts are the same in all environments. Building extranets for unicast connectivity with MPLS L3VPN is simple and is controlled entirely by the RT import/export policies. There are some additional considerations for multicast traffic. Specifically, an MVPN extranet exists when sources are in one VPN and receivers are in another VPN. Some draft RFCs (draft-rosen-l3vpn-mvpn-extranet-03) regarding MVPN extranet refer to these as “red” (sources) and “blue” (receivers) VPNs for simplicity. Those terms are used here also. The network diagram is shown below and is very similar to the previous MVPN examples except that the VPNs have been split in half.

1178 © 2016 Nicholas J. Russo

The top half of the network (XRv2, CSR5, and CSR1) are in VPN “N”, meaning North. The bottom half (CSR6, CSR7, CSR8) are in VRF “S” for South. XRv1, CSR9, CSR4, and CSR3 are all requesting the same IPv4 and IPv6 SSM groups with sources spread across both VRFs. This configuration is described in detail later. The groups in use are as follows: IPv4 SSM with North source: (10.5.10.10, 232.0.0.7) IPv4 SSM with South source: (10.2.7.2, 232.0.0.5) IPv6 SSM with North source: (2001:10:5:10::10, FF33::2) IPv6 SSM with South source: (2001:10:2:7::2, FF33::1) 29.4.1 PIM/GRE 1179 © 2016 Nicholas J. Russo

Multicast extranets are achievable using the classic Draft-Rosen family of MDTs. However, newer MVPN profiles can take advantage of BGP-AD to simplify the extranet mechanisms, so this would be similar to profile 3. The draft specified earlier uses terms “red” and “blue” to specify the VPNs in which the sources and receivers lie, respectively. For our first test, CSR10 is the source and CSR6/CSR8 are the extranet receivers for (10.5.10.10, 232.0.0.7). Therefore, VRF N is the “red” VPN and VRF S is the “blue” VPN for this test. The colors of red and blue will swap as we test connectivity in multiple directions for multiple protocols, which is why the VRF names are not tied to colors. Before actually doing any extranet connectivity, we will verify that intranet multicast works within VRF N. XRv11 and CSR9 are both requesting traffic for the aforementioned flow, so they should be able to receive intranet multicast. We won’t verify every component of MVPN profile 3 again, but we will begin by checking the C-MRIB on all PEs. They all claim that a data MDT is in use, which means there should be an associated BGP Type-3 S-PMSI route for the C(S,G). The XE boxes show the ‘Y’ and ‘y’ flag to indicate reception from and transmission to a data MDT, respectively. We can also see the PMSI to C(S,G) binding. R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 verbose | begin \( (10.5.10.10, 232.0.0.7), 00:32:34/00:02:37, flags: sTyp Incoming interface: GigabitEthernet2.550, RPF nbr 10.5.10.10 Outgoing interface list: Tunnel1, GRE MDT: 232.4.5.0 (data), Forward/Sparse, 00:43:52/00:02:37, p R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 13:55:34/00:02:50, flags: sTIY Incoming interface: Tunnel0, RPF nbr 213.5.5.5, MDT:[213.5.5.5,232.4.5.0]/never Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 13:55:34/00:02:50 RP/0/0/CPU0:XRv2#show pim vrf N topology 232.0.0.7 10.5.10.10 | begin 232 (10.5.10.10,232.0.0.7)SPT SSM Up: 14:01:43 JP: Join(00:00:24) RPF: mdtN,213.5.5.5 Flags: GigabitEthernet0/0/0/0.512 14:01:43 fwd LI LH

Checking the MDT caches on each router, we can see that CSR5 advertises the data MDT (implicit from the BGP-AD routes), and the egress PEs install it. R5#show ip pim vrf N mdt send MDT-data send list for VRF: N (source, group) (10.5.10.10, 232.0.0.7)

MDT-data group/num 232.4.5.0

R1#show ip pim vrf N mdt receive Joined MDT-data [group/mdt number : source] [232.4.5.0 : 213.5.5.5] 00:47:23/stopped RP/0/0/CPU0:XRv2#show pim vrf N mdt cache Core Source Cust (Source, Group)

ref_count 1

uptime/expires for VRF: N

Core Data

Expires

1180 © 2016 Nicholas J. Russo

213.5.5.5

(10.5.10.10, 232.0.0.7)

232.4.5.0

never

Checking the BGP MVPN information, we see that CSR5 has originated the Type-3 route to describe the availability of an S-PMSI. As long as the other PEs are importing RT:213:5, they will see this S-PMSI availability. Tunnel-type 3 means this is a PIM-SSM GRE, the root is 213.5.5.5, and the core group is 232.4.5.0. This is how others know about the data MDT when BGP-AD is used. R5#show bgp ipv4 mvpn vrf N route-type 3 10.5.10.10 232.0.0.7 213.5.5.5 BGP routing table entry for [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22, version 189 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 6 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.5.5.5) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:5 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D505 0505 E804 0500 rx pathid: 0, tx pathid: 0x0

When CSR10 starts sending traffic, we can see the packet counters increasing on all three PEs, indicating that traffic is being sent. The XRv only shows packets out, but it is working. R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 187, Packets received: 187 Source: 10.5.10.10/32, Forwarding: 187/1/118/0, Other: 187/0/0 R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 199, Packets received: 199 Source: 10.5.10.10/32, Forwarding: 199/1/142/1, Other: 199/0/0 RP/0/0/CPU0:XRv2#show mfib vrf N route 232.0.0.7 10.5.10.10 | begin 232 (10.5.10.10,232.0.0.7), Flags: Up: 14:10:10 Last Used: never SW Forwarding Counts: 0/228/22800 SW Replication Counts: 0/228/22800 SW Failure Counts: 0/0/0/0/0 mdtN Flags: A MI, Up:00:53:21 GigabitEthernet0/0/0/0.512 Flags: NS EG, Up:14:10:10

1181 © 2016 Nicholas J. Russo

To implement the “red method” as defined by the draft, the blue PEs must import the RT’s for the red multicast sources and BGP-AD routes. The draft specifically states you must use a new RT for this, called the “violet” RT. This is a good practice because only a few VPN routes behind a red PE would be multicast sources, so it makes sense not to have all the blue PEs with receivers import all red RTs. On CSR5, we will solve this in two steps. First, an export-map is applied to CSR5 for IPv4 that matches 10.5.10.0/24 and adds a RT:213:500. CSR10’s loopback, 10.10.10.10/32, is not a multicast source and only has the basic blue RT:213:5. ! CSR5 ip prefix-list PL_EXTRANET_SOURCES seq 5 permit 10.5.10.0/24 route-map RM_EXTRANET_EXPORT permit 10 match ip address prefix-list PL_EXTRANET_SOURCES set extcommunity rt 213:500 additive vrf definition N address-family ipv4 export map RM_EXTRANET_EXPORT

Assuming CSR8 imports this RT (configuration is applied for VRF S but not shown), we will have one-way unicast connectivity from CSR8 to CSR5. This is OK because we only need this routing information for RPF; we are not attempting to establish bidirectional unicast connectivity. R8#show bgp vpnv4 unicast vrf S 10.5.10.0/24 BGP routing table entry for 213:8:10.5.10.0/24, version 136 Paths: (1 available, best #1, table S) Not advertised to any peer Refresh Epoch 1 Local, imported path from 213:5:10.5.10.0/24 (global) 213.5.5.5 (metric 20) (via default) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:213:5 RT:213:500 Originator: 213.5.5.5, Cluster list: 213.12.12.12 Connector Attribute: count=1 type 1 len 12 value 213:5:213.5.5.5 mpls labels in/out nolabel/5002 rx pathid: 0, tx pathid: 0x0

The problem is that that the BGP-AD routes did not get this “violet” RT applied, so CSR8 does not know to import it. We can check XRv2, the RR, to confirm that the route only carries the blue RT:213:5. RP/0/0/CPU0:XRv2#show bgp ipv4 mvpn rd 213:5 [1][213.5.5.5]/40 | include Exten Extended community: RT:213:5 R8#show bgp ipv4 mvpn vrf S route-type 1 213.5.5.5

1182 © 2016 Nicholas J. Russo

% Network not in table

CSR8 still has a semi-valid C(S,G) entry, but there is no big ‘Y’ to indicate reception from S-PMSI. This is going to lead to problems since the entry has a valid RPF interface but is not using the proper delivery mechanism. CSR8 does not report seeing any multicast traffic arriving for this C(S,G). Its MDT cache only has an entry for the south-intranet VPN flow. R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 0, Packets received: 0 Source: 10.5.10.10/32, Forwarding: 0/0/0/0, Other: 0/0/0 R8#show ip pim vrf S mdt receive Joined MDT-data [group/mdt number : source] [232.4.7.0 : 213.7.7.7] 14:27:57/stopped

uptime/expires for VRF: S

To fix this, we configure CSR5 to add the violet RT to its outgoing IPv4 MVPN routes. We can be indiscriminate with the route-map because if an extranet receiver wants multicast from a source behind CSR5, it needs to import the violet RT, so it need not be route/prefix specific. ! CSR5 route-map RM_MVPN_EXTRANET_RT permit 10 set extcommunity rt 213:500 additive router bgp 213 address-family ipv4 neighbor 213.12.12.12 route-map RM_MVPN_EXTRANET_RT out

Next, we verify that CSR8 imports the BGP-AD routes from CSR5 and updates its C-MRIB entry via the MDT cache. Checking the packet counters, we can see traffic is flow successful. R8#show bgp ipv4 mvpn vrf S route-type 3 10.5.10.10 232.0.0.7 213.5.5.5 BGP routing table entry for [3][213:8][10.5.10.10][232.0.0.7][213.5.5.5]/22, version 195 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 (global) 213.5.5.5 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:5 RT:213:500 Originator: 213.5.5.5, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D505 0505 E804 0500 rx pathid: 0, tx pathid: 0x0

1183 © 2016 Nicholas J. Russo

R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 01:20:02/00:02:20, flags: sTIY Incoming interface: Tunnel1, RPF nbr 213.5.5.5, MDT:[213.5.5.5,232.4.5.0]/never Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 01:20:02/00:02:20 R8#show ip pim vrf S mdt receive Joined MDT-data [group/mdt number : source] [232.4.5.0 : 213.5.5.5] 00:01:06/stopped [232.4.7.0 : 213.7.7.7] 14:30:41/stopped

uptime/expires for VRF: S

R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 67, Packets received: 67 Source: 10.5.10.10/32, Forwarding: 67/1/142/1, Other: 67/0/0

As always, a quick EPC alleviates any doubt we have about forwarding. CSR8 shows GRE multicast arriving from CSR5’s MDT source towards the data MDT group 232.4.5.0. CsR8 decapsulates it and forwards the ICMP multicast for (10.5.10.10, 232.0.0.7) towards the customer. The difference in size is exactly 24 (GRE header) and the timestamps are the same. R8#show monitor capture CAP buffer detailed 2 142 1.002991 213.5.5.5 -> 232.4.5.0 GRE 0000: 01005E04 05000050 56A9862A 8100CE00 ..^....PV..*.... 0010: 08004500 007C02D1 0000FE2F F272D505 ..E..|...../.r.. 0020: 0505E804 05000000 08004500 006402DF ..........E..d.. 0030: 0000FE01 BDA30A05 0A0AE800 00070800 ................ 3 118 1.002991 10.5.10.10 0000: 01005E00 00070050 56A9FB1C 0010: 08004500 006402DF 0000FD01 0020: 0A0AE800 00070800 3FBA000A 0030: 00003723 059FABCD ABCDABCD

-> 232.0.0.7 ICMP 81000DD2 ..^....PV....... BEA30A05 ..E..d.......... 01C40000 ........?....... ABCDABCD ..7#............

Let’s try to bend the rules by importing the non-violet RT on CSR6. That is, we will import RT:215:5 and not the violet target of RT:213:500. This will give CSR6 reachability to 10.10.10.10/32, which isn’t really relevant, but it’s sloppy. The draft clearly states: “If the route matching S is a blue route (i.e., carries the blue RT but not the violet RT), then a Join is sent over the blue default PMSI. However, if the route matching S is a violet route (i.e., carries the violet RT), a Join is sent over the red default PMSI.” This would imply that an RPF route with only RT:213:5 would have its PIM join sent via the blue VPN. The keyword is “default MDT”, which we are not using yet. We will test I-PMSI later. Once CSR6 imports RT213:5, everything literally just works. This RT captures both the unicast route to the source, 10.5.10.0/24, and the BGP-AD S-PMSI route for this C(S,G) in question. All of the same verifications we

1184 © 2016 Nicholas J. Russo

did for CSR8 are shown below on CSR6 to confirm proper operation. This mechanism would not require the route-maps on CSR5 to add RT:213:500 to multiple prefixes. R6#show bgp ipv4 mvpn vrf S route-type 3 10.5.10.10 232.0.0.7 213.5.5.5 BGP routing table entry for [3][213:6][10.5.10.10][232.0.0.7][213.5.5.5]/22, version 199 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 (global) 213.5.5.5 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:5 RT:213:500 Originator: 213.5.5.5, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D505 0505 E804 0500 rx pathid: 0, tx pathid: 0x0 R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 01:29:14/00:02:59, flags: sTIY Incoming interface: Tunnel0, RPF nbr 213.5.5.5, MDT:[213.5.5.5,232.4.5.0]/never Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 01:29:14/00:02:59 R6#show ip pim vrf S mdt receive Joined MDT-data [group/mdt number : source] [232.4.5.0 : 213.5.5.5] 01:07:24/stopped [232.4.7.0 : 213.7.7.7] 14:40:22/stopped

uptime/expires for VRF: S

R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 939, Packets received: 939 Source: 10.5.10.10/32, Forwarding: 939/1/142/1, Other: 939/0/0 R6#show monitor capture CAP buffer detailed 0 142 0.000000 213.5.5.5 -> 232.4.5.0 GRE 0000: 01005E04 05000050 56A9EA54 8100CDEB ..^....PV..T.... 0010: 08004500 007C049A 0000FE2F F0A9D505 ..E..|...../.... 0020: 0505E804 05000000 08004500 006404A8 ..........E..d.. 0030: 0000FE01 BBDA0A05 0A0AE800 00070800 ................ 1 118 0.000000 10.5.10.10 -> 232.0.0.7 ICMP 0000: 01005E00 00070050 56A9DE0D 81000DDA ..^....PV....... 0010: 08004500 006404A8 0000FD01 BCDA0A05 ..E..d.......... 0020: 0A0AE800 00070800 42FC000A 038D0000 ........B.......

1185 © 2016 Nicholas J. Russo

0030:

0000372A 008DABCD ABCDABCD ABCDABCD

..7*............

To add complexity, we will remove this C(S,G) from being candidate for data MDT optimization. We do this by removing it from the IPv4 ACL on CSR5 so that it can no longer originate the Type-3 route. Because our ACL is only one line long, removing that one line turns the ACL into a “permit ip any any”. Instead, I will put a “deny ip any any” at the top. The BGP debugs clearly show this S-PMSI advertisement being withdrawn. R5#show access-lists ACL_DATA_MDT Extended IP access list ACL_DATA_MDT 5 deny ip any any (3 matches) 10 permit ip any host 232.0.0.7 ! CSR5 BGP: MVPN(15) deleting the local route [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 BGP(15): no valid path for [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 BGP(15): nettable_walker [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 no best path BGP(15): delete RIB route [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 BGP(15): (base) 213.12.12.12 send unreachable (format) [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 BGP(15): 213.12.12.12 rcv UPDATE about [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 -- withdrawn

A spot check on some interested PEs, intranet and extranet, shows the big ‘Y’ missing. Also, on the ingress PE CSR5, the little ‘y’ is missing. These groups will be using the I-PMSI as there is no selective Ptunnel available in the network for this C(S,G). R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:22:13/00:03:11, flags: sT Incoming interface: GigabitEthernet2.550, RPF nbr 10.5.10.10 Outgoing interface list: Tunnel1, Forward/Sparse, 01:26:35/00:03:11 R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 14:37:18/00:02:06, flags: sTI Incoming interface: Tunnel0, RPF nbr 213.5.5.5 Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 14:37:18/00:02:06 R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 01:36:44/00:02:38, flags: sTI Incoming interface: Tunnel1, RPF nbr 213.5.5.5 Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 01:36:44/00:02:38

1186 © 2016 Nicholas J. Russo

Another spot check on those same routers shows the MDT cache missing an entry for (10.5.10.10, 232.0.0.7). This is expected since that MDT cache information was gleaned from the Type-3 route that was just withdrawn. CSR8 still has its intranet SSM data MDT which is not being examined right now. R1#show ip pim vrf N mdt receive [no output] R8#show ip pim vrf S mdt receive Joined MDT-data [group/mdt number : source] [232.4.7.0 : 213.7.7.7] 14:50:03/stopped

uptime/expires for VRF: S

R5#show ip pim vrf N mdt send [no output]

Because we did not modify any RT policies, CSR6 is still importing RT:213:5 and CSR8 is still importing RT:213:500, which means both the BGP-AD and unicast extranet sources are inside VRF S (blue VPN). Looking at the I-PMSI BGP-AD routes, we can clearly see the default MDT groups are different. VRF S uses 232.255.0.2 while VRF N uses 232.255.0.1. There is no communication between VPNs over the default MDT, which means no PIM neighbors, etc. The draft clearly states: “If MI-PMSIs are being used, the blue VRFs must immediately join the P-tunnels specified in the red I-PMSI A-D routes. This is not happening. Despite CSR8 having imported CSR5’s I-PMSI route, it did not join that MDT. I believe this could work with other MVPN profiles (like mLDP), but not GRE. We can immediately see that, when SPMSI is not used, the complexity of MVPN extranet increases significantly. R8#show bgp ipv4 mvpn vrf S route-type 1 213.5.5.5 BGP routing table entry for [1][213:8][213.5.5.5]/12, version 194 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [1][213:5][213.5.5.5]/12 (global) 213.5.5.5 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:5 RT:213:500 Originator: 213.5.5.5, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D505 0505 E8FF 0001 rx pathid: 0, tx pathid: 0x0 R8#show ip igmp groups IGMP Connected Group Membership Group Address Interface Group Accounted 232.4.7.0 Loopback0 232.255.0.2 Loopback0 224.0.1.40 Loopback0

Uptime

Expires

Last Reporter

15:09:38 15:17:02 1d16h

stopped stopped 00:02:04

213.8.8.8 213.8.8.8 213.8.8.8

1187 © 2016 Nicholas J. Russo

R8#show ip mroute 232.255.0.1 Group 232.255.0.1 not found

There are two solutions for this which have been supported by Cisco routers since 12.2S. a. One option is to configure the source VRF on the receiver PEs, which is known as receive side chaining (RSC). This allows the “red” sources to have traffic follow the red default MDT all the way to the egress PE, because the egress PE is actually in that VPN now. This will allow the extranet receivers to join the red default MDT. This is best used when the flows are high bandwidth and/or there are few receivers. It is my personal favorite option. b. The second option is to configure the receiver VRF on the source PEs, which is source side chaining (SSC). This allows the ingress PE to perform ingress replication so that the red multicast traffic is sent as both red and blue. This means that the receiver PEs don’t need to add more VRF configurations. This is best used when the flows are low bandwidth (extra replication) or there are many receivers where RSC scales poorly, as everything egress PE would need to join the red default MDT. First, I will use the RSC solution. It’s somewhat similar to the “red method” defined in the draft as we are modifying the blue VPN to receive multicast differently. This is quite configuration intensive compared to the S-PMSI solution and requires a lot of verification. The configuration modifications on CSR8 are shown below. We define the “red” VRF containing sources on the receiver PE, which is blue. It imports the routes from CSR5 using the violet RT. The configuration guide says that RT export should also be configured on this additional VRF, but it is not necessary. Another unnecessary step in the guide is importing the red sources into VRF S directly; since VRF S is explicitly using VRF N for RPF, there is no reason to import those routes. We use group-based VRF selection so that VRF S (blue) knows the look in VRF N (red) for sources. ! CSR8 vrf definition N rd 213:88 route-target import 213:500 address-family ipv4 mdt auto-discovery pim mdt default 232.255.0.1 ip multicast-routing vrf N distributed ip pim vrf N ssm default ip access-list standard ACL_N_GROUPS permit 232.0.0.7 ip multicast vrf S rpf select vrf N group-list ACL_N_GROUPS

1188 © 2016 Nicholas J. Russo

One interesting thing to note is that we have a unidirectional PIM neighbor over the red default MDT. CSR8 has imported CSR5’s RTs but not vice versa, so the PIM hellos only go one way. This isn’t terribly relevant, but an interesting note and a quick way to check if the receiver successfully joined the source’s default MDT. R5#show ip pim vrf N neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Ver Address 213.1.1.1 Tunnel1 00:06:15/00:01:24 v2 213.12.12.12 Tunnel1 16:41:47/00:01:27 v2

DR Prio/Mode 1 / B S P G 1 / DR B P G

R8#show ip pim vrf N neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Ver Address 213.5.5.5 Tunnel4 00:06:54/00:01:43 v2

DR Prio/Mode 1 / B S P G

We can verify the group-based VRF selection policy is configured correctly also. R8#show ip rpf vrf S select Multicast Group-to-Vrf Mappings Group(s): 232.0.0.7/32, RPF vrf: N, Acl: ACL_N_GROUPS

Next, we step through the three major MRIB entries that build this extranet connectivity. First, we need to verify that the blue PE has joined the red default MDT properly. The ‘Z’ flag must be set so the router identifies this as a multicast tunnel. The OIL should be directing traffic into the red VRF. R8#show ip mroute 232.255.0.1 213.5.5.5 | begin \( (213.5.5.5, 232.255.0.1), 00:05:30/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.584, RPF nbr 213.8.14.14 Outgoing interface list: MVRF N, Forward/Sparse, 00:05:30/00:00:29

Next, the red VRF indicates that this entry is an “extranet” entry with the big ‘E’. This is a proxy mechanism in the control plane that essentially says “the real receivers are in the blue VRF”. Notice that the OIL is null since there are no real receivers in the red VPN from the perspective of the blue PE. R8#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:05:51/stopped, flags: sTE Incoming interface: Tunnel4, RPF nbr 213.5.5.5 Outgoing interface list: Null Extranet receivers in vrf S: (10.5.10.10, 232.0.0.7), 00:07:19/00:02:39, OIF count: 1, flags: sTI

1189 © 2016 Nicholas J. Russo

VRF S, the blue VPN, has a normal C-MRIB entry except for the RPF lookup uses the red VPN. This makes sense since the blue VPN has no unicast reachability to the red sources. This is a little different than the S-PMSI tests earlier where importing the violet RT actually provided one-way unicast reachability. R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:07:31/00:02:27, flags: sTI Incoming interface: Tunnel4, RPF nbr 213.5.5.5, using vrf N Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 00:07:31/00:02:27 R8#show ip rpf vrf S 10.5.10.0 failed, no route exists R8#show ip rpf vrf N 10.5.10.0 RPF information for ? (10.5.10.0) RPF interface: Tunnel4 RPF neighbor: ? (213.5.5.5) RPF route/mask: 10.5.10.0/24 RPF type: unicast (bgp 213) Doing distance-preferred lookups across tables BGP originator: 213.5.5.5 RPF topology: ipv4 multicast base, originated from ipv4 unicast base

Checking the counters on all three MRIB entries shows that traffic is being piped from the global table (red default MDT) into the red VPN, then into the blue VPN. It’s expected that the default MDT will have a few more packets since it carries additional control-plane traffic, and potential other data flows. The red and blue VPN counters should be identical or very close. R8#show ip mroute 232.255.0.1 213.5.5.5 count | begin ^Group Group: 232.255.0.1, Source count: 1, Packets forwarded: 688, Packets received: 688 Source: 213.5.5.5/32, Forwarding: 688/1/138/1, Other: 688/0/0 R8#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 596, Packets received: 596 Source: 10.5.10.10/32, Forwarding: 596/1/142/1, Other: 596/0/0 R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 596, Packets received: 596 Source: 10.5.10.10/32, Forwarding: 596/1/142/1, Other: 596/0/0

An EPC verification proves that traffic is being forwarded correctly. Traffic traverses the red default MDT. R8#show monitor capture CAP buffer detailed

1190 © 2016 Nicholas J. Russo

6 142 1.593023 213.5.5.5 0000: 01005E7F 00010050 56A9862A 0010: 08004500 007C1453 0000FE2F 0020: 0505E8FF 00010000 08004500 0030: 0000FE01 AC210A05 0A0AE800

-> 232.255.0.1 GRE 8100CE00 ..^....PV..*.... E4F4D505 ..E..|.S.../.... 00641461 ..........E..d.a 00070800 .....!..........

7 118 1.593023 10.5.10.10 0000: 01005E00 00070050 56A9FB1C 0010: 08004500 00641461 0000FD01 0020: 0A0AE800 00070800 CDB1000E 0030: 00003784 7671ABCD ABCDABCD

-> 232.0.0.7 ICMP 81000DD2 ..^....PV....... AD210A05 ..E..d.a.....!.. 02950000 ................ ABCDABCD ..7.vq..........

We will quickly re-test this on CSR6 for completeness. The downside of the RSC mechanism is that you must configure it on all receivers, so we have to do twice the work. We will use RT:213:5, the real blue VPN RT, versus the violet RT:213:500. It should work the same way since draft RFC suggesting the use of “violet RT” was written long after this feature was introduced. The snippet for CSR6 is not shown here since it is the same as CSR8’s with the RD changing to 213:66 and the RT-import policy using RT:213:5. Remember that CSR6’s blue VPN (S) is not importing any new RTs. First, we verify the three MRIB entries: red default MDT, red VPN, and blue VPN. This shows the control-plane linkages between them. R6#show ip mroute 232.255.0.1 213.5.5.5 | begin \( (213.5.5.5, 232.255.0.1), 00:01:32/00:01:58, flags: sTIZ Incoming interface: GigabitEthernet2.563, RPF nbr 213.6.13.13 Outgoing interface list: MVRF N, Forward/Sparse, 00:01:32/00:01:58 R6#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:01:19/00:01:40, flags: sTE Incoming interface: Tunnel3, RPF nbr 213.5.5.5 Outgoing interface list: Null Extranet receivers in vrf S: (10.5.10.10, 232.0.0.7), 03:48:25/00:02:55, OIF count: 1, flags: sTI R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 03:48:29/00:02:52, flags: sTI Incoming interface: Tunnel3, RPF nbr 213.5.5.5, using vrf N Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 03:48:29/00:02:52

Next, we verify the packet counters for each of these three entries. Notice the red/blue VPNs have the same number of packets, since the blue VPN is just a proxy mechanism. R6#show ip mroute 232.255.0.1 213.5.5.5 count | begin ^Group Group: 232.255.0.1, Source count: 1, Packets forwarded: 422, Packets received: 422 Source: 213.5.5.5/32, Forwarding: 422/1/139/1, Other: 422/0/0

1191 © 2016 Nicholas J. Russo

R6#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 38, Packets received: 38 Source: 10.5.10.10/32, Forwarding: 38/1/142/1, Other: 38/0/0 R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 38, Packets received: 38 Source: 10.5.10.10/32, Forwarding: 38/1/142/1, Other: 38/0/0

EPC verifies that the traffic was received from the core and sent to the customer successfully. R6#show monitor capture CAP buffer detailed 1 142 0.005997 213.5.5.5 -> 232.255.0.1 GRE 0000: 01005E7F 00010050 56A9EA54 8100CDEB ..^....PV..T.... 0010: 08004500 007C1DB7 0000FE2F DB90D505 ..E..|...../.... 0020: 0505E8FF 00010000 08004500 00641DC5 ..........E..d.. 0030: 0000FE01 A2BD0A05 0A0AE800 00070800 ................ 2 118 0.005997 10.5.10.10 0000: 01005E00 00070050 56A9DE0D 0010: 08004500 00641DC5 0000FD01 0020: 0A0AE800 00070800 13E90010 0030: 000037AD 2BB4ABCD ABCDABCD

-> 232.0.0.7 ICMP 81000DDA ..^....PV....... A3BD0A05 ..E..d.......... 06F00000 ................ ABCDABCD ..7.+...........

We will quickly test the same scenario with IPv6. We can re-use the BGP route-map to add the violet RT (RT:213:500) to the IPv6 MVPN AD routes on CSR5. We will create a new route-map with an IPv6 prefixlist to match the extranet source (2001:10:5:10::/64). We can locally verify the unicast prefix’s RT addition, but must check the RR for see the IPv6 MVPN RT addition. Both are correct. R5#show bgp vpnv6 unicast vrf N 2001:10:5:10::/64 [snip] Local :: (via vrf N) from 0.0.0.0 (213.5.5.5) Origin incomplete, metric 0, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:213:5 RT:213:500 mpls labels in/out 5009/nolabel(N) rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv2#show bgp ipv6 mvpn rd 213:5 [1][213.5.5.5]/40 [snip] Local, (Received from a RR-client) 213.5.5.5 (metric 20) from 213.5.5.5 (213.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 149 Community: no-export

1192 © 2016 Nicholas J. Russo

Extended community: RT:213:5 RT:213:500 PMSI: flags 0x00, type 3, label 0, ID 0xd5050505e8ff0001

We also configure CSR6 and CSR8 with the RPF selector to allow any source sending to group FF33::2 inside the blue VPN (S) to be RPF-checked using the red VPN (N). Remember that CSR6 is importing RT:213:5 (blue RT) and CSR8 is importing RT:213:500 (violet RT). Both should work as they did for IPv4. ! CSR6 and CSR8 ipv6 multicast vrf S rpf select vrf N group-range ACL_N_GROUPS_IPV6 ipv6 access-list ACL_N_GROUPS_IPV6 permit ipv6 any host FF33::2

First, we will verify CSR8’s MRIB entries in sequence. Traffic is received from the core on the red VPN and passed to the red proxy VRF. This passes the traffic into the blue VPN for delivery to the customer. R8#show ip mroute 232.255.0.1 213.5.5.5 | begin \( (213.5.5.5, 232.255.0.1), 00:45:48/stopped, flags: sTIZ Incoming interface: GigabitEthernet2.584, RPF nbr 213.8.14.14 Outgoing interface list: MVRF N, Forward/Sparse, 00:45:48/00:02:11 R8#show ipv6 mroute vrf N FF33::2 2001:10:5:10::10 | begin \( (2001:10:5:10::10, FF33::2), 00:10:04/never, flags: sTE Incoming interface: Tunnel4 RPF nbr: ::FFFF:213.5.5.5 Outgoing interface list: Null Extranet receivers in vrf S, OIF count 1 R8#show ipv6 mroute vrf S FF33::2 2001:10:5:10::10 | begin \( (2001:10:5:10::10, FF33::2), 03:38:03/never, flags: sTI Incoming interface: Tunnel4 RPF nbr: ::FFFF:213.5.5.5,using vrf N Immediate Outgoing interface list: GigabitEthernet2.538, Forward, 03:38:03/never

Because the blue VPN has no unicast route back to the source, our group-based VRF selection policy can satisfy RPF. This mechanism is very clean since the routes remain within the red (source) VPN. R8#show ipv6 route vrf S 2001:10:5:10::/64 % Route not found R8#show ipv6 cef vrf S 2001:10:5:10::10 ::/0 no route R8#show ipv6 rpf vrf S select Multicast Group-to-Vrf Mappings

1193 © 2016 Nicholas J. Russo

Group(s): FF33::2/128, RPF vrf: N, Acl: ACL_N_GROUPS_IPV6

We send traffic from CSR10 and check the packet counters to see traffic being forwarded. EPC verifies that the IPv4 GRE traffic is decapsulated to reveal the IPv6-ICMP multicast inside. R8#show ip mroute 232.255.0.1 213.5.5.5 count | begin ^Group Group: 232.255.0.1, Source count: 1, Packets forwarded: 262, Packets received: 262 Source: 213.5.5.5/32, Forwarding: 262/1/115/1, Other: 262/0/0 R8#show ipv6 mroute vrf N FF33::2 2001:10:5:10::10 count | begin ^Group Group: FF33::2 Source: 2001:10:5:10::10, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 39/1/142/1, Other: 0/0/0 Totals - Source count: 1, Packet count: 39 R8#show ipv6 mroute vrf S FF33::2 2001:10:5:10::10 count | begin ^Group Group: FF33::2 Source: 2001:10:5:10::10, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 39/1/142/1, Other: 0/0/0 Totals - Source count: 1, Packet count: 39 R8#show monitor cap CAP buffer detailed 2 142 0.723006 213.5.5.5 -> 232.255.0.1 GRE 0000: 01005E7F 00010050 56A9862A 8100CE00 ..^....PV..*.... 0010: 08004500 007C003A 0000FE2F F90DD505 ..E..|.:.../.... 0020: 0505E8FF 00010000 86DD6000 0000003C ..........`....< 0030: 3A3F2001 00100005 00100000 00000000 :? ............. 3 118 0.724013 2001:*:0010 -> 0000: 33330000 00020050 56A9FB1C 81000DD2 0010: 86DD6000 0000003C 3A3F2001 00100005 0020: 00100000 00000000 0010FF33 00000000 0030: 00000000 00000000 00028000 C3B4259E

FF33:*:0002 IPv6-ICMP 33.....PV....... ..`.... ->

232.255.0.1 FF33:*:0002

GRE IPv6-ICMP

Next, we will test the SSC extranet delivery option. We will use VRF S as the red VPN where CSR2 is the source and XRv1/CSR9 are the receivers. We already know that if we use S-PMSI, things will “just work” as long as we import the proper RTs on the receiver PEs since the data MDTs aren’t VPN-specific. Therefore, we will test using I-PMSI only using SSC. The benefit of this approach is that we only have to configure a new VRF on CSR7 (one ingress PE router). The drawback is that multicast will be replicated for each VPN, wasting bandwidth in the core. CSR7 is currently offering an S-PMSI P-tunnel for (10.2.7.2, 232.0.0.5). R7#show bgp ipv4 mvpn vrf S route-type 3 10.2.7.2 232.0.0.5 213.7.7.7 BGP routing table entry for [3][213:7][10.2.7.2][232.0.0.5][213.7.7.7]/22, version 200 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 5 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.7.7.7) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:7 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E804 0700 rx pathid: 0, tx pathid: 0x0 R7#show ip pim vrf S mdt send MDT-data send list for VRF: S (source, group) (10.2.7.2, 232.0.0.5)

MDT-data group/num 232.4.7.0

ref_count 1

The first thing we do on CSR7 is insert a “deny ip any any” at the top of its IPv4 data MDT ACL. This will ensure only I-PMSI can be used and also allows us to revert back to S-PMSI quickly if needed. A quick verification on CSR7 shows that the S-PMSI advertisement is gone. Again, when S-PMSI is used, extranet becomes easy and “just works” since the default MDTs don’t need to pseudo-merge. R7#show bgp ipv4 mvpn vrf S route-type 3 10.2.7.2 222.0.0.5 213.7.7.7 % Network not in table

1196 © 2016 Nicholas J. Russo

R7#show ip pim vrf S mdt send [no output]

Rather than have XRv2 and CSR1 join the red VPN (red is now S, since red means where the sources are), we will configure CSR7 to join the blue (N) VPN. The receiver also needs to import the remote RTs for each receiver so that BGP-AD can build the default MDT. We will use a “violet” RT so we can avoid any potential BGP-AD issues. CSR7 will export its extranet sources with RT:213:700 and interested receivers in the blue VPN will import this to satisfy RPF. I do not show this basic RT import configuration on XRv2 and CSR1. The proxy VRF N needs to be fully connected with the receivers; importing their RTs allows it to import the BGP-AD routes and exporting allows the opposite to happen on the other end. Failing to do this both ways results in a broken I-PMSI P-tunnel. ! CSR7 vrf definition N rd 213:77 route-target export 213:1 route-target export 213:12 route-target import 213:7 route-target import 213:12 route-target import 213:1 address-family ipv4 mdt auto-discovery pim mdt default 232.255.0.1 address-family ipv6 mdt auto-discovery pim mdt default 232.255.0.1 vrf definition S address-family ipv4 export map RM_EXTRANET_EXPORT ip prefix-list PL_EXTRANET_SOURCES seq 5 permit 10.2.7.0/24 route-map RM_EXTRANET_EXPORT permit 10 match ip address prefix-list PL_EXTRANET_SOURCES set extcommunity rt 213:700 additive ip multicast-routing vrf N distributed ipv6 multicast-routing vrf N ip pim vrf N ssm default

Before looking at the C-MRIB entries, let’s ensure the BGP-AD Type-1 I-PMSI routes were properly exchanged. CSR7 has the Type-1 routes for XRv12 and CSR1 because it imported RT:213:12 and RT:213:1, respectively. It has two local routes for its own I-PMSI advertisement, one for each MDT. This gets sloppy because we are essentially introducing the red MDT into the blue VPN, which is not

1197 © 2016 Nicholas J. Russo

desirable. Fortunately, BGP picks the locally sourced route versus the imported one. This is exported with the RTs that CSR1 and XRv2 will import, and determines which multicast group to use. R7#show bgp ipv4 mvpn vrf N | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:77 (default for vrf N) *>i [1][213:77][213.1.1.1]/12 213.1.1.1 0 100 0 ? *> [1][213:77][213.7.7.7]/12 0.0.0.0 32768 ? * 0.0.0.0 32768 ? *>i [1][213:77][213.12.12.12]/12 213.12.12.12 100 0 i R7#show bgp ipv4 mvpn vrf N route-type 1 213.7.7.7 BGP routing table entry for [1][213:77][213.7.7.7]/12, version 42 Paths: (2 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Advertised to update-groups: 2 Refresh Epoch 1 Local 0.0.0.0 from 0.0.0.0 (213.7.7.7) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:1 RT:213:12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E8FF 0001 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 Local, imported path from [1][213:7][213.7.7.7]/12 (S) 0.0.0.0 from 0.0.0.0 (213.7.7.7) Origin incomplete, localpref 100, weight 32768, valid, external Community: no-export Extended Community: RT:213:7 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E8FF 0002 rx pathid: 0, tx pathid: 0

If we didn’t use a new “violet” RT, we would have issues on CSR1/XRv2. It will learn two different Type-1 I-PMSI routes; as they have different RDs, the RR sends both to the endpoints. One of them is the wrong MDT (red) with RT:213:7, and the other is the right MDT (blue) with RTs allowed within the blue VPN. Below is an error case where CSR1 selects the wrong MDT as bestpath and fails to join the blue MDT properly. R1#show bgp ipv4 mvpn vrf N route-type 1 213.7.7.7 BGP routing table entry for [1][213:1][213.7.7.7]/12, version 230

1198 © 2016 Nicholas J. Russo

Paths: (2 available, best #2, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [1][213:77][213.7.7.7]/12 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Community: no-export Extended Community: RT:213:1 RT:213:12 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E8FF 0001 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local, imported path from [1][213:7][213.7.7.7]/12 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:7 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E8FF 0002 rx pathid: 0, tx pathid: 0x0

By using RT:213:700 for the unicast routes, we can safely import that route to satisfy RPF without importing the bogus BGP-AD route seen above. CSR1 will now only receive, and select, the correct Type1 BGP and properly bind the I-PMSI to the C-MRIB entries. Manipulating other BGP attributes/filters could also solve it, but that is sloppy and beyond the scope of good L3VPN design. R1#show bgp ipv4 mvpn vrf N route-type 1 213.7.7.7 BGP routing table entry for [1][213:1][213.7.7.7]/12, version 264 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [1][213:77][213.7.7.7]/12 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:1 RT:213:12 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E8FF 0001 rx pathid: 0, tx pathid: 0x0

Verification for SSC happens in the reverse order than it did for RSC. We first verify traffic entering from VRF S (red), being passed to VRF N (blue), and then being encapsulated for transmission to the core. Checking the red VPN, we see a valid C-MRIB entry that differs slightly from the RSC equivalent. In RSC, 1199 © 2016 Nicholas J. Russo

the OIL was null, and we only had extranet receivers. With SSC, the router still have to send multicast traffic down the red default MDT for other potential red clients, so this extranet VPN is an augment to the overall MVPN. The big ‘E’ signifies this is an extranet entry, with extranet receivers listed below. R7#show ip mroute vrf S 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 00:15:18/00:03:13, flags: sTE Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0 Outgoing interface list: Tunnel3, Forward/Sparse, 00:15:18/00:03:13 Extranet receivers in vrf N: (10.2.7.2, 232.0.0.5), 00:08:24/00:03:05, OIF count: 1, flags: sT

Once inside the blue VPN, the C-MRIB verifies RPF. The source is connected (0.0.0.0) and the VRF used for RPF lookups is S. With SSC, we did not have to explicitly configure this VRF-RPF lookup policy as we did with RSC, since the unicast routes were exchanged locally on the same PE. The blue VPN directs traffic out of its default MDT; we can see PIM neighbors have formed on this interface with each receiver (blue) PE. This signals the router to process the packet for encapsulation. R7#show ip pim vrf N neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Ver Address 213.1.1.1 Tunnel1 01:27:46/00:01:39 v2 213.12.12.12 Tunnel1 02:18:38/00:01:16 v2

DR Prio/Mode 1 / B S P G 1 / DR B P G

R7#show ip mroute vrf N 232.0.0.5 10.2.7.2 | begin \( (10.2.7.2, 232.0.0.5), 00:21:04/00:03:13, flags: sT Incoming interface: GigabitEthernet2.527, RPF nbr 0.0.0.0, using vrf S Outgoing interface list: Tunnel1, Forward/Sparse, 00:21:04/00:03:13

Because CSR7 is within the default MDT now for all receivers, it will have multiple P(S,G) entries in the blue default MDT. In this case, receivers exist behind XRv12 and CSR1, but we care more about the traffic being sent from this router, so we check the local P(S,G) entry. Packets being encapsulated by CSR7 are forwarded into core for transport to the blue receivers. R7#show ip mroute 232.255.0.1 213.7.7.7 | begin \( (213.7.7.7, 232.255.0.1), 00:00:29/00:03:00, flags: sT Incoming interface: Loopback0, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.574, Forward/Sparse, 00:00:29/00:03:00 GigabitEthernet2.573, Forward/Sparse, 00:00:29/00:03:00

We can begin sending traffic from CSR2 and checking packet counters. We see packets enter VRF S and get forwarded to VRF N. From there, they are sent down the blue MDT towards blue receivers.

1200 © 2016 Nicholas J. Russo

R7#show ip mroute vrf S 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 10, Packets received: 10 Source: 10.2.7.2/32, Forwarding: 10/1/118/0, Other: 10/0/0 R7#show ip mroute vrf N 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 10, Packets received: 10 Source: 10.2.7.2/32, Forwarding: 10/1/118/0, Other: 10/0/0 R7#show ip mroute 232.255.0.1 213.7.7.7 count | begin ^Group Group: 232.255.0.1, Source count: 3, Packets forwarded: 126, Packets received: 126 Source: 213.7.7.7/32, Forwarding: 54/1/91/1, Other: 54/0/0

At the remote end, we see packets being received by the VRF N (blue) PEs. RP/0/0/CPU0:XRv2#show mfib vrf N route 232.0.0.5 10.2.7.2 | begin 232 (10.2.7.2,232.0.0.5), Flags: Up: 06:00:15 Last Used: never SW Forwarding Counts: 0/23/2300 SW Replication Counts: 0/23/2300 SW Failure Counts: 0/0/0/0/0 mdtN Flags: A MI, Up:00:09:38 GigabitEthernet0/0/0/0.512 Flags: NS EG, Up:06:00:15 R1#show ip mroute vrf N 232.0.0.5 10.2.7.2 count | begin ^Group Group: 232.0.0.5, Source count: 1, Packets forwarded: 49, Packets received: 49 Source: 10.2.7.2/32, Forwarding: 49/1/142/1, Other: 49/0/0

An EPC verification on CSR1 shows packets arriving on the blue MDT from the core and being sent towards the extranet customers. R1#show monitor capture CAP buffer detailed 0 142 0.000000 213.7.7.7 -> 232.255.0.1 GRE 0000: 01005E7F 00010050 56A9862A 8100CDBA ..^....PV..*.... 0010: 08004500 007C0AE5 0000FE2F EC5ED507 ..E..|...../.^.. 0020: 0707E8FF 00010000 08004500 00645E64 ..........E..d^d 0030: 0000FE01 652B0A02 0702E800 00050800 ....e+.......... 1 118 0.001007 10.2.7.2 0000: 01005E00 00050050 56A91AAA 0010: 08004500 00645E64 0000FD01 0020: 0702E800 00050800 2EF0001D 0030: 000037E8 1697ABCD ABCDABCD

-> 232.0.0.5 ICMP 81000DBF ..^....PV....... 662B0A02 ..E..d^d....f+.. 00BE0000 ................ ABCDABCD ..7.............

1201 © 2016 Nicholas J. Russo

Last, we examine using the SSC method with IPv6. We won’t intentionally break this network up front by failing to use a “violet RT” since that issue was already demonstrated. The vast majority of the configuration will be on CSR7 again, and the basic things like configuring the default MDT and BGP-AD under the VRF IPv6 AF is not shown. The configuration is identical to IPv4. Adding a new IPv6 prefix-list and route-map to account for the source behind CSR7 is also required, but is not shown because it is very basic. This sets the violet RT:213:700 so routers in the blue VPN (receivers like XRv2 and CSR1) can import the unicast VPNv6 routes for RPF validation, but not the BGP-AD routes related to the red VPN. It’s worth verifying that those egress PEs only have a single BGP Type-1 I-PMSI route for the blue VPN. The default MDT for VRF N (the blue VPN in this test) is 232.255.0.1, which is shown below. Notice there is no violet RT attached to these BGP-AD routes. R1#show bgp ipv6 mvpn vrf N route-type 1 213.7.7.7 [snip] Local, imported path from [1][213:77][213.7.7.7]/12 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:1 RT:213:12 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 3, length 8, label: exp-null, tunnel parameters: D507 0707 E8FF 0001 rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv2#show bgp ipv6 mvpn vrf N [1][213.7.7.7]/40 [snip] Local, (Received from a RR-client) 213.7.7.7 (metric 20) from 213.7.7.7 (213.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 139 Community: no-export Extended community: RT:213:1 RT:213:12 PMSI: flags 0x00, type 3, label 0, ID 0xd5070707e8ff0001 Source VRF: default, Source Route Distinguisher: 213:77

Next, we can validate two-way PIM neighbors over the blue MDT. CSR7 must be able to receive C(S,G) joins from the extranet receivers within the blue VPN. R7#show ipv6 pim vrf N neighbor PIM Neighbor Table Mode: B - Bidir Capable, G - GenID Capable Neighbor Address Interface ::FFFF:213.1.1.1 Tunnel1 ::FFFF:213.12.12.12 Tunnel1

Uptime 00:24:50 01:15:44

R1#show ipv6 pim vrf N neighbor | include 7.7 ::FFFF:213.7.7.7 Tunnel2 00:26:01

Expires Mode DR pri 00:01:35 B G 1 00:01:22 B G DR 1

00:01:24 B G

1

1202 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show pim vrf N ipv6 neighbor | include 7.7 ::ffff:213.7.7.7 01:16:45 00:01:34 1

Both XRv2 and CSR1 have valid RPF interfaces for these extranet sources, so they send C-PIM joins towards CSR7 over the blue MDT. The output doesn’t show which protocol the RPF route was learned, but the admin distance of 200 hints at iBGP (VPNv6). These blue receivers need not have any red configuration on them as the source (CSR7) is performing the replication. R1#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 | begin \( (2001:10:2:7::2, FF33::1), 00:28:37/never, flags: sTI Incoming interface: Tunnel2 RPF nbr: ::FFFF:213.7.7.7 Immediate Outgoing interface list: GigabitEthernet2.519, Forward, 00:28:37/never R1#show ipv6 rpf vrf N 2001:10:2:7::2 RPF information for 2001:10:2:7::2 RPF interface: Tunnel2 RPF neighbor: ::FFFF:213.7.7.7 RPF route/mask: 2001:10:2:7::/64 RPF type: Unicast RPF recursion count: 0 Metric preference: 200 Metric: 0 RP/0/0/CPU0:XRv2#show pim vrf N ipv6 topology ff33::1 2001:10:2:7::2 | begin ffff (2001:10:2:7::2,ff33::1) SPT SSM Up: 07:14:35 JP: Join(00:01:00) Flags: RPF: mdtN,::ffff:213.7.7.7 GigabitEthernet0/0/0/0.512 07:14:35 fwd LI LH RP/0/0/CPU0:XRv2#show pim vrf N ipv6 rpf 2001:10:2:7::2 Table: IPv6-Unicast-default 2001:10:2:7::2/128 [200/0] via ::ffff:213.7.7.7, mdtN Connector 213:7:213.7.7.7, Nexthop ::ffff:213.7.7.7

Performing the MRIB verification on CSR7, we do them in reverse order from RSC. Traffic entering the PE arrives in the red VPN (S) from the LAN first. The traffic is sent down the red MDT (not null, as it was in RSC) and also sent into the extranet VPN, which is blue. The blue C-MRIB sends it to the blue default MDT for encapsulation, and this blue VPN uses the red VPN for RPF verification. R7#show ipv6 mroute vrf S FF33::1 2001:10:2:7::2 | begin \( (2001:10:2:7::2, FF33::1), 01:33:52/00:03:18, flags: sTE Incoming interface: GigabitEthernet2.527

1203 © 2016 Nicholas J. Russo

RPF nbr: 2001:10:2:7::2 Immediate Outgoing interface list: Tunnel3, Forward, 01:33:46/00:03:18 Extranet receivers in vrf N, OIF count 1 R7#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 | begin \( (2001:10:2:7::2, FF33::1), 01:33:52/00:02:36, flags: sT Incoming interface: GigabitEthernet2.527 RPF nbr: 2001:10:2:7::2,using vrf S Immediate Outgoing interface list: Tunnel1, Forward, 01:33:52/00:02:36 R7#show ip mroute 232.255.0.1 213.7.7.7 | begin \( (213.7.7.7, 232.255.0.1), 01:36:36/00:03:10, flags: sT Incoming interface: Loopback0, RPF nbr 0.0.0.0 Outgoing interface list: GigabitEthernet2.574, Forward/Sparse, 01:36:21/00:03:10 GigabitEthernet2.573, Forward/Sparse, 01:36:36/00:02:52

CSR2 begins sending traffic; we validate counters for each one of the MRIB entries. Naturally, the IPv4 default MDT (blue) will have many more hits than the C(S,G) specific for this IPv6 SSM flow as it carries all traffic in the default MDT. R7#show ipv6 mroute vrf S FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 14/1/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 14 R7#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 14/1/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 14 R7#show ip mroute 232.255.0.1 213.7.7.7 count | begin ^Group Group: 232.255.0.1, Source count: 3, Packets forwarded: 2685, Packets received: 2685 Source: 213.7.7.7/32, Forwarding: 1783/1/112/0, Other: 1783/0/0

At the remote end, we can see XRv2 and CSR1 accepting and forwarding packets. An EPC capture on CSR1 further proves it by showing the GRE encapsulated MDT tunnel arriving and the raw IPv6-ICMP exiting towards the extranet customer.

1204 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show mfib vrf N ipv6 route ff33::1 2001:10:2:7::2 | begin ff33 (2001:10:2:7::2,ff33::1) Flags: Up: 07:34:05 Last Used: never SW Forwarding Counts: 0/172/17200 SW Replication Counts: 0/172/17200 SW Failure Counts: 0/0/0/0/0 mdtN Flags: A MI, Up:01:43:28 GigabitEthernet0/0/0/0.512 Flags: NS EG, Up:07:34:05 R1#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 191/1/142/1, Other: 0/0/0 Totals - Source count: 1, Packet count: 191 1#show monitor capture CAP buffer brief 0 142 0.000000 213.7.7.7 1 118 0.000000 2001:*:0002

-> ->

232.255.0.1 FF33:*:0001

GRE IPv6-ICMP

Additional Reading – Reference configurations “mvpn-extranet-gre" 29.4.2 mLDP Extranet services for mLDP is not very different in principle from the PIM/GRE examples. We are still using the concepts of default/data MDTs, along with Type-1 (I-PMSI) and Type-3 (S-PMSI) BGP-AD routes. The base MVPN profile in use this time is profile 9, so C-PIM signaling is still used in the overlay. VRF N covers the top three routers and VRF S covers the bottom three. VRF N uses XRv4 as its mLDP MP2MP root and VRF S uses XRv3. VRF N has a VPN ID of 213:1512 while VRF S had a VPN ID of 213:678. These must match on all routers in a VPN, along with the MP2MP root addresses. Before getting started with extranet services, we will do very brief checks of the mLDP infrastructure for both VPNs to ensure the basic MP2MP trees are built. As the root for VRF N, XRv4 should have three downstream clients on its MP2MP tree: CSR1, CSR5, and XRv3 (towards XRv2). XRv3 likewise has one downstream (XRv2) and one upstream (XRv4) client. For VRF S, XRv3 has three downstream clients: CSR6, CSR7, and XRv4 (towards CSR8). XRv4 has one upstream client (XRv3) and one downstream client (CSR8). This is a good estimate that things are working properly. RP/0/0/CPU0:XRv4#show mpls mldp database mp2mp brief LSM ID Type Root Up Down Decoded Opaque Value 0x00004 MP2MP 213.13.13.13 1 1 [mdt 213:678 0] 0x00001 MP2MP 213.14.14.14 0 3 [mdt 213:1512 0] RP/0/0/CPU0:XRv3#show mpls mldp database mp2mp brief

1205 © 2016 Nicholas J. Russo

LSM ID 0x00002 0x00001

Type MP2MP MP2MP

Root 213.13.13.13 213.14.14.14

Up Down Decoded Opaque Value 0 3 [mdt 213:678 0] 1 1 [mdt 213:1512 0]

Next, we validate C-PIM adjacencies in each overlay. For brevity, we will check CSR1 and CSR8 only. Each of them has a neighborship with the other two VPN members. Realistically, one would want to verify all PEs in this way as PIM neighbors can be unidirectional. R1#show ip pim vrf N neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Ver Address 213.12.12.12 Lspvif0 00:23:09/00:01:23 v2 213.5.5.5 Lspvif0 00:25:10/00:01:31 v2

DR Prio/Mode 1 / DR B P G 1 / B S P G

R8#show ip pim vrf S neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Ver Address 213.6.6.6 Lspvif0 00:03:17/00:01:24 v2 213.7.7.7 Lspvif0 00:03:17/00:01:23 v2

DR Prio/Mode 1 / B S P G 1 / B S P G

A quick test of intranet MVPN traffic will verify proper operation before progressing to extranets. This will not be a comprehensive review of mLDP profile 9, however. CSR10 starts sending traffic to group 232.0.0.7. We check the C-MRIB states to see how the routers are forwarding traffic. A data MDT (mLDP P2MP) is in use, which implies the presence of a Type-3 S-PMSI route. This is where XRv stops working; it claims null RPF when clearly that is not true. We don’t expect it to work with LSM anyway. R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 verbose | begin \( (10.5.10.10, 232.0.0.7), 00:31:59/00:03:23, flags: sTyp Incoming interface: GigabitEthernet2.550, RPF nbr 0.0.0.0 MDT TX nr: 1 LSM-ID: 0x3 Outgoing interface list: Lspvif0, LSM MDT: 3 (data), Forward/Sparse, 00:31:59/00:03:23, p R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 06:20:31/00:02:34, flags: sTIY Incoming interface: Lspvif0, RPF nbr 213.5.5.5, MDT: [1, 213.5.5.5]/never Outgoing interface list: GigabitEthernet2.519, Forward/Sparse, 06:20:31/00:02:34 RP/0/0/CPU0:XRv2#show pim vrf N topology 232.0.0.7 10.5.10.10 | begin 232 (10.5.10.10,232.0.0.7)SPT SSM Up: 00:03:14 JP: Join(00:00:34) RPF: Null,213.5.5.5 Flags: GigabitEthernet0/0/0/0.512 00:03:14 fwd LI LH RP/0/0/CPU0:XRv2#show pim vrf N rpf 10.5.10.10 Table: IPv4-Unicast-default * 10.5.10.10/32 [200/0]

1206 © 2016 Nicholas J. Russo

via LmdtN with rpf neighbor 213.5.5.5

We check the MDT data bindings on all routers. XRv2 and CSR1 receive MDT information via the Type-3 S-PMSI route, while CSR5 sends it. The S-PMSI specifies an mLDP P2MP tunnel (type 2) rooted at CSR5. R5#show bgp ipv4 mvpn vrf N route-type 3 10.5.10.10 232.0.0.7 213.5.5.5 BGP routing table entry for [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22, [snip] Local 0.0.0.0 from 0.0.0.0 (213.5.5.5) Origin incomplete, localpref 100, weight 32768, valid, sourced, local, best Community: no-export Extended Community: RT:213:5 PMSI Attribute: Flags: 0x0, Tunnel type: 2, length 24, label: exp-null, tunnel parameters: 0600 0104 D505 0505 000E 0200 0B00 0213 0000 1512 0000 0001 rx pathid: 0, tx pathid: 0x0 R5#show ip pim vrf N mdt send MDT-data send list for VRF: N (source, group) (10.5.10.10, 232.0.0.7)

MDT-data group/num 1

R1#show ip pim vrf N mdt receive Joined MDT-data [group/mdt number : source] [1 : 213.5.5.5] 00:02:11/stopped RP/0/0/CPU0:XRv2#show pim vrf N mdt cache Core Source Cust (Source, Group) 213.5.5.5 (10.5.10.10, 232.0.0.7)

ref_count 1

uptime/expires for VRF: N

Core Data Expires [mdt 213:1512 1] never

Sending traffic from CSR10 shows packets being sent into the core by CSR5 and received by CSR1. EPC confirms the MPLS decapsulation and native IP multicast forwarding to CSR9. The difference in 4 bytes represents the popping of the MPLS shim header. A final validation of label 0x3F4 (1012) shows this is the proper local downstream label for CSR1 along that data MDT. The label is highlighted in yellow while the IP headers are in green. R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 27, Packets received: 27 Source: 10.5.10.10/32, Forwarding: 27/1/118/0, Other: 27/0/0 R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 28, Packets received: 28 Source: 10.5.10.10/32, Forwarding: 28/1/122/0, Other: 28/0/0

1207 © 2016 Nicholas J. Russo

R1#show monitor capture CAP buffer detail 0 122 0.000000 00:50:56:A9:86:2A -> 00:50:56:A9:1A:AA MPLS unicast 0000: 005056A9 1AAA0050 56A9862A 81000DBA .PV....PV..*.... 0010: 8847003F 41FD4500 00641ECF 0000FE01 .G.?A.E..d...... 0020: A1B30A05 0A0AE800 00070800 D7E20012 ................ 0030: 00410000 000039B5 6C5FABCD ABCDABCD .A....9.l_...... 1 118 0.000000 10.5.10.10 0000: 01005E00 00070050 56A91AAA 0010: 08004500 00641ECF 0000FC01 0020: 0A0AE800 00070800 D7E20012 0030: 000039B5 6C5FABCD ABCDABCD

-> 232.0.0.7 ICMP 81000DBF ..^....PV....... A3B30A05 ..E..d.......... 00410000 .............A.. ABCDABCD ..9.l_..........

R1#show mpls mldp database opaque_type mdt 213:1512 Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : Out Label (U) : None Interface : Local Label (D): 1012 Next Hop :

1 | section Upstream

5 GigabitEthernet2.514* 213.1.14.14

Because S-PMSI is offered for (10.5.10.10, 232.0.0.7), we theorize that importing the RT associated with the BGP-AD and VPNv4 routes from CSR5 will be sufficient to build connectivity to extranets. With GRE, this worked quite easily, since the entire core was PIM-enabled and there was nothing stopping PIM P(S,G) joins for data MDT groups transiting between any PEs. At this point, neither CSR6 nor CSR8 have the red VPN (N) configured locally. VRF N will be our “red VPN” at first, the one with the sources, so we will not verify intranet MVPN on VRF S yet. Instead, we will perform a similar test as we did with GRE. CSR5 will create a “violet” RT for its BGP-AD routes along with any extranet sources. CSR8 will import this “violet” RT as an extranet client. CSR6 will import the raw red RT:215:5 and attempt to draw extranet MVPN services. These imports are happening directly into the blue VPN (S), not a proxy red VPN (N). Fortunately, the entire “violet” RT constructs are identical to the PIM/GRE model and the configurations are not displayed again. We can verify that both the VPNv4 route for 10.5.100/24 and the BGP-AD Type1/3 routes from CSR5 both have the new violet RT:213:500. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf N 10.5.10.0/24 | include Extend Extended community: RT:213:5 RT:213:500 RP/0/0/CPU0:XRv2#show bgp ipv4 mvpn vrf N [1][213.5.5.5]/40 | include Extend Extended community: RT:213:5 RT:213:500 RP/0/0/CPU0:XRv2#show bgp ipv4 mvpn vrf N [3][32][10.5.10.10][32][232.0.0.7][213.5.5.5]/120 | include Extend Extended community: RT:213:5 RT:213:500

A quick check of the mLDP database on CSR6 shows that the P2MP tunnel rooted at CSR5 was not built. We can clearly see the Type-3 S-PMSI route has been imported. The issue is that the VPN ID is encoded in the variable-length opaque nested inside the FEC element. The PMSI attributes carry this value, which 1208 © 2016 Nicholas J. Russo

is 213:1512. There are no VRFs locally configured that match this VPN ID, so the MVPN construction cannot occur. We must use another method. R6#show mpls mldp database summary LSM ID Type Root C P2MP 213.7.7.7 9 MP2MP 213.13.13.13

Decoded Opaque Value [mdt 213:678 1] [mdt 213:678 0]

Client Cnt. 1 1

R6#show bgp ipv4 mvpn vrf S route-type 3 10.5.10.10 232.0.0.7 213.5.5.5 BGP routing table entry for [3][213:6][10.5.10.10][232.0.0.7][213.5.5.5]/22, version 130 Paths: (1 available, best #1, table MVPNv4-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22 (global) 213.5.5.5 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:5 RT:213:500 Originator: 213.5.5.5, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 2, length 24, label: exp-null, tunnel parameters: 0600 0104 D505 0505 000E 0200 0B00 0213 0000 1512 0000 0001 rx pathid: 0, tx pathid: 0x0

We will create the red VPN locally on CSR6 and CSR8. This is the RSC method we used for PIM/GRE, which implies we configure the source VPN on the receiver PE. The VPN ID will be 213:1512 and for cleanliness, we will remove the red and violet RT import statements from VRF S on CSR6 and CSR8, respectively. We will import these into the red VPN (N) only. Between the violet and red RTs, both CSR6 and CSR8 should be part of this mLDP MP2MP tree. We can check for C-PIM neighbors in each router. CSR6 and CSR8 also form neighbors with one another, as expected, but of greatest importance are the neighborships with CSR5. ! CSR6 vrf definition N rd 213:66 ! On CSR8, this is 213:88 instead vpn id 213:1512 route-target import 213:5 ! On CSR8, this is 213:500 instead address-family ipv4 mdt default mpls mldp 213.14.14.14 address-family ipv6 mdt default mpls mldp 213.14.14.14 vrf definition S no route-target import 213:5

!

On CSR8, this is 213:500 instead

1209 © 2016 Nicholas J. Russo

R6#show ip pim vrf N neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.8.8.8 Lspvif10 00:04:16/00:01:27 213.12.12.12 Lspvif10 00:06:24/00:01:31 213.1.1.1 Lspvif10 00:06:26/00:01:23 213.5.5.5 Lspvif10 00:06:26/00:01:24 R8#show ip pim vrf N neighbor | begin ^Neighbor Neighbor Interface Uptime/Expires Address 213.12.12.12 Lspvif1 00:04:07/00:01:38 213.6.6.6 Lspvif1 00:04:09/00:01:31 213.5.5.5 Lspvif1 00:04:09/00:01:31 213.1.1.1 Lspvif1 00:04:09/00:01:31

Ver v2 v2 v2 v2

Ver v2 v2 v2 v2

DR Prio/Mode 1 / B S P G 1 / DR B P G 1 / B S P G 1 / B S P G

DR Prio/Mode 1 / DR B P G 1 / B S P G 1 / B S P G 1 / B S P G

Examining the C-MRIB entries, we see that traffic arrives into the proxy red VPN (N) and is piped in software to the blue VPN (S). The red entry shows both the big ‘Y’ and big ‘E’ entries, indicating traffic was received from an MDT data group and also represents extranet capability. We never saw these two flags together in the PIM/GRE method because the traditional RSC/SSC extranet configuration was not needed when using S-PMSI. Also notice that, so far, it doesn’t seem to matter whether we used the violet (CSR8) or red (CSR6) RTs for pulling in these BGP-AD and VPNv4 unicast routes. R8#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:03:43/stopped, flags: sTYE Incoming interface: Lspvif1, RPF nbr 213.5.5.5, MDT: [1, 213.5.5.5]/never Outgoing interface list: Null Extranet receivers in vrf S: (10.5.10.10, 232.0.0.7), 11:29:51/00:02:11, OIF count: 1, flags: sTI R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 11:29:55/00:02:07, flags: sTI Incoming interface: Lspvif1, RPF nbr 213.5.5.5, using vrf N Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 11:29:55/00:02:07 R6#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:01:35/00:01:24, flags: sTYE Incoming interface: Lspvif10, RPF nbr 213.5.5.5, MDT: [1, 213.5.5.5]/never Outgoing interface list: Null Extranet receivers in vrf S: (10.5.10.10, 232.0.0.7), 20:30:55/00:02:59, OIF count: 1, flags: sTI R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 20:30:49/00:02:04, flags: sTI

1210 © 2016 Nicholas J. Russo

Incoming interface: Lspvif10, RPF nbr 213.5.5.5, using vrf N Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 20:30:49/00:02:04

Since these routers claim to have received data MDT information, we check their caches to confirm. This would have been derived from the Type-3 S-PMSI route, which both routes have imported into VRF N. R8#show ip pim vrf N mdt receive Joined MDT-data [group/mdt number : source] [1 : 213.5.5.5] 10:54:30/stopped

uptime/expires for VRF: N

R8#show bgp ipv4 mvpn vrf N | include \[3\] *>i [3][213:88][10.5.10.10][232.0.0.7][213.5.5.5]/22 R6#show ip pim vrf N mdt receive Joined MDT-data [group/mdt number : source] [1 : 213.5.5.5] 10:57:49/stopped

uptime/expires for VRF: N

R6#show bgp ipv4 mvpn vrf N | include \[3\] *>i [3][213:66][10.5.10.10][232.0.0.7][213.5.5.5]/22

The RPF mechanism for this traffic is identical to how it was for PIM/GRE. CSR6 and CSR8 have groupbased RPF selection policies inside VRF S (blue) that specify certain groups can be RPF-checked in VRF N (red). The configuration is identical on CSR6 and CSR8, so only one is shown. R6#show ip rpf vrf S select Multicast Group-to-Vrf Mappings Group(s): 232.0.0.7/32, RPF vrf: N, Acl: ACL_N_GROUPS

Knowing that the MDT number is 1, we can check their mLDP databases to determine the incoming label for each one. This will be used for the final data-plane verifications as well. CSR8 maps this entry to LSPvif1 while CSR6 selects LSPvif10; in both cases, this matches with the virtual interface assigned to this VPN as shown in the PIM neighbor show commands. CSR8 uses local downstream label 8013 while CSR6 uses 6013. As a sanity check, we also see the root is CSR5 and the type is P2MP, which is correct. R8#show mpls mldp database opaque_type mdt 213:1512 1 LSM ID : E Type: P2MP Uptime : 00:08:29 FEC Root : 213.5.5.5 Opaque decoded : [mdt 213:1512 1] Opaque length : 11 bytes Opaque value : 02 000B 0002130000151200000001 Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : E Out Label (U) : None Interface : GigabitEthernet2.584* Local Label (D): 8013 Next Hop : 213.8.14.14 Replication client(s):

1211 © 2016 Nicholas J. Russo

MDT (VRF N) Uptime Interface

: 00:08:29 : Lspvif1

Path Set ID

: None

R6#show mpls mldp database opaque_type mdt 213:1512 1 LSM ID : 13 Type: P2MP Uptime : 00:09:50 FEC Root : 213.5.5.5 Opaque decoded : [mdt 213:1512 1] Opaque length : 11 bytes Opaque value : 02 000B 0002130000151200000001 Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 13 Out Label (U) : None Interface : GigabitEthernet2.563* Local Label (D): 6013 Next Hop : 213.6.13.13 Replication client(s): MDT (VRF N) Uptime : 00:09:50 Path Set ID : None Interface : Lspvif10

It is worth mentioning that although these P2MP trees will be used for delivery, both CSR6 and CSR8 are part of the red MP2MP default MDT. Without it, there would be no C-PIM overlay signaling. R6#show mpls mldp database summary | include MP2MP 9 MP2MP 213.13.13.13 [mdt 213:678 0] 10 MP2MP 213.14.14.14 [mdt 213:1512 0]

1 1

R8#show mpls mldp database summary | include MP2MP 1 MP2MP 213.13.13.13 [mdt 213:678 0] B MP2MP 213.14.14.14 [mdt 213:1512 0]

1 1

A quick verification on CSR5 shows nothing unusual; the C-MRIB entry sends traffic down a data MDT (little ‘y’). The “verbose” keyword shows us the MDT details that are bound to this C(S,G). R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 verbose | begin \( (10.5.10.10, 232.0.0.7), 00:18:49/00:02:51, flags: sTyp Incoming interface: GigabitEthernet2.550, RPF nbr 0.0.0.0 MDT TX nr: 1 LSM-ID: 0x7 Outgoing interface list: Lspvif0, LSM MDT: 7 (data), Forward/Sparse, 11:47:39/00:02:51, p

CSR10 begins sending traffic. CSR5 replicates the traffic to XRv3 and XRv4 using labels 93015 and 94009 respectively. Both of these core routers pass it down the tree to their receivers. Notice that label 1012 is in the replication list for XRv4; the intranet receivers expressing interest in this C(S,G) still receive the traffic flow. R5#show mpls mldp database opaque_type mdt 213:1512 1 | section Replicat

1212 © 2016 Nicholas J. Russo

Replication client(s): MDT (VRF N) Uptime : 11:21:41 Interface : Lspvif0 213.14.14.14:0 Uptime : 11:21:41 Out label (D) : 94009 Local label (U): None 213.13.13.13:0 Uptime : 00:16:14 Out label (D) : 93015 Local label (U): None

Path Set ID

: None

Path Set ID Interface Next Hop

: None : GigabitEthernet2.554* : 213.5.14.14

Path Set ID Interface Next Hop

: None : GigabitEthernet2.553* : 213.5.13.13

RP/0/0/CPU0:XRv3#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------93015 6013 MLDP: 0x00006

labels 93015 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.563 213.6.13.6 0

RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94009 1012 MLDP: 0x00002 8013 MLDP: 0x00002

labels 94009 Outgoing Next Hop Bytes Interface Switched ------------ ------------- -----------Gi0/0/0/0.514 213.1.14.1 0 Gi0/0/0/0.584 213.8.14.8 0

Next, we check the C-MRIB counters on CSR6 and CSR8 within both the red and blue VPNs. Traffic counters should be identical between the two if things are working properly. We also show the red VPN counters on CSR1 to prove that intranet connectivity is not broken. R8#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 37, Packets received: 37 Source: 10.5.10.10/32, Forwarding: 37/1/122/0, Other: 37/0/0 R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 37, Packets received: 37 Source: 10.5.10.10/32, Forwarding: 37/1/122/0, Other: 37/0/0 R6#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 34, Packets received: 34 Source: 10.5.10.10/32, Forwarding: 34/1/122/0, Other: 34/0/0 R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 34, Packets received: 34

1213 © 2016 Nicholas J. Russo

Source: 10.5.10.10/32, Forwarding: 34/1/122/0, Other: 34/0/0 R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 20, Packets received: 20 Source: 10.5.10.10/32, Forwarding: 20/1/122/0, Other: 20/0/0

Last, we conduct EPC verifications on CSR6 and CSR8 to ensure the proper labels are entering and the ICMP multicast is being decapsulated and sent to the customer. MPLS shim headers are shown in yellow and the IP multicast headers are shown in green. We see that CSR6 receives label 0x177D (6013) and CSR8 receives label 0x1F4D (8013), which is expected. R6#show monitor capture CAP buffer detailed 0 122 0.000000 00:50:56:A9:EA:54 -> 00:50:56:A9:DE:0D MPLS unicast 0000: 005056A9 DE0D0050 56A9EA54 81000DEB .PV....PV..T.... 0010: 88470177 D1FD4500 00642176 0000FE01 .G.w..E..d!v.... 0020: 9F0C0A05 0A0AE800 00070800 57020015 ............W... 0030: 00A70000 00003C27 EA64ABCD ABCDABCD ...... 00:50:56:A9:FB:1C MPLS unicast 0000: 005056A9 FB1C0050 56A9862A 81000E00 .PV....PV..*.... 0010: 884701F4 D1FD4500 006421AE 0000FE01 .G....E..d!..... 0020: 9ED40A05 0A0AE800 00070800 7BE60015 ............{... 0030: 00DF0000 00003C28 C547ABCD ABCDABCD ...... 232.0.0.7 ICMP 81000DD2 ..^....PV....... A0D40A05 ..E..d!......... 00DF0000 ........{....... ABCDABCD .. [3][213:5][10.5.10.10][232.0.0.7][213.5.5.5]/22

1214 © 2016 Nicholas J. Russo

!! ACL modification !! R5#show bgp ipv4 mvpn vrf N | include \[3\] [no output] R5#show ip mroute vrf N 232.0.0.7 10.5.10.10 verbose | begin \( (10.5.10.10, 232.0.0.7), 00:19:29/00:03:11, flags: sTp Incoming interface: GigabitEthernet2.550, RPF nbr 0.0.0.0 Outgoing interface list: Lspvif0, LSM MDT: 2 (default), Forward/Sparse, 11:48:18/00:03:11, p

CSR6 and CSR8 also lose their data MDT cache information and strip the ‘Y’ from the C(S,G). Otherwise, the verification is generally the same. The red VPN (N) entries are still marked as ‘E’ for extranet, and the blue VPN (S) entries still use the red VPN for RPF lookups per the group-based selection policy. R6#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:48:39/stopped, flags: sTE Incoming interface: Lspvif10, RPF nbr 213.5.5.5 Outgoing interface list: Null Extranet receivers in vrf S: (10.5.10.10, 232.0.0.7), 21:17:59/00:02:55, OIF count: 1, flags: sTI R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 21:18:03/00:02:51, flags: sTI Incoming interface: Lspvif10, RPF nbr 213.5.5.5, using vrf N Outgoing interface list: GigabitEthernet2.546, Forward/Sparse, 21:18:03/00:02:51 R6#show ip pim vrf N mdt receive [no output] R8#show ip mroute vrf N 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 00:49:04/stopped, flags: sTE Incoming interface: Lspvif1, RPF nbr 213.5.5.5 Outgoing interface list: Null Extranet receivers in vrf S: (10.5.10.10, 232.0.0.7), 12:15:12/00:02:52, OIF count: 1, flags: sTI R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 | begin \( (10.5.10.10, 232.0.0.7), 12:15:12/00:02:52, flags: sTI Incoming interface: Lspvif1, RPF nbr 213.5.5.5, using vrf N Outgoing interface list: GigabitEthernet2.538, Forward/Sparse, 12:15:12/00:02:52 R8#show ip pim vrf N mdt receive [no output]

1215 © 2016 Nicholas J. Russo

We know that an mLDP default MDT always has MDT ID 0, so we can check the mLDP database to determine the downstream local labels. Also note that there are upstream labels in the MP2MP tree, though we don’t care about that for this test. CSR6 uses label 6012 and CSR8 uses label 8011. R6#show mpls mldp database opaque_type mdt 213:1512 Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : Out Label (U) : 93013 Interface : Local Label (D): 6012 Next Hop : R8#show mpls mldp database opaque_type mdt 213:1512 Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : Out Label (U) : 94014 Interface : Local Label (D): 8011 Next Hop :

0 | section Upstream

10 GigabitEthernet2.563* 213.6.13.13 0 | section Upstream

B GigabitEthernet2.584* 213.8.14.14

CSR10 begins sending traffic. We will check the red/blue VPN counters on CSR6 and CSR8 as we always have. Additionally, we check CSR1 again to ensure intranet MVPN is functional. R6#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 86, Packets received: 86 Source: 10.5.10.10/32, Forwarding: 86/1/122/0, Other: 86/0/0 R6#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 86, Packets received: 86 Source: 10.5.10.10/32, Forwarding: 86/1/122/0, Other: 86/0/0 R8#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 10, Packets received: 10 Source: 10.5.10.10/32, Forwarding: 10/1/122/0, Other: 10/0/0 R8#show ip mroute vrf S 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 10, Packets received: 10 Source: 10.5.10.10/32, Forwarding: 10/1/122/0, Other: 10/0/0 R1#show ip mroute vrf N 232.0.0.7 10.5.10.10 count | begin ^Group Group: 232.0.0.7, Source count: 1, Packets forwarded: 126, Packets received: 126 Source: 10.5.10.10/32, Forwarding: 126/1/122/0, Other: 126/0/0

1216 © 2016 Nicholas J. Russo

Finally, we confirm proper operation using EPC on CSR6 and CSR8. CSR6 receives label 0x177C (6012) and CSR8 receives label 0x1F4B (8011), which is correct for the default MDT in the red VPN. For brevity, we use the IOS parser to tell us what the payload is for the IP packet. R6#show monitor cap CAP buffer detailed 1 122 0.241992 00:50:56:A9:EA:54 -> 00:50:56:A9:DE:0D MPLS unicast 0000: 005056A9 DE0D0050 56A9EA54 81000DEB .PV....PV..T.... 0010: 88470177 C1FC4500 0064259C 0000FE01 .G.w..E..d%..... 0020: 9AE60A05 0A0AE800 00070800 26050016 ............&... 0030: 00AD0000 00003C47 1B3BABCD ABCDABCD ...... 232.0.0.7 ICMP 81000DDA ..^....PV....... 9DE60A05 ..E..d%......... 00AD0000 ........&....... ABCDABCD .. 00:50:56:A9:FB:1C MPLS unicast 0000: 005056A9 FB1C0050 56A9862A 81000E00 .PV....PV..*.... 0010: 884701F4 B1FD4500 00642601 0000FE01 .G....E..d&..... 0020: 9A810A05 0A0AE800 00070800 9AC90016 ................ 0030: 01120000 00003C48 A610ABCD ABCDABCD ...... 232.0.0.7 ICMP 81000DD2 ..^....PV....... 9C810A05 ..E..d&......... 01120000 ................ ABCDABCD .. 00:50:56:A9:EA:54 MPLS unicast 0000: 005056A9 EA540050 56A9EA77 81000DF5 .PV..T.PV..w.... 0010: 884716B5 31FE4500 00646365 0000FE01 .G..1.E..dce.... 0020: 602A0A02 0702E800 00050800 9BA1001E `*.............. 0030: 00E10000 00003CD3 A4D6ABCD ABCDABCD ...... 00:50:56:A9:86:2A MPLS unicast 0000: 005056A9 862A0050 56A9EA77 81000DF6 .PV..*.PV..w.... 0010: 884716F3 F1FE4500 00646365 0000FE01 .G....E..dce.... 0020: 602A0A02 0702E800 00050800 9BA1001E `*.............. 0030: 00E10000 00003CD3 A4D6ABCD ABCDABCD ...... 00:50:56:A9:86:2A MPLS unicast 0000: 005056A9 862A0050 56A9EA77 81000DF6 .PV..*.PV..w.... 0010: 884716F4 01FE4500 00646365 0000FE01 .G....E..dce.... 0020: 602A0A02 0702E800 00050800 9BA1001E `*.............. 0030: 00E10000 00003CD3 A4D6ABCD ABCDABCD ...... 00:50:56:A9:1A:AA MPLS unicast 0000: 005056A9 1AAA0050 56A9862A 81000DBA .PV....PV..*.... 0010: 8847003F 41FD4500 0064646D 0000FE01 .G.?A.E..ddm.... 0020: 5F220A02 0702E800 00050800 929B001E _".............. 0030: 01E90000 00003CD7 ACD0ABCD ABCDABCD ...... 232.0.0.5 ICMP 81000DBF ..^....PV....... 61220A02 ..E..ddm....a".. 01E90000 ................ ABCDABCD ..i [1][213:66][213.5.5.5]/12 213.5.5.5 0 100 0 ? *>i [3][213:66][2001:10:5:10::10][FF33::2][213.5.5.5]/46 213.5.5.5 0 100 0 ? R6#show bgp vpnv6 unicast vrf N | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 213:66 (default for vrf N) *>i 2001:10:5:10::/64 ::FFFF:213.5.5.5 0 100 0 ? R6#show ipv6 pim vrf N neighbor PIM Neighbor Table Mode: B - Bidir Capable, G - GenID Capable Neighbor Address Interface

Uptime

Expires

Mode DR pri

::FFFF:213.1.1.1 ::FFFF:213.5.5.5 ::FFFF:213.8.8.8 ::FFFF:213.12.12.12

00:14:45 00:14:45 00:14:46 00:14:45

00:01:19 00:01:20 00:01:19 00:01:35

B B B B

Lspvif1 Lspvif1 Lspvif1 Lspvif1

G G G G

1 1 1 DR 1

There is no point in sending traffic from CSR10 since we know the P2MP trees won’t be built. If it were, CSR6 would show a P2MP tree rooted at 213.5.5.5 with the VPN ID 213:1512 inside the opaque. Even a router reload on CSR6 and CSR8 (to potentially fix any order-of-operations issues with the configuration) did not resolve the issue. R6#show mpls mldp database summary LSM ID Type Root Cnt. 5 P2MP 213.7.7.7 1 MP2MP 213.13.13.13 3 MP2MP 213.14.14.14

Decoded Opaque Value

Client

[mdt 213:678 1] [mdt 213:678 0] [mdt 213:1512 0]

1 1 1

However, IPv6 appears to work somewhat with SSC (doesn’t actually work, but control-plane looks better). There is no S-PMSI offered for (2001:10:2:7::2, FF33::1) at present, but we can verify the extranet connectivity using the default MDT. CSR7 originates an IPv6 Type-1 I-PMSI route within the blue VPN (N) and advertises it to CSR1. We know this is for the blue VPN since the VPN ID is 213:1512. CSR1 is able to install this and join the default MDT. We can see the C-PIM neighbor with CSR7. We can

1229 © 2016 Nicholas J. Russo

also see the tree is MP2MP (type 7) and is rooted at XRv4. Notice the correct “violet” RT:213:700; this is because the route was originated locally from CSR7 inside VRF N. R1#show bgp ipv6 mvpn vrf N route-type 1 213.7.7.7 BGP routing table entry for [1][213:1][213.7.7.7]/12, version 100 Paths: (1 available, best #1, table MVPNV6-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [1][213:77][213.7.7.7]/12 (global) 213.7.7.7 (metric 20) from 213.12.12.12 (213.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:213:700 Originator: 213.7.7.7, Cluster list: 213.12.12.12 PMSI Attribute: Flags: 0x0, Tunnel type: 7, length 24, label: exp-null, tunnel parameters: 0800 0104 D50E 0E0E 000E 0200 0B00 0213 0000 1512 0000 0000 rx pathid: 0, tx pathid: 0x0 R1#show ipv6 pim vrf N neighbor | include 7.7.7 ::FFFF:213.7.7.7 Lspvif0 00:38:11

00:01:37 B G

1

Checking the C-MRIB entries on CSR7, we see traffic arriving from VRF S, being forwarded to VRF N (big ‘E’ for extranet), then being encapsulated inside MPLS. R7#show ipv6 mroute vrf S FF33::1 2001:10:2:7::2 | begin \( (2001:10:2:7::2, FF33::1), 02:24:20/00:03:11, flags: sTE Incoming interface: GigabitEthernet2.527 RPF nbr: 2001:10:2:7::2 Immediate Outgoing interface list: Lspvif0, Forward, 02:24:20/00:03:11 Extranet receivers in vrf N, OIF count 1 R7#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 | begin \( (2001:10:2:7::2, FF33::1), 00:40:44/00:02:47, flags: sT Incoming interface: GigabitEthernet2.527 RPF nbr: 2001:10:2:7::2,using vrf S Immediate Outgoing interface list: Lspvif1, Forward, 00:40:44/00:02:47

Notice that the two entries have different LSPvif interfaces; the one in VRF S is for red receivers (intranet) and the one in vrf N is for blue receivers (extranet). Two outputs are shown below; the first is for intranet and the second is for extranet. These are the two default MDTs for the red and blue VPNs respectively. We note the upstream labels for each since CSR7 will be replicating traffic for each VPN in this way. Unlike for IPv4, we will later see two packets, not three, since we are using the MP2MP tree. 1230 © 2016 Nicholas J. Russo

XRv4 will perform any downstream replication needed as CSR7’s only job is to get the traffic to the MP2MP root. R7#show mpls mldp database opaque_type mdt 213:678 0 LSM ID : 1 (RNR LSM ID: 2) Type: MP2MP Uptime : 04:04:11 FEC Root : 213.13.13.13 Opaque decoded : [mdt 213:678 0] Opaque length : 11 bytes Opaque value : 02 000B 0002130000067800000000 RNR active LSP : (this entry) Upstream client(s) : 213.13.13.13:0 [Active] Expires : Never Path Set ID : 1 Out Label (U) : 93010 Interface : GigabitEthernet2.573* Local Label (D): 7009 Next Hop : 213.7.13.13 Replication client(s): MDT (VRF S) Uptime : 04:04:11 Path Set ID : 2 Interface : Lspvif0 R7#show mpls mldp database opaque_type mdt 213:1512 0 LSM ID : C (RNR LSM ID: D) Type: MP2MP Uptime : 00:46:19 FEC Root : 213.14.14.14 Opaque decoded : [mdt 213:1512 0] Opaque length : 11 bytes Opaque value : 02 000B 0002130000151200000000 RNR active LSP : (this entry) Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : C Out Label (U) : 94009 Interface : GigabitEthernet2.574* Local Label (D): 7011 Next Hop : 213.7.14.14 Replication client(s): MDT (VRF N) Uptime : 00:46:19 Path Set ID : D Interface : Lspvif1

CSR1 also reports a valid C(S,G) entry for this flow. Notice there are no ‘y’ or ‘Y’ flags as this is using the IPMSI P-tunnel, which is mLDP MP2MP. Because we know the characteristics of the blue default MDT, we can query the mLDP database to find the incoming local label. That value is 1009. R1#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 | begin \( (2001:10:2:7::2, FF33::1), 23:15:33/never, flags: sTI Incoming interface: Lspvif0 RPF nbr: ::FFFF:213.7.7.7 Immediate Outgoing interface list: GigabitEthernet2.519, Forward, 23:15:33/never

1231 © 2016 Nicholas J. Russo

R1#show mpls mldp database opaque_type mdt 213:1512 Upstream client(s) : 213.14.14.14:0 [Active] Expires : Never Path Set ID : Out Label (U) : 94007 Interface : Local Label (D): 1009 Next Hop :

0 | section Upstream

1 GigabitEthernet2.514* 213.1.14.14

Next, we start sending traffic on CSR2. We check the C-MRIB counters on CSR7 to ensure traffic is being sent into the MPLS core as LSM. R7#show ipv6 mroute vrf S FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 8/0/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 8 R7#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 8/0/118/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 8

Quickly checking the packet counters on all receivers, intranet and extranet, confirms broken extranet operation. Packets arrive as expected to the intranet receivers, but not to the extranet ones. R6#show ipv6 mroute vrf S FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 401/1/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 401 R8#show ipv6 mroute vrf S FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 387/1/126/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 387 R1#show ipv6 mroute vrf N FF33::1 2001:10:2:7::2 count | begin ^Group Group: FF33::1 Source: 2001:10:2:7::2, SW Forwarding: 0/0/0/0, Other: 0/0/0 HW Forwarding: 0/0/0/0, Other: 0/0/0 Totals - Source count: 1, Packet count: 0

1232 © 2016 Nicholas J. Russo

We already verified the control-plane on CSR7, but EPC clearly shows that CSR7 is only sending traffic upstream to XRv3 along the red default MDT. Three packets, paced one second apart, all have the same label 93010 (0x16B52) and the destination MAC of XRv3. R7#show ip arp 213.7.13.13 Protocol Address Age (min) Internet 213.7.13.13 37

Hardware Addr 0050.56a9.ea54

Type ARPA

Interface Giga2.573

R7#show monitor capture CAP buffer detailed 0 126 0.000000 00:50:56:A9:EA:77 -> 00:50:56:A9:EA:54 MPLS unicast 0000: 005056A9 EA540050 56A9EA77 81000DF5 .PV..T.PV..w.... 0010: 884716B5 203F0000 213F6000 0000003C .G.. ?..!?`....< 0020: 3A3F2001 00100002 00070000 00000000 :? ............. 0030: 0002FF33 00000000 00000000 00000000 ...3............ 1 126 1.000000 00:50:56:A9:EA:77 -> 00:50:56:A9:EA:54 MPLS unicast 0000: 005056A9 EA540050 56A9EA77 81000DF5 .PV..T.PV..w.... 0010: 884716B5 203F0000 213F6000 0000003C .G.. ?..!?`....< 0020: 3A3F2001 00100002 00070000 00000000 :? ............. 0030: 0002FF33 00000000 00000000 00000000 ...3............ 2 126 2.001007 00:50:56:A9:EA:77 -> 00:50:56:A9:EA:54 MPLS unicast 0000: 005056A9 EA540050 56A9EA77 81000DF5 .PV..T.PV..w.... 0010: 884716B5 203F0000 213F6000 0000003C .G.. ?..!?`....< 0020: 3A3F2001 00100002 00070000 00000000 :? ............. 0030: 0002FF33 00000000 00000000 00000000 ...3............

Clearing several software processes did not resolve the issue. A reload of CSR7 still did not resolve it. I believe the configuration is correct and all control-plane verifications are accurate. Additional Reading – Reference configurations “mvpn-extranet-mldp" 30. Describe, implement, and troubleshoot MPLS QoS models and related features The subsections on MPLS QoS all use the same set of diagrams. The diagram changes slightly for cosmetic reasons when demonstrating IOS XR QoS but the general network remains the same. The diagram is universal for the next three sections.

1233 © 2016 Nicholas J. Russo

The configurations below include both IOS and IOS-XE configuration files for all 3 MPLS QoS models, plus QPPB. Additional Reading – Reference configurations “mpls-qos" and "gns3-mpls-qos" 30.1 Uniform Uniform is a QoS model whereby changes in the SP network are propagated to the customers. Using an MPLS example, if a customer sends traffic into the network as DSCP AF41, the SP will likely mark this as EXP4 for transport. Let’s assume that this traffic was remarked to EXP3 in the core, likely due to policing. This value applies only to the topmost shim header as the rest of the label stack is generally not evaluated/seen by an LSR (the exception would be load sharing for non-IP traffic using the bottom label). Therefore, when the penultimate hop tries to remove the topmost label to expose the VPN label (assuming a basic 2-label VPN design) to the egress PE (CSR2), it will not carry this value. The egress PE will be ignorant of any change in the SP network. This is the primary use-case for enabling explicit-null on a PE; it effectively disables PHP to allow those EXP modifications to be maintained. After receiving an explicit-null labeled packet, the egress PE cannot automatically carry this value into an IP TOS mapping since the label stack is stripped when the routing decision is made. This happens before egress QoS in the order of operations; a new placeholder is needed to match the EXP on ingress then set the IPP/DSCP on egress. This is the “qos-group” construct. A brute-force method would be the following applied to the ingress (core-facing) links: ! PE routers class-map match-all CMAP_EXP0 match qos-group 0 class-map match-all CMAP_EXP1

1234 © 2016 Nicholas J. Russo

match qos-group 1 [snip] class-map match-all CMAP_EXP7 match qos-group 7 policy-map PMAP_IN_FROM_CORE class CMAP_EXP0 set qos-group 0 class CMAP_EXP1 set qos-group 1 [snip] class CMAP_EXP7 set qos-group 7

While this works, it is configuration intensive and redundant. A more elegant and dynamic mechanism exists; we can simply tell the router to perform a direct copy from EXP to QG. It would have an identical effect as the 8 class-maps above. The drawback of this approach is described later with regards to discard-class-based WRED. Note: The configurations suffixed “gns3” use this semi-dynamic method while the ones with no special suffix are CSR’s using the brute force method. The end result is same, it’s just a matter of configuration elegance and length. ! PE routers policy-map PMAP_IN_FROM_CORE class class-default set qos-group mpls experimental topmost

Let’s pretend that QG numbers 0-7 were used for some other purpose in my network. I demonstrate a simple workaround on CSR2 using the table-map. This is NOT the same logic as the table-map used for QPPB (which is really a route-map). I can still avoid using 8 class maps and maintain some semblance of dynamic configuration while just selecting new values. On CSR2, I map EXP to QG + 50. The table-map is a generic entity just used to map numbers and is not specific to MPLS QoS. Unfortunately, Cisco clearly states in their IOS XE 3S documentation that the “table-map” is supported, but it actually isn’t (it is missing both from global configuration and from MQC constructs). You must use this “brute force” method on the CSR1000v and likely any other IOS XE 3S platform. ! PE routers policy-map PMAP_IN_FROM_CORE class class-default set qos-group mpls experimental topmost table TM_EXP_TO_QG table-map map from map from [snip] map from

TM_EXP_TO_QG 0 to 50 1 to 51 7 to 57

1235 © 2016 Nicholas J. Russo

Going the other way (out to the customer), it would be nice if we could have similar semi-dynamic behavior. We can also configure IPP, DSCP, or EXP to be derived from the qos-group. policy-map PMAP_OUT_TO_CUSTOMER class class-default set precedence qos-group table TM_QG_TO_IPP table-map TM_QG_TO_IPP map from 50 to 0 map from 51 to 1 [snip] map from 57 to 7 default 0

The downside to this approach is that we lose some granularity because our egress policy is essentially just copying EXP->IPP now. We aren’t doing any queuing or traffic conditioning. If that is desired, you can break out a few individual EXPs (in my examples I use EXP1 and EXP5, using the default for the rest) for custom treatment. Unless, of course, you just want to apply generic shaping to the default class, in which case explicit break-outs would not be required. An unfortunate lack-of-feature is that discard-class-based WRED cannot be dynamically mapped this way. In the configurations you will see examples of this feature where I statically set the discard-class to a number. It’s effectively like IPP based WRED (lower numbers get worse treatment) except it is like the QoS-group; you can set it inbound and perform WRED outbound. If this was supported dynamically you would be able to provide EXP->IPP mapping and WRED with very little configuration. A quick example of uniform mode: send traffic from CSR6 with DSCP CS2 and expect to see DSCP AF11 on CSR1. R6#ping 12.0.0.1 tos 64 repeat 5

We can quickly check CSR2 to see it matching EXP1 on ingress, setting QG1, matching QG1 on egress, and setting DSCP AF11. CSR3 performs the EXP2->EXP1 remarking for all EXP2 transit traffic; this policy is applied to all core interfaces. ! CSR3 class-map match-all CMAP_EXP2 match mpls experimental topmost 2 policy-map PMAP_REMARK_EXP class CMAP_EXP2 set mpls experimental topmost 1 R2#show policy-map interface gig2.523 input | section EXP1 Class-map: CMAP_EXP1 (match-all)

1236 © 2016 Nicholas J. Russo

5 packets, 630 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: mpls experimental topmost 1 QoS Set discard-class 1 Marker statistics: Disabled qos-group 1 Marker statistics: Disabled R2#show policy-map interface gig2.512 output | section QG1 Class-map: CMAP_QG1 (match-all) 5 packets, 590 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: qos-group 1 [snip] bandwidth 15% (150000 kbps) Exp-weight-constant: 4 (1/16) Mean queue depth: 0 packets discard-class Transmitted Random drop Tail drop Minimum Maximum Mark pkts/bytes pkts/bytes pkts/bytes thresh thresh prob 0 16

0/0 32

1 11

0/0

0/0

1/10 5/590

0/0

0/0

17 1/7 [snip]

CSR1 should report seeing DSCP AF11 now. R1#show flow monitor FNF_MONITOR_IPV4 cache format table | begin IPV4 IPV4 SRC ADDR IPV4 DST ADDR INTF INPUT IP TOS pkts =============== =============== ==================== ====== ========== 46.0.0.6 12.0.0.1 Gi2.512 0x28 5

Math conversion: DSCP CS2 = 010 000 = 16, left shift x 2 = 64 or 0x40 (0100 0000 for TOS) DSCP AF11 = 001 010 = 10, left shift x 2 = 40 or 0x28 (0010 1000 for TOS) 30.2 Short pipe Short pipe is the default mode and requires no special configuration. The word “pipe” indicates that the customer QoS markings are not modified by the SP and are preserved across the VPN. The word “short” indicates that any markings made the provider are NOT used to determine the egress QoS treatment for the customer. Using the same example from above, if the flow is remarked to EXP3 in the core, this value applies to the topmost shim header only and this new marker has no influence on the egress PE’s queuing policy. Thus, PHP can be enabled for this model. The egress queuing on the PE (CSR5) will use 1237 © 2016 Nicholas J. Russo

regular IPv4/IPv6 mechanisms. The configuration on CSR5 is very straightforward and I added a simple QoS policy just to illustrate the point. The remarking performed on CSR3 will not affect how CSR5 queues traffic outbound towards the customer. I would imagine this is the most common for service providers as it is the least complex and, since the traffic already arrived at the customer, it doesn’t matter as much how you deliver their traffic. It was more important in the SP core to remark the shim headers but at the end of the LSP, it matters less. A quick example of short-pipe mode: send traffic from CSR6 with DSCP CS2 and except to see the same value on CSR5. R6#ping 57.0.0.7 tos 64 repeat 5 R5#show policy-map interface gig2.557 output [snip] Class-map: class-default (match-any) 121 packets, 8544 bytes [snip] default 59/3776 0/0 52 104 1/10 cs2 5/590 0/0 65 104 1/10 cs6 57/4178 0/0 91 104 1/10

0/0 0/0 0/0

Above, notice that the traffic matches class-default because we did not define any custom actions for IPP2. We can see the counters under the WRED configuration, though. Below, notice that there is no change to the IP TOS byte, as expected, and the PE was NOT able to make queuing/shaping decisions based on SP-modified markings. Only IPP was visible in short-pipe mode. R7#show flow monitor FNF_MONITOR_IPV4 cache format table | begin IPV4 IPV4 SRC ADDR IPV4 DST ADDR INTF INPUT IP TOS pkts =============== =============== ==================== ====== ========== 46.0.0.6 57.0.0.7 Gi2.557 0x40 5

Math conversion: DSCP CS2 = 010 000 = 16, left shift x 2 = 64 or 0x40 (0100 0000 for TOS) 30.3 Pipe (AKA long pipe) Long pipe is a combination of the uniform and short pipe modes. It still “pipes” the customer markings through the network, so the customer has no idea that traffic markings were modified along the path. Like uniform mode, the topmost MPLS marking must arrive at the egress PE (CSR4) because the egress QoS treatment will be based on the SP-modified markings. However, these egress policies will NOT remark packets as they are sent to the customer; this is the fundamental difference between uniform 1238 © 2016 Nicholas J. Russo

and long pipe. Explicit-null is still required, as are qos-group placeholders. The QoS policies are nearly identical on CSR2 and CSR5, only that CSR5 is not setting new TOS byte values. One other note about this mechanism is that we do not HAVE to match every single value as we did in uniform mode. On CSR4’s configuration, notice that I do not have class-maps for EXP2-4 or QG2-4, because I don’t need to customize the behavior for those markings. Uniform mode MUST match every single value in order to perform a rewrite on egress. A quick example of long-pipe mode: send traffic from CSR1 with DSCP CS2 and except to see the same value on CSR6. R1#ping 46.0.0.6 tos 64 repeat 5

Check the PE for matches. CSR4 sees EXP1 coming in as expected, and performs proper outbound actions based on QG1. R4#show policy-map interface gig2.534 input | section EXP1 Class-map: CMAP_EXP1 (match-all) 5 packets, 630 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: mpls experimental topmost 1 QoS Set discard-class 1 Marker statistics: Disabled qos-group 1 Marker statistics: Disabled R4#show policy-map int gig2.546 output | section QG1 Class-map: CMAP_QG1 (match-all) 5 packets, 590 bytes [snip] discard-class Transmitted Random drop Tail drop Minimum Maximum Mark pkts/bytes pkts/bytes pkts/bytes thresh thresh prob 0 16

32

1/10

17

1/7

1 11

0/0

0/0

0/0

5/590

0/0

0/0

Above, notice that the traffic matches the specific class for QG1. We can see the counters under the WRED configuration also. Below, notice that there is no change to the IP TOS byte, as expected, yet the PE was able to make queuing/shaping decisions based on SP-modified markings. R6#show flow monitor FNF_MONITOR_IPV4 cache format table | begin IPV4 IPV4 SRC ADDR IPV4 DST ADDR INTF INPUT IP TOS pkts

1239 © 2016 Nicholas J. Russo

=============== 12.0.0.1

=============== 46.0.0.6

==================== Gi2.546

====== 0x40

========== 5

Math conversion: DSCP CS2 = 010 000 = 16, left shift x2 = 64 or 0x40 (0100 0000 for TOS) 30.4 QoS Policy Propagation through BGP (QPPB) QoS Policy Propagation through BGP (QGGP) is an interesting feature where the BGP NLRI carries the QoS markings for specific routes in a roundabout way. This can simplify QoS configuration on many devices by encoding it within the prefix via communities. It isn’t terribly granular or widely deployed but for simple policies it might be a better option than putting MQC configurations on thousands of network devices. It can be awkward because, in many cases, BGP is just transporting prefixes and is not involved in data plane operations. That is why this feature really only makes sense for IPv4/IPv6 unicast AFI/SAFI. Notice that only CSR5 and CSR7 are running BGP as the PE-CE protocol; the other CEs just have a static default route for reachability and no other connected routes. CSR7 signals CSR5 to set IPP for any traffic that it sources from specific subnets. This essentially says “When I send traffic FROM this subnet, set this QoS marking on ingress”. This is what the table-map does; this not related to the actual “table-map” command in global configuration that we used earlier, this is a route-map applied to BGP using the “table-map” command under an address-family. This could be handy to allow your customers to change their QoS markings per-prefix without the SP becoming involved, which would require some kind of remote signaling not available in MQC. There are a few reasons why wouldn’t the customer just do the marking locally in the first place: 1. Customer router is very low end, either in feature set or performance, but does support basic BGP and communities. Classification/marking can be a CPU intensive process. 2. Customer router is actually a firewall, again with limited feature set. 3. Customer is not network savvy and will accept imprecise QoS in return for minimal complexity 4. Inter-domain route exchange where the source provider can advertise what the QoS treatment should be. This can simplify the ASBR QoS configurations, specifically for Inter-AS MPLS Option A. Other non-MPLS related inter-AS exchanges can benefit as well. Specifically, this mechanism of QPPB is enabled at the link level with the “bgp-policy source ip-precmap”. Substituting the word “destination” for “source” is also valid but the router uses the destination address for the QPPB lookup. Because QPPB is always processed inbound, the logic would become “When I send traffic TOWARDS this subnet, set this QoS marking on ingress”. You can enable both simultaneously; in my example I have all QPPB features enabled at once on the customer facing link. Once enabling this feature on an interface, a quick verification is recommended. Beware the verbiage; QPPB is an ingress-only feature, so the words “input” and “output” in the show command really mean “source” and “destination” in the configuration, respectively. The show command also confirms the assertion that QPPB is an input feature. 1240 © 2016 Nicholas J. Russo

! CSR5 interface GigabitEthernet2.557 bgp-policy source ip-prec-map bgp-policy destination ip-prec-map R5#show ip interface gig2.557 | include BGP BGP Policy Mapping is enabled (output ip-prec-map) (input ip-prec-map) Input features: BGP Policy Map, QoS Classification, Access List, MCI Check

Only a single “table-map” can be applied under a BGP AF at one time. Changing the table-map will overwrite the existing value. Thus, your QPPB route-map should have widely spaced sequence numbers for easy insertions. CSR7 has five networks advertised into BGP. Two of them want traffic to be marked EXP3 and EXP4 by the PE, while the remaining three have no preference according to the customer. CSR7 sets the communities when it originates the networks into BGP and advertises these to the PE (CSR5). CSR5 receives these communities, checks its table-map for any corresponding matches, and takes action. CSR5 is also configured to provide IPP5 treatment for 7.7.7.5/32 (using a prefix-list match) and IPP6 treatment for 7.7.7.6/32 (matching the BGP AS-path, which CSR7 prepended). This is meant to demonstrate that communities do not have to be signaled from another router and that any arbitrary route-map match statements are valid with a few exceptions; as an example, matching MED does not appear to work, and it is effectively a match-any. In most cases, communities make sense, since the customer can signal QoS through BGP in this way. The prefix 7.7.7.7/32 does not match any table-map clause and thus gets no modification. This is NOT the same as setting IPP0 as a default route-map entry and that would overwrite existing QoS markings. ! CSR5 ip bgp-community new-format ip community-list standard COMMLIST_IPP3 permit 77:3 ip community-list standard COMMLIST_IPP4 permit 77:4 ip as-path access-list 7 permit ^7_.+ ip prefix-list PL_7775 seq 5 permit 7.7.7.5/32 route-map RM_QPPB_FOR_CUSTOMER permit match community COMMLIST_IPP3 set ip precedence flash route-map RM_QPPB_FOR_CUSTOMER permit match community COMMLIST_IPP4 set ip precedence flash-override route-map RM_QPPB_FOR_CUSTOMER permit match ip address prefix-list PL_7775 set ip precedence critical route-map RM_QPPB_FOR_CUSTOMER permit match as-path 7 set ip precedence internet

10

20

30

40

1241 © 2016 Nicholas J. Russo

router bgp 1 address-family ipv4 vrf A table-map RM_QPPB_FOR_CUSTOMER

The verifications are shown below. The routing table clearly shows the mapping of each destination prefix on CSR7 to the appropriate QoS value identified in the QPPB table-map. ! Signaled by customer community R5#show ip route vrf A 7.7.7.3 | include Routing_entry|Tag Routing entry for 7.7.7.3/32 Tag 7, precedence flash (3), type external ! Signaled by customer community R5#show ip route vrf A 7.7.7.4 | include Routing_entry|Tag Routing entry for 7.7.7.4/32 Tag 7, precedence flash-override (4), type external ! Locally specified R5#show ip route vrf A 7.7.7.5 | include Routing_entry|Tag Routing entry for 7.7.7.5/32 Tag 7, precedence critical (5), type external ! Locally specified R5#show ip route vrf A 7.7.7.6 | include Routing_entry|Tag Routing entry for 7.7.7.6/32 Tag 7, precedence internet (6), type external ! Not specified at all; no treatment yet R5#show ip route vrf A 7.7.7.7 | include Routing_entry|Tag Routing entry for 7.7.7.7/32 Tag 7, type external

To test it, we will send unmarked traffic from CSR7 to CSR1 source from all 5 of its addresses using a quick TCL script. Using FNF on CSR1 will allow us to capture the markings. foreach x { 7.7.7.3 7.7.7.4 7.7.7.5 7.7.7.6 7.7.7.7 } { ping 12.0.0.1 source $x }

Another nice feature of QPPB is that in the order of operations, it occurs before routing (and thus before MPLS imposition), so when the IPP is set by CSR5 on ingress it is automatically copied to EXP just as if we used ingress MQC to classify/mark the traffic. With FNF enabled on CSR3 we can see some of these EXP 1242 © 2016 Nicholas J. Russo

markings being carried through. Of course, any SP-remarking that occurs will overwrite what was written at imposition, so beware of the order of operations when mixing QPPB with uniform/long-pipe QoS models, or mixing it with MQC ingress classification/marking. Let’s compare CSR1’s captured IP packets with CSR3’s captured MPLS packets. R1#show flow monitor FNF_MONITOR_IPV4 cache format table | begin IPV4 IPV4 SRC ADDR IPV4 DST ADDR INTF INPUT IP TOS pkts =============== =============== ==================== ====== ========== 7.7.7.5 12.0.0.1 Gi2.512 0xB8 5 7.7.7.4 12.0.0.1 Gi2.512 0x80 5 7.7.7.7 12.0.0.1 Gi2.512 0x00 5 7.7.7.3 12.0.0.1 Gi2.512 0x60 5 7.7.7.6 12.0.0.1 Gi2.512 0xC0 5

I cannot explain it, but despite telling QPPB to set IPP5 for 7.7.7.5/32, it set DSCP EF (DSCP 46 or TOS 0xB8). This is true on both classic IOS and IOS XE. Perhaps the router assumes that this is a more appropriate value; from an MPLS perspective it makes no difference as EXP only captures the 3 leftmost TOS bits. Let’s check CSR3 to see the EXP values in the shim headers. I include “20..” because I want to capture labels going to CSR2 only. The output is not totally authoritative but the correlation between CSR3’s MPLS FNF cache and CSR1’s IPv4 FNF cache is clear. Without any additional verification, we can safely assume that QPPB is marking traffic correctly. R3#show flow monitor FNF_MPLS_MONITOR cache format table | include 20..|INTF|===== INTF INPUT ==================== Gi2.535 Gi2.535 Gi2.535 Gi2.535 Gi2.535

INTF OUTPUT ============ Gi2.523 Gi2.523 Gi2.523 Gi2.523 Gi2.523

MPLS LABEL 1 ============ 3000 /6 3000 /0 3000 /5 3000 /4 3000 /3

MPLS LABEL 2 ============ 2004 /6 2004 /0 2004 /5 2004 /4 2004 /3

pkts ========== 5 5 5 5 5

QPPB can also be used to set QoS-group flags for source or destination networks. It works the same way as IPP. The router can set QoS-group on ingress then perform some egress shaping or other mechanism without interrupting your existing MQC ingress classification/marking. This is a less intrusive and more flexible usage of QPPB. It can be enabled using the “ip-qos-map” command instead of “ip-prec-map”. The two can be enabled concurrently, and in this example, source-based IPP and QG mapping are enabled together. ! CSR5 interface GigabitEthernet2.557 bgp-policy source ip-prec-map bgp-policy source ip-qos-map bgp-policy destination ip-prec-map

1243 © 2016 Nicholas J. Russo

bgp-policy destination ip-qos-map R5#show ip interface gig2.557 | include BGP BGP Policy Mapping is enabled (output ip-qos-map) (input ip-qos-map) (output ip-prec-map) (input ip-prec-map) Input features: BGP Policy Map, QoS Classification, Access List, MCI Check

In this case, we set qos-group 7 for all unmatched traffic in the QPPB route-map just to illustrate its use. CSR5 can now use this on different egress interface encapsulations, MPLS or otherwise. This route-map clause just extends the existing QPPB policy. ! CSR5 route-map RM_QPPB_FOR_CUSTOMER permit 1000 set ip qos-group 7 ! Default match clause in QPPB route-map R5#show ip route vrf A 7.7.7.7 | include Routing_entry|Tag Routing entry for 7.7.7.7/32 Tag 7, qos-group 7, type external

Another quick example demonstrates the IPP and QG mapping using the “destination” method. In order for this to work, CSR2 advertises a few test loopbacks into BGP within the customer VRF so that CSR7 has something to which traffic can be destined; I didn’t want to run PE-CE BGP with CSR1. A QPPB routemap entry uses a locally-defined prefix-list to match those subnets (99.99.99.2/32 and 99.99.99.22/32) and sets both an IPP and QG mapping. Another option would be to set communities at CSR2 when the routes were redistributed, match them in the QPPB route-map, and take action. Notice that the new route-map clause comes before the default entry (index 1000 above). ! CSR5 ip prefix-list PL_LOOP99_TESTS seq 5 permit 99.99.99.0/24 le 32 route-map RM_QPPB_FOR_CUSTOMER permit 200 match ip address prefix-list PL_LOOP99_TESTS set ip precedence network set ip qos-group 99

Here are the routing entries within the VRF. Since this is a destination learned across the VPN, we must use destination-based QPPB (enabled in the beginning of this section). ! Locally specified R5#show ip route vrf A 99.99.99.2 | include Routing_entry|bgp Routing entry for 99.99.99.2/32 Known via "bgp 1", distance 200, metric 0, precedence network (7), qosgroup 99, type internal ! Locally specified

1244 © 2016 Nicholas J. Russo

R5#show ip route vrf A 99.99.99.22 | include Routing_entry|bgp Routing entry for 99.99.99.22/32 Known via "bgp 1", distance 200, metric 0, precedence network (7), qosgroup 99, type internal

A quick FNF check on CSR3 proves that the IPP7 mapped to EXP7 as imposition. I sent 15 packets per route with TOS 0. Just to prove there is no magic, I also show the label stack per prefix from CSR5 (ingress LSR). R3#show flow monitor FNF_MPLS_MONITOR cache format table | begin INTF INTF INPUT INTF OUTPUT MPLS LABEL 1 MPLS LABEL 2 pkts ==================== ============ ============ ============ ========== Gi2.535 Gi2.523 3000 /7 2000 /7 15 Gi2.535 Gi2.523 3000 /7 2001 /7 15 R5#show ip cef vrf A 99.99.99.2 99.99.99.2/32 nexthop 35.0.0.3 GigabitEthernet2.535 label 3000 2000 R5#show ip cef vrf A 99.99.99.22 99.99.99.22/32 nexthop 35.0.0.3 GigabitEthernet2.535 label 3000 2001

Some Cisco documentation suggests that QPPB is processed before input ACLs and MQC in the order of operations. We will test it quickly in IOS (12.4T) by matching IPP3 traffic entering from the customer, then sourcing traffic from 7.7.7.3/32. So long as QPPB marks the traffic first, we expect to see matches in the MQC and ACL counters. In IOS, we see this is true as the ACL matches the traffic as IPP3 along with MQC. ! CSR5 class-map match-all CMAP_IPP3_ORDER_OF_OPS_TEST match precedence 3 policy-map PMAP_ORDER_OF_OPS_TEST class CMAP_IPP3_ORDER_OF_OPS_TEST interface GigabitEthernet2.557 ip access-group ACL_IPP3_ORDER_OF_OPS_TEST in service-policy input PMAP_ORDER_OF_OPS_TEST R7#ping 12.0.0.1 source 7.7.7.3 repeat 15 Type escape sequence to abort. Sending 15, 100-byte ICMP Echos to 12.0.0.1, timeout is 2 seconds: Packet sent with a source address of 7.7.7.3 !!!!!!!!!!!!!!! Success rate is 100 percent (15/15), round-trip min/avg/max = 72/94/108 ms

1245 © 2016 Nicholas J. Russo

R5#show policy-map interface gig2.557 input | include ORDER_OF_OPS|packets Service-policy input: PMAP_ORDER_OF_OPS_TEST Class-map: CMAP_IPP3_ORDER_OF_OPS_TEST (match-all) 15 packets, 1710 bytes R5#show access-lists ACL_IPP3_ORDER_OF_OPS_TEST Extended IP access list ACL_IPP3_ORDER_OF_OPS_TEST 10 permit ip any any precedence flash (15 matches) 20 permit ip any any

IOS XE’s operational flow is different than IOS with respect to QPPB. Specifically, QPPB is now processed after the input ACL but still before MQC. The above output was regular IOS 12.4T, the below output is IOS XE 3.13.2S. R7#ping 12.0.0.1 source 7.7.7.3 repeat 15 Type escape sequence to abort. Sending 15, 100-byte ICMP Echos to 12.0.0.1, timeout is 2 seconds: Packet sent with a source address of 7.7.7.3 !!!!!!!!!!!!!!! Success rate is 100 percent (15/15), round-trip min/avg/max = 72/94/108 ms R5#show policy-map interface gig2.557 input | include ORDER_OF_OPS|packets Service-policy input: PMAP_ORDER_OF_OPS_TEST Class-map: CMAP_IPP3_ORDER_OF_OPS_TEST (match-all) 15 packets, 1770 bytes R5#show access-lists ACL_IPP3_ORDER_OF_OPS_TEST Extended IP access list ACL_IPP3_ORDER_OF_OPS_TEST 10 permit ip any any precedence flash 20 permit ip any any (15 matches)

30.5 QoS specifics on IOS XRv XRv does not appear to support QoS data-plane operations at all. Although I provide the configurations to mirror what was done above, I cannot test it. I cannot even apply an inbound service-policy without the parser complaining. ! Any XRv router interface GigabitEthernet0/0/0/0.523 service-policy input PMAP_IN_FROM_CORE !!% 'NetIO' detected the 'warning' condition 'Netio or a feature DLL returned:': Invalid argument

Outbound policies can be applied but are not actually in effect. RP/0/0/CPU0:XRv1#show run interface gig0/0/0/0.512

1246 © 2016 Nicholas J. Russo

interface GigabitEthernet0/0/0/0.512 [snip] service-policy output PMAP_SHAPE_TO_CUSTOMER RP/0/0/CPU0:XRv1#show policy-map interface gig0/0/0/0.512 GigabitEthernet0/0/0/0.512 direction input: Service Policy not installed GigabitEthernet0/0/0/0.512 direction output: Service Policy not installed

Even if it were supported, WRED drop probability doesn’t look functional either, as a general comment. Considering XR’s WRED configuration is highly granular compared to IOS, I assume this is specific to XRv because the feature is hardware dependent. ! Any XRv router class CMAP_QG1 set dscp af11 bandwidth percent 15 random-detect discard-class 1 11 packets 17 packets probability 7 class class-default set precedence 0 !!% Subsystem(8147), Code(5): cerrno 0xafe98a00: Policy-map PMAP_QUEUE_TO_CUSTOMER class CMAP_QG1 failed: random-detect 'drop probability' is not supported

Nonetheless, we will try to do some testing. The provided configurations are mirrored to the CSRs previously used because I feel they are mostly correct as is. I have replaced the core CSRs with XRv routers. Below is the replacement diagram.

1247 © 2016 Nicholas J. Russo

When we try to enable IPP-based QPPB on the interface, XRv rejects it: ! Any XRv router interface GigabitEthernet0/0/0/0.557 ipv4 bgp policy propagation input ip-precedence source !!% 'ipv4-ma' detected the 'warning' condition 'Platform doesn't support ipprec based QPPB in input direction on this card'

XR QPPB appears to only support one direction per type. For example, source-based IPP and destination-based QG is acceptable, but you cannot do both source and destination for IPP or QG concurrently. The QG command is accepted by XRv but, in our case, QG is useless because we cannot apply an MQC policy-map anywhere else. Even if QPPB was actually marking traffic, no follow-on actions could occur (unless there were IOS or IOS XE routers elsewhere in the SP network). I left the command in the configuration for reference. ! XRv3 interface GigabitEthernet0/0/0/0.557 [snip] ipv4 bgp policy propagation input qos-group source

Here is the RPL configuration from XRv3 since it is the most interesting part of this test. It’s just like the route-map except more powerful. Like programming functions/methods, you can pass parameters. When you call the RPL, you pass it the actual value. Rather than hardcode the default QG to 7, I pass it in from BGP where the RPL is referenced. I always try to use variety with RPL to master the syntax as it is quite comprehensive.

1248 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show rpl [snip] prefix-set PS_LOOP99_TESTS 99.99.99.2/32, 99.99.99.22/32 end-set ! route-policy RPL_QPPB_FOR_CUSTOMER($DEFAULT_QG) if community matches-any (77:3) then set ip-precedence 3 elseif community matches-any (77:4) then set ip-precedence 4 elseif destination in (7.7.7.5/32) then set ip-precedence 5 elseif as-path length ge 2 then set ip-precedence 6 elseif destination in PS_LOOP99_TESTS then set qos-group 9 set ip-precedence 7 else set qos-group $DEFAULT_QG endif end-policy

Applying the QPPB policy to BGP is almost identical to IOS and IOS XE. RP/0/0/CPU0:XRv3#show run router bgp 1 vrf A address-family ipv4 unicast router bgp 1 vrf A address-family ipv4 unicast table-policy RPL_QPPB_FOR_CUSTOMER(7) redistribute connected

Despite all the limitations with XRv and QoS, with QPPB we can at least do some control-plane verifications. This can show us if our RPL was configured correctly. In XRv, the QPPB information is not visible from the RIB, but can be seen from the FIB. The quotes around the “A” are because XRv assumes that “A” by itself means “all”, due to ambiguity. Even though the FIB shows IPP values, we cannot enable it at the interface level, as seen above. ! Signaled by community RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 7.7.7.3/32 | include Prec IP Precedence: 3 ! Signaled by community RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 IP Precedence: 4

7.7.7.4/32 | include Prec

! Locally specified by inline prefix

1249 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 IP Precedence: 5

7.7.7.5/32 | include Prec

! Locally specified by inline AS path RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 7.7.7.6/32 | include Prec IP Precedence: 6 ! Locally specified in default class using parameterization RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 7.7.7.7/32 | utility egrep 'Prec|QoS' QoS Group: 7 ! Locally specified by prefix-set RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 99.99.99.2/32 | include QoS QoS Group: 9, IP Precedence: 7 ! Locally specified by prefix-set RP/0/0/CPU0:XRv3#show cef vrf "A" ipv4 99.99.99.22/32 | include QoS QoS Group: 9, IP Precedence: 7

We know that our uniform and long-pipe QoS models cannot work on XRv since the input MQC policies from the MPLS core cannot be applied. We also cannot configure XRv4 (formerly CSR3) to perform EXP remarking in the core for the same reason. Short-pipe mode does work correctly because when the LSRs perform label swap operations, they don’t touch the EXP bits. XRv basically has no mechanism to do anything else with the EXP bits once it copied the IPP into EXP at imposition. A quick test verifies this. R1#ping 57.0.0.7 tos 64 repeat 5 R1#show flow monitor FNF_MONITOR_IPV4 cache format table | begin IPV4 IPV4 SRC ADDR IPV4 DST ADDR INTF INPUT IP TOS pkts =============== =============== ==================== ====== ========== 12.0.0.1 57.0.0.7 Gi2.557 0x40 5

We can reuse our TCL script for CSR7 to test the QPPB marking. Earlier we saw a variety of markings consistent with our QPPB policy. Now, we see TOS 0 for everything. XRv is not marking the traffic despite the FIB being correctly programmed, again because we did not enable the BGP policy feature at the interface level. R1#show flow monitor FNF_MONITOR_IPV4 cache format table | begin IPV4 IPV4 SRC ADDR IPV4 DST ADDR INTF INPUT IP TOS pkts =============== =============== ==================== ====== ========== 7.7.7.5 12.0.0.1 Gi2.512 0x00 5 7.7.7.6 12.0.0.1 Gi2.512 0x00 5 7.7.7.7 12.0.0.1 Gi2.512 0x00 5 7.7.7.3 12.0.0.1 Gi2.512 0x00 5 7.7.7.4 12.0.0.1 Gi2.512 0x00 5

Netflow is totally unsupported in XRv, so we cannot use it for QoS verification. The command syntax 1250 © 2016 Nicholas J. Russo

isn’t even available (probably removed due to lack of supported hardware). The XR 5.2.0 documentation clearly states the parser tree begins with the word “flow”. Fortunately, the CSRs as CE devices offer some visibility between FNF and EPC. RP/0/0/CPU0:XRv1(config)#flow? ^ % Invalid input detected at '^' marker.

IOS XR supports QPPB for IPv6 too. I cannot find any documentation that claims IPv6 support for IOS or IOS XE. CSR7 advertises two IPv6 prefixes to XRv13; we will match one (::7:7:7:7/128) by setting the community to give it IPP3 treatment, and we let the other match the default QG setting (::7:7:7:77/128). We don’t need to modify RPL because I call the policy from within the IPv6 unicast AFI and can pass in a new value of 6 (not 7) with parameterization. Communities and BGP attributes are not AFI specific so matching non-prefix related things in a QPPB RPL or route-map is preferable. RP/0/0/CPU0:XRv3#show run router bgp 1 vrf A address-family ipv6 unicast router bgp 1 vrf A address-family ipv6 unicast table-policy RPL_QPPB_FOR_CUSTOMER(6) redistribute connected ! Signaled by community RP/0/0/CPU0:XRv3#show cef vrf "A" ipv6 ::7:7:7:7/128 | utility egrep 'Prec|QoS' IP Precedence: 3 ! Not specified; matches default RPL “else” statement RP/0/0/CPU0:XRv3#show cef vrf "A" ipv6 ::7:7:7:77/128 | utility egrep 'Prec|QoS' QoS Group: 6

When we try to enable IPP-based QPPB on the interface, XRv also rejects it for IPv6. I just enabled sourced-based QoS group for reference as I did for IPv4. interface GigabitEthernet0/0/0/0.557 ipv6 bgp policy propagation input ip-precedence source !!% 'ipv6-ma' detected the 'warning' condition 'Platform doesn't support ipprec based QPPB in input direction on this card'

Additional Reading – Reference configurations “mpls-qos-xr" 30.6 Network Based Application Recognition (NBAR) summary and configurations Network based application recognition allows MQC to identify applications/protocols directly without having to use other match criteria, such as ports/protocols inside an ACL. It is invoked using the “match protocol” keyword in IOS and IOS XE. 1251 © 2016 Nicholas J. Russo

NBAR is not supported on IOS XR. The “match protocol” parser tree only lets you match IP protocols, like TCP and UDP. You can specify a raw IP protocol number also, but this is nowhere near the power of NBAR2 on IOS XE platforms. Given this restriction, NBAR is not evaluated on XR in this test. The network is simple with CSR8, CSR9, and CSR10 arrayed in a line. NBAR is not specifically an SP technology and is unlikely to be tested, but since XE makes use of many advanced NBAR features, it is discussed. Most of the NBAR configurations will be on CSR9, applied inbound to the link facing CSR8. The diagram is shown below.

A simple and common NBAR deployment is shown by the PMAP_SIMPLE policy. We can identify some protocols based on the built-in NBAR2 protocol pack listing inside a class-map, then perform some action. In this case, I only care about matching packets, not performing QoS treatment, so the policymap invokes the class and does nothing. ! CSR9 class-map match-any CMAP_TTY_MGMT match protocol telnet match protocol ssh policy-map PMAP_SIMPLE class CMAP_TTY_MGMT

Below are a few quick tests where CSR9 is supported to match telnet or SSH traffic arriving from CSR8. CSR8 telnets and SSH’es to CSR10 to test it. The final test is to show that ping traffic does not match. ! telnet test R8#telnet 90.0.0.10 Trying 90.0.0.10 ... Open R9#show policy-map interface gig2.589 input class CMAP_TTY_MGMT GigabitEthernet2.589 Service-policy input: PMAP_SIMPLE Class-map: CMAP_TTY_MGMT (match-any) 9 packets, 552 bytes 5 minute offered rate 0000 bps Match: protocol telnet

1252 © 2016 Nicholas J. Russo

Match: protocol ssh R9#clear counter Clear "show interface" counters on all interfaces [confirm] ! ssh test R8#ssh -l fakeuser 90.0.0.10 Password: R9#show policy-map interface gig2.589 input class CMAP_TTY_MGMT GigabitEthernet2.589 Service-policy input: PMAP_SIMPLE Class-map: CMAP_TTY_MGMT (match-any) 82 packets, 4818 bytes 5 minute offered rate 1000 bps Match: protocol telnet Match: protocol ssh R8#ping 90.0.0.10 repeat 15 R9#show policy-map interface gig2.589 input GigabitEthernet2.589 Service-policy input: PMAP_SIMPLE Class-map: CMAP_TTY_MGMT (match-any) 0 packets, 0 bytes 5 minute offered rate 0000 bps Match: protocol telnet Match: protocol ssh Class-map: class-default (match-any) 15 packets, 1770 bytes 5 minute offered rate 1000 bps, drop rate 0000 bps Match: any

Additional Reading – Reference configurations "nbar" 30.6.1 NBAR Custom Protocols NBAR also allows the user to define custom protocols in a number of ways. I will not be discussing the deprecated “ip nbar port-map method” as it is no longer shown in the context-sensitive help: R9(config)#ip nbar port-map radius udp 1645 1646 %NBAR-6-PORT_MAP_DEPRECATION: Port-map command will be deprecated soon. In future it will not be necessary to configure port-map on a Protocol to create a new

1253 © 2016 Nicholas J. Russo

Custom protocol onto the same well known port.

1. Simple TCP/UDP port mapping to a custom name. Before NBAR2, some common protocols like Windows RDP and RADIUS were not pre-defined. Using this technique for commonly used protocols is handy, as well as custom applications. PMAP_CUSTOM_RADIUS is an example of using this feature to match the old RADIUS authentication (1645) and accounting (1646) ports. NBAR2 only considers the newer RADIUS ports for classification by default. R9#show ip nbar port-map | include radius port-map radius udp 1813 1812 R9(config)#ip nbar custom UDP_RADIUS_OLD udp 1645 1646 R9#show ip nbar port-map | include RADIUS|radius port-map UDP_RADIUS_OLD udp 1645 1646 port-map radius udp 1813 1812

These are wrapped in the PMAP_CUSTOM_RADIUS policy-map and applied to the interface facing CSR8. ! CSR9 class-map match-any CMAP_RADIUS match protocol radius match protocol UDP_RADIUS_OLD policy-map PMAP_CUSTOM_RADIUS class CMAP_RADIUS

On CSR8, we setup a bogus RADIUS server with IP addresses 90.0.0.100. AAA is described in the security chapter, but the commands are in the configuration file. Interestingly, even on this relatively new XE version, the default ports are the old ports. R8(config-radius-server)#address ipv4 90.0.0.100 ? acct-port UDP port for RADIUS accounting server (default is 1646) alias 1-8 aliases for this server (max. 8) auth-port UDP port for RADIUS authentication server (default is 1645)

Once the server is configured, we will use CSR8 to telnet to itself. CSR8 should then try to reach the RADIUS server for authentication, which appears to be on the LAN behind CSR9. We should see matches under the class-map regardless of which set of ports we use. R8#telnet 89.0.0.8 Trying 89.0.0.8 ... Open User Access Verification Username: fakeuser

1254 © 2016 Nicholas J. Russo

Password:

R9#show policy-map interface gig2.589 input class CMAP_RADIUS GigabitEthernet2.589 Service-policy input: PMAP_CUSTOM_RADIUS Class-map: CMAP_RADIUS (match-any) 4 packets, 472 bytes 5 minute offered rate 0000 bps Match: protocol radius Match: protocol UDP_RADIUS_OLD

It works with the old ports, so we know our custom NBAR class is operational. Let’s double-check the new ports by using 1812 and 1813 in the RADIUS server configuration (configuration not shown). It also works. The reason we see exactly 4 packets is because retransmit count was set to 3, so that means 3 retries after the first failure. R9#clear counter Clear "show interface" counters on all interfaces [confirm] R8#telnet 89.0.0.8 Trying 89.0.0.8 ... Open User Access Verification Username: fakeuser Password: R9#show policy-map interface gig2.589 input class CMAP_RADIUS GigabitEthernet2.589 Service-policy input: PMAP_CUSTOM_RADIUS Class-map: CMAP_RADIUS (match-any) 4 packets, 472 bytes 5 minute offered rate 0000 bps Match: protocol radius Match: protocol UDP_RADIUS_OLD

A more complicated custom protocol involves measuring byte offsets in the packet to match specific strings in the payload. It is kind of like flexible packet matching (FPM), but is less powerful/easier to use. PMAP_CUSTOM_PAYLOAD demonstrates looking for the string “BEEF” inside a UDP packet using port 55555. CSR8 uses IP SLA (beyond the scope of this test) to generate these packets every 2 seconds towards CSR10. CSR10 will not respond, but it doesn’t matter. The IP SLA has been configured to use the string BEEF, or more specifically, 0x0000BEEF since the data-pattern is measured as a 32-bit entity, and I 1255 © 2016 Nicholas J. Russo

have only configured 16 bits. This is the packet dump from CSR8 that demonstrates this. Even more confusing is the “phantom” 4 bytes of 0x0010000 (shown in cyan) which is not part of the UDP header (shown in green) nor the payload (shown in yellow) that I specified. ! CSR9 IP: s=89.0.0.8 (local), d=90.0.0.10, len 44, local 7F1B9765C230: 4500002C 00000000 FF110000 59000008 7F1B9765C240: 5A00000A CEDBD903 001867FC 00010000 7F1B9765C250: 0000BEEF 0000BEEF 0000BEEF

feature E..,........Y... Z...N[Y...g|.... ..>o..>o..>o

We must tell NBAR to offset 6 bytes into the payload of UDP to see the first occurrence of BEEF. This would generally be used to identify a virus or malicious packet entering the network. I would have normally expected a 2 byte offset, but as seen above, IP SLA is adding some extra detail into the payload. ! CSR9 ip nbar custom FIND_BEEF 6 hex BEEF udp 55555 class-map match-any CMAP_FIND_BEEF match protocol FIND_BEEF policy-map PMAP_CUSTOM_PAYLOAD class CMAP_FIND_BEEF

We can verify the operation using the commands shown below. R9#show policy-map interface gig2.589 input class CMAP_FIND_BEEF GigabitEthernet2.589 Service-policy input: PMAP_CUSTOM_PAYLOAD Class-map: CMAP_FIND_BEEF (match-any) 7 packets, 434 bytes 5 minute offered rate 0000 bps Match: protocol FIND_BEEF

Taking this a step further, let’s pretend we are trying to stop a virus with a generally unrecognizable hex string on a specific set of ports. At the same time, we need to ensure we don’t drop legitimate traffic. We can specify the port-range as we have before, but abstract a specific portion (up to 4 bytes) of the hex string for specification in a class-map. This adds some flexibility without having to define tons of custom NBAR protocols. Specifically, packets with the hex strings ending in 0x####5100, 0x####570A, and 0x####5B0C running on UDP port 55556 are considered suspect. Anything else is acceptable. Because we are fabricating the packets with IP SLA, we need to account for the first 4 “phantom” bytes. We are matching the last 2 1256 © 2016 Nicholas J. Russo

bytes of the pattern, which implies we still need to account for a 6 byte offset; the “virus” does not modify the first 2 bytes of the payload pattern. The configuration is very similar except instead of the “hex” keyword we use “variable” to denote that this sub-protocol variable will be defined in the classmap. We can call the variable whatever we want. ! CSR9 ip nbar custom STOP_VIRUS 6 variable LAST_2_BYTES 2 udp 55556 class-map match-any CMAP_VIRUS_570A match protocol STOP_VIRUS LAST_2_BYTES "0x570A" class-map match-any CMAP_VIRUS_5100 match protocol STOP_VIRUS LAST_2_BYTES "0x5100" class-map match-any CMAP_VIRUS_5B0C match protocol STOP_VIRUS LAST_2_BYTES "0x5B0C" policy-map PMAP_STOP_VIRUS class CMAP_VIRUS_5100 class CMAP_VIRUS_570A class CMAP_VIRUS_5B0C

To test this, I have 4 IP SLA instances; three viruses and one benign flow (uses 0x6BA4570C, which is fine). We expect PMAP_STOP_VIRUS to match the first three, but not the last one. Normally it makes more sense to configure all of these matches in a single match-any class-map, but I used three separate class-maps just to show the counters without having to start/stop IP SLAs and clear counters. All three run at the same rate, so the packet counters should be nearly equal. R9#show policy-map interface gig2.589 input GigabitEthernet2.589 Service-policy input: PMAP_STOP_VIRUS Class-map: CMAP_VIRUS_5100 (match-any) 12 packets, 744 bytes 5 minute offered rate 0000 bps Match: protocol STOP_VIRUS LAST_2_BYTES "0x5100" Class-map: CMAP_VIRUS_570A (match-any) 13 packets, 806 bytes 5 minute offered rate 0000 bps Match: protocol STOP_VIRUS LAST_2_BYTES "0x570A" Class-map: CMAP_VIRUS_5B0C (match-any) 12 packets, 744 bytes 5 minute offered rate 0000 bps Match: protocol STOP_VIRUS LAST_2_BYTES "0x5B0C" Class-map: class-default (match-any)

1257 © 2016 Nicholas J. Russo

12 packets, 744 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

30.6.2 NBAR Attributes Another useful feature of NBAR2 is the ability to apply specific attributes to a protocol (whether it’s a built-in or custom protocol). These attributes can be used for monitoring, but also to simplify QoS policies. For example, I can classify SSH, telnet, and RADIUS together in a category of “net-admin”, then match this attribute in a class-map. The trade-off for a simplified QoS policy is a potentially expanded NBAR configuration, unless the default attributes for a protocol are already acceptable. Attribute types are as follows: application-group: Used to group things together based on similar application/use (jabber, flash, gnutella, AOL, etc) category: First level of categorization (email, chat, gaming, etc) encrypted: Determine if the flow is encrypted (yes, no, unassigned) p2p-technology: Determine if the flow is point to point only (yes, no, unassigned) sub-category: Second level of categorization (auth services, database, etc) tunnel: Determine if the flow is inside a tunnel (yes, no, unassigned) The process contains up to three steps as the first one is optional. 1. Define custom attributes if you like. If the built-in classifications are not appropriate, XE lets you create one custom attribute per type. Only application-group, category, and sub-category can be redefined since the other attributes are essentially yes/no questions. In this example, I created a custom application-group for RADIUS just to demonstrate. I like that you can specify the context-sensitive help string, too. ! CSR9 ip nbar attribute application-group custom RADIUS_GROUP "Custom group for RADIUS services"

The below command shows that we only have a limit of one custom attribute per type, and that we already created an application-group attribute. R9#show ip nbar attribute-custom Name Help Custom Groups Limit Custom Groups Created

: : : :

Name : Help :

application-group Application-group attribute 1 RADIUS_GROUP category Category attribute

1258 © 2016 Nicholas J. Russo

Custom Groups Limit :

1

Name : Help : Custom Groups Limit :

sub-category Sub-category attribute 1

If we try to create more application-group attributes, the router rejects it. R9(config)#ip nbar attribute application-group custom NEW_ONE % NBAR Error: Custom groups limit reached for attribute "application-group"

2. Next, we create an attribute-map, which is essentially a list of attributes. We can specify as many or as few attributes as we like. This map will be attached to a protocol later on. Since we are attaching this to our RADIUS protocol (both the built-in and the custom one), we will select appropriate categories. See below for the snippet. ! CSR9 ip nbar attribute-map RADIUS_ATT_MAP attribute application-group RADIUS_GROUP attribute category net-admin attribute encrypted encrypted-no attribute p2p-technology p2p-tech-no attribute sub-category authentication-services attribute tunnel tunnel-no

If we use the ? symbol under the attribute map when setting the application-group, we can see our help text. R9(config)#ip nbar attribute-map RADIUS_ATT_MAP R9(config-attribute-map)#attribute application-group ? RADIUS_GROUP Custom group for RADIUS services

3. Apply the attribute-map to a protocol using the “attribute-set” command. In this case, we want to affix these attributes to our RADIUS definitions. However, before we do that, let’s verify the existing attributes of the protocols first. R9#show ip nbar protocol-attribute radius Protocol Name : radius application-group : other category : net-admin encrypted : encrypted-no p2p-technology : p2p-tech-no sub-category : authentication-services tunnel : tunnel-no R9#show ip nbar protocol-attribute UDP_RADIUS_OLD

1259 © 2016 Nicholas J. Russo

Protocol Name application-group category encrypted p2p-technology sub-category tunnel

: : : : : : :

UDP_RADIUS_OLD other other encrypted-unassigned p2p-tech-unassigned other tunnel-unassigned

Now, we will overwrite those settings with our new attribute-map. Any unspecified attributes in the map are left alone in the original attribute listing. Notice how our policy basically mirrors what “radius” already had, plus our new custom group. The custom “UDP_RADIUS_OLD” protocol we defined had very generic attributes since the router has no idea what it is. ! CSR9 ip nbar attribute-set radius RADIUS_ATT_MAP ip nbar attribute-set UDP_RADIUS_OLD RADIUS_ATT_MAP R9#show ip nbar protocol-attribute radius Protocol Name : radius application-group : RADIUS_GROUP category : net-admin encrypted : encrypted-no p2p-technology : p2p-tech-no sub-category : authentication-services tunnel : tunnel-no R9#show ip nbar protocol-attribute UDP_RADIUS_OLD Protocol Name : UDP_RADIUS_OLD application-group : RADIUS_GROUP category : net-admin encrypted : encrypted-no p2p-technology : p2p-tech-no sub-category : authentication-services tunnel : tunnel-no

For a given attribute, we can see all registered protocols. For example, here is the list of all protocols that are in application-group RADIUS_GROUP. I also show a truncated version of all of the category netadmin protocols to demonstrate the use. Notice we see our custom UDP_RADIUS_OLD in the list, too. R9#show ip nbar attribute application-group RADIUS_GROUP UDP_RADIUS_OLD User defined Protocol UDP_RADIUS_OLD radius Remote Authentication Dial In User Service protocol R9#show ip nbar attribute category net-admin 914c/g Texas Instruments 914 Terminal 9pfs Plan 9 file service UDP_RADIUS_OLD User defined Protocol UDP_RADIUS_OLD

1260 © 2016 Nicholas J. Russo

acap active-directory agentx alpes [snip]

ACAP Active Directory Traffic AgentX Alpes

At this point, we can use the attributes any way we like. I built a basic MQC configuration to test it using the policy-map PMAP_RADIUS_ATTRIBUTES. Specifically, the class-map looks for both the category netadmin AND application-group RADIUS_GROUP. We will telnet from CSR8 to itself to test the RADIUS configuration. At this point, it does not matter which ports our RADIUS configuration on CSR8 is using since either set of ports will match. ! CSR9 class-map match-all CMAP_RADIUS_ATTRIBUTES match protocol attribute category net-admin match protocol attribute application-group RADIUS_GROUP policy-map PMAP_RADIUS_ATTRIBUTES class CMAP_RADIUS_ATTRIBUTES

Below are some quick tests to ensure the policy is functional. R8#telnet 89.0.0.8 Trying 89.0.0.8 ... Open User Access Verification Username: fakeuser Password: R9#show policy-map interface gig2.589 GigabitEthernet2.589 Service-policy input: PMAP_RADIUS_ATTRIBUTES Class-map: CMAP_RADIUS_ATTRIBUTES (match-all) 4 packets, 472 bytes 5 minute offered rate 0000 bps Match: protocol attribute category net-admin Match: protocol attribute application-group RADIUS_GROUP Class-map: class-default (match-any) 0 packets, 0 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

Just like before, we see exactly (4) RADIUS packets enter, except this time we classified them based on 1261 © 2016 Nicholas J. Russo

two attributes. Let’s telnet from CSR8 to CSR10 and ensure telnet traffic does not match the RADIUS class. Telnet is part of category net-admin but not part of application-group RADIUS_GROUP (we validate this using both lookup mechanisms). This test is just validating basic match-all class-map logic when using NBAR attributes. R9#show ip nbar protocol-attribute telnet | include _category|application application-group : other category : net-admin R9#show ip nbar attribute category net-admin | include ^__telnet telnet Telnet - virtual text-oriented terminal over network R9#show ip nbar attribute application-group RADIUS_GROUP | include telnet [no output] R8#telnet 90.0.0.10 Trying 90.0.0.10 ... Open R9#show policy-map interface gig2.589 GigabitEthernet2.589 Service-policy input: PMAP_RADIUS_ATTRIBUTES Class-map: CMAP_RADIUS_ATTRIBUTES (match-all) 0 packets, 0 bytes 5 minute offered rate 0000 bps Match: protocol attribute category net-admin Match: protocol attribute application-group RADIUS_GROUP Class-map: class-default (match-any) 19 packets, 1144 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

30.6.3 NBAR Attributes with HTTP Our next test will include using NBAR to classify HTTP based on the host/URL information. We can do this in two ways: use the built-in MQC syntax inside the class-map with “match protocol http” or generate a custom NBAR protocol of type HTTP. We will test both. I have enabled HTTP server on CSR10 (user/pass: http/http) and added files to flash called “route.txt” and “cef.txt”. The files contain the results of CSR10’s RIB and FIB respectively. Let’s quickly make sure the HTTP download works before continuing; filtering is used to reduce output for verification. R10#show ip route | redirect flash:route.txt R10#show ip cef | redirect flash:cef.txt R8#more http://http:[email protected]/route.txt | begin Gateway Gateway of last resort is 90.0.0.9 to network 0.0.0.0

1262 © 2016 Nicholas J. Russo

S* C L

0.0.0.0/0 [1/0] via 90.0.0.9 90.0.0.0/8 is variably subnetted, 2 subnets, 2 masks 90.0.0.0/24 is directly connected, GigabitEthernet2.590 90.0.0.10/32 is directly connected, GigabitEthernet2.590

R8#more http://http:[email protected]/cef.txt | include Gigabit 0.0.0.0/0 90.0.0.9 GigabitEthernet2.590 90.0.0.0/24 attached GigabitEthernet2.590 90.0.0.0/32 receive GigabitEthernet2.590 90.0.0.9/32 attached GigabitEthernet2.590 90.0.0.10/32 receive GigabitEthernet2.590 90.0.0.255/32 receive GigabitEthernet2.590

A key point with the “host” and “url” keywords (example: www.cisco.com/go/coolstuff/page.htm) -Host: matches the host name only (www.cisco.com) -URL: matches everything after the host name (/go/coolstuff/page.htm) We will use the “url” feature with a simple regular expression on CSR9. This will ultimate feed into a policy-map called PMAP_HTTP. Let’s also try it with the NBAR custom protocol method to match the CEF file. The policy-map should meter both individually. ! CSR9 class-map match-all CMAP_HTTP match protocol http url "*route\.txt*" ip nbar custom HTTP_CEF_FILE http url *cef\.txt* class-map match-all CMAP_HTTP_CUSTOM match protocol HTTP_CEF_FILE policy-map PMAP_HTTP class CMAP_HTTP

In case you didn’t know that the parser expected a regular expression, you can verify with the following show command. It will show all the sub-classification parameters for a given protocol. R9#show ip nbar parameter subclassification http Protocol Parameter Parameter type ----------------------------http mime regexp_url http content-encoding regexp_url http location regexp_url http server regexp_url http from regexp_url http referer regexp_url http user-agent regexp_url http host regexp_url

1263 © 2016 Nicholas J. Russo

http

url

regexp_url

Now, we can copy the files and look for matches. R8#copy http://http:[email protected]/route.txt null: Accessing http://*:*@90.0.0.10/route.txt... Loading http://*:*@90.0.0.10/route.txt 866 bytes copied in 0.045 secs (19244 bytes/sec) R8#copy http://http:[email protected]/cef.txt null: Accessing http://*:*@90.0.0.10/cef.txt... Loading http://*:*@90.0.0.10/cef.txt 664 bytes copied in 0.045 secs (14756 bytes/sec) R9#show policy-map interface gig2.589 GigabitEthernet2.589 Service-policy input: PMAP_HTTP Class-map: CMAP_HTTP (match-all) 22 packets, 1858 bytes 5 minute offered rate 0000 bps Match: protocol http url "*route\.txt*" Class-map: CMAP_HTTP_CUSTOM (match-all) 22 packets, 1850 bytes 5 minute offered rate 0000 bps Match: protocol HTTP_CEF_FILE Class-map: class-default (match-any) 16 packets, 960 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

The reason we see some class-default matches is because the underlying TCP setup/teardown process is not HTTP-specific and would not contain HTTP URL information. If we downloaded a bigger file this would be less significant as we should see about the same number of class-default packets. As always, let’s prove it. We should see approximately 8 class-default packets (we saw 16 for two downloads), and many more data packets, when downloading a large file. I can’t explain the MQC byte counters, but I estimate it is because NBAR is sampling only certain packets in the flow to conserve CPU/memory resources. Later versions of code allow you to select coarse or fine granularity when sampling; coarse is less accurate but conserves resources. R10#show tech-support | append bootflash:route.txt R10#dir bootflash: | include route 39 -rw642828 MON 3 2015 00:14:18 +00:00

route.txt

1264 © 2016 Nicholas J. Russo

R8#copy http://http:[email protected]/route.txt null: Accessing http://*:*@90.0.0.10/route.txt... Loading http://*:*@90.0.0.10/route.txt !!! 642828 bytes copied in 0.448 secs (1434884 bytes/sec) R9#show policy-map interface gig2.589 GigabitEthernet2.589 Service-policy input: PMAP_HTTP Class-map: CMAP_HTTP (match-all) 2792 packets, 162518 bytes 5 minute offered rate 0000 bps Match: protocol http url "*route\.txt*" Class-map: class-default (match-any) 8 packets, 480 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

The design reason to use the NBAR option (cef.txt) over the MQC inline option (route.txt) for HTTP matching is NBAR attributes. I’ve added an attribute-map for the CEF inspection. Because the custom protocol was defined outside of a specific class-map, this protocol can be customized. Of course the FIB has nothing to do with “file-sharing”, so the attributes below are for testing only. ! CSR9 ip nbar attribute-map HTTP_CEF_ATT_MAP attribute category file-sharing attribute encrypted encrypted-no attribute sub-category file-sharing ip nbar attribute-set HTTP_CEF_FILE HTTP_CEF_ATT_MAP

Notice how some fields, like P2P and tunnel, remain unassigned because my attribute-map did not modify them. Other fields, such as category, sub-category, and encrypted were modified. The first show command shows the map as a stand-alone entity, and the second show command is used to validate that the attribute-set operation worked; that is, that the attribute-map was properly applied to the protocol in question. R9#show ip nbar attribute-map HTTP_CEF_ATT_MAP Profile Name : HTTP_CEF_ATT_MAP category : file-sharing encrypted : encrypted-no sub-category : file-sharing R9#show ip nbar protocol-attribute HTTP_CEF_FILE

1265 © 2016 Nicholas J. Russo

Protocol Name application-group category encrypted p2p-technology sub-category tunnel

: : : : : : :

HTTP_CEF_FILE other file-sharing encrypted-no p2p-tech-unassigned file-sharing tunnel-unassigned

Let’s double check and ensure NBAR registered these attributes with our protocol using the reverselookup mechanism. R9#show ip nbar attribute category file-sharing | include CEF HTTP_CEF_FILE User defined Protocol HTTP_CEF_FILE R9#show ip nbar attribute sub-category file-sharing | include CEF HTTP_CEF_FILE User defined Protocol HTTP_CEF_FILE R9#show ip nbar attribute encrypted encrypted-no | include CEF HTTP_CEF_FILE User defined Protocol HTTP_CEF_FILE

The policy-map PMAP_HTTP_ATT was created to match this traffic now based on attributes. ! CSR9 class-map match-all CMAP_HTTP_ATT match protocol attribute category file-sharing policy-map PMAP_HTTP_ATT class CMAP_HTTP_ATT

We can see the matches below. Of course, this policy would also match anything else displayed by the “show ip nbar attribute category file-sharing” command, as all of these protocols share this category attribute. R8#copy http://http:[email protected]/cef.txt null: Accessing http://*:*@90.0.0.10/cef.txt... Loading http://*:*@90.0.0.10/cef.txt 664 bytes copied in 0.041 secs (16195 bytes/sec) R9#show policy-map interface gig2.589 GigabitEthernet2.589 Service-policy input: PMAP_HTTP_ATT Class-map: CMAP_HTTP_ATT (match-all) 22 packets, 1850 bytes 5 minute offered rate 0000 bps Match: protocol attribute category file-sharing

1266 © 2016 Nicholas J. Russo

Class-map: class-default (match-any) 8 packets, 480 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

30.6.4 NBAR Protocol-ID NBAR can be used to determine a specific “protocol-id” for a given parameter, along with the type. For common layer3/layer4 protocols, the IP protocol number or transport port number are used for the protocol-id, respectively. Here are some examples of layer-3 and layer-4 protocol-IDs. You can use this, along with the NBAR port-map, to determine port/protocol numbers. Some people do not know the IP protocol number for RSVP, for example, as it is only really used for MPLS-TE, call admission control, and a few other topics. TFTP is based on UDP and uses port 69, as shown below. R9#show ip nbar protocol-id rsvp Protocol Name id type ---------------------------------------------rsvp 46 L3 IANA R9#show ip nbar protocol-id tftp Protocol Name id type ---------------------------------------------tftp 69 L4 IANA

Some things are not IANA assigned but are still semi-common. Notice that the protocol-id is the same value for TFTP and bittorent, but the type of protocol is different. I am assuming that “L7 STANDARD” is some commonly agreed-upon numbering scheme for applications not tracked or managed by IANA. I have not found any authoritative documentation on what that means for certain. R9#show ip nbar protocol-id bittorrent Protocol Name id type ---------------------------------------------bittorrent 69 L7 STANDARD

Custom protocols get an internally allocated protocol-id, which are larger numbers. R9#show ip nbar protocol-id FIND_BEEF Protocol Name id type ---------------------------------------------FIND_BEEF 242 Custom

You can specify this ID for tracking/collection purposes when you define a new protocol using the “id” suffix keyword. The relevance of this number for custom protocols is generally limited to Netflow exports because NBAR can be integrated with FNF (shown under the FNF section of this document). 1267 © 2016 Nicholas J. Russo

R9(config)#ip nbar custom PROTO_ID destination tcp 55557 id 57 R9#show ip nbar protocol-id PROTO_ID Protocol Name id type ---------------------------------------------PROTO_ID 57 Custom

30.6.5 NBAR Protocol Discovery NBAR can also be used as a mechanism for protocol discovery. It’s like Netflow except less granular, but also very easy and automatic. It can be enabled at the interface level for IPv4, IPv6, or both. We have applied it to the link facing CSR8. I’ve performed a file download via HTTP, a telnet session to CSR10, and some pings. The main show command by itself prints verbose output, showing several counters per protocol it learns. ! CSR9 interface GigabitEthernet2.589 ip nbar protocol-discovery R9#show ip nbar protocol-discovery GigabitEthernet2.589 Last clearing of "show ip nbar protocol-discovery" counters 00:01:31 Input ----Protocol Packet Count Byte Count 5min Bit Rate (bps) 5min Max Bit Rate (bps) ------------------------ -----------------------HTTP_CEF_FILE 22 1850 0 0 telnet 26 1552 0 0 ping 5 590 0 0 unknown 10 600 0

Output -----Packet Count Byte Count 5min Bit Rate (bps) 5min Max Bit Rate (bps) -----------------------20 3380 0 1000 25 1517 0 0 5 590 0 0 5 310 0

1268 © 2016 Nicholas J. Russo

Total

0 63 4592 0 0

0 55 5797 0 1000

We can narrow down this output by selecting only specific rows (packet count, byte count, etc). R9#show ip nbar protocol-discovery stats packet-count GigabitEthernet2.589 Last clearing of "show ip nbar protocol-discovery" counters 00:01:20 Input ----Protocol Packet Count ------------------------ -----------------------telnet 26 HTTP_CEF_FILE 22 ping 5 unknown 10 Total 63

Output -----Packet Count -----------------------25 20 5 5 55

We can narrow it even further by restricting it to a single protocol as well. When you do this, the “unknown” traffic is always displayed. In this particular case, the unknown traffic is the IP SLA traffic using bogus ports from earlier tests (some were still running). R9#show ip nbar protocol-discovery stats packet-count protocol telnet GigabitEthernet2.589 Last clearing of "show ip nbar protocol-discovery" counters 00:03:25 Input ----Protocol Packet Count ------------------------ -----------------------telnet 26 unknown 10 Total 63

Output -----Packet Count -----------------------25 5 55

The nice thing about this feature is that it honors your custom protocols at all times. Even without configuring MQC anywhere, you can see if your custom protocols are traversing the network. Earlier we looked at using NBAR to find a certain text string, specifically 0x0000BEEF. If we turn on that IP SLA again, NBAR should be able to see that protocol without MQC. Unfortunately, this feature doesn’t work well with our STOP_VIRUS protocol, because we abstracted the last 2 bytes to the MQC class-map, so 1269 © 2016 Nicholas J. Russo

that protocol will essentially match anything on UDP 55556. That is one downside to the MQC abstraction method (with the “variable” keyword); it relies on MQC being configured and applied. R9#show policy-map interface brief [no output] R9#show ip nbar protocol-discovery protocol FIND_BEEF GigabitEthernet2.589 Last clearing of "show ip nbar protocol-discovery" counters 00:00:25 Input ----Protocol Packet Count Byte Count 5min Bit Rate (bps) 5min Max Bit Rate (bps) ------------------------ -----------------------FIND_BEEF 5 310 0 0

Output -----Packet Count Byte Count 5min Bit Rate (bps) 5min Max Bit Rate (bps) -----------------------0 0 0 0

31. Describe, implement, and troubleshoot MPLS TE / QoS mechanisms 31.1 MPLS RSVP-TE (General) The configurations for all MPLE TE features are contained in the files below (manual TE, FRR, CBTS/PBTS), with the exception of DS-TE and automatic tunnels. Each of those has an alternative architecture since it was not conducive to mix them in a test with the vast majority of existing TE features. Additional Reading – Reference configurations “te-manual" 31.1.1 TE Topology (TED) construction and RSVP-TE signaling MPLS traffic engineering is one of MPLS' most powerful features. It allows for completely arbitrary traffic steering from the source to destination based on a variety of attributes, such as available bandwidth, link colors, and other characteristics. OSPF and IS-IS were extended to carry these attributes. Also, RSVP was extended to RSVP-TE to support MPLS-TE, which includes some new objects. First, we will examine how IS-IS exchanges TE information. The diagram is shown below. The topology is large and densely connected so as to create many routing paths. CSR7, XRv13, and XRv14 are all connected by both an L3VPN and L2VPN so we can demonstrate TE features for both (there isn't much difference, really). Most of the testing will use IS-IS but it isn't significant since both IS-IS and OSPF carry the same information, just in different ways. MPLS-TE attribute information is limited to an IS-IS level or

1270 © 2016 Nicholas J. Russo

OSPF area, so the entire topology is IS-IS L2. OSPF will be examined later. Inter-area TE is examined in the Unified MPLS section while Inter-AS TE is examined in the Inter-AS MPLS sections.

Because everything is intra-level, every router can see every link characteristic in the topology. This means that we can fully validate a correct configuration only accessing one router. From the perspective of CSR1, we can check the TE topology and look specifically at the MPLS TE router-ID. This is normally a loopback on each router, and it must be configured under the IGP process. TE must also be enabled for an IS-IS level or OSPF area. In the case of IS-IS, wide metrics must be enabled to carry the TE attributes. A very cursory check of the topology at least shows we have all 11 LSRs running TE under IS-IS. We can see the IS-IS NETs and the TE-ID, which is the loopback0 IPv4 address. ! XE routers router isis 132 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-2 ! XR routers router isis 132 address-family ipv4 unicast metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-2 CSR1#show mpls traffic-eng IGP Id: 0000.0000.0001.00, IGP Id: 0000.0000.0002.00, IGP Id: 0000.0000.0003.00, IGP Id: 0000.0000.0004.00, IGP Id: 0000.0000.0005.00, IGP Id: 0000.0000.0006.00,

topology level-2 brief | include TE_Id MPLS TE Id:1.1.1.1 Router Node (isis MPLS TE Id:2.2.2.2 Router Node (isis MPLS TE Id:3.3.3.3 Router Node (isis MPLS TE Id:4.4.4.4 Router Node (isis MPLS TE Id:5.5.5.5 Router Node (isis MPLS TE Id:6.6.6.6 Router Node (isis

level-2) level-2) level-2) level-2) level-2) level-2)

1271 © 2016 Nicholas J. Russo

IGP IGP IGP IGP IGP

Id: Id: Id: Id: Id:

0000.0000.0008.00, 0000.0000.0009.00, 0000.0000.0010.00, 0000.0000.0011.00, 0000.0000.0012.00,

MPLS MPLS MPLS MPLS MPLS

TE TE TE TE TE

Id:8.8.8.8 Router Node (isis level-2) Id:9.9.9.9 Router Node (isis level-2) Id:10.10.10.10 Router Node (isis level-2) Id:11.11.11.11 Router Node (isis level-2) Id:12.12.12.12 Router Node (isis level-2)

If we forget to set an MPLS TE-ID on a router, this quick check will reveal that. It can be a simple mistake. We remove it under CSR3 as a test; it no longer shows up. CSR3 is now incapable of running MPLS-TE. We will see this in action later when we build TE tunnels. CSR3(config-router)#no mpls traffic-eng router-id CSR1#show mpls traffic-eng IGP Id: 0000.0000.0001.00, IGP Id: 0000.0000.0002.00, IGP Id: 0000.0000.0004.00, IGP Id: 0000.0000.0005.00, IGP Id: 0000.0000.0006.00, IGP Id: 0000.0000.0008.00, IGP Id: 0000.0000.0009.00, IGP Id: 0000.0000.0010.00, IGP Id: 0000.0000.0011.00, IGP Id: 0000.0000.0012.00,

topology level-2 brief | include TE_Id MPLS TE Id:1.1.1.1 Router Node (isis level-2) MPLS TE Id:2.2.2.2 Router Node (isis level-2) MPLS TE Id:4.4.4.4 Router Node (isis level-2) MPLS TE Id:5.5.5.5 Router Node (isis level-2) MPLS TE Id:6.6.6.6 Router Node (isis level-2) MPLS TE Id:8.8.8.8 Router Node (isis level-2) MPLS TE Id:9.9.9.9 Router Node (isis level-2) MPLS TE Id:10.10.10.10 Router Node (isis level-2) MPLS TE Id:11.11.11.11 Router Node (isis level-2) MPLS TE Id:12.12.12.12 Router Node (isis level-2)

We will repair CSR3 by adding the TE-ID back. If we change the filter to be more generic and look at all of the "nodes" in the topology, we see a new node type at the bottom. The "network" node is the TE representation of an OSPF DR or IS-IS DIS. Every single link in the topology is OSPF/IS-IS point-to-point specifically to bypass the DR/DIS election, except for the link between CSR2-CSR3. This is just for demonstration purposes and has no operational benefit in this network. The point is that DR/DIS pseudonodes count as TE nodes and are included in the TE topology. CSR1#show mpls traffic-eng IGP Id: 0000.0000.0001.00, IGP Id: 0000.0000.0002.00, IGP Id: 0000.0000.0003.00, IGP Id: 0000.0000.0004.00, IGP Id: 0000.0000.0005.00, IGP Id: 0000.0000.0006.00, IGP Id: 0000.0000.0008.00, IGP Id: 0000.0000.0009.00, IGP Id: 0000.0000.0010.00, IGP Id: 0000.0000.0011.00, IGP Id: 0000.0000.0012.00, IGP Id: 0000.0000.0003.01,

topology level-2 brief | include ^IGP MPLS TE Id:1.1.1.1 Router Node (isis level-2) MPLS TE Id:2.2.2.2 Router Node (isis level-2) MPLS TE Id:3.3.3.3 Router Node (isis level-2) MPLS TE Id:4.4.4.4 Router Node (isis level-2) MPLS TE Id:5.5.5.5 Router Node (isis level-2) MPLS TE Id:6.6.6.6 Router Node (isis level-2) MPLS TE Id:8.8.8.8 Router Node (isis level-2) MPLS TE Id:9.9.9.9 Router Node (isis level-2) MPLS TE Id:10.10.10.10 Router Node (isis level-2) MPLS TE Id:11.11.11.11 Router Node (isis level-2) MPLS TE Id:12.12.12.12 Router Node (isis level-2) Network Node (isis level-2)

1272 © 2016 Nicholas J. Russo

Tracing the entire topology is excellent practice for understanding how TE works, but we can validate a few select links for brevity. Looking at the router node for CSR1, we can see 5 attached links. This is consistent with the topology and it also links the neighbors by IS-IS NET. Provided the NETs are logical and easily understandable, we can see the neighbors are CSR8, CSR4, CSR6, CSR9, and CSR3, which is correct. The links are identified as point-to-point and this is very similar to reading OSPF LSA1s. Each link has specific attributes, including TE and IGP metrics, attribute flags (link colors discussed later), and shared risk link groups (SRLG). These detailed characters are validated in later sections. CSR1#show mpls traffic-eng topology level-2 1.1.1.1 brief IGP Id: 0000.0000.0001.00, MPLS TE Id:1.1.1.1 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0008.00, nbr_node_id:14, gen:1838 frag_id: 0, Intf Address: 132.1.8.1, Nbr Intf Address: 132.1.8.8 TE metric: 1, IGP metric: 10, attribute flags: 0x4 SRLGs: None link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0004.00, nbr_node_id:7, gen:1838 frag_id: 0, Intf Address: 132.1.4.1, Nbr Intf Address: 132.1.4.4 TE metric: 10, IGP metric: 10, attribute flags: 0xA SRLGs: None link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:12, gen:1838 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 1, IGP metric: 10, attribute flags: 0x4 SRLGs: None link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, nbr_node_id:10, gen:1838 frag_id: 0, Intf Address: 132.1.9.1, Nbr Intf Address: 132.1.9.9 TE metric: 1, IGP metric: 10, attribute flags: 0x2 SRLGs: None link[4]: Point-to-Point, Nbr IGP Id: 0000.0000.0003.00, nbr_node_id:8, gen:1838 frag_id: 0, Intf Address: 132.1.3.1, Nbr Intf Address: 132.1.3.3 TE metric: 10, IGP metric: 10, attribute flags: 0x2 SRLGs: None

Moving on to CSR3, we can see there is a link to CSR1, plus four more to CSR4, CSR9, CSR10, and a pseudonode identified as 0000.0000.0003.01. Notice that the show command references the IS-IS NET and not the loopback address as this is an alternative method of verification. Stepping through the network in this way is the best way to validate that the network has been configured properly. CSR1#show mpls traffic-eng topology level-2 igp-id isis 0000.0000.0003.00 brief IGP Id: 0000.0000.0003.00, MPLS TE Id:3.3.3.3 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0001.00, nbr_node_id:6, gen:1847 frag_id: 0, Intf Address: 132.1.3.3, Nbr Intf Address: 132.1.3.1 TE metric: 10, IGP metric: 10, attribute flags: 0x2 SRLGs: None link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0004.00, nbr_node_id:7, gen:1847

1273 © 2016 Nicholas J. Russo

frag_id: 0, Intf Address: 132.3.4.3, Nbr Intf Address: 132.3.4.4 TE metric: 10, IGP metric: 10, attribute flags: 0x9 SRLGs: None link[2]: Broadcast, DR: 0000.0000.0003.01, nbr_node_id:26, gen:1847 frag_id: 0, Intf Address: 132.2.3.3 TE metric: 10, IGP metric: 10, attribute flags: 0xE SRLGs: None link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, nbr_node_id:10, gen:1847 frag_id: 0, Intf Address: 132.3.9.3, Nbr Intf Address: 132.3.9.9 TE metric: 10, IGP metric: 10, attribute flags: 0x5 SRLGs: None link[4]: Point-to-Point, Nbr IGP Id: 0000.0000.0010.00, nbr_node_id:18, gen:1847 frag_id: 0, Intf Address: 132.3.10.3, Nbr Intf Address: 132.3.10.10 TE metric: 10, IGP metric: 10, attribute flags: 0x0 SRLGs: None

Next, we examine the pseudonode, which is the only one in the topology. The IGP ID is similar to the ISIS NET of the DIS, minus the area, and with a non-zero N-selector. CSR3 is the DIS and the pseudonode connects to CSR3 and CSR2. This node isn't a real physical router in the graph and does not have any specific MPLS-TE attributes. Just like an OSPF LSA2 or IS-IS DIS LSP, it just connects other routers; all costs are zero. CSR1#show mpls traffic-eng topology level-2 igp-id isis 0000.0000.0003.01 brief IGP Id: 0000.0000.0003.01, Network Node (isis level-2) link[0]: Broadcast, Nbr IGP Id: 0000.0000.0003.00, nbr_node_id:8, gen:1848 link[1]: Broadcast, Nbr IGP Id: 0000.0000.0002.00, nbr_node_id:16, gen:1848

From this, we can look at CSR2. We will use the IP address format for variety in the show command. CSR2 is connected to the DIS (CSR3), CSR5, CSR9, and CSR10, which is correct per the diagram. This process should be repeated for all nodes in the topology to verify the TE database (TED) is correct. CSR1#show mpls traffic-eng topology level-2 2.2.2.2 brief IGP Id: 0000.0000.0002.00, MPLS TE Id:2.2.2.2 Router Node (isis level-2) link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0005.00, nbr_node_id:21, gen:1849 frag_id: 0, Intf Address: 132.2.5.2, Nbr Intf Address: 132.2.5.5 TE metric: 10, IGP metric: 10, attribute flags: 0x9 SRLGs: None link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, nbr_node_id:10, gen:1849 frag_id: 0, Intf Address: 132.2.9.2, Nbr Intf Address: 132.2.9.9 TE metric: 10, IGP metric: 10, attribute flags: 0x1 SRLGs: None link[2]: Broadcast, DR: 0000.0000.0003.01, nbr_node_id:26, gen:1849 frag_id: 0, Intf Address: 132.2.3.2

1274 © 2016 Nicholas J. Russo

TE metric: 10, IGP metric: 10, attribute flags: 0xE SRLGs: None link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0010.00, nbr_node_id:18, gen:1849 frag_id: 0, Intf Address: 132.2.10.2, Nbr Intf Address: 132.2.10.10 TE metric: 10, IGP metric: 10, attribute flags: 0xF SRLGs: None

Omitting detailed "brief" keyword will also include all of the link bandwidth information for each of the eight available priorities (discussed later). This is how TE tracks how much bandwidth is available. CSR1#show mpls traffic-eng topology level-2 2.2.2.2 IGP Id: 0000.0000.0002.00, MPLS TE Id:2.2.2.2 Router Node (isis level-2) id 16 link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0005.00, nbr_node_id:21, gen:1849 frag_id: 0, Intf Address: 132.2.5.2, Nbr Intf Address: 132.2.5.5 TE metric: 10, IGP metric: 10, attribute flags: 0x9 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 0 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]:

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 0

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 100000

Sub Pool Reservable BW (kbps) ---------0 0 0 0 0 0 0 0

link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0009.00, nbr_node_id:10, gen:1849 frag_id: 0, Intf Address: 132.2.9.2, Nbr Intf Address: 132.2.9.9 TE metric: 10, IGP metric: 10, attribute flags: 0x1 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 0 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]: [snip]

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 0

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 100000

Sub Pool Reservable BW (kbps) ---------0 0 0 0 0 0 0 0

1275 © 2016 Nicholas J. Russo

The parser syntax above is generic for all IGPs that support TE distribution, which makes it the preferred way to look the TED. The information itself is carried differently in different protocols. We can look at the raw information within IS-IS as well with the "verbose" option. From CSR1, we look at CSR8. As attributes of the individual links (the IS-Extended neighbors), all of the TE attributes are carried. The TE metric is only carried if it is modified, which we will see later. For low-level debugging, this is the best way to see the source of the information, especially if the TED information appears incorrect. CSR1#show isis database level-2 CSR8.00-00 verbose Tag 132: IS-IS Level-2 LSP CSR8.00-00 LSPID LSP Seq Num LSP Checksum LSP Holdtime CSR8.00-00 0x000002AC 0x8A57 853 Auth: Length: 17 Area Address: 00 NLPID: 0xCC 0x8E Topology: IPv4 (0x0) IPv6 (0x2) Router ID: 8.8.8.8 Hostname: CSR8 Metric: 10 IS (MT-IPv6) CSR9.00 Metric: 10 IS (MT-IPv6) CSR6.00 Metric: 10 IS (MT-IPv6) CSR1.00 Metric: 10 IS-Extended CSR6.00 Affinity: 0x00000006 Interface IP Address: 132.6.8.8 Neighbor IP Address: 132.6.8.6 Physical BW: 1000000 kbits/sec Reservable Global Pool BW: 100000 kbits/sec Global Pool BW Unreserved: [0]: 100000 kbits/sec, [1]: 100000 kbits/sec [2]: 100000 kbits/sec, [3]: 100000 kbits/sec [4]: 100000 kbits/sec, [5]: 100000 kbits/sec [6]: 100000 kbits/sec, [7]: 100000 kbits/sec Metric: 10 IS-Extended CSR1.00 Affinity: 0x00000004 Interface IP Address: 132.1.8.8 Neighbor IP Address: 132.1.8.1 Physical BW: 1000000 kbits/sec Reservable Global Pool BW: 100000 kbits/sec Global Pool BW Unreserved: [0]: 100000 kbits/sec, [1]: 100000 kbits/sec [2]: 100000 kbits/sec, [3]: 100000 kbits/sec [4]: 100000 kbits/sec, [5]: 100000 kbits/sec [6]: 100000 kbits/sec, [7]: 100000 kbits/sec Metric: 10 IS-Extended CSR9.00 Affinity: 0x00000001 Interface IP Address: 132.8.9.8

ATT/P/OL 0/0/0

1276 © 2016 Nicholas J. Russo

Neighbor IP Address: 132.8.9.9 Physical BW: 1000000 kbits/sec Reservable Global Pool BW: 100000 kbits/sec Global Pool BW Unreserved: [0]: 100000 kbits/sec, [1]: 100000 kbits/sec [2]: 100000 kbits/sec, [3]: 100000 kbits/sec [4]: 100000 kbits/sec, [5]: 100000 kbits/sec [6]: 100000 kbits/sec, [7]: 100000 kbits/sec IP Address: 8.8.8.8 Metric: 0 IP 8.8.8.8/32 IPv6 Address: ::8:8:8:8 Metric: 0 IPv6 (MT-IPv6) ::8:8:8:8/128

We can also look at topology information using some TE link-level commands. We can look at TE-IGP neighbors, link attributes, and other things. The commands are verbose and, for the most part, show the same information we've already seen, except with link-level granularity and in a different display format. Below we look at the TE neighbors as seen by IGP as well as the TE link summary. There isn’t anything noteworthy about these commands by themselves, but I document them for completeness. Because this command details the IGP neighbor addresses as seen by TE, it might be valuable for explicit-path construction (discussed later). CSR1#show mpls traffic-eng link-management Link ID:: Gi2.513 Neighbor ID: 0000.0000.0003.00 (area: up, Sources: IGP Link ID:: Gi2.514 Neighbor ID: 0000.0000.0004.00 (area: up, Sources: IGP Link ID:: Gi2.516 Neighbor ID: 0000.0000.0006.00 (area: up, Sources: IGP Link ID:: Gi2.518 Neighbor ID: 0000.0000.0008.00 (area: up, Sources: IGP Link ID:: Gi2.519 Neighbor ID: 0000.0000.0009.00 (area: up, Sources: IGP

igp-neighbors isis

level-2, IP: 132.1.3.3)

isis

level-2, IP: 132.1.4.4)

isis

level-2, IP: 132.1.6.6)

isis

level-2, IP: 132.1.8.8)

isis

level-2, IP: 132.1.9.9)

CSR1#show mpls traffic-eng link-management summary System Information:: Links Count: 5 Flooding System: enabled IGP Area ID:: isis level-2 Flooding Protocol: ISIS Flooding Status: data flooded Periodic Flooding: enabled (every 60 seconds, next in 28 seconds) Flooded Links: 5 IGP System ID: 0000.0000.0001.00

1277 © 2016 Nicholas J. Russo

MPLS TE Router ID: 1.1.1.1 Neighbors: 5 Link ID:: Gi2.513 (132.1.3.1) Local Intfc ID: 21 Link Status: SRLGs: None Intfc Switching Capability Descriptors: Default: Intfc Switching Cap psc1, Encoding ethernet Link Label Type: Packet Physical Bandwidth: 1000000 kbits/sec Max Res Global BW: 100000 kbits/sec (reserved: 0% in, 3% out) Max Res Sub BW: 0 kbits/sec (reserved: 100% in, 100% out) MPLS TE Link State: MPLS TE on, RSVP on, admin-up, flooded, allocated Inbound Admission: reject-huge Outbound Admission: allow-if-room Link MTU: IP 1500, MPLS 1500 Admin. Weight: 10 (IGP) IGP Neighbor Count: 1 [snip]

We can also look at LSP statistical information. This will show the number of setup/teardown requests, tunnels admitted, tunnels rejected, and other information. This can be helpful for troubleshooting tunnel flaps. CSR1#show mpls traffic-eng link-management statistics System Information:: LSP Admission Statistics: Path: 140 setup requests, 139 admits, 1 rejects, 0 setup errors 138 tear requests, 0 preempts, 0 tear errors Resv: 147 setup requests, 147 admits, 0 rejects, 0 setup errors 133 tear requests, 0 preempts, 0 tear errors Link ID:: Gi2.513 (132.1.3.1) Link Admission Statistics: Up Path: 38 setup requests, 38 admits, 0 rejects, 0 setup errors 37 tear requests, 0 preempts, 0 tear errors Up Resv: 38 setup requests, 38 admits, 0 rejects, 0 setup errors 71 tear requests, 0 preempts, 0 tear errors Down Path: 76 setup requests, 76 admits, 0 rejects, 0 setup errors 75 tear requests, 0 preempts, 0 tear errors Down Resv: 84 setup requests, 84 admits, 0 rejects, 0 setup errors 37 tear requests, 0 preempts, 0 tear errors [snip]

The mechanism by which TE tunnels are signaled is RSVP-TE. Segment routing TE (SR-TE) is very new but not in scope for the lab exam, and is not documented in this chapter. There are three key messages used by RSVP to signal TE LSPs.

1278 © 2016 Nicholas J. Russo

PATH: Message sent hop-by-hop from source to destination, or in MPLS terms, head to tail. It carries the previous hop (PHOP) so that the tail end knows how to reply along the same path. After RSVP-TE extensions, the PATH message carries new objects: Label Request Object (LRO): Used to request TE labels along the path, but does not carry the label values directly as this information needs to flow form tail to head (RESV messages). Explicit Route Object (ERO): This is the output of the path calculation (PCALC) algorithm which essentially tells the PATH messages which way to go. Since RSVP is used for signaling only, it is not expected to have direct visibility into the TED as this is PCALC’s duty. Record Route Object (RRO): Used to record the route taken by the PATH message, giving a “record” of the LSP to downstream nodes. This can be valuable when using loose-hop expansion to prevent signaling loops. Session Attribute Object (SAO): Carries information about the session, such as the mode of operation (shared explicit, described later), node/bandwidth protection flags, and fast-reroute. Sender Tspec: Carries bandwidth reservation information as an average rate. RESV: Message sent hop-by-hop using the PHOPs learned from the PATH message. It is impossible for the RESV messages to follow a different path than the PATH messages. The next-hop (NHOP) is carried inside the RESV messages so that the upstream nodes know the proper direction of the LSP. Label object: This message primarily carries the label object, which is the actual label value of the TE tunnel. Coupled with the NHOP value, this defines the LSP forwarding plane. Record Route Object (RRO): Used to record the route taken by the RESV message, giving a “record” of the LSP to upstream nodes. This can be valuable when using loose-hop expansion to see the actual path that was signaled. PATHERR: An error message to signal something went awry during RSVP signaling. This could be insufficient bandwidth or a link failure, as examples. Upon receipt of a PATHERR message, the headend will recomputed the path. First, we will examine PCALC operation in detail using a very basic TE tunnel from CSR1 to CSR5 [ID 1]. TE tunnel configuration in XE has some awkwardness; the tunnel is not running IP, yet requires an IP address in the form of the TE-ID (unnumbered). The tunnel source is unspecified, but the destination is the remote TE-ID (tail end). I also clear all affinity bits so link coloring is not a factor, and this is discussed later. The PCALC process is also called constrained shortest path first (CSPF), which takes all of the tunnel constraints into consideration, finds the shortest path that meets the constraints, and assembles the hop-by-hop path into an ERO. This gets fed into the RSVP process via the PATH message for 1279 © 2016 Nicholas J. Russo

signaling. Before creating the tunnel, we enable debug on CSR1 so we can watch the PCALC process. The "lookup" debug is typically the more valuable one, but we can also check the "spf" debug occasionally for extra detail. The SPF output is extensive but we will analyze it once. The debug is broken into sections for clarity. CSR1#debug mpls traffic-eng path lookup CSR1#debug mpls traffic-eng path spf ! CSR1 interface Tunnel1 description BASIC ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 5.5.5.5 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 dynamic

Towards the top of the debug, we see the "constraints" being fed into the algorithm as input. The process performs a lookup within IS-IS level-2 and finds a shortest path (there are many) to XRv11 via CSR9 and CSR10 based on these constraints; right now, there aren't any. The total path-cost is 30, which is based on the TE metrics contained within the TE topology. This metric is copied over from the IGP metric by default, which is 10 for IS-IS. The path lookup process can then begin to find a path from 1.1.1.1 to 5.5.5.5. ! CSR1 TE-PCALC-API: 1.1.1.1_1->5.5.5.5_1 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_1->5.5.5.5_1 {7}: Path Request Info Flags: METRIC_TE IP explicit-path: None (dynamic) bw 0, min_bw 0, metric: 0 setup_pri 7, hold_pri 7 affinity_bits 0x0, affinity_mask 0x0 TE-PCALC-PATH: 1.1.1.1_1->5.5.5.5_1 {7}: Area (isis level-2) Path Lookup begin

The first step of the SPF algorithm is to evaluate the neighbor links. First, the local node address is obviously in the path, so it is immediately move from the tentative (tent) list to the path list. The other 5 links to CSR1's neighbors are added to the tent list for evaluation. After this pass, the output shows the current tent list along with their administrative weights (also known as the TE metric, shown as “aw”), along with the minimum bandwidth and hop count. We can see the tentative list is actually a stack structure, since items added first are at the bottom of the tent list. ! CSR1 TE-PCALC-SPF: Begin SPF for tunnel1 to dest 0000.0000.0005.00 TE-PCALC-SPF: Added 0000.0000.0001.00 to tent list (aw 0, min_bw 18446744073709551615, prev_node(NULL))

1280 © 2016 Nicholas J. Russo

TE-PCALC-SPF: Moved 0000.0000.0001.00 to path list TE-PCALC-SPF: Evaluating link 132.1.8.8 TE-PCALC-SPF: Added 0000.0000.0008.00 to tent list (aw 10, prev_node(0000.0000.0001.00)) TE-PCALC-SPF: Evaluating link 132.1.4.4 TE-PCALC-SPF: Added 0000.0000.0004.00 to tent list (aw 10, prev_node(0000.0000.0001.00)) TE-PCALC-SPF: Evaluating link 132.1.3.3 TE-PCALC-SPF: Added 0000.0000.0003.00 to tent list (aw 10, prev_node(0000.0000.0001.00)) TE-PCALC-SPF: Evaluating link 132.1.6.6 TE-PCALC-SPF: Added 0000.0000.0006.00 to tent list (aw 10, prev_node(0000.0000.0001.00)) TE-PCALC-SPF: Evaluating link 132.1.9.9 TE-PCALC-SPF: Added 0000.0000.0009.00 to tent list (aw 10, prev_node(0000.0000.0001.00)) TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0009.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0006.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0003.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0004.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0008.00: (aw=10, min_bw=100000, hops=1)

min_bw 100000,

min_bw 100000,

min_bw 100000,

min_bw 100000,

min_bw 100000,

At this point, all 5 options appear equal, so CSR1 must select one to continue running SPF. It selects the first option (the last one pushed onto the tent stack) and adds it to the path list. It then evaluates all of CSR9's links. Notice that only CSR10, CSR2, and XRv12 are considered tentative candidates and are added to the tent list; other paths are not due to being farther away from the destination. The tent list grows to include these new nodes, and CSR9 has been removed from the tent list (being added to the path list does this). ! CSR1 TE-PCALC-SPF: Moved 0000.0000.0009.00 to path list TE-PCALC-SPF: Evaluating link 132.9.12.12 TE-PCALC-SPF: Added 0000.0000.0012.00 to tent list (aw 20, min_bw 100000, prev_node(0000.0000.0009.00)) TE-PCALC-SPF: Evaluating link 132.6.9.6 TE-PCALC-SPF: Evaluating link 132.4.9.4 TE-PCALC-SPF: Evaluating link 132.8.9.8 TE-PCALC-SPF: Evaluating link 132.2.9.2 TE-PCALC-SPF: Added 0000.0000.0002.00 to tent list (aw 20, min_bw 100000, prev_node(0000.0000.0009.00)) TE-PCALC-SPF: Evaluating link 132.3.9.3 TE-PCALC-SPF: Evaluating link 132.1.9.1 TE-PCALC-SPF: Evaluating link 132.9.10.10 TE-PCALC-SPF: Added 0000.0000.0010.00 to tent list (aw 20, min_bw 100000, prev_node(0000.0000.0009.00)) TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0006.00: (aw=10, min_bw=100000, hops=1)

1281 © 2016 Nicholas J. Russo

0000.0000.0003.00: 0000.0000.0004.00: 0000.0000.0008.00: 0000.0000.0010.00: 0000.0000.0002.00: 0000.0000.0012.00:

(aw=10, (aw=10, (aw=10, (aw=20, (aw=20, (aw=20,

min_bw=100000, min_bw=100000, min_bw=100000, min_bw=100000, min_bw=100000, min_bw=100000,

hops=1) hops=1) hops=1) hops=2) hops=2) hops=2)

The process continues for the next node in the tent stack, which is CSR6. All of its links are evaluated, but none are added to the tent list as all of them are suboptimal. CSR6 is added to the path list and thus removed from the tent list. ! CSR1 TE-PCALC-SPF: Moved 0000.0000.0006.00 to path list TE-PCALC-SPF: Evaluating link 132.6.8.8 TE-PCALC-SPF: Evaluating link 132.1.6.1 TE-PCALC-SPF: Evaluating link 132.6.10.10 TE-PCALC-SPF: Evaluating link 132.6.12.12 TE-PCALC-SPF: Evaluating link 132.6.9.9 TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0003.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0004.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0008.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0010.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0002.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2)

The process continues for other nodes, including CSR3, CSR4, CSR8, and the DIS pseudonode represent the CSR2-CSR3 link. At the end of the evaluation, the tent list has been reduced to CSR10, CSR2, and XRv12. ! CSR1 TE-PCALC-SPF: Moved 0000.0000.0003.00 to path list TE-PCALC-SPF: Evaluating link 132.3.4.4 TE-PCALC-SPF: Evaluating link 132.1.3.1 TE-PCALC-SPF: Added 0000.0000.0003.01 to tent list (aw 20, min_bw 100000, prev_node(0000.0000.0003.00)) TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0004.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0008.00: (aw=10, min_bw=100000, hops=1) 0000.0000.0003.01: (aw=20, min_bw=100000, hops=2) 0000.0000.0010.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0002.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2) TE-PCALC-SPF: Moved 0000.0000.0004.00 to path list TE-PCALC-SPF: Evaluating link 132.3.4.3 TE-PCALC-SPF: Evaluating link 132.1.4.1 TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0008.00: (aw=10, min_bw=100000, hops=1)

1282 © 2016 Nicholas J. Russo

0000.0000.0003.01: (aw=20, min_bw=100000, hops=2) 0000.0000.0010.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0002.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2) TE-PCALC-SPF: Moved 0000.0000.0008.00 to path list TE-PCALC-SPF: Evaluating link 132.6.8.6 TE-PCALC-SPF: Evaluating link 132.1.8.1 TE-PCALC-SPF: Evaluating link 132.8.9.9 TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0003.01: (aw=20, min_bw=100000, hops=2) 0000.0000.0010.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0002.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2) TE-PCALC-SPF: Moved 0000.0000.0003.01 to path list TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0010.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0002.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2)

The evaluation continues as the tent list continues to shrink in length. When CSR2 finally gets added to the path list, evaluating the links yields the final destination. The final evaluation shows a path to CSR5. ! CSR1 TE-PCALC-SPF: Moved 0000.0000.0010.00 to path list TE-PCALC-SPF: Evaluating link 132.10.12.12 TE-PCALC-SPF: Evaluating link 132.2.10.2 TE-PCALC-SPF: No nbr node and Nbr 132.2.10.2 not in static tree TE-PCALC-SPF: Evaluating link 132.6.10.6 TE-PCALC-SPF: Evaluating link 132.10.11.11 TE-PCALC-SPF: Added 0000.0000.0011.00 to tent list (aw 30, min_bw 100000, prev_node(0000.0000.0010.00)) TE-PCALC-SPF: Evaluating link 132.3.10.3 TE-PCALC-SPF: Evaluating link 132.5.10.5 TE-PCALC-SPF: rrr_pcalc_dijkstra_spf: No acceptable reverse link for 132.5.10.10 in unidirectional path TE-PCALC-SPF: Evaluating link 132.9.10.9 TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0002.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0011.00: (aw=30, min_bw=100000, hops=3) TE-PCALC-SPF: Moved 0000.0000.0002.00 to path list TE-PCALC-SPF: Evaluating link 132.2.5.5 TE-PCALC-SPF: Added 0000.0000.0005.00 to tent list (aw 30, min_bw 100000, prev_node(0000.0000.0002.00)) TE-PCALC-SPF: Evaluating link 132.2.9.9 TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0012.00: (aw=20, min_bw=100000, hops=2) 0000.0000.0005.00: (aw=30, min_bw=100000, hops=3) 0000.0000.0011.00: (aw=30, min_bw=100000, hops=3)

1283 © 2016 Nicholas J. Russo

TE-PCALC-SPF: Moved 0000.0000.0012.00 to path list TE-PCALC-SPF: Evaluating link 132.6.12.6 TE-PCALC-SPF: Evaluating link 132.10.12.10 TE-PCALC-SPF: Evaluating link 132.9.12.9 TE-PCALC-SPF: Evaluating link 132.11.12.11 TE-PCALC-SPF: rrr_pcalc_dump_tentative list: 0000.0000.0005.00: (aw=30, min_bw=100000, hops=3) 0000.0000.0011.00: (aw=30, min_bw=100000, hops=3)

The SPF process shows this final path in reverse order since the algorithm stacks the vertices from tail toe head. The final path is CSR1 > CSR9 > CSR2 > CSR5 and has a cost of 30. The remaining messages state, in various ways, that the tunnel was built successfully. ! CSR1 TE-PCALC-PATH:Path from 0000.0000.0001.00 -> 0000.0000.0005.00: 132.2.5.2->132.2.5.5 (admin_weight=30): 132.2.9.9->132.2.9.2 (admin_weight=20): 132.1.9.1->132.1.9.9 (admin_weight=10): num_hops 4, accumulated_aw 30, min_bw 100000 TE-PCALC-PATH: 1.1.1.1_1->5.5.5.5_1 {7}: Area (isis level-2) Path Lookup end: path found TE-PCALC-API: 1.1.1.1_1->5.5.5.5_1 {7}: P2P LSP Path Lookup result: success %MPLS_TE-5-TUN: Tun1: installed LSP 1_1 (popt 10) for nil, got 1st feasible path opt %MPLS_TE-5-LSP: LSP 1.1.1.1 1_1: UP %MPLS_TE-5-TUN: Tun1: LSP path change 1_1 for nil, normal %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel1, changed state to up

The entire PCALC computation happens locally on CSR1. The output of PCALC is the ERO, which is essentially the path that RSVP should signal. Provided the TED is correct, PCALC can complete. We will flap the tunnel (shut/no shut) with RSVP debugging enabled so we can view the signaling process also. First, CSR1 sees an incoming PROXY_PATH, which is basically a PATH message generated locally on the router from some other process (PCALC, in this case). The key part of this message, at least for now, is the ERO. Other things like the sender Tspec (bandwidth reservation) will be discussed later. ! CSR1 Incoming PROXY_PATH: version:1 flags:0000 cksum:0000 ttl:255 reserved:0 length:208 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 127.0.0.1 LIH: 0x00000000 TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 44: 1.1.1.1 (Strict IPv4 Prefix, 8 bytes, /32) 132.1.9.9 (Strict IPv4 Prefix, 8 bytes, /32)

1284 © 2016 Nicholas J. Russo

132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) SESSION_ATTRIBUTE type 7 length 16: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: BASIC SENDER_TEMPLATE type 7 length 12: Tun Sender: 1.1.1.1 LSP ID: 5 SENDER_TSPEC type 2 length 36: version=0, length in words=7 Token bucket fragment (service_id=1, length=6 words parameter id=127, flags=0, parameter length=5 average rate=0 bytes/sec, burst depth=1000 bytes peak rate =0 bytes/sec min unit=0 bytes, max pkt size=2147483647 bytes ADSPEC type 2 length 48: version=0 length in words=10 General Parameters break bit=0 service length=8 IS Hops:0 Minimum Path Bandwidth (bytes/sec):2147483647 Path Latency (microseconds):0 Path MTU:4294967295 Controlled Load Service break bit=0 service length=0 LABEL_REQUEST type 1 length 8 : Layer 3 protocol ID: 2048

CSR1 turns this into a "real" PATH message and sends it to CSR9, which is the next item in the explicit path. Notice the interface address is used since TE is selecting specific links, not just selecting nodes. Notice that the minimum bandwidth towards the bottom is 125 MBps, or 1 Gbps, which is the link speed as seen by RSVP. By looking at "debug ip rsvp dump-messages" we can see all the details of the RSVP signaling. ! CSR1 %MPLS_TE-5-TUN: Tun1: installed LSP 1_5 (popt 10) for nil, got 1st feasible path opt Outgoing Path: version:1 flags:0000 cksum:7A91 ttl:255 reserved:0 length:200 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.9.1 LIH: 0x00000011 TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 36: 132.1.9.9 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32)

1285 © 2016 Nicholas J. Russo

5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) LABEL_REQUEST type 1 length 8 : Layer 3 protocol ID: 2048 SESSION_ATTRIBUTE type 7 length 16: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: BASIC SENDER_TEMPLATE type 7 length 12: Tun Sender: 1.1.1.1 LSP ID: 5 SENDER_TSPEC type 2 length 36: version=0, length in words=7 Token bucket fragment (service_id=1, length=6 words parameter id=127, flags=0, parameter length=5 average rate=0 bytes/sec, burst depth=1000 bytes peak rate =0 bytes/sec min unit=0 bytes, max pkt size=2147483647 bytes ADSPEC type 2 length 48: version=0 length in words=10 General Parameters break bit=0 service length=8 IS Hops:1 Minimum Path Bandwidth (bytes/sec):125000000 Path Latency (microseconds):0 Path MTU:1500 Controlled Load Service break bit=0 service length=0

CSR9 receives this path message from CSR1; notice that the PHOP is included so CSR9 knows where to send the RESV message. The top of the ERO is a local address on CSR9, which is expected. ! CSR9 Incoming Path: version:1 flags:0000 cksum:7A91 ttl:255 reserved:0 length:200 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.9.1 LIH: 0x00000011 TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 36: 132.1.9.9 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) [snip]

CSR9 strips its own address out of the ERO, updates PHOP, and sends a PATH message to the next-hop in the ERO, which is CSR2. ! CSR9

1286 © 2016 Nicholas J. Russo

Outgoing Path: version:1 flags:0000 cksum:29AF ttl:254 reserved:0 length:192 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.9.9 LIH: 0x0000000C TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 28: 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) [snip]

CSR2 receives the PATH message from CSR9 and the process continues. CSR2 creates a new PATH message towards CSR5, updates PHOP, and removes its own address from the ERO. ! CSR2 Incoming Path: version:1 flags:0000 cksum:29AF ttl:254 reserved:0 length:192 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.9.9 LIH: 0x0000000C TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 28: 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) [snip] Outgoing Path: version:1 flags:0000 cksum:DCD0 ttl:253 reserved:0 length:184 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.5.2 LIH: 0x0000000D TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 20: 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) [snip]

CSR5 receives the PATH message and knows that it is the tail end of the LSP (tunnel destination indicates this). Now that the PATH message signaling from head to tail is done, the RESV messages must be sent in the reverse direction. The PHOPs carried in the PATH messages are used to derive next-hops (NHOPs) 1287 © 2016 Nicholas J. Russo

for the RESV messages on the way back. This ensures that the proper interfaces are used in the reserve path. The most significant part of the RESV message is the last line, which is the label. PHP is supported with RSVP-TE so the tail-end router typically allocates implicit-null. ! CSR5 Incoming Path: version:1 flags:0000 cksum:DCD0 ttl:253 reserved:0 length:184 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.5.2 LIH: 0x0000000D TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 20: 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) [snip] Outgoing Resv: version:1 flags:0000 cksum:E141 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.5.5 LIH: 0x0000000D TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 0 bytes/sec, burst depth = 1000 bytes peak rate = 0 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12: Tun Sender: 1.1.1.1, LSP ID: 5 LABEL type 1 length 8 : Labels: 3

CSR2 receives this RESV message from CSR5, programs the label into its LFIB for this LSP, and sends another RESV back to CSR9. CSR2 allocates label 2008 for this LSP. Label 3 is implicit-null, which makes sense since CSR2 is the penultimate hop along the TE LSP. Just like LDP, having the next-hop plus label effectively creates a link in the FEC. This makes RSVP a good candidate for MPLS-TE since the RESV message nicely carries both, similar to LDP. ! CSR2

1288 © 2016 Nicholas J. Russo

Incoming Resv: version:1 flags:0000 cksum:E141 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.5.5 LIH: 0x0000000D TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 0 bytes/sec, burst depth = 1000 bytes peak rate = 0 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12: Tun Sender: 1.1.1.1, LSP ID: 5 LABEL type 1 length 8 : Labels: 3 Outgoing Resv: version:1 flags:0000 cksum:D570 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.9.2 LIH: 0x0000000C TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 0 bytes/sec, burst depth = 1000 bytes peak rate = 0 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12: Tun Sender: 1.1.1.1, LSP ID: 5 LABEL type 1 length 8 : Labels: 2008

CSR9 receives the RESV message from CSR2, programs label 2008 into its LFIB for this LSP, and sends a new RESV with local label 9013 towards CSR1. ! CSR9

1289 © 2016 Nicholas J. Russo

Incoming Resv: version:1 flags:0000 cksum:D570 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.2.9.2 LIH: 0x0000000C TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 0 bytes/sec, burst depth = 1000 bytes peak rate = 0 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12: Tun Sender: 1.1.1.1, LSP ID: 5 LABEL type 1 length 8 : Labels: 2008 Outgoing Resv: version:1 flags:0000 cksum:BA08 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.9.9 LIH: 0x00000011 TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 0 bytes/sec, burst depth = 1000 bytes peak rate = 0 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12: Tun Sender: 1.1.1.1, LSP ID: 5 LABEL type 1 length 8 : Labels: 9013

The head-end, CSR1, receives the final RESV message and programs this label into the LFIB. The TE tunnel signaling is now complete. ! CSR1 Incoming Resv:

1290 © 2016 Nicholas J. Russo

version:1 flags:0000 cksum:BA08 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.9.9 LIH: 0x00000011 TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 0 bytes/sec, burst depth = 1000 bytes peak rate = 0 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12: Tun Sender: 1.1.1.1, LSP ID: 5 LABEL type 1 length 8 : Labels: 9013

We can also look at a terser version of the RSVP signaling with "debug ip rsvp signalling". Refreshing the tunnel again, we see CSR1 send a PATH message to CSR9 with a PHOP of its local link address facing CSR9. The ERO details are not revealed here, but the basic RSVP message type (PATH, RESV, etc) and critical hop information (PHOP, NHOP, etc), are shown. ! CSR1 RSVP: 1.1.1.1_5->5.5.5.5_1[Src] in normal RSVP: 1.1.1.1_5->5.5.5.5_1[Src] 30000 xmit: 30000 RSVP: 1.1.1.1_5->5.5.5.5_1[Src] RSVP: 1.1.1.1_5->5.5.5.5_1[Src] 132.1.9.1

{7}: Path refresh, Event: none, State: stay {7}: Path refresh (msec), config: 30000 curr: {7}: Sending Path message to 132.1.9.9 {7}: building hop object with src addr:

CSR9 receives this PATH message, updates the PHOP, and sends it to CSR2. We won't trace the remainder of the path as it is the same at each hop. CSR9 receives a RESV back in from CSR2 and sends it back to CSR1. ! CSR9 RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: Received Path message from 132.1.9.1 (on GigabitEthernet2.519) RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: [rsvp_examine_and_mark_md_events] Existing PSB MD = Ignore RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: [rsvp_examine_and_mark_tspec_events] Existing PSB TSpec = Ignore RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: Incoming Path, No change

1291 © 2016 Nicholas J. Russo

RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: Sending Resv message to 132.1.9.1 from 132.1.9.9 RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: building hop object with src addr: 132.1.9.9

CSR1 receives the RESV and sees no change. ! CSR1 RSVP: session 5.5.5.5_1[1.1.1.1] (7): Received Resv message from 132.1.9.9 (on GigabitEthernet2.519) RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: Successfully parsed Resv message from 132.1.9.9 (on GigabitEthernet2.519) RSVP: 1.1.1.1_5->5.5.5.5_1[Src] {7}: No change in reservation

We can verify the tunnel with several show commands also. The primary show command is below and details most of the critical tunnel components. The first line shows the status of the interface. "Path" determines whether PCALC was successful and "Signalling" determines whether the path has been actually set up. A broken RSVP configuration won't affect PCALC, so PCALC can be valid while RSVP signaling is broken. The path options are listed and the one being used is identified as "basis for setup", along with the cost of the path. The next several lines just show configuration details. Next, there is RSVP information, to include the labels being used. There is no incoming label since this is the tunnel head; traffic only exits, never enters. The ERO, RRO, Tspec, and Fspec attributes are shown under the PATH and RESV fields as well. CSR1#show mpls traffic-eng tunnels tunnel 1 Name: BASIC (Tunnel1) Destination: 5.5.5.5 Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 30) Config Parameters: Bandwidth: 0 kbps (Global) Priority: 7 7 Affinity: 0x0/0x0 Metric Type: TE (default) AutoRoute: disabled LockDown: disabled Loadshare: 0 [0] bw-based auto-bw: disabled Active Path Option Parameters: State: dynamic path option 10 is active BandwidthOverride: disabled LockDown: disabled Verbatim: disabled InLabel : OutLabel : GigabitEthernet2.519, 9013 Next Hop : 132.1.9.9 RSVP Signalling Info: Src 1.1.1.1, Dst 5.5.5.5, Tun_Id 1, Tun_Instance 5 RSVP Path Info: My Address: 132.1.9.1

1292 © 2016 Nicholas J. Russo

Explicit Route: 132.1.9.9 132.2.9.2 132.2.5.5 5.5.5.5 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits RSVP Resv Info: Record Route: NONE Fspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits Shortest Unconstrained Path Info: Path Weight: 30 (TE) Explicit Route: 132.1.9.9 132.2.9.2 132.2.5.5 5.5.5.5 History: Tunnel: Time since created: 1 hours, 15 minutes Time since path change: 57 minutes, 24 seconds Number of LSP IDs (Tun_Instances) used: 5 Current LSP: [ID: 5] Uptime: 57 minutes, 24 seconds Prior LSP: [ID: 4] ID: path option unknown Removal Trigger: tunnel shutdown

You can verify the tunnel signaling by checking RSVP directly as well. We can verify the PATH messages in brief and verbose form. "session-type 7" refers to RSVP-TE tunnels as opposed to regular IPv4 RSVP. This original flavor of RSVP is not commonly used since the Integrated Services (IntServ) QoS model is unpopular. The tunnel head has no PHOP or incoming interface, but all other routers will. The detailed output is similar and shows in incoming and outgoing ERO (before and after the local address is removed), along with other traffic parameters, such as bandwidth and fast-reroute. The detailed output is also very similar to the “dump-messages” debugging in terms of format. CSR1#show ip rsvp sender filter session-type 7 tunnel-id 1 Destination Tun Sender TunID LSPID Prev Hop 5.5.5.5 1.1.1.1 1 5 none

I/F none

BPS 0

CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 1 PATH: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 5 Path refreshes: sent: to NHOP 132.1.9.9 on GigabitEthernet2.519 Session Attr: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: BASIC ERO: (incoming) 1.1.1.1 (Strict IPv4 Prefix, 8 bytes, /32) 132.1.9.9 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32)

1293 © 2016 Nicholas J. Russo

ERO: (outgoing) 132.1.9.9 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) Traffic params - Rate: 0 bits/sec, Max. burst: 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size 2147483647 bytes Fast-Reroute Backup info: Inbound FRR: Not active Outbound FRR: No backup tunnel selected Path ID handle: 2F00040A. Incoming policy: Accepted. Policy source(s): MPLS/TE Status: Proxied Output on GigabitEthernet2.519. Policy status: Forwarding. Handle: 0A00040E Policy source(s): MPLS/TE

Likewise, we can do the same for the RESV messages. This time, there is a NHOP and outgoing interface, but there won't be at the tail end. We can also see the label value of 9013 which we saw in the RSVP dump-messages and the MPLS-TE show commands. CSR1#show ip rsvp reservation filter session-type 7 tunnel-id 1 Destination Tun Sender TunID LSPID Next Hop I/F 5.5.5.5 1.1.1.1 1 5 132.1.9.9 Gi2.519

Fi Serv BPS SE LOAD 0

CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 1 Reservation: Tun Dest: 5.5.5.5 Tun ID: 1 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 5 Next Hop: 132.1.9.9 on GigabitEthernet2.519 Label: 9013 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 0400040F. Average Bitrate is 0 bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes Status: Policy: Accepted. Policy source(s): MPLS/TE

The same commands are valid on any router in the LSP. Moving to CSR9, we will issue the same commands. Since CSR9 has no concept of "tunnel 1", as an interface, we can specify that we want to see all tunnels for which CSR9 is a "midpoint". Other role options include "head" and "tail". We could specify the “tunnel ID” of 1, and Cisco’s implementation populates this ID based on the tunnel interface number. We can see the local label sent to CSR1 is 9013 and the downstream (outbound) label is 2008 received from CSR2. A quick look at the LFIB clearly shows a swap operation. The [5] represents the LSP ID, which is the “Tun_Instance” in the detailed output. The “1” just before this LSP IP in the LFIB represents the tunnel ID. CSR9#show mpls traffic-eng tunnels role middle

1294 © 2016 Nicholas J. Russo

P2P TUNNELS/LSPs: LSP Tunnel BASIC is signalled, connection is up InLabel : GigabitEthernet2.519, 9013 Prev Hop : 132.1.9.1 OutLabel : GigabitEthernet2.529, 2008 Next Hop : 132.2.9.2 RSVP Signalling Info: Src 1.1.1.1, Dst 5.5.5.5, Tun_Id 1, Tun_Instance 5 RSVP Path Info: My Address: 132.2.9.9 Explicit Route: 132.2.9.2 132.2.5.5 5.5.5.5 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits RSVP Resv Info: Record Route: NONE Fspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR9#show mpls forwarding-table labels 9013 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 9013 2008 1.1.1.1 1 [5] 0

Outgoing interface Gi2.529

Next Hop 132.2.9.2

As a midpoint, CSR9 has both PHOPs and NHOPs in the RSVP signaling chain. Traffic arrives from CSR1 and is switched towards CSR2. Notice that the destination and sender addresses are the same in both outputs, since the LSP is unidirectional. The output will be similar on all midpoints in the LSP, so we will skip CSR10. CSR1 is the PHOP (upstream) and CSR2 is the NHOP (downstream). CSR9#show ip rsvp sender filter session-type 7 tunnel-id 1 Destination Tun Sender TunID LSPID Prev Hop 5.5.5.5 1.1.1.1 1 5 132.1.9.1

I/F Gi2.519

CSR9#show ip rsvp reservation filter session-type 7 tunnel-id 1 Destination Tun Sender TunID LSPID Next Hop I/F 5.5.5.5 1.1.1.1 1 5 132.2.9.2 Gi2.529

BPS 0

Fi Serv BPS SE LOAD 0

On the tail end, the outputs are different. We can reference the tunnel by ID as well, or use "role tail", etc. Notice that the local label is implicit-null but there is no out label; this is the opposite of what we saw at the headend. This makes sense because the TE LSP ends at CSR5. Additionally, the RESV message has no NHOP or outgoing interface since this is the terminating point of the LSP, as mentioned earlier. CSR5#show mpls traffic-eng tunnels source-id 1 P2P TUNNELS/LSPs: LSP Tunnel BASIC is signalled, connection is up InLabel : GigabitEthernet2.525, implicit-null Prev Hop : 132.2.5.2 OutLabel : RSVP Signalling Info:

1295 © 2016 Nicholas J. Russo

Src 1.1.1.1, Dst 5.5.5.5, Tun_Id 1, Tun_Instance 5 RSVP Path Info: My Address: 5.5.5.5 Explicit Route: NONE Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits RSVP Resv Info: Record Route: NONE Fspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR5#show ip rsvp sender filter session-type 7 tunnel-id 1 Destination Tun Sender TunID LSPID Prev Hop 5.5.5.5 1.1.1.1 1 5 132.2.5.2

I/F Gi2.525

CSR5#show ip rsvp reservation filter session-type 7 tunnel-id 1 Destination Tun Sender TunID LSPID Next Hop I/F 5.5.5.5 1.1.1.1 1 5 none none

BPS 0

Fi Serv BPS SE LOAD 0

To detail an example of a valid PCALC but failed RSVP signalling, I will configure an ACL on all of CSR5's interfaces to block IP protocol 46, which is RSVP. If RSVP cannot receive PATH messages, the LSP signalling cannot complete. ! CSR5 ip access-list extended DENY_RSVP deny 46 any any permit ip any any interface GigabitEthernet2.525 ip access-group DENY_RSVP in interface GigabitEthernet2.550 ip access-group DENY_RSVP in interface GigabitEthernet2.551 ip access-group DENY_RSVP in

Because RSVP PATH messages sent to CSR5 are discarded, there is no RSVP PATHERR signaling since CSR1 keeps originating PATH messages. PCALC sees no issue with the network at all, though. In this case, it would not make sense to troubleshoot the TED or anything related to IGP as the issue is likely dataplane/RSVP related; we know this because the “path is valid. There are several other issues that can cause TE failures but most are related to specific constraints and will be examined later. CSR1#show mpls traffic-eng tunnels tunnel 1 | section Status Status: Admin: up Oper: down Path: valid Signalling: RSVP signalling proceeding path option 10, type dynamic (Basis for Setup, path weight 30)

1296 © 2016 Nicholas J. Russo

CSR1#show ip rsvp sender To From 5.5.5.5 1.1.1.1

Pro DPort Sport Prev Hop 0 1 10 none

CSR1#show ip rsvp reservation To From Pro DPort Sport Next Hop [no output]

I/F none

I/F

BPS 0

Fi Serv BPS

31.1.2 TE attributes This section continues from the last with a functional topology (test ACLs removed). We will begin adding tunnels with various attributes to see how they are used. We can use MPLS OAM (detailed in a separate section) to verify the tunnels are functional in the data-plane also. The first and most straightforward tunnel attribute is the TE metric. This is a "second metric" that we assign to each link. The TE metric is commonly used for creating two topologies in a network; IGP metrics can be used for data flows while TE metrics can define a separate topology for voice flows, as an example. If it is not specified, it is copied from the IGP metric. Like any metric, it can be asymmetric, and is evaluated by the local node. A new tunnel [ID 2] will create an LSP to XRv11 but we want to force it through CSR6. Knowing that all other link costs are equal, we can lower the TE metric on CSR1 facing CSR6 to achieve this. We do not need to do it in the other direction, technically, since we don’t expect return traffic to take the same path. This means the path will be lower cost than the other 2-hop paths through CSR3 or CSR9. ! CSR1 interface GigabitEthernet2.516 mpls traffic-eng administrative-weight 5

We can verify this change is present in the TED. After all, all TE-aware nodes need to be aware of this change. Checking CSR6 also, we see that its local TE metric did not change, so we have introduced some asymmetry into the network. CSR1#show mpls traffic-eng topology level-2 1.1.1.1 brief IGP Id: 0000.0000.0001.00, MPLS TE Id:1.1.1.1 Router Node (isis level-2) [snip] link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:4, gen:570 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 5, IGP metric: 10, attribute flags: 0x4 SRLGs: None [snip] CSR1#show mpls traffic-eng topology level-2 6.6.6.6 brief IGP Id: 0000.0000.0006.00, MPLS TE Id:6.6.6.6 Router Node (isis level-2) [snip] link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0001.00, nbr_node_id:16, gen:563

1297 © 2016 Nicholas J. Russo

frag_id: 0, Intf Address: 132.1.6.6, Nbr Intf Address: 132.1.6.1 TE metric: 10, IGP metric: 10, attribute flags: 0x4 SRLGs: None [snip]

Once the tunnel comes up, we can see the ERO specifies CSR6 as the next hop. The path weight is 25 which implies the TE metric of 5 was used (5+10+10 for total path). Recall that this ERO was the output of the PCALC operation and the input for the initial RSVP PATH message. The tunnel configuration specifies that the TE metric should be used explicitly, but this is the default. You can also specify IGP, but this is rare. ! CSR1 interface Tunnel2 description TE METRIC ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 dynamic tunnel mpls traffic-eng path-selection metric te CSR1#show mpls traffic-eng tunnels tunnel 2 | section Status|RSVP_Path Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 25) RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 132.6.12.12 132.11.12.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

A quick OAM check shows us the label stack and reveals the hops along the TE LSP. CSR1#traceroute mpls traffic-eng tunnel 2 Tracing MPLS TE Label Switched Path on Tunnel2, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.6.1 MRU 1500 [Labels: 6011 Exp: 0] L 1 132.1.6.6 MRU 1500 [Labels: 92010 Exp: 0] 2 ms L 2 132.6.12.12 MRU 1500 [Labels: implicit-null Exp: 0] 38 ms ! 3 132.11.12.11 31 ms

A similar example on XRv11 can be used to steer traffic to CSR8 away from XRv12. We increase the TE metric on the XRv11-XRv12 link so that it is less preferred. We make the configuration change then

1298 © 2016 Nicholas J. Russo

check the TE topology. For variety, we will verify it from XRv12's perspective as all routers in the IS-IS level have this updated information. ! XRv11 mpls traffic-eng interface GigabitEthernet0/0/0/0.512 admin-weight 300 RP/0/0/CPU0:XRv12#show mpls traffic-eng topology 11.11.11.11 brief | begin 0012 Link[2]:Point-to-Point, Nbr IGP Id:0000.0000.0012.00, Nbr Node Id:1, gen:69260 Frag Id:0, Intf Address:132.11.12.11, Intf Id:0 Nbr Intf Address:132.11.12.12, Nbr Intf Id:0 TE Metric:300, IGP Metric:10 Attribute Flags: 0xc Ext Admin Group: Length: 256 bits Value : 0x::c Attribute Names: BLUE(2) ORANGE(3) Switching Capability:None, Encoding:unassigned BC Model ID:RDM Physical BW:1000000 (kbps), Max Reservable BW Global:100000 (kbps) Max Reservable BW Sub:0 (kbps)

Next, we configure the tunnel. XR has a clean way of ignoring affinities with a self-explanatory command. XR also provides some verbose logging, which includes showing the ERO in the log message. We can also verify it with the traditional show commands; the output is almost identical to IOS. ! XRv11 interface tunnel-te2 description TE METRIC ipv4 unnumbered Loopback0 logging events all destination 8.8.8.8 path-selection metric te affinity ignore path-option 10 dynamic te_control[1044]: %ROUTING-MPLS_TE-5-LSP_EXPLICITROUTE : tunnel-te2 (signalled-name: XRv11_t2, LSP Id: 2) explicit-route, 132.10.11.10, 132.6.10.6, 132.6.8.8, 8.8 te_control[1044]: %ROUTING-MPLS_TE-5-LSP_RECORDROUTE : tunnel-te2 (signalledname: XRv11_t2, LSP Id: 2) record-route empty. te_control[1044]: %ROUTING-MPLS_TE-5-LSP_UPDOWN : tunnel-te2 (signalled-name: XRv11_t2, LSP Id: 2) state changed to up RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 2

1299 © 2016 Nicholas J. Russo

Name: tunnel-te2 Destination: 8.8.8.8 Ifhandle:0x480 Signalled-Name: XRv11_t2 Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 30) G-PID: 0x0800 (derived from egress interface properties) Bandwidth Requested: 0 kbps CT0 Creation Time: DAY MON 9 12:31:56 2015 (00:03:45 ago) Config Parameters: Bandwidth: 0 kbps (CT0) Priority: 7 7 Number of affinity constraints: 1 Ignore all Metric Type: TE (interface) Hop-limit: disabled Cost-limit: disabled AutoRoute: disabled LockDown: disabled Policy class: not set Forward class: 0 (default) Forwarding-Adjacency: disabled Loadshare: 0 equal loadshares Auto-bw: disabled Fast Reroute: Disabled, Protection Desired: None Path Protection: Not Enabled BFD Fast Detection: Disabled Reoptimization after affinity failure: Enabled Soft Preemption: Disabled History: Tunnel has been up for: 00:01:20 (since DAY MON 09 12:34:21 UTC 2015) Current LSP: Uptime: 00:01:20 (since DAY MON 09 12:34:21 UTC 2015) Path info (IS-IS 132 level-2): Node hop count: 3 Hop0: 132.10.11.10 Hop1: 132.6.10.6 Hop2: 132.6.8.8 Hop3: 8.8.8.8

A quick OAM check verifies the TE LSP is operational and shows us the TE labels per hop. We can also verify the RSVP PATH/RESV messages using similar commands to XE. RP/0/0/CPU0:XRv11#traceroute mpls traffic-eng tunnel-te 2 Tracing MPLS TE Label Switched Path on tunnel-te2, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.10.11.11 MRU 1500 [Labels: 10010 Exp: 0]

1300 © 2016 Nicholas J. Russo

L 1 132.10.11.10 MRU 1500 [Labels: 6011 Exp: 0] 80 ms L 2 132.6.10.6 MRU 1500 [Labels: implicit-null Exp: 0] 50 ms ! 3 132.6.8.8 20 ms RP/0/0/CPU0:XRv11#show rsvp sender session-type lsp-p2p Destination Add DPort Source Add SPort Pro Input IF Rate Burst Prot ---------------- ----- --------------- ----- --- --------- ------ ----- ---8.8.8.8 2 11.11.11.11 2 0 No 0 1K Off RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p Destination Add DPort Source Add SPort Pro Input IF Sty Serv Rate Burst --------------- ----- ------------ ----- --- ---------- --- ---- ------ ----8.8.8.8 2 11.11.11.11 2 0 Gi0/0/0/0.501 SE LOAD 0 1K

Another TE attribute is affinity. This is often called link coloring. It is a 4 byte hexadecimal number where each of the 32 bit positions represents a "color". This approach allows for 32 different colors (some routers, like XR, support 256 using extended link attributes) and any combinations of colors is valid. In this topology, 4 colors are used with the following binary and hexadecimal strings. This information is also depicted on the main diagram for quick reference. Red: 0001 or 0x1 Green: 0010 or 0x2 Blue: 0100 or 0x4 Orange: 1000 or 0x8 Individual links are colored (can be asymmetric like TE metric) administratively and this can be used as a constraint for a given TE LSP. The red path might be high bandwidth, the green path might low latency, and the blue path might be scavenger traffic, as an example. Some links may be both low latency and high bandwidth, and would be colored both red and green. A tunnel uses the affinity mask to select which colors it cares about in terms of constraints. The default affinity and affinity mask on XE and XR is 0x00000000 and 0x0000FFFF, respectively. MPLS-TE made things confusing by reversing the bitwise logic used in ACLs for TE affinity. With ACLs, a value of 0 in the mask means "I care about it" while 1 means "I ignore it". The logic is reversed with affinity masks, so a value of 1 means it is honored. The default values, therefore, mean that the lower-order 16 bits MUST be zero by default. In our earlier examples, we specifically had to set the mask to 0x0 which says "I don't care about any affinity bits". If we fail to clear these affinity bits in a "colored" network, no tunnels will come up. Looking at the PCALC debugs along with TE logging, this becomes clear. RSVP doesn't even get an ERO and no signaling takes place since PCALC cannot find any suitable path that meets the tunnel constraints. The affinity mask is shown as 0xFFFF which is also visible via the TE show commands [ID 3]. ! CSR1 interface Tunnel3 ip unnumbered Loopback0 tunnel mode mpls traffic-eng

1301 © 2016 Nicholas J. Russo

tunnel destination 11.11.11.11 tunnel mpls traffic-eng path-option 10 dynamic CSR1#debug mpls traffic-eng path lookup TE-PCALC-API: 1.1.1.1_3->11.11.11.11_3 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_3->11.11.11.11_3 {7}: Path Request Info Flags: METRIC_TE IP explicit-path: None (dynamic) bw 0, min_bw 0, metric: 0 setup_pri 7, hold_pri 7 affinity_bits 0x0, affinity_mask 0xFFFF TE-PCALC-PATH: 1.1.1.1_3->11.11.11.11_3 {7}: Area (isis level-2) Path Lookup begin TE-PCALC-PATH: 1.1.1.1_3->11.11.11.11_3 {7}: Get path: Failed to find a path to destination TE-PCALC-PATH: 1.1.1.1_3->11.11.11.11_3 {7}: Area (isis level-2) Path Lookup end: path not found TE-PCALC-API: 1.1.1.1_3->11.11.11.11_3 {7}: P2P LSP Path Lookup result: failed %MPLS_TE-5-LSP: LSP 1.1.1.1 3_3: No path to destination, 0000.0000.0011.00 (affinity)

The output below shows that the path is "not valid" since PCALC cannot resolve. The affinity mask is shown again and this is the root of the issue. The output also shows us the shortest unconstrained path, which ignores the constraints and computes a cost-based path only for informational purposes. CSR1#show mpls traffic-eng tunnels tunnel 3 Name: CSR1_t3 (Tunnel3) Destination: 11.11.11.11 Status: Admin: up Oper: down Path: not valid Signalling: Down path option 10, type dynamic Config Parameters: Bandwidth: 0 kbps (Global) Priority: 7 7 Affinity: 0x0/0xFFFF Metric Type: TE (default) AutoRoute: disabled LockDown: disabled Loadshare: 0 [0] bw-based auto-bw: disabled Shortest Unconstrained Path Info: Path Weight: 25 (TE) Explicit Route: 132.1.6.6 132.6.12.12 132.11.12.11 11.11.11.11 History: Tunnel: Time since created: 2 minutes, 28 seconds Number of LSP IDs (Tun_Instances) used: 8

We will specifically configure this tunnel to traverse green links. This assumes we have an end-to-end green path, and we have a few. To accomplish this, we can specify the green affinity bit (0x2) and modify 1302 © 2016 Nicholas J. Russo

the mask to only account for that bit (0x2). We see that the tunnel comes up and has selected the past CSR3 > CSR2 > CSR10 > XRv11, which contains only green links. Notice that the links traversed by this tunnel are all green, but not only green; multi-colored links can also be used. The affinity mask simply chose not to test for the presence or absent of the other colors, so PCALC does not care. The cost of this path was 40, but we know the lower cost (unconstrained) path was 25. This is what makes TE so powerful; routing decisions can become arbitrary yet dynamic. ! CSR1 interface Tunnel3 tunnel mpls traffic-eng affinity 0x2 mask 0x2 CSR1#show mpls traffic-eng tunnels tunnel 3 | section Status|RSVP_Path Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 40) RSVP Path Info: My Address: 132.1.3.1 Explicit Route: 132.1.3.3 132.2.3.3 132.2.3.2 132.2.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

We can do the opposite as well by instructing the tunnel to take any path that is NOT green [ID 4]. We still need to evaluate the same bit, so the mask does not change as we are only evaluating against the green color. The difference is that we clear the bit in the affinity so as to say "This bit must be zero, and I care about it". CSR1 selects a path that avoids all green paths, traversing CSR6 > XRv12 > XRv11. ! CSR1 interface Tunnel4 description NOT GREEN ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x2 tunnel mpls traffic-eng path-option 10 dynamic CSR1#show mpls traffic-eng tunnels tunnel 4 | section Status|RSVP_Path Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 25) RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 132.6.12.12 132.11.12.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

1303 © 2016 Nicholas J. Russo

The opposite logic is invalid; you cannot set something in a tunnel's affinity then say you don't care about it in the mask. The configuration doesn't make sense and is rejected by the parser. If you don’t care about it, set the affinity bit and mask bit to zero. CSR1(config-if)#tunnel mpls traffic-eng affinity 0x2 mask 0x0 % Bits cannot be set in affinity (0x2) if unset in mask (0x0)

The logical AND operation is valid when combining colors. From CSR3 to CSR10, we can create a path which requires all transit links be green, blue, and orange at the same time [ID 5]. There isn't a one-line mechanism to do Boolean OR logic, though, as you would normally use separate path-options for this, which is evaluated in more complex examples later. The more aggressive the constraints, the less likely the tunnel is to form. In this case, there is only one path with all three colors and it traverses CSR2. The sum of 0x2, 0x4, and 0x8 is 0xE (14 in decimal) or 1110 in binary. We specify this as the affinity and the bits we care about (mask). ! CSR3 interface Tunnel5 description GREEN BLUE ORANGE TOGETHER ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 10.10.10.10 tunnel mpls traffic-eng affinity 0xE mask 0xE tunnel mpls traffic-eng path-option 10 dynamic CSR3#show mpls traffic-eng tunnels tunnel 5 | section Status|RSVP_Path Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 20) RSVP Path Info: My Address: 132.2.3.3 Explicit Route: 132.2.3.2 132.2.10.10 10.10.10.10 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

A small adjustment of the affinity mask changes the logic significantly and breaks the tunnel. We can specify the tunnel must traverse links that are blue, green, and orange, but also NOT red. This is accomplished by setting the mash to 0xF or 1111 in binary, which forces PCALC to examine the last 4 bits of the link colors. There is no path from CSR3 to CSR10 that meets the constraints since the CSR2-CSR10 is colored red. We told the affinity mask to care about all four of our colors, but did not instruct the tunnel to use red links. ! CSR3 interface Tunnel6

1304 © 2016 Nicholas J. Russo

description GREEN BLUE ORANGE TOGETHER, BUT NOT RED ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 10.10.10.10 tunnel mpls traffic-eng affinity 0xE mask 0xF tunnel mpls traffic-eng path-option 10 dynamic CSR3#show mpls traffic-eng tunnels tunnel 6 | section Status|RSVP_Path Status: Admin: up Oper: down Path: not valid Signalling: Down path option 10, type dynamic %MPLS_TE-5-LSP: LSP 3.3.3.3 6_1: No path to destination, 0000.0000.0010.00 (no path)

We will repeat a similar set of tests with XR. The affinity configuration in XR is much simpler and more flexible than XE. You can assign strings of text to affinity values and enumerate these strings on an interface; XR can add them together to determine the link attributes, which is a simple, human-readable operation. XRv12 uses this mechanism, plus some legacy syntax. Attribute values (generally a single bit or color) is mapped to a string, and then those strings are assigned to interfaces. The manner in which bits are assigned to strings can vary between the classic hex syntax of specifying bit positions, where 0 is the least significant bit (0x1). You can also use the legacy syntax by specifying the hexadecimal value directly at the link level. ! XRv12 mpls traffic-eng interface GigabitEthernet0/0/0/0.502 attribute-names RED ORANGE interface GigabitEthernet0/0/0/0.512 attribute-names BLUE ORANGE interface GigabitEthernet0/0/0/0.562 attribute-flags 0xc interface GigabitEthernet0/0/0/0.592 attribute-names RED affinity-map affinity-map affinity-map affinity-map

RED bit-position 0 BLUE 0x4 GREEN bit-position 1 ORANGE 0x8

From XRv12, we will create a tunnel to CSR1 that requires an orange path [ID 3]. We will verify this tunnel by checking the outgoing ERO, which is used as input to the RSVP process within PATH messages. We can validate the path using OAM. The "include-strict" option selects a set of colors that must exist on a link, but no others may. If this command were used in this tunnel, it would mean only use orange links 1305 © 2016 Nicholas J. Russo

that contain no other colors. Colorless links would also be rejected using strict inclusion. This would be identical to an IOS affinity mask of 0xFFFFFFFF since all bits are being evaluated. ! XRv12 interface tunnel-te3 description ORANGE ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 affinity include ORANGE path-option 10 dynamic RP/0/0/CPU0:XRv12#show rsvp sender session-type lsp-p2p destination 1.1.1.1 destination 1.1.1.1 detail | begin Outgoing Explicit Route (Outgoing): Strict, 132.6.12.6/32 Strict, 132.6.9.9/32 Strict, 132.4.9.4/32 Strict, 132.1.4.1/32 Strict, 1.1.1.1/32 RP/0/0/CPU0:XRv12#traceroute mpls traffic-eng tunnel-te 3 Tracing MPLS TE Label Switched Path on tunnel-te3, timeout is 2 seconds [snip] Type escape sequence to abort.

L L L !

0 1 2 3 4

132.6.12.12 MRU 1500 [Labels: 6014 Exp: 0] 132.6.12.6 MRU 1500 [Labels: 9013 Exp: 0] 10 ms 132.6.9.9 MRU 1500 [Labels: 4010 Exp: 0] 10 ms 132.4.9.4 MRU 1500 [Labels: implicit-null Exp: 0] 10 ms 132.1.4.1 10 ms

We can also invert the logic by select a path that is NOT orange also [ID 4]. We can reference the tunnel by name also, provided we include quotes if the name has spaces. The outgoing ERO is shown at the bottom. MPLS traceroute verifies connectivity and the TE labels. The number in parenthesis is the bitposition from right to left regarding this color. The third bit position is 1000 in binary, or 8. The syntax in XR makes this very straightforward. ! XRv12 interface tunnel-te4 description NOT ORANGE ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 affinity exclude ORANGE path-option 10 dynamic RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels name "NOT ORANGE"

1306 © 2016 Nicholas J. Russo

Name: tunnel-te4 Destination: 1.1.1.1 Ifhandle:0x680 Signalled-Name: XRv12_t4 Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 20) G-PID: 0x0800 (derived from egress interface properties) Bandwidth Requested: 0 kbps CT0 Creation Time: MON DAY 9 13:04:02 2015 (00:04:51 ago) Config Parameters: Bandwidth: 0 kbps (CT0) Priority: 7 7 Number of affinity constraints: 1 Exclude bit map : 0x8 Exclude ext bit map : Length: 256 bits Value : 0x::8 Exclude affinity name : ORANGE(3) [snip] Node hop count: 2 Hop0: 132.9.12.9 Hop1: 132.1.9.1 Hop2: 1.1.1.1 RP/0/0/CPU0:XRv12#traceroute mpls traffic-eng tunnel-te 4 Tracing MPLS TE Label Switched Path on tunnel-te4, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.9.12.12 MRU 1500 [Labels: 9012 Exp: 0] L 1 132.9.12.9 MRU 1500 [Labels: implicit-null Exp: 0] 0 ms ! 2 132.1.9.1 10 ms

We can combine affinities by enumerating multiple strings. From XRv12, we build a path to CSR5 that is both red and orange. The output shows the bitmap 0x9 which is red (0x1) plus orange (0x8). This simplifies the logic for the operation and improves readability over the IOS configuration. we can see the path traverses CSR10 > CSR2 > CSR5, which is the only available path meeting the affinity constraints. ! XRv12 interface tunnel-te5 description RED AND ORANGE TOGETHER ipv4 unnumbered Loopback0 logging events all destination 5.5.5.5 affinity include RED ORANGE path-option 10 dynamic RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 5

1307 © 2016 Nicholas J. Russo

Name: tunnel-te5 Destination: 5.5.5.5 Ifhandle:0x780 Signalled-Name: XRv12_t5 Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type dynamic (Basis for Setup, path weight 30) G-PID: 0x0800 (derived from egress interface properties) Bandwidth Requested: 0 kbps CT0 Creation Time: MON DAY 9 13:25:08 2015 (00:00:05 ago) Config Parameters: Bandwidth: 0 kbps (CT0) Priority: 7 7 Number of affinity constraints: 1 Include bit map : 0x9 Include ext bit map : Length: 256 bits Value : 0x::9 Include affinity name : RED(0) ORANGE(3) [snip] Node hop count: 3 Hop0: 132.10.12.10 Hop1: 132.2.10.2 Hop2: 132.2.5.5 Hop3: 5.5.5.5

Likewise, we can specify affinities we specifically want to exclude in conjunction with those we want to include. We instruct XRv12 to build a path to CSR5 that is red and orange, but not blue [ID 6]. The configuration is very similar to the previous tunnel except with the added blue exclusion. No such path exists, and PCALC is not able to find a path that meets the affinity constraints. The tunnel information shows the affinity names, along with their values, both for inclusion and exclusion. ! XRv12 interface tunnel-te6 description RED AND ORANGE TOGETHER BUT NOT BLUE ipv4 unnumbered Loopback0 logging events all destination 5.5.5.5 affinity include RED ORANGE affinity exclude BLUE path-option 10 dynamic %ROUTING-MPLS_TE-5-LSP_PCALC_FAIL : Path calculation failed on tunnel-te6 (signalled-name: XRv12_t6, LSP: 0, path option: 10, explicit path: NONE): No path to destination, 5.5.5.5 (affinity) RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 6 Name: tunnel-te6 Destination: 5.5.5.5 Ifhandle:0x880 Signalled-Name: XRv12_t6

1308 © 2016 Nicholas J. Russo

Status: Admin:

up Oper: down

Path: not valid

Signalling: Down

path option 10, type dynamic Last PCALC Error: DAY MON 9 13:37:14 2015 Info: No path to destination, 5.5.5.5 (affinity) G-PID: 0x0800 (derived from egress interface properties) Bandwidth Requested: 0 kbps CT0 Creation Time: MON DAY 9 13:37:14 2015 (00:00:17 ago) Config Parameters: Bandwidth: 0 kbps (CT0) Priority: 7 7 Number of affinity constraints: 2 Include bit map : 0x9 Include ext bit map : Length: 256 bits Value : 0x::9 Include affinity name : RED(0) ORANGE(3) Exclude bit map : 0x4 Exclude ext bit map : Length: 256 bits Value : 0x::4 Exclude affinity name : BLUE(2)

TE tunnels can also request bandwidth. This is a control-plane reservation carried in the RSVP signaling and accounted for by both RSVP and the TED. We configure a TE tunnel from CSR1 to XRv11 requesting 20 Mbps of bandwidth [ID 7]. This is a constraint that gets fed into PCALC, and then to RSVP for signaling. Looking at the RSVP dumps, we can see the PROXY PATH (output from PCALC) contains a request for 2.5 MBps (big B for bytes), which is 20 Mbps. All of the other RSVP PATH attributes are present also, such as the SAO, ERO, and sender Tspec. ! CSR1 interface Tunnel7 description BASIC BW REQ ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 20000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 dynamic ! CSR1 Incoming PROXY_PATH: version:1 flags:0000 cksum:0000 ttl:255 reserved:0 length:212 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 7 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 127.0.0.1 LIH: 0x00000000

1309 © 2016 Nicholas J. Russo

TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 44: 1.1.1.1 (Strict IPv4 Prefix, 8 bytes, /32) 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.10.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) SESSION_ATTRIBUTE type 7 length 20: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: BASIC BW REQ SENDER_TEMPLATE type 7 length 12: Tun Sender: 1.1.1.1 LSP ID: 1 SENDER_TSPEC type 2 length 36: version=0, length in words=7 Token bucket fragment (service_id=1, length=6 words parameter id=127, flags=0, parameter length=5 average rate=2500000 bytes/sec, burst depth=1000 bytes peak rate =2500000 bytes/sec min unit=0 bytes, max pkt size=2147483647 bytes [snip]

When signaling is complete, the head end receives the final RSVP RESV message containing the TE label for the next-hop and the flowspec record showing the reservation was successful. Note: Shared Explicit (SE) is used to ensure make-before-beak (MBB). This is contained in the PATH message as a flag shown above. When an LSR needs to reroute an LSP, the new LSP is built before the old one is torn down. SE prevents double-booking of bandwidth on links during this process. ! CSR1 Incoming Resv: version:1 flags:0000 cksum:FB86 ttl:255 reserved:0 length:108 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 7 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.6.6 LIH: 0x0000000E TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 STYLE type 1 length 8 : Shared-Explicit (SE) FLOWSPEC type 2 length 36: version = 0 length in words = 7 service id = 5, service length = 6 tspec parameter id = 127, flags = 0,length = 5 average rate = 2500000 bytes/sec, burst depth = 1000 bytes peak rate = 2500000 bytes/sec min unit = 0 bytes,max pkt size = 1500 bytes FILTER_SPEC type 7 length 12:

1310 © 2016 Nicholas J. Russo

Tun Sender: 1.1.1.1, LSP ID: 1 LABEL type 1 length 8 : Labels: 6013

We can verify this bandwidth reservation several ways. The reservations are made outbound; because the LSP is unidirectional, CSR1 needs to send 20 Mbps of data, so the reservation is applied to that outgoing interface. The PATH message carries the initial bandwidth request and the RESV confirms it. CSR1#show ip rsvp sender filter session-type 7 tunnel-id 7 Destination Tun Sender TunID LSPID Prev Hop 11.11.11.11 1.1.1.1 7 1 none

I/F none

CSR1#show ip rsvp reservation filter session-type 7 tunnel-id 7 To From Pro DPort Sport Next Hop I/F 11.11.11.11 1.1.1.1 0 7 1 132.1.6.6 Gi2.516

BPS 20M

Fi Serv BPS SE LOAD 20M

We can check bandwidth reservations per interface as tracked by RSVP as well. This is a good summary view of the bandwidth reservations local to a router. CSR1#show ip rsvp interface interface rsvp allocated Gi2 ena 0 Gi2.513 ena 0 Gi2.514 ena 0 Gi2.516 ena 20M Gi2.518 ena 0 Gi2.519 ena 0

i/f max 750M 100M 100M 100M 100M 100M

flow max sub max 750M 0 100M 0 100M 0 100M 0 100M 0 100M 0

VRF

Checking the MPLS tunnel characteristics, we can see the Tspec and Fspec rates equal the reservation that was requested. CSR1#show mpls traffic-eng tunnels tunnel 7 | section Signalling Admin: up Oper: up Path: valid Signalling: connected RSVP Signalling Info: Src 1.1.1.1, Dst 11.11.11.11, Tun_Id 7, Tun_Instance 1 RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 132.6.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=20000 kbits, burst=1000 bytes, peak rate=20000 kbits RSVP Resv Info: Record Route: NONE Fspec: ave rate=20000 kbits, burst=1000 bytes, peak rate=20000 kbits

The ERO indicates that CSR6 is next in the TE tunnel path. Checking its tunnel characteristics, we see similar output to indicate the bandwidth reservation. 1311 © 2016 Nicholas J. Russo

CSR6#show mpls traffic-eng tunnels source-id 7 | section Signalling RSVP Signalling Info: Src 1.1.1.1, Dst 11.11.11.11, Tun_Id 7, Tun_Instance 1 RSVP Path Info: My Address: 132.6.10.6 Explicit Route: 132.6.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=20000 kbits, burst=1000 bytes, peak rate=20000 kbits RSVP Resv Info: Record Route: NONE Fspec: ave rate=20000 kbits, burst=1000 bytes, peak rate=20000 kbits

For additional verification, we can check the TED to ensure the bandwidth reservation is accounted for. All nodes must be aware of this allocated bandwidth so that when new tunnels are created, the local PCALC process has a complete and updated view of the topology. CSR6 shows that 20 Mbps are allocated on its interface towards CSR10. This is also present in the IS-IS database with CSR6's LSP (verbose information). The higher number priorities are less important, and the default priority is 7 (lower priority). This is examined next. CSR1#show mpls traffic-eng topology igp-id isis 0000.0000.0006.00 | begin 0010\. link[3]: Point-to-Point, Nbr IGP Id: 0000.0000.0010.00, nbr_node_id:7, gen:2748 frag_id: 0, Intf Address: 132.6.10.6, Nbr Intf Address: 132.6.10.10 TE metric: 10, IGP metric: 10, attribute flags: 0x6 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 0 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]:

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 20000

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 80000

Sub Pool Reservable BW (kbps) ---------0 0 0 0 0 0 0 0

CSR1#show isis database level-2 CSR6.00-00 verbose | section CSR10 Metric: 10 IS (MT-IPv6) CSR10.00 Metric: 10 IS-Extended CSR10.00 Affinity: 0x00000006 Interface IP Address: 132.6.10.6

1312 © 2016 Nicholas J. Russo

Neighbor IP Address: 132.6.10.10 Physical BW: 1000000 kbits/sec Reservable Global Pool BW: 100000 Global Pool BW Unreserved: [0]: 100000 kbits/sec, [1]: [2]: 100000 kbits/sec, [3]: [4]: 100000 kbits/sec, [5]: [6]: 100000 kbits/sec, [7]:

kbits/sec 100000 100000 100000 80000

kbits/sec kbits/sec kbits/sec kbits/sec

Leaving the existing bandwidth reservation up, we create a 40 Mbps bandwidth reservation with a higher priority using a blue path, also from CSR1 to XRv11 [ID 8]. Based on the affinity and TE metric, we know that CSR6 will be the next-hop in the path. We verify the ERO by checking the outgoing RSVP PATH message. ! CSR1 interface Tunnel8 description BLUE PRIORITY BW REQ ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 4 4 tunnel mpls traffic-eng bandwidth 40000 tunnel mpls traffic-eng affinity 0x4 mask 0x4 tunnel mpls traffic-eng path-option 10 dynamic CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 8 | section outgoing ERO: (outgoing) 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.12.12 (Strict IPv4 Prefix, 8 bytes, /32) 132.11.12.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32)

RSVP now reports a total of 60 Mbps reserved on the link between CSR1 and CSR6. The prioritization of flows happens within the PCALC process as RSVP is just used to signal them. When we check the TED on CSR1 relating to the CSR6-facing link, we can see that 20 Mbps is reserved at priority 7 and 40 Mbps is reserved at priority 4. The remaining reservable bandwidth is the remaining sum of bandwidths across all priorities. If a priority 4 flow requests 40 Mbps, then that 40 Mbps is not available for priority 4 or worse. Higher priorities can preempt (based on the hold priority, which is the second number) but lower ones cannot. CSR1#show ip rsvp interface interface rsvp allocated Gi2 ena 0 Gi2.513 ena 0 Gi2.514 ena 0 Gi2.516 ena 60M

i/f max 750M 100M 100M 100M

flow max sub max 750M 0 100M 0 100M 0 100M 0

VRF

1313 © 2016 Nicholas J. Russo

Gi2.518 Gi2.519

ena ena

0 0

100M 100M

100M 100M

0 0

CSR1#show mpls traffic-eng topology level-2 1.1.1.1 | begin 0006 link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:4, gen:2774 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 5, IGP metric: 10, attribute flags: 0x4 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 0 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]:

Total Allocated BW (kbps) --------------0 0 0 0 40000 0 0 20000

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 60000 60000 60000 40000

Sub Pool Reservable BW (kbps) ---------0 0 0 0 0 0 0 0

Adding a third tunnel, we configure a reservation for 90 Mbps at the highest priority following similar constraints as the previous tunnel [ID 9]. Tunnels 7 and 8 are preempted since only 10 Mbps of remaining reservable bandwidth is available at all priorities, and both of them are request more than that. Tunnel7 had no affinity constraints so it recalculated to select a alternate lowest-cost path that provided the bandwidth (CSR3). Tunnel8 was constrained by affinity so it selected a higher-cost path (CSR8) to meet its bandwidth requirement. Tunnel9, as the highest priority, is what caused the tunnels to recalculate. Just because bandwidth is no longer available for lower priority flows does not mean those tunnels will fail forever; they just cannot compete for the bandwidth on those links. ! CSR1 interface Tunnel9 description BLUE PRIORITY BW REQ (BEST) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 0 0 tunnel mpls traffic-eng bandwidth 90000 tunnel mpls traffic-eng affinity 0x4 mask 0x4 tunnel mpls traffic-eng path-option 10 dynamic CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 7 | section outgoing

1314 © 2016 Nicholas J. Russo

ERO: (outgoing) 132.1.3.3 (Strict IPv4 Prefix, 8 bytes, /32) 132.3.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.10.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 8 | section outgoing ERO: (outgoing) 132.1.8.8 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.8.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.5.10.5 (Strict IPv4 Prefix, 8 bytes, /32) 132.5.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 9 | section outgoing ERO: (outgoing) 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.12.12 (Strict IPv4 Prefix, 8 bytes, /32) 132.11.12.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32)

RSVP is now tracking different bandwidth reservations on different interfaces based on the new path calculations. Again, this is the result of the 90 Mbps reservation at the higher priority causing the existing TE tunnels to recalculate. CSR1#show ip rsvp interface interface rsvp allocated Gi2 ena 0 Gi2.513 ena 20M Gi2.514 ena 0 Gi2.516 ena 90M Gi2.518 ena 40M Gi2.519 ena 0

i/f max 750M 100M 100M 100M 100M 100M

flow max sub max 750M 0 100M 0 100M 0 100M 0 100M 0 100M 0

VRF

We won't show all of the link bandwidth updates, but checking CSR1's interface to CSR6, we see that only 10 Mbps is available for all priorities as the 90 Mbps reservation was the highest priority. CSR1#show mpls traffic-eng topology level-2 1.1.1.1 | begin 0006 link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:4, gen:2921 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 5, IGP metric: 10, attribute flags: 0x4 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 0 (kbps)

1315 © 2016 Nicholas J. Russo

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]:

Total Allocated BW (kbps) --------------90000 0 0 0 0 0 0 0

Global Pool Reservable BW (kbps) ----------10000 10000 10000 10000 10000 10000 10000 10000

Sub Pool Reservable BW (kbps) ---------0 0 0 0 0 0 0 0

During the initial pre-emption, the LSPs may issue log messages that show PCALC's inability to meet the requirements. These messages are often transient, but could lead to short-term outages when TE-FRR is not in use. ! CSR1 %MPLS_TE-5-TUN: Tun8: LSP path change nil for 8_35, path verification failed %MPLS_TE-5-LSP: LSP 1.1.1.1 8_37: No path to destination, 0000.0000.0011.00 (bw or affinity)

You cannot configure a tunnel to have a higher setup priority over the hold priority. If a tunnel is setup at a certain priority, the tunnel's ability to sustain that priority should be at least as good as the setup. You can, however, give a tunnel better hold priority so ensure that when it is setup, it is less likely to be pre-empted. CSR1(config-if)#tunnel mpls traffic-eng priority 4 5 % Setup priority (4) may not be higher than hold priority (5)

Since the configuration is almost identical in XR, we will configure one test for brevity. A tunnel is built to CSR8 [ID 7] requesting 45 Mbps following the path with the lowest IGP metric (TE metric is the default). No IGP metrics have been modified, but this means that any TE metric modifications will not be considered. XRv12 makes the reservation on the interface towards CSR6 when the LSP is signaled. ! XRv12 interface tunnel-te7 description BASIC BW REQ ipv4 unnumbered Loopback0 logging events all priority 5 3 signalled-bandwidth 45000 destination 1.1.1.1 path-selection metric igp affinity ignore

1316 © 2016 Nicholas J. Russo

path-option 10 dynamic RP/0/0/CPU0:XRv12#show rsvp sender session-type lsp-p2p destination 1.1.1.1 detail | begin Outgoing Explicit Route (Outgoing): Strict, 132.6.12.6/32 Strict, 132.1.6.1/32 Strict, 1.1.1.1/32

Checking the TED, we use XRv12 to check CSR6's link to CSR1, confirming the bandwidth reservation. Notice that despite having a setup priority of 5, the hold priority was higher at 3, so if other LSRs want to preempt this path, they have to do so against a priority 3 flow as this is already established. For priority 3 and below, 55 Mbps is remaining, but higher priority flows can use the full 100 Mbps. RP/0/0/CPU0:XRv12#show rsvp reservation session-type lsp-p2p destination 1.1.1.1 Destination Add DPort Source Add SPort Pro Input IF Sty Serv Rate Burst --------------- ----- ------------ ----- --- ---------- --- ---- ------ ----1.1.1.1 7 12.12.12.12 2 0 Gi0/0/0/0.562 SE LOAD 45000K 1K RP/0/0/CPU0:XRv12#show mpls traffic-eng topology isis 0000.0000.0006.00 | begin 0001 Link[2]:Point-to-Point, Nbr IGP Id:0000.0000.0001.00, Nbr Node Id:4, gen:79969 Frag Id:0, Intf Address:132.1.6.6, Intf Id:0 Nbr Intf Address:132.1.6.1, Nbr Intf Id:0 TE Metric:10, IGP Metric:10 Attribute Flags: 0x4 Ext Admin Group: Length: 256 bits Value : 0x::4 Attribute Names: BLUE(2) Switching Capability:None, Encoding:unassigned BC Model ID:RDM Physical BW:1000000 (kbps), Max Reservable BW Global:100000 (kbps) Max Reservable BW Sub:0 (kbps) Global Pool Sub Pool Total Allocated Reservable Reservable BW (kbps) BW (kbps) BW (kbps) ---------------------------------bw[0]: 0 100000 0 bw[1]: 0 100000 0 bw[2]: 0 100000 0 bw[3]: 45000 55000 0 bw[4]: 0 55000 0 bw[5]: 0 55000 0 bw[6]: 0 55000 0 bw[7]: 0 55000 0

1317 © 2016 Nicholas J. Russo

Despite having all these fancy TE attributes for dynamic calculations, we can arbitrarily steer traffic through a network using a predefined explicit path. This is essentially a hand-made ERO that gets fed into PCALC for sanity checking, but is ultimately delivered to RSVP with minimal changes. This is generally useful for testing purposes, loose-hop expansion, or link/node exclusion. From CSR1, we create a path to XRv11 that follows the path CSR3 > CSR2 > CSR10 > XRv11 [ID 10]. We can specify this using link addresses or TE IDs; link addresses are more strict, while TE IDs allow any link to be chosen if there are parallel links between nodes. The path below uses the most strict method, which is unnecessary in this topology as there are no sets of parallel links. Technically, you don't need to specify the remote TE ID as the last entry, but it is good practice and makes the code self-documenting. ! CSR1 ip explicit-path name PATH_1_3_2_10_11 enable next-address 132.1.3.3 next-address 132.3.9.9 next-address 132.2.9.2 next-address 132.2.10.10 next-address 132.10.11.11 interface Tunnel10 description EXPLICIT PATH STRICT ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_1_3_2_10_11

Notice that this pre-defined path is almost identical to the ERO delivered from PCALC to RSVP. As shown in the debug below, PCALC is just verifying that the path makes sense. The CSPF algorithm is still invoked to ensure other constraints (affinity, bandwidth, etc) are accounted for, but that isn't the focus now. Using affinity with explicit paths is possible but doesn’t make much sense unless loose-hop path expansion or link/node exclusion is in effect. CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 10 | section outgoing ERO: (outgoing) 132.1.3.3 (Strict IPv4 Prefix, 8 bytes, /32) 132.3.9.9 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.10.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) CSR1#debug mpls traffic-eng path lookup TE-PCALC-API: 1.1.1.1_17->11.11.11.11_10 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_17->11.11.11.11_10 {7}: Path Request Info

1318 © 2016 Nicholas J. Russo

Flags: IP_EXPLICIT_PATH METRIC_TE IP explicit-path: Supplied 132.1.3.3 132.3.9.9 132.2.9.2 132.2.10.10 132.10.11.11 11.11.11.11 bw 0, min_bw 0, metric: 0 setup_pri 7, hold_pri 7 affinity_bits 0x0, affinity_mask 0x0 TE-PCALC-PATH: 1.1.1.1_17->11.11.11.11_10 {7}: Area (isis level-2) Path Lookup begin TE-PCALC: Verify Path Lookup: 1.1.1.1_17->11.11.11.11_10 {7}: (protocol nil area nil) Flags: METRIC_TE sub-lsp weight:0 (Total LSP weight:0) Hop List: 132.1.3.3 132.3.9.9 132.2.9.2 132.2.10.10 132.10.11.11 11.11.11.11 TE-PCALC-VERIFY: VERIFY to 11.11.11.11 BEGIN: TE-PCALC-VERIFY: Verify: TE-PCALC-VERIFY: 0000.0000.0001.00, 1.1.1.1 points to TE-PCALC-VERIFY: 0000.0000.0003.00, 132.1.3.3 TE-PCALC-VERIFY: Verify: TE-PCALC-VERIFY: 0000.0000.0003.00, 132.1.3.3 points to TE-PCALC-VERIFY: 0000.0000.0009.00, 132.3.9.9 TE-PCALC-VERIFY: Verify: TE-PCALC-VERIFY: 0000.0000.0009.00, 132.3.9.9 points to TE-PCALC-VERIFY: 0000.0000.0002.00, 132.2.9.2 TE-PCALC-VERIFY: Verify: TE-PCALC-VERIFY: 0000.0000.0002.00, 132.2.9.2 points to TE-PCALC-VERIFY: 0000.0000.0010.00, 132.2.10.10 TE-PCALC-VERIFY: Verify: TE-PCALC-VERIFY: 0000.0000.0010.00, 132.2.10.10 points to TE-PCALC-VERIFY: 0000.0000.0011.00, 132.10.11.11 TE-PCALC-VERIFY: VERIFY to 11.11.11.11 PASSED

A quick OAM check verifies the LSP was built in exactly the manner we specified. CSR1#traceroute mpls traffic-eng tunnel 10 Tracing MPLS TE Label Switched Path on Tunnel10, timeout is 2 seconds [snip] Type escape sequence to abort.

1319 © 2016 Nicholas J. Russo

L L L L !

0 1 2 3 4 5

132.1.3.1 MRU 1500 [Labels: 3008 Exp: 0] 132.1.3.3 MRU 1500 [Labels: 9011 Exp: 0] 1 ms 132.3.9.9 MRU 1500 [Labels: 2013 Exp: 0] 29 ms 132.2.9.2 MRU 1500 [Labels: 10012 Exp: 0] 21 ms 132.2.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 27 ms 132.10.11.11 31 ms

An alternative approach is to specify the TE IDs. This is simpler to read and understand, but is a little less strict in cases where there are parallel links between nodes. Given the current topology, the result is identical, since there are no parallel links between any pair of nodes. We can also use a numerical ID (not preferred) to identify paths versus names. Below is an example of using path IDs versus names, but this is similar to using numbered ACLs or named ones; it is generally considered a legacy mechanism. The ERO in the PATH message gets expanded to the link values via PCALC. MPLS traceroute confirms the path is the same as it was in the link-specific path. Notice that this explicit path does not specify the last node; this technically works as long as the path specifies the penultimate hop and the tunnel destination is correct, but is more difficult to read/understand/troubleshoot. ! CSR1 ip explicit-path identifier 11 enable next-address 3.3.3.3 next-address 9.9.9.9 next-address 2.2.2.2 next-address 10.10.10.10 interface Tunnel11 description EXPLICIT PATH STRICT TE ID ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit identifier 11 CSR1#show mpls traffic-eng tunnels tunnel 11 | section RSVP_Path RSVP Path Info: My Address: 132.1.3.1 Explicit Route: 132.1.3.3 132.3.9.9 132.2.9.2 132.2.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR1#traceroute mpls traffic-eng tunnel 11 Tracing MPLS TE Label Switched Path on Tunnel11, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.3.1 MRU 1500 [Labels: 3011 Exp: 0] L 1 132.1.3.3 MRU 1500 [Labels: 9009 Exp: 0] 2 ms

1320 © 2016 Nicholas J. Russo

L L L !

2 3 4 5

132.3.9.9 MRU 1500 [Labels: 2014 Exp: 0] 34 ms 132.2.9.2 MRU 1500 [Labels: 10014 Exp: 0] 26 ms 132.2.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 25 ms 132.10.11.11 53 ms

We can become even more explicit by using the "verbatim" option. This completely bypasses the PCALC sanity check as the explicit-path is turned into an ERO exactly as written. PCALC receives the explicitpath but the topology check is skipped. This can be used for Inter-AS/Inter-area TE, as well as TE over IGPs that don't support TE extensions. For example, you can do MPLS TE over RIP or EIGRP using verbatim paths since the destination address of each PATH message has been identified. Following the same path, we create a new tunnel [ID 12] with verbatim option. We will recycle the explicit path from tunnel 10. OAM confirms the operation. This feature is examined in great detail in the for Inter-AS/Interarea TE sections of this book. Since the entire TE topology is within a single IGP flooding domain, detailing this feature here is not interesting nor useful. ! CSR1 TE-PCALC-API: 1.1.1.1_1->11.11.11.11_10 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_1->11.11.11.11_10 {7}: Path Request Info Flags: IP_EXPLICIT_PATH NO_TOPOLOGY_CHECK METRIC_TE IP explicit-path: Supplied 132.1.3.3 132.3.9.9 132.2.9.2 132.2.10.10 132.10.11.11 11.11.11.11 bw 0, min_bw 0, metric: 0 setup_pri 7, hold_pri 7 affinity_bits 0x0, affinity_mask 0x0 TE-PCALC-PATH: Get Path Common: Skip topology check for verbatim path to dest 11.11.11.11 TE-PCALC-PATH: get_path_no_check: share count=0, dest=11.11.11.11 TE-PCALC-PATH: static_path_no_check on verbatim path to 11.11.11.11 CSR1#traceroute mpls traffic-eng tunnel 12 Tracing MPLS TE Label Switched Path on Tunnel12, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.3.1 MRU 1500 [Labels: 3008 Exp: 0] L 1 132.1.3.3 MRU 1500 [Labels: 9009 Exp: 0] 65 ms L 2 132.3.9.9 MRU 1500 [Labels: 2012 Exp: 0] 31 ms L 3 132.2.9.2 MRU 1500 [Labels: 10016 Exp: 0] 36 ms L 4 132.2.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 27 ms ! 5 132.10.11.11 17 ms

1321 © 2016 Nicholas J. Russo

Explicit paths can also use loose hop expansion. We can specify a path that must traverse CSR6, but the intermediate hops can be anything (provided other constraints are met). A path from CSR1 to XRv11 that meets the aforementioned requirements is shown below [ID 13]. ! CSR1 ip explicit-path name PATH_1_6_11_LOOSE enable next-address loose 6.6.6.6 interface Tunnel13 description EXPLICIT PATH LOOSE ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_1_6_2_11_LOOSE

The signaling for loose paths is different than normal paths. Because PCALC on CSR1 was told to use an explicit path, it finds a path to CSR6 and creates an ERO for it. CSR1's ERO is very short as it only expands to include CSR6. The PCALC debug is shown below. CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 13 | section outgoing ERO: (outgoing) 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 6.6.6.6 (Strict IPv4 Prefix, 8 bytes, /32) ! CSR1 TE-PCALC-API: 1.1.1.1_55->11.11.11.11_13 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_55->11.11.11.11_13 {7}: Path Request Info Flags: IP_EXPLICIT_PATH METRIC_TE IP explicit-path: Supplied 6.6.6.6 Loose bw 0, min_bw 0, metric: 0 setup_pri 7, hold_pri 7 affinity_bits 0x0, affinity_mask 0x0 TE-PCALC-PATH: 1.1.1.1_55->11.11.11.11_13 {7}: Area (isis level-2) Path Lookup begin TE-PCALC-PATH:Path from 0000.0000.0001.00 -> 0000.0000.0006.00: 132.1.6.1->132.1.6.6 (admin_weight=5): num_hops 2, accumulated_aw 5, min_bw 100000 TE-PCALC-PATH: 1.1.1.1_55->11.11.11.11_13 {7}: Freeing rrr_path_setup_t TE-PCALC-PATH: 1.1.1.1_55->11.11.11.11_13 {7}: Free all paths in path tree TE-PCALC: Verify Path Lookup: 1.1.1.1_55->11.11.11.11_13 {7}: (protocol nil area nil) Flags: METRIC_TE Last Strict Router: 6.6.6.6 sub-lsp weight:0 (Total LSP weight:5) Hop List:

1322 © 2016 Nicholas J. Russo

132.1.6.6 6.6.6.6 TE-PCALC-VERIFY: VERIFY to 6.6.6.6 BEGIN: TE-PCALC-VERIFY: Verify: TE-PCALC-VERIFY: 0000.0000.0001.00, 1.1.1.1 points to TE-PCALC-VERIFY: 0000.0000.0006.00, 132.1.6.6 TE-PCALC-VERIFY: VERIFY to 6.6.6.6 PASSED

Debugging RSVP messaging on CSR6, the RSVP PATH message arrives with an ERO destined for CSR6. CSR6 needs to signal the remainder of the LSP, but it cannot make an ERO since it has no basis for determining the path; every other router in the path will determine the path dynamically upon receiving the path message. The outgoing PATH has no ERO. ! CSR6 Incoming Path: version:1 flags:0000 cksum:F048 ttl:255 reserved:0 length:196 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 13 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.6.1 LIH: 0x0000000E TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 20: 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 6.6.6.6 (Strict IPv4 Prefix, 8 bytes, /32) LABEL_REQUEST type 1 length 8 : Layer 3 protocol ID: 2048 SESSION_ATTRIBUTE type 7 length 28: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: EXPLICIT PATH LOOSE [snip] Outgoing Path: version:1 flags:0000 cksum:D78B ttl:254 reserved:0 length:176 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 13 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.6.12.6 LIH: 0x0000000D TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 LABEL_REQUEST type 1 length 8 : Layer 3 protocol ID: 2048 SESSION_ATTRIBUTE type 7 length 28: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: EXPLICIT PATH LOOSE [snip]

1323 © 2016 Nicholas J. Russo

As a result of this, CSR1 has no idea what the actual path is from a control-plane perspective. After CSR6, the actual LSP path is not known to the head-end. We can use the RRO object in the tunnel to record the route in the RESV messages. CSR1#show mpls traffic-eng tunnels tunnel 13 | section RSVP_Signalling RSVP Signalling Info: Src 1.1.1.1, Dst 11.11.11.11, Tun_Id 13, Tun_Instance 55 RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 6.6.6.6 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits RSVP Resv Info: Record Route: NONE Fspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits ! CSR1 interface Tunnel13 tunnel mpls traffic-eng record-route

Clearing the tunnel (shut/no shut) allows the path to be signaled again. Looking at CSR6 RSVP signaling, we can see the RRO being built. The RRO is included in the PATH message to record the PHOPs, which isn't as relevant as the RRO in the RESV message to carry the NHOPs. This information is now available on the headend, even with loose explicit-paths. If this feature seems confusing, it is a result of the artificial setup. This is typically used for Inter-AS/Inter-area TE as it has little purpose within a TE domain. The feature is examined in deeper detail in those appropriate sections. ! CSR6 Incoming Path: [snip] RECORD_ROUTE type 1 length 12: 132.1.6.1/32, Flags:0x0 (No Local Protection) Outgoing Path: [snip] RECORD_ROUTE type 1 length 20: 132.6.12.6/32, Flags:0x0 (No Local Protection) 132.1.6.1/32, Flags:0x0 (No Local Protection) Incoming Resv: [snip] RECORD_ROUTE type 1 length 20: 132.6.12.12/32, Flags:0x0 (No Local Protection) 132.11.12.11/32, Flags:0x0 (No Local Protection) Outgoing Resv:

1324 © 2016 Nicholas J. Russo

[snip] RECORD_ROUTE type 1 length 28: 132.6.12.6/32, Flags:0x0 (No Local Protection) 132.6.12.12/32, Flags:0x0 (No Local Protection) 132.11.12.11/32, Flags:0x0 (No Local Protection) CSR1#show mpls traffic-eng tunnels tunnel 13 | section RSVP_Signalling RSVP Signalling Info: Src 1.1.1.1, Dst 11.11.11.11, Tun_Id 13, Tun_Instance 56 RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 6.6.6.6 Record Route: Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits RSVP Resv Info: Record Route: 132.6.12.6 132.6.12.12 132.11.12.11 Fspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

Explicit paths on XR are very similar to XE, so for brevity we examine one example using TE IDs in a strict path [ID 11]. In XR, you must specify the last-hop in explicit paths. ! XRv11 explicit-path name PATH_11_5_2_3_4_1 index 10 next-address strict ipv4 unicast index 20 next-address strict ipv4 unicast index 30 next-address strict ipv4 unicast index 40 next-address strict ipv4 unicast index 50 next-address strict ipv4 unicast

5.5.5.5 2.2.2.2 3.3.3.3 4.4.4.4 1.1.1.1

interface tunnel-te11 description EXPLICIT PATH STRICT TE ID ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 affinity ignore path-option 10 explicit name PATH_11_5_2_3_4_1

You can verify the ERO using the "dst-port" option and specifying the tunnel ID also. The ERO is a perlink expansion of the TE IDs were specified in the explicit path. This implies that PCALC was involved to evaluate those TE IDs into the appropriate link addresses. RP/0/0/CPU0:XRv11#show rsvp sender session-type lsp-p2p dst-port 11 detail | begin Outgoing Explicit Route (Outgoing): Strict, 132.5.11.5/32 Strict, 132.2.5.2/32 Strict, 132.2.3.2/32 Strict, 132.2.3.3/32

1325 © 2016 Nicholas J. Russo

Strict, 132.3.4.4/32 Strict, 132.1.4.1/32 Strict, 1.1.1.1/32

Bandwidth can be automatically adjusted for a TE tunnel based on measured traffic also. This can help optimize bandwidth usage to ensure that tunnels are requesting the bandwidth they actually require, versus making reservations that remain under-utilized. The feature samples that average data rate through the tunnel and periodically adjusts the signaled-bandwidth to the largest sample since the last adjustment. Globally, we can enable the feature and define the interval between the samples (5 minute default). On a per-tunnel basis, we can define the interval between adjustments (24 hour default), min/max bandwidth reservations as upper and lower bounds (unlimited by default), and a threshold that signals how much different the sample must be in order to make a change (10% by default). As a best practice, the timers should follow this logic: load-interval < global sampling frequency < tunnel adjustment frequency. This way, the load-interval can change sufficiently between samples, and the tunnel adjustments will run less frequently than the sampler. If a tunnel is down when the global sample time hits, the sampler is ignored. Below is an example of auto-bandwidth. Every minute, the global process samples the bandwidth on all tunnels. The bandwidth is adjusted every 5 minutes based on the traffic going through it. Technically, we don't have to specify an initial bandwidth, but we do using 5 Mbps. We specify a lower bound of 2 Mbps and upper bound of 10 Mbps also [ID 14]. This ensures that the tunnel is not grossly over or underutilized. ! CSR1 mpls traffic-eng auto-bw timers frequency 60 interface Tunnel14 description AUTO BW TIMERS ip unnumbered Loopback0 load-interval 30 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 5000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 dynamic tunnel mpls traffic-eng auto-bw frequency 300 max-bw 10000 min-bw 2000

The tunnel comes up immediately and we can see 5 Mbps has been reserved. We can check the TE tunnel details to see the auto-bw information. The (300/96) represents the configured adjustment timer over the remaining time until the next evaluation. It counts down to zero. The zero following this field is the last measured bandwidth, which is also zero at present. The samples collected represents how many samples, based on the global interval have occurred. One was skipped since the tunnel was down during the initial configuration. CSR1#show ip rsvp reservation filter session-type 7 tunnel-id 14 Destination Tun Sender TunID LSPID Next Hop I/F

Fi Serv BPS

1326 © 2016 Nicholas J. Russo

11.11.11.11

1.1.1.1

14

1

132.1.6.6

Gi2.516

SE LOAD 5M

CSR1#show mpls traffic-eng tunnels tunnel 14 | section auto-bw auto-bw: (300/96) 0 Bandwidth Requested: 5000 Samples Missed 1: Samples Collected 3

After the timer hits zero, the adjustment happens. Since we don't have any traffic going through the tunnel, we expect the lower-bound measurement of 2 Mbps to be signaled. We see a log message stating that the tunnel has been "reoptimized", which is what happens periodically with TE tunnels anyway, but this was initiated by the auto-bw process. The LSP ID has changed from 1 to 2, and RSVP is now reporting 2 Mbps reserved on the outgoing interface. The tunnel details also show the change. This can introduce churn into the topology as the TED information is re-flooded. ! CSR1 %MPLS_TE-5-TUN: Tun14: installed LSP 14_2 (popt 10) for 14_1 (popt 10), reopt. LSP is up CSR1#show interfaces tunnel14 | include put_rate 30 second input rate 0 bits/sec, 0 packets/sec 30 second output rate 0 bits/sec, 0 packets/sec CSR1#show ip rsvp reservation filter session-type 7 tunnel-id 14 Destination Tun Sender TunID LSPID Next Hop I/F 11.11.11.11 1.1.1.1 14 2 132.1.6.6 Gi2.516

Fi Serv BPS SE LOAD 2M

CSR1#show mpls traffic-eng tunnels tunnel 14 | section auto-bw auto-bw: (300/221) 0 Bandwidth Requested: 2000 Samples Missed 1: Samples Collected 1

We can also perform adjustments based on consecutive threshold crossings based on the global interval. This can react more quickly to changing conditions since we don't have to wait for all the samples, we can say that if we are consistently incorrect, we can make an adjustment sooner. XE does not appear to support "underflow", which means we cannot react by reducing bandwidth. The timer mechanism can still react to underflow conditions, but not the event-driven threshold feature [ID 15]. If the bandwidth is more than 5% greater than the current signaled bandwidth 3 times in a row (sampled by the global frequency), then the overflow adjustment occurs. ! CSR1 interface Tunnel15 description AUTO BW FLOW THRESHOLDS ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 5000 tunnel mpls traffic-eng affinity 0x0 mask 0x0

1327 © 2016 Nicholas J. Russo

tunnel mpls traffic-eng path-option 10 dynamic tunnel mpls traffic-eng auto-bw overflow-limit 3 overflow-threshold 5 max-bw 8000 min-bw 1500 CSR1#show ip rsvp reservation filter session-type 7 tunnel-id 15 Destination Tun Sender TunID LSPID Next Hop I/F 11.11.11.11 1.1.1.1 15 1 132.1.6.6 Gi2.516

Fi Serv BPS SE LOAD 5M

CSR1#show mpls traffic-eng tunnels tunnel 15 | section auto-bw auto-bw: (300/239) 0 Bandwidth Requested: 5000 Adjustment Threshold: 5% Overflow Limit: 3 Overflow Threshold: 5% Overflow Threshold Crossed: 0 Samples Missed 0: Samples Collected 1

Another option includes the ability to simply monitor bandwidth on the tunnel, but take no action. The output looks similar except there are no adjustments ever being made. The tunnel can still request bandwidth but it cannot be adjusted. The output specifically says "collected" and not "requested" since the latter never changes. We don't specify the max-bw since it isn't relevant, and the maximum value is applied for this field. ! CSR1 interface Tunnel16 description AUTO BW MONITOR ip unnumbered Loopback0no ip address tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 5000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 dynamic tunnel mpls traffic-eng auto-bw frequency 300 max-bw 4294967295 collect-bw CSR1#show mpls traffic-eng tunnels tunnel 16 | section auto-bw auto-bw: (300/189) 0 Bandwidth Last Collected: 0 Samples Missed 1: Samples Collected 1 CSR1#show ip rsvp reservation filter session-type 7 tunnel-id 16 Destination Tun Sender TunID LSPID Next Hop I/F 11.11.11.11 1.1.1.1 16 1 132.1.6.6 Gi2.516

Fi Serv BPS SE LOAD 5M

The auto-bandwidth feature is very similar in XR. We will configure a TE tunnel from XRv11 to CSR8 requesting 7 Mbps of bandwidth initially, with a min/max of 2.5/9.5 Mbps [ID 14]. The adjustment "application" is every 5 minutes where the global timer takes samples once every minute (IOS measures this in seconds while XR uses minutes). The tunnel will adjust bandwidth when the difference is at least 15% different and 10 kbps different to avoid constant churn. As an unrelated feature, the path-option is

1328 © 2016 Nicholas J. Russo

limited to a specific IS-IS process and level, which can be useful when constraining TE tunnels to a single TE topology (could be useful if, for example, XRv11 were an ABR or L1L2 router). ! XRv11 mpls traffic-eng auto-bw collect frequency 1 interface tunnel-te14 description AUTO BW TIMERS ipv4 unnumbered Loopback0 load-interval 30 auto-bw bw-limit min 2500 max 9500 adjustment-threshold 15 min 10 application 5 logging events all signalled-bandwidth 7000 destination 8.8.8.8 affinity ignore path-option 10 dynamic isis 132 level 2

When the tunnel comes up, we can see the auto-bandwidth details. This is very similar to the XE output. XR also has an auto-bandwidth brief command that shows each tunnel as a row, along with the relevant bandwidth information. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 14 | begin Auto-bw Auto-bw: enabled No BW Applied Bandwidth Min/Max: 2500-9500 kbps Application Frequency: 5 min Jitter: 0s Time Left: 3m 40s Collection Frequency: 1 min Samples Collected: 2 Next: 35s Highest BW: 0 kbps Underflow BW: 0 kbps Adjustment Threshold: 15% 10 kbps Overflow Detection disabled Underflow Detection disabled [snip] RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels auto-bw brief Tunnel LSP Last appl Requested Signalled Highest Application Name ID BW(kbps) BW(kbps) BW(kbps) BW(kbps) Time Left ----------- ------ ---------- ---------- ---------- ---------- ----------tunnel-te14 2 7000 7000 0 3m 18s

After five minutes, we can see the tunnel has been reoptimized to 2.5 Mbps. Since there is no traffic in the tunnel, the minimum bandwidth of 2.5 Mbps is applied. The LSP history also shows that the previous

1329 © 2016 Nicholas J. Russo

LSP ID was 2, the new one is 3, since the LSP was "reoptimized" by the auto-bandwidth process. The brief output shows similar output in summary form. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 14 | begin Auto-bw Auto-bw: enabled Last BW Applied: 2500 kbps CT0 BW Applications: 1 Last Application Trigger: Periodic Application Bandwidth Min/Max: 2500-9500 kbps Application Frequency: 5 min Jitter: 0s Time Left: 5s Collection Frequency: 1 min Samples Collected: 4 Next: 0s Highest BW: 0 kbps Underflow BW: 0 kbps Adjustment Threshold: 15% 10 kbps Overflow Detection disabled Underflow Detection disabled Fast Reroute: Disabled, Protection Desired: None Path Protection: Not Enabled BFD Fast Detection: Disabled Reoptimization after affinity failure: Enabled Soft Preemption: Disabled History: Tunnel has been up for: 00:25:01 Current LSP: Uptime: 00:19:55 Prior LSP: ID: 2 Path Option: 10 Removal Trigger: reoptimization completed RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels auto-bw brief Tunnel LSP Last appl Requested Signalled Highest Application Name ID BW(kbps) BW(kbps) BW(kbps) BW(kbps) Time Left ----------- ------ ---------- ---------- ---------- ---------- ----------tunnel-te14 3 2500 2500 2500 0 21s

Next, we will look at using flow thresholds on XRv11 [ID 15]. XR also supports the underflow option which allows over-subscribed tunnels to be reduced again, where XE only supports overflow (other direction). This tunnel requests 10 Mbps of bandwidth and is constrained to the range of 8-20 Mbps. If there is a 5% increase in bandwidth that is at least 256 kbps of a change, and this occurs in three samples, the tunnel can react to the overflow condition and request more bandwidth. If there is a 10% reduction in bandwidth that is at least 128 kbps of a change, and this occurs two times, the tunnel can react to the underflow condition and request less bandwidth. Both are considered reoptimizations and resignal the LSP. The application is defaulted to 1440 minutes (24 hours) so without the flow thresholds, it would take a long time for the tunnel to reoptimize using default timers. ! XRv11 interface tunnel-te15 description AUTO BW FLOW THRESHOLDS

1330 © 2016 Nicholas J. Russo

ipv4 unnumbered Loopback0 load-interval 30 auto-bw bw-limit min 8000 max 20000 overflow threshold 5 min 256 limit 3 underflow threshold 10 min 128 limit 2 logging events all signalled-bandwidth 10000 destination 3.3.3.3 affinity ignore path-option 10 dynamic

The tunnel comes up as expected (output not shown). We see a log message showing the underflow condition as there is no traffic in the tunnel after about 2 minutes. The global sample interval is 1 minute, so after two under-threshold conditions, the underflow event is triggered. The new signaledbandwidth is the minimum, 8 Mbps, per the configuration. The brief output also shows tunnel-te14, which is shutdown, and has LSP ID 0 since the tunnel is not up. This tunnel was shutdown after the last test for clarity. ! XRv11 %ROUTING-MPLS_TE-5-LSP_REOPT : tunnel-te15 (signalled-name: XRv11_t15, old LSP Id: 4, new LSP Id: 5) has been reoptimized; reason: Auto BW Bandwidth Underflow Change. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 15 | begin Auto-bw Auto-bw: enabled Last BW Applied: 8000 kbps CT0 BW Applications: 2 Last Application Trigger: Underflow Condition Bandwidth Min/Max: 8000-20000 kbps Application Frequency: 1440 min Jitter: 0s Time Left: 23h 58m 38s Collection Frequency: 1 min Samples Collected: 1 Next: 38s Highest BW: 0 kbps Underflow BW: 0 kbps Adjustment Threshold: 10% 10 kbps Overflow Threshold: 5% 256 kbps Limit: 0/3 Overflow BW Applications: 0 Underflow Threshold: 10% 128 kbps Limit: 1/2 Underflow BW Applications: 3 Fast Reroute: Disabled, Protection Desired: None Path Protection: Not Enabled BFD Fast Detection: Disabled Reoptimization after affinity failure: Enabled Soft Preemption: Disabled History: Tunnel has been up for: 00:06:52 Current LSP: Uptime: 00:03:21 Prior LSP:

1331 © 2016 Nicholas J. Russo

ID: 4 Path Option: 10 Removal Trigger: reoptimization completed RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels auto-bw brief Tunnel LSP Last appl Requested Signalled Highest Application Name ID BW(kbps) BW(kbps) BW(kbps) BW(kbps) Time Left ----------- ------ ---------- ---------- ---------- ---------- ----------tunnel-te14 0 2500 7000 7000 0 2m 0s tunnel-te15 7 8000 8000 8000 0 23h 58m 55s

Last, we will examine bandwidth monitoring on XR for TE tunnels. We create a tunnel to CSR1 only for collecting bandwidth statistics [ID 16]. The tunnel requests 512 kbps of bandwidth and this value can never change since we are only monitoring BW on this tunnel. The "Highest BW" field represents the highest sample of bandwidth seen. The underflow bandwidth is the difference between the highest measured bandwidth and current bandwidth. ! XRv11 interface tunnel-te16 description AUTO BW MONITOR ipv4 unnumbered Loopback0 load-interval 30 auto-bw collect-bw-only logging events all signalled-bandwidth 512 destination 1.1.1.1 affinity ignore path-option 10 dynamic RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 16 | begin Auto-bw Auto-bw: enabled (collect bw only) No BW Applied Collect-only Requested BW: 512 kbps Bandwidth Min/Max: 0-4294967295 kbps Application Frequency: 1440 min Jitter: 0s Time Left: 23h 59m 37s Collection Frequency: 1 min Samples Collected: 0 Next: 2s Highest BW: 0 kbps Underflow BW: 0 kbps Adjustment Threshold: 10% 10 kbps Overflow Detection disabled Underflow Detection disabled RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels auto-bw brief Tunnel LSP Last appl Requested Signalled Highest Application Name ID BW(kbps) BW(kbps) BW(kbps) BW(kbps) Time Left -------------- ------ ---------- ---------- ---------- ---------- ----------tunnel-te14 0 2500 7000 7000 0 1m 26s tunnel-te15 0 8000 10000 10000 0 23h 55m 21s

1332 © 2016 Nicholas J. Russo

tunnel-te16

2

-

512

512

0 23h 59m 56s

Many of these features can be combined to create some very sophisticated TE tunnels using multiple path-options. Individual path options have can separate LSP attributes for maximum flexibility. On CSR1, we configure a tunnel to CSR5 to follow these requirements, in sequence: 1. Red path, 40 Mbps reservation at priority 5 2. Any color but must avoid CSR9, 105 Mbps reservation 3. Blue path, 20 Mbps reservation, must transit XRv12, should record the route taken, and the path can never be changed 4. Any path as a last resort, and should monitor the bandwidth used for future plans We create several LSP attribute classes to account for these options. Many of these options can be configured directly on the tunnel, which apply to all path-options. Applying an LSP attribute class to a path-option overrides whichever options are specified at the tunnel interface level. For example, affinity is cleared on the tunnel, so options 2 and 4 don't need to specify affinity since ignoring it is the tunnel default. These requirements use a combination of explicit and dynamic paths to meet the transit routing constraints for some options. One restriction is that all bandwidth requests for all path-options must have the same priority, so 5 is used for all options and is defined in the tunnel for inheritance [ID 17]. ! CSR1 mpls traffic-eng lsp attributes LSP_ATT_RED40 affinity 0x1 mask 0x1 bandwidth 40000 mpls traffic-eng lsp attributes LSP_ATT_NO_R9 bandwidth 105000 mpls traffic-eng lsp attributes LSP_ATT_BLUE20 affinity 0x4 mask 0x4 bandwidth 20000 lockdown record-route mpls traffic-eng lsp attributes LSP_ATT_FALLBACK auto-bw collect-bw ip explicit-path name EP_NO_R9 enable exclude-address 9.9.9.9 ip explicit-path name EP_TRANSIT_XRV12 enable next-address loose 12.12.12.12 interface Tunnel17 description COMPLEX ip unnumbered Loopback0

1333 © 2016 Nicholas J. Russo

tunnel mode mpls traffic-eng tunnel destination 5.5.5.5 tunnel mpls traffic-eng priority 5 5 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 1 dynamic attributes LSP_ATT_RED40 tunnel mpls traffic-eng path-option 2 explicit name EP_NO_R9 attributes LSP_ATT_NO_R9 tunnel mpls traffic-eng path-option 3 explicit name EP_TRANSIT_XRV12 attributes LSP_ATT_BLUE20 tunnel mpls traffic-eng path-option 100 dynamic attributes LSP_ATT_FALLBACK

We can enable debugging on CSR1 to look at the PCALC lookup process. PCALC is invoked for each pathoption in sequence. The first option tries to find a red path from CSR1 to CSR5, but there are none. This option fails due to affinity which the log message reveals. The combination of syslog and debug messages is always ideal for troubleshooting. ! CSR1, path-option 1 TE-PCALC-API: 1.1.1.1_13->5.5.5.5_17 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_13->5.5.5.5_17 {7}: Path Request Info Flags: METRIC_TE IP explicit-path: None (dynamic) bw 40000, min_bw 0, metric: 0 setup_pri 5, hold_pri 5 affinity_bits 0x1, affinity_mask 0x1 TE-PCALC-PATH: 1.1.1.1_13->5.5.5.5_17 {7}: Area (isis level-2) Path Lookup begin TE-PCALC-PATH: 1.1.1.1_13->5.5.5.5_17 {7}: Get path: Failed to find a path to destination TE-PCALC-PATH: 1.1.1.1_13->5.5.5.5_17 {7}: Area (isis level-2) Path Lookup end: path not found TE-PCALC-API: 1.1.1.1_13->5.5.5.5_17 {7}: P2P LSP Path Lookup result: failed %MPLS_TE-5-LSP: LSP 1.1.1.1 17_13: No path to destination, 0000.0000.0005.00 (affinity)

The second option inherits the "ignored" affinity attribute but is requesting more bandwidth than is available. PCALC returns success since there is a path, but the bandwidth admission control mechanism rejects this as a viable path. ! CSR1, path-option 2 TE-PCALC-API: 1.1.1.1_14->5.5.5.5_17 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_14->5.5.5.5_17 {7}: Path Request Info Flags: IP_EXPLICIT_PATH METRIC_TE IP explicit-path: Supplied 9.9.9.9 bw 105000, min_bw 0, metric: 0 setup_pri 5, hold_pri 5 affinity_bits 0x0, affinity_mask 0x0

1334 © 2016 Nicholas J. Russo

TE-PCALC-PATH: 1.1.1.1_14->5.5.5.5_17 {7}: Area (isis level-2) Path Lookup begin TE-PCALC-PATH:Path from 0000.0000.0001.00 -> 0000.0000.0005.00: 132.5.10.10->132.5.10.5 (admin_weight=25): 132.6.10.6->132.6.10.10 (admin_weight=15): 132.1.6.1->132.1.6.6 (admin_weight=5): num_hops 4, accumulated_aw 25, min_bw 100000 TE-PCALC-PATH: 1.1.1.1_14->5.5.5.5_17 {7}: Area (isis level-2) Path Lookup end: path found TE-PCALC-API: 1.1.1.1_14->5.5.5.5_17 {7}: P2P LSP Path Lookup result: success TE-PCALC: 1.1.1.1_14->5.5.5.5_17 {7}: modify bandwidth: [0000.0000.0005.00] (isis level-2) [snip] %MPLS_TE-5-LSP: LSP 1.1.1.1 17_14: Path Error from 1.1.1.1: Admission control Failure: Requested bandwidth unavailable (flags 4) %MPLS_TE-5-LSP: LSP 1.1.1.1 17_14: DOWN: path error %MPLS_TE-5-TUN: Tun17: installed LSP nil for 17_14 (popt 2), path error %MPLS_TE-5-TUN: Tun17: LSP path change nil for 17_14, path error

The third option chooses a blue path with 20 Mbps of bandwidth that transits XRv12. Since the PATH message terminates on XRv12, a lot of path control is lost after that point. This is another semi-awkward usage of loose-hop expansion within an TE domain. We see that CSR10 is in the path despite not being a blue link, which demonstrates the danger of using loose-hop inclusion. Nonetheless, this is a valid pathoption. ! CSR1, path-option 3 TE-PCALC-API: 1.1.1.1_16->5.5.5.5_17 {7}: P2P LSP Path Lookup called TE-PCALC: 1.1.1.1_16->5.5.5.5_17 {7}: Path Request Info Flags: IP_EXPLICIT_PATH METRIC_TE IP explicit-path: Supplied 12.12.12.12 Loose bw 20000, min_bw 0, metric: 0 setup_pri 5, hold_pri 5 affinity_bits 0x4, affinity_mask 0x4 TE-PCALC-PATH: 1.1.1.1_16->5.5.5.5_17 {7}: Area (isis level-2) Path Lookup begin [snip] TE-PCALC-PATH: 1.1.1.1_16->5.5.5.5_17 {7}: Area (isis level-2) Path Lookup end: path found TE-PCALC-API: 1.1.1.1_16->5.5.5.5_17 {7}: P2P LSP Path Lookup result: success %MPLS_TE-5-TUN: Tun17: installed LSP 17_16 (popt 3) for nil, got 1st feasible path opt %MPLS_TE-5-LSP: LSP 1.1.1.1 17_16: UP %MPLS_TE-5-TUN: Tun17: LSP path change 17_16 for nil, normal

We can verify the tunnel is functional by checking the tunnel details and using OAM for traceroute. The RRO seems to be missing the CSR10-XRv12 hop as well. Notice that all of the other path-options are 1335 © 2016 Nicholas J. Russo

listed, by option 3 is used for “setup”. This implies that options 1 and 2 were untenable while option 100 was not evaluated yet. CSR1#show mpls traffic-eng tunnels tunnel 17 | section Status|RSVP_Signalling Status: Admin: up Oper: up Path: valid Signalling: connected path option 3, type explicit EP_TRANSIT_XRV12 (Basis for Setup, path weight 15) path option 1, type dynamic path option 2, type explicit EP_NO_R9 path option 100, type dynamic RSVP Signalling Info: Src 1.1.1.1, Dst 5.5.5.5, Tun_Id 17, Tun_Instance 16 RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 132.6.12.12 12.12.12.12 Record Route: Tspec: ave rate=20000 kbits, burst=1000 bytes, peak rate=20000 kbits RSVP Resv Info: Record Route: 132.6.12.6 132.6.12.12 132.5.10.10 132.5.10.5 Fspec: ave rate=20000 kbits, burst=1000 bytes, peak rate=20000 kbits CSR1#traceroute mpls traffic-eng tunnel 17 Tracing MPLS TE Label Switched Path on Tunnel17, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.6.1 MRU 1500 [Labels: 6010 Exp: 0] L 1 132.1.6.6 MRU 1500 [Labels: 92010 Exp: 0] 3 ms L 2 132.6.12.12 MRU 1500 [Labels: 10010 Exp: 0] 44 ms L 3 132.10.12.10 MRU 1500 [Labels: implicit-null Exp: 0] 21 ms ! 4 132.5.10.5 29 ms

We will demonstrate a complicated TE tunnel on XR also [ID 17]. The XR options are surprisingly limited as the only significant things we can do are specify the affinity and bandwidth. Options like bandwidth priority and record-route are global to the tunnel, not the LSP attributes per path-option. In this example, XRv12 follows similar rules as CSR1 except builds a tunnel to CSR4, minus the loose next-hop. ! XRv12 explicit-path name EP_NO_R9 index 1 exclude-address ipv4 unicast 9.9.9.9 interface tunnel-te17 description COMPLEX ipv4 unnumbered Loopback0 logging events all destination 4.4.4.4

1336 © 2016 Nicholas J. Russo

record-route affinity ignore path-option 1 dynamic attribute-set LSP_ATT_RED40 path-option 2 explicit name EP_NO_R9 attribute-set LSP_ATT_NO_R9 path-option 100 dynamic lockdown attribute-set LSP_ATT_ORANGE20 mpls traffic-eng attribute-set path-option LSP_ATT_NO_R9 signalled-bandwidth 105000 class-type 0 no affinity ignore attribute-set path-option LSP_ATT_RED40 signalled-bandwidth 40000 class-type 0 affinity include RED attribute-set path-option LSP_ATT_ORANGE20 signalled-bandwidth 20000 class-type 0 affinity include ORANGE

Because there is a red path with 40 Mbps available, that first option is valid. The tunnel is up and operational and the backup path-options are also listed; only details for the active path options are expanded as shown below. The RSVP signaling information is included in the "detail" output for the tunnel, and this is where we can see the RESV RRO. The path goes from CSR9 to CSR3 to CSR4, which are all red links as expected. We also quickly confirm the 40 Mbps reservation is properly allocated by RSVP. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 17 Name: tunnel-te17 Destination: 4.4.4.4 Ifhandle:0x980 Signalled-Name: XRv12_t17 Status: Admin: up Oper: up Path: valid Signalling: connected path option 1, type dynamic (Basis for Setup, path weight 30) Path-option attribute: LSP_ATT_RED40 Number of affinity constraints: 1 Include bit map : 0x1 Include ext bit map : Length: 256 bits Value : 0x::1 Include affinity name : RED(0) Bandwidth: 40000 (CT0) path option 2, type explicit EP_NO_R9 Path-option attribute: LSP_ATT_NO_R9 path option 100, (LOCKDOWN) type dynamic Path-option attribute: LSP_ATT_ORANGE20 RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 17 detail | begin Resv Info Resv Info: Record Route:

1337 © 2016 Nicholas J. Russo

IPv4 IPv4 IPv4 Fspec:

132.3.9.9, flags 0x0 132.3.4.3, flags 0x0 132.3.4.4, flags 0x0 avg rate=40000 kbits, burst=1000 bytes, peak rate=40000 kbits

RP/0/0/CPU0:XRv12#show rsvp interface *: RDM: Default I/F B/W % : 75% [default] (max resv/bc0), 0% [default] (bc1) Interface MaxBW (bps) MaxFlow (bps) Allocated (bps) MaxSub (bps) ----------- ------------ ------------- -------------------- ------------Gi0/0/0/0.502 100M 100M 0 ( 0%) 0 Gi0/0/0/0.512 100M 100M 0 ( 0%) 0 Gi0/0/0/0.562 100M 100M 0 ( 0%) 0 Gi0/0/0/0.592 100M 100M 40M ( 40%) 0

31.1.3 Directing traffic into TE tunnels and tunnel stitching So far, we have not transferred any real traffic inside of our TE tunnels other than OAM tests. Without TE, the routers typically use LDP for transport label bindings based on the BGP next-hops, PW destinations, or other destinations reachable through IGP. The entire purpose of TE is to modify this IGPbased behavior. For these examples, we will use CSR1 and XRv11 as endpoints and direct traffic between these PEs into TE tunnels using various methods. First, we see that they currently use LDP labels. CSR1 has a VPNv4 route (and label) for XRv13 within the VPN. Based on the IGP route to XRv11's loopback, we see three possible LDP labels that can be used. CSR1#show bgp vpnv4 unicast vrf TE 13.13.13.13/32 BGP routing table entry for 132:1:13.13.13.13/32, version 8 Paths: (1 available, best #1, table TE) Advertised to update-groups: 1 Refresh Epoch 1 13 11.11.11.11 (metric 30) (via default) from 11.11.11.11 (11.11.11.11) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: RT:132:1 mpls labels in/out nolabel/91011 rx pathid: 0, tx pathid: 0x0 CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis", distance 115, metric 30, type level-2 Redistributing via isis 132 Last update from 132.1.9.9 on GigabitEthernet2.519, 17:01:37 ago Routing Descriptor Blocks: 132.1.9.9, from 11.11.11.11, 17:01:37 ago, via GigabitEthernet2.519 Route metric is 30, traffic share count is 1 * 132.1.6.6, from 11.11.11.11, 17:01:37 ago, via GigabitEthernet2.516 Route metric is 30, traffic share count is 1 132.1.3.3, from 11.11.11.11, 17:01:37 ago, via GigabitEthernet2.513 Route metric is 30, traffic share count is 1

1338 © 2016 Nicholas J. Russo

CSR1#show mpls ldp bindings 11.11.11.11 32 neighbor 9.9.9.9 lib entry: 11.11.11.11/32, rev 45 remote binding: lsr: 9.9.9.9:0, label: 9009 CSR1#show mpls ldp bindings 11.11.11.11 32 neighbor 6.6.6.6 lib entry: 11.11.11.11/32, rev 45 remote binding: lsr: 6.6.6.6:0, label: 6009 CSR1#show mpls ldp bindings 11.11.11.11 32 neighbor 3.3.3.3 lib entry: 11.11.11.11/32, rev 45 remote binding: lsr: 3.3.3.3:0, label: 3009

Although somewhat unrelated to this exact task, it can be confusing to know which ECMP path will be taken for VPN traffic from CSR7 to XRv13. The VPN CEF table points to CSR9 while the global table between BGP next-hops points to CSR3. The latter is actually correct; since the transport label lookup happens in the global table, the global FIB is consulted. We confirm this with a traceroute inside the VPN. CSR1#show ip cef vrf TE exact-route 7.7.7.7 13.13.13.13 7.7.7.7 -> 13.13.13.13 => label 91011 label 9009 TAG adj out of GigabitEthernet2.519, addr 132.1.9.9 CSR1#show ip cef exact-route 1.1.1.1 11.11.11.11 1.1.1.1 -> 11.11.11.11 => label 3009 TAG adj out of GigabitEthernet2.513, addr 132.1.3.3 CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 1 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3009/91011 Exp 0] 98 msec 52 msec 37 msec 3 132.3.10.10 [MPLS: Labels 10009/91011 Exp 0] 32 msec 42 msec 50 msec 4 132.10.11.11 [MPLS: Label 91011 Exp 0] 168 msec 64 msec 42 msec 5 10.11.13.13 17 msec * 19 msec

The first and most straightforward way to get this flow into the tunnel is using static routes. We no longer want to use LDP labels for the topmost label, which means we have to learn the route to the BGP next-hop via a TE tunnel. The router knows that it should use the TE label when the routing converges in this way. We configure a basic tunnel that can take any path not transiting CSR9. We also configure a static route to direct all traffic to XRv11's loopback into the TE tunnel. This method includes ALL IPv4 traffic between the two, including existing BGP sessions, pings, etc. If the tunnel is down for whatever reason, regular routing rules apply. This isn’t a special case but a default behavior of the static route. If the specified outgoing interface is down, the static route is removed from the RIB unless marked with

1339 © 2016 Nicholas J. Russo

the “permanent” option. The route with the next lowest AD out of a functional interface is used, so IGP routes + LDP labels can be a fallback for failed or shutdown TE tunnels. ! CSR1 interface Tunnel18 description STEER WITH STATIC ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9 ip route 11.11.11.11 255.255.255.255 Tunnel18

We verify that the tunnel is up (path is CSR6 > CSR10 > XRv11) and that the static route was installed successfully. CSR1#show mpls traffic-eng tunnels tunnel 18 | section RSVP_Path RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 132.6.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "static", distance 1, metric 0 (connected) Routing Descriptor Blocks: * directly connected, via Tunnel18 Route metric is 0, traffic share count is 1

Next, we can find the new outgoing label by checking the RSVP RESV details for this specific tunnel. Using traceroute inside the VPN on CSR7, we can see that the traffic is now being transported inside the TE tunnel. Even outside the VPN from CSR1's global table, the traffic is inside the tunnel, as mentioned earlier. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 18 | include Label Label: 6011 (outgoing) CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 1 msec 0 msec 2 132.1.6.6 [MPLS: Labels 6011/91011 Exp 0] 29 msec 25 msec 25 msec 3 132.6.10.10 [MPLS: Labels 10011/91011 Exp 0] 24 msec 24 msec 24 msec

1340 © 2016 Nicholas J. Russo

4 132.10.11.11 [MPLS: Label 91011 Exp 0] 16 msec 105 msec 41 msec 5 10.11.13.13 28 msec * 74 msec CSR1#traceroute 11.11.11.11 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 11.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 132.1.6.6 [MPLS: Label 6011 Exp 0] 5 msec 8 msec 8 msec 2 132.6.10.10 [MPLS: Label 10011 Exp 0] 26 msec 25 msec 25 msec 3 132.10.11.11 8 msec * 46 msec

The return traffic from XRv11 is taking an LDP path since there isn't a corresponding TE tunnel configured on XRv11. We will quickly verify this statement below. The route is learned via IGP and does not point to a TE tunnel. The router has LDP labels for both paths and selects CSR10’s label of 10007 per the FIB. We verify it via several show commands and traceroute. RP/0/0/CPU0:XRv11#show route ipv4 unicast 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "isis 132", distance 115, metric 30, type level-2 Routing Descriptor Blocks 132.10.11.10, from 1.1.1.1, via GigabitEthernet0/0/0/0.501 Route metric is 30 132.11.12.12, from 1.1.1.1, via GigabitEthernet0/0/0/0.512 Route metric is 30 No advertising protos. RP/0/0/CPU0:XRv11#show mpls ldp bindings 1.1.1.1/32 neighbor 10.10.10.10 1.1.1.1/32, rev 33 Local binding: label: 91004 Remote bindings: (3 peers) Peer Label ------------------------10.10.10.10:0 10007 RP/0/0/CPU0:XRv11#show mpls ldp bindings 1.1.1.1/32 neighbor 12.12.12.12 1.1.1.1/32, rev 33 Local binding: label: 91004 Remote bindings: (3 peers) Peer Label ------------------------12.12.12.12:0 92006 RP/0/0/CPU0:XRv11#show cef exact-route 11.11.11.11 1.1.1.1 1.1.1.1/32, version 138, internal 0x1000001 0x0 (ptr 0xa140f6f4) [1], 0x0 (0xa13f48e4), 0xa28 (0xa193807c) local adjacency 132.10.11.10 Prefix Len 32, traffic index 0, precedence n/a, priority 3 via GigabitEthernet0/0/0/0.501

1341 © 2016 Nicholas J. Russo

via 132.10.11.10, GigabitEthernet0/0/0/0.501, 9 dependencies, weight 0, class 0 [flags 0x0] path-idx 0 NHID 0x0 [0xa0e8f34c 0x0] next hop 132.10.11.10 local adjacency local label 91004 labels imposed {10007} RP/0/0/CPU0:XRv11#traceroute 1.1.1.1 source 11.11.11.11 Type escape sequence to abort. Tracing the route to 1.1.1.1 1 2 3

132.10.11.10 [MPLS: Label 10007 Exp 0] 19 msec 9 msec 0 msec 132.6.10.6 [MPLS: Label 6007 Exp 0] 9 msec 0 msec 0 msec 132.1.6.1 0 msec * 0 msec

We will quickly demonstrate TE with static routing on XRv11 as well. The constraints are similar except neither CSR9 nor CSR10 can be in the transit path. ! XRv11 explicit-path name PATH_NO_9_10 index 10 exclude-address ipv4 unicast 9.9.9.9 index 20 exclude-address ipv4 unicast 10.10.10.10 interface tunnel-te18 description STEER WITH STATIC ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 affinity ignore path-option 10 explicit name PATH_NO_9_10 router static address-family ipv4 unicast 1.1.1.1/32 tunnel-te18

We ensure the tunnel is operational (path is CSR5 > CSR 2 > CSR3 > CSR1) and verify the static route was installed. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 18 role head | begin Path info Path info (IS-IS 132 level-2): Node hop count: 4 Hop0: 132.5.11.5 Hop1: 132.2.5.2 Hop2: 132.2.3.2 Hop3: 132.2.3.3 Hop4: 132.1.3.1 Hop5: 1.1.1.1

1342 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv11#show route ipv4 unicast 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "static", distance 1, metric 0 (connected) Routing Descriptor Blocks directly connected, via tunnel-te18 Route metric is 0 No advertising protos.

Last, we verify the outgoing label via the RSVP RESV details and test the path through traceroute. As we will see in the MVPN and per-VRF TE sections, this approach can be problematic for multicast transport due to RPF inconsistencies. Because TE tunnels cannot run PIM (doesn’t make sense), RPF is now broken because potential MDT endpoints. RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p destination 1.1.1.1 detail | include Label Labels: Outgoing downstream: 5012. RP/0/0/CPU0:XRv13#traceroute 7.7.7.7 source 13.13.13.13 Type escape sequence to abort. Tracing the route to 7.7.7.7 1 2 3 4 5 6

10.11.13.11 0 msec 0 msec 0 msec 132.5.11.5 [MPLS: Labels 5012/1009 Exp 0] 9 msec 9 msec 9 msec 132.2.5.2 [MPLS: Labels 2010/1009 Exp 0] 9 msec 119 msec 39 msec 132.2.3.3 [MPLS: Labels 3010/1009 Exp 0] 29 msec 39 msec 29 msec 10.1.7.1 [MPLS: Label 1009 Exp 0] 39 msec 29 msec 19 msec 10.1.7.7 29 msec * 9 msec

Another option is using "auto-route". There are two main flavors: announce and destination. Destination is only supported on IOS and XE platforms and is almost identical to the static routing approach except is more dynamic. It automatically programs a static route for the tunnel destination out of the tunnel upon which it was configured [ID 19]. This can increase scalability because static routes don't need to be manually added. Changing the tunnel destination will automatically update the static route, also. Using an identical tunnel we just did, we will apply auto-route destination to achieve the same effect. Regarding MVPN RPF, this introduces the same issues as static routing (discussed in detail in other sections). ! CSR1 interface Tunnel19 description STEER WITH AUTOROUTE DESTINATION ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x0

1343 © 2016 Nicholas J. Russo

tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9

We can verify the path by checking the RSVP PATH message's outgoing ERO. We also have an automatic static route for the tunnel's destination, which works well for PE-PE tunnels. CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 19 | section outgoing ERO: (outgoing) 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.10.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "static", distance 1, metric 0 (connected) Routing Descriptor Blocks: * directly connected, via Tunnel19 Route metric is 0, traffic share count is 1

We can check the outgoing label by looking at the LFIB details for the destination prefix, and then verify it with traceroute. Within the VPN, we can verify the internal FIB details to see the TE label at the top of the stack as well. CSR1#show mpls forwarding-table 11.11.11.11 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 1011 Pop Label 11.11.11.11/32 0 Tu19 point2point MAC/Encaps=18/22, MRU=1500, Label Stack{6010}, via Gi2.516 000C2993FE00000C29FBA33981000DBC8847 0177A000 No output feature configured CSR1#traceroute 11.11.11.11 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 11.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 132.1.6.6 [MPLS: Label 6010 Exp 0] 15 msec 8 msec 8 msec 2 132.6.10.10 [MPLS: Label 10010 Exp 0] 26 msec 25 msec 17 msec 3 132.10.11.11 20 msec * 11 msec CSR1#show ip cef vrf TE 13.13.13.13 internal | begin output_chain output chain: label 91011 label implicit-null TAG midchain out of Tunnel19 7F021339A698 label 6010 TAG adj out of GigabitEthernet2.516, addr 132.1.6.6 7F020E3BC130

1344 © 2016 Nicholas J. Russo

As mentioned earlier, auto-route destination is not supported in XRv 5.3.0, and this is probably XRv specific. ! XRv11 interface tunnel-te88 autoroute destination !!% The requested operation is not supported: This feature is not supported in this release.

The other option for auto-route creates an IGP route through the TE tunnel. This seems highly suspect since link-state protocols do not advertise routes per se, they advertise topology information. Routers collect this information, and assuming everyone has the same view, can guarantee loop-free paths through the topological graph. This feature does not make any changes to the underlying OSPF LSDB or IS-IS LSPDB, but simply announces the destination to the head-end router as being reachable through the tunnel. Even though it looks like an IGP route, the TE label is used since the outgoing interface is a TE tunnel. The cost of the route learned via the TE tunnel is the same as the shortest IGP path between the endpoints. We test this using the same type of tunnel as before [ID 20]. ! CSR1 interface Tunnel20 description STEER WITH AUTOROUTE ANNOUNCE ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9

We can verify the tunnel is up very quickly by checking the outgoing label. Seeing that it is 6011, we assume the path transits CSR6. When we check the RIB, we see the route is IS-IS but recurses out of the TE tunnel. In the context of label lookups, this isn’t an IGP + LDP binding, even though the route is IGPinstalled. Only the RSVP label is used in this particular example. CSR1#show mpls traffic-eng tunnels tunnel 20 | include Label InLabel : OutLabel : GigabitEthernet2.516, 6011 CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis", distance 115, metric 30, type level-2 Redistributing via isis 132 Last update from 11.11.11.11 on Tunnel20, 00:00:01 ago Routing Descriptor Blocks: * 11.11.11.11, from 11.11.11.11, 00:00:01 ago, via Tunnel20 Route metric is 30, traffic share count is 1

1345 © 2016 Nicholas J. Russo

A quick look at CSR1's IS-IS LSP shows no direct link to XRv11, indicating the IS-IS LSPDB did not change. However, when we check the IS-IS topology, the protocol is aware that reachability to XRv11 is achieved through a new best path, which is the TE tunnel. IS-IS does track all the TE tunnels, since it carries all of the TE attributes. The two processes (TE and IS-IS/OSPF) communicate closely within software. CSR1#show isis database level-2 CSR1.00-00 detail Tag 132: IS-IS Level-2 LSP CSR1.00-00 LSPID LSP Seq Num LSP Checksum LSP Holdtime CSR1.00-00 * 0x0000010C 0x18B6 996 Auth: Length: 17 Area Address: 00 NLPID: 0xCC 0x8E Topology: IPv4 (0x0) IPv6 (0x2) Router ID: 1.1.1.1 Hostname: CSR1 Metric: 10 IS (MT-IPv6) CSR4.00 Metric: 10 IS (MT-IPv6) CSR6.00 Metric: 10 IS (MT-IPv6) CSR8.00 Metric: 10 IS (MT-IPv6) CSR3.00 Metric: 10 IS (MT-IPv6) CSR9.00 Metric: 10 IS-Extended CSR4.00 Metric: 10 IS-Extended CSR8.00 Metric: 10 IS-Extended CSR6.00 Metric: 10 IS-Extended CSR3.00 Metric: 10 IS-Extended CSR9.00 IP Address: 1.1.1.1 Metric: 0 IP 1.1.1.1/32 IPv6 Address: ::1:1:1:1 Metric: 0 IPv6 (MT-IPv6) ::1:1:1:1/128 CSR1#show isis topology level-2 XRv11 Tag 132: IS-IS 0 level-2 path to XRv11 System Id Metric Next-Hop XRv11 30 XRv11

Interface Tu20

CSR1#show isis mpls traffic-eng tunnel System Id Tunnel Name BW Metric XRv11.00 Tunnel20 0

Nexthop 11.11.11.11

ATT/P/OL 0/0/0

SNPA *MPLS TE-Tunnel

Metric

Mode

A quick traceroute test from CSR1 shows that the feature is working based on the outgoing label. The FIB internal details also show this label bound to the destination prefix 11.11.11.11/32. CSR1#traceroute 11.11.11.11 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 11.11.11.11

1346 © 2016 Nicholas J. Russo

VRF 1 2 3

info: (vrf in name/id, vrf out name/id) 132.1.6.6 [MPLS: Label 6011 Exp 0] 12 msec 11 msec 8 msec 132.6.10.10 [MPLS: Label 10011 Exp 0] 27 msec 17 msec 25 msec 132.10.11.11 15 msec * 20 msec

CSR1#show ip cef 11.11.11.11 internal | begin output_chain output chain: IP midchain out of Tunnel20 7F021339A698 label 6011 TAG adj out of GigabitEthernet2.516, addr 132.1.6.6 7F020E3BC130

A minor adjustment to this feature allows us to change the IGP metric for this auto-route. We can specify an absolute value instead of the IS-IS calculated one, which is based on the non-TE best path. It was 30 before, but we can change the auto-route to be 31, as an example. Now, the best-path is no longer through the TE tunnel but is via the ordinary paths. This can be used for advanced failover scenarios. For example, if the regular path IGP metric is 30 or less, that will be preferred. If the IGP cost is too high (greater than 30), we decide we want to use the TE tunnel and apply bandwidth reservations to compensate for the less optimal routing paths. ! CSR1 interface Tunnel20 tunnel mpls traffic-eng autoroute metric absolute 31 CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis", distance 115, metric 30, type level-2 Redistributing via isis 132 Last update from 132.1.9.9 on GigabitEthernet2.519, 00:00:04 ago Routing Descriptor Blocks: 132.1.9.9, from 11.11.11.11, 00:00:04 ago, via GigabitEthernet2.519 Route metric is 30, traffic share count is 1 132.1.6.6, from 11.11.11.11, 00:00:04 ago, via GigabitEthernet2.516 Route metric is 30, traffic share count is 1 * 132.1.3.3, from 11.11.11.11, 00:00:04 ago, via GigabitEthernet2.513 Route metric is 30, traffic share count is 1 CSR1#show mpls traffic-eng autoroute 11.11.11.11 MPLS TE autorouting enabled destination 0000.0000.0011.00, area isis level-2, has 1 tunnels Tunnel20 (load balancing metric 0, nexthop 11.11.11.11, absolute metric 31) (flags: Announce)

We can also use a relative modifier to scale the metric up or down. In this case, the IGP-computed metric can be scaled up or down within a range of +/- 10. This is mostly useful for when there are multiple TE tunnels going to the same destination, both using autoroute, and one should be preferred over the other. Another tunnel [ID 21] has been added to prove this point, and the original tunnel has 1347 © 2016 Nicholas J. Russo

the absolute metric removed. This new tunnel has a lower IGP metric by 1, and is thus preferred. This is another dimension to flexibility in TE in addition to the path-options within a given tunnel. This provides prioritization between tunnels as well. ! CSR1 interface Tunnel20 no tunnel mpls traffic-eng autoroute metric absolute 31 interface Tunnel21 description STEER WITH AUTOROUTE ANNOUNCE -1 METRIC ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng autoroute metric relative -1 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9 CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis", distance 115, metric 29, type level-2 Redistributing via isis 132 Last update from 11.11.11.11 on Tunnel21, 00:00:47 ago Routing Descriptor Blocks: * 11.11.11.11, from 11.11.11.11, 00:00:47 ago, via Tunnel21 Route metric is 29, traffic share count is 1

We can also see the paths side-by-side as tracked by the auto-route feature. This can help with troubleshooting when trying to determine which tunnel should be used. CSR1#show mpls traffic-eng autoroute 11.11.11.11 MPLS TE autorouting enabled destination 0000.0000.0011.00, area isis level-2, has 2 tunnels Tunnel20 (load balancing metric 0, nexthop 11.11.11.11) (flags: Announce) Tunnel21 (load balancing metric 0, nexthop 11.11.11.11, relative metric -1) (flags: Announce)

We will quickly test autoroute announce with metric manipulations on XR as well [ID 20, 21]. First, we create a single tunnel using the same constraints as earlier with an exact metric of 28. We check the RSVP PATH message's outgoing ERO to see the path; CSR5 is the next-hop. The LFIB reveals the TE label as 5012 and we can verify that the remote prefix 1.1.1.1/32 is using this label. ! XRv11 interface tunnel-te20 description STEER WITH AUTOROUTE ANNOUNCE

1348 © 2016 Nicholas J. Russo

ipv4 unnumbered Loopback0 logging events all autoroute announce metric absolute 28 destination 1.1.1.1 affinity ignore path-option 10 explicit name PATH_NO_9_10 RP/0/0/CPU0:XRv11#show rsvp sender detail session-type lsp-p2p destination 1.1.1.1 | begin Outgoing Explicit Route (Outgoing): Strict, 132.5.11.5/32 Strict, 132.2.5.2/32 Strict, 132.2.3.2/32 Strict, 132.2.3.3/32 Strict, 132.1.3.1/32 Strict, 1.1.1.1/32 RP/0/0/CPU0:XRv11#show route ipv4 unicast 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "isis 132", distance 115, metric 28, type level-2 Routing Descriptor Blocks 1.1.1.1, from 1.1.1.1, via tunnel-te20 Route metric is 28 No advertising protos. RP/0/0/CPU0:XRv11#show mpls forwarding tunnels 20 Tunnel Outgoing Outgoing Next Hop Bytes Name Label Interface Switched -------- ----------- ------------ --------------- -----------tt20 5012 Gi0/0/0/0.551 132.5.11.5 326 RP/0/0/CPU0:XRv11#show mpls forwarding prefix 1.1.1.1/32 detail Local Outgoing Prefix Outgoing Next Hop Label Label or ID Interface ------ ----------- ------------------ ------------ --------------91004 Pop 1.1.1.1/32 tt20 1.1.1.1 Version: 145, Priority: 3 MAC/Encaps: 18/22, MTU: 1500 Label Stack (Top -> Bottom): { 5012 Imp-Null } NHID: 0 Packets Switched: 7

Bytes Switched ---------505

Next, we create the second tunnel with a relative metric of -4 [ID 21]. We check the tunnel path and see that it also uses CSR5. This brings the metric to 26, which is better than the previous tunnel, so the RIB outgoing interface and TE label binding should change as well. ! XRv11

1349 © 2016 Nicholas J. Russo

interface tunnel-te21 description STEER WITH AUTOROUTE ANNOUNCE -4 METRIC ipv4 unnumbered Loopback0 logging events all autoroute announce metric relative -4 destination 1.1.1.1 affinity ignore path-option 10 explicit name PATH_NO_9_10 RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 21 | begin Path info Path info (IS-IS 132 level-2): Node hop count: 4 Hop0: 132.5.11.5 Hop1: 132.2.5.2 Hop2: 132.2.3.2 Hop3: 132.2.3.3 Hop4: 132.1.3.1 Hop5: 1.1.1.1 RP/0/0/CPU0:XRv11#show route ipv4 unicast 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "isis 132", distance 115, metric 26, type level-2 Routing Descriptor Blocks 1.1.1.1, from 1.1.1.1, via tunnel-te21 Route metric is 26 No advertising protos. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 21 detail | include Label Outgoing Interface: GigabitEthernet0/0/0/0.551, Outgoing Label: 5013 RP/0/0/CPU0:XRv11#show mpls forwarding prefix 1.1.1.1/32 detail | include Label Label Label or ID Interface Switched Label Stack (Top -> Bottom): { 5013 Imp-Null }

The auto-route information shows both paths along with their metrics. The IS-IS topology can then show which tunnel is used when a combination of absolute and relative metrics are used; it isn't immediately obvious which one is better using this output. RP/0/0/CPU0:XRv11#show mpls traffic-eng autoroute 1.1.1.1 Destination 1.1.1.1 has 2 tunnels in IS-IS 132 level 2 tunnel-te20 (traffic share 0, nexthop 1.1.1.1, absolute metric 28) (IS-IS 132 level-2, IPV4 Unicast) Signalled-Name: XRv11_t20 tunnel-te21 (traffic share 0, nexthop 1.1.1.1, relative metric -4) (IS-IS 132 level-2, IPV4 Unicast) Signalled-Name: XRv11_t21

1350 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv11#show isis topology level 2 systemid 0000.0000.0001 IS-IS 132 paths to IPv4 Unicast (Level-2) routers System Id Metric Next-Hop Interface SNPA CSR1 26 CSR1 tt21 *PtoP*

Another method is policy routing. This is generally the least desirable option, but can give additional flexibility as to what traffic goes into the TE tunnel. PBR does not work for VPN traffic and can only be used within a single routing instance, so for L3VPN, this is not an option. For global MPLS transport, such as 6PE, this can work. We use CSR4 as a test client briefly by forcing traffic to 11.11.11.11 to CSR1, and also disabling LDP on the interface to CSR1. This allows the traffic to XRv11 to be raw IP so CSR1 can steer it into the TE tunnel with PBR. CSR1 has a normal tunnel configuration with nothing fancy since the PBR is applied separately [ID 22]. Access-list based forwarding (ABF) is the XR equivalent and is not supported for this use. ! CSR4 interface GigabitEthernet2.514 no mpls ldp igp autoconfig ip route 11.11.11.11 255.255.255.255 132.1.4.1 ! CSR1 interface Tunnel22 description STEER WITH PBR ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9 ip access-list extended ACL_TO_XRV11 permit ip any host 11.11.11.11 route-map PM_PBR_XRV11_TE permit 10 match ip address ACL_TO_XRV11 set interface Tunnel22 interface GigabitEthernet2.514 ip policy route-map PM_PBR_XRV11_TE

We quick verify the tunnel path and label. The next-hop is CSR6 using label 6010. CSR1#show mpls traffic-eng tunnels tunnel 22 | section Label|RSVP_Path InLabel : OutLabel : GigabitEthernet2.516, 6010 RSVP Path Info: My Address: 132.1.6.1

1351 © 2016 Nicholas J. Russo

Explicit Route: 132.1.6.6 132.6.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

When CSR4 sends traffic towards XRv11, no label is applied since LDP is disabled on that interface and the route is static (which counts as an IGP route from the perspective of using LDP labels). Using traceroute, we can confirm this traffic is entering the tunnel. The PBR counters on CSR1 should increment every time a packet enters the tunnel via PBR. CSR4#show mpls interfaces gigabitEthernet 2.514 Interface IP Tunnel BGP Static Operational GigabitEthernet2.514 No Yes No No Yes CSR4#show ip cef 11.11.11.11 11.11.11.11/32 nexthop 132.1.4.1 GigabitEthernet2.514 CSR4#traceroute 11.11.11.11 source loopback0 Type escape sequence to abort. Tracing the route to 11.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 132.1.4.1 1 msec 0 msec 0 msec 2 132.1.6.6 [MPLS: Label 6010 Exp 0] 11 msec 8 msec 8 msec 3 132.6.10.10 [MPLS: Label 10010 Exp 0] 18 msec 16 msec 17 msec 4 132.10.11.11 15 msec * 10 msec route-map PM_PBR_XRV11_TE, permit, sequence 10 Match clauses: ip address (access-lists): ACL_TO_XRV11 Set clauses: interface Tunnel22 Policy routing matches: 42 packets, 2947 bytes

Last, we clean up CSR4 so it is functioning as an LSR again (cleanup configuration not shown). CSR4#show ip cef 11.11.11.11 11.11.11.11/32 nexthop 132.3.4.3 GigabitEthernet2.534 label 3009 nexthop 132.4.9.9 GigabitEthernet2.549 label 9009 CSR4#show mpls interfaces gigabitEthernet 2.514 Interface IP Tunnel BGP Static Operational GigabitEthernet2.514 Yes (ldp) Yes No No Yes

Another option for steering traffic into the tunnel is using the CEF forwarding adjacency. Earlier, we saw that the auto-route options did not make any underlying changes to the IGP link-state topology, but rather changed the local routing table. In contrast, this alternative technique does actually create a 1352 © 2016 Nicholas J. Russo

point-to-point link between the tunnel head and tail as part of the IGP topology. In order for this to work, the TE tunnel must be bidirectionally configured with forwarding-adjacency. We can also directly adjust the IGP metric on these interfaces to ensure there are no unintended routing issues; this feature is seen by all nodes in the OSPF area or IS-IS level, so set the cost appropriately. The hold-time value causes the tunnel to wait 10 seconds before informing the network of a down event. This ensures that brief tunnel flaps do not reflood OSPF/IS-IS topology information. ! CSR1 interface Tunnel24 description STEER WITH FORWARDING ADJ ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng forwarding-adjacency holdtime 10000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9 isis metric 25 ! XRv11 interface tunnel-te24 description STEER WITH FORWARDING ADJ ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 forwarding-adjacency holdtime 10000 affinity ignore path-option 10 explicit name PATH_NO_9_10 router isis 132 interface tunnel-te24 address-family ipv4 unicast metric 25 CSR1#show mpls traffic-eng tunnels tunnel 24 | section Status|Config Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type explicit EP_NO_R9 (Basis for Setup, path weight 25) Config Parameters: Bandwidth: 0 kbps (Global) Priority: 7 7 Affinity: 0x0/0x0 Metric Type: TE (default) AutoRoute: disabled LockDown: disabled Loadshare: 0 [0] bw-based Forwarding adjacency: holdtime 10000 ms auto-bw: disabled

With the tunnel up, we can begin the detailed verification process. CSR1 accounts for this tunnel as a forwarding-adjacency and has a special show command for it. 1353 © 2016 Nicholas J. Russo

CSR1#show mpls traffic-eng forwarding-adjacency 11.11.11.11 destination 0000.0000.0011.00, area isis level-2, has 1 tunnels Tunnel24 (load balancing metric 0, nexthop 11.11.11.11) (flags: Forward-Adjacency holdtime 10000)

Checking the IS-IS database, we can see a new link to XRv11 was installed. There isn't an IS-IS neighbor (no hello packets, etc), but the link exists in the topology. The metric of this P2P link is 25 as configured on the tunnel interface. CSR1#show Metric: Metric: Metric: Metric: Metric: Metric:

isis database level-2 CSR1.00-00 detail | include IS-Extend 10 IS-Extended CSR4.00 10 IS-Extended CSR6.00 10 IS-Extended CSR8.00 25 IS-Extended XRv11.00 10 IS-Extended CSR3.00 10 IS-Extended CSR9.00

CSR1#show isis neighbors Tag 132: System Id Type Interface CSR3 L2 Gi2.513 CSR4 L2 Gi2.514 CSR6 L2 Gi2.516 CSR8 L2 Gi2.518 CSR9 L2 Gi2.519

IP Address 132.1.3.3 132.1.4.4 132.1.6.6 132.1.8.8 132.1.9.9

State UP UP UP UP UP

Holdtime 24 25 22 27 26

Circuit Id 00 00 00 00 00

Some additional, but less useful, IS-IS show commands are below. We can see the TE tunnels and topology information in summary form as viewed by IS-IS. CSR1#show isis mpls traffic-eng tunnel | begin Forward Forwarding-adjacencies System Id Tunnel Name BW Metric Nexthop XRv11.00 Tunnel24 0 11.11.11.11 CSR1#show isis topology level-2 00.0000.0000.0011.00 Tag 132: IS-IS 0 level-2 path to XRv11 System Id Metric Next-Hop Interface XRv11 25 XRv11 Tu24

Metric 25

Type L2

SNPA *MPLS TE-Tunnel

Most importantly, we verify that the tunnel endpoint is reachable through the TE tunnel. If the metric was defaulted to 10, then many other endpoints would be reachable through the tunnel also. For example, 10.10.10.10/32 would have a metric of 20 via this new TE tunnel and three other LDP paths, causing some awkward traffic patterns.

1354 © 2016 Nicholas J. Russo

CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis", distance 115, metric 25, type level-2 Redistributing via isis 132 Last update from 11.11.11.11 on Tunnel24, 00:14:54 ago Routing Descriptor Blocks: * 11.11.11.11, from 11.11.11.11, 00:14:54 ago, via Tunnel24 Route metric is 25, traffic share count is 1

The verification on XR is very similar, with most of the commands being identical. The TE process is aware of the forwarding-adjacency as an announcement mechanism and IS-IS sees it as a topology link. The other show commands are included for completeness also. RP/0/0/CPU0:XRv11#show mpls traffic-eng forwarding-adjacency 1.1.1.1 destination 1.1.1.1 has 1 tunnels tunnel-te24 (traffic share 0, next-hop 1.1.1.1) (Adjacency Announced: yes, holdtime 10000) (IS-IS 132 level-2, IPV4 Unicast) RP/0/0/CPU0:XRv11#show isis database 0000.0000.0011.00 level 2 detail | include IS-Extend Metric: 10 IS-Extended CSR5.00 Metric: 10 IS-Extended CSR10.00 Metric: 10 IS-Extended XRv12.00 Metric: 25 IS-Extended CSR1.00 Metric: 10 MT (IPv6 Unicast) IS-Extended CSR5.00 Metric: 10 MT (IPv6 Unicast) IS-Extended CSR10.00 Metric: 10 MT (IPv6 Unicast) IS-Extended XRv12.00 RP/0/0/CPU0:XRv11#show isis mpls traffic-eng tunnel IS-IS 132 Level-2 MPLS Traffic Engineering tunnels System Id Tunnel Bandw Nexthop Metric Mode CSR1

tt24

0

1.1.1.1

25

Fixed

IPv4/IPv6 FA En/Dis

IPV4/IPv6 AA Dis/Dis

Chkpt ID 00000000

RP/0/0/CPU0:XRv11#show isis topology level 2 systemid 0000.0000.0001 IS-IS 132 paths to IPv4 Unicast (Level-2) routers System Id Metric Next-Hop Interface SNPA CSR1 25 CSR1 tt24 *PtoP* RP/0/0/CPU0:XRv11#show route ipv4 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "isis 132", distance 115, metric 25, type level-2 Routing Descriptor Blocks 1.1.1.1, from 1.1.1.1, via tunnel-te24 Route metric is 25 No advertising protos.

1355 © 2016 Nicholas J. Russo

To ensure the feature is working, we quickly confirm the labels along the TE LSP from CSR1 to XRv11. The path goes CSR6 > XRv12 > XRv11. Traceroute confirms that this path is used for the L3VPN traffic as expected. CSR1#show mpls traffic-eng tunnels tunnel 24 | include Label InLabel : OutLabel : GigabitEthernet2.516, 6010 CSR6#show mpls forwarding-table labels Local Outgoing Prefix Label Label or Tunnel Id 6010 92011 1.1.1.1 24 [13]

6010 Bytes Label Switched 34134

Outgoing interface Gi2.562

Next Hop 132.6.12.12

RP/0/0/CPU0:XRv12#show mpls forwarding labels 92011 Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ------------------ ------------ --------------- ---------92011 Pop 24 Gi0/0/0/0.512 132.11.12.11 30974 CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 1 msec 2 132.1.6.6 [MPLS: Labels 6010/91012 Exp 0] 94 msec 31 msec 28 msec 3 132.6.12.12 [MPLS: Labels 92011/91012 Exp 0] 31 msec 27 msec 31 msec 4 132.11.12.11 [MPLS: Label 91012 Exp 0] 27 msec 25 msec 25 msec 5 10.11.13.13 31 msec * 36 msec

Because the feature must be configured bidirectionally, we should verify it works both ways. XRv11's tunnel to CSR1 goes CSR5 > CSR2 > CSR3 > CSR1. I am not certain why this suboptimal path was chosen, but it might have been due to network instability or the 100kbps CSR1000v bandwidth limit that caused certain IGP adjacencies to be down at the moment this was configured. In any event, note that the tunnels don't have to follow the same path bidirectionally, just that any valid LSP exists. We trace the labels hop-by-hop, then verify with traceroute within the VPN. RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p destination 1.1.1.1 detail | include Label Labels: Outgoing downstream: 5013. CSR5#show mpls forwarding-table labels 5013 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 5013 2019 11.11.11.11 24 [7] \ 56471

Outgoing interface

Next Hop

Gi2.525

132.2.5.2

CSR2#show mpls forwarding-table labels 2019

1356 © 2016 Nicholas J. Russo

Local Label 2019

Outgoing Label 3008

Prefix Bytes Label or Tunnel Id Switched 11.11.11.11 24 [7] \ 56663

CSR3#show mpls forwarding-table labels 3008 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3008 Pop Label 11.11.11.11 24 [7] \ 54258

Outgoing interface

Next Hop

Gi2.523

132.2.3.3

Outgoing interface

Next Hop

Gi2.513

132.1.3.1

RP/0/0/CPU0:XRv13#traceroute 7.7.7.7 source 13.13.13.13 Type escape sequence to abort. Tracing the route to 7.7.7.7 1 10.11.13.11 0 msec 19 msec 0 msec 2 132.5.11.5 [MPLS: Labels 5013/1009 Exp 0] 39 msec 29 msec 19 msec 3 132.2.5.2 [MPLS: Labels 2019/1009 Exp 0] 19 msec 19 msec 19 msec 4 132.2.3.3 [MPLS: Labels 3008/1009 Exp 0] 39 msec 39 msec 39 msec 5 10.1.7.1 [MPLS: Label 1009 Exp 0] 19 msec 29 msec 19 msec 6 10.1.7.7 29 msec * 29 msec

Not all TE tunnels need to "home run" all the way from PE to PE. Tunnels can exist from PE-PE, PE-P, P-P, and P-PE if necessary. We can create more hierarchical TE architectures in this way, much like using HVPLS with U-PE and N-PE pseudowires. In PE-PE and P-PE tunnels, all transport signaling is done by RSVP as the tunnel provides a new LSP; LDP is not necessary at all in this case. When doing PE-P or P-P options, we need some other mechanism to binds labels for the remote PE IP prefix. The tunnels don’t terminate at the final destination, so the transport labels need to be allocated in other ways. In other words, we still need to resolve that FEC since the TE-tunnel does not go end-to-end. In the example below, a pair of TE tunnels are created. One goes from CSR1 to XRv12, and the other goes from XRv12 to XRv11. The ultimate goal is to get traffic from CSR to XRv11 using these two tunnels, “stitching” them together. The first tunnel takes an explicit path through CSR8 and CSR6, while the second takes any red path that is also not blue [ID 23]. ! CSR1 ip explicit-path name PATH_1_8_6_12 enable next-address 8.8.8.8 next-address 6.6.6.6 next-address 12.12.12.12 interface Tunnel23 description PE-P ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 12.12.12.12 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_1_8_6_12

1357 © 2016 Nicholas J. Russo

! XRv12 interface tunnel-te23 description P-PE ipv4 unnumbered Loopback0 logging events all destination 11.11.11.11 affinity include RED affinity exclude BLUE path-option 10 dynamic

We quickly verify the tunnel paths are correct and within constraints, and then use OAM to verify the data-plane [ID 23]. On XRv12, we should specify that we want to see the head-end of tunnel 23, since we used the same tunnel ID for both and XRv12 is the head for one and the tail for the other. XRv12 shows being both a head and tail within this "extended" LSP. CSR1#show mpls traffic-eng tunnels tunnel 23 | section RSVP_Path RSVP Path Info: My Address: 132.1.8.1 Explicit Route: 132.1.8.8 132.6.8.6 132.6.12.12 12.12.12.12 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR1#traceroute mpls traffic-eng tunnel 23 Tracing MPLS TE Label Switched Path on Tunnel23, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.8.1 MRU 1500 [Labels: 8012 Exp: 0] L 1 132.1.8.8 MRU 1500 [Labels: 6011 Exp: 0] 2 ms L 2 132.6.8.6 MRU 1500 [Labels: implicit-null Exp: 0] 33 ms ! 3 132.6.12.12 35 ms RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 23 role head | begin Path info Path info (IS-IS 132 level-2): Node hop count: 2 Hop0: 132.10.12.10 Hop1: 132.10.11.11 Hop2: 11.11.11.11 RP/0/0/CPU0:XRv12#traceroute mpls traffic-eng tunnel-te 23 Tracing MPLS TE Label Switched Path on tunnel-te23, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.10.12.12 MRU 1500 [Labels: 10011 Exp: 0] L 1 132.10.12.10 MRU 1500 [Labels: implicit-null Exp: 0] 0 ms ! 2 132.10.11.11 1 ms RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels brief up

1358 © 2016 Nicholas J. Russo

TUNNEL NAME tunnel-te23 PE-P

DESTINATION 11.11.11.11 12.12.12.12

STATUS up up

STATE up up

Each LSP appears to be fully functional. We will configure a static route to XRv11 on CSR1 so we can send VPN traffic into the tunnel. Auto-route destination would not be appropriate here, since the destination of CSR1's tunnel (XRv12) is not the address to which we are sending traffic (XRv11). Autoroute announce technically could work on CSR1, but we will use a static route for simplicity. We can use auto-route announce (or destination) on XRv12 for the second leg without causing any complexities. We quickly verify that each router is directing traffic to XRv11 out of the appropriate TE tunnel. ! CSR1 ip route 11.11.11.11 255.255.255.255 Tunnel23 ! XRv12 interface tunnel-te23 autoroute announce CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "static", distance 1, metric 0 (connected) Routing Descriptor Blocks: * directly connected, via Tunnel23 Route metric is 0, traffic share count is 1 RP/0/0/CPU0:XRv12#show route ipv4 unicast 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis 132", distance 115, metric 10, type level-2 Routing Descriptor Blocks 11.11.11.11, from 11.11.11.11, via tunnel-te23 Route metric is 10 No advertising protos.

When we traceroute from CSR1 to XRv11 to verify the path, we that CSR6 is performing PHP of the TE label as expected, and the traffic is raw IP on the hop to XRv12. This works for traffic in the core since we can route IP or MPLS. As such, there is no breakage in the BGP sessions, etc. We see that this breaks the VPN service between CSR7 and XRv13. CSR1#traceroute 11.11.11.11 source loopback0 Type escape sequence to abort. Tracing the route to 11.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 132.1.8.8 [MPLS: Label 8012 Exp 0] 10 msec 8 msec 8 msec 2 132.6.8.6 [MPLS: Label 6011 Exp 0] 8 msec 17 msec 17 msec 3 132.6.12.12 16 msec 16 msec 8 msec 4 132.10.12.10 [MPLS: Label 10011 Exp 0] 33 msec 25 msec 25 msec 5 132.10.11.11 15 msec * 24 msec

1359 © 2016 Nicholas J. Russo

CSR7#ping 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 13.13.13.13, timeout is 2 seconds: Packet sent with a source address of 7.7.7.7 ..... Success rate is 0 percent (0/5)

Because CSR6 is performing PHP of the TE label, and the next label is the VPN label, this is being exposed to XRv12 when it was allocated by XRv11. We confirm this with MPLS packet debugging on XRv12; it receives a packet from CSR6 but cannot route it since there isn't an LFIB entry for label 91011, which is the VPN label allocated by XRv11 for XRv13's loopback. CSR1#show bgp vpnv4 unicast vrf TE labels | include 13\. 13.13.13.13/32 11.11.11.11 nolabel/91011 ! XRv12 netio[309]: mpls_switch-2: received an mpls packet on GigabitEthernet0_0_0_0.562: proto=0, direction = ingress netio[309]: mpls_switch-2: Setting default table-id netio[309]: Flag set to 0x00000001 netio[309]: mpls_switch: GigabitEthernet0_0_0_0.562, mpls eos 1, ttl 252, len 126, inlabel 91011, tbl_id=0x0, vrf_id=0x0 in=0x600 netio[309]: [mpls_switch:2818] Pkt Drop: mpls_switch: GigabitEthernet0_0_0_0.562, No LFIB entry found for in_label 91011

To solve this problem, we need to introduce another label into the stack. At imposition, CSR1 is currently accounting for a transport label to XRv12 and a VPN label for the final destination. There is no label representing the remote PE anywhere in the stack. Without this label, XRv12 will not know how to forward traffic along the second leg of the LSP. The solution is to enable LDP on the TE tunnel, which is basically telling the software to originate a targeted session to the tunnel destination. Rather than manually configure the destination to reciprocate, we can configure XRv12 to accept targeted sessions and respond accordingly. We can verify the LDP neighbors are up on both sides. CSR1 initiates the session and is considered active (TCP client), and XRv12 receives the session request and is considered passive (TCP server). ! CSR1 interface Tunnel23 mpls ip ! XRv12 mpls ldp address-family ipv4 discovery targeted-hello accept CSR1#show mpls ldp neighbor 12.12.12.12

1360 © 2016 Nicholas J. Russo

Peer LDP Ident: 12.12.12.12:0; Local LDP Ident 1.1.1.1:0 TCP connection: 12.12.12.12.20748 - 1.1.1.1.646 State: Oper; Msgs sent/rcvd: 19/18; Downstream Up time: 00:00:46 LDP discovery sources: Targeted Hello 1.1.1.1 -> 12.12.12.12, active Addresses bound to peer LDP Ident: 12.12.12.12 132.9.12.12 132.6.12.12 132.11.12.12 132.10.12.12 RP/0/0/CPU0:XRv12#show mpls ldp neighbor 1.1.1.1 Peer LDP Identifier: 1.1.1.1:0 TCP connection: 1.1.1.1:646 - 12.12.12.12:20748; MD5 on Graceful Restart: No Session Holdtime: 180 sec State: Oper; Msgs sent/rcvd: 19/20; Downstream-Unsolicited Up time: 00:00:59 LDP Discovery Sources: IPv4: (1) Targeted Hello (12.12.12.12 -> 1.1.1.1, passive) IPv6: (0) Addresses bound to this peer: IPv4: (6) 1.1.1.1 132.1.3.1 132.1.4.1 132.1.6.1 132.1.8.1 132.1.9.1 IPv6: (0)

With an LDP session established between CSR1 and XRv12, labels can be exchanged. CSR1 is already routing to 11.11.11.11/32 out of the TE tunnel, which means the TE label is imposed. We also learn label 92003, which is XRv12’s local LDP label for XRv11's loopback. If packets arrive on any of XRv12's LDP enabled interfaces with this label, XRv12 will be able to route the packet according to its LFIB. We confirm that CSR1 imposes this label above the VPN label in its FIB, and confirm that XRv12 can route MPLS packets with label 92003 out of the proper TE tunnel (the second leg). On ingress, XRv12 “swaps” this LDP local label for the TE label learned from CSR10 via RSVP. I say “swaps” because the LFIB shows a pop followed by a push, which is effectively a swap. CSR1#show mpls ldp bindings 11.11.11.11 32 neighbor 12.12.12.12 lib entry: 11.11.11.11/32, rev 45 remote binding: lsr: 12.12.12.12:0, label: 92003 CSR1#show ip cef vrf TE 13.13.13.13 13.13.13.13/32 nexthop 11.11.11.11 Tunnel23 label 92003 91011 RP/0/0/CPU0:XRv12#show mpls forwarding labels 92003 detail Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ------------------ ------------ --------------- ----------

1361 © 2016 Nicholas J. Russo

92003

Pop 11.11.11.11/32 tt23 11.11.11.11 Version: 273, Priority: 3 MAC/Encaps: 18/22, MTU: 1500 Label Stack (Top -> Bottom): { 10011 Imp-Null } NHID: 0 Packets Switched: 33

2268

We can run our traceroute tests again from the global and VPN tables to ensure the tunnel stitching was successful. The global table does not reveal any IP-only hops, and the VPN traffic has a fully MPLSenabled path for transport. The TE label is topmost to steer traffic as needed, the LDP label is used at the stitch point to direct traffic towards the remote PE, and the VPN label remains unchanged at the bottom to represent the final destination. The second leg of the tunnel did not require tLDP since the tunnel terminates on the final PE. CSR1#traceroute 11.11.11.11 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 11.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 132.1.8.8 [MPLS: Labels 8012/92003 Exp 0] 27 msec 33 msec 25 msec 2 132.6.8.6 [MPLS: Labels 6011/92003 Exp 0] 15 msec 17 msec 17 msec 3 132.6.12.12 [MPLS: Label 92003 Exp 0] 19 msec 54 msec 53 msec 4 132.10.12.10 [MPLS: Label 10011 Exp 0] 42 msec 25 msec 25 msec 5 132.10.11.11 15 msec * 63 msec CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 1 msec 1 msec 2 132.1.8.8 [MPLS: Labels 8012/92003/91011 Exp 0] 25 msec 25 msec 25 msec 3 132.6.8.6 [MPLS: Labels 6011/92003/91011 Exp 0] 35 msec 33 msec 33 msec 4 132.6.12.12 [MPLS: Labels 92003/91011 Exp 0] 33 msec 33 msec 33 msec 5 132.10.12.10 [MPLS: Labels 10011/91011 Exp 0] 85 msec 58 msec 37 msec 6 132.10.11.11 [MPLS: Label 91011 Exp 0] 35 msec 33 msec 33 msec 7 10.11.13.13 33 msec * 24 msec

We also don't have to fully stitch tunnels together. Another design option would be to tunnel traffic to XRv12, then let XRv12 use the dynamic LDP path to deliver it to XRv12. We still need LDP on the PE-P tunnel for the reasons outlined above, but stitching TE tunnels together is not a requirement. We can use a combination of LDP and RSVP for the PE-to-PE transport. We can test this by shutting down tunnel 23 on XRv12 (not shown). Now, XRv12 learns the route via IGP out of a non-TE interface, so an ordinary LDP label can be used. Since XRv12 is the penultimate hop, it can pop label 92003 in ingress rather than swap it, as shown in the LFIB. RP/0/0/CPU0:XRv12#show route ipv4 unicast 11.11.11.11 Routing entry for 11.11.11.11/32

1362 © 2016 Nicholas J. Russo

Known via "isis 132", distance 115, metric 10, type level-2 Routing Descriptor Blocks 132.11.12.11, from 11.11.11.11, via GigabitEthernet0/0/0/0.512 Route metric is 10 No advertising protos. RP/0/0/CPU0:XRv12#show mpls forwarding labels 92003 detail Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ------------------ ------------ --------------- ---------92003 Pop 11.11.11.11/32 Gi0/0/0/0.512 132.11.12.11 4080 Version: 277, Priority: 3 MAC/Encaps: 18/18, MTU: 1500 Label Stack (Top -> Bottom): { Imp-Null } NHID: 0 Packets Switched: 34

The LSP is shorter by one hop, but the logic is the same. We verify the path in the global and VPN tables, noting that the initial TE and LDP labels are identical as before until XRv12 is introduced. CSR1 isn't aware of any changes in the path since the path from XRv12 to XRv11 was always beyond CSR1's purview. CSR1#traceroute 11.11.11.11 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 11.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 132.1.8.8 [MPLS: Labels 8012/92003 Exp 0] 36 msec 25 msec 25 msec 2 132.6.8.6 [MPLS: Labels 6011/92003 Exp 0] 15 msec 17 msec 16 msec 3 132.6.12.12 [MPLS: Label 92003 Exp 0] 16 msec 17 msec 16 msec 4 132.11.12.11 16 msec * 18 msec CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 1 msec 0 msec 2 132.1.8.8 [MPLS: Labels 8012/92003/91011 Exp 0] 75 msec 50 msec 41 msec 3 132.6.8.6 [MPLS: Labels 6011/92003/91011 Exp 0] 35 msec 33 msec 34 msec 4 132.6.12.12 [MPLS: Labels 92003/91011 Exp 0] 33 msec 33 msec 34 msec 5 132.11.12.11 [MPLS: Label 91011 Exp 0] 33 msec 42 msec 33 msec 6 10.11.13.13 23 msec * 52 msec

31.2 TE Fast-ReRoute (FRR) and rapid provisioning MPLS-TE is an excellent option for FRR. Other options, such as IP LFA, are much newer and not as widely supported. IP FRR is discussed in a dedicated sections. 31.2.1 Link (NHOP), Node (NNHOP), and Path protection – Manual 1363 © 2016 Nicholas J. Russo

Using MPLS TE for fast reroute (FRR) is a very common way to achieve HA in MPLS networks. If a link or node within the path of a TE tunnel fails, a pre-signaled backup path is available and traffic can be routed into the backup tunnel as soon as the failure is detected. TE FRR introduces some new terminology as well. PLR: The point of local repair is the head-end of the backup path. When a failure is detected, it routes packets into the backup tunnel. This is much faster than signaling back to the original TE LSP headend to recalculate since the FRR occurs locally at the point of failure. MP: The merge point is the tail-end of the backup path. When traffic arrives at the MP, it is routed along the normal TE path again, merging the traffic back on the original path. The PLR and MP work in concept to ensure that TE nodes upstream and downstream from the failure are minimally affected while FRR is active. There are three main types of protection offered, in order from least to most desirable. Path protection: Pre-signals a backup path from head to tail on a per path-option basis to be used in case of a failure anywhere along the path. Easy to configure and automatically protects all hops in the path without having to use PLRs. It is slowest to converge and scales the worst, and is not supported on XRv. NHOP protection: Next-hop protection (link protection) protects the path to the next-hop by routing around failed links. This may scale better than NNHOP protection if many LSPs transit an LSR but have many different NNHOPs. NNHOP protection: Next-next-hop protection (node protection) protects the path to the hop after the next-hop by routing around failed nodes. This offers better protection than NHOP but may not scale. RFC 4090 details two ways of achieving FRR in the data-plane. 1. N:1 or many-to-one or facility backup. The PLR pushes a new FRR label (wraps the existing MPLS packet) as part of a new TE tunnel. The MP advertises a null label (implicit or explicit), effectively allowing it to see the original TE label so that the traffic can be merged easily since the incoming label would be the same as the main path. Many LSPs can be tunneled in the FRR tunnel in this way, providing more scalability at the low cost of 4 bytes extra encapsulation. 2. 1:1 or one-to-one or detour backup. The PLR swaps the top label rather than pushes a new one. The MP advertises a normal label so the penultimate hop performs a label swap. This means that the label arriving at the MP will be different than the main path, and also different for each LSP. This scales poorly and isn't even supported in Cisco platforms.

1364 © 2016 Nicholas J. Russo

We will configure a basic TE tunnel on CSR1 towards CSR8. CSR1 has two path options; the first is an explicit path through the network, the second is a secondary dynamic option requiring a green path and 2 Mbps. The tunnel uses auto-route destination to steer traffic into it, and also requests FRR protection. ! CSR1 mpls traffic-eng lsp attributes LSP_ATT_GREEN affinity 0x2 mask 0x2 bandwidth 2000 ip explicit-path name PATH_1_3_10_6_8 enable next-address 3.3.3.3 next-address 2.2.2.2 next-address 10.10.10.10 next-address 6.6.6.6 next-address 8.8.8.8 interface Tunnel30 description BASIC TE FRR ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 8.8.8.8 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_1_3_10_6_8 tunnel mpls traffic-eng path-option 20 dynamic attributes LSP_ATT_GREEN tunnel mpls traffic-eng fast-reroute

Debugging RSVP message dumps on CSR3, we can see this FRR request inside the RSVP PATH message from CSR1. The outgoing path and incoming RESV are not shown, but the final RESV sent to CSR1 contains an RRO with additional detail; it includes the labels at each hop in the TE tunnel. It also records where the backup tunnels are in the network, and right now there aren’t any. Each entry says there is "no local protection" so FRR isn't completely configured yet. ! CSR3 Incoming Path: version:1 flags:0000 cksum:FD26 ttl:255 reserved:0 length:212 SESSION type 7 length 16: Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 HOP type 1 length 12: Hop Addr: 132.1.3.1 LIH: 0x00000019 TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 44: 132.1.3.3 (Strict IPv4 Prefix, 8 bytes, /32) 132.3.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.10.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.8.8 (Strict IPv4 Prefix, 8 bytes, /32)

1365 © 2016 Nicholas J. Russo

8.8.8.8 (Strict IPv4 Prefix, 8 bytes, /32) LABEL_REQUEST type 1 length 8 : Layer 3 protocol ID: 2048 SESSION_ATTRIBUTE type 7 length 20: Setup Prio: 7, Holding Prio: 7 Flags: (0x7) Local Prot desired, Label Recording, SE Style Session Name: BASIC TE FRR [snip] Outgoing Resv: version:1 flags:0000 cksum:A674 ttl:255 reserved:0 length:176 SESSION type 7 length 16: Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 [snip] RECORD_ROUTE type 1 length 68: 3.3.3.3/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3012 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10011 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6011 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3

We can verify that CSR1 has this information stored in its RESV message from CSR3. The format is almost the same so this is a quick way to verify the protection for a given TE tunnel. We can see that CSR1's FRR database has no protection listed for this tunnel. The RRO shown below is in almost identical format to the RSVP dump. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 Reservation: Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 3 Next Hop: 132.1.3.3 on GigabitEthernet2.513 Label: 3012 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 1D00040E. Average Bitrate is 0 bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes RRO: 3.3.3.3/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3012 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10011 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6011 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3

1366 © 2016 Nicholas J. Russo

Status: Policy: Accepted. Policy source(s): MPLS/TE CSR1#show mpls traffic-eng fast-reroute database P2P Headend FRR information: Protected tunnel In-label Out intf/label ------------------------------- -------------[no output] [snip]

FRR intf/label --------------

Status ------

Before continuing, we can verify the TE tunnel is at least operational using MPLS OAM in the core and traceroute in the VPN. This shows us that auto-route is working properly and that there are no dataplane filters that might block traffic. It also allows us to verify the exact label values used to confirm it matches what the RESV RRO contained. The labels 3012, 10011, 6011, and implicit-null are shown in the RRO above. CSR1#traceroute mpls traffic-eng tunnel 30 Tracing MPLS TE Label Switched Path on Tunnel30, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.3.1 MRU 1500 [Labels: 3012 Exp: 0] L 1 132.1.3.3 MRU 1500 [Labels: 10011 Exp: 0] 2 ms L 2 132.3.10.10 MRU 1500 [Labels: 6011 Exp: 0] 29 ms L 3 132.6.10.6 MRU 1500 [Labels: implicit-null Exp: 0] 26 ms ! 4 132.6.8.8 99 ms CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 1 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3012/8009 Exp 0] 58 msec 31 msec 30 msec 3 132.3.10.10 [MPLS: Labels 10011/8009 Exp 0] 41 msec 36 msec 40 msec 4 132.6.10.6 [MPLS: Labels 6011/8009 Exp 0] 40 msec 31 msec 30 msec 5 10.8.14.8 [MPLS: Label 8009 Exp 0] 16 msec 15 msec 15 msec 6 10.8.14.14 24 msec * 69 msec

First we will look at NHOP protection. Let's assume that the link between CSR3 and CSR10 is known to be unreliable [ID 30]. We know our tunnel transits this link but want to be able to route around it quickly if/when it fails. We configure CSR3 to be our PLR and CSR10 to be our MP; a new tunnel is created from CSR3 to CSR10 that specifically avoids this link using explicit-path excluding the link addresses. We can configure the tunnel however we like, including bandwidth reservations and affinity. The tunnel also requires colors green, blue, and orange. We also specify the outgoing interface through which the original tunnel (the one we are backing up) is transiting. This tells RSVP that RESV messages arriving on that interface can use Tunnel 30 as a facility backup if FRR is requested. 1367 © 2016 Nicholas J. Russo

! CSR3 ip explicit-path name PATH_AVOID_CSR3_CSR10 enable exclude-address 132.3.10.3 exclude-address 132.3.10.10 interface Tunnel30 description REPAIR PATH (PLR) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 10.10.10.10 tunnel mpls traffic-eng affinity 0xE mask 0xE tunnel mpls traffic-eng path-option 10 explicit name PATH_AVOID_CSR3_CSR10 interface GigabitEthernet2.530 mpls traffic-eng backup-path Tunnel30

We quickly verify that the backup tunnel comes up and doesn’t traverse the link directly to CSR10. The tunnel routes through CSR2 to reach CSR10 and avoids the protected link. CSR3#show mpls traffic-eng tunnels tunnel 30 | section Status|RSVP_Path Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type explicit PATH_AVOID_CSR3_CSR10 (Basis for Setup, path weight 20) RSVP Path Info: My Address: 132.2.3.3 Explicit Route: 132.2.3.2 132.2.10.10 10.10.10.10 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

We check the RESV details on the tunnel head again and see that NHOP protection is now available to protect 3.3.3.3/32's "next-hop". This protects the link from CSR3 to CSR10 using a backup tunnel that avoids that link. We can see the flag is different for this hop (0x21 versus 0x20) compared to the ones that offer no protection. Notice that the output says "avail" but not "in use". CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 Reservation: Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 19 Next Hop: 132.1.3.3 on GigabitEthernet2.513 Label: 3011 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 2500040E. Average Bitrate is 0 bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes

1368 © 2016 Nicholas J. Russo

RRO: 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3011 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10010 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6011 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

On the PLR, we can verify the FRR configuration by checking the FRR database, which tracks the LSPs to be backed up along with the available FRR tunnels. Specifying the incoming label 3011, we can see how this specific LSP is being backed up. The summary view shows the incoming label, along with the ordinary LSP's outgoing interface and label. It also includes the FRR tunnel; because this entry is "ready" and not "active", we know FRR is not currently in effect. CSR3#show mpls traffic-eng fast-reroute database labels 3011 [snip] P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ------------------------------- --------------------------1.1.1.1 30 [19] 3011 Gi2.530:10010 Tu30:10010

Status -----ready

The final step in the configuration is configuring some kind of mechanism to detect a failed link. We can do this using RSVP hellos. Without a mechanism like this, RSVP will not be able to tell when the link fails, and will not be able to converge around it. The logic is identical to IGPs which necessitate some kind of active discovery mechanism when using TE-FRR. Enabling the command globally allows FRR to use the hellos, and it also needs to be enabled at the link level. We tune to timers to 5 second hellos with a neighbor being declared dead after 4 are missed, for a total of 20 seconds dead detection time. ! CSR3 and CSR10 ip rsvp signalling hello interface GigabitEthernet2.530 ip rsvp signalling hello ip rsvp signalling hello refresh interval 5000 ip rsvp signalling hello refresh misses 4

Verifying on CSR3, we can see that RSVP hellos are enabled for FRR. CSR10 won't produce much output for the show commands because it is the "passive" side of the link. CSR3 is the tunnel head and has FRR bound to RSVP hellos as a client. The output even describes the difference between active and passive. The detailed output mostly shows various counters related to the session.

1369 © 2016 Nicholas J. Russo

CSR3#show ip rsvp hello Hello: RSVP Hello for Fast-Reroute/Reroute: Enabled Statistics: Disabled BFD for Fast-Reroute/Reroute: Disabled RSVP Hello for Graceful Restart: Disabled CSR3#show ip rsvp hello instance summary Active Instances: Client Neighbor I/F State FRR 132.3.10.10 Gi2.530 Up

LostCnt 0

LSPs Interval 1 5000

Passive Instances: - None Active = Actively tracking neighbor state on behalf of clients: RR = ReRoute, FRR = Fast ReRoute, or GR = Graceful Restart Passive = Responding to hello requests from neighbor CSR3#show ip rsvp hello instance detail Neighbor 132.3.10.10 (router ID: 10.10.10.10) Source 132.3.10.3 Type: Active (sending requests) I/F: GigabitEthernet2.530 State: Up (Since: 2015 MON DAY 14 13:14:08 ) Clients: Fast Reroute LSPs protecting: 1 Missed acks: 4, IP DSCP: 0x30 Refresh Interval (msec) Configured: 5000 Statistics: (from 225 samples) Min: 5000 Max: 5001 Average: 5000 Waverage: 5000 (Weight = 0.8) Current: 5000 Last sent Src_instance: 0x1FC275D3 Last recv nbr's Src_instance: 0x5256334A Counters: Communication with neighbor lost: Num times: 0 Reasons: Missed acks: 0 Bad Src_Inst received: 0 Bad Dst_Inst received: 0 I/F went down: 0 Neighbor disabled Hello: 0 Msgs Received: 1473 Sent: 1484 Suppressed: 0

1370 © 2016 Nicholas J. Russo

CSR3 also has details regarding the FRR client and the neighbor capabilities supporting that FRR client. CSR3#show ip rsvp hello client lsp detail Hello Client LSPs (all lsp tree) Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 19 Lsp flags: 0x24 Lsp RR DN nbr: 132.3.10.10 FRR CSR3#show ip rsvp hello client nbr detail Hello Client Neighbors Remote addr 132.3.10.10, Local addr Nbr State: Normal Type: Reroute Nbr Hello State: Up LSPs protecting: 1 I/F: Gi2.530

132.3.10.3

CSR10 shows the session as passive as it is responding to hello requests. The detail shows the message counts. CSR10#show ip rsvp hello instance summary | section Passive Passive Instances: Neighbor I/F 132.3.10.3 Gi2.530 CSR10#show ip rsvp hello instance detail Neighbor 132.3.10.3 (router ID: 3.3.3.3) Source Type: Passive (responding to requests) I/F: GigabitEthernet2.530 Last sent Src_instance: 0x5256334A Last recv nbr's Src_instance: 0x1FC275D3 Counters: Msgs Received: 1444 Sent: 1444

132.3.10.10

Debugging RSVP dumps, we can see the hellos being exchanged back and forth. The ordinary RSVP hello debugging does not yield any output. At this point we can assume everything is working properly. Different colors are used to identify the different hello messages (checksums will match). ! CSR3 Outgoing Hello: version:1 flags:0000 cksum:BD94 ttl:1 reserved:0 length:20 HELLO type HELLO REQUEST length 12: Src_Instance: 0x1FC275D3, Dst_Instance: 0x5256334A Incoming Hello:

1371 © 2016 Nicholas J. Russo

version:1 flags:0000 cksum:BD93 ttl:1 reserved:0 length:20 HELLO type HELLO ACK length 12: Src_Instance: 0x5256334A, Dst_Instance: 0x1FC275D3 ! CSR10 Incoming Hello: version:1 flags:0000 cksum:BD94 ttl:1 reserved:0 length:20 HELLO type HELLO REQUEST length 12: Src_Instance: 0x1FC275D3, Dst_Instance: 0x5256334A Outgoing Hello: version:1 flags:0000 cksum:BD93 ttl:1 reserved:0 length:20 HELLO type HELLO ACK length 12: Src_Instance: 0x5256334A, Dst_Instance: 0x1FC275D3

To test FRR, we will temporarily remove the backup path-option from the main LSP. This will give us more time to analyze traffic inside the FRR tunnel. When an LSP fails and FRR kicks in, the RSVP PATHERR message is sent back towards the head-end, causing a recalculation using the next path-option in sequence. Once we remove the alternative path-option, there is no other choice, so traffic is stuck in the FRR tunnel for some time. It is good design to have alternate path options for this reason when FRR is in use. ! CSR1 interface Tunnel30 no tunnel mpls traffic-eng path-option 20 dynamic attributes LSP_ATT_GREEN

To break the link, we will change the encapsulation on CSR10 to an incorrect VLAN. Shutting down the interface triggers a graceful shutdown, allowing IGP and RSVP to converge immediately, which is not a good way to test FRR. First, we will enable RSVP signaling debugs on CSR1 and CSR3 so we can watch the error event propagation. We also start a ping with a 1 second timeout on CSR7 (within the VPN) to XRv14. We expect to see about 20 dropped packets since we know it takes RSVP 20 seconds to determine there was a failure (also depends on IGP timers). ! CSR10 interface GigabitEthernet2.530 encapsulation dot1Q 999

After ~20 seconds, CSR3 determines there is a failure and sends an RSVP PATHERR message back to CSR1. CSR3 no longer has any RSVP neighbors. CSR3 should have locally repaired the tunnel. ! CSR3 RSVP: 1.1.1.1_29->8.8.8.8_30[Src] {7}: building error_spec object with errnode addr: 132.1.3.3 RSVP: 1.1.1.1_29->8.8.8.8_30[Src] {7}: Sending PathError message to 132.1.3.1 RSVP: Triggering outgoing Path refresh

1372 © 2016 Nicholas J. Russo

CSR3#show ip rsvp hello instance summary Active Instances: - None Passive Instances: - None -

The labels have changed slightly from before since I was bringing the tunnel up and down for testing, but the output is clearly different. The FRR database shows this tunnel is "active". The label is still the ordinary TE label to CSR10. The new label pushed is 2010, so label 10012 is tunneled inside label 2010 across CSR2. CSR3#show mpls traffic-eng fast-reroute database labels 3010 detail FRR Database Summary: Protected interfaces : 1 Protected LSPs/Sub-LSPs : 1 Backup tunnels : 1 Active interfaces : 1 FRR Active tunnels : 1 P2P LSPs: Tun ID: 30, LSP ID: 60, Source: 1.1.1.1 Destination: 8.8.8.8 State : active InLabel : 3010 OutLabel : Gi2.530:10012 FRR OutLabel : Tu30:10012 CSR3#show mpls traffic-eng tunnels tunnel 30 | include Label InLabel : OutLabel : GigabitEthernet2.523, 2010

CSR1 receives this RSVP PATHERR message, which is a notification that the tunnel has been locally repaired. This is a good notification, but also informs CSR1 to recalculate a new path. The goal of TE-FRR is not to redirect traffic long-term, but to allow the PLR-to-head signaling to take place, and continue passing traffic during that short time. In this case, the head-end tunnel times out after several minutes, allowing us to view the process. CSR1 claims the tunnel is pending a reroute, but the tunnel remains up since it was locally repaired. The RSVP RESV details also show that NHOP protection is available, but is also "in use". The flags have changed again to indicate this. ! CSR1 RSVP: 1.1.1.1_29->8.8.8.8_30[Src] {7}: Received PathError message from 132.1.3.3 (on GigabitEthernet2.513) %MPLS_TE-5-LSP: LSP 1.1.1.1 30_29: Path Error from 132.1.3.3: Notify: Tunnel locally repaired (flags 0)

1373 © 2016 Nicholas J. Russo

CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 Reservation: Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 60 Next Hop: 132.1.3.3 on GigabitEthernet2.513 Label: 3010 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 18000404. Average Bitrate is 0 bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes RRO: 3.3.3.3/32, Flags:0x23 (Local Prot Avail/In Use/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3010 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10012 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6012 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE CSR1#show mpls traffic-eng tunnels tunnel 30 | section Status Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type explicit PATH_1_3_10_6_8 (Basis for Setup, path weight 40) Change in required resources detected: reroute pending Currently Signalled Parameters: Bandwidth: 0 kbps (Global) Priority: 7 7 Affinity: 0x0/0x0 Metric Type: TE (default)

On CSR7, we can see 18 dropped packets (275 - 257), which is approximately correct given the 20 second RSVP dead-peer detection time. Traffic is now using the FRR tunnel since CSR1 cannot recalculate a new path, given that we removed the backup path option. We can also confirm the imposition of the third label (FRR) on the PLR (CSR3), which tunnels traffic across the backup path to the MP (CSR10). The label arriving on CSR10 is the normal TE label, which means the backup tunnel can be used for many LSPs (facility backup). CSR7#ping 14.14.14.14 source 7.7.7.7 repeat 1000000 timeout 1 Type escape sequence to abort. Sending 1000000, 100-byte ICMP Echos to 14.14.14.14, timeout is 1 seconds: Packet sent with a source address of 7.7.7.7 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!........ ..........!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

1374 © 2016 Nicholas J. Russo

Success rate is 93 percent (257/275), round-trip min/avg/max = 1/28/194 ms CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3010/8009 Exp 0] 2 msec 2 msec 9 msec 3 132.2.3.2 [MPLS: Labels 2010/10012/8009 Exp 0] 41 msec 32 msec 36 msec 4 132.2.10.10 [MPLS: Labels 10012/8009 Exp 0] 35 msec 31 msec 45 msec 5 132.6.10.6 [MPLS: Labels 6012/8009 Exp 0] 36 msec 35 msec 37 msec 6 10.8.14.8 [MPLS: Label 8009 Exp 0] 82 msec 13 msec 11 msec 7 10.8.14.14 12 msec * 2 msec

Using RSVP hellos is very uncommon since BFD has been introduced. Like many other protocols, the TEFRR process can be a BFD client, obviating the need to run a new dead peer detection mechanism. We will enable this on all XE-to-XE links in the topology; XRv does not support BFD, and XR in general does not support RSVP slow hellos. The configuration is almost identical to the RSVP hellos except we add the word "bfd" to the end. This cannot coexist with the legacy RSVP hellos, which must be removed first on CSR3 and CSR10 before configuring this (not shown). Checking CSR3, we can see that BFD for FRR is now enabled instead of RSVP hellos. ! All XE MPLS routers ip rsvp signalling hello bfd interface GigabitEthernet2.XXX ip rsvp signalling hello bfd CSR3#show ip rsvp hello Hello: RSVP Hello for Fast-Reroute/Reroute: Disabled Statistics: Disabled BFD for Fast-Reroute/Reroute: Enabled RSVP Hello for Graceful Restart: Disabled

BFD TE-FRR neighbors won't show up until a TE tunnel runs through it. As described in the BFD section, there is no neighbor discovery, but higher-level processes (TE-FRR) can notify BFD when they need fast fall-over. The signaling of the original LSP requesting FRR builds this. In this case, the main LSP is shutdown on CSR1 but the FRR tunnel is up. On the PLR (CSR3), we can see the details of the BFD neighbor with respect to TE-FRR now. This is a host-downstream neighbor as it is the NHOP of the backup TE tunnel. CSR3#show ip rsvp hello bfd nbr detail Hello Client Neighbors Remote addr 132.2.3.2, Local addr 132.2.3.3 Type: Active

1375 © 2016 Nicholas J. Russo

I/F: Gi2.523 State: Up (for 00:17:38) Clients: HST LSPs protecting: 1 (frr: 0, hst upstream: 0 hst downstream: 1) Communication with neighbor lost: 0 CSR3#show bfd neighbors client te-frr IPv4 Sessions NeighAddr LD/RD 132.2.3.2 4100/4097 Gi2.523

RH/RS Up

State Up

Int

Bringing up the original LSP adds two new TE-FRR clients to BFD, the PHOP (CSR1) and NHOP (CSR10), to use RSVP terminology. The RSVP details show the PHOP as a passive neighbor and the NHOP as an active neighbor. The NHOP neighbor is protected by FRR which is indicated by the value "1"; the output above has a zero since the backup tunnel is not protected by FRR. CSR3#show bfd neighbors client te-frr IPv4 Sessions NeighAddr LD/RD 132.1.3.1 4101/4100 132.2.3.2 4100/4097 132.3.10.10 4103/4104

RH/RS Up Up Up

State Up Up Up

Int Gi2.513 Gi2.523 Gi2.530

CSR3#show ip rsvp hello bfd nbr detail Hello Client Neighbors Remote addr 132.1.3.1, Local addr 132.1.3.3 Type: Passive I/F: Gi2.513 State: Up (for 00:00:53) Clients: None LSPs protecting: 0 Communication with neighbor lost: 0 Remote addr 132.3.10.10, Local addr 132.3.10.3 Type: Active I/F: Gi2.530 State: Up (for 00:00:53) Clients: FRR LSPs protecting: 1 (frr: 1, hst upstream: 0 hst downstream: 0) Communication with neighbor lost: 0

We quickly check the RSVP RESV details on CSR1 to ensure the tunnel still has FRR and that enabling BFD did not break any RSVP signaling configuration. The PLR shows the backup tunnel as ready and shows the in/out labels for the original LSP from CSR1.

1376 © 2016 Nicholas J. Russo

CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3013 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10014 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6014 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE CSR3#show mpls traffic-eng fast-reroute database labels 3013 P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ------------------------------- --------------------------1.1.1.1 30 [112] 3013 Gi2.530:10014 Tu30:10014

Status -----ready

Starting a ping from CSR7 and breaking the link again by changing encapsulation, we see the failover is about 1 second as BFD requires 900 ms for dead peer detections based on current configuration. The PLR marks this LSP as being actively protected, and the PLR's LFIB shows a label swap followed by a push. Not only did the traffic get re-routed faster, we removed another set of packets (RSVP hellos) from the network, saving some bandwidth and router resources. CSR7#ping 14.14.14.14 source 7.7.7.7 repeat 1000000 timeout 1 Type escape sequence to abort. Sending 1000000, 100-byte ICMP Echos to 14.14.14.14, timeout is 1 seconds: Packet sent with a source address of 7.7.7.7 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! Success rate is 99 percent (147/148), round-trip min/avg/max = 1/27/166 ms CSR3#show mpls traffic-eng fast-reroute database labels 3013 P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ------------------------------- --------------------------1.1.1.1 30 [112] 3013 Gi2.530:10014 Tu30:10014

Status -----active

CSR3#show mpls forwarding-table labels 3013 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 3013 10014 1.1.1.1 30 [112] 20738 Tu30 point2point MAC/Encaps=18/26, MRU=1496, Label Stack{2014 10014}, via Gi2.523 000C295CE1E9000C29D781FE81000DC38847 007DE0000271E000 No output feature configured

1377 © 2016 Nicholas J. Russo

CSR1 shows that the NHOP protection is now in use for this LSP. The flags change again to indicate this. Although the behavior is identical because BFD and RSVP hellos, it is always recommended to use BFD when available. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x23 (Local Prot Avail/In Use/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3013 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10014 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6014 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

We can also configure NNHOP protection, also called node protection. Rather than protect a link between two routers, we can avoid a node entirely. We will configure CSR9 with an FRR tunnel that avoids CSR6 completely, protecting the PLR's next-next-hop using any link colors. The destination of the tunnel should be the next-hop after the protected node, which is CSR8, the actual tail end. CSR8 is technically the MP also in this case. We can see this NNHOP backup tunnel goes through CSR9 to CSR8 after the tunnel is signaled. ! CSR10 ip explicit-path name PATH_AVOID_CSR6 enable exclude-address 6.6.6.6 interface Tunnel30 description REPAIR PATH NNHOP (PLR) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 8.8.8.8 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_AVOID_CSR6 interface GigabitEthernet2.560 mpls traffic-eng backup-path Tunnel30 CSR10#show mpls traffic-eng tunnels tunnel 30 | section RSVP_Path RSVP Path Info: My Address: 132.9.10.10 Explicit Route: 132.9.10.9 132.8.9.8 8.8.8.8 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

1378 © 2016 Nicholas J. Russo

Checking CSR1, we can see that CSR10 is offering NNHOP (node protection) in connection with CSR3 offering NHOP (link) protection. Each individual hop in the TE LSP can offer different protection. We see that the NNHOP protection is available, but not in use, and the flags are different than the rest (0x29). In an ideal network, every single hop would have some kind of FRR protection. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3014 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10010 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6015 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

The RESV RRO shows label 10010 arriving to CSR10 along the LSP, so we can check the FRR database against this inbound label. Along the normal LSP, this inbound label is swapped to 6015 which is also present in the RESV RRO. When NNHOP protection is in use, we no longer need to preserve the NHOP label since CSR6 won't be seeing this traffic at all. Instead, we can swap label 10010 for label 9011, which is the FRR label in for the backup tunnel. Technically, this isn’t a label swap operation, it is pushing implicit-null and then 9011, as we will see later. The router is probably smart enough to do a swap for efficiency reasons in this special case as the LFIB suggests. CSR10#show mpls traffic-eng fast-reroute database labels 10010 detail [snip] P2P LSPs: Tun ID: 30, LSP ID: 137, Source: 1.1.1.1 Destination: 8.8.8.8 State : ready InLabel : 10010 OutLabel : Gi2.560:6015 FRR OutLabel : Tu30:implicit-null CSR10#show mpls traffic-eng tunnels tunnel 30 | include Label InLabel : OutLabel : GigabitEthernet2.590, 9011

Changing the encapsulation on CSR6 so the link to CSR10 fails, we see the PLR in action. Traffic is redirected into the tunnel and the LFIB reflects this. The FRR database is updated and a RESV message is sent to CSR1 notifying it that NNHOP protection is now activated on CSR10.

1379 © 2016 Nicholas J. Russo

CSR10#show mpls forwarding-table labels 10010 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 10010 Pop Label 1.1.1.1 30 [137] 4888 Tu30 point2point MAC/Encaps=18/22, MRU=1500, Label Stack{9011}, via Gi2.590 000C29E04F84000C2949716481000E068847 02333000 No output feature configured CSR10#show mpls traffic-eng fast-reroute database labels 10010 P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status ------------------------------- -------------------------------1.1.1.1 30 [137] 10010 Gi2.560:6015 Tu30:implicit-nu active CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3014 10.10.10.10/32, Flags:0x2B (Local Prot Avail/In Use/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10010 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6015 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

We confirm connectivity through this new path within the VPN. Notice there aren't 3 labels anymore since we don't need to tunnel traffic to the next-hop, thus preserving the original TE label. We can swap it to the FRR label since we are essentially taking a new path. Technically, we "pushed" the NNHOP's label to reach the tail-end, but since it was the same router, no additional label was needed. The nextnext-hop label was 3, which means the tail-end and MP are the same router. CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 1 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3014/8009 Exp 0] 2 msec 2 msec 21 msec 3 132.3.10.10 [MPLS: Labels 10010/8009 Exp 0] 49 msec 31 msec 35 msec 4 132.9.10.9 [MPLS: Labels 9011/8009 Exp 0] 31 msec 31 msec 31 msec 5 10.8.14.8 [MPLS: Label 8009 Exp 0] 50 msec 1 msec 7 msec 6 10.8.14.14 24 msec * 2 msec

We can also have multiple FRR backup tunnels in use at the same time protecting the same LSP. Assuming the CSR3-CSR10 link fails at the same time CSR6 fails, both the NHOP and NNHOP backups can 1380 © 2016 Nicholas J. Russo

be in use concurrently (link breaks not shown). We can verify this on each PLR by checking the FRR database and seeing both entries as "active". We can also verify within the VPN using traceroute; we see the NHOP protection across CSR2 and the NNHOP across CSR9 concurrently. I highlight the LSP information also (Tunnel ID 30, LSP ID 164) to prove it is the same TE tunnel. CSR3#show mpls traffic-eng fast-reroute database P2P LSP midpoint frr information: LSP identifier In-label Out intf/label ------------------------------- -------------1.1.1.1 30 [164] 3014 Gi2.530:10013

FRR intf/label -------------Tu30:10013

CSR10#show mpls traffic-eng fast-reroute database P2P LSP midpoint frr information: LSP identifier In-label Out intf/label ------------------------------- -------------1.1.1.1 30 [164] 10013 Gi2.560:6010

FRR intf/label Status ------------------Tu30:implicit-nu active

Status -----active

CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 1 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3014/8009 Exp 0] 2 msec 2 msec 9 msec 3 132.2.3.2 [MPLS: Labels 2015/10013/8009 Exp 0] 37 msec 41 msec 31 msec 4 132.2.10.10 [MPLS: Labels 10013/8009 Exp 0] 36 msec 35 msec 31 msec 5 132.9.10.9 [MPLS: Labels 9011/8009 Exp 0] 40 msec 36 msec 40 msec 6 10.8.14.8 [MPLS: Label 8009 Exp 0] 16 msec 15 msec 20 msec 7 10.8.14.14 24 msec * 2 msec

A certain node may offer both NHOP and NNHOP protection concurrently. A router can provide multiple backup tunnels on a given interface as well. On CSR3, we will configure NNHOP protection (in addition to the NHOP protect) using a new tunnel [ID 31]. This will prefer a green path to the NNHOP, which is CSR6. We quickly verify the tunnel path; interestingly, it traverses CSR1, which looks like a loop, but technically is not since this is a different LSP. The path continues through CSR9 to merge onto CSR6. ! CSR3 ip explicit-path name PATH_AVOID_CSR10 enable exclude-address 10.10.10.10 interface Tunnel31 description REPAIR PATH NNHOP (PLR) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 6.6.6.6 tunnel mpls traffic-eng affinity 0x2 mask 0x2 tunnel mpls traffic-eng path-option 10 explicit name PATH_AVOID_CSR10

1381 © 2016 Nicholas J. Russo

interface GigabitEthernet2.530 mpls traffic-eng backup-path Tunnel31 CSR3#show mpls traffic-eng tunnels tunnel 31 | section RSVP_Path RSVP Path Info: My Address: 132.1.3.3 Explicit Route: 132.1.3.1 132.1.9.9 132.6.9.6 6.6.6.6 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

We can check the FRR database to see that the original tunnel is now being backed up by the new NNHOP tunnel and not the original NHOP tunnel at CSR3. TE tunnels requesting FRR will always prefer NNHOP over NHOP when it is available (there are other comparison criteria which are evaluated in detail later). We confirm this protection on CSR1; notice that NNHOP protection is available at both CSR3 and CSR10 now. CSR3#show mpls traffic-eng fast-reroute database backup-interface tunnel 31 detail [snip] P2P LSPs: Tun ID: 30, LSP ID: 212, Source: 1.1.1.1 Destination: 8.8.8.8 State : ready InLabel : 3020 OutLabel : Gi2.530:10013 FRR OutLabel : Tu31:6010 CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3020 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10013 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6010 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

Notice that the outgoing label in the FRR database is the NNHOP label of 6010. Normally, CSR10 would have performed the swap from 10013 to 6010, but assuming CSR10 fails, CSR3 will perform this swap from 3020 to 6010, and then push the FRR label of 1015. CSR3#show mpls traffic-eng tunnels tunnel 31 | include Label InLabel : -

1382 © 2016 Nicholas J. Russo

OutLabel : GigabitEthernet2.513, 1015 CSR3#show mpls traffic-eng fast-reroute database backup-interface tunnel 31 [snip] P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status ------------------------------- -------------------------------1.1.1.1 30 [212] 3020 Gi2.530:10013 Tu31:6010 ready

We quickly test this new NNHOP protection by changing the encapsulation on CSR10's link to CSR3. Rather than avoiding that link, CSR3 will now avoid CSR10 entirely, making CSR6 the MP and using CSR6's local label for the original LSP for merging. The FRR database marks this entry as active since FRR is in effect. CSR3#show mpls traffic-eng fast-reroute database backup-interface tunnel 31 [snip] P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status ------------------------------- -------------------------------1.1.1.1 30 [212] 3020 Gi2.530:10013 Tu31:6010 active CSR3#show mpls forwarding-table labels 3020 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 3020 6010 1.1.1.1 30 [212] 5792 Tu31 point2point MAC/Encaps=18/26, MRU=1496, Label Stack{1015 6010}, via Gi2.513 000C29FBA339000C29D781FE81000DB98847 003F70000177A000 No output feature configured CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x2B (Local Prot Avail/In Use/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3020 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10013 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6010 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 0 msec

1383 © 2016 Nicholas J. Russo

2 3 4 5 6 7

132.1.3.3 [MPLS: Labels 3020/8009 Exp 0] 3 msec 18 msec 98 msec 132.1.3.1 [MPLS: Labels 1015/6010/8009 Exp 0] 19 msec 26 msec 47 msec 132.1.9.9 [MPLS: Labels 9016/6010/8009 Exp 0] 51 msec 161 msec 26 msec 132.6.9.6 [MPLS: Labels 6010/8009 Exp 0] 39 msec 39 msec 30 msec 10.8.14.8 [MPLS: Label 8009 Exp 0] 36 msec 39 msec 34 msec 10.8.14.14 25 msec * 2 msec

We can also set a special flag in the RSVP PATH message to signal that node protection should be preferred over link protection. Because IOS XE and XR both automatically prefer NNHOP over NHOP, this option doesn't do much, but may help in multi-vendor environments where the explicit flag in the RSVP PATH message is used to select NNHOP. The PATH and RESV messages are shown below. The PATH message shows the "node-protection desired" flag while the RSV message is unchanged, since NNHOP protection is already preferred on CSR3 (PLR). ! CSR1 interface Tunnel30 tunnel mpls traffic-eng fast-reroute node-protect CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 30 | section Flags Flags: (0x17) Local Prot desired, Label Recording, SE Style Node Prot desired CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3022 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10022 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6016 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

Another special flag can be assigned to request bandwidth protection. This is used as a tie-breaker when there are FRR tunnels with an aggregate bandwidth greater than the aggregate bandwidth offered by the set of FRR tunnels. FRR tunnels will request bandwidth, and backup tunnels can provide it. Reserving bandwidth on backup tunnels can be wasteful/inefficient (it is an option, though). Alternatively, we can specify the backup bandwidth on the repair tunnels. This serves as an admission control mechanism to ensure that, when using facility backups, there isn't an oversubscription of bandwidth inside backup tunnels for primary TE LSPs that require dedicated bandwidth. The bandwidth is not actually reserved in the TED, but it does help balance the FRR flows across various backup paths when a failure occurs. The main tunnel requests 1 Mbps of bandwidth, while the NHOP and NNHOP backups on CSR3 offer 3 Mbps 1384 © 2016 Nicholas J. Russo

and 2 Mbps of backup bandwidth, respectively. We can see the PATH message "bandwidth-protection desired" flag set, and the RESV message also indicates that bandwidth protection is enabled for that hop using the NNHOP tunnel. ! CSR1 interface Tunnel30 tunnel mpls traffic-eng bandwidth 1000 tunnel mpls traffic-eng fast-reroute bw-protect ! CSR3 interface Tunnel30 tunnel mpls traffic-eng backup-bw 3000 interface Tunnel31 tunnel mpls traffic-eng backup-bw 2000 CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 30 | section Flags Flags: (0x1F) Local Prot desired, Label Recording, SE Style, Bandwidth Prot desired CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x2D (Local Prot Avail/Has BW/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3018 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10014 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6016 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

If we configure the main TE LSP to request more bandwidth than the NNHOP tunnel can provide, but the NHOP tunnel can provide it, the NHOP tunnel is preferred since bandwidth-protection was enabled. NNHOP tunnels with no bandwidth are still preferred over NHOP tunnels with explicit bandwidth, even when bandwidth-protection is configured. This is because a backup tunnel with “no” bandwidth is assumed to have unlimited bandwidth. The exact sequence of preferences is detailed more in the DS-TE section. ! CSR1 interface Tunnel30 tunnel mpls traffic-eng bandwidth 3000

1385 © 2016 Nicholas J. Russo

CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3017 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10022 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6017 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

Bandwidth protection takes precedence over node protection when both flags are set concurrently in the PATH message. Even after adding the node-protection flag to the main LSP head-end, the NHOP tunnel is preferred due to the bandwidth-backup request. ! CSR1 interface Tunnel30 tunnel mpls traffic-eng fast-reroute bw-protect node-protect CSR1#show ip rsvp sender detail filter session-type 7 tunnel-id 30 | section Flags Flags: (0x1F) Local Prot desired, Label Recording, SE Style, Bandwidth Prot desired Node Prot desired CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30 | begin RRO RRO: 3.3.3.3/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3022 10.10.10.10/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10012 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6024 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

Because the vast majority of the TE-FRR configuration and complexities lies with the PLR, we will not be testing this again with XRv as it does not support BFD. However, we will configure dummy tunnels on XRv11 to show the syntax for the features and perform basic control-plane verifications of the RSVP signaling. The head-end LSP is shown below, which requests FRR, along with node protection and bandwidth protection. We also verify the tunnel comes up.

1386 © 2016 Nicholas J. Russo

! XRv11 explicit-path name PATH_11_5_2_3_10_6_8 index 10 next-address strict ipv4 unicast index 20 next-address strict ipv4 unicast index 30 next-address strict ipv4 unicast index 40 next-address strict ipv4 unicast index 50 next-address strict ipv4 unicast index 60 next-address strict ipv4 unicast

5.5.5.5 2.2.2.2 3.3.3.3 10.10.10.10 6.6.6.6 8.8.8.8

interface tunnel-te30 description BASIC TE FRR ipv4 unnumbered Loopback0 logging events all signalled-bandwidth 3000 destination 8.8.8.8 fast-reroute protect node bandwidth affinity ignore path-option 10 explicit name PATH_11_5_2_3_10_6_8 P/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 30 brief TUNNEL NAME DESTINATION tunnel-te30 8.8.8.8

STATUS up

STATE up

We can check the locally-generated RSVP PATH message to verify the flags: local protection, node protection, and bandwidth protection are all enabled. Because we have built this tunnel over the CSR3CSR10 link, we have the NHOP and NHHOP tunnel protection there, as well as the NNHOP protection CSR10-CSR6. The RSVP RESV message doesn't show this in XR, so we look at the TE tunnel details instead. Specifically, we can see there is protection available, along with bandwidth protection, at CSR3. The word "node" does not appear, so we assume this is NHOP protection. At CSR10, we have available "node" protection (NNHOP) but no bandwidth protection. The XR outputs are more difficult to interpret than the XE ones inside the RESV message, but the summary information shown near the beginning I succinct and clear. RP/0/0/CPU0:XRv11#show rsvp sender session-type lsp-p2p dst-port 30 detail PATH: IPv4-LSP Session addr: 8.8.8.8. TunID: 30. LSPId: 2. Source addr: 11.11.11.11. ExtID: 11.11.11.11. Prot: Local, Node, BW. Backup tunnel: No. Setup Priority: 7, Reservation Priority: 7 Rate: 3M bits/sec. Burst: 1K bytes. Peak: 3M bits/sec. [snip] RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 30 detail | begin Resv Info Resv Info: Record Route: IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5013, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID)

1387 © 2016 Nicholas J. Russo

Label 2017, flags 0x1 IPv4 3.3.3.3, flags 0x25 (Node-ID, Protection: available, b/w) Label 3021, flags 0x1 IPv4 10.10.10.10, flags 0x29 (Node-ID, Protection: available, node) Label 10016, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6012, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 3, flags 0x1 Fspec: avg rate=3000 kbits, burst=1000 bytes, peak rate=3000 kbits

We can quickly check the FRR databases on the PLRs, CSR3 and CSR10, to verify this protected tunnel is shown. The RSVP RESV shown about gives us the labels per hop of the original LSP, so we can use those as input for the FRR database to specifically look at the TE tunnel from XRv11 to CSR8. CSR3#show mpls traffic-eng fast-reroute database labels 3021 P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ------------------------------- --------------------------11.11.11.11 30 [2] 3021 Gi2.530:10016 Tu30:10016

Status -----ready

CSR10#show mpls traffic-eng fast-reroute database labels 10016 P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status ------------------------------- -------------------------------11.11.11.11 30 [2] 10016 Gi2.560:6012 Tu30:implicit-nu ready

Backup tunnel configuration is similar to XE as the backup tunnel itself has no special characteristics and is not shown. Assigning a backup tunnel to a link is shown below. ! XRv11 interface tunnel-te31 description DUMMY BACKUP PATH shutdown mpls traffic-eng interface GigabitEthernet0/0/0/0.551 backup-path tunnel-te 31

Link and node protection are the best ways to do TE-FRR. An alternative exists which is a little slower and less scalable, but not nearly as slow as having disparate path-options. This is called path protection, which is configured on a per path-option basis and is used to backup that path-option. In addition to the main path-option, this “protecting” or “backup” path is pre-signaled. In IOS, backup-paths must be explicit, while the main path can be explicit or dynamic. As soon as the head-end receives the PATHERR message from the network core, it can immediately switch traffic over into the backup tunnel. This is still slower than NHOP/NNHOP protection since local repair is always faster, but it does not require any 1388 © 2016 Nicholas J. Russo

configuration changes in the network core. This might be a compromise if different organizations/administrators control P routers versus PE routers. We can create a list of backup pathoptions to allow multiple choices for a single path-option we want to backup. Only the best option in the list in pre-signaled. These backup paths cannot use NHOP/NNHOP protection for themselves, since they are backup paths already. For efficiency, we can re-use some older explicit-paths for backup options. There is no reliance on RESV RROs with label recording as the headend performs the rerouting (no concept of PLR). ! CSR1 ip explicit-path name EP_1_4_3_10_11 enable next-address 4.4.4.4 next-address 3.3.3.3 next-address 10.10.10.10 next-address 11.11.11.11 mpls traffic-eng path-option list name POPT_LIST path-option 100 explicit name PATH_1_3_2_10_11 path-option 110 explicit name EP_NO_R9 interface Tunnel32 description PATH PROTECTION ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_1_4_3_10_11 tunnel mpls traffic-eng path-option protect 10 list name POPT_LIST bandwidth 3000

CSR1 now shows two tunnels from 1.1.1.1 to 11.11.11.11. Since we created a 3 Mbps bandwidth reservation for the backup path, we can easily tell the difference at a glance. It is clear that both paths are properly signaled. Also notice that label 4014 is used for the primary path with label 3022 used for the backup path. CSR1#show ip rsvp reservation Destination Tun Sender BPS 11.11.11.11 1.1.1.1 11.11.11.11 1.1.1.1

filter session-type 7 destination 11.11.11.11 TunID LSPID Next Hop I/F Fi Serv 32 32

33 34

132.1.3.3 132.1.4.4

Gi2.513 Gi2.514

SE LOAD 3M SE LOAD 0

CSR1#show ip rsvp reservation detail filter session-type 7 destination 11.11.11.11 | include Label|LSP_ID Tun Sender: 1.1.1.1 LSP ID: 33 Label: 3022 (outgoing) Tun Sender: 1.1.1.1 LSP ID: 34 Label: 4014 (outgoing)

1389 © 2016 Nicholas J. Russo

Verifying the tunnel status, we see that the ordinary path-option 10 is used for the tunnel setup. The path protection engine tells us that the primary and backup paths share 1 link and 2 nodes, which is not ideal. Normally you would ensure the two paths were as different as possible so that the failure of a single link/node doesn't break both paths. Path-option 100 was the most desirable backup option and is the "basis for protect". CSR1#show mpls traffic-eng tunnels tunnel 32 | section Status Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type explicit EP_1_4_3_10_11 (Basis for Setup, path weight 40) Path Protection: 1 Common Link(s), 2 Common Node(s) path protect option 10, type list name POPT_LIST Inuse path-option 100, type explicit (Basis for Protect, path weight 50)

For additional detail on the two paths, we can look at the specific protection attributes of the tunnel. We can see that the CSR10-XRv11 connection is the common link, with CSR3 and CSR10 being the common nodes. CSR1#show mpls traffic-eng tunnels tunnel 32 protection PATH PROTECTION LSP Head, Tunnel32, Admin: up, Oper: up Src 1.1.1.1, Dest 11.11.11.11, Instance 34 Fast Reroute Protection: None Path Protection: 1 Common Link(s), 2 Common Node(s) Primary lsp path:132.1.4.4 132.3.4.3 132.3.10.10 132.10.11.11 11.11.11.11 Protect lsp path:132.1.3.3 132.3.9.9 132.2.9.2 132.2.10.10 132.10.11.11 11.11.11.11

A quick look on XRv11 shows both tunnels terminating. This helps prove that the backup path is fully signaled in advanced. RP/0/0/CPU0:XRv11#show rsvp reservation destination 11.11.11.11 Destination Add DPort Source Add SPort Pro Input IF Sty Serv Rate Burst --------------- ----- --------------- ----- --- ---------- --- ---- ------ ----11.11.11.11 32 1.1.1.1 33 0 No SE LOAD 3000K 1K 11.11.11.11 32 1.1.1.1 34 0 No SE LOAD 0 1K

Using autoroute destination for simplicity, traffic from CSR1 to XRv11 now traverses this tunnel. We can test it from inside the VPN, expecting to see label 4014 at the first MPLS hop.

1390 © 2016 Nicholas J. Russo

CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 1 msec 2 132.1.4.4 [MPLS: Labels 4014/91012 Exp 0] 3 132.3.4.3 [MPLS: Labels 3023/91012 Exp 0] 4 132.3.10.10 [MPLS: Labels 10017/91012 Exp 5 132.10.11.11 [MPLS: Label 91012 Exp 0] 28 6 10.11.13.13 25 msec * 40 msec

37 msec 25 msec 25 msec 15 msec 17 msec 16 msec 0] 21 msec 46 msec 38 msec msec 25 msec 25 msec

Assuming CSR4's link to CSR1 fails (not shown), the backup path should route around it. We know this is not a shared link or shared node, so the backup path is still effective according to the protection details we saw earlier. Shortly after shutting down this interface, we send additional VPN traffic to verify the new label 3022 is used along the backup LSP. We can also verify the tunnel status to see the backup path in use. The close the failure is to the head-end, the less distance the PATHERR has to travel, which means faster convergence. With NHOP/NNHOP FRR, the failure location is irrelevant, which leads to a more consistent FRR behavior. CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 1 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3022/91012 Exp 0] 3 132.3.9.9 [MPLS: Labels 9024/91012 Exp 0] 4 132.2.9.2 [MPLS: Labels 2022/91012 Exp 0] 5 132.2.10.10 [MPLS: Labels 10016/91012 Exp 6 132.10.11.11 [MPLS: Label 91012 Exp 0] 33 7 10.11.13.13 29 msec * 20 msec

18 msec 16 msec 24 msec 22 msec 24 msec 25 msec 20 msec 17 msec 20 msec 0] 38 msec 38 msec 33 msec msec 25 msec 21 msec

To verify the control-plane in greater details, we can check CSR1's tunnel information. Several log messages indicate that the main path via CSR4 is no longer available and that path-protection is now in use. ! CSR1 %MPLS_TE-5-LSP: LSP 1.1.1.1 32_35: No addresses to connect 1.1.1.1 to 4.4.4.4 %MPLS_TE-5-LSP: LSP 1.1.1.1 32_34: DOWN: path verification failed %MPLS_TE-5-TUN: Tun32: installed LSP nil for 32_34 (popt 10), path verification failed %MPLS_TE-5-TUN: Tun32: LSP path change nil for 32_34, path verification failed %MPLS_TE-5-TUN: Tun32: installed LSP 32_33 (popt 10) for nil, Path protected LSP failure %MPLS_TE-5-TUN: Tun32: LSP path change 32_33 for nil, protected failure

1391 © 2016 Nicholas J. Russo

The tunnel status and protection details essentially say that the backup LSP is now in use. The tunnel details are a bit redundant, but the message is clear. CSR1#show mpls traffic-eng tunnels tunnel 32 | section Status Status: Admin: up Oper: up Path: valid Signalling: connected path protect option 10, type list name POPT_LIST Inuse path-option 100, type explicit (Basis for Protect, path weight 50) path option 10, type explicit EP_1_4_3_10_11 Path Protection: Backup lsp in use. path protect option 10, type list name POPT_LIST Inuse path-option 100, type explicit (Basis for Protect, path weight 50) CSR1#show mpls traffic-eng tunnels tunnel 32 protection PATH PROTECTION LSP Head, Tunnel32, Admin: up, Oper: up Src 1.1.1.1, Dest 11.11.11.11, Instance 33 Fast Reroute Protection: None Path Protection: Backup lsp in use.

RSVP is only tracking one reservation now for the backup LSP. Notice that the LSP ID for the backup path is the same, so the LSP has never been resignaled, as expected. CSR1#show ip rsvp reservation To From Pro DPort Sport Next Hop 11.11.11.11 1.1.1.1 0 32 33 132.1.3.3

I/F Gi2.513

Fi Serv BPS SE LOAD 3M

Bringing CSR4's interface back up will cause the original path-option to recompute and ultimately preempt the backup path. At this point, both the main and backup paths are resignaled (new LSP IDs) but no traffic is lost during the transition. The primary LSP is signaled first, then the traffic is moved onto it. Shortly thereafter (probably concurrently), the backup path is torn down and resignaled per the "new" path option. Remember that this technique allows multiple path-options to have multiple different backup paths, and the backup paths are only used for a short time until the alternative "primary" pathoptions are available. CSR1#show ip rsvp reservation To From Pro DPort Sport Next Hop 11.11.11.11 1.1.1.1 0 32 42 132.1.4.4 11.11.11.11 1.1.1.1 0 32 43 132.1.3.3

I/F Gi2.514 Gi2.513

Fi Serv BPS SE LOAD 0 SE LOAD 3M

The feature is very similar on XR. XR is more flexible with its backup options since the backup can also be dynamic. However, XR does not support backup path-option lists, so only one backup can be specified per path-option. In this example, we re-use an old explicit-path and back it up with any blue path. 1392 © 2016 Nicholas J. Russo

! XRv1 interface tunnel-te32 description PATH PROTECTION ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 affinity ignore path-option 10 explicit name PATH_11_5_2_3_4_1 protected-by 100 path-option 100 dynamic attribute-set ATT_BLUE mpls traffic-eng attribute-set path-option ATT_BLUE affinity include 0x4

RSVP only shows one LSP, which is incorrect. There is a missing step to actually enable the feature. RP/0/0/CPU0:XRv11#show rsvp reservation destination 1.1.1.1 Destination Add DPort Source Add SPort Pro Input IF Sty Serv Rate Burst --------------- ----- --------------- ----- --- ---------- --- ---- ------ ----1.1.1.1 32 11.11.11.11 2 0 Gi0/0/0/0.551 SE LOAD 0 1K

To actually activate the feature we need to enable it explicitly. The command to do so is not supported on XRv, but the remaining configuration will be left in place. ! XRv11 interface tunnel-te32 path-protection !!% The requested operation is not supported: Path-protection is not supported on this platform

There are some additional minor options with respect to MPLS TE also. We can also specify how often the router scans for protected LSPs to "promote". This refers to selecting a better backup path when many are available. The value of 0 disables promotion, and this is the default. We will set the timer to one day (3600 seconds * 24 hours). CSR1#show mpls traffic-eng tunnels brief | include Promotion Periodic FRR Promotion: Not Running ! CSR1 mpls traffic-eng fast-reroute timers promotion 86400 CSR1#show mpls traffic-eng tunnels brief | include Promotion Periodic FRR Promotion: every 86400 seconds, next in 86395 seconds

We can quickly configure this on XRv11 as well, as the syntax is very similar. There does not appear to be a mechanism to verify this configuration, though. 1393 © 2016 Nicholas J. Russo

! XRv11 mpls traffic-eng fast-reroute timers promotion 86400

By default, the router does not reoptimize a tunnel when a new link in the TED comes up. This is true for newly-configured links or failed links that return to service. While it seems valuable for existing LSPs to use these new links, this can introduce churn into the network which is why it is disabled. For networks where tunnel optimization is more important than stability/flooding, we can enable this behavior globally. We will use CSR8 and XRv12 to configure these reoptimization features. ! CSR8 mpls traffic-eng reoptimize events link-up ! XRv12 mpls traffic-eng reoptimize events link-up

There are several reoptimization timers as well. We can delay how long an old tunnel can remain signaled before removing it after reoptimization. This might be valuable to have both tunnels active for a longer period of time in case the newly optimized tunnel fails. This is controlled with the "cleanup" timer. Additionally, we can delay the time before using the newly optimized tunnel. This is the "installation" timer and it should be less than the cleanup timer. In XE, the default cleanup timer is 10 seconds and the default installation timer is 3 seconds. This means that a newly optimized tunnel won't be used for 3 seconds, and 7 seconds later, the old tunnel is torn down. We will adjust the values slightly to 4 and 11 seconds so that the 7-second "dual tunnel" timer difference still exists. Increasing these delays can also ensure that older routers have time to program the new, optimized LSP labels into the LFIB, which may take time. If the traffic switches over to the new path too quickly, packet loss may occur. The defaults in XR are 20 seconds for delay and 20 seconds for installation, so the switchover and teardown happens at about the same time. ! CSR8 mpls traffic-eng reoptimize timers delay installation 4 mpls traffic-eng reoptimize timers delay cleanup 11 ! XRv12 mpls traffic-eng reoptimize timers delay cleanup 11 reoptimize timers delay installation 4

XR also supports some additional delay options. When a head-end router receives a PATHERR notifying it that FRR has occurred (from the PLR) we can delay how long to wait before reoptimizing. The default is 0 seconds, which makes sense since using FRR for extended periods of time is not optimal. Assuming path-protection was supported on XRv, we can also delay the time before trying to reoptimize a primary 1394 © 2016 Nicholas J. Russo

path. This timer is 3 minutes (180 seconds) by default, which is very long, but since the backup path is pre-signaled and assumed to be good, there isn't a rush. We will set it to 0 to disable it, which would imply the backup path is used forever. Last, we can delay tunnel reoptimization when an affinity failure occurs. If there isn't a path using the proper link colors, we can delay the subsequent attempts to find a path, assuming there are multiple path-options. The command is not hidden, but also does not appear to be documented. ! XRv12 mpls traffic-eng reoptimize timers delay after-frr 5 reoptimize timers delay path-protection 0 reoptimize timers delay after-affinity-failure 15

The path-protection reoptimization timer command isn't supported on XRv, but other commands are. ! XRv12 mpls traffic-eng reoptimize timers delay path-protection 0 !!% The requested operation is not supported: Path-protection is not supported on this platform

Tunnels are automatically reoptimized every hour (3600 seconds) by default. We can also control this timer globally as well. A value of 0 disables periodic reoptimization, which we configure for variety. The command is significantly different on XE and XR. ! CSR8 mpls traffic-eng reoptimize timers frequency 0 ! XRv12 mpls traffic-eng reoptimize 0

We can determine how often TE information is flooded when bandwidth changes on a per-link basis. Flooding this information too frequently can result in excessive overhead and tunnel churn. Infrequent flooding leads to outdated decision-making and bandwidth admission control. In our case, we set some new thresholds so that at 25%, 50%, and 100% of bandwidth thresholds, the TE information is reflooded. Ideally it makes sense to configure this consistently everywhere, but we will limit it to CSR3's link to CSR10 for demonstration. The default triggers for down/up are shown below. The defaults represent a more balanced model than the demonstration below and I would recommend using the default values. Regardless of these thresholds, an LSP that fails to get set up will trigger a reflood. XR has a similar configuration, and we configure this on the XRv11 link facing CSR10 for demonstration purposes. ! CSR3 interface GigabitEthernet2.530

1395 © 2016 Nicholas J. Russo

mpls traffic-eng flooding thresholds up 25 50 100 mpls traffic-eng flooding thresholds down 100 50 25 ! XRv11 mpls traffic-eng interface GigabitEthernet0/0/0/0.501 flooding thresholds up 25 50 100 flooding thresholds down 100 50 25 CSR3#show mpls traffic-eng link-management bandwidth-allocation gig2.513 | include Threshold Up Thresholds: 15 30 45 60 75 80 85 90 95 96 97 98 99 100 (default) Down Thresholds: 100 99 98 97 96 95 90 85 80 75 60 45 30 15 (default) CSR3#show mpls traffic-eng link-management bandwidth-allocation gig2.530 | include Threshold Up Thresholds: 25 50 100 Down Thresholds: 100 50 25 RP/0/0/CPU0:XRv11#show mpls traffic-eng link-management bandwidth-allocation interface gig0/0/0/0.551 | include Threshold Up Thresholds : 15 30 45 60 75 80 85 90 95 96 97 98 99 100 (default) Down Thresholds : 100 99 98 97 96 95 90 85 80 75 60 45 30 15 (default) RP/0/0/CPU0:XRv11#show mpls traffic-eng link-management bandwidth-allocation interface gig0/0/0/0.501 | include Threshold Up Thresholds : 25 50 100 Down Thresholds : 100 50 25

We can also adjust some link-level parameters as well with respect to flooding and bandwidth. On both XE and XR, we can specify how long bandwidth should be held on an interface when a PATH message is sent out of it, while waiting for a RESV to come back in. If the PATH messages are lost, the bandwidth should not remain reserved forever, as the LSP will be invalid. The default is 15 seconds. We can also adjust how frequently link-state information is flooded on each link; this is different than the flooding thresholds which were bandwidth based (event-driven). This is 3 minutes (180 seconds) by default on XR and 60 seconds by default in XE, and we will disable it by using a value of 0. This is not recommend since it relies entirely on the bandwidth threshold process for updated flooding, which by definition is less accurate. ! CSR8 mpls traffic-eng link-management timers periodic-flooding 0 mpls traffic-eng link-management timers bandwidth-hold 16 ! XRv12 mpls traffic-eng link-management timers bandwidth-hold 16 link-management timers periodic-flooding 0

1396 © 2016 Nicholas J. Russo

CSR8#show mpls traffic-eng link-management bandwidth-allocation | section System System Information:: Links Count: 3 Bandwidth Hold Time: max. 16 seconds RP/0/0/CPU0:XRv12#show mpls traffic-eng link-managemen bandwidth-allocation System Information:: Links Count : 4 Bandwidth Hold time : 16 seconds [snip] CSR8#show mpls traffic-eng link-management summary | section IGP Area IGP Area ID:: isis level-2 Flooding Protocol: ISIS Flooding Status: data flooded Periodic Flooding: disabled Flooded Links: 3 IGP System ID: 0000.0000.0008.00 MPLS TE Router ID: 8.8.8.8 Neighbors: 3 RP/0/0/CPU0:XRv12#show mpls traffic-eng link-management summary | begin level IGP Area[1]:: IS-IS 132 level 2 Flooding Protocol : IS-IS Flooding Status : flooded Periodic Flooding : disabled Flooded Links : 4 IGP System ID : 0000.0000.0012 MPLS TE Router ID : 12.12.12.12 IGP Neighbors : 4

To support uniform and ling-pipe MPLS QoS models (described in another section), we can tell the TE process to allocate explicit-null labels on the tail end. ! CSR8 mpls traffic-eng signalling advertise explicit-null ! XRv12 mpls traffic-eng signalling advertise explicit-null

Using tunnel30 on CSR1, which was the tunnel requesting FRR from CSR1 to CSR8, we observe the result of this command. The RRO clearly shows label 0, which is IPv4 explicit-null, as opposed to label 3, which is implicit-null. However, when we check the LFIB of CSR6, we can see it is still performing PHP. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 30

1397 © 2016 Nicholas J. Russo

Reservation: Tun Dest: 8.8.8.8 Tun ID: 30 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 1 Next Hop: 132.1.3.3 on GigabitEthernet2.513 Label: 3016 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 01000427. Average Bitrate is 3M bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes RRO: 3.3.3.3/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3016 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10000 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6003 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 Status: Policy: Accepted. Policy source(s): MPLS/TE CSR6#show mpls forwarding-table interface Gi2.568 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6003 Pop Label 1.1.1.1 30 [1] 348 6018 Pop Label 8.8.8.8/32 0

Outgoing interface Gi2.568 Gi2.568

Next Hop 132.6.8.8 132.6.8.8

Verifying the signaled labels, we can see that CSR8 is allocating explicit-null, but CSR6 is not interpreting it properly. Using traceroute inside the VPN and MPLS OAM, we can see that PHP is actually occurring, despite the tail-end advertising exp-null in the RESV message. CSR8#show mpls traffic-eng tunnels source-id 30 | include Label InLabel : GigabitEthernet2.568, explicit-null OutLabel : CSR6#show mpls traffic-eng tunnels source-id 30 | include Label InLabel : GigabitEthernet2.560, 6001 OutLabel : GigabitEthernet2.568, implicit-null CSR1#traceroute mpls traffic-eng tunnel 30 Tracing MPLS TE Label Switched Path on Tunnel30, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.3.1 MRU 1500 [Labels: 3004 Exp: 0] L 1 132.1.3.3 MRU 1500 [Labels: 10014 Exp: 0] 1 ms L 2 132.3.10.10 MRU 1500 [Labels: 6009 Exp: 0] 1 ms L 3 132.6.10.6 MRU 1500 [Labels: implicit-null Exp: 0] 10 ms ! 4 132.6.8.8 22 ms

1398 © 2016 Nicholas J. Russo

CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3004/8008 Exp 0] 2 msec 2 msec 9 msec 3 132.3.10.10 [MPLS: Labels 10014/8008 Exp 0] 36 msec 31 msec 31 msec 4 132.6.10.6 [MPLS: Labels 6009/8008 Exp 0] 30 msec 31 msec 42 msec 5 10.8.14.8 [MPLS: Label 8008 Exp 0] 16 msec 15 msec 21 msec 6 10.8.14.14 18 msec * 2 msec

Unfortunately, the mechanism to fix this is a hidden command shown below. CSR6 needs to be told to honor the explicit-null label allocated from the tail-end. We can verify this is now working using traceroute from inside the VPN or using MPLS OAM. ! CSR6 mpls traffic-eng signalling interpret explicit-null verbatim CSR6#show mpls traffic-eng tunnels source-id 30 | include Label InLabel : GigabitEthernet2.560, 6002 OutLabel : GigabitEthernet2.568, explicit-null CSR6#show mpls forwarding-table interface Gi2.568 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6002 explicit-n 1.1.1.1 30 [4] 364 6018 Pop Label 8.8.8.8/32 0

Outgoing interface Gi2.568 Gi2.568

Next Hop 132.6.8.8 132.6.8.8

CSR7#traceroute 14.14.14.14 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 0 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3019/8008 Exp 0] 2 msec 2 msec 9 msec 3 132.3.10.10 [MPLS: Labels 10017/8008 Exp 0] 42 msec 31 msec 30 msec 4 132.6.10.6 [MPLS: Labels 6002/8008 Exp 0] 37 msec 30 msec 31 msec 5 10.8.14.8 [MPLS: Labels 0/8008 Exp 0] 16 msec 24 msec 15 msec 6 10.8.14.8 [MPLS: Labels 0/8008 Exp 0] 16 msec 16 msec 15 msec 7 10.8.14.14 30 msec * 2 msec CSR1#traceroute mpls traffic-eng tunnel 30 Tracing MPLS TE Label Switched Path on Tunnel30, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.3.1 MRU 1500 [Labels: 3019 Exp: 0] L 1 132.1.3.3 MRU 1500 [Labels: 10017 Exp: 0] 2 ms L 2 132.3.10.10 MRU 1500 [Labels: 6002 Exp: 0] 28 ms

1399 © 2016 Nicholas J. Russo

L 3 132.6.10.6 MRU 1500 [Labels: explicit-null Exp: 0] 25 ms ! 4 132.6.8.8 23 ms

XR is a little more intelligent and automatically assumes that receiving an exp-null label in the RSVP RESV message is a good enough reason to use it. We configure XRv11 to allocate explicit-null labels with a tunnel that traverses XRv12; we will reuse tunnel2 on CSR1. ! XRv11 mpls traffic-eng signalling advertise explicit-null RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p dst-port 2 detail | include Label Labels: Local downstream: 0. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels role mid detail | include Label InLabel: GigabitEthernet0/0/0/0.562, 92010 OutLabel: GigabitEthernet0/0/0/0.512, explicit-null RP/0/0/CPU0:XRv12#show mpls forwarding | include Exp 92010 Exp-Null-v4 2 Gi0/0/0/0.512 132.11.12.11

0

We can confirm this is operational using MPLS OAM. We cannot test it within the VPN since there isn't a steering mechanism for this tunnel, but seeing explicit-null in the label stack shows it is working correctly. CSR1#traceroute mpls traffic-eng tunnel 2 Tracing MPLS TE Label Switched Path on Tunnel2, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.6.1 MRU 1500 [Labels: 6011 Exp: 0] L 1 132.1.6.6 MRU 1500 [Labels: 92010 Exp: 0] 1 ms L 2 132.6.12.12 MRU 1500 [Labels: explicit-null Exp: 0] 40 ms ! 3 132.11.12.11 41 ms

When backing up LSPs with bandwidth-protection, routers will typically try to minimize the number of LSPs pre-empted. Tunnels requesting bandwidth protection take priority over those that don't, so if a backup tunnel has no admission, non-bandwidth-protected LSPs can be pre-empted by those that are. We can alternatively configure the feature to minimize the amount of wasted bandwidth by packing each backup tunnel full with as many LSPs as possible (more efficient). More preemption may occur as a result of this increased backup bandwidth efficiency. We will enable the bandwidth optimization on CSR3. CSR3(config)#mpls traffic-eng fast-reroute backup-prot-preempt ? optimize-bw Reduce bandwidth wastage (default: minimize LSPs preempted)

1400 © 2016 Nicholas J. Russo

31.2.2 Automatic tunnels (with OSPF) This section uses a similar topology as the main TE topology with some minor changes. First, OSPF is used, mostly because it supports a feature for auto-tunnels not supported in IS-IS. It also allows us to test OSPF briefly with TE, although its behavior is generally identical to IS-IS. Additionally, many link colors have been removed for simplicity as they are mutually exclusive with some types of auto-tunnels. The diagram does not specify these uncolored links with “X”; they are simply blank.

Examining basic OSPF TE extensions, we can see there are several new LSAs inside of area 0. Specifically, there are 47 Opaque-Area (LSA-10) LSAs which carry the TE information. These are flooded only within an area, much like LSA1 and LSA2. CSR1#show ip ospf 132 0 database database-summary OSPF Router with ID (1.1.1.1) (Process ID 132) Area 0 database summary LSA Type Count Delete Maxage Router 9 0 0 Network 0 0 0 Summary Net 0 0 0 Summary ASBR 0 0 0 Type-7 Ext 0 0 0 Prefixes redistributed in Type-7 0 Opaque Link 0 0 0 Opaque Area 47 0 0 Subtotal 56 0 0

1401 © 2016 Nicholas J. Russo

Looking into the details of CSR1's local-originated LSAs, we see 6 different LSAs. OSPF generates one LSA10 to represent the node itself, which always has the Link ID of 1.0.0.0. The other LSA10s represent each one of CSR1's TE-enabled links, of which there are 5. CSR1#show ip ospf 132 0 database self-originate OSPF Router with ID (1.1.1.1) (Process ID 132) Router Link States (Area 0) Link ID 1.1.1.1

ADV Router 1.1.1.1

Age 303

Seq# Checksum Link count 0x80000005 0x00F0AE 6

Type-10 Opaque Link Area Link States (Area 0) Link ID 1.0.0.0 1.0.0.7 1.0.0.8 1.0.0.9 1.0.0.11 1.0.0.12

ADV Router 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1

Age 359 303 322 345 344 347

Seq# 0x80000001 0x80000001 0x80000001 0x80000001 0x80000001 0x80000001

Checksum 0x0058D1 0x005681 0x0007C8 0x002B91 0x00D3DC 0x008424

Opaque ID 0 7 8 9 11 12

To prove that the LSAs are only created for TE-enabled links, we will disable TE tunnels on the link to CSR4 (not shown). CSR1 immediately MAXAGEs the corresponding LSA10 to mark it for deletion. It is flushed seconds later. CSR1#show ip ospf 132 0 database self-originate OSPF Router with ID (1.1.1.1) (Process ID 132) Router Link States (Area 0) Link ID 1.1.1.1

ADV Router 1.1.1.1

Age 853

Seq# Checksum Link count 0x80000005 0x00F0AE 6

Type-10 Opaque Link Area Link States (Area 0) Link ID 1.0.0.0 1.0.0.7 1.0.0.8 1.0.0.9 1.0.0.11 1.0.0.12

ADV Router 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1 1.1.1.1

Age 909 853 3608 895 894 897

Seq# 0x80000001 0x80000001 0x80000002 0x80000001 0x80000001 0x80000001

Checksum 0x0058D1 0x005681 0x00B77A 0x002B91 0x00D3DC 0x008424

Opaque ID 0 7 8 9 11 12

1402 © 2016 Nicholas J. Russo

Before continuing, we re-enable the TE configuration on the link to CSR4. The LSA10 with Link ID 1.0.0.0 is very simple, containing only the TE-RID. It does not represent a link, but only the TE node (vertex) itself. One of the remaining five LSAs is examined to show what the link-based variant contains. This shows all expected details, such as everything you'd see in an LSA-1 (link type, local/remote addresses, etc) plus the TE information. It is clearly shown in the output and needs no explanation. Also notice that the bandwidth is measured in Bps (bytes/sec), and 1.25 MBps is equivalent to 1 Gbps. CSR1#show ip ospf 132 0 database opaque-area 1.0.0.0 adv-router 1.1.1.1 OSPF Router with ID (1.1.1.1) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 1030 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.0 Opaque Type: 1 Opaque ID: 0 Advertising Router: 1.1.1.1 LS Seq Number: 80000001 Checksum: 0x58D1 Length: 28 Fragment number : 0 MPLS TE router ID : 1.1.1.1 Number of Links : 0 CSR1#show ip ospf 132 0 database opaque-area 1.0.0.7 adv-router 1.1.1.1 OSPF Router with ID (1.1.1.1) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 1017 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.7 Opaque Type: 1 Opaque ID: 7 Advertising Router: 1.1.1.1 LS Seq Number: 80000001 Checksum: 0x5681 Length: 132 Fragment number : 7 Link connected to Point-to-Point network

1403 © 2016 Nicholas J. Russo

Link ID : 3.3.3.3 Interface Address : 132.1.3.1 Neighbor Address : 132.1.3.3 Admin Metric : 1 Maximum bandwidth : 125000000 Maximum reservable bandwidth : 12500000 Number of Priority : 8 Priority 0 : 12500000 Priority 1 : 12500000 Priority 2 : 12500000 Priority 3 : 12500000 Priority 4 : 12500000 Priority 5 : 12500000 Priority 6 : 12500000 Priority 7 : 12500000 Affinity Bit : 0x0 IGP Metric : 1 Number of Links : 1

Like with IS-IS, the link between CSR2 and CSR3 is a broadcast network, which means a DR is elected and a corresponding LSA2 is originated by that DR. This appears slightly different to the TE topology. There aren't any new LSAs, but CSR3's "Link ID" in the LSA shows the DR address, which is CSR3. The TE topology show commands that are protocol-agnostic are still identical to IS-IS in this regard and are not shown again. The TE process can interpret this topology properly since the existing LSA1/LSA2 are still in the OSPF LSDB and available for review. CSR1#show ip ospf 132 0 database opaque-area 1.0.0.7 adv-router 3.3.3.3 OSPF Router with ID (1.1.1.1) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 51 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.7 Opaque Type: 1 Opaque ID: 7 Advertising Router: 3.3.3.3 LS Seq Number: 80000001 Checksum: 0x7678 Length: 124 Fragment number : 7 Link connected to Broadcast network Link ID : 132.2.3.3 Interface Address : 132.2.3.3 Admin Metric : 1 Maximum bandwidth : 125000000 Maximum reservable bandwidth : 12500000 Number of Priority : 8

1404 © 2016 Nicholas J. Russo

Priority 0 : Priority 2 : Priority 4 : Priority 6 : Affinity Bit IGP Metric :

12500000 12500000 12500000 12500000 : 0x0 1

Priority Priority Priority Priority

1 3 5 7

: : : :

12500000 12500000 12500000 12500000

Number of Links : 1

A quick look on XRv11 shows similar output. Link ID 1.0.0.0 represents the local node, and the other 3 represent XRv11's TE-enabled links. The details of the node and a single LSA are shown below. XR supports the extended administrator group (EAG), which allows for addition affinity extensions. The EAG is an array of 32-bit unsigned integers of length 8, effectively creating 256 possible colors. Like XE, the bandwidth is still measured in Bps (it even says so in the output). RP/0/0/CPU0:XRv11#show ospf 132 0 database self-originate OSPF Router with ID (11.11.11.11) (Process ID 132) Router Link States (Area 0) Link ID 11.11.11.11

ADV Router 11.11.11.11

Age 188

Seq# Checksum Link count 0x8000002b 0x006998 7

Type-10 Opaque Link Area Link States (Area 0) Link ID 1.0.0.0 1.0.0.6 1.0.0.7 1.0.0.8

ADV Router 11.11.11.11 11.11.11.11 11.11.11.11 11.11.11.11

Age 188 188 188 188

Seq# 0x80000029 0x80000029 0x80000029 0x80000029

Checksum Opaque ID 0x003081 0 0x0065e1 6 0x00c930 7 0x007fe4 8

RP/0/0/CPU0:XRv11#show ospf 132 0 database opaque-area 1.0.0.0 self-originate OSPF Router with ID (11.11.11.11) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 252 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.0 Opaque Type: 1 Opaque ID: 0 Advertising Router: 11.11.11.11 LS Seq Number: 80000029 Checksum: 0x3081 Length: 28 MPLS TE router ID : 11.11.11.11

1405 © 2016 Nicholas J. Russo

Number of Links : 0 RP/0/0/CPU0:XRv11#show ospf 132 0 database opaque-area 1.0.0.6 self-originate OSPF Router with ID (11.11.11.11) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 264 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 1.0.0.6 Opaque Type: 1 Opaque ID: 6 Advertising Router: 11.11.11.11 LS Seq Number: 80000029 Checksum: 0x65e1 Length: 168 Link connected to Point-to-Point network Link ID : 10.10.10.10 (all bandwidths in bytes/sec) Interface Address : 132.10.11.11 Neighbor Address : 132.10.11.10 Admin Metric : 1 Maximum bandwidth : 125000000 Maximum reservable bandwidth global: 12500000 Number of Priority : 8 Priority 0 : 12500000 Priority 1 Priority 2 : 12500000 Priority 3 Priority 4 : 12500000 Priority 5 Priority 6 : 12500000 Priority 7 Affinity Bit : 0x3 IGP Metric : 1 Extended Administrative Group : Length: 8 EAG[0]: 0x3 EAG[1]: 0 EAG[2]: 0 EAG[3]: 0 EAG[4]: 0 EAG[5]: 0 EAG[6]: 0 EAG[7]: 0

: : : :

12500000 12500000 12500000 12500000

Number of Links : 1

The first type of auto-tunnels we will examine are backup. Much like the NHOP and NNHOP tunnels we configured in previous sections, Cisco has a mechanism to automatically create these backup tunnels based on LSPs flowing through a router. The manual mechanism required a backup tunnel to be 1406 © 2016 Nicholas J. Russo

manually configured, then assigned to an interface as a candidate backup. For a large-scale network, these backups would ideally exist to protect every link and/or every node in the SP core. Manually configuring these (even if done by a controller) can be time consuming and make the configuration very long. To test it, we will configure a basic TE tunnel from CSR1 to XRv11. We will use an explicit-path to take a long route intentionally to maximize testing. We verify the tunnel is operational with the RRO inside the RSVP RESV message. ! CSR1 ip explicit-path name EP_1_4_3_2_10_12_11 enable next-address 4.4.4.4 next-address 3.3.3.3 next-address 2.2.2.2 next-address 10.10.10.10 next-address 12.12.12.12 next-address 11.11.11.11 interface Tunnel100 description BASIC TE FRR (AUTO BACKUP) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_1_4_3_2_10_12_11 tunnel mpls traffic-eng fast-reroute CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4017 3.3.3.3/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3016 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2016 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10014 12.12.12.12/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 92012 132.10.12.12/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 92012 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 132.11.12.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

1407 © 2016 Nicholas J. Russo

We will enable CSR3 to provide NHOP and NNHOP protection for this LSP. Enabling "auto-tunnel backup" is done in global configuration and applies to any LSP transiting the router. First, we will enable the tunnels without any fancy options. We only need one command for this and will enable debugging to see the result. The debug is very straightforward to read since it shows the exact commands used to build the backup tunnels. We can see that the tunnel enumeration starts at 65436, which is the first number in the default range of 65436 - 65535 (100 tunnels). Logging is disabled for the tunnels and loopback0 is used for the IP address by default. The process automatically removes affinity configurations (which is odd as there are none) and relies on the default affinity of 0x0 / mask 0xFFFF. So far, this is OK, since most of the links are colorless. The explicit-path that is generated for tunnel65436 is supposed to avoid using CSR3's link to CSR2; this is an NHOP tunnel. The process repeats again for tunnel65437, which is identical in every way except is configured to avoid CSR2 by Node-ID; this is an NNHOP tunnel. Auto-tunnel backup will always try to make NHOP and NNHOP tunnels by default for each LSP it protects. ! CSR3 mpls traffic-eng auto-tunnel backup CSR3#debug mpls traffic-eng auto-tunnel backup all TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_BACKUP_RPC_REMOVE_ALL, ready=1 TE_AUTO_TUN: Delaying further auto-tunnel backup configuration. TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_CREATE_BACKUP, ready=1 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: backup CLI command: interface tunnel65436 no logging event link-status ip unnumbered Loopback0 tunnel destination 2.2.2.2 tunnel mode mpls traffic-eng end TE_AUTO_TUN: backup CLI command: interface tunnel65436 no tunnel mpls traffic-eng affinity end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65436 index 1 exclude-address 132.2.3.3 TE_AUTO_TUN: backup CLI command: interface tunnel65436 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65436 end TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: backup CLI command: interface tunnel65437

1408 © 2016 Nicholas J. Russo

no logging event link-status ip unnumbered Loopback0 tunnel destination 10.10.10.10 tunnel mode mpls traffic-eng end TE_AUTO_TUN: backup CLI command: interface tunnel65437 no tunnel mpls traffic-eng affinity end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65437 index 1 exclude-address 2.2.2.2 TE_AUTO_TUN: backup CLI command: interface tunnel65437 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65437 end

The log messages below indicate that both tunnels are operational. Checking the RSVP PATH message summary, we can see that CSR3 is the head-end for both tunnels since there is no PHOP. ! CSR3 %MPLS_TE-5-TUN: Tun65436: installed LSP 65436_1 (popt feasible path opt %MPLS_TE-5-TUN: Tun65437: installed LSP 65437_1 (popt feasible path opt %MPLS_TE-5-LSP: LSP 3.3.3.3 65437_1: UP %MPLS_TE-5-TUN: Tun65437: LSP path change 65437_1 for %MPLS_TE-5-LSP: LSP 3.3.3.3 65436_1: UP %MPLS_TE-5-TUN: Tun65436: LSP path change 65436_1 for CSR3#show ip rsvp sender filter Destination Tun Sender 2.2.2.2 3.3.3.3 10.10.10.10 3.3.3.3 11.11.11.11 1.1.1.1

session-type 7 TunID LSPID Prev Hop 65436 1 none 65437 1 none 100 2 132.3.4.4

1) for nil, got 1st 1) for nil, got 1st

nil, normal nil, normal

I/F none none Gi2.534

BPS 0 0 0

There is also a specific show command to see the backup auto-tunnel details. Later, we will look at adjusting these parameters. CSR3#show mpls traffic-eng auto-tunnel backup State: Enabled Auto backup tunnels: 1 (up: 1, down: 0) Tunnel ID Range: 65436 - 65535 Create Nhop Only: No Check for deletion of unused tunnels every: 3600 Sec

1409 © 2016 Nicholas J. Russo

SRLG: Not configured Config: unnumbered-interface: Loopback0 Affinity/Mask: 0x0/0xFFFF

The PATH details show the EROs used to build these tunnels. The NHOP tunnel uses CSR9 as an alternate path to CSR3 while the NNHOP tunnel just goes directly to CSR10, since that is a shorter path. CSR3#show ip rsvp sender detail filter session-type 7 tunnel-id 65436 | section outgoing ERO: (outgoing) 132.3.9.9 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.9.2 (Strict IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Strict IPv4 Prefix, 8 bytes, /32) CSR3#show ip rsvp sender detail filter session-type 7 tunnel-id 65437 | section outgoing ERO: (outgoing) 132.3.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 10.10.10.10 (Strict IPv4 Prefix, 8 bytes, /32)

The protected LSP uses label 3016 when sending traffic to CSR3. Checking the FRR database against this label, we see an entry showing that the NNHOP tunnel is currently available (but not in use) and using label 10014. We can confirm this update in the original RSVP RESV message on the headend. NNHOP is preferred over NHOP when both are available and all other parameters are equal (backup bandwidth, etc). All labels show in the FRR database are present in the RESV RRO of the protected LSP. CSR3#show mpls traffic-eng fast-reroute database labels 3016 detail | begin 100 Tun ID: 100, LSP ID: 2, Source: 1.1.1.1 Destination: 11.11.11.11 State : ready InLabel : 3016 OutLabel : Gi2.523:2016 FRR OutLabel : Tu65437:10014 CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4017 3.3.3.3/32, Flags:0x29 (Local Prot Avail/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3016 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2016 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id)

1410 © 2016 Nicholas J. Russo

Label subobject: Flags 0x1, C-Type 1, Label 10014 [snip]

To quickly test the NNHOP backup tunnel, we will break the link to CSR2 by changing the encapsulation on the link to CSR3 (not shown). The auto-tunnel process is just used for provisioning; everything else is just normal TE-FRR. CSR1 receives a PATHERR message from CSR4 that the tunnel was locally repaired. Because there were no other path-options available, the tunnel cannot presently be reoptimized around the failure, and traffic will remain in the TE-FRR tunnel. ! CSR1 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_2: Path Error from 132.3.4.3: Notify: Tunnel locally repaired (flags 0) %MPLS_TE-5-LSP: LSP 1.1.1.1 100_3: No addresses to connect 3.3.3.3 to 2.2.2.2

We verify that the PLR shows the NNHOP tunnel as active for the protected LSP and now swaps label 3016 for label 10014. Notice that a label push need not occur since the PLR and MP are directly connected; the PLR learned label 10014 from the RESV RRO so from CSR10's perspective, the LFIB has not changed. CSR3#show mpls traffic-eng fast-reroute database labels 3016 detail | begin 100 Tun ID: 100, LSP ID: 2, Source: 1.1.1.1 Destination: 11.11.11.11 State : active InLabel : 3016 OutLabel : Gi2.523:2016 FRR OutLabel : Tu65437:10014 CSR3#show mpls forwarding-table labels 3016 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 3016 10014 1.1.1.1 100 [2] 3232 Tu65437 point2point MAC/Encaps=18/22, MRU=1500, Label Stack{10014}, via Gi2.530 000C29497164000C29D781FE81000DCA8847 0271E000 No output feature configured

The headend, of course, knows TE-FRR is used as it generated a log message earlier. The RSVP RESV message also shows this as well. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4017 3.3.3.3/32, Flags:0x2B (Local Prot Avail/In Use/to NNHOP, Node-id)

1411 © 2016 Nicholas J. Russo

Label subobject: Flags 0x1, C-Type 1, Label 3016 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10014 [snip]

To reduce the number of tunnels required, we can configure the auto-tunnel process to only create NHOP tunnels. These may scale better; if there were many LSPs going from CSR2 to CSR3, but then from CSR3 to many different next-hops, many NNHOP tunnels would be required, when a single NHOP tunnel could at least protect a commonly used link. Debugging shows the auto-tunnel process deleting both tunnels from the configuration, which is followed by several syslog messages to indicate it. Notice that the original NHOP tunnel is then resignaled as a result of the configuration change. Tunnel 65337 is shutdown (NNHOP) and tunnel65436 is signaled (NHOP). ! CSR3 mpls traffic-eng auto-tunnel backup nhop-only TE_AUTO_TUN: backup CLI command: no ip explicit-path name __dynamic_tunnel65436 end TE_AUTO_TUN: backup CLI command: no ip explicit-path name __dynamic_tunnel65437 end TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_CREATE_BACKUP, ready=1 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: backup CLI command: interface tunnel65436 no logging event link-status ip unnumbered Loopback0 tunnel destination 2.2.2.2 tunnel mode mpls traffic-eng end TE_AUTO_TUN: backup CLI command: interface tunnel65436 no tunnel mpls traffic-eng affinity end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65436 index 1 exclude-address 132.2.3.3 TE_AUTO_TUN: backup CLI command: interface tunnel65436 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65436 end

1412 © 2016 Nicholas J. Russo

%MPLS_TE-5-TUN: Tun65436: installed LSP 65436_1 (popt 1) for nil, got 1st feasible path opt %MPLS_TE-5-LSP: LSP 3.3.3.3 65437_1: DOWN: signalling shutdown %MPLS_TE-5-TUN: Tun65437: installed LSP nil for 65437_1 (popt 1), signalling shutdown %MPLS_TE-5-TUN: Tun65437: LSP path change nil for 65437_1, signalling shutdown %MPLS_TE-5-LSP: LSP 3.3.3.3 65436_1: UP %MPLS_TE-5-TUN: Tun65436: LSP path change 65436_1 for nil, normal

Now, we can see that the “Nhop Only” flag has changed to yes in the auto-tunnel backup summary. CSR3#show mpls traffic-eng auto-tunnel backup State: Enabled Auto backup tunnels: 1 (up: 1, down: 0) Tunnel ID Range: 65436 - 65535 Create Nhop Only: Yes Check for deletion of unused tunnels every: 3600 Sec SRLG: Not configured Config: unnumbered-interface: Loopback0 Affinity/Mask: 0x0/0xFFFF

We check the RSVP RESV RRO on the head-end to see if the labels have changed on the original LSP (it has been signaled a few times). The RRO also shows that NHOP protection is available. The local label for CSR3 is now 3015, so we check the TE-FRR database on CSR3 to ensure there is an entry for it. We can tell this is an NHOP tunnel since the FRR label is the same as the outlabel, which indicates the NHOP label is persistent. The PLR will just push a new label in the path to the MP. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4016 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3015 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2015 [snip] CSR3#show mpls traffic-eng fast-reroute database labels 3015 detail | begin 100 Tun ID: 100, LSP ID: 14, Source: 1.1.1.1 Destination: 11.11.11.11 State : ready InLabel : 3015 OutLabel : Gi2.523:2015

1413 © 2016 Nicholas J. Russo

FRR OutLabel : Tu65436:2015

To verify the NHOP tunnel path, we verify CSR3's tunnel details. Again, it selects CSR9, who allocates label 9015 for the backup tunnel. CSR3#show mpls traffic-eng tunnels tunnel 65436 | section RSVP Path RSVP Path Info: My Address: 132.3.9.3 Explicit Route: 132.3.9.9 132.2.9.2 2.2.2.2 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR3#show mpls traffic-eng tunnels tunnel 65436 | include Label InLabel : OutLabel : GigabitEthernet2.539, 9015

We simulate a failure again by bringing CSR2's interface down facing CSR3. CSR1 shows the normal log messages to indicate it. ! CSR1 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_15: No addresses to connect 3.3.3.3 to 2.2.2.2 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_14: Path Error from 132.3.4.3: Notify: Tunnel locally repaired (flags 0)

We check CSR3's FRR database to see the state is now active with tunnel65436 being used, which is NHOP protection. The LFIB shows a swap from 3015 to 2015, as would normally be done, but then a push of label 9015 which is used to tunnel the LSP to the MP (CSR2). CSR3#show mpls traffic-eng fast-reroute database labels 3015 detail | begin 100 Tun ID: 100, LSP ID: 14, Source: 1.1.1.1 Destination: 11.11.11.11 State : active InLabel : 3015 OutLabel : Gi2.523:2015 FRR OutLabel : Tu65436:2015 CSR3#show mpls forwarding-table labels 3015 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 3015 2015 1.1.1.1 100 [14] 1565 Tu65436 point2point MAC/Encaps=18/26, MRU=1496, Label Stack{9015 2015}, via Gi2.539 000C29E04F84000C29D781FE81000DD38847 02337000007DF000 No output feature configured

CSR1 shows that the NHOP tunnel is now in use, showing that FRR is still working. 1414 © 2016 Nicholas J. Russo

CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4016 3.3.3.3/32, Flags:0x23 (Local Prot Avail/In Use/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3015 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2015

The concept of shared risk link group (SRLG) is similar to SRLG seen in the IP FRR environment. We can specify links to "share risk" which tells the auto-tunnel process to not prefer, or totally discredit, these links as candidate paths for FRR. This is also called “fate sharing”. For example, let’s assume that CSR3's links to CSR9 and CSR2 both use the same physical conduit to carry the cabling. If the link to CSR2 fails due to the conduit being damaged, it is highly likely the link to CSR9 will also fail, so we should not prefer that link if possible. By assigning these groups to the same SRLG, which is carried in the TE extensions inside OSPF Type-10 LSAs or IS-IS LSPs, the auto-tunnel process will consider it. The command to enable SRLG for MPLS-TE is different than the one used for IP FRR. ! CSR3 interface GigabitEthernet2.523 mpls traffic-eng srlg 29 interface GigabitEthernet2.539 mpls traffic-eng srlg 29

After configuring it, we will verify it locally on CSR3 by checking the raw OSPF LSAs and the TE topology. Normally you would go straight to the TE topology instead of the raw IGP information. The last octet of the OSPF LSA10 (7 and 10 in this case) correspond to the frag_id listed in the TED. This eliminates the guesswork when querying the LSDP for TE information. CSR3#show ip ospf 132 0 database opaque-area 1.0.0.7 adv-router 3.3.3.3 [snip] Link connected to Broadcast network Link ID : 132.2.3.3 Interface Address : 132.2.3.3 [snip] Shared Risk Link Groups:29 CSR3#show ip ospf 132 0 database opaque-area 1.0.0.10 adv-router 3.3.3.3 [snip] Link connected to Point-to-Point network Link ID : 9.9.9.9 Interface Address : 132.3.9.3 [snip]

1415 © 2016 Nicholas J. Russo

Shared Risk Link Groups:29 CSR3#show mpls traffic-eng topology 3.3.3.3 brief IGP Id: 3.3.3.3, MPLS TE Id:3.3.3.3 Router Node (ospf 132 area 0) [snip] link[3]: Point-to-Point, Nbr IGP Id: 9.9.9.9, nbr_node_id:7, gen:72 frag_id: 10, Intf Address: 132.3.9.3, Nbr Intf Address: 132.3.9.9 TE metric: 1, IGP metric: 1, attribute flags: 0x0 SRLGs: 29 link[4]: Broadcast, DR: 132.2.3.3, nbr_node_id:14, gen:72 frag_id: 7, Intf Address: 132.2.3.3 TE metric: 1, IGP metric: 1, attribute flags: 0x0 SRLGs: 29

Next, we need to tell the auto-tunnel process how to interpret the SRLG. We can either demand that the SRLG never be used, which means that if an SRLG path is the only path, no tunnel will be made. We can be more liberal and prefer that SRLG not be used, which essentially creates multiple path-options to fall back to SRLG paths as a last resort. First, we look at the strict option. As soon as we configure this feature, assuming debugging is on, we can see the original NHOP tunnel be torn down and a new one created with a new explicit-path. The tunnel configuration itself is identical, but the explicit-path now contains "exclude-srlg". This options is not available in the normal explicit-path configuration on XE as SRLGs are specific to auto-backup tunnels. ! CSR3 mpls traffic-eng auto-tunnel backup srlg exclude force ! CSR3 TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_BACKUP_RPC_REMOVE_ALL, ready=1 %MPLS_TE-5-LSP: LSP 3.3.3.3 65436_1: DOWN: signalling shutdown %MPLS_TE-5-TUN: Tun65436: installed LSP nil for 65436_1 (popt 1), signalling shutdown %MPLS_TE-5-TUN: Tun65436: LSP path change nil for 65436_1, signalling shutdown TE_AUTO_TUN: backup CLI command: no ip explicit-path name __dynamic_tunnel65436 end TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_CREATE_BACKUP, ready=1 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: backup CLI command: interface tunnel65436 no logging event link-status ip unnumbered Loopback0 tunnel destination 2.2.2.2 tunnel mode mpls traffic-eng end

1416 © 2016 Nicholas J. Russo

TE_AUTO_TUN: backup CLI command: interface tunnel65436 no tunnel mpls traffic-eng affinity end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65436 index 1 exclude-address 132.2.3.3 index 2 exclude-srlg 132.2.3.3 TE_AUTO_TUN: backup CLI command: interface tunnel65436 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65436 end %MPLS_TE-5-TUN: Tun65436: installed LSP 65436_1 (popt 1) for nil, got 1st feasible path opt %MPLS_TE-5-LSP: LSP 3.3.3.3 65436_1: UP %MPLS_TE-5-TUN: Tun65436: LSP path change 65436_1 for nil, normal

The tunnel still comes up, but avoids the link to CSR9 (SRLG) while also ignoring the link to CSR2 (NHOP protection). Specifically, the path routes via CSR10 > CSR9 > CSR2 to protect the direct link from CSR3 to CSR2 and remaining compliant with the SRLG policy. Technically, this is a valid NHOP-tunnel, despite transiting the NNHOP router (CSR10) along the way. Notice that the SRLG exclusion method is "forced". CSR3#show mpls traffic-eng auto-tunnel backup State: Enabled Auto backup tunnels: 1 (up: 1, down: 0) Tunnel ID Range: 65436 - 65535 Create Nhop Only: Yes Check for deletion of unused tunnels every: 3600 Sec SRLG Exclude: Forced Config: unnumbered-interface: Loopback0 Affinity/Mask: 0x0/0xFFFF CSR3#show mpls traffic-eng tunnels tunnel 65436 | section RSVP Path RSVP Path Info: My Address: 132.3.10.3 Explicit Route: 132.3.10.10 132.9.10.9 132.2.9.2 2.2.2.2 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

CSR3 records this NHOP tunnel as candidate to backup the primary LSP from CSR1 to XRv11. As expected, the PLR (CSR3) informs the headend (CSR1) that NHOP protection is available.

1417 © 2016 Nicholas J. Russo

CSR3#show mpls traffic-eng fast-reroute database labels 3016 detail | begin 100 Tun ID: 100, LSP ID: 25, Source: 1.1.1.1 Destination: 11.11.11.11 State : ready InLabel : 3016 OutLabel : Gi2.523:2016 FRR OutLabel : Tu65436:2016 CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4017 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3016 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2016 [snip]

Breaking the link between CSR2 and CSR3, the NHOP tunnel transitions to the active state. The LFIB reflects the corresponding swap and push operations as discussed earlier. ! CSR1 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_26: No addresses to connect 3.3.3.3 to 2.2.2.2 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_25: Path Error from 132.3.4.3: Notify: Tunnel locally repaired (flags 0) CSR3#show mpls traffic-eng fast-reroute database labels 3016 detail | begin 100 Tun ID: 100, LSP ID: 25, Source: 1.1.1.1 Destination: 11.11.11.11 State : active InLabel : 3016 OutLabel : Gi2.523:2016 FRR OutLabel : Tu65436:2016 CSR3#show mpls forwarding-table labels 3016 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 3016 2016 1.1.1.1 100 [25] 654 Tu65436 point2point MAC/Encaps=18/26, MRU=1496, Label Stack{10015 2016}, via Gi2.530 000C29497164000C29D781FE81000DCA8847 0271F000007E0000 No output feature configured

We can verify this LSP using MPLS OAM. Notice that CSR10 appears twice in the path. Ideally, the backup tunnels won’t traverse nodes in the protected LSP, but this is technically acceptable. 1418 © 2016 Nicholas J. Russo

CSR1#traceroute mpls traffic-eng tunnel 100 Tracing MPLS TE Label Switched Path on Tunnel100, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.4.1 MRU 1500 [Labels: 4017 Exp: 0] L 1 132.1.4.4 MRU 1500 [Labels: 3016 Exp: 0] 2 ms L 2 132.3.4.3 MRU 1500 [Labels: 10015/2016 Exp: 0/0] 28 ms L 3 132.3.10.10 MRU 1500 [Labels: 9014/2016 Exp: 0/0] 28 ms L 4 132.9.10.9 MRU 1500 [Labels: 2016 Exp: 0] 33 ms L 5 132.2.9.2 MRU 1500 [Labels: 10014 Exp: 0] 33 ms L 6 132.2.10.10 MRU 1500 [Labels: 92012 Exp: 0] 25 ms L 7 132.10.12.12 MRU 1500 [Labels: implicit-null Exp: 0] 25 ms ! 8 132.11.12.11 37 ms

Next, we will analyze the other SRLG option, which is to prefer non-SRLG paths. The "force" option only created a single path-option that excluded the link address and SRLG (Boolean AND logic). The "preferred" option uses the same first path-option as the "force" option, which means it first tries to find a path that satisfies both conditions. The second path-option is more relaxed and represents a regular NHOP tunnel with no SRLG restrictions as a fallback mechanism. To make the test more interesting, we will break the links between CSR3 and CSR10, as well as CSR3 and CSR1. This means that CSR3 will likely route back via CSR4 to make the NHOP tunnel to CSR2, creating a microloop. This is the same behavior we would have seen with the force option as well, since the path-option sequence still makes it very unlikely that an SRLG path will be selected, as desired. ! CSR3 mpls traffic-eng auto-tunnel backup srlg exclude preferred ! CSR3 TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_BACKUP_RPC_REMOVE_ALL, ready=1 %MPLS_TE-5-LSP: LSP 3.3.3.3 65436_1: DOWN: signalling shutdown %MPLS_TE-5-TUN: Tun65436: installed LSP nil for 65436_1 (popt 1), signalling shutdown %MPLS_TE-5-TUN: Tun65436: LSP path change nil for 65436_1, signalling shutdown TE_AUTO_TUN: backup CLI command: no ip explicit-path name __dynamic_tunnel65436 end TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_CREATE_BACKUP, ready=1 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: backup CLI command: interface tunnel65436 no logging event link-status ip unnumbered Loopback0 tunnel destination 2.2.2.2

1419 © 2016 Nicholas J. Russo

tunnel mode mpls traffic-eng end TE_AUTO_TUN: backup CLI command: interface tunnel65436 no tunnel mpls traffic-eng affinity end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65436 index 1 exclude-address 132.2.3.3 index 2 exclude-srlg 132.2.3.3 TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65436_pathopt2 index 1 exclude-address 132.2.3.3 TE_AUTO_TUN: backup CLI command: interface tunnel65436 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65436 end TE_AUTO_TUN: backup CLI command: interface tunnel65436 tunnel mpls traffic-eng path-option 2 exp name __dynamic_tunnel65436_pathopt2 end %MPLS_TE-5-LSP: LSP 3.3.3.3 65436_3: UP %MPLS_TE-5-TUN: Tun65436: LSP path change 65436_3 for nil, reoptimization

We can also verify the configuration without using debugs, shown below. The SRLG exclusion mode is now "preferred". The ERO for the particular tunnel is shown also; noticed that the path routes via CSR4 > CSR9 > CSR2. CSR3#show mpls traffic-eng auto-tunnel backup State: Enabled Auto backup tunnels: 1 (up: 1, down: 0) Tunnel ID Range: 65436 - 65535 Create Nhop Only: Yes Check for deletion of unused tunnels every: 3600 Sec SRLG Exclude: Preferred Config: unnumbered-interface: Loopback0 Affinity/Mask: 0x0/0xFFFF CSR3#show mpls traffic-eng tunnels tunnel 65436 | section RSVP Path RSVP Path Info: My Address: 132.3.4.3 Explicit Route: 132.3.4.4 132.4.9.9 132.2.9.2 2.2.2.2

1420 © 2016 Nicholas J. Russo

Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

We perform the basic checks for verification. CSR1 is aware that NHOP protection is available because the PLR is tracking it in its FRR database. The local label for CSR3 (for the protected path) has changed to 3015 since the LSP was signaled after the last test. CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4016 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3015 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2015 [snip] CSR3#show mpls traffic-eng fast-reroute database labels 3015 detail | begin 100 Tun ID: 100, LSP ID: 36, Source: 1.1.1.1 Destination: 11.11.11.11 State : ready InLabel : 3015 OutLabel : Gi2.523:2015 FRR OutLabel : Tu65436:2015

Breaking the link between CSR2 and CSR3, we can see that the NHOP protection still works despite it being suboptimal. ! CSR1 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_37: No addresses to connect 3.3.3.3 to 2.2.2.2 %MPLS_TE-5-LSP: LSP 1.1.1.1 100_36: Path Error from 132.3.4.3: Notify: Tunnel locally repaired (flags 0) CSR1#show ip rsvp reservation detail filter session-type 7 tunnel-id 100 | begin RRO RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4016 3.3.3.3/32, Flags:0x23 (Local Prot Avail/In Use/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3015 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 2015 [snip]

1421 © 2016 Nicholas J. Russo

CSR3#show mpls traffic-eng fast-reroute database labels 3015 detail | begin 100 Tun ID: 100, LSP ID: 36, Source: 1.1.1.1 Destination: 11.11.11.11 State : active InLabel : 3015 OutLabel : Gi2.523:2015 FRR OutLabel : Tu65436:2015 CSR3#show mpls forwarding-table labels 3015 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 3015 2015 1.1.1.1 100 [36] 1102 Tu65436 point2point MAC/Encaps=18/26, MRU=1496, Label Stack{4017 2015}, via Gi2.534 000C29D8E5F5000C29D781FE81000DCE8847 00FB1000007DF000 No output feature configured

Testing with MPLS traceroute, we can see the traffic immediately routes back to CSR4 using a different label for the FRR tunnel. We still avoid the protected link as well as the SRLGs. CSR1#traceroute mpls traffic-eng tunnel 100 Tracing MPLS TE Label Switched Path on Tunnel100, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.4.1 MRU 1500 [Labels: 4016 Exp: 0] L 1 132.1.4.4 MRU 1500 [Labels: 3015 Exp: 0] 1 ms L 2 132.3.4.3 MRU 1500 [Labels: 4017/2015 Exp: 0/0] 28 ms L 3 132.3.4.4 MRU 1500 [Labels: 9016/2015 Exp: 0/0] 26 ms L 4 132.4.9.9 MRU 1500 [Labels: 2015 Exp: 0] 22 ms L 5 132.2.9.2 MRU 1500 [Labels: 10013 Exp: 0] 22 ms L 6 132.2.10.10 MRU 1500 [Labels: 92013 Exp: 0] 28 ms L 7 132.10.12.12 MRU 1500 [Labels: implicit-null Exp: 0] 22 ms ! 8 132.11.12.11 39 ms

We can enable this feature on CSR10 as well to test additional minor features. CSR10 will be configured to provide NHOP and NNHOP protection using paths that are blue. The tunnel-IDs of the automatically created backup tunnels will be in the range of 1500 and 1505 and tunnels not in use (that is, not bound to an LSP requesting FRR) should be removed after every minute, building in a 10 second buffer before removal. ! CSR10 mpls traffic-eng auto-tunnel backup timers removal unused 50 10 mpls traffic-eng auto-tunnel backup tunnel-num min 1500 max 1505 mpls traffic-eng auto-tunnel backup config affinity 0x4 mask 0x4

When using an aggressive scan interval, the parser warns you about router performance. Scanning every 50 seconds to discover unused tunnels, then removing them 10 seconds later, meets the 1-minute 1422 © 2016 Nicholas J. Russo

removal requirement. We can verify the basic configuration was accepted as well. Notice that only one tunnel is shown when we expected to see two. CSR10(config)#mpls traffic-eng auto-tunnel backup timers removal unused 50 10 AUTO TUNNEL: Note that a scan interval of less than 60 seconds, may result in high CPU utilization. CSR10#show mpls traffic-eng auto-tunnel backup State: Enabled Auto backup tunnels: 1 (up: 1, down: 0) Tunnel ID Range: 1500 - 1505 Create Nhop Only: No Check for deletion of unused tunnels every: 50 Sec SRLG: Not configured Config: unnumbered-interface: Loopback0 Affinity/Mask: 0x4/0x4

Checking the path of the tunnel, we can see that it takes a relatively long path via CSR5 > XRv11 > XRv12. The tunnel even traverses the NNHOP, which is expected since this is a different LSP than the main one (used for FRR only). Checking the tunnel labels to determine the incoming label on the tunnel transiting CSR10 (midpoint router), we see label 10015, as well as the FRR tunnel. The FRR database confirms that tunnel1500 is protecting the original LSP. The reason the NHOP tunnel did not get created is that the NHOP path transits the NNHOP node, so using NHOP protection makes no sense. The auto-tunnel feature is smart enough to account for this case and will conserve resources. CSR10#show ip rsvp sender detail filter session-type 7 tunnel-id 1501 | section outgoing ERO: (outgoing) 132.5.10.5 (Strict IPv4 Prefix, 8 bytes, /32) 132.5.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) CSR10#show mpls traffic-eng tunnels role middle | include Label InLabel : GigabitEthernet2.520, 10015 OutLabel : GigabitEthernet2.502, 92012 FRR OutLabel : Tunnel1501, implicit-null CSR10#show mpls traffic-eng fast-reroute database labels 10015 detail | begin 100 Tun ID: 100, LSP ID: 47, Source: 1.1.1.1 Destination: 11.11.11.11 State : ready InLabel : 10015 OutLabel : Gi2.502:92012

1423 © 2016 Nicholas J. Russo

FRR OutLabel : Tu1501:implicit-null

Since XRv does not support BFD, FRR is not going to work well, so we won't test the data plane on this tunnel. Testing the feature briefly on XRv, we will create a tunnel using explicit-paths from XRv11 to CSR1. the tunnel will request FRR and XRv12 will create the auto-tunnels. The configuration in XR is much more limited than XE at the global level. The only affinity option is to ignore-all, there is no option for NHOP-only, and there is only one timer for removal timer (scan time only, unused tunnels are removed immediately). Additional options are available at the link level, such as NHOP-only, SRLG exclusion, and tunnel attribute-sets. For variety, we will build NHOP and NNHOP tunnels to protect the LSPs traversing the XRv12-CSR10 link and CSR10 as a node. We will use the strict SRLG option to help engineer the backup tunnels. SRLGs configured on XR are similar to IP FRR, unlike XE which has different mechanisms between IP and TE FRR. ! XRv11 interface tunnel-te100 description BASIC TE FRR (AUTO BACKUP) ipv4 unnumbered Loopback0 logging events all destination 1.1.1.1 fast-reroute affinity ignore path-option 10 explicit name EP_11_12_6_10_3_1 ! XRv12 mpls traffic-eng interface GigabitEthernet0/0/0/0.562 auto-tunnel backup exclude srlg auto-tunnel backup timers removal unused 0 tunnel-id min 1200 max 1210 affinity ignore srlg interface GigabitEthernet0/0/0/0.502 8 value 100 interface GigabitEthernet0/0/0/0.512 8 value 100 interface GigabitEthernet0/0/0/0.562 8 value 100

We can verify the configuration was successful by checking the TE topology's view of all SRLGs, as well as basic auto-tunnel show command. XR provides very detailed information where XE provides only a summary. We can also specify the SRLG value, IGP PID with area/level, and other filters. Notice that XRv12 can see the SRLGs we configured earlier since these are carried in the TE topology. Using

1424 © 2016 Nicholas J. Russo

consistent SRLG values network-wide is important since they have global significance. Of note, we see that the tunnels are down for some reason. RP/0/0/CPU0:XRv12#show mpls traffic-eng topology srlg ospf 132 area 0 SRLG Admin Weight Interface Addr TE Router ID IGP Area ID __________ ______________ ______________ ______________ _______________ 29 1 132.3.9.3 3.3.3.3 OSPF 132 area 0 29 1 132.2.3.3 3.3.3.3 OSPF 132 area 0 100 1 132.6.12.12 12.12.12.12 OSPF 132 area 0 100 1 132.11.12.12 12.12.12.12 OSPF 132 area 0 100 1 132.10.12.12 12.12.12.12 OSPF 132 area 0 RP/0/0/CPU0:XRv12#show mpls traffic-eng auto-tunnel backup AutoTunnel Backup Configuration: Interfaces count: 2 Unused removal timeout: 0s Configured tunnel number range: 1200-1210 AutoTunnel Backup Summary: AutoTunnel Backups: 2 created, 0 up, 2 down, 2 unused 1 NHOP, 1 NNHOP, 0 SRLG strict, 0 SRLG preferred, 0 SRLG weighted Protected LSPs: 0 NHOP, 0 NHOP+SRLG 0 NNHOP, 0 NNHOP+SRLG Protected S2L Sharing Families: 0 NHOP, 0 NHOP+SRLG 0 NNHOP, 0 NNHOP+SRLG Protected S2Ls: 0 NHOP, 0 NHOP+SRLG 0 NNHOP, 0 NNHOP+SRLG Cumulative Counters (last cleared 20:04:07 ago): Total NHOP NNHOP Created: 2 1 1 Connected: 2 1 1 Removed (down): 0 0 0 Removed (unused): 0 0 0 Removed (in use): 0 0 0 Range exceeded: 0 0 0 AutoTunnel Backups: Tunnel State Protection Prot. Protected Protected Name Offered Flows* Interface Node -------------- ------- ------------ ------- --------------- --------------tunnel-te1200 down NNHOP 0 Gi0/0/0/0.562 6.6.6.6 tunnel-te1201 down NHOP 0 Gi0/0/0/0.562 N/A

1425 © 2016 Nicholas J. Russo

Looking into the details of tunnel1200 as an example, we see several auto-tunnel backup parameters, as well as the reason for the tunnel failure. We need to specify the IPv4 unnumbered address somehow, but there wasn't an option under the MPLS-TE stanza. Instead, we use a global command to specify that loopback0 should be used by default, an assumption not made by XR. Finding the problem is easy, as shown below, but knowing the command to fix it can be difficult. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 1200 | begin Auto Backup Auto Backup: Protected LSPs: 0 Protected S2L Sharing Families: 0 Protected S2L: 0 Protected i/f: Gi0/0/0/0.562 Protected node: 6.6.6.6 Attribute-set: Not configured Protection: NNHOP Unused removal timeout: not running Reason for the tunnel being down: No IP source address is configured Displayed 1 (of 2) heads, 0 (of 2) midpoints, 0 (of 0) tails Displayed 0 up, 1 down, 0 recovering, 0 recovered heads ! XRv12 ipv4 unnumbered mpls traffic-eng loopback0

Now, the tunnels comes up. Tunnel1207 is NNHOP and tunnel1208 is NHOP, both of which are operational and also offer SRLG protection. I've cleared the tunnels several times so their numbers are now 1207 and 1208. The summary view shows that the NNHOP tunnel is protecting one LSP at present. The output also specifies that two of the tunnels are operating in SRLG “strict” mode, which is like “force” mode in XE. RP/0/0/CPU0:XRv12#show mpls traffic-eng auto-tunnel backup AutoTunnel Backup Configuration: Interfaces count: 1 Unused removal timeout: 0s Configured tunnel number range: 1200-1210 AutoTunnel Backup Summary: AutoTunnel Backups: 2 created, 2 up, 0 down, 1 unused 1 NHOP, 1 NNHOP, 2 SRLG strict, 0 SRLG preferred, 0 SRLG weighted Protected LSPs: 0 NHOP, 0 NHOP+SRLG 0 NNHOP, 1 NNHOP+SRLG Protected S2L Sharing Families: 0 NHOP, 0 NHOP+SRLG 0 NNHOP, 0 NNHOP+SRLG Protected S2Ls: 0 NHOP, 0 NHOP+SRLG

1426 © 2016 Nicholas J. Russo

0 NNHOP, 0 NNHOP+SRLG Cumulative Counters (last cleared 21:59:16 ago): Total NHOP NNHOP Created: 9 5 4 Connected: 35 12 23 Removed (down): 5 3 2 Removed (unused): 1 1 0 Removed (in use): 1 0 1 Range exceeded: 0 0 0 AutoTunnel Backups: Tunnel State Protection Prot. Protected Protected Name Offered Flows* Interface Node -------------- ------- ------------ ------- --------------- --------------tunnel-te1207 up NNHOP+SRLG 1 Gi0/0/0/0.562 6.6.6.6 tunnel-te1208 up NHOP+SRLG 0 Gi0/0/0/0.562 N/A *Prot. Flows = Total Protected LSPs, S2Ls and S2L Sharing Families

Looking into the NNHOP tunnel details, we can see that we have protected the link to CSR6 as well as CSR6 itself as a node. The SRLG dictates we cannot use the direct link to CSR10, which would have been convenient and a shorter path. Instead, we use CSR9 for NNHOP protection. We can see this tunnel is being used since it is protecting an LSP. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 1207 | begin Auto Backup Auto Backup: Protected LSPs: 1 Protected S2L Sharing Families: 0 Protected S2L: 0 Protected i/f: Gi0/0/0/0.562 Protected node: 6.6.6.6 Attribute-set: Not configured Protection: NNHOP+SRLG (SRLG strict) Unused removal timeout: not running Path info (OSPF 132 area 0): Node hop count: 2 Hop0: 132.9.12.9 Hop1: 132.9.10.10 Hop2: 10.10.10.10

The NHOP tunnel is also up and transiting CSR9, but is not protecting the node 6.6.6.6 (CSR6). NNHOP is preferred over NHOP when both are available and all other things are considered equal. This second tunnel protects an interface (a link) only. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 1208 | begin Auto Backup Auto Backup: Protected LSPs: 0

1427 © 2016 Nicholas J. Russo

Protected S2L Sharing Families: 0 Protected S2L: 0 Protected i/f: Gi0/0/0/0.562 Attribute-set: Not configured Protection: NHOP+SRLG (SRLG strict) Unused removal timeout: not running Path info (OSPF 132 area 0): Node hop count: 2 Hop0: 132.9.12.9 Hop1: 132.6.9.6 Hop2: 6.6.6.6

We can quickly validate the backup LSPs using MPLS OAM. Notice the different labels in use, which we verify by checking the tunnel details. These two outputs represent the NNHOP and NHOP tunnels respectively. RP/0/0/CPU0:XRv12#traceroute mpls traffic-eng tunnel-te 1207 Tracing MPLS TE Label Switched Path on tunnel-te1207, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.9.12.12 MRU 1500 [Labels: 9017 Exp: 0] L 1 132.9.12.9 MRU 1500 [Labels: implicit-null Exp: 0] 0 ms ! 2 132.9.10.10 10 ms RP/0/0/CPU0:XRv12#traceroute mpls traffic-eng tunnel-te 1208 Tracing MPLS TE Label Switched Path on tunnel-te1208, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.9.12.12 MRU 1500 [Labels: 9020 Exp: 0] L 1 132.9.12.9 MRU 1500 [Labels: implicit-null Exp: 0] 10 ms ! 2 132.6.9.6 1 ms RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 1207 detail | include Label Outgoing Interface: GigabitEthernet0/0/0/0.592, Outgoing Label: 9017 RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 1208 detail | include Label Outgoing Interface: GigabitEthernet0/0/0/0.592, Outgoing Label: 9020

Checking the FRR database confirms that the original tunnel is protected with NNHOP support. We can tell because the NNHOP label is 10013 is used (local label on CSR10) and the tunnel ID is 1207. The headend RSVP RESV RRO also shows protection available with the "node" flag, implying NNHOP protection. RP/0/0/CPU0:XRv12#show mpls traffic-eng fast-reroute database

1428 © 2016 Nicholas J. Russo

LSP midpoint FRR information: LSP Identifier Local Label ----------------------------- -----11.11.11.11 100 [2] 92012

Out Intf/ FRR Intf/ Status Label Label ---------------- ---------------- -----Gi0/0/0/0.562:6014 tt1207:10013 Ready

RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 100 detail | begin Resv Resv Info: Record Route: IPv4 12.12.12.12, flags 0x29 (Node-ID, Protection: available, node) Label 92012, flags 0x1 IPv4 132.11.12.12, flags 0x9 (Protection: available, node) Label 92012, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6014, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10013, flags 0x1 IPv4 3.3.3.3, flags 0x21 (Node-ID, Protection: available) Label 3016, flags 0x1 IPv4 1.1.1.1, flags 0x20 (Node-ID) Label 3, flags 0x1 Fspec: avg rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

To build only NHOP tunnels, we configure this on a per-interface basis in XR. We will apply this to XRv12 so that the NHOP tunnel is used. Because we did not enable debug earlier, we will do it quickly now to watch XRv12 tear down the existing NNHOP tunnel without touching the NHOP tunnel. The debug essentially verifies NHOP protection and disables NNHOP protection. ! XRv12 mpls traffic-eng interface GigabitEthernet0/0/0/0.562 auto-tunnel backup nhop-only ! XRv12 te_control[1044]: DBG-AUTOBKP[1]: te_verify_mpls_te_auto_backup:2999: Verify: Adding auto-tunnel backup on I/F GigabitEthernet0_0_0_0.562 te_control[1044]: DBG-AUTOBKP[1]: te_verify_mpls_te_auto_backup_nhop_only:3138: Verify: Adding auto-tunnel backup NHOP only on I/F GigabitEthernet0_0_0_0.562 te_control[1044]: DBG-AUTOBKP[1]: te_apply_mpls_te_auto_backup:3272: Apply: Adding auto-tunnel backup on I/F GigabitEthernet0_0_0_0.562 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_enable:2505: protected interface GigabitEthernet0_0_0_0.562 already present te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_enable:2510: protected interface GigabitEthernet0_0_0_0.562 already configured

1429 © 2016 Nicholas J. Russo

te_control[1044]: DBG-AUTOBKP[1]: te_apply_mpls_te_auto_backup_nhop_only:3371: Apply: Adding auto-tunnel backup NHOP only on I/F GigabitEthernet0_0_0_0.562 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_set_cfg:2663: removing NNHOP backups on GigabitEthernet0_0_0_0.562 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_clear_intf:1408: checking GigabitEthernet0_0_0_0.562, backup tunnel 1207 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_start_unused_pruning_timer:583: Pruning timer is disabled. Return. te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_remove:1236: Auto-tunnel backup 1207 removed te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_clear_intf:1408: checking GigabitEthernet0_0_0_0.562, backup tunnel 1208 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_stop_unused_pruning_timer:461: pruning timer to stop is not running for bkup 1208 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_need_to_create_nnhop:1701 (Tunnel ID: 100): NNHOP not configured GigabitEthernet0_0_0_0.562 te_control[1044]: DBG-AUTOBKP[1]: te_frr_autobackup_need_to_create_nhop:1622 (Tunnel ID: 100): already nhop protected GigabitEthernet0_0_0_0.562

Now, we can see that the NHOP tunnel is protecting the LSP since NNHOP protection is not available. We confirm this by checking the FRR database as well. The head-end also shows that protection is available, but does not specify node protection, so NHOP protection is implied. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 1208 | begin Auto Backup Auto Backup: Protected LSPs: 1 Protected S2L Sharing Families: 0 Protected S2L: 0 Protected i/f: Gi0/0/0/0.562 Attribute-set: Not configured Protection: NHOP+SRLG (SRLG strict) Unused removal timeout: not running Path info (OSPF 132 area 0): Node hop count: 2 Hop0: 132.9.12.9 Hop1: 132.6.9.6 Hop2: 6.6.6.6 RP/0/0/CPU0:XRv12#show mpls traffic-eng fast-reroute database LSP midpoint FRR information: LSP Identifier Local Out Intf/ FRR Intf/ Status Label Label Label ----------------------------- ------ ---------------- ---------------- -----11.11.11.11 100 [2] 92012 Gi0/0/0/0.562:6014 tt1208:6014 Ready

1430 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 100 detail | begin Resv Resv Info: Record Route: IPv4 12.12.12.12, flags 0x21 (Node-ID, Protection: available) Label 92012, flags 0x1 IPv4 132.11.12.12, flags 0x1 (Protection: available) Label 92012, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6014, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10013, flags 0x1 IPv4 3.3.3.3, flags 0x21 (Node-ID, Protection: available) Label 3016, flags 0x1 IPv4 1.1.1.1, flags 0x20 (Node-ID) Label 3, flags 0x1 Fspec: avg rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

XE supports primary auto-tunnels as well, which creates a one-hop tunnel to every directly connected peer reachable through a TE-enabled interface. There is no way to ignore affinity for these tunnels, since link-coloring doesn't make sense in conjunction with this feature. That is why I removed most of the link colors from the XE-dense parts of the network. We will enable this feature on CSR8 using a single command and with debugging enabled. Like auto-backup, these tunnel commands are shown clearly in the debug. Unlike auto-backup, the default range of tunnels is 65336 to 65435 to avoid overlap. The tunnels also have logging enabled, so we will see their statuses. Each tunnel is built to the one-hop neighbors using an explicit-path to specify using the interface towards that neighbor (these neighbor addresses are interpreted from the TED). Most importantly, each tunnel requests FRR protection and is configured with autoroute announce. Now, traffic to each peer is routed inside of this one-hop tunnel and is FRR-protected with a single command. ! CSR8 mpls traffic-eng auto-tunnel primary onehop CSR8#debug mpls traffic-eng auto-tunnel primary all TE_AUTO_TUN: Found a new router id 1.1.1.1 off of GigabitEthernet2.518 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: primary CLI command: interface tunnel65336 no logging event link-status ip unnumbered Loopback0 tunnel destination 1.1.1.1 tunnel mode mpls traffic-eng end TE_AUTO_TUN: primary CLI command: interface tunnel65336 tunnel mpls traffic-eng autoroute announce

1431 © 2016 Nicholas J. Russo

tunnel mpls traffic-eng fast-reroute end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65336 index 1 next-address 132.1.8.8 TE_AUTO_TUN: primary CLI command: interface tunnel65336 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65336 end TE_AUTO_TUN: Found Down Tunnel65336 to router id 1.1.1.1 out GigabitEthernet2.518 TE_AUTO_TUN: Found a new router id 6.6.6.6 off of GigabitEthernet2.568 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: primary CLI command: interface tunnel65337 no logging event link-status ip unnumbered Loopback0 tunnel destination 6.6.6.6 tunnel mode mpls traffic-eng end TE_AUTO_TUN: primary CLI command: interface tunnel65337 tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng fast-reroute end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65337 index 1 next-address 132.6.8.8 TE_AUTO_TUN: primary CLI command: interface tunnel65337 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65337 end TE_AUTO_TUN: Found Down Tunnel65337 to router id 6.6.6.6 out GigabitEthernet2.568 TE_AUTO_TUN: Found a new router id 9.9.9.9 off of GigabitEthernet2.589 TE_AUTO_TUN: Create Auto tunnel TE_AUTO_TUN: primary CLI command: interface tunnel65338 no logging event link-status ip unnumbered Loopback0 tunnel destination 9.9.9.9 tunnel mode mpls traffic-eng

1432 © 2016 Nicholas J. Russo

end TE_AUTO_TUN: primary CLI command: interface tunnel65338 tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng fast-reroute end TE_AUTO_TUN: CLI command: ip explicit-path name __dynamic_tunnel65338 index 1 next-address 132.8.9.8 TE_AUTO_TUN: primary CLI command: interface tunnel65338 tunnel mpls traffic-eng path-option 1 exp name __dynamic_tunnel65338 end TE_AUTO_TUN: Found Down Tunnel65338 to router id 9.9.9.9 out GigabitEthernet2.589 %MPLS_TE-5-TUN: Tun65336: installed LSP 65336_1261 (popt 1) for nil, got 1st feasible path opt %MPLS_TE-5-TUN: Tun65337: installed LSP 65337_793 (popt 1) for nil, got 1st feasible path opt %MPLS_TE-5-TUN: Tun65338: installed LSP 65338_4178 (popt 1) for nil, got 1st feasible path opt %MPLS_TE-5-LSP: LSP 8.8.8.8 65337_793: UP %MPLS_TE-5-TUN: Tun65337: LSP path change 65337_793 for nil, normal %MPLS_TE-5-LSP: LSP 8.8.8.8 65336_1261: UP %MPLS_TE-5-TUN: Tun65336: LSP path change 65336_1261 for nil, normal %MPLS_TE-5-LSP: LSP 8.8.8.8 65338_4178: UP %MPLS_TE-5-TUN: Tun65338: LSP path change 65338_4178 for nil, normal

We can quickly check the auto-primary configuration. The output below seems to indicate we can change the tunnel ID range, enabled LDP on the tunnel (for stitching PE-P and P-P tunnels), and change the unnumbered address. There is no ability to adjust affinities, as described earlier. CSR8#show mpls traffic-eng auto-tunnel primary State: Enabled Auto primary tunnels: 3 (up: 3, down: 0) Tunnel ID Range: 65336 - 65435 Check for deletion of FRR Active onehop tunnels every:0 Sec Config: unnumbered-interface: Loopback0 mpls ip: FALSE

1433 © 2016 Nicholas J. Russo

We can also verify the important aspects of this feature: autoroute and FRR protection. Autoroute shows 3 different destinations, one per tunnel, as expected. The routing table shows the destinations to each router also reachable through the tunnel. CSR8#show mpls traffic-eng autoroute MPLS TE autorouting enabled destination 1.1.1.1, area ospf 132 area 0, has Tunnel65336 (load balancing metric 0, nexthop (flags: Announce) destination 6.6.6.6, area ospf 132 area 0, has Tunnel65337 (load balancing metric 0, nexthop (flags: Announce) destination 9.9.9.9, area ospf 132 area 0, has Tunnel65338 (load balancing metric 0, nexthop (flags: Announce)

1 tunnels 1.1.1.1) 1 tunnels 6.6.6.6) 1 tunnels 9.9.9.9)

CSR8#show ip route 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "ospf 132", distance 110, metric 2, type intra area Last update from 1.1.1.1 on Tunnel65336, 00:09:21 ago Routing Descriptor Blocks: * 1.1.1.1, from 1.1.1.1, 00:09:21 ago, via Tunnel65336 Route metric is 2, traffic share count is 1 CSR8#show ip route 6.6.6.6 Routing entry for 6.6.6.6/32 Known via "ospf 132", distance 110, metric 2, type intra area Last update from 6.6.6.6 on Tunnel65337, 00:09:27 ago Routing Descriptor Blocks: * 6.6.6.6, from 6.6.6.6, 00:09:27 ago, via Tunnel65337 Route metric is 2, traffic share count is 1 CSR8#show ip route 9.9.9.9 Routing entry for 9.9.9.9/32 Known via "ospf 132", distance 110, metric 2, type intra area Last update from 9.9.9.9 on Tunnel65338, 00:09:30 ago Routing Descriptor Blocks: * 9.9.9.9, from 9.9.9.9, 00:09:30 ago, via Tunnel65338 Route metric is 2, traffic share count is 1

The RSVP RESV RRO is very short since the tunnel is onehop, and the label is implicit-null, but FRR has been requested. Primary one-hop tunnels, whether automatically or manually configured, always use a null label of sorts. CSR8#show ip rsvp reservation detail filter session-type 7 | section include Tun_Dest|RRO Tun Dest: 1.1.1.1 Tun ID: 65336 Ext Tun ID: 8.8.8.8 RRO:

1434 © 2016 Nicholas J. Russo

1.1.1.1/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Tun Dest: 6.6.6.6 Tun ID: 65337 Ext Tun ID: 8.8.8.8 RRO: 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 Tun Dest: 9.9.9.9 Tun ID: 65338 Ext Tun ID: 8.8.8.8 RRO: 9.9.9.9/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3

Despite this seemingly harmless command, the network is very broken. Because of autoroute announce, CSR8 is now routing to all remote destination through a TE tunnel that runs PE-P. VPN traffic will no longer work since the transport label is implicit-null. The VPN label is exposed to P routers way too soon, and the VPN traffic will be dropped. We need a way to learn labels from the tail-end routers (at the other end of the one-hop tunnels) for the LSP endpoint. CSR8#show bgp vpnv4 unicast vrf TE 13.13.13.13/32 BGP routing table entry for 132:1:13.13.13.13/32, version 18 Paths: (1 available, best #1, table TE) Advertised to update-groups: 1 Refresh Epoch 1 13 11.11.11.11 (metric 4) (via default) from 11.11.11.11 (11.11.11.11) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: RT:132:1 mpls labels in/out nolabel/91013 rx pathid: 0, tx pathid: 0x0 CSR8#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "ospf 132", distance 110, metric 4, type intra area Last update from 9.9.9.9 on Tunnel65338, 00:12:33 ago Routing Descriptor Blocks: 9.9.9.9, from 11.11.11.11, 00:12:33 ago, via Tunnel65338 Route metric is 4, traffic share count is 1 * 6.6.6.6, from 11.11.11.11, 00:12:33 ago, via Tunnel65337 Route metric is 4, traffic share count is 1 CSR8#show mpls traffic-eng tunnels tunnel 65337 | include Label InLabel : OutLabel : GigabitEthernet2.568, implicit-null CSR8#show mpls traffic-eng tunnels tunnel 65338 | include Label InLabel : OutLabel : GigabitEthernet2.589, implicit-null

1435 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv14#traceroute 13.13.13.13 source 14.14.14.14 Type escape sequence to abort. Tracing the route to 13.13.13.13 1 2 3

10.8.14.8 0 msec * * * [snip]

0 msec

0 msec

We can solve this problem by enabling LDP on the primary auto-tunnels. We assume the P routers will accept the LDP targeted sessions, and if they don't, we would need to enable that also (not shown). Issuing this command immediately updates all of the tunnels with the "mpls ip" command, as expected. The "mpls ip" flag in the show command also changes to TRUE to indicate it is enabled. Last, we verify the LDP targeted sessions are working properly. In the vast majority of auto-primary deployments, this command would be necessary unless all of the PEs are directly connected. ! CSR8 mpls traffic-eng auto-tunnel primary config mpls ip ! CSR8 TE_AUTO_TUN: FRR lopri process rcv msg TSPTUN_ONEHOP_SET_MPLS_IP_ALL, ready=1 TE_AUTO_TUN: primary CLI command: interface tunnel65336 mpls ip end TE_AUTO_TUN: Tunnel65336 not a candidate for primary TE_AUTO_TUN: primary CLI command: interface tunnel65337 mpls ip end TE_AUTO_TUN: Tunnel65337 not a candidate for primary TE_AUTO_TUN: primary CLI command: interface tunnel65338 mpls ip end CSR8#show mpls traffic-eng auto-tunnel primary State: Enabled Auto primary tunnels: 3 (up: 3, down: 0) Tunnel ID Range: 65336 - 65435 Check for deletion of FRR Active onehop tunnels every:0 Sec Config: unnumbered-interface: Loopback0 mpls ip: TRUE

1436 © 2016 Nicholas J. Russo

CSR8#show mpls ldp discovery | begin Target Tunnel65336 (ldp): Targeted -> 1.1.1.1 Tunnel65337 (ldp): Targeted -> 6.6.6.6 Tunnel65338 (ldp): Targeted -> 9.9.9.9 Targeted Hellos: 8.8.8.8 -> 1.1.1.1 (ldp): active, xmit/recv LDP Id: 1.1.1.1:0 8.8.8.8 -> 6.6.6.6 (ldp): active, xmit/recv LDP Id: 6.6.6.6:0 8.8.8.8 -> 9.9.9.9 (ldp): active, xmit/recv LDP Id: 9.9.9.9:0

Now, traffic flows as expected. We now have LDP label bindings for XRv11's loopback. Technically an "implicit-null" would be on the top of the stack if it were a real label, followed by the LDP label. Tracing to CSR1 wouldn't have required LDP on the tunnel since the auto-tunnel was PE-PE, so only the VPN label is used. The labels below are the LDP-learned labels for 11.11.11.11/32 from CSR6 and CSR9. CSR8#show ip cef 11.11.11.11 11.11.11.11/32 nexthop 6.6.6.6 Tunnel65337 label 6011 nexthop 9.9.9.9 Tunnel65338 label 9011 RP/0/0/CPU0:XRv14#traceroute 13.13.13.13 source 14.14.14.14 Type escape sequence to abort. Tracing the route to 13.13.13.13 1 2 3 4 5

10.8.14.8 0 msec 0 msec 0 msec 132.8.9.9 [MPLS: Labels 9011/91013 Exp 0] 49 msec 39 msec 39 msec 132.9.12.12 [MPLS: Labels 92007/91013 Exp 0] 29 msec 69 msec 59 msec 132.11.12.11 [MPLS: Label 91013 Exp 0] 59 msec 59 msec 29 msec 10.11.13.13 39 msec * 29 msec

RP/0/0/CPU0:XRv14#traceroute 7.7.7.7 source 14.14.14.14 Type escape sequence to abort. Tracing the route to 7.7.7.7 1 2 3

10.8.14.8 0 msec 0 msec 0 msec 10.1.7.1 [MPLS: Label 1008 Exp 0] 0 msec 10.1.7.7 0 msec * 0 msec

0 msec

0 msec

This feature works well with auto-tunnel backup. Since we already have a bunch of one-hop primary tunnels requesting FRR, we can back them up with NHOP auto-backups. After enabling these, we can see three new tunnels were created to backup each of the existing primary tunnels. Based on the labels received, two of them use CSR9 and one of them uses CSR1 for NHOP protection. Pay close attention to the tunnel IDs; the protected and FRR tunnels appear to be the same at a glance, but the third digit is

1437 © 2016 Nicholas J. Russo

different. It is coincidental that the auto-primary and auto-backup interface numbers happen to align as they did. The “protected” tunnels are primary and the FRR interfaces are backup. ! CSR8 mpls traffic-eng auto-tunnel backup nhop-only CSR8#show mpls traffic-eng fast-reroute database P2P Headend FRR information: Protected tunnel In-label Out intf/label ---------------------- -------- -------------Tunnel65336 Tun hd Gi2.518:implicit Tunnel65337 Tun hd Gi2.568:implicit Tunnel65338 Tun hd Gi2.589:implicit

FRR intf/label -------------Tu65436:implicit Tu65437:implicit Tu65438:implicit

Status -----ready ready ready

CSR8#show mpls traffic-eng tunnels tunnel 65436 | include Label InLabel : OutLabel : GigabitEthernet2.589, 9015 CSR8#show mpls traffic-eng tunnels tunnel 65437 | include Label InLabel : OutLabel : GigabitEthernet2.589, 9009 CSR8#show mpls traffic-eng tunnels tunnel 65438 | include Label InLabel : OutLabel : GigabitEthernet2.518, 1012

We can also adjust the removal timer for these tunnels. After a link failure, the tunnels can be removed from the configuration if desired. By default, the feature is disabled (tunnels are never removed). ! CSR8 mpls traffic-eng auto-tunnel primary timers removal rerouted 3600 CSR8#show mpls traffic-eng auto-tunnel primary State: Enabled Auto primary tunnels: 3 (up: 3, down: 0) Tunnel ID Range: 65336 - 65435 Check for deletion of FRR Active onehop tunnels every:3600 Sec Config: unnumbered-interface: Loopback0 mpls ip: TRUE

Although not clearly documented nor well-advertised, XR supports this feature as well. It is hidden within the "mesh" confines, but it does work the same way. Since XRv12 already has auto-tunnel backup configured, we will enable the one-hop tunnels there. ! XRv12

1438 © 2016 Nicholas J. Russo

mpls traffic-eng auto-tunnel mesh group 999 onehop tunnel-id min 9990 max 9999

Checking the status of these tunnels, we can see that XRv12 created 4 tunnels, one for each next-hop. Only two of them are up, but all four are marked with a % sign to indicate "onehop" tunnels. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels brief TUNNEL NAME DESTINATION STATUS *tunnel-te1208 6.6.6.6 up %tunnel-te9990 132.6.12.6 up %tunnel-te9991 132.9.12.9 up %tunnel-te9992 132.10.12.10 down %tunnel-te9993 132.11.12.11 down CSR6_t65338 12.12.12.12 up CSR6_t65439 9.9.9.9 up CSR6_t65440 12.12.12.12 up * = automatically created backup tunnel % = automatically created mesh onehop tunnel Displayed 5 (of 5) heads, 1 (of 1) midpoints, 2 (of 2) tails Displayed 3 up, 2 down, 0 recovering, 0 recovered heads

STATE up up up down down up up up

The reason those two tunnels are down is the exact reason why affinity was removed from most of the topology; we cannot adjust the affinity of these tunnels, so there is no colorless path to reach the neighbors if affinity is set. The diagram shows this clearly as PCALC cannot find a valid path. RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 9992 Name: tunnel-te9992 Destination: 132.10.12.10 Ifhandle:0x1080 (auto-tunnel mesh onehop) Signalled-Name: autom_XRv12_t9992_mg999 Status: Admin: up Oper: down Path: not valid Signalling: Down path option 10, type dynamic Last PCALC Error: Sat Oct 3 17:29:26 2015 Info: No path to destination, 132.10.12.10 (affinity or reverselink or exclude-link or hop-limit) RP/0/0/CPU0:XRv12#show mpls traffic-eng tunnels 9993 Name: tunnel-te9993 Destination: 132.11.12.11 Ifhandle:0x1180 (auto-tunnel mesh onehop) Signalled-Name: autom_XRv12_t9993_mg999 Status: Admin: up Oper: down Path: not valid Signalling: Down path option 10,

type dynamic

1439 © 2016 Nicholas J. Russo

Last PCALC Error: Sat Oct 3 17:29:26 2015 Info: No path to destination, 132.11.12.11 (affinity or hop-limit)

A quick look at the affinity information for the local node confirms this. The links to CSR6 and CSR9 have no link coloring and they are functioning properly. The links to CSR10 and XRv11 are colored and thus cannot be used by this feature. RP/0/0/CPU0:XRv12#show mpls traffic-eng topology 12.12.12.12 affinity IGP Id: 12.12.12.12, MPLS TE Id: 12.12.12.12 Router Node (OSPF 132 area 0) Link[0]: Intf Address: 132.6.12.12, Nbr Intf Address: 132.6.12.6 Attribute Flags: 0x0 Ext Admin Group: Length: 256 bits Value : 0x:: Attribute Names: Link[1]: Intf Address: 132.11.12.12, Nbr Intf Address: 132.11.12.11 Attribute Flags: 0xc Ext Admin Group: Length: 256 bits Value : 0x::c Attribute Names: BLUE(2) ORANGE(3) Link[2]: Intf Address: 132.10.12.12, Nbr Intf Address: 132.10.12.10 Attribute Flags: 0x9 Ext Admin Group: Length: 256 bits Value : 0x::9 Attribute Names: RED(0) ORANGE(3) Link[3]: Intf Address: 132.9.12.12, Nbr Intf Address: 132.9.12.9 Attribute Flags: 0x0 Ext Admin Group: Length: 256 bits Value : 0x:: Attribute Names:

Design tip: This technique of auto-backup NHOP-tunnels with auto-primary is best enabled everywhere. In doing this, we can automatically protect all links in the topology. This alleviates the need for individual TE tunnels to request FRR on a per-tunnel basis. Since every link is already protected, and every link has an FRR tunnel across it with autoroute, we can assume that any traffic crossing that link is protected as well. This is difficult to test since there is no PLR-originated PATHERR message back to the head-end (on a non-primary tunnel) to indicate local repair has happened. FRR happens by virtue of the links being protected; even normal LSPs can be protected if other TE tunnels are not used. This is the best way to protect mLDP LSM traffic which is detailed in another section. With autoroute enabled, we can simply enable this feature everywhere (along with tLDP) and have fully connectivity with FRR protection. Last, we will examine automatic tunnel mesh-groups. This allows remote PEs to automatically build tunnels to one another based on an ACL that specifies all of the destinations. We can apply any 1440 © 2016 Nicholas J. Russo

constraints we want, and in this case, the path must not be blue. It is somewhat manual when using an access-list in this way, but can greatly reduce configuration. This solution does not require any additional IGP extensions other than the TE ones already in use. We will built a mesh of TE tunnels between CSR1, CSR8, and XRv11. We will use autoroute destination so we can override any existing primary tunnels on CSR8 that are using autoroute announce. On XRv11, we use autoroute announce because we must, and also we have to specify the TE tunnel default source address in global configuration mode. Only numbered ACLs are supported which makes the configuration for difficult to interpret at a glance. ! CSR1 and CSR8 mpls traffic-eng auto-tunnel mesh mpls traffic-eng auto-tunnel mesh tunnel-num min 5000 max 5099 interface Auto-Template500 ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination access-list 5 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x4 tunnel mpls traffic-eng path-option 10 dynamic ! CSR1 only access-list 5 permit 8.8.8.8 access-list 5 permit 11.11.11.11 ! CSR8 only access-list 5 permit 1.1.1.1 access-list 5 permit 11.11.11.11 ! XRv11 ipv4 unnumbered mpls traffic-eng Loopback0 ipv4 prefix-list PL_MESH_ENDPOINT 10 permit 1.1.1.1/32 20 permit 8.8.8.8/32 mpls traffic-eng auto-tunnel mesh group 500 attribute-set ATT_NOT_BLUE destination-list PL_MESH_ENDPOINT timers removal unused 0 tunnel-id min 5000 max 5099 attribute-set auto-mesh ATT_NOT_BLUE logging events lsp-status state logging events lsp-status reroute autoroute announce affinity 0x0 mask 0x4

1441 © 2016 Nicholas J. Russo

We quickly check all three routers to ensure the auto-mesh was built properly. Each router should have two tunnels going to the other 2 routers in the mesh. CSR1#show mpls traffic-eng auto-tunnel mesh Auto-Template500: Using access-list 5 to clone the following tunnel interfaces: Destination ----------8.8.8.8 11.11.11.11

Interface --------Tunnel5000 Tunnel5001

Mesh tunnel interface numbers: min 5000 max 5099 CSR8#show mpls traffic-eng auto-tunnel mesh Auto-Template500: Using access-list 5 to clone the following tunnel interfaces: Destination ----------1.1.1.1 11.11.11.11

Interface --------Tunnel5000 Tunnel5001

Mesh tunnel interface numbers: min 5000 max 5099 RP/0/0/CPU0:XRv11#show mpls traffic-eng auto-tunnel mesh Auto-tunnel Mesh Global Configuration: Unused removal timeout: 0s (Disabled) Configured tunnel number range: 5000-5099 Auto-tunnel Mesh Groups Summary: Mesh Groups count: 1 Mesh Groups Destinations count: 2 Mesh Groups Tunnels count: 2 created, 2 up, 0 down, 0 FRR enabled Mesh Group: 500 (2 Destinations) Status: Enabled Attribute-set: ATT_NOT_BLUE Destination-list: PL_MESH_ENDPOINT Recreate timer: Not running Destination Tunnel ID State Unused timer ---------------- ----------- ------- -----------1.1.1.1 5002 up Not running 8.8.8.8 5003 up Not running Displayed 2 tunnels, 2 up, 0 down, 0 FRR enabled Auto-mesh Cumulative Counters: Last cleared: MON DAY 1 20:57:16 2015 (1d18h ago)

1442 © 2016 Nicholas J. Russo

Created: Connected: Removed (unused): Removed (in use): Range exceeded:

Total 4 4 0 2 0

Next, we quickly check each PE's FIB to ensure the tunnels are being used for routing. This proves that autoroute can be used with the announce or destination options. Normal static or policy routing would not work since the tunnel IDs are dynamically allocated. Forwarding-adjacency also works, provided the tunnels are configured bidirectionally. On XE and XR, the FIBs never show the TE label, so do not be concerned when the labels are not revealed below. CSR1#show ip cef 8.8.8.8 8.8.8.8/32 attached to Tunnel5000 CSR1#show ip cef 11.11.11.11 11.11.11.11/32 attached to Tunnel5001 CSR8#show ip cef 1.1.1.1 1.1.1.1/32 attached to Tunnel5000 CSR8#show ip cef 11.11.11.11 11.11.11.11/32 attached to Tunnel5001 RP/0/0/CPU0:XRv11#show cef ipv4 1.1.1.1 1.1.1.1/32, version 108, internal 0x1000001 0x0 (ptr 0xa13a7374) [1], 0x0 (0xa138c950), 0xa20 (0xa14aa190) Prefix Len 32, traffic index 0, precedence n/a, priority 3 via 1.1.1.1, tunnel-te5002, 3 dependencies, weight 0, class 0 [flags 0x0] path-idx 0 NHID 0x0 [0xa0e8f73c 0xa0e8f694] next hop 1.1.1.1 local adjacency local label 91004 labels imposed {ImplNull} RP/0/0/CPU0:XRv11#show cef ipv4 8.8.8.8 8.8.8.8/32, version 109, internal 0x1000001 0x0 (ptr 0xa13a75f4) [1], 0x0 (0xa138c974), 0xa20 (0xa14aa2d0) Prefix Len 32, traffic index 0, precedence n/a, priority 3 via 8.8.8.8, tunnel-te5003, 3 dependencies, weight 0, class 0 [flags 0x0] path-idx 0 NHID 0x0 [0xa0e8f544 0xa0e8f5ec] next hop 8.8.8.8 local adjacency local label 91008 labels imposed {ImplNull}

1443 © 2016 Nicholas J. Russo

Rather than test all six tunnels, we will test two to ensure the feature works. From CSR1 to XRv11, we know the tunnel ID is 5000. Checking the outgoing label, we find value 3010. Testing from inside the VPN, we can see that the first label in the stack, on the first hop, is 3010. This implies the auto-mesh LSP is being used for transport as desired. CSR1#show mpls traffic-eng tunnels tunnel 5001 | include Label InLabel : OutLabel : GigabitEthernet2.513, 3010 CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 1 msec 0 msec 2 132.1.3.3 [MPLS: Labels 3010/91013 Exp 0] 31 msec 17 msec 17 msec 3 132.3.10.10 [MPLS: Labels 10016/91013 Exp 0] 16 msec 19 msec 32 msec 4 132.10.11.11 [MPLS: Label 91013 Exp 0] 32 msec 26 msec 32 msec 5 10.11.13.13 25 msec * 14 msec

From XRv11 to CSR8, we perform a similar verification. Tunnel ID 5003 is used to forward traffic to CSR8, which uses label 10018. The first hop of the traceroute inside the VPN uses this value, which proves that the auto-mesh works on XRv also. RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p destination 8.8.8.8 detail | include Label Labels: Outgoing downstream: 10018. RP/0/0/CPU0:XRv13#traceroute 14.14.14.14 source 13.13.13.13 Type escape sequence to abort. Tracing the route to 14.14.14.14 1 2 3 4 5 6

10.11.13.11 0 msec 0 msec 0 msec 132.10.11.10 [MPLS: Labels 10018/8013 Exp 0] 89 msec 9 msec 9 msec 132.6.10.6 [MPLS: Labels 6014/8013 Exp 0] 9 msec 9 msec 9 msec 10.8.14.8 [MPLS: Labels 0/8013 Exp 0] 9 msec 9 msec 29 msec 10.8.14.8 [MPLS: Labels 0/8013 Exp 0] 19 msec 9 msec 9 msec 10.8.14.14 19 msec * 9 msec

OSPF also supports a fancier way to build auto-mesh tunnels. The XR documentation seems to suggest this is supported on XR, but it isn't. It only works on XE for OSPFv2; there is no support for IS-IS currently. We will use this to automatically build a mesh of tunnels to support the VPLS customers behind CSR1, CSR4, and CSR5. The absence of a static ACL means that all routers have identical configurations. We also use an explicit-path selecting any path that does not traverse CSR9. In general, only XE can include explicit-paths (only makes sense to use exclusion or loose hop inclusion); XR must use dynamic paths. 1444 © 2016 Nicholas J. Russo

! CSR1, CSR4, and CSR5 mpls traffic-eng auto-tunnel mesh mpls traffic-eng auto-tunnel mesh tunnel-num min 5000 max 5099 router ospf 132 mpls traffic-eng mesh-group 501 Loopback0 area 0 ip explicit-path name EP_NO_R9 enable exclude-address 9.9.9.9 interface Auto-Template501 ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination mesh-group 501 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9

We can verify the tunnel IDs for each destination on each router for further analysis. CSR1#show mpls traffic-eng auto-tunnel mesh | begin 501 Auto-Template501: Using mesh-group 501 to clone the following tunnel interfaces: Destination ----------4.4.4.4 5.5.5.5

Interface --------Tunnel5002 Tunnel5003

Mesh tunnel interface numbers: min 5000 max 5099

CSR4#show mpls traffic-eng auto-tunnel mesh | begin 501 Auto-Template501: Using mesh-group 501 to clone the following tunnel interfaces: Destination ----------1.1.1.1 5.5.5.5

Interface --------Tunnel5000 Tunnel5001

Mesh tunnel interface numbers: min 5000 max 5099 CSR5#show mpls traffic-eng auto-tunnel mesh | begin 501 Auto-Template501: Using mesh-group 501 to clone the following tunnel interfaces: Destination

Interface

1445 © 2016 Nicholas J. Russo

----------1.1.1.1 4.4.4.4

--------Tunnel5000 Tunnel5001

Mesh tunnel interface numbers: min 5000 max 5099

Checking the tunnel briefs for tunnels in the "up" state on CSR4 (for brevity), we can see 2 auto-mesh tunnels in the summary. This represents the tunnels for which the local router is the head-end. Below, we see 4 tunnels total: CSR4 is the head for the first two as the destinations are CSR1 and CSR5. CSR4 is the tail end for the other 2 since the tunnel names contain the hostnames of the corresponding head ends. CSR4#show mpls traffic-eng Signalling Summary: LSP Tunnels Process: Passive LSP Listener: RSVP Process: Forwarding: auto-tunnel: backup Disabled (0 onehop Disabled (0 mesh Enabled (2

tunnels up brief running running running enabled ), id-range:65436-65535 ), id-range:65336-65435 ), id-range:5000-5099

Periodic reoptimization: Periodic FRR Promotion: Periodic auto-bw collection:

every 3600 seconds, next in 394 seconds Not Running every 300 seconds, next in 94 seconds

P2P TUNNELS/LSPs: TUNNEL NAME DESTINATION UP IF DOWN IF STATE/PROT CSR4_t5000 1.1.1.1 Gi2.514 CSR4_t5001 5.5.5.5 Gi2.534 CSR1_t5002 4.4.4.4 Gi2.514 CSR5_t5001 4.4.4.4 Gi2.534 Displayed 2 (of 2) heads, 0 (of 0) midpoints, 2 (of 2) tails

up/up up/up up/up up/up

This information was dynamically exchanged using a new TLV to carry the mesh-group ID (MG-ID) inside of an opaque-area LSA (Type-10). The Link ID is always 4.0.0.0 and each MG-ID is 8 bytes long. The first 4 bytes represent the actual MG-ID, shown in hex (0x000001F5 = 501). The second 4 bytes represent the router's mesh-group IP address which serves as the tunnel destination for everyone else, also in hex (0x040404 = 4.4.4.4). Looking at CSR4's locally generated MG-ID LSA10, we can see this information carried in the TLV. CSR4#show ip ospf 132 0 database opaque-area 4.0.0.0 adv-router 4.4.4.4 OSPF Router with ID (4.4.4.4) (Process ID 132)

1446 © 2016 Nicholas J. Russo

Type-10 Opaque Link Area Link States (Area 0) LS age: 430 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 4.0.0.0 Opaque Type: 4 Opaque ID: 0 Advertising Router: 4.4.4.4 LS Seq Number: 80000001 Checksum: 0x4AC3 Length: 32 Capability Type: Mesh-group Length: 8 Value: 0000 01F5 0404 0404

If we add another MG-ID to CSR4, perhaps to create an unrelated meshed of tunnels, this single LSA can be extended without having to create an LSA for each MG-ID. We can select a different "destination" interface as well, which doesn't make much sense in this case, but is possible. The original MG-ID still exists, with the addition of MG-ID 502 (0x000001F6) and the IP address from the specified link (0x84030404 = 132.3.4.4). I use yellow for MG-ID 501 and green for MG-ID 502. Each new MG-ID adds 8 bytes of length to the TLV. ! CSR4 router ospf 132 mpls traffic-eng mesh-group 502 GigabitEthernet2.534 area 0 CSR4#show ip ospf 132 0 database opaque-area 4.0.0.0 adv-router 4.4.4.4 OSPF Router with ID (4.4.4.4) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 34 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 4.0.0.0 Opaque Type: 4 Opaque ID: 0 Advertising Router: 4.4.4.4 LS Seq Number: 80000003 Checksum: 0xB3C0 Length: 40 Capability Type: Mesh-group Length: 16

1447 © 2016 Nicholas J. Russo

Value: 0000 01F5 0404 0404 0000 01F6 8403 0404

The reason I suspect XR simply does not support this feature is because it cannot read this new TLV. I was not able to find a way to enable a feature to correct this, nor could I find a way to encode an MG-ID locally into the XR OSPF process. If XR cannot understand the MG-ID TLV, it is highly unlikely that the feature is supported. RP/0/0/CPU0:XRv11#show ospf 132 0 database opaque-area 4.0.0.0 adv-router 4.4.4.4 OSPF Router with ID (11.11.11.11) (Process ID 132) Type-10 Opaque Link Area Link States (Area 0) LS age: 126 Options: (No TOS-capability, DC) LS Type: Opaque Area Link Link State ID: 4.0.0.0 Opaque Type: 4 Opaque ID: 0 Advertising Router: 4.4.4.4 LS Seq Number: 80000003 Checksum: 0xb3c0 Length: 40 Unknown TLV: Type: 3 Length: 16

A more graceful way to view this information is to look at the TE topology in brief form. Looking at CSR1, it can see CSR4's remote MG-IDs. It displays all 8 bytes of the TLV value, both the MG-ID and tunnel destination. CSR1 only cares about MG-ID 501 in this specific case, as MG-ID 502 was just a demonstration. The group/destination pairings are shown in different colors for clarity as they were above. CSR1#show mpls traffic-eng topology 4.4.4.4 brief | include mg Area mg-id's: : mg-id 501 4.4.4.4 :: mg-id 502 132.3.4.4 :

We will verify that each of the PEs is routing correctly using the mesh tunnels before using MPLS OAM to verify the labels. The tunnel IDs being used for each remote destination appear correct. CSR1#show ip cef 4.4.4.4 4.4.4.4/32 attached to Tunnel5002 CSR1#show ip cef 5.5.5.5 5.5.5.5/32

1448 © 2016 Nicholas J. Russo

attached to Tunnel5003 CSR4#show ip cef 1.1.1.1 1.1.1.1/32 attached to Tunnel5000 CSR4#show ip cef 5.5.5.5 5.5.5.5/32 attached to Tunnel5001 CSR5#show ip cef 1.1.1.1 1.1.1.1/32 attached to Tunnel5000 CSR5#show ip cef 4.4.4.4 4.4.4.4/32 attached to Tunnel5001

We will test connectivity from CSR1 to CSR5. The outgoing label is 3022, which means CSR3 is the nexthop. MPLS OAM can verify that the tunnel is operation since the L2VPN cannot verify this. CSR1#show mpls traffic-eng tunnels tunnel 5003 | include Label InLabel : OutLabel : GigabitEthernet2.513, 3022 CSR1#traceroute mpls traffic-eng tunnel 5003 Tracing MPLS TE Label Switched Path on Tunnel5003, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.3.1 MRU 1500 [Labels: 3022 Exp: 0] L 1 132.1.3.3 MRU 1500 [Labels: 2020 Exp: 0] 2 ms L 2 132.2.3.2 MRU 1500 [Labels: implicit-null Exp: 0] 28 ms ! 3 132.2.5.5 23 ms

We can confirm this further in the control-plane using ordinary MPLS show commands. The label stack includes the RSVP-bound label 3022 since the next-hop to the targeted LDP endpoint is reachable via a TE tunnel. Using EPC, we can verify that the L2VPN traffic is using this TE LSP. We send traffic from XRv14 to XRv13, which uses CSR1 as the ingress L2VPN PE, and capture traffic heading to CSR3. Label 3022 (0xBCE) is indeed being used for transport, with label 5011 (0x1393) being used for the PW demultiplexer. CSR1#show l2vpn atom vc destination 5.5.5.5 vcid 71314 detail | include label_stack Output interface: Tu5003, imposed label stack {3022 5011} RP/0/0/CPU0:XRv14#ping 10.0.0.13 source 10.0.0.14

1449 © 2016 Nicholas J. Russo

Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.0.13, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/12/29 ms CSR1#show monitor capture CAP buffer detail 11 144 1.035994 00:0C:29:FB:A3:39 -> 00:0C:29:D7:81:FE MPLS unicast 0000: 000C29D7 81FE000C 29FBA339 81000DB9 ..).....)..9.... 0010: 884700BC E0FF0139 31FF0000 0000000C .G.....91....... 0020: 2955663D 000C29BE D8F90800 45000064 )Uf=..).....E..d 0030: 00000000 FF01A77E 0A00000E 0A00000D .......~........

For additional testing, we will perform the same verification from CSR5 to CSR4. We already verified the tunnel ID was 5001 and the CEF was properly programmed. The outgoing label is 2018 which means CSR2 is the next-hop in the path. We can confirm this with MPLS traceroute. CSR5#show ip rsvp reservation detail filter session-type 7 destination 4.4.4.4 | include Label Label: 2018 (outgoing) CSR5#traceroute mpls traffic-eng tunnel 5001 Tracing MPLS TE Label Switched Path on Tunnel5001, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.2.5.5 MRU 1500 [Labels: 2018 Exp: 0] L 1 132.2.5.2 MRU 1500 [Labels: 3020 Exp: 0] 1 ms L 2 132.2.3.3 MRU 1500 [Labels: implicit-null Exp: 0] 29 ms ! 3 132.3.4.4 21 ms

The L2VPN control-plane verification shows the proper label stack as well. When we send traffic from XRv3 to CSR7, we verify the proper label stack on the outgoing packets. Label 2018 (0x7E2) is the topmost, RSVP-allocated label and label 4010 (0xFAA) is the PW label. CSR5#show mpls l2transport vc destination 4.4.4.4 vcid 71314 detail | include label_stack Output interface: Tu5001, imposed label stack {2018 4010} RP/0/0/CPU0:XRv13#ping 10.0.0.7 source 10.0.0.13 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.0.7, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/10/29 ms CSR5#show monitor capture CAP buffer detailed 5 144 1.061994 00:0C:29:5F:11:A1 -> 00:0C:29:5C:E1:E9 MPLS unicast 0000: 000C295C E1E9000C 295F11A1 81000DC5 ..)\....)_...... 0010: 8847007E 20FF00FA A1FF0000 0000000C .G.~ ...........

1450 © 2016 Nicholas J. Russo

0020: 0030:

29664C2C 000C2955 663D0800 45000064 00000000 FF01A785 0A00000D 0A000007

)fL,..)Uf=..E..d ................

Additional Reading – Reference configurations "te-auto" 31.3 CBTS (IOS) and PBTS (XR) This section is a continuation of the manual TE lab and reuses some of the same tunnels for testing (configurations are consolidated at the beginning of the section). Class and Policy Based Tunnel Selection (CBTS/PBTS) are mechanisms of mapping EXP values to different tunnels in XE and XR, respectively. The feature qualifies as a TE QoS mechanism but is not technically DiffServ-aware TE (DSTE), which is different. The two are not mutually exclusive as we will see later. We will begin with CBTS. A master tunnel [ID 40] is created that bundles all of the member tunnels. The master determines the steering mechanism (autoroute, etc) but isn't a real TE tunnel. It's just a software interface so the routing process knows how to deliver traffic. The members have their own path-options along with mappings of EXP values. In this example, voice and video (EXP 5 and 4) are mapped to tunnel41 and follow the blue path, which we can pretend is low latency. Tunnel42 encompasses all other EXP values which is what the "default" keyword accomplishes. Notice that the member tunnels do not need a steering mechanism. ! CSR1 interface Tunnel40 description CBTS MASTER ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng exp-bundle master tunnel mpls traffic-eng exp-bundle member Tunnel41 tunnel mpls traffic-eng exp-bundle member Tunnel42 interface Tunnel41 description CBTS VOICE AND VIDEO ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x4 mask 0x4 tunnel mpls traffic-eng path-option 10 dynamic tunnel mpls traffic-eng exp 4 5 interface Tunnel42 description CBTS EVERYTHING ELSE ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_1_4_3_10_11

1451 © 2016 Nicholas J. Russo

tunnel mpls traffic-eng exp default

When the tunnels come up, we see only two exist, not three, since tunnel 40 is just a "traffic cop" to direct traffic into the proper member tunnels based on EXP. CSR1#show ip rsvp reservation To From Pro DPort Sport Next Hop BPS 11.11.11.11 1.1.1.1 0 41 1 132.1.6.6 11.11.11.11 1.1.1.1 0 42 1 132.1.4.4

I/F

Fi Serv

Gi2.516 Gi2.514

SE LOAD 0 SE LOAD 0

We can confirm this by checking all of the tunnel statuses. The master tunnel is up, but the output is limited to the member tunnels and their EXP mappings. We can see that the member tunnels look like regular TE tunnels with their EROs going in totally different directions, given independent tunnel constraints. Notice that label 4014 is used for the default traffic and label 6020 is used for voice/video. CSR1#show mpls traffic-eng tunnels tunnel 40 | section Status Status: Master Admin: up Oper: up Signalling: N/A Member Tunnels: Tunnel41: Config Exp: Tunnel42: Config Exp:

Member Autoroute: Inactive 4 5 default

CSR1#show mpls traffic-eng tunnels tunnel 41 | section Label|RSVP_Path InLabel : OutLabel : GigabitEthernet2.516, 6020 RSVP Path Info: My Address: 132.1.6.1 Explicit Route: 132.1.6.6 132.6.12.12 132.11.12.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR1#show mpls traffic-eng tunnels tunnel 42 | section Label|RSVP_Path InLabel : OutLabel : GigabitEthernet2.514, 4014 RSVP Path Info: My Address: 132.1.4.1 Explicit Route: 132.1.4.4 132.3.4.3 132.3.10.10 132.10.11.11 11.11.11.11 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

We can also see the EXP mappings per destination. This shows the "configured" versus "actual", which is valuable to determine where traffic actually ends up. The D and NE flags are evaluated later. CSR1#show mpls traffic-eng exp 11.11.11.11

1452 © 2016 Nicholas J. Russo

Destination: 11.11.11.11 Master: Tunnel40 Status: up Members Status Conf Exp Tunnel41 up (Active) 4 5 Tunnel42 up (Active) Default

Actual Exp 4 5 0 1 2 3 6 7

(D) : Destination is different (NE): Exp values not configured on tunnel

We quickly check the autoroute process. The routing is tied to the master tunnel40, not the members. The routing table also shows this as well. Traffic entering the master tunnel is further classified by EXP into the appropriate member tunnels based on the mappings shown above. CSR1#show mpls traffic-eng autoroute 11.11.11.11 MPLS TE autorouting enabled destination 11.11.11.11 has 1 tunnels Tunnel40 (load balancing metric 0, nexthop 11.11.11.11) (flags: Destination) CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "static", distance 1, metric 0 (connected) Routing Descriptor Blocks: * directly connected, via Tunnel40 Route metric is 0, traffic share count is 1

A quick test from within the L3VPN shows that the CBTS configuration is working. Traffic at EXP 0 uses label 4014, which is for default traffic. Extended traceroute lets us specify DSCP 40 (CS5), which is voice traffic. This uses label 6020 which is correct. CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 1 msec 0 msec 2 132.1.4.4 [MPLS: Labels 4014/91012 Exp 0] 3 132.3.4.3 [MPLS: Labels 3027/91012 Exp 0] 4 132.3.10.10 [MPLS: Labels 10015/91012 Exp 5 132.10.11.11 [MPLS: Label 91012 Exp 0] 29 6 10.11.13.13 25 msec * 14 msec

30 msec 25 msec 27 msec 22 msec 16 msec 17 msec 0] 17 msec 19 msec 43 msec msec 25 msec 17 msec

CSR7#traceroute Protocol [ip]: Target IP address: 13.13.13.13 Source address: 7.7.7.7 DSCP Value [0]: 40 [snip] Type escape sequence to abort.

1453 © 2016 Nicholas J. Russo

Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 0 msec 0 msec 1 msec 2 132.1.6.6 [MPLS: Labels 6020/91012 Exp 5] 46 msec 33 msec 34 msec 3 132.6.12.12 [MPLS: Labels 92010/91012 Exp 5] 33 msec 33 msec 33 msec 4 132.11.12.11 [MPLS: Label 91012 Exp 5] 33 msec 33 msec 33 msec 5 10.11.13.13 24 msec * 21 msec

MPLS OAM can be used to verify this as well, provided we specify the master tunnel as the target. We can easily test different EXP values from within the SP network this way. This time, we will test video (EXP 4) and ensure it uses label 6020. EXP 3 still uses the default tunnel, which is correct. Notice that the OAM returning information shows EXP 0 for the packet that was in tunnel40; I assume this is a cosmetic behavior of the MPLS traceroute, since it categorized the traffic properly. CSR1#traceroute mpls traffic-eng tunnel 40 exp 3 Tracing MPLS TE Label Switched Path on Tunnel40, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.4.1 MRU 1500 [Labels: 4014 Exp: 0] L 1 132.1.4.4 MRU 1500 [Labels: 3027 Exp: 0] 2 ms L 2 132.3.4.3 MRU 1500 [Labels: 10015 Exp: 0] 27 ms L 3 132.3.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 20 ms ! 4 132.10.11.11 36 ms CSR1#traceroute mpls traffic-eng tunnel 40 exp 4 Tracing MPLS TE Label Switched Path on Tunnel40, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.6.1 MRU 1500 [Labels: 6020 Exp: 4] L 1 132.1.6.6 MRU 1500 [Labels: 92010 Exp: 4] 2 ms L 2 132.6.12.12 MRU 1500 [Labels: implicit-null Exp: 4] 43 ms ! 3 132.11.12.11 21 ms

These member tunnels can be protected by FRR, make bandwidth reservations, and do most other things ordinary TE tunnels can do. We will protect the voice/video tunnel with FRR (with bandwidth protection) and give it 2 Mbps of bandwidth. Even though there aren't any NHOP/NNHOP backup tunnels along the path, the features do work. We can check the RSVP RESV details to see the RRO with labels, along with the bandwidth reservation. Notice that the second tunnel42 does not have a bandwidth reservation nor an RRO, as expected. ! CSR interface Tunnel41 tunnel mpls traffic-eng bandwidth 2000 tunnel mpls traffic-eng fast-reroute bw-protect CSR1#show ip rsvp reservation detail filter session-type 7 destination 11.11.11.11

1454 © 2016 Nicholas J. Russo

Reservation: Tun Dest: 11.11.11.11 Tun ID: 41 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 3 Next Hop: 132.1.6.6 on GigabitEthernet2.516 Label: 6007 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 3900041E. Created: 17:47:00 UTC DAY MON 26 2015 Average Bitrate is 2M bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes RRO: 6.6.6.6/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 6007 12.12.12.12/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 92010 132.6.12.12/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 92010 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 132.11.12.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE Reservation: Tun Dest: 11.11.11.11 Tun ID: 42 Ext Tun ID: 1.1.1.1 Tun Sender: 1.1.1.1 LSP ID: 1 Next Hop: 132.1.4.4 on GigabitEthernet2.514 Label: 4014 (outgoing) Reservation Style is Shared-Explicit, QoS Service is Controlled-Load Resv ID handle: 0C000410. Created: 17:25:32 UTC DAY MON 26 2015 Average Bitrate is 0 bits/sec, Maximum Burst is 1K bytes Min Policed Unit: 0 bytes, Max Pkt Size: 1500 bytes Status: Policy: Accepted. Policy source(s): MPLS/TE

If we remove the "default" option from tunnel42, the tunnel is considered inactive. It is still a signaled and valid LSP from the perspective of RSVP, but cannot be used for CBTS. We can see that all 8 EXP values now map to the voice/video tunnel. Since there was nothing configured on tunnel42, the router makes no assumptions about what it should transport, and so uses it for nothing. Notice the NE flag on the tunnel42 entry below. ! CSR1 interface Tunnel42 no tunnel mpls traffic-eng exp default CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11

1455 © 2016 Nicholas J. Russo

Master: Tunnel40 Status: Members Status Tunnel41 up (Active) Tunnel42 up (Inactive)NE

up Conf Exp 4 5 Unassigned

Actual Exp 0 1 2 3 4 5 6 7

(D) : Destination is different (NE): Exp values not configured on tunnel

Unfortunately, OAM is not helpful in this situation, as it states there is no label. MPLS OAM actually appears to look at the configured EXP, not the actual EXP. Tracing from the VPN yields the proper results, which is a path via CSR6. Though undesirable from a design perspective, this is in accordance with the output above. The label changed to 6007 since enabling FRR caused the tunnel to be resignaled; label 6007 is shown above in the RSVP RESV RRO. CSR1#traceroute mpls traffic-eng tunnel 40 exp 3 Tracing MPLS TE Label Switched Path on Tunnel40, timeout is 2 seconds [snip] Type escape sequence to abort. 0 0.0.0.0 MRU 0 [No Label] Q 1 * CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 1 msec 0 msec 2 132.1.6.6 [MPLS: Labels 6007/91012 Exp 0] 50 msec 34 msec 33 msec 3 132.6.12.12 [MPLS: Labels 92010/91012 Exp 0] 33 msec 33 msec 41 msec 4 132.11.12.11 [MPLS: Label 91012 Exp 0] 33 msec 33 msec 33 msec 5 10.11.13.13 26 msec * 32 msec

Next, we will configure a few explicit values on tunnel42, but not all. The lower priority traffic is forced into tunnel42, and the system automatically placed all remaining EXP values into tunnel42 as well. When a "default" is not specified, the tunnel with the lowest explicit EXP is used for all remaining values. This configuration should give us the expected result, like the original setup, as the "actual EXP" is the same. ! CSR1 interface Tunnel42 tunnel mpls traffic-eng exp 0 1 CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11 Master: Tunnel40 Status: up Members Status Conf Exp Tunnel41 up (Active) 4 5

Actual Exp 4 5

1456 © 2016 Nicholas J. Russo

Tunnel42

up

(Active)

0 1

0 1 2 3 6 7

To prove the OAM issue, we can easily verify the path is correct for EXP 1, but not EXP 2, since OAM is only looking at the configuration. This traffic is now properly placed in the non-voice/video tunnel. CSR1#traceroute mpls traffic-eng tunnel 40 exp 1 Tracing MPLS TE Label Switched Path on Tunnel40, timeout is 2 seconds [snip] Type escape sequence to abort. 0 132.1.4.1 MRU 1500 [Labels: 4014 Exp: 1] L 1 132.1.4.4 MRU 1500 [Labels: 3027 Exp: 1] 2 ms L 2 132.3.4.3 MRU 1500 [Labels: 10015 Exp: 1] 33 ms L 3 132.3.10.10 MRU 1500 [Labels: implicit-null Exp: 1] 27 ms ! 4 132.10.11.11 38 ms CSR1#traceroute mpls traffic-eng tunnel 40 exp 2 Tracing MPLS TE Label Switched Path on Tunnel40, timeout is 2 seconds [snip] Type escape sequence to abort. 0 0.0.0.0 MRU 0 [No Label] Q 1 *

We can add new tunnels to the bundle at any time. We will use tunnel43 for network control traffic. We can have up to 8 tunnels, one for each EXP. ! CSR1 interface Tunnel43 description CBTS NET CONTROL ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_NO_R9 tunnel mpls traffic-eng exp 6 7 interface Tunnel40 tunnel mpls traffic-eng exp-bundle member Tunnel43

There still isn't a default tunnel, so the remaining values are mapped to the tunnel with the lowest explicit EXP. This is still the correct operation. CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11 Master: Tunnel40

Status: up

1457 © 2016 Nicholas J. Russo

Members Tunnel41 Tunnel42 Tunnel43

Status up (Active) up (Active) up (Active)

Conf Exp 4 5 0 1 6 7

Actual Exp 4 5 0 1 2 3 6 7

The combinations of tunnel failures, EXP mappings, and routing options are vast and this document does not cover them all. We will evaluate a few, though. If we shut down one tunnel causing the member LSP to fail, the values are moved onto the other tunnels. Shutting down tunnel41 moves EXP4 and EXP5 to tunnel42, which has the lowest configured EXP value. CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11 Master: Tunnel40 Status: up Members Status Conf Exp Tunnel41 down 4 5 Tunnel42 up (Active) 0 1 Tunnel43 up (Active) 6 7

Actual Exp 0 1 2 3 4 5 6 7

If we shut down tunnel43, the traffic is moved into tunnel41. I assume this is because EXP4 and EXP5 are the "next lowest" values. It would not be appropriate to “promote” lower values like 4 and 5 into a tunnel designed for 6 and 7. However, “demoting” higher values to the next-best tunnel appears to be functional. CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11 Master: Tunnel40 Status: up Members Status Conf Exp Tunnel41 up (Active) 4 5 Tunnel42 up (Active) 0 1 Tunnel43 down 6 7

Actual Exp 4 5 6 7 0 1 2 3

If we shut down tunnel42, the traffic is moved into tunnel41. This is the "next lowest" EXP and is "more" acceptable than moving it into the tunnel carrying higher EXPs. CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11 Master: Tunnel40 Status: up Members Status Conf Exp Tunnel41 up (Active) 4 5 Tunnel42 down 0 1 Tunnel43 up (Active) 6 7

Actual Exp 0 1 2 3 4 5 6 7

1458 © 2016 Nicholas J. Russo

An error condition exists when a member tunnel has an incorrect tunnel destination, as shown below. Because the tunnel is still a member of a bundle with the correct destination, the tunnel is considered inactive. It is still a valid LSP and RSPV will happily signal the path to the new (incorrect) destination. Traffic will not flow correctly, and the EXP values mapped to that tunnel are moved to a valid bundle member. interface Tunnel43 description CBTS NET CONTROL ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 12.12.12.12 CSR1#show ip rsvp reservation To From Pro BPS 11.11.11.11 1.1.1.1 0 11.11.11.11 1.1.1.1 0 12.12.12.12 1.1.1.1 0

DPort Sport Next Hop

I/F

Fi Serv

41 42 43

Gi2.516 Gi2.514 Gi2.516

SE LOAD 2M SE LOAD 0 SE LOAD 0

9 3 4

132.1.6.6 132.1.4.4 132.1.6.6

While making the change, we can enable debugging to see CBTS remove the tunnel from the bundle. The show command also shows the D flag since the member has a different destination than the master. EXP 6 and 7 are moved to tunnel41 as the next-best valid option. CSR1#show mpls traffic-eng exp 11.11.11.11 Destination: 11.11.11.11 Master: Tunnel40 Status: Members Status Tunnel41 up (Active) Tunnel42 up (Active) Tunnel43 up (Inactive)D

up Conf Exp 4 5 0 1 6 7

Actual Exp 4 5 6 7 0 1 2 3

(D) : Destination is different (NE): Exp values not configured on tunnel CSR1#debug mpls traffic-eng exp TE-HE-BNDL: Member destination changed.. remove from bucket TE-HE-BNDL: Member tunnel property change. walk update adj bundle TE-HE-BNDL: Member dest different from bundle dest

Like any TE tunnel, routing still determines how traffic is steered into the tunnel head. If change the master to use autoroute announce, then create a non-CBTS tunnel with autoroute announce with a better metric, none of the traffic will enter the CBTS tunnel at all. ! CSR1 interface Tunnel40

1459 © 2016 Nicholas J. Russo

tunnel mpls traffic-eng autoroute announce no tunnel mpls traffic-eng autoroute destination interface Tunnel44 description CBTS COMPETITOR ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng autoroute metric relative -1 tunnel mpls traffic-eng affinity 0x2 mask 0x2 tunnel mpls traffic-eng path-option 10 dynamic

We can check the autoroute cache and RIB to verify that only the new tunnel44 is used. CSR1#show mpls traffic-eng autoroute 11.11.11.11 MPLS TE autorouting enabled destination 0000.0000.0011.00, area isis level-2, has 2 tunnels Tunnel40 (load balancing metric 0, nexthop 11.11.11.11) (flags: Announce) Tunnel44 (load balancing metric 0, nexthop 11.11.11.11, relative metric -1) (flags: Announce) CSR1#show ip route 11.11.11.11 Routing entry for 11.11.11.11/32 Known via "isis", distance 115, metric 29, type level-2 Redistributing via isis 132 Last update from 11.11.11.11 on Tunnel44, 00:01:47 ago Routing Descriptor Blocks: * 11.11.11.11, from 11.11.11.11, 00:01:47 ago, via Tunnel44 Route metric is 29, traffic share count is 1

The label for this new tunnel is 3018, which is not CBTS related at all. When we traceroute from inside the VPN, all traffic goes into this tunnel, and CBTS is totally bypassed. We will shut down tunnel44 before continuing so CBTS is active again. The point is that CBTS master tunnel is not magic and competes directly with any other TE tunnels based on the routing convergence. CSR1#show mpls traffic-eng tunnels tunnel 44 | include Label InLabel : OutLabel : GigabitEthernet2.513, 3018 CSR7#traceroute 13.13.13.13 source 7.7.7.7 Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3018/91012 Exp 0] 26 msec 16 msec 16 msec

1460 © 2016 Nicholas J. Russo

3 4 5 6

132.2.3.2 [MPLS: Labels 2023/91012 Exp 0] 17 msec 37 msec 38 msec 132.2.10.10 [MPLS: Labels 10016/91012 Exp 0] 39 msec 38 msec 38 msec 132.10.11.11 [MPLS: Label 91012 Exp 0] 35 msec 25 msec 25 msec 10.11.13.13 34 msec * 22 msec

CSR7#traceroute Protocol [ip]: Target IP address: 13.13.13.13 Source address: 7.7.7.7 DSCP Value [0]: 48 [snip] Type escape sequence to abort. Tracing the route to 13.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.7.1 1 msec 0 msec 1 msec 2 132.1.3.3 [MPLS: Labels 3018/91012 Exp 6] 3 132.2.3.2 [MPLS: Labels 2023/91012 Exp 6] 4 132.2.10.10 [MPLS: Labels 10016/91012 Exp 5 132.10.11.11 [MPLS: Label 91012 Exp 6] 25 6 10.11.13.13 40 msec * 40 msec

53 msec 25 msec 17 msec 17 msec 16 msec 25 msec 6] 38 msec 43 msec 35 msec msec 33 msec 19 msec

PBTS is similar to CBTS except is supported on XR. CBTS is not supported on XR and PBTS is not supported on XE, so the features are mutually exclusive per platform. PBTS is simpler to configure and understand since there is no need for a tunnel bundle. The first set of tunnels [ID 50 - 52] capture the Network Control, Voice/Video, and everything else. If no default tunnel is available, the tunnel with the lowest EXP value carries the remaining traffic. Each tunnel is totally independent in terms of constraints (affinity, bandwidth, path-selection, etc). You cannot explicitly map EXP 0. ! XRv11 interface tunnel-te50 description PBTS NET CONTROL ipv4 unnumbered Loopback0 logging events all autoroute announce destination 1.1.1.1 policy-class 6 7 affinity 0x8 mask 0x8 path-option 10 dynamic interface tunnel-te51 description PBTS VOICE AND VIDEO ipv4 unnumbered Loopback0 logging events all signalled-bandwidth 5000 autoroute announce destination 1.1.1.1 policy-class 4 5 affinity ignore

1461 © 2016 Nicholas J. Russo

path-option 10 explicit name PATH_11_5_2_3_4_1 interface tunnel-te52 description PBTS DEFAULT CLASS ipv4 unnumbered Loopback0 logging events all autoroute announce destination 1.1.1.1 policy-class default affinity 0x4 mask 0x4 path-option 10 dynamic

We will quickly check to see that the tunnels came up successfully. We also validate the PBTS mappings per tunnel. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels destination 1.1.1.1 up | utility egrep '^Name|Admin|Policy' Name: tunnel-te50 Destination: 1.1.1.1 Ifhandle:0x1080 Admin: up Oper: up Path: valid Signalling: connected AutoRoute: enabled LockDown: disabled Policy class: 6 7 Name: tunnel-te51 Destination: 1.1.1.1 Ifhandle:0x1180 Admin: up Oper: up Path: valid Signalling: connected AutoRoute: enabled LockDown: disabled Policy class: 4 5 Name: tunnel-te52 Destination: 1.1.1.1 Ifhandle:0x1280 Admin: up Oper: up Path: valid Signalling: connected AutoRoute: enabled LockDown: disabled Policy class: default

We also check the labels bound to each tunnel for verification purposes. We also verify that autoroute is working correctly, showing all three tunnels. RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p destination 1.1.1.1 detail | utility egrep 'Label|TunID' RESV: IPv4-LSP Session addr: 1.1.1.1. TunID: 50. LSPId: 2. Labels: Outgoing downstream: 92010. RESV: IPv4-LSP Session addr: 1.1.1.1. TunID: 51. LSPId: 3. Labels: Outgoing downstream: 5012. RESV: IPv4-LSP Session addr: 1.1.1.1. TunID: 52. LSPId: 2. Labels: Outgoing downstream: 5002. RP/0/0/CPU0:XRv11#show mpls traffic-eng autoroute Destination 1.1.1.1 has 3 tunnels in IS-IS 132 level 2 tunnel-te50 (traffic share 0, nexthop 1.1.1.1, metric 0) (IS-IS 132 level-2, IPV4 Unicast) Signalled-Name: XRv11_t50 tunnel-te51 (traffic share 0, nexthop 1.1.1.1, metric 0) (IS-IS 132 level-2, IPV4 Unicast) Signalled-Name: XRv11_t51 tunnel-te52 (traffic share 0, nexthop 1.1.1.1, metric 0)

1462 © 2016 Nicholas J. Russo

(IS-IS 132 level-2, IPV4 Unicast) Signalled-Name: XRv11_t52

Since the auto-route metrics and traffic-shares are set to 0, we can verify the actual class-based forwarding by checking the FIB. All three entries are shown in the FIB and the classes correspond with each tunnel as specified by PBTS. RP/0/0/CPU0:XRv11#show cef ipv4 1.1.1.1/32 1.1.1.1/32, version 186, internal 0x1000001 0x0 (ptr 0xa14176f4) [1], 0x0 (0xa13fcb00), 0xa20 (0xa191e050) Prefix Len 32, traffic index 0, precedence n/a, priority 3 via 1.1.1.1, tunnel-te50, 3 dependencies, weight 0, class 6 7 [flags 0x0] path-idx 0 NHID 0x0 [0xa0e977e4 0xa0e9788c] next hop 1.1.1.1 local adjacency local label 91006 labels imposed {ImplNull} via 1.1.1.1, tunnel-te51, 3 dependencies, weight 0, class 4 5 [flags 0x0] path-idx 1 NHID 0x0 [0xa0e97544 0xa0e975ec] next hop 1.1.1.1 local adjacency local label 91006 labels imposed {ImplNull} via 1.1.1.1, tunnel-te52, 3 dependencies, weight 0, class default [flags 0x0] path-idx 2 NHID 0x0 [0xa0e97694 0xa0e9773c] next hop 1.1.1.1 local adjacency local label 91006 labels imposed {ImplNull}

The FIB details specifically detail PBTS parameters also. The distribution won't be 1:3 as it is class based, but the FIB shows all three tunnels, their paths, and their position in the hash list. RP/0/0/CPU0:XRv11#show cef ipv4 1.1.1.1/32 Weight distribution: slot 0, weight 0, normalized_weight 1, slot 1, weight 0, normalized_weight 1, slot 2, weight 0, normalized_weight 1,

detail | begin Weight class default class 4 5 class 6 7

PBTS class information: class 4: 1 paths, offset 1 class 5: 1 paths, offset 1 class 6: 1 paths, offset 2 class 7: 1 paths, offset 2 class default: 1 paths, offset 0 Load distribution: 0 1 2 (refcount 3) Hash 0

OK Y

Interface tunnel-te52

Address point2point

1463 © 2016 Nicholas J. Russo

1 2

Y Y

tunnel-te51 tunnel-te50

point2point point2point

Because XR does not allow us to specify TOS values in traceroute, and XRv does not support MQC-based QoS at all, it is hard verify operation without changing the topology to use XE and a CE. As a temporarily solution, and because no tunnels currently traverse the link between XRv11 and CSR10, we will disable LDP on that link on CSR10, as well as configure a static route to direct traffic to 1.1.1.1/32 towards XRv11. At the end of the PBTS section, these changes are removed. ! CSR10 interface GigabitEthernet2.501 no mpls ldp igp autoconfig ip route 1.1.1.1 255.255.255.255 132.10.11.11

XRv11 will perform a lookup in its FIB since the traffic is IP, not MPLS. It will then place the traffic into the proper TE tunnel based on the first 3 bits of the TOS byte. Below, we test DSCP CS6, CS5, and CS2. Each of them end up in the proper tunnels as we can confirm by checking the first label along the TELSPs. CSR10#traceroute Protocol [ip]: Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 48 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 0 msec 0 msec 2 132.11.12.12 [MPLS: Label 92010 Exp 6] 19 msec 17 msec 17 msec 3 132.6.12.6 [MPLS: Label 6020 Exp 6] 16 msec 43 msec 36 msec 4 132.6.9.9 [MPLS: Label 9009 Exp 6] 35 msec 48 msec 30 msec 5 132.4.9.4 [MPLS: Label 4013 Exp 6] 15 msec 16 msec 66 msec CSR10#traceroute Protocol [ip]: Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 40 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 0 msec 1 msec 2 132.5.11.5 [MPLS: Label 5012 Exp 5] 12 msec 8 msec 8 msec 3 132.2.5.2 [MPLS: Label 2017 Exp 5] 13 msec 27 msec 27 msec 4 132.2.3.3 [MPLS: Label 3015 Exp 5] 25 msec 15 msec 16 msec 5 132.3.4.4 [MPLS: Label 4002 Exp 5] 15 msec 6 msec 9 msec

1464 © 2016 Nicholas J. Russo

6 132.1.4.1 8 msec *

9 msec

CSR10#traceroute Protocol [ip]: Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 16 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 1 msec 2 132.5.11.5 [MPLS: Label 5002 Exp 2] 9 msec 8 msec 21 msec 3 132.5.10.10 [MPLS: Label 10018 Exp 2] 25 msec 32 msec 30 msec 4 132.6.10.6 [MPLS: Label 6008 Exp 2] 29 msec 20 msec 16 msec 5 132.1.6.1 13 msec * 25 msec

Next, we will shut down the default tunnel52 and verify that the traffic traverses tunnel51. This is the "next lowest" traffic class, so all lower priority traffic uses that in the absence of a default tunnel. Notice that the CEF details make no mention of this behavior, so the user must know that EXP values 0-3 are mapped to tunnel51 in this case using label 5012. RP/0/0/CPU0:XRv11#show cef ipv4 1.1.1.1/32 detail | begin Weight Weight distribution: slot 0, weight 0, normalized_weight 1, class 4 5 slot 1, weight 0, normalized_weight 1, class 6 7 PBTS class information: class 4: 1 paths, offset class 5: 1 paths, offset class 6: 1 paths, offset class 7: 1 paths, offset

0 0 1 1

Load distribution: 0 1 (refcount 3) Hash 0 1

OK Y Y

Interface tunnel-te51 tunnel-te50

Address point2point point2point

CSR10#traceroute Protocol [ip]: Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 8 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 1 msec 2 132.5.11.5 [MPLS: Label 5012 Exp 1] 21 msec 8 msec 8 msec

1465 © 2016 Nicholas J. Russo

3 4 5 6

132.2.5.2 132.2.3.3 132.3.4.4 132.1.4.1

[MPLS: [MPLS: [MPLS: 8 msec

Label Label Label * 12

2017 Exp 1] 10 msec 15 msec 15 msec 3015 Exp 1] 15 msec 15 msec 16 msec 4002 Exp 1] 15 msec 8 msec 8 msec msec

We bring up tunnel52 again and note the new label is 5013. We shut down tunnel50, the "best" tunnel. Traffic is moved into the default tunnel52 as expected, per the FIB. We confirm this by using traceroute to see label 5013 at the front of the LSP. Tunnel51 is still servicing classes 4 and 5, which we confirm with DSCP CS4. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 52 detail | include Label Outgoing Interface: GigabitEthernet0/0/0/0.551, Outgoing Label: 5013 RP/0/0/CPU0:XRv11#show cef ipv4 1.1.1.1/32 | include class via 1.1.1.1, tunnel-te51, 3 dependencies, weight 0, class 4 5 [flags 0x0] via 1.1.1.1, tunnel-te52, 3 dependencies, weight 0, class default [flags 0x0] CSR10#traceroute Protocol [ip]: Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 56 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 1 msec 2 132.5.11.5 [MPLS: Label 5013 Exp 7] 20 msec 8 msec 8 msec 3 132.5.10.10 [MPLS: Label 10012 Exp 7] 26 msec 32 msec 25 msec 4 132.6.10.6 [MPLS: Label 6011 Exp 7] 29 msec 38 msec 16 msec 5 132.1.6.1 17 msec * 7 msec CSR10#traceroute Protocol [ip]: Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 32 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 1 msec 2 132.5.11.5 [MPLS: Label 5012 Exp 4] 21 msec 9 3 132.2.5.2 [MPLS: Label 2017 Exp 4] 15 msec 15 4 132.2.3.3 [MPLS: Label 3015 Exp 4] 15 msec 16 5 132.3.4.4 [MPLS: Label 4002 Exp 4] 15 msec 16 6 132.1.4.1 9 msec * 20 msec

msec msec msec msec

9 msec 15 msec 15 msec 9 msec

1466 © 2016 Nicholas J. Russo

Forwarding-adjacency is also supported for PBTS. So long as there is a single tunnel from the tail to head (CSR1 to XRv11) that is also configured for the feature, all of the PBTS tunnels can rely on it. We can quickly reconfigure XRv11's PBTS tunnels, as well as bring up tunnel24 on CSR1 (not shown), which is enabled for forwarding-adjacency. ! XRv11 interface tunnel-te50 no autoroute announce forwarding-adjacency holdtime 10000 interface tunnel-te51 no autoroute announce forwarding-adjacency holdtime 10000 interface tunnel-te52 no autoroute announce forwarding-adjacency holdtime 10000 router isis 132 interface tunnel-te50 address-family ipv4 unicast metric 25 interface tunnel-te51 address-family ipv4 unicast metric 25 interface tunnel-te52 address-family ipv4 unicast metric 25

The IS-IS topology shows three separate links to CSR1 from XRv11. The FIB has the same look as before since all of the paths appear "equal" until we start sending traffic. RP/0/0/CPU0:XRv11#show isis topology systemid 0000.0000.0001 IS-IS 132 paths to IPv4 Unicast (Level-2) routers System Id Metric Next-Hop Interface SNPA CSR1 25 CSR1 tt52 *PtoP* CSR1 25 CSR1 tt51 *PtoP* CSR1 25 CSR1 tt50 *PtoP* RP/0/0/CPU0:XRv11#show cef ipv4 1.1.1.1/32 | include class via 1.1.1.1, tunnel-te50, 9 dependencies, weight 0, class 6 7 [flags 0x0] via 1.1.1.1, tunnel-te51, 9 dependencies, weight 0, class 4 5 [flags 0x0]

1467 © 2016 Nicholas J. Russo

via 1.1.1.1, tunnel-te52, 9 dependencies, weight 0, class default [flags 0x0] RP/0/0/CPU0:XRv11#show mpls traffic-eng forwarding-adjacency 1.1.1.1 destination 1.1.1.1 has 4 tunnels tunnel-te24 (traffic share 0, next-hop 1.1.1.1) (Adjacency Announced: no, holdtime 10000) tunnel-te50 (traffic share 0, next-hop 1.1.1.1) (Adjacency Announced: yes, holdtime 10000) (IS-IS 132 level-2, IPV4 Unicast) tunnel-te51 (traffic share 0, next-hop 1.1.1.1) (Adjacency Announced: yes, holdtime 10000) (IS-IS 132 level-2, IPV4 Unicast) tunnel-te52 (traffic share 0, next-hop 1.1.1.1) (Adjacency Announced: yes, holdtime 10000) (IS-IS 132 level-2, IPV4 Unicast)

For confirmation, we verify the labels on the tunnels. Then, we perform three traceroutes on CSR10 to verify PBTS works with forwarding-adjacency in a quasi-P2MP environment. with CBTS, this behavior is different since there is a tunnel-bundle and only one link. RP/0/0/CPU0:XRv11#show rsvp reservation session-type lsp-p2p destination 1.1.1.1 detail | utility egrep 'Label|TunID' RESV: IPv4-LSP Session addr: 1.1.1.1. TunID: 50. LSPId: 4. Labels: Outgoing downstream: 92010. RESV: IPv4-LSP Session addr: 1.1.1.1. TunID: 51. LSPId: 4. Labels: Outgoing downstream: 5002. RESV: IPv4-LSP Session addr: 1.1.1.1. TunID: 52. LSPId: 4. Labels: Outgoing downstream: 5015. CSR10#traceroute ip Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 48 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 0 msec 2 132.11.12.12 [MPLS: Label 92010 Exp 6] 28 msec 16 msec 25 msec 3 132.6.12.6 [MPLS: Label 6004 Exp 6] 34 msec 25 msec 18 msec 4 132.6.9.9 [MPLS: Label 9000 Exp 6] 42 msec 30 msec 42 msec 5 132.4.9.4 [MPLS: Label 4013 Exp 6] 25 msec 34 msec 16 msec 6 132.1.4.1 25 msec * 56 msec CSR10#traceroute ip Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 40

1468 © 2016 Nicholas J. Russo

[snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 1 msec 2 132.5.11.5 [MPLS: Label 5002 Exp 5] 18 msec 8 msec 8 msec 3 132.2.5.2 [MPLS: Label 2019 Exp 5] 9 msec 15 msec 26 msec 4 132.2.3.3 [MPLS: Label 3017 Exp 5] 21 msec 15 msec 15 msec 5 132.3.4.4 [MPLS: Label 4015 Exp 5] 15 msec 17 msec 11 msec 6 132.1.4.1 5 msec * 7 msec CSR10#traceroute ip Target IP address: 1.1.1.1 Source address: 10.10.10.10 DSCP Value [0]: 24 [snip] Tracing the route to 1.1.1.1 VRF info: (vrf in name/id, vrf out name/id) 1 132.10.11.11 1 msec 1 msec 1 msec 2 132.5.11.5 [MPLS: Label 5015 Exp 3] 5 msec 10 msec 30 msec 3 132.5.10.10 [MPLS: Label 10015 Exp 3] 27 msec 25 msec 25 msec 4 132.6.10.6 [MPLS: Label 6005 Exp 3] 25 msec 20 msec 14 msec 5 132.1.6.1 20 msec * 16 msec

31.4 DiffServ-aware Traffic Engineering (DS-TE) DiffServ-aware TE grants finer granularity to bandwidth reservations. Multiple bandwidth pools can exist with different amounts of bandwidth to service different traffic types. The pre-standard Cisco version accounted for two pools: global and sub. The global pool is what we have already analyzed, which represents the more general pool of bandwidth. If the bandwidth reservation is unspecified, the global pool is assumed. The sub-pool is typically used for higher priority flows, such as voice, and is accounted for separately from the global pool. The IETF later extended DS-TE to encompass up to 8 different classtypes (CT), each of which can have flows within the existing 8 priorities, making 64 possible combinations of TE classes; currently, only 2 of the CTs are used (seen later). As a general comment, routers migrating from pre-standard to standard (IETF) modes can be configured to use both. In this model, the local router floods/advertises pre-standard information, but is capable of processing both pre-standard and IETF information. The network diagram is identical to the TE-manual diagram (affinity on all links), but is shown again for clarity.

1469 © 2016 Nicholas J. Russo

31.4.1 Pre-standard Model The pre-standard version is very straightforward to configure; Cisco only changed two commands to account for DS-TE originally. The "ip rsvp bandwidth" command at the link-level includes a "sub-pool" option to specify available sub-pool bandwidth. To use this sub-pool bandwidth, TE tunnels can specify "sub-pool" as an option within the "tunnel mpls traffic-eng bandwidth" command. The feature does not have to be explicitly enabled on core LSRs and all of the show commands are the same. The prestandard version behaves like the IETF Russian Dolls Model (RDM) except with only two bandwidth constraints (BC). Because RSVP has not been extended to support the pre-standard version, global pool bandwidth is signaled as “controlled-load” service, whereas sub pool bandwidth is signaled as “guaranteed” service. Those terms are discussed later, but the point is that "pre-standard" is functionally similar to IETF RDM minus the RSVP extensions. We will enable an RSVP global pool size of 100 Mbps and a sub pool size of 20 Mbps on all links in the topology. Only CSR5 and XRv11 are shown for brevity. ! CSR5 interface GigabitEthernet2.525 ip rsvp bandwidth 100000 sub-pool 20000 interface GigabitEthernet2.550 ip rsvp bandwidth 100000 sub-pool 20000 interface GigabitEthernet2.551 ip rsvp bandwidth 100000 sub-pool 20000 ! XRv11 rsvp interface GigabitEthernet0/0/0/0.501 bandwidth 100000 sub-pool 20000 interface GigabitEthernet0/0/0/0.512

1470 © 2016 Nicholas J. Russo

bandwidth 100000 sub-pool 20000 interface GigabitEthernet0/0/0/0.551 bandwidth 100000 sub-pool 20000

As a side note, when configuring pre-standard or IETF RDM DS-TE, it is important to ensure that higher numbered BCs have less or equal bandwidth to the next lower BC. In this case, it doesn't make sense for the sub-pool (BC1) to have more bandwidth than the global pool (BC0). CSR3(config-subif)#ip rsvp bandwidth rdm bc0 10000 bc1 20000 Reservable sub pool (pool1) bandwidth must be less than or equal to RSVP Reservable Bandwidth.

We can verify the configuration was successful in both routers by checking the RSVP interface summary. The RSVP interface details don't reveal anything new, but show the same output in verbose fashion. CSR5#show ip rsvp interface interface rsvp allocated Gi2 ena 0 Gi2.525 ena 0 Gi2.550 ena 0 Gi2.551 ena 0

i/f max 750M 100M 100M 100M

flow max sub max 750M 0 100M 20M 100M 20M 100M 20M

VRF

RP/0/0/CPU0:XRv11#show rsvp interface *: RDM: Default I/F B/W % : 75% [default] (max resv/bc0), 0% [default] (bc1) Interface MaxBW (bps) MaxFlow (bps) Allocated (bps) MaxSub (bps) ----------- ------------ ------------- -------------------- ------------Gi0/0/0/0.501 100M 100M 0 ( 0%) 20M Gi0/0/0/0.512 100M 100M 0 ( 0%) 20M Gi0/0/0/0.551 100M 100M 0 ( 0%) 20M CSR5#show ip rsvp interface detail gig2.550 | section Bandwidth Bandwidth: Curr allocated: 0 bits/sec Max. allowed (total): 100M bits/sec Max. allowed (per flow): 100M bits/sec Max. allowed for LSP tunnels using sub-pools (pool 1): 20M bits/sec Set aside by policy (total): 0 bits/sec RP/0/0/CPU0:XRv11#show rsvp interface gig0/0/0/0.501 detail *: RDM: Default I/F B/W % : 75% [default] (max resv/bc0), 0% [default] (bc1) INTERFACE: GigabitEthernet0/0/0/0.501 (ifh=0x400). VRF ID: 0x60000000 (Default). BW (bits/sec): Max=100M. MaxFlow=100M. Allocated=0 (0%). MaxSub=20M. Signalling: No DSCP marking. No rate limiting. States in: 0. Max missed msgs: 4. Max out-of-band missed msgs: 38000.

1471 © 2016 Nicholas J. Russo

Expiry timer: Not running. Refresh interval: 45s. Normal Refresh timer: Not running. Out-of-band refresh interval: 0s. Summary refresh timer: Not running. Refresh reduction local: Enabled. Summary Refresh: Enabled (1472 bytes max). Reliable summary refresh: Disabled. Bundling: Enabled. (1500 bytes max). Ack hold: 400 ms, Ack max size: 1500 bytes. Retransmit: 2100ms.

Checking the TE topology from CSR1 (a remote node) to look at CSR5's link to XRv11, we can see this sub-pool bandwidth is now available. All routers in IS-IS level 2 should see this. For completeness, we verify XRv11's link to CSR5 as well. CSR1#show mpls traffic-eng topology igp-id isis 0000.0000.0005.00 IGP Id: 0000.0000.0005.00, MPLS TE Id:5.5.5.5 Router Node (isis level-2) id 24 link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0011.00, nbr_node_id:27, gen:1698 frag_id: 0, Intf Address: 132.5.11.5, Nbr Intf Address: 132.5.11.11 TE metric: 10, IGP metric: 10, attribute flags: 0x5 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 20000 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]: [snip]

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 0

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 100000

Sub Pool Reservable BW (kbps) ---------20000 20000 20000 20000 20000 20000 20000 20000

CSR1#show mpls traffic-eng topology 11.11.11.11 IGP Id: 0000.0000.0011.00, MPLS TE Id:11.11.11.11 Router Node (isis level2) id 27 link[0]: Point-to-Point, Nbr IGP Id: 0000.0000.0005.00, nbr_node_id:24, gen:1570 frag_id: 0, Intf Address: 132.5.11.11, Nbr Intf Address: 132.5.11.5 TE metric: 10, IGP metric: 10, attribute flags: 0x5 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 20000 (kbps)

1472 © 2016 Nicholas J. Russo

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]: [snip]

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 0

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 100000

Sub Pool Reservable BW (kbps) ---------20000 20000 20000 20000 20000 20000 20000 20000

As a final verification, we quickly check the ISIS LSPDB for CSR5's LSP. The sub-pool bandwidth is clearly shown in kbps, broken down by priority. CSR1#show isis database l2 verbose CSR5.00-00 | section XRv11 Metric: 10 IS-Extended XRv11.00 Affinity: 0x00000005 Interface IP Address: 132.5.11.5 Neighbor IP Address: 132.5.11.11 Physical BW: 1000000 kbits/sec Reservable Global Pool BW: 100000 kbits/sec Reservable Sub Pool BW: 20000 kbits/sec Global Pool BW Unreserved: [0]: 100000 kbits/sec, [1]: 100000 kbits/sec [2]: 100000 kbits/sec, [3]: 100000 kbits/sec [4]: 100000 kbits/sec, [5]: 100000 kbits/sec [6]: 100000 kbits/sec, [7]: 100000 kbits/sec Sub Pool BW Unreserved: [0]: 20000 kbits/sec, [1]: 20000 kbits/sec [2]: 20000 kbits/sec, [3]: 20000 kbits/sec [4]: 20000 kbits/sec, [5]: 20000 kbits/sec [6]: 20000 kbits/sec, [7]: 20000 kbits/sec Metric: 10 IS (MT-IPv6) XRv11.00

Next, we will configure a DS-TE tunnel [ID 200] requesting 5 Mbps of sub-pool bandwidth from CSR1 to XRv11. The TE tunnel configuration is identical to other TE tunnels except the bandwidth is identified as "sub-pool". We will re-use existing path-options from previous labs for brevity. When the tunnel configuration is complete, we verify that it is functional and that the bandwidth is identified as "Sub". ! CSR1 interface Tunnel200 description PRE-STD DS-TE GLOBAL POOL ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11

1473 © 2016 Nicholas J. Russo

tunnel tunnel tunnel tunnel tunnel

mpls mpls mpls mpls mpls

traffic-eng traffic-eng traffic-eng traffic-eng traffic-eng

autoroute destination priority 7 7 bandwidth sub-pool 5000 affinity 0x0 mask 0x0 path-option 10 explicit name EP_NO_R9

CSR1#show mpls traffic-eng tunnels tunnel 200 | section Status|Config Status: Admin: up Oper: up Path: valid Signalling: connected path option 10, type explicit EP_NO_R9 (Basis for Setup, path weight 25) Config Parameters: Bandwidth: 5000 kbps (Sub) Priority: 7 7 Affinity: 0x0/0x0 Metric Type: TE (default) AutoRoute: disabled LockDown: disabled Loadshare: 5000 [400000] bw-based AutoRoute destination: enabled auto-bw: disabled

Debugging RSVP dump-messages and look at the outgoing PATH message, we can see that this is identified as “guaranteed service” with some generally irrelevant parameters. This is just a mechanism for RSVP to identify/classify the flow differently. Notice that the PATH message length is 252 bytes. ! CSR1 Outgoing Path: version:1 flags:0000 cksum:E808 ttl:255 reserved:0 length:252 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 200 Ext Tun ID: 1.1.1.1 [snip] Guaranteed Service break bit=0 service length=8 Path Delay (microseconds):0 Path Jitter (microseconds):12 Path delay since shaping (microseconds):0 Path Jitter since shaping (microseconds):12

Checking RSVP, we can see that 5 Mbps was reserved. The RSVP interface summary does not show that these 5 Mbps are from the sub-pool, nor does the detailed command. CSR1#show ip rsvp interface interface rsvp allocated Gi2 ena 0 Gi2.513 ena 0 Gi2.514 ena 0 Gi2.516 ena 5M Gi2.518 ena 0 Gi2.519 ena 0

i/f max 750M 100M 100M 100M 100M 100M

flow max sub max 750M 0 100M 20M 100M 20M 100M 20M 100M 20M 100M 20M

VRF

CSR1#show ip rsvp interface detail gig2.516 | section Bandwidth Bandwidth:

1474 © 2016 Nicholas J. Russo

Curr allocated: 5M bits/sec Max. allowed (total): 100M bits/sec Max. allowed (per flow): 100M bits/sec Max. allowed for LSP tunnels using sub-pools (pool 1): 20M bits/sec Set aside by policy (total): 0 bits/sec

The TED does a better job of differentiating between pools. We can tell these 5 Mbps are used from the sub-pool because it was subtracted from both the sub-pool and global-pool. Global-pool reservations are only subtracted from the global-pool. Similar to TE priorities, bandwidth reserved within more important pools is subtracted from lesser pools, but not vice versa. CSR1#show mpls traffic-eng topology 1.1.1.1 [snip] link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:21, gen:2618 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 5, IGP metric: 10, attribute flags: 0x4 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 20000 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]: [snip]

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 5000

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 95000

Sub Pool Reservable BW (kbps) ---------20000 20000 20000 20000 20000 20000 20000 15000

To prove this, we temporarily change the bandwidth from sub-pool to global-pool on tunnel200 and look at the TED again. Now, the 5 Mbps was only subtracted from the global-pool. This proves that subpool reservations count against global pool reservations on a link. This makes sense because it ensures global-pool flows do not book more bandwidth than is available when sub-pool flows (more important) are also considered. ! CSR1 interface Tunnel200 tunnel mpls traffic-eng bandwidth 5000 CSR1#show mpls traffic-eng topology 1.1.1.1 [snip]

1475 © 2016 Nicholas J. Russo

link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:21, gen:2762 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 5, IGP metric: 10, attribute flags: 0x4 SRLGs: None physical_bw: 1000000 (kbps), max_reservable_bw_global: 100000 (kbps) max_reservable_bw_sub: 20000 (kbps)

bw[0]: bw[1]: bw[2]: bw[3]: bw[4]: bw[5]: bw[6]: bw[7]: [snip]

Total Allocated BW (kbps) --------------0 0 0 0 0 0 0 5000

Global Pool Reservable BW (kbps) ----------100000 100000 100000 100000 100000 100000 100000 95000

Sub Pool Reservable BW (kbps) ---------20000 20000 20000 20000 20000 20000 20000 20000

Debugging RSVP dump-messages again, we can see this is just a normal “controlled-load” service with no fancy options. The difference in RSVP PATH message length is 32 bytes (252 bytes for guaranteed service minus 220 bytes for controlled-load service) . Since the "service length" of the sub-pool PATH message is 8, I assume this is a measurement counting 4-byte words. Each of the sub-service fields is a 32-bit (4-byte) unsigned integer as a result. ! CSR1 Outgoing Path: version:1 flags:0000 cksum:FF75 ttl:255 reserved:0 length:220 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 200 Ext Tun ID: 1.1.1.1 [snip] Controlled Load Service break bit=0 service length=0

For additional verification, we perform a quick data-plane verification of the LSP. CSR1 is using RSVPbound label 6016 to send traffic to CSR6. Using traceroute on CSR1, we verify that traffic to 11.11.11.11 is forwarded inside the TE tunnel. CSR1#show ip rsvp reservation detail filter session-type 7 destination 11.11.11.11 | include Label Label: 6016 (outgoing) CSR1#traceroute 11.11.11.11 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 11.11.11.11

1476 © 2016 Nicholas J. Russo

VRF 1 2 3

info: (vrf in name/id, vrf out name/id) 132.1.6.6 [MPLS: Label 6016 Exp 0] 16 msec 8 msec 8 msec 132.6.10.10 [MPLS: Label 10013 Exp 0] 16 msec 16 msec 25 msec 132.10.11.11 9 msec * 21 msec

Before continuing, we change the 5 Mbps reservation back to sub-pool bandwidth. Unlike CBTS, all of the traffic towards XRv11's loopback will use this tunnel. The difference is that bandwidth is reserved differently; a bandwidth reservation on a single tunnel cannot reserve both global-pool and sub-pool bandwidth concurrently. Because of this, the underlying QoS queuing/shaping architecture should match the DS-TE configuration. We will create a quick policy on CSR1 to demonstrate this. The PMAP_SHAPE policy is applied to all 5 of CSR1's MPLS-facing sub-interfaces (application of policy is not shown). The math behind the 26 bytes of overhead is described next. ! CSR1 class-map match-all CMAP_VOICE match mpls experimental topmost 5 policy-map PMAP_QUEUE class CMAP_VOICE priority 5000 class class-default queue-limit 16 packets policy-map PMAP_SHAPE class class-default shape average 100000000 account user-defined 26 service-policy PMAP_QUEUE

We know that 5 Mbps have been reserved; let's assume this is a layer 2 reservation which accounts for necessary overhead already. Cisco queuing and policing always accounts for layer 2 overhead, and shaping can optionally be configured to do so also. The overhead of packets traversing the network will vary depending on the MPLS label stack, but a safe assumption would be: 14 bytes for Ethernet, 4 bytes of 802.1q, and two 4-byte MPLS shim headers. This is a total of 26 bytes, and we can quickly prove it by using our QoS policy. Sending a single 100-byte packet at DSCP CS5 will show us how Cisco queuing accounts for overhead. The policy-map details show a single 126 byte packet, which verifies the 26 bytes of overhead as expected. CSR7#ping 13.13.13.13 source 7.7.7.7 tos 160 repeat 1 Type escape sequence to abort. Sending 1, 100-byte ICMP Echos to 13.13.13.13, timeout is 2 seconds: Packet sent with a source address of 7.7.7.7 ! Success rate is 100 percent (1/1), round-trip min/avg/max = 21/21/21 ms CSR1#show policy-map interface gig2.516 | section VOICE Class-map: CMAP_VOICE (match-all)

1477 © 2016 Nicholas J. Russo

1 packets, 126 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: mpls experimental topmost 5 Priority: 5000 kbps, burst bytes 125000, b/w exceed drops: 0

The math is not always precise. For example, sending VPN traffic to CSR8 results in only a single MPLS label due to PHP, so the overhead is 4 bytes less for a total of 22 bytes. This isn't terribly significant for DS-TE but it is worth mentioning. The reason for the detailed synchronization between RSVP reservations and QoS schedulers is because RSVP is only a control-plane reservation. DS-TE is useless without appropriate queuing mechanisms at the link-level. Because we stated that the 5 Mbps accounts for layer 2 overhead, the actual user traffic would be slightly less than 5 Mbps. CSR7#ping 14.14.14.14 so 7.7.7.7 tos 160 repeat 1 Type escape sequence to abort. Sending 1, 100-byte ICMP Echos to 14.14.14.14, timeout is 2 seconds: Packet sent with a source address of 7.7.7.7 ! Success rate is 100 percent (1/1), round-trip min/avg/max = 2/2/2 ms CSR1#show policy-map interface gig2.518 | section VOICE Class-map: CMAP_VOICE (match-all) 1 packets, 122 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: mpls experimental topmost 5 Priority: 5000 kbps, burst bytes 125000, b/w exceed drops: 0

For completeness, we will configure pre-standard DS-TE on XRv11 as well. The only difference from a standard TE-tunnel in XR is that the signalled-bandwidth is identified as sub-pool bandwidth. We can reuse old explicit-paths; later we will test FRR on the link between CSR3 and CSR10, so we want to use that link. Once configured, we verify the tunnel comes up and routes the way the explicit-path specified. ! XRv11 interface tunnel-te200 description PRE-STD DS-TE GLOBAL POOL ipv4 unnumbered Loopback0 logging events all signalled-bandwidth sub-pool 11000 destination 8.8.8.8 affinity ignore path-option 10 explicit name PATH_11_5_2_3_10_6_8 RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 200 role head | begin Path info Path info (IS-IS 132 level-2): Node hop count: 6 Hop0: 132.5.11.5 Hop1: 132.2.5.2

1478 © 2016 Nicholas J. Russo

Hop2: Hop3: Hop4: Hop5: Hop6: Hop7:

132.2.3.2 132.2.3.3 132.3.10.10 132.6.10.6 132.6.8.8 8.8.8.8

Checking XRv11's local TE topology information, we see that 11 Mbps of sub-pool bandwidth was allocated at priority 7 (worst). This was also reduced from the global pool, as expected, which is consistent with the RDM BC-model discussed later. It is interesting to note that XR shows the BC model is RDM, which is somewhat true since the pre-standard version behaves like IETF RDM in many ways. RP/0/0/CPU0:XRv11#show mpls traffic-eng topology 11.11.11.11 | begin 0005 Link[0]:Point-to-Point, Nbr IGP Id:0000.0000.0005.00, Nbr Node Id:6, gen:815 Frag Id:0, Intf Address:132.5.11.11, Intf Id:0 Nbr Intf Address:132.5.11.5, Nbr Intf Id:0 TE Metric:10, IGP Metric:10 Attribute Flags: 0x5 Ext Admin Group: Length: 256 bits Value : 0x::5 Attribute Names: Unnamed bits : 0 2 Switching Capability:None, Encoding:unassigned BC Model ID:RDM Physical BW:1000000 (kbps), Max Reservable BW Global:100000 (kbps) Max Reservable BW Sub:20000 (kbps) Global Pool Sub Pool Total Allocated Reservable Reservable BW (kbps) BW (kbps) BW (kbps) ---------------------------------bw[0]: 0 100000 20000 bw[1]: 0 100000 20000 bw[2]: 0 100000 20000 bw[3]: 0 100000 20000 bw[4]: 0 100000 20000 bw[5]: 0 100000 20000 bw[6]: 0 100000 20000 bw[7]: 11000 89000 9000

To test FRR with DS-TE, we will begin by modifying an existing NHOP tunnel on CSR3. This is an NHOP backup tunnel that requires green, blue, and orange colors on a link. The tunnel routes via CSR2 as this is the only option meeting the affinity constraints. The tunnel currently does not specify any backup bandwidth. ! CSR3 interface Tunnel30

1479 © 2016 Nicholas J. Russo

description REPAIR PATH NHOP (PLR) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 10.10.10.10 tunnel mpls traffic-eng affinity 0xE mask 0xE tunnel mpls traffic-eng path-option 10 explicit name PATH_AVOID_CSR3_CSR10 CSR3#show mpls traffic-eng tunnels tunnel 30 | section RSVP Path RSVP Path Info: My Address: 132.2.3.3 Explicit Route: 132.2.3.2 132.2.10.10 10.10.10.10 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

For FRR to be effective within the context of DS-TE, we need tunnels that request FRR on CSR1 and XRv11, along with bandwidth protection. We add two tunnels [ID 210 and 211] to test this. Tunnel210 will request global pool bandwidth while tunnel211 will request sub-pool bandwidth. We can look at the RSVP RESV RRO on both CSR1 and XRv11 to see the tunnels are protected at CSR3, the PLR. First, we configure and verify the CSR1 tunnels. The NHOP backup tunnel currently has "unlimited" bandwidth since no backup-bandwidth is specified. Furthermore, not specifying a bandwidth type, such as global or sub-pool, means that the tunnel can back up a tunnel requesting bandwidth from any pool. We also shut down tunnel200 (not shown) so that it does not interfere with this test. ! CSR1 interface Tunnel210 description DS-TE GLOBAL-POOL BW FRR ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 9000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_1_4_3_10_11 tunnel mpls traffic-eng fast-reroute bw-protect interface Tunnel211 description DS-TE SUB-POOL BW FRR ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth sub-pool 3000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name EP_1_4_3_10_11 tunnel mpls traffic-eng fast-reroute bw-protect

Since tunnels 210 and 211 are requesting FRR, their RSVP RESV RROs should reflect the available NHOP protection. In the FRR section, we saw that backup tunnels with no "backup-bw" identified were 1480 © 2016 Nicholas J. Russo

assumed to have unlimited backup bandwidth. Despite requesting bandwidth protection, the tunnel does not specify any, so the assumption of having "unlimited" bandwidth really means that no bandwidth protection is provided. For this reason, CSR1 sees the PLR as offering local NHOP protection, but no bandwidth protection. CSR1#show ip rsvp reservation detail filter session-type 7 | section Tun_ID|RRO Tun Dest: 11.11.11.11 Tun ID: 210 Ext Tun ID: 1.1.1.1 RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4001 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3007 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10008 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 132.10.11.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 0 Tun Dest: 11.11.11.11 Tun ID: 211 Ext Tun ID: 1.1.1.1 RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4008 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3013 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10013 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 132.10.11.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 0

Next, we configure the equivalent tunnels on XRv11. The tunnels are virtually identical with different bandwidth reservations in different pools. ! XRv11 interface tunnel-te210 description DS-TE GLOBAL-POOL BW FRR ipv4 unnumbered Loopback0 logging events all signalled-bandwidth 11000 destination 8.8.8.8 fast-reroute protect bandwidth affinity ignore path-option 10 explicit name PATH_11_5_2_3_10_6_8 interface tunnel-te211 description DS-TE SUB-POOL BW FRR

1481 © 2016 Nicholas J. Russo

ipv4 unnumbered Loopback0 logging events all signalled-bandwidth sub-pool 4000 destination 8.8.8.8 fast-reroute protect bandwidth affinity ignore path-option 10 explicit name PATH_11_5_2_3_10_6_8

Like the CSR1 DS-TE tunnels, both tunnels are requesting bandwidth protection but it is not available. Local protection is supported at CSR3 for both tunnels, though. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 210 detail | begin Resv Info Resv Info: Record Route: IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5017, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID) Label 2018, flags 0x1 IPv4 3.3.3.3, flags 0x21 (Node-ID, Protection: available) Label 3021, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10021, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6015, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 0, flags 0x1 Fspec: avg rate=11000 kbits, burst=1000 bytes, peak rate=11000 kbits RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 211 detail | begin Resv Info Resv Info: Record Route: IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5018, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID) Label 2019, flags 0x1 IPv4 3.3.3.3, flags 0x21 (Node-ID, Protection: available) Label 3022, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10022, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6016, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 0, flags 0x1 Fspec: avg rate=4000 kbits, burst=1000 bytes, peak rate=4000 kbits

CSR3 sees all four tunnels in its FRR database as well. Each one is mapped to tunnel30, the NHOP tunnel with unlimited bandwidth servicing "any" pool. CSR3#show mpls traffic-eng fast-reroute database | begin midpoint

1482 © 2016 Nicholas J. Russo

P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status ------------------------ -------- -------------------------------1.1.1.1 210 [25] 3023 Gi2.530:10023 Tu30:10023 ready 1.1.1.1 211 [25] 3024 Gi2.530:10024 Tu30:10024 ready 11.11.11.11 210 [5] 3021 Gi2.530:10021 Tu30:10021 ready 11.11.11.11 211 [5] 3022 Gi2.530:10022 Tu30:10022 ready

The backup tunnel details show that "any" pool is assumed, since no pool was specified. Also, the total in-use bandwidth on the tunnel is 27 Mbps, none of which is actually protected. 27 Mbps is the sum of all 4 tunnels requested bandwidth but does not specify global or sub pool. CSR3#show mpls traffic-eng tunnels tunnel 30 backup REPAIR PATH NHOP (PLR) LSP Head, Admin: up, Oper: up Tun ID: 30, LSP ID: 24, Source: 3.3.3.3 Destination: 10.10.10.10 Fast Reroute Backup Provided: Protected i/fs: Gi2.530 Protected LSPs/Sub-LSPs: 4, Active: 0 Backup BW: any pool unlimited; inuse: 27000 kbps Backup flags: 0x0

As a test, we will configure the tunnel to offer 7 Mbps of sub-pool backup bandwidth. This is just enough to backup both tunnel211 interfaces on CSR1 and XRv11. At this point, both tunnel210 interfaces now have no protection at all, since the NHOP tunnel services only sub-pool flows. The FRR tunnel has now specified both the type and quantity of bandwidth, making it much more restrictive/selective in terms of which LSPs it is capable of protecting. ! CSR3 interface Tunnel30 tunnel mpls traffic-eng backup-bw sub-pool 7000

We can verify that only the tunnel211 interfaces are backed up on the PLR by checking the FRR database and the tunnel30 backup details. The bandwidth type is sub-pool and the amount is 7 Mbps. CSR3#show mpls traffic-eng fast-reroute database | begin midpoint P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ----------------------- -------- --------------------------1.1.1.1 211 [25] 3024 Gi2.530:10024 Tu30:10024 11.11.11.11 211 [5] 3022 Gi2.530:10022 Tu30:10022

Status -----ready ready

CSR3#show mpls traffic-eng tunnels tunnel 30 backup REPAIR PATH NHOP (PLR) LSP Head, Admin: up, Oper: up Tun ID: 30, LSP ID: 24, Source: 3.3.3.3

1483 © 2016 Nicholas J. Russo

Destination: 10.10.10.10 Fast Reroute Backup Provided: Protected i/fs: Gi2.530 Protected LSPs/Sub-LSPs: 2, Active: 0 Backup BW: sub-pool; limit: 7000 kbps, inuse: 7000 kbps (BWP inuse: 7000 kbps) Backup flags: 0x0

On CSR1, we notice that tunnel210 is no longer protected by the NHOP tunnel at all. This is because it is requesting global pool bandwidth and the NHOP tunnel only provides sub-pool bandwidth backup. Tunnel211, however, how has local protection and available bandwidth, since it is requesting sub-pool bandwidth and the backup tunnel has a specific amount of bandwidth available (versus unlimited). Tunnel210 would not be protected even if it wasn't asking for bandwidth-protection simply because the bandwidth pools are mismatched between the head-end and PLR tunnels. CSR1#show ip rsvp reservation detail filter session-type 7 | section Tun_ID|RRO Tun Dest: 11.11.11.11 Tun ID: 210 Ext Tun ID: 1.1.1.1 RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4015 3.3.3.3/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3023 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10023 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 132.10.11.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 0 Tun Dest: 11.11.11.11 Tun ID: 211 Ext Tun ID: 1.1.1.1 RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4016 3.3.3.3/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3024 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10024 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 132.10.11.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 0

Checking XRv11, we see the same effect. Tunnel210 now has no protection at all, while tunnel211 now has bandwidth protection via the NHOP tunnel on CSR3. RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 210 detail | begin Resv Info Resv Info: Record Route:

1484 © 2016 Nicholas J. Russo

IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5017, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID) Label 2018, flags 0x1 IPv4 3.3.3.3, flags 0x20 (Node-ID) Label 3021, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10021, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6015, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 0, flags 0x1 Fspec: avg rate=11000 kbits, burst=1000 bytes, peak rate=11000 kbits RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 211 detail | begin Resv Info Resv Info: Record Route: IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5018, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID) Label 2019, flags 0x1 IPv4 3.3.3.3, flags 0x25 (Node-ID, Protection: available, b/w) Label 3022, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10022, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6016, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 0, flags 0x1 Fspec: avg rate=4000 kbits, burst=1000 bytes, peak rate=4000 kbits

We can also specify the type of bandwidth being protected without specifying an amount. This will limit to tunnel to protecting only a specific pool, as seen above, but not specifying a bandwidth value. In this example, we adjust the NHOP tunnel to support only global pool but with unlimited bandwidth. Now, the FRR database on the PLR shows protection for tunnel210 interfaces since the global pool NHOP tunnel cannot backup sub pool LSPs. There is no limit on the bandwidth and 20 Mbps are in use (sum of CSR1 and XRv11 tunnel210 global pool bandwidth requests). ! CSR3 interface Tunnel30 tunnel mpls traffic-eng backup-bw global-pool unlimited CSR3#show mpls traffic-eng fast-reroute database P2P LSP midpoint frr information: LSP identifier In-label Out intf/label ------------------------------- -------------1.1.1.1 210 [30] 3008 Gi2.530:10014 11.11.11.11 210 [5] 3021 Gi2.530:10021

| begin midpoint FRR intf/label -------------Tu30:10014 Tu30:10021

Status -----ready ready

1485 © 2016 Nicholas J. Russo

CSR3#show mpls traffic-eng tunnels tunnel 30 backup REPAIR PATH NHOP (PLR) LSP Head, Admin: up, Oper: up Tun ID: 30, LSP ID: 24, Source: 3.3.3.3 Destination: 10.10.10.10 Fast Reroute Backup Provided: Protected i/fs: Gi2.530 Protected LSPs/Sub-LSPs: 2, Active: 0 Backup BW: global pool; limit 0 kbps, inuse: 20000 kbps (BWP inuse: 0 kbps) Backup flags: 0x1

As expected, CSR1's tunnel210 is protected again, but without bandwidth protection. We only specified the type of bandwidth, not the amount, so unlimited bandwidth really means no bandwidth. Tunnel211 currently has no protection at all. This is slightly different from the previous example since the protected LSPs are not claiming to have bandwidth protection. CSR1#show ip rsvp reservation detail filter session-type 7 | section Tun_ID|RRO Tun Dest: 11.11.11.11 Tun ID: 210 Ext Tun ID: 1.1.1.1 RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4017 3.3.3.3/32, Flags:0x21 (Local Prot Avail/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3008 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10014 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 132.10.11.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 0 Tun Dest: 11.11.11.11 Tun ID: 211 Ext Tun ID: 1.1.1.1 RRO: 4.4.4.4/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 4001 3.3.3.3/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3017 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10017 11.11.11.11/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 0 132.10.11.11/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 0

The same is true on XRv11. Tunnel211 now has no protection while tunnel210 has NHOP protection without any backup bandwidth guarantee. The point is that specifying the bandwidth type, and not the amount, can be used to differentiate between tunnels used for backups. 1486 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 210 detail | begin Resv Info Resv Info: Record Route: IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5017, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID) Label 2018, flags 0x1 IPv4 3.3.3.3, flags 0x21 (Node-ID, Protection: available) Label 3021, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10021, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6015, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 0, flags 0x1 Fspec: avg rate=11000 kbits, burst=1000 bytes, peak rate=11000 kbits RP/0/0/CPU0:XRv11#show mpls traffic-eng tunnels 211 detail | begin Resv Info Resv Info: Record Route: IPv4 5.5.5.5, flags 0x20 (Node-ID) Label 5018, flags 0x1 IPv4 2.2.2.2, flags 0x20 (Node-ID) Label 2019, flags 0x1 IPv4 3.3.3.3, flags 0x20 (Node-ID) Label 3022, flags 0x1 IPv4 10.10.10.10, flags 0x20 (Node-ID) Label 10022, flags 0x1 IPv4 6.6.6.6, flags 0x20 (Node-ID) Label 6016, flags 0x1 IPv4 8.8.8.8, flags 0x20 (Node-ID) Label 0, flags 0x1 Fspec: avg rate=4000 kbits, burst=1000 bytes, peak rate=4000 kbits

To demonstrate multiple backup tunnel selection, we will configure a pair of basic NNHOP tunnels on CSR3 [ID 31 and 32]. The tunnel logic is recycled from earlier but modified for this test. Tunnel31 will provide 5 Mbps of bandwidth for the sub-pool, which will offer protection for it. Tunnel32 won't specify a bandwidth type, but will offer 25 Mbps of general bandwidth (any pool). Both tunnels avoid CSR10 completely. We need two separate tunnels since CSR1 and XRv11 headends have different NNHOPs. As mentioned earlier, this is why NHOP protection often scales better at the expense of less resiliency. ! CSR3 interface Tunnel31 description REPAIR PATH NNHOP (PLR) FOR CSR1 ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11

1487 © 2016 Nicholas J. Russo

tunnel mpls traffic-eng backup-bw sub-pool 5000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_AVOID_CSR10 interface Tunnel32 description REPAIR PATH NNHOP (PLR) FOR XRV11 ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 6.6.6.6 tunnel mpls traffic-eng backup-bw 25000 tunnel mpls traffic-eng affinity 0x0 mask 0x0 tunnel mpls traffic-eng path-option 10 explicit name PATH_AVOID_CSR10

We quickly verify the paths used by the tunnels. Tunnel31 routes via CSR2 and CSR5 to reach XRv11 to protect CSR1's LSP. Tunnel32 routes via CSR1 to reach CSR6 to protect XRv11's LSP. CSR3#show ip rsvp sender detail filter session-type 7 tunnel 31 | section outgoing ERO: (outgoing) 132.2.3.2 (Strict IPv4 Prefix, 8 bytes, /32) 132.2.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 132.5.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) CSR3#show ip rsvp sender detail filter session-type 7 tunnel 32 | section outgoing ERO: (outgoing) 132.1.3.1 (Strict IPv4 Prefix, 8 bytes, /32) 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 6.6.6.6 (Strict IPv4 Prefix, 8 bytes, /32)

When there are multiple competing tunnels, the pool type adds a new dimension to the backup tunnel selection. Below is a chart summarizing the selection process. The chart can be generalized with a few general comments: 1. NNHOP tunnels are preferred over NHOP tunnels 2. A tunnel with limited (specified) bandwidth is preferred over unlimited bandwidth 3. Specific pool tunnels (global, sub, or specific BC) are preferred over "any" (unspecified) pool tunnels Preference 1 (best) 2 3 4 5 6 7

Tunnel type NNHOP NNHOP NNHOP NNHOP NHOP NHOP NHOP

Bandwidth type Specific pool/BC Any Specific pool/BC Any Specific pool/BC Any Specific pool/BC

Bandwidth amount Limited Limited Unlimited Unlimited Limited Limited Unlimited 1488

© 2016 Nicholas J. Russo

8 (worst)

NHOP

Any

Unlimited

The FRR database on CSR3 shows very interesting results. We will analyze each row of the database individually. For clarity, the tunnels and their backup-bandwidths are shown again. ! CSR3 interface Tunnel30 tunnel mpls traffic-eng backup-bw global-pool unlimited interface Tunnel31 tunnel mpls traffic-eng backup-bw sub-pool 5000 interface Tunnel32 tunnel mpls traffic-eng backup-bw 25000

First, we see tunnel210 from CSR1. It is protected by the original NHOP tunnel30, which offers unlimited global-pool bandwidth. This is option #7 on the chart above, making it relatively undesirable. This LSP cannot choose tunnel31 since that NNHOP backup tunnel protects only sub-pool traffic (limited quantity), which would be option #1. This type is tunnel is the most desirable choice but also the most restrictive/selective. It also cannot choose tunnel32 because this tunnel goes to an incorrect NNHOP for this LSP, but it would have been a better choice since any limited bandwidth is better than specific unlimited bandwidth. Second, we see tunnel211 also from CSR1. This uses tunnel31 for backup, which offers NNHOP protection for limited sub-pool bandwidth. This is the best possible option. The LSP could not use tunnel30, since it was requesting sub pool bandwidth while tunnel30 only offered global pool protection. Third is tunnel210 from XRv11. It is protected by tunnel32, offering NNHOP protection with a limited bandwidth amount in any pool (option #2). The LSP could have potentially used tunnel30 as well, since it was requesting global pool bandwidth, but option #2 is better than option #7. Fourth is tunnel211 from XRv11. It also uses tunnel32 for protection since it is the only feasible option. Tunnel30 is not a candidate since it only backs up global pool bandwidth but the head-end requests subpool bandwidth. CSR3#show mpls traffic-eng fast-reroute database | begin midpoint P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ------------------------ -------- --------------------------1.1.1.1 210 [31] 3020 Gi2.530:10019 Tu30:10019 1.1.1.1 211 [28] 3017 Gi2.530:10017 Tu31:implicit-nu 11.11.11.11 210 [5] 3021 Gi2.530:10021 Tu32:6015 11.11.11.11 211 [5] 3022 Gi2.530:10022 Tu32:6016

Status -----ready ready ready ready

1489 © 2016 Nicholas J. Russo

If we temporarily shut down tunnel32, we notice 2 changes. CSR1's tunnels are not affected at all, since they selected other available backups in the first place. XRv11 tunnel210, which selected tunnel32, now chooses tunnel30. As mentioned earlier, tunnel30 was a candidate for protection but was not selected since it was less desirable. XRv11 tunnel211 now has no feasible backup paths, and thus is unprotected. When every LSR in a network is also a candidate PLR (which is common), there are likely to be many backup tunnels. This selection process is important to determine the sequence in which LSPs are protected by specific FRR tunnels. CSR3#show mpls traffic-eng fast-reroute database | begin midpoint P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ----------------------- -------- --------------------------1.1.1.1 210 [31] 3020 Gi2.530:10019 Tu30:10019 1.1.1.1 211 [28] 3017 Gi2.530:10017 Tu31:implicit-nu 11.11.11.11 210 [5] 3021 Gi2.530:10021 Tu30:10021

Status -----ready ready ready

When tunnel32 comes back up, the PLR notifies the head-end that protection is available within the RSVP RESV message without any user intervention. The head-end does not need to resignal the LSP, which is why the LSP IDs remain unchanged, as do the TE labels. The LSP promotion for backup tunnels can be controlled by the FRR timers discussed earlier. CSR3#show mpls traffic-eng fast-reroute database | begin midpoint P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ------------------------ -------- --------------------------1.1.1.1 210 [31] 3020 Gi2.530:10019 Tu30:10019 1.1.1.1 211 [28] 3017 Gi2.530:10017 Tu31:implicit-nu 11.11.11.11 210 [5] 3021 Gi2.530:10021 Tu32:6015 11.11.11.11 211 [5] 3022 Gi2.530:10022 Tu32:6016

Status -----ready ready ready ready

Additional Reading – Reference configurations “ds-te-pre-std" 31.4.2 IETF Russian Dolls Model (RDM) The lab is a continuation from the DS-TE pre-standard lab. The Russian Dolls Model (RDM) method is the default IETF DS-TE implementation in Cisco routers. It is defined in RFC 4127 and allows different CTs to share bandwidth on a link. This allows for most efficient bandwidth sharing between different CTs, but still allows for preemption when higher priority classes require bandwidth. The IETF implementation is vendor interoperability and extends the sub-pool concept to multiple CTs, briefly described earlier. To enable IETF DS-TE in RDM mode, only one command is required. ! All MPLS XE routers mpls traffic-eng ds-te mode ietf ! All MPLS XR routers mpls traffic-eng

1490 © 2016 Nicholas J. Russo

ds-te mode ietf

Although we have not defined any bandwidth constraints (BC), entering this global command changes quite a bit. In XE, all of the existing RSVP bandwidth pools configured at the link level are converted to the IETF RDM format. The global pool is identified as BC0 and the sub-pool is identified as BC1, but otherwise the command is very similar. Because the pre-standard DS-TE implementation is based on RDM, the configuration can be easily converted back and forth by IOS. Either syntax is accepted, and both XE and XR don't care either way. XR does not change the commands from the global/sub pool format by default, so we will leave XRv12 with the legacy global/sub pool commands since they still work. I will update XRv11 for completeness to show the alternative (new) syntax. ! CSR1, this happens automatically interface GigabitEthernet2.516 ip rsvp bandwidth rdm bc0 100000 bc1 20000 ! XRv11, must be configured manually rsvp interface GigabitEthernet0/0/0/0.501 bandwidth rdm bc0 100000 bc1 20000 interface GigabitEthernet0/0/0/0.512 bandwidth rdm bc0 100000 bc1 20000 interface GigabitEthernet0/0/0/0.551 bandwidth rdm bc0 100000 bc1 20000

The RSVP commands, like before, are not terribly useful for DS-TE and haven't changed much. RSVP still uses terms like "global" and "sub" pool. The existing tunnel200 is still online and requesting 5 Mbps of sub-pool bandwidth. CSR1#show ip rsvp interface interface rsvp allocated Gi2 ena 0 Gi2.513 ena 0 Gi2.514 ena 0 Gi2.516 ena 5M Gi2.518 ena 0 Gi2.519 ena 0

i/f max 750M 100M 100M 100M 100M 100M

flow max sub max 750M 0 100M 20M 100M 20M 100M 20M 100M 20M 100M 20M

VRF

Looking at the MPLS TE show commands for TE-classes, we can see there is new information. A small table is shown which defines 4 default TE-classes out of a possible 64. Only class-types 0 (global) and 1 (sub) are currently defined, but the 8 priority values still exist. XE routers appear capable of carrying up to 8 TE classes at any time, as seen in the TED later. The chart below seems to indicate that higher number TE-classes are more important. For example, TE-class 0 uses global-pool bandwidth (CT 0) and

1491 © 2016 Nicholas J. Russo

has a priority of 7, which is the worst. TE-class 5 use sub-pool bandwidth (CT 1) and has a priority of 0, which is the best. CSR1#show mpls traffic-eng ds-te te-class DS-TE Mode: IETF TE-Class Class-Type Priority * 0 0 7 * 1 1 7 * 4 0 0 * 5 1 0 * - default setting Class-Type: 0 = Global-pool, 1 = Sub-pool

The TE tunnel configured earlier is still operational in the new RDM-based DS-TE network. Looking at the TED on CSR1, and specifically looking at the link to CSR6, we can see the output is considerably different. It shows us the RDM model, along with BC0 and BC1 reservable bandwidth quantities. A BC defines how much bandwidth a CT can use. Thus, BC0 is 100 Mbps and is the maximum reservable bandwidth between all TE-classes within CT0. BC1 is 20 Mbps and is the maximum reservable bandwidth for TEclasses using CT1. The bandwidth reservations shown below are confusing at first glance. TE-class 1 represents sub-pool flows that are priority 7. The existing TE tunnel is priority 7, the default, because we never set it. It is also using sub-pool bandwidth (CT1), so we know it is part of TE-class 1. The 5 Mbps reservation is counted against TE-class 1 and all lower classes as well; this behavior is consistent with non-DS-TE as well when only bandwidth prioritization for global-pool resources was considered. TE-class 0 is also priority 7 but represents the global-pool bandwidth, so reducing it by 5 Mbps when a sub-pool flow of the same priority is created is consistent with pre-standard DS-TE as well. TE-classes 4 and 5, despite being CT0 and CT1 respectively, are not affected by this reservation. CSR1#show mpls traffic-eng topology 1.1.1.1 [snip] link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0006.00, nbr_node_id:21, gen:2772 frag_id: 0, Intf Address: 132.1.6.1, Nbr Intf Address: 132.1.6.6 TE metric: 5, IGP metric: 10, attribute flags: 0x4 SRLGs: None physical_bw: 1000000 (kbps), BC Model Id: RDM BC0 (max_reservable): 100000 (kbps) BC0 (max_reservable_bw_global): 100000 (kbps) BC1 (max_reservable_bw_sub): 20000 (kbps) Total Allocated BW (kbps) --------------TE-Class[0]: 5000

Reservable BW (kbps) ----------95000

1492 © 2016 Nicholas J. Russo

TE-Class[1]: TE-Class[2]: TE-Class[3]: TE-Class[4]: TE-Class[5]: TE-Class[6]: TE-Class[7]: [snip]

5000 0 0 0 0 0 0

15000 0 0 100000 20000 0 0

We also check the ISIS LSP. This differs significantly from the pre-standard version since there isn't a separate accounting of global and sub pools. Instead, the BC-TLV signals that the IETF RDM model is being used. The "global pool" bandwidth represents the bandwidth of BC0, which is the allencompassing bandwidth value. 100 Mbps are available to BC0 with 20 Mbps of it being available to BC1, as shown in the LSP. For brevity, we examine the link to CSR8. The TE-classes are carried instead of the bandwidth pool types. CSR1#show isis database verbose CSR1-00.00 | section CSR8 Metric: 10 IS (MT-IPv6) CSR8.00 Metric: 10 IS-Extended CSR8.00 Affinity: 0x00000004 Interface IP Address: 132.1.8.1 Neighbor IP Address: 132.1.8.8 Physical BW: 1000000 kbits/sec Reservable Global Pool BW: 100000 kbits/sec TE-Class BW Unreserved: [0]: 100000 kbits/sec, [1]: 20000 kbits/sec [2]: 0 kbits/sec, [3]: 0 kbits/sec [4]: 100000 kbits/sec, [5]: 20000 kbits/sec [6]: 0 kbits/sec, [7]: 0 kbits/sec BC-TLV Header BC Model-Id:RDM Bandwidth constraints[0]: 100000 kbits/sec Bandwidth constraints[1]: 20000 kbits/sec

The CT of a TE tunnel is carried inside of the RSVP PATH message as a class type object (CTO). The Cisco pre-standard version could not use this CTO since it required changes to RSVP; it is specified to the IETF BC models. When CSR1 signals this tunnel, the CT is carried along the path so routers know how to allocate bandwidth for it when multiple BC pools exist. The TE-class itself is not explicitly signaled in RSVP, which is why all routers must agree on what the TE-class mappings are. The combination of the CT and the LSP priority (carried in the SAO) make up the TE-class. R1#debug ip rsvp dump-message Outgoing Path: version:1 flags:0000 cksum:A908 ttl:255 reserved:0 length:260 SESSION type 7 length 16: Tun Dest: 11.11.11.11 Tun ID: 200 Ext Tun ID: 1.1.1.1

1493 © 2016 Nicholas J. Russo

HOP type 1 length 12: Hop Addr: 132.1.6.1 LIH: 0x0000000D TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 36: 132.1.6.6 (Strict IPv4 Prefix, 8 bytes, /32) 132.6.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 132.10.11.11 (Strict IPv4 Prefix, 8 bytes, /32) 11.11.11.11 (Strict IPv4 Prefix, 8 bytes, /32) LABEL_REQUEST type 1 length 8 : Layer 3 protocol ID: 2048 CLASS_TYPE type 1 length 8 : Class-Type value: 1 SESSION_ATTRIBUTE type 7 length 36: Setup Prio: 7, Holding Prio: 7 Flags: (0x4) SE Style Session Name: PRE-STD DS-TE GLOBAL POOL [snip]

To test out how bandwidth is divided when there are multiple priorities, next we configure a new tunnel [ID 201] that uses global-pool bandwidth but is the highest priority. The tunnel will take any green path to XRv11 and reserve 7 Mbps of global-pool bandwidth. This maps to TE-class 4, which is a combination of CT0 and priority 0. Auto-route is not configured since this tunnel is for demonstration only. ! CSR1 interface Tunnel201 description RDM DS-TE GLOBAL POOL (HI PRI) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 0 0 tunnel mpls traffic-eng bandwidth 7000 class-type 0 tunnel mpls traffic-eng affinity 0x2 mask 0x2 tunnel mpls traffic-eng path-option 10 dynamic

It is interesting to note that a lower-priority CT cannot reduce bandwidth from a higher-priority CT. Even though TE-class 4 is "better" than TE-class 1, it uses bandwidth from BC0 which can never take bandwidth from BC1. TE-class 0 is affected by this 7 Mbps reservation since it is within CT0, just like TEclass 4. TE-class 1 is within CT1 and cannot be affected. In short, when a flow has a lower CT but a higher priority (TE-class 4), higher CTs with lower priorities (TE-class 1) are not affected. CSR1#show mpls traffic-eng topology 1.1.1.1 [snip] link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0003.00, nbr_node_id:29, gen:2799 frag_id: 0, Intf Address: 132.1.3.1, Nbr Intf Address: 132.1.3.3 TE metric: 10, IGP metric: 10, attribute flags: 0x2

1494 © 2016 Nicholas J. Russo

SRLGs: None physical_bw: 1000000 (kbps), BC Model Id: RDM BC0 (max_reservable): 100000 (kbps) BC0 (max_reservable_bw_global): 100000 (kbps) BC1 (max_reservable_bw_sub): 20000 (kbps) Total Allocated BW (kbps) --------------TE-Class[0]: 7000 TE-Class[1]: 0 TE-Class[2]: 0 TE-Class[3]: 0 TE-Class[4]: 7000 TE-Class[5]: 0 TE-Class[6]: 0 TE-Class[7]: 0 [snip]

Reservable BW (kbps) ----------93000 20000 0 0 93000 20000 0 0

Building another tunnel to XRv11 [ID 202], we will configure a 1 Mbps CT1 reservation at priority 3. Priority 3 was not specified as a valid TE-class by default. Unfortunately, there are no debug or syslog messages to indicate this is a problem, and the tunnel never comes up. I assume that the TE process has no idea how to signal this bandwidth since it does not fit into a predefined TE-class. ! CSR1 interface Tunnel202 description RDM DS-TE SUB POOL (MED PRI) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 11.11.11.11 tunnel mpls traffic-eng priority 3 3 tunnel mpls traffic-eng bandwidth 1000 class-type 1 tunnel mpls traffic-eng affinity 0x8 mask 0x8 tunnel mpls traffic-eng path-option 10 dynamic CSR1#show mpls traffic-eng tunnels tunnel 202 | section Status|Config Status: Admin: up Oper: down Path: not valid Signalling: Down path option 10, type dynamic Config Parameters: Bandwidth: 1000 kbps (CT1) Priority: 3 3 Affinity: 0x8/0x8 Metric Type: TE (default) AutoRoute: disabled LockDown: disabled Loadshare: 1000 [0] bw-based auto-bw: disabled

1495 © 2016 Nicholas J. Russo

We can customize the TE-classes as well; the TE-classes seen earlier were just the defaults. We create a new TE-class 3 that uses CT1 and priority 3. This must be consistent on all MPLS routers within the DS-TE domain. After configuring this new class, we verify it on an XE and an XR router. XE shows non-default entries by omitting the asterisk while XR explicitly states whether a TE-class is a default or configured entry. ! All MPLS XE routers mpls traffic-eng ds-te te-classes te-class 3 class-type 1 priority 3 ! All MPLS XR routers mpls traffic-eng ds-te te-classes te-class 3 class-type 1 priority 3 CSR1#show mpls traffic-eng ds-te te-class DS-TE Mode: IETF TE-Class Class-Type Priority * 0 0 7 * 1 1 7 3 1 3 * 4 0 0 * 5 1 0 * - default setting Class-Type: 0 = Global-pool, 1 = Sub-pool RP/0/0/CPU0:XRv11#show te-class 0: class-type te-class 1: class-type te-class 2: unused te-class 3: class-type te-class 4: class-type te-class 5: class-type te-class 6: unused te-class 7: unused

mpls traffic-eng ds-te te-class 0 priority 7 status default 1 priority 7 status default 1 priority 3 status configured 0 priority 0 status default 1 priority 0 status default

With the new TE-class 3 configured, tunnel202 comes up and we can check its bandwidth impacts on the TE topology. The tunnel routes through CSR4 which is the only orange link towards XRv11. Because the reservation was made at TE-class 3, all lower TE-classes that have lesser or equal CTs will be affected (they are within BC1 or less). In this case, the reservation affects all lower priority TE classes for BC0 and BC1. TE-classes 4 and 5 are not affected since they are higher priority; the "cascade" effect of bandwidth reduction only applies when both the CT and priority are less or equal to the level at which the reservation was made. TE-classes 4 and 5 are unaffected, despite TE-class 4 being BC0. CSR1#show mpls traffic-eng topology 1.1.1.1

1496 © 2016 Nicholas J. Russo

[snip] link[1]: Point-to-Point, Nbr IGP Id: 0000.0000.0004.00, nbr_node_id:18, gen:2822 frag_id: 0, Intf Address: 132.1.4.1, Nbr Intf Address: 132.1.4.4 TE metric: 10, IGP metric: 10, attribute flags: 0xA SRLGs: None physical_bw: 1000000 (kbps), BC Model Id: RDM BC0 (max_reservable): 100000 (kbps) BC0 (max_reservable_bw_global): 100000 (kbps) BC1 (max_reservable_bw_sub): 20000 (kbps) Total Allocated BW (kbps) --------------TE-Class[0]: 1000 TE-Class[1]: 1000 TE-Class[2]: 0 TE-Class[3]: 1000 TE-Class[4]: 0 TE-Class[5]: 0 TE-Class[6]: 0 TE-Class[7]: 0 [snip]

Reservable BW (kbps) ----------99000 19000 0 19000 100000 20000 0 0

RDM tunnels on XR are very similar to XE. Recycling tunnel200, we just change the syntax to the RDM version (which isn't required) for clarity. We can verify the reservation by checking the RSVP details, but again, this doesn't show us BC-level granularity. ! XRv11 interface tunnel-te200 signalled-bandwidth 11000 class-type 0 RP/0/0/CPU0:XRv11#show rsvp interface gig0/0/0/0.551 *: RDM: Default I/F B/W % : 75% [default] (max resv/bc0), 0% [default] (bc1) INTERFACE: GigabitEthernet0/0/0/0.551 (ifh=0xB00). BW (bits/sec): Max=100M. MaxFlow=100M. Allocated=11M (11%). BC0=100M. BC1=20M.

The TE topology will show this reservation using almost identical output to XE. The output is shown below for reference; we clearly see 11 Mbps reserved within TE-class 0. This represents BC0 at the lowest priority, which is correct. RP/0/0/CPU0:XRv11#show mpls traffic-eng ds-te te-class te-class 0: class-type 0 priority 7 status default

1497 © 2016 Nicholas J. Russo

te-class te-class te-class te-class te-class te-class te-class

1: 2: 3: 4: 5: 6: 7:

class-type unused class-type class-type class-type unused unused

1 priority 7 status default 1 priority 3 status configured 0 priority 0 status default 1 priority 0 status default

RP/0/0/CPU0:XRv11#show mpls traffic-eng topology 11.11.11.11 | begin 0005 Link[0]:Point-to-Point, Nbr IGP Id:0000.0000.0005.00, Nbr Node Id:6, gen:7234 Frag Id:0, Intf Address:132.5.11.11, Intf Id:0 Nbr Intf Address:132.5.11.5, Nbr Intf Id:0 TE Metric:10, IGP Metric:10 Attribute Flags: 0x5 Ext Admin Group: Length: 256 bits Value : 0x::5 Attribute Names: Unnamed bits : 0 2 Switching Capability:None, Encoding:unassigned BC Model ID:RDM Physical BW:1000000 (kbps), Max Reservable BW:100000 (kbps) BC0:100000 (kbps) BC1:20000 (kbps) Total Allocated Reservable BW (kbps) BW (kbps) ------------------------TE-class[0]: 11000 89000 TE-class[1]: 0 20000 TE-class[2]: 0 0 TE-class[3]: 0 20000 TE-class[4]: 0 100000 TE-class[5]: 0 20000 TE-class[6]: 0 0 TE-class[7]: 0 0

To test FRR with RDM, we will use tunnels 210 and 211 on CSR1 and XRv11 again. We will update them to the IETF RDM syntax for clarity. All previous tunnels are shutdown for this test. ! CSR1 interface Tunnel210 description DS-TE RDM BC0 BW FRR tunnel mpls traffic-eng bandwidth 9000 class-type 0 interface Tunnel211 description DS-TE RDM BC1 BW FRR tunnel mpls traffic-eng bandwidth 3000 class-type 1 ! XRv11

1498 © 2016 Nicholas J. Russo

interface tunnel-te210 description DS-TE RDM BC0 BW FRR signalled-bandwidth 11000 class-type 0 interface tunnel-te211 description DS-TE RDM BC1 BW FRR signalled-bandwidth 4000 class-type 1

Since pre-standard DS-TE operates just like RDM with only two pools, the FRR behavior is identical. We update the existing backup tunnels on CSR3 to the IETF RDM syntax for clarity. ! CSR3 interface Tunnel30 tunnel mpls traffic-eng backup-bw class-type 0 unlimited interface Tunnel31 tunnel mpls traffic-eng backup-bw class-type 1 5000 interface Tunnel32 tunnel mpls traffic-eng backup-bw 25000

The only changes we have made thus far have been cosmetic. There isn’t any behavior change with DSTE and FRR when switching between different modes or BC-models. Verifying each individual tunnel again would be highly redundant, so we will check the FRR database on CSR3 as a shortcut. Again, we see all 4 tunnels receiving some level of protection. The behavior is identical to pre-standard RDM with respect to FRR protection, except the configuration syntax changes. The backup tunnels that each LSP selects are based on the same 3-point selection process we evaluated earlier. The chart shown in the previous chapter holds true for IETF DS-TE using RDM. CSR3#show mpls traffic-eng fast-reroute database P2P LSP midpoint frr information: LSP identifier In-label Out intf/label ---------------------- -------- -------------1.1.1.1 210 [18] 3005 Gi2.530:10008 1.1.1.1 211 [16] 3007 Gi2.530:10024 11.11.11.11 210 [4] 3023 Gi2.530:10015 11.11.11.11 211 [4] 3001 Gi2.530:10014

| begin midpoint FRR intf/label -------------Tu30:10008 Tu31:implicit-nu Tu32:6018 Tu32:6019

Status -----ready ready ready ready

As a quick test, we shut down tunnel32 again to ensure we get the same result, which also supports the claim that the selection process is consistent between pre-standard and IETF DS-TE implementation. CSR3#show mpls traffic-eng fast-reroute database | begin midpoint P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ---------------------- -------- --------------------------1.1.1.1 210 [18] 3005 Gi2.530:10008 Tu30:10008

Status -----ready

1499 © 2016 Nicholas J. Russo

1.1.1.1 211 [16] 11.11.11.11 210 [4]

3007 3023

Gi2.530:10024 Gi2.530:10015

Tu31:implicit-nu ready Tu30:10015 ready

We quickly verify the bandwidth reserved on the link to CSR10 for additional verification. We can see the RDM BC model in action; the 7 Mbps reduced from TE-class 1 is also reduced from TE-class 0 based on the bandwidth sharing mechanism of RDM. With MAM, we will see that TE-class 0 would not have been affected by TE-class 1 reservations since CTs are entirely isolated. CSR3#show mpls traffic-eng topology 3.3.3.3 | begin 0010 link[2]: Point-to-Point, Nbr IGP Id: 0000.0000.0010.00, nbr_node_id:18, gen:563 frag_id: 0, Intf Address: 132.3.10.3, Nbr Intf Address: 132.3.10.10 TE metric: 10, IGP metric: 10, attribute flags: 0x0 SRLGs: None physical_bw: 1000000 (kbps), BC Model Id: RDM BC0 (max_reservable): 100000 (kbps) BC0 (max_reservable_bw_global): 100000 (kbps) BC1 (max_reservable_bw_sub): 20000 (kbps) Total Allocated BW (kbps) --------------TE-Class[0]: 27000 TE-Class[1]: 7000 TE-Class[2]: 0 TE-Class[3]: 0 TE-Class[4]: 0 TE-Class[5]: 0 TE-Class[6]: 0 TE-Class[7]: 0

Reservable BW (kbps) ----------73000 13000 0 20000 100000 20000 0 0

Additional Reading – Reference configurations “ds-te-rdm" 31.4.3 IETF Maximum Allocation Model (MAM) This lab continues from the RDM lab. Maximum Allocation Model (MAM) is defined in RFC 4125 and creates total isolation between the CTs. There is no bandwidth sharing between them which means bandwidth can be wasted when higher priority flows are not using all available bandwidth within a BC. Different priorities across different CTs have no effect on one another. This is essentially a way to map BCn to CTn, where 0 48:100:VEID-1:Blk-1/136 0.0.0.0 *>i 48:100:VEID-6:Blk-1/136 48.0.0.6 *>i 48:100:VEID-9:Blk-1/136 48.0.0.9

32768 ? 0

100

0 ?

0

100

0 ?

Next, we can see which PWs were automatically created based on these BGP routes. Notice that the concept of split-horizon is moot for this design since there are no NPE-to-UPE PWs. The traffic arrives at the NPE on an EFP, not a PW, so there isn’t any bridging of PWs requiring split-horizon to be disabled anymore. The entire access network is layer 2 only which alleviates the complexity of larger IP/MPLS and PW designs. We can also see the remote PW label is 6015, which was computed based on the VE-ID, VBO, and VBS (described in another section). Since we are using auto-route-target, the VPN-ID is used as the RT and is imported/exported by all nodes, creating a full-mesh of PWs fit for providing E-LAN service. R1#show l2vpn vfi name VPLS Legend: RT=Route-target, S=Split-horizon, Y=Yes, N=No VFI name: VPLS, state: up, type: multipoint, signaling: BGP VPN ID: 100, VE-ID: 1, VE-SIZE: 10 RD: 48:100, RT: 48:100 Bridge-Domain 3512 attachment circuits: Pseudo-port interface: pseudowire100005 Interface Peer Address VE-ID Local Label Remote Label pseudowire100007 48.0.0.9 9 1026 9013 pseudowire100006 48.0.0.6 6 1023 6015

S Y Y

We can check the VPLS instance to ensure both PWs are up. Specifically, we want to look at the PW with VE-ID 6 to see the label stack and verify the FEC. R1#show l2vpn atom vc service-name VPLS Service Interface Peer ID VC ID Type Name --------- --------------- ---------- ------ -----------------------pw100006 6 100 vfi VPLS pw100007 9 100 vfi VPLS

Status -------UP UP

The route to CSR6 is via CSR7 due to IGP costs. We bind CSR7’s LDP label to this prefix as a result, which is 7002. The full label stack is therefore {7002 6015}. R1#show ip route 48.0.0.6 Routing entry for 48.0.0.6/32 Known via "isis", distance 115, metric 28, type level-2 Redistributing via isis 48 Last update from 48.1.7.7 on GigabitEthernet2.517, 00:01:15 ago Routing Descriptor Blocks:

1684 © 2016 Nicholas J. Russo

* 48.1.7.7, from 48.0.0.6, 00:01:15 ago, via GigabitEthernet2.517 Route metric is 28, traffic share count is 1 R1#show mpls ldp bindings 48.0.0.6 32 neighbor 48.0.0.7 lib entry: 48.0.0.6/32, rev 24 remote binding: lsr: 48.0.0.7:0, label: 7002 R1#show l2vpn atom vc ve-id 6 detail | include label_stack Output interface: Gi2.517, imposed label stack {7002 6015}

We verify that CSR1 is actually receiving frames and learns the MAC address of CSR3 via the VPLS instance. The pseudo-port represents one of the PWs within that instance, and we assume this to be towards CSR6. This pseudo-port instructs the router to add MPLS encapsulation to the frame using the labels shown above. Frames received on the EFP have their top tag removed on ingress and only the inner VLAN is preserved end-to-end. The arriving tag stack is {3512 X} but only tag X is encapsulated inside MPLS inside the tunneled Ethernet frame header. R1#show bridge-domain 3512 Bridge-domain 3512 (3 ports in all) State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 3512 vfi VPLS neighbor 48.0.0.6 100 vfi VPLS neighbor 48.0.0.9 100 AED MAC address Policy Tag Age Pseudoport 0 0050.56A9.86C3 forward dynamic 205 VPLS.1004021 0 0050.56A9.42EF forward dynamic 206 GigabitEthernet2.EFP3512 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe8057850

As P routers, CSR7 and XRv4 perform swap and pop operations, respectively. This delivers the traffic to CSR6 with the PW label exposed. R7#show mpls forwarding-table labels 7002 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7002 94008 48.0.0.6/32 990 RP/0/0/CPU0:XRv4#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------94008 Pop 48.0.0.6/32

Outgoing interface Gi2.574

Next Hop 48.7.14.14

labels 94008 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.564 48.6.14.6 15840

CSR6 cannot label switch this packet since the LFIB removes all labels. The packet is piped to the BD CAM so a switching decision can be made based on the destination MAC, which is highlighted. The BD CAM table shows this MAC address is accessible via an EFP that sends traffic into another access 1685 © 2016 Nicholas J. Russo

network on the other side of the core. A new carrier tag is used simply because the virtualization environment limits, but it could have been the same if the access networks didn’t share the same vSwitch. Traffic inside the MPLS packet has tag X and the NPE pushes the carrier tag 3546 before sending it into the access network. The tag stack becomes (3546 X}. R6#show mpls forwarding-table labels 6015 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6015 No Label lbl-blk-id(2:0) 0

Outgoing interface none

Next Hop point2point

R6#show bridge-domain 3546 Bridge-domain 3546 (3 ports in all) State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 3546 vfi VPLS neighbor 48.0.0.1 100 vfi VPLS neighbor 48.0.0.9 100 AED MAC address Policy Tag Age Pseudoport 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe80d4840 0 0050.56A9.86C3 forward dynamic 120 GigabitEthernet2.EFP3546 0 0050.56A9.42EF forward dynamic 120 VPLS.100402a

When the frame reaches CSR4, it is bridged from the ingress EFP to the egress EFP via the BD CAM table. Notice that the pseudo-port is an EFP from Gig3, which is our UPE-CE link. Traffic arrives with tag {3546 X} and is stripped on egress to simply carry tag X. R4#show bridge-domain 3546 Bridge-domain 3546 (2 ports in all) State: UP Mac learning: Enabled Aging-Timer: 300 second(s) GigabitEthernet2 service instance 3546 GigabitEthernet3 service instance 3546 AED MAC address Policy Tag Age Pseudoport 1 FFFF.FFFF.FFFF flood static 0 OLIST_PTR:0xe80d4c20 0 0050.56A9.86C3 forward dynamic 225 GigabitEthernet3.EFP3546 0 0050.56A9.42EF forward dynamic 43 GigabitEthernet2.EFP3546

Because we cannot use EPC on the EFPs, we will verify packets being MPLS-encapsulated on CSR1 towards CSR7, and likewise being MPLS-decapsulated on CSR6 from CSR7. We expect to see the tunneled VLAN ID inside the customer traffic, along with the expected label values revealed earlier. The first packet is sent from CSR1 to CSR7, and the second is a packet arriving on CSR6. BGP signaling does not appear to allow us to use FAT-PW or CW, so those components are missing from the capture. This restrictions is specified in an extended community that IOS does not let us configure (L2VPN INFO community, which has the CW, sequencing, and FAT-PW bits set to zero unconditionally). Using the same colors as the previous section (labels are yellow, destination MAC is cyan, destination IP is pink), we can verify the packets are following the proper LSP. Packets are sent from CSR1 with labels {7002 1686 © 2016 Nicholas J. Russo

6015} (0x1B5A, 0x177F) towards CSR7. CSR6 receives the packet only with label 6015 as expected. The difference with these captures, aside from missing a CW and FAT-PW label, is the presence of the dot1q C-VLAN tag (grey). In this case, it is VLAN 101 (0x65) but since the core is not matching this number anywhere, any VLANs could be used. This is also possible with MPLS in the access network but was not demonstrated last time. R1#show monitor capture CAP buffer dump 2 0000: 005056A9 EA770050 56A91AAA 81000DBD 0010: 884701B5 A0FF0177 F1FF0050 56A986C3 0020: 005056A9 42EF8100 00650800 450001F4 0030: 007B0000 FF016E2F C0A8650A C0A86503 0040: 08004633 001D0000 00000000 092BF620 0050: ABCDABCD ABCDABCD ABCDABCD ABCDABCD

.PV..w.PV....... .G.....w...PV... .PV.B....e..E... .{....n/..e...e. ..F3.........+. ................

R6#show monitor capture CAP buffer dump 1 0000: 005056A9 DE0D0050 56A9862A 81000DEC 0010: 88470177 F1FD0050 56A986C3 005056A9 0020: 42EF8100 00650800 450001F4 007C0000 0030: FF016E2E C0A8650A C0A86503 08004625 0040: 001D0001 00000000 092BF62D ABCDABCD

.PV....PV..*.... .G.w...PV....PV. B....e..E....|.. ..n...e...e...F% .........+.-....

Just to prove the transparency is functional, we quickly test another C-VLAN. The forwarding path, carrier tags, and LSPs do not change at all. The only difference is the dot1q C-VLAN tag, shown again in grey. This time, the customer is sending traffic on VLAN 102 (0x66). The other fields are highlighted as well for completeness; they are identical to the packets above. R1#show monitor capture CAP buffer dump 1 0000: 005056A9 EA770050 56A91AAA 81000DBD 0010: 884701B5 A0FF0177 F1FF0050 56A986C3 0020: 005056A9 42EF8100 00660800 450001F4 0030: 00850000 FF016C25 C0A8660A C0A86603 0040: 08007766 001F0000 00000000 0932C4E4 0050: ABCDABCD ABCDABCD ABCDABCD ABCDABCD R6#show monitor capture CAP buffer dump 0 0000: 005056A9 DE0D0050 56A9862A 81000DEC 0010: 88470177 F1FD0050 56A986C3 005056A9 0020: 42EF8100 00660800 450001F4 00850000 0030: FF016C25 C0A8660A C0A86603 08007766 0040: 001F0000 00000000 09 32C4E4 ABCDABCD .........2......

.PV..w.PV....... .G.....w...PV... .PV.B....f..E... ......l%..f...f. ..wf.........2.. ................

.PV....PV..*.... .G.w...PV....PV. B....f..E....... ..l%..f...f...wf

Additional Reading – Reference configurations "hvpls-qinq" 1687 © 2016 Nicholas J. Russo

32.2 IP encapsulated L2VPN Layer 2 tunneling protocol (L2TP) is another way to offer E-LINE, E-LAN, and E-TREE services when MPLS transport is not available. Like GRE, it is a form of IP encapsulation, but can carry many layer 2 protocols. L2TP, when used with xconnect, uses IP protocol 115 and is referenced in RFC 3931. L2TP can also use UDP encapsulation as seen with VPDN groups, which is beyond the scope of this section. The signaling for L2TP is quite complex and offers many verbose debugs. The network diagram is the same as the MPLS L2VPN tests, except MPLS is disabled everywhere, along with all of its supporting protocols (LDP, BGP, etc). L2TP can only be used in a P2P fashion as all VFI configurations reject all encapsulations other than MPLS. As an alternative to VPLS, we can use Overlay Transport Virtualization (OTV), a Ciscoproprietary protocol (for now) designed for data center interconnects (DCI) to built transparent LANs. Like DMVPN and LISP, it is more customer-oriented, but it is still technically a mechanism of providing ELAN service, so I consider it in scope for the CCIE SP. Besides, a good service provider is aware of how customer-focused technologies work. 32.2.1 E-LINE with L2TP Layer 2 tunnel protocol (L2TP) works like a P2P MPLS L2VPN in that it provides transparent layer 2 service across a service provider core. The difference is that it is IP-encapsulated, not requiring MPLS at all. L2TP is a very signaling-intensive protocol with several new messages described below (RFC 3931). Fortunately, the general logic follows a three-way handshake, much like TCP. Start-Control-Connection-Request (SCCRQ): Initiates a control connection between two L2TP peers. Either router can initiate the session. It carries the L2TP router-ID, hostname, and other details. This is loosely analogous to a TCP SYN. Start-Control-Connection-Request (SCCRP): Sent in response to a SCCRQ. This is an acknowledgement that the SCCRQ was accepted and setup should continue. This like a TCP SYN ACK. Start-Control-Connection-Connected (SCCCN): Sent in response to a SCCRP to complete the control session setup. This is like the final TCP ACK. These three messages serve the same function as the TCP three-way handshake. Stop-Control-Connection-Notification (StopCCN): Sent by either L2TP router that the control connection is being shut down. A result code is carried in the message to explain why. This is like a TCP RST. Hello packet (HELLO): Control message sent by either L2TP router to serve as a keepalive. Incoming-Call-Request (ICRQ): Like a SCCRQ, except happens after the session is established. It happens per call, not per session, and is mostly relevant in a dial-up deployment (BBA setup with LAC/LNS). This is also like a TCP SYN.

1688 © 2016 Nicholas J. Russo

Incoming-Call-Request (ICRP): Like a SCCRP, except is sent in response to an ICRQ for a particular call. This is sent if the ICRQ parameters are accepted. This is also like a TCP SYN ACK. Incoming-Call-Connected (ICCN): Like a SCCRCN, except is sent in response to an ICRP for a particular call. This finalizes the L2TP session for the call. This is also like a TCP ACK. Outgoing-Call-Request (OCRQ): Like an ICRQ, except in the opposite direction. This is specific to calls from a LAC, which is not present in this test. This is also like a TCP SYN. Outgoing-Call-Request (OCRP): Like an ICRP, except in the opposite direction. This is specific to calls from a LAC, which is not present in this test. This is also like a TCP SYN ACK. Outgoing-Call-Connected (OCCN): Like an ICCN, except in the opposite direction. This is specific to calls from a LAC, which is not present in this test. This is also like a TCP ACK. Call-Disconnect-Notify (CDN): Like a StopCCN, except specific for a call within the general L2TP control session. It disconnects the specific session but does not affect L2TP control. WAN-Error-Notify (WEN): Sent from a LAC to the LNS to indicate an error condition in the WAN. Not present in this test. Set-Link-Info (SLI): This is used to send updated status reports across between L2TP speakers. Explicit-Acknowledgement (EXP-ACK): Just acknowledges a message and is typically used with sequencing. Zero-Length-Body (ZLB): A message received without any attribute-value pairs (AVPs). Serves the same function as the EXP-ACK. The RFC says it can only be used when control channel authentication is disabled, but I see it used all the time, with EXP-ACK never being used. The normal P2P PW configuration is almost identical to the MPLS AToM section. The most significant change is using L2TPv3 encapsulation versus MPLS, along with some other parameters. The configurations are shown below. We will mirror the setup from earlier where CSR8 has two PWs in primary/backup operation to CSR5 and CSR6 respectively. There is no option for control-word, yet sequencing is available. We can also adjust the tunnel TOS, much like a GRE tunnel, either by reflecting the TOS inside the payload or hard-setting a value (CS1 in this case on the backup tunnel). We will explore these features later. We use the new L2VPN configuration on CSR8 and CSR5, and use the legacy xconnect configuration on CSR6. Pitfall: L2TP is not supported with EFP/EVC in the CSR1000v. This means we must use a main interface or subinterface, but not a service instance. Those interfaces don’t need any configuration on them other than the encapsulation when using the new L2VPN configuration model (CSR5 and CSR8). When using the legacy model (CSR6), the xconnect goes under the subinterface. We will examine these configurations soon. The network is very similar to before by with some routers 1689 © 2016 Nicholas J. Russo

removed since L2TP cannot support E-LAN or E-TREE services. There is no need to have a large network when only P2P connections are being tested.

! CSR8 template type pseudowire TMP_L2TP_PW encapsulation l2tpv3 signaling protocol l2tpv3 ip local interface Loopback0 interface pseudowire10141 description PW TO CSR5 source template type pseudowire TMP_L2TP_PW neighbor 55.0.0.5 10141

1690 © 2016 Nicholas J. Russo

interface pseudowire10142 description PW TO CSR6 source template type pseudowire TMP_L2TP_PW neighbor 55.0.0.6 10142 interface GigabitEthernet2.3504 encapsulation dot1Q 3580 second-dot1q 3504 l2vpn xconnect context XC_XRV14_R10 member GigabitEthernet2.3504 member pseudowire10141 group R10_XRV14 priority 3 member pseudowire10142 group R10_XRV14 priority 6 ! CSR5 interface pseudowire10141 description PW TO CSR8 encapsulation l2tpv3 neighbor 55.0.0.8 10141 signaling protocol l2tpv3 ip local interface Loopback0 interface GigabitEthernet2.1014 encapsulation dot1Q 3554 second-dot1q 1014 l2vpn xconnect context XC_XRV14_R10 member GigabitEthernet2.1014 member pseudowire10141 ! CSR6 pseudowire-class PW_1014 encapsulation l2tpv3 ip local interface Loopback0 interface GigabitEthernet2.3546 encapsulation dot1Q 3546 xconnect 55.0.0.8 10142 encapsulation l2tpv3 pw-class PW_1014

We can verify the L2TP sessions are operational using some basic commands. We can see if the tunnel is up, and to which peer. We can also check the transport information, which shows the IP protocol number and local interface as well. The “tunnel” represents the control channel and all higher level protocols (PPP, xconnect, etc) all rely on this. Between a set of L2TP routers, there is a single tunnel, but multiple “calls” or “sessions”. This is why the SCC and IC messages are different. R8#show l2tp tunnel L2TP Tunnel Information Total tunnels 2 sessions 2 LocTunID RemTunID Remote Name State Remote Address 1501687873 1414773374 R6 2718846397 3521913226 R5

est est

55.0.0.6 55.0.0.5

Sessn Count 1 1

L2TP Class/ VPDN Group l2tp_default_cl l2tp_default_cl

1691 © 2016 Nicholas J. Russo

R8#show l2tp tunnel transport L2TP Tunnel Information Total tunnels LocTunID Type Prot Local Address 1501687873 IP 115 55.0.0.8 2718846397 IP 115 55.0.0.8

2 sessions 2 Port Remote Address 0 55.0.0.6 0 55.0.0.5

Port 0 0

We can also check the packets transported by L2TP. The difference between session and tunnel packets is subtle. “Session” packets only counts packets that occur within the given session (or “call”), so clearing the L2TP session or having the PW fail for any reason will reset these counters. “Tunnel” packets continue to increment even after PWs fail, provided the tunnel ID is not deallocated and the controlsession doesn’t completely fail. Therefore, the tunnel packets will always be greater than or equal to the session packets since one is a subset of the other. Both commands reference tunnels by their tunnel-ID, which is a 32-bit unsigned number used to represent a connection. This used to be a 16-bit unsigned number in L2TPv2, but was expanded in to 32-bits in L2TPv3. Above, we see that the ID beginning with 150 corresponds to the backup connection to CSR6, while the one beginning with 271 corresponds to the primary connection to CSR5. That is why only one tunnel has packets associated. R8#show l2tp session packets L2TP Session Information Total tunnels 2 sessions 2 LocID RemID TunID Pkts-In Pkts-Out 2927100785 1218283557 1501687873 0 0 3331724010 4005951606 2718846397 105 144

Bytes-In 0 18990

Bytes-Out 0 22416

R8#show l2tp tunnel packets L2TP Tunnel Information Total tunnels 2 sessions 2 LocTunID Pkts-In Pkts-Out Bytes-In Bytes-Out 1501687873 0 0 0 0 2718846397 126 186 32068 37198

After clearing the session and sending 100 packets through the tunnel, the session counts those, but the tunnel continues to increment from the value it had before the session cleared. The control-channel was reset but remained UP (in software, at least) but the “call” was re-established. This is a minor detail and is not terribly significant, but helps highlight the difference between a “tunnel” and a “session”. R8#show l2tp session packets L2TP Session Information Total tunnels 2 sessions 2 LocID RemID TunID Pkts-In Pkts-Out 490257398 1184553043 1501687873 0 0 3918464858 50829490 2718846397 100 101

Bytes-In 0 11800

Bytes-Out 0 12274

R8#show l2tp tunnel packets L2TP Tunnel Information Total tunnels 2 sessions 2 LocTunID Pkts-In Pkts-Out Bytes-In Bytes-Out 1501687873 0 0 0 0 2718846397 226 288 43868 49546

1692 © 2016 Nicholas J. Russo

Next, we verify the signaling packet flow. We will clear the backup tunnel then watch CSR6 renegotiate with CSR8. We see that CSR6 sends the initial SCCRQ; we can confirm this by checking the tunnel details to ensure the tunnel is “locally” initiated. CSR8 responds with the SCCRP, and CSR6 confirms the session with the SCCN. After the control channel is established, CSR6 originates the call with an ICRQ message on behalf of the xconnect process. Only the “RQ” packets are highlighted during this exchange for brevity, but the debug clearly shows all of the message transactions. One the session is established, the two routers then exchange status messages using the SLI message. R6#debug l2tp packet brief L2TP tnl 08213:4ED2CB86: L2TP tnl 08213:4ED2CB86: L2TP tnl 08213:4ED2CB86: L2TP _____:08213:29450B58: L2TP _____:08213:29450B58: L2TP _____:08213:29450B58: L2TP _____:08213:29450B58: L2TP _____:08213:29450B58: L2TP _____:08213:29450B58: L2TP _____:08213:29450B58:

Tx Rx Tx Tx Rx Tx Tx Tx Tx Rx

->

->

-> -> -> Wt-SCCRP do Tx-SCCRQ ev Rx-SCCRP Wt-SCCRP->Proc-SCCRP do Rx-SCCRP Authentication success

When we verify the details for this new tunnel, we see some interesting output/ CSR8 reports that the peer never had an authentication failure, and that authentication is not configured. Both of these claims are false; I assume that this status output only changes when the newer control-message hashing feature is enabled. R8#show l2tp tunnel state | include R6 2400039996 2929395307 R8 R6

est

00:27:09

R8#show l2tp tunnel all id 2400039996 | include authen Total peer authentication failures 0 Control message authentication is disabled

Next, we configure the “new” method of authentication on CSR8 and CSR5. The word “digest” houses many features, but when the “secret” option is used, this enables authentication and integrity checking. Again, I configure the password incorrectly on purpose on CSR5. ! CSR8 l2tp-class L2TP_CLASS_PRIMARY digest secret 0 L2TP_DIGEST hash SHA1 interface pseudowire10141 signaling protocol l2tpv3 L2TP_CLASS_PRIMARY ! CSR5 interface pseudowire10141 signaling protocol l2tpv3 L2TP_ADVANCED l2tp-class L2TP_ADVANCED digest secret 0 L2TP_DIGEST_WRONG hash SHA1

1697 © 2016 Nicholas J. Russo

As expected, the tunnel does not form due to the authentication mismatch. Like the control-channel authentication, the debug states this clearly. ! CSR5 L2TP tnl 10282:5CF2880D: Shutting down tunnel L2TP tnl 10282:5CF2880D: Result Code L2TP tnl 10282:5CF2880D: General error - refer to error code L2TP tnl 10282:5CF2880D: Error Code L2TP tnl 10282:5CF2880D: No error L2TP tnl 10282:5CF2880D: Vendor Error L2TP tnl 10282:5CF2880D: None L2TP tnl 10282:5CF2880D: Optional Message L2TP tnl 10282:5CF2880D: "SCCRQ authen failed" L2TP tnl 10282:5CF2880D: ERROR: Validate message digest, received digest mismatch with local

Once we correct the password on CSR5, we see the tunnel form immediately. Each control message is displayed in hex and should be followed by a message stating the digest was correct. ! CSR5 L2TP tnl C8 03 00 00 00 00 00 00 00 0A 00 09 9B 9C 15 72 00 00 L2TP tnl

55 10 00 00 00 00

08325:116B1827: 11 6B 18 27 00 80 1B 00 09 00 00 00 00 00 00 03 A1 8E 57 BE 08 00 09 00 08 00 08325:116B1827:

03 0C 00 80 00

Control Message 00 05 80 08 00 01 00 00 00 00 00 00 00 00 00 0A 00 09 00 04 01 00 0A 00 09

00 00 80 77 00

Message digest match performed, passed.

Now, we can see CSR8 reporting authentication being enabled. The output here only updates when relating to the newer digest authentication and not the CHAP-style mechanism as we theorized earlier. You can also configure multiple digest secrets to perform password rollover. R8#show l2tp tunnel state | include R5 4135688314 292231207 R8 R5

est

00:02:31

R8#show l2tp tunnel all id 4135688314 | include authen Total peer authentication failures 0 Control message authentication is enabled with 1 digest secrets Last control message authenticated with first digest secret

On CSR5, we will add a wrong password back in, but we will remove the right password so the wrong password is the “first” one. Now there are two passwords available. The session stays up because one is correct, but CSR5’s show command output differs from CSR8’s. CSR5 says there are two possible digests and the last message was authenticated using the second. Ideally, removing the “first” one, following a 1698 © 2016 Nicholas J. Russo

password rollover, is a time-sensitive task. Cisco supports a maximum of 2 secrets for both IOS and XR, and the expectation is that L2TP would only have 2 passwords during the manual password rollover period. ! CSR5 l2tp-class L2TP_ADVANCED digest secret 0 L2TP_DIGEST_WRONG hash SHA1 digest secret 0 L2TP_DIGEST hash SHA1 R5#show l2tp tunnel state | include R8 292231207 4135688314 R5 R8

est

00:05:02

R5#show l2tp tunnel all id 292231207 | include authen Total peer authentication failures 0 Control message authentication is enabled with 2 digest secrets Last control message authenticated with second digest secret

One last feature of the digest is disabling the digest check. It is enabled by default, and can only be disabled if digest authentication is disabled. Assuming CSR6 is an old router with poor performance, we can disable this feature to save some CPU cycles at the cost of security and message integrity. Other than verifying the L2TP-class, I have not found a way to verify that this is operational. ! CSR6 l2tp-class L2TP_ADVANCED no digest check R6#show l2tp class | section ADVANCED class [L2TP_ADVANCED] is a statically configured class configuration: l2tp-class L2TP_ADVANCED authentication no digest check hello 60 [snip]

The HELLO mechanism is very simple and is similar to other protocols. The default hello timer is 60 seconds and we can see the packets exchanged using L2TP packet debugging. The difference is that L2TP receives a hello and replies with an acknowledgement (ZLB in our case). Peers must not expect hello messages according to a time interval, so there isn’t a “dead time” concept. I think the parser just prints the ZLB before the hello, but the two happen at the same time. Notice the “loc” and “rem” fields represent the tunnel IDs in hex. We adjust the timer to 20 seconds on CSR6 only; the timers do not have to match. This will flap the session. This debug does not show received ZLB’s since no processing is done on them. ! CSR6

1699 © 2016 Nicholas J. Russo

l2tp-class L2TP_ADVANCED hello 20 ! CSR6 L2TP tnl L2TP tnl

08217:50829095: Tx -> Hello loc 50829095 rem A50BA3D0 08217:50829095: Tx -> Hello loc 50829095 rem A50BA3D0

! CSR8 L2TP tnl L2TP tnl L2TP tnl L2TP tnl

08542:A50BA3D0: 08542:A50BA3D0: 08542:A50BA3D0: 08542:A50BA3D0:

Tx Rx Tx Rx

->

2.2.2.2_101 {7}: Area (isis level-1) Path Lookup begin TE-PCALC-PATH: exclude_path: system_id 0-0-0-0-0-0-0 not known! TE-PCALC-PATH: exclude_path: system_id 0-0-0-0-0-0-0 not known! TE-PCALC-PATH: exclude_path: system_id 0-0-0-0-0-0-0 not known! TE-PCALC-PATH: exclude_path: system_id 0-0-0-0-0-0-0 not known! TE-PCALC-PATH:Path from 0000.0000.0010.00 -> 0000.0000.0002.00: 49.2.10.10->49.2.10.2 (admin_weight=10): num_hops 2, accumulated_aw 10, min_bw 100000 TE-PCALC-PATH: 10.10.10.10_1->2.2.2.2_101 {7}: Area (isis level-1) Path Lookup end: path found 2.2.2.2 expands to: 49.2.10.2 2.2.2.2 TE-PCALC-API: 10.10.10.10_1->2.2.2.2_101 {7}: LSP Path Expand result: success TE-PCALC-PATH: 10.10.10.10_1->2.2.2.2_101 {7}: Freeing rrr_path_setup_t CSR10#show ip rsvp sender detail filter session-type 7 destination 2.2.2.2 | section [ER]RO ERO: (incoming) 49.8.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 10.10.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Loose IPv4 Prefix, 8 bytes, /32) ERO: (outgoing) 49.2.10.2 (Strict IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Strict IPv4 Prefix, 8 bytes, /32) RRO: 49.8.10.8/32, Flags:0x0 (No Local Protection) 49.1.8.1/32, Flags:0x0 (No Local Protection) 49.1.14.14/32, Flags:0x0 (No Local Protection) 49.5.14.5/32, Flags:0x0 (No Local Protection)

For extra debugging, we can look at the RSVP dump messages to see the incoming PATH message on CSR10. This reveals the ERO and RRO, both of which are interest for loose-hop expansion. ! CSR10 Incoming Path: version:1 flags:0000 cksum:89E2 ttl:252 reserved:0 length:248 SESSION type 7 length 16: Tun Dest: 2.2.2.2 Tun ID: 101 Ext Tun ID: 5.5.5.5 HOP type 1 length 12: Hop Addr: 49.8.10.8 LIH: 0x0000000F TIME_VALUES type 1 length 8 : Refresh Period (msec): 30000 EXPLICIT_ROUTE type 1 length 28:

1832 © 2016 Nicholas J. Russo

49.8.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 10.10.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Loose IPv4 Prefix, 8 bytes, /32) [snip] RECORD_ROUTE type 1 length 36: 49.8.10.8/32, Flags:0x0 (No Local Protection) 49.1.8.1/32, Flags:0x0 (No Local Protection) 49.1.14.14/32, Flags:0x0 (No Local Protection) 49.5.14.5/32, Flags:0x0 (No Local Protection)

Because the RRO is included in the PATH message, this implies that the tail-end router (CSR2) will be able to see every PHOP along the path, not just the local PHOP. CSR2#show ip rsvp sender detail filter session-type 7 destination 2.2.2.2 | section RRO RRO: 49.2.10.10/32, Flags:0x0 (No Local Protection) 49.8.10.8/32, Flags:0x0 (No Local Protection) 49.1.8.1/32, Flags:0x0 (No Local Protection) 49.1.14.14/32, Flags:0x0 (No Local Protection) 49.5.14.5/32, Flags:0x0 (No Local Protection)

Unlike tunnel stitching, we can use OAM to test the entire length of the tunnel, versus each segment. From the head-end, we verify the entire path. CSR5#traceroute mpls traffic-eng tunnel 101 Tracing MPLS TE Label Switched Path on Tunnel101, timeout is 2 seconds [snip] Type escape sequence to abort. 0 49.5.14.5 MRU 1500 [Labels: 94011 Exp: 0] L 1 49.5.14.14 MRU 1500 [Labels: 1012 Exp: 0] 1 ms L 2 49.1.14.1 MRU 1500 [Labels: 8018 Exp: 0] 50 ms L 3 49.1.8.8 MRU 1500 [Labels: 10015 Exp: 0] 24 ms L 4 49.8.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 24 ms ! 5 49.2.10.2 24 ms

When we configured this tunnel, we did not configure a mechanism for routing. Auto-route announce and forwarding adjacency cannot work on these tunnels; notice that the destination TE ID cannot be seen by the head-end router, which breaks the auto-route announce logic. Forwarding-adjacency also would not make sense since it builds a link in the graph, which is not suitable between OSPF areas or ISIS levels. ! CSR5 interface Tunnel101 tunnel mpls traffic-eng autoroute announce

1833 © 2016 Nicholas J. Russo

CSR5#show mpls traffic-eng autoroute MPLS TE autorouting enabled destination 0-0-0-0-0-0-0, area isis level-1, has 1 tunnels Tunnel101 (load balancing metric 0, nexthop 2.2.2.2) (flags: Announce)

Instead, we can use autoroute destination on this tunnel since the destination actually is the real tail end. This strategy does not work on stitched tunnels, though. This means that CSR5 and CSR2 have a complete, end-to-end connection. This means that UMPLS architecture is totally bypassed; we no longer need BGP IPv4 labeled-unicast. The downside is that this solution defeats the purpose of UMPLS, which is to achieve better scalability. Stitched TE tunnels between ABRs in the core, or between ABRs/PEs in the aggregation islands, may scale better than PE to PE tunnels in this design. TE also may have negative effects on MVPN, which has been discussed many times in many other chapters. ! CSR5 interface Tunnel101 tunnel mpls traffic-eng autoroute destination

Verifying L3VPN connectivity, we can see traffic from CSR5 to CSR2 has only two labels: the bottom BGP VPNv6 label and the top RSVP-TE label. RP/0/0/CPU0:XRv11#traceroute vrf U2 ::2:6:6:6 source ::2:11:11:11 Type escape sequence to abort. Tracing the route to ::2:6:6:6 1 2 3 4 5 6 7

fd00:192:168:205::5 0 msec 0 msec 0 msec 2049:49:5:14::14 [MPLS: Labels 94011/2007 Exp 0] 29 msec 89 msec 49 msec ::ffff:49.1.14.1 [MPLS: Labels 1012/2007 Exp 0] 69 msec 29 msec 49 msec ::ffff:49.1.8.8 [MPLS: Labels 8018/2007 Exp 0] 49 msec 49 msec 59 msec ::ffff:49.8.10.10 [MPLS: Labels 10015/2007 Exp 0] 49 msec 39 msec 39 msec fd00:192:168:202::2 [MPLS: Label 2007 Exp 0] 79 msec 69 msec 89 msec fd00:192:168:202::6 89 msec 49 msec 49 msec

Next, we will examine TE-FRR with UMPLS. CSR1, CSR9, CSR8, CSR4, and CSR10 are running BFD between all links upon which they connect. BFD is disabled on all other routers. The goal will be to provide NHOP and NNHOP protection in various points in the network. Since tunnel stitching is just two ordinary TE tunnels, providing FRR for them is nothing new. Instead, we will configure the inter-area tunnel with the headend at CSR5 to request FRR. We can quickly confirm this is successful by verifying label bindings for each hop in the RSVP RESV RRO. Additionally, we will configure a bandwidth reservation to ensure bandwidth can be backed up between areas/levels. ! CSR5 interface Tunnel101 tunnel mpls traffic-eng fast-reroute tunnel mpls traffic-eng bandwidth 5000

1834 © 2016 Nicholas J. Russo

CSR5#show ip rsvp reservation filter session-type 7 destination 2.2.2.2 Destination Tun Sender TunID LSPID Next Hop I/F Fi Serv BPS 2.2.2.2 5.5.5.5 101 19 49.5.14.14 Gi2.554 SE LOAD 5M CSR5#show ip rsvp reservation detail filter session-type 7 destination 2.2.2.2 | begin RRO RRO: 14.14.14.14/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 94010 49.5.14.14/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 94010 1.1.1.1/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 1017 49.1.8.1/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 1017 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 8024 49.8.10.8/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 8024 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10023 49.2.10.10/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 10023 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 49.2.10.2/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3 Status: Policy: Accepted. Policy source(s): MPLS/TE

First, we will provide basic NHOP protection in the core to protect the link from CSR1 to CSR8. CSR1 creates this NHOP tunnel and acts as the PLR. It will backup 7 Mbps of bandwidth from any pool. We quickly verify the FRR database on the PLR and the RSVP RESV RRO on the headend to ensure the tunnel is protected. We can use CSR1's local label shown in the RRO to see the specific FRR database entry. ! CSR1 ip explicit-path name EP_AVOID_R1_R8_LINK enable exclude-address 49.1.8.1 exclude-address 49.1.8.8 interface Tunnel110 description INTER-AREA NHOP PROTECTION ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 8.8.8.8 tunnel mpls traffic-eng backup-bw 7000 tunnel mpls traffic-eng path-option 10 explicit name EP_AVOID_R1_R8_LINK

1835 © 2016 Nicholas J. Russo

interface GigabitEthernet2.518 mpls traffic-eng backup-path Tunnel110 CSR1#show mpls traffic-eng tunnels role head destination 8.8.8.8 | section RSVP Path RSVP Path Info: My Address: 49.1.9.1 Explicit Route: 49.1.9.9 49.8.9.8 8.8.8.8 Record Route: NONE Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits CSR5#show ip rsvp reservation detail filter session-type 7 destination 2.2.2.2 | section RRO RRO: 14.14.14.14/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 94010 49.5.14.14/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 94010 1.1.1.1/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 1019 49.1.8.1/32, Flags:0x5 (Local Prot Avail/Has BW/to NHOP) Label subobject: Flags 0x1, C-Type 1, Label 1019 8.8.8.8/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 8018 49.8.10.8/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 8018 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10021 49.2.10.10/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 10021 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 49.2.10.2/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3 R1#show mpls traffic-eng fast-reroute database labels 1019 | begin P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label ---------------------------- --------------------------5.5.5.5 101 [21] 1019 Gi2.518:8018 Tu110:8018

P2P LSP Status -----ready

We also quickly verify that the backup bandwidth is working. The RRO indicates that the NHOP protection "has BW" but we can verify it on the PLR as well. For extra verification, we use OAM to verify the path avoids the link between CSR1 and CSR8. This is also shown in the RSVP PATH ERO shown above. CSR1#show mpls traffic-eng tunnels tunnel 110 backup INTER-AREA NHOP PROTECTION

1836 © 2016 Nicholas J. Russo

LSP Head, Admin: up, Oper: up Tun ID: 110, LSP ID: 1, Source: 1.1.1.1 Destination: 8.8.8.8 Fast Reroute Backup Provided: Protected i/fs: Gi2.518 Protected LSPs/Sub-LSPs: 1, Active: 0 Backup BW: any pool; limit: 7000 kbps, inuse: 5000 kbps (BWP inuse: 0 kbps) Backup flags: 0x0 CSR1#traceroute mpls traffic-eng tunnel 110 Tracing MPLS TE Label Switched Path on Tunnel110, timeout is 2 seconds [snip] Type escape sequence to abort. 0 49.1.9.1 MRU 1500 [Labels: 9014 Exp: 0] L 1 49.1.9.9 MRU 1500 [Labels: implicit-null Exp: 0] 3 ms ! 2 49.8.9.8 34 ms

We can also configure NNHOP protection across IS-IS level boundaries using loose-hop expansion. We will configure this tunnel on CSR8 going to CSR2, effectively bypassing CSR10. This tunnel will use a combination of strict and loose hops to reach the final destination. We cannot simply exclude CSR10 since CSR8 won't find an intra-area/level path to CSR2. We ensure the tunnel comes up and avoids CSR10 completely using a more explicit method. ! CSR8 ip explicit-path name EP_AVOID_CSR10 enable next-address 9.9.9.9 next-address loose 4.4.4.4 next-address loose 2.2.2.2 interface Tunnel111 description INTER_AREA NNHOP PROTECTION ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 2.2.2.2 tunnel mpls traffic-eng backup-bw 9000 tunnel mpls traffic-eng path-option 10 explicit name EP_AVOID_CSR10 interface GigabitEthernet2.580 mpls traffic-eng backup-path Tunnel111 CSR8#show mpls traffic-eng tunnels tunnel 111 | section RSVP Path RSVP Path Info: My Address: 49.8.9.8 Explicit Route: 49.8.9.9 49.4.9.4 4.4.4.4 2.2.2.2* Record Route: Tspec: ave rate=0 kbits, burst=1000 bytes, peak rate=0 kbits

1837 © 2016 Nicholas J. Russo

First, we check the RSVP RESV RRO on the headend to ensure the NNHOP tunnel backs up the TE tunnel, also offering bandwidth. The PLR shows the FRR database entry as well, and we reference it by local label as derived by the RRO. RSVP is smart enough to know that a tunnel offers NHOP protection when its destination is the NHOP. Likewise, when a tunnel’s destination is the NNHOP, regardless of how the tunnel was configured, RSVP considers it an NNHOP protection tunnel. CSR5#show ip rsvp reservation detail filter session-type 7 destination 2.2.2.2 | section RRO RRO: 14.14.14.14/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 94010 49.5.14.14/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 94010 1.1.1.1/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 1019 49.1.8.1/32, Flags:0x5 (Local Prot Avail/Has BW/to NHOP) Label subobject: Flags 0x1, C-Type 1, Label 1019 8.8.8.8/32, Flags:0x2D (Local Prot Avail/Has BW/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 8018 49.8.10.8/32, Flags:0xD (Local Prot Avail/Has BW/to NNHOP) Label subobject: Flags 0x1, C-Type 1, Label 8018 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10021 49.2.10.10/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 10021 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 49.2.10.2/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3 CSR8#show mpls traffic-eng fast-reroute database labels 8018 | begin P2P LSP P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status --------------------------- -------------------------------5.5.5.5 101 [21] 8018 Gi2.580:10021 Tu111:implicit-n ready

For completeness, we quickly check the backup bandwidth and use OAM to verify the data plane. Of the 9 Mbps available, 5 Mbps are allocated for the tunnel configured on CSR5. CSR8#show mpls traffic-eng tunnels tunnel 111 backup INTER_AREA NNHOP PROTECTION LSP Head, Admin: up, Oper: up Tun ID: 111, LSP ID: 3, Source: 8.8.8.8 Destination: 2.2.2.2 Fast Reroute Backup Provided: Protected i/fs: Gi2.580

1838 © 2016 Nicholas J. Russo

Protected LSPs/Sub-LSPs: 1, Active: 0 Backup BW: any pool; limit: 9000 kbps, inuse: 5000 kbps (BWP inuse: 0 kbps) Backup flags: 0x0 CSR8#traceroute mpls traffic-eng tunnel 111 Tracing MPLS TE Label Switched Path on Tunnel111, timeout is 2 seconds [snip] Type escape sequence to abort. 0 49.8.9.8 MRU 1500 [Labels: 9017 Exp: 0] L 1 49.8.9.9 MRU 1500 [Labels: 4001 Exp: 0] 1 ms L 2 49.4.9.4 MRU 1500 [Labels: 10022 Exp: 0] 3 ms L 3 49.4.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 21 ms ! 4 49.2.10.2 21 ms

First, we will test the NHOP tunnel. The FRR label that CSR1 (PLR) pushes to tunnel traffic through CSR9 is 9014 as shown below. We use this, along with label 8018, to see traffic in the FRR tunnel using EPC. CSR1#show mpls traffic-eng tunnels tunnel 110 | include Label InLabel : OutLabel : GigabitEthernet2.519, 9014

BFD takes about 2.7 seconds to detect a failure in this setup. We change the encapsulation on CSR8 towards CSR1 after starting a ping within the L3VPN. After those 3 seconds, the packets are forwarded in the TE tunnel for a short time (only one packet in this instance, since the tunnel head is very close to the PLR). The single packet that ended up in the TE-FRR tunnel is highlighted green in the ping output. The label stack is {9014 8018 2007} as highlighted in cyan. This shows that NHOP protection works as expected when using inter-area tunnels. The reason we cannot reasonably use traceroute is because convergence happens so fast with dynamic or loose-explicit paths. RP/0/0/CPU0:XRv11#ping vrf U2 ::2:6:6:6 source ::2:11:11:11 size 500 count 1000 Type escape sequence to abort. Sending 1000, 500-byte ICMP Echos to ::2:6:6:6, timeout is 2 seconds: !!!...!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [snip] CSR9#show monitor capture CAP buffer detailed | begin 534 13 534 4.955005 00:0C:29:FB:A3:39 -> 00:0C:29:E0:4F:84 MPLS unicast 0000: 000C29E0 4F84000C 29FBA339 81000DBF ..).O...)..9.... 0010: 88470233 603901F5 2039007D 713B6000 .G.3`9...9.}q;`. 0020: 000001CC 3A3B0000 00000000 00000002 ....:;.......... 0030: 00110011 00110000 00000000 00000002 ................

We will bring the CSR1-CSR8 link back up, reoptimize the tunnel, and test NNHOP protection next. This is a little more interesting since both the protected and protecting tunnels are inter-area with 1839 © 2016 Nicholas J. Russo

loose-hops, but the RSVP signaling process is still the same. We will break the link between CSR10 and CSR8, then check the RSVP information. We don't need to use EPC because I managed to quickly capture the relevant information via show-commands (with the NHOP test above, I wasn’t fast enough). CSR5 shows the NNHOP tunnel actually "in use". The NHOP tunnel is still available but is no longer “in use”. This shows that the head-end was still notified by the PLR that FRR was activated. We won't do all of the TE-FRR verification details as this is sufficient to show that loose-hop tunnels can be both protected and protecting tunnels. Additional TE verification commands are show in the dedicated TE section. CSR5#show ip rsvp reservation detail filter session-type 7 destination 2.2.2.2 | section RRO RRO: 14.14.14.14/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 94010 49.5.14.14/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 94010 1.1.1.1/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 1021 49.1.8.1/32, Flags:0x5 (Local Prot Avail/Has BW/to NHOP) Label subobject: Flags 0x1, C-Type 1, Label 1021 8.8.8.8/32, Flags:0x2F (Local Prot Avail/In Use/Has BW/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 8024 49.8.9.8/32, Flags:0xF (Local Prot Avail/In Use/Has BW/to NNHOP) Label subobject: Flags 0x1, C-Type 1, Label 8024 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10015 49.2.10.10/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 10015 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 49.2.10.2/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3

Additional Reading – Reference configurations "umpls-isis" 33.2.2 OSPF (summarized) We also analyze UMPLS using OSPF instead of IS-IS. The concepts are all identical, so this section is summarized to highlight the differences. The area boundaries are in the same places, so the routers that were L1/L2 routers for IS-IS are now ABRs for OSPF. XRv21 and CSR8 are also ABRs now; despite CSR7 being in a different IS-IS area earlier, it was still a level-2 router, so there was no additional complexity involved; UMPLS did not come into play. In this case, the topology is a little more complex as a result of added ABRs. There are different area types in use here mostly to demonstrate that UMPLS can work with any area type. Unlike IS-IS, we have to carefully select how to advertise loopback interfaces in areas. I use a combination of aggregation area and core area advertisement, along with LSA3 filtering to allow routes to leak. Area 0.0.43.173 and area 0.0.202.254 follow a classic UMPLS model where the ABRs don't allow routes to leak between areas and set next-hop-self on reflected BGP routes for IPv4 1840 © 2016 Nicholas J. Russo

labeled-unicast. Area 0.0.190.239 is a relaxed variant where the OSPF inter-area routes leak into area 0 (as with a normal OSPF design) so the LDP LSP can be used. The BGP verifications are the same and are not repeated, since we will see them when verifying the MPLS services. The area 0.0.43.173 ABR configuration points of interest are shown below; notice that the loopbacks are advertised into the NSSA (not traditional) and are leaked into area 0.

! XRv13 (XRv14 is identical with different RPL) route-policy RPL_LEAK_LOOPBACK if destination in (13.13.13.13/32) then pass endif end-policy router ospf 49 network point-to-point area 0 route-policy RPL_LEAK_LOOPBACK in area 0.0.43.173 nssa no-summary interface Loopback0 passive enable

1841 © 2016 Nicholas J. Russo

The ABRs for area 0.0.202.254 are slightly different. The loopbacks are advertised into area 0 and are leaked outbound into all non-zero areas. This is similar logic as above except in the reverse direction. Additionally, all LSA3s are filtered from non-zero areas into area 0, which means BGP will be adjusting the next-hop. Setting the next-hop in an RPL doesn't appear to work for iBGP, even with iBGP modifications enforced, so we ignore this on XRv12. On CSR8, the OSPF filtering logic is identical with different syntax. ! XRv12 route-policy RPL_DROP drop end-policy route-policy RPL_LEAK_LOOPBACK if destination in (12.12.12.12/32) then pass endif end-policy router ospf 49 network point-to-point area 0 route-policy RPL_DROP in route-policy RPL_LEAK_LOOPBACK out interface Loopback0 passive enable area 0.0.202.254 ! CSR8 interface Loopback0 ip ospf 49 area 0 ip prefix-list PL_DROP seq 5 deny 0.0.0.0/0 le 32 ip prefix-list PL_LEAK_LOOPBACK seq 5 permit 8.8.8.8/32 router ospf 49 area 0 filter-list prefix PL_DROP in area 0 filter-list prefix PL_LEAK_LOOPBACK out

The ABRs for area 0.0.190.239 has no filtering applied, other than the stub area configuration. This means all routers inside the non-zero areas can flood into area 0, implying that the ABRs do not need to adjust the BGP next-hop when advertising IPv4 labeled-unicast prefixes into the core. Since the ABRs are only advertising default routes, they must adjust the BGP next-hop when advertising routes to the PEs inside area 0.0.190.239 (the only PE is CSR2). ! CSR4 and CSR10 interface Loopback0

1842 © 2016 Nicholas J. Russo

ip ospf 49 area 0.0.190.239 router ospf 49 area 0.0.190.239 stub no-summary

33.2.2.1 L3VPN We will trace some L3VPN LSPs to ensure UMPLS is working correctly. First, we will test VRF U2 connectivity between XRv11 and CSR3 for IPv6. CSR5 has a VPNv6 route for ::2:3:3:3/128 with label 7010. This is the VPN label which, in UMPLS, never changes. The BGP next-hop is an IPv4 address of 7.7.7.7. CSR5#show bgp vpnv6 unicast vrf U2 ::2:3:3:3/128 BGP routing table entry for [49:205]::2:3:3:3/128, version 232 Paths: (1 available, best #1, table U2) Advertised to update-groups: 1 Refresh Epoch 2 65000, imported path from [49:207]::2:3:3:3/128 (global) ::FFFF:7.7.7.7 (metric 2) (via default) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:49:207 mpls labels in/out nolabel/7010 rx pathid: 0, tx pathid: 0x0

CSR5's route to 7.7.7.7/32 is via XRv14. I adjusted local-preference on XRv14 to 200 earlier (it was not shown for brevity, though), because I wanted to use XRv14 as well, since XRv13 was in the transit path for all the IS-IS tests. The label for this destination is 94004, which was allocated by XRv14. Its next-hop is 14.14.14.14 for which CSR5 should have an IGP route and LDP label. CSR5#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 14.14.14.14 00:20:16 ago Routing Descriptor Blocks: * 14.14.14.14, from 14.14.14.14, 00:20:16 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 94004 CSR5#show bgp ipv4 unicast 7.7.7.7/32 BGP routing table entry for 7.7.7.7/32, version 162 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 Local 14.14.14.14 (metric 2) from 14.14.14.14 (14.14.14.14) Origin incomplete, metric 0, localpref 200, valid, internal, best

1843 © 2016 Nicholas J. Russo

Originator: 7.7.7.7, Cluster list: 14.14.14.14, 8.8.8.8 mpls labels in/out nolabel/94004 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 Local 13.13.13.13 (metric 2) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 7.7.7.7, Cluster list: 13.13.13.13, 8.8.8.8 mpls labels in/out nolabel/93003 rx pathid: 0, tx pathid: 0

The route to XRv14's loopback is an OSPF route via a non-TE tunnel interface, which means an LDP label will be used as we predicted above. Since XRv14 is directly connected, that label value is implicit-null. The label stack is now {94004 7010}. CSR5#show ip route 14.14.14.14 Routing entry for 14.14.14.14/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.5.14.14 on GigabitEthernet2.554, 01:30:45 ago Routing Descriptor Blocks: * 49.5.14.14, from 14.14.14.14, 01:30:45 ago, via GigabitEthernet2.554 Route metric is 2, traffic share count is 1 CSR5#show mpls ldp bindings 14.14.14.14 32 neighbor 14.14.14.14 lib entry: 14.14.14.14/32, rev 58 remote binding: lsr: 14.14.14.14:0, label: imp-null

When XRv14 receives this packet, it swaps the label from 94004 to 8002. The reason the path via XRv12 is invalid is because there is no route to 7.7.7.7/32 without causing a recursive routing loop. XR appears incapable of adjusting next-hops in RPLs for iBGP neighbors, which we attempted to configure on XRv12. “next-hop-self” works when applied to neighbors but not inside a custom RPL. As such, CSR8 is the ingress point for area 0.0.202.254. The label stack is now {8002 7010}. RP/0/0/CPU0:XRv14#show bgp ipv4 labeled-unicast 7.7.7.7/32 [snip] Paths: (2 available, best #1) Advertised to peers (in unique update groups): 5.5.5.5 Path #1: Received by speaker 0 Advertised to peers (in unique update groups): 5.5.5.5 Local 8.8.8.8 (metric 3) from 8.8.8.8 (7.7.7.7) Received Label 8002 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 122

1844 © 2016 Nicholas J. Russo

Originator: 7.7.7.7, Cluster list: 8.8.8.8 Path #2: Received by speaker 0 Not advertised to any peer Local 7.7.7.7 (metric 3) from 12.12.12.12 (7.7.7.7) Received Label 3 Origin incomplete, metric 0, localpref 100, valid, internal Received Path ID 0, Local Path ID 0, version 0 Originator: 7.7.7.7, Cluster list: 12.12.12.12

Next, XRv14 needs to lookup the route for the BGP next-hop, which is 8.8.8.8 via OSPF towards CSR1. The corresponding LDP label from CSR1 is 1016. The label stack is now {1016 8002 7010}. RP/0/0/CPU0:XRv14#show route ipv4 8.8.8.8 Routing entry for 8.8.8.8/32 Known via "ospf 49", distance 110, metric 3, type intra area Routing Descriptor Blocks 49.1.14.1, from 8.8.8.8, via GigabitEthernet0/0/0/0.514 Route metric is 3 No advertising protos. RP/0/0/CPU0:XRv14#show mpls ldp bindings 8.8.8.8/32 neighbor 1.1.1.1 8.8.8.8/32, rev 68 Local binding: label: 94002 Remote bindings: (2 peers) Peer Label ------------------------1.1.1.1:0 1016

CSR1 is a pure P router and only looks at the top-most label since there are no ECMP paths. OSPF says the next-hop is towards CSR8 directly, so implicit-null is advertised by CSR8, forcing CSR1 to pop the topmost label. The label stack is now {8002 7010}. CSR1#show ip route 8.8.8.8 Routing entry for 8.8.8.8/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.1.8.8 on GigabitEthernet2.518, 01:44:40 ago Routing Descriptor Blocks: * 49.1.8.8, from 8.8.8.8, 01:44:40 ago, via GigabitEthernet2.518 Route metric is 2, traffic share count is 1 CSR1#show mpls forwarding-table labels Local Outgoing Prefix Label Label or Tunnel Id 1016 Pop Label 8.8.8.8/32

1016 Bytes Label Switched 80067

Outgoing interface Gi2.518

Next Hop 49.1.8.8

1845 © 2016 Nicholas J. Russo

When CSR8 receives the packets, the BGP label is removed. Notice that the BGP route is not in the routing table due to a RIB failure; this is because CSR8 has an OSPF route to 7.7.7.7 as they share an area. CSR7 allocates implicit-null via LDP for prefix 7.7.7.7/32, causing CSR8 to pop the BGP label. The label stack is now 7010, which reveals the VPN label of 7010 to CSR7. Although BGP and LDP both dictate that the out-label operation is “pop”, the LDP implicit-null is the one that is actually processed. CSR8#show bgp ipv4 unicast 7.7.7.7/32 BGP routing table entry for 7.7.7.7/32, version 75 Paths: (1 available, best #1, table default, RIB-failure(17)) Advertised to update-groups: 6 Refresh Epoch 7 Local, (Received from a RR-client) 7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best mpls labels in/out 8002/imp-null rx pathid: 0, tx pathid: 0x0 CSR8#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.7.8.7 on GigabitEthernet2.578, 01:47:55 ago Routing Descriptor Blocks: * 49.7.8.7, from 7.7.7.7, 01:47:55 ago, via GigabitEthernet2.578 Route metric is 2, traffic share count is 1 CSR8#show mpls ldp bindings 7.7.7.7 32 neighbor 7.7.7.7 lib entry: 7.7.7.7/32, rev 22 remote binding: lsr: 7.7.7.7:0, label: imp-null

On CSR7, the LFIB removes all labels and delivers the IPv6 traffic to CSR3 inside VRF U2. CSR7#show mpls forwarding-table labels 7010 detail Local Outgoing Prefix Bytes Label Outgoing Label Label or Tunnel Id Switched interface 7010 No Label ::2:3:3:3/128[V] 7498 Gi2.207 MAC/Encaps=22/22, MRU=1504, Label Stack{} 000C29D781FE000C29664C2C81000DD1810000CF86DD VPN route: U2 No output feature configured

Next Hop FE80::3

We can confirm the label stack at each hop using traceroute as shown below. RP/0/0/CPU0:XRv11#traceroute vrf U2 ::2:3:3:3 source ::2:11:11:11 Type escape sequence to abort. Tracing the route to ::2:3:3:3

1846 © 2016 Nicholas J. Russo

1 fd00:192:168:205::5 9 msec 9 msec 119 msec 2 2049:49:5:14::14 [MPLS: Labels 94004/7010 Exp 0] 139 msec 49 msec 39 msec 3 ::ffff:49.1.14.1 [MPLS: Labels 1016/8002/7010 Exp 0] 39 msec 39 msec 39 msec 4 ::ffff:49.1.8.8 [MPLS: Labels 8002/7010 Exp 0] 39 msec 29 msec 59 msec 5 fd00:192:168:207::7 [MPLS: Label 7010 Exp 0] 39 msec 39 msec 39 msec 6 fd00:192:168:207::3 39 msec 39 msec 39 msec

Next, we will trace the LSP from CSR6 to XRv11 inside VRF U2. We will notice some asymmetry in the forwarding path due to BGP adjustments along the way. First, we verify that CSR6 (CE) is sending traffic to 2.11.11.11 via CSR4 versus CSR2. This is a result of the BGP local-preference changes shown earlier. CSR6#show bgp vpnv4 unicast vrf U2 2.11.11.11/32 BGP routing table entry for 49:202:2.11.11.11/32, version 76 Paths: (2 available, best #2, table U2) Advertised to update-groups: 2 Refresh Epoch 1 49 49 192.168.202.2 (via vrf U2) from 192.168.202.2 (2.2.2.2) Origin incomplete, localpref 100, valid, external rx pathid: 0, tx pathid: 0 Refresh Epoch 1 49 49 192.168.204.4 (via vrf U2) from 192.168.204.4 (4.4.4.4) Origin incomplete, localpref 200, valid, external, best rx pathid: 0, tx pathid: 0x0

The same command on CSR4 shows us the VPN label allocations. The remote PE (CSR5) allocates label 5002) for this customer prefix. This label never changes. CSR4#show bgp vpnv4 unicast vrf U2 2.11.11.11/32 BGP routing table entry for 49:204:2.11.11.11/32, version 142 Paths: (1 available, best #1, table U2) Advertised to update-groups: 2 Refresh Epoch 1 65000, imported path from 49:205:2.11.11.11/32 (global) 5.5.5.5 (metric 6) (via default) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:49:205 Originator: 5.5.5.5, Cluster list: 7.7.7.7 mpls labels in/out nolabel/5002 rx pathid: 0, tx pathid: 0x0

1847 © 2016 Nicholas J. Russo

The route to 5.5.5.5, the VPNv4 next-hop, is a BGP route which carries label 93000. This is allocated by XRv13, the remote ABR. XRv13 is performing the “correct” role of a seamless MPLS ABR by adjusting the BGP labeled-unicast next-hop to itself. CSR4#show ip route 5.5.5.5 Routing entry for 5.5.5.5/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 13.13.13.13 00:45:45 ago Routing Descriptor Blocks: * 13.13.13.13, from 8.8.8.8, 00:45:45 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 93000

The route to 13.13.13.13 is an OSPF route via CSR9, which is bound to LDP label 9003. The label stack is now {9003 93000 5002}. CSR4#show ip route 13.13.13.13 Routing entry for 13.13.13.13/32 Known via "ospf 49", distance 110, metric 6, type inter area Last update from 49.4.9.9 on GigabitEthernet2.549, 01:25:21 ago Routing Descriptor Blocks: * 49.4.9.9, from 13.13.13.13, 01:25:21 ago, via GigabitEthernet2.549 Route metric is 6, traffic share count is 1 CSR4#show mpls ldp bindings 13.13.13.13 32 neighbor 9.9.9.9 lib entry: 13.13.13.13/32, rev 96 remote binding: lsr: 9.9.9.9:0, label: 9003

All of the core routers are involved in this LSP. CSR9, CSR1, and CSR8 all are P routers along this path and perform basic label swaps of LDP labels. XRv12 is the PHP router along the LDP LSP and pops the LDP label to reveal the IPv4 labeled-unicast label of 93000 to XRv13. The show commands are displayed below, and the label stack becomes {93000 5002}. CSR9#show mpls forwarding-table labels Local Outgoing Prefix Label Label or Tunnel Id 9003 1001 13.13.13.13/32

9003 Bytes Label Switched 1458

Outgoing interface Gi2.519

CSR1#show mpls forwarding-table labels Local Outgoing Prefix Label Label or Tunnel Id 1001 8000 13.13.13.13/32

1001 Bytes Label Switched 1884

Outgoing interface Gi2.518

CSR8#show mpls forwarding-table labels 8000 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched

Outgoing interface

Next Hop 49.1.9.1

Next Hop 49.1.8.8

Next Hop

1848 © 2016 Nicholas J. Russo

8000

92002

13.13.13.13/32

286255

Gi2.582

49.8.12.12

RP/0/0/CPU0:XRv12#show mpls forwarding labels 92002 Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ------------------ ------------ --------------- ---------92002 Pop 13.13.13.13/32 Gi0/0/0/0.523 49.12.13.13 764630

When XRv13 receives packets with label 93000, it pops the label. The "Received Label" of 3 means implicit-null but does not apply here since XRv13 has an OSPF route for 5.5.5.5/32, thus using the LDP label. It is also implicit-null, and as we have seen many times before, can be tricky to understand which “implicit-null” is being used. This reveals label 5002 to the remote PE. I constantly show this as a “trick” because the in-label is BGP allocated, but the out-label cannot be used since the route is not installed via BGP. In this case, the LDP and BGP labels are both 3, but this is not always the case. RP/0/0/CPU0:XRv13#show bgp ipv4 labeled-unicast 5.5.5.5/32 [snip] Paths: (1 available, best #1) Advertised to update-groups (with more than one peer): 0.1 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.1 Local, (Received from a RR-client) 5.5.5.5 (metric 2) from 5.5.5.5 (5.5.5.5) Received Label 3 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best Received Path ID 0, Local Path ID 1, version 69 RP/0/0/CPU0:XRv13#show route ipv4 5.5.5.5/32 Routing entry for 5.5.5.5/32 Known via "ospf 49", distance 110, metric 2, type intra area Routing Descriptor Blocks 49.5.13.5, from 5.5.5.5, via GigabitEthernet0/0/0/0.553 Route metric is 2 No advertising protos. RP/0/0/CPU0:XRv13#show mpls ldp bindings 5.5.5.5/32 neighbor 5.5.5.5 5.5.5.5/32, rev 43 Local binding: label: 93000 Remote bindings: (2 peers) Peer Label ------------------------5.5.5.5:0 ImpNull

1849 © 2016 Nicholas J. Russo

When CSR5 receives packets with label 5002, it delivers them to XRv11 inside VRF U2 for prefix 2.11.11.11/32 as expected. CSR5#show mpls forwarding-table labels 5002 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 5002 No Label 2.11.11.11/32[V] 4328 Gi2.205 192.168.205.11 MAC/Encaps=22/22, MRU=1504, Label Stack{} 000C2950BAB7000C295F11A181000DDF810000CD0800 VPN route: U2 No output feature configured

As usual, we can confirm this LSP using traceroute below. The CEF table on the ingress PE within the VPN also shows the imposition label stack for quick verification. CSR4#show ip cef vrf U2 2.11.11.11 2.11.11.11/32 nexthop 49.4.9.9 GigabitEthernet2.549 label 9003 93000 5002 CSR6#traceroute vrf U2 2.11.11.11 source 2.6.6.6 Type escape sequence to abort. Tracing the route to 2.11.11.11 VRF info: (vrf in name/id, vrf out name/id) 1 192.168.204.4 1 msec 1 msec 0 msec 2 49.4.9.9 [MPLS: Labels 9003/93000/5002 Exp 0] 121 msec 33 msec 33 msec 3 49.1.9.1 [MPLS: Labels 1001/93000/5002 Exp 0] 31 msec 27 msec 32 msec 4 49.1.8.8 [MPLS: Labels 8000/93000/5002 Exp 0] 34 msec 34 msec 32 msec 5 49.8.12.12 [MPLS: Labels 92002/93000/5002 Exp 0] 26 msec 32 msec 26 msec 6 49.12.13.13 [MPLS: Labels 93000/5002 Exp 0] 34 msec 33 msec 33 msec 7 192.168.205.5 [MPLS: Label 5002 Exp 0] 32 msec 26 msec 31 msec 8 192.168.205.11 26 msec * 30 msec

33.2.2.2 L2VPN Like IS-IS, the verification for L2VPN is based mostly on the transport path plus the binding of the PW labels. Tracing the LSP from CSR6 to CSR3 inside the L2VPN, we can determine the PW label by checking the AToM bindings. In this case, the remote PW label is 7034. CSR2#show l2vpn atom binding 7.7.7.7 Destination Address: 7.7.7.7,VC ID: 567 Local Label: 2018 Cbit: 1, VC Type: Ethernet, GroupID: n/a MTU: 1500, Interface Desc: n/a VCCV: CC Type: RA [2], TTL [3] CV Type: LSPV [2] Remote Label: 7034 Cbit: 1, VC Type: Ethernet, GroupID: n/a MTU: 1500, Interface Desc: n/a

1850 © 2016 Nicholas J. Russo

VCCV: CC Type: RA [2], TTL [3] CV Type: LSPV [2]

We will quickly determine the transport label stack as well. The route to CSR7, the remote PW endpoint, is a BGP route with label 10029 via CSR10. The route to 10.10.10.10 is an OSPF route bound to implicitnull, also via CSR10. The label stack is now {10029 7034}. CSR2#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 10.10.10.10 01:05:16 ago Routing Descriptor Blocks: * 10.10.10.10, from 10.10.10.10, 01:05:16 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 10029 CSR2#show ip route 10.10.10.10 Routing entry for 10.10.10.10/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.2.10.10 on GigabitEthernet2.520, 01:55:27 ago Routing Descriptor Blocks: * 49.2.10.10, from 10.10.10.10, 01:55:27 ago, via GigabitEthernet2.520 Route metric is 2, traffic share count is 1 CSR2#show mpls ldp bindings 10.10.10.10 32 neighbor 10.10.10.10 lib entry: 10.10.10.10/32, rev 61 remote binding: lsr: 10.10.10.10:0, label: imp-null

CSR10 selects CSR8 as the BGP bestpath towards 7.7.7.7/32 due to having a lower neighbor ID (worst tie breaker). As seen earlier, the route via XRv12 is invalid due to the infinite recursive lookups that would occur. This swaps label 10029 for 8002; we confirm this in the LFIB also. The label stack is now {8002 7034}. CSR10#show bgp ipv4 unicast 7.7.7.7/32 BGP routing table entry for 7.7.7.7/32, version 82 Paths: (2 available, best #2, table default) Advertised to update-groups: 4 Refresh Epoch 1 Local 7.7.7.7 (metric 2) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 7.7.7.7, Cluster list: 12.12.12.12 mpls labels in/out 10029/imp-null rx pathid: 0, tx pathid: 0 Refresh Epoch 5

1851 © 2016 Nicholas J. Russo

Local 8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 7.7.7.7, Cluster list: 8.8.8.8 mpls labels in/out 10029/8002 rx pathid: 0, tx pathid: 0x0 CSR10#show Local Label 10029

mpls forwarding-table labels 10029 Outgoing Prefix Bytes Label Label or Tunnel Id Switched 8002 7.7.7.7/32 338367

Outgoing interface Gi2.580

Next Hop 49.8.10.8

As seen earlier in the L3VPN verification, CSR8 pops this BGP label due to receiving an LDP-bound implicit-null for 7.7.7.7/32. The LDP label is used because the route to prefix 7.7.7.7/32 is an IGP route. This reveals the PW label of 7034 to CSR7. CSR8#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.7.8.7 on GigabitEthernet2.578, 01:47:55 ago Routing Descriptor Blocks: * 49.7.8.7, from 7.7.7.7, 01:47:55 ago, via GigabitEthernet2.578 Route metric is 2, traffic share count is 1 CSR8#show mpls ldp bindings 7.7.7.7 32 neighbor 7.7.7.7 lib entry: 7.7.7.7/32, rev 22 remote binding: lsr: 7.7.7.7:0, label: imp-null

We can verify the transport label path using MPLS OAM. Ignoring the extraneous "implicit-nulls", the label stack is as we verified it above. Looking at the L2VPN PW details, we can see the label stack including the PW label as well. CSR2#traceroute mpls ipv4 7.7.7.7/32 source 2.2.2.2 more work needed here to demux the tfs subtlv and to display the right output [snip] Type escape sequence to abort. 0 49.2.10.2 MRU 1500 [Labels: implicit-null/10029 Exp: 0/0] L 1 49.2.10.10 MRU 1500 [Labels: 8002 Exp: 0] 1 ms L 2 49.8.10.8 MRU 1500 [Labels: implicit-null Exp: 0] 29 ms ! 3 49.7.8.7 20 ms CSR2#show l2vpn atom vc vcid 567 destination 7.7.7.7 detail | include label_stack Output interface: Gi2.520, imposed label stack {10029 7034}

1852 © 2016 Nicholas J. Russo

We will also verify the PW from CSR7 to CSR5 for extra practice. The PW label is 5009 as shown by the AToM bindings below. CSR7#show l2vpn atom binding 5.5.5.5 | include Label Local Label: 7021 Remote Label: 5009

The route to the PW endpoint of 5.5.5.5 is a BGP route, implying that we use the BGP label to reach it. Notice that the reason CSR8 was selected was due to having a lower neighbor ID. CSR7#show ip route 5.5.5.5 Routing entry for 5.5.5.5/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 8.8.8.8 01:14:05 ago Routing Descriptor Blocks: * 8.8.8.8, from 8.8.8.8, 01:14:05 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 8004 CSR7#show bgp ipv4 unicast 5.5.5.5/32 BGP routing table entry for 5.5.5.5/32, version 143 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 1 Local 12.12.12.12 (metric 2) from 12.12.12.12 (12.12.12.12) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 5.5.5.5, Cluster list: 12.12.12.12, 13.13.13.13 mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0 Refresh Epoch 3 Local 8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 8.8.8.8, 13.13.13.13 mpls labels in/out nolabel/8004 rx pathid: 0, tx pathid: 0x0

The route to the BGP next-hop of 8.8.8.8 is an OSPF route, implying that the LDP label should be pushed atop the BGP one. The LDP label is implicit-null, so the label stack becomes {8004 5009}. CSR7#show ip route 8.8.8.8 Routing entry for 8.8.8.8/32 Known via "ospf 49", distance 110, metric 2, type inter area Last update from 49.7.8.8 on GigabitEthernet2.578, 01:17:49 ago Routing Descriptor Blocks:

1853 © 2016 Nicholas J. Russo

* 49.7.8.8, from 8.8.8.8, 01:17:49 ago, via GigabitEthernet2.578 Route metric is 2, traffic share count is 1 CSR7#show mpls ldp bindings 8.8.8.8 32 neighbor 8.8.8.8 lib entry: 8.8.8.8/32, rev 145 remote binding: lsr: 8.8.8.8:0, label: imp-null

Checking the LFIB, we can see that label 8004 is swapped for label 93000, with label 92002 pushed on top of that. The swap is conducted by BGP since the remote label from either XRv13 or XR14 was allocated by BGP also. CSR5 selects XRv13 as the bestpath due to the peer having a lower neighbor ID. CSR8#show mpls forwarding-table labels 8004 detail Local Outgoing Prefix Bytes Label Outgoing Label Label or Tunnel Id Switched interface 8004 93000 5.5.5.5/32 519868 Gi2.582 MAC/Encaps=18/26, MRU=1496, Label Stack{92002 93000} 000C2987637A000C296AFFD581000DFE8847 1676200016B48000 No output feature configured

Next Hop 49.8.12.12

CSR8#show ip route 5.5.5.5 Routing entry for 5.5.5.5/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 13.13.13.13 01:55:05 ago Routing Descriptor Blocks: * 13.13.13.13, from 13.13.13.13, 01:55:05 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 93000 CSR8#show bgp ipv4 unicast 5.5.5.5/32 BGP routing table entry for 5.5.5.5/32, version 69 Paths: (2 available, best #2, table default) Advertised to update-groups: 5 6 Refresh Epoch 1 Local, (Received from a RR-client) 14.14.14.14 (metric 3) from 14.14.14.14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 5.5.5.5, Cluster list: 14.14.14.14 mpls labels in/out 8004/94008 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local, (Received from a RR-client) 13.13.13.13 (metric 3) from 13.13.13.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 5.5.5.5, Cluster list: 13.13.13.13 mpls labels in/out 8004/93000 rx pathid: 0, tx pathid: 0x0

1854 © 2016 Nicholas J. Russo

The push operation occurs because CSR8 has to reach XRv13 across the MPLS core, so it pushes an LDP label from XRv12 for 13.13.13.13/32. The label stack is now {92002 93000 5009}. CSR8#show ip route 13.13.13.13 Routing entry for 13.13.13.13/32 Known via "ospf 49", distance 110, metric 3, type inter area Last update from 49.8.12.12 on GigabitEthernet2.582, 01:53:55 ago Routing Descriptor Blocks: * 49.8.12.12, from 13.13.13.13, 01:53:55 ago, via GigabitEthernet2.582 Route metric is 3, traffic share count is 1 CSR8#show mpls ldp bindings 13.13.13.13 32 neighbor 12.12.12.12 lib entry: 13.13.13.13/32, rev 90 remote binding: lsr: 12.12.12.12:0, label: 92002

This same in-label was seen earlier in the L3VPN section, but we verify again. The LDP label is popped revealing the BGP label to the ABR, XRv13. RP/0/0/CPU0:XRv12#show mpls forwarding labels 92002 Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ------------------ ------------ --------------- ---------92002 Pop 13.13.13.13/32 Gi0/0/0/0.523 49.12.13.13 1032102

XRv13 pops the BGP label since the path to 5.5.5.5/32 is an IGP route that has an implicit-null LDP label. This reveals the PW label of 5009 to CSR5. RP/0/0/CPU0:XRv13#show route ipv4 5.5.5.5/32 Routing entry for 5.5.5.5/32 Known via "ospf 49", distance 110, metric 2, type intra area Routing Descriptor Blocks 49.5.13.5, from 5.5.5.5, via GigabitEthernet0/0/0/0.553 Route metric is 2 No advertising protos. RP/0/0/CPU0:XRv13#show mpls ldp bindings 5.5.5.5/32 neighbor 5.5.5.5 5.5.5.5/32, rev 43 Local binding: label: 93000 Remote bindings: (2 peers) Peer Label ------------------------5.5.5.5:0 ImpNull

MPLS traceroute and L2VPN detailed outputs confirm the LSP we verified.

1855 © 2016 Nicholas J. Russo

CSR7#show l2vpn atom vc vcid 567 destination 5.5.5.5 detail | include label_stack Output interface: Gi2.578, imposed label stack {8004 5009} CSR7#traceroute mpls ipv4 5.5.5.5/32 source 7.7.7.7 more work needed here to demux the tfs subtlv and to display the right output [snip] Type escape sequence to abort. 0 49.7.8.7 MRU 1500 [Labels: implicit-null/8004 Exp: 0/0] L 1 49.7.8.8 MRU 1500 [Labels: 92002/93000 Exp: 0/0] 2 ms L 2 49.8.12.12 MRU 1500 [Labels: implicit-null/93000 Exp: 0/0] 8 ms L 3 49.12.13.13 MRU 1500 [Labels: implicit-null Exp: 0] 39 ms ! 4 49.5.13.5 40 ms

A quick verification on CSR3 shows that the L2VPN E-LAN is working. There is a full-mesh of EIGRP neighbors for both IPv4 and IPv6, as expected. Since the Q-count is zero, we can assume that multicast and unicast connectivity has been achieved bidirectionally across the L2VPN. CSR3#show eigrp address-family ipv4 vrf U3 neighbors EIGRP-IPv4 VR(VPLS) Address-Family Neighbors for AS(65000) VRF(U3) H Address Interface Hold Uptime SRTT (sec) (ms) 1 192.168.0.6 Gi2.307 10 01:25:14 258 0 192.168.0.11 Gi2.307 14 01:25:24 152 CSR3#show eigrp address-family ipv6 vrf U3 neighbors EIGRP-IPv6 VR(VPLS) Address-Family Neighbors for AS(65000) VRF(U3) H Address Interface Hold Uptime SRTT (sec) (ms) 1 Link-local address: Gi2.307 10 01:25:19 149 FE80::11 0 Link-local address: Gi2.307 14 01:25:26 25 FE80::6

RTO

Q Cnt 1548 0 912 0

Seq Num 46 40

RTO

Q Seq Cnt Num 894 0 41 150

0

43

33.2.2.3 MVPN (mLDP profiles 1 and 17) The mLDP connectivity works identically regardless of the IGPs in use. We quickly verify the tree construction on CSR2 and CSR5 as examples. Because CSR2 and CSR4 do not import one another’s RTs, the MVPN BGP AD type-1 routers are not imported, so they don't join one another’s P2MP trees. That is why CSR2 is not joined to a P2MP rooted at CSR4, but CSR5 is. As seen earlier, IPv4 P2MP trees are highlighted in yellow while IPv6 P2MP trees are highlighted in green. CSR2#show mpls mldp database summary LSM ID Type Root Decoded Opaque Value Cnt.

Client

1856 © 2016 Nicholas J. Russo

11 21 22 12 1F 20 13

P2MP P2MP P2MP P2MP P2MP P2MP MP2MP

2.2.2.2 5.5.5.5 7.7.7.7 2.2.2.2 5.5.5.5 7.7.7.7 8.8.8.8

[gid [gid [gid [gid [gid [gid [mdt

65536 (0x00010000)] 65536 (0x00010000)] 65536 (0x00010000)] 131072 (0x00020000)] 131072 (0x00020000)] 131072 (0x00020000)] 49:222 0]

CSR5#show mpls mldp database summary LSM ID Type Root Decoded Opaque Value Cnt. 2C P2MP 2.2.2.2 [gid 65536 (0x00010000)] 27 P2MP 4.4.4.4 [gid 65536 (0x00010000)] 13 P2MP 5.5.5.5 [gid 65536 (0x00010000)] 28 P2MP 7.7.7.7 [gid 65536 (0x00010000)] 2B P2MP 2.2.2.2 [gid 131072 (0x00020000)] 29 P2MP 4.4.4.4 [gid 131072 (0x00020000)] 14 P2MP 5.5.5.5 [gid 131072 (0x00020000)] 2A P2MP 7.7.7.7 [gid 131072 (0x00020000)] 15 MP2MP 8.8.8.8 [mdt 49:222 0]

2 1 1 2 1 1 1

Client 1 1 2 1 1 1 2 1 1

A quick OAM ping on CSR5 shows that the MP2MP tree is working properly. Traffic can flow bidirectionally which means that the OAM ping can be originated from any PE in the MVPN instance. CSR5#ping mpls mldp mp2mp 8.8.8.8 mdt 49:222 0 mp2mp Root node addr 8.8.8.8 Opaque type MDT, oui:index 0x49:0222, mdtnum 0 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor, timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request ! reply ! reply ! reply

#1 addr 49.4.9.4 addr 49.7.8.7 addr 49.2.10.2

Round-trip min/avg/max = 172/200/222 ms Received 3 replies

We can also use OAM to check the P2MP trees from the root of that tree. In this case, we test the MVPNv6 tree from CSR5 to see that CSR2, CSR4, and CSR7 join. We also run the same test from CSR4 and only see responses from CSR5 and CSR7. This is expected because CSR2 did not join this P2MP tree. CSR5#ping mpls mldp p2mp 5.5.5.5 hex 0x1 00020000 p2mp Root node addr 5.5.5.5 Opaque type hex value (0x1), num hex digits 4 Sending 1, 72-byte MPLS Echos to Target FEC Stack TLV descriptor,

1857 © 2016 Nicholas J. Russo

timeout is 2.2 seconds, send interval is 0 msec, jitter value is 200 msec: [snip] Request ! reply ! reply ! reply

#1 addr 49.2.10.2 addr 49.4.9.4 addr 49.7.8.7

Round-trip min/avg/max = 20/115/181 ms Received 3 replies CSR4#ping mpls mldp p2mp 4.4.4.4 p2mp Root node addr 4.4.4.4 Opaque type hex value (0x1), num Sending 1, 72-byte MPLS Echos to timeout is 2.2 seconds, send msec: [snip]

hex 0x1 00020000 hex digits 4 Target FEC Stack TLV descriptor, interval is 0 msec, jitter value is 200

Request #1 ! reply addr 49.7.8.7 ! reply addr 49.5.14.5 Round-trip min/avg/max = 91/123/155 ms Received 2 replies

For extra verification, we can see the BGP MVPNv6 routes on CSR4. Only CSR5 and CSR7 routes exist (plus the local one from CSR4), which is additional proof that CSR2 and CSR4 are not exchanging MVPN routes. The details of the CSR7 BGP AD Type-1 (I-PMSI) route are shown below. The format is identical for MVPNv4 and MVPNv6, which is discussed in detail in the MVPN section. This concludes the verification for this section; the OAM checks were sufficient to ensure the MVPN, from an SP perspective, is functional. CSR4#show bgp ipv6 mvpn vrf U1 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 49:104 (default for vrf U1) *> [1][49:104][4.4.4.4]/12 :: 32768 ? * i [1][49:104][5.5.5.5]/12 5.5.5.5 0 100 0 ? *>i 5.5.5.5 0 100 0 ? *>i [1][49:104][7.7.7.7]/12 7.7.7.7 0 100 0 ? CSR4#show bgp ipv6 mvpn vrf U1 route-type 1 7.7.7.7 BGP routing table entry for [1][49:104][7.7.7.7]/12, version 121

1858 © 2016 Nicholas J. Russo

Paths: (1 available, best #1, table MVPNV6-BGP-Table, not advertised to EBGP peer) Not advertised to any peer Refresh Epoch 1 Local, imported path from [1][49:107][7.7.7.7]/12 (global) 7.7.7.7 (metric 4) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best Community: no-export Extended Community: RT:49:107 PMSI Attribute: Flags: 0x0, Tunnel type: 2, length 17, label: exp-null, tunnel parameters: 0600 0104 0707 0707 0007 0100 0400 0200 00 rx pathid: 0, tx pathid: 0x0

33.2.2.4 MPLS TE and TE-FRR Inter-area TE is similar between IS-IS levels and OSPF areas. The same constraints and techniques apply. We will quickly test both the tunnel stitching and loose-hop expansion methods. First, we have a stitched LSP from CSR2 to CSR7. For variety, we will adjust the tunnel in area 0.0.190.239 to terminate on CSR4; this is one hop away, which makes it a primary one-hop tunnel as well. ! CSR2 ip explicit-path name EP_4_DIRECT enable next-address 49.2.4.4 next-address 4.4.4.4 interface Tunnel100 description PATH TO R4 (DIRECT) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 4.4.4.4 tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng path-option 10 explicit name EP_4_DIRECT

Unlike IS-IS, this area boundary with CSR7 actually matters to OSPF. IS-IS considered the inter-area link level-2, so the fact that CSR7 was in a separate area made no difference to IS-IS in that specific design. The tunnel originating from CSR4 cannot terminate on CSR7 for this reason. Instead, we terminate it on CSR8 by looping through other core routers; because this was the cost-based dynamic best-path, we don't make an explicit-path for simplicity. We verify this path with OAM. ! CSR4 interface Tunnel100 description PATH TO CSR8 (DYNAMIC) ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 8.8.8.8 tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng path-option 10 dynamic

1859 © 2016 Nicholas J. Russo

CSR4#trace mpls traffic-eng tunnel 100 Tracing MPLS TE Label Switched Path on Tunnel100, timeout is 2 seconds [snip] Type escape sequence to abort. 0 49.4.9.4 MRU 1500 [Labels: 9019 Exp: 0] L 1 49.4.9.9 MRU 1500 [Labels: 1004 Exp: 0] 2 ms L 2 49.1.9.1 MRU 1500 [Labels: implicit-null Exp: 0] 9 ms ! 3 49.1.8.8 13 ms

Rather than configure a tunnel in area 0.0.202.254, we will merge the TE path with the LDP path. We don't necessarily need to stitch tunnels everywhere, so we can continue to use LDP towards CSR7 from the ABRs servicing that area. First, we have several routing issues to fix. On CSR2, the route to 7.7.7.7 ultimately recurses towards CSR10 outside of the TE tunnel. We can fix this with a static route, or the more elegant approach; adjust BGP. CSR2#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 10.10.10.10 00:08:20 ago Routing Descriptor Blocks: * 10.10.10.10, from 10.10.10.10, 00:08:20 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 10029 CSR2#show ip route 10.10.10.10 Routing entry for 10.10.10.10/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.2.10.10 on GigabitEthernet2.520, 00:06:37 ago Routing Descriptor Blocks: * 49.2.10.10, from 10.10.10.10, 00:06:37 ago, via GigabitEthernet2.520 Route metric is 2, traffic share count is 1

CSR2 has selected CSR10 as the best-path for the route to 7.7.7.7/32 due to it being "closer" to CSR7 than CSR4 is. This generally makes sense and is the reason why BGP considers the IGP metric to the BGP next-hop in the first place. CSR2#show bgp ipv4 unicast 7.7.7.7/32 BGP routing table entry for 7.7.7.7/32, version 111 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 1 Local 4.4.4.4 (metric 3) from 4.4.4.4 (4.4.4.4) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 7.7.7.7, Cluster list: 4.4.4.4, 8.8.8.8

1860 © 2016 Nicholas J. Russo

mpls labels in/out nolabel/4024 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 7.7.7.7, Cluster list: 10.10.10.10, 8.8.8.8 mpls labels in/out nolabel/10029 rx pathid: 0, tx pathid: 0x0

We temporarily increase CSR10's cost on its loopback to 2. This means that CSR4 will become the winner based on having the lower neighbor ID. If we used a number greater than 2, CSR4 would still win, but as a result of having a lower IGP metric to the BGP next-hop. I form an IGP metric tie in this case just for fun. Label 4024 is now used within the LSP. Because CSR4 is the tail end of the tunnel, the next label in the stack should be a local label of that tail-end router. Using CSR10’s local label would break the LSP as CSR4 would not know how to label-switch those 10000-series labels. ! CSR10 interface Loopback0 ip ospf cost 2 CSR2#show bgp ipv4 unicast 7.7.7.7/32 BGP routing table entry for 7.7.7.7/32, version 116 Paths: (2 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 Local 4.4.4.4 (metric 3) from 4.4.4.4 (4.4.4.4) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 7.7.7.7, Cluster list: 4.4.4.4, 8.8.8.8 mpls labels in/out nolabel/4024 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 Local 10.10.10.10 (metric 3) from 10.10.10.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 7.7.7.7, Cluster list: 10.10.10.10, 8.8.8.8 mpls labels in/out nolabel/10029 rx pathid: 0, tx pathid: 0

To verify that this works, we follow the route lookups on CSR2. Now that 4.4.4.4 is the BGP next-hop for 7.7.7.7, the traffic is routed into the tunnel via auto-route. Ignoring MPLS service labels (L3VPN, L2VPN, etc), the label stack is now {4024}. Also, ignoring mLDP requirements, LDP is not required on this tunnel despite it not being end-to-end because BGP can be used for the intermediate label bindings normally required for PE-P or P-P tunnels. CSR2#show ip route 7.7.7.7

1861 © 2016 Nicholas J. Russo

Routing entry for 7.7.7.7/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 4.4.4.4 00:00:14 ago Routing Descriptor Blocks: * 4.4.4.4, from 4.4.4.4, 00:00:14 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 4024 CSR2#show ip route 4.4.4.4 Routing entry for 4.4.4.4/32 Known via "ospf 49", distance 110, metric 3, type intra area Last update from 4.4.4.4 on Tunnel100, 00:11:21 ago Routing Descriptor Blocks: * 4.4.4.4, from 4.4.4.4, 00:11:21 ago, via Tunnel100 Route metric is 3, traffic share count is 1 CSR2#show mpls traffic-eng tunnels tunnel 100 | include Label InLabel : OutLabel : GigabitEthernet2.524, implicit-null

The same logic is true on CSR4. Checking the route recursion, 7.7.7.7 is reachable via 8.8.8.8, which is reachable via the TE tunnel. The label stack becomes {9019 8002}. In short, creating P-P to PE-P tunnels that terminate on routers that swap BGP labels (UMPLS ABRs, for example) is an acceptable alternative to using tLDP. CSR4#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "bgp 49", distance 200, metric 0, type internal Last update from 8.8.8.8 3d07h ago Routing Descriptor Blocks: * 8.8.8.8, from 8.8.8.8, 3d07h ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: 8002 CSR4#show ip route 8.8.8.8 Routing entry for 8.8.8.8/32 Known via "ospf 49", distance 110, metric 4, type intra area Last update from 8.8.8.8 on Tunnel100, 00:09:53 ago Routing Descriptor Blocks: * 8.8.8.8, from 8.8.8.8, 00:09:53 ago, via Tunnel100 Route metric is 4, traffic share count is 1 CSR4#show mpls traffic-eng tunnels tunnel 100 | include Label InLabel : OutLabel : GigabitEthernet2.549, 9019

1862 © 2016 Nicholas J. Russo

BGP details this operation as well. When label 4024 arrives, it is swapped to another BGP label of 8002, which is CSR8's local label for 7.7.7.7/32. Then, as seen above, traffic to 8.8.8.8 is tunneled inside MPLS using RSVP-bound labels. The LFIB details show these swap and push operations, which occur in that specific sequence. As expected, BGP doesn’t have the “full picture” of other things happening beyond its purview, but the LFIB does. CSR4#show bgp ipv4 unicast 7.7.7.7/32 bestpath BGP routing table entry for 7.7.7.7/32, version 97 Paths: (2 available, best #2, table default) Advertised to update-groups: 3 Refresh Epoch 4 Local 8.8.8.8 (metric 4) from 8.8.8.8 (8.8.8.8) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 7.7.7.7, Cluster list: 8.8.8.8 mpls labels in/out 4024/8002 rx pathid: 0, tx pathid: 0x0 CSR4#show mpls forwarding-table labels 4024 detail Local Outgoing Prefix Bytes Label Outgoing Next Hop Label Label or Tunnel Id Switched interface 4024 8002 7.7.7.7/32 41231 Tu100 point2point MAC/Encaps=18/26, MRU=1496, Label Stack{9019 8002}, via Gi2.549 000C29E04F84000C29D8E5F581000DDD8847 0233B00001F42000 No output feature configured

CSR9 and CSR1 are just P routers and perform ordinary label swap and PHP operations, respectively. The label stack becomes {1004 8002} and {8002} as it transits the core. CSR9#show mpls forwarding-table labels Local Outgoing Prefix Label Label or Tunnel Id 9019 1004 4.4.4.4 100 [1]

9019 Bytes Label Switched 97530

Outgoing interface Gi2.519

CSR1#show mpls forwarding-table labels Local Outgoing Prefix Label Label or Tunnel Id 1004 Pop Label 4.4.4.4 100 [1]

1004 Bytes Label Switched 95886

Outgoing interface Gi2.518

Next Hop 49.1.9.1

Next Hop 49.1.8.8

CSR8 receives packets with only label 8002. Because CSR8's next-hop is CSR7, which was the original next-hop, the BGP label is removed which reveals the service label, if it exists. Beware of the BGP output; the BGP route is not installed in the RIB, which means CSR8 is learning the route another way. Specifically, it is an OSPF route and has an LDP binding for implicit-null. It just happens to be the same label binding as the BGP route, but technically, the implicit-nulls are different and mean different things about the forwarding path. 1863 © 2016 Nicholas J. Russo

CSR8#show bgp ipv4 unicast 7.7.7.7/32 BGP routing table entry for 7.7.7.7/32, version 75 Paths: (1 available, best #1, table default, RIB-failure(17)) Advertised to update-groups: 6 Refresh Epoch 7 Local, (Received from a RR-client) 7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best mpls labels in/out 8002/imp-null rx pathid: 0, tx pathid: 0x0 CSR8#show ip route 7.7.7.7 Routing entry for 7.7.7.7/32 Known via "ospf 49", distance 110, metric 2, type intra area Last update from 49.7.8.7 on GigabitEthernet2.578, 3d08h ago Routing Descriptor Blocks: * 49.7.8.7, from 7.7.7.7, 3d08h ago, via GigabitEthernet2.578 Route metric is 2, traffic share count is 1 CSR8#show mpls ldp bindings 7.7.7.7 32 neighbor 7.7.7.7 lib entry: 7.7.7.7/32, rev 22 remote binding: lsr: 7.7.7.7:0, label: imp-null

We can verify the label stack using OAM, ignoring the implicit-nulls that don’t belong. CSR2#traceroute mpls ipv4 7.7.7.7/32 source 2.2.2.2 more work needed here to demux the tfs subtlv and to display the right output [snip] Type escape sequence to abort. 0 49.2.4.2 MRU 1500 [Labels: L 1 49.2.4.4 MRU 1500 [Labels: L 2 49.4.9.9 MRU 1500 [Labels: L 3 49.1.9.1 MRU 1500 [Labels: L 4 49.1.8.8 MRU 1500 [Labels: ! 5 49.7.8.7 21 ms

implicit-null/implicit-null/4024 Exp: 0/0/0] 9019/8002 Exp: 0/0] 2 ms 1004/8002 Exp: 0/0] 9 ms 8002 Exp: 0] 13 ms implicit-null Exp: 0] 35 ms

A quick MPLS service check for L3VPN indicates that the tunnel works. The transport labels from LDP, RSVP, and BGP are all the same, with the additional VPN label at the bottom. We confirm this value is 7007 by checking BGP VPNv4, then use traceroute to verify the rest of the label stack. The key point is that targeted LDP was not enabled on either tunnel, yet they were successfully stitched together. Although this tunneling design causes issues with MVPN as described in other sections section, it does work well with UMPLS. CSR2#show bgp vpnv4 unicast vrf U1 1.3.3.3/32

1864 © 2016 Nicholas J. Russo

BGP routing table entry for 49:102:1.3.3.3/32, version 128 Paths: (1 available, best #1, table U1) Advertised to update-groups: 1 Refresh Epoch 2 65000, imported path from 49:107:1.3.3.3/32 (global) 7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:49:107 mpls labels in/out nolabel/7007 rx pathid: 0, tx pathid: 0x0 CSR6#traceroute vrf U1 1.3.3.3 source 1.6.6.6 Type escape sequence to abort. Tracing the route to 1.3.3.3 VRF info: (vrf in name/id, vrf out name/id) 1 192.168.102.2 1 msec 1 msec 1 msec 2 49.2.4.4 [MPLS: Labels 4024/7007 Exp 0] 2 msec 2 msec 10 msec 3 49.4.9.9 [MPLS: Labels 9019/8002/7007 Exp 0] 32 msec 32 msec 32 msec 4 49.1.9.1 [MPLS: Labels 1004/8002/7007 Exp 0] 32 msec 32 msec 32 msec 5 49.1.8.8 [MPLS: Labels 8002/7007 Exp 0] 32 msec 31 msec 37 msec 6 192.168.107.7 [MPLS: Label 7007 Exp 0] 26 msec 27 msec 20 msec 7 192.168.107.3 20 msec * 3 msec

The loose-expansion tunnel works exactly the same in OSPF as it does for IS-IS. For this reason, we don't have to change the tunnel configuration on CSR5 at all for tunnel101. The configuration is shown again for reference but remains unchanged. ! CSR5 ip explicit-path name EP_14_10_2_LOOSE enable next-address loose 14.14.14.14 next-address loose 10.10.10.10 next-address loose 2.2.2.2 interface Tunnel101 description PATH TO R2 VIA XRV14 AND R10 ip unnumbered Loopback0 tunnel mode mpls traffic-eng tunnel destination 2.2.2.2 tunnel mpls traffic-eng autoroute destination tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 5000 tunnel mpls traffic-eng path-option 10 explicit name EP_14_10_2_LOOSE tunnel mpls traffic-eng fast-reroute

First, CSR5 computes a strict path to the local ABR which is XRv14. This uses the directly connected interface between CSR5 and XRv14.

1865 © 2016 Nicholas J. Russo

CSR5#show ip rsvp sender detail filter session-type 7 destination 2.2.2.2 | section ERO ERO: (incoming) 5.5.5.5 (Strict IPv4 Prefix, 8 bytes, /32) 49.5.14.14 (Strict IPv4 Prefix, 8 bytes, /32) 14.14.14.14 (Strict IPv4 Prefix, 8 bytes, /32) 10.10.10.10 (Loose IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Loose IPv4 Prefix, 8 bytes, /32) ERO: (outgoing) 49.5.14.14 (Strict IPv4 Prefix, 8 bytes, /32) 14.14.14.14 (Strict IPv4 Prefix, 8 bytes, /32) 10.10.10.10 (Loose IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Loose IPv4 Prefix, 8 bytes, /32)

When XRv14 receives the PATH message, it runs the PCALC expansion process to compute a strict path to the first loose hop, which is CSR10. This path routes via CSR1 and CSR8. RP/0/0/CPU0:XRv14#show rsvp sender session-type lsp-p2p destination 2.2.2.2 detail | begin Explicit Explicit Route (Incoming): Strict, 49.5.14.14/32 Strict, 14.14.14.14/32 Loose, 10.10.10.10/32 Loose, 2.2.2.2/32 Explicit Route (Outgoing): Strict, 49.1.14.1/32 Strict, 49.1.8.8/32 Strict, 49.8.10.10/32 Strict, 10.10.10.10/32 Loose, 2.2.2.2/32

When CSR10 receives, the PATH message, it also runs the PCALC expand algorithm to evaluate a strict path to the first loose hop, which is CSR2. CSR10#show ip rsvp sender detail filter session-type 7 sender 5.5.5.5 | section ERO ERO: (incoming) 49.8.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 10.10.10.10 (Strict IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Loose IPv4 Prefix, 8 bytes, /32) ERO: (outgoing) 49.2.10.2 (Strict IPv4 Prefix, 8 bytes, /32) 2.2.2.2 (Strict IPv4 Prefix, 8 bytes, /32)

Since each router in the path appended its local interface to the RSVP PATH RRO, CSR2 can see the full path back to source, to include inter-area hops. The RRO also shows available TE-FRR protection along the way, which is something normally the tail-end does not see. These TE-FRR backup tunnels were 1866 © 2016 Nicholas J. Russo

configured in the previous section and remain intact here. No labels are included in the PATH RRO since there is no MPLS forwarding along this path (it is not an LSP), which is different than the RESV RRO also shown below. This is the RRO we are used to seeing, which includes the per-hop RSVP-allocated labels along with any protection capabilities along the LSP. The labels, in sequence, are 94018 > 1032 > 8006 > 10021 > implicit-null. CSR2#show ip rsvp sender detail filter session-type 7 sender 5.5.5.5 | section RRO RRO: 49.2.10.10/32, Flags:0x0 (No Local Protection) 49.8.10.8/32, Flags:0xD (Local Prot Avail/Has BW/to NNHOP) 49.1.8.1/32, Flags:0x5 (Local Prot Avail/Has BW/to NHOP) 49.1.14.14/32, Flags:0x0 (No Local Protection) 49.5.14.5/32, Flags:0x0 (No Local Protection) CSR5#show ip rsvp reservation detail filter session-type 7 destination 2.2.2.2 | section RRO RRO: 14.14.14.14/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 94018 49.5.14.14/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 94018 1.1.1.1/32, Flags:0x25 (Local Prot Avail/Has BW/to NHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 1032 49.1.8.1/32, Flags:0x5 (Local Prot Avail/Has BW/to NHOP) Label subobject: Flags 0x1, C-Type 1, Label 1032 8.8.8.8/32, Flags:0x2D (Local Prot Avail/Has BW/to NNHOP, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 8006 49.8.10.8/32, Flags:0xD (Local Prot Avail/Has BW/to NNHOP) Label subobject: Flags 0x1, C-Type 1, Label 8006 10.10.10.10/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 10021 49.2.10.10/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 10021 2.2.2.2/32, Flags:0x20 (No Local Protection, Node-id) Label subobject: Flags 0x1, C-Type 1, Label 3 49.2.10.2/32, Flags:0x0 (No Local Protection) Label subobject: Flags 0x1, C-Type 1, Label 3

We quickly verify the LSP is operational using OAM from CSR5 and verify the labels at the same time. Notice that this technique, unlike tunnel stitching, does not rely on an additional (third) BGP label in the stack to achieve scalability. An end-to-end TE design like this defeats the purpose of UMPLS because it scales poorly, but is a valid technique. CSR5#traceroute mpls traffic-eng tunnel 101 Tracing MPLS TE Label Switched Path on Tunnel101, timeout is 2 seconds [snip]

1867 © 2016 Nicholas J. Russo

Type escape sequence to abort. 0 49.5.14.5 MRU 1500 [Labels: 94018 Exp: 0] L 1 49.5.14.14 MRU 1500 [Labels: 1032 Exp: 0] 2 ms L 2 49.1.14.1 MRU 1500 [Labels: 8006 Exp: 0] 56 ms L 3 49.1.8.8 MRU 1500 [Labels: 10021 Exp: 0] 26 ms L 4 49.8.10.10 MRU 1500 [Labels: implicit-null Exp: 0] 24 ms ! 5 49.2.10.2 40 ms

We also use traceroute within the VPN on XRv11 to test VPNv6 reachability to CSR6. CSR5 selects the path via CSR2 as best due to having a lower originator ID, so the VPN label of 2002 is used. The top-most transport labels are the same RSVP-allocated labels we verified with OAM. CSR5#show bgp vpnv6 unicast vrf U2 ::2:6:6:6/128 BGP routing table entry for [49:205]::2:6:6:6/128, version 470 Paths: (2 available, best #1, table U2) Advertised to update-groups: 1 Refresh Epoch 2 65000, imported path from [49:202]::2:6:6:6/128 (global) ::FFFF:2.2.2.2 (via default) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:49:202 Originator: 2.2.2.2, Cluster list: 7.7.7.7 mpls labels in/out nolabel/2004 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 65000, imported path from [49:204]::2:6:6:6/128 (global) ::FFFF:4.4.4.4 (metric 2) (via default) from 7.7.7.7 (7.7.7.7) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:49:204 Originator: 4.4.4.4, Cluster list: 7.7.7.7 mpls labels in/out nolabel/4006 rx pathid: 0, tx pathid: 0 RP/0/0/CPU0:XRv11#traceroute vrf U2 ::2:6:6:6 source ::2:11:11:11 Type escape sequence to abort. Tracing the route to ::2:6:6:6 1 2 3 4 5 6 7

fd00:192:168:205::5 19 msec 9 msec 0 msec 2049:49:5:14::14 [MPLS: Labels 94018/2004 Exp 0] 59 msec 39 msec 39 msec ::ffff:49.1.14.1 [MPLS: Labels 1032/2004 Exp 0] 39 msec 39 msec 39 msec ::ffff:49.1.8.8 [MPLS: Labels 8006/2004 Exp 0] 39 msec 39 msec 39 msec ::ffff:49.8.10.10 [MPLS: Labels 10021/2004 Exp 0] 39 msec 39 msec 39 msec fd00:192:168:202::2 [MPLS: Label 2004 Exp 0] 39 msec 39 msec 39 msec fd00:192:168:202::6 49 msec 59 msec 49 msec

1868 © 2016 Nicholas J. Russo

We saw earlier that this inter-area tunnel was protected by FRR as evidenced by the PATH and RESV RROs. CSR1 offers NHOP protected and is protecting the LSP with in-label 1032. It provides 7 Mbps bandwidth backup for any pool. CSR1#show mpls traffic-eng fast-reroute database labels 1032 | begin P2P LSP P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status ---------------------------- -------------------------------5.5.5.5 101 [81] 1032 Gi2.518:8006 Tu110:8006 ready CSR1#show mpls traffic-eng tunnels tunnel 110 backup INTER-AREA NHOP PROTECTION LSP Head, Admin: up, Oper: up Tun ID: 110, LSP ID: 66, Source: 1.1.1.1 Destination: 8.8.8.8 Fast Reroute Backup Provided: Protected i/fs: Gi2.518 Protected LSPs/Sub-LSPs: 1, Active: 0 Backup BW: any pool; limit: 7000 kbps, inuse: 5000 kbps (BWP inuse: 0 kbps) Backup flags: 0x0

Additionally, CSR8 provides NNHOP protection to completely avoid CSR10. This is also an inter-area tunnel using loose-hop expansion, which is valid and was verified in detail earlier. The protected LSP has a local label of 8020 and since the NNHOP is the tunnel destination, the FRR label becomes implicit-null since there isn't a need to instruct the merge point to forward the packet further (MP is the tail end). We can also see the backup bandwidth which is available for the protected LSP. CSR8#show mpls traffic-eng fast-reroute database labels 8020 | begin P2P LSP P2P LSP midpoint frr information: LSP identifier In-label Out intf/label FRR intf/label Status -------------------------- -------------------------------5.5.5.5 101 [83] 8020 Gi2.580:10012 Tu111:implicit-n ready CSR8#show mpls traffic-eng tunnels tunnel 111 backup INTER_AREA NNHOP PROTECTION LSP Head, Admin: up, Oper: up Tun ID: 111, LSP ID: 195, Source: 8.8.8.8 Destination: 2.2.2.2 Fast Reroute Backup Provided: Protected i/fs: Gi2.580 Protected LSPs/Sub-LSPs: 1, Active: 0 Backup BW: any pool; limit: 9000 kbps, inuse: 5000 kbps (BWP inuse: 0 kbps) Backup flags: 0x0

Additional Reading – Reference configurations "umpls-ospf" 1869 © 2016 Nicholas J. Russo

34. Describe, implement, and troubleshoot LISP Locator/ID Separation Protocol (LISP) is less of a routing protocol and more of an dynamic tunneling protocol, somewhat similar to MPLS and mGRE encapsulations. It uses UDP encapsulation on port 4341 for data traffic and UDP port 4342 for LISP control traffic. It was developed by Cisco and is currently an experimental RFC within the IETF (RFC 6830). The basic premise of LISP is, as the name states, to separate WHERE a host is from WHO is it. IP routing relies on a host being part of a network (subnet) which is both its location on the network and its identity concurrently. LISP does not re-invert IP routing nor does it advertise IP routes. It uses routing locations (RLOC) to specify WHERE a host is and endpointIDs (EID) to specify WHO the host is. This is basic tunneling; the RLOCs are the outer encapsulation in the tunnel and are expected to be reachable via some existing IP routing protocol, just like normal IP tunneling mechanisms. The EIDs don't have to be routed between sites with a dedicated routing protocols provided they are registered to a central location where LISP speakers can determine where the EIDs are. This EID resolution is similar to DNS where a set of central devices can map EIDs to RLOCs. LISP decouples the roles of resolving EID-to-RLOC mappings and mapping registration into two components: MS: The central mechanism used for RLOC/EID registration is known as a map server (MS). ETRs (discussed later) register their EID-to-RLOC mappings with the MS. This process is loosely analogous to the NHRP registration process of mapping overlay (VPN tunnel) addresses to underlay (NBMA) addresses. Map-requests are forwarded from the MR (discussed later) to the MS, who will perform an EID-database lookup. Assuming there is a match, it will forward the map-request directly to the ETR who registered the EID; the ETR is responsible for sending the map-reply to the original ITR. MR: The map resolver (MR) is the target of all map-request messages. When an ITR (discussed later) sends a map-request to the MR, the request is forwarded to the MS over the ALT (discussed later). The MR does very little work in that its entire purpose is to receive map-requests and forward them to the MS, who actually has the mappings. In order to actually forward map-requests to the MS, the MR needs to have BGP routes inside the ALT for the EID destinations; this is how the MR knows which MS has the correct mappings if multiple MS devices exist. ALT: Not a component of LISP so much as it is a virtual network used for inter MR/MS communication. It relies on VRFs, GRE, and BGP so that the MR and MS can exist on separate devices. Although uncommon, the ALT would be required when the MR and MS are separated, which might be good for availability/scalability. The MR and MS roles are commonly combined on a single router for simplicity, known as the MR/MS. Two types of LISP tunneling routers are used at the customer sites for encapsulation and decapsulation: ITR: An ingress tunnel router (ITR) is used for communicating with the MR to finding EID-to-RLOC mappings based on traffic being sent towards the provider (that is to say, traffic outbound from the 1870 © 2016 Nicholas J. Russo

customer). The ITR role is typically assigned to a CE and runs on a Cisco ISR or ASR, for example. When a packet arrives on the LAN side on an ITR (inside EID space), the ITR sends a map-request to the MR. As discussed earlier, the MR forwards the map-request to the MS. If the MS has a mapping for this EID, the map-request is forwarded to the ETR that registered the EID. If not, the MS sends a map-reply that indicates that the destination of the original packet is not an EID (no mapping). The ITR then creates a negative cache entry to blackhole traffic. ETR: An egress tunnel router (ETR) communicates with the MS to register EID-to-RLOC mappings. These mappings are sent in map-register messages. When LISP-encapsulated packets arrive on an ETR destined to one of its RLOCs (incoming traffic from the provider), verifies that the inner-header is destined to one of its registered EIDs, and forwards the packet natively on the LAN-side (inside EID space). Upon receiving a forwarded map-request from the MS, it verifies that it has the EID being requested (it should, otherwise the MS wouldn't have sent it, but this is a sanity check). It replies directly to the ITR that originated the map-request using the ITR RLOCs listed in the message. The ITR and ETR roles are commonly combined on a single router for simplicity, known as the xTR. To enable communication with non-LISP domains, two additional LISP router variants are introduced: PITR: The Proxy ITR performs ITR functions on behalf of non-LISP sites. These are typically deployed at Internet exchange points to provide interworking from non-LISP to LISP sites. Non-LISP sites will route traffic towards the PITR for encapsulation and forwarding to LISP sites. For scalability, aggregation should be used so that the PITR does not have to advertise many specific routes. No additional configuration is needed on the ITRs when the PITR is deployed. PETR: The Proxy ETR performs ETR functions on behalf of LISP sites. The PETR is deployed similarly to the PITR, except is used to provide interworking from LISP to non-LISP. This would normally be used when the SP will not allow private (EID) space to be routed over its core. ETRs must specify the PETR address to use the PETR service. The PITR and PETR roles are commonly combined on a single router for simplicity, known as the PxTR. LISP supports two main variations of virtualization: Shared Model: EID spaces exist in varying VRFs with a single upstream RLOC table. For example, CSR9 has its RLOC in global space while its EIDs and in varying VRFs. Each EID VRF has its own instance ID (IID) specified on the ETR and MS so that the EID-to-RLOC mappings are kept separate. Just like BGP VPNv4/v6, overlap can exist in different VRFs, so this IID has a similar function to the BGP RD. CSR10 also uses this model except the RLOC is in a VRF, much like a front-door VRF (FVRF) model commonly used in overlay networking.

1871 © 2016 Nicholas J. Russo

Parallel Model: EID spaces are in N VRFs with N upstream RLOCs which correspond to each different VRF. This is a more traditional model of splitting the router into VRFs and not having any inter-VRF communication between EID and RLOC namespaces. This generally means a different layer 3 interface per EID VRF towards the provider. This is implemented on CSR2, where there are two subinterfaces in two different VRFs to service two different EID spaces. This also requires two separate LISP processes. The network diagram below shows an SP core inside AS 41 that provides IPv4/v6 transport between the PE-CE connected routes only. There is no PE-CE protocol exchange anywhere. There are three customer sites that communicate using LISP (CSR10, CSR9, and CSR2). These sites use a variety of LISP virtualization models discussed later; many of the test endpoints are loopbacks on these routers for simplicity. CSR3 is an MR/MS and CSR4 is an MR only; the two communicate over an ALT. CSR5 is a public Internet site which uses the PxTR configured on CSR6 to communicate to and from the LISP customer sites. To test multicast, CSR1 is an ordinary end host and multicast source. CSR1 is also used to test mobility by moving between LANs hosted on CSR10 (home) and CSR9 (foreign); this particular feature is deemed out of scope for CCIE SP and is not tested here.

The relevant configurations for CSR9 and CSR10 are shown below. We will begin with the basic components to understand the control-plane mechanisms. CSR9 has both RLOC and EIDs in the global table (no virtualization yet) while CSR10 has the RLOC in a front-door VRF (shared virtualization model). 1872 © 2016 Nicholas J. Russo

Each of them register their loopbacks with the MS, which is 10.3.6.3 (CSR3). The MR is 10.4.6.4 (CSR4) which is separated from the MS. ! CSR9 interface GigabitEthernet2.579 ip address 10.7.9.9 255.255.255.0 router lisp 1 eid-table default instance-id 0 database-mapping 10.9.9.9/32 IPv4-interface GigabitEthernet2.579 priority 10 weight 10 ipv4 itr map-resolver 10.4.6.4 ipv4 etr map-server 10.3.6.3 key CSR9 ip route 0.0.0.0 0.0.0.0 10.7.9.7 ! CSR10 interface GigabitEthernet2.570 vrf forwarding UP ip address 10.7.10.10 255.255.255.0 router lisp 1 locator-table vrf UP eid-table default instance-id 0 database-mapping 10.10.10.10/32 IPv4-interface GigabitEthernet2.570 priority 10 weight 10 ipv4 itr map-resolver 10.4.6.4 ipv4 etr map-server 10.3.6.3 key CSR10 ip route vrf UP 0.0.0.0 0.0.0.0 10.7.10.7

Before any traffic can flow, the ETRs must register their EID-to-RLOC mappings with the MS (CSR3). Both CSR10 and CSR9 must do this; notice that the map-register message also contains the IID so the MS knows in which table to store the mappings. This IID serves as a method to distinguished EID routes between customers, much like the BGP RD does for MPLS VPNs. IID 0 represents the global table, which is how we can sort through the debugs to eliminate other duplicate prefixes of 10.10.10.10/32 and 10.9.9.9/32 in other VRFs. LISP debugs are some of the best debugs in IOS; very detailed with many granular options. Notice that the map-register messages use UDP port 4342 as this is used only for LISP signaling. This registration occurs once every 60 seconds. ! CSR9 CSR9#debug lisp control-plane etr-map-server LISP-1: IPv4 Map Server IID 0 10.3.6.3, Sending map-register (src_rloc 10.7.9.9) nonce 0x960C2F3D-0xB25541FE. ! CSR3

1873 © 2016 Nicholas J. Russo

CSR3#debug lisp control-plane map-server-registration LISP: Processing received Map-Register(3) message on GigabitEthernet2.536 from 10.7.10.10:4342 to 10.3.6.3:4342 LISP: Processing Map-Register no proxy, map-notify, no merge, no security, no mobile-node, not to-RTR, no fast-map-register, no EID-notify, ID-included, 1 record, nonce 0xE9B119D3-0xA5B3EAFD, key-id 1, auth-data-len 20, hashfunction sha1, xTR-ID 0x86671414-0xB88BC82F-0x428A5BA4-0xB567A9B7, site-ID unspecified LISP: Processing Map-Register mapping record for IID 0 10.10.10.10/32, ttl 1440, action none, authoritative, 1 locator 10.7.10.10 pri/wei=10/10 LpR LISP-0: MS registration IID 0 prefix 10.10.10.10/32 10.7.10.10 site CSR10, Updating. LISP: Processing received Map-Register(3) message on GigabitEthernet2.536 from 10.7.9.9:4342 to 10.3.6.3:4342 LISP: Processing Map-Register no proxy, map-notify, no merge, no security, no mobile-node, not to-RTR, no fast-map-register, no EID-notify, ID-included, 1 record, nonce 0x960C2F3D-0xB25541FE, key-id 1, auth-data-len 20, hashfunction sha1, xTR-ID 0xDF137D40-0xE6306DF2-0xB055C530-0xD3D28EB2, site-ID unspecified LISP: Processing Map-Register mapping record for IID 0 10.9.9.9/32, ttl 1440, action none, authoritative, 1 locator 10.7.9.9 pri/wei=10/10 LpR LISP-0: MS registration IID 0 prefix 10.9.9.9/32 10.7.9.9 site CSR9, Updating.

A summary view of this IID 0 table shows the following registered prefixes with the MS. Looking into the details of 10.9.9.9/32, most of this information is the same as what we saw in the debug message. This is useful for troubleshooting more advanced LISP functionality like PxTR, load sharing, and dynamic EID creation. The priority/weight values are discussed later. CSR3#show lisp site instance-id 0 LISP Site Registration Information * = Some locators are down or unreachable Site Name CSR10 CSR9

Last Register 00:00:11 00:00:01

Up yes yes

Who Last Registered 10.7.10.10 10.7.9.9

Inst ID 0 0

EID Prefix 10.10.10.10/32 10.9.9.9/32

CSR3#show lisp site 10.9.9.9/32 instance-id 0 LISP Site Registration Information Site name: CSR9 Allowed configured locators: any Requested EID-prefix: EID-prefix: 10.9.9.9/32 instance-id 0

1874 © 2016 Nicholas J. Russo

First registered: 3d00h Routing table tag: 0 Origin: Configuration Merge active: No Proxy reply: No TTL: 1d00h State: complete Registration errors: Authentication failures: 0 Allowed locators mismatch: 0 ETR 10.7.9.9, last registered 00:00:12, no proxy-reply, map-notify TTL 1d00h, no merge, hash-function sha1, nonce 0x960C2F3D0xB25541FE state complete, no security-capability xTR-ID 0xDF137D40-0xE6306DF2-0xB055C530-0xD3D28EB2 site-ID unspecified Locator Local State Pri/Wgt Scope 10.7.9.9 yes up 10/10 IPv4 none

Because CSR3 is not an MR for IID 0, it needs to somehow receive map-requests from CSR4, which is the MR for IID 0. This is doing using the ALT; there is a P2P GRE tunnel between CSR3 and CSR4 inside of a VRF called LISPALT. BGP is running inside this VRF to connect the two sites using the IPv4 AF within the VRF. The significant configuration snippets are shown below and are explained next. ! CSR3 interface Tunnel34 vrf forwarding LISPALT ip address 10.3.4.3 255.255.255.0 ipv6 address FE80::3 link-local ipv6 address 2001:10:3:4::3/64 tunnel source 10.3.6.3 tunnel destination 10.4.6.4 router lisp eid-table default instance-id 0 ipv4 alt-vrf LISPALT ipv6 alt-vrf LISPALT site CSR10 authentication-key CSR10 eid-prefix 10.10.10.10/32 site CSR9 authentication-key CSR9 eid-prefix 10.9.9.9/32 router bgp 34

1875 © 2016 Nicholas J. Russo

address-family ipv4 vrf LISPALT redistribute lisp neighbor 10.3.4.4 remote-as 34 neighbor 10.3.4.4 activate address-family ipv6 vrf LISPALT redistribute lisp neighbor 2001:10:3:4::4 remote-as 34 neighbor 2001:10:3:4::4 activate ! CSR4 interface Tunnel34 vrf forwarding LISPALT ip address 10.3.4.4 255.255.255.0 ipv6 address FE80::4 link-local ipv6 address 2001:10:3:4::4/64 tunnel source 10.4.6.4 tunnel destination 10.3.6.3 router lisp ipv4 map-resolver ipv4 alt-vrf LISPALT router bgp 34 address-family ipv4 vrf LISPALT neighbor 10.3.4.3 remote-as 34 neighbor 10.3.4.3 activate address-family ipv6 vrf LISPALT neighbor 2001:10:3:4::3 remote-as 34 neighbor 2001:10:3:4::3 activate

When CSR3 receives map-registrations, it originates an exact-match route into the ALT for advertisement to the MRs. CSR3 has originated routes for both 10.9.9.9/32 and 10.10.10.10/32 as LISP routes with AD 10 pointing to Null0. In this way, they are present in the RIB for the ALT VRF. This allows them to be redistributed into BGP for advertisement over the ALT. The routes don’t direct traffic but are important to facilitate communication between disparate MR and MS devices. CSR3#show vrf LISPALT Name LISPALT

Default RD 34:3

Protocols ipv4,ipv6

Interfaces Tu34

CSR3#show ip route vrf LISPALT lisp | begin Gateway Gateway of last resort is not set 10.0.0.0/8 is variably subnetted, 4 subnets, 2 masks l 10.9.9.9/32 [10/1], 3d00h, Null0 l 10.10.10.10/32 [10/1], 4d19h, Null0

1876 © 2016 Nicholas J. Russo

CSR3#show bgp vpnv4 unicast vrf LISPALT | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 34:3 (default for vrf LISPALT) *> 10.9.9.9/32 0.0.0.0 1 32768 ? *> 10.10.10.10/32 0.0.0.0 1 32768 ?

CSR4 is the MR for IID 0 and needs to know where to forward map-requests it receives from ITRs. Without these BGP routes, the MR has no idea where the forward the map-requests and the ITR will never be able to LISP-encapsulate any packets. A quick check on CSR4 shows that the routes were learned from CSR4 via iBGP through the P2P GRE tunnel. CSR4#show vrf LISPALT Name LISPALT

Default RD 34:4

Protocols ipv4,ipv6

Interfaces Tu34

CSR4#show bgp vpnv4 unicast vrf LISPALT | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 34:4 (default for vrf LISPALT) *>i 10.9.9.9/32 10.3.4.3 1 100 0 ? *>i 10.10.10.10/32 10.3.4.3 1 100 0 ?

Now, we will send traffic from CSR9 to CSR10. LSIP control-plane debugging is enabled on each device so each step can be visualized. First, traffic is sourced from EID space on CSR10 destined to EID space on CSR9. As an ITR, CSR10 issues a map-request to the MR which is CSR4 (10.4.6.4). CSR10 has no idea how to reach 10.9.9.9/32 as there is no EID-to-RLOC mapping for it currently. The ITR also includes all of its RLOCS in this message (10.7.10.10 and 2001:10:7:10::10), which is why ITR RLOCs reads as 2. This is included so the ETR knows how to respond to the map-request. Notice the timestamp in the comment. ! CSR10, time 15:10:58.345 LISP-1: IID 0 Request processing of remote EID prefix map requests to IPv4. LISP: Send map request type remote EID prefix LISP: Send map request for EID prefix IID 0 10.9.9.9/32 LISP-1: Remote EID IID 0 prefix 10.9.9.9/32, Send map request (1) (sources: , state: incomplete, rlocs: 0). LISP-1: EID-AF IPv4, Sending map-request from 10.9.9.9 to 10.9.9.9 for EID 10.9.9.9/32, ITR-RLOCs 2, nonce 0x5AB0F557-0xEEBD265C (encap src 10.7.10.10, dst 10.4.6.4).

When the MR (CSR4) receives the map-request from the ITR (CSR9), it immediately forwards it towards to MS (CSR3) inside the ALT. The reason the MR even knows about the EID prefix 10.9.9.9/32 in the first place is because of BGP in the ALT, which we saw earlier. Without this BGP route, the MR cannot sufficiently resolve the request as it is not aware of any map-servers that are authoritative for this EID. The debug says it is forwarding the map-request “to 10.9.9.9”, but resolving this recursively really means 10.3.4.4 as the BGP routes above reveal.

1877 © 2016 Nicholas J. Russo

! CSR4 LISP: Processing received Map-Request(1) message on GigabitEthernet2.546 from 10.9.9.9:4342 to 10.9.9.9:4342 LISP: Received map request for IID 0 10.9.9.9/32, source_eid IID 0 10.10.10.10, ITR-RLOCs: 10.7.10.10 2001:10:7:10::10, records 1, nonce 0x5AB0F557-0xEEBD265C LISP-0: AF IID 0 IPv4, Forwarding map request to 10.9.9.9 on the ALT.

When the MS receives it from the MR, it checks its IID 0 database and sees RLOC 10.7.9.9 for the EID prefix 10.9.9.9/32. There were the RLOCs that were included in the map-register message initially sent to the MS by CSR9 (as an ETR). The MS forwards the map-request down to the ETR who send the mapregister, which is CSR9, using the ETR RLOC specified in the database. ! CSR3 LISP: Processing received Map-Request(1) message on Tunnel34 from 10.9.9.9:4342 to 10.9.9.9:4342 LISP: Received map request for IID 0 10.9.9.9/32, source_eid IID 0 10.10.10.10, ITR-RLOCs: 10.7.10.10 2001:10:7:10::10, records 1, nonce 0x5AB0F557-0xEEBD265C LISP-0: MS EID IID 0 prefix 10.9.9.9/32 site CSR9, Forwarding map request to ETR RLOC 10.7.9.9.

When the ETR receives it from the MS, the ETR verifies that this map-request is actually for an EID that it has locally registered and was destined to a valid RLOC. It sends a map-reply directly to the "better" ITR RLOC which is determined by priorities. Prioritization is discussed later, but in this case it selected the IPv4 RLOC over the IPv6 RLOC. ! CSR10 LISP: Processing received Map-Request(1) message on GigabitEthernet2.579 from 10.9.9.9:4342 to 10.9.9.9:4342 LISP: Received map request for IID 0 10.9.9.9/32, source_eid IID 0 10.10.10.10, ITR-RLOCs: 10.7.10.10 2001:10:7:10::10, records 1, nonce 0x5AB0F557-0xEEBD265C LISP: Processing map request record for EID prefix IID 0 10.9.9.9/32 LISP-1: Sending map-reply from 10.7.9.9 to 10.7.10.10.

When the ITR receives the final map-reply from the ETR, it effectively learns the remote RLOC (LISP tunnel destination) it needs to use for encapsulation. Most of the debug below includes low-level table updates, but the first few lines show the remote RLOC clearly. The RTT is also shown, and although I have removed the timestamps, the entire process from ITR map-request to ETR map-reply was exactly 26ms. I included timestamps in the comment below for comparison against the timestamp shown above. ! CSR10, time 15:10:58.371 LISP: Received map reply nonce 0x5AB0F557-0xEEBD265C, records 1

1878 © 2016 Nicholas J. Russo

LISP: Processing Map-Reply mapping record for IID 0 10.9.9.9/32, ttl 1440, action none, authoritative, 1 locator 10.7.9.9 pri/wei=10/10 LpR LISP-1: Map Request IID 0 prefix 10.9.9.9/32 remote EID prefix[LL], Received reply with rtt 26ms. LISP: Processing mapping information for EID prefix IID 0 10.9.9.9/32 LISP-1: Remote EID IID 0 prefix 10.9.9.9/32, Change state to reused (sources: , state: incomplete, rlocs: 0). LISP-1: Remote EID IID 0 prefix 10.9.9.9/32, Starting idle timer (delay 00:02:30) (sources: , state: reused, rlocs: 0). LISP-1: AF IID 0 IPv4, Persistent db: ignore writing request, disabled. LISP-1: Remote EID IID 0 prefix 10.9.9.9/32, Change state to complete (sources: , state: reused, rlocs: 0).

This entire process happens in the reverse direction since bidirectional communication is the goal. CSR9 would be the ITR and CSR10 would be the ETR for this process. The process is identical except reversed and is not shown again. To prove connectivity is working, we will send packets from CSR10 to CSR9 and use EPC in both directions on CSR9. Two packets are shown below. The first packet has an outer UDP (0x11, protocol 17, green) header and sends traffic from 10.7.10.10 to 10.7.9.9 (yellow) using port 4341 (0x10F5, cyan). This is the LISP data packet encapsulation and uses IP addressing from the RLOC space. The inner packet is ICMP (0x01, protocol 1, green) from 10.10.10.10 to 10.9.9.9 (yellow), which is in EID space. The second packet is the reply; the same fields are highlighted in the same colors for the return traffic. CSR9#show monitor 0000: 000C29E0 0010: 08004500 0020: 0A0A0A07 0030: 00000000 0040: 94290A0A 0050: 00000000 [snip] 0000: 0010: 0020: 0030: 0040: 0050: [snip]

000C2966 08004500 09090A07 00000000 94290A09 00000000

capture CAP buffer dump 4F84000C 29664C2C 81000DFB 0088124A 4000FE11 42FA0A07 09090600 10F50074 00004000 00014500 0064004A 0000FF01 0A0A0A09 09090800 D2280010 000056CC 5545ABCD ABCDABCD

..).O...)fL,.... [email protected]... ...........t..@. ......E..d.J.... .)...........(.. ......V.UE......

4C2C000C 0088352C 0A0A0600 00014500 09090A0A 000056CC

..)fL,..).O..... ..E...5,@....... ...........t..@. ......E..d.J.... .)...........(.. ......V.UE......

29E04F84 4000FF11 10F50074 0064004A 0A0A0000 5545ABCD

81000DFB 1F180A07 00004000 0000FF01 DA280010 ABCDABCD

After connectivity is validated, we check the LISP mapping tables on both routers to see their new entries. Each of them has the other router's EID prefix along with a valid RLOC to reach it. Both of these entries were learned via the dynamic map-request/map-reply process. CSR9#show ip lisp 1 map-cache

1879 © 2016 Nicholas J. Russo

LISP IPv4 Mapping Cache for EID-table default (IID 0), 2 entries 0.0.0.0/0, uptime: 2d19h, expires: never, via static send map-request Negative cache entry, action: send-map-request 10.10.10.10/32, uptime: 00:00:31, expires: 23:59:28, via map-reply, complete Locator Uptime State Pri/Wgt 10.7.10.10 00:00:31 up 10/10 CSR10#show ip lisp 1 map-cache LISP IPv4 Mapping Cache for EID-table default (IID 0), 2 entries 0.0.0.0/0, uptime: 4d20h, expires: never, via static send map-request Negative cache entry, action: send-map-request 10.9.9.9/32, uptime: 00:10:08, expires: 23:49:51, via map-reply, complete Locator Uptime State Pri/Wgt 10.7.9.9 00:10:08 up 10/10

If we look at the FIB, we see interesting output. The first command is misleading because we have an exact match /32 but it says no route. Normally the lack of a route would show “0.0.0.0/0” followed by “no route”. Looking at the details, we see that LISP is working behind the scenes and performing "action encap" to get the traffic over the network. The reason it says "no route" is because the RLOC space is in a VRF on CSR10; it's a shared front-door VRF used by all EID spaces across all tables. On CSR10, the EID namespace has 10.10.10.10/32 in the global table which has no routes aside from this loopback (shown in the routing table summary). The lack of any routes, or only having a default, is required to trigger LISP mapping resolution. Ultimately, the FIB recurses to having no route and should be dropped, but LISP steps in and performs the encapsulation. This is somewhat similar to the DMVPN phase 3 operation where the NHRP cache augments the FIB. CSR10#show ip cef 10.9.9.9 10.9.9.9/32 no route CSR10#show ip cef 10.9.9.9 detail 10.9.9.9/32, epoch 2, flags [default route handler, subtree context, check lisp eligibility, default route] SC owned,sourced: LISP remote EID - locator status bits 0x00000001 LISP remote EID: 28 packets 2596 bytes fwd action encap LISP source path list nexthop 10.7.9.9 LISP1 2 IPL sources [unresolved, active source] Dependent covered prefix type inherit, cover 0.0.0.0/0 recursive via 0.0.0.0/0 no route CSR10#show ip route summary IP routing table name is default (0x0) IP routing table maximum-paths is 32

1880 © 2016 Nicholas J. Russo

Route Source (bytes) application connected static lisp internal Total

Networks

Subnets

Replicates

Overhead

Memory

0 0 0 0 1 1

0 1 0 0

0 0 0 0

0 96 0 0

1

0

96

0 288 0 0 368 656

On CSR9, the output is a little different. Because both the EID prefix and RLOC namespace are in the global table, the FIB does recurse to a valid route since there is a default route in the global table, just like there is a default route in the FVRF on CSR10. The behavior is exactly the same in terms of LISP encapsulation, but the outputs are slightly different due to route recursion. LISP still trumps the FIB. There is still a specific entry for the EID prefix learned from LISP, and the action remains "action encap". Notice that the next-hops shown with "LISP1" are IPv4 addresses of the remote RLOC in use. CSR9#show ip cef 10.10.10.10 10.10.10.10/32 nexthop 10.7.9.7 GigabitEthernet2.579 CSR9#show ip cef 10.10.10.10 detail 10.10.10.10/32, epoch 2, flags [subtree context, check lisp eligibility, default route] SC owned,sourced: LISP remote EID - locator status bits 0x00000001 LISP remote EID: 19 packets 1900 bytes fwd action encap LISP source path list nexthop 10.7.10.10 LISP1 2 IPL sources [active source] Dependent covered prefix type inherit, cover 0.0.0.0/0 recursive via 0.0.0.0/0 recursive via 10.7.9.7 attached to GigabitEthernet2.579 CSR9#show ip route 0.0.0.0 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 1, metric 0, candidate default path Routing Descriptor Blocks: * 10.7.9.7 Route metric is 0, traffic share count is 1

LISP does an excellent job of tracking statistics. We will look at CSR9 and CSR3 as examples. Notice the high number of map-registers sent from the ETR and received on the MS. The MS also shows the maprequests forwarded. Note: When using LISP show commands, you often need to include the process number. Notice on CSR9, I must include the "1" or else the show command does not work. On CSR3, I use the general LISP process with no PID. CSR9#show ip lisp 1 statistics instance-id 0 | section Map-Re

1881 © 2016 Nicholas J. Russo

Map-Requests in/out: Encapsulated Map-Requests in/out: RLOC-probe Map-Requests in/out: SMR-based Map-Requests in/out: Map-Requests expired on-queue/no-reply Map-Resolver Map-Requests forwarded: Map-Server Map-Requests forwarded: Map-Reply records in/out: Authoritative records in/out: Non-authoritative records in/out: Negative records in/out: RLOC-probe records in/out: Map-Server Proxy-Reply records out: Map-Register records in/out: Map-Server AF disabled: Authentication failures: Map-Reply deferred/dropped: MR negative Map-Reply deferred/dropped: MR Map-Request fwd deferred/dropped: MS Map-Request fwd deferred/dropped: MS proxy Map-Reply deferred/dropped: RTR Map-Register fwd deferred/dropped: [snip]

1/1 1/1 0/0 0/0 0/0 0 0 1/1 1/1 0/0 0/0 0/0 0 0/4492 0 0 0/0 0/0 0/0 0/0 0/0 0/0

CSR3#show ip lisp statistics instance-id 0 | section Map-Re Map-Requests in/out: 4/4 Encapsulated Map-Requests in/out: 0/4 RLOC-probe Map-Requests in/out: 0/0 SMR-based Map-Requests in/out: 0/0 Map-Requests expired on-queue/no-reply 0/0 Map-Resolver Map-Requests forwarded: 0 Map-Server Map-Requests forwarded: 4 Map-Reply records in/out: 0/0 Authoritative records in/out: 0/0 Non-authoritative records in/out: 0/0 Negative records in/out: 0/0 RLOC-probe records in/out: 0/0 Map-Server Proxy-Reply records out: 0 Map-Register records in/out: 14219/0 Map-Server AF disabled: 0 Authentication failures: 0 Map-Reply deferred/dropped: 0/0 MR negative Map-Reply deferred/dropped: 0/0 MR Map-Request fwd deferred/dropped: 0/0 MS Map-Request fwd deferred/dropped: 0/0 MS proxy Map-Reply deferred/dropped: 0/0 RTR Map-Register fwd deferred/dropped: 0/0 [snip]

1882 © 2016 Nicholas J. Russo

In many environments, the MR/MS are consolidated onto a single device. Although the MR and MS roles remain the same, the actual forwarding of the map-request from MR to MS occurs in software within the router. That is, there is no need for an ALT in this model. At present, the ALT feature only appears to support EIDs in IID 0, so for testing EID prefixes inside VRFs, we consolidate the MR/MS services on CSR3.When configuring EID prefixes not within the default IID, you must explicitly specify the IID when defining the EID prefix. All three sites have prefixes in both IID 101 and 102 as shown below. The “accept-more-specifics” option allows a more generic registration to occur where the MS does not need to enumerate individual prefixes that are contiguous. I also introduce IPv6, but LISP does an excellent job of providing feature and configuration parity between IPv4 and IPv6. The concepts are identical. ! CSR3 router lisp site CSR10 authentication-key CSR10 eid-prefix instance-id 101 10.1.10.0/23 accept-more-specifics eid-prefix instance-id 102 10.10.10.10/32 eid-prefix instance-id 102 2001:10:10:10::10/128 site CSR2 authentication-key CSR2 eid-prefix instance-id 101 10.2.2.2/32 eid-prefix instance-id 102 10.2.2.2/32 eid-prefix instance-id 102 2001:10:2:2::2/127 accept-more-specifics site CSR9 authentication-key CSR9 eid-prefix instance-id 101 eid-prefix instance-id 101 eid-prefix instance-id 102 eid-prefix instance-id 102 ipv4 ipv4 ipv6 ipv6

10.1.9.0/24 10.9.9.9/32 10.9.9.9/32 2001:10:9:9::9/128

map-server map-resolver map-server map-resolver

We also introduce CSR2 using the parallel virtualization model. The relevant configurations for that are below. The EID prefixes in VRF C1 and C2 are not included as those are simply endpoints. Each LISP process maps to an IID (VRF) because each LISP process can only specify a single “locator-table”. The configurations are largely redundant but this creates additional isolation between LISP virtual instances on a since xTR. ! CSR2 interface GigabitEthernet2.520 vrf forwarding C2 ip address 10.8.2.2 255.255.255.0

1883 © 2016 Nicholas J. Russo

ipv6 address 2001:10:8:2::2/64 interface GigabitEthernet2.528 vrf forwarding C1 ip address 10.2.8.2 255.255.255.0 ipv6 address 2001:10:2:8::2/64 router lisp 1 locator-table vrf C1 eid-table vrf C1 instance-id 101 database-mapping 10.2.2.2/32 IPv4-interface GigabitEthernet2.528 priority 10 weight 10 ipv4 itr map-resolver 10.3.6.3 ipv4 itr ipv4 etr map-server 10.3.6.3 key CSR2 ipv4 etr router lisp 2 locator-table vrf C2 eid-table vrf C2 instance-id 102 database-mapping 10.2.2.2/32 IPv4-interface GigabitEthernet2.520 priority 20 weight 10 database-mapping 10.2.2.2/32 IPv6-interface GigabitEthernet2.520 priority 15 weight 10 database-mapping 2001:10:2:2::2/128 IPv4-interface GigabitEthernet2.520 priority 10 weight 10 database-mapping 2001:10:2:2::2/128 IPv6-interface GigabitEthernet2.520 priority 15 weight 10 ipv4 itr map-resolver 10.3.6.3 ipv4 itr ipv4 etr map-server 10.3.6.3 key CSR2 ipv4 etr ipv6 itr map-resolver 10.3.6.3 ipv6 itr ipv6 etr map-server 10.3.6.3 key CSR2 ipv6 etr

For practice, we will trace the LISP control-plane process again, except this time we will send traffic between CSR10 and CSR2 within a customer's VRF EID space (VRF C2). This VRF uses IID 102. As a quick check, we can verify that the MS knows about the 10.10.10.10/32 and 10.2.2.2/32 EID prefixes from each site within IID 102. CSR3#show lisp site instance-id 102 LISP Site Registration Information * = Some locators are down or unreachable Site Name Last Up Who Last Register Registered CSR10 00:00:26 yes 10.7.10.10

Inst ID 102

EID Prefix 10.10.10.10/32

1884 © 2016 Nicholas J. Russo

CSR2

CSR9

00:00:07 00:00:17 never 00:00:43 00:00:51 00:00:50

yes yes no yes yes yes

10.7.10.10 10.8.2.2 -10.8.2.2 10.7.9.9 10.7.9.9

102 102 102 102 102 102

2001:10:10:10::10/128 10.2.2.2/32 2001:10:2:2::2/127 2001:10:2:2::2/128 10.9.9.9/32 2001:10:9:9::9/128

A quick debug on CSR3 shows CSR10 and CSR2 registering their IID 102 prefixes. Notice that each prefix has an IPv4 and IPv6 RLOC. ! CSR3 CSR3#debug lisp control-plane map-server-registration LISP: Processing received Map-Register(3) message on GigabitEthernet2.536 from 10.7.10.10:4342 to 10.3.6.3:4342 LISP: Processing Map-Register no proxy, map-notify, no merge, no security, no mobile-node, not to-RTR, no fast-map-register, no EID-notify, ID-included, 1 record, nonce 0x23C5A837-0x5B4889E2, key-id 1, auth-data-len 20, hashfunction sha1, xTR-ID 0x86671414-0xB88BC82F-0x428A5BA4-0xB567A9B7, site-ID unspecified LISP: Processing Map-Register mapping record for IID 102 10.10.10.10/32, ttl 1440, action none, authoritative, 2 locators 10.7.10.10 pri/wei=10/10 LpR 2001:10:7:10::10 pri/wei=15/10 LpR LISP-0: MS registration IID 102 prefix 10.10.10.10/32 10.7.10.10 site CSR10, Updating. LISP: Processing received Map-Register(3) message on GigabitEthernet2.536 from 10.8.2.2:4342 to 10.3.6.3:4342 LISP: Processing Map-Register no proxy, map-notify, no merge, no security, no mobile-node, not to-RTR, no fast-map-register, no EID-notify, ID-included, 1 record, nonce 0xA97E340B-0xA2AA248E, key-id 1, auth-data-len 20, hashfunction sha1, xTR-ID 0xB5353617-0x22428435-0xF93808BB-0x80454A5A, site-ID unspecified LISP: Processing Map-Register mapping record for IID 102 10.2.2.2/32, ttl 1440, action none, authoritative, 2 locators 10.8.2.2 pri/wei=10/10 LpR 2001:10:8:2::2 pri/wei=15/10 LpR LISP-0: MS registration IID 102 prefix 10.2.2.2/32 10.8.2.2 site CSR2, Updating.

For another test, we originate traffic on CSR2, which uses the parallel model for virtualization. This means there is a direct link to the SP inside this VRF, which implies there is also a default route if LISP is going to be use for lookups. We confirm this before continuing. CSR2#show ip route vrf C2 0.0.0.0 0.0.0.0 Routing Table: C2 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 1, metric 0, candidate default path

1885 © 2016 Nicholas J. Russo

Routing Descriptor Blocks: * 10.8.2.8 Route metric is 0, traffic share count is 1

When CSR2 sends traffic (sourced from within the proper VRF), it issues a map-request to the MR, which is CSR3 for IID 102. This is CSR2's way of asking for the valid RLOCs for EID prefix 10.10.10.10/32 within the context of IID 102. ! CSR2, time 17:15:25.650 LISP-2: IID 102 Request processing of remote EID prefix map requests to IPv4. LISP: Send map request type remote EID prefix LISP: Send map request for EID prefix IID 102 10.10.10.10/32 LISP-2: Remote EID IID 102 prefix 10.10.10.10/32, Send map request (1) (sources: , state: incomplete, rlocs: 0). LISP-2: EID-AF IPv4, Sending map-request from 10.10.10.10 to 10.10.10.10 for EID 10.10.10.10/32, ITR-RLOCs 2, nonce 0x9701B20F-0x8488F010 (encap src 10.8.2.2, dst 10.3.6.3).

When CSR3 receives the map-request (as the MR), it knows that it's also an MS so there is no need to consult the ALT for a BGP route. Instead, the local database on the MS is checked against the current EID-to-RLOC mappings. The check is successful, and the ETR who registered this mapping is CSR10. The MS forwards the map-request for CSR10. ! CSR3 LISP: Processing received Map-Request(1) message on GigabitEthernet2.536 from 10.10.10.10:4342 to 10.10.10.10:4342 LISP: Received map request for IID 102 10.10.10.10/32, source_eid IID 102 10.2.2.2, ITR-RLOCs: 10.8.2.2 2001:10:8:2::2, records 1, nonce 0x9701B20F0x8488F010 LISP-0: MS EID IID 102 prefix 10.10.10.10/32 site CSR10, Forwarding map request to ETR RLOC 10.7.10.10.

When CSR10 receives the map-request (as the ETR), it performs a sanity check to ensure the EID-toRLOC mappings are correct, then sends the map-reply directly to the ITR, which is CSR2. Interestingly, CSR10 uses the IPv6 RLOCs for this map-reply for the transport mechanism. ! CSR10 LISP: Processing received Map-Request(1) message on GigabitEthernet2.570 from 10.10.10.10:4342 to 10.10.10.10:4342 LISP: Received map request for IID 102 10.10.10.10/32, source_eid IID 102 10.2.2.2, ITR-RLOCs: 10.8.2.2 2001:10:8:2::2, records 1, nonce 0x9701B20F0x8488F010 LISP: Processing map request record for EID prefix IID 102 10.10.10.10/32 LISP-1: Sending map-reply from 2001:10:7:10::10 to 2001:10:8:2::2.

1886 © 2016 Nicholas J. Russo

When CSR10 receives the map-reply (as the ITR), it measures the time for the LISP signaling process to occur, which was 33 ms. CSR2 then installs this EID-to-RLOC mapping as valid and uses these mappings to augment the FIB. The debug didn’t show the RLOC mappings here; I cannot explain why. ! CSR2, time 17:15:25.683 LISP-2: Map Request IID 102 prefix 10.10.10.10/32 remote EID prefix[LL], Received reply with rtt 33ms. LISP: Processing mapping information for EID prefix IID 102 10.10.10.10/32 LISP-2: Remote EID IID 102 prefix 10.10.10.10/32, Change state to reused (sources: , state: incomplete, rlocs: 0). LISP-2: Remote EID IID 102 prefix 10.10.10.10/32, Starting idle timer (delay 00:02:30) (sources: , state: reused, rlocs: 0). LISP-2: AF IID 102 IPv4, Persistent db: ignore writing request, disabled. LISP-2: Remote EID IID 102 prefix 10.10.10.10/32, Change state to complete (sources: , state: reused, rlocs: 0).

Again, we use EPC to validate the traffic flow on CSR7, this time facing CSR10. Two packets are shown below. The first packet is UDP (0x11, protocol 17, green) and sends traffic from 10.8.2.2 to 10.7.10.10 (yellow) using port 4341 (0x10F5, cyan). This is the LISP encapsulation and uses IP addressing from the RLOC space. The inner packet is ICMP (0x01, protocol 1, green) from 10.2.2.2 to 10.10.10.10 (yellow), which is in EID space. The second packet is the reply; the same fields are highlighted in the same colors for the return traffic. CSR7#show monitor 0000: 000C2949 0010: 08004500 0020: 02020A07 0030: 00000000 0040: 9AD20A02 0050: 00000000 [snip] 0000: 0010: 0020: 0030: 0040: 0050: [snip]

000C2966 08004500 0A0A0A08 00000000 9AD20A0A 00000000

capture CAP buffer dump 7164000C 29664C2C 81000DF2 00880046 4000FC11 5E040A08 0A0A0D00 10F50074 00004800 66034500 006400AF 0000FF01 02020A0A 0A0A0800 50820027 00005710 D690ABCD ABCDABCD

..)Iqd..)fL,.... ..E....F@...^... ...........t..H. ....f.E..d...... ............P..' ......W.........

4C2C000C 008812A8 02020D00 66034500 0A0A0A02 00005710

..)fL,..)Iqd.... [email protected]... ...........t..H. ....f.E..d...... ............X..' ......W.........

29497164 4000FF11 10F50074 006400AF 02020000 D690ABCD

81000DF2 48A20A07 00004800 0000FF01 58820027 ABCDABCD

Earlier we saw that both of these prefixes had multiple RLOCs. Specifically, one was IPv4 and one was IPv6. The lower priorities are preferred; right now the IPv4 RLOCs have priority 10 while IPv6 RLOCs have 15. IPv6 RLOCs are defined identically to the IPv4 ones and are not detailed in the inline configuration snippets. After increasing the IPv4 RLOC priorities to 20 on both CSR2 and CSR10, we still have reachability. The EPC shows that the LISP encapsulation (RLOC space) is using IPv6 while the inner

1887 © 2016 Nicholas J. Russo

packet (EID space) still uses IPv4. This permits any combination of IPvX in IPvX tunneling. Before sending traffic, we can validate that the MS knows about these RLOC priority changes. CSR3#show lisp site 10.2.2.2/32 instance-id 102 | begin Locator Locator Local State Pri/Wgt Scope 10.8.2.2 yes up 20/10 IPv4 none 2001:10:8:2::2 yes up 15/10 IPv6 none CSR3#show lisp site 10.10.10.10/32 instance-id 102 Locator Local State Pri/Wgt 10.7.10.10 yes up 20/10 2001:10:7:10::10 yes up 15/10

| begin Locator Scope IPv4 none IPv6 none

First, we need to clear the map-cache entries on both CSR2 and CSR10 so they can re-signal the LISP path with the MR/MS. CSR2#clear ip lisp 2 map-cache * CSR10#clear ip lisp 1 map-cache *

The EPC is shown again below after sending traffic from CSR2 to CSR10. We can immediately see the packet is significantly larger, which hints towards the usage of IPv6. Two packets are shown below. The first packet has an outer IP header which wraps UDP (0x11, protocol 17, green) and sends traffic from 2001:10:8:2::2 to 2001:10:7:10::10 (yellow) using port 4341 (0x10F5, cyan). This is the LISP encapsulation and uses IP addressing from the RLOC space. The inner packet is ICMP (0x01, protocol 1, green) from 10.2.2.2 to 10.10.10.10 (yellow), which is in EID space. The second packet is the reply; the same fields are highlighted in the same colors for the return traffic. CSR7#show monitor 0000: 000C2949 0010: 86DD6000 0020: 00020000 0030: 00100000 0040: 00004800 0050: 0000FF01 0060: F15E002B [snip] 0000: 0010: 0020: 0030: 0040: 0050: 0060: [snip]

000C2966 86DD6000 00100000 00020000 00004800 0000FF01 F95E002B

capture CAP buffer dump 7164000C 29664C2C 81000DF2 00000074 11FC2001 00100008 00000000 00022001 00100007 00000000 00100D00 10F50074 00000000 66034500 006400C3 9ABE0A02 02020A0A 0A0A0800 00000000 00005722 359EABCD

..)Iqd..)fL,.... ..`....t.. ..... .......... ..... ...............t ..H.....f.E..d.. ................ .^.+......W"5...

4C2C000C 00000074 00000000 00000000 00000000 9ABE0A0A 00000000

..)fL,..)Iqd.... ..`....t.. ..... .......... ..... ...............t ..H.....f.E..d.. ................ .^.+......W"5...

29497164 11FE2001 00102001 00020D00 66034500 0A0A0A02 00005722

81000DF2 00100007 00100008 10F50074 006400C3 02020000 359EABCD

1888 © 2016 Nicholas J. Russo

We can verify the map-cache entries on both CSR2 and CSR10. We use the summary view on CSR2 and the detailed view on CSR10 for variety. Note that the xTR maintains all of the RLOCs assigned to an EID prefix, and locally selects the best one based on the advertised priorities. It is not prescribed by a remote LISP node; the remote node influences the decision via the priority settings. CSR2#show ip lisp 2 instance-id 102 map-cache LISP IPv4 Mapping Cache for LISP 2 EID-table vrf C2 (IID 102), 2 entries 0.0.0.0/0, uptime: 00:15:36, expires: never, via static send map-request Negative cache entry, action: send-map-request 10.10.10.10/32, uptime: 00:15:34, expires: 23:44:25, via map-reply, complete Locator Uptime State Pri/Wgt 10.7.10.10 00:15:34 up 20/10 2001:10:7:10::10 00:15:34 up 15/10 CSR10#show ip lisp 1 instance-id 102 map-cache 10.2.2.2/32 LISP IPv4 Mapping Cache for EID-table vrf C2 (IID 102), 2 entries 10.2.2.2/32, uptime: 00:16:15, expires: 23:43:45, via map-reply, complete Sources: map-reply State: complete, last modified: 00:16:15, map-source: 10.8.2.2 Idle, Packets out: 8 (~ 00:15:05 ago) Locator Uptime State Pri/Wgt 10.8.2.2 00:16:15 up 20/10 Last up-down state change: 00:16:15, state change count: 1 Last route reachability change: 00:16:15, state change count: 1 Last priority / weight change: never/never RLOC-probing loc-status algorithm: Last RLOC-probe sent: never 2001:10:8:2::2 00:16:15 up 15/10 Last up-down state change: 00:16:15, state change count: 1 Last route reachability change: 00:16:15, state change count: 1 Last priority / weight change: never/never RLOC-probing loc-status algorithm: Last RLOC-probe sent: never

Verifying the FIB, we see the specific EID prefix with "action encap" on both sides. Notice that the LISP next-hops are the IPv6 RLOCs now since it was the “better” RLOC per our manual priority modifications. Due to virtualization in play, the "LISP" interfaces have subinterfaces with a number matching the IID. CSR2#show ip cef vrf C2 10.10.10.10 detail 10.10.10.10/32, epoch 0, flags [subtree context, check lisp eligibility, default route] SC owned,sourced: LISP remote EID - locator status bits 0x00000003 LISP remote EID: 18 packets 1666 bytes fwd action encap LISP source path list nexthop 2001:10:7:10::10 LISP2.102 2 IPL sources [active source] Dependent covered prefix type inherit, cover 0.0.0.0/0 recursive via 0.0.0.0/0

1889 © 2016 Nicholas J. Russo

recursive via 10.8.2.8 attached to GigabitEthernet2.520 CSR10#show ip cef vrf C2 10.2.2.2 detail 10.2.2.2/32, epoch 0, flags [default route handler, subtree context, check lisp eligibility, default route] SC owned,sourced: LISP remote EID - locator status bits 0x00000003 LISP remote EID: 8 packets 800 bytes fwd action encap LISP source path list nexthop 2001:10:8:2::2 LISP1.102 2 IPL sources [unresolved, active source] Dependent covered prefix type inherit, cover 0.0.0.0/0 recursive via 0.0.0.0/0 no route

A more common usage of IPv6 with LISP would be to tunnel it inside IPv4 to connect two disjoint IPv6 "islands". We just demonstrated the opposite which, while possible, has much less utility. Assuming the SP did not provide VPNv6 transport, we can use LISP as an IPv6 transition/tunneling mechanism. We won't trace all the signaling debugs since the IPv4/IPv6 RLOCs are carried in the same LISP mapping messages. First, we check the MS (CSR3) to ensure it knows about CSR2 and CSR10 IID 102 IPv6 prefixes. These are 2001:10:2:2::2/128 and 2001:10:10:10::10/128, respectively. Notice that there is an entry for 2001:10:2:2::2/127 also (less specific); I configured the MS for the /127 but told it to accept longermatches, which is why the /128 was able to register successfully. The configuration snippets shown above demonstrated this option. CSR3#show lisp site instance-id 102 LISP Site Registration Information * = Some locators are down or unreachable Site Name Last Up Who Last Register Registered CSR10 00:00:02 yes 10.7.10.10 00:00:53 yes 10.7.10.10 CSR2 00:00:59 yes 10.8.2.2 never no -00:00:35 yes 10.8.2.2 CSR9 00:00:27 yes 10.7.9.9 00:00:40 yes 10.7.9.9

Inst ID 102 102 102 102 102 102 102

EID Prefix 10.10.10.10/32 2001:10:10:10::10/128 10.2.2.2/32 2001:10:2:2::2/127 2001:10:2:2::2/128 10.9.9.9/32 2001:10:9:9::9/128

The MS also knows the priorities of the different RLOCs. In this case, IPv4 RLOCs have a lower (better) priority, so we expect those to be used for traffic forwarding. This is the opposite test of what we just did; we transition from IPv4-in-IPv6 to IPv6-in-IPv4. We simply change the priority of the IPv4 RLOCs back to 10 to re-baseline the network. CSR3#show lisp site 2001:10:2:2::2/128 instance-id 102 | begin Locator Locator Local State Pri/Wgt Scope 10.8.2.2 yes up 10/10 IPv4 none

1890 © 2016 Nicholas J. Russo

2001:10:8:2::2

yes

up

15/10

IPv6 none

CSR3#show lisp site 2001:10:10:10::10/128 instance-id 102 | begin Locator Locator Local State Pri/Wgt Scope 10.7.10.10 yes up 10/10 IPv4 none 2001:10:7:10::10 yes up 15/10 IPv6 none

When we send traffic between the two sites, we can look for the IPv6-in-IPv4 encapsulation on CSR7 outbound towards CSR10. Two packets are shown below. The first packet has an outer UDP protocol type (0x11, protocol 17, green) and sends traffic from 10.8.2.2 to 10.7.10.10 (yellow) using port 4341 (0x10F5, cyan). This is the LISP encapsulation and uses IP addressing from the RLOC space. The inner packet is IPv6-ICMP (0x3A, protocol 58, green) from 2001:10:2:2::2 to 2001:10:10:10::10 (yellow), which is in EID space. The second packet is the reply; the same fields are highlighted in the same colors for the return traffic. CSR7#show monitor 0000: 000C2949 0010: 08004500 0020: 02020A07 0030: 00000000 0040: 00100002 0050: 00100010 0060: AF420348 [snip] 0000: 0010: 0020: 0030: 0040: 0050: 0060: [snip]

000C2966 08004500 0A0A0A08 00000000 00100010 00100002 AE420348

capture CAP buffer dump 7164000C 29664C2C 81000DF2 0088005E 4000FC11 5DEC0A08 0A0A0454 10F50074 00004800 66036000 0000003C 3A402001 00020000 00000000 00022001 00100000 00000000 00108000 00000001 02030405 06070809

..)Iqd..)fL,.... ..E....^@...]... .......T...t..H. ....f.`.... 10.11.13.13 GRE 81000DC0 ..)j....)\...... 928B0A08 ..E..|...../.... 0000003C ..........`....< 00000000 :@..............

GRE has many interesting features beyond basic tunneling. First, we will examine the GRE tunnel key. This is a 32-bit unsigned integer that differentiates tunnels with the same source and destination, and is commonly used in networks with many overlays and a single underlay. It is also a form of weak security used to prevent against misconfigurations or from accepting stray packets. To make it easy to see this in 1909 © 2016 Nicholas J. Russo

the packet capture, we will use the decimal key 3203386110 (0xBEEFCAFE) on the tunnel between CSR2 and XRv13. ! CSR2 interface Tunnel213 tunnel key 3203386110 ! XRv13 interface tunnel-ip213 tunnel key 3203386110

If the tunnel keys do not match, "debug tunnel keepalive" will reveal the fault. To test it, I temporarily change one of the keys to 11111 (not shown) to ensure the key is being checked. ! CSR2 Tunnel: Drop, Failed to find tunnel for GRE/IPv4 packet 10.11.13.13->10.8.2.2 (tbl=4,"FVRF" len=104, ttl=252)

After capturing on CSR8, we can see the packet is 4 bytes larger (122 to 126) and contains our new 4byte key. The original 4 byte header is shown in yellow with the new GRE tunnel key in green. The beginning of the header changed also; the first hex digit is 0010 in binary, and the 1 represents the flag indicating that the tunnel key is present. CSR8#show monitor capture CAP buffer detailed ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------0 126 0.000000 10.8.2.2 -> 10.11.13.13 GRE 0000: 000C296A FFD5000C 295CE1E9 81000DC0 ..)j....)\...... 0010: 080045C0 006C060C 0000FF2F 91750A08 ..E..l...../.u.. 0020: 02020A0B 0D0D2000 0800BEEF CAFE45C0 ...... .......E. 0030: 0050AB62 00000159 162A0A02 0D02E000 .P.b...Y.*......

Next, we enable the GRE checksum feature. This is a 16-bit field, where the low-order 16-bits are always set to 0. This used for validating the packet for corruption. This is not supported on XR, but can be enabled unidirectionally. ! CSR2 interface Tunnel213 tunnel checksum

Notice that packet has grown by another 4 bytes to 130. The checksum is 0xCE10 and the next 16 bits are set to 0 (cyan). The GRE key follows that. Notice the very first hex digit of the GRE header is 0xA, which is 1010. We conclude that the leftmost bit represents the checksum option. Note: The rightmost

1910 © 2016 Nicholas J. Russo

bit represents the sequence numbers field. GRE sequence numbers, the third GRE option, do not appear to be supported on XE or XR, so they are not tested. CSR8#show monitor capture CAP buffer detailed 1 130 23.000000 10.8.2.2 -> 10.11.13.13 GRE 0000: 000C296A FFD5000C 295CE1E9 81000DC0 ..)j....)\...... 0010: 080045C0 0070069F 0000FF2F 90DE0A08 ..E..p...../.... 0020: 02020A0B 0D0DA000 0800CE10 0000BEEF ................ 0030: CAFE45C0 0050ACE7 00000159 14A50A02 ..E..P.....Y....

Although not a header extension, GRE can also issue regular keepalives. This is supported on P2P tunnels only, but it’s essentially an empty GRE packet. We enable this on the CSR2-XRv13 tunnel using the basic "keepalive" command. ! CSR2 interface Tunnel213 keepalive 3 9 ! XRv13 interface tunnel-ip213 keepalive 3 9

Keepalives can be enabled unidirectionally since it works like BFD echo-mode. Inside the GRE packet from CSR2 to XRv13, there is another GRE packet which has the source/destination headers reversed. This allows XRv13 to respond normally. Notice the GRE header at the very end of the packet. The protocol number is 0x0000 and the reserved field that was previously 0x0000 is now 0x0001. I assume this low-level detail signals a keepalive to the GRE process, although technically it is part of the sequence number field. The debug below also shows the keepalive being sent and received, but notice the direction is represented by the inner packet (from the destination to the source). I used the same colors as above to show the GRE basic header (yellow), checksum (cyan), and key (green). I use grey to show the inner GRE packet with reversed IP addresses. CSR8#show monitor 0 0000: 000C296A 0010: 080045C0 0020: 02020A0B 0030: CAFE45C0 0040: 0D0D0A08 0050: CAFE

capture CAP buffer dump FFD5000C 00400707 0D0DA000 0020000F 0202A000

295CE1E9 0000FF2F 0800CE10 0000FF2F 0000D60F

81000DC0 90A60A08 0000BEEF 97BE0A0B 0001BEEF

..)j....)\...... ..E..@...../.... ................ ..E.. ...../.... ................

Tunnel213: sending keepalive, 10.11.13.13->10.8.2.2 (len=32 ttl=255), counter=1 Tunnel213: keepalive received, 10.11.13.13->10.8.2.2 (len=32 ttl=251), resetting counterund all

1911 © 2016 Nicholas J. Russo

When we enable it on XRv13, there does not appear to be a debug that shows the keepalives. However, outbound EPC on CSR8 reveals those packets. Notice that XRv13 is not using checksums, which explains the different in length (74 versus 82, account for two checksums). CSR8#show monitor 5 0000: 000C295C 0010: 080045C0 0020: 0D0D0A08 0030: 001C0000 0040: 0D0D2000

capture CAP buffer dump E1E9000C 00380000 02022000 0000FF2F 0000BEEF

296AFFD5 4000FD2F 0800BEEF 97D10A08 CAFE

81000DC0 59B50A0B CAFE45C0 02020A0B

..)\....)j...... ..E..8..@../Y... ...... .......E. ......./........ .. .......

You can see a quick summary of the GRE special features using the command below. CSR2#show interfaces tunnel213 | section GRE|Keepalive Description: P2P GRE CSR2 TO XRV13 Keepalive set (3 sec), retries 9 Tunnel protocol/transport GRE/IP Key 0xBEEFCAFE, sequencing disabled Checksumming of packets enabled RP/0/0/CPU0:XRv13#show interfaces tunnel-ip213 | include is enabled Keepalive is enabled, interval 3 seconds, maximum retry 9 Key is enabled, key value 3203386110

The tunnel TOS and TTL parameters are straightforward and self-explanatory. These set the TOS and TTL of the outer IP header only. We demonstrate this on CSR9 sending traffic to CSR4 and capturing inbound on CSR7. CSR9 will set the outer IP header TOS to DSCP CS3 rather than reflect the tunneled value. The tunnels default TTL is 255 but we can shorten the “length” of the tunnel by selecting a smaller value. ! CSR9 interface Tunnel49 tunnel tos 96 tunnel ttl 8

The OSPF hellos have DSCP CS6 set, but the tunnel overrides it to DSCP CS3. The tunnel TTL has been reduced to 8 for additional security. Comparing it to the packets going to XRv3 (a different and unmodified tunnel), the differences become clear. The TTLs are highlighted in yellow and the TOS values in green. Failing to specify a tunnel TOS means that the value is copied from the inner IP header; the packets going to XRv3 have DSCP CS6 which was copied from the OSPFv2 hellos inside. Failing to specify a TTL means that a value of 255 is always used. CSR7#show monitor capture CAP buffer detail 0 122 0.000000 10.7.9.9 -> 10.11.13.13 GRE 0000: 000C2966 4C2C000C 29E04F84 81000DFB ..)fL,..).O.....

1912 © 2016 Nicholas J. Russo

0010: 0020: 0030:

080045C0 0068087B 0000FF2F 88040A07 09090A0B 0D0D0000 080045C0 0050A615 00000159 1B690A09 0D09E000 00050201

1 122 4.775021 10.7.9.9 0000: 000C2966 4C2C000C 29E04F84 0010: 08004560 0068087C 0000082F 0020: 09090A04 0C040000 080045C0 0030: 00000159 1F6D0A04 0909E000

..E..h.{.../.... ..........E..P.. ...Y.i..........

-> 10.4.12.4 GRE 81000DFB ..)fL,..).O..... 80740A07 ..E`.h.|.../.t.. 0050A616 ..........E..P.. 00050201 ...Y.m..........

The “flow egress-records” command appears to do nothing. Whether it is enabled or not, egress Netflow records are always captured. FNF is enabled on CSR10 to show this. ! CSR10 flow monitor FLOW_MON_IPV4 record netflow-original interface Tunnel410 ip flow monitor FLOW_MON_IPV4 output tunnel flow egress-records CSR10#show interfaces tunnel410 | include flow Egress flow records enabled CSR10#show flow monitor FLOW_MON_IPV4 cache [snip] IPV4 SOURCE ADDRESS: IPV4 DESTINATION ADDRESS: TRNS SOURCE PORT: TRNS DESTINATION PORT: INTERFACE INPUT: FLOW SAMPLER ID: IP TOS: IP PROTOCOL: [snip]

10.1.10.1 4.4.4.4 0 2048 Gi2.510 0 0x00 1

When the command is removed and the cache is cleared, we see the same effect. I also tried putting it inbound only on the LAN to see if I would get duplicate records for outbound packets, but I did not. We will stop investigating the feature as it appears to have limited utility. CSR10#show interfaces tunnel410 | include flow [no output] CSR10#show flow monitor FLOW_MON_IPV4 cache [snip] IPV4 SOURCE ADDRESS:

10.1.10.1

1913 © 2016 Nicholas J. Russo

IPV4 DESTINATION ADDRESS: TRNS SOURCE PORT: TRNS DESTINATION PORT: INTERFACE INPUT: FLOW SAMPLER ID: IP TOS: IP PROTOCOL: [snip]

4.4.4.4 0 2048 Gi2.510 0 0x00 1

The tunnel bandwidth feature is another command on a GRE tunnel of limited value. I believe it is informational only and is meant to be used for Layer1 transports that don't have the same bandwidth in both directions, like satellite links using Rate-Based Satellite Control Protocol (RBSCP) tunnels. I assume this information is communicated to upper-layer protocols as it is not a policer for the tunnel. The default is 8000 kbps and this value also doesn't seem to affect IGP cost calculation. This feature is probably not relevant for GRE tunnels at all, but we will investigate it briefly. CSR9#show interfaces tunnel913 | include bandwidth Tunnel transmit bandwidth 8000 (kbps) Tunnel receive bandwidth 8000 (kbps) CSR9#show ip ospf interface brief | include Tu913 Tu913 1 0 10.9.13.9/24

1000

P2P

1/1

We modify the tunnel on CSR9 going to XRv3 to be TX 4 Mbps and RX 1 Mbps. There is no change in the IGP metric. ! CSR9 interface Tunnel913 tunnel bandwidth transmit 4096 tunnel bandwidth receive 1024 CSR9#show interfaces tunnel913 | include bandwidth Tunnel transmit bandwidth 4096 (kbps) Tunnel receive bandwidth 1024 (kbps) CSR9#show ip ospf interface brief | include Tu913 Tu913 1 0 10.9.13.9/24

1000

P2P

1/1

Only the traditional "bandwidth" command has an effect on the IGP metric. There is also a "bandwidth receive" command. Having 4 different bandwidth commands on a single interface can be very confusing. I've set the bandwidth to 16 Mbps and the RX BW to 2048, making all 4 values are different. ! CSR9 interface Tunnel913 bandwidth 16384 bandwidth receive 2048

1914 © 2016 Nicholas J. Russo

The 16 Mbps value is used for the OSPF cost calculation now. The "tunnel bandwidth receive" is deprecated according to Cisco documentation as this was once used for RBSCP. RBSCP is beyond the scope of this document, but only the “transmit” command is used now. CSR9#show interfaces tunnel 913 | include bandwidth|BW MTU 9976 bytes, BW 16384 Kbit/sec, RxBW 2048 Kbit/sec, DLY 50000 usec, Tunnel transmit bandwidth 4096 (kbps) Tunnel receive bandwidth 1024 (kbps) CSR9#show ip ospf interface brief | include Tu913 Tu913 1 0 10.9.13.9/24

6

P2P

1/1

Tunnel entropy uses a hash value for the lower-order 8 bits of the tunnel key to introduce more randomness into the packet header. It is a 6-tuple combining src/dst IP, src/dst port, VRF, and IP protocol. The goal of this is better load sharing assuming routers in the core do their ECMP-hash-based forwarding using the tunnel key. This also means the GRE key must be 24 bits or less, so using 0xBEEFCAFE again won't work. We will enable this on the CSR2-CSR4 tunnel, and we must enable entropy before adding the key. CSR2(config-if)#tunnel key ? key

When entropy is enabled, the key can only be 24 bits. Using a small key 0xBEEF (48879) will work. We configure this on both ends of the tunnel then verify it was successful. Both ends must have a matching key with the entropy option enabled. To enable entropy, you must configure entropy first and then the tunnel key; order of operations is significant. ! CSR2 and CSR4 interface Tunnel24 tunnel entropy tunnel key 48879 CSR2(config-if)#tunnel key ? key CSR2#show interfaces tunnel24 | include Key Key 0xBEEF, sequencing disabled Tunnel Entropy Calculation Enabled (24-bit Key) CSR4#show interfaces tunnel24 | include Key Key 0xBEEF, sequencing disabled Tunnel Entropy Calculation Enabled (24-bit Key)

1915 © 2016 Nicholas J. Russo

Looking into the packet by capturing outbound on CSR8, we can see the lower-order 8 bits of the key are no longer part of the "key" used for security purposes between the endpoints. The top flow is a telnet session and has entry entropy value 0xBE, and the bottom flow is an ICMP flow with entropy value 0x7C. The protocol numbers 6 and 1 for TCP and ICMP, respectively, are highlighted in pink. Because every flow is likely to have a different GRE key (256 total combinations), if the underlying architecture can do hashing based on the GRE key rather than the src/dst IP information, better load sharing can be achieved. I cannot find an authoritative source that tells me whether XE/XR routers can do load sharing based on the GRE tunnel key, nor can I find a mechanism to configure it. The concept is very clever and could be very useful for load sharing in an IP-based core network. CSR8#show monitor capture CAP buffer detailed 3 88 0.558021 10.4.12.4 -> 10.8.2.2 GRE 0000: 000C295C E1E9000C 296AFFD5 81000DC0 ..)\....)j...... 0010: 080045C0 0046196C 0000FD2F 814B0A04 ..E..F.l.../.K.. 0020: 0C040A08 02022000 080000BE EFBE45C0 ...... .......E. 0030: 002A5977 0000FF06 458D0A02 04040A02 .*Yw....E....... 4 146 3.018004 10.4.12.4 0000: 000C295C E1E9000C 296AFFD5 0010: 08004500 008018FC 0000FD2F 0020: 0C040A08 02022000 080000BE 0030: 00640017 0000FF01 9F780A02

-> 10.8.2.2 GRE 81000DC0 ..)\....)j...... 82410A04 ..E......../.A.. EF7C4500 ...... ......|E. 04040A02 .d.......x......

Additional Reading – Reference configurations "gre-p2p" 35.2 Dynamic Multipoint VPN (DMVPN) basics DMVPN isn't really an SP technology but is very commonly used in enterprise networks to connect branch sites to corporate HQ sites in hub-spoke fashion using multipoint GRE (mGRE) tunnel interfaces. Since mGRE is within the blueprint, we examine DMVPN also; understanding what enterprise customers are doing over the SP network is valuable for customer interactions and joint troubleshooting. Encryption via IPSec can optionally be added, but this is beyond the scope of the SP blueprint. Spokes register to the hubs using Next Hop Resolution Protocol (NHRP) so that hubs can dynamically discover hubs, which simplifies provisioning for new spokes. The registration message contains a mapping between the VPN addresses (the typically-private tunnel IP) and the NBMA addresses (the typicallypublic address, AKA the tunnel endpoints). The process is somewhat similar to LISP where ETRs register EID-to-RLOC mappings to the MS. In most deployments the hubs dynamically create downstream multicast mappings based on spoke registrations. DMVPN comes in three variations: 1. Phase 1: Still takes advantage of mGRE tunnels, but does not provide any direct spoke-to-spoke communication. All spoke-to-spoke traffic is hair-pinned through the hub. In single-hub deployments, spokes can use a P2P GRE tunnel construct with some additional configuration for simplicity. In dual-hub deployments, spokes can use the mGRE tunnel construct similar to the hubs. Phase 1 is really only recommended for small, single-hub deployments with little to no spoke-to-spoke traffic. 1916 © 2016 Nicholas J. Russo

2. Phase 2: Uses mGRE tunnel interfaces everywhere as it provides dynamic spoke-to-spoke tunnel creation. When a spoke needs to reach a remote spoke, it queries the hub requesting the NBMA (underlay) address for a given VPN (overlay) address. The hub resolves the remote spoke's NBMA address (processed detailed later) and the local spoke can directly encapsulate traffic with a GRE destination of the remote spoke's NBMA address. So long as the underlay is intelligently designed to provide spoke-to-spoke access (such as the Internet or SP WAN links), this technology works well. 3. Phase 3: Similar to phase 2 but it enhances scalability. Using new NHRP redirect messages from the hub and CEF shortcuts on the spokes, a hierarchical hub architecture can be designed to further regionalize the network. All of the routers still share a tunnel mesh (and thus an IP subnet), but the IGP neighbor discovery process can be bounded by the hub hierarchy. It also allows the hub routers to advertise very coarse aggregate routes, like defaults, and not provide more specific matches. This is not possible in Phase 2 since the spokes still need to know the actual VPN next-hops for every destination, even other spokes. The diagram below shows a network logically similar to the P2P GRE architecture. The main difference from the P2P GRE setup is that XRv3 is no longer a "hub" since it does not support DMVPN at all. It will be the "Internet" router and CSR5 will be a hub, alongside CSR4. CSR10, CSR9, and CSR2 remain spokes. This lab does not detail advanced DMVPN designs with respect to IGP, multicast, etc. The focus is on basic DMVPN constructs, detailed debugging, and control-plane verification.

1917 © 2016 Nicholas J. Russo

General restrictions: 1. GRE keepalives only work on tunnels that have explicit destinations configured. Even on mGRE spokes with NHRP static mappings, GRE keepalives are not sent. Recall that keepalives can be configured unidirectionally, so a Phase 1 spoke in a single-hub deployment, for example, can use GRE keepalives to probe the hub for reachability. 2. Multicast NBMA designs apply. For Phase 1, both PIM (*,G) and (S,G) joins can be enabled since the next-hop for all routes is the hub. For Phase 2, only PIM (*,G) joins are valid since the RPF neighbor will appear spoke-to-spoke (directly connected), which is not supported for LL multicast. There are no PIM neighbors laterally across any flavor of DMVPN. For Phase 3, (S,G) joins can work, but only if the NHO process is in effect. NHRP (H) routes are used for RPF and break multicast when they exist. These H routes are common when performing summarization on spokes. 35.2.1 Phase 1 In this setup, CSR4/CSR5 are DMVPN phase 1 hubs. CSR9 and CSR10 use mGRE interfaces as spokes to connect to both hubs. They use static mappings towards each hub for unicast and multicast. CSR2 demonstrates using a P2P tunnel to CSR4, which would be common in a single-hub deployment. To 1918 © 2016 Nicholas J. Russo

connect to CSR5, both CSR2 and CSR5 must provision a new, separate P2P GRE tunnel since CSR2 cannot use the same subnet on multiple interfaces in the same VRF. For phase 1, this doesn't affect forwarding, but complicates the design. The OSPF network type should be P2MP on the hubs and either P2MP or P2P on the spokes depending on the design. ! Generic hub (CSR4) interface Tunnel100 description DMVPN PHASE 1 HUB ip address 10.0.100.4 255.255.255.0 no ip redirects ip nhrp authentication NHRPAUTH ip nhrp map multicast dynamic ip nhrp network-id 100 ip ospf network point-to-multipoint ip ospf hello-interval 10 ip ospf 1 area 0 tunnel source 10.4.12.4 tunnel mode gre multipoint ! Generic mGRE spoke (CSR9) interface Tunnel100 description DMVPN PHASE 1 - SPOKE ip address 10.0.100.9 255.255.255.0 no ip redirects ip nhrp authentication NHRPAUTH ip nhrp map 10.0.100.4 10.4.12.4 ip nhrp map multicast 10.4.12.4 ip nhrp map 10.0.100.5 10.5.6.5 ip nhrp map multicast 10.5.6.5 ip nhrp network-id 100 ip nhrp nhs 10.0.100.4 ip nhrp nhs 10.0.100.5 ip ospf network point-to-multipoint ip ospf hello-interval 10 ip ospf 1 area 0 tunnel source 10.7.9.9 tunnel mode gre multipoint ! P2P GRE spoke (CSR2) interface Tunnel100 description DMVPN PHASE 1 - HUB CSR4 ip address 10.0.100.2 255.255.255.0 ip nhrp authentication NHRPAUTH ip nhrp map 10.0.100.4 10.4.12.4 ip nhrp map multicast 10.4.12.4 ip nhrp network-id 100 ip nhrp nhs 10.0.100.4 ip ospf network point-to-point

1919 © 2016 Nicholas J. Russo

ip ospf 1 area 0 keepalive 3 3 tunnel source 10.8.2.2 tunnel destination 10.4.12.4 tunnel vrf FVRF

We will watch the NHRP registration process in detail from the perspective of a spoke (CSR9) and a hub (CSR5). The debugs are verbose so I have stripped out the unnecessary parts. First, CSR9's tunnel comes up and adds the static mappings to the NHRP cache. Technically, those static mappings count as NHRP entries also. The NHRP process iterates over the configured hub mappings in series. I also show the NHRP cache at present to show the mappings were added. The command used is "debug nhrp". ! CSR9 NHRP: if_up: Tunnel100 proto 'NHRP_IPv4' NHRP: Adding all static maps to cache NHRP: Adding multicast map entry to multicast list10.5.6.5 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.5, NBMA: 10.5.6.5) NHRP: Successfully attached NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.5, NBMA: 10.5.6.5) NHRP: No peer data updated in NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.5, NBMA: 10.5.6.5) NHRP: Adding multicast map entry to multicast list10.4.12.4 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.4, NBMA: 10.4.12.4) NHRP: Successfully attached NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.4, NBMA: 10.4.12.4) NHRP: No peer data updated in NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.4, NBMA: 10.4.12.4) CSR9#show ip nhrp brief Target Via 10.0.100.4/32 10.0.100.4 10.0.100.5/32 10.0.100.5

NBMA 10.4.12.4 10.5.6.5

CSR9#show ip nhrp multicast I/F NBMA address Tunnel100 10.4.12.4 Flags: static Tunnel100 10.5.6.5 Flags: static

Mode Intfc Claimed static Tu100 < > static Tu100 < >

(Enabled) (Enabled)

Next, CSR9 will attempt to register its VPN/NBMA mappings to each NHS in series. Priority values can determine whether NHS' are more preferred than others (0 is the highest priority) according to Cisco documentation, but this is used in specific HA architectures. NHRP “clusters” can be used for failover scenarios; these features are not examined in great detail here. Below, the debugs from CSR9 show the registration packets being sent, and the replies being received from each NHS. The transition from 'E' to 'RE' means that the NHS entry went from "expecting replies" to "expecting replies and receiving them". This means registration was successful and each NHS is marked as "up". 1920 © 2016 Nicholas J. Russo

! CSR9 NHRP: Multicast enabled for dst 10.4.12.4 NHRP: NHS 10.0.100.4 Tunnel100 vrf 0 Cluster 0 Priority 0 Transitioned to 'E' from ' ' NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.4 NHRP: Send Registration Request via Tunnel100 vrf 0, packet size: 108 src: 10.0.100.9, dst: 10.0.100.4 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.4.12.4 NHRP: 132 bytes out Tunnel100 NHRP: Receive Registration Reply via Tunnel100 vrf 0, packet size: 128 NHRP: NHS 10.0.100.4 Tunnel100 vrf 0 Cluster 0 Priority 0 Transitioned to 'RE' from 'E' NHRP: NHS-UP: 10.0.100.4 NHRP: Multicast enabled for dst 10.5.6.5 NHRP: NHS 10.0.100.5 Tunnel100 vrf 0 Cluster 0 Priority 0 Transitioned to 'E' from ' ' NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.5 NHRP: Send Registration Request via Tunnel100 vrf 0, packet size: 108 src: 10.0.100.9, dst: 10.0.100.5 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.5.6.5 NHRP: 132 bytes out Tunnel100 NHRP: Receive Registration Reply via Tunnel100 vrf 0, packet size: 128 NHRP: NHS 10.0.100.5 Tunnel100 vrf 0 Cluster 0 Priority 0 Transitioned to 'RE' from 'E' NHRP: NHS-UP: 10.0.100.5 CSR9#show ip nhrp nhs detail Legend: E=Expecting replies, R=Responding, W=Waiting Tunnel100: 10.0.100.4 RE priority = 0 cluster = 0 req-sent 4 req-failed 0 4 (00:16:55 ago) 10.0.100.5 RE priority = 0 cluster = 0 req-sent 4 req-failed 0 4 (00:16:55 ago)

repl-recv repl-recv

Viewing the process from the hub's perspective, we see the registration requests come in from the spoke NBMA addresses. Inside that message are the VPN/NBMA mappings, and this allows the hub to create a dynamic NHRP entry for both multicast and unicast traffic. The hub sends the registration reply so the spoke knows its request was processed. ! CSR5 NHRP: Receive Registration Request via Tunnel100 vrf 0, packet size: 108 NHRP: Tunnels gave us pak src: 10.7.9.9

1921 © 2016 Nicholas J. Russo

NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: NHRP subblock already exists for Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: Cache already has a subblock node attached for Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.9 NHRP: Send Registration Reply via Tunnel100 vrf 0, packet size: 128 src: 10.0.100.5, dst: 10.0.100.9 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.7.9.9 NHRP: 152 bytes out Tunnel100 CSR5#show ip nhrp brief Target Via 10.0.100.9/32 10.0.100.9 10.0.100.10/32 10.0.100.10

NBMA 10.7.9.9 10.7.10.10

CSR5#show ip nhrp multicast I/F NBMA address Tunnel100 10.7.10.10 Flags: dynamic Tunnel100 10.7.9.9 Flags: dynamic

Mode Intfc Claimed dynamic Tu100 < > dynamic Tu100 < >

(Enabled) (Enabled)

Because NHRP is a component of DMVPN, we can also use the DMVPN-derived debugging. These provide significant additional detail over the NHRP debugs. The same process is shown again, and I've focused on the actual NHRP registration packet. The details of the registration packet are displayed using this debug, so checking for authentication mismatches or NAT-traversal issues (with IPSec) is simplified. The command used is "debug dmvpn all nhrp". The new F, M, and C fields represent several new concepts. “F” is the fixed field, specifying address-family, version, and type. “M” is the mandatory information field, primarily used by the hub to construct dynamic mappings to spokes that register. “C” is the client information element (CIE) field that can be added as NHS speakers need to communicate information back and forth. These three fields are shown in green just for identification purposes. CSR9 sends a registration request to CSR5, and CSR5 responds with a registration reply. ! CSR9 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.5 NHRP: Send Registration Request via Tunnel100 vrf 0, packet size: 108 src: 10.0.100.9, dst: 10.0.100.5 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "unique nat ", reqid: 70 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.5 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200

1922 © 2016 Nicholas J. Russo

addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 NHRP: Setting 'used' flag on cache entry with nhop: 10.0.100.5 NHRP: NHRP successfully mapped '10.0.100.5' to NBMA 10.5.6.5 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.5.6.5 NHRP: 132 bytes out Tunnel100 ! CSR5 NHRP: Receive Registration Request via Tunnel100 vrf 0, packet size: 108 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "unique nat ", reqid: 72 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.5 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 NHRP: netid_in = 100, to_us = 1 NHRP: Tunnels gave us pak src: 10.7.9.9 NHRP: Tunnel100: Cache update for target 10.0.100.9/32 next-hop 10.0.100.9 10.7.9.9 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: NHRP subblock already exists for Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: Cache already has a subblock node attached for Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9)

1923 © 2016 Nicholas J. Russo

NHRP: Updating our cache with NBMA: 10.5.6.5, NBMA_ALT: 10.5.6.5 NHRP: New mandatory length: 32 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.9 NHRP: Send Registration Reply via Tunnel100 vrf 0, packet size: 128 src: 10.0.100.5, dst: 10.0.100.9 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 128 extoff: 52 (M) flags: "unique nat ", reqid: 72 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.5 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 NHRP: Setting 'used' flag on cache entry with nhop: 10.0.100.9 NHRP: NHRP successfully mapped '10.0.100.9' to NBMA 10.7.9.9 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.7.9.9 NHRP: 152 bytes out Tunnel100

Below is an example of mismatched NHRP authentication. The NHS doesn't actually send a message back to say the authentication failed, but it fails to send a registration response. CSR9 continues to retransmit the message, which could hint at an authentication issue. Before continuing, we correct the authentication issue. This test was simply to show that the debug can help debug those issues. ! CSR9 NHRP: Send Registration Request via Tunnel100 vrf 0, packet size: 107 src: 10.0.100.9, dst: 10.0.100.4 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 107 extoff: 52

1924 © 2016 Nicholas J. Russo

(M) flags: "unique nat ", reqid: 77 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.4 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:BADAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 NHRP: NHRP successfully mapped '10.0.100.4' to NBMA 10.4.12.4 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.4.12.4 NHRP: 131 bytes out Tunnel100 NHRP-RATE: Retransmitting Registration Request for 10.0.100.4, reqid 77, (retrans ivl 8 sec) ! CSR5 NHRP: Receive Registration Request via Tunnel100 vrf 0, packet size: 107 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 107 extoff: 52 (M) flags: "unique nat ", reqid: 78 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.5 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:BADAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5

We have also configured IPv6 in the overlay also. DMVPN can be an IPv6 transition strategy to connect separated IPv6 islands. The NHRP syntax is nearly identical for IPv6 as it is for IPv6 and is shown below; 1925 © 2016 Nicholas J. Russo

only the additional components for IPv6 are shown. Some commands require netmask specification, unlike IPv4. EIGRP is used for IPv6 unicast routing. There are two significant configurations for EIGRP. Disabling split-horizon on the hubs, which are NBMA interfaces, is important to distribute updates to spokes. Making the spokes EIGRP stubs is a good design practice which prevents them from being transit routers and ensures they can never receive EIGRP queries for active prefixes. ! Generic hub (CSR5) interface Tunnel100 ipv6 address FE80::5 link-local ipv6 address 2001:10:0:100::5/64 ipv6 nhrp authentication NHRPAUTH ipv6 nhrp map multicast dynamic ipv6 nhrp network-id 100 router eigrp VPN address-family ipv6 unicast autonomous-system 65000 af-interface Tunnel100 no split-horizon ! Generic spoke (CSR9) interface Tunnel100 ipv6 address FE80::9 link-local ipv6 address 2001:10:0:100::9/64 ipv6 nhrp authentication NHRPAUTH ipv6 nhrp map 2001:10:0:100::4/128 10.4.12.4 ipv6 nhrp map 2001:10:0:100::5/128 10.5.6.5 ipv6 nhrp map multicast 10.4.12.4 ipv6 nhrp map multicast 10.5.6.5 ipv6 nhrp network-id 100 ipv6 nhrp nhs 2001:10:0:100::4 priority 5 ipv6 nhrp nhs 2001:10:0:100::5 priority 2

In cases where routers are behind the "stub", like CSR1 and XRv4, you can either leak routes through or summarize on the stub router. CSR10 uses the leak method while CSR9 uses the summary method. CSR2 only advertises connected routes as it has nothing to summarize nor any downstream EIGRP neighbors. This is a quick side-note and is not directly related to DMVPN, but general EIGRP design over NBMA architectures. These same concepts apply to IPv4 as well. ! CSR10 ipv6 prefix-list PL_IPV6_ALL seq 5 permit ::/0 le 128 route-map RM_EIGRP_LEAK_IPV6 permit 10 match ipv6 address prefix-list PL_IPV6_ALL router eigrp VPN address-family ipv6 unicast autonomous-system 65000 eigrp stub connected leak-map RM_EIGRP_LEAK_IPV6

1926 © 2016 Nicholas J. Russo

! CSR9 router eigrp VPN address-family ipv6 unicast autonomous-system 65000 eigrp stub connected summary af-interface Tunnel100 summary-address ::14:14:14:0/112 af-interface GigabitEthernet2.594 summary-address ::/0 ! CSR2 router eigrp VPN address-family ipv6 unicast autonomous-system 65000 eigrp stub connected

The debugs for NHRP are identical between IPv4 and IPv6. CSR9 includes its public IPv6 address in the "M" stanza and its LL address in the first "C" stanza. ! CSR9 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 2001:10:0:100::5 NHRP: Send Registration Request via Tunnel100 vrf 0, packet size: 164 src: 2001:10:0:100::9, dst: 2001:10:0:100::5 (F) afn: AF_IP(1), type: IPv6(86DD), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 164 extoff: 96 (M) flags: "unique nat ", reqid: 94 src NBMA: 10.7.9.9 src protocol: 2001:10:0:100::9, dst protocol: 2001:10:0:100::5 (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.7.9.9 client protocol: FE80::9 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.5.6.5 client protocol: 2001:10:0:100::5

The hub receives this request and programs it in its NHRP cache.

1927 © 2016 Nicholas J. Russo

! CSR5 NHRP: Receive Registration Request via Tunnel100 vrf 0, packet size: 164 (F) afn: AF_IP(1), type: IPv6(86DD), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 164 extoff: 96 (M) flags: "unique nat ", reqid: 94 src NBMA: 10.7.9.9 src protocol: 2001:10:0:100::9, dst protocol: 2001:10:0:100::5 (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.7.9.9 client protocol: FE80::9 [snip, same as above] CSR5#show ipv6 nhrp 2001:10:0:100::9/128 2001:10:0:100::9/128 via 2001:10:0:100::9 Tunnel100 created 00:10:21, expire 01:53:32 Type: dynamic, Flags: unique registered used nhop NBMA address: 10.7.9.9

The hub sends a reply message back towards the spoke. The "M" stanza remains unchanged, but there are some additional mappings inserted. The spoke NBMA is mapped to its LL IPv6 address, as is the hubs. We did not see this in IPv4 as LL addressing isn’t used the same way. ! CSR5 NHRP: Send Registration Reply via Tunnel100 vrf 0, packet size: 228 src: 2001:10:0:100::5, dst: 2001:10:0:100::9 (F) afn: AF_IP(1), type: IPv6(86DD), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 228 extoff: 128 (M) flags: "unique nat ", reqid: 94 src NBMA: 10.7.9.9 src protocol: 2001:10:0:100::9, dst protocol: 2001:10:0:100::5 (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.7.9.9 client protocol: FE80::9 (C-2) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.5.6.5 client protocol: FE80::5 Responder Address Extension(3): (C) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0

1928 © 2016 Nicholas J. Russo

client NBMA: 10.5.6.5 client protocol: 2001:10:0:100::5 Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 0 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.5.6.5 client protocol: 2001:10:0:100::5

CSR9 receives the reply, accepts it, and programs it to its NHRP cache. It creates two entries, one for CSR5's real VPN address, and one for its LL address, as both are mapped to the same IPv4 NBMA address. The LL address has a special NHRP flag to identify it as such. It is interesting to note that the "type" of the LL mapping is static, although it was learned. Since it is bound to a hub that was a static mapping, the two are considered equivalent enough to share these attributes. ! CSR9 NHRP: Receive Registration Reply via Tunnel100 vrf 0, packet size: 228 (F) afn: AF_IP(1), type: IPv6(86DD), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 228 extoff: 128 (M) flags: "unique nat ", reqid: 90 src NBMA: 10.7.9.9 src protocol: 2001:10:0:100::9, dst protocol: 2001:10:0:100::5 [snip, same as above] CSR9#show ipv6 nhrp 2001:10:0:100::5/128 2001:10:0:100::5/128 via 2001:10:0:100::5 Tunnel100 created 00:10:22, never expire Type: static, Flags: used NBMA address: 10.5.6.5 CSR9#show ipv6 nhrp FE80::5/128 FE80::5/128 via FE80::5 Tunnel100 created 00:10:28, never expire Type: static, Flags: nhs-ll NBMA address: 10.5.6.5

At this point, the IPv6 capability in the second hub at CSR4 has not been configured yet. Below is an example of output that indicates one fully functional hub, and one dysfunctional one. The NHS at CSR4, from CSR9's perspective, is expecting replies but not receiving them; 6 requests were sent but 0 received. CSR9#show ipv6 nhrp nhs detail

1929 © 2016 Nicholas J. Russo

Legend: E=Expecting replies, R=Responding, W=Waiting Tunnel100: 2001:10:0:100::4 E priority = 5 cluster = 0 req-sent 6 repl-recv 0 2001:10:0:100::5 RE priority = 2 cluster = 0 req-sent 1 repl-recv 1 (00:00:45 ago) Pending Registration Requests: Registration Request: Reqid 85, Ret 32

req-failed 0 req-failed 0

NHS 2001:10:0:100::4 expired (Tu100)

Quickly adding the IPv6 NHRP hub configuration to CSR4 (similar to CSR5 and not shown), we can see the registration is successful on CSR9. Also notice that CSR5 has a better (lower) priority than CSR4, which should come into play with phases 2 and 3. CSR9#show ipv6 nhrp nhs detail Legend: E=Expecting replies, R=Responding, W=Waiting Tunnel100: 2001:10:0:100::4 RE priority = 5 cluster = 0 req-sent 42 req-failed 0 repl-recv 1 (00:00:13 ago) 2001:10:0:100::5 RE priority = 2 cluster = 0 req-sent 1 req-failed 0 repl-recv 1 (00:34:37 ago)

Another way to quickly verify that all spokes have registered is by checking the IPv6 NHRP caches on the hub in brief form. CSR4 has 3 spokes while CSR5 only has two; recall that CSR2 is only in the DMVPN connected to CSR4, using a P2P tunnel for connectivity to CSR5. The hubs show the spoke LL addresses as dynamic mappings as they were contained inside the NHRP registration requests initially. CSR5#show ipv6 nhrp brief Target 2001:10:0:100::9/128 NBMA: 10.7.9.9 2001:10:0:100::10/128 NBMA: 10.7.10.10 FE80::9/128 NBMA: 10.7.9.9 FE80::10/128 NBMA: 10.7.10.10 CSR4#show ipv6 nhrp brief Target 2001:10:0:100::2/128 NBMA: 10.8.2.2 2001:10:0:100::9/128 NBMA: 10.7.9.9 2001:10:0:100::10/128 NBMA: 10.7.10.10 FE80::2/128 NBMA: 10.8.2.2

Via 2001:10:0:100::9 Mode: dynamic Intfc: Tu100 2001:10:0:100::10 Mode: dynamic Intfc: Tu100 2001:10:0:100::9 Mode: dynamic Intfc: Tu100 2001:10:0:100::10 Mode: dynamic Intfc: Tu100

Claimed:


Claimed:


Claimed:


Claimed:


Claimed:


Claimed:


Claimed:


Claimed:


Via 2001:10:0:100::2 Mode: dynamic Intfc: Tu100 2001:10:0:100::9 Mode: dynamic Intfc: Tu100 2001:10:0:100::10 Mode: dynamic Intfc: Tu100 2001:10:0:100::2 Mode: dynamic Intfc: Tu100

1930 © 2016 Nicholas J. Russo

FE80::9/128 NBMA: 10.7.9.9 FE80::10/128 NBMA: 10.7.10.10

2001:10:0:100::9 Mode: dynamic Intfc: Tu100 2001:10:0:100::10 Mode: dynamic Intfc: Tu100

Claimed:


Claimed:


The only time phase 1 NHRP resolution requests are issued by spokes is when they are trying to directly reach one another's VPN addresses when the route recursion points to a connected route. This also only occurs when there are multiple hubs and the spokes use mGRE interfaces (not P2P interfaces). For example, CSR9 tries to ping CSR10's VPN address along with its loopback. In IPv4, this VPN address is an OSPF /32 route given the P2MP network type used. The loopback is a normal OSPF /32 route. Both of these follow the regular routing rules through the hub, and there are no reachability issues. This is a very rare case, but the process is near-identical for other DMVPN phases, so we analyze it here. CSR9#show ip cef 10.0.100.10 10.0.100.10/32 nexthop 10.0.100.4 Tunnel100 nexthop 10.0.100.5 Tunnel100 CSR9#show ip cef 10.10.10.10 10.10.10.10/32 nexthop 10.0.100.4 Tunnel100 nexthop 10.0.100.5 Tunnel100 CSR9#ping 10.0.100.10 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.100.10, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 29/34/42 ms CSR9#ping 10.10.10.10 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.10.10.10, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 16/33/46 ms

Doing the same test to CSR2's VPN address and loopback is different. The path to the loopback is a normal OSPF /32 route, but the path to the VPN endpoint is the connected /24 route. Because of the NBMA network, CSR9 does not have direct connectivity to this. It needs to query the hub for resolution. CSR9#show ip cef 10.0.100.2 10.0.100.0/24 attached to Tunnel100 CSR9#show ip cef 2.2.2.2 2.2.2.2/32 nexthop 10.0.100.4 Tunnel100 nexthop 10.0.100.5 Tunnel100

1931 © 2016 Nicholas J. Russo

When it queries the hub, only CSR4 can respond since CSR2 is not running DMVPN with CSR5 at all (uses P2P tunnel). The resolution request is shown below. This is CSR9 asking CSR4 "What is the NBMA address for VPN address 10.0.100.2"? I usually look at the “M” field first since it contains 3 of the 4 key pieces of information, and the resolution request seeks to find the fourth. ! CSR9 NHRP: Send Resolution Request via Tunnel100 vrf 0, packet size: 88 src: 10.0.100.9, dst: 10.0.100.2 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 88 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0

The hub receives the resolution request, performs a route lookup, and then forwards the request towards CSR2. This is similar to what a LISP MR/MS does when receiving map-requests from ITRs. The NHS inserts a "transit NHS record extension" into the packet which includes its local VPN/NBMA mapping. This simply means that an NHS was in the resolution transit path from source to destination (the path of the resolution request). Although technically CSR4 knows the NBMA mapping for 10.0.100.2, it forwards the request to CSR2 for authoritative resolution. ! CSR4 NHRP: Receive Resolution Request via Tunnel100 vrf 0, packet size: 88 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 88 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 NHRP: Route lookup for destination 10.0.100.2 in (0x0) yielded interface Tunnel100, prefixlen 24 NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 10.0.100.2 NHRP: Attempting to forward to destination: 10.0.100.2 NHRP: Forwarding: NHRP SAS picked source: 10.0.100.4 for destination: 10.0.100.2

1932 © 2016 Nicholas J. Russo

NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.2 NHRP: Forwarding Resolution Request via Tunnel100 vrf 0, packet size: 108 src: 10.0.100.4, dst: 10.0.100.2 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4

CSR2 receives that modified resolution request from CSR9 (via CSR4 in the middle), performs a route lookup. CSR2 determines it's a local route and generates a resolution reply for CSR9, using the NHS as transit. ! CSR2 NHRP: Receive Resolution Request via Tunnel100 vrf 0, packet size: 108 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 9 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 NHRP: Route lookup for destination 10.0.100.2 in (0x0) yielded interface Tunnel100, prefixlen NHRP: Request was to us. Process the NHRP Resolution Request.

1933 © 2016 Nicholas J. Russo

At this point, CSR2 has enough information to where it can add CSR9's NBMA address to its local cache. Unlike LISP, the entire resolution process does not have to happen in both directions. CSR2 marks this entry as "implicit" because an explicit resolution was not triggered; it was able to glean the information from a resolution request to which it was responding. ! CSR2 NHRP: Tunnel100: Cache update for target 10.0.100.9/32 next-hop 10.0.100.9 10.7.9.9 CSR2#show ip nhrp 10.0.100.9 10.0.100.9/32 via 10.0.100.9 Tunnel100 created 00:18:03, expire 01:58:31 Type: dynamic, Flags: router implicit nhop NBMA address: 10.7.9.9

The first line of the debug below is interesting and details why Tunnel25 was not used for NHRP signaling; it is not part of the DMVPN network and should not be used for NHRP-resolved routes. CSR2 adds a new extension for "responder address", which makes sense since CSR2 originated the resolution reply. This is where the NBMA mapping is encoded and ultimately how CSR9's initial question is answered. Notice this information is in the first CIE field also. ! CSR2 NHRP: Pak out Tunnel25 would leave logical NBMA network NHRP: Attempting to send packet via NHS 10.0.100.4 NHRP: Send Resolution Reply via Tunnel100 vrf 0, packet size: 136 src: 10.0.100.2, dst: 10.0.100.4 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 136 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref:

9

0

0

0

1934 © 2016 Nicholas J. Russo

client NBMA: 10.4.12.4 client protocol: 10.0.100.4

CSR4 receives the resolution reply from CSR2 and forwards it back to CSR9. It also adds a "reverse transit NHS record" to show that the NHS was also in the transit path going in the reverse (reply) direction also. ! CSR4 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 136 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 136 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 8 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 NHRP: Forwarding Resolution Reply via Tunnel100 vrf 0, packet size: src: 10.0.100.4, dst: 10.0.100.9 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 156 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref:

156

8

0

0

1935 © 2016 Nicholas J. Russo

client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 Reverse Transit NHS Record Extension(5): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4

CSR9 receives this packet which is full of information. The original 'M' stanza contains three out of the four key pieces of information: source VPN, source NBMA, and destination VPN. This was originated by CSR9 in the first place. The entire purpose of this entire process was to resolve the destination NBMA to reach CSR2. The following stanza (CIE) contains the "answer" to that query, which is 10.8.2.2 mapped to VPN address 10.0.100.2. The remaining fields are used internally by NHRP for tracking from which NHS this resolution was proxied, among other things. ! CSR9 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 156 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 156 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 8 src NBMA: 10.7.9.9 src protocol: 10.0.100.9, dst protocol: 10.0.100.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 Reverse Transit NHS Record Extension(5):

1936 © 2016 Nicholas J. Russo

(C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4

CSR9 processes the received resolution reply and adds this information to its NHRP cache. Notice the word "implicit" is absent since CSR9 send an explicit resolution request to the NHS, CSR4. ! CSR9 NHRP: Tunnel100: Cache update for target 10.0.100.2/32 next-hop 10.0.100.2 10.8.2.2 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.2, NBMA: 10.8.2.2) NHRP: Dequeue request of type Resolution Request for 10.0.100.2, reqid 8, netid 100 CSR9#show ip nhrp 10.0.100.2 10.0.100.2/32 via 10.0.100.2 Tunnel100 created 00:00:33, expire 01:59:26 Type: dynamic, Flags: router nhop NBMA address: 10.8.2.2

This is a very bad design because CSR2 is only a P2P tunnel, it can never install dynamic NHRP mappings on that interface as the destination is fixed. CSR9 can send traffic to 10.0.100.2, but the reply traffic will never work. In this way, phase 1 can be tricky to troubleshoot and understand. Even though the NHRP mappings appear correct, there is no connectivity between VPN addresses. Normally, you would have a single hub with all sites using P2P tunnels (with prefix-suppression of sorts, somewhat like CSR2), or multiple hubs plus all sites using mGRE (CSR9 and CSR10). Next, we can begin data-plane testing; most reachability is achievable both within the VPN and to the Internet site, with the exception of some VPN addresses due to an intentionally poor design described above. Only connected networks (transit links) have Internet access, since NAT is not configured as it is beyond the scope of this test. Of greater interest is determining the routing path; a traceroute on CSR9 to CSR10 clearly shows the hub(s) in the transit path. The CEF table confirms this both for CSR10's loopback and the OSPF summary behind it. This is the main disadvantage of DMVPN phase 1 and the reason why it's rarely used. CSR9#traceroute 10.10.10.10 Type escape sequence to abort. Tracing the route to 10.10.10.10 VRF info: (vrf in name/id, vrf out name/id) 1 10.0.100.4 17 msec 10.0.100.5 25 msec 10.0.100.4 8 msec 2 10.0.100.10 33 msec * 64 msec

1937 © 2016 Nicholas J. Russo

CSR9#show ip cef 10.10.10.10 10.10.10.10/32 nexthop 10.0.100.4 Tunnel100 nexthop 10.0.100.5 Tunnel100 CSR9#show ip cef 1.1.1.1 1.1.1.0/24 nexthop 10.0.100.4 Tunnel100 nexthop 10.0.100.5 Tunnel100

The same is true for IPv6 routing. Because of EIGRP stub routing, The hubs cannot reach one another without a P2P tunnel or static mappings on their mGRE interfaces. I add a P2P tunnel quickly to achieve reachability (not shown). CSR9's paths to other remote sites are shown also, and as expected, route through the hubs directly. CSR5#show ipv6 cef ::4:4:4:4 ::4:4:4:4/128 nexthop FE80::4 Tunnel45 CSR9#show ipv6 cef ::1:1:1:1 ::1:1:1:1/128 nexthop FE80::4 Tunnel100 nexthop FE80::5 Tunnel100 CSR9#show ipv6 cef ::2:2:2:2 ::2:2:2:2/128 nexthop FE80::4 Tunnel100 nexthop FE80::5 Tunnel100

Additional Reading – Reference configurations "dmvpn1" 35.2.2 Phase 2 DMVPN phase 2 solves some of the problems associated with phase 1 by allowing spokes to send traffic directly to one another. This implies that all routers must have the same mGRE interface (no P2P links with static tunnel destinations) so that they can install NHRP mappings to remote NBMA addresses inside the same DMVPN. GRE keepalives are not supported at any node, hub or spoke, in this design. From a routing perspective, EIGRP must be configured with split-horizon disabled (like phase 1) and with next-hop-self disabled (specific to phase 2). This allows spoke VPN addresses for re-advertised routes to be maintained, which allows traffic to traverse the dynamic spoke-to-spoke tunnels. For OSPF, a multiaccess network type should be used (broadcast or nonbroadcast). Typically broadcast is used as it simplifies the configuration, but the spokes must have their DR priorities set to 0 as only hub nodes have connectivity to all endpoints to distribute OSPF updates. We will use OSPFv2 for IPv4 and EIGRP for IPv6 as we did in phase 1. I have removed the P2P tunnel between CSR4/CSR5 and used a static mapping on CSR5 to build the linkage. CSR5 is a backup hub that references CSR4 as an NHS also, so within a local DMVPN phase 2 design, you can achieve a little bit of hierarchy this way. For brevity, only the relevant 1938 © 2016 Nicholas J. Russo

configuration components are shown, since the basic configuration (addresses, tunnel source, etc) has not changed. ! CSR4, NHS only interface Tunnel100 description DMVPN PHASE 2 - HUB ip nhrp authentication NHRPAUTH ip nhrp map multicast dynamic ip nhrp network-id 100 ip nhrp server-only ip ospf network broadcast ip ospf priority 5 ip ospf 1 area 0 ipv6 nhrp authentication NHRPAUTH ipv6 nhrp map multicast dynamic ipv6 nhrp network-id 100 ! CSR5, NHS and client interface Tunnel100 description DMVPN PHASE 2 HUB/CLIENT ip nhrp authentication NHRPAUTH ip nhrp map multicast dynamic ip nhrp map multicast 10.4.12.4 ip nhrp map 10.0.100.4 10.4.12.4 ip nhrp network-id 100 ip nhrp nhs 10.0.100.4 ip ospf network broadcast ip ospf 1 area 0 ipv6 nhrp authentication NHRPAUTH ipv6 nhrp map multicast dynamic ipv6 nhrp map multicast 10.4.12.4 ipv6 nhrp map 2001:10:0:100::4/128 10.4.12.4 ipv6 nhrp network-id 100 ipv6 nhrp nhs 2001:10:0:100::4 ! General NHS client (spoke) interface Tunnel100 description DMVPN PHASE 2 - SPOKE ip nhrp authentication NHRPAUTH ip nhrp map 10.0.100.4 10.4.12.4 ip nhrp map multicast 10.4.12.4 ip nhrp map 10.0.100.5 10.5.6.5 ip nhrp map multicast 10.5.6.5 ip nhrp network-id 100 ip nhrp nhs 10.0.100.4 ip nhrp nhs 10.0.100.5 ip ospf network broadcast ip ospf priority 0 ip ospf 1 area 0

1939 © 2016 Nicholas J. Russo

ipv6 ipv6 ipv6 ipv6 ipv6 ipv6 ipv6 ipv6

nhrp nhrp nhrp nhrp nhrp nhrp nhrp nhrp

authentication NHRPAUTH map 2001:10:0:100::4/128 10.4.12.4 map 2001:10:0:100::5/128 10.5.6.5 map multicast 10.4.12.4 map multicast 10.5.6.5 network-id 100 nhs 2001:10:0:100::4 priority 5 nhs 2001:10:0:100::5 priority 2

Before starting, a quick check of NHRP registrations on the hubs is valuable to ensure the network has converged. Notice that CSR4 has all dynamic entries, where CSR5 has a static entry for CSR4. Then, checking OSPF neighbors confirms that LL multicast is functioning (assuming "broadcast" network type) and IGP can converge. The hubs should be candidate DRs while the spokes are forced to never be DRs (DROthers). CSR4#show ip nhrp brief Target Via 10.0.100.2/32 10.0.100.2 10.0.100.5/32 10.0.100.5 10.0.100.9/32 10.0.100.9 10.0.100.10/32 10.0.100.10

NBMA 10.8.2.2 10.5.6.5 10.7.9.9 10.7.10.10

Mode Intfc Claimed dynamic Tu100 < > dynamic Tu100 < > dynamic Tu100 < > dynamic Tu100 < >

CSR5#show ip nhrp brief Target Via 10.0.100.2/32 10.0.100.2 10.0.100.4/32 10.0.100.4 10.0.100.9/32 10.0.100.9 10.0.100.10/32 10.0.100.10

NBMA 10.8.2.2 10.4.12.4 10.7.9.9 10.7.10.10

Mode Intfc Claimed dynamic Tu100 < > static Tu100 < > dynamic Tu100 < > dynamic Tu100 < >

CSR4#show ip ospf neighbor Neighbor ID Pri State 2.2.2.2 0 FULL/DROTHER 5.5.5.5 1 FULL/BDR 10.9.9.9 0 FULL/DROTHER 10.10.10.10 0 FULL/DROTHER

Dead Time 00:00:35 00:00:32 00:00:36 00:00:37

Address 10.0.100.2 10.0.100.5 10.0.100.9 10.0.100.10

Interface Tunnel100 Tunnel100 Tunnel100 Tunnel100

CSR5#show ip ospf neighbor Neighbor ID Pri State 2.2.2.2 0 FULL/DROTHER 4.4.4.4 5 FULL/DR 10.9.9.9 0 FULL/DROTHER 10.10.10.10 0 FULL/DROTHER

Dead Time 00:00:32 00:00:34 00:00:33 00:00:35

Address 10.0.100.2 10.0.100.4 10.0.100.9 10.0.100.10

Interface Tunnel100 Tunnel100 Tunnel100 Tunnel100

When CSR1 sends traffic to XRv14, CSR10 is the next router in the path. However, CSR10 does not know the NBMA address for 10.0.100.9, which is the next-hop of the route for 14.14.14.0/24. CSR10 also does not report having an NHRP mapping for this VPN address (IGP next-hop), which is loosely analogous to not having an ARP entry (MAC address) for a next-hop IP address. 1940 © 2016 Nicholas J. Russo

CSR1#show ip cef 14.14.14.14 0.0.0.0/0 nexthop 10.1.10.10 GigabitEthernet2.510 CSR10#show ip cef 14.14.14.14 14.14.14.0/24 nexthop 10.0.100.9 Tunnel100 CSR10#show ip nhrp 10.0.100.9 [no output]

The lack of a valid NHRP mapping for 10.0.100.9 triggers an NHRP resolution request on CSR10. CSR10 sends this request to CSR4 only, presumably before it comes first in the configuration as the priorities are the same. The debug below is very chatty (I've reduced many parts), but it effectively says "I don't know the NBMA for 10.0.100.9, so I will ask CSR4. I have the NBMA/VPN mappings for the NHS, so I will send a resolution request for 10.0.100.9 to the NHS NBMA address". ! CSR10 NHRP: NHRP could not map 10.0.100.9 to NBMA, cache entry not found NHRP: MACADDR: if_in GigabitEthernet2.510 netid-in 0 if_out Tunnel100 netidout 100 NHRP: Sending packet to NHS 10.0.100.4 on Tunnel100 NHRP: NHRP successfully mapped '10.0.100.4' to NBMA 10.4.12.4 NHRP: Tunnel100: Cache add for target 10.0.100.9/32 next-hop 10.0.100.9 10.4.12.4 NHRP: Posted msg for temp cache installation - Tunnel100: dst 10.0.100.9, nhs 10.0.100.4(nbma 10.4.12.4),tableid 0, holdtime 185, afn 1 NHRP: Enqueued NHRP Resolution Request for destination: 10.0.100.9 NHRP: Tunnel100: Cache update for target 10.0.100.9/32 next-hop 10.0.100.9 10.4.12.4 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.4.12.4) NHRP: Successfully attached NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.4.12.4) NHRP: Send Resolution Request via Tunnel100 vrf 0, packet size: 88 src: 10.0.100.10, dst: 10.0.100.9 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 88 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 10.0.100.9 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5):

1941 © 2016 Nicholas J. Russo

Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: NHRP successfully mapped '10.0.100.9' to NBMA 10.4.12.4 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.4.12.4 NHRP: 112 bytes out Tunnel100

CSR4 receives the resolution request from CSR10 attempting to resolve the NBMA address for CSR9. CSR4 does a route lookup for 10.0.100.9 and sees that it points out of Tunnel100, which means it's within the same DMVPN. Like a standard NHRP resolution, the NHS adds itself as the "forward transit NHS" inside the request before sending the traffic towards the destination. ! CSR4 NHRP: Receive Resolution Request via Tunnel100 vrf 0, packet size: 88 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 88 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 10.0.100.9 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Route lookup for destination 10.0.100.9 in (0x0) yielded interface Tunnel100, prefixlen 24 NHRP: Forwarding request due to authoritative request. NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 10.0.100.9 NHRP: Attempting to forward to destination: 10.0.100.9 NHRP: Forwarding: NHRP SAS picked source: 10.0.100.4 for destination: 10.0.100.9 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.9 NHRP: Forwarding Resolution Request via Tunnel100 vrf 0, packet size: 108 src: 10.0.100.4, dst: 10.0.100.9 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 10.0.100.9

1942 © 2016 Nicholas J. Russo

(C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: NHRP successfully mapped '10.0.100.9' to NBMA 10.7.9.9 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.7.9.9 NHRP: 132 bytes out Tunnel100

CSR9 receives the resolution request originated by CSR10 and forwarded by CSR4. CSR9 performs a route lookup for the destination and sees that it is a local interface, so the packet is processed locally. CSR9 also knows the NBMA/VPN mappings for the source, CSR10, as this information is contained in the original request. This is considered an "implicit" mapping and is valid for entry into CSR10's NHRP cache. ! CSR9 NHRP: Receive Resolution Request via Tunnel100 vrf 0, packet size: 108 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 8 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 10.0.100.9 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Route lookup for destination 10.0.100.9 in (0x0) yielded interface Tunnel100, prefixlen 24

1943 © 2016 Nicholas J. Russo

NHRP-ATTR: smart spoke feature and attributes are not configured, NHRP: Request was to us. Process the NHRP Resolution Request. NHRP: Request was to us, responding with ouraddress NHRP: Checking for delayed event 10.0.100.10/10.0.100.9 on list (Tunnel100). NHRP: No delayed event node found. NHRP: No need to delay processing of resolution event nbma src:10.7.9.9 nbma dst:10.7.10.10 NHRP: Tunnel100: Cache add for target 10.0.100.10/32 next-hop 10.0.100.10 10.7.10.10 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.10, NBMA: 10.7.10.10) NHRP: Successfully attached NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.10, NBMA: 10.7.10.10) NHRP: Inserted subblock node for cache: Target 10.0.100.10/32 nhop 10.0.100.10 NHRP: Converted internal dynamic cache entry for 10.0.100.10/32 interface Tunnel100 to external NHRP: Tunnel100: Cache update for target 10.0.100.10/32 next-hop 10.0.100.10 10.7.10.10

Unlike the process we examined for phase 1, the resolution reply is sent directly back to CSR10, which means the NHS won't see it again. I presume this is because the IGP routing directs the traffic directly back to the spoke; with Phase 1, the hub was always in the transit path, even with the NHRP resolutions that occurred between VPN addresses. As expected, CSR9 adds the "responder address extension" field to the NHRP message so that the originator knows who responded. ! CSR9 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.10 NHRP: Send Resolution Reply via Tunnel100 vrf 0, packet size: 136 src: 10.0.100.9, dst: 10.0.100.10 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 136 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 8 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 10.0.100.9 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.7.9.9 client protocol: 10.0.100.9 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.7.9.9 client protocol: 10.0.100.9 Forward Transit NHS Record Extension(4):

1944 © 2016 Nicholas J. Russo

(C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Setting 'used' flag on cache entry with nhop: 10.0.100.10 NHRP: NHRP successfully mapped '10.0.100.10' to NBMA 10.7.10.10 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.7.10.10 NHRP: 160 bytes out Tunnel100

The output below shows the entry that CSR9 adds upon receiving the resolution request from CSR10. Since the source NBMA was contained in the message, an "implicit" entry can be added since its fair to assume that the return traffic should be sent towards the source NBMA address. After all, CSR9 just used this mapping to send a resolution reply directly back to CSR9, avoiding the transit NHS on the return path. CSR9#show ip nhrp 10.0.100.10 10.0.100.10/32 via 10.0.100.10 Tunnel100 created 00:36:11, expire 01:23:48 Type: dynamic, Flags: router implicit used nhop NBMA address: 10.7.10.10

CSR10 receives the resolution reply from CSR9 which contains its NBMA address 10.7.9.9. CSR10 adds this to its NHRP cache and can directly encapsulate unicast traffic to this spoke inside mGRE. The debug reveals this new NBMA information in several different ways, which is why I have trimmed it down. ! CSR10 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 136 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 136 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 8 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 10.0.100.9 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.7.9.9 client protocol: 10.0.100.9 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200

1945 © 2016 Nicholas J. Russo

addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.7.9.9 client protocol: 10.0.100.9 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.4.12.4 client protocol: 10.0.100.4 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: nhrp_pak_reply_from_target: answer_nbma: 10.7.9.9, responder_nbma: 10.7.9.9 NHRP: Tunnels gave us pak src: 10.7.9.9 NHRP: No need to delay processing of resolution event nbma src:10.7.10.10 nbma dst:10.7.9.9 NHRP: Tunnel100: Cache update for target 10.0.100.9/32 next-hop 10.0.100.9 10.7.9.9 NHRP: Deleted subblock node associated with cache: Target 10.0.100.9/32 nhop 10.0.100.9 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: Successfully attached NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: Tunnel100: Cache update for target 10.0.100.9/32 next-hop 10.0.100.9 10.7.9.9 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.9, NBMA: 10.7.9.9) NHRP: Dequeue request of type Resolution Request for 10.0.100.9, reqid 8, netid 100.

Below is the final result of the NHRP process. CSR10 adds this NBMA entry as a dynamic entry into its NHRP cache. Nothing changes in the RIB/FIB since the next-hop for the routes behind a spoke is the spoke's VPN address. As long as NHRP can provide the dynamic "tunnel destination", we can use traceroute to prove traffic routes directly between nodes; notice CSR4/CSR5 are not in the transit path. CSR10#show ip nhrp 10.0.100.9 10.0.100.9/32 via 10.0.100.9 Tunnel100 created 00:35:53, expire 01:24:06 Type: dynamic, Flags: router nhop NBMA address: 10.7.9.9 CSR1#traceroute 14.14.14.14 source 1.1.1.1 Type escape sequence to abort. Tracing the route to 14.14.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 10.1.10.10 0 msec 0 msec 1 msec 2 10.0.100.9 1 msec 1 msec 5 msec

1946 © 2016 Nicholas J. Russo

3 10.9.14.14 8 msec *

2 msec

We will quickly test IPv6 also without much debugging. The nature of EIGRP provides two identical routes to CSR10 for ::14:14:14:0/112 since the hubs are leaving the LL next-hop the same and simply readvertising it (like a route-reflector). CSR10 does not have an NHRP mapping for this LL next-hop, which is essentially a VPN address for CSR9. CSR1#show ipv6 cef ::14:14:14:14 ::14:14:14:0/112 nexthop FE80::10 GigabitEthernet2.510 CSR10#show ipv6 cef ::14:14:14:12 ::14:14:14:0/112 nexthop FE80::9 Tunnel100 nexthop FE80::9 Tunnel100 CSR10#show ipv6 nhrp FE80::9/128 [no output]

When CSR1 sends traffic, the NHRP process takes place. CSR9 sends its request to CSR4 (small debug shown), which forwards it to CSR10. CSR10 then responds directly. ! CSR9 NHRP: NHRP successfully mapped '2001:10:0:100::4' to NBMA 10.4.12.4 NHRP: IPv6-resolution: Sending packet to NHS 2001:10:0:100::4/10.4.12.4 on Tunnel100 - status: 1

When CSR9 receives the resolution reply, it installs the LL mapping into the NHRP cache. Even though it learns the publicly-routable VPN address of CSR9 via the "responder address extension", this is not relevant for routing in this case. ! CSR9 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 196 (F) afn: AF_IP(1), type: IPv6(86DD), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 196 extoff: 96 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 12 src NBMA: 10.7.10.10 src protocol: FE80::10, dst protocol: FE80::9 (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.7.9.9 client protocol: FE80::9 Responder Address Extension(3): (C) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200

1947 © 2016 Nicholas J. Russo

addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.7.9.9 client protocol: 2001:10:0:100::9 CSR10#show ipv6 nhrp FE80::9/128 FE80::9/128 via FE80::9 Tunnel100 created 00:00:54, expire 01:59:05 Type: dynamic, Flags: router nhop NBMA address: 10.7.9.9

Traceroute confirms direct spoke-to-spoke connectivity. CSR1#traceroute ipv6 Target IPv6 address: ::14:14:14:14 Source address: ::1:1:1:1 Type escape sequence to abort. Tracing the route to ::14:14:14:14 1 2001:10:1:10::10 1 msec 0 msec 1 msec 2 2001:10:0:100::9 2 msec 1 msec 4 msec 3 ::14:14:14:14 20 msec 13 msec 17 msec

Additional Reading – Reference configurations "dmvpn2" 35.2.3 Phase 3 DMVPN phase 3 achieves greater scalability than both phases 1 and 2, as mentioned briefly earlier. In phase 2, spokes still need the routes from every other spoke since the VPN next-hops must be preserved. Even if each node can summarize its back-end networks into a single summary, this could still mean thousands of routes in a large enterprise deployment with thousands of spokes. With phase 3, the hubs can originate default routes, which directs all traffic to them initially. Using new NHRP redirect messages (like ICMP redirects), the hubs can redirect follow-on packets to spokes directly. Spokes can install CEF shortcuts for one another's NBMA addresses to provide spoke-to-spoke connectivity. Hubs can also be stacked hierarchically to regionalize the DMVPN architecture, which is not examined here. The configuration requires one new command on hubs and one new command on spokes. Also, EIGRP still needs split-horizon disabled but the next-hops should be changed to be the hub (next-hop-self). To demonstrate the scalability improvements, as well as new NHRP route types, we will originate coarse aggregate routes from the hubs (2001::/16). The reason for using these is so that the enterprise WAN traffic (2001::/16) is kept inside the DMVPN, while default traffic (::/0) goes to the SP without encapsulation via existing default routes. OSPF P2MP network type should be used to allow the nexthops to be the hubs. This makes the phase 3 configuration a hybrid between phase 1 and phase 2 with respect to NHRP and IGP configurations. ! General NHS (hub) interface Tunnel100 description DMVPN PHASE 3 - HUB

1948 © 2016 Nicholas J. Russo

ip nhrp authentication NHRPAUTH ip nhrp map multicast dynamic ip nhrp network-id 100 ip nhrp server-only ip nhrp redirect ip ospf network point-to-multipoint ip ospf 1 area 0 ipv6 nhrp authentication NHRPAUTH ipv6 nhrp map multicast dynamic ipv6 nhrp network-id 100 ipv6 nhrp redirect router eigrp VPN address-family ipv6 unicast autonomous-system 65000 af-interface Tunnel100 no split-horizon next-hop-self summary-address 2001::/16 ! General NHS client (spoke) interface Tunnel100 description DMVPN PHASE 3 - SPOKE ip nhrp authentication NHRPAUTH ip nhrp map 10.0.100.4 10.4.12.4 ip nhrp map multicast 10.4.12.4 ip nhrp map 10.0.100.5 10.5.6.5 ip nhrp map multicast 10.5.6.5 ip nhrp network-id 100 ip nhrp nhs 10.0.100.4 ip nhrp nhs 10.0.100.5 ip nhrp shortcut ip ospf network point-to-multipoint ip ospf 1 area 0 ipv6 nhrp authentication NHRPAUTH ipv6 nhrp map 2001:10:0:100::4/128 10.4.12.4 ipv6 nhrp map 2001:10:0:100::5/128 10.5.6.5 ipv6 nhrp map multicast 10.4.12.4 ipv6 nhrp map multicast 10.5.6.5 ipv6 nhrp network-id 100 ipv6 nhrp nhs 2001:10:0:100::4 priority 5 ipv6 nhrp nhs 2001:10:0:100::5 priority 2 ipv6 nhrp shortcut

We quick check the hubs to ensure the NHRP registrations completed successfully. This is identical to other DMVPN phases. We also quickly test for OSPF/EIGRP neighbors. CSR4#show ip nhrp brief Target Via 10.0.100.2/32 10.0.100.2

NBMA 10.8.2.2

Mode Intfc Claimed dynamic Tu100 < >

1949 © 2016 Nicholas J. Russo

10.0.100.5/32 10.0.100.5 10.0.100.9/32 10.0.100.9 10.0.100.10/32 10.0.100.10

10.5.6.5 10.7.9.9 10.7.10.10

CSR5#show ip nhrp brief Target Via 10.0.100.2/32 10.0.100.2 10.0.100.4/32 10.0.100.4 10.0.100.9/32 10.0.100.9 10.0.100.10/32 10.0.100.10

NBMA 10.8.2.2 10.4.12.4 10.7.9.9 10.7.10.10

dynamic dynamic dynamic

Tu100 Tu100 Tu100

< <
> >

Mode Intfc Claimed dynamic Tu100 < > static Tu100 < > dynamic Tu100 < > dynamic Tu100 < >

CSR4#show ip ospf neighbor Neighbor ID Pri State 10.10.10.10 0 FULL/ 10.9.9.9 0 FULL/ 2.2.2.2 0 FULL/ 5.5.5.5 0 FULL/

-

Dead Time 00:01:36 00:01:36 00:01:53 00:01:39

Address 10.0.100.10 10.0.100.9 10.0.100.2 10.0.100.5

Interface Tunnel100 Tunnel100 Tunnel100 Tunnel100

CSR5#show ip ospf neighbor Neighbor ID Pri State 10.10.10.10 0 FULL/ 10.9.9.9 0 FULL/ 2.2.2.2 0 FULL/ 4.4.4.4 0 FULL/

-

Dead Time 00:01:54 00:01:53 00:01:42 00:01:52

Address 10.0.100.10 10.0.100.9 10.0.100.2 10.0.100.4

Interface Tunnel100 Tunnel100 Tunnel100 Tunnel100

CSR4#show eigrp address-family ipv6 neighbors EIGRP-IPv6 VR(VPN) Address-Family Neighbors for AS(65000) H Address Interface Hold Uptime (sec) 4 Link-local address: Tu100 10 00:01:48 190 FE80::5 2 Link-local address: Tu100 12 00:02:41 133 FE80::9 1 Link-local address: Tu100 13 00:02:41 170 FE80::2 0 Link-local address: Tu100 12 00:02:41 204 FE80::10 3 Link-local address: Tu45 13 1d21h 77 FE80::5 CSR5#show eigrp address-family ipv6 neighbors EIGRP-IPv6 VR(VPN) Address-Family Neighbors for AS(65000) H Address Interface Hold Uptime (sec) 4 Link-local address: Tu100 14 00:01:59 102 FE80::9 2 Link-local address: Tu100 11 00:01:59 113 FE80::2 1 Link-local address: Tu100 14 00:01:59 84 FE80::10 0 Link-local address: Tu100 14 00:02:00 139 FE80::4

SRTT (ms) 5000

RTO 0

Q Seq Cnt Num 205

5000

0

123

5000

0

151

5000

0

121

1476

0

206

SRTT (ms) 5000

RTO 0

Q Seq Cnt Num 123

5000

0

153

5000

0

124

5000

0

197

1950 © 2016 Nicholas J. Russo

3

Link-local address: FE80::4

Tu45

10 1d21h

21

5000

0

198

The link-state nature of OSPF makes hub summarization impossible, so each spoke still has OSPF routes for each remote spoke since they share a common area. For this reason (and others), OSPF is generally a poor choice of IGP over DMVPN. For IPv6 routes learned via EIGRP, every spoke only has a pair of 2001::/16 routes towards the hubs. My original IPv6 addressing plan was poor in that the remote networks (loopbacks) did not summarize well; they have been updated to reflect good EIGRP design using the format 2001:x:x:x::x/128 (not shown). In this way, each spoke will have a very small number of routes regardless of how many spokes are added to the network. CSR9#show ip route ospf | begin ^O O IA 1.1.1.0 [110/2002] via 10.0.100.5, 00:15:46, Tunnel100 [110/2002] via 10.0.100.4, 00:15:46, Tunnel100 2.0.0.0/32 is subnetted, 1 subnets O 2.2.2.2 [110/2001] via 10.0.100.5, 00:15:56, Tunnel100 [110/2001] via 10.0.100.4, 00:15:56, Tunnel100 4.0.0.0/32 is subnetted, 1 subnets O 4.4.4.4 [110/1001] via 10.0.100.4, 02:55:26, Tunnel100 5.0.0.0/32 is subnetted, 1 subnets O 5.5.5.5 [110/1001] via 10.0.100.5, 1d19h, Tunnel100 10.0.0.0/8 is variably subnetted, 12 subnets, 2 masks O 10.0.100.2/32 [110/2000] via 10.0.100.5, 00:15:56, Tunnel100 [110/2000] via 10.0.100.4, 00:15:56, Tunnel100 [snip] CSR9#show ipv6 route eigrp | begin ^D D 2001::/16 [90/76800640] via FE80::4, Tunnel100 via FE80::5, Tunnel100 D 2001:14:14:14::14/128 [90/10752] via FE80::14, GigabitEthernet2.594

We will send traffic from CSR1’s to CSR2's loopback using IPv4. Unlike phase 2, CSR10 does have a valid IP route for 2.2.2.2/32 via CSR4 and CSR5, implying that CSR10 does not send a resolution request to the NHS. To determine which coarse aggregate it uses for the initial packet, we check the FIB. CSR10 will encapsulate this packet and send it to CSR5 according to the output below. Including the “virtual” keyword will show the tunnel interface information (VPN address) in addition to the tunnel destination (NBMA address). CSR10#show ip cef exact-route virtual 1.1.1.1 2.2.2.2 1.1.1.1 -> 2.2.2.2 => IP midchain out of Tunnel100, addr 10.0.100.5 7FE069004A78 => IP adj out of GigabitEthernet2.570, addr 10.7.10.7

When CSR5 receives this packet, it routes it normally out of the tunnel (hair-pin) and tells CSR10 "I will forward this packet, but you should send traffic straight to CSR2 in the future". This is what the NHRP 1951 © 2016 Nicholas J. Russo

redirect message does. Although the message type is “Traffic Indication”, the code identified in the mandatory (M) field is “redirect”. Notice that it does not contain CSR2's NBMA address anywhere in the packet; this is just an indication for CSR10 to originate a traditional NHRP resolution request to resolve CSR2’s NBMA. This packet goes from CSR5 directly back towards 1.1.1.1/32, where CSR10 is the DMVPN endpoint servicing that host. ! CSR5 NHRP: Attempting to Redirect, remote_nbma:10.7.10.10, dst:2.2.2.2 NHRP: inserting (10.7.10.10/2.2.2.2) in redirect table NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 1.1.1.1 NHRP: Send Traffic Indication via Tunnel100 vrf 0, packet size: 120 src: 10.0.100.5, dst: 1.1.1.1 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 120 extoff: 88 (M) traffic code: redirect(0) src NBMA: 10.5.6.5 src protocol: 10.0.100.5, dst protocol: 1.1.1.1 Contents of nhrp traffic indication packet: 45 00 00 64 01 8A 00 00 FD 01 B6 09 01 01 01 01 02 02 02 02 08 00 B6 C8 00 27 00 00 00 00 00 00 8A 46 3D 14 AB CD AB CD AB CD AB CD AB CD AB Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Switching directly using pre-set NBMA 10.7.10.10 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.7.10.10 NHRP: 144 bytes out Tunnel100

CSR10 receives the traffic indication message (redirect) and attempts to resolve the NBMA mapping for 2.2.2.2. Unlike phase 2, phase 3 resolves NBMA/VPN mappings where the VPN address is the final destination, not the actual VPN address/next-hop. A poorly summarized network could result in a huge NHRP cache in this case; the reason the VPN address itself is not resolved is because the hub summarization has hidden the NBMA topology. The NBMA topology and IGP routing are totally decoupled with DMVPN phase 3, which allows it to scale (assuming IGP summarizes well). The resolution request contains the usual three parameters: source NBMA, source VPN, and destination VPN (final destination, in this case). This is sent towards CSR5, which is the NHS that originated the redirect. ! CSR10 NHRP: Receive Traffic Indication via Tunnel100 vrf 0, packet size: 120 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP)

1952 © 2016 Nicholas J. Russo

pktsz: 120 extoff: 88 (M) traffic code: redirect(0) src NBMA: 10.5.6.5 src protocol: 10.0.100.5, dst protocol: 1.1.1.1 Contents of nhrp traffic indication packet: 45 00 00 64 01 8A 00 00 FD 01 B6 09 01 01 01 01 02 02 02 02 08 00 B6 C8 00 27 00 00 00 00 00 00 8A 46 3D 14 AB CD AB CD AB CD AB CD AB CD AB Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Enqueued NHRP Resolution Request for destination: 2.2.2.2 NHRP: Sending NHRP Resolution Request for dest: 2.2.2.2 to nexthop: 2.2.2.2 using our src: 10.0.100.10 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 2.2.2.2 NHRP: Send Resolution Request via Tunnel100 vrf 0, packet size: 88 src: 10.0.100.10, dst: 2.2.2.2 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 88 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: NHRP successfully mapped '10.0.100.5' to NBMA 10.5.6.5 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.5.6.5 NHRP: 112 bytes out Tunnel100

CSR5 receives the resolution request, performs a route lookup, and forwards it back out of the tunnel towards CSR2. CSR5 inserts the "forward transit NHS record" into the packet so CSR2 knows which NHS forwarded the message. As an NHS, CSR5 knows the NBMA address of CSR2, which is 10.8.2.2. This is the first time we have seen that address in any debug message. As expected, the NHS does not respond with this directly to CSR10, but forwards the request towards the final destination, 2.2.2.2/32. ! CSR5 NHRP: Receive Resolution Request via Tunnel100 vrf 0, packet size: 88

1953 © 2016 Nicholas J. Russo

(F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 88 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Route lookup for destination 2.2.2.2 in (0x0) yielded interface Tunnel100, prefixlen 32 NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 2.2.2.2. NHRP: Forwarding: NHRP SAS picked source: 10.0.100.5 for destination: 2.2.2.2 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 2.2.2.2 NHRP: Forwarding Resolution Request via Tunnel100 vrf 0, packet size: 108 src: 10.0.100.5, dst: 2.2.2.2 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: NHRP successfully mapped '10.0.100.2' to NBMA 10.8.2.2 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.8.2.2 NHRP: 132 bytes out Tunnel100

1954 © 2016 Nicholas J. Russo

CSR2 receives the resolution request from CSR10 via the NHS CSR5. Although the packet isn't directed at the VPN address, CSR2 acknowledges that it is the "egress" router so it processes, and replies to, the received request. Inside the "responder address" field, CSR2 returns its NBMA address, which is the fundamental problem NHRP is meant to solve. Although CSR2 does know the NBMA address of CSR10 (the source), it sends its resolution reply to the NHS on the way back. I presume that the reason the NHS is used on the way back is because the route back to the “source protocol” isn't "connected" anymore, so NHS does not make assumptions about its whereabouts (even though it technically could, given the source NBMA address). ! CSR2 NHRP: Receive Resolution Request via Tunnel100 vrf 0, packet size: 108 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 108 extoff: 52 (M) flags: "router auth src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 0(NSAP), subaddr_len: 0(NSAP), proto_len: 0, pref: 0 Responder Address Extension(3): Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: Route lookup for destination 2.2.2.2 in (0x0) yielded interface Loopback0, prefixlen 32 NHRP: We are egress router. Process the NHRP Resolution Request. NHRP: Tunnel100: Cache add for target 10.0.100.10/32 next-hop 10.0.100.10 10.7.10.10 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.10, NBMA: 10.7.10.10) NHRP: Successfully attached NHRP subblock for Tunnel Endpoints (VPN: 10.0.100.10, NBMA: 10.7.10.10) NHRP: Inserted subblock node for cache: Target 10.0.100.10/32 nhop 10.0.100.10 NHRP: Converted internal dynamic cache entry for 10.0.100.10/32 interface Tunnel100 to external NHRP: Tunnel100: Internal Cache add for target 2.2.2.2/32 next-hop 10.0.100.2 10.8.2.2 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.10 NHRP: Send Resolution Reply via Tunnel100 vrf 0, packet size: 136

1955 © 2016 Nicholas J. Russo

src: 10.0.100.2, dst: 10.0.100.10 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 136 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: No NHRP subblock found in packet NHRP: nhrp_ifcache: Avl Root:7F52F862AF28 NHRP: NHRP successfully mapped '10.0.100.5' to NBMA 10.5.6.5 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.5.6.5 NHRP: 160 bytes out Tunnel100

When CSR5 receives the resolution reply from CSR2, it adds the "reverse transit NHS" field to show that the reply transited through this NHS. CSR5 was used for both the forward and reverse legs of the resolution, so it's information appears in both fields. ! CSR5 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 136 (F) afn: AF_IP(1), type: IP(800), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 136 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0)

1956 © 2016 Nicholas J. Russo

prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Reverse Transit NHS Record Extension(5): Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: netid_in = 0, to_us = 0 NHRP: No NHRP subblock found in packet NHRP: Forwarding Resolution Reply to 10.0.100.10 NHRP: Attempting to forward to destination: 10.0.100.10 NHRP-MPLS: tableid: 0 vrf: NHRP: nhrp_ifcache: Avl Root:7F6270A9FFF0 NHRP: Forwarding: NHRP SAS picked source: 10.0.100.5 for destination: 10.0.100.10 NHRP: Attempting to send packet through interface Tunnel100 via DEST dst 10.0.100.10 NHRP: Forwarding Resolution Reply via Tunnel100 vrf 0, packet size: 156 src: 10.0.100.5, dst: 10.0.100.10 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 156 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2

1957 © 2016 Nicholas J. Russo

Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Reverse Transit NHS Record Extension(5): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: nhrp_ifcache: Avl Root:7F6270A9FFF0 NHRP: No NHRP subblock found in packet NHRP: nhrp_ifcache: Avl Root:7F6270A9FFF0 NHRP: NHRP successfully mapped '10.0.100.10' to NBMA 10.7.10.10 NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 10.7.10.10 NHRP: 180 bytes out Tunnel100

CSR10 receives the resolution reply, processes it, and adds an NHRP mapping for 2.2.2.2/32 to the NHRP cache. NHRP tries to add a new NHRP route to the RIB with AD 250 for 2.2.2.2/32. Since there is already an exact-match OSPF route with AD 110, the next-hop-override (NHO) process is triggered instead. This is a function of the CEF shortcut; we know the VPN next-hop for 2.2.2.2/32 is 10.0.100.2, so rather than use the hubs as the IGP next-hops, the router overwrites them. Ultimately, the FIB is augmented by the NHRP cache in this way. The routing table displays these entries with "%" symbols. The VPN address of CSR2 is also marked as an “nhop” and the final destination is not, which is self-explanatory. ! CSR10 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 156 (F) afn: AF_IP(1), type: IP(800), hop: 254, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 156 extoff: 60 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 13 src NBMA: 10.7.10.10 src protocol: 10.0.100.10, dst protocol: 2.2.2.2 (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Responder Address Extension(3): (C) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0

1958 © 2016 Nicholas J. Russo

client NBMA: 10.8.2.2 client protocol: 10.0.100.2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Reverse Transit NHS Record Extension(5): (C-1) code: no error(0) prefix: 32, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 4, pref: 0 client NBMA: 10.5.6.5 client protocol: 10.0.100.5 Authentication Extension(7): type:Cleartext(1), data:NHRPAUTH NAT address Extension(9): NHRP: netid_in = 0, to_us = 1 NHRP: This NHRP Resolution Reply is a forwarded packet NHRP: Tunnel100: Cache update for target 2.2.2.2/32 next-hop 10.0.100.2 10.8.2.2 NHRP: Adding Tunnel Endpoints (VPN: 10.0.100.2, NBMA: 10.8.2.2) NHRP: Inserted subblock node for cache: Target 2.2.2.2/32 nhop 10.0.100.2 NHRP: Converted internal dynamic cache entry for 2.2.2.2/32 interface Tunnel100 to external NHRP: Adding route entry for 2.2.2.2/32 (Tunnel100) to RIB NHRP: Route addition failed (admin-distance) NHRP: nexthop-override added to RIB NHRP: Adding route entry for 10.0.100.2/32 (Tunnel100) to RIB NHRP: Route addition failed (admin-distance) NHRP: nexthop-override added to RIB CSR10#show ip cef 2.2.2.2 2.2.2.2/32 nexthop 10.0.100.2 Tunnel100 CSR10#show ip route | include 2.2.2.2 O % 2.2.2.2 [110/2001] via 10.0.100.5, 03:44:13, Tunnel100 CSR10#show ip route next-hop-override 2.2.2.2 Routing entry for 2.2.2.2/32 Known via "ospf 1", distance 110, metric 2001, type intra area Redistributing via nhrp Last update from 10.0.100.4 on Tunnel100, 03:22:50 ago Routing Descriptor Blocks: * 10.0.100.5, from 2.2.2.2, 03:22:50 ago, via Tunnel100 Route metric is 2001, traffic share count is 1 10.0.100.4, from 2.2.2.2, 03:22:50 ago, via Tunnel100 Route metric is 2001, traffic share count is 1

1959 © 2016 Nicholas J. Russo

[NHO]10.0.100.2, from 10.0.100.2, 00:20:01 ago, via Tunnel100 Route metric is 1, traffic share count is 1 MPLS label: none CSR10#show ip nhrp shortcut 2.2.2.2/32 via 10.0.100.2 Tunnel100 created 00:20:20, expire 01:39:39 Type: dynamic, Flags: router rib nho NBMA address: 10.8.2.2 10.0.100.2/32 via 10.0.100.2 Tunnel100 created 00:20:20, expire 01:39:39 Type: dynamic, Flags: router nhop rib nho NBMA address: 10.8.2.2

Unlike phase 2, there isn't an "implicit" NHRP entry on the remote side router. Even though CSR2 showed adding the tunnel endpoints from the original request, the redirect process still occurs. The entire process occurs again in the opposite direction so that CSR2 knows how to reach 1.1.1.1/32. Most of the debugs are not shown again, but there is one significant difference. On CSR10, the route 2.2.2.2/32 could not be added to the RIB because the OSPF route (AD 110) was better than the NHRP route (AD 250). Since CSR2 never had a route for the exact prefix 1.1.1.1/32, it is able to add this longer match. CSR2 only had a summary for 1.1.1.0/24, which was enough for routing to work, but means the NHO process need not occur. The NHRP entry does not specify "nho" for the 1.1.1.1/32 entry, but does for the VPN address of CSR10, which is expected. New NHRP routes are denoted by "H" in the routing table. ! CSR2 NHRP: Adding route entry for 1.1.1.1/32 (Tunnel100) to RIB NHRP: Route addition to RIB Successful CSR2#show ip cef 1.1.1.1 1.1.1.1/32 nexthop 10.0.100.10 Tunnel100 CSR2#show ip route 1.1.1.1 Routing entry for 1.1.1.1/32 Known via "nhrp", distance 250, metric 1 Last update from 10.0.100.10 on Tunnel100, 00:45:24 ago Routing Descriptor Blocks: * 10.0.100.10, from 10.0.100.10, 00:45:24 ago, via Tunnel100 Route metric is 1, traffic share count is 1 MPLS label: none CSR2#show ip route nhrp | begin Gateway Gateway of last resort is 10.2.8.8 to network 0.0.0.0

H

1.0.0.0/8 is variably subnetted, 2 subnets, 2 masks 1.1.1.1/32 [250/1] via 10.0.100.10, 00:45:42, Tunnel100

1960 © 2016 Nicholas J. Russo

CSR2#show ip nhrp shortcut 1.1.1.1/32 via 10.0.100.10 Tunnel100 created 00:45:13, expire 01:14:46 Type: dynamic, Flags: router rib NBMA address: 10.7.10.10 10.0.100.10/32 via 10.0.100.10 Tunnel100 created 00:45:13, expire 01:14:46 Type: dynamic, Flags: router nhop rib nho NBMA address: 10.7.10.10

The process for IPv6 is identical. Because phase 3 makes no assumptions about the VPN topology, having coarse aggregates pointing to the hub will always work. Before sending any traffic from CSR1 to CSR2, CSR10 only has aggregates for 2001::/16. This kind of summarization means that NHO will be nonexistent for real destinations since there is never an exact-match route already known by the spokes. Before sending any traffic between spokes, CSR10 only has the hub-generated aggregate routes. CSR10#show ipv6 cef 2001:2:2:2::2 2001::/16 nexthop FE80::4 Tunnel100 nexthop FE80::5 Tunnel100

When CSR10 receives IPv6 traffic from CSR1 towards CSR2, it initially sends the traffic along the regular routing path because it knows the NBMA address for both NHS’, per the RIB and static NHRP mappings. Upon receipt, the hub returns the NHRP redirect message towards the sender, forwards the original packet back out of the tunnel, and expects to receive NHRP resolution from the sender soon. Because the process happens in both directions (no "implicit" mappings), the NHS tracks the redirects for a short time until they expire. After expiration, they can be regenerated if the spoke resolution requests fail for whatever reason. Below shows pending redirects from both CSR10 and CSR2 with 6 seconds left until they expire. CSR4#show ipv6 nhrp redirect I/F NBMA address

Destination

Tunnel100 Tunnel100

2001:2:2:2::2 2001:1:1:1::1

10.7.10.10 10.8.2.2

Drop Count 1 1

Expiry

00:00:06 00:00:06

Upon receipt of the redirect message, CSR10 begins the usual phase 3 resolution process. In short, the process is as follows: CSR10 sends traffic to hub, hub sends redirect to CSR10, and then CSR10 issues a resolution request to follow the normal NHRP process. For these resolutions, the NHRP resolution reply does not transit the NHS and is sent directly back from the target spoke to the originating spoke. I speculate that this is different from the previous example because with OSPF P2MP, the routers had an OSPF /32 to each VPN endpoint where with EIGRPv6 they did not. Having the NHRP route (not an IGP route) to the VPN endpoint may enable DMVPN to bypass the NHS on the reverse path when sending the replies. 1961 © 2016 Nicholas J. Russo

! CSR10 NHRP: Receive Resolution Reply via Tunnel100 vrf 0, packet size: 196 (F) afn: AF_IP(1), type: IPv6(86DD), hop: 255, ver: 1 shtl: 4(NSAP), sstl: 0(NSAP) pktsz: 196 extoff: 96 (M) flags: "router auth dst-stable unique src-stable nat ", reqid: 19 src NBMA: 10.7.10.10 src protocol: 2001:10:0:100::10, dst protocol: 2001:2:2:2::2 (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.8.2.2 client protocol: 2001:10:0:100::2 Responder Address Extension(3): (C) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.8.2.2 client protocol: 2001:10:0:100::2 Forward Transit NHS Record Extension(4): (C-1) code: no error(0) prefix: 128, mtu: 9976, hd_time: 7200 addr_len: 4(NSAP), subaddr_len: 0(NSAP), proto_len: 16, pref: 0 client NBMA: 10.4.12.4 client protocol: 2001:10:0:100::4 Reverse Transit NHS Record Extension(5): [no output]

When the resolution completes, CSR10 has specific CEF entries for CSR2's loopback, which was the target address inside the VPN. The following outputs mirror the IPv4 outputs seen earlier for other tests. The "rib" flag signifies that there are exact-match routes for these entries in the RIB, but they don’t necessarily have to be H-routes. If “rib” is present and “nho” is absent, the route is surely an H route. When both are present, the route is a non-H route; NHO and H-routes are mutually exclusive within the context of a single prefix. CSR10#show ipv6 cef 2001:2:2:2::2 2001:2:2:2::2/128 nexthop 2001:10:0:100::2 Tunnel100 CSR10#show ipv6 route 2001:2:2:2::2/128 Routing entry for 2001:2:2:2::2/128 Known via "NHRP-IPv6", distance 250, metric 1 Route count is 1/1, share count 0 Routing paths: 2001:10:0:100::2, Tunnel100 Last updated 00:01:10 ago

1962 © 2016 Nicholas J. Russo

CSR10#show ipv6 route nhrp | begin ^H H 2001:2:2:2::2/128 [250/1] via 2001:10:0:100::2, Tunnel100 H 2001:10:0:100::2/128 [250/1] via 2001:10:0:100::2, Tunnel100 CSR10#show ipv6 nhrp shortcut 2001:2:2:2::2/128 via 2001:10:0:100::2 Tunnel100 created 00:00:45, expire 01:59:14 Type: dynamic, Flags: router rib NBMA address: 10.8.2.2 2001:10:0:100::2/128 via 2001:10:0:100::2 Tunnel100 created 00:00:45, expire 01:59:14 Type: dynamic, Flags: router used nhop rib NBMA address: 10.8.2.2

For completeness, the verification on CSR2 is shown below. As discussed earlier, the phase 3 resolution process happens in the reverse direction assuming traffic needs to flow bidirectionally. CSR2#show ipv6 cef 2001:1:1:1::1 2001:1:1:1::1/128 nexthop 2001:10:0:100::10 Tunnel100 CSR2#show ipv6 route 2001:1:1:1::1/128 Routing entry for 2001:1:1:1::1/128 Known via "NHRP-IPv6", distance 250, metric 1 Route count is 1/1, share count 0 Routing paths: 2001:10:0:100::10, Tunnel100 Last updated 00:08:16 ago CSR2#show ipv6 route nhrp | begin ^H H 2001:1:1:1::1/128 [250/1] via 2001:10:0:100::10, Tunnel100 H 2001:10:0:100::10/128 [250/1] via 2001:10:0:100::10, Tunnel100 CSR2#show ipv6 nhrp shortcut 2001:1:1:1::1/128 via 2001:10:0:100::10 Tunnel100 created 00:08:26, expire 01:51:33 Type: dynamic, Flags: router rib NBMA address: 10.7.10.10 2001:10:0:100::10/128 via 2001:10:0:100::10 Tunnel100 created 00:08:26, expire 01:51:33 Type: dynamic, Flags: router nhop rib NBMA address: 10.7.10.10

Additional Reading – Reference configurations "dmvpn3" 1963 © 2016 Nicholas J. Russo

35.3 mGRE-based L3VPN The L3VPN-over-mGRE feature is used to dynamically connect customers within remote provider sites together. Imagine a CSC-like scenario where the core carrier refuses to provide a labeled-unicast service across two customer carrier islands. The connectivity between the customer carriers is IP only, which means MPLS transport doesn’t work automatically. The L3VPNoGRE feature fully automates the tunnel endpoint discovery and maintains the BGP VPNv4/v6 allocated labels between customer carrier sites. The feature also supports the MVPN profile 0 (PIM/GRE) implementation for efficient multicast transport; this is sensible since profile 0 works well over CSC architectures. The network used for this test is quite complex; it is a CSC scenario where the core carrier provides an IP-only VPN to three customer carrier sites. The core carrier uses mLDP to transport LSM for variety since the core and customer carriers can use entirely different MVPN techniques. The customer carrier sites must use profile 0 since their CSC service is IP only. CSR6, CSR9, and CSR10 are the customer PE routers that service the true customers, CSR1, CSR4, and CSR8. The XRv routers don't support the dynamic L3VPNoGRE feature, and are P routers within the customer carriers. Different IGPs are used in different regions for variety. Before continuing with this test, ensure you understand MVPN/CSC technologies and techniques.

1964 © 2016 Nicholas J. Russo

The configuration is rather involved, so the initial verifications are limited to basic reachability. Within the core, CSR2 is the RR for VPNv4/v6 AFs using the BGP listen command (dynamic peer groups). We quickly verify that the neighbors are up and we receive at least some routes. VPNv6 in the core carrier was a test for other things and is not directly related to this test. CSR2#show bgp vpnv4 unicast all summary | begin Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRc *3.3.3.3 4 33 304 337 32 0 0 04:21:39 1 *5.5.5.5 4 33 307 335 32 0 0 04:24:47 1 *7.7.7.7 4 33 309 334 32 0 0 04:24:54 1 10.255.14.14 4 65000 250 284 32 0 0 04:06:26 1 * Dynamically created based on a listen range command Dynamically created neighbors: 3, Subnet ranges: 1 BGP peergroup IBGP listen range group members: 0.0.0.0/0 Total dynamically created neighbors: 3/(100 max), Subnet ranges: 1

CSR2#show bgp vpnv6 unicast all summary | begin Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRc *3.3.3.3 4 33 313 346 36 0 0 04:29:17 1 *5.5.5.5 4 33 316 344 36 0 0 04:32:25 1 *7.7.7.7 4 33 317 344 36 0 0 04:32:32 1 FD00:10:255:14::14 4 65000 259 289 36 0 0 04:14:04 1 * Dynamically created based on a listen range command Dynamically created neighbors: 3, Subnet ranges: 1 BGP peergroup IBGP listen range group members: 0.0.0.0/0 Total dynamically created neighbors: 3/(100 max), Subnet ranges: 1

We also check the BGP RR for the customer carrier, which is CSR6. Rather than check the BGP summary for the neighbors, I will check the BGP tables for VPNv4, VPNv6, and IPv4 MDT. We learn the remote two VPNv4/v6 routes (one from each remote PE), as well as the MDT sources for MVPN transport within the IPv4 MDT AFI. CSR6#show bgp vpnv4 unicast all | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 65000:6 (default for vrf VPN) *> 192.168.1.0 0.0.0.0 0 32768 ? *>i 192.168.4.0 10.10.10.10 0 100 0 ? *>i 192.168.8.0 9.9.9.9 0 100 0 ? Route Distinguisher: 65000:9 *>i 192.168.8.0 9.9.9.9 0 100 0 ? Route Distinguisher: 65000:10 *>i 192.168.4.0 10.10.10.10 0 100 0 ?

1965 © 2016 Nicholas J. Russo

CSR6#show bgp vpnv6 unicast all | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 65000:6 (default for vrf VPN) *> FD00:192:168:1::/64 :: 0 32768 ? *>i FD00:192:168:4::/64 ::FFFF:10.10.10.10 0 100 0 ? *>i FD00:192:168:8::/64 ::FFFF:9.9.9.9 0 100 0 ? Route Distinguisher: 65000:9 *>i FD00:192:168:8::/64 ::FFFF:9.9.9.9 0 100 0 ? CSR6#show bgp ipv4 mdt all | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 65000:6 (default for vrf VPN) *> 6.6.6.6/32 0.0.0.0 0 ? Route Distinguisher: 65000:9 *>i 9.9.9.9/32 9.9.9.9 0 100 0 ? Route Distinguisher: 65000:10 *>i 10.10.10.10/32 10.10.10.10 0 100 0 ? Route Distinguisher: 65000:10 *>i FD00:192:168:4::/64 ::FFFF:10.10.10.10 0 100 0 ?

Given this, we know the BGP control plane is properly built and there is IP reachability between CSR6, CSR9, and CSR10. If there were not, the BGP sessions between customer carrier PEs would not work. The LSP tracing verification procedure does not work, because there is no LSP between these PEs, despite them trying to run VPNv4/v6. We receive a BGP label from the remote PE (CSR10), but we have no transport label. XRv2 and XRv4 do not allocate labels because 10.10.10.10/32 was a BGP learned prefix from their perspective, and the provider never allocated a label for it via labeled-unicast. CSR6#show bgp vpnv4 unicast vrf VPN 192.168.4.0 BGP routing table entry for 65000:6:192.168.4.0/24, version 49 Paths: (1 available, best #1, table VPN) Not advertised to any peer Refresh Epoch 2 Local, (Received from a RR-client), imported path from 65000:10:192.168.4.0/24 (global) 10.10.10.10 (metric 10) (via default) from 10.10.10.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:65000:10 Connector Attribute: count=1 type 1 len 12 value 65000:10:10.10.10.10 mpls labels in/out nolabel/10006

1966 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0x0 CSR6#show ip route 10.10.10.10 Routing entry for 10.10.10.10/32 Known via "isis", distance 115, metric 10, type level-1 Redistributing via isis 65000 Last update from 10.6.13.13 on GigabitEthernet2.563, 02:59:24 ago Routing Descriptor Blocks: * 10.6.14.14, from 14.14.14.14, 02:59:24 ago, via GigabitEthernet2.564 Route metric is 10, traffic share count is 1 10.6.13.13, from 13.13.13.13, 02:59:24 ago, via GigabitEthernet2.563 Route metric is 10, traffic share count is 1 CSR6#show mpls forwarding-table 10.10.10.10 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6009 No Label 10.10.10.10/32 0 No Label 10.10.10.10/32 0

Outgoing interface Gi2.563 Gi2.564

Next Hop 10.6.13.13 10.6.14.14

The solution for this problem involves creating an L3VPN encapsulation profile. Here, we specify tunnel template parameters, somewhat similar to auto-tunnel configuration in TE. We can specify the tunnel protocol (only IPv4/GRE is currently supported) along with optional tunnel source and key. We then create a generalized route-map for both VPNv4 and VPNv6 that changes the next-hop on ingress. The logic is that when BGP receives a VPNv4/v6 prefix, it looks at the next-hop and creates a GRE tunnel endpoint from it. This alleviates the need for DMVPN-style signaling and allows spokes to communicate directly without extra signaling. BGP already knows all the endpoints since these are also the VPNv4/v6 next-hops. Even though only IPv4 is supported for the outer encapsulation at this time, VPNv6 via 6VPE can function using this feature, so we also adjust the IPv6 next-hop. ! CSR6, CSR9, CSR10 l3vpn encapsulation ip L3VPN_PROFILE transport ipv4 source Loopback0 protocol gre key 65000 route-map RM_VPN_NHOP_IN permit 10 set ip next-hop encapsulate l3vpn L3VPN_PROFILE set ipv6 next-hop encapsulate l3vpn L3VPN_PROFILE router bgp 65000 address-family vpnv4 unicast neighbor IBGP route-map RM_VPN_NHOP_IN in address-family vpnv6 unicast neighbor IBGP route-map RM_VPN_NHOP_IN in

After applying this configuration to the customer carrier PEs, the BGP path now shows reachability to 10.10.10.10 via Tunnel0. This is a dynamic entity that serves as a template, and the tunnel configuration reflects the configured source address and key. Nothing has changed with the MPLS label bindings. 1967 © 2016 Nicholas J. Russo

CSR6#show bgp vpnv4 unicast vrf VPN 192.168.4.0/24 BGP routing table entry for 65000:6:192.168.4.0/24, version 10 Paths: (1 available, best #1, table VPN) Not advertised to any peer Refresh Epoch 2 Local, (Received from a RR-client), imported path from 65000:10:192.168.4.0/24 (global) 10.10.10.10 (metric 10) (via default) (via Tunnel0) from 10.10.10.10 (10.10.10.10) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:65000:10 Connector Attribute: count=1 type 1 len 12 value 65000:10:10.10.10.10 mpls labels in/out nolabel/10006 rx pathid: 0, tx pathid: 0x0 CSR6#show derived-config interface tunnel0 Building configuration... Derived configuration : 151 bytes ! interface Tunnel0 ip unnumbered Loopback0 no ip redirects ipv6 enable tunnel source Loopback0 tunnel mode gre multipoint tunnel key 65000 end

The routing table never changed, but now VPN traffic destined for those BGP next-hops is send inside this tunnel using the L3VPN encapsulation profile. There is a new adjacency in the FIB that details these connections. One is listed as IP and one is listed as MPLS. CSR6#show adjacency 10.10.10.10 Protocol Interface IP Tunnel0 TAG Tunnel0

Address 10.10.10.10(4) 10.10.10.10(4)

Digging into the encapsulation details, we see the only difference between these two entries is the protocol type (0x0800 for IP versus 0x8847 for MPLS). We expect the MPLS traffic inside this tunnel to have the VPN label directly beneath this layer of encapsulation, before the customer IP header. The tunnel key of 65000 (0xFDE8) marks the end of the GRE encapsulation. Immediately following this tunnel key, assuming the type is 0x8847, will be the customer VPN label. CSR6#show adjacency 10.10.10.10 encapsulation

1968 © 2016 Nicholas J. Russo

Protocol Interface Address IP Tunnel0 10.10.10.10(4) Encap length 28 4500000000000000FF2F9BAF06060606 0A0A0A0A200008000000FDE8 Provider: TUNNEL Protocol header count in macstring: 2 HDR 0: ipv4 dst: static, 10.10.10.10 src: static, 6.6.6.6 prot: static, 47 ttl: static, 255 df: static, cleared per packet fields: tos ident tl chksm HDR 1: gre prot: static, 0x800 key: static, 65000 per packet fields: none TAG Tunnel0 10.10.10.10(4) Encap length 28 4500000000000000FF2F9BAF06060606 0A0A0A0A200088470000FDE8 Provider: TUNNEL Protocol header count in macstring: 2 HDR 0: ipv4 dst: static, 10.10.10.10 src: static, 6.6.6.6 prot: static, 47 ToS: static, 0 ttl: static, 255 df: static, cleared per packet fields: ident tl chksm HDR 1: gre prot: static, 0x8847 key: static, 65000 per packet fields: none

We can also check the tunnel endpoints to see what was discovered. Because this tunnel is used for VPNv4 and VPNv6, both IPv4 and IPv6 BGP next-hops are bound to that transport address. CSR6#show tunnel endpoints tunnel 0 Tunnel0 running in multi-GRE/IP mode Endpoint transport 9.9.9.9 Refcount 4 Base 0x7FEC1811F158 Create Time 01:35:03 overlay ::FFFF:9.9.9.9 Refcount 2 Parent 0x7FEC1811F158 Create Time 01:35:03 overlay 9.9.9.9 Refcount 2 Parent 0x7FEC1811F158 Create Time 01:35:03 Endpoint transport 10.10.10.10 Refcount 4 Base 0x7FEC1811F2E8 Create Time 01:34:58 overlay ::FFFF:10.10.10.10 Refcount 2 Parent 0x7FEC1811F2E8 Create Time 01:34:58 overlay 10.10.10.10 Refcount 2 Parent 0x7FEC1811F2E8 Create Time 01:34:58

1969 © 2016 Nicholas J. Russo

As an example, we will send traffic from CSR4 to CSR8 inside the customer network. The IGP cost between CSR7 and CSR5 in the core carrier is high cost, so capturing on CSR2's interface towards CSR7 should yield the most encapsulation. The significant components are highlighted in different colors and are shown below. CSR4#ping 192.168.8.8 CSR2#show monitor 0000: 000C295C 0010: 8847007D 0020: 0000FD2F 0030: 88470000 0040: 0000FF01 0050: 7B3B000A [snip]

capture CAP7 buffer dump E1E9000C 29664C2C 81000DC7 80FD013B 91FD4500 00840010 97150A0A 0A0A0909 09092000 FDE80232 D1FF4500 00640030 2E05C0A8 040AC0A8 08090800 00010000 00009568 6D9BABCD

..)\....)fL,.... .G.}...;..E..... .../.......... . .G.....2..E..d.0 ................ {;.........hm...

The first 12 bytes are the destination and source MAC addresses, respectively. The next 4 bytes specify 802.1Q encapsulation with VLAN ID 0xDC7 (3527). 0x8847 specifies MPLS unicast, and 0x7D8 (2008) is the LDP label from CSR2 to reach CSR5, highlighted in yellow. CSR7#show mpls ldp bindings 5.5.5.5 32 lib entry: 5.5.5.5/32, rev 8 local binding: label: 7000 remote binding: lsr: 5.5.5.5:0, label: imp-null remote binding: lsr: 2.2.2.2:0, label: 2008

The next label is 0x13B9 (5049) which is the VPNv4 label from CSR5 describing reachability to CSR9. The S-bit is set, as expected. This is the bottom-most label and is also highlighted in yellow. CSR7#show bgp vpnv4 unicast vrf CH labels Network Next Hop In label/Out label Route Distinguisher: 33:7 (CH) 6.6.6.6/32 3.3.3.3 nolabel/3001 2.2.2.2 nolabel/2003 9.9.9.9/32 5.5.5.5 nolabel/5049 10.10.10.10/32 10.255.11.11 7010/nolabel

Next is an IP packet with protocol 0x2F (47) for GRE in cyan. Source and destination addresses are highlighted in green (10.10.10.10 -> 9.9.9.9). The payload of this GRE/IP packet is 0x8847 with a tunnel key of 0xFDE8 (65000). The next label is 0x232D (9005) which is the VPN label allocated by CSR9 describing reachability to the connected route with CSR8, highlighted in grey. This proves that the VPN label immediately follows the GRE header. CSR10#show bgp vpnv4 unicast vrf VPN labels

1970 © 2016 Nicholas J. Russo

Network Next Hop In label/Out label Route Distinguisher: 65000:10 (VPN) 192.168.1.0 6.6.6.6 nolabel/6002 192.168.4.0 0.0.0.0 10006/nolabel(VPN) 192.168.8.0 9.9.9.9 nolabel/9005

The final significant piece of this encapsulation is the final IP header, which is IP protocol 1 (ICMP), from 192.168.4.10 to 192.168.8.9 (in pink). This is the original VPN traffic. We can summarize the encapsulation for unicast traffic as follows: 1. 2. 3. 4. 5. 6.

Layer 2 encapsulation – Ethernet, FR, ATM, PPP, etc MPLS transport core labels – LDP, RSVP-TE, TE-FRR, Segment Routing, etc MPLS VPN core label – BGP VPNv4 always, since only IPv4 is supported for outer encapsulation IP/GRE header (L3VPN encap profile) – 4 or 8 bytes depending on key, type is always 0x8847 MPLS VPN customer label – BGP VPNv4 or VPNv6 between customer PEs Customer traffic – IPv4 or IPv6 customer traffic

Next, we will examine multicast transport. Both the customer carrier and final customers are using SSM only to simplify testing, so there are no RPs anywhere. The core carrier is PIM-free using mLDP with CSR3 as the root of the MP2MP tree (MVPN profile 1). There are no P2MP mLDP trees in this example, although there is nothing precluding them from existing. CSR1 and CSR8 are both receivers for group 232.4.4.4 and signal their interest to the PEs using IGMPv3. The joins include the source 192.168.4.4, which is the only sender. The customer PEs will encapsulate this multicast traffic into GRE multicast using the MDT default group 232.0.0.255 (MVPN profile 0). We will check CSR10 as a quickly to verify its MDT status. CSR10 has VRF-aware PIM neighbors within the customer VPN. CSR10#show ip pim mdt * implies mdt is the default MDT, # is (*,*) Wildcard, > is non-(*,*) Wildcard MDT Group/Num Interface Source VRF * 232.0.0.255 Tunnel1 Loopback0 VPN CSR10#show ip pim vrf VPN neighbor PIM Neighbor Table Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority, P - Proxy Capable, S - State Refresh Capable, G - GenID Capable, L - DR Load-balancing Capable Neighbor Interface Uptime/Expires Ver DR Address Prio/Mode 9.9.9.9 Tunnel1 01:51:51/00:01:35 v2 1 / S P G 6.6.6.6 Tunnel1 03:33:13/00:01:35 v2 1 / S P G

XRv11 receives a P(S,G) join from CSR7 for the default MDT. Customer AS 65000 is PIM-enabled for MVPN transport, as expected, so XRv11’s only job is to continue building the P(S,G) tree towards the source, CSR10. XRv11’s P(S,G) state shown below forwards traffic along the SPT towards the core carrier. 1971 © 2016 Nicholas J. Russo

For brevity, we do not trace the entire P(S,G) join process; if XRv11 has the join, it is indicative of a successful configuration. Additional MVPN verifications are detailed in the corresponding chapter. RP/0/0/CPU0:XRv11#show pim topology 232.0.0.255 10.10.10.10 | begin 232 (10.10.10.10,232.0.0.255)SPT SSM Up: 01:19:08 JP: Join(00:00:41) RPF: GigabitEthernet0/0/0/0.510,10.10.11.10 Flags: GigabitEthernet0/0/0/0.571 01:19:08 fwd Join(00:03:14)

A quick check of CSR7's MDT status shows all neighbors are up. MDT 0 within the context of mLDP means the default MDT (MP2MP delivery tree). Even without verifying this, we could assume it was working based on the P(S,G) join above. If the core carrier’s MDT is broken, the customer MDTs relying on it are, by extension, also broken. CSR7#show ip pim mdt * implies mdt is the default MDT, # is (*,*) Wildcard, > is non-(*,*) Wildcard MDT Group/Num Interface Source VRF * 0 Lspvif0 Loopback0 CH CSR7#show ip pim vrf CH neighbor PIM Neighbor Table Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority, P - Proxy Capable, S - State Refresh Capable, G - GenID Capable, L - DR Load-balancing Capable Neighbor Interface Uptime/Expires Ver DR Address Prio/Mode 10.255.11.11 GigabitEthernet2.571 03:41:22/00:01:36 v2 1 / DR P G 5.5.5.5 Lspvif0 03:37:09/00:01:36 v2 1 / S P G 3.3.3.3 Lspvif0 03:37:09/00:01:38 v2 1 / S P G 2.2.2.2 Lspvif0 03:37:09/00:01:31 v2 1 / S P G

When CSR4 starts sending traffic, CSR10 encapsulations it inside the default MDT and forwards it to XRv11. XRv11 forwards it to CSR7 so it enters the core carrier. These basic steps are not shown, and we skip to CSR7 for additional verification. Upon receipt, CSR7 encapsulates the traffic inside MPLS (LSM) with a single mLDP label allocated by CSR2. This is upstream towards the root, which is CSR3. CSR7#show mpls mldp database opaque_type mdt 33:65000 0 LSM ID : 1 (RNR LSM ID: 2) Type: MP2MP Uptime : 03:38:27 FEC Root : 3.3.3.3 Opaque decoded : [mdt 33:65000 0] Opaque length : 11 bytes Opaque value : 02 000B 0000330006500000000000 RNR active LSP : (this entry) Upstream client(s) : 2.2.2.2:0 [Active] Expires : Never Path Set ID : 1 Out Label (U) : 2011 Interface : GigabitEthernet2.527*

1972 © 2016 Nicholas J. Russo

Local Label (D): 7005 Replication client(s): MDT (VRF CH) Uptime : 03:38:27 Interface : Lspvif0

Next Hop

: 33.2.7.2

Path Set ID

: 2

At this time, we will enable the capture inbound on CSR2 and use that to view the encapsulation as we did for unicast. We expect to see a single transport label (mLDP) after the layer 2 encapsulation with no customer carrier MPLS encapsulation, since that carrier is using MVPN profile 0 (no LSM). The output is described in detail next. CSR4#ping 232.4.4.4 repeat 5 timeout 0 CSR2#show monitor 0000: 000C295C 0010: 8847007D 0020: BF940A0A 0030: 00641178 0040: 04040800 0050: 3AC4ABCD

capture CAP7 buffer dump E1E9000C 29664C2C 81000DC7 B1FD4500 007C00AB 0000FD2F 0A0AE800 00FF0000 08004500 0000FE01 FA6BC0A8 0404E804 ACF40027 00010000 00009669 ABCDABCD ABCDABCD ABCDABCD

..)\....)fL,.... .G.}..E..|...../ ..............E. .d.x.....k...... .......'.......i :...............

Jumping straight to the first set of MPLS labels, we see value 0x7DB (2011) used to send traffic towards the mLDP root (CSR3), using an upstream mLDP label allocated by CSR2 (yellow). This is the only label, and the assertion of the S-bit proves this. Immediately following this is a GRE/IP header with source 10.10.10.10 and destination 232.0.0.255, which represents the MDT source and default group, respectively (green). Following that is another IP packet, the customer ICMP ping, from 192.168.4.4 to 232.4.4.4 (cyan). There is significantly less MPLS interaction here since the core carrier uses a single label to transport the multicast between PEs (no PHP) but the customer carrier doesn't use LSM at all. The point of this demonstration was to show that the L3VPN encapsulation profile supports both unicast and multicast transport. Like any good CSC solution, the customer and core carrier forwarding mechanisms can be entirely different as well. The final set of tests will quickly evaluate IPv6 unicast and multicast across the network. IPv6 SSM is used again in the customer network as transport and is encapsulated inside IPv4 GRE multicast at the customer PEs. Since the P(S,G) signaled is still IPv4-based, we don’t need to retrace the signaling path for the IPv6 MVPN flow. We will perform a unicast capture first, and then validate the packet contents against the control-plane show commands. The IPv6 header is much larger to account for larger addresses, but decoding it is easier because the numbers are hexadecimal in the first place. CSR4#ping fd00:192:168:8::8 CSR2#show monitor 0000: 000C295C 0010: 8847007D 0020: 0000FD2F

capture CAP7 buffer dump E1E9000C 29664C2C 81000DC7 80FD013B 91FD4500 008400C1 96640A0A 0A0A0909 09092000

..)\....)fL,.... .G.}...;..E..... .../.d........ .

1973 © 2016 Nicholas J. Russo

0030: 0040: 0050: 0060: 0070:

88470000 3A3FFD00 0004FD00 00088000 08090A0B

FDE80233 01920168 01920168 9CC821D6 0C0D0E0F

113F6000 00040000 00080000 00020203 10111213

0000003C 00000000 00000000 04050607 14151617

.G.....3.?`....< :?.....h........ .......h........ ......!......... ................

The core carrier label stack is {2008 5049} in yellow, the IP/GRE header is 10.10.10.10 -> 9.9.9.9 in cyan, the customer VPN label is 9009 in green, and the IPv6 header is FD00:192:168:4::4 -> FD00:192:168:8::8 in pink. We won't verify the ICMP ping because we initiated it and it’s easily understood. We will verify it as quickly as possible using the most succinct show commands, for variation. The core carrier imposes two labels for traffic heading towards the remote PE, and this traffic was generated by the source PE sending traffic to the dynamic endpoints and imposing the VPN label for the customer route allocated by the remote PE (VPNv6). The colors in the EPC capture match the highlights below. CSR7#show ip cef vrf CH 9.9.9.9 9.9.9.9/32 nexthop 33.2.7.2 GigabitEthernet2.527 label 2008 5049 CSR10#show ipv6 cef vrf VPN fd00:192:168:8::8 FD00:192:168:8::/64 nexthop ::FFFF:9.9.9.9 Tunnel0 label 9009 CSR10#show tunnel endpoints tunnel 0 Endpoint transport 9.9.9.9 Refcount 01:50:44 overlay ::FFFF:9.9.9.9 Refcount 2 01:50:44 overlay 9.9.9.9 Refcount 2 Parent

| section 9\.9 4 Base 0x7F57F6166E98 Create Time Parent 0x7F57F6166E98 Create Time 0x7F57F6166E98 Create Time 01:50:44

Unfortunately, IPv6 multicast does not appear to work. There is an RPF logic problem. When the lasthop router (egress customer PE) receives the MLDv2 SSM join from the customer, it thinks the RPF interface is the L3VPN encapsulation tunnel and not the MDT tunnel. The remote end (CSR10) never receives the PIMv6 (S,G) join since the egress PE isn’t able to send it properly. Below, all RPF/C-MFIB related commands indicate that the L3VPNoGRE tunnel is the RPF interface (Tunnel0) while the PIM neighbors form over the default MDT (Tunnel1). CSR6#show ipv6 rpf vrf VPN FD00:192:168:4::4 RPF information for FD00:192:168:4::4 RPF interface: Tunnel0 RPF neighbor: ::FFFF:10.10.10.10 RPF route/mask: FD00:192:168:4::/64 RPF type: Unicast RPF recursion count: 0 Metric preference: 200 Metric: 0 Using Extranet RPF Rule: BGP Imported Route, RPF VRF:

1974 © 2016 Nicholas J. Russo

CSR6#show ipv6 mroute vrf VPN FF33::4 FD00:192:168:4::4 [snip] (FD00:192:168:4::4, FF33::4), 00:20:40/never, flags: sTI Incoming interface: Tunnel0 RPF nbr: ::FFFF:10.10.10.10 Immediate Outgoing interface list: GigabitEthernet2.516, Forward, 00:20:40/never CSR6#show ipv6 pim vrf VPN neighbor PIM Neighbor Table Mode: B - Bidir Capable, G - GenID Capable Neighbor Address Interface

Uptime

Expires

::FFFF:9.9.9.9 ::FFFF:10.10.10.10

00:10:00 00:13:02

00:01:38 B G 00:01:38 B G

Tunnel1 Tunnel1

Mode DR pri 1 DR 1

CSR10#show ipv6 mroute vrf VPN No mroute entries found.

Quickly checking IPv4, we see this is not a problem since the router knows that the RPF interface for multicast traffic within the VPN should be the MDT "emulated LAN". I believe the IPv6 support isn’t fully implemented yet, because we clearly saw IPv4 multicast working earlier. Since there isn’t an actual Tunnel0 or Tunnel1 interface, correcting this problem appears impossible at this time (no ability to enable PIM, adjust mroutes, etc). CSR6#show ip rpf vrf VPN 192.168.4.4 RPF information for ? (192.168.4.4) RPF interface: Tunnel1 RPF neighbor: ? (10.10.10.10) RPF route/mask: 192.168.4.0/24 RPF type: unicast (bgp 65000) Doing distance-preferred lookups across tables BGP originator: 10.10.10.10 RPF topology: ipv4 multicast base, originated from ipv4 unicast base CSR6#show ip mroute vrf VPN 232.4.4.4 192.168.4.4 [snip] (192.168.4.4, 232.4.4.4), 04:32:28/00:02:05, flags: sTI Incoming interface: Tunnel1, RPF nbr 10.10.10.10 Outgoing interface list: GigabitEthernet2.516, Forward/Sparse, 04:32:28/00:02:05 CSR6#show ip pim vrf VPN neighbor PIM Neighbor Table Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority, P - Proxy Capable, S - State Refresh Capable, G - GenID Capable, L - DR Load-balancing Capable

1975 © 2016 Nicholas J. Russo

Neighbor Address 9.9.9.9 10.10.10.10

Interface

Uptime/Expires

Ver

Tunnel1 Tunnel1

03:01:36/00:01:43 v2 04:42:57/00:01:39 v2

DR Prio/Mode 1 / S P G 1 / DR SPG

Additional Reading – Reference configurations "l3vpn-mgre" 36. Describe, implement, and troubleshoot IPv6 transition mechanisms 36.1 NAT44 and NAT444 NAT44 is synonymous with ordinary IPv4 NAT which has been around for many years. It is more accurately called NAT44 nowadays, implying an IPv4-to-IPv4 translation. With other mechanisms in existence, such as NAT64 (formerly NAT-PT) and NAT66 (also called Network Prefix Translation v6 or NPTv6), there needed to be a way to distinguish between them, preferably without inventing different names for each one. NAT444 just means doing NAT44 twice; this is common in carrier-grade NAT (CGN) or large scale NAT (LSN) architectures. LSN is the newer term which is more accurate, since CGN implies that NAT444 is only relevant in SP networks, which is false. This document uses both the CGN and LSN terms interchangeably. The lab is generally focused on NAT444 since NAT44 is a subcomponent of it, and having a separate NAT44 lab adds no value. In this setup, CSR1 through CSR7 serve as CPE routers for small businesses or homes. In common LSN architectures, these nodes will perform NAT44 from a private IP range on the inside (locally administered) to another private IP range on the outside (SP administered). The reason is that NAT overload gives us some ~65,000 connections per IP address using TCP/UDP port numbers as multiplexers, and assigning a real global address to a CPE router is viewed as wasteful. If each CPE received a public IPv4 address, their overload utilization would be much lower, which is highly inefficient compared to LSN. The LSN nodes, CSR8 though CSR10, each have real public addresses. They perform NAT44 again from the SP administered private addressing on the inside to public addressing on the outside. XRv1 and XRv2 are BGP routers that interact with the IPv4/v6 Internet simulator, XRv3. XRv4 has several VRFs to represent customer hosts, like residential laptops, and is used for testing only.

1976 © 2016 Nicholas J. Russo

Several different NAT44 variations are tested here since we are performing NAT44 on ten different nodes. A summary of the variations is below. CSR8 provides a direct Ethernet service to its CPEs, while CSR9 and CSR10 provide PPPoE variations for broadband aggregation. The CPE inside addressing is in the 192.168.0.0/16 range, the SP-administered private addressing is in the 10.0.0.0/8 range, and the public addresses are in the 209.19.85.8/30 range. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

CSR1: ACL with NAT pool, no overload CSR2: Route-map with NAT pool, reversible CSR3: HSRP (active) and NAT CSR4: HSRP (standby) and NAT CSR5: Route-map per interface with PBR input, overload CSR6: Static NAT CSR7: ACL with NAT overload, VRF-aware CSR8: Route-map with NAT overload, CGN disabled CSR9: Route-map per interface, overload, CGN enabled CSR10: Route-map with NAT overload, CGN enabled, VRF-aware

The IS-IS and BGP configurations are not shown in this lab as they are very basic to enable reachability from the Internet to the public post-NAT444 IP addresses on the LSN nodes. IS-IS is used between the 1977 © 2016 Nicholas J. Russo

LSN and Internet ASBRs to exchange LSN and Internet routes back and forth. CSR1’s NAT configuration is very basic. NAT inside is enabled on the CPE LAN with NAT outside facing the SP. The NAT pool contains the outside addresses assigned by the SP. Assigning 100 addresses to a single CPE may seem wasteful, but since they are private addresses, it matters less. ! CSR1 interface GigabitEthernet2.514 ip nat inside interface GigabitEthernet2.518 ip nat outside ip nat pool NAT_POOL 10.1.8.100 10.1.8.199 prefix-length 24 ip nat inside source list ACL_NAT pool NAT_POOL ip access-list standard ACL_NAT permit 192.168.1.0 0.0.0.255

We can verify the proper interface and pool configuration using the show commands below. We don’t care about the statistical values just yet, but only that the names are correct. R1#show ip nat pool name NAT_POOL NAT Pool Statistics Pool name NAT_POOL, id 1 Assigned Addresses 1 UDP Low Ports 0 TCP Low Ports 0 UDP High Ports 0 TCP High Ports 0

Available 99 51200 51200 6451100 6451100

(Low ports are less than 1024. High ports are greater than or equal to 1024.) R1#show ip nat statistics | section interface Outside interfaces: GigabitEthernet2.518 Inside interfaces: GigabitEthernet2.514

When using ACL-based NAT without overloading, non-extended NAT entries are automatically created. That is, when a host initiates a protocol-specific session to a remote destination (ping, telnet, etc), that entry is accounted for with a specific port/protocol mapping. However, the inside local to inside global mapping is also stored in the NAT table, which allows unsolicited inbound traffic from the outside. XRv4 sends some traffic to CSR8, the LSN node. CSR1 performs NAT44 to change XRv4’s address to an available 10.1.8.0/24 address from the specified pool. Two entries are added to the table; one is fully extended (yellow), which contains the port/protocol information. The other is non-extended (green) which associates 10.1.8.100 with 192.168.1.14. 1978 © 2016 Nicholas J. Russo

R1#show ip nat translations Pro Inside global Inside local --- 10.1.8.100 192.168.1.14 icmp 10.1.8.100:37041 192.168.1.14:37041

Outside local --10.1.8.8:37041

Outside global --10.1.8.8:37041

Because of this non-extended (green) entry, CSR8 can reach into the CPE network and ping XRv4 using 10.1.8.100. The NAT order of operations states that packets arriving on an inside interface are subject to routing first, then NAT. Packets arriving on an outside interface are subject to NAT first, then routing. Therefore, packets send to 10.1.8.100 are translated to destination 192.168.1.14 first, then routed, which means CSR1 is not replying to the pings (XRv4 is). We can prove this by performing an EPC capture on CSR1’s inside LAN. We see five ICMP packets sourced from 10.1.8.8 destined for 192.168.1.14, and the five replies coming back. It is clear that CSR8 initiated this ping flow and that XRv4 responded, which shows the outside-to-inside reachability. This is often undesirable but is the default behavior of using a NAT ACL with a NAT pool without overloading enabled. R8#ping 10.1.8.100 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.1.8.100, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 3/5/16 ms R1#show monitor capture CAP buffer brief ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------0 118 0.000000 10.1.8.8 -> 192.168.1.14 ICMP 1 118 0.001007 192.168.1.14 -> 10.1.8.8 ICMP 2 118 0.001999 10.1.8.8 -> 192.168.1.14 ICMP 3 118 0.003006 192.168.1.14 -> 10.1.8.8 ICMP 4 118 0.003997 10.1.8.8 -> 192.168.1.14 ICMP 5 118 0.005004 192.168.1.14 -> 10.1.8.8 ICMP 6 118 0.005996 10.1.8.8 -> 192.168.1.14 ICMP 7 118 0.005996 192.168.1.14 -> 10.1.8.8 ICMP 8 118 0.007995 10.1.8.8 -> 192.168.1.14 ICMP 9 118 0.016997 192.168.1.14 -> 10.1.8.8 ICMP

When using a route-map with a NAT pool, this outside-to-inside reachability is disabled by default. We will demonstrate this on CSR2. The configuration is very similar, using a slightly different NAT pool to account for a separate Ethernet link from CSR8. The NAT ACL is wrapped in a route-map, which is normally used for multi-homing, but can also be used to hide this “outside-in” functionality. ! CSR2 interface GigabitEthernet2.524 ip nat inside interface GigabitEthernet2.528

1979 © 2016 Nicholas J. Russo

ip nat outside ip access-list standard ACL_NAT permit 192.168.2.0 0.0.0.255 route-map RM_NAT permit 10 match ip address ACL_NAT ip nat pool NAT_POOL 10.2.8.100 10.2.8.199 prefix-length 24 ip nat inside source route-map RM_NAT pool NAT_POOL

Sending traffic from XRv4 only generates a single fully-extended NAT translation entry, which prevents inside access from CSR8. However, CSR8 still gets ICMP echo-replies, but since outside-to-inside NAT did not occur (no matching entry), CSR2 locally responded to those pings. R2#show ip nat translations Pro Inside global Inside local icmp 10.2.8.100:45233 192.168.2.14:45233 Total number of translations: 1

Outside local 10.2.8.8:45233

Outside global 10.2.8.8:45233

R8#ping 10.2.8.100 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.2.8.100, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/4/6 ms

This ping response on CSR2 is the result of the dynamic IP alias created for the inside global address. On static NAT mappings, we can disable it, but not on dynamic ones. Packets arrived on CSR2’s outside interface and were subject to NAT before routing. There was no matching NAT entry for the unsolicited ICMP flow from CSR8, so CSR2 performs a routing lookup and sees the dynamic alias effectively as a connected route. R2#show ip alias Address Type Interface Dynamic Interface Interface

IP Address 10.2.8.2 10.2.8.100 192.168.1.2 192.168.2.2

Port

Using EPC, we can prove these pings are going to XRv4, and that CSR2 is responding. We see no packets sent to the inside LAN when CSR8 initiates the pings, which validates the statements above. R2#show monitor capture CAP buffer brief ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------[no output]

1980 © 2016 Nicholas J. Russo

For consistency with CSR1, we can explicitly enable outside-in connectivity when using route-maps using the “reversible” keyword. We enable this on CSR2 and retest. Now, XRv4’s flow has a second entry to account for the outside-originated traffic (the non extended one). ! CSR2 ip nat inside source route-map RM_NAT pool NAT_POOL reversible R2#show ip nat translations Pro Inside global Inside local --- 10.2.8.100 192.168.2.14 icmp 10.2.8.100:41137 192.168.2.14:41137

Outside local --10.2.8.8:41137

Outside global --10.2.8.8:41137

Pinging from CSR8 now shows up in CSR2’s capture after modifying the NAT configuration. Since there is a non-extended NAT entry, NAT sees a match, performs a translation, then routing occurs. This behavior is operationally identical to CSR1. R8#ping 10.2.8.100 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.2.8.100, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/140/671 ms R2#show monitor capture CAP buffer brief ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------0 118 0.000000 10.2.8.8 -> 192.168.2.14 ICMP 1 118 0.001999 192.168.2.14 -> 10.2.8.8 ICMP 2 118 0.840005 10.2.8.8 -> 192.168.2.14 ICMP 3 118 0.843011 192.168.2.14 -> 10.2.8.8 ICMP 4 118 0.850014 10.2.8.8 -> 192.168.2.14 ICMP 5 118 0.852013 192.168.2.14 -> 10.2.8.8 ICMP 6 118 0.854012 10.2.8.8 -> 192.168.2.14 ICMP 7 118 0.856010 192.168.2.14 -> 10.2.8.8 ICMP 8 118 0.858009 10.2.8.8 -> 192.168.2.14 ICMP 9 118 0.860008 192.168.2.14 -> 10.2.8.8 ICMP

Next, we will configure CGN on CSR8. There is a specific “CGN” mode of NAT which is discussed later, but the term is misleading because there is nothing “carrier-grade” about it. It just modifies the NAT process slightly; we will evaluate this on CSR9 and CSR10 later. CSR8, however, will just do NAT44 again with CSR1 and CSR2 10.1.8.0/24 and 10.2.8.0/24 subnets as inside addresses. These subnets were the outside NAT pools on the CPEs, but are viewed as inside addresses from the perspective of CSR8. The public address 209.19.85.8/32 is configured on a loopback so as not to waste public addressing on transit links with XRv2. CSR8 uses a route-map with overload, which means reversible NAT (outside-in) is not supported at all.

1981 © 2016 Nicholas J. Russo

! CSR8 interface GigabitEthernet2.518 ip nat inside interface GigabitEthernet2.528 ip nat inside interface GigabitEthernet2.582 ip nat outside ip nat inside source route-map RM_NAT interface Loopback209 overload ip access-list standard ACL_NAT permit 10.1.8.0 0.0.0.255 permit 10.2.8.0 0.0.0.255 route-map RM_NAT permit 10 match ip address ACL_NAT

We will perform a basic routing check on CSR8 to make sure it has Internet connectivity. The Internet consists of routes from AS 13 (13.0.0.0/8) while XRv1 and XRv2 summarize this into IS-IS from BGP. We confirm that CSR8 has the routing information and reachability via ping. R8#show ip route 13.0.0.0 255.0.0.0 Routing entry for 13.0.0.0/8 Known via "isis", distance 115, metric 30, type level-2 Redistributing via isis 1112 Last update from 10.8.12.12 on GigabitEthernet2.582, 00:27:51 ago Routing Descriptor Blocks: * 10.8.12.12, from 0.0.0.0, 00:27:51 ago, via GigabitEthernet2.582 Route metric is 30, traffic share count is 1 R8#ping 13.144.2.9 source loopback209 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 13.144.2.9, timeout is 2 seconds: Packet sent with a source address of 209.19.85.8 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 2/147/724 ms

At this point, hosts behind CSR1 and CSR2 should have Internet reachability. Sending traffic from both LANs to the Internet, we check the NAT tables on CSR1, CSR2, and CSR8. Since CSR1 is using ACL-based NAT with pools, we see non-extended entries for outside-in traffic. Likewise on CSR2, using the routemap-based NAT pool technique with the “reversible” option, we see non-extended entries as well. CSR8 has two fully-extended entries that map the inside local addresses (inside global addresses from CSR1 and CSR2 perspectives) to the single public IP addressing using port overloading. Addresses are highlighted to show their position in the NAT tables. NAT444 derives it names from the original host being represented by three addresses, which implies NAT44 performed twice. In this case, 192.168.1.14 1982 © 2016 Nicholas J. Russo

is also seen as 10.1.8.100 and 209.19.85.8. 192.168.2.14 is also seen as 10.2.8.100 and 209.19.85.8; the hierarchy of LSN allows better IPv4 address conservation. R1#show ip nat translations Pro Inside global Inside local --- 10.1.8.100 192.168.1.14 icmp 10.1.8.100:12465 192.168.1.14:12465 R2#show ip nat translations Pro Inside global Inside local --- 10.2.8.100 192.168.2.14 icmp 10.2.8.100:16561 192.168.2.14:16561 Total number of translations: 2 R8#show ip nat translations Pro Inside global Inside local icmp 209.19.85.8:2 10.2.8.100:16561 icmp 209.19.85.8:1 10.1.8.100:12465

Outside local --13.144.2.1:12465

Outside global --13.144.2.1:12465

Outside local --13.144.2.1:16561

Outside global --13.144.2.1:16561

Outside local 13.144.2.1:16561 13.144.2.1:12465

Outside global 13.144.2.1:2 13.144.2.1:1

Next, we will configure CSR3 and CSR4 concurrently. Stateful NAT (SNAT) does not appear to be supported on the CSR1000v, which was the intent of this setup. We can configure a SNAT-like setup anyway. The XRv4 VRF host is on a shared LAN with both routers; CSR3 and CSR4 run HSRPv2 to determine default gateway status for XRv4. Each one tracks the presence of the default route, which is negotiated with IPCP (via PPPoE). CSR3 is active while CSR4 is standby based on priorities. The configuration is identical on CSR3 and CSR4 with the exception of the HSRP priority and interface enumerations, so only CSR3 is shown. Only pertinent configurations are shown; both routers use basic ACL NAT with overload, which means outside-in traffic is not permitted. For PPPoE technical details, review the specific chapter on this topic. ! CSR3 interface GigabitEthernet2.534 ip nat inside standby version 2 standby 344 ip 192.168.34.254 standby 344 priority 105 standby 344 preempt standby 344 name SITE34 standby 344 track 9 decrement 10 interface Dialer3 ip address negotiated ip nat outside ppp ipcp route default interface GigabitEthernet2.539 pppoe-client dial-pool-number 3 track 9 ip route 0.0.0.0 0.0.0.0 reachability

1983 © 2016 Nicholas J. Russo

ip nat inside source list ACL_NAT interface Dialer3 overload ip access-list standard ACL_NAT permit 192.168.34.0 0.0.0.255

In order for basic reachability to exist with the LSN node, we have to configure CSR9 also. CSR9 hands out IP addresses via IPCP using a local address pool. This keeps all of the PPPoE clients in the same “subnet”, but they cannot reach each other since PPP only installs host routes between connected endpoints. All of CSR9’s PPPoE clients are in different VLANs, so we can set very strict session limits for broadband aggregation. For now, we will not configure any NAT on CSR9 so that we can verify CSR3 and CSR4 in isolation. ! CSR9 bba-group pppoe PPPOE_LSN virtual-template 345 sessions per-mac limit 1 sessions per-vlan limit 1 interface Virtual-Template345 peer default ip address pool PPPOE_POOL interface GigabitEthernet2.539 pppoe enable group PPPOE_LSN interface GigabitEthernet2.549 pppoe enable group PPPOE_LSN

After configuring this, CSR3 and CSR4 should establish PPPoE sessions with CSR9. They both install static default routes as supplied by IPCP, using the PPPoE server as a next-hop. For brevity, we will verify the client details on CSR3 only. Recursive routing lookups show that the PPPoE server IP address is known via a host route, also installed by PPP. R3#show pppoe session 1 client session Uniq ID N/A

PPPoE SID 26

RemMAC LocMAC 0050.56a9.d672 0050.56a9.8ccf

Port Gi2.539

VT

VA VA-st Di3 Vi1 UP

State Type UP

R3#show ip route 0.0.0.0 Routing entry for 0.0.0.0/0, supernet Known via "static", distance 1, metric 0, candidate default path Routing Descriptor Blocks: * 10.34.59.9 Route metric is 0, traffic share count is 1 R3#show ip route 10.34.59.9 Routing entry for 10.34.59.9/32

1984 © 2016 Nicholas J. Russo

Known via "connected", distance 0, metric 0 (connected, via interface) Routing Descriptor Blocks: * directly connected, via Dialer3 Route metric is 0, traffic share count is 1

CSR9 shows two PPPoE sessions, one to CSR3 and one to CSR4. Notice that the VLAN numbers help differentiate the connections. We can see that CSR9 has host routes to two local pool addresses, which are allocated to CSR3 and CSR4. R9#show pppoe session 3 sessions in LOCALLY_TERMINATED (PTA) State 3 sessions total Uniq ID 26

PPPoE SID 26

35

35

RemMAC LocMAC 0050.56a9.8ccf 0050.56a9.d672 0050.56a9.2c57 0050.56a9.d672

Port

VT

Gi2.539 VLAN:3539 Gi2.549 VLAN:3549

345 345

VA VA-st Vi1.1 UP Vi1.2 UP

State Type PTA PTA

R9#show ip route connected | include 59\.1[0-9]+ C 10.34.59.101/32 is directly connected, Virtual-Access1.1 C 10.34.59.110/32 is directly connected, Virtual-Access1.2

For clarity, we confirm that CSR3 is .101 and CSR4 is .110. We can also draw this conclusion by looking at the in-use pool addresses and mapping those to the virtual-access IDs shown in the PPPoE session table. Vi1.1 maps to CSR3 and Vi1.2 maps to CSR4. R3#show ip interface brief dialer 3 Interface IP-Address Protocol Dialer3 10.34.59.101 R4#show ip interface brief dialer 4 Interface IP-Address Protocol Dialer4 10.34.59.110

OK? Method Status YES IPCP

up

up

OK? Method Status YES IPCP

up

up

R9#show ip local pool PPPOE_POOL | begin Inuse Inuse addresses: 10.34.59.101 Vi1.1 10.34.59.110 Vi1.2

At this point, we will verify that HSRP has converged by checking both CSR3 and CSR4. CSR3 should be active while CSR4 is standby. Last, we double check our NAT configuration to ensure the interfaces are set up properly before testing.

1985 © 2016 Nicholas J. Russo

R3#show standby brief

Interface Gi2.534

Grp 344

P indicates configured to preempt. | Pri P State Active Standby 105 P Active local 192.168.34.4

Virtual IP 192.168.34.254

R3#show ip nat statistics | section interface Outside interfaces: Dialer3, Virtual-Access1 Inside interfaces: GigabitEthernet2.534 R4#show standby brief

Interface Gi2.534

Grp 344

P indicates configured to preempt. | Pri P State Active Standby 100 P Standby 192.168.34.3 local

Virtual IP 192.168.34.254

R4#show ip nat statistics | section interface Outside interfaces: Dialer4, Virtual-Access2 Inside interfaces: GigabitEthernet2.534

Now, we will send traffic from XRv4 to CSR9’s PPPoE server address. We expect to see NAT entries on CSR3 only (no SNAT supported) since it is the active HSRP gateway. Notice that there is only one fullyextended entry; whether using an ACL or route-map, outside-in traffic is non-reversible when overloading is enabled. Non-extended entries don’t exist since the NAT mappings happen at layer 4, not layer 3. R3#show ip nat translations Pro Inside global Inside local icmp 10.34.59.101:1 192.168.34.14:24753 Total number of translations: 1

Outside local 10.34.59.9:24753

Outside global 10.34.59.9:1

R4#show ip nat translations Total number of translations: 0

As a quick test of HSRP, we will break CSR3’s PPPoE connection by shutting down its dialer interface. CSR4 becomes the active HSRP gateway since CSR3 loses its default route and decrements its HSRP priority to less than CSR4’s value. We confirm this on CSR3 and CSR4. This proves that the default-route tracker works properly. R3#show standby brief

Interface Gi2.534

Grp 344

P indicates configured to preempt. | Pri P State Active Standby 95 P Standby 192.168.34.4 local

Virtual IP 192.168.34.254

1986 © 2016 Nicholas J. Russo

R4#show standby brief

Interface Gi2.534

Grp 344

P indicates configured to preempt. | Pri P State Active Standby 100 P Active local 192.168.34.3

Virtual IP 192.168.34.254

When sending traffic to CSR9 from XRv4 again, now CSR4 handles the NAT. The NAT behavior is identical to CSR3 since both routers shared common NAT strategies. R3#show ip nat translations Total number of translations: 0 R4#show ip nat translations Pro Inside global Inside local icmp 10.34.59.110:1 192.168.34.14:28849 Total number of translations: 1

Outside local 10.34.59.9:28849

Outside global 10.34.59.9:1

Before continuing, we bring CSR3 back into the PPPoE network. CSR9 now allocates a new address of 10.34.59.103, which is relevant for NAT44 discussed next. R3#show ip interface brief dialer 3 Interface IP-Address Protocol Dialer3 10.34.59.103

OK? Method Status YES IPCP

up

up

Next, we can configure NAT on CSR9. We will introduce a new feature; route-map multi-homing. Since CSR9 is connected to multiple BGP gateway routers, CEF will split traffic between the two assuming ECMP is in play. Using route-maps, we can reference the same inner ACL to match the inside local addresses (from the PPPoE local pool). CSR9 has two public IP addresses available: 209.19.85.9 and 209.19.85.11. Traffic routed to XRv1 (east) will use .9, while traffic routed to XRv2 (west) will use .11. CSR9 is not using policy-routing to force traffic in a given direction; this just specifies the post-NAT address, or packet source, based on the CEF load-sharing decision process. Since inside NAT happens before routing, the outgoing interface has already been identified, implying that the NAT process can use this as match criteria in the route-map. ! CSR9 interface Virtual-Template345 ip nat inside interface GigabitEthernet2.591 ip nat outside interface GigabitEthernet2.592 ip nat outside ip access-list standard ACL_NAT permit 10.34.59.0 0.0.0.255

1987 © 2016 Nicholas J. Russo

route-map RM_NAT_EAST permit 10 match ip address ACL_NAT match interface GigabitEthernet2.591 route-map RM_NAT_WEST permit 10 match ip address ACL_NAT match interface GigabitEthernet2.592 ip nat inside source route-map RM_NAT_EAST interface Loopback209 overload ip nat inside source route-map RM_NAT_WEST interface Loopback211 overload

We will also introduce the concept of CGN. There are two modes of NAT supported on the CSR1000v: traditional and CGN. Traditional NAT stores the outside local/global address per flow which determine the final destination an inside host is trying to reach. Conversely, it shows the source address of returning packets. CGN does not track this at all; this conserves memory, which allows NAT to scale better. The outside local/global address are irrelevant since a CGN device can look at the destination address/port of a returning packet and map it to an existing NAT entry. Enabling CGN automatically removes support for NAT outside mappings, so only the first command is required. The biggest downside of CGN in an enterprise environment is that the entire “ip nat outside” syntax tree disappears. Because there are no outside mappings, this feature is unsupported. Outside mappings are not commonly used anyway, and certainly not needed for LSN. ! CSR9 ip nat settings mode cgn no ip nat settings support mapping outside

We expect traffic to be translated from 10.34.59.103 (CSR3’s PPPoE client address) to either 205.19.85.9 or .11 depending on the CEF decision. To demonstrate both NAT policies, we will destine traffic to two different Internet destinations from the same source. CSR9’s FIB says that traffic to 13.144.2.1 will be routed east to XRv1, while .9 will be routed west to XRv2. Yellow (east) and green (west) are used throughout this test to differentiate flows. R9#show ip cef exact-route 10.34.59.103 13.144.2.1 10.34.59.103 -> 13.144.2.1 =>IP adj out of GigabitEthernet2.591, addr 10.9.11.11 R9#show ip cef exact-route 10.34.59.103 13.144.2.9 10.34.59.103 -> 13.144.2.9 =>IP adj out of GigabitEthernet2.592, addr 10.9.12.12

XRv4 will send pings to both destinations at the same time. We already know that CSR3 is going to translate both flows to an outside address of 10.34.59.103 based on its IPCP address; we do not verify this again. On CSR9, we see two separate NAT entries with those final Internet destinations (outside global) not tracked at all. CGN simply does not care about those Internet destinations. The traffic to 13.144.2.1 is source-translated to 209.19.85.9, while traffic to 13.144.2.9 is source-translated to 1988 © 2016 Nicholas J. Russo

209.19.85.11. This is expected per the NAT policies; since the route-maps used interfaces as match criteria, different post-NAT addresses were used based on the routing decisions. The NAT statistical refcounts show that there is one translation entry per rule. This is a quick way to verify there is some kind of load sharing happening. R9#show ip nat translations Pro Inside global Inside local icmp 209.19.85.11:1 10.34.59.103:4 icmp 209.19.85.9:1 10.34.59.103:3 Total number of translations: 2

Outside local -----

Outside global -----

R9#show ip nat statistics | include ref [Id: 3] route-map RM_NAT_EAST interface Loopback209 refcount 1 [Id: 4] route-map RM_NAT_WEST interface Loopback211 refcount 1

We will use EPC to verify this by capturing outbound on CSR9 using both upstream interfaces. We look at the destination MAC address to verify the directionality from CSR9’s perspective. The buffer brief shows the same information as the NAT table, where packets from different post-NAT addresses are used when traffic is destined differently. EPC happens after routing and NAT, which is why we see postNAT sources. R9#show monitor capture CAP buffer brief ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------2 118 2.249987 209.19.85.9 -> 13.144.2.1 ICMP 3 118 2.252977 209.19.85.9 -> 13.144.2.1 ICMP 4 118 2.254976 209.19.85.9 -> 13.144.2.1 ICMP 5 118 2.257982 209.19.85.9 -> 13.144.2.1 ICMP 6 118 2.267976 209.19.85.9 -> 13.144.2.1 ICMP 7 118 5.979996 209.19.85.11 -> 13.144.2.9 ICMP 8 118 5.982987 209.19.85.11 -> 13.144.2.9 ICMP 9 118 5.985993 209.19.85.11 -> 13.144.2.9 ICMP 10 118 5.988999 209.19.85.11 -> 13.144.2.9 ICMP 11 118 5.998001 209.19.85.11 -> 13.144.2.9 ICMP

The ARP entries are shown below for verifying the destination MAC addresses of XRv1 and XRv2 in the packet captures. We can confirm the statements above by verifying the MAC addresses in addition to the IP addresses. This is just to confirm that the packets are being sent out of the proper interfaces; only one of reach packet is shown. The same two packets (#6 and #7) are highlighted again below. R9#show ip arp 10.9.11.11 Protocol Address Internet 10.9.11.11

Age (min) 3

Hardware Addr 0050.56a9.9c60

Type ARPA

Interface GigabitEth2.591

R9#show ip arp 10.9.12.12 Protocol Address

Age (min)

Hardware Addr

Type

Interface

1989 © 2016 Nicholas J. Russo

Internet

10.9.12.12

190

0050.56a9.0e6f

ARPA

GigabitEth2.592

R9#show monitor capture CAP buffer detail ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------6 118 2.267976 209.19.85.9 -> 13.144.2.1 ICMP 0000: 005056A9 9C600050 56A9D672 81000E07 .PV..`.PV..r.... 0010: 08004500 00640004 0000FD01 87E7D113 ..E..d.......... 0020: 55090D90 02010800 CF0E0001 0004ABCD U............... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................ 7 118 5.979996 209.19.85.11 -> 13.144.2.9 ICMP 0000: 005056A9 0E6F0050 56A9D672 81000E08 .PV..o.PV..r.... 0010: 08004500 00640000 0000FD01 87E1D113 ..E..d.......... 0020: 550B0D90 02090800 CF120001 0000ABCD U............... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................

Next, we will configure CSR6 and CSR7 for NAT44 operations. They will be connecting to CSR10 via PPPoE as well, but will use static addressing along with default static routes in lieu of IPCP. All of the PPPoE clients will share a VLAN as well. CSR6 will use static one-to-one NAT for its single LAN client; notice that it is using a new inside global address of 10.6.10.14. This address doesn’t fit nicely into any subnet anywhere, and it doesn’t have to. CSR7 will use the ACL NAT with overload, much like CSR3 and CSR4. The difference is that CSR7 will do this inside of a VRF just for variety, but the configuration is very similar. ! CSR6 interface GigabitEthernet2.564 ip nat inside interface Dialer6 ip nat outside ip nat inside source static 192.168.6.14 10.6.10.14 ip route 0.0.0.0 0.0.0.0 10.56.70.10 10 ! CSR7 interface GigabitEthernet2.574 vrf forwarding CPE ip nat inside interface Dialer7 vrf forwarding CPE ip nat outside ip nat inside source list ACL_NAT interface Dialer7 vrf CPE overload ip access-list standard ACL_NAT permit 192.168.7.0 0.0.0.255

1990 © 2016 Nicholas J. Russo

ip route vrf CPE 0.0.0.0 0.0.0.0 10.56.70.10 10

In order for the PPPoE sessions to come up, we have to configure CSR10. Like CSR9, we will ignore the CGN configuration on CSR10 for now. CSR10 will also be performing VRF operations so we will configure that now as well. Notice that the per-VLAN PPPoE session limit is 3 because all three clients share a LAN, but the per-MAC limit is still 1. One key point is that CSR10 needs a route back to CSR6’s inside global address representing XRv4, since it was a made-up address. ! CSR10 interface Virtual-Template567 vrf forwarding LSN bba-group pppoe PPPOE_LSN virtual-template 567 sessions per-mac limit 1 sessions per-vlan limit 3 interface GigabitEthernet2.556 pppoe enable group PPPOE_LSN ip route vrf LSN 10.6.10.0 255.255.255.0 10.56.70.6

We quickly verify the PPPoE sessions are up on CSR10, then perform a quick NAT test from XRv4 to the PPPoE server address via CSR6 and CSR7. Notice that both PPPoE clients are in the same VLAN as expected. CSR10 is not performing NAT currently as we are verifying CSR6 and CSR7 first. Because CSR6 is using static NAT with a non-extended entry, traffic would be allowed in from CSR10. We won’t test this again since we tested it with CSR1 and CSR2 earlier, but the NAT table clearly shows the nonextended (permanent) entry. Both CSR6 and CSR7 show the same outside global address of CSR10’s PPPoE server address. The inside local from each represents the XRv4 VRF that is used for testing, and the inside global is the post-NAT address. R10#show pppoe session 2 sessions in LOCALLY_TERMINATED (PTA) State 2 sessions total Uniq ID 8

PPPoE SID 8

9

9

RemMAC LocMAC 0050.56a9.ea77 0050.56a9.f961 0050.56a9.de0d 0050.56a9.f961

Port Gi2.556 VLAN:3556 Gi2.556 VLAN:3556

R6#show ip nat translations Pro Inside global Inside local --- 10.6.10.14 192.168.6.14 icmp 10.6.10.14:37041 192.168.6.14:37041

VT 567 567

Outside local --10.56.70.10:37041

VA VA-st Vi2.2 UP Vi2.3 UP

State Type PTA PTA

Outside global --10.56.70.10:37041

1991 © 2016 Nicholas J. Russo

R7#show ip nat translations Pro Inside global Inside local icmp 10.56.70.7:1 192.168.7.14:41137

Outside local 10.56.70.10:41137

Outside global 10.56.70.10:1

Next, we will configure CGN on CSR10. Using CGN mode, we disable all outside local/global address tracking. We will use an overloaded route-map VRF-aware configuration on CSR10 for variety. The behavior is operationally equivalent to ACL-based overload. The NAT ACL must account for the bogus inside global addresses offered by CSR6 as well as the general PPPoE client subnet used by CSR7. ! CSR10 interface GigabitEthernet2.501 vrf forwarding LSN ip nat outside interface Virtual-Template567 ip nat inside ip access-list standard ACL_NAT permit 10.56.70.0 0.0.0.255 permit 10.6.10.0 0.0.0.255 route-map RM_NAT permit 10 match ip address ACL_NAT ip nat settings mode cgn no ip nat settings support mapping outside ip nat inside source route-map RM_NAT interface Loopback209 vrf LSN overload

Testing connectivity from the XRv4 clients all the way to the Internet, we can see two NAT entries on CSR10. Since CSR10 is single-homed to XRv1, there is no need for interface matching in the route-map as demonstrated on CSR9. All post-NAT addresses (inside global) use 209.19.85.10, regardless of inside local or FIB decision. R10#show ip nat translations Pro Inside global Inside local icmp 209.19.85.10:2 10.56.70.7:1 icmp 209.19.85.10:1 10.6.10.14:53425 Total number of translations: 2

Outside local -----

Outside global -----

Last, we will configure CSR5. It is multi-homed with two PPPoE connections, one to CSR9 and one to CSR10. It conforms to the address policies of each; that is, it receives an IPCP address from CSR9, while using static addressing towards CSR10. No default routes exist on CSR5 since we will use PBR in conjunction with NAT to influence traffic patterns on a per-source basis. This cannot be achieved without PBR since we are trying to achieve sourced-based routing. Specifically, because inside NAT happens before routing, the NAT process relies on the RIB process to tell it the outgoing interface. Using PBR, we can set this outgoing interface or next-hop, then match it in a NAT route-map to select the 1992 © 2016 Nicholas J. Russo

inside global address. In this case, sources in 192.168.5.0/25 (bottom half) will use Dialer59’s dynamic PPPoE pool address and be sent to CSR9. The addresses within 192.168.5.128/25 (top half) will use Dialer50’s static address and be sent to CSR10. The NAT route-maps don’t need to match source addresses because in this case, we know CSR5 has no routes, and can only route via PBR. Since PBR is only matching the sources we care about, we can omit an ACL with our NAT route-maps, despite it being a dangerous practice. ! CSR5 interface Dialer50 ip nat outside interface Dialer59 ip nat outside ip access-list standard ACL_EAST_TOP permit 192.168.5.128 0.0.0.127 ip access-list standard ACL_WEST_BOTTOM permit 192.168.5.0 0.0.0.127 route-map RM_PBR permit 10 match ip address ACL_WEST_BOTTOM set interface Dialer59 route-map RM_PBR permit 20 match ip address ACL_EAST_TOP set interface Dialer50 route-map RM_EAST permit 10 match interface Dialer50 route-map RM_WEST permit 10 match interface Dialer59 interface GigabitEthernet2.554 ip nat inside ip policy route-map RM_PBR ip nat inside source route-map RM_EAST interface Dialer50 overload ip nat inside source route-map RM_WEST interface Dialer59 overload

When XRv4 sends traffic from 192.168.5.14 and 192.168.5.214, we see two separate inside global addresses for each. The east entry is shown in yellow and the west entry in green. PBR directed traffic from .14 to use CSR9 as the next-hop and .214 to use CSR10 as the next-hop. NAT honored those routing decisions by selecting the proper inside global (post-NAT) address. The PBR packet counters are useful to ensure the policies are configured properly. R5#show ip nat translations Pro Inside global Inside local icmp 10.34.59.105:1 192.168.5.14:53425 icmp 10.56.70.5:1 192.168.5.214:49329

Outside local 13.144.2.1:53425 13.144.2.1:49329

Outside global 13.144.2.1:1 13.144.2.1:1

1993 © 2016 Nicholas J. Russo

Total number of translations: 2

R5#show route-map RM_PBR route-map RM_PBR, permit, sequence 10 Match clauses: ip address (access-lists): ACL_WEST_BOTTOM Set clauses: interface Dialer59 Policy routing matches: 5 packets, 590 bytes route-map RM_PBR, permit, sequence 20 Match clauses: ip address (access-lists): ACL_EAST_TOP Set clauses: interface Dialer50 Policy routing matches: 5 packets, 590 bytes

We can confirm that the packets were routed of the proper interfaces by checking CSR9 and CSR10. Each one should only have one NAT entry; CSR10 has the east entry and CSR9 has the west entry. Notice that the inside local on the CGN router is the same as the inside global on the CPE router. As expected, CSR9 selects one of its global addresses based on its ECMP scheme, while CSR10 uses its only global address. Again, neither CGN router stores outside local/global address information. R9#show ip nat translations Pro Inside global Inside local icmp 209.19.85.9:1 10.34.59.105:1 Total number of translations: 1

Outside local ---

Outside global ---

R10#show ip nat translations Pro Inside global Inside local icmp 209.19.85.10:1 10.56.70.5:1 Total number of translations: 1

Outside local ---

Outside global ---

NAT also supports port forwarding for specific outside to inside communications. This might be valuable for VPN/remote access from the outside. Although unrelated to CGN/LSN, it is a common NAT technique in general. We will quickly demonstrate this on CSR1. This will allow outside hosts, such as CSR8, to access internal resources on specific TCP/UDP ports. We can specify the outside interface or an IP address. If the inside global address is not already present on the router (such as 10.1.8.22), the router automatically creates a dynamic alias for it. In this example, we allow outside routers to access XRv4 via telnet on port 2300 when using CSR1’s outside address. Alternatively, they can use SSH on port 2200 using a new address 10.1.8.22. ! CSR1 ip nat inside source static tcp 192.168.1.14 23 interface Gig2.518 2300 ip nat inside source static tcp 192.168.1.14 22 10.1.8.22 2200 extendable R1#show ip alias Address Type Interface

IP Address 10.1.8.1

Port

1994 © 2016 Nicholas J. Russo

Dynamic Interface Interface

10.1.8.22 192.168.1.1 192.168.1.2

Testing this from CSR8, it can access XRv4 via telnet and SSH. This requires that XRv4 be configured as a telnet and SSH server (not shown). R8#telnet 10.1.8.1 2300 Trying 10.1.8.1, 2300 ... Open R8#ssh -l cisco -p 2200 10.1.8.22 IMPORTANT: READ CAREFULLY [snip]

CSR1 shows the non-extended entries, which we must have to allow the outside unsolicited traffic. The fully extended entries represents actual NAT sessions currently in place: telnet is highlighted green and SSH is highlighted yellow. Specific sessions spawn these fully-extended entries on a per-session basis. The outside local/global address will always be the initiator of the port-forwarded session, which is CSR8 in this case. R1#show ip nat translations tcp Pro Inside global Inside local tcp 10.1.8.1:2300 192.168.1.14:23 tcp 10.1.8.22:2200 192.168.1.14:22 tcp 10.1.8.22:2200 192.168.1.14:22 tcp 10.1.8.1:2300 192.168.1.14:23

Outside local ----10.1.8.8:39246 10.1.8.8:33914

Outside global ----10.1.8.8:39246 10.1.8.8:33914

Although not significant to NAT44, IPv6 can run alongside this architecture assuming all devices are dualstack. Only IPv4 is subject to NAT44, so IPv6 reachability from XRv4’s VRFs to the Internet is not impacted. All transports, such as Ethernet, PPPoE, etc can carry IPv6 and NAT44 will simply ignore it. The IPv6 routing in this design is very sloppy to create additional complexity (more elegant ways exist to issue IPv6 prefixes over PPPoE and are shown in the PPPoE section). The routing between CSR8 and its CPEs (CSR1 and CSR2) is static, which is redistributed into IS-IS and later BGP. CSR9 uses OSPFv3 to learn the downstream IPv6 prefixes from CPE devices. PPPoE IPv6 client does not appear to work at all inside a VRF, so CSR7 has no IPv6 connectivity. The NAT64 section details the translation mechanisms for interworking between IPv4 and IPv6 to support LSN architectures. The point is that NAT44 is not in play when native IPv6 hosts are trying to reach the IPv6 Internet. Additional Reading – Reference configurations "lsn-444" 36.2 NAT64 and NAT464 This test continues from the previous topology with some modifications. From an LSN perspective, a common use case of NAT64 is to deploy a NAT464 architecture. Like NAT444, NAT still occurs twice, but NAT464 introduces NAT64 at the CPE and the LSN. The original IPv4 packet is translated to IPv6 by the CPE, transiting the PE-CE link as IPv6. This can help bypass any firewall rules for east-west traffic that might be filtered by private IPv4 addressing. In the examples seen in this lab, notice that IPv4 has been 1995 © 2016 Nicholas J. Russo

completed removed between the CPE and LSN nodes as it is no longer needed. The LSN translates the IPv6 traffic from the CPE back to IPv4 so it can be routed to the IPv4 Internet. Native IPv6 traffic, like before, it not subject to NAT64 and would be routed normally to the IPv6 Internet. The network diagram is below; the same architecture is used for NAT464 as was used for NAT444.

NAT64 comes in two variants: stateful and stateless. These are quickly described below and are demonstrated in great detail later. 1. Stateful: Like traditional NAT, this feature maintains a state table of all translations. This includes original and translated IPv4 and IPv6 addresses. Classic NAT features such as portoverloading, ACL-based matching, and static NAT are supported. Because overload is supported, address conservation is good use-case. 2. Stateless: This feature maps IPv4 addresses directly into IPv6 addresses and does so without maintaining any state in the router. IPv4 sources and destinations can be configured to share a common prefix or use different ones. This is much less computationally expensive than stateful NAT. There is no conversation of addressing since there is no concept of “overloading”. On CSR1, where NAT64 will transform IPv4 into IPv6 for traffic heading towards the IPv4 Internet, we will configure stateless NAT64. This is a good option for CPE routers as it is less computationally expensive and less memory hungry. Aside from enabling NAT64 on the interfaces (IPv6 must be enabled 1996 © 2016 Nicholas J. Russo

on both the v4 and v6 facing links), there are three key commands. First, we must define a NAT64 prefix. Both the IPv4 source and destination will be mapped to this prefix, and the prefix can vary in length. I find it most straightforward to use a /96 as the IPv4 address will always be encoded into the lowestorder 32 bits. The NAT64 “route” command adds a static route to the IPv4 RIB that directs traffic towards the NAT virtual interface (NVI). This forces traffic towards 13.0.0.0/8 to be NAT64’ed rather than routed normally, which is the same thing that happens for 3001::/96 when the NAT64 prefix is defined. A default route cannot be used for the NAT64 IPv4 route, so we must be selective; this is not totally realistic as the CPE should have a default route to cover all IPv4 prefixes on the Internet. The final IPv6 route is critical from a routing perspective. Since the destination and source both fit into 3001::/96, the router needs to have a more specific route to the destination prefixes. There is a 3001::/96 in the IPv6 RIB directing IPv6 traffic to the NVI, so the router needs a longer match to direct traffic to CSR8. Since the IPv4 destination always begins with 13, I create a route that matches this first octet, and is therefore 8 bits longer than the existing /96. ! CSR1 interface GigabitEthernet2.514 nat64 enable ipv6 address FE80::1 link-local ipv6 nd ra suppress all interface GigabitEthernet2.518 nat64 enable ipv6 address FE80::1 link-local ipv6 nd ra suppress all nat64 prefix stateless 3001::/96 nat64 route 13.0.0.0/8 GigabitEthernet2.518 ipv6 route 3001::D00:0/104 GigabitEthernet2.518 FE80::8

First, we will verify the NAT64 stateless prefix. If the prefix was invalid or undefined, this output would reveal that issue. Both the v4 and v6 interfaces are using this prefix since we did not define interfacespecific prefixes for this test. R1#show nat64 prefix stateless global Global Stateless Prefix: is valid, 3001::/96 IFs Using Global Prefix Gi2.514 Gi2.518

Next, we verify that the NAT64 route is correct. It is mapped to the global prefix, which means traffic towards 13.0.0.0/8 will have its IPv4 source/destination encoded into 3001::/96 when it is sent into the IPv6 network. R1#show nat64 routes prefix 13.0.0.0/8 IPv4 Prefix Adj. Address Enabled Output IF

1997 © 2016 Nicholas J. Russo

Global IPv6 Prefix 13.0.0.0/8 0.0.0.5 TRUE 3001::/96

TRUE

Gi2.518

R1#show ip route 13.0.0.0 255.0.0.0 Routing entry for 13.0.0.0/8 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 0.0.0.5, via NVI0 Route metric is 0, traffic share count is 1

Last, we verify the IPv6 route. Because all post-NAT64 traffic will have an IPv6 destination beginning with 3001::/96, followed by 8-bits representing the first IPv4 octet, we crafted a longer-match /104 route. When the router tries to send traffic towards this final destination, it sends it to CSR8’s LL address. Without this manual route, traffic would loop (in software) back into the NVI by following the /96 and be dropped. R1#show ipv6 route 3001::/96 longer-prefixes | begin Appl a - Application S 3001::/96 [1/0] via ::42, NVI0 S 3001::D00:0/104 [1/0] via FE80::8, GigabitEthernet2.518

From XRv4, I will send some large ping packets so they are easy to see. Using EPC on CSR1, we can confirm that NAT64 is working correctly. Since this is a stateless translation, there is no translation table associated, so real-time verification is preferred. CSR8 has not been configured for NAT64 yet, so we do not expect the ping to succeed, but we can verify that CSR1 is translating the packets properly. Capturing inbound IPv4 and outbound IPv6 on CSR1, we see the same packet twice. First, we see the original IPv4 source/destination as originated by XRv4. After NAT64 occurs, those same 32-bit addresses are encoded inside the IPv6 prefix 3001::/96 as shown in the outgoing IPv6 packet. Sources are in yellow and destinations are in green. Notice that, since a single global prefix was defined, both the source and destination addresses are contained within the same prefix. This can complicate routing significantly in large networks by making aggregation difficult. R1#show monitor capture CAP buffer detail 0 518 0.000000 192.168.1.14 -> 13.144.2.1 ICMP 0000: 005056A9 1AAA0050 56A9862A 81000DBA .PV....PV..*.... 0010: 08004500 01F40000 0000FF01 E8C1C0A8 ..E............. 0020: 010E0D90 02010800 25C9709C 0000ABCD ........%.p..... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................ 1 546 0.000000 3001:*:010E -> 0000: 005056A9 FB1C0050 56A91AAA 81000DBE 0010: 86DD6000 000001E8 2CFE3001 00000000 0020: 00000000 0000C0A8 010E3001 00000000

3001:*:0201 IPv6-Frag .PV....PV....... ..`.....,.0..... ..........0.....

1998 © 2016 Nicholas J. Russo

0030:

00000000 00000D90 02013A00 00000000

..........:.....

NAT64 keeps track of packet counters as well, which is a quick way to verify NAT64 operation without EPC. The number says 13 only because I sent several other pings while testing. This output indicates a problem because packets are being translated v4 to v6, but not vice versa, which indicates a lack of bidirectional reachability. This is expected since the rest of the network has not been configured. R1#show nat64 statistics prefix stateless 3001::/96 NAT64 Statistics 3001::/96: Packets translated (IPv4 -> IPv6) Stateless: 13 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 0

Before configuring CSR8, we will configure CSR2 with stateless NAT64 using a variation of the method above. The interfaces are almost identical except CSR2 uses an interface-level prefix of length /32. This would allow a multi-homed CPE (like CSR5) to use different NAT64 prefixes for different ISPs if necessary. When a NAT64 prefix length other than /96 is used, two things change. First, bits 64 to 71, inclusive (8 bits), are fixed to 0, which are called universal-bits (u-bits); these are not well documented and are likely reserved for future use. The IPv4 addresses are encoded immediately following the prefix, putting it at bits 32 to 63, inclusive. Unlike the /96, the IPv4 address is in the middle of the IPv6 address, but still technically following the network prefix as it did for the /96. These fields are discussed more when the packet capture is examined, but the immediate relevance is building the IPv6 static route correctly. Without it, NAT64 cannot route traffic upstream. By specifying a v6v4 prefix at the link level and a v4v6 prefix globally, we can effectively use different IPv6 source/destination addresses for my post-NAT traffic. This resolves the network scalability issues related to aggregation we observed on CSR1. ! CSR2 interface GigabitEthernet2.524 nat64 enable ipv6 address FE80::2 link-local ipv6 nd ra suppress all interface GigabitEthernet2.528 nat64 enable nat64 prefix stateless v6v4 3002::/32 ipv6 address FE80::2 link-local ipv6 nd ra suppress all

1999 © 2016 Nicholas J. Russo

nat64 prefix stateless v4v6 2222::/32 nat64 route 13.0.0.0/8 GigabitEthernet2.528 ipv6 route 3002:0:D00::/40 GigabitEthernet2.528 FE80::8

We will quickly verify the configuration as we did for CSR1. First, we verify that the interface-level prefix is enabled and valid. This v6v4 prefix will determine the post-NAT64 IPv6 destination addresses. The v4v6 global prefix will determine the post-NAT64 IPv6 source addresses. R2#show nat64 prefix stateless v6v4 interfaces v6v4 Stateless Prefixes Interface NAT64 Enabled Global Prefix GigabitEthernet2.528 TRUE FALSE 3002::/32 R2#show nat64 prefix stateless v4v6 global Global v4v6 Stateless Prefix: is valid, 2222::/32 IFs Using Global Prefix Gi2.524 Gi2.528

Next, we confirm that the NAT64 IPv4 route is directing traffic for 13.0.0.0/8 to the NVI. This is identical to CSR1 and required to direct incoming IPv4 traffic to the NAT64 process. R2#show ip route 13.0.0.0 255.0.0.0 Routing entry for 13.0.0.0/8 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 0.0.0.1, via NVI0 Route metric is 0, traffic share count is 1

Last, we confirm that the IPv6 static route is directing traffic for post-NAT IPv6 flows towards CSR8, rather than looping it back into the NVI. This is the same type of configuration required on CSR1. R2#show ipv6 route 3002::/32 longer-prefixes | begin Appl a - Application S 3002::/32 [1/0] via ::44, NVI0 S 3002:0:D00::/40 [1/0] via FE80::8, GigabitEthernet2.528

I begin sending large pings from XRv4 through CSR2. The NAT64 statistics are increasing, which suggests that it is working. Again, packets are being translated v4 to v6, but not vice versa, which implies a unidirectional forwarding path. Notice that we specifically check the v4v6 prefix of 2222::/32 since the only functional NAT64 direction is v4 to v6 at this point. 2000 © 2016 Nicholas J. Russo

R2#show nat64 statistics prefix stateless v4v6 2222::/32 NAT64 Statistics 2222::/32: Packets translated (IPv4 -> IPv6) Stateless: 5 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 0

Using EPC in the same manner as on CSR1, we can see the IPv4 to IPv6 translation. Source addresses are yellow and destination addresses are green. The IPv4 address is embedded in a different spot as a result of the NAT64 prefix length. The u-bits and suffix are all zeroes as well, but of greater significance, stateless NAT64 is working. The most significant piece of this output is the difference between the IPv6 source and destination prefixes. R2#show monitor capture CAP buffer detailed 0 518 0.000000 192.168.2.14 -> 13.144.2.1 ICMP 0000: 005056A9 BE8A0050 56A9862A 81000DC4 .PV....PV..*.... 0010: 08004500 01F40000 0000FF01 E7C1C0A8 ..E............. 0020: 020E0D90 02010800 95C9009C 0000ABCD ................ 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................ 1 546 0.000000 2222:*:0000 -> 0000: 005056A9 FB1C0050 56A9BE8A 81000DC8 0010: 86DD6000 000001E8 2CFE2222 0000C0A8 0020: 020E0000 00000000 00003002 00000D90 0030: 02010000 00000000 00003A00 00000000

3002:*:0000 IPv6-Frag .PV....PV....... ..`.....,."".... ..........0..... ..........:.....

Next, we will configure NAT64 on CSR8, which is an LSN node. This will translate IPv6 traffic from the customers CSR1 and CSR2 into public IPv4 traffic that can be used on the IPv4 Internet. As an LSN, CSR8 should be conservative with IPv4 addressing, so it will use stateful NAT64 to overload a small set of public IPv4 addresses using port numbers as differentiators. Defining an IPv4 pool is similar to regular NAT44, and the pool size is limited to a single IPv4 address. The IPv6 ACL matches the traffic candidate for NAT64, and because the CPEs are performing stateless NAT64, we can create very generic rules. The source/destination of the IPv6 packets from CSR1 and CSR2 are highly deterministic since we know the NAT64 stateless prefixes in use by CSR1 and CSR2. Last, we must define a NAT64 stateful prefix on CSR8. Like the stateless prefix, this creates a static route to the NVI so IPv6 traffic for these destinations can be NAT64’ed. I use per-interface prefixes since the CPEs are using different NAT64 prefixes; the formats should match so addressing can be derived properly. CSR8 needs to know where the IPv4 address is encoded within the IPv6 address to decode it properly, so a generic prefix only works when all CPEs are using the same prefix length. The two IPv6 static routes are used to direct traffic back to the CPEs. 2001 © 2016 Nicholas J. Russo

Notice that CSR1’s route is somewhat complex since it encompasses the entire NAT64 prefix, then uses a more specific match to keep traffic from looping into the NVI. CSR2 uses an entirely different source IPv6 range, which is cleaner from a routing perspective. ! CSR8 interface GigabitEthernet2.518 nat64 enable nat64 prefix stateful 3001::/96 ipv6 address FE80::8 link-local interface GigabitEthernet2.528 nat64 enable nat64 prefix stateful 3002::/32 ipv6 address FE80::8 link-local interface GigabitEthernet2.582 nat64 enable ipv6 address FE80::8 link-local ipv6 access-list ACL_LSN_64 permit ipv6 3001::/96 3001::/96 permit ipv6 2222::/32 3002::/32 nat64 v4 pool NAT64_POOL 209.19.85.8 209.19.85.8 nat64 v6v4 list ACL_LSN_64 pool NAT64_POOL overload ipv6 route 2222::/16 GigabitEthernet2.528 FE80::2 ipv6 route 3001::C0A8:100/120 GigabitEthernet2.518 FE80::1

One interesting behavior of stateful NAT64 is that the addresses contained within the IPv4 pool cannot exist on any interfaces, even shutdown ones. This is because NAT64 installs an NVI route to the pool addresses which forces them into the NAT process; having a connected route already in the routing table would break this logic. I completely remove Loopback209 (not shown) so that the command below is accepted. R8(config)#nat64 v4 pool NAT64_POOL 209.19.85.8 209.19.85.8 %NAT64: The range 209.19.85.8-209.19.85.8 cannot contain an interface address (found overlap with address on interface Loopback209)

Unlike stateless NAT64, we did not need to add a static route to forward traffic further upstream. CSR8 learns an IS-IS summary route for 13.0.0.0/8 from XRv2, which is sufficient for traffic towards the Internet from the LSN. R8#show ip route 13.0.0.0 255.0.0.0 Routing entry for 13.0.0.0/8 Known via "isis", distance 115, metric 20, type level-2

2002 © 2016 Nicholas J. Russo

Redistributing via isis 1112 Last update from 10.8.12.12 on GigabitEthernet2.582, 01:56:07 ago Routing Descriptor Blocks: * 10.8.12.12, from 0.0.0.0, 01:56:07 ago, via GigabitEthernet2.582 Route metric is 20, traffic share count is 1

Return traffic is broken as a result of removing Loopback209. This was advertised into IS-IS earlier which ultimately allowed the Internet to know about it. Although summarization was used in previous labs to influence routing decisions, I am illustrating a situation where such alternate paths would not exist. There is technically a /30 covering this prefix, known to XRv3, but that is beyond the scope of the current problem. RP/0/0/CPU0:XRv3#show cef 209.19.85.8/32 %Prefix not found or IP is not running. VRF default. RP/0/0/CPU0:XRv3#show cef 209.19.85.8 209.19.85.8/30, version 480, internal 0x5000001 0x0 (ptr 0xa1413cf4) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 10.12.13.12 Prefix Len 30, traffic index 0, precedence n/a, priority 4 via 10.12.13.12, 2 dependencies, recursive, bgp-ext [flags 0x6020] path-idx 0 NHID 0x0 [0xa14144f4 0x0] next hop 10.12.13.12 via 10.12.13.12/32

Fortunately, NAT64 adds this IPv4 pool to the RIB as a static route, so we can simply redistribute it into IS-IS to achieve reachability. From NAT64 perspective, this is critical since returning traffic is directed into the NVI to be translated from v4 to v6. From a routing perspective, it gives us a mechanism to redistribute the NAT64 pool to IPv4 routing protocols. Once complete, the Internet router sees this exact match prefix from XRv1. ! CSR8 ip prefix-list PL_PUBLIC seq 5 permit 209.19.85.8/32 route-map RM_STATIC_TO_ISIS permit 10 match ip address prefix-list PL_PUBLIC router isis 1112 redistribute static ip route-map RM_STATIC_TO_ISIS RP/0/0/CPU0:XRv3#show cef 209.19.85.8/32 209.19.85.8/32, version 486, internal 0x5000001 0x0 (ptr 0xa14147f4) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 10.11.13.11 Prefix Len 32, traffic index 0, precedence n/a, priority 4 via 10.11.13.11, 2 dependencies, recursive, bgp-ext [flags 0x6020] path-idx 0 NHID 0x0 [0xa1414574 0x0] next hop 10.11.13.11 via 10.11.13.11/32

2003 © 2016 Nicholas J. Russo

With basic IPv4 routing resolved, we will continue verifying stateful NAT64. Verifying the prefix is similar to stateless NAT64 as it shows the user-defined prefixes as valid and applied to the proper interfaces. Again, we used different interface-level prefixes on the LSN because the downstream clients have different NAT64 stateless prefixes in this example. R8#show nat64 prefix stateful interfaces Stateful Prefixes Interface NAT64 Enabled Global Prefix GigabitEthernet2.518 TRUE FALSE 3001::/96 GigabitEthernet2.528 TRUE FALSE 3002::/32

Next, we verify that the pool was configured properly. It consists of exactly 1 useable IPv4 address. R8#show nat64 pools name NAT64_POOL Protocol HSL ID Name Is Single Range Ranges IPv4 1 NAT64_POOL TRUE (209.19.85.8 - 209.19.85.8) 209.19.85.8 - 209.19.85.8

As a troubleshooting aid when dealing with routing issues, we can see the static routes created by NAT64 for IPv6 prefixes and IPv4 pools using the commands below. A single /32 sufficiently covers the IPv4 pool in this case. An IPv4 pool may require a collection of static routes to cover the entire range. Additionally, each IPv6 prefix causes NAT64 to add a static route to direct traffic into the NVI. R8#show nat64 prefix stateful static-routes Stateful Prefixes NAT64 Prefix Static Route Ref-Count 3001::/96 1 3002::/32 1 R8#show nat64 pools routes Pools configured: 1 Protocol HSL ID Name Is Single Range Ranges IPv4 1 NAT64_POOL TRUE (209.19.85.8 - 209.19.85.8) 209.19.85.8 - 209.19.85.8

2004 © 2016 Nicholas J. Russo

Static Routes for Range: 1 209.19.85.8/32

Because we know how traffic is being encoded inside IPv6, we can quickly check the IPv6 FIB to ensure return traffic is routed correctly. For CSR2 this is easier since 2222::/16 is not used for anything else. For CSR1, we must ensure the longer-match route was configured correctly. I highlight the embedded IPv4 private source addresses for clarity (192.168.1.14 or 192.168.2.14 for CSR1 and CSR2, respectively). R8#show ipv6 cef 3001::C0A8:10E 3001::C0A8:100/120 nexthop FE80::1 GigabitEthernet2.518 R8#show ipv6 cef 2222:0:c0a8:020E:: 2222::/16 nexthop FE80::2 GigabitEthernet2.528

To test it, I begin sending large pings from XRv4 towards the Internet. I begin with using CSR2 as the gateway since it is the simpler case. This time, the pings are successful as shown below. With EPC enabled on CSR8, we can see the incoming IPv6 packet from CSR2 and the outgoing IPv4 packet from CSR8. The public IPv4 address of 209.19.85.8 is used as the IPv4 source with a destination of the original IPv4 Internet host. Because CSR8 had a matching IPv6 stateful prefix for CSR2’s NAT64 stateless prefix, it knew exactly where to find the IPv4 addresses. R8#show monitor capture CAP buffer detailed 0 546 0.000000 2222:*:0000 -> 0000: 005056A9 FB1C0050 56A9BE8A 81000DC8 0010: 86DD6000 000001E8 2CFD2222 0000C0A8 0020: 020E0000 00000000 00003002 00000D90 0030: 02010000 00000000 00003A00 00000000 1 518 0.000000 209.19.85.8 0000: 005056A9 0E6F0050 56A9FB1C 0010: 08004500 01F40000 0000FD01 0020: 55080D90 02010800 96620003 0030: ABCDABCD ABCDABCD ABCDABCD

3002:*:0000 IPv6-Frag .PV....PV....... ..`.....,."".... ..........0..... ..........:.....

-> 13.144.2.1 ICMP 81000DFE .PV..o.PV....... 865CD113 ..E..........\.. 0000ABCD U........b...... ABCDABCD ................

Viewing the translation table, the output is somewhat similar to the deprecated NAT-PT feature. There are 4 key pieces of information. The original IPv4 address is the destination that XRv4 was trying to reach initially (Internet host). When this was translated into an IPv6 address, it was wrapped inside the 3002::/32 prefix, which is highlighted. The second row shows the original IPv6 address, which was the source of the traffic entering CSR8. After translations, the post-NAT64 address came from the IPv4 pool and represents CSR8’s public address. A simpler way to read this table would be to read it by columns. The IPv4 column shows the destination/source of the post-NAT IPv4 packet (outgoing). The IPv6 column shows that destination/source of the pre-NAT IPv6 packet (incoming). This “shortcut” is confirmed by the EPC capture above. 2005 © 2016 Nicholas J. Russo

R8#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.1:1 [3002:0:d90:201::]:24732 209.19.85.8:1 [2222:0:c0a8:20e::]:24732

We can check the statistics to confirm that all 5 packets in the ping test succeeded. Gig2.528 translated packets from v6 to v4, while Gig2.582 translated them back from v4 to v6. R8#show nat64 statistics interface gig2.528 NAT64 Statistics Interface Statistics GigabitEthernet2.528 (IPv4 configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 5 MAP-T: 0 Packets dropped: 0 R8#show nat64 statistics interface gig2.582 NAT64 Statistics Interface Statistics GigabitEthernet2.582 (IPv4 configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 5 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 0

Quickly backtracking to CSR2 (and clearing the NAT64 statistics for clarity), we can check the v4v6 prefix for translations. It shows that 5 packets were translated in both directions, which suggests connectivity is functional. When we viewed this output before CSR8 was configured, we only saw v4 to v6 translations. Seeing bidirectional translations on the CPE is a good sign that the LSN is functional. R2#clear nat64 statistics RP/0/0/CPU0:XRv4#ping vrf 2 13.144.2.1 size 500

2006 © 2016 Nicholas J. Russo

Type escape sequence to abort. Sending 5, 500-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/57/89 ms R2#show nat64 statistics prefix stateless v4v6 2222::/32 NAT64 Statistics 2222::/32: Packets translated (IPv4 -> IPv6) Stateless: 5 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 5 Stateful: 0 MAP-T: 0 Packets dropped: 0

When we attempt the same test using CSR1, it fails. We did not trace the detailed routing process for CSR2 since it “just worked”, so for completeness, we will trace it using CSR1 in the forwarding path. RP/0/0/CPU0:XRv4#ping vrf 1 13.144.2.1 size 500 Type escape sequence to abort. Sending 5, 500-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: ..... Success rate is 0 percent (0/5)

We used EPC earlier to identify that CSR1 is sending traffic to CSR8 with IPv6 source 3001::C0A8:010E (192.168.1.14) and IPv6 destination 3001::0D90:0201 (13.144.2.1). Based on this, CSR8 should be routing this traffic to the NVI. The NAT64-installed static route for the IPv6 stateful prefix 3001::/96 appears to be working. R8#show ipv6 cef 3001::d90:201 3001::/96 nexthop ::100.0.0.1 NVI0

Once in the NVI, the NAT64 ACL is processed. This particular flow matches sequence 10. Assuming NAT64 worked, CSR8 also has a route to the final destination as well. Without this route, the earlier test using CSR2 would have also failed. R8#show access-lists ACL_LSN_64 IPv6 access list ACL_LSN_64 permit ipv6 3001::/96 3001::/96 sequence 10 permit ipv6 2222::/32 3002::/32 sequence 20 R8#show ip cef 13.144.2.1 13.0.0.0/8

2007 © 2016 Nicholas J. Russo

nexthop 10.8.12.12 GigabitEthernet2.582

At this point, NAT64 should be occurring, but it isn’t. Instead, all 5 of the pings were dropped. R8#show nat64 statistics prefix stateful 3001::/96 NAT64 Statistics 3001::/96: Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 5

Even with all NAT64 debugging enabled, no information is given. I send more pings via XRv4 (no shown) yet we are presented with no debug output, no NAT64 translations, and another 5 dropped packets. R8#debug nat64 all NAT64 debugging is on [no output] R8#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------Total number of translations: 0 R8#show nat64 statistics prefix stateful 3001::/96 | include dropped Packets dropped: 10

At this point, the problem is impossible to identify without an educated guess. The only significant difference between CSR1 and CSR2 is that CSR1’s source IPv6 address also falls within the NAT64 stateful prefix range. That is to say, when CSR8 tries to perform NAT64 on a specific destination inside the range 3001::/96, it also sees the IPv6 packet’s source falls within the same range. I assume that this is the reason NAT64 silently discards these packets as a loop control mechanism. After hours of testing and troubleshooting (absent any formal documentation), I believe this is the reason for the failure. I leave CSR1 broken for reference; if we were going to fix it, we would need to use a separate stateless NAT64 prefix for source and destination as CSR2 does. We progress to configuring CSR3 and CSR4 with stateless NAT64. They both serve a single client on a LAN so their configurations will be similar. For brevity, I do not show the NAT44 removal and IPv4 cleanup. I also do not show enabling IPv6 or NAT64 on the interfaces as this is very basic. There is 2008 © 2016 Nicholas J. Russo

nothing preventing CSR3 and CSR4 from sending traffic to the same IPv6 destination, so the NAT64 v6v4 prefix applied to the dialers can be identical. The v4v6 prefix applied globally, used for IPv6 source addresses, should be different to guarantee optimal routing. The prefix lengths between v6v4 and v4v6 prefixes do not have to match. I create the standard 13.0.0.0/8 route to the NVI for incoming IPv4 traffic, and enable a minor fragmentation option. IPv6 can insert fragmentation headers for fragmented packets, and NAT64 does this automatically. That is why the EPC captures shown earlier displayed “IPv6Frag” versus “IPv6-ICMP”. This option prevents that fragmentation header from being imposed and I enable it on CSR3 only. The /104 static routes are similar to the one used on CSR1; this directs the IPv6 traffic towards the LSN rather than looping it through the NVI indefinitely. ! CSR3 interface Dialer3 nat64 prefix stateless v6v4 3034:BEEF:2BAD::/96 nat64 prefix stateless v4v6 3034:3::/32 nat64 settings fragmentation header disable nat64 route 13.0.0.0/8 Dialer3 ipv6 route 3034:BEEF:2BAD:0:0:0:0d00:0/104 Dialer3 ! CSR4 interface Dialer4 nat64 prefix stateless v6v4 3034:BEEF:2BAD::/96 nat64 prefix stateless v4v6 3034:4::/32 nat64 route 13.0.0.0/8 Dialer4 ipv6 route 3034:BEEF:2BAD:0:0:0:0d00:0/104 Dialer4

Since HSRP was originally tracking the presence of an IPv4 default route, I remove the tracking mechanism entirely since it is beyond the scope of NAT64 and complicates our test. CSR3 is active for both IPv4 and IPv6 due to having a higher priority. R3#show standby brief

Interface Gi2.534 Gi2.534

Grp 344 346

P | Pri P 105 P 105 P

indicates configured to preempt. State Active Active

Active local local

Standby 192.168.34.4 FE80::4

Virtual IP 192.168.34.254 FE80::254

The CSR3 and CSR4 configurations are almost identical, so I only verify CSR3 for brevity. Below, I quickly verify the global v4v6 prefix (sources) and interface-specific v6v4 prefix (destination). CSR4 would show a different v4v6 prefix but the same v6v4 prefix. R3#show nat64 prefix stateless v4v6 global Global v4v6 Stateless Prefix: is valid, 3034:3::/32 IFs Using Global Prefix Gi2.534

2009 © 2016 Nicholas J. Russo

Di3 R3#show nat64 prefix stateless v6v4 interfaces v6v4 Stateless Prefixes Interface NAT64 Enabled Global Prefix Dialer3 TRUE FALSE 3034:BEEF:2BAD::/96

Using a specific IPv4 Internet host, I confirm that this traffic is forwarded into the NVI so that NAT64 can occur. This static route was installed via the “nat64 route” command. R3#show ip route 13.144.2.1 Routing entry for 13.0.0.0/8 Known via "static", distance 1, metric 0 Routing Descriptor Blocks: * 0.0.0.1, via NVI0 Route metric is 0, traffic share count is 1

After NAT64, the router needs to send traffic upstream to CSR9. The longer-match /104 static routes accomplish this nicely. The same route can be used on both CSR3 and CSR4 since they use a common v6v4 (destination) prefix, as seen in the configuration snippets earlier. R3#show ipv6 route 3034:beef:2bad::/96 longer-prefixes | begin Appl a - Application S 3034:BEEF:2BAD::/96 [1/0] via ::44, NVI0 S 3034:BEEF:2BAD::D00:0/104 [1/0] via Dialer3, directly connected

Using EPC while pinging from XRv4, we can confirm CSR3 is functioning correctly; an IPv4 packet enters and an IPv6 packet exits. The IPv6 source/destination addresses are from different major networks, which is a requirement for stateful NAT64 to work at the LSN. Notice that the IPv6 packet is 546 bytes, which was the same length as the IPv6 packet seem earlier with the fragmentation header. That IPv6 fragmentation header is 8 bytes, as is the PPPoE encapsulation (cyan), which is why the sizes appear the same now. This proves that the IPv6 fragmentation header was removed from the IPv6 packet after NAT64 occurred; if it was not removed, the size would have been 554. We will see this on CSR4 next. R3#show monitor capture CAP buffer detail 4 518 2.020003 192.168.34.14 -> 13.144.2.1 ICMP 0000: 00000C9F F1580050 56A9862A 81000DCE .....X.PV..*.... 0010: 08004500 01F40002 0000FF01 C7BFC0A8 ..E............. 0020: 220E0D90 02010800 D5C6C09C 0002ABCD "............... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................

2010 © 2016 Nicholas J. Russo

5 546 2.020003 00:50:56:A9:8C:CF -> 00:50:56:A9:D6:72 PPPoE Session Stage 0000: 005056A9 D6720050 56A98CCF 81000DD3 .PV..r.PV....... 0010: 88641100 0005020A 00576000 000001E0 .d.......W`..... 0020: 3AFE3034 0003C0A8 220E0000 00000000 :.04...."....... 0030: 00003034 BEEF2BAD 00000000 00000D90 ..04..+......... 0040: 02018000 1E5DC09C 0001ABCD ABCDABCD .....]..........

I shut down CSR3’s client-side LAN interface so that CSR4 becomes the HSRP active router. Using EPC, I perform the same test. Everything is identical except for two details. The IPv6 source address is using the prefix 3034:4::/32 versus 3034:3::/32, and the packet is 8 bytes larger due to the IPv6 fragmentation header being added by NAT64. This header is shown in pink for completeness, but is not examined/discussed in detail. The IPv6 packet is 554 bytes as expected when accounting for this additional header. R4#show monitor capture CAP buffer detail 3 518 2.413965 192.168.34.14 -> 13.144.2.1 ICMP 0000: 00000C9F F1580050 56A9862A 81000DCE .....X.PV..*.... 0010: 08004500 01F40000 0000FF01 C7C1C0A8 ..E............. 0020: 220E0D90 02010800 A5C8F09C 0000ABCD "............... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................ 4 554 2.413965 00:50:56:A9:2C:57 -> 00:50:56:A9:D6:72 PPPoE Session Stage 0000: 005056A9 D6720050 56A92C57 81000DDD .PV..r.PV.,W.... 0010: 88641100 00040212 00576000 000001E8 .d.......W`..... 0020: 2CFE3034 0004C0A8 220E0000 00000000 ,.04...."....... 0030: 00003034 BEEF2BAD 00000000 00000D90 ..04..+......... 0040: 02013A00 00000000 00008000 EE5CF09C ..:..........\..

Both CSR3 and CSR4 reported 5 packets translated from IPv4 to IPv6, which is accurate given the quick testing we conducted. This is a good indication that CPE stateless NAT64 is functioning properly in one direction so far. I bring CSR3’s LAN interface back up before continuing. R3#show nat64 statistics global | begin Global Global Stats: Packets translated (IPv4 -> IPv6) Stateless: 5 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 R4#show nat64 statistics global | begin Global Global Stats:

2011 © 2016 Nicholas J. Russo

Packets translated (IPv4 -> IPv6) Stateless: 5 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0

CSR9 will perform stateful NAT64 just as CSR8 does. CSR9 has two public addresses available that are not contiguous, so I will create two separate NAT64 rules with different ACLs. This can be used for “port sharing” so that a single public IPv4 “less overloaded” as there are multiple public IPv4 addresses. CSR9’s public loopbacks are completely removed in order to create the NAT pools, and the resulting NVIgenerated IPv4 static routes are redistributed into IS-IS. Since two pools are created, two NAT64 rules are required. One rule maps traffic from CSR3 to 209.19.85.9 and the other maps traffic from CSR4 to 209.19.85.11. With this design (assuming CSR3 and CSR4 had some kind of FHRP load sharing scheme, which they don’t currently), each public address on CSR9 is less saturated with ports. Because CSR3 and CSR4 use the same v6v4 (destination) prefix, CSR9 can define a global NAT64 stateful prefix. CSR8 had to use these per-interface since CSR1 and CSR2 used different destination IPv6 addresses. With PPPoE, this would be impossible since there is only a single “interface” towards the CPE routers, so having the PPPoE clients use the same v6v4 (destination) prefix is required. Enabling NAT64 on the virtual-template and upstream transit links is not shown for brevity. ! CSR9 ipv6 access-list ACL_NAT_FROM_3 permit ipv6 3034:3::/32 3034:BEEF:2BAD::/96 ipv6 access-list ACL_NAT_FROM_4 permit ipv6 3034:4::/32 3034:BEEF:2BAD::/96 nat64 nat64 nat64 nat64 nat64

prefix stateful 3034:BEEF:2BAD::/96 v4 pool NAT64_POOL_11 209.19.85.11 209.19.85.11 v4 pool NAT64_POOL_9 209.19.85.9 209.19.85.9 v6v4 list ACL_NAT_FROM_3 pool NAT64_POOL_9 overload v6v4 list ACL_NAT_FROM_4 pool NAT64_POOL_11 overload

Rather than use static routes on CSR9 for return traffic to 3034::3/96 and 3034:4::/96, we can use an IPv6 IGP; in this case, we use OSPFv3. It is already in place for IPv6 routing to the IPv6 Internet, and the static routes were auto-generated on the CPEs for stateless NAT64. I redistribute these static NVI routes into OSPFv3 on the CPEs, which provides dynamic return paths for CSR9 (the LSN). I use the most specific prefix-list possible to capture both source prefixes to keep the configuration the same. ! CSR3 and CSR4 ipv6 prefix-list PL_V4V6 seq 5 permit 3034::/29 ge 32 le 32

2012 © 2016 Nicholas J. Russo

route-map RM_STATIC_TO_OSPF permit 10 match ipv6 address prefix-list PL_V4V6 router ospfv3 9 address-family ipv6 unicast redistribute static route-map RM_STATIC_TO_OSPF

We confirm that CSR3 and CSR4 properly redistribute these from static into OSPF. CSR9 learns these routes as OSPF external type-2 routes, as expected. R3#show ospfv3 ipv6 rib redistribution OSPFv3 9 address-family ipv6 (router-id 192.168.34.3) 3034:3::/32, type 2, metric 20, tag 0, from static via ::43, NVI0 R4#show ospfv3 ipv6 rib redistribution OSPFv3 9 address-family ipv6 (router-id 192.168.34.4) 3034:4::/32, type 2, metric 20, tag 0, from static via ::43, NVI0 R9#show ipv6 route 3034::/29 longer-prefixes | begin Appl a - Application OE2 3034:3::/32 [110/20] via FE80::3, Virtual-Access2.2 OE2 3034:4::/32 [110/20] via FE80::4, Virtual-Access2.1

Before testing NAT64, we quickly configure the validate flow inside CSR9. When packets arrive, whether from CSR3 or CSR4, they are directly into the NVI as expected. R9#show ipv6 route 3034:BEEF:2BAD::13.144.2.1 Routing entry for 3034:BEEF:2BAD::/96 Known via "static", distance 1, metric 0 Route count is 1/1, share count 0 Routing paths: ::100.0.0.1, NVI0 Last updated 00:02:45 ago

Assuming there is a match against one of the NAT64 ACLs, the traffic is routed to the final IPv4 destination. CSR9 has two ECMP paths for this via XRv1 and XRv2, which is fine. R9#show ip cef 13.144.2.1 13.0.0.0/8 nexthop 10.9.11.11 GigabitEthernet2.591 nexthop 10.9.12.12 GigabitEthernet2.592

2013 © 2016 Nicholas J. Russo

Returning traffic will be destined to either 205.19.85.9 or 205.19.85.11 depending on whether it came from CSR3 or CSR4. In either case, traffic is directed into the NVI as a result of the NAT64 IPv4 pools. The other public addresses are shown below from the other LSNs (learned via IS-IS) but are not relevant for this test. Assuming there was an existing NAT64 translation, traffic would be routed back towards CSR3 or CSR4. This return-flow routing occurs via the OSPFv3 routes we verified earlier. R9#show ip route 209.19.85.8 255.255.255.252 longer-prefixes | begin Gate Gateway of last resort is not set 209.19.85.0/32 is subnetted, 4 subnets i L2 209.19.85.8 [115/20] via 10.9.12.12, 03:28:12, GigabitEthernet2.592 S 209.19.85.9 [1/0] via 100.0.0.2, NVI0 i L2 209.19.85.10 [115/20] via 10.9.11.11, 05:30:58, GigabitEthernet2.591 S 209.19.85.11 [1/0] via 100.0.0.2, NVI0

To test it, we use XRv4 again inside VRF 34 which is currently using CSR3 as its default gateway. The pings immediately work, which implies bidirectional connectivity has been achieved. RP/0/0/CPU0:XRv4#ping vrf 34 13.144.2.1 size 500 Type escape sequence to abort. Sending 5, 500-byte ICMP Echos to 13.144.2.1, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/59/89 ms

Checking the NAT64 translation table on CSR9, we can see that the original IPv4 packet was destined to 13.144.2.1, an Internet host, and its source was adjusted to be 209.19.85.9. This makes sense since the traffic came from CSR3 (3034:3::/32) and the NAT64 rule specified 209.19.85.9 as the address to use. R9#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.1:1 [3034:beef:2bad::d90:201]:156 209.19.85.9:1 [3034:3:c0a8:220e::]:156

The NAT64 statistics also indicate that 5 packets were translated bidirectionally across CSR9. R9#show nat64 statistics global | begin Global Global Stats: Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 5 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 5 MAP-T: 0

2014 © 2016 Nicholas J. Russo

We can use EPC for additional verification. It does not appear capable of capturing the PPPoE traffic entering the router on the server side (incoming IPv6 packet), so we will limit the verification to the outgoing IPv4 packet. Since ECMP was in play, CSR9 had to decide whether to forward the packet to XRv1 or XRv2. Using show commands, we confirm that XRv2 was selected, and EPC proves it with the destination MAC address in cyan. As usual, I show the source address in yellow (209.19.85.9) and the destination address in green (13.144.2.1). R9#show ip cef exact-route 209.19.85.9 13.144.2.1 209.19.85.9 -> 13.144.2.1 =>IP adj out of GigabitEthernet2.592, addr 10.9.12.12 R9#show ip arp 10.9.12.12 Protocol Address Internet 10.9.12.12

Age (min) 99

Hardware Addr 0050.56a9.0e6f

Type ARPA

Interface Gig2.592

R9#show monitor capture CAP buffer detailed 0 518 0.000000 209.19.85.9 -> 13.144.2.1 ICMP 0000: 005056A9 0E6F0050 56A9D672 81000E08 .PV..o.PV..r.... 0010: 08004500 01F40000 4000FD01 465BD113 [email protected][.. 0020: 55090D90 02010800 96640001 0000ABCD U........d...... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................

We expect CSR4 to work identically, except that post-NAT64 address of 209.19.85.11 should be used. Shutting down CSR3’s LAN interface, we can force traffic through CSR4. XRv4 can still ping the Internet host (not shown) which means the HSRP failover is functioning as designed. Checking the NAT translations, we see that the other public IPv4 address is used. The statistics also show an additional 5 packets successfully translated, for a total of 10 packets. All of this output is indicative of success so far. R9#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.1:1 [3034:beef:2bad::d90:201]:20636 209.19.85.11:1 [3034:4:c0a8:220e::]:20636 Total number of translations: 1 R9#show nat64 statistics global | begin Global Global Stats: Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 10 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 10

2015 © 2016 Nicholas J. Russo

MAP-T: 0

Using EPC again, we ensure the packet headers are correct. This time, CSR9 selects XRv1 as the next-hop towards the Internet host since the CEF ECMP function is a combination of both source and destination addresses. Only the source was modified and CEF made a different routing decision. We confirm this decision by checking the destination MAC address in the Ethernet frame. The IPv4 source and destination are in yellow and green, respectively, and represent the correct addresses. Before continuing, CSR3 is brought back online. R9#show ip cef exact-route 209.19.85.11 13.144.2.1 209.19.85.11 -> 13.144.2.1 =>IP adj out of GigabitEthernet2.591, addr 10.9.11.11 R9#show ip arp 10.9.11.11 Protocol Address Internet 10.9.11.11

Age (min) 124

Hardware Addr 0050.56a9.2dc6

Type ARPA

Interface Gig2.591

R9#show monitor capture CAP buffer detailed 0 518 0.000000 209.19.85.11 -> 13.144.2.1 ICMP 0000: 005056A9 2DC60050 56A9D672 81000E07 .PV.-..PV..r.... 0010: 08004500 01F40000 0000FD01 8659D113 ..E..........Y.. 0020: 550B0D90 02010800 96640001 0000ABCD U........d...... 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................

Next, we will configure CSR5 for stateless NAT64 in a dual-homed scenario. NAT64 does not currently support route-maps, so the source-based PBR approach cannot be used as it was for NAT444. Instead, we will use destination-based load sharing using the longest-match routing principle with respect to directing traffic into the NVI. Traffic to 13.0.0.0/8 will be sent to CSR9 via Dialer59, but traffic to 13.144.2.24/29 will be sent to CSR10 via Dialer50. You cannot configure multiple NAT64 routes for the same destination in ECMP fashion, but using longest-match to approximate some load sharing is accepted. This could be valuable when a large portion of the upstream traffic flow is directed to subnet 13.144.2.24/29 and it warrants using a separate path for better load distribution. The v4v6 prefix 5000:AAAA::/64 is used for the IPv6 source addressing and there are separate v6v4 (destination) prefixes facing CSR9 and CSR10. Dialer59 uses v6v4 prefix 3034:BEEF:2BAD::/96 just like CSR3 and CSR4, which simplifies the configuration on CSR9. Dialer50 uses v4v6 prefix 3005:DEED:567A::/64 towards CSR10. ! CSR5 interface Dialer50 nat64 prefix stateless v6v4 3005:DEED:567A::/64 interface Dialer59 nat64 prefix stateless v6v4 3034:BEEF:2BAD::/96 nat64 prefix stateless v4v6 5000:AAAA::/64

2016 © 2016 Nicholas J. Russo

nat64 route 13.0.0.0/8 Dialer59 nat64 route 13.144.2.24/29 Dialer50

Before continuing, we verify this configuration. CSR5 has two NAT64-geneated static routes to direct traffic towards 13.0.0.0/8 and 13.144.2.24/29 into the NVI. Although not terribly significant, the nexthops are different for each NAT64 route. This differentiation allows NAT64 to know which v6v4 (destination) prefix to use since the “NAT64” route command identified an outgoing interface. The NVI0 is a “multi-access network” in this way since a next-hop is specified along with the outgoing interface. R5#show ip route 13.0.0.0 255.0.0.0 longer-prefixes | begin Gate Gateway of last resort is not set 13.0.0.0/8 is variably subnetted, 2 subnets, 2 masks S 13.0.0.0/8 [1/0] via 0.0.0.1, NVI0 S 13.144.2.24/29 [1/0] via 0.0.0.2, NVI0

We can see the mapping of IPV4 NAT64 routes to NVI0 next-hops using the show command below. This information serves as input to the IPv6 process that generates the static routes. R5#show nat64 routes IPv4 Prefix Adj. Address Global IPv6 Prefix 13.0.0.0/8 0.0.0.1 13.144.2.24/29 0.0.0.2

Enabled Output IF TRUE TRUE

Di59 Di50

The IPv6 routing is somewhat difficult on CSR5. We need two static IPv6 routes since CSR5 has two uplinks; the masks on these routes “matches” what the NAT64 IPv4 routes were. Dialer59 services traffic inside 13.0.0.0/8, while Dialer50 services traffic inside 13.144.2.24/29. Because each of these uplinks has a different v6v4 (destination) prefix length, the IPv6 routes will differ significantly. Dialer59 is easy since the first 96 bits are the prefix with the last 32 as the IPv4 address. Dialer50’s IPv4 address begins at bit 73 because there are 8 u-bits immediately following the /64 prefix. Because this is complicated, I highlight those bits in pink below. The overall route prefix becomes 64 (prefix) + 8 (u-bits) + 29 (IPv4 mask) = 101. Following the route addition, the v4v6 (source) prefix must be redistributed into OSPFv3; this was required on CSR3 and CSR4 as well. This will allow CSR9 to learn the prefix in order to route returning traffic to CSR5. CSR10 has not yet been configured so we do not expect to have connectivity to 13.144.2.24/29 at this time. ! CSR5 ipv6 route 3034:BEEF:2BAD::D00:0/104 Dialer59 ipv6 route 3005:DEED:567A:0:000D:9002:1800:0/101 Dialer50 ipv6 prefix-list PL_V4V6 seq 5 permit 5000:AAAA::/64 route-map RM_STATIC_TO_OSPF permit 10 match ipv6 address prefix-list PL_V4V6

2017 © 2016 Nicholas J. Russo

router ospfv3 9 address-family ipv6 unicast redistribute static route-map RM_STATIC_TO_OSPF

We quickly verify that the redistribution and IPv6 route additions were successful. CSR5 redistributes the v4v6 (source) prefix into OSPFv3 and CSR9 learns it as an OSPF E2 route. The manual static routes remain in CSR5’s local IPv6 RIB to forward traffic towards the ISPs. R5#show ospfv3 ipv6 rib redistribution OSPFv3 9 address-family ipv6 (router-id 192.168.5.5) 5000:AAAA::/64, type 2, metric 20, tag 0, from static via ::43, NVI0 R5#show ipv6 route 3034:BEEF:2BAD::D00:0/104 longer-prefixes | begin Appl a - Application S 3034:BEEF:2BAD::D00:0/104 [1/0] via Dialer59, directly connected S 3005:DEED:567A:0:D:9002:1800:0/101 [1/0] via Dialer50, directly connected R9#show ipv6 route 5000:AAAA::/64 Routing entry for 5000:AAAA::/64 Known via "ospf 9", distance 110, metric 20, type extern 2 Redistributing via isis 1112 Route count is 1/1, share count 0 Routing paths: FE80::5, Virtual-Access2.3 Last updated 00:01:21 ago

In order for CSR9 to perform NAT64 against traffic from CSR5, we must update CSR9 to account for this new v4v6 (source) prefix 5000:AAAA::/64 that is used by CSR5. CSR9 has two public, post-NAT IPv4 addresses. CSR5 must be configured to use one of them; I simply extend one of the NAT64 ACLs to include CSR5’s source prefix. This will cause traffic from CSR5 to use 209.19.85.9 just as CSR3 does. ! CSR9 ipv6 access-list ACL_NAT_FROM_3 sequence 20 permit ipv6 5000:AAAA::/64 3034:BEEF:2BAD::/96

There is one minor clean-up we should perform on CSR9 before continuing. Currently, CSR9 is redistributing OSPFv3 into ISIS as this was required for raw IPv6 Internet reachability, beyond the scope of NAT entirely. Because CSR9 is learning NAT64 stateless v4v6 (source) prefixes such as 3034:3::/32, 3034:4::/32, and 5000:AAAA::/64, we should apply a filter to prevent these from leaking to the Internet. Normally these prefixes would be unique local addresses from FC00::/7, but I used public IPv6 prefixes for demonstration. I apply a simple route-map that does not include external routes; only OSPF internal 2018 © 2016 Nicholas J. Russo

routes are candidates for redistribution. Checking the local ISIS LSP, we can see that the NAT64 v4v6 (source) prefixes are not included. These PE-CE transit links should never leak to the Internet as their only purpose is to support the NAT464 architecture. ! CSR9 route-map RM_OSPF_TO_ISIS_V6 permit 10 match route-type internal router isis 1112 address-family ipv6 redistribute ospf 9 route-map RM_OSPF_TO_ISIS_V6 R9#show isis database detail R9.00-00 | begin IPv6 Add IPv6 Address: 2001:10:9:12::9 Metric: 0 IPv6 (MT-IPv6) 2001:192:168:34::/64 Metric: 0 IPv6 (MT-IPv6) 2001:192:168:5::/64

To test this new NAT64 configuration, I begin sending pings from XRv4 to 13.144.2.1 which succeed (not shown). This will match the 13.0.0.0/8 NAT64 route on CSR5 and should be sent towards CSR9. Since we have never used a /64 for the source prefix before, I use EPC on CSR5 to see exactly where the IPv4 address is encoded. The source addresses are yellow and the destination addresses are green. The destination IPv4 address is encoded into the v6v4 prefix specified to Dialer59 yet it appears 1 byte (8 bits) after the /64 prefix ends, colored grey. These are the u-bits described earlier, and in this case, they are set to zero. This separates the IPv4 address from the NAT64 v6v4 prefix which makes it a little harder to read, but otherwise has no significance at present. Notice that CSR5 includes the IPv6 fragmentation header since we did not explicitly disable it (pink) as we did on CSR3. R5#show monitor capture CAP buffer detail 2 518 0.011993 192.168.5.14 -> 13.144.2.1 ICMP 0000: 005056A9 DC630050 56A9862A 81000DE2 .PV..c.PV..*.... 0010: 08004500 01F40000 0000FF01 E4C1C0A8 ..E............. 0020: 050E0D90 02010800 D5C8C09C 0000ABCD ................ 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................ 3 554 0.011993 00:50:56:A9:DC:63 -> 00:50:56:A9:D6:72 PPPoE Session Stage 0000: 005056A9 D6720050 56A9DC63 81000DE7 .PV..r.PV..c.... 0010: 88641100 00070212 00576000 000001E8 .d.......W`..... 0020: 2CFE5000 AAAA0000 000000C0 A8050E00 ,.P............. 0030: 00003034 BEEF2BAD 00000000 00000D90 ..04..+......... 0040: 02013A00 00000000 00008000 7FDBC09C ..:.............

Checking CSR9, we can see a stateful translation entry for this ICMP flow. The “Original IPv6” address is the most interesting here since this clearly shows the u-bit offset that EPC reveal above. The IPv4 address is still encoded correctly but it doesn’t “look nice”. Just before the “c0” in this address are two zeroes to represent those u-bits; the router doesn’t show these leading zeroes for brevity. Also notice 2019 © 2016 Nicholas J. Russo

that the “Translated IPv6” address is the same post-NAT address used when traffic comes from CSR3 (green); this is the result of modifying the “FROM_3” ACL. R9#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.1:1 [3034:beef:2bad::d90:201]:45212 209.19.85.9:1 [5000:aaaa::c0:a805:e00:0]:45212

CSR5 will not be able to send traffic to 13.144.2.24/29 until CSR10 is configured. We quickly configure CSR10 with a straightforward stateful NAT64 configuration similar to CSR9 and CSR10. In the NAT444 lab, CSR10 was running VRF-lite since NAT44 is VRF-aware. NAT64 is not currently VRF-aware, so all VRFs are removed from CSR10 and only the global table is used (VRF removal and subsequent cleanup not shown). CSR10 will match traffic using the prefixes defined by CSR5’s stateless NAT64 process and map those to its single public IPv4 address of 209.19.85.10. This pool spawns a static route to the NVI and must be redistributed into IS-IS so upstream routers can forward returning traffic back towards the LSN. ! CSR10 ipv6 access-list ACL_NAT permit ipv6 5000:AAAA::/64 3005:DEED:567A::/64 nat64 v4 pool NAT64_POOL 209.19.85.10 209.19.85.10 nat64 v6v4 list ACL_NAT pool NAT64_POOL overload nat64 prefix stateful 3005:DEED:567A::/64 ip prefix-list PL_PUBLIC seq 5 permit 209.19.85.10/32 route-map RM_STATIC_TO_ISIS permit 10 match ip address prefix-list PL_PUBLIC router isis 1112 redistribute static ip route-map RM_STATIC_TO_ISIS

First, we check to ensure CSR10 creates the IPv4 static route for the NAT64 pool (pointing to the NVI) and redistributes it into IS-IS. This ensures that the Internet routers will have reachability to this postNAT public address on the LSN. R10#show ip route 209.19.85.10 Routing entry for 209.19.85.10/32 Known via "static", distance 1, metric 0 Redistributing via isis 1112 Advertised by isis 1112 metric-type internal level-2 route-map RM_STATIC_TO_ISIS Routing Descriptor Blocks: * 100.0.0.2, via NVI0

2020 © 2016 Nicholas J. Russo

Route metric is 0, traffic share count is 1 R10#show isis database detail R10.00-00 | include 209 Metric: 0 IP 209.19.85.10/32 RP/0/0/CPU0:XRv3#show cef 209.19.85.10/32 209.19.85.10/32, version 494, internal 0x5000001 0x0 (ptr 0xa1413ef4) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 10.11.13.11 Prefix Len 32, traffic index 0, precedence n/a, priority 4 via 10.11.13.11, 2 dependencies, recursive, bgp-ext [flags 0x6020] path-idx 0 NHID 0x0 [0xa1414574 0x0] next hop 10.11.13.11 via 10.11.13.11/32

Next, we ensure that the NAT64 stateful prefix has a corresponding static route directly IPv6 into the NVI. This captures traffic from CSR5 by matching its v6v4 (destination) stateless NAT64 prefix. R10#show ipv6 route 3005:DEED:567A::/64 Routing entry for 3005:DEED:567A::/64 Known via "static", distance 1, metric 0 Route count is 1/1, share count 0 Routing paths: ::100.0.0.1, NVI0 Last updated 00:13:41 ago

Because there is no good way for CSR10 to configure a static route towards CSR5’s v4v6 (source) prefix, I use EIGRPv6 on CSR5 to advertise it dynamically. CSR6, CSR7, and CSR10 were already using EIGRPv6 for this purpose so I extend it to CSR5 as well. It is not directly related to NAT64, but CSR10 must have a route back to 5000:AAAA::/64. Since CSR5 had to redistribute this static route into OSPFv3 for CSR9, the prefix-list/route-map constructs are already in place. EIGRPv6 can simply reuse them even if the routemap name is unintuitive. After the static route is redistributed into EIGRP, we check the topology locally on CSR5 to verify. I also verify that CSR10 learns this is an EIGRP external route, which allows it to send return traffic from the IPv4 Internet back to CSR5. We confirm that the virtual-access interface actually belongs to CSR5 by checking the PPPoE and interface MAC address details. ! CSR5 router eigrp LSN address-family ipv6 unicast autonomous-system 10 topology base redistribute static route-map RM_STATIC_TO_OSPF R5#show eigrp address-family ipv6 topology 5000:AAAA::/64 EIGRP-IPv6 VR(LSN) Topology Entry for AS(10)/ID(192.168.5.5) for 5000:AAAA::/64 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 12030537142 Descriptor Blocks: ::43, from Rstatic, Send flag is 0x0

2021 © 2016 Nicholas J. Russo

Composite metric is (12030537142/0), route is External Vector metric: Minimum bandwidth is 56 Kbit Total delay is 5000000000 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 9216 Hop count is 0 External data: Originating router is 192.168.5.5 (this system) AS number of route is 0 External protocol is Static, external metric is 0 Administrator tag is 0 (0x00000000) R10#show ipv6 route 5000:AAAA::/64 Routing entry for 5000:AAAA::/64 Known via "eigrp 10", distance 170, metric 145188571, type external Route count is 1/1, share count 0 Routing paths: FE80::21E:14FF:FEFF:9B00, Virtual-Access2.2 Last updated 00:07:16 ago R10#show pppoe session 3 sessions in LOCALLY_TERMINATED (PTA) State 3 sessions total Uniq ID 6

PPPoE SID 6

10

10

11

11

RemMAC LocMAC 0050.56a9.ea77 0050.56a9.f961 0050.56a9.dc63 0050.56a9.f961 0050.56a9.de0d 0050.56a9.f961

Port Gi2.556 VLAN:3556 Gi2.556 VLAN:3556 Gi2.556 VLAN:3556

VT 567 567 567

VA VA-st Vi2.1 UP Vi2.2 UP Vi2.3 UP

State Type PTA PTA PTA

R5#show interfaces gigabitEthernet2 | include bia Hardware is CSR vNIC, address is 0050.56a9.dc63 (bia 0050.56a9.dc63)

To protect the IPv6 Internet from these “bogus” v4v6 (source) NAT64 prefixes, I use the same filtering technique on CSR10 as I did on CSR9 except using a prefix-list. The “route-type” matching inside the route-map does not appear to work for EIGRPv6 at this time. This configuration ensures that the NAT64 prefixes are not redistributed from EIGRP into IS-IS. ! CSR10 ipv6 prefix-list PL_V4V6 seq 5 permit 5000:AAAA::/64 route-map RM_EIGRP_TO_ISIS_V6 deny 10 match ipv6 address prefix-list PL_V4V6

2022 © 2016 Nicholas J. Russo

route-map RM_EIGRP_TO_ISIS_V6 per 1000 router isis 1112 address-family ipv6 redistribute eigrp 10 route-map RM_EIGRP_TO_ISIS_V6 R10#show isis database detail R10.00-00 | begin IPv6 Add IPv6 Address: 2001:10:10:11::10 Metric: 0 IPv6 (MT-IPv6) 2001:192:168:5::/64 Metric: 0 IPv6 (MT-IPv6) 2001:192:168:6::/64 Metric: 0 IPv6 (MT-IPv6) 2001:192:168:7::/64

At this point, XRv4 should have reachability to 13.144.2.25 which is a host address on the Internet. This traffic should be routed via CSR10. The source will be changed mapped into v4v6 prefix 5000:AAAA::/64 and the destination will be mapped into v6v4 prefix 3005:DEED:567A::/64 associated with Dialer50. CEF directs traffic for 13.144.2.24/20 to the NVI using address 0.0.0.2, which is tied to Dialer50. R5#show ip cef 13.144.2.25 13.144.2.24/29 nexthop 0.0.0.2 NVI0 R5#show nat64 route IPv4 Prefix Adj. Address Global IPv6 Prefix 13.0.0.0/8 0.0.0.1 13.144.2.24/29 0.0.0.2

Enabled Output IF TRUE TRUE

Di59 Di50

R5#show nat64 prefix stateless v4v6 global Global v4v6 Stateless Prefix: is valid, 5000:AAAA::/64 IFs Using Global Prefix Gi2.554 Di50 Di59 R5#show nat64 prefix stateless v6v4 interfaces v6v4 Stateless Prefixes Interface NAT64 Enabled Global Prefix Dialer50 TRUE FALSE 3005:DEED:567A::/64 Dialer59 TRUE FALSE 3034:BEEF:2BAD::/96

After NAT64 occurs, the destination IPv6 address will fit within a relatively specific route which routes traffic towards the IPv4 prefix 13.144.2.24/29 towards Dialer50. Without this route, traffic would loop back into the NVI and be dropped. These are the static routes we verified earlier but are shown again for completeness. 2023 © 2016 Nicholas J. Russo

R5#show ipv6 route 3005:DEED:567A::/64 longer-prefixes | begin Appl a - Application S 3005:DEED:567A::/64 [1/0] via ::44, NVI0 S 3005:DEED:567A:0:D:9002:1800:0/101 [1/0] via Dialer50, directly connected

Using EPC on CSR5, we can see the incoming LAN IPv4 and outgoing WAN IPv6 packet. Both the IPv6 source and destination include the u-bits, shown in grey, since their prefix lengths are shorter than /96. Even the /32 prefixes included the u-bits, but it was less significant since the IPv4 addresses were encoded first. If the prefix length is less than or equal to 64, the u-bits will always exist. If they exist, ubits are always in the same place as described earlier. Source addresses are shown in yellow with destination addresses in green, as usual. For consistency, the PPPoE encapsulation is shown in cyan and the IPv6 fragmentation header is shown in pink, but neither one is relevant to this discussion. R5#show monitor capture CAP buffer detailed 2 518 2.615966 192.168.5.14 -> 13.144.2.25 ICMP 0000: 005056A9 DC630050 56A9862A 81000DE2 .PV..c.PV..*.... 0010: 08004500 01F40000 0000FF01 E4A9C0A8 ..E............. 0020: 050E0D90 02190800 D5C8C09C 0000ABCD ................ 0030: ABCDABCD ABCDABCD ABCDABCD ABCDABCD ................ 3 554 2.615966 00:50:56:A9:DC:63 -> 00:50:56:A9:F9:61 PPPoE Session Stage 0000: 005056A9 F9610050 56A9DC63 81000DE4 .PV..a.PV..c.... 0010: 88641100 000A0212 00576000 000001E8 .d.......W`..... 0020: 2CFE5000 AAAA0000 000000C0 A8050E00 ,.P............. 0030: 00003005 DEED567A 0000000D 90021900 ..0...Vz........ 0040: 00003A00 00000000 00008000 9BC0C09C ..:.............

Quickly checking the statistics, we can see 5 packets being translated v4-to-v6 inbound on the LAN and v6-to-v4 inbound on the WAN (return flow). This is an indication that NAT64 is working. R5#show nat64 statistics interface gig2.554 NAT64 Statistics Interface Statistics GigabitEthernet2.554 (IPv4 configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 5 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 0

2024 © 2016 Nicholas J. Russo

R5#show nat64 statistics interface dialer50 NAT64 Statistics Interface Statistics Dialer50 (IPv4 not configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 5 Stateful: 0 MAP-T: 0 Packets dropped: 0

CSR10 will receive packets towards 3005:DEED:567A:0:D:9002:1900:0 which was the destination seen in the EPC above. This is the “Translated IPv4” address from CSR10’s perspective seen below. The public IPv4 address used comes from the NAT v4 pool and is 209.19.85.10, which is redistributed into ISIS. XRv4 can now reach all Internet hosts using both CSR9 and CSR10. R10#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.25:1 [3005:deed:567a:0:d:9002:1900:0]:57500 209.19.85.10:1 [5000:aaaa::c0:a805:e00:0]:57500

Assuming that CSR6 is not capable of stateless NAT64, we can use stateful NAT64 with static mappings. Although this does not support address overloading, it does allow us to specify direct mappings between IPv4 and IPv6 addresses. When traffic enters CSR6 destined to 13.144.2.1, CSR6 will encode the destination inside of the already-supported destination prefix 3005:DEED:567A::/64. We manually have to determine the correct destination IPv6 address because CSR10 will care. CSR10 is already expecting traffic for this stateful prefix as it supports CSR5. We have to expand CSR10’s NAT ACL to account for a new source, CSR6, as well as advertise this new source into EIGRP. ! CSR6 nat64 prefix stateful 6006:FEED:BEAD::/96 nat64 v6v4 static 3005:DEED:567A:0:D:9002:100:0 13.144.2.1 ipv6 prefix-list PL_V4V6 seq 5 permit 6006:FEED:BEAD::/96 route-map RM_STATIC_TO_EIGRP permit 10 match ipv6 address prefix-list PL_V4V6 router eigrp LSN address-family ipv6 unicast autonomous-system 10 topology base

2025 © 2016 Nicholas J. Russo

redistribute static route-map RM_STATIC_TO_EIGRP

CSR10 should learn this as an EIGRP external route from CSR6. We confirm this using some PPPoE/ARP show commands before continuing. Without this route, CSR10 would not be able to deliver returning traffic to CSR6. R10#show ipv6 route 6006:FEED:BEAD::/96 Routing entry for 6006:FEED:BEAD::/96 Known via "eigrp 10", distance 170, metric 145188571, type external Route count is 1/1, share count 0 Routing paths: FE80::21E:BDFF:FE69:6200, Virtual-Access2.3 Last updated 00:09:51 ago R10#show pppoe session 3 sessions in LOCALLY_TERMINATED (PTA) State 3 sessions total Uniq ID 6

PPPoE SID 6

10

10

11

11

RemMAC LocMAC 0050.56a9.ea77 0050.56a9.f961 0050.56a9.dc63 0050.56a9.f961 0050.56a9.de0d 0050.56a9.f961

Port Gi2.556 VLAN:3556 Gi2.556 VLAN:3556 Gi2.556 VLAN:3556

VT 567 567 567

VA VA-st Vi2.1 UP Vi2.2 UP Vi2.3 UP

State Type PTA PTA PTA

R6#show interfaces gigabitEthernet2 | include bia Hardware is CSR vNIC, address is 0050.56a9.de0d (bia 0050.56a9.de0d)

We will expand the NAT ACL to include the new source, and adjust the EIGRP-to-ISIS redistribution to prevent this “bogus” route from polluting the IPv6 Internet. This is the same procedure we configured earlier so that CSR10 could support CSR5. I show the full PL and ACL configuration for completeness; there is one line for each supported CPE router at this point. CSR10’s logic hardly changes. After updating the NAT ACL, traffic from 6006:FEED:BEAD::/96 will be candidate for NAT using 209.19.85.10 as the post-NAT public address. In this way, both the CPE and LSN are technically using stateful NAT64. ! CSR10 ipv6 prefix-list PL_V4V6 seq 5 permit 5000:AAAA::/64 ipv6 prefix-list PL_V4V6 seq 10 permit 6006:FEED:BEAD::/96 ipv6 access-list ACL_NAT permit ipv6 5000:AAAA::/64 3005:DEED:567A::/64 permit ipv6 6006:FEED:BEAD::/96 3005:DEED:567A::/64

2026 © 2016 Nicholas J. Russo

When traffic enters CSR6 towards 13.144.2.1, it is directed to the NVI via a static route. This was installed by NAT64 when the static mapping was added. The route is a /32 host route so no other Internet destinations are included. R6#show ip cef 13.144.2.1 13.144.2.1/32 nexthop 100.0.0.1 NVI0

After NAT64 occurs, the resulting IPv6 packet is forwarded to CSR10 by following the default route. This route was originated by CSR10 inside EIGRPv6 from the last lab, and CSR6 will rely on it here. Unlike stateless NAT64, there is no static NVI route created for this destination IPv6 address; there is no concept of a v4v6 (source) prefix with static mappings. The source address used comes from the stateful NAT prefix 6006:FEED:BEAD::/96 defined above. CSR6 does install a static route for this prefix since the returning traffic must be directed into the NVI for translation back to IPv4. R6#show ipv6 cef 3005:DEED:567A:0:D:9002:100:0 ::/0 nexthop FE80::21E:14FF:FE6F:8300 Dialer6 R6#show ipv6 route 6006:FEED:BEAD::/96 Routing entry for 6006:FEED:BEAD::/96 Known via "static", distance 1, metric 0 Redistributing via eigrp 10 Route count is 1/1, share count 0 Routing paths: ::100.0.0.1, NVI0 Last updated 00:19:27 ago

Since CSR6 has a static, stateful mapping. This will appear in the NAT64 translation table when no traffic is flowing. This behavior is consistent with NAT44 as this static entry works in either direction. These are both effectively “destination” addresses when examined from an upstream traffic flow perspective. Traffic to 13.144.2.1 has its destination changed to the “Original IPv6” address and its source changed to the NAT64 stateful prefix encoded with the original IPv4 source address (which varies). R6#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------------13.144.2.1 3005:deed:567a:0:d:9002:100:0

When XRv4 sends traffic to 13.144.2.1, CSR6 translates it as shown below. The “Original IPv4” address is the original source, which is XRv4. This is encoded into the last 32 bits of the NAT64 /96 stateful prefix as examined many times thus far. The “destination” IPv4/IPv6 addresses are fixed and do not change, with the exception of a port number used for demultiplexing. 2027 © 2016 Nicholas J. Russo

R6#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------------13.144.2.1 3005:deed:567a:0:d:9002:100:0 icmp 192.168.6.14:12444 [6006:feed:bead::c0a8:60e]:12444 13.144.2.1:12444 [3005:deed:567a:0:d:9002:100:0]:12444

Checking CSR10, NAT64 is performed based on traffic entering the router with a destination inside the stateful NAT64 prefix of 3005:DEED:567A::/64. Because CSR6 manually formatted this correctly in the static NAT64 mapping, CSR10 is able to perform dynamic NAT64 by extracting the original IPv4 destination address of 13.144.2.1. The “Original IPv6” address, which carries to the encoded IPv4 original source, doesn’t get decoded again. It is only used for mapping returning flows destined to 209.19.85.10 back to the proper downstream customer. R10#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.1:1 [3005:deed:567a:0:d:9002:100:0]:16540 209.19.85.10:1 [6006:feed:bead::c0a8:60e]:16540

Quickly checking CSR6’s NAT64 statistics, we can see packets being translated bidirectionally on the proper interfaces. R6#show nat64 statistics interface gig2.564 NAT64 Statistics Interface Statistics GigabitEthernet2.564 (IPv4 configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 5 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 0 R6#show nat64 statistics interface dialer6 NAT64 Statistics Interface Statistics Dialer6 (IPv4 not configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 0

2028 © 2016 Nicholas J. Russo

Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 5 MAP-T: 0 Packets dropped: 0

With this static NAT64 architecture in place, accessing new Internet hosts is simple. To access XRv3’s remaining 3 interfaces, we simply need to add a static mappings for each. We manually determine the post-NAT IPv6 destination address so that CSR10 can correctly decode the embedded IPv4 address. From XRv4, I ping all 4 destinations and examine the translation table. All of the pings succeed. ! CSR6 nat64 v6v4 static 3005:DEED:567A:0:D:9002:900:0 13.144.2.9 nat64 v6v4 static 3005:DEED:567A:0:D:9002:1100:0 13.144.2.17 nat64 v6v4 static 3005:DEED:567A:0:D:9002:1900:0 13.144.2.25 R6#show nat64 translations protocol icmp Proto

Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 192.168.6.14:61596 [6006:feed:bead::c0a8:60e]:61596 13.144.2.9:61596 [3005:deed:567a:0:d:9002:900:0]:61596 icmp 192.168.6.14:8348 [6006:feed:bead::c0a8:60e]:8348 13.144.2.25:8348 [3005:deed:567a:0:d:9002:1900:0]:8348 icmp 192.168.6.14:53404 [6006:feed:bead::c0a8:60e]:53404 13.144.2.1:53404 [3005:deed:567a:0:d:9002:100:0]:53404 icmp 192.168.6.14:156 [6006:feed:bead::c0a8:60e]:156 13.144.2.17:156 [3005:deed:567a:0:d:9002:1100:0]:156

On CSR7, I demonstrate using the well-known prefix (WKP) for stateful NAT64. The WKP prefix is 64:FF9B::/96 and cannot be disabled. Any router with NAT64 enabled, even stateless NAT64, will have a static route directing traffic to this destination into the NVI. CSR7 will use this as the destination IPv6 prefix as CSR10 will automatically direct this into its NVI without us having to configure it. The NAT64 manually-specified prefix configured on CSR7 will be used as the source address when NAT64 generates IPv6 packets. The basic NVI static route redistribution into EIGRP is shown again which is similar to CSR6. ! CSR7 nat64 prefix stateful 7007:FEED:BEAD::/96 nat64 v6v4 static 64:FF9B::D90:201 13.144.2.1 ipv6 prefix-list PL_V4V6 seq 5 permit 7007:FEED:BEAD::/96 route-map RM_STATIC_TO_EIGRP permit 10 match ipv6 address prefix-list PL_V4V6

2029 © 2016 Nicholas J. Russo

router eigrp LSN address-family ipv6 unicast autonomous-system 10 topology base redistribute static route-map RM_STATIC_TO_EIGRP

CSR7 must define a longer-match static route to ensure traffic towards this destination isn’t looped into its local NVI. Since CSR7 is running NAT64 also, we need to totally ignore the 64:FF9B::/96 prefix from a routing perspective. A /97 is not very specific, but it’s good enough for this test case. ! CSR7 ipv6 route 64:FF9B::/97 Dialer7 R7#show ipv6 route 64:FF9b::/96 longer-prefixes | begin Appl a - Application S 64:FF9B::/96 [1/0] via ::100.0.0.2, NVI0 S 64:FF9B::/97 [1/0] via Dialer7, directly connected

CSR10 must expand its list of “bogus” IPv6 prefixes to filter from ISIS, along with its NAT ACL. The NAT ACL now matches traffic from CSR7 towards the WKP. Notice that CSR10 does not have to define a new NAT64 prefix, since 64:FF9B::/96 is automatically created. The IPv6 static route for the WKP towards the NVI is automatically generated as soon as NAT64 is enabled as we saw on CSR7 also. ! CSR10 ipv6 prefix-list PL_V4V6 seq 5 permit 5000:AAAA::/64 ipv6 prefix-list PL_V4V6 seq 10 permit 6006:FEED:BEAD::/96 ipv6 prefix-list PL_V4V6 seq 15 permit 7007:FEED:BEAD::/96 ipv6 access-list ACL_NAT permit ipv6 5000:AAAA::/64 3005:DEED:567A::/64 permit ipv6 6006:FEED:BEAD::/96 3005:DEED:567A::/64 permit ipv6 7007:FEED:BEAD::/96 64:FF9B::/96 R10#show ipv6 route 64:FF9B::/96 Routing entry for 64:FF9B::/96 Known via "static", distance 1, metric 0 Route count is 1/1, share count 0 Routing paths: ::100.0.0.2, NVI0 Last updated 00:44:27 ago

After CSR7’s redistribution of its local NAT64 prefix into EIGRP, CSR10 now has a path back to CSR7’s NAT64 source prefix. We confirm this using the PPPoE and interface-level show commands. R10#show ipv6 route 7007:FEED:BEAD::/96

2030 © 2016 Nicholas J. Russo

Routing entry for 7007:FEED:BEAD::/96 Known via "eigrp 10", distance 170, metric 145188571, type external Route count is 1/1, share count 0 Routing paths: FE80::21E:49FF:FECA:A400, Virtual-Access2.2 Last updated 00:00:14 ago R10#show pppoe session 3 sessions in LOCALLY_TERMINATED (PTA) State 3 sessions total Uniq ID 8

PPPoE SID 8

7

7

9

9

RemMAC LocMAC 0050.56a9.ea77 0050.56a9.f961 0050.56a9.dc63 0050.56a9.f961 0050.56a9.de0d 0050.56a9.f961

Port Gi2.556 VLAN:3556 Gi2.556 VLAN:3556 Gi2.556 VLAN:3556

VT 567 567 567

VA VA-st Vi2.2 UP Vi2.1 UP Vi2.3 UP

State Type PTA PTA PTA

R7#show interfaces gigabitEthernet 2 | include bia Hardware is CSR vNIC, address is 0050.56a9.ea77 (bia 0050.56a9.ea77)

On CSR7, we can see the new static translation. Tracing the routing logic, incoming IPv4 packets to 13.144.2.1 are sent to the NVI. Outgoing IPv6 packets match the /97 route via Dialer7 towards CSR10. We verified CSR10’s route to the Internet earlier, as well as its local WKP route to the NVI for incoming IPv6 packets. R7#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------------13.144.2.1 64:ff9b::d90:201 Total number of translations: 1 R7#show ip cef 13.144.2.1 13.144.2.1/32 nexthop 100.0.0.1 NVI0 R7#show ipv6 cef 64:ff9b::d90:201 64:FF9B::/97 attached to Dialer7

Just like CSR6, this works without issue. CSR7 translates private IPv4 into transitory IPv6, then CSR10 translates transitory IPv6 into public IPv4. I highlight the private and public IPv4 source addresses, which 2031 © 2016 Nicholas J. Russo

is the primary use-case of NAT464 for LSN in general. I don’t verify all the statistics again since this logic is identical to CSR6 except uses the WKP. R7#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------------13.144.2.1 64:ff9b::d90:201 icmp 192.168.7.14:45212 [7007:feed:bead::c0a8:70e]:45212 13.144.2.1:45212 [64:ff9b::d90:201]:45212 R10#show nat64 translations Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------icmp 13.144.2.1:1 [64:ff9b::d90:201]:45212 209.19.85.10:1 [7007:feed:bead::c0a8:70e]:45212

Rather than enumerate the remaining Internet mappings on CSR7 as we did on CSR6, we will use TCP/UDP specific mappings for variety. Starting with TCP, we create a mapping for telnet. This will give XRv4 telnet access to XRv3, but nothing else. Because I use port 23 for both the IPv4 and IPv6 components, the number is effectively passed through. This allows CSR10 to maintain the destination port 23 when it performs NAT64-overload. I use IPv4 address 13.144.2.9 on XRv3 for variety. ! CSR7 nat64 v6v4 static tcp 64:FF9B::D90:209 23 13.144.2.9 23 R7#show nat64 translations protocol tcp Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------tcp ----13.144.2.9:23 [64:ff9b::d90:209]:23

When the telnet flow begins, CSR7 creates a new state entry for the specific flow, which includes the source IPv4 address and source port. The source port is carried over exactly (cannot be modified), but the destination port is treated the same due to our configuration. The flow appears to be coming from CSR7 towards the WKP destination shown below. R7#show nat64 translations protocol tcp Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------tcp ----13.144.2.9:23 [64:ff9b::d90:209]:23

2032 © 2016 Nicholas J. Russo

tcp

192.168.7.14:29960 13.144.2.9:23

[7007:feed:bead::c0a8:70e]:29960 [64:ff9b::d90:209]:23

This WKP destination is routed to CS10’s NVI. The IPv4 address is decoded from the packet and the TCP port number is maintained. Due to overloading, the original source port is rewritten to some demultiplexing value that CSR10 selects; it is 1015 in this case. The final IPv4 packet is highlighted. XRv3 correctly acknowledges this as a telnet session from 209.19.85.10 using port 1025 as shown in its TCP table. R10#show nat64 translations protocol tcp Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------tcp 13.144.2.9:23 [64:ff9b::d90:209]:23 209.19.85.10:1025 [7007:feed:bead::c0a8:70e]:29960 RP/0/0/CPU0:XRv3#show tcp brief | include :23 0x1014721c 0x60000000 0 0 13.144.2.9:23

209.19.85.10:1025

ESTAB

After clearing the NAT64 statistics on CSR7, I sent exactly 8 packets from XRv4 to XRv3. It fails since CSR7 has no matching translation entry for this protocol. This is the expected behavior of using port/protocol specific static mappings. RP/0/0/CPU0:XRv4#ping vrf 7 13.144.2.9 count 8 timeout 1 Type escape sequence to abort. Sending 8, 100-byte ICMP Echos to 13.144.2.9, timeout is 1 seconds: ........ Success rate is 0 percent (0/8) R7#show nat64 statistics interface gig2.574 NAT64 Statistics Interface Statistics GigabitEthernet2.574 (IPv4 configured, IPv6 configured): Packets translated (IPv4 -> IPv6) Stateless: 0 Stateful: 0 MAP-T: 0 Packets translated (IPv6 -> IPv4) Stateless: 0 Stateful: 0 MAP-T: 0 Packets dropped: 8

Next, we will perform the same test with UDP. Using IP SLA, we will create a probe using port 42518 towards 13.144.2.17. IP SLA might be useful to test a custom application across NAT64 devices to an Internet destination. The IP SLA configuration on XRv3 and XRv4 is very basic; the details are described 2033 © 2016 Nicholas J. Russo

in the IP SLA section. XRv3 is configured to respond to 13.144.2.17:42518 and XRv4 is configured to send traffic to that address/port pair. ! XRv3 ipsla responder type udp ipv4 address 13.144.2.17 port 42518 ! XRv4 ipsla operation 7 type udp echo vrf 7 destination address 13.144.2.17 control disable timeout 3000 destination port 42518 frequency 5 schedule operation 7 start-time now life forever

On CSR7, we use the same logic as we did for TCP. The port number 42518 is maintained across the NAT64 process; otherwise, the configuration is identical. ! CSR7 nat64 v6v4 static udp 64:FF9B::D90:211 42518 13.144.2.17 42518

Checking the translations on CSR7 and CSR10, we can see the IP SLA probe. XRv4 is using source port 35760 which is transparent across the static NAT64 mapping. The destination port is also transparent due to the configuration. CSR10 adjusts the source port since it is overloading a single IP address, but otherwise the logic is similar. Port 42518 is ultimately revealed to XRv3 who can respond correctly. R7#show nat64 translations protocol udp Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------udp ----13.144.2.17:42518 [64:ff9b::d90:211]:42518 udp 192.168.7.14:35760 [7007:feed:bead::c0a8:70e]:35760 13.144.2.17:42518 [64:ff9b::d90:211]:42518 R10#show nat64 translations protocol udp Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ----------------------------------------------------------------------------

2034 © 2016 Nicholas J. Russo

udp

13.144.2.17:42518 209.19.85.10:1024

[64:ff9b::d90:211]:42518 [7007:feed:bead::c0a8:70e]:35760

To ensure it is working, we check the IP SLA statistics on XRv4. The probe is succeeding and there is bidirectional reachability for this new “custom application”. RP/0/0/CPU0:XRv4#show ipsla statistics 7 | utility egrep 'Number|return' Number of operations attempted: 74 Number of operations skipped : 0 Latest operation return code : OK

Last, I demonstrate changes the destination port. This can mask the user from the actual ports used by the target. For example, to access 13.144.2.25 via telnet, the user (XRv4) must use port 2300 rather than 23. ! CSR7 nat64 v6v4 static tcp 64:FF9B::D90:219 23 13.144.2.25 2300 R7#show nat64 translations v6 translated 13.144.2.25 Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------tcp ----13.144.2.25:2300 [64:ff9b::d90:219]:23

The logic is identical to the original telnet test, except XRv4 specifies port 2300 now. CSR7 translates this to destination port 23; CSR10 remains unaware of any changes. R7#show nat64 translations v6 translated 13.144.2.25 Proto Original IPv4 Translated IPv4 Translated IPv6 Original IPv6 ---------------------------------------------------------------------------tcp ----13.144.2.25:2300 [64:ff9b::d90:219]:23 tcp 192.168.7.14:23444 [7007:feed:bead::c0a8:70e]:23444 13.144.2.25:2300 [64:ff9b::d90:219]:23

Additional Reading – Reference configurations "lsn-464" 36.3 Dual stack lite (DS-lite) DS-Lite takes the advantages of NAT464 and simplifies the deployment concept. Because NAT64 may not be supported on low-end CPE routers in residential deployments, DS-Lite is an alternative that allows IPv4 to be encapsulated inside IPv6. Just like the NAT464 model, the CPE-to-LSN links are still IPv6, but the redundancy, performance, and scaling concerns are eliminated since translation is no

2035 © 2016 Nicholas J. Russo

longer occurring at the CPE. IPv4 traffic tunneled to the LSN inside IPv6 is decapsulated, and then regular NAT44 is performed. When coupled with NAT44, however, DS-Lite adds a key piece of information. Because NAT44 was not performed on the CPEs, it is possible that the same private addressing could be used by multiple CPEs beneath a single LSN. With NAT444, this is not a concern since the SP-allocated private addressing, which is the outside address for the CPE, would be unique at each CPE. With DS-Lite, no such concept exists. The NAT44 process, when interacting with DS-Lite, adds the source IPv6 address of the DS-Lite tunnel to the NAT state. This mapping provides a destination address for return traffic entering the LSN from the outside, as well as differentiates CPE private addressing. Like with NAT444 and NAT464, native IPv6 traffic is routed normally with no additional encapsulation or NAT applied. This feature is not currently supported on IOS, IOS-XE, or XRv. It is supported on IOS-XR platforms but only with specific hardware enhancements. Specifically, The ASR9000 series Integrated Service Module (ISM) for CGN and the CRS Carrier Grade Services Engine (CGSE). A sample configuration from Cisco is shown below, though it cannot be demonstrated on XRv. interface te0/0/0/0 ipv6 address 2001:db8:ff00::1/64 ! interface te0/1/0/0 ipv4 address 192.168.100.1/24 ! interface ServiceApp61 ipv6 address 2001:db8:1::1/64 service cgn demo service-type ds-lite ! interface ServiceApp41 ipv4 address 192.168.1.1 255.255.255.252 service cgn demo service-type ds-lite ! service cgn demo service-type ds-lite dslite-1 map address-pool x.y.z.0/24 aftr-tunnel-endpoint-address 2001:db8:ffff::1 ! address-family ipv4 interface ServiceApp42 ! address-family ipv6 interface ServiceApp41 ! ! router static address-family ipv4 unicast x.y.z.0/24 ServiceApp42

2036 © 2016 Nicholas J. Russo

! address-family ipv6 unicast 2001:db8:ffff::1/128 ServiceApp41

36.4 IPv6 tunneling over IPv4 networks There are many techniques for connecting isolated IPv6 networks over IPv4 transport. Several methods are tested below, and they all share a common network diagram. Each subsection is independent from the others but all rely on the same MPLS network providing L3VPN service for IPv4 only. The MPLS configurations are very basic with XRv1 as a VPNv4 RR and all LSRs running LDP. The PE-CE routing protocol is nonexistent as, like the LISP lab, the PEs just redistribute the connected transit link into BGP and the CEs have a static default IPv4 route pointing towards the PEs. Routers are color-coded in the diagram below to show which tunneling techniques they will use. The focus is on 6RD since the other mechanisms are less applicable to service providers, but are valid IPv6 transition options nonetheless.

Below is a comparison of different IPv6 tunneling techniques. Some techniques are clearly better than others based on the natural progression of time (e.g., 6RD is a newer variant of 6to4), but each of them has valid use cases. Legacy auto-tunnels are not examined here at all. One thing that all tunnels have in common is they require an IPv4 source address, which is omitted from the chart for brevity. 2037 © 2016 Nicholas J. Russo

Protocol/Method Manual IPv6-in-IPv4 GRE 6to4

6RD ISATAP

Uses P2P tunnel, intra or inter site, manually configured P2P tunnel, intra or inter site, manually configured P2MP tunnel, inter site, automatic based on IPv4 address P2MP tunnel, inter site, automatic based on IPv4 address P2MP tunnel, intra site, automatic based on IPv4 address

Notes Carries IPv6 only, can run IGPs and BGP Carries IPv4, IPv6, CLNS, etc. but adds extra 4-byte encapsulation, can run IGPs can BGP Limited to 2002::/16 prefix, inflexible prefix format, anycast-based relay for Internet access using specific addresses Flexible prefix format, simplified relay mechanism, can use SPs own address space Can run IGPs (unicast neighbors) and BGP

Additional Reading - Reference configurations "ipv6-tunnels" 36.4.1 GRE / Manual IPv6 tunnels The simplest method to bridge disparate IPv6 islands is to create a GRE tunnel between sites and run IPv6 inside of it. This is also supported with DMVPN and any other GRE-based technology. XR also supports this method, but none of the others (at least not without special hardware modules). CSR6 and XRv4 form this basic GRE tunnel between one another. The tunnel endpoints have reachability through the L3VPN via IPv4 using default routes. There are no tunnel IPv4 addresses since the customer sites are IPv6 only. ! CSR6 interface GigabitEthernet2.546 encapsulation dot1Q 3546 ip address 10.4.6.6 255.255.255.0 ip route 0.0.0.0 0.0.0.0 10.4.6.4 interface Tunnel614 description BASIC GRE ipv6 address FE80::6 link-local ipv6 address FD00:10:6:14::6/64 tunnel source 10.4.6.6 tunnel destination 10.2.14.14 ! XRv4 interface GigabitEthernet0/0/0/0.524 ipv4 address 10.2.14.14 255.255.255.0 encapsulation dot1q 3524 router static address-family ipv4 unicast 0.0.0.0/0 10.2.14.2

2038 © 2016 Nicholas J. Russo

interface tunnel-ip614 description BASIC GRE ipv6 address fe80::14 link-local ipv6 address fd00:10:6:14::14/64 tunnel source 10.2.14.14 tunnel destination 10.4.6.6

A quick verification on each router shows that the traffic is carried inside IPv4 between the tunnel source and destination. This is nothing new as we tested this in the GRE section as well. R6#show interfaces tunnel614 | include Tunnel Tunnel614 is up, line protocol is up Hardware is Tunnel Tunnel linestate evaluation up Tunnel source 10.4.6.6, destination 10.2.14.14 Tunnel protocol/transport GRE/IP Tunnel TTL 255, Fast tunneling enabled Tunnel transport MTU 1476 bytes Tunnel transmit bandwidth 8000 (kbps) Tunnel receive bandwidth 8000 (kbps) RP/0/0/CPU0:XRv4#show interfaces tunnel-ip614 | include Tunnel Hardware is Tunnel Tunnel TOS 0 Tunnel mode GRE IPV4, Tunnel source 10.2.14.14, destination 10.4.6.6 Tunnel TTL 255

There is no complexity to the BGP configuration either. The snippets are shown below. Dynamic routing is one of the benefits of GRE tunneling that is not supported in 6to4 or 6RD, although one could argue that it isn’t necessary for those models in the first place. ! CSR6 router bgp 614 no bgp default ipv4-unicast neighbor FD00:10:6:14::14 remote-as 614 address-family ipv6 network ::10:6:6:6/128 neighbor FD00:10:6:14::14 activate ! XRv4 router bgp 614 address-family ipv6 unicast network ::10:14:14:14/128 neighbor fd00:10:6:14::6 remote-as 614 address-family ipv6 unicast

2039 © 2016 Nicholas J. Russo

From CSR6, we can verify the BGP session is up then check to ensure CSR6 learns XRv4’s loopback. We also quickly test reachability. R6#show bgp ipv6 unicast summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd FD00:10:6:14::14 4 14 22 23 3 0 0 00:18:10 1

R6#show bgp ipv6 unicast | begin Network Network Next Hop *> ::10:6:6:6/128 :: *>i ::10:14:14:14/128 FD00:10:6:14::14

Metric LocPrf Weight Path 0 32768 i

0

100

0 i

R6#ping ::10:14:14:14 source ::10:6:6:6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::10:14:14:14, timeout is 2 seconds: Packet sent with a source address of ::10:6:6:6 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 5/11/33 ms

As a final data-plane check, we capture packets inbound on CSR2 (PE) to see IPv6 packets encapsulated inside GRE. Notice that the GRE header indicates that the tunneled protocol is IPv6 using 0x86DD. The IPv4 and IPv6 source addresses are highlighted in yellow, where the IPv4 comes first and is the outer encapsulation. IPv4 and IPv6 destination addresses are in green and represent CSR6’s address. The GRE header is in pink which clearly shows the IPv6 protocol type. Notice that the IPv4 IP protocol is 47 (0x2F, colored cyan) which is regular GRE; there is nothing special for IPv6 in the outer IP header since the GRE header shows the tunneled protocol. R2#show monitor capture CAP buffer dump 0 0000: 005056A9 BE8A0050 56A9862A 81000DC4 0010: 08004500 007C0000 4000FF2F 53390A02 0020: 0E0E0A04 06060000 86DD6000 0000003C 0030: 3A400000 00000000 00000010 00140014 0040: 00140000 00000000 00000010 00060006 0050: 00068100 CCAC24C8 00000001 02030405

.PV....PV..*.... ..E..|..@../S9.. ..........`....< :@.............. ................ ......$.........

Manual tunneling is not examined here because it is very simple and almost identical to GRE. The difference is that the tunnel mode is changed to “ipv6ip” with no additional options like 6to4 or 6RD. This is similar to “ip-in-ip” tunneling, IP protocol 4, which is slightly less overhead than GRE but can only carry IPv4 inside IPv4. IPv6-in-IPv4 is the same concept, so non-IPv6 traffic cannot be tunneled inside. IPv6-in-IPv4 uses IP protocol 41 and is shared by 6to4 and 6RD we will see soon. The downside of these approaches (both GRE and manual) is that they are labor-intensive and difficult to deploy large scale. 2040 © 2016 Nicholas J. Russo

36.4.2 6to4 automatic tunnels 6to4 tunnels are a fast way to deploy IPv6 tunnels automatically (and without storing state) throughout the network. Somewhat similar to DMVPN, no tunnel destination is specified. Since the IPv6 addresses are so large and can easily encompass IPv4 addresses, the IPv4 tunnel endpoints are embedded inside of the IPv6 addresses. The 6to4 prefix header always begins with 2002::/16 as this prefix was allocated by IANA specifically for 6to4 tunneling. The next 32 bits represent the local IPv4 address for a given site converted to hexadecimal. If the tunnel source was 10.4.6.6 as tested earlier, the 6to4 prefix for that site would be 2002:0A04:0606::/48, which leaves a 16-bit subnet ID as is normally expected for IPv6. This implies that all IPv6 networks local to that site must be encompassed within 2002:0A04:0606::/48, since any remote nodes sending traffic to this prefix automatically know to encapsulate the IPv6 packet into an IPv4 packet destined for 10.4.6.6. CSR3, CSR9, and CSR10 are used for this test. First, we will configure it the “hard way” on CSR9 and CSR10 which does not use the general-prefix construct. The tunnel mode is changed to IPv6-in-IPv4 (protocol 41) with the special “6to4” option. This means that the tunnel does not need an explicit destination as it is embedded inside the IPv6 prefix. Also notice that we don’t need to configure global addressing on the tunnel itself. CSR9 and CSR10 have LAN simulations on loopback0 which are subnets within the /48 prefix specific to each site; you cannot deviate from this, but you do have 65536 subnets of size /64 to allocate, which is a good solution in my opinion. ! CSR9 interface Tunnel3910 description 6TO4 ipv6 address FE80::9 link-local tunnel source 10.9.11.9 tunnel mode ipv6ip 6to4 interface Loopback0 ipv6 address 2002:A09:B09:9998::9/128 ipv6 address 2002:A09:B09:9999::9/128 ipv6 route 2002::/16 Tunnel3910 ! CSR10 interface Tunnel3910 description 6TO4 ipv6 address FE80::10 link-local tunnel source 10.10.13.10 tunnel mode ipv6ip 6to4 interface Loopback0 ipv6 address 2002:A0A:D0A:AAAA::A/128 ipv6 address 2002:A0A:D0A:AAAB::A/128 ipv6 route 2002::/16 Tunnel3910

2041 © 2016 Nicholas J. Russo

Since CSR9 and CSR10 have almost identical configurations, we limit our verification to CSR9 for brevity. Despite not having a destination, the tunnel comes up. This is a result of 6to4 being enabled which is a multipoint tunnel, similar to DMVPN. The source address is the PE-CE link address which has connectivity with MPLS L3VPN. Notice that MTU is 1480 bytes as opposed to the more commonly seen 1476 with GRE tunnels. Since the IPv6 packet is wrapped inside of a standard 20-byte IPv4 header with no 4-byte GRE encapsulation, the MTU is automatically optimized to the maximum possible value (1500 – 20). R9#show interfaces tunnel3910 | include Tunnel Tunnel3910 is up, line protocol is up Hardware is Tunnel Tunnel linestate evaluation up Tunnel source 10.9.11.9 Tunnel protocol/transport IPv6 6to4 Tunnel TTL 255 Tunnel transport MTU 1480 bytes Tunnel transmit bandwidth 8000 (kbps) Tunnel receive bandwidth 8000 (kbps)

If we want to send traffic to CSR10, we follow the highly generic 2002::/16 route which directs traffic out of the tunnel. Looking at the IPv6 CEF table, we can see an interesting action in the output chain. After identifying Tunnel3910 as the outgoing interface, another lookup is performed in the IPv4 table. This lookup is based on the embedded IPv4 address for CSR10’s tunnel so that the router knows how to ultimately encapsulate the packet inside IPv4, then add the appropriate layer 2 encapsulation. R9#show ipv6 route 2002::/16 Routing entry for 2002::/16 Known via "static", distance 1, metric 0 Route count is 1/1, share count 0 Routing paths: directly connected via Tunnel3910 Last updated 02:05:41 ago R9#show ipv6 cef 2002::/16 internal | begin output output chain: IPV6 midchain out of Tunnel3910, addr :: 7F3DE5349238 Lookup in table IPv4:Default

Additional details in the adjacency table reveal this as well. We can also confirm the IP protocol number of 41 this way. Notice that the destination IPv4 address in the encapsulation string is all zeroes and the comment mentions that it must “copy from payload”, which is accurate (yellow) The tunnel source IPv4 address is shown just before it and is colored green. R9#show adjacency tunnel3910 :: encapsulation Protocol Interface Address

2042 © 2016 Nicholas J. Russo

IPV6 Tunnel3910 ::(7) Encap length 20 4500000000000000FF29A6C30A090B09 00000000 Provider: TUNNEL Protocol header count in macstring: 1 HDR 0: ipv4 dst: per packet, copy from payload: ipv6 dst (2-5) src: static, 10.9.11.9 prot: static, 41 ttl: static, 255 df: static, cleared per packet fields: dst tos ident tl chksm

As a safety mechanism, it is good practice (but technically not required for connectivity) to ensure that traffic assigned to a site is not allowed to be sent from that site into the 6to4 network. For example, CSR9 owns the aggregate prefix 2002:A09:B09::/48. If a host behind CSR9 sends traffic to a subnet of this prefix for which CSR9 does not have a longer match, it will try to send it of the 6to4 tunnel. Although the packet won’t actually go anywhere since the destination is local, it is sloppy and could introduce a control-plane DoS attack. R9#show ipv6 cef 2002:A09:B09:1:2:3:4:5 2002::/16 attached to Tunnel3910

We can correct this easily with null routes on CSR9 and CSR10. It is still sufficiently generic enough such that longer matches won’t get blackholed, but specific enough to override the 2002::/16 for local traffic. ! CSR9 ipv6 route 2002:A09:B09::/48 null0 ! CSR10 ipv6 route 2002:A0A:D0A::/48 null0 R9#show ipv6 cef 2002:A09:B09:1:2:3:4:5 2002:A09:B09::/48 attached to Null0 R10#show ipv6 cef 2002:A0A:D0A:1:2:3:4:5 2002:A0A:D0A::/48 attached to Null0

We can test connectivity across the 6to4 tunnel while using EPC outbound on CSR9 at the same time. We will initiate the ping from CSR10 so that we capture CSR9’s echo-replies. I also highlight the IPv4 addresses embedded in the IPv6 prefix for clarity; it shows packets going from 10.10.13.10 to 10.9.11.9, which is correct.

2043 © 2016 Nicholas J. Russo

R10#ping 2002:A09:B09:9998::9 source 2002:A0A:D0A:AAAA::A Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 2002:A09:B09:9998::9, timeout is 2 seconds: Packet sent with a source address of 2002:A0A:D0A:AAAA::A !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/8/22 ms

EPC dumps on CSR9 show the entire packet. Unlike the GRE tunnel we saw earlier, the IP protocol is now 41 (0x29) which indicates IPv6-in-IPv4 encapsulation, shown in cyan. As a result, there is no GRE header, which saves 4 bytes of encapsulation (but makes IPv6 the only possible payload protocol). The IPv4 and IPv6 source addresses are shown in yellow again, which is nothing different than before. Likewise, the IPv4 and IPv6 destination addresses are shown in green, which is also similar to the GRE example. In summary, IPv6 traffic flows from CSR9 to CSR10 inside IPv4. R9#show monitor capture CAP buffer dump 0 0000: 005056A9 9C600050 56A9D672 81000E07 0010: 08004500 00780014 0000FF29 8F230A09 0020: 0B090A0A 0D0A6000 0000003C 3A402002 0030: 0A090B09 99980000 00000000 00092002 0040: 0A0A0D0A AAAA0000 00000000 000A8100 0050: 26C01AA2 00000001 02030405 06070809

.PV..`.PV..r.... ..E..x.....).#.. ......`....i0.0.0.0/0 30.0.0.4 4009 nolabel *> 9.0.0.0/28 10.9.11.9 nolabel 91003 RP/0/0/CPU0:XRv1#show mpls ldp bindings 30.0.0.4/32 neighbor 30.0.0.3 30.0.0.4/32, rev 12 Local binding: label: 91001 Remote bindings: (1 peers) Peer Label ------------------------30.0.0.3:0 3001

CSR3 performs PHP as a P router to expose label 4009 to CSR4. When CSR4 receives label 4009, the LFIB directs the router to perform another lookup inside the VRF 5 FIB. This is because 0.0.0.0/0 was a local aggregate and CSR4 cannot forward the packet based on the MPLS label alone. R3#show mpls forwarding-table labels 3001 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3001 Pop Label 30.0.0.4/32 8706497

Outgoing interface Gi2.534

R4#show mpls forwarding-table labels 4009 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 4009 No Label 0.0.0.0/0[V] 5310

Outgoing Next Hop interface aggregate/5

Next Hop 30.3.4.4

R4#show ip cef vrf 5 55.46.253.1 55.46.253.0/26 nexthop 10.4.5.5 GigabitEthernet2.545

2068 © 2016 Nicholas J. Russo

When CSR5 responds, it sends its replies to 9.0.0.0. CSR4 imposes two labels just as XRv1 did in the opposite direction. Label 91003 is the VPN label for 9.0.0.0/28 from XRv1 and label 3002 is CSR3’s LDP binding for 30.0.0.11/32. R4#show ip cef vrf 5 9.0.0.0 9.0.0.0/28 nexthop 30.3.4.3 GigabitEthernet2.534 label 3002 91003 R4#show bgp vpnv4 unicast vrf 5 9.0.0.0/28 BGP routing table entry for 4:5:9.0.0.0/28, version 53 Paths: (1 available, best #1, table 5) Advertised to update-groups: 3 Refresh Epoch 1 9, imported path from 11:9:9.0.0.0/28 (global) 30.0.0.11 (metric 20) (via default) from 30.0.0.11 (30.0.0.11) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: RT:11:9 mpls labels in/out nolabel/91003 rx pathid: 0, tx pathid: 0x0 R4#show mpls ldp bindings 30.0.0.11 32 neighbor 30.0.0.3 lib entry: 30.0.0.11/32, rev 8 remote binding: lsr: 30.0.0.3:0, label: 3002

CSR3 is a P router performing PHP and XRv11 forwards packets towards CSR9. Since 9.0.0.0/28 was not a local aggregate defined on XRv1, we do not see the inefficiency of multiple LFIB/FIB lookups on XRv1 as we did on CSR4. When CSR9 receives the packets on the outside interface, NAT happens before routing, so the destination of 9.0.0.0 is changed back to 192.168.9.1 and the packet is routed to the loopback. R3#show mpls forwarding-table labels 3002 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3002 Pop Label 30.0.0.11/32 8534837 RP/0/0/CPU0:XRv1#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------91003 Unlabelled 9.0.0.0/28[V]

Outgoing interface Gi2.531

Next Hop 30.3.11.11

vrf 9 prefix 9.0.0.0/28 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.591 10.9.11.9 2000

CSR9’s connectivity to the IPv6 Internet is less interesting as there is no NAT in play. Rather than trace the LSPs for a very basic L3VPN connection (the route leaking as already verified), I use traceroute from CSR9 (CE) and CSR7 (IPv6 Internet) for brevity. 2 labels are imposed by the ingress LSR in either direction as expected, and the packets are label-switched across the core. The only interesting part of the output below is label 92009; this is technically an IPv6 labeled-unicast label, not a VPNv6 label, since the route 2069 © 2016 Nicholas J. Russo

was imported from the global table. When XRv2 leaked the route from VRF to global, a new local label was allocated and advertised via BGP IPv6 labeled-unicast. R9#traceroute ipv6 Target IPv6 address: ::77:BABE:0 Source address: 9000::9 [snip] Tracing the route to ::77:BABE:0 1 FD00:10:9:11::11 [AS 30] 3 msec 2 ::FFFF:30.3.11.3 [AS 30] [MPLS: msec 3 FD00:10:7:12::12 [AS 30] [MPLS: msec 4 FD00:10:7:12::7 [AS 30] 27 msec R7#traceroute ipv6 Target IPv6 address: 9000::9 Source address: ::77:BABE:0 [snip] Tracing the route to 9000::9 1 FD00:10:7:12::12 [AS 30] 3 msec 2 ::FFFF:30.3.12.3 [AS 30] [MPLS: msec 3 FD00:10:9:11::11 [AS 30] [MPLS: msec 4 FD00:10:9:11::9 [AS 30] 27 msec

3 msec 2 msec Labels 3003/92009 Exp 0] 9 msec 7 msec 8 Label 92009 Exp 0] 25 msec 20 msec 27 15 msec 15 msec

3 msec 3 msec Labels 3002/91007 Exp 0] 7 msec 7 msec 9 Label 91007 Exp 0] 23 msec 23 msec 22 15 msec 15 msec

A quick verification on XRv2 proves that a new label was issued; the VPNv6 label of 92008 was not carried over during the VRF-to-global export process. This is the label learned by XRv1 in the global table. RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf 7 ::/0 | include Local Label Local Label: 92008 RP/0/0/CPU0:XRv2#show route ipv6 ::/0 detail | include Local Label Local Label: 0x16769 (92009) RP/0/0/CPU0:XRv1#show route ipv6 unicast ::/0 detail | include Label Label: 0x16769 (92009)

Next, we will use a set of alternative techniques on CSR2 and CSR8. Like CSR9, these are CE routers with private IPv4 addressing, and thus require NAT. CSR6, the ingress PE, will perform NAT for these clients. This means that CSR6 will need to learn the private addresses from the CE and not distribute them indiscriminately to the IPv4 Internet. For IPv6, both clients will be in a basic L3VPN central services VPN. This is usually the best approach for IPv6 Internet connectivity when only a default route is required. For brevity, CSR2 and CSR8 interface configurations are not shown, but their private networks are 2070 © 2016 Nicholas J. Russo

192.168.2.1/24 and 192.168.8.1/24, respectively. The IPv6 networks 2000::/16 and 8000::/16 are used by CSR2 and CSR8, respectively, as well. First, I show the VPN-specific configurations on CSR6. There is no route leaking with this example; the VRFs just perform standard RT import/export actions. The only imported RT for both VRFs is the central services RT for each AFI. ! CSR6 vrf definition 2 rd 6:2 address-family ipv4 route-target export 6:2 route-target import 4:5 address-family ipv6 route-target export 6:2 route-target import 12:7 vrf definition 8 rd 6:8 address-family ipv4 route-target export 6:8 route-target import 4:5 address-family ipv6 route-target export 6:8 route-target import 12:7

Next, we configure BGP on the PE. CSR6 has the responsibility to ensure that private routes learned from the customer are not advertised further. Rather than actually match the private range, CSR6 marks all routes from the customers with the “no-advertise” community, which prevents it from being advertised at all. This solution works well when the customer is ONLY sending private routes to the CE and there is no possibility of public IPv4 routes being advertised. ! CSR6 route-map RM_SET_NO_ADV permit 10 set community no-advertise router bgp 30 address-family ipv4 vrf 2 neighbor 10.2.6.2 remote-as 2 neighbor 10.2.6.2 activate neighbor 10.2.6.2 route-map RM_SET_NO_ADV in address-family ipv6 vrf 2 neighbor FD00:10:2:6::2 remote-as 2 neighbor FD00:10:2:6::2 activate address-family ipv4 vrf 8

2071 © 2016 Nicholas J. Russo

neighbor 10.6.8.8 remote-as 8 neighbor 10.6.8.8 activate neighbor 10.6.8.8 route-map RM_SET_NO_ADV in address-family ipv6 vrf 8 neighbor FD00:10:6:8::8 remote-as 8 neighbor FD00:10:6:8::8 activate

Below are the basic BGP configurations on CSR2 and CSR8. They advertise their private networks to CSR6 and do not perform local NAT44. ! CSR2 router bgp 2 no bgp default ipv4-unicast neighbor 10.2.6.6 remote-as 30 neighbor FD00:10:2:6::6 remote-as 30 address-family ipv4 network 192.168.2.0 neighbor 10.2.6.6 activate address-family ipv6 network 2000::/16 neighbor FD00:10:2:6::6 activate ! CSR8 router bgp 8 no bgp default ipv4-unicast neighbor 10.6.8.6 remote-as 30 neighbor FD00:10:6:8::6 remote-as 30 address-family ipv4 network 192.168.8.0 neighbor 10.6.8.6 activate address-family ipv6 network 8000::/16 neighbor FD00:10:6:8::6 activate

Before configuring NAT on CSR6, we verify that the PE learns the private IPv4 routes from both customers and the “no-advertise” community is set. The IPv6 routes are also present but lack this community as they are public and can be advertised to the IPv6 Internet. R6#show bgp vpnv4 unicast vrf 2 192.168.2.0/24 BGP routing table entry for 6:2:192.168.2.0/24, version 5 Paths: (1 available, best #1, table 2, not advertised to any peer) Not advertised to any peer Refresh Epoch 1

2072 © 2016 Nicholas J. Russo

2 10.2.6.2 (via vrf 2) from 10.2.6.2 (10.2.6.2) Origin IGP, metric 0, localpref 100, valid, external, best Community: no-advertise Extended Community: RT:6:2 mpls labels in/out 6019/nolabel rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv4 unicast vrf 8 192.168.8.0/24 BGP routing table entry for 6:8:192.168.8.0/24, version 6 Paths: (1 available, best #1, table 8, not advertised to any peer) Not advertised to any peer Refresh Epoch 1 8 10.6.8.8 (via vrf 8) from 10.6.8.8 (192.168.8.1) Origin IGP, metric 0, localpref 100, valid, external, best Community: no-advertise Extended Community: RT:6:8 mpls labels in/out 6023/nolabel rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv6 unicast vrf 2 2000::/16 BGP routing table entry for [6:2]2000::/16, version 397 Paths: (1 available, best #1, table 2) Advertised to update-groups: 3 Refresh Epoch 1 2 FD00:10:2:6::2 (FE80::2) (via vrf 2) from FD00:10:2:6::2 (10.2.6.2) Origin IGP, metric 0, localpref 100, valid, external, best Extended Community: RT:6:2 mpls labels in/out 6043/nolabel rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv6 unicast vrf 8 8000::/16 BGP routing table entry for [6:8]8000::/16, version 398 Paths: (1 available, best #1, table 8) Advertised to update-groups: 3 Refresh Epoch 1 8 FD00:10:6:8::8 (FE80::8) (via vrf 8) from FD00:10:6:8::8 (192.168.8.1) Origin IGP, metric 0, localpref 100, valid, external, best Extended Community: RT:6:8 mpls labels in/out 6021/nolabel rx pathid: 0, tx pathid: 0x0

On the other side of the BGP connection, we verify that both CSR2 and CSR8 learn the default route from CSR6 in both AFIs. 2073 © 2016 Nicholas J. Russo

R2#show bgp ipv4 unicast 0.0.0.0/0 BGP routing table entry for 0.0.0.0/0, version 17 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 10.2.6.6 from 10.2.6.6 (30.0.0.6) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R8#show bgp ipv4 unicast 0.0.0.0/0 BGP routing table entry for 0.0.0.0/0, version 113 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 10.6.8.6 from 10.6.8.6 (30.0.0.6) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R2#show bgp ipv6 unicast ::/0 BGP routing table entry for ::/0, version 212 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 FD00:10:2:6::6 (FE80::6) from FD00:10:2:6::6 (30.0.0.6) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R8#show bgp ipv6 unicast ::/0 BGP routing table entry for ::/0, version 242 Paths: (1 available, best #1, table default) Advertised to update-groups: 2 Refresh Epoch 1 30 FD00:10:6:8::6 (FE80::6) from FD00:10:6:8::6 (30.0.0.6) Origin incomplete, localpref 100, weight 66, valid, external, best rx pathid: 0, tx pathid: 0x0

NAT configuration, even when crossing VRFs, is very simple. Inside interfaces are configured towards the customer (inside VRF) and outside interfaces are configured towards the MPLS core (global interfaces). Two NAT pools are defined, along with corresponding VRF-aware static null routes. NAT rules are applied within a VRF to perform NAT source translation for the private networks within AS 2 and AS 8. There is nothing different about the NAT configuration other than ensuring the static routes and NAT rules are VRF-aware. 2074 © 2016 Nicholas J. Russo

! CSR6 interface GigabitEthernet2.536 ip nat outside interface GigabitEthernet2.526 ip nat inside interface GigabitEthernet2.568 ip nat inside ip nat pool NAT_POOL_2 2.0.0.0 2.0.0.15 prefix-length 28 ip nat pool NAT_POOL_8 8.0.0.0 8.0.0.15 prefix-length 28 ip route vrf 2 2.0.0.0 255.255.255.240 Null0 ip route vrf 8 8.0.0.0 255.255.255.240 Null0 ip access-list standard ACL_NAT_2 permit 192.168.2.0 0.0.0.255 ip access-list standard ACL_NAT_8 permit 192.168.8.0 0.0.0.255 ip nat inside source list ACL_NAT_2 pool NAT_POOL_2 vrf 2 ip nat inside source list ACL_NAT_8 pool NAT_POOL_8 vrf 8

To complete the configuration, the static null routes defined above are advertised into BGP via network statements. To ensure these routes are not advertised down to the customers, a prefix-list that only permits the default route is applied. Although learning these NAT pool routes are harmless for CSR2 and CSR8, it is sloppy from a design perspective. ! CSR6 ip prefix-list PL_DEFAULT seq 5 permit 0.0.0.0/0 router bgp 30 address-family ipv4 vrf 2 network 2.0.0.0 mask 255.255.255.240 neighbor 10.2.6.2 prefix-list PL_DEFAULT out address-family ipv4 vrf 8 network 8.0.0.0 mask 255.255.255.240 neighbor 10.6.8.8 prefix-list PL_DEFAULT out

As a quick verification, we confirm that CSR5 (IPv4 Internet) learns the routes to 2.0.0.0/28 and 8.0.0.0/28, but do not see routes for any private networks. We also confirm that CSR2 and CSR8 (CE routers) do not learn the NAT pools that CSR6 defined; they only learn the default routes for IPv4 as seen earlier.

2075 © 2016 Nicholas J. Russo

R5#show bgp ipv4 unicast 2.0.0.0/28 BGP routing table entry for 2.0.0.0/28, version 37 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 10.4.5.4 from 10.4.5.4 (30.0.0.4) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R5#show bgp ipv4 unicast 8.0.0.0/28 BGP routing table entry for 8.0.0.0/28, version 38 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 10.4.5.4 from 10.4.5.4 (30.0.0.4) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R5#show bgp ipv4 unicast 192.168.2.0/24 % Network not in table R5#show bgp ipv4 unicast 192.168.8.0/24 % Network not in table R2#show bgp ipv4 unicast 2.0.0.0/28 % Network not in table R8#show bgp ipv4 unicast 8.0.0.0/28 % Network not in table

The IPv6 configuration is much less complicated since the routes on CSR2 and CSR8 were public already; these were advertised directly into BGP and are learned by CSR7 as expected. R7#show bgp ipv6 unicast 2000::/16 BGP routing table entry for 2000::/16, version 1038 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 2 FD00:10:7:12::12 (FE80::12) from FD00:10:7:12::12 (30.0.0.12) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R7#show bgp ipv6 unicast 8000::/16 BGP routing table entry for 8000::/16, version 1039 Paths: (1 available, best #1, table default) Not advertised to any peer

2076 © 2016 Nicholas J. Russo

Refresh Epoch 1 30 8 FD00:10:7:12::12 (FE80::12) from FD00:10:7:12::12 (30.0.0.12) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

To test connectivity, CSR2 and CSR8 both attempt to ping the IPv4 Internet from their loopbacks. The pings succeeds (not shown) and CSR6 creates NAT state for each. Since overloading is not used, outsideto-inside traffic is also permitted when using an ACL with NAT pool; this is just an observation and is not relevant for this test. The post-NAT IPv4 source addresses for CSR2 and CSR8 are 2.0.0.0 and 8.0.0.0, respectively. R6#show ip nat translations Pro Inside global Inside local --- 2.0.0.0 192.168.2.1 --- 8.0.0.0 192.168.8.1 icmp 2.0.0.0:6 192.168.2.1:6 icmp 8.0.0.0:5 192.168.8.1:5

Outside local ----55.46.253.193:6 55.46.253.193:5

Outside global ----55.46.253.193:6 55.46.253.193:5

Because this IPv4 Internet access is provided via a central services architecture, the MPLS label stack is identical for VRFs 2 and 8 as the exact same VPN route is used. Rather than trace this basic LSP, we will end the forward-LSP verification here. R6#show ip cef vrf 2 55.46.253.193 0.0.0.0/0 nexthop 30.3.6.3 GigabitEthernet2.536 label 3001 4009 R6#show ip cef vrf 8 55.46.253.193 0.0.0.0/0 nexthop 30.3.6.3 GigabitEthernet2.536 label 3001 4009

Returning traffic is very similar except there are 2 separate VPN routes on CSR4 which originated in different VRFs (one per NAT pool). The transport label is the same but the VPN label differs in this case. Nonetheless, this is basic L3VPN and does not warrant additional verification. R4#show ip cef vrf 5 2.0.0.0/28 2.0.0.0/28 nexthop 30.3.4.3 GigabitEthernet2.534 label 3000 6013 R4#show ip cef vrf 5 8.0.0.0/28 8.0.0.0/28 nexthop 30.3.4.3 GigabitEthernet2.534 label 3000 6020

What is interesting is how a command like “ip nat outside” can be honored when traffic entering CSR6 from CSR3 is MPLS encapsulated. The reason this configuration works is because CSR6 locally generated the static null routes for 2.0.0.0/28 and 8.0.0.0/28. This means that the label action on CSR6 will be to 2077 © 2016 Nicholas J. Russo

strip all labels and perform an IPv4 VRF-aware FIB lookup on the destination. As seen above, labels 6013 and 6020 were allocated by CSR6 for the NAT pools. R6#show mpls forwarding-table labels 6013 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6013 No Label 2.0.0.0/28[V] 1770

Outgoing Next Hop interface aggregate/2

R6#show mpls forwarding-table labels 6020 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6020 No Label 8.0.0.0/28[V] 1180

Outgoing Next Hop interface aggregate/8

However, performing a VRF-aware routing lookup on 2.0.0.0/28 or 8.0.0.0/28 will drop traffic as these are connected to null0. Because outside NAT happens before routing, the traffic is un-NAT’ed first and then the routing lookup occurs. The key is that the order of operations still applies with respect to MPLS label removal, outside NAT, and routing. After the label is removed, the order of operations “restarts” for IP traffic. After un-NAT, traffic is forwarded back towards the proper customer. R6#show ip cef vrf 2 2.0.0.0/28 2.0.0.0/28 attached to Null0 R6#show ip cef vrf 8 8.0.0.0/28 8.0.0.0/28 attached to Null0

IPv6 Internet connectivity in this design is uninteresting as it is a basic L3VPN central services design with no complexity. A quick traceroute within the customer VRFs confirms this works properly. R2#traceroute ipv6 Target IPv6 address: ::77:BABE:2 Source address: 2000::2 [snip] Tracing the route to ::77:BABE:2 1 FD00:10:2:6::6 [AS 30] 4 msec 4 msec 4 msec 2 ::FFFF:30.3.6.3 [AS 30] [MPLS: Labels 3003/92008 Exp 0] 6 msec 8 msec 8 msec 3 FD00:10:7:12::12 [AS 30] [MPLS: Label 92008 Exp 0] 23 msec 22 msec 22 msec 4 FD00:10:7:12::7 [AS 30] 23 msec 15 msec 15 msec R8#traceroute ipv6 Target IPv6 address: ::77:BABE:2 Source address: 8000::8 [snip] Tracing the route to ::77:BABE:2

2078 © 2016 Nicholas J. Russo

1 FD00:10:6:8::6 [AS 30] 4 msec 4 msec 3 msec 2 ::FFFF:30.3.6.3 [AS 30] [MPLS: Labels 3003/92008 Exp 0] 7 msec 7 msec 9 msec 3 FD00:10:7:12::12 [AS 30] [MPLS: Label 92008 Exp 0] 23 msec 22 msec 22 msec 4 FD00:10:7:12::7 [AS 30] 23 msec 15 msec 15 msec

We will examine another set of Internet connectivity methods on CSR1. For IPv4, CSR6 will leak the default IPv4 route from global into the VRF. CSR1 has no requirement for NAT44 as it only supports a few users, so using direct public addressing is acceptable. We will assume that CSR1 is an older router that cannot support IPv6 BGP, so instead uses a static default route. The prefix-list filter is applied as a best practice to ensure CSR1 does not become overloaded with IPv4 Internet routes accidentally. No RTs are imported into VRF 1 for IPv4, so this accidental advertisement is currently impossible. Since CSR6 cannot learn CSR1’s public IPv6 routes, we configure a VRF-aware static route and redistribute it into BGP. ! CSR6 route-map RM_VRF_1_IMPORT_BGP permit 10 match ip address prefix-list PL_DEFAULT vrf definition 1 rd 6:1 address-family ipv4 import ipv4 unicast map RM_VRF_1_IMPORT_BGP route-target export 6:1 address-family ipv6 route-target export 6:1 route-target import 12:7 ipv6 route vrf 1 1000::/16 GigabitEthernet2.516 FE80::1 tag 1 route-map RM_STATIC_TO_BGP_VRF_1 permit 10 match tag 1 router bgp 30 address-family ipv4 vrf 1 neighbor 10.1.6.1 remote-as 1 neighbor 10.1.6.1 activate neighbor 10.1.6.1 prefix-list PL_DEFAULT out address-family ipv6 vrf 1 redistribute static route-map RM_STATIC_TO_BGP_VRF_1

The most interesting part of this configuration is the global-to-VRF route leaking. CSR6 learns the route inside the default table natively. Once it is exported from global (imported to VRF), the route is marked “af-export” with the VRF name of the target VRF. The newly-created VPN route has no RTs or MPLS 2079 © 2016 Nicholas J. Russo

labels, but does specify a global next-hop. The reason labels are not required is because the MPLS core is IPv4-capable, so if the penultimate hop removes the transport label and exposes the IPv4 header to the egress PE, the egress PE can route it correctly. R6#show bgp ipv4 unicast 0.0.0.0 BGP routing table entry for 0.0.0.0/0, version 25 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 2 Local 30.0.0.4 (metric 20) from 30.0.0.4 (30.0.0.4) Origin incomplete, metric 0, localpref 100, valid, internal, afexport(1), best rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv4 unicast vrf 1 0.0.0.0/0 BGP routing table entry for 6:1:0.0.0.0/0, version 3 Paths: (1 available, best #1, table 1) Advertised to update-groups: 12 Refresh Epoch 2 Local, imported path from 0.0.0.0/0 (global) 30.0.0.4 (metric 20) (via default) from 30.0.0.4 (30.0.0.4) Origin incomplete, metric 0, localpref 100, valid, internal, no-import, no-import, best rx pathid: 0, tx pathid: 0x0

The configuration on CSR1 is very basic. The loopback prefix of 1.0.0.0/28 is advertised into BGP so CSR6 can learn it. A default IPv6 route is also configured which complements the specific 1000::/16 static route added onto CSR6. ! CSR1 router bgp 1 no bgp default ipv4-unicast neighbor 10.1.6.6 remote-as 30 address-family ipv4 network 1.0.0.0 mask 255.255.255.240 neighbor 10.1.6.6 activate ipv6 route ::/0 GigabitEthernet2.516 FE80::6

We quickly confirm that CSR6 learns 1.0.0.0/28 from CSR1 and CSR1 learns the default route from CSR6. R6#show bgp vpnv4 unicast vrf 1 1.0.0.0/28 BGP routing table entry for 6:1:1.0.0.0/28, version 4 Paths: (1 available, best #1, table 1)

2080 © 2016 Nicholas J. Russo

Advertised to update-groups: 15 Refresh Epoch 1 1 10.1.6.1 (via vrf 1) from 10.1.6.1 (1.0.0.1) Origin IGP, metric 0, localpref 100, valid, external, best Extended Community: RT:6:1 mpls labels in/out 6011/nolabel rx pathid: 0, tx pathid: 0x0 R1#show bgp ipv4 unicast 0.0.0.0/0 BGP routing table entry for 0.0.0.0/0, version 8 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 10.1.6.6 from 10.1.6.6 (30.0.0.6) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

Because no VPN label was allocated for the default route when it was leaked into the VRF, only one label is imposed. This is the LDP label from CSR3 to describe the path to 30.0.0.4/32. R6#show ip cef vrf 1 55.46.253.1 detail 0.0.0.0/0, epoch 0, flags [rib only nolabel, rib defined all labels, default route] recursive via 30.0.0.4 nexthop 30.3.6.3 GigabitEthernet2.536 label 3001 R6#show mpls ldp bindings 30.0.0.4 32 neighbor 30.0.0.3 lib entry: 30.0.0.4/32, rev 9 remote binding: lsr: 30.0.0.3:0, label: 3001

When CSR3 pops this label, the IPv4 packet is exposed to CSR4. Since CSR4 leaked the Internet route from VRF to global, it has a route for the final destination in the global table. The next-hop is in VRF 5, but connectivity still works. Only a single MPLS label is required to tunnel the traffic through the P routers as there is no need to bind VPN labels for IPv4 prefixes in the global table. The exception was 6PE, which is required when the core is not IPv6-capable (verified on CSR9 earlier). R3#show mpls forwarding-table labels 3001 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 3001 Pop Label 30.0.0.4/32 8870220

Outgoing interface Gi2.534

Next Hop 30.3.4.4

R4#show ip route 55.46.253.1 Routing entry for 55.46.253.0/26

2081 © 2016 Nicholas J. Russo

Known via "bgp 30", distance 20, metric 0 Tag 5, type external Last update from 10.4.5.5 3d14h ago Routing Descriptor Blocks: * 10.4.5.5 (5), from 10.4.5.5, 3d14h ago Route metric is 0, traffic share count is 1 AS Hops 1 Route tag 5 MPLS label: none

Traffic in the reverse direction uses basic L3VPN for transport. Since CSR6 learns 1.0.0.0/28 via BGP, label 6011 actually forwards traffic directly to CSR1 inside the VPN. This is different than the static null routes for 2.0.0.0/8 and 8.0.0.0/28 observed earlier for NAT support. R4#show ip cef vrf 5 1.0.0.0/28 1.0.0.0/28 nexthop 30.3.4.3 GigabitEthernet2.534 label 3000 6011 R6#show mpls forwarding-table labels 6011 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 6011 No Label 1.0.0.0/28[V] 1180 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A91AAA005056A9DE0D81000DBC0800 VPN route: 1 No output feature configured

Outgoing interface Gi2.516

Next Hop 10.1.6.1

For IPv6, the default route was imported into VRF 1 using the normal RT-import process. This is why label 92008, the original VPNv6 label allocated by XRv2, is used. This is in contrast to XRv1 where label 92009 was used (IPv6 labeled-unicast) since route leaking was used at the IPv6 Internet peering point. Returning traffic in the reverse direction is also using a basic L3VPN forwarding scheme. R6#show ipv6 cef vrf 1 ::/0 ::/0 nexthop 30.3.6.3 GigabitEthernet2.536 label 3003 92008 RP/0/0/CPU0:Xshow cef vrf 7 ipv6 1000::/16 1000::/16, version 60, internal 0x5000001 0x0 (ptr 0xa13faa74) [1], 0x0 (0x0), 0x208 (0xa156d3c0) Prefix Len 16, traffic index 0, precedence n/a, priority 3 via ::ffff:30.0.0.6, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa18350bc 0x0] recursion-via-/128 next hop VRF - 'default', table - 0xe0000000 next hop ::ffff:30.0.0.6 via ::ffff:30.0.0.6:0 next hop 30.3.12.3/32 Gi0/0/0/0.532 labels imposed {3000 6022}

2082 © 2016 Nicholas J. Russo

We verify connectivity using IPv4 ping and IPv6 traceroute for brevity. R1#ping 55.46.253.1 source 1.0.0.1 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 55.46.253.1, timeout is 2 seconds: Packet sent with a source address of 1.0.0.1 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 7/7/9 ms R1#traceroute ipv6 Target IPv6 address: ::77:BABE:0 Source address: 1000::1 [snip] Tracing the route to ::77:BABE:0 1 FE80::6 11 msec 3 msec 4 msec 2 ::FFFF:30.3.6.3 [MPLS: Labels 3003/92008 Exp 0] 7 msec 5 msec 11 msec 3 FD00:10:7:12::12 [MPLS: Label 92008 Exp 0] 23 msec 22 msec 22 msec 4 FD00:10:7:12::7 23 msec 15 msec 15 msec

CSR8 can be multi-homed to XRv1 as well. Normally this would call for providing the full routing table to CSR8, but an alternative approach is to use simply BGP manipulations to prefer one default gateway over the other. We will examine IPv6 first, saving the final NT44 option for later. XRv1 will simply provide a default route, just like CSR6. By importing RT:12:7, this will match the default route and nothing else. We confirm that only the IPv6 default route is imported before continuing. ! XRv1 vrf 8 address-family ipv6 unicast import route-target 12:7 export route-target 11:8 router bgp 30 vrf 8 rd 11:8 address-family ipv6 unicast neighbor fd00:10:8:11::8 remote-as 8 address-family ipv6 unicast route-policy RPL_PASS in route-policy RPL_PASS out RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf 8 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 11:8 (default for vrf 8) *>i::/0 30.0.0.12 0 100 0 ?

2083 © 2016 Nicholas J. Russo

Next, we configure CSR8 for BGP IPv6 unicast. The configuration is very basic, and to ensure XRv1 is used, we adjust the weight for that peer. Otherwise, CSR8 will prefer the older default route when comparing CSR6 and XRv1. The other BGP parameters, such as the network statements, are already configured on CSR8 from earlier. I add a lower weight to CSR6 just for clarity, although it is not required. This will make configuration adjustments easier later. We confirm that CSR8 learns both IPv6 default routes and prefers XRv1 over CSR6 due to the BGP weight attribute. ! CSR8 router bgp 8 neighbor FD00:10:8:11::11 remote-as 30 address-family ipv6 neighbor FD00:10:6:8::6 weight 66 neighbor FD00:10:8:11::11 activate neighbor FD00:10:8:11::11 weight 1111 R8#show bgp ipv6 unicast ::/0 BGP routing table entry for ::/0, version 249 Paths: (2 available, best #1, table default) Advertised to update-groups: 2 Refresh Epoch 1 30 FD00:10:8:11::11 (FE80::11) from FD00:10:8:11::11 (30.0.0.11) Origin incomplete, localpref 100, weight 1111, valid, external, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 30 FD00:10:6:8::6 (FE80::6) from FD00:10:6:8::6 (30.0.0.6) Origin incomplete, localpref 100, weight 66, valid, external rx pathid: 0, tx pathid: 0

We confirm that XRv1 learns CSR8’s public IPv6 prefix 8000::/16. Because the exported RT is something that XRv2 is already importing, we assume that this route is being advertised to the IPv6 Internet. Again, this is basic L3VPN so traceroute is used to speed up the verification. RP/0/0/CPU0:XRv1#show bgp vrf 8 ipv6 unicast 8000::/16 | begin “ 8$” 8 fd00:10:8:11::8 from fd00:10:8:11::8 (192.168.8.1) Origin IGP, metric 0, localpref 100, valid, external, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 64 Extended community: RT:11:8 R8#traceroute ipv6 Target IPv6 address: ::77:BABE:2 Source address: 8000::8

2084 © 2016 Nicholas J. Russo

[snip] Tracing the route to ::77:BABE:2 1 FD00:10:8:11::11 [AS 30] 3 msec 2 ::FFFF:30.3.11.3 [AS 30] [MPLS: msec 3 FD00:10:7:12::12 [AS 30] [MPLS: msec 4 FD00:10:7:12::7 [AS 30] 23 msec

2 msec 2 msec Labels 3003/92008 Exp 0] 7 msec 6 msec 10 Label 92008 Exp 0] 30 msec 26 msec 22 15 msec 15 msec

We progress to CSR10’s IPv6 connectivity. Similar to CSR9, XRv1 will leak the default IPv6 route from the global table. The difference is that CSR10 is an old router, like CSR1, and is incapable of running BGP IPv6. It uses a static default route to reach the IPv6 Internet. XRv1 configures a VRF-aware static route to CSR1’s public IPv6 prefix A000::/16. Some RPL infrastructure is re-used to selectively redistribute this into BGP. ! XRv1 prefix-set PS_10 a000::/16 end-set vrf 10 address-family ipv6 unicast import from default-vrf route-policy RPL_IF_DEST_PASS(PS_DEFAULT) export route-target 11:10 router bgp 30 vrf 10 rd 11:10 address-family ipv6 unicast redistribute static route-policy RPL_IF_DEST_PASS(PS_10) ! CSR10 ipv6 route ::/0 GigabitEthernet2.510 FE80::1

Because the prefix A000::/16 was locally originated into BGP but was not a null aggregate (that is, it has a valid next-hop), a prefix-specific VPNv6 label is allocated. The export RT of 11:10 is imported by XRv2 so it is assumed this prefix can be reached from the IPv6 Internet. RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf 10 a000::/16 BGP routing table entry for a000::/16, Route Distinguisher: 11:10 Versions: Process bRIB/RIB SendTblVer Speaker 20 20 Local Label: 91008 Paths: (1 available, best #1) Advertised to peers (in unique update groups):

2085 © 2016 Nicholas J. Russo

30.0.0.12 Path #1: Received by speaker 0 Advertised to peers (in unique update groups): 30.0.0.12 Local fe80::10 from 0.0.0.0 (30.0.0.11) Origin incomplete, metric 0, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 20 Extended community: RT:11:10

As a quick review, the IPv6 labeled-unicast label of 92009 is used since the core is not IPv6-aware. This was allocated by XRv2 since traffic received from the core will use this label and the LFIB lookup occurs in the global table. When XRv2 receives traffic along this LSP, label stack is removed and traffic is forwarded to CSR7. RP/0/0/CPU0:XRv1#show cef vrf 10 ipv6 ::/0 ::/0, version 6, proxy default, internal 0x5000011 0x0 (ptr 0xa13f9d74) [1], 0x0 (0x0), 0x208 (0xa156d460) Prefix Len 0, traffic index 0, precedence n/a, priority 3 via ::ffff:30.0.0.12, 9 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa1801050 0x0] recursion-via-/128 next hop VRF - 'default', table - 0xe0000000 next hop ::ffff:30.0.0.12 via ::ffff:30.0.0.12:0 next hop 30.3.11.3/32 Gi0/0/0/0.531 labels imposed {3003 92009} RP/0/0/CPU0:XRv2#show mpls forwarding | include 92009 92009 Unlabelled ::/0 Gi0/0/0/0.572 fe80::7

321200

Using traceroute, we can verify IPv6 Internet connectivity. Label 92009 is technically an IPv6 labeledunicast label from XRv2, but on XRv1, it is used as a VPNv6 label at imposition. The label carries over when the prefix is imported from default to VRF. IPv6 Internet connectivity is significantly more straightforward, even with route leaking, than IPv4 as there is seldom a need for NAT. R10#traceroute ipv6 Target IPv6 address: ::77:BABE:3 Source address: A000::A [snip] Tracing the route to ::77:BABE:3 1 ::FFFF:30.3.11.11 4 msec 3 msec 2 msec 2 ::FFFF:30.3.11.3 [MPLS: Labels 3003/92009 Exp 0] 8 msec 8 msec 8 msec 3 FD00:10:7:12::12 [MPLS: Label 92009 Exp 0] 23 msec 22 msec 22 msec 4 FD00:10:7:12::7 23 msec 15 msec 15 msec

This completes IPv6 Internet connectivity for all CE routers. CSR8’s link via XRv1 and CSR10 have not yet been enabled for IPv4 Internet connectivity. There is a third option that this document has not yet 2086 © 2016 Nicholas J. Russo

explored. The first option was local NAT on the CE device as seen on CSR9. The second option was NAT on the ingress PE as seen on CSR6 supporting CSR2 and CSR8. The third option is NAT on the egress PE towards the IPv4 Internet. This might be used when there are many ingress PEs and a small set of very powerful IPv4 Internet peering routers. The NAT44 can be centralized on these routers rather than managed on the ingress PEs. The drawback is that increase in processing power as a result of the consolidating NAT44 processing on a smaller set of devices. First, the VRF RT import/export policies are defined. There is no route-leaking with this solution to keep the focus on NAT44. ! XRv1 vrf 8 address-family ipv4 unicast import route-target 4:5 export route-target 11:8 vrf 10 address-family ipv4 unicast import route-target 4:5 export route-target 11:10

Next, BGP is configured between the CEs (CSR10 and CSR8) and XRv1. Weight is used on CSR8 to prefer XRv1 for testing, just as it was in IPv6. The BGP configuration is very simple as there is no route leaking or advanced filtering. The most interesting part of this configuration is that CSR8 and CSR10 advertise their private LAN routes to XRv1, and XRv1 makes no effort to filter them. This is because the routes must be advertised all the way to XRv2, the NAT44 point. ! CSR8 router bgp 8 neighbor 10.8.11.11 remote-as 30 address-family ipv4 network 192.168.8.0 neighbor 10.6.8.6 weight 66 neighbor 10.8.11.11 activate neighbor 10.8.11.11 weight 1111 ! CSR10 router bgp 10 no bgp default ipv4-unicast neighbor 10.10.11.11 remote-as 30 address-family ipv4 network 192.168.10.0 neighbor 10.10.11.11 activate ! XRv1

2087 © 2016 Nicholas J. Russo

router bgp 30 vrf 8 address-family ipv4 unicast neighbor 10.8.11.8 remote-as 8 address-family ipv4 unicast route-policy RPL_PASS in route-policy RPL_PASS out router bgp 30 vrf 10 address-family ipv4 unicast neighbor 10.10.11.10 remote-as 10 address-family ipv4 unicast route-policy RPL_PASS in route-policy RPL_PASS out

Before continuing, we quickly verify that XRv1 learns the routes from both CEs. We also verify that both CEs learn a default from XRv1. CSR8 learns the same route from CSR6 and XRv1, but prefers XRv1 due to the BGP weight configuration. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast 192.168.8.0/22 longer-prefixes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 11:8 (default for vrf 8) *> 192.168.8.0/24 10.8.11.8 0 0 8 i Route Distinguisher: 11:10 (default for vrf 10) *> 192.168.10.0/24 10.10.11.10 0 0 10 i R8#show bgp ipv4 unicast 0.0.0.0/0 BGP routing table entry for 0.0.0.0/0, version 116 Paths: (2 available, best #1, table default) Advertised to update-groups: 2 Refresh Epoch 1 30 10.8.11.11 from 10.8.11.11 (30.0.0.11) Origin incomplete, localpref 100, weight 1111, valid, external, best rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 30 10.6.8.6 from 10.6.8.6 (30.0.0.6) Origin incomplete, localpref 100, weight 66, valid, external rx pathid: 0, tx pathid: 0 R10#show bgp ipv4 unicast 0.0.0.0/0 BGP routing table entry for 0.0.0.0/0, version 2

2088 © 2016 Nicholas J. Russo

Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 30 10.10.11.11 from 10.10.11.11 (30.0.0.11) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

To protect the IPv4 Internet from learning these private LAN routes, a basic prefix filter is applied outbound on CSR4. This denies all private and IPv4 link-local prefixes, as well as the default route. The default route prefix-list is reused from before and contained in the route-map match clause using Boolean “or” logic. A quick check on CSR5 ensures that these private routes are not learned. ! CSR4 ip prefix-list ip prefix-list ip prefix-list ip prefix-list

PL_PRIVATE PL_PRIVATE PL_PRIVATE PL_PRIVATE

seq seq seq seq

5 permit 10.0.0.0/8 le 32 10 permit 172.16.0.0/12 le 32 15 permit 192.168.0.0/16 le 32 20 permit 169.254.0.0/16 le 32

route-map RM_INTERNET_OUT deny 10 match ip address prefix-list PL_DEFAULT PL_PRIVATE route-map RM_INTERNET_OUT permit 100 router bgp 30 address-family ipv4 vrf 5 neighbor 10.4.5.5 route-map RM_INTERNET_OUT out R5#show bgp ipv4 unicast 192.168.8.0/22 longer-prefixes [no output]

Next, we configure NAT44 on CSR4. Unlike the other NAT44 configurations, we use port overloading on CSR4. This is not specific to running NAT on the egress PE; I introduce it here for variety only. The NAT pools contain a single route and there are corresponding VRF-aware static routes per pool. Although the NAT interfaces are in different VRFs, the router seamlessly moves packets across as necessary. CSR4 creates ACLs to match the private LANs on both CSR8 and CSR10 which are used for the inside NAT rules. These rules are also VRF-aware. ! CSR4 interface GigabitEthernet2.534 ip nat inside interface GigabitEthernet2.545 ip nat outside ip nat pool NAT_POOL_8 4.8.0.0 4.8.0.0 prefix-length 30 ip nat pool NAT_POOL_10 4.10.0.0 4.10.0.0 prefix-length 30

2089 © 2016 Nicholas J. Russo

ip route vrf 5 4.8.0.0 255.255.255.255 Null0 ip route vrf 5 4.10.0.0 255.255.255.255 Null0 ip access-list standard ACL_NAT_10 permit 192.168.10.0 0.0.0.255 ip access-list standard ACL_NAT_8 permit 192.168.8.0 0.0.0.255 ip nat inside source list ACL_NAT_10 pool NAT_POOL_10 vrf 5 overload ip nat inside source list ACL_NAT_8 pool NAT_POOL_8 vrf 5 overload

The BGP configuration involves redistributing these static routes. Since there is a VRF-aware default route already being redistributed into BGP, we technically don’t need to configure anything new. The configuration is shown again for reference, but this command was already configured to advertise the VRF-aware default into BGP. We quickly confirm that these NAT pool prefixes were advertised into BGP properly. ! CSR4, reference only address-family ipv4 vrf 5 redistribute static R4#show bgp vpnv4 unicast vrf 5 4.8.0.0/14 longer-prefixes | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 4:5 (default for vrf 5) Export Map: RM_VRF_5_TO_GLOBAL, Address-Family: IPv4 Unicast, Pfx Count/Limit: 5/1000 *> 4.8.0.0/32 0.0.0.0 0 32768 ? *> 4.10.0.0/32 0.0.0.0 0 32768 ?

Because CSR4’s export RT policy for VRF 5 adds a bogus RT by default, these NAT pool routes carry that RT. No router imports this RT anywhere in the network and it was only required to force route leaking to work. As such, no fancy filtering is required to prevent the CEs from learning these NAT pool routes. XRv1 is unable to import either route as shown below. R4#show bgp vpnv4 unicast vrf 5 4.8.0.0/32 | include RT: Extended Community: RT:4:999 R4#show bgp vpnv4 unicast vrf 5 4.10.0.0/32 | include RT: Extended Community: RT:4:999 RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf 8 4.8.0.0/32 % Network not in table RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf 10 4.10.0.0/32 % Network not in table

2090 © 2016 Nicholas J. Russo

Of greater importance is ensuring the IPv4 Internet (CSR5) learns these routes. We quickly confirm that below. This completes the control-plane verification as all of the routing information is correct. R5#show bgp ipv4 unicast 4.8.0.0/14 longer-prefixes | begin Network Network Next Hop Metric LocPrf Weight Path *> 4.8.0.0/32 10.4.5.4 0 0 30 ? *> 4.10.0.0/32 10.4.5.4 0 0 30 ?

When traffic arrives at XRv1 from the customers, it follows the exact same LSP since both paths recursive to the default route. The transport LSP using label 3001 has been traced multiple times already, so that part is skipped. RP/0/0/CPU0:XRv1#show cef vrf 8 55.46.253.193 | include labels next hop 30.3.11.3/32 Gi0/0/0/0.531 labels imposed {3001 4009} RP/0/0/CPU0:XRv1#show cef vrf 10 55.46.253.193 | include labels next hop 30.3.11.3/32 Gi0/0/0/0.531 labels imposed {3001 4009}

When CSR4 receives label 4009, it is directed to perform an ordinary IPv4 FIB lookup in VRF 5. Because the IPv4 packet is revealed to CSR4, it can perform inside NAT on this traffic. The MPLS label identified the proper VRF which will specify which NAT rule to use. R4#show mpls forwarding-table labels 4009 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 4009 No Label 0.0.0.0/0[V] 6490

Outgoing Next Hop interface aggregate/5

R4#show ip cef vrf 5 55.46.253.193 55.46.253.192/26 nexthop 10.4.5.5 GigabitEthernet2.545

Because NAT overload is used, there are no outside-to-inside mappings created. The post-NAT source addresses are 4.8.0.0 and 4.10.0.0, which are the only IPv4 addresses available in each pool. Since CSR4 learns the original private addresses (inside locals), those are identified so that the return flow can be un-NAT’ed. R4#show ip nat translations Pro Inside global Inside local icmp 4.8.0.0:1 192.168.8.1:6 icmp 4.10.0.0:1 192.168.10.1:5 Total number of translations: 2

Outside local 55.46.253.193:6 55.46.253.193:5

Outside global 55.46.253.193:1 55.46.253.193:1

Because outside NAT happens before routing, the reply traffic destination to 4.8.0.0 or 4.10.0.0 is translated back into the correct inside local address as identified in the NAT translation table. Since the

2091 © 2016 Nicholas J. Russo

return path is regular L3VPN, two labels are imposed, and the VPN label differs for each as these are unique, non-aggregate prefixes. This completes the return path. R4#show ip cef vrf 5 192.168.8.0 192.168.8.0/24 nexthop 30.3.4.3 GigabitEthernet2.534 label 3002 91009 R4#show ip cef vrf 5 192.168.10.0 192.168.10.0/24 nexthop 30.3.4.3 GigabitEthernet2.534 label 3002 91004

Additional Reading – Reference configurations "inet-access" 37. Describe, implement, and troubleshoot end-to-end fast convergence 37.1 Loop Free Alternate (LFA) for IPv4 IPv4 LFA is supported for OSPFv2, IS-IS, and EIGRP. The purpose of this feature to create a loop-free path from a given node by running SPF using other routers as the root of the tree (not including EIGRP, which is discussed in a dedicated section). Specifically, a router with LFA configuration only runs SPF from the perspective of its neighbors, not every router in the area/level, as the local router is just trying to find a backup next-hop towards a specific destination. Because all of the LFA testing was all done using the same topology, the configurations are provided once. To move through the demonstrations, all you need to do is shutdown or enable certain interfaces, change administrative distances, enable/disable LFA where desired, tune metrics, etc. All of the IGPs for both IPv4 and IPv6 have been pre-built. Additional Reading (includes all direct and IPv4 rLFA examples) – Reference configurations "lfa" 37.1.1 OSPFv2 37.1.1.1 Direct LFA The network diagram is shown below. Many diverse IGP paths are used so that testing the different LFA variables is made easy.

2092 © 2016 Nicholas J. Russo

The basic LFA configuration is shown below, beginning with the XE routers. LFA is enabled on a perprefix basis, which is the only option supported on XE (XR also supports per-link, which is discussed later). The concept of prefix-priority is related to the order in which prefixes are updated in the RIB/FIB after SPF is finished. Higher priorities are processed first, and generally these are loopbacks (LSP endpoints) or IPTV SSM sources. A route-map can be applied to select prefixes for high-priority, but the default is /32 prefixes. Prefix-priority is not discussed in detail as it is very straightforward. ! All XE routers router ospf 1 fast-reroute per-prefix enable area 0 prefix-priority high

A quick summary on CSR1 shows the default tiebreaker policy for backup paths. The tiebreakers are used to select the best backup when multiple backups exist. These are examined in great detail. CSR1#show ip ospf fast-reroute OSPF Router with ID (1.1.1.1) (Process ID 1) Loop-free Fast Reroute protected prefixes: Area Topology name Priority Remote LFA Enabled 0 Base High No Repair path selection policy tiebreaks (built-in default policy): 10 srlg 20 primary-path 30 interface-disjoint 40 lowest-metric 50 linecard-disjoint 60 node-protecting

2093 © 2016 Nicholas J. Russo

70 256

broadcast-interface-disjoint load-sharing

To simplify the output initially, I have increased the cost of CSR1's LAN segment and P2P CSR4 link to 5. This gives us a single best-path through P2P CSR2 connection, which is a good starting point. I have configured all of the CSRs to keep all backup paths, even ignored ones, as this will make troubleshooting easier. The command consumes more memory with little operational gain other than more information seen to the administrator. It is very useful for troubleshooting as we will see soon. ! CSR1 interface GigabitEthernet2.513 ip ospf cost 5 interface GigabitEthernet2.514 ip ospf cost 5 ! All XE routers router ospf 1 fast-reroute keep-all-paths

Let's examine how CSR1 sees XRv14's loopback with IPv4 address 14.14.14.14/32. For readability, I am breaking the output of this command into sections and describing each section individually. CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 OSPF Router with ID (1.1.1.1) (Process ID 1) Base Topology (MTID 0) OSPF local RIB Codes: * - Best, > - Installed in global RIB LSA: type/LSID/originator

The first entry is the best path via the P2P link. The “RIB” flag without the “Repair” flag is an indication that this is the best path. *>

14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 20, age 00:00:19 Flags: RIB, HiPrio via 12.0.0.2, GigabitEthernet2.512 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14

This is the repair path via XRv13 over the LAN. When compared against the best path, it is a different interface (and by extension a different broadcast interface), is downstream to the destination (closer), and provides node protection since CSR2 is not in the repair path. These are all “good” things that generally make a repair path more desirable. Having repair paths rely on the same interfaces, same linecards, and same nodes may result in lower availability.

2094 © 2016 Nicholas J. Russo

repair path via 123.0.0.13, GigabitEthernet2.513, cost 7 Flags: RIB, Repair, IntfDj, BcastDj, CostWon, NodeProt, Downstr LSA: 1/14.14.14.14/14.14.14.14

This candidate path was ignored because of the cost tiebreaker. The path is also interface disjoint but has a higher cost than the path shown above. repair path via 14.0.0.4, GigabitEthernet2.514, cost 8 Flags: Ignore, Repair, IntfDj, BcastDj, NodeProt LSA: 1/14.14.14.14/14.14.14.14

This path is equal in many ways to the selected repair path, but was ignored due to not providing node protection. This is another link through CSR2, so LFA would rather use a different node in the transit path, all other more important things being equal. Node protection guarantees that if CSR2 fails completely, the LFA does not rely on CSR2 as the next-hop. repair path via 123.0.0.2, GigabitEthernet2.513, cost 7 Flags: Ignore, Repair, IntfDj, BcastDj, Downstr LSA: 1/14.14.14.14/14.14.14.14

Let's say we want to use CSR4 as the backup path. We can use the shared risk link group (SRLG) to identify the PSP CSR2 and LAN interfaces as sharing risk; that is, we don't want to re-route traffic between two links in the same SRLG because perhaps they use the same fiber pair or physical conduit. This is an administrative setting similar to the MPLS-TE SRLG and serves an identical purpose, except that it is local only (not carried in topology information). ! CSR1 interface GigabitEthernet2.512 srlg gid 222 interface GigabitEthernet2.513 srlg gid 222

The paths sharing the SRLG are marked in the OSPF RIB with the “SRLG” flag and are now ignored. Notice that CSR4 is technically not "downstream" from CSR1 towards the destination. This is because both nodes have a cost of 3 to reach 14.14.14.14/32, and like the EIGRP feasibility condition, the nexthop node must be strictly closer (lower metric, cannot be equal) to the destination. CSR4 is also higher cost than those LAN interfaces, but SRLG is the first and most important tiebreaker by default. This is the “trump card” that can administratively exclude certain links from LFA consideration. CSR4 is the highest cost backup path, but is also the most desirable. CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 21, age 00:07:24

2095 © 2016 Nicholas J. Russo

Flags: RIB, HiPrio via 12.0.0.2, GigabitEthernet2.512 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 8 Flags: RIB, Repair, IntfDj, BcastDj, NodeProt LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.13, GigabitEthernet2.513, cost 7 Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt, Downstr LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.2, GigabitEthernet2.513, cost 7 Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, Downstr LSA: 1/14.14.14.14/14.14.14.14

With the SRLGs in place, next we will change some OSPF costs. The cost on the CSR2 P2P link (on CSR1 side) has been set to 5 with other CSR1 link costs set to 1. The cost on CSR2 to XRv14 has been increased to 2, making CSR1 select the LAN interface via XRv13 to reach 14.14.14.14/32. I did not remove the SRLG, so the best-path is through an interface with SRLG set. This means that we don't really want to use the CSR2 P2P link as a backup. It also has a worse cost, so even without the SRLG, CSR2 P2P link would not be preferred. ! CSR1 interface GigabitEthernet2.512 ip ospf cost 5 interface GigabitEthernet2.513 ip ospf cost 1 interface GigabitEthernet2.514 ip ospf cost 1 ! CSR2 interface GigabitEthernet2.524 ip ospf cost 2 CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 25, age 00:00:11 Flags: RIB, HiPrio via 123.0.0.13, GigabitEthernet2.513 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 4 Flags: RIB, Repair, IntfDj, BcastDj LSA: 1/14.14.14.14/14.14.14.14 repair path via 12.0.0.2, GigabitEthernet2.512, cost 8 Flags: Ignore, Repair, IntfDj, SRLG

2096 © 2016 Nicholas J. Russo

LSA: 1/14.14.14.14/14.14.14.14

The prove the claim above, we remove the SRLGs; the end result in the same, but cost is the differentiator. Notice that the path through CSR4 is not considered node-protecting because CSR4 has equal cost paths through XRv13 and CSR5 to reach 14.14.14.14/32. That is, we cannot guarantee that if we fast-reroute traffic to CSR4 that CSR4 won’t forward it through XRv13. CSR4’s ECMP load-sharing mechanism could potentially forward traffic through XRv13 which means node-protection cannot be assumed. Node-protection was not analyzed for this LFA decision, but confirming the observation is good practice. ! CSR1 interface GigabitEthernet2.512 no srlg gid 222 interface GigabitEthernet2.513 no srlg gid 222 CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 26, age 00:00:49 Flags: RIB, HiPrio via 123.0.0.13, GigabitEthernet2.513 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 4 Flags: RIB, Repair, IntfDj, BcastDj, CostWon LSA: 1/14.14.14.14/14.14.14.14 repair path via 12.0.0.2, GigabitEthernet2.512, cost 8 Flags: Ignore, Repair, IntfDj CSR4#show ip route 14.14.14.14 Routing entry for 14.14.14.14/32 Known via "ospf 1", distance 110, metric 3, type intra area Last update from 45.0.0.5 on GigabitEthernet2.545, 00:29:25 ago Routing Descriptor Blocks: 45.0.0.5, from 14.14.14.14, 00:29:25 ago, via GigabitEthernet2.545 Route metric is 3, traffic share count is 1 Repair Path: 34.0.0.13, via GigabitEthernet2.534 * 34.0.0.13, from 14.14.14.14, 00:29:25 ago, via GigabitEthernet2.534 Route metric is 3, traffic share count is 1 Repair Path: 45.0.0.5, via GigabitEthernet2.545

We will attempt to make CSR1’s LFA shown above “node-protecting”. A quick cost adjustment on CSR4 (increase cost on link towards XRv13 to 2) shows the path above as being node-protected since CSR4 is only routing through CSR5. This doesn't change the LFA selection but is just an interesting note. Personally, even if node-protection is low on the tiebreaker list, I would try to design the network such 2097 © 2016 Nicholas J. Russo

that node-protection was made available as often as possible. Now XRv13 could fail completely and it would have no affect on the backup path, which was not the case before the CSR4 cost adjustment. ! CSR4 interface GigabitEthernet2.534 ip ospf cost 2 CSR4#show ip route 14.14.14.14 Routing entry for 14.14.14.14/32 Known via "ospf 1", distance 110, metric 3, type intra area Last update from 45.0.0.5 on GigabitEthernet2.545, 00:30:10 ago Routing Descriptor Blocks: * 45.0.0.5, from 14.14.14.14, 00:30:10 ago, via GigabitEthernet2.545 Route metric is 3, traffic share count is 1 CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 27, age 00:06:17 Flags: RIB, HiPrio via 123.0.0.13, GigabitEthernet2.513 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 4 Flags: RIB, Repair, IntfDj, BcastDj, CostWon, NodeProt LSA: 1/14.14.14.14/14.14.14.14 repair path via 12.0.0.2, GigabitEthernet2.512, cost 8 Flags: Ignore, Repair, IntfDj LSA: 1/14.14.14.14/14.14.14.14

One final example shows the best-path through XRv13 with the CSR4 P2P link disabled. CSR1 can select either link on CSR2, which is both downstream and node-protecting. LFA should select the P2P link versus the LAN interface as it interface-disjoint (disjoint meaning the same LAN interface was not reused). To facilitate this, CSR2's cost on the LAN interface is increased to 2 because if CSR2 routes to XRv13 over that link, it is not loop free, and cannot be a backup path. The below output shows that the path through the CSR2 P2P link was selected based on "IntfDj". The path is also broadcast interface disjoint but that was not considered (this particular option is discussed later). Of note, notice that the cost through this preferred backup path is higher than the ignored one; this is because interface disjoint is processed before lowest metric by default. ! CSR1 interface GigabitEthernet2.514 shutdown ! CSR2 interface GigabitEthernet2.513

2098 © 2016 Nicholas J. Russo

ip ospf cost 2 CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 30, age 00:17:41 Flags: RIB, HiPrio via 123.0.0.13, GigabitEthernet2.513 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 12.0.0.2, GigabitEthernet2.512, cost 8 Flags: RIB, Repair, IntfDj, BcastDj, NodeProt LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.2, GigabitEthernet2.513, cost 4 Flags: Ignore, Repair, NodeProt LSA: 1/14.14.14.14/14.14.14.14

To modify the tiebreaker options, you can manually configure a new sequence. In this example, we reapply an SRLG to CSR1 CSR P2P and LAN interfaces. We then tell OSPF that SRLG disjoint is required (not just preferred), as are downstream nodes. Notice that changing the selection process wipes out the default settings, so you cannot make a few changes and maintain the default policy. You'd need to redefine the entire selection process. CSR1 only cares about the SRLG-disjoint and downstream characteristics now; cost, for example, is not considered by LFA at all. ! CSR1 interface GigabitEthernet2.512 srlg gid 45 interface GigabitEthernet2.513 srlg gid 45 ! CSR1 router ospf 1 fast-reroute per-prefix tie-break srlg required index 10 fast-reroute per-prefix tie-break downstream required index 20 CSR1#show ip ospf fast-reroute [snip] Repair path selection policy tiebreaks: 10 srlg (required) 20 downstream (required) 256 load-sharing

All OSPF costs have been reset to 1 on CSR2. The link between CSR1 and CSR4 has been enabled again. CSR1's cost on the LAN interface has been set to 5, making the P2P CSR2 link the best path. However, notice how there is no backup path. The path through CSR2 and XRv13 on the LAN is not available because it shares an SRLG with the primary path (but one is downstream). The link through CSR4 is not 2099 © 2016 Nicholas J. Russo

available because it is not downstream to the destination (both have a cost of 3), but passes the SRLG test. This demonstrates the danger of using the "required" keyword for LFA tiebreakers as the logic between required options is Boolean AND, not Boolean OR. All conditions must be true, but in our case we have 3 candidate repair paths that are ignored because no single path meets both criteria. ! CSR2 interface GigabitEthernet2.513 ip ospf cost 1 ! CSR1 interface GigabitEthernet2.514 no shutdown interface GigabitEthernet2.513 ip ospf cost 5 interface GigabitEthernet2.512 ip ospf cost 1 CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 37, age 00:00:12 Flags: RIB, HiPrio via 12.0.0.2, GigabitEthernet2.512 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 4 Flags: Ignore, Repair, IntfDj, BcastDj, NodeProt LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.2, GigabitEthernet2.513, cost 7 Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, Downstr LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.13, GigabitEthernet2.513, cost 7 Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt, Downstr LSA: 1/14.14.14.14/14.14.14.14

A quick fix would be to somehow make CSR4 downstream (closer) to 14.14.14.14/32. If we increase the cost on CSR1's P2P interface to CSR2 to a number less than 5, CSR1 could potentially pick CSR4 as a bestpath. Let's assume we want to keep the current best-path (via the CSR2 P2P link) but use CSR4 as a repair-path without modifying the tie-breakers. Increasing the cost on both the CSR2 and XRv13 P2P links to XRv14, followed by increasing the cost on CSR1's P2P link to CSR4, should meet the goal. The first modification would make CSR1 think that CSR2, XRv13, and CSR4 are all equidistant from XRv14. Increasing the cost on CSR1's P2P link to CSR4 ensures that the best-path is maintained and CSR4 is now considered downstream. ! CSR2

2100 © 2016 Nicholas J. Russo

interface GigabitEthernet2.524 ip ospf cost 2 ! XRv13 router ospf 1 area 0 interface GigabitEthernet0/0/0/0.534 cost 2 ! CSR1 interface GigabitEthernet2.514 ip ospf cost 2

Now, the path through CSR4 is both downstream (CSR4's cost is 3, CSR1's cost is 4, and 3 < 4) and does not share an SRLG with the primary path. It is a valid backup-path per our tie-breaker configuration. Before continuing, those custom tie-breaker commands are removed. CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 4, area 0 SPF Instance 40, age 00:00:52 Flags: RIB, HiPrio via 12.0.0.2, GigabitEthernet2.512 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 5 Flags: RIB, Repair, IntfDj, BcastDj, NodeProt, Downstr LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.2, GigabitEthernet2.513, cost 8 Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, Downstr LSA: 1/14.14.14.14/14.14.14.14 repair path via 123.0.0.13, GigabitEthernet2.513, cost 7 Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt, Downstr LSA: 1/14.14.14.14/14.14.14.14

An interesting corner-case is using the broadcast-interface-disjoint tiebreaker. This ensures that an LFA does not transit through the same broadcast interface used by the primary path, but from the perspective another router. For example, if CSR1 selects the best path via XRv13 (LAN interface) to reach 14.14.14.14/32, if CSR2 was an LFA, it is preferable for CSR2 not to follow the path CSR2 > XRv13 > XRv14. This would cause the primary and backup paths to both rely on the same LAN, which is what the BcastDj option attempts to avoid. Instead, the LFA via CSR4 > XRv13 > XRv14 is a better option. Adjusting costs as shown below ensures that XRv13 > XRv14 is the primary path, but the LFA via CSR2 is ignored because it uses the same LAN as the primary path. ! CSR1 interface GigabitEthernet2.512 ip ospf cost 2

2101 © 2016 Nicholas J. Russo

interface GigabitEthernet2.514 ip ospf cost 2 ! CSR2 interface GigabitEthernet2.524 ip ospf cost 3 ! CSR4 interface GigabitEthernet2.545 ip ospf cost 2 CSR1#show ip ospf rib 14.14.14.14 255.255.255.255 [snip] *> 14.14.14.14/32, Intra, cost 3, area 0 SPF Instance 15, age 00:00:20 Flags: RIB, HiPrio via 123.0.0.13, GigabitEthernet2.513 Flags: RIB LSA: 1/14.14.14.14/14.14.14.14 repair path via 14.0.0.4, GigabitEthernet2.514, cost 5 Flags: RIB, Repair, IntfDj, BcastDj LSA: 1/14.14.14.14/14.14.14.14 repair path via 12.0.0.2, GigabitEthernet2.512, cost 5 Flags: Ignore, Repair, IntfDj LSA: 1/14.14.14.14/14.14.14.14 CSR2#show ip route 14.14.14.14 Routing entry for 14.14.14.14/32 Known via "ospf 1", distance 110, metric 3, type intra area Last update from 123.0.0.13 on GigabitEthernet2.513, 00:07:26 ago Routing Descriptor Blocks: * 123.0.0.13, from 14.14.14.14, 00:07:26 ago, via GigabitEthernet2.513 Route metric is 3, traffic share count is 1 Repair Path: 24.0.0.14, via GigabitEthernet2.524

Next, we will do some basic tests with XR direct LFA using OSPFv2. The network has been reset with all OSPF costs set to 1 and all interfaces enabled (cleanup not shown). We enable per-prefix LFA under area 0 which applies to all area 0 interfaces also (by inheritance). This is applied on XRv14 only since XRv13 will be used for per-link LFA later. ! XRv14 router ospf 1 area 0 fast-reroute per-prefix

Examining the route to CSR1's loopback, we see two equal cost paths, both of which back each other up. Notice they are primary paths (both are used for ECMP), downstream (both CSR2 and XRv13 are closer 2102 © 2016 Nicholas J. Russo

to CSR1 than XRv14 is), node protecting (the paths do not rely on the same transit routers), and SRLG disjoint. RP/0/0/CPU0:XRv14#show ospf routes 1.1.1.1/32 backup-path detail OSPF Route entry for 1.1.1.1/32 Route type: Intra-area Metric: 3 SPF priority: 4, SPF version: 69 RIB version: 0, Source: Unknown 43.0.0.13, from 1.1.1.1, via GigabitEthernet0/0/0/0.543, path-id 2 Backup path: 24.0.0.2, from 1.1.1.1, via GigabitEthernet0/0/0/0.524, protected bitmap 0000000000000002 Attribues: Metric: 3, Primary, Downstream, Node Protect, SRLG Disjoint 24.0.0.2, from 1.1.1.1, via GigabitEthernet0/0/0/0.524, path-id 1 Backup path: 43.0.0.13, from 1.1.1.1, via GigabitEthernet0/0/0/0.543, protected bitmap 0000000000000001 Attribues: Metric: 3, Primary, Downstream, Node Protect, SRLG Disjoint

We can see another path through CSR5 by looking at the diagram but OSPF does not display it here. We will apply SRLGs to XRv14's interfaces facing CSR2 and XRv13 so that we prefer not to use backup paths that way. We make SRLG the #1 tiebreaker also, following the same convention tested with XE earlier. Additionally, we will increase the cost of XRv14's interface facing CSR2 to 2 so that ECMP is not in play. Notice that applying SRLGs in XR is identical across LFA and MPLS-TE. In XE, there are two different mechanisms, but XR consolidates this nicely. ! XRv14 srlg interface GigabitEthernet0/0/0/0.524 8 value 654 interface GigabitEthernet0/0/0/0.543 8 value 654 router ospf 1 fast-reroute per-prefix tiebreaker srlg-disjoint index 1 area 0 interface GigabitEthernet0/0/0/0.524 cost 2 RP/0/0/CPU0:XRv14#show ospf routes 1.1.1.1/32 backup-path detail OSPF Route entry for 1.1.1.1/32 Route type: Intra-area Metric: 3

2103 © 2016 Nicholas J. Russo

SPF priority: 4, SPF version: 74 RIB version: 0, Source: Unknown 43.0.0.13, from 1.1.1.1, via GigabitEthernet0/0/0/0.543, path-id 1 Backup path: 54.0.0.5, from 1.1.1.1, via GigabitEthernet0/0/0/0.554, protected bitmap 0000000000000001 Attribues: Metric: 4, Node Protect, SRLG Disjoint

If we remove the SRLGs (delete the entire feature for brevity), the router goes back to picking CSR2. At this point, both paths are SRLG disjoint but CSR2 is downstream (closer to CSR1 than XRv14 is) where CSR5 is not. ! XRv14 no srlg RP/0/0/CPU0:XRv14#show ospf routes 1.1.1.1/32 backup-path detail OSPF Route entry for 1.1.1.1/32 Route type: Intra-area Last updated: MON 8 12:54:59.512 Metric: 3 SPF priority: 4, SPF version: 74 RIB version: 0, Source: Unknown 43.0.0.13, from 1.1.1.1, via GigabitEthernet0/0/0/0.543, path-id 1 Backup path: 24.0.0.2, from 1.1.1.1, via GigabitEthernet0/0/0/0.524, protected bitmap 0000000000000001 Attribues: Metric: 4, Downstream, Node Protect, SRLG Disjoint

Let's make CSR5 downstream somehow. We accomplish this by significantly increasing all of XRv14's metrics so that all neighbors appear downstream. We will also increase CSR2's metric by even more so that it loses the LFA tie-break based on worse cost. To prove that metric is the tie-breaker, we will put the downstream attribute as the #2 item in the tie-breaker process. ! XRv14 router ospf 1 fast-reroute per-prefix tiebreaker downstream index 2 area 0 interface GigabitEthernet0/0/0/0.524 cost 110 interface GigabitEthernet0/0/0/0.543 cost 101 interface GigabitEthernet0/0/0/0.554 cost 101

2104 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv14#show ospf routes 1.1.1.1/32 backup-path detail OSPF Route entry for 1.1.1.1/32 Route type: Intra-area Metric: 103 SPF priority: 4, SPF version: 91 RIB version: 0, Source: Unknown 43.0.0.13, from 1.1.1.1, via GigabitEthernet0/0/0/0.543, path-id 1 Backup path: 54.0.0.5, from 1.1.1.1, via GigabitEthernet0/0/0/0.554, protected bitmap 0000000000000001 Attribues: Metric: 104, Downstream, Node Protect, SRLG Disjoint

We now reset all the costs on XRv14 back to 1 (not shown) and shift focus to XRv13. We will briefly test per-link LFA, which is an alternative LFA technique. Per-link runs SPF for all prefixes learned through a primary next-hop. This is obviously less ideal as certain links will not yield LFA paths for all prefixes, but is significantly less CPU intensive. This is only supported on XR and appears to be a less popular variant of LFA. I use a medium prefix-priority just for variety, although it doesn’t mean much in a network where we are only evaluating loopback addresses. ! XRv13 router ospf 1 fast-reroute per-link priority-limit medium area 0 fast-reroute per-link

Based on the output below, we can conclude the following. The “(!)” symbol indicates an FRR path. Prefixes with next-hop 123.0.0.1 (1.1.1.1/32, 4.4.4.4/32) use CSR4 34.0.0.4 Prefixes with next-hop 123.0.0.2 (2.2.2.2/32) use XRv14 43.0.0.14 Prefixes with next-hop 43.0.0.14 (5.5.5.5/32, 14.14.14.14/32) use CSR 123.0.0.2 Prefixes with next-hop 34.0.0.4 (5.5.5.5/32, 4.4.4.4/32) use CSR1 123.0.0.1 RP/0/0/CPU0:XRv13#show route ospf O 1.1.1.1/32 [110/2] via 123.0.0.1, 00:02:34, GigabitEthernet0/0/0/0.513 [110/0] via 34.0.0.4, 00:02:34, GigabitEthernet0/0/0/0.534 (!) O 2.2.2.2/32 [110/0] via 43.0.0.14, 00:02:34, GigabitEthernet0/0/0/0.543 (!) [110/2] via 123.0.0.2, 00:02:34, GigabitEthernet0/0/0/0.513 O 4.4.4.4/32 [110/0] via 123.0.0.1, 00:02:03, GigabitEthernet0/0/0/0.513 (!) [110/2] via 34.0.0.4, 00:02:03, GigabitEthernet0/0/0/0.534 O 5.5.5.5/32 [110/3] via 43.0.0.14, 00:02:03, GigabitEthernet0/0/0/0.543 [110/0] via 123.0.0.2, 00:02:03, GigabitEthernet0/0/0/0.513 (!) [110/0] via 123.0.0.1, 00:02:03, GigabitEthernet0/0/0/0.513 (!) [110/3] via 34.0.0.4, 00:02:03, GigabitEthernet0/0/0/0.534 O 14.14.14.14/32 [110/2] via 43.0.0.14, 00:02:03, GigabitEthernet0/0/0/0.543 [110/0] via 123.0.0.2, 00:02:03, GigabitEthernet0/0/0/0.513 (!) O 24.0.0.0/24 [110/2] via 43.0.0.14, 00:02:34, GigabitEthernet0/0/0/0.543 O 54.0.0.0/24 [110/2] via 43.0.0.14, 00:02:34, GigabitEthernet0/0/0/0.543

2105 © 2016 Nicholas J. Russo

If we add several prefixes to (or behind) XRv14, their FRR paths will be identical to 14.14.14.14/32 regardless of the number of prefixes we add and without additional computation. I've added 4 to demonstrate the point. Of course, the result would be the same with per-prefix, but with additional computation for each prefix. When the network has an enormous number of prefixes, it is likely that the number of links on a router is significantly less than the number of prefixes, so this can be a way of scaling LFA. RP/0/0/CPU0:XRv13#show route longer-prefixes 14.14.14.0/24 | begin ^O O 14.14.14.1/32 [110/0] via 123.0.0.2, 00:00:32, GigabitEthernet0/0/0/0.513 (!) [110/2] via 43.0.0.14, 00:00:32, GigabitEthernet0/0/0/0.543 O 14.14.14.2/32 [110/0] via 123.0.0.2, 00:00:32, GigabitEthernet0/0/0/0.513 (!) [110/2] via 43.0.0.14, 00:00:32, GigabitEthernet0/0/0/0.543 O 14.14.14.3/32 [110/0] via 123.0.0.2, 00:00:32, GigabitEthernet0/0/0/0.513 (!) [110/2] via 43.0.0.14, 00:00:32, GigabitEthernet0/0/0/0.543 O 14.14.14.4/32 [110/0] via 123.0.0.2, 00:00:32, GigabitEthernet0/0/0/0.513 (!) [110/2] via 43.0.0.14, 00:00:32, GigabitEthernet0/0/0/0.543 O 14.14.14.14/32 [110/0] via 123.0.0.2, 00:12:21, GigabitEthernet0/0/0/0.513 (!) [110/2] via 43.0.0.14, 00:12:21, GigabitEthernet0/0/0/0.543

37.1.1.2 Remote LFA In ring designs with 4 or more nodes, you cannot use regular LFA to reach some nodes (assuming all costs are equal) since your neighbors sometimes route through you. The network has been modified by setting all OSPF costs to 1, shutting down the shared LAN interface, removing SRLGs, and removing CSR5 completely. CSR1's tiebreaker policies have been set to the default values. We now have a pentagon shape where all costs are equal (CSR1-CSR2-XRv14-XRv13-CSR4-CSR1). If CSR1 tries to compute a backup-path to CSR2's loopback, CSR4 says that it routes through CSR1 to get there. This is not going to work for ordinary LFA as it is not a loop-free path. Remote LFA (rLFA) logic solves this problem by applying the following logic: "I can tunnel the traffic to a node that CAN reach CSR2 through a backup path (like XRv13) and let that node deliver the traffic". In this scenario, each node will have a backup path for nodes that are exactly two hops away; for example, CSR1 has backup paths for XRv13 and XRv14, but not CSR2 and CSR4. This is using regular LFA with no modifications, but ideally we want to protect paths to CSR2 and CSR4 also around the ring topology. Note: Only MPLS encapsulation with LDP signaling is currently supported. LDP is enabled on all OSPF-enabled links in the network (not shown). I suppose this is a limitation of Segment Routing (SR) in that remote-LFA cannot use it for tunnel label bindings. CSR1#show ip ospf rib [snip] * 1.1.1.1/32, Intra, cost 1, area 0, Connected via 1.1.1.1, Loopback0 *> 2.2.2.2/32, Intra, cost 2, area 0 via 12.0.0.2, GigabitEthernet2.512 *> 4.4.4.4/32, Intra, cost 2, area 0 via 14.0.0.4, GigabitEthernet2.514 *> 13.13.13.13/32, Intra, cost 3, area 0 via 14.0.0.4, GigabitEthernet2.514

2106 © 2016 Nicholas J. Russo

*>

repair path via 12.0.0.2, GigabitEthernet2.512, cost 4 14.14.14.14/32, Intra, cost 3, area 0 via 12.0.0.2, GigabitEthernet2.512 repair path via 14.0.0.4, GigabitEthernet2.514, cost 4

Remote LFA introduces several new terms which are discussed first. There are two "spaces" that have been defined to show how remote-LFA works. I derive these definitions from "draft-shand-remote-lfa00" as they are succinct and clear. P-space: Set of routers reachable without any path (including ECMP options) transiting the protected link. For example, from CSR1's perspective we are trying to protect a path to CSR2. Our shortest path is the direct link, which is what we are trying to protect, so the P-space routers would not be reachable over that link. A quick look at CSR1's routing table can tell us which routers we CAN reach through the link towards CSR4. Clearly, we can reach CSR4 and XRv13 this way. XRv14 and CSR2 are not considered P-space routers because CSR1's shortest path to them is via the protected link. Attached is a picture to help visualize the P-space; the red colored link is the link we are trying to protect. CSR1#show ip route | include 14.0.0.4 [snip] O 4.4.4.4 [110/2] via 14.0.0.4, 00:06:40, GigabitEthernet2.514 O 13.13.13.13 [110/3] via 14.0.0.4, 00:09:23, GigabitEthernet2.514

Q-space: Set of routers from which the destination can be reached without any path (including ECMP options) transiting the protected link. For example, we are trying to reach CSR2; that is the destination. 2107 © 2016 Nicholas J. Russo

Q-space determines which routers can reach CSR2 on their best-paths without transiting the protected link (CSR1-CSR2). We would need to check each router's route to 2.2.2.2/32 to see which ones fit within the Q-space and which don't. I will use traceroute to speed things up. The highlights represent the protected link, which implies the router is NOT part of the Q-space if its path traverses that link. This output reveals that XRv13 and XRv14 are in the Q-space, but CSR1 and CSR4 are not. XRv13 and XRv14 can both reach CSR2 by not transiting the link we are trying to protect. Attached is a picture to help visualize the Q-space. CSR1#traceroute 2.2.2.2 source loopback0 Type escape sequence to abort. Tracing the route to 2.2.2.2 VRF info: (vrf in name/id, vrf out name/id) 1 12.0.0.2 1 msec * 1 msec CSR4#traceroute 2.2.2.2 source loopback 0 Type escape sequence to abort. Tracing the route to 2.2.2.2 VRF info: (vrf in name/id, vrf out name/id) 1 14.0.0.1 [MPLS: Label 1002 Exp 0] 1 msec 1 msec 1 msec 2 12.0.0.2 1 msec * 1 msec RP/0/0/CPU0:XRv13#traceroute 2.2.2.2 source 13.13.13.13 Type escape sequence to abort. Tracing the route to 2.2.2.2 1 43.0.0.14 [MPLS: Label 94001 Exp 0] 29 msec 19 msec 2 24.0.0.2 29 msec * 29 msec

29 msec

RP/0/0/CPU0:XRv14#traceroute 2.2.2.2 source 14.14.14.14 Type escape sequence to abort. Tracing the route to 2.2.2.2 1

24.0.0.2 0 msec

*

0 msec

2108 © 2016 Nicholas J. Russo

PQ-node: A router that exists in both the P and Q spaces concurrently. The logic is as follows: the source router (CSR1) can reach any P-space router without transiting the protected link. Any Q-space router can reach the destination (CSR2) without transiting the protected link. Therefore, CSR1 can reach XRv13 and XRv13 can reach CSR2 without transiting the protected link. CSR1 establishes a "repair tunnel" to XRv13 (the PQ-node) inside of MPLS. XRv13 is the tail end of this repair tunnel, sending tunneled packets along their natural best-path to CSR2. Attached is a picture to help visualize the PQ-node.

2109 © 2016 Nicholas J. Russo

Like regular LFA, the configuration is very straightforward and is demonstrated on CSR1. The first command tells it what kind of encapsulation to use, and right now “mpls-ldp” is the only option. Future encapsulations could potentially include MPLS with Segment Routing, GRE/IP, or any other tunneling techniques. The second command puts an upper-bound on how far the tunnel can go. Realistically, you do not want to tunnel backup traffic too long in your network. In our case, I set an aggressive value of 3 to ensure we do not tunnel traffic more than 2 hops (plus the cost of the loopback). After all, if CSR1 can tunnel traffic to XRv13 (the PQ router), that is all we need in a small 5-node ring. Likewise, in the opposite direction, if we can tunnel to XRv14 as a PQ-node, we can backup CSR4's loopback. I am only enabling rLFA in area 0 although you can omit the area and enable it for all areas (plus external routes). In fact, you must omit the area if you want to protect external routes, which implicitly means rLFA is enabled for all areas as well. ! CSR1 router ospf 1 fast-reroute per-prefix remote-lfa area 0 tunnel mpls-ldp fast-reroute per-prefix remote-lfa area 0 maximum-cost 3

The head-end (CSR1) sends targeted LDP hellos towards 13.13.13.13/32 (and also XRv14 to protect CSR4). We need to enable the PQ-nodes to accept these targeted hellos; we will enable it on XRv13 only first. Notice what happens if you skip this step as we did on XRv14. The LDP neighbor does not form and rLFA will not work. CSR1 picked XRv13 because it was a PQ-router; the algorithm that determined this happens behind the scenes automatically based on the aforementioned P-space / Q-space discussion. You do not need to specify the rLFA tunnel destination, which greatly simplifies the configuration.

2110 © 2016 Nicholas J. Russo

! XRv13 mpls ldp address-family ipv4 discovery targeted-hello accept CSR1#show mpls ldp discovery Local LDP Identifier: 1.1.1.1:0 Discovery Sources: Interfaces: GigabitEthernet2.512 (ldp): xmit/recv LDP Id: 2.2.2.2:0 GigabitEthernet2.514 (ldp): xmit/recv LDP Id: 4.4.4.4:0 Targeted Hellos: 1.1.1.1 -> 13.13.13.13 (ldp): active, xmit/recv LDP Id: 13.13.13.13:0 1.1.1.1 -> 14.14.14.14 (ldp): active, xmit CSR1#show mpls ldp neighbor | include Peer_LDP Peer LDP Ident: 4.4.4.4:0; Local LDP Ident 1.1.1.1:0 Peer LDP Ident: 2.2.2.2:0; Local LDP Ident 1.1.1.1:0 Peer LDP Ident: 13.13.13.13:0; Local LDP Ident 1.1.1.1:0

CSR1 learns XRv13's LDP local-label for 2.2.2.2/32 via the rLFA tunnel, which is 93008. This will be the inner label when rLFA traffic is tunneled to XRv13. Also take note of CSR4's label for 2.2.2.2/32, which is 4002, as this will be used after CSR1's SPF process completes and traffic stops forwarding over the rLFA tunnel. Just like TE-FRR in ordinary MPLS, the backup path is only used for a very short time until the tunnel head-end repairs failure. In this case, once IGP is aware of the failure, it uses the backup path immediately (LFA or rLFA) and computes SPF in the meantime. Assuming SPF completes quickly, traffic will follow a new, more stable LSP, which in this case would use label 4002 via CSR4 directly. CSR1#show mpls ldp bindings 2.2.2.2 32 lib entry: 2.2.2.2/32, rev 4 local binding: label: 1000 remote binding: lsr: 4.4.4.4:0, label: 4002 remote binding: lsr: 2.2.2.2:0, label: imp-null remote binding: lsr: 13.13.13.13:0, label: 93008

When sending traffic destined to XRv13 (which is the tail end of the rLFA tunnel), we must use CSR4's label for 13.13.13.13/32, which is 4004. This is the outer label for the rLFA tunnel, as well as the targeted LDP session with XRv13. CSR1#show mpls ldp bindings 13.13.13.13 32 lib entry: 13.13.13.13/32, rev 10 local binding: label: 1002 remote binding: lsr: 4.4.4.4:0, label: 4004

2111 © 2016 Nicholas J. Russo

remote binding: lsr: 2.2.2.2:0, label: 2003 remote binding: lsr: 13.13.13.13:0, label: imp-null

As seen below, a repair tunnel called "MPLS-Remote-Lfa1" was created to backup this destination. Assuming XRv14 was correctly configured to accept tLDP sessions (which it is not), we would see a second rLFA tunnel towards XRv14. The number of this tunnel changes every time the protected link flaps, because after CSR1 converges and uses the real LSP, the rLFA tunnel is torn down as it is not needed. CSR1#show ip ospf rib 2.2.2.2 255.255.255.255 [snip] *> 2.2.2.2/32, Intra, cost 2, area 0 SPF Instance 52, age 00:28:45 Flags: RIB, HiPrio via 12.0.0.2, GigabitEthernet2.512 Flags: RIB LSA: 1/2.2.2.2/2.2.2.2 repair path via 13.13.13.13, MPLS-Remote-Lfa1, cost 4 Flags: RIB, Repair, IntfDj, BcastDj LSA: 1/2.2.2.2/2.2.2.2

You can see some extra tunnel details which helps define the path. This shows XRv13 as the tail-end (PQ-node) with CSR4 being the first hop in the tunnel’s path. CSR1#show ip ospf fast-reroute remote-lfa tunnels Interface MPLS-Remote-Lfa1 Tunnel type: MPLS-LDP Tailend router ID: 13.13.13.13 Termination IP address: 13.13.13.13 Outgoing interface: GigabitEthernet2.514 First hop gateway: 14.0.0.4 Tunnel metric: 2 Protects: 12.0.0.2 GigabitEthernet2.512, total metric 4

BFD must be configured for rLFA to work. The control-plane is fully operational without it as verified above, but you must use BFD. IGP fast hellos will not work, regardless of how fast they are, and BFD will work, no matter how slow it is. Thus, BFD has been enabled and bound to OSPF on the link between CSR1 and CSR2 (basic BFD configuration not shown). The "FRR" registered protocol represents LFA. Cisco suggests BFD is requirement for direct LFA as well, which makes sense. CSR1#show bfd neighbors details | include 12.0.0.2|Register 12.0.0.2 4097/4097 Up Up Registered protocols: OSPF CEF FRR

Gi2.512

2112 © 2016 Nicholas J. Russo

I added a quick Flexible Netflow (FNF) configuration to CSR4 to track incoming MPLS packets. This way, we can visualize CSR1 sending MPLS-encapsulated backup traffic through the rLFA tunnel. ! CSR4 flow record FNF_MPLS_RECORD match mpls label 1 details match mpls label 2 details collect counter packets flow monitor FNF_MPLS_MONITOR record FNF_MPLS_RECORD interface GigabitEthernet2.514 mpls flow monitor FNF_MPLS_MONITOR input

The output below is showing the targeted LDP messages being sent from CSR1 to XRv13 (notice the label value of 4004 with EXP 6). This is just a basic TCP session between CSR1 and XRv13. This isn’t relevant for rLFA, but we use it to verify that FNF was configured correctly. CSR4#show flow monitor FNF_MPLS_MONITOR cache format table | begin MPLS MPLS LABEL 1 MPLS LABEL 2 pkts ============ ============ ========== 4004 /6 0 /0 118

Next, I send several pings from CSR1’s loopback to CSR2’s loopback with DSCP CS1 (for identification, just in case something else generates traffic behind the scenes), then break the protected link on CSR2. We expect to see some traffic (not many packets) with a label stack of {4004 93008} which is used to tunnel traffic inside the rLFA tunnel to XRv13 (4004), then use XRv13's label for 2.2.2.2/32 to reach the final destination (93008). After OSPF converges, CSR1 uses the ordinary LDP-derived LSP via CSR4's label for 2.2.2.2/32 (4002). Tunneled rLFA traffic counters are shown in green and native forwarding through CSR4, after IGP convergence, is shown in yellow. CSR1#ping 2.2.2.2 source loopback0 tos 32 repeat 100000000 ! CSR2 interface GigabitEthernet2.512 shutdown CSR4#show flow monitor FNF_MPLS_MONITOR cache format table | begin MPLS MPLS LABEL 1 MPLS LABEL 2 pkts ============ ============ ========== 4002 /1 0 /0 242 4004 /1 93008 /1 128 4004 /6 0 /0 18 CSR4#show flow monitor FNF_MPLS_MONITOR cache format table | begin MPLS

2113 © 2016 Nicholas J. Russo

MPLS LABEL 1 ============ 4002 /1 4004 /1 4004 /6

MPLS LABEL 2 ============ 0 /0 93008 /1 0 /0

pkts ========== 267 128 18

In summary, the traffic with label stack {4004 93008} was inside the rLFA tunnel and was used for only the amount of time it took CSR1 to reconverge (I did not tune OSPF timers for ultra-fast convergence). Notice the difference in packet counters between the two FNF show commands; the rLFA stopped forwarding after 128 packets and the label stack {4002} was used thereafter as this was the post-SPF LSP. I highlighted the packet counters in different colors above so you can see that rLFA counters stop while the normal LSP counters continue to increase. We don't need to trace all the LSPs as the FNF output reveals the key behavioral components of this feature. When we have a ring topology with an even number of nodes greater than 3, remote LFA becomes a little more complicated. CSR1 is removed from the network and CSR4’s link to XRv13 is restored, making a square.

Given what we know about P and Q spaces, assuming we want to protect the link between CSR4 and CSR5, there is no traditional PQ-node. Without a PQ-node, it seems that remote LFA cannot work. The reason this seems to be the case is because CSR4's route to XRv14 is via an ECMP path-split between XRv13 and CSR5, which means XRv14 is not in CSR4's P-space. Additionally, because XRv13's route to CSR5 is via ECMP also, it is not part of CSR5's Q-space. CSR4#show ip cef 14.14.14.14 | include _nexthop nexthop 34.0.0.13 GigabitEthernet2.534 label [93015|5004] nexthop 45.0.0.5 GigabitEthernet2.545 label [5004|93015] RP/0/0/CPU0:XRv13#show cef ipv4 5.5.5.5/32 | include next hop next hop 34.0.0.4 next hop 43.0.0.14

2114 © 2016 Nicholas J. Russo

In order for rLFA to work, XRv14 needs to be in CSR4’s P-space or XRv13 needs to be in CSR5’s Q-space. This problem is solved by leveraging the “extended P-space” which makes the first condition somewhat true. The logic is as follows: CSR4 will only use the repair path when the direct link to CSR5 is down. This initial hop does not need to be subject to CSR4's normal routing process; this effectively allows us to relax the "you can't use ECMP" rule. The extended P-space is the union (combination) of the P-spaces of the source router (CSR4) and all of its P-space neighbors (XRv13). In terms of cost, a router is in the extended P-space if the cost XRv13->XRv14 is less than the cost XRv13->CSR4->CSR5->XRv14 (1 < 3). Basically, your neighbor (XRv13) has to be closer to the target PQ-node via a non-protected link. In this case, XRv14 is within XRv13's P-space since it is reachable without using the protected link and is closer than CSR4 (downstream); XRv14 is in CSR4's extended P-space and is thus the PQ-node.

Fortunately, there is nothing special you have to configure for this advanced feature to work. Just enable the rLFA feature and the router “figures it out”. We should see an rLFA tunnel from CSR4 to XRv14. In earlier examples we did not have LDP targeted session acceptance configured on XRv14 for demonstration purposes, so we must add it.

2115 © 2016 Nicholas J. Russo

! XRv14 mpls ldp address-family ipv4 discovery targeted-hello accept

I've also set up BFD and bound OSPF to it on the link between CSR4 and CSR5; remember that this is a requirement for rLFA to work. CSR4 actually builds two tunnels, one via CSR5 and one via XR13. Since our current objective is protecting the link to CSR5, we will ignore the second tunnel. Both of them terminate on XRv14 since it would be the PQ-node in either direction. Since XRv doesn’t support BFD, we test link failures between CSR4 and CSR5 only. CSR4#show ip ospf fast-reroute remote-lfa tunnels [snip] Interface MPLS-Remote-Lfa5 Tunnel type: MPLS-LDP Tailend router ID: 14.14.14.14 Termination IP address: 14.14.14.14 Outgoing interface: GigabitEthernet2.534 First hop gateway: 34.0.0.13 Tunnel metric: 2 Protects: 45.0.0.5 GigabitEthernet2.545, total metric 3 [snip]

Performing a similar test as before, we send pings from CSR4 to CSR5, break the direct link, and ensure XRv13 sees the proper rLFA label stack. CSR4 learns label 94003 for the prefix 5.5.5.5/32 from XRv14 via tLDP. This will be the bottom label in the rLFA tunnel stack. CSR4 also needs a label for XRv14's loopback, the tunnel tail end. Because the route to 14.14.14.14/32 is IGP learned, an LDP label must be used, and XRv13 advertises label 93015. We expect the rLFA tunnel to use label stack {93015 94003} to tunnel traffic first to XRv14, then use XRv14's label for CSR5's loopback to deliver the traffic. When OSPF converges, we should use XRv13's LDP label for CSR5's loopback, which is 93002. CSR4#show mpls ldp bindings 5.5.5.5 32 lib entry: 5.5.5.5/32, rev 41 local binding: label: 4001 remote binding: lsr: 5.5.5.5:0, label: imp-null remote binding: lsr: 13.13.13.13:0, label: 93002 remote binding: lsr: 14.14.14.14:0, label: 94003 CSR4#show mpls ldp bindings 14.14.14.14 32 lib entry: 14.14.14.14/32, rev 14 local binding: label: 4003 remote binding: lsr: 13.13.13.13:0, label: 93015 remote binding: lsr: 5.5.5.5:0, label: 5004 remote binding: lsr: 14.14.14.14:0, label: imp-null

2116 © 2016 Nicholas J. Russo

FNF does not appear to be supported on XRv, so we will use MPLS packet debugging, which does work. The XR debugs for this are extremely detailed; we will send these logs to the buffer only. We expect to see a short burst of traffic with label stack {93015 94003} with all follow-on traffic using label 93002. Like last time, we use MPLS EXP1 to differentiate the traffic against other potential flows, enhancing readability. XRv13#debug mpls packet CSR4#ping 5.5.5.5 source loopback0 tos 32 repeat 100000000 ! CSR5 interface GigabitEthernet2.545 shutdown

Here is debug output from XRv13 that shows traffic coming in with label 93015, which is XRv13's label for XRv14's loopback. We assume this is the rLFA tunnel because of the EXP bits being set to 1 and the label values appear correct. The LFIB states PHP should occur, so only label 94003 should go out towards XRv14. Notice that the packet length is 130 bytes on ingress. ! XRv13 mpls_switch: FINT0_0_CPU0, mpls eos 0, ttl 254, len 130, inlabel 93015, tbl_id=0xe0000000, vrf_id=0x60000000 in=0xb00 mpls_rewrite mpls_rewrite: tos 32 eos 0 ttl 254, out_label 1048577, #labels 0, RA - 0, out_intf GigabitEthernet0_0_0_0.543 mpls_rewrite: POP to mpls mpls_rewrite: tos 1 (0x1) eos 1 ttl 254, out_label 94003, top_of_stack 0x16f333fe RP/0/0/CPU0:XRv13#show mpls forwarding labels 93015 Local Outgoing Prefix Outgoing Next Hop Bytes Label Label or ID Interface Switched ------ ----------- ------------------ ------------ --------------- ---------93015 Pop 14.14.14.14/32 Gi0/0/0/0.543 43.0.0.14 1665321

A few seconds later, OSPF converges and XRv13 only sees traffic using label 93002 to reach CSR5's loopback as the rLFA tunnel is torn down. Notice the packet length is also 4 bytes less (126 vs. 130) than it was when using rLFA as the label stack contains one label, not two. XRv13 performs a label swap, not a pop, to send this traffic towards CSR5 via XRv14. The debug earlier clearly showed “POP to mpls” where this debug does not. ! XRv13 mpls_switch-2: Setting default table-id Flag set to 0x00000001 mpls_switch: FINT0_0_CPU0, mpls eos 1, ttl 254, len 126, inlabel 93002, tbl_id=0xe0000000, vrf_id=0x60000000 in=0xb00

2117 © 2016 Nicholas J. Russo

mpls_rewrite mpls_rewrite: tos 32 eos 1 ttl 254, out_label 94003, #labels 1, RA - 0, out_intf GigabitEthernet0_0_0_0.543 mpls_rewrite: tos 1 (0x1) eos 1 ttl 254, out_label 94003, top_of_stack 0x16f333fe

Because BFD is not supported on XRv, we will not test the data plane of rLFA, but instead will verify control plane signaling when the rLFA tunnel headend is on XRv. Using the same square topology, we enable rLFA on XRv13 to protect the link to XRv14. On CSR5, we will configure LDP to accept targeted hellos as XRv13 must learn CSR5's label for XRv14. ! XRv13 router ospf 1 fast-reroute per-prefix remote-lfa tunnel mpls-ldp fast-reroute per-prefix remote-lfa maximum-cost 20 ! CSR5 mpls ldp discovery targeted-hello accept

To verify that we learn the label, we can check the LDP bindings to discover the value 5004. This is the bottom label in the rLFA tunnel. The top label is CSR4's label for CSR5, which is 4001. Thus, the rLFA tunnel encapsulation is {4001 5004}. RP/0/0/CPU0:XRv13#show mpls ldp bindings 14.14.14.14/32 neighbor 5.5.5.5 14.14.14.14/32, rev 129 Local binding: label: 93015 Remote bindings: (3 peers) Peer Label ------------------------5.5.5.5:0 5004 RP/0/0/CPU0:XRv13#show mpls ldp bindings 5.5.5.5/32 neighbor 4.4.4.4 5.5.5.5/32, rev 164 Local binding: label: 93002 Remote bindings: (3 peers) Peer Label ------------------------4.4.4.4:0 4001

The OSPF RIB, like XE, also tracks the backup paths with all of the tiebreaker attributes (SRLG, node protecting, interface disjoint, etc). RP/0/0/CPU0:XRv13#show ospf routes 14.14.14.14/32 backup-path detail OSPF Route entry for 14.14.14.14/32 Route type: Intra-area Metric: 2 SPF priority: 4, SPF version: 134

2118 © 2016 Nicholas J. Russo

RIB version: 0, Source: Unknown 43.0.0.14, from 14.14.14.14, via GigabitEthernet0/0/0/0.543, path-id 1 Backup path: Remote, LFA: 5.5.5.5 34.0.0.4, from 14.14.14.14, via GigabitEthernet0/0/0/0.534, protected bitmap 0000000000000001 Attribues: Metric: 2, SRLG Disjoint

We can check the FIB to ensure this is the correct label stack and that the backup path is installed also. RP/0/0/CPU0:XRv13#show cef ipv4 14.14.14.14/32 14.14.14.14/32, version 479, internal 0x1000001 0x0 (ptr 0xa140f174) [1], 0x0 (0xa13f46a4), 0xa28 (0xa177c158) local adjacency 43.0.0.14 Prefix Len 32, traffic index 0, precedence n/a, priority 3 via 34.0.0.4, GigabitEthernet0/0/0/0.534, 5 dependencies, weight 0, class 0, backup [flags 0x300] path-idx 0 NHID 0x0 [0xa0e8f34c 0x0] next hop 34.0.0.4, PQ-node 5.5.5.5 local adjacency local label 93015 labels imposed {4001 5004} via 43.0.0.14, GigabitEthernet0/0/0/0.543, 5 dependencies, weight 0, class 0, protected [flags 0x400] path-idx 1 bkup-idx 0 NHID 0x0 [0xa184c384 0xa184c1b4], parent-ifh 0x100 next hop 43.0.0.14 local label 93015 labels imposed {ImplNull}

Just for extra XR rLFA verification/practice, we will repeat the exercise to trace the LSP protecting the link between XRv13 and CSR4. Remember that XRv13 will build two tunnels to its PQ-node: one protects the link to XRv14 and one protects the link to CSR4. First we check the label bindings from LDP, then the OSPF RIB, then the FIB. We see that the label stack {94003 5002} contains two labels: the bottom label is 5002 which represents CSR5's label for 4.4.4.4/32, while the top label is 94003 which represents XRv14's label for 5.5.5.5/32. The RIB shows us the protected routing information and the FIB shows us the full label stack. RP/0/0/CPU0:XRv13#show mpls ldp bindings 4.4.4.4/32 neighbor 5.5.5.5 4.4.4.4/32, rev 121 Local binding: label: 93009 Remote bindings: (3 peers) Peer Label ------------------------5.5.5.5:0 5002 RP/0/0/CPU0:XRv13#show mpls ldp bindings 5.5.5.5/32 neighbor 14.14.14.14 5.5.5.5/32, rev 164 Local binding: label: 93002 Remote bindings: (3 peers) Peer Label

2119 © 2016 Nicholas J. Russo

----------------14.14.14.14:0

--------94003

RP/0/0/CPU0:XRv13#show ospf routes 4.4.4.4/32 backup-path detail OSPF Route entry for 4.4.4.4/32 Route type: Intra-area Metric: 2 SPF priority: 4, SPF version: 134 RIB version: 0, Source: Unknown 34.0.0.4, from 4.4.4.4, via GigabitEthernet0/0/0/0.534, path-id 1 Backup path: Remote, LFA: 5.5.5.5 43.0.0.14, from 4.4.4.4, via GigabitEthernet0/0/0/0.543, protected bitmap 0000000000000001 Attribues: Metric: 2, SRLG Disjoint RP/0/0/CPU0:XRv13#show cef ipv4 4.4.4.4/32 4.4.4.4/32, version 481, internal 0x1000001 0x0 (ptr 0xa140ee74) [1], 0x0 (0xa13f4368), 0xa28 (0xa177c1b0) local adjacency 34.0.0.4 Prefix Len 32, traffic index 0, precedence n/a, priority 3 via 34.0.0.4, GigabitEthernet0/0/0/0.534, 5 dependencies, weight 0, class 0, protected [flags 0x400] path-idx 0 bkup-idx 1 NHID 0x0 [0xa184c29c 0xa184c0cc], parent-ifh 0x100 next hop 34.0.0.4 local label 93009 labels imposed {ImplNull} via 43.0.0.14, GigabitEthernet0/0/0/0.543, 5 dependencies, weight 0, class 0, backup [flags 0x300] path-idx 1 NHID 0x0 [0xa0e8f49c 0x0] next hop 43.0.0.14, PQ-node 5.5.5.5 local adjacency local label 93009 labels imposed {94003 5002}

We will do one more test with XRv rLFA with OSPF for practice. We will revert to the pentagon topology for this, which means the router does not need to invoke an extended P-space calculation. Note: Unlike IS-IS, we do not have the capability to filter rLFA targets by RID, but we can enable the process globally versus having to do it on every interface. This should confirm that we have an LDP neighbor with CSR1, along with some new labels. The rLFA tunnel to CSR1 allows XRv13 to protect the path to XRv14’s loopback, which is normally reachable via a direct link. RP/0/0/CPU0:XRv13#show ospf route 14.14.14.14/32 backup-path Topology Table for ospf 1 with ID 13.13.13.13 Codes: O - Intra area, O IA - Inter area O E1 - External type 1, O E2 - External type 2 O N1 - NSSA external type 1, O N2 - NSSA external type 2 O 14.14.14.14/32, metric 2 43.0.0.14, from 14.14.14.14, via GigabitEthernet0/0/0/0.543, path-id 1 Backup path: Remote, LFA: 1.1.1.1

2120 © 2016 Nicholas J. Russo

34.0.0.4, from 14.14.14.14, via GigabitEthernet0/0/0/0.534, protected bitmap 0000000000000001 RP/0/0/CPU0:XRv13#show mpls ldp neighbor 1.1.1.1 Peer LDP Identifier: 1.1.1.1:0 TCP connection: 1.1.1.1:646 - 13.13.13.13:24247 Graceful Restart: No Session Holdtime: 180 sec State: Oper; Msgs sent/rcvd: 14/17; Downstream-Unsolicited Up time: 00:03:33 LDP Discovery Sources: IPv4: (1) Targeted Hello (13.13.13.13 -> 1.1.1.1, active/passive) IPv6: (0) Addresses bound to this peer: IPv4: (3) 1.1.1.1 12.0.0.1 14.0.0.1 IPv6: (0)

We now learn CSR1's label for 14.14.14.14/32 which is 1007. This is the bottom label in the rLFA tunnel stack. The top label is whatever CSR4's label is for 1.1.1.1/32 (the rLFA endpoint), which is 4000. Thus, the rLFA label stack is {4000 1007}. After OSPF converges, XRv13 will use label 4004 to reach XRv14's loopback as this is the ordinary LDP LSP, using labels learned from the IGP nexthop (CSR4). RP/0/0/CPU0:XRv13#show mpls ldp bindings 14.14.14.14/32 14.14.14.14/32, rev 186 Local binding: label: 93000 Remote bindings: (3 peers) Peer Label ------------------------1.1.1.1:0 1007 4.4.4.4:0 4004 14.14.14.14:0 ImpNull RP/0/0/CPU0:XRv13#show mpls ldp bindings 1.1.1.1/32 1.1.1.1/32, rev 214 Local binding: label: 93004 Remote bindings: (3 peers) Peer Label ------------------------1.1.1.1:0 ImpNull 4.4.4.4:0 4000 14.14.14.14:0 94002

37.1.2 IS-IS 37.1.2.1

Direct LFA

2121 © 2016 Nicholas J. Russo

The LFA topology for IS-IS is identical to the one used for OSPF, except we set the OSPF AD to 255 to remove its routes from the RIB. The LFA process is very similar between OSPF and IS-IS so this test will focus on the differences. The diagram is shown again below.

General notes:  On XE, IPv6 LFA for OSPFv3 and IS-IS both appear unsupported. Although the configuration is generic to the IS-IS process and not to an AF, the IS-IS IPv6 RIB doesn’t show LFA paths.  IS-IS does not have a concept of "broadcast interface disjoint", so that isn't a tiebreaker.  When you enable LFA in XE, you must select an IS-IS level, unlike OSPF where the area is optional. In XR, the level is optional.  IS-IS does not appear to have a show command to list the tie-breaker sequence.  IS-IS considers loopbacks (/32) to be "medium" priority by default. Like the OSPFv2 direct LFA test, I've increased CSR1's metric on the LAN to CSR2/XRv13 and the P2P link to CSR4 to 15. CSR1 prefers the P2P path via CSR2 to reach XRv14's loopback, 14.14.14.14/32. The repair path via XRv13 is downstream (XRv13 is closer to XRv14 than CSR1 is), node-protecting (we don't go through CSR2 on the repair-path), and SRLG-disjoint (no SRLGs configured yet). ! CSR1 interface GigabitEthernet2.513 isis metric 15 interface GigabitEthernet2.514 isis metric 15

2122 © 2016 Nicholas J. Russo

CSR1#show isis rib 14.14.14.14 IPv4 local RIB for IS-IS process 1 IPV4 unicast topology base (TID 0, TOPOID 0x0) ================= Repair path attributes: DS - Downstream, LC - Linecard-Disjoint, NP - Node-Protecting PP - Primary-Path, SR - SRLG-Disjoint Routes under majornet 14.0.0.0/8: 14.14.14.14/32 [115/L2/20] via 12.0.0.2(GigabitEthernet2.512), from 14.14.14.14, tag 0, LSP[6/25] (installed) repair path: 123.0.0.13(GigabitEthernet2.513) metric:25 (DS,NP,SR) LSP[3] [115/L2/25] via 123.0.0.13(GigabitEthernet2.513), from 13.13.13.13, tag 0, LSP[3/28]

If we remove XRv13 by shutting down its LAN interface, CSR1 will install the repair path via CSR2 through the LAN, but it is not longer node-protecting since we rely on CSR2 for primary and repair paths. This is less ideal and is the reason IS-IS picked the XRv13 repair path over the CSR2 repair path in the first place. Notice the extra path via CSR4 at the bottom; this is like an "ignored" but valid backup path. ! XRv13 interface GigabitEthernet0/0/0/0.513 shutdown CSR1#show isis rib 14.14.14.14 | begin 14\.14 14.14.14.14/32 [115/L2/20] via 12.0.0.2(GigabitEthernet2.512), from 14.14.14.14, tag 0, LSP[6/28] (installed) repair path: 123.0.0.2(GigabitEthernet2.513) metric:25 (DS,SR) LSP[6] [115/L2/35] via 14.0.0.4(GigabitEthernet2.514), from 13.13.13.13, tag 0, LSP[3/31]

Let's force the backup path to use CSR4 versus CSR2 or XRv13. We bring XRv13 back into the network and set SRLGs on CSR1's interface to CSR2 (P2P) and LAN. Now, the backup path through CSR4 is preferred due to the SRLG being disjoint. Node-protection is achieved since CSR2 is not in CSR4's path to XRv14. Notice that CSR4 is not downstream when compared to CSR1 since both of their metrics to XRv14 are 20. ! CSR1 interface GigabitEthernet2.512 srlg gid 222 interface GigabitEthernet2.513 srlg gid 222 ! XRv13 interface GigabitEthernet0/0/0/0.513 no shutdown

2123 © 2016 Nicholas J. Russo

CSR1#show isis rib 14.14.14.14 | begin 14\.14 14.14.14.14/32 [115/L2/20] via 12.0.0.2(GigabitEthernet2.512), from 14.14.14.14, tag 0, LSP[6/29] (installed) repair path: 14.0.0.4(GigabitEthernet2.514) metric:35 (NP,SR) LSP[3] [115/L2/25] via 123.0.0.13(GigabitEthernet2.513), from 13.13.13.13, tag 0, LSP[3/32]

The next example changes CSR1's interface costs again and removes the SRLGs. The P2P link to CSR2 is changed to 15 and the LAN remains 15, while the cost to CSR4 is set to 10. We have three-way ECMP but one of the paths (via XRv13 LAN) is listed as a backup for the CSR P2P link path. We see a new option, primary-path, which indicates that the repair-path is also an ECMP path already used for forwarding. ! CSR1 interface GigabitEthernet2.512 isis metric 15 no srlg gid 222 interface GigabitEthernet2.513 no srlg gid 222 interface GigabitEthernet2.514 isis metric 10 CSR1#show isis rib 14.14.14.14 | begin 14\.14 14.14.14.14/32 [115/L2/25] via 123.0.0.13(GigabitEthernet2.513), from 13.13.13.13, tag 0, LSP[4/42] (installed) [115/L2/25] via 123.0.0.13(GigabitEthernet2.513), from 14.14.14.14, tag 0, LSP[3/42] [115/L2/25] via 12.0.0.2(GigabitEthernet2.512), from 14.14.14.14, tag 0, LSP[3/42] (installed) repair path: 123.0.0.13(GigabitEthernet2.513) metric:25 (PP,DS,NP,SR) LSP[3] [115/L2/25] via 123.0.0.2(GigabitEthernet2.513), from 14.14.14.14, tag 0, LSP[3/42] (installed)

We can also check the repair path to CSR5 from here. CSR1 should prefer CSR4 based on lowest metric but uses the P2P path via CSR2 as a repair-path. CSR2 is not downstream since both routers have a metric of 20 to reach 5.5.5.5/32, but it is outside of the primary path (node protecting) and does not share an SRLG. CSR1#show isis rib 5.5.5.5 | begin 5\.5 5.5.5.5/32 [115/L2/20] via 14.0.0.4(GigabitEthernet2.514), from 5.5.5.5, tag 0, LSP[5/39] (installed) repair path: 12.0.0.2(GigabitEthernet2.512) metric:35 (NP,SR) LSP[5]

Shutting down CSR2's P2P link to CSR1 forces the repair path through the CSR2 LAN hop. Considering the attributes of this path are identical to those of the P2P link, I am assuming IS-IS prefers P2P over multi-

2124 © 2016 Nicholas J. Russo

access links. Cisco mentions LFA doesn’t work on IS-IS broadcast networks, but this might be a hardware-based limitation since the control-plane has shown multi-access-facing LFAs in earlier tests. CSR1#show isis rib 5.5.5.5 | begin 5\.5 5.5.5.5/32 [115/L2/20] via 14.0.0.4(GigabitEthernet2.514), from 5.5.5.5, tag 0, LSP[5/41] (installed) repair path: 123.0.0.2(GigabitEthernet2.513) metric:35 (NP,SR) LSP[5]

Completely removing CSR2 from the network, CSR1 now prefers XRv3 via the LAN. This is less preferable because it does not offer node protection as XRv13 has an ECMP path to CSR5 that routes through CSR4. ! CSR2 interface GigabitEthernet2 shutdown CSR1#show isis rib 5.5.5.5 | begin 5\.5 5.5.5.5/32 [115/L2/20] via 14.0.0.4(GigabitEthernet2.514), from 5.5.5.5, tag 0, LSP[5/40] (installed) repair path: 123.0.0.13(GigabitEthernet2.513) metric:35 (SR) LSP[5]

Of course, we can change XRv13's costs so that it prefers to route through XRv14 to reach CSR5. Reducing the metric on the XRv13 link facing XRv14 to 5 should force traffic through XRv14 only. We should achieve node protection (and downstream status) for this backup path now. The metric reduction from 35 to 30 also shows as the backup path’s cost follows XRv13’s best path through XRv14. In summary, IS-IS LFA behaves almost identically to OSPF LFA. ! XRv13 router isis 1 interface GigabitEthernet0/0/0/0.534 address-family ipv4 unicast metric 5 CSR1#show isis rib 5.5.5.5 | begin 5\.5 5.5.5.5/32 [115/L2/20] via 14.0.0.4(GigabitEthernet2.514), from 5.5.5.5, tag 0, LSP[13/17] (installed) repair path: 123.0.0.13(GigabitEthernet2.513) metric:30 (DS,NP,SR) LSP[13]

Next, we quickly examine direct LFA on IOS-XR. Both IPv4 and IPv6 are supported, which is an improvement over XE. I have enabled LFA for both AFs on XRv13 but IPv6 will be analyzed later. The configuration is highly redundant since the feature must be enabled under each interface per-AFI. ! XRv13 router isis 1 interface GigabitEthernet0/0/0/0.513

2125 © 2016 Nicholas J. Russo

address-family ipv4 unicast fast-reroute per-prefix address-family ipv6 unicast fast-reroute per-prefix interface GigabitEthernet0/0/0/0.534 address-family ipv4 unicast fast-reroute per-prefix address-family ipv6 unicast fast-reroute per-prefix interface GigabitEthernet0/0/0/0.543 address-family ipv4 unicast fast-reroute per-prefix address-family ipv6 unicast fast-reroute per-prefix

Earlier we changed the IPv4 metrics so that XRv13 routes via XRv14 to reach 5.5.5.5/32. It now has an LFA via CSR4; the "detail" keyword lets us see which path attributes apply. This path is not a primarypath (P), total metric is 20 (TM), is not linecard disjoint (LC), is node protecting (NP), is downstream (D), and is SRLG disjoint. The “SRLG: Yes” phrase is confusing because XR is really saying “SRLG is absent”, not “SRLG is present”. The output also shows us that the default prefix-priority of this loopback is medium, not high like XE. Default tie-breakers for XR IS-IS LFA are as follows. Notice that SRLG is not considered by default. 10 Primary path 20 Lowest metric 30 Line card disjoint 40 Node protection RP/0/0/CPU0:XRv13#show isis ipv4 fast-reroute 5.5.5.5/32 detail L2 5.5.5.5/32 [15/115] medium priority via 43.0.0.14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via 34.0.0.4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 P: No, TM: 20, LC: No, NP: Yes, D: Yes, SRLG: Yes src CSR5.00-00, 5.5.5.5

Next, we will use SRLG to see if we can select the path through CSR1 instead of CSR4. Below is the SRLG configuration. XR SRLG configuration is nice because you can use descriptive names or values. We also have to tell IS-IS to use the SRLG in its computations. I also adjust SRLG to be the most important tiebreaker for IS-IS LFA, which is specific to level-2. ! XRv13 srlg name SAME_FIBER_PAIR value 134 interface GigabitEthernet0/0/0/0.534

2126 © 2016 Nicholas J. Russo

name SAME_FIBER_PAIR interface GigabitEthernet0/0/0/0.543 name SAME_FIBER_PAIR router isis 1 address-family ipv4 unicast fast-reroute per-prefix tiebreaker srlg-disjoint index 1 level 2

Now, the LFA is via CSR1, which is not downstream (CSR1 cost is 20, XRv13 is 15), but is node protecting and SRLG disjoint. RP/0/0/CPU0:XRv13#show isis fast-reroute 5.5.5.5/32 detail L2 5.5.5.5/32 [15/115] medium priority via 43.0.0.14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via 123.0.0.1, GigabitEthernet0/0/0/0.513, CSR1, Weight: 0 P: No, TM: 30, LC: No, NP: Yes, D: No, SRLG: Yes src CSR5.00-00, 5.5.5.5

Last, we modify the IS-IS path selection to make lowest-metric the most important tie-breaker with SRLG as the second most important. The original backup path via CSR4 should be preferred since its total metric is lower (20 versus 30), but notice that this path is not SRLG disjoint. As a reminder, the text “SRLG: No” indicates that the SRLG is present and the LFA is not SRLG-disjoint. ! XRv13 router isis 1 address-family ipv4 unicast fast-reroute per-prefix tiebreaker lowest-backup-metric index 1 level 2 fast-reroute per-prefix tiebreaker srlg-disjoint index 2 level 2 RP/0/0/CPU0:XRv13#show isis fast-reroute 5.5.5.5/32 detail L2 5.5.5.5/32 [15/115] medium priority via 43.0.0.14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via 34.0.0.4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 P: No, TM: 20, LC: No, NP: Yes, D: Yes, SRLG: No src CSR5.00-00, 5.5.5.5

In summary, direct (or local) LFA is very similar between IS-IS and OSPF, also between XE and XR. The exact tiebreakers and their sequential arrangements vary slightly between platforms and protocols. 37.1.2.2 Remote LFA Remote LFA (rLFA) for IS-IS is very similar to OSPF. We still require MPLS encapsulation with LDP signaling regardless of the IGP in play. Because the mechanics of rLFA (P-space, extended P-space, Qspace, and PQ-nodes) were discussed already, it is not discussed again because the concepts are the same. The testing below just demonstrates the feature using IS-IS. We will use the same pentagon 2127 © 2016 Nicholas J. Russo

topology as we used in OSPF, ensuring that we bind IS-IS to BFD on that P2P link between CSR1 and CSR2. All interfaces have IS-IS metric of 10. For brevity we will not demonstrate the square topology as the configuration and verification is identical. A quick verification of BFD is in order because without BFD, LFA data-plane operations do not work. This test is targeting the CSR1-CSR2 link as the protected link, so ensuring BFD is enabled on that link is a good first step. CSR1#show bfd neighbors client isis IPv4 Sessions NeighAddr LD/RD 12.0.0.2 4097/4097

RH/RS Up

State Up

Int Gi2.512

We enable rLFA on CSR1 under the IS-IS process. We must specify an IS-IS level, which is different than OSPF. In OSPF, we could optionally specify an area, and failing to specify an area would protect all areas as well as external routes. IS-IS is less generic and requires the rLFA feature specified for each level. The output below is very terse and there is no "detail" option, but gives quick insight. We don’t really care about Lfa24 because we are trying to protect the link to CSR2, which implies an rLFA tunnel to XRv13. If we were trying to protect the link to CSR4, the rLFA tunnel to XRv14 would matter. ! CSR1 router isis 1 fast-reroute remote-lfa level-2 mpls-ldp maximum-metric 20 CSR1#show isis fast-reroute remote-lfa tunnels Tag 1 - Fast-Reroute Remote-LFA Tunnels: MPLS-Remote-Lfa23: use Gi2.514, nexthop 14.0.0.4, end point 13.13.13.13 MPLS-Remote-Lfa24: use Gi2.512, nexthop 12.0.0.2, end point 14.14.14.14

We should verify that we have an LDP neighbor with the PQ-node, XRv13, and that we learn its LDP local-label for 2.2.2.2/32. This prefix is the final destination that we are trying to protect. Forming an LDP session with XRv13 assumes that XRv13 is accepting tLDP sessions from CSR1 (configured earlier). The bottom label in the rLFA tunnel will be 93003. Also note that label 4007 will be used for normal forwarding once the rLFA tunnel is torn down after IS-IS reconverges. The top label is the rLFA tunnel endpoint, which is towards 13.13.13.13/32; the label value is learned by CSR4 with a label of 4001. CSR1#show mpls ldp neighbor 13.13.13.13 Peer LDP Ident: 13.13.13.13:0; Local LDP Ident 1.1.1.1:0 TCP connection: 13.13.13.13.48869 - 1.1.1.1.646 State: Oper; Msgs sent/rcvd: 23/24; Downstream Up time: 00:06:33 LDP discovery sources: Targeted Hello 1.1.1.1 -> 13.13.13.13, active Addresses bound to peer LDP Ident: 13.13.13.13 34.0.0.13 43.0.0.13

2128 © 2016 Nicholas J. Russo

CSR1#show mpls ldp bindings 2.2.2.2 32 lib entry: 2.2.2.2/32, rev 4 local binding: label: 1004 remote binding: lsr: 2.2.2.2:0, label: imp-null remote binding: lsr: 4.4.4.4:0, label: 4007 remote binding: lsr: 13.13.13.13:0, label: 93003 CSR1#show mpls ldp bindings 13.13.13.13 32 neighbor 4.4.4.4 lib entry: 13.13.13.13/32, rev 10 remote binding: lsr: 4.4.4.4:0, label: 4001

We still have MPLS FNF configured on CSR4 facing CSR1 from the OSPF tests. When we send traffic from CSR1 to CSR2's loopback, then break the P2P link between them (breakage not shown), we see a short burst of traffic with label stack {4001 93003} which is rLFA-encapsulated traffic shown in green. Traffic with label 4007 is used thereafter for normal forwarding when IS-IS converges, shown in yellow. Notice that after 119 packets, the rLFA is no longer used, and the traffic uses the regular LDP signaled LSP for transport. CSR4#show flow monitor FNF_MPLS_MONITOR cache format table | begin MPLS MPLS LABEL 1 MPLS LABEL 2 pkts ============ ============ ========== 4007 /1 0 /0 5 4001 /1 93003 /1 119 CSR4#show flow monitor FNF_MPLS_MONITOR cache format table | begin MPLS MPLS LABEL 1 MPLS LABEL 2 pkts ============ ============ ========== 4007 /1 0 /0 33 4001 /1 93003 /1 119

Because BFD is not supported on XRv, we will limit our rLFA tests to control-plane verifications. We will enable rLFA on XRv13 to protect the path to 14.14.14.14/32 (protect the link between XRv13 and XRv14). rLFA for IS-IS in XR is enabled at the link-level; the AF-level command is used to filter PQ-node candidacy based on RID with a prefix-list. I added this filter to the configuration just to test the feature; I don't care about backing up the path to 4.4.4.4/32 so I do not need an rLFA tunnel to CSR2 for this test, only CSR1. This can help scale rLFA since building too many tunnels to protect links unnecessarily may resource-intensive. ! XRv13 ipv4 prefix-list PL_CSR1 10 permit 1.1.1.1/32 router isis 1 address-family ipv4 unicast fast-reroute per-prefix remote-lfa prefix-list PL_CSR1

2129 © 2016 Nicholas J. Russo

interface GigabitEthernet0/0/0/0.513 address-family ipv4 unicast fast-reroute per-prefix remote-lfa tunnel mpls-ldp level 2 fast-reroute per-prefix remote-lfa maximum-metric 20 interface GigabitEthernet0/0/0/0.534 address-family ipv4 unicast fast-reroute per-prefix remote-lfa tunnel mpls-ldp level 2 fast-reroute per-prefix remote-lfa maximum-metric 20 interface GigabitEthernet0/0/0/0.543 address-family ipv4 unicast fast-reroute per-prefix remote-lfa tunnel mpls-ldp level 2 fast-reroute per-prefix remote-lfa maximum-metric 20

The output below shows that XRv14's loopback is backed up by an rLFA tunnel terminating on CSR1. Notice that the total metric (TM) is the cost to the tunnel endpoint, not the entire path. RP/0/0/CPU0:XRv13#show isis fast-reroute 14.14.14.14/32 detail L2 14.14.14.14/32 [10/115] medium priority via 43.0.0.14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 Remote FRR backup via CSR1 [1.1.1.1], via 34.0.0.4, GigabitEthernet0/0/0/0.534 CSR4, Weight: 0 P: No, TM: 20, LC: No, NP: No, D: No, SRLG: No src XRv14.00-00, 14.14.14.14 RP/0/0/CPU0:XRv13#show mpls ldp neighbor 1.1.1.1 Peer LDP Identifier: 1.1.1.1:0 TCP connection: 1.1.1.1:646 - 13.13.13.13:51129 Graceful Restart: No Session Holdtime: 180 sec State: Oper; Msgs sent/rcvd: 13/13; Downstream-Unsolicited Up time: 00:02:32 LDP Discovery Sources: IPv4: (1) Targeted Hello (13.13.13.13 -> 1.1.1.1, active/passive IPv6: (0) Addresses bound to this peer: IPv4: (3) 1.1.1.1 12.0.0.1 14.0.0.1 IPv6: (0)

Next, we determine the label stacks required if this rLFA tunnel were to be used. We learn CSR1's label for XRv14's loopback is 1007 which would be the bottom label in the rLFA stack. Label 4004 would be the post-IGP convergence label we would use after the rLFA tunnel is no longer needed. Last, we need to know the transport label to reach the rLFA tunnel endpoint, which is 1.1.1.1/32. We have label 4000 2130 © 2016 Nicholas J. Russo

from CSR4 that directs us there. The rLFA stack would be {4000 1007} and the label used after IS-IS converges is 4004. RP/0/0/CPU0:XRv13#show mpls ldp bindings 14.14.14.14/32 14.14.14.14/32, rev 186 Local binding: label: 93000 Remote bindings: (3 peers) Peer Label ------------------------1.1.1.1:0 1007 4.4.4.4:0 4004 14.14.14.14:0 ImpNull RP/0/0/CPU0:XRv13#show mpls ldp bindings 1.1.1.1/32 neighbor 4.4.4.4 1.1.1.1/32, rev 214 Local binding: label: 93004 Remote bindings: (3 peers) Peer Label ------------------------4.4.4.4:0 4000

37.1.3 EIGRP EIGRP LFA is the simplest topic in this section. All EIGRP LFA does is take feasible successors and preinstall them into the FIB for slightly faster convergence. EIGRP is not actually doing more work than it used to, and only feasible successor (FS) paths are candidates for FRR. After all, they are guaranteed to be LFAs already, unlike OSPF and IS-IS which required additional logic. This implies that all EIGRP LFA paths must be downstream; in OSPF and IS-IS we saw that backup paths did not need to be downstream because the link-state nature of those protocols could run SPF as their neighbors. EIGRP does not have that intelligence, so meeting the feasibility condition (which is effectively mandating “downstream”) is required. In a virtual platform, EIGRP LFA probably does not have a significant performance impact since there are not hardware linecards or TCAM components to program. We will use the same topology we used earlier for EIGRP FRR with IPv4. IPv6 does not appear supported at this time, nor is there a concept of remote LFA in EIGRP. EIGRP LFA is also not supported in XR. As a result, the EIGRP FRR options are much less comprehensive than OSPF and IS-IS. For this test, only interface delay is considered, which effectively allows us to mimic IS-IS and OSPF path selection mechanisms; using bandwidth makes the EIGRP calculation more complicated since it's evaluated per-path, not per-hop.

2131 © 2016 Nicholas J. Russo

We will begin by evaluating CSR5's route to XRv13 with metric modifications. With FRR enabled in OSPF or IS-IS, we expect two ECMP paths via CSR4 and XRv14, each of which backs the other one up. In EIGRP, this concept does not apply, and we only see the two ECMP paths as we normally would. This is acceptable since the paths are both installed in the FIB anyway (primary paths used for ECMP), so EIGRP need not identify them as mutually-supporting LFAs. ! CSR5 router eigrp LFA_TEST address-family ipv4 unicast autonomous-system 1 topology base fast-reroute per-prefix all CSR5#show eigrp address-family ipv4 topology frr | section 13\.13 P 13.13.13.13/32, 2 successors, FD is 3997696 via 45.0.0.4 (3997696/720896), GigabitEthernet2.545 via 54.0.0.14 (3997696/720896), GigabitEthernet2.554

Increasing the delay on CSR5's link to XRv14 will allow CSR5 to prefer the path via CSR4 as primary. The path through XRv14 is an LFA because it is an EIGRP feasible successor (XRv14 is closer to XRv13 than CSR5 is, since 720896 < 3997696). OSPF and IS-IS call this "downstream" which is a more generic term of expressing the feasibility condition in EIGRP (same concept). CSR5#show eigrp address-family ipv4 topology frr | section 13\.13 P 13.13.13.13/32, 1 successors, FD is 3997696 via 45.0.0.4 (3997696/720896), GigabitEthernet2.545

2132 © 2016 Nicholas J. Russo

via 54.0.0.14 (7274496/720896), GigabitEthernet2.554, [LFA]

Next, we will set the delay of all EIGRP interfaces to 5 to make traffic engineering a little easier (not shown). Using EIGRP named-mode on XE displays the delay in picoseconds versus microseconds (ms was used in classic EIGRP). Note: A quick conversion from ps to ms is to trim the last 6 zeroes of the ps field. If we increase the delay value on XRv14's link to XRv13 to 10, then CSR5 and XRv14 are equidistant from XRv13. The EIGRP feasibility condition (or downstream condition) is not satisfied, and the alternate path via XRv14 path is not an LFA. OSPF and IS-IS are more intelligent, and given their link-state information they can still use non-downstream paths in some cases (unlike when FD = RD, which is 6619136 in this case). Here, we prove that EIGRP’s FS condition remains inflexible despite the LFA feature. CSR5#show eigrp address-family ipv4 topology frr | section 13\.13 P 13.13.13.13/32, 1 successors, FD is 3997696 via 45.0.0.4 (6619136/3342336), GigabitEthernet2.545, serno 152 via 54.0.0.14 (9895936/6619136), GigabitEthernet2.554

EIGRP also has a limited number of tie-breakers, such as interface-disjoint, linecard-disjoint, lowest metric, and SRLG-disjoint. We will test interface-disjoint as the tie-breaker since we have a LAN interface. We will use CSR1 for this demonstration as it has additional path diversity. All interfaces have been reset to delay 5 as a baseline. Initially, CSR1 does 3-way load sharing to XRv14's loopback via CSR2 LAN, CSR2 P2P, and XRv13 LAN. Changing XRv13's cost to XRv14 to 4 makes it the best path. However, rather than prefer CSR2 over the LAN, CSR1 prefers CSR2 via the P2P link to be interface-disjoint. The CSR2 LAN link is definitely a feasible successor (LFA) but was not preferred. Notice that CSR4 is not a candidate LFA because its RD equals CSR1's FD (5963776). Also note that we did not need to explicitly configure this tie-breaker since its enabled within EIGRP and cannot be disabled. ! XRv3 router eigrp LFA_TEST address-family ipv4 interface GigabitEthernet0/0/0/0.543 metric delay 4 CSR1#show eigrp address-family ipv4 topology frr | sec 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 123.0.0.13 (5963776/2686976), GigabitEthernet2.513 via 12.0.0.2 (6619136/3342336), GigabitEthernet2.512, [LFA] via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 14.0.0.4 (9240576/5963776), GigabitEthernet2.514

Increasing CSR1's cost on the CSR2 P2P link increases the metric to 14.14.14.14/32 via that link, but it's still the preferred LFA. In the previous example, both of these FS paths had the same FD, but the interface-disjoint characteristic served as the tie-breaker. This proves that interface-disjoint is a higher tie-breaker than lowest metric.

2133 © 2016 Nicholas J. Russo

! CSR1 interface GigabitEthernet2.512 delay 6 CSR1#show eigrp address-family ipv4 topology frr | sec 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 123.0.0.13 (5963776/2686976), GigabitEthernet2.513 via 12.0.0.2 (7274496/3342336), GigabitEthernet2.512, [LFA] via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 14.0.0.4 (9240576/5963776), GigabitEthernet2.514

When we configure lowest metric as the #1 tie-breaker, the LFA via CSR2 LAN is used. None of the metrics have changed, but the order in which tie-breakers are evaluated allowed us to select a new LFA. This tie-breaker prioritization process works identically to OSPF and IS-IS. ! CSR1 router eigrp LFA_TEST address-family ipv4 unicast autonomous-system 1 topology base fast-reroute tie-break lowest-backup-path-metric 1 CSR1#show eigrp address-family ipv4 topology frr | sec 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 123.0.0.13 (5963776/2686976), GigabitEthernet2.513 via 12.0.0.2 (7274496/3342336), GigabitEthernet2.512 via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513, [LFA] via 14.0.0.4 (9240576/5963776), GigabitEthernet2.514

Let's reset the network to baseline delays of 5 (not shown) and test the SRLG feature. The only noteworthy metric changes are CSR1's link to CSR4 has delay 2 and CSR4’s link to CSR5 has delay 3. This allows CSR1 to do 4-way load sharing to XRv14 as all paths have a cumulative delay value of 11 (XRv4’s loopback counts as delay 1). ! CSR1 interface GigabitEthernet2.514 delay 2 ! CSR4 interface GigabitEthernet2.545 delay 3 CSR1#show ip eigrp topology frr | section 14\.14 P 14.14.14.14/32, 4 successors, FD is 5963776 via 12.0.0.2 (6619136/3342336), GigabitEthernet2.512 via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 123.0.0.13 (6619136/3342336), GigabitEthernet2.513 via 14.0.0.4 (6619136/5308416), GigabitEthernet2.514

2134 © 2016 Nicholas J. Russo

We now reduce CSR1's delay to CSR2 on the P2P link to 4 so it becomes the best path, but there are three candidate LFAs all with equal cost. CSR1 selects XRv13 but the evaluation criteria for this decision is unclear. My initial guess is because XRv13 has the higher IP address. ! CSR1 interface GigabitEthernet2.512 delay 4 CSR1#show ip eigrp topology frr | section 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 12.0.0.2 (5963776/3342336), GigabitEthernet2.512 via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 123.0.0.13 (6619136/3342336), GigabitEthernet2.513, [LFA] via 14.0.0.4 (6619136/5308416), GigabitEthernet2.514

Shutting down the LAN interface on XRv13 causes the router to pick CSR4. Despite having a lower IP address than 123.0.0.2, 14.0.0.4 was likely chosen due to offering node-protection, where 123.0.0.2 is the same router as 12.0.0.2 (successor path). Clearly the RD is not being evaluated since CSR4’s is the highest, but this was never a criteria for LFA paths provided the RD < FD (meets feasibility condition). Node-protection is not a term used in the EIGRP LFA documentation anywhere, so I cannot prove that node-protection is even evaluated by EIGRP. Considering it does not have a concept of “nodes” (no corresponding graph structure), node-protection in EIGRP is unlikely nonexistent. ! XRv13 interface GigabitEthernet0/0/0/0.513 shutdown CSR1#show ip eigrp topology frr | section 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 12.0.0.2 (5963776/3342336), GigabitEthernet2.512 via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 14.0.0.4 (6619136/5308416), GigabitEthernet2.514, [LFA]

Using SRLG, let's assume we want the router to always pick CSR4 over the LAN interfaces if possible. We apply the same SRLG to the CSR2 P2P link and the LAN interface. We must manually assign SRLG to the EIGRP tie-break list because it is not assigned by default. This is similar to IS-IS, but different than OSPF where SRLG is evaluated by default. Below, we add it as the second tiebreaker (lowest metric was the first). We also bring XRv13 back into the LAN. Without this SRLG configuration, XRv13 would immediately become the preferred LFA again. Using SRLGs in this way allows us to semi-administratively select CSR4 as the preferred LFA. ! CSR1 interface GigabitEthernet2.512 srlg gid 45

2135 © 2016 Nicholas J. Russo

interface GigabitEthernet2.513 srlg gid 45 router eigrp LFA_TEST address-family ipv4 unicast autonomous-system 1 topology base fast-reroute tie-break srlg-disjoint 2 ! XRv13 interface GigabitEthernet0/0/0/0.513 no shutdown CSR1#show ip eigrp topology frr | section 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 12.0.0.2 (5963776/3342336), GigabitEthernet2.512 via 123.0.0.13 (6619136/3342336), GigabitEthernet2.513 via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 14.0.0.4 (6619136/5308416), GigabitEthernet2.514, [LFA]

We can override the SRLG by making the cost to CSR4 less desirable. We increase the delay from 2 to 3 on CSR1's P2P link to CSR4 and now XRv13 is preferred again. CSR4 is still an EIGRP FS, and therefore a candidate LFA, but is rejected since lowest backup metric is the #1 tie-breaker. This path violates the SRLG configuration; SRLG is the #2 tie-breaker so EIGRP doesn’t even evaluate SRLG since the tie was already broken in the previous step. ! CSR1 interface GigabitEthernet2.514 delay 3 CSR1#show ip eigrp topology frr | section 14\.14 P 14.14.14.14/32, 1 successors, FD is 5963776 via 12.0.0.2 (5963776/3342336), GigabitEthernet2.512 via 123.0.0.13 (6619136/3342336), GigabitEthernet2.513, [LFA] via 123.0.0.2 (6619136/3342336), GigabitEthernet2.513 via 14.0.0.4 (7274496/5308416), GigabitEthernet2.514

37.2 Loop Free Alternate (LFA) for IPv6 (XR Only) This feature only seems to be supported on XR currently. Remote LFA for IPv6 is going to require LDPv6 which supposedly came out in XR 5.3.0, which is newer than what the CCIE lab will use. We will examine it briefly for completeness since IPv6 is a major focus area of the CCIE SP exam. 37.2.1 OSPFv3 37.2.1.1

Direct LFA

2136 © 2016 Nicholas J. Russo

We will evaluate using direct LFA with OSPFv3 on XR. Be sure to adjust the administrative distances to make OSPF preferred over IS-IS if you were previously testing IS-IS. The network is shown again below for reference.

We will examine the network from XRv14's perspective for this test. We enable LFA under area 0 in perprefix mode. Initially, there is a familiar situation with two ECMP paths that are backing each other up. These are ECMP paths from XRv14 to ::1:1:1:1/128, which is CSR1’s IPv6 loopback prefix. ! XRv14 router ospfv3 1 area 0 fast-reroute per-prefix RP/0/0/CPU0:XRv14#show ospfv3 routes ::1:1:1:1/128 backup-path Topology Table for OSPFv3 1 with ID 14.14.14.14 * ::1:1:1:1/128, Intra, cost 2/0, area 0 GigabitEthernet0/0/0/0.524, fe80::2, path-id 1 Backup path: fe80::13, from 1.1.1.1, via GigabitEthernet0/0/0/0.543, protected bitmap 0x1 Attribues: Metric: 2, Primary, Downstream, Node Protect, SRLG Disjoint GigabitEthernet0/0/0/0.543, fe80::13, path-id 2 Backup path: fe80::2, from 1.1.1.1, via GigabitEthernet0/0/0/0.524, protected bitmap 0x2

2137 © 2016 Nicholas J. Russo

Attribues: Metric: 2, Primary, Downstream, Node Protect, SRLG Disjoint

For consistency, we recycle an old SRLG configuration on XRv14 to verify SRLGs work with OSPFv3. We then increase the OSPFv3 cost to XRv13 so that CSR2 is the primary with CSR5 as the backup. We also achieve node protection as CSR2 is not in the backup path. XRv13 is node-protecting also, but would violate the SRLG policy. ! XRv14 srlg interface GigabitEthernet0/0/0/0.524 8 value 654 interface GigabitEthernet0/0/0/0.543 8 value 654 router ospfv3 1 fast-reroute per-prefix tiebreaker srlg-disjoint index 1 area 0 interface GigabitEthernet0/0/0/0.543 cost 2 RP/0/0/CPU0:XRv14#show ospfv3 routes ::1:1:1:1/128 backup-path Topology Table for OSPFv3 1 with ID 14.14.14.14 * ::1:1:1:1/128, Intra, cost 2/0, area 0 GigabitEthernet0/0/0/0.524, fe80::2, path-id 1 Backup path: fe80::5, from 1.1.1.1, via GigabitEthernet0/0/0/0.554, protected bitmap 0x1 Attribues: Metric: 3, Node Protect, SRLG Disjoint

Rather than repeat the exact set of tests we did for IPv4, we will test new things. Next, we will examine XRv14's path to CSR4's loopback. Due to the cost modification, XRv14 routes through CSR5 with a backup via XRv13. This is SRLG disjoint and node-protecting as expected. RP/0/0/CPU0:XRv14#show ospfv3 routes ::4:4:4:4/128 backup-path Topology Table for OSPFv3 1 with ID 14.14.14.14 * ::4:4:4:4/128, Intra, cost 2/0, area 0 GigabitEthernet0/0/0/0.554, fe80::5, path-id 1 Backup path: fe80::13, from 4.4.4.4, via GigabitEthernet0/0/0/0.543, protected bitmap 0x1 Attribues: Metric: 3, Downstream, Node Protect, SRLG Disjoint

Before continuing, we remove our OSPFv3 cost modification so XRv14 performs ECMP back to CSR4 with the primary paths repairing one another (output not shown). We will test a new technology that also applies to IPv4 but was not evaluated earlier. This is the usage of “lfa-candidate” and “use-candidate2138 © 2016 Nicholas J. Russo

only”. The former is used to identify interfaces that are LFA candidate interfaces (that is, allowed to host LFA paths exiting them) and the second command forces LFAs to use only those interfaces explicitly identified as LFA candidates. It's like a stricter version of SRLG since we can hard-code candidate interfaces and for LFA to follow those constraints. The configuration below shows our LFA parameters. We want to use the link via CSR2 for LFAs if possible. ! XRv14 router ospfv3 1 fast-reroute per-prefix lfa-candidate interface GigabitEthernet0/0/0/0.524 fast-reroute per-prefix use-candidate-only enable

After applying these changes, we still have two ECMP paths, but both are backed up via CSR2. This makes sense since only that interface is able to host LFA paths. When using CSR5, the backup path is SRLG disjoint (the SRLG is on the CSR2 and XRv13 links) and node-protecting. When routing through XRv13, the backup path through CSR2 is neither SRLG disjoint nor node protecting, which would normally make it a relatively undesirable LFA. RP/0/0/CPU0:XRv14#show ospfv3 routes ::4:4:4:4/128 backup-path Topology Table for OSPFv3 1 with ID 14.14.14.14 * ::4:4:4:4/128, Intra, cost 2/0, area 0 GigabitEthernet0/0/0/0.554, fe80::5, path-id 1 Backup path: fe80::2, from 4.4.4.4, via GigabitEthernet0/0/0/0.524, protected bitmap 0x3 Attribues: Metric: 3, Node Protect, SRLG Disjoint GigabitEthernet0/0/0/0.543, fe80::13, path-id 2 Backup path: fe80::2, from 4.4.4.4, via GigabitEthernet0/0/0/0.524, protected bitmap 0x3 Attribues: Metric: 3,

The reason the path is not node protecting is because CSR2 does 3-way ECMP between its two links to CSR1 and its link to XRv13. Thus, if the primary path from XRv14 to CSR4 routes through XRv13, we cannot guarantee node protection if we install an LFA via CSR2. Though LFA is configured locally, the resulting backup paths are highly dependent on the network views from the adjacency nodes. CSR2#show ipv6 route ::4:4:4:4 Routing entry for ::4:4:4:4/128 Known via "ospf 1", distance 110, metric 2, type intra area Backup from "isis 1 [115]" Route count is 3/3, share count 0 Routing paths: FE80::1, GigabitEthernet2.512 Last updated 00:24:47 ago FE80::13, GigabitEthernet2.513 Last updated 00:00:00 ago

2139 © 2016 Nicholas J. Russo

FE80::1, GigabitEthernet2.513 Last updated 00:00:00 ago

We can engineer node protection by forcing CSR2 not to use its LAN interface to reach CSR4 by increasing the OSPFv3 cost to 2. As soon as we do that, we achieve node protection, since the LFA path (via CSR2) does not rely on XRv13 for reachability to CSR4. Instead, CSR2 routes via CSR1 to reach CSR4, bypassing XRv13 entirely. This short series of tests proves that LFA is very similar between IPv4 and IPv6. ! CSR2 interface GigabitEthernet2.513 ospfv3 1 ipv6 cost 2 RP/0/0/CPU0:XRv14#show ospfv3 routes ::4:4:4:4/128 backup-path Topology Table for OSPFv3 1 with ID 14.14.14.14 * ::4:4:4:4/128, Intra, cost 2/0, area 0 GigabitEthernet0/0/0/0.554, fe80::5, path-id 1 Backup path: fe80::2, from 4.4.4.4, via GigabitEthernet0/0/0/0.524, protected bitmap 0x3 Attribues: Metric: 3, Node Protect, SRLG Disjoint GigabitEthernet0/0/0/0.543, fe80::13, path-id 2 Backup path: fe80::2, from 4.4.4.4, via GigabitEthernet0/0/0/0.524, protected bitmap 0x3 Attribues: Metric: 3, Node Protect, CSR2#show ipv6 route ::4:4:4:4 Routing entry for ::4:4:4:4/128 Known via "ospf 1", distance 110, metric 2, type intra area Backup from "isis 1 [115]" Route count is 1/1, share count 0 Routing paths: FE80::1, GigabitEthernet2.512 Last updated 00:24:30 ago

37.2.1.2 Remote LFA Remote LFA does not appear to be supported for OSPFv3 (XR 5.3.0) at this time. This chapter is left here as a placeholder for completeness since I expect the feature to be supported soon. 37.2.2 IS-IS 37.2.2.1 Direct LFA We will use XRv13 to reach CSR5's loopback for this demonstration. The diagram is shown again below for reference.

2140 © 2016 Nicholas J. Russo

Enabling IS-IS LFA for IPv6 is identical to IPv4 and is done on a per-interface, per-AFI basis as shown below. Initially we see the ECMP paths backing each other up. Both CSR4 and XRv14 are downstream (closer to CSR5), node protecting, primary (ECMP) paths, equal metrics, and SRLG disjoint. ! XRv13 router isis 1 interface GigabitEthernet0/0/0/0.513 address-family ipv6 unicast fast-reroute per-prefix interface GigabitEthernet0/0/0/0.534 address-family ipv6 unicast fast-reroute per-prefix interface GigabitEthernet0/0/0/0.543 address-family ipv6 unicast fast-reroute per-prefix RP/0/0/CPU0:XRv13#show isis ipv6 fast-reroute ::5:5:5:5/128 detail L2 ::5:5:5:5/128 [20/115] medium priority via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 P: Yes, TM: 20, LC: No, NP: Yes, D: Yes, SRLG: Yes via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 FRR backup via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 P: Yes, TM: 20, LC: No, NP: Yes, D: Yes, SRLG: Yes src CSR5.00-00, ::5:5:5:5

2141 © 2016 Nicholas J. Russo

We will increase XRv13's metric to XRv14 to 15 so that XRv13 prefers the path through CSR4 to ::5:5:5:5/128. We see the increase in total-metric (TM) to 25, and the path is no longer ECMP (primary, indicated by the “P” flag). We have introduced a strict primary/backup relationship between the CSR4/XRv14 paths after this metric adjustment. ! XRv13 router isis 1 interface GigabitEthernet0/0/0/0.543 address-family ipv6 unicast metric 15 RP/0/0/CPU0:XRv13#show isis ipv6 fast-reroute ::5:5:5:5/128 detail L2 ::5:5:5:5/128 [20/115] medium priority via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 FRR backup via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 P: No, TM: 25, LC: No, NP: Yes, D: Yes, SRLG: Yes src CSR5.00-00, ::5:5:5:5

Next, we remove that metric adjustment and instead apply an SRLG to XRv13's links to CSR4 and XRv14. Notice that the SRLG disjoint attribute is removed but we still use it for the backup paths. This is because XR does not consider SRLGs by default; notice how the two paths back each other up as they did in the very first test. ! XRv13 srlg interface GigabitEthernet0/0/0/0.534 8 value 333 interface GigabitEthernet0/0/0/0.543 8 value 333 router isis 1 interface GigabitEthernet0/0/0/0.543 address-family ipv6 unicast no metric RP/0/0/CPU0:XRv13#show isis ipv6 fast-reroute ::5:5:5:5/128 detail L2 ::5:5:5:5/128 [20/115] medium priority via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 P: Yes, TM: 20, LC: No, NP: Yes, D: Yes, SRLG: No via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 FRR backup via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 P: Yes, TM: 20, LC: No, NP: Yes, D: Yes, SRLG: No src CSR5.00-00, ::5:5:5:5

2142 © 2016 Nicholas J. Russo

Next, we make SRLG the #1 tiebreaker. Now, XRv13 will prefer to not backup those ECMP paths with one another. ! XRv13 router isis 1 address-family ipv6 unicast fast-reroute per-prefix tiebreaker srlg-disjoint index 1

XRv13 uses the paths through XRv14 and CSR4 for ECMP but backs them both up via CSR1. CSR1 is not downstream to CSR5 as XRv13 is closer, but when used to backup the XRv14 path only, node protection is achieved. Overall, it is a less desirable LFA, but since CSR4 and XRv4 “fate-share”, it makes sense to introduce a third link for better resiliency. RP/0/0/CPU0:XRv13#show isis ipv6 fast-reroute ::5:5:5:5/128 detail L2 ::5:5:5:5/128 [20/115] medium priority via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via fe80::1, GigabitEthernet0/0/0/0.513, CSR1, Weight: 0 P: No, TM: 30, LC: No, NP: Yes, D: No, SRLG: Yes via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 FRR backup via fe80::1, GigabitEthernet0/0/0/0.513, CSR1, Weight: 0 P: No, TM: 30, LC: No, NP: No, D: No, SRLG: Yes src CSR5.00-00, ::5:5:5:5

The path from XRv13 to CSR2 is another candidate path but provides node protection for CSR4, not XRv14. It is functionally equivalent to the CSR1 path. By shutting down CSR1's LAN interface, the router selects CSR2. When CSR1 comes back online, XRv13 still prefers CSR2, and it appears to be similar to the eBGP tie-breaker of selecting the oldest route when all other tiebreakers are equal. Notice that node protection is achieved when backing up CSR4 but not XRv14. This is the opposite problem we saw above; there is one LFA protecting both primary paths when the second LFA is just as good. ! CSR1 interface GigabitEthernet2.513 shutdown RP/0/0/CPU0:XRv13#show isis ipv6 fast-reroute ::5:5:5:5/128 detail L2 ::5:5:5:5/128 [20/115] medium priority via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via fe80::2, GigabitEthernet0/0/0/0.513, CSR2, Weight: 0 P: No, TM: 30, LC: No, NP: No, D: No, SRLG: Yes via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 FRR backup via fe80::2, GigabitEthernet0/0/0/0.513, CSR2, Weight: 0 P: No, TM: 30, LC: No, NP: Yes, D: No, SRLG: Yes src CSR5.00-00, ::5:5:5:5

2143 © 2016 Nicholas J. Russo

It would be ideal if we could use CSR1 to backup the path through XRv14 and CSR2 to backup the path through CSR4. This way, we achieve node protection for the primary paths using two node-protecting LFAs for maximal redundancy. We can do this by telling XR to evaluate the node-protect (NP) attribute, which it was not doing by default. ! XRv13 router isis 1 address-family ipv6 unicast fast-reroute per-prefix tiebreaker node-protecting index 5

We can clearly see that that CSR1 backs up XRv14 and CSR2 backs up CSR4 as desired. After the SRLG attribute is evaluated, node-protection comes next, which allows XRv13 to more intelligent select LFAs. Tuning these tie-breakers must be carefully considered based on the network architecture since adjusting the selection process could have undesirable affects. In this case, we tuned the process to achieve a highly desirable result. RP/0/0/CPU0:XRv13#show isis ipv6 fast-reroute ::5:5:5:5/128 detail L2 ::5:5:5:5/128 [20/115] medium priority via fe80::14, GigabitEthernet0/0/0/0.543, XRv14, Weight: 0 FRR backup via fe80::1, GigabitEthernet0/0/0/0.513, CSR1, Weight: 0 P: No, TM: 30, LC: No, NP: Yes, D: No, SRLG: Yes via fe80::4, GigabitEthernet0/0/0/0.534, CSR4, Weight: 0 FRR backup via fe80::2, GigabitEthernet0/0/0/0.513, CSR2, Weight: 0 P: No, TM: 30, LC: No, NP: Yes, D: No, SRLG: Yes src CSR5.00-00, ::5:5:5:5

37.2.2.2 Remote LFA Remote LFA on XR using LDPv6 is probably beyond the scope of the blueprint given it is supported on XR 5.3.0 whereas the test uses version XR 5.2.0. For completeness, we will examine it briefly; LDPv6 is also introduced in XR 5.3.0 and is not discussed in detail. It works identically to LDP for IPv4 by allocating labels to IPv6 prefixes instead of IPv4 prefixes. OSPFv3 does not support rLFA at this time, so only IS-IS is tested. Basic rLFA concepts like P-space, extended P-space, Q-space, and PQ-node are not discussed again as they are identical between all rLFA variations (OSPF, IS-IS, IPv4, IPv6). We will use a square topology of XRv routers for this test. The network diagram is below. IS-IS level 2 is enabled everywhere, all transit links have only link-local addressing. The OSPFv3 and LDP RIDs are hardcoded to X.X.X.X where X is the router number. The IPv6 loopbacks are ::X:X:X:X/128 and there is no IPv4 configuration anywhere.

2144 © 2016 Nicholas J. Russo

We enable rLFA under each interface within IS-IS on XRv11 so that it can tunnel traffic to XRv14 to protect its connected links to XRv12 and XRv13. The rLFA configuration is staged under all of the other routers, but it needs to be activated with the base command "fast-reroute per-prefix" in XR. ! XRv11 router isis 1 interface GigabitEthernet0/0/0/0.512 address-family ipv6 unicast fast-reroute per-prefix fast-reroute per-prefix remote-lfa tunnel mpls-ldp fast-reroute per-prefix remote-lfa maximum-metric 20 interface GigabitEthernet0/0/0/0.513 address-family ipv6 unicast fast-reroute per-prefix fast-reroute per-prefix remote-lfa tunnel mpls-ldp fast-reroute per-prefix remote-lfa maximum-metric 20

The basic LDPv6 configuration is shown below but is not discussed in detail. The RID is set manually as is the discovery transport-address; IPv4 is not running so the TCP session is IPv6-based as well. I also disable LDPv4 globally since it doesn’t need to be enabled. Last, I enabled LDPv6 on the proper interfaces. Similar configurations exist for all other XR routers in the network. ! XRv11 mpls ldp default-vrf implicit-ipv4 disable router-id 11.11.11.11 address-family ipv6 discovery targeted-hello accept discovery transport-address ::11:11:11:11 interface GigabitEthernet0/0/0/0.512 address-family ipv6 interface GigabitEthernet0/0/0/0.513

2145 © 2016 Nicholas J. Russo

address-family ipv6

The output below shows us that the route to XRv12's loopback is protected via an rLFA tunnel to XRv14. This will show even if LDPv6 is broken (that is, the targeted session fails). RP/0/0/CPU0:XRv11#show isis ipv6 fast-reroute detail ::12:12:12:12/128 L2 ::12:12:12:12/128 [10/115] medium priority via fe80::12, GigabitEthernet0/0/0/0.512, XRv12, Weight: 0 Remote FRR backup via XRv14 [::14:14:14:14], via fe80::13, GigabitEthernet0/0/0/0.513 XRv13, Weight: 0 P: No, TM: 20, LC: No, NP: No, D: No, SRLG: Yes src XRv12.00-00, ::12:12:12:12

Next, we check XRv11 to make sure that it learns the LDP label for ::12:12:12:12/128 from XRv14 (using its RID that looks like an IPv4 address). Checking the neighbor (using the IPv6 address), we see the proper prefix association. RP/0/0/CPU0:XRv11#show mpls ldp ipv6 discovery 14.14.14.14 Local LDP Identifier: 11.11.11.11:0 Discovery Sources: Targeted Hellos: ::11:11:11:11 -> ::14:14:14:14 (active), xmit/recv LDP Id: 14.14.14.14:0 Hold time: 90 sec (local:90 sec, peer:90 sec) RP/0/0/CPU0:XRv11#show mpls ldp neighbor ::14:14:14:14 Peer LDP Identifier: 14.14.14.14:0 TCP connection: ::14:14:14:14:60551 - ::11:11:11:11:646 Graceful Restart: No Session Holdtime: 180 sec State: Oper; Msgs sent/rcvd: 10/11; Downstream-Unsolicited Up time: 00:03:14 LDP Discovery Sources: IPv4: (0) IPv6: (1) Targeted Hello (::11:11:11:11 -> ::14:14:14:14, active) Addresses bound to this peer: IPv4: (0) IPv6: (1) ::14:14:14:14

Lastly, we check the LDPv6 LIB to ensure we learn XRv14's label for ::12:12:12:12/128 which is 94011. This is the inner label of the rLFA tunnel; the outer label will be XRv13's label for ::14:14:14:14/128, which is 93002. After IGP convergence, XRv1 will use XRv13's label for ::12:12:12:12/128 directly, which is 93016, and the rLFA tunnel is torn down since there is no more PQ-node. The rLFA label stack would

2146 © 2016 Nicholas J. Russo

be {93002 94011} assuming rLFA was in use during the IGP convergence period. This is the same logic we evaluated with rLFA earlier for both OSPFv2 and IPv4 IS-IS. RP/0/0/CPU0:XRv11#show mpls ldp ipv6 bindings ::12:12:12:12/128 neighbor 14.14.14.14:0 ::12:12:12:12/128, rev 8 Local binding: label: 91004 Remote bindings: (2 peers) Peer Label ------------------------14.14.14.14:0 94011 RP/0/0/CPU0:XRv11#show mpls ldp ipv6 bindings ::14:14:14:14/128 neighbor 13.13.13.13:0 ::14:14:14:14/128, rev 6 Local binding: label: 91001 Remote bindings: (3 peers) Peer Label ------------------------13.13.13.13:0 93002 RP/0/0/CPU0:XRv11#show mpls ldp ipv6 bindings ::12:12:12:12/128 neighbor 13.13.13.13:0 ::12:12:12:12/128, rev 8 Local binding: label: 91004 Remote bindings: (3 peers) Peer Label ------------------------13.13.13.13:0 93016

As a final check, we validate the CEF entry and ensure the label binding is correct. I don't know why the second label is not shown in the label stack for the backup path, but without BFD we can't actually test it. CEF imposes the topmost label and I am assuming the rLFA process, behind the scenes, is imposing label 94011 before CEF sees it. RP/0/0/CPU0:XRv11#show cef ipv6 ::12:12:12:12/128 ::12:12:12:12/128, version 520, internal 0x1000001 0x0 (ptr 0xa13b8874) [1], 0x0 (0xa13842d8), 0xa28 (0xa173512c) local adjacency fe80::12 Prefix Len 128, traffic index 0, precedence n/a, priority 3 via fe80::12, GigabitEthernet0/0/0/0.512, 5 dependencies, weight 0, class 0, protected [flags 0x400] path-idx 0 bkup-idx 1 NHID 0x0 [0xa180d384 0xa180d29c], parent-ifh 0x200 next hop fe80::12 local label 91004 labels imposed {ImplNull} via fe80::13, GigabitEthernet0/0/0/0.513, 5 dependencies, weight 0, class 0, backup [flags 0x300] path-idx 1 NHID 0x0 [0xa0c7f4a4 0x0]

2147 © 2016 Nicholas J. Russo

next hop fe80::13 local adjacency local label 91004

labels imposed {93002}

Additional Reading – Reference configurations "rlfa-ipv6" 37.3 Convergence optimizations for BGP BGP has many special timers that can tune its behavior. This lab explores many of them within the context of IPv4 and IPv6 AFIs. The ordinary “neighbor timers” are evaluated in a specific section dedicated to individual protocol hellos and are not discussed here. The lab topology is similar to that seen in the inter-AS multicast section, except that the BGP IPv4/v6 multicast AFIs have been disabled. Other multicast features remain enabled but are not the focus of this test, and are thus ignored. The diagram is shown again for reference. There are 4 different ASes which have many external and internal BGP peerings for additional testing. Each AS has its own IPv4/v6 IGP for intra-AS reachability. Unlike the inter-AS multicast lab, I enabled BGP IPv4/v6 unicast on CSR1 through CSR4; these were test hosts in the multicast lab but are now used as BGP routers for additional testing. XRv1, CSR10, and CSR8 are routereflectors for both AFIs in their respective ASes, while AS11 has a full mesh of iBGP peers.

2148 © 2016 Nicholas J. Russo

The first timer is the advertisement interval. This timer determines the rate at which advertisements are sent to a peer; the default for eBGP is 30 seconds while iBGP is 0 seconds (no delay). This allows for updates to be grouped together and advertised in chunks rather than independently. This grouping mechanism is loosely analogous to the OSPF LSA-group pacing mechanism for LSA age refreshing. When there are multiple changes in an AS a few seconds apart, they are propagated immediately within the AS but delayed externally so that a more efficient update can be assembled. Since the primary purpose of BGP was very high scalability, this could reduce convergence effectiveness. Before making configuration adjustments, we can confirm the default timers on XE and XR for eBGP and iBGP peers. A timer of 0 seconds, the iBGP default, implies no delay in propagating updates. R6#show bgp ipv4 unicast neighbors | include ^BGP|advertisement BGP neighbor is 9.0.0.10, remote AS 9, internal link Default minimum time between advertisement runs is 0 seconds BGP neighbor is 10.6.9.9, remote AS 11, external link Default minimum time between advertisement runs is 30 seconds BGP neighbor is 10.6.12.12, remote AS 8, external link Default minimum time between advertisement runs is 30 seconds RP/0/0/CPU0:XRv4#show bgp ipv4 unicast neighbors | utility egrep '^BGP|advertisement’ BGP neighbor is 7.0.0.11 Minimum time between advertisement runs is 0 secs BGP neighbor is 10.9.14.9 Minimum time between advertisement runs is 30 secs Policy for incoming advertisements is PASS Policy for outgoing advertisements is PASS BGP neighbor is 10.12.14.12 Minimum time between advertisement runs is 30 secs Policy for incoming advertisements is PASS Policy for outgoing advertisements is PASS

We can confirm this behavior by creating two changes in quick succession within AS 7 and observing the behavior. Adding a new network on CSR2 into IPv4 BGP, this is advertised immediately to iBGP peers, and then to eBGP peers. The iBGP advertisement interval expires immediately so when a second network is added, it is also advertised to iBGP peers immediately. The eBGP advertisement interval begins counting down from 30, and until it expires, updates will not be sent towards a given peer. To test it, on CSR2 and CSR6 I enable BGP update debugging for IPv4. Adding the first network to CSR2, we see that CSR2 advertises it to its iBGP peers immediately, and when CSR6 receives it from the iBGP RR (CSR10), it advertises it to its eBGP peers immediately. Outbound updates are shown in green with inbound updates shown in yellow. The clocks on CSR2 and CSR6 are not synchronized, but it does not matter since we are only comparing time differences in each local router. The difference for both is 1 ms, which is effectively zero. ! CSR2

2149 © 2016 Nicholas J. Russo

interface Loopback1 ip address 9.2.2.2 255.255.255.255 router bgp 9 address-family ipv4 network 9.2.2.2 mask 255.255.255.255 ! CSR2 01:57:19.681: BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 9.2.2.2/32 01:57:19.681: BGP(0): redistributedlocal route 9.2.2.2/32 modified 01:57:19.682: BGP(0): 9.0.0.10 NEXT_HOP is set to self for net 9.2.2.2/32, 01:57:19.682: BGP(0): (base) 9.0.0.10 send UPDATE (format) 9.2.2.2/32, next 9.0.0.2, metric 0, path Local ! CSR6 07:41:19.505: BGP(0): 9.0.0.10 rcvd UPDATE w/ attr: nexthop 9.0.0.2, origin i, localpref 100, metric 0, originator 9.0.0.2, clusterlist 9.0.0.10 07:41:19.505: BGP(0): 9.0.0.10 rcvd 9.2.2.2/32 07:41:19.505: BGP(0): Revise route installing 1 of 1 routes for 9.2.2.2/32 -> 9.0.0.2(global) to main IP table 07:41:19.506: BGP(0): (base) 10.6.12.12 send UPDATE (format) 9.2.2.2/32, next 10.6.12.6, metric 0, path Local

Quickly adding a second network into IPv4 BGP on CSR2 (within a few seconds), we notice that CSR2 advertises it to its iBGP peers immediately, but CSR6 waits. The first eBGP update was sent at 07:41:19 which means the next update should not be sent before 07:41:49; it might be a few seconds after, but never before. CSR6 receives the update immediately but cannot advertise it until 07:41:49 which is when the advertisement-interval for eBGP peers will expire. The timestamp of the eBGP update for the second network proves this. ! CSR2 interface Loopback1 ip address 9.2.2.22 255.255.255.255 secondary router bgp 9 address-family ipv4 network 9.2.2.22 mask 255.255.255.255 ! CSR2 01:57:31.579: BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 9.2.2.22/32 01:57:31.579: BGP(0): redistributedlocal route 9.2.2.22/32 modified 01:57:31.579: BGP(0): 9.0.0.10 NEXT_HOP is set to self for net 9.2.2.22/32, 01:57:31.579: BGP(0): (base) 9.0.0.10 send UPDATE (format) 9.2.2.22/32, next 9.0.0.2, metric 0, path Local ! CSR6

2150 © 2016 Nicholas J. Russo

07:41:31.402: BGP(0): 9.0.0.10 rcvd UPDATE w/ attr: nexthop 9.0.0.2, origin i, localpref 100, metric 0, originator 9.0.0.2, clusterlist 9.0.0.10 07:41:31.402: BGP(0): 9.0.0.10 rcvd 9.2.2.22/32 07:41:31.402: BGP(0): Revise route installing 1 of 1 routes for 9.2.2.22/32 > 9.0.0.2(global) to main IP table ! CSR6, about 18 seconds later 07:41:49.827: BGP(0): (base) 10.6.12.12 send UPDATE (format) 9.2.2.22/32, next 10.6.12.6, metric 0, path Local

The construction of these more efficient updates is handled by the BGP update-group feature. When updates are assembled, they are assembled once and replicated to peers that have the same outbound policies. These outbound policies include route filters, advertisement intervals, etc. Internal and external peers are always in different update groups regardless of the outbound policies. We can see the update groups by specifying a neighbor and then seeing all of the members. In this case, both XRv2 and CSR9 are in the same update group, which is why only one update was created above. R6#show bgp ipv4 unicast update-group 10.6.12.12 BGP version 4 update-group 6, external, Address Family: IPv4 Unicast BGP Update version : 275/0, messages 0, active RGs: 1 Topology: global, highest version: 275, tail marker: 275 Format state: Current working (OK, last minimum advertisement interval) Refresh blocked (not in list, last not in list) Update messages formatted 98, replicated 171, current 0, refresh 0, limit 1000 Number of NLRIs in the update sent: max 3, min 0 Minimum time between advertisement runs is 30 seconds Has 2 members: 10.6.12.12 10.6.9.9

Adjusting the advertisement-interval for one of the peers will break the update group efficiency. To make things interesting, I will increase the advertisement-interval (slow the convergence) towards CSR9. We now see two update groups, one with each eBGP peer, due to the advertisement interval changing. The output also, assuming the interval is counting down, will show you the time remaining as demonstrated below. ! CSR6 router bgp 9 address-family ipv4 neighbor 10.6.9.9 advertisement-interval 45 R6#show bgp ipv4 unicast update-group BGP version 4 update-group 6, external, Address Family: IPv4 Unicast BGP Update version : 277/0, messages 0, active RGs: 1 Topology: global, highest version: 277, tail marker: 277 Format state: Current blocked (minimum advertisement interval, last minimum advertisement interval)

2151 © 2016 Nicholas J. Russo

Refresh blocked (not in list, last not in list) Update messages formatted 100, replicated 175, current 0, refresh 0, limit 1000 Number of NLRIs in the update sent: max 3, min 0 Minimum time between advertisement runs is 30 seconds (expires in 1 seconds) Has 1 member: 10.6.12.12 BGP version 4 update-group 7, external, Address Family: IPv4 Unicast BGP Update version : 277/0, messages 0, active RGs: 1 Topology: global, highest version: 277, tail marker: 277 Format state: Current blocked (minimum advertisement interval, last minimum advertisement interval) Refresh blocked (not in list, last not in list) Update messages formatted 11, replicated 11, current 0, refresh 0, limit 1000 Number of NLRIs in the update sent: max 3, min 0 Minimum time between advertisement runs is 45 seconds (expires in 43 seconds) Has 1 member: 10.6.9.9

Quickly retesting the advertisement using the test networks on CSR2, we enable debugging on CSR6 only. Since we didn’t change the iBGP settings, we won’t confirm how advertisement works over iBGP again. We see that CSR6 receives the update 9.2.2.2/32 at time 08:00:22 and the update for 9.2.2.22/32 at time 08:00:28, about 6 seconds later. 9.2.2.2/32 was immediately advertised to both eBGP peers in two separate updates (thanks to the two separate update-groups) and their advertisement-intervals behind counting down. ! CSR6 08:00:22.850: BGP(0): 9.0.0.10 rcvd UPDATE w/ attr: nexthop 9.0.0.2, origin i, localpref 100, metric 0, originator 9.0.0.2, clusterlist 9.0.0.10 08:00:22.850: BGP(0): 9.0.0.10 rcvd 9.2.2.2/32 08:00:22.850: BGP(0): Revise route installing 1 of 1 routes for 9.2.2.2/32 -> 9.0.0.2(global) to main IP table 08:00:22.852: BGP(0): (base) 10.6.12.12 send UPDATE (format) 9.2.2.2/32, next 10.6.12.6, metric 0, path Local 08:00:22.852: BGP(0): (base) 10.6.9.9 send UPDATE (format) 9.2.2.2/32, next 10.6.9.6, metric 0, path Local 08:00:28.719: BGP(0): 9.0.0.10 rcvd UPDATE w/ attr: nexthop 9.0.0.2, origin i, localpref 100, metric 0, originator 9.0.0.2, clusterlist 9.0.0.10 08:00:28.719: BGP(0): 9.0.0.10 rcvd 9.2.2.22/32 08:00:28.719: BGP(0): Revise route installing 1 of 1 routes for 9.2.2.22/32 > 9.0.0.2(global) to main IP table

2152 © 2016 Nicholas J. Russo

30 seconds after the time the first update is sent to XRv2, the update for 9.2.2.22/32 is also sent; this is time 08:00:52. 45 seconds after that initial time (or 15 seconds after the update was sent to XRv2), the update for 9.2.2.22/32 is sent to CSR9 since we slowed down the advertisement-interval. This later time is 08:01:07. This proves that the advertisement-interval works as expected. ! CSR6 08:00:52.733: BGP(0): (base) 10.6.12.12 send UPDATE (format) 9.2.2.22/32, next 10.6.12.6, metric 0, path Local 08:01:07.071: BGP(0): (base) 10.6.9.9 send UPDATE (format) 9.2.2.22/32, next 10.6.9.6, metric 0, path Local

The feature works identically in XR, and for variety we will perform our configurations for IPv6 BGP. Currently, XRv2 and CSR9 are in the same update-group for this AFI due to their identical outbound filtering RPLs and matching advertisement-intervals. RP/0/0/CPU0:XRv4#show bgp ipv6 unicast update-group neighbor fd00:10:9:14::9 Update group for IPv6 Unicast, index 0.3: Attributes: Outbound policy: PASS First neighbor AS: 11 Directly connected IPv6 EBGP 4-byte AS capable Non-labeled address-family capable Minimum advertisement interval: 30 secs Update group desynchronized: 0 Sub-groups merged: 8 Number of refresh subgroups: 0 Messages formatted: 201, replicated: 292 All neighbors are assigned to sub-group(s) Neighbors in sub-group: 0.1, Filter-Groups num:1 Neighbors in filter-group: 0.4(RT num: 0) fd00:10:9:14::9 fd00:10:12:14::12

Modifying an advertisement interval will break this. In this case, I reduce the interval to 15.5 seconds on the peer to XRv2 (15 seconds plus 500 ms). XR adds greater capability by permitting millisecond granularity as well. However, the configuration is per-neighbor, not per-neighbor per-AFI. Even though we only test IPv6, this adjustment affects all AFIs negotiated to a given neighbor. Checking the update groups, we can see the update group has been split as the advertisement intervals differ (30 vs. 15.5 seconds). For brevity (and because XR’s BGP update debugging is not very clear) we will not test this in real-time, but the effect is identical to XE. ! XRv4 router bgp 7 neighbor fd00:10:12:14::12 advertisement-interval 15 500

2153 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show bgp ipv6 unicast update-group neighbor fd00:10:9:14::9 Update group for IPv6 Unicast, index 0.3: Attributes: Outbound policy: PASS First neighbor AS: 11 Directly connected IPv6 EBGP 4-byte AS capable Non-labeled address-family capable Minimum advertisement interval: 30 secs Update group desynchronized: 0 Sub-groups merged: 8 Number of refresh subgroups: 0 Messages formatted: 201, replicated: 292 All neighbors are assigned to sub-group(s) Neighbors in sub-group: 0.1, Filter-Groups num:1 Neighbors in filter-group: 0.4(RT num: 0) fd00:10:9:14::9 RP/0/0/CPU0:XRv4#show bgp ipv6 unicast update-group neighbor fd00:10:12:14::12 Update group for IPv6 Unicast, index 0.1: Attributes: Outbound policy: PASS First neighbor AS: 8 Directly connected IPv6 EBGP 4-byte AS capable Non-labeled address-family capable Minimum advertisement interval: 15.500 secs Update group desynchronized: 0 Sub-groups merged: 0 Number of refresh subgroups: 0 Messages formatted: 11, replicated: 11 All neighbors are assigned to sub-group(s) Neighbors in sub-group: 0.3, Filter-Groups num:1 Neighbors in filter-group: 0.1(RT num: 0) fd00:10:12:14::12

The next timer is the BGP scan-time. This is the timer that is used for BGP to “walk the RIB” to verify IGP reachability for BGP next-hops. This occurs every 60 seconds by default, and in order to see it in action, we have to disable a smarter feature known as next-hop tracking (NHT), which is discussed later. Since the scan-time is timer-based and can be tuned very coarsely, it is not a good choice for event-driven or high-availability architectures. This is another per-AFI feature; below I show the defaults for IPv4/IPv6 on XE and XR. R10#show bgp ipv4 unicast summary | include scan BGP activity 93/55 prefixes, 406/352 paths, scan interval 60 secs

2154 © 2016 Nicholas J. Russo

R10#show bgp ipv6 unicast summary | include scan BGP activity 93/55 prefixes, 406/352 paths, scan interval 60 secs RP/0/0/CPU0:XRv1#show bgp ipv4 unicast | include scan BGP generic scan interval 60 secs BGP scan interval 60 secs RP/0/0/CPU0:XRv1#show bgp ipv6 unicast | include scan BGP generic scan interval 60 secs BGP scan interval 60 secs

To disable NHT and adjust the scan timer concurrently for brevity, I issue the commands below. The scanner has been set to 3 times its normal speed so that we can detect next-hop changes in 20 seconds. NHT cannot be disabled on XR, but we will increase the next-hop trigger delays to larger numbers than the scan time which effectively disables it. NHT is discussed in greater detail later. ! CSR10 router bgp 9 address-family ipv4 no bgp nexthop trigger enable bgp scan-time 20 ! XRv1 router bgp 7 address-family ipv6 unicast bgp scan-time 20 nexthop trigger-delay critical 30000 nexthop trigger-delay non-critical 30000

Before testing this with any prefixes, we can watch the scanner in action by debugging BGP events. On XE, I enable the debugging for IPv4 unicast yet the debug shows all of the scanners. Notice that all of the scanners start at once, but then the IPv4 run runs twice more 20 seconds apart, while the others are still waiting for their next cycle of 60 seconds. R10#debug bgp 02:54:32.322: scan 02:54:32.322: 02:54:32.322: scan 02:54:32.322: 02:54:32.322: scan 02:54:32.322: 02:54:32.322: scan 02:54:32.322: 02:54:32.322: scan

ipv4 unicast events BGP: tbl IPv4 Unicast:base Performing BGP Nexthop scanning for general BGP(0): Future scanner version: 5972, current scanner version: 5971 BGP: tbl IPv6 Unicast:base Performing BGP Nexthop scanning for general BGP(1): Future scanner version: 5958, current scanner version: 5957 BGP: tbl L2VPN E-VPN:base Performing BGP Nexthop scanning for general BGP(10): Future scanner version: 5958, current scanner version: 5957 BGP: tbl MVPNv4 Unicast:base Performing BGP Nexthop scanning for general BGP(15): Future scanner version: 5958, current scanner version: 5957 BGP: tbl MVPNv6 Unicast:base Performing BGP Nexthop scanning for general

2155 © 2016 Nicholas J. Russo

02:54:32.322: 02:54:52.325: 02:54:52.325: 02:54:52.325: 02:55:12.324: 02:55:12.324: 02:55:12.324:

BGP(16): Future scanner version: 5958, current scanner version: 5957 BGP: Regular scanner timer event BGP: tbl IPv4 Unicast:base Performing BGP Nexthop scanning for nhop scan BGP(0): Future scanner version: 5973, current scanner version: 5972 BGP: Regular scanner timer event BGP: tbl IPv4 Unicast:base Performing BGP Nexthop scanning for nhop scan BGP(0): Future scanner version: 5974, current scanner version: 5973

The output is similar on XR. We can see the scans running 20 seconds apart. RP/0/0/CPU0:XRv1#debug bgp event afi ipv6 unicast 02:58:02.990 : bgp[1052]: [default-event] (ip6u): Scanning IPv6 Unicast routing table 02:58:02.990 : bgp[1052]: [default-event] (ip6u): Garbage collection for table TBL:default (2/1) prefix tree - Target version 260 02:58:02.990 : bgp[1052]: [default-event] (ip6u): Prefix count 18, Segment count 1, Segment size 100000 02:58:02.990 : bgp[1052]: [default-event] (ip6u): Finished scanning IPv6 Unicast routing table 02:58:23.009 : bgp[1052]: [default-event] (ip6u): Scanning IPv6 Unicast routing table [snip] 02:58:43.027 : bgp[1052]: [default-event] (ip6u): Scanning IPv6 Unicast routing table [snip]

To test it, I will reduce the scan timer even further on CSR2 to 10 seconds. This creates a log message warning the user about potential CPU issues as the scanner will scan all prefixes in the BGP table, even non-best-paths. NHT is also disabled on CSR2 which allows the scanner to operate properly. Then, I shut down CSR6’s loopback. After about 40 seconds, this will also kill the BGP session to CSR10, but we see the work of the scanner with respect to the next-hop reachability of all eBGP routes learned from nexthop 9.0.0.6. When the loopback gets shut down, IGP converges quickly and for a few seconds, CSR2 still thinks the BGP routes have valid next-hops. After the scanner runs, those next-hops become inaccessible and the BGP routes are no longer candidates for the best-path. ! CSR2 R2(config-router-af)#bgp scan-time 10 %BGP-5-AGGRESSIVE_SCAN_TIME: bgp scan-time configuration less than 15 seconds can cause high cpu usage by BGP Scanner.

Before CSR6’s loopback is shutdown, the “baseline” network show commands are below. 9.0.0.6/32 is reachable and the BGP prefixes that rely on it are candidates for the best-path selection algorithm. R2#show ip route 9.0.0.6 Routing entry for 9.0.0.6/32 Known via "ospfv3 9", distance 110, metric 3, type intra area Last update from 9.2.5.5 on GigabitEthernet2.525, 00:00:44 ago

2156 © 2016 Nicholas J. Russo

Routing Descriptor Blocks: * 9.2.5.5, from 9.0.0.6, 00:00:44 ago, via GigabitEthernet2.525 Route metric is 3, traffic share count is 1 R2#show bgp ipv4 unicast 8.0.0.3/32 BGP routing table entry for 8.0.0.3/32, version 248 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 2 8 9.0.0.6 (metric 3) from 9.0.0.10 (9.0.0.10) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 9.0.0.6, Cluster list: 9.0.0.10 rx pathid: 0, tx pathid: 0x0

IGP has already withdrawn reachability to 9.0.0.6/32 but BGP is not yet aware. This is the purpose of the scanner; it detects these IGP changes and update the information about the BGP next-hop, such as the accessibility, metric, etc. R2#show ip route 9.0.0.6 % Subnet not in table R2#show bgp ipv4 unicast 8.0.0.3/32 BGP routing table entry for 8.0.0.3/32, version 248 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 2 8 9.0.0.6 (metric 3) from 9.0.0.10 (9.0.0.10) Origin incomplete, metric 0, localpref 100, valid, internal, best Originator: 9.0.0.6, Cluster list: 9.0.0.10 rx pathid: 0, tx pathid: 0x0

When the scanner runs, it notices that 9.0.0.6 in inaccessible, and marks the BGP prefixes that use this next-hop as such. The debug doesn’t show us the exact scanner activity; it only reveals that the scanner ran according to schedule. ! CSR2 03:16:13.780: 03:16:13.780: nhop scan 03:16:13.780: 1608 03:16:15.278:

BGP: Regular scanner timer event BGP: tbl IPv4 Unicast:base Performing BGP Nexthop scanning for BGP(0): Future scanner version: 1609, current scanner version: BGP: aggregate timer expired

R2#show ip route 9.0.0.6 % Subnet not in table

2157 © 2016 Nicholas J. Russo

R2#show bgp ipv4 unicast 8.0.0.3/32 BGP routing table entry for 8.0.0.3/32, version 261 Paths: (1 available, no best path) Not advertised to any peer Refresh Epoch 2 8 9.0.0.6 (inaccessible) from 9.0.0.10 (9.0.0.10) Origin incomplete, metric 0, localpref 100, valid, internal Originator: 9.0.0.6, Cluster list: 9.0.0.10 rx pathid: 0, tx pathid: 0

The BGP scanner is similar to RIP for unicast routing. Being timer-based, it does not work in large deployments where routing changes occur frequently, not does it facilitate an end-to-end HA design. The NHT feature was introduced as an event-driven mechanism; the BGP “RIB watcher” process maintains a list of all BGP next-hops. This process will report back to BGP when one of two events happen: the next-hop becomes unreachable or the IGP metric to the next-hop changes. The first event is more important and, in most cases, should be reportedly more quickly; the default is 5 seconds in XE. This value should be tuned to slightly longer than it takes IGP to converge because notifying BGP of a next-hop failure in a timely manner is not valuable unless there is an alternative path. Alerting BGP before IGP has a chance to calculate the backup path means that routes will be marked inaccessible and may be withdrawn entirely. Waiting a second or so is a less disruptive option than immediate withdrawal. The NHT delay value can be set to as low as 0 seconds, which implies no delay, and should only be used in cases where some sort of IP fast-reroute is in play. This may include ECMP, OSPF LFA, ISIS LFA, or even basic EIGRP feasible successor designs. For the demonstration, I configure the feature on CSR9 and XRv2. Notice that only XR allows us to differentiate between “critical” events, such as prefix reachability, and “non-critical” events, such as IGP metric changes. The defaults for these two event types is 0 and 3 seconds, respectively, so XR assumes you have an immediately-converging IGP by default. XR measures these timers in milliseconds while XE measures it in seconds. ! CSR9 router bgp 11 address-family ipv6 bgp nexthop trigger delay 2 ! XRv2 router bgp 8 address-family ipv4 unicast nexthop trigger-delay critical 2000 nexthop trigger-delay non-critical 4000

Before testing this, we can also specify a route-map/RPL to determine which routes can qualify as nexthops. Although not specifically related to the delay adjustments, this is useful for avoiding aggregate routes. For example, we may want to adjust certain iBGP nodes to always prefer prefixes of length 30 or longer for next-hops. This would cover transit links redistributed into IGP or next-hop-self on ASBRs using their loopbacks. It ensures that the route recursion doesn’t select a default route, for example, 2158 © 2016 Nicholas J. Russo

then claim the next-hop is reachable. All next-hops would be reachable in this case, which is undesirable. For IPv6, next-hops should be length /64 or longer, which means that the route is likely not an aggregate. ! CSR9 ipv6 prefix-list PL_64_AND_LONGER seq 5 permit ::/0 ge 64 route-map RM_BGP_NHOP_V6 permit 10 match ipv6 address prefix-list PL_64_AND_LONGER router bgp 11 address-family ipv6 bgp nexthop route-map PL_64_AND_LONGER ! XRv2 route-policy RPL_BGP_NHOP if destination in (0.0.0.0/0 ge 30) then pass endif end-policy router bgp 8 address-family ipv4 unicast nexthop route-policy RPL_BGP_NHOP

Applying this to XRv2, we immediately see a problem. In most real deployments, private transit links would be /30 or longer (or possibly unnumbered, but that is unrelated to this). In this test lab, they are /24. This means that all of the eBGP routes learned on XRv2 will have /24 routes for their next-hops. Below is the BGP table where we can clearly see none of the eBGP routes have best-paths due to nexthop inaccessibility. The iBGP routes, however, are reachable since the next-hop is the advertising iBGP peers loopback address. RP/0/0/CPU0:XRv2#show Network * 7.0.0.1/32 * * 7.0.0.7/32 * * 7.0.0.11/32 * * 7.0.0.14/32 * * 7.1.7.0/24 * *>i8.0.0.3/32 *>i8.0.0.8/32 *> 8.0.0.12/32 *>i8.3.8.0/24

bgp ipv4 unicast | begin Network Next Hop Metric LocPrf Weight Path 10.6.12.6 0 9 7 ? 10.12.14.14 0 7 ? 10.6.12.6 0 9 7 ? 10.12.14.14 0 7 ? 10.6.12.6 0 9 7 i 10.12.14.14 0 7 i 10.6.12.6 0 9 7 i 10.12.14.14 0 0 7 i 10.6.12.6 0 9 7 i 10.12.14.14 0 7 i 8.0.0.3 0 100 0 ? 8.0.0.8 0 100 0 ? 0.0.0.0 0 32768 i 8.0.0.8 0 100 0 i

2159 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp ipv4 unicast 7.0.0.1/32 BGP routing table entry for 7.0.0.1/32 Versions: Process bRIB/RIB SendTblVer Speaker 188 188 Paths: (2 available, no best path) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 9 7 10.6.12.6 (inaccessible) from 10.6.12.6 (9.0.0.6) Origin incomplete, localpref 100, valid, external Received Path ID 0, Local Path ID 0, version 0 Origin-AS validity: not-found Path #2: Received by speaker 0 Not advertised to any peer 7 10.12.14.14 (inaccessible) from 10.12.14.14 (7.0.0.14) Origin incomplete, localpref 100, valid, external Received Path ID 0, Local Path ID 0, version 0 Origin-AS validity: not-found RP/0/0/CPU0:XRv2#show route 10.6.12.6 Routing entry for 10.6.12.0/24 Known via "connected", distance 0, metric 0 (connected) Routing Descriptor Blocks directly connected, via GigabitEthernet0/0/0/0.562 Route metric is 0 No advertising protos. RP/0/0/CPU0:XRv2#show route 10.12.14.14 Routing entry for 10.12.14.0/24 Known via "connected", distance 0, metric 0 (connected) Routing Descriptor Blocks directly connected, via GigabitEthernet0/0/0/0.524 Route metric is 0 No advertising protos.

As an adjustment for this misconfiguration, I add an exception for 10-networks that are exactly /24 in length. This still follows the spirit of the “no aggregates for next-hops” policy since we know the 10.x.x.x/24 routes are not aggregates. The XR RPL structure has an “eq” option so that we don’t have to use “ge 24 le 24” as we would in an XE prefix-list. The BGP routes are immediately valid again since their BGP next-hops are reachable. ! XRv2 route-policy RPL_BGP_NHOP if destination in (0.0.0.0/0 ge 30) or destination in (10.0.0.0/8 eq 24) then

2160 © 2016 Nicholas J. Russo

pass endif end-policy RP/0/0/CPU0:XRv2#show Network * 7.0.0.1/32 *> * 7.0.0.7/32 *> [snip]

bgp ipv4 unicast | begin Network Next Hop Metric LocPrf Weight Path 10.6.12.6 0 9 7 ? 10.12.14.14 0 7 ? 10.6.12.6 0 9 7 ? 10.12.14.14 0 7 ?

We can also view all of the BGP next-hops. XE prints a simple table showing the next-hop addresses and number of BGP prefixes bound to that next-hop which also includes non best-paths. This is specific to the BGP table, not the routing table. R9#show bgp ipv6 unicast nexthops # Paths Nexthop Address 14 FD00:10:9:14::14 (FE80::14) 14 FD00:10:6:9::6 (FE80::6) 1 200B::4 2 200B::D

The show command in XR is much more involved. It includes many pieces of statistical information which respect to next-hop processing and path changes. It also shows us the configured NHT values for both critical and non-critical events, which serves as a good verification of the configuration. The status codes clearly show which next-hops are reachable, connected versus IGP learned, and other detailed status information. Like CSR9, we see a few connected next-hops for eBGP routes and a few IGP-learned next-hops for iBGP routes. RP/0/0/CPU0:XRv2#show bgp ipv4 unicast nexthops Total Nexthop Processing Time Spent: 0.000 secs Maximum Nexthop Processing Received: 3d04h Bestpaths Deleted: 0 Bestpaths Changed: 16 Time Spent: 0.000 secs Last Notification Processing Received: 3d04h Time Spent: 0.000 secs Gateway Address Family: IPv4 Unicast Table ID: 0xe0000000 Nexthop Count: 5 Critical Trigger Delay: 2000msec

2161 © 2016 Nicholas J. Russo

Non-critical Trigger Delay: 4000msec Nexthop Version: 1, RIB version: 1 Status codes: R/UR Reachable/Unreachable C/NC Connected/Not-connected L/NL Local/Non-local PR Pending Registration I Invalid (Policy drop) Next Hop Status Metric Notf 8.0.0.3 [R][NC][NL] 2 0/0 8.0.0.8 [R][NC][NL] 1 1/0 10.6.12.6 [R][C][NL] 0 1/0 10.12.14.14 [R][C][NL] 0 1/0

LastRIBEvent RefCount 2d01h (Reg) 1/3 4d08h (Cri) 2/5 4d08h (Cri) 16/19 4d08h (Cri) 16/19

We can drill into a single next-hop as well by specifying the next-hop explicitly. This reveals additional information about the next-hop such as the routing source (RIP), BGP prefix reference counts, and other low-level debugging information. RP/0/0/CPU0:XRv2#show bgp ipv4 unicast nexthops 8.0.0.3 Nexthop: 8.0.0.3 VRF: default Nexthop ID: 0x600007b, Version: 0x0 Nexthop Flags: 0x00000000, Gateway Flags: 0x00000080 Nexthop Handle: 0x10c6f890, Gateway Handle: 0x10799720 RIB Related Information Gateway: reachable, non-Connected route, prefix length 32 Resolving Route: 8.0.0.3/32 (rip) Paths: 0 RIB Nexhop ID: 0x0 Status: [Reachable][Not Connected][Not Local] Metric: 2 Registration: Synchronous, Completed: 2d01h Events: Critical (0)/Non-critical (0) Last Received: 2d01h (Registration) Last gw update: (Crit-sync) 2d01h(rib) Reference Count: 1 Prefix Related Information Active Tables: [IPv4 Unicast] Metrices: [0x2] Reference Counts: [1] Interface Handle: 0x0

To test NHT, I will shutdown CSR8’s loopback. XRv2 learns this via IGP and it is the next-hop for one prefix, 8.0.0.3/32. This is a RIB failure, but it doesn’t matter for the purpose of this test. The slow 2162 © 2016 Nicholas J. Russo

convergence of RIP will take up to 180 seconds for this route to be invalidated, and once that happens, the NHT delay is 2 seconds longer. Once the route is withdrawn (debug reveals this), a critical next-hop change occurs. BGP marks the next-hop as inaccessible and removes the BGP prefixes associated with this next-hop. We can see a very large metric, but most importantly, a critical event occurred a few seconds before. This “critical event” was the removal of the route. If we issue the command a few seconds later, the next-hop information for 8.0.0.3/32 is removed entirely. RP/0/0/CPU0:XRv2#debug routing ipv4 ipv4_rib[1144]: RIB Routing: Vrf: "default", Tbl: "default" IPv4 Unicast, Delete active route to 8.0.0.3 via 8.8.12.8 interface GigabitEthernet0/0/0/0.582, metric [120/2], label None, by client-id 14 RP/0/0/CPU0:XRv2#show bgp ipv4 unicast nexthops 8.0.0.3 Nexthop: 8.0.0.3 VRF: default Nexthop ID: 0x600007e, Version: 0x0 Nexthop Flags: 0x00000000, Gateway Flags: 0x00000080 Nexthop Handle: 0x10c6f890, Gateway Handle: 0x10799698 RIB Related Information Gateway: unreachable, non-Connected route, prefix length 512 Resolving Route: 8.0.0.3/32 (rip) Paths: 0 RIB Nexhop ID: 0x0 Status: [Unreachable] Metric: 4294967295 Registration: Synchronous, Completed: 00:00:22 Events: Critical (1)/Non-critical (0) Last Received: 00:00:02 (Critical) Last gw update: (Crit-notif) 00:00:02(rib) Reference Count: 0 Prefix Related Information Active Tables: [IPv4 Unicast] Metrices: [0xffffffff] Reference Counts: [0] Interface Handle: 0x0 RP/0/0/CPU0:XRv2#show bgp ipv4 unicast nexthops 8.0.0.3 % Nexthop information not found

On CSR9, we will actually test the timers in greater detail. I will trigger a non-critical event by changing the bandwidth on CSR4’s loopback interface. EIGRP will recompute a new metric and BGP NHT will notice this. I enable BGP event debugging (only available under the IPv4 parser tree but affects all AFIs) and IPv6 routing table debugs. When EIGRP adds the new (worse) metric, the BGP “RIB watcher” immediately notifies BGP of the change. The prefix 200B::4/128 is still reachable, so this would be a 2163 © 2016 Nicholas J. Russo

“non-critical” next-hop change event, and the penalty information applies only to dampening (not configured). Exactly 2 seconds later (the NHT delay, shown in green), BGP begins walking its prefixes for updates as it would have normally done via the scanner in older versions. In between those 2 seconds, in a real network, IGP should have converged. In summary, this means the BGP RIB watcher process intentionally delays notifying BGP in order to give IGP time to converge after a topology change event. Routes that are not directly affected by the next-hop change are still reassessed, as seen below. ! CSR9 debug bgp ipv4 unicast events debug ipv6 routing 03:20:42.212: [EIGRP-IPv6]IPv6RT[default]: eigrp 11, Route add 200B::4/128 [worse metric 7178700, 16000] 03:20:42.212: [EIGRP-IPv6]IPv6RT[default]: eigrp 11, Update path FE80::13/GigabitEthernet2.593 Flags : 0 : 0, tag 0, metric 7178700 03:20:42.213: BGP: bgp_rwatch_notify: BGP_RWATCH_APPL_NHOP 03:20:42.213: EvD: charge penalty 500, new accum. penalty 500, flap count 11 03:20:42.213: BGP: bgp_rwatch_notify: BGP_RWATCH_APPL_NBR 03:20:42.213: BGP: nbr global 200B::4 bgp_process_bnbr_notification reachable 03:20:44.213: [BGP Router]IPv6RT[default]: bgp 11, Update 2007::1/128 [20/0], 1 paths tag 0 03:20:44.213: [BGP Router]IPv6RT[default]: bgp 11, Updating route 2007::1/128 [20/0], [20/0] 20 03:20:44.213: [BGP Router]IPv6RT[default]: bgp 11, Update path FE80::14/GigabitEthernet2.594 Flags : 4 : 4, tag 0, metric 0 [snip]

Another global BGP timer is the update-delay and generally applies only to BGP during initial startup. This timer starts counting when the first BGP peer is formed. When the timer expires, the router will begin the BGP best-path selection process for all routes and be allowed to advertise best-paths to peers. The idea is to operate in “read-only” mode for a short time when BGP starts to collect all prefixes from all peers before running best-path initially, which ultimately determines which routes a BGP speaker advertises to others. The default timer is 120 seconds and this is a global BGP setting, not per-AFI. We will test a faster timer on CSR3. This timer is different from the rest because it represents a maximum time; we will configure timer of 30 seconds so that CSR3 could begin advertising updates in some amount of time less than 30 if all prefixes are received. Note that this feature is only valuable when a BGP speaker has multiple peers; with only one peer, there is never a reason to wait once the routes have been received. ! CSR9 router bgp 11 bgp update-delay 30 R9#debug bgp ipv4 unicast events BGP events debugging is on

2164 © 2016 Nicholas J. Russo

R9#debug bgp ipv4 unicast update BGP updates debugging is on for address family: IPv4 Unicast

We enable debugging and then “clear bgp ipv4 unicast” to restart the entire process. The peer with CSR4 comes up first at 23:21:54. Because this is an iBGP peer, CSR4 immediately advertises its BGP routes to CSR9. For brevity, I do not show all of the other eBGP peer formations and received updates, but this is the “read-only” phase. CSR9 will receive all of these updates from all peers, and having a reasonable (not zero) update-delay allows BGP time to collect this information. ! CSR9 23:21:54.579: BGP(0): 11.0.0.4 was the first peer to be established for IPv4 Unicast 23:21:54.579: BGP: nopeerup-delay post-boot, set to default, 60s 23:21:54.579: %BGP-5-ADJCHANGE: neighbor 11.0.0.4 Up 23:21:54.580: BGP: nbr_topo global 11.0.0.4 IPv4 Unicast:base (0x7F6994DBB708:1) rcvd Refresh Start-of-RIB 23:21:54.580: BGP: nbr_topo global 11.0.0.4 IPv4 Unicast:base (0x7F6994DBB708:1) refresh_epoch is 2 23:21:54.580: BGP(0): 11.0.0.4 rcvd UPDATE w/ attr: nexthop 11.0.0.4, origin ?, localpref 100, metric 0 23:21:54.580: BGP(0): 11.0.0.4 rcvd 11.0.0.4/32 23:21:54.580: BGP: nbr_topo global 11.0.0.4 IPv4 Unicast:base (0x7F6994DBB708:1) rcvd Refresh End-of-RIB

As soon as the first peer is established, the local router “needs to delay read-only mode” so the BGP UPDATE process must exit. This message is printed continuously, once per second and once per AFI, until RO mode ends. Mixed in with this reminder are the syslog messages that the other BGP peers have come up, but this is less significant since the update-delay begins counting when the first peer comes up, which explains the debug output above. Those peers were discussed above (neighbor forms, updates received, etc) but not shown. ! CSR9 23:21:54.969: BGP: tbl IPv4 Unicast:base Generate update check 23:21:54.969: BGP: tbl IPv4 Unicast:base AF needs to delay RO exit

After the update-delay expires, about 30 seconds in this case, the IPv4 unicast AFI becomes read-write and the BGP best-path selection process begins. I omit some of the prefixes for brevity but all BGP prefixes are evaluated for their best-paths at this time. The debug makes it easy for troubleshooting this timer as well. ! CSR9 23:22:24.670: BGP(base): waited 30s[update delay] for all peers 23:22:24.670: BGP: tbl IPv4 Unicast:base Computed bestpaths, table version went from 1 to 20 23:22:24.670: BGP(0): Revise route installing 1 of 1 routes for 7.0.0.1/32 -> 10.9.14.14(global) to main IP table

2165 © 2016 Nicholas J. Russo

23:22:24.670: BGP(0): Revise route installing 1 of 1 routes for 7.0.0.7/32 -> 10.9.14.14(global) to main IP table [snip] 23:22:24.670: BGP: tbl IPv4 Unicast:base AF is now RW 23:22:24.671: BGP: tbl IPv4 Unicast:base IMP Initial import complete. 23:22:24.671: BGP: notified IGPs about convergence

Immediately after best-path runs, neighbor updates can be sent provided the advertisement-interval allows for it. ! CSR9 23:22:24.674: BGP(0): 11.0.0.13 NEXT_HOP is set to self for net 7.0.0.1/32, 23:22:24.674: BGP(0): (base) 11.0.0.13 send UPDATE (format) 7.0.0.1/32, next 11.0.0.9, metric 0, path 7

Although we do not test it due to lack of revealing XR debugs, we configure the BGP update-delay timer on XRv3 for completeness. Because the update-delay is a maximum timer by default in XR, we can optionally enforce this delay as a minimum timer by using the “always” keyword. This forces the specified delay regardless of whether or not BGP receives all prefixes from all peers in a timely manner. ! XRv3 router bgp 11 bgp update-delay 30 always

Another minor BGP timer is the nopeerup-delay, which prevents peer formation after some “hard” event. We can delay BGP peers from forming in during four different disruptive events: 1. Cold-boot: When the router first boots up. I assume this occurs when the router was poweredoff or was forcefully reset (not “reload” command). 2. NSF-switchover: After an NSF switchover only. 3. Post-boot: When the system is already booted and all peers go down. This includes the “reload” command. 4. User-initiated: When a BGP peer is manually cleared by the administrator. The default value for all of these fields is zero according to the documentation and 60 seconds according to the debug (I trust the debug). The feature only appears supported on XE platforms. This network is not NSF-capable since that feature doesn’t apply to virtual routers, so we can test the other three timers. I configure all timers on CSR6 since it has a combination of internal and external peers. ! CSR6 router bgp 7 bgp nopeerup-delay bgp nopeerup-delay bgp nopeerup-delay bgp nopeerup-delay

cold-boot 240 post-boot 300 nsf-switchover 120 user-initiated 30

2166 © 2016 Nicholas J. Russo

First, I reload CSR6 and immediately enable BGP debugging as soon as the console is available. This will allow us to see the nopeerup-delay post-boot timer of 300 seconds configured above. Based on this test, we see that “post-boot” relates to administrator-initiated reloads as this is the timer revealed in the debugs. However, this feature does not appear to work. All BGP peers form immediately despite the debug saying that the backoff should be 5 minutes. The first log message was printed at 02:32:30 and the first BGP came up ~38 seconds later, which indicates unexpected behavior. This may be a CSR1000v artificiality operating in a virtual environment. ! CSR6 debug bgp ipv4 unicast debug bgp ipv4 unicast events 02:32:30.646: %VUDI-6-EVENT: [serial number: 92V72A6SE1F], [vUDI: ], vUDI is successfully retrieved from license file 02:33:08.388: BGP: ses global 10.6.12.12 (0x7F2109416840:1) Up 02:33:08.388: BGP: nopeerup-delay post-boot [config:300s/default:60s] set to 300s 02:33:08.388: %BGP-5-ADJCHANGE: neighbor 10.6.12.12 Up

The user-initiated timer also does not appear to work. Clearing a peer manually, it comes back up almost immediately despite the 30-second delay configured. Although the configuration guide makes reference to this feature typically being used with NSF, it is not listed as a prerequisite. While this test was not able to demonstrate the feature functioning properly, we can at least verify it was configured properly by checking the debugs that show the newly configured values (as observed above). R6#clear bgp ipv4 unicast 10.6.9.9 02:37:46.231: %BGP-3-NOTIFICATION: sent to neighbor 10.6.9.9 6/4 (Administrative Reset) 0 bytes 02:37:46.231: %BGP-5-ADJCHANGE: neighbor 10.6.9.9 Down User reset 02:37:46.231: %BGP_SESSION-5-ADJCHANGE: neighbor 10.6.9.9 IPv4 Unicast topology base removed from session User reset 02:37:46.387: %BGP-5-ADJCHANGE: neighbor 10.6.9.9 Up

The last BGP feature we will test is the slow-peer feature. A slow-peer is a BGP peer that cannot keep up with the routing updates sent to it. For example, peering a large carrier router with a small branch router, then trying to send it many updates very quickly may not work. Since this slow-peer might be in the same update-group as other peers operating correctly, the sending router would be slowing all updates down for all peers in the update-group. The number of updates to be serviced would queue, and convergence is slowed significantly. There are 3 ways to deal with slow-peers:

2167 © 2016 Nicholas J. Russo

1. Slow-peer detection: The router simply issues a syslog message to notify the administrator of a dynamically-detected slow-peer but takes no action. The update-groups remain intact as they were before the discovery was made; this is just a warning mechanism. 2. Slow-peer dynamic protection: The router detects the slow peer and creates a “slow-peer update-group” for this peer, and any other slow peers with identical outbound attributes. Peers can remain in this slow group indefinitely or temporarily until they stop being slow. This feature relies on the slow-peer detection mechanism as well. 3. Slow-peer static configuration: This is slow-peer protection but is statically configured. This option may be the result of using slow-peer detection and manually taking action after an internal network policy review, for example. This is also a mechanism to simply force a BGP peer into a different update-group without having to adjust the advertisement-interval. We will configure all 3 options on XE only as the feature does not appear to be supported on XR. On CSR6, we will enable slow-peer detection for IPv4 unicast (all peers) except XRv2. For IPv6 unicast, we enable it for the eBGP peer CSR9 only. The detection is enabled per-AFI and has a threshold value associated with it. This measurements how “slow” a “slow-peer” is and is 5 minutes by default; this is the difference between the oldest update in the peer’s update queue versus the current time. If an update was added to a peer’s update queue at 12:00 and it is still in the queue at 12:05, the peer is considered slow. Smaller values imply a more strict definition of “slow-peer”, so I reduce it to the minimum value of 2 minutes. I explicitly disabled slow-peer detection to XRv2 for IPv4 and enable it to CSR9 only for IPv6. ! CSR6 router bgp 9 address-family ipv4 bgp slow-peer detection threshold 120 neighbor 10.6.12.12 slow-peer detection disable address-family ipv6 neighbor FD00:10:6:9::9 slow-peer detection threshold 120

To verify the configuration, we can check the BGP neighbor details per-peer, per-AFI. For IPv4, we see that slow-peer detection is enabled for both CSR10 and CSR9, but not XRv2. For IPv6, only the peer to CSR9 is running the feature. R6#show bgp ipv4 unicast neighbors | include ^BGP|detection BGP neighbor is 9.0.0.10, remote AS 9, internal link Slow-peer detection is enabled, threshold value is 120 BGP neighbor is 10.6.9.9, remote AS 11, external link Slow-peer detection is enabled, threshold value is 120 BGP neighbor is 10.6.12.12, remote AS 8, external link Slow-peer detection is disabled R6#show bgp ipv6 unicast neighbors | include ^BGP|detection

2168 © 2016 Nicholas J. Russo

BGP neighbor is 2009::A, remote AS 9, internal link Slow-peer detection is disabled BGP neighbor is FD00:10:6:9::9, remote AS 11, external link Slow-peer detection is enabled, threshold value is 120 BGP neighbor is FD00:10:6:12::12, remote AS 8, external link Slow-peer detection is disabled

As seen earlier, adjusting the advertisement-interval for the eBGP peers caused the router to create a different update group for XRv2 and CSR9. As such, CSR6 as three update groups. We can tell this is the case as each update-group has only one member, so in this case, slow-peer detection isn’t very effective. Each update-group has a cache (queue) where BGP updates are stored before being sent to peers. The size of the cache is sized dynamically based on available router memory, number of peers, etc. Although an inefficient use of update-groups, this AFI on CSR6 doesn’t need to worry about slowpeers as each peer is already in a separate update-group. As a side note, the group “leader” is the peer with the highest IPv4/v6 address, and when BGP prepares an update for the group, it does so targeting the “leader”. In a way, BGP is performing business-as-usual by preparing an update for a single peer, although it is replicated to all others in the update-group. R6#show bgp ipv4 unicast replication Index 6 14 15

Members 1 1 1

Leader 10.6.12.12 9.0.0.10 10.6.9.9

MsgFmt 221 75 12

MsgRepl 296 75 12

Csize 0/1000 0/1000 0/1000

Current Next Version Version 554/0 554/0 554/0

For IPv6, we did not make any adjustments to advertisement-intervals, so the eBGP peers are in the same update-group. R6#show bgp ipv6 unicast replication Index 9 16

Members Leader 2 FD00:10:6:12::12

Current Next Csize Version Version

MsgFmt

MsgRepl

734

1449

0/1000

1782/0

200

200

0/1000

1782/0

1 2009::A

We will attempt to treat CSR9 as a slow-peer. First, we check CSR6 to see how many prefixes it is advertising to CSR9 and find a total of 15. R6#show bgp ipv6 unicast neighbors FD00:10:6:9::9 advertised-routes | include Total Total number of prefixes 15

Next, we define an MQC policy to drop BGP messages shorter than 120 bytes. This is an approximate value used to differentiate small keepalives (60-80 bytes) from larger BGP updates. This will mean that 2169 © 2016 Nicholas J. Russo

the BGP updates will be unacknowledged and CSR6 will keep trying to send them. This filter is temporary as it will break basic BGP operation entirely between CSR6 and CSR9. ! CSR9 class-map match-all CMAP_DROP_BGP_UPDATES match protocol bgp match packet length min 120 policy-map PMAP_FILTER class CMAP_DROP_BGP_UPDATES police 8000 conform-action drop class class-default interface GigabitEthernet2.569 service-policy input PMAP_FILTER

I will configure CSR6 to perform an outbound route-refresh to CSR9 which will trigger the update process. R6#debug bgp ipv6 unicast FD00:10:6:9::9 updates out BGP updates debugging is on for neighbor FD00:10:6:9::9 (outbound) for address family: IPv6 Unicast ! CSR6 BGP(1): (base) FD00:10:6:12::12 send UPDATE (format) 2007::1/128, next FD00:10:6:12::6, metric 0, path 7 BGP(1): (base) FD00:10:6:12::12 send UPDATE (format) 2007::B/128, next FD00:10:6:12::6, metric 0, path 7 BGP(1): FD00:10:6:12::12 NEXT_HOP part 1 net 2008:8:3:8::/64, next FD00:10:6:12::12 BGP(1): (base) FD00:10:6:12::12 send UPDATE (format) 2008:8:3:8::/64, next FD00:10:6:12::12, metric 0, path 8 BGP(1): (base) FD00:10:6:12::12 send UPDATE (format) 200B::9/128, next FD00:10:6:12::6, metric 0, path 11

We check to ensure CSR9 is dropping these large update packets. Although we cannot be 100% sure these are the BGP updates, we are fairly confident this is the case based on the MQC policy. R9#show policy-map interface gig2.569 GigabitEthernet2.569 Service-policy input: PMAP_FILTER Class-map: CMAP_DROP_BGP_UPDATES (match-all) 7 packets, 6854 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: protocol bgp Match: packet length min 120

2170 © 2016 Nicholas J. Russo

police: cir 8000 bps, bc 1500 bytes conformed 7 packets, 6854 bytes; actions: drop exceeded 0 packets, 0 bytes; actions: drop conformed 0000 bps, exceeded 0000 bps Class-map: class-default (match-any) 85 packets, 7422 bytes 5 minute offered rate 0000 bps, drop rate 0000 bps Match: any

We can see the updates piling up in the cache for that update-group. Unfortunately, I am not able to find a way to actually make the slow-peer detection discover a slow-peer despite these efforts. This illustrates an example of how a slow-peer could form, though, and in the interest of time I continue on. CSR9’s inability to acknowledge dropped BGP packets is probably less of a “slow-peer” issue and more of a fundamentally broken BGP relationship. With professional-grade test equipment to inject thousands of BGP routes, this would be easier to test. R6#show bgp ipv6 unicast replication Index 9 16

Members Leader 2 FD00:10:6:12::12

Current Next Version Version

MsgFmt

MsgRepl

Csize

765

1484

5/1000

1789/0

206

206

0/1000

1789/0

1 2009::A

If a slow-peer were detected, the time at which the last detection takes place is recorded in the neighbor details. This is valuable for historical tracking. R6#show bgp ipv6 unicast neighbors FD00:10:6:9::9 | include detect Slow-peer detection is enabled, threshold value is 120 Last detected as dynamic slow peer: never

Beyond detection, we can configure a router to automatically add a slow-peer to its own update group. Since a single slow-peer effectively slows down updates for the entire group, this is often the best method for dealing with slow-peers (as opposed to shutting them down, etc). On CSR9, we will assume XRv4 is a potential slow-peer for both AFIs. For IPv4, the slow-peer will be dynamically added to its own update-group (along with other slow-peers potentially) only when the slow-peer is detected. This implies that slow-peer detection is also enabled when we configure slow-peer protection. After the peer is no longer considered slow (say, the CPU utilization drops sufficiently so that BGP updates can be processed at normal speeds), it can be moved into the normal update group, which is the essence of dynamic recovery. The “permanent” option disables this recovery and is discussed later.

2171 © 2016 Nicholas J. Russo

! CSR9 router bgp 11 address-family ipv4 neighbor 10.9.14.14 slow-peer split-update-group dynamic address-family ipv6 neighbor FD00:10:9:14::14 slow-peer split-update-group dynamic permanent

We can verify that slow-peer protection is configured for the IPv4 peer of XRv4. Detection is implicitly enabled, dynamic protection is explicitly enabled, and the peer is allowed to recover. Like the detection timer, the dynamic recovery timestamp will update when the slow-peer is no longer deemed as such. R9#show bgp ipv4 unicast neighbors 10.9.14.14 | include low.peer Slow-peer detection is enabled, threshold value is 300 Slow-peer split-update-group dynamic is enabled, and active Last detected as dynamic slow peer: never Dynamic slow peer recovered: never

The output for the IPv6 peer of XRv4 is almost identical. The difference is that the “permanent” keyword implies this neighbor can never recover from being labeled a slow-peer (short of manually clearing it, reloading the router, etc). Once deemed a slow-peer, the router makes no assumption and has no expectation that the peer will ever speed up. R9#show bgp ipv6 unicast neighbors FD00:10:9:14::14 | include low.peer Slow-peer detection is enabled, threshold value is 300 Slow-peer split-update-group dynamic permanent is enabled, and active Last detected as dynamic slow peer: never Dynamic slow peer recovered: never

If a peer has been identified as slow, we can clear this penalty manually using the command below. This is the only way for a slow-peer to gracefully recovery from dynamic protection mode with the “permanent” option also applied. R9#clear bgp ipv6 unicast FD00:10:9:14::14 slow

Both the slow-peer detection and slow-peer dynamic protection can be configured per neighbor or perAFI, and I have demonstrated a number of these options. The final option for dealing with slow-peers is to statically configure them on a per-neighbor basis only. This is fail-safe mechanism for breaking the update-group by excluding specific peers known to be slow. Before the slow-peer feature was introduced, adjusting the advertisement-interval was the only way to guarantee this, which is a valid workaround but less elegant. ! CSR9 router bgp 11 address-family ipv4

2172 © 2016 Nicholas J. Russo

neighbor 10.6.9.6 slow-peer split-update-group static

When we verify this, the output changes slightly. Slow-peer detection remains disabled because there is nothing to detect; we statically configure this peer as slow. The static mode of the command is enabled, and we know that the timestamps for detection and recovery time do not apply here. R9#show bgp ipv4 unicast neighbors 10.6.9.6 | include low.peer Slow-peer split-update-group static is enabled Slow-peer detection is disabled Slow-peer split-update-group dynamic is disabled Last detected as dynamic slow peer: never Dynamic slow peer recovered: never

Looking at all of CSR9’s update-groups, we now expect to see 3. There is one that is common for all iBGP peers since their advertisement-intervals and outbound policies are identical. There is a second for the IPv4 eBGP peer to XRv4, which has slow-peer detection enabled dynamically, and it is currently in a regular update-group (that is, not currently identified as a slow-peer). Other non-slow eBGP peers, if they existed, would also be in this update-group. The final group is the update-group for slow peers and is identified as a “slow update group”. Other slow peers would end up in this group if identified as such; XRv4, if dynamically determined to be slow, is an example. Note: Since the slow-peer syntax is not supported in XR, you can use the “advertisement-interval” to target a slow peer and achieve the same effect. R9#show bgp ipv4 unicast update-group BGP version 4 update-group 7, internal, Address Family: IPv4 Unicast BGP Update version : 26/0, messages 0, active RGs: 1 NEXT_HOP is always this router for eBGP paths Community attribute sent to this neighbor Topology: global, highest version: 26, tail marker: 26 Format state: Current working (OK, last not in list) Refresh blocked (not in list, last not in list) Update messages formatted 14, replicated 28, current 0, refresh 0, limit 1000 Number of NLRIs in the update sent: max 3, min 0 Minimum time between advertisement runs is 0 seconds Has 2 members: 11.0.0.13 11.0.0.4 BGP version 4 update-group 8, external, Address Family: IPv4 Unicast BGP Update version : 26/0, messages 0, active RGs: 1 Topology: global, highest version: 26, tail marker: 26 Format state: Current working (OK, last minimum advertisement interval) Refresh blocked (not in list, last not in list) Update messages formatted 27, replicated 43, current 0, refresh 0, limit 1000 Number of NLRIs in the update sent: max 3, min 0 Minimum time between advertisement runs is 30 seconds

2173 © 2016 Nicholas J. Russo

Has 1 member: 10.9.14.14 BGP version 4 update-group 9, external, Address Family: IPv4 Unicast BGP Update version : 26/0, messages 0, active RGs: 1 Slow update group Topology: global, highest version: 26, tail marker: 26 Format state: Current working (OK, last minimum advertisement interval) Refresh blocked (not in list, last not in list) Update messages formatted 11, replicated 11, current 0, refresh 0, limit 1000 Number of NLRIs in the update sent: max 2, min 0 Minimum time between advertisement runs is 30 seconds Has 1 member: 10.6.9.6

Additional Reading – Reference configurations "bgp-conv-timers" 37.4 Convergence optimizations for IGPs This lab explores the various OSPF and IS-IS fast convergence techniques, specifically related to SPF calculation and topology advertisement. Most of these are detailed timers; dead neighbor detection timers (hellos) are not covered in this section as there is a dedicated section for that. This lab uses the same topology and basic configurations as the LDP lab, except has all advanced LDP features removed. There are multiple IS-IS levels and multiple OSPF areas, which makes it an ideal topology for this test. Both the IS-IS and OSPFv2/v3 testing uses the same set of configurations which are shown beneath the diagram.

Additional Reading – Reference configurations "fast-conv" 2174 © 2016 Nicholas J. Russo

37.4.1 IS-IS IS-IS has a handful of convergence timers which are used to fine-tune how the internal IS-IS components operate. These components include intra-level SPF, partial recalculations for non-intra-level topology or stub network information (PRC), LSP generation adjustments, and more. We will define each one in detail and test some examples. spf-interval: This is the time between consecutive SPF runs. This is a throttling mechanism, not a dampening mechanism (nothing is disabled or moved into a hold-down state) which is designed to protect the router from IS-IS hogging the CPU during times of continuous network instability within a level. There are three values that make up this throttling mechanism. The first timer is the max-hold time, which is the maximum time a router waits between SPF. If 2 or more SPF runs are needed, the SPF backoff timer waits twice as long as the previous run up to the maxhold. The default is 10 seconds and this will make more sense when we examine the other 2 timers. The second value is the initial wait time, which is the time a node waits before starting the first SPF run. Having this be smaller means the first SPF run happens faster, but waiting a short time allows the router to collect more changed LSPs, which might mean less SPF runs overall. The default is 5500 ms. The third value is how long the router waits from after the completion of the first SPF run to beginning the second. Multiple SPF runs might be necessary if a router receives more LSP updates after the first changes. The default is 5500 ms. It is challenging to find the perfect values to optimize convergence as this is entirely network dependent. This is best illustrated with examples below. Inside level-1, we will start the first SPF run after 100 ms, with the second run being 200 ms after the first. The max backoff time will be 2 second. Thus, if multiple SPF runs are needed within level-1, they would run at after 100, 200, 400, 800, 1600, and 2000 ms periods; the wait-timer doubles until the max-hold is reached. For level-1, these aggressive timers might be OK if the topology is stable. If more than 6 SPF runs are required, they would all be spaced at 2000 ms apart. ! CSR3 router isis LDP spf-interval level-1 2 100 200

For level-2, more conservative timers might be more appropriate. Here, we will configure the routers to wait 3 seconds before starting SPF, with the next SPF run being 7 seconds later. The max-hold time is 28 seconds, so the SPF runs will occur at 3, 7, 14, and 28 second intervals. ! CSR3 router isis LDP spf-interval level-2 28 3000 7000

2175 © 2016 Nicholas J. Russo

We can quickly test these timers in some capacity. Rather than dig into unnecessary detail on the exact operation of SPF, we will perform cursory checks. If we clear IS-IS on CSR10, we can see that ~3 seconds after the adjacency comes back up, SPF begins. We can reveal this with debugs. R3#debug isis spf-events IS-IS SPF events debugging is on for IPv4 unicast topology base ! CSR3 00:53:18.260: %CLNS-5-ADJCHANGE: ISIS: Adjacency to R10 (GigabitEthernet2.530) Down, neighbor forgot us 00:53:18.332: %CLNS-5-ADJCHANGE: ISIS: Adjacency to R10 (GigabitEthernet2.530) Up, new adjacency ! About 3 seconds later 00:53:21.311: ISIS-SPF: Compute L2 SPT 00:53:21.311: ISIS-SPF: 4 nodes for level-2 [snip]

We will also test the more aggressive level-1 timers. If we clear ISIS on a node that isn’t connected to CSR3, we won’t see an adjacency flap, so we can also enable a debug to show us LSP changes. More generally, the command shows us any event that triggers SPF, and we see that the first LSP change seen by CSR3 starts the 100 ms timer. After 100 ms (perfect timing in this case), SPF runs for level-1. R3#debug isis spf-triggers IS-IS SPF triggering events debugging is on for IPv4 unicast topology base ! CSR3 01:04:46.046: ISIS-SPF-TRIG: L1, 0000.0000.0009.00-00 TLV contents changed, code 22 01:04:46.146: ISIS-SPF: Compute L1 SPT 01:04:46.146: ISIS-SPF: 7 nodes for level-1 [snip]

prc-interval: This is the time between consecutive partial recalculation (PRC) runs. The command has identical parameters as the spf-interval and is configured on a per-process only (not per-level). The concepts of max-hold, initial-wait, and backoff-wait are measured the same with the exception of the default timers. The max-hold is 5 seconds, the initial-wait is 2 seconds, and the backoff-wait is 5 seconds. This effectively means that the first PRC waits 2 seconds and every PRC afterwards will be run 5 seconds apart. The significant difference is that this feature relates to how often partial recalculations are run, not full SPF invocations. A partial recalculation is invoked when a leaf (stub) network changes, such as an IP route. Inside level-2, we will make the timer a little more aggressive on CSR10. The initialwait is reduced to 1500 ms so PRC begins faster. The second PRC, if necessary, runs 2500 ms after the first, with a max-hold of 5000 ms (the default). This means that the first PRC would be at the 1500 ms mark, followed by runs after 2500 and 5000 ms periods as needed. 2176 © 2016 Nicholas J. Russo

! CSR10 router isis LDP prc-interval 5 1500 2500

To trigger a partial recalculation, we can enable SPF event and trigger debugging on CSR10 (level-2) and clear CSR8 (level-1). This will change the IP routes inside CSR3, and CSR10 will invoke the PRC process to essentially withdraw reachability to these prefixes. The timing was almost perfect with PRC occurring 1499 ms after receiving the LSP update. ! CSR10 01:21:53.712: ISIS-SPF: L2 LSP 1 (0000.0000.0003.00-00) flagged for recalculation from 7FBCAB81848D 01:21:55.211: ISIS-SPF: LSP 1 (0000.0000.0003.00-00) Type STD 01:21:55.211: ISIS-SPF: spf_result: next_hop_parents:0x7FBC415CF020 root_distance:10, parent_count:1, parent_index:4 db_on_paths:1 01:21:55.211: ISIS-SPF: Calculating routes for L2 LSP 1 (0000.0000.0003.0000) [snip]

max-lsp-lifetime: Measured in seconds, this is how long the router will retain LSPs learned from other ISIS routers. The default is 1200 seconds (20 minutes) and this is configured per-process, not per-level. A quick check of the database shows many LSPs each with an LSP holdtime of less than 1200 seconds. When an LSP is refreshed, its holdtime is set to the max-lsp-lifetime, and the number counts down to zero. This hold-time is actually carried in the LSP that a router originates. R3#show isis database Tag LDP: IS-IS Level-1 Link State Database: LSPID LSP Seq Num R2.00-00 0x00000240 R3.00-00 * 0x00000248 R3.01-00 * 0x00000228 R6.00-00 0x00000231 R8.00-00 0x0000023C R9.00-00 0x00000247 XRv2.00-00 0x00000241 IS-IS Level-2 Link State Database: LSPID LSP Seq Num R1.00-00 0x000001F0 R3.00-00 * 0x00000237 R10.00-00 0x0000020F XRv4.00-00 0x00000208

LSP Checksum 0x3FC0 0x3C8E 0x3EC2 0x6CAF 0xD046 0xE1CA 0x92D8

LSP Holdtime 1110 559 500 843 1161 1102 854

ATT/P/OL 0/0/0 1/0/0 0/0/0 0/0/0 0/0/0 0/0/0 0/0/0

LSP Checksum 0x8FF4 0xC4FC 0x26AF 0xA7D1

LSP Holdtime 484 427 354 963

ATT/P/OL 0/0/0 0/0/0 0/0/0 0/0/0

2177 © 2016 Nicholas J. Russo

On CSR3, we will configure a very aggressive holdtime, so aggressive that other routers don’t generate their LSPs fast enough to keep the network stable. If we check CSR8, we can see that the hold-time for CSR3’s LSP is now counting down from 120 seconds. ! CSR3 router isis LDP max-lsp-lifetime 120 R8#show isis database R3.00-00 Tag LDP: IS-IS Level-1 LSP R3.00-00 LSPID LSP Seq Num R3.00-00 0x0000024F

LSP Checksum 0x5B68

LSP Holdtime 70

ATT/P/OL 1/0/0

Notice that after the LSP expires ~70 seconds later, it is refreshed immediately by CSR3. The LSP sequence number increases from 0x24F to 0x250, the next number in sequence, and the LSP has a new checksum. R8#show isis database R3.00-00 Tag LDP: IS-IS Level-1 LSP R3.00-00 LSPID LSP Seq Num R3.00-00 0x00000250

LSP Checksum 0x5969

LSP Holdtime 102

ATT/P/OL 1/0/0

We can confirm this using debug on CSR8. Letting it run a few minutes, we can see the LSP updates coming every 60-90 seconds for all of CSR3’s LSPs, to include the DIS. I assume there is some LSP refresh timer approximation happening on CSR3 implicitly as a result of changing the max lifetime. I use yellow and green to show the router and DIS LSPs, respectively. R8#debug isis update-packets IS-IS Update related packet debugging is on for router process LDP ! CSR8 01:43:17.739: ISIS-Upd: Rec L1 LSP 0000.0000.0003.00-00, seq 252, ht 117, 01:43:18.080: ISIS-Upd: Rec L1 LSP 0000.0000.0003.01-00, seq 230, ht 117, 01:44:39.576: ISIS-Upd: Rec L1 LSP 0000.0000.0003.01-00, seq 231, ht 117, 01:44:43.360: ISIS-Upd: Rec L1 LSP 0000.0000.0003.00-00, seq 253, ht 117, 01:45:47.601: ISIS-Upd: Rec L1 LSP 0000.0000.0003.01-00, seq 232, ht 117, 01:45:54.352: ISIS-Upd: Rec L1 LSP 0000.0000.0003.00-00, seq 254, ht 117,

lsp-refresh-interval: Measured in seconds, this is how long the router will wait before re-flooding its own generated LSPs (local networks, DIS, etc). The default is 15 minutes and this is set per-process, not per-level. This number has to be less than the max-lsp-lifetime since an LSP must be refreshed before

2178 © 2016 Nicholas J. Russo

being purged. On CSR3, we will configure this to be 15 seconds, which will generate even more frequent debugging output on CSR8. Now we can see the timers are about 15 seconds apart (slightly faster). ! CSR3 router isis LDP lsp-refresh-interval 15 ! CSR8 01:49:30.508: ISIS-Upd: Rec L1 LSP 0000.0000.0003.00-00, seq 259, ht 117, 01:49:32.113: ISIS-Upd: Rec L1 LSP 0000.0000.0003.01-00, seq 237, ht 117, 01:49:42.269: ISIS-Upd: Rec L1 LSP 0000.0000.0003.00-00, seq 25A, ht 117, 01:49:46.172: ISIS-Upd: Rec L1 LSP 0000.0000.0003.01-00, seq 238, ht 117, 01:49:54.561: ISIS-Upd: Rec L1 LSP 0000.0000.0003.00-00, seq 25B, ht 117, 01:49:59.666: ISIS-Upd: Rec L1 LSP 0000.0000.0003.01-00, seq 239, ht 117,

lsp-gen-interval: This specifies how much time is in between creating new versions of a single LSP. This can be set on a per-process (both levels) or per-level basis. Just like the SPF and PRC calculations, this command has the max-hold, initial-wait, and backoff-wait timers. The max-wait is 5 seconds, the initialwait is 50 ms (fast), and the backoff-wait is 5 seconds. Essentially this means that the first LSP change is propagated quickly but any subsequent changes occur 5 seconds later. Testing this is a little tricky, but we can easily verify the max-wait and backoff-wait timers. On CSR3, we will slow things down by making the routing wait a full second before propagating any updates, then backing off subsequent updates at 10 seconds each time. ! CSR3 router isis LDP lsp-gen-interval level-2 10 1000 10000

To simulate this, we will add a new loopback to IS-IS on CSR3, then remove it about 2 seconds later. By making the first change, the IS-IS process waits 1 second then floods the update. We can’t effectively measure this other than a good guess by watching debug output on other routers. If we wait more than 1 seconds but less than 10 seconds and remove the loopback, this change will not be regenerated until 10 seconds after the first one was. ! CSR3 interface Loopback33 ip address 33.33.33.33 255.255.255.255 router isis LDP passive-interface Loopback33 ! wait 2 seconds no passive-interface Loopback33

2179 © 2016 Nicholas J. Russo

By debugging on CSR1, we can see the first update where there was a “leaf route changed” in topology ID (TID) 0 which is IPv4. This carries the new IPv4 prefix 33.33.33.33/32 which we added to IS-IS. Just under 10 seconds later (9999 ms), another update occurs to say the leaf routes changed again, which indicates the removal of the loopback. Whether we removed the loopback at the 1.1 or 9.9 second mark, the LSP generation throttle would ensure that the time delta here is always 10 seconds given our configuration. We can see the sequence number increments from 0x26A to 0x26B, indicating that the LSP has been updated. ! CSR1 01:56:03.989: 01:56:03.989: 01:56:03.989: 01:56:03.989: 01:56:03.989: 01:56:03.989: 01:56:03.989:

ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd:

Rec L2 LSP 0000.0000.0003.00-00, seq 26A, ht 118, from SNPA 0050.56a9.862a (GigabitEthernet2.514) LSP newer than database copy TLV contents different, code 135 TID 2: TLV contents different, code 0x87 TID 2 no change TID 0 leaf routes changed

01:56:13.988: 01:56:13.988: 01:56:13.988: 01:56:13.988: 01:56:13.988: 01:56:13.988: 01:56:13.988:

ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd: ISIS-Upd:

Rec L2 LSP 0000.0000.0003.00-00, seq 26B, ht 118, from SNPA 0050.56a9.862a (GigabitEthernet2.514) LSP newer than database copy TLV contents different, code 135 TID 2: TLV contents different, code 0x87 TID 2 no change TID 0 leaf routes changed

Another interesting IS-IS feature is incremental SPF (iSPF). This command is used for intra-level SPF calculations to only rebuild the parts of the tree requiring it. In a sparsely-connected graph, this can be very valuable, since a full SPF run is CPU intensive and iSPF can reduce the impact. This is different than PRC which is for prefix information (or any NLRI carried by IS-IS) and is for stub networks. iSPF helps converge a network more efficiently when intra-level links and nodes change state. The feature is interoperable with routers that don’t support it also, so we will enable it on CSR10 only. It can be enabled for level-1, level-2, or both. The numerical argument is somewhat benign; it simply details how long the router should wait before activating IS-IS from the time you configure it. The default is 120 seconds, but I set it to 1 second. If I enter the command, I don’t want the router to wait 2 minutes to obey the instruction. ! CSR10 router isis LDP ispf level-2 1

To test it, I will shutdown the link to XRv4 on CSR1. Since CSR1 is a stubby node, CSR10 should only need to recompute that part of the network without needing to perform much work on CSR3. We don’t examine the whole process in detail, but we can see the iSPF is enabled and functional. R10#debug isis spf-events terse

2180 © 2016 Nicholas J. Russo

IS-IS SPF events debugging terse is on for IPv4 unicast topology base ! CSR10 02:15:41.202: ISIS-SPF: 02:15:41.202: ISIS-SPF: 02:15:41.202: ISIS-SPF: 02:15:41.202: ISIS-SPF: From 7FBCAB7EBE28 02:15:41.202: ISIS-SPF: 02:15:41.202: ISIS-SPF: From 7FBCAB7EBE28 [snip]

Compute L2 SPT 4 nodes for level-2 I-SPF: Adding LSP: (1) to NewLSP, metric: 10 I-SPF: Adding LSP: (1) to 7FBC413E5BE0, metric: 10. I-SPF: Adding LSP: (2) to NewLSP, metric: 10 I-SPF: Adding LSP: (2) to 7FBC413E5BE0, metric: 10.

Another feature is IS-IS fast flood. This is used to ensure that an LSP that causes a reconvergence event (say, a failed node) is flooded to other neighbors before the local router runs SPF. The idea is to quickly “spread the alarm” within the level so other routers can run SPF in parallel, rather than running SPF immediately upon receipt and flooding the LSP later. However, flooding too many LSPs can be counterproductive, so the number can be set between 1 and 15 LSPs, where the default is 5. This often times improves IS-IS convergence in a network, and is configured on CSR3. CSR3 will fast-flood 2 LSPs before running SPF during a reconvergence event. ! CSR3 router isis LDP fast-flood 2

These IS-IS timers work identically on IOS-XR. There does not seem to be a PRC interval timer, but the others exist. XR does a better job of labeling the values with their names rather than arraying them in a command with vague context-sensitive help information. LSP-related command are not AFI-specific and are configured under the process. AFI-specific commands are configured under the AFI stanzas. ! XRv4 router isis LDP lsp-gen-interval maximum-wait 4800 initial-wait 600 secondary-wait 1200 level 2 max-lsp-lifetime 1800 level 2 address-family ipv4 unicast ispf level 2 spf-interval maximum-wait 3000 initial-wait 50 secondary-wait 3000 level 2

37.4.2 OSPFv2 and OSPFv3 The mechanism by which OSPF builds a graph, at least within an OSPF area, is very similar to IS-IS within a level. The SPF timers are similar in behavior but vary slightly in syntax. It also introduces many more complicated timers compared to IS-IS as well.

2181 © 2016 Nicholas J. Russo

timers throttle spf: This timer is identical to the IS-IS spf-interval timer. The difference is that the timers are arrayed differently on the CLI. First is the initial-wait, then the backoff-wait, then the max-hold. The defaults for each one are 5 seconds, 10 seconds, and 10 seconds, respectively. This means that OSPF waits 5 seconds for the first SPF run (very slow), and 10 seconds for every other run. We will speed things up a bit. In this case, we configure CSR7 to wait 500 ms from the moment it receives an LSA1 or LSA2 update to the time it starts SPF. The second SPF run happens 1500 ms after the first, with subsequent runs happening at the 3000, 6000, and 12000 ms marks. OSPF also shows you this information in a show command, which IS-IS does not. ! CSR7 router ospf 92 timers throttle spf 500 1500 12000 R7#show ip ospf | include SPF Initial SPF schedule delay 500 msecs Minimum hold time between two consecutive SPFs 1500 msecs Maximum wait time between two consecutive SPFs 12000 msecs [snip]

We can verify this with debugging. On CSR4, we will flap a link to create an SPF event (not shown). CSR7 receives the update from XRv2 and CSR4, and the SPF initial timer starts upon receipt of the first SPFtriggering LSA update. Exactly 500 ms later, SPF begins, which is 10 times faster than the default. The 500 ms was a good buffer because it gave CSR7 time to collect another LSA update from CSR4 before running SPF for the first time; a value of 10 ms, for example, would have necessitated at least two SPF runs, since the second LSU from CSR4 would have come in too late to be considered in the first SPF run. R7#debug ip ospf spf intra OSPF SPF intra debugging is on ! CSR7 08:26:29.656: OSPF-92 SPF 92.0.0.12 area 0 08:26:30.052: OSPF-92 SPF 92.0.0.4 area 0 08:26:30.156: OSPF-92 MON [snip]

: Detect change in LSA type 1, LSID 92.0.0.12 from : Detect change in LSA type 1, LSID 92.0.0.4 from : Begin SPF at 477043.804ms, process time 8623ms

timers throttle lsa: This is equivalent to the IS-IS lsp-gen-interval and uses the same trio of timers we have seen many times. The initial-timer is 0 ms (happens right away), with the backoff and max-hold timers are set to 5 seconds. This means if two interfaces flap on a router in quick succession, one of them is reflected immediately, but the other must wait 5 seconds. By increasing the initial timer to 50 ms or so, you can actually speed up convergence in the network since both link-state changes would be captured in the first update. ! CSR7

2182 © 2016 Nicholas J. Russo

router ospf 92 timers throttle lsa 50 100 500

Once we configure these timers, we can confirm them using show commands. R7#show ip ospf | include LSA throttle Initial LSA throttle delay 50 msecs Minimum hold time for LSA throttle 100 msecs Maximum wait time for LSA throttle 500 msecs

As a quick test, we will shut down one of CSR7’s transit links while showing the clock just before we do (copy/paste works well). First, we enable LSA-generation debugs, then quickly capture a timestamp and shutdown the interface. Since CSR7 is an ABR and its only link to area 1 was lost, we see CSR7 generate LSAs for both areas. The only reason it does this in area 0 is to clear the ABR flag, not because an area 0 link failed. In any case, we can see that these LSAs are rate-limited from being created right away. Exactly 50 ms later, they are generated and flooded soon thereafter. R7#debug ip ospf lsa-generation OSPF LSA generation debugging is on R7(config)#interface gig2.571 R7(config-subif)#do show clock 08:36:49.476 UTC R7(config-subif)#shutdown ! CSR7 08:36:49.481: %OSPF-5-ADJCHG: Process 92, Nbr 92.0.0.11 on GigabitEthernet2.571 from FULL to DOWN, Neighbor Down: Interface down or detached 08:36:49.481: OSPF-92 LSGEN: Scheduling rtr LSA for area 1, build flag 0x41 (from 0x7FAFE34F1C42) 08:36:49.481: OSPF-92 LSGEN: Scheduling rtr LSA for area 1, build flag 0x41 (from 0x7FAFE34D0247) 08:36:49.483: OSPF-92 LSGEN: Scheduling rtr LSA for area 0, build flag 0x41 (from 0x7FAFE34EC3C2) 08:36:49.483: OSPF-92 LSGEN: Scheduling rtr LSA for area 1, build flag 0x41 (from 0x7FAFE34EC3C2) 08:36:49.485: OSPF-92 LSGEN: Rate limit LSA generation for 1/92.0.0.7/92.0.0.7 08:36:49.485: OSPF-92 LSGEN: Rate limit LSA generation for 1/92.0.0.7/92.0.0.7 08:36:49.535: OSPF-92 LSGEN: Suppressing prefix 92.6.7.0/24 from router LSA 08:36:49.535: OSPF-92 LSGEN: Suppressing prefix 92.11.7.0/24 from router LSA 08:36:49.535: OSPF-92 LSGEN: Build router LSA for area 0, router ID 92.0.0.7, seq 0x800000E3 08:36:49.536: OSPF-92 LSGEN: Build router LSA for area 1, router ID 92.0.0.7, seq 0x800000CB

2183 © 2016 Nicholas J. Russo

timers lsa arrival: This timer is measured in ms and serves to ensure a router does not keep accepting the same LSA too quickly. Used in conjunction with the LSA throttle timers earlier for LSA generation, this timer enforces LSA reception instead. The default is 1000 ms and if the same LSA arrives faster than this period, the LSA is dropped. This might mean the LSAs would need to be re-flooded, but it would allow the local router to have a more stable LSDB if one router is generating LSAs too quickly. ! CSR7 router ospf 92 timers lsa arrival 500

We can quickly verify this by checking the OSPF summary. R7#show ip ospf | include arrival Minimum LSA arrival 500 msecs

timers pacing flood: This timer determines how long to wait before flooding an LSA on an interface after the last one was sent. When multiple updates need to be sent, they are queued in an interface flood-list. The inter-update interval is controlled with this command. Every interface has a flood-list, and it’s hard to actually see the queue grow since the timer is very fast. R7#show ip ospf flood-list GigabitEthernet2.567 OSPF Router with ID (92.0.0.7) (Process ID 92) Interface GigabitEthernet2.567, Queue length 0

The timer is measured in ms with a default value of 33 ms. We will increase the value to 75 to create more delay between updates to potentially reduce CPU utilization at the cost of slower convergence. We can confirm it by checking the OSPF process summary. ! CSR7 router ospf 92 timers pacing flood 75 R7#show ip ospf | include flood_pacing Interface flood pacing timer 75 msecs

timers pacing retransmission: This timer determines how frequently to retransmit unacknowledged LSAs on a link using the logic described above. The “flood” timer controls the inter-packet flood delay when packets are first flooded, while the “retransmission” timer controls the subsequent inter-packet flood delay which, in a perfect world, would not occur. It makes sense to make the retransmission timer slower than the flood timer to conserve router resources, and by default it is 66 ms (twice the flood timer). The LSAs candidate for retransmission are placed in an interface-level retransmission-list, much 2184 © 2016 Nicholas J. Russo

like the flood-list above. The flood/retransmission timers accomplish the same thing but apply to different queues. The configuration and verification is below. ! CSR7 router ospf 92 timers pacing retransmission 150 R7#show ip ospf | include Retrans Retransmission pacing timer 150 msecs

Like the flood-list, each interface keeps track of the LSAs requiring retransmission as well. While the queue is empty, the command to see those LSAs is below. R7#show ip ospf retransmission-list GigabitEthernet2.567 OSPF Router with ID (92.0.0.7) (Process ID 92) Neighbor 92.0.0.6, interface GigabitEthernet2.567 address 92.6.7.6

timers pacing lsa-group: This timer is meant to optimize the refresh process for aging LSAs. Rather than refresh all LSAs at the 30 minute mark regardless of age, this timer groups LSAs together with similar ages. The default is 240 seconds (4 minutes), which means that LSAs within 4 minutes of one another are refreshed together. This is sometimes referred to as “controlled bursting” and is a better approach than refreshing all LSAs at the 30 minute mark (half of MAXAGE) or refreshing them individually. The smaller the number, the more frequent (but less inclusive) the bursts. We can effectively disable the feature de by setting the timer to its maximum value of 1800 (30 minutes). We could also approximate the “individual LSA” refresh approach by using the minimum of 10 seconds. I use a value of 2 minutes, which means there will be twice as many refreshes per 30 minute interval, but each refresh will contain approximately half as many LSAs. Testing these pacing timers is difficult to do in a timely manner, so we will limit verification to some show commands. ! CSR7 router ospf 92 timers pacing lsa-group 120 R7#show ip ospf | include LSA_group LSA group pacing timer 120 secs

We can do a little extra verification for the LSA-group timer since there is a special command for it. We can see that the grouped updates are approximately 120 seconds apart, and the interval is displayed for sanity. This helps us verify that the configuration is correct. We can also see that the next update is due in about 65 seconds; this number is always less than the LSA-group interval. R7#show ip ospf timers lsa-group

2185 © 2016 Nicholas J. Russo

OSPF Router with ID (92.0.0.7) (Process ID 92) Group size 8, Head 2, Search Index 4, Interval 120 sec Next update due in 00:01:05 Current time 550287 Index 0 Timestamp 550353 Index 1 Timestamp 550475 Index 2 Timestamp 550595 Index 3 Timestamp 550723 Index 4 Timestamp 550845 Index 5 Timestamp 550966 Index 6 Timestamp 551089 Index 7 Timestamp 551212 Failure Head 0, Last 0 LSA group failure logged

queue-depth: These commands are not well documented as the IOS XE 3S documentation makes no mention of them. They are self-explanatory though; you can control how many OSPF hellos/updates may exist in the software queue for processing. This will ensure the router doesn’t get stuck processing OSPF packets for forever, but also allows to router to have some backlog mechanism. Below, I allow up to 64 hello packets to be queued, but an unlimited number of LSUs. There does not appear to be a show command to verify this. ! CSR7 router ospf 92 queue-depth update unlimited queue-depth hello 64

ispf: When enabled, SPF will only recalculate the part of the tree that requires it. This works identically in logic to the IS-IS feature seen earlier. We can verify it by looking at the OSPF summary. ! CSR7 router ospf 92 ispf R7#show ip ospf | include Increment Incremental-SPF enabled

We can quickly test this by changing the IGP cost on CSR4’s loopback (not shown). A minor change like this would cause an IS-IS PRC without needing iSPF enabled, but for OSPFv2, this is an intra-area change to the LSA1, which normally is a full SPF update. With iSPF, this alleviates the need for CSR7 to recompute paths to CSR6, XRv2 or XRv1. OSPF invokes an “Increment” SPF mechanism which, based on the minor change from CSR4’s LSA1, can be significantly more CPU-efficient than a full SPF run. ! CSR7 OSPF-92 SPF: Detect change in LSA type 1, LSID 92.0.0.4 from 92.0.0.4 area 0

2186 © 2016 Nicholas J. Russo

OSPF-92 INTRA: Insert LSA to New_LSA list type 1, LSID 92.0.0.4, from 92.0.0.4 area 0 OSPF-92 SPF: Detect change in LSA type 3, LSID 92.0.0.4 from 92.0.0.11 area 1 OSPF-92 SPF : Do not schedule partial SPF type 3, LSID 92.0.0.4, adv_rtr 92.0.0.11, area 1: INTRA/INTER spf scheduled OSPF-92 MON : Begin SPF at 554752.098ms, process time 10137ms OSPF-92 INTRA: Running SPF for area 0, SPF-type Incremental OSPF-92 INTRA: Initializing to run spf OSPF-92 INTRA: Running incremental SPF for area 0 [snip]

The last OSPFv2 convergence topic is knowing how to read the SPF statistics page, which can be cryptic since the command provides no legend. On CSR7, we can see several SPF runs over the past 24 hours. Below are the key field definitions. Delta T: Time that has passed from when a given SPF run started to the current time. In the example below, this was about 20 hours ago. Essentially, it says “I started an SPF run this long ago”. Intra: Time in ms for SPF to process intra-area LSAs, such as LSA1 and LSA2, and to build the topology graph within an area. The timer also accounts for installing intra-area routes in the RIB, which can be a lengthy process. If a topology is densely connected and/or has many IP prefixes, this number can be large. D-Intra: Same concept as “Intra” except measures the time required to delete intra-area routes (O) from the RIB, measured in ms. Summ: Same concept as “Intra” except measures the time required to run partial SPF on inter-area LSAs (LSA3) and install them in the RIB. This is also measured in ms. D-Summ: Same concept as “Summ” except measures the time required to delete inter-area routes (O IA) from the RIB, measured in ms. Ext: Same concept as “Intra” and “Summ” except measures the time required to run partial SPF on external LSAs (LSA5 and LSA7) and install them in the RIB. This is also measured in ms. D-Ext: Same concept as “Ext” except measures the time required to delete external (E and N) routes from the RIB, measured in ms. Total: The sum of all aforementioned counters, measured in ms. This is how long it took the entire SPF process to do everything required, to include additions and deletes for all OSPF route types. Reason: This is a code that justifies the need to run SPF.

2187 © 2016 Nicholas J. Russo

R: A change in a router LSA (type 1) has occurred. This means full SPF (or possibly iSPF) must run within the area since the topology changed or an intra-area prefix changed. N: A change in a network LSA (type 2) has occurred. This might indicate that a DR failed (which would also be an “R” condition) when the BDR succeeded it, or that a network type was changed to include or exclude the DR election. When the reason is “N”, the reason is often “R” as well. SA: A change in a summary ASBR LSA (type 4) has occurred. This would change if a new ASBR was detected in an area and the ABRs needed to notify the other areas about it. The opposite is also true; if an ASBR stops performing redistribution, that would trigger a removal of this LSA. Changes in the cost to reach the ASBR would also cause SPF to run using this code. SN: A change in a summary LSA (type 3) has occurred. If an ABR changes what prefixes it advertises between areas, that would cause a partial SPF run for those prefixes. Changes in the interarea prefix costs would also trigger SPF using this code. X: A change in an external LSA (type 5 or 7) has occurred. If an ASBR changes what prefixes it advertises into OSPF via redistribution (or potentially a default route that is not inter-area), that would cause a partial SPF run. Changes in the external prefix costs would also trigger SPF using this code. R7#show ip ospf statistics OSPF Router with ID (92.0.0.7) (Process ID 92) Area 0: SPF algorithm executed 3 times Area 1: SPF algorithm executed 3 times Summary OSPF SPF statistic SPF calculation time Delta T Intra D-Intra 20:22:16 0 0 20:22:11 0 0 00:36:11 0 0 00:36:10 0 0 00:36:07 0 1 00:36:01 0 0 00:35:53 0 0 00:35:52 0 0 00:35:48 0 0 00:35:45 0 0

Summ 0 0 0 0 0 0 0 0 0 0

D-Summ 0 0 0 0 0 0 0 0 0 0

Ext 0 0 0 0 0 0 0 0 0 0

D-Ext 0 0 0 0 0 1 0 1 0 0

Total 0 0 0 0 1 1 0 1 0 0

Reason R R, SN R, N, SN, SA, X R R, N, SN R, SN, X R, N, SN, SA, X R, SN, X R, N, SN R, SN

2188 © 2016 Nicholas J. Russo

Looking at a specific entry, we can see the number of link-state IDs processes of each type, along with all of the routers that were involved in the change. This was mostly likely the result of clearing OSPF on a handful of area 0 routers at once, since all of them had a change. R7#show ip ospf statistics detail OSPF Router with ID (92.0.0.7) (Process ID 92) Area 0: SPF algorithm executed 8 times SPF 1 executed 00:55:34 ago, SPF type Full SPF calculation time (in msec): SPT Intra D-Intr Summ D-Summ Ext7 D-Ext7 Total 0 0 0 0 0 0 0 0 LSIDs processed R:1 N:0 Stub:1 SN:9 SA:0 X7:0 Change record 0x0 LSIDs changed 8 Changed LSAs. Recorded is LS ID and LS type: 92.0.0.7(R) 92.0.0.4(R) 92.0.0.6(R) 92.0.0.11(R) 92.0.0.12(R) 92.0.0.11(R) 92.0.0.7(R) 92.0.0.6(R)

Before continuing to OSPFv3, we will configure all of these OSPFv2 timers on XRv1 for completeness. Some timers, such as the retransmission timer, do not appear to exist. The retransmission pacing interval appears to be set to twice the pacing flood interval, which is logical behavior. Incremental SPF does not appear to be supported, either. ! XRv1 router ospf 92 timers throttle lsa all 50 2000 5000 timers throttle spf 100 2000 10000 timers lsa group-pacing 360 timers lsa min-arrival 150 timers lsa refresh 2400 timers pacing flood 50 RP/0/0/CPU0:XRv1#show ospf | include sec Initial SPF schedule delay 100 msecs Minimum hold time between two consecutive SPFs 2000 msecs Maximum wait time between two consecutive SPFs 10000 msecs Initial LSA throttle delay 50 msecs Minimum hold time for LSA throttle 2000 msecs Maximum wait time for LSA throttle 5000 msecs Minimum LSA interval 2000 msecs. Minimum LSA arrival 150 msecs LSA refresh interval 2400 seconds Flood pacing interval 50 msecs. Retransmission pacing interval 100 msecs

2189 © 2016 Nicholas J. Russo

XR also shows the OSPF statistics for SPF, as well as many other features (TE, interface, and other details). XR also gives you a summary of the reason codes which is very handy. If you forget the XE codes, you can use this command on XR since the codes are identical across platforms. RP/0/0/CPU0:XRv1#show ospf statistics spf SPF statistics for OSPF 92 Reason Codes: R - Router-LSA, N - Network-LSA, SN - Summary-LSA (IP network), SA - Summary-LSA (ASBR), X - AS-external-LSA Last 40 Dijkstra Calculations Delta T Area Runtime 1d00h 0 0 R, 1d00h 0 0 R, 1d00h 0 0 R, 23:57:16 0 0 23:57:14 0 0 21:04:49 0 0 21:04:48 0 0 21:04:46 0 0 21:03:56 0 0 [snip]

Reason

R, R, R, R, R, R,

OSPFv3 is very similar to OSPFv2 in terms of how SPF is performed and the timers that are available. Because the original LDP lab did not have OSPFv3 configured, we will configure it quickly on all OSPFv2 routers using the same area boundaries. The configuration is not shown, but we can quickly verify that the loopbacks are all advertised into OSPFv3. Checking the intra and inter-area prefixes, we can see CSR5 and XRv3 are inter-area, while all other loopbacks are in area 0 from XRv2’s perspective. CSR6 also originates a default route if the IS-IS IPv6 default route is present, which is an external route. RP/0/0/CPU0:XRv2#show ospfv3 database prefix | include ::92 Prefix Address: ::92:0:0:4 Prefix Address: ::92:0:0:6 Prefix Address: ::92:0:0:7 Prefix Address: ::92:0:0:11 Prefix Address: ::92:0:0:12 RP/0/0/CPU0:XRv2#show ospfv3 database inter-area prefix | include ::92 Prefix Address: ::92:0:0:5 Prefix Address: ::92:0:0:13 Prefix Address: ::92:0:0:13 Prefix Address: ::92:0:0:5 Prefix Address: ::92:0:0:5 Prefix Address: ::92:0:0:13 R7#show ospfv3 database external ::/0

2190 © 2016 Nicholas J. Russo

OSPFv3 92 address-family ipv6 (router-id 92.0.0.7) Type-5 AS External Link States LS age: 367 LS Type: AS External Link Link State ID: 0 Advertising Router: 92.0.0.6 LS Seq Number: 80000001 Checksum: 0x70E3 Length: 32 Prefix Address: :: Prefix Length: 0, Options: None Metric Type: 2 (Larger than any link state path) Metric: 1 External Route Tag: 92

The convergence timers and queue-depth adjustments in OSPFv3 work identically to those in OSPFv2, so they are not evaluated again. We can literally copy/paste the OSPFv2 configuration on CSR7 into OSPFv3 and it works the same. We verify it with a similar show command. The only convergence feature not also supported in OSPFv3 is incremental SPF. ! CSR7 router ospfv3 92 address-family ipv6 unicast timers throttle spf 500 1500 12000 timers throttle lsa 50 100 500 timers lsa arrival 500 timers pacing lsa-group 120 timers pacing flood 75 timers pacing retransmission 150 queue-depth hello 64 queue-depth update unlimited R7#show ospfv3 92 ipv6 | include sec Initial SPF schedule delay 500 msecs Minimum hold time between two consecutive SPFs 1500 msecs Maximum wait time between two consecutive SPFs 12000 msecs Initial LSA throttle delay 50 msecs Minimum hold time for LSA throttle 100 msecs Maximum wait time for LSA throttle 500 msecs Minimum LSA arrival 500 msecs LSA group pacing timer 120 secs Interface flood pacing timer 75 msecs Retransmission pacing timer 150 msecs

2191 © 2016 Nicholas J. Russo

However, one significant improvement introduced by OSPFv3 is the introduction of the intra-area prefix LSA (type 9). OSPFv3 intelligently decouples the topology information from the IP prefix information into two separate LSAs. Legitimate topology changes, such as failed intra-area links or nodes, will trigger an SPF run (LSA1). Minor changes to prefixes within that area, like IS-IS, should not (LSA9). The OSPFv3 SPF statistics table is very similar to the OSPFv2 table with some new entries. Only the new entries are defined below for brevity. SPT: Since IP prefixes are decoupled from the tree construction process, this is a new field that represents the time it takes to process all LSA1s and to find shortest paths to each node. This does not include interaction with stub networks (IPv4/v6 prefixes), measured in ms. Prefix: This is the time it takes to process all stub networks (IPv4/v6 prefixes) and installs them in the RIB, measured in ms. In other words, this processes the LSA9s. We also see new reason codes of “P” and “L”. The “P” code is for “prefix” and represents a partial SPF recalculation when only LSA9 (prefix) information is changed. The “L” code is for “link” and represents changes to the LSA8 Link LSA, which is not discussed here. R7#show ospfv3 92 statistic OSPFv3 92 address-family ipv6 (router-id 92.0.0.7) Area 0: SPF algorithm executed 8 times Area 1: SPF algorithm executed 5 times SPF calculation time Delta T SPT Prefix 00:23:35 0 0 00:23:26 0 0 00:19:40 0 0 00:19:30 0 0 00:18:53 0 0 00:18:21 0 0 00:18:11 0 0 00:16:48 0 0 00:16:38 0 0 00:16:19 0 0

D-Int 0 0 0 0 0 0 0 0 0 0

Sum 0 0 0 0 0 0 0 0 0 1

D-Sum 0 0 0 0 0 0 0 0 0 0

Ext 0 0 0 0 0 0 0 0 0 0

D-Ext 0 0 0 0 0 0 0 0 0 0

Total Reason 0 R N SN SA X L P 0 R N SN SA X L P 0 R X L P 0 R SN X 0 R P 0 P 0 R N SN SA L P 0 R N SN X L P 0 N SN X P 1 P

If we clear the OSPFv3 process on CSR7, we can see many subsequent SPF runs where there were many changes. These definitely qualify as full SPF runs, since there are many changes to LSA1 and LSA2. R7#show ospfv3 92 statistic [snip] 00:20:35 0 0 0 00:20:25 0 0 0 00:20:06 0 0 0 00:00:16 0 0 0

0 0 1 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 1 0

R N SN X L P N SN X P P R N SN SA X L P

2192 © 2016 Nicholas J. Russo

00:00:14 00:00:11 00:00:05

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

R N SN X L P R SN X L R SN X P

Changing the IGP cost on a stub network, or perhaps adding/deleting the network entirely, registers only as a “P” to indicate a prefix-based recalculation, which is lightweight. In this example, I add a new loopback to CSR6 while debugging on CSR7. This is one of the most significant improvements of OSPFv3 over OSPFv2. R7#debug ospfv3 spf intra OSPFv3 SPF intra debugging is on for process 92, IPv6, Default vrf ! CSR6 interface Loopback66 ipv6 address ::92:0:0:66/128 ospfv3 92 ipv6 area 0 ! CSR7 OSPFv3-92-IPv6 OSPFv3-92-IPv6 OSPFv3-92-IPv6 OSPFv3-92-IPv6

MON : INTRA: INTRA: INTRA:

Begin SPF at 555747.351, process time 38ms Running SPF for area 0, cause P Starting Intra-Area SPF (Prefix) Process Prefix LSAs

R7#show ospfv3 92 statistic [snip] 00:05:14 0 0 0 00:05:13 0 0 0 00:05:10 0 0 0 00:05:04 0 0 0 00:00:42 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

R R R R P

N SN N SN SN X SN X

SA X L P X L P L P

This is also true for external routes. Since CSR6 is originating an IPv6 default only when it learns one from IS-IS, we can shut down CSR3’s link to the level-1 domain (not shown). This ultimately causes CSR6 to withdraw the external LSA from area 0; with debugging on CSR7, we can see this SPF get scheduled as well. ! CSR7 OSPFv3-92-IPv6 SPF

: Schedule partial SPF for LSA 4005/0/92.0.0.6

R7#show ospfv3 92 statistic [snip] 00:15:53 0 0 0 00:11:32 0 0 0

0 0

0 0

0 0

0 0

0 0

R SN X P P

Unfortunately, the SPF statistic table doesn’t show this as reason “X”. In fact, it doesn’t show any new activity at all, although we know it happened. Even when we bring the link on CSR3 back up (not shown), SPF runs again, but the statistics don’t show it. Personally, I find this SPF statistics command to be of 2193 © 2016 Nicholas J. Russo

limited value because there are other cases where it records SPF runs only selectively. I assume this is a cosmetic decision because the fundamental IPv6 OSPFv3 routing is fully functional. ! CSR7 OSPFv3-92-IPv6 SPF

: Schedule partial SPF for LSA 4005/1/92.0.0.6

R7#show ospfv3 92 statistic [snip] 00:16:27 0 0 0 00:12:05 0 0 0

0 0

0 0

0 0

0 0

0 0

R SN X P P

XR has similar output, and since this intra-area prefix calculation is an OSPFv3 feature (not a Ciscospecific enhancement), XR also shows “PFX” as a reason to run partial SPF inside an area. In this example, the SPF run was actually a full run since we see the “R” and “N” codes also, but I illustrate that XR uses “PFX” and not “P” to indicate a prefix-based SPF run. “ASE” stands for AS external, which is equivalent to “X” on XE. RP/0/0/CPU0:XRv1#show ospfv3 92 statistics spf OSPFv3 router 92.0.0.11 (Process 92) OSPFv3 SPF statistics summary SPF algorithm executed 14 times Area 0 executed 14 times Area 1 executed 14 times SPF calculation times (in msec) Delta T Dijkstra Intra D-Intra 00:28:10 0 0 0 00:28:10 0 0 0 00:28:10 0 0 0 00:28:09 0 0 0 [snip]

Inter D-Inter 0 0 0 0 0 0 0 0

Ext 0 0 0 0

D-Ext 0 0 0 0

Reason R R N R N PFX ASE

Regarding OSPFv3 timers on XR, they are very similar to the OSPFv2 timers on XR. Some of them aren’t supported (LSA refresh, etc) but the syntax is otherwise similar, if not identical. They have the same effect as the OSPFv2 timers and are not tested in detail. ! XRv1 router ospfv3 92 timers lsa arrival 150 timers pacing flood 50 timers pacing lsa-group 360 timers throttle lsa all 50 2000 5000 timers throttle spf 100 2000 10000

38. Describe, implement, and troubleshoot multi-VRF CE and advanced VRF techniques This entire section uses a large topology with an MPLS core and several customer VRFs. Every combination of XE and XR serving in PE and CE roles is represented. VRF-Lite for all protocols is tested on 2194 © 2016 Nicholas J. Russo

both XE and XR (where supported). Backdoor links for OSPF, EIGRP, and BGP are also examined, along with many other VRF features. The network diagram is below. CSR10 and CSR1 are meant to be generalized “central services” sites so their RTs can be imported/exported as needed for testing. CSR2 and CSR9 are used for traffic leaking and are not part of the MPLS-specific testing. Note that this section is very configuration intensive with the vast majority of it being highly repetitive. Not all configurations will be shown “in line” with the documentation as a result.

Additional Reading - Reference configurations "vrf" 38.1 Multi-VRF CE (VRF-Lite) This section merges all of the PE-CE routing protocols together with VRF-lite on the CEs. The PE-CE routing techniques are the generally same whether the CE uses VRF-lite or not, but for brevity they are combined here with a single analysis. 38.1.1 Basic VRF-Lite VRF-Lite is not a difficult feature. Just like a VLAN represents a virtual switch, a VRF represents a virtual router. Unless manually configured, traffic/routes do not leak between VRFs, which provides good software isolation between tables. This is especially useful for modeling larger networks from a single router. VRFs were initially designed for MPLS L3VPN PE routers so that customer routes could be kept isolated from other customers; for L3VPN, the main motivation was overlapping address space within different customer networks. VRF-Lite removes the MPLS overtone and simply segments a router into SVRs, as discussed earlier. The concept of RD is only relevant when the VRF needs to participate in any kind of BGP routing. XR correctly places the RD configuration under BGP because it only has relevance to BGP. In XE, we don’t need to define an RD under the VRF unless BGP is used. RTs serve no purpose for 2195 © 2016 Nicholas J. Russo

VRF-Lite (assuming no route leaking) and can be skipped for now. Notice that the non-BGP related VRFs don’t have RDs configured. We still have to initialize each AFI we want the VRF to support; this causes the router to allocate the appropriate RIB/FIB constructs for the specified AFIs. ! Basic XE VRF-Lite vrf definition BGP rd 214:104 address-family ipv4 address-family ipv6 vrf definition EIGRP address-family ipv4 address-family ipv6 ! Basic XR VRF-Lite vrf BGP address-family ipv4 unicast address-family ipv6 unicast vrf RIP address-family ipv4 unicast address-family ipv6 unicast router bgp 104 vrf BGP rd 214:104

Verifying the configuration is very straightforward. Since there are no advanced VRF features activated (yet), we can use the XE summary commands to verify the VRFs. The only real difference between any pair of VRFs might be the presence or absence of an RD. We can also see the AFIs and interfaces. R5#show vrf ISIS Name ISIS

Default RD

Protocols ipv4,ipv6

Interfaces Gi2.103 Lo103

R5#show vrf BGP Name BGP

Default RD 214:104

Protocols ipv4,ipv6

Interfaces Gi2.104 Lo104 Gi2.204

In XR, the interfaces are not shown in summary form, but are revealed in the VRF details. We can verify the RD, interface, and AFIs. Not seeing IPv6 enabled under the first VRF, for example, would be a good troubleshooting reason why something like OSPFv3 was not working. RP/0/0/CPU0:XRv3#show vrf OSPF detail | exclude No VRF OSPF; RD not set; VPN ID not set

2196 © 2016 Nicholas J. Russo

VRF mode: Regular Description not set Interfaces: GigabitEthernet0/0/0/0.101 Loopback101 GigabitEthernet0/0/0/0.201 Address family IPV4 Unicast Address family IPV6 Unicast RP/0/0/CPU0:XRv3#show vrf BGP detail | exclude No VRF BGP; RD 214:104; VPN ID not set VRF mode: Regular Description not set Interfaces: GigabitEthernet0/0/0/0.104 Loopback104 GigabitEthernet0/0/0/0.204 Address family IPV4 Unicast Address family IPV6 Unicast

The show command syntax in both platforms changes when introducing VRFs. In XR, the syntax is very deterministic, generally in the format of “show (process) [vrf (vrf-name)] [afi] (feature)” where the VRF and AFI fields are optional. Below are some examples. RP/0/0/CPU0:XRv3#show ipv4 vrf BGP interface brief loopback104 Interface IP-Address Status Loopback104 192.104.13.13 Up

Protocol Up

RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP ipv6 neighbors gig0/0/0/0.102 IPv6-EIGRP VR(VPN) Neighbors for AS(102) VRF EIGRP H Address Interface Hold Uptime SRTT RTO Q Seq (sec) (ms) Cnt Num 0 Link Local Address: Gi0/0/0/0.102 14 12:43:50 16 200 0 27 fe80::11 RP/0/0/CPU0:XRv3#show ospf vrf OSPF our-address Local Address Database for ospf 101, VRF OSPF with ID 0.0.101.13 192.5.101.13 GigabitEthernet0/0/0/0.201 192.11.101.13 GigabitEthernet0/0/0/0.101 192.101.13.13 Loopback101

The XE syntax also includes the “vrf” keyword for many show commands, but is less deterministic/consistent. For example, the OSPFv2 show commands never reference a VRF, just the process ID, which might be mapped to a VRF. The EIGRP show commands reference the VRF after the AFI, which is backwards compared to XR. The RIP command contains the process and feature before specifying the VRF, which is also inconsistent with the XE EIGRP and all XR commands. In short, the XE commands are inconsistent, which is one of the reasons XR was designed to be simple. 2197 © 2016 Nicholas J. Russo

R5#show ip ospf 101 interface brief Interface PID Area IP Address/Mask Lo101 101 0 192.101.5.5/32 Gi2.201 101 0 192.5.101.5/24 Gi2.101 101 0 192.12.101.5/24

Cost 1 1 1

State LOOP P2P P2P

R5#show eigrp address-family ipv4 vrf EIGRP neighbors gig2.102 EIGRP-IPv4 VR(VPN) Address-Family Neighbors for AS(102) VRF(EIGRP) H Address Interface Hold Uptime SRTT (sec) (ms) 0 192.12.102.12 Gi2.102 10 13:31:27 15

Nbrs F/C 0/0 1/1 1/1

RTO

Q Seq Cnt Num 100 0 19

R5#show ip rip database vrf RIP 192.106.14.14 255.255.255.255 192.106.14.14/32 [2] via 192.12.106.12, 00:00:12, GigabitEthernet2.106

Interfaces are assigned to VRFs by putting the VRF under that interface. The XR VRF interface syntax is identical to the VRF definition and BGP VRF configuration; it is very consistent. XE uses different commands for each one. Changing the VRF of an interface on XE will remove all IPv4/v6 addresses from that interface. To change it on XR, you must remove all IPv4/v6 addresses manually, remove/change the VRF, then add the IPv4/v6 addresses back along with the new VRF (if desired). ! XRv3 interface Loopback101 vrf OSPF ipv4 address 192.101.13.13 255.255.255.255 ipv6 address ::192:101:13:13/128 ! CSR5 interface Loopback101 vrf forwarding OSPF ip address 192.101.5.5 255.255.255.255 ipv6 address ::192:101:5:5/128

38.1.2 OSPF and sham-links Both OSPFv2 and OSPFv3 can be used as on both XE and XR as a PE-CE routing protocol. In a traditional environment where the CE routing is not running a VRF, the PE routing would run the OSPF process inside of the customer VRF while the CE would run OSPF in the global routing table. Using VRF-Lite on the CE, both sites will be running it in a VRF. Because the configurations for PE and CE are very similar up to this point, only one of each is shown. CSR5 is a CE and XRv2 is a PE. OSPFv3 instance numbers are shown here just for variety and have nothing to do with VRF-Lite; the value is link-local and must match on a segment for a neighbor to form. This is true whether OSPFv3 is inside a VRF or the global table. ! CSR5 interface GigabitEthernet2.101

2198 © 2016 Nicholas J. Russo

vrf forwarding OSPF ip ospf network point-to-point ip ospf 101 area 0 ospfv3 network point-to-point ospfv3 101 ipv6 area 0 instance 12 router ospfv3 101 address-family ipv6 unicast vrf OSPF router ospf 101 vrf OSPF ! XRv2 interface GigabitEthernet0/0/0/0.101 vrf OSPF router ospf 101 router-id 0.0.101.12 vrf OSPF area 0 interface GigabitEthernet0/0/0/0.101 network point-to-point router ospfv3 101 router-id 0.0.101.12 vrf OSPF area 0 interface GigabitEthernet0/0/0/0.101 network point-to-point instance 12

Assuming we replicate these configurations appropriately on all 8 devices (all PEs and CEs), we should have OSPFv2/v3 neighbors everywhere. For brevity, we will verify this from the PE side only on the 4 PE routers. Notice that we do not need to specify an AFI for XR’s OSPFv3; the IPv4 AFI is not supported for OSPFv3 on XR, so IPv6 is assumed. We see 4 OSPFv2 and 4 OSPFv3 neighbors, so we can assume that OSPFv2/v3 is working properly. R3#show ip ospf 101 neighbor | include FULL 0.0.101.14 0 FULL/ 00:00:38

192.3.101.14

Gig2.101

R3#show ospfv3 vrf OSPF ipv6 neighbor | include FULL 0.0.101.14 0 FULL/ 00:00:30 8

Gig2.101

R4#show ip ospf 101 neighbor | include FULL 192.4.101.6 0 FULL/ 00:00:31

Gig2.101

192.4.101.6

R4#show ospfv3 vrf OSPF ipv6 neighbor | include FULL 192.4.101.6 0 FULL/ 00:00:37 17

Gig2.101

2199 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show ospf vrf OSPF neighbor | include FULL 0.0.101.13 1 FULL/ 00:00:30 192.11.101.13

Gig0/0/0/0.101

RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF neighbor | include FULL 0.0.101.13 1 FULL/ 00:00:36 13

Gig0/0/0/0.101

RP/0/0/CPU0:XRv2#show ospf vrf OSPF neighbor | include FULL 192.12.101.6 1 FULL/ 00:00:36 192.12.101.5

Gig0/0/0/0.101

RP/0/0/CPU0:XRv2#show ospfv3 vrf OSPF neighbor | include FULL 192.12.101.6 1 FULL/ 00:00:37 12

Gig0/0/0/0.101

Next, we will quickly check to see if each PE learned OSPFv2/v3 routes from the CE. Each CE has a loopback with a single route. We will check CSR3 and XRv1 only. CSR3 learns the IPv4 and IPv6 routes from XRv4 and XRv1 does the same from XRv3. R3#show ip route vrf OSPF ospf | begin Gate Gateway of last resort is not set O 192.101.14.14 [110/2] via 192.3.101.14, 14:09:56, GigabitEthernet2.101 R3#show ipv6 route vrf OSPF ospf | begin Application a - Application O ::192:101:14:14/128 [110/1] via FE80::14, GigabitEthernet2.101 RP/0/0/CPU0:XRv1#show route vrf OSPF ospf O 192.101.13.13/32 [110/2] via 192.11.101.13, 14:14:37, GigabitEthernet0/0/0/0.101 RP/0/0/CPU0:XRv1#show route vrf OSPF ipv6 ospf O ::192:101:13:13/128 [110/1] via fe80::13, 14:14:19, GigabitEthernet0/0/0/0.101

Next, we must configure bidirectional redistribution between OSPF and BGP. OSPF to BGP allows customers routes to be advertised via VPNV4/v6 (and have labels allocated) to remote BGP peers. Redistributing from BGP into OSPF allows the received BGP routes to be learned by the customer. We will show the configuration on CSR3 and XRv1 only. The connected redistribution on OSPFv3 isn’t a requirement, but IOS platforms automatically exclude connected routes when redistributing IPv6 protocols. ! CSR3 router bgp 214 address-family ipv4 vrf OSPF redistribute ospf 101 address-family ipv6 vrf OSPF

2200 © 2016 Nicholas J. Russo

redistribute ospf 101 include-connected router ospfv3 101 address-family ipv6 unicast vrf OSPF redistribute bgp 214 router ospf 101 vrf OSPF redistribute bgp 214 subnets ! XRv1 router bgp 214 vrf OSPF rd 214:101 address-family ipv4 unicast redistribute ospf 101 address-family ipv6 unicast redistribute ospfv3 101 router ospf 101 vrf OSPF redistribute bgp 214 router ospfv3 101 vrf OSPF redistribute bgp 214

At this point, the PEs should have the locally redistributed routes inside their BGP tables, as well as the remote iBGP routes from the remote peer. Both PEs can see the loopbacks of XRv3 and XRv4, the CEs, within the BGP table. This verifies that the OSPF to BGP redistribution was successful at both ends. R3#show bgp vpnv4 unicast vrf OSPF | include 1[34]\.1[34] *>i 192.101.13.13/32 214.0.0.11 2 100 0 ? *> 192.101.14.14/32 192.3.101.14 2 32768 ? RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf OSPF | include 1[34]\.1[34] *> 192.101.13.13/32 192.11.101.13 2 32768 ? *>i192.101.14.14/32 214.0.0.3 2 100 0 ? R3#show bgp vpnv6 unicast vrf OSPF | include 1[34]\:1[34] *>i ::192:101:13:13/128 *> ::192:101:14:14/128 RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf OSPF | include 1[34]\:1[34] *> ::192:101:13:13/128 *>i::192:101:14:14/128

2201 © 2016 Nicholas J. Russo

To verify that the BGP to OSPF redistribution worked, we can check the local OSPF LSDB on each PE to look for the Type 3 (for IPv6) or Type 5 LSA (for IPv4). The reason for the discrepancy is discussed later. For now, we can see that the redistribution worked. R3#show ip ospf 101 database | include Link_States|13\.13 Router Link States (Area 0) Summary Net Link States (Area 0) Type-5 AS External Link States 192.101.13.13 192.3.101.3 896 0x8000001A 0x007D87 3489661142 RP/0/0/CPU0:XRv1#show ospf vrf OSPF database | utility egrep 'Link States|14\.14 Router Link States (Area 0) Summary Net Link States (Area 0) Type-5 AS External Link States 192.101.14.14 0.0.101.11 941 0x8000001a 0x0017a7 3489661142 R3#show ospfv3 vrf OSPF ipv6 database | include Link_States|13:13 Router Link States (Area 0) Inter Area Prefix Link States (Area 0) 192.3.101.3 1092 0x8000001A ::192:101:13:13/128 Link (Type-8) Link States (Area 0) Intra Area Prefix Link States (Area 0) RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF database | utility egrep 'Link States|14\.14 Router Link States (Area 0) Inter Area Prefix Link States (Area 0) 0.0.101.11 1002 0x8000001a ::192:101:14:14/128 Link (Type-8) Link States (Area 0) Intra Area Prefix Link States (Area 0)

At this point, one would assume there is full reachability between sites. The tricky part about using OSPF with VRF-Lite on the CEs is the notion of the down bit (DN bit). The PEs will set this bit on LSA3 and LSA5 routes redistributed from BGP which help control the “flow” of the LSA. The idea is that when backdoor links exist, the LSA originated at one PE is flooded to a second PE. That second PE should never install this route as it could create suboptimal routing at best, or a loop at worst. Examining a remote prefix (redistributed from BGP to OSPF locally) on CSR3, we can see the “downward” bit in the LSA options field. IOS routers only recently started setting the DN-bit in their LSA5 as it used to only happen for LSA3. The older mechanism for LSA5 loop-prevention was to use route tags, which still happens for backwards compatibility. Converting the route tag below to hex yields 0xD00000D6, and converting the last 2 bytes back into decimal yields the BGP AS number, which is 214. The output is almost identical for XRv1, as the DN-bit is set and the route-tag represents BGP AS 214. R3#show ip ospf 101 database external 192.101.13.13 OSPF Router with ID (192.3.101.3) (Process ID 101)

2202 © 2016 Nicholas J. Russo

Type-5 AS External Link States LS age: 1857 Options: (No TOS-capability, DC, Downward) LS Type: AS External Link Link State ID: 192.101.13.13 (External Network Number ) Advertising Router: 192.3.101.3 LS Seq Number: 8000001A Checksum: 0x7D87 Length: 36 Network Mask: /32 Metric Type: 2 (Larger than any link state path) MTID: 0 Metric: 2 Forward Address: 0.0.0.0 External Route Tag: 3489661142 RP/0/0/CPU0:XRv1#show ospf vrf OSPF database external 192.101.14.14 OSPF Router with ID (0.0.101.11) (Process ID 101, VRF OSPF) Type-5 AS External Link States LS age: 76 Options: (No TOS-capability, DC, DN) LS Type: AS External Link Link State ID: 192.101.14.14 (External Network Number) Advertising Router: 0.0.101.11 LS Seq Number: 8000001b Checksum: 0x15a8 Length: 36 Network Mask: /32 Metric Type: 2 (Larger than any link state path) TOS: 0 Metric: 2 Forward Address: 0.0.0.0 External Route Tag: 3489661142

We quickly check the OSPFv3 LSA3 details to see the DN-bit set on them as well. The same logic applies for any OSPF AFI. R3#show ospfv3 vrf OSPF ipv6 database inter-area prefix ::192:101:13:13/128 OSPFv3 101 address-family ipv6 vrf OSPF (router-id 192.3.101.3) LS age: 1133 LS Type: Inter Area Prefix Links Link State ID: 4 Advertising Router: 192.3.101.3 LS Seq Number: 8000001B Checksum: 0x2A6D Length: 44 Metric: 1 Prefix Address: ::192:101:13:13

2203 © 2016 Nicholas J. Russo

Prefix Length: 128, Options: DN RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF database inter-area prefix ::192:101:14:14 OSPFv3 Router with ID (0.0.101.11) (Process ID 101 VRF OSPF) Inter Area Prefix Link States (Area 0) LS age: 850 LS Type: Inter Area Prefix Links Link State ID: 6 Advertising Router: 0.0.101.11 LS Seq Number: 8000001b Checksum: 0xf658 Length: 44 Metric: 1 Prefix Address: ::192:101:14:14 Prefix Length: 128, Options: DN , Priority: Critical

If a router is running OSPF inside of a VRF, it automatically assumes that it is a PE for an MPLS network. On CSR6, we can see this by looking at the OSPF summary information for OSPFv2 and v3. This output isn’t available on XR, but the assumption is the same, as XRv3 and XRv4 also thinks they are MPLS PEs. These VRF-Lite CE routers clearly are not MPLS PEs and are not connected to the MPLS superbackbone. R6#show ip ospf 101 | include VPN Connected to MPLS VPN Superbackbone, VRF OSPF R6#show ospfv3 vrf OSPF ipv6 | include VPN Connected to MPLS VPN Superbackbone

The result of this is that all of the CE routers will learn these LSA3/LSA5 inside their LSDBs, but cannot install them in the routing table. Neither CSR6 nor XRv4 make any mention that the LSA has the routing bit set, which means it is not candidate for RIB entry. R6#show ip ospf 101 database external 192.101.13.13 OSPF Router with ID (192.4.101.6) (Process ID 101) Type-5 AS External Link States LS age: 1511 Options: (No TOS-capability, DC, Downward) LS Type: AS External Link Link State ID: 192.101.13.13 (External Network Number ) Advertising Router: 192.4.101.4 LS Seq Number: 8000001B Checksum: 0x6D94 Length: 36 Network Mask: /32 Metric Type: 2 (Larger than any link state path) MTID: 0 Metric: 2

2204 © 2016 Nicholas J. Russo

Forward Address: 0.0.0.0 External Route Tag: 3489661142 RP/0/0/CPU0:XRv4#show ospf vrf OSPF database external 192.101.13.13 OSPF Router with ID (0.0.101.14) (Process ID 101, VRF OSPF) Type-5 AS External Link States LS age: 673 Options: (No TOS-capability, DC, DN) LS Type: AS External Link Link State ID: 192.101.13.13 (External Network Number) Advertising Router: 192.3.101.3 LS Seq Number: 8000001b Checksum: 0x7b88 Length: 36 Network Mask: /32 Metric Type: 2 (Larger than any link state path) TOS: 0 Metric: 2 Forward Address: 0.0.0.0 External Route Tag: 3489661142

The same is true for the IPv6. CSR6 and XRv4 both learn the LSA3 but cannot install it in their RIB. They assume they are MPLS L3VPN PE routers since OSPF is inside a VRF, so installing this route could cause a loop. RP/0/0/CPU0:XRv4#show ospfv3 vrf OSPF database inter-area prefix ::192:101:13:13/128 OSPFv3 Router with ID (0.0.101.14) (Process ID 101 VRF OSPF) Inter Area Prefix Link States (Area 0) LS age: 1952 LS Type: Inter Area Prefix Links Link State ID: 4 Advertising Router: 192.3.101.3 LS Seq Number: 8000001b Checksum: 0x2a6d Length: 44 Metric: 1 Prefix Address: ::192:101:13:13 Prefix Length: 128, Options: DN , Priority: Critical R6#show ospfv3 vrf OSPF ipv6 database inter-area prefix ::192:101:13:13/128 OSPFv3 101 address-family ipv6 vrf OSPF (router-id 192.101.6.6) LS age: 1027 LS Type: Inter Area Prefix Links Link State ID: 3 Advertising Router: 192.4.101.4 LS Seq Number: 8000001C Checksum: 0x2471

2205 © 2016 Nicholas J. Russo

Length: 44 Metric: 1 Prefix Address: ::192:101:13:13 Prefix Length: 128, Options: DN

To solve this problem, we need to notify OSPF that it is running inside of a VRF-Lite environment and to ignore the DN bit. The two essentially mean the same thing, and I phrase it this way because XE and XR have different syntaxes. XE is consistent using the “capability vrf-lite” syntax for both OSPFv2 and v3. XR is less consistent since it uses “capability vrf-lite” for OSPFv3 and “disable-dn-bit-check” for OSPFv2. On XE only, applying this command will reset the OSPFv2/v3 neighbors. ! CSR6 router ospfv3 101 address-family ipv6 unicast vrf OSPF capability vrf-lite router ospf 101 vrf OSPF capability vrf-lite ! XRv4 router ospf 101 vrf OSPF disable-dn-bit-check router ospfv3 101 vrf OSPF capability vrf-lite

In most of the outputs seen below, the routing bit will now be set on the LSA3/LSA5 on the CE routers. The DN-bit is still being set by the PEs, but the CEs are ignoring it entirely. Oddly, the routing bit output does not appear on the XE LSAs at all, which used to happen in older IOS versions. XR clearly displays it, as shown below. R6#show ip ospf 101 database external 192.101.13.13 OSPF Router with ID (192.4.101.6) (Process ID 101) Type-5 AS External Link States LS age: 1392 Options: (No TOS-capability, DC, Downward) [snip] R6#show ospfv3 vrf OSPF ipv6 database inter-area prefix ::192:101:13:13/128 OSPFv3 101 address-family ipv6 vrf OSPF (router-id 192.101.6.6) LS age: 1644 LS Type: Inter Area Prefix Links [snip] RP/0/0/CPU0:XRv4#show ospf vrf OSPF database external 192.101.13.13

2206 © 2016 Nicholas J. Russo

OSPF Router with ID (0.0.101.14) (Process ID 101, VRF OSPF) Type-5 AS External Link States Routing Bit Set on this LSA LS age: 246 Options: (No TOS-capability, DC, DN) LS Type: AS External Link Link State ID: 192.101.13.13 (External Network Number) [snip] Network Mask: /32 [snip] RP/0/0/CPU0:XRv4#show ospfv3 vrf OSPF database inter-area prefix ::192:101:13:13/128 OSPFv3 Router with ID (0.0.101.14) (Process ID 101 VRF OSPF) Inter Area Prefix Link States (Area 0) Routing Bit Set on this LSA [snip] Prefix Address: ::192:101:13:13 Prefix Length: 128, Options: DN , Priority: Critical

Quickly checking the routes on CSR6, we can see all the routes are installed to the remote loopbacks. Notice that some are external and some are inter-area. Specifically, the routes reachable via the XR PEs are external, and the route via the CSR PE is inter-area. These discrepancies are resolved later. R6#show ip route vrf OSPF | include 192.101 192.101.5.0/32 is subnetted, 1 subnets O E2 192.101.5.5 [110/2] via 192.4.101.4, 00:05:04, GigabitEthernet2.101 192.101.6.0/32 is subnetted, 1 subnets C 192.101.6.6 is directly connected, Loopback101 192.101.13.0/32 is subnetted, 1 subnets O E2 192.101.13.13 [110/2] via 192.4.101.4, 00:05:04, Gigabit2.101 192.101.14.0/32 is subnetted, 1 subnets O IA 192.101.14.14 [110/3] via 192.4.101.4, 00:05:04, Gigabit2.101

Looking into the BGP routing information for VPNv4 on a PE, we see there are several new extended communities used by BGP when transporting OSPF routes inside of an L3VPN. Specifically, the routes from the XE PE have a domain-ID encoded in them. The domain-ID type is 0x0005, which is the default in IOS platforms, and the value 0x0065 converted to decimal is 101. This is where the OSPF PID has global significance; it is used by default to seed the domain-ID for an MPLS L3VPN. XR makes no assumptions and does not encode this extended community by default. We can see this by checking the iBGP nexthops of each route and seeing the difference. The routes from XRv1 and XRv2 don’t include this domain ID by default. R4#show bgp vpnv4 unicast vrf OSPF 192.101.14.14 BGP routing table entry for 214:101:192.101.14.14/32, version 255 Paths: (2 available, best #2, table OSPF) Not advertised to any peer

2207 © 2016 Nicholas J. Russo

[snip] Refresh Epoch 2 Local 214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 2, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF DOMAIN ID:0x0005:0x000000650200 OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:192.3.101.3:0 mpls labels in/out nolabel/3015 rx pathid: 0, tx pathid: 0x0 R4#show bgp vpnv4 unicast vrf OSPF 192.101.13.13 BGP routing table entry for 214:101:192.101.13.13/32, version 127 Paths: (2 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 2 Local 214.0.0.11 (metric 30) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 2, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF RT:0.0.0.0:1:0 OSPF ROUTER ID:0.0.101.11:0 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91014 rx pathid: 0, tx pathid: 0x0 [snip]

When the domain-IDs match at remote ends of a VPN, they are considered to be in the same “AS” so to speak. The MPLS network acts like a “super backbone” as a third level of hierarchy greater than area 0. This allows area 0 networks to be bridged together and see one another as inter-area. When the domain-ID does not match, BGP makes no assumptions about this continuity and treats the routes as external when redistributing them. If one router has a domain-ID and the other does not, this is considered a non-match. We could set the domain-ID to be the same as we saw above on the routes from CSR3, but for variety, we will change it to a new value of 244 (0xF4) so that we can demonstrate the change on XE platforms as well. We quickly verify that the configuration is correct by checking the OSPF process details. For now, we will limit this change only to OSPFv2 processes on the PEs. ! CSR4 router ospf 101 vrf OSPF domain-id type 0005 value 000000F40200 ! XRv1 router ospf 101 vrf OSPF domain-id type 0005 value 000000F40200 R4#show ip ospf 101 | include Domain Domain ID type 0x0005, value 0x000000F40200

2208 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show ospf vrf OSPF | begin Domain Primary Domain ID: 0x5:0x000000f40200 [snip]

Now, all routes learned by CSR4 will have the same domain-ID. Provided these extended communities match the local domain-ID (verified above), then the routes are considered inter-area. R4#show bgp vpnv4 unicast vrf OSPF 192.101.13.13 BGP routing table entry for 214:101:192.101.13.13/32, version 324 Paths: (2 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 2 Local 214.0.0.11 (metric 30) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 2, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF DOMAIN ID:0x0005:0x000000F40200 OSPF RT:0.0.0.0:1:0 OSPF ROUTER ID:0.0.101.11:0 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91014 rx pathid: 0, tx pathid: 0x0 [snip] R4#show bgp vpnv4 unicast vrf OSPF 192.101.14.14 bestpath BGP routing table entry for 214:101:192.101.14.14/32, version 330 Paths: (2 available, best #2, table OSPF) Not advertised to any peer Refresh Epoch 4 Local 214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 2, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF DOMAIN ID:0x0005:0x000000F40200 OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:192.3.101.3:0 mpls labels in/out nolabel/3015 rx pathid: 0, tx pathid: 0x0

We can verify this worked by checking the CE routers to ensure all of their OSPF routes are inter-area. Notice that CSR6 shows 8 inter-area routes and 0 external type-2 routes, which is what we had before. XRv4 doesn’t show 0 external routes, it just displays nothing except the inter-area routes, which is an implicit verification method. R6#show ip route vrf OSPF summary | section ospf 101 ospf 101 5 3 0 768 Intra-area: 0 Inter-area: 8 External-1: 0 External-2: 0 NSSA External-1: 0 NSSA External-2: 0

2336

RP/0/0/CPU0:XRv4#show route vrf OSPF summary detail | begin ospf 101 ospf 101 7 7 0 0

2209 © 2016 Nicholas J. Russo

Inter-area: Total

7 12

7 12

0 1

0 1

The concept is identical for IPv6 routes. The only difference is that suddenly XE no longer makes assumptions about the domain-ID. At this point, neither XE nor XR PEs will encode a domain-ID since neither of them consider the OSPF PID as a seed value. R4#show bgp vpnv6 unicast vrf OSPF ::192:101:13:13/128 BGP routing table entry for [214:101]::192:101:13:13/128, version 363 Paths: (2 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 6 Local ::FFFF:214.0.0.11 (metric 30) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 1, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:0.0.101.11:0 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91017 rx pathid: 0, tx pathid: 0x0 [snip] R4#show bgp vpnv6 unicast vrf OSPF ::192:101:14:14/128 BGP routing table entry for [214:101]::192:101:14:14/128, version 368 Paths: (2 available, best #2, table OSPF) Not advertised to any peer [snip] Refresh Epoch 6 Local ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 1, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF ROUTER ID:192.3.101.3:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out nolabel/3016 rx pathid: 0, tx pathid: 0x0

Checking the local domain-IDs on CSR4 and XRv1, we see that there are no defaults. Personally, I prefer the XR behavior because it never makes a domain-ID assumption and is consistent between OSPFv2 and OSPFv3. XE makes assumptions about the OSPFv2 domain-ID only by using the OSPF PID, which is inconsistent. R4#show ospfv3 vrf OSPF | include Domain Domain ID (none) RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF | begin Domain [no output]

2210 © 2016 Nicholas J. Russo

Since there are no domain-IDs carried in any VPNv6 routes and also no local domain-IDs, OSPF makes the bold assumption that the routes are indeed in the same domain. This behavior is in accordance with RFC 4577 and is not a Cisco-specific behavior. We can confirm this on CSR6 and XRv4. At this point, we could consider our configuration “complete” since all of the OSPFv2/v3 CE routers see one another’s routes as inter-area (not external). R6#show ipv6 route vrf OSPF summary | section ospf 101 ospf 101 8 1152 1664 Intra-area: 0 Inter-area: 8 External-1: 0 External-2: 0 NSSA External 1: 0 NSSA External 2: 0 RP/0/0/CPU0:XRv4#show route vrf OSPF ipv6 summary detail | begin ospf 101 ospf 101 7 7 0 0 Inter-area: 7 7 0 0 Total 12 12 1 1

We will first configure the domain-ID on all of the XE PEs only, this time using value 246 (0xF6). This should introduce some external routes into the VPN routing tables since routes from CSR3 and CSR4 will have domain-IDs. XRv1 and XRv2 will not; this behavior would mimic the original issue observed with OSPFv2 where XE PEs added domain-IDs but XR PEs did not. After configuring the feature, we confirm it on both routers. ! CSR3 and CSR4 router ospfv3 101 address-family ipv6 unicast vrf OSPF domain-id type 0005 value 000000F60200 R3#show ospfv3 vrf OSPF | include Domain Domain ID 0x000000F60200, type 0x0005 R4#show ospfv3 vrf OSPF | include Domain Domain ID 0x000000F60200, type 0x0005

Checking CSR4, we can see that VPNv6 prefixes from CSR3 have this new domain-ID, which matches the local domain-ID, and therefore means routes from CSR3 will be inter-area. Routes from the remote XR PEs have no domain-ID, and are therefore external. Both CSR6 and XRv14 now see a mix of inter-area and external routes. R4#show bgp vpnv6 unicast vrf OSPF ::192:101:13:13/128 BGP routing table entry for [214:101]::192:101:13:13/128, version 363 Paths: (2 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 8 Local ::FFFF:214.0.0.11 (metric 30) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 1, localpref 100, valid, internal, best

2211 © 2016 Nicholas J. Russo

Extended Community: RT:214:101 OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:0.0.101.11:0 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91017 rx pathid: 0, tx pathid: 0x0 [snip] R4#show bgp vpnv6 unicast vrf OSPF ::192:101:14:14/128 BGP routing table entry for [214:101]::192:101:14:14/128, version 377 Paths: (2 available, best #2, table OSPF) Not advertised to any peer [snip] Refresh Epoch 8 Local ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 1, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF DOMAIN ID:0x0005:0x000000F60200 OSPF ROUTER ID:192.3.101.3:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out nolabel/3016 rx pathid: 0, tx pathid: 0x0 R6#show ipv6 route vrf OSPF summary | section ospf 101 ospf 101 8 1152 1664 Intra-area: 0 Inter-area: 3 External-1: 0 External-2: 5 NSSA External 1: 0 NSSA External 2: 0 RP/0/0/CPU0:XRv4#show route vrf OSPF ipv6 summary detail | begin ospf 101 ospf 101 7 7 0 0 Inter-area: 2 2 0 0 External-2: 5 5 0 0 Total 12 12 1 1

For completeness, we will configure the IPv6 domain-ID on the XR PEs as well, and quickly verify that the configuration was successful. As mentioned earlier, XR is more consistent with domain-ID configuration and default behavior than XE. ! XRv1 and XRv2 router ospfv3 101 vrf OSPF domain-id type 0005 value 000000F60200 RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF | begin Domain Primary Domain ID: 0x0005:0x000000f60200 RP/0/0/CPU0:XRv2#show ospfv3 vrf OSPF | begin Domain Primary Domain ID: 0x0005:0x000000f60200

2212 © 2016 Nicholas J. Russo

Finally, we will verify that the XR-originated VPNv6 prefixes are carrying the proper domain-ID extended community, and that the XE routers only see inter-area routes again. R4#show bgp vpnv6 unicast vrf OSPF ::192:101:13:13/128 BGP routing table entry for [214:101]::192:101:13:13/128, version 380 Paths: (2 available, best #1, table OSPF) Not advertised to any peer Refresh Epoch 8 Local ::FFFF:214.0.0.11 (metric 30) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 1, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF DOMAIN ID:0x0005:0x000000F60200 OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:0.0.101.11:0 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91017 rx pathid: 0, tx pathid: 0x0 [snip] R4#show bgp vpnv6 unicast vrf OSPF ::192:101:14:14/128 BGP routing table entry for [214:101]::192:101:14:14/128, version 377 Paths: (2 available, best #2, table OSPF) Not advertised to any peer [snip] Refresh Epoch 8 Local ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 1, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF DOMAIN ID:0x0005:0x000000F60200 OSPF ROUTER ID:192.3.101.3:0 OSPF RT:0.0.0.0:2:0 mpls labels in/out nolabel/3016 rx pathid: 0, tx pathid: 0x0

When we try to verify this claim, we find some unanticipated results. CSR6 and XRv4 still see external routes, which doesn’t make sense. R6#show ipv6 route vrf OSPF summary | section ospf 101 ospf 101 8 1152 1664 Intra-area: 0 Inter-area: 6 External-1: 0 External-2: 2 NSSA External 1: 0 NSSA External 2: 0 RP/0/0/CPU0:XRv4#show route vrf OSPF ipv6 summary detail | begin ospf 101 ospf 101 7 7 0 0 Inter-area: 5 5 0 0 External-2: 2 2 0 0 Total 12 12 1 1

2213 © 2016 Nicholas J. Russo

If we look at the specific routes that are still external, we see that the PE-CE transit links connected to XRv1 and XRv2 are the only ones. For whatever reason, XRv1 and XRv2 did not set the domain-ID on those routes during redistribution, possibly because they were not OSPF-learned. XE is smarter than XR in this regard and considers those are part of the OSPFv3 domain, even as connected routes. I am not able to find a workaround for this since the RPL does not contain a mechanism to manually set this extended community. At least we know that the OSPF-learned routes, which are much more important and relevant, have the extended community applied and the feature is working properly. Inter-area reachability to the transit links is not important, so XR’s inability to add the domain-ID to these VPNv6 prefixes is insignificant. R6#show ipv6 route vrf OSPF | include E2 OE2 - OSPF ext 2, ON1 - OSPF NSSA ext 1, ON2 - OSPF NSSA ext 2 OE2 FD00:192:11:101::/64 [110/1] OE2 FD00:192:12:101::/64 [110/1] RP/0/0/CPU0:XRv4#show route vrf OSPF ipv6 | include E2 E1 - OSPF external type 1, E2 - OSPF external type 2, E – EGP O E2 fd00:192:11:101::/64 O E2 fd00:192:12:101::/64 RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf OSPF fd00:192:11:101::/64 | include Extended Extended community: OSPF route-type:0:2:0x0 OSPF router-id:214.0.0.11 RT:214:101 RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf OSPF fd00:192:12:101::/64 | include Extended Extended community: OSPF route-type:0:2:0x0 OSPF router-id:214.0.0.12 RT:214:101

It is worth discussing the other new extended communities and OSPF domain-ID types. First, the OSPF route-type is something not present in normal OSPF/BGP architectures. The attribute carries the same information for both IPv4 and IPv6 routes. XE and XR display them a little differently but the premise is the same. The first 32 bits represent the area-ID, which is 0 in this case (XR shows 0 and XE shows 0.0.0.0). The next 8 bits specify the route type: 1 or 2 for intra-area routes (whether it same from LSA1 or LSA2), 3 for inter-area, 5 for external, and 7 for NSSA-external. The creators of this extended community kept it simple by using the same numbers as the OSPF LSA types. Below, we can see two area 0 routes with route-type 2. Although we don’t have any DRs in the network, I am not sure why the route-type is 2 and not 1, but we are least know it is intra-area. The last field of the route-type is the options field, which is 8 bits, and is used to signal information about external routes. Since neither route was external, these flags are 0 (XR shows 0x0 and XE shows 0). RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf OSPF 192.101.6.6/32 | include Extended

2214 © 2016 Nicholas J. Russo

Extended community: OSPF domain-id:0x5:0x000000f40200 OSPF routetype:0:2:0x0 OSPF router-id:192.4.101.4 RT:214:101 R4#show bgp vpnv6 unicast vrf OSPF ::192:101:14:14/128 | include OSPF_RT OSPF ROUTER ID:192.3.101.3:0 OSPF RT:0.0.0.0:2:0

The area-ID value is 0 for external routes as well. To see this, we will create a new loopback on CSR6 and redistribute it into OSPFv2 and v3. ! CSR6 interface Loopback600 description EXTERNAL TEST vrf forwarding OSPF ip address 192.60.6.6 255.255.255.255 ipv6 address ::192:60:6:6/128 route-map RM_CONN_TO_OSPF permit 10 match interface Loopback600 router ospfv3 101 address-family ipv6 unicast vrf OSPF redistribute connected route-map RM_CONN_TO_OSPF router ospf 101 vrf OSPF redistribute connected subnets route-map RM_CONN_TO_OSPF

We can verify the redistribution was successful by checking the local LSDB. Note: The PEs must be configured to match external routes when redistributing from OSPF into BGP. This is not allowed by default and must be explicitly specified. Because there are no external routes on CSR5 or XRv3, we only need to configure external redistribution on CSR3 and CSR4. R6#show ip ospf 101 database external self-originate | include Link|Mask Type-5 AS External Link States LS Type: AS External Link Link State ID: 192.60.6.6 (External Network Number ) Network Mask: /32 R6#show ospfv3 vrf OSPF database external self-originate | include Prefix Prefix Address: ::192:60:6:6 Prefix Length: 128, Options: None ! CSR3 and CSR4 router bgp 214 address-family ipv6 vrf OSPF redistribute ospf 101 include-connected match internal external address-family ipv4 vrf OSPF

2215 © 2016 Nicholas J. Russo

redistribute ospf 101 match internal external

When we check the route on XRv1, we can see the route-type is 5 to signal an external route. The lowerorder bit of the options byte is set to 1 which indicates external type-2. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf OSPF 192.60.6.6/32 | include Extend Extended community: OSPF domain-id:0x5:0x000000f40200 OSPF routetype:0:5:0x1 OSPF router-id:192.4.101.4 RT:214:101

When the route is external type-1, the options field is clear, and the route-type remains 5. We update the route-map to make this change, then tell OSPFv2/v3 to perform the redistribution again to capture the changes. XRv1 shows the new route-type. Also note there is no need to carry the OSPF metric in a new extended community as it is already carried in the MED value. This is standard BGP behavior when redistributing from an IGP, and OSPF can use that to keep metrics consistent across the L3VPN. ! CSR6 route-map RM_CONN_TO_OSPF permit 10 set metric-type type-1 R6#clear ip ospf 101 redistribution R6#clear ospfv3 vrf OSPF redistribution RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf OSPF 192.60.6.6/32 | include Extend Extended community: OSPF domain-id:0x5:0x000000f40200 OSPF routetype:0:5:0x0 OSPF router-id:192.4.101.4 RT:214:101

The OSPF RID community needs no explaining. It contains the OSPF RID of the originating PE for the route in question. Looking at CSR6 and XRv4 loopbacks side by side from XRv2’s perspective, we can see different OSPF RID’s since the egress PEs differ for those two prefixes. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf OSPF 192.101.14.14 | include Extend Extended community: OSPF domain-id:0x5:0x000000f40200 OSPF routetype:0:2:0x0 OSPF router-id:192.3.101.3 RT:214:101 RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf OSPF 192.101.6.6 | include Extend Extended community: OSPF domain-id:0x5:0x000000f40200 OSPF routetype:0:2:0x0 OSPF router-id:192.4.101.4 RT:214:101

The domain-ID types are just meant to be differentiators between various domain-ID formats. The Cisco default in 0x005, as we saw earlier. All four domain-ID formats (0x0005, 0x0105, 0x0205, 0x8005) are 8 bytes long where the first 2 bytes always represent the type. Type 0x0005 uses the next 2 bytes for the “global administrator” field and the last 4 for the “local administrator” field, which is where the OSPF 2216 © 2016 Nicholas J. Russo

PID is encoded. Types 0x0105 and 0x0205 use 4 bytes for global and 2 bytes for local, but the difference is really just cosmetic. You can set the actual value to whatever is necessary to make the OSPF routes appear the way they need to (sometimes you may want inter-area, sometimes external). The 0x8005 type is exceptional; using 0x8005 is a way to achieve backward compatibility between all domain-ID types. When this domain-ID type is used, the first 2 bytes are ignored, so the domain-ID types no longer need to match. It is a raw bit-for-bit comparison of the remaining 6 bits, which works perfectly for backwards compatibility. We will configure XRv1 with a domain-ID type of 0x8005 but use the same ID value of 0xF4 for OSPFv2. ! XRv1 router ospf 101 vrf OSPF domain-id type 8005 value 000000F40200

XRv4 still sees the route to XRv3 as inter-area, which proves that CSR3 ignored the difference between 0x0005 and 0x8005. R3#show bgp vpnv4 unicast vrf OSPF 192.101.13.13 BGP routing table entry for 214:101:192.101.13.13/32, version 596 Paths: (1 available, best #1, table OSPF) Advertised to update-groups: 3 Refresh Epoch 1 Local, (Received from a RR-client) 214.0.0.11 (metric 20) (via default) from 214.0.0.11 (214.0.0.11) Origin incomplete, metric 2, localpref 100, valid, internal, best Extended Community: RT:214:101 OSPF RT:0.0.0.0:1:0 OSPF ROUTER ID:0.0.101.11:0 OSPF DOMAIN ID:0x8005:0x000000F40200 mpls labels in/out nolabel/91014 rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv4#show route vrf OSPF 192.101.13.13 Routing entry for 192.101.13.13/32 Known via "ospf 101", distance 110, metric 3, type inter area Routing Descriptor Blocks 192.3.101.3, from 192.3.101.3, via GigabitEthernet0/0/0/0.101 Route metric is 3 No advertising protos.

We will quickly test type 0x8005 on XE using OSPFv3. Changing CSR3’s domain-ID to 0x8005, XRv2 still considers it within the same domain since the low-order 6-byte domain value still matches. CSR5 still sees the route as inter-area, as we expect. ! CSR3 router ospfv3 101 address-family ipv6 unicast vrf OSPF

2217 © 2016 Nicholas J. Russo

domain-id type 8005 value 000000F60200 RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf OSPF ::192:101:14:14 [snip] Local, (Received from a RR-client) 214.0.0.3 (metric 30) from 214.0.0.3 (214.0.0.3) Received Label 3016 Origin incomplete, metric 1, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 535 Extended community: OSPF router-id:192.3.101.3 OSPF route-type:0:2:0x0 OSPF domain-id:0x8005:0x000000f60200 RT:214:101 Source VRF: OSPF, Source Route Distinguisher: 214:101 R5#show ipv6 route vrf OSPF ::192:101:14:14 Routing entry for ::192:101:14:14/128 Known via "ospf 101", distance 110, metric 2, type inter area Route count is 1/1, share count 0 Routing paths: FE80::12, GigabitEthernet2.101 Last updated 03:20:39 ago

Let’s assume that between XRv4 and CSR6, and XRv3 and CSR5, that we have installed a pair of lowspeed backdoor links. These could be non-MPLS private WAN leased lines that we want to use only for backup, while preferring the L3VPN MPLS path via the PEs primarily. The backdoor link configuration is not shown, but we can quickly check the neighbors on all four CEs for both IPv4/v6 to verify the neighbor was formed properly. We also increase the cost of this link to a very high value in an attempt to prefer MPLS. The backdoor link is Gig2.201 on XE or Gig0/0/0/0.201 on XR and has a cost of 500. We can quickly verify this on CSR6 and XRv4 as an example. Notice that there is 1 neighbor on the link also, as expected. This serves as a quick check to verify the cost configuration and OSPFv2/v3 neighbors. R6#show ip ospf 101 interface brief Interface PID Area IP Address/Mask Lo101 101 0 192.101.6.6/32 Gi2.201 101 0 192.6.101.6/24 Gi2.101 101 0 192.4.101.6/24 R6#show ospfv3 101 Interface PID Lo101 101 Gi2.201 101 Gi2.101 101

vrf OSPF ipv6 interface brief Area AF Cost 0 ipv6 1 0 ipv6 500 0 ipv6 1

RP/0/0/CPU0:XRv4#show ospf vrf OSPF interface brief [snip] Interface PID Area IP Address/Mask Lo101 101 0 192.101.14.14/32 Gi0/0/0/0.101 101 0 192.3.101.14/24

Cost 1 500 1

State LOOP P2P P2P

State LOOP P2P P2P

Nbrs F/C 0/0 1/1 1/1

Nbrs F/C 0/0 1/1 1/1

Cost 1 1

State Nbrs F/C LOOP 0/0 P2P 1/1

2218 © 2016 Nicholas J. Russo

Gi0/0/0/0.201

101

0

192.6.101.14/24

500

RP/0/0/CPU0:XRv4#show ospfv3 vrf OSPF interface brief Interface PID Area IPv6 Address Cost Gi0/0/0/0.101 101 0 fe80::14 1 Gi0/0/0/0.201 101 0 fe80::14 500 Lo101 101 0 fe80::e92e:ddff:fef3:2b48 0

P2P

1/1

State Nbrs F/C P2P 1/1 P2P 1/1 LOOP 0/0

When we check the routing table on CSR6, we can see that it is preferring the slow backdoor link over the fast MPLS connection. This is because the backdoor route is intra-area, and is always preferred over inter-area. This OSPF “route preference” behavior cannot be modified and the problem persists for both IPv4 and IPv6. Assigning a different area to the transit link would not work as this would imply bridging area 0 over a non-backbone area. Technically this is possible with virtual links and other awkward workarounds, but that is beyond the scope of this problem. R6#show ip route vrf OSPF 192.101.14.14 Routing Table: OSPF Routing entry for 192.101.14.14/32 Known via "ospf 101", distance 110, metric 501, type intra area Last update from 192.6.101.14 on GigabitEthernet2.201, 00:04:04 ago Routing Descriptor Blocks: * 192.6.101.14, from 0.0.101.14, 00:04:04 ago, via GigabitEthernet2.201 Route metric is 501, traffic share count is 1 R6#show ipv6 route vrf OSPF ::192:101:14:14/128 Routing entry for ::192:101:14:14/128 Known via "ospf 101", distance 110, metric 500, type intra area Route count is 1/1, share count 0 Routing paths: FE80::14, GigabitEthernet2.201 Last updated 00:05:52 ago

A quick check on XRv4 shows the same problem in the opposite direction. It is preferring the slow backdoor link to CSR6 when we prefer that it use the MPLS link primarily. RP/0/0/CPU0:XRv4#show route vrf OSPF ipv4 192.101.6.6 Routing entry for 192.101.6.6/32 Known via "ospf 101", distance 110, metric 501, type intra area Routing Descriptor Blocks 192.6.101.6, from 192.4.101.6, via GigabitEthernet0/0/0/0.201 Route metric is 501 No advertising protos. RP/0/0/CPU0:XRv4#show route vrf OSPF ipv6 ::192:101:6:6 Routing entry for ::192:101:6:6/128 Known via "ospf 101", distance 110, metric 500, type intra area Routing Descriptor Blocks

2219 © 2016 Nicholas J. Russo

fe80::6, from ::, via GigabitEthernet0/0/0/0.201 Route metric is 500 No advertising protos.

The aptly-named “sham-link” can be used to solve this problem. Like a virtual-link, it will connect two non-adjacent OSPF speakers using a demand-circuit that allows LSAs to be exchanged across the network. Sham-link routers must be in the same domain (same or compatible domain-ID type with identical domain-ID value). The sham-link can be used to connect two L3VPN PEs together over an MPLS core to create an intra-area link between them. This would form a square between the PEs and CEs, which would allow the OSPF cost to be the deciding factor in terms of routing. To build a sham link, we must create host-route loopbacks (/32 for IPv4 and /128 for IPv6) which exist inside of the same VPN as the process we intend to bridge. The loopbacks must be advertised into BGP and must not be advertised into the VRF-aware OSPF process; the purpose is to create reachability between these endpoints over the MPLS network, not the customer network. We will configure OSPFv2 and OSPFv3 sham-links between CSR3 and CSR4 first. Note that even if we were using OSPFv3 for the IPv4 AFI, the sham-link endpoint must be a /128 IPv6 address. We can advertise the sham-links into BGP via network statements or redistribution; it does not matter. We will use both methods for variety. ! CSR3 interface Loopback101 description SHAM LINK ENDPOINT vrf forwarding OSPF ip address 101.0.0.3 255.255.255.255 ipv6 address ::101:0:0:3/128 router bgp 214 address-family ipv4 vrf OSPF network 101.0.0.3 mask 255.255.255.255 address-family ipv6 vrf OSPF network ::101:0:0:3/128 ! CSR4 interface Loopback101 description SHAM LINK ENDPOINT vrf forwarding OSPF ip address 101.0.0.4 255.255.255.255 ipv6 address ::101:0:0:4/128 route-map RM_SHAM_TO_BGP permit 10 match interface Loopback101 router bgp 214 address-family ipv4 vrf OSPF redistribute connected route-map RM_SHAM_TO_BGP address-family ipv6 vrf OSPF

2220 © 2016 Nicholas J. Russo

redistribute connected route-map RM_SHAM_TO_BGP

Before configuring the sham-link itself, we should verify that we have connectivity between these shamlink endpoints. The traffic will be MPLS-encapsulated since it is inside a VPN as these look like ordinary VPNv4/v6 routes with BGP allocated labels. We only need to perform the ping test from one PE since the return traffic (echo-reply) confirms bidirectional connectivity. R3#show bgp vpnv4 unicast vrf OSPF | include 101.0.0.[34] *> 101.0.0.3/32 0.0.0.0 0 32768 i *>i 101.0.0.4/32 214.0.0.4 0 100 0 ? R4#show bgp vpnv4 unicast vrf OSPF | include 101.0.0.[34] * i 101.0.0.3/32 214.0.0.3 0 100 0 i *> 101.0.0.4/32 0.0.0.0 0 32768 ? R3#ping vrf OSPF 101.0.0.4 source 101.0.0.3 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 101.0.0.4, timeout is 2 seconds: Packet sent with a source address of 101.0.0.3 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 7/8/10 ms

We quickly perform the same verification for the IPv6 sham-link endpoints. Connectivity is achieved via 6VPE, the technique we have been using for ordinary IPv6 L3VPN connectivity. R3#show bgp vpnv6 unicast vrf OSPF | include ::101:0:0:[34] *> ::101:0:0:3/128 :: 0 32768 i *>i ::101:0:0:4/128 ::FFFF:214.0.0.4 R4#show bgp vpnv6 unicast vrf OSPF | include ::101:0:0:[34] * i ::101:0:0:3/128 ::FFFF:214.0.0.3 *> ::101:0:0:4/128 :: 0 32768 ? R3#ping vrf OSPF ::101:0:0:4 source ::101:0:0:3 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::101:0:0:4, timeout is 2 seconds: Packet sent with a source address of ::101:0:0:3%OSPF !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 2/2/3 ms

Although technically not required, we should ensure that these new loopbacks do not get advertised into the customer OSPF process. Because OSPF is redistributing from BGP without a filter at present, the OSPF process will contain these as external routes. We can manually filter them out on both CSR3 and CSR4. There is no reason to expose these addresses to the customer. ! CSR3 and CSR4 ip prefix-list PL_SHAM_ENDPOINTS seq 10 permit 101.0.0.0/8 ge 32

2221 © 2016 Nicholas J. Russo

route-map RM_BGP_TO_OSPF_V4 deny 10 match ip address prefix-list PL_SHAM_ENDPOINTS route-map RM_BGP_TO_OSPF_V4 permit 1000 ipv6 prefix-list PL_SHAM_ENDPOINTS seq 5 permit ::101:0:0:0/80 ge 128 route-map RM_BGP_TO_OSPF_V6 deny 10 match ipv6 address prefix-list PL_SHAM_ENDPOINTS route-map RM_BGP_TO_OSPF_V6 permit 1000 router ospf 101 redistribute bgp 214 subnets route-map RM_BGP_TO_OSPF_V4 router ospfv3 101 address-family ipv6 unicast vrf OSPF redistribute bgp 214 route-map RM_BGP_TO_OSPF_V6

We can verify this was successful by checking the local LSDBs for any external routes beginning with 101.0.0.X or ::101:0:0:X. We don’t see any. The reason we see IPv6 externals is due to the XR behavior earlier about failing to add the domain-ID to connected redistributed routes. The point is that the shamlink endpoints do not exist as external OSPFv2/v3 routes, which means the customer cannot learn them. R3#show ip ospf 101 database | begin Type-5 [no output] R3#show ospfv3 vrf OSPF database | begin External Type-5 AS External Link States ADV Router Age Seq# Prefix 192.3.101.3 1457 0x80000005 FD00:192:11:101::/64 192.3.101.3 1457 0x80000003 FD00:192:12:101::/64 192.4.101.4 1511 0x80000005 FD00:192:11:101::/64 192.4.101.4 1511 0x80000005 FD00:192:12:101::/64 R4#show ip ospf 101 database | begin Type-5 [no output] R4#show ospfv3 vrf OSPF database | begin External Type-5 AS External Link States ADV Router Age Seq# Prefix 192.3.101.3 1682 0x80000005 FD00:192:11:101::/64 192.3.101.3 1682 0x80000003 FD00:192:12:101::/64 192.4.101.4 1730 0x80000005 FD00:192:11:101::/64 192.4.101.4 1730 0x80000005 FD00:192:12:101::/64

Next, we will configure the sham-link. We can optionally specify the OSPF cost as well, and provided this cost (plus the cost of the PE-CE transit links) is less than the backdoor cost, we can prefer MPLS over the backdoor. We will use a cost of 214 on the OSPFv2 sham-link without authentication. On the OSPFv3 sham-link, we use the default cost of 1, but enable authentication for variety. Older version of IOS do not allow for sham-link authentication, so we test that in XE for OSPFv3 only. For brevity, only CSR3 is 2222 © 2016 Nicholas J. Russo

shown, but the CSR4 configuration is identical with reversed IPv4/v6 addresses for the sham-link endpoints. ! CSR3 key chain KC_SHAM_AUTH key 1 key-string SHAM_AUTH cryptographic-algorithm hmac-sha-512 router ospf 101 vrf OSPF area 0 sham-link 101.0.0.3 101.0.0.4 cost 214 router ospfv3 101 address-family ipv6 unicast vrf OSPF area 0 sham-link ::101:0:0:4 ::101:0:0:3 authentication key-chain KC_SHAM_AUTH

You cannot use MD5 authentication with OSPFv3 sham-links and must use one of the SHA options. Failure to specify a valid option inside the key chain will results with some error messages. ! CSR3 %OSPFv3-5-NOCRYPTOALG: Key ID 1 in key chain KC_SHAM_AUTH does not have a valid cryptographic algorithm %OSPFv3-4-NOVALIDKEY: No valid authentication key under key-chain KC_SHAM_AUTH

Assuming the configuration is correct, the routers will generate syslog messages to indicate that the sham-links are up. In case we miss these messages, we can check the status of the sham-links with a special show command. The pertinent information includes the fact that it is a demand-circuit with suppressed hellos. The DNA bit is allowed which implies the link is DC-capable, which is an LSA option in the LSA1. The link is considered point-to-point, which is a link type within the LSA1 as well. This is different than the virtual-link, which has its own link type in the LSA1. The sham-link is graphically equivalent to an ordinary point-to-point link. That is why we can explicitly specify a sham-list cost, but we cannot explicitly specify a virtual-link cost. ! CSR3 %OSPF-5-ADJCHG: Process 101, Nbr 192.4.101.4 on OSPF_SL0 from LOADING to FULL, Loading Done %OSPFv3-5-ADJCHG: Process 101, IPv6, VRF OSPF, Nbr 192.4.101.4 on OSPFv3_SL0 from LOADING to FULL, Loading Done R3#show ip ospf 101 sham-links Sham Link OSPF_SL0 to address 101.0.0.4 is up Area 0 source address 101.0.0.3 Run as demand circuit DoNotAge LSA allowed. Cost of using 214 State POINT_TO_POINT,

2223 © 2016 Nicholas J. Russo

Timer intervals configured, Hello 10, Dead 40, Wait 40, Hello due in 00:00:07 Adjacency State FULL (Hello suppressed) Index 2/2, retransmission queue length 0, number of retransmission 0 First 0x0(0)/0x0(0) Next 0x0(0)/0x0(0) Last retransmission scan length is 0, maximum is 0 Last retransmission scan time is 0 msec, maximum is 0 msec R3#show ospfv3 vrf OSPF sham-links OSPFv3 101 address-family ipv6 vrf OSPF (router-id 192.3.101.3) Sham Link OSPFv3_SL0 to address ::101:0:0:4 is up Interface ID 25 Area 0 source address ::101:0:0:3 Run as demand circuit DoNotAge LSA allowed. Cost of using 1 Transmit Delay is 1 sec, State POINT_TO_POINT, Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5 Adjacency State FULL (Hello suppressed) Index 1/2/2, retransmission queue length 0, number of retransmission 0 First 0x0(0)/0x0(0)/0x0(0) Next 0x0(0)/0x0(0)/0x0(0) Last retransmission scan length is 0, maximum is 0 Last retransmission scan time is 0 msec, maximum is 0 msec

To verify the topological significance of the sham-link, we will inspect the LSA1 from the perspective of CSR3. Most of the output is removed, but we can see a new link inside the LSA1. It uses a link address of 0.0.0.24, which is a new software IDB generated by the router specifically for the sham-link. XE allows us to see this new interface number for confirmation as well. The link is point-to-point and has a cost of 214 as expected. If we didn’t know a sham-link was being used, this LSA1 output is indistinguishable from an ordinary OSPF P2P link using IP unnumbered. R3#show ip ospf 101 database router self-originate OSPF Router with ID (192.3.101.3) (Process ID 101) Router Link States (Area 0) LS age: 732 Options: (No TOS-capability, DC) [snip] Link connected to: another Router (point-to-point) (Link ID) Neighboring Router ID: 192.4.101.4 (Link Data) Router Interface address: 0.0.0.24 Number of MTID metrics: 0 TOS 0 Metrics: 214 [snip] R3#show ifnum list | include 24 SWIDB 24 ACTIVE SL0 08E00000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

2224 © 2016 Nicholas J. Russo

For IPv6, the output is similar. The router is DC-capable, and has a link to CSR4 using interface ID 25. The IDB process confirms this is used for a sham-link as well. The link is point-to-point with the default cost of 1. R3#show ospfv3 vrf OSPF database router self-originate OSPFv3 101 address-family ipv6 vrf OSPF (router-id 192.3.101.3) Router Link States (Area 0) LS age: 674 Options: (V6-Bit, E-Bit, R-Bit, DC-Bit) [snip] Link connected to: another Router (point-to-point) Link Metric: 1 Local Interface ID: 25 Neighbor Interface ID: 25 Neighbor Router ID: 192.4.101.4 R3#show ifnum list | include 25 SWIDB 25 ACTIVE SL0 08EA0000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Because the CEs share area 0 with the PEs, they can see this new link in the OSPF graph. Like the MPLSTE forwarding-adjacency, they use this in their SPF calculations. CSR6 now routes over MPLS to reach XRv4’s loopback. The total OSPF path cost is 217. A cost of 1 represents the local PE-CE hop, 214 for the sham-link, 1 more for the remote PE-CE hop, and 1 more for the loopback itself. Of greatest significance is the route-type, which is now intra-area. The routes that were once inter-area, as defined by being in the same domain, are now bridged into the same area by the sham-link. A virtual-link connects a disjoint area 0 over a non-backbone OSPF area while a sham-link connects a disjoint area 0 over MPLS. The IPv6 route is similar, but has a metric of 3 to account for the two PE-CE links and the sham-link default cost of 1. The cost of the IPv6 loopback is 0 compared to the IPv4 loopback cost of 1. R6#show ip route vrf OSPF 192.101.14.14 Routing Table: OSPF Routing entry for 192.101.14.14/32 Known via "ospf 101", distance 110, metric 217, type intra area Last update from 192.4.101.4 on GigabitEthernet2.101, 00:17:57 ago Routing Descriptor Blocks: * 192.4.101.4, from 0.0.101.14, 00:17:57 ago, via GigabitEthernet2.101 Route metric is 217, traffic share count is 1 R6#show ipv6 route vrf OSPF ::192:101:14:14 Routing entry for ::192:101:14:14/128 Known via "ospf 101", distance 110, metric 3, type intra area Route count is 1/1, share count 0 Routing paths:

2225 © 2016 Nicholas J. Russo

FE80::4, GigabitEthernet2.101 Last updated 00:15:46 ago

A quick traceroute from CSR6 proves that the sham-link is working correctly. Traffic is being routed towards the L3VPN PEs and encapsulated inside MPLS. R6#traceroute vrf OSPF 192.101.14.14 source 192.101.6.6 Type escape sequence to abort. Tracing the route to 192.101.14.14 VRF info: (vrf in name/id, vrf out name/id) 1 192.4.101.4 4 msec 4 msec 4 msec 2 192.3.101.3 [MPLS: Label 3015 Exp 0] 4 msec 4 msec 4 msec 3 192.3.101.14 5 msec 13 msec 13 msec R6#traceroute vrf OSPF ipv6 Target IPv6 address: ::192:101:14:14 Source address: ::192:101:6:6 [snip] Type escape sequence to abort. Tracing the route to ::192:101:14:14 1 FD00:192:4:101::4 4 msec 4 msec 3 msec 2 FD00:192:3:101::3 [MPLS: Label 3016 Exp 0] 4 msec 4 msec 4 msec 3 ::192:101:14:14 212 msec 5 msec 5 msec

To verify that there is no disruption to the customer network, we will verify that the sham-link endpoints did not leak into the customer RIB. This should be impossible given the filtering we configured earlier, but we will ensure the routes do not exist. This makes the sham-link entirely transparent to the customer CE devices. R6#show ip route vrf OSPF 101.0.0.0 255.0.0.0 longer-prefixes | begin Gateway Gateway of last resort is not set [no output] RP/0/0/CPU0:XRv4#show route vrf OSPF ipv4 | include 101.0.0 [no output] R6#show ipv6 route vrf OSPF ::101:0:0:0/80 longer-prefixes | begin Applica a – Application [no output] RP/0/0/CPU0:XRv4#show route vrf OSPF ipv6 | include ::101 [no output]

The same exact problem presents itself on the other side of the network with XRv1 and XRv2 as PEs. XRv3 prefers the high-cost backdoor link because the route to CSR5’s loopback is intra-area via that path. 2226 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show route vrf OSPF 192.101.5.5 Routing entry for 192.101.5.5/32 Known via "ospf 101", distance 110, metric 501, type intra area Routing Descriptor Blocks 192.5.101.5, from 192.12.101.6, via GigabitEthernet0/0/0/0.201 Route metric is 501 No advertising protos.

The concepts for sham-links in XR are exactly the same and the configuration is similar. First, we will create the loopbacks on XRv1 and XRv2 inside of the customer VPN. Then, we advertise them into BGP using a network statement or redistribution. XR does not support the “match interface” or anything equivalent within RPL, so we can minimize RPL configuration using parameters. ! XRv1 interface Loopback101 description SHAM LINK ENDPOINT vrf OSPF ipv4 address 101.0.0.11 255.255.255.255 ipv6 address ::101:0:0:11/128 router bgp 214 vrf OSPF address-family ipv4 unicast network 101.0.0.11/32 address-family ipv6 unicast network ::101:0:0:11/128 ! XRv2 interface Loopback101 description SHAM LINK ENDPOINT vrf OSPF ipv4 address 101.0.0.12 255.255.255.255 ipv6 address ::101:0:0:12/128 prefix-set PS_SHAM_V4 101.0.0.0/8 ge 32 end-set prefix-set PS_SHAM_V6 ::101:0:0:0/80 ge 128 end-set route-policy RPL_SHAM_TO_BGP($PS) if destination in $PS then pass endif end-policy

2227 © 2016 Nicholas J. Russo

router bgp 214 vrf OSPF address-family ipv4 unicast redistribute connected route-policy RPL_SHAM_TO_BGP(PS_SHAM_V4) address-family ipv6 unicast redistribute connected route-policy RPL_SHAM_TO_BGP(PS_SHAM_V6)

We will verify that these routes were successfully advertised into BGP on both PEs. Notice that the remote sham-link endpoints from CSR3 and CSR4 are also learned; they are still inside the iBGP advertisement space. This allows us to build sham-links to CSR3 and CSR4 if we desire, but since there are no east-to-west backdoor links, there is little reason to do so. We can see all 4 routes from all 4 PEs within IPv4 and IPv6 scopes. RP/0/0/CPU0:XRv1#show *>i101.0.0.3/32 *>i101.0.0.4/32 *> 101.0.0.11/32 * i101.0.0.12/32

bgp vpnv4 unicast vrf OSPF | include 101.0.0 214.0.0.3 0 100 0 i 214.0.0.4 0 100 0 ? 0.0.0.0 0 32768 i 214.0.0.12 0 100 0 ?

RP/0/0/CPU0:XRv1#show *>i::101:0:0:3/128 *>i::101:0:0:4/128 *> ::101:0:0:11/128 * i::101:0:0:12/128

bgp vpnv6 unicast vrf OSPF | include ::101:0:0 214.0.0.3 0 100 0 i 214.0.0.4 0 100 0 ? :: 0 32768 i 214.0.0.12 0 100 0 ?

We verify reachability between these XRv1/XRv2 sham-link endpoints, again relying on basic MPLS L3VPN (6VPE) connectivity. This is exactly the same procedure we followed for the XE sham-link. RP/0/0/CPU0:XRv1#ping vrf OSPF 101.0.0.12 source 101.0.0.11 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 101.0.0.12, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/20/89 ms RP/0/0/CPU0:XRv1#ping vrf OSPF ::101:0:0:12 source ::101:0:0:11 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::101:0:0:12, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms

Like last time, we should remove these sham-link endpoints from OSPF. Notice that XRv2 is originating XRv1’s sham-link endpoint and vice versa; they are learning one another’s BGP routes then redistributing them back into OSPF. Fortunately, the DN-bit protects the PEs from installing these

2228 © 2016 Nicholas J. Russo

routes, but it is still bad design and may be confusing for customers, who will install these routes. XRv1, for example, has these new loopbacks inside of OSPFv3. RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF database | include ::101:0:0:1[12] 0.0.101.11 141 0x80000001 ::101:0:0:12/128 214.0.0.12 81 0x80000001 ::101:0:0:11/128

To correct it, we can use RPL to filter any sham-link endpoints when redistributing from BGP to OSPF. The prefix-lists we used above on XRv2 can be re-used for this task again and are not shown again. They must be copied to XRv1 since only XRv2 has them configured at present (XRv1 used network statements instead). Using a new RPL policy along with parameterization, we instruct OSPF to filter any sham-link endpoints from entering from BGP. The filter also accounts for CSR3/CSR4 sham-link endpoints as well. For this reason, when there are many PEs in an OSPF-serving L3VPN, the sham-link endpoints should aggregate nicely to simplify this filtering task. ! XRv1 and XRv2 route-policy RPL_BGP_TO_OSPF($SHAMS) if destination in $SHAMS then drop else pass endif end-policy router ospf 101 vrf OSPF redistribute bgp 214 route-policy RPL_BGP_TO_OSPF(PS_SHAM_V4) router ospfv3 101 vrf OSPF redistribute bgp 214 route-policy RPL_BGP_TO_OSPF(PS_SHAM_V6)

To verify it, we can check XRv1 for any 101.0.0.X or ::101:0:0:X routes. Now, we don’t see any in the PE LSDBs or in the customer routing tables. RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF database | include ::101:0:0: [no output] RP/0/0/CPU0:XRv1#show ospf vrf OSPF database | include 101.0.0 [no output] RP/0/0/CPU0:XRv3#show route vrf OSPF ipv4 | include 101.0.0 [no output] RP/0/0/CPU0:XRv3#show route vrf OSPF ipv6 | include ::101 [no output]

2229 © 2016 Nicholas J. Russo

Next, we will configure the sham-link. The sham-link has its own stanza in XR, which is convenient because it has many sub-options like authentication and cost. We will set a custom cost and enable basic MD5 authentication on the OSPFv2 sham-link. The OSPFv3 sham-link will be more secure with IPSec encryption using DES and SHA-1 hash. For brevity, only XRv1’s configuration is shown since the only difference on XRv2 is reversing the IPv4/v6 addresses. ! XRv1 router ospf 101 vrf OSPF area 0 sham-link 101.0.0.11 101.0.0.12 cost 12 authentication message-digest message-digest-key 1 md5 SHAM_AUTH router ospfv3 101 vrf OSPF area 0 sham-link ::101:0:0:11 ::101:0:0:12 encryption ipsec spi 1212 esp des clear 1234567890abcdef authentication sha1 clear 1234567890123456789012345678901234567890

Like any other OSPFv2/v3 neighbor, we will see a log message for the sham-links, along with the numbers allocated to them. Unlike XE, XR allocates different numbers for different sham-links between processes. XE assigned both SLs a value of 0 on CSR3 and CSR4 because the OSPF processes are entirely different. ! XRv1 ospfv3[1028]: %ROUTING-OSPFv3-5-ADJCHG : Process 101, Nbr 214.0.0.12 on OSPF_SL1 from LOADING to FULL, Loading Done ospf[1018]: %ROUTING-OSPF-5-ADJCHG : Process 101, Nbr 214.0.0.12 on OSPF_SL0 in area 0 from LOADING to FULL, Loading Done, vrf OSPF vrfid 0x60000007

In case you miss the log messages, the XR show commands give the same details as the XE show commands. We verify that the link is a demand-circuit, point-to-point, suppressing hellos, has the proper cost/authentication, and is operational. RP/0/0/CPU0:XRv1#show ospf vrf OSPF sham-links Sham Links for OSPF 101, VRF OSPF Sham Link OSPF_SL0 to address 101.0.0.12 is up Area 0, source address 101.0.0.11 IfIndex = 2 Run as demand circuit DoNotAge LSA allowed., Cost of using 12 Transmit Delay is 1 sec, State POINT_TO_POINT, Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5

2230 © 2016 Nicholas J. Russo

Hello due in 00:00:07:199 Adjacency State FULL (Hello suppressed) Number of DBD retrans during last exchange 0 Index 2/2, retransmission queue length 0, number of retransmission 0 First 0(0)/0(0) Next 0(0)/0(0) Last retransmission scan length is 0, maximum is 0 Last retransmission scan time is 0 msec, maximum is 0 msec Message digest authentication enabled Youngest key id is 1 RP/0/0/CPU0:XRv1#show ospfv3 vrf OSPF sham-links Sham Links for OSPFv3 101, VRF OSPF Sham Link OSPF_SL1 to address ::101:0:0:12 is up, ipsec is up Area 0, source address ::101:0:0:11 IfIndex = 2 ESP Encryption DES, Authentication SHA1, SPI 1212 Run as demand circuit DoNotAge LSA allowed., Cost of using 1 Transmit Delay is 1 sec, State POINT_TO_POINT, Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5 Hello due in 00:00:02 Adjacency State FULL (Hello suppressed) Number of DBD retrans during last exchange 0 Index 2/2, retransmission queue length 0, number of retransmission 0 First 0(0)/0(0) Next 0(0)/0(0) Last retransmission scan length is 0, maximum is 0 Last retransmission scan time is 0 msec, maximum is 0 msec

Next, we will confirm that the link exists within the router LSA on XRv2 for both OSPFv2 and OSPFv3. We can see that the OSPFv2 link has a cost of 12, is point-to-point, is DC-capable, and connects directly to XRv1. The OSPFv3 link has all the same attributes except uses a cost of 1, the default. Just like ordinary OSPF P2P interfaces, the costs need not be symmetric, but for simplicity, I configure them so. RP/0/0/CPU0:XRv2#show ospf vrf OSPF database router self-originate OSPF Router with ID (214.0.0.12) (Process ID 101, VRF OSPF) Router Link States (Area 0) LS age: 490 Options: (No TOS-capability, DC) [snip] Link connected to: another Router (point-to-point) (Link ID) Neighboring Router ID: 0.0.101.11 (Link Data) Router Interface address: 0.0.0.2 Number of TOS metrics: 0 TOS 0 Metrics: 12 RP/0/0/CPU0:XRv2#show ospfv3 vrf OSPF database router self-originate OSPFv3 Router with ID (214.0.0.12) (Process ID 101 VRF OSPF) Router Link States (Area 0)

2231 © 2016 Nicholas J. Russo

LS age: 554 Options: (V6-Bit E-Bit R-Bit DC-Bit) [snip] Link connected to: another Router (point-to-point) Link Metric: 1 Local Interface ID: 2 Neighbor Interface ID: 2 Neighbor Router ID: 0.0.101.11

Because XRv3 and CSR5 can use this sham-link in their SPF calculations, the MPLS network is considered intra-area and its cost is can be directly compared to the slow backdoor. XRv3 now routes to CSR5’s loopback via MPLS. The total cost is 15: 1 for the local PE-CE link, 12 for the sham-link, 1 for the remote PE-CE link, and 1 for the loopback. The type is intra-area as expected. The output is similar for IPv6 with a different metric based on the IPv6 topology and sham-link cost. RP/0/0/CPU0:XRv3#show route vrf OSPF 192.101.5.5 Routing entry for 192.101.5.5/32 Known via "ospf 101", distance 110, metric 15, type intra area Routing Descriptor Blocks 192.11.101.11, from 192.12.101.6, via GigabitEthernet0/0/0/0.101 Route metric is 15 No advertising protos. RP/0/0/CPU0:XRv3#show route vrf OSPF ipv6 ::192:101:5:5 Routing entry for ::192:101:5:5/128 Known via "ospf 101", distance 110, metric 3, type intra area Routing Descriptor Blocks fe80::11, from ::, via GigabitEthernet0/0/0/0.101 Route metric is 3 No advertising protos.

Traceroute confirms the proper traffic pattern. XR does not show the labels for IPv6, possibly due to a cosmetic adjustment, but we can see the 214.X.X.X address which represents the MPLS core. There is no possibility of this IPv6 traceroute working without MPLS encapsulation, so we can safely assume the traffic was labeled. RP/0/0/CPU0:XRv3#traceroute vrf OSPF 192.101.5.5 source 192.101.13.13 Type escape sequence to abort. Tracing the route to 192.101.5.5 1 192.11.101.11 0 msec 0 msec 0 msec 2 214.11.12.12 [MPLS: Label 92011 Exp 0] 0 msec 0 msec 0 msec 3 192.12.101.5 0 msec 0 msec 0 msec RP/0/0/CPU0:XRv3#traceroute vrf OSPF ::192:101:5:5 source ::192:101:13:13 Type escape sequence to abort. Tracing the route to ::192:101:5:5 1 fd00:192:11:101::11 9 msec 19 msec 9 msec

2232 © 2016 Nicholas J. Russo

2 3

2001:214:11:12::12 0 msec 0 msec 0 msec fd00:192:12:101::5 0 msec 0 msec 0 msec

For good measure, we will use TCL to verify reachability from CSR5 and CSR6 to all endpoints for both AFIs. XR technically supports TCL, but is very complicated and not mentioned in this chapter. We will therefore limit the TCL verifications to XE CE routers. The outputs from CSR5 and CSR6 are not shown, but the pings were 100% successful to all destinations. Notice the OSPF “external” test prefixes are included here also. ! CSR5 and CSR6 tclsh foreach x { 192.101.5.5 192.101.6.6 192.101.13.13 192.101.14.14 192.60.6.6 ::192:101:5:5 ::192:101:6:6 ::192:101:13:13 ::192:101:14:14 ::192:60:6:6 } { ping vrf OSPF $x source loopback101 repeat 3 timeout 1 }

38.1.3 EIGRP and Site-of-Origin (SoO) EIGRP is another option for PE-CE routing with L3VPN. The VRF-aware EIGRP configuration is almost identical for IPv4 and IPv6, assuming we used named-mode on the XE platforms. There is no “DN-bit” concept in EIGRP, so the VRF-lite configuration is not much different from the PE, with the exception of BGP redistribution. For EIGRP IPv6, there is no “network” statement, so all IPv6-enabled interfaces are automatically configured for EIGRP. They can selectively disabled by shutting them down under the EIGRP process as needed, but in our case, we don’t need to disable EIGRP IPv6 from any interfaces. The configurations for CSR5 and XRv3 are shown for brevity. The backdoor link is Gig2.202; it exists between XRv4/CSR6 and XRv3/CSR5 and is currently shut down. Note that XR does not log EIGRP neighbor changes by default, so I personally enable this feature at all times to assist with troubleshooting. ! CSR5 router eigrp VPN address-family ipv4 unicast vrf EIGRP autonomous-system 102 network 192.0.0.0 0.255.255.255 address-family ipv6 unicast vrf EIGRP autonomous-system 102 ! XRv3 router eigrp VPN vrf EIGRP address-family ipv4

2233 © 2016 Nicholas J. Russo

router-id 13.13.13.13 log-neighbor-changes autonomous-system 102 interface Loopback102 passive-interface interface GigabitEthernet0/0/0/0.102 interface GigabitEthernet0/0/0/0.202 address-family ipv6 router-id 13.13.13.13 log-neighbor-changes autonomous-system 102 interface Loopback102 passive-interface interface FigabitEthernet0/0/0/0.102 interface GigabitEthernet0/0/0/0.202

We quickly verify the neighbors are up on all EIGRP PEs for both AFIs. We only expect to see one neighbor on each PE, which is the corresponding CE. R3#show eigrp address-family ipv4 vrf EIGRP neighbors | begin ^0 0 192.3.102.14 Gi2.102 11 1d20h 29 174

0

67

0

64

0

57

0

62

200

0

60

200

0

82

200

0

78

200

0

90

R3#show eigrp address-family ipv6 vrf EIGRP neighbors | begin ^0 0 Link-local address: Gi2.102 12 1d09h 27 162 FE80::14 R4#show eigrp address-family ipv4 vrf EIGRP neighbors | begin ^0 0 192.4.102.6 Gi2.102 13 2d01h 1 100 R4#show eigrp address-family ipv6 vrf EIGRP neighbors | begin ^0 0 Link-local address: Gi2.102 11 2d01h 1 100 FE80::6 RP/0/0/CPU0:XRv1#show eigrp vrf EIGRP ipv4 neighbors | begin ^0 0 192.11.102.13 Gi0/0/0/0.102 11 1d20h 15 RP/0/0/CPU0:XRv1#show eigrp vrf EIGRP ipv6 neighbors | begin ^0 0 Link Local Address: Gi0/0/0/0.102 13 11:20:03 15 fe80::13 RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv4 neighbors | begin ^0 0 192.12.102.5 Gi0/0/0/0.102 11 06:16:52 1 RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 neighbors | begin ^0 0 Link Local Address: Gi0/0/0/0.102 11 05:40:54 1 fe80::5

2234 © 2016 Nicholas J. Russo

We verify that CSR4 and XRv2 learn some EIGRP routes from CSR6 and CSR5 in both IPv4 and IPv6 AFIs. In this case, there is a single loopback from each CE, just like there was for OSPF. This confirms the CEto-PE route exchange. R4#show ip route vrf EIGRP eigrp | begin Gateway Gateway of last resort is not set 192.102.6.0/32 is subnetted, 1 subnets D 192.102.6.6 [90/10880] via 192.4.102.6, 1d21h, GigabitEthernet2.102 R4#show ipv6 route vrf EIGRP eigrp | begin Applic a - Application D ::192:102:6:6/128 [90/10880] via FE80::6, GigabitEthernet2.102 RP/0/0/CPU0:XRv2#show route vrf EIGRP ipv4 eigrp D 192.102.5.5/32 [90/10880] via 192.12.102.5, 06:22:27, GigabitEthernet0/0/0/0.102 RP/0/0/CPU0:XRv2#show route vrf EIGRP ipv6 eigrp D ::192:102:5:5/128 [90/10880] via fe80::5, 05:46:31, GigabitEthernet0/0/0/0.102

These EIGRP routes are redistributed from EIGRP into BGP, and vice versa, on all four PEs. The configurations on CSR4 and XRv2 are shown; they are identical to the VRF-aware CE configurations with the addition of redistribution discussed earlier. ! CSR4 router eigrp VPN address-family ipv4 unicast vrf EIGRP autonomous-system 102 topology base redistribute bgp 214 address-family ipv6 unicast vrf EIGRP autonomous-system 102 topology base redistribute bgp 214 ! XRv2 router eigrp VPN vrf EIGRP address-family ipv4 redistribute bgp 214 address-family ipv6 redistribute bgp 214

When we verify the routes inside of VPNv4/v6, we see that there are several new extended communities attached to them. The cost-community has a dedicated section and is not discussed in 2235 © 2016 Nicholas J. Russo

detail, but notice that the point of insertion (POI) is pre-bestpath, which is value 128. XE decodes this for us but XR does not. The second 128 represents the community-ID which allows cost-communities to be compared directly, not to be confused with the POI value of 128 meant to indicate pre-bestpath evaluation. When used in this fashion, the cost-community always uses POI and community ID values of 128; this cannot be modified. The metric value is easy to read and is directly derived from the RIB metric of 10880. R4#show ip route vrf EIGRP 192.102.6.6 | include metric Known via "eigrp 102", distance 90, metric 10880, type internal Route metric is 10880, traffic share count is 1 RP/0/0/CPU0:XRv2#show route vrf EIGRP ipv4 192.102.5.5/32 | include metric Known via "eigrp 102", distance 90, metric 10880, type internal Route metric is 10880

The other extended-community values are EIGRP-specific and are encoded according to the table below. Attribute Type Usage Value General 0x8800 Route information Route flag and tag Metric 0x8801 Route metric and AS ASN and delay 0x8802 Route metric Reliability, next-hop, bandwidth 0x8803 Route metric Reserve, Load, and MTU External 0x8804 External route metric Remote AS and Remote ID 0x8805 External route metric Remote Protocol and Remote Metric Originating RID 0x8806 Originating RID Originating RID R4#show bgp vpnv4 unicast vrf EIGRP 192.102.6.6/32 [snip] Local 192.4.102.6 (via vrf EIGRP) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:214:102 Cost:pre-bestpath:128:10880 0x8800:32768:0 0x8801:102:288 0x8802:65281:2560 0x8803:65281:1500 0x8806:0:3221513734 mpls labels in/out 4012/nolabel rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf EIGRP 192.102.5.5/32 [snip] Local 192.12.102.5 from 0.0.0.0 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 639 Extended community: COST:128:128:10880 EIGRP route-info:0x8000:0 EIGRP AD:102:288 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:5.5.12.192 RT:214:102

2236 © 2016 Nicholas J. Russo

There isn’t much reason to deep-dive the extended communities, but we will verify the values for sanity once. XR makes it a little easier by naming the communities by their purpose. 0x8800 carries the “routeflag” which is 0x8000 (32768) and the route-tag (not set) on both routes. 0x8801 caries the AS number (102) and the route delay (288). We can check the EIGRP topology on both routers to confirm a value of 288, which is the delay in tens of microseconds scaled by 256 (11,250,000 picoseconds * 256 = 2,880,000,000 picoseconds = 2880 microseconds). R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.4.102.4) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv4(102): Topology base(0) entry for 192.102.6.6/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1392640, RIB is 10880 Descriptor Blocks: 192.4.102.6 (GigabitEthernet2.102), from 192.4.102.6, Send flag is 0x0 Composite metric is (1392640/163840), route is Internal Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 11250000 picoseconds [snip]

Verifying 0x8802 is also somewhat tricky. The first 2 byte value contains both the reliability and next-hop flag together. 65281 translates to 0xFF01, which represents a reliability of 255 and a hop count of 1. R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 | include Reliab|Hop Reliability is 255/255 Hop count is 1

Extended community 0x8802 carries the link bandwidth of 2560. This appears to be measured in hundreds of megabits scaled by 256 (1,000,000,000 bps * 256 = 256,000,000,000 bps, = 256 Gbps). R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 | include bandw Minimum bandwidth is 1000000 Kbit

The extended community 0x8803 has a decimal value of 65281 which we know evaluates to 0xFF01. The load represents the 1, and I assume the “Reserved” field is set to 0xFF. XR seems to set this to zero, though, so it is probably ignored. The route MTU is clearly shown as a decimal value of 1500, which is convenient. R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 | include MTU|Load Load is 1/255

2237 © 2016 Nicholas J. Russo

Minimum MTU is 1500

The decimal number carried in 0x8806 by CSR4 is 0xC0046606 in hex. This translates to 192.4.102.6, while XRv2 shows CSR5’s loopback in reverse order. This is the originating router’s EIGRP RID. We can quickly confirm this by checking the topology information. Ideally both RIDs would be the loopback, but since they are not manually set, EIGRP selecting the first available interface at the time I configured it. For external routes, this is an important field for loop prevention. R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 | include Origin Originating router is 192.4.102.6 RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv4 topology 192.102.5.5/32 | include Origin Originating router is 192.102.5.5

These extended communities are automatically and by default added to BGP routes when EIGRP is redistributed from within a VPN. The reason all this information is necessary is so that routes across the VPN, assuming the ASN matches, can be considered internal. The MPLS network essentially collapses to a single node in the EIGRP topology and can be used to seamlessly bridge EIGRP islands. This is somewhat like a sham-link except zero cost and in-band, whereas the sham-link was an explicit, out-ofband link with an associated OSPF cost (it was part of the graph). As is normal for any IGP to BGP redistribution, the IGP metric is also copied into the BGP MED. Checking XRv2, we can see the IGP metric inside the BGP MED, which is installed in the local routing table as an iBGP route. RP/0/0/CPU0:XRv2#show route vrf EIGRP ipv4 192.102.6.6/32 | include metric Known via "bgp 214", distance 200, metric 10880, type internal Route metric is 10880

Checking the EIGRP topology on CSR5, we can see all of the original routing information from CSR6’s loopback is preserved across the L3VPN. The hop count is only two, since the MPLS network (including the PEs) only counts as 1 hop. The originating router is still the same, and the reported distance (RD) of 1392640 is the FD of the remote PE (CSR4). This proves that the MPLS network is viewed as a single hop; because it’s one hop, it looks like one router, and thus has no cost. The PE-CE link costs are accounted for, however. For OSPF, the sham-link itself incurred a cost of at least 1. R5#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.102.5.5) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv4(102): Topology base(0) entry for 192.102.6.6/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2048000, RIB is 16000 Descriptor Blocks: 192.12.102.12 (GigabitEthernet2.102), from 192.12.102.12, Send flag is 0x0 Composite metric is (2048000/1392640), route is Internal

2238 © 2016 Nicholas J. Russo

Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 21250000 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 2 Originating router is 192.4.102.6 R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.6.6/32 | include Composite_metric Composite metric is (1392640/163840), route is Internal

We can disable the extended community population on CSR4 by telling EIGRP not to pass this information to BGP on a per AFI basis. Notice that the normal BGP redistribution actions still occur: the IGP metric is copied to the BGP MED, the export route-target is imposed, and an MPLS label is allocated. The VPN can still work, but the remote PEs will not be able to assume this route is part of the same EIGRP AS as its local customer. ! CSR4 router eigrp VPN address-family ipv4 unicast vrf EIGRP autonomous-system 102 no populate bgp-ext-comm R4#show bgp vpnv4 unicast vrf EIGRP 192.102.6.6/32 | begin Local Local 192.4.102.6 (via vrf EIGRP) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:214:102 mpls labels in/out 4035/nolabel rx pathid: 0, tx pathid: 0x0

As soon as this happens, XRv2 prints out an angry syslog message. The message warns us that during the BGP to EIGRP redistribution, we did not set a metric. Normally this is required when redistributing into EIGRP (except when using connected, static, or other EIGRP routes). We didn’t set a metric earlier because the extended-communities handled that for us. Now, without any extended-community information, EIGRP cannot accept this route since it has no idea what the seed metric should be. ! XRv2 eigrp[1002]: %ROUTING-EIGRP-3-NO_REDISTMETRIC : EIGRP-VPN: EIGRP-v4 102: No metric set while redistributing route 192.102.6.6/32 RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv4 topology 192.102.6.6/32 % IPv4-EIGRP (VRF EIGRP): Route not in topology table of EIGRP VPN.

2239 © 2016 Nicholas J. Russo

To solve it, we will set a bogus set of default metric inputs on XRv1 and XRv2 for EIGRP IPv4. The route is very much external now as the extended-community enhancements were removed, but at least EIGRP can see it. Last, we can check the routing on CSR5 to ensure it installs this route. The bogus metrics make the external routes stand out very clearly. ! XRv1 and XRv2 router eigrp VPN vrf EIGRP address-family ipv4 default-metric 12345 543 234 123 1500 RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv4 topology 192.102.6.6/32 IPv4-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for 192.102.6.6/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 408947559, RIB is 3194902 Routing Descriptor Blocks: 214.0.0.4, from Redistributed, Send flag is 0x0 Composite metric is (408947559/0), Route is External Vector metric: Minimum bandwidth is 12345 Kbit Total delay is 5430000000 picoseconds Reliability is 234/255 Load is 123/255 Minimum MTU is 1500 Hop count is 0 External data: Originating router is 214.0.0.12 (this system) AS number of route is 102 External protocol is BGP, external metric is 10880 Administrator tag is 0 (0x00000000) R5#show ip route vrf EIGRP eigrp | section include EX D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area D EX 192.102.6.6 [170/3200022] via 192.12.102.12, 00:03:13, GigabitEthernet2.102

One of the pitfalls of disabling extended-community support on CSR4 is that it has bidirectional implications. That is to say, not only will EIGRP not instruct BGP to carry the information, but it also will not honor information from BGP, either. On CSR4, it learns CSR5’s loopback via iBGP and it has all of the extended-community values. However, the route does not exist in the EIGRP topology since, like XRv2, there is no default redistribution metric and EIGRP is now blind to these extended-communities. R4#show bgp vpnv4 unicast vrf EIGRP 192.102.5.5/32 bestpath | begin Local Local 214.0.0.12 (metric 20) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, valid, internal, best Extended Community: RT:214:102 Cost:pre-bestpath:128:10880

2240 © 2016 Nicholas J. Russo

0x8800:32768:0 0x8801:102:288 0x8802:65281:2560 0x8803:1:1500 0x8806:0:3227911429 mpls labels in/out nolabel/92012 rx pathid: 0, tx pathid: 0x0 R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.5.5/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.4.102.4) Topology(base) TID(0) VRF(EIGRP) %Entry 192.102.5.5/32 not in topology table

To fix this, we will set an easily-visible default-metric on CSR4 as well. Like on XRv2, the route is now external and locally originated by CSR4. Any EIGRP speakers behind CSR4 in the customer VPN will not know there is another EIGRP process across MPLS in this configuration. ! CSR4 router eigrp VPN address-family ipv4 unicast vrf EIGRP autonomous-system 102 topology base default-metric 112347 10203 100 200 1499 R4#show eigrp address-family ipv4 vrf EIGRP topology 192.102.5.5/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.4.102.4) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv4(102): Topology base(0) entry for 192.102.5.5/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 6692471435 Descriptor Blocks: 214.0.0.12, from Redistributed, Send flag is 0x0 Composite metric is (6692471435/0), route is External Vector metric: Minimum bandwidth is 112347 Kbit Total delay is 102030000000 picoseconds Reliability is 100/255 Load is 200/255 Minimum MTU is 1499 Hop count is 0 Originating router is 192.4.102.4 External data: AS number of route is 214 External protocol is BGP, external metric is 10880 Administrator tag is 0 (0x00000000)

Unlike CSR5, who only saw one external route, CSR6 will see all routes as external since the router not supporting extended-communities is its local PE (CSR4). We will ensure CSR6 has the route for CSR5’s loopback and test reachability. R6#show ip route vrf EIGRP eigrp | section 5\.5 D EX 192.102.5.5 [170/52290053] via 192.4.102.4, 00:03:25, GigabitEthernet2.102

2241 © 2016 Nicholas J. Russo

192.102.13.0/32 is subnetted, 1 subnets R6#ping vrf EIGRP 192.102.5.5 source 192.102.6.6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.102.5.5, timeout is 2 seconds: Packet sent with a source address of 192.102.6.6 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 6/7/11 ms

The capability to disable this is identical for IPv6. Before we disable it, we will quickly verify that both CSR5 and CSR6 IPv6 loopbacks have the proper extended communities on XRv2. There is a new community, 0x8807, and it is so confusing that XR doesn’t even know what to call it. The hex values decode into 80:169:171:0:0:0 which is meaningless at first sight. The important thing is the communities are being passed. RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf EIGRP ::192:102:5:5/128 | begin Local [snip] Local fe80::5 from 0.0.0.0 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 570 Extended community: COST:128:128:10880 EIGRP route-info:0x8000:0 EIGRP AD:102:288 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:5.5.102.192 RT:214:102 RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf EIGRP ::192:102:6:6/128 | begin Local [snip] Local, (Received from a RR-client) 214.0.0.4 (metric 20) from 214.0.0.4 (214.0.0.4) Received Label 4043 Origin incomplete, metric 10880, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 593 Extended community: COST:128:128:10880 EIGRP route-info:0x8000:0 EIGRP AD:102:288 EIGRP RHB:255:1:2560 EIGRP LM:0xff:1:1500 EIGRP VRR:0x0:6.102.4.192 0x8807:0x50:0xa9:0xab:0x00:0x00:0x00 RT:214:102 Source VRF: EIGRP, Source Route Distinguisher: 214:102

For brevity, we will do a quick EIGRP topology check on both CE devices, then verify reachability. The logic of the MPLS network being viewed as a single EIGRP router is still the same as verified by the hop count. R5#show eigrp address-family ipv6 vrf EIGRP topology ::192:102:6:6/128 | include Hop

2242 © 2016 Nicholas J. Russo

Hop count is 2 R5#show eigrp address-family ipv6 vrf EIGRP topology | section 6:6 P ::192:102:6:6/128, 1 successors, FD is 2048000 via FE80::12 (2048000/1392640), GigabitEthernet2.102 R6#show eigrp address-family ipv6 vrf EIGRP topology | section 5:5 P ::192:102:5:5/128, 1 successors, FD is 2048000 via FE80::4 (2048000/1392640), GigabitEthernet2.102 R6#show eigrp address-family ipv6 vrf EIGRP topology ::192:102:5:5/128 | include Hop Hop count is 2 R5#ping vrf EIGRP ::192:102:6:6 source ::192:102:5:5 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::192:102:6:6, timeout is 2 seconds: Packet sent with a source address of ::192:102:5:5%EIGRP !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 6/10/26 ms

On CSR4, we will disable populating these communities into VPNv6 BGP. We verify that the route learned from CSR6 does not have these communities applied, but the one received from XRv2 does. ! CSR4 router eigrp VPN address-family ipv6 unicast vrf EIGRP autonomous-system 102 no populate bgp-ext-comm R4#show bgp vpnv6 unicast vrf EIGRP ::192:102:6:6/128 | begin Local Local FE80::6 (FE80::6) (via vrf EIGRP) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:214:102 mpls labels in/out 4043/nolabel rx pathid: 0, tx pathid: 0x0 R4#show bgp vpnv6 unicast vrf EIGRP ::192:102:5:5/128 | begin Local Local ::FFFF:214.0.0.12 (metric 20) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, valid, internal, best Extended Community: RT:214:102 Cost:pre-bestpath:128:10880 0x8800:32768:0 0x8801:102:288 0x8802:65281:2560 0x8803:1:1500 0x8806:0:3227911429 mpls labels in/out nolabel/92010 rx pathid: 0, tx pathid: 0x0

2243 © 2016 Nicholas J. Russo

XRv2 learns this route, but is unable to redistribute it into EIGRP as there is no default-metric specified. The EIGRP-v6 process on XRv2 didn’t give us a log message this time, so we must be aware of EIGRP redistribution behavior with respect to unspecified seed metrics. We can confirm that other VPNv6 routes, like XRv4’s loopback, did get redistributed from BGP into EIGRP successfully. For that route, all of the extended-communities were properly interpreted as the hop count is 1 (the MPLS network is one EIGRP router) and the route is internal. RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf EIGRP ::192:102:6:6/128 | begin Local Local, (Received from a RR-client) 214.0.0.4 (metric 20) from 214.0.0.4 (214.0.0.4) Received Label 4043 Origin incomplete, metric 10880, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 594 Extended community: RT:214:102 Source VRF: EIGRP, Source Route Distinguisher: 214:102 RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology ::192:102:6:6/128 % IPv6-EIGRP (VRF EIGRP): Route not in topology table of EIGRP VPN. RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology ::192:102:14:14/128 IPv6-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for ::192:102:14:14/128 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1377280, RIB is 10760 Routing Descriptor Blocks: ::ffff:214.0.0.3, from VPNv6 Sourced, Send flag is 0x0 Composite metric is (1377280/0), Route is Internal (VPNv6 Sourced) Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 11015625 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 1 Originating router is 14.14.14.14

To solve it, we will hardcode a metric to the BGP redistribution, rather than set a default metric. This is accomplished with an RPL and cannot be placed in-line as XE allows. ! XRv1 and XRv2 route-policy RPL_BGP_TO_EIGRP set eigrp-metric 111111 222 33 44 1455 end-policy router eigrp VPN vrf EIGRP

2244 © 2016 Nicholas J. Russo

address-family ipv6 redistribute bgp 214 route-policy RPL_BGP_TO_EIGRP

The significant downside of this hardcoded-metric approach is that, unless specific prefixes are matched in the RPL, all of the EIGRP metrics will be modified in this way, overriding the legitimate extendedcommunity values. This does allow CSR6’s loopback to enter the EIGRP topology, but also overrides the values for XRv4’s loopback. Fortunately, the EIGRP ASN is still honored so the route remains internal. RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology ::192:102:6:6/128 IPv6-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for ::192:102:6:6/128 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 151388165, RIB is 1182720 Routing Descriptor Blocks: ::ffff:214.0.0.4, from Redistributed, Send flag is 0x0 Composite metric is (151388165/0), Route is External Vector metric: Minimum bandwidth is 111111 Kbit Total delay is 2220000000 picoseconds Reliability is 33/255 Load is 44/255 Minimum MTU is 1455 Hop count is 0 External data: Originating router is 214.0.0.12 (this system) AS number of route is 102 External protocol is BGP, external metric is 10880 Administrator tag is 0 (0x00000000) RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology ::192:102:14:14/128 IPv6-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for ::192:102:14:14/128 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1377280, RIB is 1182720 Routing Descriptor Blocks: ::ffff:214.0.0.3, from VPNv6 Sourced, Send flag is 0x0 Composite metric is (151388165/0), Route is Internal (VPNv6 Sourced) Vector metric: Minimum bandwidth is 111111 Kbit Total delay is 2220000000 picoseconds Reliability is 33/255 Load is 44/255 Minimum MTU is 1455 Hop count is 1

A quick routing table check on CSR5 confirms that both routes are installed with the appropriate route types as verified in the EIGRP topology. They have the same RIB metric, which is probably not desirable. R5#show ipv6 route vrf EIGRP eigrp | section include :6:6|:14:14 EX ::192:102:6:6/128 [170/1187840]

2245 © 2016 Nicholas J. Russo

D

via FE80::12, GigabitEthernet2.102 ::192:102:14:14/128 [90/1187840] via FE80::12, GigabitEthernet2.102

To fix it, we can modify the RPL on XRv1/XRv2 to target these routes. Since we cannot match on the presence or absence of the cost-community or any advanced EIGRP communities, we can match the next-hop of CSR4 to capture all prefixes easily. Be sure to identify this address as an IPv4-compatible IPv6 address; even though XR displays VPNv6 next-hops within the preceding “::FFFF:”, they are IPv6 next-hops. Notice that now only CSR6’s loopback, the only true external route, is marked as “Redistributed” with a bogus metric while the other routes are “VPNv6 Sourced” with the correct metric. Both the summary and detailed outputs are shown below. ! XRv1 and XRv2 route-policy RPL_BGP_TO_EIGRP if next-hop in (::FFFF:214.0.0.4) then set eigrp-metric 111111 222 33 44 1455 else pass endif end-policy RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology | begin ::192 P ::192:102:5:5/128, 1 successors, FD is 1392640, RIB is 10880 via fe80::5 (1392640/163840), GigabitEthernet0/0/0/0.102 P ::192:102:6:6/128, 1 successors, FD is 151388165, RIB is 1182720 via Redistributed (151388165/0) P ::192:102:13:13/128, 1 successors, FD is 1377280, RIB is 10760 via VPNv6 Sourced (1377280/0) P ::192:102:14:14/128, 1 successors, FD is 1377280, RIB is 10760 via VPNv6 Sourced (1377280/0) RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology ::192:102:6:6/128 IPv6-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for ::192:102:6:6/128 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 151388165, RIB is 1182720 Routing Descriptor Blocks: ::ffff:214.0.0.4, from Redistributed, Send flag is 0x0 Composite metric is (151388165/0), Route is External Vector metric: Minimum bandwidth is 111111 Kbit Total delay is 2220000000 picoseconds Reliability is 33/255 Load is 44/255 Minimum MTU is 1455 Hop count is 0 External data: Originating router is 214.0.0.12 (this system) AS number of route is 102

2246 © 2016 Nicholas J. Russo

External protocol is BGP, external metric is 10880 Administrator tag is 0 (0x00000000) RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv6 topology ::192:102:13:13/128 IPv6-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for ::192:102:13:13/128 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1377280, RIB is 10760 Routing Descriptor Blocks: ::ffff:214.0.0.11, from VPNv6 Sourced, Send flag is 0x0 Composite metric is (1377280/0), Route is Internal (VPNv6 Sourced) Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 11015625 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 1 Originating router is 13.13.13.13

As we know, CSR4 is now ignoring the extended-community values that XRv2 advertised inside of CSR5’s loopback, so connectivity will be broken as it was for IPv4. CSR4 is totally ignoring all BGP-carried EIGRP extended-communities. R4#show eigrp address-family ipv6 vrf EIGRP topology ::192:102:5:5/128 EIGRP-IPv6 VR(VPN) Topology Entry for AS(102)/ID(192.4.102.4) Topology(base) TID(0) VRF(EIGRP) %Entry ::192:102:5:5/128 not in topology table

We will use the same destructive approach on CSR4 by setting the metric within the redistribute statement. Using “default-metric” was a more appropriate solution as it respects the extendedcommunities, if they exist. ! CSR4 route-map RM_BGP_TO_EIGRP permit 10 set metric 55555 444 33 22 1411 router eigrp VPN address-family ipv6 unicast vrf EIGRP autonomous-system 102 topology base redistribute bgp 214 route-map RM_BGP_TO_EIGRP R4#show eigrp address-family ipv6 vrf EIGRP topology ::192:102:5:5/128 EIGRP-IPv6 VR(VPN) Topology Entry for AS(102)/ID(192.4.102.4) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv6(102): Topology base(0) entry for ::192:102:5:5/128 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 302776437 Descriptor Blocks: ::FFFF:214.0.0.12, from Redistributed, Send flag is 0x0

2247 © 2016 Nicholas J. Russo

Composite metric is (302776437/0), route is External Vector metric: Minimum bandwidth is 55555 Kbit Total delay is 4440000000 picoseconds Reliability is 33/255 Load is 22/255 Minimum MTU is 1411 Hop count is 0 External data: Originating router is 192.4.102.4 (this system) AS number of route is 214 External protocol is BGP, external metric is 10880 Administrator tag is 0 (0x00000000)

As expected, since CSR4 is the one not supporting the extended-communities, all remote routes are viewed as external, so there is little sense in matching specific entities in the route-map as all routes will have the same suboptimal appearance. The topology lists these as “redistributed” as opposed to “VPNv6 Sourced”. R4#show eigrp address-family ipv6 vrf EIGRP topology | section include ::192 P ::192:102:14:14/128, 1 successors, FD is 302776437 via Redistributed (302776437/0) P ::192:102:13:13/128, 1 successors, FD is 302776437 via Redistributed (302776437/0) P ::192:102:6:6/128, 1 successors, FD is 1392640 via FE80::6 (1392640/163840), GigabitEthernet2.102 P ::192:102:5:5/128, 1 successors, FD is 302776437 via Redistributed (302776437/0)

To achieve full reachability, I must instruct XRv1 to perform the same metric manipulations as XRv2 in order for its EIGRP VPN process to redistribute CSR6’s loopback from VPNv4/v6 (configuration not shown since it is identical). We quickly verify that XRv3 has the routes and that they are of the proper type. RP/0/0/CPU0:XRv3#show route vrf EIGRP ipv4 eigrp | include 192.102. D 192.102.5.5/32 [90/16000] via 192.11.102.11, 02:08:13, GigabitEthernet0/0/0/0.102 D EX 192.102.6.6/32 [170/3200022] via 192.11.102.11, 00:04:50, GigabitEthernet0/0/0/0.102 D 192.102.14.14/32 [90/15880] via 192.11.102.11, 1d07h, GigabitEthernet0/0/0/0.102 RP/0/0/CPU0:XRv3#show route vrf EIGRP ipv6 eigrp | include ::192:102: D ::192:102:5:5/128 D EX ::192:102:6:6/128 D ::192:102:14:14/128

2248 © 2016 Nicholas J. Russo

Likewise on CSR3, it needs to set metrics for routes coming from CSR4 only, since it can honor the extended-communities from XRv1 and XRv2. The configuration for EIGRP-v4 is identical using the default-metric under the EIGRP AFI. The IPv6 configuration varies slightly to account for only adjusting metrics from CSR4. Since XE doesn’t support in-line matches like XR does, we define an IPv6 prefix-list with an IPv4-compatible IPv6 address representing CSR4’s VPNv6 endpoint. We quickly check the routes on XRv4 to ensure it learned all routes and that their route-types are correct. If “default-metric” cannot be used, using route-maps/RPL in this way can retain the EIGRP metrics for internal routes and seed new ones for external routes. ! CSR3 ipv6 prefix-list PL_IPV6_NHOP_CSR4 seq 5 permit ::FFFF:214.0.0.4/128 route-map RM_BGP_TO_EIGRP permit 10 match ipv6 next-hop prefix-list PL_IPV6_NHOP_CSR4 set metric 55555 444 33 22 1411 route-map RM_BGP_TO_EIGRP permit 1000 RP/0/0/CPU0:XRv4#show route vrf EIGRP ipv4 eigrp | include 192.102. D 192.102.5.5/32 [90/16000] via 192.3.102.3, 02:12:01, GigabitEthernet0/0/0/0.102 D EX 192.102.6.6/32 [170/52290053] via 192.3.102.3, 00:00:18, GigabitEthernet0/0/0/0.102 D 192.102.13.13/32 [90/15880] via 192.3.102.3, 1d07h, GigabitEthernet0/0/0/0.102 RP/0/0/CPU0:XRv4#show route vrf EIGRP ipv6 eigrp | include ::192:102: D ::192:102:5:5/128 D EX ::192:102:6:6/128 D ::192:102:13:13/128

For completeness, we will advertisement an external route to the MPLS network by redistributing a new loopback on CSR5, much like we did for OSPF. We cannot use CSR6 again because it’s PE is not capable of encoding the extended-communities. We also cannot use anything inside 192.0.0.0/8 due to CSR5’s network statement. ! CSR5 interface Loopback500 description EXTERNAL TEST EIGRP vrf forwarding EIGRP ip address 50.50.5.5 255.255.255.255 ipv6 address ::50:50:5:5/128 route-map RM_CONN_TO_EIGRP permit 10 match interface Loopback500 router eigrp VPN address-family ipv4 unicast vrf EIGRP autonomous-system 102

2249 © 2016 Nicholas J. Russo

topology base redistribute connected route-map RM_CONN_TO_EIGRP address-family ipv6 unicast vrf EIGRP autonomous-system 102 topology base redistribute connected route-map RM_CONN_TO_EIGRP

XRv12 learns this route but has some new extended-communities. We can check the EIGRP topology to try and decode the values. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf EIGRP 50.50.5.5/32 [snip] Local 192.12.102.5 from 0.0.0.0 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 695 Extended community: COST:128:129:10880 EIGRP route-info:0x0:0 EIGRP AD:102:288 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP AR:0:192.102.5.5 EIGRP PM:11:0 RT:214:102

The first 0 represents the remote AS, which is 0 per the EIGRP topology (route was redistributed from “connected”). The next 4 bytes represents the originator ID, which is CSR5’s loopback. The second community carries the remote protocol/metric; I assume 11 means “connected” and the remote metric of 0. RP/0/0/CPU0:XRv2#show eigrp vrf EIGRP ipv4 topology 50.50.5.5/32 | begin External data External data: Originating router is 192.102.5.5 AS number of route is 0 External protocol is Connected, external metric is 0 Administrator tag is 0 (0x00000000)

A quick look at CSR3’s BGP and EIGRP tables shows similar information. XE is less friendly in decoding information, but converting 3227911429 to hex yields 0xC0660505, which is 192.102.5.5, which is the originator ID. This is important for loop prevention with EIGRP external routes, so transporting it is important. R3#show bgp vpnv4 unicast vrf EIGRP 50.50.5.5/32 [snip] Local, (Received from a RR-client) 214.0.0.12 (metric 30) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, valid, internal, best Extended Community: RT:214:102 Cost:pre-bestpath:129:10880 0x8800:0:0 0x8801:102:288 0x8802:65281:2560 0x8803:1:1500 0x8804:0:3227911429 0x8805:11:0

2250 © 2016 Nicholas J. Russo

mpls labels in/out nolabel/92024 rx pathid: 0, tx pathid: 0x0

Even though the route is external, carrying the extended-communities still allows the MPLS domain to carry the EIGRP metrics for the external route. XRv4 sees the router as two hops away with the proper vector metric information. RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP ipv4 topology 50.50.5.5/32 IPv4-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for 50.50.5.5/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2048000, RIB is 16000 Routing Descriptor Blocks: 192.11.102.11 (GigabitEthernet0/0/0/0.102), from 192.11.102.11, Send flag is 0x0 Composite metric is (2048000/1392640), Route is External Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 21250000 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 2 External data: Originating router is 192.102.5.5 AS number of route is 0 External protocol is Connected, external metric is 0 Administrator tag is 0 (0x00000000)

The last significant feature of EIGRP is the site-of-origin (SoO). This is an extended-community carried in a new EIGRP TLV that allows an administrator to identify an EIGRP “site”. A multi-homed site, such as one with backdoor links, might be susceptible to transient routing loops during EIGRP convergence. For example, if XRv4’s loopback fails, it would issue QUERY messages to CSR6 and CSR3 (assume there was a backdoor to CSR3). If CSR4, the remote PE, receives the QUERY before CSR3 has a chance to respond, it will REPLY affirmatively that there is an alternate path (the BGP route). Although count-to-infinity may eventually break the loop, this is not an ideal circumstance. SoO works like route-tags except is more dynamic and solves a very specific problem. Setting it on a PE-CE link on the PE side tells EIGRP to copy the SoO value into a BGP extended-community during redistribution. Below, we can see the new SoO community of 102:14 on CSR3. ! CSR3 route-map RM_EIGRP_SOO permit 10 set extcommunity soo 102:14 interface GigabitEthernet2.102 ip vrf sitemap RM_EIGRP_SOO

2251 © 2016 Nicholas J. Russo

R3#show bgp vpnv4 unicast vrf EIGRP 192.102.14.14/32 [snip] Local 192.3.102.14 (via vrf EIGRP) from 0.0.0.0 (214.0.0.3) Origin incomplete, metric 10752, localpref 100, weight 32768, valid, sourced, best Extended Community: SoO:102:14 RT:214:102 Cost:pre-bestpath:128:10752 0x8800:32768:0 0x8801:102:282 0x8802:65281:2560 0x8803:65281:1500 0x8806:0:235802126 mpls labels in/out 3025/nolabel rx pathid: 0, tx pathid: 0x0

When remote PEs receive this and perform redistribution back into EIGRP, this community is carried over again. Because CSR4 cannot understand these communities, it does not carry this information from BGP into EIGRP. This is consistent with its treatment of the EIGRP vector metrics as well. R5#show eigrp address-family ipv4 vrf EIGRP topology 192.102.14.14/32 | include SoO Extended Community: SoO:102:14 R6#show eigrp address-family ipv4 vrf EIGRP topology 192.102.14.14/32 | include SoO [no output] RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP topology 192.102.14.14/32 | begin Ext Extended Community: SoO:102:14

If another site (including CEs) have an SoO set on an interface, routes received that match the SoO are discarded. We cannot use the backdoor (currently shut down) between CSR6 and XRv4 since CSR4 does not support the extended-communities. Instead, we can configure XRv3 to be in the same “site” as XRv4. This means that XRv3 will not be able to learn routes marked with SoO:104:14 from the PE. ! XRv1 router eigrp VPN vrf EIGRP interface GigabitEthernet0/0/0/0.102 site-of-origin 102:14 RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP topology 192.102.14.14/32 % IPv4-EIGRP (VRF EIGRP): Route not in topology table of EIGRP VPN.

Notice that this is unlike a route-tag in that routes sent from XRv3 do not have the SoO set. This is set only on the redistribution from EIGRP into BGP, and used as an ingress filter for routes carrying the SoO TLV in EIGRP messages. XRv1 is not aware of any SoO configuration on XRv3 at all.

2252 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf EIGRP 192.102.13.13 | include Extended Extended community: COST:128:128:10752 EIGRP route-info:0x8000:0 EIGRP AD:102:282 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:13.13.13.13 RT:214:102

If we bring up the backdoor link between CSR5 and XRv3 using Gig2.202 (not shown), XRv3 can still learn these routes via that link. Notice that the route still has the SoO set; CSR5 did not filter it since it does not have any SoO configurations that match that particular route. RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP topology 192.102.14.14/32 IPv4-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for 192.102.14.14/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2688000, RIB is 21000 Routing Descriptor Blocks: 192.5.102.5 (GigabitEthernet0/0/0/0.202), from 192.5.102.5, Send flag is 0x0 Composite metric is (2688000/2032640), Route is Internal Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 31015625 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 3 Originating router is 14.14.14.14 Extended Community: SoO:102:14

A more realistic use of SoO is to prevent loops within the same site when a backdoor link is present. To do this, we can configure a number of different designs. We will examine the first using IPv4 and the third using IPv6; the second test case is difficult to verify and test. 1. Same SoO on the PEs, configured on the PE-CE link only. This is used so that the two sites have no reachability over MPLS whatsoever. If the two CE routers are actually in the same site, this design is very logical (the backdoor link is a high-speed LAN interface, for example). 2. Different SoO on the PE, configured on the PE-CE link only. This allows failover to occur but once a route comes “full circle” around the backdoor link, it is dropped at ingress on the PE that originally injected it, the route will be discarded. This prevents transient routing loops. 3. SoO X on the local CE’s backdoor link and the local PE’s PE-CE link. SoO Y on the remote CE’s backdoor link and the remote PE’s PE-CE link. This provides better protection than #2, still allows sites to communicate unlike #1, but provides less redundancy if the CE and backdoor routers are not the same device (that is to say, there is a larger EIGRP network in between the CE routers).

2253 © 2016 Nicholas J. Russo

The first setup is very straightforward. Both PEs will use an SoO value for the PE-CE link. When the two sites exchange routes, the extended-communities are carried over BGP. We will configure this for the IPv4 AFI only, leaving the advanced example for IPv6. In XR, only one SoO can be applied per interface, per AFI. ! XRv1 and XRv2 router eigrp VPN vrf EIGRP address-family ipv4 interface GigabitEthernet0/0/0/0.102 site-of-origin 102:1212

In the ideal state, the entire MPLS network looks like a single EIGRP router, and so XRv2 prefers the BGP route to XRv1 versus the EIGRP route via CSR5. XRv1 adds the SoO and XRv2 sees it. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf EIGRP 192.102.13.13/32 | begin from 192.11.102.13 from 0.0.0.0 (214.0.0.11) Origin incomplete, metric 10752, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate, import suspect Received Path ID 0, Local Path ID 1, version 788 Extended community: SoO:102:1212 COST:128:128:10752 EIGRP routeinfo:0x8000:0 EIGRP AD:102:282 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:13.13.13.13 RT:214:102 RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf EIGRP 192.102.13.13/32 | begin from Local, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91016 Origin incomplete, metric 10752, localpref 100, valid, internal, best, group-best, import-candidate, imported, import suspect Received Path ID 0, Local Path ID 1, version 745 Extended community: SoO:102:1212 COST:128:128:10752 EIGRP routeinfo:0x8000:0 EIGRP AD:102:282 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:13.13.13.13 RT:214:102 Source VRF: EIGRP, Source Route Distinguisher: 214:102

XRv2 is not allowed to advertise this route to CSR5 due to SoO. Checking CSR5’s EIGRP topology, it only has one route via the backdoor link. R5#show eigrp address-family ipv4 vrf EIGRP topology 192.102.13.13/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.102.5.5) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv4(102): Topology base(0) entry for 192.102.13.13/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1376256, RIB is 10752

2254 © 2016 Nicholas J. Russo

Descriptor Blocks: 192.5.102.13 (GigabitEthernet2.202), from 192.5.102.13, Send flag is 0x0 [snip]

Shutting down this backdoor link completely breaks connectivity between XRv3 and CSR5. This is expected since we told the PEs that these routers were in the same “site”, and thus should not be using the WAN for transport between one another. RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP topology 192.102.5.5/32 % IPv4-EIGRP (VRF EIGRP): Route not in topology table of EIGRP VPN. R5#show eigrp address-family ipv4 vrf EIGRP topology 192.102.13.13/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.102.5.5) Topology(base) TID(0) VRF(EIGRP) %Entry 192.102.13.13/32 not in topology table

We will bring the backdoor link up for our next test using IPv6. Here, we configure XRv3’s backdoor and XRv1 to be in one site, while CSR5’s backdoor and XRv2 are in another. This ensures that routes from a given site cannot re-enter the source site under any circumstances. It is loosely analogous to the “transit AS” prevention techniques used in BGP. ! XRv1 router eigrp VPN vrf EIGRP address-family ipv6 interface GigabitEthernet0/0/0/0.102 site-of-origin 102:11 ! XRv2 router eigrp VPN vrf EIGRP address-family ipv6 interface GigabitEthernet0/0/0/0.102 site-of-origin 102:12 ! XRv3 router eigrp VPN vrf EIGRP address-family ipv6 interface GigabitEthernet0/0/0/0.202 site-of-origin 102:11 ! CSR5 route-map RM_EIGRP_V6_SOO per 10 set extcommunity soo 102:12 interface GigabitEthernet2.202

2255 © 2016 Nicholas J. Russo

ip vrf sitemap RM_EIGRP_V6_SOO

We check XRv2 for XRv3’s loopback which should have SoO:102:11. Then, we check XRv1 for CSR5’s loopback which should have SoO:102:12. The reason XRv1 has two lines of output is because it is an RRclient where XRv2 is an RR. RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf EIGRP ::192:102:13:13/128 | include SoO Extended community: SoO:102:11 COST:128:128:10752 EIGRP routeinfo:0x8000:0 EIGRP AD:102:282 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:13.13.13.13 RT:214:102 RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf EIGRP ::192:102:5:5/128 | include SoO Extended community: SoO:102:12 COST:128:128:10880 EIGRP routeinfo:0x8000:0 EIGRP AD:102:288 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:5.5.102.192 RT:214:102 Extended community: SoO:102:12 COST:128:128:10880 EIGRP routeinfo:0x8000:0 EIGRP AD:102:288 EIGRP RHB:255:1:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:5.5.102.192 RT:214:102

These routes are allowed to be learned by remote CEs as there is no SoO on the PE-CE link configured on the CE side. This allows MPLS to interconnect the two sites if the backdoor link fails. The EIGRP metrics have not been adjusted for any kind of route-preference, and the MPLS network does not qualify as a feasible successor, so we must use “all-links” to see the alternate paths when using the summary command. R5#show eigrp address-family ipv6 vrf EIGRP topology all-links | section 13:13 P ::192:102:13:13/128, 1 successors, FD is 1376256, serno 198 via FE80::13 (1376256/131072), GigabitEthernet2.202 via FE80::12 (2032640/1377280), GigabitEthernet2.102 RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP ipv6 topology all-links | begin 102:5:5 P ::192:102:5:5/128, 1 successors, FD is 1392640, RIB is 10880, serno 322 via fe80::5 (1392640/163840), GigabitEthernet0/0/0/0.202 via fe80::11 (2048000/1392640), GigabitEthernet0/0/0/0.102

We quickly shut down the backdoor link to verify MPLS WAN connectivity with traceroute. This allows to sites to have some form of redundancy with loop protection concurrently. R5#traceroute vrf EIGRP ipv6 Target IPv6 address: ::192:102:13:13 Source address: ::192:102:5:5 [snip] Tracing the route to ::192:102:13:13

2256 © 2016 Nicholas J. Russo

1 FD00:192:12:102::12 3 msec 2 msec 2 msec 2 2001:214:11:12::11 [MPLS: Label 91018 Exp 0] 93 msec 6 msec 5 msec 3 ::192:102:13:13 94 msec 6 msec 6 msec

To demonstrate the loop protection, we will bring the backdoor link online and shutdown CSR5’s PE-CE link. Because this PE-CE link was considered a connected route from CSR5’s perspective, shutting down the link will remove this connected route. The link is still up on the PE, XRv2, so it will be redistributed into BGP and assigned SoO:102:12. XRv1 will advertise the PE-CE link from XRv2 to XRv3 since the SoO is SoO:102:11 and does not match XRv1’s local SoO. XRv3 will do the same and advertise it to CSR5. RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf EIGRP fd00:192:12:102::/64 | begin Local$ Local :: from 0.0.0.0 (214.0.0.12) Origin incomplete, metric 0, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 677 Extended community: SoO:102:12 COST:128:128:10240 EIGRP routeinfo:0x8000:0 EIGRP AD:102:256 EIGRP RHB:255:0:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:12.0.0.214 RT:214:102 RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf EIGRP fd00:192:12:102::/64 | begin Local$ [snip, alternate RR route not shown] Not advertised to any peer Local 214.0.0.12 (metric 10) from 214.0.0.12 (214.0.0.12) Received Label 92008 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 660 Extended community: SoO:102:12 COST:128:128:10240 EIGRP routeinfo:0x8000:0 EIGRP AD:102:256 EIGRP RHB:255:0:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:12.0.0.214 RT:214:102 Source VRF: EIGRP, Source Route Distinguisher: 214:102 RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP ipv6 topology fd00:192:12:102::/64 IPv6-EIGRP VR(VPN) AS(102) VRF EIGRP: Topology entry for fd00:192:12:102::/64 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1966080, RIB is 15360 Routing Descriptor Blocks: fe80::11 (GigabitEthernet0/0/0/0.102), from fe80::11, Send flag is 0x0 Composite metric is (1966080/1310720), Route is Internal Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 20000000 picoseconds Reliability is 255/255 Load is 1/255

2257 © 2016 Nicholas J. Russo

Minimum MTU is 1500 Hop count is 1 Originating router is 214.0.0.12 Extended Community: SoO:102:12

CSR5 should reject the route on ingress because its backdoor link is configured with SoO:102:12. This might reduce some redundancy if there were hosts on that LAN, or other networks directly connected to the PE; CSR5 would no longer have access. Unfortunately, the feature does not appear to work, and CSR5 is able to learn the route through the backdoor. I suspect that the “ip vrf sitemap” command doesn’t work on CEs for IPv6. R5#show eigrp address-family ipv6 vrf EIGRP topology FD00:192:12:102::/64 EIGRP-IPv6 VR(VPN) Topology Entry for AS(102)/ID(192.102.5.5) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv6(102): Topology base(0) entry for FD00:192:12:102::/64 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2621440, RIB is 20480 Descriptor Blocks: FE80::13 (GigabitEthernet2.202), from FE80::13, Send flag is 0x0 Composite metric is (2621440/1966080), route is Internal Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 30000000 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 2 Originating router is 214.0.0.12 Extended Community: SoO:102:12

Because this test did not work, we will reverse the problem. We will bring CSR5’s link with XRv2 (PE) back up and shut down XRv3’s link to XRv1 (PE) instead. We will now we tracing the PE-CE link between XRv1 and XRv3. We first ensure that XRv1 creates the route, assigns SoO:102:11, and advertises it to XRv2. XRv2 should advertise it to CSR5, who should also see the SoO in the EIGRP topology. RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast vrf EIGRP fd00:192:11:102::/64 | begin Local$ Local :: from 0.0.0.0 (214.0.0.11) Origin incomplete, metric 0, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 688 Extended community: SoO:102:11 COST:128:128:10240 EIGRP routeinfo:0x8000:0 EIGRP AD:102:256 EIGRP RHB:255:0:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:11.0.0.214 RT:214:102

2258 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf EIGRP fd00:192:11:102::/64 | begin Local Local, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91006 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 683 Extended community: SoO:102:11 COST:128:128:10240 EIGRP routeinfo:0x8000:0 EIGRP AD:102:256 EIGRP RHB:255:0:2560 EIGRP LM:0x0:1:1500 EIGRP VRR:0x0:11.0.0.214 RT:214:102 Source VRF: EIGRP, Source Route Distinguisher: 214:102 R5#show eigrp address-family ipv6 vrf EIGRP topology FD00:192:11:102::/64 EIGRP-IPv6 VR(VPN) Topology Entry for AS(102)/ID(192.102.5.5) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv6(102): Topology base(0) entry for FD00:192:11:102::/64 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1966080, RIB is 15360 Descriptor Blocks: FE80::12 (GigabitEthernet2.102), from FE80::12, Send flag is 0x0 Composite metric is (1966080/1310720), route is Internal Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 20000000 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 1 Originating router is 214.0.0.11 Extended Community: SoO:102:11

Now, when CSR5 advertises the route to XRv3, the route is rejected. This proves that XR’s implementation of SoO works correctly for IPv4 and IPv6, whereas XE only works reliably for IPv4. RP/0/0/CPU0:XRv3#show eigrp vrf EIGRP ipv6 topology fd00:192:11:102::/64 % IPv6-EIGRP (VRF EIGRP): Route not in topology table of EIGRP VPN.

Before continuing, we will restore the PE-CE link on XRv3. Just like in the OSPF section, we will use TCL to verify full reachability from CSR5 and CSR6. The validation is not shown, but the script is below. ! CSR5 and CSR6 tclsh foreach x { 192.102.5.5 192.102.6.6 192.102.13.13 192.102.14.14

2259 © 2016 Nicholas J. Russo

50.50.5.5 ::192:102:5:5 ::192:102:6:6 ::192:102:13:13 ::192:102:14:14 ::50:50:5:5 } { ping vrf EIGRP $x source loopback102 repeat 3 timeout 1 }

We will immediately notice the possibility of problems. A single PE (CSR4) routing not supporting the critical EIGRP extended communities can break many things. CSR4 thinks that the best path to CSR6’s loopback is via XRv1, which is not true. Because this path has a relatively low pre-bestpath costcommunity, it is compared against the local route’s default value of 2147483648 (2^31) and is much better. If CSR4 supported the community, the cost-community on the local route would certainly be better. R4#show bgp vpnv4 unicast vrf EIGRP 192.102.6.6/32 BGP routing table entry for 214:102:192.102.6.6/32, version 884 Paths: (3 available, best #1, table EIGRP, RIB-failure(17) - next-hop mismatch) Not advertised to any peer Refresh Epoch 4 Local 214.0.0.11 (metric 30) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 3210262, localpref 100, valid, internal, best Extended Community: SoO:102:1212 RT:214:102 Cost:pre-bestpath:129:3210262 (default-2144273385) 0x8800:0:0 0x8801:102:139776 0x8802:59907:207360 0x8803:123:1500 0x8804:102:3590324236 0x8805:9:10880 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91030 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 Local 214.0.0.11 (metric 30) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 3210262, localpref 100, valid, internal Extended Community: SoO:102:1212 RT:214:102 Cost:pre-bestpath:129:3210262 (default-2144273385) 0x8800:0:0 0x8801:102:139776 0x8802:59907:207360 0x8803:123:1500 0x8804:102:3590324236 0x8805:9:10880 Originator: 214.0.0.11, Cluster list: 0.0.3.12 mpls labels in/out nolabel/91030 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local 192.4.102.6 (via vrf EIGRP) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, sourced Extended Community: RT:214:102

2260 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0

To fix this problem, we must either re-enable the EIGRP extended-community processing on CSR4 or instruct BGP to ignore these communities entirely. We will choose the latter solution as it is more complicated. This feature cannot be modified on a per AF-basis and is global to the BGP process in its entirety. CSR4 will totally ignore the cost-community now, relying on the regular BGP best-path selection algorithm. Now, the local route is the best-path due to the weight attribute. ! CSR4 router bgp 214 bgp bestpath cost-community ignore R4#show bgp vpnv4 unicast vrf EIGRP 192.102.6.6/32 bestpath [snip] Local 192.4.102.6 (via vrf EIGRP) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10880, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:214:102 mpls labels in/out 4053/nolabel rx pathid: 0, tx pathid: 0x0

This still may not fix all of the problems in the network. In order for this to be effective, we would need to enable it everywhere, which would break many other things we tested. We will leave the EIGRP topology “broken” as a reference of how not to configure EIGRP over L3VPN. As a recommendation, either use cost-community everywhere or don’t use it anywhere. XRv1 is preferring XRv2 to reach CSR6 now, and advertises this as its bestpath to CSR4. Even though CSR4 is picking the proper route, the other routers are not. Manually manipulating the entire network to fix these issues is not worth it, so again, we leave the network broken. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf EIGRP 192.102.6.6/32 brief | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:102 (default for vrf EIGRP) * i192.102.6.6/32 214.0.0.12 3210262 100 0 ? *>i 214.0.0.12 3210262 100 0 ?

As one final comment, EIGRP is capable of carrying SoO inside of external routes. Looking at 50.50.5.5/32, which was redistributed into EIGRP on CSR5, it carries the SoO from XRv2. When CSR3 redistributes it back into EIGRP, the SoO is carried with it as well. We won’t retest SoO for this specific route, but be aware that the SoO rules apply to all EIGRP routes that carry the SoO attribute. R3#show bgp vpnv4 unicast vrf EIGRP 50.50.5.5/32 BGP routing table entry for 214:102:50.50.5.5/32, version 234 Paths: (1 available, best #1, table EIGRP) Advertised to update-groups:

2261 © 2016 Nicholas J. Russo

14 Refresh Epoch 1 Local, (Received from a RR-client) 214.0.0.12 (metric 30) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 10880, localpref 100, valid, internal, best Extended Community: SoO:102:1212 RT:214:102 Cost:pre-bestpath:129:10880 0x8800:0:0 0x8801:102:288 0x8802:65281:2560 0x8803:1:1500 0x8804:0:3227911429 0x8805:11:0 mpls labels in/out nolabel/92024 rx pathid: 0, tx pathid: 0x0 R3#show eigrp address-family ipv4 vrf EIGRP topology 50.50.5.5/32 EIGRP-IPv4 VR(VPN) Topology Entry for AS(102)/ID(192.3.102.3) Topology(base) TID(0) VRF(EIGRP) EIGRP-IPv4(102): Topology base(0) entry for 50.50.5.5/32 State is Passive, Query origin flag is 1, 1 Successor(s), FD is 1392640 Descriptor Blocks: 214.0.0.12, from VPNv4 Sourced, Send flag is 0x0 Composite metric is (1392640/0), route is External (VPNv4 Sourced) Vector metric: Minimum bandwidth is 1000000 Kbit Total delay is 11250000 picoseconds Reliability is 255/255 Load is 1/255 Minimum MTU is 1500 Hop count is 1 Extended Community: SoO:102:1212 External data: Originating router is 192.102.5.5 AS number of route is 0 External protocol is Connected, external metric is 0 Administrator tag is 0 (0x00000000)

38.1.4 IS-IS IS-IS is not commonly used as PE-CE routing protocol as it is not common in enterprise (customer) networks. VRF-aware IS-IS is also not currently supported on XR, either, so demonstrating it in this topology is difficult. Additionally, IPv6 IS-IS is only supported in the XE global process. These limitations will drastically reduce the length of this section. R3(config-subif)#ipv6 router isis 103 %ISIS: Cannot enable ISIS-IPv6 for non-default VRF process R3(config-subif)#ipv6 router isis %ISIS: Cannot enable ISIS-IPv6 for non-default VRF interface

The protocol itself does not support carrying SoO attributes nor is there is any concept of a sham-link for back-door protection. VRF-aware IS-IS is only enabled between CSR4 (PE) and CSR6 (CE) for a brief 2262 © 2016 Nicholas J. Russo

demonstration. First, we verify the neighbor is up and IS-IS is operational. The configuration is identical to global-table IS-IS except the “vrf” keyword is included under the process. The PE configuration is shown below, which also redistributes from BGP. ! CSR4 router isis 103 vrf ISIS net 49.0103.0000.0000.0004.00 is-type level-2-only redistribute bgp 214 R4#show isis 103 neighbors Tag 103: System Id Type Interface R6 L2 Gi2.103

IP Address 192.4.103.6

State Holdtime Circuit Id UP 23 00

Because there is no remote IS-IS VPN, we will exchange VPN routes with CSR1 (central services) by importing RT:214:8. CSR8 will import RT:214:103 as well (not shown). Now, CSR6 will learn new routes via CSR4. We can see that all 15 routes are IS-IS Level 2 routes. The LSPDB shows that these are IPExternal entries as part of CSR4’s LSP. Their metrics are 0 at the point of redistribution, much like loopbacks. R6#show ip route vrf ISIS summary | section isis 103 isis 103 0 15 0 1440 Level 1: 0 Level 2: 15 Inter-area: 0 R6#show isis 103 database detail level-2 R4.00-00 Tag 103: IS-IS Level-2 LSP R4.00-00 LSPID LSP Seq Num LSP Checksum LSP Holdtime R4.00-00 0x000001C4 0xCAFB 1116 Area Address: 49.0103 NLPID: 0xCC Hostname: R4 Metric: 10 IS R6.00 IP Address: 192.4.103.4 Metric: 10 IP 192.4.103.0 255.255.255.0 Metric: 0 IP-External 82.125.160.0 255.255.255.240 Metric: 0 IP-External 82.125.160.16 255.255.255.240 [snip] Metric: 0 IP-External 82.125.160.0 255.255.255.192 Metric: 0 IP-External 82.125.160.64 255.255.255.192 Metric: 0 IP-External 10.1.8.0 255.255.255.0

4320

ATT/P/OL 0/0/0

It is worth noting that when redistributing from IS-IS, the connected routes are not included. We confirm this on CSR4 by checking the VPNv4 table for the CSR4-CSR6 PE-CE link. We verify that CSR6’s loopback made it into the table, which is an IS-IS route on CSR4. 2263 © 2016 Nicholas J. Russo

R4#show bgp vpnv4 unicast vrf ISIS 192.4.103.0/24 % Network not in table R4#show bgp vpnv4 unicast vrf ISIS 192.103.6.6/32 [snip] Local 192.4.103.6 (via vrf ISIS) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:214:103 mpls labels in/out 4055/nolabel rx pathid: 0, tx pathid: 0x0

Something interesting about SoO is that, when configured on a PE router, is entirely protocol-agnostic. IS-IS cannot carry nor interpret SoO, but routes learned on the PE-CE link can still have this community encoded when redistributing into BGP. We quickly test this on CSR4 by applying the SoO on the link towards CSR6 and checking the BGP VPNv4 table. ! CSR4 route-map RM_SOO permit 10 set extcommunity soo 103:4 interface GigabitEthernet2.103 ip vrf sitemap RM_SOO R4#show bgp vpnv4 unicast vrf ISIS 192.103.6.6/32 BGP routing table entry for 214:103:192.103.6.6/32, version 1177 BGP Bestpath: cost-community-ignore Paths: (1 available, best #1, table ISIS) Advertised to update-groups: 1 Refresh Epoch 1 Local 192.4.103.6 (via vrf ISIS) from 0.0.0.0 (214.0.0.4) Origin incomplete, metric 10, localpref 100, weight 32768, valid, sourced, best Extended Community: SoO:103:4 RT:214:103 mpls labels in/out 4056/nolabel rx pathid: 0, tx pathid: 0x0

This could be used to filter routes advertised to CSR1 on CSR8 using BGP SoO, which is covered in detail later. Identifying this peer as being in the same “site” as CSR6 means that routes can no longer be exchanged within the IPv4 AFI. CSR8 cannot advertise CSR6’s loopback to CSR1 now, but it can import them locally. ! CSR8

2264 © 2016 Nicholas J. Russo

router bgp 214 address-family ipv4 vrf BOTTOM neighbor 10.1.8.1 soo 103:4 R8#show bgp vpnv4 unicast vrf BOTTOM 192.103.6.6/32 [snip] Local, imported path from 214:103:192.103.6.6/32 (global) 214.0.0.4 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 10, localpref 100, valid, internal, best Extended Community: SoO:103:4 RT:214:103 Originator: 214.0.0.4, Cluster list: 0.0.3.12 mpls labels in/out nolabel/4056 rx pathid: 0, tx pathid: 0x0 R8#show bgp vpnv4 unicast vrf BOTTOM neighbors 10.1.8.1 advertised-routes | include 192.103 [no output] R1#show bgp ipv4 unicast 192.103.6.6/32 % Network not in table

Interestingly, because IS-IS does not understand SoO, CSR4 still freely redistributes CSR1’s routes from BGP into IS-IS despite the SoO being the same. Selecting any route from CSR1, we can see the SoO is set and IS-IS appears to ignore it. CSR6 installs this route in its RIB as expected. R4#show bgp vpnv4 unicast vrf ISIS 82.125.160.48/28 [snip] 82, imported path from 214:8:82.125.160.48/28 (global) 214.0.0.8 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: SoO:103:4 RT:214:8 Originator: 214.0.0.8, Cluster list: 0.0.3.12 mpls labels in/out nolabel/8028 rx pathid: 0, tx pathid: 0x0 R4#show isis 103 database level-2 detail R4.00-00 | include 82.125.160.48 Metric: 0 IP-External 82.125.160.48 255.255.255.240 R6#show ip route vrf ISIS 82.125.160.48 255.255.255.240 Routing Table: ISIS Routing entry for 82.125.160.48/28 Known via "isis", distance 115, metric 10, type level-2 Redistributing via isis 103 Last update from 192.4.103.4 on GigabitEthernet2.103, 00:19:25 ago Routing Descriptor Blocks: * 192.4.103.4, from 192.4.103.4, 00:19:25 ago, via GigabitEthernet2.103 Route metric is 10, traffic share count is 1

2265 © 2016 Nicholas J. Russo

In the interest of keeping the network interesting, we will leave the IS-IS to BGP AS 82 connectivity broken intentionally. Removing the SoO from either CSR4’s PE-CE link or CSR8’s neighbor statement to CSR1 will restore connectivity. IS-IS was never designed to use SoO in this context but this lab illustrates the protocol-agnostic characteristics of SoO. 38.1.5 BGP and Site-of-Origin (SoO) BGP is the simplest, most popular, most appropriate PE-CE routing protocol option. Traditionally, it is used between ASNs for inter-network routing, and MPLS L3VPN PE-CE is a perfect use-case. BGP’s flexible filtering support, as well as SoO, make it the best option. It is also widely supported by service providers in real life since it is supported on all vendor platforms and is straightforward to implement in many cases. The configuration on CSR5 (CE) and XRv2 (PE) are shown below as a summary reference. Note that the backdoor connections are configured but in a shutdown state initially. ! CSR5 router bgp 104 no bgp default ipv4-unicast address-family ipv4 vrf BGP redistribute connected neighbor 192.5.104.13 remote-as 104 neighbor 192.5.104.13 shutdown neighbor 192.5.104.13 activate neighbor 192.12.104.12 remote-as 214 neighbor 192.12.104.12 activate address-family ipv6 vrf BGP redistribute connected neighbor FD00:192:5:104::13 remote-as 104 neighbor FD00:192:5:104::13 shutdown neighbor FD00:192:5:104::13 activate neighbor FD00:192:12:104::12 remote-as 214 neighbor FD00:192:12:104::12 activate ! XRv12 router bgp 214 vrf BGP rd 214:104 address-family ipv4 unicast address-family ipv6 unicast neighbor 192.12.104.5 remote-as 104 address-family ipv4 unicast route-policy PASS in route-policy PASS out neighbor fd00:192:12:104::5

2266 © 2016 Nicholas J. Russo

remote-as 104 address-family ipv6 unicast route-policy PASS in route-policy PASS out

We quickly verify that the neighbors are up between PE-CE on these routers. The XR routers work well without needing to define multiple sessions for different AFs as the IPv6 next-hops are automatically adjusted. The verifications on the left-half of the network are not shown for brevity. The backdoor peer is marked as “Active” because XRv3 doesn’t have the peer shutdown, only CSR5 does, but the session is unable to form despite XRv3 attempting to establish it. RP/0/0/CPU0:XRv3#show bgp vrf BGP summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ 192.5.104.5 0 104 111 86 0 0 0 192.11.104.11 0 214 5873 5849 74 0 0

Up/Down St/PfxRcd 3d07h Active 4d01h 0

RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv6 unicast summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 192.11.104.11 0 214 5874 5850 67 0 0 4d01h 0 fd00:192:5:104::5 0 104 108 86 0 0 0 3d07h Active R5#show bgp vpnv4 unicast vrf BGP summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 192.5.104.13 4 104 0 0 1 0 0 3d07h 192.12.104.12 4 214 5858 6509 350 0 0 4d01h R5#show bgp vpnv6 unicast vrf BGP summary | begin ^Neighbor Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down FD00:192:5:104::13 4 104 0 0 1 0 0 3d07h FD00:192:12:104::12 4 214 5856 6493 1288 0 0 4d01h

State/PfxRcd Idle (Admin) 0

State/PfxRcd Idle (Admin) 0

Because we are using the same basic MPLS L3VPN construct as we did for the other tests, we should see all of the customer VPN routes in each of the CE sites. The output above shows 0, which means the peer is up but no prefixes were received. We quickly check XRv2 to see if it is advertising routes to CSR5; it isn’t. XR is trying to be “smart” and not advertise routes to CSR5 that it knows will be dropped on ingress by the eBGP loop prevention mechanism. If a peer has AS X and AS X appears in the AS_PATH of a BGP route, it is highly likely that the peer will reject the routes on ingress. RP/0/0/CPU0:XRv2#show bgp vrf BGP ipv4 unicast neighbors 192.12.104.5 advertised-routes [no output]

Looking across the network to CSR3 and CSR4 as next-hops, we see that XRv2 has learned VPNv4 routes from those peers and imported them into the BGP VPN. Because they have AS 104 in the path, XR knows that CSR5 cannot accept this, and doesn’t even attempt to advertise it, as seen above. 2267 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show *>i192.3.104.0/24 *>i192.4.104.0/24 *>i192.6.104.0/24 *>i192.104.6.6/32 *>i192.104.14.14/32

bgp vpnv4 unicast vrf BGP | include 214.0.0.[34] 214.0.0.3 0 100 0 104 ? 214.0.0.4 0 100 0 104 ? 214.0.0.3 0 100 0 104 ? 214.0.0.4 0 100 0 104 ? 214.0.0.3 0 100 0 104 ?

We can disable this intelligent behavior on XRv2 for IPv4/v6 as shown below. Now, XRv2 is advertising these routes to CSR5. ! XRv12 router bgp 214 vrf BGP address-family ipv4 unicast as-path-loopcheck out disable address-family ipv6 unicast as-path-loopcheck out disable RP/0/0/CPU0:XRv2#show bgp vrf BGP ipv4 unicast neighbors 192.12.104.5 advertised-routes Network Next Hop From AS Path Route Distinguisher: 214:104 (default for vrf BGP) 192.3.104.0/24 192.12.104.12 214.0.0.3 214 104? 192.4.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.6.104.0/24 192.12.104.12 214.0.0.3 214 104? 192.11.104.0/24 192.12.104.12 214.0.0.11 214 104? 192.104.6.6/32 192.12.104.12 214.0.0.4 214 104? 192.104.13.13/32 192.12.104.12 214.0.0.11 214 104? 192.104.14.14/32 192.12.104.12 214.0.0.3 214 104?

CSR5 still only has a set of local routes due to the loop prevention mechanism. Debugging BGP updates inbound on CSR5 reveals this. We expected this to happen once the PE started advertising the routes. R5#show bgp vpnv4 unicast vrf BGP | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> 192.5.104.0 0.0.0.0 0 32768 ? *> 192.12.104.0 0.0.0.0 0 32768 ? *> 192.104.5.5/32 0.0.0.0 0 32768 ? ! CSR5 BGP(0): 192.12.104.12 rcv UPDATE about 192.5.104.0/24 -- DENIED due to: ASPATH contains our own AS; BGP: 192.12.104.12 Modifying prefix 192.5.104.0/24 from 0 -> 4 address

2268 © 2016 Nicholas J. Russo

There are two options to fix it. The first solution is CE-based. This will relax the loop-prevention rule by telling CSR5 to allow BGP routes with a certain number of the local-AS in the AS path. This is a number between 1 and 10 on both XE and XR platforms; the local-AS path can occur anywhere in the AS path up to the number of times we specify. The larger the number, the larger the possibility of a loop. The configurations for CSR5 are shown below which allows exactly 1 instance of AS 104 in the AS path of any received route from XRv2. After configuring it, we will ensure the configuration is applied properly to the neighbor. Because we have fixed the problem for IPv4 and IPv6, we will verify both concurrently. ! CSR5 router bgp 104 address-family ipv4 vrf BGP neighbor 192.12.104.12 allowas-in 1 address-family ipv6 vrf BGP neighbor FD00:192:12:104::12 allowas-in 1 R5#show bgp vpnv4 unicast vrf BGP neighbors 192.12.104.12 | include For address|allow For address family: VPNv4 Unicast My AS number is allowed for 1 number of times R5#show bgp vpnv6 unicast vrf BGP neighbors FD00:192:12:104::12 | include For address|allow For address family: VPNv6 Unicast My AS number is allowed for 1 number of times

Now, we are able to learn all of the remote prefixes from the other BGP speakers. AS 104 is still in the path, but this is permitted one time. If a remote CE router applied AS path prepending to influence the SP’s routing decision, this solution would not work without increasing the value up from 1. R5#show bgp vpnv4 unicast vrf BGP | include 192.12.104.12 *> 192.3.104.0 192.12.104.12 *> 192.4.104.0 192.12.104.12 *> 192.6.104.0 192.12.104.12 *> 192.11.104.0 192.12.104.12 *> 192.104.6.6/32 192.12.104.12 *> 192.104.13.13/32 192.12.104.12 *> 192.104.14.14/32 192.12.104.12

0 0 0 0 0 0 0

214 214 214 214 214 214 214

104 104 104 104 104 104 104

? ? ? ? ? ? ?

R5#show bgp vpnv6 unicast vrf BGP | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> ::192:104:5:5/128 :: 0 32768 ? *> ::192:104:6:6/128 FD00:192:12:104::12 0 214 104 ?

2269 © 2016 Nicholas J. Russo

*>

::192:104:13:13/128 FD00:192:12:104::12 0 214 104 ?

[snip]

As a CE within the same VPN, XRv13 will have the same problem, but we can use an alternative solution. The second and most straightforward solution is to have the PE routers remark all instances of the customer AS number with the SP’s AS number. That is, every instance of 104 in the AS path will be rewritten to 214 when advertised to the eBGP peer inside the VPN. The AS path length remains the same, but the values change. XRv1 will iterate over the AS path and make this adjustment before advertising the routes, “overriding” the VPN AS with the SP AS, hence the feature’s name. We also don’t have to disable the outbound loop-check on XRv1 since the AS override operation happens before the outbound route loop-check/advertisements. After configuring the feature, we can verify the configuration worked by checking the neighbor details. ! XRv1 router bgp 214 vrf BGP neighbor 192.11.104.13 address-family ipv4 unicast as-override address-family ipv6 unicast as-override RP/0/0/CPU0:XRv1# show bgp vrf BGP ipv4 unicast neighbors 192.11.104.13 | utility egrep 'For Address|override' For Address Family: IPv4 Unicast AS override is set For Address Family: IPv6 Unicast AS override is set

Next, we can verify that XRv1 is advertising these BGP VPN routes to the CE, which it is. This output does not account for the AS override operation and shows the original VPN AS. RP/0/0/CPU0:XRv1#show bgp vrf BGP ipv4 unicast neighbors 192.11.104.13 advertised-routes Network Next Hop From AS Path Route Distinguisher: 214:104 (default for vrf BGP) 192.3.104.0/24 192.11.104.11 214.0.0.3 214 104? 192.4.104.0/24 192.11.104.11 214.0.0.3 214 104? 192.6.104.0/24 192.11.104.11 214.0.0.3 214 104? 192.12.104.0/24 192.11.104.11 214.0.0.12 214 104? 192.104.5.5/32 192.11.104.11 214.0.0.12 214 104? 192.104.6.6/32 192.11.104.11 214.0.0.3 214 104? 192.104.14.14/32 192.11.104.11 214.0.0.3 214 104?

2270 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show bgp vrf BGP ipv6 unicast neighbors 192.11.104.13 advertised-routes Network Next Hop From AS Path Route Distinguisher: 214:104 (default for vrf BGP) ::192:104:5:5/128 fd00:192:11:104::11 214.0.0.12 214 104? ::192:104:6:6/128 fd00:192:11:104::11 214.0.0.3 214 104? ::192:104:14:14/128 fd00:192:11:104::11 214.0.0.3 214 104? [snip]

Last, we verify that the CE receives the routes without seeing its own AS. The originating AS appears to be 214 according to the new AS path. This scales much better than “allowas-in” because remote CEs can use AS path prepending without necessitating any changes on the local CE. RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv4 unicast neighbors 192.11.104.11 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> 192.3.104.0/24 192.11.104.11 0 214 214 *> 192.4.104.0/24 192.11.104.11 0 214 214 *> 192.6.104.0/24 192.11.104.11 0 214 214 *> 192.12.104.0/24 192.11.104.11 0 214 214 *> 192.104.5.5/32 192.11.104.11 0 214 214 *> 192.104.6.6/32 192.11.104.11 0 214 214 *> 192.104.14.14/32 192.11.104.11 0 214 214

routes

RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv6 unicast neighbors 192.11.104.11 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> ::192:104:5:5/128 fd00:192:11:104::11 0 214 214 *> ::192:104:6:6/128 fd00:192:11:104::11 0 214 214 *> ::192:104:14:14/128 fd00:192:11:104::11 0 214 214 [snip]

routes

? ? ? ? ? ? ?

? ?

?

Now, CSR5 and XRv13 have reachability over the SP network to one another, but CSR6 and XRv14 will have the same problems. R5#ping vrf BGP 192.104.13.13 source 192.104.5.5 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.13.13, timeout is 2 seconds:

2271 © 2016 Nicholas J. Russo

Packet sent with a source address of 192.104.5.5 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 5/5/6 ms

For variety, we will troubleshoot XRv4 as if we don’t understand the problem. XRv4 only sees its local routes. Debugging BGP updates inbound, we see similar messages as CSR5 whereby the routes advertised by the PE are rejected due to the AS-path. It is interesting to note that XE makes no attempt to filter the routes outbound despite being able to see the loop; we had to disable this intelligent behavior on XRv2. RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv4 unicast | begin Network Network Next Hop Metric LocPrf Weight Route Distinguisher: 214:104 (default for vrf BGP) *> 192.3.104.0/24 0.0.0.0 0 32768 *> 192.6.104.0/24 0.0.0.0 0 32768 *> 192.104.14.14/32 0.0.0.0 0 32768

Path ? ? ?

! XRv4 bgp[1052]: [default-rtr] VRF BGP (ip4u): UPDATE from 192.3.104.3, prefix 192.11.104.0/24 (path ID: none) DENIED due to: bgp[1052]: [default-rtr] VRF BGP (ip4u): as-path contains our own AS, or 0;

We already know there are two solutions: allow AS 104 in on the CE (allowas-in), or change 104 to 214 outbound on the PE (as-override). We will use “allowas-in” to see how it works on XR. As always, after configuring the feature we will verify it was applied to the neighbor properly. ! XRv14 router bgp 104 vrf BGP neighbor 192.3.104.3 address-family ipv4 unicast allowas-in 1 neighbor fd00:192:3:104::3 address-family ipv6 unicast allowas-in 1 RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv4 unicast neighbors 192.3.104.3 | utility egrep 'For Address|allow' For Address Family: IPv4 Unicast Maximum prefixes allowed 1048576 My AS number is allowed 1 times in received updates RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv6 unicast neighbors fd00:192:3:104::3 | utility egrep 'For Address|allow' For Address Family: IPv6 Unicast Maximum prefixes allowed 524288 My AS number is allowed 1 times in received updates

2272 © 2016 Nicholas J. Russo

Next, we verify that XRv14 is actually learning routes with AS 104 in the path for both IPv4 and IPv6. This would imply that the configuration worked. AS 104 is still in the AS path which means the provider did not rewrite it. RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv4 unicast regexp 104$ | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> 192.4.104.0/24 192.3.104.3 0 214 104 ? *> 192.5.104.0/24 192.3.104.3 0 214 104 ? *> 192.11.104.0/24 192.3.104.3 0 214 104 ? *> 192.12.104.0/24 192.3.104.3 0 214 104 ? *> 192.104.5.5/32 192.3.104.3 0 214 104 ? *> 192.104.6.6/32 192.3.104.3 0 214 104 ? *> 192.104.13.13/32 192.3.104.3 0 214 104 ? RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv6 unicast regexp 104$ | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> ::192:104:5:5/128 fd00:192:3:104::3 0 214 104 ? *> ::192:104:6:6/128 fd00:192:3:104::3 0 214 104 ? *> ::192:104:13:13/128 fd00:192:3:104::3 0 214 104 ? [snip]

A quick ping check to CSR5 and XRv13, using a mix of IPv4 and IPv6, shows that XRv14 now has connectivity over the VPN. RP/0/0/CPU0:XRv4#ping vrf BGP 192.104.5.5 source 192.104.14.14 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.5.5, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/7/9 ms RP/0/0/CPU0:XRv4#ping vrf BGP ::192:104:13:13 source ::192:104:14:14 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::192:104:13:13, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/6/9 ms

Last, we need to fix CSR6. We have already seen the problem several times so we will fix it quickly by enabling AS override on CSR4. After configuring, we verify that it was applied to the VPN peer. ! CSR6

2273 © 2016 Nicholas J. Russo

router bgp 214 address-family ipv4 vrf BGP neighbor 192.4.104.6 as-override address-family ipv6 vrf BGP neighbor FD00:192:4:104::6 as-override R4#show bgp vpnv4 unicast vrf BGP neighbors 192.4.104.6 | include For address|Override For address family: VPNv4 Unicast Overrides the neighbor AS with my AS before sending updates R4#show bgp vpnv6 unicast vrf BGP neighbors FD00:192:4:104::6 | include For address|Override For address family: VPNv6 Unicast Overrides the neighbor AS with my AS before sending updates

CSR6 receives the IPv4 and IPv6 routes. For variety, we will check the specific IPv4 route to XRv4 and IPv6 route to CSR5. Notice that the AS path is “214 214” as expected per the AS override feature. R6#show bgp vpnv4 unicast vrf BGP 192.104.13.13/32 BGP routing table entry for 214:104:192.104.13.13/32, version 29 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 2 214 214 192.4.104.4 (via vrf BGP) from 192.4.104.4 (214.0.0.4) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv6 unicast vrf BGP ::192:104:5:5/128 BGP routing table entry for [214:104]::192:104:5:5/128, version 1324 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 1 214 214 FD00:192:4:104::4 (FE80::4) (via vrf BGP) from FD00:192:4:104::4 (214.0.0.4) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

A quick ping check verifies VPN connectivity to all three remote sites. We can reasonably assume that all of the BGP customers can communicate now. R6#ping vrf BGP 192.104.13.13 source 192.104.6.6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.13.13, timeout is 2 seconds: Packet sent with a source address of 192.104.6.6 !!!!!

2274 © 2016 Nicholas J. Russo

Success rate is 100 percent (5/5), round-trip min/avg/max = 6/7/9 ms R6#ping vrf BGP ::192:104:5:5 source ::192:104:6:6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::192:104:5:5, timeout is 2 seconds: Packet sent with a source address of ::192:104:6:6%BGP !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 6/7/12 ms R6#ping vrf BGP ::192:104:14:14 source ::192:104:6:6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::192:104:14:14, timeout is 2 seconds: Packet sent with a source address of ::192:104:6:6%BGP !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 5/8/11 ms

Next, we will bring up the backdoor links between CSR6/XRv4 and CSR5/XRv3. We quickly check XRv3 and XRv4 to ensure they see CSR6 and CSR5 as BGP peers for IPv4 and IPv6, respectively. These sessions were disabled in previous verifications. RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv4 unicast summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 192.5.104.5 0 104 120 91 104 0 0 00:01:03 8 192.11.104.11 0 214 7306 7275 104 0 0 23:15:42 5 RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv6 unicast summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 192.11.104.11 0 214 7306 7275 103 0 0 23:15:46 5 fd00:192:5:104::5 0 104 122 92 103 0 0 00:01:22 8 RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv4 unicast summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 192.3.104.3 0 214 7820 7091 112 0 0 20:37:18 5 192.6.104.6 0 104 118 90 112 0 0 00:00:21 8 RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv6 unicast summary | begin Neighbor Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd fd00:192:3:104::3 0 214 7759 7041 71 0 0 20:37:18 5 fd00:192:6:104::6 0 104 116 88 71 0 0 00:00:17 8

SoO works similarly in BGP as it does in EIGRP. Since we are bending the eBGP loop-prevention rules, there are cases where AS 104 routes don’t have AS 104 in the AS path, and may bounce around inside AS 104 freely. They may even become best-paths, which can cause suboptimal traffic patterns or loops. The simplest SoO deployment is having the PEs identify the PEs are being in the same site. For example, assuming CSR5 and XRv4 were located together, there may never be a case where traffic between these routers should transit the WAN. To implement this, we can configure XRv2 and XRv1 to identify a 2275 © 2016 Nicholas J. Russo

common SoO for this site and apply it to the CE routers. We can apply it directly to a neighbor under a specific AFI or use an RPL for additional granularity (per-prefix matches, etc). I demonstrate both methods below. ! XRv1 router bgp 214 vrf BGP neighbor 192.11.104.13 remote-as 104 address-family ipv4 unicast site-of-origin 214:513 ! XRv2 route-policy RPL_FROM_CSR5 set extcommunity soo (214:513) end-policy router bgp 214 vrf BGP neighbor 192.12.104.5 remote-as 104 address-family ipv4 unicast route-policy RPL_FROM_CSR5 in

After applying these configurations and refreshing the BGP routes, we notice that the MPLS network cannot be used for transit between CSR5 and XRv3 anymore. This would be ideal for multi-homed scenarios where east-west traffic should be LAN-based (backdoor link only), even in failure cases. A quick check shows that the two routers are exchanging loopbacks via iBGP, which are both best-paths. R5#show bgp vpnv4 unicast vrf BGP 192.104.13.13/32 BGP routing table entry for 214:104:192.104.13.13/32, version 10 Paths: (1 available, best #1, table BGP) Advertised to update-groups: 18 Refresh Epoch 1 Local 192.5.104.13 (via vrf BGP) from 192.5.104.13 (13.13.13.13) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv3#show bgp vpnv4 unicast vrf BGP 192.104.5.5/32 | begin Local$ Local 192.5.104.5 from 192.5.104.5 (5.5.5.5) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 118

2276 © 2016 Nicholas J. Russo

As a PE, XRv2 sees both paths, each of which originates from AS 104. Looking at XRv3’s loopback, it is received from XRv1 (iBGP over MPLS) and from CSR5 (eBGP from customer). Hot-potato routing shows that the eBGP path is preferred as best, so XRv2 will not advertise the route from XRv3 any further. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf BGP 192.104.13.13/32 | begin Paths Paths: (2 available, best #1) Advertised to update-groups (with more than one peer): 0.2 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.2 104 192.12.104.5 from 192.12.104.5 (5.5.5.5) Origin incomplete, localpref 100, valid, external, best, group-best, import-candidate, import suspect Received Path ID 0, Local Path ID 1, version 1143 Extended community: SoO:214:513 RT:214:104 Path #2: Received by speaker 0 Not advertised to any peer 104, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91013 Origin incomplete, metric 0, localpref 100, valid, internal, importcandidate, imported, import suspect Received Path ID 0, Local Path ID 0, version 0 Extended community: SoO:214:513 RT:214:104 Source VRF: BGP, Source Route Distinguisher: 214:104

As a side note, it isn’t a good idea to make a PE the RR. Because XRv2 has selected the path through CSR5 as best, other nodes will route through CSR5 to reach XRv3 when transiting XRv1 would have been better. This may cause suboptimal routing depending on the bandwidth of the links, administrative policies, etc. SoO does not aim to fix this specifically. R6#traceroute vrf BGP 192.104.13.13 source 192.104.6.6 Type escape sequence to abort. Tracing the route to 192.104.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 192.4.104.4 14 msec 4 msec 2 msec 2 214.4.8.8 [MPLS: Labels 8003/92031 Exp 0] 6 msec 7 msec 9 msec 3 214.8.12.12 [MPLS: Label 92031 Exp 0] 20 msec 20 msec 20 msec 4 192.12.104.5 20 msec 12 msec 11 msec 5 192.5.104.13 91 msec 6 msec 6 msec

We will continue testing BGP with SoO. If we break the BGP connection between CSR5 and XRv3, the two routers will no longer be able to communicate. Now, XRv2 selects XRv1 as the best-path to XRv3’s

2277 © 2016 Nicholas J. Russo

loopback, but should not advertise it to CSR5 due to SoO restrictions. However, XRv2 still is advertising it to CSR5 despite SoO. This appears incorrect. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf BGP 192.104.13.13/32 | begin Paths Paths: (1 available, best #1) Advertised to update-groups (with more than one peer): 0.2 Path #1: Received by speaker 0 Advertised to update-groups (with more than one peer): 0.2 104, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91013 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 1177 Extended community: SoO:214:513 RT:214:104 Source VRF: BGP, Source Route Distinguisher: 214:104 RP/0/0/CPU0:XRv2#show bgp vrf BGP ipv4 unicast neighbors 192.12.104.5 advertised-routes Network Next Hop From AS Path Route Distinguisher: 214:104 (default for vrf BGP) 192.3.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.4.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.6.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.11.104.0/24 192.12.104.12 214.0.0.11 214 104? 192.104.6.6/32 192.12.104.12 214.0.0.4 214 104? 192.104.13.13/32 192.12.104.12 214.0.0.11 214 104? 192.104.14.14/32 192.12.104.12 214.0.0.4 214 104?

A similar check on XRv1 yields better results. XRv1 does not advertise CSR5’s loopback to XRv3, as expected. Two-way connectivity is therefore broken, but CSR5 should not have received the routes from XRv2 in the first place. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast vrf BGP 192.104.5.5/32 [snip] 104 214.0.0.12 (metric 10) from 214.0.0.12 (214.0.0.12) Received Label 92014 Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 1248 Extended community: SoO:214:513 RT:214:104 Source VRF: BGP, Source Route Distinguisher: 214:104

2278 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show bgp vrf BGP ipv4 unicast neighbors 192.11.104.13 advertised-routes Network Next Hop From AS Path Route Distinguisher: 214:104 (default for vrf BGP) 192.3.104.0/24 192.11.104.11 214.0.0.3 214 104? 192.4.104.0/24 192.11.104.11 214.0.0.3 214 104? 192.6.104.0/24 192.11.104.11 214.0.0.3 214 104? 192.104.6.6/32 192.11.104.11 214.0.0.3 214 104? 192.104.14.14/32 192.11.104.11 214.0.0.3 214 104?

The issue with using an RPL to set SoO on ingress is that the automatic filtering that SoO provides is then lost. We would need to manually match the SoO in an outbound RPL for that peer to drop the routes. This is a design consideration of SoO by making it more configuration intensive at the cost of having additional granularity. On XRv1, specifying a neighbor-level SoO automatically works bidirectionally, so XRv1’s configuration is complete. ! XRv2 route-policy RPL_TO_CSR5 if extcommunity soo matches-every (214:513) then drop else pass endif end-policy router bgp 214 vrf BGP address-family ipv4 unicast neighbor 192.12.104.5 address-family ipv4 unicast route-policy RPL_TO_CSR5 out

After applying this new policy, XRv2 cannot advertise XRv3’s loopback to CSR5. Neither CSR5 nor XRv3 have routes to one another’s loopback, which is expected when the backdoor is down. RP/0/0/CPU0:XRv2#show bgp vrf BGP ipv4 unicast neighbors 192.12.104.5 advertised-routes Network Next Hop From AS Path Route Distinguisher: 214:104 (default for vrf BGP) 192.3.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.4.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.6.104.0/24 192.12.104.12 214.0.0.4 214 104? 192.104.6.6/32 192.12.104.12 214.0.0.4 214 104? 192.104.14.14/32 192.12.104.12 214.0.0.4 214 104? R5#show bgp vpnv4 unicast vrf BGP 192.104.13.13/32 % Network not in table

2279 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv4 unicast 192.104.5.5/32 % Network not in table

Notice that IPv6 connectivity is totally unaffected when the backdoor is down. SoO is enabled per AFI, unlike EIGRP, so we can have different “site-based” topologies for IPv4 and IPv6. Both CSR5 and XRv3 have IPv6 routes towards one another’s loopbacks. R5#show bgp vpnv6 unicast vrf BGP ::192:104:13:13/128 BGP routing table entry for [214:104]::192:104:13:13/128, version 1737 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 1 214 104 FD00:192:12:104::12 (FE80::12) (via vrf BGP) from FD00:192:12:104::12 (214.0.0.12) Origin incomplete, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv6 unicast ::192:104:5:5/128 | begin Paths Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 214 214 fd00:192:11:104::11 from 192.11.104.11 (214.0.0.11) Origin incomplete, localpref 100, valid, external, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 126

We can also use an approach similar to that tested in the EIGRP section, whereby a site “guards” itself with the same SoO around the perimeter routers to prevent routes from that site entering again, be it via the SP network or backdoor links. On the left side of the network, we will first begin by enabling SoO on the PE devices. CSR6 and XRv4 will be considered in different “sites” so that they can talk over MPLS or the backdoor. We will use the per-neighbor method on CSR3 and the route-map method on CSR4. We also advertise the PE-CE link into BGP so we can test the feature as we did for EIGRP. ! CSR3 router bgp 214 address-family ipv4 vrf BGP network 192.3.104.0 neighbor 192.3.104.14 soo 214:3 address-family ipv6 vrf BGP network FD00:192:3:104::/64 neighbor FD00:192:3:104::14 soo 214:3

2280 © 2016 Nicholas J. Russo

! CSR4 route-map RM_SOO_IN permit 10 set extcommunity soo 214:4 address-family ipv4 vrf BGP network 192.4.104.0 neighbor 192.4.104.6 route-map RM_SOO_IN in address-family ipv6 vrf BGP network FD00:192:4:104::/64 neighbor FD00:192:4:104::6 route-map RM_SOO_IN in

Unlike XR, which has no command to verify per-neighbor SoO, XE will show this under the neighbor details. As expected, when using the route-map, it will not. R3#show bgp vpnv4 unicast vrf BGP neighbors 192.3.104.14 | include v4|Origin Address family IPv4 Unicast: advertised and received For address family: VPNv4 Unicast Translates address family IPv4 Unicast for VRF BGP neighbor Site-of-Origin is SoO:214:3 R3#show bgp vpnv6 unicast vrf BGP neighbors fd00:192:3:104::14 | include v6|Origin Address family IPv6 Unicast: advertised and received For address family: VPNv6 Unicast Translates address family IPv6 Unicast for VRF BGP neighbor Site-of-Origin is SoO:214:3

Looking at CSR3’s view of the connected PE-CE link between itself and XRv4, we can see both SoO impositions are not quite working. The remotely learned route from the eBGP peer (XRv4) has the proper SoO applied, but the connected route via the network statement does not. Unlike EIGRP, SoO is being applied to routes learned from a peer, not routes recursing out of a given interface. BGP has no concept of interfaces. R3#show bgp vpnv6 unicast vrf BGP FD00:192:3:104::/64 BGP routing table entry for [214:104]FD00:192:3:104::/64, version 84 Paths: (2 available, best #2, table BGP) Advertised to update-groups: 12 11 Refresh Epoch 1 104 FD00:192:3:104::14 (FE80::14) (via vrf BGP) from FD00:192:3:104::14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, external Extended Community: SoO:214:3 RT:214:104 mpls labels in/out 3045/nolabel rx pathid: 0, tx pathid: 0 Refresh Epoch 1

2281 © 2016 Nicholas J. Russo

Local :: (via vrf BGP) from 0.0.0.0 (214.0.0.3) Origin IGP, metric 0, localpref 100, weight 32768, valid, sourced, local, best Extended Community: RT:214:104 mpls labels in/out 3045/nolabel(BGP) rx pathid: 0, tx pathid: 0x0

Applying a route-map to the BGP network statement does not work, although one would think that it would manually adjust the attributes of the prefix. Instead, we can use the interface-level command that is protocol-agnostic to solve this problem. For a quick side-demo, we will set a bogus SoO just to see if it accidentally overrides the BGP-level neighbor one for routes learned via that interface. Fortunately, it does not. We know we will have the exact same problem on CSR4, so we will correct the problem there also. ! CSR3 and CSR4 route-map RM_SOO_IN permit 10 set extcommunity soo 214:999 interface GigabitEthernet2.104 ip vrf sitemap RM_SOO_IN R3#show bgp vpnv6 unicast vrf BGP FD00:192:3:104::/64 [snip] Local :: (via vrf BGP) from 0.0.0.0 (214.0.0.3) Origin IGP, metric 0, localpref 100, weight 32768, valid, sourced, local, best Extended Community: SoO:214:999 RT:214:104 mpls labels in/out 3011/nolabel(BGP) rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 104 FD00:192:3:104::14 (FE80::14) (via vrf BGP) from FD00:192:3:104::14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, external Extended Community: SoO:214:3 RT:214:104 mpls labels in/out 3011/nolabel rx pathid: 0, tx pathid: 0

We will change the SoO back to 214:3 as expected (not shown) and verify it quickly. Now, the PE-CE link and the eBGP-learned route have the proper SoO. R3#show bgp vpnv6 unicast vrf BGP FD00:192:3:104::/64 | include from|SoO :: (via vrf BGP) from 0.0.0.0 (214.0.0.3) Extended Community: SoO:214:3 RT:214:104

2282 © 2016 Nicholas J. Russo

FD00:192:3:104::14 (FE80::14) (via vrf BGP) from FD00:192:3:104::14 (14.14.14.14) Extended Community: SoO:214:3 RT:214:104

Since we know the locally-originated route is the best-path, CSR3 can advertise this to CSR4. It carries the SoO attribute. CSR4 learns 3 routes; one directly from CSR3 (RR), another reflected from XRv2 (RR), and one from the eBGP peer CSR6. The best-path is the one from CSR3 since the peer ID 214.0.0.3 is less than 214.0.0.12 (worst tiebreaker). This is because the BGP RID of 214.0.0.3 is compared against the originator ID of the same value from XRv2, which results in a tie. The AS path length is 0 which immediately disqualifies the eBGP path. R4#show bgp vpnv6 unicast vrf BGP fd00:192:3:104::/64 BGP routing table entry for [214:104]FD00:192:3:104::/64, version 2317 BGP Bestpath: cost-community-ignore Paths: (3 available, best #2, table BGP) Advertised to update-groups: 5 Refresh Epoch 1 Local ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.12 (214.0.0.12) Origin IGP, metric 0, localpref 100, valid, internal Extended Community: SoO:214:3 RT:214:104 Originator: 214.0.0.3, Cluster list: 0.0.3.12 mpls labels in/out nolabel/3011 rx pathid: 0, tx pathid: 0 Refresh Epoch 3 Local ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: SoO:214:3 RT:214:104 mpls labels in/out nolabel/3011 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 104 FD00:192:4:104::6 (FE80::6) (via vrf BGP) from FD00:192:4:104::6 (6.6.6.6) Origin incomplete, localpref 100, valid, external Extended Community: SoO:214:4 RT:214:104 rx pathid: 0, tx pathid: 0

CSR4 then advertises this to CSR6, who should ideally see SoO:214:3. CSR6 doesn’t see any SoO information on the route received from CSR4. Recall that XE does not send communities by default, and XR does not send communities to eBGP neighbors by default. R6#show bgp vpnv6 unicast vrf BGP fd00:192:3:104::/64 BGP routing table entry for [214:104]FD00:192:3:104::/64, version 1587 Paths: (2 available, best #2, table BGP)

2283 © 2016 Nicholas J. Russo

Advertised to update-groups: 1 Refresh Epoch 1 214 FD00:192:4:104::4 (FE80::4) (via vrf BGP) from FD00:192:4:104::4 (214.0.0.4) Origin IGP, localpref 100, valid, external rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local FD00:192:6:104::14 (via vrf BGP) from FD00:192:6:104::14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0

We will quickly enable the sending of extended communities to eBGP peers on CSR3 and CSR4. ! CSR3 router bgp 214 address-family ipv6 vrf BGP neighbor FD00:192:3:104::14 send-community extended ! CSR4 router bgp 214 address-family ipv6 vrf BGP neighbor FD00:192:4:104::6 send-community extended

Now, we can see the SoO on CSR6 as expected. We also check the PE-CE link between CSR4/CSR6 on XRv4; it can see SoO:214:4 as imposed by CSR4. These connected-route impositions required the interface-level site-map command since the BGP-level SoO configuration does not affect locallyoriginated routes. R6#show bgp vpnv6 unicast vrf BGP fd00:192:3:104::/64 BGP routing table entry for [214:104]FD00:192:3:104::/64, version 1587 Paths: (2 available, best #2, table BGP) Advertised to update-groups: 1 Refresh Epoch 1 214 FD00:192:4:104::4 (FE80::4) (via vrf BGP) from FD00:192:4:104::4 (214.0.0.4) Origin IGP, localpref 100, valid, external Extended Community: SoO:214:3 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local FD00:192:6:104::14 (via vrf BGP) from FD00:192:6:104::14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, internal, best rx pathid: 0, tx pathid: 0x0

2284 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show bgp vrf BGP ipv6 unicast fd00:192:4:104::/64 | begin Paths Paths: (2 available, best #2) Advertised to CE peers (in unique update groups): fd00:192:3:104::3 Path #1: Received by speaker 0 Not advertised to any peer 214 fd00:192:3:104::3 from fd00:192:3:104::3 (214.0.0.3) Origin IGP, localpref 100, valid, external, group-best, import suspect Received Path ID 0, Local Path ID 0, version 0 Extended community: SoO:214:4 Path #2: Received by speaker 0 Advertised to CE peers (in unique update groups): fd00:192:3:104::3 Local fd00:192:6:104::6 from fd00:192:6:104::6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, import suspect Received Path ID 0, Local Path ID 1, version 71

Because those connected routes were technically originated in AS 214, the AS path check will pass without needing “allowas-in” or “as-override”. Shutting down CSR6’s PE-CE link proves that CSR6 can still learn that connected route with SoO:214:4 from XRv4. Normally, this is the situation we are trying to prevent, since CSR4 is effectively learning a route from its local “site”. R6#show bgp vpnv6 unicast vrf BGP fd00:192:4:104::/64 BGP routing table entry for [214:104]FD00:192:4:104::/64, version 1656 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 1 214 FD00:192:6:104::14 (via vrf BGP) from FD00:192:6:104::14 (14.14.14.14) Origin IGP, localpref 100, valid, internal, best Extended Community: SoO:214:4 rx pathid: 0, tx pathid: 0x0

Next, we need to configure the SoO on the backdoor peerings for IPv6 at a minimum, as well as ensure extended-communities are being exchanged more liberally since SoO is being actively used by PE and CE routers. CSR6 will identify as being in site 4 while XRv4 will be in site 3. XR automatically sends communities inside of iBGP, so it only needs to be explicitly enabled with the eBGP PE. ! XRv4 router bgp 214 vrf BGP neighbor fd00:192:3:104::3

2285 © 2016 Nicholas J. Russo

remote-as 214 address-family ipv6 unicast send-extended-community-ebgp neighbor fd00:192:6:104::6 remote-as 104 address-family ipv6 unicast site-of-origin 214:3 ! CSR6 router bgp 104 address-family ipv6 vrf BGP neighbor FD00:192:4:104::4 send-community extended neighbor FD00:192:6:104::14 send-community extended neighbor FD00:192:6:104::14 soo 214:4

XR doesn’t allow this to be committed and claims the SoO is an eBGP-only configuration. However, XE takes the code, so we will attempt it. According to this, the third SoO technique is not supported on XR. ! XRv4 neighbor fd00:192:6:104::6 address-family ipv6 unicast site-of-origin 214:3 !!% Change would result in internal neighbor (fd00:192:6:104::6 VRF BGP) with external-only config

We verify that the configuration was successful on CSR6. Despite multiple BGP hard resets, CSR6 still learns the route regardless of whether the BGP SoO is set or not. XR warned us that this wasn’t supported, but XE seems to accept the configuration and gives no indication of a problem, either in show or debug commands. This suggests that the “perimeter” technique is not supported on either platform and is likely specific to EIGRP deployments. R6#show bgp vpnv6 unicast vrf BGP neighbors FD00:192:6:104::14 | include v6|Origin Address family IPv6 Unicast: advertised and received For address family: VPNv6 Unicast Translates address family IPv6 Unicast for VRF BGP neighbor Site-of-Origin is SoO:214:4 R6#show bgp vpnv6 unicast vrf BGP fd00:192:4:104::/64 BGP routing table entry for [214:104]FD00:192:4:104::/64, version 7 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 1 214 FD00:192:6:104::14 (via vrf BGP) from FD00:192:6:104::14 (14.14.14.14) Origin IGP, localpref 100, valid, internal, best Extended Community: SoO:214:4

2286 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0x0

In summary, it appears that BGP SoO is best used in the strict sense of identifying sites so that intra-site traffic does not transit the SP network. Usage of the site-map command is very rare with BGP and was used for demonstration purposes only, however using it in this way does speak to its flexibility. There is another handy use for SoO which can be used to reduce suboptimal routing. Considering, at present, routes learned via CSR3 have SoO:214:3 and routes via CSR4 have SoO:214:4, we have a “tag” on routes based on which PE they were learned from. When all links in the network are up, CSR4 learns routes behind the remote CE (the one to which it is not connected) via eBGP. This means that traffic routed to CSR4 destined for networks behind XRv4 will transit the (possibly) slow backdoor link. It may be desirable for CSR4 to prefer the iBGP route via CSR3 because that PE-CE link (CSR3/XRv4) is much faster, where CSR4 is just a backup. Following this scenario, we can adjust the best-path selection algorithm based on prefixes to achieve this. Matching on SoO is a simple and dynamic way, especially if it is already being used for loop control in other parts of the network. R4#show bgp vpnv6 unicast vrf BGP ::192:104:14:14/128 BGP routing table entry for [214:104]::192:104:14:14/128, version 2350 BGP Bestpath: cost-community-ignore Paths: (2 available, best #1, table BGP) Advertised to update-groups: 1 Refresh Epoch 1 104 FD00:192:4:104::6 (FE80::6) (via vrf BGP) from FD00:192:4:104::6 (6.6.6.6) Origin incomplete, localpref 100, valid, external, best Extended Community: SoO:214:4 RT:214:104 mpls labels in/out 4055/nolabel rx pathid: 0, tx pathid: 0x0 Refresh Epoch 5 104 ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: SoO:214:3 RT:214:104 mpls labels in/out 4055/3009 rx pathid: 0, tx pathid: 0

Since the BGP peer over iBGP is VPNv6, and the eBGP peer is IPv6 unicast inside of a VRF, we cannot just apply a route-map to the VPNv6 peer if other VRFs are using the same SoO. For example, if SoO:214:3 is being used by VRF EIGRP, for example, the route-map could have unintended effects. In this case, we are safe, but it is a future consideration. The route-map effectively says “routes from site 3 should be preferred via CSR3”, which is a very logical decision. ! CSR4 ip extcommunity-list standard EXTCOML_SOO_3 permit soo 214:3

2287 © 2016 Nicholas J. Russo

route-map RM_BGP_CSR3_IN permit 10 match extcommunity EXTCOML_SOO_3 set local-preference 200 route-map RM_BGP_CSR3_IN permit 1000 router bgp 214 address-family vpnv6 neighbor 214.0.0.3 route-map RM_BGP_CSR3_IN in

After a soft reset, we verify the local preference update was successful. Now CSR4 prefers the path over MPLS; we verify with traceroute. R4#show bgp vpnv6 unicast vrf BGP ::192:104:14:14/128 [snip] 104 ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: SoO:214:3 RT:214:104 Originator: 214.0.0.3, Cluster list: 0.0.3.12 mpls labels in/out nolabel/3009 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 104 FD00:192:4:104::6 (FE80::6) (via vrf BGP) from FD00:192:4:104::6 (6.6.6.6) Origin incomplete, localpref 100, valid, external Extended Community: SoO:214:4 RT:214:104 rx pathid: 0, tx pathid: 0 Refresh Epoch 6 104 ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 0, localpref 200, valid, internal, best Extended Community: SoO:214:3 RT:214:104 mpls labels in/out nolabel/3009 rx pathid: 0, tx pathid: 0x0 R4#traceroute vrf BGP ipv6 ::192:104:14:14 Type escape sequence to abort. Tracing the route to ::192:104:14:14 1 FD00:192:3:104::3 [MPLS: Label 3009 Exp 0] 4 msec 2 msec 3 msec 2 ::192:104:14:14 [AS 104] 4 msec 4 msec 3 msec

The downside of this approach is that now CSR4 will prefer CSR3 for all traffic going into AS 104. Adding some kind of non-overlapping attributes (possibly prefixes, RTs, other custom communities, etc) to the route-map match clauses can add granularity to this routing enhancement. Applying this policy everywhere would make sense, but I demonstrate it only on CSR4 for brevity. 2288 © 2016 Nicholas J. Russo

R4#show bgp vpnv6 unicast vrf BGP ::192:104:6:6/128 [snip] 104 ::FFFF:214.0.0.3 (metric 10) (via default) from 214.0.0.3 (214.0.0.3) Origin incomplete, metric 0, localpref 200, valid, internal, best Extended Community: SoO:214:3 RT:214:104 mpls labels in/out nolabel/3016 rx pathid: 0, tx pathid: 0x0 R4#traceroute vrf BGP ipv6 ::192:104:6:6 Type escape sequence to abort. Tracing the route to ::192:104:6:6 1 FD00:192:3:104::3 [MPLS: Label 3016 Exp 0] 5 msec 4 msec 2 msec 2 FD00:192:3:104::14 4 msec 4 msec 3 msec 3 FD00:192:6:104::6 [AS 104] 5 msec 5 msec 5 msec

The reachability script is below. The only failure will be between CSR5 and XRv3 since the backdoor is down and their PEs assert that those routers belong to the same “site”. This was to demonstrate SoO specifically. Bringing it up yields full reachability. ! CSR5 and CSR6 tclsh foreach x { 192.104.5.5 192.104.6.6 192.104.13.13 192.104.14.14 ::192:104:5:5 ::192:104:6:6 ::192:104:13:13 ::192:104:14:14 } { ping vrf BGP $x source loopback104 repeat 3 timeout 1 }

38.1.6 Static routing Static routing is another possible way to achieve PE-CE routing. Most service providers will offer either BGP or static routing as options due to their widespread support and simplicity. Looking at CSR4 and XRv2, all we have to do is add a VRF-aware static route and redistribute it into BGP per AFI. ! CSR4 ip route vrf STATIC 192.105.6.6 255.255.255.255 Gig2.105 192.4.105.6 ipv6 route vrf STATIC ::192:105:6:6/128 Gig2.105 FD00:192:4:105::6 router bgp 214 address-family ipv4 vrf STATIC redistribute static address-family ipv6 vrf STATIC redistribute static

2289 © 2016 Nicholas J. Russo

! XRv2 router static vrf STATIC address-family ipv4 unicast 192.105.5.5/32 GigabitEthernet0/0/0/0.105 192.12.105.5 address-family ipv6 unicast ::192:105:5:5/128 GigabitEthernet0/0/0/0.105 fd00:192:12:105::5 router bgp 214 vrf STATIC rd 214:105 address-family ipv4 unicast redistribute static address-family ipv6 unicast redistribute static

Using some lesser-known static routing show commands, we can verify the static routes inside of the VRF, per AFI. RP/0/0/CPU0:XRv2#show static vrf STATIC ipv4 unicast topology VRF: STATIC Table Id: 0xe0000018 AFI: IPv4 SAFI: Unicast Prefix/Len Interface Nexthop 192.105.5.5/32 GigabitEthernet0_0_0_0.105192.12.105.5

Object None

RP/0/0/CPU0:XRv2#show static vrf STATIC ipv6 unicast topology VRF: STATIC Table Id: 0xe0800018 AFI: IPv6 SAFI: Unicast Prefix/Len Interface Nexthop ::192:105:5:5/128 GigabitEthernet0_0_0_0.105fd00:192:12:105::5

Object None

Metrics [0/0/1/0]

Metrics [0/0/1/0]

R4#show ip static route vrf STATIC Codes: M - Manual static, A - AAA download, N - IP NAT, D - DHCP, [snip] Codes in []: A - active, N - non-active, B - BFD-tracked, D - Not Tracked, P - permanent Static local RIB for STATIC M 192.105.6.6/32 [1/0] via GigabitEthernet2.105 192.4.105.6 [A] R4#show ipv6 static vrf STATIC IPv6 Static routes Table - STATIC Codes: * - installed in RIB, u/m - Unicast/Multicast only [snip] * ::192:105:6:6/128 via FD00:192:4:105::6, GigabitEthernet2.105, distance 1

A quick check on XRv2 shows all of the remote static routes inside of BGP with the proper VPNv4 next hops for both AFIs. Based on this, we can assume all of the static-to-BGP redistribution was successful on all 4 PEs within VRF STATIC. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf STATIC | begin Network Network Next Hop Metric LocPrf Weight Path

2290 © 2016 Nicholas J. Russo

Route Distinguisher: 214:105 (default for vrf STATIC) *> 192.105.5.5/32 192.12.105.5 0 *>i192.105.6.6/32 214.0.0.4 0 100 *>i192.105.13.13/32 214.0.0.11 0 100 *>i192.105.14.14/32 214.0.0.3 0 100

32768 0 0 0

? ? ? ?

RP/0/0/CPU0:XRv2#show bgp vpnv6 unicast vrf STATIC | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:105 (default for vrf STATIC) *> ::192:105:5:5/128 fd00:192:12:105::5 0 32768 ? *>i::192:105:6:6/128 214.0.0.4 0 100 0 ? *>i::192:105:13:13/128 214.0.0.11 0 100 0 ? *>i::192:105:14:14/128 214.0.0.3 0 100 0 ?

We also need to ensure the CE routers have static routes pointing towards the PE. A simple default for IPv4 and IPv6 would work, but we will use a course aggregate for variety. Only configurations on XRv3 and CSR5 are shown for brevity. Assuming this MPLS network was also the mechanism by which CEs reached the Internet, a default route is most appropriate. If not, a coarse aggregate encompassing all of the WAN subnets would work well, with a default pointing elsewhere towards the Internet. ! CSR6 ip route vrf STATIC 192.0.0.0 255.0.0.0 GigabitEthernet2.105 192.12.105.12 ipv6 route vrf STATIC ::192:0:0:0/80 GigabitEthernet2.105 FD00:192:12:105::12 ! XRv3 router static vrf STATIC address-family ipv4 unicast 192.0.0.0/8 GigabitEthernet0/0/0/0.105 192.11.105.11 address-family ipv6 unicast ::192:0:0:0/80 GigabitEthernet0/0/0/0.105 fd00:192:11:105::11

There isn’t much more to test at this point, so we will verify reachability before continuing. Running this script from the XE CEs is a quick way to ensure all 4 sites in the STATIC VPN can communicate. ! CSR5 and CSR6 tclsh foreach x { 192.105.5.5 192.105.6.6 192.105.13.13 192.105.14.14 ::192:105:5:5 ::192:105:6:6

2291 © 2016 Nicholas J. Russo

::192:105:13:13 ::192:105:14:14 } { ping vrf STATIC $x source loopback105 repeat 3 timeout 1 }

As a quick note, SoO can also apply to static routes provided they point traffic out of the interface to which the SoO is applied. XR does not support this, but XE does. This would only be useful if BGP was used somewhere in the VPN, as it was for IS-IS, which is where the filtering would happen. Even if we use the same SoO on CSR3 and CSR4, CSR6 and XRv4 will still be able to talk to one another. BGP will import the routes into the proper table, but since static routing is used, there is no possible way for the SoO to be “advertised” into the customer domain. ! CSR3 and CSR4 route-map RM_STATIC_SOO permit 10 set extcommunity soo 214:105 interface GigabitEthernet2.105 ip vrf sitemap RM_STATIC_SOO

We verify that SoO:214:105 has been applied by both CSR3 and CSR4 on their locally redistributed static routes. The information is exchanged, but no filtering occurs since there is nothing being advertised to the CE routers, as described above. Without a PE-CE protocol, the SoO could be used as a route-tag as seen in the BGP section for best-path adjustment, but in that case, using ordinary communities would be more appropriate. R3#show bgp vpnv6 unicast vrf STATIC ::192:105:6:6/128 BGP routing table entry for [214:105]::192:105:6:6/128, version 411 Paths: (1 available, best #1, table STATIC) Advertised to update-groups: 13 Refresh Epoch 7 Local, (Received from a RR-client) ::FFFF:214.0.0.4 (metric 10) (via default) from 214.0.0.4 (214.0.0.4) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: SoO:214:105 RT:214:105 mpls labels in/out nolabel/4027 rx pathid: 0, tx pathid: 0x0 R3#show bgp vpnv6 unicast vrf STATIC ::192:105:14:14/128 BGP routing table entry for [214:105]::192:105:14:14/128, version 410 Paths: (1 available, best #1, table STATIC) Advertised to update-groups: 13 Refresh Epoch 1 Local FD00:192:3:105::14 (via vrf STATIC) from 0.0.0.0 (214.0.0.3) Origin incomplete, metric 0, localpref 100, weight 32768, valid, sourced, best

2292 © 2016 Nicholas J. Russo

Extended Community: SoO:214:105 RT:214:105 mpls labels in/out 3005/nolabel rx pathid: 0, tx pathid: 0x0 R6#ping vrf STATIC ::192:105:14:14 source ::192:105:6:6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to ::192:105:14:14, timeout is 2 seconds: Packet sent with a source address of ::192:105:6:6%STATIC !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 3/6/16 ms

38.1.7 RIP RIP is a very uncommon PE-CE option but it is supported in XE and XR. XR does not support RIPng (IPv6) at all, and because of the high density of XR in the network, RIPng is not tested. Below are the PE configurations for RIPv2 on an XE and XR PE. When redistributing from BGP, we can set the metric to transparent which allows RIP to copy the BGP MED value when importing routes from BGP. We won’t need this command in XR as it is automatic so long as we don’t specify a hard metric. ! CSR4 router rip address-family ipv4 vrf RIP redistribute bgp 214 metric transparent network 192.4.106.0 no auto-summary version 2 ! XRv2 router rip vrf RIP interface GigabitEthernet0/0/0/0.106 redistribute bgp 214

The CE configurations are nearly identical, except there is no BGP redistribution and RIP is enabled on the loopbacks/backdoors. The backdoors are currently shutdown. ! CSR6 router rip address-family ipv4 vrf RIP network 192.4.106.0 network 192.6.106.0 network 192.106.6.0 no auto-summary version 2 ! XRv3 router rip vrf RIP

2293 © 2016 Nicholas J. Russo

interface Loopback106 passive-interface interface GigabitEthernet0/0/0/0.106 interface GigabitEthernet0/0/0/0.206

Checking XRv2, we can see all of the RIP loopbacks in BGP VPNv4, which means the RIP advertisement from CE to PE, and the redistribution from RIP, were successful. All of the PE-CE transit links are included by default as well. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf RIP | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:106 (default for vrf RIP) *>i192.3.106.0/24 214.0.0.3 0 100 0 ? *>i192.4.106.0/24 214.0.0.4 0 100 0 ? *>i192.11.106.0/24 214.0.0.11 0 100 0 ? *> 192.12.106.0/24 0.0.0.0 0 32768 ? *> 192.106.5.5/32 192.12.106.5 1 32768 ? *>i192.106.6.6/32 214.0.0.4 1 100 0 ? *>i192.106.13.13/32 214.0.0.11 1 100 0 ? *>i192.106.14.14/32 214.0.0.3 1 100 0 ?

Although unrelated to RIP, we will perform some cleanup by removing all of these connected routes. This is a good practice for EIGRP and OSPF also. Doing so is simple; a deny-all route-map/RPL applied to connected redistribution will take explicit control of the connected routes and deny them. Only true RIPlearned routes will be permitted since that redistribution is unfiltered. This solution scales better than a “white-list” prefix-filter under RIP since any RIP-learned prefix is now allowed by default. ! CSR3 and CSR4 route-map RM_DENY_ALL deny 10 router bgp 214 address-family ipv4 vrf RIP redistribute connected route-map RM_DENY_ALL redistribute rip ! XRv1 and XRv2 route-policy RPL_DENY_ALL drop end-policy router bgp 214 vrf RIP address-family ipv4 unicast redistribute connected route-policy RPL_DENY_ALL redistribute rip

Checking XRv2 again, all we see are the loopbacks without the PE-CE transit links, as expected. 2294 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf RIP | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:106 (default for vrf RIP) *> 192.106.5.5/32 192.12.106.5 1 32768 ? *>i192.106.6.6/32 214.0.0.4 1 100 0 ? *>i192.106.13.13/32 214.0.0.11 1 100 0 ? *>i192.106.14.14/32 214.0.0.3 1 100 0 ?

For additional complexity, we will open backdoor connections between XRv4/CSR6 and CSR5/XRv3. Because RIP is hop-count based, the CEs will prefer this backdoor over the MPLS network. Also note that our fancy “redistribute connected” filter is no longer effective since transit links can be learned via RIP now; we will ignore this as it is outside the scope of this test (we could easily filter it from RIP updates). We will pretend that CSR6 and XRv4 actually are in the same site and should never reach one another via MPLS. SoO seems like it might work here, so we will test it. We will re-use some old route-maps that have the same SoO from older testing. ! CSR3 and CSR4 interface GigabitEthernet2.106 ip vrf sitemap RM_STATIC_SOO

Checking CSR3, it has a locally-originated route (redistributed from RIP) and an iBGP learned one from CSR4. As expected, both of them have the same SoO. R3#show bgp vpnv4 unicast vrf RIP 192.106.14.14/32 BGP routing table entry for 214:106:192.106.14.14/32, version 443 Paths: (2 available, best #2, table RIP) Advertised to update-groups: 14 Refresh Epoch 2 Local, (Received from a RR-client) 214.0.0.4 (metric 10) (via default) from 214.0.0.4 (214.0.0.4) Origin incomplete, metric 2, localpref 100, valid, internal Extended Community: SoO:214:105 RT:214:106 mpls labels in/out 3065/4025 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local 192.3.106.14 (via vrf RIP) from 0.0.0.0 (214.0.0.3) Origin incomplete, metric 1, localpref 100, weight 32768, valid, sourced, best Extended Community: SoO:214:105 RT:214:106 mpls labels in/out 3065/nolabel rx pathid: 0, tx pathid: 0x0

2295 © 2016 Nicholas J. Russo

To speed up the convergence process, we can shut down the backdoor link on both CSR6 and XRv4 concurrently. SoO actually does work for RIP filtering; CSR6 and XRv4 do not have routes for one another’s loopback anymore, despite the PEs having the routes in their local VPN tables. Debugging RIP on CSR3, we do not see CSR6’s loopback being advertised to XRv4 at all. CSR3 did properly redistribute this route from BGP into RIP, as the database shows. R3#show ip rip database vrf RIP 192.106.6.6 255.255.255.255 192.106.6.6/32 redistributed [2] via 214.0.0.4, ! CSR3 RIP: sending v2 update to 224.0.0.9 via GigabitEthernet2.106 (192.3.106.3) RIP: build update entries 192.5.106.0/24 via 0.0.0.0, metric 2, tag 0 192.11.106.0/24 via 0.0.0.0, metric 3, tag 0 192.12.106.0/24 via 0.0.0.0, metric 3, tag 0 192.106.5.5/32 via 0.0.0.0, metric 2, tag 0 192.106.13.13/32 via 0.0.0.0, metric 2, tag 0 RP/0/0/CPU0:XRv4#show route vrf RIP 192.106.6.6 % Network not in table

CSR5 and XRv3 will be in “different” sites but should prefer MPLS over the backdoor. We can use an offset list on the CE router backdoor interfaces to make this link less preferable. For example, CSR5 is currently preferring the backdoor link to reach XRv3’s loopback. R5#show ip route vrf RIP 192.106.13.13 Routing Table: RIP Routing entry for 192.106.13.13/32 Known via "rip", distance 120, metric 1 Redistributing via rip Last update from 192.5.106.13 on GigabitEthernet2.206, 00:00:19 ago Routing Descriptor Blocks: * 192.5.106.13, from 192.5.106.13, 00:00:19 ago, via GigabitEthernet2.206 Route metric is 1, traffic share count is 1

I am surprised to see that XR even has a site-of-origin command for RIP. We will use another traditional multi-site SoO design on these routers where each PE router has a different SoO. This means that a routing update loop will stop at the originating PE if the route manages to come “full circle”, but this assumes RIP can somehow carry SoO (which it does not appear to). There is no backdoor filtering. This design would be more relevant for EIGRP than anything else since EIGRP can carry SoO. ! CSR5 router rip address-family ipv4 vrf RIP offset-list 0 in 8 GigabitEthernet2.206

2296 © 2016 Nicholas J. Russo

! XRv3 route-policy RPL_ADD_8 add rip-metric 8 end-policy router rip vrf RIP interface GigabitEthernet0/0/0/0.206 route-policy RPL_ADD_8 in

On the PEs, we need to configure the SoO on the PE-CE link as well. This completes the SoO design. ! XRv1 router rip vrf RIP interface GigabitEthernet0/0/0/0.106 site-of-origin 214:9013 ! XRv2 router rip vrf RIP interface GigabitEthernet0/0/0/0.106 site-of-origin 214:9005

Verifying some BGP routes on XRv2, we can see the loopback CSR5 with SoO values of 214:9013 and 214:9005, respectively. One was locally originated and one was iBGP-learned. These values were imposed by the PEs at the redistribution point. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf RIP 192.106.5.5/32 | begin Local$ Local 192.12.106.5 from 0.0.0.0 (214.0.0.12) Origin incomplete, metric 1, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 1358 Extended community: SoO:214:9005 RT:214:106 Path #2: Received by speaker 0 Not advertised to any peer Local, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91031 Origin incomplete, metric 10, localpref 100, valid, internal, importcandidate, imported Received Path ID 0, Local Path ID 0, version 0 Extended community: SoO:214:9013 RT:214:106 Source VRF: RIP, Source Route Distinguisher: 214:106

There is a race condition regarding XRv2’s view of XRv3’s loopback. If it learns the RIP route from CSR5 first, it will install it into the RIB and advertise it into BGP. We can simulate this by shutting down the 2297 © 2016 Nicholas J. Russo

BGP peering to XRv1 for a minute or so. In this case, XRv2 routes the traffic to CSR5 and then over the backdoor. This implies that CSR5’s route is via the slow back door, which is not what we want. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf RIP 192.106.13.13/32 | begin Local$ Local 192.12.106.5 from 0.0.0.0 (214.0.0.12) Origin incomplete, metric 10, localpref 100, weight 32768, valid, redistributed, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 1420 Extended community: SoO:214:9005 RT:214:106 Path #2: Received by speaker 0 Not advertised to any peer Local, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91012 Origin incomplete, metric 1, localpref 100, valid, internal, importcandidate, imported Received Path ID 0, Local Path ID 0, version 0 Extended community: SoO:214:9013 RT:214:106 Source VRF: RIP, Source Route Distinguisher: 214:106 R5#show ip route vrf RIP 192.106.13.13 Routing Table: RIP Routing entry for 192.106.13.13/32 Known via "rip", distance 120, metric 9 Redistributing via rip Last update from 192.5.106.13 on GigabitEthernet2.206, 00:00:05 ago Routing Descriptor Blocks: * 192.5.106.13, from 192.5.106.13, 00:00:05 ago, via GigabitEthernet2.206 Route metric is 9, traffic share count is 1

If XRv2 learns the BGP route first, it will install it in the RIB, advertise it to CSR5 with a low metric, and CSR5 will install it. We can simulate this by breaking the backdoor link on CSR5’s side for a minute or so. This is a more ideal end-state since the customer is using MPLS over the slow backdoor. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf RIP 192.106.13.13/32 | begin Local$ Local, (Received from a RR-client) 214.0.0.11 (metric 10) from 214.0.0.11 (214.0.0.11) Received Label 91012 Origin incomplete, metric 1, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 1432 Extended community: SoO:214:9013 RT:214:106 Source VRF: RIP, Source Route Distinguisher: 214:106 R5#show ip route vrf RIP 192.106.13.13

2298 © 2016 Nicholas J. Russo

Routing Table: RIP Routing entry for 192.106.13.13/32 Known via "rip", distance 120, metric 2 Redistributing via rip Last update from 192.12.106.12 on GigabitEthernet2.106, 00:00:25 ago Routing Descriptor Blocks: * 192.12.106.12, from 192.12.106.12, 00:00:25 ago, via GigabitEthernet2.106 Route metric is 2, traffic share count is 1

Solutions to prevent this situation are numerous: metric based ingress filtering on the PEs or egress filtering on the CEs to prevent remote routes from even hitting the MPLS network would be best. Unfortunately, we can’t use the RPL to match RIP metrics, otherwise this would be a clean solution. We can match prefixes directly and filter them, which scales poorly but is effective. We will use ingress filtering on XRv1 and XRv2 to ensure routes that came over the backdoor (we can tell by the huge metric) cannot be installed in the routing table. The downside of this approach is that if a PE-CE link fails, the PE “diagonal” from a CE will not have reachability, which is a rare case. We use opposing logic in the RPLs for variety. ! XRv1 route-policy RPL_RIP_FILTER if not destination in (192.106.13.13/32) then drop else pass endif end-policy ! XRv2 route-policy RPL_RIP_FILTER if destination in (192.106.13.13/32) then pass else drop endif end-policy ! XRv1 and XRv2 router rip vrf RIP interface GigabitEthernet0/0/0/0.106 route-policy RPL_RIP_FILTER in

To test it, we can break the BGP connection between XRv1 and XRv2 again. The CE routers will utilize the slow backdoor to communicate and advertise one another’s loopbacks to the PEs. The PEs should reject it, ultimately having no routes to the diagonally-opposed CEs. Only the local loopbacks are allowed. RP/0/0/CPU0:XRv1#show route vrf RIP rip

2299 © 2016 Nicholas J. Russo

R 192.106.13.13/32 [120/1] via 192.11.106.13, 02:49:50, GigabitEthernet0/0/0/0.106 RP/0/0/CPU0:XRv2#show route vrf RIP rip R 192.106.5.5/32 [120/1] via 192.12.106.5, 00:01:27, GigabitEthernet0/0/0/0.106

The backdoor link continues to work as an alternate path when the MPLS network fails. R5#ping vrf RIP 192.106.13.13 source 192.106.5.5 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.106.13.13, timeout is 2 seconds: Packet sent with a source address of 192.106.5.5 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 2/3/4 ms

When BGP comes back online, both PEs only see the BGP learned router, ensuring that the PEs immediately prefer the MPLS network again without question. R5#show ip cef vrf RIP 192.106.13.13 192.106.13.13/32 nexthop 192.12.106.12 GigabitEthernet2.106 RP/0/0/CPU0:XRv3#show cef vrf RIP ipv4 192.106.5.5/32 192.106.5.5/32, version 105, internal 0x1000001 0x0 (ptr 0xa144ad74) [1], 0x0 (0xa142e250), 0x0 (0x0) local adjacency 192.11.106.11 Prefix Len 32, traffic index 0, precedence n/a, priority 2 via 192.11.106.11, GigabitEthernet0/0/0/0.106, 5 dependencies, weight 0, class 0 [flags 0x0] path-idx 0 NHID 0x0 [0xa0ec734c 0x0] next hop 192.11.106.11 local adjacency

Finally, we test connectivity between all sites. CSR6 and XRv4 cannot communicate only because the backdoor is down (PE imposes the same SoO), but otherwise, there is full reachability. ! CSR5 and CSR6 tclsh foreach x { 192.106.5.5 192.106.6.6 192.106.13.13 192.106.14.14 } { ping vrf RIP $x source loopback106 repeat 3 timeout 1 }

38.2

VRF label modes 2300

© 2016 Nicholas J. Russo

When using VPNv4/v6 BGP to allocate labels for VPN prefixes, we are often accustomed to seeing a unique VPN label per customer prefix. The logic is that this label serves as a demultiplexer value; when the label is received from the MPLS core, the egress LSR performs a single LFIB lookup on a single label in a single table. There is no need to look at the IPv4/v6 destination since the local label that was allocated will determine the IPv4/v6 next-hop within the VPN, as well as represent a specific customer prefix. This is an intelligent approach, but comes with scalability limitations. The most obvious example is transporting the Internet prefixes inside of a central-service L3VPN. Allocating ~350,000 labels is more than one third of the available MPLS label space, not to mention increased burden on the router in terms of memory (large LIB, LFIB, etc). New options were invented to help increase efficiency with respect to label resource management. The first and oldest alternative was to allocate labels on a per-VRF basis. For each customer, there will be a single, common VPN label allocated by BGP for VPNv4/v6. The drawback of this approach is that it violates one of the original designs of MPLS. When the aggregate VPN label arrives on an egress PE from the core, it could represent any prefix within a given VPN, so the router must perform a second FIB (not LFIB) lookup on the IPv4/v6 destination address to determine how to route it. In cases where there is only a single PE-CE link per VPN, this is obviously inefficient/unnecessary and could affect performance. When there are multiple PE-CE links in the same VPN, the inefficiency remains but it is necessary since the next-hop CE may vary depending on the VPN prefix. Below is the original output of the BGP allocated labels without modification. We see a unique VPN label for each prefix, which is the default. R7#show bgp vpnv4 unicast vrf TOP labels | include 88\. 88.7.7.7/32 0.0.0.0 7020/nolabel(TOP) 88.125.160.0/28 10.7.10.10 7012/nolabel 88.125.160.16/28 10.7.10.10 7013/nolabel 88.125.160.32/28 10.7.10.10 7014/nolabel 88.125.160.48/28 10.7.10.10 7015/nolabel 88.125.160.64/27 10.7.10.10 7016/nolabel 88.125.160.96/27 10.7.10.10 7017/nolabel [snip]

We have the flexibility to change this per protocol (VPNv4 or VPNv6) or for all protocols, as well as per VRF or for all VRFs. We will enable it on CSR7 inside VRF TOP for VPNv4 only. Notice that CSR7 learns several 88.0.0.0/8 routes from CSR10, and advertises a local loopback inside the same VRF. Before issuing the command, each one had a different label as shown above. Afterwards, however, they all use the same label, regardless of the route’s origination. The label is explicitly marked as an IPv4 VRF aggregate label for clarity. ! CSR7 mpls label mode vrf TOP protocol bgp-vpnv4 per-vrf R7#show bgp vpnv4 unicast vrf TOP labels | include 88\. 88.7.7.7/32 0.0.0.0 IPv4 VRF Aggr:7005/nolabel(TOP) 88.125.160.0/28 10.7.10.10 IPv4 VRF Aggr:7005/nolabel 88.125.160.16/28 10.7.10.10 IPv4 VRF Aggr:7005/nolabel

2301 © 2016 Nicholas J. Russo

88.125.160.32/28 10.7.10.10 88.125.160.48/28 10.7.10.10 88.125.160.64/27 10.7.10.10 [snip]

IPv4 VRF Aggr:7005/nolabel IPv4 VRF Aggr:7005/nolabel IPv4 VRF Aggr:7005/nolabel

As a result, packets arriving for 88.7.7.7/32 and 88.125.160.0/28 must be looked up in the LFIB first. The LFIB basically says “this belongs to VRF TOP, but since it is an aggregate label, we don’t know how to forward the packet”. The label is popped to reveal the IPv4 packet. The LFIB entry contains no outgoing interface or next-hop. All it knows is that the IPv4 VRF TOP FIB must be consulted. R7#show mpls forwarding-table labels 7005 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7005 Pop Label IPv4 VRF[V] 0 MAC/Encaps=0/0, MRU=0, Label Stack{} VPN route: TOP No output feature configured

Outgoing Next Hop interface aggregate/TOP

The second VRF-aware FIB lookup tells us how to route the packet once the MPLS labels are removed. There is no way to carry this information inside of MPLS anymore. As an example, two packets could arrive with identical labels but were forwarded two different ways based on their IPv4 destinations. R7#show ip cef vrf TOP 88.7.7.7 88.7.7.7/32 receive for Loopback7 R7#show ip cef vrf TOP 88.125.160.0 88.125.160.0/28 nexthop 10.7.10.10 GigabitEthernet2.570

The per-VRF label feature was a good “first attempt” at solving this problem. A better approach is to allocate labels per-next-hop, or as Cisco calls it, “per-CE”. Imagine the scenario described earlier with two CEs in the same VRF. We don’t necessarily need an aggregate label for the whole VRF, as this causes inefficiencies in the forwarding lookups. One label per next-hop means that the label can still demultiplex the traffic’s direction, saving the PE an IPv4/v6 FIB lookup. The CE can look at the IP packet and route the traffic as it normally would. This will also work well in our current situation of having a local loopback in the same VRF as customer routes. We can test this using VPNv6; below is the output before any configuration changes where labels are allocated per-prefix by default. R7#show bgp vpnv6 unicast vrf TOP labels | section 2BAD:88 2BAD:88:125:160::/72 FD00:10:7:10::10 7007/nolabel 2BAD:88:125:160:100::/72 FD00:10:7:10::10

2302 © 2016 Nicholas J. Russo

7009/nolabel 2BAD:88:125:160:200::/72 FD00:10:7:10::10 7010/nolabel 2BAD:88:125:160:300::/72 FD00:10:7:10::10 7011/nolabel

Specifically for this VRF (we only have one, so this doesn’t matter for CSR7) we will enable the per-CE label allocation. Notice that these labels are not considered “aggregates”; they still have forwarding instructions, as the LFIB will show later. We can see that the next-hop of CSR10 uses label 7020 for all prefixes, while prefixes with next-hop :: (local) use label 7012. Local/aggregate routes would normally share the same label since an IPv4/v6 FIB lookup is required no matter what. ! CSR7 mpls label mode vrf TOP protocol bgp-vpnv6 per-ce R7#show bgp vpnv6 unicast vrf TOP labels | section 2BAD::?88 2BAD::88:7:7/128 :: 7012/nolabel(TOP) 2BAD:88:125:160::/72 FD00:10:7:10::10 7020/nolabel 2BAD:88:125:160:100::/72 FD00:10:7:10::10 7020/nolabel 2BAD:88:125:160:200::/72 FD00:10:7:10::10 7020/nolabel 2BAD:88:125:160:300::/72 FD00:10:7:10::10 7020/nolabel

The LFIB outputs are different for per-CE versus per-VRF. The first entry for label 7020 is an ordinary LFIB entry for a PE router with the exception of the “prefix” field. A “next-hop-ID” is assigned, which we know is CSR10, and is mapped to this specific local label. I cannot find a command to show the next-hop bindings and the relevance of the ID number 3, and Cisco does not discuss it in their documentation. It is not terribly important; the LFIB entry clearly shows CSR10 as the next-hop anyway. R7#show mpls forwarding-table labels 7020 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7020 No Label nh-id(3) 0 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A9F961005056A9EA7781000DF286DD No output feature configured

Outgoing interface Gi2.570

Next Hop FE80::10

2303 © 2016 Nicholas J. Russo

Below is the normal L3VPN output for a local/aggregate route. The prefix ID for the loopback prefix is known since the next-hop is local and removing the VPN label is required anyway. R7#show mpls forwarding-table labels 7012 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7012 Pop Label 2BAD::88:7:7/128[V] \ 0 MAC/Encaps=0/0, MRU=0, Label Stack{} VPN route: TOP No output feature configured

Outgoing interface

Next Hop

aggregate/TOP

All of these label mode modifications can be viewed by checking the VRF details as well. The per-VRF option even shows the label value since it is authoritative for the entire VRF. R7#show vrf detail TOP | include ^Address|label_alloc Address family ipv4 unicast (Table ID = 0x1): VRF label allocation mode: per-vrf (Label 7005) Address family ipv6 unicast (Table ID = 0x1E000001): VRF label allocation mode: per-ce Address family ipv4 multicast not active

We know that the per-CE feature allocates labels per next-hop. We can check the BGP next-hops to see how many are in VRF TOP, which is a good indication of how many labels we need. This doesn’t account for local/aggregate prefixes. Note that any route with a zero next-hop (0.0.0.0 for IPv4 or :: for IPv6) will get a label per-prefix since the router cannot reliably determine the next-hop. This example is a little artificial in this way, but it still demonstrates the feature. R7#show bgp vpnv6 unicast vrf TOP nexthops # Paths Nexthop Address Route Distinguisher: 214:7 (default for vrf TOP) 20 ::FFFF:214.0.0.8 7 FD00:10:7:10::10 (FE80::10) (TOP)

Nothing is networking is free, and these optimizations come with several limitations. Carrier supporting Carrier (CSC) is not a valid design when using these label aggregation features. Likewise, BGP PIC and iBGP multipath are not advertised to work either. The per-CE feature has even more limitations: routes can only be imported from BGP and cannot be eBGP multi-hop. To demonstrate some of these restrictions, we will try to enable VRF-aware LDP on the VRF TOP interface to support CSC. We can also try to the IPv4 labeled-unicast inside BGP, but we get the same message. R7(config)#int gig2.570 R7(config-subif)#mpls ip % Cannot configure CSC for vrf TOP, please remove the per-VRF label mode configuration before configuring CSC

2304 © 2016 Nicholas J. Russo

R7(config)#router bgp 214 R7(config-router)#address-family ipv4 vrf TOP R7(config-router-af)#neighbor 10.7.10.10 send-label % Cannot configure CSC for vrf TOP, please remove the per-VRF label mode configuration before configuring CSC

We can configure three bogus static routes on CSR7 within the VRF. Two of them use CSR10’s public address of FD00:10:7:10::10 and one of them uses FE80::10. Since the BGP next-hops are recursive using the LL address, the static route pointing to FE80::10 is allowed to share a label with the BGP prefixes from CSR10. This is confusing since the BGP label list for VPNv6 clearly shows the pre-recursive next-hop of the global address, but label 7020 is really associated with FE80::10 per the LFIB. The other two static routes use a new label bound to FD00:10:7:10::10 per the LFIB. ! CSR7 ipv6 route vrf TOP 2BAD::88:10:10/128 GigabitEthernet2.570 FD00:10:7:10::10 ipv6 route vrf TOP 2BAD::88:10:100/128 GigabitEthernet2.570 FD00:10:7:10::10 ipv6 route vrf TOP 2BAD::88:100:100/128 GigabitEthernet2.570 FE80::10 router bgp 214 address-family ipv6 vrf TOP redistribute static R7#show bgp vpnv6 unicast vrf TOP labels | section 2BAD::?88 2BAD::88:7:7/128 :: 7012/nolabel(TOP) 2BAD::88:10:10/128 FD00:10:7:10::10 7013/nolabel 2BAD::88:10:100/128 FD00:10:7:10::10 7013/nolabel 2BAD::88:100:100/128 FE80::10 7020/nolabel 2BAD:88:125:160::/72 FD00:10:7:10::10 7020/nolabel 2BAD:88:125:160:100::/72 FD00:10:7:10::10 7020/nolabel [snip] R7#show ipv6 static vrf TOP | begin distance * 2BAD::88:10:10/128 via FD00:10:7:10::10, GigabitEthernet2.570, distance 1 * 2BAD::88:10:100/128 via FD00:10:7:10::10, GigabitEthernet2.570, distance 1 * 2BAD::88:100:100/128 via FE80::10, GigabitEthernet2.570, distance 1 R7#show mpls forwarding-table labels 7013 Local Outgoing Prefix Bytes Label

Outgoing

Next Hop

2305 © 2016 Nicholas J. Russo

Label 7013

Label No Label

or Tunnel Id nh-id(4)

Switched 0

R7#show mpls forwarding-table labels 7020 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7020 No Label nh-id(3) 0

interface Gi2.570

FD00:10:7:10::10

Outgoing interface Gi2.570

Next Hop FE80::10

Because prefixes are aggregated by labels, per-prefix counters no longer function. Since the per-VRF and per-next-hop methods both collapse LFIB entries into single (or few) entries, only aggregate counters are collected. The LFIB has no way to differentiate and makes no attempt. We know label 7005 represents all of VRF TOP for IPv4, so we can check counters for that only. R1#ping

88.125.160.65 source 82.125.161.129 repeat 10000

R7#show mpls forwarding-table labels 7005 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7005 Pop Label IPv4 VRF[V] 244614

Outgoing Next Hop interface aggregate/TOP

Likewise for the per-next-hop behavior configured for VPNv6, we can only account for packets on a pernext-hop basis. R1#ping 2BAD:88:125:160:300::1 source 2BAD:82:125:160:400::1 repeat 1000 R7#show mpls forwarding-table interface gig2.570 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7013 No Label nh-id(4) 0 7020 No Label nh-id(3) 254998

Outgoing interface Gi2.570 Gi2.570

Next Hop FD00:10:7:10::10 FE80::10

We can still send traffic for the bogus routes and temporarily create a new loopback on CSR10 to respond. The counters for the second next-hop will increment because technically the LFIB still received an MPLS packet and performed a pop operation. R1#ping 2BAD::88:10:100 source 2BAD:82:125:160:400::1 repeat 100 R7#show mpls forwarding-table interface gig2.570 Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 7013 No Label nh-id(4) 11800 7020 No Label nh-id(3) 299484

Outgoing interface Gi2.570 Gi2.570

Next Hop FD00:10:7:10::10 FE80::10

There is one final option which takes the per-VRF method and enhances it. Aggregates created on the PE, along with any connected routes, will all share a single label. This allows the SP to leak specific routes that may need their own labels for multipath, add-path, per-prefix counters, or any other feature not 2306 © 2016 Nicholas J. Russo

supported with per-VRF or per-CE labels. The aggregate/connected routes behave like per-VRF label aggregates. On CSR7, we configured these label mode options on a per-VRF, per-AFI basis. On CSR8, we will summarily configure the per-VRF connected/aggregate feature for all VRFs and all AFIs for variety. We will also configure some IPv4 and IPv6 aggregates within the VRFs. Last, we redistribute the connected PE-CE transit link so that we have a combination of many “interesting” routes. ! CSR8 mpls label mode all-vrfs protocol all-afs vrf-conn-aggr router bgp 214 address-family ipv4 vrf BOTTOM redistribute connected aggregate-address 82.125.160.64 255.255.255.192 aggregate-address 82.125.160.0 255.255.255.192 address-family ipv6 vrf BOTTOM redistribute connected aggregate-address 2BAD:82:125:160:200::/71 aggregate-address 2BAD:82:125:160::/71

We can see that VPNv4 has allocated label 8025 as an aggregate label for the connected transit link and aggregate addresses. Because the aggregates did not suppress longer-matches, the component routes are present as well, and with per-prefix labels. R8#show bgp vpnv4 unicast vrf BOTTOM labels Network Next Hop In label/Out label Route Distinguisher: 214:8 (BOTTOM) 10.1.8.0/24 0.0.0.0 IPv4 VRF Aggr:8025/nolabel(BOTTOM) 82.125.160.0/28 10.1.8.1 8005/nolabel 82.125.160.0/26 0.0.0.0 IPv4 VRF Aggr:8025/aggregate(BOTTOM) 82.125.160.16/28 10.1.8.1 8006/nolabel 82.125.160.32/28 10.1.8.1 8007/nolabel 82.125.160.48/28 10.1.8.1 8008/nolabel 82.125.160.64/27 10.1.8.1 8009/nolabel 82.125.160.64/26 0.0.0.0 IPv4 VRF Aggr:8025/aggregate(BOTTOM)

The LFIB for this label entry is identical to that seen for the per-VRF mode. This design isn’t very effective since all of the longer-matches are unsuppressed, but realistically many would be suppressed, making use of the aggregated labels to reduce LIB size. R8#show mpls forwarding-table labels 8025 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 8025 Pop Label IPv4 VRF[V] 0 MAC/Encaps=0/0, MRU=0, Label Stack{} VPN route: BOTTOM

Outgoing Next Hop interface aggregate/BOTTOM

2307 © 2016 Nicholas J. Russo

No output feature configured

VPNv6 is nearly identical except contains a different set of prefixes. These use IPv6 aggregated labels, and in this case, all aggregate and connected routes share label 8019. The LFIB output is identical with the exception of the prefix ID indicating an IPv6 VRF aggregate. R8#show bgp vpnv6 unicast vrf BOTTOM labels Network Next Hop In label/Out label Route Distinguisher: 214:8 (BOTTOM) 2BAD:82:125:160::/72 FD00:10:1:8::1 8036/nolabel 2BAD:82:125:160::/71 :: IPv6 VRF Aggr:8019/aggregate(BOTTOM) 2BAD:82:125:160:100::/72 FD00:10:1:8::1 8037/nolabel 2BAD:82:125:160:200::/72 FD00:10:1:8::1 8038/nolabel 2BAD:82:125:160:200::/71 :: IPv6 VRF Aggr:8019/aggregate(BOTTOM) FD00:10:1:8::/64 :: IPv6 VRF Aggr:8019/nolabel(BOTTOM) R8#show mpls forwarding-table labels 8019 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 8019 Pop Label IPv6 VRF[V] 0 MAC/Encaps=0/0, MRU=0, Label Stack{} VPN route: BOTTOM No output feature configured

Outgoing Next Hop interface aggregate/BOTTOM

We can confirm the configuration and label value by checking the VRF details. Just like the per-VRF label mode, the per-VRF connected/aggregate option uses a single, predetermined label, so the VRF stores it. R8#show vrf detail BOTTOM | include ^Address|label_alloc Address family ipv4 unicast (Table ID = 0x2): VRF label allocation mode: vrf-conn-aggr (Label 8025) Address family ipv6 unicast (Table ID = 0x1E000001): VRF label allocation mode: vrf-conn-aggr (Label 8019) Address family ipv4 multicast not active

The feature is very similar on XR and is examined briefly. The configuration happens under BGP, which is a more appropriate place for it. On XRv1, we will enable the per-VRF mode for EIGRP IPv4 AFI and perCE mode for the EIGRP IPv6 AFI. The topology is imperfect for validating these, but we can show that they are working. ! XRv1 router bgp 214 vrf EIGRP

2308 © 2016 Nicholas J. Russo

address-family ipv4 unicast label mode per-vrf address-family ipv6 unicast label mode per-ce

After issuing these commands, we can verify it by checking the BGP label table. The label “type” shows whether we are using per-VRF, per-CE, or per-prefix. The default for IPv4/v6 and VPNV4/v6 AFIs is perprefix. Interestingly, we see per-VRF labels allocated for OSPF, EIGRP, and RIP, when we only configured it for EIGRP. We also see per-prefix labels for RD 214:101 and 214:106 (OSPF and RIP), but not RD 214:102 (EIGRP). The IOS XR documentation states “Even if label mode is in use, per-vrf label is allocated for connected, aggregate, and local prefixes.” This helps explain why we see aggregated, per-VRF labels in the BGP label table for VRFs we haven’t configured at all. In the case of OSPF, EIGRP, and RIP, they all included connected routes when redistributing into BGP, and so those routes were assigned the VRF aggregate label. RP/0/0/CPU0:XRv1#show bgp label table Label Type VRF/RD 91005 IPv6 VRF Table OSPF 91006 IPv6 VRF Table EIGRP 91007 IPv4 VRF Table OSPF 91008 IPv4 VRF Table EIGRP 91009 IPv4 VRF Table RIP 91010 IPv4 VRF Prefix 214:104 91011 IPv6 VRF Prefix 214:104 91012 IPv4 VRF Prefix 214:106 91013 IPv4 VRF Prefix 214:104 91014 IPv4 VRF Prefix 214:101 91015 IPv6 VRF Prefix 214:104 91017 IPv6 VRF Prefix 214:101 91019 IPv4 VRF Prefix 214:105 91020 IPv4 VRF Prefix 214:104 91021 IPv6 VRF Prefix 214:104 91022 IPv4 VRF Prefix 214:101 91023 IPv6 VRF Prefix 214:101 91032 IPv6 VRF Prefix 214:101 91033 IPv6 VRF Prefix 214:101 91040 IPv6 Nexthop EIGRP

Context 192.11.104.0/24 fd00:192:11:104::/64 192.106.13.13/32 192.104.13.13/32 192.101.13.13/32 ::192:104:13:13/128 ::192:101:13:13/128 192.105.13.13/32 192.5.104.0/24 fd00:192:5:104::/64 192.5.101.0/24 fd00:192:5:101::/64 ::192:101:5:5/128 fd00:192:12:101::/64 1

For example, XRv1 has allocated label 91008 for EIGRP IPv6 prefixes that meet the condition outlined above, while using label 91040 for labels learned from each CE. The connected route uses the per-VRF label (green) while the EIGRP-learned routes from XRv3 (specifically FE80::13 as a next-hop) use the pernext-hop label (yellow). It is interesting to note that XR per-CE label allocation works with non-BGP protocols for PE-CE as we demonstrate below. RP/0/0/CPU0:XRv1#show bgp vrf EIGRP ipv6 unicast labels | begin Network Network Next Hop Rcvd Label Local Label

2309 © 2016 Nicholas J. Russo

Route Distinguisher: 214:102 (default *>i::50:50:5:5/128 214.0.0.12 *>i::192:102:5:5/128 214.0.0.12 *>i::192:102:6:6/128 214.0.0.4 *> ::192:102:13:13/128 fe80::13 *>i::192:102:14:14/128 214.0.0.3 *>ifd00:192:3:102::/64 214.0.0.3 *>ifd00:192:4:102::/64 214.0.0.4 *> fd00:192:5:102::/64 fe80::13 *>ifd00:192:6:102::/64 214.0.0.3 *> fd00:192:11:102::/64 :: *>ifd00:192:12:102::/64 214.0.0.12

for vrf EIGRP) 92025 92010 4052

nolabel nolabel nolabel

nolabel

91040

3028

nolabel

3027

nolabel

4005

nolabel

nolabel

91040

3010

nolabel

nolabel

91006

92008

nolabel

For EIGRP IPv4, we don’t see any per-prefix or next-hop labels, implying that all EIGRP IPv4 prefixes share the per-VRF label. This is the expected behavior based on our configuration. RP/0/0/CPU0:XRv1#show bgp vrf EIGRP labels | begin Network Network Next Hop Rcvd Label Local Label Route Distinguisher: 214:102 (default for vrf EIGRP) *>i50.50.5.5/32 214.0.0.12 92024 nolabel *>i192.3.102.0/24 214.0.0.3 3055 nolabel *>i192.4.102.0/24 214.0.0.4 4006 nolabel *> 192.5.102.0/24 192.11.102.13 nolabel 91008 *>i192.6.102.0/24 214.0.0.3 3056 nolabel *> 192.11.102.0/24 0.0.0.0 nolabel 91008 *>i192.12.102.0/24 214.0.0.12 92007 nolabel *>i192.102.5.5/32 214.0.0.12 92012 nolabel *>i192.102.6.6/32 214.0.0.4 4053 nolabel *> 192.102.13.13/32 192.11.102.13 nolabel 91008 *>i192.102.14.14/32 214.0.0.3 3057 nolabel

In the case of OSPF, where we didn’t change the label mode at all, we can see individual, per-prefix labels for all truly OSPF-learned routes. Routes locally originated into BGP, such as the connected PE-CE link to XRv3 or the sham-link endpoint, use the per VRF label 91007 identified in the label table viewed earlier. This is efficient because the VPN FIB needs to be consulted anyway, so allocating 1 label for N connected/aggregate routes conserves label resources. RP/0/0/CPU0:XRv1#show bgp vrf OSPF labels | begin Network Network Next Hop Rcvd Label Local Label

2310 © 2016 Nicholas J. Russo

Route Distinguisher: 214:101 (default *>i101.0.0.3/32 214.0.0.3 *>i101.0.0.4/32 214.0.0.4 *> 101.0.0.11/32 0.0.0.0 *>i101.0.0.12/32 214.0.0.12 *>i192.3.101.0/24 214.0.0.3 *> 192.5.101.0/24 192.11.101.13 *>i192.6.101.0/24 214.0.0.3 *> 192.11.101.0/24 0.0.0.0 *> 192.12.101.0/24 192.11.101.13 *>i192.60.6.6/32 214.0.0.4 *> 192.101.5.5/32 192.11.101.13 * i 214.0.0.12 *>i192.101.6.6/32 214.0.0.4 *> 192.101.13.13/32 192.11.101.13 *>i192.101.14.14/32 214.0.0.3

for vrf OSPF) 3048 4026 nolabel 92005 3049 nolabel 3051 nolabel nolabel 4021 nolabel 92011 4011 nolabel 3054

nolabel nolabel 91007 nolabel nolabel 91022 nolabel 91007 91025 nolabel 91026 91026 nolabel 91014 nolabel

We can drill into the details of a single label as well. Using 91040, since it is the most interesting, we can see the table details displayed in verbose fashion, as well as the label’s age. The label context contains a reference ID to the actual next-hop. We can check the nexthop-set with BGP to see what “1” means, and as expected, it references FE80::13. RP/0/0/CPU0:XRv1#show bgp label 91040 BGP Label Entry: 91040 Last Updated : MON 03 21:09:07.386 (00:33:19 ago) Label Type : IPv6 Nexthop VRF/RD : EIGRP Label Context: 1 Refcount : 2 RP/0/0/CPU0:XRv1#show bgp vrf EIGRP ipv6 unicast nexthop-set Resilient per-CE nexthop set, ID 1 Number of nexthops 1, Label 91040, Flags 0x1 Nexthops: fe80::13

Looking at a per-prefix label, we see similar information, with the addition of the prefix to which the label is bound. XE did not appear to support these kind of label-specific show commands. RP/0/0/CPU0:XRv1#show bgp label 91020 BGP Label Entry: 91020 Last Updated : MON 01 23:04:58.145 (1d22h ago) Label Type : IPv4 VRF Prefix VRF/RD : 214:104 Label Context: 192.5.104.0/24 Refcount : 1

2311 © 2016 Nicholas J. Russo

We can view the label allocations in summary form as well. Because XR supports so many label varieties, we can perform a quick check to see the quantity of each label allocated. As an example, one per-CE IPv6 label was allocated, which is used by the EIGRP IPv6 VPN. There are zero for IPv4 because we did not enable that anywhere. RP/0/0/CPU0:XRv1#show bgp label summary Label Alloc Type Labels ---------------- -----GBL IPv4 Prefix 0 VRF IPv4 Prefix 0 VPN IPv4 Prefix 0 CE IPv4 Nexthop 0 VRF IPv4 Table 3 GBL VRF VPN CE VRF

IPv6 IPv6 IPv6 IPv6 IPv6

Prefix Prefix Prefix Nexthop Table

ASBR Nexthop ---------------TOTAL

0 0 0 1 2 0 -----6

Checking the data plane, we can see these labels programmed into the LFIB. Because the label refers to a remote VPN prefix, referencing the label by itself did not work, but using a label range did. We also look at a per-VRF label, which is identified as such in the LFIB, whereby the per-nexthop label has “No ID” that makes it unique. The aggregate label LFIB entry essentially directs the router to perform a VPN FIB lookup. RP/0/0/CPU0:XRv1#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------91040 Unlabelled No ID Aggregate No ID

labels 91039 91040 Outgoing Next Hop Interface ------------ --------------Gi0/0/0/0.102 fe80::13 EIGRP

RP/0/0/CPU0:XRv1#show mpls forwarding labels 91007 Local Outgoing Prefix Outgoing Next Hop Label Label or ID Interface ------ ----------- ------------------ ------------ --------------91007 Aggregate OSPF: Per-VRF Aggr[V] \ OSPF

Bytes Switched ---------0 0

Bytes Switched ---------12488

2312 © 2016 Nicholas J. Russo

On XRv2, we will perform a more complex example using RPL. XRv2 learns 4 total routes via BGP between IPv4 and IPv6 from CSR5. Currently, each one has a different, per-prefix label. Only examining IPv4, we want to use a per-VRF label for all non-host routes. RP/0/0/CPU0:XRv2#show bgp label table | include 214:104 92013 IPv4 VRF Prefix 214:104 192.12.104.0/24 92014 IPv4 VRF Prefix 214:104 192.104.5.5/32 92015 IPv6 VRF Prefix 214:104 ::192:104:5:5/128 92016 IPv6 VRF Prefix 214:104 fd00:192:12:104::/64

Ideally, we should see 192.12.104.0/24 to the per-VRF value, while CSR5’s loopback still has a per-prefix label. This might be useful to have different LFIB counters or for performance reasons, since per-VRF label requires a FIB (not LFIB) lookup after the bottom label is removed. ! XRv2 prefix-set PS_HOST_ROUTES 0.0.0.0/0 ge 32 end-set route-policy RPL_PER_PREFIX_IF_MATCH($PS) if destination in $PS then set label-mode per-prefix else set label-mode per-vrf endif end-policy router bgp 214 vrf BGP address-family ipv4 unicast label mode route-policy RPL_PER_PREFIX_IF_MATCH(PS_HOST_ROUTES)

Looking at the label table again, we only see 3 routes with per-prefix labels within RD 214:104. The one IPv4 route remaining is CSR5’s loopback, which is a host route, and should have a per-prefix label. RP/0/0/CPU0:XRv2#show bgp label table | include 214:104 92014 IPv4 VRF Prefix 214:104 192.104.5.5/32 92015 IPv6 VRF Prefix 214:104 ::192:104:5:5/128 92016 IPv6 VRF Prefix 214:104 fd00:192:12:104::/64

We have a new entry in the table for the IPv4 VRF table for the BGP VRF. The label allocated was 92029, and the details indicate it is only a few minutes old. BGP also indicates that the PE-CE link to CSR5 and the backdoor link between CSR5 and XRv3 are both using this label. RP/0/0/CPU0:XRv2#show bgp label table | include BGP 92029 IPv4 VRF Table BGP -

2313 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp label 92029 BGP Label Entry: 92029 Last Updated : MON 03 22:03:21.159 (00:02:50 ago) Label Type : IPv4 VRF Table VRF/RD : BGP Refcount : 2 RP/0/0/CPU0:XRv2#show bgp vrf BGP labels | include 92029 *> 192.5.104.0/24 192.12.104.5 nolabel 92029 *> 192.12.104.0/24 192.12.104.5 nolabel 92029

If we create a dummy prefix on CSR5 with a mask of /31, this route is also included in the per-VRF label. We can see the prefix bound to label 92029 on XRv2 as we expect. ! CSR5 interface Loopback104 ip address 192.104.55.55 255.255.255.254 secondary RP/0/0/CPU0:XRv2#show *> 192.5.104.0/24 *> 192.12.104.0/24 *> 192.104.55.54/31

bgp vrf BGP labels | include 92029 192.12.104.5 nolabel 92029 192.12.104.5 nolabel 92029 192.12.104.5 nolabel 92029

38.3 VRF selection for traffic leaking Although route-leaking is a little more straightforward (discussed later), we can also forcibly move traffic between VRFs, VRF to global, and global to VRF using other techniques. IOS and IOS-XE platforms support three: VRF selection, VRF autoclassify, and VRF receive with PBR. All three are discussed, but only the last one works on the CSR1000v. VRF selection does not appear to be supported on the CSR1000v as it is somewhat hardware specific. There isn’t anything more to say about this since the command doesn’t work. Normally, any traffic coming from the specified source would have its destination IP address looked up in the specified VRF, much like the PBR approach we will evaluate later. R2(config)#vrf selection source 10.2.9.0 255.255.255.0 vrf VIA_XRV4 % VRF Select: failed to add config

VRF autoclassify should work, but the system tries to add dynamic route-maps with spaces in the name. I was not able to find a debug command that actually shows the route-map name, but I can see the system try to make them. This feature does not work as a result of the route-maps not being properly created. With the exception of the command that doesn't work, the configuration below is valid. ! CSR2 interface GigabitEthernet2.529 description VRF SELECT USING SECONDARIES

2314 © 2016 Nicholas J. Russo

encapsulation dot1Q 3529 ip address 10.2.9.2 255.255.255.0 secondary vrf VIA_XRV4 ip address 10.9.2.2 255.255.255.0 secondary vrf VIA_CSR6 ip address 2.2.2.2 255.255.255.0 R2(config-subif)#ip vrf autoclassify source % Spaces not Allowed in route-map names % Spaces not Allowed in route-map names R2#debug route-map api RMAP: App:IPVRF; Call to:rmap_api_reg_app() -- 3 RMAP: App:IPVRF; Call to: rmap_api_create() -- 3 RMAP: App:IPVRF; Call to: rmap_api_create() -- 3

Ultimately, no route-maps are made, IP VRF auto-classify is not listed as an input feature on that interface, and PBR is not active. The feature may not be fully supported on this IOS-XE version; always check the official Cisco release notes for your current version. R2#show route-map dynamic Current active dynamic routemaps = 0 R2#show ip interface gig2.529 | include Input_feat Input features: MCI Check R6#show ip policy Interface Route map [no output]

What does work is the “receive VRF” selection using manual PBR. Before beginning, note that you must enable PBR on the interface or configure VRF Select before configuring the VRF receive feature. Since VRF Select is not supported, we will create a PBR policy. R2(config-subif)#ip vrf receive VIA_XRV4 % Need to enable Policy Based Routing or VRF Select on the interface first

Specifically, we want to take traffic from 10.2.9.0/24 and use VRF VIA_XRV4 for the route lookup. We want to do the same for 10.9.2.0/24 but use VRF VIA_CSR6 for the route lookup. We can be even more specific; rather than simply select a VRF for a normal routing lookup, we can cross VRFs while also specifying a next-hop. To demonstrate this, the first case uses the ordinary VRF lookup; packets arriving with source 10.2.9.0/24 will be routed normally, but the destination address will be looked up in VIA_XRV4. For sources 10.9.2.0/24, the regular routing process is overridden and the next-hop is set to CSR6, but is also allowed to cross VRFs. The effect is the same, since we can take traffic from the global table (the interface is not in a VRF) and leak it into other tables. ! CSR2

2315 © 2016 Nicholas J. Russo

ip access-list standard ACL_VIA_CSR6 permit 10.9.2.0 0.0.0.255 ip access-list standard ACL_VIA_XRV4 permit 10.2.9.0 0.0.0.255 route-map RM_VRF_SELECTION permit 10 match ip address ACL_VIA_XRV4 set vrf VIA_XRV4 route-map RM_VRF_SELECTION permit 20 match ip address ACL_VIA_CSR6 set ip vrf VIA_CSR6 next-hop 10.2.6.6 interface GigabitEthernet2.529 encapsulation dot1Q 3529 ip vrf receive VIA_XRV4 ip vrf receive VIA_CSR6 ip address 2.2.2.2 255.255.255.0 ip policy route-map RM_VRF_SELECTION

The problem with this approach is that it doesn’t consider the return traffic. Even if XRv4 and CSR6 had return routes for 10.2.9.0/24 and 10.9.2.0/24 respectively, CSR2 will receive them on VRF enabled interfaces, perform a route lookup on the destination, and drop the traffic. We can override this with PBR in the return path to leak traffic from a VRF to the global table. Using similar techniques, we use an extended ACL to match destinations (for the return traffic). For CSR6, we will instruct the policy to perform a normal route lookup in the global table, which means CSR2 will need a route to 10.9.2.0/24 via 2.2.2.9. For XRv4, we will use PBR to avoid that by statically defining the next-hop in the PBR policy to 2.2.2.9. We also specify the keyword “global” so that the lookup on that hard-coded next-hop happens outside of the incoming link VRF, which was VIA_XRV4. We don’t need to specify any “receive” VRFs on those interfaces since the traffic isn’t being considered for migration into another VRF, but just the global table. ! CSR2 ip access-list permit ip any ip access-list permit ip any

extended 10.9.2.0 extended 10.2.9.0

ACL_RETURN_CSR6 0.0.0.255 ACL_RETURN_XRV4 0.0.0.255

route-map RM_FROM_XRV4 permit 10 match ip address ACL_RETURN_XRV4 set ip global next-hop 2.2.2.9 route-map RM_FROM_CSR6 permit 10 match ip address ACL_RETURN_CSR6 set global ip route 10.9.2.0 255.255.255.0 GigabitEthernet2.529 2.2.2.9 interface GigabitEthernet2.526

2316 © 2016 Nicholas J. Russo

vrf forwarding VIA_CSR6 ip policy route-map RM_FROM_CSR6 interface GigabitEthernet2.524 vrf forwarding VIA_XRV4 ip policy route-map RM_FROM_XRV4

This is certainly a lot of manual labor. Fortunately, the “set global” and “set vrf” commands are highly scalable since we can just select tables for lookups without statically defining next-hops. However, this technique assumes that the VRF table has a route that can forward the leaked packets. Any combination of static and dynamic routing can be used to achieve this. Before testing, I will clear all counters on the PBR route-maps and enabling PBR debugs. R2#clear route-map counters RM_FROM_XRV4 R2#clear route-map counters RM_FROM_CSR6 R2#clear route-map counters RM_VRF_SELECTION R2#debug ip policy Policy routing debugging is on

A quick verification shows that PBR is properly applied on the correct interfaces. We also configure some basic secondary addresses (loopbacks would work also) along with a static default route. This is only needed so CSR9 can encapsulate the IP packets to CSR2’s MAC address, allowing PBR to take over once the layer 2 encapsulation is removed. These are effectively our test source addresses. R2#show ip policy Interface Route map Gi2.524 RM_FROM_XRV4 Gi2.526 RM_FROM_CSR6 Gi2.529 RM_VRF_SELECTION ! CSR9 interface GigabitEthernet2.529 encapsulation dot1Q 3529 ip address 10.9.2.9 255.255.255.0 secondary ip address 10.2.9.9 255.255.255.0 secondary ip address 2.2.2.9 255.255.255.0 ip route 0.0.0.0 0.0.0.0 GigabitEthernet2.529 2.2.2.2

Unfortunately, the CSR1000v (CSR2) is not revealing any PBR debugs at all when we send some test traffic from CSR9 to CSR6. We source the traffic from one of CSR9’s secondary addresses that is mapped to the VIA_CSR6 VPN. We can check the route-map counters for additional confirmation. Notice that no traffic was matched for the first SELECTION match clause, which was for VIA_XRV4. Likewise, no traffic has entered from XRv4 yet. The successful ping and route-map counters are indicative of a successful configuration. 2317 © 2016 Nicholas J. Russo

R9#ping 10.2.6.6 source 10.9.2.9 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.2.6.6, timeout is 2 seconds: Packet sent with a source address of 10.9.2.9 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/10 ms R2#show route-map | include ^route-map|Policy route-map RM_VRF_SELECTION, permit, sequence 10 Policy routing matches: 0 packets, 0 bytes route-map RM_VRF_SELECTION, permit, sequence 20 Policy routing matches: 5 packets, 590 bytes route-map RM_FROM_XRV4, permit, sequence 10 Policy routing matches: 0 packets, 0 bytes route-map RM_FROM_CSR6, permit, sequence 10 Policy routing matches: 5 packets, 590 bytes

We retest from CSR9, this time sending traffic to XRv4. We will send more packets so the counters look different. Also note that there is nothing stopping us from using more advanced PBR matching, like TCP/UDP ports, DSCP, etc. Unlike auto-classify or VRF select, we gain more flexibility from this approach despite the poorly scaling configurations. R9#ping 10.2.14.14 source 10.2.9.9 repeat 10 Type escape sequence to abort. Sending 10, 100-byte ICMP Echos to 10.2.14.14, timeout is 2 seconds: Packet sent with a source address of 10.2.9.9 !!!!!!!!!! Success rate is 100 percent (10/10), round-trip min/avg/max = 2/12/19 ms R2#show route-map | include ^route-map|Policy route-map RM_VRF_SELECTION, permit, sequence 10 Policy routing matches: 10 packets, 1180 bytes route-map RM_VRF_SELECTION, permit, sequence 20 Policy routing matches: 5 packets, 590 bytes route-map RM_FROM_XRV4, permit, sequence 10 Policy routing matches: 10 packets, 1180 bytes route-map RM_FROM_CSR6, permit, sequence 10 Policy routing matches: 5 packets, 590 bytes

38.4 VRF route leaking Both XE and XR can leak routes between VRFs as well. BGP is typically required for this. On CSR6 and XRv4, we will leak routes from the global table to and from the different customer VRFs. Since we have been using CSR6 and XRv4 for CE routers, they have many VRFs with many diverse routes. Their links to CSR2 are in the global table and are currently used for the VRF traffic leaking with PBR. Both CSR6 and XRv4 have a static route (configured earlier) to 10.0.0.0/8 via CSR2 in their global tables. This is entirely disconnected from the MPLS L3VPN tests conducted earlier. There are two important side notes with 2318 © 2016 Nicholas J. Russo

route leaking. First, the VRF must have an RD. If it doesn’t the parser gives you an error message. This is because the routes are treated like VPN routes and need to be stored separately in the BGP table. R6(config-vrf-af)#export ipv4 unicast 10 map RM_VRF_TO_GLOBAL %vrf ISIS does not have "rd" configured, please configure "rd" before configuring export route-map

Second, these commands only have visibility to routes in the BGP RIB. If the route isn’t in global BGP, it cannot be imported to a VRF. If the route isn’t in VPN BGP, it cannot be exported to the global table. For this reason, we will work with the pre-existing BGP VRF since it already has BGP-learned routes, which implies it has an RD configured. When configuring this feature, we can configure an upper-bound on the number of prefixes imported. Since we are moving routes between tables, they are being copied, so the boundary condition is a safety check. First, we will leak all VPNv4 BGP host routes within the BGP VPN into the global table. The usage of RT’s is kind of odd since it makes no sense, but is actually required for the feature to work. If you forget the RTs, iBGP routes appear to work, but locally-originated and eBGP learned routes do not. Overall, the behavior is simply inconsistent if you forget the add RTs. ! CSR6 ip prefix-list PL_HOST_ROUTES seq 5 permit 0.0.0.0/0 ge 32 route-map RM_VRF_TO_GLOBAL permit 10 match ip address prefix-list PL_HOST_ROUTES vrf definition BGP rd 214:104 address-family ipv4 export ipv4 unicast 10 map RM_VRF_TO_GLOBAL route-target export 1:1 route-target import 1:1

After applying this configuration, we see several host-routes in the BGP global table, as well as the routing table, since these are the best available routes. The routing table annotates that the next-hops are in VRF BGP, not the global table. Notice that all BGP paths are exported, not just the best paths. Since CSR6’s local loopback prefix is directly connected, there is no next-hop, but the packet is delivered locally to the router. R6#show bgp ipv4 unicast | begin Network Network Next Hop * i 192.104.5.5/32 192.6.104.14 *> 192.4.104.4 *> 192.104.6.6/32 0.0.0.0 * i 192.104.13.13/32 192.6.104.14 *> 192.4.104.4 *>i 192.104.14.14/32 192.6.104.14

Metric LocPrf Weight Path 100 0 214 104 0 214 214 0 32768 ? 100 0 214 104 0 214 214 0 100 0 ?

? ? ? ?

2319 © 2016 Nicholas J. Russo

R6#show ip route bgp | begin Gate Gateway of last resort is not set 192.104.5.0/32 is subnetted, 1 subnets B 192.104.5.5 [20/0] via 192.4.104.4 (BGP), 00:01:56 192.104.6.0/32 is subnetted, 1 subnets B 192.104.6.6 is directly connected, 00:01:30, Loopback104 192.104.13.0/32 is subnetted, 1 subnets B 192.104.13.13 [20/0] via 192.4.104.4 (BGP), 00:01:56 192.104.14.0/32 is subnetted, 1 subnets B 192.104.14.14 [200/0] via 192.6.104.14 (BGP), 00:01:35

The details of the BGP route in the global table indicate that it was imported from VRF BGP, which includes the entire VPNv4 prefix, with the RD and VRF identified. All attributes of the route, to include communities, are carried over. Communities like SoO and RT are not terribly useful in global BGP. R6#show bgp ipv4 unicast 192.104.5.5/32 BGP routing table entry for 192.104.5.5/32, version 3 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 1 214 104, imported path from 214:104:192.104.5.5/32 (BGP) 192.6.104.14 from 192.6.104.14 (14.14.14.14) Origin incomplete, localpref 100, valid, internal Extended Community: SoO:214:4 RT:1:1 rx pathid: 0, tx pathid: 0 Refresh Epoch 1 214 214, imported path from 214:104:192.104.5.5/32 (BGP) 192.4.104.4 from 192.4.104.4 (214.0.0.4) Origin incomplete, localpref 100, valid, external, best Extended Community: RT:1:1 rx pathid: 0, tx pathid: 0x0

We can also move routes in the opposite direction, taking them from the global table into the VRF table. We will use a match-all route-map and not set a prefix limit. This is the most dangerous approach but is shown here for variety; it effectively permits all VPNv4 routes to be leaked to the global table. We must advertise the static 10.0.0.0/8 route into BGP and do so with a simple network statement. There is no need to specify RTs when importing to VRF from global. ! CSR6 route-map RM_MATCH_ALL permit 10 router bgp 104 address-family ipv4 network 10.0.0.0 vrf definition BGP address-family ipv4

2320 © 2016 Nicholas J. Russo

import ipv4 unicast map RM_MATCH_ALL

First, we verify that advertising 10.0.0.0/8 into the global table was successful. The route is marked as “af-import” to signify that a VRF is importing this route. Then, we check the VRF BGP table for the route and find it. Since the route was already imported to VRF BGP, it cannot be imported again, explaining the “no-import” flag. It is interesting to note that the VRF treats this route as external despite having an AS path length of 0. I assume this is done so that the route can be advertised to iBGP peers. R6#show bgp ipv4 unicast 10.0.0.0 BGP routing table entry for 10.0.0.0/8, version 5 Paths: (1 available, best #1, table default) Not advertised to any peer Refresh Epoch 1 Local 10.2.6.2 from 0.0.0.0 (6.6.6.6) Origin IGP, metric 0, localpref 100, weight 32768, valid, sourced, local, af-export(1), best rx pathid: 0, tx pathid: 0x0 R6#show bgp vpnv4 unicast vrf BGP 10.0.0.0 BGP routing table entry for 214:104:10.0.0.0/8, version 18 Paths: (1 available, best #1, table BGP) Advertised to update-groups: 3 4 Refresh Epoch 1 Local, imported path from 10.0.0.0/8 (global) 10.2.6.2 (via default) from 0.0.0.0 (6.6.6.6) Origin IGP, metric 0, localpref 100, weight 32768, valid, external, noimport, no-import, best rx pathid: 0, tx pathid: 0x0

We can check the summary information of these import/export policies within the VRF BGP table. It shows the route-map name, AFI, and prefix counts. The default limit is 1000. Notice that the 4 prefixes (loopbacks) exported from VRF BGP to global only count as 4 and not 6; we learned 6 unique paths but this is a per-prefix counter, which does not account for alternate paths. R6#show bgp vpnv4 unicast vrf BGP | include t_Map Import Map: RM_MATCH_ALL, Address-Family: IPv4 Unicast, Pfx Count/Limit: 1/1000 Export Map: RM_VRF_TO_GLOBAL, Address-Family: IPv4 Unicast, Pfx Count/Limit: 4/10

At this point, we should have reachability from CSR9 to those leaked BGP loopbacks. We already verified that CSR6’s global table had the remote loopbacks, but we also have to verify that the VRF table has the 10.0.0.0/8 summary. The next-hop is in the global (or “default”) table which allows CSR6 to move packets across tables. 2321 © 2016 Nicholas J. Russo

R6#show ip route vrf BGP 10.0.0.0 Routing Table: BGP Routing entry for 10.0.0.0/8 Known via "bgp 104", distance 20, metric 0, type external Last update from 10.2.6.2 00:09:52 ago Routing Descriptor Blocks: * 10.2.6.2 (default), from 0.0.0.0, 00:09:52 ago Route metric is 0, traffic share count is 1 AS Hops 0 MPLS label: none

Verifying CSR5, we can see it now learns a route to 10.0.0.0/8 within the VPN. There is nothing special about this route as it looks just like any other MPLS L3VPN route from the PE. This aggregate route should allow CSR5 to reach CSR9. R5#show bgp vpnv4 unicast vrf BGP 10.0.0.0 BGP routing table entry for 214:104:10.0.0.0/8, version 337 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 1 214 104 192.12.104.12 (via vrf BGP) from 192.12.104.12 (214.0.0.12) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0

We can test connectivity from CSR6 to CSR5, being sure to specify a valid source address on CSR6. The route lookup occurs in the global table, and CEF doesn’t make any special note about the next-hop of CSR4 (Although we know 192.4.104.4 is within VRF BGP, not the global table). R6#ping 192.104.5.5 source 10.2.6.6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.5.5, timeout is 2 seconds: Packet sent with a source address of 10.2.6.6 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 9/9/10 ms R6#show ip cef 192.104.5.5 detail 192.104.5.5/32, epoch 2, flags [rib only nolabel, rib defined all labels] recursive via 192.4.104.4 attached to GigabitEthernet2.104

When we test connectivity to XRv4’s loopback, the ping fails. This is because the 10.0.0.0/8 route was imported as an eBGP route, which means it can be advertised to iBGP peers. The next-hop is unchanged by default. Normally, this would just mean that XRv4 would select the path via CSR3 instead. The key phrase below is “import dampened” which means there is some kind of routing problem with this BGP

2322 © 2016 Nicholas J. Russo

prefix. We expected 10.2.6.2 to be inaccessible, as explained earlier, but this new “dampened” word is different. RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast vrf BGP 10.0.0.0/8 | begin Paths Paths: (2 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer 214 104 192.3.104.3 from 192.3.104.3 (214.0.0.3) Origin IGP, localpref 100, valid, external, best, group-best, importcandidate, import dampened Received Path ID 0, Local Path ID 1, version 2786 Path #2: Received by speaker 0 Not advertised to any peer Local 10.2.6.2 (inaccessible) from 192.6.104.6 (6.6.6.6) Origin IGP, metric 0, localpref 100, valid, internal, import dampened Received Path ID 0, Local Path ID 0, version 0

The issue is that the route to 10.2.6.2 is 10.0.0.0/8, which creates recursive routing logic error on XRv4. It generates occasional syslog messages to warn us about this. We would not have this issue if the nexthop was not covered by the aggregate in question. ! XRv4 ipv4_rib[1144]: %ROUTING-RIB-7-SERVER_ROUTING_DEPTH : Recursion loop looking up prefix 10.2.6.2 in Vrf: "BGP" Tbl: "default" Safi: "Unicast" added by bgp ipv4_rib[1144]: %ROUTING-RIB-5-TABLE_NH_DAMPED : Nexthops in Vrf: BGP Tbl: default (0xe0000011) Safi: Unicast are getting damped

To fix it, we apply next-hop-self to CSR6’s neighbor configuration towards XRv4. XRv4 then learns this route via the backdoor, as expected, and reachability is achieved. Normally, we would never need nexthop-self in this situation, but since the route was viewed as eBGP, the next-hop was not changed when it was advertised to an iBGP neighbor. ! CSR4 router bgp 104 address-family ipv4 vrf BGP neighbor 192.6.104.14 next-hop-self RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast vrf BGP 10.0.0.0/8 | begin Paths Paths: (1 available, best #1) Not advertised to any peer Path #1: Received by speaker 0 Not advertised to any peer Local 192.6.104.6 from 192.6.104.6 (6.6.6.6)

2323 © 2016 Nicholas J. Russo

Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, import suspect Received Path ID 0, Local Path ID 1, version 3052 R6#ping 192.104.14.14 source 10.2.6.6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.14.14, timeout is 2 seconds: Packet sent with a source address of 10.2.6.6 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 3/5/14 ms

Tying this in with the existing VRF traffic leaking, CSR9 now has reachability to all 4 of the VRF BGP loopbacks. The summary results are shown below; provided CSR2 can somehow get the traffic back and forth between CSR9 and CSR6 (which it can), there should be no problems. CSR6 will perform route lookups in the global FIB on traffic coming inbound from CSR2, but the next-hops will be inside the BGP VRF (either towards CSR4 or XRv4). Returning traffic will always go to 10.9.2.9, which is in the global table. Again, the FIB does not make any special annotation of the VRF since it just needs to switch packets between interfaces and add layer 2 encapsulation as needed. R6#show ip cef vrf BGP 10.9.2.9 detail 10.0.0.0/8, epoch 0, flags [rib only nolabel, rib defined all labels] recursive via 10.2.6.2 attached to GigabitEthernet2.526 R9#ping 192.104.5.5 source 10.9.2.9 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 9/9/11 ms R9#ping 192.104.6.6 source 10.9.2.9 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 4/5/7 ms R9#ping 192.104.13.13 source 10.9.2.9 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 11/30/85 ms R9#ping 192.104.14.14 source 10.9.2.9 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 3/6/15 ms

We will examine the same feature on XRv4 now. Fortunately, XR does not require the RT configuration to export routes from a VRF to a global table. Using the same logic as before (except with a parameterized RPL), we can match all host routes in the BGP VRF. We also need to specify a BGP RID since there are no available IPv4 unicast global loopbacks; without a RID, BGP cannot select a best-path as it cannot properly initialize. We also must initialize the global IPv4 unicast AFI on XRv4, which is

2324 © 2016 Nicholas J. Russo

something we haven’t done yet. The “else drop” stanza is technically unnecessary since the RPL will drop all routes not explicitly passed, but I add it for completeness. ! XRv4 vrf BGP address-family ipv4 unicast export to default-vrf route-policy RPL_LEAK(PS_VRF_TO_GLOBAL) prefix-set PS_VRF_TO_GLOBAL 0.0.0.0/0 ge 32 end-set route-policy RPL_LEAK($PS) if destination in $PS then pass else drop endif end-policy router bgp 104 bgp router-id 14.14.14.14 address-family ipv4 unicast

Looking at the BGP details, we can see the 4 loopbacks inside the global table now. The BGP details of a specific prefix don’t specify in what VRF the next-hop lies, but it does list the source VRF, which implies the next-hop is accessible inside that VRF. The route is marked as “imported” much like a normal VPNv4 route would when using RTs. The IPv4 unicast RIB specifies that the next-hop is in VRF BGP, which is correct. The outputs in XR are similar to XE and nicely detail the route leaking operation. RP/0/0/CPU0:XRv4#show Network *> 192.104.5.5/32 *>i192.104.6.6/32 *> 192.104.13.13/32 *> 192.104.14.14/32

bgp ipv4 unicast | begin Network Next Hop Metric LocPrf Weight Path 192.3.104.3 0 214 104 ? 192.6.104.6 0 100 0 ? 192.3.104.3 0 214 104 ? 0.0.0.0 0 32768 ?

RP/0/0/CPU0:XRv4#show bgp ipv4 unicast 192.104.6.6/32 | begin Local Local 192.6.104.6 from 192.6.104.6 (6.6.6.6) Origin incomplete, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 3 Source VRF: BGP, Source Route Distinguisher: 214:104 RP/0/0/CPU0:XRv4#show route 192.104.6.6/32 Routing entry for 192.104.6.6/32 Known via "bgp 104", distance 200, metric 0, type internal

2325 © 2016 Nicholas J. Russo

Routing Descriptor Blocks 192.6.104.6, from 192.6.104.6 Nexthop in Vrf: "BGP", Table: "default", IPv4 Unicast, Table Id: 0xe0000011 Route metric is 0 No advertising protos.

We can also see a list of imported routes within the IPv4 unicast AFI on a per-VRF or per-neighbor basis. We can specify the AFI and neighbor concurrently when verifying multi-homed VRF deployments. RP/0/0/CPU0:XRv4#show Network Network *> 192.104.5.5/32 *>i192.104.6.6/32 *> 192.104.13.13/32 *> 192.104.14.14/32 RP/0/0/CPU0:XRv4#show begin Network Network *> 192.104.5.5/32 *> 192.104.13.13/32

bgp ipv4 unicast imported-routes vrf BGP | begin Neighbor 192.3.104.3 192.6.104.6 192.3.104.3 0.0.0.0

Route Distinguisher 214:104 214:104 214:104 214:104

Source VRF BGP BGP BGP BGP

bgp ipv4 unicast imported-routes neighbor 192.3.104.3 | Neighbor 192.3.104.3 192.3.104.3

Route Distinguisher 214:104 214:104

Source VRF BGP BGP

Next, we will configure XRv4 to import its static route of 10.0.0.0/8 from the global table to the BGP VRF table. The configuration is very similar to XE; there are no RTs required. We will use redistribution to advertise the static route into BGP for variety. ! XRv4 vrf BGP address-family ipv4 unicast import from default-vrf route-policy RPL_LEAK(PS_GLOBAL_TO_VRF) prefix-set PS_GLOBAL_TO_VRF 10.0.0.0/8 end-set router bgp 104 address-family ipv4 unicast redistribute static

After configuring this, we can verify that the static route was properly redistributed into BGP IPv4 unicast. After that, it becomes candidate for import into VRF BGP. We verify that the export operation was successful, and also note that it is treated as an eBGP route. XR specifically marks this as an “extranet” route since it was imported from another routing table; specifically, it lists the source VRF as

2326 © 2016 Nicholas J. Russo

“default”, which means global. XRv4 also learns the 10.0.0.0/8 prefix via CSR3 (PE) and CSR6 (backdoor), but they are not best paths due to the weight attribute. RP/0/0/CPU0:XRv4#show bgp ipv4 unicast 10.0.0.0/8 brief | begin Network Network Next Hop Metric LocPrf Weight Path *> 10.0.0.0/8 10.2.14.2 0 32768 ? RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast vrf BGP 10.0.0.0/8 brief | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> 10.0.0.0/8 10.2.14.2 0 32768 ? * 192.3.104.3 0 214 104 i * i 192.6.104.6 0 100 0 i RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast vrf BGP 10.0.0.0/8 | begin Local Local 10.2.14.2 from 0.0.0.0 (14.14.14.14) Origin incomplete, metric 0, localpref 100, weight 32768, valid, extranet, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 3054 Source VRF: default, Source Route Distinguisher: 0:0:0 Path #2: Received by speaker 0 Not advertised to any peer 214 104 192.3.104.3 from 192.3.104.3 (214.0.0.3) Origin IGP, localpref 100, valid, external, group-best Received Path ID 0, Local Path ID 0, version 0 Path #3: Received by speaker 0 Not advertised to any peer Local 192.6.104.6 from 192.6.104.6 (6.6.6.6) Origin IGP, metric 0, localpref 100, valid, internal Received Path ID 0, Local Path ID 0, version 0

The only significant result of using redistribution instead of the network statement is that CSR3 still prefers the iBGP route to 10.0.0.0/8 from CSR4 over the eBGP route from XRv4. This is because origin IGP is preferred over incomplete. This isn’t a big deal, but is interesting to note as it changes the traffic pattern. R3#show bgp vpnv4 unicast rd 214:104 10.0.0.0/8 BGP routing table entry for 214:104:10.0.0.0/8, version 1800 Paths: (2 available, best #2, table BGP) Advertised to update-groups: 13 14 Refresh Epoch 1 104 192.3.104.14 (via vrf BGP) from 192.3.104.14 (14.14.14.14)

2327 © 2016 Nicholas J. Russo

Origin incomplete, localpref 100, valid, external Extended Community: SoO:214:3 RT:214:104 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 104, (Received from a RR-client) 214.0.0.4 (metric 10) (via default) from 214.0.0.4 (214.0.0.4) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: SoO:214:4 RT:214:104 mpls labels in/out nolabel/4048 rx pathid: 0, tx pathid: 0x0

XR clearly annotates the next-hop VRF in the FIB. When performing a global lookup on XRv4 for CSR5’s loopback, the VRF associated with the next-hop is displayed. This is probably just a cosmetic difference since XE also “knows” the VRF of the next-hop. The same is true for the VRF-aware FIB lookup towards 10.0.0.0/8, where the VRF containing the next-hop is “default”. RP/0/0/CPU0:XRv4#show cef ipv4 192.104.5.5/32 192.104.5.5/32, version 24, internal 0x5000001 0x0 (ptr 0xa14156f4) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 192.3.104.3 Prefix Len 32, traffic index 0, precedence n/a, priority 4 via 192.3.104.3, 3 dependencies, recursive, bgp-ext [flags 0x6020] path-idx 0 NHID 0x0 [0xa14153f4 0x0] next hop VRF - 'BGP', table - 0xe0000011 next hop 192.3.104.3 via 192.3.104.3/32 RP/0/0/CPU0:XRv4#show cef vrf BGP ipv4 10.0.0.0/8 10.0.0.0/8, version 4655, internal 0x5000001 0x0 (ptr 0xa1415674) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 10.2.14.2 Prefix Len 8, traffic index 0, precedence n/a, priority 3 via 10.2.14.2, 2 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa1413674 0x0] next hop VRF - 'default', table - 0xe0000000 next hop 10.2.14.2 via 10.2.14.2/32

Before testing connectivity from CSR9, we need to make one additional modification to the policies on CSR2. Currently, when packets arrive from CSR9 with a source inside 10.2.9.0/24, the router doesn’t actually forwarding the traffic, but just sets a VRF for the route lookup. Sending traffic from CSR9’s address in that range to any loopback will fail since the VIA_XRV4 table does not have a route for those prefixes. R2#show route-map RM_VRF_SELECTION route-map RM_VRF_SELECTION, permit, sequence 10 Match clauses: ip address (access-lists): ACL_VIA_XRV4 Set clauses:

2328 © 2016 Nicholas J. Russo

vrf VIA_XRV4 Policy routing matches: 98 packets, 11564 bytes route-map RM_VRF_SELECTION, permit, sequence 20 Match clauses: ip address (access-lists): ACL_VIA_CSR6 Set clauses: ip vrf VIA_CSR6 next-hop 10.2.6.6 Policy routing matches: 99 packets, 11394 bytes R2#show ip cef vrf VIA_XRV4 192.104.14.14 0.0.0.0/0 no route

To solve it, a VRF-aware static default route is added to CSR2. This means that after the policy-based lookup occurs, the VRF-aware FIB can forward traffic to XRv4. We could also run BGP or IGP with CSR2 to exchange routing information, but the static route is quick and easy. ! CSR2 ip route vrf VIA_XRV4 0.0.0.0 0.0.0.0 GigabitEthernet2.524 10.2.14.14 R2#show ip cef vrf VIA_XRV4 192.104.14.14 0.0.0.0/0 nexthop 10.2.14.14 GigabitEthernet2.524

CSR9 can now reach XRv14’s loopback using 10.2.9.2 as a source, but no other loopbacks are reachable. This is due to the asymmetric routing that occurs. In this example, CSR9 cannot reach CSR6’s loopback. R9#ping 192.104.14.14 source 10.2.9.9 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.14.14, timeout is 2 seconds: Packet sent with a source address of 10.2.9.9 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 2/5/19 ms R9#ping 192.104.6.6 source 10.2.9.9 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 192.104.6.6, timeout is 2 seconds: Packet sent with a source address of 10.2.9.9 ..... Success rate is 0 percent (0/5)

Considering that the traffic is probably reaching CSR6’s loopback (the BGP routing is correct), the issue is with the return flow. We quickly verify that traffic reaching XRv4 for CSR6’s loopback is routed to CSR6, as expected. RP/0/0/CPU0:XRv4#show cef ipv4 192.104.6.6/32

2329 © 2016 Nicholas J. Russo

192.104.6.6/32, version 26, internal 0x5000001 0x0 (ptr 0xa1416df4) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 192.6.104.6 Prefix Len 32, traffic index 0, precedence n/a, priority 4 via 192.6.104.6, 3 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa1416b74 0x0] next hop VRF - 'BGP', table - 0xe0000011 next hop 192.6.104.6 via 192.6.104.6/32

When CSR6 sends the reply packet back to 10.2.9.2, it follows its local static route towards CSR2. R6#show ip cef vrf BGP 10.2.9.2 10.0.0.0/8 nexthop 10.2.6.2 GigabitEthernet2.526

The issue is that CSR2 does not have an inbound PBR policy to account for traffic to this destination. When the PBR lookup fails, CSR2 attempts to perform a normal VRF-aware FIB lookup, yet has no routing entry for it either. R2#show route-map RM_FROM_CSR6 route-map RM_FROM_CSR6, permit, sequence 10 Match clauses: ip address (access-lists): ACL_RETURN_CSR6 Set clauses: global Policy routing matches: 93 packets, 10842 bytes R2#show access-list ACL_RETURN_CSR6 Extended IP access list ACL_RETURN_CSR6 10 permit ip any 10.9.2.0 0.0.0.255 R2#show ip cef vrf VIA_CSR6 10.2.9.2 0.0.0.0/0 no route

We can fix this problem by adding another entry to the ACL to account for 10.2.9.0/24 sources, which essentially permits asymmetric routing to work. More interestingly, we can add a static VRF-aware route with a global next-hop for this destination. When the PBR fails to match the incoming traffic to perform a lookup in the global VRF, CSR2 performs a lookup in the VRF VIA_CSR6 routing table. The static route is used, which essentially directs traffic for this destination towards CSR9’s address in the global table. ! CSR2 ip route vrf VIA_CSR6 10.2.9.0 255.255.255.0 GigabitEthernet2.529 2.2.2.9 global R2#show ip cef vrf VIA_CSR6 10.2.9.2 10.2.9.0/24

2330 © 2016 Nicholas J. Russo

nexthop 2.2.2.9 GigabitEthernet2.529

If the asymmetry was happening in the other direction, we will have the same problem on CSR2’s interface to XRv4. We can test this by sending traffic to XRv4’s loopback using a source of 10.9.2.9, which is directed to CSR6. To verify the asymmetry, CSR6 performs a lookup in its global FIB and routes the packet to XRv4 over the backdoor. XRv4 performs a lookup in its VRF-aware FIB, processes the local packet, and routes the reply directly to CSR2. R6#show ip cef 192.104.14.14 192.104.14.14/32 nexthop 192.6.104.14 GigabitEthernet2.204 RP/0/0/CPU0:XRv4#show cef vrf BGP ipv4 192.104.14.14 192.104.14.14/32, version 17, attached, receive Prefix Len 32 internal 0x3006041 (ptr 0xa1416274) [3], 0x0 (0xa13dfe84), 0x0 (0x0) RP/0/0/CPU0:XRv4#show cef vrf BGP ipv4 10.9.2.9 10.0.0.0/8, version 4655, internal 0x5000001 0x0 (ptr 0xa1415674) [1], 0x0 (0x0), 0x0 (0x0) local adjacency 10.2.14.2 Prefix Len 8, traffic index 0, precedence n/a, priority 3 via 10.2.14.2, 2 dependencies, recursive [flags 0x6000] path-idx 0 NHID 0x0 [0xa1413674 0x0] next hop VRF - 'default', table - 0xe0000000 next hop 10.2.14.2 via 10.2.14.2/32

CSR2’s inbound policy-routing matches an ACL that only accounts for traffic sourced from 10.2.9.0/24. The routing entry we added earlier would cause the packet to loop until TTL expires, which is undesirable. R2#show route-map RM_FROM_XRV4 route-map RM_FROM_XRV4, permit, sequence 10 Match clauses: ip address (access-lists): ACL_RETURN_XRV4 Set clauses: ip global next-hop 2.2.2.9 Policy routing matches: 92 packets, 10840 bytes R2#show access-lists ACL_RETURN_XRV4 Extended IP access list ACL_RETURN_XRV4 10 permit ip any 10.2.9.0 0.0.0.255 R2#show ip cef vrf VIA_XRV4 10.9.2.0 0.0.0.0/0 nexthop 10.2.14.14 GigabitEthernet2.524

2331 © 2016 Nicholas J. Russo

Because this route-map clause manually sets the correct next-hop, we don’t have to worry about changing the VIA_XRV4 routing table. Instead, we can simply invoke the pre-existing ACL_RETURN_XRV4 access-list in the same route-map entry. This will use Boolean OR logic, so that traffic destinations matching either list can be forwarding to 2.2.2.9. We are effectively extending the match logic in the route-map to account for both 10.2.9.0/24 and 10.9.2.0/24. ! CSR2 route-map RM_FROM_XRV4 permit 10 match ip address ACL_RETURN_XRV4 ACL_RETURN_CSR6 R2#show route-map RM_FROM_XRV4 route-map RM_FROM_XRV4, permit, sequence 10 Match clauses: ip address (access-lists): ACL_RETURN_XRV4 ACL_RETURN_CSR6 Set clauses: ip global next-hop 2.2.2.9 Policy routing matches: 95 packets, 11194 bytes

Next, we verify that CSR9 has reachability to all remote loopbacks from all valid sources. We can use a nested set of for loops to accomplish this. After enumerating all the possibilities for the destination, we nest another loop to iterate over the sources. The result is X * Y (8 total) verifications. The output is not shown, but it was successful for all ping targets. ! CSR9 tclsh foreach x { 192.104.5.5 192.104.6.6 192.104.13.13 192.104.14.14 } { foreach y { 10.2.9.9 10.9.2.9 } { ping $x source $y repeat 3 timeout 1 } }

Regarding IPv6, XE only supports route-export from VRF to global. Importing from global to VRF does not appear supported in this code release. Only being able to go one-way seems awkward and incapable. R6(config-vrf-af)#export ? ipv6 Address family based VRF export map Route-map based VRF export R6(config-vrf-af)#import ? map Route-map based VRF import

2332 © 2016 Nicholas J. Russo

We will test it briefly just to see the routes populated in the IPv6 global table, but nothing more. We will create an IPv6 prefix-list just to match the host routes inside of VRF BGP. We also need to include dummy RTs for this to work properly. ! CSR6 ipv6 prefix-list PL_HOST_ROUTES_V6 seq 5 permit ::/0 ge 128 route-map RM_VRF_TO_GLOBAL_V6 permit 10 match ipv6 address prefix-list PL_HOST_ROUTES_V6 vrf definition BGP address-family ipv6 export ipv6 unicast map RM_VRF_TO_GLOBAL_V6 route-target export 1:1 route-target import 1:1

As expected, we see all 4 loopbacks within the global table, including the non-best-paths seen by BGP. All of the BGP candidates routes are imported which is consistent with the IPv4 behavior. The VPNv6 BGP table for that VRF shows that 4 prefixes were exported out of a maximum of 1000. This is the same default as IPv4 and serves as a safety mechanism. R6#show bgp ipv6 unicast | begin Network Network Next Hop Metric LocPrf Weight Path * i ::192:104:5:5/128 FD00:192:6:104::14 100 0 214 *> FD00:192:4:104::4 0 214 *> ::192:104:6:6/128 :: 0 32768 ? * FD00:192:4:104::4 0 214 * i ::192:104:13:13/128 FD00:192:6:104::14 100 0 214 *> FD00:192:4:104::4 0 214 * i ::192:104:14:14/128 FD00:192:6:104::14 0 100 0 ? *> FD00:192:4:104::4 0 214

104 ? 214 ?

214 ?

104 ? 214 ?

214 ?

R6#show bgp vpnv6 unicast vrf BGP | include t_Map Export Map: RM_VRF_TO_GLOBAL_V6, Address-Family: IPv6 Unicast, Pfx Count/Limit: 4/1000

2333 © 2016 Nicholas J. Russo

The global routing table now shows these routes as BGP routes. For the connected loopback, we can see the % symbol is used to indicate that the interface is in a different VRF. R6#show ipv6 route bgp | begin ^B B ::192:104:5:5/128 [20/0] via FE80::4, GigabitEthernet2.104 B ::192:104:6:6/128 [20/0] via Loopback104%BGP, directly connected B ::192:104:13:13/128 [20/0] via FE80::4, GigabitEthernet2.104 B ::192:104:14:14/128 [20/0] via FE80::4, GigabitEthernet2.104

An interesting note is that the VRF table selects XRv4 as the best-path to XRv4’s loopback, which makes sense given the shorter AS path length. The global table selects CSR4 as the next-hop for this route (seen above, also) because the next-hop is inaccessible. R6#show bgp vpnv6 unicast vrf BGP ::192:104:14:14/128 BGP routing table entry for [214:104]::192:104:14:14/128, version 64 Paths: (2 available, best #1, table BGP) Advertised to update-groups: 1 Refresh Epoch 1 Local FD00:192:6:104::14 (via vrf BGP) from FD00:192:6:104::14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: SoO:214:4 RT:1:1 rx pathid: 0, tx pathid: 0x0 Refresh Epoch 2 214 214 FD00:192:4:104::4 (FE80::4) (via vrf BGP) from FD00:192:4:104::4 (214.0.0.4) Origin incomplete, localpref 100, valid, external Extended Community: SoO:214:3 RT:1:1 rx pathid: 0, tx pathid: 0 R6#show bgp ipv6 unicast ::192:104:14:14/128 BGP routing table entry for ::192:104:14:14/128, version 55 Paths: (2 available, best #2, table default) Not advertised to any peer Refresh Epoch 1 Local, imported path from [214:104]::192:104:14:14/128 (BGP) FD00:192:6:104::14 (inaccessible) from FD00:192:6:104::14 (14.14.14.14) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: SoO:214:4 RT:1:1 rx pathid: 0, tx pathid: 0 Refresh Epoch 2 214 214, imported path from [214:104]::192:104:14:14/128 (BGP)

2334 © 2016 Nicholas J. Russo

FD00:192:4:104::4 (FE80::4) from FD00:192:4:104::4 (214.0.0.4) Origin incomplete, localpref 100, valid, external, best Extended Community: SoO:214:3 RT:1:1 rx pathid: 0, tx pathid: 0x0

This doesn’t make sense since both routes are connected to CSR6 and are in the VRF; either both should be inaccessible or both should be accessible. Neither one was imported into the global table. The only difference I can see is that the valid route puts the next-hop’s LL address into parenthesis. R6#show ipv6 route FD00:192:4:104:: % Route not found R6#show ipv6 route FD00:192:6:104:: % Route not found

Adjusting the next-hop inbound to include LL addresses does not solve the problem. The LL address of FE80::14 is still absent from the iBGP route. ! CSR6 route-map RM_IPV6_NHOP_ADJUST permit 10 set ipv6 next-hop FD00:192:6:104::14 FE80::14 router bgp 104 address-family ipv6 vrf BGP neighbor FD00:192:6:104::14 route-map RM_IPV6_NHOP_ADJUST in R6#show bgp vpnv6 unicast vrf BGP ::192:104:14:14/128 | include via vrf FD00:192:6:104::14 (via vrf BGP) from FD00:192:6:104::14 (14.14.14.14) FD00:192:4:104::4 (FE80::4) (via vrf BGP) from FD00:192:4:104::4 (214.0.0.4)

A last-ditch effort involves leaking that connected /64 route into the VRF to satisfy the route recursion. This actually does solve the problem, but is kind of a hack. ! CSR6 ipv6 prefix-list PL_HOST_ROUTES_V6 seq 10 permit FD00:192:6:104::/64 R6#show ipv6 route bgp | begin ^B B ::192:104:5:5/128 [20/0] via FE80::4, GigabitEthernet2.104 B ::192:104:6:6/128 [20/0] via Loopback104%BGP, directly connected B ::192:104:13:13/128 [20/0] via FE80::4, GigabitEthernet2.104 B ::192:104:14:14/128 [200/0] via FD00:192:6:104::14%BGP, GigabitEthernet2.104%BGP B FD00:192:6:104::/64 [20/0]

2335 © 2016 Nicholas J. Russo

via GigabitEthernet2.204%BGP, directly connected

The feature appears more stable on XR. We will leak the IPv6 VPN loopbacks into the global table, reusing the RPL from earlier by feeding it a new prefix set. We also have to initialize the IPv6 unicast AFI under BGP. ! XRv4 vrf BGP address-family ipv6 unicast export to default-vrf route-policy RPL_LEAK(PS_VRF_TO_GLOBAL_V6) prefix-set PS_VRF_TO_GLOBAL_V6 ::/0 ge 128 end-set router bgp 104 address-family ipv6 unicast

Once the IPv6 unicast AFI is finishing initializing and importing the routes, we can see them in the global table. All of the next-hops are reachable, so we don’t have the same problem on XR as we did on XE. We can see all four of the routes have the correct next-hops, all have valid best-paths, and all were imported from the source VRF BGP. RP/0/0/CPU0:XRv4#show bgp ipv6 unicast imported-routes vrf BGP | begin Network Network Neighbor Route Distinguisher Source VRF *> ::192:104:5:5/128 fd00:192:3:104::3 214:104 BGP *>i::192:104:6:6/128 fd00:192:6:104::6 214:104 BGP *> ::192:104:13:13/128 fd00:192:3:104::3 214:104 BGP *> ::192:104:14:14/128 :: 214:104 BGP

The routing table displays the expected information as well. Remote routes, like CSR5’s loopback, are reachable via the PE (CSR3). The “local site” route of CSR6’s loopback is reachable via the backdoor link. RP/0/0/CPU0:XRv4#show route ipv6 unicast bgp B ::192:104:5:5/128 [20/0] via fe80::3 (nexthop in vrf BGP), 00:00:44, GigabitEthernet0/0/0/0.104 B ::192:104:6:6/128 [200/0] via fd00:192:6:104::6 (nexthop in vrf BGP), 00:00:44 B ::192:104:13:13/128

2336 © 2016 Nicholas J. Russo

[20/0] via fe80::3 (nexthop in vrf BGP), 00:00:44, GigabitEthernet0/0/0/0.104 B ::192:104:14:14/128 is directly connected, 00:00:44, Loopback104 (nexthop in vrf BGP)

We can also import global routes into a VRF on XR for IPv6. Using the connected route on XRv4’s interface to CSR2, we re-use the RPL using a new prefix-set to import the route from global to VRF BGP. We need to ensure this route is advertised into BGP, or else the import will not work. ! XRv4 prefix-set PS_GLOBAL_TO_VRF_V6 2001::/16 le 96 end-set vrf BGP address-family ipv6 unicast import from default-vrf route-policy RPL_LEAK(PS_GLOBAL_TO_VRF_V6) router bgp 104 address-family ipv6 unicast network 2001:dead:beef:cafe::/64

First, we confirm the route was successfully advertised into the IPv6 unicast global BGP table via the network statement. Then, we confirm it was imported into the VPNv6 BGP table for VRF BGP using the import route-policy. Just like with IPv4, BGP calls this route an “extranet”, which has eBGP characteristics, except with an AS path length of 0 (route is local). The source VRF is identified as being the default table (global) with no RD. RP/0/0/CPU0:XRv4#show bgp ipv6 unicast 2001::/16 longer-prefixes | begin Network Network Next Hop Metric LocPrf Weight Path *> 2001:dead:beef:cafe::/64 :: 0 32768 i RP/0/0/CPU0:XRv4#show bgp vpnv6 unicast vrf BGP 2001:dead:beef:cafe::/64 | beg$ Local :: from 0.0.0.0 (14.14.14.14) Origin IGP, metric 0, localpref 100, weight 32768, valid, extranet, best, group-best, import-candidate, imported Received Path ID 0, Local Path ID 1, version 139 Source VRF: default, Source Route Distinguisher: 0:0:0

From routers across the VPN, there is nothing special about this route, and it looks like a normal BGP route within the MPLS L3VPN. We can verify reachability from XRv4 to the other loopbacks using this route to ensure the import/export successfully achieves full reachability. 2337 © 2016 Nicholas J. Russo

R5#show bgp vpnv6 unicast vrf BGP 2001:DEAD:BEEF:CAFE::/64 BGP routing table entry for [214:104]2001:DEAD:BEEF:CAFE::/64, version 2411 Paths: (1 available, best #1, table BGP) Not advertised to any peer Refresh Epoch 1 214 104 FD00:192:12:104::12 (FE80::12) (via vrf BGP) from FD00:192:12:104::12 (214.0.0.12) Origin IGP, localpref 100, valid, external, best rx pathid: 0, tx pathid: 0x0 RP/0/0/CPU0:XRv4#ping ::192:104:5:5 source 2001:dead:beef:cafe::14 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 1/11/29 ms RP/0/0/CPU0:XRv4#ping ::192:104:6:6 source 2001:dead:beef:cafe::14 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms RP/0/0/CPU0:XRv4#ping ::192:104:13:13 source 2001:dead:beef:cafe::14 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 1/6/9 ms

38.5 L3VPN import/export maps Both XE and XR support the ability to apply additional constraints to restrict VPN membership beyond a simple RT import/export policy. This allows the routers to advertise certain routes with additional exported route-targets, or to filter inbound routes more aggressively even if they match the RT-import configured. Essentially, it is a per-prefix granularity tool to refine VPN membership. For this test, we will use CSR7 and XRv2 as the PEs to exchange routes between VRF BGP and VRF TOP. First, we verify the existing RT import/export policies on both routers. CSR7 has an asymmetric policy where it imports and exports routes with different RTs. XRv2 uses RT:214:104 for both directions. R7#show vrf detail TOP | include Address|VPN|RT VRF TOP (VRF Id = 1); default RD 214:7; default VPNID Address family ipv4 unicast (Table ID = 0x1): Export VPN route-target communities RT:214:7 Import VPN route-target communities RT:214:8 Address family ipv6 unicast (Table ID = 0x1E000001): Export VPN route-target communities RT:214:7 Import VPN route-target communities RT:214:8 Address family ipv4 multicast not active

2338 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show vrf BGP VRF RD BGP 214:104

RT import export import export

AFI 214:104 214:104 214:104 214:104

SAFI

IPV4 IPV4 IPV6 IPV6

Unicast Unicast Unicast Unicast

The first goal will be to get all IPv4 host routes from XRv2 VRF BGP into CSR7 VRF TOP without configuring CSR7. If we didn’t have the host route restriction, we could just add another export RT to XRv2 with value RT:214:8. That configuration is shown below, but is uncommitted, since it is incorrect. ! XRv2 (not committed, incorrect) vrf BGP address-family ipv4 unicast export route-target 214:8

We know that CSR7 is already importing RT:214:8, and it would work. However, we need to be more granular and only attach RT:214:8 to host routes inside the VPN. Non-matching routes will not have RT:214:8 applied, which means CSR7 will not import them into VRF TOP. Using RPL, we can construct a flexible policy to do this. We already have a prefix-set matching IPv4 host routes from earlier, so we can reuse that. The policy adds (does not replace) another RT to the extended community list and simply passes other routes. The “passed” routes will only have the “normal” export RTs defined under the VRF. ! XRv2 vrf BGP address-family ipv4 unicast export route-policy RPL_IF_MATCH_SET_RT(PS_HOST_ROUTES, RT_214_8) extcommunity-set rt RT_214_8 214:8 end-set route-policy RPL_IF_MATCH_SET_RT($PS, $RT) if destination in $PS then set extcommunity rt $RT additive else pass endif end-policy

After applying this policy, we confirm that it is working locally. XRv2 learns multiple routes from CSR5, and two are shown below. The PE-CE link does not have the new RT but the host route does, implying that the policy worked.

2339 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf BGP 192.12.104.0/24 | include Extend Extended community: SoO:214:513 RT:214:104 RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf BGP 192.104.5.5/32 | include Extend Extended community: SoO:214:513 RT:214:8 RT:214:104

Of greater importance is verifying that these routes were imported on CSR7. We can use a regex to filter based on originating AS 104 to ensure only the host routes were allowed. Since XRv2 only originated one host route, it imposed RT:214:8 on that single route only. The route contains all information we would expect to see, including the new RT, MPLS label, and other components needed for reachability. R7#show bgp vpnv4 unicast vrf TOP quote-regexp "104$" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:7 (default for vrf TOP) *>i 192.104.5.5/32 214.0.0.12 0 100 0 104 ? R7#show bgp vpnv4 unicast vrf TOP 192.104.5.5/32 BGP routing table entry for 214:7:192.104.5.5/32, version 728 Paths: (1 available, best #1, table TOP) Advertised to update-groups: 1 Refresh Epoch 1 104, imported path from 214:104:192.104.5.5/32 (global) 214.0.0.12 (metric 20) (via default) from 214.0.0.12 (214.0.0.12) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: SoO:214:513 RT:214:8 RT:214:104 mpls labels in/out nolabel/92014 rx pathid: 0, tx pathid: 0x0

At this point, we will not have reachability since XRv2 is not importing RT:214:7, which is exported by CSR7. At this point, we can either export RT:214:104 on CSR7 or import RT:214:7 on XRv2. The two approaches have different consequences. Exporting RT:214:104 on CSR7 means that all 4 of the PEs will be importing those routes, since all of them are importing RT:214:104 into VRF BGP. Importing RT:214:7 on XRv2 means that only XRv2 (not counting CSR8) will import those routes, which is probably cleaner. However, we add the additional requirement that XRv2 should not import all of the routes from CSR7 since there are several. We will attempt to match /26 routes with RT:214:7 set. It is important that the RPL test for RT:214:7 in the match statement or else the /26 filter would apply to all imported routes. Our requirement is to permit /26 prefixes with RT:214:7 while not applying this strict filter to other imported routes. ! XRv2 vrf BGP address-family ipv4 unicast import route-policy RPL_IF_MATCH_PASS(PS_LENGTH_26, RT_214_7)

2340 © 2016 Nicholas J. Russo

extcommunity-set rt RT_214_7 214:7 end-set prefix-set PS_LENGTH_26 0.0.0.0/0 ge 26 le 26 end-set route-policy RPL_IF_MATCH_PASS($PS, $RT) if destination in $PS and extcommunity rt matches-every $RT then pass elseif extcommunity rt matches-every $RT then drop else pass endif end-policy

After applying this configuration, XRv2 is not importing any routes from AS 88 into VRF BGP. Because XRv2 is a route-reflector, it already retains all the VPN routes, so we can verify that it is at least receiving them from CSR7 within RD:214:7 (RD, not RT). The RT of the route is RT:214:7 as expected. For some reason, the configuration did not work. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf BGP regexp 88$ [no output] RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast rd 214:7 88.125.160.128/26 [snip] 88, (Received from a RR-client) 214.0.0.7 (metric 20) from 214.0.0.7 (214.0.0.7) Received Label 7005 Origin IGP, metric 0, localpref 100, valid, internal, best, group-best, import-candidate, not-in-vrf Received Path ID 0, Local Path ID 1, version 896 Extended community: RT:214:7

The reason for this issue is in the order of operations. The import-map is processed only after the routes enter the VRF, and they enter the VRF via the “import route-target” command. Therefore, we must specify the RT manually in the VRF, then use the RPL to filter the routes further. This is also like the export-map which is processed after “export route-target”, which is why we must be careful not to overwrite RT values. We will correct our configuration below. The RPL is still correct in its three-stage logic: pass if RT:214:7 and /26, drop if RT:214:7 (non /26 prefixes from AS 88), pass otherwise (for unrelated imports). ! XRv2 vrf BGP

2341 © 2016 Nicholas J. Russo

address-family ipv4 unicast import route-target 214:7

Now, we can see the /26 routes from AS 88 in VRF BGP as expected. The RPL filtered all of the non /26 routes during the import process, but only routes already matching RT:214:7 were candidates to be compared against that filter. RP/0/0/CPU0:XRv2#show bgp vpnv4 unicast vrf BGP regexp 88$ | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *>i88.125.160.128/26 214.0.0.7 0 100 0 88 i *>i88.125.160.192/26 214.0.0.7 0 100 0 88 i

We quickly check the routes on CSR5 and CSR10 to ensure the ultimate endpoints can reach one another. Then, we confirm the data-plane is functional using ping. CSR10 has the route to CSR5’s loopback, and CSR5 has the routes to CSR1’s /26 prefixes. Notice that we achieved this new reachability without configuring CSR7 at all. R10#show bgp ipv4 unicast quote-regexp "104$" | begin Network Network Next Hop Metric LocPrf Weight Path *> 192.104.5.5/32 10.7.10.7 0 214 104 ? R5#show bgp vpnv4 unicast vrf BGP quote-regexp "88$" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> 88.125.160.128/26 192.12.104.12 0 214 88 i *> 88.125.160.192/26 192.12.104.12 0 214 88 i R5#ping vrf BGP 88.125.160.129 source 192.104.5.5 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 8/9/11 ms R5#ping vrf BGP 88.125.160.193 source 192.104.5.5 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 8/8/11 ms

We will examine the feature on XE as well, this time using IPv6 prefixes. Specifically, we will enable connectivity between XRv3’s loopback and the /70’s within AS 82 by only configuring CSR8. First, we will import XRv3’s loopback into VRF BOTTOM using RT:214:104. We will do it incorrectly as we did on XR by trying to import the RT within the import-map; this is not supported on either platform. ! CSR8 ipv6 prefix-list PL_XRV13_LOOPBACK seq 5 permit ::192:104:13:13/128 ip extcommunity-list standard EXTCOML_RT_214_104 permit rt 214:104

2342 © 2016 Nicholas J. Russo

route-map RM_VRF_IMPORT permit 10 match extcommunity EXTCOML_RT_214_104 match ipv6 address prefix-list PL_XRV13_LOOPBACK vrf definition BOTTOM address-family ipv6 import map RM_VRF_IMPORT

We can verify the configuration was successfully applied by checking the VRF details. BGP debugging reveals that the route is still rejected due to the import-RT policy not including RT:214:104. Even though the route-map clearly imports this RT, it is not accepted for VPN membership due to the order of operations. The import-map is only applied against routes already imported by the “route-target import” command. R8#show vrf detail BOTTOM | begin v6 Address family ipv6 unicast (Table ID = 0x1E000001): Flags: 0x0 Export VPN route-target communities RT:214:8 Import VPN route-target communities RT:214:7 RT:214:103 Import route-map: RM_VRF_IMPORT No global export route-map No export route-map VRF label distribution protocol: not configured VRF label allocation mode: vrf-conn-aggr (Label 8019) Address family ipv4 multicast not active R8#debug bgp vpnv6 unicast updates in BGP(5): 214.0.0.3 rcvd [214:104]::192:104:13:13/128, label 91015 -- DENIED due to: extended community not supported;

We can modify the configuration on CSR8 as we did on XRv2. We will remove the extcommunity-list containing the new RT and just import it the normal way. Then, we leave the filter in place to select the prefixes we want based on other attributes (other community values, prefix length, etc). This may create issues later when dealing with imports unrelated to RT:214:104, but we will continue anyway. ! CSR8 no ip extcommunity-list standard EXTCOML_RT_214_104 permit rt 214:104 route-map RM_VRF_IMPORT permit 10 no match extcommunity EXTCOML_RT_214_104 vrf definition BOTTOM address-family ipv6 route-target import 214:104

2343 © 2016 Nicholas J. Russo

Now, we can see XRv3’s loopback inside of the VRF table as expected. This means that the custom import was successful and that those other routes originating from AS 104 were filtered. R8#show bgp vpnv6 unicast vrf BOTTOM quote-regexp "104$" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:8 (default for vrf BOTTOM) *>i ::192:104:13:13/128 ::FFFF:214.0.0.11 0 100 0 104 ?

There is no reachability from AS 82 to XRv3’s loopback since XRv1 is not importing RT:214:8, which CSR8 is exporting. We will configure CSR8 to export RT:214:104 so that XRv1 can import it. The danger here is that all PEs will import these routes, but will not have bidirectional connectivity since CSR8 only imported XRv3’s loopback. In a real design, another RT would be used to exchange these routes, say RT:214:500, which XRv1 can import and CSR8 can append to the /70s it wants XRv1 to see. We will also take this opportunity to prove the order of operations when using export maps by intentionally forgetting the “additive” keyword. ! CSR8 ipv6 prefix-list PL_LENGTH_70 seq 5 permit ::/0 ge 70 le 70 route-map RM_VRF_EXPORT permit 10 match ipv6 address prefix-list PL_LENGTH_70 set extcommunity rt 214:104 vrf definition BOTTOM address-family ipv6 export map RM_VRF_EXPORT

After applying this export, BGP needs to be cleared in order for the RT changes to have effect. First, we can validate that the configuration was successful by checking the VRF details. R8#show vrf detail BOTTOM | begin v6 Address family ipv6 unicast (Table ID = 0x1E000001): Flags: 0x0 Export VPN route-target communities RT:214:8 Import VPN route-target communities RT:214:104 Import route-map: RM_VRF_IMPORT No global export route-map Export route-map: RM_VRF_EXPORT VRF label distribution protocol: not configured VRF label allocation mode: vrf-conn-aggr (Label 8019) Address family ipv4 multicast not active

2344 © 2016 Nicholas J. Russo

Looking at the details of a /70 route, we can see the new RT:214:104 was applied, but the originally exported RT:214:8 was erased. This will allow all of the VRF BGP PEs to import this route, but CSR7 can no longer. This proves that the export-map happens after the “route-target export” command, overwriting whatever RTs were exported. R8#show bgp vpnv6 unicast vrf BOTTOM 2BAD:82:125:160:400::/70 BGP routing table entry for [214:8]2BAD:82:125:160:400::/70, version 6 Paths: (1 available, best #1, table BOTTOM) Advertised to update-groups: 5 Refresh Epoch 1 82 FD00:10:1:8::1 (FE80::1) (via vrf BOTTOM) from 10.1.8.1 (82.125.160.1) Origin IGP, metric 0, localpref 100, valid, external, best Extended Community: RT:214:104 mpls labels in/out 8016/nolabel rx pathid: 0, tx pathid: 0x0

This issue might be difficult to track down without regression testing, since reachability between XRv3 and CSR1 is now achieved, as shown below. XRv13 has the routes from AS 82, as expected, and has reachability as well. RP/0/0/CPU0:XRv3#show bgp vrf BGP ipv6 unicast regexp "82$" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:104 (default for vrf BGP) *> 2bad:82:125:160:400::/70 fd00:192:11:104::11 0 214 82 i *> 2bad:82:125:160:800::/70 fd00:192:11:104::11 0 214 82 i *> 2bad:82:125:160:c00::/70 fd00:192:11:104::11 0 214 82 i RP/0/0/CPU0:XRv3#ping vrf BGP 2BAD:82:125:160:400::1 source ::192:104:13:13 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 1/6/9 ms

Debugging on CSR7 reveals that the routes from AS 82 are now being denied as they no longer carry RT:214:8 which CSR7 expects to import. Other routes (the non /70 prefixes) are still imported to CSR7 since their RT values were never adjusted. The /72s are still successfully installed, for example. R7#debug bgp vrf BGP vpnv6 unicast updates in

2345 © 2016 Nicholas J. Russo

BGP(5): 214.0.0.12 rcvd [214:8]2BAD:82:125:160:C00::/70, label 8044 -- DENIED due to: extended community not supported; BGP(5): 214.0.0.12 rcvd [214:8]2BAD:82:125:160:800::/70, label 8043 -- DENIED due to: extended community not supported; BGP(5): 214.0.0.12 rcvd [214:8]2BAD:82:125:160:400::/70, label 8016 -- DENIED due to: extended community not supported; R7#show bgp vpnv6 unicast vrf TOP quote-regexp "82$" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:7 (default for vrf TOP) *>i 2BAD:82:125:160::/72 ::FFFF:214.0.0.8 0 100 0 82 i *>i 2BAD:82:125:160:100::/72 ::FFFF:214.0.0.8 0 100 0 82 i *>i 2BAD:82:125:160:200::/72 ::FFFF:214.0.0.8 0 100 0 82 i *>i 2BAD:82:125:160:300::/72 ::FFFF:214.0.0.8 0 100 0 82 i

To fix this, we can correct the route-map to use the “additive” keyword when setting a new RT. This is the most straightforward and logical way to solve the problem and is something we used on XRv2 earlier. However, I will solve it another way; perhaps we are not allowed to use the “additive” keyword. We can remove the generalized RT export from the VRF; we will apply the general export RT:214:8 to IPv4 only so that AFI is unaffected. Removing it from the top-level VRF configuration means the IPv6 AFI cannot inherit it. We will extend the IPv6 export route-map to include a default match-any clause that sets RT:214:8 on everything not previously matched. The /70s matched in the first clause are assigned both RT:214:104 and RT:214:8, ultimately receiving all the RTs they need. Non /70 routes only receive RT:214:8 as expected. We could have solved this by not having a second route-map clause and having left the “inheritance” RT inside the VRF as well, which would have been simpler. ! CSR8 vrf definition BOTTOM no route-target export 214:8 address-family ipv4 route-target export 214:8 route-map RM_VRF_EXPORT permit 10 match ipv6 address prefix-list PL_LENGTH_70 set extcommunity rt 214:8 214:104 route-map RM_VRF_EXPORT permit 20 set extcommunity rt 214:8

2346 © 2016 Nicholas J. Russo

We can verify this on CSR8 by looking at a /70 and /71, noticing that the /70 gets both RT values. One of the routes is eBGP learned while the other is a local aggregate; both are subject to the export-map adjustments. R8#show bgp vpnv6 unicast vrf BOTTOM 2BAD:82:125:160:C00::/70 BGP routing table entry for [214:8]2BAD:82:125:160:C00::/70, version 8 Paths: (1 available, best #1, table BOTTOM) Advertised to update-groups: 11 Refresh Epoch 1 82 FD00:10:1:8::1 (FE80::1) (via vrf BOTTOM) from 10.1.8.1 (82.125.160.1) Origin IGP, metric 0, localpref 100, valid, external, best Extended Community: RT:214:8 RT:214:104 mpls labels in/out 8016/nolabel rx pathid: 0, tx pathid: 0x0 R8#show bgp vpnv6 unicast vrf BOTTOM 2BAD:82:125:160:200::/71 BGP routing table entry for [214:8]2BAD:82:125:160:200::/71, version 21 Paths: (1 available, best #1, table BOTTOM) Advertised to update-groups: 10 11 Refresh Epoch 1 Local, (aggregated by 214 214.0.0.8) :: (via default) from 0.0.0.0 (214.0.0.8) Origin IGP, localpref 100, weight 32768, valid, aggregated, local, atomic-aggregate, best Extended Community: RT:214:8 mpls labels in/out IPv6 VRF Aggr:8019/aggregate(BOTTOM) rx pathid: 0, tx pathid: 0x0

Next, let’s verify that CSR7 is able to import these routes. We don’t need to see the route details; we can safely assume that so long as BGP selected a best-path, the routes are valid. The data-plane is expected to work since it worked for all other prefixes earlier. R7#show bgp vpnv6 unicast vrf TOP quote-regexp "82$" | include /70 *>i 2BAD:82:125:160:400::/70 *>i 2BAD:82:125:160:800::/70 *>i 2BAD:82:125:160:C00::/70

There is one other issue with the RT inheritance on CSR8, and this is the result of an earlier change. When configuring RT import values at the top level and the AFI level, the AFI level takes precedence. Given this configuration, only RT:214:104 is imported into the IPv6 AFI. The RTs or 214:7 and 214:103 are no longer considered for import into the IPv6 AFI. ! CSR8 vrf definition BOTTOM

2347 © 2016 Nicholas J. Russo

rd 214:8 route-target import 214:7 route-target import 214:103 address-family ipv6 route-target import 214:104 R8#show vrf detail BOTTOM | begin v6 Address family ipv6 unicast (Table ID = 0x1E000001): Flags: 0x0 No Export VPN route-target communities Import VPN route-target communities RT:214:104 Import route-map: RM_VRF_IMPORT No global export route-map Export route-map: RM_VRF_EXPORT VRF label distribution protocol: not configured VRF label allocation mode: vrf-conn-aggr (Label 8019) Address family ipv4 multicast not active

To correct this, we need to add RT:213:103 (for IS-IS testing completed earlier) and RT:214:7 (to import routes from CSR7) under the IPv6 AFI. Because the IPv4 AFI has not defined explicit RT-import policies, it can still inherit from the top-level configuration, so we won’t delete those. We can tell this new configuration is working by looking at the VRF details. ! CSR8 vrf definition BOTTOM rd 214:8 ! route-target import 214:7 ! route-target import 214:103 address-family ipv6 route-target import 214:104 route-target import 214:7 route-target import 214:103 R8#show vrf detail BOTTOM | begin v6 Address family ipv6 unicast (Table ID = 0x1E000001): Flags: 0x0 No Export VPN route-target communities Import VPN route-target communities RT:214:104 RT:214:7 RT:214:103 Import route-map: RM_VRF_IMPORT No global export route-map Export route-map: RM_VRF_EXPORT VRF label distribution protocol: not configured VRF label allocation mode: vrf-conn-aggr (Label 8019) Address family ipv4 multicast not active

2348 © 2016 Nicholas J. Russo

Even after this, we still don’t have the AS 88 routes in CSR8’s VPN table anymore. The import-map we applied earlier is the problem; it indiscriminately drops anything that is not XRv3’s loopback. We can tell by looking at the routes in the VPN table that are not from AS 82; there is only one. This is somewhat similar to the RPL on XRv2 where we needed to use multiple conditional matches to achieve the desired result. R8#show bgp vpnv6 unicast vrf BOTTOM quote-regexp "[^82]" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:8 (default for vrf BOTTOM) *>i ::192:104:13:13/128 ::FFFF:214.0.0.11 0 100 0 104 ?

Ideally, we need to allow all routes from AS 88. We have successfully imported RT:214:7 but the routes were filtered by the second-stage import-map. We can safely permit all routes from AS 88 because this route-map is only evaluated after the RT import process. ! CSR8 ip as-path access-list 88 permit 88$ route-map RM_VRF_IMPORT permit 10 match ipv6 address prefix-list PL_XRV13_LOOPBACK route-map RM_VRF_IMPORT permit 20 match as-path 88 R8#show bgp vpnv6 unicast vrf BOTTOM quote-regexp "88$" | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 214:8 (default for vrf BOTTOM) *>i 2BAD:88:125:160::/72 ::FFFF:214.0.0.7 0 100 0 88 i *>i 2BAD:88:125:160:100::/72 ::FFFF:214.0.0.7 0 100 0 88 i [snip]

Now that routing has been verified, we quickly test IPv6 reachability between AS 88 and XRv3’s loopback. This proves that the “extranet” configuration was successful. RP/0/0/CPU0:XRv3#ping vrf BGP 2BAD:82:125:160:400::1 source ::192:104:13:13 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 1/6/9 ms R10#ping 2BAD:82:125:160:C00::1 source 2BAD:88:125:160::1 [snip] Success rate is 100 percent (5/5), round-trip min/avg/max = 6/6/7 ms

2349 © 2016 Nicholas J. Russo

This section introduced many advanced L3VPN and VRF techniques. In summary: 1. Top-level VRF configuration of RTs in XE is convenient but confusing. XR doesn’t even support it. Policies configured there are inherited to child AFIs only if the child AFIs don’t have policies of their own. The inheritance is not additive and is overridden by the child policy as those are more specific. I recommend to only configure RTs at the AFI level for clarity and consistency with XR. 2. Import/export maps are processed after the import/export RT process, respectively. You cannot use the import-map to match RTs for import. The import-map applies to routes already passing the import-RT check and serves as a second-stage filter for additional granularity. This routemap/RPL has an implicit-deny at the end; it will reject non-matching routes from VRF import. 3. Export-map RT application will overwrite RTs written by “route-target export” unless the “additive” keyword is used. Prefixes not permitted by the export-map are not denied, but they are unmodified; they will have the normal export RTs configured in the VRF and nothing else. This can be considered an “implicit-pass” at the end of the route-map/RPL. This feature isn’t meant for denying routes, only adjusting exported RTs. 38.6 Half-Duplex VRF (HDVRF) The half-duplex VRF (HDVRF) feature is a minor option within the scope of L3VPN. Integrating this with the existing large-scale VRF lab previously would have been difficult, so a small lab has been developed specifically to test this feature. An HDVRF effectively defines two VRFs per interface: one upstream and one downstream. Like an isolated private VLAN, a single downstream VRF can service many customers but ensure they do not have direct connectivity across the local PE. Instead, traffic can hairpin across the network; although seemingly inefficient, this might be valuable for security reasons. Perhaps there is a centralized security checkpoint consisting of firewalls and intrusion prevention systems that inspects all spoke-to-spoke traffic on the network. HDVRF simplifies configuration on the local PE since two VRFs, the upstream and downstream VRFs, can service an unlimited number of customers without leaking traffic laterally. The feature is not currently supported on XR platforms or for IPv6 in general. Below is a basic L3VPN network a corporate headquarters site whereby XRv2 allows traffic to hairpin between spokes. The Internet peering point is access-only and spokes cannot use CSR10 for hairpinning traffic between one another. The core runs basic OSPFv2, BGP VPNv4, and LDP as shown below.

2350 © 2016 Nicholas J. Russo

Before beginning, the parser returns an error when trying to configure HDVRF with IPv6. The parser tells you the VRF assignment is simply ignored. To avoid any unforeseen consequences of this unsupported configuration, the IPv6 AFI has been removed from all VRFs. There is no 6VPE in this lab, either. R5(config-subif)#vrf forwarding UP downstream DOWN % VRF forwarding ignored for IPv6 on HDVRF interface GigabitEthernet2.556

Before configuring the VRFs, we will perform a quick check of the existing IGP, LDP, and BGP infrastructure; the configurations are very basic and not shown for brevity. CSR5 sees 5 routers in the area with no DRs as all links are point-to-point. XRv1 and CSR9 each see 3 LDP neighbors which suggests all LDP sessions are up. XRv1 is the VPNv4 RR and has a valid BGP session to each PE. R5#show ip ospf 35 0 database | begin ^Link Link ID ADV Router Age Seq# 35.0.0.5 35.0.0.5 204 0x80000017 35.0.0.8 35.0.0.8 11 0x80000018 35.0.0.9 35.0.0.9 1904 0x80000018 35.0.0.11 35.0.0.11 11 0x8000001B 35.0.0.14 35.0.0.14 1862 0x80000014 RP/0/0/CPU0:XRv1#show mpls ldp neighbor brief Peer GR NSR Up Time Discovery ipv4 ipv6 ----------------- -- --- ---------- ---------35.0.0.5:0 N N 10:32:14 1 0 35.0.0.9:0 N N 10:32:14 1 0 35.0.0.8:0 N N 10:27:40 1 0

Checksum 0x004AEB 0x00AD29 0x00EF5E 0x005DED 0x00E8D3

Addresses ipv4 ipv6 ---------3 0 4 0 2 0

Link count 5 3 7 7 3

Labels ipv4 ipv6 -----------10 0 10 0 10 0

R9#show mpls ldp neighbor | include Peer Peer LDP Ident: 35.0.0.5:0; Local LDP Ident 35.0.0.9:0

2351 © 2016 Nicholas J. Russo

Peer LDP Ident: 35.0.0.11:0; Local LDP Ident 35.0.0.9:0 Peer LDP Ident: 35.0.0.14:0; Local LDP Ident 35.0.0.9:0 RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 35.0.0.5 0 35 706 659 24 0 0 10:30:48 35.0.0.8 0 35 699 641 24 0 0 10:28:20 35.0.0.14 0 35 635 641 24 0 0 10:29:32

St/PfxRcd 0 0 0

CSR8 and XRv4 are basic L3VPN PEs with no fancy configuration. They use BGP to communicate with CSR10 and XRv2, respectively. The sessions are currently up and the PEs are learning routes from each CE eBGP peer. We can see the route counts: CSR8 learns one route while XRv4 learns several. R8#show bgp vpnv4 unicast vrf CENTRAL summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.8.10.10 4 10 16 19 42 0 0 00:10:50 1 RP/0/0/CPU0:XRv4#show bgp vrf INTERNET summary | begin ^Neigh Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 10.12.14.12 0 12 27 27 26 0 0 00:23:09

St/PfxRcd 7

CSR8 will only receive a default route from CSR10. The Internet default route is exported with RT:35:8; it makes sense that only a default route is provided for Internet access in this single-homed network. R8#show bgp vpnv4 unicast vrf CENTRAL 0.0.0.0/0 BGP routing table entry for 35:100:0.0.0.0/0, version 18 Paths: (1 available, best #1, table CENTRAL) Advertised to update-groups: 5 Refresh Epoch 1 10 10.8.10.10 (via vrf CENTRAL) from 10.8.10.10 (10.10.10.10) Origin IGP, localpref 100, valid, external, best Extended Community: RT:35:8 mpls labels in/out 8011/nolabel rx pathid: 0, tx pathid: 0x0

XRv4 learns intra-enterprise routes from XRv2. Because the spokes don’t need all these routes, we use aggregation on XRv4. The AS-path information is applied to the aggregate which prevents AS 12 routers from accidentally installing it. Component routes are also suppressed and the aggregation configuration is shown for reference. ! XRv4 router bgp 35 vrf CENTRAL address-family ipv4 unicast aggregate-address 12.24.0.0/16 as-set summary-only

2352 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv4#show bgp vpnv4 unicast vrf CENTRAL 12.24.0.0/16 | begin Paths Paths: (1 available, best #1) Advertised to peers (in unique update groups): 35.0.0.11 Path #1: Received by speaker 0 Advertised to peers (in unique update groups): 35.0.0.11 12, (aggregated by 35 35.0.0.14) 0.0.0.0 from 0.0.0.0 (35.0.0.14) Origin incomplete, localpref 100, weight 32768, valid, aggregated, best, group-best, import-candidate Received Path ID 0, Local Path ID 1, version 63 Extended community: RT:35:14 Origin-AS validity: not-found (iBGP signalled)

We quickly check XRv1, the VPNv4 RR, to ensure it is learning and retaining these routes. This is a very clean way of providing Internet and Intranet connectivity to remote sites with only two routes. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast rd 35:100 | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 35:100 *>i0.0.0.0/0 35.0.0.8 0 100 0 10 i *>i12.24.0.0/16 35.0.0.14 100 0 12 ?

With the basic network verification complete, we will configure the HDVRF feature on CSR5. CSR6, CSR7, and XRv3 are spoke sites that all require Internet and Intranet access without direct spoke-to-spoke access. Without considering HDVRF, there are two general solutions to this problem: 1. Create three VRFs, each of which imports the RTs exported by CSR8 and XRv4, and exports a “spoke RT” that those routers can import. Spokes will have no connectivity unless the spoke routes are leaked somewhere upstream. 2. Create one VRFs which imports the RTs exported by CSR8 and XRv4 and exports a “spoke RT” that those routers can import. Spokes will have direct connectivity across CSR5 unless some kind of data-plane filter is applied. Neither of these solutions work well in our environment to meet the requirements stated earlier. The first solution scales poorly since each spoke needs its own VRF, and spoke-to-spoke traffic is not inherent without additional work. The second is efficient from a forwarding perspective but “insecure” since XRv2 is the central point for all network security services. HDVRF is a combination of these two methods. We create two VRFs on CSR5 to represent upstream and downstream flows. Before adjusting any of the RT policies, we will configure the basic VRFs and apply them to the PE-CE interfaces. ! CSR5 vrf definition DOWN rd 35:6713

2353 © 2016 Nicholas J. Russo

address-family ipv4 vrf definition UP rd 35:100 address-family ipv4 interface GigabitEthernet2.553 encapsulation dot1Q 3553 vrf forwarding UP downstream DOWN interface GigabitEthernet2.556 vrf forwarding UP downstream DOWN interface GigabitEthernet2.557 vrf forwarding UP downstream DOWN

When we verify this configuration, we see that the DOWN VRF has a special “D” flag next to each interface. This specifies that the particular VRF is a downstream VRF on that interface. The lack of any flag specifies that the VRF is an upstream VRF. Below, we can see both UP and DOWN VRFs applied to the same interfaces. R5#show vrf Name DOWN

UP

Default RD 35:6713

Protocols ipv4

35:100

ipv4

Interfaces Gi2.556 [D] Gi2.557 [D] Gi2.553 [D] Gi2.556 Gi2.557 Gi2.553

We will need to exchange routing information with these spokes. Before we can do that, we must determine in which VRF the connected PE-CE subnets exist. This will determine in which VRF we must configure the PE-CE routing protocol. We quickly look at the VRF UP and DOWN routing tables to see that the connected routes are in the DOWN VRF. If we tried to configure BGP in the UP VRF, for example, there would be no route to the eBGP peer and the session would never form. R5#show ip route vrf UP connected | include ^C_ [no output] R5#show ip route vrf DOWN connected | include ^C_ C 10.5.6.0/24 is directly connected, GigabitEthernet2.556 C 10.5.7.0/24 is directly connected, GigabitEthernet2.557 C 10.5.13.0/24 is directly connected, GigabitEthernet2.553

Next, we configure BGP to each spoke. The BGP configuration is basic; the only noteworthy comment is that these sessions are in the DOWN VRF since that VRF contains the connected PE-CE subnets. We

2354 © 2016 Nicholas J. Russo

quickly verify that the configuration was successful by verifying that the BGP sessions are up (CE configurations not shown). ! CSR5 router bgp 35 template peer-session IPV4_UCAST_DOWN remote-as 6713 password SPOKE ttl-security hops 1 address-family ipv4 vrf DOWN neighbor 10.5.6.6 inherit peer-session IPV4_UCAST_DOWN neighbor 10.5.7.7 inherit peer-session IPV4_UCAST_DOWN neighbor 10.5.13.13 inherit peer-session IPV4_UCAST_DOWN R5#show bgp vpnv4 unicast vrf DOWN summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.5.6.6 4 6713 756 765 81 0 0 11:17:40 1 10.5.7.7 4 6713 752 763 81 0 0 11:17:33 1 10.5.13.13 4 6713 49 72 81 0 0 00:45:15 1

Incoming routes learned by the PE are placed into the downstream VRF. This makes sense because BGP was configured in VRF DOWN and the connected subnets also exist there. A quick check of CSR5’s VRF DOWN routing table reveals these three eBGP routes, one from each spoke. R5#show ip route vrf DOWN bgp | include ^B_ B 100.6.6.6 [20/0] via 10.5.6.6, 00:54:27 B 100.7.7.7 [20/0] via 10.5.7.7, 00:54:27 B 100.13.13.13 [20/0] via 10.5.13.13, 00:54:25

Ultimately, both the Internet and Intranet routers will need reachability to these subnets, just like a normal L3VPN. Since the routes only exist in VRF DOWN, we will export RT:35:6713 onto these prefixes so that CSR8 and XRv4 can import them. Both the export and import processes are shown below. ! CSR5 vrf definition DOWN address-family ipv4 route-target export 35:6713 ! CSR8 vrf definition INTERNET address-family ipv4 route-target import 35:6713 ! XRv4 vrf CENTRAL address-family ipv4 unicast

2355 © 2016 Nicholas J. Russo

import route-target 35:6713

We don’t expect to have bidirectional connectivity since the routing advertisement is only complete in one direction. We verify that both CSR10 and XRv2 learn these spoke routes via BGP. From the perspective of CSR8 and XRv4, this is a normal L3VPN with no advanced features; they continue to import and export RTs as usual. The CENTRAL and INTERNET VRFs on these routers need not be HDVRFaware since the feature is local to CSR5. R10#show bgp ipv4 unicast 100.0.0.0/8 longer-prefixes | begin Network Network Next Hop Metric LocPrf Weight Path *> 100.6.6.6/32 10.8.10.8 0 35 6713 i *> 100.7.7.7/32 10.8.10.8 0 35 6713 i *> 100.13.13.13/32 10.8.10.8 0 35 6713 i RP/0/0/CPU0:XRv2#show Network Network *> 100.6.6.6/32 *> 100.7.7.7/32 *> 100.13.13.13/32

bgp ipv4 unicast 100.0.0.0/8 longer-prefixes | begin Next Hop 10.12.14.14 10.12.14.14 10.12.14.14

Metric LocPrf Weight 0 0 0

Path 35 6713 i 35 6713 i 35 6713 i

To complete the routing configuration, we introduce the half-duplex nature of the VRFs on CSR5. Incoming traffic towards the PE will use the upstream VRF for its route lookup, which implies that VRF UP should be importing the remote L3VPN routes. In summary, VRF UP is importing upstream routes to control how clients route upstream. VRF DOWN is exporting downstream routes to control how upstream clients route back towards spokes. After importing the Internet and Intranet RTs into VRF UP on CSR5, we look at the VPNv4 routes across both VRFs. VRF UP has reachability to the Internet and Intranet sites, while VRF DOWN has reachability to the spokes. In a normal L3VPN configuration, one would think that these were entirely different topologies as reachability appears fragmented. ! CSR5 vrf definition UP address-family ipv4 route-target import 35:8 route-target import 35:14 R5#show bgp vpnv4 unicast all | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 35:100 (default for vrf UP) *>i 0.0.0.0 35.0.0.8 0 100 0 10 i *>i 12.24.0.0/16 35.0.0.14 100 0 12 ? Route Distinguisher: 35:6713 (default for vrf DOWN) *> 100.6.6.6/32 10.5.6.6 0 0 6713 i *> 100.7.7.7/32 10.5.7.7 0 0 6713 i *> 100.13.13.13/32 10.5.13.13 0 0 6713 i

2356 © 2016 Nicholas J. Russo

At present, the spokes have no ability to reach any remote destinations since they haven’t learned any routes from CSR5; VRF DOWN is not importing anything so nothing can be advertised to the spokes. As a quick fix, we will configure a static route on CSR6 only to achieve reachability. ! CSR6 ip route 0.0.0.0 0.0.0.0 10.5.6.5

When CSR6 sends traffic to CSR5, the FIB lookup happens in VRF UP regardless of whether VRF DOWN has a route or not. Two MPLS labels are imposed: the top label is XRv1’s label for CSR8’s loopback (transport) and the bottom label is CSR8’s label for CSR10’s loopback (VPN). R5#show ip cef vrf DOWN 10.10.10.10 0.0.0.0/0 no route R5#show ip cef vrf UP 10.10.10.10 0.0.0.0/0 nexthop 35.5.11.11 GigabitEthernet2.551 label 91005 8010 R5#show mpls ldp bindings 35.0.0.8 32 neighbor 35.0.0.11 lib entry: 35.0.0.8/32, rev 20 remote binding: lsr: 35.0.0.11:0, label: 91005 R5#show bgp vpnv4 unicast vrf UP labels Network Next Hop In label/Out label Route Distinguisher: 35:100 (UP) 0.0.0.0 35.0.0.8 nolabel/8010 12.24.0.0/16 35.0.0.14 nolabel/94023

XRv1 performs PHP and exposes label 8010 to CSR8. CSR8 removes all labels and delivers to the traffic to the Internet destination of 10.10.10.10. Once the label stack is imposed by CSR5, there is nothing special in the forwarding-plane. RP/0/0/CPU0:XRv1#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------91005 Pop 35.0.0.8/32

labels 91005 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.581 35.8.11.8 17575

R8#show mpls forwarding-table labels 8010 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 8010 No Label 0.0.0.0/0[V] 590 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A9F961005056A9FB1C81000DFC0800 VPN route: INTERNET

Outgoing interface Gi2.580

Next Hop 10.8.10.10

2357 © 2016 Nicholas J. Russo

No output feature configured

Earlier, we confirmed that the Internet and Intranet sites already had ordinary L3VPN routes back to the spokes. When CSR10 replies, CSR8 imposes two labels in similar fashion to what CSR5 did above. Rather than trace the LSP from CSR8 to CSR5, we will skip to CSR5, which receives packets labeled 5004 after XRv1 performs PHP. CSR5 removes all labels and delivers to packets to CSR6 inside VRF DOWN. This makes sense because the VPN route originated inside VRF DOWN since there is where BGP was configured. It is also the VRF in which the connected next-hop resides. R8#show ip cef vrf INTERNET 100.6.6.6/32 100.6.6.6/32 nexthop 35.8.11.11 GigabitEthernet2.581 label 91000 5004 R5#show mpls forwarding-table labels 5004 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 5004 No Label 100.6.6.6/32[V] 590 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A9DE0D005056A9DC6381000DE40800 VPN route: DOWN No output feature configured

Outgoing interface Gi2.556

Next Hop 10.5.6.6

To ensure connectivity is functional, we quickly ping from CSR6 to CSR10 before continuing. R6#ping 10.10.10.10 source 100.6.6.6 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.10.10.10, timeout is 2 seconds: Packet sent with a source address of 100.6.6.6 !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 8/9/11 ms

Rather than configure static routes on all spokes, we can advertise the expected BGP routes to them. This is accomplished by importing CSR8’s and XRv4’s RT values on VRF DOWN. Since VRF DOWN is never used for upstream route lookups, having these in the PE’s VRF DOWN routing table means nothing. Of greater importance is being able to advertise them down to the spokes. Note that this step is optional and is not required if the CEs have other upstream routing mechanisms (static, etc), as we just demonstrated on CSR6. We quickly check to ensure the routes were imported. ! CSR5 vrf definition DOWN address-family ipv4 route-target import 35:8 route-target import 35:14 R5#show bgp vpnv4 unicast vrf DOWN quote-regexp "^[^6]" | begin Network Network Next Hop Metric LocPrf Weight Path

2358 © 2016 Nicholas J. Russo

Route Distinguisher: 35:6713 (default for vrf DOWN) *>i 0.0.0.0 35.0.0.8 0 100 *>i 12.24.0.0/16 35.0.0.14 100

0 10 i 0 12 ?

Looking at the details of the default route as an example, we see interesting output. CSR5 claims that the route was imported directly from VRF UP, not from VPNv4 BGP. This is odd because VRF UP is not exporting RT:35:8 yet the output seems to suggest that. In reality, this means that VRF DOWN is “borrowing” the route from the upstream routing VRF so that it can advertise it downwards. R5#show bgp vpnv4 unicast vrf DOWN 0.0.0.0/0 BGP routing table entry for 35:6713:0.0.0.0/0, version 21 Paths: (1 available, best #1, table DOWN) Advertised to update-groups: 3 Refresh Epoch 1 10, imported path from 35:100:0.0.0.0/0 (UP) 35.0.0.8 (metric 3) (via default) from 35.0.0.11 (35.0.0.11) Origin IGP, metric 0, localpref 100, valid, internal, best Extended Community: RT:35:8 Originator: 35.0.0.8, Cluster list: 35.0.0.11 mpls labels in/out nolabel/8010 rx pathid: 0, tx pathid: 0x0

As mentioned earlier, the local VRF DOWN RIB on CSR5 will now have these BGP routes, but they are irrelevant since they are not consulted when traffic is received from the CE. It is critical that the spokes learn these routes; we check CSR7 and XRv3 specifically. We also allowed the spokes to learn 12.24.0.0/16 since this was also “borrowed” from VRF UP to VRF DOWN. This is needed to route traffic towards the central services resources behind XRv2 in the case that the default route is not available. R5#show ip route vrf DOWN bgp | include \[200 B* 0.0.0.0/0 [200/0] via 35.0.0.8, 00:04:50 B 12.24.0.0 [200/0] via 35.0.0.14, 00:04:50 R7#show ip route bgp | include ^B B* 0.0.0.0/0 [20/0] via 10.5.7.5, 00:05:13 B 12.24.0.0 [20/0] via 10.5.7.5, 00:05:13 RP/0/0/CPU0:XRv3#show route bgp B* 0.0.0.0/0 [20/0] via 10.5.13.5, 00:05:26 B 12.24.0.0/16 [20/0] via 10.5.13.5, 00:05:26

A quick set of tests confirms that this achieves connectivity between spokes and “hubs”. We don’t need to trace each LSP again since the forwarding plane is identical regardless of how the spokes learn their upstream routes. R7#ping 10.10.10.10 source 100.7.7.7

2359 © 2016 Nicholas J. Russo

[snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 7/8/10 ms R7#ping 12.24.48.1 source 100.7.7.7 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 5/8/19 ms RP/0/0/CPU0:XRv3#ping 10.10.10.10 source 100.13.13.13 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/5/9 ms RP/0/0/CPU0:XRv3#ping 12.24.48.1 source 100.13.13.13 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/6/19 ms

Next, we will verify that hairpinning capability of HDVRF. Both CSR7 and XRv3 learn default routes from CSR5 inside the same VRF; under normal circumstances, traffic would be routed locally by CSR5 between these two endpoints. Since the routing lookup always happens in the upstream VRF, traffic will not be routed directly. In this example, we will trace the path from CSR7 to XRv3. When CSR7 sends packets to CSR5 for this destination it follows its default route. R7#show ip cef 100.13.13.13 0.0.0.0/0 nexthop 10.5.7.5 GigabitEthernet2.557

CSR5 performs a lookup in VRF UP and also finds that the default route is the best match. Traffic is tunneled inside MPLS all the way to CSR10, since that is where the default route was originated. R5#show ip cef vrf UP 100.13.13.13 0.0.0.0/0 nexthop 35.5.11.11 GigabitEthernet2.551 label 91005 8010

CSR10’s longest-match for XRv3 is via CSR8 following a host route. Since CSR8 imported those downstream routes, but VRF UP didn’t do it locally on CSR5, we have effectively created a large hairpin terminating on CSR10. R10#show ip cef 100.13.13.13 100.13.13.13/32 nexthop 10.8.10.8 GigabitEthernet2.580

CSR8 tunnels traffic inside MPLS back to CSR5 for forwards it to XRv3 inside VRF DOWN.

2360 © 2016 Nicholas J. Russo

R8#show ip cef vrf INTERNET 100.13.13.13/32 100.13.13.13/32 nexthop 35.8.11.11 GigabitEthernet2.581 label 91000 5009 R5#show mpls forwarding-table labels 5009 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 5009 No Label 100.13.13.13/32[V] \ 1888 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A9EA54005056A9DC6381000DE10800 VPN route: DOWN No output feature configured

Outgoing interface

Next Hop

Gi2.553

10.5.13.13

Using traceroute inside CSR7 reveals the entire path. CSR10 terminates the hairpin and is highlighted, but it is clear that traffic between spokes cannot be routed locally by CSR5. R7#traceroute 100.13.13.13 source 100.7.7.7 Type escape sequence to abort. Tracing the route to 100.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.7.5 [AS 10] 4 msec 4 msec 3 msec 2 35.5.11.11 [AS 10] [MPLS: Labels 91005/8010 Exp 0] 3 10.8.10.8 [AS 10] [MPLS: Label 8010 Exp 0] 16 msec 4 10.8.10.10 [AS 10] 19 msec 11 msec 10 msec 5 10.8.10.8 [AS 10] 10 msec 11 msec 10 msec 6 35.8.11.11 [AS 10] [MPLS: Labels 91000/5009 Exp 0] msec 7 10.5.13.5 [AS 10] [MPLS: Label 5009 Exp 0] 21 msec 8 10.5.13.13 [AS 10] 32 msec 28 msec 27 msec

7 msec 5 msec 6 msec 16 msec 15 msec

32 msec 51 msec 50 19 msec 18 msec

This behavior is highly undesirable at present. Recall that CSR10 represents the Internet while XRv2 represents the central services site containing the security tools. Traffic is being hairpinned via an Internet router and totally bypassing the security features at XRv2. This is the result of VRF UP only having a default route to reach the spoke loopbacks. Instead, we must find a way to attract spoke-tospoke traffic on XRv2. The simplest solution is a static null route advertised into BGP on XRv2. After configuring it, we verify that XRv1 (RR) learns this prefix. ! XRv2 router static address-family ipv4 unicast 100.0.0.0/8 Null0 router bgp 12 address-family ipv4 unicast network 100.0.0.0/8

2361 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast rd 35:100 100.0.0.0/8 brief | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 35:100 *>i100.0.0.0/8 35.0.0.14 0 100 0 12 i

We quickly verify that all three spokes learn this prefix. Although the next-hop is the same as the default route, all of them will be following this new route to reach one another’s loopbacks. I use three different verifications on the spokes for variety. R6#show ip route 100.0.0.0 255.0.0.0 Routing entry for 100.0.0.0/8 Known via "bgp 6713", distance 20, metric 0 Tag 35, type external [snip] R7#show ip cef 100.0.0.0/8 100.0.0.0/8 nexthop 10.5.7.5 GigabitEthernet2.557 RP/0/0/CPU0:XRv3#show bgp ipv4 unicast 100.0.0.0/8 brief | begin Network Network Next Hop Metric LocPrf Weight Path *> 100.0.0.0/8 10.5.13.5 0 35 12 i

Of paramount importance is the upstream VRF routing lookup on CSR5. Since the static null route on XRv2 is generic enough to cover all spoke loopbacks, yet more specific than the default route, it is preferred. Now traffic will hairpin through XRv2 as desired. Without performing a detailed verification, we know that label 94008 was allocated by XRv4; in this case, it represents XRv4’s local VPN label for prefix 100.0.0.0/8. R5#show ip cef vrf UP 100.6.6.6 100.0.0.0/8 nexthop 35.5.9.9 GigabitEthernet2.559 label 9010 94008 R5#show ip cef vrf UP 100.7.7.7 100.0.0.0/8 nexthop 35.5.9.9 GigabitEthernet2.559 label 9010 94008 R5#show ip cef vrf UP 100.13.13.13 100.0.0.0/8 nexthop 35.5.9.9 GigabitEthernet2.559 label 9010 94008

Using traceroute from CSR7 again verifies this behavior. XRv2 is highlighted in the output below. At this point the HDVRF configuration is complete. R7#traceroute 100.13.13.13 source 100.7.7.7

2362 © 2016 Nicholas J. Russo

Type escape sequence to abort. Tracing the route to 100.13.13.13 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.7.5 [AS 10] 4 msec 2 msec 2 msec 2 35.5.9.9 [AS 10] [MPLS: Labels 9010/94008 Exp 0] 6 msec 6 msec 11 msec 3 35.9.14.14 [AS 10] [MPLS: Label 94008 Exp 0] 19 msec 21 msec 20 msec 4 10.12.14.12 [AS 10] 20 msec 15 msec 14 msec 5 10.12.14.14 [AS 10] 14 msec 14 msec 15 msec 6 35.9.14.9 [AS 10] [MPLS: Labels 9001/5009 Exp 0] 45 msec 56 msec 51 msec 7 10.5.13.5 [AS 10] [MPLS: Label 5009 Exp 0] 22 msec 8 msec 11 msec 8 10.5.13.13 [AS 10] 25 msec 27 msec 29 msec

Earlier I briefly discussed the need for the spokes to learn the prefix 12.24.0.0/16 despite having a default route. To prove its value, we temporarily shut down CSR10’s BGP session to CSR8 (not shown). Because CSR10 represents the Internet, we fully expect that the spokes will no longer have Internet reachability. They should, however, continue to have central services and “hairpin” services via XRv2. Checking CSR7, we can see that it retains both of these BGP routes but no longer learns a default dynamically. Without this route, even if the Internet was operational, the central services subnets would be unreachable. This is more of a basic L3VPN/routing issue than an HDVRF issue, but integrating an intelligent routing design with HDVRF is important. R7#show ip route bgp | include ^B B 12.24.0.0 [20/0] via 10.5.7.5, 00:34:33 B 100.0.0.0/8 [20/0] via 10.5.7.5, 00:10:44

CSR7 will not have Internet access since there is no route, but has full access to XRv2’s services. We can tell by the long latency on the spoke-to-spoke ping that traffic is still being hairpinned via XRv2. R7#show ip cef 10.10.10.10 0.0.0.0/0 no route R7#ping 100.13.13.13 source 100.7.7.7 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 9/25/39 ms R7#ping 12.24.48.1 source 100.7.7.7 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/13 ms

Additional Reading – Reference configurations "hdvrf" 38.7

BGP Local Convergence (VRF Local Protection)

2363 © 2016 Nicholas J. Russo

This feature is difficult to classify by name since it is a combination of BGP, HA, L3VPN, and VRF features. It is designed to speed up BGP convergence on a local egress PE by protecting against a PE-CE link failure. Normally, when the PE-CE link fails, the eBGP session running over that link also fails. This will occur as a result of the “fast-external-fallover” feature enabled by default if the link drops line-protocol. If the line-protocol doesn’t fail, such as would be the case in many Ethernet last-mile deployments, the BGP keepalive timer (or possibly BFD timer) must expire. Once this happens, the PE that suffered the link failure will withdraw the prefix from the BGP topology, which typically involves sending the MP_UNREACH_NLRI towards the RR. Assuming that BGP convergence is slow (that is to say, it takes time for the remote PEs to learn this update to select better paths), this can result in a significant amount of dropped traffic. Like TE-FRR in a way, VPN traffic destined for hosts behind the PE suffering the link failure will still be sent towards that PE. Here, the traffic would typically be blackholed while BGP was converging. With local protection, this egress PE will swap the VPN label to the VPN label of the backup egress PE and send the packet back into the MPLS core, effectively performing local repair until BGP converges. There are two key network design requirements in order to use this feature: the CE site must be multi-homed to at least two different PEs, and the primary egress PE must learn the backup PE’s VPN route. The second requirement can be achieved many ways: BGP additional-paths, diverse RDs, etc. The network diagram is shown below. CSR9 and CSR10 are in the same VPN and each one is multi-homed to two PEs.

Before beginning any advanced configuration, we verify the basic IGP, LDP, and BGP components. IS-IS level-2 is running on all nodes. A quick check of XRv1’s LSPDB shows 6 LSPs, each with the appropriate number of transit links. We also verify that each LSR advertises a single loopback into IGP. This quick check verifies the entire IGP topology quickly. RP/0/0/CPU0:XRv1#show isis database detail | include "^[RX]|Extended" R5.00-00 0x0000000e 0x9ec1 1015 0/0/0 Metric: 10 IS-Extended XRv1.00 Metric: 0 IP-Extended 212.0.0.5/32 R6.00-00 0x0000000d 0x96a5 697 0/0/0 Metric: 10 IS-Extended XRv2.00 Metric: 10 IS-Extended XRv1.00 Metric: 0 IP-Extended 212.0.0.6/32

2364 © 2016 Nicholas J. Russo

R7.00-00 Metric: 10 Metric: 10 Metric: 0 R8.00-00 Metric: 10 Metric: 0 XRv1.00-00 Metric: 10 Metric: 10 Metric: 10 Metric: 0 XRv2.00-00 Metric: 10 Metric: 10 Metric: 10 Metric: 0

0x00000009 IS-Extended IS-Extended IP-Extended 0x00000009 IS-Extended IP-Extended * 0x0000000c IS-Extended IS-Extended IS-Extended IP-Extended 0x00000009 IS-Extended IS-Extended IS-Extended IP-Extended

0x023a XRv2.00 XRv1.00 212.0.0.7/32 0xa4b3 XRv2.00 212.0.0.8/32 0xef36 R6.00 R5.00 R7.00 212.0.0.11/32 0xd150 R6.00 R8.00 R7.00 212.0.0.12/32

443

0/0/0

887

0/0/0

1123

0/0/0

951

0/0/0

Since there are 6 LSRs in the network, we expect to see exactly 6 loopbacks. This is because each LSR is only allocating labels for host routes. Checking the core routers, we can see all LDP sessions are up with 6 IPv4-based labels exchanged between each set of peers. RP/0/0/CPU0:XRv1#show mpls ldp neighbor brief Peer GR NSR Up Time Discovery ipv4 ipv6 ----------------- -- --- ---------- ---------212.0.0.6:0 N N 01:10:32 1 0 212.0.0.5:0 N N 01:10:26 1 0 212.0.0.7:0 N N 01:09:22 1 0

Addresses ipv4 ipv6 ---------3 0 2 0 3 0

Labels ipv4 ipv6 -----------6 0 6 0 6 0

RP/0/0/CPU0:XRv2#show mpls ldp neighbor brief Peer GR NSR Up Time Discovery ipv4 ipv6 ----------------- -- --- ---------- ---------212.0.0.6:0 N N 01:12:04 1 0 212.0.0.7:0 N N 01:12:04 1 0 212.0.0.8:0 N N 01:12:04 1 0

Addresses ipv4 ipv6 ---------3 0 3 0 2 0

Labels ipv4 ipv6 -----------6 0 6 0 6 0

XRv1 is the RR for both VPNv4 and VPNv6. We quickly verify that both sessions are up to all 4 PEs. Ignore the route counts for now as those will make sense once we verify VPN connectivity. RP/0/0/CPU0:XRv1#show Neighbor Spk 212.0.0.5 0 212.0.0.6 0 212.0.0.7 0 212.0.0.8 0

bgp vpnv4 unicast summary | begin ^Neigh AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 212 96 137 45 0 0 01:12:46 212 92 136 45 0 0 01:12:33 212 118 141 45 0 0 00:27:34 212 115 139 45 0 0 00:27:39

St/PfxRcd 1 1 1 1

2365 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#show Neighbor Spk 212.0.0.5 0 212.0.0.6 0 212.0.0.7 0 212.0.0.8 0

bgp vpnv6 unicast summary | begin ^Neigh AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down 212 96 137 38 0 0 01:12:49 212 92 136 38 0 0 01:12:35 212 118 141 38 0 0 00:27:37 212 115 139 38 0 0 00:27:41

St/PfxRcd 1 1 1 1

Next, we quickly verify the PE-CE routing. The local protection feature supports all protocols (including static routing) except for IS-IS within the IPv4 AFI, while IPv6 only supports BGP and static routing. To test as many combinations as possible, we will use a variety of PE-CE configurations. CSR9 will use BGP for both IPv4 and IPv6. The configuration is not shown since it is very basic, but we quickly verify that all sessions are functional from the perspective of CSR9. The only interesting configuration is that CSR9 is configured to always compared BGP RID; this ensures that CSR7 is always the primary next-hop since the eBGP oldest-route comparison is skipped. ! CSR9 router bgp 65009 bgp bestpath compare-routerid R9#show bgp ipv4 unicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer 10.7.9.7 4 212 29 31 31 10.8.9.8 4 212 43 46 31 R9#show bgp ipv6 unicast summary | begin ^Neigh Neighbor V AS MsgRcvd MsgSent TblVer FD00:10:7:9::7 4 212 29 30 34 FD00:10:8:9::8 4 212 42 45 34

InQ OutQ Up/Down State/PfxRcd 0 0 00:20:45 1 0 0 00:31:29 1

InQ OutQ Up/Down State/PfxRcd 0 0 00:20:49 1 0 0 00:31:30 1

CSR10 will use OSPFv2 for IPv4 and static routing for IPv6. Both CSR5 and CSR6, as PEs, are never allowed to be the OSPF DR. If CSR10 fails, there is no reason for CSR5 and CSR6 to form a neighborship as no routes need to be exchanged laterally. The IPv6 static routing is basic; note that the PE-CE LAN segment does not have any routable IPv6 addressing as it is not required. CSR10 uses different administrative distances for selection. CSR5 is always the primary next-hop as a result. ! CSR5 and CSR6 interface GigabitEthernet2.550 ip ospf priority 0 ipv6 route vrf 910 910:10:10:10::10/128 GigabitEthernet2.550 FE80::10 ! CSR10 ipv6 route ::/0 GigabitEthernet2.550 FE80::6 8 ipv6 route ::/0 GigabitEthernet2.550 FE80::5 4

Because the local protection feature requires the primary PE (CSR5 and CSR7 in this case) to learn the iBGP route via the backup PE, we used unique RDs on all PEs. BGP additional-paths would have been 2366 © 2016 Nicholas J. Russo

appropriate also, but given XE’s limited support for this feature within VPNv4/v6, unique RD was a simpler solution. We quickly check XRv1, the RR, to ensure that each PE is using a different RD. CSR9 and CSR10 contribute one route each as shown below. It is between these customer host addresses that we will test local protection. The RDs are highlighted for clarity; they key is that they are all different. RP/0/0/CPU0:XRv1#show bgp vpnv4 unicast | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 212:9105 *>i9.10.0.10/32 212.0.0.5 20 100 0 ? Route Distinguisher: 212:9106 *>i9.10.0.10/32 212.0.0.6 20 100 0 ? Route Distinguisher: 212:9107 *>i9.10.0.9/32 212.0.0.7 0 100 0 65009 ? Route Distinguisher: 212:9108 *>i9.10.0.9/32 212.0.0.8 0 100 0 65009 ? RP/0/0/CPU0:XRv1#show bgp vpnv6 unicast | begin Network Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 212:9105 *>i910:10:10:10::10/128 212.0.0.5 0 100 0 ? Route Distinguisher: 212:9106 *>i910:10:10:10::10/128 212.0.0.6 0 100 0 ? Route Distinguisher: 212:9107 *>i910:9:9:9::9/128 212.0.0.7 0 100 0 65009 ? Route Distinguisher: 212:9108 *>i910:9:9:9::9/128 212.0.0.8 0 100 0 65009 ?

Because all of our PE-CE links are Ethernet-based, BGP’s fast-external-fallover will not be valuable for detecting PE-CE link failures in a timely manner. We will use BFD to achieve less than 3 seconds detection time; the relatively slow timer is used given the CSR1000v’s 100 kbps rate-limit. The BFD configuration is not discussed in detail since there is a dedicated chapter for it. For brevity, only the CE configurations are shown, but the BFD configurations are very similar on both ends of the PE-CE link. ! CSR9 router bgp 65009 neighbor 10.7.9.7 fall-over bfd single-hop neighbor 10.8.9.8 fall-over bfd single-hop neighbor FD00:10:7:9::7 fall-over bfd single-hop neighbor FD00:10:8:9::8 fall-over bfd single-hop interface GigabitEthernet2.579 bfd interval 900 min_rx 900 multiplier 3 interface GigabitEthernet2.589 bfd interval 900 min_rx 900 multiplier 3

2367 © 2016 Nicholas J. Russo

! CSR10 interface GigabitEthernet2.550 bfd interval 900 min_rx 900 multiplier 3 router ospf 910 bfd all-interfaces ipv6 route static bfd GigabitEthernet2.550 FE80::5 ipv6 route static bfd GigabitEthernet2.550 FE80::6

From the CE routers, we quickly verify that BFD is operational. Each CE should have 4 operational BFD sessions, two per AFI. R9#show bfd neighbors IPv4 Sessions NeighAddr 10.7.9.7 10.8.9.8 IPv6 Sessions NeighAddr FD00:10:7:9::7 FD00:10:8:9::8 R10#show bfd neighbors IPv4 Sessions NeighAddr 10.5.10.5 10.5.10.6 IPv6 Sessions NeighAddr FE80::5 FE80::6

LD/RD 4097/4097 4101/4097

RH/RS Up Up

State Up Up

Int Gi2.579 Gi2.589

LD/RD 2/1 1/2

RH/RS Up Up

State Up Up

Int Gi2.579 Gi2.589

LD/RD 4097/4097 4098/4097

RH/RS Up Up

State Up Up

Int Gi2.550 Gi2.550

LD/RD 1/1 2/1

RH/RS Up Up

State Up Up

Int Gi2.550 Gi2.550

Quickly note that a positive side effect of removing CSR5 and CSR6 from the DR election is that they do not establish a BFD session with one another. OSPF only establishes BFD sessions to the DR and BDR as discussed in the BFD chapter. R6#show bfd neighbors client ospf IPv4 Sessions NeighAddr LD/RD 10.5.10.10 4097/4098

RH/RS Up

State Up

Int Gi2.550

To test connectivity over the basic L3VPN, we use traceroute for brevity. Notice that CSR9 will always use CSR7 and CSR10 will always use CSR5, assuming CSR7 and CSR5 are up. These PEs are used for 2368 © 2016 Nicholas J. Russo

ingress and egress by default; when comparing iBGP paths, the ingress PEs will select the egress PE with the lower BGP RID. This was done intentionally to converge the network deterministically and without advanced BGP best-path adjustments. R10#traceroute 9.10.0.9 source 9.10.0.10 Type escape sequence to abort. Tracing the route to 9.10.0.9 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.10.5 6 msec 4 msec 4 msec 2 212.5.11.11 [MPLS: Labels 91003/7008 Exp 0] 8 msec 8 msec 7 msec 3 10.7.9.7 [MPLS: Label 7008 Exp 0] 16 msec 16 msec 14 msec 4 10.7.9.9 20 msec 11 msec 14 msec R9#traceroute 9.10.0.10 source 9.10.0.9 Type escape sequence to abort. Tracing the route to 9.10.0.10 VRF info: (vrf in name/id, vrf out name/id) 1 10.7.9.7 5 msec 5 msec 3 msec 2 212.7.11.11 [MPLS: Labels 91000/5001 Exp 0] 8 msec 8 msec 8 msec 3 10.5.10.5 [MPLS: Label 5001 Exp 0] 10 msec 16 msec 13 msec 4 10.5.10.10 22 msec 15 msec 10 msec

Before enabling local protection on the PEs, we will verify that the primary PEs (CSR5 and CSR7), at a minimum, have the iBGP paths from the backup PEs (CSR6 and CSR8). These do not need to be installed as “BGP backup paths” in the literal sense, but the routers need to have them. For variety, we check the local CE route on CSR5 using IPv4 and CSR7 using IPv6. The key word is “local”; this feature only provides FRR service for traffic egressing the core towards the customer. The primary paths are highlighted in yellow and the backup paths are in green. Notice that the label operations for the primary path is a basic imposition, while the backup path requires a VPN label swap. This makes sense since the backup PE changed the VPN next-hop, so BGP must perform a label swap to that backup PE’s local label. This label swapping intelligence has nothing to do with local protection as this is a fundamental component of BGP/MPLS interaction. R5#show bgp vpnv4 unicast vrf 910 9.10.0.10/32 BGP routing table entry for 212:9105:9.10.0.10/32, version 35 Paths: (2 available, best #2, table 910) Advertised to update-groups: 1 Refresh Epoch 1 Local, imported path from 212:9106:9.10.0.10/32 (global) 212.0.0.6 (metric 20) (via default) from 212.0.0.11 (212.0.0.11) Origin incomplete, metric 20, localpref 100, valid, internal Extended Community: RT:212:910 OSPF DOMAIN ID:0x0005:0x0000038E0200 OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:10.5.10.6:0 Originator: 212.0.0.6, Cluster list: 212.0.0.11 mpls labels in/out 5001/6033

2369 © 2016 Nicholas J. Russo

rx pathid: 0, tx pathid: 0 Refresh Epoch 1 Local 10.5.10.10 (via vrf 910) from 0.0.0.0 (212.0.0.5) Origin incomplete, metric 20, localpref 100, weight 32768, valid, sourced, best Extended Community: RT:212:910 OSPF DOMAIN ID:0x0005:0x0000038E0200 OSPF RT:0.0.0.0:5:1 OSPF ROUTER ID:10.5.10.5:0 mpls labels in/out 5001/nolabel rx pathid: 0, tx pathid: 0x0 R7#show bgp vpnv6 unicast vrf 910 910:9:9:9::9/128 BGP routing table entry for [212:9107]910:9:9:9::9/128, version 38 Paths: (2 available, best #1, table 910) Advertised to update-groups: 5 Refresh Epoch 2 65009 FD00:10:7:9::9 (FE80::9) (via vrf 910) from FD00:10:7:9::9 (9.10.0.9) Origin incomplete, metric 0, localpref 100, valid, external, best Extended Community: RT:212:910 mpls labels in/out 7001/nolabel rx pathid: 0, tx pathid: 0x0 Refresh Epoch 1 65009, imported path from [212:9108]910:9:9:9::9/128 (global) ::FFFF:212.0.0.8 (metric 20) (via default) from 212.0.0.11 (212.0.0.11) Origin incomplete, metric 0, localpref 100, valid, internal Extended Community: RT:212:910 Originator: 212.0.0.8, Cluster list: 212.0.0.11 mpls labels in/out 7001/8015 rx pathid: 0, tx pathid: 0

Before enabling local protection, we will break the link between CSR7 and CSR9 by changing the VLAN encapsulation (not shown). BGP update debugging is enabled on CSR7 for VPNv4 and VPNv6. Although CSR7 immediately installs the backup VPN route via CSR8, it withdraws its own route immediately to trigger a BGP convergence event. This is a good thing; there is no reason for CSR7 to continue advertising its ability to be an egress PE since this is no longer true. R7#debug bgp vpnv4 unicast updates out BGP updates debugging is on (outbound) for address family: VPNv4 Unicast BGP(4): Revise route installing 1 of 1 routes for 9.10.0.9/32 -> 212.0.0.8(910) to 910 IP table BGP(4): (base) 212.0.0.11 send unreachable (format) 212:9107:9.10.0.9/32 BGP(4): 212.0.0.11 rcv UPDATE about 212:9107:9.10.0.9/32 -- withdrawn, label 524288 R7#debug bgp vpnv6 unicast updates out

2370 © 2016 Nicholas J. Russo

BGP updates debugging is on (outbound) for address family: VPNv4 Unicast VPNv6 Unicast BGP(5): Revise route installing [212:9107]910:9:9:9::9/128 -> ::FFFF:212.0.0.8 (::) to IPv6 910 table BGP(5): (base) 212.0.0.11 send unreachable (format) [212:9107]910:9:9:9::9/128 BGP(5): 212.0.0.11 rcv UPDATE about [212:9107]910:9:9:9::9/128 -- withdrawn, label 524288

The fundamental issue is more subtle; CSR7 also immediately deallocates its local labels for those prefixes since it does not consider itself an ingress PE at all. This means that any traffic that may already be transiting the core will be dropped when CSR7 receives it, causing a short-term black hole. Until BGP converges around the failure, CSR7 will continue to drop packets received from the core. No matter how fast BGP converges, the packets that are already transiting the core would still be dropped. R7#show bgp vpnv4 unicast vrf 910 labels Network Next Hop In label/Out label Route Distinguisher: 212:9107 (910) 9.10.0.9/32 212.0.0.8 nolabel/8012 9.10.0.10/32 212.0.0.6 nolabel/6033 212.0.0.5 nolabel/5001 R7#show bgp vpnv6 unicast vrf 910 labels Network Next Hop In label/Out label Route Distinguisher: 212:9107 (910) 910:9:9:9::9/128 ::FFFF:212.0.0.8 nolabel/8015 910:10:10:10::10/128 ::FFFF:212.0.0.6 nolabel/6020 ::FFFF:212.0.0.5 nolabel/5028

Local protection doesn’t change the operation of BGP routing so much as it changes the label retention mechanism. Rather than immediately deallocating the local label for that prefix, the egress PE will retain it for 5 minutes, giving BGP plenty of time to converge. Packets that arrive on the egress PE will have their VPN label swapped per the LFIB, and any additional transport labels will be pushed as well. This behavior is similar to the PLR in TE-FRR and is only meant to be a temporary backup while BGP converges. The feature does not have any knobs and is simple to enable for IPv4 and IPv6; we enable it on all PEs. Technically, the feature only needs to be enabled on the primary PE, but we can enable it everywhere in case we decide to modify the BGP best-path selection algorithm later to prefer CSR6 or CSR8. The feature isn’t explicitly supported in XR, but using BGP additional-paths, you could protect PECE links as seen in the “add-path” section. Once configured, we verify it by checking the VRF details. ! CSR5, CSR6, CSR7, and CSR8

2371 © 2016 Nicholas J. Russo

vrf definition 910 address-family ipv4 protection local-prefixes address-family ipv6 protection local-prefixes R7#show Address Local Address Local Address

vrf detail 910 | include ^Addr|Local family ipv4 unicast (Table ID = 0x2): prefix protection enabled family ipv6 unicast (Table ID = 0x1E000001): prefix protection enabled family ipv4 multicast not active

Because BGP converges extremely fast in such a small network with so few prefixes, we need to slow it down. Using the “advertisement-internal” on each PE towards the RR, we can effectively limit how frequently the PEs can send successive updates/withdrawals across the network. We will introduce a long timer of 180 seconds (3 minutes) onto each PE. This means that the failure of a PE-CE link could take up to 3 minutes to converge, giving us time to verify that local protection works. ! CSR5, CSR6, CSR7, and CSR8 router bgp 212 address-family vpnv4 neighbor 212.0.0.11 advertisement-interval 180 address-family vpnv6 neighbor 212.0.0.11 advertisement-interval 180

After adjusting the advertisement interval, I quickly flap CSR9’s loopback. We can check the VPNv4/v6 update-groups on CSR7 to watch it count down as a result of seeing this instability. For the next ~3 minutes, CSR7 cannot send any kind of BGP updates to XRv2. We also ensure that CSR10’s VPN traffic is still egressing through CSR7. The VPN label allocated by CSR7 is 7003; this is the label we are trying to retain until BGP converges. R7#show bgp vpnv4 unicast all update-group 212.0.0.11 | include Minimum Minimum time between advertisement runs is 180 seconds (expires in 176 seconds) R10#traceroute 9.10.0.9 source 9.10.0.10 Type escape sequence to abort. Tracing the route to 9.10.0.9 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.10.5 4 msec 5 msec 4 msec 2 212.5.11.11 [MPLS: Labels 91003/7003 Exp 0] 8 msec 7 msec 7 msec 3 10.7.9.7 [MPLS: Label 7003 Exp 0] 12 msec 21 msec 15 msec 4 10.7.9.9 19 msec 11 msec 10 msec

2372 © 2016 Nicholas J. Russo

Moving quickly, we change the VLAN tag on CSR9 again to break the PE-CE link (not shown). Because CSR7 still has about 2 minutes remaining on its advertisement-interval, CSR7 cannot notify XRv1 about this change quite yet. Traffic from CSR10 will continue to flow towards CSR7 until the network is notified about the change. CSR7 knows that 232.0.0.8 (CSR8) is the correct next-hop for actually delivering the traffic. The difference is that its local label of 7003 is swapped to CSR8’s local label of 8001 and sent towards 212.0.0.8. Without local protection, label 7003 would have been immediately deallocated. R7#show bgp vpnv4 unicast all update-group 212.0.0.11 | include Minimum Minimum time between advertisement runs is 180 seconds (expires in 123 seconds) R7#show bgp vpnv4 unicast vrf 910 9.10.0.9/32 BGP routing table entry for 212:9107:9.10.0.9/32, version 12 Paths: (1 available, best #1, table 910) Flag: 0x820 Advertised to update-groups: (Pending Update Generation) 9 Refresh Epoch 1 65009, imported path from 212:9108:9.10.0.9/32 (global) 212.0.0.8 (metric 20) (via default) from 212.0.0.11 (212.0.0.11) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:212:910 Originator: 212.0.0.8, Cluster list: 212.0.0.11 mpls labels in/out 7003/8001 rx pathid: 0, tx pathid: 0x0

Since CSR8 is multiple hops away, CSR7 must add additional MPLS encapsulation. The route to CSR8’s loopback is learned via IGP, so an LDP label is used. In this case, label 92001 is pushed atop the stack to create {92001 8001}. Like a TE-FRR PLR, CSR7 adds additional labels to tunnel traffic towards the correct egress PE. This is why the primary PE must have the backup PE’s route; if CSR7 didn’t already have this iBGP path, local protection would not be useful. R7#show ip route 212.0.0.8 Routing entry for 212.0.0.8/32 Known via "isis", distance 115, metric 20, type level-2 Redistributing via isis 212 Last update from 212.7.12.12 on GigabitEthernet2.572, 02:21:12 ago Routing Descriptor Blocks: * 212.7.12.12, from 212.0.0.8, 02:21:12 ago, via GigabitEthernet2.572 Route metric is 20, traffic share count is 1 R7#show mpls ldp bindings 212.0.0.8 32 neighbor 212.0.0.12 lib entry: 212.0.0.8/32, rev 10 remote binding: lsr: 212.0.0.12:0, label: 92001

2373 © 2016 Nicholas J. Russo

XRv2 performs PHP to expose label 8001 to CSR8. CSR8 removes all labels and delivers traffic to CSR9 as a normal egress PE. RP/0/0/CPU0:XRv2#show mpls forwarding Local Outgoing Prefix Label Label or ID ------ ----------- -----------------92001 Pop 212.0.0.8/32

labels 92001 Outgoing Next Hop Bytes Interface Switched ------------ --------------- ---------Gi0/0/0/0.582 212.8.12.8 56767

R8#show mpls forwarding-table labels 8001 detail Local Outgoing Prefix Bytes Label Label Label or Tunnel Id Switched 8001 No Label 9.10.0.9/32[V] 3672 MAC/Encaps=18/18, MRU=1504, Label Stack{} 005056A9D672005056A9FB1C81000E050800 VPN route: 910 No output feature configured

Outgoing interface Gi2.589

Next Hop 10.8.9.9

We can use traceroute from CSR10 to confirm this new path. Notice that label 7003 is swapped for label 8001 at CSR7 as shown below. R10#traceroute 9.10.0.9 source 9.10.0.10 Type escape sequence to abort. Tracing the route to 9.10.0.9 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.10.5 7 msec 4 msec 4 msec 2 212.5.11.11 [MPLS: Labels 91003/7003 Exp 0] 9 msec 8 msec 7 msec 3 212.7.11.7 [MPLS: Label 7003 Exp 0] 27 msec 31 msec 30 msec 4 212.7.12.12 [MPLS: Labels 92001/8001 Exp 0] 31 msec 31 msec 31 msec 5 10.8.9.8 [MPLS: Label 8001 Exp 0] 20 msec 17 msec 15 msec 6 10.8.9.9 20 msec 10 msec 15 msec

At some point, CSR7’s advertisement-interval will expire, which will trigger the much-needed BGP reconvergence event. CSR5 will be forced to select CSR8 as the best egress point; we can determine this by checking the VRF-aware FIB on CSR5. Label 8001 is at the bottom of the stack, which we already know is CSR8’s VPNv4 label for CSR9’s loopback. Traceroute from CSR10 also confirms this new LSP as there is no swapping of the VPN label anymore. R5#show ip cef vrf 910 9.10.0.9/32 9.10.0.9/32 nexthop 212.5.11.11 GigabitEthernet2.551 label 91004 8001 R10#traceroute 9.10.0.9 source 9.10.0.10 Type escape sequence to abort. Tracing the route to 9.10.0.9 VRF info: (vrf in name/id, vrf out name/id) 1 10.5.10.5 6 msec 4 msec 4 msec

2374 © 2016 Nicholas J. Russo

2 3 4 5 6

212.5.11.11 [MPLS: Labels 91004/8001 Exp 0] 9 msec 8 msec 9 msec 212.7.11.7 [MPLS: Labels 7009/8001 Exp 0] 32 msec 31 msec 31 msec 212.7.12.12 [MPLS: Labels 92001/8001 Exp 0] 31 msec 31 msec 31 msec 10.8.9.8 [MPLS: Label 8001 Exp 0] 16 msec 16 msec 15 msec 10.8.9.9 24 msec 10 msec 10 msec

Since the advertisement-interval was 2 minutes less than the time local protection retains local labels (always 5 minutes), we wait a bit longer before checking CSR7. After those 5 minutes expire, we can see that label 7003 has been totally deallocated as local protection is assumed to no longer be required. R7#show bgp vpnv4 unicast vrf 910 9.10.0.9/32 BGP routing table entry for 212:9107:9.10.0.9/32, version 13 Paths: (1 available, best #1, table 910) Not advertised to any peer Refresh Epoch 1 65009, imported path from 212:9108:9.10.0.9/32 (global) 212.0.0.8 (metric 20) (via default) from 212.0.0.11 (212.0.0.11) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:212:910 Originator: 212.0.0.8, Cluster list: 212.0.0.11 mpls labels in/out nolabel/8001 rx pathid: 0, tx pathid: 0x0

We will quickly test the feature using IPv6 on the other side of the network. Because static routing is in use, we cannot flap CSR10’s loopback to achieve the desired effect. Instead, we will quickly remove and re-add CSR5’s static IPv6 route to CSR10’s loopback. This has the same effect as it will trigger a BGP update from CSR5 and start the advertisement-interval. We check CSR5 to ensure the interval is counting down while also ensuring CSR9 still routes via CSR5 to reach CSR10. Note the original VPNv6 label value of 5032. R5#show bgp vpnv6 unicast all update-group 212.0.0.11 | include Minimum Minimum time between advertisement runs is 180 seconds (expires in 178 seconds) R9#traceroute ipv6 Target IPv6 address: 910:10:10:10::10 Source address: 910:9:9:9::9 [snip] 1 FD00:10:7:9::7 3 msec 4 msec 5 msec 2 ::FFFF:212.0.0.11 [MPLS: Labels 91000/5032 Exp 0] 7 msec 8 msec 9 msec 3 ::FFFF:212.5.11.5 [MPLS: Label 5032 Exp 0] 22 msec 17 msec 20 msec 4 910:10:10:10::10 [AS 212] 24 msec 14 msec 16 msec

Next, we break the PE-CE link between CSR5 and CSR10. Since this is a multi-access network, we must pay attention as to where we break the link. Doing so on the CE side would break everything, so we must break it on CSR5. We change the VLAN tag and wait about 3 seconds for BFD to detect the failure 2375 © 2016 Nicholas J. Russo

(not shown). Once the link fails, we immediately check the VPNv6 BGP entry to ensure the local label value of 5032 has not been deallocated. CSR5 is performing the VPN label swap operation as expected. R5#show bgp vpnv6 unicast vrf 910 910:10:10:10::10/128 BGP routing table entry for [212:9105]910:10:10:10::10/128, version 17 Paths: (1 available, best #1, table 910) Flag: 0x820 Advertised to update-groups: (Pending Update Generation) 2 Refresh Epoch 1 Local, imported path from [212:9106]910:10:10:10::10/128 (global) ::FFFF:212.0.0.6 (metric 20) (via default) from 212.0.0.11 (212.0.0.11) Origin incomplete, metric 0, localpref 100, valid, internal, best Extended Community: RT:212:910 Originator: 212.0.0.6, Cluster list: 212.0.0.11 mpls labels in/out 5032/6027 rx pathid: 0, tx pathid: 0x0

We can confirm this by checking the LFIB details. In the past, label 5032 was removed and the raw IPv6 traffic was delivered directly to CSR10. A broken PE-CE link requires CSR5 to swap the VPN label to 6027, CSR6’s label for the same VPN route, then add a transport label to tunnel to packets across the core. R5#show mpls forwarding-table labels 5032 detail Local Outgoing Prefix Bytes Label Outgoing Label Label or Tunnel Id Switched interface 5032 6027 910:10:10:10::10/128[V] \ 1320 Gi2.551 MAC/Encaps=18/26, MRU=1496, Label Stack{91001 6027} 005056A92DC6005056A9DC6381000DDF8847 163790000178B000 VPN route: 910 No output feature configured

Next Hop

212.5.11.11

We can confirm the path using traceroute on CSR9 again. Notice that XRv1 is seen in the path twice; like TE-FRR, this is a valid design for short bursts of traffic and does not qualify as a “loop”. Also notice the label activity at CSR5; label 5032 is received and label stack {91001 6027} is sent back into the core as verified above. This behavior appears identical between IPv4 and IPv6. R9#traceroute ipv6 Target IPv6 address: 910:10:10:10::10 Source address: 910:9:9:9::9 [snip] 1 FD00:10:7:9::7 4 msec 4 msec 5 msec 2 ::FFFF:212.0.0.11 [MPLS: Labels 91000/5032 Exp 0] 8 msec 8 msec 10 msec 3 ::FFFF:212.5.11.5 [MPLS: Label 5032 Exp 0] 8 msec 22 msec 23 msec 4 ::FFFF:212.0.0.11 [MPLS: Labels 91001/6027 Exp 0] 23 msec 22 msec 23 msec 5 ::FFFF:212.6.11.6 [MPLS: Label 6027 Exp 0] 22 msec 27 msec 23 msec 6 910:10:10:10::10 [AS 212] 23 msec 21 msec 15 msec

2376 © 2016 Nicholas J. Russo

For a brief period of about 2 minutes, we can see that BGP has converged but local protection is still enabled on CSR5. This makes sense since 2 minutes is the difference between the advertisement interval and the local protection label retention timer. CSR9 now routes directly via CSR6, indicating BGP has converged correctly. CSR7 uses label 6027 directly at imposition rather than relying on CSR5 to swap it. R9#traceroute ipv6 Target IPv6 address: 910:10:10:10::10 Source address: 910:9:9:9::9 [snip] 1 FD00:10:7:9::7 3 msec 5 msec 4 msec 2 ::FFFF:212.0.0.11 [MPLS: Labels 91001/6027 Exp 0] 113 msec 8 msec 8 msec 3 ::FFFF:212.6.11.6 [MPLS: Label 6027 Exp 0] 18 msec 18 msec 17 msec 4 910:10:10:10::10 [AS 212] 28 msec 22 msec 15 msec R5#show bgp vpnv6 unicast vrf 910 labels [snip] 910:10:10:10::10/128 ::FFFF:212.0.0.6 5032/6027

Checking CSR5’s label allocations shortly thereafter, label 5032 has been deallocated and traffic is now flowing solely through CSR6 as verified above. R5#show bgp vpnv6 unicast vrf 910 labels [snip] 910:10:10:10::10/128 ::FFFF:212.0.0.6 nolabel/6027

Additional Reading – Reference configurations "vrf-local-prot" 39. Describe, implement, and troubleshoot Layer 2 failure detection 39.1 Link Aggregation Control Protocol (LACP) LACP is a way to bond links together to create larger logical channels. This provides two key benefits: high availability and increased bandwidth. HA is inherent because if some number of links in the bundle fail, the remaining links serve as backups automatically. Bandwidth is increased by way of load-sharing across the links, and there are many algorithms to accomplish this. This lab uses two hardware ME3400-24TS-A switches with ten links between them.

2377 © 2016 Nicholas J. Russo

LACP is supported on ENI and NNI interfaces only, so the port-type is set to NNI for simplicity. The configuration of LACP is very simple; we specify the LACP bond number and the mode of operation. The channel-protocol is an optional field which ensures no other protocols (like Port Aggregation Protocol, PAgP) get negotiated. LACP active mode means that the local switch actively sends LACPDUs to try and establish the bond, while passive mode means a switch will respond to them. Two switches both running in passive mode cannot form LACP bonds, but two switches in active mode can. Both sides are configured as active on all ten ports. I also configure this as a trunk port to carry multiple VLANs if we decide to test that later, although LACP bonds can be access ports (or layer 3 routed ports). The configuration at the port level can be identical on both sides and for all ports in the bond. Since these ports were originally UNI ports, we also have to “no shutdown” them, as UNI ports are shutdown by default in this particular Metro IOS version. ! ME1 and ME2 interface range FastEthernet0/13 - 22 port-type nni switchport mode trunk channel-protocol lacp channel-group 1 mode active spanning-tree portfast trunk no shutdown

With this very basic configuration, we can see that the bond comes up with no issues. Looking at ME1, we see that the first 8 ports are actively used in the bond, while the remaining 2 are in hot-standby mode. LACP supports up to 16 links in a bond, but only 8 can be used at once, with up to 8 standby links. PAgP and static Etherchannel do not support the hot-standby feature. The “SU” flags in the output indicate that the port-channel is in use and that it is switched (running at layer 2). ME2 has the exact same output and is not shown. ME1#show etherchannel summary Flags: D - down P - bundled in port-channel I - stand-alone s - suspended H - Hot-standby (LACP only) R - Layer3 S - Layer2 U - in use f - failed to allocate aggregator M - not in use, minimum links not met u - unsuitable for bundling w - waiting to be aggregated

2378 © 2016 Nicholas J. Russo

d - default port Number of channel-groups in use: 1 Number of aggregators: 1 Group Port-channel Protocol Ports ------+-------------+-----------+------------------------------------------1 Po1(SU) LACP Fa0/13(P) Fa0/14(P) Fa0/15(P) Fa0/16(P) Fa0/17(P) Fa0/18(P) Fa0/19(P) Fa0/20(P) Fa0/21(H) Fa0/22(H)

Most of the time, the command above is the only command you really need to verify and troubleshoot LACP. We will examine advanced commands as well since they are not so well-known. First, we can look at the LACP neighbors. The device ID is the neighbor’s MAC address which is also used for the STP bridge-ID MAC address component. We can prove this by checking the STP information on ME2. We can see all ten of the ports have the same neighbor ID since there is no multi-homed, multi-chassis LACP being tested here. All ports are using “slow” LACPDUs; this happens automatically once an Etherchannel is stable, which is a 30 second PDU timer. The “Age” column resets to 0 when an LACPDU is received and counts up to 30. During the initial setup, “fast” LACPDUs are sent which is once per second. We also see the port numbers are listed in hexadecimal which are just sequential numbers assigned to ports; this is similar to the STP port ID and is used as a tie-breaker for selecting active links (more on this later).The port state is a code, again similar to STP, which shows what the port is doing. We can see code 0x3D probably relates to forwarding while code 0x5 relates to hot-standby (or a general link-down state). ME1#show lacp 1 neighbor Flags: S - Device is requesting Slow LACPDUs F - Device is requesting Fast LACPDUs A - Device is in Active mode P - Device is in Passive mode Channel group 1 neighbors Partner's information: LACP port Admin Oper Port Port Flags Priority Dev ID Age key Key Number State Fa0/13 SA 32768 04c5.a453.d800 22s 0x0 0x1 0x110 Fa0/14 SA 32768 04c5.a453.d800 27s 0x0 0x1 0x111 Fa0/15 SA 32768 04c5.a453.d800 2s 0x0 0x1 0x112 Fa0/16 SA 32768 04c5.a453.d800 24s 0x0 0x1 0x113 Fa0/17 SA 32768 04c5.a453.d800 3s 0x0 0x1 0x114 Fa0/18 SA 32768 04c5.a453.d800 18s 0x0 0x1 0x115 Fa0/19 SA 32768 04c5.a453.d800 24s 0x0 0x1 0x116 Fa0/20 SA 32768 04c5.a453.d800 25s 0x0 0x1 0x117 Fa0/21 SA 32768 04c5.a453.d800 24s 0x0 0x1 0x118 Fa0/22 SA 32768 04c5.a453.d800 3s 0x0 0x1 0x119

Port

0x3D 0x3D 0x3D 0x3D 0x3D 0x3D 0x3D 0x3D 0x5 0x5

ME2#show spanning-tree vlan 1 | section Bridge Bridge ID Priority 32769 (priority 32768 sys-id-ext 1)

2379 © 2016 Nicholas J. Russo

Address Hello Time Aging Time

04c5.a453.d800 2 sec Max Age 20 sec 300 sec

Forward Delay 15 sec

If we change a few ports on ME2 to be in passive mode (we can have a variety modes on the same switch for the same bond), we can see the output changes to reflect that. Changing ports 13 and 14 on ME2 means that ME1 sees these peers as passive now, while all other ports are still active. ! ME2 interface range FastEthernet0/13 - 14 channel-group 1 mode passive ME1#show lacp 1 neighbor Flags: S - Device is requesting Slow LACPDUs F - Device is requesting Fast LACPDUs A - Device is in Active mode P - Device is in Passive mode Channel group 1 neighbors Partner's information: LACP port Admin Oper Port Port Flags Priority Dev ID Age key Key Number State Fa0/13 SP 32768 04c5.a453.d800 16s 0x0 0x1 0x110 Fa0/14 SP 32768 04c5.a453.d800 15s 0x0 0x1 0x111 Fa0/15 SA 32768 04c5.a453.d800 19s 0x0 0x1 0x112 Fa0/16 SA 32768 04c5.a453.d800 2s 0x0 0x1 0x113 [snip]

Port

0x3C 0x3C 0x3D 0x3D

LACP has the concept of a system-ID, which is loosely analogous to the STP bridge-ID. It is computed the same way; a 2 byte administratively-defined priority is prepended to the MAC address of the switch, and the lower value is considered the decision-maker in terms of which links are used for forwarding. Below, we can see the system-ID on both ME1 and ME2. Both have the default priority of 32768 and their switch global MAC addresses. ME1 has the lower MAC address and, as such, is the decision-maker on which links get aggregated. ME1#show lacp sys-id 32768, 001d.4692.f700 ME2#show lacp sys-id 32768, 04c5.a453.d800

For variety, we will configure ME2 to be the LACP decision-maker by adjusting its system priority globally, then confirming the change. Changing the decision-maker, in this case from ME1 to ME2, will cause the bond to flap, so don’t do it in production. ! ME2 lacp system-priority 16384

2380 © 2016 Nicholas J. Russo

ME2#show lacp sys-id 16384, 04c5.a453.d800

We can change the port-priority to select which ports should be used for forwarding. Much like changing the STP port-priority on an upstream switch to influence root port selection on a downstream switch, this is configured on a per-interface basis to select usable ports. Our example has 10 ports, yet only 8 are used for forwarding. The 8 ports with the lowest port-ID (which is the concatenation of the port-priority and the port-number) are used for forwarding. That is why the first 8 ports are used, since their portnumbers as seen above are lower. We will reduce the port priorities of ports 21 and 22 on ME1 to observe the effect; we see that nothing changes. ! ME1 interface range FastEthernet0/21 - 22 lacp port-priority 16384

Looking at the internal details of the bond, ME1 shows that the new port-priority was configured correctly yet the ports are stuck in hot-standby. This is because ME1 is no longer the decision-maker on which ports are used for forwarding in the bond as it has the higher LACP system-ID. This is equivalent to configuring port-priority on a downstream STP switch on the root port and expecting it to do something. Only ME2 can determine which ports can be used for forwarding, so ME1’s port-priorities for this channel are meaningless. ME1#show lacp internal | include 2[12] Fa0/21 SA hot-sby 16384 Fa0/22 SA hot-sby 16384

0x1 0x1

0x1 0x1

0x118 0x119

0x5 0x5

We will use a similar configuration on ME2 to achieve the desired effect. This change also causes ports in the bond to flap, so don’t do it in production. ME1 still shows its local port-priority of 16384, which is meaningless, but the ports are now active in the bundle since ME2 decided to do this. I show outputs from both ME1 and ME2 for clarity here; we can see that the port-priority column is a reflection of the local configuration only. Now, the next worse LACP port-IDs are placed into hot-standby mode, which are ports 19 and 20. ! ME2 interface range FastEthernet0/21 - 22 lacp port-priority 8192 ME1#show lacp internal [snip] Port Fa0/13 Fa0/14 Fa0/15 Fa0/16

Flags SA SA SA SA

State bndl bndl bndl bndl

LACP port Priority 32768 32768 32768 32768

Admin Key 0x1 0x1 0x1 0x1

Oper Key 0x1 0x1 0x1 0x1

Port Number 0x110 0x111 0x112 0x113

Port State 0x3D 0x3D 0x3D 0x3D

2381 © 2016 Nicholas J. Russo

Fa0/17 Fa0/18 Fa0/19 Fa0/20 Fa0/21 Fa0/22

SA SA SA SA SA SA

bndl bndl hot-sby hot-sby bndl bndl

32768 32768 32768 32768 16384 16384

0x1 0x1 0x1 0x1 0x1 0x1

0x1 0x1 0x1 0x1 0x1 0x1

0x114 0x115 0x116 0x117 0x118 0x119

0x3D 0x3D 0x5 0x5 0x3D 0x3D

LACP port Priority 32768 32768 32768 32768 32768 32768 32768 32768 8192 8192

Admin Key 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1

Oper Key 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1

Port Number 0x110 0x111 0x112 0x113 0x114 0x115 0x116 0x117 0x118 0x119

Port State 0x3C 0x3C 0x3D 0x3D 0x3D 0x3D 0x5 0x5 0x3D 0x3D

ME2#show lacp internal [snip] Port Fa0/13 Fa0/14 Fa0/15 Fa0/16 Fa0/17 Fa0/18 Fa0/19 Fa0/20 Fa0/21 Fa0/22

Flags SP SP SA SA SA SA SA SA SA SA

State bndl bndl bndl bndl bndl bndl hot-sby hot-sby bndl bndl

The “neighbor” command seen earlier reflects the remote peer configuration. On ME1, for example, we can see the remote port-priority on those newly-bundled ports. This is in contrast to the “internal” command which reflects the local configuration. ME1#show lacp neighbor | include 2[12] Fa0/21 SA 8192 04c5.a453.d800 Fa0/22 SA 8192 04c5.a453.d800

17s 4s

0x0 0x0

0x1 0x1

0x118 0x119

0x3D 0x3D

The detail version of the “internal” and “neighbor” commands will display a more verbose version of the tabular data for local and remote LACP fields, respectively. Snippets are shown below, but very little new information is provided, so we don’t analyze it. In this example, we look at ME2’s port 13 details (neighbor) and ME1’s port 13 details (internal), both from ME1’s perspective. ME1#show lacp neighbor detail [snip] Partner Port System ID Fa0/13 16384,04c5.a453.d800 LACP Partner Port Priority 32768

Partner Port Number 0x110 Partner Oper Key 0x1

Port State Flags Decode: Activity: Timeout: Aggregation:

Age 18s

Partner Flags SP

Partner Port State 0x3C

Synchronization:

2382 © 2016 Nicholas J. Russo

Passive

Long

Collecting: Yes

Distributing: Yes

ME1#show lacp internal detail [snip] Actor Port System ID Fa0/13 32768,001d.4692.f700 LACP Actor Port Priority 32768

Yes

Yes Defaulted: No

Actor Port Number 0x110 Actor Oper Key 0x1

Distributing: Yes

Actor Flags SA

Age 11s Actor Port State 0x3D

Port State Flags Decode: Activity: Timeout: Aggregation: Active Long Yes Collecting: Yes

Expired: No

Synchronization: Yes

Defaulted: No

Expired: No

All of these details represent the same fields that are carried in LACPDUs during the bond negotiation. A short debugging effort on ME1 reveals some of these LACPDUs being sent. The “p-pri” and “s-pri” fields represent the port and system priorities, respectively, and are show in hexadecimal (0x8000 = 32768). ! ME1 ME1#debug lacp packet Link Aggregation Control Protocol packet debugging is on ! ME1 00:59:15: LACP :lacp_bugpak: Send LACP-PDU packet via Fa0/13 00:59:15: LACP : packet size: 124 00:59:15: LACP: pdu: subtype: 1, version: 1 00:59:15: LACP: Act: tlv:1, tlv-len:20, key:0x1, p-pri:0x8000, p:0x110, pstate:0x3D, s-pri:0x8000, s-mac:001d.4692.f700

Assuming I unplug one of the links actively in the bundle, the hot-standby links will take over for it. I unplug port 13, which means port 19 is the next most desirable port for forwarding. Notice that the line protocol is moved from down to up across interfaces in about one second. Port 20 still remains down, as LACP is holding it in this state given its high port-ID compared to the remaining ports. ! ME1 01:03:34: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/13, changed state to down 01:03:35: %LINK-3-UPDOWN: Interface FastEthernet0/13, changed state to down

2383 © 2016 Nicholas J. Russo

01:03:35: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/19, changed state to up ME1#show interfaces fastEthernet 0/20 | include line_proto FastEthernet0/20 is up, line protocol is down (notconnect)

A quick check of the LACP internal information shows the first port as down, which interestingly has the same code of 0x5 as hot-standby. Port 19 is now in the bundle while port 20 remains in hot-standby. ME1#show lacp internal [snip] Port Fa0/13 Fa0/14 Fa0/15 Fa0/16 Fa0/17 Fa0/18 Fa0/19 Fa0/20 Fa0/21 Fa0/22

Flags SA SA SA SA SA SA SA SA SA SA

State down bndl bndl bndl bndl bndl bndl hot-sby bndl bndl

LACP port Priority 32768 32768 32768 32768 32768 32768 32768 32768 16384 16384

Admin Key 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1

Oper Key 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1 0x1

Port Number 0x110 0x111 0x112 0x113 0x114 0x115 0x116 0x117 0x118 0x119

Port State 0x5 0x3D 0x3D 0x3D 0x3D 0x3D 0x3D 0x5 0x3D 0x3D

We plug port 13 in before continuing. Now that we have tested the high-availability and channel formation details, we will discuss the bandwidth benefits. These benefits are not specific to LACP and apply to all types of Ethernet link aggregation. Because this FastEthernet Etherchannel (FEC, not to be confused with MPLS forwarding equivalence class) consists of 8 active members, it has an aggregated bandwidth of 800 Mbps. This is important from an IGP perspective, because if we were running routing over this single logical link, we should prefer it more than a single 100 Mbps link, but less than a single 1 Gbps link. In the case of switching, STP takes this new bandwidth into account automatically; we know that classic STP uses a cost of 19 for 100 Mbps ports and 4 for 1 Gbps port. The cost of this 800 Mbps FEC is 5, which sounds right. This value of 5 is much better than 19 but just slightly worse than 4. ME1#show interfaces port-channel 1 | include BW MTU 1998 bytes, BW 800000 Kbit/sec, DLY 100 usec, ME1#show spanning-tree interface port-channel 1 Vlan Role Sts Cost Prio.Nbr Type ------------------- ---- --- --------- -------- --------------------------VLAN0001 Desg FWD 5 128.64 P2p

To achieve the load-sharing, Etherchannels can select from a variety of algorithms, and I show the context-sensitive help below with added comments. The term “IP” always includes IPv4, but may include IPv6 depending on the platform. Non-IP traffic (PPPoE, MPLS, CLNS, etc) is always load balanced based on MAC addresses since the IP-based load sharing doesn’t make sense. 2384 © 2016 Nicholas J. Russo

ME1(config)#port-channel load-balance dst-ip Dst IP Addr ! Hashed based on destination IP address only dst-mac Dst Mac Addr ! Hashed based on destination MAC address only src-dst-ip Src XOR Dst IP Addr ! Hased based on source/dest IP address src-dst-mac Src XOR Dst Mac Addr ! Hased based on source/dest MAC address src-ip Src IP Addr ! Hashed based on source IP address only src-mac Src Mac Addr ! Hashed based on source MAC address only

The load-balancing method is configured globally and the default on this platform is “src-mac”. We can verify this with a show command, and we see that both IP and non-IP traffic uses this method. ME1#show etherchannel load-balance EtherChannel Load-Balancing Configuration: src-mac EtherChannel Load-Balancing Addresses Used Per-Protocol: Non-IP: Source MAC address IPv4: Source MAC address

We can quickly test to see if this works. Each switch has a management SVI in VLAN 1, and the IP addresses for ME1 and ME2 are 10.0.10.91 and 10.0.10.92, respectively. We can test, using an exec command, to see which link is used when ME1 sends traffic to ME2 from this SVI, and vice versa. First, we need to record the MAC addresses of the SVIs. ME1#show interfaces vlan 1 | include bia Hardware is EtherSVI, address is 001d.4692.f740 (bia 001d.4692.f740) ME2#show interfaces vlan 1 | include bia Hardware is EtherSVI, address is 04c5.a453.d840 (bia 04c5.a453.d840)

Next, we issue the “test” commands on ME1 and ME2 to determine which member links will be used for these flows. Notice that the ports between switches don’t have to be the same, since the load-balancing decisions only affect how traffic is transmitted. Traffic can be received on any bundled link, since each switch can have a different load-sharing policy for traffic transmission. ME1#test etherchannel load-balance interface port-channel 1 mac 001d.4692.f740 04c5.a453.d840 Would select Fa0/21 of Po1 ME2#test etherchannel load-balance interface port-channel 1 mac 04c5.a453.d840 001d.4692.f740 Would select Fa0/14 of Po1

Since the current load-sharing mechanism only cares about the source MAC, the destination MAC can change but the outgoing interface remains the same. We do a few bogus tests on ME1 to prove this. 2385 © 2016 Nicholas J. Russo

ME1#test etherchannel load-balance interface port-channel 1 mac 001d.4692.f740 04c5.a453.2bad Would select Fa0/21 of Po1 ME1#test etherchannel load-balance interface port-channel 1 mac 001d.4692.f740 04c5.a453.deed Would select Fa0/21 of Po1 ME1#test etherchannel load-balance interface port-channel 1 mac 001d.4692.f740 04c5.a453.beef Would select Fa0/21 of Po1

As a data-plane test, I will send many packets from ME2 to ME1, then check the counters while the flow is happening. We check port 14 on ME2 and port 21 on ME1 to see the packets increasing quickly. ME2#ping 10.0.10.91 size 1500 repeat 1000000 ME1#show interfaces fastEthernet 0/21 | include packets_output 7055 packets output, 10339446 bytes, 0 underruns ME2#show interfaces fastEthernet 0/14 | include packets_output 7127 packets output, 10368100 bytes, 0 underruns

For further proof, we can look at all the output packets on ME1 within the bond. It is obvious that one port has the most packets by far. Other ports are accounting for LACPDUs being transferred back and forth, along with any other flows that may be going between a different set of MAC addresses. ME1#show interfaces | include packets_output [snip] 2762 packets output, 237064 bytes, 0 underruns 341 packets output, 75578 bytes, 0 underruns 2932 packets output, 235918 bytes, 0 underruns 300 packets output, 64138 bytes, 0 underruns 306 packets output, 64906 bytes, 0 underruns 299 packets output, 63744 bytes, 0 underruns 290 packets output, 57346 bytes, 0 underruns 271 packets output, 50786 bytes, 0 underruns 7060 packets output, 10340618 bytes, 0 underruns 245 packets output, 40138 bytes, 0 underruns

We will test 2 other load sharing methods. ME1 will use the combination of source/destination IP while ME2 uses the combination of source/destination MAC. Notice that ME1 uses source/destination MAC load-sharing automatically for non-IP traffic, while ME2 uses this technique for both. ! ME1

2386 © 2016 Nicholas J. Russo

port-channel load-balance src-dst-ip ! ME2 port-channel load-balance src-dst-mac ME1#show etherchannel load-balance EtherChannel Load-Balancing Configuration: src-dst-ip EtherChannel Load-Balancing Addresses Used Per-Protocol: Non-IP: Source XOR Destination MAC address IPv4: Source XOR Destination IP address ME2#show etherchannel load-balance EtherChannel Load-Balancing Configuration: src-dst-mac EtherChannel Load-Balancing Addresses Used Per-Protocol: Non-IP: Source XOR Destination MAC address IPv4: Source XOR Destination MAC address

With these new methods, we may get different results for our load sharing on both switches. ME1 coincidentally selects port 21 again while ME2 selects port 13. Notice that ME1 used IP addresses for input while ME2 used MAC addresses. ME1#test etherchannel load-balance interface port-channel 1 ip 10.0.10.91 10.0.10.92 Would select Fa0/21 of Po1 ME2#test etherchannel load-balance interface port-channel 1 mac 04c5.a453.d840 001d.4692.f740 Would select Fa0/13 of Po1

We will clear counters on both switches and perform a short ping test of 100 packets. We can see that both switches have just over 100 packets (add in a few recent LACPDUs) on their selected outgoing interfaces. ME2#ping 10.0.10.91 size 1500 repeat 100 ME1#show interfaces fastEthernet 0/21 | include packets_output 103 packets output, 152450 bytes, 0 underruns ME2#show interfaces fastEthernet 0/13 | include packets_output 102 packets output, 152056 bytes, 0 underruns

As a note, when using the “test” command with an IP-based load-sharing scheme, the parser warns you that the result is not accurate for IP traffic. For example, if we used the MAC addresses as input on ME1, 2387 © 2016 Nicholas J. Russo

we get a different output interface (port 13), which may not be true for IP traffic. Since ME1 does use this mechanism for non-IP traffic, this command is still valuable, but you must be diligent to provide the proper inputs depending on the traffic type. MPLS traffic exiting ME1 towards ME2 would actually use port 13, as an example. ME1#test etherchannel load-balance interface port-channel 1 mac 001d.4692.f740 04c5.a453.d840 Configured load-balance is "src-dst-ip", results may not be accurate for IP traffic Would select Fa0/13 of Po1

Additional Reading – Reference configurations "lacp" 39.2 Uni-Directional Link Detection (UDLD) UDLD is a feature that is designed to detect unidirectional links. It is generally used on fiber links where there are explicit and separate transit (TX) and receive (RX) fiber lines. Each set of TX/RX lines is called a “pair” and ideally, both TX and RX directions are functional. When one of the lines fails, the overall link does not perform as expected, and UDLD helps mitigate this. The network diagram is very simple and consists of two hardware Cisco ME-3400-24TS-A switches connected with a single fiber pair. The SFPs used are 100BaseFX, which is 100 Mbps.

First, we will configure the interfaces to support a basic fiber connection between the two. I hardcode the duplex to full since sometimes, in my experience, the FX SFPs will negotiate half-duplex. We cannot set the speed on these interfaces as this is controlled by the SFP hardware. UDLD is not yet enabled, but CDP is by default, and we will use that for verification assistance. ! ME1 and ME2 interface GigabitEthernet0/1 description UDLD TEST port-type nni duplex full

We verify the basic link characteristics first. We see a 100Base-FX SFP with a speed of 100 Mbps, along with full-duplex operation. The interface hardware components appear to be working properly. ME1#show interfaces gigabitEthernet 0/1 | include media_type|line_proto GigabitEthernet0/1 is up, line protocol is up (connected) Full-duplex, 100Mb/s, link type is auto, media type is 100BaseFX SFP

2388 © 2016 Nicholas J. Russo

ME2#show interfaces gigabitEthernet 0/1 | include media_type|line_proto GigabitEthernet0/1 is up, line protocol is up (connected) Full-duplex, 100Mb/s, link type is auto, media type is 100BaseFX SFP

We can also see that CDP sees neighbors in both directions, which is a quick indication that the fiber pair is working. We are also able to ping between management SVIs, which is another verification method. ME1#show cdp neighbors gigabitEthernet 0/1 Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone, D - Remote, C - CVTA, M - Two-port Mac Relay Device ID ME2

Local Intrfce Gig 0/1

Holdtme 145

Capability S I

Platform Port ID ME-3400-2 Gig 0/1

ME2#show cdp neighbors gigabitEthernet 0/1 [snip] Device ID Local Intrfce Holdtme ME1 Gig 0/1 136

Capability S I

Platform Port ID ME-3400-2 Gig 0/1

ME1#ping 10.0.10.92 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.10.92, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms

Now that we are certain the link is functional, we will enable UDLD. There are two modes of operation: normal and aggressive. In normal mode, when a unidirectional link is detected, UDLD does not take action. The port forwarding states will behave according to the STP topology. When a port is determined to be bi-directional (functional), once the UDLD message hold-down time expires without having seen UDLD messages from the remote end, the port is considered to be unidirectional. UDLD can be enabled globally or per-interface. When enabled globally, it only applies to fiber ports, but you can enable it on copper ports at the interface level. ME1 uses the global command and ME2 uses the port command; both are using UDLD normal mode. We can also disable UDLD on specific ports when it is enabled globally; on ME1, I demonstrate this by disabling UDLD on Gig0/2, a dummy port. In both cases, UDLD normal mode is enabled on Gig0/1 only on both switches. ! ME1 udld enable interface GigabitEthernet0/2 udld port disable ! ME2 interface GigabitEthernet0/1 udld port

2389 © 2016 Nicholas J. Russo

We can see the state of UDLD by checking the neighbors. The device name shown in the neighbor table is the remote device’s serial number, which we prove by checking ME2. ME1#show Port ---Gi0/1

udld neighbors Device Name Device ID ------------------FOC1510V51X 1

Port ID ------Gi0/1

Neighbor State -------------Bidirectional

ME2#show version | include System_serial System serial number : FOC1510V51X

Often times, looking at the neighbor summary is sufficient for UDLD verification, but specifying an interface prints more detail. Specifically, we can see the message interval and timeout which are discussed later. ME1#show udld gigabitEthernet 0/1 Interface Gi0/1 --Port enable administrative configuration setting: Follows device default Port enable operational state: Enabled Current bidirectional state: Bidirectional Current operational state: Advertisement - Single neighbor detected Message interval: 15 Time out interval: 5 Entry 1 --Expiration time: 34 Device ID: 1 Current neighbor state: Bidirectional Device name: FOC1510V51X Port ID: Gi0/1 Neighbor echo 1 device: FOC1132U4V2 Neighbor echo 1 port: Gi0/1 Message interval: 15 Time out interval: 5 CDP Device name: ME2

We can test UDLD by simply unplugging one of the lines on one of the pairs. Since the link is currently in the bidirectional state, this will change shortly, which should trigger UDLD. Unplugging the RX line on ME2 means that the ME2 interface goes down but ME1 stays up. There is nothing of interest to test on ME2 since the interface is down, but on ME1, we can see the UDLD and CDP neighbors are gone. However, the interface stays up, and is forwarding per STP. UDLD in normal mode isn’t terribly useful and it doesn’t generate a syslog message, either.

2390 © 2016 Nicholas J. Russo

ME1#show udld neighbors Port Device Name Device ID ---------------------[no output]

Port ID -------

ME1#show cdp neighbors gigabitEthernet 0/1 [snip] Device ID Local Intrfce Holdtme [no output]

Neighbor State --------------

Capability

ME1#show spanning-tree vlan 1 VLAN0001 Spanning tree enabled protocol rstp Root ID Priority 32769 Address 001d.4692.f700 This bridge is the root Hello Time 2 sec Max Age 20 sec Bridge ID

Priority Address Hello Time Aging Time

Platform

Port ID

Forward Delay 15 sec

32769 (priority 32768 sys-id-ext 1) 001d.4692.f700 2 sec Max Age 20 sec Forward Delay 15 sec 300 sec

Interface Role Sts Cost Prio.Nbr Type ------------------- ---- --- --------- -------- ----------------------------Gi0/1 Desg FWD 19 128.1 P2p

Next, we will restore fiber connectivity and enable UDLD aggressive mode. This is also enabled globally or per-port, as we see on ME1 and ME2. UDLD remains disabled on ME1 Gig0/2 explicitly, so only Gig0/1 on both switches is UDLD-enabled. I also enable err-disable recovery on both switches so that after a UDLD failure is fixed, the port can be restored automatically. ! ME1 errdisable recovery cause udld errdisable recovery interval 30 udld port aggressive ! ME2 errdisable recovery cause udld errdisable recovery interval 30 interface GigabitEthernet0/1 udld port aggressive

ME1 sees ME2 as a bidirectional neighbor again, except this time in aggressive mode. Aggressive mode will err-disable unidirectional links so that alternative paths can be used by STP and other protocols. This

2391 © 2016 Nicholas J. Russo

only occurs after aggressive mode tries to recover the link by sending more frequent UDLD messages across; err-disable is the result of the recovery failure. ME1#show Port ---Gi0/1

udld neighbors Device Name Device ID ------------------FOC1510V51X 1

Port ID ------Gi0/1

Neighbor State -------------Bidirectional

ME1#show udld gigabitEthernet 0/1 Interface Gi0/1 --Port enable administrative configuration setting: Follows device default Port enable operational state: Enabled / in aggressive mode Current bidirectional state: Bidirectional Current operational state: Advertisement - Single neighbor detected Message interval: 15 Time out interval: 5 [snip]

Now, we unplug ME2 RX connection the break bidirectional connectivity, ME1’s interface stays up until UDLD detects the broken connectivity. UDLD aggressive-mode prints syslog messages to notify us about this failure, then err-disables the port. ! ME1 02:49:49: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Gi0/1, aggressive mode failure detected 02:49:49: %PM-4-ERR_DISABLE: udld error detected on Gi0/1, putting Gi0/1 in err-disable state 02:49:49: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to down 02:49:50: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/1, changed state to down

The port is now unusable; since there are no alternative paths, there are no interfaces in the “up” state on either switch, so connectivity is totally broken. If there were alternative paths, those would be used instead. ME1#show interfaces gigabitEthernet 0/1 status Port Name Status Vlan Gi0/1 UDLD TEST err-disabled 1

Duplex full

Speed Type 100 100BaseFX SFP

ME1#show spanning-tree vlan 1 Spanning tree instance(s) for vlan 1 does not exist.

The err-disable recovery timer tries to recover the port every 30 seconds, so plugging the cable back in restores the port, and thus the UDLD error clears. We can see that in 6 seconds, the err-disable process 2392 © 2016 Nicholas J. Russo

will attempt to recover this port, and the log messages from a successful recovery are shown below. Without the auto-recovery configured, manual “shutdown” and “no shutdown” commands are required. Alternatively, the “udld reset” command can be used. ME1#show errdisable recovery | begin reason Interface Errdisable reason Time left(sec) -------------------------------------Gi0/1 udld 6 ! ME1 %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Gi0/1 %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/1, changed state to up

We will break the link again to capture more debugging information. We can watch this UDLD behavior in action by looking at UDLD events on ME1 (or whichever switch still has its RX line connected). Before the failure, we see UDLD messages being exchanged every 15 seconds as expected. ME1#debug udld events UDLD events debugging is on ! ME1 02:58:36: 02:58:36: 02:58:36: 02:58:36: 02:58:36: 02:58:36: 02:58:36: 02:58:36: 02:58:36:

New_entry = 55128D8 (Gi0/1) Found an entry from same device (Gi0/1) Cached entries = 2 (Gi0/1) Entry (0x62359C0) deleted: 1 entries cached Cached entries = 1 (Gi0/1) Checking if multiple neighbors (Gi0/1) Single neighbor detected (Gi0/1) Checking if link is bidirectional (Gi0/1) Found my own ID pair in 2way conn list (Gi0/1)

02:58:51: 02:58:51: 02:58:51: 02:58:51: 02:58:51: 02:58:51: 02:58:51: 02:58:51: 02:58:51:

New_entry = 62359C0 (Gi0/1) Found an entry from same device (Gi0/1) Cached entries = 2 (Gi0/1) Entry (0x55128D8) deleted: 1 entries cached Cached entries = 1 (Gi0/1) Checking if multiple neighbors (Gi0/1) Single neighbor detected (Gi0/1) Checking if link is bidirectional (Gi0/1) Found my own ID pair in 2way conn list (Gi0/1)

Once we unplug the cable, the output changes. ME1 shows no output for about 45 seconds (3 times the hello interval) then declares that all neighbors on Gig0/1 have aged out. Then, UDLD aggressive mode sends 8 hellos 1 second apart to try and recover the link. After that, the link is considered down and is err-disabled.

2393 © 2016 Nicholas J. Russo

! ME1 03:02:20: allNeighborsAgedOutEvent during link up. (Gi0/1) 03:02:20: Phase set from ADV to LUP because all neighbors aged out (Gi0/1) 03:02:20: prev = 0 entry = 55128D8 next = 0 exp_time = 0 (Gi0/1) 03:02:20: udsb->cache = 0x532E274 (Gi0/1) 03:02:20: timeout timer = 7 (Gi0/1) 03:02:21: timeout timer = 6 (Gi0/1) 03:02:22: timeout timer = 5 (Gi0/1) 03:02:23: timeout timer = 4 (Gi0/1) 03:02:24: timeout timer = 3 (Gi0/1) 03:02:25: timeout timer = 2 (Gi0/1) 03:02:26: timeout timer = 1 (Gi0/1) 03:02:27: timeout timer = 0 (Gi0/1) 03:02:27: Phase set to udld_advertisement from phase udld_link_up in aggresive mode after all neighbors aged out. (Gi0/1) 03:02:27: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Gi0/1, aggressive mode failure detected 03:02:27: %PM-4-ERR_DISABLE: udld error detected on Gi0/1, putting Gi0/1 in err-disable state 03:02:27: Port UDLD set error disabled (Gi0/1)

We can plug the line back in and perform a manual CLI reset to fix the network so we can test it once more with the UDLD packet debugs. We would also wait for err-disable recovery, but the manual CLI reset is faster and more appropriate for testing. ME1#udld reset 1 ports shutdown by UDLD were reset. ME1#show interfaces gigabitEthernet 0/1 status Port Name Status Vlan Gi0/1 UDLD TEST connected 1

Duplex full

Speed Type 100 100BaseFX SFP

With the network stable again, I break the network once more by unplugging the RX line from ME2. Packet debugging has been enabled on ME1 so we can see the UDLD activity, as well as aggressive mode trying to recover the link. The first two probes are ordinary messages and are sent 15 seconds apart. Once UDLD sees the issues, it issues 8 “resynch” messages in quick succession, once per second, until it flushes the UDLD state from the database. I personally think aggressive mode is a much better option than normal mode. I would not consider this “high” availability since it takes almost a minute to detect and bypass the failure. ME1#debug udld packets UDLD packets debugging is on ! ME1 03:09:48: UDLD send probe message, flags = rec_timeout (Gi0/1) 03:09:48: P (Gi0/1) 03:10:03: UDLD send probe message, flags = rec_timeout (Gi0/1)

2394 © 2016 Nicholas J. Russo

03:10:03: P (Gi0/1) 03:10:17: UDLD send probe message, flags = 03:10:17: Pr (Gi0/1) 03:10:18: UDLD send probe message, flags = 03:10:18: Pr (Gi0/1) 03:10:19: UDLD send probe message, flags = 03:10:19: Pr (Gi0/1) 03:10:20: UDLD send probe message, flags = 03:10:20: Pr (Gi0/1) 03:10:21: UDLD send probe message, flags = 03:10:21: Pr (Gi0/1) 03:10:22: UDLD send probe message, flags = 03:10:22: Pr (Gi0/1) 03:10:23: UDLD send probe message, flags = 03:10:23: Pr (Gi0/1) 03:10:24: UDLD send probe message, flags = 03:10:24: Pr (Gi0/1) 03:10:25: UDLD send flush message, flags = 03:10:25: F (Gi0/1) 03:10:25: %UDLD-4-UDLD_PORT_DISABLED: UDLD aggressive mode failure detected

rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) rec_timeout | resynch (Gi0/1) 0 (Gi0/1) disabled interface Gi0/1,

The only knob that UDLD has available is how often UDLD messages are send in steady-state operation. This can speed up UDLD “convergence”. The default is 15 seconds, but we can adjust it to any number between 1 and 90. I adjust it to 10 on ME1 only, since the timers don’t have to match. Once we set this, the details reveal it. ME1 uses a timer of 10, while it learns that ME2 uses the default timer of 15. ! ME1 udld message time 10 ME1#show udld gigabitEthernet 0/1 | include Message|Device Message interval: 10 Device ID: 1 Device name: FOC1510V51X Message interval: 15 CDP Device name: ME2

UDLD packet debugging reveals this new timer in use. I omit the received probe information since it is verbose and not valuable for this verification. We can see ME1 sending probes 10 seconds apart now. ME1#debug udld packets UDLD packets debugging is on 03:15:29: 03:15:29: 03:15:39: 03:15:39:

UDLD send probe message, flags = rec_timeout (Gi0/1) P (Gi0/1) UDLD send probe message, flags = rec_timeout (Gi0/1) P (Gi0/1)

2395 © 2016 Nicholas J. Russo

Additional Reading – Reference configurations "udld" 40. Describe, implement, and troubleshoot Layer 3 failure detection 40.1 Individual Protocol Hello packets Most protocols have their own neighbor discovery mechanism, which is typically a hello packet of sorts. This is true for EIGRP, OSPF, IS-IS, RSVP, FHRPs, BGP, LDP, PIM, and others. There are typically two timers associated with each: an interval that determines how frequently to send hellos, and a hold/dead timer that says how long one can go without receiving one before declaring the neighbor dead. The timer behavior varies significantly between protocols and is discussed in this section. The network diagram includes XE and XR routers to see the configuration/verification on both platforms.

First, we will examine OSPF timers briefly. The hello and dead timers must match on all peers sharing an interface/network. This can be changed per-link. The hello/dead timers are easily understood and there are no magic tricks with OSPF. If the dead-interval is not specified, it is automatically assumed to be four times the hello interval. We will adjust the hello/dead timers to 2/6 seconds on the OSPF LAN. ! CSR8, CSR9, CSR10 interface GigabitEthernet2.518 ip ospf dead-interval 6 ip ospf hello-interval 2 ! XRv1 router ospf 181 area 0 interface GigabitEthernet0/0/0/0.518 dead-interval 6 hello-interval 2

2396 © 2016 Nicholas J. Russo

A quick debug on XRv1 and CSR8 shows hellos being send and received as expected. XR shows more detailed output to include the DR/BDR information, timers, and other pieces of information. RP/0/0/CPU0:XRv1#debug ospf 181 hello gig0/0/0/0.518 ospf[1018]: Send hello to 224.0.0.5 area 0 on GigabitEthernet0/0/0/0.518 from 181.18.90.11 (nbr/if state 3/7) ospf[1018]: Send hello pkt pri 0 options 0x12 DR 181.18.90.9 BDR 181.18.90.10 hello 2 dead 6 netmask 255.255.255.0, vrf default vrfid 0x60000000 ospf[1018]: Rcv hello from 181.0.0.9 area 0 from GigabitEthernet0/0/0/0.518 181.18.90.9 (nbr/if state 3/7) vrf default vrfid 0x60000000 ospf[1018]: Rcv hello pkt pri 5 options 0x12 DR 181.18.90.9 BDR 181.18.90.10 hello 2 dead 6 netmask 255.255.255.0 (nbr_dr 181.18.90.9 nbr_bdr 181.18.90.10) ospf[1018]: End of hello processing ospf[1018]: Rcv hello from 181.0.0.8 area 0 from GigabitEthernet0/0/0/0.518 181.18.90.8 (nbr/if state 3/7) vrf default vrfid 0x60000000 ospf[1018]: Rcv hello pkt pri 0 options 0x12 DR 181.18.90.9 BDR 181.18.90.10 hello 2 dead 6 netmask 255.255.255.0 (nbr_dr 181.18.90.9 nbr_bdr 181.18.90.10) ospf[1018]: End of hello processing ospf[1018]: Rcv hello from 181.0.0.10 area 0 from GigabitEthernet0/0/0/0.518 181.18.90.10 (nbr/if state 3/7) vrf default vrfid 0x60000000 ospf[1018]: Rcv hello pkt pri 3 options 0x12 DR 181.18.90.9 BDR 181.18.90.10 hello 2 dead 6 netmask 255.255.255.0 (nbr_dr 181.18.90.9 nbr_bdr 181.18.90.10) ospf[1018]: End of hello processing R8#debug OSPF-181 OSPF-181 OSPF-181 OSPF-181

ip ospf hello HELLO Gi2.518: HELLO Gi2.518: HELLO Gi2.518: HELLO Gi2.518:

Rcv hello from 181.18.90.11 area 0 181.18.90.11 Send hello to 224.0.0.5 area 0 from 181.18.90.8 Rcv hello from 181.0.0.10 area 0 181.18.90.10 Rcv hello from 181.0.0.9 area 0 181.18.90.9

We will configure CSR9 to have a different set of timers so that the neighborship fails. Both XE and XR show these failures in the debug messages. Both XE and XR output uses “R” and “C” for Remote and Configured, respectively. The remote value is what the peer is using, while configured is the local value. We can see that CSR9 is using a dead time of 13 and a hello time of 6, while CSR8 expects a dead time of 6 and a hello time of 2. The subnet mask also much match on broadcast links, and in this case, it does. ! CSR9 interface GigabitEthernet2.518 ip ospf dead-interval 6 ip ospf hello-interval 13 ! CSR8 OSPF-181 HELLO Gi2.518: Rcv hello from 181.0.0.9 area 0 181.18.90.9 OSPF-181 HELLO Gi2.518: Mismatched hello parameters from 181.18.90.9

2397 © 2016 Nicholas J. Russo

OSPF-181 HELLO Gi2.518: Dead R 13 C 6, Hello R 6 C 2 Mask R 255.255.255.0 C 255.255.255.0 ! XRv1 Rcv hello from 181.0.0.9 area 0 from GigabitEthernet0/0/0/0.518 181.18.90.9 (nbr/if state 3/7) vrf default vrfid 0x60000000 Mismatched hello parameters from 181.18.90.9 Dead R 13 C 6, Hello R 6 C 2 Mask R 255.255.255.0 C 255.255.255.0 hello from 181.0.0.9 area 0 failed validation

Before continuing, we will set CSR9 back to the proper values (not shown). We will quickly verify the timers on CSR9 as well as the neighbors. CSR9 is a DR/BDR candidate and should have full adjacencies will all nodes on the segment. R9#show ip ospf interface gig2.518 | include Timer Timer intervals configured, Hello 2, Dead 6, Wait 6, Retransmit 5 R9#show ip ospf neighbor gig2.518 Neighbor ID Pri State 181.0.0.8 0 FULL/DROTHER 181.0.0.10 3 FULL/BDR 181.18.90.11 0 FULL/DROTHER

Dead Time 00:00:05 00:00:05 00:00:04

Address 181.18.90.8 181.18.90.10 181.18.90.11

Interface Gig2.518 Gig2.518 Gig2.518

For the fastest possible OSPF detection, we can set the dead timer to “minimal”, which is 1 second. We then specify how many hellos packets should be sent within that 1 second interval. A smaller number means less frequent hello packets. Below, we set the dead interval to 1 second with the hello interval to 1/3 of a second, or 333ms. OSPF can send hellos as fast as 50ms apart using a hello-multiplier of 20 (1/20th of a second). To ensure I don’t violate the CSR1000v 100 kbps bandwidth constraint, I only configure this on the link between CSR5 and CSR10. We also verify the timers are correct and that a neighbor has formed. We can verify this on one side, since a timer mismatch implies no neighbor. Notice the very short dead timer. ! CSR5 and CSR10 interface GigabitEthernet2.550 ip ospf dead-interval minimal hello-multiplier 3 R5#show ip ospf interface gig2.550 | include Timer Timer intervals configured, Hello 333 msec, Dead 1, Wait 1, Retransmit 5 R5#show ip ospf neighbor gig2.550 Neighbor ID Pri State 181.0.0.10 0 FULL/ GigabitEthernet2.550

Dead Time 990 msec

Address 181.5.10.10

Interface

OSPFv3 is similar. We will use a hello timer of 2 and a dead timer of 8 for routers on the LAN. We quickly verify the timers as well. 2398 © 2016 Nicholas J. Russo

! CSR8, CSR9, and CSR10 interface GigabitEthernet2.518 ospfv3 181 hello-interval 2 ospfv3 181 dead-interval 8 ! XRv1 router ospfv3 181 interface GigabitEthernet0/0/0/0.518 dead-interval 8 hello-interval 2 R9#show ospfv3 interface gig2.518 | include Timer Timer intervals configured, Hello 2, Dead 8, Wait 8, Retransmit 5 RP/0/0/CPU0:XRv1#show ospfv3 interface gig0/0/0/0.518 | include Timer Timer intervals configured, Hello 2, Dead 8, Wait 8, Retransmit 5

Because OSPFv2 fast hellos existed before BFD (discussed later) and OSPFv3 was created later, there was no good reason to support OSPFv3 fast hellos. Thus, like EIGRP, the slowest hello-interval is 1 second. By only specifying this, the dead-timer is automatically 4 times this value, which is 4 seconds. ! CSR5 and CSR10 interface GigabitEthernet2.550 ospfv3 181 hello-interval 1 R10#show ospfv3 interface gig2.550 | include Timer Timer intervals configured, Hello 1, Dead 4, Wait 4, Retransmit 5

IS-IS has a slightly different behavior with respect to matching timers. Timers do not need to match on both sides, and the IS-IS hellos carry a “hold-time” value. This value tells a peer “if you don’t receive a hello from me within this time period, consider me dead”. In this way, a different set of timers can be established per node. By default, the hold-time is 3 times the hello time, which is considered the hellomultiplier. Like many IS-IS commands, we can optionally specify the level to which these hellos apply. Since XRv1 is the DIS, the logic changes to act like the “minimal” configuration. That is to say, the hello interval is divided by the multiplier, but only on the DIS. If we want a 2 second hello interval from the DIS, we can use 6/3. Below, we configure different hello/hold timers on all 4 routers. ! XRv1 router isis 181 interface GigabitEthernet0/0/0/0.513 hello-interval 6 hello-multiplier 3 ! XRv3 router isis 181

2399 © 2016 Nicholas J. Russo

interface GigabitEthernet0/0/0/0.513 hello-interval 2 level 2 ! CSR4 interface GigabitEthernet2.513 isis hello-interval minimal isis hello-multiplier 3 ! CSR10 interface GigabitEthernet2.513 isis hello-interval 6 level-2

Based on this configuration, XRv1 uses 2/6, XRv3 uses 2/3 (3 is the default hello-multiplier), CSR4 uses 333ms/1 (similar to OSPF fast hellos), and CSR10 uses 6/18. The verification for this is very loose; we cannot clearly see the timer configuration for either XE or XR. We can only look at the interface details to see when the “next” hello packet will be sent. R4#show clns interface gig2.513 | include Hello Next IS-IS LAN Level-2 Hello in 196 milliseconds R10#show clns interface gig2.513 | include Hello Next IS-IS LAN Level-2 Hello in 3 seconds RP/0/0/CPU0:XRv3#show isis interface gig0/0/0/0.513 | include IIH Next LAN IIH in: 1 s RP/0/0/CPU0:XRv1#show isis interface gig0/0/0/0.513 | include IIH Next LAN IIH in: 461 ms

Likewise, when looking at a neighbor, we can see the hold-time as it counts down, but not the neighbors configured value. From the DIS’ perspective, CSR4’s hold-time is sub-second so the value shows as 0. CSR10 has a hold time of 18 (6 * 3) so a number like 16 makes sense. XRv3’s hold time is 6 seconds (2 * 3) so a number like 5 makes sense. RP/0/0/CPU0:XRv1#show isis neighbors IS-IS 181 neighbors: System Id Interface SNPA R4 Gi0/0/0/0.513 0050.56a9.2c57 R10 Gi0/0/0/0.513 0050.56a9.f961 XRv3 Gi0/0/0/0.513 0050.56a9.ea54

State Up Up Up

Holdtime 0 16 5

Type L2 L2 L2

IETF-NSF Capable Capable Capable

To prove the awkwardness of the DIS timers, we can enable IS-IS hello debugging on XRv1 and CSR10. XRv1 will receive many hellos from CSR4 because it has a sub second hello interval (omitted). We can see the hello packets being originated from XRv1 using the DIS’ hostname-based NET (XRv1.01), and being sent every ~2 seconds despite configuring a 6 second hello interval (6 / 3).

2400 © 2016 Nicholas J. Russo

RP/0/0/CPU0:XRv1#debug isis adjacencies 13:58:13.380 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.ea54: System ID XRv3, Holdtime 6, LAN ID XRv1.01, length 90 13:58:13.490 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.2c57: System ID R4, Holdtime 1, LAN ID XRv1.01, length 90 [snip, R4 hellos] 13:58:13.960 : isis[1010]: SEND L2 LAN IIH on GigabitEthernet0/0/0/0.513: isis_osi_send_lan_hello 1491 LAN ID XRv1.01, 3 neighbors, Holdtime 6s, Length 90 13:58:14.150 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.2c57: System ID R4, Holdtime 1, LAN ID XRv1.01, length 90 [snip, R4 hellos] 13:58:15.000 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.ea54: System ID XRv3, Holdtime 6, LAN ID XRv1.01, length 90 [snip, R4 hellos] 13:58:15.550 : isis[1010]: SEND L2 LAN IIH on GigabitEthernet0/0/0/0.513: isis_osi_send_lan_hello 1491 LAN ID XRv1.01, 3 neighbors, Holdtime 6s, Length 90 13:58:15.660 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.2c57: System ID R4, Holdtime 1, LAN ID XRv1.01, length 90 [snip, R4 hellos] 13:58:16.890 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.f961: System ID R10, Holdtime 18, LAN ID XRv1.01, length 90 13:58:17.010 : isis[1010]: RECV L2 LAN IIH from GigabitEthernet0/0/0/0.513 SNPA 0050.56a9.ea54: System ID XRv3, Holdtime 6, LAN ID XRv1.01, length 90 [snip, R4 hellos] 13:58:17.540 : isis[1010]: SEND L2 LAN IIH on GigabitEthernet0/0/0/0.513: isis_osi_send_lan_hello 1491 LAN ID XRv1.01, 3 neighbors, Holdtime 6s, Length 90

CSR10 receives hellos from all other neighbors (shown by MAC address) which all show the DIS’ NET minus the area (0000.0000.0011.01). Hellos can be jittered up to 25% so the interval is not exactly 2 seconds. Also, we see that CSR10 sends a single hello for every ~3 received from the DIS as its interval is 3 times slower. R10#show ip arp 181.13.40.11 Protocol Address Age (min) Internet 181.13.40.11 34

Hardware Addr 0050.56a9.9c60

Type ARPA

Interface Gig2.513

R10#debug isis adj-packets 14:01:47.932: ISIS-Adj: Sending L2 LAN IIH on GigabitEthernet2.513, length 90 14:01:48.074: ISIS-Adj: Rec L2 IIH from 0050.56a9.9c60 (GigabitEthernet2.513), cir type L2, cir id 0000.0000.0011.01, length 90, ht(6) 14:01:48.074: ISIS-Adj: tid 0 14:01:48.074: ISIS-Adj: ,2 14:01:48.074: ISIS-Adj: he_knows_us 1, old state 0, new state 0, level 2 [snip, other hellos]

2401 © 2016 Nicholas J. Russo

14:01:49.694: ISIS-Adj: (GigabitEthernet2.513), ht(6) 14:01:49.694: ISIS-Adj: 14:01:49.694: ISIS-Adj: 14:01:49.694: ISIS-Adj: 14:01:51.554: ISIS-Adj: (GigabitEthernet2.513), ht(6) 14:01:51.554: ISIS-Adj: 14:01:51.554: ISIS-Adj: 14:01:51.554: ISIS-Adj:

Rec L2 IIH from 0050.56a9.9c60 cir type L2, cir id 0000.0000.0011.01, length 90, tid 0 ,2 he_knows_us 1, old state 0, new state 0, level 2 Rec L2 IIH from 0050.56a9.9c60 cir type L2, cir id 0000.0000.0011.01, length 90, tid 0 ,2 he_knows_us 1, old state 0, new state 0, level 2

EIGRP behaves similarly to IS-IS with respect to its per-neighbor treatment of hello/hold timers. The difference is that there isn’t a pseudonode concept since EIGRP is not link-state, so the additional complexity observed in IS-IS doesn’t apply. EIGRP also does not have a sub-second hello option, so the fastest possible failure detection is 1 second; hello timer is 1 second with a 1 second hold-time. This is probably highly unstable and a hold-time of at least 2 seconds would be preferred in this case. There is no difference between IPv4 and IPv6, so only IPv4 is examined here. XRv3 and CSR2 have timer adjustments shown below, while CSR4 uses the default timers of 5/15 (no configuration changes). ! CSR2 router eigrp L3 address-family ipv4 unicast autonomous-system 181 af-interface default hello-interval 4 hold-time 16 ! XRv3 router eigrp L3 vrf L3 address-family ipv4 interface GigabitEthernet0/0/0/0.523 hello-interval 3 hold-time 9

We can quickly verify the configuration was successful by checking the interface details. Notice that all of the timers are different. R4#show eigrp address-family ipv4 vrf L3 interfaces detail | include HelloHello-interval is 5, Hold-time is 15 R2#show eigrp address-family ipv4 interfaces detail | include HelloHello-interval is 4, Hold-time is 16 RP/0/0/CPU0:XRv3#show eigrp vrf L3 ipv4 interfaces detail | include Hello Hello interval is 3 sec, hold time is 9 sec

2402 © 2016 Nicholas J. Russo

Checking the EIGRP neighbors on CSR2, we can see the per-neighbor hold times, just like IS-IS. XRv3 has a hold time of 9 seconds, so a value of 8 makes sense. CSR4 has a hold time of 16 seconds, so a value of 12 makes sense. Similarly for XRv3, CSR4 and CSR2 both have reasonable hold times (that is, values less than or equal to their configured hold times), which shows a correct configuration. R2#show eigrp address-family ipv4 neighbors EIGRP-IPv4 VR(L3) Address-Family Neighbors for AS(181) H Address Interface Hold Uptime SRTT (sec) (ms) 1 10.23.4.13 Gi2.523 8 01:18:02 51 0 10.23.4.4 Gi2.523 12 1d22h 27

RP/0/0/CPU0:XRv3#show eigrp vrf L3 ipv4 neighbors IPv4-EIGRP VR(L3) Neighbors for AS(181) VRF L3 H Address Interface Hold Uptime SRTT (sec) (ms) 1 10.23.4.4 Gi0/0/0/0.523 10 02:43:02 505 0 10.23.4.2 Gi0/0/0/0.523 13 02:43:02 1

RTO

Q Cnt 306 0 162 0

RTO

Q Cnt 3030 0 200 0

Seq Num 162 110

Seq Num 118 79

EIGRP hello debugging on CSR2 also shows packets being sent and received. Packets from CSR4 are 5 seconds apart (yellow), while packets received from XRv13 are 3 seconds apart (green). Packets sent from CSR2 are transmitted every 4 seconds, but only one is shown here (cyan). ! CSR2 20:52:09.736: EIGRP: received packet with MD5 authentication, key id = 1 20:52:09.736: EIGRP: Received HELLO on Gi2.523 - paklen 60 nbr 10.23.4.4 20:52:09.736: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 iidbQ un/rely 0/0 peerQ un/rely 0/0 20:52:10.298: EIGRP: received packet with MD5 authentication, key id = 1 20:52:10.298: EIGRP: Received HELLO on Gi2.523 - paklen 60 nbr FE80::13 20:52:10.298: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 all iidbQ un/rely 0/0 peerQ un/rely 0/0 20:52:10.597: EIGRP: received packet with MD5 authentication, key id = 1 20:52:10.597: EIGRP: Received HELLO on Gi2.523 - paklen 60 nbr 10.23.4.13 20:52:10.597: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 iidbQ un/rely 0/0 peerQ un/rely 0/0 20:52:11.160: EIGRP: received packet with MD5 authentication, key id = 1 20:52:11.160: EIGRP: Received HELLO on Gi2.523 - paklen 60 nbr FE80::4 20:52:11.160: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 iidbQ un/rely 0/0 peerQ un/rely 0/0 20:52:13.456: EIGRP: received packet with MD5 authentication, key id = 1 20:52:13.456: EIGRP: Received HELLO on Gi2.523 - paklen 60 nbr 10.23.4.13 20:52:13.456: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 iidbQ un/rely 0/0 peerQ un/rely 0/0 20:52:13.548: EIGRP: Sending HELLO on Gi2.523 - paklen 66 20:52:13.548: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 iidbQ un/rely 0/0

2403 © 2016 Nicholas J. Russo

20:52:14.681: EIGRP: received packet with MD5 authentication, key id = 1 20:52:14.681: EIGRP: Received HELLO on Gi2.523 - paklen 60 nbr 10.23.4.4 20:52:14.681: AS 181, Flags 0x0:(NULL), Seq 0/0 interfaceQ 0/0 iidbQ un/rely 0/0 peerQ un/rely 0/0

Output from XR is verbose, especially with authentication enabled. Sent packets are shown in cyan, with received packets shows in yellow and green from CSR4 and CSR2, respectively. ! XRv3 20:58:11.402 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, keychain KC_GEN_AUTH, Found live authentication key(id = 1, alg = 3) 20:58:11.402 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, packet being sent with key(id = 1) 20:58:11.402 : eigrp[1002]: IPv4-EIGRP(L3-181): Sending HELLO on Gi0/0/0/0.523 20:58:11.402 : eigrp[1002]: IPv4-EIGRP(L3-181): AS 181, Flags 0x00000000, Seq 0/0 iidbQ un/rely 0/0 ddb Gr/Nsf/NiP 1/1/0 20:58:13.201 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, received authenticated packet, key id = 1 20:58:13.201 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, pkt authentication, local key(id = 1) 20:58:13.201 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, packet authentication key id = 1, authentication data is valid 20:58:13.201 : eigrp[1002]: IPv4-EIGRP(L3-181): Received HELLO on Gi0/0/0/0.523 nbr 10.23.4.4 20:58:13.201 : eigrp[1002]: IPv4-EIGRP(L3-181): AS 181, Flags 0x00000000, Seq 0/0 iidbQ un/rely 0/0 ddb Gr/Nsf/NiP 1/1/0 peer Cr/0/NIn/0/NIa/0/NiP/0/Eot/0/uNS/1/IGr/0/LAck/118 peerQ un/rely 0/0 20:58:13.321 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, received authenticated packet, key id = 1 20:58:13.321 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, pkt authentication, local key(id = 1) 20:58:13.321 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, packet authentication key id = 1, authentication data is valid 20:58:13.321 : eigrp[1002]: IPv4-EIGRP(L3-181): Received HELLO on Gi0/0/0/0.523 nbr 10.23.4.2 20:58:13.321 : eigrp[1002]: IPv4-EIGRP(L3-181): AS 181, Flags 0x00000000, Seq 0/0 iidbQ un/rely 0/0 ddb Gr/Nsf/NiP 1/1/0 peer Cr/0/NIn/0/NIa/0/NiP/0/Eot/0/uNS/1/IGr/0/LAck/79 peerQ un/rely 0/0 20:58:13.981 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, keychain 'KC_GEN_AUTH' Valid authentication keys found (id = 1)

2404 © 2016 Nicholas J. Russo

20:58:13.981 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, keychain KC_GEN_AUTH, Found live authentication key(id = 1, alg = 3) 20:58:13.981 : eigrp[1002]: IPv4-EIGRP(L3-181): interface GigabitEthernet0_0_0_0.523, packet being sent with key(id = 1) 20:58:13.981 : eigrp[1002]: IPv4-EIGRP(L3-181): Sending HELLO on Gi0/0/0/0.523 20:58:13.981 : eigrp[1002]: IPv4-EIGRP(L3-181): AS 181, Flags 0x00000000, Seq 0/0 iidbQ un/rely 0/0 ddb Gr/Nsf/NiP 1/1/0

RIP timers aren’t really used for failure detection since RIP packets actually carry routes; it doesn’t have a dedicated neighbor discovery/maintenance mechanism. We can change the frequency at which the routing advertisements are sent, invalidated, and flushed. We use RIP and RIPng as the PE-CE routing protocol between CSR5 and CSR6 for a quick demonstration. The only caveat is that the remote hello timer must be less than the local invalid timer. If it isn’t, then the routing advertisements will come in too slowly, and the receiving router will have periods of downtime since the routers are flushed too quickly. Below is an example of this phenomenon. CSR5 uses the default timers (30/180/180/240) while CSR6 uses new, aggressive timers. CSR6 sends updates every 5 seconds and flushes received updates after 15 seconds. CSR5 is sending updates every 30 seconds yet CSR6 can only retain the for 15 seconds. For the remaining ~15 seconds, CSR6 will have no routes from CSR5. CSR6’s advertisements are shown in yellow, and the received routes and subsequent downtime is shown in green. ! CSR6 router rip timers basic 5 15 15 20 R6#debug ip rip R6#debug ip routing 22:39:53.022: RIP: received v2 update from 10.5.6.5 on GigabitEthernet2.556 22:39:53.022: 10.7.7.7/32 via 0.0.0.0 in 8 hops 22:39:53.022: RT: updating rip 10.7.7.7/32 (0x0) : 10.5.6.5 Gi2.556 0 1048578 22:39:53.022: RT: add 10.7.7.7/32 via 10.5.6.5, rip metric [120/8] 22:39:55.023: RIP: sending v2 flash update to 224.0.0.9 via GigabitEthernet2.556 (10.5.6.6) 22:39:55.023: RIP: build flash update entries - suppressing null update 22:39:56.369: RIP: sending v2 update to 224.0.0.9 via GigabitEthernet2.556 (10.5.6.6) 22:39:56.369: RIP: build update entries 22:39:56.369: 10.6.6.6/32 via 0.0.0.0, metric 1, tag 0 22:39:56.369: 10.66.66.66/32 via 0.0.0.0, metric 1, tag 0 22:40:01.324: RIP: sending v2 update to 224.0.0.9 via GigabitEthernet2.556 (10.5.6.6) 22:40:01.325: RIP: build update entries 22:40:01.325: 10.6.6.6/32 via 0.0.0.0, metric 1, tag 0 22:40:01.325: 10.66.66.66/32 via 0.0.0.0, metric 1, tag 0

2405 © 2016 Nicholas J. Russo

22:40:05.711: (10.5.6.6) 22:40:05.711: 22:40:05.711: 22:40:05.711: 22:40:10.062: (10.5.6.6) 22:40:10.062: 22:40:10.062: 22:40:10.062: 22:40:14.570: (10.5.6.6) 22:40:14.570: 22:40:14.570: 22:40:14.570: 22:40:17.555: 22:40:17.555:

RIP: sending v2 update to 224.0.0.9 via GigabitEthernet2.556 RIP: build update entries 10.6.6.6/32 via 0.0.0.0, metric 1, tag 0 10.66.66.66/32 via 0.0.0.0, metric 1, tag 0 RIP: sending v2 update to 224.0.0.9 via GigabitEthernet2.556 RIP: build update entries 10.6.6.6/32 via 0.0.0.0, metric 1, tag 0 10.66.66.66/32 via 0.0.0.0, metric 1, tag 0 RIP: sending v2 update to 224.0.0.9 via GigabitEthernet2.556 RIP: build update entries 10.6.6.6/32 via 0.0.0.0, metric 1, tag 0 10.66.66.66/32 via 0.0.0.0, metric 1, tag 0 RT: del 10.7.7.7 via 10.5.6.5, rip metric [120/8] RT: delete subnet route to 10.7.7.7/32

To fix it, we mirror the timers on CSR5 under the VRF. Additionally, we configure “fast” timers on both nodes for RIPng as well. Also note that RIPng is not VRF away by default and requires a global command to enable that feature set (this would have been done during the initial L3VPN setup, but I draw attention to it here). ! CSR5 ipv6 rip vrf-mode enable router rip address-family ipv4 vrf L3 timers basic 5 15 15 20 ipv6 router rip L3 address-family ipv6 vrf L3 timers 5 15 15 20 ! CSR6 ipv6 router rip CUST timers 5 15 15 20

We can verify updates are being passed back and forth successfully. We can see packets being sent and received 5 seconds apart, which is correct. Updates in green are being sent from CSR6 and updates in yellow are received by CSR6 from CSR5. R6#debug ipv6 rip 22:48:44.530: RIPng [Gi2.556, default VRF]: process "CUST" is sending a multicast update. 22:48:44.530: src=FE80::6 22:48:44.530: dst=FF02::9 (GigabitEthernet2.556) 22:48:44.530: sport=521, dport=521, length=72

2406 © 2016 Nicholas J. Russo

22:48:44.530: command=2, version=1, mbz=0, #rte=3 22:48:44.530: tag=0, metric=1, prefix=::10:6:6:6/128 22:48:44.530: tag=0, metric=1, prefix=FD00:10:5:6::/64 22:48:44.530: tag=0, metric=1, prefix=::10:66:66:66/128 22:48:44.530: RIPng [default VRF]: a message has been received. 22:48:44.893: RIPng [default VRF]: a message has been received. 22:48:44.893: RIPng [Gi2.556, default VRF]: response received from FE80::5 for process "CUST". 22:48:44.893: src=FE80::5 (GigabitEthernet2.556) 22:48:44.893: dst=FF02::9 22:48:44.893: sport=521, dport=521, length=32 22:48:44.893: command=2, version=1, mbz=0, #rte=2 22:48:44.893: tag=0, metric=1, prefix=FD00:10:5:6::/64 22:48:44.893: tag=0, metric=8, prefix=::10:7:7:7/128 22:48:49.530: RIPng [Gi2.556, default VRF]: process "CUST" is sending a multicast update. 22:48:49.530: src=FE80::6 22:48:49.530: dst=FF02::9 (GigabitEthernet2.556) 22:48:49.530: sport=521, dport=521, length=72 22:48:49.530: command=2, version=1, mbz=0, #rte=3 22:48:49.530: tag=0, metric=1, prefix=::10:6:6:6/128 22:48:49.530: tag=0, metric=1, prefix=FD00:10:5:6::/64 22:48:49.530: tag=0, metric=1, prefix=::10:66:66:66/128 22:48:49.530: RIPng [default VRF]: a message has been received. 22:48:49.893: RIPng [default VRF]: next RIB walk in 10000. 22:48:49.893: RIPng [default VRF]: a message has been received. 22:48:49.893: RIPng [Gi2.556, default VRF]: response received from FE80::5 for process "CUST". 22:48:49.893: src=FE80::5 (GigabitEthernet2.556) 22:48:49.893: dst=FF02::9 22:48:49.893: sport=521, dport=521, length=32 22:48:49.893: command=2, version=1, mbz=0, #rte=2 22:48:49.893: tag=0, metric=1, prefix=FD00:10:5:6::/64 22:48:44.893: tag=0, metric=8, prefix=::10:7:7:7/128

HSRP timers must match for the process to converge, but they can be configured differently. Whichever device is the active router for a given group will propagate its timers to the standby nodes. In this way, the configuration need not match, since the HSRP process ensures the timers are carried in the hello messages. This is similar to STP where the timers of the root bridge are carried in BPDUs for all non-root switches to honor, but individually switches can have entirely different timer settings. The default HSRP timers are 3/10 for hello/hold, so we will modify it to 4/12 on XRv3, while leaving the CSR4 timers unchanged. ! XRv3 router hsrp

2407 © 2016 Nicholas J. Russo

interface GigabitEthernet0/0/0/0.523 address-family ipv4 hsrp 234 version 2 timers 4 12

We can clearly see that XRv3 is the active router based on its higher priority (configuration not shown). CSR4 is the standby router but has inherited the new timers. XRv3 also shows us the configured timers as well, so comparing the configured and real-time timers is easier. RP/0/0/CPU0:XRv3#show hsrp gig0/0/0/0.523 234 IPv4 Groups: P indicates configured to preempt. | Interface Grp Pri P State Active addr Standby addr Gi0/0/0/0.523 234 105 P Active local 10.23.4.4

Group addr 10.23.4.254

R4#show standby gig2.524 234 | include State|Hello State is Standby Hello time 4 sec, hold time 12 sec RP/0/0/CPU0:XRv3#show hsrp gig0/0/0/0.523 234 detail | include [Hh]ello Hellotime 4000 msec holdtime 12000 msec Configured hellotime 4000 msec holdtime 12000 msec

Since CSR4 is the active router for IPv6 hosts, we will change its timers using the smallest possible values, which is 15 ms and 50 ms. XRv3 accepts this value as it is learned via HSRP exchanges via the active router. Since HSRP runs different instances for IPv4 and IPv6, the two topologies can achieve rudimentary load-sharing (and entirely different configurations) in this way. ! CSR4 interface GigabitEthernet2.524 standby 236 timers msec 15 msec 50 R4#show standby gig2.524 236 | include State|Hello State is Active Hello time 15 msec, hold time 50 msec RP/0/0/CPU0:XRv3#show hsrp gig0/0/0/0.523 236 detail | include Hello Hellotime 15 msec holdtime 50 msec

For debugging purposes, HSRP 236 is temporarily disabled to avoid flooding the log buffer/console. We will look at the IPv4 exchanges. Even with detailed debugging, there isn’t much to see. The packets contain their timers, priorities, interface MAC addresses, and virtual IP addresses. The in/out hello packets show the source IP of the packet received as the destination is multicast (224.0.0.102 for HSRPv2). We can see that packets are sent ~4 seconds apart in either direction.

2408 © 2016 Nicholas J. Russo

R4#debug standby packets hello detail 23:05:30.380: HSRP: Gi2.524 Grp 234 Hello 10.23.4.254 23:05:30.380: hel 4000 hol 12000 id 23:05:31.951: HSRP: Gi2.524 Grp 234 Hello 10.23.4.254 23:05:31.951: hel 4000 hol 12000 id 23:05:33.819: HSRP: Gi2.524 Grp 234 Hello 10.23.4.254 23:05:33.819: hel 4000 hol 12000 id 23:05:35.970: HSRP: Gi2.524 Grp 234 Hello 10.23.4.254 23:05:35.970: hel 4000 hol 12000 id

out 10.23.4.4 Standby pri 100 vIP 0050.56a9.2c57 in 10.23.4.13 Active

pri 105 vIP

0050.56a9.ea54 out 10.23.4.4 Standby pri 100 vIP 0050.56a9.2c57 in 10.23.4.13 Active

pri 105 vIP

0050.56a9.ea54

XRv3’s debugging is very similar to CSR4’s, showing the hello packets and their contents. One interesting difference is that XR doesn’t appear to show outgoing HSRP packets, only incoming ones. RP/0/0/CPU0:XRv3#debug hsrp packets hello detail 23:10:06.699 : hsrp[1095]: SB234: Gi0/0/0/0.523 Hello in Standby pri 100 hel 4000 hol 12000 ip 10.23.4.254 ver 2 23:10:10.239 : hsrp[1095]: SB234: Gi0/0/0/0.523 Hello in Standby pri 100 hel 4000 hol 12000 ip 10.23.4.254 ver 2 23:10:13.539 : hsrp[1095]: SB234: Gi0/0/0/0.523 Hello in Standby pri 100 hel 4000 hol 12000 ip 10.23.4.254 ver 2

10.23.4.4 10.23.4.4 10.23.4.4

BGP timers can be configured differently on both sides, but the hold-time must match (after negotiation) for the session for form. BGP sessions use hello/hold timers of 60/180 by default on Cisco routers. Before doing anything, we will verify the default timers on CSR8. I show internal and external VPNv4 sessions and a single external VPNv6 session to prove it. R8#show bgp vpnv4 unicast all neighbors | include ^BGP|hold_time BGP neighbor is 7.7.7.7, vrf L3, remote AS 7, external link Last read 00:00:17, last write 00:00:32, hold time is 180, keepalive interval is 60 seconds BGP neighbor is 181.0.0.4, remote AS 181, internal link Last read 00:00:28, last write 00:00:37, hold time is 180, keepalive interval is 60 seconds BGP neighbor is 181.0.0.5, remote AS 181, internal link Last read 00:00:16, last write 00:00:43, hold time is 180, keepalive interval is 60 seconds BGP neighbor is 181.0.0.13, remote AS 181, internal link Last read 00:00:10, last write 00:00:46, hold time is 180, keepalive interval is 60 seconds R8#show bgp vpnv6 unicast vrf L3 neighbors fd00:10:7:8::7 | include hold_time Last read 00:00:00, last write 00:00:22, hold time is 180, keepalive interval is 60 seconds

2409 © 2016 Nicholas J. Russo

The hold-time is negotiated as part of the BGP session establishment where the smaller (faster) values are preferred. We can set a lower interval on XRv3 to prove this. A side effect of this is that the router who acknowledged the adjustment (CSR8, in this case) adjusts its keepalive interval to 1/3 of the new, smaller hold-time. This means the hello-interval decreases proportionally to the hold-time. In this case, XR follows suit and also uses 58 seconds despite being configured for 60. ! XRv3 router bgp 181 neighbor 181.0.0.8 timers 60 175 R8#show bgp vpnv4 unicast all neighbors 181.0.0.13 | include hold_time Last read 00:00:37, last write 00:00:37, hold time is 175, keepalive interval is 58 seconds RP/0/0/CPU0:XRv3#show bgp vpnv4 unicast neighbors 181.0.0.8 | include hold time Hold time is 175, keepalive interval is 58 seconds Configured hold time: 175, keepalive: 60, min acceptable hold time: 3

A more extreme example would be setting much lower timers, such as 10/40, and observing the remote side. We configure 10/40 timers on CSR8 towards CSR5, who then calculates its new keepalive timer as 13 seconds. Like in XR, there is a difference between configured and actual hold times. When a local node makes explicit changes to a timer, the two sets of timers are shown together. Unlike XR, it seems that XE is less generous; it still uses its configured keepalive timer despite CSR5 adjusting its timer proportionally. Asymmetric timers are typically undesirable and add complexity while yielding little gain. ! CSR8 router bgp 181 neighbor 181.0.0.5 timers 10 40 R5#show bgp vpnv4 unicast all neighbors 181.0.0.8 | include hold_time Last read 00:00:00, last write 00:00:03, hold time is 40, keepalive interval is 13 seconds R8#show bgp vpnv4 unicast all neighbors 181.0.0.5 | include hold_time Last read 00:00:06, last write 00:00:08, hold time is 40, keepalive interval is 10 seconds Configured hold time is 40, keepalive interval is 10 seconds

As a security mechanism to protect the router’s control-plane from too many BGP keepalives, we can specify a minimum neighbor hold-time. This allows a BGP speaker to reject sessions that have overly aggressive hold-timers; this is ideal for PE routers. By default, this is set to 0, allowing any hold-down timers. We can verify this by checking CSR5, as an example. R8#show bgp vpnv4 unicast all neighbors 181.0.0.5 | include Minimum

2410 © 2016 Nicholas J. Russo

Minimum holdtime from neighbor is 0 seconds

We will configure CSR8, as a PE, to ensure CSR7 is using at least a 60 second hold-down timer. Because we have not changed CSR7 yet, and we know the default was 180, we can assume CSR7 will negotiate a hold-time of 60 when peering with CSR8. This is an acceptable hold-time. We verify these timers on CSR7 and CSR8; again we see that XE is stubborn and uses its locally configured values, despite the remote peer calculating its keepalive to be 1/3 of the negotiated hold-down timers. ! CSR8 router bgp 181 address-family ipv6 vrf L3 neighbor FD00:10:7:8::7 timers 15 60 60 R8#show bgp vpnv6 unicast vrf L3 neighbors FD00:10:7:8::7 | include hold_?time Last read 00:00:15, last write 00:00:06, hold time is 60, keepalive interval is 15 seconds Configured hold time is 60, keepalive interval is 15 seconds Minimum holdtime from neighbor is 60 seconds R7#show bgp ipv6 unicast neighbors FD00:10:7:8::8 | include hold_?time Last read 00:00:00, last write 00:00:09, hold time is 60, keepalive interval is 20 seconds

Assuming CSR7 wanted to use an aggressive hold-time, the PE would not allow it, and the session does not form. Both CSR7 and CSR8 generate BGP log messages to detail this error. CSR7 tries to use a holdtime of 59 seconds which is “too fast” for CSR8. The syslog messages make this error very apparent. ! CSR7 router bgp 181 neighbor FD00:10:7:8::8 timers 15 59 ! CSR8 %BGP-3-NOTIFICATION: sent to neighbor FD00:10:7:8::7 active 2/6 (unacceptable hold time) 0 bytes ! CSR7 %BGP-3-NOTIFICATION: received from neighbor FD00:10:7:8::8 active 2/6 (unacceptable hold time) 0 bytes

This isn’t very useful for the customer, since they legitimately may not know what the hold-time is supposed to be. The BGP debugs don’t help, either, so the customer would need to contact the provider. Assume that the issue was fixed and that CSR7 is now using a hold-time of 60 seconds again. Debugging on CSR7 towards CSR8, we can see CSR8 sending keepalives to CSR7 every ~15 seconds (yellow). CSR7 is sending keepalives to CSR8 every ~20 seconds as well (green), and these values don’t need to match. 2411 © 2016 Nicholas J. Russo

R7#debug bgp ipv6 unicast keepalives 00:28:56.997: BGP: FD00:10:7:8::8 received KEEPALIVE, length (excl. header) 0 00:29:02.756: BGP: ses global FD00:10:7:8::8 (0x7F401571BBC8:1) Keep alive timer fired. 00:29:02.756: BGP: FD00:10:7:8::8 KEEPALIVE requested (bgp_keepalive_timer_expired) 00:29:02.756: BGP: ses global FD00:10:7:8::8 (0x7F401571BBC8:1) service keepalive IO request. 00:29:02.756: BGP: FD00:10:7:8::8 KEEPALIVE write request serviced in BGP_IO 00:29:10.310: BGP: FD00:10:7:8::8 received KEEPALIVE, length (excl. header) 0 00:29:20.164: BGP: ses global FD00:10:7:8::8 (0x7F401571BBC8:1) Keep alive timer fired. 00:29:20.164: BGP: FD00:10:7:8::8 KEEPALIVE requested (bgp_keepalive_timer_expired) 00:29:20.165: BGP: ses global FD00:10:7:8::8 (0x7F401571BBC8:1) service keepalive IO request. 00:29:20.165: BGP: FD00:10:7:8::8 KEEPALIVE write request serviced in BGP_IO 00:29:24.648: BGP: FD00:10:7:8::8 received KEEPALIVE, length (excl. header) 0

For completeness, we will do the same debugging on XRv3. Since XR is more generous than XE, it has reduced its keepalive timer to 58 seconds, along with CSR8, so we should be keepalives at approximately the same interval. Keepalives sent to CSR8 occur every ~58 seconds (yellow), and keepalives received from CSR8 also occur every ~58 seconds as expected (green). This simplifies debugging by ensuring that both the hello and hold timers are synchronized. RP/0/0/CPU0:XRv3#debug bgp keepalive 181.0.0.8 00:36:25.824 : bgp[1052]: [default-ka]: Keepalive timer expired for 181.0.0.8 00:36:25.824 : bgp[1052]: [default-ka]: Keepalive timer started for 181.0.0.8(loc 5) 00:36:25.824 : bgp[1052]: [default-ka]: KEEPALIVE sent to 181.0.0.8 00:36:35.864 : bgp[1052]: [default-iord]: KEEPALIVE received from 181.0.0.8 00:37:23.841 : bgp[1052]: [default-ka]: Keepalive timer expired for 181.0.0.8 00:37:23.841 : bgp[1052]: [default-ka]: Keepalive timer started for 181.0.0.8(loc 5) 00:37:23.841 : bgp[1052]: [default-ka]: KEEPALIVE sent to 181.0.0.8 00:37:31.160 : bgp[1052]: [default-iord]: KEEPALIVE received from 181.0.0.8

PIM hellos are also very straightforward and the hold-time is non-configurable. It is automatically 3.5 times the hello timer received from each neighbor. As such, the timers can be completely different like EIGRP and IS-IS. The default hello timer is 30; we reduce it to 15 on CSR4 and 3 on XRv3 for IPv4/v6. ! CSR4

2412 © 2016 Nicholas J. Russo

interface GigabitEthernet2.524 ip pim query-interval 15 ipv6 pim hello-interval 15 ! XRv3 router pim vrf L3 address-family ipv4 hello-interval 3 address-family ipv6 hello-interval 3

We can quickly confirm the configuration was correct by checking the interface details on both routers for both address-families. R4#show ip pim vrf L3 interface detail | include Hello PIM Hello/Query interval: 15 seconds PIM Hello packets in/out: 34334/36527 R4#show ipv6 pim vrf L3 interface gig2.524 Interface PIM Nbr Hello DR Count Intvl Prior Gi2.524 on 1 15 1 Address: FE80::4 DR : FE80::13 RP/0/0/CPU0:XRv3#show pim vrf L3 ipv4 interface gig0/0/0/0.523 PIM interfaces in VRF L3 Address Interface PIM Nbr Hello DR DR Count Intvl Prior 10.23.4.13 GigabitEthernet0/0/0/0.523 on 2 3 1 this system RP/0/0/CPU0:XRv3#show pim vrf L3 ipv6 interface gig0/0/0/0.523 PIM interfaces in VRF L3 Interface PIM Nbr Hello DR Count Intvl Prior GigabitEthernet0/0/0/0.523 on 2 3 1 Primary Address : fe80::13 DR : this system

We will debug the IPv4 PIM hellos to show the variance in timers. Yellow represents CSR4 sending PIM hellos every 15 seconds, while green represents receiving PIM hellos from XRv3 every 3 seconds. Of note, IPv6 PIM requires non-LL addressing, unlike many other IPv6 protocols, and the PIMv6 hello debug is very verbose. R4#debug ip pim vrf L3 hello 23:31:48.447: PIM(1): Send periodic v2 Hello on GigabitEthernet2.524 with GenID = 2707292449

2413 © 2016 Nicholas J. Russo

23:31:48.536: PIM(1): Received v2 hello on GigabitEthernet2.524 from 10.23.4.13 23:31:48.536: PIM(1): Ignored unknown option 2 in PIM Hello packet 23:31:48.536: PIM(1): Neighbor (10.23.4.13) Hello GENID = 21332 23:31:51.635: PIM(1): Received v2 hello on GigabitEthernet2.524 from 10.23.4.13 23:31:51.635: PIM(1): Ignored unknown option 2 in PIM Hello packet 23:31:51.635: PIM(1): Neighbor (10.23.4.13) Hello GENID = 21332 23:31:54.735: PIM(1): Received v2 hello on GigabitEthernet2.524 from 10.23.4.13 23:31:54.735: PIM(1): Ignored unknown option 2 in PIM Hello packet 23:31:54.735: PIM(1): Neighbor (10.23.4.13) Hello GENID = 21332 23:31:57.840: PIM(1): Received v2 hello on GigabitEthernet2.524 from 10.23.4.13 23:31:57.840: PIM(1): Ignored unknown option 2 in PIM Hello packet 23:31:57.840: PIM(1): Neighbor (10.23.4.13) Hello GENID = 21332 23:32:00.945: PIM(1): Received v2 hello on GigabitEthernet2.524 from 10.23.4.13 23:32:00.945: PIM(1): Ignored unknown option 2 in PIM Hello packet 23:32:00.945: PIM(1): Neighbor (10.23.4.13) Hello GENID = 21332 23:32:03.197: PIM(1): Send periodic v2 Hello on GigabitEthernet2.524 with GenID = 2707292449

The XR debugs are concise, so much so that receiving a PIM hello doesn’t even show the originator’s IP address. It only indicates that a packet has been “read”. Considering that this “read” event occurs ~15 seconds apart and there is only one other PIM router on the segment, we can assume this is from CSR4( yellow). The XRv3 originated packets are ~3 seconds apart, as expected (green). RP/0/0/CPU0:XRv3#debug pim vrf L3 ipv4 io 23:34:46.048 : pim[1160]: [16] Read packet paklen 58 hdrlen 20 23:34:46.048 : pim[1160]: [16] Read 58 bytes of data 23:34:46.498 : pim[1160]: [16] pim_net_send: app_offset 20, nw_offset 23:34:46.498 : pim[1160]: [16] Sending packet, src 10.23.4.13, dest 224.0.0.13, datasize 42

0

23:34:49.598 : pim[1160]: [16] pim_net_send: app_offset 20, nw_offset 23:34:49.598 : pim[1160]: [16] Sending packet, src 10.23.4.13, dest 224.0.0.13, datasize 42

0

23:34:52.697 : pim[1160]: [16] pim_net_send: app_offset 20, nw_offset

0

2414 © 2016 Nicholas J. Russo

23:34:52.697 : pim[1160]: [16] Sending packet, src 10.23.4.13, dest 224.0.0.13, datasize 42 23:34:55.797 : pim[1160]: [16] pim_net_send: app_offset 20, nw_offset 23:34:55.797 : pim[1160]: [16] Sending packet, src 10.23.4.13, dest 224.0.0.13, datasize 42

0

23:34:58.897 : pim[1160]: [16] pim_net_send: app_offset 20, nw_offset 23:34:58.897 : pim[1160]: [16] Sending packet, src 10.23.4.13, dest 224.0.0.13, datasize 42

0

23:35:00.607 : pim[1160]: [16] Read packet paklen 58 hdrlen 20 23:35:00.607 : pim[1160]: [16] Read 58 bytes of data

Additional Reading – Reference configurations "hello" 40.2 Bidirectional Forwarding Detection (BFD) Bidirectional Forwarding Detection (BFD) is a lightweight keepalive protocol design to reduce dead peer detection time across layer 2 networks. It is used primarily on networks that do not rely on line-protocol for interface status (like Ethernet). BFD uses a pair of UDP ports for transport and allows other protocols to "register" to it. BFD relies on a single session between its neighbors so that protocol-specific hello messages don't need to be aggressively tuned for fast fall-over. Individual protocols can be configured to use BFD; by default, none of them do. Some of these client protocols include EIGRP, OSPFv2/3, IS-IS, BGP, RSVP-TE, PIM, and xconnect. BFD uses echo messages to test reachability between neighbors and control messages for signaling. BFD does not perform neighbor detection; it relies on the registered protocol to do that, then runs probes based on those discovered IPv4/v6 neighbor addresses. The network diagram is shown below; large LANs are used for certain protocols rather than P2P links to demonstrate BFD behavior. It is similar to the previous section without XRv in the core as XRv does not support BFD (XRv1 is swapped with CSR1 and XRv3 is replaced by CSR3).

2415 © 2016 Nicholas J. Russo

The echo function is on by default if you just enable BFD at link level using "bfd interval". Echoes are supported in BFD version 1, where version 0 only supported control packets. We can tell if echo is enabled by checking the BFD neighbor details. The OSPF LAN is configured using this method; the difference between echo and control packets are discussed later. One interesting note about BFD with OSPF is that, on broadcast networks, BFD sessions only form to the DR/BDR. In this case, CSR1 and CSR8 are not eligible to participate in the DR election, and thus do not form BFD peers with one another. Only CSR9 and CSR10 can be the BD/BDR. ! CSR1, CSR8, CSR9, CSR10 interface GigabitEthernet2.518 bfd interval 900 min_rx 900 multiplier 3 R1#show bfd neighbors client ospf IPv4 Sessions NeighAddr LD/RD 181.18.90.9 4103/4100 181.18.90.10 4104/4106

RH/RS Up Up

State Up Up

Int Gi2.518 Gi2.518

R8#show bfd neighbors client ospf IPv4 Sessions NeighAddr LD/RD 181.18.90.9 4098/4097 181.18.90.10 4100/4103

RH/RS Up Up

State Up Up

Int Gi2.518 Gi2.518

R9#show bfd neighbors client ospf IPv4 Sessions NeighAddr LD/RD 181.18.90.1 4100/4103 181.18.90.8 4097/4098 181.18.90.10 4098/4101 R10#show bfd neighbors client ospf IPv4 Sessions NeighAddr LD/RD 181.5.10.5 4102/4097 181.18.90.1 4106/4104 181.18.90.8 4103/4100 181.18.90.9 4101/4098

RH/RS Up Up Up

RH/RS Up Up Up Up

State Up Up Up

State Up Up Up Up

Int Gi2.518 Gi2.518 Gi2.518

Int Gi2.550 Gi2.518 Gi2.518 Gi2.518

We can also check the OSPF process to ensure BFD is enabled. From CSR10, we can check an OSPF neighbor, interface, or process to see if BFD is enabled or not. BFD needs to be bound to OSPF (or any other protocol) on a per-interface basis. Using the command below under the OSPF process registers OSPF to BFD on all interfaces, which is typically desired. ! CSR1, CSR8, CSR9, CSR10 router ospf 181 bfd all-interfaces

2416 © 2016 Nicholas J. Russo

R10#show ip ospf neighbor 181.0.0.1 | include BFD In the area 0 via interface GigabitEthernet2.518, BFD enabled R10#show ip ospf interface gig2.518 | include BFD Transmit Delay is 1 sec, State BDR, Priority 3, BFD enabled R10#show ip ospf 181 | include BFD BFD is enabled

A quick check of the same routers shows similar results for OSPFv3 (assuming BFD is enabled for all interfaces). In this topology, CSR8 and CSR9 cannot be DR/BDR, and thus do not form BFD peers with one another. More accurately, because they do not form OSPF neighbors, the OSPF process does not inform BFD that a session should be created. ! CSR1, CSR8, CSR9, CSR10 router ospfv3 181 address-family ipv6 unicast bfd all-interfaces R1#show bfd neighbors client ospfv3 IPv6 Sessions NeighAddr LD/RD FE80::8 2/2 FE80::9 1/2 FE80::10 3/4 R8#show bfd neighbors client ospfv3 IPv6 Sessions NeighAddr LD/RD FE80::1 2/2 FE80::10 1/3

RH/RS Up Up Up

State Up Up Up

RH/RS Up Up

State Up Up

Int Gi2.518 Gi2.518 Gi2.518

Int Gi2.518 Gi2.518

R9#show bfd neighbors client ospfv3 IPv6 Sessions NeighAddr LD/RD FE80::1 2/1 FE80::10 1/2

RH/RS Up Up

State Up Up

Int Gi2.518 Gi2.518

R10#show bfd neighbors client ospfv3 IPv6 Sessions NeighAddr LD/RD FE80::1 4/3 FE80::5 1/1 FE80::8 3/1 FE80::9 2/1

RH/RS Up Up Up Up

State Up Up Up Up

Int Gi2.518 Gi2.550 Gi2.518 Gi2.518

Likewise, we check the OSPFv3 show commands to verify BFD is enabled. 2417 © 2016 Nicholas J. Russo

R10#show ospfv3 neighbor 181.0.0.1 | include BFD In the area 0 via interface GigabitEthernet2.518, BFD enabled R10#show ospfv3 interface gig2.518 | include BFD Transmit Delay is 1 sec, State DR, Priority 5, BFD enabled R10#show ospfv3 181 | include BFD BFD is enabled

The output below shows that the echo feature is enabled using the timers we configured. This is an ideal configuration because the echoes are what make BFD lightweight on any router. From CSR9, we will look at CSR1 within the IPv4 AF. We see that OSPF is registered to BFD. R9#show bfd neighbors ipv4 181.18.90.1 details | include ^Sess|^Regist Session state is UP and using echo function with 900 ms interval. Session Host: Software Registered protocols: OSPF CEF

Interestingly, BFD echoes do not appear to be supported for IPv6 addresses. We will further reinforce this theory when we test BGP using non-LL addressing. R8#show bfd neighbors ipv6 fe80::10 details | include echo Session state is UP and not using echo function.

The BFD control packets are not lightweight like echoes and cannot be offloaded to hardware in all platforms. Below is a control packet on UDP port 3784 (0xEC8) and an echo on UDP port 3785 (0xEC9). The source UDP port for all BFD packets appears to be 49152 (0xC0000). The control ports are shown in cyan and the echo ports are shown in pink. The echo always has the source and destination IP set to the sender's address; this is done so the receiver can immediately loop the packet back out and not process it locally. The IP addressing is highlighted in green in both packets below. Notice the destination MAC of the echo is still CSR9's MAC which allows the frame to be delivered at layer 2, while the source is always CSR1’s MAC (yellow/grey). The control packet is actually destined to CSR9 like a regular packet would and is punted to the CPU (RSP/RP) for processing. R9#show interfaces gigabitEthernet 2 | include bia Hardware is CSR vNIC, address is 0050.56a9.d672 (bia 0050.56a9.d672) R9#show ip arp 181.18.90.1 Protocol Address Age (min) Internet 181.18.90.1 38

Hardware Addr 0050.56a9.1aaa

Type ARPA

Interface GigabitEth2.518

R9#monitor capture CAP match ipv4 protocol udp any any range 3784 3785 R9#show monitor capture CAP buffer detailed 0 70 0.000000 181.18.90.1 -> 181.18.90.9 UDP 0000: 005056A9 D6720050 56A91AAA 81000DBE .PV..r.PV.......

2418 © 2016 Nicholas J. Russo

0010: 0020: 0030:

080045C0 00340990 0000FF11 9339B512 5A01B512 5A09C000 0EC80020 8E8E20C0 03180000 10010000 1002000F 4240000F

..E..4.......9.. Z...Z...... .. . ............B@..

6 58 0.173987 181.18.90.1 -> 181.18.90.1 UDP 0000: 005056A9 D6720050 56A91AAA 81000DBE .PV..r.PV....... 0010: 080045C0 00280991 0000FF11 934CB512 ..E..(.......L.. 0020: 5A01B512 5A01C000 0EC90014 3BDC0000 Z...Z.......;... 0030: 00000000 10010001 C6F6 ..................

We can confirm that we are receiving both echoes and control packets by counting the matches. Note: uRPF and ICMP redirects must be disabled for echoes to work, for obvious reasons. The router is smart enough to not send ICMP redirects even if you don’t explicit disable it, but uRPF will break this feature. R9#show monitor capture CAP buffer detailed | count 0EC8 Number of lines which match regexp = 64 R9#show monitor capture CAP buffer detailed | count 0EC9 Number of lines which match regexp = 140

We will simulate a failure on the LAN by configuring an inbound ACL on CSR8 to drop BFD echoes. We will debug BFD events on CSR8 to monitor the change. CSR8 shows the echo failure, notifies all of its clients, and then the OSPF neighbors fail. CSR8 has neighbors to CSR9/CSR10, the DR/BDR, so BFD tracks those. ! CSR8 ip access-list extended ACL_DENY_BFD_ECHO deny udp any any eq 3785 permit ip any any interface GigabitEthernet2.518 ip access-group ACL_DENY_BFD_ECHO in R8#debug bfd event BFD-DEBUG Event: V1 FSM ld:4098 handle:2 event:ECHO FAILURE state:UP (0) BFD-DEBUG EVENT: bfd_session_destroyed, proc:OSPF, handle:2 act BFD-DEBUG Event: notify client(CEF) IP:181.18.90.9, ld:4098, handle:2, event:DOWN, cp independent failure (0) BFD-DEBUG Event: notify client(OSPF) IP:181.18.90.9, ld:4098, handle:2, event:DOWN, cp independent failure (0) %OSPF-5-ADJCHG: Process 181, Nbr 181.0.0.9 on GigabitEthernet2.518 from FULL to DOWN, Neighbor Down: BFD node down BFD-DEBUG Event: V1 FSM ld:4100 handle:4 event:RX DOWN state:UP (0) BFD-DEBUG EVENT: bfd_session_destroyed, proc:OSPF, handle:4 act BFD-DEBUG Event: notify client(CEF) IP:181.18.90.10, ld:4100, handle:4, event:DOWN, cp independent failure (0)

2419 © 2016 Nicholas J. Russo

BFD-DEBUG Event: notify client(OSPF) IP:181.18.90.10, ld:4100, handle:4, event:DOWN, cp independent failure (0) %OSPF-5-ADJCHG: Process 181, Nbr 181.0.0.10 on GigabitEthernet2.518 from FULL to DOWN, Neighbor Down: BFD node down

Removing the ACL, we see the sessions come back up. CSR8 can successfully receive BFD packets, and then initializes the BFD neighbor relationship. OSPF forms its neighbor first as it needs to tell BFD that it requests fast detection. When BFD comes up, it notifies OSPF that it can begin using BFD state. ! CSR8 %OSPF-5-ADJCHG: Process 181, Nbr 181.0.0.9 on GigabitEthernet2.518 from LOADING to FULL, Loading Done BFD-DEBUG EVENT: bfd_session_created, proc:OSPF, idb:GigabitEthernet2.518 handle:2 act BFD-DEBUG Event: V1 FSM ld:4118 handle:4 event:RX INIT state:DOWN (0) BFD-DEBUG Event: notify client(CEF) IP:181.18.90.10, ld:4118, handle:4, event:UP, (0) BFD-DEBUG Event: notify client(OSPF) IP:181.18.90.10, ld:4118, handle:4, event:UP, (0) BFD-DEBUG Event: V1 FSM ld:4118 handle:4 event:RX UP state:UP (0) %OSPF-5-ADJCHG: Process 181, Nbr 181.0.0.10 on GigabitEthernet2.518 from LOADING to FULL, Loading Done BFD-DEBUG EVENT: bfd_session_created, proc:OSPF, idb:GigabitEthernet2.518 handle:4 act BFD-DEBUG Event: V1 FSM ld:4117 handle:2 event:RX INIT state:DOWN (0) BFD-DEBUG Event: notify client(CEF) IP:181.18.90.9, ld:4117, handle:2, event:UP, (0) BFD-DEBUG Event: notify client(OSPF) IP:181.18.90.9, ld:4117, handle:2, event:UP, (0) BFD-DEBUG Event: V1 FSM ld:4117 handle:2 event:RX UP state:UP (0)

A quick repeat of the same test, except filtering only the control traffic, yields similar results. BFD still does the exact same thing, except the failure event is based on not being able to detect a BFD peer at all. Losing echoes was a different failure event, but the result is identical. In most designs, the echo timer is much faster than the control timer, so a loss of control packets would take longer to discover. Upon failure, BFD notifies its clients who immediately tear down their neighbor sessions. ! CSR8 ip access-list extended ACL_DENY_BFD_CONTROL deny udp any any eq 3784 permit ip any any interface GigabitEthernet2.518 ip access-group ACL_DENY_BFD_CONTROL in ! CSR8

2420 © 2016 Nicholas J. Russo

BFD-DEBUG Event: V1 FSM ld:4118 handle:4 event:DETECT TIMER EXPIRED state:UP (0) BFD-DEBUG EVENT: bfd_session_destroyed, proc:OSPF, handle:4 act BFD-DEBUG Event: notify client(CEF) IP:181.18.90.10, ld:4118, handle:4, event:DOWN, (0) BFD-DEBUG Event: notify client(OSPF) IP:181.18.90.10, ld:4118, handle:4, event:DOWN, (0) %OSPF-5-ADJCHG: Process 181, Nbr 181.0.0.10 on GigabitEthernet2.518 from FULL to DOWN, Neighbor Down: BFD node down BFD-DEBUG Event: V1 FSM ld:4117 handle:2 event:DETECT TIMER EXPIRED state:UP (0) BFD-DEBUG EVENT: bfd_session_destroyed, proc:OSPF, handle:2 act BFD-DEBUG Event: notify client(CEF) IP:181.18.90.9, ld:4117, handle:2, event:DOWN, (0) BFD-DEBUG Event: notify client(OSPF) IP:181.18.90.9, ld:4117, handle:2, event:DOWN, (0) %OSPF-5-ADJCHG: Process 181, Nbr 181.0.0.9 on GigabitEthernet2.518 from FULL to DOWN, Neighbor Down: BFD node down

Upon restoration, the process is reversed, and when OSPF forms neighbors, it notifies BFD. When BFD completes its neighbor exchange, it notifies OSPF that it is now operational. This process is identical to the previous example in terms of debug and is not shown again. We quickly verify it using show commands. R8#show bfd neighbors client ospf IPv4 Sessions NeighAddr LD/RD 181.18.90.9 4117/4109 181.18.90.10 4118/4115

RH/RS Up Up

State Up Up

Int Gi2.518 Gi2.518

Next, we look at BFD with IS-IS. The echo functionality was not explicitly enabled on the IS-IS LAN using the BFD-template construct. BFD makes no assumptions about your configuration when using templates; in this case, echoes are not used, so all of the BFD packets are control packets. If you use the "interval" command in a bfd-template without explicitly enabling "echo", only BFD control packets are affected by the timer specification. BFD still operates and is effective, but the lightweight behavior is reduced. Some high-end platforms can offload the entire BFD process to linecards, but echo offloading works on almost all routers since it is just based on routing. Despite having a DIS, IS-IS still forms neighbors will all nodes on the segment, or at least that is what IS-IS tells BFD to do. I also enable basic authentication, which is self explanatory. BFD dampening is like BGP or event dampening; continuous flaps will cause BFD to penalize an unstable neighbor. We do not analyze BFD dampening in detail since the logic is identical to BGP dampening (discussed in the security section). ! CSR1, CSR3, CSR4, CSR10 key chain KC_BFD_AUTH key 1

2421 © 2016 Nicholas J. Russo

key-string BFD_AUTH bfd-template single-hop BFD_AUTH_DAMP interval min-tx 900 min-rx 900 multiplier 3 authentication sha-1 keychain KC_BFD_AUTH dampening 3 1500 3000 5 interface GigabitEthernet2.513 bfd template BFD_AUTH_DAMP

We can verify the “full mesh” of neighbors, as directed by IS-IS, and also verify that echoes are not enabled. Echoes must be enabled bidirectionally in order for BFD to negotiate them. Also notice that since IS-IS works with both IPv4 and IPv6, we can see the IPv6 peers by checking the IS-IS clients. For brevity, I just show CSR1 and CSR3, but the neighbors are fully meshed. R1#show bfd neighbors client isis IPv4 Sessions NeighAddr LD/RD 181.13.40.3 4099/4105 181.13.40.4 4097/4105 181.13.40.10 4101/4104 IPv6 Sessions NeighAddr FE80::3 FE80::4 FE80::10

LD/RD 4100/4106 4098/4106 4102/4105

RH/RS Up Up Up

RH/RS Up Up Up

State Up Up Up

State Up Up Up

Int Gi2.513 Gi2.513 Gi2.513

Int Gi2.513 Gi2.513 Gi2.513

R3#show bfd neighbors client isis IPv4 Sessions NeighAddr LD/RD 181.13.40.1 4105/4099 181.13.40.4 4097/4097 181.13.40.10 4102/4099

RH/RS Up Up Up

State Up Up Up

Int Gi2.513 Gi2.513 Gi2.513

IPv6 Sessions NeighAddr FE80::1 FE80::4 FE80::10

RH/RS Up Up Up

State Up Up Up

Int Gi2.513 Gi2.513 Gi2.513

LD/RD 4106/4100 4098/4098 4104/4100

R3#show bfd neighbors ipv4 181.13.40.10 details | include ^Sess|^Regist Session state is UP and not using echo function. Session Host: Software Registered protocols: ISIS CEF R3#show bfd neighbors ipv6 fe80::10 details | include ^Sess|^Regist Session state is UP and not using echo function.

2422 © 2016 Nicholas J. Russo

Session Host: Software Registered protocols: ISIS CEF

We can verify BFD is enabled by checking the IS-IS process as well. The commands clearly show support for both IPv4 and IPv6 address families. R10#show clns is-neighbors detail | include ^R[0-9]|BFD R1 Gi2.513 Up L2 5 R1.01 BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6) R3 Gi2.513 Up L2 0 R1.01 BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6) R4 Gi2.513 Up L2 0 R1.01 BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6) R10#show isis neighbors detail | include ^R[0-9]|BFD R1 L2 Gi2.513 181.13.40.1 UP 9 Remote BFD Support:TLV (MTID:0, IPV4) (MTID:2, IPV6) BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6) R3 L2 Gi2.513 181.13.40.3 UP 24 Remote BFD Support:TLV (MTID:0, IPV4) (MTID:2, IPV6) BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6) R4 L2 Gi2.513 181.13.40.4 UP 26 Remote BFD Support:TLV (MTID:0, IPV4) (MTID:2, IPV6) BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6)

Phase V Phase V Phase V

R1.01

R1.01

R1.01

R10#show clns interface gig2.513 | include BFD BFD enabled: (MTID:0, ipv4) (MTID:2, ipv6)

A quick look at CSR3's capture shows only packets with varying source/destination addresses. If there were many echoes, we would see packets destined to themselves at layer 3 arriving on CSR3's interface from the other 3 routers on the segment. We see no such packets, which is a good indication that BFD echoes are not enabled on this LAN segment. R3#show monitor capture CAP buffer brief ------------------------------------------------------------# size timestamp source destination protocol ------------------------------------------------------------0 98 0.000000 181.13.40.1 -> 181.13.40.10 UDP 1 98 0.009002 181.13.40.10 -> 181.13.40.1 UDP 2 98 0.020995 181.13.40.4 -> 181.13.40.10 UDP 3 98 0.033995 181.13.40.1 -> 181.13.40.10 UDP 4 98 0.041990 181.13.40.4 -> 181.13.40.10 UDP 5 98 0.042997 181.13.40.1 -> 181.13.40.3 UDP 6 98 0.050001 181.13.40.4 -> 181.13.40.1 UDP 7 98 0.057996 181.13.40.10 -> 181.13.40.1 UDP 8 98 0.069989 181.13.40.3 -> 181.13.40.10 UDP 9 98 0.077999 181.13.40.3 -> 181.13.40.1 UDP 10 98 0.085994 181.13.40.10 -> 181.13.40.1 UDP

2423 © 2016 Nicholas J. Russo

11 12 13 14

98 98 98 98

0.100993 0.108988 0.116998 0.124994

181.13.40.1 181.13.40.3 181.13.40.10 181.13.40.3

-> -> -> ->

181.13.40.10 181.13.40.10 181.13.40.1 181.13.40.1

UDP UDP UDP UDP

We can also use output filtering to help prove the point. Searching for the hexadecimal control-port in the packet details, we have many hits. Searching for the echo port yields zero hits. R3#show monitor capture CAP buffer detail | count 0EC8 Number of lines which match regexp = 924 R3#show monitor capture CAP buffer detail | count 0EC9 Number of lines which match regexp = 0

A quick detailed look at one of the packets clearly shows it is a control packet. This particular packet was sent from CSR1 to CSR3. R3#show interfaces gigabitEthernet 2 | include bia Hardware is CSR vNIC, address is 0050.56a9.8ccf (bia 0050.56a9.8ccf) R3#monitor capture CAP match ipv4 protocol udp any any range 3784 3785 R3#show monitor capture CAP buffer detail 79 98 0.900955 181.13.40.1 -> 181.13.40.3 UDP 0000: 005056A9 8CCF0050 56A91AAA 81000DB9 .PV....PV....... 0010: 080045C0 00501F7A 0000FF11 E143B50D ..E..P.z.....C.. 0020: 2801B50D 2803C000 0EC8003C 2A4520C4 (...(...... 95 will work well given the current EIGRP metric configuration. ! CSR1 track 25 ip route 153.2.2.2 255.255.255.255 metric threshold threshold metric up 55 down 95

The object is currently down since we are routing via XRv13. The router shows you the RIB and scaled metrics also. This is a good way to double-check your math. Recall that the scaled metric is the RIB metric divided by the scaling factor (2560 for EIGRP by default). R1#show track 25 Track 25 IP route 153.2.2.2 255.255.255.255 metric threshold Metric threshold is Down (EIGRP/259072/101) 2 changes, last change 00:00:17 Metric threshold down 95 up 55 First-hop interface is GigabitEthernet2.513

If we bring up the interface to CSR2, the object comes up. When EIGRP converges, the cost to CSR2 is much lower, resulting in a lower scaled metric of 51. R1#show track 25 Track 25 IP route 153.2.2.2 255.255.255.255 metric threshold Metric threshold is Up (EIGRP/130816/51) 3 changes, last change 00:00:00 Metric threshold down 95 up 55 First-hop interface is GigabitEthernet2.512

If desired, you can change the scaled metric defaults. Different protocols have different metric ranges. EIGRP and BGP can have very large metrics/MEDs, while OSPF and ISIS tend to use smaller numbers. RIP, for example, is never scaled, and its metrics are translated directly since they only count to 15. Notice that the minimum and maximum limits in the parser are determined by the protocol types. R1(config)#track resolution ip route ospf ? Resolution value

2633 © 2016 Nicholas J. Russo

R1(config)#track resolution ip route eigrp ? Resolution value

The “stub-object” is an interesting feature. There are no fancy options involved; you simply create the object, assign a default-state (if desired), and it’s always up or always down. The exception to this, and one of its best use cases, is to have it be controlled by EEM. EEM is discussed in detail in a dedicated section, but a quick example is shown below. A stub-object [ID 30] is created with default state “down” (which is the default, so the configuration does not display it). A manual EEM script can be run which changes it to “up”. You could also use it in conjunction with other tracked objects within track lists. ! CSR1 track 30 stub-object event manager applet SET_30_UP event none action 100 track set 30 state up R1#show track 30 brief Track Type Instance Change 30 stub-object Undefined

Parameter

State Last Down

00:01:18

To demonstrate using EEM to adjust the value, we write a small script and manually execute it from the CLI. Assuming some other process relied on this object, we can use EEM to influence network behavior. ! CSR1 event manager applet SET_30_UP event none action 100 track set 30 state up R1#event manager run SET_30_UP R1#show track 30 brief Track Type Instance Change 30 stub-object EEM

Parameter

State Last Up

00:00:02

The track feature also has timers regarding how often it re-evaluates its tracked objects. There are two sets of timers: a global timer which determines how often the router reassesses the state of an object, and a per-object timer which determines how long to delay going up or going down. We will re-use route tracking object 25 from earlier for this demonstration. The default timers are shown below. Notice that some fields say “expired”; this is because there are zero objects of that specified type. We have not configured any IPv6 route objects, which explains this output. The poll interval is measured in seconds and represents the global timer for that object type. The right-most column just counts down to zero, then restarts. Notice that the stub-object, by default, has a fast timer. The route objects have slower timers to allow protocols to converge without flapping object states constantly. If you are using scaled 2634 © 2016 Nicholas J. Russo

metric threshold tracking in your environment, the poll interval for IPv4/IPv6 routes should be slightly longer than your IGP convergence time. R1#show track timers Tracking Timers Object type interface ip route ip sla ipv6 route list stub-object

Poll Interval Time to next poll 1 0.887 15 12.607 5 0.292 15 expired 1 0.67 1 0.230

We will enable timestamps on CSR1 and modify the timer to be 30 seconds. You can also set millisecond granularity if desired. ! CSR1 service timestamps debug datetime service timestamps log datetime track timer ip route 30 R1#show track timers | include ip_route ip route 30 12.718

The network is fully operational which means the tracked object is up. When the timer restarts at 30 seconds, we will shut down the link to CSR2 on CSR1 immediately thereafter. We expect the tracked object to stay down for about 26-28 seconds or so, depending on how fast I am. The sequence of events is shown below with timestamps and key actions highlighted. The EIGRP neighbor fails immediately as expected, but the tracked object remains in the up state for some time. Notice that the object took 28 seconds to go down, which is what we expected. R1#show track timers | include ip_route ip route 30 29.887 ! CSR1 interface GigabitEthernet2.512 shutdown 23:17:28: %DUAL-5-NBRCHANGE: EIGRP-IPv4 1: Neighbor 153.1.2.2 (GigabitEthernet2.512) is down: interface down R1#show track timers | include ip_route ip route 30 0.608 23:17:56: %TRACK-6-STATE: 25 ip route 153.2.2.2/32 metric threshold Up ->

2635 © 2016 Nicholas J. Russo

Down

We can also tell the object itself to delay going up or down on a per-object basis. We configure this probe to delay for 10 additional seconds when coming up. We will repeat the same test except we will bring the interface back up, and expect 35-38 seconds of delay (~30 seconds from the track timer plus ~10 seconds from the up-delay). First, we configure the up-delay and verify that the 10 second delay was properly applied to the object. ! CSR1 track 25 ip route 153.2.2.2 255.255.255.255 metric threshold threshold metric up 55 down 95 delay up 10 R1#show track 25 Track 25 IP route 153.2.2.2 255.255.255.255 metric threshold Metric threshold is Up (EIGRP/130816/51) 9 changes, last change 00:13:52 Metric threshold down 95 up 55 Delay up 10 secs First-hop interface is GigabitEthernet2.512

We follow the same test procedure as above except “no shutdown” the link between CSR1 and CSR2. In this case, it was about 36 seconds, since it took EIGRP a few seconds to discover CSR2 and converge. Notice that when the countdown timer hits 0.593 seconds, we still have to wait another 10 seconds because of the per-object delay-up time. In the last example, as soon as the global timer expired, the router re-evaluated the object and changed its state. This time, the router still performed the reevaluation, directed the object to change state, but the object delayed the “up” action. R1#show track timers | include ip_route ip route 30 29.626 ! CSR1 interface GigabitEthernet2.512 no shutdown 23:21:00: %DUAL-5-NBRCHANGE: EIGRP-IPv4 1: Neighbor 153.1.2.2 (GigabitEthernet2.512) is up: new adjacency ! About 30 seconds later R1#show track timers | include ip_route ip route 30 0.593 ! Another 10 seconds later R1#show track timers | include ip_route ip route 30 21.441

2636 © 2016 Nicholas J. Russo

23:21:36: %TRACK-6-STATE: 25 ip route 153.2.2.2/32 metric threshold Down -> Up

47.9 IPv6 SLA A number of probes are supported for IPv6, but not all. Cisco documentation specifies that only ICMPecho, UDP-echo, UDP-jitter, and TCP-connect are supported, but in reality, the path-echo and path-jitter are also supported (and possibly more). IPv6 SLA operation is identical to that of its IPv4 counterpart. I’ve configured OSPFv3 using IPv6 AF only (XR doesn’t support OSPFv3 AF IPv4) for this demonstration. The configuration is basic and is not shown. The probe numbers used earlier will be repeated but added to 60. For example, IP SLA ID 3 from the IPv4 testing will be ID 63 here. The first probe to examine is the ICMP-echo probe [ID 61]. The probe has been discussed in detail already so the basic operation is not revisited. Two new options exist under the probe; traffic-class and flow-label. Traffic-class is functionally equivalent to IP TOS and is still an unsigned 8 bit integer. We will use DSCP AF21 for this probe; DSCP AF21 in decimal is 18. Bit-wise left-shifting twice results in a trafficclass value of 72 or 0x48. R1(config-ip-sla-echo)#traffic-class ? Traffic Class Value

The flow “label” has nothing to do with MPLS, despite being 20 bits (the exact same size as the label field in the MPLS shim header). This is a component of the IPv6 header with a few key uses. First, it allows routers to avoid having to look deeper into the packet (could be encrypted, etc) to identify the flow. Loosely analogous to the GRE tunnel key with “entropy” enabled, it could be unique per flow to allow for better load-sharing in the core. Packets in the same flow should have the same flow label and should be forwarded along the same path. Intermediate network devices will generally keep packets with the same flow-label in the same path if they have multiple paths to a destination (ECMP, UCMP, etc). In this case, we set a value of 19, but it doesn’t do anything significant in the network. A value of zero means the packet has no flow specifications. R1(config-ip-sla-echo)#flow-label ? Flow Label Value

All other probe configurations (timers, history, scheduling, object tracking) are the same. The probe configuration is shown below for completeness. ! CSR1 ip sla 61 icmp-echo ::153:2:2:2 source-ip ::153:1:1:1 traffic-class 72 flow-label 19 frequency 5 ip sla schedule 61 life forever start-time now

2637 © 2016 Nicholas J. Russo

We see the probe succeeds without issue. We can also see that our QoS and flow-label settings were properly applied as CSR2’s debug reveals those values. The traffic class is shown as a decimal number from 0 to 255 while the flow label is shown in hexadecimal. R1#show ip sla statistics 61 | include Number Number of successes: 4 Number of failures: 0 R2#debug ipv6 icmp IPV6: source ::153:1:1:1 (GigabitEthernet2.512) dest ::153:2:2:2 (Loopback0) traffic class 72, flow 0x13, len 76+18, prot 58, hops 64,forward to ulp

Next, we test the TCP connect probe [ID 62]. Again, the threshold, timeout, and frequency settings work identically for IPv6 as they do for IPv4. ! CSR1 ip sla 62 tcp-connect ::153:2:2:2 55555 source-ip ::153:1:1:1 threshold 10000 timeout 10000 frequency 10 ip sla schedule 62 life forever start-time now

We also enable debugging on CSR2, which is a dynamic responder, to ensure the IPv6 TCP operation is successful. Note that “ip sla responder” applies to IPv6 probes without any additional configuration. On an XE responder, the operation 0 must be specified for the debug. Notice that the responder performs the same actions it did for IPv4 and also tracks the IPv6 addressing in its “recent source” list. CSR1 reports no failures. R1#show ip sla statistics 62 | include Number Number of successes: 5 Number of failures: 0 R2#debug ip sla trace 0 IPSLA-RESP_TRACE: Receive Control msg: len =68 IPSLA-RESP_TRACE: Ctrl-Msg Ver: 1 ID: 168 Len: 68 IPSLA-RESP_TRACE: Ctrl-Msg: command: SLA_CMD_TCPV6_CONN_ENABLE, ip: ::153:2:2:2, port: 55555, duration: 10000 [snip] IPSLA-RESP_TRACE: %TRACE: TCP socket accepted IPSLA-RESP_TRACE: Cleaning up port (::153:2:2:2,55555) R2#show ip sla responder General IP SLA Responder on Control port 1967 General IP SLA Responder on Control V2 port 1167 General IP SLA Responder is: Enabled

2638 © 2016 Nicholas J. Russo

Number of control message received: 2 Number of errors: 0 Recent sources: ::153:1:1:1 [13:53:16.508 UTC DAY MON 11 2015] ::153:1:1:1 [13:53:06.508 UTC DAY MON 11 2015]

We also test using a static responder. We configure a probe on CSR2 to send UDP-echoes to CSR1 without any control messaging [ID 63]. This means we need a static responder configured on CSR1 for the target IPv6 address and port pair. The command syntax is the same as it is for IPv4; this is somewhat odd since there is no “ipv6 sla” tree, yet the IPv6 functionality is nested within the “ip sla” tree. ! CSR1 ip sla responder udp-echo ipaddress ::153:1:1:1 port 55556 R1#show ip sla responder | section udp udpEcho Responder: IP Address Port 153.1.1.1 55556 153.1.1.1 55557 ::153:1:1:1 55556

Next, we configure the probe on CSR2. As expected, no failures are observed since CSR1 is responding properly to these probes. ! CSR2 ip sla 63 udp-echo ::153:1:1:1 55556 control disable frequency 5 ip sla schedule 63 life forever start-time now R2#show ip sla statistics 63 | include Number Number of successes: 3 Number of failures: 0

Recall that UDP-jitter was a complicated operation because the information it is capable of resolving is dependent upon a few things. First, NTP synchronization determines whether the probe resolves oneway latency or not. Second, a dynamic responder is required for jitter/MOS resolution. We quickly verify these behaviors with IPv6 SLA. We create a UDP-jitter probe on CSR2 with control-messages disabled to probe CSR1 [ID 64]. With NTP configured, we expect to resolve one-way latency only. For consistency, we will use the same parameters as IP SLA ID 4, to include the port number, which means we need to quickly add another static responder entry to CSR1 (not shown). ! CSR2 ip sla 64 udp-jitter ::153:1:1:1 55557 control disable num-packets 200 interval 25 request-data-size 72 frequency 10

2639 © 2016 Nicholas J. Russo

ip sla schedule 64 life forever start-time now R2#show ntp status | include ^sync Clock is synchronized, stratum 9, reference is 153.1.1.1 R2#show ip sla statistics 64 | section Latency|Jitter|Number Number Of RTT: 200 RTT Min/Avg/Max: 1/1/2 milliseconds Latency one-way time: Number of Latency one-way Samples: 1 Source to Destination Latency one way Min/Avg/Max: 9/9/9 milliseconds Destination to Source Latency one way Min/Avg/Max: 3/3/3 milliseconds Jitter Time: Number of SD Jitter Samples: 0 Number of DS Jitter Samples: 0 Source to Destination Jitter Min/Avg/Max: 0/0/0 milliseconds Destination to Source Jitter Min/Avg/Max: 0/0/0 milliseconds Number Of RTT Over Threshold: 0 (0%) Source to Destination Loss Periods Number: 0 Destination to Source Loss Periods Number: 0 Number of successes: 7 Number of failures: 0

Trying a similar probe in the opposite direction from CSR1 to CSR2 [ID 65], we get statistics for jitter and latency. This is because NTP is synchronized and CSR2 is already a dynamic IP SLA responder. ! CSR1 ip sla 65 udp-jitter ::153:2:2:2 55558 num-packets 200 interval 25 request-data-size 72 frequency 10

Note that if the latency is very low, the router may just display all zeroes, even for the number of samples. To increase latency a little, I shut down the direct link between CSR1 and CSR2. The reason we know that this was not occurring for IP SLA ID 64 is because the number of jitter samples was still zero. R1#show ip sla statistics 65 | section Jitter|Latency|Number Number Of RTT: 200 RTT Min/Avg/Max: 3/4/66 milliseconds Latency one-way time: Number of Latency one-way Samples: 16 Source to Destination Latency one way Min/Avg/Max: 4/11/65 millisecon Destination to Source Latency one way Min/Avg/Max: 0/0/2 milliseconds Jitter Time: Number of SD Jitter Samples: 199 Number of DS Jitter Samples: 199 Source to Destination Jitter Min/Avg/Max: 0/2/61 milliseconds Destination to Source Jitter Min/Avg/Max: 0/1/3 milliseconds Number Of RTT Over Threshold: 0 (0%) Source to Destination Loss Periods Number: 0

2640 © 2016 Nicholas J. Russo

Destination to Source Loss Periods Number: 0 Number of successes: 33 Number of failures: 0

Advanced ICMP probes, such as path-echo and path-jitter, also appear supported. The path-echo probe configured on CSR1 to test IPv6 [ID 70] mirrors the configuration of its IPv4 counterpart. ! CSR1 ip sla 70 path-echo ::153:2:2:2 threshold 50 timeout 100 frequency 10 hops-of-statistics-kept 3 samples-of-history-kept 10 ip sla schedule 70 life forever start-time now

Remember that in order to see valuable data from this probe, we need to look at the aggregated statistics. This particular probe discovers each of the hops in each of the paths (2 dimensional array, or matrix of values) and displays them within the aggregated statistics. R1#show ip sla statistics aggregated 70 IPSLAs aggregated statistics IPSLA operation id: 70 Start Time Index: 14:25:41.428 UTC [snip] Path Index: 1 Hop in Path Index: 1 Type of operation: path-echo Number of successes: 2 Number of failures: 0 Target Address ::153:13:13:13 Start Time Index: 14:25:41.428 UTC [snip] Path Index: 1 Hop in Path Index: 2 Type of operation: path-echo Number of successes: 2 Number of failures: 0 Target Address ::153:14:14:14 Start Time Index: 14:25:41.428 UTC [snip] Path Index: 1 Hop in Path Index: 3 Type of operation: path-echo Number of successes: 2 Number of failures: 0 Target Address ::153:2:2:2

2641 © 2016 Nicholas J. Russo

Last, we examine path-jitter for IPv6. Recall that this probe discovers all hops as returned by “traceroute”, which runs first. Then, individual prober operations are spawned to measure metrics for each hop. As such, we must have reachability to the transit links (unless IPv6 LL addressing is used), so OSPFv3 prefix-suppression should not be enabled. In this case, the loopback on each device is the only routable IPv6 address, and as such, is used as the source address for all ICMP unreachables returned to CSR1. If the transit links had any other addressing, the probing device would need reachability to whatever “traceroute” returns. The path-jitter probe configured on CSR1 [ID 71] will trace the path to CSR2, and the direct link to CSR2 is still down. We can see the output is very similar to IPv4. ! CSR1 ip sla 71 path-jitter ::153:2:2:2 source-ip ::153:1:1:1 frequency 5 ip sla schedule 71 life forever start-time now R1#show ip sla statistics 71 IPSLAs Latest Operation Statistics IPSLA operation id: 71 Latest RTT: 4 milliseconds Latest operation start time: 14:21:58 UTC [snip] Latest operation return code: OK ---- Path Jitter Statistics ---Hop IP ::153:13:13:13: Round Trip Time milliseconds: Latest RTT: 1 ms Number of RTT: 10 RTT Min/Avg/Max: 1/1/2 ms Jitter time milliseconds: Number of jitter: 5 Jitter Min/Avg/Max: 1/1/1 ms Packet Values: Packet Loss (Timeouts): 0 Out of Sequence: 0 Discarded Samples: 0 Hop IP ::153:14:14:14: Round Trip Time milliseconds: Latest RTT: 2 ms Number of RTT: 10 RTT Min/Avg/Max: 2/2/3 ms [snip] Hop IP ::153:2:2:2: Round Trip Time milliseconds:

2642 © 2016 Nicholas J. Russo

Latest RTT: 4 ms Number of RTT: 10 RTT Min/Avg/Max: 3/4/7 ms [snip]

It is likely that the DNS, HTTP, and FTP probes work for IPv6 also, but they are not tested here for brevity. The MPLS ones only work for IPv4 FECs current. DHCP does not appear to support IPv6, which makes sense since the DHCPv4 and DHCPv6 operations are fundamentally different (address vs. prefix). 47.10 IOS-XR IP SLA and EOT IOS XR IP SLA is not very different from IOS and XE. A few quick notes: 1. TCP-connect and fancy operations (HTTP, DNS, HTTP, DHCP, etc) are not supported. 2. ICMP operations supported include echo, path-echo, and path-jitter. No ICMP-jitter. 3. UDP operations supported include echo and jitter. VOIP-style codec probes are not supported. 4. MPLS operations supported include ping and traceroute for both IPv4 and TE targets. 5. In general, XR is less capable than XE with regards to IP SLA. Because TCP operations are not supported, all static IP SLA responders only give the option for UDP, and only IPv4 responders appear to be supported also. We will quick iterate over the list of probes already discussed. XRv4 is configured as a dynamic IP SLA responder. The IDs of each re-tested probe will have an ID of 300 plus whatever number the original probe was on CSR1 or CSR2. ! XRv4 ipsla responder

The first test is the basic ICMP-echo probe. We will configure it from XRv3 to XRv4 [ID 301]. ! XRv3 ipsla operation 301 type icmp echo source address 153.13.13.13 destination address 153.14.14.14 frequency 5 schedule operation 301 start-time now life forever

We can see the probe is successful and it provides a detailed RTT breakdown also, unlike IOS and XE. Quick tip: You can un-schedule all probes at once with “no ipsla schedule”. RP/0/0/CPU0:XRv3#show ipsla statistics 301 Entry number: 301

2643 © 2016 Nicholas J. Russo

Modification time: 19:57:04.964 UTC [date] Start time : 19:57:04.984 UTC [date] Number of operations attempted: 10 Number of operations skipped : 0 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 5 Connection loss occurred : FALSE Timeout occurred : FALSE Latest RTT (milliseconds) : 1 Latest operation start time : 19:57:45.222 UTC [date] Next operation start time : 19:57:50.222 UTC [date] Latest operation return code : OK RTT Values: RTTAvg : 1 RTTMin: 1 RTTMax : 1 NumOfRTT: 1 RTTSum: 1 RTTSum2: 1

IOS XR has a different way of indicating failures. The output below clearly shows the number of attempts, but then has some Boolean variables representing whether the operation timed out or connection was lost. Since ICMP is connectionless, this field doesn’t apply, but once XRv4’s loopback is shutdown, the IP SLA process marks the timeout flag as TRUE and increments the “skipped” counter. This isn’t a great description because the operation wasn’t skipped; it did run, but it failed. The XE mechanism of reporting successes/failures was more straightforward. RP/0/0/CPU0:XRv3#show ipsla statistics 301 Entry number: 301 [snip] Number of operations attempted: 10 Number of operations skipped : 1 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 5 Connection loss occurred : FALSE Timeout occurred : TRUE [snip]

Next, we will configure a path-echo probe in the same direction [ID 310]. I have shutdown XRv3’s link to XRv4 to force traffic through CSR1 and CSR2. ! XRv3 ipsla operation 310 type icmp path-echo source address 153.13.13.13 destination address 153.14.14.14 frequency 10 schedule operation 301

2644 © 2016 Nicholas J. Russo

start-time now life forever

Traceroute quickly reveals the effect of this minor routing change. RP/0/0/CPU0:XRv3#traceroute 153.14.14.14 source 153.13.13.13 Type escape sequence to abort. Tracing the route to 153.14.14.14 1 2 3

153.1.13.1 [MPLS: Label 1004 Exp 0] 0 msec 0 msec 0 msec 153.1.2.2 [MPLS: Label 2004 Exp 0] 0 msec 0 msec 0 msec 153.2.14.14 0 msec * 0 msec

Next, I disable ICMP unreachables on XRv4’s interface facing CSR2, causing traceroute to fail. ! XRv4 interface GigabitEthernet0/0/0/0.524 ipv4 unreachables disable RP/0/0/CPU0:XRv3#traceroute 153.14.14.14 source 153.13.13.13 Type escape sequence to abort. Tracing the route to 153.14.14.14 1 153.1.13.1 [MPLS: Label 1004 Exp 0] 0 msec 0 msec 0 msec 2 153.1.2.2 [MPLS: Label 2004 Exp 0] 0 msec 0 msec 0 msec 3 * * * [snip]

Unfortunately, we can’t see much in XR because the debugs are not helpful; they are all marked with “cisco-support”, so Cisco did not intend for the general user to rely on them. The good news is that the show command actually does report failures and timeouts at least for this probe type, unlike IOS and XE. RP/0/0/CPU0:XRv3#debug ipsla ? common Debug the IPSLA common modules(cisco-support) error Output IP SLAs Error Messages(cisco-support) ippmserver Debug the IPSLA IPPM Server process(cisco-support) master-agent Debug the IPSLA master agent process(cisco-support) responder Debug the IPSLA responder process(cisco-support) sub-agent Debug the IPSLA sub agent process(cisco-support) trace Output IP SLAs Trace Messages(cisco-support) RP/0/0/CPU0:XRv3#show ipsla statistics 310 Entry number: 310 [snip] Number of operations attempted: 8 Number of operations skipped : 24

2645 © 2016 Nicholas J. Russo

Current seconds left in Life : Operational state of entry : Operational frequency(seconds): Connection loss occurred : Timeout occurred : [snip]

Forever Active 10 FALSE TRUE

Once we rollback the change on XRv4 to allow it to send unreachables again (not shown), we see the timeout value change to false and the “attempted” operations begins to increase again. RP/0/0/CPU0:XRv3#show ipsla statistics 310 Entry number: 310 [snip] Number of operations attempted: 12 Number of operations skipped : 32 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 10 Connection loss occurred : FALSE Timeout occurred : FALSE [snip]

Next is the path-jitter which is configured on XRv3 [ID 312]. This operation issues a traceroute first, follow by ICMP-echo probes to each hop returned within the traceroute list. We still have the same variables from before: N, S, T, and F. I’ve included their default values where known. RP/0/0/CPU0:XRv3(config-ipsla-op)#type icmp path-jitter packet ? count Number of packets to be transmitted during a probe ! (N, 10 pkts) interval Inter packet interval ! (T, 20 msec) RP/0/0/CPU0:XRv3(config-ipsla-op)#type icmp path-jitter ? datasize Protocol data size in payload of probe packets ! (S) frequency Frequency of the probing ! (F, 60 sec)

! XRv3 ipsla operation 312 type icmp path-jitter source address 153.13.13.13 destination address 153.14.14.14 frequency 10 schedule operation 312 start-time now life forever We start the probe and then check the statistics. We expect to see three hops in the path (CSR1, CSR2, 2646 © 2016 Nicholas J. Russo

and XRv4). The output from XR is very extensive but provides similar information as IOS and XE. I have omitted the statistical output from subsequent hops but it is present for the first hop. There does not appear to be way to specify “targetOnly” as we did in IOS and XE, but this is a minor feature anyway. RP/0/0/CPU0:XRv3#show ipsla statistics 312 Entry number: 312 Modification time: 20:22:23.390 UTC [date] Start time : 20:22:23.410 UTC [date] Number of operations attempted: 1 Number of operations skipped : 0 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 10 Connection loss occurred : FALSE Timeout occurred : FALSE Latest RTT (milliseconds) : 8 Latest operation start time : 20:22:26.650 UTC [date] Next operation start time : 20:22:36.650 UTC [date] Latest operation return code : OK ---- ICMP Path Jitter Statistics ---Source IP - 153.13.13.13 Destination IP - 153.14.14.14 Number of Echos - 10 Interval between Echos - 20 ms Hop Index: 0 Hop IP 153.1.13.1: RTT Values: RTTAvg : 1 NumOfRTT: 10 Packet Loss Values: PacketCount : 10 OutOfSequence: 0 VerifyErrors : 0 Jitter Values : Jitter : 0 JitterAve : 0 MinPosJitter: 0 Sum2Pos : 0 MinNegJitter: 0 Sum2Neg : 0

RTTMin: 1 RTTSum: 10

RTTMax : 1 RTTSum2: 10

PacketLoss : 0 DiscardedSamples: 0 DroppedErrors : 0 PosCount

: 0

MaxPosJitter: AvePosJitter: MaxNegJitter: AveNegJitter:

0 0 0 0

NegCount:0 SumPos:0 SumNeg: 0

Hop Index: 1 Hop IP 153.1.2.2: [snip] Hop Index: 2 Hop IP 153.14.14.14:

2647 © 2016 Nicholas J. Russo

[snip]

Next, we will configure a UDP-echo on XRv4 to probe XRv3’s loopback [ID 303]. We will make XRv3 a static responder for UDP 55556 and the IP address of its loopback (which is the address XRv4 is targeting). Control-messages should be disabled on XRv4 as the UDP data should be sent immediately. ! XRv3 ipsla responder type udp ipv4 address 153.13.13.13 port 55556 ! XRv4 ipsla operation 303 type udp echo source address 153.14.14.14 destination address 153.13.13.13 control disable destination port 55556 schedule operation 303 start-time now life forever

XRv4 reports success and XRv3, as a static responder, also counts the received probes on that specific port/address pair. Notice there are no control probes on a static (aka permanent) responder. RP/0/0/CPU0:XRv4#show ipsla statistics 303 Entry number: 303 [snip] Number of operations attempted: 2 Number of operations skipped : 0 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 60 Connection loss occurred : FALSE Timeout occurred : FALSE Latest RTT (milliseconds) : 1 [snip] RP/0/0/CPU0:XRv3#show ipsla responder statistics all ports Port Statistics --------------Local Address Port Port Type Probes Drops CtrlProbes Discard 153.13.13.13 55556 Permanent 2 0 0

Next we will examine the UDP-jitter probe on XR. This was an interesting probe with many contributing 2648 © 2016 Nicholas J. Russo

factors beyond the scope of the probe itself. We will configure a probe on XRv4 [ID 304] to probe XRv3’s loopback for latency/jitter information. Notice we use the same UDP destination port of 55556 which already has a corresponding static responder configured on XRv3 from the last example. ! XRv4 ipsla operation 304 type udp jitter source address 153.14.14.14 destination address 153.13.13.13 control disable destination port 55556 frequency 10 schedule operation 304 start-time now life forever

The two routers are not synchronized with NTP and the responder is static. As expected, we cannot resolve jitter or one-way delay (latency). So far, this is consistent with IOS and XE behaviors. RP/0/0/CPU0:XRv4#show ipsla statistics 304 Entry number: 304 [snip] Number of operations attempted: 5 Number of operations skipped : 0 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 10 Connection loss occurred : FALSE Timeout occurred : FALSE [snip] Latest operation return code : OK RTT Values: RTTAvg : 1 RTTMin: 1 RTTMax : 1 NumOfRTT: 10 RTTSum: 10 RTTSum2: 10 Packet Loss Values: [snip] Jitter Values : MinOfPositivesSD: 0 MaxOfPositivesSD: 0 NumOfPositivesSD: 0 SumOfPositivesSD: 0 Sum2PositivesSD : 0 MinOfNegativesSD: 0 MaxOfNegativesSD: 0 [snip] One Way Values : NumOfOW: 0 OWMinSD : 0 OWMaxSD: 0 OWSumSD: 0 OWSum2SD: 0 OWAveSD: 0

2649 © 2016 Nicholas J. Russo

OWMinDS : 0 OWSum2DS: 0

OWMaxDS: 0 OWAveDS: 0

OWSumDS: 0

Configuring another UDP jitter probe on XRv3 to probe XRv4 will use XRv4’s dynamic responder and should give us jitter, but not latency [ID 305]. ! XRv3 ipsla operation 305 type udp jitter source address 153.13.13.13 destination address 153.14.14.14 destination port 55558 schedule operation 305 start-time now life forever

As expected, IP SLA ID 305 can resolve jitter only as a result of XRv4 being a dynamic responder. The lack of NTP synchronization implies that one-way latency cannot be collected. RP/0/0/CPU0:XRv3#show ipsla statistics 305 Entry number: 305 [snip] Number of operations attempted: 2 Number of operations skipped : 0 Current seconds left in Life : Forever Operational state of entry : Active Operational frequency(seconds): 60 Connection loss occurred : FALSE Timeout occurred : FALSE [snip] Latest operation return code : OK [snip] Jitter Values : MinOfPositivesSD: 40 MaxOfPositivesSD: 40 NumOfPositivesSD: 1 SumOfPositivesSD: 40 Sum2PositivesSD : 1600 [snip] JitterAve: 40 JitterSDAve: 40 JitterDSAve: 0 Interarrival jitterout: 0 Interarrival jitterin: 0 One Way Values : NumOfOW: 0 OWMinSD : 0 OWMaxSD: 0 OWSumSD: 0 OWSum2SD: 0 OWAveSD: 0 OWMinDS : 0 OWMaxDS: 0 OWSumDS: 0 OWSum2DS: 0 OWAveDS: 0

2650 © 2016 Nicholas J. Russo

Notice how the responder on XRv4 records this as a dynamically-learned probe and tracks its progress. The responder also shows that control probes were received, which makes sense for a dynamic responder. RP/0/0/CPU0:XRv4#show ipsla responder statistics all ports Port Statistics --------------Local Address Port Port Type Probes Drops CtrlProbes Discard 153.14.14.14 55558 Dynamic 20 0 2 ON

Let’s quickly enable NTP on the XRv routers, using CSR1 as the server because it’s already configured for it. We wait several minutes for NTP to synchronize. ! XRv3 and XRv4 ntp server 153.1.1.1 source Loopback0 RP/0/0/CPU0:XRv3#show ntp status | include sync Clock is synchronized, stratum 9, reference is 153.1.1.1 RP/0/0/CPU0:XRv4#show ntp status | include sync Clock is synchronized, stratum 9, reference is 153.1.1.1

IP SLA ID 305 on XRv3 should start reporting both latency and jitter since it is targeting a dynamic responder. IP SLA ID 304 on XRv4 should start reporting latency only since it is targeting a static responder. The only change we made in the network was synchronizing time; this proves that the XR behavior is identical to IOS and XE with respect to latency and jitter resolution/collection. RP/0/0/CPU0:XRv3# show ipsla statistics 305 Entry number: 305 [snip] Jitter Values : MinOfPositivesSD: 10 MaxOfPositivesSD: 10 NumOfPositivesSD: 1 SumOfPositivesSD: 10 Sum2PositivesSD : 100 MinOfNegativesSD: 10 MaxOfNegativesSD: 10 NumOfNegativesSD: 1 SumOfNegativesSD: 10 Sum2NegativesSD : 100 MinOfPositivesDS: 20 MaxOfPositivesDS: 20 NumOfPositivesDS: 1 SumOfPositivesDS: 20 Sum2PositivesDS : 400 MinOfNegativesDS: 20 MaxOfNegativesDS: 20 NumOfNegativesDS: 1 SumOfNegativesDS: 20 Sum2NegativesDS : 400 JitterAve: 15 JitterSDAve: 10 JitterDSAve: 20 Interarrival jitterout: 0 Interarrival jitterin: 0 One Way Values :

2651 © 2016 Nicholas J. Russo

NumOfOW: 2 OWMinSD : 10 OWSum2SD: 200 OWMinDS : 0 OWSum2DS: 400

OWMaxSD: OWAveSD: OWMaxDS: OWAveDS:

10 10 20 10

OWSumSD: 20 OWSumDS: 20

RP/0/0/CPU0:XRv4#show ipsla statistics 304 [snip] Jitter Values : MinOfPositivesSD: 0 MaxOfPositivesSD: 0 NumOfPositivesSD: 0 SumOfPositivesSD: 0 Sum2PositivesSD : 0 [snip] One Way Values : NumOfOW: 1 OWMinSD : 19 OWMaxSD: 19 OWSumSD: 19 OWSum2SD: 361 OWAveSD: 19 OWMinDS : 1 OWMaxDS: 1 OWSumDS: 1 OWSum2DS: 1 OWAveDS: 1

Next, we will examine the MPLS ping and traceroute operations. Remember that “mpls oam” must be explicitly enabled in order to use these commands; this is not enabled by default in XR. We will configure an MPLS LSP ping [ID 313] and LSP traceroute [ID 314] on XRv3 targeting XRv4’s loopback address. Before configuring the probes, we will manually test MPLS ping/traceroute from the CLI. Everything looks correct so far. RP/0/0/CPU0:XRv3#ping mpls ipv4 153.14.14.14/32 source 153.13.13.13 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/10 ms RP/0/0/CPU0:XRv3#trace mpls ipv4 153.14.14.14/32 source 153.13.13.13 [snip] Codes: '!' - success, 'Q' - request not sent, '.' - timeout, 'L' - labeled output interface, 'B' - unlabeled output interface, [snip] Type escape sequence to abort. 0 153.1.13.13 MRU 1500 [Labels: 1004 Exp: 0] L 1 153.1.13.1 MRU 1500 [Labels: 2004 Exp: 0] 0 ms L 2 153.1.2.2 MRU 1500 [Labels: implicit-null Exp: 0] 10 ms ! 3 153.2.14.14 1 ms

Next, we configure the probes. It is interesting to note that sometimes exp-null is required to get XR to respond to MPLS OAM traceroute probes. This is true when XR is the target of an LSP traceroute, but isn’t terribly significant since XR IP SLA supports it. Notice that the configuration between the LSP ping and traceroute operations is identical since the features share many common parameters and 2652 © 2016 Nicholas J. Russo

behaviors. ! XRv3 ipsla operation 313 type mpls lsp ping source address 153.13.13.13 target ipv4 153.14.14.14/32 frequency 10 force explicit-null operation 314 type mpls lsp trace source address 153.13.13.13 target ipv4 153.14.14.14/32 frequency 10 force explicit-null schedule operation 313 start-time now life forever schedule operation 314 start-time now life forever

Checking the probe statistics, everything appears to be working. This makes sense since we manually tested the LSP ping/traceroute operations before configuring the probes. RP/0/0/CPU0:XRv3#show ipsla statistics 313 | utility egrep 'Number|Timeout|loss’ Number of operations attempted: 6 Number of operations skipped : 0 Connection loss occurred : FALSE Timeout occurred : FALSE RP/0/0/CPU0:XRv3#show ipsla statistics 314 | utility egrep 'Number|Timeout|loss’ Number of operations attempted: 4 Number of operations skipped : 0 Connection loss occurred : FALSE Timeout occurred : FALSE

We will temporarily break the LSP by telling CSR2 to stop advertising LDP labels completely. We will rerun our manual verifications and expect the probes to report failures also. XRv did not increase the “skipped” counter for the MPLS ping failures, but instead said connection loss occurred. The MPLS traceroute doesn’t change the timeout/connection flags nor does it increment the “skipped” counter; it only reports the RTT as unknown, so you can use that as a measurement. This can be difficult to identify 2653 © 2016 Nicholas J. Russo

when looking at the outputs quickly. ! CSR2 no mpls ldp advertise-labels RP/0/0/CPU0:XRv3#ping mpls ipv4 153.14.14.14/32 source 153.13.13.13 [snip] Codes: '!' - success, 'Q' - request not sent, '.' - timeout, 'L' - labeled output interface, 'B' - unlabeled output interface, [snip] BBBBB Success rate is 0 percent (0/5)

RP/0/0/CPU0:XRv3#traceroute mpls ipv4 153.14.14.14/32 source 153.13.13.13 [snip] Codes: '!' - success, 'Q' - request not sent, '.' - timeout, 'L' - labeled output interface, 'B' - unlabeled output interface, [snip] 0 153.1.13.13 MRU 1500 [Labels: 1004 Exp: 0] B 1 153.1.13.1 MRU 1500 [No Label] 0 ms

RP/0/0/CPU0:XRv3#show ipsla statistics 313 | utility egrep 'Number|Timeout|loss’ Number of operations attempted: 50 Number of operations skipped : 0 Connection loss occurred : TRUE Timeout occurred : FALSE RP/0/0/CPU0:XRv3#show ipsla statistics 314 | utility egrep 'Number|Timeout|loss|Latest RTT’ Number of operations attempted: 55 Number of operations skipped : 0 Connection loss occurred : FALSE Timeout occurred : FALSE Latest RTT (milliseconds) : Unknown

To see some extra valuable information on the MPLS traceroute, be sure to use the aggregated statistics show command. Unfortunately, unlike XE, this does not show the LSP hop-by-hop. The detail modifier prints the output in a compact tabular format shown below. RP/0/0/CPU0:XRv3#show ipsla Entry StartT Pth Hop 314 1436637790389 0 0 314 1436638019453 0 0

statistics aggregated detail 314 | begin ^Entry S Dst Comps SumCmp SumCmp2H SumCmp2L TMax 0 22 22 0 22 1 0 17 17 0 17 1

TMin 1 1

2654 © 2016 Nicholas J. Russo

The MPLS ping and traceroute features can also test a TE tunnel. Be careful if you are using XRv 5.3.0 with verbatim TE; bug CSCus55972 states that the next hop in the ERO path must be a /32, so XRv3 has a static route to 153.1.13.1/32. The connected /24 is not good enough for TE tunnel construction as a result. Using a static route is Cisco’s suggested workaround. This is totally unrelated to IP SLA but I mention it here to explain why the static route is configured. The only reason we use verbatim pathoptions is to build TE tunnels over EIGRP as the IGP as it is incapable of carrying TE topology information. ! XRv3 explicit-path name R1_R2_XRv4 index 1 next-address strict ipv4 index 2 next-address strict ipv4 index 3 next-address strict ipv4 index 4 next-address strict ipv4

unicast unicast unicast unicast

153.1.13.1 153.1.2.2 153.2.14.14 153.14.14.14

interface tunnel-te34 ipv4 unnumbered Loopback0 destination 153.14.14.14 path-option 10 explicit name R1_R2_XRv4 verbatim router static address-family ipv4 unicast 153.1.13.1/32 GigabitEthernet0/0/0/0.513

Let’s ensure the tunnel is up and do a manual MPLS ping/traceroute to ensure MPLS data plane is functional, too. RP/0/0/CPU0:XRv3#show mpls traffic-eng tunnels brief TUNNEL NAME DESTINATION STATUS tunnel-te34 153.14.14.14 up Displayed 1 (of 1) heads, 0 (of 0) midpoints, 0 (of 0) tails Displayed 1 up, 0 down, 0 recovering, 0 recovered heads

STATE up

RP/0/0/CPU0:XRv3#ping mpls traffic-eng tunnel-te 34 [snip] !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/10 ms RP/0/0/CPU0:XRv3#traceroute mpls traffic-eng tunnel-te 34 [snip] 0 153.1.13.13 MRU 1500 [Labels: 1005 Exp: 0] L 1 153.1.13.1 MRU 1500 [Labels: 2005 Exp: 0] 10 ms L 2 153.1.2.2 MRU 1500 [Labels: implicit-null Exp: 0] 0 ms ! 3 153.2.14.14 1 ms

We will configure a ping probe [ID 370] and a traceroute probe [ID 371] on XRv3 to test the state of this tunnel at regular intervals (every 10 seconds). 2655 © 2016 Nicholas J. Russo

! XRv3 ipsla operation 370 type mpls lsp ping target traffic-eng tunnel 34 frequency 10 operation 371 type mpls lsp trace target traffic-eng tunnel 34 frequency 10 schedule operation 370 start-time now life forever schedule operation 371 start-time now life forever

Below are the outputs from probes; both are successful. RP/0/0/CPU0:XRv3#show ipsla statistics 370 | utility egrep 'Number|loss|return code|Latest RTT’ Number of operations attempted: 2 Number of operations skipped : 0 Connection loss occurred : FALSE Latest RTT (milliseconds) : 1 Latest operation return code : OK RP/0/0/CPU0:XRv3#show ipsla statistics 371 | utility egrep 'Number|loss|return code’ Number of operations attempted: 3 Number of operations skipped : 0 Connection loss occurred : FALSE Latest operation return code : OK

To simulate a failure, we will disable MPLS TE tunnels globally on CSR2. This rips out the RSVP-based LSP and sends a PATHERR message back to the headend. We would expect the tunnel to be dysfunctional now. In fact, the MPLS OAM process doesn’t even send the pings because the tunnel is down. ! CSR2 no mpls traffic-eng tunnels RP/0/0/CPU0:XRv3#show mpls traffic-eng tunnels brief TUNNEL NAME

DESTINATION

STATUS

STATE

2656 © 2016 Nicholas J. Russo

tunnel-te34 153.14.14.14 down Displayed 1 (of 1) heads, 0 (of 0) midpoints, 0 (of 0) tails Displayed 0 up, 1 down, 0 recovering, 0 recovered heads

down

RP/0/0/CPU0:XRv3#ping mpls traffic-eng tunnel-te 34 [snip] Codes: '!' - success, 'Q' - request not sent, '.' - timeout, QQQQQ Success rate is 0 percent (0/5) RP/0/0/CPU0:XRv3#traceroute mpls traffic-eng tunnel-te 34 [snip] Q 1 *

Below are the failed probes. These outputs are more consistent than the IPv4 MPLS ping/traceroute manual tests shown above. They both continue to increment the “attempted” counter, but both indicate connection loss and tell us there was a TX error along the LSP. The output above indicates the requests were not even sent because the LSP is down entirely. This output is very clear and concise, even more so than the XE equivalent. RP/0/0/CPU0:XRv3#show ipsla statistics 370 | utility egrep 'Number|loss|return code’ Number of operations attempted: 9 Number of operations skipped : 0 Connection loss occurred : TRUE Latest operation return code : MplsLspEchoTxError RP/0/0/CPU0:XRv3#show ipsla statistics 371 | utility egrep 'Number|loss|return_code’ Number of operations attempted: 14 Number of operations skipped : 0 Connection loss occurred : TRUE Latest operation return code : MplsLspEchoTxError

XR simplifies the configuration logic for aggregation statics and history. There are two different syntax trees now: history and statistics. We will reuse IP SLA ID 301 to test these features and attempt to mirror the settings from IP SLA ID 1 on CSR1. The original goal was to collect aggregate statistics for RTT in “buckets” of width 25 ms between 0 and 100 (so we need 4 buckets). We also want to collect history for all events (not just failures) in a rotating list with 15 buckets. XR does not give us an option on how often to sample for history, probably because it writes history in step with the probe frequency (which makes sense). The configuration is shown below. Be sure to modify the “statistics hourly” category since the statistics “interval” is measured only in seconds. The hourly aggregated statistics is where we achieve the granularity we expect. The “statistics interval” is used for collecting enhanced statistics which is shown later.

2657 © 2016 Nicholas J. Russo

! XRv3 ipsla operation 301 type icmp echo history lives 1 filter all buckets 15 source address 153.13.13.13 destination address 153.14.14.14 statistics hourly buckets 1 distribution count 4 distribution interval 25 frequency 5 statistics interval 5 buckets 4

The output doesn’t show us the bucket breakdowns like XE did, but we know each row is a bucket of width 25ms. All of our probes are landing in the first bucket since the network is small and RTT is low. RP/0/0/CPU0:XRv3#show ipsla Entry StartT Pth Hop 301 1436638963658 1 1 301 1436638963658 1 1 301 1436638963658 1 1 301 1436638963658 1 1

statistics aggregated detail 301 | begin Entry S Dst Comps SumCmp SumCmp2H SumCmp2L TMax 0 10 10 0 10 1 1 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0

TMin 1 0 0 0

We can also view enhanced statistics which is based on the interval we configured earlier. Every 5 seconds, the router computes these statistics for display. This shows enhanced failure statistics over time. RP/0/0/CPU0:XRv3#show ipsla statistics Entry number: 301 Interval : 5 seconds Bucket : 1 Start Time Index: 18:27:24.469 UTC Number of Failed Operations due to Number of Failed Operations due to Number of Failed Operations due to Number of Failed Operations due to Number of Failed Operations due to Number of Failed Operations due to Number of Failed Operations due to RTT Values: RTTAvg : 1 RTTMin: 1

enhanced aggregated 301

[snip] a Disconnect : a Timeout : a Busy : a No Connection : an Internal Error: a Sequence Error : a Verify Error :

0 0 0 0 0 0 0

RTTMax : 1

2658 © 2016 Nicholas J. Russo

NumOfRTT: 1 Bucket : 2 [snip]

RTTSum: 1

RTTSum2: 1

Without examining every case, we will briefly examine IP SLA reaction configurations. In this example, we will configure XRv3 to issue a syslog message and start a path-echo operation [ID 312] if XRv4’s loopback becomes unreachable after 3 consecutive failures of the ICMP-echo operation [ID 301]. The path-echo should only run for 5 minutes (300 seconds) when invoked. The configuration is rather involved, just like XE, and is shown below. The key command is “reaction trigger” which effectively says “if probe 301 experiences the trigger condition, start probe 312”. ! XRv3 ipsla reaction operation 301 react timeout action logging action trigger threshold type consecutive 3 reaction trigger 301 312 schedule operation 301 start-time now life forever schedule operation 312 start-time pending life 300

Initially, the ICMP-echo probe succeeds and the path-echo is not yet running, as expected. RP/0/0/CPU0:XRv3#show ipsla statistics 301 | utility egrep 'Number|occurred' Number of operations attempted: 6 Number of operations skipped : 0 Connection loss occurred : FALSE Timeout occurred : FALSE RP/0/0/CPU0:XRv3#show ipsla statistics 312 Entry number: 312 Modification time: 18:58:55.476 UTC [snip] Start time : Not started yet [snip]

If we temporarily apply an ACL to CSR1 that denies all ICMP echoes, we expect it see XRv3 log the event and start the path-echo probe after 3 consecutive failures. I will temporarily remove LDP from CSR1’s interface to XRv3 so that traffic is raw IP, otherwise the ACL cannot classify it. Below is the syslog 2659 © 2016 Nicholas J. Russo

message, failures from the original probe, and the starting of the second probe. The second probe fails too because it is ICMP-based but it doesn’t matter for this test. I was just illustrating the reaction configuration to ensure that second probe was started at the proper time. ! CSR1 ip access-list extended ACL_DENY_ICMP_ECHO deny icmp any any echo permit ip any any interface GigabitEthernet2.513 ip access-group ACL_DENY_ICMP_ECHO in no mpls ip ! XRv3 ipsla_sa[260]: %MGBL-IPSLA-5-THRESHOLD_SET : Monitor element has exceeded the threshold condition. Op:301, TargetAddr:153.14.14.14, MonElem:Timeout(Type:consecutive) RP/0/0/CPU0:XRv3#show ipsla statistics 301 | utility egrep 'Number|occurred' Number of operations attempted: 9 Number of operations skipped : 9 Connection loss occurred : FALSE Timeout occurred : TRUE RP/0/0/CPU0:XRv3#show ipsla statistics 312 | utility egrep 'Number|occurred' Number of operations attempted: 11 Number of operations skipped : 0 Connection loss occurred : FALSE Timeout occurred : TRUE

When I re-enabled MPLS on the link, the traffic succeeds as it tunnels through CSR1 so the ACL does nothing. Notice this log message says “threshold clear” and “reset”, while the first one said “threshold set” and “exceeded”. I also remove the ACL for cleanup purposes. ! CSR1 interface GigabitEthernet2.513 no ip access-group ACL_DENY_ICMP_ECHO in mpls ip ! XRv3 ipsla_sa[260]: %MGBL-IPSLA-5-THRESHOLD_CLEAR : Monitor element has reset the threshold reaction. Op:301, TargetAddr:153.14.14.14, MonElem:Timeout(Type:consecutive) RP/0/0/CPU0:XRv3#show ipsla statistics 301 | utility egrep 'Number|occurred' Number of operations attempted: 76 Number of operations skipped : 17

2660 © 2016 Nicholas J. Russo

Connection loss occurred Timeout occurred

: FALSE : FALSE

TWAMP was discussed earlier but in the context of XR the feature set is identical; only the server and session-reflector are supported in XR. Like XE, “server” enables TWAMP server and “responder” enables TWAMP session-reflector, which for v1.0 must be on the same router. Also like XE, we can see the TWAMP standards for reference; this is useful when determining interoperability between different platforms or vendors. ! XRv4 ipsla server twamp port 60000 responder twamp RP/0/0/CPU0:XRv4#show ipsla twamp status TWAMP Server is enabled TWAMP Server port : 60000 TWAMP Reflector is enabled RP/0/0/CPU0:XRv4#show ipsla twamp standards Feature Organization TWAMP Server IETF TWAMP Reflector IETF

Standard RFC5357 RFC5357

Now we will examine the IPv6 SLA probes supported in XR. Their behavior is generally identical to their IPv4 counterparts, as we saw in XE also. We begin with the IPv6 ICMP-echo probe [ID 361] on XRv3. XR still calls the IPv6 “TOS” as opposed to the more correct “traffic-class”. There does not appear to be an option for assigning flow-labels at this time, either. This is probably a result of sharing the same syntax tree as IPv4 rather than being subcomponents of a probe-type as seen in XE. ! XRv3 operation 361 type icmp echo source address ::153:13:13:13 destination address ::153:14:14:14 frequency 5 schedule operation 301 start-time now life forever

We schedule the probe to ping XRv4’s loopback and see no issues. RP/0/0/CPU0:XRv3#show ipsla statistics 361 | utility egrep 'Number|occurred'

2661 © 2016 Nicholas J. Russo

Number of operations attempted: Number of operations skipped : Connection loss occurred : Timeout occurred :

7 0 FALSE FALSE

None of the UDP probes appear to support IPv6 in XR v5.3.0. As such, they are not discussed further. RP/0/0/CPU0:XRv4(config-ipsla-op)#type udp echo destination address ? A.B.C.D Enter IPv4 address of the target device RP/0/0/CPU0:XRv4(config-ipsla-op)#type udp jitter destination address ? A.B.C.D Enter IPv4 address of the target device

Enhanced Object Tracking (EOT) is similar in XR as in IOS/XE. One enhancement is the ability to use named objects, not just numbers. However, XR only supports a limited number of objects: rtr (ip sla), ipv4 route (not IPv6), list, and line-protocol. We will quickly test these options using some dummy objects as their functions are identical to what was already tested and documented. There does not appear a way to set an object’s default state. My workaround for this is to use the line-protocol object; to have an always-up object, track a loopback interface. To have an always-down object, track a nonexistent interface. Below is the verification of my basic TRUE/FALSE objects; these are like stub-objects in a sense but cannot have their states changed by EEM. ! XRv3 track ALWAYS_TRUE type line-protocol state interface Loopback0 track ALWAYS_FALSE type line-protocol state interface Loopback999 RP/0/0/CPU0:XRv3#show track brief Track Object Parameter Value -------------------------------------------------------------------------ALWAYS_FALSE interface Loopback999 line protocol Down ALWAYS_TRUE interface Loopback0 line protocol Up

Next, we will test the “route” object to test reachability to a specific prefix. I made two objects; one specific for the loopback /32 address, and one for a generic /24 for which we do not have an exact match. The former is up and the latter is down. ! XRv3 track XRV14_LOOPBACK_ROUTE type route reachability route ipv4 153.14.14.14/32

2662 © 2016 Nicholas J. Russo

track XRV14_LOOPBACK_LESS_SPECIFIC type route reachability route ipv4 153.14.14.0/24 RP/0/0/CPU0:XRv3#show track XRV14_LOOPBACK_ROUTE Track XRV14_LOOPBACK_ROUTE Ip route 153.14.14.14 255.255.255.255 reachability ip vrf default Reachability is UP First Hop interface GigabitEthernet0_0_0_0.534 1 change, last change 15:53:12 UTC [snip] RP/0/0/CPU0:XRv3#show track XRV14_LOOPBACK_LESS_SPECIFIC Track XRV14_LOOPBACK_LESS_SPECIFIC Ip route 153.14.14.0 255.255.255.0 reachability ip vrf default Reachability is DOWN 1 change, last change 15:54:13 UTC [snip]

Next, we will tie EOT to an IP SLA operation. EOT in XR does not support the “state” option, so measuring the threshold values inside EOT is not possible; only reachability appears is supported. The object below tracks the reachability of IP SLA ID 301 and when it times out, the object goes down. This feature appears to be dysfunctional in XR v5.3.0 (it may be XRv-specific). I have the IP SLA running successfully but EOT does not acknowledge it. ! XRv3 track XRV14_REACH type rtr 301 reachability RP/0/0/CPU0:XRv3#show ipsla statistics 301 | utility egrep 'Number|occurred' Number of operations attempted: 35 Number of operations skipped : 0 Connection loss occurred : FALSE Timeout occurred : FALSE RP/0/0/CPU0:XRv3#show track XRV14_REACH Track XRV14_REACH Response Time Reporter 301 reachability ipsla operation not in progress

To test the object-list features, I will add three objects to track XRv3 loopback0 and its two transit interfaces to ensure the line-protocol is up. XR does not let me track the loopback using a route object, for some reason, but it the parser accepts it for the transit links. ! XRv3 track XRV3_TO_CSR1 type line-protocol state interface GigabitEthernet0/0/0/0.513 track XRV3_TO_XRV4

2663 © 2016 Nicholas J. Russo

type line-protocol state interface GigabitEthernet0/0/0/0.534 track XRV3_LOOPBACK type route reachability route ipv4 153.13.13.13/32

The track process does not appear to actually track connected routes though, so using line-protocol is better for this test. I leave it in the configuration for Loopback0 as a demonstration. ! XRv3 track XRV3_LOOPBACK type route reachability route ipv4 153.13.13.13/32 !!% 'FIB' detected the 'warning' condition 'Invalid argument' RP/0/0/CPU0:XRv3#show track brief | include XRV3 XRV3_LOOPBACK0 interface Loopback0 XRV3_TO_XRV4 interface GigabitEthernet0/0/0/0.534 XRV3_TO_CSR1 interface GigabitEthernet0/0/0/0.513

line protocol reachability reachability

Up Up Up

The logic of the weighted threshold list is very odd in XR. In XE, if the sum of the weights was higher than the up-value, the object was up. If the sum of the weights was lower than the down-value, the object was down. In XR, the sum of the weights must lie in the range of the two values to be up. The interesting thing about XR’s implementation is that you can specify multiple threshold ranges, but the logical comparison between those ranges is Boolean AND. In this example, the list can suffer the loss of a single transit link only. The current sum of the weights is 20, which is within the range 11